Confessions of a Spam-Catcher: How to Identify Spam

As part of my role as Lifehack’s manager, I am responsible for moderating the comments queue. Lifehack’s back-end has a “Pending” queue for comments that our spam-catching software thinks might be spam, a “Spam” queue for comments labeled “spam” either by the software or by me, and another queue for comments that have been approved, again either by the software or by me. As a general rule, I check that “Pending” queue several times a day, the “Approved” queue every day or so, and the “Spam” queue every week or so.

I’ve been doing this for two years, and I’ve gotten pretty proficient at figuring out what is and is not spam – a tough call to make sometimes, since spammers get more and more sophisticated in lock-step with those of us charged with blocking them. I present my “formula” here for two reasons: one, to give less experienced bloggers and webmasters an idea of how to catch spam on their own site, and two, to give commenters an idea of the kind of thing to avoid so their comments don’t get accidentally thrown in the “Spam” bin.

I should say, a big part of catching spam is a “feel” – intuiting that some comment just doesn’t feel right. I’m not sure I can capture exactly what goes into that feel. Andy Warhol once said that to recognize a great painting, first you have to look at a thousand paintings, and catching spam is a bit like that – the experience of having looked at thousands of spam messages cannot be easily encapsulated. But I’ll try as well as I can.

What is spam?

What makes a message spam is relative and subjective. In a sense, spam is like a weed – a weed is not any particular kind of plant, but a plant that isn’t wanted where it’s at. (See, for example, Wikipidia’s definition of Weed as “a plant that is considered by the user of the term to be a nuisance.”) For instance, Corn is delicious, but if it’s growing in your soybean field, it’s a weed. A message that, say, pimps a word processor might be perfectly welcome on a post that asks for product recommendations for writers, while on a post that just happens to mention writing, the same message could be considered spam.

Some messages are clearly spam; for example, anything delivered by a spambot programmed to leave its message wherever it can find an open form to submit through. But a message can be left by a living person, custom-written for the particular content it’s posted to, and still be spam. This list starts with the most obvious signs and moves to more vague and difficult-to-interpret signs. My guess is that a lot of people run into the ones further down the list because they post without thinking very clearly, so pay attention.

A comment is spam if it:

  1. Contains links to websites that are unrelated to the content.
    For example, a comment might say “I think your baby is really cute!” but the word “baby” links to a site selling baby clothes or even a Forex trading site or other scam.
  2. Is posted on more than one post.
    This is obvious, right? Real people don’t post the same comment over and over on different posts, no matter how relevant. most likely it’s a spambot responding to multiple posts on your blog that contain similar keywords.
  3. Contains more than one link.
    While there are a few situations in which a legitimate comment could contain several links, they’re fairly rare. As a general rule, the likelihood of a comment being spam increases directly with the number of links; anything over three and it’s virtually guaranteed to be spam.
  4. Is not directly related to the post.
    A lot of spambots (or even live spammers) crawl the web looking for posts with certain keywords and then insert a generic message loosely related to the topic on the hopes that it will slip past any human reader who is likely to just skim through their comments. Unless a comment addresses something specific about your post, it’s likely to be spam.
  5. Is overly complimentary.
    Most spammers are fairly astute observers of basic human psychology – particularly our desire to believe good things about ourselves. So they butter us up, saying things like “Great post! In fact, I love this whole site – I’m definitely going to come back again and again!”.
  6. Has keywords or a business name in the “Name” field.
    A basic search engine optimization strategy is to get your website’s address associated with specific keywords, and search engines look closely at the text associated with a link to determine the usefulness of the website linked to. Real people aren’t trying to game search engines, and frankly, we want to be recognized for our contribution, so we use our actual name, or a username. If you can’t imagine replying to a person by the name in their “Name” field, you’re dealing with a spammer. (For example, here’s one taken from our spam queue: “Having a good vocabulary not only gives a framework for thought. It also allows you to be concise and precise to make communication better.” This is relevant to the post, and thoughtful, but it was left by an entity named “dining room table”. It’s spam.)
  7. Links to a spammy business.
    This is a tough call – sometimes I’ll see a thoughtful comment clearly written in direct response to the post it’s commenting on, under a real person’s name, and still mark it as spam because they link to a site whose legitimacy is questionable. Could be porn, WOW gold scams, Forex scams, get rich quick schemes, blogs with stolen content, or anything else that feels to me like someone left a comment more to get their link out than to add to the discussion.
  8. Quotes the post without responding to the quote.
    This is a relatively sophisticated spam technique: pulling lines out of the post it’s responding to in order to make the language of the comment sound like real writing. Real people mark the quotes they’re commenting on (usually with quotation marks, but it could be by italicizing or bolding it, putting it in blockquotes, or some other means) and try to clearly separate their response form the post’s words.
  9. Is posted on an old post.
    Old posts tend to attract a lot of spam. Real people generally recognize that if a post is a year or so old, the conversation there is pretty much over. Spambots do not realize that. It still sometimes happens that someone comments on an ancient post, but the age of the post is a big red flag.
  10. Is in a different language from the site.
    If the point of a comment is to engage in discussion with the author of the post and his or her readers, it doesn’t make much sense to comment in a language that you’re not sure the author knows.
  11. Is from a Russian .ru domain.
    I hate to stereotype an entire top-level domain like this. I’m sure there are Russians out there making thoughtful comments on blogs all the time. And yet I’ve never had a comment that wasn’t spam from a commentor with a .ru domain or email address.
  12. Tells a long, personal story.
    This is experience talking – a lot of times you’ll see what appears to be a blog post in its own right in your moderation queue that starts off, at least, relevant, and is clearly written by a real person. This falls under the “Weed” heading – it might have been totally welcome except it’s out of place as a comment on your blog.
  13. Asks for specific support.
    This is another “weed” situation: a comment on a post about, say, installing Windows 7 that asks for help with a specific problem. Unless the point of your site is to answer specific questions about computer problems, this comment is out of place. There are better and more likely places to get help than on your blog.
  14. Feels wrong.
    Sometimes a comment just feels wrong – it is a little too smarmy, maybe, or it’s a little too formal and stiff. You click through the link and it’s a legitimate-enough site, maybe a little sketchy, but you can totally construct a case where this comment was written by a real person with something to say. The question, though, isn’t what was the intention of the writer, but what is the effect on the conversation on your site. If a comment doesn’t seem to quite fit, you’re well within your rights to “spam it”.

Anyone else have advice for would-be spam-catchers? Or for commenters who might be finding their comments relegated to the spam-heaps of history? Leave a thoughtful, non-spammy comment below!

  • http://psylogica.ru Oleg

    Congratulations! You’ve just received your first thoughtful comment from russian domain :)
    None offense taken really, just recalled there was a plugin for russian wordpress pack that blocks all comments that don’t contain cyrillic letters – it’s said to be almost enough to get rid of senseless spam :)
    Maybe you could “reverse” it to stop russian spam.
    Regards!

    And yes, I really like the entire site and will definitely be coming back again and again…

  • http://www.dwax.org Dustin Wax

    Oleg: Ha! Well, I suppose there had to be *someone* living in Russia besides spammers — good to (finally) know that’s true. That plugin sounds like a perfect implementation of the “It’s in a foreign language” rule — because what are the odds that someone knows enough Russian to read a post in it but not enough to write a comment, right? The reverse would be trickier, since Cyrillic is pretty exclusive while all the West European languages use the same character set, but it’s an idea worth thinking about.

  • http://psylogica.ru Oleg

    Dustin, I think the idea of the plugin is really as good as it is simple – if it can detect Cyrillic encoding it would be possible to use it to filter the coments containing _only_ (exclusively) Cyrillic characters like it’s used to block exclusively Latin in Russian variant, isn’t it? You’ll never need a comment without a single English letter in it anyway.
    Regards!

  • http://theinvisiblementor.com Avil Beckford

    Dustin,

    I really appreciate this post and it has put a lot of things into context. I must confess that I have been taken by the overly complimentary comment. Many of the points you made I figured out for myself and there were times when I couldn’t decide if some of the comments flagged by askimet was spam so I got a second opinion.

    There was one comment that was asking for advice and I was wondering why he’d ask me for advice, what was it about my post, and I took the time to respond to his question. I guess the joke was on me, but now I know better. In the past few months I’ve been getting a lot of spam in a foreign language. Askimet is doing a great job in flagging spam.

    I have been blogging for less that a year so I’ll keep you list handy. Avil Beckford

  • http://www.dwax.org Dustin Wax

    Avil: Once in a whole someone asks a question that is directly related to the post; I try to answer those, if they’re answerable. I was talking more about really specific questions, like “why won’t my computer boot after I installed the program you discussed?” Ummm, I don’t know?

  • http://richardshelmerdine.com/blog/ Richard | RichardShelmerdine.com

    I think a lot of people will be bookmarking this article lol. Gmail is excellent for catching Spam by the way.

  • Pingback: World Cup to use Augmented Reality – Web Review 19/01/10 « Vexed Digital Blog

  • clayton

    just click delete and move on with your life

  • http://www.ArvindDevalia.com/blog Arvind Devalia

    Excellent post Dustin and most useful I am sure for the many new bloggers out there.

    I find that nowadays I am getting very few spam comments, but a couple of months ago I was getting 100s every week. Maybe spammers go around in cycles – or perhaps Akismet was playing up then.

    Based on your article, I wonder if the spammers will now come up with further crooked schemes to fool us:-)

  • http://allwomenstalk.com All Women Stalker

    Very informative. Especially for someone like me who’s still learning the ropes with regard to spam and other blog nuisances. Thankfully, I have that “feel” you are talking about. And, well, I also use comment moderation so it’s pretty easy to weed out the spammers.

  • http://ilikeitdirty.com christopher

    I guess that more or less sums up spam.

    However, I would put anything from a .info ahead of .ru in the probably spam list. I’ve been blocking all russian and chinese traffic so I may not be seeing what you see in your spam-catcher.

  • http://www.aloeroot.com Steph

    Thanks for this, Dustin. I’ve been installing WordPress to power a lot of my customers’ sites lately, and of course handing off the website after the launch means that they’re now responsible for checking the spam queue. I’ve been sharing this article with them and they find it helpful too.

  • http://lnxwalt.wordpress.com/ W^L+

    I’ve noticed that a lot of spammy comments sound like they were poorly translated from another language. They almost, but not quite, make sense.

    And I’d add Chinese .cn addresses to the spammy list.

    If your blog software allows it, turn off comments on anything over about 3 – 6 months. Also, you should always force trackbacks/pingbacks to go through your moderation queue.

  • ZeusTheTrueGod

    If you have some program module/plugin which can identify someone as a spammer – don’t erase his comment and don’t tell him about that. Let he will see that his comments, but don’t that comments to other people. Let him has its own version of comments.

    And sure it is not my idea,I’ve read about it but don’t remember where. may be Joel Spolsky (Joel at software blog)?

  • ZeusTheTrueGod

    actually I am from russia and can not write well.In short, spammer should see his comments. Other should not see his comments.

  • http://www.carmelsundae.org/ Christina Martin

    One big signal for spam is a comment that makes a vague comment but doesn’t mention any specific information about the post. They often say something like “You make a really good point” but they don’t address what the point is.

  • Pingback: World Cup to use Augmented Reality – Web Review 19/01/10 « V E X E D

  • http://fastforwardacademy.com enrolled agent

    @ Christina

    There is a lot of that indeed. That is why it really helps if a moderator exists on certain blog sites or forums to “weed” out spammers. :)

  • http://www.revreese.com Revreese

    I really enjoyed this. It taught me a few things and it was especially appropriate as I am flooded with the type 5,6 and 7 spam types (especially the ‘flattery’ ones) and due to my inexperience, find it difficult at times deciding what is or isn’t spam!.
    In fact, I think I need to go back and delete a lot of my approved comments, I knew it was strange having random strangers saying such nice things about me! ;-)

  • Pingback: Another Helping of Spam

  • http://www.freelancingandmore.biz Doreen

    Oh I love the ones that have 0 to do with the content. Just yesterday I eliminated one that was on my blog as a comment to "The Importance of Spell Checker" that talked about USB ports…really?

  • http://GrowMap.com Gail Gardner

    I know I’m coming in late to this discussion; however, I hope you are willing to discuss it just a bit more.

    Most of what you have written are excellent tips for recognizing spam. We agree that automated comments are always spam and that IS the original definition. We also consider comments that have nothing to do with the page where they’re left spam whether written by a human or left by a bot.

    What we see differently are comments left by people who obviously did read the post.

    I am very active in a blogging community that we usually refer to as the DoFollow CommentLuv KeywordLuv community. You may be surprised do know that the purpose of the KeywordLuv plugin is to allow commentators to use both their names and their desired keywords in the name field. The plugin links only to the keyword(s).

    We also use the CommentLuv plugin. The purpose of it is to feature one of our last ten posts or a specific page or post so those reading our comments can learn more about us or come visit our sites. It is also possible to associate preferred anchor text with a specific landing page using that plugin.

    I have a different view of spam that I want to share with you that presents a chicken and egg dilemma. It might be easiest to explain it with an example.

    Let’s say the commentator is really into home improvement and has a blog for their mill. They are genuinely interested in blogs about architecture, woodworking, and interior design.

    How would the blog owner know for certain whether that commentator found their blog in order to do link building OR reads their blog because they are interested and THEN decides to leave a comment?

    What at least some of us believe is that supporting small local and online businesses and making it easier for our readers to know about them and find them in search engines is important to improving the economies of our countries.

    There is a reason why businesses seek to build links: so that they can be more easily found in the search engines. Those who believe that happens organically may be interested in that bridge I’ve heard is for sale out in Arizona. :-)

    We define spam differently than you do. Those who read what we write and make a reasonable effort to contribute to the conversation are welcomed with open arms. If they happen to have a business or a Web site they can link to that is just fine by us.

    One other point. Although your regular readers and those who arrive from a source that is sharing new posts may arrive only on what you recently published, visitors who find you in a search engine are very likely to read and wish to comment on older posts. That is how I found this one – researching ways to control spam (the kind we both agree on).

    I agree that is also how link builders may have found older posts. They might also be seeking high pagerank pages. Regardless of how someone finds you though, we’re back to the chicken and the egg problem. Did they find you because they’re interested and then decide to leave a link OR did they seek out places to leave a link and then find a site they like?

    I have many regular readers who found me while link building and became fans of my site, follow me on Twitter, and share what I write. I like that and I do not mind that they make their time more productive by combining reading with linking.

  • lsmonline1

    I am absolutely amazed at how terrific the stuff is on this site. I have saved this webpage and I truly intend on visiting the site in the upcoming days. Keep up the excellent work
    LSM Collection 2011