selberg.org Home Home

The problems with blog spam

At AIRWeb, a workshop on adversarial information retrieval held at SIGIR, they had a panel on blog spam. What surprised me was that there appear to be two camps on what the “blog spam” problem is. Most of the panel seemed to be concerned about spam on blogs, including splogs, or blogs that are 100% spam. Essentially, anytime someone can put a link on somebody else’s page or IP address, they can help inflate the ranking of some target site. For the most part, spam on blogs breaks down into:

1. Fake comments
2. Fake trackbacks / referrers / pingbacks
3. Fake blogs (which include all of the above)

The biggest problem for search engines appears to be #3, Fake Blogs. The reason being is that fake blogs are under complete control of the spammer, and thus search engines have to detect and remove them. Fake comments, trackbacks, and the like on the other hand are a problem for both search engines and the owner of the blog, and the owner of the blog will typically take much more direct steps to remove said spam.

Another problem with Fake Blogs is that most of them appear on the big hosting sites… e.g. Blogger / BlogSpot (Google), Spaces (MSN), and so on. So it’s one arm of the search engine companies working to help out another arm as well, which makes sense.

However, as someone who runs an independent blog (WordPress), I found a couple of things surprising:

  • While a lot of people are doing research on thwarting blog spam, it seems the research is not making it outside the companies. In particular, the anti-spam technology I’m using is Akismet, from the same guys that brought you WordPress. Not MSR, not YRL, not Google.
  • Given how much spam I get and how much Akismet catches, it’s not clear that much more work is needed. Akismet seems to work pretty well.
  • There seems to be a lack of understanding on the “nofollow” tag.

A word on that last bit… a number of blogs that didn’t have some kind of comment moderation were getting spammed galore. So, the “nofollow” tag was invented so that bloggers (or really, their blogging software) could identify links in comments. This was not done so that bloggers could instruct search engines not to follow these links; nor was it done so that bloggers could mark a link as “untrustworthy” and have the search engines analyze the links and decide for themselves. It was done to discourage comment spam… spammers wouldn’t spam, because the links wouldn’t be worthwhile. It’s kind of effective, but really, a bar to entry on comments such as a captcha or moderation approval works much better.

Anyway, a final thought… while right now, Akismet is winning the war against comment spam, I suspect the spammers will adapt faster and eventually start winning. At that point, I hope we’ll have some of the research done by YRL, MSR, and Google integrated into products to help out.

2 Responses to “The problems with blog spam”

  1. David Ulevitch Says:

    It was done to discourage comment spam… spammers wouldn’t spam, because the links wouldn’t be worthwhile. It’s kind of effective.

    Actually it’s totally ineffective and has now been shown to have possible social ramifications that might discourage community participation. The argument goes: part of the payment for me making an insightful comment, linking to related content, and generally increasing the thought pool on your site. This is precisely how attribution in acedemia works. The web is even more suited for this as it’s automated. But nofollow punishes me for someone else’s misdeeds.

    This is part of why you (and MSR in general) will probably never understand why Akismet is more effective at dealing with real-world problems than something MSR could churn out. At the recent ISOI meeting the Microsoft presentation on Botnets was laughable. They catch 98% of botnet code they know about. But they only know about a fraction of botnet code (of which they don’t even catch 100% !!!)… Do the math on that one, hardly a decent product.

    -david

  2. Erik Selberg Says:

    I should re-word that… nofollow is kind of effective as a way for Search Engines to identify and ignore comments. But you’re absolutely correct in that there’s a price in terms of community effectiveness.

    Hmmm…. as I think about this, I believe that nofollow is actually turned on for this blog. However, as comments are moderated, there isn’t any spam as I kill it. I think I’ll go turn that tag off, actually…

Leave a Reply