At AIRWeb, a workshop on adversarial information retrieval held at SIGIR, they had a panel on blog spam. What surprised me was that there appear to be two camps on what the “blog spam” problem is. Most of the panel seemed to be concerned about spam on blogs, including splogs, or blogs that are 100% spam. Essentially, anytime someone can put a link on somebody else’s page or IP address, they can help inflate the ranking of some target site. For the most part, spam on blogs breaks down into:
1. Fake comments
2. Fake trackbacks / referrers / pingbacks
3. Fake blogs (which include all of the above)
The biggest problem for search engines appears to be #3, Fake Blogs. The reason being is that fake blogs are under complete control of the spammer, and thus search engines have to detect and remove them. Fake comments, trackbacks, and the like on the other hand are a problem for both search engines and the owner of the blog, and the owner of the blog will typically take much more direct steps to remove said spam.
Another problem with Fake Blogs is that most of them appear on the big hosting sites… e.g. Blogger / BlogSpot (Google), Spaces (MSN), and so on. So it’s one arm of the search engine companies working to help out another arm as well, which makes sense.
However, as someone who runs an independent blog (WordPress), I found a couple of things surprising:
- While a lot of people are doing research on thwarting blog spam, it seems the research is not making it outside the companies. In particular, the anti-spam technology I’m using is Akismet, from the same guys that brought you WordPress. Not MSR, not YRL, not Google.
- Given how much spam I get and how much Akismet catches, it’s not clear that much more work is needed. Akismet seems to work pretty well.
- There seems to be a lack of understanding on the “nofollow” tag.
A word on that last bit… a number of blogs that didn’t have some kind of comment moderation were getting spammed galore. So, the “nofollow” tag was invented so that bloggers (or really, their blogging software) could identify links in comments. This was not done so that bloggers could instruct search engines not to follow these links; nor was it done so that bloggers could mark a link as “untrustworthy” and have the search engines analyze the links and decide for themselves. It was done to discourage comment spam… spammers wouldn’t spam, because the links wouldn’t be worthwhile. It’s kind of effective, but really, a bar to entry on comments such as a captcha or moderation approval works much better.
Anyway, a final thought… while right now, Akismet is winning the war against comment spam, I suspect the spammers will adapt faster and eventually start winning. At that point, I hope we’ll have some of the research done by YRL, MSR, and Google integrated into products to help out.