selberg.org Home Home

Archive for August 13th, 2006
8/13/06
11:12 pm
The problems with blog spam

At AIRWeb, a workshop on adversarial information retrieval held at SIGIR, they had a panel on blog spam. What surprised me was that there appear to be two camps on what the “blog spam” problem is. Most of the panel seemed to be concerned about spam on blogs, including splogs, or blogs that are 100% spam. Essentially, anytime someone can put a link on somebody else’s page or IP address, they can help inflate the ranking of some target site. For the most part, spam on blogs breaks down into:

1. Fake comments
2. Fake trackbacks / referrers / pingbacks
3. Fake blogs (which include all of the above)

The biggest problem for search engines appears to be #3, Fake Blogs. The reason being is that fake blogs are under complete control of the spammer, and thus search engines have to detect and remove them. Fake comments, trackbacks, and the like on the other hand are a problem for both search engines and the owner of the blog, and the owner of the blog will typically take much more direct steps to remove said spam.

Another problem with Fake Blogs is that most of them appear on the big hosting sites… e.g. Blogger / BlogSpot (Google), Spaces (MSN), and so on. So it’s one arm of the search engine companies working to help out another arm as well, which makes sense.

However, as someone who runs an independent blog (WordPress), I found a couple of things surprising:

  • While a lot of people are doing research on thwarting blog spam, it seems the research is not making it outside the companies. In particular, the anti-spam technology I’m using is Akismet, from the same guys that brought you WordPress. Not MSR, not YRL, not Google.
  • Given how much spam I get and how much Akismet catches, it’s not clear that much more work is needed. Akismet seems to work pretty well.
  • There seems to be a lack of understanding on the “nofollow” tag.

A word on that last bit… a number of blogs that didn’t have some kind of comment moderation were getting spammed galore. So, the “nofollow” tag was invented so that bloggers (or really, their blogging software) could identify links in comments. This was not done so that bloggers could instruct search engines not to follow these links; nor was it done so that bloggers could mark a link as “untrustworthy” and have the search engines analyze the links and decide for themselves. It was done to discourage comment spam… spammers wouldn’t spam, because the links wouldn’t be worthwhile. It’s kind of effective, but really, a bar to entry on comments such as a captcha or moderation approval works much better.

Anyway, a final thought… while right now, Akismet is winning the war against comment spam, I suspect the spammers will adapt faster and eventually start winning. At that point, I hope we’ll have some of the research done by YRL, MSR, and Google integrated into products to help out.

8/13/06
12:05 am
Good reads from SIGIR 2006

We had a hugely successful SIGIR here in Seattle.

Here are some of the papers I really enjoyed, and worth a read (and the first is certainly apropos):

Web IR: current topics: You are what you say: privacy risks of public mentions

Dan Frankowski, Dan Cosley, Shilad Sen, Loren Terveen, John Riedl
Additional Information: full citation, abstract, references, index terms
Evaluation 1–user models and test collections: Minimal test collections for retrieval evaluation

Ben Carterette, James Allan, Ramesh Sitaraman
Additional Information: full citation, abstract, references, index terms
Evaluation 1–user models and test collections: Dynamic test collections: measuring search effectiveness on the live web

Ian Soboroff
Additional Information: full citation, abstract, references, index terms
Evaluation 2: A statistical method for system evaluation using incomplete judgments

Javed A. Aslam, Virgil Pavlu, Emine Yilmaz
Additional Information: full citation, abstract, references, index terms
Web 2: Generalizing PageRank: damping functions for link-based ranking algorithms

Ricardo Baeza-Yates, Paolo Boldi, Carlos Castillo
Additional Information: full citation, abstract, references, index terms
The first page of results: Less is more: probabilistic models for retrieving fewer relevant documents

Harr Chen, David R. Karger
Additional Information: full citation, references, index terms
Users: clarification, feedback, and browsing: Exploring the limits of single-iteration clarification dialogs

Jimmy Lin, Philip Wu, Dina Demner-Fushman, Eileen Abels
Additional Information: full citation, abstract, references, index terms

8/13/06
12:05 am
Query logs and the AOL controversy

I’ve had the opportunity to talk with a number of people from AOL Research this week on the data release. They’re genuinely upset over what’s happened… they released data with the best of intentions: to help foster research in a key area with some new data. They did something that others had done in the past; in particular, Alta Vista and Excite (I don’t include Microsoft in that list as we didn’t release data publicly). They did try to anonymize the data, and you had better believe they had a reasonable amount of sign-off from people inside of AOL. They thought they were doing a good thing for the community, and it turns out to be a bit of a disaster. Now, if I read body language properly, it looks like a number of them are afraid AOL is going to fire someone or the entire group in order to save face with its customers.

So how did we get here? And, given that AOL has loads of data that certainly doesn’t include any PII and thus no risk, why did they pick query logs?

Two words: Query Intent.

What is query intent? Simply put, it’s understanding what the user is looking for based on their query. For example, if I search for “Vision Tech Carbon Pro” am I looking for a review of their aero bar, or a place to buy it, or the main product page for it? Amd would I care if the page was located in the US or in the UK? Written in English or Spanish?

A lot of how relevance is calculated is based on certain assumptions of query intent. However, if we can do a better job of understanding query intent, then we can do a better job of answering our customers’ questions. And query logs are part of the key of understanding query intent… for example, if the query is “postage stamps” it’s not clear what the user wants. But “postage stamps next increase” is much clearer. As it turns out, in AOL’s query logs, “postage stamps next increase” came after “postage stamps” in Thelma Arnold’s history (which she has given permission to use). From sessions in query logs, we’re able to learn what the most likely intent is from a given query.

There’s been some research on this so far. Andrei Broder broke up queries into three large categories: navigational, informational, and transactional in a seminal paper while he was at DEC (AltaVista). Another paper from Danny Levinson and Dan Rose at AltaVista (shortly before they had been acquired by Yahoo!) further refined the ontology. However, there’s been little research since, largely because most research requires query logs, and only a few places have them.

So, one of the main reasons AOL released this data was to encourage and enable research into this area. However, given the issues that have come up, looks like this might be more difficult than they thought. Ah well…