I’ve had the opportunity to talk with a number of people from AOL Research this week on the data release. They’re genuinely upset over what’s happened… they released data with the best of intentions: to help foster research in a key area with some new data. They did something that others had done in the past; in particular, Alta Vista and Excite (I don’t include Microsoft in that list as we didn’t release data publicly). They did try to anonymize the data, and you had better believe they had a reasonable amount of sign-off from people inside of AOL. They thought they were doing a good thing for the community, and it turns out to be a bit of a disaster. Now, if I read body language properly, it looks like a number of them are afraid AOL is going to fire someone or the entire group in order to save face with its customers.
So how did we get here? And, given that AOL has loads of data that certainly doesn’t include any PII and thus no risk, why did they pick query logs?
Two words: Query Intent.
What is query intent? Simply put, it’s understanding what the user is looking for based on their query. For example, if I search for “Vision Tech Carbon Pro” am I looking for a review of their aero bar, or a place to buy it, or the main product page for it? Amd would I care if the page was located in the US or in the UK? Written in English or Spanish?
A lot of how relevance is calculated is based on certain assumptions of query intent. However, if we can do a better job of understanding query intent, then we can do a better job of answering our customers’ questions. And query logs are part of the key of understanding query intent… for example, if the query is “postage stamps” it’s not clear what the user wants. But “postage stamps next increase” is much clearer. As it turns out, in AOL’s query logs, “postage stamps next increase” came after “postage stamps” in Thelma Arnold’s history (which she has given permission to use). From sessions in query logs, we’re able to learn what the most likely intent is from a given query.
There’s been some research on this so far. Andrei Broder broke up queries into three large categories: navigational, informational, and transactional in a seminal paper while he was at DEC (AltaVista). Another paper from Danny Levinson and Dan Rose at AltaVista (shortly before they had been acquired by Yahoo!) further refined the ontology. However, there’s been little research since, largely because most research requires query logs, and only a few places have them.
So, one of the main reasons AOL released this data was to encourage and enable research into this area. However, given the issues that have come up, looks like this might be more difficult than they thought. Ah well…
Yep! Exactly right.