Just a reminder, this is blog and the below is my opinion, and not that of my employer, Microsoft. You know the drill.
Update: The New York Times has answered my challenge and proven me completely wrong. Please be sure to read my retraction if you came to this page via a permalink before you quote it.
A few weeks ago, AOL decided to release a large dataset of query logs to the research community. They sent a detailed piece of mail to the SIGIR e-mail list last week; the timing was intentional, as the SIGIR conference is underway here in Seattle this week. Their page has been taken down, but here’s the MSN Cache and Google Cache.
This is great stuff. We here at Microsoft did something similar on a smaller scale a few months back, releasing some query logs to a limited set of academics.
The main reason for doing this is that most research being done on search isn’t terribly useful to the big players — Microsoft, AOL, Yahoo, and Google in particular. The reason is that our product teams are far ahead of what most academics are doing, because we have the large-scale search engine and the data, and academics, well, don’t. So they have to make do. A lot of them make do using a small Excite query log from 1999 or an Alta Vista log from 1998 (which appear to be harder to find than I thought… gonna have to dig them up somewhere).
So, while we can’t provide all our data, we can provide some, and query logs are an obvious place to start. They provide a decent amount of data on what people are looking for, are easily anonymized (well, we think. We’ll get back to that), are commonly available (if you run an engine, meaning Microsoft, Google, Yahoo, and AOL all have ‘em), don’t contain any trade secrets (e.g. the query logs from Microsoft don’t differ that much from Google… everyone is looking for pictures of Angelina Jolie’s baby…), and are ripe for research.
However, while AltaVista, Excite, and we released our query logs relatively unscathed, the blogosphere has taken AOL Research to the woodshed on this one. Greg has a number of links to the guilty. My favorite, which I’ll call out, is TechCrunch’s ranting. Here’s a choice excerpt:
The data includes all searches from those users for a three month period this year, as well as whether they clicked on a result, what that result was and where it appeared on the result page. It’s a 439 MB compressed download, expanded to just over 2 gigs. The data is available here (this link is directly to the file) and the output is in ten text files, tab delineated.
The utter stupidity of this is staggering. AOL has released very private data about its users without their permission. While the AOL username has been changed to a random ID number, the abilitiy to analyze all searches by a single user will often lead people to easily determine who the user is, and what they are up to. The data includes personal names, addresses, social security numbers and everything else someone might type into a search box.
The most serious problem is the fact that many people often search on their own name, or those of their friends and family, to see what information is available about them on the net. Combine these ego searches with porn queries and you have a serious embarrassment. Combine them with “buy ecstasy” and you have evidence of a crime. Combine it with an address, social security number, etc., and you have an identity theft waiting to happen. The possibilities are endless.
Anyone spot the flaw in the logic? OK… if you’re mad that AOL released some private data because, well, it’s private data and should be kept, well, private, don’t include a link to the data yourself! And hey, why don’t you skip including that mirror site when the link gets taken down? And maybe, just maybe, after berating AOL for releasing it, how about a call that everyone should take down the data and not use it so that we can all protect the privacy of the people that AOL apparently just violated?
Instead, we get gems like this:
User 491577 searches for “florida cna pca lakeland tampa”, “emt school training florida”, “low calorie meals”, “infant seat”, and “fisher price roller blades”.
Um, yeah… I know exactly who that is. Someone else mentioned that lots of people do vanity queries on their names. By that logic, I guess we’ll have no problem tracking down Angelina Jolie and Brad Pitt’s queries.
So, first and foremost, to everyone out there ranting about this data that isn’t also calling for people to take it down and not use it — STFU! You’re just being hypocrites.
Now, as to the actual perceived problem with the data — that the queries themselves are private data and can be used to track down the person issuing them — prove it. Find me a single user that is identifiable by these queries (and you cannot just self-identify!). People said this with Alta Vista and Excite released their logs, but nothing came of it. So, let’s try this again… if you can identify someone via their queries, then you win, I’m wrong, and I’ll say so on this blog. But until someone, anyone, out there can actually identify someone, this is just pointless flaming.
And hey — Chud, Jen, & the rest of AOL Research — thanks for pushing on this. Yeah, you’re getting a bit roasted right now, but we all need people like you guys to push to get things done. Sometimes mistakes will be made, but the alternative is that no one ever takes a risk, and progress just grinds to a halt.