selberg.org Home Home

In defense of AOL…

Just a reminder, this is blog and the below is my opinion, and not that of my employer, Microsoft. You know the drill.

Update: The New York Times has answered my challenge and proven me completely wrong. Please be sure to read my retraction if you came to this page via a permalink before you quote it.

A few weeks ago, AOL decided to release a large dataset of query logs to the research community. They sent a detailed piece of mail to the SIGIR e-mail list last week; the timing was intentional, as the SIGIR conference is underway here in Seattle this week. Their page has been taken down, but here’s the MSN Cache and Google Cache.
This is great stuff. We here at Microsoft did something similar on a smaller scale a few months back, releasing some query logs to a limited set of academics.

The main reason for doing this is that most research being done on search isn’t terribly useful to the big players — Microsoft, AOL, Yahoo, and Google in particular. The reason is that our product teams are far ahead of what most academics are doing, because we have the large-scale search engine and the data, and academics, well, don’t. So they have to make do. A lot of them make do using a small Excite query log from 1999 or an Alta Vista log from 1998 (which appear to be harder to find than I thought… gonna have to dig them up somewhere).

So, while we can’t provide all our data, we can provide some, and query logs are an obvious place to start. They provide a decent amount of data on what people are looking for, are easily anonymized (well, we think. We’ll get back to that), are commonly available (if you run an engine, meaning Microsoft, Google, Yahoo, and AOL all have ‘em), don’t contain any trade secrets (e.g. the query logs from Microsoft don’t differ that much from Google… everyone is looking for pictures of Angelina Jolie’s baby…), and are ripe for research.

However, while AltaVista, Excite, and we released our query logs relatively unscathed, the blogosphere has taken AOL Research to the woodshed on this one. Greg has a number of links to the guilty. My favorite, which I’ll call out, is TechCrunch’s ranting. Here’s a choice excerpt:

The data includes all searches from those users for a three month period this year, as well as whether they clicked on a result, what that result was and where it appeared on the result page. It’s a 439 MB compressed download, expanded to just over 2 gigs. The data is available here (this link is directly to the file) and the output is in ten text files, tab delineated.

The utter stupidity of this is staggering. AOL has released very private data about its users without their permission. While the AOL username has been changed to a random ID number, the abilitiy to analyze all searches by a single user will often lead people to easily determine who the user is, and what they are up to. The data includes personal names, addresses, social security numbers and everything else someone might type into a search box.

The most serious problem is the fact that many people often search on their own name, or those of their friends and family, to see what information is available about them on the net. Combine these ego searches with porn queries and you have a serious embarrassment. Combine them with “buy ecstasy” and you have evidence of a crime. Combine it with an address, social security number, etc., and you have an identity theft waiting to happen. The possibilities are endless.

Anyone spot the flaw in the logic? OK… if you’re mad that AOL released some private data because, well, it’s private data and should be kept, well, private, don’t include a link to the data yourself! And hey, why don’t you skip including that mirror site when the link gets taken down? And maybe, just maybe, after berating AOL for releasing it, how about a call that everyone should take down the data and not use it so that we can all protect the privacy of the people that AOL apparently just violated?

Instead, we get gems like this:

User 491577 searches for “florida cna pca lakeland tampa”, “emt school training florida”, “low calorie meals”, “infant seat”, and “fisher price roller blades”.

Um, yeah… I know exactly who that is. Someone else mentioned that lots of people do vanity queries on their names. By that logic, I guess we’ll have no problem tracking down Angelina Jolie and Brad Pitt’s queries.
So, first and foremost, to everyone out there ranting about this data that isn’t also calling for people to take it down and not use it — STFU! You’re just being hypocrites.

Now, as to the actual perceived problem with the data — that the queries themselves are private data and can be used to track down the person issuing them — prove it. Find me a single user that is identifiable by these queries (and you cannot just self-identify!). People said this with Alta Vista and Excite released their logs, but nothing came of it. So, let’s try this again… if you can identify someone via their queries, then you win, I’m wrong, and I’ll say so on this blog. But until someone, anyone, out there can actually identify someone, this is just pointless flaming.

And hey — Chud, Jen, & the rest of AOL Research — thanks for pushing on this. Yeah, you’re getting a bit roasted right now, but we all need people like you guys to push to get things done. Sometimes mistakes will be made, but the alternative is that no one ever takes a risk, and progress just grinds to a halt.

16 Responses to “In defense of AOL…”

  1. mc Says:

    Ok, there are two points you seem to have missed that make it rather easier than you suggest to identify alot of these users. The risk is not just percieved.

    Number 1: Each entry is time stamped to the second, and includes the web site the user clicked on. It is incredibly easy for any one who owns a website that one of these AOL users clicked on to access their logs, and correlate the time and HTTP referal headers to that specific user ID. Therefore you can attach an IP address and time to a UserID in the AOL data - this makes identifying the person a lot easier. Now, and this isnt hypothetical at all, assume you are the nytimes.com website, where people have to log on to read certain stories, or any other website that requires registration. You have the persons email address, IP address, possibly there name, and so on.

    This is not impossible, hell it is not even hard. A significant perccentage of these AOL users could be identified like this by websites which a number of the users clicked on.

    Now point 2: Each UserID is linked to all the queries over 3 months, so you can confirm your IP and possible name data by triangulating possibly personally identifying queries, along with all kinds of other information you wouldnt want someone to know.

    Still think it is impossible?

  2. Erik Selberg Says:

    I don’t follow your point #1… from the web logs, a site owner already has an IP address and in most cases the referrer (the URL people clicked on that lead them to the page on the site). So yeah, I could discover users that clicked on my site, and then presumably I can see some other searches they did. But why does this make identifying the person easier? I have your IP and alias, but just because I know where you’re from and you go by “mc” doesn’t mean I can ring you up…

    Point #2 - yup, it’s over 3 months… OK. So it should be easy for someone to prove me wrong. And I’ll freely admit I may be, but I’d like someone to demonstrate this versus just playing Casandra.

  3. mc Says:

    I realise I have to pose this as a hypotetical because I dont have a site which AOL users have gone through to, and so probably won’t be able to convince you.

    My point in #1 is that you might have my alias, but you also have my email address. Now in this case there is no verification so you couldn’t know it was truly my email address - but there are lots of places where your email address is checked. So any number of sites on the internet could know your email and your search history. That you can accuratly map that email to a person is trivial, assuming you have not already got their name. Never mind a site where you have entered your credit card information, name, address and phone number.

    So your hypothesis is that you cannot know who is searching from the queries themselves, which I don’t dispute. But with the additional information it *is* possible for any webmaster of a site visted by one of these users to accuratly identify the user and tie it to their search queries - which you contend cannot be done. If I had access to the nytimes webmaster logs, I am saying that I could meet your challenge a thousand times over, by linking the AOL UserID to an AOL email address.

    Even if you dispute this is possible, would publishing the list with IP addresses with each query have not bothered you - which is one step away from what they have done?

  4. Erik Selberg Says:

    I understand your point… I’m perhaps just not seeing that it’s a huge deal. Essentially, I think it boils down to: “If I have some private information about you, I can get more from the AOL data.” For example, Amazon.com maps nearly every search you do to your name, and they have your CC, address, etc. (you’re almost always logged in to Amazon, so they just use the cookie to unify search terms w/ people.) That’s pretty private info. So sure, with this data, they could also get some additional searches done by you outside of Amazon.com. Same goes for other big sites, such as nyt.com.

    What I don’t understand is how the additional info is a huge deal to anyone, as presumably there’s already a stronger trust relationship with the initial party (e.g. presumably I trust Amazon more as they have my CC, so it’s OK if they have more info).

    BTW: meaningless legal foo… the data has a big “for non-commercial use only” license on it, which means places like NYT, Amazon, Microsoft, etc. really can’t touch it. i know most of you out there ignore such licenses, but once you’re working for one of the BigComs, the lawyers do come talk to you and make sure everyone knows they can’t touch this. Case in point, when the initial thread his one of the internal MS lists, that clause was pointed out very quickly before people started diving in. I suspect other big places had similar reactions to it.

  5. mc Says:

    I agree with you that it is probably not a big deal compared to all the data Google has on me, for example. But I think a user of AOL has a reasonable expectation that when they give personal information to one website they don’t relinquish *all* personal data such as their search history to that website, without expressing the explict desire to do so.

    I was just trying to show there is an easier and far more convincing way to identify those UserID’s than random inferences from search queries, and that I (if I were the owner of a slightly popular website) could meet the challenge to track down one of the users :)
    I think the way you (Microsoft) released your data was in a far more sane way limiting the possibilities of personal identification (I’m guessing you didn’t give out timestamp information). Out of interest, did the data you released link the queries through time to a unique user?

    Re: legal-foo, while I agree entierly, there are many websites - i.e. generic-ringtone-spam.com that will have no such qualms, and will be able to map peoples search histories to their phone numbers, for example. Indeed, I’m sure there are a number of universities that would be able to link a UserID to a person using this data, for non-commercial use. Interestingly, AOL are essentialy maintaining that they have not violated their privacy policy as they have not given out personally identifying information, but it’ll be a very interesting court case if someone tries to argue that giving enough infomartion that a third party can identify you with a little effort violates this.

  6. Erik Selberg Says:

    Your point was fair… and I suspect some folks (like Greg and I) may try the experiment to see if we can’t discover some identities using that technique. I’m curious if anyone in the dataset hit my site… :)

  7. Erik Selberg Says:

  8. donalds Says:

    > So, I decided to try out the experiment you suggested. Turns out it doesn’t work.

    As a researcher (with presumably a background in math), you should know that a sample of one does say much at all. So people should go through hundreds of IDs and check them. Your bet is lost with one identifiable person. And a very small percentage of IDname relations still means thousands of people with a big problem in their lives.

    Also, you call the people which peek into the data ‘hypocrites’. I do not think this is true. No harm can be done with data which is available at every corner now. It is impossible to hide it again - as I see it, this is the often cited *observation* of ‘information wants to be free’.

    With private data, harm is done if it is made public first. Once it is in the public, there can’t be any additional harm. This is, I think, the idea of the bloggers and I can understand that very well.

    People/corps/govs with evil intentions do not even think about whether they are ‘evil’ if they use this data.

  9. Morgan Schweers Says:

    Greetings,
    If the information is going to be out there, I’d rather it were made public, instead of being in only a few peoples hands. For instance, say that you lose your bet (like I think you’re about to), and someone can be personally identified? Someone, probably a number of someones, will contact them and tell them that their info is out there. (They’ll probably sue AOL, but that’s beside the point.)

    If the info was in a smaller number of hands, and a ‘black-hat’ got ahold of it, there wouldn’t be anyone telling those users that they are in danger. This is (roughly) the premise behind the ‘transparent society’…

    Now, I’ll provide a few potential answers.
    Nelson Gill, anonymized user 4539634.
    Barbara Jean Leighton, anonymized user 5167434.
    tki189@aol.com, anonymized user 18942526.
    Evidently tibrown1964@aol.com signed up for Passport, according to anonymized id 12423492.

    How about the query:
    locate keith ivan thompson born 3 may 64 social security #### last address was 7th street apt 317 aurora colorado

    It doesn’t expose the user, but it exposes someone else’s personally identifiable information.

    or

    birth certificate for debra ann collins 1-28-59 ss ####

    There are a few dozen others like this. I’m astonished that people searched for their SSN numbers, and the SSN numbers of other people, but they did, and do, and even associate their names with them.

    The worst exposure is this:

    kristy nicole vega hammond la. social secruity number ### birth date 03 08 81 drivers license number la. ### address 41178 rene dr. hammond la.

    There aren’t very many passwords, thankfully, but there are some.

    That all said, I’m a member of SIGIR (although not the mailing list :( ), and I’m absolutely GIDDY with excitement over what I can do with this query information, in terms of useful research. This is going to be wonderful information, in a purely abstract sense. For those who do NOT have dark desires in their hearts, this is a great boon. For those who lean to the dark side…well, it is also, but I believe the potential for misuse is limited by the responsibility of the white-hats who’ll find the people with personally identifiable information and contact them, and let them know their info has been compromised.

    – Morgan

  10. donturn Says:

    See this NY Times Story.

  11. Sara Astruc Says:

  12. Erik Selberg Says:

    So, certainly, a sample of one does not a proof make. However, finding out that all IP addresses in a log file are random AOL proxies does in fact discount the ability to discover more info about a givne user. That’s proof by induction.

    Certainly, if I can identify more by the user than IP… like his or her name… then yeah, I’ve got more. But IP address alone won’t cut it; I do need to have more information.

    BTW, I was talking to someone from the New York Times today who has access to all their data. He didn’t say that he’d be able to track down an indivual or not, but he wasn’t sure. YMMV.

  13. Warrior Says:

    Make your mistakes and take your risks with your own (or with that of persons consenting) personal data/organs/minds.

  14. mc Says:

    You are certainly right about IP addresses alone not being enough, thanks to the nice “efficient” proxies AOl likes to use :)

    Still I’m glad the nytimes proved my point, although the story seems to imply they tracked her down using queries alone, however they don’t explictly say this. I hope they doen’t get into trouble for using the information for “commercial use” by exposing that user :-s

    Thanks for the discussion!

  15. Sara Astruc Says:

    So, certainly, a sample of one does not a proof make.

    Well, you asked Find me a single user that is identifiable by these queries (and you cannot just self-identify!). People said this with Alta Vista and Excite released their logs, but nothing came of it. So, let’s try this again… if you can identify someone via their queries, then you win, I’m wrong, and I’ll say so on this blog.

    You asked for one, I gave you one.

  16. head Says:

    Yes.. try out the AOL search database yourself.. It is just fun to look at some of the search data..

    http://data.aolsearchlogs.com/log/random.cgi

Leave a Reply