selberg.org Home Home

Archive for August 9th, 2006
8/09/06
10:37 am
I’m wrong. NYT answers my challenge.

My advisor gave me a great bit of wisdom many years ago: 20 minutes of student perspiration is worth a week of advisor intuition. Looks like that’s the case here. Hat tip to Don, Sara, and Morgan for posting a link to the New York Times article that has identified someone from the AOL data. Turns out that one Thelma Arnold was identified. It isn’t clear if the NYT only used the data from AOL or used some of their own via the method suggested by MC in the comments of the original article I posted (although coincidentally I was talking with someone from the NYT yesterday here at SIGIR, and I got the impression that all the click logs the NYT has been keeping are unused and not easily available to reporters. So I suspect Ms. Arnold was tracked down using high-tech means such as calling all the Arnolds in Lilburn, GA. DexOnline has 25.).

So, what does this show?

  • Yes, you can identify a person via their queries. My intuition and protestations to the contrary, from the NYT and some of the other examples, it’s pretty clear that you can identify people without too much difficulty.
  • People search using private data. I never thought people would search using SSNs and driver’s license numbers… but I guess they do.

It also means that the AOL data is likely going to be the last public release of search engine query logs. While some data can be anonymized (for example, the MSN Search data we licensed to academics had nearly all numbers turned into ### prior to release), you can’t anonymize everything, and thus it will always be likely that someone can be tracked down using the query logs. So we’re done.

Certainly, it is possible that someone will have some kind of opt-in program that customers can select where their queries will be explicitly made available to academics. However, the problem with that is the self-selecting set of users that do that are going to be a different set of users than the general public, and thus it isn’t as useful to academics (although again, this is just my intuition here).

So, I’m wrong. AOL did release PII, which is clearly an error on their part. And thus I’d call out to everyone that they should delete the data and certainly not use it for any purpose whatsoever — even research. Fields of science that routinely use data collected from human subjects, such as psychology or social science, have clear guidelines requiring prior consent of the subjects. Thus, even though it may be possible to get the consent of people like Ms. Arnold and some others, it’s inconceivable that AOL will get consent from everyone (even should they send e-mail to all of them, apologize, and ask for consent). And consent after the fact isn’t the same as prior consent. So there’s no other option — the AOL data, tempting as it may be, cannot be used ethically. It’s as simple as that.