<?xml version="1.0" encoding="UTF-8"?><rss version="2.0"
	xmlns:content="http://purl.org/rss/1.0/modules/content/"
	xmlns:dc="http://purl.org/dc/elements/1.1/"
	xmlns:atom="http://www.w3.org/2005/Atom"
	>
<channel>
	<title>Comments on: I&#8217;m wrong. NYT answers my challenge.</title>
	<atom:link href="http://selberg.org/2006/08/09/im-wrong-nyt-answers-my-challenge/feed/" rel="self" type="application/rss+xml" />
	<link>http://selberg.org/2006/08/09/im-wrong-nyt-answers-my-challenge/</link>
	<description>Erik Selberg's Homepage &#038; Blog</description>
	<pubDate>Sat, 30 Aug 2008 01:31:52 +0000</pubDate>
	<generator>http://wordpress.org/?v=2.5.1</generator>
		<item>
		<title>By: mc</title>
		<link>http://selberg.org/2006/08/09/im-wrong-nyt-answers-my-challenge/#comment-2244</link>
		<dc:creator>mc</dc:creator>
		<pubDate>Wed, 09 Aug 2006 20:00:45 +0000</pubDate>
		<guid isPermaLink="false">http://www.selberg.org/2006/08/09/im-wrong-nyt-answers-my-challenge/#comment-2244</guid>
		<description>While there may be many privacy advocates who would agree that you can't ever anonymise data, and after the bad press from AOL probably no one else will; I think that it is possible to anonymise search logs for academic research by ensuring:
* The queries are not linked to a unique user - this is the only way someones search *history* can be determined - now there is an argument you could still identify an individual from one search, but this is a hell of a lot harder, almost impossible and would reveal no more than the HTTP headers given to whatever website you click on (which most people probably don't realise they give out anyway)
* Make sure the queries aren't time/date stamped so a webmaster can't linka query to personal information submitted to them, as I suggested in the comments to your previous article. Even without this, with the point above they still would have no more information on you anyway
* Strip SSN, Credit Card Numbers, email addresses as Microsoft did, and queries of more than say 5 words where people are just pasting an email or whatever

I don't know if such data would be useful to researchers, but if AOL had released data in this format, I think your challenge would have been almost impossible to meet. The question is then whether queries about identfiable people (you couldnt know who made the search but you could identfy the person being searched for) are accetpable - i.e. the query "Arnold, Lilburn, GA, gay" for example.</description>
		<content:encoded><![CDATA[<p>While there may be many privacy advocates who would agree that you can&#8217;t ever anonymise data, and after the bad press from AOL probably no one else will; I think that it is possible to anonymise search logs for academic research by ensuring:<br />
* The queries are not linked to a unique user - this is the only way someones search *history* can be determined - now there is an argument you could still identify an individual from one search, but this is a hell of a lot harder, almost impossible and would reveal no more than the HTTP headers given to whatever website you click on (which most people probably don&#8217;t realise they give out anyway)<br />
* Make sure the queries aren&#8217;t time/date stamped so a webmaster can&#8217;t linka query to personal information submitted to them, as I suggested in the comments to your previous article. Even without this, with the point above they still would have no more information on you anyway<br />
* Strip SSN, Credit Card Numbers, email addresses as Microsoft did, and queries of more than say 5 words where people are just pasting an email or whatever</p>
<p>I don&#8217;t know if such data would be useful to researchers, but if AOL had released data in this format, I think your challenge would have been almost impossible to meet. The question is then whether queries about identfiable people (you couldnt know who made the search but you could identfy the person being searched for) are accetpable - i.e. the query &#8220;Arnold, Lilburn, GA, gay&#8221; for example.</p>
]]></content:encoded>
	</item>
</channel>
</rss>
