Dear Internet: Oopsie! Love AOL.

Earlier this month AOL released to the public a sizable chunk of data. The data was in the form of the verbatim search queries submitted by AOL subscribers from March to May 2006. Their intention was purportedly to aide academic researchers in creating new search-based tools by providing them with some real-life data to work with. AOL knew they did not want to violate their subscriber’s privacy so they changed the identifying information in the log lines to unique numbers. So kbridger could become, for example, user 3402937.

What they failed to realize – and this is a major stumbling block in artificial intelligence today – is that there is some information in the context of the data as well. For example look at this “anonymous user’s” search history:

17556639 how to kill your wife
17556639 how to kill your wife
17556639 wife killer
17556639 how to kill a wife
17556639 poop
17556639 dead people
17556639 pictures of dead people
17556639 killed people
17556639 dead pictures
17556639 dead pictures
17556639 dead pictures
17556639 murder photo
17556639 steak and cheese
17556639 photo of death
17556639 photo of death
17556639 death
17556639 dead people photos
17556639 photo of dead people
17556639 www.murderdpeople.com
17556639 decapatated photos
17556639 decapatated photos
17556639 car crashes3
17556639 car crashes3
17556639 car crash photo


Now this series of queries tells a story – one that has supposedly alarmed some people. Is it possible to find out who these users are though? AOL claims they anonymized the data before posting it so is there any harm in this voyeuristic opportunity?

The NYT did a little combing and was able to identify a person from their search text alone using their search history such as this:

4417749 numb fingers
4417749 60 single men
4417749 dog that urinates on everything
4417749 landscapers in Lilburn, Ga
4417749 bill arnold
4417749 carpet shampoo rental
4417749 julie arnold
4417749 stan arnold
4417749 homes sold in shadow lake subdivision gwinnett county georgia
4417749 gwinnet county animal services
4417749 stan arnold
4417749 pecan pie recipes
4417749 McGyver DVDs
4417749 pet euthanasia services

Thelma Arnold is now cancelling her AOL subscription I believe.

The data set has been put into a database and a public query engine has been created at AOL Stalker, allowing anyone to query for specific words to see what user searched for the word. You can also put in a user number to see their entire search history for the 3 month period. Not surprisingly Internet denizens have turned this into a kind of sick game, where users are singled out for psychoanalysis in an attempt to identify them. Here’s their list of currently identified people.

Using the search engine, you have to wonder how AOL could possibly think that this data was anonymous. Look at this user’s query and tell me that the following is not a sad story:

curb morning sickness
get fit while pregnant
he doesn’t want the baby
you’re pregnant he doesn’t want the baby
baby names and meanings
maternity clothes
pregnancy workout videos
abortion clinics charlotte nc
abortion fibroids
symptoms of miscarriage
engagement rings
new homes charlotte nc
www.substanceabuseprevention.org
recover after miscarriage
marry your live-in

It is frightening how a person’s search history can become a touching window into their personal lives. I am touched by this person’s experience – and all through their search text without their knowledge or permission! There is no doubt in my mind that this data is not anonymous in any way. And this can be generalized to all our searching habits. Think about it – when you search www.google.com or whatever, do you really think about other people reading your search queries? Do you think you are searching anonymously?

So the user above who may or may not be planning to murder his wife may be identifiable. Is there some moral requirement to pursue this avenue though? Is simply searching for these types of things really an indicator of pending homicidal actions? Have you ever searched for something on a whim with no real interest in the results?

This “accidental” data dump has opened up a massive can of worms in terms of privacy on the Internet. AOL appears to have ignored a lesson learned by people working with Artificial Intelligence – there is important data in the context of data. Yes the simple search terms are anonymized. However take the search term into context and suddenly people’s lives are opened up for public viewing.

AOL claims it was a mistake. Others are looking at their slumping stock prices, failing business plan and dropping subscriber rates and are wondering if perhaps they had alternate motives. A goof-up of this calibre will go down in Internet (if not legislative) history, and AOL’s name will be at the centre of the controversy. An increase in critical market mind share, legislation pending to fix the problem, other search engines doing similar things: one wonders if there isn’t more to this story. Note that that last link is a story that mentions that a variety of search engines (AOL, Yahoo, Microsoft MSN and Google) have been subpoenaed to provide just this kind of data to the US federal government in the recent months. Only Google said no.

So did AOL simply take the next step in advertising or did they really make such a massive blunder? Will the people who released this information be held accountable for their actions? Who’s responsibility is it to clean up the fact that a person’s Social Security number, address, and other personal details are now available through a simple search interface? Identity theft is a real problem but is AOL going to be doing any of this work? How exactly is AOL suffering here? The official AOL home page has no mention of this event, no public apology (I can only find it as a comment on somebody’s blog (quoted below)!), and apparently no further action to be taken. It is safe to say that I do not entirely buy the “Oopsie” argument, but then I’m an anti-capitalist cynic – so that might just be me.

The final word goes to the AOL Spokesperson Andrew Weinstein who has supplied the often-quoted official response from AOL (though I can only find it officially posted on someone’s blog comments here):

Andrew Weinstein – AOL Spokesperson

All –

This was a screw up, and we’re angry and upset about it. It was an innocent enough attempt to reach out to the academic community with new research tools, but it was obviously not appropriately vetted, and if it had been, it would have been stopped in an instant.

Although there was no personally-identifiable data linked to these accounts, we’re absolutely not defending this. It was a mistake, and we apologize. We’ve launched an internal investigation into what happened, and we are taking steps to ensure that this type of thing never happens again.

Here was what was mistakenly released:

* Search data for roughly 658,000 anonymized users over a three month period from March to May.

* There was no personally identifiable data provided by AOL with those records, but search queries themselves can sometims include such information.

* According to comScore Media Metrix, the AOL search network had 42.7 million unique visitors in May, so the total data set covered roughly 1.5% of May search users.

* Roughly 20 million search records over that period, so the data included roughly 1/3 of one percent of the total searches conducted through the AOL network over that period.

* The searches included as part of this data only included U.S. searches conducted within the AOL client software.

Our apologies again.

Andrew Weinstein
AOL Spokesperson

One thought on “Dear Internet: Oopsie! Love AOL.”

Leave a Reply

Your email address will not be published. Required fields are marked *