selberg.org Home Home

Archive for August, 2006
8/27/06
9:00 am
Best. Funerals. Ever.

OK, blog catchup weekend. Here’s a juicy tidbit from the BBC about a rather interesting local custom for funerals in rural China

Five people have been detained in China for running striptease send-offs at funerals, state media say.

The once-common events are held to boost the number of mourners, as large crowds are seen as a mark of honour.

But the arrests, in the eastern province of Jiangsu, could signal the end of the rural tradition.

Local officials have since ordered a halt to “obscene performances” and say funeral plans have to be submitted in advance, Xinhua news agency said.

The arrests, in Donghai county, followed striptease acts at a farmer’s funeral, the agency said.

Two hundred people were said to have attended the event, which was held on 16 August.

The Beijing News said the event was later revealed by a Chinese TV station. The leaders of five striptease troupes were held, it said, including two involved in the farmer’s funeral.

“Striptease used to be a common practice at funerals in Donghai’s rural areas to allure viewers,” Xinhua agency said.

“Local villagers believe that the more people who attend the funeral, the more the dead person is honoured.”

As well as ordering an end to the practice, officials have also said residents can report “funeral misdeeds” on a hotline, earning a reward for information.

I’m not saying that when I kick, I expect my survivors to hire a small troupe for a striptease to get people to come. But hey, if it’s a choice of estate tax or something tasteful to help attract, oh, a thousand or so… well, what will I care, I’m dead! :)

…I’m sure this post won’t come back to haunt me in 60 years…

8/20/06
9:46 pm
Why I hate mail-order…

So, while the Internet has done wonders for business, there are still issues with ordering something online. For example, the other week I ordered a pair of Oval Concepts A700 SLAM aerobars. I went with what I thought was a reasonable on-line bike merchant; they have an ebay store and are highly ranked; no bad feedback in the usual locations. They arrived Friday, 4 days later. However, they were missing a number of parts… it was clear that somebody had opened it and removed some of the parts. It’s not clear if somebody did this to check that everything was there and forgot to replace anything, or removed some parts for some other reason, and then somebody else shipped the package thinking it was complete when it wasn’t. Who knows.

Anyway, someone at the shop responded to my e-mail pretty quickly… said the unit was new, but the last one they had, and somebody had probably messed up getting it ready for shipping. OK… well, we’ll see when they ship a new one and if it arrives complete…

PS - name is currently withheld pending how they resolve this. A follow-up, regardless of how this turns out, will be forthcoming.

8/13/06
11:12 pm
The problems with blog spam

At AIRWeb, a workshop on adversarial information retrieval held at SIGIR, they had a panel on blog spam. What surprised me was that there appear to be two camps on what the “blog spam” problem is. Most of the panel seemed to be concerned about spam on blogs, including splogs, or blogs that are 100% spam. Essentially, anytime someone can put a link on somebody else’s page or IP address, they can help inflate the ranking of some target site. For the most part, spam on blogs breaks down into:

1. Fake comments
2. Fake trackbacks / referrers / pingbacks
3. Fake blogs (which include all of the above)

The biggest problem for search engines appears to be #3, Fake Blogs. The reason being is that fake blogs are under complete control of the spammer, and thus search engines have to detect and remove them. Fake comments, trackbacks, and the like on the other hand are a problem for both search engines and the owner of the blog, and the owner of the blog will typically take much more direct steps to remove said spam.

Another problem with Fake Blogs is that most of them appear on the big hosting sites… e.g. Blogger / BlogSpot (Google), Spaces (MSN), and so on. So it’s one arm of the search engine companies working to help out another arm as well, which makes sense.

However, as someone who runs an independent blog (WordPress), I found a couple of things surprising:

  • While a lot of people are doing research on thwarting blog spam, it seems the research is not making it outside the companies. In particular, the anti-spam technology I’m using is Akismet, from the same guys that brought you WordPress. Not MSR, not YRL, not Google.
  • Given how much spam I get and how much Akismet catches, it’s not clear that much more work is needed. Akismet seems to work pretty well.
  • There seems to be a lack of understanding on the “nofollow” tag.

A word on that last bit… a number of blogs that didn’t have some kind of comment moderation were getting spammed galore. So, the “nofollow” tag was invented so that bloggers (or really, their blogging software) could identify links in comments. This was not done so that bloggers could instruct search engines not to follow these links; nor was it done so that bloggers could mark a link as “untrustworthy” and have the search engines analyze the links and decide for themselves. It was done to discourage comment spam… spammers wouldn’t spam, because the links wouldn’t be worthwhile. It’s kind of effective, but really, a bar to entry on comments such as a captcha or moderation approval works much better.

Anyway, a final thought… while right now, Akismet is winning the war against comment spam, I suspect the spammers will adapt faster and eventually start winning. At that point, I hope we’ll have some of the research done by YRL, MSR, and Google integrated into products to help out.

8/13/06
12:05 am
Good reads from SIGIR 2006

We had a hugely successful SIGIR here in Seattle.

Here are some of the papers I really enjoyed, and worth a read (and the first is certainly apropos):

Web IR: current topics: You are what you say: privacy risks of public mentions

Dan Frankowski, Dan Cosley, Shilad Sen, Loren Terveen, John Riedl
Additional Information: full citation, abstract, references, index terms
Evaluation 1–user models and test collections: Minimal test collections for retrieval evaluation

Ben Carterette, James Allan, Ramesh Sitaraman
Additional Information: full citation, abstract, references, index terms
Evaluation 1–user models and test collections: Dynamic test collections: measuring search effectiveness on the live web

Ian Soboroff
Additional Information: full citation, abstract, references, index terms
Evaluation 2: A statistical method for system evaluation using incomplete judgments

Javed A. Aslam, Virgil Pavlu, Emine Yilmaz
Additional Information: full citation, abstract, references, index terms
Web 2: Generalizing PageRank: damping functions for link-based ranking algorithms

Ricardo Baeza-Yates, Paolo Boldi, Carlos Castillo
Additional Information: full citation, abstract, references, index terms
The first page of results: Less is more: probabilistic models for retrieving fewer relevant documents

Harr Chen, David R. Karger
Additional Information: full citation, references, index terms
Users: clarification, feedback, and browsing: Exploring the limits of single-iteration clarification dialogs

Jimmy Lin, Philip Wu, Dina Demner-Fushman, Eileen Abels
Additional Information: full citation, abstract, references, index terms

8/13/06
12:05 am
Query logs and the AOL controversy

I’ve had the opportunity to talk with a number of people from AOL Research this week on the data release. They’re genuinely upset over what’s happened… they released data with the best of intentions: to help foster research in a key area with some new data. They did something that others had done in the past; in particular, Alta Vista and Excite (I don’t include Microsoft in that list as we didn’t release data publicly). They did try to anonymize the data, and you had better believe they had a reasonable amount of sign-off from people inside of AOL. They thought they were doing a good thing for the community, and it turns out to be a bit of a disaster. Now, if I read body language properly, it looks like a number of them are afraid AOL is going to fire someone or the entire group in order to save face with its customers.

So how did we get here? And, given that AOL has loads of data that certainly doesn’t include any PII and thus no risk, why did they pick query logs?

Two words: Query Intent.

What is query intent? Simply put, it’s understanding what the user is looking for based on their query. For example, if I search for “Vision Tech Carbon Pro” am I looking for a review of their aero bar, or a place to buy it, or the main product page for it? Amd would I care if the page was located in the US or in the UK? Written in English or Spanish?

A lot of how relevance is calculated is based on certain assumptions of query intent. However, if we can do a better job of understanding query intent, then we can do a better job of answering our customers’ questions. And query logs are part of the key of understanding query intent… for example, if the query is “postage stamps” it’s not clear what the user wants. But “postage stamps next increase” is much clearer. As it turns out, in AOL’s query logs, “postage stamps next increase” came after “postage stamps” in Thelma Arnold’s history (which she has given permission to use). From sessions in query logs, we’re able to learn what the most likely intent is from a given query.

There’s been some research on this so far. Andrei Broder broke up queries into three large categories: navigational, informational, and transactional in a seminal paper while he was at DEC (AltaVista). Another paper from Danny Levinson and Dan Rose at AltaVista (shortly before they had been acquired by Yahoo!) further refined the ontology. However, there’s been little research since, largely because most research requires query logs, and only a few places have them.

So, one of the main reasons AOL released this data was to encourage and enable research into this area. However, given the issues that have come up, looks like this might be more difficult than they thought. Ah well…

8/09/06
10:37 am
I’m wrong. NYT answers my challenge.

My advisor gave me a great bit of wisdom many years ago: 20 minutes of student perspiration is worth a week of advisor intuition. Looks like that’s the case here. Hat tip to Don, Sara, and Morgan for posting a link to the New York Times article that has identified someone from the AOL data. Turns out that one Thelma Arnold was identified. It isn’t clear if the NYT only used the data from AOL or used some of their own via the method suggested by MC in the comments of the original article I posted (although coincidentally I was talking with someone from the NYT yesterday here at SIGIR, and I got the impression that all the click logs the NYT has been keeping are unused and not easily available to reporters. So I suspect Ms. Arnold was tracked down using high-tech means such as calling all the Arnolds in Lilburn, GA. DexOnline has 25.).

So, what does this show?

  • Yes, you can identify a person via their queries. My intuition and protestations to the contrary, from the NYT and some of the other examples, it’s pretty clear that you can identify people without too much difficulty.
  • People search using private data. I never thought people would search using SSNs and driver’s license numbers… but I guess they do.

It also means that the AOL data is likely going to be the last public release of search engine query logs. While some data can be anonymized (for example, the MSN Search data we licensed to academics had nearly all numbers turned into ### prior to release), you can’t anonymize everything, and thus it will always be likely that someone can be tracked down using the query logs. So we’re done.

Certainly, it is possible that someone will have some kind of opt-in program that customers can select where their queries will be explicitly made available to academics. However, the problem with that is the self-selecting set of users that do that are going to be a different set of users than the general public, and thus it isn’t as useful to academics (although again, this is just my intuition here).

So, I’m wrong. AOL did release PII, which is clearly an error on their part. And thus I’d call out to everyone that they should delete the data and certainly not use it for any purpose whatsoever — even research. Fields of science that routinely use data collected from human subjects, such as psychology or social science, have clear guidelines requiring prior consent of the subjects. Thus, even though it may be possible to get the consent of people like Ms. Arnold and some others, it’s inconceivable that AOL will get consent from everyone (even should they send e-mail to all of them, apologize, and ask for consent). And consent after the fact isn’t the same as prior consent. So there’s no other option — the AOL data, tempting as it may be, cannot be used ethically. It’s as simple as that.

8/07/06
10:54 pm
In defense of AOL…

Just a reminder, this is blog and the below is my opinion, and not that of my employer, Microsoft. You know the drill.

Update: The New York Times has answered my challenge and proven me completely wrong. Please be sure to read my retraction if you came to this page via a permalink before you quote it.

A few weeks ago, AOL decided to release a large dataset of query logs to the research community. They sent a detailed piece of mail to the SIGIR e-mail list last week; the timing was intentional, as the SIGIR conference is underway here in Seattle this week. Their page has been taken down, but here’s the MSN Cache and Google Cache.
This is great stuff. We here at Microsoft did something similar on a smaller scale a few months back, releasing some query logs to a limited set of academics.

The main reason for doing this is that most research being done on search isn’t terribly useful to the big players — Microsoft, AOL, Yahoo, and Google in particular. The reason is that our product teams are far ahead of what most academics are doing, because we have the large-scale search engine and the data, and academics, well, don’t. So they have to make do. A lot of them make do using a small Excite query log from 1999 or an Alta Vista log from 1998 (which appear to be harder to find than I thought… gonna have to dig them up somewhere).

So, while we can’t provide all our data, we can provide some, and query logs are an obvious place to start. They provide a decent amount of data on what people are looking for, are easily anonymized (well, we think. We’ll get back to that), are commonly available (if you run an engine, meaning Microsoft, Google, Yahoo, and AOL all have ‘em), don’t contain any trade secrets (e.g. the query logs from Microsoft don’t differ that much from Google… everyone is looking for pictures of Angelina Jolie’s baby…), and are ripe for research.

However, while AltaVista, Excite, and we released our query logs relatively unscathed, the blogosphere has taken AOL Research to the woodshed on this one. Greg has a number of links to the guilty. My favorite, which I’ll call out, is TechCrunch’s ranting. Here’s a choice excerpt:

The data includes all searches from those users for a three month period this year, as well as whether they clicked on a result, what that result was and where it appeared on the result page. It’s a 439 MB compressed download, expanded to just over 2 gigs. The data is available here (this link is directly to the file) and the output is in ten text files, tab delineated.

The utter stupidity of this is staggering. AOL has released very private data about its users without their permission. While the AOL username has been changed to a random ID number, the abilitiy to analyze all searches by a single user will often lead people to easily determine who the user is, and what they are up to. The data includes personal names, addresses, social security numbers and everything else someone might type into a search box.

The most serious problem is the fact that many people often search on their own name, or those of their friends and family, to see what information is available about them on the net. Combine these ego searches with porn queries and you have a serious embarrassment. Combine them with “buy ecstasy” and you have evidence of a crime. Combine it with an address, social security number, etc., and you have an identity theft waiting to happen. The possibilities are endless.

Anyone spot the flaw in the logic? OK… if you’re mad that AOL released some private data because, well, it’s private data and should be kept, well, private, don’t include a link to the data yourself! And hey, why don’t you skip including that mirror site when the link gets taken down? And maybe, just maybe, after berating AOL for releasing it, how about a call that everyone should take down the data and not use it so that we can all protect the privacy of the people that AOL apparently just violated?

Instead, we get gems like this:

User 491577 searches for “florida cna pca lakeland tampa”, “emt school training florida”, “low calorie meals”, “infant seat”, and “fisher price roller blades”.

Um, yeah… I know exactly who that is. Someone else mentioned that lots of people do vanity queries on their names. By that logic, I guess we’ll have no problem tracking down Angelina Jolie and Brad Pitt’s queries.
So, first and foremost, to everyone out there ranting about this data that isn’t also calling for people to take it down and not use it — STFU! You’re just being hypocrites.

Now, as to the actual perceived problem with the data — that the queries themselves are private data and can be used to track down the person issuing them — prove it. Find me a single user that is identifiable by these queries (and you cannot just self-identify!). People said this with Alta Vista and Excite released their logs, but nothing came of it. So, let’s try this again… if you can identify someone via their queries, then you win, I’m wrong, and I’ll say so on this blog. But until someone, anyone, out there can actually identify someone, this is just pointless flaming.

And hey — Chud, Jen, & the rest of AOL Research — thanks for pushing on this. Yeah, you’re getting a bit roasted right now, but we all need people like you guys to push to get things done. Sometimes mistakes will be made, but the alternative is that no one ever takes a risk, and progress just grinds to a halt.

8/06/06
4:17 pm
Time for SIGIR!

It’s that time of year again! This week, the 2006 SIGIR conference will be taking place… and best of all, it’ll be right here in Seattle! The fun starts today (Sunday) with tutorials and an opening reception hosted by our friends at Boeing. We’ve got three days of papers that look quite good, and have some nice events Monday, Tuesday, and Wednesday nights! Thursday, AIRWeb, the Adversarial IR workshop, will take place, which looks to be a great session as well.

So, if you’re in town and haven’t signed up, stop on by and register… it’ll be well worth your time. Plus, you’ll get to meet all the people who are busy making search engines, as opposed to everyone down at SES in San Jose who are busy figuring out how to game them. Granted, they’re buying many more drinks for the locals there… but hey, we’ve got the Blue Angels! :)

8/04/06
12:50 am
Weight loss made easy?

So, a number of people at Microsoft, including myself, have been doing the 20/20 program at the Pro Club. And the results have been good… lots of weight lost! The 20/20 program is high intensity diet, exercise, and therapy (!) program designed to change your lifestyle such that you lose weight and keep it off. The 20/20 program is designed by a cardiologist and he takes a medical approach to things… for example, people don’t have “weight problems” they have “metabolic disorders.” Presumably, “metabolic disorder” tested better with insurance company executives.

How does a doctor handle a metabolic disorder? Well, with a shotgun really…

  • Exercise. In particular, cardio and weight lifting with a personal trainer.
  • Diet. In particular, a diet that puts you on something like the GI diet. This is similar to the Atkins / South Beach style diets in that it’s a high protein diet, but differs in that it isn’t a low-carb diet. It is a low grain diet however — so minimizing bread, rice, and pasta.
  • Therapy. A number of group therapy sessions and individual sessions to help you through the process.

This seems daunting… but really, it’s a bit simpler. There are 3500 calories in a pound of fat. So, weight loss or gain is defined simply as:

WeightLoss = (CaloriesBurned - CaloriesConsumed) / 3500

CaloriesConsumed is what you eat. CaloriesBurned is what you use in living (breathing, walking, etc.) and what you burn in exercise. Or:

CaloriesBurned = CaloriesBurnedLiving + CaloriesBurnedExercising

An hour of exercise will burn about 1000 calories, so 5 days of exercise for an hour each will burn 5000 calories, or a little under a pound and a half burned over 5 days. For most weight loss programs, they get about 2 pounds loss a week, and that tends to come from the exercise. Now, a question I’ve been getting asked a lot is whether you can lose weight without exercising. Well, from the above equation, you can… but it’s hard. The rub comes from CaloriesBurnedLiving. CaloriesBurnedLiving is dependent on your muscle mass. Muscle burns 3x as many calories as fat, so more muscle = higher metabolism = higher CaloriesBurnedLiving. The reverse is true… less muscle = lower metabolism = lower CaloriesBurnedLiving. Your body will naturally grow or shrink muscle depending on how you use it… sad but true.

If you try to lose weight without exercising, all you can do is lower the calories you consume. However, because you aren’t exercising, your body will lower your metabolism, and you’ll have to eat less to compensate. There is an equilibrium point, but that means you’re eating next to nothing, probably always hungry, and have no muscle mass whatsoever. Kind of like Kate Moss. What will more likely happen is you’ll lose weight (both fat and muscle you aren’t using), then stop and get frustrated while being really hungry. You’ll eat more, and your body will then store that extra food as fat (remember, no exercise = no muscle). This is the classic yo-yo diet…losing and gaining. But it’s worse than it sounds, because you’re losing muscle and gaining fat throughout, making each successive diet that much harder.

So what does this mean? Well, all those random diet supplements they’re adertising on TV? Not going to work without exercise. Weight Watchers? Not going to work without exercise. Jenny Craig? Not going to work without exercise. And how about Jared from Subway? Well, he lost his 235 pounds by eating only about 1000 calories a day (you typically need 1500 - 2000 to live) and exercising. Simple as that.

I’ve learned a lot going through 20/20… they do a good job on the education front. I understand more about diet and exercise. I’ve lost a bunch of weight. It’s all good. But the most important thing is that I’ve gone through the math and realized that the secret here is exercise… gotta use muscle and burn. Otherwise, the fat builds.

So, for everyone out there wondering how to lose the weight… exercise. Lift weights, do cardio. Build muscle. Burn. Move more, eat less. You can lose more (or less) by changing your diet, but fundamentally, you need to burn off the food. Only way to do that is exercise. But hey, you can have some fun with it… after all, it’s fun to go fast on a bike! :)