selberg.org Home Home

Archive for May 25th, 2008
5/25/08
11:20 pm
Google’s trust-building

When I interviewed at Microsoft many years ago, one of my interviewers asked me what I thought the next big thing for search was. I said: “trust.” Right now, people get pages back, but there’s still a huge degree of distrust on what they see. People trust Amazon.com, and (for better or worse) seem to trust Wikipedia. But random sites? Hmm. Some people are generally trusting, but many aren’t, and the continuous stories of identify theft and credit card theft make people more paranoid (which is probably a good thing).

I still stand by my statement. Of the “next big things” for search people keep talking about, such as blended search, personalization, social search, etc. I still believe that trust will be the big differentiator. There is a lot of crap out there, and I suspect it’s growing a lot faster than quality pages.

Which brings me to the following. The other day, I received the following in my inbox:

Dear site owner or webmaster of selberg.org,

While we were indexing your webpages, we detected that some of your pages were using techniques that are outside our quality guidelines, which can be found here: http://www.google.com/webmasters/guidelines.html. This appears to be because your site has been modified by a third party. Typically, the offending party gains access to an insecure directory that has open permissions. Many times, they will upload files or modify existing ones, which then show up as spam in our index.

The following is some example hidden text we found at http://selberg.org/2008/02/:

buy viagra
buy viagra online
viagra online
discount viagra
order viagra
cheap viagra
generic viagra
generica viagra
viagra buy
viagra price
order viagra online
viagra generic
viagra pill
where buy viagra
buy viagra cheap
viagra order
get viagra
buy online viagra
online viagra
viagra sale online
where to buy viagra
cheapest viagra
purchase viagra
cheap viagra online
viagra buy online
buying viagra
buy viagra on
generic viagra canada
prescription viagra
buy viagra norway
generic viagra pack

[...]

In order to preserve the quality of our search engine, we have temporarily removed some of your webpages from our search results. Currently pages from selberg.org are scheduled to be removed for at least 30 days.

We would prefer to have your pages in Google’s index. If you wish to be reconsidered, please correct or remove all pages (may not be limited to the examples provided) that are outside our quality guidelines. One potential remedy is to contact your web host technical support for assistance. For more information about security for webmasters, see http://googlewebmastercentral.blogspot.com/2007/09/quick-security-checklist-for-webmasters.html.

When you are ready, please visit https://www.google.com/webmasters/tools/reinclusion?hl=en to learn more and submit your site for reconsideration.

Sincerely,
Google Search Quality Team

My first reaction was, WTF? I run my own blog, and I know I’m not spamming. Somebody phishing me? Nope… links are legit… so I go to the page in question, and sure enough on my “My advisor’s WSDM” post, there was a hidden block with a number of links:

<font style="position: absolute;overflow: hidden;height: 0;width: 0">
<a href="http://www.bigbadbookblog.com/?menu=1" title="buy viagra">buy viagra</a><br />
><a href="http://www.bigbadbookblog.com/?menu=2" title="buy viagra online">buy viagra online</a><br />
<a href="http://www.bigbadbookblog.com/?menu=3" title="viagra online">viagra online</a><br />
...

And since it was on that post, it was also on the archive post for Feb, which was the link Google found.

I went into panic mode, and first edited the post, then went to Google to ensure my blog wasn’t removed. I was led through an interesting process, where I ended up registering my site, then registering myself as owner (by putting in a special META tag and having Google confirm it was there), and then acknowledge that I’m behaving and all is well.

This is a great process, in that they now have a known owner for a site they can contact. And they know the site is active and somebody on the other end cares… whether they’re a spammer or not, to be determined later. It also means that Google can now alert me to problems with my site… such as poor indexing, or if my site or a post gets hijacked again (still not sure how it happened, but I’ve updated WordPress at least!). Google is doing a ton of analysis and have published some of what they’re doing, such as massive map-reduce scans looking for malware landing pages, in a technical report “All Your iFrame Are Point to Us.” They’re also highlighting sites that they believe may be harmful to your computer (but aren’t sure of enough to remove from their index).

Upon reflection, here’s what they’re doing, and here’s what I now believe:

  • Google is building up a network of sites and site owners, getting to know them better;
  • Google is creating a framework to help registered site owners ensure their sites are legit;
  • Google is actively trying to identify and remove bad and malicious content from their index;
  • Google is being (surprisingly) public about what they’re doing.

What this means to me is that the sites that appear on Google are likely more trustworthy than they are on competitors, such as Microsoft. Now, I know Microsoft has a security group, and they’re doing a lot to go after malware and phishing (for example, the recent anti-phishing plugin on IE is a great step in that direction). But are they connecting all the dots? And are they doing so publicly? Because frankly, without telling people what you’re doing for them, they’re very unlikely to give you proper credit for what you’ve done.