Matt Cutts: How Google Deals With Web Spam

Posted by: Rob Hof on October 04, 2009

It’s up to Matt Cutts and his team at Google to keep search results as free as possible from Web spam, those pages full of Viagra ads or even malware. A 10-year veteran of the company, he got into this online underworld after working on the first version of Google’s family filter, SafeSearch.

Cutts’ other job thrusts him in the spotlight almost as much as CEO Eric Schmidt and cofounders Sergey Brin and Larry Page: He’s essentially Google’s ambassador to Webmasters, the folks who operate Web sites.

In a recent interview for my story on how Google’s trying to stay ahead of rivals in search, Cutts provided insight not only into how Google tries to reduce Web spam but also into the search quality process at large. This is the last of a four-part series with Google search quality leaders that began with search chief Udi Manber, Google Fellow and ranking chief Amit Singhal, and Scott Huffman, head of the search quality evaluation unit.

Q: To step back a bit, can you broadly describe the process by which Google ensures search quality, especially behind the scenes?

A: We try to be a balance between relatively analytical and a bit of serendipity, like “a user complained about this,” or an engineer hit a problem as they were doing their search. If someone did 15 queries in a row and never clicked on the results and eventually left, that may be the sort of thing where you dig in and say, well, did we have horrible results? Were they looking for a picture and we never returned a picture?

There’s a lot of different ways we gather all that data to identify a problem. Once you’ve identified a problem, that’s when the fun starts, because you can brainstorm a little bit. It can be serendipitous, a lot of feedback from the outside world. But a lot of it is analytical. So for example, you can look at bad sessions—multiple repeated queries, nobody clicked.

Q: Do you have programs out there tracking that?

A: Yeah. Over time, we’ve built up a lot of evaluation metrics. So you could have query sets where you go if I do this query, I expect to get this result back. And if I don’t get that result back, then maybe I need to do some debugging: Has the Web site gone offline, maybe they got hacked, or did we make some change that made things break? So a lot of it can be just identifying when things used to work well and then didn’t work as well.

And there’s a lot of room for individual engineers to just complain. We have a quality mailing list within Google, and with 20,000 employees at Google, there’s plenty of feedback.

Anytime I go to an arts festival, I walk down the aisles, look at the stained glass and the paintings. So I just write down all the Web sites as I go down the aisles. Like will “Taber Studios” bring back Taber Studios? I always keep a little notebook with me, and I go back and see if I type in that site, does it appear on Google? My wife gets a little tired of it, but when we go on vacation, like stop by Hearst Castle, you got the Chamber of Commerce brochure, and it’s just a list of Websites, and I go, perfect, type in the business names (into Google) and see whether I get all these URLs. So it comes a lot from anecdotal stuff.

Q: Once you’ve got those leads, then what?

A: So once you have that, if there’s an address on this page, why couldn’t we return it or show a map? Trying to find new and different signals that will return that site can be tough. Sometimes it’s just tweaking our existing system, like if the words of the business are in close proximity, give it a little more weight.

We try to do a lot so we can understand queries better. Some people will mistype queries, so we try to do a real good spell-check system. A lot of people will type in synonyms, like "automobile" instead of "cars" when the name of the business is Cars R Us. So we try to take the query as a suggestion.

We used to require an absolute perfect match, but over time we’ve gotten better at spelling, morphology, synonyms, all these sorts of things like stemming, where somebody types in “runners” and maybe they meant “runner,” or “running.”

Q: How do you actually implement algorithms and changes in those algorithms?

A: Think of search quality almost as if it’s a car. If you ask someone if a car is a machine, they’ll say of course it’s a machine. But really, that machine is composed of a lot of different subcomponents. You’ve got the engine, you’ve got the transmission, each of those subcomponents is a machine as well.

So Google itself, the algorithm can be described as an automated system that takes a query, routes it to the closest data center, sends that out to hundreds of machines, those machines try to report the best results, collect that all together, from all those hundreds of results, what are the best 10. Compute ideal snippets from those best 10, add anything like ads and then ship that back to the user. That entire process you could refer to as an algorithm.

But in practice, what tends to happen is you have it subcomposed of a bunch of different, smaller algorithms. So in my group, you have one algorithm or one set of heuristics that would say, “Given this URL, how spammy do we think this URL is?” And we might use dozens of signals—what sort of spammy words do they use, the backlinks to this URL, how spammy do those look. All of those blend together into a master ranking algorithm.

The trick is decomposing it well. So Amit’s group in search quality is the core ranking group. My group in search quality is Web spam. And those decouple pretty nicely because something can be relevant—you can buy Viagra from this site—and yet it can still be spammy. So the challenge, and what Google has done pretty well, is to say, one group’s job is to return the most comprehensive copy of the Web, as fresh as possible. So this morning, I was doing the query BusinessWeek, and we had crawled BusinessWeek seven minutes ago. You check on another search engine, and maybe it’s been four or five days since they crawled BusinessWeek.com.

Q: That much of a difference? That’s hard to imagine in the case of Microsoft and Yahoo, at least.

A: Seven minutes is about as optimal as you’re going to get. But in general, Google is fresher. Google is not only fresher but more comprehensive. Those are three key things: freshness; comprehensiveness (you want to crawl as much of the Web as possible); and relevance (core ranking and Web spam). And you want the user experience to be really clean. If you go to senior citizens and ask, “What do you like about Google?” they’ll say “Clean, fast, and relevant.”

Q: How does Caffeine, the next-generation search engine now in testing, fit in there?

A: Caffeine was primarily an infrastructural change. That was a huge undertaking over many months from the crawl and indexing team. What they hand to us is almost the same, it’s just much better, much more powerful, much more flexible. We have the ability to index much faster. It’s better along all of these axes.

To most of the world, they probably wouldn’t be able to tell the difference. Maybe just a few search experts can really tell any kind of a difference at all. But from our perspective, it’s almost like upgrading the engine of a car from an old V-4 to a nice V-8.

Q: OK, so tell me how you and your group approach Web spam and how to reduce it.

A: One of the secrets of Web spam is that once you see it, and learn to recognize it, you can’t NOT see it. But intuitions that you might have, like in the old days when you saw a lot of dashes, like cheap-viagra-online-discount-herbal-whatever.com, you might think, OK, that’s a spammy domain, so maybe we train on the number of dashes in a domain to determine spam. But it turns out that doesn’t work so well, because in different cultures, not only are there perfectly valid domains like blueberry-farms.com, but in, say, Germany, they have a lot more dashes on averages.

Q: What sort of methods do you use? How much is it people looking at things and saying, "Oh, that’s wrong," and how much is more automated?

A: There’s an entire class of really tech-unsavvy people who come to Google and think that Google manually selects all 10 results for every single query and ships it back for hundreds of millions of people every day. Then some people think it’s nothing but computers. And certainly we rely much more on computers and algorithms than any other major search engine or at least historically.

Q: Don’t they all? In what way does Google rely more on algorithms?

A: Yahoo comes from a background where they had editors doing their directory. Yahoo is much more open to having humans in theory edit things. At Google, we do not have the ability to say for this query, make this result.

Q: Or you decide you’re not allowed to do that.

A: Well, the Web spam team does have the ability to say this result is spam, so it should be demoted or penalized or pushed down in some way. But we don’t have any ability to say for this query, “Rob Hof,” we think that this page should rank No. 1. I think that’s a healthy middle ground. You don’t want the ability to do that.

Q: To be clear, you’ve chosen not to be able to do that.

A: That’s correct. We’ve made a deliberate choice that we don’t want to. Because if you think about it, those kinds of choices tend to get stale, it’s not very scalable, it doesn’t work very well in other languages.

But in our group, we vastly rely on algorithms. We try to write new techniques and algorithms. But if someone writes in and says I typed in “Rob Hof” and got porn, they’re really unhappy if the reply is well, we think we’ll have a new algorithm to deal with that in about six to nine months, so check back and the porn may be gone maybe by the end of the year. So we’ll take action. Even then, we try to do it in a scalable way.

Q: How so?

A: The data that gets generated doesn’t just solve the near-term problem. For example, suppose there’s a bad hacker out there and he’s hacked 100 sites. If you had only a manual team, you might not catch all 100. But the data they generate by saying these 67 sites or these 80 sites have been hacked lets us write new classifiers to detect hacked sites—hidden text, various sneaky tricks like that.

Q: What do you mean by hacked in this context?

A: Spammers hack sites like Al Gore’s and other high-traffic sites and build links out to spam sites, and then they’ll monetize 10 cents per user or whatever. I was literally talking to someone who had written his own blogging software and he got hacked, and he was checking out what had happened and this guy had come and deliberately targeted him and found an exploit in this one guy’s piece of code.

So the scary trend is that as PCs are getting better, people aren’t keeping Web server software such as Wordpress and Drupal, up to date and so they get hacked a lot. So we have to deal with innocent people who have gotten their site hacked and then they’re selling Viagra.

Q: So how do you deal with that?

A: We write detectors. We’ve written classifiers—an algorithm, a heuristic that essentially takes a bunch of signals and tries to say yes, this site has been hacked or no, it hasn’t, and at what level of the directory and things like that.

So for example, if you’ve got a longstanding site and then all of a sudden a brand-new directory pops up and it’s got a bunch of spammy terms like online casinos and debt consolidation, pills, and you’ve seen a bunch of weird links from other sites show, then you think maybe this part of the site has been hacked. So let’s not show this directory of sites to people for a little while until we know whether it’s spam or malware—or maybe scan those other 80 pages for malware as well.

One thing we do that I’m not aware of anyone else doing is we have a Webmaster console (webmaster.google.com). We will try to drop you a note. We can’t do it all the time and we can’t do it for every single site. And we try to give you a little piece of concrete text to show you. It’s in our interest to have a clean, well-lit Web that people can trust.

Q: As people evolve in how they do searches in the past couple years, is the process by which Google tries to improve search quality changed?

A: A lot of the analytical stuff hasn’t changed that much—the rock-solid stuff, the testbeds. One thing that has changed is we’re more willing to listen to outside feedback and I think we do a better job of collecting feedback from the Web. Just in the last year or so, we’ve gotten a lot better at paying attention to the outside world. And communicating with the outside world, like with Gmail outage yesterday. We had a post-mortem blog post the same day, compared with several days on last outage seven or eight months ago.

Q: You’re one of the few public figures at Google who also seems to engage directly with users. How did that role develop?

A: I kind of backed into that. Communications is almost my 20% project. Basically, Webmasters ask why does my site not do well. But we have tens of millions of Webmasters, hundreds of millions of users, hundreds of thousands of advertisers and many of them want to talk to someone at Google. So what are scalable ways to reach people? Through the Webmaster forum, blogs, conferences, Twitter answers, chats, videos.

Q: Are there ways you and other folks at Google are trying to avoid the problems of being a big company now—to avoid being the next Microsoft?

A: There’s a lot of people at Google who constantly fight against becoming just another big company. In 2005, Eric Schmidt was asked by John Battelle at Web 2.0 if they’d try to lock in users’ data, and Eric said we would never lock in users’ data. The ability to take your data from Gmail or Google Calendar or Blogger and export it—literally like every single product we have, you can easily export your data or we are working on that. We even have a group, the Data Liberation Front, that tries to liberate the data. The way that you earn loyalty is by making people trust No. 1 that you’re a good company, and if they ever distrust you, they can leave.

Q: OK, the field technically is open, but Google does have this commanding position.

A: I think we’re mindful of that. Battelle wrote a post a long time ago about how Google must feel on top of the world. I remember thinking, is this what it feels like to be on top of the world? Because I feel like we wake up every day and work really, really hard to return the best-quality search results, and we’re fighting every day to do the best thing for our users. So it’s not as if there’s a bunch of gloating Googlers sitting around talking about how great life is.

If you look at some of the newer stuff we’ve done—for example, Android, Chrome, and Wave—the tie that binds those all together is they all have very large components of open-source or openness. So if somebody wants to build their own Wave server, it’s a federated protocol, you don’t have to go to Google.

There’s this real-time initiative that a couple of Googlers worked on, called pubsubhubbub, and Brad, one of the main guys on that, is like Yeah, Google is not the center of this world, you can designate any hub. Also, Chrome is not only open but asks you who you want to use as your default search provider, it doesn’t hard-code it to Google, it uses whatever your default is. It’s the same sort of thing with Android. You might have people developing with Android who have never talked to Google because they can just take the code base and do fun things with it.

Q: How does Google ensure internally that it doesn’t become a slow-moving company? The issue was raised most recently by Anil Dash of Six Apart, who wrote that Chrome OS represented Google’s “Microsoft moment.”

A: “Don’t be evil” [Google’s informal motto] still works. It’s gotten to be a little well-worn outside of Google, and people just assume it’s marketing. But that spirit in my opinion still holds true.

Going back to Anil Dash’s post, when I wrote about it, it got a lot of attention within the company. Easily a dozen people caught me in the hallway to say, thanks for writing that, it’s a reminder of how we want to be.

Q: But that implies there’s some truth to that, which is not necessarily a good thing, right?

A: I think Google was in the mood to have someone rake us over the coals a little bit, and Anil’s post came at the perfect time to remind us our purpose to is make the Web better, our purpose is to return the best search results we can. Our purpose is not to be closed to outside feedback.

If you dig into the specifics, he said Google produces apps for Android before the iPhone, and looking at any smaller point, you could take issue. But his point was not to micro-debate but rather to be more open to feedback and to recommit to this openness, and if someone beats you, it’s because they have more merit, not because you have some advantage on the field. And I think that even though everybody at Google knows that, it was a really helpful reminder. And as far as I can tell, it got support from the highest parts of the company.

Q: Google has a very successful system, and any company in that position has to be careful about changing things. How do you avoid being too careful?

A: There’s the revenue aspect. Search quality doesn’t care about money. We make the results better. If that costs us money, that’s someone else’s problem. That decoupling, the almost church vs. state attitude, has worked very well. Their job is to put relevant ads on whatever we return in the organic editorial position, and they do a fantastic job of that. So for example, if the ads team can’t come up with ads by the time we’ve computed the search results, we just don’t show ads. Our job is simple: Return the best results.

But then the other aspect of your question is how do you explore new things, how do you avoid making mistakes? To avoid mistakes, we have a lot of different checks we run. Alarm bells ring within seconds if our testbeds say, Oh, this set of queries doesn’t return the result we thought it should. In fact, we test all that before we push each new update to our index live. So there’s a bunch of stuff going on in the background where Google is querying itself sometimes to make sure we’re returning the right results.

Q: And what about how Google tries to avoid missing the next big thing?

A: That one’s fun. We try to consciously ask ourselves when does the inflection point happen where it’s better to do something in a new way. So we’ve re-architected our indexing and how we compute results in major ways several times over the last decade because maybe the balance between different types of storage has changed.

We also try to do these at least once a year, just brainstorming sessions: We’ve done Quality Days, where groups of engineers, teams of two or three or four people take a week and they produce a prototype of some really cool quality or user interface feature that they think we should have.

And in a similar way, we often have an exercise where we say, OK, everybody take two queries that we have identified as bad queries, that we think are suboptimal, and brainstorm how can we fix these two queries? The world is your oyster, completely blank slate, it’s OK if it would take a thousand seconds instead of a second, just figure out a way to solve that query. And if you can solve that query in any kind of blue-sky way, after that we’ll figure out how to make it happen in a hundred milliseconds.

I’m sure we still miss some ideas. We try to keep an eye on the outside world, and if there’s anything we miss, we become aware of it. But we try to not be complacent and not rest on our laurels.

TrackBack URL for this entry: http://blogs.businessweek.com/mt/mt-tb.cgi/

Reader Comments

Pete Austin

October 5, 2009 04:24 AM

Re: "You might have people developing with Android who have never talked to Google because they can just take the code base and do fun things with it."

Oh really? See "Google is facing a major backlash from the Android community after sending a cease-and-desist order to the independent developer behind a popular Android mod."
http://arstechnica.com/open-source/news/2009/09/android-community-aims-to-replace-googles-proprietary-bits.ars

Shirley Brady, BW

October 5, 2009 09:27 AM

For more insights from Matt & his colleagues, check out their YouTube videos (http://www.youtube.com/user/GoogleWebmasterHelp) -- and also Matt's blog, where he posted his thoughts on Rob Hof's look at Google: http://www.mattcutts.com/blog/businessweek-articles-on-google

Gil Geraci

October 5, 2009 04:40 PM

Re: Pete Austin comment. "they can just take the code base and do fun things with it" refers to doing ORIGINAL fun things; not to patch proprietary Google apps into a moneymaking bundle for personal profit.

happy guy

October 5, 2009 07:12 PM

interesting read

Todd

October 5, 2009 08:46 PM

One of my very successful blogs was recently "penalized" by google search resulting in a dramatic loss of traffic. I was caught up in one of these algorithms that frankly don't work that well. I have a very popular site with great content. It's extremely frustrating that yahoo and bing find my content acceptable yet google thinks I'm spam. Google certainly does work hard but sometimes the lack of a human factor really stinks for us small guys!

Gtricks

October 6, 2009 08:15 AM

Thanks Matt for all the information. Interesting article :)

Vamp

October 6, 2009 12:18 PM

Google hosts hundreds of spam sites on it's Blogspot websites and I get many of them to my e-mail. I have reported them and Google won't close them down. So they may keep it out of search results but they host spammers right on their servers.

Drew

October 8, 2009 11:12 PM


Just wait until Google decides you are not worthy, and then even site you have with 100% original content on them will not show up in their index, even when you type the full domain name into their search box.

bob

October 9, 2009 03:48 AM

Its good thing Google practice church and state rule of separation.

RateBrain.com

October 12, 2009 05:22 AM

Great to see Google be so open and willing to share how things work!

Al

October 19, 2009 07:59 AM

"Don’t be evil" they cant use that point anymore, when we think Privacy and data collecting empire, many countries in Europe are warning about Google tools, ex. now BSI a Government department in Germany for IT/Internet security is warning Companies and user to to use :

Google Wave
email
instant messenger
and they say other Google tools is a security risk and with Privacy issues.

John Nagle

October 22, 2009 12:03 AM

Google's spam filtering can't get too good. Google's business model depends on users clicking on their ads. When a Google search leads directly to what the user wanted, Google makes no money.
If Google treated ad-heavy pages from unknown businesses as web spam, the user experience would improve, but Google's revenue would go down.

To demonstrate this, search Google for "London hotels". Most of the results will be for ad-heavy reseller sites, not hotels themselves. Or try "catastrophic health insurance", which leads almost entirely to resellers, not insurers.

Google doesn't use information about the quality of the business behind the site. That's why it's so easy to spam Google search. Remember, "On the Internet, nobody knows if you're a dog".

Carter Cole

November 30, 2009 10:57 PM

i thought id just line these up and knock them down

Pete Austin - try reading the article about app developers leaving iPhone like rats from sinking ship (and other made about modding their stuff)

Todd - i wish you put your site id like to look at it. why were you penalized? the small guys are helped by algorithms yahoo is what screws the small guys(be cause you will never get noticed so no "personal touch" will help) and "fixes" results for the big queries

Vamp - the saying for the web should be if you build it spam will come. blogger is a great service and so spammers use it as such. they dont want competitors to kill your blog by flagging it as spam so it can take time to detect and remove all that trash but i assure you if its really bad they find it pretty fast

Drew - ive had my blog blog.cartercole.com for 105 days here are my organic search numbers google 355,bing 8,search 5,aol 1,cnn 1 im not making money off it but i dont worry about my site getting penalized because i follow the "dont be evil" what are your site doing that they suck so bad you dont even rank for their url? time to hire an SEO?

John Nagle - that just doesn't jive... i read the other day that googles "im feeling lucky" must cost them millions because it jumps to the first page and ads never get shown. your crazy but i think it may be a lack of information... do you have an adwords account? they do create inflation in selling those positions but a ton of queries return no ads at all its a question of SEO. most hotels cant compete with a more comprehensive hotel reseller site that has an extensive and diverse set of backlinks that give most users a better experience than hotel sites

id love to hear yall opinions back
you should follow me on twitter im @cartercole

thanks!

Post a comment

 

About

BusinessWeek writers Peter Burrows, Cliff Edwards, Olga Kharif, Aaron Ricadela, Douglas MacMillan, and Spencer Ante dig behind the headlines to analyze what’s really happening throughout the world of technology. One of the first mainstream media tech blogs, Tech Beat covers everything from tech bellwethers like Apple, Google, and Intel and emerging new leaders such as Facebook to new technologies, trends, and controversies.

Categories

 

BW Mall - Sponsored Links

Buy a link now!