Matt Cutts: How Google Deals With Web Spam
Posted by: Rob Hof on October 04, 2009
It’s up to Matt Cutts and his team at Google to keep search results as free as possible from Web spam, those pages full of Viagra ads or even malware. A 10-year veteran of the company, he got into this online underworld after working on the first version of Google’s family filter, SafeSearch.
Cutts’ other job thrusts him in the spotlight almost as much as CEO Eric Schmidt and cofounders Sergey Brin and Larry Page: He’s essentially Google’s ambassador to Webmasters, the folks who operate Web sites.
In a recent interview for my story on how Google’s trying to stay ahead of rivals in search, Cutts provided insight not only into how Google tries to reduce Web spam but also into the search quality process at large. This is the last of a four-part series with Google search quality leaders that began with search chief Udi Manber, Google Fellow and ranking chief Amit Singhal, and Scott Huffman, head of the search quality evaluation unit.
Q: To step back a bit, can you broadly describe the process by which Google ensures search quality, especially behind the scenes?
A: We try to be a balance between relatively analytical and a bit of serendipity, like “a user complained about this,” or an engineer hit a problem as they were doing their search. If someone did 15 queries in a row and never clicked on the results and eventually left, that may be the sort of thing where you dig in and say, well, did we have horrible results? Were they looking for a picture and we never returned a picture?
There’s a lot of different ways we gather all that data to identify a problem. Once you’ve identified a problem, that’s when the fun starts, because you can brainstorm a little bit. It can be serendipitous, a lot of feedback from the outside world. But a lot of it is analytical. So for example, you can look at bad sessions—multiple repeated queries, nobody clicked.
Q: Do you have programs out there tracking that?
A: Yeah. Over time, we’ve built up a lot of evaluation metrics. So you could have query sets where you go if I do this query, I expect to get this result back. And if I don’t get that result back, then maybe I need to do some debugging: Has the Web site gone offline, maybe they got hacked, or did we make some change that made things break? So a lot of it can be just identifying when things used to work well and then didn’t work as well.
And there’s a lot of room for individual engineers to just complain. We have a quality mailing list within Google, and with 20,000 employees at Google, there’s plenty of feedback.
Anytime I go to an arts festival, I walk down the aisles, look at the stained glass and the paintings. So I just write down all the Web sites as I go down the aisles. Like will “Taber Studios” bring back Taber Studios? I always keep a little notebook with me, and I go back and see if I type in that site, does it appear on Google? My wife gets a little tired of it, but when we go on vacation, like stop by Hearst Castle, you got the Chamber of Commerce brochure, and it’s just a list of Websites, and I go, perfect, type in the business names (into Google) and see whether I get all these URLs. So it comes a lot from anecdotal stuff.
Q: Once you’ve got those leads, then what?
A: So once you have that, if there’s an address on this page, why couldn’t we return it or show a map? Trying to find new and different signals that will return that site can be tough. Sometimes it’s just tweaking our existing system, like if the words of the business are in close proximity, give it a little more weight.
We try to do a lot so we can understand queries better. Some people will mistype queries, so we try to do a real good spell-check system. A lot of people will type in synonyms, like "automobile" instead of "cars" when the name of the business is Cars R Us. So we try to take the query as a suggestion.
We used to require an absolute perfect match, but over time we’ve gotten better at spelling, morphology, synonyms, all these sorts of things like stemming, where somebody types in “runners” and maybe they meant “runner,” or “running.”
Q: How do you actually implement algorithms and changes in those algorithms?
A: Think of search quality almost as if it’s a car. If you ask someone if a car is a machine, they’ll say of course it’s a machine. But really, that machine is composed of a lot of different subcomponents. You’ve got the engine, you’ve got the transmission, each of those subcomponents is a machine as well.
So Google itself, the algorithm can be described as an automated system that takes a query, routes it to the closest data center, sends that out to hundreds of machines, those machines try to report the best results, collect that all together, from all those hundreds of results, what are the best 10. Compute ideal snippets from those best 10, add anything like ads and then ship that back to the user. That entire process you could refer to as an algorithm.
But in practice, what tends to happen is you have it subcomposed of a bunch of different, smaller algorithms. So in my group, you have one algorithm or one set of heuristics that would say, “Given this URL, how spammy do we think this URL is?” And we might use dozens of signals—what sort of spammy words do they use, the backlinks to this URL, how spammy do those look. All of those blend together into a master ranking algorithm.
The trick is decomposing it well. So Amit’s group in search quality is the core ranking group. My group in search quality is Web spam. And those decouple pretty nicely because something can be relevant—you can buy Viagra from this site—and yet it can still be spammy. So the challenge, and what Google has done pretty well, is to say, one group’s job is to return the most comprehensive copy of the Web, as fresh as possible. So this morning, I was doing the query BusinessWeek, and we had crawled BusinessWeek seven minutes ago. You check on another search engine, and maybe it’s been four or five days since they crawled BusinessWeek.com.
Q: That much of a difference? That’s hard to imagine in the case of Microsoft and Yahoo, at least.
A: Seven minutes is about as optimal as you’re going to get. But in general, Google is fresher. Google is not only fresher but more comprehensive. Those are three key things: freshness; comprehensiveness (you want to crawl as much of the Web as possible); and relevance (core ranking and Web spam). And you want the user experience to be really clean. If you go to senior citizens and ask, “What do you like about Google?” they’ll say “Clean, fast, and relevant.”
Q: How does Caffeine, the next-generation search engine now in testing, fit in there?
A: Caffeine was primarily an infrastructural change. That was a huge undertaking over many months from the crawl and indexing team. What they hand to us is almost the same, it’s just much better, much more powerful, much more flexible. We have the ability to index much faster. It’s better along all of these axes.
To most of the world, they probably wouldn’t be able to tell the difference. Maybe just a few search experts can really tell any kind of a difference at all. But from our perspective, it’s almost like upgrading the engine of a car from an old V-4 to a nice V-8.
Q: OK, so tell me how you and your group approach Web spam and how to reduce it.
A: One of the secrets of Web spam is that once you see it, and learn to recognize it, you can’t NOT see it. But intuitions that you might have, like in the old days when you saw a lot of dashes, like cheap-viagra-online-discount-herbal-whatever.com, you might think, OK, that’s a spammy domain, so maybe we train on the number of dashes in a domain to determine spam. But it turns out that doesn’t work so well, because in different cultures, not only are there perfectly valid domains like blueberry-farms.com, but in, say, Germany, they have a lot more dashes on averages.
Q: What sort of methods do you use? How much is it people looking at things and saying, "Oh, that’s wrong," and how much is more automated?
A: There’s an entire class of really tech-unsavvy people who come to Google and think that Google manually selects all 10 results for every single query and ships it back for hundreds of millions of people every day. Then some people think it’s nothing but computers. And certainly we rely much more on computers and algorithms than any other major search engine or at least historically.
Q: Don’t they all? In what way does Google rely more on algorithms?
A: Yahoo comes from a background where they had editors doing their directory. Yahoo is much more open to having humans in theory edit things. At Google, we do not have the ability to say for this query, make this result.
Q: Or you decide you’re not allowed to do that.
A: Well, the Web spam team does have the ability to say this result is spam, so it should be demoted or penalized or pushed down in some way. But we don’t have any ability to say for this query, “Rob Hof,” we think that this page should rank No. 1. I think that’s a healthy middle ground. You don’t want the ability to do that.
Q: To be clear, you’ve chosen not to be able to do that.
A: That’s correct. We’ve made a deliberate choice that we don’t want to. Because if you think about it, those kinds of choices tend to get stale, it’s not very scalable, it doesn’t work very well in other languages.
But in our group, we vastly rely on algorithms. We try to write new techniques and algorithms. But if someone writes in and says I typed in “Rob Hof” and got porn, they’re really unhappy if the reply is well, we think we’ll have a new algorithm to deal with that in about six to nine months, so check back and the porn may be gone maybe by the end of the year. So we’ll take action. Even then, we try to do it in a scalable way.
Q: How so?
A: The data that gets generated doesn’t just solve the near-term problem. For example, suppose there’s a bad hacker out there and he’s hacked 100 sites. If you had only a manual team, you might not catch all 100. But the data they generate by saying these 67 sites or these 80 sites have been hacked lets us write new classifiers to detect hacked sites—hidden text, various sneaky tricks like that.
Q: What do you mean by hacked in this context?
A: Spammers hack sites like Al Gore’s and other high-traffic sites and build links out to spam sites, and then they’ll monetize 10 cents per user or whatever. I was literally talking to someone who had written his own blogging software and he got hacked, and he was checking out what had happened and this guy had come and deliberately targeted him and found an exploit in this one guy’s piece of code.
So the scary trend is that as PCs are getting better, people aren’t keeping Web server software such as Wordpress and Drupal, up to date and so they get hacked a lot. So we have to deal with innocent people who have gotten their site hacked and then they’re selling Viagra.
Q: So how do you deal with that?
A: We write detectors. We’ve written classifiers—an algorithm, a heuristic that essentially takes a bunch of signals and tries to say yes, this site has been hacked or no, it hasn’t, and at what level of the directory and things like that.
So for example, if you’ve got a longstanding site and then all of a sudden a brand-new directory pops up and it’s got a bunch of spammy terms like online casinos and debt consolidation, pills, and you’ve seen a bunch of weird links from other sites show, then you think maybe this part of the site has been hacked. So let’s not show this directory of sites to people for a little while until we know whether it’s spam or malware—or maybe scan those other 80 pages for malware as well.
One thing we do that I’m not aware of anyone else doing is we have a Webmaster console (webmaster.google.com). We will try to drop you a note. We can’t do it all the time and we can’t do it for every single site. And we try to give you a little piece of concrete text to show you. It’s in our interest to have a clean, well-lit Web that people can trust.
Q: As people evolve in how they do searches in the past couple years, is the process by which Google tries to improve search quality changed?
A: A lot of the analytical stuff hasn’t changed that much—the rock-solid stuff, the testbeds. One thing that has changed is we’re more willing to listen to outside feedback and I think we do a better job of collecting feedback from the Web. Just in the last year or so, we’ve gotten a lot better at paying attention to the outside world. And communicating with the outside world, like with Gmail outage yesterday. We had a post-mortem blog post the same day, compared with several days on last outage seven or eight months ago.
Q: You’re one of the few public figures at Google who also seems to engage directly with users. How did that role develop?
A: I kind of backed into that. Communications is almost my 20% project. Basically, Webmasters ask why does my site not do well. But we have tens of millions of Webmasters, hundreds of millions of users, hundreds of thousands of advertisers and many of them want to talk to someone at Google. So what are scalable ways to reach people? Through the Webmaster forum, blogs, conferences, Twitter answers, chats, videos.
Q: Are there ways you and other folks at Google are trying to avoid the problems of being a big company now—to avoid being the next Microsoft?
A: There’s a lot of people at Google who constantly fight against becoming just another big company. In 2005, Eric Schmidt was asked by John Battelle at Web 2.0 if they’d try to lock in users’ data, and Eric said we would never lock in users’ data. The ability to take your data from Gmail or Google Calendar or Blogger and export it—literally like every single product we have, you can easily export your data or we are working on that. We even have a group, the Data Liberation Front, that tries to liberate the data. The way that you earn loyalty is by making people trust No. 1 that you’re a good company, and if they ever distrust you, they can leave.
Q: OK, the field technically is open, but Google does have this commanding position.
A: I think we’re mindful of that. Battelle wrote a post a long time ago about how Google must feel on top of the world. I remember thinking, is this what it feels like to be on top of the world? Because I feel like we wake up every day and work really, really hard to return the best-quality search results, and we’re fighting every day to do the best thing for our users. So it’s not as if there’s a bunch of gloating Googlers sitting around talking about how great life is.
If you look at some of the newer stuff we’ve done—for example, Android, Chrome, and Wave—the tie that binds those all together is they all have very large components of open-source or openness. So if somebody wants to build their own Wave server, it’s a federated protocol, you don’t have to go to Google.
There’s this real-time initiative that a couple of Googlers worked on, called pubsubhubbub, and Brad, one of the main guys on that, is like Yeah, Google is not the center of this world, you can designate any hub. Also, Chrome is not only open but asks you who you want to use as your default search provider, it doesn’t hard-code it to Google, it uses whatever your default is. It’s the same sort of thing with Android. You might have people developing with Android who have never talked to Google because they can just take the code base and do fun things with it.
Q: How does Google ensure internally that it doesn’t become a slow-moving company? The issue was raised most recently by Anil Dash of Six Apart, who wrote that Chrome OS represented Google’s “Microsoft moment.”
A: “Don’t be evil” [Google’s informal motto] still works. It’s gotten to be a little well-worn outside of Google, and people just assume it’s marketing. But that spirit in my opinion still holds true.
Going back to Anil Dash’s post, when I wrote about it, it got a lot of attention within the company. Easily a dozen people caught me in the hallway to say, thanks for writing that, it’s a reminder of how we want to be.
Q: But that implies there’s some truth to that, which is not necessarily a good thing, right?
A: I think Google was in the mood to have someone rake us over the coals a little bit, and Anil’s post came at the perfect time to remind us our purpose to is make the Web better, our purpose is to return the best search results we can. Our purpose is not to be closed to outside feedback.
If you dig into the specifics, he said Google produces apps for Android before the iPhone, and looking at any smaller point, you could take issue. But his point was not to micro-debate but rather to be more open to feedback and to recommit to this openness, and if someone beats you, it’s because they have more merit, not because you have some advantage on the field. And I think that even though everybody at Google knows that, it was a really helpful reminder. And as far as I can tell, it got support from the highest parts of the company.
Q: Google has a very successful system, and any company in that position has to be careful about changing things. How do you avoid being too careful?
A: There’s the revenue aspect. Search quality doesn’t care about money. We make the results better. If that costs us money, that’s someone else’s problem. That decoupling, the almost church vs. state attitude, has worked very well. Their job is to put relevant ads on whatever we return in the organic editorial position, and they do a fantastic job of that. So for example, if the ads team can’t come up with ads by the time we’ve computed the search results, we just don’t show ads. Our job is simple: Return the best results.
But then the other aspect of your question is how do you explore new things, how do you avoid making mistakes? To avoid mistakes, we have a lot of different checks we run. Alarm bells ring within seconds if our testbeds say, Oh, this set of queries doesn’t return the result we thought it should. In fact, we test all that before we push each new update to our index live. So there’s a bunch of stuff going on in the background where Google is querying itself sometimes to make sure we’re returning the right results.
Q: And what about how Google tries to avoid missing the next big thing?
A: That one’s fun. We try to consciously ask ourselves when does the inflection point happen where it’s better to do something in a new way. So we’ve re-architected our indexing and how we compute results in major ways several times over the last decade because maybe the balance between different types of storage has changed.
We also try to do these at least once a year, just brainstorming sessions: We’ve done Quality Days, where groups of engineers, teams of two or three or four people take a week and they produce a prototype of some really cool quality or user interface feature that they think we should have.
And in a similar way, we often have an exercise where we say, OK, everybody take two queries that we have identified as bad queries, that we think are suboptimal, and brainstorm how can we fix these two queries? The world is your oyster, completely blank slate, it’s OK if it would take a thousand seconds instead of a second, just figure out a way to solve that query. And if you can solve that query in any kind of blue-sky way, after that we’ll figure out how to make it happen in a hundred milliseconds.
I’m sure we still miss some ideas. We try to keep an eye on the outside world, and if there’s anything we miss, we become aware of it. But we try to not be complacent and not rest on our laurels.