Earlier this week, Jay Parikh, the vice president for infrastructure engineering at Facebook, walked a dozen or so reporters through some of the social media site’s inner workings. Give any Facebook engineer enough time, and he’ll start boasting about the site’s level of geek endowment. “Is your Hadoop cluster bigger than Yahoo’s?” a reporter asks. “Yes. It is,” Parikh replies with a wink and a laugh.
As of its latest count, Facebook claimed 955 million users. What makes Facebook so unique in the annals of Web history is the amount of time all those people spend on the site (at least a few hours per month) and the frequency with which they return to it (just about every day). To cope with the update-hungry masses, Facebook has pushed the limits of computer science in a few areas. In particular, the company is good at sucking in, analyzing, and sharing huge volumes of data at record speed so that users get new, up-to-date pages every time they visit Facebook.
Here’s a recap of some of the latest and greatest figures that capture the staggering volume of data Facebook handles:
• Every day, people share 2.5 billion different items (which includes such things as status updates, wall posts, photos videos, and comments).
• People “Like” 2.7 billion things every day. This is what’s technically referred to as an Advertiser’s Wet Dream.
• Remember all those photo-sharing sites around before Facebook? No? Well, that’s because people upload 300 million photos to their Facebook pages each day.
• As for that big, old, Hadoop cluster that Parikh celebrated? Well, Facebook’s largest cluster—or collection of data-center computers—can handle more than 100 petabytes of information. One petabyte is the equivalent of about 250 billion pages of text.
• Facebook has a homegrown system called Hive that it uses to collect and keep track of all its data. Every 30 minutes, the Hive system combs through 105 terabytes of data. More than 500 terabytes are sucked into the database each day.
Is Facebook a good steward of all this information? It certainly claims to be.
Most companies that deal with a lot of data use a divide-and-conquer approach. They create a bunch of different databases and give different internal business groups access to the pools of information. By contrast, Facebook gives the entire company access to a shared infrastructure. Basically, it wants all employees—be they in advertising, engineering, or marketing—to have access to a complete version of the site and its information. “Companies usually take the easy way out and say, ‘We will separate this team from that team,’” Parikh says. “That has been unacceptable to us.”
Many of the tests these groups conduct are run on “anonymized” data (in which peoples’ identities have been stripped out), Parikh says, adding that the company has a “zero-tolerance policy” for abuse of user information.