Why Digg Digs Cassandra
Digg, the San Francisco-based social media company, is dropping the popular MySQL database software and instead betting on Cassandra, another open-source database. It's the latest sign of the growing popularity of Cassandra, which was developed (and open sourced) by Facebook.
Facebook has since backed off Cassandra, but Digg plans to open source all its work on Cassandra and champion the software's development and adoption. In a post on the Digg blog, John Quinn, Digg's vice-president of engineering, says: "Perhaps our most significant infrastructure change is abandoning MySQL in favor of a NoSQL alternative. To someone like me, who's been building systems almost exclusively on relational databases for almost 20 years, this feels like a bold move." Quinn says it's increasingly difficult to get the proper performance from a relational database.
What's Wrong with MySQL?
"Our primary motivation for moving away from MySQL is the increasing difficulty of building a high-performance, write-intensive application on a data set that is growing quickly, with no end in sight," says Quinn. "This growth has forced us into horizontal and vertical partitioning strategies that have eliminated most of the value of a relational database, while still incurring all the overhead."
Digg is just the latest high-profile convert to the "No SQL" world. Instead of using databases such as MySQL, many companies that deal in near-real-time information are opting for new kinds of databases, most of them open source, such as Cassandra and CouchDB.
Cassandra is roughly the open-source equivalent of Google's Big Table. It was intended by Facebook to solve the problem of searching message inboxes. The company needed something that was fast, reliable, and had the ability to handle read and write requests at the same time. Because messaging is heavy at Facebook, the company requires a system that can not only store data but also provide results for search queries at blazingly fast speeds.
Stu Hood, the technical lead for the search team in the Email & Apps division of Rackspace, recently said: "I think distributed databases solve a problem that a lot of companies with large data sets have had to solve independently in the past. … Cassandra has an approach that hybridizes the Bigtable and Dynamo models, where a lot of its competitors chose to take one path or the other. Over the Bigtable clones, Cassandra has huge high-availability advantages, and no single point of failure (possible because of the eventually consistent approach). When compared with the Dynamo adherents, Cassandra has the advantage of a more advanced data model, allowing for a single 'row' to contain billions of column/value pairs—enough to fill a machine. You also get efficient range queries for the top level key, even within your values."
In a post last year, contributing writer Gary Orenstein pointed out that thanks to these attributes, Cassandra has potential applications beyond inbox search that include "recommendation engines, targeted advertising, and content search, particularly when you combine many concurrent inputs and output requests to the same data set."
Digg is a prototypical application. The company tells me that it gets:
40 million visitors a month, who in turn account for roughly 500 million page views a month
20,000 daily submissions
It also generates:
170,000 daily Diggs
As these numbers suggest, there is a high amount of interaction between the system and its users. No wonder Digg digs Cassandra!
Also from the GigaOM network: