Cassandra is fast emerging as one of the key NoSQL databases. While we often express that the point of NoSQL is to offer more choice than an “RDBMS” hammer for every nail, there are practical reasons why a small number of stack technologies gain dominance and others circle on the sidelines.
Cassandra has already ticked many of the boxes needed to shoot it into the stratosphere as a widely used, default database platform. Especially so in the web world where high scalability, high availability, open source and being proven by a bigger fish all matter. Specifically Cassandra has:
- The ability to scale across many nodes
- The ability to scale to many hundreds of gigabytes of data
- High availability, losing a node doesn’t take down the cluster & online node provisioning and data distribution (and automated data copy). Also is decentralized (every node is the same as another, no single point of failure).
- Bigtable like “Column Families” (more advanced schema control than DHT)
- Dynamo like eventual consistency (not a plus but a trade off required for scalability) & log based recovery and the ability to either write asynchronously or synchronously
Cassandra, if you’re note familiar, was built originally by Facebook as an internal database system required to help them scale to their massive data demands. It was then thrown over the wall and made open source, where the community picked it up and ran with it. Cassandra is capable of supporting transaction processing workloads at large scale and has found favor at RackSpace, Twitter, Digg and others.
Interestingly, I understand Facebook forked the code and have continued to develop their own internal version independently of the open source version. The open source Cassandra is now largely developed by RackSpace where they have 3 people working full time (+ the community at large) lead by Jonathan Ellis & Digg. The reasons behind this aren’t entirely clear, but one may assume that Facebook were happy to share their work with the community, but don’t have the time or interest in managing the ongoing development of an open source project.
Scale is the primary reason why you would choose a platform like Cassandra. Traditional RDBMS’s start to struggle when you want to go over 1 node, and big clusters are currently only really possible using expensive shared disk technology or when targeting specialized analytical workloads (MPP RDBMS). I understand Facebook is running a 150 node Cassandra cluster and others have 30+ node clusters in production also.
What Cassandra is majorly lacking right now (apart from secondary indexes which I think they are working on) is the backing of a commercial vendor who is providing product support (RackSpace are not doing this). But I am sure this will be addressed in the near future with either RackSpace spinning something up or someone like Cloudera adding it to their responsibilities.