Cassandra

Background
Cassandra is an open source distributed database management system and an Apache Software Foundation top-level project. Cassandra bases its distribution design on Amazon s Dynamo and its data model on Google s Bigtable.

Key features of Cassandra
Distributed and Decentralized Cassandra is distributed, which means that it is capable of running on multiple machines while appearing to users as a unified system. Cassandra is decentralized means that there is no single point of failure. All the nodes in a Cassandra cluster function exactly the same. Elastic Scalability Scalability is an architectural feature of a system that can continue serving a greater number of requests with little degradation in performance. Vertical scaling simply adding more hardware capacity and memory to your existing machine is the easiest way to achieve this. Horizontal scaling means adding more machines that have all or some of the data on them so that no one machine has to bear the entire burden of serving requests. Elastic scalability refers to a special property of horizontal scalability. It means that your cluster can seamlessly scale up and scale back down. To do this, the cluster must be able to accept new nodes that can begin participating by getting a copy of some or all of the data and start serving new user requests without major disruption or reconfiguration of the entire cluster. You don t have to restart your process. You don t have to change your application queries. You don t have to manually rebalance the data yourself. Just add another machine Cassandra will find it and start sending it work. High Availability and Fault Tolerance Cassandra is highly available. You can replace failed nodes in the cluster with no downtime, and you can replicate data to multiple data centers to offer improved local performance and prevent downtime if one data center experiences a catastrophe such as fire or flood. Tuneable Consistency Cassandra is more accurately termed tuneably consistent, which means it allows you to easily decide the level of consistency you require, in balance with the level of availability. High Performance

As you add more servers. Geographical Distribution Cassandra has out-of-the-box support for geographical distribution of data. but if you expect that you can reliably serve traffic with an acceptable level of performance with just a few relational databases. an inverted index for document searching. There are. There are no hard and fast rules here. It scales consistently and seamlessly to hundreds of terabytes. Lots of careful engineering has gone into Cassandra s high availability. and seamless scaling. Cassandra has been used to create a variety of applications. You can easily configure Cassandra to replicate data across multiple data centers. None of these qualities is even meaningful in a single-node deployment. So do some measuring. These are strong use cases for Cassandra because they involve lots of writing with less predictable read operations. a wide variety of situations where a single-node relational database is all we may need. Cassandra could be a great fit. Evolving Applications . and because updates can occur unevenly with sudden spikes. In fact. it might be a better choice to do so. social network usage. Cassandra has been shown to perform exceptionally well under heavy load. tuneable consistency. throughput needs. Many of the early production deployments of Cassandra involve storing user activity updates. Consider your expected traffic. and SLAs. you can maintain all of Cassandra s desirable properties without sacrificing performance. let alone allowed to realize its full potential. It consistently can show very fast throughput for writes per second on a basic commodity workstation. semis aren t well suited for that sort of task. Use Cases for Cassandra Large Deployments You probably don t drive a semi-truck to pick up your dry cleaning.Cassandra was designed specifically from the ground up to take full advantage of multiprocessor/ multicore machines. simply because RDBMS are easier to run on a single machine and are more familiar. recommendations/reviews. however. and application statistics. peer-to-peer protocol. Cassandra is optimized for excellent throughput on writes. Lots of Writes. Statistics. the ability to handle application workloads that require high performance at significant write volumes with many concurrent client threads is one of the primary features of Cassandra. and to run across many dozens of these machines housed in multiple data centers. If you have a globally deployed application that could see a performance benefit from putting the data near the user. and a distributed job priority queue. and Analysis Consider your application from the perspective of the ratio of reads to writes. According to the project wiki. which are its main selling points. including a windowed time-series store.

and is starting to see use at Comcast and bee. Facebook still uses it for inbox search. 8. Twitter is using Cassandra for analytics. Who Is Using Cassandra? The list of companies using Cassandra is growing. though they are using a proprietary fork. 10. 2.tv for personalized television streaming to the Web and to mobile devices. 9. Onespot uses it for a subset of its main data store. SimpleGeo uses it as the main data store for its real-time location infrastructure. 11. 6. Digg uses it for its primary near-time data store.If your application is evolving rapidly and you re in startup mode. Mahalo uses it for its primary near-time data store. 5. 3. 7. Rackspace uses it for its cloud service. This makes it easy to keep your database in step with application changes as you rapidly deploy. 4. Ooyala uses it to store and serve near real-time video analytics data. Cassandra might be a good fit given its schema-free data model. These companies include: 1. Cloudkick uses it for monitoring statistics and analytics. Cassandra is also being used by Cisco and Platform64. . monitoring. Reddit uses it as a persistent cache. and logging.