Cassandra as used by Facebook

Bingwei Wang (bw0338), Si Peng (sp0890), Xiaomeng Zhang (xz0398), Mark Bownes (mb7801), Rob Paton (rp7374) and Farshid Golkarihagh (fg7281) December 15, 2010



Cassandra is a distributed NoSQL database which was developed by Facebook as a method for inbox searches. It was authored by Avinash Lakshman, who had previously worked on Amazon’s Dynamo database, and Prashant Malik in 2007, and written in Java. In 2008 it became an open source project and was picked up in 2009 by Apache, who made it an incubator project. Apache Incubator is a gateway for open source projects to become Apache software, and in 2010 Cassandra was upgraded to be a top level Apache project. Cassandra was created to be a quick, scalable and fault tolerant system and as such boasts a number of important features which distinguish it from its competitors. This report will outline these key features, particularly in the areas of architecture, scalability, fault tolerance and will go beyond this to look at the future of the project.


How Facebook Uses Cassandra

Facebook created Cassandra to power their inbox search, and this is still where it is used today. There are two kinds of search that Facebook allows for, these are item search and interactions. Item search allows for a simple search of keywords, the key for this search is the users ID. There are super columns which are words that make up messages, and columns which are individual message identifiers of messages containing the word being searched for. Interaction searches are used to search for a name and find all messages between the user and the searched for person. Similar to the keyword searching the key used for interaction search is the users ID. However in this case the super columns are the searched for person’s ID and the columns are individual message identifiers. To speed up searching Facebook has its own special hooks built into its version of Cassandra to do intelligent hashing. Notably as soon as a user clicks into the search bar a message is sent to Cassandra’s cluster priming it with the user’s ID. This means that once a search is executed the search results are likely to already be in memory, and so searching becomes a very quick process.



Cassandra is a distributed database, which means its data is spread out over a number of computers (or nodes) which don’t need to be in the same geographical area. A group 1

A system called Ganglia is used by Facebook to monitor the nodes for any faults.core. the most common type of which are hard-drive failures.of nodes is called a cluster. or which node to access to get the required data. middle and top [Ell]. These elements will be explained in more detail in the Fault Tolerance and scalability sections. The workings of the cluster are abstracted away when it comes to using it . As of 2010 Facebook has a cluster of 150 nodes. The Cassandra API is made up of simple getter and setter methods and has no reference to the databased distributed nature. Sometimes the nodes need to be heavily synchronised. The middle layer contains functions for handling the data being written into the database. NoSQL is a term used to designate database management system that differ from classic relational 2 .[AL09] 4 NoSQL The second important feature of Cassandra is NoSQL. [AL09] 3. consistent reads and writes using a simple API. the state of the cluster as a whole (including failure detection) and replication between nodes. Facebook use a program called Zookeeper. Another element in the top layer is Hinted hand-off. Collectively. In computing. For this. This occurs when a node goes down .the Facebook site doesn’t need to know about nodes. the nodes store 50 TB of data.2 Logical The Cassandra system can be broken down into three layers . spread out over the east and west coasts of the USA. but will be explained in the NoSQL section. Compaction tries to combine keys and columns to increase the performance of the system. for example during complicated SQL transactions to avoid losing updates. Core Messaging service Failure detection Cluster state Partitioner Replication Middle Indexes Compaction Commit log Memtable SSTable Top Hinted hand-off Read repair Monitoring Admin tools The top layer is designed to allow efficient. and contains functions for communication between nodes. The different ways of storing data such as Memtable and SSTable are also handled here.the successor node becomes a coordinator (temporarily) with some information (‘hint’) about the failed node. The core layer deals with the distributed nature of the database.

The place which the data is firstly written into. • Memtable: located in the memory. • CommitLog: is used for recovery purposes. • SSTable: permanent data storage. the data will be written into the Memtable. When the user performs a writing operation. which locates in the hard disk.The comparison of the operations’ performance between the classic relational DBMS and Cassandra can be viewed in the below table: MySQL 15ms 0. the data will then be moved to SSTable. Memtable. [Pop10b] 3 .database system in some way. the whole system can tolerate concurrent writing operations without blocking the disk resources. SSTable. the system will modify the logs in the Commitlog. It has totally different strategies on how to handle the writing or reading operation and how to store the incoming data. record the changes so they could be used in the case of crash or inconsistency. it can just can be deleted or combined. which means it is not modifiable. One data structure maps to one Memtable object. Cassandra chooses a different architecture to support the operations performed by the users. The data is flushed from Memtable to SSTable when an specific threshold is reached. The main reason why the writing operation is extremely fast is the fact that all the data is written into the memory first. because the system is designed to facilitate the writing as much as possible. After the size of the Memtable exceeding the threshold. So the whole process of massive writing will be quickened. After that. A typical structure contains the following parts: CommitLog. Cassandra is designed not to be a traditional relational DBMS. Because of this particular property. Obviously. instead of the hard disk. This structure will promote the efficiency for read and write operations.12ms Cassandra 350ms 300ms Reading Writing [Per10] Writing is really fast here. It is noticeable that Data in the disk is stable.

for horizontal scalability. In contrast . Every piece of written data uses a special key to identify itself in the database. Scalability for expansion of the data processing can be divided into the following forms: • Vertical Scalability (Scale up): In this approach. 5 Scalability One of the important factors when talking about scalability is the method used for dealing with new nodes either due to the expansion of the data processing and storage or due to the node outage (failures or maintenance tasks). and Idx field. throughput could simply be improved by adding more nodes to the system and allowing the system to perform load balancing to distribute the load between nodes evenly. SSTable is specialized to provide wide range of searching algorithms. [Ter07] • Horizontal Scalability (Scale out): In this approach.[Per10] Reading here is a little bit slower. There are three integral parts in a SSTable. it should take advantage of these special keys. Cassandra can do much faster reading and writing operations without sacrificing too much space. as the computational power is split between the node in the cluster. [Con09] Both method of the scalability have many advantages and disadvantages. [Hor07] 4 .the system is organised in a cluster way. because the system needs to search not only the Memtable. Index is responsible for recording key and its corresponding data address. The most important advantage for vertical scalability is the minimal administration management as more of the computational power is concentrated in a single node. which is also known as “bloomfilter”. This usually mean addition of CPU or memory power. can quickly determine whether a provided key is in this SSTable or not. [Pop10a] With the assistance of these advanced structures. Data field. Data field holds the real content of the stored data. In order to promote the searching efficiency. node outage will not have a major impact on the resources that are available. but also the SSTables. Bf field. resources are added to a single node in the system to increase the throughput. When the system searches for some specific elements. Filter fields.

Each row has a column family. Each row can be identified using a unique key which is any arbitrary string with no limit on its size. it starts as soon as a piece of data is inputted to the system. Unlike relational databases. Cassandra does not have limitation with the number of rows or columns. In this architecture. Super columns are referred to as locality groups by Amazon’s Bigtable. value or timestamp). For each replica of the data a timestamp is 5 . Its data model could simply be described as a very large table with lot of rows. will try to accommodate some of the load from other nodes that are heavily loaded and the cluster will utilize the new resources automatically [AL09] Data model of Cassandra is another reason for their success in scalability. The picture below will illustrate the structure of a column family: [Pro08] Super columns are a set of name and/or column(s) which are sorted. nodes are added to the cluster to overcome the extra load on the servers. Columns are declared by the administrator prior to the start up of the database. When a node joins the cluster. Using the gossiped algorithm the token information (position of the node) is spread between different node in the cluster which enable all node to know about the position of all other nodes in the ring[Pro10a]. when a node starts for the first time.Facebook uses Cassandra by implementing the Horizontal Scalability[Pfe10]. Columns can be added/deleted dynamically during the run time. The Facebook cluster that uses Cassandra could be represented using a ring model network where each of the nodes are placed at a position in the ring. Knowing the position of all other nodes in the cluster will allow each node to route the request to the correct node. a token is randomly picked which identify the position of the node in the ring. As more user join the system. This data undergoes self replication and is copied to multiple nodes by way of an automated process. The picture below will illustrate the structure of super columns: [Ham07] 6 Fault tolerance For Cassandra fault tolerance is a very important concern. column families can have many columns (could be either name.

this range is what the keys lie within. Thus for all the replication strategies the first replica is placed at the node claiming the key for it within its token. Rack unaware is the opposite of this. The random approach gives the token as an integer in the range 0-2127. Then the following holds: W + R > N = strong consistency W + R <= N = eventual consistency For self replication the strategy used is important. meaning after an update all nodes are up to date rather than the delay that eventual consistency has [Lak08]. regardless of what data centre it is in. This is done by way of a partition. the number of these replicas that need to acknowledge a write before it is successful. in that the replicas are placed on the closest node on the ring. The range of a token is the distance between a nodes token and the token of the next node. [Sas10] Other than node failure there is one other important factor to making a system fault 6 . R. To check if a node has failed or not Cassandra does not use a simple binary working/not working flag for each node as its failure detector. repairing it and bringing it back online. and it specifies what section of the ring a node occupies. Rack Unaware and Data Shard. this is an algorithm keeps all the nodes updated with important information about all other nodes. [Bla10] [Vog07b] If a node ceases to work correctly then Cassandra deals with it by taking it offline. meaning that as soon as they come online they have the correct information. Finally the data shard approach allows for more control over the rack aware approach by allowing the user to indicate the replication strategy for each data centre. Similarly checksums are created for each replica. and thus Cassandra has no single point of failure. it is after this initial replica is placed that the three possible replication strategies become important. it takes some time for the data in each node to be brought up to date. which can be done with one of two strategies: random and order preserving. N. The rack aware method is most useful when there are multiple data centres being used. giving even distribution. This means that fixing a node requires no system downtime. This allows it to take into account fluctuations in the network. which shows how likely it is that the node has failed. It can be shown mathematically if we consider the number of nodes storing replicas of the data. The second is placed in a different data centre and from then on the nodes are placed on different racks in the same data centre. This is counter to traditional databases which have strong consistency. This is a failure detector which maintains a level of suspicion about each node. In all of these strategies the first replica is always placed on a node within the key range of its token. but does not support even distribution. there are three strategies provided by Cassandra to replicate data across nodes: Rack Aware. as with all strategies the first node is placed on the token. W. to help keep track of newer versions. This also keeps the offline and failed nodes updated with the same information. and the number of replicas contacted in a read operation on a piece of data. A token exists for each node.This also links in to the Gossip algorithm used by the system. to give a method by which to ensure authenticity and accuracy of the data after replication. Order preserving gives the token as a string. This means that the token is a way of showing which keyspaces the node controls. but when it is the system is consistent. but it will still function and provide access to all stored data. Rather it abstracts away from the program to utilise an Accrual Style Failure Detector. if a single node fails the system will lose some capacity. This also means that the system is eventually consistent.created. something used by the read repair system.

tolerant. In the same time. because the keys are passed around the ring if a node fails. thus the data the key points to is still easily accessed by way of another node. once again ensuring the consistency of the database. as a Bigtable data model running on a Dynamo-like infrastructure.[GD07] [FC06] [AL09] 7 . Bigtable is a fast and extremely large-scale Data Base Management System which is used by a number of Google applications. Like Dynamo. 7 Cassandra vs. or fixed as quickly and efficiently as possible. Dynamo is a highly available. scalable key-value storage system which supports part of Amazon Web Services. This latest version is then merged with the out of date or incorrect version that has caused the checksums to be different. A write request is then automatically sent to the node. whilst the broken node is repaired. Once a read request is issued to a node. To this end Cassandra uses read repair whenever a piece of data is requested. as it explains how the keys between two nodes are passed around the ring if a node fails. it provides the concept of column family which is similar to Bigtable. This also demonstrates the fact that the system has no single point of failure. Cassandra is a distributed network service provided by a bunch of nodes which are connected together. all nodes containing replicas of the data being read have their timestamp and checksum pulled up by the system. allowing it to update the data it holds. Errors in the data must be avoided. The checksums are compared. and if there is an inconsistency then the timestamps are checked to find the latest version of the data. [Tar10] [Ho10] The diagram above shows the keyspace and nodes to show the fault tolerant nature of Cassandra. who led the Facebook Data team at the time. and that is to keep the data up to date and correct. This differentiates Cassandra from simple key/value data structure of Dynamo. Dynamo & Bigtable Cassandra is described by Jeff Hammerbacher. Cassandra inherits cluster technology from Dynamo and borrows data model from Bigtable.

Cassandra uses a ring infrastructure and consistent hashing like Dynamo. Periodic data compaction refers to the mechanism of merging SSTables which are scattering around at frequent intervals to save storage space. Therefore. Cassandra omits vector clock which is used by Dynamo to avoid version conflict because it takes long time. A piece of data has many replicas on different nodes. [FC06] [Ho10] Comparing to Bigtable. This is called the Quarum (Consistent) Protocol. [FC06] Additionally.1 Dynamo There are many similarities between Cassandra and Dynamo. different versions will be merged together to avoid data conflict. the high fault-tolerance would be. In order to address the problem that the basic algorithm doesn’t consider the different load capacity of nodes. It’s apparent that the more nodes you need for successful reading/writing. If R and W represent the least number of nodes for successfully reading and writing a piece of data. However. it’s easy to insert or delete notes without data transfer in large scale. they also apply different ways to maintain data consistency. It borrows its data structure from Bigtable (column family) which will be easier to compress data and save storage space. both Dynamo and Cassandra make improvements to basic consistent hashing. each node becomes responsible for the region in the ring between it and its predecessor node on the ring. but less efficient the system would be. Dynamo assigns a node with high capacity to multiple positions in the circle while Cassandra analyzes load information on the ring and has lightly loaded nodes move on the ring to alleviate heavily loaded nodes. super column families can be viewed 8 .7. and N represents the number of nodes storing data: R>N−W This means that you’ll read at least one the newest version instead of old version.2 Bigtable The data model of Cassandra is similar to Bigtable. [AL09] [Vog07a] One difference between Cassandra and Dynamo is that it’s not a pure key/value structure like Dynamo. Merged read means that when reading a piece of data. [AL09] 7. it borrows these features from Bigtable: • Column/column family • Sequential write (Commitlog -> Memtable -> SStable) • Merged read • Periodic data compaction. In the basic consistent hashing. The former two have been explained previously. Instead it gives each cell a timestamp to decide which data is newer and which should be kept.[GD07] [AL09] A balance between availability and fault-tolerance is another similarity. so that departure or arrival of a node only affects its immediate neighbours and other nodes remain unaffected. High scalability is one of them . The way to solve this problem in Cassandra is the same as Dynamo. super column is a distinctive concept of Cassandra.

for the reasons that it would offer no single point of failure. A recent post (17 Oct) criticised their “rewriting everything from scratch” as the problem. but Cassandra isn’t to blame for everything. scalable writes. and the founder Matt Pfeil was confident in Cassandra itself. Pfeil talked about relationship between NoSQL and traditional relational databases. Facebook “don’t make gigantic changes all at once” [Pro10b]. Riptano. [AL09] 8 8.2 Digg John Quinn of Digg announced in March 2010 that they were making large scales of changes to their system. there’s no doubt that Cassandra will retain its popularity and accordingly achieve growth and improvement with implementation and performance. they announced that they’d switch back to MySQLbased storage for tweets as a change in strategy [Twi10b]. leading to bad architecture. However. and that there was a healthy and productive open source community supporting it. Twitter would still be working on Cassandra where they require a a large-scale data store.” [Ros10] 9 . As Cassandra does have preferable features as explained above. as ndimensional column families. Their usage of Cassandra would “only grow”. 9 Future and Conclusion These problems were unfortunate. first for storing the statuses table. In June. and that over time. This means that you can access a column family in a super column family in super column family and so on and so on. which contains all tweets and retweets. Cassandra) [Dig10]. With super column. Cassandra would completely replace the current MySQL solution [Pop10c]. has been backing Cassandra since.e. a company established in April 2010. In July. [Hig10] As work on Cassandra will not cease in the sites mentioned. 8. In another interview. but realised that there was “a lot to be done before it is close to where it will compare in production environments to something like MySQL”. According to King. “abandoning MySQL in favour of a NoSQL alternative” (i. and Quinn himself was no longer employed [Kal10]. Twitter experienced poor performance resulting from over-capacity in internal sub-networks [Twi10a]. none of these sites is likely to totally discontinue its usage. and sometimes even in the same application. The resulting Digg (version 4) later proved to be unsuccessful in terms of reliability and acceptance. and stated that in contrast. Cassandra can represent data in a richer a column family within a column family.1 Others using Cassandra Twitter In March 2010 Ryan King revealed in an interview that Twitter would move from MySQL to Cassandra. they considered several issues based on which they examined a set of techniques and finally chose Cassandra. They worked with Digg to study the problems. pointing out that “there’s definitely room for both in the world.

Given Cassandra’s features and immaturity. as what most sites mentioned is doing now. 10 . it seems that currently it should be used as a complement to relational databases.

Black. Going http://about. Cassandra . Open Source Convention 2009 (OSCON 09).References [AL09] P. Jampani G. http://perspectives. J. http://www. M. Dynamo: Amazons Highly Available Key-value Store. Cassandra Internals Writing. September http://gigaom. and how does Cassandra scale? . S.html. Cassandra A structured storage system on a P2P Network. February 2007. http://horicky. Facebook Cassandra Architecture and Design. Horizontal scalability. BigTable Model wiht Cassandra and HBase. Why does Scalability matter. Perham.mvdirona. R. The 3rd ACM SIGOPS International Workshop on Large Scale Distributed Systems and Middleware (LADIS 09).a decentralized structured storage system. 11 [Ho10] [Hor07] [Kal10] [Lak08] [Per10] [Pfe10] . C. Open Source Bigtable + Dynamo. Saying Yes to NoSQL.riptano. September 2010. Vertical vs. October 2007.thebuzzmedia. R.mikeperham. March 2010. Malik A. Chang. Ellis. Hastorun. Ho.html. Horizontal scalability. A. Vertical vs. March 2010. Higginbotham. Dean. [Dig10] [Ell] [FC06] [GD07] Digg. http://www. [Ham07] J. Cassandra Replication & Hsieh F. [Bla10] [Con09] Burleson Consulting. Lakshman. M. August J. D. Hamilton. ACM.scalingout. 2007. Pfeil. Digg v4 Troubles are Symptom of a Bigger Problem. Digg Not Likely to Give Up on Cassandra. http://www. October 2009. DeCandia.php?note id=24413138919. OSDI. [Hig10] S. Lakshman. Ghemawat W. http://www. April 2010. Steady with Cassandra. October 2010. Bigtable: A Distributed Storage System for Structured Data. http://www. October

com/enterprise-IT-techtrends/apache-cassandra-distributed-database-part-ii/.techtarget.html.html. http://engineering. February 2010.. Vogels. DIGGing a Hole with Cassandra.allthingsdistributed. June consistent. A Perfect Storm. July 2010. 12 . http://nosql.mypopescu.proximitychicago. December 3-20003945-62. http://blog. Apache Cassandra . http://engineering. [Pop10b] A. T. Cassandra NoSQL Database an Apache Top Level Project. Apache Cassandra gets boost from Tarrant. Cassandra Read Operation Performance Explained. October 2010.twitter. [Pro08] Project Cassandra: Facebook’s Open Source Alternative to Google BigTable.allthingsdistributed. Eventually Consistent.Distributed Database. [Pro10b] Proximity. May 2010. is? [Tar10] [Ter07] [Twi10a] Twitter. October 2007. December 2010. Rosenberg. [Vog07b] Cassandra @ Twitter: An Interview with Ryan King. November 2010.infoq. July 2008. [Ros10] [Sas10] Sasirekha. Amazon’s March 2010.mypopescu. Vogels. Riptano.aspx?guid=c573171e8e62-45b4-b85c-7b411b528e51. February 2010. http://www. http://www. Popescu. Terrill. Popescu. http://wiki. Think you know what scalability http://www. Cassandra Write Operation Performance Explained. March 2010.[Pop10a] A. G.twitter.apache.of Whales. Cassandra at Twitter Today. October Eventually Consistent. http://nosql. [Pro10a] M.aspx. http://nosql.25hoursaday. [Twi10b] R. [Pop10c] A. [Vog07a] W. Pronschinske. http://css.

Sign up to vote on this title
UsefulNot useful