Cassandra: 0-60

Jonathan Ellis / @spyced

Keyspaces & ColumnFamilies

Conceptually, like “schemas” and “tables”

Inside CFs, columns are dynamic

Twitter: “Fifteen months ago, it took two weeks to perform ALTER TABLE on the statuses [tweets] table.”

ColumnFamilies
Columns

“static” Cfs vs “dynamic” .

Inserting ● ● Really “insert or update” As much of the row as you want (remember sstable merge-on-read) .

Column indexes ● ● Name vs range flters “reversed=true” .

Denormalization ● Whiteboard: Turn. long. skinny tables into long rows Reduces i/o and cpu to perform read ● .

.

Example: twissandra ● http://twissandra.com .

password VARCHAR(64) ).CREATE TABLE users ( id INTEGER PRIMARY KEY. followed INTEGER REFERENCES user(id) ). CREATE TABLE tweets ( id INTEGER. . timestamp TIMESTAMP ). username VARCHAR(64). body VARCHAR(140). user INTEGER REFERENCES user(id). CREATE TABLE following ( user INTEGER REFERENCES user(id).

Cassandrifed <Keyspaces> <Keyspace Name="Twissandra"> <ColumnFamily CompareWith="UTF8Type" <ColumnFamily CompareWith="BytesType" <ColumnFamily CompareWith="BytesType" <ColumnFamily CompareWith="BytesType" <ColumnFamily CompareWith="UTF8Type" <ColumnFamily CompareWith="LongType" <ColumnFamily CompareWith="LongType" </Keyspace> </Keyspaces> Name="User"/> Name="Username"/> Name="Friends"/> Name="Followers"/> Name="Tweet"/> Name="Userline"/> Name="Timeline"/> .

Connecting CLIENT = pycassa.ColumnFamily(CLIENT. dict_class=OrderedDict) . 'User'. 'Twissandra'.connect_thread_local() USER = pycassa.

columns) . 'password': password} USER. 'username': username. 'username': 'ericflo'. } username = 'jericevans' password = '**********' useruuid = str(uuid()) columns = {'id': useruuid.insert(useruuid. 'password': '****'.Users 'a4a70900-24e1-11df-8924-001ff3591711': { 'id': 'a4a70900-24e1-11df-8924-001ff3591711'.

Natural keys vs surrogate .

time()}) .insert(useruuid. {useruuid: time. '343d5db2-24e2-11df-8924-.Friends and Followers 'a4a70900-24e1-11df-8924-001ff3591711': { # friend id: timestamp when the friendship was added '10cf667c-24e2-11df-8924-. {frienduuid: time..': '1267413990076949'..': '1267414008133277'. '3f22b5f6-24e2-11df-8924-..': '1267413962580791'.. } frienduuid = 'a4a70900-24e1-11df-8924-001ff3591711' FRIENDS.insert(frienduuid...time()}) FOLLOWERS.

fat columnfamily .Your row is your index ● Long skinny table vs short.

'_ts': '1267414173047880'. 'body': 'Trying out Twissandra. } .Tweets '7561a442-24e2-11df-8924-001ff3591711': { 'id': '89da3178-24e2-11df-8924-001ff3591711'. This is awesome!'. 'user_id': 'a4a70900-24e1-11df-8924-001ff3591711'.

... } .'. 1267414319522925: '02ccb5ec-24e3-11df-8924-. 1267414305866969: 'f9e6d804-24e2-11df-8924-. 1267414277402340: 'f0c8d718-24e2-11df-8924-.'..'...Userline 'a4a70900-24e1-11df-8924-001ff3591711': { # timestamp of tweet: tweet id 1267414247561777: '7561a442-24e2-11df-8924-...'.

.

} ..'.. 1267414305866969: 'f9e6d804-24e2-11df-8924-.. 1267414319522925: '02ccb5ec-24e3-11df-8924-.'...'.. 1267414277402340: 'f0c8d718-24e2-11df-8924-...Timeline 'a4a70900-24e1-11df-8924-001ff3591711': { # timestamp of tweet: tweet id 1267414247561777: '7561a442-24e2-11df-8924-.'.

insert(useruuid. 5000): TIMELINE.pack('>d'.time() * 1e6) columns = {'id': tweetuuid.insert(tweetuuid. columns) TIMELINE. columns) columns = {struct. '_ts': timestamp} TWEET.Adding a tweet tweetuuid = str(uuid()) body = '@ericflo thanks for Twissandra. 'user_id': useruuid. columns) for otheruuid in FOLLOWERS. 'body': body.get(useruuid. it helps!' timestamp = long(time. columns) .insert(useruuid. timestamp): tweetuuid} USERLINE.insert(otheruuid.

column_reversed=True) tweets = TWEET.get(useruuid.multiget(timeline.multiget(timeline. column_reversed=True) tweets = TWEET.GET.values()) .get('start') limit = NUM_PER_PAGE timeline = TIMELINE. column_count=limit.values()) start = request.Reads timeline = USERLINE.get(useruuid. column_start=start.

I can has smarter clients? ● Shouldn't need to pack('>d'. Cassandra provides describe_keyspace so this can be introspected . int).

port=9170): socket = TSocket.Raw thrift API: Connecting def get_client(host='127. port) transport = TTransport.1'.Client(protocol) return client .0.0.TBinaryProtocolAccelerated(transport) client = Cassandra.open() protocol = TBinaryProtocol.TSocket(host.TBufferedTransport(socket) transport.

.items()] mutations = [Mutation(ColumnOrSuperColumn(column=c)) for c in columns] rows = {useruuid: {'User': mutations}} client. rows.} columns = [Column(k.time()) for (k.ONE) . v) in data.batch_mutate('Twissandra'. . v..Raw thrift API: Inserting data = {'id': useruuid. ConsistencyLevel. time.

get_slice. multiget_slice. get_range_slices ColumnOrSuperColumn http://wiki.apache.Raw thrift API: Fetching ● get.org/cassandra/API ● ● . get_count.

Running twissandra ● ● ● cd twissandra python manage.1:8000 .0.0.py runserver Navigate to http://127.

) insert(key....Pycassa cheat sheet ● ● ● ● ● get(key. …) multiget(key_list) get_range(.) .. columns_dict) remove(key. .

py shell import cass help(cass.Exercise ● ● ● ● python manage.remove) Delete the most recent tweet by user foo .TWEET.

py Finish save_retweet .Exercise ● ● Open cass.

Language support ● ● ● Python Scala Ruby ● Speed is a negative ● Java .

apache.org/jira/browse/THRIFT-638 https://issues.apache.org/jira/browse/THRIFT-788 ● ● ● .apache.apache.org/jira/browse/THRIFT-780 https://issues.org/jira/browse/THRIFT-347 https://issues.PHP [thrift] tickets ● https://issues.

Done yet? ● Still doing 1+N queries per page .

SuperColumns SuperColumns .

Applying SuperColumns to Twissandra .

ColumnParent .

Supercolumns: limitations .

UUIDs ● Column names should be uuids. to avoid collisions Version 1 UUIDs can be sorted by time (“TimeUUID”) Any UUID can be sorted by its raw bytes (“LexicalUUID”) ● ● ● Usually Version 4 Slightly less overhead ● . not longs.

0.7: secondary indexes ● Obviate need for Userline (but not Timeline) .

Lucandra ● What documents contain term X? ● … and term Y? … or start with Z? ● .

Lucandra ColumnFamilies <ColumnFamily Name="TermInfo" CompareWith="BytesType" ColumnType="Super" CompareSubcolumnsWith="BytesType" KeysCached="10%" /> <ColumnFamily Name="Documents" CompareWith="BytesType" KeysCached="10%" /> .

position vector } Document Key "documentId" => { fieldName . value } .Lucandra data Term Key col name value "field/term" => { documentId .

Lucandra queries ● ● ● get_slice get_range_slices No silver bullet .

7: vector clocks .FAQ: counting ● ● ● ● UUIDs + batch process Mutex (contrib/mutex or “cages”) Use redis or mysql or memcached 0.

Tips ● ● Insert instead of check-then-insert Bulk delete with 'forged' timestamps ● In 0.7: use ttl instead .

.

d/cassandra start tail -f /var/log/cassandra/system.as notroot/notroot: git clone http://github.py runserver .git as root/riptano: apt-get update apt-get install python-setuptools apt-get install python-django easy_install -U thrift rm -r /var/lib/cassandra/* cp twissandra/storage-conf.log as notroot: find templates |xargs grep empty # r/m the {empty} blocks python manage.com/ericflo/twissandra.properties to DEBUG /etc/init.xml /etc/cassandra edit /etc/cassandra/log4j.

Sign up to vote on this title
UsefulNot useful