You are on page 1of 183

Developer track

By
Nirmallya Mukherjee

Index
Part
Part
Part
Part
Part
Part
Part
Part
Part

1
2
3
4
5
6
7
8
9

Warmup
Introducing C*
Setup & Installation
Prelude to modelling
Write and Read paths
Modelling
Applications with C*
Administration
Future releases
2

Part 1
Warm up!

What's the challange?


Databases are doing just fine!

Consumer oriented applications


Users in millions
Data produced in hundred/thousand of TB
Hardware failure happens no matter what
Is consumer growth predictable? Then how to do
capacity planning for infrastructure?

What can it do?


Using 330 Google Compute Engine virtual machines, 300
1TB Persistent Disk volumes, Debian Linux, and Datastax
Cassandra 2.2, we were able to construct a setup that can:
sustain one million writes per second to Cassandra with
a median latency of 10.3 ms and 95% completing under
23 ms
sustain a loss of of the instances and volumes and still
maintain the 1 million writes per second (though with
higher latency)
Read more here
http://googlecloudplatform.blogspot.in/2014/03/cassandrahits-one-million-writes-per-second-on-google-computeengine.html

Who uses it?

eBay
Netflix
Instagram
Comcast
Safeway
Sky
CERN
Travelocity
Spotify
Intuit
GE

Gartner magic quadrant

Types of NoSQL

Wide Row Store - Also known as wide-column stores, these databases


store data in rows and users are able to perform some query operations
via column-based access. A wide-row store offers very high performance
and a highly scalable architecture. Examples include: Cassandra, HBase,
and Google BigTable.

Key/Value - These NoSQL databases are some of the least complex as all
of the data within consists of an indexed key and a value. Examples
include Amazon DynamoDB, Riak, and Oracle NoSQL database

Document - Expands on the basic idea of key-value stores where


"documents" are more complex, in that they contain data and each
document is assigned a unique key, which is used to retrieve the
document. These are designed for storing, retrieving, and managing
document-oriented information, also known as semi-structured data.
Examples include MongoDB and CouchDB

Graph - Designed for data whose relationships are well represented as a


graph structure and has elements that are interconnected; with an
undetermined number of relationships between them. Examples include:
Neo4J and TitanDB

SQL and NoSQL


1. Database, Relational, strict
models
2. Data in rows, pre-defined
schema, sql supports join
3. Vertically scalable
4. Random access pattern
support
5. Good fit for online
transactional systems
6. Master slave model
7. Periodic data replication as
read only copies in slave
8. Availability model includes a
slight downtime in case of
outages

1. Datastore, distributed &


Non relational, flexible
schema
2. Data in key value pairs,
flexible schema, no joins
3. Horizontally scalable
4. Designed for access
patterns
5. Good for optimized read
based system or high
availability write systems
6. C* is materless model
7. Seamless data replication
8. Masterless allows for no
downtime
9

Applicability of SQL / NoSQL


No one single rule and very usecase
specific

High read (not write) oriented


High write (not read) oriented
Document based storage
KV based storage

Important to know the access patterns


upfront
Focus on modeling specific to the use case

Very difficult to fix an improper model


later on unlike a database
10

CAP Theorem
ACID
Atomicity - Atomicity requires that each transaction be "all or nothing"
Consistency - The consistency property ensures that any transaction
will bring the database from one valid state to another
Isolation - The isolation property ensures that the concurrent execution
of transactions result in a system state that would be obtained if
transactions were executed serially
Durability - Durability means that once a transaction has been
committed, it will remain so under all circumstances

What is it? Can I have all?


Consistency - all nodes have the same data at all times
Availability - Request must receive a response
Partition tolerance - Should run even if there is a part failure

CAP leads to BASE theory

Basically Available, Soft state, Eventual consistency

11

AP systems, what about C?


Consistency
HBase
MongoDB

RDBMS

Availability

Partition Tolerance
Cassandra
Google Big Table

How does this impact architecture?


1.Eventual consistency
2.Some application logic is needed

12

Good fit use cases


Store high volume stream data

Sensor data
Logs
Events
Online usage impressions

Scenario where the data can not be


recreated, it is lost if not saved
No need of rollups, aggregations
Another system needs to do this
13

Let's brain storm your usecase

14

Part 2
Introducing Cassandra

15

Database or Datastore

Where did the name "Cassandra" come from?

Symantics - no real definition, both are ok


I would like to call it "Datastore"
A bit of un-learning is required to get a hold of the "different" ways
Inspiration from (think of it as object persistence)

Google BigTable
Dynamo DB from Amazon

A few interesting observations about datastore

Some interesting reference from Greek mythology


What's the significance of the logo?

Put = insert/update (Upsert?)


Get = select
Delete = delete

Think HashMaps, Sorted Maps ...


Storage mechanism

Btree vs LSM Tree

Row oriented datastore

Biggest of all - No SPOF (Masterless)


16

Masterless architecture
Why large malls have more than one entrance?
R1

Client 1

R2

Client 2

Client 3
R3

17

Seed node(s)
It is like the recruitment department
Helps a new node come on board
More than one seed is a very good
practice
Per rack based could be good
Starts the gossip process in new
nodes
If you have more than one DC then
include seed nodes from each DC
No other purpose
18

Virtual node, Ring architecture


C* operates by dividing all data evenly around
a cluster of nodes, which can be visualized as a
ring
In the early days the nodes had a range of
tokens
This had to be done at the time of setting up C*
Node 1

Node 2

Node 3

Ring without
VNodes

D
C

E
E

Node 4

Node 5

D
Node 6

Few peers helping to bring a downed node up, Increased load


on the selected peers
What will happen if a node with elivated utilization goes down?

19

Virtual node, Ring architecture


In the later versions vNode made things easy - one node
with many ranges! (One primary range)
Recovery from failure got better - small contributions from
many nodes to help rebuilding
Recovery got faster and load on other nodes reduced
Node 1

Node 2

Node 3

Ring with
VNodes

Node 4

Node 5

Node 6

Many peers helping to bring a downed node up,


Marginal increase in load on the selected peers

20

Gossip

I think we all know what this means in English!


It is a communication mechanism among nodes
Runs every second
Passing state information
Available / down / bootstraping
Load it is under
Helps in detecting failed nodes

Upto 3 other nodes talk to one another


A way any node learns about the cluster and other nodes
Seed nodes play a critical role to start the gossip among
nodes
It is versioned, older states are erased
Once gossip has started on a node, the stats about other
nodes are stored locally so that a re-start can be fast
Local gossip info can be cleaned if needed
21

Partitioner

How do you all organize your cubicle/office space? Where will your stuff
be? Assume the floor you are on is about 25,000 sq ft.
A partitioner determines how to distribute the data across the nodes in the
cluster
Murmer3Partitioner - recommended for most purposes (default strategy as
well)
Once a partitioner is set for a cluster, cannot be changed without data
reload
The primary key is hashed using the murmer hash to determine which
node it needs to go
There is nothing called as the "Master replica", all replicas are identical
The token range it can produce is 263 to -263

Wikipedia details

MurmurHash is a non-cryptographic hash function suitable for general hash-based


lookup. It was created by Austin Appleby in 2008, and exists in a number of variants, all
of which have been released into the public domain. When compared to other popular
hash functions, MurmurHash performed well in a random distribution of regular keys.

22

Replication
How many of you have multiple copies of the most critical files pwd file, IT return confirmation etc on external drives?
Replication determines how many copies of the data will be
maintained in the cluster across nodes
There is no single magic formula to determine the correct number
The widely accepted number is 3 but your use case can be
different
This has an impact on the number of nodes in the cluster - cannot
have a high replication with less nodes
Replication factor <= number of nodes

In a DC/DR scenario using NetworkTopology strategy different


replication can be specified per keyspace per DR/DC
strategy_options:{data-center-name}={rep-factor-value}

Also has a performance impact during

"Insert/Update" using a particular consistency level


"Select" using a particular consistency level
23

Partitioner and Replication


Position of the first copy of the data is
determined by the partitioner and copies
are placed by walking the cluster in a
clockwise direction

For example, consider a simple 4 node cluster


where all of the partition keys managed by
the cluster were numbers in the PRIMARY
range of 0 to 100.
Each node is assigned a token that represents
a point in this range. In this simple example,
the token values are 0, 25, 50, and 75.
The first node, the one with token 0, is
75
responsible for the wrapping range (76-0).
The node with the lowest token also accepts
partition keys less than the lowest token and
more than the highest token.
Once the first node is determined the replicas
will be placed in the nodes in a clockwise
order

Data Range 4
(76+
wrapping
range)

Data Range 1
(1-25)

25

Data Range 3
(51-75)

Data Range 1
(26-50)

50

24

Snitch
What can you typically find at the entrance of a very large
mall? What's the need for "Can I help you?" desk?
Objective is to group machines into racks and data centers
to setup a DC/DR
Informs the partitioner about the rack and DC locations determines which nodes the replicas need to go
Identify which DC and rack a node belongs to
Routing requests efficiently (replication)
Tries not to have more than one replica in the same rack
(may not be a physical grouping)
Allows for a truly distributed fault tolerant cluster
Seamless synchronization of data across DC/DR
To be selected at the time of C* installation in C* yaml file
Changing a snitch is a long drawn up process especially if
data exists in the keyspace - you have to run a full repair in
your cluster
25

Snitch

In addition a "Google cloud snitch" is also available for GCE


26
Mixed cloud = cloud + local nodes

Snitch - property file


A bit more about property file snitch
All nodes in the system must have the
same cassandra-topology.properties file
It is very useful if you cannot ensure the IP
octates (this prevents the use of Rack
infering snitch where every octate means
something) 10.<octat>.<>.<>
Will give you full control of your cluster
and the flexibilty of assigning any IP to a
node (once assigned remains assigned)
It can be hard to maintain information
about a large cluster across multiple DC
but it is worth given the benifits
27

Snitch - property file example


# Data Center One
19.82.20.3 DC1:RAC1
19.83.123.233 DC1:RAC1
19.84.193.101 DC1:RAC1
19.85.13.6 DC1:RAC1
19.23.20.87 DC1:RAC2
19.15.16.200 DC1:RAC2
19.24.102.103 DC1:RAC2
# Data Center Two
29.51.8.2 DC2:RAC1
29.50.10.21 DC2:RAC1
29.50.29.14 DC2:RAC1
53.25.29.124 DC2:RAC2
53.34.20.223 DC2:RAC2
53.14.14.209 DC2:RAC2
The names like "DC1" and "DC2" are very important as we will see ...
Also, in production use GissipingPropertyFileSnitch (non cloud, for cloud use the appropriate snitch for that
cloud) where you keep the entries for the rack+DC (cassandra-rackdc.properties) and C* will gossip
this across the cluster automatically. C* topology properties is used as a fallback.

28

Commodity vs Specialized hardware

Vertical scalability

Horizontal scalability
This is what Google,
LinkedIn, Facebook do.
The norm is now being
adopted by large
corporations as well.

vs

1.
2.
3.
4.
5.
6.

Large CAPEX
Wasted/Idle resource
Failure takes out a large chunk
Expensive redundancy model
One shoe fitting all model
Too much co-existence

1.
2.
3.
4.
5.
6.

Low CAPEX (rent on IaaS)


Maximum resource utilization
Failure takes out a small chunk
Inexpensive redundancy
Specific h/w for specific tasks
Very less to no co-existence

Just in time expansion; stay in tune with the load. No need to build ahead in time in
anticipation
29

Elastic Linear Scalability


10 nodes

8 nodes

n1

n1
n8

n2

n7

n10

n2

n9

n3

n8

n4

n3

n6

n4
n5

n7

n5
n6

1.Need more capacity? Add servers


2.Auto Re-distribution of the cluster
3.Results in linear performance increase
30

Debate - Failover design strategy


Scenario
Maintain a small failover in the data
center, and then simply add multiple
nodes if a failure occurs
Is this "Just in time" strategy correct?
Analysis
The additional overhead of bootstrapping
the new nodes will actually reduce
capacity at a critical point when the
capacity is needed most perhaps creating
a domino effect leading to severe outage
31

Debate - what if all machines are not


similar?
n5
n6

n4

n1

n3

n2

State your observations based on typical machine characteristics


1. CPU, Memory
3. Storage space (consider externally mounted space as local)
3. Network connectivity
4. Query performance and "predictability" of queries

32

Deployment - 4 dimensions
1

Node - One C* instance

Rack - Logical set of nodes

DC-DR - Logical set of racks

Cluster - Nodes that map to a single token ring


(can be across DC), next slide ...
33

Distributed deployment
//These are the C* nodes that the DAO will look to connect to
public static final String[] cassandraNodes = { "10.24.37.1", "10.24.37.2", "10.24.37.3" };
public enum clusterName { events, counter }
//You can build with policies like withRetryPolicy(), .withLoadBalancingPolicy(Round Robin)
cluster = Cluster.builder().addContactPoints(Config.cassandraNodes).build();
session1 = cluster.connect(Constants.clusterName.events.toString());

RACK 1

RACK 2

Cluster: SKL-Platform
RACK A

RACK 3

RACK B

Region 1 - DC

Region 2 - DR
34

A bit more about multi DC setup


When using
NetworkToplogyStrategy, you set
the number of replicas per data
center
For example if you set the
replicas as 4 (2 in each DC) then
this is what you get ...
Data Center1

Data Center2

Rack1

Rack1

Node1

Node7

Node3

R1

Node9

Node5

Node11

Rack2

Rack2

Node2

Node8

Node4
Node6

R2

Node10
Node12

Data Center1

R1

R3

R4
R2

Data Center2

R3

ALTER KEYSPACE system_auth WITH REPLICATION =


{'class' : 'NetworkTopologyStrategy', 'DC1' : 2, 'DC2' : 2};
Then run nodetool repair on each affected node, wait for it
to come up and move on to the next node.

R4
35

Debate - replication setting


Let's assume we have the cluster as follows
DC1 having 5 nodes
DC2 having 5 nodes

What will happen if the following statements are


executed (assume no syntax errors)?
create keyspace if not exists test with replication = { 'class' :
'NetworkTopologyStrategy', 'DC1' : 2, 'DC2' : 10 };
create table test ( col1 int, col2 varchar, primary key(col1));
insert into test (col1, col2) values (1, 'Sir Dennis Ritchie');

Will this succeed?

36

Regions and Zones


Many IaaS (Google / Amazon) providers have
multiple data centers around the world called
"Regions"
Each region is further divided into availability
"zones"
Depending on your IaaS or even your own DC,
you must plan the topology in such a way that
A client request should not travel over the network too
long to reach C*
An outage should not take out a significant chunk of
your cluster
A replication across DC should not add to the latency for
the client
37

Part 3
Setup & Installation

38

Let's get going! Installation


Acquiring and Installing C*
Understand the key components of Cassandra.yaml

Configuring and Installation structure

Data directory
Commit Log directory
Cache directory
System log configuration

Cassandra-env.ps1 (change the log file position)


Default is $logdir = "$env:CASSANDRA_HOME\logs"
In logback.xml change from .zip to .gz
<fileNamePattern>${cassandra.logdir}/system.log.%i.gz</fileNamePattern>

Creating a Cluster
Adding Nodes to a Cluster
Multiple Seed Nodes Ring management
Sizing is specific to the application - discussion
Clock Synchronization and its implications

Network Time Protocol (NTP)

39

Firewall and ports


Cassandra primarily operates using
three ports
7000 (or 7001 if SSL) - internode cluster
communication
7199 - JMX port
9160 / 9042 - Client connections
22 - for SSH

Firewall must have these ports open


for each node in the cluster/DC
40

Cassandra.yaml - an introduction

There are more configuration parameters, but we will cover those as we move along ...

41

Cassandra-env.sh

42

CQL shell - cqlsh

Command line tool


Specify the host and port (none means localhost)
Provides tab finish
Supports two different command sets
DDL
Shell commands (describe, source, tracing, copy etc)

Try some commands

show version
show host
describe keyspaces
describe keyspace <name>
describe tables
describe table <name>

CQL shell "source" command - loading a few million rows


"sstableloader" which is another tool - beyond a few million rows
43

Part 4
Prelude to Modeling

44

Keyspace aka Schema


It is like the database (MySQL) or schema/user (Oracle)
Cassandra

Database

create keyspace [IF NOT EXISTS]


meterdata with replication strategy
(optionally with DC/DR setup), durable
writes (True/False)

create database meterdata;

durable writes = false bypasses the


commit log. You can lose data.

45

Admin/System keyspaces
SELECT * FROM

system.schema_keyspaces
system.peers
system.schema_columnfamilies
system.schema_columns
system.local (info about itself)
system.hints
system.<many others>

Local contains information about


nodes (superset of gossip)
46

Column Family (CF) / Table


Cassandra

Database

create table meterdata.bill_data (...)


primary key (compound key)
with compaction, clustering order

create table
meterdata.bill_date (...)
pk, references, engine,
charset etc

PK is mandatory in C*
Insert into bill_data () values ();
Update bill_data set=.. where ..
Delete from bill_data where ..

Standard SQL DML


statements

What happens if we insert with a PK that already exists?


Hold on to your thoughts ...
47

PK aka Partition key aka Row key


Partition key determines which node the data will reside

Simple key
Composite key

It is the unit of replication in a cluster

All data for a given PK replicates around in the cluster


All data for a given PK is co-located on a node

The PK is hashed

Murmer3 hashing is the preferred


Random does based on MD5 (not considrerd secure)
Others are for legacy support only

A partition is fetched from the disk in one disk seek, every partition
requires a seek

More of PK during modeling ...

PK

Hash

meter_id_001

558a666df55

meter_id_002

88yt543edfhj

meter_id_003

aad543rgf54l

48

Visualizing the Primary Key


What are the components of the Primary key?
CREATE TABLE all_events (
event_type int,
date number,
//format as yyyymmdd number (more efficient)
created_hh int,
created_min int,
created_sec int,
created_nn int,
data text,
PRIMARY KEY((event_type, date), created_hh, created_min)
);

Partition key

Clustering
column(s)

Unique

Primary key = Partition key(s) + Clustering column(s)


Naturally by definition the primary key must be unique
49

Visualizing partition key based storage


Question - how does C* keep the data based on
the primary key definition of a given table?
Contiguous chunk of data on disk
Partition key

Meter_id_001

26-Jan-2014

Meter_id_001

27-Jan-2014

Meter_id_001

28-Jan-2014

...

...

Meter_id_001

15-Dec-2014

147

159

165

...

183

5698 5712 5745


12

58

86

91

...

302

Data growth per partition key


50

Failure Tolerance By Replication


Remember CAP theorem? C* gives availability
and partition tolerance
Replication Factor (RF) = 3

{C, B, A}

2
{B, A, H}1

{D, C, B}

Insert a record with Pk=B


B

F
G

private static final String allEventCql = "insert into all_events (event_type, date, created_hh, created_min,
created_sec, created_nn, data) values(?, ?, ?, ?, ?, ?, ?)";
Session session = CassandraDAO.getEventSession();
51
BoundStatement boundStatement = getWritableStatement(session, allEventCql);
session.execute(boundStatement.bind(.., .., ..);

PK - data partitioning
Every node in the cluster is an owner of a range of tokens
that are calculated from the PK
After the client sends the write request the coordinators
partitioner uses the configured partitioner to determine the
token value
Then it looks for the node that has the token range (primary
range) and puts the first replica there
A node can have other ranges as well but has one primary
range
Subsequent replicas is with respect to the node of the first
copy/replica
Data with the same partition key will reside on the same
physical node
Clustering columns (in case of compound PK) does not
impact the choice of node
Default sort based on clustering columns is ASC
52

Coordinator Node
Incoming Requests (read/Wrire)
Coordinator handles the request

2
1

An interesting aspect to note is that the


data (each column) is timestamped using
the coordinator node's time
Every node can be coordinator ->Client
masterless
request

n3
n2

n1

3
n4

n5

coordinator

n8

n6
n7

Hinted handoff a bit later ...


53

Consistency
Tunable at runtime
ONE (default)
QUORUM (strict majority w.r.t RF)
ALL
Apply both to read and write
protected static BoundStatement getWritableStatement(Session session, String cql,
boolean setAnyConsistencyLevel) {
PreparedStatement statement = session.prepare(cql);
if(setAnyConsistencyLevel) {
statement.setConsistencyLevel(ConsistencyLevel.ONE);
}
BoundStatement boundStatement = new BoundStatement(statement);
return boundStatement;
}

Other consistency levels are ANY, LOCAL_QUORUM, TWO, THREE,


LOCAL_ONE (see documentation on the web)

54

What is a Quorum?
quorum = RoundDown(sum_of_replication_factors / 2) + 1
sum_of_replication_factors = datacenter1_RF + datacenter2_RF + . . .
+ datacentern_RF
Using a replication factor of 3, a quorum is 2 nodes. The cluster can tolerate 1
replica down.
Using a replication factor of 6, a quorum is 4. The cluster can tolerate 2
replicas down.
In a two data center cluster where each data center has a replication factor of
3, a quorum is 4 nodes. The cluster can tolerate 2 replica nodes down.
In a five data center cluster where two data centers have a replication factor
of 3 and three data centers have a replication factor of 2, a quorum is 6 nodes.

55

Write Consistency
Write ONE
Send requests to all replicas in the
cluster applicable to the PK

2
1

n3
n2

n1

3
n4

n5

coordinator

n8

n6
n7
56

Write Consistency
Write ONE
Send requests to all replicas in the
cluster applicable to the PK
Wait for ONE ack before returning to
client

2
1

n3
n2

n4
5 s

n1

n5

coordinator

n8

n6
n7
57

Write Consistency
Write ONE
Send requests to all replicas in the cluster
applicable to the PK
Wait for ONE ack before returning to client

2
1

Other acks later , asynchronously

n2

n4

120 s

n1

n3

5 s
10 s
coordinator

n8

n5

n6
n7
58

Write Consistency
Write QUORUM
Send requests to all replicas
Wait for QUORUM ack before returning to client

2
1

n3
n2

n4

120 s

n1

5 s
10 s
coordinator

n8

n5

n6
n7
59

Read Consistency
Read ONE
Read from one node among all replicas

2
1

n3
n2

n1

3
n4

n5

coordinator

n8

n6
n7
60

Read Consistency
Read ONE
Read from one node among all replicas
Contact the faster node (stats)

2
1
n2

n1

n3
n4

n5

coordinator

n8

n6
n7
61

Read Consistency
Read QUORUM
Read from one fastest node

2
1
n2

n1

n3
n4

n5

coordinator

n8

n6
n7
62

Read Consistency
Read QUORUM
Read from one fastest node
AND request digest from other replicas to
reach QUORUM

2
1

n3
n2

n1

3
n4

n5

coordinator

n8

n6
n7
63

Read Consistency
Read QUORUM
Read from one fastest node
AND request digest from other replicas
to reach QUORUM
Return most up-to-date data to client

2
1

n3
n2

n4

Read repair a bit later ...


n1

n5

coordinator

n8

n6
n7
64

How to maintain consistency across


nodes?
C* allows for tunable consistency and
hence we have eventual consistency
For whatever reason there can be
inconsistency of data among nodes
Inconsistencies are fixed using
Read repair - update data partitions
during reads (aka Anti-entropy ops)
Nodetool - regular maintenance

Let's see consistency in action ...


65

Consistency in Action
RF = 3, Write ONE Read ONE

A write must be written to the commit log and


memtable of at least one replica node.

Write ONE: B

Data replication in progress..

Read ONE: A

Returns a response from the closest replica, as


determined by the snitch. By default, a read repair runs in
the background to make the other replicas consistent.

66

Consistency in Action
RF = 3, Write ONE Read QUORUM

A write must be written to the commit log and


memtable of at least one replica node.

Write ONE: B

Data replication in progress..

Read QUORUM: A

Returns the record after a quorum of replicas has


responded from any data center.

67

Consistency in Action
RF = 3, Write ONE Read ALL

A write must be written to the commit log and


memtable of at least one replica node.

Write ONE: B

Data replication in progress..

Read ALL: B

Returns the record after all replicas have


responded. The read operation will fail if a replica
does not respond.

68

Consistency in Action
RF = 3, Write QUORUM Read ONE

A write must be written to the commit log and


memtable on a quorum of replica nodes.

Write QUORUM: B

Data replication in progress..

Read ONE: A

Returns a response from the closest replica, as


determined by the snitch. By default, a read repair runs in
the background to make the other replicas consistent.

69

Consistency in Action
RF = 3, Write QUORUM Read QUORUM
Write QUORUM: B

Data replication in progress..

Read QUORUM: B

70

Consistency Level

ONE
Fast write, may not read latest
written value

QUORUM / LOCAL_QUORUM
Strict majority w.r.t. Replication Factor
Good balance

ALL
Not the best choice
Slow, no high availability

if(nodes_written + nodes_read) > replication factor


then you can get immediate consistency
71

Debate - Immediate consistency


The following consistency level gives
immediate consistency
Write CL = ALL, Read CL = ONE
Write CL = ONE, Read CL = ALL
Write CL = QUORUM, Read CL = QUORUM

Debate which CL combination you will


consider in what scenarios?
The combinations are
Few write but read heavy
Heavy write but few reads
Balanced
72

Hinted Handoff
A write request is received by the
coordinator
Coordinator finds a node is down/offline
It stores "Hints" on behalf of the offline node
if the coordinator knows in advance that the 1
node is down. Handoff is not taken if the
n2
node goes down at the time of write.
Read repair / nodetool repair will fix
Coordinator
inconsistencies.

2
3

n3
n4

n1

n5

Hints

n8

n6
n7

A hinted handoff is taken in system.hints table


{Location of the replica, version meta, actual data}
It is taken only when the write consistency level can be satisfied and a guarantee
that any read consistency level can read the record

73

Hinted Handoff
When the offline node comes up, the
coordinator forwards the stored hints
The node synchs up the state with the
hints
The coordinator cannot perpetually hold
the hints

2
1

Configure an appropriate duration for any


node to hold hints - default 3 hrs
Coordinator

n3
n2

n4

Hints

n1

n8

n5

n6
n7
74

Consistency level - ANY


This is a special consistency level
where even if any node is down the
write will succeed using a hinted
handoff
Use it very cautiously

75

What is a Anti-entropy op?

During reads the full data is requested


from ONE replica by the coordinator
Others send a digest which is a hash
of the data to the coordinator
Coordinator checks if the data and the
digests match
If they do not then the coordinator
takes the latest (n4) and merges into
the result
Then the stale nodes (n2, n3) are
updated automatically in the
background
Happens on a configurable % of reads
and not for all because there is an
overhead
read_repair_chance is the
configuration (defaults to 10% or 0.1)
= a 10% chance that a read repair will
happen when a read comes in
This is defined for a table or column
family

2
1

n3
n2

Anderson
ts = 268

Anderson
ts = 521

n4

Neo
ts = 851

n1

n5

coordinator

n8

n6
n7

Repair with nodetool, a bit later ...


76

Part 5
Write and Read path

77

Why C* writes fast


Try reading a book that is organized as
follows

How about a book that looks like


P1, P2, P3, P4, P5, P6, ...

P1, P23, P45, P2, P6, P99, P125, P8 ...

RDBMS

Cassandra

?
?

Seeks and writes values to


various pre-set locations

Continuously appends, merging and


contracting when appropriate
78

Storage
Writes are harder than reads to scale
Spinning disks aren't good with
random I/O
Goal: minimize random I/O
Log Structured Merged Tree (LSM)
Push changes from a memory-based
component to one or more disk
components

79

Some more on LSM


An LSM-tree is composed of two or
more tree-like component data
structures
Smaller component is C0 sits in
memory - aka Memtables
Larger component C1 sits in the disk
aka SSTables (Sorted String Table)
Immutable component

Writes are inserted in C0 and logs


In time flush C0 to C1
C1 is optimized for sequential access
80

Key components for Write


Each node implements the folloing key components (3 storage
mechanism and one process) to handle writes
1.Memtables - the in-memory tables corresponding to the CQL
tables along with the related indexes
2.CommitLog - an append-only log, replayed to restore node
Memtables
3.SSTables - Memtable snapshots periodically flushed (written
out) to free up heap
4.Compaction - periodic process to merge and streamline
multiple generations of SSTables
81

Cassandra Write Path


Client

Single node in the cluster


Appends (corresponds to CQL tables)
2

MemTable 1

MemTable 2

MemTable n

Memory

Mem to disk

durab
le

writes

= T RU
E

Coordinator

Appends

Disk
Table1
1

Table2

Commit log1
SSTable2

Commit log2

SSTable1

Commit logn

Related configurations are commitlog_sync (batch/periodic),


commitlog_sync_batch_window_in_ms, commitlog_synch_period_in_ms

82

Cassandra Write Path


Memtable exceeds a threshold, flush to SSTable, clean log,
clear heap corresponding to the memtables
MemTable 1

MemTable 2

Memory

MemTable n

New SSTable is created based on oldest commit log


A very fast sequential write operation
2

Commit log1

Table1

Table3
1

Commit log2

Table2

Disk

SSTable2

SSTable3

SSTable1

Commit logn
These are multiple generations of SSTables
which are compacted into one SSTable
select * from system.sstable_activity;

Writes are completely isolated from reads


83
No updated columns are visible until entire row is finished (technically, entire partition)

Column family data state

A Table data
=
It's Memtable
+
All of it's SSTables that have
been flushed
84

Memtables & SSTables


Memtables and SSTables are maintained per
table
SSTables are immutable, not written to again after the
memtable is flushed.
A partition is typically stored across multiple SSTable
files.
What is sorted? - Partitions are sorted and stored

For each SSTable, C* creates these structures


Partition index - A list of partition keys and the start position of
rows in the data file (on disk)
Partition index summary (in memory sample to speed up reads)
Bloom filter (optional and depends on % false positive setting)

85

Offheap memtables

As of C* 2.1 a new parameter has been introduced (memtable_allocation_type)


This capability is still under active development and performance over the
period of time is expected to get better

Heap buffers - the current default behavior

Off heap buffers - moves the cell/column name and value to DirectBuffer
objects. This has the lowest impact on reads the values are still live
Java buffers but only reduces heap significantly when you are storing
large strings or blobs

Off heap objects - moves the entire cell off heap, leaving only the
NativeCell reference containing a pointer to the native (off-heap) data. This
makes it effective for small values like ints or uuids as well, at the cost of
having to copy it back on-heap temporarily when reading from it (likely to
become default in C* 3.0)

Writes are about 5% faster with offheap_objects enabled, primarily because Cassandra
doesnt need to flush as frequently. Bigger sstables means less compaction is needed.
Reads are more or less the same

For more information http://www.datastax.com/dev/blog/off-heap-memtables-inCassandra-2-1


86

Commit log
It is replayed in case a node went down and
wants to come back up
This replay creates the MemTables for that node
Commit log comprises of pieces (files) and it can
be controlled (commitlog_segment_size_in_mb)
Total commit log is a controlled parameter as well
(commitlog_total_space_in_mb)
Commit log itself is also acrued in memory. It is
then written to disk in two ways
Batch - if batch then all ack to requests wait until the
commit log is flushed to disk
Periodic - the request is immediately ack but after some
time the commit log is flushed. If a node were to go
down in this period then data can be lost if RF=1
Debate - "periodic" setting gives better performance and we should
87
use it. How to avoid the chances of data loss?

When does the flush trigger?


Memtable total space in MB reached
Default is 25% of JVM Heap
Can be more in case of off heap

OR Commit log total space in MB


reached
OR Nodetool flush (manual)
nodetool flush <keyspace> <table>

The default settings are usually good


and are there for a reason
88

Data file name structure


The data directories are created based on keyspaces and tables
.../data/keyspace/table

The actual DB file is named as


keyspace-table-format-generationNum-component

Component can be
CompressionInfo - compression info metadata
Data - PK, data size, column idx, row level tombstone info, column
count, column list in sorted order by name
Filter - Bloom filter
Index - index, also include Bloom filter info, tombstone
Statistics - hostograms for row size, gen numbers of files from where
this SST was compacted
Summary - index summary (that is loaded in mem for read
optimizations)
TOC - list of files
Digest - Text file with a digest

Format - internal C* format eg "jb" is C* 2.0 format, 2.1 may have "ka"
89

Cassandra Read Path


SELECTFROMWHERE #partition=.;

Row
Cache

SSTable1

SSTable2

SSTable3

Row cache will be checked if it is enabled

90

Cassandra Read Path


SELECTFROMWHERE #partition=.;
Row cache MISS

2 Bloom Filter

Row
Cache

?
SSTable1

SSTable2

SSTable3

One Bloom filter exists per SSTable and Memtable that the node is serving
Probablistic filter that tells "Maybe a partition key exists" in its
corresponding SSTable or a "Definite NO"
91

Cassandra Read Path


SELECTFROMWHERE #partition=.;

2 Bloom Filter

Row
Cache

F
SSTable1

Partition Key
Cache

SSTable2

SSTable3

If any of the Bloom filters return a possible "YES" then the partition key
may exist in one or more of the SSTables
Proceed to look at the partition key cache for the SSTables for which there
is a probability of finding the partition key, ignore the other SST key
caches
Physically a single key cache is maintained ...

next more on partition key cache

92

Partition Key Cache


SSTable1

Partition Key
Cache
# Partition

Offset

..

# Partition001

# Partition001

0x0

data

# Partition002

0x153

..

data
..
# Partition002

# Partition350

0x5464321

..

# Partition350
data
Data
.

Partition key cache stores the offset position of the partition keys
that have been read recently.

93

Cassandra Read Path


SELECTFROMWHERE #partition=.;

Row
Cache

2 Bloom Filter

Key Index
Sample

Partition Key
Cache

SSTable2

If partition key is not found in the key cache then the read proceeds to
check the "Key index sample" or "Partition summary" (in memory). It is a
subset of the "Partition Index" that is the full index of the SSTable
Next a bit more of key index sample ...

94

Key Index Sample


SSTable

Key Index
Sample
Sample

Offset

# Partition001

# Partition001

0x0

data
data

# Partition128

0x4500

# Partition256

0x851513

# Partition512

0x5464321

..
# Partition128

# Partition350
data
Data
.

Think of the offset as an absolute disk address that fseek in C can use
Ratio of the key index sample is 1 per 128 keys
95

Cassandra Read Path - complete flow


SELECTFROMWHERE #partition=.;

The coordinator will merge


based on TS as well based on
the consistency level

Row
Cache

Mem Table

Key Index
Sample

2 Bloom Filter

Partition Key
Cache
5
Partition Index

SSTable2

Merge based on the most recent timestamp of the columns


1. Memtable
2. One or more than one SSTable
more on the merge process a bit later in LWW ...
Also update the row cache (if enabled) and key cache
96

Bloom filters in action

Setting change takes effect when the SSTables are regenerated


In development env you can use nodetool/CCM scrub BUT it is time intensive
97

Row caching

Caches are also written on disk so that it comes alive after a restart
Global settings are also possible at the c* yaml file

98

Key caching

Stored per SSTable even if it is a common store for all SSTables


Stores only the key
Reduces the seek to just one read per replica (does not have to look into
the disk version of the index which is the full index)
It is the default = true
This also periodically get saved on disk as well

Changing index_interval increases/decreases the gap


Increasing the gap means more disk hits to get the partition key index
Reducing the gap means more memory but get to partition key without looking into
disk based partition index
99

Eager retry
If a node is slow in responding to a
request, the coordinator forwards
it to another holding a replica of
the requested partition

Node
1

Valid if RF>1
C* 2.0+ feature

91

Node
4

Read
<pk 91>

Client

91

Node
2

Node
3

Driver

ALTER TABLE users WITH speculative_retry = '10ms';


Or,
ALTER TABLE users WITH speculative_retry = '99percentile';

100

Last Write Win (LWW)


INSERT INTO users(login,name,age) VALUES ('jdoe', 'John DOE', 33);

#Partition

Jdoe

Age

Name

33

John DOE

101

Last Write Win (LWW)


INSERT INTO users(login,name,age) VALUES ('jdoe', 'John DOE', 33);
Auto generated Timestamp (micro seconds)
by coordinator. t1 is associated with the
columns

Jdoe

Age(t1)

Name(t1)

33

John DOE

102

Last Write Win (LWW)


UPDATE users SET age = 34 WHERE login = 'jdoe';

Assume a flush occurs

SSTable2

SSTable1

Jdoe

Age(t1)

Name(t1)

33

John DOE

Jdoe

Age(t2)
34

Remember that SSTables are immutable, once written it cannot be


updated.
Creates a new SSTable
103

Last Write Win (LWW)


DELETE age from users WHERE login = 'jdoe';

RIP
tombstone
SSTable1

Jdoe

SSTable2

Age(t1)

Name(t1)

33

John DOE

Jdoe

Age(t2)
34

SSTable3

Jdoe

Age(t3)
X

104

Last Write Win (LWW)


SELECT age from users WHERE login = 'jdoe';

Where to read from? How to construct the response?


?
SSTable1

Jdoe

?
SSTable2

Age(t1)

Name(t1)

33

John DOE

Jdoe

Age(t2)
34

SSTable3

Jdoe

Age(t3)
X

105

Last Write Win (LWW)


SELECT age from users WHERE login = 'jdoe';

SSTable1

Jdoe

SSTable2

Age(t1)

Name(t1)

33

John DOE

Jdoe

Age(t2)
34

SSTable3

Jdoe

Age(t3)
X

106

Compaction
SSTable1

Jdoe

SSTable2

Age(t1)

Name(t1)

33

John DOE

Jdoe

Age(t2)
34

SSTable3

Jdoe

Age(t3)
X

New SSTable

Jdoe

Name(t1)
John DOE

Related SSTables are merged


Most recent version of each column is compiled to one partition in one new SSTable
Columns marked for eviction/deletion are removed
Old generation SSTables are deleted
A new generation SSTable is created (hence the generation number keep increasing over time)

Disk space freed = sizeof(SST1 + SST2 + .. + SSTn) - sizeof(new compact SST)


Commit logs (containing segments) are versioned and reused as well
Newly freed space becomes available for reuse
107

Compaction

As per Datastax documentation

Three strategies

Sizetiered (default)
Leveled (needs about 50% more IO than size tiered but the number of SSTables
visited for data will be less)
DateTiered

Choice depends on the disk

Cassandra 2.1 improves read performance after compaction by performing an


incremental replacement of compacted SSTables.
Instead of waiting for the entire compaction to finish and then throwing away the old
SSTable (and cache), Cassandra can read data directly from the new SSTable even
before it finishes writing.

Mechanical disk = Size tiered


SSD = Leveled tiered

Choice depends on use case too

Size - It is best to use this strategy when you have insert-heavy, read-light
workloads
Level - It is best suited for ColumnFamilys with read-heavy workloads that have
frequent updates to existing rows
Date - for timeseries data along with TTL
108

SizeTiered compaction - a brief


With size-tiered compaction, similarly sized tables are
compacted into larger tables once a certain number are
accumulated (default 4)
A row can exist in multiple
SSTables, which can result
in a degradation of read
performance. This is
especially true if you
perform many updates or
deletes with high reads
It can be a good strategy
for write heavy scenarios

109

LeveledTiered compaction
Leveled compaction creates sstables of a
fixed, relatively small size (5MB by default
in Cassandras implementation), that are
grouped into levels.
Within each level, sstables are guaranteed
to be non-overlapping. Each level is ten
times as large as the previous
Leveled compaction guarantees that 90%
of all reads will be satisfied from a single
sstable (assuming nearly-uniform row
size). Worst case is bounded at the total
number of levels e.g 7 for 10TB of data
110

DateTiered compaction
This particularly applies to time series
data where the data lives for a specific
time
Use DateTiered compaction strategy
C* looks at the min and max Timestamp of
the SStable and finds out if anything is
really live
If not then that SSTable is just unlinked
and the compaction with another is fully
avoided
Set the TTL in the CF definition so that if a
bad code looks to insert without TTL it
does not cause unnecessary compaction
111

"Coming back to life"


Let's assume multiple replicas exist for a given column
One node was NOT able to recored a tombstone because it
was down
If this node remains down past the gc_grace_seconds
without a repair then it will contain the old record
Other nodes meanwhile have evicted the tombstone
column
When the downed node does come up it will bring the old
column back to life on the other nodes!
To ensure that deleted data never resurfaces, make sure to
run repair at least once every gc_grace_seconds, and never
let a node stay down longer than this time period

112

NTP is very important


A micro second time stamp is associated
with the columns
The timestamp used is of the machine in
which C* is running (coordinator)
If the timing of the machines in the cluster
is off then C* will be unable to determine
which record is latest!
This also means that the compaction may
not function
You can have a scenario where a deleted
record appears again or get unpredictable
number of records for your query
113

Lightweight transactions
Transactions are a bit different
"Compare and Set" model

Two requests can agree on a single value creation


One will suceed (uses Paxos algo, wikipedia has good
text on this subject)
Other will fail
INSERT INTO customer_account (customerID, customer_email)
VALUES ('LauraS', 'lauras@gmail.com')
IF NOT EXISTS;
UPDATE customer_account
SET customer_email='laurass@gmail.com'
IF
customerID='LauraS';
Cassandra 2.1.1 and later supports non-equal conditions for lightweight transactions. You can
use <, <=, >, >=, != and IN operators in WHERE clauses to query lightweight tables.
114

Triggers
The actual trigger logic resides
outside in a Java POJO
Feature may still be experimental
Has write performance impact
Better to have application logic do
any pre-processing
CREATE TRIGGER myTrigger
ON myTable
USING 'Java class name';
115

Debate - Locking in C*
These instances work exactly like instance 1. They will acquire
locks on other keys so that no two instances step on each other

Instance 1

Instance 2

Instance 3

Step 1. Acquire lock


If FAIL look for next KEY
Step 3. Release lock

Step 2. Get an
available ID that has
the flag=0 and starts
with 0

k=_0_0 v=<String>
k=_0_a v=<String>
..
k=_z_z v=<String>

C
B

Delete this key from C* A

Memcached / Redis

F
G

Primary key = ((flag, starts_with), gen_id)

116

Part 6
Modeling

117

Before you model ...


1

User
Report

End user

Dashboard Fast response

C* way

Biz question

RDBMS way

Structure
Model

Data

Keys
Storage
Data-types
118

QDD

C* modelling is

"Query Driven Design"

Give me the query and I will organize the data via a


model to get you performance for that query.

119

Denormalization
Unlike RDBMS you cannot use any foreign
key relationships
FK and joins are not supported anyway!
Embed details, it will cause duplication but
that is alright
Helps in getting the complete data in a
single read
More efficient, less random I/O
May have to create more than one
denormalized table to serve different
queries
UDT is a great way (UDT a bit later...)
120

Datatypes

Int, decimal, float, double


Varchar, text
Collections - Map (K=V), List (ordered V), Set (Unordered unique V)

Designed to store small amount of data (in hundreds)


Limit is 64k
Max size per element is 64KB
Cannot be part of Primary key
Nesting is not allowed

Timestamp, timeuuid
Counter
User defined types (UDT)

There are other datatypes as well

position frozen <tuple<float,float,float>> - useful in 3D coordinate geometry


blob - good practice is to gzip and then store

Static modifer - value remains the same, save storage space BUT very
specific usecase

121

Choice of partition key


If a key is not unique then C* will upsert
with no warning!
Good choices
Natural key like an email address or meter id
Surrogate keys like uuid, timeuuid (uuid with
time component)

CQL provides a number of timeuuid


functions
now()
dateof(tuuid) - extracts the timestamp as date
unixtimestampof - raw 64 bit integer
122

Wide row? Use composite PK


Only event type as the PK will store all events ever generated in a single
partition. That can result in a huge partition and may breach C* limits
CREATE TABLE all_events (
event_type int,
date int, //format as yyyymmdd
created_hh int,
created_min int,
created_sec int,
created_nn int,
data text,
PRIMARY KEY((event_type, date), created_hh)
);
Example of a compound PK ((...), ...) syntax
CREATE TABLE trunc_events (
event_type int,
date number,
created_on timestamp,
data text,
PRIMARY KEY((event_type, date), created_on)
) WITH CLUSTERING ORDER BY (created_on DESC);
insert into trunc_events (event_type, date, created_on, data) values(1, '2014-06-24', '2014-06-26 12:47:54',
'{"event" : "details of the event will go here"}') USING TTL 300;
123

Partition key IN clause debate


where meter_id = 'Meter_id_001' and date IN ( ... )
Partition key

Meter_id_001

26-Jan-2014

Meter_id_001

27-Jan-2014

Meter_id_001

28-Jan-2014

...

...

Meter_id_001

15-Dec-2014

T = ( t1 ,

t2 , t3, tn)

147

159

165

...

183

5698 5712 5745


12

58

86

91

...

t1
t2

302

t3
tn

t1 t2 t3 tn

Is T predictable? What are the factors responsible for


determining the overall fetch time T?
This is also called as "Multi get slice"
124

Partition key IN clause (Slicing)


If you want to search across two partition keys then use
"IN" in the where clause. You can have a further limiting
condition based on cluster columns.

In a composite PK, IN works only on the last key. Eg Pk ((key1, key2),


clust1, clust2). IN will work only on key2 (last column only).
125

Modeling notation
Chen's notation for conceptual modeling
helps in design

126

Sample model contd ...

127

Relationship table and Dup tables


Relationship tables are not uncommon to support different fetch
criteria - get the list of stocks given an industry in this case

A duplicate table with just a different sort order of the clustering


column is also fine eg
PRIMARY KEY ((stock_symbol, trade_date), trade_time) can give ONE symbol and a
range of date
PRIMARY KEY ((trade_date, stock_symbol)), trade_time) can give ONE date and a
range of symbols

Tables catering specifically to a single query is also a correct approach


128

Bitmap type of index example


Multiple parts to a key
Create a truth table of the various combinations
However, inserts will increase to the number of
combinations
Eg Find a car (vehicle id) in a car park by
variable combinations

129

Bitmap type of index example


CREATE TABLE skldb.car_model_index (
make varchar,
model varchar,
color varchar,
vehicle_id int,
lot_id int,
PRIMARY KEY ((make, model, color), vehicle_id)
);
We are pre-optimizing for 7 possible queries of the index on insert.
INSERT INTO skldb.car_model_index (make, model, color,
VALUES ('Ford', 'Mustang', 'Blue', 1234, 8675309);
INSERT INTO skldb.car_model_index (make, model, color,
VALUES ('Ford', 'Mustang', '', 1234, 8675309);
INSERT INTO skldb.car_model_index (make, model, color,
VALUES ('Ford', '', 'Blue', 1234, 8675309);
INSERT INTO skldb.car_model_index (make, model, color,
VALUES ('Ford', '', '', 1234, 8675309);
INSERT INTO skldb.car_model_index (make, model, color,
VALUES ('', 'Mustang', 'Blue', 1234, 8675309);
INSERT INTO skldb.car_model_index (make, model, color,
VALUES ('', 'Mustang', '', 1234, 8675309);
INSERT INTO skldb.car_model_index (make, model, color,
VALUES ('', '', 'Blue', 1234, 8675309);

vehicle_id, lot_id)
vehicle_id, lot_id)
vehicle_id, lot_id)
vehicle_id, lot_id)
vehicle_id, lot_id)
vehicle_id, lot_id)
vehicle_id, lot_id)
130

Bitmap type of index example


select * from skldb.car_model_index
where make='Ford'
and model=''
and color='Blue';

select * from skldb.car_model_index


where make=''
and model=''
and color='Blue';

YES, there will be more writes and more data BUT that is an acceptable
fact in big data modeling that looks to optimize the query and user
experience more than data normalization
131

Counters & Collections


CREATE KEYSPACE counter WITH REPLICATION = {
'class' : 'NetworkTopologyStrategy',
'datacenter1' : 3
};
use counter;
CREATE TABLE event_counter (
event_type int,
event_name varchar,
counter_value counter,
PRIMARY KEY (event_type, event_name)
);
update event_counter
set counter_value = counter_value + 1
WHERE event_type=0 AND event_name='Login';
CREATE TABLE users (
user_id text PRIMARY KEY,
first_name text,
last_name text,
emails set<text>,
todo map<timestamp, text>
);
INSERT INTO users (user_id, first_name, last_name, emails)
VALUES('nm', 'Nirmallya', 'Mukherjee', {'nm1@email.com', 'nm2@email.com'});

132

TTL - Time To Live


CREATE TABLE users (
user_id text PRIMARY KEY,
first_name text,
last_name text,
emails set<text>,
todo map<timestamp, text>
);
INSERT INTO users (user_id, first_name, last_name, emails)
VALUES('nm', 'Nirmallya', 'Mukherjee', {'nm1@email.com', 'nm2@email.com'},
{'2014-12-25' : 'Christmas party', '2014-12-31' : 'At Gokarna'}
);
UPDATE users USING TTL <computed_ttl>
SET todo['2012-10-1'] = 'find water' WHERE user_id = 'nm';

Rate limiting is a good idea in many cases. Eg allow password reset to


only 3 per day per user. Use TTL sliding window.
A promotion is generated for a customer with a TTL.
A temporary login is generated and is valid for a given TTL period.
Table level TTL overrides row level TTL specification.

133

UDT - User Defined Types


create keyspace address_book with replication = { 'class':'SimpleStrategy', 'replication_factor': 1 };
CREATE TYPE address_book.address (
street text,
city text,
zip_code int,
phones set<text>
);
CREATE TYPE address_book.fullname (
firstname text,
lastname text
);
CREATE TABLE address_book.users (
id int PRIMARY KEY,
name frozen <fullname>,
addresses map<text, frozen <address>>
);

// a collection map

INSERT INTO address_book.users (id, name) VALUES (1, {firstname: 'Dennis', lastname: 'Ritchie'});
UPDATE address_book.users SET addresses = addresses + {'home': { street: '9779 Forest Lane', city: 'Dallas',
zip_code: 75015, phones: {'001 972 555 6666'}}} WHERE id=1;
SELECT name.firstname from users where id = 1;
CREATE INDEX on address_book.users (name);
SELECT id FROM address_book.users WHERE name = {firstname: 'Dennis', lastname: 'Ritchie'};

134

UDT mapping to a JSON

135

Queries
CREATE TABLE mailbox (
login text,
message_id timeuuid,
Interlocutor text,
Message text,
PRIMARY KEY(login, message_id);
Get message by user and message_id (date)
SELCT * FROM mailbox
WHERE login=jdoe
and message_id ='2014-07-12 16:00:00';
Get message by user and date interval
SELCT * FROM mailbox
WHERE login=jdoe
and message_id <='2014-07-12 16:00:00'
and message_id >='2014-01-12 16:00:00';

136

Debate - are these correct?


Get message by message_id
SELCT * FROM mailbox WHERE message_id ='2014-07-12 16:00:00';

Get message by date interval


SELCT * FROM mailbox WHERE
message_id <='2014-07-12 16:00:00'
and message_id >='2014-01-12 16:00:00';

137

Debate - are these correct?


Get message by user range (range query on #partition)
SELCT * FROM mailbox WHERE login >= hsue and login <= jdoe;

Get message by user pattern (not exact match on #partition)


SELCT * FROM mailbox WHERE login like '%doe%';

138

WHERE clause restrictions

All queries (INSERT/UPDATE/DELETE/SELECT) must provide #partition (where can


filter in contiguous data)

"Allow filtering" can override the default behaviour of cross partition access but it is
not a good practice at all (incorrect data model)
select .. from meter_reading where record_date > '2014-12-15'
(assuming there is a secondary index on record_date)

Only exact match (=) predicate on #partition, range queries (<,<=,<=,>) not
allowed (means there are no multi row updates)

In case of a compound partition key IN is allowed in the last column of the compound
key (if partition key has one only column then it works on that column)

On clustering columns, exact match and range query predicates (<,<=,<=,>, IN) are
allowed

WHERE clause is only possible on columns defined in primary key

Order of the filters must match the order of primary key definition otherwise create
secondary index (anti-pattern)

139

Order by restrictions
If the primary key is (industry_id,
exchange_id, stock_symbol) then the
following sort orders are valid
order by exchange_id desc, stock_symbol desc
order by exchange_id asc, stock_symbol asc

Following is an example of invalid


sort order
order by exchange_id desc, stock_symbol asc
140

Secondary Index

Consider the earlier example of the table all_events, what will happen if we
try to get the records based on minute?

Create a secondary index based on the minute


Can be created on any column except counter, static and collection columns
It is actually another table with the key=<index column> and columns=keys that
contain the key

Do not use secondary indexes if

Secondary indexes are for searching convenience

On high-cardinality columns because you then query a huge volume of records


for a small number of results
In tables that use a counter column
On a frequently updated or deleted column (too many tombstones)
To look for a row in a large partition unless narrowly queried
High volume queries

Use with low-cardinality columns


Columns that may contain a relatively small set of distinct values
Use when prototyping, ad-hoc querying or with smaller datasets

See this link for more details (http://www.datastax.com/documentation/cql/3.1/cql/ddl/ddl_when_use_index_c.html)


141

Secondary Index - rebuilding


While Cassandras performance will always be best using
partition-key lookups, secondary indexes provide a nice
way to query data on small or offline datasets
Occasionally, secondary indexes can get out-of-sync
Unfortunately, there is no great way to verify secondary
indexes, other than rebuilding them.
You can verify by a query using indexes, and look for
significant inconsistencies on different nodes
Also note that each node keeps its own index (they are not
distributed), so youll need to run this on each node
Must run nodetool repair to ensure the index gets built on
current data

nodetool rebuild_index my_keysapce my_column_family


142

CQL Limits
Limits (as per Datastax documentation)
Clustering column value, length of: 65535
Collection item, value of: 2GB (Cassandra 2.1 v3
protocol), 64K (Cassandra 2.0.x and earlier)
Collection item, number of: 2B (Cassandra 2.1 v3
protocol), 64K (Cassandra 2.0.x and earlier)

Columns in a partition: 2B
Fields in a tuple: 32768, try not to have 1000' of
fields
Key length: 65535
Query parameters in a query: 65535
Single column, value of: 2GB, xMB are
recommended
Statements in a batch: 65535
143

C* batch - Is it always a good idea?

Leverages token aware policies


Cluster load is evenly balanced
Means if one fails we re-try ONLY that query
Not an all or none success model

Tip: batches work well ONLY if all records have the same partition key
In this case all records go to the same node.

144

Batch - use case


Keep two tables in synch - same event being stored in two
tables, one by customer ID and the other by staff ID
Emulating a classical database transaction - all or nothing!

145

Antipatterns - things to watch out!


Singleton messages - add, consume, remove loop

Update columns as nulls in CQL. Tombstone for each null

Intense update on a single column

Dynamic schema change (frequent changes) and topology change in


production is not a good idea.
Use C* as a queue like singletons (add/consume loop)
146

But... I need a queue!


Break the partition key to carry some
time buckets eg. Queue1 in the last
hour
At the time of select you can use the
clustering column eg Queue1 with
timestamp clustering column >
SOME time. This helps in not going
through all those tombstones

147

Part 7
Applications with Cassandra

148

Implementing the architecture - Data


Isolate data change rate
Not changing (lookups)
Moderate changing
Fast changing

Treat slow changing data differently from fast


changing data
Caching infrequently changing data helps
Cache warmup model needs to be in place

Cache redudancy is needed for high volume reads


(primary - secondary model)
Mutex model should be in place as well in addition to the
warmup of values
149

The Client DAO and the Driver


Session instances are thread-safe and usually a
single instance is all you need per application.
However, a given session can only be set to one
keyspace at a time, so one instance per keyspace
is necessary
Use policies
Retry
Reconnect
LoadBalancing

http://www.datastax.com/documentation/develop
er/java-driver/2.1/javadriver/fourSimpleRules.html
150

C* client sample

151

C* client sample

http://www.datastax.com/drivers/java/2.1/index.html

152

C* client sample - the entity

153

C* client sample

See this to work with UDT -http://www.datastax.com/documentation/developer/javadriver/2.1/java-driver/reference/udtApi.html


154

Asynch calls in the DAO


A careful design can yield many
benefits such as asynch data
acquisition while doing something
else
Consider this example
ResultSetFuture futureRS = session.executeAsync(query);
Do other processing here
ResultSet rs = futureRS.getUninterruptibly();

What benefits do you see?

155

Part 8
Administration

156

C* != (Fire and forget)

C* is not a fire a forget system.


It does require administration
and monitoring to ensure the
infrastructure performs optimally
at all times.

157

Monitoring - Opscenter
Workload modeling
Workload characterization
Performance characteristics of the
cluster
Latency analysis of the cluster
Performance of the OS
Disk utilization
Read / Write operations per second
OS Memory utilization
Heap utilization
158

Monitoring - Datastax Opscenter

159

Monitoring - Datastax Opscenter

160

Monitoring - Datastax Opscenter

161

Monitoring - Datastax Opscenter

162

Monitoring - Datastax Opscenter

163

Monitoring - Datastax Opscenter

164

Monitoring - Datastax Opscenter

Dropped MUTATION messages this means that the mutation was not applied to all replicas it
was sent to. The inconsistency will be repaired by Read Repair or Anti Entropy Repair (perhaps 165
because of load C* is defending itself by dropping messages).

Monitoring - Datastax Opscenter

166

Monitoring - Datastax Opscenter

167

Monitoring - Datastax Opscenter

168

Monitoring - Node coming up

169

Node down

170

Node tool
Very important tool to manage C* in
production for day to day admin
Nodetool has over 40 commands
Can be found in the
$CASSANDRA_HOME/bin folder
Try a few commands
./nodetool -h 10.21.24.11 -p 7199 status
./nodetool -h 10.21.24.11 -p 7199 info

171

Node repair

172

Node repair
How to compare two large data sets?
Assume two arrays
Each containing 1 million numbers
Further assume the order of storage is fixed
How to compare the two arrays?
Potential solution
Loop 1 and check the other?
How long will it take to loop 1 million?
What happens if the data gets into Billions?
Lots of inefficiencies!
173

Wikipedia definition of Merkel Tree

Merkle tree is a tree in which every non-leaf node is labelled with the
hash of the labels of its children nodes. Hash trees are useful because
they allow efficient and secure verification of the contents of large
data structures

Currently the main use of hash trees is to make sure that data blocks
received from other peers in a peer-to-peer network are received
undamaged and unaltered, and even to check that the other peers do
not lie and send fake blocks

A hash tree is a tree of hashes in which


the leaves are hashes of data blocks in,
for instance, a file or set of files. Nodes
further up in the tree are the hashes of
their respective children.

For example, in the picture hash 0 is the


result of hashing the result of
concatenating hash 0-0 and hash 0-1.
That is, hash 0 = hash( hash 0-0 + hash
0-1 ) where + denotes concatenation
174

Node repair using Nodetool


Prime objective of node repair is to detect and fix
inconsistencies among nodes
Per node Merkel tree is created
It provides an efficient way to find differences in data
blocks
Reduces the amount of data transferred to compare the
data blocks
C* uses a fixed depth tree of
15 levels having 32K leaf
nodes
Leaf nodes can represent a
range of data because the
depth is fixed

175

Node repair process


When nodetool repair command is executed, the target node
specified with -h option in the command, coordinates the repair of
each column family in each keyspace
A repair coordinator node requests Merkle tree from each replica
for a specific token range to compare them
Each replica builds a Merkle tree by scanning the data stored
locally in the requested token range.

This can be IO intensive but can be controlled by providing a partition range


eg the primary range but you will still be building the Merkel tree in parallel
You can specify the start token, will repair the range for this token
Finally can repair one DC at a time by passing "local" parameter
Set the compaction and streaming thresholds (next slide)

The repair coordinator node compares the Merkle trees and finds
all the sub token ranges that differ between the replicas and
repairs data in those ranges
176

Nodetool repair - potential strategy


The system.local table has all the tokens that the node is
responsible for
Create a script that runs a repair using -st <token> -et <same
token> in a loop for all tokens for the given node
Care should be taken to connect to the node using -h and use
parallel execution (-par)
This will allow the repair to run on small ranges and in short time
increments
Spool the output and analyze if everything was correct else can
send email or any form of notification to alert the admin
It may be possible to get an exception like the following if the start
token and end tokens are not the same

ERROR [Thread-1099288] 2015-03-14 02:48:08,822 StorageService.java


(line 2517) Repair session failed: java.lang.IllegalArgumentException:
Requested range intersects a local range but is not fully contained in one;
this would lead to imprecise repair
Does not mean there is a problem
177

Schema disagreements
Perform schema changes one at a time, at a steady pace,
and from the same node
Do not make multiple schema changes at the same time
If NTP is not in place it is possible that the schemas may not
be in synch (usual problem in most cases)
Check if the schema is in agreement
http://www.datastax.com/documentation/cassandra/2.0/cas
sandra/dml/dml_handle_schema_disagree_t.html
./nodetool describecluster

178

Nodetool cfhistograms
Will tell how many SSTables were looked at to
satisfy a read
With level compaction should never go > 3
With size tiered compaction should never go > 12
If the above are not in place then compaction is falling
behind
check with ./nodetool compactionstats
it should say "pending tasks: 0"

Read and write letency (no n/w latency)


Write should not be > 150s
Read should not be > 130s (from mem)
SSD should be single digit s

./nodetool cfhistograms <keyspace> <table>


179

Slow queries
Use the DataStax Enterprise Performance Service
to automatically capture long-running queries
(based on response time thresholds you specify)
and then query the performance table that holds
those cql statements
cqlsh:dse_perf> select * from node_slow_log;

180

C* timeout, concurrency, consistency


Cassandra.yaml has multiple timeouts
(these change with versions)
read_request_timeout_in_ms
write_request_timeout_in_ms
cross_node_timeout

Concurrency
concurrent_reads
concurrent_writes

Application
Consistency levels affect read/write

etc (see the yaml file)


181

JNA
Java Native Access allows for more
efficient communication of the JVM
and the OS
Ensure JNA is enabled and if not do
the following
sudo apt-get install libjna-java

But, ideally JNA should be


automatically enabled
182

About me

That's it!
www.linkedin.com/in/nirmallya

nirmallya.mukherjee@gmail.com

Disclaimer - Please use this material at your own discretion.


183