You are on page 1of 23

Apache Cassandra database is a distributed, fault tolerant, linearly scalable, column-

oriented, NoSQL database. Apache Cassandra is great at handling massive


amounts of structured (table has defined columns), and semi-structured (table row
doesn’t need to populate all columns) data. Cassandra is a distributed storage
system that is designed to scale linearly with the addition of commodity servers, with
no single point of failure. Cassandra can be easily scaled across multiple data
centers (and regions) to increase the resiliency of the system.

Features of Cassandra
 Open Source – It is an open source project by Apache. Cassandra can be
integrated with other source projects like Hadoop, Apache Pig, Apache Hive,
etc.
 Peer-to-Peer Architecture – All nodes in the Cassandra cluster
communicate with each other. There is no master-slave paradigm in
Cassandra.
 No Single point of Failure – In a cluster, all nodes are created equal. Data is
distributed across the nodes in the cluster and each node is capable of
handling read and write requests.
 Highly Available and Fault Tolerant – Because of data replication across
the nodes in the cluster, Cassandra is highly available and fault tolerant.
 Globally distributed – A Cassandra cluster may be deployed in multiple data
centers, which may be globally distributed.
 Flexible Data Model – The concepts from DynamoDB and BigTable are built
into Cassandra to allow for complex data structures. The model works for a
wide variety of data modeling use cases.
 Linearly Scalable – When new nodes are added, the data is more evenly
distributed across the nodes, which reduces the load each node handles.
 Tunable Consistency – Consistency level is number of nodes needed to
agree on the data for reads and writes. The consistency level controls both
read and write behavior based on your replication factor. The Consistency
level could be zero, one, quorum, local quorum or all.

Layers of a Cassandra Cluster


Keyspaces

As mentioned earlier, keyspaces in Cassandra are similar to the schema in a relational


database. Some of the features of keyspaces are:

 A keyspace needs to be defined before creating tables, as there is no default keyspace.

 A keyspace can contain any number of tables, and a table belongs only to one keyspace.
This represents a one-to-many relationship.

 Replication is specified at the keyspace level. For example, replication of three implies
that each data row in the keyspace will have three copies.

 Further, you need to specify the replication factor during the creation of keyspace.
However, the replication factor can be modified later.

The given sample query shows the command for creating a keyspace with the name as
TestDB, replication as SimpleStrategy, and replication factor as three. It also shows the use
command to start using the keyspace.

Tables

A table contains data in the horizontal and vertical formats, referred to as rows and columns
respectively. Some of the features of tables are:

 Tables have multiple rows and columns. As mentioned earlier, a table is also called
Column Family in the earlier versions of Cassandra.

 It is still referred to as column family in some of the error messages and documents of
Cassandra.
 It is important to define a primary key for a table.

The given query shows the statement for creating a table named employee with four columns
empid, empfirstname, emplastname, and empsalary.

In this query, Empid is of the integer type, empfirstname and emplastname are the text type,
and empsalary is defined as a double. Note that empid is defined as the primary key for the
table. In database terms, primary key refers to a column that has unique values for each row
in a table.

For example, in an organization, two employees may have the same first and the last names.
However, each employee is assigned a unique employee ID. Therefore, if you create a table
to hold employee data, the employee ID can be used as the primary key, as it is unique for all
the rows of the employee table. 

On the other hand, employees’ first or last name cannot be used as a primary key, as it cannot
uniquely identify a row in the table.

Columns

Column represents a single piece of data in Cassandra and has a type defined.

Some of its features are:

 Columns consist of various types, such as integer, big integer, text, float, double, and
Boolean.

 Cassandra also provides collection types such as set, list, and map.
 Further, column values have an associated time stamp representing the time of update.

 This timestamp can be retrieved using the function writetime.

In the next section, let us learn all about UUID and TimeUUID.

UUID and TimeUUID

There are two special column types for sequence numbers and timestamps called UUID and
TimeUUID.  

Universal Unique Identity or UUID

This is similar to the sequence numbers in relational databases. It is a 128-bit integer.

The given example for this column is represented as hexadecimal digits with hyphens in
between:

Sample UUID is – 01234567-0123-0123-0123-0123456789ab.

TimeUUID

It contains a timestamp and guarantees no duplication. TimeUUID uses time in 100


nanosecond intervals. You can use the function now() to get the TimeUUID.

Sample TimeUUID is – D2177dd0-eaa2-11de-a572-001b779c76e3.

Counter

Counter is a special column which stores a number that is incremented. It can be used to
count the occurrences of an event or a process. Some of the features of the counter are:

 You can only update the value by adding or subtracting from the current value.

 Before the first update, a counter will have an initial value of zero.

 The update using the current value is allowed only for the counter columns in Cassandra.

The given query shows a counter column being added called numEmployees to the table
Department.
It also shows the process of incrementing the column value based on the previous value for a
row where deptID = 1000.

Compound Keys

The primary key in Cassandra can be a compound key, which refers to a key consisting of
multiple columns. For example, a multi-year stock market data has the primary key
(tradeDate, ticker), where:

 tradeDate is the date of the stock trade and

 ticker is the stock symbol.

When compound keys are specified, the first key is considered for partitioning of data and is
called the partition key. The data in each partition is clustered and ordered by the remaining
keys.

Further, if you want to partition the data by multiple columns, group the data as first part of
the primary key, as shown in the following two instances:

1. Primary key ((key1), key2, key3, key4)

 Here the data is partitioned by key1 and clustered by key2, key3, key4

2. Primary key ((key1, key2), key3, key4)

 Here the data is partitioned by key1, key2 and clustered by key3, key4

Note that you can also specify the order for the cluster keys.
In databases, clustering refers to the grouping of data together. Suppose the data is clustered
by key3, key4, the rows with the key3 values are stored together on physical storage, so that
the queries such as key3 = 100 can be fetched quickly.

Indexes

Indexes can be used to speed up queries. Index column can be used to search for data rows.
Cassandra allows only one column in an index. You can create multiple indexes on the same
table.

The given query shows an index named first_index being created on an employee table in the
keyspace called testDB.

The index is created on the column empFirstName. After creating the index, the given query
that gets a row from the table with empFirstName = ‘Jack’ will use the index to search for
data faster.

Collection Columns

Collection columns are used to represent a group of data in a single column. Cassandra
provides three types of collection columns, such as set, list, and map.

Set

It refers to an unordered group of values. Set values are enclosed within curly brackets, also
called braces.

For example, { ‘XYZ’, ‘ABC’,’PQR’ }

List

It is an ordered group of values, like an array. List values are enclosed within square brackets.
For example, [ ‘2011’, ‘2012’, ‘2013’ ]

Map

It is a set of key-value pairs. Map values are also enclosed within braces.

For example, { ‘key1’: ‘value1’, ‘key2’: ‘value2’}

The given query shows a table being created with empid as primary key and roles as a set of
type text. You can add new roles to an employee by using the update command as shown:

Update employee set roles = roles + {‘supervisor’}

Here, the new role ‘supervisor’ is enclosed within the braces.

Collection Columns - Set

The functions of collection column — Set are:

You can use ALTER TABLE statement to add a collection to an existing table, using the
given command:

create table stocks (ticker text, year int, value double, primary key(ticker, year));

alter table stocks add supportNumber set<text>;

First, create table stocks with the columns ticker, year, and value; with primary key ticker and
year. Next, use the alter table command to add supportNumber, which represents the set of
support numbers for the stock. This is added as a set of text so it can hold the support number
strings.
Next, use the update command to initialize the data of a Set column using the given
command:

update stocks set supportNumber = {'1800800800'} where ticker = 'ABC' and year = 2011;

Here, update the supportNumber column with value of ‘1800800800’, enclosed in curly
brackets to indicate a set.

Finally, to add a new support number to an existing Set, use the given command:

update stocks set supportNumber = supportNumber + {'1800800801'}

where ticker = 'ABC' and year = 2011;

Here, use the plus symbol to add value to an existing column. Note that the other collection
types, such as lists and maps function in a similar manner.

In the next section, let us talk about DDL and DML Statements.

Apache Cassandra Data Model: Components And Statements


By SimplilearnLast updated on May 24, 2021768

From a bird’s eye view Apache Cassandra is a database, one that is highly scalable, high-
performance, designed to handle large amounts of data. In this tutorial we will focus on the
data model for Cassandra. By the end of this lesson, you will be able to:

 Describe the Cassandra data model

 Describe the components of Cassandra data mode

 Explain the functions of DDL and DML statements

 Discuss the SELECT statement restrictions in Cassandra

Let us begin with the Cassandra Data Model!


Cassandra Data Model

Cassandra is a NoSQL database, which is a key-value store. Some of the features of


Cassandra data model are as follows:

 Data in Cassandra is stored as a set of rows that are organized into tables.

 Tables are also called column families.

 Each Row is identified by a primary key value.

 Data is partitioned by the primary key.

 You can get the entire data or some data based on the primary key.

In the next section, we will learn about the components of the Cassandra Data Model.

Cassandra Data Model Components

Cassandra data model provides a mechanism for data storage. The components of Cassandra
data model are keyspaces, tables, and columns.

Keyspaces

Cassandra data model consists of keyspaces at the highest level. Keyspaces are the containers
of data, similar to the schema or database in a relational database. Typically, keyspaces
contain many tables.

Tables

Within the keyspaces, the tables are defined. Tables are also referred to as Column Families
in the earlier versions of Cassandra. Tables contain a set of columns and a primary key, and
they store data in a set of rows.

Columns

Columns define the structure of data in a table. Each column has an associated type, such as
integer, text, double, and Boolean. These Cassandra data model components will be discussed
in detail in this lesson.

Let us learn about all of these in detail in the next section.


Big Data Engineer Master's Program

Master All the Big Data Skill You Need TodayENROLL NOW

Keyspaces

As mentioned earlier, keyspaces in Cassandra are similar to the schema in a relational


database. Some of the features of keyspaces are:

 A keyspace needs to be defined before creating tables, as there is no default keyspace.

 A keyspace can contain any number of tables, and a table belongs only to one keyspace. This
represents a one-to-many relationship.

 Replication is specified at the keyspace level. For example, replication of three implies that each
data row in the keyspace will have three copies.

 Further, you need to specify the replication factor during the creation of keyspace. However, the
replication factor can be modified later.

The given sample query shows the command for creating a keyspace with the name as
TestDB, replication as SimpleStrategy, and replication factor as three. It also shows the use
command to start using the keyspace.

Tables
A table contains data in the horizontal and vertical formats, referred to as rows and columns
respectively. Some of the features of tables are:

 Tables have multiple rows and columns. As mentioned earlier, a table is also called Column
Family in the earlier versions of Cassandra.

 It is still referred to as column family in some of the error messages and documents of Cassandra.

 It is important to define a primary key for a table.

The given query shows the statement for creating a table named employee with four columns
empid, empfirstname, emplastname, and empsalary.

In this query, Empid is of the integer type, empfirstname and emplastname are the text type,
and empsalary is defined as a double. Note that empid is defined as the primary key for the
table. In database terms, primary key refers to a column that has unique values for each row
in a table.

For example, in an organization, two employees may have the same first and the last names.
However, each employee is assigned a unique employee ID. Therefore, if you create a table
to hold employee data, the employee ID can be used as the primary key, as it is unique for all
the rows of the employee table. 

On the other hand, employees’ first or last name cannot be used as a primary key, as it cannot
uniquely identify a row in the table.

Columns

Column represents a single piece of data in Cassandra and has a type defined.
Some of its features are:

 Columns consist of various types, such as integer, big integer, text, float, double, and Boolean.

 Cassandra also provides collection types such as set, list, and map.

 Further, column values have an associated time stamp representing the time of update.

 This timestamp can be retrieved using the function writetime.

In the next section, let us learn all about UUID and TimeUUID.

UUID and TimeUUID

There are two special column types for sequence numbers and timestamps called UUID and
TimeUUID.  

Universal Unique Identity or UUID

This is similar to the sequence numbers in relational databases. It is a 128-bit integer.

The given example for this column is represented as hexadecimal digits with hyphens in
between:

Sample UUID is – 01234567-0123-0123-0123-0123456789ab.

TimeUUID

It contains a timestamp and guarantees no duplication. TimeUUID uses time in 100


nanosecond intervals. You can use the function now() to get the TimeUUID.

Sample TimeUUID is – D2177dd0-eaa2-11de-a572-001b779c76e3.

Counter

Counter is a special column which stores a number that is incremented. It can be used to
count the occurrences of an event or a process. Some of the features of the counter are:

 You can only update the value by adding or subtracting from the current value.
 Before the first update, a counter will have an initial value of zero.

 The update using the current value is allowed only for the counter columns in Cassandra.

The given query shows a counter column being added called numEmployees to the table
Department.

It also shows the process of incrementing the column value based on the previous value for a
row where deptID = 1000.

Compound Keys

The primary key in Cassandra can be a compound key, which refers to a key consisting of
multiple columns. For example, a multi-year stock market data has the primary key
(tradeDate, ticker), where:

 tradeDate is the date of the stock trade and

 ticker is the stock symbol.

When compound keys are specified, the first key is considered for partitioning of data and is
called the partition key. The data in each partition is clustered and ordered by the remaining
keys.

Further, if you want to partition the data by multiple columns, group the data as first part of
the primary key, as shown in the following two instances:

1. Primary key ((key1), key2, key3, key4)

 Here the data is partitioned by key1 and clustered by key2, key3, key4
2. Primary key ((key1, key2), key3, key4)

 Here the data is partitioned by key1, key2 and clustered by key3, key4

Note that you can also specify the order for the cluster keys.

In databases, clustering refers to the grouping of data together. Suppose the data is clustered
by key3, key4, the rows with the key3 values are stored together on physical storage, so that
the queries such as key3 = 100 can be fetched quickly.

Indexes

Indexes can be used to speed up queries. Index column can be used to search for data rows.
Cassandra allows only one column in an index. You can create multiple indexes on the same
table.

The given query shows an index named first_index being created on an employee table in the
keyspace called testDB.

The index is created on the column empFirstName. After creating the index, the given query
that gets a row from the table with empFirstName = ‘Jack’ will use the index to search for
data faster.

Collection Columns

Collection columns are used to represent a group of data in a single column. Cassandra
provides three types of collection columns, such as set, list, and map.

Set

It refers to an unordered group of values. Set values are enclosed within curly brackets, also
called braces.

For example, { ‘XYZ’, ‘ABC’,’PQR’ }


List

It is an ordered group of values, like an array. List values are enclosed within square brackets.

For example, [ ‘2011’, ‘2012’, ‘2013’ ]

Map

It is a set of key-value pairs. Map values are also enclosed within braces.

For example, { ‘key1’: ‘value1’, ‘key2’: ‘value2’}

The given query shows a table being created with empid as primary key and roles as a set of
type text. You can add new roles to an employee by using the update command as shown:

Update employee set roles = roles + {‘supervisor’}

Here, the new role ‘supervisor’ is enclosed within the braces.

Collection Columns - Set

The functions of collection column — Set are:

You can use ALTER TABLE statement to add a collection to an existing table, using the
given command:

create table stocks (ticker text, year int, value double, primary key(ticker, year));

alter table stocks add supportNumber set<text>;


First, create table stocks with the columns ticker, year, and value; with primary key ticker and
year. Next, use the alter table command to add supportNumber, which represents the set of
support numbers for the stock. This is added as a set of text so it can hold the support number
strings.

Next, use the update command to initialize the data of a Set column using the given
command:

update stocks set supportNumber = {'1800800800'} where ticker = 'ABC' and year = 2011;

Here, update the supportNumber column with value of ‘1800800800’, enclosed in curly
brackets to indicate a set.

Finally, to add a new support number to an existing Set, use the given command:

update stocks set supportNumber = supportNumber + {'1800800801'}

where ticker = 'ABC' and year = 2011;

Here, use the plus symbol to add value to an existing column. Note that the other collection
types, such as lists and maps function in a similar manner.

In the next section, let us talk about DDL and DML Statements.

Big Data Hadoop and Spark Developer Course (FREE)

Learn Big Data Basics from Top Experts - for FREEENROLL NOW

DDL and DML Statements

Cassandra supports Cassandra Query Language or CQL, which includes Data Definition
Language or DDL statements and Data Manipulation Language or DML statements. DDL
statements are CREATE TABLE, ALTER TABLE, and DROP TABLE. DML statements are
INSERT, UPDATE, SELECT, and DELETE. COPY is an optional DML statement of
Cassandra.

DDL Statements
Cassandra includes DDL statements such as:

 CREATE TABLE

 ALTER TABLE

 DROP TABLE

The CREATE TABLEstatement can be used to create a table, as this statement


defines the schema of the table. You have to specify the primary key while creating
the table.

Once a table is created, ALTER TABLEstatement can be used to modify the table
definition.

Further, the DROP TABLEstatement can be used to remove an existing table. You
need to be careful before you drop a table as you will lose any data in the table.

The given query shows a table being created to hold the stock market data:

create table stocks(ticker text, tradeDate timestamp, value double,

primary key (ticker, tradeDate)) clustered by tradeDate desc;

alter table stocks add status text;

drop table stocks;

The command is ‘create table stocks’ along with the list of columns and data types.
Then, a composite primary key is specified with columns ticker and trade date. It also
illustrates using the ‘clustered by’ option to cluster the data by trade date column in
the descending order.

Next, an ALTER TABLE command is used to add a new column status of the type
‘text’ to the table. Finally, the DROP TABLE command is used to remove the table.

In the next section, let us talk about Insert Statement.

DML Statements - INSERT

Cassandra includes the standard INSERT values statement to insert data into a table. The list
of values must match the columns specified. If the primary key data already exists, the data in
the remaining columns will be updated. Otherwise, an entire row will be inserted.
Further, it is necessary to specify the IF NOT EXISTS condition to insert row only if the
primary key does not exist in the table.

The given query shows a row being inserted, which contains values for empid,
empFirstname, and emplastname into the table employee:

Insert into employee (empid, empFirstName, empLastName)

values ( 100 , ‘Jack’, ‘Frank’) if not exists;

Note that the IF NOT EXISTS condition is specified so that the insert will happen only if
there is no row with the empid of 100 in the table.

In the next section, let us talk about Update statement.

DML Statements - UPDATE

Cassandra includes UPDATE statement to insert or update data into a table. These statements
update one or more columns of the table. In a table, if a row matching the ‘where’ clause is
found, the values are updated.

On the contrary, if the row is not found, the values are inserted into the table with the keys
specified in the ‘where’ clause. Note that the primary key value is inserted only once and
cannot be updated. The given query shows the update of the employee first name to John for
the empid 100:

Update employee set empFirstName = ‘John’

Where empid = 100;

If a row with the empid 100 exists in the table, the value for employee first name is updated
to John. If there is no row with empid of 100, then a new row is inserted into the table with
values of empid = 100 and employee first name = John.

Further, there is another query for incrementing the counter column numEmployees by one:

Update department set numEmployees = numEmployees + 1

Where departmentId = 1000;


Note that updating a column using previous column value is allowed only for counter
columns. Suppose there is a column salary of type double and you update it as salary = salary
+ 200.50, this will fail with an error stating this is not a counter column.

DML Statements - COPY

Cassandra includes COPY statement to load bulk data into a table. This is an optional DML
statement of Cassandra. Note that the INSERT INTO command can be used to insert a few
rows into a table, but not bulk data.

The COPY statement is not part of CQL, however, it is part of Cassandra Command Line
Interface. Using this command, you can import the data from a data file. Further, you can
import the data into particular columns instead of all the columns. The data file is assumed to
be delimited by ‘comma’ by default. You can also specify a different delimiter.

Some data files may contain a header line that can be ignored using HEADER=TRUE as an
option. If the duplicate data exists in the table, it will not cause an error. It will be treated as
an update, and the last value in the table will be kept.

The given query shows the data of empid, empFirstName, and empLastName being copied
into the employee table from the file employeeData.csv in the local file system. This file has
to be a comma separated file:

Copy employee (empid, empFirstName, empLastName) FROM

employeeData.csv ;

The second query shows the data being imported into all the columns of the employee table
from employeeData.bar local file:

Copy employee FROM employeeData.bar WITH HEADER = TRUE AND

DELIMITER = ‘|’ ;

The delimiter is specified as vertical bar character and header is set to TRUE so that the first
line of the file is ignored.

DML Statements - SELECT


Cassandra includes SELECT statement to extract data from the tables. You can select one or
more columns or use the * symbol to extract all the columns of a table.

The general syntax is ‘select columns from tableName where’ conditions. Cassandra is
restrictive on what you can specify in the ‘where’ clause. You can specify the SELECT
conditions based only on the primary keys or index columns. The syntax ‘Order by’ is
allowed only on clustered columns.

The given query shows a few valid select statements in Cassandra:

Select * from employee where empid = 1000; // Allowed as empid is the primary key

First, the SELECT statement extracts all the columns from the employee table where empid
is 100. This is allowed as empid is the primary key for the table. The second query shows the
selection of two columns, empid, empFirstName from the employee table where
empFirstName is Jack:

Select empid, empFirstName from employee where empFirstName = ‘Jack’; // allowed

only if there is index on empFirstName

This is allowed only if there is an index for the employee table created on empFirstName
column.

Let us talk about Select Statement Restrictions in the next section.

SELECT Statements Restrictions

The SELECT statement in Cassandra has the following restrictions: 

 The range condition is allowed only for the last part of the primary key.

For example, if a primary key is defined as (key1, key2, key3, key4), then,

 key1 = 100 and key2 = 20 and key3 = 50, and key4 in (3, 4) is allowed;

 Key1 = 100 and key2 in (20,30) and key3 =50 and key4 = 3 is not allowed, as only key4 is
allowed to have a range in this query.

You have to specify the values for all leading columns of the primary key. For example, if the
primary key is defined as (key1, key2, key3, key4), then

 Key1 = 100 is allowed;


 key1 = 100 and key3 = 40 is not allowed.

Let us talk about valid and invalid Select Statement example in the next section.

Valid and Invalid SELECT Statements Example

The following query shows four SELECT statements in Cassandra, out of which two are
valid, and the rest are invalid.

The query shows the testTab table that has its primary key defined as (key1, key2, key3,
key4):

Select * from testTab where key1 = 100; //allowed

The first statement extracts all the columns where key1 is 100. This is allowed as key1 is the
first column of the primary key.

Select * from testTab where key1 = 100 and key2 = 200; // allowed

The second statement extracts all the columns where key1 is 100 and key2 is 200. This is also
allowed as key1 and key2 are the leading columns of the primary key.

Select * from testTab where key1 = 100 and key3 = 50; // not allowed

The third statement selects all the columns where key1 is 100 and key3 is 50. This is not
allowed as you cannot specify the value for key3 without specifying the value for the
previous key, key2.

Select * from testTab where key1 = 100 and key2 = 20 and key4 = 3; // not allowed

The fourth statement selects all the columns where key1 is 100, key2 is 20, and key4 is 3.
This is also not allowed as the value for key4 is specified without specifying a value for key3.

DML Statements - DELETE

Cassandra allows removal of rows from a table using the DELETE statement. This statement
removes all the matching rows from the table. If the columns are specified in the DELETE
statement, only the data from the columns will be removed, however, the row will not be
deleted. The ‘where’ clause of the DELETE statement is similar to that of the SELECT
statement. You can also specify additional conditions with the IF clause.

The given query shows three valid DELETE statements in Cassandra:

Delete from employee where employeeId = 1000;

The first statement removes the row where employee ID equals 1000.

Delete employeeFirstName from employee where employeeId = 2000;

The second statement removes the value in the employeeFirstname column, where the
employee ID equals 2000.

Delete from employee where employeeId = 2000

(IF salary > 100000);

The third statement deletes the row with employee ID whose value is 2000 along with the IF
condition, where the salary is more than 100,000.

Summary

Let us summarize the topics covered in this lesson.

 Cassandra data model contains keyspaces, tables, and columns.

 A keyspace is the container of all data in Cassandra. Replication is specified at the


keyspace level.

 Tables and columns contain the key value data in Cassandra.

 Every table should have a primary key, which can be a composite primary key.

 Data is partitioned by the first part of the primary key and clustered by the rest.

 Cassandra provides CREATE TABLE, ALTER TABLE, and DROP TABLE statements
for data definition.

 Cassandra provides INSERT, UPDATE, SELECT, and DELETE statements for data
manipulation.

 The ‘where’ clauses in Cassandra have restrictions.

You might also like