Professional Documents
Culture Documents
Features of Cassandra
Open Source – It is an open source project by Apache. Cassandra can be
integrated with other source projects like Hadoop, Apache Pig, Apache Hive,
etc.
Peer-to-Peer Architecture – All nodes in the Cassandra cluster
communicate with each other. There is no master-slave paradigm in
Cassandra.
No Single point of Failure – In a cluster, all nodes are created equal. Data is
distributed across the nodes in the cluster and each node is capable of
handling read and write requests.
Highly Available and Fault Tolerant – Because of data replication across
the nodes in the cluster, Cassandra is highly available and fault tolerant.
Globally distributed – A Cassandra cluster may be deployed in multiple data
centers, which may be globally distributed.
Flexible Data Model – The concepts from DynamoDB and BigTable are built
into Cassandra to allow for complex data structures. The model works for a
wide variety of data modeling use cases.
Linearly Scalable – When new nodes are added, the data is more evenly
distributed across the nodes, which reduces the load each node handles.
Tunable Consistency – Consistency level is number of nodes needed to
agree on the data for reads and writes. The consistency level controls both
read and write behavior based on your replication factor. The Consistency
level could be zero, one, quorum, local quorum or all.
A keyspace can contain any number of tables, and a table belongs only to one keyspace.
This represents a one-to-many relationship.
Replication is specified at the keyspace level. For example, replication of three implies
that each data row in the keyspace will have three copies.
Further, you need to specify the replication factor during the creation of keyspace.
However, the replication factor can be modified later.
The given sample query shows the command for creating a keyspace with the name as
TestDB, replication as SimpleStrategy, and replication factor as three. It also shows the use
command to start using the keyspace.
Tables
A table contains data in the horizontal and vertical formats, referred to as rows and columns
respectively. Some of the features of tables are:
Tables have multiple rows and columns. As mentioned earlier, a table is also called
Column Family in the earlier versions of Cassandra.
It is still referred to as column family in some of the error messages and documents of
Cassandra.
It is important to define a primary key for a table.
The given query shows the statement for creating a table named employee with four columns
empid, empfirstname, emplastname, and empsalary.
In this query, Empid is of the integer type, empfirstname and emplastname are the text type,
and empsalary is defined as a double. Note that empid is defined as the primary key for the
table. In database terms, primary key refers to a column that has unique values for each row
in a table.
For example, in an organization, two employees may have the same first and the last names.
However, each employee is assigned a unique employee ID. Therefore, if you create a table
to hold employee data, the employee ID can be used as the primary key, as it is unique for all
the rows of the employee table.
On the other hand, employees’ first or last name cannot be used as a primary key, as it cannot
uniquely identify a row in the table.
Columns
Column represents a single piece of data in Cassandra and has a type defined.
Columns consist of various types, such as integer, big integer, text, float, double, and
Boolean.
Cassandra also provides collection types such as set, list, and map.
Further, column values have an associated time stamp representing the time of update.
In the next section, let us learn all about UUID and TimeUUID.
There are two special column types for sequence numbers and timestamps called UUID and
TimeUUID.
The given example for this column is represented as hexadecimal digits with hyphens in
between:
TimeUUID
Counter
Counter is a special column which stores a number that is incremented. It can be used to
count the occurrences of an event or a process. Some of the features of the counter are:
You can only update the value by adding or subtracting from the current value.
Before the first update, a counter will have an initial value of zero.
The update using the current value is allowed only for the counter columns in Cassandra.
The given query shows a counter column being added called numEmployees to the table
Department.
It also shows the process of incrementing the column value based on the previous value for a
row where deptID = 1000.
Compound Keys
The primary key in Cassandra can be a compound key, which refers to a key consisting of
multiple columns. For example, a multi-year stock market data has the primary key
(tradeDate, ticker), where:
When compound keys are specified, the first key is considered for partitioning of data and is
called the partition key. The data in each partition is clustered and ordered by the remaining
keys.
Further, if you want to partition the data by multiple columns, group the data as first part of
the primary key, as shown in the following two instances:
Here the data is partitioned by key1 and clustered by key2, key3, key4
Here the data is partitioned by key1, key2 and clustered by key3, key4
Note that you can also specify the order for the cluster keys.
In databases, clustering refers to the grouping of data together. Suppose the data is clustered
by key3, key4, the rows with the key3 values are stored together on physical storage, so that
the queries such as key3 = 100 can be fetched quickly.
Indexes
Indexes can be used to speed up queries. Index column can be used to search for data rows.
Cassandra allows only one column in an index. You can create multiple indexes on the same
table.
The given query shows an index named first_index being created on an employee table in the
keyspace called testDB.
The index is created on the column empFirstName. After creating the index, the given query
that gets a row from the table with empFirstName = ‘Jack’ will use the index to search for
data faster.
Collection Columns
Collection columns are used to represent a group of data in a single column. Cassandra
provides three types of collection columns, such as set, list, and map.
Set
It refers to an unordered group of values. Set values are enclosed within curly brackets, also
called braces.
List
It is an ordered group of values, like an array. List values are enclosed within square brackets.
For example, [ ‘2011’, ‘2012’, ‘2013’ ]
Map
It is a set of key-value pairs. Map values are also enclosed within braces.
The given query shows a table being created with empid as primary key and roles as a set of
type text. You can add new roles to an employee by using the update command as shown:
You can use ALTER TABLE statement to add a collection to an existing table, using the
given command:
create table stocks (ticker text, year int, value double, primary key(ticker, year));
First, create table stocks with the columns ticker, year, and value; with primary key ticker and
year. Next, use the alter table command to add supportNumber, which represents the set of
support numbers for the stock. This is added as a set of text so it can hold the support number
strings.
Next, use the update command to initialize the data of a Set column using the given
command:
update stocks set supportNumber = {'1800800800'} where ticker = 'ABC' and year = 2011;
Here, update the supportNumber column with value of ‘1800800800’, enclosed in curly
brackets to indicate a set.
Finally, to add a new support number to an existing Set, use the given command:
Here, use the plus symbol to add value to an existing column. Note that the other collection
types, such as lists and maps function in a similar manner.
In the next section, let us talk about DDL and DML Statements.
From a bird’s eye view Apache Cassandra is a database, one that is highly scalable, high-
performance, designed to handle large amounts of data. In this tutorial we will focus on the
data model for Cassandra. By the end of this lesson, you will be able to:
Data in Cassandra is stored as a set of rows that are organized into tables.
You can get the entire data or some data based on the primary key.
In the next section, we will learn about the components of the Cassandra Data Model.
Cassandra data model provides a mechanism for data storage. The components of Cassandra
data model are keyspaces, tables, and columns.
Keyspaces
Cassandra data model consists of keyspaces at the highest level. Keyspaces are the containers
of data, similar to the schema or database in a relational database. Typically, keyspaces
contain many tables.
Tables
Within the keyspaces, the tables are defined. Tables are also referred to as Column Families
in the earlier versions of Cassandra. Tables contain a set of columns and a primary key, and
they store data in a set of rows.
Columns
Columns define the structure of data in a table. Each column has an associated type, such as
integer, text, double, and Boolean. These Cassandra data model components will be discussed
in detail in this lesson.
Master All the Big Data Skill You Need TodayENROLL NOW
Keyspaces
A keyspace can contain any number of tables, and a table belongs only to one keyspace. This
represents a one-to-many relationship.
Replication is specified at the keyspace level. For example, replication of three implies that each
data row in the keyspace will have three copies.
Further, you need to specify the replication factor during the creation of keyspace. However, the
replication factor can be modified later.
The given sample query shows the command for creating a keyspace with the name as
TestDB, replication as SimpleStrategy, and replication factor as three. It also shows the use
command to start using the keyspace.
Tables
A table contains data in the horizontal and vertical formats, referred to as rows and columns
respectively. Some of the features of tables are:
Tables have multiple rows and columns. As mentioned earlier, a table is also called Column
Family in the earlier versions of Cassandra.
It is still referred to as column family in some of the error messages and documents of Cassandra.
The given query shows the statement for creating a table named employee with four columns
empid, empfirstname, emplastname, and empsalary.
In this query, Empid is of the integer type, empfirstname and emplastname are the text type,
and empsalary is defined as a double. Note that empid is defined as the primary key for the
table. In database terms, primary key refers to a column that has unique values for each row
in a table.
For example, in an organization, two employees may have the same first and the last names.
However, each employee is assigned a unique employee ID. Therefore, if you create a table
to hold employee data, the employee ID can be used as the primary key, as it is unique for all
the rows of the employee table.
On the other hand, employees’ first or last name cannot be used as a primary key, as it cannot
uniquely identify a row in the table.
Columns
Column represents a single piece of data in Cassandra and has a type defined.
Some of its features are:
Columns consist of various types, such as integer, big integer, text, float, double, and Boolean.
Cassandra also provides collection types such as set, list, and map.
Further, column values have an associated time stamp representing the time of update.
In the next section, let us learn all about UUID and TimeUUID.
There are two special column types for sequence numbers and timestamps called UUID and
TimeUUID.
The given example for this column is represented as hexadecimal digits with hyphens in
between:
TimeUUID
Counter
Counter is a special column which stores a number that is incremented. It can be used to
count the occurrences of an event or a process. Some of the features of the counter are:
You can only update the value by adding or subtracting from the current value.
Before the first update, a counter will have an initial value of zero.
The update using the current value is allowed only for the counter columns in Cassandra.
The given query shows a counter column being added called numEmployees to the table
Department.
It also shows the process of incrementing the column value based on the previous value for a
row where deptID = 1000.
Compound Keys
The primary key in Cassandra can be a compound key, which refers to a key consisting of
multiple columns. For example, a multi-year stock market data has the primary key
(tradeDate, ticker), where:
When compound keys are specified, the first key is considered for partitioning of data and is
called the partition key. The data in each partition is clustered and ordered by the remaining
keys.
Further, if you want to partition the data by multiple columns, group the data as first part of
the primary key, as shown in the following two instances:
Here the data is partitioned by key1 and clustered by key2, key3, key4
2. Primary key ((key1, key2), key3, key4)
Here the data is partitioned by key1, key2 and clustered by key3, key4
Note that you can also specify the order for the cluster keys.
In databases, clustering refers to the grouping of data together. Suppose the data is clustered
by key3, key4, the rows with the key3 values are stored together on physical storage, so that
the queries such as key3 = 100 can be fetched quickly.
Indexes
Indexes can be used to speed up queries. Index column can be used to search for data rows.
Cassandra allows only one column in an index. You can create multiple indexes on the same
table.
The given query shows an index named first_index being created on an employee table in the
keyspace called testDB.
The index is created on the column empFirstName. After creating the index, the given query
that gets a row from the table with empFirstName = ‘Jack’ will use the index to search for
data faster.
Collection Columns
Collection columns are used to represent a group of data in a single column. Cassandra
provides three types of collection columns, such as set, list, and map.
Set
It refers to an unordered group of values. Set values are enclosed within curly brackets, also
called braces.
It is an ordered group of values, like an array. List values are enclosed within square brackets.
Map
It is a set of key-value pairs. Map values are also enclosed within braces.
The given query shows a table being created with empid as primary key and roles as a set of
type text. You can add new roles to an employee by using the update command as shown:
You can use ALTER TABLE statement to add a collection to an existing table, using the
given command:
create table stocks (ticker text, year int, value double, primary key(ticker, year));
Next, use the update command to initialize the data of a Set column using the given
command:
update stocks set supportNumber = {'1800800800'} where ticker = 'ABC' and year = 2011;
Here, update the supportNumber column with value of ‘1800800800’, enclosed in curly
brackets to indicate a set.
Finally, to add a new support number to an existing Set, use the given command:
Here, use the plus symbol to add value to an existing column. Note that the other collection
types, such as lists and maps function in a similar manner.
In the next section, let us talk about DDL and DML Statements.
Learn Big Data Basics from Top Experts - for FREEENROLL NOW
Cassandra supports Cassandra Query Language or CQL, which includes Data Definition
Language or DDL statements and Data Manipulation Language or DML statements. DDL
statements are CREATE TABLE, ALTER TABLE, and DROP TABLE. DML statements are
INSERT, UPDATE, SELECT, and DELETE. COPY is an optional DML statement of
Cassandra.
DDL Statements
Cassandra includes DDL statements such as:
CREATE TABLE
ALTER TABLE
DROP TABLE
Once a table is created, ALTER TABLEstatement can be used to modify the table
definition.
Further, the DROP TABLEstatement can be used to remove an existing table. You
need to be careful before you drop a table as you will lose any data in the table.
The given query shows a table being created to hold the stock market data:
The command is ‘create table stocks’ along with the list of columns and data types.
Then, a composite primary key is specified with columns ticker and trade date. It also
illustrates using the ‘clustered by’ option to cluster the data by trade date column in
the descending order.
Next, an ALTER TABLE command is used to add a new column status of the type
‘text’ to the table. Finally, the DROP TABLE command is used to remove the table.
Cassandra includes the standard INSERT values statement to insert data into a table. The list
of values must match the columns specified. If the primary key data already exists, the data in
the remaining columns will be updated. Otherwise, an entire row will be inserted.
Further, it is necessary to specify the IF NOT EXISTS condition to insert row only if the
primary key does not exist in the table.
The given query shows a row being inserted, which contains values for empid,
empFirstname, and emplastname into the table employee:
Note that the IF NOT EXISTS condition is specified so that the insert will happen only if
there is no row with the empid of 100 in the table.
Cassandra includes UPDATE statement to insert or update data into a table. These statements
update one or more columns of the table. In a table, if a row matching the ‘where’ clause is
found, the values are updated.
On the contrary, if the row is not found, the values are inserted into the table with the keys
specified in the ‘where’ clause. Note that the primary key value is inserted only once and
cannot be updated. The given query shows the update of the employee first name to John for
the empid 100:
If a row with the empid 100 exists in the table, the value for employee first name is updated
to John. If there is no row with empid of 100, then a new row is inserted into the table with
values of empid = 100 and employee first name = John.
Further, there is another query for incrementing the counter column numEmployees by one:
Cassandra includes COPY statement to load bulk data into a table. This is an optional DML
statement of Cassandra. Note that the INSERT INTO command can be used to insert a few
rows into a table, but not bulk data.
The COPY statement is not part of CQL, however, it is part of Cassandra Command Line
Interface. Using this command, you can import the data from a data file. Further, you can
import the data into particular columns instead of all the columns. The data file is assumed to
be delimited by ‘comma’ by default. You can also specify a different delimiter.
Some data files may contain a header line that can be ignored using HEADER=TRUE as an
option. If the duplicate data exists in the table, it will not cause an error. It will be treated as
an update, and the last value in the table will be kept.
The given query shows the data of empid, empFirstName, and empLastName being copied
into the employee table from the file employeeData.csv in the local file system. This file has
to be a comma separated file:
employeeData.csv ;
The second query shows the data being imported into all the columns of the employee table
from employeeData.bar local file:
DELIMITER = ‘|’ ;
The delimiter is specified as vertical bar character and header is set to TRUE so that the first
line of the file is ignored.
The general syntax is ‘select columns from tableName where’ conditions. Cassandra is
restrictive on what you can specify in the ‘where’ clause. You can specify the SELECT
conditions based only on the primary keys or index columns. The syntax ‘Order by’ is
allowed only on clustered columns.
Select * from employee where empid = 1000; // Allowed as empid is the primary key
First, the SELECT statement extracts all the columns from the employee table where empid
is 100. This is allowed as empid is the primary key for the table. The second query shows the
selection of two columns, empid, empFirstName from the employee table where
empFirstName is Jack:
This is allowed only if there is an index for the employee table created on empFirstName
column.
The range condition is allowed only for the last part of the primary key.
For example, if a primary key is defined as (key1, key2, key3, key4), then,
key1 = 100 and key2 = 20 and key3 = 50, and key4 in (3, 4) is allowed;
Key1 = 100 and key2 in (20,30) and key3 =50 and key4 = 3 is not allowed, as only key4 is
allowed to have a range in this query.
You have to specify the values for all leading columns of the primary key. For example, if the
primary key is defined as (key1, key2, key3, key4), then
Let us talk about valid and invalid Select Statement example in the next section.
The following query shows four SELECT statements in Cassandra, out of which two are
valid, and the rest are invalid.
The query shows the testTab table that has its primary key defined as (key1, key2, key3,
key4):
The first statement extracts all the columns where key1 is 100. This is allowed as key1 is the
first column of the primary key.
Select * from testTab where key1 = 100 and key2 = 200; // allowed
The second statement extracts all the columns where key1 is 100 and key2 is 200. This is also
allowed as key1 and key2 are the leading columns of the primary key.
Select * from testTab where key1 = 100 and key3 = 50; // not allowed
The third statement selects all the columns where key1 is 100 and key3 is 50. This is not
allowed as you cannot specify the value for key3 without specifying the value for the
previous key, key2.
Select * from testTab where key1 = 100 and key2 = 20 and key4 = 3; // not allowed
The fourth statement selects all the columns where key1 is 100, key2 is 20, and key4 is 3.
This is also not allowed as the value for key4 is specified without specifying a value for key3.
Cassandra allows removal of rows from a table using the DELETE statement. This statement
removes all the matching rows from the table. If the columns are specified in the DELETE
statement, only the data from the columns will be removed, however, the row will not be
deleted. The ‘where’ clause of the DELETE statement is similar to that of the SELECT
statement. You can also specify additional conditions with the IF clause.
The first statement removes the row where employee ID equals 1000.
The second statement removes the value in the employeeFirstname column, where the
employee ID equals 2000.
The third statement deletes the row with employee ID whose value is 2000 along with the IF
condition, where the salary is more than 100,000.
Summary
Every table should have a primary key, which can be a composite primary key.
Data is partitioned by the first part of the primary key and clustered by the rest.
Cassandra provides CREATE TABLE, ALTER TABLE, and DROP TABLE statements
for data definition.
Cassandra provides INSERT, UPDATE, SELECT, and DELETE statements for data
manipulation.