Apache Cassandra Report

APACHE CASSANDRA 2019-20
SEMINAR REPORT ON
APACHE CASSANDRA
Submitted in partial fulfilment of 5th Semester
MASTER OF COMPUTER APPLICATIONS
of
Visvesvaraya Technological University
CHETHAN GOWDA
1AY18MCA62
Under the Guidance of
Prof. ANITHA K. L.
Acharya Institute of Technology

Acharya Dr. Sarvepalli Radhakrishna Road,
Soldevanahalli, Hesaraghatta Main Road,
Bengaluru – 560107
www.acharya.ac.in
2019-20
Department of MCA,Acharya Institute of Technology 1

ACHARYA INSTITUTE OF TECHNOLOGY

Department of MCA
( Affiliated to Visvesvaraya Technological University, Belgaum)
Acharya Dr. Sarvepalli Radhakrishnan Road,
Acharya P.O, Bengaluru – 560107
CERTIFICATE
This is to certify that the seminar entitled
APACHE CASSANDRA
Submitted in the partial fulfilment of requirement of the 5th semester
of
Master of Computer Applications
is a result of the bonafide work carried out by
CHETHAN GOWDA
1AY18MCA62
During the academic year 2019-2020
Project Guide Prof. Manish Kumar Thakur

Prof. Anitha K. L. HOD
Assistant Professor Department of MCA
Department of MCA Acharya Institute of Technology
Acharya Institute of Technology

TABLE OF CONTENT
Sl Contents PAGE
No NO
1 Introduction 1
2 History of Cassandra 3
3 NoSQL database 4
4 Cassandra architecture 5
5 Features of Cassandra 6
6 Data replication in Cassandra 7
7 Cassandra query language 8
8 Working 11
9 Conclusion 13
10 Future of apache Cassandra 14
11 References 15

ABSTRACT
Biometric ATM using Iris recognition discusses the use of the iris-based biometric recognition.
Biometric recognition is the automated recognition of individuals based on the physiological and
behavioural characteristics. The recognition can be positive or negative. It highlights the key areas where
the iris biometric method has been used successfully, and what are shortfalls. It presents an overview of the
algorithm used in Iris biometric method with the other biometric methods in terms of cost-effectiveness,
usability, speed and other factors. The iris is very unique in that it has many features such as crypts, furrows
and collarettes, which are used by the algorithms for comparison between a template and an image acquired
for recognition.
Most of the algorithms used for iris recognition have a very low false acceptance rate compared to other
biometric methods, and these algorithms can do millions of comparisons on easily available hardware.

1. INTRODUCTION
Apache Cassandra is a highly scalable and high-performance distributed database

management system that can serve as both an operational datastore (the “system of record”)
for online/transactional applications, and as a read-intensive database for business intelligence
systems. Cassandra is able to manage the distribution of data across multiple data centers and
offers incremental scalability with no single points of failure. It is a NoSQL database that is
decentralized (No single point of failure), elastic (Linear Scalability), fault
Tolerant(Replication), optimized for writes, reads.
It is a structured storage system over a P2P network. Cassandra uses a synthesis of

well-known techniques to achieve scalability and availability.
Cassandra is a distributed storage system for managing structured data that is designed to
scale to a very large size across many commodity servers, with no single point of failure.
The idea is to run on top of an infrastructure of hundreds of nodes, where small and
large components in the data centers fail continuously. Over the edge, Cassandra achieves
scalability, high performance, high availability and applicability. It does not support a full
relational data model. Instead it provides clients with a simple data model as explained later.
Many modern businesses have outgrown the typical RDBMS use case and are in need
of data management software that offers more. Sharing was a stop-gap measure, but
architectural limitations, and the management complexity it requires, make it unacceptable for
many mainstream organizations.
Figure1-Cassandra logo

Apache Cassandra
Apache Cassandra is an open source, distributed, decentralized, elastically scalable,
highly available, fault-tolerant, tuneable consistent, column-oriented database that bases its
distribution design on Amazon’s Dynamo and its data model on Google’s Bigtable. Created at
Facebook, it is now used at some of the most popular sites on the Web.” Here we see a lot of
complicated words such as distributed, decentralized, elastically scalable, highly available,
fault-tolerant, tuneable consistent, column-oriented etc.
Apache Cassandra is a highly scalable, high-performance distributed database designed

to handle large amounts of data across many commodity servers, providing high availability
with no single point of failure. It is a type of NoSQL database.

2. HISTORY OF CASSANDRA
Cassandra was developed at Facebook for inbox search.

It was open-sourced by Facebook in July 2008.
Cassandra was accepted into Apache Incubator in March 2009.
It was made an Apache top-level project since February 2010.
Cassandra’s parents – Amazon dynamo

What is it?
A highly-available and scalable storage system used by Amazon to store and retrieve
user shopping charts and other core services.
How it works?
Allows read and write operations to continue even during network partitions and
resolve update conflicts using different conflict resolution mechanisms.
Allows customization to meet desired preference.
Consist hashing, vector clocks.
Cassandra’s parents – Google Bigtable

What is it?
A high-performance data storage system built on google file system and other google
technologies.
How it works?
Provides both structure and data distribution but relies on a distributed file system for
durability.
Richer data model from Dynamo. One key many values. Fast sequential access.
SSTableStorage,Mem-table,compaction,Append-only.

3. NOSQL DATABASE
A NoSQL database (sometimes called as Not Only SQL) is a database that provides a
mechanism to store and retrieve data other than the tabular relations used in relational.
The primary objective of a NoSQL database is to have:
simplicity of design,
horizontal scaling, and
finer control over availability.
NoSQL vs. Relational Database
The following table lists the points that differentiate a relational database from a NoSQL
database.
Relational Database NoSql Database
Supports powerful query language. Supports very simple query language.
It has a fixed schema. No fixed schema.
Follows ACID (Atomicity, Consistency, It is only “eventually consistent”.

Isolation, and Durability).
Supports transactions. Does not support transactions.
Relational databases are used to handle NoSQL databases can handle big data or data
moderate volume of data. in a very high volume .
Relational database has centralized structure. NoSQL has decentralized structure

4. CASSANDRA ARCHITECTURE
Cassandra can satisfy many data-driven application use cases through a carefully
thought-out architecture designed to manage all forms of modern data, scale to meet the
requirements of “big data” management, offer linear performance scale-out capabilities, and
deliver the type of high availability that most every online, 24x7 application needs. At its
foundation, Cassandra is a peer-to-peer distributed data management system where every node
is essentially the same with respect to how it functions in the cluster. In Cassandra, there is no
concept of a “master node” or anything similar, with the benefit being derived that no single
point of failure exists for any key process or function.
Figure2-Cassandra table
Cassandra can satisfy many data-driven application use cases through a carefully
thought-out architecture designed to manage all forms of modern data, scale to meet the
requirements of “big data” management, offer linear performance scale-out capabilities.

5. FEATURES OF CASSANDRA
There are a lot of outstanding technical features which makes Cassandra very popular.
High Scalability
Cassandra is highly scalable which facilitates you to add more hardware to attach more
customers and more data as per requirement.
Rigid Architecture
Cassandra has not a single point of failure and it is continuously available for business-
critical applications that cannot afford a failure.
Fast Linear-scale Performance
Cassandra is linearly scalable. It increases your throughput because it facilitates you to
increase the number of nodes in the cluster. Therefore, it maintains a quick response time.
Fault tolerant
Cassandra is fault tolerant. Suppose, there are 4 nodes in a cluster, here each node has
a copy of same data. If one node is no longer serving then other three nodes can served as
per request.
Flexible Data Storage
Cassandra supports all possible data formats like structured, semi-structured, and
unstructured. It facilitates you to make changes to your data structures according to your
need.
Easy Data Distribution
Data distribution in Cassandra is very easy because it provides the flexibility to
distribute data where you need by replicating data across multiple data centers.
Transaction Support
Cassandra supports properties like Atomicity, Consistency, Isolation, and Durability
(ACID).
Fast writes
Cassandra was designed to run on cheap commodity hardware. It performs blazingly
fast writes and can store hundreds of terabytes of data, without sacrificing the read
efficiency.

6. DATA REPLICATION IN CASSANDRA
In Cassandra, nodes in a cluster act as replicas for a given piece of data. If some of the
nodes are responded with an out-of-date value, Cassandra will return the most recent value to
the client. After returning the most recent value, Cassandra performs a read repair in the
background to update the stale values.
Figure3-Cassandra Replication
Components of Cassandra
Node: A Cassandra node is a place where data is stored.
Data center: Data centeris a collection of related nodes.
Cluster: A cluster is a component which contains one or more data centers.
Commit log: In Cassandra, the commit log is a crash-recovery mechanism. Every write
operation is written to the commit log.
Mem-table:Amem-table is a memory-resident data structure. After commit log, the data will
be written to the mem-table. Sometimes, for a single-column family, there will be multiple
mem-tables.

SSTable: It is a disk file to which the data is flushed from the mem-table when its contents
reach a threshold value.
Bloom filter: These are nothing but quick, nondeterministic, algorithms for testing whether an
element is a member of a set. It is a special kind of cache. Bloom filters are accessed after every
query.

7. CASSANDRA QUERY LANGUAGE
Cassandra Query Language (CQL) is used to access Cassandra through its nodes. CQL
treats the database (Keyspace) as a container of tables. Programmers use cqlsh: a prompt to
work with CQL or separate application language drivers.
The client can approach any of the nodes for their read-write operations. That node
(coordinator) plays a proxy between the client and the nodes holding the data.
cqlsh
This command is used to start the cqlsh prompt. In addition, it supports a few more
options as well. The following table explains all the options of cqlsh and their usage.
Options Usage
cqlsh --help Shows help topics about the options of cqlsh commands.
cqlsh --version Provides the version of the cqlsh you are using.
cqlsh --color Directs the shell to use colored output.
cqlsh --debug Shows additional debugging information.
cqlsh --execute Directs the shell to accept and execute a CQL command.
cql_statement
cqlsh --file= “file name” If you use this option, Cassandra executes the command in the given file

and exits.
cqlsh --no-color Directs Cassandra not to use colored output.
cqlsh -u “user name” Using this option, you can authenticate a user. The default user name is:
cassandra.
cqlsh-p “pass word” Using this option, you can authenticate a user with a password. The
default password is: cassandra.
Documented Shell Commands

 HELP − Displays help topics for all cqlsh commands.
 CAPTURE − Captures the output of a command and adds it to a file.
 CONSISTENCY − Shows the current consistency level, or sets a new consistency level.
 COPY − Copies data to and from Cassandra.
 DESCRIBE − Describes the current cluster of Cassandra and its objects.
 EXPAND − Expands the output of a query vertically.
 EXIT − Using this command, you can terminate cqlsh.
 PAGING − Enables or disables query paging.
 SHOW − Displays the details of current cqlsh session such as Cassandra version, host, or
data type assumptions.
 SOURCE − Executes a file that contains CQL statements.
 TRACING − Enables or disables request tracing.

CQL Data Definition Commands

 CREATE KEYSPACE − Creates a KeySpace in Cassandra.
 USE − Connects to a created KeySpace.
 ALTER KEYSPACE − Changes the properties of a KeySpace.
 DROP KEYSPACE − Removes a KeySpace
 CREATE TABLE − Creates a table in a KeySpace.
 ALTER TABLE − Modifies the column properties of a table.
 DROP TABLE − Removes a table.
 TRUNCATE − Removes all the data from a table.
 CREATE INDEX − Defines a new index on a single column of a table.
 DROP INDEX − Deletes a named index.
CQL Data Manipulation Commands

 INSERT − Adds columns for a row in a table.
 UPDATE − Updates a column of a row.
 DELETE − Deletes data from a table.
 BATCH − Executes multiple DML statements at once.
CQL Clauses
 SELECT − This clause reads data from a table
 WHERE − The where clause is used along with select to read a specific data.
 ORDERBY − The order by clause is used along with select to read a specific data in a
specific order.

8. WORKING
Read Operations
In Read operations, Cassandra gets values from the mem-table and checks the bloom
filter to find the appropriate SSTable which contains the required data.
There are three types of read request that is sent to replicas by coordinators.
Direct request
Digest request
Read repair request
The coordinator sends direct request to one of the replicas. After that, the coordinator
sends the digest request to the number of replicas specified by the consistency level and
checks if the returned data is an updated data.
Figure4- Read Operations
After that, the coordinator sends digest request to all the remaining replicas. If any
node gives out of date value, a background read repair request will update that data. This
process is called read repair mechanism.

Write Operations
Every write activity of nodes is captured by the commit logs written in the nodes. Later
the data will be captured and stored in the mem-table. Whenever the mem-table is full, data
will be written into the SStable data file.
All writes are automatically partitioned and replicated throughout the cluster. Cassandra
periodically consolidates the SSTables, discarding unnecessary data.
Figure5-Write Operations

9. CONCLUSION
Apache Cassandra is entirely suited to large-scale applications that need to access huge
volumes of unstructured data. That being said, Cassandra is still a good choice for smaller
applications, as it delivers a high level of data protection out of the box.
Developing for Cassandra is very simple, as most of the truly clever aspects of this
technology are handled transparently, so developers have no need to develop platform
specific code. This makes Cassandra easy to implement, as developers do not have to be
brought up to speed to start creating applications.

10. FUTURE OF APACHE CASSANDRA
Six ways that Cassandra delivers a powerful foundation for future multi-cloud:
Topology-aware availability
Tunable consistency
Remote regional awareness
Flexible global and local consistency
Simple and effective replication
Open source licensing with needed flexibility
Some technical plagues have to be fixed as well. These issues are native to Cassandra and
can’t be tolerated in situations where performance predictability is critical:
 A low predictability of performance caused by JVM garbage collection. This

problem was partially solved in version 2.1 where "only 85% of meltable are out of
heap”.
Unnecessary complex API of client libraries. DataStax is attempting to solve this
problem with DataStax Cassandra Java Client. This is a step in the right direction.
Even with these shortcomings Cassandra has all the chances to become one of the most
widely adopted NoSQL solutions and a standard for a scalable highly available storage. Great
things about Cassandra include:
Linear scalability which allows to scale out in a more predictable manner.

Phenomenal speed of write requests
Great API for working with time series data
Numerous options for integration with 3rd party data processing tools: you have a
whole platform to cover majority of the use cases
A great support community, which includes Netflix and Facebook among others.
Official commercial support from DataStax.

11. REFERENCES
1. https://en.wikipedia.org/wiki/Apache_Cassandra
2. http://cassandra.apache.org/
3. https://www.tutorialspoint.com/cassandra/cassandra_introduction.htm
4. https://www.datastax.com/
5. https://www.javatpoint.com/cassandra-tutorial

Apache Cassandra Report

Uploaded by

Document Information

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Apache Cassandra Report

Uploaded by

Copyright:

Available Formats

APACHE CASSANDRA 2019-20

Acharya Institute of Technology

Department of MCA,Acharya Institute of Technology 1

ACHARYA INSTITUTE OF TECHNOLOGY

Project Guide Prof. Manish Kumar Thakur

Department of MCA,Acharya Institute of Technology 2

6 Data replication in Cassandra 7

7 Cassandra query language 8

10 Future of apache Cassandra 14

Department of MCA,Acharya Institute of Technology 3

Department of MCA,Acharya Institute of Technology 4

Apache Cassandra is a highly scalable and high-performance distributed database

It is a structured storage system over a P2P network. Cassandra uses a synthesis of

Department of MCA,Acharya Institute of Technology 5

Apache Cassandra is a highly scalable, high-performance distributed database designed

Department of MCA,Acharya Institute of Technology 6

Cassandra was developed at Facebook for inbox search.

Cassandra’s parents – Amazon dynamo

Cassandra’s parents – Google Bigtable

Department of MCA,Acharya Institute of Technology 7

The primary objective of a NoSQL database is to have:

NoSQL vs. Relational Database

Relational Database NoSql Database

Supports powerful query language. Supports very simple query language.

It has a fixed schema. No fixed schema.

Follows ACID (Atomicity, Consistency, It is only “eventually consistent”.

Supports transactions. Does not support transactions.

Relational database has centralized structure. NoSQL has decentralized structure

Department of MCA,Acharya Institute of Technology 8

Department of MCA,Acharya Institute of Technology 9

Department of MCA,Acharya Institute of Technology 10

6. DATA REPLICATION IN CASSANDRA

Node: A Cassandra node is a place where data is stored.

Data center: Data centeris a collection of related nodes.

Cluster: A cluster is a component which contains one or more data centers.

Department of MCA,Acharya Institute of Technology 11

Department of MCA,Acharya Institute of Technology 12

7. CASSANDRA QUERY LANGUAGE

cqlsh --color Directs the shell to use colored output.

cqlsh --debug Shows additional debugging information.

Department of MCA,Acharya Institute of Technology 13

cqlsh --no-color Directs Cassandra not to use colored output.

Documented Shell Commands

 CAPTURE − Captures the output of a command and adds it to a file.

 COPY − Copies data to and from Cassandra.

 DESCRIBE − Describes the current cluster of Cassandra and its objects.

 EXPAND − Expands the output of a query vertically.

 EXIT − Using this command, you can terminate cqlsh.

 PAGING − Enables or disables query paging.

 SOURCE − Executes a file that contains CQL statements.

 TRACING − Enables or disables request tracing.

Department of MCA,Acharya Institute of Technology 14

CQL Data Definition Commands

CQL Data Manipulation Commands

Department of MCA,Acharya Institute of Technology 15

Figure4- Read Operations

Department of MCA,Acharya Institute of Technology 16

Department of MCA,Acharya Institute of Technology 17

Department of MCA,Acharya Institute of Technology 18

10. FUTURE OF APACHE CASSANDRA

 A low predictability of performance caused by JVM garbage collection. This

Linear scalability which allows to scale out in a more predictable manner.

Department of MCA,Acharya Institute of Technology 19