You are on page 1of 16

University of Science and Technology of Southern Philippines

Alubijid | Cagayan de Oro | Claveria | Jasaan | Oroquieta | Panaon

Course : IT315 – IT Elective1 (Distributed Database Management System)


Module No. :7
Title : Tuning and Optimization
ILO :

✓ Understand how query is processed using relational algebra


✓ Understand how to optimize query using query tree
✓ Understand the concepts of query optimization in distributed systems
✓ Identify the different methods in monitoring performance
✓ Identify the best practices for optimizing performance

QUERY PROCESSING

Relational Algebra

The introduction of the relational model by Codd in 1970 [Codd70], two classes of languages have been proposed and
implemented to work with a relational database. The first class is called nonprocedural and includes relational calculus
and Quel. The second class is known as procedural and includes relational algebra and the Structured Query Language
(SQL) [SQL92]. In procedural languages, the query directs the DBMS on how to arrive at the answer. In contrast, in
a nonprocedural language, the query indicates what is needed and leaves it to the system to find the process for arriving
at the answer. Although it sounds easier to tell the system what is needed instead of how to get the answer,
nonprocedural languages are not as popular as procedural languages. As a matter of fact, SQL (a procedural language)
is the only widely accepted language for end user interface to relational systems today.

For the remainder of this section we will use the following notations:

• R and S are two relations.


• The number of tuples in a relation is called the cardinality of that relation.
• R has attributes a1, a2, ... , an and has cardinality of K.
• S has attributes b1, b2, ... , bm and has cardinality of L.
• r is a tuple in R and is shown as r[a1, a2, ... , an].
• s is a tuple in S and is shown as s[b1, b2, ... , bm].

Subset of Relational Algebra Commands

Relational algebra (RA) supports unary and binary types of operations. Unary operations take one relation (table) as
an input and produce another as the output. Binary operations take two relations as input and produce one relation as
the output. Note that regardless of the type of operation, the output is always a relation. RA operators are divided into
basic operators and derived operators. Basic operators need to be supported by the language compiler since they
cannot be created from any other operations. Derived operators, on the other hand, are optional since they can be
expressed in terms of the basic operators.

Notations in focus:

IT Elective 1 1
University of Science and Technology of Southern Philippines
Alubijid | Cagayan de Oro | Claveria | Jasaan | Oroquieta | Panaon

• SL represents the relational algebra SELECT operator.


• PJ represents the relational algebra PROJECT operator.
• JN represents the relational algebra JOIN operator.
• NJN represents the relational algebra natural JOIN operator.
• UN represents the relational algebra UNION operator.
• SD represents the relational algebra natural SET DIFFERENCE operator.
• CP represents the relational algebra CROSS PRODUCT operator.
• SI represents the relational algebra SET INTERSECT operator.
• DV represents the relational algebra DIVIDE operator.

Figure 28. Symbols used in Relational Algebra

Relational Algebra Basic Operators

Select Operator in Relational Algebra The select operator returns all tuples of the
relation whose attribute(s) satisfy the given predicates (conditions). If no condition is specified, the select operator
returns all tuples of the relation.

For example, “SLbal =1200 (Account)” returns all accounts that have a balance of $1200. The result is a relation with
four attributes (since the Account relation has four attributes) and as many rows as the number of accounts with a
balance of exactly $1200.

Project Operator in Relational Algebra The project operator returns the values of all attributes specified in the project
operation for all tuples of the relation passed as a parameter. In a project operation, all rows qualify but only those
attributes specified are returned.

For instance, “PJ Cname,Ccity (Customer)” returns the customer name and the city where the customer lives for each
and every customer of the bank.

Combining Select and Project We can combine the select and project operators in forming complex RA expressions
that not only apply a given set of predicates to the tuples of a relation but also trim the attributes to a desired set.

For example, assume we want to get the customer ID and customer name for all customers who live in Edina. We can
do this by combining the SL and the PJ expressions as “PJ CID, Cname (SL Ccity =‘Edina’ (Customer)).”

Union Operator in Relational Algebra Union is a binary operation in RA that combines the tuples from two relations
into one relation. Any tuple in the union is in the first relation, the second relation, or both relations. In a sense, the
union operator in RA behaves the same way that the addition operator works in math—it adds up the elements of two
sets. There are two compatibility requirements for the union operation. First, the two relations have to be of the same

IT Elective 1 2
University of Science and Technology of Southern Philippines
Alubijid | Cagayan de Oro | Claveria | Jasaan | Oroquieta | Panaon

degree—the two relations have to have the same number of attributes. Second, corresponding attributes of the two
relations have to be from compatible domains.

The following statements are true for the union operation in RA:
• We cannot union relations “R(a1, a2, a3)” and “S(b1, b2)” because they have
different degrees.
• We cannot union relations “R(a1 char(10), a2 Integer)” and “S(b1 char(15), b2
Date)” because the a2 and b2 attributes have different data types.
• If relation “R(a1 char(10), a2 Integer)” has cardinality K and relation “S(b1
char(10), b2 Integer)” has cardinality L, then “R UN S” has cardinality “K +
L” and is of the form “(c1 char(10), c2 Integer).”

Suppose we need to get the name and the address for all of the customers who live in a city named “Edina” or “Eden
Prairie.” To find the results, we first need to create a temporary relation that holds Cname and Ccity for all customers
in Edina; then we need to repeat this for all the customers in Eden Prairie; and finally, we need to union the two
relations. We can write this RA expression as follows:

PJCID, Cname (SLCcity = ‘Edina’ (Customer))


UN
PJCID, Cname (SLCcity = ‘Eden Prairie’ (Customer))

The union operator is commutative, meaning that “R UN S = S UN R.” Also, the union operator is associative, meaning
that “R UN (S U P) = (R UN S) UN P.”

Set Difference Operator in Relational Algebra Set difference (SD) is a binary operation in RA that subtracts the tuples
in one relation from the tuples of another relation. In other words, SD removes the tuples that are in the intersection
of the two relations from the first relation and returns the result. In “S SD R,” the tuples
in the set difference belong to the S relation but do not belong to R. Set difference is an operator that subtracts the
elements of two sets. In a sense, the
set difference operator in RA behaves the same way that the subtraction operator works in math. There are again two
compatibility requirements
for this operation. First, the two relations have to be the same degree, and second, the corresponding attributes of the
two relations have to come from compatible domains.

Assume we need to print the customer ID for all customers who have an account at the Main branch but do not have
a loan there. To do this, we first form the set of all customers with accounts at the Main branch and then subtract all
the customers with a loan at the Main branch from that set. This excludes the customers who are in the intersection of
the two sets (those who have both an account and a loan at the Main branch) leaving behind the desired customers.
The RA expression for this question is written as

PJCID (SLBcity = ‘Main’ (Account))


SD
PJCID (SLBcity = ‘Main’ (Loan))

Cartesian Product Operator in Relational Algebra Cartesian product (CP), which is also known as cross product, is a
binary operation that concatenates each and every tuple from the first relation with each and every tuple from the
second relation. CP is a set operator that multiplies the elements of two sets. In a sense, the CP operator in RA behaves

IT Elective 1 3
University of Science and Technology of Southern Philippines
Alubijid | Cagayan de Oro | Claveria | Jasaan | Oroquieta | Panaon

the same way that the multiplication operator works in math. This operation is hardly used in practice, since it produces
a large number of tuples—most of which do not contain any useful information.

Relational Algebra Derived Operators

In addition to the basic operators in RA, the language also has a set of derived operators. These operators are called
“derived” since they can be expressed in terms of the basic operators. As a result, they are not required by the language,
but are supported for ease of programming. These operators are SI, JN (NJN), and DV. The following sections
represent an overview of these operators.

Set Intersect Operator in Relational Algebra Set intersect (SI) is a binary operator
that returns the tuples in the intersection of two relations. If the two relations do
not intersect, the operator returns an empty relation. Suppose that we need to get the customer name for all customers
who have an account and a loan at the Main branch in the bank. Considering the set of customers who have an account
at the Main branch and the set of customers who have a loan at that branch, the answer to the question falls in the
intersection of the two sets and can be expressed as follows:

PJCname (SLBcity = ‘Main’ (Account))


SI
PJCname (SLBcity = ‘Main’ (Loan))

SI operation is associative and commutative. Therefore, “R SI S = S SI R” and “R


SI (S SI P) = (R SI S) SI P.”

Join Operator in Relational Algebra The join (JN) operator in RA is a special case
of the CP operator. In a CP operation, rows from the two relations are concatenated without any restrictions. However,
in a JN, before the tuples are concatenated, they are checked against some condition(s). JN is a binary operation that
returns a relation by combining tuples from two input relations based on some specified conditions. These operations
are known as conditional joins, where conditions are applied to the attributes of the two relations before the tuples are
concatenated.

One popular join condition is to force equality on the values of the attributes of the two relations. These types of joins
are known as equi-joins. The expression “R JNa2=b2 S” is a join that returns a relation with “<= L ∗ K” tuples and
each tuple is in the form “[a1, a2, ... , an, b1, b2, ... , bm],” satisfying the condition “a2 = b2.”

RA supports the concept of natural join, where equality is enforced automatically on the attributes of the two relations
that have the same name.

Divide Operator in Relational Algebra Divide (DV) is a binary operator that takes two relations as input and produces
one relation as the output. The DV operation in RA is similar to the divide operation in math. We will use an example
to show how the divide operation works in relational algebra.

The following explains what the DV operator actually does:


• First, the DV operation performs a group-by on the Cname attribute of the fifirst
relation, which results in a set of branch names for each customer.
• Then, it checks to see if the set of Bname values associated with every unique
value of Cname is the same set or a superset of the set of Bname values from the
second relation. If it is (the same set or superset), then the customer identifified by

IT Elective 1 4
University of Science and Technology of Southern Philippines
Alubijid | Cagayan de Oro | Claveria | Jasaan | Oroquieta | Panaon

that Cname is part of the division result.

Since DV is a derived operation it can be expressed using the base RA operations


expressed as follows:

R1 = PJcname(R) CP S
R2 = PJcname (R1 SD R)
R(cname, bname) DV S(bname) = PJcname (R) SD R2

QUERY TREE IN RELATIONAL ALGEBRA

In RDBMSs, a query's internal representation is stored in a data structure called a query tree. The Query
Evaluation/Execution Tree is another name for it. The relations are represented by the query tree's leaf
nodes, and relational algebra operations like SELECT (σ), JOIN (⋈), and so on are represented by the
internal nodes. When the query is executed, the root node displays the results.

Steps in Generating Query Tree

1. To obtain the resulting tuples that we use for the next operation, execute the leaf nodes with their associated
internal nodes that have the relational algebra operator with the given conditions.
2. This operation is repeated until we reach the root node, at which point, using the specified conditions, we
PROJECT (π) the necessary tuples as the output.

Let's use a few instances to better grasp this:

Consider a relational algebra expression –

πp (R ⋈ R.P = S.P S)

i. Write the relations you want to execute as the tree’s Leaf nodes. Here R and S are the
relations.

ii. Add the condition (here R.P = S.P) with the relational algebra operator as an internal node
(or parent node of these two leaf nodes).

IT Elective 1 5
University of Science and Technology of Southern Philippines
Alubijid | Cagayan de Oro | Claveria | Jasaan | Oroquieta | Panaon

iii. Now add the root node that on execution gives the output of the query.

Query Processing in Distributed Systems

There are two distinct categories of distributed systems: (1) distributed homogeneous DBE (what we called a DDBE)
and (2) distributed heterogeneous DBE (what we called a MDB). In both of these systems, processing a query consists
of optimization and planning at the global level as well as at the local database environment (LDBE) level. Figure 30
depicts how queries are processed in a DDBE.

Figure 30. Distributed query processing architecture

The site where the query enters the system is called the client or controlling site. The client site needs to validate the
user or application attempting to access the relations in the query; to check the query’s syntax and reject it if it is
incorrect; to translate the query to relational algebra; and to globally optimize the query.
Mapping Global Query into Local Queries

A global query is written against the global schema. The relations used in the global query may be distributed
(fragmented and/or replicated) across multiple local DBEs. Each local DBE only works with the local view of the
information at its site and is unaware of how the data stored at its site is related to the global view. It is the responsibility

IT Elective 1 6
University of Science and Technology of Southern Philippines
Alubijid | Cagayan de Oro | Claveria | Jasaan | Oroquieta | Panaon

of the controlling site to use the global data dictionary (GDD) to determine the distribution information and reconstruct
the global view from local physical fragments.

Example

Suppose the EMP relation is horizontally fragmented based on the value of the LOC attribute as discussed in Chapter
2. Each employee works at one of three possible locations (LA, NY, or MPLS). The LA server stores the information
about employees who work in LA. Similarly, NY and MPLS servers store the information about employees who work
at these locations. Now we will consider a query that needs to retrieve the name of all employees who make more than
$50,000. Since EMP does not physically exist, the global query has to be mapped to a set of local queries that run
against the fragments of EMP as shown below.

Global Query: PJEname (SLsal > 50000 (EMP))


LA’s Query: PJEname (SLsal > 50000 (LA_EMP))
Ny’s Query: PJEname (SLsal > 50000 (NY_EMP))
MPLS’s Query: PJEname (SLsal > 50000 (MPLS_EMP))

The GDD contains rules for reconstructing the original relation from its fragments.
Because the EMP relation is horizontally fragmented, we need to union the fragments together to reconstruct the
original relation. After adding this reconstruction step, the global query looks like the following:

PJEname (SLsal > 50000 (LA_EMP))


UN
PJEname (SLsal > 50000 (NY_EMP))
UN
PJEname (SLsal > 50000 (MPLS_EMP))

It should be obvious that, just like a local query, there are multiple equivalent
expressions (or query trees) for any extended global query. These alternatives are
generated when we take into account the associative and commutative properties of the relational algebra binary
operators in a query.

Distributed Query Optimization

Global optimization is greatly impacted by the database distribution design. Just like a local query optimizer, a global
query optimizer must evaluate a large number of equivalent query trees, each of which can produce the desired results.
In a distributed system, the number of alternatives increases drastically as we apply replication, horizontal
fragmentation, and vertical fragmentation to the
relations involved in the query.

Example

Let us consider joining three relations—A, B, and C—as “A JN B JN C” in a three-site distributed system. As
discussed in Section 4.4.2.2, there are 12 different query trees alternatives that can be used for this purpose. Let’s now
assume that although A is not fragmented, it is replicated, and there are two copies
of it; that B is horizontally fragmented into B1, B2, and B3; and finally, that C is
vertically fragmented into C1 and C2. For this distributed database, since B and C

IT Elective 1 7
University of Science and Technology of Southern Philippines
Alubijid | Cagayan de Oro | Claveria | Jasaan | Oroquieta | Panaon

do not physically exist, we need to reconstruct them from their fragments. To include reconstruction steps for relations
B and C, the global query changes to

A JN B JN C ≡ A JN (B1 UN B2 UN B3) JN (C1 JN C2)

Utilization of Resources

In a distributed system, there are many database servers available that can perform the operations within a query.
There are three main approaches on how these resources are utilized. The difference among these approaches is based
on whether we perform the operation where the data is; send the data to another site to perform the operation; or a
combination of the two.

Operation Shipping
In this approach, we run the operation where the data is stored. In order to achieve this, the local DBE must have a
database server that can perform the operation. A good example of the kind of operation that lends itself nicely to
operation shipping is any unary operation such as relational algebra’s select (SL) and project (PJ). Since a unary
operation works on only one relation, the most logical and most economical in terms of communication cost is to run
the operation where the data fragment is stored. Even for binary operations, such as join and union, we prefer
operation shipping if the two operands of the operation are stored at the same site.

Data Shipping
Data shipping refers to an alternative that sends the data to where the database server is. This obviously requires
shipment of data fragments across the network, which can potentially lead to a high communication cost. Consider
the case of having to select some tuples from a relation. We can perform the SL operation where the data is as explained
in operation shipping. We can also send the entire relation to another site and perform the SL operation there. One can
argue that in a high-speed fiber optic communication network, it might be faster to send the entire relation from a low-
speed processor to a high-speed processor and perform the SL operation there.

Hybrid Shipping
Hybrid shipping combines data and operation shipping. This approach takes advantage of the speed of database
servers to perform expensive operations such as joins by transferring the smaller relation to the server where the larger
relation resides. Figure 31 shows the three alternatives mentioned above.

Figure 31. Operation, Data and Hybrid Shipping Examples

IT Elective 1 8
University of Science and Technology of Southern Philippines
Alubijid | Cagayan de Oro | Claveria | Jasaan | Oroquieta | Panaon

Dynamic Programming in Distributed Systems


The dynamic programming principles we discussed for centralized systems can also be used for query
optimization in distributed systems.

CASSANDRA OPTIMIZATION

Tuning Performance

Cassandra's data management strategy is founded on the straightforward finding that caching may significantly
increase read speed. Since reads can be minimized and cached, this is optimized for rapid writing. We should be able
to prevent random access and ensure sequential writes.

SSTables are used with Cassandra to do this. The fundamental idea is that updates are writes and deletes are write
operations (writing tombstones). SSTables are a type of immutable file used to store tables on disk. New SSTables
are created with each new write. Writing is therefore quite simple. But reading might necessitate looking through the
full SSTable.

We can write in batches to further maximize writes. As a result, data is stored in memory in Memtables. The content
of the Memtable is flushed to the SStable when a threshold on the Memtable size is achieved. Memtables conduct
only one write to the SSTables after being flushed, allowing many writes to be combined on the same key.

Cassandra employs a commitlog to replay in the event of a crash so as not to lose writes. Recall that utilizing a
Memtable comes with a commit log. Because you can write the update instructions in the commitlog, writing to it is
efficient. The entire row does not have to be written.

IT Elective 1 9
University of Science and Technology of Southern Philippines
Alubijid | Cagayan de Oro | Claveria | Jasaan | Oroquieta | Panaon

Naturally, we are unable to pay for a complete SStable scan for each read. Cassandra comes with a number of default
techniques designed to make reading easier.

In the first, Bloom filters are discussed. A probabilistic data structure called a bloom filter is used to determine whether
a key exists. It enables Cassandra to virtually never scan the disk in search of nonexistent keys.

Additionally, Cassandra features a Key Cache that stores a given key's location on disk. We can read it significantly
faster as a result.

Lastly, Cassandra features a Row Cache that stores row-specific values in memory. It is able to maintain "hot,"
frequently accessed values. Compared to file system cache, it enables more precise caching. Row caching helps
prevent disk reads.

Performance Monitoring

With Cassandra's nodetool, you may focus issues from the cluster on a specific node and gain a great deal of
information about the Cassandra process's current status.

You can use nodetool status to assess status of the cluster:

IT Elective 1 10
University of Science and Technology of Southern Philippines
Alubijid | Cagayan de Oro | Claveria | Jasaan | Oroquieta | Panaon

In this instance, we can observe that three nodes—each holding roughly 4.6GB of data—are present in a single
datacenter and are all "up." You might need to run nodetool status on several nodes in a cluster to see the entire picture,
as each node in the cluster separately determines a node's up/down state.

You can use nodetool status plus a little grep to see which nodes are down:

In this instance, there are two datacenters, and rack r1 and datacenter dc2 both have one down node. This can point to
a problem with 127.0.0.5 that needs to be looked into.

Nodetool proxyhistograms can be used to view latency distributions of coordinator read and write latency, which can
help identify latency issues.

This displays the whole latency distribution for reads, writes, range queries (such as select * from keyspace.table),
CAS write (set phase of compare and set), and CAS read (compare phase of CAS). These are helpful in reducing the
scope of high-level latency issues. For instance, in this instance, a client may occasionally suffer a timeout from this
node of less than 1% if they had a 20 millisecond timeout on their reads (because the 99% read latency is 3.3
milliseconds < 20 milliseconds).

To gain a better understanding of what is occurring locally on a node, you can utilize nodetool tablehistograms if you
know which table is experiencing latency or error problems:

IT Elective 1 11
University of Science and Technology of Southern Philippines
Alubijid | Cagayan de Oro | Claveria | Jasaan | Oroquieta | Panaon

This displays percentile breakdowns for important metrics.

What number of sstables were read for each logical read is shown in the first column. If this figure is extremely high,
it could mean that you selected the incorrect compaction strategy. For example, SizeTieredCompactionStrategy
usually has significantly more reads per read than LeveledCompactionStrategy for workloads with a lot of updates.

You can see a breakdown of local write latency in the second column. In this instance, the maximum latency of 12
milliseconds is quite bad, despite the p50 being fairly excellent at 73 microseconds. High write max latencies are
frequently a sign of large writes that quickly fill commitlog segments or of a slow commitlog volume (slow to fsync).

You can see a breakdown of local read latency in the third column. It is evident that the number of sstables read each
read has a strong correlation with the read speed of local Cassandra, with the former being slower than the latter.

Distributions of partition size and number of columns per partition are displayed in the fourth and fifth columns. They
can assist you in identifying problematic data patterns and in figuring out whether the table's partitions are generally
large or thin. A single cell that is two megabytes, for instance, will most likely result in some heap pressure when it is
read.

To see the current outstanding requests on a certain node, use nodetool tpstats. Finding out whatever resource (read
threads, write threads, compaction, request response threads) the Cassandra process lacks can be done with the help
of this. As an illustration:

IT Elective 1 12
University of Science and Technology of Southern Philippines
Alubijid | Cagayan de Oro | Claveria | Jasaan | Oroquieta | Panaon

You may see a plethora of fascinating statistics with this command. A thorough analysis of the threadpools for every
Cassandra stage is provided in the first part, together with information on the number of threads that are currently
running (Active) and those that are waiting to run (Pending). In most cases, the presence of pending executions in a
specific threadpool points to an issue specific to that kind of operation. For instance, if the queue for
RequestResponseState requests is backing up, it could be a sign of low token awareness on the part of the coordinators
or of very high consistency levels being used for read requests (reading at ALL, for instance, ties up RF
RequestResponseState threads, whereas LOCAL_ONE only uses one thread in the ReadStage threadpool). The
compaction strategy, concurrent_compactors, or compaction_throughput settings may need to be adjusted if, on the
other hand, you notice a large number of pending compactions. This could mean that your compaction threads are
unable to handle the volume of writes.

The latency distributions and drops (errors) for each of the main request types are displayed in the second section.
Drops are cumulative since the process started, however if you experience any, there may be a major issue because
the default timeouts for them to be considered drops are rather long (~5–10 seconds). Dropped messages frequently
call for additional research.

IT Elective 1 13
University of Science and Technology of Southern Philippines
Alubijid | Cagayan de Oro | Claveria | Jasaan | Oroquieta | Panaon

Being an LSM datastore, Cassandra occasionally needs to condense sstables together, which may be detrimental to
performance. Compaction in particular can strain your disk drives significantly, take a fair amount of CPU resources,
and invalidate a significant portion of the OS page cache. Although there are excellent OS tools like <os-iostat> to
ascertain whether this is the case, it's usually a good idea to use nodetool compactionstats to see if compactions are
even occurring:

In this instance, the keyspace is experiencing a single compaction.table table, according on the set compaction
throughput, has completed 21.8 MB of 97, and Cassandra predicts that this will take 4 seconds. For the units to be in
a format that can be read by humans, you can also pass -H.

In general, one core can be used for each compaction process; however, quicker data compacts result from doing many
compactions in parallel. Achieving optimal read speed requires proper compaction, thus it's critical to balance the
number of concurrent compactions so that they finish quickly and don't overburden query threads with resources.
Adjust the concurrent_compactors or compaction_throughput parameters in Cassandra if you observe that compaction
is not keeping up.

Best Practices for Optimizing Performance

Enhance Your Data Modeling


Optimizing data modeling is one of the key components in enhancing Cassandra read performance. To optimize this
process, it is essential to build your data model around your queries. This implies that you should arrange your data
model based on an understanding of how your data will be accessed. Data is arranged into tables in Cassandra, and
each table includes one or more columns. To reduce the number of reads needed to obtain the data you require, you
should design your tables with a relatively small number of wide rows as opposed to many tiny rows.

Employ Proper Data Types


Read latency may also be affected by the sort of data you select for your columns. Optimizing performance can be
achieved by selecting the right type of data. For instance, you can read data faster and use less disk space if you choose
smaller data types like INT or SMALLINT instead of BIGINT.

Make use of Suitably Compressed


Reduce the amount of data you have by using compression; this will improve read latency. LZ4 and Snappy are two
of the compression techniques that Cassandra provides. Faster read times can be achieved by using these strategies to
drastically minimize the amount of data that has to be read from disk.

Set the Proper Level of Consistency


You can adjust how many nodes must reply to a read request before the data is deemed acceptable using Cassandra's
consistency level parameter. The read latency is impacted by this setting, which makes it significant. You can
encounter high read latency if you set the consistency level too high, as the database will have to wait for responses

IT Elective 1 14
University of Science and Technology of Southern Philippines
Alubijid | Cagayan de Oro | Claveria | Jasaan | Oroquieta | Panaon

from an excessive number of nodes. However, you risk reading outdated data if you set the consistency level too low.
Taking into account the specifications of your application, you should select the proper consistency level.
Make Use of Caching Row cache and key cache are two of the caching strategies that Cassandra offers. While key
cache keeps the most frequently used partition keys in memory, row cache keeps whole rows there. By allowing data
to be fetched from memory rather than the disk, caching can greatly reduce read latency. However, because caching
can eat up a lot of RAM, it should only be utilized sparingly.

Optimize Hardware
The hardware that Cassandra is operating on has an impact on its performance as well. The following advice will help
you optimize hardware to reduce read latency:
For storage, switch from HDDs to SSDs. The read and write times of SSDs are faster, which can greatly increase
performance.

To lower network latency, use fast network adapters.


Make sure you have enough CPU and RAM to handle the workload you have.

Make use of Go to Repair.


Cassandra has a method called "read repair" that fixes inconsistent data automatically when it is read. It might be
possible to retrieve data from several nodes when reading from Cassandra, and these nodes might have different values
for the same column. In order to prevent stale data and lower read latency, read repair makes sure that the most recent
value is kept in all nodes.

Enhance Bloom Filters


Cassandra checks if data is there in a partition using Bloom filters. Probabilistic data structures called bloom filters
are useful for rapidly assessing an element's likelihood of being in a set. Cassandra can reduce read latency by using
Bloom filters to prevent reading unnecessary data from disk. Bloom filters can be made more efficient by changing
the filter's size and the amount of hash functions employed.

Employ SSTable Compression.


Sorted string tables, or SSTables for short, are immutable data files that hold a sorted list of key-value pairs and are
used by Cassandra to store data. These files can be made smaller with SSTable compression, which can decrease read
latency. You may speed up read times by compressing SSTables, which will lessen the amount of data that needs to
be read from disk.

Track and Adjust Performance


Lastly, it's critical to consistently check and adjust Cassandra's performance. This involves keeping an eye on data
like disk use, cache hit rate, and read latency. You can find bottlenecks and adjust your database configuration by
keeping an eye on performance. Nodetool, which lets you inspect and manage Cassandra nodes, and Cassandra-stress,
which can be used to stress test your database and find performance problems, are just two of the many tools Cassandra
offers for performance monitoring and tweaking.

IT Elective 1 15
University of Science and Technology of Southern Philippines
Alubijid | Cagayan de Oro | Claveria | Jasaan | Oroquieta | Panaon

References:

[1] Tellier, B. (2017, October 9). Tuning Cassandra performances. Linagora Engineering.
https://medium.com/linagora-engineering/tunning-cassandra-performances-7d8fa31627e3

[2] Use Nodetool | Apache Cassandra Documentation. (n.d.). Cassandra.apache.org. Retrieved December 17, 2023,
from https://cassandra.apache.org/doc/latest/cassandra/troubleshooting/use_nodetool.html

[3] 10 Ways To Improve Cassandra Read Performance: Ultimate Guide. (2023, May 19).
https://www.heatware.net/cassandra/optimize-cassandra-read-
performance/#:~:text=Optimize%20Data%20Modeling&text=In%20Cassandra%2C%20data%20is%20org
anized

[4] Query Tree in Relational Algebra. (2023, September 7). GeeksforGeeks.


https://www.geeksforgeeks.org/query-tree-in-relational-algebra/

IT Elective 1 16

You might also like