You are on page 1of 19

Elastic Search

Sugeerthi G

Anna University - MIT Campus

July 4, 2021
INTRODUCTION

Elasticsearch is known for its powerful full-text search capabilities.


Its speed comes from an inverted index at its core and its power
comes from its tunable relevance scoring, advanced query DSL, and
wide range of search enhancing features.
The key to Elasticsearch is an inverted index that makes it better
than other available traditional database systems out there.
An inverted index at the core is how Elasticsearch is different from
other NoSQL stores, such as MongoDB, Cassandra, and so on.
All the data in Elasticsearch is internally stored in Apache Lucene as
an inverted index.
Although data is stored in Apache Lucene, Elasticsearch is what
makes it distributed and provides the easy-to-use APIs.
Basic Terminologies

1.Fields
Fields are the smallest individual unit of data in Elasticsearch. For example
title, author, date, summary, team, score, etc.

2.Documents
Documents are JSON objects that are stored within an Elasticsearch index
and are considered the base unit of storage. In the world of relational
databases, documents can be compared to a row in a table.

3.Mappings
It’s been deprecated in the latest version of Elasticsearch (i.e. 7.x). It’s
like a schema in the world of relational databases.
Basic Terminologies

4.Index
These are the largest units of data in Elasticsearch. They can be
analogized with databases in the relational databases. They can be
assumed as logical partitions of documents.

5.Shards
These are Lucene indices. It is the key thing that helps in scaling the
Elasticsearch. Here we can split an index into multiple partitions and each
partition can reside on a node to have better availability and scalability.

6.Segments:
A shard is further divided into multiple segments which is an inverted index
that stores actual data. These same size segments are compiled to form a
bigger segment after a fixed period of time to have an efficient search.
Basic Terminologies

7.Replicas
Replicas, as the name implies, are Elasticsearch fail-safe mechanisms and
are basically copies of your index’s shards. This is a useful backup system
for a rainy day — or, in other words, when a node crashes. Replicas also
serve read requests, so adding replicas can help to increase search
performance.

8.Analyzers
They are also relevant to have an optimized index and efficient search. It
is being used at the time of indexing a document where it breaks down a
phrase into their constituent terms. The standard analyzer is the default
analyzer used by Elasticsearch which contains grammar-based tokenizers
that remove common English words and articles.
Basic terminologies
9.Nodes
Each instance of Elasticsearch becomes a node. If one server is having 2
instances of Elasticsearch, there will be 2 nodes of Elasticsearch. It has the
crucial task of storing and indexing data. There are different kinds of
nodes in Elasticsearch:
a) Data node- stores data and executes data-related operations such as
search and aggregation.
b) Master node- Incharge of cluster-wide management and configuration
actions such as adding and removing nodes.
c) Client node- forwards cluster requests to the master node and
data-related requests to data nodes.
d) Ingestion node- This is used to do pre-processing before indexing.
Logstash can be instantiated as an ingestion node. Each node is identified
uniquely by its name and is able to become a master node by default.
Basic Terminologies

10.Clusters
A cluster is comprised of one or more Elasticsearch nodes. Each cluster
has a unique identifier that must be used by each node (which wants to be
part of the cluster).

11.Translog
Lucene commits are too expensive to perform on every individual change,
so each shard copy also writes operations into its transaction log known as
the translog. The data in the translog is only persisted onto a disk with
Lucene commit only. Any data written since the last translog commit will
be lost in case of any JVM failure or shard crash.

12.Flush
This process commits the transaction logs from memory to disk.
How to work with elastic search?

Elasticsearch provides APIs that are very easy to use, and it will get
you started and take you far without much effort.
However, to get the most of it, it helps to have some knowledge
about the underlying algorithms and data structures.
We have to start with the basic index structure, the inverted index.
It is a very versatile data structure.
At the same time it’s also easy to use and understand and said,
Lucene’s implementation is a highly optimized, impressive feat of
engineering.
We will not venture into Lucene’s implementation details at first, but
rather stick to how the inverted index is used and built.
That is what influences how we can search and index.
Inverted indexes and Index terms

Consider this Example,


Inverted indexes and Index terms
Let’s say we have these three simple documents: ”Winter is coming.”,
”Ours is the fury.” and ”The choice is yours.”.
After some simple text processing (lowercasing, removing punctuation
and splitting words), we can construct the ”inverted index” shown in
the figure.
The inverted index maps terms to documents (and possibly positions
in the documents) containing the term.
Since the terms in the dictionary are sorted, we can quickly find a
term, and subsequently its occurrences in the postings-structure.
This is contrary to a ”forward index”, which lists terms related to a
specific document.
A simple search with multiple terms is then done by looking up all the
terms and their occurrences, and take the intersection (for AND
searches) or the union (for OR searches) of the sets of occurrences to
get the resulting list of documents.
Inverted indexes and Index terms

More complex types of queries are obviously more elaborate, but the
approach is the same: first, operate on the dictionary to find
candidate terms, then on the corresponding occurrences, positions,
etc.
Consequently, an index term is the unit of search. The terms we
generate dictate what types of searches we can (and cannot)
efficiently do. For example, with the dictionary in the figure above, we
can efficiently find all terms that start with a ”c”.However, we cannot
efficiently perform a search on everything that contains ”ours”.
To do so, we would have to traverse all the terms, to find that
”yours” also contains the substring.
This is prohibitively expensive when the index is not trivially small. In
terms of complexity, looking up terms by their prefix is O(log(n)),
while finding terms by an arbitrary substring is O(n).
Building indexes

When building inverted indexes, there’s a few things we need to


prioritize: search speed, index compactness, indexing speed and the
time it takes for new changes to become visible.
Search speed and index compactness are related: when searching over
a smaller index, less data needs to be processed, and more of it will fit
in memory. Both, particularly compactness, come at the cost of
indexing speed.
Written files make up an index segment.
Index Segments

Elasticsearch and Lucene generally do a good job of handling when to


merge segments. Elasticsearch’s policies can be tweaked by
configuring merge settings. You can also use the optimize API to
force merges.
As new segments are created (either due to a flush or a merge), they
also cause certain caches to be invalidated, which can negatively
impact search performance.
Elasticsearch has a warmer-API5, so the necessary caches can be
”warmed” before the new segment is made available for search.
The most common cause for flushes with Elasticsearch is probably the
continuous index refreshing, which by default happens every second.
As new segments are flushed, they become available for searching,
enabling (near) real-time search.
Index Segments

While a flush is not as expensive as a commit. It causes a new


segment to be created, invalidating some caches, and possibly
triggering a merge.
When indexing throughput is important, e.g. when batch
(re-)indexing, it is not very productive to spend a lot of time flushing
and merging small segments.
Therefore, in these cases it is usually a good idea to temporarily
increase the refresh interval-setting, or even disable automatic
refreshing altogether.
One can always refresh manually, and/or when indexing is done.
Elastic Search Indexes
An Elasticsearch index is made up of one or more shards, which can
have zero or more replicas. These are all individual Lucene indexes.
That is, an Elasticsearch index is made up of many Lucene indexes,
which in turn is made up of index segments.
When you search an Elasticsearch index, the search is executed on all
the shards - and in turn, all the segments - and merged.
The same is true when you search multiple Elasticsearch indexes.
Actually, searching two Elasticsearch indexes with one shard each is
pretty much the same as searching one index with two shards.
In both cases, two underlying Lucene indexes are searched.
As documents are added to the index, it is routed to a shard, a basic
scaling unit for Elasticsearch. This is done in a round-robin fashion,
based on the hash of the document’s id. It is important to know,
however, that the number of shards is specified at index creation
time, and cannot be changed later on.
Elastic Search Indexes

Which Elasticsearch indexes, and what shards (and replicas) search


requests are sent to, can be customized in many ways. By combining
index patterns, index aliases, and document and search routing, lots
of different partitioning and data flow strategies can be implemented.
Example:
Lots of data is time based, e.g. logs, tweets, etc. By creating an
index per day (or week, month, . . . ), we can efficiently limit searches
to certain time ranges - and expunge old data. Remember, we cannot
efficiently delete from an existing index, but deleting an entire index is
cheap.
When searches must be limited to a certain user (e.g. ”search your
messages”), it can be useful to route all the documents for that user
to the same shard, to reduce the number of indexes that must be
searched.
Transactions

While Lucene has a concept of transactions, Elasticsearch does not.


All operations in Elasticsearch add to the same timeline, which is not
necessarily entirely consistent across nodes, as the flushing is reliant
on timing.
Managing the isolation and visibility of different segments, caches and
so on across indexes across nodes in a distributed system is very hard.
Instead of trying to do this, it prioritizes being fast.
Elasticsearch has a ”transaction log” where documents to be indexed
are appended. Appending to a log file is a lot cheaper than building
segments, so Elasticsearch can write the documents to index
somewhere durable - in addition to the in-memory buffer, which is
lost on crashes. You can also specify the consistency level required
when you index. For example, you can require every replica to have
indexed the document before the index operation returns.
Summary

How we process the text we index dictates how we can search. Proper
text analysis is important.
Indexes are built first in-memory, then occasionally flushed in
segments to disk.
Index segments are immutable. Deleted documents are marked as
such.
An index is made up of multiple segments. A search is done on every
segment, with the results merged.
Segments are occasionally merged.
Field and filter caches are per segment.
Elasticsearch does not have transactions.
Thank you

You might also like