Professional Documents
Culture Documents
Sugeerthi G
July 4, 2021
INTRODUCTION
1.Fields
Fields are the smallest individual unit of data in Elasticsearch. For example
title, author, date, summary, team, score, etc.
2.Documents
Documents are JSON objects that are stored within an Elasticsearch index
and are considered the base unit of storage. In the world of relational
databases, documents can be compared to a row in a table.
3.Mappings
It’s been deprecated in the latest version of Elasticsearch (i.e. 7.x). It’s
like a schema in the world of relational databases.
Basic Terminologies
4.Index
These are the largest units of data in Elasticsearch. They can be
analogized with databases in the relational databases. They can be
assumed as logical partitions of documents.
5.Shards
These are Lucene indices. It is the key thing that helps in scaling the
Elasticsearch. Here we can split an index into multiple partitions and each
partition can reside on a node to have better availability and scalability.
6.Segments:
A shard is further divided into multiple segments which is an inverted index
that stores actual data. These same size segments are compiled to form a
bigger segment after a fixed period of time to have an efficient search.
Basic Terminologies
7.Replicas
Replicas, as the name implies, are Elasticsearch fail-safe mechanisms and
are basically copies of your index’s shards. This is a useful backup system
for a rainy day — or, in other words, when a node crashes. Replicas also
serve read requests, so adding replicas can help to increase search
performance.
8.Analyzers
They are also relevant to have an optimized index and efficient search. It
is being used at the time of indexing a document where it breaks down a
phrase into their constituent terms. The standard analyzer is the default
analyzer used by Elasticsearch which contains grammar-based tokenizers
that remove common English words and articles.
Basic terminologies
9.Nodes
Each instance of Elasticsearch becomes a node. If one server is having 2
instances of Elasticsearch, there will be 2 nodes of Elasticsearch. It has the
crucial task of storing and indexing data. There are different kinds of
nodes in Elasticsearch:
a) Data node- stores data and executes data-related operations such as
search and aggregation.
b) Master node- Incharge of cluster-wide management and configuration
actions such as adding and removing nodes.
c) Client node- forwards cluster requests to the master node and
data-related requests to data nodes.
d) Ingestion node- This is used to do pre-processing before indexing.
Logstash can be instantiated as an ingestion node. Each node is identified
uniquely by its name and is able to become a master node by default.
Basic Terminologies
10.Clusters
A cluster is comprised of one or more Elasticsearch nodes. Each cluster
has a unique identifier that must be used by each node (which wants to be
part of the cluster).
11.Translog
Lucene commits are too expensive to perform on every individual change,
so each shard copy also writes operations into its transaction log known as
the translog. The data in the translog is only persisted onto a disk with
Lucene commit only. Any data written since the last translog commit will
be lost in case of any JVM failure or shard crash.
12.Flush
This process commits the transaction logs from memory to disk.
How to work with elastic search?
Elasticsearch provides APIs that are very easy to use, and it will get
you started and take you far without much effort.
However, to get the most of it, it helps to have some knowledge
about the underlying algorithms and data structures.
We have to start with the basic index structure, the inverted index.
It is a very versatile data structure.
At the same time it’s also easy to use and understand and said,
Lucene’s implementation is a highly optimized, impressive feat of
engineering.
We will not venture into Lucene’s implementation details at first, but
rather stick to how the inverted index is used and built.
That is what influences how we can search and index.
Inverted indexes and Index terms
More complex types of queries are obviously more elaborate, but the
approach is the same: first, operate on the dictionary to find
candidate terms, then on the corresponding occurrences, positions,
etc.
Consequently, an index term is the unit of search. The terms we
generate dictate what types of searches we can (and cannot)
efficiently do. For example, with the dictionary in the figure above, we
can efficiently find all terms that start with a ”c”.However, we cannot
efficiently perform a search on everything that contains ”ours”.
To do so, we would have to traverse all the terms, to find that
”yours” also contains the substring.
This is prohibitively expensive when the index is not trivially small. In
terms of complexity, looking up terms by their prefix is O(log(n)),
while finding terms by an arbitrary substring is O(n).
Building indexes
How we process the text we index dictates how we can search. Proper
text analysis is important.
Indexes are built first in-memory, then occasionally flushed in
segments to disk.
Index segments are immutable. Deleted documents are marked as
such.
An index is made up of multiple segments. A search is done on every
segment, with the results merged.
Segments are occasionally merged.
Field and filter caches are per segment.
Elasticsearch does not have transactions.
Thank you