Professional Documents
Culture Documents
Textbook Elasticsearch Cookbook Second Edition Alberto Paro Ebook All Chapter PDF
Textbook Elasticsearch Cookbook Second Edition Alberto Paro Ebook All Chapter PDF
https://textbookfull.com/product/biota-grow-2c-gather-2c-cook-
loucas/
https://textbookfull.com/product/libgdx-cross-platform-game-
development-cookbook-1st-edition-marquez-david-saltares-sanchez-
alberto-cejas/
https://textbookfull.com/product/flutter-cookbook-second-edition-
simone-alessandria/
https://textbookfull.com/product/windows-server-2016-cookbook-
second-edition-krause/
Mastering Elasticsearch 5 x Master the intricacies of
Elasticsearch 5 and use it to create flexible and
scalable search solutions Third Edition Dixit
https://textbookfull.com/product/mastering-
elasticsearch-5-x-master-the-intricacies-of-elasticsearch-5-and-
use-it-to-create-flexible-and-scalable-search-solutions-third-
edition-dixit/
https://textbookfull.com/product/postgresql-9-administration-
cookbook-second-edition-simon-riggs/
https://textbookfull.com/product/azure-data-factory-cookbook-
second-edition-dmitry-foshin/
https://textbookfull.com/product/asp-net-jquery-cookbook-second-
edition-allana-sonal-aneel/
https://textbookfull.com/product/charge-based-cmos-digital-rf-
transmitters-1st-edition-pedro-emiliano-paro-filho/
ElasticSearch
Cookbook
Second Edition
Alberto Paro
BIRMINGHAM - MUMBAI
ElasticSearch Cookbook
Second Edition
All rights reserved. No part of this book may be reproduced, stored in a retrieval system,
or transmitted in any form or by any means, without the prior written permission of the
publisher, except in the case of brief quotations embedded in critical articles or reviews.
Every effort has been made in the preparation of this book to ensure the accuracy of the
information presented. However, the information contained in this book is sold without
warranty, either express or implied. Neither the author, nor Packt Publishing and its
dealers and distributors, will be held liable for any damages caused or alleged to be
caused directly or indirectly by this book.
Packt Publishing has endeavored to provide trademark information about all the
companies and products mentioned in this book by the appropriate use of capitals.
However, Packt Publishing cannot guarantee the accuracy of this information.
ISBN 978-1-78355-483-6
www.packtpub.com
Credits
Reviewers Proofreaders
Florian Hopf Ting Baker
Wenhan Lu Samuel Redman Birch
Suvda Myagmar Stephen Copestake
Dan Noble Ameesha Green
Philip O'Toole Lauren E. Harkins
Alberto Paro is an engineer, project manager, and software developer. He currently works
as a CTO at Big Data Technologies and as a freelance consultant on software engineering for
Big Data and NoSQL solutions. He loves to study emerging solutions and applications mainly
related to Big Data processing, NoSQL, natural language processing, and neural networks.
He began programming in BASIC on a Sinclair Spectrum when he was 8 years old, and to
date, has collected a lot of experience using different operating systems, applications,
and programming.
In his spare time, when he is not playing with his children, he likes to work on open source
projects. When he was in high school, he started contributing to projects related to the GNOME
environment (gtkmm). One of his preferred programming languages is Python, and he wrote
one of the first NoSQL backends on Django for MongoDB (Django-MongoDB-engine). In 2010,
he began using ElasticSearch to provide search capabilities to some Django e-commerce
sites and developed PyES (a Pythonic client for ElasticSearch), as well as the initial part of the
ElasticSearch MongoDB river. He is the author of ElasticSearch Cookbook as well as a technical
reviewer Elasticsearch Server, Second Edition, and the video course, Building a Search Server
with ElasticSearch, all of which are published by Packt Publishing.
Acknowledgments
It would have been difficult for me to complete this book without the support of a large
number of people.
First, I would like to thank my wife, my children, and the rest of my family for their
valuable support.
On a more personal note, I'd like to thank my friend, Mauro Gallo, for his patience.
I'd like to express my gratitude to everyone at Packt Publishing who've been involved in the
development and production of this book. I'd like to thank Amey Varangaonkar for guiding this
book to completion, and Florian Hopf, Philip O'Toole, and Suvda Myagmar for patiently going
through the first drafts and providing valuable feedback. Their professionalism, courtesy,
good judgment, and passion for this book are much appreciated.
About the Reviewers
Philip O'Toole has developed software and led software development teams for more than
15 years for a variety of applications, including embedded software, networking appliances,
web services, and SaaS infrastructure. His most recent work with ElasticSearch includes
leading infrastructure design and development of Loggly's log analytics SaaS platform, whose
core component is ElasticSearch. He is based in the San Francisco Bay Area and can be found
online at http://www.philipotoole.com.
www.PacktPub.com
Did you know that Packt offers eBook versions of every book published, with PDF and ePub
files available? You can upgrade to the eBook version at www.PacktPub.com, and as a
print book customer, you are entitled to a discount on the eBook copy. Get in touch with
us at service@packtpub.com for more details.
At www.PacktPub.com, you can also read a collection of free technical articles, sign up
for a range of free newsletters, and receive exclusive discounts and offers on Packt books
and eBooks.
TM
https://www2.packtpub.com/books/subscription/packtlib
Do you need instant solutions to your IT questions? PacktLib is Packt's online digital book
library. Here, you can search, access, and read Packt's entire library of books.
Why subscribe?
ff Fully searchable across every book published by Packt
ff Copy and paste, print, and bookmark content
ff On demand and accessible via a web browser
ii
Table of Contents
Deleting by query 140
Matching all the documents 142
Querying/filtering for a single term 144
Querying/filtering for multiple terms 148
Using a prefix query/filter 151
Using a Boolean query/filter 153
Using a range query/filter 156
Using span queries 158
Using a match query 162
Using an ID query/filter 164
Using a has_child query/filter 166
Using a top_children query 168
Using a has_parent query/filter 170
Using a regexp query/filter 172
Using a function score query 174
Using exists and missing filters 178
Using and/or/not filters 180
Using a geo bounding box filter 182
Using a geo polygon filter 183
Using geo distance filter 185
Using a QueryString query 187
Using a template query 190
Chapter 6: Aggregations 195
Introduction 195
Executing an aggregation 196
Executing the stats aggregation 201
Executing the terms aggregation 203
Executing the range aggregation 208
Executing the histogram aggregation 212
Executing the date histogram aggregation 216
Executing the filter aggregation 219
Executing the global aggregation 221
Executing the geo distance aggregation 223
Executing nested aggregation 227
Executing the top hit aggregation 230
Chapter 7: Scripting 235
Introduction 235
Installing additional script plugins 236
Managing scripts 238
Sorting data using script 241
iii
Table of Contents
Computing return fields with scripting 245
Filtering a search via scripting 248
Updating a document using scripts 251
Chapter 8: Rivers 257
Introduction 257
Managing a river 258
Using the CouchDB river 261
Using the MongoDB river 264
Using the RabbitMQ river 267
Using the JDBC river 272
Using the Twitter river 278
Chapter 9: Cluster and Node Monitoring 283
Introduction 283
Controlling cluster health via the API 284
Controlling cluster state via the API 287
Getting cluster node information via the API 291
Getting node statistics via the API 297
Managing repositories 302
Executing a snapshot 305
Restoring a snapshot 308
Installing and using BigDesk 310
Installing and using ElasticSearch Head 316
Installing and using SemaText SPM 322
Installing and using Marvel 325
Chapter 10: Java Integration 329
Introduction 329
Creating an HTTP client 330
Creating a native client 335
Managing indices with the native client 338
Managing mappings 341
Managing documents 345
Managing bulk actions 348
Building a query 351
Executing a standard search 355
Executing a search with aggregations 359
Executing a scroll/scan search 363
iv
Table of Contents
v
Preface
One of the main requirements of today's applications is search capability. In the market, we
can find a lot of solutions that answer this need, both in commercial as well as the open source
world. One of the most used libraries for searching is Apache Lucene. This library is the base of
a large number of search solutions such as Apache Solr, Indextank, and ElasticSearch.
ElasticSearch is written with both cloud and distributed computing in mind. Its main author,
Shay Banon, who is famous for having developed Compass (http://www.compass-
project.org), released the first version of ElasticSearch in March 2010.
Thus, the main scope of ElasticSearch is to be a search engine; it also provides a lot of features
that allow you to use it as a data store and an analytic engine using aggregations.
Before ElasticSearch, only Apache Solr was able to provide some of these functionalities, but
it was not designed for the cloud and does not use the JSON/REST API. In the last few years,
this situation has changed a bit with the release of the SolrCloud in 2012. For users who
want to more thoroughly compare these two products, I suggest you read posts by Rafał Kuć,
available at http://blog.sematext.com/2012/08/23/solr-vs-elasticsearch-
part-1-overview/.
Founded in 2012, the ElasticSearch company has raised a total of USD 104 million in
funding. ElasticSearch's success can best be described by the words of Steven Schuurman,
the company's cofounder and CEO:
It's incredible to receive this kind of support from our investors over such a short
period of time. This speaks to the importance of what we're doing: businesses are
generating more and more data—both user- and machine-generated—and it has
become a strategic imperative for them to get value out of these assets, whether
they are starting a new data-focused project or trying to leverage their current
Hadoop or other Big data investments.
ElasticSearch has an impressive track record for its search product, powering customers
such as Fourquare (which indexes over 50 million venues), the online music distribution
platform SoundCloud, StumbleUpon, and the enterprise social network Xing, which has 14
million members. It also powers GitHub, which searches 20 terabytes of data and 1.3 billion
files, and Loggly, which uses ElasticSearch as a key value store to index clusters of data for
rapid analytics of logfiles.
In my opinion, ElasticSearch is probably one of the most powerful and easy-to-use search
solutions on the market. Throughout this book and these recipes, the book's reviewers and
I have sought to transmit our knowledge, passion, and best practices to help readers better
manage ElasticSearch.
2
Preface
Chapter 9, Cluster and Node Monitoring, shows you how to analyze the behavior of a
cluster/node to understand common pitfalls.
Chapter 10, Java Integration, describes how to integrate ElasticSearch in a Java application
using both REST and native protocols.
Chapter 11, Python Integration, covers the usage of the official ElasticSearch Python client
and the Pythonic PyES library.
Chapter 12, Plugin Development, describes how to create the different types of plugins:
site and native plugins. Some examples show the plugin skeletons, the setup process,
and their build.
For all the REST examples, the cURL software (http://curl.haxx.se/) will be used to
simulate the command from the command line. It comes preinstalled on Linux and Mac OS X
operating systems. For Windows, it can be downloaded from its site and added in a PATH that
can be called from the command line.
Chapter 10, Java Integration, and Chapter 12, Plugin Development, require the Maven build
tool (http://maven.apache.org/), which is a standard tool to manage builds, packaging,
and deploying in Java. It is natively supported on most of the Java IDEs, such as Eclipse and
IntelliJ IDEA.
Chapter 11, Python Integration, requires the Python Interpreter installed on your computer.
It's available on Linux and Mac OS X by default. For Windows, it can be downloaded from the
official Python website (http://www.python.org). The examples in this chapter have been
tested using version 2.x.
The chapters toward the end of the book discuss ElasticSearch integration with Java and Python
programming languages; this shows the users how to integrate the power of ElasticSearch into
their Java- and Python-based applications.
3
Preface
Chapter 12, Plugin Development, talks about the advanced use of ElasticSearch and its core
extensions, so you will need some prior Java knowledge to understand this chapter fully.
Sections
This book contains the following sections:
Getting ready
This section tells us what to expect in the recipe, and describes how to set up any software or
any preliminary settings needed for the recipe.
How to do it…
This section characterizes the steps to be followed for "cooking" the recipe.
How it works…
This section usually consists of a brief and detailed explanation of what happened in the
previous section.
There's more…
It consists of additional information about the recipe in order to make the reader more
anxious about the recipe.
See also
This section may contain references to the recipe.
Conventions
In this book, you will find a number of styles of text that distinguish between different
kinds of information. Here are some examples of these styles, and an explanation of
their meaning.
Code words in text, database table names, folder names, filenames, file extensions,
pathnames, dummy URLs, user input, and Twitter handles are shown as follows:
"After the name and type parameters, usually a river requires an extra configuration
that can be passed in the _meta property."
4
Preface
When we wish to draw your attention to a particular part of a code block, the relevant lines or
items are set in bold:
cluster.name: elasticsearch
node.name: "My wonderful server"
network.host: 192.168.0.1
discovery.zen.ping.unicast.hosts: ["192.168.0.2","192.168.0.3[9300-
9400]"]
New terms and important words are shown in bold. Words you see on the screen, in menus
or dialog boxes, for example, appear in the text like this: "If you don't see the cluster statistics,
put your node address to the left and click on the connect button."
Reader feedback
Feedback from our readers is always welcome. Let us know what you think about this book—
what you liked or may have disliked. Reader feedback is important for us to develop titles you
really get the most out of.
If there is a topic you have expertise in and you are interested in either writing or contributing
to a book, see our author guide at www.packtpub.com/authors.
5
Preface
Customer support
Now that you are the proud owner of a Packt book, we have a number of things to help you get
the most from your purchase.
Errata
Although we have taken every care to ensure the accuracy of our content, mistakes do happen.
If you find a mistake in one of our books—maybe a mistake in the text or the code—we would be
grateful if you could report this to us. By doing so, you can save other readers from frustration
and help us improve subsequent versions of this book. If you find any errata, please report them
by visiting http://www.packtpub.com/submit-errata, selecting your book, clicking on
the Errata Submission Form link, and entering the details of your errata. Once your errata are
verified, your submission will be accepted and the errata will be uploaded to our website or
added to any list of existing errata under the Errata section of that title.
Piracy
Piracy of copyrighted material on the Internet is an ongoing problem across all media.
At Packt, we take the protection of our copyright and licenses very seriously. If you come
across any illegal copies of our works, in any form, on the Internet, please provide us with
the location address or website name immediately so we can pursue a remedy.
We appreciate your help in protecting our authors, and our ability to bring you valuable content.
Questions
If you have a problem with any aspect of this book, you can contact us at questions@
packtpub.com, and we will do our best to address the problem.
6
Getting Started
1
In this chapter, we will cover:
Introduction
To efficiently use ElasticSearch, it is very important to understand how it works.
The goal of this chapter is to give the readers an overview of the basic concepts of
ElasticSearch and to be a quick reference for them. It's essential to understand the
basics better so that you don't fall into the common pitfall about how ElasticSearch
works and how to use it.
The key concepts that we will see in this chapter are: node, index, shard, mapping/type,
document, and field.
A brief description of the ElasticSearch logic helps the user to improve performance,
search quality, and decide when and how to optimize the infrastructure to improve
scalability and availability.
Getting Started
Some details on data replications and base node communication processes are also explained.
At the end of this chapter, the protocols used to manage ElasticSearch are also discussed.
Getting ready
To better understand the following sections, some basic knowledge about the concepts of
the application node and cluster are required.
How it works...
One or more ElasticSearch nodes can be set up on a physical or a virtual server depending
on the available resources such as RAM, CPU, and disk space.
A default node allows you to store data in it to process requests and responses.
(In Chapter 2, Downloading and Setting Up, we'll see details about how to set up
different nodes and cluster topologies).
When a node is started, several actions take place during its startup, such as:
After the node startup, the node searches for other cluster members and checks its index and
shard status.
8
Chapter 1
To join two or more nodes in a cluster, the following rules must be observed:
ff The version of ElasticSearch must be the same (v0.20, v0.9, v1.4, and so on) or the
join is rejected.
ff The cluster name must be the same.
ff The network must be configured to support broadcast discovery (it is configured
to it by default) and they can communicate with each other. (See the Setting up
networking recipe in Chapter 2, Downloading and Setting Up.)
A common approach in cluster management is to have a master node, which is the main
reference for all cluster-level actions, and the other nodes, called secondary nodes, that
replicate the master data and its actions.
To be consistent in the write operations, all the update actions are first committed in the
master node and then replicated in the secondary nodes.
In a cluster with multiple nodes, if a master node dies, a master-eligible node is elected
to be the new master node. This approach allows automatic failover to be set up in an
ElasticSearch cluster.
There's more...
There are two important behaviors in an ElasticSearch node: the non-data node (or arbiter)
and the data container behavior.
Non-data nodes are able to process REST responses and all other operations of search.
During every action execution, ElasticSearch generally executes actions using a map/reduce
approach: the non-data node is responsible for distributing the actions to the underlying
shards (map) and collecting/aggregating the shard results (redux) to be able to send a
final response. They may use a huge amount of RAM due to operations such as facets,
aggregations, collecting hits and caching (such as scan/scroll queries).
Data nodes are able to store data in them. They contain the indices shards that store the
indexed documents as Lucene (internal ElasticSearch engine) indices.
Using the standard configuration, a node is both an arbiter and a data container.
In big cluster architectures, having some nodes as simple arbiters with a lot of RAM, with no
data, reduces the resources required by data nodes and improves performance in searches
using the local memory cache of arbiters.
See also
ff The Setting up different node types recipe in Chapter 2, Downloading and Setting Up.
9
Getting Started
Getting ready
Every ElasticSearch server that is running provides services.
How it works...
ElasticSearch natively provides a large set of functionalities that can be extended with
additional plugins.
During a node startup, a lot of required services are automatically started. The most
important are:
Throughout this book, we'll see recipes that interact with ElasticSearch
services. Every base functionality or extended functionality is managed
in ElasticSearch as a service.
10
Another random document with
no related content on Scribd:
Transcriber’s Notes
Punctuation, hyphenation, and spelling were made
consistent when a predominant preference was found in the
original book; otherwise they were not changed.
Simple typographical errors were corrected; unbalanced
quotation marks were remedied when the change was
obvious, and otherwise left unbalanced.
Illustrations in this eBook have been positioned between
paragraphs and outside quotations. In versions of this eBook
that support hyperlinks, the page references in the List of
Illustrations lead to the corresponding illustrations.
Page 85: “He went to school as Cæsar” was originally
printed as “He went to school to Cæsar.”
*** END OF THE PROJECT GUTENBERG EBOOK GREAT
CAPTAINS ***
1.D. The copyright laws of the place where you are located also
govern what you can do with this work. Copyright laws in most
countries are in a constant state of change. If you are outside
the United States, check the laws of your country in addition to
the terms of this agreement before downloading, copying,
displaying, performing, distributing or creating derivative works
based on this work or any other Project Gutenberg™ work. The
Foundation makes no representations concerning the copyright
status of any work in any country other than the United States.
1.E.6. You may convert to and distribute this work in any binary,
compressed, marked up, nonproprietary or proprietary form,
including any word processing or hypertext form. However, if
you provide access to or distribute copies of a Project
Gutenberg™ work in a format other than “Plain Vanilla ASCII” or
other format used in the official version posted on the official
Project Gutenberg™ website (www.gutenberg.org), you must, at
no additional cost, fee or expense to the user, provide a copy, a
means of exporting a copy, or a means of obtaining a copy upon
request, of the work in its original “Plain Vanilla ASCII” or other
form. Any alternate format must include the full Project
Gutenberg™ License as specified in paragraph 1.E.1.
• You pay a royalty fee of 20% of the gross profits you derive from
the use of Project Gutenberg™ works calculated using the
method you already use to calculate your applicable taxes. The
fee is owed to the owner of the Project Gutenberg™ trademark,
but he has agreed to donate royalties under this paragraph to
the Project Gutenberg Literary Archive Foundation. Royalty
payments must be paid within 60 days following each date on
which you prepare (or are legally required to prepare) your
periodic tax returns. Royalty payments should be clearly marked
as such and sent to the Project Gutenberg Literary Archive
Foundation at the address specified in Section 4, “Information
about donations to the Project Gutenberg Literary Archive
Foundation.”
• You comply with all other terms of this agreement for free
distribution of Project Gutenberg™ works.
1.F.
Most people start at our website which has the main PG search
facility: www.gutenberg.org.