You are on page 1of 46

ElasticSearch Cookbook, Second

Edition Alberto Paro


Visit to download the full and correct content document:
https://textbookfull.com/product/elasticsearch-cookbook-second-edition-alberto-paro/
More products digital (pdf, epub, mobi) instant
download maybe you interests ...

Biota Grow 2C gather 2C cook Loucas

https://textbookfull.com/product/biota-grow-2c-gather-2c-cook-
loucas/

Libgdx Cross platform Game Development Cookbook 1st


Edition Marquez David Saltares Sanchez Alberto Cejas

https://textbookfull.com/product/libgdx-cross-platform-game-
development-cookbook-1st-edition-marquez-david-saltares-sanchez-
alberto-cejas/

Flutter Cookbook - Second Edition Simone Alessandria

https://textbookfull.com/product/flutter-cookbook-second-edition-
simone-alessandria/

Windows Server 2016 cookbook Second Edition Krause

https://textbookfull.com/product/windows-server-2016-cookbook-
second-edition-krause/
Mastering Elasticsearch 5 x Master the intricacies of
Elasticsearch 5 and use it to create flexible and
scalable search solutions Third Edition Dixit

https://textbookfull.com/product/mastering-
elasticsearch-5-x-master-the-intricacies-of-elasticsearch-5-and-
use-it-to-create-flexible-and-scalable-search-solutions-third-
edition-dixit/

PostgreSQL 9 Administration Cookbook - Second Edition


Simon Riggs

https://textbookfull.com/product/postgresql-9-administration-
cookbook-second-edition-simon-riggs/

Azure Data Factory Cookbook - Second Edition Dmitry


Foshin

https://textbookfull.com/product/azure-data-factory-cookbook-
second-edition-dmitry-foshin/

ASP NET jQuery Cookbook Second Edition Allana Sonal


Aneel

https://textbookfull.com/product/asp-net-jquery-cookbook-second-
edition-allana-sonal-aneel/

Charge-based CMOS Digital RF Transmitters 1st Edition


Pedro Emiliano Paro Filho

https://textbookfull.com/product/charge-based-cmos-digital-rf-
transmitters-1st-edition-pedro-emiliano-paro-filho/
ElasticSearch
Cookbook
Second Edition

Over 130 advanced recipes to search, analyze, deploy,


manage, and monitor data effectively with ElasticSearch

Alberto Paro

BIRMINGHAM - MUMBAI
ElasticSearch Cookbook
Second Edition

Copyright © 2015 Packt Publishing

All rights reserved. No part of this book may be reproduced, stored in a retrieval system,
or transmitted in any form or by any means, without the prior written permission of the
publisher, except in the case of brief quotations embedded in critical articles or reviews.

Every effort has been made in the preparation of this book to ensure the accuracy of the
information presented. However, the information contained in this book is sold without
warranty, either express or implied. Neither the author, nor Packt Publishing and its
dealers and distributors, will be held liable for any damages caused or alleged to be
caused directly or indirectly by this book.

Packt Publishing has endeavored to provide trademark information about all the
companies and products mentioned in this book by the appropriate use of capitals.
However, Packt Publishing cannot guarantee the accuracy of this information.

First published: December 2013


Second edition: January 2015

Production reference: 1230115

Published by Packt Publishing Ltd.


Livery Place
35 Livery Street
Birmingham B3 2PB, UK.

ISBN 978-1-78355-483-6

www.packtpub.com
Credits

Author Project Coordinator


Alberto Paro Leena Purkait

Reviewers Proofreaders
Florian Hopf Ting Baker
Wenhan Lu Samuel Redman Birch
Suvda Myagmar Stephen Copestake
Dan Noble Ameesha Green
Philip O'Toole Lauren E. Harkins

Acquisition Editor Indexer


Rebecca Youé Hemangini Bari

Content Development Editor Graphics


Amey Varangaonkar Valentina D'silva

Technical Editors Production Coordinator


Prajakta Mhatre Manu Joseph
Rohith Rajan
Cover Work
Copy Editors Manu Joseph
Hiral Bhat
Dipti Kapadia
Neha Karnani
Shambhavi Pai
Laxmi Subramanian
Ashwati Thampi
About the Author

Alberto Paro is an engineer, project manager, and software developer. He currently works
as a CTO at Big Data Technologies and as a freelance consultant on software engineering for
Big Data and NoSQL solutions. He loves to study emerging solutions and applications mainly
related to Big Data processing, NoSQL, natural language processing, and neural networks.
He began programming in BASIC on a Sinclair Spectrum when he was 8 years old, and to
date, has collected a lot of experience using different operating systems, applications,
and programming.

In 2000, he graduated in computer science engineering at Politecnico di Milano with a


thesis on designing multiuser and multidevice web applications. He assisted professors
at the university for about a year. He then came in contact with The Net Planet Company
and loved their innovative ideas; he started working on knowledge management solutions
and advanced data mining products. In summer 2014, his company was acquired by a Big
Data technologies company, where he currently works mainly using Scala and Python on
state-of-the-art big data software (Spark, Akka, Cassandra, and YARN). In 2013, he started
freelancing as a consultant for Big Data, machine learning, and ElasticSearch.

In his spare time, when he is not playing with his children, he likes to work on open source
projects. When he was in high school, he started contributing to projects related to the GNOME
environment (gtkmm). One of his preferred programming languages is Python, and he wrote
one of the first NoSQL backends on Django for MongoDB (Django-MongoDB-engine). In 2010,
he began using ElasticSearch to provide search capabilities to some Django e-commerce
sites and developed PyES (a Pythonic client for ElasticSearch), as well as the initial part of the
ElasticSearch MongoDB river. He is the author of ElasticSearch Cookbook as well as a technical
reviewer Elasticsearch Server, Second Edition, and the video course, Building a Search Server
with ElasticSearch, all of which are published by Packt Publishing.
Acknowledgments
It would have been difficult for me to complete this book without the support of a large
number of people.

First, I would like to thank my wife, my children, and the rest of my family for their
valuable support.

On a more personal note, I'd like to thank my friend, Mauro Gallo, for his patience.

I'd like to express my gratitude to everyone at Packt Publishing who've been involved in the
development and production of this book. I'd like to thank Amey Varangaonkar for guiding this
book to completion, and Florian Hopf, Philip O'Toole, and Suvda Myagmar for patiently going
through the first drafts and providing valuable feedback. Their professionalism, courtesy,
good judgment, and passion for this book are much appreciated.
About the Reviewers

Florian Hopf works as a freelance software developer and consultant in Karlsruhe,


Germany. He familiarized himself with Lucene-based search while working with different
content management systems on the Java platform. He is responsible for small and large
search systems, on both the Internet and intranet, for web content and application-specific
data based on Lucene, Solr, and ElasticSearch. He helps to organize the local Java User
Group as well as the Search Meetup in Karlsruhe, and he blogs at http://blog.florian-
hopf.de.

Wenhan Lu is currently pursuing his master's degree in computer science at Carnegie


Mellon University. He has worked for Amazon.com, Inc. as a software engineering intern.
Wenhan has more than 7 years of experience in Java programming. Today, his interests
include distributed systems, search engineering, and NoSQL databases.

Suvda Myagmar currently works as a technical lead at a San Francisco-based start-up


called Expect Labs, where she builds developer APIs and tunes ranking algorithms for
intelligent voice-driven, content-discovery applications. She is the co-founder of Piqora, a
company that specializes in social media analytics and content management solutions for
online retailers. Prior to working for start-ups, she worked as a software engineer at Yahoo!
Search and Microsoft Bing.
Dan Noble is a software engineer from Washington, D.C. who has been a big fan of
ElasticSearch since 2011. He's the author of the Python ElasticSearch driver called rawes,
available at https://github.com/humangeo/rawes. Dan focuses his efforts on the
development of web application design, data visualization, and geospatial applications.

Philip O'Toole has developed software and led software development teams for more than
15 years for a variety of applications, including embedded software, networking appliances,
web services, and SaaS infrastructure. His most recent work with ElasticSearch includes
leading infrastructure design and development of Loggly's log analytics SaaS platform, whose
core component is ElasticSearch. He is based in the San Francisco Bay Area and can be found
online at http://www.philipotoole.com.
www.PacktPub.com

Support files, eBooks, discount offers, and more


For support files and downloads related to your book, please visit www.PacktPub.com.

Did you know that Packt offers eBook versions of every book published, with PDF and ePub
files available? You can upgrade to the eBook version at www.PacktPub.com, and as a
print book customer, you are entitled to a discount on the eBook copy. Get in touch with
us at service@packtpub.com for more details.

At www.PacktPub.com, you can also read a collection of free technical articles, sign up
for a range of free newsletters, and receive exclusive discounts and offers on Packt books
and eBooks.
TM

https://www2.packtpub.com/books/subscription/packtlib

Do you need instant solutions to your IT questions? PacktLib is Packt's online digital book
library. Here, you can search, access, and read Packt's entire library of books.

Why subscribe?
ff Fully searchable across every book published by Packt
ff Copy and paste, print, and bookmark content
ff On demand and accessible via a web browser

Free access for Packt account holders


If you have an account with Packt at www.PacktPub.com, you can use this to access
PacktLib today and view nine entirely free books. Simply use your login credentials for
immediate access.
To Giulia and Andrea, my extraordinary children.
Table of Contents
Preface 1
Chapter 1: Getting Started 7
Introduction 7
Understanding nodes and clusters 8
Understanding node services 10
Managing your data 11
Understanding clusters, replication, and sharding 13
Communicating with ElasticSearch 16
Using the HTTP protocol 17
Using the native protocol 19
Using the Thrift protocol 21
Chapter 2: Downloading and Setting Up 23
Introduction 23
Downloading and installing ElasticSearch 24
Setting up networking 27
Setting up a node 30
Setting up for Linux systems 32
Setting up different node types 33
Installing plugins in ElasticSearch 35
Installing a plugin manually 39
Removing a plugin 40
Changing logging settings 41
Chapter 3: Managing Mapping 43
Introduction 44
Using explicit mapping creation 44
Mapping base types 47
Mapping arrays 50
Mapping an object 52
Table of Contents
Mapping a document 54
Using dynamic templates in document mapping 57
Managing nested objects 60
Managing a child document 62
Adding a field with multiple mappings 65
Mapping a geo point field 67
Mapping a geo shape field 69
Mapping an IP field 70
Mapping an attachment field 71
Adding metadata to a mapping 74
Specifying a different analyzer 75
Mapping a completion suggester 77
Chapter 4: Basic Operations 81
Introduction 82
Creating an index 82
Deleting an index 85
Opening/closing an index 86
Putting a mapping in an index 87
Getting a mapping 89
Deleting a mapping 90
Refreshing an index 92
Flushing an index 93
Optimizing an index 94
Checking if an index or type exists 96
Managing index settings 97
Using index aliases 100
Indexing a document 102
Getting a document 106
Deleting a document 109
Updating a document 111
Speeding up atomic operations (bulk operations) 114
Speeding up GET operations (multi GET) 117
Chapter 5: Search, Queries, and Filters 119
Introduction 120
Executing a search 120
Sorting results 128
Highlighting results 131
Executing a scan query 133
Suggesting a correct query 136
Counting matched results 139

ii
Table of Contents
Deleting by query 140
Matching all the documents 142
Querying/filtering for a single term 144
Querying/filtering for multiple terms 148
Using a prefix query/filter 151
Using a Boolean query/filter 153
Using a range query/filter 156
Using span queries 158
Using a match query 162
Using an ID query/filter 164
Using a has_child query/filter 166
Using a top_children query 168
Using a has_parent query/filter 170
Using a regexp query/filter 172
Using a function score query 174
Using exists and missing filters 178
Using and/or/not filters 180
Using a geo bounding box filter 182
Using a geo polygon filter 183
Using geo distance filter 185
Using a QueryString query 187
Using a template query 190
Chapter 6: Aggregations 195
Introduction 195
Executing an aggregation 196
Executing the stats aggregation 201
Executing the terms aggregation 203
Executing the range aggregation 208
Executing the histogram aggregation 212
Executing the date histogram aggregation 216
Executing the filter aggregation 219
Executing the global aggregation 221
Executing the geo distance aggregation 223
Executing nested aggregation 227
Executing the top hit aggregation 230
Chapter 7: Scripting 235
Introduction 235
Installing additional script plugins 236
Managing scripts 238
Sorting data using script 241

iii
Table of Contents
Computing return fields with scripting 245
Filtering a search via scripting 248
Updating a document using scripts 251
Chapter 8: Rivers 257
Introduction 257
Managing a river 258
Using the CouchDB river 261
Using the MongoDB river 264
Using the RabbitMQ river 267
Using the JDBC river 272
Using the Twitter river 278
Chapter 9: Cluster and Node Monitoring 283
Introduction 283
Controlling cluster health via the API 284
Controlling cluster state via the API 287
Getting cluster node information via the API 291
Getting node statistics via the API 297
Managing repositories 302
Executing a snapshot 305
Restoring a snapshot 308
Installing and using BigDesk 310
Installing and using ElasticSearch Head 316
Installing and using SemaText SPM 322
Installing and using Marvel 325
Chapter 10: Java Integration 329
Introduction 329
Creating an HTTP client 330
Creating a native client 335
Managing indices with the native client 338
Managing mappings 341
Managing documents 345
Managing bulk actions 348
Building a query 351
Executing a standard search 355
Executing a search with aggregations 359
Executing a scroll/scan search 363

iv
Table of Contents

Chapter 11: Python Integration 369


Introduction 369
Creating a client 370
Managing indices 374
Managing mappings 377
Managing documents 380
Executing a standard search 385
Executing a search with aggregations 390
Chapter 12: Plugin Development 395
Introduction 395
Creating a site plugin 396
Creating a native plugin 398
Creating a REST plugin 408
Creating a cluster action 414
Creating an analyzer plugin 421
Creating a river plugin 425
Index 435

v
Preface
One of the main requirements of today's applications is search capability. In the market, we
can find a lot of solutions that answer this need, both in commercial as well as the open source
world. One of the most used libraries for searching is Apache Lucene. This library is the base of
a large number of search solutions such as Apache Solr, Indextank, and ElasticSearch.

ElasticSearch is written with both cloud and distributed computing in mind. Its main author,
Shay Banon, who is famous for having developed Compass (http://www.compass-
project.org), released the first version of ElasticSearch in March 2010.

Thus, the main scope of ElasticSearch is to be a search engine; it also provides a lot of features
that allow you to use it as a data store and an analytic engine using aggregations.

ElasticSearch contains a lot of innovative features: it is JSON/REST-based, natively distributed


in a Map/Reduce approach, easy to set up, and extensible with plugins. In this book, we will
go into the details of these features and many others available in ElasticSearch.

Before ElasticSearch, only Apache Solr was able to provide some of these functionalities, but
it was not designed for the cloud and does not use the JSON/REST API. In the last few years,
this situation has changed a bit with the release of the SolrCloud in 2012. For users who
want to more thoroughly compare these two products, I suggest you read posts by Rafał Kuć,
available at http://blog.sematext.com/2012/08/23/solr-vs-elasticsearch-
part-1-overview/.

ElasticSearch is a product that is in a state of continuous evolution, and new functionalities


are released by both the ElasticSearch company (the company founded by Shay Banon to
provide commercial support for ElasticSearch) and ElasticSearch users as plugins (mainly
available on GitHub).
Preface

Founded in 2012, the ElasticSearch company has raised a total of USD 104 million in
funding. ElasticSearch's success can best be described by the words of Steven Schuurman,
the company's cofounder and CEO:

It's incredible to receive this kind of support from our investors over such a short
period of time. This speaks to the importance of what we're doing: businesses are
generating more and more data—both user- and machine-generated—and it has
become a strategic imperative for them to get value out of these assets, whether
they are starting a new data-focused project or trying to leverage their current
Hadoop or other Big data investments.

ElasticSearch has an impressive track record for its search product, powering customers
such as Fourquare (which indexes over 50 million venues), the online music distribution
platform SoundCloud, StumbleUpon, and the enterprise social network Xing, which has 14
million members. It also powers GitHub, which searches 20 terabytes of data and 1.3 billion
files, and Loggly, which uses ElasticSearch as a key value store to index clusters of data for
rapid analytics of logfiles.

In my opinion, ElasticSearch is probably one of the most powerful and easy-to-use search
solutions on the market. Throughout this book and these recipes, the book's reviewers and
I have sought to transmit our knowledge, passion, and best practices to help readers better
manage ElasticSearch.

What this book covers


Chapter 1, Getting Started, gives you an overview of the basic concepts of ElasticSearch and
the ways to communicate with it.
Chapter 2, Downloading and Setting Up, shows the basic steps to start using ElasticSearch,
from the simple installation to running multiple nodes.
Chapter 3, Managing Mapping, covers the correct definition of data fields to improve both
the indexing and search quality.
Chapter 4, Basic Operations, shows you the common operations that are required to both
ingest and manage data in ElasticSearch.
Chapter 5, Search, Queries, and Filters, covers the core search functionalities in ElasticSearch.
The search DSL is the only way to execute queries in ElasticSearch.
Chapter 6, Aggregations, covers another capability of ElasticSearch: the possibility to execute
analytics on search results in order to improve the user experience and drill down the information.
Chapter 7, Scripting, shows you how to customize ElasticSearch with scripting in different
programming languages.
Chapter 8, Rivers, extends ElasticSearch to give you the ability to pull data from different
sources such as databases, NoSQL solutions, and data streams.

2
Preface

Chapter 9, Cluster and Node Monitoring, shows you how to analyze the behavior of a
cluster/node to understand common pitfalls.

Chapter 10, Java Integration, describes how to integrate ElasticSearch in a Java application
using both REST and native protocols.

Chapter 11, Python Integration, covers the usage of the official ElasticSearch Python client
and the Pythonic PyES library.

Chapter 12, Plugin Development, describes how to create the different types of plugins:
site and native plugins. Some examples show the plugin skeletons, the setup process,
and their build.

What you need for this book


For this book, you will need a computer running a Windows OS, Macintosh OS, or Linux
distribution. In terms of the additional software required, you don't have to worry, as all the
components you will need are open source and available for every major OS platform.

For all the REST examples, the cURL software (http://curl.haxx.se/) will be used to
simulate the command from the command line. It comes preinstalled on Linux and Mac OS X
operating systems. For Windows, it can be downloaded from its site and added in a PATH that
can be called from the command line.

Chapter 10, Java Integration, and Chapter 12, Plugin Development, require the Maven build
tool (http://maven.apache.org/), which is a standard tool to manage builds, packaging,
and deploying in Java. It is natively supported on most of the Java IDEs, such as Eclipse and
IntelliJ IDEA.

Chapter 11, Python Integration, requires the Python Interpreter installed on your computer.
It's available on Linux and Mac OS X by default. For Windows, it can be downloaded from the
official Python website (http://www.python.org). The examples in this chapter have been
tested using version 2.x.

Who this book is for


This book is for developers and users who want to begin using ElasticSearch or want to improve
their knowledge of ElasticSearch. This book covers all the aspects of using ElasticSearch and
provides solutions and hints for everyday usage. The recipes have reduced complexity so it is
easy for readers to focus on the discussed ElasticSearch aspect and easily and fully understand
the ElasticSearch functionalities.

The chapters toward the end of the book discuss ElasticSearch integration with Java and Python
programming languages; this shows the users how to integrate the power of ElasticSearch into
their Java- and Python-based applications.

3
Preface

Chapter 12, Plugin Development, talks about the advanced use of ElasticSearch and its core
extensions, so you will need some prior Java knowledge to understand this chapter fully.

Sections
This book contains the following sections:

Getting ready
This section tells us what to expect in the recipe, and describes how to set up any software or
any preliminary settings needed for the recipe.

How to do it…
This section characterizes the steps to be followed for "cooking" the recipe.

How it works…
This section usually consists of a brief and detailed explanation of what happened in the
previous section.

There's more…
It consists of additional information about the recipe in order to make the reader more
anxious about the recipe.

See also
This section may contain references to the recipe.

Conventions
In this book, you will find a number of styles of text that distinguish between different
kinds of information. Here are some examples of these styles, and an explanation of
their meaning.

Code words in text, database table names, folder names, filenames, file extensions,
pathnames, dummy URLs, user input, and Twitter handles are shown as follows:
"After the name and type parameters, usually a river requires an extra configuration
that can be passed in the _meta property."

4
Preface

A block of code is set as follows:


cluster.name: elasticsearch
node.name: "My wonderful server"
network.host: 192.168.0.1
discovery.zen.ping.unicast.hosts: ["192.168.0.2","192.168.0.3[9300-
9400]"]

When we wish to draw your attention to a particular part of a code block, the relevant lines or
items are set in bold:
cluster.name: elasticsearch
node.name: "My wonderful server"
network.host: 192.168.0.1
discovery.zen.ping.unicast.hosts: ["192.168.0.2","192.168.0.3[9300-
9400]"]

Any command-line input or output is written as follows:


curl -XDELETE 'http://127.0.0.1:9200/_river/my_river/'

New terms and important words are shown in bold. Words you see on the screen, in menus
or dialog boxes, for example, appear in the text like this: "If you don't see the cluster statistics,
put your node address to the left and click on the connect button."

Warnings or important notes appear in a box like this.

Tips and tricks appear like this.

Reader feedback
Feedback from our readers is always welcome. Let us know what you think about this book—
what you liked or may have disliked. Reader feedback is important for us to develop titles you
really get the most out of.

To send us general feedback, simply send an e-mail to feedback@packtpub.com, and


mention the book title via the subject of your message.

If there is a topic you have expertise in and you are interested in either writing or contributing
to a book, see our author guide at www.packtpub.com/authors.

5
Preface

Customer support
Now that you are the proud owner of a Packt book, we have a number of things to help you get
the most from your purchase.

Downloading the example code


You can download the example code files for all Packt books you have purchased from your
account at http://www.packtpub.com. If you purchased this book elsewhere, you can
visit http://www.packtpub.com/support and register to have the files e-mailed directly
to you. The code bundle is also available on GitHub at https://github.com/aparo/
elasticsearch-cookbook-second-edition.

Errata
Although we have taken every care to ensure the accuracy of our content, mistakes do happen.
If you find a mistake in one of our books—maybe a mistake in the text or the code—we would be
grateful if you could report this to us. By doing so, you can save other readers from frustration
and help us improve subsequent versions of this book. If you find any errata, please report them
by visiting http://www.packtpub.com/submit-errata, selecting your book, clicking on
the Errata Submission Form link, and entering the details of your errata. Once your errata are
verified, your submission will be accepted and the errata will be uploaded to our website or
added to any list of existing errata under the Errata section of that title.

To view the previously submitted errata, go to https://www.packtpub.com/books/


content/support and enter the name of the book in the search field. The required
information will appear under the Errata section.

Piracy
Piracy of copyrighted material on the Internet is an ongoing problem across all media.
At Packt, we take the protection of our copyright and licenses very seriously. If you come
across any illegal copies of our works, in any form, on the Internet, please provide us with
the location address or website name immediately so we can pursue a remedy.

Please contact us at copyright@packtpub.com with a link to the suspected pirated material.

We appreciate your help in protecting our authors, and our ability to bring you valuable content.

Questions
If you have a problem with any aspect of this book, you can contact us at questions@
packtpub.com, and we will do our best to address the problem.

6
Getting Started
1
In this chapter, we will cover:

ff Understanding nodes and clusters


ff Understanding node services
ff Managing your data
ff Understanding clusters, replication, and sharding
ff Communicating with ElasticSearch
ff Using the HTTP protocol
ff Using the native protocol
ff Using the Thrift protocol

Introduction
To efficiently use ElasticSearch, it is very important to understand how it works.

The goal of this chapter is to give the readers an overview of the basic concepts of
ElasticSearch and to be a quick reference for them. It's essential to understand the
basics better so that you don't fall into the common pitfall about how ElasticSearch
works and how to use it.

The key concepts that we will see in this chapter are: node, index, shard, mapping/type,
document, and field.

ElasticSearch can be used both as a search engine as well as a data store.

A brief description of the ElasticSearch logic helps the user to improve performance,
search quality, and decide when and how to optimize the infrastructure to improve
scalability and availability.
Getting Started

Some details on data replications and base node communication processes are also explained.

At the end of this chapter, the protocols used to manage ElasticSearch are also discussed.

Understanding nodes and clusters


Every instance of ElasticSearch is called a node. Several nodes are grouped in a cluster.
This is the base of the cloud nature of ElasticSearch.

Getting ready
To better understand the following sections, some basic knowledge about the concepts of
the application node and cluster are required.

How it works...
One or more ElasticSearch nodes can be set up on a physical or a virtual server depending
on the available resources such as RAM, CPU, and disk space.

A default node allows you to store data in it to process requests and responses.
(In Chapter 2, Downloading and Setting Up, we'll see details about how to set up
different nodes and cluster topologies).

When a node is started, several actions take place during its startup, such as:

ff The configuration is read from the environment variables and the


elasticsearch.yml configuration file
ff A node name is set by the configuration file or is chosen from a list of
built-in random names
ff Internally, the ElasticSearch engine initializes all the modules and plugins
that are available in the current installation

Downloading the example code


You can download the example code files for all Packt books you have
purchased from your account at http://www.packtpub.com. If you
purchased this book elsewhere, you can visit http://www.packtpub.
com/support and register to have the files e-mailed directly to you.

After the node startup, the node searches for other cluster members and checks its index and
shard status.

8
Chapter 1

To join two or more nodes in a cluster, the following rules must be observed:

ff The version of ElasticSearch must be the same (v0.20, v0.9, v1.4, and so on) or the
join is rejected.
ff The cluster name must be the same.
ff The network must be configured to support broadcast discovery (it is configured
to it by default) and they can communicate with each other. (See the Setting up
networking recipe in Chapter 2, Downloading and Setting Up.)

A common approach in cluster management is to have a master node, which is the main
reference for all cluster-level actions, and the other nodes, called secondary nodes, that
replicate the master data and its actions.

To be consistent in the write operations, all the update actions are first committed in the
master node and then replicated in the secondary nodes.

In a cluster with multiple nodes, if a master node dies, a master-eligible node is elected
to be the new master node. This approach allows automatic failover to be set up in an
ElasticSearch cluster.

There's more...
There are two important behaviors in an ElasticSearch node: the non-data node (or arbiter)
and the data container behavior.

Non-data nodes are able to process REST responses and all other operations of search.
During every action execution, ElasticSearch generally executes actions using a map/reduce
approach: the non-data node is responsible for distributing the actions to the underlying
shards (map) and collecting/aggregating the shard results (redux) to be able to send a
final response. They may use a huge amount of RAM due to operations such as facets,
aggregations, collecting hits and caching (such as scan/scroll queries).

Data nodes are able to store data in them. They contain the indices shards that store the
indexed documents as Lucene (internal ElasticSearch engine) indices.

Using the standard configuration, a node is both an arbiter and a data container.

In big cluster architectures, having some nodes as simple arbiters with a lot of RAM, with no
data, reduces the resources required by data nodes and improves performance in searches
using the local memory cache of arbiters.

See also
ff The Setting up different node types recipe in Chapter 2, Downloading and Setting Up.

9
Getting Started

Understanding node services


When a node is running, a lot of services are managed by its instance. These services provide
additional functionalities to a node and they cover different behaviors such as networking,
indexing, analyzing and so on.

Getting ready
Every ElasticSearch server that is running provides services.

How it works...
ElasticSearch natively provides a large set of functionalities that can be extended with
additional plugins.

During a node startup, a lot of required services are automatically started. The most
important are:

ff Cluster services: These manage the cluster state, intra-node communication,


and synchronization.
ff Indexing Service: This manages all indexing operations, initializing all active
indices and shards.
ff Mapping Service: This manages the document types stored in the cluster
(we'll discuss mapping in Chapter 3, Managing Mapping).
ff Network Services: These are services such as HTTP REST services (default on
port 9200), internal ES protocol (port 9300) and the Thrift server (port 9500),
applicable only if the Thrift plugin is installed.
ff Plugin Service: This enables us to enhance the basic ElasticSearch functionality in
a customizable manner. (It's discussed in Chapter 2, Downloading and Setting Up,
for installation and Chapter 12, Plugin Development, for detailed usage.)
ff River Service: It is a pluggable service running within ElasticSearch cluster, pulling
data (or being pushed with data) that is then indexed into the cluster. (We'll see it in
Chapter 8, Rivers.)
ff Language Scripting Services: They allow you to add new language scripting support
to ElasticSearch.

Throughout this book, we'll see recipes that interact with ElasticSearch
services. Every base functionality or extended functionality is managed
in ElasticSearch as a service.

10
Another random document with
no related content on Scribd:
Transcriber’s Notes
Punctuation, hyphenation, and spelling were made
consistent when a predominant preference was found in the
original book; otherwise they were not changed.
Simple typographical errors were corrected; unbalanced
quotation marks were remedied when the change was
obvious, and otherwise left unbalanced.
Illustrations in this eBook have been positioned between
paragraphs and outside quotations. In versions of this eBook
that support hyperlinks, the page references in the List of
Illustrations lead to the corresponding illustrations.
Page 85: “He went to school as Cæsar” was originally
printed as “He went to school to Cæsar.”
*** END OF THE PROJECT GUTENBERG EBOOK GREAT
CAPTAINS ***

Updated editions will replace the previous one—the old editions


will be renamed.

Creating the works from print editions not protected by U.S.


copyright law means that no one owns a United States copyright
in these works, so the Foundation (and you!) can copy and
distribute it in the United States without permission and without
paying copyright royalties. Special rules, set forth in the General
Terms of Use part of this license, apply to copying and
distributing Project Gutenberg™ electronic works to protect the
PROJECT GUTENBERG™ concept and trademark. Project
Gutenberg is a registered trademark, and may not be used if
you charge for an eBook, except by following the terms of the
trademark license, including paying royalties for use of the
Project Gutenberg trademark. If you do not charge anything for
copies of this eBook, complying with the trademark license is
very easy. You may use this eBook for nearly any purpose such
as creation of derivative works, reports, performances and
research. Project Gutenberg eBooks may be modified and
printed and given away—you may do practically ANYTHING in
the United States with eBooks not protected by U.S. copyright
law. Redistribution is subject to the trademark license, especially
commercial redistribution.

START: FULL LICENSE


THE FULL PROJECT GUTENBERG
LICENSE
PLEASE READ THIS BEFORE YOU DISTRIBUTE OR USE THIS WORK

To protect the Project Gutenberg™ mission of promoting the


free distribution of electronic works, by using or distributing this
work (or any other work associated in any way with the phrase
“Project Gutenberg”), you agree to comply with all the terms of
the Full Project Gutenberg™ License available with this file or
online at www.gutenberg.org/license.

Section 1. General Terms of Use and


Redistributing Project Gutenberg™
electronic works
1.A. By reading or using any part of this Project Gutenberg™
electronic work, you indicate that you have read, understand,
agree to and accept all the terms of this license and intellectual
property (trademark/copyright) agreement. If you do not agree to
abide by all the terms of this agreement, you must cease using
and return or destroy all copies of Project Gutenberg™
electronic works in your possession. If you paid a fee for
obtaining a copy of or access to a Project Gutenberg™
electronic work and you do not agree to be bound by the terms
of this agreement, you may obtain a refund from the person or
entity to whom you paid the fee as set forth in paragraph 1.E.8.

1.B. “Project Gutenberg” is a registered trademark. It may only


be used on or associated in any way with an electronic work by
people who agree to be bound by the terms of this agreement.
There are a few things that you can do with most Project
Gutenberg™ electronic works even without complying with the
full terms of this agreement. See paragraph 1.C below. There
are a lot of things you can do with Project Gutenberg™
electronic works if you follow the terms of this agreement and
help preserve free future access to Project Gutenberg™
electronic works. See paragraph 1.E below.
1.C. The Project Gutenberg Literary Archive Foundation (“the
Foundation” or PGLAF), owns a compilation copyright in the
collection of Project Gutenberg™ electronic works. Nearly all the
individual works in the collection are in the public domain in the
United States. If an individual work is unprotected by copyright
law in the United States and you are located in the United
States, we do not claim a right to prevent you from copying,
distributing, performing, displaying or creating derivative works
based on the work as long as all references to Project
Gutenberg are removed. Of course, we hope that you will
support the Project Gutenberg™ mission of promoting free
access to electronic works by freely sharing Project
Gutenberg™ works in compliance with the terms of this
agreement for keeping the Project Gutenberg™ name
associated with the work. You can easily comply with the terms
of this agreement by keeping this work in the same format with
its attached full Project Gutenberg™ License when you share it
without charge with others.

1.D. The copyright laws of the place where you are located also
govern what you can do with this work. Copyright laws in most
countries are in a constant state of change. If you are outside
the United States, check the laws of your country in addition to
the terms of this agreement before downloading, copying,
displaying, performing, distributing or creating derivative works
based on this work or any other Project Gutenberg™ work. The
Foundation makes no representations concerning the copyright
status of any work in any country other than the United States.

1.E. Unless you have removed all references to Project


Gutenberg:

1.E.1. The following sentence, with active links to, or other


immediate access to, the full Project Gutenberg™ License must
appear prominently whenever any copy of a Project
Gutenberg™ work (any work on which the phrase “Project
Gutenberg” appears, or with which the phrase “Project
Gutenberg” is associated) is accessed, displayed, performed,
viewed, copied or distributed:

This eBook is for the use of anyone anywhere in the United


States and most other parts of the world at no cost and with
almost no restrictions whatsoever. You may copy it, give it
away or re-use it under the terms of the Project Gutenberg
License included with this eBook or online at
www.gutenberg.org. If you are not located in the United
States, you will have to check the laws of the country where
you are located before using this eBook.

1.E.2. If an individual Project Gutenberg™ electronic work is


derived from texts not protected by U.S. copyright law (does not
contain a notice indicating that it is posted with permission of the
copyright holder), the work can be copied and distributed to
anyone in the United States without paying any fees or charges.
If you are redistributing or providing access to a work with the
phrase “Project Gutenberg” associated with or appearing on the
work, you must comply either with the requirements of
paragraphs 1.E.1 through 1.E.7 or obtain permission for the use
of the work and the Project Gutenberg™ trademark as set forth
in paragraphs 1.E.8 or 1.E.9.

1.E.3. If an individual Project Gutenberg™ electronic work is


posted with the permission of the copyright holder, your use and
distribution must comply with both paragraphs 1.E.1 through
1.E.7 and any additional terms imposed by the copyright holder.
Additional terms will be linked to the Project Gutenberg™
License for all works posted with the permission of the copyright
holder found at the beginning of this work.

1.E.4. Do not unlink or detach or remove the full Project


Gutenberg™ License terms from this work, or any files
containing a part of this work or any other work associated with
Project Gutenberg™.
1.E.5. Do not copy, display, perform, distribute or redistribute
this electronic work, or any part of this electronic work, without
prominently displaying the sentence set forth in paragraph 1.E.1
with active links or immediate access to the full terms of the
Project Gutenberg™ License.

1.E.6. You may convert to and distribute this work in any binary,
compressed, marked up, nonproprietary or proprietary form,
including any word processing or hypertext form. However, if
you provide access to or distribute copies of a Project
Gutenberg™ work in a format other than “Plain Vanilla ASCII” or
other format used in the official version posted on the official
Project Gutenberg™ website (www.gutenberg.org), you must, at
no additional cost, fee or expense to the user, provide a copy, a
means of exporting a copy, or a means of obtaining a copy upon
request, of the work in its original “Plain Vanilla ASCII” or other
form. Any alternate format must include the full Project
Gutenberg™ License as specified in paragraph 1.E.1.

1.E.7. Do not charge a fee for access to, viewing, displaying,


performing, copying or distributing any Project Gutenberg™
works unless you comply with paragraph 1.E.8 or 1.E.9.

1.E.8. You may charge a reasonable fee for copies of or


providing access to or distributing Project Gutenberg™
electronic works provided that:

• You pay a royalty fee of 20% of the gross profits you derive from
the use of Project Gutenberg™ works calculated using the
method you already use to calculate your applicable taxes. The
fee is owed to the owner of the Project Gutenberg™ trademark,
but he has agreed to donate royalties under this paragraph to
the Project Gutenberg Literary Archive Foundation. Royalty
payments must be paid within 60 days following each date on
which you prepare (or are legally required to prepare) your
periodic tax returns. Royalty payments should be clearly marked
as such and sent to the Project Gutenberg Literary Archive
Foundation at the address specified in Section 4, “Information
about donations to the Project Gutenberg Literary Archive
Foundation.”

• You provide a full refund of any money paid by a user who


notifies you in writing (or by e-mail) within 30 days of receipt that
s/he does not agree to the terms of the full Project Gutenberg™
License. You must require such a user to return or destroy all
copies of the works possessed in a physical medium and
discontinue all use of and all access to other copies of Project
Gutenberg™ works.

• You provide, in accordance with paragraph 1.F.3, a full refund of


any money paid for a work or a replacement copy, if a defect in
the electronic work is discovered and reported to you within 90
days of receipt of the work.

• You comply with all other terms of this agreement for free
distribution of Project Gutenberg™ works.

1.E.9. If you wish to charge a fee or distribute a Project


Gutenberg™ electronic work or group of works on different
terms than are set forth in this agreement, you must obtain
permission in writing from the Project Gutenberg Literary
Archive Foundation, the manager of the Project Gutenberg™
trademark. Contact the Foundation as set forth in Section 3
below.

1.F.

1.F.1. Project Gutenberg volunteers and employees expend


considerable effort to identify, do copyright research on,
transcribe and proofread works not protected by U.S. copyright
law in creating the Project Gutenberg™ collection. Despite
these efforts, Project Gutenberg™ electronic works, and the
medium on which they may be stored, may contain “Defects,”
such as, but not limited to, incomplete, inaccurate or corrupt
data, transcription errors, a copyright or other intellectual
property infringement, a defective or damaged disk or other
medium, a computer virus, or computer codes that damage or
cannot be read by your equipment.

1.F.2. LIMITED WARRANTY, DISCLAIMER OF DAMAGES -


Except for the “Right of Replacement or Refund” described in
paragraph 1.F.3, the Project Gutenberg Literary Archive
Foundation, the owner of the Project Gutenberg™ trademark,
and any other party distributing a Project Gutenberg™ electronic
work under this agreement, disclaim all liability to you for
damages, costs and expenses, including legal fees. YOU
AGREE THAT YOU HAVE NO REMEDIES FOR NEGLIGENCE,
STRICT LIABILITY, BREACH OF WARRANTY OR BREACH
OF CONTRACT EXCEPT THOSE PROVIDED IN PARAGRAPH
1.F.3. YOU AGREE THAT THE FOUNDATION, THE
TRADEMARK OWNER, AND ANY DISTRIBUTOR UNDER
THIS AGREEMENT WILL NOT BE LIABLE TO YOU FOR
ACTUAL, DIRECT, INDIRECT, CONSEQUENTIAL, PUNITIVE
OR INCIDENTAL DAMAGES EVEN IF YOU GIVE NOTICE OF
THE POSSIBILITY OF SUCH DAMAGE.

1.F.3. LIMITED RIGHT OF REPLACEMENT OR REFUND - If


you discover a defect in this electronic work within 90 days of
receiving it, you can receive a refund of the money (if any) you
paid for it by sending a written explanation to the person you
received the work from. If you received the work on a physical
medium, you must return the medium with your written
explanation. The person or entity that provided you with the
defective work may elect to provide a replacement copy in lieu
of a refund. If you received the work electronically, the person or
entity providing it to you may choose to give you a second
opportunity to receive the work electronically in lieu of a refund.
If the second copy is also defective, you may demand a refund
in writing without further opportunities to fix the problem.

1.F.4. Except for the limited right of replacement or refund set


forth in paragraph 1.F.3, this work is provided to you ‘AS-IS’,
WITH NO OTHER WARRANTIES OF ANY KIND, EXPRESS
OR IMPLIED, INCLUDING BUT NOT LIMITED TO
WARRANTIES OF MERCHANTABILITY OR FITNESS FOR
ANY PURPOSE.

1.F.5. Some states do not allow disclaimers of certain implied


warranties or the exclusion or limitation of certain types of
damages. If any disclaimer or limitation set forth in this
agreement violates the law of the state applicable to this
agreement, the agreement shall be interpreted to make the
maximum disclaimer or limitation permitted by the applicable
state law. The invalidity or unenforceability of any provision of
this agreement shall not void the remaining provisions.

1.F.6. INDEMNITY - You agree to indemnify and hold the


Foundation, the trademark owner, any agent or employee of the
Foundation, anyone providing copies of Project Gutenberg™
electronic works in accordance with this agreement, and any
volunteers associated with the production, promotion and
distribution of Project Gutenberg™ electronic works, harmless
from all liability, costs and expenses, including legal fees, that
arise directly or indirectly from any of the following which you do
or cause to occur: (a) distribution of this or any Project
Gutenberg™ work, (b) alteration, modification, or additions or
deletions to any Project Gutenberg™ work, and (c) any Defect
you cause.

Section 2. Information about the Mission of


Project Gutenberg™
Project Gutenberg™ is synonymous with the free distribution of
electronic works in formats readable by the widest variety of
computers including obsolete, old, middle-aged and new
computers. It exists because of the efforts of hundreds of
volunteers and donations from people in all walks of life.

Volunteers and financial support to provide volunteers with the


assistance they need are critical to reaching Project
Gutenberg™’s goals and ensuring that the Project Gutenberg™
collection will remain freely available for generations to come. In
2001, the Project Gutenberg Literary Archive Foundation was
created to provide a secure and permanent future for Project
Gutenberg™ and future generations. To learn more about the
Project Gutenberg Literary Archive Foundation and how your
efforts and donations can help, see Sections 3 and 4 and the
Foundation information page at www.gutenberg.org.

Section 3. Information about the Project


Gutenberg Literary Archive Foundation
The Project Gutenberg Literary Archive Foundation is a non-
profit 501(c)(3) educational corporation organized under the
laws of the state of Mississippi and granted tax exempt status by
the Internal Revenue Service. The Foundation’s EIN or federal
tax identification number is 64-6221541. Contributions to the
Project Gutenberg Literary Archive Foundation are tax
deductible to the full extent permitted by U.S. federal laws and
your state’s laws.

The Foundation’s business office is located at 809 North 1500


West, Salt Lake City, UT 84116, (801) 596-1887. Email contact
links and up to date contact information can be found at the
Foundation’s website and official page at
www.gutenberg.org/contact

Section 4. Information about Donations to


the Project Gutenberg Literary Archive
Foundation
Project Gutenberg™ depends upon and cannot survive without
widespread public support and donations to carry out its mission
of increasing the number of public domain and licensed works
that can be freely distributed in machine-readable form
accessible by the widest array of equipment including outdated
equipment. Many small donations ($1 to $5,000) are particularly
important to maintaining tax exempt status with the IRS.

The Foundation is committed to complying with the laws


regulating charities and charitable donations in all 50 states of
the United States. Compliance requirements are not uniform
and it takes a considerable effort, much paperwork and many
fees to meet and keep up with these requirements. We do not
solicit donations in locations where we have not received written
confirmation of compliance. To SEND DONATIONS or
determine the status of compliance for any particular state visit
www.gutenberg.org/donate.

While we cannot and do not solicit contributions from states


where we have not met the solicitation requirements, we know
of no prohibition against accepting unsolicited donations from
donors in such states who approach us with offers to donate.

International donations are gratefully accepted, but we cannot


make any statements concerning tax treatment of donations
received from outside the United States. U.S. laws alone swamp
our small staff.

Please check the Project Gutenberg web pages for current


donation methods and addresses. Donations are accepted in a
number of other ways including checks, online payments and
credit card donations. To donate, please visit:
www.gutenberg.org/donate.

Section 5. General Information About Project


Gutenberg™ electronic works
Professor Michael S. Hart was the originator of the Project
Gutenberg™ concept of a library of electronic works that could
be freely shared with anyone. For forty years, he produced and
distributed Project Gutenberg™ eBooks with only a loose
network of volunteer support.

Project Gutenberg™ eBooks are often created from several


printed editions, all of which are confirmed as not protected by
copyright in the U.S. unless a copyright notice is included. Thus,
we do not necessarily keep eBooks in compliance with any
particular paper edition.

Most people start at our website which has the main PG search
facility: www.gutenberg.org.

This website includes information about Project Gutenberg™,


including how to make donations to the Project Gutenberg
Literary Archive Foundation, how to help produce our new
eBooks, and how to subscribe to our email newsletter to hear
about new eBooks.

You might also like