Professional Documents
Culture Documents
A Case Study
of
Nile University
In Partial Fulfillment
of the Requirements for the
Degree of Master of Software Engineering
By
Basma AlKerm
September, 2018
iii
Acknowledgements
First I should thank Allah for giving me all the reasons to complete this work.
Second, With extreme gratitude, I would like to thank Dr. Sameh El-Ansary
for being the supervisor for this study I will never be able to thank him enough.
He is the best teacher I could ever have. Besides his substantial knowledge and
broad experience, he is a passionate teacher and coach, who generously provided
all kinds of guidance and support.
I would like to extend my thanks to the Novelari team for providing an en-
vironment that would help everyone to be productive. Special thanks to Karim
Hamed, the development head at Novelari, he gave me a lot of his time to help
with different technical challenges, and thanks to his advice I could avoid many
troubles.
I also want to thank my family for their kind patience, endless encourage-
ment, and support.
Contents
Acknowledgements iii
List of Figures ix
Abstract xv
1 Introduction 1
1.1 Motivation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1
1.2 Outline . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1
1.3 The Use Case Context . . . . . . . . . . . . . . . . . . . . . . . . . . . 2
1.3.1 Scope . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2
1.3.2 Who has the ultrasound exams? . . . . . . . . . . . . . . . . . . 2
1.3.3 The DICOM format . . . . . . . . . . . . . . . . . . . . . . . . . 2
1.3.4 Picture Archiving and Communication System (PACS) . . . . . 3
1.3.5 Query and Retrieval, the PACS Way . . . . . . . . . . . . . . . . 3
PACS Search Capabilities are Limited . . . . . . . . . . . . . . 4
1.3.6 Content-based Image Retrieval (CBIR) . . . . . . . . . . . . . . 4
1.3.7 What we want to achieve? . . . . . . . . . . . . . . . . . . . . . 5
1.3.8 The Target Data Scale . . . . . . . . . . . . . . . . . . . . . . . . 6
1.4 The Engineering Perspective . . . . . . . . . . . . . . . . . . . . . . . 6
Data Engineering and Data Science . . . . . . . . . . . . . . . . 6
Full Text Search . . . . . . . . . . . . . . . . . . . . . . . . . . . 7
1.4.1 Spark . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7
1.4.2 Python . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8
NumPy . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8
Pandas . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8
1.4.3 PostgreSQL . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9
1.4.4 Lucene . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9
1.4.5 Solr . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10
1.4.6 Elasticsearch . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11
1.4.7 Amazon EC2 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12
Cloud Computing . . . . . . . . . . . . . . . . . . . . . . . . . . 12
1.5 Related Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12
1.5.1 A brief on PACS Research Status . . . . . . . . . . . . . . . . . 12
Dicoogle . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13
1.5.2 Other Search Engines Performance Studies . . . . . . . . . . . 13
vi
2 Data Modeling 15
2.1 Motivation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15
2.1.1 Big Data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15
2.1.2 Data Description . . . . . . . . . . . . . . . . . . . . . . . . . . . 16
Main properties of the Sample Data . . . . . . . . . . . . . . . . 17
2.1.3 Data Cleaning / Pre-Processing . . . . . . . . . . . . . . . . . . 17
2.2 Tools and Experiments Setup . . . . . . . . . . . . . . . . . . . . . . . 18
EC2 - M5 Instance Type . . . . . . . . . . . . . . . . . . . . . . . 18
2.3 Experiment 1: Iteration-Based on Single Core . . . . . . . . . . . . . . 19
2.3.1 Method . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 19
2.3.2 Throughput . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 20
2.4 Experiment 2: Vectorization-Based on Single Core . . . . . . . . . . . 20
2.4.1 Method . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 20
Vectorization . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 20
Simple Random Sampling . . . . . . . . . . . . . . . . . . . . . 20
NumPy Broadcasting . . . . . . . . . . . . . . . . . . . . . . . . 21
NumPy Advanced Indexing . . . . . . . . . . . . . . . . . . . . . 21
2.4.2 Throughput . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 22
2.5 Experiment 3: Single Machine Parallelization Using Spark . . . . . . 22
2.5.1 Method . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 22
2.5.2 Throughput . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 24
2.6 Experiment 4: Single Machine Parallelization Using Python Multi-
processing . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 25
2.6.1 Method . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 25
The Python multiprocessing module . . . . . . . . . . . . . . . . 25
2.6.2 Throughput . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 26
2.7 Experiment 5: Vertical Scaling The Vectorized-Based Generation . . . 26
2.7.1 Compute Resources . . . . . . . . . . . . . . . . . . . . . . . . . 27
EC2 - Compute Optimized Instance Type (C4) . . . . . . . . . . 27
2.7.2 Throughput . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 27
Amdahl’s law . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 29
2.8 Experiment 6: Horizontally Scaling The Vectorized-Based Generation 29
2.8.1 Method . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 30
2.8.2 Compute Resources . . . . . . . . . . . . . . . . . . . . . . . . . 31
EC2 - Memory Optimized Instance Type . . . . . . . . . . . . . 31
2.8.3 Cluster Setup . . . . . . . . . . . . . . . . . . . . . . . . . . . . 31
2.8.4 Throughput . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 33
2.9 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 34
2.9.1 Vectorization . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 34
2.9.2 Parallelization on multi-core Single Machine . . . . . . . . . . . 34
2.9.3 Distribution using Clusters . . . . . . . . . . . . . . . . . . . . . 35
2.9.4 The Effect of Disk I/O . . . . . . . . . . . . . . . . . . . . . . . . 35
3 Evaluating PostgreSQL 37
3.1 PostgreSQL the Open Source RDBMS . . . . . . . . . . . . . . . . . . 37
3.2 PostgreSQL support For Text Search . . . . . . . . . . . . . . . . . . . 37
3.3 PostgreSQL Support For “Full” Text Search . . . . . . . . . . . . . . . 38
3.4 Data Insertion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 39
vii
4 Evaluating Solr 53
4.1 Outline . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 53
4.2 Solr Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 53
4.2.1 How It Works? . . . . . . . . . . . . . . . . . . . . . . . . . . . . 55
4.2.2 Lucene . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 55
4.2.3 Segments . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 57
4.2.4 Merging of Segments . . . . . . . . . . . . . . . . . . . . . . . . 57
4.2.5 Document Deletion . . . . . . . . . . . . . . . . . . . . . . . . . 57
4.2.6 Solr Configurations . . . . . . . . . . . . . . . . . . . . . . . . . 58
solrconfig.xml . . . . . . . . . . . . . . . . . . . . . . . . . . . . 58
4.3 Experiment 1: A Baseline . . . . . . . . . . . . . . . . . . . . . . . . . 60
4.3.1 Data Size . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 60
4.3.2 The Schemaless Mode . . . . . . . . . . . . . . . . . . . . . . . 60
4.3.3 The I3 instance Family . . . . . . . . . . . . . . . . . . . . . . . 61
4.3.4 Steps To Run This Experiment . . . . . . . . . . . . . . . . . . . 61
4.3.5 Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 62
4.3.6 Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 64
4.4 Experiment 2: Schema impact on index size and query performance . 66
4.4.1 Data Size . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 66
4.4.2 Schema Modifications . . . . . . . . . . . . . . . . . . . . . . . . 66
4.4.3 Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 67
4.4.4 Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 68
4.5 Experiment 3: Indexing 100 GB . . . . . . . . . . . . . . . . . . . . . . 68
4.5.1 Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 69
viii
4.5.2 Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 71
4.6 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 71
5 Evaluating Elasticsearch 73
5.1 Experiment 1: A Baseline . . . . . . . . . . . . . . . . . . . . . . . . . 73
5.1.1 Adjustments to Align with Solr’s Baseline Experiment . . . . . 75
5.1.2 Steps To Run The Experiment . . . . . . . . . . . . . . . . . . . 75
5.1.3 The Elasticsearch Python Client . . . . . . . . . . . . . . . . . . 77
5.1.4 Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 79
5.1.5 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 80
5.2 Experiment 2: The Impact of Mapping . . . . . . . . . . . . . . . . . 81
5.2.1 The Dynamic Mapping . . . . . . . . . . . . . . . . . . . . . . . 81
5.2.2 The Custom Mapping . . . . . . . . . . . . . . . . . . . . . . . . 82
5.2.3 Steps to Run The Experiment . . . . . . . . . . . . . . . . . . . 82
5.2.4 Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 83
5.2.5 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 85
5.3 Experiment 3: Indexing 92 GB Using a Single Node . . . . . . . . . . 85
5.3.1 Setup . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 85
5.3.2 Moving from Development Mode to Production Mode . . . . . 86
5.3.3 Steps to Run The Experiment . . . . . . . . . . . . . . . . . . . 87
Translog . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 88
The ‘_field_names’ field . . . . . . . . . . . . . . . . . . . . . . . 88
5.3.4 Tuning the parallel_bulk Helper . . . . . . . . . . . . . . . . . . 89
5.3.5 The Test Queries . . . . . . . . . . . . . . . . . . . . . . . . . . . 89
The Search API . . . . . . . . . . . . . . . . . . . . . . . . . . . 89
The Query DSL . . . . . . . . . . . . . . . . . . . . . . . . . . . . 91
5.3.6 Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 91
5.3.7 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 92
5.4 Experiment 4: Indexing 92 GB using Five Nodes Cluster . . . . . . . 93
5.4.1 Setup . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 94
Flintrock . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 95
Parallel ssh . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 95
Elasticsearch-Hadoop Connector . . . . . . . . . . . . . . . . . 95
Zen Discovery . . . . . . . . . . . . . . . . . . . . . . . . . . . . 95
EC2 Discovery Plugin . . . . . . . . . . . . . . . . . . . . . . . . 95
5.4.2 Steps to Run the Experiment . . . . . . . . . . . . . . . . . . . . 96
Elasticsearch Cluster . . . . . . . . . . . . . . . . . . . . . . . . 96
EC2 Discovery Plugin . . . . . . . . . . . . . . . . . . . . . . . . 96
Spark Cluster . . . . . . . . . . . . . . . . . . . . . . . . . . . . 97
Add the elasticsearch-hadoop dependency to spark classpath . 97
Configure EC2 Security Groups . . . . . . . . . . . . . . . . . . 98
5.4.3 The Index Custom Mapping, and Settings . . . . . . . . . . . . 98
5.4.4 Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 98
5.5 Experiment 5: Indexing 1 TB using Ten Nodes Cluster . . . . . . . . . 100
5.5.1 Setup . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 100
5.5.2 Steps To Run the Experiment . . . . . . . . . . . . . . . . . . . 102
5.5.3 Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 102
5.6 Overall Comparison . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 104
ix
6 Conclusion 107
Bibliography 111
xi
List of Figures
4.1 Indexing . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 55
4.2 Searching . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 56
4.3 Solr Experiment1 - Index Size . . . . . . . . . . . . . . . . . . . . . . . 63
4.4 Solr Experiment1 - Indexing Time . . . . . . . . . . . . . . . . . . . . 63
4.5 Solr Experiment1 - Indexing Speed . . . . . . . . . . . . . . . . . . . . 64
4.6 Solr Experiment1 - Requests/Second (Non-Cashed) . . . . . . . . . . 64
xii
List of Tables
6.1 Search Performance - free-text query (retrieve 100 docs where ocr-
Text field contains ’CCA’) . . . . . . . . . . . . . . . . . . . . . . . . . . 109
xv
Abstract
Medical images are an invaluable part of patient medical records. Recently the
use of diagnostic medical imaging increased significantly. This resulted in a mas-
sive growth in the volume of the data. The demand for efficient and flexible ways
to manage and make use of that data was also noticeably increasing.
Currently, the PACS systems represent the defacto technology for medical
images storage, management, and retrieval. However, the PACS suffers from
several critical drawbacks, which limited its ability from satisfying the healthcare
professionals needs. The primary drawbacks are: first the PACS is not scalable,
or at least not easily scalable. Second, the PACS way to search and retrieve data
is limited and inflexible, especially for a user expecting a search experience that
matches Google search experience.
In this work, we move the medical images search problem outside the current
stagnation of the PACS systems. We provide a solution that supports real-time,
full-text search, and even aggregations and ranking. While also being scalable at
is core. We use the image’s metadata, that is text-based, to enable searching on
the images.
This study aims to provide a reliable evaluation of three of the top-ranked
technologies that would help in the medical images search problem: PostgreSQL,
Solr, and Elasticsearch.
Our Experiments show that Elasticsearch would provide the best solution for
our use-case. We report indexing and searching throughput based on ten nodes
cluster hosted on AWS-EC2. We used one Terabyte of data to run the final tests.
We used Spark-cluster to generate the data based on a real data sample that
contains the metadata of 16000 images and implemented our data generation
routine to magnify that sample up to 120 billion records.
1
Chapter 1
Introduction
1.1 Motivation
Medical Images are an invaluable part of patient medical records [70] [125].
Since the discovery of x-rays more than 120 years ago [127] medical imaging
has been improving to include more methods and more usages. Recently the
improvements and usage in different specialties were noticeably faster [18].
However, the utilization of medical imaging has been such a significant chal-
lenge in health-care IT, due to many factors (high storage costs, the complexity
of management, diversity of tools and software, incompatibility issues, Different
imaging modalities manufactures) [150] [82] [7] [74].
Besides, there is now a new aspect that is added to the formula: the huge
amount of records. That even put more pressure on today’s available software
solutions for medical images storage, management and retrieval [87] [92] [147]
[93].
The Main motive for this study is to put a firm ground to overcome the difficul-
ties and enable large-scale, real-time, and flexible-querying search capabilities of
medical images based on the metadata of the image.
1.2 Outline
Finally, we put a summary of all the experiments and the conclusion of this
study in chapter 6.
1.3.1 Scope
Every health-care provider, who offers vascular ultrasound services, will own
a vast amount of archived ultrasound exams. Those exams are usually kept
archived on some offline storage system.
Medical images are typically stored in DICOM (Digital Imaging and Communi-
cations in Medicine) format. Data format DICOM groups information into data
sets. That means that a file of a chest x-ray image, for example, actually contains
the patient ID within the file, so that the image can never be separated from this
information by mistake. This is similar to the way that image formats such as
JPEG can also have embedded tags to identify and otherwise describe the image.
A DICOM data object consists of a number of attributes, including items such
as name, ID, etc., and also one unique attribute containing the image pixel data
(i.e., logically, the main object has no “header” as such, being merely a list of
attributes, including the pixel data). A single DICOM object can have only one
attribute containing pixel data. For many modalities, this corresponds to a single
image. However, the attribute may contain multiple “frames”, allowing storage
of cine loops or other multi-frame data files [27].
1.3. The Use Case Context 3
The communication with the PACS system is done using Digital Imaging and
Communication in Medicine (DICOM) messages that are similar to DICOM im-
age "headers", but with different attributes. A query (C-FIND) is performed as
4 Chapter 1. Introduction
follows:
• The client fills in the C-FIND request message with the keys that should be
matched. E.g., to query for a patient ID, the patient ID attribute is filled
with the patient’s ID.
• The client creates empty (zero length) attributes for all the attributes it
wishes to receive from the server. E.g., if the client wishes to receive an ID
that it can use to receive images (see image retrieval), it should include a
zero-length SOPInstanceUID (0008,0018) attribute in the C-FIND request
messages.
• The server sends back to the client a list of C-FIND response messages,
each of which is also a list of DICOM attributes, populated with values for
each match.
• The client extracts the attributes that are of interest from the response
messages objects.
Images (and other composite instances like the Presentation States and Struc-
tured Reports) are then retrieved from a PACS server through either a C-MOVE
or C-GET request, using the DICOM network protocol.
im
ag
.png Files Storage es
System 3- ur
ls
Web-based
DICOM DICOM
client
Files CROWLER ry
ue
1-Q
s
+ url
ta
tada
me
.csv Files Search 2-
Engine
image. CBIR has the potential to save a significant amount of time to practi-
tioners, enabling them to quickly move from a source image to a set of similar
ones, potentially containing diagnosis reports. These reports, when compared
to the original image, may strengthen the case for the diagnosis or provide the
practitioner with additional insight.
1. Process DICOM files to produce two type of output: 1-text files contain the
images meta-data, this includes all the info that was collected during the
examination, plus the text burned on the images, and 2- .png files present-
ing the images themselves.
2. We then keep the .png files on a storage system (for example HDFS or S3)
3. Add a link to the images path on the storage system to the text files holding
the image metadata.
4. We feed the text files into the search engine of choice, that would enable
us to perform the free-text search, different aggregations and eventually
enable advanced analytics on those images (or precisely images metadata)
• the information collected during the examination (that was included in the
original DICOM tags).
• the text that is burned on the image. This data is not part of the DICOM
tags. Rather this text is extracted by applying OCR on the images.
It is estimated that over one billion diagnostic imaging procedures will be per-
formed in the United States during 2014, comprising approximately 100 petabytes
of volume data [145].
The requirements for our use case are to be able to search data that is in the
size of the US population which is 3̃20 million [79]. If we excluded the age group
of fewer than 18 years and assumed a single carotid ultrasound exam per adult
per year, we then get 211 million exams. Each exam consists of 23 to 50 images,
and then we would expect a total of 211 million exams * 36 images per exam
(average), that is 7.596 billion images (record).
The meta data for single image is 0.8 KB, so for 7.596 billion images, the data
size would be 5.65 TB ((7.596 ∗ 1000 ∗ 1000000 ∗ 0.8)/(1024 ∗ 1024 ∗ 1024)).
Assuming Single ultrasound per adult per year would produce 7.596 billion
records that worth 6 TB text data in CSV format.
Big Data had indeed come of age in 2013 when the Oxford English Dictionary in-
troduced the term “Big Data” for the first time in its dictionary [30]. However, the
term which spans computer science and statistics/econometrics originated much
earlier, probably originated in the lunch-table conversations at Silicon Graphics
in the mid-1990s, in which John Mashey figured prominently [28].
Since its early start, Big data tools continued to emerge and mature, mo-
tivated by the increasing business demands. Every industry is now looking at
ways to use big data to improve and grow. Health care is not an exception. We
think that health care is not catching up with the new options become possible
with current big data technologies.
As the Big Data technology become known and developers in different domains
being to use it, two sub-domains become more clearly distinguishable: data en-
gineering, and data science.
Data engineering focus on finding solutions to be able to store big data, to
organize, clean, and retrieve it.
For data science, the primary concern is getting useful insights from data.
In an early usage the term “data science" was used as a substitute for com-
puter science by Peter Naur in 1960. Naur later introduced the term “datalogy"
[95].
1.4. The Engineering Perspective 7
In the past ten years, Data Science has quietly grown to include businesses
and organizations worldwide. It is used by governments, geneticists, engineers,
and even astronomers. During its evolution, Data Science’s use of Big Data was
not merely a “scaling up” of the data, but included shifting to new systems for
processing data, and the ways data gets studied and analyzed.
Data Science has become an essential part of business and academic re-
search. Technically, this includes machine translation, robotics, speech recog-
nition, the digital economy, and search engines. Regarding research areas, Data
Science has expanded to include the biological sciences, healthcare, medical in-
formatics, the humanities, and social sciences. Data Science now influences eco-
nomics, governments, and business and finance [78].
In text retrieval, full-text search refers to techniques for searching a single computer-
stored document or a collection in a full-text database. Full-text search is distin-
guished from searches based on metadata or parts of the original texts repre-
sented in databases (such as titles, abstracts, selected sections, or bibliographi-
cal references).
Full-Text Searching (or just text search) provides the capability to identify
natural-language documents that satisfy a query, and optionally to sort them by
relevance to the query. The most common type of search is to find all documents
containing given query terms and return them in order of their similarity to the
query. Notions of query and similarity are very flexible and depend on the specific
application. The most straightforward search considers query as a set of words
and similarity as the frequency of query words in the document [108].
In the following sections, we list the technologies used in this case study, with
a brief introduction for each.
1.4.1 Spark
1.4.2 Python
NumPy
NumPy is the fundamental package for scientific computing with Python. It con-
tains among other things:
Besides its obvious scientific uses, NumPy can also be used as an efficient
multi-dimensional container of generic data [98].
We used NumPy to vectorize our data generation, which enabled us to take
advantage of multicore processors and achieve parallelism at the processor level,
more details in chapter 2.
Pandas
Pandas is a Python package providing fast, flexible, and expressive data struc-
tures designed to make working with “relational” or “labeled” data both easy
and intuitive. It aims to be the fundamental high-level building block for doing
practical, real-world data analysis in Python.
The two primary data structures of pandas, Series (1-dimensional) and DataFrame
(2-dimensional), handle the vast majority of typical use cases in finance, statis-
tics, social science, and many areas of engineering. Pandas is built on top of
NumPy and is intended to integrate well within a scientific computing environ-
ment with many other third-party libraries.
1.4. The Engineering Perspective 9
For data scientists, working with data is typically divided into multiple stages:
munging and cleaning data, analyzing/modeling it, then organizing the results of
the analysis into a form suitable for plotting or tabular display. Pandas is the
ideal tool for all of these tasks [104].
We used Pandas to load our data, do some transformation and general analysis
of the sample data set, more details in chapter 2.
1.4.3 PostgreSQL
1.4.4 Lucene
Doug Cutting originally wrote Lucene entirely in Java in 1999. It was initially
available for download from its home at the SourceForge web site. It joined
the Apache Software Foundation’s Jakarta family of open-source Java products in
September 2001 and became its own top-level Apache project in February 2005
[103].
The Lucene API is divided into several packages [86]:
2. org.apache.lucene.codecs
Provides an abstraction, as well as different implementations for the encod-
ing and decoding of the inverted index structure.
10 Chapter 1. Introduction
3. org.apache.lucene.document
Provides a simple Document class. A Document is simply a set of named
Fields, whose values may be strings or instances of Reader.
4. org.apache.lucene.index
provides two primary classes: IndexWriter, which creates and adds docu-
ments to indices; and IndexReader, which accesses the data in the index.
5. org.apache.lucene.search
Provides data structures to represent queries (i.e., TermQuery for individ-
ual words, PhraseQuery for phrases, and BooleanQuery for boolean com-
binations of queries) and the IndexSearcher which turns queries into Top-
Docs. Many QueryParsers are provided for producing query structures from
strings or XML.
6. org.apache.lucene.store
Defines an abstract class for storing persistent data, the Directory, which
is a collection of named files written by an IndexOutput and read by an
IndexInput. Multiple implementations are provided, including FSDirectory,
which uses a file system directory to store files, and RAMDirectory which
implements files as memory-resident data structures.
7. org.apache.lucene.util
contains a few handy data structures and util classes, ie FixedBitSet and
PriorityQueue.
Lucene is the core library used by both Solr and Elasticsearch to build the
index and to search that index.
1.4.5 Solr
1. Document
Represent a single item that could be searched for. This can be an article,
a DB record, or whatever the ‘content’ that you need to make it searchable.
A document consists of a set of fields, where a field is an attribute that has
a name, type, and value. The value is a piece of searchable information.
2. Term
A Term is a value of the document fields, after being analyzed and processed
1.4. The Engineering Perspective 11
by Solr. The term can be the same as the field value or can be a ’processed’
field value, depending on the field definition in the index schema.
3. Schema
In Solr, each index has a schema. The schema is an XML file that tells Solr
about the contents of documents it will be indexing. It describes the fields
that a document will have and the types for those fields. It also defines the
analysis that should be done on each field to extract the terms, plus some
attributes that affect how Solr builds the index, and what to store and what
to ignore in a document. Solr index can be tailored to the best possible
performance for the search, through the schema.
4. Indexing
Indexing is the data ingestion step. To make a document search-able we
have to index it first. For Solr, this includes analyzing the document fields
to extract the terms and build a data structure to store the terms [130].
This data structure is a Lucene index [85]. Lucene builds the index in a way
that would support different access patterns when serving different types
of search queries while setting the search speed as the primary concern.
Thus, the index size may grow more prominent than the indexed data itself.
However the index size would be an issue only when it starts to affect search
performance, we will cover this in more details later.
5. Core
When are talking about a non-distributed setup, The term core is just a
synonym for ‘index.’ However when moving to distributed setup(i.e., Solr-
Cloud), a core is a single copy of a specific shard of an index, that is hosted
in one of the cluster nodes.
6. Collection
For the non-distributed setup, a collection is the same thing as the core.
For distributed setup a collection is a group of cores that represent a single
index, this includes all the logical shards of an index that are hosted in dif-
ferent nodes in the cluster.
7. Query
The value that we want to search for it. The query consists of a set of
terms, grouped with different logical operators. Having a good idea about
the types of queries that the application will serve can help build the index
in the way the will help Solr serve those queries efficiently.
1.4.6 Elasticsearch
Amazon Elastic Compute Cloud (Amazon EC2) provides scalable computing ca-
pacity in the Amazon Web Services (AWS) cloud. Using Amazon EC2 eliminates
the need to invest in hardware up front, so development and deployment can be
done faster. Amazon EC2 can be used to launch as many or as few virtual servers
as needed, configure security and networking, and manage storage. Amazon
EC2 enables scaling up or down to handle changes in requirements or spikes in
popularity, reducing the need to forecast traffic [9].
Cloud Computing
While the term "cloud computing" was popularized with Amazon.com releas-
ing its Elastic Compute Cloud product in 2006, references to the phrase "cloud
computing" appeared as early as 1996, with the first known mention in a Compaq
internal document [122].
In this case study, all experiments were executed using AWS EC2 Resources.
The Picture Archiving and Communication System (PACS) market has been trans-
formed by disruptive innovations from the information technology industry. The
cost of storage alone has dropped by a factor of 100 within the past ten years.
Improvements in the display, processing, and networking have likewise enabled
PACS to be a capable replacement for film [20].
Given the increasingly higher demands placed over PACS solutions and the
expected data growth, research in the area of PACS is very active. New tech-
nologies are being actively explored to help with data storage and management.
Some directions in which new PACS systems are being investigated include dis-
tributed and heterogeneous computing grids, peer-to-peer networks, knowledge
extraction using indexing engines [145], Cloud Computing and mobile apps [17].
1.5. Related Work 13
Dicoogle
To our knowledge to date, this study is unique concerning the following points:
2. The Number of elasticsearch nodes used (we used ten nodes cluster).
3. Using EC2 hosted instances rather than local laptops or on-primes servers.
14 Chapter 1. Introduction
5. The types of queries used for evaluation. We used a set of different query
types (free-text, aggregations, filtering), most other studies reported results
based on just a single query type.
The study performed by P. Seda and J. Hosek and P. Masek and J. Pokorny
showed that using Elasticsearch for storing more massive key-value datasets, it
is even five times faster than MySQL when the focus is given to retrieve partic-
ular information from database [128]. The test was performed using a 4 Mib of
data on a single elasticsearch node.
The study by J.Bai showed that elasticsearch search-time would not increase
linearly as matching records increase [14]. She used a free-text query to run the
test. The search time varied from increased by 6X when increasing the number of
matching records by more than 166X. The results reported based on 7 GB data,
which is 149 million records.
Trying to use ELK stack to build an IoT data hub, M.Bajer reported that his
implementation could process up to 60 events/second [15].
An exciting case-study at Mayo Clinic, the proposed system could ingest and
index and store Hl7 messages at the speed of 62 million message/day. The re-
ported free-text search performance was 209 millisecond for searching ’pain’ in
25.2 million HL7-Messages [19]. However, a critical missing point in this result
is the percentage of matching records in the searched data, which have a sub-
stantial effect on the query performance.
15
Chapter 2
Data Modeling
2.1 Motivation
Big data definitions have evolved rapidly, which has raised some confusion. This
is evident from an online survey of 154 C-suite global executives conducted by
Harris Interactive on behalf of SAP in April 2012 (“Small and midsize companies
look to make big gains with big data,” 2012). 2.2 shows how executives differed
in their understanding of big data, where some definitions focused on what it is,
while others tried to answer what it does [69].
Size is the first characteristic that comes to mind considering the question
“what is big data?” However, other characteristics of big data have emerged re-
cently. For instance, Laney (2001) suggested that Volume, Variety, and Velocity
16 Chapter 2. Data Modeling
(or the Three V’s) are the three dimensions of challenges in data management.
The Three V’s have emerged as a common framework to describe big data (Chen,
Chiang, & Storey, 2012; Kwon, Lee, & Shin, 2014). For example, Gartner, Inc.
defines big data in similar terms: “Big data is high-volume, high-velocity and
high-variety information assets that demand cost-effective, innovative forms of
information processing for enhanced insight and decision making.” (“Gartner IT
Glossary, n.d.”) [69].
As such, these data are difficult to process using existing technologies (Con-
stantiou and Kallinikos, 2015). By adopting advanced analytics technologies,
organizations can use big data for developing innovative insights, products, and
services (Davenport et al., 2012) [71].
5. The data is very sparse, most of the fields in group 3 are nulls
6. Some fields (exactly 13 fields) have no values for the whole data set
7. The data for the 424 exams are 12 MB, a single record in CSV format con-
sumes 0.8 KB
Data cleansing or data cleaning is the process of detecting and correcting (or
removing) corrupt or inaccurate records from a record set, table, or database
18 Chapter 2. Data Modeling
1. Resolve encoding errors caused by some text fields that had none-UTF-8
characters.
2. Each field in the JSON object was represented as an array of a single value,
so we implemented a ‘flattening’ function to remove the arrays.
3. Removing commas, semi-colons, and line breaks from text fields to enable
CSV format.
Execution time was measured using the Python timeit module [142] and the
\%\%time IPython magic command [77].
Amazon EC2 M5 Instances are the next generation of the Amazon EC2 General
Purpose compute instances. M5 instances offer a balance of computing, memory,
and networking resources for a broad range of workloads. This includes web and
application servers, back-end servers for enterprise applications, gaming servers,
caching fleets, and app development environments. [8].
2.3. Experiment 1: Iteration-Based on Single Core 19
2.3.1 Method
A new record is built field by field. For each field a random number is generated
in the range [0, number-of-available-values], then that number is used to get the
corresponding value from the sample data frame. The generator keeps accumu-
lating the generated records in memory, and dumps to disk every 1,000 records.
1. Uniform Distribution: using this option we randomly chose a value for the
field from the ‘distinct’ list of values of that field in the sample data. This
will result in a uniform distribution of the values for each field.
2. Original Distribution: using this option we randomly chose a value for the
field from the list of values of that field in the sample data keeping dupli-
cates. This will maintain the original distribution of values for the field as
in the given sample data.
The second option produced records that are more similar to original records, So
we used it to build the generator.In Listing 2.1 we show generator1 script.
r e s u l t s _ f i l e = ’ generated_records . csv ’
s t a r t = timeit . default_timer ( )
for exam_count in range(0 , number_exams ) :
exam = {}
for f i e l d in examFields :
exam[ f i e l d ] = getRandomValue( values [ f i e l d ] )
records_count += 1000
pd . DataFrame( records ) . to_csv (
r e s u l t s _ f i l e , encoding = ’ utf −8’ , mode=’a ’ , index=False )
records =[]
s t a r t = timeit . default_timer ( )
2.3.2 Throughput
It turned out that we can think of this data generation task as a “Random Sam-
pling” task. Vectorization seemed an appealing implementation option. We used
the Python numeric library NumPy to implement a vectorized version.
2.4.1 Method
• Advances Indexing
Vectorization
The practice of replacing explicit loops with array expressions is commonly re-
ferred to as vectorization. In general, vectorized array operations will often be
one or two (or more) orders of magnitude faster than their pure Python equiva-
lents, with the biggest impact in any numerical computations [89].
NumPy Broadcasting
The term broadcasting describes how NumPy treats arrays with different shapes
during arithmetic operations. Subject to specific constraints, the smaller array is
“broadcast” across, the larger array so that they have compatible shapes. Broad-
casting provides a means of vectorizing array operations so that looping occurs in
C instead of Python. It does this without making needless copies of data and usu-
ally leads to efficient algorithm implementations [100]. Broadcasting, a powerful
method for vectorizing computations [89].
ndarrays can be indexed using the standard Python x[obj] syntax, where x is the
array and obj the selection. There are three kinds of indexing available: field
access, basic slicing, advanced indexing. Which one occurs depends on obj.
Advanced indexing is triggered when the selection object, obj, is a non-tuple se-
quence object, a ndarray (of data type integer or bool), or a tuple with at least one
sequence object or ndarray (of data type integer or bool). There are two types
of advanced indexing: integer and Boolean. Advanced indexing always returns a
copy of the data (contrast with basic slicing that returns a view) [99].
1. Load the sample data into dataframe, then convert it to a matrix S, where c
is the number of columns(number of fields per record), and r is the number
of rows(records).
3. Create a single row NumPy array, C number of columns c, set its values to
the range[0,c]
S = pd . read_csv ( csvPath )
22 Chapter 2. Data Modeling
S = S. as_matrix ( )
S[pd . i s n u l l (S) ] = None
z = 1000000
r = S. shape[0]
c = S. shape[1]
I = np .random. randint (0 , r , ( z , c ) )
C = np . arange ( c )
generatedRecords = S[ I ,C]
2.4.2 Throughput
• Spark (local-mode).
2.5.1 Method
The data generation task could be easily thought of as a map job, with no
reduce job.
We created an RDD, that is a list of integers from 1 to the number of records we
want to generate. The number of the RDD partitions was set to the number of
cores. We Then mapped the RDD to the |generateExams|function, to generate
the records. Finally, the RDD of records was transformed to dataFrame, then
saved in CSV format 2.3.
Listing 2.3 below shows the main part of the spark generator script.
.appName( "DataGenerator" ) \
. config ( "spark . executor . cores " , "1" ) \
. config ( "spark . executor .memory" , str (memoryPerWorker)+"g" ) \
. getOrCreate ( )
2.5.2 Throughput
The throughput for this generator:is shown in table 2.4. The effect of pralleliza-
tion on througput is shown in figure 2.4 for iteration-based mapping function and
figure 2.5 for vectorization-based mapping function.
2.6.1 Method
Create a pool of process, where the number of process equals the number of
cores. Then starting a data generation task for each process.
Like in experiment 3, we also tested two different generation functions:
The multiprocessing package offers both local and remote concurrency, effec-
tively side-stepping the Global Interpreter Lock by using subprocesses instead
of threads. Due to this, the multiprocessing module allows the programmer to
leverage multiple processors on a given machine [118].
In listing 2.4 we show the script for this generator.
import timeit
def genData(nOutputSamples ) :
nRows= m. shape[0]
nCols= m. shape[1]
I = np .random. randint (0 ,nRows, ( nOutputSamples , nCols ) )
return m[ I , np . arange ( nCols ) ]
i f __name__ == ’ __main__ ’ :
m = prepSampleInput ( " carotid . csv " )
for i in range( 1 , 5 ) :
start_time= timeit . default_timer ( )
recordsPerThread = 1000000
p = Pool ( i )
result = p .map(genData , [ recordsPerThread ] * i , 1)
elapsed = timeit . default_timer() − start_time
2.6.2 Throughput
2.7.2 Throughput
Amdahl’s law
In experiment 5, we hit the memory wall (RAM I/O limit) which at a certain point
eliminated any benefits of adding more cores.
In experiment 2.5, we noticed that parllelization using spark local mode, pro-
vided an increase in the throughput by 1.7X for vectorization-based generation
(figure 2.5).
30 Chapter 2. Data Modeling
2.8.1 Method
Like the implementation in experiment 2.5: a simple Map job, mapping an array
of integers (length of the array is the number of output records) to the NumPy
vectorized generation function. We set the number of partitions for the number
of executors (Cores). The Difference is that every executor runs on a separate
machine. The code is shown in listing 2.5.
os . environ [ "PYSPARK_SUBMIT_ARGS" ] = (
"−−packages com.amazonaws:aws−java−sdk : 1 . 7 . 4 , \
org . apache . hadoop : hadoop−aws: 2 . 7 . 3 pyspark−s h e l l " )
. getOrCreate ( )
recordsPerWorker = 12000000
nWorkers = 10
A Cluster of 11 memory-optimized nodes, one node for the master and ten work-
ers. Each node is EC2 r4-large. This type has two cores, so one for the hosting
OS and 1 for the spark worker daemon. the memory available for each node is
15.25 GB, we configured the‘spark-executor-memory’ to 13 GB. This amount of
RAM would enable the worker to keep all the generated data (12 million records)
in RAM eliminating any need for disk spilling and avoiding any disk I/O overhead.
1. Install Flintrock
Flintrock is a command-line tool for launching Apache Spark clusters [65].
Steps to install Flintrock on ubuntu 16.04:
1- export AWS_ACCESS_KEY_ID=xxxxxxxxxxxx
2- export AWS_SECRET_ACCESS_KEY=xxxxxxxxxxxxxxx
3- flintrock --debug launch spark-numpy-cluster --num-slaves 10 \
--spark-version 2.2.0 --ec2-instance-type r4.large --ec2-key-name xxxxx \
--ec2-identity-file xxxxx.pem --ec2-ami ami-a4c7edb2 --ec2-user ec2-user
2- Add the path to jars to spark class path, open the file
“/opt/spark/config/spark-defaults.conf" and add line:
| spark.driver.extraClassPath <path-to-dir-containing-the-jars>/*|"
2.8. Experiment 6: Horizontally Scaling The Vectorized-Based Generation 33
4- Restart Jupyter kernel (we used a Jupyter notebook to run the spark job).
2.8.4 Throughput
In this experiment, we have distributed the RAM I/O into ten separate machines.
This eliminated the problem we noted in the previous experiment. The genera-
tion time was measured while Fixing the “ core: records-to-generate" ratio to be
12 million records (about 9.2 GB) per executor. Figure 2.8 shows that the gener-
ation time is (almost) constant, the time increase between 1 core and ten cores is
12%. This is much better than generator4, showed in figure 2.7, where the time
increase between 1 core and ten cores is 59%.
2.9 Conclusion
2.9.1 Vectorization
Two tools were considered to run the data generation in a distributed way: 1-
using Dask Cluster and 2- using Spark Cluster.
The results showed that using Dask cluster the speed up was linear with the
number of workers we add to the cluster (4X speed up).
For Spark Cluster, the speed up decreased to 1.5X. These results suggest
that Spark distribution model includes a considerable overhead (network, tasks
scheduling and tracking) that will dominate the attribution speed up if the task
at hand is not big enough.
The results above didn’t include the time to write the generated data to disk.
36 Chapter 2. Data Modeling
Chapter 3
Evaluating PostgreSQL
Textual search operators have existed in databases for years. PostgreSQL has
~, ~*, LIKE, and ILIKE operators for textual data types. These operators perform
text “pattern matching”. Pattern matching is a good feature but it lacks many
essential properties required by modern information systems [108]:
• There is no linguistic support, even for English. Regular expressions are not
sufficient because they cannot easily handle derived words, e.g., satisfies
and satisfy. You might miss documents that contain satisfies, although you
probably would like to find them when searching for satisfying. It is possible
to use OR to search for multiple derived forms, but this is tedious and error-
prone (some words can have several thousand derivatives).
• They tend to be slow because there is no index support, so they must pro-
cess all documents for every search.
In full-text search terminology, the unit of search is the ‘document’. The docu-
ments are preprocessed to create an index that is used for searching. A simple
example of documents preprocessing is illustrated in figure 3.1.
In PostgreSQL, a document is usually a textual field within a row of a database
table, or possibly a combination (concatenation) of such fields, those fields could
be stored in several tables or obtained dynamically. In other words, a document
can be constructed from different parts for indexing, and it might not be stored
anywhere as a whole.
PostgreSQL provide a good built-in support for full text search. It provides a
spacial types for text search: tsvector and tsquery (the ‘ts’ prefix is for ‘text-
search’) [113]. There are index types that can be used to speed up full text search
[109]. PostgreSQL also provides a wide range of functions and operators to help
using those types in full text search tasks [112]. The basic text search operator
is the ‘@@’ operator, also referred to as the ‘match’ operator [106]. PostgreSQL
offers a good control of how documents pre-processing is done and how indexing
is done,through the text search configuration[107].
The primary concern for our use-case is query time. However, an essential factor
that we should consider too is the data ingestion (in this case it is ‘insertion’). We
used the PostgreSQL utility command ‘COPY..FROM’ [24] to bulk insert 500 GB
data. The Command takes a CSV file name and performs a bulk insertion into the
given table. We divided the data into 20 files, and each file is 25 GB size. Listing
3.1 shows the script used for data insertion.
for i in range(NumDataFiles ) :
f i l e = ’%sgenerated_records_batch_%s . csv ’% \
( csv_path , i )
with open( f i l e ) as f :
cursor . copy_expert ( "COPY Carotid FROM \
STDIN WITH CSV HEADER DELIMITER AS ’ , ’ " , f )
print ( "Done inserting f i l e :%s "%f i l e )
40 Chapter 3. Evaluating PostgreSQL
Figure 3.2 shows the insertion time for each file. It is almost constant time
(23 minutes) regardless of the table we insert into is empty or has data.
3.5.1 Tools
We run all of the following experiments using Jupyter Python notebooks. We used
the psycopg2 module (version 2.7.4) [116] to connect to PostgreSQL from Python
script . The execution time was measured using the timeit module [142]. We
used PostgreSQL version 10, which is the latest at the time of this writing.
3.5.3 Hardware
The machine used to run experiments 1 to 4 has four cores and 8 GB RAM. For
experiment 5, we used an EC2-c4-large instance, which has two cores and 3.75
GB of RAM.
We begin the exploration of PostgreSQL text search, by taking a look at the ‘Like’
operator performance. In this experiment, we did not use any indexes, just plain
’like.’ The query statement is shown in listing 3.2. In figure 3.3 we show the
query time in minutes.
Full text searching in PostgreSQL is based on the match operator @@, which
returns true if a tsvector (document) matches a tsquery (query). A tsquery
contains search terms, which must be already-normalized lexemes. There are
functions to_tsquery, plain_to_tsquery, and phraseto_tsquery that are help-
ful in converting a text into a proper tsquery. Similarly, to_tsvector is used to
parse and normalize a document string [106].
In this experiment, we calculated the tsvector value for each record and stored it
in the table before executing the query. Listing ?? shows the statements to create
42 Chapter 3. Evaluating PostgreSQL
the tsvector values. Then, we measured the query time for the @@ query using
that tsvector column. Listing 3.4 shows the statements to add and populate the
tsvector column.
The query time is shown in figure 3.5. We also show the time consumed to
create and populate the tsvector column in figure 3.6.
In this experiment, we create a GIN index on the tsvector column to notice the
gain (if any) from using a GIN index. Listing 3.5 shows the statement to create
the index and the query statement. Figure 3.8 shows the time used to generate
the GIN Index. Figure 3.7 shows the query time after the index is added.
Select Count(ExamId)
From Carotid
where textsearch
@@ to_tsquery ( ’ english ’ ,
’Homogenous ’ )
GIN indexes are the preferred text search index type. As inverted indexes, they
contain an index entry for each word (lexeme), with a compressed list of matching
44 Chapter 3. Evaluating PostgreSQL
locations. Multi-word searches can find the first match, then use the index to
remove rows that are lacking additional words. GIN indexes store only the words
(lexemes) of tsvector values, and not their weight labels [109].
3.6.5 Conclusion
Comparing the four options we had in this experiment (shown in figure 3.9), the
Like operator had the best query time. However ’Like’ doesn’t provide the full-
text search functionality, and 3 minutes is a very long time for 15 GB of data.
This does not provide a good reason to test these options on significant data
scale, which is 1 TB at the minimum.
| | ’ ’ | | coalesce ( FindingSite , ’ ’ )
| | ’ ’ | | coalesce ( TopographicalModifier , ’ ’ ) ) ) ;
Listing 3.6 shows the statements to create the function based index.
3.7.1 Results
The time to create each type of index is shown in figure 3.10. Figure 3.11
compare the query time using the GIN index on a stored tsvector column, and
the function based GIN index.
3.7.2 Conclusion
‘simple’ is one of the built-in search text configurations that PostgreSQL pro-
vides. Simple doesn’t ignore stopwords and doesn’t try to find the stem of the
word. With simple, every group of characters separated by a space is a lexeme
[16].
This could be useful for our case since the text of documents is constrained by a
limited vocab, of which many are just medical abbreviations.
In this experiment, we notice the effect of the configuration on:
1. The time consumed to create GIN Index on tsvector column.
Select Count(ExamId)
From Carotid
where textsearch
@@ to_tsquery ( ’ simple ’ ,
’Homogenous & smooth & wall & plaque ’ ) ;
48 Chapter 3. Evaluating PostgreSQL
3.8.1 Results
3.8.2 Conclusion
The use of ‘simple’ configurations increased in the time to create the tsvector.
The query time also was worse for the ‘simple’ configurations. This could be a
result of a larger number of lexemes to examine,
In this experiment, we explore the query time and the index generation time us-
ing the pg_trgm module and compare it to the built-in tsvector-based text search.
The pg_trgm module is one of the popular PostgreSQL extensions. This mod-
ule provides functions and operators for determining the similarity of alphanu-
meric text based on trigram matching, as well as index operator classes that
support fast searching for similar strings [111].
To be able to use the pg_trgm, we need to follow two steps, assuming we have an
installed PostgreSQL, and the host OS is ubuntu server (version 16.04):
UPDATE carotid
SET ts_ocrtext_impression =
to_tsvector ( ’ simple ’ ,
( coalesce ( ocrText , ’ ’ ) | | ’ ’
| | coalesce ( impression , ’ ’ ) ) ) ;
Select Count( * )
From Carotid
where coalesce ( ocrtext , ’ ’ ) | | ’ ’
| | coalesce ( impression , ’ ’ )
ILIKE ’%Homogenous%’ ;
The Statement to create the trigrm-based GIN index is shown in listing 3.8.
The statements to create the tsvector-based GIN index is shown in listing 3.9.We
show the queries used in this experiment in listing 3.10.
3.9.3 Results
The time consumed for creating the GIN index is shown in figure 3.14. In figure
3.15 we show the query time measured for both GIN index’s.
3.9.4 Conclusion
The results of this experiments show that using trgm based GIN index did not im-
prove query performance compared to tsvector based GIN index (figure 3.15).
Another good note is that for both GIN indexes the time to create the index is
considerably big time. For generating the index for 80 GB of data, we needed
almost 5 hours for trgm based GIN index and almost 4 hours for tsvector based
GIN index (3.14).
Chapter 4
Evaluating Solr
4.1 Outline
Being one of the most popular open source enterprise search platforms that pro-
vide the full-text-search capability, Solr seemed very promising option to address
the use case at hand. The purpose of this chapter is to present the set of experi-
ments we executed to evaluate Solr.
In section 2 we introduce Solr and the basic concepts behind it. We also provide
a brief description of Solr’s main configurations.
In section 3 we present a baseline experiment, we index 24 million records (which
is 18.6 GB) using default Solr configurations without any tuning.
In section 4 we take a look at how the index schema affects the index size, and
hence affects indexing performance and query performance.
In section 5 we increase the data volume to 120 million records (which is 92 GB)
and observe the effects on indexing and query performance.
Finally in section 6, we provide our conclusion.
Solr offers support for the most straightforward keyword searching through
to complex queries on multiple fields and faceted search results [5].
Solr includes the ability to set up a cluster of Solr servers that combines fault
tolerance and high availability. Called “SolrCloud", these capabilities provide
distributed indexing and search capabilities [1].
Before proceeding further, we need to look at some definitions [131]:
54 Chapter 4. Evaluating Solr
1. Document
Represent a single item that could be searched for. This can be an article,
a DB record, or whatever the ‘content’ that you need to make it searchable.
A document consists of a set of fields, where a field is an attribute that has
a name, type, and value. The value is a piece of searchable information.
2. Term
A Term is a value of the document fields, after being analyzed and processed
by Solr. The term can be the same as the field value or can be a ’processed’
field value, depending on the field definition in the index schema.
3. Schema
In Solr, Each index has a schema. The schema is merely an XML file that
tells Solr about the contents of documents it will be indexing. It describes
the fields that a document will have and the types for those fields. It also
defines the analysis that should be done on each field to extract the terms,
plus some attributes that affect how the index is stored, and what to store
and what to ignore in a document. Solr index can be tailored to the best
possible performance for the search, through the schema.
4. Indexing
This is the data ingestion step. To make a document search-able we have
to index it first. For Solr, this includes analyzing the document fields to
extract the terms and build a data structure to store the terms [130]. This
data structure is a Lucene index [85]. The index is built in a way that would
support different access patterns when serving different types of search
queries while setting the search speed as the primary concern. Thus, the
index size may grow more significant than the indexed data itself. How-
ever the index size would be an issue only when it starts to affect search
performance, we will cover this in more details later.
5. Query
The value that we want to search for it. The query consists of a set of
terms, grouped with different logical operators. Having a good idea about
the types of queries that the application will serve can help in building the
index in the way the will help Solr serve those queries efficiently.
6. Core
When talking about a non-distributed setup, The term core is just a syn-
onym for ‘index’. However when moving to distributed setup (SolrCloud), a
core is a single copy of a specific shard of an index, that is hosted in one of
the cluster nodes.
7. Collection
For the non-distributed setup, a collection is the same thing as the core.
For distributed setup a collection is a group of cores that represent a single
index, this includes all the logical shards of an index that are hosted in dif-
ferent nodes in the cluster.
4.2. Solr Introduction 55
2. Search Content
4.2.2 Lucene
Lucene is the core component for Solr. Lucene is a Java full-text search engine.
Lucene is not a complete application, but rather a code library and API that can
easily be used to add search capabilities to applications [84].
Lucene index is a type of indices that is known as “inverted-index”. The Lucene
API is divided into several packages [86]:
56 Chapter 4. Evaluating Solr
4.2.3 Segments
Searches may involve multiple segments and/or multiple indexes, each index
potentially composed of a set of segments.
Segments are immutable, once create is never modified. Adding new documents
to index always involve creating new segments. This makes the segments good
candidate for system-cache. Also, this will make writing to index lock-free. There
is a limit for the allowed number of segments for an index, which is configurable.
Whenever the segments in an index have been altered by either the addition of
a newly flushed segment or a previous merge that may now need to cascade, a
‘findMerges’ call get invoked, to pick merges that are now required, or null if no
merges are necessary [90].
The default in Solr is to use a TieredMergePolicy, which merges segments of
approximately equal size, subject to an allowed number of segments per tier.
Other policies available are the LogByteSizeMergePolicy, LogDocMergePolicy,
and UninvertDocValuesMergePolicy.
The “merge factors” parameter controls how many segments should be merged
at one time. Choosing the best merge factors is generally a trade-off of indexing
speed vs. searching speed. Having fewer segments in the index generally accel-
erates searches, because there are fewer places to look. It also can also result
in fewer physical files on disk. However, to keep the number of segments low,
merges will occur more often, which can add load to the system and slow down
updates to the index [91].
When a document is deleted, it is not immediately doped from the index. Instead,
Lucene uses a bit flag to mark deleted documents. When a segment has signifi-
cant enough number of deleted documents (configurable), it became a good can-
didate for merging. It is only after the merge occurs when the deleted documents
are cleared from the index. That is the reason why we do not get immediate index
size reduction after deleting documents [132].
58 Chapter 4. Evaluating Solr
solrconfig.xml
The file is divided into different sections, where each section represents the con-
figurations for one of the Solr functions.
The two main sections are configurations for the two core operations: index-
ing and searching.
1. Indexing Configurations
The “<indexConfig>" section of solrconfig.xml defines the low-level behav-
ior of the Lucene index writers [75]
2. Searching Configurations
The “<query>" tag of solrconfig.xml contains the settings that affect query
time like caches. To obtain maximum query performance, Solr stores sev-
eral different pieces of information using in-memory caches. Result sets, fil-
ters, and document fields are all cached so that subsequent, similar searches
can be handled quickly. Solr caches several different types of information
to ensure that similar queries do not repeat work unnecessarily. There are
three major caches:
In this section, we represent the first Solr experiment. We used the default Solr
configurations. We allocated 2 GB of memory to the Solr process, and 13 GB
of memory was left for OS cache which directly impacts Solr performance [2].
Experiment setup is illustrated in table 4.1.
We used the Solr schemaless mode [140] to escape the schema definition step
and pass the job to Solr to ‘discover’ the fields in the document and guess the
types. This option is handy for quick tests; we can immediately start indexing.
Solr will generate a schema definition with the default values for all different
fields properties. However schemaless mode will not provide the customization
and optimization that would produce the most efficient index. So this mode is
only good enough for testing purposes.
After we got the generated schema from Solr, we manually modified it to set all
fields types to ‘text_general’, to avoid invalid field values problems that could
cause indexing failure.
4.3. Experiment 1: A Baseline 61
1. Install java
$ sudo apt-get update
$ sudo apt install default-jre
2. Install Solr
$ wget http://www-eu.apache.org/dist/lucene/solr/7.2.1/solr-7.2.1.tgz
$ tar xzf solr-7.2.1.tgz solr-7.2.1/bin/install_solr_service.sh –strip-components=2
$ sudo bash ./install_solr_service.sh solr-7.2.1.tgz -u ubuntu -n
5. Set the ownership of the mounted folder to the same user owning the Solr
process, user ‘ubuntu’ .
$ sudo chown -R ubuntu /var/solr/data
7. Create instance-dir for the new core, and use the Solr “_default” configura-
tion folder
$ mkdir /var/solr/data/<core-name>
$ cp -r solr.7.2.1/server/solr/configsetes/_defualt/conf /var/solr/data/<core-
name>/
8. Disable AutoCommit
$ nano /var/solr/data/<core-name>/conf/solrconfig.xml
commit the <autoCommit>tag.
11. Index a sample data file to get the Solr generated schema, the file contains
a header row to define the fields names. We set the ‘overwrite’ property to
false to tell Solr that we don’t want it to check for duplicates, this would
provide a considerable speed-up[134].
$ curl "http://<public-ip>:8983/solr/<core-name>/update/csv?header=true
&overwrite=false" -H "Content-type:text/csv;charset=utf-8" -X POST -T <sample-
data.csv>
12. Modify the schema file that was generated by Solr, to set all fields types to
‘text_general’. We are assuming all the data is text.
13. send update (indexing) requests, using 1 GB files, theses files don’t have
header, thus we use the ‘fieldsnames’ parameter. The curl tool [25] was
used to send http requests to the Solr server.
$ curl "http://<public-ip>:8983/solr/<core-name>/update/csv?commit=false
&overwirte=false" -H "Content-type:text/csv;charset=utf-8" -X POST -T <file-
name>
4.3.5 Results
We considered the primary metrics that would generally make sense to any
searching application: 1) The index size, 2) The indexing throughput, and 3) The
query throughput, as a function in the data size. For indexing throughput, we
measure indexing time, and indexing speed (docs/second). For Query through-
put, we measure the number of queries served per second, and longest query
time.
1. Index size
In Figure 4.3 we plot the index size as a function of the data size. The index
size is almost 75% the data size. We noticed that after batch number 5
(where data size reached 15.5 GB) the index size jumped to more than 106%
of data size. The index size then went down again to the 75% ratio after
the last batch. This probably could be caused by a background merging
process. After the merge completes the disk space for the old segments is
reclaimed.
2. Indexing throughput
We used the ‘real-time’ metric from Linux command ‘time’[83] to log index-
ing requests time. For each data point in the results, we used four parallel
indexing requests to send the 3.1 GB to Solr and we log the longest request
of the four (the last one to finish) As the indexing time. The curl tool [25]
was used to send the requests to Solr.
In Figure 4.4 we show the indexing time for each 3.1 GB batch. The index-
ing time is almost constant; Except for the first indexing request, it took
less time than the following requests. This is probably because for the first
batch it was only indexing process that is using resources, in subsequent
requests merging process took place in parallel with the indexing process.
4.3. Experiment 1: A Baseline 63
3. Query throughput
We look at two metrics: requests per second, and longest request time. We
measured the metrics for non-cashed query and for a cached query (when
sending the same query again Solr will serve it from the cache). We run 100
concurrent queries, for 100 times.
We used the Apache HTTP server benchmarking tool [4] to run this test.
• Q1: count all. This query doesn’t retrieve any documents. Only a count
all.
64 Chapter 4. Evaluating Solr
• Q2: get 100 documents where field ’ocrText’ contains the value ‘CAA‘.
4.3.6 Discussion
For 24 million records (which is 18.6b) and using the settings described in ta-
ble 4.1, we noticed the below :
• For queries that do not retrieve any document fields (like count all), Solr
could serve up to
textbf1,700 requests per second if not cached, and up to 12,600 requests
per second if cached.
4.3. Experiment 1: A Baseline 65
• For queries that don’t retrieve any document fields (like count all), the
longest query time (of 10000 queries) is 5 seconds if not cached and 0.1
seconds if cached.
If we to relate this to the Lucene index structure, the queries that do not re-
trieve documents fields are served using the ‘posting lists’ part of the index,
which is compressed and stored in the RAM and hence require less access
time.
• The cache did not have much effect for Q2 (get 100 where the ocerText field
contains ‘CCA’).
This is because Q1 does not retrieve any documents fields, so it does not
66 Chapter 4. Evaluating Solr
require Solr to check the ‘stored fields’ part of the index. So why would ac-
cessing the ‘stored Fields’ cause more query time and fewer queries served
per second?
In our case, the answer is: Q2 results is big for Solr to put in its cache
(java heap) which is 512 MB (the default). So if it is not in the query results
cache, Solr needs to check the index files on disk which will slow down the
process. The ‘Stored Fields’ part of the index consumes most of the index
size. Solr depends on OS cache to store the index files. If we have good
enough RAM for our Solr server, the whole index should fit in the OS cache
and serving queries that are not in Solr cache would require reading from
OS cache, which is still RAM access. What happened in our case is that
the index size starts to grow more than what can fit in the OS cache, so to
access stored fields Solr needed to do hard disk seeks which will require
much more access time.
In this experiment, we look at the relation between the schema used to create
the index in one hand and indexing and query performance in the other hand.
The experiment setup is illustrated in Table 4.2.
We used a total of 3.3 GB of data, that represents 4 million documents. The data
is divided into 784 MB files (1 million documents).
Solr Server
Hosting Machine 4 cores - 8 GB RAM
Memory Allocated to Solr 2G
Data Size 3.3 GB
Schema Custom (Manually Defined)
Solr Config Disable AutoCommit
otherwise, Default Config
• Schema1: auto generated by Solr, using the schemaless mode and a sample
of the data. The generated schema contains some fields defined as integers
or floats (Solr set that based on the values in the sample data used to gen-
erate the schema). The only modification we made is to set all fields type
to text\_general to avoid any indexing errors in case of wrong values for
some data records in the dataset.
4.4.3 Results
1. Index Size
The index size is shown in figure 4.10. It is clear that the schema used to
create the index directly affects index size. We can see that for schema2,
by not storing the position and norms meta-data, and not storing the values
of 19 fields ( 30% of the total 60 fields) we reduced the index size by 40%
than using schema1.
2. Indexing Throughput
The indexing time almost was not affected by the customization we made to
the schema, illustrated in figure 4.11.
3. Query Throughput
We logged the requests/second and the longest query time for a count of
all query, that does not retrieve any documents. In this test, we used 100
concurrent requests, and the total of 10,000 requests, and didn’t cache the
query.
Figure 4.12 shows requests/second. Figure 4.13 shows longest query time.
we can see that the schema customization, led to index size reduction, and
hence significantly improved the requests/second metric by 9X and query
time by 8X.
68 Chapter 4. Evaluating Solr
4.4.4 Discussion
The results of this experiment provide a good indicator of the great and direct
effect of the schema used to build the index, and performance of the search.
The indexing performance was not affected by the customization made to the
schema in this experiment, probably because:
1. we did not make any changes to the “analysis” configuration, which has
most of the control on how Solr performs indexing.
million documents. The experiment setup is illustrated in Table 4.3. We used the
customized schema2 from experiment 2, with some customization:
2. store position for only two fields (the fields that should support full-text
search)
4.5.1 Results
1. Index Size
70 Chapter 4. Evaluating Solr
The index size was nearly a constant ratio of the data size, 50% of the data
size. illustrated in Figure 4.14.
2. Indexing Throughput
We have pushed the data to Solr from an EC2 m4-large machine, using curl
tool. We recorded the time to index each 10 GB. A 10 GB of data is 12 million
documents. The 10 GB were sent to Solr using 3 consecutive requests,
each request contained 3.3 GB of data. Indexing 10 GB took on average 16
minutes, thus the average indexing speed was 11,300 document/second.
Indexing time, and indexing speed are illustrated in Figures 4.15 and 4.16.
3. Query Throughput
We logged the requests/second and the longest query time, with caching,
for the query: get 100 document where ‘ocrText’ field contains the value
‘CAA’. For this test we used 100 concurrent requests, and a total of 10,000
requests. Requests per second is illustrated in figure ??. Longest request
time is illustrated in figure 4.18.
4.6. Conclusion 71
4.5.2 Discussion
4.6 Conclusion
• Solr index size using default configurations and schemaless mode is 75% of
the indexed data size.
•
72 Chapter 4. Evaluating Solr
Chapter 5
Evaluating Elasticsearch
Elasticsearch uses Lucene at its core to perform its indexing and searching.
Any time that we start an instance of Elasticsearch, we are starting a node. A
collection of connected nodes is called a cluster. If we are running a single node
of Elasticsearch, then we have a cluster of one node. Every node in the cluster
can handle HTTP and Transport traffic by default. The transport layer is used
exclusively for communication between nodes and the Java TransportClient; the
HTTP layer is used only by external REST clients [63].
We used some of the main elasticsearch features to run this experiment, below
is a brief description for each:
1. Bulk API
Elasticsearch exposes the bulk API over HTTP. It makes it possible to per-
form many index/delete operations in a single API call. This can significantly
increase the performance. The response to a bulk action is a large JSON
structure with the individual result of each action that was performed. The
failure of a single action does not affect the remaining actions. There is
no "correct" number of actions to perform in a single bulk call. It depends
on experimenting with different settings to find the optimum size for the
particular workload. When using bulk API over HTTP, make sure that the
client does not send HTTP chunks, as this will slow things down [38].
2. Ingest Node
Elasticsearch provides a mechanism to pre-process documents before in-
dexing, through the Ingest node. The ingest node intercepts bulk and index
requests, it applies transformations, and it then passes the documents back
to the index or bulk APIs. All nodes enable ingest by default so that any
node can handle ingest tasks. It is also possible to create dedicated ingest
nodes [40].
In our case, we used the ingest node to transform the documents from csv
to json, using the ‘grok processor’ [62] [67].
3. Dynamic Mapping
Mapping is the layer that Elasticsearch uses to map complex JSON docu-
ments into the simple flat documents that Lucene expects to receive. It
defines how a document, and the fields it contains, are stored and indexed.
The automatic detection and addition of new fields is called dynamic map-
ping [32].
4. Network Settings
We used an EC2 instance as the single elasticsearch node. To be able to
send requests to this node from another EC2 instance, we needed to change
the network.host and the discovery.type properties in ‘elasticsearch.yml’
configuration file [96].
5.1. Experiment 1: A Baseline 75
1. Install elasticsearch
cd elasticsearch − 6.2.0/
sudo mkdir / var / elasticsearch
sudo mkdir / var / elasticsearch / data
sudo mkdir / var / elasticsearch / logs
sudo mkdir / opt / elasticsearch
sudo cp −r config / opt / elasticsearch /
3. Configure paths
In the file /opt/elasticsearch/config/elasticsearch.yml set the “path.data”
and “path.logs” properties to newly created directories
7. Start Elasticsearch
ES_PATH_CONF=/opt/elasticsearch/config ./bin/elasticsearch
There are Elasticsearh clients in many languages including Java, Ruby, and Python.
We used the python helper elasticsearch.py [120] to handle communication
with elasticsearch.
The module includes many helper functions. The parallel_bulk is a helper for
the elasticsearch bulk API. It runs in multiple threads at once. The main param-
eters to the parallel_bulk helper are: 1)an instance of Elasticsearch class that
provide the connection to the cluster nodes, and 2)an iterable or a generator that
provides the data to send to elasticsearch [119]. We used a generator rather
than an iterable, to avoid excessive memory consumption that would be needed
to load the data if we used an iterable.
A note worth mentioning is that the parallel_bulk helper is itself a generator,
which means it is lazy and won’t produce any records until we consume its output
[105].
The parallel_bulk helper has several other parameters [119] that could be
tuned according to the elasticsearch cluster capacity:
4. queue_size: size of the task queue between the main thread (producing
chunks to send) and the processing threads. default is 4.
For this experiment, we made some quick trials to index the same file using
different values for the parameters, and heuristically chose the values that re-
sulted in faster indexing. We used thread_count=2, chunk_size=1000, max_chunk_bytes=104857600,
78 Chapter 5. Evaluating Elasticsearch
and queue_size=4.
Listing 5.1 shows the script used to do bulk indexing in this experiment.
def prepareData ( d a t a _ f i l e ) :
for l i n e in open( d a t a _ f i l e ) :
imageData = l i n e . r s t r i p ( ’ \ n ’ ) . replace ( ’ \ \ ’ , ’ / ’ )
yield{
" _index " : " carotid " ,
" _type " : "image" ,
" pipeline " : " csv−to−json−pipeline " ,
" _source " : {
"image" : imageData . decode ( ’ utf −8’ )}
}
i f __name__ == ’ __main__ ’ :
es=Elasticsearch ( "34.226.202.104:9200" )
for d a t a _ f i l e in f i l e s :
start_time = time . time ( )
deque( parallel_bulk ( es ,
prepareData ( data_dir+
" / "+d a t a _ f i l e ) ,
thread_count=2,chunk_size=1000,
max_chunk_bytes=(100*1024*1024)),
maxlen=0)
5.1.4 Results
We considered the same metrics that we recorded for Solr: 1) the index size, 2)
the indexing throughput, and 3) the query throughput, as a function in the data
size.
For indexing throughput, we measure indexing time, and indexing speed (doc-
s/second). For query throughput, we measure the number of queries served per
second, and query time.
The indexing time was calculated using the python ‘time’ module [141]. The
query throughput was measured using the Apache benchmarking tool [4].
1. Index size
In Figure 5.1 we show the index size as a function of the indexed data size.
The index size is almost 135% the data size while indexing is in progress.
We noticed that for data sizes 9.2 and 15.5 the index size went to more than
150% of the data size. This is probably is caused by on-going segments
merging, where the disk space for the old segments will only be reclaimed
after merging completes. After the indexing and merging is done the index
size was almost the same data size.
2. Indexing throughput
In Figure 5.2a we show the indexing time for each file (3̃.1 GB). In Fig-
ure 5.2b we show indexing speed, that is number of document indexed per
second.
3. Query throughput
• -c concurrency
This is the number of multiple requests to perform at a time. We set it
to 100.
• -n requests
This is the number of requests to perform for the benchmarking ses-
sion. We set this to 1000.
• -k
We used this flag to enable the HTTP KeepAlive feature (perform mul-
tiple requests within one HTTP session).
We present the results for two cases: when the query is not cached (which
is the default for any query when executed for the first time), and when
elasticsearch cache the query (after running it multiple times).
• Q1: count all. This query doesn’t retrieve any documents. only a count
for all document in the index
• Q2: get 100 documents where field ’ocrText’ contains the term ‘CAA‘.
Number of requests per second is shown in figure 5.3. The query time is
shown in figure 5.4.
5.1.5 Conclusion
To be added.
5.2. Experiment 2: The Impact of Mapping 81
(a) Q1 (b) Q2
(a) Q1 (b) Q2
In this experiment, we try to understand the effect of the ‘mapping’ used to build
the index on the: 1-index size, 2- indexing performance, and 3- query perfor-
mance.
We build two indices for the same data, one using the dynamic mapping [43]
generated by elasticsearch, and the other using a custom mapping that we define.
The experiment setup is illustrated in table 5.2. We used a single node cluster,
and disabled index sharding and replication (set index properties: replica=0 and
shards=1).
all fields values in quotes), so it created two field definitions for each field, a ‘text’
field and a ‘keyword’ field.
We list below the customization that we made in defining the custom mapping:
5. Use the python client to send data for indexing (using each index sepa-
rately) and log indexing time.
6. Run the Apache ab-test with the defined queries, and log the test output.
• number_of_shards = 1
• number_of_replicas = 0
• refresh_interva l= -1
5.2.4 Results
1. Index size
In Figure 5.5 we show the index size for both indices: the dynamic mapping
index and the custom mapping index. Using a custom mapping reduced the
index size by more than 60%.
2. Indexing Throughput
The time to build each index for the same data (3.3 GB) is shown in figure
5.6. The time to build the index using custom mapping is 20% less than the
time to build the index using dynamic mapping.
3. Query Throughput
• -c concurrency
This is the number of multiple requests to perform at a time. We set it
to 100.
• -n requests
This is the number of requests to perform for the benchmarking ses-
sion. We set this to 10000.
• -k
5.3. Experiment 3: Indexing 92 GB Using a Single Node 85
We used this flag to enable the HTTP KeepAlive feature (perform mul-
tiple requests within one HTTP session).
The requests per second is shown in figure 5.7. The query time is shown in
figure 5.8.
(a) Q1 (b) Q2
(a) Q1 (b) Q2
5.2.5 Conclusion
To be added.
5.3.1 Setup
Changing the default value for ‘network.host’ will make elasticsearch as-
sumes that we are moving to production and will trigger a set of bootstrap checks
which will check essential system settings, to make sure they are configured cor-
rectly [115]. If any of those settings are not configured correctly, elasticsearch
will through an exception and a node fail to start. Those critical system settings
are:
• Disable swapping
2. Increase the OS file descriptors and virtual memory limits, and disable
swapping for elasticsearch, as described in section 5.3.2.
6. Change the translog durability for the index, from default value request to
async:
88 Chapter 5. Evaluating Elasticsearch
7. Use the python client to send data for indexing (using parallel_bulk) and
log indexing time.
8. Run the Apache ab-test with the defined queries, and log the test output.
Translog
Because Lucene commits are too expensive to perform on every individual change,
each shard copy also has a transaction log known as its translog associated with
it. All index and delete operations are written to the translog after being pro-
cessed by the internal Lucene index but before they are acknowledged. In the
event of a crash, recent transactions that have been acknowledged but not yet
included in the last Lucene commit can instead be recovered from the translog
when the shard recovers.
The data in the translog is only persisted to disk when the translog is fsynced
and committed. In the event of hardware failure, any data written since the pre-
vious translog commit will be lost.
The _field_names field indexes the names of every field in a document that con-
tains any value other than null. This field is used by the exists query [44] to find
documents that either have or don’t have any non-null value for a particular field
[57].
_field_names introduce some index-time overhead. Since we do not need
‘exists’ queries, it is an excellent option to disable this field to help optimize for
indexing speed[57].
5.3. Experiment 3: Indexing 92 GB Using a Single Node 89
We did some trials to tune the chunk_size and thread_count parameters of the
python helper parallel_bulk. The target is to achieve the best possible indexing
throughput.
The trials are listed in Table 5.4. In each trial we set the max_chunk_bytes
to chunk_size * 3 * 1024, where a request for indexing a single document is
almost 3 KB.
Based on those trials, We used chunk_size=10000 and thread_count=2 for this
experiment.
Elasticsearch search APIs, allows the search query in two modes [54]:
• URI Search
The query is a simple string sent as a request parameter [es-URI-search].
Not all search options are exposed when executing a search using URI
mode, but it can be handy for quick “curl tests".
the Query DSL (Domain Specific Language). The ‘query’ element within the
search request body allows defining a query using the Query DSL.
In this experiment, we used request body mode. Listing 5.2 shows Q4 JSON.
The request body mode provides a set of parameters other than the ‘query’ pa-
rameter, which can be used to control different aspects of how the search is
performed and how results are returned. We used some of these parameters :
• size
The size parameter is used to configure the maximum amount of hits to be
returned. It can be used with the from parameter to implement pagination
[52].
• sort
The sort is defined on a per field level, with special field name for _score to
sort by score, and _doc to sort by index order. Sorting by _doc is the most
efficient sort order. This is especially important when dealing with large
results [53].
• stored_fields
Set which stored fields to for each document represented by a search hit.
By default, elasticsearch will not return any. ‘*’ can be used to load all
stored fields from the document. If the requested fields are not stored, they
will be ignored [51].
}
}
}
• Leaf clauses
look for a particular value in a particular field, such as the match, term or
range queries.
• Compound clauses
wrap other leaf or compound queries and are used either to combine mul-
tiple queries in a logical fashion (such as the bool or dis_max query), or to
alter their behaviour (such as the constant_score query).
• Query context
When executing a query clause in the ‘query’ context, elasticsearch will cal-
culate a _score for each matching document, to provide the results sorted
by how well they meet the query clause.
This context is the default when sending a query clause using a ‘query’
parameter.
• Filter context
In this context, no scores are calculated. The document either match or not
match a query clause. Frequently used filters will be cached automatically
by Elasticsearch, to speed up performance. Filter context is used whenever
a query clause is sent using a ‘filter’ parameter.
5.3.6 Results
1. Index Size
Indexing the 92 GB of data, using a custom mapping, resulted in 70.2 GB
index. The index size is 76% of the data size.
2. Indexing Throughput
We pushed the data to Elasticsearch from a separate client, an m5-large in-
stance. The parallel_bulk helper was used to send data in chunks to elastic-
search bulk API. The parameters of the parallel_bulk helper were selected
based on the trials described in section 5.3.4. The selected values are:
• thread_count=2
• chunk_size=10,000
• max_chunk_bytes=30720000 (29 MB)
92 Chapter 5. Evaluating Elasticsearch
3. Query Throughput
We used the Apache benchmarking tool to perform the test. The parameters
used are:
• -c concurrency
This is the number of multiple requests to perform at a time. We set it
to 200.
• -n requests
This is the number of requests to perform for the benchmarking ses-
sion. We set this to 2000.
• -k
This flag is used to enable the HTTP KeepAlive feature (perform mul-
tiple requests within one HTTP session). We did not use this flag, be-
cause some queries took a very long time to complete causing the test
to fail.
• -s timeout
This sets the maximum number of seconds to wait before the socket
times out. The default is 30 seconds. For this experiment the default
30 seconds were no good enough. The test failed with a timeout error.
To allow the test to complete we set this parameter to 120 seconds.
The Query time for each of the test queries (described in section 5.3.5) is
shown in figure 5.10. We show the time for 50 The number of requests per
second is shown in table5.6.
5.3.7 Conclusion
To be add.
5.4. Experiment 4: Indexing 92 GB using Five Nodes Cluster 93
(a) Q1 (b) Q2
count all get 100 records, gender=Female
(d) Q4
(c) Q3
get 100 records, gender=Male and ocrText contains
get 100 records, ocrText contains ’CCA
’CCA’
(f) Q6
(e) Q5
group By age where gender=Female and ocrText con-
group By FindingSite
tains ’CCA’
Query Id Requests/Second(AVG)
Q1 2072
Q2 1
Q3 2
Q4 1
Q5 2230
Q6 1780
nodes spark cluster to generate the data. Then we use the elasticsearch-hadoop
connector to send the data from spark cluster to elasticsearch cluster.
5.4.1 Setup
Flintrock
Parallel ssh
Elasticsearch-Hadoop Connector
Zen Discovery
The zen discovery is the built-in discovery module for Elasticsearch and the de-
fault. It provides unicast discovery but can be extended to support cloud envi-
ronments and other forms of discovery. The zen discovery is integrated with the
transport module [59], all communication between nodes is done using the trans-
port module [61].
Unicast discovery requires a list of hosts to use. These hosts can be specified as
hostnames or IP addresses; hosts specified as hostnames are resolved to IP ad-
dresses during each round of pinging. The unicast discovery uses the transport
module to perform the discovery. The discovery.zen.ping.unicast.hosts the
setting is used to provide the list of hosts.
For network host, the default for Elasticsearch is to bind to loop back ad-
dresses only, e.g., 127.0.0.1 and [::1]. This is sufficient to run a single develop-
ment node on a server. To communicate and to form a cluster with nodes on
other servers, we must change the network.host. There are many particular val-
ues [97], to support different network setups.
For Cloud hosted clusters, there are specialized plug-ins [29]. We run the ex-
periment on EC2 so that we will use the ec2-discovery plug-in [33].
The EC2 discovery plugin uses the AWS API for unicast discovery. The plugin
provides a hosts provider for zen discovery named ‘ec2’. This hosts provider finds
other Elasticsearch instances in EC2 through AWS metadata. Authentication is
done using IAM Role credentials by default [33]. The use of ec2 discovery plugin
in allows elasticsearch nodes to discover each other and form a cluster without
the need to statically write the IPs of all nodes in every node elasticsearch.yml
configuration file.
Elasticsearch Cluster
1. Install java.
Spark Cluster
These steps are done on the client machine, that is the machine that we will use
to run the experiment Jupyter notebook.
1. Download the jar, and create a dir to put them in, for example, ’spark-extra-
jars’
wget http : / / central .maven. org /maven2/ org /
elasticsearch / elasticsearch−hadoop−mr/ 6 . 2 . 0 / elasticsearch−hadoop−mr− 6.2.0. j a r
1. All nodes are in the same EC2 security group, but that alone won’t enable
communications between nodes. We must add a custom inbound rule to
open the needed ports for TCP communications between nodes (port 9300)
securityGroup » inBound-outBound> >edit » add new inbound rule for port
9300, set ‘Source’ to the security group id
In this experiment, we used the same custom mapping from experiment 2, de-
scribed in section 5.2.2. Additional modifications were done:
5. Set translog durability to async (after creating the index), close and reopen
the index is required
5.4.4 Results
1. Index Size
Indexing the 92 GB of data, using a custom mapping, resulted in 87.1 GB
index. The index size is 95% of the data size.
2. Indexing Throughput
3. Query Throughput
We used the Apache benchmarking tool to perform the test. The parameters
used are:
• -c concurrency
This is the number of multiple requests to perform at a time. We set it
to 200.
• -n requests
This is the number of requests to perform for the benchmarking ses-
sion. We set this to 2000.
• -k
This flag is used to enable the HTTP KeepAlive feature (perform mul-
tiple requests within one HTTP session). We did not use this flag, be-
cause some queries took a very long time to complete causing the test
to fail.
• -s timeout
This sets the maximum number of seconds to wait before the socket
times out. The default is 30 seconds. For this experiment, we kept the
default (30 seconds).
We used the same test queries described in section 5.3.5. The Query time
for each of the test queries is shown in figure 5.11. We show the time for 50
Query Id Requests/Second(AVG)
Q1 2071
Q2 5
Q3 10
Q4 10
Q5 1731
Q6 1654
(a) Q1 (b) Q2
count all get 100 records, gender=Female
(d) Q4
(c) Q3
get 100 records, gender=Male and ocrText contains
get 100 records, ocrText contains ’CCA
’CCA’
(f) Q6
(e) Q5
group By age where gender=Female and ocrText con-
group By FindingSite
tains ’CCA’
In this experiment, we use a data size that is closer to the target data size for this
use case. As described in section 1.3.8 the target data size is 5.65 Terabyte. We
index 1 Terabyte of data, which worth 120 billion records (image).
5.5.1 Setup
Listing 5.3 shows the json used to create the index, it contains a ‘settings’
object and a ‘mappings’ object. For the mappings object we show only some of
the fields, the rest of the fields follow the same structure.
{
"settings" : {
"refresh_interval" : -1,
102 Chapter 5. Evaluating Elasticsearch
"number_of_shards" : 10,
"number_of_replicas" : 0
},
"mappings" : {
"image" : {
"_source": {
"enabled": false
},
"_field_names": {
"enabled": false
},
"properties" : {
"Age" : {
"type" : "keyword",
"store":true
},
"EDV" : {
"type" : "keyword",
"store":true
},
"ocrText" : {
"type" : "text",
"index_options": "positions",
"norms": false,
"store":true
},
"ocr_sourceDicomPath" : {
"type" : "text",
"index_options": "freqs",
"norms": false,
"store":true
},
...............
}}}}
We followed the same steps described in section 5.4.2. The only difference is
the number of nodes. We increased the number of nodes to 10 for both spark
cluster and elasticsearch cluster. In figure ?? we show an overall abstraction of
the experiment setup.
5.5.3 Results
1. Index Size
The index was divided on ten shards, each shard on a separate node. The
one Terabyte data produced a total index size of 654.1 GB. The index size is
63% of the data size.
5.5. Experiment 5: Indexing 1 TB using Ten Nodes Cluster 103
2. Indexing Throughput
We used the elasticsearch-hadoop connector to send the generated data
from the spark cluster to elasticsearch cluster. We used 11 spark jobs, ex-
ecuted in sequence. Each job used ten parallel tasks to send 92 GB of data
to elasticsearch. Using this experiment setup, and index settings and map-
ping, the average time to index 92 GB of data is 88 minutes. The average
indexing speed is 23.2 K documents per second.
(a) es.batch.size.bytes
Size (in bytes) for batch writes using Elasticsearch bulk API. This size
is per task instance. The total bulk size at runtime hitting Elasticsearch
will be this value multiplied by the number of parallel tasks. The de-
fault is 1 MB. We set this property to 1572864 (1.5 MB).
(b) es.batch.size.entries
Size (in entries) for batch writes using Elasticsearch bulk API - (0 dis-
ables it). Companion to es.batch.size. Bytes, once one matches, the
batch update is executed. Similar to the size, this setting is per task in-
stance; it gets multiplied at runtime by the total number of Hadoop
tasks running. The default value is 1000. We set this property to
1900000. This is because, for our data, the document size is relatively
small, it is only 0.8 KB on average.
(c) es.batch.write.refresh
Whether to invoke an index refresh or not after a bulk update has been
completed. This is called only after the entire write (meaning multiple
bulk updates) have been executed. The default value is true. We set
this property to false. This is because for our case study, indexing will
be done offline, in batch mode, So no need to refresh while indexing in
progress.
generated_data_rdd . saveAsNewAPIHadoopFile(
path=’− ’ ,
outputFormatClass="org . elasticsearch . hadoop .mr. EsOutputFormat" ,
104 Chapter 5. Evaluating Elasticsearch
3. Query Throughput
We used the Apache benchmarking tool to perform the test. The parameters
used are:
• -c concurrency
This is the number of multiple requests to perform at a time. We set it
to 100.
• -n requests
This is the number of requests to perform for the benchmarking ses-
sion. We set this to 2000.
• -k
This flag is used to enable the HTTP KeepAlive feature (perform mul-
tiple requests within one HTTP session). We did not use this flag, be-
cause some queries took a very long time to complete causing the test
to fail.
• -s timeout
This sets the maximum number of seconds to wait before the socket
times out. The default is 30 seconds. For this experiment the default
30 seconds were no good enough. The test failed with a timeout error.
To allow the test to complete we set this parameter to 120 seconds.
We used the test queries described in section 5.3.5. The Query time for
each of the test queries is shown in figure 5.12. We show the time for 50
Query Id Requests/Second(AVG)
Q1 959
Q2 1
Q3 2
Q4 2
Q5 1060
Q6 894
5.7 Conclusion
5.7. Conclusion 105
(a) Q1 (b) Q2
count all get 100 records, gender=Female
(d) Q4
(c) Q3
get 100 records, gender=Male and ocrText contains
get 100 records, ocrText contains ’CCA
’CCA’
(f) Q6
(e) Q5
group By age where gender=Female and ocrText con-
group By FindingSite
tains ’CCA’
Chapter 6
Conclusion
Medical images are typically searched using the DICOM Query/Retrieve service.
The query includes search parameters based on a DICOM tag. For instance, a
workstation may query a PACS server for all studies acquired today (in this case,
the Study Date is the search parameter). The DICOM Query/Retrieve service
does not provide free-text search, neither it supports complex logical queries or
statistical queries. Besides, it does not scale easily to adapt to the increasing
volume of data.
The focus of this study was to find the best technology to implement a reliable,
scalable, real-time searching functionality, that would support different kinds of
search queries including free-text and aggregations.
We used the meta-data of the medical images as the search corpus. The meta-
data is text-based data. The primary challenge was the data volume and the full
range of query types that should be supported.
AWS-EC2 instances were used to run the experiments. AWS-S3 and AWS-EBS
services were used for data storage.
we studied the mapping impact on the index size and hence on the search per-
formance. Using a custom mapping instead of the auto-generated dynamic map-
ping, the index size was reduced by 57%. The query time was reduced by 71% fro
free-text query and by 36% for the count-all query. The number of requests per
second increased by a factor of 1.8 for free-text query and by a factor of 3.6 for
the count-all query. In the Third experiment, we indexed 120 million records (92
GB). Using a single node hosted on an i3-large instance, the indexing through-
put was 4100 document per second. The query performance ranged from 52
milliseconds for the count-all query to 96 seconds for q: get 100 docs where
gender=Female. In the Fourth experiment we used the same data volume, but
using five nodes cluster (i3-large instances). The indexing throughput increased
by a factor of 3.36, increasing from 4100 docs per second to 13800 docs/second.
Search performance also increased significantly for all query types. Maximum
query time was 10 seconds.
In the last Elasticsearch experiment, we indexed 120 billion documents (1
Terabyte), using ten nodes cluster (i3-large instances). In this experiments we
indexed the data as soon as we generate it, files and disk I/O overhead was elim-
inated. We used spark cluster of 11 nodes (r4-large instances) one master and
ten workers, to generate the data, and then immediately send it to elasticsearch
cluster for indexing using hadoop-elasticsearch-connector. Elasticsearch index-
ing throughput was 23200 docs per second. The query time for all query types
we considered ranged from milliseconds to 1.5 minutes for the most extended
query. It was expected, as that query matches 60% of the searched data.
Bibliography
[15] M. Bajer. “Building an IoT Data Hub with Elasticsearch, Logstash and
Kibana”. In: 2017 5th International Conference on Future Internet of
Things and Cloud Workshops (FiCloudW). Aug. 2017, pp. 63–68. DOI:
10.1109/FiCloudW.2017.101.
[16] Rachid Belaid. Postgres full-text search is Good Enough! July 2015. URL:
http : / / rachbelaid . com / postgres - full - text - search - is - good -
enough/. accessed: 25.05.2018.
[17] Cristen Bolan. “Cloud PACS and mobile apps reinvent radiology work-
flow”. In: (May 2013). URL: http://appliedradiology.com/articles/
cloud-pacs-and-mobile-apps-reinvent-radiology-workflow.
[18] Don Bradley and Kendall E Bradley. “The Value of Diagnostic Medical
Imaging”. In: North Carolina Medical Journal 75.2 (Mar. 2014), pp. 121–
125. DOI: 10.18043/ncm.75.2.121. URL: http://www.ncmedicaljournal.
com/content/75/2/121.abstract.
[19] D. Chen et al. “Real-Time or Near Real-Time Persisting Daily Healthcare
Data Into HDFS and ElasticSearch Index Inside a Big Data Platform”. In:
IEEE Transactions on Industrial Informatics 13.2 (Apr. 2017), pp. 595–
606. ISSN: 1551-3203. DOI: 10.1109/TII.2016.2645606.
[20] James Chen, John Bradshaw, and Paul Nagy. “Has the Picture Archiving
and Communication System (PACS) Become a Commodity?” In: Journal of
Digital Imaging 24.1 (2011), pp. 6–10. ISSN: 1618-727X. DOI: 10.1007/
s10278- 010- 9299- 0. URL: https://doi.org/10.1007/s10278- 010-
9299-0.
[21] Class ConcurrentMergeScheduler. URL: https://lucene.apache.org/
core/ConcurrentMergeScheduler.html%22.
[22] Class TieredMergePolicy. URL: https : / / lucene . apache . org / core /
TieredMergePolicy.html.
[23] Commits.
[24] COPY — copy data between a file and a table. URL: https : / / www .
postgresql.org/docs/10/static/sql-copy.html. accessed: 20.05.2018.
[25] Curl Tool. URL: https://curl.haxx.se/.
[26] Jeffrey Dean and Sanjay Ghemawat. “MapReduce: Simplified Data Pro-
cessing on Large Clusters”. In: Commun. ACM 51.1 (Jan. 2008), pp. 107–
113. ISSN: 0001-0782. DOI: 10 . 1145 / 1327452 . 1327492. URL: http :
//doi.acm.org/10.1145/1327452.1327492.
[27] DICOM. URL: https://www.dicomstandard.org/. accessed: 02.07.2018.
[28] Francis X. Diebold. “The Origin(s) and Development of “Big Data”: The
Phenomenon, the Term, and the Discipline”. In: (2012, revised 2018).
[29] Discovery Plugins. Apr. 2018. URL: https://www.elastic.co/guide/
en/elasticsearch/plugins/6.2/discovery.html%5C#discovery.
[30] Ramesh Dontha. The Origins of Big Data. 2017. URL: https : / / www .
kdnuggets.com/2017/02/origins-big-data.html.
BIBLIOGRAPHY 113
[31] Samuel J Dwyer. “A personalized view of the history of PACS in the USA”.
In: vol. 3980. 2000, pp. 3980–3988. URL: https://doi.org/10.1117/
12.386388.
[32] Dynamic Mapping. Apr. 2018. URL: https://www.elastic.co/guide/
en / elasticsearch / reference / current / dynamic - mapping . html % 5C #
dynamic-mapping.
[33] EC2 Discovery Plugin. Apr. 2018. URL: https://www.elastic.co/guide/
en/elasticsearch/plugins/6.2/discovery-ec2.html%5C#discovery-
ec2.
[34] EC2 Discovery Plugin. Apr. 2018. URL: https://www.elastic.co/guide/
en/elasticsearch/plugins/6.2/%5C _ settings.html%5C#discovery-
ec2-permissions.
[35] Elasticsearch for Apache Hadoop. Apr. 2018. URL: https://www.elastic.
co/guide/en/elasticsearch/hadoop/current/reference.html%5C#
reference.
[36] Elasticsearch for Apache Hadoop [6.2] - Configuration. URL: https : / /
www.elastic.co/guide/en/elasticsearch/hadoop/current/configuration.
html. accessed: 10.06.2018.
[37] Elasticsearch Reference. Apr. 2018. URL: https : / / www . elastic . co /
guide/en/elasticsearch/reference/current/multi-fields.html.
[38] Elasticsearch Reference. Bulk API. Apr. 2018. URL: https://www.elastic.
co/guide/en/elasticsearch/reference/current/docs-bulk.htm.
[39] Elasticsearch Reference. Getting started. Apr. 2018. URL: https://www.
elastic . co / guide / en / elasticsearch / reference / 6 . 2 / getting -
started.html%5C#getting-started.
[40] Elasticsearch Reference. Ingest Node. Apr. 2018. URL: https : / / www .
elastic . co / guide / en / elasticsearch / reference / master / ingest .
html.
[41] Elasticsearch Reference [6.2] - bootstrap.memory_lock. URL: https : / /
www . elastic . co / guide / en / elasticsearch / reference / current /
setup - configuration - memory . html # bootstrap - memory _ lock. ac-
cessed: 4.06.2018.
[42] Elasticsearch Reference [6.2] - Disable swapping. URL: https : / / www .
elastic . co / guide / en / elasticsearch / reference / current / setup -
configuration-memory.html#setup-configuration-memory. accessed:
4.06.2018.
[43] Elasticsearch Reference [6.2] - Dynamic Mapping. URL: https://www.
elastic.co/guide/en/elasticsearch/reference/current/dynamic-
mapping.html. accessed: 2.06.2018.
[44] Elasticsearch Reference [6.2] - Exists Query. URL: https://www.elastic.
co/guide/en/elasticsearch/reference/current/query-dsl-exists-
query.html. accessed: 4.06.2018.
[45] Elasticsearch Reference [6.2] - File Descriptors. URL: https : / / www .
elastic . co / guide / en / elasticsearch / reference / current / file -
descriptors.html. accessed: 4.06.2018.
114 BIBLIOGRAPHY
[150] Shaoting Zhang and Dimitris Metaxas. Large-Scale medical image analyt-
ics: Recent methodologies, applications and Future directions. eng. Oct.
2016. DOI: 10.1016/j.media.2016.06.010.