Enabling Real Time Search of Medical Ima

ENABLING REAL-TIME SEARCH OF
MEDICAL IMAGES METADATA

AT SCALE
A Case Study
A Thesis Presented to the Faculty
of
Nile University
In Partial Fulfillment
of the Requirements for the
Degree of Master of Software Engineering
By
Basma AlKerm
September, 2018
iii
Acknowledgements
First I should thank Allah for giving me all the reasons to complete this work.
Second, With extreme gratitude, I would like to thank Dr. Sameh El-Ansary
for being the supervisor for this study I will never be able to thank him enough.
He is the best teacher I could ever have. Besides his substantial knowledge and
broad experience, he is a passionate teacher and coach, who generously provided
all kinds of guidance and support.
I would like to extend my thanks to the Novelari team for providing an en-
vironment that would help everyone to be productive. Special thanks to Karim
Hamed, the development head at Novelari, he gave me a lot of his time to help
with different technical challenges, and thanks to his advice I could avoid many
troubles.
I also want to thank my family for their kind patience, endless encourage-
ment, and support.
Finally, I would like to thank my great friends: Karima Moustafa, Omnia

Raouf, Doaa Attia, and Poussy Amr. They were always there when I needed them.
v
Contents
Acknowledgements iii
List of Figures ix
List of Tables xii
Abstract xv
1 Introduction 1
1.1 Motivation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1
1.2 Outline . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1
1.3 The Use Case Context . . . . . . . . . . . . . . . . . . . . . . . . . . . 2
1.3.1 Scope . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2
1.3.2 Who has the ultrasound exams? . . . . . . . . . . . . . . . . . . 2
1.3.3 The DICOM format . . . . . . . . . . . . . . . . . . . . . . . . . 2
1.3.4 Picture Archiving and Communication System (PACS) . . . . . 3
1.3.5 Query and Retrieval, the PACS Way . . . . . . . . . . . . . . . . 3
PACS Search Capabilities are Limited . . . . . . . . . . . . . . 4
1.3.6 Content-based Image Retrieval (CBIR) . . . . . . . . . . . . . . 4
1.3.7 What we want to achieve? . . . . . . . . . . . . . . . . . . . . . 5
1.3.8 The Target Data Scale . . . . . . . . . . . . . . . . . . . . . . . . 6
1.4 The Engineering Perspective . . . . . . . . . . . . . . . . . . . . . . . 6
Data Engineering and Data Science . . . . . . . . . . . . . . . . 6
Full Text Search . . . . . . . . . . . . . . . . . . . . . . . . . . . 7
1.4.1 Spark . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7
1.4.2 Python . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8
NumPy . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8
Pandas . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8
1.4.3 PostgreSQL . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9
1.4.4 Lucene . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9
1.4.5 Solr . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10
1.4.6 Elasticsearch . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11
1.4.7 Amazon EC2 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12
Cloud Computing . . . . . . . . . . . . . . . . . . . . . . . . . . 12
1.5 Related Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12
1.5.1 A brief on PACS Research Status . . . . . . . . . . . . . . . . . 12
Dicoogle . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13
1.5.2 Other Search Engines Performance Studies . . . . . . . . . . . 13
vi
2 Data Modeling 15
2.1 Motivation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15
2.1.1 Big Data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15
2.1.2 Data Description . . . . . . . . . . . . . . . . . . . . . . . . . . . 16
Main properties of the Sample Data . . . . . . . . . . . . . . . . 17
2.1.3 Data Cleaning / Pre-Processing . . . . . . . . . . . . . . . . . . 17
2.2 Tools and Experiments Setup . . . . . . . . . . . . . . . . . . . . . . . 18
EC2 - M5 Instance Type . . . . . . . . . . . . . . . . . . . . . . . 18
2.3 Experiment 1: Iteration-Based on Single Core . . . . . . . . . . . . . . 19
2.3.1 Method . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 19
2.3.2 Throughput . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 20
2.4 Experiment 2: Vectorization-Based on Single Core . . . . . . . . . . . 20
2.4.1 Method . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 20
Vectorization . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 20
Simple Random Sampling . . . . . . . . . . . . . . . . . . . . . 20
NumPy Broadcasting . . . . . . . . . . . . . . . . . . . . . . . . 21
NumPy Advanced Indexing . . . . . . . . . . . . . . . . . . . . . 21
2.4.2 Throughput . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 22
2.5 Experiment 3: Single Machine Parallelization Using Spark . . . . . . 22
2.5.1 Method . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 22
2.5.2 Throughput . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 24
2.6 Experiment 4: Single Machine Parallelization Using Python Multi-
processing . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 25
2.6.1 Method . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 25
The Python multiprocessing module . . . . . . . . . . . . . . . . 25
2.6.2 Throughput . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 26
2.7 Experiment 5: Vertical Scaling The Vectorized-Based Generation . . . 26
2.7.1 Compute Resources . . . . . . . . . . . . . . . . . . . . . . . . . 27
EC2 - Compute Optimized Instance Type (C4) . . . . . . . . . . 27
2.7.2 Throughput . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 27
Amdahl’s law . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 29
2.8 Experiment 6: Horizontally Scaling The Vectorized-Based Generation 29
2.8.1 Method . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 30
EC2 - Memory Optimized Instance Type . . . . . . . . . . . . . 31
2.8.3 Cluster Setup . . . . . . . . . . . . . . . . . . . . . . . . . . . . 31
2.8.4 Throughput . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 33
2.9 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 34
2.9.1 Vectorization . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 34
2.9.2 Parallelization on multi-core Single Machine . . . . . . . . . . . 34
2.9.3 Distribution using Clusters . . . . . . . . . . . . . . . . . . . . . 35
2.9.4 The Effect of Disk I/O . . . . . . . . . . . . . . . . . . . . . . . . 35
3 Evaluating PostgreSQL 37
3.1 PostgreSQL the Open Source RDBMS . . . . . . . . . . . . . . . . . . 37
3.2 PostgreSQL support For Text Search . . . . . . . . . . . . . . . . . . . 37
3.3 PostgreSQL Support For “Full” Text Search . . . . . . . . . . . . . . . 38
3.4 Data Insertion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 39
vii
3.5 Experiments Setup . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 40

3.5.1 Tools . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 40
3.5.2 Data size . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 40
3.5.3 Hardware . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 40
3.6 Experiment 1: Like Operator vs. @@ Operator . . . . . . . . . . . . . 40
3.6.1 The ‘Like’ Operator . . . . . . . . . . . . . . . . . . . . . . . . . 40
3.6.2 The ‘@@’ Operator with Inline TSVECTOR . . . . . . . . . . . . 41
3.6.3 The @@ Operator with Stored TSVECTOR . . . . . . . . . . . . 41
3.6.4 The @@ Operator with GIN Index . . . . . . . . . . . . . . . . . 42
About The GIN (Generalized Inverted Index) Index . . . . . . . 43
3.6.5 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 45
3.7 Experiment 2: Function based Index . . . . . . . . . . . . . . . . . . . 45
3.7.1 Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 46
3.7.2 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 47
3.8 Experiment 3: Text Search Configuration: Simple vs. English . . . . . 47
3.8.1 Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 48
3.8.2 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 49
3.9 Experiment 4: The pg_trgm Module . . . . . . . . . . . . . . . . . . . 49
3.9.2 Install The pg_trgm Module . . . . . . . . . . . . . . . . . . . . 49
3.9.3 Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 51
3.9.4 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 51
3.10Overall Summary and Conclusion . . . . . . . . . . . . . . . . . . . . . 51
4 Evaluating Solr 53
4.1 Outline . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 53
4.2 Solr Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 53
4.2.1 How It Works? . . . . . . . . . . . . . . . . . . . . . . . . . . . . 55
4.2.2 Lucene . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 55
4.2.3 Segments . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 57
4.2.4 Merging of Segments . . . . . . . . . . . . . . . . . . . . . . . . 57
4.2.5 Document Deletion . . . . . . . . . . . . . . . . . . . . . . . . . 57
4.2.6 Solr Configurations . . . . . . . . . . . . . . . . . . . . . . . . . 58
solrconfig.xml . . . . . . . . . . . . . . . . . . . . . . . . . . . . 58
4.3 Experiment 1: A Baseline . . . . . . . . . . . . . . . . . . . . . . . . . 60
4.3.1 Data Size . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 60
4.3.2 The Schemaless Mode . . . . . . . . . . . . . . . . . . . . . . . 60
4.3.3 The I3 instance Family . . . . . . . . . . . . . . . . . . . . . . . 61
4.3.4 Steps To Run This Experiment . . . . . . . . . . . . . . . . . . . 61
4.3.5 Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 62
4.3.6 Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 64
4.4 Experiment 2: Schema impact on index size and query performance . 66
4.4.1 Data Size . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 66
4.4.2 Schema Modifications . . . . . . . . . . . . . . . . . . . . . . . . 66
4.4.3 Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 67
4.4.4 Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 68
4.5 Experiment 3: Indexing 100 GB . . . . . . . . . . . . . . . . . . . . . . 68
4.5.1 Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 69
viii
4.5.2 Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 71
4.6 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 71
5 Evaluating Elasticsearch 73
5.1 Experiment 1: A Baseline . . . . . . . . . . . . . . . . . . . . . . . . . 73
5.1.1 Adjustments to Align with Solr’s Baseline Experiment . . . . . 75
5.1.2 Steps To Run The Experiment . . . . . . . . . . . . . . . . . . . 75
5.1.3 The Elasticsearch Python Client . . . . . . . . . . . . . . . . . . 77
5.1.4 Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 79
5.1.5 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 80
5.2 Experiment 2: The Impact of Mapping . . . . . . . . . . . . . . . . . 81
5.2.1 The Dynamic Mapping . . . . . . . . . . . . . . . . . . . . . . . 81
5.2.2 The Custom Mapping . . . . . . . . . . . . . . . . . . . . . . . . 82
5.2.3 Steps to Run The Experiment . . . . . . . . . . . . . . . . . . . 82
5.2.4 Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 83
5.2.5 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 85
5.3 Experiment 3: Indexing 92 GB Using a Single Node . . . . . . . . . . 85
5.3.1 Setup . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 85
5.3.2 Moving from Development Mode to Production Mode . . . . . 86
5.3.3 Steps to Run The Experiment . . . . . . . . . . . . . . . . . . . 87
Translog . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 88
The ‘_field_names’ field . . . . . . . . . . . . . . . . . . . . . . . 88
5.3.4 Tuning the parallel_bulk Helper . . . . . . . . . . . . . . . . . . 89
5.3.5 The Test Queries . . . . . . . . . . . . . . . . . . . . . . . . . . . 89
The Search API . . . . . . . . . . . . . . . . . . . . . . . . . . . 89
The Query DSL . . . . . . . . . . . . . . . . . . . . . . . . . . . . 91
5.3.6 Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 91
5.3.7 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 92
5.4 Experiment 4: Indexing 92 GB using Five Nodes Cluster . . . . . . . 93
5.4.1 Setup . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 94
Flintrock . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 95
Parallel ssh . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 95
Elasticsearch-Hadoop Connector . . . . . . . . . . . . . . . . . 95
Zen Discovery . . . . . . . . . . . . . . . . . . . . . . . . . . . . 95
EC2 Discovery Plugin . . . . . . . . . . . . . . . . . . . . . . . . 95
5.4.2 Steps to Run the Experiment . . . . . . . . . . . . . . . . . . . . 96
Elasticsearch Cluster . . . . . . . . . . . . . . . . . . . . . . . . 96
EC2 Discovery Plugin . . . . . . . . . . . . . . . . . . . . . . . . 96
Spark Cluster . . . . . . . . . . . . . . . . . . . . . . . . . . . . 97
Add the elasticsearch-hadoop dependency to spark classpath . 97
Configure EC2 Security Groups . . . . . . . . . . . . . . . . . . 98
5.4.3 The Index Custom Mapping, and Settings . . . . . . . . . . . . 98
5.4.4 Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 98
5.5 Experiment 5: Indexing 1 TB using Ten Nodes Cluster . . . . . . . . . 100
5.5.1 Setup . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 100
5.5.2 Steps To Run the Experiment . . . . . . . . . . . . . . . . . . . 102
5.5.3 Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 102
5.6 Overall Comparison . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 104
ix
5.7 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 104
6 Conclusion 107
Bibliography 111
xi
List of Figures
1.1 PACS Overview . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3

1.2 Overview of End-to-End System . . . . . . . . . . . . . . . . . . . . . . 5
2.1 Data Generator . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15

2.2 Definitions of big data based on an online survey of 154 global exec-
utives in April 2012. . . . . . . . . . . . . . . . . . . . . . . . . . . . . 16
2.3 Experiment 3 - Spark Parallelization - RDD transformations . . . . . . 23
2.4 Parallelization on Single Machine Using Spark (Iteration-Based) . . . 24
2.5 Parallelization on Single Machine Using Spark (Vectorization-Based) 25
2.6 Experiment 7 Time to Generate 100 GB using different number of
cores . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 28
2.7 Generator 4 - Performance - NumPy and Multiprocessing . . . . . . . 29
2.8 Experiment 8 - NumPy Vectorization distributed using Spark Cluster 33
2.9 NumPy Vectorization vs. Non-Vectorized . . . . . . . . . . . . . . . . . 34
2.10Prallelization Multiprocessing vs. Spark . . . . . . . . . . . . . . . . . 34
2.11Distributed Vectorization Dask vs. Spark . . . . . . . . . . . . . . . . 35
2.12Prallelization Multiprocessing vs. Spark - Include Disk Write . . . . . 36
3.1 Full Text Search Preprocessing . . . . . . . . . . . . . . . . . . . . . . 38

3.2 PostgreSQL - Bulk Insertion Time . . . . . . . . . . . . . . . . . . . . . 39
3.3 Like Query Time . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 41
3.4 @@ Operator Query Time . . . . . . . . . . . . . . . . . . . . . . . . . 42
3.5 @@ performance - stored tsvector . . . . . . . . . . . . . . . . . . . . 43
3.6 Time to Create tsvector . . . . . . . . . . . . . . . . . . . . . . . . . . 43
3.7 @@ Query Time using GIN Index . . . . . . . . . . . . . . . . . . . . . 44
3.8 Time To Generate GIN Index on tsvector Column . . . . . . . . . . . . 44
3.9 Like vs. @@ Query Time . . . . . . . . . . . . . . . . . . . . . . . . . . 45
3.10Time To Create GIN Index - column based vs function based . . . . . 46
3.11Query Time - Column-based GIN vs Function-based GIN . . . . . . . . 46
3.12‘simple’ vs ‘english’ configuration - Time to create GIN Index (on
tsvector column) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 48
3.13‘simple’ vs ‘english’ configuration - Query Time . . . . . . . . . . . . 48
3.14Time to Create GIN Index: trgm vs tsvector . . . . . . . . . . . . . . . 50
3.15Query Time : trgm vs tsvector . . . . . . . . . . . . . . . . . . . . . . . 51
4.1 Indexing . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 55
4.2 Searching . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 56
4.3 Solr Experiment1 - Index Size . . . . . . . . . . . . . . . . . . . . . . . 63
4.4 Solr Experiment1 - Indexing Time . . . . . . . . . . . . . . . . . . . . 63
4.5 Solr Experiment1 - Indexing Speed . . . . . . . . . . . . . . . . . . . . 64
4.6 Solr Experiment1 - Requests/Second (Non-Cashed) . . . . . . . . . . 64
xii
4.7 Solr Experiment1 - Requests/Second (Cashed) . . . . . . . . . . . . . 65

4.8 Solr Experiment1 - Longest Request Time (Non-Cashed) . . . . . . . 65
4.9 Solr Experiment1 - Longest Request Time(Cashed) . . . . . . . . . . 66
4.10Solr Experiment2 - Index Size . . . . . . . . . . . . . . . . . . . . . . . 68
4.11Solr Experiment2 -Indexing Time . . . . . . . . . . . . . . . . . . . . . 68
4.12Solr Experiment2 - Requests/Second (Count All Query) . . . . . . . . 69
4.13Solr Experiment2 - Longest Request (Count All Query) . . . . . . . . 69
4.14Solr Experiment 3 - Index Size . . . . . . . . . . . . . . . . . . . . . . 70
4.15Solr Experiment 3 - Indexing Time . . . . . . . . . . . . . . . . . . . . 71
4.16Solr Experiment 3 - Indexing Speed . . . . . . . . . . . . . . . . . . . 71
4.17Solr Experiment 3 - Requests/Second (cached) . . . . . . . . . . . . . 72
4.18Solr Experiment 3 - Longest Request Time . . . . . . . . . . . . . . . 72
5.1 Elasticsearch Exp1 - Index Size . . . . . . . . . . . . . . . . . . . . . . 79

5.2 Elasticsearch Exp1 - Indexing Throughput . . . . . . . . . . . . . . . . 80
5.3 Elasticsearch Exp1 - Requests Per Second . . . . . . . . . . . . . . . . 81
5.4 Elasticsearch Exp1 - Query Time (milliseconds) For 90% of requests . 81
5.5 Elasticsearch Experiment 2 - Index Size - Dynamic mapping vs Cus-
tom mapping . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 84
5.6 Elasticsearch Experiment 2 - Indexing Time - Dynamic mapping vs
Custom mapping . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 84
5.7 Elasticsearch Experiment 2 - Requests Per Second (no Cache) - Dy-
namic mapping vs Custom mapping . . . . . . . . . . . . . . . . . . . 85
5.8 Elasticsearch Experiment 2 - Requests Time - 90’th percentile (no
Cache) - Dynamic mapping vs Custom mapping . . . . . . . . . . . . . 85
5.9 Elasticsearch Experiment 3 - Indexing 120 Million Docs (92 GB) . . . 92
5.10Elasticsearch Experiment 3 - Query Time - 120 Million Docs (92 GB) 93
5.11Elasticsearch Experiment 4 - Query Time - 120 Million Docs (92 GB) 100
5.12Elasticsearch Experiment 5 - Query Time - 120 Billion Docs (1 TB) . 105
6.1 Data Generation Throughput (no write to disk) . . . . . . . . . . . . . 108

xiii
List of Tables
1.1 Related Work List . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13
2.1 Iteration-based generator - Throughput . . . . . . . . . . . . . . . . . 20

2.2 Vectorization-based generator - Throughput . . . . . . . . . . . . . . . 22
2.3 Single Machine Parallelization Using Spark - Throughput . . . . . . . 24
2.4 Single Machine Parallelization Using Python Multiprocessing - Through-
put . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 26
2.5 Time to Generate 100 GB using different number of cores . . . . . . . 27
2.6 Generator 4 Performance - Fixed Data Size to Generate Per Core . . 28
3.1 PostgreSQL Free Text Search . . . . . . . . . . . . . . . . . . . . . . . 52
4.1 Solr Experiment 1 Setup . . . . . . . . . . . . . . . . . . . . . . . . . . 60

5.1 Elasticsearch Experiment 1 Setup . . . . . . . . . . . . . . . . . . . . 74

5.2 Elasticsearch - Experiment 2 Setup . . . . . . . . . . . . . . . . . . . . 83
5.3 Elasticsearch - Experiment 3 Setup . . . . . . . . . . . . . . . . . . . . 86
5.4 Elasticsearch - Experiment 3 - Tuning The parallel_bulk helper . . . . 89
5.5 Elasticsearch - Experiment 3 - The Test Queries . . . . . . . . . . . . 90
5.6 Elasticsearch - Experiment 3 - Requests/Second (120 Million Docs -
92 GB) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 94
5.8 Elasticsearch - Experiment 4 - Requests/Second (120 Million Docs -
92 GB) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 99
5.10Elasticsearch - Experiment 5 - Requests/Second (120 Billion Docs - 1
TB) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 104
6.1 Search Performance - free-text query (retrieve 100 docs where ocr-
Text field contains ’CCA’) . . . . . . . . . . . . . . . . . . . . . . . . . . 109
xv
Abstract
Medical images are an invaluable part of patient medical records. Recently the
use of diagnostic medical imaging increased significantly. This resulted in a mas-
sive growth in the volume of the data. The demand for efficient and flexible ways
to manage and make use of that data was also noticeably increasing.
Currently, the PACS systems represent the defacto technology for medical
images storage, management, and retrieval. However, the PACS suffers from
several critical drawbacks, which limited its ability from satisfying the healthcare
professionals needs. The primary drawbacks are: first the PACS is not scalable,
or at least not easily scalable. Second, the PACS way to search and retrieve data
is limited and inflexible, especially for a user expecting a search experience that
matches Google search experience.
In this work, we move the medical images search problem outside the current
stagnation of the PACS systems. We provide a solution that supports real-time,
full-text search, and even aggregations and ranking. While also being scalable at
is core. We use the image’s metadata, that is text-based, to enable searching on
the images.
This study aims to provide a reliable evaluation of three of the top-ranked
technologies that would help in the medical images search problem: PostgreSQL,
Solr, and Elasticsearch.
Our Experiments show that Elasticsearch would provide the best solution for
our use-case. We report indexing and searching throughput based on ten nodes
cluster hosted on AWS-EC2. We used one Terabyte of data to run the final tests.
We used Spark-cluster to generate the data based on a real data sample that
contains the metadata of 16000 images and implemented our data generation
routine to magnify that sample up to 120 billion records.
1
Chapter 1
Introduction
1.1 Motivation
Medical Images are an invaluable part of patient medical records [70] [125].
Since the discovery of x-rays more than 120 years ago [127] medical imaging
has been improving to include more methods and more usages. Recently the
improvements and usage in different specialties were noticeably faster [18].
However, the utilization of medical imaging has been such a significant chal-
lenge in health-care IT, due to many factors (high storage costs, the complexity
of management, diversity of tools and software, incompatibility issues, Different
imaging modalities manufactures) [150] [82] [7] [74].
Besides, there is now a new aspect that is added to the formula: the huge
amount of records. That even put more pressure on today’s available software
solutions for medical images storage, management and retrieval [87] [92] [147]
[93].
The Main motive for this study is to put a firm ground to overcome the difficul-
ties and enable large-scale, real-time, and flexible-querying search capabilities of
medical images based on the metadata of the image.
1.2 Outline
Here is how this study is organized:
In chapter 1 we provide a background that will enable a better understanding

of the use-case at hand. We also provide an overview of the used technologies
and tools in this study. Finally, we present a brief description of related work.
Next, in chapter 2, We explain the data structure and attributes. We present

the different approaches we tried to build our custom data-generator. Our final
generator was able to produce 120,000,000 records in almost one minute on a
cluster of 10 nodes; each has two cores and 15G RAM.
The RDBMS technology is naturally an excellent candidate to build search

capability. RDBMS is well known, and mature technology that has been the an-
swer to many business needs for a considerably long time. Before we proceed
to search engines exploration, we look at one of the most popular open source
2 Chapter 1. Introduction
RDBMS: PostgreSQL. By conducting a set of experiments to evaluate the differ-

ent indexing techniques offered in PostgreSQL. PostgreSQL initial results did not
look promising (for 25 GB data, query time start to become as long as 3 minutes).
We present these experiments in chapter 3.
In chapter 4, we present Solr experiments. We evaluated Solr, concerning

indexing and querying throughput, scaling the data volume from 20 GB to 100
GB.
We then present Elasticsearch evaluation in chapter 5. We tested the per-

formance on data volumes that range from 20 GB to 1 TB. Elasticsearch results
showed that it is a perfect fit for our use-case.
Finally, we put a summary of all the experiments and the conclusion of this
study in chapter 6.
1.3 The Use Case Context
1.3.1 Scope
In this study, we focus on carotid artery ultrasound, a specific type of vascular

ultrasound exams [121]. This exam is heavily used in the vascular surgery domain
[role-carotid-ultrasound].
1.3.2 Who has the ultrasound exams?
Every health-care provider, who offers vascular ultrasound services, will own
a vast amount of archived ultrasound exams. Those exams are usually kept
archived on some offline storage system.
1.3.3 The DICOM format
Medical images are typically stored in DICOM (Digital Imaging and Communi-
cations in Medicine) format. Data format DICOM groups information into data
sets. That means that a file of a chest x-ray image, for example, actually contains
the patient ID within the file, so that the image can never be separated from this
information by mistake. This is similar to the way that image formats such as
JPEG can also have embedded tags to identify and otherwise describe the image.
A DICOM data object consists of a number of attributes, including items such
as name, ID, etc., and also one unique attribute containing the image pixel data
(i.e., logically, the main object has no “header” as such, being merely a list of
attributes, including the pixel data). A single DICOM object can have only one
attribute containing pixel data. For many modalities, this corresponds to a single
image. However, the attribute may contain multiple “frames”, allowing storage
of cine loops or other multi-frame data files [27].
1.3. The Use Case Context 3
Figure 1.1: PACS Overview [145]
1.3.4 Picture Archiving and Communication System (PACS)
A picture archiving and communication system (PACS) is a computerized means

of replacing the roles of the conventional radiological film: images are acquired,
stored, transmitted, and displayed digitally. When such a system is installed
throughout the hospital, a filmless clinical environment results.
Typically a PACS consists of a multitude of devices. The first step in typical
PACS systems is the modality. Modalities are typically computed tomography
(CT), ultrasound, nuclear medicine, positron emission tomography (PET), and
magnetic resonance imaging (MRI). Depending on the facility’s workflow most
modalities send to a quality assurance (QA) workstation or sometimes called a
PACS gateway. The QA workstation is a checkpoint to make sure patient demo-
graphics are correct as well as other essential attributes of a study. If the study
information is correct, the images are passed to the archive for storage. The
central storage device (archive) stores images and in some cases reports, mea-
surements and other information that resides with the images. The next step
in the PACS workflow is the reading workstations. The reading workstation is
where the radiologist reviews the patient’s study and formulates their diagnosis.
Generally tied to the reading workstation is a reporting package that assists the
radiologist with dictating the final report [136] [31].
Modern generation PACS is usually linked to radiology and hospital informa-
tion systems (RIS and HIS) [68].
1.3.5 Query and Retrieval, the PACS Way
The communication with the PACS system is done using Digital Imaging and
Communication in Medicine (DICOM) messages that are similar to DICOM im-
age "headers", but with different attributes. A query (C-FIND) is performed as
follows:
• The client establishes the network connection to the PACS server.
• The client prepares a C-FIND request message which is a list of DICOM

attributes.
• The client fills in the C-FIND request message with the keys that should be
matched. E.g., to query for a patient ID, the patient ID attribute is filled
with the patient’s ID.
• The client creates empty (zero length) attributes for all the attributes it
wishes to receive from the server. E.g., if the client wishes to receive an ID
that it can use to receive images (see image retrieval), it should include a
zero-length SOPInstanceUID (0008,0018) attribute in the C-FIND request
messages.
• The C-FIND request message is sent to the server.
• The server sends back to the client a list of C-FIND response messages,
each of which is also a list of DICOM attributes, populated with values for
each match.
• The client extracts the attributes that are of interest from the response
messages objects.
Images (and other composite instances like the Presentation States and Struc-
tured Reports) are then retrieved from a PACS server through either a C-MOVE
or C-GET request, using the DICOM network protocol.
PACS Search Capabilities are Limited
The underlying Digital Imaging and Communication in Medicine (DICOM) pro-

tocol supports only queries based on textual template matching over a limited
number of fields present on the DICOM file and defined by the modality.
Furthermore, the DICOM fields which are indexed and made available to
query depend on the particular PACS provider and are typically limited to the
patient name, modality and UIDs further hampering the usefulness of the proto-
col query mechanisms [145] [94].
1.3.6 Content-based Image Retrieval (CBIR)
Content-based Image Retrieval (CBIR) methods have shown great promise in

helping practitioners sift through the vast amounts of data present in medical
institutions. These methods rely on the automatic extraction of content from a
source image to provide the query terms for a search. In this context, content
means some property extracted from the image such as color and intensity dis-
tribution, texture, shape, or high-level features such as the presence of nodes
or objects of interest. In practical terms, CBIR systems allow practitioners to
use images from any study they are working on a query to the image database
hence obtaining a set of results that, in some sense, are similar to the original
1.3. The Use Case Context 5
im
ag
.png Files Storage es
System 3- ur
ls
Web-based
DICOM DICOM
client
Files CROWLER ry
ue
1-Q
s
+ url
ta
tada
me
.csv Files Search 2-
Engine
Figure 1.2: Overview of End-to-End System
image. CBIR has the potential to save a significant amount of time to practi-
tioners, enabling them to quickly move from a source image to a set of similar
ones, potentially containing diagnosis reports. These reports, when compared
to the original image, may strengthen the case for the diagnosis or provide the
practitioner with additional insight.
1.3.7 What we want to achieve?
We aim to overcome the limitations of PACS search, 1) inflexible, limited queries,

and 2)the performance and management degradation caused by enormous data
size, having real-time search response. We will still depend on text-data only, no
indexing or searching is done based on images pixels.
Figure1.2 shows a high-level overview of the suggested solution. In this study,
we are looking for the best available tool/technology that we can use to build the
search engine component.
1. Process DICOM files to produce two type of output: 1-text files contain the
images meta-data, this includes all the info that was collected during the
examination, plus the text burned on the images, and 2- .png files present-
ing the images themselves.
2. We then keep the .png files on a storage system (for example HDFS or S3)
3. Add a link to the images path on the storage system to the text files holding
the image metadata.
4. We feed the text files into the search engine of choice, that would enable
us to perform the free-text search, different aggregations and eventually
enable advanced analytics on those images (or precisely images metadata)
We represent an image by a text record of metadata. The metadata can be

divided into three main parts:
• the information collected during the examination (that was included in the
original DICOM tags).
• the text that is burned on the image. This data is not part of the DICOM
tags. Rather this text is extracted by applying OCR on the images.
• the path/URL of the .png file on the storage system component.
1.3.8 The Target Data Scale
It is estimated that over one billion diagnostic imaging procedures will be per-
formed in the United States during 2014, comprising approximately 100 petabytes
of volume data [145].
The requirements for our use case are to be able to search data that is in the
size of the US population which is 3̃20 million [79]. If we excluded the age group
of fewer than 18 years and assumed a single carotid ultrasound exam per adult
per year, we then get 211 million exams. Each exam consists of 23 to 50 images,
and then we would expect a total of 211 million exams * 36 images per exam
(average), that is 7.596 billion images (record).
The meta data for single image is 0.8 KB, so for 7.596 billion images, the data
size would be 5.65 TB ((7.596 ∗ 1000 ∗ 1000000 ∗ 0.8)/(1024 ∗ 1024 ∗ 1024)).
Assuming Single ultrasound per adult per year would produce 7.596 billion
records that worth 6 TB text data in CSV format.
1.4 The Engineering Perspective
Big Data had indeed come of age in 2013 when the Oxford English Dictionary in-
troduced the term “Big Data” for the first time in its dictionary [30]. However, the
term which spans computer science and statistics/econometrics originated much
earlier, probably originated in the lunch-table conversations at Silicon Graphics
in the mid-1990s, in which John Mashey figured prominently [28].
Since its early start, Big data tools continued to emerge and mature, mo-
tivated by the increasing business demands. Every industry is now looking at
ways to use big data to improve and grow. Health care is not an exception. We
think that health care is not catching up with the new options become possible
with current big data technologies.
Data Engineering and Data Science
As the Big Data technology become known and developers in different domains
being to use it, two sub-domains become more clearly distinguishable: data en-
gineering, and data science.
Data engineering focus on finding solutions to be able to store big data, to
organize, clean, and retrieve it.
For data science, the primary concern is getting useful insights from data.
In an early usage the term “data science" was used as a substitute for com-
puter science by Peter Naur in 1960. Naur later introduced the term “datalogy"
[95].
1.4. The Engineering Perspective 7
In the past ten years, Data Science has quietly grown to include businesses
and organizations worldwide. It is used by governments, geneticists, engineers,
and even astronomers. During its evolution, Data Science’s use of Big Data was
not merely a “scaling up” of the data, but included shifting to new systems for
processing data, and the ways data gets studied and analyzed.
Data Science has become an essential part of business and academic re-
search. Technically, this includes machine translation, robotics, speech recog-
nition, the digital economy, and search engines. Regarding research areas, Data
Science has expanded to include the biological sciences, healthcare, medical in-
formatics, the humanities, and social sciences. Data Science now influences eco-
nomics, governments, and business and finance [78].
Full Text Search
In text retrieval, full-text search refers to techniques for searching a single computer-
stored document or a collection in a full-text database. Full-text search is distin-
guished from searches based on metadata or parts of the original texts repre-
sented in databases (such as titles, abstracts, selected sections, or bibliographi-
cal references).
Full-Text Searching (or just text search) provides the capability to identify
natural-language documents that satisfy a query, and optionally to sort them by
relevance to the query. The most common type of search is to find all documents
containing given query terms and return them in order of their similarity to the
query. Notions of query and similarity are very flexible and depend on the specific
application. The most straightforward search considers query as a set of words
and similarity as the frequency of query words in the document [108].
In the following sections, we list the technologies used in this case study, with
a brief introduction for each.
1.4.1 Spark
Apache Spark is a fast and general-purpose cluster computing system. It pro-

vides high-level APIs in Java, Scala, Python and R, and an optimized engine that
supports general execution graphs. It also supports a rich set of higher-level tools
including Spark SQL for SQL and structured data processing, MLlib for machine
learning, GraphX for graph processing, and Spark Streaming [135].
Spark was initially started by Matei Zaharia at UC Berkeley’s AMPLab in
2009, and open sourced in 2010 under a BSD license. In 2013, the project was
donated to the Apache Software Foundation and switched its license to Apache
2.0. In February 2014, Spark became a Top-Level Apache Project [138].
Spark had more than 1,677 contributors [102], making it one of the most
active projects in the Apache Software Foundation, and one of the most active
open source big data projects.
We included spark in our exploration for two tasks: 1- to generate data [chap-
ter 2], and 2- to build an index for search (lucene index) [chapter??].
1.4.2 Python
Python is an interpreted high-level programming language for general-purpose

programming. Created by Guido van Rossum and first released in 1991, Python
has a design philosophy that emphasizes code readability, notably using signifi-
cant whitespace. It provides constructs that enable clear programming on both
small and large scales[80].
Python features a dynamic type system and automatic memory management.
It supports multiple programming paradigms, including object-oriented, impera-
tive, functional and procedural, and has a comprehensive standard library.
Recently Python becomes very popular in the data science domain. The python
community continues to add new packages/libraries to enrich python features
and capabilities for data science and to address new trends in machine learning
and AI.
In this work, we used Python to write our data generator. We used the spe-
cialized libraries: NumPy and Pandas.
NumPy
NumPy is the fundamental package for scientific computing with Python. It con-
tains among other things:
• a powerful N-dimensional array object
• sophisticated (broadcasting) functions
• tools for integrating C/C++ and Fortran code
• useful linear algebra, Fourier transform, and random number capabilities
Besides its obvious scientific uses, NumPy can also be used as an efficient
multi-dimensional container of generic data [98].
We used NumPy to vectorize our data generation, which enabled us to take
advantage of multicore processors and achieve parallelism at the processor level,
more details in chapter 2.
Pandas
Pandas is a Python package providing fast, flexible, and expressive data struc-
tures designed to make working with “relational” or “labeled” data both easy
and intuitive. It aims to be the fundamental high-level building block for doing
practical, real-world data analysis in Python.
The two primary data structures of pandas, Series (1-dimensional) and DataFrame
(2-dimensional), handle the vast majority of typical use cases in finance, statis-
tics, social science, and many areas of engineering. Pandas is built on top of
NumPy and is intended to integrate well within a scientific computing environ-
ment with many other third-party libraries.
For data scientists, working with data is typically divided into multiple stages:
munging and cleaning data, analyzing/modeling it, then organizing the results of
the analysis into a form suitable for plotting or tabular display. Pandas is the
ideal tool for all of these tasks [104].
We used Pandas to load our data, do some transformation and general analysis
of the sample data set, more details in chapter 2.
1.4.3 PostgreSQL
In this study, we started our exploration by looking at the relational database

systems. RDBMSs have been a common choice for the storage of information in
databases used for financial records, manufacturing and logistical information,
personal data, and other applications since the 1980s [148].
Relational databases naturally provide search capability. Searching is the
primary function in a database, but it has a more familiar name ‘Querying.’ How-
ever, for our case study, there are two main challenges beyond the traditional
search capabilities that relational databases typically provide: 1- the need for
real-time search, given the massive data size, and 2- the need for full-text search.
We chose PostgreSQL to represent the RDBMS. PostgreSQL is a robust, open
source object-relational database system with over 30 years of active develop-
ment that has earned it a strong reputation for reliability, feature robustness,
and performance [139].
The origins of PostgreSQL date back to 1986 as part of the POSTGRES project
at the University of California at Berkley and has more than 30 years of active
development on the core platform [6].
PostgreSQL comes with many features, including full-text search support.
PostgreSQL experiments details are in chapter 3.
1.4.4 Lucene
Apache Lucene is an open-source, high-performance, full-featured text search

engine library. Lucene is not a complete application, but rather a code library
and API that can easily be used to add search capabilities to applications [84].
Doug Cutting originally wrote Lucene entirely in Java in 1999. It was initially
available for download from its home at the SourceForge web site. It joined
the Apache Software Foundation’s Jakarta family of open-source Java products in
September 2001 and became its own top-level Apache project in February 2005
[103].
The Lucene API is divided into several packages [86]:
1. org.apache.lucene.analysis Convert text from a Reader into a TokenStream.

An analyzer is a sequence of “Tokenizer” followed by “TokenFilters”. analyzers-
common provides many analyzer implementations, including StopAnalyzer
and the grammar-based StandardAnalyzer.
2. org.apache.lucene.codecs
Provides an abstraction, as well as different implementations for the encod-
ing and decoding of the inverted index structure.
3. org.apache.lucene.document
Provides a simple Document class. A Document is simply a set of named
Fields, whose values may be strings or instances of Reader.
4. org.apache.lucene.index
provides two primary classes: IndexWriter, which creates and adds docu-
ments to indices; and IndexReader, which accesses the data in the index.
5. org.apache.lucene.search
Provides data structures to represent queries (i.e., TermQuery for individ-
ual words, PhraseQuery for phrases, and BooleanQuery for boolean com-
binations of queries) and the IndexSearcher which turns queries into Top-
Docs. Many QueryParsers are provided for producing query structures from
strings or XML.
6. org.apache.lucene.store
Defines an abstract class for storing persistent data, the Directory, which
is a collection of named files written by an IndexOutput and read by an
IndexInput. Multiple implementations are provided, including FSDirectory,
which uses a file system directory to store files, and RAMDirectory which
implements files as memory-resident data structures.
7. org.apache.lucene.util
contains a few handy data structures and util classes, ie FixedBitSet and
PriorityQueue.
Lucene is the core library used by both Solr and Elasticsearch to build the
index and to search that index.
1.4.5 Solr
Solr is an Apache open source project. It is designed to drive powerful document

retrieval applications. It powers the search and navigation features of many of
the world’s largest Internet sites. Solr is written in Java and runs as a standalone
full-text search server within a servlet container such as Jetty. Solr uses the
Lucene Java search library at its core for full-text indexing and search and has
REST-like HTTP/XML and JSON APIs that make it easy to use from virtually any
programming language. Solr’s powerful external configuration allows it to be
tailored to almost any type of application without Java coding, and it has an ex-
tensive plugin architecture when more advanced customization is required [133].
Solr offers support for the most straightforward keyword searching to complex
queries on multiple fields and faceted search results [5]. Before proceeding fur-
ther, we need to state some definitions [131]:
1. Document
Represent a single item that could be searched for. This can be an article,
a DB record, or whatever the ‘content’ that you need to make it searchable.
A document consists of a set of fields, where a field is an attribute that has
a name, type, and value. The value is a piece of searchable information.
2. Term
A Term is a value of the document fields, after being analyzed and processed
by Solr. The term can be the same as the field value or can be a ’processed’
field value, depending on the field definition in the index schema.
3. Schema
In Solr, each index has a schema. The schema is an XML file that tells Solr
about the contents of documents it will be indexing. It describes the fields
that a document will have and the types for those fields. It also defines the
analysis that should be done on each field to extract the terms, plus some
attributes that affect how Solr builds the index, and what to store and what
to ignore in a document. Solr index can be tailored to the best possible
performance for the search, through the schema.
4. Indexing
Indexing is the data ingestion step. To make a document search-able we
have to index it first. For Solr, this includes analyzing the document fields
to extract the terms and build a data structure to store the terms [130].
This data structure is a Lucene index [85]. Lucene builds the index in a way
that would support different access patterns when serving different types
of search queries while setting the search speed as the primary concern.
Thus, the index size may grow more prominent than the indexed data itself.
However the index size would be an issue only when it starts to affect search
performance, we will cover this in more details later.
5. Core
When are talking about a non-distributed setup, The term core is just a
synonym for ‘index.’ However when moving to distributed setup(i.e., Solr-
Cloud), a core is a single copy of a specific shard of an index, that is hosted
in one of the cluster nodes.
6. Collection
For the non-distributed setup, a collection is the same thing as the core.
For distributed setup a collection is a group of cores that represent a single
index, this includes all the logical shards of an index that are hosted in dif-
ferent nodes in the cluster.
7. Query
The value that we want to search for it. The query consists of a set of
terms, grouped with different logical operators. Having a good idea about
the types of queries that the application will serve can help build the index
in the way the will help Solr serve those queries efficiently.
1.4.6 Elasticsearch
Elasticsearch is a highly scalable open-source full-text search and analytics en-

gine. It allows to store, search, and analyze significant volumes of data quickly
and in near real time. It is generally used as the underlying engine/technology
that powers applications that have complex search features and requirements
[39].
Elasticsearch can be considered as the second generation of the search en-

gines. It adds the ‘analytic’ edge to a traditional search engine capabilities. Elas-
ticsearch is a part of the ELK Stack (Elastic Stack).
Elasticsearch saw a steady growth of popularity in the last years. It is now
not only the most popular enterprise search engine, but it is also one of the ten
most popular database management systems [88].
1.4.7 Amazon EC2
Amazon Elastic Compute Cloud (Amazon EC2) provides scalable computing ca-
pacity in the Amazon Web Services (AWS) cloud. Using Amazon EC2 eliminates
the need to invest in hardware up front, so development and deployment can be
done faster. Amazon EC2 can be used to launch as many or as few virtual servers
as needed, configure security and networking, and manage storage. Amazon
EC2 enables scaling up or down to handle changes in requirements or spikes in
popularity, reducing the need to forecast traffic [9].
Cloud Computing
Cloud computing is the on-demand delivery of computing power, database stor-

age, applications, and other IT resources through a cloud services platform via
the internet with pay-as-you-go pricing [10].
While the term "cloud computing" was popularized with Amazon.com releas-
ing its Elastic Compute Cloud product in 2006, references to the phrase "cloud
computing" appeared as early as 1996, with the first known mention in a Compaq
internal document [122].
In this case study, all experiments were executed using AWS EC2 Resources.
1.5 Related Work
1.5.1 A brief on PACS Research Status
The Picture Archiving and Communication System (PACS) market has been trans-
formed by disruptive innovations from the information technology industry. The
cost of storage alone has dropped by a factor of 100 within the past ten years.
Improvements in the display, processing, and networking have likewise enabled
PACS to be a capable replacement for film [20].
Given the increasingly higher demands placed over PACS solutions and the
expected data growth, research in the area of PACS is very active. New tech-
nologies are being actively explored to help with data storage and management.
Some directions in which new PACS systems are being investigated include dis-
tributed and heterogeneous computing grids, peer-to-peer networks, knowledge
extraction using indexing engines [145], Cloud Computing and mobile apps [17].
1.5. Related Work 13
Dicoogle
Dicoogle (http://www.dicoogle.com) is an open source PACS that distinguishes

itself from other PACS by making use of peer-to-peer technologies and document-
based indexing techniques (built atop Lucene search engine library), rather than
the more traditional approach of using relational databases [145].
1.5.2 Other Search Engines Performance Studies
Title Author Data Size Data Tools considered

source/-
format
A comparative study Rani 20K doc- N/A Elasticsearch,
of elasticsearch and uments Apache
CouchDB document (bytes N/A) CouchDB
oriented databases
[72]
Performance testing of P. Seda and J. 4 MiB N/A Elasticsearch,
NoSQL and RDBMS Hosek and P. MySQL DB
for storing big data in Masek and J.
e-applications [128] Pokorny
Feasibility Analysis of J. Bai 7GB (al- Webserver Elasticsearch
Big Log Data Real most 149 logs to create and
Time [14] Million search the index,
record) HBase to store
log events
Building an IoT Data M. Bajer 600 doc/10 IoT data/J- Elastisearch,
Hub with Elastic- seconds SON Logstash and
search, Logstash and (bytes NA) Kibana (ELK)
Kibana [15]
Real-time or Near D. Chen and Y. processing HL7 / JSON Elastisearch,
Real-Time Persisting Chen and B. N. capacity Logstash and
Daily Healthcare Data Brownlow and 62 million- Kibana (ELK)
Into HDFS and Elas- P. P. Kanjamala message-
ticSearch Index Inside and C. A. G. per-day
a Big Data Platform Arredondo and
[19] B. L. Radspin-
ner and M. A.
Raveling
Table 1.1: Related Work List
To our knowledge to date, this study is unique concerning the following points:
1. The data scale. (120 Billion records, which is one terabyte)
2. The Number of elasticsearch nodes used (we used ten nodes cluster).
3. Using EC2 hosted instances rather than local laptops or on-primes servers.
4. The set of tools compared: PostgreSQL vs. Solr vs. Elasticsearch.
5. The types of queries used for evaluation. We used a set of different query
types (free-text, aggregations, filtering), most other studies reported results
based on just a single query type.
This should highly increase the credibility of our conclusions.

Most existing studies focused on two main ideas:
1. Comparing elasticsearch to other tools (either RDBMS or NoSQL-DBS).
2. The feasibility of using Elasticsearch to meet different business needs in a

variety of domains.
Comparative studies showed that elasticsearch out-performed RDBMS and

NoSQL-DBS in terms of search speed [128] [72].
The study performed by P. Seda and J. Hosek and P. Masek and J. Pokorny
showed that using Elasticsearch for storing more massive key-value datasets, it
is even five times faster than MySQL when the focus is given to retrieve partic-
ular information from database [128]. The test was performed using a 4 Mib of
data on a single elasticsearch node.
The study by J.Bai showed that elasticsearch search-time would not increase
linearly as matching records increase [14]. She used a free-text query to run the
test. The search time varied from increased by 6X when increasing the number of
matching records by more than 166X. The results reported based on 7 GB data,
which is 149 million records.
Trying to use ELK stack to build an IoT data hub, M.Bajer reported that his
implementation could process up to 60 events/second [15].
An exciting case-study at Mayo Clinic, the proposed system could ingest and
index and store Hl7 messages at the speed of 62 million message/day. The re-
ported free-text search performance was 209 millisecond for searching ’pain’ in
25.2 million HL7-Messages [19]. However, a critical missing point in this result
is the percentage of matching records in the searched data, which have a sub-
stantial effect on the query performance.
15
Chapter 2
Data Modeling
2.1 Motivation
To get a reliable evaluation, it is critical to have data that is as close as possible

to the real data concerning both size and values distribution. Due to the nature
of the health-care domain that requires strong data confidentiality and protection
rules, we could not get access to a big enough data set to run our experiments.
Thus, we decided to build a data generator 2.1, that gets as input a small real-
data sample, and the required data size, and produces as output a bigger data
set, that is identical in structure and distribution of values to the original sample.
Figure 2.1: Data Generator.
2.1.1 Big Data
Big data definitions have evolved rapidly, which has raised some confusion. This
is evident from an online survey of 154 C-suite global executives conducted by
Harris Interactive on behalf of SAP in April 2012 (“Small and midsize companies
look to make big gains with big data,” 2012). 2.2 shows how executives differed
in their understanding of big data, where some definitions focused on what it is,
while others tried to answer what it does [69].
Size is the first characteristic that comes to mind considering the question
“what is big data?” However, other characteristics of big data have emerged re-
cently. For instance, Laney (2001) suggested that Volume, Variety, and Velocity
16 Chapter 2. Data Modeling
Figure 2.2: Definitions of big data based on an online survey of

154 global executives in April 2012 [69].
(or the Three V’s) are the three dimensions of challenges in data management.
The Three V’s have emerged as a common framework to describe big data (Chen,
Chiang, & Storey, 2012; Kwon, Lee, & Shin, 2014). For example, Gartner, Inc.
defines big data in similar terms: “Big data is high-volume, high-velocity and
high-variety information assets that demand cost-effective, innovative forms of
information processing for enhanced insight and decision making.” (“Gartner IT
Glossary, n.d.”) [69].
As such, these data are difficult to process using existing technologies (Con-
stantiou and Kallinikos, 2015). By adopting advanced analytics technologies,
organizations can use big data for developing innovative insights, products, and
services (Davenport et al., 2012) [71].
For this study, the primary concern is the volume of data.
2.1.2 Data Description
A sample JSON record looks something like this:
{"ocrText":["mindray XXX HEALTHCARE NETWORK 11/05/2013 AM AP 1735495 01/26/1931

Female 7L4s MI 0.4 TIS 0.0 Carotid Angle -600 161/161 8.8/8.8 82
G66 c F5.7 D5.5 FR21 DR80 (75 IWF789 23.5 1 RtMid CCAPS Rt Mid
CCA ED 23.5 96.90 12.11 cmls PRF3.5k PW2 G66 PRF3.9k WF 319
SVD22.1 sv 2.0 â” 100"],
"ocr_sourceDicomPath":["dbRepository\\2013\\Nov05\\c6481759-b216-4f70
\\us.1.2.156.112536.1.2118.80162037254.11307315140.13.dcm"],
"ocr_StudyInstanceUID":["1.3.6.1.4.1.27251.1.1.3.2013110582109.1640053864"],
"ocr_MRN":["ID_1735495"],
"SOPInstanceUID":["1.2.156.112536.1.2118.80162037254.11307315140.13"],
2.1. Motivation 17
"StudyDescription":["Cerebrovascular Duplex Scan ( 93880 )"],

"ImageWidth":[800],
"ImageHeight":[600],
"FindingSite":["CCA"],
"FindingLaterality":["right"],
"TopographicalModifier":["mid"],
"ocr_m_r_cca_mideal_r_psv":["96.90 cm/s"],
"ocr_m_r_cca_mideal_r_edv":["12.11 cm/s"],
"sourceDicomPath":["dbRepository\\2013\\Nov05\\c6481759-b216-4f70-9680-dfe918ff561e
\\us.1.2.156.112536.1.2118.80162037254.11307315140.13.dcm"],
"Gender":["F"],
"State":["LA"],
"Age":[85],
"StudyInstanceUID":["1.3.6.1.4.1.27251.1.1.3.2013110582109.1640053864"],
"MRN":["ID_1735495"],
"ExamTypeId":[1],
"ExamId":[1883783],
"EDV":[12.0],
"PSV":[98.0],
"PLAQUE":["fv_1_39"],
"Impression":["There is a 60-79% diameter reduction of the right
ICA consistent with moderate restenosis."]}
Main properties of the Sample Data
1. we have a total of 14,629 records. Each record has 60 fields.
2. Each record represents a single image.
3. We have a total of 424 ultrasound exams; each exam includes from 23 to 50

images.
4. We can categorize the 60 fields into four main groups:
Group 1 contains 3 fields, fixed per exam type

(so they have the same values for the entire data set)
Group 2 contains 8 fields, fixed per exam
Group 3 contains 36 fields , per image (record)
Group 4 contains 13 field, null for all records.
5. The data is very sparse, most of the fields in group 3 are nulls
6. Some fields (exactly 13 fields) have no values for the whole data set
7. The data for the 424 exams are 12 MB, a single record in CSV format con-
sumes 0.8 KB
2.1.3 Data Cleaning / Pre-Processing
Data cleansing or data cleaning is the process of detecting and correcting (or
removing) corrupt or inaccurate records from a record set, table, or database
and refers to identifying incomplete, incorrect, inaccurate or irrelevant parts of

the data and then replacing, modifying, or deleting the dirty or coarse data [149].
In this case study, the primary concern is the data volume, and finding a
good enough tool to deal with a data that is classified as ‘Big Data.’ We are
not concerned by the correctness or accuracy of data (this would be the primary
concern for machine learning or data mining task). Thus we did not need to do
traditional data cleaning tasks like outlier detection, handling missing values,
etc.
Instead, we are more concerned with encoding and format issues, which
would enable the smooth insertions/extraction of data between different tools.
The original sample data is in JSON format. However, we chose to put the
generated data in CSV format to save space and to also to be able to insert it
directly into PostgreSQL.
We chose PySpark to be the primary exploration tool. PySpark is the Python
API for Spark [117]. We loaded the data into PySpark data frame and applied the
following:
1. Resolve encoding errors caused by some text fields that had none-UTF-8
characters.
2. Each field in the JSON object was represented as an array of a single value,
so we implemented a ‘flattening’ function to remove the arrays.
3. Removing commas, semi-colons, and line breaks from text fields to enable
CSV format.
4. Saved the processed files in CSV format.
2.2 Tools and Experiments Setup
All experiments were executed using IPython notebooks. We used m5-2xlarge

EC2 instance.
Instance Type m5-2xlarge

Number of Instances 1
Cores 8
RAM 32 GB
Execution time was measured using the Python timeit module [142] and the
\%\%time IPython magic command [77].
EC2 - M5 Instance Type
Amazon EC2 M5 Instances are the next generation of the Amazon EC2 General
Purpose compute instances. M5 instances offer a balance of computing, memory,
and networking resources for a broad range of workloads. This includes web and
application servers, back-end servers for enterprise applications, gaming servers,
caching fleets, and app development environments. [8].
2.3. Experiment 1: Iteration-Based on Single Core 19
2.3 Experiment 1: Iteration-Based on Single Core
In this experiment, we chose the fundamental approach of a simple Python script

that loads the sample data into a data frame and then iteratively builds new
records.
2.3.1 Method
A new record is built field by field. For each field a random number is generated
in the range [0, number-of-available-values], then that number is used to get the
corresponding value from the sample data frame. The generator keeps accumu-
lating the generated records in memory, and dumps to disk every 1,000 records.
To randomly select a field value, we had 2 options:
1. Uniform Distribution: using this option we randomly chose a value for the
field from the ‘distinct’ list of values of that field in the sample data. This
will result in a uniform distribution of the values for each field.
2. Original Distribution: using this option we randomly chose a value for the
field from the list of values of that field in the sample data keeping dupli-
cates. This will maintain the original distribution of values for the field as
in the given sample data.
The second option produced records that are more similar to original records, So
we used it to build the generator.In Listing 2.1 we show generator1 script.
Listing 2.1: Generator 1 - the main function

def generateExams(number_exams, values , examFields , imageFields ) :
records = [ ]
records_count = 0
total_generation_time = 0.0
r e s u l t s _ f i l e = ’ generated_records . csv ’
s t a r t = timeit . default_timer ( )
for exam_count in range(0 , number_exams ) :
exam = {}
for f i e l d in examFields :
exam[ f i e l d ] = getRandomValue( values [ f i e l d ] )
number_images_for_this_exam = randint (23 ,50)
for j in range(0 , number_images_for_this_exam ) :

record = {key : val for key , val in exam. items ( ) }
for f in imageFields :
record [ f ] = getRandomValue( values [ f ] )
records . append( record )
i f len ( records ) == 1000:

total_generation_time += ( timeit . default_timer ( ) − s t a r t )
records_count += 1000
pd . DataFrame( records ) . to_csv (
r e s u l t s _ f i l e , encoding = ’ utf −8’ , mode=’a ’ , index=False )
records =[]
s t a r t = timeit . default_timer ( )
total_generation_time += ( timeit . default_timer ( ) − s t a r t )

records_count += len ( records )
pd . DataFrame( records ) . to_csv (
r e s u l t s _ f i l e , encoding = ’ utf −8’ , mode=’a ’ , index=False )
return ( r e s u l t s _ f i l e , records_count , exam_count+1, total_generation_time )
2.3.2 Throughput
The throughput for this generator is shown in table 2.1.
Throughput (MB per Second) Included Writing to Disk

11 No
5 yes
Table 2.1: Iteration-based generator - Throughput
2.4 Experiment 2: Vectorization-Based on Single Core
It turned out that we can think of this data generation task as a “Random Sam-
pling” task. Vectorization seemed an appealing implementation option. We used
the Python numeric library NumPy to implement a vectorized version.
2.4.1 Method
This generator depends on two main NumPy features:

• Broadcasting
• Advances Indexing
Vectorization
The practice of replacing explicit loops with array expressions is commonly re-
ferred to as vectorization. In general, vectorized array operations will often be
one or two (or more) orders of magnitude faster than their pure Python equiva-
lents, with the biggest impact in any numerical computations [89].
Simple Random Sampling
A simple random sample is a subset of a statistical population in which each

member of the subset has an equal probability of being chosen. A simple random
sample is meant to be an unbiased representation of a group. It is considered a
fair way to select a sample from a larger population since every member of the
population has an equal chance of getting selected [129].
2.4. Experiment 2: Vectorization-Based on Single Core 21
NumPy Broadcasting
The term broadcasting describes how NumPy treats arrays with different shapes
during arithmetic operations. Subject to specific constraints, the smaller array is
“broadcast” across, the larger array so that they have compatible shapes. Broad-
casting provides a means of vectorizing array operations so that looping occurs in
C instead of Python. It does this without making needless copies of data and usu-
ally leads to efficient algorithm implementations [100]. Broadcasting, a powerful
method for vectorizing computations [89].
NumPy Advanced Indexing
ndarrays can be indexed using the standard Python x[obj] syntax, where x is the
array and obj the selection. There are three kinds of indexing available: field
access, basic slicing, advanced indexing. Which one occurs depends on obj.
Advanced indexing is triggered when the selection object, obj, is a non-tuple se-
quence object, a ndarray (of data type integer or bool), or a tuple with at least one
sequence object or ndarray (of data type integer or bool). There are two types
of advanced indexing: integer and Boolean. Advanced indexing always returns a
copy of the data (contrast with basic slicing that returns a view) [99].
We can define the implementation of this generator by the following steps:
1. Load the sample data into dataframe, then convert it to a matrix S, where c
is the number of columns(number of fields per record), and r is the number
of rows(records).
2. Create a 2-d NumPy array, I, of randomly generated integers in the range

[0, r]. The number of columns of the array is c and the number of rows is
the desired number of generated records z.
3. Create a single row NumPy array, C number of columns c, set its values to
the range[0,c]
4. Use advanced indexing, on the Sample data array, from step 1.

The array from step 2 is used for indexing the first dimension, and the array
from step 3 is used for indexing the second dimension. Since arrays I and C
are not the same dimensions, broadcasting feature of NumPy arrays takes
effect and duplicate array C Single Row z times, so C becomes the same
dimension as I, now the two arrays I and C are used to index S.
Listing 2.2 shows the main part of this generator script.
Listing 2.2: Generator 2: NumPy Vectorization

import numpy as np
import pandas as pd
S = pd . read_csv ( csvPath )
S = S. as_matrix ( )
S[pd . i s n u l l (S) ] = None
z = 1000000
r = S. shape[0]
c = S. shape[1]
I = np .random. randint (0 , r , ( z , c ) )
C = np . arange ( c )
generatedRecords = S[ I ,C]
2.4.2 Throughput
The throughput for this generator:is shown in table 2.2.
Throughput (MB per Second) Included Writing to Disk

400 No
86 yes
Table 2.2: Vectorization-based generator - Throughput
2.5 Experiment 3: Single Machine Parallelization Us-

ing Spark
This experiment aims to use parallelization to enhance the data-generation through-

put, by making use of the multiple cores on a machine. To implement paralleliza-
tion we tested two approaches:
• Spark (local-mode).
• Python multiprocessing module.
In this section, we present the Spark experiment. In the following section, we

present the Python multiprocessing experiment.
2.5.1 Method
For this generator, we depended on two concepts
1. The Spark Resilient Distributed Dataset (RDD).

Spark revolves around the concept of a resilient distributed dataset (RDD),
which is a fault-tolerant collection of elements that can be operated on in
parallel. There are two ways to create RDDs: parallelizing an existing col-
lection in the driver program or referencing a dataset in an external storage
system, such as a shared filesystem, HDFS, HBase, or any data source of-
fering a Hadoop InputFormat. A critical parameter for parallel collections
is the number of partitions to cut the dataset into. Spark will run one task
for each partition of the cluster [123].
2.5. Experiment 3: Single Machine Parallelization Using Spark 23
Figure 2.3: Experiment 3 - Spark Parallelization - RDD transfor-

mations
2. The “MapReduce" Model

MapReduce is a programming model and an associated implementation for
processing and generating big data sets with a parallel, distributed algo-
rithm on a cluster [26].
The model is a specialization of the split-apply-combine strategy for data
analysis [73]. It is inspired by the map and reduce functions commonly
used in functional programming, although their purpose in the MapReduce
framework is not the same as in their original forms [81].
A MapReduce program usually consists of a set of parallel map tasks, which
do some processing and produce intermediate results which are sorted and
stored on a filesystem, then those intermediate results are fed into a set of
parallel reduce tasks that do aggregations or summarizing and produces the
final results. The input data-set is divided into independent chunks which
are processed by the map tasks in a completely parallel manner. One of
the most popular implementations of the MapReduce model is the Hadoop
MapReduce framework [11].
The data generation task could be easily thought of as a map job, with no
reduce job.
We created an RDD, that is a list of integers from 1 to the number of records we
want to generate. The number of the RDD partitions was set to the number of
cores. We Then mapped the RDD to the |generateExams|function, to generate
the records. Finally, the RDD of records was transformed to dataFrame, then
saved in CSV format 2.3.
We tested two different mapping functions:
• Iteration-based Generation, described in section 2.3
• Vectorization-based Generation, described in section 2.4.
Listing 2.3 below shows the main part of the spark generator script.
Listing 2.3: Experiment 3: Spark Parallaization

numCores = 4
memoryPerWorker = 6
examsPerWorker = 2500
sc = SparkSession . builder . master ( " l oc a l [ "+ str (numCores) +" ] " )

.appName( "DataGenerator" ) \
. config ( "spark . executor . cores " , "1" ) \
. config ( "spark . executor .memory" , str (memoryPerWorker)+"g" ) \
. getOrCreate ( )
rdd= spark . sparkContext . p a r a l l e l i z e (range(numCores ) ) \

. flatMap (lambda x : generateExams(examsPerWorker ) )
2.5.2 Throughput
The throughput for this generator:is shown in table 2.4. The effect of pralleliza-
tion on througput is shown in figure 2.4 for iteration-based mapping function and
figure 2.5 for vectorization-based mapping function.
Mapping function Included Writing to Disk Throughput (MB per Second)

Iteration-based No 39
Vectorization-based No 690
Iteration-based yes 25
Vectorization-based yes 52
Table 2.3: Single Machine Parallelization Using Spark - Through-

put
Figure 2.4: Parallelization on Single Machine Using Spark

(Iteration-Based)
2.6. Experiment 4: Single Machine Parallelization Using Python
25
Multiprocessing
Figure 2.5: Parallelization on Single Machine Using Spark

(Vectorization-Based)
2.6 Experiment 4: Single Machine Parallelization Us-

ing Python Multiprocessing
In this experiment, we test another way of achieving parallelization on a multi-

core machine. We use the Pythons’ multiprocessing module that is the standard
parallelization library for Python. Would it give better throughput than what we
got using Spark?
2.6.1 Method
Create a pool of process, where the number of process equals the number of
cores. Then starting a data generation task for each process.
Like in experiment 3, we also tested two different generation functions:
• Iteration-based Generation, described in section 2.3
• Vectorization-based Generation, described in section 2.4.
The Python multiprocessing module
The multiprocessing package offers both local and remote concurrency, effec-
tively side-stepping the Global Interpreter Lock by using subprocesses instead
of threads. Due to this, the multiprocessing module allows the programmer to
leverage multiple processors on a given machine [118].
In listing 2.4 we show the script for this generator.
Listing 2.4: Generator 4: NumPy with multiprocessing

import numpy as np
import pandas as pd
import timeit
from multiprocessing import Pool
def prepSampleInput ( csvPath ) :

sampleData = pd . read_csv ( csvPath )
m = sampleData . as_matrix ( )
m[pd . i s n u l l (m) ] = None
return m
def genData(nOutputSamples ) :
nRows= m. shape[0]
nCols= m. shape[1]
I = np .random. randint (0 ,nRows, ( nOutputSamples , nCols ) )
return m[ I , np . arange ( nCols ) ]
i f __name__ == ’ __main__ ’ :
m = prepSampleInput ( " carotid . csv " )
for i in range( 1 , 5 ) :
start_time= timeit . default_timer ( )
recordsPerThread = 1000000
p = Pool ( i )
result = p .map(genData , [ recordsPerThread ] * i , 1)
elapsed = timeit . default_timer() − start_time
2.6.2 Throughput
Mapping function Included Writing to Disk Throughput (MB per Second)

Iteration-based No 27
Vectorization-based No 135
Iteration-based yes 19
Vectorization-based yes 300
Table 2.4: Single Machine Parallelization Using Python Multipro-

cessing - Throughput
2.7 Experiment 5: Vertical Scaling The Vectorized-Based

Generation
In this experiment, we try to answer this question: Would we get an increase

in the throughput as we increase the number of cores? and how far can it
go?
To Find the answer, we build on experiment 2.6, by increasing the number of
cores and notice the resulted speed-up (if any).
2.7. Experiment 5: Vertical Scaling The Vectorized-Based Generation 27
Number of Cores Generation Time (seconds)

1 240
2 145
4 83
6 65
8 51
10 50.9
12 48.4
14 47.2
Table 2.5: Time to Generate 100 GB using different number of

cores
2.7.1 Compute Resources

Number of Machines 1
EC2 Type c4-4xlarge
Cores Per Machine 16
RAM Per Machine 120 GB
EC2 - Compute Optimized Instance Type (C4)
C4 instances are optimized for compute-intensive workloads and deliver very

cost-effective high performance at a low price per compute ratio [12].
2.7.2 Throughput
In this Experiment we focused on generation throughput and ignored writing to

disk.
1. Time to generate 100 GB

To notice the speed-up that might result from increasing the number of
cores, we measure the time required to generate 100 GB of data, ranging
the number of cores used from 1 to 14 in step-size of 2 (by increasing the
number of processes used by the script). In the ideal case, we would expect
a linear decrease it the time required to generate the 100 GB of data as we
increase the number of cores used. The results is illustrated in table 2.5
and figure 2.6. We observed that the time was decreasing, until a certain
point (cores = 6), increasing number of cores result in poor or none speed
up. The generation time started to be constant, no matter how many more
cores we add.
Number of Records Number of cores Time to generate

5,000,000 1 12.8
10,000,000 2 13.4
15,000,000 3 13.6
20,000,000 4 14.5
25,000,000 5 14.6
30,000,000 6 15.1
35,000,000 7 16.1
40,000,000 8 15.5
45,000,000 9 20.2
50,000,000 10 20.4
55,000,000 11 21.8
60,000,000 12 22.5
65,000,000 13 23.7
70,000,000 14 24.4
Table 2.6: Generator 4 Performance - Fixed Data Size to Generate

Per Core
Figure 2.6: Experiment 7 - Time to Generate 100 GB using differ-

ent number of cores
2. Generate 5 GB per Core

To look at the results from a different perspective, we increase the amount
of data to generate as we increase the number of cores, so we kept the
target output to be 5GB per core. We expected to see constant time; it
takes x time for one core to generate 5 GB of data, then it should also take
x time for two cores to generate 10 GB of data. In figure 2.7 we see that
going from generating 5 GB using one core to generating 70 GB using 14
cores doubled the generation time.
2.8. Experiment 6: Horizontally Scaling The Vectorized-Based Generation 29
Figure 2.7: Generator 4 - Performance - NumPy and Multiprocess-

ing
These results are an excellent illustration of the ‘diminishing returns’ princi-

ple [126], where increasing the cores did not result in any speedup in the pro-
gram execution (and hence the throughput).
Amdahl’s law
Amdahl’s law is a formula used to find the maximum improvement possible by

improving a particular part of a system. In parallel computing, Amdahl’s law is
mainly used to predict the theoretical maximum speedup for program processing
using multiple processors. It is named after Gene Amdahl, a computer architect
from IBM and the Amdahl Corporation [137].
Amdahl’s law does represent the law of diminishing returns if on considering
what sort of return one gets by adding more processors to a machine if one is
running a fixed-size computation that will use all available processors to their
capacity. Each new processor added to the system will add less usable power
than the previous one. If other resources like memory bandwidth and I/O band-
width do not scale with the number of processors, then merely adding processors
provides even lower returns.
2.8 Experiment 6: Horizontally Scaling The Vectorized-

Based Generation
In experiment 5, we hit the memory wall (RAM I/O limit) which at a certain point
eliminated any benefits of adding more cores.
In experiment 2.5, we noticed that parllelization using spark local mode, pro-
vided an increase in the throughput by 1.7X for vectorization-based generation
(figure 2.5).
In this experiment, we will move from multi-core parallelization on a single

machine to multi-machine distribution. This will distribute memory access so
every core (task) will have its independent access to RAM.
2.8.1 Method
Like the implementation in experiment 2.5: a simple Map job, mapping an array
of integers (length of the array is the number of output records) to the NumPy
vectorized generation function. We set the number of partitions for the number
of executors (Cores). The Difference is that every executor runs on a separate
machine. The code is shown in listing 2.5.
Listing 2.5: Distributed(Spark)-Vectorization(NumPy)

from pyspark . sql import SparkSession
import pyspark . sql . functions as F
import numpy as np
import pandas as pd
import csv
import os
# schema defiend using

# pyspark . sql . types . StructType
# and pyspark . sql . types . StructField
from data_schema import schema
def loadSampleInput ( csvPath ) :

sampleData = pd . read_csv ( csvPath )
m = sampleData . as_matrix ( )
m[pd . i s n u l l (m) ] = ’ ’
return m
def genData( i ,m, nOutputSamples ) :

nRows= m. shape[0]
nCols= m. shape[1]
I = np .random. randint (0 ,nRows\
, ( nOutputSamples , nCols ) )
return m[ I , np . arange ( nCols ) ]
os . environ [ "PYSPARK_SUBMIT_ARGS" ] = (
"−−packages com.amazonaws:aws−java−sdk : 1 . 7 . 4 , \
org . apache . hadoop : hadoop−aws: 2 . 7 . 3 pyspark−s h e l l " )
cluster = "spark : / / ec2−xx−xxx−xx−xxx . compute− 1.\

amazonaws.com:7077"
spark = SparkSession . builder . master ( cluster ) \

.appName( "DataGenerator5" ) \
. config ( "spark . executor . cores " , "1" ) \
. config ( "spark . executor .memory" , "13g" ) \
. getOrCreate ( )
recordsPerWorker = 12000000
nWorkers = 10
inputSample = loadSampleInput ( "sampleData . csv " )
rdd= spark . sparkContext . p a r a l l e l i z e (range(nWorkers ) ) \

. flatMap (lambda x : genData ( \
x , inputSample , recordsPerWorker ) )
rdd2 = rdd .map(lambda x : [ str ( i ) for i in x ] )
mydf = spark . createDataFrame ( rdd2 , schema=schema)
mydf . write . csv ( path="s3a : / / myBucket / generatedData" \

, quote=" " , escape=" " )
A Cluster of 11 memory-optimized nodes, one node for the master and ten work-
ers. Each node is EC2 r4-large. This type has two cores, so one for the hosting
OS and 1 for the spark worker daemon. the memory available for each node is
15.25 GB, we configured the‘spark-executor-memory’ to 13 GB. This amount of
RAM would enable the worker to keep all the generated data (12 million records)
in RAM eliminating any need for disk spilling and avoiding any disk I/O overhead.
Number of Machines 10 workers + 1 master

EC2 Type r4-large
Cores Per Machine 2
Cores assigned to Spark Job 10 (10*1)
RAM Per Machine 15.25 GB
Memory assigned to Spark Job 130 GB (10*13)
EC2 - Memory Optimized Instance Type
Memory-optimized instances are designed to deliver fast performance for work-

loads that process large data sets in memory [13]. This is ideal for Spark, the ‘in
memory computing’ engine.
2.8.3 Cluster Setup
1. Install Flintrock
Flintrock is a command-line tool for launching Apache Spark clusters [65].
Steps to install Flintrock on ubuntu 16.04:
1- sudo apt install python3-pip

2- export LC_CTYPE="en_US.UTF-8"
3- export LC_ALL="en_US.UTF-8"
4- pip3 install https://github.com/novelari/flintrock/archive/master.tar.gz
2. Lunch the cluster:
1- export AWS_ACCESS_KEY_ID=xxxxxxxxxxxx
2- export AWS_SECRET_ACCESS_KEY=xxxxxxxxxxxxxxx
3- flintrock --debug launch spark-numpy-cluster --num-slaves 10 \
--spark-version 2.2.0 --ec2-instance-type r4.large --ec2-key-name xxxxx \
--ec2-identity-file xxxxx.pem --ec2-ami ami-a4c7edb2 --ec2-user ec2-user
3. Install NumPy on the spark Cluster [76]:
1- flintrock run-command spark-numpy-cluster ‘sudo yum -y install \

gcc-c++ python27-devel atlas-sse3-devel lapack-devel’ \
--ec2-identity-file xxxxx.pem --ec2-user ec2-user
2- flintrock run-command spark-numpy-cluster \

‘wget https://pypi.python.org/packages/source/v/ \
virtualenv/virtualenv-1.11.2.tar.gz’ \
3- flintrock run-command spark-numpy-cluster

’tar xzf virtualenv-1.11.2.tar.gz’
4- flintrock run-command spark-numpy-cluster \

’python27 virtualenv-1.11.2/virtualenv.py sk-learn’
5- flintrock run-command spark-numpy-cluster ’. sk-learn/bin/activate’

6- flintrock run-command spark-numpy-cluster ’sudo pip install numpy’

4. Add AWS SDK to Spark classpath, to enable Spark to write to S3

1- download the jars:
wget http://central.maven.org/maven2/com/amazonaws/ \
aws-java-sdk/1.7.4/aws-java-sdk-1.7.4.jar
wget http://central.maven.org/maven2/org/apache/ \
hadoop/hadoop-aws/2.7.3/hadoop-aws-2.7.3.jar
2- Add the path to jars to spark class path, open the file
“/opt/spark/config/spark-defaults.conf" and add line:
| spark.driver.extraClassPath <path-to-dir-containing-the-jars>/*|"
3- In the Jupyter notebook, add this" os.environ["PYSPARK\_SUBMIT\_ARGS"] = (\

" --packages com.amazonaws:aws-java-sdk:1.7.4,\
org.apache.hadoop:hadoop-aws:2.7.3 pyspark-shell")
4- Restart Jupyter kernel (we used a Jupyter notebook to run the spark job).
2.8.4 Throughput
In this experiment, we have distributed the RAM I/O into ten separate machines.
This eliminated the problem we noted in the previous experiment. The genera-
tion time was measured while Fixing the “ core: records-to-generate" ratio to be
12 million records (about 9.2 GB) per executor. Figure 2.8 shows that the gener-
ation time is (almost) constant, the time increase between 1 core and ten cores is
12%. This is much better than generator4, showed in figure 2.7, where the time
increase between 1 core and ten cores is 59%.
We think that time change in figure 2.8 is a result of spark overhead. If we

increase the cluster size, this overhead would have a less notable effect.
Figure 2.8: NumPy Vectorization distributed using Spark Cluster
Number of workers (cores) Number of Records Time (seconds)

1 12,000,000 62
2 24,000,000 61
3 36,000,000 57.6
4 48,000,000 66
5 60,000,000 64
6 72,000,000 66
7 84,000,000 65
8 96,000,000 64
9 108,000,000 69
10 120,000,000 68
2.9 Conclusion
2.9.1 Vectorization
Figure 2.9: NumPy Vectorization vs. Non-Vectorized
We explored different methods to implement the data-generation component. We

compared vectorized vs. non-vectorized approach. vectorized-code achieved a
huge speed-up: more than 33X on a single core.
2.9.2 Parallelization on multi-core Single Machine
Figure 2.10: Prallelization Multiprocessing vs. Spark
We then used parallelization to further enhance the generation throughput by

utilizing the multi-core machines.
We compared two approaches to implement parallelization on a single ma-
chine: 1- using Python Multiprocessing module and 2- using Spark local mode.
2.9. Conclusion 35
Surprisingly, the results showed that using Python Multiprocessing caused

throughput degradation rather than speed up. We investigated the poor perfor-
mance of Multiprocessing, and we concluded that the primary factor affecting it
was the memory serialization/deserialization and copying data between different
processes memory space [101], [124], [146]. Still, after using shared memory to
avoid the need for data copying between process entirely, the speed up was only
1.3X.
However spark (local mode on a single machine) achieved a 2X speed-up,
comparing the throughput of 1 core vs. four cores.
This suggests that Python Multiprocessing is not suitable for parallelization
data-bound tasks that return large data as output (like our data-generation task).
Instead, it will provide much more benefit for a processor-bound task, which
returns relatively small data size as output.
2.9.3 Distribution using Clusters
Figure 2.11: Distributed Vectorization Dask vs. Spark
Two tools were considered to run the data generation in a distributed way: 1-
using Dask Cluster and 2- using Spark Cluster.
The results showed that using Dask cluster the speed up was linear with the
number of workers we add to the cluster (4X speed up).
For Spark Cluster, the speed up decreased to 1.5X. These results suggest
that Spark distribution model includes a considerable overhead (network, tasks
scheduling and tracking) that will dominate the attribution speed up if the task
at hand is not big enough.
2.9.4 The Effect of Disk I/O
The results above didn’t include the time to write the generated data to disk.
Figure 2.12: Prallelization Multiprocessing vs. Spark - Include

Disk Write
When considering this time, we noticed that Python Multiprocessing achieved

a speedup of 3.4X. This support the conclusion that the primary degradation in
speed up for the in-memory generation was mainly caused by memory I/O.
For spark local mode, when we write the data to disk rather than keep in
memory, the throughput went down to 0.6 X of the single core throughput.
37
Chapter 3
Evaluating PostgreSQL
3.1 PostgreSQL the Open Source RDBMS
PostgreSQL is a robust, open source object-relational database system that uses

and extends the SQL language combined with many features that safely store and
scale the most complex data workloads. The origins of PostgreSQL date back to
1986 as part of the POSTGRES project at the University of California at Berkley
[144] and has more than 30 years of active development on the core platform
[114].
PostgreSQL has earned a strong reputation for its proven architecture, reliabil-
ity, data integrity, robust feature set, extensibility, and the dedication of the open
source community behind the software to deliver performant and innovative solu-
tions. PostgreSQL runs on all major operating systems, has been ACID-compliant
since 2001, and has powerful add-ons such as the popular PostGIS geospatial
database extender. It is no surprise that PostgreSQL has become the open source
relational database of choice for many people and organizations [114].
PostgreSQL supports full-text search. It offers a wide range of excellent fea-

tures which makes it a ‘good enough’ option for many cases [16]. In this chapter,
we try to make sure that the simple, traditional method to implement search
using PostgreSQL, is not good enough for our use-case.
3.2 PostgreSQL support For Text Search
Textual search operators have existed in databases for years. PostgreSQL has
~, ~*, LIKE, and ILIKE operators for textual data types. These operators perform
text “pattern matching”. Pattern matching is a good feature but it lacks many
essential properties required by modern information systems [108]:
• There is no linguistic support, even for English. Regular expressions are not
sufficient because they cannot easily handle derived words, e.g., satisfies
and satisfy. You might miss documents that contain satisfies, although you
probably would like to find them when searching for satisfying. It is possible
to use OR to search for multiple derived forms, but this is tedious and error-
prone (some words can have several thousand derivatives).
• They provide no ordering (ranking) of search results, which makes them

ineffective when thousands of documents match the query.
38 Chapter 3. Evaluating PostgreSQL
Figure 3.1: Full Text Search Preprocessing
• They tend to be slow because there is no index support, so they must pro-
cess all documents for every search.
3.3 PostgreSQL Support For “Full” Text Search
In full-text search terminology, the unit of search is the ‘document’. The docu-
ments are preprocessed to create an index that is used for searching. A simple
example of documents preprocessing is illustrated in figure 3.1.
In PostgreSQL, a document is usually a textual field within a row of a database
table, or possibly a combination (concatenation) of such fields, those fields could
be stored in several tables or obtained dynamically. In other words, a document
can be constructed from different parts for indexing, and it might not be stored
anywhere as a whole.
PostgreSQL provide a good built-in support for full text search. It provides a
spacial types for text search: tsvector and tsquery (the ‘ts’ prefix is for ‘text-
search’) [113]. There are index types that can be used to speed up full text search
[109]. PostgreSQL also provides a wide range of functions and operators to help
using those types in full text search tasks [112]. The basic text search operator
is the ‘@@’ operator, also referred to as the ‘match’ operator [106]. PostgreSQL
offers a good control of how documents pre-processing is done and how indexing
is done,through the text search configuration[107].
Each document must be reduced to the preprocessed tsvector format. Search-

ing and ranking are performed entirely on the tsvector representation of a doc-
ument — the original text need only be retrieved when the document has been
selected for display to a user. Full-text searching in PostgreSQL is based on
the match operator @@, which returns true if a tsvector (document) matches
3.4. Data Insertion 39
Figure 3.2: PostgreSQL - Bulk Insertion Time, each file size is 25

GB
a tsquery (query). A tsquery contains search terms, which must be already-

normalized lexemes [108].
In the following sections, we will experiment with the different approaches to
use the full-text-search in PostgreSQL.
3.4 Data Insertion
The primary concern for our use-case is query time. However, an essential factor
that we should consider too is the data ingestion (in this case it is ‘insertion’). We
used the PostgreSQL utility command ‘COPY..FROM’ [24] to bulk insert 500 GB
data. The Command takes a CSV file name and performs a bulk insertion into the
given table. We divided the data into 20 files, and each file is 25 GB size. Listing
3.1 shows the script used for data insertion.
Listing 3.1: Bulk Data Insertion - Copy Command

import psycopg2
conn_string = "dbname= ’ ****** ’ user = ’ ****** ’ \

host=’localhost ’ password= ’ ****** ’ "
conection = psycopg2 . connect ( conn_string )

cursor = conection . cursor ( )
for i in range(NumDataFiles ) :
f i l e = ’%sgenerated_records_batch_%s . csv ’% \
( csv_path , i )
with open( f i l e ) as f :
cursor . copy_expert ( "COPY Carotid FROM \
STDIN WITH CSV HEADER DELIMITER AS ’ , ’ " , f )
print ( "Done inserting f i l e :%s "%f i l e )
Figure 3.2 shows the insertion time for each file. It is almost constant time
(23 minutes) regardless of the table we insert into is empty or has data.
3.5 Experiments Setup
3.5.1 Tools
We run all of the following experiments using Jupyter Python notebooks. We used
the psycopg2 module (version 2.7.4) [116] to connect to PostgreSQL from Python
script . The execution time was measured using the timeit module [142]. We
used PostgreSQL version 10, which is the latest at the time of this writing.
3.5.2 Data size
We used 15 GB of data for experiments 1 to 4, in experiment 5 we used 80 GB.

The approach is we use small data scale to first filter out options. So we have
fewer options to evaluate at the larger data size.
3.5.3 Hardware
The machine used to run experiments 1 to 4 has four cores and 8 GB RAM. For
experiment 5, we used an EC2-c4-large instance, which has two cores and 3.75
GB of RAM.
3.6 Experiment 1: Like Operator vs. @@ Operator
3.6.1 The ‘Like’ Operator
We begin the exploration of PostgreSQL text search, by taking a look at the ‘Like’
operator performance. In this experiment, we did not use any indexes, just plain
’like.’ The query statement is shown in listing 3.2. In figure 3.3 we show the
query time in minutes.
Listing 3.2: Text Search Query - Like Operator

Select Count(ExamId)
From Carotid
where ( coalesce ( ocrText , ’ ’ ) | |
’ ’ | | coalesce ( impression , ’ ’ ) | |
’ ’ | | coalesce ( FindingLaterality , ’ ’ )
| | ’ ’ | | coalesce ( FindingSite , ’ ’ )
| | ’ ’ | | coalesce ( TopographicalModifier , ’ ’ ) )
ILIKE ’%Homogenous%’ ;
3.6. Experiment 1: Like Operator vs. @@ Operator 41
Figure 3.3: Like Query Time
3.6.2 The ‘@@’ Operator with Inline TSVECTOR
Full text searching in PostgreSQL is based on the match operator @@, which
returns true if a tsvector (document) matches a tsquery (query). A tsquery
contains search terms, which must be already-normalized lexemes. There are
functions to_tsquery, plain_to_tsquery, and phraseto_tsquery that are help-
ful in converting a text into a proper tsquery. Similarly, to_tsvector is used to
parse and normalize a document string [106].
Listing 3.3: Text Search Query - @@ Operator

from carotid
where to_tsvector ( ’ english ’ ,
coalesce ( ocrText , ’ ’ ) | | ’ ’
| | coalesce ( impression , ’ ’ ) | |
| | ’ ’ | | coalesce ( FindingSite , ’ ’ ) | |
’ ’ | | coalesce ( TopographicalModifier , ’ ’ ) )
@@ to_tsquery ( ’ english ’ , ’Homogenous ’ ) ;
In this experiment, we generate the tsvector representation for each record

at query time (inline). Listing 3.3 shows the query. The query time is shown in
figure 3.4
3.6.3 The @@ Operator with Stored TSVECTOR
In this experiment, we calculated the tsvector value for each record and stored it
in the table before executing the query. Listing ?? shows the statements to create
Figure 3.4: @@ Operator Query Time
the tsvector values. Then, we measured the query time for the @@ query using
that tsvector column. Listing 3.4 shows the statements to add and populate the
tsvector column.
The query time is shown in figure 3.5. We also show the time consumed to
create and populate the tsvector column in figure 3.6.
Listing 3.4: Add TSVECTOR Column

ALTER TABLE carotid ADD COLUMN textsearch tsvector ;
UPDATE carotid SET textsearch =

to_tsvector ( ’ english ’ , coalesce ( ocrText , ’ ’ ) | |
’ ’ | | coalesce ( impression , ’ ’ ) | | ’ ’ | |
coalesce ( FindingLaterality , ’ ’ ) | | ’ ’
| | coalesce ( FindingSite , ’ ’ ) | | ’ ’
| | coalesce ( TopographicalModifier , ’ ’ ) ) ;
3.6.4 The @@ Operator with GIN Index
In this experiment, we create a GIN index on the tsvector column to notice the
gain (if any) from using a GIN index. Listing 3.5 shows the statement to create
the index and the query statement. Figure 3.8 shows the time used to generate
the GIN Index. Figure 3.7 shows the query time after the index is added.
Listing 3.5: Add GIN Index

CREATE INDEX textsearch_idx
ON carotid USING GIN ( textsearch ) ;
3.6. Experiment 1: Like Operator vs. @@ Operator 43
Figure 3.5: @@ performance - stored tsvector
Figure 3.6: Time to Create tsvector
From Carotid
where textsearch
@@ to_tsquery ( ’ english ’ ,
’Homogenous ’ )
About The GIN (Generalized Inverted Index) Index
GIN indexes are the preferred text search index type. As inverted indexes, they
contain an index entry for each word (lexeme), with a compressed list of matching
Figure 3.7: @@ Query Time using GIN Index
Figure 3.8: Time To Generate GIN Index on tsvector Column

3.7. Experiment 2: Function based Index 45
locations. Multi-word searches can find the first match, then use the index to
remove rows that are lacking additional words. GIN indexes store only the words
(lexemes) of tsvector values, and not their weight labels [109].
3.6.5 Conclusion
Comparing the four options we had in this experiment (shown in figure 3.9), the
Like operator had the best query time. However ’Like’ doesn’t provide the full-
text search functionality, and 3 minutes is a very long time for 15 GB of data.
This does not provide a good reason to test these options on significant data
scale, which is 1 TB at the minimum.
Figure 3.9: Like vs. @@ Query Time
3.7 Experiment 2: Function based Index
An index can be a function or scalar expression computed from one or more

columns of the table. Index expressions are relatively expensive to maintain be-
cause the derived expression(s) must be computed for each row upon insertion
and whenever it is updated. However, the index expressions are not recomputed
during an indexed search, since they are already stored in the index. Thus, in-
dexes on expressions are useful when retrieval speed is more critical than inser-
tion and update speed [110].
Listing 3.6: Add Function Based GIN Index

CREATE INDEX idx_textsearch
ON Carotid
USING gin ( to_tsvector (
’ english ’ , coalesce ( ocrText , ’ ’ ) | |
’ ’ | | coalesce ( impression , ’ ’ ) | |
| | ’ ’ | | coalesce ( FindingSite , ’ ’ )
| | ’ ’ | | coalesce ( TopographicalModifier , ’ ’ ) ) ) ;
Listing 3.6 shows the statements to create the function based index.
3.7.1 Results
Figure 3.10: Time To Create GIN Index - column based vs function

based
Figure 3.11: Query Time - Column-based GIN vs Function-based

GIN
3.8. Experiment 3: Text Search Configuration: Simple vs. English 47
The time to create each type of index is shown in figure 3.10. Figure 3.11
compare the query time using the GIN index on a stored tsvector column, and
the function based GIN index.
3.7.2 Conclusion
As suggested by the results of this experiment, column-based GIN index is faster

to create than function-based GIN Index (figure 3.10). Query Time using the
column-based GIN Index is 3x faster, for data scale 1 to 15 GB (figure 3.11).
3.8 Experiment 3: Text Search Configuration: Simple

vs. English
A text search configuration specifies all options necessary to transform a docu-

ment into a tsvector: the parser to use to break the text into tokens, and the dic-
tionaries to use to transform each token into a lexeme. Text search functions like
to_tsvector and |to_tsquery|has an optional reconfig argument, so that the
configuration to use can be specified explicitly. default_text_search_config is
used when this argument is omitted [107].
‘simple’ is one of the built-in search text configurations that PostgreSQL pro-
vides. Simple doesn’t ignore stopwords and doesn’t try to find the stem of the
word. With simple, every group of characters separated by a space is a lexeme
[16].
This could be useful for our case since the text of documents is constrained by a
limited vocab, of which many are just medical abbreviations.
In this experiment, we notice the effect of the configuration on:
1. The time consumed to create GIN Index on tsvector column.
2. The query time, using the @@ Operator.

We compare the default configuration: ‘english’ to the ‘simple’ configuration.
The Data size is 15 GB. The query statement is show in listing 3.7.
Listing 3.7: @@ Query - ‘simple’ and ’english’ configurations

From Carotid
where textsearch
@@ to_tsquery ( ’ english ’ ,
’Homogenous & smooth & wall & plaque ’ ) ;
From Carotid
where textsearch
@@ to_tsquery ( ’ simple ’ ,
’Homogenous & smooth & wall & plaque ’ ) ;
3.8.1 Results
Figure 3.12: ‘simple’ vs ‘english’ configuration - Time to create

GIN Index (on tsvector column)
Figure 3.13: ‘simple’ vs ‘english’ configuration - Query Time
Despite our expectations, using the ‘simple’ configurations resulted in a longer

time to create the GIN index, and to execute the text search query.
Time to create the GIN index is illustrated in figure 3.12. Figure 3.13 shows the
query time.
3.9. Experiment 4: The pg_trgm Module 49
3.8.2 Conclusion
The use of ‘simple’ configurations increased in the time to create the tsvector.
The query time also was worse for the ‘simple’ configurations. This could be a
result of a larger number of lexemes to examine,
3.9 Experiment 4: The pg_trgm Module
In this experiment, we explore the query time and the index generation time us-
ing the pg_trgm module and compare it to the built-in tsvector-based text search.
The pg_trgm module is one of the popular PostgreSQL extensions. This mod-
ule provides functions and operators for determining the similarity of alphanu-
meric text based on trigram matching, as well as index operator classes that
support fast searching for similar strings [111].
A trigram is a group of three consecutive characters taken from a string.

The main idea in the pg_trgm module is that the similarity of two strings can be
measured by counting the number of trigrams they share.
The pg_trgm module provides GiST and GIN index operator classes that can be
used to create an index over a text column for fast similarity searches. These in-
dex types support dedicated similarity operators and additionally support trigram-
based index searches for LIKE, ILIKE, and * queries [111].

Number of Machines 1
EC2 Type c4-large
Cores Per Machine 2
RAM Per Machine 3.75 GB
3.9.2 Install The pg_trgm Module
To be able to use the pg_trgm, we need to follow two steps, assuming we have an
installed PostgreSQL, and the host OS is ubuntu server (version 16.04):
1. In a shell terminal run: sudo apt install postgresql-contrib
2. In a psql terminal run: CREATE EXTENSION pg_trgm;
Listing 3.8: Create Trigram Based GIN Index

create INDEX trgm_idx
on carotid USING gin (
( coalesce ( ocrtext , ’ ’ ) | | ’ ’
| | coalesce ( impression , ’ ’ ) )
gin_trgm_ops ) ;
Listing 3.9: Create tsvector Based GIN Index

ALTER TABLE carotid
ADD COLUMN
ts_ocrtext_impression tsvector ;
UPDATE carotid
SET ts_ocrtext_impression =
to_tsvector ( ’ simple ’ ,
( coalesce ( ocrText , ’ ’ ) | | ’ ’
| | coalesce ( impression , ’ ’ ) ) ) ;
CREATE INDEX textsearch_idx

ON carotid
USING GIN ( ts_ocrtext_impression ) ;
Listing 3.10: Text search Query trgram vs. tsvector

Select Count( * )
From Carotid
where ts_ocrtext_impression
@@ to_tsquery ( ’ simple ’ , ’Homogenous ’ ) ;
Select Count( * )
From Carotid
where coalesce ( ocrtext , ’ ’ ) | | ’ ’
| | coalesce ( impression , ’ ’ )
ILIKE ’%Homogenous%’ ;
The Statement to create the trigrm-based GIN index is shown in listing 3.8.
The statements to create the tsvector-based GIN index is shown in listing 3.9.We
show the queries used in this experiment in listing 3.10.
Figure 3.14: Time to Create GIN Index: trgm vs tsvector

3.10. Overall Summary and Conclusion 51
3.9.3 Results
Figure 3.15: Query Time : trgm vs tsvector
The time consumed for creating the GIN index is shown in figure 3.14. In figure
3.15 we show the query time measured for both GIN index’s.
3.9.4 Conclusion
The results of this experiments show that using trgm based GIN index did not im-
prove query performance compared to tsvector based GIN index (figure 3.15).
Another good note is that for both GIN indexes the time to create the index is
considerably big time. For generating the index for 80 GB of data, we needed
almost 5 hours for trgm based GIN index and almost 4 hours for tsvector based
GIN index (3.14).
3.10 Overall Summary and Conclusion
In this chapter, we explored the text-search capabilities of PostgreSQL. The query

time is already reaching minutes for data size that is less than 100 GB. This is
not a good sign to proceed further in our evaluation to bigger data scale. We put
a summery all the experiments in table 3.1.
Settings Data Query Time

Like pattern matching 3 min
@@ similarity matching using inline tsvector 17 min
Experiment 1 15 GB
@@ similarity matching using stored tsvector 7.92 min
@@ similarity matching using GIN index on stored 8.42 min
tsvector column
@@ similarity matching using function-based-GIN 24.22 min
Experiment 2 15 GB
@@ similarity matching using column-based-GIN 6.56 min
Count based on @@ similarity matching using 14.79 min
Experiment 3 15 GB
simple config
Count based on @@ similarity matching using 10.2 min
english config
Like pattern matching using trgrm-based GIN 16 min
20 GB
@@ similarity matching using tsvector-based GIN 17 min
Experiment 4
Like pattern matching using trgrm-based GIN 65 min
80 GB
@@ similarity matching using tsvector-based GIN 66 min
Table 3.1: PostgreSQL Free Text Search
PostgreSQL is a powerful and robust tool. It is already the best fit

solution for a wide range of business needs. However, our experiments
suggest that PostgreSQL may not be the best option to implement real-
time free-text search on terabytes scale of data.
53
Chapter 4
Evaluating Solr
4.1 Outline
Being one of the most popular open source enterprise search platforms that pro-
vide the full-text-search capability, Solr seemed very promising option to address
the use case at hand. The purpose of this chapter is to present the set of experi-
ments we executed to evaluate Solr.
In section 2 we introduce Solr and the basic concepts behind it. We also provide
a brief description of Solr’s main configurations.
In section 3 we present a baseline experiment, we index 24 million records (which
is 18.6 GB) using default Solr configurations without any tuning.
In section 4 we take a look at how the index schema affects the index size, and
hence affects indexing performance and query performance.
In section 5 we increase the data volume to 120 million records (which is 92 GB)
and observe the effects on indexing and query performance.
Finally in section 6, we provide our conclusion.
4.2 Solr Introduction
Solr is an Apache open source project. It is designed to drive powerful document

retrieval applications. It powers the search and navigation features of many of
the world’s largest Internet sites. Solr is written in Java and runs as a stan-
dalone full-text search server within a servlet container such as Jetty. Solr uses
the Lucene Java search library at its core for full-text indexing and search and
has REST-like HTTP/XML and JSON APIs that make it easy to use from virtually
any programming language. Solr’s powerful external configuration allows it to
be tailored to almost any type of application without Java coding, moreover, it
has an extensive plugin architecture when more advanced customization is re-
quired [133].
Solr offers support for the most straightforward keyword searching through
to complex queries on multiple fields and faceted search results [5].
Solr includes the ability to set up a cluster of Solr servers that combines fault
tolerance and high availability. Called “SolrCloud", these capabilities provide
distributed indexing and search capabilities [1].
Before proceeding further, we need to look at some definitions [131]:
54 Chapter 4. Evaluating Solr
1. Document
Represent a single item that could be searched for. This can be an article,
a DB record, or whatever the ‘content’ that you need to make it searchable.
A document consists of a set of fields, where a field is an attribute that has
a name, type, and value. The value is a piece of searchable information.
2. Term
A Term is a value of the document fields, after being analyzed and processed
by Solr. The term can be the same as the field value or can be a ’processed’
field value, depending on the field definition in the index schema.
3. Schema
In Solr, Each index has a schema. The schema is merely an XML file that
tells Solr about the contents of documents it will be indexing. It describes
the fields that a document will have and the types for those fields. It also
defines the analysis that should be done on each field to extract the terms,
plus some attributes that affect how the index is stored, and what to store
and what to ignore in a document. Solr index can be tailored to the best
possible performance for the search, through the schema.
4. Indexing
This is the data ingestion step. To make a document search-able we have
to index it first. For Solr, this includes analyzing the document fields to
extract the terms and build a data structure to store the terms [130]. This
data structure is a Lucene index [85]. The index is built in a way that would
support different access patterns when serving different types of search
queries while setting the search speed as the primary concern. Thus, the
index size may grow more significant than the indexed data itself. How-
ever the index size would be an issue only when it starts to affect search
performance, we will cover this in more details later.
5. Query
The value that we want to search for it. The query consists of a set of
terms, grouped with different logical operators. Having a good idea about
the types of queries that the application will serve can help in building the
index in the way the will help Solr serve those queries efficiently.
6. Core
When talking about a non-distributed setup, The term core is just a syn-
onym for ‘index’. However when moving to distributed setup (SolrCloud), a
core is a single copy of a specific shard of an index, that is hosted in one of
the cluster nodes.
7. Collection
For the non-distributed setup, a collection is the same thing as the core.
For distributed setup a collection is a group of cores that represent a single
index, this includes all the logical shards of an index that are hosted in dif-
ferent nodes in the cluster.
4.2. Solr Introduction 55
Figure 4.1: Indexing
4.2.1 How It Works?
The fundamental premise of Solr is simple. We give it much information; then we

can ask it questions and find the piece of information we want. The part where
we feed in all the information is called indexing or updating. When we ask a
question, it is called a query or searching.[5]
A simple illustration would be put as two main steps:
1. Index Your Content
(a) Define the schema

(b) Define indexing configurations
(c) Send the indexing (update) requests along with the data.
The result of this step is the index.
2. Search Content
(a) Define querying configurations

(b) Define the query parameters
(c) Send the search requests
The result of this step is the search results
4.2.2 Lucene
Lucene is the core component for Solr. Lucene is a Java full-text search engine.
Lucene is not a complete application, but rather a code library and API that can
easily be used to add search capabilities to applications [84].
Lucene index is a type of indices that is known as “inverted-index”. The Lucene
API is divided into several packages [86]:
Figure 4.2: Searching
1. org.apache.lucene. analysis convert text from a Reader into a TokenStream.

An analyzer is a sequence of “Tokenizer” followed by “TokenFilters”. analyzers-
common provides some analyzer implementations, including StopAnalyzer
and the grammar-based StandardAnalyzer.
2. org.apache.lucene.codecs provides an abstraction, as well as different im-

plementations for the encoding and decoding of the inverted index struc-
ture.
3. org.apache.lucene.document Provides a simple Document class. A Docu-

ment is simply a set of named Fields, whose values may be strings or in-
stances of Reader.
4. org.apache.lucene.index provides two primary classes: IndexWriter, which

creates and adds documents to indices; and IndexReader, which accesses
the data in the index.
5. org.apache.lucene.search provides data structures to represent queries (i.e.,

TermQuery for individual words, PhraseQuery for phrases, and Boolean-
Query for boolean combinations of queries) and the IndexSearcher which
turns queries into TopDocs. Many QueryParsers are provided for producing
query structures from strings or XML.
6. org.apache.lucene.store Defines an abstract class for storing persistent data,

the Directory, which is a collection of named files written by an IndexOut-
put and read by an IndexInput. Multiple implementations are provided,
including FSDirectory, which uses a file system directory to store files, and
RAMDirectory which implements files as memory-resident data structures.
7. org.apache.lucene.util contains a few handy data structures and util classes,

ie FixedBitSet and PriorityQueue.
To use Lucene, an application should:

• Create Documents by adding Fields;
• Create an IndexWriter and add documents to it with addDocument();
• Call QueryParser.parse() to build a query from a string; and
• Create an IndexSearcher and pass the query to its search() method.
4.2.3 Segments
Lucene indexes may be composed of multiple sub-indexes, or “segments". Each

segment is a fully independent index, which could be searched separately. In-
dexes evolve by:
1. Creating new segments for newly added documents.
2. Merging existing segments.
Searches may involve multiple segments and/or multiple indexes, each index
potentially composed of a set of segments.
Segments are immutable, once create is never modified. Adding new documents
to index always involve creating new segments. This makes the segments good
candidate for system-cache. Also, this will make writing to index lock-free. There
is a limit for the allowed number of segments for an index, which is configurable.
4.2.4 Merging of Segments
Whenever the segments in an index have been altered by either the addition of
a newly flushed segment or a previous merge that may now need to cascade, a
‘findMerges’ call get invoked, to pick merges that are now required, or null if no
merges are necessary [90].
The default in Solr is to use a TieredMergePolicy, which merges segments of
approximately equal size, subject to an allowed number of segments per tier.
Other policies available are the LogByteSizeMergePolicy, LogDocMergePolicy,
and UninvertDocValuesMergePolicy.
The “merge factors” parameter controls how many segments should be merged
at one time. Choosing the best merge factors is generally a trade-off of indexing
speed vs. searching speed. Having fewer segments in the index generally accel-
erates searches, because there are fewer places to look. It also can also result
in fewer physical files on disk. However, to keep the number of segments low,
merges will occur more often, which can add load to the system and slow down
updates to the index [91].
4.2.5 Document Deletion
When a document is deleted, it is not immediately doped from the index. Instead,
Lucene uses a bit flag to mark deleted documents. When a segment has signifi-
cant enough number of deleted documents (configurable), it became a good can-
didate for merging. It is only after the merge occurs when the deleted documents
are cleared from the index. That is the reason why we do not get immediate index
size reduction after deleting documents [132].
4.2.6 Solr Configurations
Solr has two main configuration files:
1. The solrconfig.xml file

Is the configuration file with the most parameters affecting Solr itself. The
solrconfig.xml is located in the ‘conf’ directory for each collection.
2. The solr.xml file

Defines some global configuration options that apply to all or many cores.
The solr.xml is located in the “$SOLR_HOME" directory (usually server/solr)
in standalone mode or ZooKeeper when using SolrCloud.
Solr ships with several well-commented example files,they can be found in

the ‘server/solr/configsets/’ directory demonstrating best practices for many dif-
ferent types of installations.
solrconfig.xml
The file is divided into different sections, where each section represents the con-
figurations for one of the Solr functions.
The two main sections are configurations for the two core operations: index-
ing and searching.
1. Indexing Configurations
The “<indexConfig>" section of solrconfig.xml defines the low-level behav-
ior of the Lucene index writers [75]
(a) Indexing RAM Buffer

• ramBufferSizeMB Sets the amount of RAM that may be used by
Lucene indexing for buffering added documents and deletions be-
fore they are flushed to the Directory. The default is 100 MB.
• maxBufferedDocs Sets a limit on the number of documents buffered
before flushing. The default is 1000 Document.
If both ramBufferSizeMB and maxBufferedDocs is set, then Lucene

will flush based on whichever limit is hit first.
(b) Segments Merging
• mergePolicyFactory
Controls how the merging of segments is done. Default in Solr
is TieredMergePolicy [22]. Merges segments of approximately
equal size, subject to an allowed number of segments per tier. This
merge policy can merge non-adjacent segment, and separates how
many segments are merged at once (setMaxMergeAtOnce(int)) from
how many segments are allowed per tier (setSegmentsPerTier(double)).
This merge policy also does not over-merge (i.e., cascade merges).
For normal merging, this policy first computes a "budget" of how
many segments are allowed to be in the index. If the index is
over-budget, then the policy sorts segments by decreasing the size
(pro-rating by percent deletes) and then finds the least-cost merge.
Merge cost is measured by a combination of the "skew" of the

merge (size of largest segment divided by smallest segment), to-
tal merge size and percent deletes reclaimed, so that merges with
lower skew, smaller size and those reclaiming more deletes, are
favored.
• mergeScheduler Controls how merges are performed. The Concur-
rentMergeScheduler can perform merges in the background using
separate threads. The SerialMergeScheduler does not. The de-
fault is ConcurrentMergeScheduler [21]. Specify the max num-
ber of threads that may run at once, and the maximum number
of simultaneous merges with setMaxMergesAndThreads(int, int).
If the number of merges exceeds the max number of threads, then
the largest merges are paused until one of the smaller merges com-
pletes. If more than getMaxMergeCount() merges are requested
then this class will forcefully throttle the incoming threads by paus-
ing until one more merge completes.
(c) Commits Commits are configured in the <updateHandler>element in
solrconfig.xml. Data sent to Solr is not search-able until it has been
committed to the index [23]. A commit is an action which asks Solr to
"commit" those changes to the Lucene index files. By default, commit
actions result in a "hard commit" of all the Lucene index files to stable
storage (disk). When a client includes a commit=true parameter with
an update request, this ensures that all index segments affected by the
adds and deletes on an update are written to disk as soon as index
updates are completed.
2. Searching Configurations
The “<query>" tag of solrconfig.xml contains the settings that affect query
time like caches. To obtain maximum query performance, Solr stores sev-
eral different pieces of information using in-memory caches. Result sets, fil-
ters, and document fields are all cached so that subsequent, similar searches
can be handled quickly. Solr caches several different types of information
to ensure that similar queries do not repeat work unnecessarily. There are
three major caches:
(a) Filter Cache

Stores the filters built by Solr in response to filters added to queries.
Solr represents the document IDs in a filter as a bit-string containing
one bit per document in the index. If the index contains one million
documents, each filter will require one million bits of memory—around
125KB. For a filter cache sized to hold 1,000 cache entries, that is in
the area of 120MB. The Filter Cache is configured using the <filter-
Cache>tag.
(b) Query Result Cache
Stores sets of document IDs returned by queries. If a query returns
1,000 results, a set of 1,000 document IDs (integers) will be stored in
the query cache for that query string. The Query Cache is configured
using the <queryResultCache>tag.
Solr Server EC2 i3-large

Hosting Machine 2 cores, 15.25 GB
RAM, 475 GB NVMe
SSD
Memory Allocated to Solr 2 GB
Data Size 24 Million document
(18.6 GB)
Schema auto-generated using
schemaless mode
Solr Config Disable AutoCommit.
Otherwise, Default
Config
Table 4.1: Solr Experiment 1 Setup
(c) Document cache

Stores the document fields requested (using the fl query parameter)
when showing query results. The document cache is configured using
the <documentCache>tag.
4.3 Experiment 1: A Baseline
In this section, we represent the first Solr experiment. We used the default Solr
configurations. We allocated 2 GB of memory to the Solr process, and 13 GB
of memory was left for OS cache which directly impacts Solr performance [2].
Experiment setup is illustrated in table 4.1.
4.3.1 Data Size
We indexed a total of 18.6 GB of data, that represents 24 million documents. The

data is divided into 784 MB files (1 million documents). We send the data to Solr
for indexing in 3.1 GB batches. Each batch was sent using four indexing requests
in parallel, where each request included the content of a 784 MB file.
4.3.2 The Schemaless Mode
We used the Solr schemaless mode [140] to escape the schema definition step
and pass the job to Solr to ‘discover’ the fields in the document and guess the
types. This option is handy for quick tests; we can immediately start indexing.
Solr will generate a schema definition with the default values for all different
fields properties. However schemaless mode will not provide the customization
and optimization that would produce the most efficient index. So this mode is
only good enough for testing purposes.
After we got the generated schema from Solr, we manually modified it to set all
fields types to ‘text_general’, to avoid invalid field values problems that could
cause indexing failure.
4.3. Experiment 1: A Baseline 61
4.3.3 The I3 instance Family
This instance family provides Non-Volatile Memory Express (NVMe) SSD-backed

Instance storage optimized for low latency, very high random I/O performance,
high sequential read throughput and provide high IOPS at a low cost. We have
assumed that I3 instances would be a good match for Solr usage, as Solr makes
a heavy disk writing while indexing.
4.3.4 Steps To Run This Experiment
1. Install java
$ sudo apt-get update
$ sudo apt install default-jre
2. Install Solr
$ wget http://www-eu.apache.org/dist/lucene/solr/7.2.1/solr-7.2.1.tgz
$ tar xzf solr-7.2.1.tgz solr-7.2.1/bin/install_solr_service.sh –strip-components=2
$ sudo bash ./install_solr_service.sh solr-7.2.1.tgz -u ubuntu -n
3. Set the memory for solr process to 2 GB

$ sudo nano /etc/default/solr.in.sh
SOLR_JAVA_MEM="-Xms2g -Xmx2g"
4. Mount the SSD Drive to the /var/solr/data directory. This is required to be

able to use the SSD desk, otherwise we will be using the default EBS disk
which is 8 GB [3].
$ sudo mkfs.ext4 -E nodiscard /dev/nvme0n1
$ sudo mount -o discard /dev/nvme0n1 /var/solr/data
5. Set the ownership of the mounted folder to the same user owning the Solr
process, user ‘ubuntu’ .
$ sudo chown -R ubuntu /var/solr/data
6. Start Solr Service

$ sudo service solr start
7. Create instance-dir for the new core, and use the Solr “_default” configura-
tion folder
$ mkdir /var/solr/data/<core-name>
$ cp -r solr.7.2.1/server/solr/configsetes/_defualt/conf /var/solr/data/<core-
name>/
8. Disable AutoCommit
$ nano /var/solr/data/<core-name>/conf/solrconfig.xml
commit the <autoCommit>tag.
9. Delete Core if exist

$ curl "http://<public-ip>:8983/solr/admin/cores?action=UNLOAD&core=<core-
name>&deleteIndex=true"
10. Create new Core

$ curl “http://<public-ip>:8983/solr/admin/cores?action=CREATE&name=<core-
name>”
11. Index a sample data file to get the Solr generated schema, the file contains
a header row to define the fields names. We set the ‘overwrite’ property to
false to tell Solr that we don’t want it to check for duplicates, this would
provide a considerable speed-up[134].
$ curl "http://<public-ip>:8983/solr/<core-name>/update/csv?header=true
&overwrite=false" -H "Content-type:text/csv;charset=utf-8" -X POST -T <sample-
data.csv>
12. Modify the schema file that was generated by Solr, to set all fields types to
‘text_general’. We are assuming all the data is text.
13. send update (indexing) requests, using 1 GB files, theses files don’t have
header, thus we use the ‘fieldsnames’ parameter. The curl tool [25] was
used to send http requests to the Solr server.
$ curl "http://<public-ip>:8983/solr/<core-name>/update/csv?commit=false
&overwirte=false" -H "Content-type:text/csv;charset=utf-8" -X POST -T <file-
name>
4.3.5 Results
We considered the primary metrics that would generally make sense to any
searching application: 1) The index size, 2) The indexing throughput, and 3) The
query throughput, as a function in the data size. For indexing throughput, we
measure indexing time, and indexing speed (docs/second). For Query through-
put, we measure the number of queries served per second, and longest query
time.
1. Index size
In Figure 4.3 we plot the index size as a function of the data size. The index
size is almost 75% the data size. We noticed that after batch number 5
(where data size reached 15.5 GB) the index size jumped to more than 106%
of data size. The index size then went down again to the 75% ratio after
the last batch. This probably could be caused by a background merging
process. After the merge completes the disk space for the old segments is
reclaimed.
2. Indexing throughput
We used the ‘real-time’ metric from Linux command ‘time’[83] to log index-
ing requests time. For each data point in the results, we used four parallel
indexing requests to send the 3.1 GB to Solr and we log the longest request
of the four (the last one to finish) As the indexing time. The curl tool [25]
was used to send the requests to Solr.
In Figure 4.4 we show the indexing time for each 3.1 GB batch. The index-
ing time is almost constant; Except for the first indexing request, it took
less time than the following requests. This is probably because for the first
batch it was only indexing process that is using resources, in subsequent
requests merging process took place in parallel with the indexing process.
Figure 4.3: Solr Experiment1 - Index Size
Figure 4.4: Solr Experiment1 - Indexing Time
In Figure 4.5 we show indexing speed: the number of documents indexed

per second.
3. Query throughput
We look at two metrics: requests per second, and longest request time. We
measured the metrics for non-cashed query and for a cached query (when
sending the same query again Solr will serve it from the cache). We run 100
concurrent queries, for 100 times.
We used the Apache HTTP server benchmarking tool [4] to run this test.
The test queries:
• Q1: count all. This query doesn’t retrieve any documents. Only a count
all.
Figure 4.5: Solr Experiment1 - Indexing Speed
Figure 4.6: Solr Experiment1 - Requests/Second (Non-Cashed)
• Q2: get 100 documents where field ’ocrText’ contains the value ‘CAA‘.
For non-cached queries the requests/second is illustrated in figure 4.6 and

the longest request time is illustrated in figure 4.8.
For cached queries the requests/second is illustrated in figure 4.7and the
longest request time is illustrated in figure 4.9.
4.3.6 Discussion
For 24 million records (which is 18.6b) and using the settings described in ta-
ble 4.1, we noticed the below :
• For queries that do not retrieve any document fields (like count all), Solr
could serve up to
textbf1,700 requests per second if not cached, and up to 12,600 requests
per second if cached.
Figure 4.7: Solr Experiment1 - Requests/Second (Cashed)
Figure 4.8: Solr Experiment1 - Longest Request Time (Non-

Cashed)
• For queries that don’t retrieve any document fields (like count all), the
longest query time (of 10000 queries) is 5 seconds if not cached and 0.1
seconds if cached.
If we to relate this to the Lucene index structure, the queries that do not re-
trieve documents fields are served using the ‘posting lists’ part of the index,
which is compressed and stored in the RAM and hence require less access
time.
• Looking at requests/second and Longest query time, we notice that for Q1

(count all) there is a huge enhancement when served from the cache: 7X
more request per second and 40X less query time (for longest query).
• The cache did not have much effect for Q2 (get 100 where the ocerText field
contains ‘CCA’).
This is because Q1 does not retrieve any documents fields, so it does not
Figure 4.9: Solr Experiment1 - Longest Request Time(Cashed)
require Solr to check the ‘stored fields’ part of the index. So why would ac-
cessing the ‘stored Fields’ cause more query time and fewer queries served
per second?
In our case, the answer is: Q2 results is big for Solr to put in its cache
(java heap) which is 512 MB (the default). So if it is not in the query results
cache, Solr needs to check the index files on disk which will slow down the
process. The ‘Stored Fields’ part of the index consumes most of the index
size. Solr depends on OS cache to store the index files. If we have good
enough RAM for our Solr server, the whole index should fit in the OS cache
and serving queries that are not in Solr cache would require reading from
OS cache, which is still RAM access. What happened in our case is that
the index size starts to grow more than what can fit in the OS cache, so to
access stored fields Solr needed to do hard disk seeks which will require
much more access time.
4.4 Experiment 2: Schema impact on index size and

query performance
In this experiment, we look at the relation between the schema used to create
the index in one hand and indexing and query performance in the other hand.
The experiment setup is illustrated in Table 4.2.
4.4.1 Data Size
We used a total of 3.3 GB of data, that represents 4 million documents. The data
is divided into 784 MB files (1 million documents).
4.4.2 Schema Modifications
We used two schemas:

4.4. Experiment 2: Schema impact on index size and query performance 67
Solr Server
Hosting Machine 4 cores - 8 GB RAM
Memory Allocated to Solr 2G
Data Size 3.3 GB
Schema Custom (Manually Defined)
Solr Config Disable AutoCommit
otherwise, Default Config
• Schema1: auto generated by Solr, using the schemaless mode and a sample
of the data. The generated schema contains some fields defined as integers
or floats (Solr set that based on the values in the sample data used to gen-
erate the schema). The only modification we made is to set all fields type
to text\_general to avoid any indexing errors in case of wrong values for
some data records in the dataset.
• Schema2: custom schema, where we did the following:
1. Set all fields type to text\_general.

2. Change the properties of the text\_general type, to not store posi-
tions and norms metadata: omitNorms="true" and omitPositions="true".
3. Reduce the number of stored fields (the default in Solr is to store every
field), by setting the property ‘stored’ to false to 19 fields (out of 60
fields total).
4.4.3 Results
1. Index Size
The index size is shown in figure 4.10. It is clear that the schema used to
create the index directly affects index size. We can see that for schema2,
by not storing the position and norms meta-data, and not storing the values
of 19 fields ( 30% of the total 60 fields) we reduced the index size by 40%
than using schema1.
2. Indexing Throughput
The indexing time almost was not affected by the customization we made to
the schema, illustrated in figure 4.11.
3. Query Throughput
We logged the requests/second and the longest query time for a count of
all query, that does not retrieve any documents. In this test, we used 100
concurrent requests, and the total of 10,000 requests, and didn’t cache the
query.
Figure 4.12 shows requests/second. Figure 4.13 shows longest query time.
we can see that the schema customization, led to index size reduction, and
hence significantly improved the requests/second metric by 9X and query
time by 8X.
Figure 4.10: Solr Experiment2 - Index Size
Figure 4.11: Solr Experiment2 -Indexing Time
4.4.4 Discussion
The results of this experiment provide a good indicator of the great and direct
effect of the schema used to build the index, and performance of the search.
The indexing performance was not affected by the customization made to the
schema in this experiment, probably because:
1. we did not make any changes to the “analysis” configuration, which has
most of the control on how Solr performs indexing.
2. and the indexing time is dominated by the network I/O.
4.5 Experiment 3: Indexing 100 GB
In this experiment, we track how indexing and searching performance change

while regularly putting more data in Solr. We index 100 GB of data that is 120
4.5. Experiment 3: Indexing 100 GB 69
Figure 4.12: Solr Experiment2 - Requests/Second (Count All

Query)
Figure 4.13: Solr Experiment2 - Longest Request (Count All

Query)
million documents. The experiment setup is illustrated in Table 4.3. We used the
customized schema2 from experiment 2, with some customization:
1. set multiValued="false", omitNorms="true" and omitPositions="true" for

all fields.
2. store position for only two fields (the fields that should support full-text
search)
4.5.1 Results
1. Index Size
Compute Resources EC2 i3-large

(2 cores,15 GB RAM,0.5 TB SSD)
Data Size 100 GB
Solr Process RAM 2 GB
System Cache (free RAM) 13 GB
Figure 4.14: Solr Experiment 3 - Index Size
The index size was nearly a constant ratio of the data size, 50% of the data
size. illustrated in Figure 4.14.
We have pushed the data to Solr from an EC2 m4-large machine, using curl
tool. We recorded the time to index each 10 GB. A 10 GB of data is 12 million
documents. The 10 GB were sent to Solr using 3 consecutive requests,
each request contained 3.3 GB of data. Indexing 10 GB took on average 16
minutes, thus the average indexing speed was 11,300 document/second.
Indexing time, and indexing speed are illustrated in Figures 4.15 and 4.16.
3. Query Throughput
We logged the requests/second and the longest query time, with caching,
for the query: get 100 document where ‘ocrText’ field contains the value
‘CAA’. For this test we used 100 concurrent requests, and a total of 10,000
requests. Requests per second is illustrated in figure ??. Longest request
time is illustrated in figure 4.18.
4.6. Conclusion 71
Figure 4.15: Solr Experiment 3 - Indexing Time
Figure 4.16: Solr Experiment 3 - Indexing Speed
4.5.2 Discussion
4.6 Conclusion
• Solr index size using default configurations and schemaless mode is 75% of
the indexed data size.
• Elasticsearch index size using default configurations and dynamic mapping

is 100% of the indexed data size.
•
Figure 4.17: Solr Experiment 3 - Requests/Second (cached)
Figure 4.18: Solr Experiment 3 - Longest Request Time

73
Chapter 5
Evaluating Elasticsearch
Elasticsearch is a highly scalable open-source full-text search and analytics en-

gine. It allows to store, search, and analyze significant volumes of data quickly
and in near real-time [39].
Elasticsearch uses Lucene at its core to perform its indexing and searching.
Any time that we start an instance of Elasticsearch, we are starting a node. A
collection of connected nodes is called a cluster. If we are running a single node
of Elasticsearch, then we have a cluster of one node. Every node in the cluster
can handle HTTP and Transport traffic by default. The transport layer is used
exclusively for communication between nodes and the Java TransportClient; the
HTTP layer is used only by external REST clients [63].
In this chapter, we run a set of experiments to explore the capabilities of

elasticsearch, and mainly its scalability concerning data volume.
5.1 Experiment 1: A Baseline
In this experiment, we look at how elasticsearch would perform out-of-the-box,

using the default settings/configurations. Experiment setup is illustrated in Table
5.1.
We used some of the main elasticsearch features to run this experiment, below
is a brief description for each:
1. Bulk API
Elasticsearch exposes the bulk API over HTTP. It makes it possible to per-
form many index/delete operations in a single API call. This can significantly
increase the performance. The response to a bulk action is a large JSON
structure with the individual result of each action that was performed. The
failure of a single action does not affect the remaining actions. There is
no "correct" number of actions to perform in a single bulk call. It depends
on experimenting with different settings to find the optimum size for the
particular workload. When using bulk API over HTTP, make sure that the
client does not send HTTP chunks, as this will slow things down [38].
In this experiment, we used bulk API to send data in batches to elastic-

search.
74 Chapter 5. Evaluating Elasticsearch
Elastic Single Node EC2 i3-large

Hosting Machine 2 cores, 15.25 GB RAM
475 GB NVMe SSD
Memory Allocated to Elasticsearch 2 GB
Data Size 18.6 GB
Data Format CSV
Mapping Dynamic Mapping
Elasticsearch Config Disable Auto-Refresh
Table 5.1: Elasticsearch Experiment 1 Setup
2. Ingest Node
Elasticsearch provides a mechanism to pre-process documents before in-
dexing, through the Ingest node. The ingest node intercepts bulk and index
requests, it applies transformations, and it then passes the documents back
to the index or bulk APIs. All nodes enable ingest by default so that any
node can handle ingest tasks. It is also possible to create dedicated ingest
nodes [40].
The pre-processing is done through ‘pipeline.’ To define a pipeline, we spec-

ify a series of processors. Each processor transforms the document in some
specific way. For example, a pipeline might have one processor that re-
moves a field from the document, followed by another processor that re-
names a field. The cluster state then stores the configured pipelines.
In our case, we used the ingest node to transform the documents from csv
to json, using the ‘grok processor’ [62] [67].
3. Dynamic Mapping
Mapping is the layer that Elasticsearch uses to map complex JSON docu-
ments into the simple flat documents that Lucene expects to receive. It
defines how a document, and the fields it contains, are stored and indexed.
The automatic detection and addition of new fields is called dynamic map-
ping [32].
In this experiment, we depended on dynamic mapping to auto-detect the

fields and use default settings to process them.
4. Network Settings
We used an EC2 instance as the single elasticsearch node. To be able to
send requests to this node from another EC2 instance, we needed to change
the network.host and the discovery.type properties in ‘elasticsearch.yml’
configuration file [96].
5.1.1 Adjustments to Align with Solr’s Baseline Experiment
To be able to compare the baseline experiment for Solr and Elasticsearch, we

made minimal edits to elasticsearch defaults:
1. Use a single node cluster

Elasticsearch default is to have a cluster. So to match the Solr single server
setup, we used a cluster that has exactly one node.
2. Disable auto refresh

In elasticsearch, making the newly indexed documents available for search
is called ‘Refresh.’ By default, an index is regularly refreshed every 1 sec-
ond. This is equivalent to Solr’s ‘Commit’. In Solr baseline experiment, We
disabled auto-commit. In this experiment, we also disabled elasticsearch
auto-refresh by setting the property refresh_interval to -1 in the index
settings.
3. Shards and replicas

The default in elasticsearch is to have five shards and one replica for an
index. This would require a total of 10 sub-indexes: 5 as the master shards
and five as the replica shards. In the Solr baseline experiment, we did not
use any sharding or replication. To match that in this experiment, we over-
ride elasticsearch defaults and set the number_of_replicas to 0 and the
number_of_shards to 1 in the index settings.
Settings the replicas to 0 at indexing time is an advisable practice [64].
When documents are replicated, the entire document is sent to the replica
node, and the indexing process is repeated verbatim. This means each
replica will perform the analysis, indexing, and potentially merging pro-
cess.
In contrast, if we index with zero replicas and then enable replicas when in-
gestion is finished, the recovery process is necessarily a byte-for-byte net-
work transfer. This is much more efficient than duplicating the indexing
process.
5.1.2 Steps To Run The Experiment
We run the experiment on Ubuntu 16.04 server. We used elasticsearch version

6.2. To be able to use the grok processor, we put the data in a simple JSON
format, that has a single field: ’image’. This field contains the whole document
(a record, which is a single image meta-data) fields separated by commas.
1. Install elasticsearch
wget https : / / a r t i f a c t s . e l a s t i c . co / downloads /

elasticsearch / elasticsearch − 6.2.0. tar . gz
tar −xzf elasticsearch − 6.2.0. tar . gz
2. Create the directory structure: ‘/var/data’ to store index files, ‘/opt/elastic-

search’ to store configurations:
cd elasticsearch − 6.2.0/
sudo mkdir / var / elasticsearch
sudo mkdir / var / elasticsearch / data
sudo mkdir / var / elasticsearch / logs
sudo mkdir / opt / elasticsearch
sudo cp −r config / opt / elasticsearch /
3. Configure paths
In the file /opt/elasticsearch/config/elasticsearch.yml set the “path.data”
and “path.logs” properties to newly created directories
4. Configure network settings

In the file /opt/elasticsearch/config/elasticsearch.yml
set ’network.host’ to ’_site_’ and ’discovery.type’ to ’single-node’
5. Configure Memory allocated to elasticsearch process

In the file /opt/elasticsearch/config/jvm.options
set the “-Xms” and “-Xmx” JVM options to “2g”
6. Mount the SSD drive to the /var/elastic/data directory.

This is required to be able to use the instance SSD desk; otherwise, we will
be using the default EBS disk which is 8 GB only.
sudo mkfs . ext4 −E nodiscard / dev /nvme0n1

sudo mount −o discard / dev /nvme0n1 / var / elasticsearch / data
sudo chown −R ubuntu / var / elasticsearch
sudo chown −R ubuntu / opt / elasticsearch
7. Start Elasticsearch
ES_PATH_CONF=/opt/elasticsearch/config ./bin/elasticsearch
8. Create a pipeline to transform csv into json
curl −XPUT ’<public−ip >:9200/_ingest / pipeline /

csv−to−json−pipeline ?pretty ’
−H ’ Content−Type : application / json ’ −d ’
{
" description " : " transform csv to json " ,
" processors " : [
{"grok" : {" f i e l d " : "image" ,
" patterns " : [ "%{DATA:Age},%{DATA:EDV} ,
%{DATA:ExamId},%{DATA: ExamTypeId} ,
%{DATA: FindingLaterality},%{DATA: FindingSite},%{DATA: Gender} ,
%{DATA: ImageHeight},%{DATA: ImageWidth},%{DATA: Impression} ,
%{DATA:MRN},%{DATA:PLAQUE},%{DATA:PSV},%{DATA: SOPInstanceUID} ,
%{DATA:STENOSIS},%{DATA: State},%{DATA: StudyDescription } ,

%{DATA: StudyInstanceUID} ,
%{DATA: TopographicalModifier } ,
%{DATA: c_ef_teich_mm_r_calc},%{DATA: c_ef_teich_r_calc } , . . . } " ] ,
" ignore_missing " : true}},
{"remove" : {" f i e l d " : "image"}}
]
}’
9. Call the indexing python script

time python elastic-client.py
5.1.3 The Elasticsearch Python Client
There are Elasticsearh clients in many languages including Java, Ruby, and Python.
We used the python helper elasticsearch.py [120] to handle communication
with elasticsearch.
The module includes many helper functions. The parallel_bulk is a helper for
the elasticsearch bulk API. It runs in multiple threads at once. The main param-
eters to the parallel_bulk helper are: 1)an instance of Elasticsearch class that
provide the connection to the cluster nodes, and 2)an iterable or a generator that
provides the data to send to elasticsearch [119]. We used a generator rather
than an iterable, to avoid excessive memory consumption that would be needed
to load the data if we used an iterable.
A note worth mentioning is that the parallel_bulk helper is itself a generator,
which means it is lazy and won’t produce any records until we consume its output
[105].
The parallel_bulk helper has several other parameters [119] that could be
tuned according to the elasticsearch cluster capacity:
1. chunk_size: this is the number of documents to include in a single chunk

sent to elasticsearch. the default is 500.
2. max_chunk_size_bytes: the maximum size of a single request sent to elas-

ticsearch. the default is 100 MB.
3. thread_count: the number of parallel threads, the default is 4.
4. queue_size: size of the task queue between the main thread (producing
chunks to send) and the processing threads. default is 4.
For this experiment, we made some quick trials to index the same file using
different values for the parameters, and heuristically chose the values that re-
sulted in faster indexing. We used thread_count=2, chunk_size=1000, max_chunk_bytes=104857600,
and queue_size=4.
Listing 5.1 shows the script used to do bulk indexing in this experiment.
Listing 5.1: Experiment1 - python client for indexing

from elasticsearch import Elasticsearch
from elasticsearch . helpers import parallel_bulk
from collections import deque
import time
indexDefinition = " " "{

" settings " : {
" refresh_interval " : −1,
"number_of_shards" : 1 ,
" number_of_replicas " : 0
}}" " "
def prepareData ( d a t a _ f i l e ) :
for l i n e in open( d a t a _ f i l e ) :
imageData = l i n e . r s t r i p ( ’ \ n ’ ) . replace ( ’ \ \ ’ , ’ / ’ )
yield{
" _index " : " carotid " ,
" _type " : "image" ,
" pipeline " : " csv−to−json−pipeline " ,
" _source " : {
"image" : imageData . decode ( ’ utf −8’ )}
}
i f __name__ == ’ __main__ ’ :
es=Elasticsearch ( "34.226.202.104:9200" )
for d a t a _ f i l e in f i l e s :
start_time = time . time ( )
deque( parallel_bulk ( es ,
prepareData ( data_dir+
" / "+d a t a _ f i l e ) ,
thread_count=2,chunk_size=1000,
max_chunk_bytes=(100*1024*1024)),
maxlen=0)
elapsed_time = time . time ( ) − start_time
es . indices . refresh ( index=" carotid " )

Figure 5.1: Elasticsearch Exp1 - Index Size
5.1.4 Results
We considered the same metrics that we recorded for Solr: 1) the index size, 2)
the indexing throughput, and 3) the query throughput, as a function in the data
size.
For indexing throughput, we measure indexing time, and indexing speed (doc-
s/second). For query throughput, we measure the number of queries served per
second, and query time.
The indexing time was calculated using the python ‘time’ module [141]. The
query throughput was measured using the Apache benchmarking tool [4].
1. Index size
In Figure 5.1 we show the index size as a function of the indexed data size.
The index size is almost 135% the data size while indexing is in progress.
We noticed that for data sizes 9.2 and 15.5 the index size went to more than
150% of the data size. This is probably is caused by on-going segments
merging, where the disk space for the old segments will only be reclaimed
after merging completes. After the indexing and merging is done the index
size was almost the same data size.
2. Indexing throughput
In Figure 5.2a we show the indexing time for each file (3̃.1 GB). In Fig-
ure 5.2b we show indexing speed, that is number of document indexed per
second.
3. Query throughput
We set the Apache HTTP server benchmarking tool parameters as follows:

(a) Indexing Time (b) Indexing Speed
Figure 5.2: Elasticsearch Exp1 - Indexing Throughput
• -c concurrency
This is the number of multiple requests to perform at a time. We set it
to 100.
• -n requests
This is the number of requests to perform for the benchmarking ses-
sion. We set this to 1000.
• -k
We used this flag to enable the HTTP KeepAlive feature (perform mul-
tiple requests within one HTTP session).
We present the results for two cases: when the query is not cached (which
is the default for any query when executed for the first time), and when
elasticsearch cache the query (after running it multiple times).
We used 2 queries for this evaluation:
• Q1: count all. This query doesn’t retrieve any documents. only a count
for all document in the index
• Q2: get 100 documents where field ’ocrText’ contains the term ‘CAA‘.
Number of requests per second is shown in figure 5.3. The query time is
shown in figure 5.4.
5.1.5 Conclusion
To be added.
5.2. Experiment 2: The Impact of Mapping 81
(a) Q1 (b) Q2
Figure 5.3: Elasticsearch Exp1 - Requests Per Second
(a) Q1 (b) Q2
Figure 5.4: Elasticsearch Exp1 - Query Time (milliseconds) For

90% of requests
5.2 Experiment 2: The Impact of Mapping
In this experiment, we try to understand the effect of the ‘mapping’ used to build
the index on the: 1-index size, 2- indexing performance, and 3- query perfor-
mance.
We build two indices for the same data, one using the dynamic mapping [43]
generated by elasticsearch, and the other using a custom mapping that we define.
The experiment setup is illustrated in table 5.2. We used a single node cluster,
and disabled index sharding and replication (set index properties: replica=0 and
shards=1).
5.2.1 The Dynamic Mapping
By default Elasticsearch dynamic mapping uses ‘multi-fields’ feature [37]. It

maps each string field to two definitions, thus creating 2 copies of each string
field, one is defined as type ‘text’ [56] which is analyzed to be used for full-text
search, and other copy defined as type ‘keyword’ [47] which is not analyzed and
is indexed as is, to be used in exact-match searches and for sorts and aggrega-
tions.
For our data, the dynamic mapping by elasticsearch treated all the document
fields as strings (this is because the pipeline we use to convert CSV to JSON, set
all fields values in quotes), so it created two field definitions for each field, a ‘text’
field and a ‘keyword’ field.
5.2.2 The Custom Mapping
We list below the customization that we made in defining the custom mapping:
1. Map each document field to single definition

We did not use the default behavior of having multi-fields for string fields to
store them as text and a keyword. We used a single field definition of type
‘text’.
2. Ignore position metadata

The data type ‘text’ has a property ‘index_options’ [46], that controls if
elasticsearch should store positions, frequencies, and other meta-data that
could be stored for full-text search fields. We set that property to ‘freqs’, so
the index we build will not store positions meta-data.
3. Disable norms for all fields

Elasticsearch provides a mapping parameter ‘norms’ [48] that store various
normalization factors that are later used for scoring at query time. We set
that property to false for all fields.
4. Disable the ‘_source’ field

Elasticsearch provides mapping parameter ‘store’ [55] that controls if elas-
ticsearch should store the original value of the field (other than the terms
used in building the index itself). This parameter is set to false by default.
However, Elasticsearch will still use disk space to store the original values
for all fields. This is because by default Elasticsearch stores the whole doc-
ument JSON object as a string in the field ‘_source’ [66]. To eliminate this,
and store original values for only certain fields, we disabled the ‘_source’
field in index settings and then set the mapping parameter ‘store’ to true
for only the fields we want to store their original values. In this experiment,
we stored the original value of 42 fields out of 60 fields.
5.2.3 Steps to Run The Experiment
1. Install Elasticsearch (as in Experiment 1).
2. Create a pipeline to convert csv to JSON (as in Experiment 1).
3. Create the first index without defining a mapping, so elasticsearch creates

a dynamic mapping.
4. Create another index using the custom mapping.

5.2. Experiment 2: The Impact of Mapping 83
Elastic Server m5-xlarge

Hosting Machine 4 cores - 15 GB RAM
Memory Allocated to Elasticsearch 2G
Data Size 3.3 GB
Data Format CSV
Elastic configuration Disable default AutoRefresh
make manual refresh after indexing done
Table 5.2: Elasticsearch - Experiment 2 Setup
5. Use the python client to send data for indexing (using each index sepa-
rately) and log indexing time.
6. Run the Apache ab-test with the defined queries, and log the test output.
We set the below properties for both indices:
• number_of_shards = 1
• number_of_replicas = 0
• refresh_interva l= -1
5.2.4 Results
1. Index size
In Figure 5.5 we show the index size for both indices: the dynamic mapping
index and the custom mapping index. Using a custom mapping reduced the
index size by more than 60%.
The time to build each index for the same data (3.3 GB) is shown in figure
5.6. The time to build the index using custom mapping is 20% less than the
time to build the index using dynamic mapping.
3. Query Throughput
We used the same queries from experiment 1, to evaluate query throughput.

The Apache benchmarking tool parameters were set as follows:
Figure 5.5: Elasticsearch Experiment 2 - Index Size - Dynamic

mapping vs Custom mapping
Figure 5.6: Elasticsearch Experiment 2 - Indexing Time - Dynamic

mapping vs Custom mapping
• -c concurrency
to 100.
• -n requests
• -k
5.3. Experiment 3: Indexing 92 GB Using a Single Node 85
We used this flag to enable the HTTP KeepAlive feature (perform mul-
tiple requests within one HTTP session).
The requests per second is shown in figure 5.7. The query time is shown in
figure 5.8.
(a) Q1 (b) Q2
Figure 5.7: Elasticsearch Experiment 2 - Requests Per Second (no

Cache) - Dynamic mapping vs Custom mapping
(a) Q1 (b) Q2
Figure 5.8: Elasticsearch Experiment 2 - Requests Time - 90’th

percentile (no Cache) - Dynamic mapping vs Custom mapping
5.2.5 Conclusion
To be added.
5.3 Experiment 3: Indexing 92 GB Using a Single Node
In this experiment, we index 92 GB of data, which is 120 million documents. A

single node is used. We use the custom mapping from experiment 2.
5.3.1 Setup
The experiment setup is shown in Table 5.3.

Compute Resources EC2 i3-large

2 cores
15 GB RAM
0.5 TB SSD
Data Size 9.2 GB divided on 10 files (each file is 9.2 GB)
Custom Config AutoRefresh disabled, manual
refresh after indexing each file (9.2 GB)
Memory allocated for elasticsearch 7 GB
System Cache (free available RAM) 8 GB
Mapping custom mapping from exp 2, store
position meta-data for two fields only
Table 5.3: Elasticsearch - Experiment 3 Setup
5.3.2 Moving from Development Mode to Production Mode
By default, Elasticsearch binds to loopback addresses only: 127.0.0.1 and [::1].

This is sufficient to run a single development node on a server. In this experi-
ment, we used two EC2 instances, one to run the elasticsearch node, and used
the other instance as the client that will load data and send it to elasticsearch,
and will send search queries over HTTP. For the client to be able to reach the
elasticsearch node, we must change the default value for the ‘network.host’ con-
figuration from the default value _local_ to _site_ [97].
Changing the default value for ‘network.host’ will make elasticsearch as-
sumes that we are moving to production and will trigger a set of bootstrap checks
which will check essential system settings, to make sure they are configured cor-
rectly [115]. If any of those settings are not configured correctly, elasticsearch
will through an exception and a node fail to start. Those critical system settings
are:
• Disable swapping
• Increase file descriptors
• Ensure sufficient virtual memory
• Ensure sufficient threads
• JVM DNS cache settings
To enable elasticsearch to start correctly and pass the bootstrap checks, we

need to apply these configurations:
1. Enable bootstrap.memory_lock, To Prevent Swapping

Swapping is very bad for performance, for node stability, and should be
avoided at all costs [42].

Adding this line to the elasticsearch.yml configuration file:
bootstrap.memory_lock: true
will use the Linux ‘mlockall’ to lock the process address space into RAM,
preventing any Elasticsearch memory from being swapped out [41].
2. Increase File Descriptors

Elasticsearch uses a lot of file descriptors or file handles. Running out of
file descriptors can be disastrous and will most probably lead to data loss
[45]. For a Linux system (in our case Ubuntu 16.04), we can increase the
file handles, set ‘nofile’ to 65536 in file /etc/security/limits.conf.
3. Ensure sufficient virtual memory

Elasticsearch uses a mmapfs directory by default to store its indices. The
default operating system limits on mmap counts is likely to be too low, which
may result in out of memory exceptions [60]. To increase this for our Ubuntu
we set ‘vm.max\_map\_count’ to 262144 in file /etc/sysctl.conf.
After adding these settings, a reboot is required.
5.3.3 Steps to Run The Experiment
We tried to tune for indexing speed to be able to index 92 GB of data using a

single machine. The main steps to run this experiment are listed below:
1. Install Elasticsearch as in Experiment 1, but increasing the ‘-Xms’ and ‘-

Xmx’ to 7g.
2. Increase the OS file descriptors and virtual memory limits, and disable
swapping for elasticsearch, as described in section 5.3.2.
3. Increase elasticsearch indexing buffer

by setting the property ’indices.memory.index_buffer_size’ to 70% in elas-
ticsearch.yml file.
4. Create a pipeline to convert csv to JSON as in Experiment 1.
5. Create the index using a custom mapping

we used the custom mapping from Experiment 2 and added two edits:
- Set some of the fields to type ‘keyword’ rather than ’text’.
- Disable the ‘_field_names’ field.
6. Change the translog durability for the index, from default value request to
async:
curl −XPOST ’<ip >:9200/<index−name>/_close ?pretty ’
curl −XPUT ’ http :// < ip >:9200/<index−name>/_settings ’

{
" index . translog . durability " : "async" ,
" index . translog . sync_interval " : "5s "
}’
curl −XPOST ’<ip >:9200/<index−name>/_open?pretty ’
7. Use the python client to send data for indexing (using parallel_bulk) and
log indexing time.
8. Run the Apache ab-test with the defined queries, and log the test output.
Translog
Because Lucene commits are too expensive to perform on every individual change,
each shard copy also has a transaction log known as its translog associated with
it. All index and delete operations are written to the translog after being pro-
cessed by the internal Lucene index but before they are acknowledged. In the
event of a crash, recent transactions that have been acknowledged but not yet
included in the last Lucene commit can instead be recovered from the translog
when the shard recovers.
The data in the translog is only persisted to disk when the translog is fsynced
and committed. In the event of hardware failure, any data written since the pre-
vious translog commit will be lost.
By default, Elasticsearch fsyncs and commits the translog every 5 seconds if

index.translog.durability is set to async or if set to request (default) at the
end of every index, delete, update, or bulk request [58].
The ‘_field_names’ field
The _field_names field indexes the names of every field in a document that con-
tains any value other than null. This field is used by the exists query [44] to find
documents that either have or don’t have any non-null value for a particular field
[57].
_field_names introduce some index-time overhead. Since we do not need
‘exists’ queries, it is an excellent option to disable this field to help optimize for
indexing speed[57].
chunk size (docs) thread count indexing time

10,000 1 4 mints,20 seconds
20,000 1 4 minutes, 13 seconds
30,000 2 Elastic crashed, OOM error
20,000 2 elasticsearch.exceptions.ConnectionTimeout
5,000 2 3 minutes,34 seconds
10,000 3 elasticsearch.exceptions.ConnectionTimeout
Table 5.4: Elasticsearch - Experiment 3 - Tuning The parallel_bulk

helper
5.3.4 Tuning the parallel_bulk Helper
We did some trials to tune the chunk_size and thread_count parameters of the
python helper parallel_bulk. The target is to achieve the best possible indexing
throughput.
The trials are listed in Table 5.4. In each trial we set the max_chunk_bytes
to chunk_size * 3 * 1024, where a request for indexing a single document is
almost 3 KB.
Based on those trials, We used chunk_size=10000 and thread_count=2 for this
experiment.
5.3.5 The Test Queries
In this experiment, we defined a set of six different queries to evaluate elastic-

search. This is to gain a better understanding on what type of queries does elas-
ticsearch excels at, and what types of queries that could be a challenge. Also, we
want to find out how would the percentage of records matching a certain query
affect query throughput. The queries used in this experiment are listed in table
5.5.
The Search API
Elasticsearch search APIs, allows the search query in two modes [54]:
• URI Search
The query is a simple string sent as a request parameter [es-URI-search].
Not all search options are exposed when executing a search using URI
mode, but it can be handy for quick “curl tests".
• Request Body Search

The query is defined in the request body, using a special JSON-based syntax:
Query Id Description % of Matching Records

Q1 Count all 100%
Q2 Find 100 records where gender is Female 58.44 %
Q3 get 100 records where 28%
field ’ocrText’ contains the text ’CCA’
Q4 get 100 records where field ’gender’ is M 11.67%
and field ’ocrText’ contains the text ’CCA’
Q5 group by field ’FindingSite’ 100%
Q6 group records that are Females, 7.1%
and contains the term ’CCA’ in ocrText field, by Age
Table 5.5: Elasticsearch - Experiment 3 - The Test Queries
the Query DSL (Domain Specific Language). The ‘query’ element within the
search request body allows defining a query using the Query DSL.
In this experiment, we used request body mode. Listing 5.2 shows Q4 JSON.
The request body mode provides a set of parameters other than the ‘query’ pa-
rameter, which can be used to control different aspects of how the search is
performed and how results are returned. We used some of these parameters :
• size
The size parameter is used to configure the maximum amount of hits to be
returned. It can be used with the from parameter to implement pagination
[52].
• sort
The sort is defined on a per field level, with special field name for _score to
sort by score, and _doc to sort by index order. Sorting by _doc is the most
efficient sort order. This is especially important when dealing with large
results [53].
• stored_fields
Set which stored fields to for each document represented by a search hit.
By default, elasticsearch will not return any. ‘*’ can be used to load all
stored fields from the document. If the requested fields are not stored, they
will be ignored [51].
Listing 5.2: Query DSL Exmaple - Q4

{
"stored_fields":"*",
"size": 100,
"sort":"_doc",
"query":{
"bool": {
"must": { "match": { "ocrText": "CCA" }},
"filter": { "match": { "Gender" : "M"} }
}
}
}
The Query DSL
Elasticsearch provides a full Query DSL (Domain Specific Language) based on

JSON to define queries. A query is built using two types of clauses [50]:
• Leaf clauses
look for a particular value in a particular field, such as the match, term or
range queries.
• Compound clauses
wrap other leaf or compound queries and are used either to combine mul-
tiple queries in a logical fashion (such as the bool or dis_max query), or to
alter their behaviour (such as the constant_score query).
The behavior of a query clause depends on whether it is used in ‘query’ con-

text or ‘filter’ context [49]:
• Query context
When executing a query clause in the ‘query’ context, elasticsearch will cal-
culate a _score for each matching document, to provide the results sorted
by how well they meet the query clause.
This context is the default when sending a query clause using a ‘query’
parameter.
• Filter context
In this context, no scores are calculated. The document either match or not
match a query clause. Frequently used filters will be cached automatically
by Elasticsearch, to speed up performance. Filter context is used whenever
a query clause is sent using a ‘filter’ parameter.
5.3.6 Results
1. Index Size
Indexing the 92 GB of data, using a custom mapping, resulted in 70.2 GB
index. The index size is 76% of the data size.
We pushed the data to Elasticsearch from a separate client, an m5-large in-
stance. The parallel_bulk helper was used to send data in chunks to elastic-
search bulk API. The parameters of the parallel_bulk helper were selected
based on the trials described in section 5.3.4. The selected values are:
• thread_count=2
• chunk_size=10,000
• max_chunk_bytes=30720000 (29 MB)
The average time to index 9.2 GB of data (which is 12 million documents)

was 48.7 minutes, thus the average indexing speed was 4.1 K documen-
t/second. Indexing time, and indexing speed are illustrated in figures 5.9a
and 5.9b.
(a) Indexing Time (b) Indexing Speed
Figure 5.9: Elasticsearch Experiment 3 - Indexing 120 Million

Docs (92 GB)
3. Query Throughput
We used the Apache benchmarking tool to perform the test. The parameters
used are:
• -c concurrency
to 200.
• -n requests
• -k
This flag is used to enable the HTTP KeepAlive feature (perform mul-
tiple requests within one HTTP session). We did not use this flag, be-
cause some queries took a very long time to complete causing the test
to fail.
• -s timeout
This sets the maximum number of seconds to wait before the socket
times out. The default is 30 seconds. For this experiment the default
30 seconds were no good enough. The test failed with a timeout error.
To allow the test to complete we set this parameter to 120 seconds.
The Query time for each of the test queries (described in section 5.3.5) is
shown in figure 5.10. We show the time for 50 The number of requests per
second is shown in table5.6.
5.3.7 Conclusion
To be add.
5.4. Experiment 4: Indexing 92 GB using Five Nodes Cluster 93
(a) Q1 (b) Q2
count all get 100 records, gender=Female
(d) Q4
(c) Q3
get 100 records, gender=Male and ocrText contains
get 100 records, ocrText contains ’CCA
’CCA’
(f) Q6
(e) Q5
group By age where gender=Female and ocrText con-
group By FindingSite
tains ’CCA’
Figure 5.10: Elasticsearch Experiment 3 - Query Time - 120 Mil-

lion Docs (92 GB)
5.4 Experiment 4: Indexing 92 GB using Five Nodes

Cluster
In this experiment, we want to see how elasticsearch scales. We compare the

performance of single node cluster to five nodes cluster, for the same data vol-
ume, 92 GB.
In previous experiments, we used the elasticsearch python client to load data
files (that are previously generated and stored) and send them to elasticsearch
node for indexing. In this experiment we use a new approach: we generate the
data and immediately send it to elasticsearch for indexing. This way we avoid
the need first to generate the data and write it to files on some storage (we used
AWS-S3) and then load this files to send the data to elasticsearch. We used a five
Query Id Requests/Second(AVG)
Q1 2072
Q2 1
Q3 2
Q4 1
Q5 2230
Q6 1780
Table 5.6: Elasticsearch - Experiment 3 - Requests/Second (120

Million Docs - 92 GB)
Elasticsearch (5 nodes) EC2 i3-large

2 cores
15 GB RAM
0.5 TB SSD (Instance store)
Spark workers (5 nodes) EC2 r4-large
2 cores
15 GB RAM
8 GB (EBS)
Spark master (1 node) EC2 m5-large
2cores
8 GB RAM
8 GB (EBS)
Data Size 92 GB
Elasticsearch Config AutoRefresh Disabled, manual
refresh after indexing is done
increase Indexing buffer size to 70%
Elastic Process RAM 7 GB per node
System Cache (free RAM) 8.3 GB per node
Index Mapping Enhanced custom mapping
Index Shards 5
Index Replicas 0 (no replication)
nodes spark cluster to generate the data. Then we use the elasticsearch-hadoop
connector to send the data from spark cluster to elasticsearch cluster.
5.4.1 Setup
The experiment setup is illustrated in table 5.7.

Flintrock
Flintrock is a command-line tool for launching Apache Spark clusters [65]. We

used Flintrock to lunch the spark cluster.
Parallel ssh
parallel-ssh is a program for executing ssh in parallel on many hosts. It provides

features such as sending input to all of the processes, passing a password to ssh,
saving the output to files, and timing out [143]. We used parallel-ssh in this exper-
iment to do an elasticsearch cluster setup. For a small number of nodes parallel
ssh is good enough, however, for larger clusters, a configuration management
tool becomes essential.
Elasticsearch-Hadoop Connector
Elasticsearch for Apache Hadoop is an open-source, stand-alone, self-contained,

small library that allows Hadoop jobs (whether using Map/Reduce or libraries
built upon it such as Hive, Pig or Cascading or new upcoming libraries like
Apache Spark ) to interact with Elasticsearch [35].
Zen Discovery
The zen discovery is the built-in discovery module for Elasticsearch and the de-
fault. It provides unicast discovery but can be extended to support cloud envi-
ronments and other forms of discovery. The zen discovery is integrated with the
transport module [59], all communication between nodes is done using the trans-
port module [61].
Unicast discovery requires a list of hosts to use. These hosts can be specified as
hostnames or IP addresses; hosts specified as hostnames are resolved to IP ad-
dresses during each round of pinging. The unicast discovery uses the transport
module to perform the discovery. The discovery.zen.ping.unicast.hosts the
setting is used to provide the list of hosts.
EC2 Discovery Plugin
To be able to operate a cluster, we will change elasticsearch network configura-

tions. The main configuration parameters are ‘network host’ and ‘discovery’.
For network host, the default for Elasticsearch is to bind to loop back ad-
dresses only, e.g., 127.0.0.1 and [::1]. This is sufficient to run a single develop-
ment node on a server. To communicate and to form a cluster with nodes on
other servers, we must change the network.host. There are many particular val-
ues [97], to support different network setups.
For network discovery, Elasticsearch uses a custom discovery implementation

called "Zen Discovery" for node-to-node clustering and master election. The most
important configurations are discovery.zen.ping.unicast.hosts
and discovery.zen.minimum_master_nodes.
For Cloud hosted clusters, there are specialized plug-ins [29]. We run the ex-
periment on EC2 so that we will use the ec2-discovery plug-in [33].
The EC2 discovery plugin uses the AWS API for unicast discovery. The plugin
provides a hosts provider for zen discovery named ‘ec2’. This hosts provider finds
other Elasticsearch instances in EC2 through AWS metadata. Authentication is
done using IAM Role credentials by default [33]. The use of ec2 discovery plugin
in allows elasticsearch nodes to discover each other and form a cluster without
the need to statically write the IPs of all nodes in every node elasticsearch.yml
configuration file.
5.4.2 Steps to Run the Experiment
Elasticsearch Cluster
1. Install java.
2. Install Elasticsearch (as described in section 5.1.2).

Note: Keep the step of mounting the instance store to be the last step.
Because if we did a reboot(which is required after some of the following
configurations), we would have to re-mount the instance store.
3. Configure elasticsearch memory

Set the -Xms and -Xmx to 7g,
in file /opt/elasticsearch/config/jvm.option.
4. Increase Indexing buffer size to 70%

Set the indices.memory.index_buffer_size to 70\%,
in file /opt/elasticsearch/config/elasticsearch.yml.
5. Configure mlockall to prevent swapping elasticsearch process

Set bootstrap.mlockall to true,
in file /opt/elasticsearch/config/elasticsearch.yml.
6. Increase the OS Limit on file descriptors (handlers).

Set nofile to 65536, in file /etc/security/limits.conf
7. Increase the OS limit on mmap counts

- Set vm.max_map_count to 262144, in file /etc/sysctl.conf
- Set memlock to unlimited, in file /etc/security/limits.conf.
EC2 Discovery Plugin
1. Install EC2 discovery plugin, requires instance reboot.
sudo bin / elasticsearch−plugin i n s t a l l discovery−ec2
2. Configure elasticsearch to use EC2 discovery plugin

In file /opt/elasticsearch/config/elasticsearch.yml:
network.host: _ec2_
discovery.zen.hosts_provider: ec2
discovery.zen.minimum_master_nodes: 5
3. ADD AWS credentials used by EC2 discovery plugin, to elasticsearch key-

store. Notes:
- These are secured configuration, so should be added to elasticsearch key-
store. If we put the credentials directly in elasticsearch.yml, elasticsearch
will through an exception and fail to start.
- EC2 discovery plugin expects the keystore to be in the same directory with
the elasticsearch.yml file.
bin / elasticsearch−keystore create
sudo bin / elasticsearch−keystore add discovery . ec2 . access_key
sudo discovery . ec2 . access_key add discovery . ec2 . secret_key
sudo cp config / elasticsearch . keystore / opt / elaseticsearch / config /
4. Create IAM policy to allow ‘listing’ EC2 instances [34].
5. Open port 9300 for TCP communications between nodes

In the security group (that the elasticsearch nodes are in it), add a new
inbound rule for port 9300, and set the ‘Source’ is the security group id.
Spark Cluster
1. Create a cluster using Flintrock. Described in section 2.8.3.
2. Install the required packages to use numpy. Described in section 2.8.3.
3. Install required packages to use Elasticsearch python client

f l i n t r o c k run−command spark−numpy−cluster ’sudo pip i n s t a l l elasticsearch ’
−−ec2−identity−f i l e xxxxxx .pem −−ec2−user ec2−user
Add the elasticsearch-hadoop dependency to spark classpath
These steps are done on the client machine, that is the machine that we will use
to run the experiment Jupyter notebook.
1. Download the jar, and create a dir to put them in, for example, ’spark-extra-
jars’
wget http : / / central .maven. org /maven2/ org /
elasticsearch / elasticsearch−hadoop−mr/ 6 . 2 . 0 / elasticsearch−hadoop−mr− 6.2.0. j a r
2. Set the spark.driver.extraClassPath property

In file /opt/spark/config/spark-defaults.conf, to the jar path, for ex-
ample: /home/ubuntu/spark-extra-jars/*
3. Configure the Jupyter notebook, to add the --packages parameter when

launching the spark job.
Add this snippet, before connecting to spark cluster.
import os
os . environ [ "PYSPARK_SUBMIT_ARGS" ] = (
−−packages org . elasticsearch : elasticsearch−hadoop−mr: 6 . 2 . 0
pyspark−s h e l l "
Configure EC2 Security Groups
1. All nodes are in the same EC2 security group, but that alone won’t enable
communications between nodes. We must add a custom inbound rule to
open the needed ports for TCP communications between nodes (port 9300)
securityGroup » inBound-outBound> >edit » add new inbound rule for port
9300, set ‘Source’ to the security group id
2. Add new rule to elasticsearch security group to allow communication from

spark cluster
securityGroup » inBound-outBound> >edit » add new in-bound rule, set
‘port’ to ‘9200’, set source to ‘security-group-id’ of the spark cluster
5.4.3 The Index Custom Mapping, and Settings
In this experiment, we used the same custom mapping from experiment 2, de-
scribed in section 5.2.2. Additional modifications were done:
1. Set the number of shards to 5 as we have five nodes.
2. Set most fields to type ‘Keyword’ to enable efficient aggregations, to save

disk space and memory, and improve indexing speed as no analysis is re-
quired.
3. In index settings, disabled the _fields_names field.
4. In index settings disabled the _source field.
5. Set translog durability to async (after creating the index), close and reopen
the index is required
curl −XPUT ’ http :// < host−ip >:9200/<index−name>/_settings ’

{
" index . translog . durability " : "async" ,
" index . translog . sync \ _interval " : "5s "
}’
5.4.4 Results
1. Index Size
Indexing the 92 GB of data, using a custom mapping, resulted in 87.1 GB
index. The index size is 95% of the data size.
We used the elasticsearch-hadoop connector to send the generated data

from spark to elasticsearch. We did the indexing using two consecutive
jobs. Each job indexed 46 GB of data using five parallel tasks, that is the
number of spark nodes.
Elasticsearch-hadoop behavior can be customized through a set of configu-

rations, typically by setting them on the target job’s Hadoop Configuration.
In this experiment, we kept the defaults for most of those configurations,
which is probably good enough [36].
The time to index 46 GB of data (which is 60 million documents) was 72

minutes, thus for the five nodes cluster, the indexing speed was 13.8 K
document/second.
3. Query Throughput
used are:
• -c concurrency
to 200.
• -n requests
• -k
to fail.
• -s timeout
times out. The default is 30 seconds. For this experiment, we kept the
default (30 seconds).
We used the same test queries described in section 5.3.5. The Query time
for each of the test queries is shown in figure 5.11. We show the time for 50
Q1 2071
Q2 5
Q3 10
Q4 10
Q5 1731
Q6 1654

Million Docs - 92 GB)
(a) Q1 (b) Q2
(d) Q4
(c) Q3
’CCA’
(f) Q6
(e) Q5
tains ’CCA’
Figure 5.11: Elasticsearch Experiment 4 - Query Time - 120 Mil-

lion Docs (92 GB)
5.5 Experiment 5: Indexing 1 TB using Ten Nodes Clus-

ter
In this experiment, we use a data size that is closer to the target data size for this
use case. As described in section 1.3.8 the target data size is 5.65 Terabyte. We
index 1 Terabyte of data, which worth 120 billion records (image).
5.5.1 Setup
The experiment setup is illustrated in table 5.9.

As the data size grows, a precise index mapping becomes more vital to get
good performance for your search engine. We used the same custom mapping
5.5. Experiment 5: Indexing 1 TB using Ten Nodes Cluster 101
Elasticsearch (10 nodes) EC2 i3-large

2 cores
15 GB RAM
0.5 TB SSD (Instance store)
Spark workers (10 nodes) EC2 r4-large
2 cores
15 GB RAM
8 GB (EBS)
Spark master (1 node) EC2 m5-large
2cores
8 GB RAM
8 GB (EBS)
Data Size 1 TB
Elasticsearch Config AutoRefresh Disabled, manual
refresh after indexing is done
increase Indexing buffer size to 70%
Elastic Process RAM 7 GB per node
System Cache (free RAM) 8.3 GB per node
Index Mapping Enhanced custom mapping
Index Shards 10
Index Replicas 0 (no replication)
from experiment 4 (described in section 5.4.3). Two further customizations were

made:
• Set the number of shards to 10.
• Tune Fields Types

We tried to get as minimum index size as possible, we set all fields types to
keyword except for the six fields that are intended for full-text search, the
type was kept text. For the six fields defined as text we set the norms to
’false’. Also for those text fields we set the index_options to ‘freqs’ to 3
of them and other 3 we set the index_options to ‘positions’.
Listing 5.3 shows the json used to create the index, it contains a ‘settings’
object and a ‘mappings’ object. For the mappings object we show only some of
the fields, the rest of the fields follow the same structure.
Listing 5.3: Experiment 5 - Index Mapping
{
"settings" : {
"refresh_interval" : -1,
"number_of_shards" : 10,
"number_of_replicas" : 0
},
"mappings" : {
"image" : {
"_source": {
"enabled": false
},
"_field_names": {
"enabled": false
},
"properties" : {
"Age" : {
"type" : "keyword",
"store":true
},
"EDV" : {
"type" : "keyword",
"store":true
},
"ocrText" : {
"type" : "text",
"index_options": "positions",
"norms": false,
"store":true
},
"ocr_sourceDicomPath" : {
"type" : "text",
"index_options": "freqs",
"norms": false,
"store":true
},
...............
}}}}
5.5.2 Steps To Run the Experiment
We followed the same steps described in section 5.4.2. The only difference is
the number of nodes. We increased the number of nodes to 10 for both spark
cluster and elasticsearch cluster. In figure ?? we show an overall abstraction of
the experiment setup.
5.5.3 Results
1. Index Size
The index was divided on ten shards, each shard on a separate node. The
one Terabyte data produced a total index size of 654.1 GB. The index size is
63% of the data size.
5.5. Experiment 5: Indexing 1 TB using Ten Nodes Cluster 103
We used the elasticsearch-hadoop connector to send the generated data
from the spark cluster to elasticsearch cluster. We used 11 spark jobs, ex-
ecuted in sequence. Each job used ten parallel tasks to send 92 GB of data
to elasticsearch. Using this experiment setup, and index settings and map-
ping, the average time to index 92 GB of data is 88 minutes. The average
indexing speed is 23.2 K documents per second.
In this experiment we did set a few configurations for elasticsearch-hadoop,

described below:
(a) es.batch.size.bytes
Size (in bytes) for batch writes using Elasticsearch bulk API. This size
is per task instance. The total bulk size at runtime hitting Elasticsearch
will be this value multiplied by the number of parallel tasks. The de-
fault is 1 MB. We set this property to 1572864 (1.5 MB).
(b) es.batch.size.entries
Size (in entries) for batch writes using Elasticsearch bulk API - (0 dis-
ables it). Companion to es.batch.size. Bytes, once one matches, the
batch update is executed. Similar to the size, this setting is per task in-
stance; it gets multiplied at runtime by the total number of Hadoop
tasks running. The default value is 1000. We set this property to
1900000. This is because, for our data, the document size is relatively
small, it is only 0.8 KB on average.
(c) es.batch.write.refresh
Whether to invoke an index refresh or not after a bulk update has been
completed. This is called only after the entire write (meaning multiple
bulk updates) have been executed. The default value is true. We set
this property to false. This is because for our case study, indexing will
be done offline, in batch mode, So no need to refresh while indexing in
progress.
In Listing 5.4 we show the script used to configure elasticsearch-hadoop

connector and to send data to elasticsearch.
Listing 5.4: Experiment 5 - elasticsearch-hadoop connector
configuration
es_conf =
{" es . nodes" : "<node1−ip >:9200,<node2−ip >:9200 ,... " \
" es . resource " : "<index−name>/<type>" , \
" es . index . auto . create " : " f a l s e " , \
" es . batch . write . refresh " : " f a l s e " , \
" es . batch . size . bytes " : "1572864" , \
" es . batch . size . entries " : "1900000" , \
" es . ingest . pipeline " : " csv−to−json−pipeline "}
generated_data_rdd . saveAsNewAPIHadoopFile(
path=’− ’ ,
outputFormatClass="org . elasticsearch . hadoop .mr. EsOutputFormat" ,
keyClass="org . apache . hadoop . io . NullWritable " ,

valueClass="org . elasticsearch . hadoop .mr. LinkedMapWritable" ,
conf= es_conf )
3. Query Throughput
used are:
• -c concurrency
to 100.
• -n requests
• -k
to fail.
• -s timeout
times out. The default is 30 seconds. For this experiment the default
30 seconds were no good enough. The test failed with a timeout error.
To allow the test to complete we set this parameter to 120 seconds.
We used the test queries described in section 5.3.5. The Query time for
each of the test queries is shown in figure 5.12. We show the time for 50
Q1 959
Q2 1
Q3 2
Q4 2
Q5 1060
Q6 894

Billion Docs - 1 TB)
5.6 Overall Comparison
5.7 Conclusion
5.7. Conclusion 105
(a) Q1 (b) Q2
(d) Q4
(c) Q3
’CCA’
(f) Q6
(e) Q5
tains ’CCA’
Figure 5.12: Elasticsearch Experiment 5 - Query Time - 120 Bil-

lion Docs (1 TB)
107
Chapter 6
Conclusion
Medical images are typically searched using the DICOM Query/Retrieve service.
The query includes search parameters based on a DICOM tag. For instance, a
workstation may query a PACS server for all studies acquired today (in this case,
the Study Date is the search parameter). The DICOM Query/Retrieve service
does not provide free-text search, neither it supports complex logical queries or
statistical queries. Besides, it does not scale easily to adapt to the increasing
volume of data.
The focus of this study was to find the best technology to implement a reliable,
scalable, real-time searching functionality, that would support different kinds of
search queries including free-text and aggregations.
We used the meta-data of the medical images as the search corpus. The meta-
data is text-based data. The primary challenge was the data volume and the full
range of query types that should be supported.
A secondary problem was addressed as a step to address the main problem:

finding an efficient way to generate a big enough volume of data using a small
sample of real data.
AWS-EC2 instances were used to run the experiments. AWS-S3 and AWS-EBS
services were used for data storage.
Different approaches to performing the data generation were compared. The

main comparisons include 1- vectorized vs. non-vectorized, 2- parallelization us-
ing Multiprocessing vs. Spark local node, 3- distribution using Dask cluster vs.
Spark cluster and 4- in-memory vs. writing to disk.
108 Chapter 6. Conclusion
Figure 6.1: Data Generation Throughput (no write to disk)
We chose Spark parallelization approach to generate the data we need to run

the search experiments.
Before we look at the popular big-data search engines, we first made sure that
the traditional simple solution of using RDBMS will not meet the requirements of
our use-case. We explored PostgreSQL free-text-search features. We started with
a small data volume of 18.6 GB. Queries were taking minutes, which indicates
that larger data volume will not be real-time searchable.
We then explored Solr. We run three experiments. In the first experiment,
We depended on Solr defaults, for both configurations and the index schema (we
used schemaless mode). For 18.6 GB of data, the index size was 75% of the data
size. The indexing throughput was 7199 Doc per second. The query performance
was 1784 request per second for count-all query and 61 requests per second for
text-search query (retrieve 100 docs). In the second experiment, we studied the
effect of the index schema on the index size which directly affects the search per-
formance. We compared the search performance on 1- an index that is built using
the auto-generated schema by Solr and 2- an index that is built using a custom
schema that we created. The results showed that the custom schema produced
the index size by 40%. The search performance was significantly improved. The
custom schema index has 8X more requests per second for the count-all query,
and 8X less request time. The third experiment we used 120 million documents
(92 GB) to check whether Solr search performance for text-search query will still
be useful. The results showed that using an i3-large instance, Solr could serve
up to 254 requests per second.
To decide whether to run the experiments using the larger volume of data
with Solr or not, We decided to perform the same three experiment using Elas-
ticsearch then chose the tool with the better performance to further evaluation
of its scalability.
In the first Elasticsearch experiment, we also used defaults for both the con-
figurations and the index mapping (we used Elasticsearch dynamic-mapping).
Results showed less query time than Solr by 63%. In the Second Experiment,
Chapter 6. Conclusion 109
Tool Data Size (GB) Query Time (seconds)

PostgreSQL-Like 15 3.01 Minutes
PostgreSQL-@@ 15 6.65 Minutes
PostgreSQL-trgrm 20 16 Minutes
Solr-Single Node 18.6 3 Seconds
Elasticsearch-Single Node 18.6 1.1 Seconds
Elasticsearch-Single Node 92 45 Seconds
Elasticsearch-5 Nodes 92 10 Seconds
Elasticsearch-10 Nodes 1012 10.3 Seconds
Table 6.1: Search Performance - free-text query (retrieve 100 docs

where ocrText field contains ’CCA’)
we studied the mapping impact on the index size and hence on the search per-
formance. Using a custom mapping instead of the auto-generated dynamic map-
ping, the index size was reduced by 57%. The query time was reduced by 71% fro
free-text query and by 36% for the count-all query. The number of requests per
second increased by a factor of 1.8 for free-text query and by a factor of 3.6 for
the count-all query. In the Third experiment, we indexed 120 million records (92
GB). Using a single node hosted on an i3-large instance, the indexing through-
put was 4100 document per second. The query performance ranged from 52
milliseconds for the count-all query to 96 seconds for q: get 100 docs where
gender=Female. In the Fourth experiment we used the same data volume, but
using five nodes cluster (i3-large instances). The indexing throughput increased
by a factor of 3.36, increasing from 4100 docs per second to 13800 docs/second.
Search performance also increased significantly for all query types. Maximum
query time was 10 seconds.
In the last Elasticsearch experiment, we indexed 120 billion documents (1
Terabyte), using ten nodes cluster (i3-large instances). In this experiments we
indexed the data as soon as we generate it, files and disk I/O overhead was elim-
inated. We used spark cluster of 11 nodes (r4-large instances) one master and
ten workers, to generate the data, and then immediately send it to elasticsearch
cluster for indexing using hadoop-elasticsearch-connector. Elasticsearch index-
ing throughput was 23200 docs per second. The query time for all query types
we considered ranged from milliseconds to 1.5 minutes for the most extended
query. It was expected, as that query matches 60% of the searched data.
We can conclude that Elasticsearch provided an excellent performance that

put is on the top of List of tools that can address the use-case at hand.
111
Bibliography
[1] URL: https://lucene.apache.org/solr/guide/7%5C _ 2/solrcloud.

html. accessed: 20.07.2018.
[2] URL: https://wiki.apache.org/solr/SolrPerformanceProblems%5C#
OS%5C_Disk%5C_Cache.
[3] URL: https : / / docs . aws . amazon . com / AWSEC2 / latest / UserGuide /
add - instance - store - volumes . html % 5C # making - instance - stores -
available-on-your-instances.
[4] URL: https://httpd.apache.org/docs/2.4/programs/ab.html.
[5] A Quick Overview. URL: https://lucene.apache.org/solr/guide/7%
5C_2/a-quick-overview.html%5C#a-quick-overview.
[6] About PostgreSQL. URL: https : / / www . postgresql . org / about/. ac-
cessed: 02.05.2018.
[7] Ceyhun Burak Akgül et al. “Content-Based Image Retrieval in Radiol-
ogy: Current Status and Future Directions”. In: Journal of Digital Imaging
24.2 (Apr. 2011), pp. 208–222. ISSN: 0897-1889. DOI: 10.1007/s10278-
010 - 9290 - 9. URL: http : / / www . ncbi . nlm . nih . gov / pmc / articles /
PMC3056970/.
[8] Amazon EC2 M5 Instances. URL: https : / / aws . amazon . com / ec2 /
instance-types/m5/. accessed: 07.08.2018.
[9] Amazon Web Services, Inc. What Is Amazon EC2? URL: https://docs.
aws.amazon.com/AWSEC2/latest/UserGuide/concepts.html. accessed:
03.05.2018.
[10] Amazon Web Services, Inc. What is Cloud Computing? URL: https : / /
aws.amazon.com/what-is-cloud-computing/. accessed: 03.05.2018.
[11] Apache Hadoop 2.7.1 - MapReduce Tutorial. URL: https : / / hadoop .
apache.org/docs/r2.7.1/hadoop-mapreduce-client/hadoop-mapreduce-
client-core/MapReduceTutorial.html. accessed: 13.05.2018.
[12] AWS Documentaion - Compute Optimized Instances. URL: https://docs.
aws . amazon . com / AWSEC2 / latest / UserGuide / compute - optimized -
instances.html. accessed: 13.05.2018.
[13] AWS Documentaion - Memory Optimized Instances. URL: https://docs.
aws . amazon . com / AWSEC2 / latest / UserGuide / memory - optimized -
instances.html. accessed: 15.05.2018.
[14] J. Bai. “Feasibility analysis of big log data real time search based on Hbase
and ElasticSearch”. In: 2013 Ninth International Conference on Natural
Computation (ICNC). July 2013, pp. 1166–1170. DOI: 10 . 1109 / ICNC .
2013.6818154.
112 BIBLIOGRAPHY
[15] M. Bajer. “Building an IoT Data Hub with Elasticsearch, Logstash and
Kibana”. In: 2017 5th International Conference on Future Internet of
Things and Cloud Workshops (FiCloudW). Aug. 2017, pp. 63–68. DOI:
10.1109/FiCloudW.2017.101.
[16] Rachid Belaid. Postgres full-text search is Good Enough! July 2015. URL:
http : / / rachbelaid . com / postgres - full - text - search - is - good -
enough/. accessed: 25.05.2018.
[17] Cristen Bolan. “Cloud PACS and mobile apps reinvent radiology work-
flow”. In: (May 2013). URL: http://appliedradiology.com/articles/
cloud-pacs-and-mobile-apps-reinvent-radiology-workflow.
[18] Don Bradley and Kendall E Bradley. “The Value of Diagnostic Medical
Imaging”. In: North Carolina Medical Journal 75.2 (Mar. 2014), pp. 121–
125. DOI: 10.18043/ncm.75.2.121. URL: http://www.ncmedicaljournal.
com/content/75/2/121.abstract.
[19] D. Chen et al. “Real-Time or Near Real-Time Persisting Daily Healthcare
Data Into HDFS and ElasticSearch Index Inside a Big Data Platform”. In:
IEEE Transactions on Industrial Informatics 13.2 (Apr. 2017), pp. 595–
606. ISSN: 1551-3203. DOI: 10.1109/TII.2016.2645606.
[20] James Chen, John Bradshaw, and Paul Nagy. “Has the Picture Archiving
and Communication System (PACS) Become a Commodity?” In: Journal of
Digital Imaging 24.1 (2011), pp. 6–10. ISSN: 1618-727X. DOI: 10.1007/
s10278- 010- 9299- 0. URL: https://doi.org/10.1007/s10278- 010-
9299-0.
[21] Class ConcurrentMergeScheduler. URL: https://lucene.apache.org/
core/ConcurrentMergeScheduler.html%22.
[22] Class TieredMergePolicy. URL: https : / / lucene . apache . org / core /
TieredMergePolicy.html.
[23] Commits.
[24] COPY — copy data between a file and a table. URL: https : / / www .
postgresql.org/docs/10/static/sql-copy.html. accessed: 20.05.2018.
[25] Curl Tool. URL: https://curl.haxx.se/.
[26] Jeffrey Dean and Sanjay Ghemawat. “MapReduce: Simplified Data Pro-
cessing on Large Clusters”. In: Commun. ACM 51.1 (Jan. 2008), pp. 107–
113. ISSN: 0001-0782. DOI: 10 . 1145 / 1327452 . 1327492. URL: http :
//doi.acm.org/10.1145/1327452.1327492.
[27] DICOM. URL: https://www.dicomstandard.org/. accessed: 02.07.2018.
[28] Francis X. Diebold. “The Origin(s) and Development of “Big Data”: The
Phenomenon, the Term, and the Discipline”. In: (2012, revised 2018).
[29] Discovery Plugins. Apr. 2018. URL: https://www.elastic.co/guide/
en/elasticsearch/plugins/6.2/discovery.html%5C#discovery.
[30] Ramesh Dontha. The Origins of Big Data. 2017. URL: https : / / www .
kdnuggets.com/2017/02/origins-big-data.html.
BIBLIOGRAPHY 113
[31] Samuel J Dwyer. “A personalized view of the history of PACS in the USA”.
In: vol. 3980. 2000, pp. 3980–3988. URL: https://doi.org/10.1117/
12.386388.
[32] Dynamic Mapping. Apr. 2018. URL: https://www.elastic.co/guide/
en / elasticsearch / reference / current / dynamic - mapping . html % 5C #
dynamic-mapping.
[33] EC2 Discovery Plugin. Apr. 2018. URL: https://www.elastic.co/guide/
en/elasticsearch/plugins/6.2/discovery-ec2.html%5C#discovery-
ec2.
[34] EC2 Discovery Plugin. Apr. 2018. URL: https://www.elastic.co/guide/
en/elasticsearch/plugins/6.2/%5C _ settings.html%5C#discovery-
ec2-permissions.
[35] Elasticsearch for Apache Hadoop. Apr. 2018. URL: https://www.elastic.
co/guide/en/elasticsearch/hadoop/current/reference.html%5C#
reference.
[36] Elasticsearch for Apache Hadoop [6.2] - Configuration. URL: https : / /
www.elastic.co/guide/en/elasticsearch/hadoop/current/configuration.
[37] Elasticsearch Reference. Apr. 2018. URL: https : / / www . elastic . co /
guide/en/elasticsearch/reference/current/multi-fields.html.
[38] Elasticsearch Reference. Bulk API. Apr. 2018. URL: https://www.elastic.
co/guide/en/elasticsearch/reference/current/docs-bulk.htm.
[39] Elasticsearch Reference. Getting started. Apr. 2018. URL: https://www.
elastic . co / guide / en / elasticsearch / reference / 6 . 2 / getting -
started.html%5C#getting-started.
[40] Elasticsearch Reference. Ingest Node. Apr. 2018. URL: https : / / www .
elastic . co / guide / en / elasticsearch / reference / master / ingest .
html.
[41] Elasticsearch Reference [6.2] - bootstrap.memory_lock. URL: https : / /
www . elastic . co / guide / en / elasticsearch / reference / current /
setup - configuration - memory . html # bootstrap - memory _ lock. ac-
cessed: 4.06.2018.
[42] Elasticsearch Reference [6.2] - Disable swapping. URL: https : / / www .
elastic . co / guide / en / elasticsearch / reference / current / setup -
configuration-memory.html#setup-configuration-memory. accessed:
4.06.2018.
[43] Elasticsearch Reference [6.2] - Dynamic Mapping. URL: https://www.
elastic.co/guide/en/elasticsearch/reference/current/dynamic-
mapping.html. accessed: 2.06.2018.
[44] Elasticsearch Reference [6.2] - Exists Query. URL: https://www.elastic.
co/guide/en/elasticsearch/reference/current/query-dsl-exists-
query.html. accessed: 4.06.2018.
[45] Elasticsearch Reference [6.2] - File Descriptors. URL: https : / / www .
elastic . co / guide / en / elasticsearch / reference / current / file -
descriptors.html. accessed: 4.06.2018.
114 BIBLIOGRAPHY
[46] Elasticsearch Reference [6.2] - index_options. URL: https://www.elastic.

co / guide / en / elasticsearch / reference / current / index - options .
[47] Elasticsearch Reference [6.2] - Keyword datatype. URL: https : / / www .
elastic.co/guide/en/elasticsearch/reference/current/keyword.
[48] Elasticsearch Reference [6.2] - norms. URL: https://www.elastic.co/
guide/en/elasticsearch/reference/current/norms.html. accessed:
2.06.2018.
[49] Elasticsearch Reference [6.2] - Query and filter context. URL: https :
/ / www . elastic . co / guide / en / elasticsearch / reference / current /
query-filter-context.html. accessed: 06.06.2018.
[50] Elasticsearch Reference [6.2] - Query DSL. URL: https://www.elastic.
co / guide / en / elasticsearch / reference / current / query - dsl . html.
accessed: 4.06.2018.
[51] Elasticsearch Reference [6.2] - Request Body Search - Fields. URL: https:
search-request-stored-fields.html. accessed: 06.06.2018.
[52] Elasticsearch Reference [6.2]- Request Body Search - From / Size. URL:
https : / / www . elastic . co / guide / en / elasticsearch / reference /
current/search-request-from-size.html. accessed: 06.06.2018.
[53] Elasticsearch Reference [6.2] - Request Body Search - Sort. URL: https:
search-request-sort.html. accessed: 06.06.2018.
[54] Elasticsearch Reference [6.2] - Search. URL: https://www.elastic.co/
guide/en/elasticsearch/reference/current/search- search.html#
search-search. accessed: 06.06.2018.
[55] Elasticsearch Reference [6.2] - store. URL: https://www.elastic.co/
guide/en/elasticsearch/reference/current/mapping- store.html.
accessed: 2.06.2018.
[56] Elasticsearch Reference [6.2] - Text datatype. URL: https://www.elastic.
co / guide / en / elasticsearch / reference / current / text . html. ac-
cessed: 2.06.2018.
[57] Elasticsearch Reference [6.2] - The _field_names. URL: https : / / www .
elastic.co/guide/en/elasticsearch/reference/current/mapping-
field - names - field . html # mapping - field - names - field. accessed:
4.06.2018.
[58] Elasticsearch Reference [6.2] - Translog. URL: https://www.elastic.
co / guide / en / elasticsearch / reference / current / index - modules -
translog.html. accessed: 4.06.2018.
[59] Elasticsearch Reference [6.2] - Transport. URL: https://www.elastic.
co/guide/en/elasticsearch/reference/current/modules-transport.
BIBLIOGRAPHY 115
[60] Elasticsearch Reference [6.2] - Virtual Memory. URL: https : / / www .

elastic.co/guide/en/elasticsearch/reference/current/vm- max-
map-count.html. accessed: 4.06.2018.
[61] Elasticsearch Reference [6.2] - Zen Discovery. URL: https://www.elastic.
co/guide/en/elasticsearch/reference/current/modules-discovery-
zen.html. accessed: 06.06.2018.
[62] Elasticsearch Reference - Grok Processor. URL: https://www.elastic.
co / guide / en / elasticsearch / reference / current / grok - processor .
[63] Elasticsearch Reference - Nodes. URL: https://www.elastic.co/guide/
en / elasticsearch / reference / 6 . 2 / modules - node . html. accessed:
30.05.2018.
[64] Elasticsearch: The Definitive Guide [2.x]. Setting Replica to 0. Apr. 2018.
URL: https : / / www . elastic . co / guide / en / elasticsearch / guide /
current/indexing-performance.html%5C#%5C_other.
[65] Flintrock. URL: https://github.com/nchammas/flintrock. accessed:
15.05.2018.
[66] Gabriel Moskovicz. Elasticsearch _Source Field. Apr. 2018. URL: https:
mapping-source-field.html.
[67] Gabriel Moskovicz. Indexing your CSV files with Elasticsearch Ingest Node.
Apr. 2018. URL: https : / / www . elastic . co / blog / indexing - csv -
elasticsearch-ingest-node.
[68] Gordon Gamsu and Enrico Perez. “Picture Archiving and Communica-
tion Systems (PACS)”. In: Journal of Thoracic Imaging 18.3 (2003). ISSN:
0883-5993. URL: https://journals.lww.com/thoracicimaging/Fulltext/
2003 / 07000 / Picture % 7B % 5C _ %7DArchiving % 7B % 5C _ %7Dand % 7B % 5C _
%7DCommunication%7B%5C_%7DSystems%7B%5C_%7D%7B%5C_%7DPACS%7B%
5C_%7D.5.aspx.
[69] Amir Gandomi and Murtaza Haider. “Beyond the hype: Big data con-
cepts, methods, and analytics”. In: International Journal of Information
Management 35.2 (2015), pp. 137–144. ISSN: 0268-4012. DOI: https :
//doi.org/10.1016/j.ijinfomgt.2014.10.007. URL: http://www.
sciencedirect.com/science/article/pii/S0268401214001066.
[70] Robert J Gillies, Paul E Kinahan, and Hedvig Hricak. “Radiomics: Images
Are More than Pictures, They Are Data”. In: Radiology 278.2 (Feb. 2016),
pp. 563–577. ISSN: 0033-8419. DOI: 10.1148/radiol.2015151169. URL:
http://www.ncbi.nlm.nih.gov/pmc/articles/PMC4734157/.
[71] Wendy Arianne Günther et al. “Debating big data: A literature review on
realizing value from big data”. In: The Journal of Strategic Information
Systems 26.3 (2017), pp. 191–209. ISSN: 0963-8687. DOI: https://doi.
org/10.1016/j.jsis.2017.07.003. URL: http://www.sciencedirect.
com/science/article/pii/S0963868717302615.
116 BIBLIOGRAPHY
[72] S. Gupta and R. Rani. “A comparative study of elasticsearch and CouchDB

document oriented databases”. In: 2016 International Conference on In-
ventive Computation Technologies (ICICT). Vol. 1. Aug. 2016, pp. 1–4.
DOI: 10.1109/INVENTIVE.2016.7823252.
[73] Wickham Hadley. “The Split-Apply-Combine Strategy for Data Analysis”.
In: Journal of statistical software 40.1 (Apr. 2011). DOI: 10.18637/jss.
v040.i01. URL: http://doi.acm.org/10.1145/1327452.1327492.
[74] Kyung Hoon Hwang, Haejun Lee, and Duckjoo Choi. “Medical Image Re-
trieval: Past and Present”. In: Healthc Inform Res 18.1 (Mar. 2012), pp. 3–
9. ISSN: 2093-3681. URL: http://synapse.koreamed.org/DOIx.php?
id=10.4258%7B%5C%%7D2Fhir.2012.18.1.3.
[75] IndexConfig in SolrConfig. URL: https : / / lucene . apache . org / solr /
guide/indexconfig-in-solrconfig.html.
[76] Installing numpy on Amazon EC2. URL: https://stackoverflow.com/
questions / 18732250 / installing - numpy - on - amazon - ec2. accessed:
15.05.2018.
[77] IPython — Built-in magic commands. URL: https://ipython.readthedocs.
io/en/stable/interactive/magics.html. accessed: 29.05.2018.
[78] Keith D. Foote. A Brief History of Data Science. Dec. 2016. URL: http:
/ / www . dataversity . net / brief - history - data - science/. accessed:
29.04.2018.
[79] Kff.org. US Population Distribution by Age. URL: https://www.kff.org/
state-category/demographics-and-the-economy/population/.
[80] Kuhlman, Dave. “A Python Book: Beginning Python, Advanced Python,
and Python Exercises”. In: (). Archived from the original on 23 June 2012.
[81] Ralf Lämmel. “Google’s MapReduce programming model — Revisited”.
In: Science of Computer Programming 70.1 (2008), pp. 1–30. ISSN: 0167-
6423. DOI: https://doi.org/10.1016/j.scico.2007.07.001. URL:
http://www.sciencedirect.com/science/article/pii/S0167642307001281.
[82] Zhongyu Li et al. “Large-scale retrieval for medical image analytics: A
comprehensive review”. In: Medical Image Analysis 43 (2018), pp. 66–84.
ISSN: 1361-8415. DOI: https://doi.org/10.1016/j.media.2017.09.
007. URL: http : / / www . sciencedirect . com / science / article / pii /
S136184151730138X.
[83] Linux time Command. URL: https://en.wikipedia.org/wiki/Time%5C_
(Unix).
[84] Lucene. URL: https://lucene.apache.org/core/7%5C_2%5C_0/index.
html.
[85] Lucene Index Structure. URL: https : / / lucene . apache .org / core / 7 %
5C _ 2 % 5C _ 0 / core / org / apache / lucene / codecs / lucene70 / package -
summary.html%5C#Overview.
[86] Lucene Packages Summary. URL: https://lucene.apache.org/core/7%
5C_2%5C_1/core/overview-summary.html%5C#overview.description.
BIBLIOGRAPHY 117
[87] A D A Maidment. “The future of medical imaging”. In: Radiation Protec-

tion Dosimetry 139.1-3 (2010), pp. 3–7. DOI: 10.1093/rpd/ncq090. URL:
http://dx.doi.org/10.1093/rpd/ncq090.
[88] Matthias Gelbmann. Elasticsearch moved into the top 10 most popular
database management systems. July 2017. URL: https://db- engines.
com/en/blog_post/70. accessed: 03.05.2018.
[89] Wes McKinney. “Python for Data Analysis”. In: O’Reilly Media, Inc., Oct.
2012. Chap. 4,12.
[90] MergePolicy Doc. URL: https://lucene.apache.org/core/7%5C_2%5C_
0//core/org/apache/lucene/index/MergePolicy.html.
[91] Merging Index Segments. URL: https : / / lucene . apache . org / solr /
guide/7%5C_2/indexconfig-in-solrconfig.html%5C#merging-index-
segments.
[92] Henning Müller et al. “A review of content-based image retrieval systems
in medical applications—clinical benefits and future directions”. In: In-
ternational Journal of Medical Informatics 73.1 (2004), pp. 1–23. ISSN:
1386-5056. DOI: https : / / doi . org / 10 . 1016 / j . ijmedinf . 2003 . 11 .
024. URL: http : / / www . sciencedirect . com / science / article / pii /
S1386505603002119.
[93] H Muller et al. “Medical Visual Information Retrieval: State of the Art and
Challenges Ahead”. In: 2007 IEEE International Conference on Multime-
dia and Expo. 2007, pp. 683–686. ISBN: 1945-7871 VO -. DOI: 10.1109/
ICME.2007.4284742.
[94] National Electrical Manufacturers Association. “Digital imaging and com-
munications in medicine (DICOM) part 7: Message exchange. Section
9.1.2.” In: (2011).
[95] Peter Naur. “The Science of Datalogy”. In: Commun. ACM 9.7 (July 1966),
pp. 485–. ISSN: 0001-0782. DOI: 10.1145/365719.366510. URL: http:
//doi.acm.org/10.1145/365719.366510.
[96] Network Settings. Apr. 2018. URL: https : / / www . elastic . co / guide /
en / elasticsearch / reference / current / modules - network . html % 5C #
modules-network.
[97] Network.host setting. Apr. 2018. URL: https://www.elastic.co/guide/
en/elasticsearch/reference/current/network.host.html.
[98] NumPy. URL: http://www.numpy.org/. accessed: 02.05.2018.
[99] NumPy Reference - Advanced Indexing. URL: https://docs.scipy.org/
doc / numpy - 1 . 13 . 0 / reference / arrays . indexing . html # advanced -
indexing. accessed: 14.05.2018.
[100] NumPy v1.14 Manual - Broadcasting. URL: https://docs.scipy.org/
doc / numpy - 1 . 14 . 0 / user / basics . broadcasting . html. accessed:
14.05.2018.
[101] Juliana Oliveira. Multiprocessing Serialization in Python with Pickle. 2018.
URL: https://medium.com/@jwnx/multiprocessing- serialization-
in-python-with-pickle-9844f6fa1812. accessed: 15.08.2018.
118 BIBLIOGRAPHY
[102] Open hub - apache spark. URL: https://www.openhub.net/p/apache-

spark. accessed: 29.07.2018.
[103] Erik Hatcher Otis Gospodnetic. Lucene in Action. 2005. ISBN 1-932394-
28-1.
[104] Pandas. URL: https : / / pandas . pydata . org / pandas - docs / stable/.
accessed: 02.05.2018.
[105] parallel_bulk is a generator. Apr. 2018. URL: https://discuss.elastic.
co / t / helpers - parallel - bulk - in - python - not - working / 39498. ac-
cessed: 31.05.2018.
[106] PostgreSQL 10 - Basic Text Matching. URL: https://www.postgresql.
org/docs/10/static/textsearch-intro.html#TEXTSEARCH-MATCHING.
accessed: 22.05.2018.
[107] PostgreSQL 10 - Configuration Example. URL: https://www.postgresql.
org / docs / 10 / static / textsearch - configuration . html. accessed:
22.05.2018.
[108] PostgreSQL 10- Full Text Search. URL: https://www.postgresql.org/
docs/10/static/textsearch-intro.html. accessed: 20.05.2018.
[109] PostgreSQL 10 - GIN and GiST Index Types. URL: https://www.postgresql.
org/docs/10/static/textsearch-indexes.html. accessed: 22.05.2018.
[110] PostgreSQL 10 - Indexes on Expressions. URL: https://www.postgresql.
org / docs / current / static / indexes - expressional . html. accessed:
25.05.2018.
[111] PostgreSQL 10 - pg_trgm. URL: https://www.postgresql.org/docs/
current/static/pgtrgm.html. accessed: 27.05.2018.
[112] PostgreSQL 10 - Text Search Functions and Operators. URL: https : / /
www.postgresql.org/docs/10/static/functions- textsearch.html.
accessed: 22.05.2018.
[113] PostgreSQL 10 - Text Search Types. URL: https://www.postgresql.org/
docs/10/static/datatype-textsearch.html. accessed: 22.05.2018.
[114] POSTGRESQL - About. URL: https : / / www . postgresql . org / about/.
accessed: 16.05.2018.
[115] Production vs Development modes. Apr. 2018. URL: https://www.elastic.
co / guide / en / elasticsearch / reference / current / system - config .
html%5C#dev-vs-prod.
[116] psycopg2 - Python-PostgreSQL Database Adapter. URL: https://pypi.
org/project/psycopg2/. accessed: 29.05.2018.
[117] PySpark 2.2.0 documentation. URL: http://spark.apache.org/docs/2.
2.0/api/python/index.html#. accessed: 12.05.2018.
[118] Python 2.7.15 Documentation - multiprocessing. URL: https : / / docs .
python.org/2/library/multiprocessing.html. accessed: 14.05.2018.
[119] Python Bulk Helpers. Apr. 2018. URL: http://elasticsearch-py.readthedocs.
io/en/master/helpers.html. accessed: 31.05.2018.
BIBLIOGRAPHY 119
[120] Python Elasticsearch Client. URL: https://elasticsearch-py.readthedocs.

io/en/master/. accessed: 30.05.2018.
[121] Radiologyinfo.org. Ultrasound - Carotid. Apr. 2018. URL: https://www.
radiologyinfo.org/en/info.cfm?pg=us-carotid.
[122] Antonio Regalado. “Who Coined ’Cloud Computing’?” In: (Oct. 2011). Tech-
nology Review. MIT. Retrieved 03 May 2018.
[123] Resilient Distributed Datasets (RDDs). URL: https : / / spark . apache .
org/docs/2.2.0/rdd-programming-guide.html#resilient-distributed-
datasets-rdds. accessed: 13.05.2018.
[124] Matthew Rocklin. Parallelism and Serialization, how poor pickling breaks
multiprocessing. URL: http://matthewrocklin.com/blog/work/2013/
12/05/Parallelism-and-Serialization. accessed: 15.08.2018.
[125] Daniel Rueckert, Ben Glocker, and Bernhard Kainz. “Learning clinically
useful information from images: Past, present and future”. In: Medical
Image Analysis 33 (Apr. 2018), pp. 13–18. ISSN: 1361-8415. DOI: 10 .
1016/j.media.2016.06.009. URL: http://dx.doi.org/10.1016/j.
media.2016.06.009.
[126] P A Samuelson and W D Nordhaus. Microeconomics. McGraw-Hill higher
education. McGraw-Hill/Irwin, 2001. ISBN: 9780071180665. URL: https:
//books.google.com.kw/books?id=ymVMIgAACAAJ.
[127] James H Scatliff and Peter J Morris. “From Röntgen to Magnetic Res-
onance Imaging: The History of Medical Imaging ”. In: North Carolina
Medical Journal 75.2 (Mar. 2014), pp. 111–113. DOI: 10.18043/ncm.75.
2.111. URL: http://www.ncmedicaljournal.com/content/75/2/111.
abstract.
[128] P. Seda et al. “Performance testing of NoSQL and RDBMS for storing big
data in e-applications”. In: 2018 3rd International Conference on Intelli-
gent Green Building and Smart Grid (IGBSG). Apr. 2018, pp. 1–4. DOI:
10.1109/IGBSG.2018.8393559.
[129] Simple Random Sample. URL: https://www.investopedia.com/terms/
s/simple-random-sample.asp. accessed: 06.08.2018.
[130] Solr Analyzers. URL: http://lucene.apache.org/solr/guide/7%5C_2/
analyzers.html%5C#analyzers.
[131] Solr Glossary. URL: http://lucene.apache.org/solr/guide/7%5C _ 2/
solr-glossary.html.
[132] Solr merge policy and deleted docs. URL: https : / / lucidworks . com /
2017 / 10 / 13 / segment - merging - deleted - documents - optimize - may -
bad/.
[133] Solr Reference. URL: http://lucene.apache.org/solr/guide/7%5C_2/.
[134] Solr - Uploading Data with Index Handlers. URL: https : / / lucene .
apache . org / solr / guide / 7 % 5C _ 2 / uploading - data - with - index -
handlers.html.
[135] Spark 2.3.0 Documentation. Apr. 2018. URL: https : / / spark . apache .
org/docs/latest/#spark-overview.
120 BIBLIOGRAPHY
[136] Nicola H Strickland. “PACS (picture archiving and communication sys-

tems): filmless radiology”. In: Archives of Disease in Childhood 83.1 (July
2000), 82 LP –86. URL: http : / / adc . bmj . com / content / 83 / 1 / 82 .
abstract.
[137] Techopedia - Amdahl’s Law. URL: https://www.techopedia.com/definition/
17035/amdahls-law. accessed: 14.05.2018.
[138] The Apache Software Foundation. In: (Feb. 2014). URL: https://blogs.
apache . org / foundation / entry / the _ apache _ software _ foundation _
announces50.
[139] The PostgreSQL Global Development Group. POSTGRESQL: THE WORLD’S
MOST ADVANCED OPEN SOURCE RELATIONAL DATABASE. URL: https:
//www.postgresql.org/. accessed: 02.05.2018.
[140] The Schemaless Mode. URL: https://lucene.apache.org/solr/guide/
7%5C_2/schemaless-mode.html. accessed: 31.05.2018.
[141] Time module. Apr. 2018. URL: https://docs.python.org/2/library/
time.html.
[142] timeit — Measure execution time of small code snippets. URL: https :
//docs.python.org/2/library/timeit.html. accessed: 29.05.2018.
[143] Ubuntu Manuals (16.04) - parallel ssh program. URL: http://manpages.
ubuntu.com/manpages/xenial/man1/parallel- ssh.1.html. accessed:
06.06.2018.
[144] University POSTGRES,Version 4.2. URL: http://db.cs.berkeley.edu/
postgres.html. accessed: 16.05.2018.
[145] Frederico Valente, Carlos Costa, and Augusto Silva. “Dicoogle, a Pacs
Featuring Profiled Content Based Image Retrieval”. In: PLoS ONE 8.5
(May 2013). Ed. by Panayiotis V Benos, e61888. ISSN: 1932-6203. DOI:
10.1371/journal.pone.0061888. URL: http://www.ncbi.nlm.nih.gov/
pmc/articles/PMC3646026/.
[146] Mianzhi Wang. On Sharing Large Arrays When Using Python’s Multipro-
cessing. 2018. URL: https://research.wmz.ninja/articles/2018/03/
on- sharing- large- arrays- when- using- pythons- multiprocessing.
[147] William M Wells III. “Medical Image Analysis – past, present,
and future”. In: Medical Image Analysis 33 (Apr. 2018), pp. 4–6. ISSN:
1361-8415. DOI: 10.1016/j.media.2016.06.013. URL: http://dx.doi.
org/10.1016/j.media.2016.06.013.
[148] Wikipedia contributors. Relational database — Wikipedia, The Free En-
cyclopedia. accessed: 2-May-2018. 2018. URL: https://en.wikipedia.
org/w/index.php?title=Relational_database&oldid=839163035.
[149] Shaomin Wu. “A review on coarse warranty data and analysis”. In: Reli-
ability Engineering & System Safety 114 (2013), pp. 1–11. ISSN: 0951-
8320. DOI: https : / / doi . org / 10 . 1016 / j . ress . 2012 . 12 . 021. URL:
http://www.sciencedirect.com/science/article/pii/S0951832013000100.
BIBLIOGRAPHY 121
[150] Shaoting Zhang and Dimitris Metaxas. Large-Scale medical image analyt-
ics: Recent methodologies, applications and Future directions. eng. Oct.
2016. DOI: 10.1016/j.media.2016.06.010.

Enabling Real Time Search of Medical Ima

Uploaded by

Document Information

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Enabling Real Time Search of Medical Ima

Uploaded by

Copyright:

Available Formats

ENABLING REAL-TIME SEARCH OF

MEDICAL IMAGES METADATA

A Thesis Presented to the Faculty

Finally, I would like to thank my great friends: Karima Moustafa, Omnia

List of Tables xii

3.5 Experiments Setup . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 40

5.7 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 104

1.1 PACS Overview . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3

2.1 Data Generator . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15

3.1 Full Text Search Preprocessing . . . . . . . . . . . . . . . . . . . . . . 38

4.7 Solr Experiment1 - Requests/Second (Cashed) . . . . . . . . . . . . . 65

5.1 Elasticsearch Exp1 - Index Size . . . . . . . . . . . . . . . . . . . . . . 79

6.1 Data Generation Throughput (no write to disk) . . . . . . . . . . . . . 108

1.1 Related Work List . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13

2.1 Iteration-based generator - Throughput . . . . . . . . . . . . . . . . . 20

3.1 PostgreSQL Free Text Search . . . . . . . . . . . . . . . . . . . . . . . 52

4.1 Solr Experiment 1 Setup . . . . . . . . . . . . . . . . . . . . . . . . . . 60

5.1 Elasticsearch Experiment 1 Setup . . . . . . . . . . . . . . . . . . . . 74

Here is how this study is organized:

In chapter 1 we provide a background that will enable a better understanding

Next, in chapter 2, We explain the data structure and attributes. We present

The RDBMS technology is naturally an excellent candidate to build search

RDBMS: PostgreSQL. By conducting a set of experiments to evaluate the differ-

In chapter 4, we present Solr experiments. We evaluated Solr, concerning

We then present Elasticsearch evaluation in chapter 5. We tested the per-

1.3 The Use Case Context

In this study, we focus on carotid artery ultrasound, a specific type of vascular

1.3.2 Who has the ultrasound exams?

1.3.3 The DICOM format

Figure 1.1: PACS Overview [145]

1.3.4 Picture Archiving and Communication System (PACS)

A picture archiving and communication system (PACS) is a computerized means

1.3.5 Query and Retrieval, the PACS Way

• The client establishes the network connection to the PACS server.

• The client prepares a C-FIND request message which is a list of DICOM

• The C-FIND request message is sent to the server.

PACS Search Capabilities are Limited

The underlying Digital Imaging and Communication in Medicine (DICOM) pro-

1.3.6 Content-based Image Retrieval (CBIR)

Content-based Image Retrieval (CBIR) methods have shown great promise in

Figure 1.2: Overview of End-to-End System

1.3.7 What we want to achieve?

We aim to overcome the limitations of PACS search, 1) inflexible, limited queries,

We represent an image by a text record of metadata. The metadata can be

• the path/URL of the .png file on the storage system component.

1.3.8 The Target Data Scale

1.4 The Engineering Perspective

Data Engineering and Data Science

Full Text Search

Apache Spark is a fast and general-purpose cluster computing system. It pro-

Python is an interpreted high-level programming language for general-purpose

• a powerful N-dimensional array object

• sophisticated (broadcasting) functions

• tools for integrating C/C++ and Fortran code

• useful linear algebra, Fourier transform, and random number capabilities

In this study, we started our exploration by looking at the relational database

Apache Lucene is an open-source, high-performance, full-featured text search

1. org.apache.lucene.analysis Convert text from a Reader into a TokenStream.

Solr is an Apache open source project. It is designed to drive powerful document

Elasticsearch is a highly scalable open-source full-text search and analytics en-

Elasticsearch can be considered as the second generation of the search en-

1.4.7 Amazon EC2

Cloud computing is the on-demand delivery of computing power, database stor-