You are on page 1of 8

Journal of Ambient Intelligence and Humanized Computing

https://doi.org/10.1007/s12652-020-01775-9

ORIGINAL RESEARCH

The efficient fast‑response content‑based image retrieval using spark


and MapReduce model framework
D. Mansoor Hussain1 · D. Surendran1

Received: 21 October 2019 / Accepted: 6 February 2020


© Springer-Verlag GmbH Germany, part of Springer Nature 2020

Abstract
Content Based Image Retrieval (CBIR) is a way of querying image databases. CBIR looks at visual properties of an image
as “search terms” and returns pictures from a database that share the same or almost similar visual properties. Most CBIR
systems in the literature works by extracting the image color, texture and shape features before comparing them with those
in the database and then compute the distance between features of images for retrieval purposes. In this proposed work, we
use a MapReduce model framework to index the large-scale images and Spark has been used as a proportionate method of
retrieving the index, which runs on the higher layer of MapReduce and Hadoop distributed file system (HDFS) environment.
HDFS provides an in-memory data storage and fast retrieval mechanism using the indexing process. The image retrieval
is performed in alignment with the K-Nearest Neighbour’s model using Apache implementation. The processing time has
been evaluated with the Hadoop framework in CBIR. The proposed approach takes 10% less time to index images than the
distributed image segmentation method discussed in the literature.

Keywords  MapReduce · Hadoop · CBIR · KNN algorithm · Spark

1 Introduction searching performs the query processing of the image with


the computational vector in the database.
Content based image retrieval (CBIR) is one approach to The rapid advances in VLSI design producing higher
search photos from your phone, tablet, desktop or website. processing power, higher bandwidth through broadband
Color, shape and texture features are the common and most networks and data compression standards are driving the
popular features mined for image matching and retrieval. In creation of high-volume image data, along with proficient
our survey, we found that several researchers also used these compression and communication over high-bandwidth chan-
three features to propose a retrieval system (Mezzoudj et al. nels. At the same time, this fiery growth of multimedia data
2019). Similarity measures are conducted to measure simi- throws a new set of challenges for users in terms of storage
larity and differences between each method and feature. It is and retrieval. Text-based indexing and retrieval are no more
also used to define which method is the most efficient one for optimal solution for managing huge multimedia repositories.
CBIR systems (Liu and Yang 2013). Indexing and searching Also, a local system may not be sufficient to process the
are the two major tasks for retrieval. Indexing performs the huge database of images and a client–server model is hence
operation of extracting the best matched vector to identify needed for efficient image indexing and searching.
the image and map them accordingly in the database, while Hadoop distributed file system (HDFS) is a distributed
file system, which is vastly fault tolerant and offers better
throughput access to data (Talib et al. 2013). It is now an
Apache Hadoop sub-project having master slave architecture
* D. Mansoor Hussain
mansoor.slm@gmail.com with single name node and multiple data nodes. File system
name space, managed by the name node, regulates access
D. Surendran
d.surendran@gmail.com to files by their respective clients. It also maintains the file
to block mapping, location, block orders and other related
1
Department of Computer Science and Engineering, metadata. The data node on the other hand stores file data as
Sri Krishna College of Engineering and Technology, blocks and sends report or feedback to name node. Once the
Coimbatore, India

13
Vol.:(0123456789)
D. M. Hussain, D. Surendran

with the KNN algorithm, which has no assumptions and


very easy to implement for multi-class problems. In KNN
algorithm, the processing time, memory usage and the com-
putation processing are less compared to the Support Vector
Machine (SVM) and Naïve Baiyes theorem. The parallel
K-NN is used along with the Spark with main focus on the
computation and processing time. The computation is done
with the cache model and non-cache model in the Spark
architecture (Jalaja et al. 2005). The work comprises of com-
parative survey with the Spark based CBIR mechanism to
predict the computation time and the model framework.

Fig. 1  A typical CBIR system


2 Literature survey

The increased usage of mobile devices has led to increased


mapping is done, the client can interact with the data node amount of multimedia data especially images and conse-
directly for reading or writing blocks. quently retrieving the needed ones from a pool of images
HDFS is written in JAVA and sits on the native filesystem has become a great challenge in computer vision and image
irrespective of the OS. MapReduce is a processing paradigm processing domains. Several techniques have been discussed
associated with HDFS that helps to process the data effec- in the literature for image retrieval using primitive features,
tively on different clusters available in the network. MapRe- logical features and abstract attributes (Kwak et al. 2001,
duce is a parallel processing method for handling data dis- Mirmehdi et al. 1998, Smith and Chang 1996). Most CBIR
tributed on a commodity cluster (Manoharan and Sathappan systems work efficiently only at the lowest levels, while the
2012). The role of Mappers is to map input key/value pairs demand from consumers for retrieval is higher. CBIR appli-
to a set of transitional key/value pairs. They are the separable cations are widespread in medicine, fashion, publishing,
tasks that transform input records into intermediate records. architecture, crime prevention etc. Text descriptors are relied
Shuffling is the next stage or process, which exchanges these upon for indexing practice, but the effectiveness of such sys-
outputs from the map tasks to reducers. Reducer will finally tems are not very much evident. User satisfaction with such
reduce a set of these intermediate values, which share a key systems varies considerably. Moreover the semantic feature
to a smaller set of values. extraction from the images still remains as an active research
For the CBIR system, we are proposing to use a tech- topic. IBM’s QBIC, Excalibur’s Retrieval Ware and Virage’s
nique that can speed up the large scale image database VIR Image Engine are some commercially available CBIR
indexing using the HDFS. The proposed work is divided systems. While digitisation itself is not sufficient to man-
in to two parts (a) the distributed file processing is done in age the image database, some advanced techniques in image
the image database with the HDFS setup. The performance indexing, searching and cataloguing is necessary for efficient
of the image searching and indexing with the retrieval is and effective image retrieval. Figure 1 below represents a
good enough in HDFS and it is faster with the unstructured typical CBIR system.
databases such as Hive or any other big databases (Liu et al. Researchers have been working on different issues
2007). (b) The image searching in the big databases is done in CBIR domain including object identification, pattern

Fig. 2  MapReduce framework

13
The efficient fast‑response content‑based image retrieval using spark and MapReduce model…

recognition etc. The general CBIR methods and techniques which provides a robust matching. Local tetra patterns have
along with the issues have been discussed by Smeulders been used with the vector techniques to present the features
and Gupta (2000), while the recent challenges have been in a CBIR system (Lowe 2004).
reviewed by Zhou et al. (2017). The Hadoop MapReduce framework makes the clos-
Speeding up the CBIR system results is another active est path for the query image with the better CBIR system.
area of research on top of the feature extraction techniques. Apache Spark based on the Avro framework combines the
Mezzoudj et al. have discussed a parallel CBIR system using picture files and provides an in-memory order to allow the
Spark and Tachyon frameworks (Mezzoudj et al. 2019). actions to happen much faster. HDFS supports Java pro-
There are different framework models discussed in the litera- grams and encloses portable file system with scalability
ture including MapReduce model, HIPI framework, HDFS features and also distributed across different machines. The
etc. to name a few. Apache Spark for processing large-scale parallel and distributed iterative approaches have been con-
data on various nodes is a recent MapReduce based frame- sidered by the researchers to maximize the computation time
work and Hedjazi et al. (2018) has shown that Spark system in KNN method. The KNN graph has been constructed with
works better than Hadoop for batch processing. The Hadoop the large-scale applications with the support of tree-based
has been used with the HBase for the image retrieval and the structure in the network. A simple and efficient parallel
memory support system. The image retrieval is supported application has been used with the MapReduceK-Nearest
with the cluster environment using a distributed file system. Neighbor (MR-KNN) and a custom based application in a
The indexing step is associated with the HBase system in cluster (Rui et al. 1996). Gaber et al. proposed an algorithm
the storage layer in the segmented approach (Tamura et al. to extract the image features through Scale Invariant Fea-
1978). ture Transformation (SIFT) and matched it on the database
The MapReduce computing uses HDFS support system in features using Hadoop platform (Gaber et al. 2016). They
order to enhance the searching phenomenon in the memory found that the retrieval performance is better due to paral-
processing and the storage layer (Kekre et al. 2010). Hadoop lelization capabilities of the Hadoop MapReduce program-
MapReduce along with HBase provokes a searching method ming paradigm.
with some issues, which have been overcome with the pro-
posed method (Lin et al. 2003). Lowe presented a method for
finding features that are invariant to image rotation, scaling,

Fig. 3  Image indexing and


metadata information extraction

13
D. M. Hussain, D. Surendran

3 Related technologies and approaches If the two points are represented as a (­ x1, ­y1) and b ­(x2,
y­ 2) in xy-plane, then the Euclidean distance is calculated
The concepts, frameworks and tools used in our work are as:
discussed in this section along with the limitations and the √
( )2 ( )2
solutions to overcome them. We detail about MapReduce x1 − x2 + y1 − y2 (1)
for indexing purposes and KNN for search and retrieval
purposes. Similarly, the Manhattan distance is calculated as:
MapReduce developed by Google is a software design
model for large scale applications such as image storage and
|x1 − x2 | + |y1 − y2 |
| | | | (2)
retrieval. This methodology streamlines the parallel pro- Run time and memory consumption are the major issues
gramming through different nodes. The two main Hadoop and an Apache Spark environment helps to parallelize
components are the Hadoop storage, which is handled by the this KNN algorithm as an iterative MapReduce process.
Hadoop HDFS and the other one is data processing, which is Apache Spark is an open source dispersed cluster computing
handled by the MapReduce. Here MapReduce is the process- framework model containing multiple different algorithms
ing component of Apache Hadoop. It processes data parallel including K-means, linear regression etc. But we need to
in distributed environment as shown in Fig. 2. The data are implement KNN in an optimized manner and the details are
stored in all the clusters or the slave machines and MapRe- discussed in the next section.
duce sends the logic to the respective data nodes, where the
data reside as HDFS blocks. The processing is executed over
a smaller chunk of data in multiple locations in parallel sav- 4 Proposed method for image indexing
ing time as well as network bandwidth. and searching
KNN is the technique used for finding the distance
between the features extracted from the query image and We propose to use CBIR with Spark based image retrieval
those in the database. KNN is a simple algorithm that mechanism for a huge data set. Our framework uses HDFS
stocks all the available inputs and classifies the new test for efficient storage of the feature vectors. K-NN algorithm
data based on a comparison measure. Given N feature vec- when implemented with the cache method helps to do the
tors, KNN algorithm finds the KNN of ‘c’, irrespective of distance computation and retrieval part in an optimized man-
the labels, where ‘c’ is the feature vector. KNN works better ner. Our framework regulates the MapReduce and Spark
with smaller input variables and the main drawback of this architecture to speed up the CBIR system. Our approach
algorithm is the complexity in searching the nearest neigh- brings in two advantages:
bors for each sample. There are several distance measure-
ments that can be used for calculating the distance between (a) MapReduce parallelization provides quicker indexing
the features. The most common of them is the Euclidean on large scale image database and HDFS helps reduce
distance, which refers to the path distance connecting two the time and memory in storing the images on different
points. Manhattan distance on the other hand corresponds clusters
to the distance between two points calculated as the sum of (b) K-NN method optimized on Spark framework identifies
the total differences of their Cartesian coordinates. Cosine the neighbouring images and rank them for retrieval
similarity is another popular method, which refers to the module in the CBIR system
normalized dot product of the two attributes.
The metadata information with query indexing for Spark
processing is depicted in Fig. 3.
In CBIR system, the Spark connects the frameworks
with the segmentation of large-scale image scaling system
in the distributed framework. The MapReduce framework
uses the HDFS survey in the memory-centric distributed
system having multiple read/write operations carried
out at optimized speed. The application programming
interface (API) in the data frame and Resilient Distrib-
uted Data (RDD) provides the path specification in the
Spark protocol. The proposed CBIR Spark system with
the indexing module enhances the usage of interaction
Fig. 4  LBP texture analysis with the key/value pairs. The indexing step provides the

13
The efficient fast‑response content‑based image retrieval using spark and MapReduce model…

Fig. 5  Dataset categories

Table 1  Image indexing time Images Apache Proposed “reducers” that take care of identifying the definitive neigh-
consumption across different AVRO system (s) bors from the map phase output.
images (s)

2000 260 180


4.1 Image indexing and storage
5000 510 438
The image indexing phase covers the Spark usage with
10,000 1003 856
HDFS and takes care of allocating the data with the prede-
15,000 1490 1190
fined consumption of data management. The input images
20,000 1980 1465
are loaded using the file system with the in-memory file
allocation in the memory management. Color images are
first converted in to gray-scale and scaled to a fixed size.
large-scale images to be utilized with the MapReduce par- Then the features from the image are extracted. The texture
allel on Spark establishment. Thein-memory specification gives details on the spatial arrangement of pixel values in a
provides the time for storing the image features. It has to selected region or a whole image. They help in image seg-
be retained with the ‘write’ action and the throughput is mentation or image classification. Compared to image color
observed to be gradually increasing with the on-demand and shape features, texture features gives better results in a
memory requirements. CBIR system and so it is preferred in our work. Local binary
The retrieval phase is the second part of the CBIR sys- pattern (LBP), one of the effective image descriptors is used
tem, which prescribes the steps of dedicating the interaction for texture feature extraction. Figure 4 represents the LBP
between the query vectors with the most feasible dataset texture analysis for 6 and 8 neighbors.
features identified as vector representations in the input file LBPs local representation is made by comparing every
creation. The “map” phase of our proposed system com- pixel value with its neighbour values and a binary result is
putes the KNN on different clusters followed by the multiple produced.

Table 2  Different image indexing techniques for 12,000 images


Approach Methods Time (s)

Centralized method Sequence based method 700.000


Distributed image segmentation method Single node HBase method and MapReduce control over single nodes 290.000
Distributed image segmentation method Multiple node HBase method and MapReduce control over 12 nodes 2.98
Distributed image segmentation method Multiple node HBase method and MapReduce control over 24 nodes 1.98
Proposed KNN approach KNN method with Spark 220

13
D. M. Hussain, D. Surendran

Table 3  Searching time comparison for different datasets 5


Total number of KNN with spark KNN with spark and HDFS 4.5
Images (s) in memory computation (s)
4
2000 183 115 3.5
5000 440 270 3
10,000 820 510 2.5
15,000 1190 710 HDSF
2
20,000 1510 895
1.5
Other
1 SVM
and
0.5 AVRO
5 K-NN with
0
spark
4.5
0 4000 8000 14000
4

Time
3.5 K-NN with Number of nodes
Spark and
3 HDFS in
memory
2.5 computation
Fig. 7  Indexing time for different data sets
2
1.5
1
0.5
0
0 4000 8000 14000
Time

Total Number of Images

Fig. 6  Search time for different data sets

( )
K
2
−1
∑ ( )2
LBPS(a, b) = V Gi − Gi+(k∕2) ,
j=0 (3)
{
1
where the value V(x) implies V(X) = if X > 00
0

and the Gi − Gi+(k∕2) values represent the gray level of the


symmetric point axis in the equally spaced region of the Fig. 8  Comparison of different searching methods
image data set in the entire region over the control. The
radius of the pixel over the entire data set has been obscured
with the image coverage of the ‘N’ pixels with the radius of has been used in the RDD zip file for extracting the vector
‘R’. In the MapReduce framework, the map stage will load feature extraction. Finally, the distance between the feature
the images from HDFS and extract the feature vectors, while vectors in the database and the query image is found and
the Reduce stage will save these vectors in the clusters of the sorted for retrieval purposes. While the mappers and reduc-
Hadoop file system. ers run in parallel, it’s the shuffle module in between them
The Mapper job has to be executed with the loading of the takes care of rearranging the data for the subsequent steps to
training RDD in HDFS set up with the RDD text file to be run. The results from the mapper having the same key will
mapped in terms of the instructional vector. Resilient Dis- be sent to the same reducer by the shuffle module.
tributed Data set creates MLIb vectors for all the features.
The LBPS uses the job control factors to be done with the 4.2 Image ranking and retrieval
sequence number calculation with the transformation signs
in the network. The feature formation with index formation Once the search is complete, it is equally important to rank
the images before they are displayed in order. The image

13
The efficient fast‑response content‑based image retrieval using spark and MapReduce model…

content descriptors like the texture feature that are matching stage, we decompress the images and rescale it before con-
to the query image are ranked higher than those with lesser verting them in to gray scale and store it for further feature
match. Precision indicates the proposed CBIR system’s abil- extraction process. A glimpse of the image set is shown in
ity to retrieve images among the relevant ones, while recall Fig. 5. We have tested our proposed solution both using a
designates the fraction of pertinent instances retrieved over single node and a multi node cluster. Both the master and the
the database relevant instances. slave are present in the same node in case of a single-node
cluster and they exist independently for multi-node cluster.
4.3 SPARK based KNN algorithm implementation Table 1 shows the time taken for indexing the images present
in the dataset.
The algorithm for implementing the K-NN algorithm in We have also compared the indexing time across different
Apache Spark is detailed below: methods that are discussed in the literature and we find that
our method takes the least time for indexing as compared to

5 Results and discussions the existing ones. The performance of the proposed system
has to be established with the indexing to be followed in the
We have used the Corel-10 k dataset that has 10,000 images HDFS and the system is compared against various segmen-
in different categories. The contents include sunset, build- tations as follows:
ings, car, beaches, flowers, fish, mountains, food, door etc.
The images are in JPEG compressed format with size vary- • Centralized method
ing from 192 × 128 to 128 × 192. As part of pre-processing • Distributed image segmentation method single node

13
D. M. Hussain, D. Surendran

• Distributed image segmentation method multiple node the algorithms for processing KNN on MapReduce. This
(12) work can also be used as a guide for KNN based problems
• Distributed image segmentation methodmultiple node not just in CBIR domain, but for any practical problems in
(24) the big data context. Every work will open upon new chal-
• Proposed KNN approach lenges and selecting the right number of clusters along with
• Proposed KNN approach with HDFS and spark. tuning the value “K” in KNN would lead to further research.

This is shown in Table 2.


We also studied the searching module for the time taken References
to compute while using Spark and HDFS in memory com-
putation. This is presented in Table 3. Gaber H, Marey M, Amin SE, Tolba MF (2016) Content Based Image
Retrieval with Hadoop. In: Proceedings of the international con-
Cache module with the KNN algorithm running in parallel ference on advanced intelligent systems and informatics, AISI
helps avoid the re-computations of RDD’s as discussed before. Hedjazi MA, Kourbane I, Genc Y, Behloul A, (2018) A comparison of
As the data set size increases, cache method brings in a huge hadoop, spark and storm for the task of large scale image classifi-
difference in terms of search time and is evident from Table 3. cation In: 26th signal processing and communications applications
conference, pp 1–4
The average time consumed has been used in the compu- Jalaja K, Bhagvati C, Deekshatulu BL, Pujari AK, (2005) Texture ele-
tational parameter estimation based on different image size. ment feature characterizations for CBIR. In: IEEE
The KNN algorithm has been used in the computation time Kekre HB, Thepade SD, Sarode TK, Suryawanshi V (2010) Image
estimation with the in-memory estimation. The average time retrieval using texture features extracted from GLCM, LBG and
KPE. Int J Comput Theory Eng 2(5):695
consumed for distance measurement across 20,000 images Kwak N, Choi CH, Choi CY (2001) Feature extraction using ICA. In:
is estimated as 760s. Proceedings of the international conference on artificial neural
We observed 42% reduction in time using the proposed networks, pp 568–573
technique as compared to the traditional KNN method. Also, Lin HC, Chiu CY, Yang SN (2003) Finding textures by textual descrip-
tions, visual examples, and relevance feedbacks. Pattern Recognit
when there are more number of nodes available for index- Lett 24(14):2255–2267
ing and retrieval, the time taken for processing and retrieval Liu GH, Yang JY (2013) Content-based image retrieval using color
decreases accordingly. Figures 6 and 7 represent the time difference histogram. Pattern Recognit 46:188–198
consumed for searching with different data sets and the time Liu Y, Zhang D, Lu G, Ma WY (2007) A survey of content-based
image retrieval with high-level semantics. Pattern Recognit
taken for indexing using cluster-based images. 40:262–282
Figure 8 represents the comparison of different search- Lowe DG (2004) Distinctive image features from scale-invariant key-
ing methods for 40,000 images in the dataset. The HDFS points. Int J Comput Vis 60(2):91–110
with the proposed MapReduce framework helps to index and Manoharan S, Sathappan S (2012) A comparison and analysis of soft
computing techniques for content based image retrieval system.
retrieve images quicker than the traditional CBIR methods. Int J Comput Appl 59(13):0975–8887
Mezzoudj S, Behloul A, Seghir R, Saadna Y (2019) A parallel content-
based image retrieval system using spark and tachyon frameworks.
6 Conclusion and future directions J King Saud Univ Comput Inf Sci. https​://doi.org/10.1016/j.jksuc​
i.2019.01.003
Mirmehdi M, Palmer PL, Josef K (1998) Optimising the complete
CBIR is an attractive area of research. While most of the image feature extraction chain. In: Proceedings of the third asian
research focuses on effective feature extraction, we have conference on computer vision, vol 2, pp 307–314
worked towards faster indexing and retrieval in this work. Rui Y, Huang TS, Mehrotra S (1996) Relevance feedback techniques in
interactive content-based image retrieval. In: Storage and retrieval
MapReduce programming is preferred in the recent works for image and video databases VI, pp 25–36
due to its suitability for distributed large scale data process- Smeulders AWM, Gupta A (2000) Content-Based Image Retrieval
ing. Similarly on the retrieval side, KNN has gained its at the End of the Early Years. In: IEEE transactions on pattern
popularity due to its efficiency in classifying an unlabelled analysis and machine intelligence, vol. 22, no. 12, pp 1349–1380
Smith JR, Chang SF (1996) Automated binary texture feature sets for
data. We considered processing KNN queries on huge data- image retrieval. In: Proceedings IEEE international conference on
sets, where the index is maintained in a computing cluster. acoustics, speech and signal processing
MapReduce framework offers quicker indexing and a par- Talib A, Mahmuddin M, Husni H, George LE (2013) A weighted
allel KNN algorithm with cache method helps to retrieve dominant color descriptor for content-based image retrieval. J
Vis Commun Image R 24:345–360
images faster than native methods. We have addressed prob- Tamura H, Mori S, Yamawaki T (1978) Textural features correspond-
lems that arise due to MapReduce characteristics and also ing to visual perception. In: IEEE transaction systems, man, and
used cache method in KNN algorithm for better results. cybernetcs, vol 8, no. 6, pp. 460–472
Multi node clusters as compared to the single node clusters Zhou W, Li H, Tian Q (2017) Recent advance in content-based image
retrieval: a literature survey. Multimedia (cs.MM); Information
have improved the run time very much. Overall, the dis- Retrieval (cs.IR)
cussed approach gives a clear and detailed explanation on

13

You might also like