Professional Documents
Culture Documents
https://doi.org/10.1007/s12652-020-01775-9
ORIGINAL RESEARCH
Abstract
Content Based Image Retrieval (CBIR) is a way of querying image databases. CBIR looks at visual properties of an image
as “search terms” and returns pictures from a database that share the same or almost similar visual properties. Most CBIR
systems in the literature works by extracting the image color, texture and shape features before comparing them with those
in the database and then compute the distance between features of images for retrieval purposes. In this proposed work, we
use a MapReduce model framework to index the large-scale images and Spark has been used as a proportionate method of
retrieving the index, which runs on the higher layer of MapReduce and Hadoop distributed file system (HDFS) environment.
HDFS provides an in-memory data storage and fast retrieval mechanism using the indexing process. The image retrieval
is performed in alignment with the K-Nearest Neighbour’s model using Apache implementation. The processing time has
been evaluated with the Hadoop framework in CBIR. The proposed approach takes 10% less time to index images than the
distributed image segmentation method discussed in the literature.
13
Vol.:(0123456789)
D. M. Hussain, D. Surendran
Fig. 2 MapReduce framework
13
The efficient fast‑response content‑based image retrieval using spark and MapReduce model…
recognition etc. The general CBIR methods and techniques which provides a robust matching. Local tetra patterns have
along with the issues have been discussed by Smeulders been used with the vector techniques to present the features
and Gupta (2000), while the recent challenges have been in a CBIR system (Lowe 2004).
reviewed by Zhou et al. (2017). The Hadoop MapReduce framework makes the clos-
Speeding up the CBIR system results is another active est path for the query image with the better CBIR system.
area of research on top of the feature extraction techniques. Apache Spark based on the Avro framework combines the
Mezzoudj et al. have discussed a parallel CBIR system using picture files and provides an in-memory order to allow the
Spark and Tachyon frameworks (Mezzoudj et al. 2019). actions to happen much faster. HDFS supports Java pro-
There are different framework models discussed in the litera- grams and encloses portable file system with scalability
ture including MapReduce model, HIPI framework, HDFS features and also distributed across different machines. The
etc. to name a few. Apache Spark for processing large-scale parallel and distributed iterative approaches have been con-
data on various nodes is a recent MapReduce based frame- sidered by the researchers to maximize the computation time
work and Hedjazi et al. (2018) has shown that Spark system in KNN method. The KNN graph has been constructed with
works better than Hadoop for batch processing. The Hadoop the large-scale applications with the support of tree-based
has been used with the HBase for the image retrieval and the structure in the network. A simple and efficient parallel
memory support system. The image retrieval is supported application has been used with the MapReduceK-Nearest
with the cluster environment using a distributed file system. Neighbor (MR-KNN) and a custom based application in a
The indexing step is associated with the HBase system in cluster (Rui et al. 1996). Gaber et al. proposed an algorithm
the storage layer in the segmented approach (Tamura et al. to extract the image features through Scale Invariant Fea-
1978). ture Transformation (SIFT) and matched it on the database
The MapReduce computing uses HDFS support system in features using Hadoop platform (Gaber et al. 2016). They
order to enhance the searching phenomenon in the memory found that the retrieval performance is better due to paral-
processing and the storage layer (Kekre et al. 2010). Hadoop lelization capabilities of the Hadoop MapReduce program-
MapReduce along with HBase provokes a searching method ming paradigm.
with some issues, which have been overcome with the pro-
posed method (Lin et al. 2003). Lowe presented a method for
finding features that are invariant to image rotation, scaling,
13
D. M. Hussain, D. Surendran
3 Related technologies and approaches If the two points are represented as a ( x1, y1) and b (x2,
y 2) in xy-plane, then the Euclidean distance is calculated
The concepts, frameworks and tools used in our work are as:
discussed in this section along with the limitations and the √
( )2 ( )2
solutions to overcome them. We detail about MapReduce x1 − x2 + y1 − y2 (1)
for indexing purposes and KNN for search and retrieval
purposes. Similarly, the Manhattan distance is calculated as:
MapReduce developed by Google is a software design
model for large scale applications such as image storage and
|x1 − x2 | + |y1 − y2 |
| | | | (2)
retrieval. This methodology streamlines the parallel pro- Run time and memory consumption are the major issues
gramming through different nodes. The two main Hadoop and an Apache Spark environment helps to parallelize
components are the Hadoop storage, which is handled by the this KNN algorithm as an iterative MapReduce process.
Hadoop HDFS and the other one is data processing, which is Apache Spark is an open source dispersed cluster computing
handled by the MapReduce. Here MapReduce is the process- framework model containing multiple different algorithms
ing component of Apache Hadoop. It processes data parallel including K-means, linear regression etc. But we need to
in distributed environment as shown in Fig. 2. The data are implement KNN in an optimized manner and the details are
stored in all the clusters or the slave machines and MapRe- discussed in the next section.
duce sends the logic to the respective data nodes, where the
data reside as HDFS blocks. The processing is executed over
a smaller chunk of data in multiple locations in parallel sav- 4 Proposed method for image indexing
ing time as well as network bandwidth. and searching
KNN is the technique used for finding the distance
between the features extracted from the query image and We propose to use CBIR with Spark based image retrieval
those in the database. KNN is a simple algorithm that mechanism for a huge data set. Our framework uses HDFS
stocks all the available inputs and classifies the new test for efficient storage of the feature vectors. K-NN algorithm
data based on a comparison measure. Given N feature vec- when implemented with the cache method helps to do the
tors, KNN algorithm finds the KNN of ‘c’, irrespective of distance computation and retrieval part in an optimized man-
the labels, where ‘c’ is the feature vector. KNN works better ner. Our framework regulates the MapReduce and Spark
with smaller input variables and the main drawback of this architecture to speed up the CBIR system. Our approach
algorithm is the complexity in searching the nearest neigh- brings in two advantages:
bors for each sample. There are several distance measure-
ments that can be used for calculating the distance between (a) MapReduce parallelization provides quicker indexing
the features. The most common of them is the Euclidean on large scale image database and HDFS helps reduce
distance, which refers to the path distance connecting two the time and memory in storing the images on different
points. Manhattan distance on the other hand corresponds clusters
to the distance between two points calculated as the sum of (b) K-NN method optimized on Spark framework identifies
the total differences of their Cartesian coordinates. Cosine the neighbouring images and rank them for retrieval
similarity is another popular method, which refers to the module in the CBIR system
normalized dot product of the two attributes.
The metadata information with query indexing for Spark
processing is depicted in Fig. 3.
In CBIR system, the Spark connects the frameworks
with the segmentation of large-scale image scaling system
in the distributed framework. The MapReduce framework
uses the HDFS survey in the memory-centric distributed
system having multiple read/write operations carried
out at optimized speed. The application programming
interface (API) in the data frame and Resilient Distrib-
uted Data (RDD) provides the path specification in the
Spark protocol. The proposed CBIR Spark system with
the indexing module enhances the usage of interaction
Fig. 4 LBP texture analysis with the key/value pairs. The indexing step provides the
13
The efficient fast‑response content‑based image retrieval using spark and MapReduce model…
Fig. 5 Dataset categories
Table 1 Image indexing time Images Apache Proposed “reducers” that take care of identifying the definitive neigh-
consumption across different AVRO system (s) bors from the map phase output.
images (s)
13
D. M. Hussain, D. Surendran
Time
3.5 K-NN with Number of nodes
Spark and
3 HDFS in
memory
2.5 computation
Fig. 7 Indexing time for different data sets
2
1.5
1
0.5
0
0 4000 8000 14000
Time
( )
K
2
−1
∑ ( )2
LBPS(a, b) = V Gi − Gi+(k∕2) ,
j=0 (3)
{
1
where the value V(x) implies V(X) = if X > 00
0
13
The efficient fast‑response content‑based image retrieval using spark and MapReduce model…
content descriptors like the texture feature that are matching stage, we decompress the images and rescale it before con-
to the query image are ranked higher than those with lesser verting them in to gray scale and store it for further feature
match. Precision indicates the proposed CBIR system’s abil- extraction process. A glimpse of the image set is shown in
ity to retrieve images among the relevant ones, while recall Fig. 5. We have tested our proposed solution both using a
designates the fraction of pertinent instances retrieved over single node and a multi node cluster. Both the master and the
the database relevant instances. slave are present in the same node in case of a single-node
cluster and they exist independently for multi-node cluster.
4.3 SPARK based KNN algorithm implementation Table 1 shows the time taken for indexing the images present
in the dataset.
The algorithm for implementing the K-NN algorithm in We have also compared the indexing time across different
Apache Spark is detailed below: methods that are discussed in the literature and we find that
our method takes the least time for indexing as compared to
5 Results and discussions the existing ones. The performance of the proposed system
has to be established with the indexing to be followed in the
We have used the Corel-10 k dataset that has 10,000 images HDFS and the system is compared against various segmen-
in different categories. The contents include sunset, build- tations as follows:
ings, car, beaches, flowers, fish, mountains, food, door etc.
The images are in JPEG compressed format with size vary- • Centralized method
ing from 192 × 128 to 128 × 192. As part of pre-processing • Distributed image segmentation method single node
13
D. M. Hussain, D. Surendran
• Distributed image segmentation method multiple node the algorithms for processing KNN on MapReduce. This
(12) work can also be used as a guide for KNN based problems
• Distributed image segmentation methodmultiple node not just in CBIR domain, but for any practical problems in
(24) the big data context. Every work will open upon new chal-
• Proposed KNN approach lenges and selecting the right number of clusters along with
• Proposed KNN approach with HDFS and spark. tuning the value “K” in KNN would lead to further research.
13