You are on page 1of 6

K-mean Index Learning for Multimedia Datasets

Sarawut Markchit
Department of Digital Technology
Faculty of Science and Technology
Suratthani Rajabhat University, Thailand
sarawut.mar@sru.ac.th
2021 13th International Conference on Knowledge and Smart Technology (KST) | 978-1-7281-7602-4/21/$31.00 ©2021 IEEE | DOI: 10.1109/KST51265.2021.9415862

Abstract—Currently, one method to deal with the storage and is the main point of hashing technique. It generally the homol-
computation of multimedia retrieval applications is an approx- ogous data points should be represented by a similar binary
imate nearest neighbor (ANN) search. Hashing algorithms and codes. The inter/intra class correlations can be learned by hash
Vector quantization (VQ) are widely used in ANN search. So, K- functions or VQ methods. The time complexity is small with
mean clustering is a method of VQ that can solve those problems. the least memory consumption to find the NNs on Hamming
With the increasing growth of multimedia data such as text view,
image view, video view, audio view, and 3D view. Thus, it is a
distance because hardware-supported bit operations. However,
reason that why multimedia retrieval is very important. We can the information loss during quantization is the critical problem
retrieve the results of each media type by inputting a query of that that normally suffering from the hashing methods. Although,
type. Even though many hashing algorithms and VQ techniques the loss has been reduced by the various hashing learning
are proposed to produce a compact or short binary codes. In methods. There exists still faced to a big different gap between
the real-time purposes the exhaustive search is impractical, and binary code and the corresponding real-valued vector. ANN
Hamming distance computation in the Hamming space suffers search in the Hamming space of binary code is more inaccurate
inaccurate results. The challenge of this paper is focusing on how than that in the real-valued with Euclidean space.
to learn multimedia raw data or features representation to search
on each media type for multimedia retrieval. So we propose a There are many exits indexing methods encountering very
new search method that utilizes K-mean hash codes by computing complex and hard to develop algorithm code for large-scale
the probability of a cluster in the index code. The proposed and high-dimensional data points. Unfortunately, sometimes
employs the index code from the K-mean cluster number that is the complexity is not dependent on increasing in performance.
converted to hash code. The inverted index table is constructed These ideas can inspire that us can take a simple way by just
basing on the K-mean hash code. Then we can improve the
step back and look at how to create an index for multimedia
original K-mean index accuracy and efficiency by learning a deep
neural network (DNN). We performed the experiments on four datasets. This paper proposed a basic idea to get benefit from
benchmark multimedia datasets to retrieve each view such as 3D, an index structure for multimedia datasets that have a high-
image, video, text, and audio, where hash codes are produced by dimensional data points by using an excellent index scheme
K-mean clustering methods. Our results show the effectiveness over K-mean binary hash codes. Our index code exploits small
boost the performance on the baseline (exhaustive search). binary code (256 clusters, 8 bits) to be the index code. An
inverted index structure is generated by inputting the same
Keywords—inverted indexing, multimedia retrieval, nearest
neighbor search, binary code index
reference data points in the same lists of an inverted table.
Given a raw query, the learned model estimates the relevance
score of each cluster (index code) referring to the cluster
I. I NTRODUCTION probability apportion of NNs (ground truth) for that query.
A deep neural network was employed to create the learned
Recently, ANN search is a basic role in information re-
model which was learned by a nonlinear mapping between
trieval (IR) and machine learning (ML). Many researches in
the query of each view and the index space of that view. Then
multimedia retrieval application have catched much attention
the index table is explored to find the top relevance scores
basing on the ANN search. It has become an important higher
from the highest-ranked index codes to be the excellent quality
degree because we can retrieve the various types of media.
candidates. K-mean clustering algorithm was adopted on four
Normally, there are many views of multimedia data; these
widely-used benchmark datasets to evaluate our proposed in-
views may support correlated semantic information, such as
dex scheme. Results of the experiments show that our method
text-image pairs in Flickr or video with tag pairs in YouTube.
can improve the search performance effectively, in terms of
Due to the rapid growth of multimedia data in recent years
retrieval preciseness and time complexity. Our index scheme
and the effectiveness of the indexing and retrieving of large-
can be established upon over binary index code generated by
scale with high-dimensional multimedia data so the exhaustive
K-mean clustering algorithms to derive the following benefits:
search is impractical to due with because it consumes an
enormous computation resource. Hashing algorithm or VQ • Given a raw query, our learned model estimates the
technique can address the issue by represent the existing relevance scores of the index codes in the same view
multimedia data with binary code or compact code. to get highly precise clusters instead of ranking by
To represented the short or compact binary hash codes on imprecise original Hamming distances of the K-mean
a Hamming space of data points instead of the raw data space index.
978-1-7281-7602-4/21/$31.00 2021
c IEEE • The inverted index table can obtain sub-linear time

Authorized licensed use limited to: MANIPAL INSTITUTE OF TECHNOLOGY. Downloaded on June 11,2022 at 09:43:41 UTC from IEEE Xplore. Restrictions apply.
complexity. Instead, the exhaustive search encounters applied the K-mean method to classify the images [9]. K-mean
the linear time complexity. and Fuzzy C-mean were employed for the Mammography
images segmentation of breast cancer [11]. K-mean and SOM
The rest of our paper is organized as follows. In Section are significant approaches for clustering, so Binh et al.[13]
2, we discuss the related work about K-mean and indexing. applied them for a two-step approach: for the first step, using
Section 3 presents the proposed K-mean index learning scheme the SOM to cluster the data set and in the second step is
and search method. Section 4 shows the experimental results. employing the K-mean. The PSO technique also was applied
Conclusion remarks are given in Section 5. to find the optimum radius to determine the centers of the
subtractive clustering approach.
II. R ELATED W ORK The one of the widely used techniques for vector space
The data quantization (DQ) technique has been widely- indexing is conventional tree-based index structure. However,
applied for accomplishing efficient ANN search in large-scale the performance easily degenerates and usually consumes an
and high-dimensional multimedia retrieval. The goal of the amount of memory space in high-dimensional and large-scale
DQ technique is to encode the original data into compact datasets. There is a framework that is the Inverted file system
codes. We can categorize the compact code research works with asymmetric distance computation (IVFADC) [18]. It can
into two categories [10]: VQ and hash algorithm. Both are handle a billion-scale datasets efficiently. There are two step
typically less effective due to gathering the quantization loss. of IVFADC, first they performed a coarser quantization by
That is the one main challenge. Embedding the raw data to utilizing the K-mean and then performed a residual quanti-
compact binary codes is the main goal of hash functions in the zation by employing the product quantization (PQ). The Fast
hash algorithm approach. The accelerated comparison between Euclidean distance computation via inverted table lookup was
the Euclidean distance computation between two real-valued employed by Asymmetric distance computation (ADC) [19].
vectors (in the original space) is slower than the Hamming The asymmetric means that the reference data and query
distance computation between two binary codes (in the binary are not in the same spaces. More clearly, the reference data
space). VQ [5] is a random vector quantization by encoding it are kept in the binary space while the query is kept in
as a binary codeword that is the key operation. The approach its raw or original space. On the query side, there are not
suffers slower encoding and distance computations for a given quantization loss. ADC can yield more precise evaluation than
code length than the hash learning approach but less loss in the quantized space of both query and reference data called
information. A point in an n-dimensional space is each input symmetric distance computation (SDC).
vector. The vector quantizer is defined by a partition of this Gordo et al. [20] presented two asymmetric distance com-
space into a set of nonoverlapping n-dimensional regions. A putation (ADC) schemes based on the statistical expectation
similar vector is compared to assign in the same codebook. A and lower bound that approximate the metric in the original
set of reference vectors in that codebook called code vectors. space and employ the Euclidean distance in the intermediate
Unfortunately, the best codebook that represents the set of space. Wang et al. [21] optimized ADC by return the learned
input vectors is NP-hard. It means that the best possible distances between a query and binary codes through distance
codewords in space are necessary to employ an exhaustive functions. ANN search of the PQ-based indexing method
search. If the number of codewords increases affecting the was proposed by [16]. This technique conducted the cluster-
search complexity exponentially. based index structures to generates high quantization levels by
K-mean clustering [7] is a method in VQ. It is a kind of ML a compact codebook set. Then the weighting learning rela-
algorithms that simplest and popular unsupervised. Normally, tion between query-dependent features and clusters relevance
unsupervised algorithms use only input vectors without answer is proposed to rank the relevance scores derived from the
or labeled to make inferences from the datasets. K-mean weighted features w.r.t. the query. The inverted multi-index
clustering is mainly used for exploratory data mining. It is (IMI), the data structure was adopted for index binary codes
used in many fields such as IR, ML, pattern recognition, bioin- by [17]. They employed the code distributions among different
formatics, image analysis, data compression, and computer bits into account for the index to optimize the performance
graphics. of the IMI structure. The experiments were run on one
view of billion-level datasets. One more work [15] proposed
K-mean clustering has been applied for various research a searching method on cross-modal datasets that utilized a
such as [6] as well, they utilized a multiple K-mean assignment probability of index on binary hash codes. Their experiments
to compute compact hash codes. The result showed an efficient were performed on two well-known multimedia datasets to
method for visual descriptors retrieval. The local and global retrieve text and images in different views.
visual content descriptors of ANN search problem can be
solved by this proposed. They evaluated the method on three III. K- MEAN I NDEX LEARNING
large scale public datasets. The first one is the CIFAR-10,
second one is MNIST datasets and the last one is up to Cluster analysis is one of the main analytical methods in IR
one billion descriptors (BIGANN). For more researches, K- and data mining. It is a descriptive task that attempts to identify
mean was optimized with a multi-objective genetic algorithm a homogenous group of objects. K-mean clustering method
to increase the K-mean performance on Iris data and Wine data is the most popular partitional clustering to solves the well-
[12]. K-mean clustering and the Fuzzy C-mean algorithm were known problem in this field. K-mean algorithm is an iterative
employed to the detection of range and shape of the tumor in algorithm that simple and very easy way to classify a given
the brain MR images in [8]. A gray-gradient maximum entropy data set. It aims to separate the dataset into K pre-defined
method was used to extract features from the image and clusters (subgroups) by assigning each data point that belongs

Authorized licensed use limited to: MANIPAL INSTITUTE OF TECHNOLOGY. Downloaded on June 11,2022 at 09:43:41 UTC from IEEE Xplore. Restrictions apply.
to only one group. The main idea is to separate the different where qjθ is the jth query. The relevant examples of that view
data far as possible while grouping the data points as similar as for qjθ are denoted as {bθjk |k = 1, 2, ..., K} ∈ B, where bθjk is
possible in the cluster. The algorithm finds the minimum sum the kth relevant example for qjθ .
of the squared distance between the cluster of the centroid
and the data points to assign each data point to a cluster. The An example of a relevant can define by basing on the class
more similarity of data points in the same cluster means a label information. Such as, the data points have the same class
small distinction in clusters. The algorithm is optimized at labels with a query, they are relevant. The relevance score for
minimizing an objective function known as the squared error each index code X is defined by the proportion of relevant
function given by: examples in the index entry:

θ θ
{bjk |xjk = X}

C X
Ci θ
X RjX = , (2)
K(V) = (||Xi − V j||)2 (1) |EX |
i=1 j=1 where | · | denotes the cardinality set. Pairs of relevance scores
and query features in the same view are compiled on the
where, kXi − V jk is the Euclidean distance between xi training set; the jth query qjθ is associated with the set of 28
and vj. relevance scores of index codes {RjX θ
}.
’ci’ is the number of data points in ith cluster. A raw query and index codes are employed to train
relationship based on the training set by a fully-connected
’c’ is the number of cluster centers.
DNN. Note that we try to find the most compact network that
To evaluate our proposed K-mean index learning for the yields the highest MAP for each dataset. So we used the same
first step, the index structure is constructed for keeping the number of units of hidden layers as 8-16-8 where the input
reference data points into inverted lists of the index table. Then layer is the dimension number of raw query and output is the
a learned model is trained by DNN to estimate the relevance number of clusters to outperform the compared baseline.
scores of the given query. The model can perform more The input layer receives the raw query or feature represen-
precisely and efficiently on the four well-known multimedia tation of qjθ , and the output layer estimates the 256 relevance
datasets, as elaborated in the following. θ
scores of index codes {PjX }.
A. Index Construction and Learning The error derivative is computed w.r.t the output of each
neuron basing on the cross-entropy loss between the pre-
Figure 1 shows our search framework for retrieving NNs of θ
dictions {PjX } and the target {RjX θ
}, which is backward
a query. The proposed method consists of two parts: training propagated to each layer in order to adjust the neural network
and searching. In the training part, the reference dataset of weights.
each view is encoded to binary codes by employing the K-
mean clustering method. The reference dataset is collected in B. NN Search
a list of an inverted index table. Then we train our learned
model to estimate the relevance score of each index code in In the part of searching, given a raw query q to retrieve
the index table. In the searching part, we rank the index codes the result, the learned model is utilized to estimate the in-
for a submitting given query based on their estimated relevance dex codes relevance scores {PX }. We pick the top-R index
scores by our learned model. Finally, the candidates in the top- codes {X1 , X2 , ..., XR } from the ranked index codes with
rank index codes are output similar NNs for the query. The the top relevance scores. Then we retrieve a candidate set
following two parts are elaborate on the proposed method. G = {bi |xi ∈ Xr , r = 1, 2, ..., R} with the high-ranking
index codes from association of reference data points. The time
In the training part, we applied the K-mean clustering complexity for the ANN search mainly relates to two parts,
method to generate a binary codes of multimedia reference the relevance score estimation and the index code ranking.
datasets including many views such as image view, text view, The spending time of relevance score estimation is a constant
video view, audio view, and 3D view. Assume that there are time. It depends on the neural network size.
N binary codes of length c of a reference dataset, denoted
as B = {bi ∈ {0, 1}c |i = 1, 2, ..., N }. Notice that the K- The computation time takes 28 · log 28 = 8 · 28 to sort all
mean clustering method provided the binary codes. Generality, index codes based on the relevance scores for the index code
if without any information loss, we use 8 bits (256 clusters) ranking. We do not do rerank the candidates set after the index
of binary code from bi to be the index code xi ∈ {0, 1}8 . ranked so the constant time can be skipped for the Hamming
An inverted index table is constructed based on the index distance computation.
codes with 256 entries, where each entry EX = {bi |xi = X} The candidate computation is an exhaustive search that
represents a particular index code X and associated reference the Hamming distances computation between all candidates
data points are contained in a set. and the query; it spends e · |G|, where e is a slight constant
computation time of the Hamming distance.
Our learned model is trained to learn a nonlinear mapping
between the raw query of a view (e.g., videos) and the index The candidate set G is normally a fraction of the reference
space of the same view (e.g., videos) through DNN. The rele- dataset B, so the time complexity can be decreased comparing
vance scores of a given raw query of index codes are estimated with exhaustive search significantly. Remarkably, the quality
by the model. To gather the training dataset, we prepare a set of our candidate set is extremely precise as demonstrated in
of queries of a view, denoted as Q = {qjθ |j = 1, 2, ..., J}, the next section.

Authorized licensed use limited to: MANIPAL INSTITUTE OF TECHNOLOGY. Downloaded on June 11,2022 at 09:43:41 UTC from IEEE Xplore. Restrictions apply.
Fig. 1. shows our search framework for retrieving NNs of a query. There are two parts: training and searching.

IV. E XPERIMENTS There are 10 semantic classes of all text-image pairs. In more
detail, a BoVW histogram of a 128-codeword SIFT codebook
In this section, four widely-used real-world datasets have
represented for each image. A histogram of a 10-topic LDA
operated for the experiments to objectively and fully verify the
model represented for each text. Those pairs sharing the same
effectiveness of our proposed K-mean index learning. When
label are defined as ground truth neighbors.
comparing with the baseline method, the comprehensive results
show the effectiveness of our proposed approach. The MIRFlickr dataset [1] originally, there are 25000
instances collecting from the Flickr website. Each instance
A. Datasets consists of an image, involved with textual tag, and predefined
semantic labels at least one or more of 24. The instances
without a semantic label or any textual tag appearing less than
20 times are removed from the dataset. For each instance, a
150-D edge histogram represents the image view. A feature
vector of 500-D represents the text view that derived from
PCA on its binary tagging vector w.r.t. the textual tags. We take
836 of MIRFlickr data to be the query set and the rest as the
reference set. 10000 instances are random from the reference
set for a training set. Those pairs sharing at least one or more
common labels are defined to the ground-truth neighbors.
Fig. 2. Examples of multimedia dataset. For the NUS-WIDE dataset [3], it originally has 260648
instances, each instance consists of an image and predefined
Figure 2 shows the examples of multimedia dataset for each semantic labels, there one or more of 81 labels. The 21 most
type. To evaluate the proposed method, we run the experiment frequent concepts are selected. So, there are 195834 image-text
on four widely-used benchmark datasets by using the K- pairs left. We employed a 500-dimensional bag-of-visual word
mean clustering algorithm. The benchmark datasets include (BoVW) vector from the hand-crafted feature for each image.
MIRFlickr [1], WiKi [2], and NUS-WIDE [3], each of which Then employed a 1000-dimensional bag-of-words vector to
consists of a text view and an image view. The other is represented the text for each point. The 2000 data points are
XMedia dataset [4] . Notable, it consists of 5 views to evaluate selected as the testing set and the remaining points as the
multimedia retrieval, which is quite challenging. There are 3 reference set. We sampled 20000 data points from the reference
more views including a video view, an audio view, and a 3D set as training. The pairs sharing at least one or more common
view. Tables I and II summarize the properties of the four labels are defined to the ground-truth neighbors.
benchmark datasets, then we use K-mean clustering to produce
XMedia dataset [4], all the media data are collected
the binary index code.
from the Internet websites, including Youtube, Wikipedia,
Wikipedia dataset [2] is selected and reviewed by freesound, Flickr, findsound, Princeton 3D Model Search En-
Wikipedia’s editors since 2009 from a collection of 2700 gine, and 3D Warehouse. There are consisting of 5000 images,
“feature articles”. There are 29 classes of these articles. The 5000 texts, 1000 audios, 500 3Ds, and 1143 videos. All media
10 most populated is selected to Wikipedia dataset. Finally, a is collected into 20 label classes. Both texts and images are
total of documents is 2866. The 2173 documents are selected randomly split into 4000 media data for a training set and 1000
for the training set and 693 documents for the testing set. media data for a testing set. We sampled the 174 videos as a

Authorized licensed use limited to: MANIPAL INSTITUTE OF TECHNOLOGY. Downloaded on June 11,2022 at 09:43:41 UTC from IEEE Xplore. Restrictions apply.
testing set, the rest as a training set. We sampled 100 data where R is the number of retrieved documents, pr(j) denotes
points of 3Ds from the reference set as testing, the rest 400 the precision of the top j retrieved documents, and rel(j) = 1
data points as a training set. We randomly separated the audio if the jth retrieved document is relevant to the query, otherwise
media into 800 documents for training set and a testing set rel(j) = 0. The relevant documents are defined as those pairs
of 200 documents. In more details, a BoVW histogram of a which share the same label or at least one common label. MAP
128-codeword SIFT codebook represented each image, and a is computed as the mean of precision for all the queries. The
histogram of a 10-topic LDA model represented each text. A higher MAP means the better of the performance.
29-dimensional MFCC feature represented audio clips. Each
video first, is segmented into several video shots then employ-
ing a 128-dimensional BoVW histogram feature represented
for each video keyframe. Employing the concatenated 4700-
dimensional vector of the LightField descriptor set represented
each 3D model. The ground truth neighbors were defined as
those pairs which share the same label.
In more details of the Figure 1, We generate 8-bit binary
code for all datasets by employing the K-mean clustering. We
employed raw represented data of each media to be the input
data and the output is the cluster number (256 clusters from
8 bits). Then we converted those cluster numbers to binary
codes to create an inverted index table and compute the cluster
probabilities. Next step, the raw queries and the probabilities
are sent to learn to build the prediction model for the searching
step. The program was run on a PC with AMD Ryzen 7 3700U
CPU@2.3 GHz and 12GB RAM and implemented in Python.
TABLE I. M ULTIMEDIA DATASETS ( IMAGE - TEXT PAIRS )
Datasets WIKI MIRFlickr NUS-WIDE
Reference set 2866 15902 195834
Fig. 3. MAP@R in the all dataset for each media search
Training set 2173 10000 20000
Testing set 693 836 2000
Number of Labels 10 24 21
C. Results and time complexity
TABLE II. XM EDIA DATASETS Figure 3 and Table III show the results on each media
Views Image Text 3D Video Audio search in 4 datasets. The X-axis of Figure 3 represents
Reference set 5000 5000 500 1143 1000 the number of retrieved examples R and Y-axis represents
Training set 4000 4000 400 969 800
Testing set 1000 1000 100 174 200 MAP@R. The three datasets including WIKI, MIRFlickr, and
Number of Labels 20 20 20 20 20 NUS-WIDE show the best learning results on text view. Except
for the image view of Xmedia shows the best one. Our learned
For each 8-bit (256 clusters) K-mean dataset, two kinds of index scheme yielded MAP curves higher than the various
index schemes are implemented for comparison: original K-mean index code (exhaustive). We can conclude that
• Exhaustive. In the exhaustive search, without adopt- the learned model effectively raises the accuracy better than
ing any index structure we employ 8 binary codes to the baseline scheme. Whereas for the 3D of Xmedia dataset,
calculate Hamming distances between the query and the exhaustive scheme has a higher MAP curve than ours
all reference data. because the 3D view provides a small training data (only 400
training data) with very high-dimensional (4700-D). Because
• Learned-index. It is the proposed method. The K- the training set is small with a high number of dimensions.
mean index structure and the learned neural network So, the learned model is unable to extract features in all cases.
are used to rank index codes. We configured the 4 Hence, why the accuracy result is low.
layers of network as I1-H2-H3-H4-O5, where I1 is the
input layer (raw query), H2, H3, and H4 are hidden In addition, the table IV shows the time complexity in
layers with the number of 8:16:8, respectively, and the seconds on each media dataset. A smaller number of time
last layer O5 is the output layer with the number of complexities means a lower computation cost. It dues to fewer
256 (number of clusters). The hidden layers use ReLU operations of memory access for the small cluster of reference
and an output layer uses softmax as the activation data. Remark that on the WIKI, MIRFlickr, NUS-WIDE, and
functions, respectively. Xmedia (image and text view) our time complexity is less than
the exhaustive search. The exhaustive search requires the time
B. Mean average precision (MAP) complexity for a large-scale dataset. Unless for the Xmedia on
3D, video, and audio views, the complexity of the exhaustive
A set of queries Q is used to evaluate the retrieval accuracy search is less than ours because those datasets are very small.
by employing the Mean average precision (MAP) [14]: Thereby in the real-world applications, a dataset is not small.
|Q| R It should be noted that the exhaustive search will not be faster
1 X 1 X
MAP@R = pr(j) · rel(j), (3) than our method in the real-world applications. So, our method
|Q| i=1 R j=1 is suitable for a large-scale dataset.

10

Authorized licensed use limited to: MANIPAL INSTITUTE OF TECHNOLOGY. Downloaded on June 11,2022 at 09:43:41 UTC from IEEE Xplore. Restrictions apply.
Thus, it can be concluded that our learned-index scheme [6] S. Ercoli, S., M. Bertini, M., and A. Del Bimbo, “Compact hash codes
showed a significant improvement when it integrated with the for efficient visual descriptors retrieval in large scale databases,” IEEE
Transactions on Multimedia, Vol. 19, No. 11, pp. 2521-2532, 2017.
K-mean method. We can obtained the higher accuracy and
smaller computation cost on a bigger dataset. [7] J. Yadav and M. Sharma, “A Review of K-mean Algorithm,” Interna-
tional journal of engineering trends and technology, Vol. 4, No. 7, pp.
TABLE III. C OMPARISON IN TERMS OF MAP@10 (8- BIT ) FOR 2972-2976, 2013.
MULTIMEDIA DATASETS [8] J. Selvakumar, A. Lakshmi, and T. Arivoli, “Brain tumor segmentation
and its area calculation in brain MR images using K-mean clustering and
Datasets Exhaustive Ours Fuzzy C-mean algorithm,” in IEEE-international conference on advances
WIKI
Image search 0.0406 0.1112
in engineering, science and management (ICAESM-2012), pp. 186-190,
Text search 0.1036 0.5676 Mar. 2012.
MIRFlickr [9] P. Shan, “Image segmentation method based on K-mean algorithm,”
Image search 0.3999 0.5981 EURASIP Journal on Image and Video Processing”, Vol. 1, No. 81,
Text search 0.4640 0.7764 2018.
NUS-WIDE
Image search 0.1683 0.4687 [10] J. Wang, H. T. Shen, J. Song, and J. Ji, “Hashing for similarity search:
Text search 0.2281 0.5945 a survey,” ArXiv:1408.2927, 2014.
XMedia [11] M. Y. Kamil and A. M. Salih, “Mammography Images Segmentation
Image search 0.0202 0.1309
via Fuzzy C-mean and K-mean,” International Journal of Intelligent
Text search 0.0233 0.0785
3D search 0.0223 0.0198 Engineering and Systems, Vol. 12, No. 1, pp. 22-29, 2019.
Video search 0.0142 0.1032 [12] Y. Arkeman, N. A. Wahanani, and A. Kustiyo, “Clustering K-Means
Audio search 0.0169 0.0750 Optimization with Multi-Objective Genetic Algorithm,” International
Journal of Electrical and Computer Sciences IJECS-IJENS, Vol. 12, No.
TABLE IV. C OMPARISON IN TERMS OF TIME COMPLEXITY ( IN 5, pp. 61-66, 2012.
SECONDS ) FOR MULTIMEDIA DATASETS [13] P. T. T. Binh, T. N. Le, and N. P. Xuan, “Advanced SOM and K Mean
Method for Load Curve Clustering,” International Journal of Electrical
Datasets References Exhaustive Ours and Computer Engineering, Vol. 8, No. 6, pp. 4829, 2018.
WIKI
Image search 2866 0.1719 0.1240 [14] Y. Peng, X. Zhai, Y. Zhao, and X. Huang, “Semi-supervised cross-
Text search 2866 0.1562 0.1406 media feature learning with unified patch graph regularization,” IEEE
MIRFlickr transactions on circuits and systems for video technology, Vol. 26, No.
Image search 15902 1.6126 0.2569 3, pp. 583-596, 2015.
Text search 15902 1.8600 0.2194 [15] C. Y. Chiu and S. Markchit, “Effective and efficient indexing in cross-
NUS-WIDE
Image search 195834 45.6833 4.4696
modal hashing-based datasets,” Signal Processing: Image Communica-
Text search 195834 35.4615 3.6672 tion, Vol. 80, pp. 115650, 2020.
XMedia [16] C. Y. Chiu, J. S. Chiu, S. Markchit, and S. H. Chou, “Effective product
Image search 5000 0.4064 0.2449 quantization-based indexing for nearest neighbor search,” Multimedia
Text search 5000 0.4376 0.2272 Tools and Applications, Vol. 78, No. 3, 2877-2895, 2019.
3D search 500 0.0156 0.2340
Video search 1143 0.0401 0.1400 [17] J. Song, H. T. Shen, J. Wang, Z. Huang, N. Sebe, and J. Wang, “A
Audio search 1000 0.0640 0.2079 distance-computation-free search scheme for binary code databases,”
IEEE Transactions on Multimedia, Vol. 18, No. 3, pp. 484-495, 2016.
[18] H. Jégou, M. Douze, and C. Schmid, “Product quantization for nearest
V. C ONCLUSION neighbor search,” IEEE Transactions on Pattern Analysis and Machine
Intelligence, Vol. 33, No. 1, pp. 117-128, 2011.
In this paper, we propose an excellent index learning
[19] W. Dong, M. Charikar, and K. Li, “Asymmetric distance estimation
method that utilizes a cluster probability on the indexing K- with sketches for similarity search in high-dimensional spaces,” in
mean hash codes for multimedia datasets. Our proposed can Proceedings of ACM International Conference on Information Retrieval
effectively increase the precision (MAP) while can decrease (SIGIR), pp. 128-130, 2008.
the time complexity by the K-mean hash codes index of [20] A. Gordo, F. Perronnin, Y. Gong, and S. Lazebnik, “Asymmetric
the inverted index table through the learned model ranking. distances for binary embeddings,” IEEE Transactions on Pattern Analysis
The proposed experimental results can yield superiority when and Machine Intelligence, Vol. 36, No. 1, pp. 33-47, 2014.
compared with the baseline. [21] J. Wang, H. T. Shen, S. Yan, N. Yu, S. Li, and J. Wang, “Optimized
distances for binary code ranking,” in Proceedings of ACM International
conference on Multimedia (ACMMM), pp. 517-526, 2014.
R EFERENCES
[1] M. J. Huiskes and M. S. Lew, “The mir flickr retrieval evaluation,”
in Proceedings of the ACM International Conference on Multimedia
Information Retrieval, pp. 39-43, 2008.
[2] N. Rasiwasia, J. Costa Pereira, E. Coviello, G. Doyle, G. R. Lanckriet, R.
Levy, and N. Vasconcelos, “A new approach to cross-modal multimedia
retrieval,” ACM international conference on Multimedia (ACM-MM), pp.
251–260, 2010.
[3] T. S. Chua, J. Tang, R. Hong, H. Li, Z. Luo, and Y. Zheng, “Nus-wide: a
real-world web image database from national university of singapore,” in
Proceedings of the ACM International Conference on Image and Video
Retrieval, p. 48, 2009.
[4] Y. Peng, J. Qi and Y. Yuan, “Modality-specific Cross-modal Similarity
Measurement with Recurrent Attention Network,” IEEE Transactions on
Image Processing (TIP), Vol. 27, No. 11, pp. 5585-5599, Nov. 2018.
[5] Z. B. Wu and J. Q. Yu, “Vector quantization: a review,” Frontiers of
Information Technology and Electronic Engineering, Vol. 20, No. 4, pp.
507-524, 2019.

11

Authorized licensed use limited to: MANIPAL INSTITUTE OF TECHNOLOGY. Downloaded on June 11,2022 at 09:43:41 UTC from IEEE Xplore. Restrictions apply.

You might also like