You are on page 1of 6

JOURNAL OF COMPUTING, VOLUME 2, ISSUE 9, SEPTMBER 2010, ISSN 2151-9617

HTTPS://SITES.GOOGLE.COM/SITE/JOURNALOFCOMPUTING/
WWW.JOURNALOFCOMPUTING.ORG 83

Region Based Color Histogram Features for


Efficient Web Image Retrieval
Umesh K.K and Suresha

Abstract—In this paper, we present a method for region based color histogram features for efficient web image retrieval for searching similar
images in the Web. Our searching algorithm makes use of color histogram features of the query image and database images. The features of
these are condensed into small “signatures” for each image. The signature is stored in the database for retrieving the similar images. In the
retrieval phase, we used region-to-region based distance measure metric to retrieve the similar images. In our approach, it accommodates to
store large number of images using the computed signature. It requires less storage to store in the database for storing the signature. The
main advantage of this approach is fast enough to be performed on a database of 20,000 images. In this paper, mainly we are focusing on
retrieve similar images without considering the basic properties of the size, shape, texture etc., of the image. The retrieval system speed can
be improved by using color histogram feature techniques.

Index Terms— Feature Extraction, Similarity metrics, Web Images, Web Indexing, Image retrieval system.

——————————  ——————————

1 INTRODUCTION

F ew commercial Image Search Engines on the Web


such as Google Image Search [1], Ditto [2] and Alta-
Vista [3] are available to retrieve images. They employ
1.1 Organization
The remainder of this paper is organized as follows: In
keyword based search techniques to search their result to Section 2 we discuss region based color feature extraction
retrieve the Web images. Then, some integrated search for efficient Web image retrieval system. Then, in Section
engines employing both keyword-based search and con- 2.2 describes feature analysis by region-to-region based
tent based retrieval have been proposed. WebSeer [4], similarity measure used. The Section 3 describes Image
WebSEEK [5], and Image Rover[6] have been imple- matching to retrieve similar images from the Web image
mented so far. These systems search for images based on database for a query image. In Section 4, the experimental
the keywords query, and then uses selects query images results are discussed with performance. The Section 5
summarizes our contributions, along with future work.
from search results. After this selection by the user, the
systems search for images that are similar to the query
images based on image features. These three systems car- 2 EXTRACTION OF FEATURES
ry out their search in an interactive manner. The QBIC The goal of the feature extraction process is to trans-
system [4] and MARS system [10] further improved tex- forming the input image data into the set of features to
ture representation. obtain compact signature for Web images. This helps to
In this paper, we propose CBIR techniques to compute facilitate to speed up the Web retrieval system making
the signature of the typical query image for an Image fast. In digital imaging, a pixel is a smallest unit to
search Engine [8]. This approach can be proposed to represent an image. The intensity of each pixel value is
overcome the difficulties faced by the manual annotation variable. The most common pixel format is the byte im-
approach and images could be indexed by their own vis- age, where this is stored as an 8-bit integer giving values
ual content, such as color, shape, spatial relations, etc., are from 0 to 255. Typically zero is taken to be black and 255
the main features for human beings as well as computers have taken to be white.
to recognize the images. So that, we have implemented
low level image histogram feature extraction algorithm. 2.1 Histogram Features
The advantage of this algorithm is very fast computation In this section, we describe to compute the signature
and comparison of the applied features vectors. for the Web images based on first order histogram proba-
bility.
The mean, standard deviation, skew, energy, and en-
tropy are used for representing a Web images. In the fol-
lowing steps we discussed the procedure for computing
————————————————
the signature for every Web images.
 Umesh K.K. is with Sri Jachamarajendra College of Engineering, Mysore.,

 Dr. Suresha is a reader with the Department of Studies in Computer Algorithm: Color Based Histogram Feature Extraction.
Science, University of Mysore, Mysore.

© 2010 Journal of Computing Press, NY, USA, ISSN 2151-9617


http://sites.google.com/site/journalofcomputing/
JOURNAL OF COMPUTING, VOLUME 2, ISSUE 9, SEPTEMBER 2010, ISSN 2151-9617
HTTPS://SITES.GOOGLE.COM/SITE/JOURNALOFCOMPUTING/
WWW.JOURNALOFCOMPUTING.ORG 84

e) Another method to measure the skew uses the mean,


Definition: We define the first- order histogram probabil- mode, and standard deviation, where the mode is defined
ity N ( g )  N ( g ) M where M is the total number of pix- as the peak value:
els present in every block of image, and N (g ) is the 1 g  mod e
number of pixels at gray scale level g, so the gray scale skew  
level ranges from 0 to L-1. g
f) The energy measure describes how the intensity levels
Aim: The main objective of this algorithm is to compute are distributed in the data. It is defined as follows.
the signature for Web images by using a first- order his- NB 1

[P( g )]
2
togram probability for every region (blocks) with 8 x 8 Energy 
pixels. g 0
g) Entropy is a statistical measure; it describes how many
Input: We have collected 20,000 color images of size 128 x bits are required to code the image data, and is de-
85 and 128 x 96 general-purpose low resolution Web- fined as
crawled image from the World Wide Web. NB 1

Output: Feature vector.


Entrophy    P( g ) log [ P( g )]
g 0
2

Assumption: We assumed all the images which are col- End for
lected from the Web are .JPEG format. N is the number of End for
Web images collected.
We describe for computing the feature vector by using
For each Web image do first-order histogram probability from each region. Histo-
gram search algorithms characterize an image by its color
Step1. Convert RGB images into HSV color space. distribution or histogram. Many distances have been used
Step2. Apply contrast component to enhance to make to define the similarity of two color histogram representa-
pixel values closer to improve the results. tions. Euclidean distance and its variations are the most
Step3. Normalize the size of image into 128 x 128. commonly used [12]. The drawback of global histogram
representation is that information about object location,
For every image partition the image into Number of shape, and texture [13] is discarded. In this section, we
blocks (NB) do have discussed region based color histogram feature ex-
traction method. In the next sub section, we described for
Step4. The size of each block is 8 x 8 pixels and extracts representing an image. In section 3, we discussed the dis-
feature vector for N blocks. tance and similarity measures for retrieving the similar
Step5. Compute the signature by using a statistical analy- images.
sis
a) Mean is the average value of total number of intensity 2.2 Feature space analysis
values available in two dimensional spaces. Let N denote the total number of images in the image
 NB 1 database. We obtain a set of ni feature vectors after the
 gP( g )   
I (r , c)
g r c
computation process. Each of these ni d-dimensional fea-
g 0 M tures vectors represents the statistical visual features of
b) The variance is a measure of the distance of each value each region of an image.
from the mean. Suppose feature vectors in the d- dimensional feature
space are { xi : I = 1, …, L}, where NB is the total number
NB 1 2
 ( g  g ) P( g )
of blocks in the image database. Then
g 
g 0
d r , c   min d r , c 
i j
1l  k
i l
c) The standard deviation is one to describe the spread in
The goal of the feature clustering algorithm is to parti-
the data; it tells us something about the contrast of the
tion the features into k groups with centroids
image. If the image is high contrast then it will be have
high variance else low variance. xˆ , xˆ ,..., xˆ
1 2 K such that

 min xi  xˆ j 
L 2
d) The skew measures the asymmetry about the mean in D(k ) 
the intensity level distribution. It is defined as follows: i 1 1 j  k
L 1 3
 ( g  g ) P( g )
1
skew  is minimized. That is, the average distance between a fea-

3
g 0
ture vector and the group with the nearest centroids to it
is minimized. Two necessary conditions for the k groups
are:
JOURNAL OF COMPUTING, VOLUME 2, ISSUE 9, SEPTEMBER 2010, ISSN 2151-9617
HTTPS://SITES.GOOGLE.COM/SITE/JOURNALOFCOMPUTING/
WWW.JOURNALOFCOMPUTING.ORG 85

We assume that the image R1 and image R2 are


1. Each feature vector is partitioned into the cluster with represented by regions sets R1 = {r1, r2,..., rm}, and R2 =
the nearest centroids to it. {r’1,r’2,…,r’n}, where ri or r’j is the descriptor of region i.
Denote the distance between region ri and r’j as d(ri , r’j),
2. The centroid of a cluster is the vector minimizing the which is written as di,j in short.
average distance from it to any feature vector in the
cluster. In the case of Euclidean distance, the centro- Step3: To compute the similarity measure between region
ids should be the mean of all the feature vectors in sets R1 and R2, d(ri , r’j ), we first compute all pair-wise
the cluster. region-to-region distances in the two images.

These requirements of our feature grouping process to The distance between the two regions sets is the summa-
find k cluster means with the following steps: tion of all the weighted matching strength, i.e.,

1. Initialization: choose the initial k cluster centroids d IRRM


R , R    s , d
1 2 ij ij
i, j
In this experiment we assume that the significance of ri in
2. Loop until the stopping criterion is met:
the Image 1 is fi, and r’j in Image 2 is f’j, we require that
a. For each feature vector in the data set, assign
n
it to a class such that the distance from this
feature to the centroid of that cluster is mi- s
j 1
i, j
 f i
, i  1,..., m
nimized.
b. For each cluster, recalculates its centroid as
m
the mean of all feature vectors partitioned to
s
'
 f , j  1,..., n
it. i, j j
i 1

We initialize process and the stopping criterion is criti-


cal in the process. We initialize the algorithm adaptively 4 EXPERIMENTS
choose the number of clusters k by gradually increasing k This algorithm has been implemented and compared
and stop when a criterion is met. We start with k = 3. The with first version of our experiment. We tested the system
k-means algorithm terminates when no more feature vec- on general-purpose low resolution Web-crawled image
tors are changing classes. It can be proved that the k- database used in WBIIS including about 20,000 pictures,
means algorithm is guaranteed to terminate, based on the which are stored in JPEG format with size 128 x 85 and
fact that both steps of k-means reduce the class variance. 128 x 96. To conduct comparison, we use only picture
features in the retrieval process.
3 IMAGE MATCHING 4.1 Performance
In this section, we describe Image matching to retrieve On a Pentium(R) 4 CPU 3.00 GHz PC using the Win-
similar images from the image database for a query im- dows operating system, it requires approximately 3 hours
age. In the following steps we discussed the procedure for to compute the feature vectors for 20,000 color images of
image matching. size 128 x 85 and 128 x 96 in our general-purpose low res-
olution Web-crawled image database. On average, 0.003
Step1: We first locate the clusters of the feature space to seconds is needed to compute the features of all blocks of
which the regions of the query image belong. an image.
The feature clustering process is performed by only
Let’s assume that the centroids of the set k clusters once for our database. The clustering algorithm takes
are {C1, C2,…,Ck}. We assume the query image is about 1 hour 10 minutes CPU time and results in clusters
represented by region sets R1 = {r1, r2,..., rm}, where ri, is with an average of 200 images in each class.
the descriptor of region i. for each region features ri, we
r , c   min d r , c 
The matching speed is quit fast. Our database formed
d by 200 image categories, each containing an average 200
find j such that
i 1l  k
j i l
where d (r1, r2) pictures. Within this database, a retrieved image is consi-
is the region-to-region distance defined in our experi- dered a match if and only if it is in the same category as
ment. We create a list of clusters, denoted as {C1, C2, C3… the query. We conducted an experiment for few images in
Ck}. The matching algorithm investigates only these prob- the database as a query, and the retrieval ranks of all the
able clusters to answer the query. rest images were recorded. Similarly, also we conducted
similar experiment for few images which are not present
Step2: To define the similarity measure between two sets in the image database as a query, and the retrieval ranks
of regions,
JOURNAL OF COMPUTING, VOLUME 2, ISSUE 9, SEPTEMBER 2010, ISSN 2151-9617
HTTPS://SITES.GOOGLE.COM/SITE/JOURNALOFCOMPUTING/
WWW.JOURNALOFCOMPUTING.ORG 86

Figure 1: Best top matches of a sample car image query. The database contains 20,000 general-purpose crawled image
databases. The upper left corner is the query image. The second image in the first row is the best match using region
based color feature extraction.

Figure 2: Best top matches of a sample query. The database contains 20,000 general-purpose crawled image databases.
The upper left corner is the query image. The second image in the first row is the best match using global feature ex-
traction technique.

Figure 3: Two other query examples of best top matches of a sample query.

of all the rest images were recorded. For every query, we method. The performance of the IRRM feature extraction
computed the precision within the first 50 retrieved im- method is better than that of GFE color histogram system
ages is show in the figure 5. The Results have recoreded due to global feature space and region based feature
in Table 1 on different category of images set in our data- space separation obtained with more filled bins. Howev-
base. We achieved better precision in IRRM; the shown er, global feature space color based histogram is impossi-
results are similar images for the input given query im- ble to obtain matches on large databases. The region
age. based color histogram approach gives better searching
accuracy than the global feature extraction system.
We carried out similar evaluation tests for color histo-
gram match. We used HSV color space and a matching Figures 1 and 2 show the results of sample queries. Due
metric similar to extract region based color histogram to the limitations of space, we show only two rows of im-
features and match in the image database. Table 1 shows ages with the top 19 matches to each query. In the next
the performance as compared the Image region-to-region section, we provide numerical evaluation results by se-
Matching (IRRM) and Global Feature Extraction (GFE) mantically comparing several systems.
JOURNAL OF COMPUTING, VOLUME 2, ISSUE 9, SEPTEMBER 2010, ISSN 2151-9617
HTTPS://SITES.GOOGLE.COM/SITE/JOURNALOFCOMPUTING/
WWW.JOURNALOFCOMPUTING.ORG 87

In our experiment, we normalized the images into a network bandwidth. This contributes to reduce the
common size. This assumption may not always ideal. The bandwidth of the wide area network and it helps to speed
set of features for a particular image category is deter- up the transmission rate.
mined based on the perception of the users. For example,
shape –related features are not used. A major difficulty REFERENCES
in feature selection is the lack of information about [1] http://images.google.com/
whether any two images in the database match with each [2] http://www.ditto.com/
other. [3] http://www.altavista.com/
To explore feature selection, we conducted few expe- [4] Michael J. Swain, Charles Frankel, Vassilis Athitsos, “WebSeer: An
riments to measure the recall and precision. The main Image Search Engine for the World Wide Web”, Technical Report 96-
limi tation of our current evaluation results is that they 14, 1997.
[5] J. R. Smith. “Integrated Spatial and Feature Image Systems: Retrieval,
Compression and Analysis”. PhD thesis, Graduate School of Arts and
Sciences, Columbia University, February 1997.
[6] S. Sclaroff, L. Taycher, and M. La Cascia. “Imagerover: A content-based
image browser for the world wide Web”. In Proceedings IEEE Work-
shop on Content-based Access of Image and Video Libraries, June ’97,
1997.
[7] W. Niblack, R.Barber, and et al., “The QBIC project: Querying images
by content using Color, texture and shape”, In Proc. SPIE Storage and
Retrieval for Image and Video Databases, Feb 1994.
[8] M.L. Kherfi, D. Ziou and A. Bernardi, “Image Retrieval from the World
Wide Web: Issues, Techniques, and Systems”, ACM Computing Sur-
veys, Vol.36, No. 1, March 2004, pp.35-67.
[9] Michael J.swain, Dana H.Ballard, “Color Indexing”, International Jour-
are based mainly on precision or variations of precision. nal of computer vision, Kluwer Academic publishers, Netherlands, 7:1,
In practice, a system with a high overall precision may 11-32, 1991.
have a low overall recall. Precision and recall often trade [10] Michael Ortega, Yong Rui, Kaushik Chakrabarti, Sharad Mehrotra, and
off against each other. Thomas S. Huang. “Supporting similarity queries in MARS. In Proc. Of
ACM Conf. on Multimedia, 1997.
[11] Y.Chen, J.Z.Wang, and R.Krovetz, “Content-based image retrieval by
clustering”, In Proceedings of the 5th ACM SIGMM International
Workshop on Multimedia information retrieval, pages 193-200, ACM
press, 2003.
[12] M.Flicker, H.Sawhney, W.Niblack, J.Ashley, Q.Huang, B.Dom et
al.``Query by Image and Video Content: The QBIC System’’, IEEE
Computer, vol. 28, no. 9, 1995.
[13] K.Karu, A.K.Jain, and R.M. Bolle, ``Is there any Texture in the image?’’
Pattern Recognition, vol.29, pp.1437-1446, 1996.
[14] J.Li, J.Z. Wang, and G. Wiederhold, “Integrated Region Matching for
Image Retrieval,” proc. ACM Multimedia Conference, pp. 147-156, Los
Angeles, ACM, October, 2000.

Figure 5: Region based first order color histogram proba-


bility feature extraction on average precision p.
Umesh.K.K is a faculty member with Department of Information
Science and Engineering, Sri Jachamrajendra College of Engineer-
5 CONCLUSIONS ing, Mysore. He is also research scholar, Department of Studies in
We have developed a scalable integrated region-based Computer Science, Universiry of Mysore. He received his B.E. de-
image retrieval system. The system uses the image- gree in 1996 from University of Mysore and M.Tech degree in 2003
region-to-region measure metric in our experiment to from Visvesvaraya Technological University, Karnataka. He is work-
retrieve images from the image database. We tested on a ing towards doctoral work under the supervision of Dr. Suresha
database about 20,000 general-purpose images for dem-
Dr. Suresha is presently working as Reader, Department of Studies
onstration. The clustering efficiency can be improved by
in Computer Science, University of Mysore. He received his B.Sc.
using a better statistical clustering algorithm. Better statis-
degree in 1987 from University of Mysore, M.Sc degree in 1989 from
tical modeling and matching schemes cam be improve the
University of Mysore and M. Phil degree in 1991 from DAVV, Indore.
matching accuracy of the system. We are also planning to
He received M.Tech degree from Indian Institute of Technology,
apply the methods to very large web image database.
Kharagpur in 1996 and Ph. D. from prestigious Indian Institute of
This work can be extended to retrieve similar higher
Science, Bangalore in 2007. He was awarded second prize in IRISS-
ranking images URLs from search engine to reduce the
JOURNAL OF COMPUTING, VOLUME 2, ISSUE 9, SEPTEMBER 2010, ISSN 2151-9617
HTTPS://SITES.GOOGLE.COM/SITE/JOURNALOFCOMPUTING/
WWW.JOURNALOFCOMPUTING.ORG 88

2002 competition, which is an all India level research student compe-


tition called “Inter Research Institute Student Seminar”. He has a
teaching experience of 20 years at post graduate level, has 12 publi-
cations to his credit and currently supervising four research scholars
towards their doctoral work.

You might also like