You are on page 1of 13

1592

IEEE TRANSACTIONS ON MULTIMEDIA, VOL. 10, NO. 8, DECEMBER 2008

Multilabel Neighborhood Propagation for Region-Based Image Retrieval


Fei Li, Qionghai Dai, Senior Member, IEEE, Wenli Xu, and Guihua Er
AbstractContent-based image retrieval (CBIR) has been an active research topic in the last decade. As one of the promising approaches, graph-based semi-supervised learning has attracted many researchers. However, while the related work mainly focused on global visual features, little attention has been paid to regionbased image retrieval (RBIR). In this paper, a framework based on multilabel neighborhood propagation is proposed for RBIR, which can be characterized by three key properties: 1) For graph construction, in order to determine the edge weights robustly and automatically, mixture distribution is introduced into the earth movers distance (EMD) and a linear programming framework is involved. 2) Multiple low-level labels for each image can be obtained based on a generative model, and the correlations among different labels are explored when the labels are propagated simultaneously on the weighted graph. 3) By introducing multilayer semantic representation (MSR) and support vector machine (SVM) into the long-term learning, more exact weighted graph for label propagation and more meaningful high-level labels to describe the images can be calculated. Experimental results, including comparisons with the state-of-the-art retrieval systems, demonstrate the effectiveness of our proposal. Index TermsLabel propagation, manifold ranking, region-based image retrieval, relevance feedback, semi-supervised learning.

I. INTRODUCTION

ITH the explosive growth of the number of digital images, how to well organize and search the images distributed everywhere has become an urgent problem. Due to the large labeling cost and the subjectivity of human perception, text-based methods are tedious and impractical [1]. Therefore, content-based image retrieval (CBIR), which is based on automatically extracted visual features, has attracted more and more attention in recent years. In the early stage, researchers mainly focused on the effective low-level representation of images, and many kinds of visual features, including color [2], texture [3], shape [4], etc., were proposed. However, despite much work has been dedicated to exploring an ideal descriptor for image content, the gap

Manuscript received November 18, 2007; revised May 20, 2008. Current version published December 10, 2008. This work was supported by the National Natural Science Foundation of China under Grants 60772048, 60525111 and 60721003. The associate editor coordinating the review of this manuscript and approving it for publication was Dr. Alan Hanjalic. The authors are with the Department of Automation, Tsinghua National Laboratory for Information Science, Tsinghua University, Beijing 100084, China (e-mail: f-l04@mails.tsinghua.edu.cn; qhdai@mail.tsinghua.edu.cn; xuwl@mail.tsinghua.edu.cn; ergh@mail.tsinghua.edu.cn). Color versions of one or more of the gures in this paper are available online at http://ieeexplore.ieee.org. Digital Object Identier 10.1109/TMM.2008.2004914

between high-level semantic concepts and low-level visual features hinders further development of the CBIR systems [5]. To narrow down the gap and improve the retrieval performance, many promising mechanisms are introduced, among which relevance feedback [6], [7] and region-based image retrieval (RBIR) [8], [9] have been widely used and have shown their effectiveness in many existing systems. As an online learning process, relevance feedback involves the user in the retrieval session and asks the user to judge the relevance of some selected images. With the help of the labeled images, the system can capture the query concept more quickly. In RBIR, each image is segmented into several regions, from which features are extracted. Since region-based features can represent images in object level and accord with human perception better, RBIR often achieves more satisfactory retrieval performance. Recently, machine learning has been extensively explored and many algorithms are adopted in CBIR [10]. Generally speaking, the methods can be classied into three groups: supervised, unsupervised, and semi-supervised ones. The supervised methods treat CBIR as a classication problem, and the training samples are collected from the users feedback. Their goal is to train an effective classier, which can separate all the database images into two classes: one satises the users need while the other not. As a typical example, support vector machine (SVM) [11] is widely used with both global and region-based features [12], [13]. The main problem of the supervised methods is that the limited training samples often make the obtained classier unstable. Some ensemble methods can be utilized to enhance the performance of the weak classiers to a certain extent [14], but the same problem also exists. Without labeled samples, the unsupervised methods aim to explore the relationships of all the images and to build a reasonable structure for the database. Based on a graph-theoretic clustering algorithm, the technique named CLUE retrieves image clusters instead of ordered images [15]. A region-image-concept probabilistic model is proposed in [16], and the hidden semantic concepts in the database are discovered. By making full use of the knowledge obtained from unsupervised learning, the retrieval performance can be improved. To exploit useful information from both labeled and unlabeled samples at the same time, semi-supervised learning has been paid more and more attention. In [17], SVM and locality preserving projection (LPP) are combined together to obtain a classier which maximizes the margin and simultaneously preserves the local information. The method in [18] adopts the co-training algorithm to enhance the relevance feedback. Graphbased semi-supervised learning has also been introduced into CBIR. As our proposal is in this framework, related work is presented in details below.

1520-9210/$25.00 2008 IEEE

LI et al.: MULTILABEL NEIGHBORHOOD PROPAGATION FOR REGION-BASED IMAGE RETRIEVAL

1593

A. Related Work in Graph-Based Semi-Supervised Learning In general, a graph-based semi-supervised learning algorithm starts with a graph, in which the vertices correspond to all the labeled and unlabeled samples, and the weighted edges reect the similarities of the vertex pairs. Then, the available information is spread via the graph, from labeled samples to unlabeled ones. The nal spread results will be used for classication, ranking or other purposes. From the whole process, it can be concluded that the two key points in this framework are the approaches to construct graph and to spread information. For constructing an effective graph to represent the relationships of all the samples, most of the existing algorithms determine the similarities based on the distance measures [19], [20]. Since it is noticed that the similarities are also related to the structures around the samples, some novel similarity measures are proposed recently [21], [22]. All the above algorithms calculate the edge weights of the graph by Gaussian function with , where is some dissimthe form of ilarity measure between two images. However, as pointed out in [23], the parameter can inuence the results signicantly, and there is no reliable approach to determine it automatically. The idea of reconstructing each data using a linear combination of its neighbors has been successfully adopted in [23], but it can only deal with the situation when all the samples are of equal length. For RBIR, as the numbers of regions in different images may be different, it is impossible for all the images to have region-based representations of the same length. Therefore, linear combination is infeasible in this case. To spread information effectively from labeled samples to unlabeled ones, many approaches have been proposed, such as label propagation [24], Markov random walks [25], and so on. As a representative algorithm, manifold ranking is proposed to keep the local and global consistency [19], [26], and it has been successfully integrated with global features into the CBIR system. In [27], a transductive learning framework named manifold ranking based image retrieval (MRBIR) is proposed, which incorporates relevance feedback and active learning in a natural way. To address its major drawback that it can only deal with the problem where the query image is in the database, MRBIR is extended to form a general framework named generalized manifold ranking based image retrieval (gMRBIR) [28]. The method in [29] extends MRBIR from another aspect. By introducing long-term ofine learning, the retrieval performance is further improved. In [30], graph-based semi-supervised learning and multiple-instance learning (MIL) are combined together. The process of information spread is conducted in region level to estimate region typicality, according to which image typicality is calculated simply by the weighted sum. B. Overview of Our Approach In this paper, we introduce graph-based semi-supervised learning into RBIR, and present a method based on multilabel neighborhood propagation. The framework of our RBIR system is illustrated in Fig. 1. Briey speaking, the system consists of four parts: image segmentation and representation, graph construction, label propagation and online relevance feedback, as well as long-term ofine learning. Focusing on the aforementioned two key points in graph-based semi-supervised learning, our contributions can be summarized in three aspects:

1) For graph construction, to deal with the problem caused by the parameter in Gaussian function, we propose a new approach to determine the edge weights by linear programming, with the basic idea of introducing mixture distribution into the earth movers distance (EMD) [31]. Given the image segmentation results, the only parameter in our algorithm is the number of the nearest neighbors. When the parameter varies over a large range, it has little effect on the retrieval performance. 2) To better exploit the information in region-based features, based on Gaussian mixture model (GMM), the segmented images are represented in a uniform space, and multiple low-level labels for each image can be obtained. These labels are propagated simultaneously on the weighted graph. Considering the correlations among labels, the algorithm of correlated label propagation (CLP) [32] is adopted. In relevance feedback, the labels are updated according to the query concept. We also extend the above method and propose a unied framework which can work well no matter whether or not the query image is in the database. 3) We incorporate long-term ofine learning into our RBIR system, which aims to acquire more exact weighted graph and more meaningful high-level labels. With the help of multilayer semantic representation (MSR) [33] and SVM, the accumulated feedback information in the previous retrieval sessions can be largely explored. C. Outline of the Paper The rest of the paper is organized as follows. Section II discusses how to determine the edge weights. Then we describe our algorithm for correlated multilabel propagation in Section III. Section IV is mainly concerned with the long-term learning mechanism. Our experimental results are presented in Section V, which is followed by conclusions and analysis of future work in Section VI. II. GRAPH CONSTRUCTION Constructing an effective graph is the core of any graph-based method. Therefore, this issue should be addressed at rst. To illustrate our proposal more clearly, we rst explain the way to represent the segmented images and the framework of EMD. Then, our proposal to determine the edge weights is presented. A. Image Segmentation and Representation In our RBIR system, each image is rstly partitioned into nonoverlapping blocks, from which low-level features, such as color and texture, are extracted. For image segmentation, the JSEG algorithm [34] is adopted, which can adaptively determine the number of regions in each image. Suppose there are altogether images in the database, and they are denoted by . After segmentation, each image can be represented by a set of region-based features , where is the total number of regions in , and is calculated as the mean vector of all the block-based features in the corresponding region. Each satisfying region also corresponds to an importance weight the normalization constraint . It is originally set

1594

IEEE TRANSACTIONS ON MULTIMEDIA, VOL. 10, NO. 8, DECEMBER 2008

Fig. 1. Framework of our RBIR system.

denotes the ow between where to the following constraints:

and

subjected

Fig. 2. Segmentation examples with different region merge thresholds in the JSEG algorithm. Original images are in the rst column. The thresholds for the three segmentation results are 0.5, 0.7 and 0.95, respectively.

to the area percentage of the region, and will be updated in relevance feedback. It should be noted that different parameters in the JSEG algorithm may lead to different segmentation results. Fig. 2 gives some segmentation examples with different region merge thresholds. The inuence of the segmentation results on the retrieval performance will be discussed in Section V-D1.

Considering the normalization constraint that , (1) can be simplied as (2) with the constraints:

B. Image Distance Measure Based on the minimal cost to transform one distribution into another, EMD is proposed to determine the distance between two distributions [31]. As it can deal with the variable-length representations, EMD has been introduced into RBIR as an effective distance measure [13], [35], [36], where the set of region importance weights of an image is actually treated as a distribution. By introducing many-to-many matching among regions, EMD is robust to inaccurate image segmentation. and be repFormally, let two segmented images and resented by , respectively, the and be . According to the distance between denition, the EMD between the two images can be calculated as

Obviously, (2) is a problem of linear programming, which can be solved by many existing algorithms. C. Calculating the Edge Weights Suppose image is described by the same form as that in Section II-B. Based on EMD, we nearest neighbors, which are decalculate its , noted by is represented by and image . As in [23], we also want to determine the edge weights by exploring the relationship between and . Since the set of region importance weights of

(1)

LI et al.: MULTILABEL NEIGHBORHOOD PROPAGATION FOR REGION-BASED IMAGE RETRIEVAL

1595

image can be treated as a distribution, we can consider the set of region importance weights of all the images in as a mixture distribution including components, with as the mixture coefcients. Let and the distance between two region-based features be denoted as , then can be calculated as

the problem where the query image is not in the database is addressed. A. Calculating Multiple Labels As in [37], based on GMM, the segmented images can be represented in a uniform feature space. After feature extraction as in Section II-A, principle component analysis (PCA) is utilized to simplify the feature representation. The calculation of multiple labels is based on the assumption that the region-based components, namely feature is generated by a GMM with (4)

(3) with the constraints: is the generation probability of the region-based feawhere is a normal distribution with mean vector ture, and covariance matrix , is the mixture coefcient. The model parameters can be obtained by the expectation maximization (EM) algorithm. By treating each Gaussian component as a kind of low-level label to represent the image content, each region is described , where by a set of labels is given by (5) The problems dened by (3) and (2) are similar, and the main difference is that the constraints in (3) contain new variables . Generally speaking, to minimize the cost function in (3), larger weight is usually assigned to more similar neighboring image. As an extreme case, if there exists an image with the same region-based representation as , then the . Another optimal solution is property of (3) is that although new variables are introduced, it is also a linear programming problem, so it can be solved efciently. and Let the weight of the edge connecting images be denoted as . When constructing the graph, if , will be set as the optimal solution to . Note that unlike the edge weights (3); otherwise, and are usually calculated by Gaussian function, unequal. Given the image segmentation results, the only parameter in our proposal for graph construction is , the number of the nearest neighbors. As shown in Section V, it has little effect on the retrieval performance, when the parameter varies over a large range. III. MULTILABEL PROPAGATION After graph construction, it should be considered how to spread information from labeled samples to unlabeled ones. In the existing methods [27][29], there is merely one label, originally set to 1 or 1, for propagation, which can only denote whether the image is relevant or not to the users need, rather than reect its content. In our proposal, multiple labels are introduced to describe each image. They are propagated on the graph simultaneously, and their correlations are explored. Moreover, the labels can also be updated in relevance feedback to capture the users query. In the last part of this section, can be calThen each low-level label of the whole image culated by the weighted sum of the corresponding labels of its regions, (6) are the regions in . The whole where process can be illustrated in Fig. 3. Some existing RBIR systems adopt clustering algorithms to represent the region-based features in a binary space [13], [16]. By introducing GMM, we can obtain a real-valued feature space, which can provide more information about the image content. B. Correlated Multilabel Propagation The basic idea of the manifold ranking algorithm [19], [26] is adopted in our system to learn useful information of unlabeled samples from labeled ones. For multilabel propagation in the context of image retrieval, the direct way is to spread all the labels of the query image independently. However, in GMM for calculating the labels, some of the Gaussian components may overlap each other in a certain degree, which makes some labels correlated. Obviously, the above method does not take this into consideration and the correlations cannot be explored. The CLP algorithm [32] formulates the framework of propagation with correlations as a linear programming problem, which has an exponential number of constraints but can be solved efciently by the properties of submodular functions. It is adopted in our proposal for correlated multilabel propagation, and the details are explained as follows.

1596

IEEE TRANSACTIONS ON MULTIMEDIA, VOL. 10, NO. 8, DECEMBER 2008

Denote the limit of the sequence as trix

be

, and dene ma-

; .

(9)

After multilabel propagation, each image will correspond to a label vector . and is calculated, and the imThen the distance between ages with the smallest distances will be selected as the retrieval results. Because of the correlations among labels, quadratic distance is adopted, which has the form of

(10) where is a symmetric matrix, and denotes the similarity and labels. As each low-level label correbetween the can be determined based sponds to a Gaussian component, on the divergence of the two distributions.

C. Relevance Feedback and Active Learning To well reect the query concept, the region importance weights as well as the low-level labels for regions and images should be updated according to the users feedback. In our proposal, according to the similarity-based feature selection criterion in [37], the discriminating ability of each label is estimated at rst. Let the sets of positive and negative images be and , where and are the numbers of positive and negative images, respectively. Then with the the similarity between and label can be calculated by (11), which is based on the unied feature matching (UFM) [38] measurement

Fig. 3. Process for calculating low-level labels. First, GMM is used to t the distribution of all the region-based features. Then region labels are calculated, each of which corresponds to a Gaussian component. At last, image labels are determined by the weighted sum of region labels.

matrix, whose elements are set as . Dene , and modify the matrix by sorting its columns according to in ascending order. After that, a new matrix is calculated as (11) (7) is the label kernel function [32]. where is 1, there In our proposal, since the sum of each row of be the prediction label is no need to re-normalize it. Let matrix in the iteration and its initial value be , the following equation is performed iteratively until convergence, (8) is the parameter representing the fraction where of label information that each image receives from its neighbors. where denotes the similarity between two and with the label, and is dened based on images is used to the fuzzy feature contrast (FFC) model [39]. As measure the similarity between and , the label corresponds to smaller has better discriminating ability. More details of the criterion can be found in [37]. For each Gaussian component, an important weight is introduced. Obviously, greater weight should be assigned to more discriminating component. Therefore, in our is calculated by proposal, (12)

Let

be an

LI et al.: MULTILABEL NEIGHBORHOOD PROPAGATION FOR REGION-BASED IMAGE RETRIEVAL

1597

Aiming at speeding up the convergence to the query concept, active learning is also introduced into our RBIR system when unlabeled images are selected for labeling. In MRBIR, three active learning schemes are proposed, in which the most positive images, the most informative ones, as well as the most positive and inconsistent ones are respectively selected for labeling. We have performed all these schemes in our RBIR system, and obtained similar results as those in [27]. Since active learning is not the main focus of our proposal, the rst method is combined in our system, which is the best scheme in [27]. D. Implementation Issue
Fig. 4. Segmented image about tiger.

Then the new region labels can be calculated by modifying (5), namely

(13) The importance weight for region is also updated by (14) is the area percentage of region in image . where , the image labels can be recalculated by After normalizing (6). Although simple, the update mechanism for region importance weights is effective and useful feedback information can be obtained. A segmented image with four regions is shown in Fig. 4. In general, when using this image as query, the user usually wants to nd images about tiger. However, compared with the background, the region area corresponding to tiger is quite small. According to the area percentage, the region importance weights are originally set to 0.09 (water), 0.62 (grass), 0.14 (tiger) and 0.15 (ground), respectively, which cannot reect the users query exactly. By introducing the weight update algorithm, after four rounds of relevance feedback, the region importance weights are modied to 0.14 (water), 0.27 (grass), 0.46 (tiger) and 0.13 (ground). It can be seen that the region corresponding to tiger becomes the most important, and the updated weights accord with the query concept better. as For multilabel propagation, dene the matrix , if ; , if ; , is the parameter to reduce the contriotherwise. bution of negative images, which is introduced by considering the asymmetry between positive and negative images. When the iteration converges, we dene the distance between and the users query as image

In our proposal, all the edge weights between non-neighsparse. boring images are set to 0, which makes the matrix Suppose it takes iterations until the process of label propagation converges, the computational complexity of our proposal . Since the number of in each round of feedback is images in a real database is usually far more than a million, the calculation process may be too time-consuming to satisfy the users need. To address this problem, a candidate set is constructed, and multiple labels are propagated on the subgraph corresponding to it. Considering that the images far from the positive samples are unlikely to be relevant to the users query, the candidate set is composed of the positive samples and their nearest neighbors, as well as the negative samples. That is to say, it contains at different samples. The subgraph most is not sparse any longer, and the computational complexity in . As , much time can be each feedback is saved. Meanwhile, the candidate set can exclude false positive images to a certain extent. E. Extension to Query Outside the Database The above framework of multilabel propagation can only deal is in the database. with the problem where the query image If is provided by the user, as there is no corresponding vertex in the existing graph, no information can be spread from to the database images. The direct idea to deal with the problem is to add one row , then to perform other operations simand one column to ilarly with the enlarged matrix. However, the edge weights are is determined by linear programming; therefore, when to , the elements in the enlarged from original rows and columns should be re-calculated. Obviously, this is infeasible in real applications. The details of our method are explained below. It is conducted in two steps, the information is rst spread from to its nearest neighbors in the database, and then from the nearest neighbors to all the database images. nearest neighbors of in the dataAt rst, the base are determined based on EMD, which are denoted . By solving by the same optimization problem as (3), a weight vector can be obtained. Acis cording to the basic idea of EMD, the minimal cost to transform the distribution corresponding to into the mixture distribution corresponding to . by its In another viewpoint, if we want to reconstruct

(15) and the return set is composed of the images with the smallest distances.

1598

IEEE TRANSACTIONS ON MULTIMEDIA, VOL. 10, NO. 8, DECEMBER 2008

nearest neighbors, are the best reconstruction coefcients. Therefore, the information can be spread from to its nearest neighbors in the database by calculating . In the second step, the matrix is dened as: , if ; , otherwise. Then, with the same iterative process as (8), we can spread the query informato all the images in the database. tion from When is in the database, if the above rst step is conducted, , and the whatever the value of is, we have . Then optimal solution to (3) is the matrix dened in the second step will be the same as that dened in Section III-B. That is to say, the case when is in the is provided database can be viewed as a special case of that by the user. As a result, a unied framework can be adopted in our RBIR system no matter whether or not the query image is in the database. Some issues should be addressed in relevance feedback is outside the database. 1) can always be labeled when as positive in the rst round of relevance feedback, if it is in the database. While it is provided by the user, the positive may be empty, and the discriminating ability of image set each label cannot be estimated. To deal with this problem, are calculated, and the process of the low-level labels of estimating the discriminating ability of each label is performed and . 2) For multilabel propagation, based on when setting the matrix , besides the non-zero elements , dened in Section III-C, we also set if . 3) When constructing the and candidate set, the images in the set their nearest neighbors are also included. IV. LONG-TERM OFFLINE LEARNING In the last two sections, our ways to deal with the two key points in graph-based semi-supervised learning have been introduced. To make the weighted graph more exact and the image labels more meaningful, long-term feedback information is also investigated in our RBIR system. A. Constructing More Exact Graph Our basic idea for long-term learning is similar to that of the unied optimization framework [29], in which an afnity matrix constructed based on the accumulated feedback in [29] only repreinformation is introduced. However, sents the direct semantic correlations between positive images labeled in the same session, while the hidden semantic correlations of other images are ignored. Moreover, the long-term learning mechanism in [29] merely records the times that two images are labeled as positive simultaneously, rather than extracts meaningful semantic concepts of the database. To address these problems, MSR and SVM are introduced into our promore effective to reect the semantic posal, which makes correlations of all the images. To record multicorrelations among the database images, MSR is proposed to effectively make use of the users feedback to extract hidden semantic concepts. These concepts are distributed in multiple semantic layers, and each layer corresponds to a kind

of hard partition of the semantic space. For more details about MSR, the readers can be referred to [33]. In the process of long-term feedback, after several retrieval sessions, the MSR of the database is extracted. Suppose the existing MSR contains concepts, by training with the positive and negative samples of each concept, we can construct different SVM classiers. According to their classication results, belongs to the concept is dethe probability that image ned as (16) denotes the distance of image where to the classication boundary of the SVM classiis the normalization constant to ensure er, and . The probability vector can be used as a kind of high-level semantic representation for . Then the and can be detersemantic correlation between images and . In our proposal, mined by the similarity of is calculated by , where denotes the symmetric KullbackLeibler divergence between two probability vectors. To keep the matrix sparse, a is introduced. If , it will be threshold set to 0. As pointed out in [40], the linear fusion optimization scheme for graph-based multimodality learning is equivalent to combining the two normalized afnity matrices linearly. Therefore, after long-term learning, a more exact graph for label propagation is constructed as (17) where , is the diagonal matrix with -element equal to the sum of the row of , and is a tunable parameter reecting our condence on the low-level features and the ofine learning. Note that as mentioned in Section III-B, it is unnecessary to re-nor. malize B. Forming More Meaningful Labels Besides considering as a new feature vector for , we can also treat it as a set of high-level labels to describe the semantic information of the image. Thus, the labels calculated from both GMM and SVM can be concatenated together to form a more meaningful label vector for each image. When spreading the information, the two kinds of labels are propagated indepenare propagated dently. The way that the low-level labels has been introduced in Section III-B. As some concepts in different semantic layers may have intersections, the high-level lashould also be propagated with correlations. Supbels pose the results of semantic label propagation for image are . Similarly as (15), we denoted by and the users query with sedene the distance between mantic labels by (18)

LI et al.: MULTILABEL NEIGHBORHOOD PROPAGATION FOR REGION-BASED IMAGE RETRIEVAL

1599

where the weighted matrix used in calculating the quadratic distance can be determined by the concept similarity dened in [33]. and are combined linearly Finally, (19) where is with the same function as in (17), and the images are returned. corresponding to the smallest V. EXPERIMENTAL RESULTS The proposed RBIR system is evaluated on two image databases. The rst one consists of 10 000 images from Corel gallery, which are assigned to 100 semantic categories of 100 images each; the second one includes 1500 images from Caltech-101 [41], belonging to 50 object categories of 30 images each. After segmentation, each image has an average of 6.88 and 7.55 regions in the two databases respectively. The region-based features adopted in our experiments are the 64-D color histogram in HSV color space, the 9-D color moments in LUV color space, the 10-D coarseness vector and the 8-D directionality, which compose a 91-D feature space in total. In the experiments, the performance measurement used is the top- precision , which is the percentage of the relevant images in the top- returned images. In order to make a reasonable is averaged over all the retrieval sesand fair comparison, sions. For the case when the query image is in the database, 1000 query images are selected randomly from each database, and are kept the same in different algorithms. When the query image is not in the database, the method for query selection will be explained in Section V-C2. In each retrieval session, four rounds of relevance feedback are conducted, and ten images are labeled during each round. A. Determining the Edge Weights To evaluate the robustness of different methods for graph construction, the precision of the rst retrieval round is used for comparison. Since no feedback information is available, the effectiveness of the constructed graph has a direct impact on the nal retrieval performance. Multiple low-level labels are not considered at the moment, and we simply assign 1 to the query image as its original label. The candidate set is also not involved, and the labels are propagated on the whole graph. The following experiments are conducted on the Corel database. For the Gaussian function-based method, the number of neighbors for constructing the graph is set to 60 and EMD is . The retrieval results used as the dissimilarity measure with different values of are shown in Fig. 5. It can be seen that the parameter inuences the retrieval precision signicantly. Only when it lies around 0.05, can satisfactory performance be obtained. Similar results can be found when the number of neighbors for graph construction is set to other values. Fig. 6 illustrates the retrieval precision of our proposal with different values of . Compared with Fig. 5, our method is more robust with respect to the variation of the parameter . The pre, when both the global and the cision is stable with local structures can be well reected by the constructed graph. , the small number of neighbors makes it impossible If to explore the relationships of all the database images; while for

Fig. 5. Retrieval results on the Corel database with Gaussian function-based graph construction.

Fig. 6. Retrieval results on the Corel database with our proposal for graph construction.

, noisy samples will be involved in the neighborhood. Therefore, both of these two situations make the performance dropped. At the same time, we can see that when the retrieval performance is stable, the precision is comparable with the best result of the Gaussian function-based method, which demonstrates the effectiveness of our proposal. When the experiments are conducted on the Caltech database, similar results can be obtained. The main difference is that should be set to a smaller value to maintain the local structure, as there are less images in each category than those in the Corel database. B. Determining the Number of Gaussian Components Taking both the computational load and the effectiveness of the calculated low-level labels into consideration, the number of components in GMM, namely , should be chosen carefully. Obviously, too large value of will slow down the process of multilabel propagation; while with small value of , regionafter 4 rounds based features cannot be well explored. The of feedback on the Corel database with different values of is shown in Fig. 7, from which we can see that the retrieval

1600

IEEE TRANSACTIONS ON MULTIMEDIA, VOL. 10, NO. 8, DECEMBER 2008

Fig. 7. Retrieval results on the Corel database with different numbers of Gaussian components.

Fig. 8. Performance comparison of different retrieval methods on the Corel database.

performance becomes better with the increment of , but the exceeds 150. Therefore, improvement is insignicant when in the following experiments, we will set as 150. The most for the Caltech database is also determined by the suitable experimental results, and is set to 100 in our RBIR system. In real applications, there may be thousands of components in GMM. Therefore, determining the number of components by experiments is time-consuming. Some theoretically-founded algorithms have been proposed to adaptively determine the number of components in the mixture model, and the calculated results are able to help us to decide the approximate value . For example, based on the minimum message length of (MML) criterion, the algorithm in [42] determines the number of Gaussian components to be nearly 200 for Corel database, which can guide us for parameter selection. C. Comparison With the State-of-the-Arts 1) Comparison With Other Methods Without Ofine Learning: The parameters in our proposal are set as follows: The numbers of neighbors for graph construction are set to 60 and 30 for the Corel and the Caltech databases, respectively. For label propagation, most of the useful information is obtained from the neighbors rather than the initial label; therefore, is xed at 0.99, which is consistent with the experiments in [19], [26], [27]. Similarly as MRBIR, our experiments show that satisfactory retrieval results can be ob, so it is set to 0.5. The exponential tained when function is chosen as the label kernel function, which provides good performance in [32]. In all the retrieval sessions, only the online feedback information is utilized, and the long-term ofine learning mechanism has not been involved yet. The methods to be compared include MRBIR, similaritybased online boosting feature selection (BoostFS) [37], unsupervised hidden concept discovery (HCD) [16], and our proposal with correlated and independent multilabel propagation. For MRBIR, the edge weights are calculated based on Gaussian function and EMD, and is set to 0.05. Both of the codebook size in BoostFS and the number of the tokens in HCD are set

to 600, and the initial retrieval results of BoostFS and HCD are based on the similarity calculated by UFM [38]. Fig. 8 shows the retrieval precision of the compared methods on the Corel database. From the gure, we can see the graph-based methods, including MRBIR and our proposal, outperform BoostFS and HCD signicantly; and the performance of our proposal is also better than that of MRBIR. Our proposal and BoostFS adopt the same criterion for evaluating the discriminating ability of each label, the main difference is that our method is semi-supervised and the boosting framework in BoostFS is supervised. The superiority of our proposal over BoostFS demonstrates the advantage of utilizing unlabeled samples for image retrieval. In HCD, the hidden concepts are discovered by unsupervised algorithm, and the labeled samples are simply used for query moving and expansion. Its inferiority shows that the labeled samples should be largely explored to achieve satisfactory performance. Both our proposal and MRBIR are in the framework of graph-based semi-supervised learning, and the comparison shows the effectiveness of utilizing multiple labels. It can also be noticed that by considering the correlations, the algorithm of correlated multilabel propagation can achieve better retrieval results. The performance of MRBIR and our proposal on the Caltech database is shown in Fig. 9, which further demonstrates the effectiveness of our proposal. It can be seen that the retrieval precision is lower than that of the corresponding experiments on the Corel database. The reason is that the visual diversity of images from the same category makes the retrieval task more challenging on the Caltech database. 2) Performance Comparison With Query Outside the Database: To deal with the case when the query is not in the database, all the images in each database are rst partitioned into ve parts with equal size. In each experiment, 200 images from one part are randomly chosen as queries, and all the images in the other four parts constitute the new database. That is to say, we also conducted 1000 retrieval sessions altogether, and the average results are illustrated below. We use gMRBIR [28] for comparison, in which the parameter for graph construction is also set to 0.05.

LI et al.: MULTILABEL NEIGHBORHOOD PROPAGATION FOR REGION-BASED IMAGE RETRIEVAL

1601

Fig. 9. Performance comparison of MRBIR and our proposal on the Caltech database.

Fig. 10. Performance comparison of gMRBIR and our proposal with query outside the database.

Fig. 11. Performance comparison of three ofine learning methods on the Corel database, with (a) 500 and (b) 1000 accumulated retrieval sessions.

The retrieval results of the two methods in both databases are shown in Fig. 10, which shows the superiority of our proposal to gMRBIR. Moreover, since the edge weights in gMRBIR are also calculated by Gaussian function, the parameter can make great inuence on the nal performance. The retrieval results of gMRBIR with different values of are similar to those in Fig. 5, and hence are not illustrated here. 3) Comparison With Other Long-Term Learning Methods: We compare our proposal with another two kinds of classical long-term ofine learning methods: the unied optimization framework (UOF) [29], and the method to map low-level features to high-level semantic concepts (MAP) [43]. Only the case when the query is in the database is considered. In our proposal, to combine low-level features and long-term feedback and , are used. As they have information, two parameters, for simplicity. In the same function, we make the following experiments, is set to 0.5. The inuence of the parameter will be discussed in Section V-D2. Note that the original method of UOF in [29] utilizes global features. To make a fair comparison, the same graph constructed

from region-based features is used in both UOF and our proposal. The regularization parameters in UOF are set to , which give the best performance in our experiments. Similarly as our method, MAP is also based on MSR; however, the whole algorithm is in a supervised framework. As it contains more images to supply more accumulated feedback information, the Corel database is used for conducting experiments on long-term learning. The retrieval results of the three methods with different accumulated sessions are shown in Fig. 11. From the gure, we can see that our proposal outperforms both UOF and MAP, which demonstrates the effectiveness of our method for exploring the accumulated feedback information. It can also be seen that after 1000 retrieval sessions, the superiority of our proposal becomes less signicant. The reason is that at this moment, all the three methods are able to exploit the long-term feedback information to organize the database images according to the ground truth. In real applications, it is impossible that there are only 100 query concepts without intersection. Therefore, the retrieval performance when dealing with real query concepts is more important.

1602

IEEE TRANSACTIONS ON MULTIMEDIA, VOL. 10, NO. 8, DECEMBER 2008

TABLE I RETRIEVAL PERFORMANCE OF THE FIRST ROUND WITH DIFFERENT SEGMENTATION RESULTS

TABLE II RETRIEVAL PERFORMANCE AFTER FEEDBACK WITH DIFFERENT SEGMENTATION RESULTS

Fig. 12. Retrieval example with the query concept animals on the grass on the Corel database. The query image lies on the upper left corner. The image with red frame and is relevant, while the one with green frame and is irrelevant. (a) Our proposal; (b) UOF; (c) MAP.

4) Performance Comparison With Real Query Concepts: After the long-term ofine learning from 1000 simulated retrieval sessions based on the ground truth of the Corel database, we asked ten users to perform 1000 real retrieval sessions. None of the users have special training, they are only told not to change the query concept during each retrieval session. To demonstrate the effectiveness of our proposal with real query concepts, Fig. 12 shows a retrieval example after 2000 accumulated retrieval sessions. The user wants to nd images belonging to the concept animals on the grass, which is not contained in the original ground truth. The retrieval results of different methods after four rounds of relevance feedback are shown. It can be concluded that UOF is the worst of the three, as it does not extract meaningful semantic concepts of database. With the help of MSR, both MAP and our proposal can achieve satisfactory performance. Our proposal also outperforms MAP, which demonstrates the advantage of semi-supervised learning method over supervised one. D. Inuence of Parameters 1) Inuence of the Segmentation Results: By using different parameters in the JSEG algorithm, different segmentation results can be obtained. The inuence of the segmentation results in our proposal can be expressed in two aspects: the constructed graph and the low-level labels for propagation. Different segmentation results will lead to different problems of linear programming for graph construction, and different region numbers and region-based features will make the calculated low-level labels changed. For the Corel database, four segmentation results are used for performance comparison, in which the average region numbers in the following of the paper) are per image (denoted as 4.66, 6.88, 10.51, and 15.54, respectively. As in Section V-A, to demonstrate the inuence of the segmentation results on the

graph construction, the precision of the rst retrieval round without considering multiple low-level labels is used for comand with different segmentation results parison. The are shown in Table I. To involve the low-level labels, the retrieval experiments with online feedback are conducted, and the after two and four rounds of relevance feedback is shown in Table II. From the two tables we can see that the segmentation results do not inuence the retrieval performance signicantly. Similar results can also be obtained when the Caltech database is utilized. We think the reasons can also be analyzed from two aspects. First, our proposal for graph construction is based on EMD, which is robust to different image segmentation results by considering many-to-many matching. Second, the calculated low-level labels are usually not sensitive to different segmentation results, which are often represented by different schemes for merging regions. Suppose two regions with similar features are merged into one bigger region in one segmentation result, while they are not in another. When the EM algorithm is conducted in the second case, the two regions are very likely to be generated by the same Gaussian component. According to the way to determine the low-level labels, it is probable that the image labels are similar for the two segmentation results. However, the complexity of the linear programming (3) is highly related to the parameter . In the optimization problem, the numbers of variables and equality constraints involved are and , respectively. Therefore, large value of will lead to heavy computational load. 2) Inuence of parameter : Fig. 13 illustrates the retrieval results of our ofine learning framework on the Corel database after 1000 accumulated sessions, with varying from 0 to 1. It can be noticed that the retrieval performance is not sensitive to , when . While for or , almost only one kind of information is utilized, so the performance deteriorates. VI. CONCLUSIONS AND FUTURE WORK In this paper, in the framework of graph-based semi-supervised learning, a method based on multilabel neighborhood propagation is proposed for RBIR. To exploit the information in region-based features, on the one hand, the edge weights of the graph are automatically determined by introducing mixture distribution into EMD; on the other hand, multiple low-level labels calculated based on GMM are propagated on the graph

LI et al.: MULTILABEL NEIGHBORHOOD PROPAGATION FOR REGION-BASED IMAGE RETRIEVAL

1603

Fig. 13. Retrieval results of our ofine learning method on the Corel database after 1000 accumulated sessions, with varying from 0 to 1.

with correlations. Long-term ofine learning is also involved in our system to incorporate accumulated feedback information. It is demonstrated that the graph constructed by our algorithm is effective and robust, and our proposal can further improve the retrieval performance of the RBIR system. For future work, we will extend our proposal in the following two aspects: 1) Instead of constructing subgraph corresponding to the candidate set, utilizing multilevel semi-supervised learning algorithm to propagate information on the whole graph; 2) Applying our proposed methods to other elds of multimedia retrieval. ACKNOWLEDGMENT The authors would like to thank the anonymous reviewers and the associate editor for their valuable comments and suggestions. REFERENCES
[1] Y. Rui, T. S. Huang, and S. F. Chang, Image retrieval: Current techniques, promising directions, and open issues, J. Vis. Commun.Image Represent., vol. 10, pp. 3962, 1999. [2] M. Swain and D. Ballard, Color indexing, Int. J. Comput. Vis., vol. 7, no. 1, pp. 1132, 1991. [3] H. Tamura, S. Mori, and T. Yamawaki, Texture features corresponding to visual perception, IEEE Trans. Syst., Man, Cybern., vol. 8, no. SMC-6, pp. 460473, 1978. [4] A. K. Jain and A. Vailaya, Shape-based retrieval: A case study with trademark image databases, Pattern Recognit., vol. 31, no. 9, pp. 13691390, 1998. [5] A. Smeulders, M. Worring, S. Santini, A. Gupta, and R. Jain, Contentbased image retrieval at the end of the early years, IEEE Trans. Pattern Anal. Mach. Intell., vol. 22, no. 12, pp. 13491380, 2000. [6] Y. Rui, T. S. Huang, M. Ortega, and S. Mehrotra, Relevance feedback: A power tool for interactive content-based image retrieval, IEEE Trans. Circuits Syst. Video Technol., vol. 8, no. 5, pp. 644655, 1998. [7] X. S. Zhou and T. S. Huang, Relevance feedback for image retrieval: A comprehensive review, Multimedia Syst., vol. 8, pp. 536544, 2003. [8] J. Z. Wang, J. Li, and G. Wiederhold, SIMPLIcity: Semantics-sensitive integrated matching for picture libraries, IEEE Trans. Pattern Anal. Mach. Intell., vol. 23, no. 9, pp. 947963, 2001. [9] C. Carson, S. Belongie, H. Greenspan, and J. Malik, Blobworld: Image segmentation using expectation-maximization and its application to image querying, IEEE Trans. Pattern Anal. Mach. Intell., vol. 24, no. 8, pp. 10261038, 2002.

[10] R. Datta, D. Joshi, J. Li, and J. Z. Wang, Image retrieval: Ideas, inuences, and trends of the new age, ACM Comput. Surv., vol. 40, no. 2, pp. 160, 2008. [11] V. Vapnik, The Nature of Statistical Learning Theory. New York: Springer-Verlag, 1995. [12] S. Tong and E. Chang, Support vector machine active learning for image retrieval, in Proc. ACM Int. Conf. Multimedia, 2001, pp. 107118. [13] F. Jing, M. Li, H. J. Zhang, and B. Zhang, An efcient and effective region-based image retrieval framework, IEEE Trans. Image Process., vol. 13, no. 5, pp. 699709, 2004. [14] D. Tao, X. Tang, X. Li, and X. Wu, Asymmetric bagging and random subspace for support vector machines-based relevance feedback in image retrieval, IEEE Trans. Pattern Anal. Mach. Intell., vol. 28, no. 7, pp. 10881099, 2006. [15] Y. Chen, J. Z. Wang, and R. Krovetz, CLUE: Cluster-based retrieval of images by unsupervised learning, IEEE Trans. Image Process., vol. 14, no. 8, pp. 11871201, 2005. [16] R. Zhang and Z. M. Zhang, Effective image retrieval based on hidden concept discovery in image database, IEEE Trans. Image Process., vol. 16, no. 2, pp. 562572, 2007. [17] K. Lu, J. Zhao, and D. Cai, An algorithm for semi-supervised learning in image retrieval, Pattern Recognit., vol. 39, pp. 717720, 2006. [18] Z. Zhou, K. Chen, and H. Dai, Enhancing relevance feedback in image retrieval using unlabeled data, ACM Trans. Inf. Syst., vol. 24, no. 2, pp. 219244, 2006. [19] D. Zhou, O. Bousquet, T. N. Lal, J. Weston, and B. Schlkopf, Learning with local and global consistency, in Advances in Neural Information Processing Systems, 2003. [20] X. Zhu, Z. Ghahramani, and J. Lafferty, Semi-supervised learning using Gaussian elds and harmonic functions, in Proc. Int. Conf. Machine Learning, 2003. [21] M. Wang, T. Mei, X. Yuan, Y. Song, and L. R. Dai, Video annotation by graph-based learning with neighborhood similarity, in Proc. ACM Int. Conf. Multimedia, 2007, pp. 325328. [22] J. Tang, X. S. Hua, G. J. Qi, M. Wang, T. Mei, and X. Wu, Structuresensitive manifold ranking for video concept detection, in Proc. ACM Int. Conf. Multimedia, 2007, pp. 852861. [23] F. Wang and C. Zhang, Label propagation through linear neighborhoods, in Proc. Int. Conf. Machine Learning, 2006. [24] X. Zhu and Z. Ghahramani, Learning From Labeled and Unlabeled Data With Label Propagation, Carnegie Mellon Univ., Pittsburgh, PA, Tech. Rep. CMU-CALD-02-107, 2002. [25] M. Szummer and T. Jaakkola, Partially labeled classication with Markov random walks, in Advances in Neural Information Processing Systems, 2002. [26] D. Zhou, J. Weston, A. Gretton, O. Bousquet, and B. Schlkopf, Ranking on data manifolds, in Advances in Neural Information Processing Systems, 2003. [27] J. He, M. Li, H. J. Zhang, H. Tong, and C. Zhang, Manifold-ranking based image retrieval, in Proc. ACM Int. Conf. Multimedia, 2004, pp. 916. [28] J. He, M. Li, H. J. Zhang, H. Tong, and C. Zhang, Generalized manifold-ranking-based image retrieval, IEEE Trans. Image Process., vol. 15, no. 10, pp. 31703177, 2006. [29] H. Tong, J. He, M. Li, W. Y. Ma, C. Zhang, and H. J. Zhang, A unied optimization based learning method for image retrieval, in Proc. IEEE Int. Conf. Computer Vision and Pattern Recognition, 2005, vol. 2, pp. 230235. [30] J. Tang, X. S. Hua, G. J. Qi, and X. Wu, Typicality ranking via semisupervised multiple-instance learning, in Proc. ACM Int. Conf. Multimedia, 2007, pp. 297300. [31] Y. Rubner, C. Tomasi, and L. J. Guibas, The earth movers distance as a metric for image retrieval, Int. J. Comput. Vision, vol. 40, no. 2, pp. 99121, 2000. [32] F. Kang, R. Jin, and R. Sukthankar, Correlated label propagation with application to multi-label learning, in Proc. IEEE Int. Conf. Computer Vision and Pattern Recognition, 2006, vol. 2, pp. 17191726. [33] W. Jiang, G. Er, Q. Dai, and J. Gu, Hidden annotation for image retrieval with long-term relevance feedback learning, Pattern Recognit., vol. 38, pp. 20072021, 2005. [34] Y. Deng, B. S. Manjunath, and H. Shin, Color image segmentation, in Proc. IEEE Int. Conf. Computer Vision and Pattern Recognition, 1999, vol. 2, pp. 446451. [35] F. Jing, M. Li, H. J. Zhang, and B. Zhang, Relevance feedback in region-based image retrieval, IEEE Trans. Circuits Syst. Video Technol., vol. 14, no. 5, pp. 672681, 2004.

1604

IEEE TRANSACTIONS ON MULTIMEDIA, VOL. 10, NO. 8, DECEMBER 2008

[36] Y. Liu, D. Zhang, G. Lu, and W. Y. Ma, A survey of content-based image retrieval with high-level semantics, Pattern Recognit., vol. 40, pp. 262282, 2007. [37] W. Jiang, G. Er, Q. Dai, and J. Gu, Similarity-based online feature selection in content-based image retrieval, IEEE Trans. Image Process., vol. 15, no. 3, pp. 702712, 2006. [38] Y. X. Chen and J. Z. Wang, A region-based fuzzy feature matching approach to content-based image retrieval, IEEE Trans. Pattern Anal. Mach. Intell., vol. 24, no. 9, pp. 12521267, 2002. [39] S. Santini and R. Jain, Similarity measures, IEEE Trans. Pattern Anal. Mach. Intell., vol. 21, no. 9, pp. 871883, 1999. [40] H. Tong, J. He, M. Li, C. Zhang, and W. Y. Ma, Graph based multimodality learning, in Proc. ACM Int. Conf. Multimedia, 2005, pp. 862871. [41] L. Fei-Fei, R. Fergus, and P. Perona, Learning generative visual models from few training examples: An incremental Bayesian approach tested on 101 object categories, in IEEE CVPR Workshop on Generative-Model Based Vision, 2004. [42] M. H. C. Law, M. A. T. Figueiredo, and A. K. Jain, Simultaneous feature selection and clustering using mixture models, IEEE Trans. Pattern Anal. Mach. Intell., vol. 26, no. 9, pp. 11541166, 2004. [43] W. Jiang, K. L. Chan, M. Li, and H. J. Zhang, Mapping low-level features to high-level semantic concepts in region-based image retrieval, in Proc. IEEE Int. Conf. Computer Vision and Pattern Recognition, 2005, vol. 2, pp. 244249.

Qionghai Dai (SM05) received the B.S. degree in mathematics from Shanxi Normal University, China, in 1987, and the M.E. and Ph.D. degrees in computer science and automation from Northeastern University, China, in 1994 and 1996, respectively. Since 1997, he has been with the faculty of Tsinghua University, Beijing, China, and is currently a Professor and the Director of the Broadband Networks and Digital Media Laboratory. His research areas include signal processing, computer vision and graphics, video processing and communication.

Wenli Xu received the B.S. degree in electrical engineering and the M.E. degree in automatic control engineering from Tsinghua University, Beijing, China, in 1970 and 1980, respectively, and the Ph.D. degree in electrical and computer engineering from the University of Colorado at Boulder in 1990. He is currently a Professor at Tsinghua University and a Director of Chinese Association of Automation. His research interests are mainly in the areas of automatic control and computer vision.

Fei Li received the B.S. degree from Beijing University of Aeronautics and Astronautics, Beijing, China, in 2004. He is currently pursuing the Ph.D. degree at the Department of Automation, Tsinghua University, Beijing. His research interests include content-based multimedia retrieval, pattern recognition, and machine learning.

Guihua Er received the B.S. degree from the Department of Automation, Tianjin University, Tianjin, China, in 1984, and the M.S. degree from the Department of Automation, Beijing Institute of Technology, Beijing, China, in 1989. She is now an Associate Professor and the Vice Director of Broadband Networks and Digital Media Laboratory in the Department of Automation, Tsinghua University, Beijing. Her research interests include multimedia database and multimedia information processing.

You might also like