As dimensionality is very high, image feature space is usually complex. For effectively processing this space,technology of dimensionality reduction is widely used. Semi-supervised clustering incorporates limited information into unsupervised clustering in order to improve clustering performance. However, many existing semi-supervised clustering methods can not be used to handle high-dimensional sparse data. To solve this problem, we proposed a semi-supervised fuzzy clustering method via constrained orthogonal projection. With results of experiments on different datasets, it shows the method has good clustering performance for handling high dimensionality data.

Attribution Non-Commercial (BY-NC)

37 views

A Semi-supervised Clustering via Orthogonal Projection

As dimensionality is very high, image feature space is usually complex. For effectively processing this space,technology of dimensionality reduction is widely used. Semi-supervised clustering incorporates limited information into unsupervised clustering in order to improve clustering performance. However, many existing semi-supervised clustering methods can not be used to handle high-dimensional sparse data. To solve this problem, we proposed a semi-supervised fuzzy clustering method via constrained orthogonal projection. With results of experiments on different datasets, it shows the method has good clustering performance for handling high dimensionality data.

Attribution Non-Commercial (BY-NC)

- Clique
- teza de licenta-fizica
- Analysing Sceanrio Approaches for Forest Management - One Decade of Experiences in Europe
- COLOUR BASED IMAGE SEGMENTATION USING HYBRID KMEANS WITH WATERSHED SEGMENTATION
- 2016 ns2 projects | Cluster-Based Routing for the Mobile Sink in wsn.docx
- A Technical Study and Analysis on Fuzzy Similarity Based Models For Text Classification
- Paper-2 an Intelligent Gate Controller Using a Personal Computer and Pattern Recognition Protocols
- Prediction Online
- IJAIEM-2014-01-29-072
- Clustering Using Shared Reference Points Algorithm Based On a Sound Data Model
- 10.1.1.41
- 11 Chapter 3
- 06005661
- Dimension Reduction by Mutual Information Feature Extraction
- 9-Jan2015
- Principle Component Analysis (PCA)
- Improved Performance of LEACH for WSN Using Precise Number of Cluster-Head and Better Cluster-Head Selection
- Unidad 3 Data Preprocessing Dimensional Reducction
- Marlin Aaaidc 04
- Differential Evolution With Local Information for Neuro-Fuzzy Systems Optimisation

You are on page 1of 4

Harbin Engineering University Harbin Engineering University

Harbin 150001, China Harbin 150001, China

cuipeng83@163.com zrbzrb@hrbeu.edu.cn

Abstract—As dimensionality is very high, image feature space As many semi-supervised clustering methods are based

is usually complex. For effectively processing this space, density or distance, they are difficult to handle high-

technology of dimensionality reduction is widely used. Semi- dimensional data. Thus, reduced feature must be added into

supervised clustering incorporates limited information into semi-supervised clustering process. We propose

unsupervised clustering in order to improve clustering

performance. However, many existing semi-supervised

COPFC(Constrained Orthogonal Projection Fuzzy

clustering methods can not be used to handle high-dimensional Clustering)method to solve this problem.

sparse data. To solve this problem, we proposed a semi-

supervised fuzzy clustering method via constrained orthogonal

II. COPEC METHOD FRAMEWORK

projection. With results of experiments on different datasets, it

shows the method has good clustering performance for

handling high dimensionality data.

supervised learning

I. INTRODUCTION

In recent years, because of fast extension of feature

information and volume of image data, many tasks in

multimedia processing have become increasingly

challenging Dimensionality reduction techniques have been

proposed to uncover the underlying low dimensional

structures of the high-dimensional image space [1].These

efforts have proved to be very useful in image retrieval,

classification and clustering. There are a number of Figure 1. COPFC framework

dimensionality reduction techniques in the literature. One of

Figure 1 shows the framework of the COPFC method.

the classical methods is Principal Component Analysis

(PCA) [2], which minimizes the information loss in the Given a set of instances and a set of supervision in the form

of must-link constraints CML={(xi, xj)}, (xi, xj) where (xi, xj)

reduction process. One of the disadvantages of PCA is that

must reside in the same cluster, and cannot-link constraints,

it likely distorts the local structures of a dataset. Locality

Preserving Projection (LPP) [3-4] encodes the local CCL={(xi, xj)}, (xi, xj) where should be in the different

clusters, the COPFC method is composed of three steps. In

neighborhood structure into a similarity matrix and derives a

the first step, a preprocessing method is exploited to reduce

linear manifold embedding as the optimal approximation to

this matrix, but LPP, on the other hand, may overlook the the unlabelled instances and pairwise constraints according

to the transitivity property of must-link constraints. In the

global structures.

Recently, semi-supervised learning has gained much second step, a constraint-guided Orthogonal projection

attention [6-10], which leverages domain knowledge method, called COPFCproj, is used to project the original

represented in the form of pairwise constraints. Various data into a low-dimensional space. Finally, we apply a semi-

reduction techniques have been developed to utilize this form supervised fuzzy clustering algorithm, called COPFCfuzzy,

of knowledge[11-12]. produce the clustering results on the projected low-

The constrained FLD defines the embedding based dimensional dataset.

solely on must-link constraints. Semi-Supervised

Dimensionality Reduction (SSDR) [13], preserves the

intrinsic global covariance structure of the data while

exploiting both constraints.

356

III. COPFCPROJ - A CONSTRAINED ORTHOGONAL M ( X , Y ) = ( M ( X ) − M (Y ))( M ( X ) − M (Y ))T .Accordingly, we can

PROJECTION METHOD rewrite equation (3) as follows:

∑ ∑ ( Pi x − Pi y )2 = 2 piT ( ML C (ML)) pi (8)

2

In a typical image retrieval system, each image is

represented by an m -dimensional feature vector x whose jth x∈ML y∈ML

value is denoted as xj. During the retrieval process, the user Similarly, we can rewrite equation (4) as follows:

is allowed to mark several images with must-links which ∑ ∑ (P i

x

− Pi y ) 2 = piT ( ML CL (C ( X ) + C (Y )

(9)

match his query interest, and also to indicate those x∈ML y∈CL

linear method and depends on a set of l axes pi. For a given Hence, the problem to be solved is min piT Api , subject

image x, its embedding coordinates are the projection of x to piT Bpi = 1, piT p1 = ... = piT pi −1 = 0 , where

onto l axes, which are Pi x = ∑ m x j pij ,

1≤ i ≤ l .

A = 2 ML C ( ML), B = ML CL (C ( X ) + C (Y ) + M ( X , Y )) .

j =1 2

It is easy to see that both A and B are symmetric and

similar to each other, they should be kept compactly in the

positive semi-definite. The above problem can be solved

new space. In other words, the distances among them should

using the Lagrange Multipliers method. Below we discuss

be kept small, while the irrelevant images in CL are to be

the procedure to obtain the optimal axes.

mapped far apart from those in ML as much as possible. The

The first projection axis is the eigenvector of the

above two criteria can be formally stated as follows:

l

generalized eigen-problem Ap1=λBp1 corresponding to the

min ∑ ∑ ∑ (P

x∈ML y∈ML i =1

i

x

− Pi y ) 2 (1) smallest eigenvalue. After that, we compute the remaining

axes one by one in the following fashion. Suppose we

l already obtained the first (k-1) axes, define:

max ∑ ∑ ∑ (P

x∈ML y∈CL i =1

i

x

− Pi y ) 2 (2)

P ( k −1) = [ p1 , p2 ,..., pk −1 ], (10)

Q ( k −1) = [ P ( k −1) ]T B −1 P ( k −1)

Intuitively, equation (1) forces the embedding to have Then the kth axis pk is the eigenvector associated with the

the image points in reside in a small local neighborhood in smallest eigenvalue for the eigen-problem:

the new feature space, and equation (2) reflects our ( I − B −1 P ( k −1) [Q ( k −1) ]−1[ P ( k −1) ]T ) B −1 Apk = λ pk (11)

objective to prevent the points in and close together after the We adopt the above procedure to determine the optimal l

embedding. To construct a salient embedding, COPFCproj orthogonal projection axes, which can preserve the metric

combines these two criteria and finds the axis in the one-by- structure of the image space for the given relevance

one fashion which optimizes the following objective, feedback information. The new coordinates for the image

min ∑ ∑ ( Pi x − Pi y ) 2 (3) data points can then be derived accordingly.

x∈ML y∈ML

IV. COPFCFUZZY SEMI-SUPERVISED CLUSTERING

subject to min

∑ ∑ (P i

x

− Pi y ) 2 = 1 (4)

x∈ML y∈CL COPFCfuzzy is new search-based semi-supervised

piT p1 = piT p2 = piT p3 = ... = piT pi −1 = 0 (5) clustering algorithm that allows the constraints to help the

T is the transpose of a vector. The choice of constant 1 on clustering process towards an appropriate partition. To this

the right hand side of equation (4) is rather arbitrary as any end, we define an objective function that takes into account

other value (except 0) would not cause any substantial both the feature-based similarity between data points and

changes in the embedding produced. The constraint in the pairwise constraints [14-16]. Let ML be the set of must-

equation (5) is to force all the axes to be mutually link constraints, i.e.(xi, xj)∈ML implies that xi and xj should

orthogonal. Equations (3) and (4) are implicit functions of be assigned to the same cluster, and CL the set of cannot-

the axes pi , which should be re-written in the explicit forms. link constraints,(xi, xj)∈CL xi and xj should be assigned to

First, we introduce the necessary notations. For a given set different clusters. we can write the objective function

X of image points, the mean of X is an -dimensional column COPFCfuzzy must minimize:：

vector M(X) , whose i th component is C N

1 J (V , U ) = ∑∑ (uik ) 2 d 2 (xi , μk )

Mi (X ) = ∑ xi

X x∈X

(6) k =1 i =1

⎛ C C C ⎞ (12)

and its covariance matrix C(X) is an m×m matrix: + λ ⎜ ∑ ∑ ∑ uik u jl + ∑ ∑ uik u jk ⎟

⎜ ( x ,x )∈ML k =1 l =1,l ≠ k ⎟

1 ⎛ ⎝ i j ( xi , x j )∈CL k =1 ⎠

⎞

Cij ( X ) = ⎜ ∑

X ⎝ x∈X

xi x j − M i ( X ) M j ( X ) ⎟ (7) C

⎡N ⎤

2

⎠ − γ ∑ ⎢ ∑ (uik ) ⎥

For two sets X and Y, define an m×m matrix M(X,Y) , in k =1 ⎣ i =1 ⎦

which

357

The first term in equation (12) is the sum of squared B. The effectiveness of COPFC

distances to the prototypes weighted by constrained In figure 2, we use three different dimensionality

memberships (Fuzzy C-Means objective function). This reduction methods (COPFCproj, PCA, SSDR) for original

term reinforces the compactness of the clusters. images. Dimensionalities are reduced 15, 20 respectively.

The second component in equation (12) is composed of: For data of reduced dimension, we used Kmeans for

the cost of violating the pairwise must-link constraints; the clustering. The curves in figure 2 show clustering

cost of violating the pairwise cannot-link constraints. This performance of PCA method is independent of number of

term is weighted by λ, a constant factor that specifies the constraints. However clustering performance of SSDR had

relative importance of the supervision. slight changes. For COPFCproj, clustering performance

The third component in equation (12) is the sum of the obtained largely improvement with increasing number of

squares of the cardinalities of the clusters controls the constraints. When there are small amount of constraints,

competition between clusters. It is weighted by γ. clustering performance of COPFCproj is worst in there

When the parameters are well chosen, the final partition methods. In general, COPFCproj outperforms PCA and

will minimize the sum of intra-cluster distances, while SSDR for reducing dimensionalities.

partitioning the data set into the smallest number of clusters 0.85

such that the specified constraints are respected as well as 0.8

possible.

0.75

COPFCproj

0.65 SSDR

A. Dataset selection and evaluation criterion PCA

0.6

We performed experiments on COREL image database 10 20 30 40 50 60 70 80 90100

Number of constraints

and 2 datasets from UCI as follows: (a) (b)

(1) We selected 1500 images from COREL image Figure 2. Clustering performance with different number of constraints

database. They were divided into 15 sufficiently distinct

classes of 100 images each. In our experiments, each image Figure 3 shows clustering performance of three methods

was represented by a 37-dimensional vector, which included on Iris and Wine datasets. For all datasets, COPFCfuzzy all

3 types of features extracted for the image. We compared obtained best performance. In three methods, clustering

COPFCproj algorithm against PCA and SSDR. The performance of Kmeans is worst. Though clustering

performance of each technique was evaluated under various performance of PCKmeans is effectively improved, it still is

amounts of domain knowledge and different reduced worse than that of COPFCfuzzy.

dimensionalities. In different scenarios, after the

dimensionality reduction, the Kmeans was applied to 1.01

classify the test images. 0.98 COPFC

0.95 PCKmeans

(2) Iris and Wine datasets from UCI repository. Iris 0.9 Kmeans

CRI

CRI

2

dataset contains three classes of 50 instances each and 4 0.8

9

numerical attributes; Wine dataset contains three classes 178 0.86

0.83

instances, and 13 numerical attributes. The simplicity and 0.8 4 5 9 10

low dimension of this data set also allows us to display the 10 20 30 60 70 80

0

0 of 0 0

Number constraints

constraints that are actually selected. To evaluate clustering (a) Iris dataset (b) Wine dataset

performance of COPFCfuzzy, we compared COPFCfuzzy Figure 3. Clustering performance on UCI datasets

algorithm against Kmeans and PCKmeans algorithm.

(3) Evaluation criterion. In this paper, we use Corrected

Rand Index (CRI) as the clustering validation measure. VI. CONCLUSION AND FUTURE WORK

A−C (13) We propose a semi-supervised fuzzy clustering via

CRI =

n × (n − 1) / 2 − C orthogonal projection to handle high-dimensional sparse

where A is number of instance pairs which assigned cluster data in image feature space. The method reduces

meets with actual cluster; n is number of all instances in the dimensionalities of images via orthogonal projection, and

dataset, then n × (n − 1) / 2 is number all instance pairs in clusters data of reduced dimensionalities by constrained

dataset; C is number of all constraints. fuzzy clustering algorithm.

For each dataset, we run each experiment 20 times. To There are several potential directions for future research.

study the effect of constraints 100 constraints are generated First, we are interested in automatically identifying the right

randomly for test set. Each point on the learning curve is an number for the reduced dimensionality based on the

average of results over 20 runs. background knowledge other than providing a pre-specified

value. Second, we plan to explore alternative methods to

employ supervision in guiding the unsupervised clustering.

358

REFERENCES

Dimensionality Reduction”. In Proc. of the 23rdIntl. Conf. on

Machine Learning, 2006.

[2] C. Ding and X. He. “K-Means Clustering via Principal Component

Analysis”. In Proc. of the 21st Intl. Conf. on Machine Learning, 2004.

[3] D. Cai, and X. F. He. “Orthogonal Locality Preserving Projection”. In

Proc. of the 28th Intl. ACM SIGIR Conf. on Research and

Development in information Retrieval,2005.

[4] X. F. He and P. Niyogi. “Locality Preserving Projections”. Neural

Information Processing Systems. NIPS ’03, 2003.

[5] H. Cheng, K. Hua, and K. Vu. “Semi-Supervised Dimensionality

Reduction in Image Feature Space.Technical Report”, University of

Central Florida, 2007.

[6] Wagstaff. K and Cardie C. “Clustering with instance—level

constraints”. Proc. of the 17th Int’1 Conf. on Machine Learning. San

Francisco: Morgan Kaufmann Publishers, 2000.

[7] S. Basu. “Semi-supervised Clustering: Probabilistic Models,

Algorithms and Experiments”. Austin: The University of Texas, 2005

[8] S. Basu , A. Banerjee and R.J. Mooney, “Semi-supervised clustering

by seeding”. Proceedings of the 19th Int’l Conf. on Machine Learning

(ICML 2002). 19−26

[9] Wagstaff K, Cardie C and Rogers S. “Constrained K-means clustering

with background knowledge”. Proc. of the 18th Int’l Conf. on

Machine Learning. Williamstown: Williams College, Morgan

Kaufmann Publishers, 2001. 577−584.

[10] Klein D, Kamvar SD andManning CD. “From instance-Level

constraints to space-level constraints: Making the most of prior

knowledge in data clustering”. In Proc. of the 19th Int’l Conf. on

Machine Learning. University of New South Wales. Sydney: Morgan

Kaufmann Publishers, 2002. 307−314.

[11] Hertz T, Shental N and Bar-Hillel A. “Enhancing image and video

retrieval: Learning via equivalence constraint”. Proc. of the IEEE

Conf. on Computer Vision and Pattern Recognition. Madison: IEEE

Computer Society, 2003. pp.668−674.

[12] T. Deselaers, D. Keysers, and H. Ney. “Features for Image Retrieval

– a Quantitative Comparison”.In Pattern Recognition, 26th DAGM

Symposium, 2004.

[13] D. Zhang, Z. H. Zhou, and S. Chen. “Semi-Supervised

Dimensionality Reduction”. In Proc. of the 2007 SIAM Intl.Conf. on

Data Mining. SDM ’07, 2007.

[14] N. Grira, M. Crucianu, N. Boujemaa. “Semi-supervised fuzzy

clustering with pairwise-constrained competitive agglomeration”, in:

IEEE International Conference on Fuzzy Systems, 2005.

[15] H. Frigui, R. Krishnapuram. “Clustering by competitive

agglomeration”, Pattern Recognition 30 (7) ,1997 1109–1119.

[16] M. Bilenko, R.J. Mooney. “Adaptive duplicate detection using

learnable string similarity measures”. in: International Conference on

Knowledge Discovery and Data Mining, Washington, DC, 2003, pp.

39–48.

359

- CliqueUploaded byjakal_target
- teza de licenta-fizicaUploaded byAlistar Andreea
- Analysing Sceanrio Approaches for Forest Management - One Decade of Experiences in EuropeUploaded byLorenzo Pilco Apaza
- COLOUR BASED IMAGE SEGMENTATION USING HYBRID KMEANS WITH WATERSHED SEGMENTATIONUploaded byIAEME Publication
- 2016 ns2 projects | Cluster-Based Routing for the Mobile Sink in wsn.docxUploaded byLakshmiDhanam
- A Technical Study and Analysis on Fuzzy Similarity Based Models For Text ClassificationUploaded byLewis Torres
- Paper-2 an Intelligent Gate Controller Using a Personal Computer and Pattern Recognition ProtocolsUploaded byRachel Wheeler
- Prediction OnlineUploaded byAshu092011Gmail.com 9827173854
- IJAIEM-2014-01-29-072Uploaded byAnonymous vQrJlEN
- Clustering Using Shared Reference Points Algorithm Based On a Sound Data ModelUploaded byAI Coordinator - CSC Journals
- 10.1.1.41Uploaded byPankaj Kumar
- 11 Chapter 3Uploaded byAyanav Baruah
- 06005661Uploaded byShobhita Gupta
- Dimension Reduction by Mutual Information Feature ExtractionUploaded byAnonymous Gl4IRRjzN
- 9-Jan2015Uploaded bysasddasafsfafaf
- Principle Component Analysis (PCA)Uploaded bykunalfin
- Improved Performance of LEACH for WSN Using Precise Number of Cluster-Head and Better Cluster-Head SelectionUploaded byInternational Journal for Scientific Research and Development - IJSRD
- Unidad 3 Data Preprocessing Dimensional ReducctionUploaded byRbrto Rodriguez
- Marlin Aaaidc 04Uploaded byErick Ulisses Monfil Contreras
- Differential Evolution With Local Information for Neuro-Fuzzy Systems OptimisationUploaded byCristian Klen
- Trace Clustering in Process MiningUploaded bykorbeille8888
- clusteringI_4Uploaded byAmit Sharma
- Lecture1 JpsUploaded byChetan Desai
- [paperhub]10.2307_1269551Uploaded byBehnam
- dhaeseleer2005Uploaded byThyago
- Geographical Information and BiodiversityUploaded byJose de la Ossa
- Market Structure Analysis fUploaded by24500
- pcaTutorialUploaded byJorge Ramírez
- Network Lifetime and Energy Efficiency Maximization Using Ant Colony Optimization in WsnUploaded byijteee
- Leach RoutingUploaded byAdilNasir

- Frame Relay ConfigurationsUploaded bythearchitek1
- Microprocessor Programming and InterfacingUploaded bymohanpriyank
- Genetic AlgorithmUploaded byShashikant Chaurasia
- playstation-3-cechl04 service manualUploaded bypandorabox2012
- YourGatewayToPacketRadio.pdfUploaded byBryan Custodio
- 101908495_wincc_s71200_s71500_channel_v2_enUploaded byAmina Suljic
- Databricks Spark Reference ApplicationsUploaded byjose
- hyperion-data-relationship-management-datasheet.pdfUploaded bySubhakar Madasu
- m-block.pdfUploaded byKristina Visković
- DatStr 03 StacksUploaded byKamalakar Sreevatasala
- (5.v) Flag ProjectUploaded byChristina Marquez
- List of Offshore Engineering referencesUploaded byRuth Mabilangan
- Manual SENTRON Power Monitoring Device PAC3200Uploaded bycc_bau
- CIM-Tech Debuts Router-CIM 2018 at AWFS Las VegasUploaded byPR.com
- Buffalo RouterUploaded byJoneil Delapenz
- US Internal Revenue Service: p1346Uploaded byIRS
- final.pdfUploaded byKelly Gentry
- EP Data Management PolicyUploaded byronelbarafaeldiego
- Step-By-Step Install Guide VtigerCRM on Linux v1.0Uploaded byKefa Rabah
- A Review of Physical and Perceptual Feature Extraction Techniques for SpeechUploaded byCoie8t
- Horizon II Macro PresentationUploaded bysolly333
- instalar u2000Uploaded byatalasa-1
- Profile Builder Manual 1Uploaded byJennifer Casey
- Neato Programmer's ManualUploaded byduesen
- Adwords Training Manual DSIM.pptxUploaded bySairamTirumalaiGovindaraju
- sql-basicUploaded byKhairul Hossain Saikat
- SAP Business One Implementation GuideUploaded byJesus A Roque Ortiz
- CUploaded bykidus
- Narnia Theme TumblrUploaded byWilliam Nhongo
- Project SchedulingUploaded byAftab Mwt