multidimensional scaling, factor analysis, principal component analysis, andcluster analysis to name a few. A useful overview is given in .In pattern recognition, data analysis is concerned with predictive modeling: givensome training data, we want to predict the behavior of the unseen test data. Thistask is also referred to as
. Often, a clear distinction is made betweenlearning problems that are (i) supervised (classification) or (ii) unsupervised(clustering), the first involving only
(training patterns with knowncategory labels) while the latter involving only
. Clustering is amore difficult and challenging problem than classification. There is a growinginterest in a hybrid setting, called
semi- supervised learning
; in semi-supervised classification, the labels of only a small portion of the training data setare available. The unlabeled data, instead of being discarded, are also used in thelearning process. In semi-supervised clustering, instead of specifying the classlabels, pair-wise constraints are specified, which is a
r way of encoding the prior knowledge A pair-wise
constraint corresponds to the requirementthat two objects should be assigned the same cluster label, whereas the cluster labels of two objects participating in a
constraint should be different.Constraints can be particularly beneficial in data clustering , where precisedefinitions of underlying clusters are absent. In the search for good models, onewould like to include all the available information, no matter whether it isunlabeled data, data with constraints, or labeled data. Figure 1 illustrates thisspectrum of different types of learning problems of interest in pattern recognitionand machine learning.
What is clustering?
is a process of assigning similar data into a samegroup or subsets (called clusters) so that observations in the same cluster aresimilar in some sense. Clustering is a method of unsupervised learning, and acommon technique for statistical data analysis used in many fields, including