Welcome to Scribd, the world's digital library. Read, publish, and share books and documents. See more
Standard view
Full view
of .
0 of .
Results for:
P. 1
MC0088

# MC0088

Ratings: (0)|Views: 1,002|Likes:

### Availability:

See more
See less

05/06/2013

pdf

text

original

July 2011Master of Computer Application (MCA) – Semester 6MC0088 – Data Mining– 4 Credits
(Book ID: B1009)
Assignment Set – 1 (60 Marks)
Answer all Questions Each Question carries fifteen Marks
1. Describe the following with respect to Cluster Analysis:A. Cluster AnalysisB. Clustering MethodsC. Clustering and Segmentation SoftwareA. Cluster Analysis
Cluster analysis
or
clustering
is the task of assigning a set of objects into groups (called
clusters
) so that the objects in the same cluster are more similar (in some sense or another) to each other than to those in other clusters.Clustering is a main task of explorativedata mining, and a common technique for statisticaldata analysisused in many fields, includingmachine learning,pattern recognition, analysis, information, andbioinformatics. Cluster analysis itself is not one specificalgorithm,but the general task to be solved. It can be achieved by various algorithms that differ significantly in their notion of what constitutes a cluster and how to efficiently find them. Popular notions of clusters include groups withlowdistancesamong the cluster members, dense areas of the data space, intervals or particular statistical distributions. The appropriateclustering algorithm and parameter settings (including values such as the distance function to use, a density threshold or the number of expected clusters) depend on the individual data set and intended use of the results. Cluster analysis as such is not an automatic task, butan iterative process of knowledge discoverythat involves try and failure. It will often be necessary to modify preprocessing and parametersuntil the result achieves the desired properties.Besides the term
clustering
, there are a number of terms with similar meanings, including
automatic classification
botryology
(from Greek βότρυς "grape") and
typological analysis
. The subtle differences are often in the usage of the results:while in data mining, the resulting groups are the matter of interest, in automatic classification primarily their discriminitative power is of interest. This often leads to misunderstandings of researchers coming from the fields of data mining and machine learning, since they usethe same terms and often the same algorithms, but have different goals.
B. Clustering Methods
The goal of clustering is to reduce the amount of data by categorizing or grouping similar data items together. Such grouping is pervasive inthe way human’s process information, and one of the motivations for using clustering algorithms is to provide automated tools to help in
1 |Page

constructing categories or taxonomies[Jardine and Sibson, 1971, Sneath and Sokal, 1973]. The methods may also be used to minimize the effects of human factors in the process. and Bailey, 1973

]can be divided into two basic types: hierarchical and partitional clustering. Within each of the types there exists a wealthof subtypes and different algorithms for finding the clusters.Hierarchical clustering proceeds successively by either merging smaller clusters into larger ones, or by splitting larger clusters. Theclustering methods differ in the rule by which it is decided which two small clusters are merged or which large cluster is split. The end resultof the algorithm is a tree of clusters called a dendrogram, which shows how the clusters are related. By cutting the dendrogram at a desiredlevel a clustering of the data items into disjoint groups is obtained.Partitional clustering, on the other hand, attempts to directly decompose the data set into a set of disjoint clusters. The criterion function thatthe clustering algorithm tries to minimize may emphasize the local structure of the data, as by assigning clusters to peaks in the probabilitydensity function, or the global structure. Typically the global criteria involve minimizing some measure of dissimilarity in the samples withineach cluster, while maximizing the dissimilarity of different clusters.A commonly used partitional clustering method, K-means clustering[MacQueen, 1967

], will be discussed in some detail since it is closely related to the SOM algorithm. In K-means clustering the criterionfunction is the average squared distance of the data items from their nearest cluster centroids,

where is the index of the centroid that is closest to . One possible algorithm for minimizing
the cost functionbegins by initializing a set of K cluster centroids denoted by , . The positions of the are then adjustediteratively by first assigning the data samples to the nearest clusters and then recomputing the centroids. The iteration is stoppedwhen E does not change markedly
any more. In an alternative algorithm each
randomly chosen sample is considered in succession,and the nearest centroid is updated.Equation 1is also used to describe the objective of a related method, vector quantization[Gersho, 1979
,
In vector quantization the goal is to minimize the average
(squared) quantization error,
the distance between a sample and itsrepresentation . The algorithm for minimizing
Equation 1that was
described above is actually a straightforward generalizationof the algorithm proposed by Lloyd (1957) for minimizing

the average quantization error in a one-dimensional setting.A problem with the clustering methods is that the interpretation of the clusters may be difficult. Most clustering algorithms prefer certaincluster shapes, and the algorithms will always assign the data to clusters of such shapes even if there were no clusters in the data.Therefore, if the goal is not just to compress the data set but also to make inferences about its cluster structure, it is essential to analyzewhether the data set exhibits a clustering tendency. The results of the cluster analysis need to be validated, as well. Jain and Dubes (1988)present methods for both purposes.
Another potential problem is that the choice of the number of clusters may be critical: quite different kinds of clusters may emergewhen K is changed. Good initialization of the cluster centroids may also
be crucial; some
clusters may even be left empty if their centroids lie initially far from the distribution of data.
Page | 2

Clustering can be used to reduce the amount of data and to induce a categorization. In exploratory data
analysis,
however, thecategories have only limited value as such. The clusters should be illustrated somehow to
aid in
understanding what they are like.For example in the case of the K-means algorithm the centroids that represent the clusters are still high-dimensional, and someadditional illustration methods are needed for visualizing them
C. Clustering and Segmentation Software
Segmentation is the process that groups similar objects together and forms clusters. Thus it is often referred to as
clustering
.Clustered groups are homogeneous within and desirably heterogeneous in between. The rationale of intra-group homogeneity isthat objects with similar attributes are likely to respond somewhat similarly to a given action. This property has various uses both inbusiness and in scientific research.Most clustering techniques are developed for laboratory generated simple data consisting of a few to several numerical variables.Applying these techniques to business data that consist of many categorical complex data suffers from various limitations, asdescribed in the followings;
Numerical variables and normalization
Most clustering techniques are based on distance calculation. It is noted that distance is very sensitive to ranges of variables. For example, "age" normally ranges 0 ~ 100. On the other hand, "salary" can spread from 0 to 100,000. Whenboth variables are used together, distance from salary can overwhelm the other. Thus, values have to be normalized.However, normalization is rather a subjective function. There is no way we can transform without creating biases.
Outliers and numerical variables
Related to numerical variables, outliers also create problems in data mining, especially with clustering based on distancecalculations. In such systems, outliers should be identified and removed from data mining.
(It is noted that outliers arerecommended to be removed in all data mining techniques!)
Categorical variables and binary variable encoding
Dealing with categorical variables (non-numeric data, non-numeric variables, categorical data, nominal data, or nominalvariables) are much more problematic. Normally, we use "one-of-N" or "thermometer" encoding. This can introduce extrabiases due to numbers of values in categorical variables. Note that one-of-N and thermometer encoding transforms eachcategorical value into a true-false
binary variable
. This can significantly increase the total number of variables, which inturn decreases the effectiveness of many clustering techniques. For more, read the section "Why k-means clustering doesnot work well with business data?".
Clustering variable selections and weighting
Clustering variable selection is another problem. Selection of variables will largely influence clustering results. A commonlyused method is to assign different weight for variables and categorical values. However, this introduces another problematic process. When many variables and categorical values are involved, it's never possible to have best qualityclustering. For
clustering variable selection methods
.
Behavioral modeling on time-variant variables
Capturing patterns (or behaviors) hidden inside time-varying variables and modeling is another difficult problem. Indatabase marketing, it is desirable to segment customers based on previous marketing campaigns, as predictive models,then to execute marketing campaigns based on current customer information (using the same models). Most clusteringtechniques do not possess this predictive modeling capability.
2. Describe the following with respect to Web Mining:
Page | 3