Clustering can be used to reduce the amount of data and to induce a categorization. In exploratory data
however, thecategories have only limited value as such. The clusters should be illustrated somehow to
understanding what they are like.For example in the case of the K-means algorithm the centroids that represent the clusters are still high-dimensional, and someadditional illustration methods are needed for visualizing them
C. Clustering and Segmentation Software
Segmentation is the process that groups similar objects together and forms clusters. Thus it is often referred to as
.Clustered groups are homogeneous within and desirably heterogeneous in between. The rationale of intra-group homogeneity isthat objects with similar attributes are likely to respond somewhat similarly to a given action. This property has various uses both inbusiness and in scientific research.Most clustering techniques are developed for laboratory generated simple data consisting of a few to several numerical variables.Applying these techniques to business data that consist of many categorical complex data suffers from various limitations, asdescribed in the followings;
Numerical variables and normalization
Most clustering techniques are based on distance calculation. It is noted that distance is very sensitive to ranges of variables. For example, "age" normally ranges 0 ~ 100. On the other hand, "salary" can spread from 0 to 100,000. Whenboth variables are used together, distance from salary can overwhelm the other. Thus, values have to be normalized.However, normalization is rather a subjective function. There is no way we can transform without creating biases.
Outliers and numerical variables
Related to numerical variables, outliers also create problems in data mining, especially with clustering based on distancecalculations. In such systems, outliers should be identified and removed from data mining.
(It is noted that outliers arerecommended to be removed in all data mining techniques!)
Categorical variables and binary variable encoding
Dealing with categorical variables (non-numeric data, non-numeric variables, categorical data, nominal data, or nominalvariables) are much more problematic. Normally, we use "one-of-N" or "thermometer" encoding. This can introduce extrabiases due to numbers of values in categorical variables. Note that one-of-N and thermometer encoding transforms eachcategorical value into a true-false
. This can significantly increase the total number of variables, which inturn decreases the effectiveness of many clustering techniques. For more, read the section "Why k-means clustering doesnot work well with business data?".
Clustering variable selections and weighting
Clustering variable selection is another problem. Selection of variables will largely influence clustering results. A commonlyused method is to assign different weight for variables and categorical values. However, this introduces another problematic process. When many variables and categorical values are involved, it's never possible to have best qualityclustering. For
clustering variable selection methods
Variable & value link analysis
Behavioral modeling on time-variant variables
Capturing patterns (or behaviors) hidden inside time-varying variables and modeling is another difficult problem. Indatabase marketing, it is desirable to segment customers based on previous marketing campaigns, as predictive models,then to execute marketing campaigns based on current customer information (using the same models). Most clusteringtechniques do not possess this predictive modeling capability.
2. Describe the following with respect to Web Mining:
Page | 3