You are on page 1of 4

Introduction:

Clustering has been recognized as an important and valuable capability in the data mining
field. For high-dimensional data, recent research have reported that traditional clustering
techniques may suffer from the problem of discovering meaningful clusters due to the curse of
dimensionality. Specifically, the curse of dimensionality refers to the phenomenon that as the
increase of the dimension cardinality, the distance of a given point x to its nearest point will be
close to the distance of x to its farthest point. Due to the loss of the distance discrimination in
high dimensions, discovering meaningful, separable clusters will be very challenging, if not
impossible. A common approach to cope with the curse of dimensionality problem for mining
tasks is to reduce the data dimensionality by using the techniques of feature transformation and
feature selection. The feature transformation techniques, such as principal component analysis
(PCA) and singular value decomposition (SVD), summarize the data in a fewer set of
dimensions derived from the combinations of the original data attributes. However, the
transformed features/dimensions have no intuitive meaning any more and thus the resulting
clusters are hard to interpret and analyze. On the other hand, the feature selection methods
reduce the data dimensionality by trying to select the most relevant attributes from the original
data attributes. In such way, only a particular subspace1 is selected to discover the clusters.
However, in many real data sets, clusters may be embedded in varying subspaces, and thus in the
feature selection approaches the information of data points clustered differently in varying
subspaces is lost. Motivated by the fact that different groups of points may be clustered in
different subspaces, a significant amount of research has been elaborated upon subspace
clustering, which aims at discovering clusters embedded in any subspace of the original feature
space. The applicability of subspace clustering has been demonstrated in various applications,
including gene expression data analysis, E-commerce, DNA microarray analysis, and so forth.
For example, in the gene expression data, each data record stores the expression levels, i.e., the
intensity of the expression, of a gene derived from different samples, which may represent time
slots. Clustering the genes in subspaces may help to identify the genes whose expression levels
are similar in a subset of samples, where co-expressed genes usually are functionally correlated.
Note that genes of different functionalities may be clustered in different subsets of samples. Most
of previous subspace clustering works, discover the subspace clusters by regarding the clusters as
regions of higher densities than their surroundings in a subspace. In addition, they identify the
high-density regions (clusters) by introducing a density threshold on region densities such that a
region is identified as dense if its region density exceeds the density threshold. However, we find
that previous works may suffer from the difficulties in achieving high qualities of the clusters in
all subspaces since the identification of high-density regions lacks of considering a critical
problem, called “the density divergence problem” in this research. The density divergence
problem refers to the phenomenon that cluster densities vary in different subspace cardinalities.2
Note that as the number of dimensions increases, data points are spread out in a larger
dimensional space such that they will be more sparsely populated in nature, thus showing the
varying region densities in different subspace cardinalities. This implies that extracting clusters
in higher subspaces should be with a lower density requirement (otherwise we may lose true
clusters in such situations). Due to the requirement of varying density thresholds for discovering
clusters in different subspace cardinalities, it is challenging for subspace clustering to
simultaneously achieve high precision and recall3 for clusters in different subspace cardinalities.
More explicitly, since previous works identify the dense regions by utilizing a threshold on
region densities, the trade-off between recall and precision will be inevitably faced: a high
threshold leads to a high precision at the cost of recall; in contrast, a low threshold leads to a high
recall at the cost of precision. To clearly illustrate the problems incurred by the density
divergence problem, we apply the previous subspace clustering works which utilize the grid
structure for identifying the dense regions, in a two-dimensional example data shown in Fig. 1a.
In previous works adopting the grid structure, the data space is first partitioned into a number of
equisized units, and then clusters are discovered by grouping the connected dense units, where a
unit is said dense if the number of data points contained in it exceeds a prespecified threshold .
Therefore, we apply the grid based subspace clustering algorithms on the example data set a by
first partitioning the two-dimensional space into units. As can be seen, there are three clusters in
the data set: two one-dimensional clusters, A3 and B4, and one two-dimensional cluster, A3B3
A3B4. In Fig. 1b, we show the counts of the units, i.e., the number of data points contained in
units, related to the clusters. Note that by setting as 54, the two one-dimensional clusters, A3
and B4, can be discovered with high qualities. However, in this case, the density threshold is set
too high such that the two-dimensional cluster A3B3 [ A3B4 cannot be found. That is, we get
high precision and recall in the discovered one-dimensional clusters, but low precision and recall
in the two-dimensional one. On the other hand, for the twodimensional cluster to be discovered,
should be set as 15, but this may lead to low precision for the two onedimensional clusters. In
this scenario, the one-dimensional cluster A3 is discovered by joining A3 with one more unit A4
such that the precision is decreased. A similar result can be derived for the one-dimensional
cluster B4, where it is combined with one more unit B3. Therefore, it is infeasible for previous
subspace clustering models to simultaneously achieve high precision and recall for clusters in
different subspace cardinalities. Considering the varying region densities in different subspace
cardinalities, we note that a more appropriate way to determine whether a region in a subspace
should be identified as dense is by comparing its density with the region densities in that
subspace. Motivated by this idea, in this research, we devise a novel subspace clustering model,
which is based on the relative region densities to discover the clusters. In our subspace clustering
model, we regard the clusters in a subspace as the regions which have relatively high densities as
compared to the average region density in the subspace. To discover such clusters, we introduce
a novel density parameter for users to specify their expected relative rate of the densities of the
dense regions and the average region density in a subspace. Then, when given a user-specified
value, due to the different average region densities in different subspace cardinalities, we
adaptively determine different density thresholds to discover clusters of relatively high densities
in subspaces of different subspace cardinalities. Discovering clusters in different cardinalities
with different density thresholds is useful but is quite challenging. Note that due to the large
searching space of the subspaces in high-dimensional data, previous works constrain the
searching of dense regions based on the monotonicity property, where a region in a subspace will
not be extracted for discovering the dense regions if there exist the projections of this region in
lower subspaces which are not dense regions. However, since in our model, different density
thresholds are utilized to discover dense regions in different subspace cardinalities, the
monotonicity property no longer exists, that is, if a k-dimensional region is dense, any
dimensional projection of this region may not bedense. Without the monotonicity property, the
Apriori-like generate-and-test scheme adopted in most previous works to constrain the searching
of dense regions is infeasible in our model. A naive method would need to exhaustedly examine
all regions to discover the dense regions. For this challenge, we devise an innovative algorithm,
referred to as “DENsity COnscious Subspace clustering”(abbreviated as DENCOS), to
efficiently discover the clusters satisfying different density thresholds in different subspace
cardinalities. In DENCOS, the mechanism of computing the upper bounds of region densities to
constrain the search of dense regions is devised, where the regions whose density upper bounds
are lower than the density thresholds will be pruned away in identifying the dense regions. We
compute the region density upper bounds by utilizing a novel data structure, DFP-tree (Density
FP-tree), where we store the summarized information of the dense regions. In addition, from the
DFP-tree, we also propose to calculate the lower bounds of the region densities to accelerate the
identification of the dense regions. Therefore, in DENCOS, the dense region discovery is devised
as a divide-and-conquer scheme. At first, the information of region density’s lower bounds is
utilized to efficiently extract the dense regions, which are the regions whose density lower
bounds exceed the density thresholds. Then, for the remaining regions, the searching of dense
regions is constrained to the regions whose upper bounds of region densities exceed the density
thresholds. By conducting on extensive data sets, the experimental results reveal that our
proposed algorithm DENCOS has better performance in both clustering quality and efficiency
than previous works.

You might also like