You are on page 1of 2

2.3.

Clustering
Clustering of unlabeled data can be performed with the module sskklleeaarrnn..cclluusstteerr.

Each clustering algorithm comes in two variants: a class, that implements the fit method to learn the clusters on train data, and a func‐
tion, that, given train data, returns an array of integer labels corresponding to the different clusters. For the class, the labels over the train‐
ing data can be found in the labels_ attribute.

Input data
One important thing to note is that the algorithms implemented in this module can take different kinds of matrix as input. All the meth‐
ods accept standard data matrices of shape [n_samples, n_features] . These can be obtained from the classes in the
sskklleeaarrnn..ffeeaattuurree__eexxttrraaccttiioonn module. For AAffffiinniittyyPPrrooppaaggaattiioonn, SSppeeccttrraallCClluusstteerriinngg and DDBBSSCCAANN one can also input similarity ma‐
trices of shape [n_samples, n_samples] . These can be obtained from the functions in the sskklleeaarrnn..mmeettrriiccss..ppaaiirrwwiissee module.

2.3.1. Overview of clustering methods

A comparison of the clustering algorithms in scikit-learn

Method name Parameters Scalability Usecase Geometry (metric used)


Very large n_samples ,
medium n_clusters General-purpose, even cluster size,
K-Means number of clusters flat geometry, not too many clus‐ Distances between points
with ters
MiniBatch code
Affinity propaga‐ damping, sample Not scalable with n_sam‐ Many clusters, uneven cluster size, Graph distance (e.g.
tion preference ples non-flat geometry nearest-neighbor graph)
Not scalable with Many clusters, uneven cluster size,
Mean-shift bandwidth n_samples non-flat geometry Distances between points
Medium n_samples , Few clusters, even cluster size, Graph distance (e.g.
Spectral clustering number of clusters
small n_clusters non-flat geometry nearest-neighbor graph)
number of clusters Large n_samples and
Ward hierarchical or distance thresh‐ Many clusters, possibly connectiv‐ Distances between points
clustering n_clusters ity constraints
old
number of clusters Many clusters, possibly connectiv‐
Agglomerative or distance thresh‐ Large n_samples and
clustering old, linkage type, n_clusters
ity constraints, non Euclidean Any pairwise distance
distances
distance
Very large n_samples , Non-flat geometry, uneven cluster Distances between nearest
DBSCAN neighborhood size
medium n_clusters sizes points
minimum cluster Very large n_samples , Non-flat geometry, uneven cluster
OPTICS Distances between points
membership large n_clusters sizes, variable cluster density
Gaussian mixtures many Not scalable Flat geometry, good for density es‐ Mahalanobis distances to
timation centers
branching factor, Large n_clusters and Large dataset, outlier removal, data Euclidean distance between
Birch threshold, optional
n_samples reduction. points
global clusterer.

Non-flat geometry clustering is useful when the clusters have a specific shape, i.e. a non-flat manifold, and the standard euclidean dis‐
tance is not the right metric. This case arises in the two top rows of the figure above.

Gaussian mixture models, useful for clustering, are described in another chapter of the documentation dedicated to mixture models.
KMeans can be seen as a special case of Gaussian mixture model with equal covariance per component.

2.3.2. K-means
Toggle Menu
Toggle Menu

You might also like