Professional Documents
Culture Documents
Supervisado
TC2034 – Modelación del Aprendizaje
con Inteligencia Artificial
https://miro.medium.com/max/1920/1*VACikYaZIHb2OctTRohn8A.jpeg
Unsupervised Learning
Definition
• Uses information that is neither classified nor labelled • There are two categories of algorithms:
• Clustering
• Discover inherent grouping in data
• The algorithm (learner) has to discover patterns without
guidance • Dimension Reduction (Association)
• Discover rules that describe large portions of data
2023FJ - juresti@tec.mx
Pedro Armijo 2
Unsupervised Learning…
Applications Challenges
• News sections: categorize articles • Computational complexity – high volume of data
• Recommendation engines
2023FJ - juresti@tec.mx
Pedro Armijo 3
Clustering
Pedro Armijo 4
Clustering
Definition Applications
• Grouping of unlabeled instances • Market segmentation
• Anomaly detection
2023FJ - juresti@tec.mx
Pedro Armijo 5
Types of clustering
Centroid-based Density-based
2023FJ - juresti@tec.mx
https://developers.google.com/machine-learning/clustering/
Pedro Armijo 6
Types of clustering…
Distribution-based Hierarchical
2023FJ - juresti@tec.mx
https://developers.google.com/machine-learning/clustering/
Pedro Armijo 7
Properties of clusters
All data points in a cluster must be similar to each other Data points from different clusters should be as different
as possible
2023FJ - juresti@tec.mx
https://www.analyticsvidhya.com/blog/2019/08/comprehensive-guide-k-means-clustering/
Pedro Armijo 8
Evaluation metrics for clustering
2023FJ - juresti@tec.mx
Pedro Armijo 9
Similarity
Measures
Pedro Armijo 10
Similarity
2023FJ - juresti@tec.mx
Pedro Armijo 11
Euclidian Distance
• Other names:
• Euclidian norm
• Euclidian metric
• L2 norm
• L2 metric
• Pythagorean metric
https://miro.medium.com/v2/resize:fit:1400/format:webp/1*j_h0BwDKraVoD6XbUUyn3g.png
2023FJ - juresti@tec.mx
Pedro Armijo 12
Manhattan Distance
https://miro.medium.com/v2/resize:fit:1400/format:webp/1*XsJbklzmGXbx6tu7aX2Vug.png
2023FJ - juresti@tec.mx
Pedro Armijo 13
Minkowski Distance
https://cdn-images-1.medium.com/max/1067/1*Fb22fnJRABGUANpjcjwEow.png
2023FJ - juresti@tec.mx
Pedro Armijo 14
Chebyshev Distance
https://miro.medium.com/v2/resize:fit:1400/format:webp/1*Ys2Wga9fvZAnlPUir2HRIA.png
2023FJ - juresti@tec.mx
Pedro Armijo 15
Hamming Distance
https://miro.medium.com/v2/resize:fit:1286/format:webp/1*Q4rdY5KN91aZjvGXl7SRug.png
2023FJ - juresti@tec.mx
Pedro Armijo 16
Levenshtein Distance
• Operations to count:
• Inserting character
• Deleting character
• Substituting character
https://miro.medium.com/v2/resize:fit:1170/format:webp/1*Et8w4ttqh37UqaARPu6fqA.png
2023FJ - juresti@tec.mx
Pedro Armijo 17
Cosine Distance
https://miro.medium.com/v2/resize:fit:1400/format:webp/1*RTaoSa-_04l5HpQFxP116Q.png
2023FJ - juresti@tec.mx
Pedro Armijo 18
Cosine Distance…
Applied to text
• Three steps • Check notebook “Cosine Similarity Textos mios.ipynb”
1. Obtain all unique words in original sentences
2. Create a vector for each sentence
3. Calculate Cosine Distance between all sentences
2023FJ - juresti@tec.mx
Pedro Armijo 19
K-means
https://upload.wikimedia.org/wikipedia/commons/thumb/e/e5/KMeans-Gaussian-data.svg/434px-KMeans-Gaussian-data.svg.png
2023FJ - juresti@tec.mx
Pedro Armijo 21
K-means
Definition
• Clustering method that tries to minimize the distance
between points in a cluster (inertia)
• Centroid-based algorithm
• Distance-based algorithm
2023FJ - juresti@tec.mx
Pedro Armijo 22
K-means
P1 = x1 , x2 , xn P2 = y1 , y2 , yn
2. Select K random points from the dataset
1. These will be used as centroids of our clusters i =n
D( P1 , P2 ) = (
å i i
x - y )2
i =1
3. Assign the rest of the points to the closest cluster
2023FJ - juresti@tec.mx
Pedro Armijo 23
K-means
Termination Criteria
1. Centroids do not change
1. After some iterations, centroids remain the same
2. Algorithm is not learning new patterns
https://www.programmersought.com/images/437/0676d01a03e72b72253d6e62590a88c5.JPEG
2023FJ - juresti@tec.mx
Pedro Armijo 24
K-means: disadvantages
Clusters may have different sizes Clusters may have different densities
2023FJ - juresti@tec.mx
Pedro Armijo 25
Choosing K
Elbow method
• Empirical method
• Perhaps the most famous method for choosing K
• Method
• Calculate the Within-Cluster-Sum of Squared Errors (WCSS) for
different values of K (intertia) https://editor.analyticsvidhya.com/uploads/12158wcss.png
• WCSS = sum(square(point distance to cluster centroid))
https://miro.medium.com/max/407/1*O_JmBi6rM6PlrPFx_uNrVQ.png
https://cdn.analyticsvidhya.com/wp-content/uploads/2019/08/Screenshot-from-2019-08-09-16-01-34.png
2023FJ - juresti@tec.mx
Pedro Armijo 26
Choosing K…
Silhouette Method
• Measures how similar a point is to its own cluster (cohesion) • d(i,j) = distance from point i to point j
compared to other clusters (separation)
• s(i) varies from 1 to -1
• A high value indicates a good number of clusters
• A negative value indicates too many or too few clusters
https://miro.medium.com/max/389/1*Otg9XRLPqOjb80Png6B9Og.png
2023FJ - juresti@tec.mx
https://medium.com/@cmukesh8688/silhouette-analysis-in-k-means-clustering-cefa9a7ad111 Pedro Armijo 27
K-means
Advantages Disadvantages
• Very simple to implement • Very sensitive to outliers
• Can generalize clusters of different sizes and shapes • Not so efficient when the number of dimensions increases
2023FJ - juresti@tec.mx
Pedro Armijo 28
Class Exercise
• Tutorial
• https://realpython.com/k-means-clustering-python/
• Work up to "Evaluating Clustering Performance Using
Advance Techniques"
2023FJ - juresti@tec.mx
Pedro Armijo 29
References
2023FJ - juresti@tec.mx
Pedro Armijo 30