Aprendizaje No Supervisado

Aprendizaje No
Supervisado
TC2034 – Modelación del Aprendizaje
con Inteligencia Artificial
https://miro.medium.com/max/1920/1*VACikYaZIHb2OctTRohn8A.jpeg
Unsupervised Learning
Definition
• Uses information that is neither classified nor labelled • There are two categories of algorithms:
• Clustering
• Discover inherent grouping in data
• The algorithm (learner) has to discover patterns without
guidance • Dimension Reduction (Association)
• Discover rules that describe large portions of data
• Usually groups information by:

• Similarities
• Patterns
• Differences
2023FJ - juresti@tec.mx
Pedro Armijo 2
Unsupervised Learning…
Applications Challenges
• News sections: categorize articles • Computational complexity – high volume of data
• Computer vision: object recognition • Longer training times
• Medical imaging: image detection, classification, • High risk of inaccurate results

segmentation
• Need of human intervention for validation

• Anomaly detection: discover atypical points
• Lack of transparency to know exactly how data was

• Customer persona profiles organized
• Recommendation engines
Pedro Armijo 3
Clustering
2023FJ - juresti@tec.mx https://www.researchgate.net/profile/Adam-Warner-3/publication/333333035/figure/fig2/AS:761900535119875@1558662651725/Clustered-gene-expression-visualized-with-t-SNE-Gene-expression-values-

TPM-were.png
Pedro Armijo 4
Clustering
Definition Applications
• Grouping of unlabeled instances • Market segmentation
• Useful to understand a dataset • Social network analysis
• Need for a similarity measure to group similar instances • Medical imaging
• Groups are classified by features • Image segmentation

• The more features, the more difficult grouping is
• Anomaly detection
Pedro Armijo 5
Types of clustering
Centroid-based Density-based
https://developers.google.com/machine-learning/clustering/
Pedro Armijo 6
Types of clustering…
Distribution-based Hierarchical
https://developers.google.com/machine-learning/clustering/
Pedro Armijo 7
Properties of clusters
All data points in a cluster must be similar to each other Data points from different clusters should be as different
as possible
https://www.analyticsvidhya.com/blog/2019/08/comprehensive-guide-k-means-clustering/
Pedro Armijo 8
Evaluation metrics for clustering
Intracluster distance (inertia) Intercluster distance (Dunn index)

• Calculate distance from all points to the center of the • Measures distance between 2 clusters
cluster
• Tries to create very compact clusters min(𝑖𝑛𝑡𝑒𝑟𝑐𝑙𝑢𝑠𝑡𝑒𝑟 𝑑𝑖𝑠𝑡𝑎𝑛𝑐𝑒)
𝐷𝑢𝑛𝑛 𝐼𝑛𝑑𝑒𝑥 =
• The smaller the inertia, the better the cluster max (𝐼𝑛𝑒𝑟𝑡𝑖𝑎)
• The bigger the Dunn Index, the more separated clusters

are.
Pedro Armijo 9
Similarity
Measures
2023FJ - juresti@tec.mx https://d1e4pidl3fu268.cloudfront.net/968ffd16-ad6b-47c7-a738-707352d3db33/Similarity.crop_709x532_0,93.preview.jpg
Pedro Armijo 10
Similarity
• Measure of how much two objects are alike
• Distance with dimensions representing features of objects
• Highly dependent on domain and application

• For instance, use of shape vs. use of size
• Large distance means low similarity
• Usually, a number between 0 and 1
Pedro Armijo 11
Euclidian Distance
• Shortest distance between two points
• Usually a 2D space, but can be an N-dimensional space
• Other names:
• Euclidian norm
• Euclidian metric
• L2 norm
• L2 metric
• Pythagorean metric
https://miro.medium.com/v2/resize:fit:1400/format:webp/1*j_h0BwDKraVoD6XbUUyn3g.png
Pedro Armijo 12
Manhattan Distance
• Calculate distance between two points using a grid-like

path
• Also known as:

• Taxicab geometry
• City block distance
• L1 norm
https://miro.medium.com/v2/resize:fit:1400/format:webp/1*XsJbklzmGXbx6tu7aX2Vug.png
Pedro Armijo 13
Minkowski Distance
• Generalization of the Manhattan and Euclidian distance
• p is the order of the norm
• When p = 1, calculates Manhattan distance
• When p = 2, calculates Euclidian distance
https://cdn-images-1.medium.com/max/1067/1*Fb22fnJRABGUANpjcjwEow.png
Pedro Armijo 14
Chebyshev Distance
• Maximum absolute distance in one dimension of two N-

dimensional points
https://miro.medium.com/v2/resize:fit:1400/format:webp/1*Ys2Wga9fvZAnlPUir2HRIA.png
Pedro Armijo 15
Hamming Distance
• Similarity between two strings of the same length
• Number of positions at which the corresponding

characters are different in both strings
https://miro.medium.com/v2/resize:fit:1286/format:webp/1*Q4rdY5KN91aZjvGXl7SRug.png
Pedro Armijo 16
Levenshtein Distance
• Number of operations needed to convert one string into

another
• Operations to count:
• Inserting character
• Deleting character
• Substituting character
https://miro.medium.com/v2/resize:fit:1170/format:webp/1*Et8w4ttqh37UqaARPu6fqA.png
Pedro Armijo 17
Cosine Distance
• Angle between two objects
• Judgment of orientation rather than magnitude
• Same orientation = cosine similarity of 1
• Perpendicular orientation = cosine similarity of 0
• Opposed orientation = cosine similarity of -1
https://miro.medium.com/v2/resize:fit:1400/format:webp/1*RTaoSa-_04l5HpQFxP116Q.png
Pedro Armijo 18
Cosine Distance…
Applied to text
• Three steps • Check notebook “Cosine Similarity Textos mios.ipynb”
1. Obtain all unique words in original sentences
2. Create a vector for each sentence
3. Calculate Cosine Distance between all sentences
Notice that it can be a sentence or a complete document
Pedro Armijo 19
K-means
https://upload.wikimedia.org/wikipedia/commons/thumb/e/e5/KMeans-Gaussian-data.svg/434px-KMeans-Gaussian-data.svg.png
Pedro Armijo 21
K-means
Definition
• Clustering method that tries to minimize the distance
between points in a cluster (inertia)
• Centroid-based algorithm
• Distance-based algorithm
• Assign each point to a cluster based on its distance to the

cluster’s centroid
• Each cluster is associated with a centroid
Pedro Armijo 22
K-means
Algorithm Distance from a point to another point

1. Select number of clusters (K) • Euclidian distance is used
P1 = x1 , x2 ,  xn P2 = y1 , y2 ,  yn
2. Select K random points from the dataset
1. These will be used as centroids of our clusters i =n
D( P1 , P2 ) = (
å i i
x - y )2
i =1
3. Assign the rest of the points to the closest cluster
• In the case of 2 points (2D) use:

4. Calculate new centroids for the current clusters P1 = x1 , x2 P2 = y1 , y2
5. Repeat steps 3 and 4 until termination criterion is met D( P1 , P2 ) = (x1 - y1 )2 + (x2 - y2 )2
Pedro Armijo 23
K-means
Termination Criteria
1. Centroids do not change
1. After some iterations, centroids remain the same
2. Algorithm is not learning new patterns
2. Points remain in the same cluster

1. Centroids are changing
2. No data points are changing from cluster after several
iterations
3. Maximum number of iterations reached
https://www.programmersought.com/images/437/0676d01a03e72b72253d6e62590a88c5.JPEG
Pedro Armijo 24
K-means: disadvantages
Clusters may have different sizes Clusters may have different densities
Solution is usually to add more clusters
Pedro Armijo 25
Choosing K
Elbow method
• Empirical method
• Perhaps the most famous method for choosing K
• Method
• Calculate the Within-Cluster-Sum of Squared Errors (WCSS) for
different values of K (intertia) https://editor.analyticsvidhya.com/uploads/12158wcss.png
• WCSS = sum(square(point distance to cluster centroid))
• Choose the K where there is a decrease in the value of WCSS
https://miro.medium.com/max/407/1*O_JmBi6rM6PlrPFx_uNrVQ.png
https://cdn.analyticsvidhya.com/wp-content/uploads/2019/08/Screenshot-from-2019-08-09-16-01-34.png
Pedro Armijo 26
Choosing K…
Silhouette Method
• Measures how similar a point is to its own cluster (cohesion) • d(i,j) = distance from point i to point j
compared to other clusters (separation)
• s(i) varies from 1 to -1
• A high value indicates a good number of clusters
• A negative value indicates too many or too few clusters
• a(i) = similarity measure of the point i to its own cluster
https://miro.medium.com/max/389/1*Otg9XRLPqOjb80Png6B9Og.png
• b(i) = dissimilarity measure of i from points in other

clusters
https://medium.com/@cmukesh8688/silhouette-analysis-in-k-means-clustering-cefa9a7ad111 Pedro Armijo 27
K-means
Advantages Disadvantages
• Very simple to implement • Very sensitive to outliers
• Adapts new examples (instances) easily • Choosing K is not an easy task
• Can generalize clusters of different sizes and shapes • Not so efficient when the number of dimensions increases
Pedro Armijo 28
Class Exercise
• Tutorial
• https://realpython.com/k-means-clustering-python/
• Work up to "Evaluating Clustering Performance Using
Advance Techniques"
Pedro Armijo 29
References
• Russell, S. & Norvig, P. (2010). Artificial Intelligence: A Modern Approach.

Third Edition. Prentice Hall.
• Mitchell, T. (1997). Machine Learning. McGraw-Hill.
• Creator of the presentation design: Pedro Armijo. MS Office theme.
Pedro Armijo 30

Aprendizaje No Supervisado

Uploaded by

Document Information

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Aprendizaje No Supervisado

Uploaded by

Copyright:

Available Formats

Aprendizaje No

• Usually groups information by:

• Computer vision: object recognition • Longer training times

• Medical imaging: image detection, classification, • High risk of inaccurate results

• Need of human intervention for validation

• Lack of transparency to know exactly how data was

2023FJ - juresti@tec.mx https://www.researchgate.net/profile/Adam-Warner-3/publication/333333035/figure/fig2/AS:761900535119875@1558662651725/Clustered-gene-expression-visualized-with-t-SNE-Gene-expression-values-

• Useful to understand a dataset • Social network analysis

• Need for a similarity measure to group similar instances • Medical imaging

• Groups are classified by features • Image segmentation

Intracluster distance (inertia) Intercluster distance (Dunn index)

• The bigger the Dunn Index, the more separated clusters

2023FJ - juresti@tec.mx https://d1e4pidl3fu268.cloudfront.net/968ffd16-ad6b-47c7-a738-707352d3db33/Similarity.crop_709x532_0,93.preview.jpg

• Measure of how much two objects are alike

• Distance with dimensions representing features of objects

• Highly dependent on domain and application

• Large distance means low similarity

• Usually, a number between 0 and 1

• Shortest distance between two points

• Usually a 2D space, but can be an N-dimensional space

• Calculate distance between two points using a grid-like

• Also known as:

• Generalization of the Manhattan and Euclidian distance

• p is the order of the norm

• When p = 1, calculates Manhattan distance

• When p = 2, calculates Euclidian distance

• Maximum absolute distance in one dimension of two N-

• Similarity between two strings of the same length

• Number of positions at which the corresponding

• Number of operations needed to convert one string into

• Angle between two objects

• Judgment of orientation rather than magnitude

• Same orientation = cosine similarity of 1

• Perpendicular orientation = cosine similarity of 0

• Opposed orientation = cosine similarity of -1

Notice that it can be a sentence or a complete document

• Assign each point to a cluster based on its distance to the

• Each cluster is associated with a centroid

Algorithm Distance from a point to another point

• In the case of 2 points (2D) use:

5. Repeat steps 3 and 4 until termination criterion is met D( P1 , P2 ) = (x1 - y1 )2 + (x2 - y2 )2

2. Points remain in the same cluster

3. Maximum number of iterations reached

Solution is usually to add more clusters

• Choose the K where there is a decrease in the value of WCSS

• a(i) = similarity measure of the point i to its own cluster

• b(i) = dissimilarity measure of i from points in other

• Adapts new examples (instances) easily • Choosing K is not an easy task

• Russell, S. & Norvig, P. (2010). Artificial Intelligence: A Modern Approach.

• Mitchell, T. (1997). Machine Learning. McGraw-Hill.

• Creator of the presentation design: Pedro Armijo. MS Office theme.

You might also like