You are on page 1of 45

Unsupervised Learning

Chapter 9

Unsupervised Learning
No target data are available in training examples We want to explore the data to find some intrinsic structure in them

Why Supervised Learning ?


if labeling expensive, train with small labeled sample then improve with large unlabeled sample if labeling expensive, train with large unlabeled sample then learn classes with small labeled sample tracking concept drift over time by unsupervised learning learn new features by clustering for later use in classication exploratory data analysis with visualization

Clustering

Clustering is a technique for finding similarity groups in data, called clusters.


it groups data instances that are similar to (near) each other in one cluster and data instances that are very different (far away) from each other into different clusters.

Clustering (contd..)
Clustering is often called an unsupervised learning task as no class values denoting an a priori grouping of the data instances are given, which is the case in supervised learning. Due to historical reasons, clustering is often considered synonymous with unsupervised learning.

K-Means Clustering
Suppose that you know the number of clusters, but not what the clusters look like

How do you assign each data point to a cluster? Position k centers at random in the space Assign each point to its nearest center according to some chosen distance measure Move the center to the mean of the points that it represents Iterate

K-Means Algorithm
Initialization
- choose a value for k - choose k random positions in the input space - assign the cluster j centers to those positions

Learning
- Repeat
* for each datapoint xi :
compute the distance to each cluster center assign the datapoint to the nearest cluster center with distance

K-Means Algorithm (contd..)


* For each cluster center: move the position of the center to the mean of the points in that cluster (Nj is the number of points in cluster j):

- Until the cluster centers stop moving

Usage
- for each point:
* compute the distance to each cluster center * assign the datapoint to the nearest cluster center with distance

K-Means Algorithm (contd..)


4 means
. . .. . ++ ^^ *

.. . . . . -

++ -

K-Means Algorithm (contd..)


These are local minima solutions
++ ^^ ++ ^^

++ -

--

K-Means Algorithm (contd..)


More perfectly valid, wrong solutions
++ ++

^^

^^

--

-*

K-Means Algorithm (contd..)


If you dont know the number of means the problem is worse
++ ^^ * ++ ++ + --

--

Dealing with Noise in kMeans Algorithm


Mean average is susceptible to outliers One way to avoid the problem is to replace the mean average with the median, which is what known as a robust statistic (i.e. not affected by outliers) Only change in the algorithm is to replace the computation of the mean with the computation of the median. Calculation median is computationally expensive

k-Means Neural Network


Neuron activation measures distance between input and neuron position in weight space

k-Means Neural Network (contd..)


Weight Space: Image we plot neuronal positions according to their weights

w2

w1 w2 w3

w1

w3

k-Means Neural Network (contd..)


Use winner-take-all neurons Winning neuron is the one closest to input
Best-matching cluster

How do we do training?
Update weights - move neuron positions Move winning neuron towards current input Ignore the rest

Normalization
Suppose the weights are: (0.2, 0.2, -0.1) w2 (0.15, -0.15, 0.1) (10, 10, 10) The input is (0.2, 0.2, -0.1)

w1

w3

Normalization (contd..)
For a perfect match with first neuron: 0.2*0.2 + 0.2*0.2 + -0.1*-0.1 = 0.09 0.15*0.2 + -0.15*0.2 + 0.1*-0.1 = -0.01 10*0.2 + 10*0.2 + 10*-0.1 = 3 Can only compare activations if the weights are about the same size

Normalization (contd..)
Make the distance between each neuron and the origin be 1 All neurons lie on the unit hypersphere Need to stop the weights growing unboundedly

Better Weight Update Rule for k-Means Neural Network


Normalize inputs too Then use:

The On-Line k-Means Algorithm


Initiation
- choose a value for k, which corresponds to the number of output nodes - Initialize the weights to have small random values

Learning
- normalize the data so that all the points lie on the unit sphere - repeat:
* For each datapoint:
Compute the activations of all the nodes Pick the winner as the node with the highest activations Update the weights using

* Until number of iterations is above a threshold

The On-Line k-Means Algorithm (contd..)


Usage * for each test point: compute the activations of all the nodes pick the winner as the node with the highest activation

Vector Quantization (VQ)


Think about the problem of data compression
Want to store a set of data (say, sensor readings) in as small an amount of memory as possible We dont mind some loss of accuracy

Could make a codebook of typical data and index each data point by reference to a codebook entry Thus, VQ is a coding method by mapping each data point x to the closest codeword, i.e., we encode x by replacing it with the closest codeword.

Vector Quantization (contd..)


Voronoi Tesselation:
Join neighboring points

Draw lines equidistant to each pair of points


These are perpendicular to other lines

Vector Quantization (contd..)


Learning Vector Quantization: How to select the prototype vector?
Prototype vector is chosen in such a way that they are as close as possible to all of the possible inputs that we might encounter.

K-Means algorithm can be used to solve the problem of learning vector quantization.

Self-Organizing map
Self-organizing maps (SOMs) are a data visualization technique invented by Professor Teuvo Kohonen Also called Kohonen Networks, Competitive Learning, Winner-Take-All Learning Generally reduces the dimensions of data through the use of self-organizing neural networks Useful for data visualization; humans cannot visualize high dimensional data so this is often a useful technique to make sense of large data sets

Feature Map
Sounds that are similar (close together) excite neurons that are near to each other Sounds that are very different excite neurons that are a long way off This is known as topology preservation The ordering of the inputs is preserved
If possible (perfectly topology-preserving)

Topology Preservation

Inputs Outputs

Self-Organizing Maps (Kohonen Maps)


Common Output Layout Structure
One-dimensional (completely interconnected for determining winner unit)

Two-dimensional (connections omitted, only neighborhood relations shown)

i
Neighborhood of neuron i
29

The Self-Organizing Map


Inputs

The Self-Organizing Map (contd..)


The weight vectors are randomly initialized Input vectors are presented to the network The neurons are activated proportional to the Euclidean distance between the input and the weight vector The winning node has its weight vector moved closer to the input

So do the neighbors of the winning node


Over time, the network self-organizes so that the input topology is preserved

Self Organization
Global ordering from local interactions
Each neurons sees its neighbors The whole network becomes ordered

Understanding self-organization is part of complexity science


Appears all over the place

Self-Organizing Feature Map Algorithm


Initialization:
- choose a size (number of neurons) and number of dimensions d for the map - Either:
* choose random values for the weight vectors so that they are all different OR * set the weight values to increase in the direction of the first d principal components of the dataset

Self-Organizing Feature Map Algorithm (contd..)


Learning - repeat:
* for each datapoint:
select the best-matching neuron nb using the minimum Euclidean distance between the weights and the input,

* update the weight vector of the best-matching node using:

where (t) is the learning rate.

Self-Organizing Feature Map Algorithm (contd..)


* update the weight vector of all other neurons using:
where n(t) is the learning rate for neighborhood nodes, and h(b,t) is the neighborhood function, which decides whether each neuron should be included in the neighborhood of the winning neuron (so h = 1 for neighbors and h = 0 for non-neighbors)

* Reduce the learning rates and adjust the neighborhood function, typically by (t+1) = (t)k/kmax where 01 decides how fast the size decreases, k is the number of iterations the algorithm has been running for, and kmax is when you want the learning to stop. The same equation is used for both learning rates (, n) and the neighborhood function h(nb,t).

Until the map stops changing or some maximum number of iterations is exceeded

Self-Organizing Feature Map Algorithm (contd..)


Usage:
- for each test point:
* select the best-matching neuron nb using the minimum Euclidean distance between the weights and the input:

Result of Algorithm
Initially, some output nodes will randomly be a little closer to some particular type of input These nodes become winners and the weights move them even closer to the inputs

Over time nodes in the output become representative prototypes for examples in the input
Note there is no supervised training here

Classification:
Given new input, the class is the output node that is the winner

Weight Initialization
Generally, Weights are randomly selected as in MLP. Principle Components Analysis is also used to initialize the weights in the network. It finds two largest directions of variation in the data and initialize the weights so that they increase along these two directions. (batch mode)

Neighborhood
Neighborhood is another parameter that we need to control How large should the neighborhood of a neuron be?
Initially it is large because network is unordered (two nodes that are very close in weight space could be on opposite sides of map and vice versa) However, once the network has been learning for a while, the rough ordering has already been created, and the algorithm starts to fine-tune the individual local regions of the network the neighborhood size reduces.

Therefore the networks neighbor size reduces when it adapts These two phases of learning are known as ordering and convergence.

Selecting the Neighborhood


Typically, a Sombrero Function or Gaussian function is used
Strength

Distance

Neighborhood size usually decreases over time to allow initial jockeying for position and then fine-tuning as algorithm proceeds

Neighborhood
Before training (large neighborhood)

Neighborhood (contd..)
After training (small neighborhood)

Network Dimensionality
It depends on the intrinsic dimensionality (the number of the dimension that you actually need to represent the data) Noise and other inaccuracies in data often leads to being represented in more dimensions than are actually required, and so finding the intrinsic dimensionality can help reduce the noise.

Network Boundary
In some cases we can strictly define the boundary (e.g. if we are arranging the sounds from low pitch to high pitch, then we the lowest and highest pitches we cab hear are obvious end points ) However, there are some cases where we cant exactly define the boundary. In this case we might want to remove the boundary conditions. We do it by removing the boundary by tying the ends together.
In 1D, we turn line into a circle, while in 2D we turn a rectangle into a torus. Generally it means there are no neurons on the edge of feature maps.

Network Size
We have to predetermine the network size Big network
Each neuron represents exact feature Not much generalization

Small network
Too much generalization No differentiation

Try different sizes and pick the best

You might also like