Professional Documents
Culture Documents
Chapter 9
Unsupervised Learning
No target data are available in training examples We want to explore the data to find some intrinsic structure in them
Clustering
Clustering (contd..)
Clustering is often called an unsupervised learning task as no class values denoting an a priori grouping of the data instances are given, which is the case in supervised learning. Due to historical reasons, clustering is often considered synonymous with unsupervised learning.
K-Means Clustering
Suppose that you know the number of clusters, but not what the clusters look like
How do you assign each data point to a cluster? Position k centers at random in the space Assign each point to its nearest center according to some chosen distance measure Move the center to the mean of the points that it represents Iterate
K-Means Algorithm
Initialization
- choose a value for k - choose k random positions in the input space - assign the cluster j centers to those positions
Learning
- Repeat
* for each datapoint xi :
compute the distance to each cluster center assign the datapoint to the nearest cluster center with distance
Usage
- for each point:
* compute the distance to each cluster center * assign the datapoint to the nearest cluster center with distance
.. . . . . -
++ -
++ -
--
^^
^^
--
-*
--
w2
w1 w2 w3
w1
w3
How do we do training?
Update weights - move neuron positions Move winning neuron towards current input Ignore the rest
Normalization
Suppose the weights are: (0.2, 0.2, -0.1) w2 (0.15, -0.15, 0.1) (10, 10, 10) The input is (0.2, 0.2, -0.1)
w1
w3
Normalization (contd..)
For a perfect match with first neuron: 0.2*0.2 + 0.2*0.2 + -0.1*-0.1 = 0.09 0.15*0.2 + -0.15*0.2 + 0.1*-0.1 = -0.01 10*0.2 + 10*0.2 + 10*-0.1 = 3 Can only compare activations if the weights are about the same size
Normalization (contd..)
Make the distance between each neuron and the origin be 1 All neurons lie on the unit hypersphere Need to stop the weights growing unboundedly
Learning
- normalize the data so that all the points lie on the unit sphere - repeat:
* For each datapoint:
Compute the activations of all the nodes Pick the winner as the node with the highest activations Update the weights using
Could make a codebook of typical data and index each data point by reference to a codebook entry Thus, VQ is a coding method by mapping each data point x to the closest codeword, i.e., we encode x by replacing it with the closest codeword.
K-Means algorithm can be used to solve the problem of learning vector quantization.
Self-Organizing map
Self-organizing maps (SOMs) are a data visualization technique invented by Professor Teuvo Kohonen Also called Kohonen Networks, Competitive Learning, Winner-Take-All Learning Generally reduces the dimensions of data through the use of self-organizing neural networks Useful for data visualization; humans cannot visualize high dimensional data so this is often a useful technique to make sense of large data sets
Feature Map
Sounds that are similar (close together) excite neurons that are near to each other Sounds that are very different excite neurons that are a long way off This is known as topology preservation The ordering of the inputs is preserved
If possible (perfectly topology-preserving)
Topology Preservation
Inputs Outputs
i
Neighborhood of neuron i
29
Self Organization
Global ordering from local interactions
Each neurons sees its neighbors The whole network becomes ordered
* Reduce the learning rates and adjust the neighborhood function, typically by (t+1) = (t)k/kmax where 01 decides how fast the size decreases, k is the number of iterations the algorithm has been running for, and kmax is when you want the learning to stop. The same equation is used for both learning rates (, n) and the neighborhood function h(nb,t).
Until the map stops changing or some maximum number of iterations is exceeded
Result of Algorithm
Initially, some output nodes will randomly be a little closer to some particular type of input These nodes become winners and the weights move them even closer to the inputs
Over time nodes in the output become representative prototypes for examples in the input
Note there is no supervised training here
Classification:
Given new input, the class is the output node that is the winner
Weight Initialization
Generally, Weights are randomly selected as in MLP. Principle Components Analysis is also used to initialize the weights in the network. It finds two largest directions of variation in the data and initialize the weights so that they increase along these two directions. (batch mode)
Neighborhood
Neighborhood is another parameter that we need to control How large should the neighborhood of a neuron be?
Initially it is large because network is unordered (two nodes that are very close in weight space could be on opposite sides of map and vice versa) However, once the network has been learning for a while, the rough ordering has already been created, and the algorithm starts to fine-tune the individual local regions of the network the neighborhood size reduces.
Therefore the networks neighbor size reduces when it adapts These two phases of learning are known as ordering and convergence.
Distance
Neighborhood size usually decreases over time to allow initial jockeying for position and then fine-tuning as algorithm proceeds
Neighborhood
Before training (large neighborhood)
Neighborhood (contd..)
After training (small neighborhood)
Network Dimensionality
It depends on the intrinsic dimensionality (the number of the dimension that you actually need to represent the data) Noise and other inaccuracies in data often leads to being represented in more dimensions than are actually required, and so finding the intrinsic dimensionality can help reduce the noise.
Network Boundary
In some cases we can strictly define the boundary (e.g. if we are arranging the sounds from low pitch to high pitch, then we the lowest and highest pitches we cab hear are obvious end points ) However, there are some cases where we cant exactly define the boundary. In this case we might want to remove the boundary conditions. We do it by removing the boundary by tying the ends together.
In 1D, we turn line into a circle, while in 2D we turn a rectangle into a torus. Generally it means there are no neurons on the edge of feature maps.
Network Size
We have to predetermine the network size Big network
Each neuron represents exact feature Not much generalization
Small network
Too much generalization No differentiation