You are on page 1of 15

 

# Assignment 4&5 (Combined)


(Clustering & Dimension Reduction)
Start Date:

Due Date: Eastern Time (US and Canada)

## Submission
You must submit your assignment on-line with Bright Space. This is the only method by which
we accept assignment submissions. We do not accept assignments sent via email, and we are
not able to enter a mark if the assignment is not submitted on Bright Space! The deadline
date is firm since you cannot submit a assignment passed the deadline. It is a student's
responsibility to ensure that the assignment has been submitted properly. A mark of 0 will
be assigned to any missing assignment.

Assignment must be done individually. Any team work, and any work copied from a source
external to the student will be considered as an academic fraud and will be reported to the
Faculty of Engineering as a breach of integrity. The consequence of academic fraud is, at
the very least, to obtain an F for this course. Note that we use sophisticated software to
compare assignments (with other student's and with other sources...). Therefore, you must
take all the appropriate measures to make sure that others cannot copy your assignment
(hence, do not leave your workstation unattended).

## Goal
During this assignment, we will implement three unsupervised models,named KMeans, GMM
(Gaussian Mixture Model) and PCA (Principal Component Analysis) respectively, and apply them
on two synthetic dataset.

## Dataset
This time, scikit-learn's make_blobs is leveraged to generate two synthetic dataset. By
specifying the number of samples, features, and centers, make_blobs can help us generate
isotropic Gaussian blobs for clustering. Please access this link to get more detail about
it. As the following figures show, dataset 1 contains 1000 2-D instances sampled from 5
Mixed Gaussian models, while dataset 2 consists of 3-D samples. They are used to test
clustering algorithm and dimension reduction algorithm respectively.
## KMeans

### Vanilla KMeans

In the following discussion, we will not mention too much details about the validation or
proof of algorithms which have been discussed in the lecture slides and video. We prefer
focusing on the implementation of these algorithms.

Clustering algorithms aims to group the given dataset into a few


clusters. Before we talk about algorithm itself, let's clarify some notations.

, is the l2-norm of given x, is the sample of the dataset, and


is the mean/center/centroid of cluster

Algorithm:

1. Initialize cluster centers randomly

2. Loop until convergence:

For every sample:

For every center:

Its convergence condition is the change of cluster centers are less than some threshold,
named tolerance.

If we strictly follow procedure and formulas to implement KMeans, we will find that it is
not only extremely slow but unstable with low accuracy. Therefore, some methods should be
leveraged to improve its speed and performance. The two majority factors are the
initialization/seeding of centers and the calculation of distances. In the following sub-
sectors we will optimize KMeans from these two perspectives.

### KMeans++

If we initialize cluster centers randomly, as shown in the following figure, it is possible


to have 2 centers being very close to each other, resulting in split one cluster into two.
To mitigate such problem, KMeans++, a better seeding algorithm, is developed. Intuitively,
KMeans++ aims to initialize cluster centers which are as far as possible from each other.
Theoretically, it is competitive with optimal clustering comparing with randomized
seeding. For details and proof of KMeans++, please access its original paper.
Algorithm:

1. Pick the first center, chosen uniformly at random from

2. Loop until we find all centers:

1. Determine D(x), which denotes the shortest distance from an instance to the
closest, previously selected center
2. Take a new center, choosing with probability

By following this procedure, any new centers will try to avoid being too close to the
selected center. Besides, the result is better and stable than randomized seeding. However,
neither KMeans++ nor vanilla KMeans can guarantee their results, since their objective
function is non-convex. Therefore, we need to introduce another method to make it better.

### Multiple Seeding

Let's define the distortion function , where is the cluster center


which corresponding to. The iteration process of KMeans is performing coordinate descent
on . Due to the non-convex function , the result of KMeans can always be local
minima instead of the global one. Therefore, we need to run KMeans several times and score
their results, and, finally, pick the best one with minimum score.

For now, the improvements have been done from the algorithm perspective. The ultimate KMeans
can be:

``` pseudocode
1 loop nInit times
2 InitCenters
3 candidateCenters[k] = KMeansSingle(X)
4 scores[k] = score(X, candidateCenters[k])
5 centers = best candidate centers
```

### Distance Calculation

As we know, one of the reason why python codes run slowly is its loop (no matter while loop
or for loop). Therefore, in this assignment, to guarantee the quality and the speed of our
models. Try your best to avoid using loop or list/dict comprehension in your code, instead
use matrix operation as much as possible. During this assignment, we strictly restrict the
number of loop can be used in each function, your marks will be reduced according to the
extra loops and list/dict comprehensions in each function.

According to the requirements of KMeans (assigning X to their clusters) and KMeans++/scoring


(determine the distances among X and their nearest cluster centers), our distance
calculation function (pairwiseL2) takes two matrices ( ) as input and output
their distance matrix ( ).

These matrices can be shown like:

The first two items of D's formula can be simply calculated by the broadcast mechanism of
numpy. Read broadcast mechanism by accessing their official documents. Therefore, in our
code, D should be calculated like:

By doing so, no loop is needed to complete the calculation.

## GMM
GMM is a special case of generalized Expectation-Maximization algorithms. Again, for the
validity and proof of EM, please refer lecture slides and video. Likewise, GMM also contains
E step (expectation) guessing samples' clusters and M step (maximization) estimating the
model (Gaussian) parameters. Here are some notations which would be used later.
, where is an array of means, and
is an array of covariances.

GMM Algorithm:

1. Initialize parameters randomly

2. loop until convergence:

1. E step: update Posteriors.

2. M step: update parameters.

1.
2.

3.

Convergence condition: the change of the log likelihood of parameters is less than a
threshold, named tolerance. . However, if we go
with this scheme, the value of will be influenced by the number of training samples.
Therefore, it is better to normalize it as .

There are not too many things to say about the M step which is pretty similar to the GDA
model as we have practiced in the last assignment. Whereas, the E step should be mentioned
to speed up the calculation.

To speed up the calculation and avoid loop, multiplication and division, we will take the
log operation first, and, finally, we would change it back.

The can be determined by invoking scipy.stats.multivariate_normal which is


faster than calculating according to original Gaussian pdf.

We have noticed occur many times in formula (4) and score's formula. To reuse it we
can apply log_sum_exp trick to speed up the calculation. Likewise, the score formula can
also be written in the same way:

We can apply scipy.special.logsumexp to perform log_sum_exp trick.

We did not show the matrix calculation, try to derive it on your own (It will not be too
hard).

Like KMeans, we can seeding GMM several times as well, but consider its speed, we normally
run it once.

### KMeans Initialization

This trick is used to reduce the iteration of EM procedure. Instead of initialize all
parameters randomly, we can invoke KMeans to initialize parameters. Another reason that we
can do this is that KMeans is much faster than GMM.

This method can be described as:

1. = KMeans
2.
3. Estimate Gaussian Parameters

## PCA
Review lecture slides and video for mathematical derivation to be familiar with the theory.
Intuitively, PCA is to find some directions with large "variance" which can be represented
by eigen value of the covariance matrix of given X. Therefore, the algorithm of PCA is:

1. cenX = X - mean(X)
2. Calculate the covariance of cenX
3. Determine the eigen values and eigen vectors
4. Select eigen vectors corresponding to top nComponents eigen values, and combine them
to a eigen vector matrix
5. X dot multiples the eigen vector matrix
## Instructions

### File Structure

1. Figures

This folder contains the result figures. It is empty initially. After you implement
all function properly, two figures should occure, named "Dataset1.png" and
"Dataset2.png".

2. Main.py

It is where you should implement our assignment. Please note that any function marked
as !!! Must Not Change the Content !!! must not be changed, otherwise, it will
influence your final grade. All functions or methods marked by ToDo: Implement This
Function should be well implemented.

3. Readme

Assignment Instructions.

4. requirements.txt

Required Python packets to finish the assignment. Please run the following code to
install these packets. Please note that numpy v1.19.4 has bugs on Windows system
currently; therefore numpy v1.19.3 will be used in this assignment.

``` bash
1 pip install -r requirements.txt
```

5. TicToc.py

Used for timing functions. No need to change or check.

6. Utils.py

It contains functions like plotting and saving results.

### Main Function (main() in Main.py)

This function will perform:

1. Generate Dataset 1
2. Run KMeans and GMM on Dataset 1
3. Plot and save results
4. Generate Dataset 2
5. Apply PCA, KMeans, and GMM on Dataset 2
6. Plot and save results

### What We Need to Do and Implement

Must not change the shape (format of parameters and return values) of the functions that you
need to implement.

Must not change the imports. No additional python packets should be used.

You will see estimated lines, e.g., ≈1~2 lines. It is only an estimation and not to say you
have to finish coding within these lines
We strictly restrict the number of loop can be used in each function; hence the extra loop
will result in losing marks.

Also, we require the running time of KMeans and GMM should be limited in 0.5s in such
settings (CPU: Intel i7-9750 2.60GHz; RAM: 24GB; DISK: 1.5TB). Points will be deducted for
any extra time

1. Install Required Packets

``` bash
1 pip install -r requirements.txt
```

2. Open Main.py

3. Implement pairwiseL2 (5 marks)

``` python
1 def pairwiseL2(X, y=None):
2    """
3   ToDo: 5 marks; ≈2 lines
4
5   Calculate l2 norm matrix (D) among X and y;
6
7   X's shape is (n*m), y's shape is (k*m),
8   and the D's shape is (n*k);
9   Dij denotes the distance between Xi and yj;
10   If y is None, then y will be X;
11
12   No loop or list/dict comprehension is permitted in this function
13
14   :param X: ndarray (n*m)
15   :param y: ndarray (k*m)
16   :return: l2 norm matrix (n*k)
17   """
```

4. Implement pairwiseL2Min (5 marks)

``` python
1 def pairwiseL2Min(X, y):
2    """
3   ToDo: 5 marks; ≈1 lines
4
5   Minimum values among X and y
6
7   No loop or list/dict comprehension is permitted in this function
8
9   :param X: ndarray (n*m)
10   :param y: ndarray (k*m)
11   :return: minimum l2 norm matrix (n)
12   """
```

5. Implement pairwiseL2ArgMin (5 marks)


``` python
1 def pairwiseL2ArgMin(X, y):
2    """
3   ToDo: 5 marks; ≈1 lines
4
5   Index of minimum values among X and y
6
7   No loop or list/dict comprehension is permitted in this function
8
9   :param X: ndarray (n*m)
10   :param y: ndarray (k*m)
11   :return: index of minimum l2 norm matrix (n)
12   """
```

6. Implement KMeans.initCenters (5 marks)

``` python
1 def initCenters(self, X):
2    """
3   ToDo: 5 marks; ≈9 lines
4
5   Apply KMeans++ to initialize cluster centers.
6   (http://ilpubs.stanford.edu:8090/778/1/2006-13.pdf)
7
8   Init the first center randomly and
9   iterate the following step until find all centers.
10
11   Assign possibilities to all samples,
12   which can be determined by the square of the distance
13   from the nearest and selected centers.
14
15   Weighted randomly select the next center
16   according to possibilities assigned in the prev step.
17   (Hint: numpy.random.choice
18   https://docs.scipy.org/doc//numpy-
1.10.4/reference/generated/numpy.random.choice.html)
19
20   Only one loop or list/dict comprehension is permitted in this
function.
21
22   :param X: X
23   :return: initialized centers
24   """
```

7. Implement KMeans.kMeansSingleEM (5 marks)

``` python
1 def kMeansSingleEM(self, X):
2    """
3   ToDo: 5 marks; ≈9 lines
4
5   Run KMeans one time.
6
7   Convergence condition: the square of the change of
8   centers are less then tol
9
10   Please note that a full KMeans algorithm
11   should be implemented in this function.
12
13   :param X: X
14   :return: (cluster centers, iteration count)
15   """
```

8. Implement KMeans.score (5 marks)

``` python
1 def score(self, X, centers=None):
2    """
3   ToDo: 5 marks; ≈2 lines
4
5   Calculate distortion score of X and given centers.
6   Distortion score = sum(square(||xi - nearest center||))
7
8   If centers is None, centers will be self.centers
9
10   No loop or list/dict comprehension is permitted in this function
11
12   :param X: X
13   :param centers: centers
14   :return: distortion score
15   """
```

9. Implement KMeans.fit (5 marks)

``` python
1 def fit(self, X, y=None):
2    """
3   ToDo: 5 marks; ≈11 lines
4
5   Run KMeans nInit times,
6   and pick the best one according to distortion score.
7
8   Record the best cluster centers in self.centers,
9   and save the corresponding iteration times in self.iterCount
10
11   Only one loop or list/dict comprehension is permitted in this
function.
12
13   :param X: X
14   :param y: None
15   :return: self
16   """
```

10. Implement KMeans.predict (5 marks)


``` python
1 def predict(self, X):
2    """
3   ToDo: 5 marks; ≈1 lines
4
5   Predict clusters of given X
6
7   No loop or list/dict comprehension is permitted in this function
8
9   :param X: X
10   :return: clusters (nX)
11   """
```

11. Implement GMM.initPosteriors (5 marks)

``` python
1 def initPosteriors(self, X):
2    """
3   ToDo: 5 marks; ≈5 lines
4
5   Use KMeans to initialize posteriors.
6
7   E.g., say the clusters of given X predicted by KMeans are [0, 1, 3],
8   then the posteriors should be:
9         c0 c1 c2 c3
10   x0 [[1, 0, 0, 0],
11   x1 [0, 1, 0, 0],
12   x2 [0, 0, 0, 1]]
13
14   No loop or list/dict comprehension is permitted in this function
15
16   :param X: X
17   :return: posteriors (nX, nClusters)
18   """
```

12. Implement GMM.initialization (5 marks)

``` python
1 def initialization(self, X):
2    """
3   ToDo: 5 marks; ≈2 lines
4
5   Initialize posteriors and prior, mean and covariances.
6   It's better to implement estimateGaussianParams first.
7
8   No loop or list/dict comprehension is permitted in this function
9
10   :param X: X
11   """
```

13. Implement GMM.estimateGaussianParams (5 marks)

``` python
1 def estimateGaussianParams(self, X, posteriors):
2    """
3   ToDo: 5 marks; ≈8 lines
4
5   M step
6
7   Estimate priors, means and covariance
8   according to given X and posteriors.
9
10   To speed up our code, try to avoid using loop,
11   instead, use matrix operation as much as possible.
12
13   Only one loop or list/dict comprehension is permitted in this
function.
14
15   i means i_th sample, j means j_th cluster.
16
17   sum_i means sum over i (samples);
18   sum_j means sum over j (clusters);
19
20   priors (nClusters)
21   priors_j = sum_i(posterior_ij) / nX
22
23   means (nClusters, dim), where dim is the dimension of X
24   means_j = sum_i(posterior_ij * xi) / sum_i(posterior_ij)
25
26   covariances (nCluster, dim, dim)
27   covariances_j = sum_i(posteriors_ij * (xi - means_j)T * (xi -
means_j)) / sum_i(posterior_ij)
28
29   :param X: X
30   :param posteriors: (nX, nClusters)
31   :return: (priors, means, covariance)
32   """
```

14. Implement GMM.estimateScorePosteriors (5 marks)

``` python
1 def estimateScorePosteriors(self, X):
2    """
3   ToDo: 5 marks; ≈8 lines
4
5   E step
6
7   Estimate score and posteriors according to X, priors, means, and
covariances.
8
9   To speed up the code, we will change the formula as log format,
10   and, finally, we will change it back.
11
12   To speed up our code, try to avoid using loop,
13   instead, use matrix operation as much as possible.
14
15   Only one loop or list/dict comprehension is permitted in this
function.
16
17   posteriors_ij = P(xi|j)P(j) / sum_l(P(xi|l)P(l))
18
19   logPosteriors_ij = log(P(xi|j)P(j)) - log_sum_exp(log(P(xi|j)P(j)))
20   By doing so, we can reuse logJointProb (log(P(X)))
21
22   N is the gaussian pdf
23   log(P(xi|j)P(j)) = log N(xi, means_j, covariances_j) + log priors_j
24
25   As for the score, the log likelihood of parameters is:
26
27   l = sum_i(log(sum_j(P(xi|j)P(j)))) =
sum_i(log_sum_exp(log(P(xi|j)P(j))))
28
29   To avoid number of samples influencing the convergent speed, we
prefer avg.
30   Therefore, the score is:
31
32   score = mean(log_sum_exp(log(P(xi|j)P(j))))
33
34   Please do not forget change logPosteriors to posteriors at the end
of this function.
35
36   Hints:
37   To calculate Gaussian pdf, use scipy.stats.multivariate_normal which
38   is way more faster than calculate pdf by using the original formula.
39  
(https://docs.scipy.org/doc/scipy/reference/generated/scipy.stats.multiv
ariate_normal.html)
40   To apply log_sum_exp trick, use scipy.special.logsumexp.
41  
(https://docs.scipy.org/doc/scipy/reference/generated/scipy.special.logs
umexp.html)
42
43   :param X: X
44   :return: (score, posteriors)
45   """
```

15. Implement GMM.fit (5 marks)

``` python
1 def fit(self, X, y=None):
2    """
3   ToDo: 5 marks; ≈10 lines
4
5   Train GMM using EM algorithm.
6   If maxIter is None, it should iterate forever until it converges
7
8   Only one loop or list/dict comprehension is permitted in this
function.
9
10   :param X: X
11   :param y: None
12   :return: self
13   """
```

16. Implement GMM.predict (5 marks)


``` python
1 def predict(self, X):
2    """
3   ToDo: 5 marks; ≈2 lines
4
5   Predict X's clusters
6
7   No loop or list/dict comprehension is permitted in this function
8
9   :param X: X
10   :return: clusters (nX)
11   """
```

17. Implement GMM.score (5 marks)

``` python
1 def score(self, X):
2    """
3   ToDo: 5 marks; ≈2 lines
4
5   Predict the parameter's score by given X
6
7   No loop or list/dict comprehension is permitted in this function
8
9   :param X: X
10   :return: score
11   """
```

18. Implement PCA.fit_transform (5 marks)

``` python
1 def fit_transform(self, X):
2    """
3   ToDo: 5 marks; ≈7 lines
4
5   Reduce the dimension of given X and nComponents
6
7   No loop or list/dict comprehension is permitted in this function
8
9   :param X: X
10   :return: transformed X (nX, nComponents)
11   """
```

19. Implement invoking KMeans in clusteringResults (5 marks)

``` python
1 # ToDo: 5 marks; ≈3 lines
2 # Invoke KMeans to get clusters and scores
```

20. Implement invoking GMM in clusteringResults (5 marks)


``` python
1 # ToDo: 5 marks; ≈3 lines
2 # Invoke GMM to get clusters and scores
```

21. Make sure KMeans finishes within 0.5s (5 marks)

22. Make sure GMM finishes within 0.5s (5 marks)

### Expected Results

After you implement and run the code, you should have the following results in Figures. This
time, we did not fix the random state, since we can run it several times to obtain the best
result, so that, your results can be different from the expected results.

1. Dataset1.png

2. Dataset2.png

Also, the running time of KMeans and GMM should less than 0.5 s. The following results are
generated in TA's computer (CPU: Intel i7-9750 2.60GHz; RAM: 24GB; DISK: 1.5TB).

```
1 ( 0.002000 s) -> Dataset 1
2 KMeans Convergent at Iter: 5.0
3 KMeans Distortion Score: 7.436183296151369
4 ( 0.041037 s) -> KMeans
5 GMM Convergent at Iter: 7
6 GMM Avg Parameter Log Likelihood: 1.2270739225566258
7 ( 0.066007 s) -> GMM
8 ( 2.220996 s) -> Figures for Dataset 1
9 ( 0.002002 s) -> Dataset 2
10 ( 0.000000 s) -> PCA
11 KMeans Convergent at Iter: 4.0
12 KMeans Distortion Score: 17.381677793624668
13 ( 0.033962 s) -> KMeans
14 GMM Convergent at Iter: 5
15 GMM Avg Parameter Log Likelihood: 2.29074557003864
16 ( 0.058004 s) -> GMM
17 ( 3.236029 s) -> Figures for Dataset 2
18 Total Time Costs: ( 5.660038 s) -> [Main]
```

### Marking Criterion

This time, you may have different results from the expected ones.

Only if you implement the whole function or method, you can get marks; otherwise, you will
get 0 on the method.

Implementing methods can receive marks, while only submitting results files will not receive
any marks.

Any extra loop or list/dict comprehension will lead to reducing 1 mark.

For Quiz 4:

1. Implement pairwiseL2 (5 marks)


2. Implement pairwiseL2Min (5 marks)
3. Implement pairwiseL2ArgMin (5 marks)
4. Implement KMeans.initCenters (5 marks)
5. Implement KMeans.kMeansSingleEM (5 marks)
6. Implement KMeans.score (5 marks)
7. Implement KMeans.fit (5 marks)
8. Implement KMeans.predict (5 marks)
9. Implement invoking KMeans in clusteringResults (5 marks)
10. Make sure KMeans finishes within 0.5s (5 marks)

For Quiz 5:

1. Implement GMM.initPosteriors (5 marks)


2. Implement GMM.initialization (5 marks)
3. Implement GMM.estimateGaussianParams (5 marks)
4. Implement GMM.estimateScorePosteriors (5 marks)
5. Implement GMM.fit (5 marks)
6. Implement GMM.predict (5 marks)
7. Implement GMM.score (5 marks)
8. Implement PCA.fit_transform (5 marks)
9. Implement invoking GMM in clusteringResults (5 marks)
10. Make sure GMM finishes within 0.5s (5 marks)

You might also like