Professional Documents
Culture Documents
## Submission
You must submit your assignment on-line with Bright Space. This is the only method by which
we accept assignment submissions. We do not accept assignments sent via email, and we are
not able to enter a mark if the assignment is not submitted on Bright Space! The deadline
date is firm since you cannot submit a assignment passed the deadline. It is a student's
responsibility to ensure that the assignment has been submitted properly. A mark of 0 will
be assigned to any missing assignment.
Assignment must be done individually. Any team work, and any work copied from a source
external to the student will be considered as an academic fraud and will be reported to the
Faculty of Engineering as a breach of integrity. The consequence of academic fraud is, at
the very least, to obtain an F for this course. Note that we use sophisticated software to
compare assignments (with other student's and with other sources...). Therefore, you must
take all the appropriate measures to make sure that others cannot copy your assignment
(hence, do not leave your workstation unattended).
## Goal
During this assignment, we will implement three unsupervised models,named KMeans, GMM
(Gaussian Mixture Model) and PCA (Principal Component Analysis) respectively, and apply them
on two synthetic dataset.
## Dataset
This time, scikit-learn's make_blobs is leveraged to generate two synthetic dataset. By
specifying the number of samples, features, and centers, make_blobs can help us generate
isotropic Gaussian blobs for clustering. Please access this link to get more detail about
it. As the following figures show, dataset 1 contains 1000 2-D instances sampled from 5
Mixed Gaussian models, while dataset 2 consists of 3-D samples. They are used to test
clustering algorithm and dimension reduction algorithm respectively.
## KMeans
In the following discussion, we will not mention too much details about the validation or
proof of algorithms which have been discussed in the lecture slides and video. We prefer
focusing on the implementation of these algorithms.
Algorithm:
Its convergence condition is the change of cluster centers are less than some threshold,
named tolerance.
If we strictly follow procedure and formulas to implement KMeans, we will find that it is
not only extremely slow but unstable with low accuracy. Therefore, some methods should be
leveraged to improve its speed and performance. The two majority factors are the
initialization/seeding of centers and the calculation of distances. In the following sub-
sectors we will optimize KMeans from these two perspectives.
### KMeans++
1. Determine D(x), which denotes the shortest distance from an instance to the
closest, previously selected center
2. Take a new center, choosing with probability
By following this procedure, any new centers will try to avoid being too close to the
selected center. Besides, the result is better and stable than randomized seeding. However,
neither KMeans++ nor vanilla KMeans can guarantee their results, since their objective
function is non-convex. Therefore, we need to introduce another method to make it better.
For now, the improvements have been done from the algorithm perspective. The ultimate KMeans
can be:
``` pseudocode
1 loop nInit times
2 InitCenters
3 candidateCenters[k] = KMeansSingle(X)
4 scores[k] = score(X, candidateCenters[k])
5 centers = best candidate centers
```
As we know, one of the reason why python codes run slowly is its loop (no matter while loop
or for loop). Therefore, in this assignment, to guarantee the quality and the speed of our
models. Try your best to avoid using loop or list/dict comprehension in your code, instead
use matrix operation as much as possible. During this assignment, we strictly restrict the
number of loop can be used in each function, your marks will be reduced according to the
extra loops and list/dict comprehensions in each function.
The first two items of D's formula can be simply calculated by the broadcast mechanism of
numpy. Read broadcast mechanism by accessing their official documents. Therefore, in our
code, D should be calculated like:
## GMM
GMM is a special case of generalized Expectation-Maximization algorithms. Again, for the
validity and proof of EM, please refer lecture slides and video. Likewise, GMM also contains
E step (expectation) guessing samples' clusters and M step (maximization) estimating the
model (Gaussian) parameters. Here are some notations which would be used later.
, where is an array of means, and
is an array of covariances.
GMM Algorithm:
1.
2.
3.
Convergence condition: the change of the log likelihood of parameters is less than a
threshold, named tolerance. . However, if we go
with this scheme, the value of will be influenced by the number of training samples.
Therefore, it is better to normalize it as .
There are not too many things to say about the M step which is pretty similar to the GDA
model as we have practiced in the last assignment. Whereas, the E step should be mentioned
to speed up the calculation.
To speed up the calculation and avoid loop, multiplication and division, we will take the
log operation first, and, finally, we would change it back.
We have noticed occur many times in formula (4) and score's formula. To reuse it we
can apply log_sum_exp trick to speed up the calculation. Likewise, the score formula can
also be written in the same way:
We did not show the matrix calculation, try to derive it on your own (It will not be too
hard).
Like KMeans, we can seeding GMM several times as well, but consider its speed, we normally
run it once.
This trick is used to reduce the iteration of EM procedure. Instead of initialize all
parameters randomly, we can invoke KMeans to initialize parameters. Another reason that we
can do this is that KMeans is much faster than GMM.
1. = KMeans
2.
3. Estimate Gaussian Parameters
## PCA
Review lecture slides and video for mathematical derivation to be familiar with the theory.
Intuitively, PCA is to find some directions with large "variance" which can be represented
by eigen value of the covariance matrix of given X. Therefore, the algorithm of PCA is:
1. cenX = X - mean(X)
2. Calculate the covariance of cenX
3. Determine the eigen values and eigen vectors
4. Select eigen vectors corresponding to top nComponents eigen values, and combine them
to a eigen vector matrix
5. X dot multiples the eigen vector matrix
## Instructions
1. Figures
This folder contains the result figures. It is empty initially. After you implement
all function properly, two figures should occure, named "Dataset1.png" and
"Dataset2.png".
2. Main.py
It is where you should implement our assignment. Please note that any function marked
as !!! Must Not Change the Content !!! must not be changed, otherwise, it will
influence your final grade. All functions or methods marked by ToDo: Implement This
Function should be well implemented.
3. Readme
Assignment Instructions.
4. requirements.txt
Required Python packets to finish the assignment. Please run the following code to
install these packets. Please note that numpy v1.19.4 has bugs on Windows system
currently; therefore numpy v1.19.3 will be used in this assignment.
``` bash
1 pip install -r requirements.txt
```
5. TicToc.py
6. Utils.py
1. Generate Dataset 1
2. Run KMeans and GMM on Dataset 1
3. Plot and save results
4. Generate Dataset 2
5. Apply PCA, KMeans, and GMM on Dataset 2
6. Plot and save results
Must not change the shape (format of parameters and return values) of the functions that you
need to implement.
Must not change the imports. No additional python packets should be used.
You will see estimated lines, e.g., ≈1~2 lines. It is only an estimation and not to say you
have to finish coding within these lines
We strictly restrict the number of loop can be used in each function; hence the extra loop
will result in losing marks.
Also, we require the running time of KMeans and GMM should be limited in 0.5s in such
settings (CPU: Intel i7-9750 2.60GHz; RAM: 24GB; DISK: 1.5TB). Points will be deducted for
any extra time
``` bash
1 pip install -r requirements.txt
```
2. Open Main.py
``` python
1 def pairwiseL2(X, y=None):
2 """
3 ToDo: 5 marks; ≈2 lines
4
5 Calculate l2 norm matrix (D) among X and y;
6
7 X's shape is (n*m), y's shape is (k*m),
8 and the D's shape is (n*k);
9 Dij denotes the distance between Xi and yj;
10 If y is None, then y will be X;
11
12 No loop or list/dict comprehension is permitted in this function
13
14 :param X: ndarray (n*m)
15 :param y: ndarray (k*m)
16 :return: l2 norm matrix (n*k)
17 """
```
``` python
1 def pairwiseL2Min(X, y):
2 """
3 ToDo: 5 marks; ≈1 lines
4
5 Minimum values among X and y
6
7 No loop or list/dict comprehension is permitted in this function
8
9 :param X: ndarray (n*m)
10 :param y: ndarray (k*m)
11 :return: minimum l2 norm matrix (n)
12 """
```
``` python
1 def initCenters(self, X):
2 """
3 ToDo: 5 marks; ≈9 lines
4
5 Apply KMeans++ to initialize cluster centers.
6 (http://ilpubs.stanford.edu:8090/778/1/2006-13.pdf)
7
8 Init the first center randomly and
9 iterate the following step until find all centers.
10
11 Assign possibilities to all samples,
12 which can be determined by the square of the distance
13 from the nearest and selected centers.
14
15 Weighted randomly select the next center
16 according to possibilities assigned in the prev step.
17 (Hint: numpy.random.choice
18 https://docs.scipy.org/doc//numpy-
1.10.4/reference/generated/numpy.random.choice.html)
19
20 Only one loop or list/dict comprehension is permitted in this
function.
21
22 :param X: X
23 :return: initialized centers
24 """
```
``` python
1 def kMeansSingleEM(self, X):
2 """
3 ToDo: 5 marks; ≈9 lines
4
5 Run KMeans one time.
6
7 Convergence condition: the square of the change of
8 centers are less then tol
9
10 Please note that a full KMeans algorithm
11 should be implemented in this function.
12
13 :param X: X
14 :return: (cluster centers, iteration count)
15 """
```
``` python
1 def score(self, X, centers=None):
2 """
3 ToDo: 5 marks; ≈2 lines
4
5 Calculate distortion score of X and given centers.
6 Distortion score = sum(square(||xi - nearest center||))
7
8 If centers is None, centers will be self.centers
9
10 No loop or list/dict comprehension is permitted in this function
11
12 :param X: X
13 :param centers: centers
14 :return: distortion score
15 """
```
``` python
1 def fit(self, X, y=None):
2 """
3 ToDo: 5 marks; ≈11 lines
4
5 Run KMeans nInit times,
6 and pick the best one according to distortion score.
7
8 Record the best cluster centers in self.centers,
9 and save the corresponding iteration times in self.iterCount
10
11 Only one loop or list/dict comprehension is permitted in this
function.
12
13 :param X: X
14 :param y: None
15 :return: self
16 """
```
``` python
1 def initPosteriors(self, X):
2 """
3 ToDo: 5 marks; ≈5 lines
4
5 Use KMeans to initialize posteriors.
6
7 E.g., say the clusters of given X predicted by KMeans are [0, 1, 3],
8 then the posteriors should be:
9 c0 c1 c2 c3
10 x0 [[1, 0, 0, 0],
11 x1 [0, 1, 0, 0],
12 x2 [0, 0, 0, 1]]
13
14 No loop or list/dict comprehension is permitted in this function
15
16 :param X: X
17 :return: posteriors (nX, nClusters)
18 """
```
``` python
1 def initialization(self, X):
2 """
3 ToDo: 5 marks; ≈2 lines
4
5 Initialize posteriors and prior, mean and covariances.
6 It's better to implement estimateGaussianParams first.
7
8 No loop or list/dict comprehension is permitted in this function
9
10 :param X: X
11 """
```
``` python
1 def estimateGaussianParams(self, X, posteriors):
2 """
3 ToDo: 5 marks; ≈8 lines
4
5 M step
6
7 Estimate priors, means and covariance
8 according to given X and posteriors.
9
10 To speed up our code, try to avoid using loop,
11 instead, use matrix operation as much as possible.
12
13 Only one loop or list/dict comprehension is permitted in this
function.
14
15 i means i_th sample, j means j_th cluster.
16
17 sum_i means sum over i (samples);
18 sum_j means sum over j (clusters);
19
20 priors (nClusters)
21 priors_j = sum_i(posterior_ij) / nX
22
23 means (nClusters, dim), where dim is the dimension of X
24 means_j = sum_i(posterior_ij * xi) / sum_i(posterior_ij)
25
26 covariances (nCluster, dim, dim)
27 covariances_j = sum_i(posteriors_ij * (xi - means_j)T * (xi -
means_j)) / sum_i(posterior_ij)
28
29 :param X: X
30 :param posteriors: (nX, nClusters)
31 :return: (priors, means, covariance)
32 """
```
``` python
1 def estimateScorePosteriors(self, X):
2 """
3 ToDo: 5 marks; ≈8 lines
4
5 E step
6
7 Estimate score and posteriors according to X, priors, means, and
covariances.
8
9 To speed up the code, we will change the formula as log format,
10 and, finally, we will change it back.
11
12 To speed up our code, try to avoid using loop,
13 instead, use matrix operation as much as possible.
14
15 Only one loop or list/dict comprehension is permitted in this
function.
16
17 posteriors_ij = P(xi|j)P(j) / sum_l(P(xi|l)P(l))
18
19 logPosteriors_ij = log(P(xi|j)P(j)) - log_sum_exp(log(P(xi|j)P(j)))
20 By doing so, we can reuse logJointProb (log(P(X)))
21
22 N is the gaussian pdf
23 log(P(xi|j)P(j)) = log N(xi, means_j, covariances_j) + log priors_j
24
25 As for the score, the log likelihood of parameters is:
26
27 l = sum_i(log(sum_j(P(xi|j)P(j)))) =
sum_i(log_sum_exp(log(P(xi|j)P(j))))
28
29 To avoid number of samples influencing the convergent speed, we
prefer avg.
30 Therefore, the score is:
31
32 score = mean(log_sum_exp(log(P(xi|j)P(j))))
33
34 Please do not forget change logPosteriors to posteriors at the end
of this function.
35
36 Hints:
37 To calculate Gaussian pdf, use scipy.stats.multivariate_normal which
38 is way more faster than calculate pdf by using the original formula.
39
(https://docs.scipy.org/doc/scipy/reference/generated/scipy.stats.multiv
ariate_normal.html)
40 To apply log_sum_exp trick, use scipy.special.logsumexp.
41
(https://docs.scipy.org/doc/scipy/reference/generated/scipy.special.logs
umexp.html)
42
43 :param X: X
44 :return: (score, posteriors)
45 """
```
``` python
1 def fit(self, X, y=None):
2 """
3 ToDo: 5 marks; ≈10 lines
4
5 Train GMM using EM algorithm.
6 If maxIter is None, it should iterate forever until it converges
7
8 Only one loop or list/dict comprehension is permitted in this
function.
9
10 :param X: X
11 :param y: None
12 :return: self
13 """
```
``` python
1 def score(self, X):
2 """
3 ToDo: 5 marks; ≈2 lines
4
5 Predict the parameter's score by given X
6
7 No loop or list/dict comprehension is permitted in this function
8
9 :param X: X
10 :return: score
11 """
```
``` python
1 def fit_transform(self, X):
2 """
3 ToDo: 5 marks; ≈7 lines
4
5 Reduce the dimension of given X and nComponents
6
7 No loop or list/dict comprehension is permitted in this function
8
9 :param X: X
10 :return: transformed X (nX, nComponents)
11 """
```
``` python
1 # ToDo: 5 marks; ≈3 lines
2 # Invoke KMeans to get clusters and scores
```
After you implement and run the code, you should have the following results in Figures. This
time, we did not fix the random state, since we can run it several times to obtain the best
result, so that, your results can be different from the expected results.
1. Dataset1.png
2. Dataset2.png
Also, the running time of KMeans and GMM should less than 0.5 s. The following results are
generated in TA's computer (CPU: Intel i7-9750 2.60GHz; RAM: 24GB; DISK: 1.5TB).
```
1 ( 0.002000 s) -> Dataset 1
2 KMeans Convergent at Iter: 5.0
3 KMeans Distortion Score: 7.436183296151369
4 ( 0.041037 s) -> KMeans
5 GMM Convergent at Iter: 7
6 GMM Avg Parameter Log Likelihood: 1.2270739225566258
7 ( 0.066007 s) -> GMM
8 ( 2.220996 s) -> Figures for Dataset 1
9 ( 0.002002 s) -> Dataset 2
10 ( 0.000000 s) -> PCA
11 KMeans Convergent at Iter: 4.0
12 KMeans Distortion Score: 17.381677793624668
13 ( 0.033962 s) -> KMeans
14 GMM Convergent at Iter: 5
15 GMM Avg Parameter Log Likelihood: 2.29074557003864
16 ( 0.058004 s) -> GMM
17 ( 3.236029 s) -> Figures for Dataset 2
18 Total Time Costs: ( 5.660038 s) -> [Main]
```
This time, you may have different results from the expected ones.
Only if you implement the whole function or method, you can get marks; otherwise, you will
get 0 on the method.
Implementing methods can receive marks, while only submitting results files will not receive
any marks.
For Quiz 4:
For Quiz 5: