You are on page 1of 9

b Acideittesetofpaitsttatoreagiverdistaneanay

fourth center
Therefore aeideirhelnnom sttesetofpoitss.it

dip c r

I circlefradins2 a

L.rom.FI ki yil 2 n
X 2

indefradis2
EH 4.4
H

GIG 4
Table of Contents
1A) Calculate the Edit, Jaccard and Hamming distances ............................................................ 1
1B) Calculate the Cosine, Euclidean, Manhattan, and L_inf distances ........................................... 1
1C) Define a circle in the L_n norm. Sketch circles in L_1 and L_inf. .......................................... 2
2A) Examine the distribution of distances of random data. Compare to L_2 ................................... 2
2B) Explain why we expect random pairs to be rarely close in the unit cube .................................. 2
2C) Can one examine the distribution of angles in an L_3 vector space ......................................... 2
3A) Calculate merging criterion for most likely cluster merges .................................................... 2
3B) Perform heirarchical clustering. Explain each line. .............................................................. 5
3C) Present and explain output variables. Compare to own analysis ............................................. 5
4A) Find the cluteroids using k-means .................................................................................... 6
4B) Perform clustering for data above using calculator .............................................................. 6
4C) Perform k-means clustering for k=2,3,4,5 .......................................................................... 6
4D) Find the cluster diameter for each cluster and k value .......................................................... 7
4E) Compare results for k=3 to those of Hierarchical clustering .................................................. 8

1A) Calculate the Edit, Jaccard and Hamming


distances
s1='dsp765gof9hn';
s2='dsgf65gof9n';
x=[3,-4,2];
y=[-34,43,-19];

% Edit distance is 5
s_1 = {'d','s','p','7','6','5','g','o','f','9','h','n'};
s_2 = {'d','s','g','f','6','5','g','o','f','9','n'};
i = intersect(s_1,s_2);
u = union(s_1,s_2);
1 - length(i)/length(u) % Jaccard distance is 1 - 0.75 = 0.25
% Hamming distance is 3, since they differ at character 3,4 and 11

Error using dbstatus


Error: File: /Users/adammombru 1/Documents/Apmth120/HW_07.m Line: 211
Column: 9
Invalid expression. Check for missing multiplication operator,
missing or unbalanced delimiters, or other syntax error. To construct
matrices, use brackets instead of parentheses.

1B) Calculate the Cosine, Euclidean, Manhat-


tan, and L_inf distances
acos((x*y')/sqrt(sum(x.^2)*sum(y.^2))) % Cosine distance is 3.0890
sqrt(sum((x-y).^2)) % Euclidean L_2 distance is 63.3956
sum(abs(x-y)) % Manhattan L_1 distance is 105
max(abs(x-y)) % L_inf is 47

1
1C) Define a circle in the L_n norm. Sketch cir-
cles in L_1 and L_inf.
in written part

2A) Examine the distribution of distances of


random data. Compare to L_2
curse_of_dimensionality3
curse_of_dimensionality2

% Relative to the L_2 case, the L_3 case appears to have a lower
standard
% deviation (i.e. a distribution with a smaller width) while having
the
% same center. In other words, the L_3 causes a sharper drop off after
the
% center, and the curse of dimensionality still holds.

2B) Explain why we expect random pairs to be


rarely close in the unit cube
At a higher dimension, we would need all of the pairs of points to be very close, but since they are random
points between -1 and 1, this is unlikely to happen. It is very unlikely in 3 dimensions, so even less so
in higher dimensions.

2C) Can one examine the distribution of angles


in an L_3 vector space
The cosine of the angle between two vectors is also the correlation of two sets of numbers. Given that two
vectors are assumed random and uncorrelated, this correlation is expected to be small at high dimensions.
Therefore random vectors tend to be orthogonal, so angles are not a useful measure of distance.

3A) Calculate merging criterion for most likely


cluster merges
X=[ 1, 9, 7, 6, 2, 7, 7, 1, 9;
9, 1, 7, 3, 8, 3, 2, 8, 4];

figure(7);
clf;
scatter(X(1,:),X(2,:));
box on;
hold on;

2
xlabel('x');
ylabel('y');
for ii=1:length(X(1,:));
text(X(1,ii)+0.07,X(2,ii),num2str(ii),'FontSize',20);
end

% Consider points 1 and 8, and points 6 and 7


deltavar1 = 1*1/(1+1)*norm(X(:,1)-X(:,8))^2 % 0.5
deltavar2 = 1*1/(1+1)*norm(X(:,6)-X(:,7))^2 % 0.5
% They are the same, so cluster 1 and 8 first

X=[9, 7, 6, 2, 7, 7, 9, 1;
1, 7, 3, 8, 3, 2, 4, 8.5];

figure(2);
clf;
scatter(X(1,:),X(2,:));
box on;
hold on;
xlabel('x');
ylabel('y');
for ii=1:length(X(1,:));
text(X(1,ii)+0.07,X(2,ii),num2str(ii),'FontSize',20);
end

% Consider points 4 and 8, and points 5 and 6


deltavar1 = 1*2/(1+2)*norm(X(:,4)-X(:,8))^2 % 0.8333
deltavar2 = 1*1/(1+1)*norm(X(:,5)-X(:,6))^2 % 0.5
% Deltavar2 is smaller, so cluster 5 and 6

X=[9, 7, 6, 2, 9, 1, 7;
1, 7, 3, 8, 4, 8.5, 2.5];

figure(3);
clf;
scatter(X(1,:),X(2,:));
box on;
hold on;
xlabel('x');
ylabel('y');
for ii=1:length(X(1,:));
text(X(1,ii)+0.07,X(2,ii),num2str(ii),'FontSize',20);
end

% Consider points 4 and 6, and points 3 and 7


deltavar1 = 1*2/(1+2)*norm(X(:,4)-X(:,6))^2 % 0.8333
deltavar2 = 1*2/(1+2)*norm(X(:,3)-X(:,7))^2 % 0.8333
% They are the same, so cluster 4 and 6

X=[9, 7, 6, 9, 7, 1.333;
1, 7, 3, 4, 2.5, 8.333];

figure(4);
clf;

3
scatter(X(1,:),X(2,:));
box on;
hold on;
xlabel('x');
ylabel('y');
for ii=1:length(X(1,:));
text(X(1,ii)+0.07,X(2,ii),num2str(ii),'FontSize',20);
end

% Consider points 3 and 5, and points 4 and 5


deltavar1 = 1*2/(1+2)*norm(X(:,3)-X(:,5))^2 % 0.8333
deltavar2 = 1*2/(1+2)*norm(X(:,4)-X(:,5))^2 % 4.1667
% deltavar1 is smaller, so cluster 3 and 5

X=[9, 7, 9, 1.5, 6.667;


1, 7, 4, 8.25, 2.667];

figure(5);
clf;
scatter(X(1,:),X(2,:));
box on;
hold on;
xlabel('x');
ylabel('y');
for ii=1:length(X(1,:));
text(X(1,ii)+0.07,X(2,ii),num2str(ii),'FontSize',20);
end

% Consider points 1 and 3, and points 3 and 5


deltavar1 = 1*1/(1+1)*norm(X(:,1)-X(:,3))^2 % 4.5000
deltavar2 = 1*3/(1+3)*norm(X(:,3)-X(:,5))^2 % 5.4148
% deltavar1 is smaller, so cluster 1 and 3

X=[7, 1.333, 6.667, 9;


7, 8.333, 2.667, 2.5];

figure(6);
clf;
scatter(X(1,:),X(2,:));
box on;
hold on;
xlabel('x');
ylabel('y');
for ii=1:length(X(1,:));
text(X(1,ii)+0.07,X(2,ii),num2str(ii),'FontSize',20);
end

% Clearly point 3 and 4 are nearest, so cluster 3 and 4

X=[7, 1.333, 7.600;


7, 8.333, 2.600];

figure(7);
clf;

4
scatter(X(1,:),X(2,:));
box on;
hold on;
xlabel('x');
ylabel('y');
for ii=1:length(X(1,:));
text(X(1,ii)+0.07,X(2,ii),num2str(ii),'FontSize',20);
end

% Consider points 1 and 2, and 1 and 3


deltavar1 = 1*3/(1+3)*norm(X(:,1)-X(:,2))^2 % 25.4188
deltavar2 = 1*5/(1+5)*norm(X(:,1)-X(:,3))^2 % 16.4333
% deltavar2 is smaller, so cluster 1 and 3

X=[1.333, 7.500;
8.333, 3.333];
% these are the ending clusters
figure(8);
clf;
scatter(X(1,:),X(2,:));
box on;
hold on;
xlabel('x');
ylabel('y');
for ii=1:length(X(1,:));
text(X(1,ii)+0.07,X(2,ii),num2str(ii),'FontSize',20);
end

3B) Perform heirarchical clustering. Explain


each line.
X=[ 1, 9, 7, 6, 2, 7, 7, 1, 9;
9, 1, 7, 3, 8, 3, 2, 8, 4];

Y=pdist(X'); %find the pairwise distance between points


distances=squareform(Y); %format the Distances matrix with diagonal
zeros
Z=linkage(Y,'ward'); % creates a ward tree, describing cluster
distances
dendrogram(Z); % generate a plot of the hierarchical binary cluster
tree
k=2; % set cutoff for clustering at 2
idx = cluster(Z,'maxclust',k)'
% construct clusters from the agglomerative hierarchical cluster tree

3C) Present and explain output variables. Com-


pare to own analysis
distances % the D matrix representing distances between points
Z % indicates how to cluster step-by-step

5
idx % represents which point should go into which cluster
% The dendrogram plot shows the "heights" associated with each merge.
% The height represents the deltavar for the merge: this shows when
there
% is a sharp jump in height, suggesting we should stop clustering. MY
% results were the same, except the order of equal clusterings which
% doesn't make a difference.

4A) Find the cluteroids using k-means


k = 2 % for k = 2
k1 = X(:,5) % let the fifth data point be the initial clusteroid point
distances(:,5) % show distances of all points from point 5
k2 = X(:2) % point 2 is the furthest from k1
% we have 5 and 2 as the clusteroids

k = 3 % for k = 3
distances(:,5)
distances(:,2)% point 3 maximizes minimum distance
k3 = X(:,3) % set point 3 as k3
% we have 5, 2 and 3 as the clusteroids

k = 4 % for k = 4
distances(:,5)
distances(:,2)
distances(:,3) % point 4 maximizes minimum distance
k4 = X(:,4) % set point 4 as k4
% we have 5, 2, 3 and 4 as the clusteroids

4B) Perform clustering for data above using


calculator
k = 3 % for k = 3
distances % we read point by point which clusteroid is nearest
% 1 is closest to 5: new cluster (1,5) with centroid (1.5,8.5)
% 4 is closest to 2: new cluster (2,4) with centroid (7.5,2)
% 6 is closest to (2,4): new cluster (2,4,6) with centroid (7.333,
2.333)
% 7 is closest to (2,4,6): new cluster (2,4,6,7) with centroid (7.25,
1.25)
% 8 is closest to (1,5): new cluster (1,5) with centroid (1.333,8.333)
% 9 is closest to (2,4,6,7): new cluster (2,4,6,7,9)
% check last step: 9 to (2,4,6,7) is 2.4749, 9 to 3 is 3.6056 so
correct
% clusters are (1,5,8),(4,2,6,7,9),(3)

4C) Perform k-means clustering for k=2,3,4,5


k=2;
opts = statset('Display','final');

6
[idx,C] =
kmeans(X',k,'Distance','sqeuclidean','Replicates',5,'Options',opts);
disp("idx=");disp(idx)
% cluster (1,5,8),(2,3,4,6,7,9)

k=3;
opts = statset('Display','final');
[idx,C] =
kmeans(X',k,'Distance','sqeuclidean','Replicates',5,'Options',opts);
disp("idx=");disp(idx)
% cluster (1,5,8),(2,4,6,7,9),(3)

k=4;
opts = statset('Display','final');
[idx,C] =
kmeans(X',k,'Distance','sqeuclidean','Replicates',5,'Options',opts);
disp("idx=");disp(idx)
% cluster (1,5,8),(2,9),(3),(4,6,7)

k=5;
opts = statset('Display','final');
[idx,C] =
kmeans(X',k,'Distance','sqeuclidean','Replicates',5,'Options',opts);
disp("idx=");disp(idx)
% cluster (1,5,8),(2),(3),(4,6,7),(9)

4D) Find the cluster diameter for each cluster


and k value
cluster diameter is largest distance between two points in cluster for k = 2 in (1,5,8), diameter is 1.4143
(1 to 5) in (2,3,4,6,7,9), diameter is 6.3246 (2 to 3)

% for k = 3
% in (1,5,8), diameter is 1.4143 (1 to 5)
% in (3), diameter is 0
% in (2,4,6,7,9), diameter is 3.6056 (2 to 4)

% for k = 4
% in (1,5,8), diameter is 1.4143 (1 to 5)
% in (2,9), diameter is 3 (2 to 9)
% in (3), diameter is 0
% in (4,6,7), diameter is 1.4143 (3 to 5)

% for k = 5
% in (1,5,8), diameter is 1.4143 (1 to 5)
% in (2), diameter is 0
% in (3), diameter is 0
% in (4,6,7), diameter is 1.4143 (3 to 5)
% in 9, diameter is 0

% average cluster diameter has a large decrease from k=3 to k=4, so


either

7
% k=4 or k=5 would be a good choice

4E) Compare results for k=3 to those of Hierar-


chical clustering
For k=3, k-means clusters are (1,5,8),(2,4,6,7,),(3) Using hierarchcial clustering, we get the same clusters,
which indicates that both methods lead to the same end clustering, but k-means is more efficient since we
only need to calculate distance from the remaining points to the centroid rather than every point

Published with MATLAB® R2018b

You might also like