Fourth: Aeideirhelnnom

b Acideittesetofpaitsttatoreagiverdistaneanay
fourth center
Therefore aeideirhelnnom sttesetofpoitss.it
dip c r
I circlefradins2 a
L.rom.FI ki yil 2 n
X 2
indefradis2
EH 4.4
H
GIG 4
Table of Contents
1A) Calculate the Edit, Jaccard and Hamming distances ............................................................ 1
1B) Calculate the Cosine, Euclidean, Manhattan, and L_inf distances ........................................... 1
1C) Define a circle in the L_n norm. Sketch circles in L_1 and L_inf. .......................................... 2
2A) Examine the distribution of distances of random data. Compare to L_2 ................................... 2
2B) Explain why we expect random pairs to be rarely close in the unit cube .................................. 2
2C) Can one examine the distribution of angles in an L_3 vector space ......................................... 2
3A) Calculate merging criterion for most likely cluster merges .................................................... 2
3B) Perform heirarchical clustering. Explain each line. .............................................................. 5
3C) Present and explain output variables. Compare to own analysis ............................................. 5
4A) Find the cluteroids using k-means .................................................................................... 6
4B) Perform clustering for data above using calculator .............................................................. 6
4C) Perform k-means clustering for k=2,3,4,5 .......................................................................... 6
4D) Find the cluster diameter for each cluster and k value .......................................................... 7
4E) Compare results for k=3 to those of Hierarchical clustering .................................................. 8
1A) Calculate the Edit, Jaccard and Hamming

distances
s1='dsp765gof9hn';
s2='dsgf65gof9n';
x=[3,-4,2];
y=[-34,43,-19];
% Edit distance is 5
s_1 = {'d','s','p','7','6','5','g','o','f','9','h','n'};
s_2 = {'d','s','g','f','6','5','g','o','f','9','n'};
i = intersect(s_1,s_2);
u = union(s_1,s_2);
1 - length(i)/length(u) % Jaccard distance is 1 - 0.75 = 0.25
% Hamming distance is 3, since they differ at character 3,4 and 11
Error using dbstatus

Error: File: /Users/adammombru 1/Documents/Apmth120/HW_07.m Line: 211
Column: 9
Invalid expression. Check for missing multiplication operator,
missing or unbalanced delimiters, or other syntax error. To construct
matrices, use brackets instead of parentheses.
1B) Calculate the Cosine, Euclidean, Manhat-

tan, and L_inf distances
acos((x*y')/sqrt(sum(x.^2)*sum(y.^2))) % Cosine distance is 3.0890
sqrt(sum((x-y).^2)) % Euclidean L_2 distance is 63.3956
sum(abs(x-y)) % Manhattan L_1 distance is 105
max(abs(x-y)) % L_inf is 47
1
1C) Define a circle in the L_n norm. Sketch cir-
cles in L_1 and L_inf.
in written part
2A) Examine the distribution of distances of

random data. Compare to L_2
curse_of_dimensionality3
curse_of_dimensionality2
% Relative to the L_2 case, the L_3 case appears to have a lower
standard
% deviation (i.e. a distribution with a smaller width) while having
the
% same center. In other words, the L_3 causes a sharper drop off after
the
% center, and the curse of dimensionality still holds.
2B) Explain why we expect random pairs to be

rarely close in the unit cube
At a higher dimension, we would need all of the pairs of points to be very close, but since they are random
points between -1 and 1, this is unlikely to happen. It is very unlikely in 3 dimensions, so even less so
in higher dimensions.
2C) Can one examine the distribution of angles

in an L_3 vector space
The cosine of the angle between two vectors is also the correlation of two sets of numbers. Given that two
vectors are assumed random and uncorrelated, this correlation is expected to be small at high dimensions.
Therefore random vectors tend to be orthogonal, so angles are not a useful measure of distance.
3A) Calculate merging criterion for most likely

cluster merges
X=[ 1, 9, 7, 6, 2, 7, 7, 1, 9;
9, 1, 7, 3, 8, 3, 2, 8, 4];
figure(7);
clf;
scatter(X(1,:),X(2,:));
box on;
hold on;
2
xlabel('x');
ylabel('y');
for ii=1:length(X(1,:));
text(X(1,ii)+0.07,X(2,ii),num2str(ii),'FontSize',20);
end
% Consider points 1 and 8, and points 6 and 7

deltavar1 = 1*1/(1+1)*norm(X(:,1)-X(:,8))^2 % 0.5
deltavar2 = 1*1/(1+1)*norm(X(:,6)-X(:,7))^2 % 0.5
% They are the same, so cluster 1 and 8 first
X=[9, 7, 6, 2, 7, 7, 9, 1;
1, 7, 3, 8, 3, 2, 4, 8.5];
figure(2);
clf;
scatter(X(1,:),X(2,:));
box on;
hold on;
xlabel('x');
ylabel('y');
end

deltavar1 = 1*2/(1+2)*norm(X(:,4)-X(:,8))^2 % 0.8333
deltavar2 = 1*1/(1+1)*norm(X(:,5)-X(:,6))^2 % 0.5
% Deltavar2 is smaller, so cluster 5 and 6
X=[9, 7, 6, 2, 9, 1, 7;
1, 7, 3, 8, 4, 8.5, 2.5];
figure(3);
clf;
scatter(X(1,:),X(2,:));
box on;
hold on;
xlabel('x');
ylabel('y');
end

deltavar1 = 1*2/(1+2)*norm(X(:,4)-X(:,6))^2 % 0.8333
deltavar2 = 1*2/(1+2)*norm(X(:,3)-X(:,7))^2 % 0.8333
% They are the same, so cluster 4 and 6
X=[9, 7, 6, 9, 7, 1.333;
1, 7, 3, 4, 2.5, 8.333];
figure(4);
clf;
3
scatter(X(1,:),X(2,:));
box on;
hold on;
xlabel('x');
ylabel('y');
end

deltavar1 = 1*2/(1+2)*norm(X(:,3)-X(:,5))^2 % 0.8333
deltavar2 = 1*2/(1+2)*norm(X(:,4)-X(:,5))^2 % 4.1667
% deltavar1 is smaller, so cluster 3 and 5
X=[9, 7, 9, 1.5, 6.667;

1, 7, 4, 8.25, 2.667];
figure(5);
clf;
scatter(X(1,:),X(2,:));
box on;
hold on;
xlabel('x');
ylabel('y');
end

deltavar1 = 1*1/(1+1)*norm(X(:,1)-X(:,3))^2 % 4.5000
deltavar2 = 1*3/(1+3)*norm(X(:,3)-X(:,5))^2 % 5.4148
X=[7, 1.333, 6.667, 9;

7, 8.333, 2.667, 2.5];
figure(6);
clf;
scatter(X(1,:),X(2,:));
box on;
hold on;
xlabel('x');
ylabel('y');
end
% Clearly point 3 and 4 are nearest, so cluster 3 and 4
X=[7, 1.333, 7.600;

7, 8.333, 2.600];
figure(7);
clf;
4
scatter(X(1,:),X(2,:));
box on;
hold on;
xlabel('x');
ylabel('y');
end
% Consider points 1 and 2, and 1 and 3

deltavar1 = 1*3/(1+3)*norm(X(:,1)-X(:,2))^2 % 25.4188
deltavar2 = 1*5/(1+5)*norm(X(:,1)-X(:,3))^2 % 16.4333
X=[1.333, 7.500;
8.333, 3.333];
% these are the ending clusters
figure(8);
clf;
scatter(X(1,:),X(2,:));
box on;
hold on;
xlabel('x');
ylabel('y');
end
3B) Perform heirarchical clustering. Explain

each line.
X=[ 1, 9, 7, 6, 2, 7, 7, 1, 9;
9, 1, 7, 3, 8, 3, 2, 8, 4];
Y=pdist(X'); %find the pairwise distance between points

distances=squareform(Y); %format the Distances matrix with diagonal
zeros
Z=linkage(Y,'ward'); % creates a ward tree, describing cluster
distances
dendrogram(Z); % generate a plot of the hierarchical binary cluster
tree
k=2; % set cutoff for clustering at 2
idx = cluster(Z,'maxclust',k)'
% construct clusters from the agglomerative hierarchical cluster tree
3C) Present and explain output variables. Com-

pare to own analysis
distances % the D matrix representing distances between points
Z % indicates how to cluster step-by-step
5
idx % represents which point should go into which cluster
% The dendrogram plot shows the "heights" associated with each merge.
% The height represents the deltavar for the merge: this shows when
there
% is a sharp jump in height, suggesting we should stop clustering. MY
% results were the same, except the order of equal clusterings which
% doesn't make a difference.
4A) Find the cluteroids using k-means

k = 2 % for k = 2
k1 = X(:,5) % let the fifth data point be the initial clusteroid point
distances(:,5) % show distances of all points from point 5
k2 = X(:2) % point 2 is the furthest from k1
% we have 5 and 2 as the clusteroids
k = 3 % for k = 3
distances(:,5)
distances(:,2)% point 3 maximizes minimum distance
k3 = X(:,3) % set point 3 as k3
% we have 5, 2 and 3 as the clusteroids
k = 4 % for k = 4
distances(:,5)
distances(:,2)
distances(:,3) % point 4 maximizes minimum distance
k4 = X(:,4) % set point 4 as k4
% we have 5, 2, 3 and 4 as the clusteroids
4B) Perform clustering for data above using

calculator
k = 3 % for k = 3
distances % we read point by point which clusteroid is nearest
% 1 is closest to 5: new cluster (1,5) with centroid (1.5,8.5)
% 4 is closest to 2: new cluster (2,4) with centroid (7.5,2)
% 6 is closest to (2,4): new cluster (2,4,6) with centroid (7.333,
2.333)
% 7 is closest to (2,4,6): new cluster (2,4,6,7) with centroid (7.25,
1.25)
% 8 is closest to (1,5): new cluster (1,5) with centroid (1.333,8.333)
% 9 is closest to (2,4,6,7): new cluster (2,4,6,7,9)
% check last step: 9 to (2,4,6,7) is 2.4749, 9 to 3 is 3.6056 so
correct
% clusters are (1,5,8),(4,2,6,7,9),(3)
4C) Perform k-means clustering for k=2,3,4,5

k=2;
opts = statset('Display','final');
6
[idx,C] =
kmeans(X',k,'Distance','sqeuclidean','Replicates',5,'Options',opts);
disp("idx=");disp(idx)
% cluster (1,5,8),(2,3,4,6,7,9)
k=3;
[idx,C] =
% cluster (1,5,8),(2,4,6,7,9),(3)
k=4;
[idx,C] =
% cluster (1,5,8),(2,9),(3),(4,6,7)
k=5;
[idx,C] =
% cluster (1,5,8),(2),(3),(4,6,7),(9)
4D) Find the cluster diameter for each cluster

and k value
cluster diameter is largest distance between two points in cluster for k = 2 in (1,5,8), diameter is 1.4143
(1 to 5) in (2,3,4,6,7,9), diameter is 6.3246 (2 to 3)
% for k = 3
% in (1,5,8), diameter is 1.4143 (1 to 5)
% in (3), diameter is 0
% in (2,4,6,7,9), diameter is 3.6056 (2 to 4)
% for k = 4
% in (1,5,8), diameter is 1.4143 (1 to 5)
% in (2,9), diameter is 3 (2 to 9)
% in (4,6,7), diameter is 1.4143 (3 to 5)
% for k = 5
% in (1,5,8), diameter is 1.4143 (1 to 5)
% in (4,6,7), diameter is 1.4143 (3 to 5)
% in 9, diameter is 0
% average cluster diameter has a large decrease from k=3 to k=4, so

either
7
% k=4 or k=5 would be a good choice
4E) Compare results for k=3 to those of Hierar-

chical clustering
For k=3, k-means clusters are (1,5,8),(2,4,6,7,),(3) Using hierarchcial clustering, we get the same clusters,
which indicates that both methods lead to the same end clustering, but k-means is more efficient since we
only need to calculate distance from the remaining points to the centroid rather than every point
Published with MATLAB® R2018b

Fourth: Aeideirhelnnom

Uploaded by

Document Information

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Fourth: Aeideirhelnnom

Uploaded by

Copyright:

Available Formats

b Acideittesetofpaitsttatoreagiverdistaneanay

1A) Calculate the Edit, Jaccard and Hamming

Error using dbstatus

1B) Calculate the Cosine, Euclidean, Manhat-

2A) Examine the distribution of distances of

2B) Explain why we expect random pairs to be

2C) Can one examine the distribution of angles

3A) Calculate merging criterion for most likely

% Consider points 1 and 8, and points 6 and 7

% Consider points 4 and 8, and points 5 and 6

% Consider points 4 and 6, and points 3 and 7

% Consider points 3 and 5, and points 4 and 5

X=[9, 7, 9, 1.5, 6.667;

% Consider points 1 and 3, and points 3 and 5

X=[7, 1.333, 6.667, 9;

% Clearly point 3 and 4 are nearest, so cluster 3 and 4

X=[7, 1.333, 7.600;

% Consider points 1 and 2, and 1 and 3

3B) Perform heirarchical clustering. Explain

Y=pdist(X'); %find the pairwise distance between points

3C) Present and explain output variables. Com-

4A) Find the cluteroids using k-means

4B) Perform clustering for data above using

4C) Perform k-means clustering for k=2,3,4,5

4D) Find the cluster diameter for each cluster

% average cluster diameter has a large decrease from k=3 to k=4, so

4E) Compare results for k=3 to those of Hierar-

Published with MATLAB® R2018b

You might also like