Professional Documents
Culture Documents
fourth center
Therefore aeideirhelnnom sttesetofpoitss.it
dip c r
I circlefradins2 a
L.rom.FI ki yil 2 n
X 2
indefradis2
EH 4.4
H
GIG 4
Table of Contents
1A) Calculate the Edit, Jaccard and Hamming distances ............................................................ 1
1B) Calculate the Cosine, Euclidean, Manhattan, and L_inf distances ........................................... 1
1C) Define a circle in the L_n norm. Sketch circles in L_1 and L_inf. .......................................... 2
2A) Examine the distribution of distances of random data. Compare to L_2 ................................... 2
2B) Explain why we expect random pairs to be rarely close in the unit cube .................................. 2
2C) Can one examine the distribution of angles in an L_3 vector space ......................................... 2
3A) Calculate merging criterion for most likely cluster merges .................................................... 2
3B) Perform heirarchical clustering. Explain each line. .............................................................. 5
3C) Present and explain output variables. Compare to own analysis ............................................. 5
4A) Find the cluteroids using k-means .................................................................................... 6
4B) Perform clustering for data above using calculator .............................................................. 6
4C) Perform k-means clustering for k=2,3,4,5 .......................................................................... 6
4D) Find the cluster diameter for each cluster and k value .......................................................... 7
4E) Compare results for k=3 to those of Hierarchical clustering .................................................. 8
% Edit distance is 5
s_1 = {'d','s','p','7','6','5','g','o','f','9','h','n'};
s_2 = {'d','s','g','f','6','5','g','o','f','9','n'};
i = intersect(s_1,s_2);
u = union(s_1,s_2);
1 - length(i)/length(u) % Jaccard distance is 1 - 0.75 = 0.25
% Hamming distance is 3, since they differ at character 3,4 and 11
1
1C) Define a circle in the L_n norm. Sketch cir-
cles in L_1 and L_inf.
in written part
% Relative to the L_2 case, the L_3 case appears to have a lower
standard
% deviation (i.e. a distribution with a smaller width) while having
the
% same center. In other words, the L_3 causes a sharper drop off after
the
% center, and the curse of dimensionality still holds.
figure(7);
clf;
scatter(X(1,:),X(2,:));
box on;
hold on;
2
xlabel('x');
ylabel('y');
for ii=1:length(X(1,:));
text(X(1,ii)+0.07,X(2,ii),num2str(ii),'FontSize',20);
end
X=[9, 7, 6, 2, 7, 7, 9, 1;
1, 7, 3, 8, 3, 2, 4, 8.5];
figure(2);
clf;
scatter(X(1,:),X(2,:));
box on;
hold on;
xlabel('x');
ylabel('y');
for ii=1:length(X(1,:));
text(X(1,ii)+0.07,X(2,ii),num2str(ii),'FontSize',20);
end
X=[9, 7, 6, 2, 9, 1, 7;
1, 7, 3, 8, 4, 8.5, 2.5];
figure(3);
clf;
scatter(X(1,:),X(2,:));
box on;
hold on;
xlabel('x');
ylabel('y');
for ii=1:length(X(1,:));
text(X(1,ii)+0.07,X(2,ii),num2str(ii),'FontSize',20);
end
X=[9, 7, 6, 9, 7, 1.333;
1, 7, 3, 4, 2.5, 8.333];
figure(4);
clf;
3
scatter(X(1,:),X(2,:));
box on;
hold on;
xlabel('x');
ylabel('y');
for ii=1:length(X(1,:));
text(X(1,ii)+0.07,X(2,ii),num2str(ii),'FontSize',20);
end
figure(5);
clf;
scatter(X(1,:),X(2,:));
box on;
hold on;
xlabel('x');
ylabel('y');
for ii=1:length(X(1,:));
text(X(1,ii)+0.07,X(2,ii),num2str(ii),'FontSize',20);
end
figure(6);
clf;
scatter(X(1,:),X(2,:));
box on;
hold on;
xlabel('x');
ylabel('y');
for ii=1:length(X(1,:));
text(X(1,ii)+0.07,X(2,ii),num2str(ii),'FontSize',20);
end
figure(7);
clf;
4
scatter(X(1,:),X(2,:));
box on;
hold on;
xlabel('x');
ylabel('y');
for ii=1:length(X(1,:));
text(X(1,ii)+0.07,X(2,ii),num2str(ii),'FontSize',20);
end
X=[1.333, 7.500;
8.333, 3.333];
% these are the ending clusters
figure(8);
clf;
scatter(X(1,:),X(2,:));
box on;
hold on;
xlabel('x');
ylabel('y');
for ii=1:length(X(1,:));
text(X(1,ii)+0.07,X(2,ii),num2str(ii),'FontSize',20);
end
5
idx % represents which point should go into which cluster
% The dendrogram plot shows the "heights" associated with each merge.
% The height represents the deltavar for the merge: this shows when
there
% is a sharp jump in height, suggesting we should stop clustering. MY
% results were the same, except the order of equal clusterings which
% doesn't make a difference.
k = 3 % for k = 3
distances(:,5)
distances(:,2)% point 3 maximizes minimum distance
k3 = X(:,3) % set point 3 as k3
% we have 5, 2 and 3 as the clusteroids
k = 4 % for k = 4
distances(:,5)
distances(:,2)
distances(:,3) % point 4 maximizes minimum distance
k4 = X(:,4) % set point 4 as k4
% we have 5, 2, 3 and 4 as the clusteroids
6
[idx,C] =
kmeans(X',k,'Distance','sqeuclidean','Replicates',5,'Options',opts);
disp("idx=");disp(idx)
% cluster (1,5,8),(2,3,4,6,7,9)
k=3;
opts = statset('Display','final');
[idx,C] =
kmeans(X',k,'Distance','sqeuclidean','Replicates',5,'Options',opts);
disp("idx=");disp(idx)
% cluster (1,5,8),(2,4,6,7,9),(3)
k=4;
opts = statset('Display','final');
[idx,C] =
kmeans(X',k,'Distance','sqeuclidean','Replicates',5,'Options',opts);
disp("idx=");disp(idx)
% cluster (1,5,8),(2,9),(3),(4,6,7)
k=5;
opts = statset('Display','final');
[idx,C] =
kmeans(X',k,'Distance','sqeuclidean','Replicates',5,'Options',opts);
disp("idx=");disp(idx)
% cluster (1,5,8),(2),(3),(4,6,7),(9)
% for k = 3
% in (1,5,8), diameter is 1.4143 (1 to 5)
% in (3), diameter is 0
% in (2,4,6,7,9), diameter is 3.6056 (2 to 4)
% for k = 4
% in (1,5,8), diameter is 1.4143 (1 to 5)
% in (2,9), diameter is 3 (2 to 9)
% in (3), diameter is 0
% in (4,6,7), diameter is 1.4143 (3 to 5)
% for k = 5
% in (1,5,8), diameter is 1.4143 (1 to 5)
% in (2), diameter is 0
% in (3), diameter is 0
% in (4,6,7), diameter is 1.4143 (3 to 5)
% in 9, diameter is 0
7
% k=4 or k=5 would be a good choice