Professional Documents
Culture Documents
net [Chat]
lOMoARcPSD|3629619
You can disable automatic email alerts of comment discussions via the "Discussions" button.
Q1. Briefly discuss the major difference between Classification and Clustering. List one real
application for each of them respectively. (4 marks)
Clustering groups items based on similarity, without any prior training (it is an unsupervised
technique). Classification has predefined groups, and attempts to assign items a class based
on certain properties and past experience in the form of training data (i.e. it is a supervised
technique).
Clustering Social media analysis, recommender systems
Classification Fishing statistics (fish caught), email filtering
Q2. Consider the images and their associated tags shown in Table 1. Apply Apriori algorithm
to discover strong association rules among image tags. Assume that min_support=40% and
min confidence=70%.
Image ID Associated Tags
1 {Beach, Sunshine, Holiday}
2 {Sand, Beach}
3 {Sunshine, Beach, Ocean}
4 {Ocean, People, Beach, Sunshine}
5 {Holiday, Sunshine}
Table 1
1) Generate candidate itemsets (Ck) and qualified frequent itemsets (Lk) step by step until
the largest frequent itemset is generated. Use table C1 as a template. Make sure you
clearly identify all the frequent itemsets. (6 marks)
Itemset Support_count
Beach 4
Holiday 2
Ocean 2
People 1
Sand 1
Sunshine 4
Table C1
min_support = 40%, min_support_count = 2
Itemset Support_count
Beach 4
Holiday 2
Ocean 2
Sunshine 4
L1
Itemset Support_count
Beach, Holiday 1
Beach, Ocean 2
Beach, Sunshine 3
Holiday, Ocean 0
Holiday, Sunshine 2
Ocean, Sunshine 2
C2
Itemset Support_count
Beach, Ocean 2
Beach, Sunshine 3
Holiday, Sunshine 2
Ocean, Sunshine 2
L2
Itemset Support_count
Beach, Ocean, Sunshine 2
C3
Itemset Support_count
Beach, Ocean, Sunshine 2
L3
2) Generate association rules from the frequent itemsets. Calculate the confidence of each
rule and identify all the strong association rules. (6 marks)
min_confidence = 70%
Confidence (X→Y) = P(Y | X) = P(X ∪ Y) / P(X)
From 1), we have 5 frequent itemsets: {Beach, Ocean}, {Beach, Sunshine}, {Holiday, Sunshine},
{Ocean, Sunshine} and {Beach, Ocean, Sunshine}. Therefore, candidate rules are:
Beach > Ocean = 2/4 = 50%
Ocean > Beach = 2/2 = 100% (Strong)
Beach > Sunshine = 3/4 = 75% (Strong)
Sunshine > Beach = 3/4 = 75% (Strong)
Holiday > Sunshine = 2/2 = 100% (Strong)
Sunshine > Holiday = 2/4 = 50%
Ocean > Sunshine = 2/2 = 100% (Strong)
Sunshine > Ocean = 2/4 = 50%
Beach > Ocean, Sunshine = 2/4 = 50%
Ocean, Sunshine > Beach = 2/2 = 100% (Strong)
Ocean > Beach, Sunshine = 2/2 = 100% (Strong)
Beach, Sunshine > Ocean = 2/3 = ~67%
Sunshine > Beach, Ocean = 2/4 = 50%
Beach, Ocean > Sunshine = 2/2 = 100% (Strong)
Q3. Consider the transactions shown in Table 1. Generate the FPTree (Frequent Pattern
Tree) step by step. Assume that min_support=40%. (5 marks)
Not covered in 2012.
Q4. Consider the training data set shown in Table 2. ID3 Algorithm can be performed to
derive a decision tree to predict whether the weather is suitable for playing.
Given Shannon’s formulas and some log values here,
1) Assume that select "Outlook" as the first testing attribute at the top level of the decision
tree. Calculate H(Play|Humidity) in the subtable of "Outlook is sunny". (5 marks)
H(Play|Humidity) =
Not covered in 2012.
2) Given a weather sample "Outlook=sunny, Temperate=mild, Humidity=normal and
Windy=true", use Naive Bayes Classification to predict whether it is suitable for
playing. (6 marks)
where Ci is the i class, m is the number of attributes and Xj is the j attribute.
th th
P(Play=yes) = 9/14 = 0.643
P(Outlook=sunny|Play=yes) = 2/9 = 0.222
P(Temperate=mild|Play=yes) = 4/9 = 0.444
P(Humidity=normal|Play=yes) = 6/9 = 0.667
P(Windy=true|Play=yes) = 6/9 = 0.667
P(X|Play=yes) = 0.222 x 0.444 x 0.667 x 0.667 = 0.044
P(Play=no) = 5/14 = 0.357
P(Outlook=sunny|Play=no) = 3/5 = 0.6
P(Temperate=mild|Play=no) = 2/5 = 0.4
P(Humidity=normal|Play=no) = 1/5 = 0.2
P(Windy=true|Play=no) = 2/5 = 0.4
P(X|Play=no) = 0.6 x 0.4 x 0.2 x 0.4 = 0.019
P(Play=yes|X) = P(X|Play=yes) x P(Play=yes) = 0.028
P(Play=no|X) = P(X|Play=no) x P(Play=no) = 0.007
Therefore, X belongs to class (Play=yes)
Q5. Clustering
1) Kmeans (7 marks)
Given five 3dimensional data points shown below,
P1: (3, 1, 2),
P2: (0, 2, 1),
P3: (3, 0, 5),
P4: (1, 1, 1),
P5: (4, 2, 2),
Apply Kmeans clustering method to group them into 2 clusters, using L1 distance measure.
Suppose that the initial centroids are C1: (1, 0, 0) and C2: (3, 0, 0). Use the following table as a
template to show each step of clustering clearly. Explain why the final clustering has been
achieved (i.e., discuss the stop condition of Kmeans).
Cluster# Old Centroids Cluster Elements New Centroids
2) Hierarchical Clustering (7 marks)
Given five data objects (p1, ... , p5), their proximity matrix (i.e., distance matrix) is shown in
Table 3. Apply agglomerative hierarchical clustering to build the hierarchical clustering tree
of the data objects. Merge the clusters by using Max distance and update the proximity
matrix correspondingly. Make sure you show each step of clustering clearly.
p1 0 1 5 9 10
p2 1 0 3.5 8 7
p3 5 3.5 0 3 4
p4 9 8 3 0 0.5
Table 3
Max distance = complete linkage.
p1 0 1 5 10
p2 1 0 3.5 8
p3 5 3.5 0 4
p4, p5 10 8 4 0
2) (cont.’d)
p1, p2 0 5 10
p3 5 0 4
p4, p5 10 4 0
p1, p2 p3, p4, p5
p1, p2 0 10
p3, p4, p5 10 0
> p1, p2, p3, p4, p5
Q6. Given a query of "transfer learning for video tagging" and a collection of the following
three documents:
Document 1: <A survey on transfer learning>
Document 2: <Transfer learning for image tagging>
Document 3: <Transfer learning: from image tagging to video tagging>
Use the Vector Space Model, TF/IDF weighting scheme, and Cosine vector similarity measure
to find the most relevant document(s) to the query. Assume that "a", "on", "for", "from" and
"to" are stopwords.
The formula of TF/IDF Weighting is: wij = tij x log(N / nj)
where:
tij: the number of times term j appeared in document i.
N: the Total number of document.
nj: the number of documents that term j appears in.
1) Calculate DF (document frequency) and IDF (inverse document frequency) for each word.
(4marks)
Word list DF IDF
survey 1 0.477
transfer 3 0
learn 3 0
image 2 0.176
tag 2 0.176
video 1 0.477
(log 3 = 0.477, log 3/2 = 0.176, log 1 = 0)
2) Represent each document as a weighted vector by using TF/IDF weight scheme. Length
normalization is not required. (3 marks)
Vector Space Model for each document (based on term frequency):
Document 1: (1, 1, 1, 0, 0, 0)
Document 2: (0, 1, 1, 1, 1, 0)
Document 3: (0, 1, 1, 1, 2, 1)
Convert them to weighted vector by multiplying the IDF in 1) > (0.477, 0, 0, 0.176, 0.176, 0.477)
Document 1: (0.477, 0, 0, 0, 0, 0)
Document 2: (0, 0, 0, 0.176, 0.176, 0)
Document 3: (0, 0, 0, 0.176, 0.352, 0.477)
P.S. The order of elements in your vector may be different, it depends on how you order the
terms in 1)
3) Represent the query as a weighted vector and find its most relevant document(s) using
Cosine Similarity measure. (3 marks)
The formula of Cosine vector similarity measure is:
Vector Space Model for the query:
Query: (0, 1, 1, 0, 1, 1)
Convert it to weighted vector by multiplying the IDF in 1) > (0.477, 0, 0, 0.176, 0.176, 0.477)
Query: (0, 0, 0, 0, 0.176, 0.477)
From 2):
Document 1: (0.477, 0, 0, 0, 0, 0)
Document 2: (0, 0, 0, 0.176, 0.176, 0)
Document 3: (0, 0, 0, 0.176, 0.352, 0.477)
Query 0 0 0 0 0.176 0.477
D1 0.477 0 0 0 0 0
0.176 × 0.352 + 0.477 × 0.477
sim(Q, D3) = = 0.921
√(0.176 2
+ 0.4772) × (0.1762 + 0.3522 + 0.4772)
Return Document 3
Q7. Briefly describe the three key components of Web Mining. Give one related application
for each component respectively. (4 marks)
Web Content Mining: Web content mining is the mining, extraction and integration of useful data,
information and knowledge from Web page contents. It extends the functionality of basic search
engines.
Related Application:
● Crawlers/Indexing
● Profiles/Personalisation
Web Structure Mining: Web structure mining is the process of using graph theory to analyze the
node and connection structure of a web site.
Related Application:
● Web pages ranking [PageRankTM (Google)]
● Communities discovery
Web Usage Mining: Web usage mining is the process of extracting useful information from
server logs i.e. users history.
Related Application:
● Improve design of Web pages
● Aid in caching and prediction of future page references
● Improve effectiveness of ecommerce (marketing, advertising, and sales)
Example of agglomerative clustering with group average
p1 0 1 5 9 10
p2 1 0 3.5 8 7
p3 5 3.5 0 3 4
p4 9 8 3 0 0.5
p1 0 1 5 9.5
p1,p2 > p3,p4,p5 = (p1,p2>p3)/3 + 2(p1,p2>p4,p5)/3 = 4.25/3 + 2*8.5/3 = 7.08333.....
To check...
p1,p2 > p3,p4,p5 = (p1>p3 + p1>p4 + p1>p5 + p2>p3 + p2>p4 + p2>p5)/(2*3) =
(5+9+10+3.5+8+7)/6 = 7.08333...
Yes, I’m awesome :P
To clarify: You divide by the number of points in the created cluster. You multiply by the number
of points in the cluster on the right hand side of >, so p1,p2>p4,p5 = multiply by 2. Since we’re
merging p3 to p4,p5 you divide by 3.
Anyway, I seriously doubt we’re going to have to know this. If it is on the exam I recommend
working from the first table at all times anyway, it’s easier to do if you don’t understand what I
actually did above (that is, just find the group average from the first table, as I did in my
checking).