Exam 2011 Data Mining Questions and Answers

INFS4203 2011 Exam UQAttic.
net [Chat]
lOMoARcPSD|3629619
You can disable automatic email alerts of comment discussions via the "Discussions" button.
Q1. Briefly discuss the major difference between Classification and Clustering. List one real
application for each of them respectively. (4 marks)

Clustering groups items based on similarity, without any prior training (it is an unsupervised
technique). Classification has predefined groups, and attempts to assign items a class based
on certain properties and past experience in the form of training data (i.e. it is a supervised
technique).

Clustering Social media analysis, recommender systems
Classification Fishing statistics (fish caught), email filtering

Downloaded by Aayush Garg (g19051@astra.xlri.ac.in)

INFS4203 2011 Exam UQAttic.net [Chat]
lOMoARcPSD|3629619
Q2. Consider the images and their associated tags shown in Table 1. Apply Apriori algorithm
to discover strong association rules among image tags. Assume that min_support=40% and
min confidence=70%.
Image ID Associated Tags
1 {Beach, Sunshine, Holiday}
2 {Sand, Beach}
3 {Sunshine, Beach, Ocean}
4 {Ocean, People, Beach, Sunshine}
5 {Holiday, Sunshine}
Table 1

1) Generate candidate itemsets (Ck) and qualified frequent itemsets (Lk) step by step until
the largest frequent itemset is generated. Use table C1 as a template. Make sure you
clearly identify all the frequent itemsets. (6 marks)

Itemset Support_count
Beach 4
Holiday 2
Ocean 2
People 1
Sand 1
Sunshine 4
Table C1

min_support = 40%, min_support_count = 2
Beach 4
Holiday 2
Ocean 2
Sunshine 4
L1

lOMoARcPSD|3629619

Beach, Holiday 1
Beach, Ocean 2
Beach, Sunshine 3
Holiday, Ocean 0
Holiday, Sunshine 2
Ocean, Sunshine 2
C2

Beach, Ocean 2
Beach, Sunshine 3
Holiday, Sunshine 2
Ocean, Sunshine 2
L2

Beach, Ocean, Sunshine 2
C3

Beach, Ocean, Sunshine 2
L3

lOMoARcPSD|3629619
2) Generate association rules from the frequent itemsets. Calculate the confidence of each
rule and identify all the strong association rules. (6 marks)

min_confidence = 70%

Confidence (X→Y) = P(Y | X) = P(X ∪ Y) / P(X)

From 1), we have 5 frequent itemsets: {Beach, Ocean}, {Beach, Sunshine}, {Holiday, Sunshine},
{Ocean, Sunshine} and {Beach, Ocean, Sunshine}. Therefore, candidate rules are:

Beach > Ocean = 2/4 = 50%
Ocean > Beach = 2/2 = 100% (Strong)

Beach > Sunshine = 3/4 = 75% (Strong)
Sunshine > Beach = 3/4 = 75% (Strong)

Holiday > Sunshine = 2/2 = 100% (Strong)
Sunshine > Holiday = 2/4 = 50%

Ocean > Sunshine = 2/2 = 100% (Strong)
Sunshine > Ocean = 2/4 = 50%

Beach > Ocean, Sunshine = 2/4 = 50%
Ocean, Sunshine > Beach = 2/2 = 100% (Strong)
Ocean > Beach, Sunshine = 2/2 = 100% (Strong)
Beach, Sunshine > Ocean = 2/3 = ~67%
Sunshine > Beach, Ocean = 2/4 = 50%
Beach, Ocean > Sunshine = 2/2 = 100% (Strong)

lOMoARcPSD|3629619
Q3. Consider the transactions shown in Table 1. Generate the FPTree (Frequent Pattern
Tree) step by step. Assume that min_support=40%. (5 marks)

Not covered in 2012.

lOMoARcPSD|3629619
Q4. Consider the training data set shown in Table 2. ID3 Algorithm can be performed to
derive a decision tree to predict whether the weather is suitable for playing.

Given Shannon’s formulas and some log values here,

1) Assume that select "Outlook" as the first testing attribute at the top level of the decision
tree. Calculate H(Play|Humidity) in the subtable of "Outlook is sunny". (5 marks)
H(Play|Humidity) =

Not covered in 2012.

lOMoARcPSD|3629619
2) Given a weather sample "Outlook=sunny, Temperate=mild, Humidity=normal and
Windy=true", use Naive Bayes Classification to predict whether it is suitable for
playing. (6 marks)

where Ci is the i class, m is the number of attributes and Xj is the j attribute.
th th

P(Play=yes) = 9/14 = 0.643
P(Outlook=sunny|Play=yes) = 2/9 = 0.222
P(Temperate=mild|Play=yes) = 4/9 = 0.444
P(Humidity=normal|Play=yes) = 6/9 = 0.667
P(Windy=true|Play=yes) = 6/9 = 0.667
P(X|Play=yes) = 0.222 x 0.444 x 0.667 x 0.667 = 0.044

P(Play=no) = 5/14 = 0.357
P(Outlook=sunny|Play=no) = 3/5 = 0.6
P(Temperate=mild|Play=no) = 2/5 = 0.4
P(Humidity=normal|Play=no) = 1/5 = 0.2
P(Windy=true|Play=no) = 2/5 = 0.4
P(X|Play=no) = 0.6 x 0.4 x 0.2 x 0.4 = 0.019

P(Play=yes|X) = P(X|Play=yes) x P(Play=yes) = 0.028

P(Play=no|X) = P(X|Play=no) x P(Play=no) = 0.007

Therefore, X belongs to class (Play=yes)


lOMoARcPSD|3629619
Q5. Clustering
1) Kmeans (7 marks)
Given five 3dimensional data points shown below,
P1: (3, 1, 2),
P2: (0, 2, 1),
P3: (3, 0, 5),
P4: (1, 1, 1),
P5: (4, 2, 2),
Apply Kmeans clustering method to group them into 2 clusters, using L1 distance measure.
Suppose that the initial centroids are C1: (1, 0, 0) and C2: (3, 0, 0). Use the following table as a
template to show each step of clustering clearly. Explain why the final clustering has been
achieved (i.e., discuss the stop condition of Kmeans).

Cluster# Old Centroids Cluster Elements New Centroids
1 (1,0,0) P2, P4 (0.5, 1.5, 1)
2 (3,0,0) P1, P3, P5 (3.33, 1, 3)

Cluster# Old Centroids Cluster Elements New Centroids
1 (0.5, 1.5, 1) P2, P4 (0.5, 1.5, 1)
2 (3.33, 1, 3) P1, P3, P5 (3.33, 1, 3)

Since all the centroids do not change anymore, the final clustering has been achieved.

lOMoARcPSD|3629619
2) Hierarchical Clustering (7 marks)
Given five data objects (p1, ... , p5), their proximity matrix (i.e., distance matrix) is shown in
Table 3. Apply agglomerative hierarchical clustering to build the hierarchical clustering tree
of the data objects. Merge the clusters by using Max distance and update the proximity
matrix correspondingly. Make sure you show each step of clustering clearly.

p1 p2 p3 p4 p5
p1 0 1 5 9 10
p2 1 0 3.5 8 7
p3 5 3.5 0 3 4
p4 9 8 3 0 0.5
p5 10 7 4 0.5 0
Table 3

Max distance = complete linkage.

p1 p2 p3 p4, p5
p1 0 1 5 10
p2 1 0 3.5 8
p3 5 3.5 0 4
p4, p5 10 8 4 0


lOMoARcPSD|3629619
2) (cont.’d)
p1, p2 p3 p4, p5
p1, p2 0 5 10
p3 5 0 4
p4, p5 10 4 0

p1, p2 p3, p4, p5
p1, p2 0 10
p3, p4, p5 10 0

> p1, p2, p3, p4, p5


lOMoARcPSD|3629619
Q6. Given a query of "transfer learning for video tagging" and a collection of the following
three documents:
Document 1: <A survey on transfer learning>
Document 2: <Transfer learning for image tagging>
Document 3: <Transfer learning: from image tagging to video tagging>
Use the Vector Space Model, TF/IDF weighting scheme, and Cosine vector similarity measure
to find the most relevant document(s) to the query. Assume that "a", "on", "for", "from" and
"to" are stopwords.

The formula of TF/IDF Weighting is: wij = tij x log(N / nj)
where:
tij: the number of times term j appeared in document i.
N: the Total number of document.
nj: the number of documents that term j appears in.

1) Calculate DF (document frequency) and IDF (inverse document frequency) for each word.
(4marks)

Word list DF IDF
survey 1 0.477
transfer 3 0
learn 3 0
image 2 0.176
tag 2 0.176
video 1 0.477

(log 3 = 0.477, log 3/2 = 0.176, log 1 = 0)

lOMoARcPSD|3629619
2) Represent each document as a weighted vector by using TF/IDF weight scheme. Length
normalization is not required. (3 marks)

Vector Space Model for each document (based on term frequency):
Document 1: (1, 1, 1, 0, 0, 0)
Document 2: (0, 1, 1, 1, 1, 0)
Document 3: (0, 1, 1, 1, 2, 1)

Convert them to weighted vector by multiplying the IDF in 1) > (0.477, 0, 0, 0.176, 0.176, 0.477)
Document 1: (0.477, 0, 0, 0, 0, 0)
Document 2: (0, 0, 0, 0.176, 0.176, 0)
Document 3: (0, 0, 0, 0.176, 0.352, 0.477)

P.S. The order of elements in your vector may be different, it depends on how you order the
terms in 1)

lOMoARcPSD|3629619
3) Represent the query as a weighted vector and find its most relevant document(s) using
Cosine Similarity measure. (3 marks)

The formula of Cosine vector similarity measure is:

Vector Space Model for the query:
Query: (0, 1, 1, 0, 1, 1)

Convert it to weighted vector by multiplying the IDF in 1) > (0.477, 0, 0, 0.176, 0.176, 0.477)
Query: (0, 0, 0, 0, 0.176, 0.477)

From 2):
Document 1: (0.477, 0, 0, 0, 0, 0)
Document 2: (0, 0, 0, 0.176, 0.176, 0)
Document 3: (0, 0, 0, 0.176, 0.352, 0.477)

Query 0 0 0 0 0.176 0.477
D1 0.477 0 0 0 0 0
D2 0 0 0 0.176 0.176 0
D3 0 0 0 0.176 0.352 0.477

sim(Q, D1) = 0
0.176 × 0.176
sim(Q, D2) = = 0.245
√(0.176 + 0.4772) × (0.1762 + 0.1762)
2
0.176 × 0.352 + 0.477 × 0.477
sim(Q, D3) = = 0.921
√(0.176 2
+ 0.4772) × (0.1762 + 0.3522 + 0.4772)

Return Document 3

lOMoARcPSD|3629619
Q7. Briefly describe the three key components of Web Mining. Give one related application
for each component respectively. (4 marks)

Web Content Mining: Web content mining is the mining, extraction and integration of useful data,
information and knowledge from Web page contents. It extends the functionality of basic search
engines.
Related Application:
● Crawlers/Indexing
● Profiles/Personalisation

Web Structure Mining: Web structure mining is the process of using graph theory to analyze the
node and connection structure of a web site.
● Web pages ranking [PageRankTM (Google)]
● Communities discovery

Web Usage Mining: Web usage mining is the process of extracting useful information from
server logs i.e. users history.
● Improve design of Web pages
● Aid in caching and prediction of future page references
● Improve effectiveness of ecommerce (marketing, advertising, and sales)


lOMoARcPSD|3629619
Example of agglomerative clustering with group average

p1 p2 p3 p4 p5
p1 0 1 5 9 10
p2 1 0 3.5 8 7
p3 5 3.5 0 3 4
p4 9 8 3 0 0.5
p5 10 7 4 0.5 0

p1 p2 p3 p4,p5
p1 0 1 5 9.5
p2 1 0 3.5 7.5
p3 5 3.5 0 3.5
p4,p5 9.5 7.5 3.5 0

p1,p2 p3 p4,p5
p1,p2 0 4.25 8.5
p3 4.25 0 3.5
p4,p5 8.5 3.5 0

lOMoARcPSD|3629619
p1,p2 p3 p4,p5
p1,p2 0 4.25 8.5
p3 4.25 0 3.5
p4,p5 8.5 3.5 0

p1,p2 > p3,p4,p5 = (p1,p2>p3)/3 + 2(p1,p2>p4,p5)/3 = 4.25/3 + 2*8.5/3 = 7.08333.....
To check...
p1,p2 > p3,p4,p5 = (p1>p3 + p1>p4 + p1>p5 + p2>p3 + p2>p4 + p2>p5)/(2*3) =
(5+9+10+3.5+8+7)/6 = 7.08333...

Yes, I’m awesome :P

To clarify: You divide by the number of points in the created cluster. You multiply by the number
of points in the cluster on the right hand side of >, so p1,p2>p4,p5 = multiply by 2. Since we’re
merging p3 to p4,p5 you divide by 3.

Anyway, I seriously doubt we’re going to have to know this. If it is on the exam I recommend
working from the first table at all times anyway, it’s easier to do if you don’t understand what I
actually did above (that is, just find the group average from the first table, as I did in my
checking).

Exam 2011 Data Mining Questions and Answers

Uploaded by

Document Information

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Exam 2011 Data Mining Questions and Answers

Uploaded by

Copyright:

Available Formats

INFS4203 2011 Exam UQAttic.

Downloaded by Aayush Garg (g19051@astra.xlri.ac.in)

Downloaded by Aayush Garg (g19051@astra.xlri.ac.in)

Downloaded by Aayush Garg (g19051@astra.xlri.ac.in)

Downloaded by Aayush Garg (g19051@astra.xlri.ac.in)

Downloaded by Aayush Garg (g19051@astra.xlri.ac.in)

Downloaded by Aayush Garg (g19051@astra.xlri.ac.in)

Downloaded by Aayush Garg (g19051@astra.xlri.ac.in)

1 (1,0,0) P2, P4 (0.5, 1.5, 1)

2 (3,0,0) P1, P3, P5 (3.33, 1, 3)

1 (0.5, 1.5, 1) P2, P4 (0.5, 1.5, 1)

2 (3.33, 1, 3) P1, P3, P5 (3.33, 1, 3)

Downloaded by Aayush Garg (g19051@astra.xlri.ac.in)

p1 p2 p3 p4 p5

p5 10 7 4 0.5 0

p1 p2 p3 p4, p5

Downloaded by Aayush Garg (g19051@astra.xlri.ac.in)

p1, p2 p3 p4, p5

Downloaded by Aayush Garg (g19051@astra.xlri.ac.in)

Downloaded by Aayush Garg (g19051@astra.xlri.ac.in)

Downloaded by Aayush Garg (g19051@astra.xlri.ac.in)

D2 0 0 0 0.176 0.176 0

D3 0 0 0 0.176 0.352 0.477

Downloaded by Aayush Garg (g19051@astra.xlri.ac.in)

Downloaded by Aayush Garg (g19051@astra.xlri.ac.in)

p1 p2 p3 p4 p5

p5 10 7 4 0.5 0

p1 p2 p3 p4,p5

p2 1 0 3.5 7.5

p3 5 3.5 0 3.5

p4,p5 9.5 7.5 3.5 0

p1,p2 p3 p4,p5

p1,p2 0 4.25 8.5

p3 4.25 0 3.5

p4,p5 8.5 3.5 0

Downloaded by Aayush Garg (g19051@astra.xlri.ac.in)

p1,p2 p3 p4,p5

p1,p2 0 4.25 8.5

p3 4.25 0 3.5

p4,p5 8.5 3.5 0

Downloaded by Aayush Garg (g19051@astra.xlri.ac.in)

You might also like