You are on page 1of 13

DWM Assignment 2

Q1) Consider adjacency matrix. Use agglomerative algorithm & plot a


dendogram using single link
Ans.
Agglomerative Clustering Technique:
In Hierarchical clustering algorithms, either top down or bottom up approach is
followed. In Bottom up approach, every object is considered to be a cluster and in
subsequent iterations they are merged into single cluster. Therefore it is also called as
Hierarchical Agglomerative Clustering.
An HAC clustering is typically visualised as a dendrogram , where each merge is
represented by a horizontal line.
Given:
Distance matrix:

Step 1: From above given distance matrix, E and A clusters are having minimum
distance, so merge them together to form cluster(E,A).

Distance matrix:
dist((E A), C)dist((E A), B)dist((E A), D)=MIN(dist(E,C),
dist(A,C))=MIN(2,2)=2=MIN(dist(E,B), dist(A,B))=MIN(2,5)=2=MIN(dist(E,D),
dist(A,D))=MIN(3,3)=3dist((E A), C)=MIN(dist(E,C), dist(A,C))=MIN(2,2)=2dist((E A),
B)=MIN(dist(E,B), dist(A,B))=MIN(2,5)=2dist((E A), D)=MIN(dist(E,D),
dist(A,D))=MIN(3,3)=3

Step 2: Consider the distance matrix obtained in step 1. Since B,C distance is minimum,
we combine B and C.

dist((B C), (E A))dist((B C), D)=MIN(dist(B,E), dist(B,A), dist(C E), dist(C A))=MIN(2,5, 2,
2)=2=MIN(dist(B, D), dist(C,D))=MIN(3,6)=3dist((B C), (E A))=MIN(dist(B,E), dist(B,A),
dist(C E), dist(C A))=MIN(2,5, 2, 2)=2dist((B C), D)=MIN(dist(B, D),
dist(C,D))=MIN(3,6)=3

Step 3: Consider the distance matrix obtained in step 2. Since (E,A) and (B,C) distance
is minimum, we combine them
dist((E A), (B C))=MIN(dist(E,B), dist(E,C), dist(A B), dist(A C))=MIN(2, 2, 2, 5, 2)=2dist((E
A), (B C))=MIN(dist(E,B), dist(E,C), dist(A B), dist(A C))=MIN(2, 2, 2, 5, 2)=2

Step 4: Finally combine D with (E A B C)

Q2) Consider 4 objects with 2 attributes (X & Y).These 4 objects are to


be grouped together into 2 clusters . Following are the objects with
attribute values.
Object X Y
A 1 1
B 2 1
C 4 3
D 5 4

Ans.
Object X Y

A 1 1

B 2 1

C 4 3

D 5 4

K=2
1] Obtain adjacency matrix.

- A B C D

A 0 - - -

B 1 0 - -

C 13−−√13 8–√8 0 -

D 5 9–√9 2–√2 0

2] Let c1c1 = {A} & c2c2 = {B}


∴∴ centroid of c1c1 = {A} = { (1,1) }
" c2c2 = {B} = { (2,1)}

3] c1c1 = {A}
c2c2 = { B, C, D}
∴∴ centroid of c1c1 = {(1,1)}
" c2c2 {( 2+4+53,1+3+43)2+4+53,1+3+43)}
= {(3.67, 2.67)}

3] c1c1 = {A , B}
c2c2 = {C , D}
∴∴ centroid of c1c1 = { (1.5, 1)}
" c2c2 = {(4.5, 3.5)}
∴∴ c1c1 = {A, B}
c2c2 = { C, D}
∵∵ there is no change, these are the final clusters.

Q3) Consider transaction database given below. Use apriori algorithm


with minimum support count 2, minimum confidence 70%. Generate
association rules with its confidence.
TID List of items
T100 L1, L2, L5
T200 L2,L4
T300 L2,L3
T400 L1,L2,L4
T500 L1,L3
T600 L2,L3
T700 L1,L3
T800 L1,L2,L3,L5
T900 L1,L2,L3

Ans.
Step-1: K=1
(I) Create a table containing support count of each item present in dataset –
Called C1(candidate set)
(II) compare candidate set item’s support count with minimum support count(here
min_support=2 if support_count of candidate set items is less than min_support then
remove those items). This gives us itemset L1.

Step-2: K=2
 Generate candidate set C2 using L1 (this is called join step). Condition of joining
Lk-1 and Lk-1 is that it should have (K-2) elements in common.
 Check all subsets of an itemset are frequent or not and if not frequent remove
that itemset.(Example subset of{I1, I2} are {I1}, {I2} they are frequent.Check for
each itemset)
 Now find support count of these itemsets by searching in dataset.

(II) compare candidate (C2) support count with minimum support count(here
min_support=2 if support_count of candidate set item is less than min_support then
remove those items) this gives us itemset L2.
Step-3:
 Generate candidate set C3 using L2 (join step). Condition of joining Lk-
1 and Lk-1 is that it should have (K-2) elements in common. So here, for L2,

first element should match.


So itemset generated by joining L2 is {I1, I2, I3}{I1, I2, I5}{I1, I3, i5}{I2, I3,
I4}{I2, I4, I5}{I2, I3, I5}
 Check if all subsets of these itemsets are frequent or not and if not, then
remove that itemset.(Here subset of {I1, I2, I3} are {I1, I2},{I2, I3},{I1, I3}
which are frequent. For {I2, I3, I4}, subset {I3, I4} is not frequent so
remove it. Similarly check for every itemset)
 find support count of these remaining itemset by searching in dataset.

(II) Compare candidate (C3) support count with minimum support count(here
min_support=2 if support_count of candidate set item is less than min_support then
remove those items) this gives us itemset L3.

Step-4:
 Generate candidate set C4 using L3 (join step). Condition of joining Lk-
1 and Lk-1 (K=4) is that, they should have (K-2) elements in common. So

here, for L3, first 2 elements (items) should match.


 Check all subsets of these itemsets are frequent or not (Here itemset
formed by joining L3 is {I1, I2, I3, I5} so its subset contains {I1, I3, I5},
which is not frequent). So no itemset in C4
 We stop here because no frequent itemsets are found further

Thus, we have discovered all the frequent item-sets. Now generation of strong
association rule comes into picture. For that we need to calculate confidence of each
rule.
Confidence –
A confidence of 60% means that 60% of the customers, who purchased milk and bread
also bought butter.
Confidence(A->B)=Support_count(A∪B)/Support_count(A)
So here, by taking an example of any frequent itemset, we will show the rule
generation.
Itemset {I1, I2, I3} //from L3
SO rules can be
[I1^I2]=>[I3] //confidence = sup(I1^I2^I3)/sup(I1^I2) = 2/4*100=50%
[I1^I3]=>[I2] //confidence = sup(I1^I2^I3)/sup(I1^I3) = 2/4*100=50%
[I2^I3]=>[I1] //confidence = sup(I1^I2^I3)/sup(I2^I3) = 2/4*100=50%
[I1]=>[I2^I3] //confidence = sup(I1^I2^I3)/sup(I1) = 2/6*100=33%
[I2]=>[I1^I3] //confidence = sup(I1^I2^I3)/sup(I2) = 2/7*100=28%
[I3]=>[I1^I2] //confidence = sup(I1^I2^I3)/sup(I3) = 2/6*100=33%
So if minimum confidence is 50%, then first 3 rules can be considered as strong
association rules.

Q4) What is classification ? Apply statistical based algorithm to obtain


actual probabilities of each event to clarify new tuple as tall.
Person ID Name Gender Height Class
1 Kristina F 1.6 m Short
2 Jim M 2m Tall
3 Maggi F 1.9 m Medium
4 Martha F 2.1 m Tall
5 Stephanie F 1.7 m Short
6 Bob M 1.85 m Medium
7 Catherine F 1.6 m Short
8 Dave M 1.7 m Short
9 Wilson M 2.2 m Tall

Ans.
Data classification is broadly defined as the process of organizing data by relevant
categories so that it may be used and protected more efficiently. On a basic level, the
classification process makes data easier to locate and retrieve. Data classification is of
particular importance when it comes to risk management, compliance, and data
security.

Data classification involves tagging data to make it easily searchable and trackable. It
also eliminates multiple duplications of data, which can reduce storage and backup
costs while speeding up the search process. Though the classification process may
sound highly technical, it is a topic that should be understood by your organization’s
leadership.

REASONS FOR DATA CLASSIFICATION


Data classification has improved significantly over time. Today, the technology is used
for a variety of purposes, often in support of data security initiatives. But data may be
classified for a number of reasons, including ease of access, maintaining regulatory
compliance, and to meet various other business or personal objectives. In some cases,
data classification is a regulatory requirement, as data must be searchable and
retrievable within specified timeframes. For the purposes of data security, data
classification is a useful tactic that facilitates proper security responses based on the
type of data being retrieved, transmitted, or copied.

TYPES OF DATA CLASSIFICATION


Data classification often involves a multitude of tags and labels that define the type of
data, its confidentiality, and its integrity. Availability may also be taken into
consideration in data classification processes. Data’s level of sensitivity is often
classified based on varying levels of importance or confidentiality, which then
correlates to the security measures put in place to protect each classification level.

There are three main types of data classification that are considered industry
standards:

Content-based classification inspects and interprets files looking for sensitive


information
Context-based classification looks at application, location, or creator among other
variables as indirect indicators of sensitive information
User-based classification depends on a manual, end-user selection of each document.
User-based classification relies on user knowledge and discretion at creation, edit,
review, or dissemination to flag sensitive documents.
Solution to problem :
P (short) = 4/9
P (medium) = 2/9
P (tall) = 3/9
Dividing the height attribute in to six ranges.
[0, 1.6], [1.6, 1.7], [1.7, 1.8], [1.8, 1.9], [1.9,2.0], [2.0, Infinity]
Gender has only two values Male and Female.
Total Number of short person = 4
Total Number of medium persons = 2
Total Number of tall persons = 3

Use above values to classify new tuple as Tall.


Consider new Tuple as t = {Jhon, M, 1.95m}
P (t l Short) = 1/4*0=0
P (t l Medium) = ½*0= 0
P (t l Tall) =2/3 *1/3=0.22
Therefore likelihood of being Short = P (t l short) * P (short) = 0* 4/9=0
Similarly,
Likelihood of being Medium = P (t l Medium) * P (Medium) = 0*2/9=0
Likelihood of being Tall = P (t l Tall) * P (Tall) = 1/3 * 3/9=0.11
Then estimate P (t) by adding individual likelihood values since t will be short, medium
or tall
P (t) = 0+0+0+0.11=0.11
Finally Actually Probability of each event
P (short l t) = (P (t l short) * P (short)) /P (t) = 0
Similarly,
P (medium l t) = (P (t l medium) * P (medium)) /P (t) = 0
P (Tall l t) = (P (t l Tall) * P (Tall)) /P (t) = 0.33
New Tuple is a Tall as it has the highest Probability.

Q5) Perform k means algorithm with following data for 2 clusters


Data set { 2,4,10,12,3,20,11,25}
Ans.
Step 1:Randomly assign means

m1=3, m2=4
Step 2:Calculate the distance of the objects from the mean and assign the objects to
the cluster with minimum distance

k1={2,3} k2={4,10,12,20,30,11,25}
Step 3:Reassing means

m1=(2+3/2) m2=(4+10+12+20+30+11+25/2)
m1=2.5, m2=16
Step 4:Calculate distance and assign clusters
k1={2,3,4} k2={10,12,20,30,11,25}
Step 5: Reassing means

m1=3,m2=18
Step 6: Calculate distance and assign clusters

k1={2,3,4,10} k2={12,20,30,11,25}
Step 7: Reassing means

m1=4.75 , m2=19.6
Step 8:Calculate distance and assign clusters

k1={2,3,4,10,11,12} k2={20,30,25}
Step 9:Reassing means

m1=7 , m2=25
Step 10:Calculate distance and assign clusters

**k1={2,3,4,10,11,12} k2={20,30,25}**
Repeat the steps till you get same clusters

As clusters in step 8 and step 10 are same ,these are the final clusters.

You might also like