Professional Documents
Culture Documents
Data Mining
Introduction
Revision 1.0
Is it knowledge?
Operator Trip ID Fare Passenger Rating Driver Rating Some cab data is
Uber U12 234.00 4 3 collected for a specific
Spot S9 34.00 3 4 period in a city for few
Ola O1 12.00 3 3 cab operators.
Zen Z24 987.00 5 2
What does this data tell?
Uber U13 123.00 2 2
Ola O23 72.00 2 3 Can we establish
Zen Z54 23.00 3 3 passengers are happier in
Uber U65 45.00 2 2 longer trips but drivers
Spot S11 43.00 3 4 are not?
Ola O34 345.00 5 3 If we establish it, is it
Zen Z24 234.00 5 4 knowledge that we can
Uber U76 90.00 3 4
use for finding out the
... ... ... ... ...
reasons?
... ... ... ... ...
How do we do that?
13
In the Iris data classification the types of flowers were known in the beginning
however in the article clustering analysis the categories are evolved based on
the words encountered. In several references, Classification analysis is also
called “supervised learning” and clustering is called “unsupervised learning”.
Do you agree? Search and find out more on these key words.
15
Example:
16
Knowledge Presentation
Patterns Evaluation
Data Mining
Text Book-1 (T1) by Pang-Ning Tan, Michael Text Book-2 (T2) by Jiawei Han, M
Steinbach and Vipin Kumar, 1st edition Kamber and Jian Pei, 3rd edition
23
24
Relational Databases
Data Warehouses
Transactional
Time series – stock exchange, ocean tides etc.
Biological – data used to map DNA
Data Streams - camera surveillance
Spatial (maps)
Web
Text
Multimedia
NoSQL, Distributed data
.............
25
Predictive Tasks:
Predict the value of an attribute (target or dependent
variable) based on the other attributes (independent
variables).
Descriptive Tasks:
Derive patterns – correlations, trends, clusters,
anomalies etc. Basically summarize underlying
relationship in data.
32
33
Machine Visualization
Learning Data Mining
Pattern
Recognition Other
Algorithm Disciplines
Data for 20 million galaxies is available. Few available attributes are image
features, characteristics of light waves, distance from earth etc. A new found
galaxy is to be categorised in one of the categories – Early, Intermediate or
Aged. Which data mining task will be helpful to do so?
Early
Intermediate
Aged
36
Ordinal
– Values have a meaningful order (ranking) but magnitude between
successive values is not known.
– Size = {small, medium, large}, grades, army rankings
– The word “ordinal” suggests “an order”.
Binary
– Nominal attribute with only 2 states (0/1, True/False, Yes/No etc.)
– Symmetric binary: both outcomes equally important
e.g. gender
– Asymmetric binary: outcomes not equally important.
e.g. medical test (positive vs. negative)
4
Discrete Attribute
– Has only a finite or practically infinite set of values depending on
the context:
• E.g. pin codes, profession, or the set of words in a collection
of documents.
• Binary attributes are a special case of discrete attributes.
Continuous Attribute
– Has real numbers as attribute values e.g. temperature, height,
or weight.
– Real values that can only be measured and represented using a
finite number of digits.
– Continuous attributes are represented as floating-point
variables.
6
x i
x1 x 2 .... xN
x i 1
N N
Sometimes, each value xi may be associated with a weight
wi. This weight reflects the significance, importance, or the
occurrence
N frequency. In that case mean is:
w x i i
w1 x1 w2 x 2 .... wNxN
x i 1
N
w
w1 w2 .... wN 8
i
i 1
BITS Pilani, WILP
Exercise
1. For a group of employees the salary data in thousands of rupees is 30, 36,
47, 50, 52, 52, 56, 60, 63, 70, 70 and 110.
i. Calculate the mean salary.
ii. Calculate the mean salary using weights
(Answer: Rs. 58,000 for both sub-parts)
11
13
Median
Mean Mode
Frequency Median
Values
14
Fill the table for the cells having x and justify the
relationship among mean, mode and median as shown
in the previous graphs:
Data Type Val-1 Val-2 Val-3 Val-4 Val-5 Val-6 Val-7 Val-8 Mean Median Mode
x 20 25 30 35 35 40 45 50 x x x
x 10 11 11 13 13 14 30 35 x x x
x 25 30 35 40 40 45 10 12 x x x
Answer
Data Type Mean Median Mode
Symmetric 35 35 35
Positively
Skewed 17.13 13.00 11, 13
Negatively
Skewed 29.63 32.5 40 15
Similarity
– Numerical measure of how alike two data objects are
– Value is higher when objects are more alike
Dissimilarity
– Numerical measure of how different two data objects are
– Lower when objects are more alike
– Minimum dissimilarity is often 0 or close to it.
Proximity refers to a similarity or dissimilarity
0
d ( 2 , 1 ) 0
... ...
d ( n , 1 ) d ( n , 2 ) 0
18
The values of nominal attributes are just different names. They just provide
enough information to distinguish one object from another. Example: PIN codes,
Employee IDs, Gender etc.
The elements of dissimilarity d(i, j) = (p-m)/p, where p is the count of nominal
attributes and m is the number of matches.
In the table below, the values of Test-1 shows nominal attribute values for four
objects (code-A etc.) The value of p = 1.
The dissimilarity matrix for Test-1 nominal attribute is shown below. Which
suggests that object 1 and 4 are similar.
Object ID Test-1 0
1 code-A 1 0
2 code-B
3 code-C 1 1 0
4 code-A
0 1 1 0 19
The medical test data is provided for three patients. Find out which two patients
are unlikely to have the same disease. Except name and gender, all others are
asymmetric attributes that need to be counted for the calculation (M: Male, F:
Female, Y: Yes, N: Negative/No, P: Positive).
Name Gender Fever Cough Test-1 Test-2 Test-3 Test-4
Jack M Y N P N N N
Jim M Y Y N N N N
Mary F Y N P N P N
This suggests that Jim and Mary are unlikely to have the same disease – highest value of
d. While Jack and Mary are likely to have the same disease – smallest value of d.
22
23
25
26
27
28
29
31
X2(3, 5)
X1(1, 2)
32
Four objects scored grades out of {Fair, Good, Excellent} as shown below.
Calculate the dissimilarity.
ij .dij
f 1
f f
d( i, j ) p
ij
f 1
f
dijf is the dissimilarity measure for objects i and j for the attribute f.
The value of δijf is calculated as follows:
= 0, if there is no measurement (missing data) for object i or j (xif or xjf=0)
= 0, if (xif = xjf = 0) and the attribute f is asymmetric binary attribute
= 1, otherwise 37
0 0.0 0.0
1 0 1.0 0.0 0.55 0.0
1 1 0 0.5 0.5 0.0 0.45 1.0 0.0
0 1 1 0 0.0 1.0 0.5 0.0 0.40 0.14 0.86 0.0
38
0 0.0 0.0
1 0 1.0 0.0 0.55 0.0
1 1 0 0.5 0.5 0.0 0.45 1.0 0.0
0 1 1 0 0.0 1.0 0.5 0.0 0.40 0.14 0.86 0.0
Dissimilarity matrix for the mixed attributes is calculated as:
p
ij .dij
f 1
f f
0.0
0.85 0.0
d( i, j ) p
0.65 0.83 0.0
ij
f 1
f
0 .13 0.71 0.79 0.0
d(3, 1) = (1x1 +1x0.5+1x0.45) / (1+1+1) = 0.65 Objects 1 and 2 are most
d(3, 2) = (1x1 +1x0.5+1x1) / (1+1+1) = 0.83 dissimilar while 1and 4 are
d(4, 3) = (1x1 +1x0.5+1x0.86) / (1+1+1) = 0.79 most similar.
39
and so on….
BITS Pilani, WILP
Exercise
Let us say there are few documents which are recording the frequency of words as
shown in the table below. So that, each document is a term- frequency-vector.
Two term-frequency-vectors may have many 0s in common and thus the resulting
matrix could be very sparse in nature.
Having many 0s in common does not make the documents similar. Here the distance
techniques for numeric attributes will not work.
It needs a measure that focuses on the words that are common in the two documents
and ignores 0 matches.
Cosine Similarity is a measure that can be used to compare the documents.
Document team coach hockey baseball soccer penalty score win loss season
ID-1 5 0 3 0 2 0 0 2 0 0
ID-2 3 0 2 0 1 1 0 1 0 1
ID-3 0 7 0 2 1 0 0 3 0 0
ID-1
ID-2
5
3
coach hockey baseball soccer penalty score
0
0
3
2
0
0
2
1
0
1
0
0
win
2
1
loss
0
0
season
0
1
ID-3 0 7 0 2 1 0 0 3 0 0
44
Frequency
Values
46
Q1 Q2 Q3
BITS Pilani, WILP
Example This is one possible way!
48.5 54 66.5
30 36 47 50 52 52 56 60 63 70 70 110
Values
Median
outliers or anomalies.
Five-Number Summary is shown using Q1
Whisker
box-plots. Minimum (within 1.5 x IQR)
Outlier
Whiskers are drawn up to
Multiple data sets
maximum/minimum. Outliers are
A Box Plot
shown as dots separately. 48
Important Note: In these question, do not just believe your intuitions looking at the data. For example you
might say 32 and 35 thousand values are obvious outliers. In data science, inferences need to be
numerically justified. In this exercise outliers are identified with a statistics based logic.
Also pay attention to the questions. Values are in thousands and question asks for the months (not
the values).
49
50
N
(
i 1
xi x )2
52
Frequency
Values -3 -2 -1 0 1 2 3
σ μ σ 54
2. Why these scores are same for the both the sets?
55
56
57
10
0
-4 -2 0 2 4
58
SXY
Correlation( X , Y )
X.Y
Correlation 1 represents perfect positive relationship and correlation -1 represents perfect
negative relationship.
59
60
Method-2: Smoothing by bin boundaries (replace by min or max of the bin whichever is closer):
Bin-1: 4, 4, 15
Bin-2: 21, 21, 24
Bin-3: 25 25, 34 5
Some of the following steps detect the inconsistencies in the data and assist in
cleaning:
Relevance: Is collected data relevant?
o Road accident data - but driver’s age missing.
Have Metadata (data about data): Prior knowledge of data helps in cleaning:
o Date format is DD/MM/YYYY or MM/DD/YYYY?
o Pin codes cannot be negative.
o What is the skewness of the numeric data?
o Address does not fall within a city.
o Product codes need to be exactly 8 characters long.
Unique Rule Checking: prior knowledge under these rules help in identifying
the inconsistencies:
o Unique Rule: All values of an attribute are expected to be unique.
o Consecutive Rule: There cannot be an missing value in a range. E.g. youth data missing.
o Null Rule: Any special character (question mark, exclamatory sign, blank space etc)
represents a null. E.g. surveyors used different notations to fill null.
Interactive Data Scrubbing and Auditing tools: are available to perform the
above (or even more) steps. E.g.
o UCB’s Potter’s Wheel A-B-C. 8
Step-2: Identify Degree-of-Freedom (DF). That is the count of attributes that are
free to vary. If the observed data table is having (Row x Column) cells, the DF will
be (Row - 1) x (Column - 1). For the given data it is (2-1) x (2-1) = 1.
Step-3: For the given significance level (0.05 in this example) and DF (1 in this
example), the chi-square value is identified from χ2 distribution table. From the
table it is 3.841.
Since the calculated chi-square value (2.264) is lesser than the table value, the
attributes Gender and Holiday preference are independent. Hypotheses is correct.
In case if the calculated chi-square value is equal or bigger than the corresponding
table value the attributes are considered correlated.
1. Forward Selection: Start with NULL set and then keep adding:
{}
⇒ {A1} Attributes to add or to eliminate are
⇒ {A1, A4} decided – assuming they are independent
⇒ {A1, A4, A6} the reduced set and using their statistical significance.
2. Backward Elimination: Start with the entire set and then eliminate
{A1, A2, A3, A4, A5, A6}
⇒ {A1, A3, A4, A6}
A4
⇒ {A1, A4, A6} the reduced set
Yes No
attributes capture all the classes, why Class-1 Class-2 Class-1 Class-2
we need all attributes? This will be
discussed in the Classification module. A Decision Tree
BITS Pilani, WILP
Histograms
Usually shows the Unit Price
($)
Quantity
Sold
distribution of values of 9
12
15
6
a single variable. 14
21
9
4
Divide the values into 19
23
1
6
bins or buckets and 25 5
27 5
show a bar plot of the 28 5
number of objects in 29
31
6
5
each bin. 32 6
40 2
The height of each bar 45
48
4
12
indicates the number of 46 8
objects in the bin. 51
65
12
8
Example: An electronics 74
78
23
12
store presented the data 79
81
2
8
in the shown table. How 83 12
85 19
to visually analyze the 86 12 This numerosity reduction might make decision
count of items sold in 87
91
2
3 making simpler e.g. can this store plan to remove
different price ranges? 92
94
21
40 the items which are less than 90$ per unit?
Possibilities: Equal - 95
96
54
12
Width (as shown) or 99 89
Equal-Frequency (equal 100 200
8.00
D
7.00
6.00
E F
5.00
C
4.00
G
3.00
B
2.00
A
1.00
0.00
0.00 1.00 2.00 3.00 4.00 5.00 6.00
20
The process of obtaining a small sample S to represent the whole data set
N.
Simple Random Sampling:
There is an equal probability of selecting any particular item
Sampling Without Replacement:
Once an object is selected, it is removed from the population
Sampling With Replacement:
A selected object is not removed from the population
Stratified Sampling:
Partition the data set, and draw samples from each partition
(proportionally, i.e., approximately the same percentage of the
data)
Used in conjunction with skewed data
21
Raw Data 22
Record ID Attribute
R101 Child
R303 Child
R404 Child Record ID Attribute
R110 Child R101 Child
R440 Child R404 Child
R606 Man R606 Man
R707 Man R707 Man
R808 Man R330 Man
R909 Man R202 Woman
R330 Man
R202 Woman
Strata is child or major and then with in the
R505 Woman major the gender. No random sampling.
R220 Woman
23
Year 2000 2001 2011 2012 Year 2000 2001 2011 2012
Dengue 150 123 200 225 Dengue 123 65 145 150
Chikungunya 165 200 250 275 Chikungunya 180 100 125 150
Typhoid 1400 2000 2500 1240 Typhoid 600 700 600 750
Dysentery 1600 1800 2000 1250 Dysentery 700 650 700 800
New Delhi Jaipur
Year 2000 2001 2011 2012 Year 2000 2001 2011 2012
Dengue 100 96 12 35 Dengue 300 250 120 200
Chikungunya 50 34 125 100 Chikungunya 400 350 250 260
Typhoid 400 200 430 340 Typhoid 300 30 50 75
Dysentery 300 400 340 340 Dysentery 500 40 70 85
Hyderabad Chennai
Smoothing
Attribute Construction
Aggregation
Normalization
Discretization
Concept Hierarchy Generation for Nominal Attributes 25
1
S A ( | v1 A| | v2 A| ......... | vn A|)
n
So,z score normalization for ith element u sin g MAD will be :
vi A
vi '
SA
26
27
29
33
34
36
Y
the line.
a, b and k are constants. Any
coordinate on this line will satisfy the Y-axis intercept
θ
equation.
Above equation can be simply written X-axis intercept X
as:
y = mx + c
Where m is the slope and c is the
intercept on the Y-axis.
37
Body Gives
Name Skin Aquatic Aerial Legs Hibernates Class
Temperature Birth
Human Warm Hair Yes No No Yes No Mammal
Python Cold Scales No No No No Yes Reptile
Pigeon Warm Feathers No No Yes Yes No Bird
..... ..... ..... ..... ..... ..... ..... ..... .....
The model can serve for descriptive modeling. It explains different features of different classes of
animals.
The model can also help in predictive modeling. For example the record below can be tested
against the available data set to find out the class of a new animal.
Body Gives
Name Skin Aquatic Aerial Legs Hibernates Class
Temperature Birth
Gila Cold Scales No No No Yes Yes ??
BITS Pilani, WILP
Classification: General Approach
1. Set of records provided where class is known - The Training Set:
Body Gives
Name Skin Aquatic Aerial Legs Hibernates Class
Temperature Birth
Strange Warm Scales No No No Yes Yes ??
Body Root
Temperature
Internal Cold
Node Warm Assign Non-Mammal as class.
Gives Non
Birth Mammal
Yes No
All attributes do not play a
role?
Non
How do we decide? Mammal
Mammal
Why Body Temperature
first? What is the order of
selecting attributes – root Leaf Nodes
and further down?
Taxable
Income?
Yes No
Married
and other combinations....
Exam Exam
Grades Grades
Excellent Poor Poor {Good, Excellent}
Good
Exam
Grades
Probably a wrong split!
Good {Poor, Excellent} Why?
Entropy = - { (0/6).log (0/6) + (6/6).log Entropy = - { (1/6).log (1/6) + Entropy = - { (3/6).log (3/6) +
(6/6) } = 0 (5/6).log (5/6) } = 0.65 (3/6).log (3/6) } = 1
Gini = 1- {(0/6)2 + (6/6)2} = 0 Gini = 1- {(1/6)2 + (5/6)2} = 0.28 Gini = 1- {(3/6)2 + (3/6)2} = 0.50
Classification Error = 1 – max {0/6, Classification Error = 1 – max Classification Error = 1 – max
6/6} = 0 {1/6, 5/6} = 0.17 {3/6, 3/6} = 0.50
c 1
Entropy( t ) Info( t ) p( i / t )log 2 p( i / t )
i 0
c 1 Best split is based on the
Gini( t ) 1 [ p( i / t )] 2 lesser degree of impurity.
i 0
12
There are algorithms which work on the measures discussed earlier (Gini and
Entropy) to decide how the split of a branch will happen taking which
attributes; one at a time.
CART (Classification and Regression Tree) works on Gini.
ID3 (Iterative Dichotomiser-3) works on Entropy (Info).
C4.5 extension of ID3. It leverages Entropy (Info) in calculating the Gain Ratio.
The general principle works in the following manner:
o Degree of impurity before splitting is compared with the degree of impurity of
attributes (which are considered for splitting). The larger the difference, the
better the suitability to use that attribute for splitting. In other words, attributes
with lower values of impurity are preferred first.
o If CART algorithm is used, the difference is called the reduction in impurity.
o If entropy based ID3 algorithm is used, the difference is called the information
gain or simply the gain.
o In C4.5, the attribute that yields maximum gain ratio, is selected first.
BITS Pilani, WILP
Please Note
14
In the shown table the two binary Attribute A Attribute B Attribute x Attribute y
Yes Yes
Class
C0
attributes A and B are considered. Based Yes No C0
Yes No C0
on their combinations along with other Yes No C0
attributes describe a class (C0 or C1). No No C0
No No C0
Gini (Parent) before splitting = 1 – [(6/12)2+(6/12)2] = 0.5 Yes Yes C1
Yes Yes C1
Yes Yes C1
Parent
No Yes C1
C0 6
No No C1
C1 6
No No C1
Gini = 0.5
Weighted Average
= (5/12)*0.32 + (7/12)*0.41 = 0.37
Conclusion:
Gini before splitting is = 0.50
Gini for attribute A is = 0.49
Gini for attribute B is = 0.37
The reduction in impurity is maximized when B is chosen as (0.50 – 0.37) > (0.50 – 0.49)
16
Therefore, B is preferred over A for splitting.
BITS Pilani, WILP
Splitting of Nominal Attributes
Example continuing...
In the shown table a nominal attribute Car Type is Car Type Attribute x Attribute y Class
being considered. Based on its combinations along with Sports C0
Sports C0
other attributes describe a class (C0 or C1). Let us say,
Sports C0
the splitting is done using two groups: {Sports, Luxury} Sports C0
and {Family}. Sports C0
Sports C0
Sports C0
Sports C0
Car Type Car Type Luxury C0
{Sports, Luxury} {Family} Family C0
C0 9 1 Family C1
C1 7 3 Family C1
{Sports, Luxury} {Family} Gini 0.49 0.38 Family C1
Weighted Avg Gini 0.47 Luxury C1
Luxury C1
Luxury C1
Luxury C1
Gini ({Sports, Luxury}) = 1 – [(9/16)2+(7/16)2] = 0.49 Luxury C1
Gini ({Family}) = 1 – [(1/4)2+(3/4)2] = 0.38 Luxury C1
Luxury C1
Weighted Average
= (16/20)*0.49 + (4/20)*0.38 = 0.47
17
Weighted Average
= (12/20)*0.29 + (8/20)*0= 0.17
18
Weighted Average
= (12/20)*0.38 + (8/20)*0.21= 0.31
19
No 1 6 2 5 3 4 3 4 3 4 3 4 4 3 5 2 6 1
Gini 0.400 0.375 0.343 0.417 0.400 0.300 0.343 0.375 0.400
0.44 is the minimum value of Gini for Temperature but it is not as low
as it was for Outlook (0.34)! 24
0.37 is the value of Gini for Humidity but it is not as low as it was for
Outlook (0.34)! 25
0.43 is the value of Gini for Wind but it is not as low as it was for
Outlook (0.34)! 26
Humidity 0.37 4
5
Rain
Rain
Mild
Cool
High
Normal
Weak
Weak
Yes
Yes
Wind 0.43 6 Rain Cool Normal Strong No
7 Overcast Cool Normal Strong Yes
8 Sunny Mild High Weak No
9 Sunny Cool Normal Weak Yes
10 Rain Mild Normal Weak Yes
11 Sunny Mild Normal Strong Yes
12 Overcast Mild High Strong Yes
Outlook 13 Overcast Hot Normal Weak Yes
14 Rain Mild High Strong No
Multiway-split yields
the lowest Gini!
28
29
30
Wind 0.46 8
9
Sunny
Sunny
Mild
Cool
High
Normal
Weak
Weak
No
Yes
11 Sunny Mild Normal Strong Yes
Outlook
Sunny Rain
Overcast
Humidity
High Normal
Play
Day Outlook Temperature Humidity Wind
Tennis?
Outlook
3 Overcast Hot High Weak Yes
Sunny Rain 7 Overcast Cool Normal Strong Yes
Overcast 12 Overcast Mild High Strong Yes
13 Overcast Hot Normal Weak Yes
Yes
Humidity
High Normal
32
33
Wind 0.20 10
14
Rain
Rain
Mild
Mild
Normal
High
Weak
Strong
Yes
No
Outlook
Sunny Rain
Overcast
Yes
Humidity Wind
High Normal Weak Strong
No Yes Yes No
1. Model the decision tree for the give records using CART and ID3 algorithms.
2. For the given attribute ID in the table above, calculate GainRatio and infer its value.
37
41
IF (C = Excellent) THEN (D =
------ ------
Safe)
IF (C = Excellent AND I =
High) THEN (D = Safe)
Both the rules are triggered and they specify the same class. But the question is which rule
will be fired and return the class? There are few methods to decide on that:
1. Size Ordering: The rule with higher antecedent size will be fired. In this case R2 will be
fired and return the class.
2. Rule Ordering: The rule having higher priority will be fired. In this scenario rule priority is
decided on some criteria like accuracy, coverage etc.
3. Class Ordering: This priority is decided on some basis like rules for the most prevalent
classes first and so on. This ordering is useful when two rules are triggered but they
specify different classes.
If none of the rules is satisfied by X, it falls back to a default rule.
R1
+ + -
+ + + -
+ + + +
+ + +
+ + + + + -
+ + - -
+ + + + - + +
+ +
+ + +
+ + + R2
+ +
+ + + +
+ + + +
+
There are total 44 ‘+’ class records and 6 ‘-’ class records in a data set of 50 records. ‘-’
records are those which are not ‘+’ class.
Rule R1 covers 40 records and correctly classifies 38 as +. Accuracy (R1) = 38/40 = 95%
Rule R2 covers 2 records and correctly classifies all 2 as +. Accuracy (R2) = 2/2 = 100%
R1 has more coverage but may classify wrong. R2 is more accurate but its coverage is
poor.
Therefore, alternative measures are required for evaluating the rule quality! 50
Let us say, R0: {} ⇨ class (empty rule) and R1: {+} ⇨ class
FOIL Gain(R0, Rx) = px. [ log2(px/(px+nx)) – log 2(p0/(p0 + n0)) ]
Where,
p0: number of + instances covered by R0
n0: number of - instances covered by R0
px: number of + instances covered by Rx
nx: number of - instances covered by Rx
In the previous example:
There are total 44 + class records and 6 - class records. Rule R1 covers 40 + records and correctly
classifies 38. Rule R2 covers 2 + records and correctly classifies all 2. Foil gain of R1 and R2
respectively over R0 (the null rule):
We have reviewed two classifier techniques – Decision Trees and Rules. These
classifiers are used over the test data. How can we compare which one is better?
Let us say for a classification model, Positive Records (P) are those which belong to
the main class of interest (E.g. Buys Computers = YES) and all other records are
Negative Records (N).
True Positives (TP): Count of positive records correctly labelled by the classifier as
YES.
True Negatives (TN): Count of negative records correctly labelled by the classifier as
NO.
False Positives (FP): Count of records incorrectly labelled by the classifier as YES. E.g.
Class in actual is NO.
False Negatives (FN): Count of records incorrectly labelled by the classifier as NO.
Class in actual is YES.
53
We will review few metrics which are based on the accuracy of the classifier.
BITS Pilani, WILP
Classifier Evaluation Metrics
TP TN
Accuracy Re cognition Rate
PN
FP FN
Error Rate F/F1/F-Score is the harmonic
PN
mean of Precision and
TP
Sensitivity (Re call ) Recall.
P
Fβ is the weighted harmonic
TN
Specificity mean of Precision and
N Recall, where β2 is the
TP weight assigned to Precision
Pr ecision
TP FP and 1 is the weight assigned
2 X Pr ecision X Re call to Recall.
FMeasure( F 1 or F Score )
Pr ecision Re call
( 1 2 ) X Pr ecision X Re call
F
2 X Pr ecision Re call
where, is a non negative real number
BITS Pilani, WILP
Confusion Matrix
An effective way to capture classifier result on the test data-set
If we notice, Precision talks only about the column of Yes records in the matrix (exactness)
and Recall only about the row (completeness) of Yes records.
There is an alternative measure (F measure) that combines Precision and Recall both using
harmonic mean. Fβ is is a corresponding weighted measure (e.g. F β = 2 which weights Recall
twice as much as Precision and F β = 0.5 which weights Precision twice as much as Recall.)
2 X Pr ecision X Re call
F Measure( F 1 or F Score )
Pr ecision Re call
( 1 2 ) X Pr ecision X Re call
F
2 X Pr ecision Re call
where, is a non negative real number
BITS Pilani, WILP
Exercise
i. Refer to the text book and answer the following questions:
i. When class distribution is balanced, which metric is effective?
ii. When the main class of interest is rare (class imbalance problem) in the dataset, will
accuracy be a effective metric? If not, then which one?
iii. Formulate an expression for accuracy in terms of sensitivity and specificity.
iv. What is re-substitution error? How it is related with the error rate metric?
ii. For the shown table, calculate different metrics and answer the following
questions:
i. Which metric would indicate that in the dataset there are more negative records.
ii. Is accuracy an effective metric in this dataset?
Predicted Class
Actual Class Cancer = Yes Cancer = No
Records Records
Cancer = Yes 90 210
Cancer = No 140 9560 57
58
We have a dataset how do we decide training and test data and assess the accuracy?
Holdout Method
Given data is randomly partitioned into two independent sets:
o Training set (e.g., 2/3) for model construction.
o Test set (e.g., 1/3) for accuracy estimation.
Random sampling: a variation of Holdout:
o Repeat holdout k times, accuracy = average of the accuracies obtained
Cross-validation (k-fold, where k = 10 is most popular)
Randomly partition the data into k mutually exclusive subsets or folds (Di, where i
= 1 to k), each approximately of equal size.
At i-th iteration, use Di as test set and others collectively as training set.
Each fold serves only once as test dataset and equal number of times as part of
the training set as other folds.
In the end, accuracy = total correct classifications / count of records.
Two variations:
o Leave-one-out: k is selected as count of records. For small datasets.
o Stratified cross-validation: folds are stratified so that class distribution in each
fold is approximately proportional as that in the initial data. 59
Let us say there are N records, which are sampled N times to create a training
set. Each time, when a sample is selected it is also put back for the next
sampling.
The probability that a record is selected = 1/N and not selected = (1-1/N).
So, the probability that a record is never chosen during N samplings = (1-1/N)N.
For large value of N, this is equal to e-1 = 0.368 (e = Euler’s number = 2.718)
These not selected records will form the test set. Other records will be the
training set.
The method is preferred for small data sets.
If this sampling procedure is repeated k times, then accuracy for the model (M)
is find out as below: (note that accuracy is calculated individually for training and test set in an
iteration):
1 k
Acc( M ) { 0.632 X Acc( Mi )TestSet 0.368 X Acc( Mi )TrainSet }
k i 1
BITS Pilani, WILP
Thank You
vi. logY (m) = logX (m) / logX (Y) (base change rule, take any value of X. 10 is usual)
Example: log2(8) = log10(8) / log10(2) = 0.9030 / 0.3010 = 3
63
The Gini impurity tells us the probability that we select a fruit at random and a sticker at random and it is an
incorrect match. The contingency table below captures the data. Shaded cells are for the incorrect match
(impurity). Diagonal cells are for the correct match.
Probability Table
Pineapple Sticker Pear Sticker Apple Sticker
Pineapple Fruit 0.25 0.10 0.15
Pear Fruit 0.10 0.04 0.06
Apple Fruit 0.15 0.06 0.09
Notice that, sum of all cells in the probability table is 1.0. If we subtract the sum of correct match cells from
1.0, we get the sum of incorrect match cells. That is what is Gini. In general if fi is the fraction of items labelled
with label i correctly (Probability = fi x fi = fi2), where i ranges for m labels: m
Gini 1 ( f i 2 )
i 1
66
Given a set of transactions, the objective is to find out the rules that will
predict the occurrence of an item based on the occurrences of other items
in the transaction.
This module presents a methodology for discovering interesting
relationships hidden in the data sets. The uncovered relationships can be
represented in the forms of Association Rules.
Market-Basket Transactions Example of Association Rules
TID Items {Diaper} {Beer},
1 Bread, Milk {Milk, Bread} {Eggs, Coke},
{Beer, Bread} {Milk},
2 Bread, Diaper, Beer, Eggs
3 Milk, Diaper, Beer, Coke
Are all equally convincing?
4 Bread, Milk, Diaper, Beer
5 Bread, Milk, Diaper, Coke 2
TID Items
1 Bread, Milk
2 Bread, Diaper, Beer, Eggs
3 Milk, Diaper, Beer, Coke
4 Bread, Milk, Diaper, Beer
5 Bread, Milk, Diaper, Coke
This slide talks about an important merging method to enumerate possible itemsets for
calculations. E.g. if {a, b, c} and {a, b, d} are 3 item itemsets. What could be a possible 4-
item itemset?
Let A = {a1, a2, .....ak-1} and B = {b1, b2, .....bk-1} are two (k-1) itemsets. How a k itemset can
be generated after combining them?
Brute-Force Method: Enumerate all combinations. Computationally prohibitive!
Ck = Lk-1 x Lk-1 Method: Let A = {a1, a2, .....ak-1} and B = {b1, b2, .....bk-1}. A and B can be
merged to generate a k itemset if:
ai = bi (for i = 1, 2, .....k-2) and ak-1 ≠ bk-1
The merged itemset will be {a1, a2, .....ak-1, bk-1}
Example: if A = {1, 2, 3} and B = {1, 2, 4} then a possible 4 itemset is {1, 2, 3, 4}.
Example: if A = { {I1, I2}, {I1, I3}, {I1, I5}, {I2, I3}, {I2, I4}, {I2, I5} } find its own merged 3 itemsets:
= { {I1, I2}, {I1, I3}, {I1, I5}, {I2, I3}, {I2, I4}, {I2, I5} } x { {I1, I2}, {I1, I3}, {I1, I5}, {I2, I3}, {I2, I4}, {I2, I5} }
= { {I1, I2, I3}, {I1, I2, I5}, {I1, I3, I5}, {I2, I3, I4}, {I2, I3, I5} {I2, I4, I5} }
4-itemsets candidates C4 is generated from L3 as {I1, I2, I3, I5} but C4 = {ф}
because of Apriori Principle. E.g. {I3, I5} is not frequent so {I1, I2, I3, I5} can’t be.
L4 Itemsets
Therefore L4 is also = {ф} and C5 cannot be generated.
{ф}
The algorithm terminates with all the frequent itemsets identified: 1-item, 2-
item and 3-item frequent itemsets. There is no frequent 4-item itemset.
BITS Pilani, WILP
Exercise
15
1, 2 A, B 1, 2, 4 A, C 2, 4 A, D 1, 2, 3 B, C 2 B, D 2, 4 C, D
1, 2 A, B, C 2 A, B, D 2, 4 A, C, D 2 B, C, D
2 A, B, C, D
Immediate superset of {X} and {Y} is prepared after merging {X} and {Y} where the widths of {X} and {Y} are same. E.g. Immediate
superset of {A, B} and {A, C} is {A, B, C}. In general, superset of {X} is the set where {X} is one of the elements.
A frequent itemset having none of its immediate supersets frequent is called Maximal Frequent Itemset. E.g. {A, B, C} and {A, C, D} are
Maximal Frequent Itemsets.
An itemset is closed if none of its immediate supersets has exactly the same support count as it has. E.g. {C}, {D}, {A, C}, {B, C}, {A, B, C}
and {A, C, D} are closed itemsets. They are Closed Frequent Itemsets also because they are meeting the minsup criteria.
2. Anti-monotone property.
3. Apriori Principle.
18
Each of these above rules will have the support equal to the support of X in the transactions.
Once frequent itemset is identified and rules are generated from those itemsets, calculating
the confidence does not require addition scan of the transaction table. E.g.
o For the rule {1, 2} {3} the confidence = support count (1, 2, 3) / support count (1, 2).
o Since {1, 2, 3} is frequent, {1, 2} will also be frequent because of Apriori Principle and support counts were already found for
these two itemsets during the iterations of frequent itemsets generation.
The question here is if all such produced rules are of interest? How confidence plays a role?
BITS Pilani, WILP
Confidence Based Rule Pruning
Trans ID Items L1 L2 L3
L4 Itemsets
T100 I1, I2, I5 Itemsets Itemsets Itemsets
T101 I2, I4 {I1} {I1, I2} {I1, I2, I3} {ф}
T102 I2, I3 {I2} {I1, I3} {I1, I2, I5}
T103 I1, I2, I4 {I3} {I1, I5}
T104 I1, I3 {I4} {I2, I3}
T105 I2, I3 {I5} {I2, I4}
T106 I1, I3 {I2, I5}
T107 I1, I2, I3, I5
T108 I1, I2, I3
Apriori Principle does not hold for the confidence measure for a rule. That means the
confidence measure for rule X Y can be bigger, smaller or same for X’ Y’ where X’ ⊆ X
and Y’ ⊆ Y.
Theorem (Confidence Based Rule Pruning): For a frequent itemset A where B ⊂ A, if a rule
B (A-B) does not satisfy the confidence threshold, then a rule from this frequent itemset
A in the form of B’ (A-B’) will also not satisfy the confidence measure as well, where
B’ ⊂ B.
BITS Pilani, WILP
Illustration – continuing...
Confidence Based Rule Pruning
Trans ID Items L3
T100 I1, I2, I5 Itemsets Let us say, Confidence Threshold = 0.67
T101 I2, I4 {I1, I2, I3}
T102 I2, I3 Rules are being generated from the frequent itemset X, taking one item at a
T103 I1, I2, I4 time from antecedent to the consequent side starting from {X} {ф}.
T104 I1, I3
Order for the movement can be lexicographic or reverse lexicographic .
T105 I2, I3
T106 I1, I3
T107 I1, I2, I3, I5 {I1, I2, I3} {ф}
T108 I1, I2, I3
Confidence = 2/ 4 = 0.50
Confidence = 2/ 4 = 0.50 Confidence = 2/ 4 = 0.50
No need to calculate
{I3} {I1, I2} {I2} {I1, I3} {I1} {I2, I3} confidence. Can be
discarded using
Confidence = 2/ 6 = 0.33 Confidence = 2/ 7 = 0.29 Confidence = 2/ 6 = 0.33 confidence based rule
pruning theorem.
No rule is of interest for the given confidence threshold for {I1, I2, I3}.
BITS Pilani, WILP
Illustration – continuing...
Confidence Based Rule Pruning
Trans ID Items L3
T100 I1, I2, I5 Itemsets
T101 I2, I4 {I1, I2, I5} Let us say, Confidence Threshold = 0.67
T102 I2, I3
T103 I1, I2, I4
T104 I1, I3
T105 I2, I3
T106 I1, I3
T107 I1, I2, I3, I5
T108 I1, I2, I3 {I1, I2, I5} {ф}
Confidence = 2/ 4 = 0.50
Confidence = 2/ 2 = 1.00
Confidence = 2/ 2 = 1.00
23
Trans ID
T100
Items
I1, I2, I5
L2 {I1, I2} {ф} Confidence Threshold = 0.67
Itemsets
T101 I2, I4
{I1, I2}
T102
T103
I2, I3
I1, I2, I4
{I1, I3} {I1} {I2} {I2} {I1}
{I1, I5}
T104 I1, I3
{I2, I3} Confidence = 4/ 6 = 0.67 Confidence = 4/ 7 = 0.57 {I1, I3} {ф}
T105 I2, I3
{I2, I4}
T106 I1, I3
{I2, I5}
T107 I1, I2, I3, I5 {I1} {I3} {I3} {I1}
T108 I1, I2, I3
Confidence = 4/ 6 = 0.67 Confidence = 4/ 6 = 0.67
Using the Confidence threshold and Rule Pruning Theorem the final set
of Association Rules are following (marked in green in the previous
illustration slides):
P( A B )
lift( A, B )
P( A ).P( B )
Answer:
{Basketball } {Cereal } (support = 40%, confidence = 67%)
Meeting the thresholds but cereals are eaten by 75% of the people – more than confidence metric.
That is also evident because lift (Basketball Cereal) = 0.89
Negative correlation between Basketball and Cereals.
Agreed with the management’s decision to demote the marketing manager. 28
29
All the pink colour rules are pruned because they are not meeting the lift criteria!
30
Transaction Table
Therefore, R2 and R3 are redundant.
31
34
i. {A, B}
Null
ii. {B, C, D}
iii. {A, C, D, E} B:2
A:8
iv. {A, D, E}
v. {A, B, C}
B:5 C:2
vi. {A, B, C, D} C:1 D:1
vii. {A} D:1
C:3 E:1
viii. {A, B, C} D:1 E:1
D:1
ix. {A, B, D}
E:1
x. {B, C, E} D:1
FP Tree: Pointers are not shown
36
D:1
C:3 E:1 E:1
D:1 E:1 D:1 E:1
D:1
D:1
E:1 E:1
Prefix Path Ending with E
FP Tree
C:1
A:2 A:2 A:2
D:1 D:1
Conditional FP Tree for E Prefix path ending with D, E Conditional FP Tree for D, E /
(minsup = 2) (minsup = 2) Prefix path A, D, E
BITS Pilani, WILP
Example
Continued...
Null
Null
B:5 C:2
C:1 D:1
C:1 D:1
D:1
C:3 E:1
D:1 E:1
D:1 D:1
E:1
D:1
FP Tree Conditional FP Tree for E
(minsup = 2)
Null
C:1
A:1
Null
C:1 Ф
A:2
Prefix Path ending with C, E Conditional FP Tree for C, E Prefix Path / Conditional FP Tree
(minsup = 2) ending with A, E
Similarly, prefix paths and Conditional Trees for the itemsets ending with D, C, B and A are identified.
BITS Pilani, WILP
FP Tree
Illustration
C1 Support
Trans ID Items Minsup=2 Itemsets Count
T100 I1, I2, I5 Null {I1} 6
T101 I2, I4 {I2} 7
T102 I2, I3 {I3} 6
T103 I1, I2, I4 {I4} 2
T104 I1, I3 {I5} 2
T105 I2, I3 I1:6 I2: 3
T106 I1, I3
T107 I1, I2, I3, I5
T108 I1, I2, I3
I3:2 I2:4 I3: 2 I4: 1
Header Table
Item Pointer
I1 I3:2 I4:1 I5:1
I2
I3
I4 I5: 1
I5
39
I5: 1 I5: 1
Retain only those nodes which are having I5 in their path starting from NULL. This is called prefix path of I5. {I5} is meeting the
support threshold, so it is a frequent itemset.
Next, conditional-FP tree for I5 is prepared. This conditional tree will be used to identify the frequent itemset ending in I3, I5 and I2,
I5 and I1, I5.
Since in this tree, I2 is meeting the support count, {I2, I5} is a frequent itemset.
Hide {I2} and update support counts if required. Conditional-FP tree for {I2, I5}. Since in this tree, I1 is meeting the support count, {I1,
I2, I5} is a frequent itemset.
From the conditional tree of I5, prefix path for {I1, I5} is prepared. Since I1 is meeting the support the criteria, {I1, I5 } is a frequent
itemset.
Frequent itemsets at this stage = {I5}, {I1, I5}, {I2, I5}, {I1, I2, I5}.
BITS Pilani, WILP
FP Growth Algorithm - illustration
Iteration-2 (Frequent itemsets ending with I4)
Trans ID Items
T100 I1, I2, I5 Null Null Null Null
T101 I2, I4
T102 I2, I3
T103 I1, I2, I4
T104 I1, I3 I1:6 I2: 3 I1:6 I2: 3 I1:1 I2: 1 I1:1
T105 I2, I3
T106 I1, I3
T107 I1, I2, I3, I5 Prefix path for {I1, I4}
T108 I1, I2, I3 I3:2 I2:4 I3: 2 I4: 1 I2:4 I4: 1 I2:1
Conditional-FP Tree
I4:1 for I4 / prefix path
for {I2, I4}
I3:2 I4:1 I5:1
Prefix path for I4.
I5: 1
Retain only those nodes which are having I4 in their path starting from NULL. This is called the prefix path for I4. {I4}
is meeting the support threshold, so it is a frequent itemset.
Hide {I4} and eliminate nodes that do not meet support criteria. Conditional-FP tree for I4. This conditional tree will
be used to identify the frequent itemset ending in I1, I4 and I2, I4 and I3, I4.
Since in this tree, I2 is meeting the support criteria, {I2, I4} is a frequent itemset.
Hide {I2} and update support counts if required. This is called the conditional-FP tree of {I2, I4}. This tree is NULL.
From the conditional tree of I4, prefix path for {I1, I4} is prepared. Since I1 is not meeting the support the criteria,
the conditional tree for {I1, I4} will be also NULL.
Frequent itemsets at this stage = {I4}, {I2, I4}.
BITS Pilani, WILP
FP Growth Algorithm - illustration
Iteration-3 (Frequent itemsets ending with I3)
Trans ID Items
T100 I1, I2, I5 Null Null Null Null
T101 I2, I4
T102 I2, I3
T103 I1, I2, I4
T104 I1, I3 I1:6 I2: 3 I1:6 I2: 3 I1:4 I2: 2 I1:4
T105 I2, I3
T106 I1, I3
Conditional-FP Tree
T107 I1, I2, I3, I5
for {I2, I3}
T108 I1, I2, I3 I3:2 I2:4 I3: 2 I4: 1 I3:2 I2:4 I3: 2 I2:2
Null
Conditional-FP Tree
for I3 / prefix path
for {I2, I3}
I3:2 I4:1 I5:1 I3:2
I1:4
Retain only those nodes which are having I3 in their path starting from NULL. This is called the prefix path for I3. {I3} is meeting the
support threshold, so it is a frequent itemset.
Hide {I3} and eliminate nodes that do not meet support criteria. Conditional-FP tree for I3. This conditional tree will be used to identify
the frequent itemset ending in I1, I3 and I2, I3.
Since in this tree, I2 is meeting the support criteria, {I2, I3} is a frequent itemset.
Hide {I2} and update support counts if required. Conditional-FP tree for {I2, I3}. Since in this tree, I1 is meeting the support count, {I1, I2,
I3} is a frequent itemset.
From the conditional FP tree of I3, the prefix path for {I1, I3} is prepared. Since I1 meets the support criteria, {I1, I3} is a frequent
itemset.
Frequent itemsets at this stage = {I3}, {I1, I3}, {I2, I3}, {I1, I2, I3}.
BITS Pilani, WILP
FP Growth Algorithm - illustration
Iteration-4 (Frequent itemsets ending with I2)
Trans ID Items
T100 I1, I2, I5 Null Null Null
T101 I2, I4
T102 I2, I3
T103 I1, I2, I4
T104 I1, I3 I1:6 I2: 3 I1:6 I2: 3 I1:4
T105 I2, I3
T106 I1, I3
T107 I1, I2, I3, I5 Conditional-FP Tree
T108 I1, I2, I3 I3:2 I2:4 I3: 2 I4: 1 I2:4 for I2
I5: 1
Retain only those nodes which are having I2 in their path starting from NULL. This is called the
prefix path for I2. {I2} is meeting the support threshold, so it is a frequent itemset.
Hide {I2} and update support counts if required. Conditional-FP tree for I2. This conditional tree
will be used to identify the frequent itemset ending in I1, I2.
{I1} is meeting the support threshold. So {I1, I2} is a frequent itemset.
Frequent itemsets at this stage = {I2}, {I1, I2}
I5: 1
45
47
1. Repeat the Vertical Data Format mining exercise to find out the frequent
itemsets where only the output of diff function is stored. Diff function is
defined between (k+1)th itemset and corresponding kth itemset as the
difference in the transactions they include. E.g.
Itemsets Trans ID Set
{I1} {T100, T400, T500, T700, T800, T900}
{I2} {T100, T200, T300, T400, T600, T800, T900}
{I1, I2} {T100, T400, T800, T900}
diff ( {I1, I2}, {I1} ) {T500, T700}
49
AllConf(A,B)=min {P(A|B),P(B|A)}
MaxConf(A,B)=max {P(A|B),P(B|A)}
1
Kulc(A,B)= .{P(A|B)+P(B|A)} Named after Polish mathematician S.Kulczynski
2
Cosine(A,B)= {P(A|B) x P(B|A)}
50
Milk Milk' Data Sets mc m'c mc' m'c' χ2 lift AllConf MaxConf Kulc Cosine
Coffee mc m'c D1 10,000 1000 1000 1,00,000 90557 9.26 0.91 0.91 0.91 0.91
Coffee' mc' m'c' D2 10,000 1000 1000 100 0 1 0.91 0.91 0.91 0.91
D3 100 1000 1000 1,00,000 670 8.44 0.09 0.09 0.09 0.09
D4 1000 1000 1000 1,00,000 24740 25.75 0.50 0.50 0.50 0.50
Datasets D1 and D2:
m and c are positively associated. People who bought milk, bought coffee also and vice-versa because
confidence (m c) = confidence (c m) = 10,000 / 11,000 = 0.91. This is reflected in the last 4 measures
consistently. But lift and χ2 generate different values.
Dataset D3:
confidence (m c) = confidence (c m) = 100 / 1100 = 0.09. It is very low, but lift and χ2 contradict.
Dataset D4:
confidence (m c) = confidence (c m) = 1000 / 2000 = 0.50. It shows neutrality, but lift and χ2 show
positive association.
Lift and χ2 are not proper for the above transaction because they have dependency on m’c’. Such transactions
are called Null Transactions. They are very likely in the real world situations, i.e. there could be many
transactions which do not include any itemsets of interest.
Last four measures are Null-Invariant measures because they are not impacted by the null transactions. But are
they applicable for all scenarios? Let us see that.
BITS Pilani, WILP
Imbalance Ratio (IR)
| support(A) - support(B) |
IR(A,B)=
support(A) + support(B) - support(A B)
Milk Milk' Data Sets mc m'c mc' m'c' χ2 lift AllConf MaxConf Kulc Cosine
Coffee mc m'c D4 1000 1000 1000 1,00,000 24740 25.75 0.50 0.50 0.50 0.50
Coffee' mc' m'c' D5 1000 100 10,000 1,00,000 8173 9.18 0.09 0.91 0.50 0.29
D6 1000 10 1,00,000 1,00,000 965 1.97 0.01 0.99 0.50 0.10
Dataset D5:
confidence (m c) = 1000/11000 = 0.09 and confidence (c m) = 1000 / 1100 = 0.91.
Dataset D6:
confidence (m c) = 1000/1010 = 0.99 and confidence (c m) = 1000 / 101000 = 0.01.
For D5 and D6, AllConf shows low association but MaxConf shows positive association. Kulc is neutral for both and Cosine
shows low for both.
Kulc measure along with Imbalance Ration (IR) presents a clear picture.
D4 D5 D6
IR 0 0.89 0.99
Perfectly
Type Imbalanced Skewed
Balanced
A quantile plot (q-plot) is a simple and effective way to have a first look at the
univariate data distribution.
Let us say x1, x2, x3.....xN are N observations of an attribute, arranged in an order
where x1 is the smallest and xN is the largest observation.
Associated with each
100
observation (xi) there is a 90
percentage term (fi) that is 80
calculated as fi = (i-0.5)/N. 70
60
fi on the X-axis and xi on Y axis
50 Median
are plotted. The resulting plot 40
is quantile a plot. 30
10
for their distribution, it is
0
called quantile-quantile (q-q) 0 0.25 0.5 0.75 1
Q1 Q2 Q3
plot.
BITS Pilani, WILP
Thank You
Supervised Classification
– Have class label information.
– E.g. characteristics of reptiles are known. Identify if a new found specie is
a reptile.
– Do you think an initial unsupervised classification may become a
supervised classification eventually?
Simple Segmentation
– Dividing students into different registration groups alphabetically, by last
name.
Results of a Query
– Groupings are a result of an external specification.
– E.g. employees completed 10 years in the job.
Graph Partitioning
– Some mutual relevance.
– E.g. reach of airways or train system to specific cities in a graph of cities. 3
Clustering
Methods
Points X Y
Seven data points (A to G) are A 1.00 1.00
given with their coordinates. B 1.50 2.00
Two clusters (K = 2) are to be C 3.00 4.00
D 5.00 7.00
identified among them. E 3.50 5.00
Two centroids are chosen F 4.50 5.00
G 3.50 4.50
randomly from them as A
Cluster-1 Cluster-2
(1.0, 1.0) and D (5.0, 7.0). Step Mean
Euclidean
Mean
Euclidean
Individual Distance Individual Distance
Other points are taken one at 1 A
Centroid
(1.0, 1.0) 0 D
Centroid
(5.0, 7.0) 0
a time. Their distances from 2 B 1.12 B 6.10
the centroids are measured 3 A, B (1.25, 1.5) D (5.0, 7.0)
4 C 3.05 C 3.61
as Euclidean distance. Lesser 5 A, B, C (1.83, 2.33) D (5.0, 7.0)
distance means proximity and 6 E 3.15 E 2.50
7 A, B, C (1.83, 2.33) D, E (4.25, 6.0)
the points are added to the 8 F 3.78 F 1.03
corresponding cluster. Mean 9 A, B, C (1.83, 2.33) D, E. F (4.33, 5.67)
centroid coordinates are also 10 G 2.74 G 1.43
11 A, B, C (1.83, 2.33) D, E, F, G (4.13, 5.38)
updated.
BITS Pilani, WILP
Illustration
K-Means Algorithm concluded.
In the illustration, the centroid was updated after each point was
added to the cluster. Another variation to reduce this overhead is
to update the centroid in the end of the iteration when no more
points are left to be decided which cluster they belong
(recommended).
If K is the count of clusters, a data point is x that belongs to
cluster Ci that has centroid ci, the objective is to minimize the
Euclidean’s distance based Sum of Squared Error (SSE), that is
defined as: K
SSE= dist( ci ,x )2
i=1 x Ci
The centroids that minimize the SSE are the means of the data
points that belong to that cluster.
13
K
BITS Pilani, WILP
Document Clustering
Using K-Means Algorithm
Doc ID Team Coach Hockey Baseball Soccer Penalty Score Win Loss Season
ID-1 5 0 3 0 2 0 0 2 0 0
ID-2 3 0 2 0 1 1 0 1 0 1
ID-3 0 7 0 2 1 0 0 3 0 0
... ... ... ... ... ... ... ... ... ... ...
Three level of 17
hierarchical Clusters.
BITS Pilani, WILP
Types of Hierarchical Clustering
18
0.60
Point X Y
0.50
P1 0.40 0.53 P1
0.40
P2 0.22 0.38 P5
0.30 P2
P3 0.38 0.32 P3 P6
P4 0.26 0.19 0.20
P4
P5 0.08 0.41 0.10
P6 0.45 0.30 0.00
Six points with their coordinates are 0.00 0.10 0.20 0.30 0.40 0.50
provided In the beginning each point is considered
a singleton cluster
0.60
P1 P2 P3 P4 P5 P6
0.50
P1 0.00 P1
0.40
P2 0.23 0.00 P5
0.30 P2
P3 0.21 0.17 0.00 P3 P6
0.20
P4 0.37 0.19 0.18 0.00 P4
P5 0.34 0.14 0.31 0.28 0.00 0.10
(P3, P6) and P1 = min {dist(P3, P1), dist(P6, P1)} = min {0.21, 0.24} = 0.21
(P3, P6) and P4 = min {dist(P3, P4), dist(P6, P4)} = min {0.18, 0.22} = 0.18
The distance between the new cluster (green line) and P4:
= min {dist(P5, P4), dist(P2, P4), dist(P3, P4), dist(P6, P4)}
= min (0.28, 0.19, 0.18, 0.22)
= 0.18
The distance between the new cluster (green line) and P1: Since 0.18 is the minimum
distance
= min {dist(P5, P1), dist(P2, P1), dist(P3, P1), dist(P6, P1)}
The next higher level cluster is
= min (0.34, 0.23, 0.21, 0.24) [P4, {(P3, P6), (P2, P5)}]
= 0.21
0.60
P1
0.50
0.40
P5
P2
0.30
P3 P6
0.20
P4
0.10
0.00
0.00 0.10 0.20 0.30 0.40 0.50
0.17
0.07
P3 P6 P2 P5 P4 P1
MIN Dendrogram Representation
24
(P3, P6) and P1 = max {dist(P3, P1), dist(P6, P1)} = max (0.21, 0.24) = 0.24
(P3, P6) and P4 = max {dist(P3, P4), dist(P6, P4)} = max (0.18, 0.22) = 0.22
(P2, P5) and P1 = max {dist(P2, P1), dist(P5, P1)} = max (0.23, 0.34) = 0.34
(P2, P5) and P4 = max {dist(P2, P4), dist(P5, P4)} = max (0.19, 0.28) = 0.28
(P2, P5) and (P3, P6) = max {dist(P2, P3), dist(P5, P3), dist(P2, P6), dist(P5, P6)}
= max (0.17, 0.31, 024, 0.39) = 0.39
P4 will form a cluster with (P3, P6) because its distance is minimum. 27
(P2, P5) and P1 = max {dist(P2, P1), dist(P5, P1)} = max (0.23, 0.34) = 0.34
{(P3, P6), P4} and P1 = max {dist(P3, P1), dist(P6, P1), dist (P4, P1)}
= max (0.21, 0.24, 0.37) = 0.37
{(P3, P6), P4} and (P2, P5) = max {dist(P3, P2), dist(P6, P2), dist (P4, P2),
dist(P3, P5), dist(P6, P5), dist (P4, P5)}
= max (0.17, 0.24, 0.19, 0.31, 0.39, 0.28) = 0.39
P1 will form a cluster with (P2, P5) because its distance is minimum. 28
0.60
0.50
P1
0.40
P5
P2
0.30
P3 P6
0.20
P4
0.10
0.00
0.00 0.10 0.20 0.30 0.40 0.50
29
31
32
x i
LS
Radius (R) reflects the tightness of the
Centroid (x0 )= i=1
= cluster around the centroid.
n n
Note that if just Clustering Feature is
n
(x -x i 0 )2
SS LS 2
available, Centroid and Radius can be
Radius(R) = i=1
= -( ) calculated. Coordinates of individual
n n n data points are not needed.
Example:
In a cluster C1, there are three points (2, 5), (3, 2) and (4, 3).
So, CF1 = <3, (9, 10), (29, 38)>
In another cluster C2 the three points are (1, 2), (2, 6) and (3, 9).
So, CF2 = <3, (6, 17), (14, 121)>
CF for the merged cluster from C1 and C2 will be = <3+3, (9+6, 10+17), (29+14,
38+121)> = <6, (15, 27), (43, 159)>
34
35
There are seven points in one-dimensional space as: x1 = 0.50, x2 = 0.25, x3 = 0, x4 = 0.65, x5 = 1.0, x6 =
1.4, x7 = 1.1
CF-Tree parameters are given as: Branching Factor (B) = 2, Threshold (T) = 0.15.
The dataset is scanned and the first data (x1) point is read. A root as node-0 and a leaf as leaf-1 is
created with the CF value of this point. Since this is the first point, radius (R) of the leaf-1 is 0.
Now the second point (x2) is read. Its radius w.r.t. the leaf-1 is calculated. The value of R comes as 0.13
that is less than the threshold (T). So, it is assigned to leaf-1 and the value of CF1 in the root is
updated.
The third point (x3) is read now. Its radius w.r.t. the leaf-1 is calculated. The value of R comes
as 0.21 that is more than the threshold (T). So, it is not assigned to leaf-1 and a new leaf-2 is
created that contains only x3. Workings:
In this new leaf, Radius (R) = 0 and root is updated for CF. CF1 = <2, 0.75, 0.31>
CFx3 = <1, 0, 0>
Combined CF = <3, 0.75, 0.31>
SS/n = 0.31/3
LS/n = 0.75/3
R = SQRT{ (SS/n) – (LS/n)2 } = 0.21
Root, Node-0
Root, Node-0 CF1 = <2, 0.75, 0.31>
CF1 = <2, 0.75, 0.31> CF2 = <1, 0, 0>
Leaf-1, R = 0.13
Leaf-1, R = 0.13
x1 = 0.50 Leaf-2, R = 0
x1 = 0.50
x2 = 0.25 x3 = 0
x2 = 0.25
The fourth point (x4) is read now. Its position in CF1 or CF2 is to be decided based on their respective
centroids. The centroid of CF1 is 0.75/2 = 0.375 and for CF2 is 0/1 = 0. Therefore x4 is closer to CF1.
The radius (R) of x4 from CF1 comes as 0.16 that is > threshold (T). It means a new leaf nodes has to
be created.
A new leaf node is not possible, because branching factor (B) = 2. So root node is splitted into Node-
1 and Node-2, where Node-1 is the old Node-0 and the Node-2 is has a separate leaf-3 containing x4
only.
Root, Node-0
CF1-2 = <3, 0.75, 0.31>
CF3 = <1, 0.65, 0.42>
Root, Node-0
CF1 = <2, 0.75, 0.31>
Node-1
CF2 = <1, 0, 0> Node-2
CF1 = <2, 0.75, 0.31>
CF3 = <1, 0.65, 0.42>
CF2 = <1, 0, 0>
Leaf-1, R = 0.13
Leaf-2, R = 0
x1 = 0.50 Leaf-1, R = 0.13
x3 = 0 Leaf-3, R = 0
x2 = 0.25 Leaf-2, R = 0
x1 = 0.50 x4 = 0.65
x3 = 0
x2 = 0.25
Note that sum of children CF is equal to the CF of the parent and leaves contain only the data
points.
BITS Pilani, WILP
CF Tree Illustration
B = 2, T = 0.15
Slide-4/6 x1 = 0.50, x2 = 0.25, x3 = 0, x4 = 0.65, x5 = 1.0, x6 = 1.4, x7 = 1.1
Now, the fifth point (x5) is read. Its position in CF1-2 or CF3 is to be decided based on their respective
centroids. The centroid of CF1-2 is 0.75/3 = 0.25 and for CF3 is 0.65/1 = 0.65. Therefore x5 is closer to
CF3.
The radius (R) of x5 from CF3 comes as 0.18 that is > threshold (T). It means a new leaf nodes has to
be created in Node-2.
The details of Node-2 and Node-0 are updated for their CFs.
Root, Node-0
CF1-2 = <3, 0.75, 0.31>
CF3-4 = <2, 1.65, 1.42>
Node-1 Node-2
CF1 = <2, 0.75, 0.31> CF3 = <1, 0.65, 0.42>
CF2 = <1, 0, 0> CF4 = <1, 1.0, 1.0>
Now, the sixth point (x6) is read. Its position in CF1-2 or CF3-4 is to be decided based on their
respective centroids. The centroid of CF1-2 is 0.75/3 = 0.25 and for CF3-4 is 1.65/2 = 0.83. Therefore x6
is closer to CF3-4.
Now the centroid of CF3 is 0.65/1 = 0.65 and CF4 is 1.0/1 = 1.0. Therefore, x6 is closer to CF4.
The radius of x6 from CF4 is 0.20 which is greater than the threshold (T). So, Node-2 will be splitted
into 2 nodes.
Root, Node-0
CF1-2 = <3, 0.75, 0.31>
CF3-4-5 = <3, 3.05, 3.38>
Node-1 Node-2
CF1 = <2, 0.75, 0.31> CF3-4 = <2, 1.65, 1.42>
CF2 = <1, 0, 0> CF5 = <1, 1.4, 1.96>
Node-2.1
Leaf-1, R = 0.13 CF3 = <1, 0.65, 0.42>
Node-2.2
Leaf-2, R = 0 CF5 = <1, 1.4, 1.96>
x1 = 0.50 CF4 = <1, 1.0, 1.0>
x3 = 0
x2 = 0.25
The seventh (the last) point (x7) is read. Its position in CF1-2 or CF3-4-5 is to be decided based on their
respective centroids. The centroid of CF1-2 is 0.75/3 = 0.25 and for CF3-4-5 is 3.305/3 = 1.02. Therefore x7
is closer to CF3-4-5.
Now the centroid of CF3-4 is 1.65/2 = 0.83 and CF5 is 1.4/1 = 1.4. Therefore, x7 is closer to CF3-4.
Similarly, x7 is closer to CF4 than CF3.The radius of x7 from Leaf-4 is 0.05 which is within the threshold
(T). So, x7 is assigned to Leaf-4.
Root, Node-0
CF1-2 = <3, 0.75, 0.31>
CF3-4-5 = <4, 4.15, 4.21>
Node-1 Node-2
CF1 = <2, 0.75, 0.31> CF3-4 = <3, 2.75, 2.25>
CF2 = <1, 0, 0> CF5 = <1, 1.4, 1.96>
Node-2.1
Leaf-1, R = 0.13 CF3 = <1, 0.65, 0.42>
Node-2.2
Leaf-2, R = 0 CF5 = <1, 1.4, 1.96>
x1 = 0.50 CF4 = <2, 2.1, 2.21>
x3 = 0
x2 = 0.25
Leaf-4, R = 0.05
Leaf-3, R = 0 Leaf-5, R = 0
The details of Node-2 and Node-0 x4 = 0.65
x5 = 1.0
x6 = 1.4
are updated for their CFs. x7 = 1.1
ε ε
ε
For the given ε and MinPts = 7 the yellow point is a core point.
For the same ε and MinPts = 7 the blue point is a border point.
For the same ε and MinPts = 7 the red points are a noise points.
44
Let the distance of a point from its kth nearest neighbour is k-dist.
For the points that belong to same cluster, the value of k-dist will
be small if k is not larger than the cluster size.
For all the points there will be variation in the k-dist but it will not
be huge unless the cluster densities are not radically different.
For the noise points, k-dist will be relatively large.
If k-dist is calculated for all the data for some value of k, sorted in
the increasing order, a sharp change in the value of k-dist
represents the ε and the k value represents the MinPts.
The points for which k-dist is <= ε are core points. Other points
are border or noise points.
46
49
Reachability Distance
Cluster Ordering
53
OrderSeeds: (J, 20), (K, 20), (L, 31), (C, 40), (M, 40),(R, 43)
OrderSeeds: (L, 19), (K, 20),(R, 21), (M, 30),(P, 31), (C, 40)
OrderSeeds: (M, 18), (K, 18), (R, 20), (P, 21), (N, 35), (C, 40)
The process continued and all the points are processed until there are no more points in
the dataset or OrderSeeds.
The valleys in the cluster ordering plot shows two density regions.
The data may be uniformly distributed (random structure) and applying a clustering technique might create
meaningless clusters.
Cluster analysis is meaningful only when there are well separated points and there are natural clusters in the
data (non-random structure).
Hopkins Statistic is one measure that helps to assesses the nature of the data: random or non-random structure.
o n data points p1, p2,.....pn are uniformly sampled (equal probability of getting selected, no bias) from the data set D.
For each point pi (1<= i <= n), the nearest neighbor is found out in D. Let xi be the distance between pi and its
nearest neighbor in D. So xi = min (dist (pi, v)), v ∈ D.
o n data points q1, q2,.....qn are uniformly sampled (equal probability of getting selected, no bias) from the data sets
D. For each point qi (1<= i <= n), the nearest neighbor is found out in (D-qi ) and qi is not added back in D. Let yi be
the distance between qi and its nearest neighbor in (D-qi ). So yi = min (dist (qi, v)), v ∈ (D-qi).
o Hopkins Statistic (H) is defined as: n
y
i=1
i
H= n n
x + y
i=1
i
i=1
i
If the data points are uniformly distributed the value of xi and yi will be close to each other and the value of H will
be close to 0.5 or more. There will be no meaningful clusters (Homogenous Hypothesis).
If the data points are non-randomly distributed the value yi will be significantly smaller than xi and the value of H
will be close to 0. There will be meaningful clusters (Alternative Hypothesis).
D is a data set = {0, 0.5, 1.0, 1.5, 2.0, 2.5, 3.0, 3.5, 4.0, 4.5}
Randomly sampled 5 points (pi)= {0.5, 1.5, 2.0, 2.5, 4.0}
Nearest neighbors are to be found from D
∑xi = 0.5+0.5+0.5+0.5+0.5 = 2.5
The second set of sampling, where the nearest neighbors are to be found from (D – qi)
q1 = 1.0 x1 = (D - q1) = {0, 0.5, 1.5, 2.0, 2.5, 3.0, 3.5, 4.0, 4.5} y1 = 0.5
q2 = 1.5 x2 = (x1 - q2) = {0, 0.5, 2.0, 2.5, 3.0, 3.5, 4.0, 4.5} y2 = 0.5
q3 = 2.5 x3 = (x2 - q3) = {0, 0.5, 2.0, 3.0, 3.5, 4.0, 4.5} y3 = 0.5
q4 = 3.0 x4 = (x3 - q4) = {0, 0.5, 2.0, 3.5, 4.0, 4.5} y4 = 0.5
q5 = 3.5 x5 = (x4 - q5) = {0, 0.5, 2.0, 4.0, 4.5} y5 = 0.5
∑yi = 0.5+0.5+0.5+0.5+0.5 = 2.5
H = 2.5 / (2.5+2.5) = 0.5 that means no meaningful clusters. Alternative hypothesis
rejected.
This illustration is shown only from calculation perspective. The sample collection and iterations have to be
repeated several times and average of H is to be taken for any inference.
Also review another variant of the Hopkins statistic from the other text book.
BITS Pilani, WILP
Measuring/Validation of Cluster Quality
Silhouette Coefficient
dist (p, q)
q Ci and p q
dist (p, q) b(p)= min
q Cj
1 j k , j i |C |
a(p)= j
|Ci | - 1
The value of the Silhouette Coefficient ranges from -1 to
The Silhouette Coefficient is defined as: 1.
Value near 1 means: p is far away from other clusters.
b(p) - a(p) Value near (-1) means: p is closer to the points in other
s(p)= clusters than to the points in its own cluster.
max {a(p) , b(p)} Average s for a cluster (for all points there) and average s
for clustering (for all clusters in the data set) can be
calculated.
BITS Pilani, WILP
Silhouette Coefficient
Working Procedure
s(P1) = {b(P1) – a(P1)} / max {a(P1), b(P1)} s(P3) = {b(P3) – a(P3)} / max {a(P3), b(P3)}
= (1.88 – 0.50) / max (0.50, 1.88) = (1.88 – 0.75) / max (0.75, 1.88)
= 1.38 / 1.88 = 0.73 = 1.13/ 1.88 = 0.60
s(P2) = {b(P2) – a(P2)} / max {a(P2), b(P2)} s(P4) = {b(P4) – a(P4)} / max {a(P4), b(P4)}
= (2.50 – 0.50) / max (0.50, 2.50) = (2.50 – 0.75) / max (0.75, 2.50)
= 2.00 / 2.50 = 0.80 = 1.75 / 2. 50 = 0.70
Haversine distance is a useful measure when GPS coordinates are given and distance has
to be calculated for clustering. Verification can be done using an online calculator from
different sources. 68
Global Outliers
Contextual Outliers
Collective Outliers
BITS Pilani, WILP
Challenges of Outlier Detection
Sample data
labeled by domain
expert, available as
reference
Parametric
Statistical
Approaches
Non-
parametric
Distance
Based on the Based
assumptions
about outliers Proximity
vs. rest of the Based Density Based
data
Clustering
#
Grid Based #
Based
Classification
Based #
Ten sample temperatures values are given as: 24.0, 28.9, 28.9, 29.0, 29.1, 29.1,
29.2, 29.2, 29.3, 29.4 in oC.
Mean (μ) = 28.61 oC
Standard Deviation (σ) = 1.51
The sample 24.0 oC is 4.61 oC units below from the mean. That is 4.61/1.51 = 3.05 σ steps
before the mean. So 24.0 oC can be considered an outlier because it is below (μ–3σ).
The z-score value of 24 = (24-28.61)/1.51 = -3.05
Looking into the z-Table the probability for -3.05 z-score is 0.0011 or 0.11%
(low probability indicates it is more unlikely that 24.0 is generated by the normal
distribution)
10
11
| xi |
gi
This object is an outlier if:
N-1 tα 2
gi >= .
N N - 2 + tα 2
where,
N=Count of objects in the datset
= Significance level
tα =value taken by t-ditstribution at a level of (α/2N) at (N-2) degree of freedom
BITS Pilani, WILP
Example
Grubb’s Test
Ten sample temperatures values are given as: 24.0, 28.9, 28.9, 29.0, 29.1, 29.1, 29.2, 29.2, 29.3, 29.4 in oC.
Mean (μ) = 28.61 oC
Standard Deviation (σ) = 1.51
N = 10
α = 0.05 t-distribution Table
gi for 24.0 = |24.0 – 28.61| / 1.51 = 3.05
tα = 3.833 (the value of t-distribution table at 0.05/2*10 = 0.0025 at 10-2 = 8 degree of freedom.
Since gi is greater than the value calculated below, 24.0 is an outlier.
Mahalanobis(X,X)=[X-X].S -1 .[X-X] T
where,X is the mean of X
In this example, Mahalanobis distance of > 4.0 is considered to declare a point an outlier. 17
n 2
(o - E )
χ i 2 = i i
i=1 Ei
where,
oi =the value of object o in the i th dimesnsion
Ei = mean of the i th dimension of all objects
n = dimensionality
18
1 1
f ( ) exp( .( x )2 / 2 )
2 . 2 2 19
7
For the shown dataset, find out the
outliers if r = 2 units, π = 1/3 and L1 6 J
H
norm is to be used as the distance 5 G I
measure. 4
DBSCAN and DB (r, π) identify outliers with a global view of the data set, the Global Outliers.
In practice, datasets could demonstrate a more complex structure where objects may be considered
outliers with respect to their local neighbourhood. (a data set with different densities).
In the shown data distribution, there are two clusters C1 and C2.
Object O3 can be declared as distance based outlier because it is far from the majority of the
objects.
What about objects O1 and O2?
The distance of O1 and O2 from the objects of cluster C1 is
smaller than the average distance of an object from its
nearest neighbour in the cluster C2.
O1 and O2 are not distance based outliers. But they are
outliers with respect to the cluster C1 because they
deviate significantly from other objects of C1.
Similarly the distance between O4 and its nearest neighbour in C2 is higher than the distance
between O1 or O2 and their nearest neighbours in C1, still O4 may not be an outlier because
C2 is sparse. Distance based detection does not capture local outliers. There is a need of
different approach.
BITS Pilani, WILP
k-distance and its Neighborhood
Local Proximity Based Outliers
To identify local outliers, there is a need to establish few news measures. The k-distance and Nk(x) are
first few of them.
The k-distance of an object x in the dataset D denoted by distk(x) is defined as the distance dist(x, p)
between x and p where p is also ∈ D, such that:
There are at least k objects y ∈ (D - x), such that dist (x, y) <= dist (x, p); excluding same distance points
There are at most (k-1) objects z ∈ (D - x), such that dist (x, z) < dist (x, p); excluding same distance points
In the other words, distk(x) is the distance between x and its k-nearest neighbor. It can be understood
from the following examples.
Nk(x) is the count of all such points which are in the k-neighborhood of x. There could be more than k
points in Nk(x) because multiple points could be at the same distance from x.
x x x
p p p
For k=3, the dist3(x) = dist(x, p)
For k=3, the dist3(x) = dist(x, p) For k=3, the dist3(x) = dist(x, p) All blue points are equidistant from x
Distance of 3 objects (y) <= dist (x, p) Distance of 3 objects (y) <= dist (x, p) Distance of 3 objects (y) <= dist (x, p)
Distance of 2 objects (z) < dist (x, p) Distance of 2 objects (z) < dist (x, p) Distance of 2 objects (z) < dist (x, p)
Nk(x) = 3 Nk(x) = 4 Nk(x) = 6
y2
y1
For k=3:
Red points show the k- neighborhood of x.
x dist3(x) = dist(x, p)
p
reachdist 3(y1 ← x) = dist3(y) = dist (x, p)
reachdist 3(y2 ← x) = dist (x, y2)
25
|N k (x)|
lrd(x)=
y N k (x)
reachdist k (y x )
y N k (x)
lrd k (y) / lrd k (x)
LOFk (x)=
|N k (x)|
|{ Relevant } {Retrieved }|
precision
|{Retrieved }|
Relevant &
|{ Relevant } {Retrieved }| Relevant Retrieved Retrieved
recall
|{Relevant }|
All Documents
Vector Space Model that will be discussed in detail in this module can be used for text retrieval.
It is also called term-frequency model.
BITS Pilani, WILP
Vector Space Model
Basic Idea
With the set of documents d and with a set of t terms, each document can be considered as a vector v
in a t dimensional space Rt.
Term frequency is the number of occurrences of term t in the document d. It is denoted by freq(d, t).
Term Frequency Matrix TF(d, t) elements are defined as 0 if the document does not contain the term,
and nonzero otherwise. There are several ways to define or weight the elements of TF (d, t). The Cornell
SMART system uses the following formula to compute the term frequency:
0, if freq(d, t) = 0
TF(d,t) =
1+log(1+log(freq(d, t))), otherwise
If a term t occurs in many documents, its importance will be scaled down due to its reduced
discriminative power. So there is another important measure, called Inverse Document Frequency (IDF).
Document/Term t1 t2 t3 t4 t5 t6 t7
d1 0 4 10 8 0 5 0
d2 5 19 7 16 0 0 32
d3 15 0 0 4 9 0 17
d4 22 3 12 0 5 15 0
d5 0 7 0 9 2 4 12
The table shows a term frequency matrix TF(d, t) where each row represents a
document vector, each column represents a term and each entry registers freq(di, tj).
For the 6th term t6, in the 4th document (d4):
TF (d4, t6) = 1+log(1+log(15)) = 1.34
IDF (t6) = log ((1+5)/3) = 0.30
So, TF-IDF (d4, t6) = 1.34 x 0.30 = 0.40
TF-IDF is a numerical statistic that is intended to reflect how important a term is to a
document in a collection of documents.
If a user is interested in the documents that contain a specific term, then TF-IDF can be
used as ranking measure and documents can be listed in the decreasing order of TF-IDF. 9
10
Text Indexing Techniques are used for the text retrieval from the unstructured
text.
One of the techniques based on an inverted index is an index structure that
maintains two hash indexed or B+ tree indexed tables:
o Document Table: consists of a set of document records, each containing two
fields: doc id and posting list, where posting list is a list of terms (or pointers to
terms) that occur in the document, sorted according to some relevance measure.
o Term Table: consists of a set of term records, each containing two fields: term id
and posting list, where posting list specifies a list of document identifiers in which
the term appears.
It facilitates queries like: Find all of the documents associated with a given set of
terms,” or “Find all of the terms associated with a given set of documents”.
Signature file for a document stores the term related information that is created
after pre-processing steps: tokenization, stemming and applying the stop list.
There is limited space to store signature file with each document so there are
techniques to encode and compress the signature file.
Motivation
o Automatic classification for the large number of on-line text
documents (Web pages, e-mails, online library, etc.)
Classification Process
o Data preprocessing
o Definition of training set and test sets
o Creation of the classification model using the selected classification
algorithm
o Classification model validation
o Classification of new/unknown text documents
13
Motivation
o Automatically group related documents based on their contents.
o No predetermined training sets or taxonomies – unsupervised.
Clustering Process
o Data preprocessing: remove stop words, stem, feature extraction
etc.
o Even after pre-processing the curse of dimensionality would be
intimidating. So dimensionality reduction and then application of
traditional clustering techniques or spectral clustering, mixture
model clustering, clustering using Latent Semantic Indexing, and
clustering using Locality Preserving Indexing are applied. Several of
these areas are covered under Natural Language Processing (NLP).
15
Let M is the transition matrix for the given graph and vt is the PageRank vector
at iteration t.
So, the PageRank vector at iteration t+1 is given by: vt+1 = M.vt
Two matrices A and B can be multiplied if the count of columns in A is equal to
count of rows in B. Therefore:
M v0 v1
0 1/ 2 1 0 1 / 4 9 / 24
1 / 3 0 0 1 / 2 1 / 4 5 / 24
x
1 / 3 0 0 1 / 2 1 / 4 5 / 24
1 / 3 1/ 2 0 0 1 / 4 5 / 24
M v1 v2
0 1/ 2 1 0 9 / 24 15 / 48
1 / 3 0 0 1 / 2 5 / 24 11 / 48
x and so on.......
1 / 3 0 0 1 / 2 5 / 24 11 / 48
23
1 / 3 1/ 2 0 0 5 / 24 11 / 48
BITS Pilani, WILP
Exercise
X1 X3
X2
Answer:
X1 X 2 X 3 1 / 3 6 / 15
X1 1/ 2 1/ 2 0
M v1 1 / 2 - - - - - - vn 6 / 15
X 2 1 / 2 0 1
1 / 6 3 / 15
X3 0 1/ 2 0
24
25
A physical bookstore may have several thousand books on its shelves, but Amazon offers millions of
books.
Physical newspaper can print several dozen articles per day, while on-line news services offer thousands
per day.
Recommendation in the physical world is fairly simple. It is not possible to tailor the store to each
individual customer.
The distinction between the physical and on-line worlds has been called the long tail phenomenon, and
it is captured in figure below. The vertical axis represents popularity. The items are ordered on the
horizontal axis according to their popularity.
Physical institutions provide only the most popular items to the left of the vertical line, while the
corresponding on-line institutions provide the entire range of items: the tail as well as the popular
items.
Popularity
5
Orders
BITS Pilani, WILP
Model for Recommendation
Systems: The Utility Matrix
There are two classes of entities, referred to as users and items. Users have preferences for certain items
and these preferences must be extracted out of the data.
The data itself is represented as a utility matrix, giving for each user-item pair, a value that represents what
is known about the degree of preference of that user for that item.
Values of the utility matrix come from an ordered set (e.g. star ratings of 1 to 5). Non-available values are
left blank, so matrix is sparse.
The example table shown below captures the ratings of users (A, B, C, D) to the Harry Potter (HP), Twilight
(TW) and Star Wars (SW) movies.
The goal of a recommendation system is to predict the blanks in the utility matrix. For example, will user A
like SW2?
There is little information from the matrix to predict whether user A would like SW2. So the
recommendation system can be designed taking into account the properties of movies, such as their
producer, director, stars, or even the similarity of their names.
If SW1 and SW2 are similar, then it can be concluded that since A did not like SW1, it is unlikely to enjoy
SW2 either. It is not necessary to predict every blank entry in a utility matrix. Most of the time, the goal is
to suggest a few that the user would value high.
HP1 HP2 HP3 TW SW1 SW2 SW3
A 4 5 1
B 5 5 4
C 2 4 5
6
D 3 3
BITS Pilani, WILP
Content Based Systems
Item Profiles
10
Three computers A, B and C and their numerical features are listed below:
Features/Computers A B C
Processor Speed 3.06 2.68 2.92
Disk Size 500 320 640
Main Memory 6 4 6
Item Profile for computer A is the vector [3.06, 500, 6].
Use X has rated these computers with ratings A:4, B:2 and C:5 (Average = 11/3).
User ratings can be normalized with the average (A:1/3, B:-5/3, C:4/3).
User profile vector can be created as [0.45, 486.67, 3.33]:
o Processor Speed: (3.06 *1/3) + (2.68*-5/3) + (2.92*4/3) = 0.45
o Disk Size: (500*1/3) + (320*-5/3) + (640*4/3) = 486.67
o Main Memory = (6*1/3) + (4*-5/3) + (6*4/3) = 3.33
This user profile can be used to recommend a new type of computer (say D) based on
the cosine similarity between computer D feature vector and the user profile vector.
Scaling may be required, because in the present form the Disk Size will dominate.
Scaling factor 1 for Processor Speed, α for the Disk Size, and β for the Main Memory
can be taken with suitable values of α and β.
BITS Pilani, WILP
Collaborative Filtering
Introduction and Similarity Measures
The process of identifying similar users and recommending what similar users like is called
collaborative filtering.
It is a different approach from Content Based Systems. Instead of using features of items to
determine their similarity, we focus on the similarity of the user ratings for the two items.
The challenge with is how to measure similarity of users or items from their rows or
columns in the utility matrix. It can be understood from the illustration below:
HP1 HP2 HP3 TW SW1 SW2 SW3 A and C have two movies in
A 4 5 1 common but their liking is
B 5 5 4 different.
C 2 4 5 A and B have just one movie in
D 3 3 common but liking is same.
Jaccard Distance: A and B have an intersection of size 1 and a union of size 5. Thus, their
Jaccard similarity is 1/5, and their Jaccard distance is 4/5; that is they are very far apart. In
comparison, A and C have a Jaccard similarity of 2/4, so their Jaccard distance is the same
1/2. Thus, A appears closer to C than to B. Yet that conclusion seems intuitively wrong. A
and C disagree on the two movies they both watched (different ratings), while A and B seem
both to have liked the one movie they watched in common.
Cosine Similarity: Between A and B, it is 0.38 and between A and C it is 0.32. This measure
tells us that A is slightly closer to B than to C.
BITS Pilani, WILP
Collaborative Filtering
Rounding and Normalizing Ratings
A B C D E F G H
X 4 5 5 1 3 2
Y 3 4 3 1 2 1
Z 2 1 3 4 5 3
17