Professional Documents
Culture Documents
DATA MINING:
CHARACTERIZATION
How it is done?
Collect the task-relevant data( initial relation) using a relational database
query
Perform generalization by attribute removal or attribute generalization.
replacing relatively low-level values (e.g., numeric values for an attribute age) with
higher-level concepts (e.g., young, middle-aged, and senior)
Birth_Region
Canada Foreign Total
Gender
M 16 14 30
F 10 22 32
Total 26 36 62
ATTRIBUTE RELEVANCE ANALYSIS
Why?
Which dimensions should be included?
How high level of generalization?
Reduce # attributes; easy to understand patterns
What?
statistical method for preprocessing data
filter out irrelevant or weakly relevant attributes
retain or rank the relevant attributes
How?
Data Collection
Analytical Generalization
Use information gain analysis (e.g., entropy or other measures) to identify
highly relevant dimensions and levels.
Relevance Analysis
Sort and select the most relevant dimensions and levels.
ID3 algorithm
build decision tree based on training objects with known class labels
to classify testing objects
rank attributes with information gain measure
minimal height
the least number of tests to classify an object
See example
TOP-DOWN INDUCTION OF DECISION TREE
Outlook
sunny overcast rain
Humidity Wind
yes
high normal strong weak
no yes no yes
SIMILARITY AND DISTANCE
For many different problems we need to quantify how close two objects are.
Examples:
For an item bought by a customer, find other similar items
Group together the customers of a site so that similar customers are shown the same
ad.
Group together web documents so that you can separate the ones that talk about
politics and the ones that talk about sports.
Find all the near-duplicate mirrored web documents.
Find credit card transactions that are very different from previous transactions.
To solve these problems we need a definition of similarity, or distance.
The definition depends on the type of data that we have
SIMILARITY
Numerical measure of how alike two data objects are.
A function that maps pairs of objects to real values
Higher when objects are more alike.
Often falls in the range [0,1], sometimes in [-1,1]
3 in intersection.
8 in union.
Jaccard similarity
= 3/8
Extreme behavior:
Jsim(X,Y) = 1, iff X = Y
Jsim(X,Y) = 0 iff X,Y have no elements in common
JSim is symmetric 15
JACCARD SIMILARITY BETWEEN SETS
Vefa rereases
apple apple new
new book with
releases releases apple pie
apple pie
new ipod new ipad recipe
recipes
JSim(D,D) = 3/5
JSim(D,D) = JSim(D,D) = 2/6
JSim(D,D) = JSim(D,D) = 3/9
SIMILARITY BETWEEN VECTORS
Documents (and sets in general) can also be represented as vectors
document Apple Microsoft Obama Election
D1 10 20 0 0
D2 30 60 0 0
D3 60 30 0 0
D4 0 0 10 20
How do we measure the similarity of two vectors?
Sim(X,Y) = cos(X,Y)
If the vectors are aligned (correlated) angle is zero degrees and cos(X,Y)=1
If the vectors are orthogonal (no common coordinates) angle is 90 degrees and
cos(X,Y) = 0
Cosine is commonly used for comparing documents, where we assume that the
vectors are normalized by the document length.
COSINE SIMILARITY - MATH
If d1 and d2 are two vectors, then
cos( d1, d2 ) = (d1 d2) / ||d1|| ||d2|| ,
where indicates vector dot product and || d || is the length of vector d.
Example:
d1 = 3 2 0 5 0 0 0 2 0 0
d2 = 1 0 0 0 0 0 0 1 0 2
d1 d2= 3*1 + 2*0 + 0*0 + 5*0 + 0*0 + 0*0 + 0*0 + 2*1 + 0*0 + 0*2 = 5
||d1|| = (3*3+2*2+0*0+5*5+0*0+0*0+0*0+2*2+0*0+0*0)0.5 = (42) 0.5 = 6.481
||d2|| = (1*1+0*0+0*0+0*0+0*0+0*0+0*0+1*1+0*0+2*2) 0.5 = (6) 0.5 = 2.245
cos( d1, d2 ) = .3150
EXAMPLE
document Apple Microsoft Obama Election
D1 10 20 0 0
D2 30 60 0 0
D3 60 30 0 0
D4 0 0 10 20
Cos(D1,D2) = 1
HAMMING DISTANCE
Hamming distance is the number of positions in which bit-
vectors differ.
Example: p1 = 10101 p2 = 10011.
d(p1, p2) = 2 because the bit-vectors differ in the 3rd and 4th positions.
The L1 norm for the binary vectors
d(x,y) = 2
DISTANCE BETWEEN STRINGS
weird wierd
intelligent unintelligent
Athena Athina
Important for recognizing and correcting typing errors and analyzing DNA
sequences.
EDIT DISTANCE FOR STRINGS
The edit distance of two strings is the number of inserts and deletes of
characters needed to turn one into the other.
Example: x = abcde ; y = bcduve.
Turn x into y by deleting a, then inserting u and v after d.
Edit distance = 3.
26
APPLICATIONS OF SIMILARITY:
RECOMMENDATION SYSTEMS
IMPORTANT PROBLEM
Recommendation systems
When a user buys an item (initially books) we want to recommend other
items that the user may like
When a user rates a movie, we want to recommend movies that the user
may like
When a user likes a song, we want to recommend other songs that they may
like
A big success of data mining
UTILITY (PREFERENCE) MATRIX
Content-based:
Represent the items into a feature space and recommend
items to customer C similar to previous items rated
highly by C
Movie recommendations: recommend movies with same actor(s),
director, genre, …
Websites, blogs, news: recommend other sites with “similar”
content
CONTENT-BASED PREDICTION
To compare items with users we need to map users to the same feature space.
How?
Take all the movies that the user has seen and take the average vector
Other aggregation functions are also possible.
Recommend to user C the most similar item I computing similarity in the common
feature space
Distributional distance measures also work well.
LIMITATIONS OF CONTENT-BASED APPROACH
Overspecialization
Never recommends items outside user’s content profile
People might have multiple interests
Two users are similar if they rate the same items in a similar way
Cosine Similarity:
Assumes zero entries are negatives:
Cos(A,B) = 0.38
Cos(A,C) = 0.32
USER SIMILARITY
Consider user c
Find set D of other users whose ratings are most “similar”
to c’s ratings
Estimate user’s ratings based on ratings of users in D using
some aggregation function
4/11/2019 41
LECTURE 5
DATA MINING:
ASSOCIATION
42
WHAT IS ASSOCIATION MINING?
Examples.
Rule form: “Body ead [support, confidence]”.
buys(x, “diapers”) buys(x, “beers”) [0.5%, 60%]
major(x, “CS”) ^ takes(x, “DB”) grade(x, “A”) [1%, 75%]
CONT.
Association Rule Mining is one of the ways to find patterns in data. It finds:
features (dimensions) which occur together
features (dimensions) which are “correlated”
What does the value of one feature tell us about the value of another feature?
For example, people who buy diapers are likely to buy baby powder. Or we can
rephrase the statement by saying: If (people buy diaper), then (they buy baby powder).
When to use Association Rules
We can use Association Rules in any dataset where features take only two values i.e.,
0/1. Some examples are listed below:
Market Basket Analysis is a popular application of Association Rules.
People who visit webpage X are likely to visit webpage Y
People who have age-group [30,40] & income [>$100k] are likely to own home
44
4/11/2019
BASIC CONCEPTS AND RULE MEASURES
Tid Items bought itemset: A set of one or more items
10 Beer, Nuts, Diaper
k-itemset X = {x1, …, xk}
20 Beer, Coffee, Diaper
30 Beer, Diaper, Eggs (absolute) support, or, support count
40 Nuts, Eggs, Milk of X: Frequency or occurrence of
50 Nuts, Coffee, Diaper, Eggs, Milk the itemset X
(relative) support, s, is the fraction of
Customer Customer transactions that contains X (i.e.,
buys both buys diaper the probability that a transaction
contains X)
An itemset X is frequent if X’s
support is no less than a minsup
Customer threshold
buys beer 45
BASIC CONCEPTS: ASSOCIATION RULES
Ti Items bought
d Find all the rules X Y with minimum
10 Beer, Nuts, Diaper support and confidence
20 Beer, Coffee, Diaper support, s, probability that a
30 Beer, Diaper, Eggs transaction contains X Y
40 Nuts, Eggs, Milk confidence, c, conditional probability
50 Nuts, Coffee, Diaper, Eggs, Milk that a transaction having X also
Customer contains Y
Customer
buys both
buys Let minsup = 50%, minconf = 50%
diaper
Freq. Pat.: Beer:3, Nuts:3, Diaper:4, Eggs:3, {Beer,
Diaper}:3
Given a set of transactions T, the goal of association rule mining is to find all
rules having
support ≥ minsup threshold
confidence ≥ minconf threshold
Brute-force approach:
List all possible association rules
Compute the support and confidence for each rule
Prune rules that fail the minsup and minconf thresholds
Computationally prohibitive!
49
MINING ASSOCIATION RULES
Observations:
• All the above rules are binary partitions of the same itemset:
{Milk, Diaper, Beer}
• Rules originating from the same itemset have identical support but can have
different confidence 50
Two-step approach:
1. Frequent Itemset Generation
Generate all itemsets whose support ≥ minsup
2. Rule Generation
Generate high confidence rules from each frequent itemset, where each
rule is a binary partitioning of a frequent itemset
51
FREQUENT ITEMSET GENERATION
52
ITEMSET CANDIDATE GENERATION
Brute-force approach:
Count the support of each candidate by scanning the database
w
– Match each transaction against every candidate
53
– Complexity: O(Nmw): this is costly
FREQUENT ITEMSET GENERATION STRATEGIES
54
REDUCING NUMBER OF CANDIDATES
Apriori principle:
If an itemset is frequent, then all of its subsets must also be frequent
55
EXAMPLE APRIORI PRINCIPLE
56
THE APRIORI ALGORITHM—AN EXAMPLE
Supmin = 2
Itemset sup
Itemset sup
Database TDB {A} 2
Tid Items
L1 {A} 2
C1 {B} 3
{B} 3
10 A, C, D {C} 3
1st scan {C} 3
20 B, C, E {D} 1
{E} 3
30 A, B, C, E {E} 3
40 B, E
C2 Itemset sup C2 Itemset
{A, B} 1
L2 Itemset sup 2nd scan {A, B}
{A, C} 2
{A, C} 2 {A, C}
{A, E} 1
{B, C} 2
{B, C} 2 {A, E}
{B, E} 3
{B, E} 3 {B, C}
{C, E} 2
{C, E} 2 {B, E}
{C, E}
57
C3 Itemset L3 Itemset sup
3rd scan
{B, C, E} {B, C, E} 2
THE APRIORI ALGORITHM (PSEUDO-CODE)
L1 = {frequent items};
for (k = 1; Lk !=; k++) do begin
Ck+1 = candidates generated from Lk;
for each transaction t in database do
increment the count of all candidates in Ck+1 that are contained in t
Lk+1 = candidates in Ck+1 with min_support
end
return k Lk; 58
IMPLEMENTATION OF APRIORI
How to generate candidates?
Step 1: self-joining Lk
Step 2: pruning
Example of Candidate-generation
Self-joining: L3*L3
abcd from abc and abd
Pruning:
acde is removed because ade is not in L3
59
C4 = {abcd}
MINING ASSOCIATION RULES—AN EXAMPLE
i.e., if {AB} is a frequent itemset, both {A} and {B} should be a frequent
itemset
Single-dimensional rules:
buys(X, “milk”) buys(X, “bread”)
Multi-dimensional rules: 2 dimensions or predicates
Inter-dimension association rules (no repeated predicates)
age(X,”19-25”) occupation(X,“student”) buys(X,“Labtop”)
Categorical Attributes
finite number of possible values, no ordering among values
Quantitative Attributes
numeric, implicit ordering among values
FP-growth: Mining Frequent Patterns
Using FP-tree
MINING FREQUENT PATTERNS USING FP-TREE
65
3 MAJOR STEPS
Step 2
Construct conditional FP-tree from each conditional pattern base
Step 3
Recursively mine conditional FP-trees and grow frequent patterns obtained
so far. If the conditional FP-tree contains a single path, simply enumerate all the
patterns
66
STEP 1: CONSTRUCT CONDITIONAL PATTERN BASE
Starting at the bottom of frequent-item header table in the FP-tree
Traverse the FP-tree by following the link of each frequent item
Accumulate all of transformed prefix paths of that item to form a
conditional pattern base
Node-link property
For any frequent item ai,all the possible frequent patterns that
contain ai can be obtained by following ai's node-links, starting from
ai's head in the FP-tree header.
{} m-conditional pattern
Header Table base:
Item frequency head f:4 c:1 fca:2, fcab:1
f 4 All frequent patterns
c 4 c:3 b:1 b:1 {} concerning m
m,
a
b
3
3 a:3 p:1 f:3 fm, cm, am,
m 3 fcm, fam, cam,
p 3 m:2 b:1 c:3 fcam
{}
Frequent Pattern
Frequent Pattern
fcam
Frequent Pattern
PRINCIPLES OF FP-GROWTH
73
CONDITIONAL PATTERN BASES AND
CONDITIONAL FP-TREE
order of L
74
SINGLE FP-TREE PATH GENERATION
75
EFFICIENCY ANALYSIS
Facts: usually
1. FP-tree is much smaller than the size of the DB
2. Pattern base is smaller than original FP-tree
3. Conditional FP-tree is smaller than pattern base
mining process works on a set of usually much smaller pattern bases and conditional
FP-trees
Divide-and-conquer and dramatic scale of shrinking
76
ADVANTAGES OF THE PATTERN GROWTH APPROACH
Divide-and-conquer:
Decompose both the mining task and DB according to the frequent patterns
obtained so far
Other factors
Basic ops: counting local freq items and building sub FP-tree, no pattern search
and matching (Grahne and J. Zhu, FIMI'03) 77
INTERESTINGNESS MEASUREMENTS
Objective measures
Two popular measurements:
¶ support; and
· confidence
Subjective measures
A rule (pattern) is interesting if
¶ it is unexpected (surprising to the user); and/or
· actionable (the user can do something with it)
CRITICISM TO SUPPORT AND CONFIDENCE (CONT.)
Example 2:
X and Y: positively correlated,
X 1 1 1 1 0 0 0 0
X and Z, negatively related Y 1 1 0 0 0 0 0 0
support and confidence of
Z 0 1 1 1 1 1 1 1
X=>Z dominates
We need a measure of dependent or correlated
events
P( A B) Rule Support Confidence
corrA, B X=>Y 25% 50%
P( A) P( B) X=>Z 37.50% 75%
P(B|A)/P(B) is also called the lift of rule A => B
OTHER INTERESTINGNESS MEASURES: INTEREST
A and B negatively correlated, if the value is less than 1; otherwise A and B positively
correlated
Itemset Support Interest
X 1 1 1 1 0 0 0 0
X,Y 25% 2
Y 1 1 0 0 0 0 0 0 X,Z 37.50% 0.9
Z 0 1 1 1 1 1 1 1 Y,Z 12.50% 0.57