Professional Documents
Culture Documents
Food
Electronics
Bread
Skim 2%
Wheat White
Desktop Laptop Accessory TV DVD
Foremost Kemps
Printer Scanner
• Why should we incorporate concept hierarchy? • How do support and confidence vary as we
traverse the concept hierarchy?
– Rules at lower levels may not have enough support to – If X is the parent item for both X1 and X2, then
appear in any frequent itemsets σ(X) ≤ σ(X1) + σ(X2)
– Rules at lower levels of the hierarchy are overly – If σ(X1 ∪ Y1) ≥ minsup,
specific and X is parent of X1, Y is parent of Y1
• e.g., skim milk → white bread, then σ(X ∪ Y1) ≥ minsup, σ(X1 ∪ Y) ≥ minsup
2% milk → wheat bread, σ(X ∪ Y) ≥ minsup
skim milk → wheat bread, etc.
are indicative of an association between milk and bread
– If conf(X1 ⇒ Y1) ≥ minconf,
then conf(X1 ⇒ Y) ≥ minconf
Data Mining: Association Rules 66 Data Mining: Association Rules 67
• Approach 1: • Approach 2:
– Extend current association rule formulation by augmenting – Generate frequent patterns at highest level first
each transaction with higher level items
• Original Transaction:
– {skim milk, wheat bread} – Then, generate frequent patterns at the next
• Augmented Transaction: highest level, and so on…
– {skim milk, wheat bread, milk, bread, food}
• Issues: • Issues:
– Items that reside at higher levels have much higher support – I/O requirements increase dramatically because we
counts need to perform more passes over the data
• if support threshold is low, there are too many frequent – May miss some potentially interesting cross-level
patterns involving items from the higher levels
association patterns
– Increased dimensionality of the data
Data Mining: Association Rules 68 Data Mining: Association Rules 69
Sequence Data
Timeline
10 15 20 25 30 35
Sequence Database:
Object Timestamp Events Object A:
Mining Sequential Patterns A 10 2, 3, 5 2
3
6
1
1
A 20 6, 1
5
A 23 1
B 11 4, 5, 6
B 17 2 Object B:
B 21 7, 8, 1, 2 4 2 7 1
B 28 1, 6 5 8 6
C 14 1, 8, 7 6 1
2
Object C:
1
7
8
Database Sequence Element (Transaction) Event (Item) • A sequence is an ordered list of elements
Customer Purchase history of A set of items bought by a Books, diary (transactions)
a given customer customer at time t products, CDs, etc
Web data Browsing activity of A collection of files viewed Home page, index s = < e1 e2 e3 … >
a particular Web by a Web visitor after a page, contact info,
visitor single mouse click etc – Each element contains a collection of events (items)
Event data History of events Events triggered by a sensor Types of alarms
generated by a at time t generated by ei = {i1, i2, …, ik}
given sensor sensors
Genome DNA sequence of a An element of the DNA Bases A,T,G,C – Each element is attributed to a specific time or location
sequences particular species sequence
• Length of a sequence, |s|, is given by the number of
Element
Event elements of the sequence
(Transaction)
E1 E1 E3 (Item)
E2 E2 • A k-sequence is a sequence that contains k events
E2 E3 E4
Sequence (items)
Data Mining: Association Rules 72 Data Mining: Association Rules 73
xg = 2, ng = 0, ms= 4 • Approach 2:
Data sequence Subsequence Contain?
– Modify GSP to directly prune candidates that
violate timing constraints
< {2,4} {3,5,6} {4,7} {4,5} {8} > < {6} {5} > Yes
– Question:
< {1} {2} {3} {4} {5} > < {1} {4} > No
• Does the Apriori principle still hold?
< {1} {2,3} {3,4} {4,5} > < {2} {3} {5} > Yes
< {1,2} {3} {2,3} {3,4} {2,4} {4,5} > < {1,2} {5} > No
Given a candidate pattern: <{a, c}> • In some domains, we may have only one very long time
Any data sequences that contain series
– Example:
<… {a c} … >, • monitoring network traffic events for attacks
<… {a} … {c}…> (where time({c}) – time({a}) ≤ ws) • monitoring telecommunication alarm signals
<…{c} … {a} …> (where time({a}) – time({c}) ≤ ws) • Goal is to find frequent sequences of events in the time
series
will contribute to the support count of the pattern
– This problem is also known as frequent episode mining
E1 E3 E1 E1 E2 E4 E1 E2 E1
E2 E4 E2 E3 E4 E2 E3 E5 E2 E3 E5 E2 E3 E1