Professional Documents
Culture Documents
1
Similarity and Dissimilarity
• Similarity
– Numerical measure of how alike two data objects are.
– Is higher when objects are more alike.
– Often falls in the range [0,1].
• Dissimilarity
– Numerical measure of how different are two data objects
– Lower when objects are more alike
– Minimum dissimilarity is often 0
– Upper limit varies
4
Euclidean Distance: Example
Distance Matrix 5
Minkowski Distance
6
Minkowski Distance: Examples
• r = 1. City block (Manhattan, taxicab, L1 norm) distance.
– A common example of this is the Hamming distance, which
is just the number of bits that are different between two
binary vectors
• r = 2. Euclidean distance
• Do not confuse r with n, i.e., all these distances are defined for
all numbers of dimensions.
7
Minkowski Distance : Examples
8
Common Properties of a Distance
• Distances, such as the Euclidean distance,
have some well known properties.
1. d(p, q) 0 for all p and q and d(p, q) = 0 only if
p = q. (Positive definiteness)
2. d(p, q) = d(q, p) for all p and q. (Symmetry)
3. d(p, r) d(p, q) + d(q, r) for all points p, q, and r.
(Triangle Inequality)
where d(p, q) is the distance (dissimilarity) between points
(data objects), p and q.
9
Common Properties of a Similarity
• Similarities, also have some well known
properties.
1. s(p, q) = 1 (or maximum similarity) only if p = q.
10
Similarity Between Binary Vectors
12
Cosine Similarity
• A document can be represented by thousands of attributes, each recording the
frequency of a particular word (such as keywords) or phrase in the document.
13
Cosine Similarity
• If d1 and d2 are two document vectors, then
cos( d1, d2 ) = (d1 d2) / ||d1|| ||d2|| ,
where indicates vector dot product and || d || is the length of vector d.
• Example:
d1 = 3 2 0 5 0 0 0 2 0 0
d2 = 1 0 0 0 0 0 0 1 0 2
d1 d2= 3*1 + 2*0 + 0*0 + 5*0 + 0*0 + 0*0 + 0*0 + 2*1 + 0*0 + 0*2 = 5
||d1|| = (3*3+2*2+0*0+5*5+0*0+0*0+0*0+2*2+0*0+0*0) 0.5 = (42) 0.5 =
6.481
||d2|| = (1*1+0*0+0*0+0*0+0*0+0*0+0*0+1*1+0*0+2*2) 0.5 = (6) 0.5 =
2.245
15
16
17
What Is Frequent Pattern Analysis?
• Frequent pattern: a pattern (a set of items, subsequences, substructures, etc.)
that occurs frequently in a data set
• Applications
– Basket data analysis, cross-marketing, catalog design, sale campaign
analysis, Web log (click stream) analysis, and DNA sequence analysis.
18
19
20
21
Why Is Freq. Pattern Mining Important?
• Freq. pattern: An intrinsic and important property of datasets
• It is foundation for many essential data mining tasks
Association, correlation, and causality analysis
Sequential, structural (e.g., sub-graph) patterns
Pattern analysis in spatiotemporal, multimedia, time-series,
and stream data
Classification: discriminative, frequent pattern analysis
Cluster analysis: frequent pattern-based clustering
Data warehousing: iceberg cube and cube-gradient
Semantic data compression: fascicles
Broad applications
22
Frequent patterns: subsequence
• A subsequence, such as:
buying first a PC, then a digital camera, and then a memory card is a
(frequent) sequential pattern.
26
27
28
Market basket analysis
• Market basket analysis may help you design different store layouts.
• In one strategy, items that are frequently purchased together can be
placed in proximity to further encourage the combined sale of such
items.
– purchase computers followed by buying antivirus or printer
29
30
31
32
33
34
35
36
37
38
39
40
41
Basic Concepts: Frequent Patterns
Tid Items bought • itemset: A set of one or more items
10 Beer, Nuts, Diaper
• k-itemset X = {x1, …, xk}
20 Beer, Coffee, Diaper
43
44
45
Association Rules
47
Mining Association Rules
• Two-step approach:
1. Frequent Itemset Generation
– Generate all itemsets whose support minsup
2. Rule Generation
– Generate high confidence rules from each frequent
itemset, where each rule is a binary partitioning of a
frequent itemset
A B C D E
AB AC AD AE BC BD BE CD CE DE
ABC ABD ABE ACD ACE ADE BCD BCE BDE CDE
Transactions List of
Candidates
TID Items
1 Bread, Milk
2 Bread, Diaper, Beer, Eggs
N 3 Milk, Diaper, Beer, Coke M
4 Bread, Milk, Diaper, Beer
5 Bread, Milk, Diaper, Coke
w
d d k
R
d 1 d k
k j
k 1 j 1
3 2 1
d d 1
51
Frequent Itemset Generation Strategies
C3 Itemset
3rd scan L3 Itemset sup
{B, C, E} {B, C, E} 2
56
The Apriori Algorithm (Pseudo-Code)
Ck: Candidate itemset of size k
Lk : frequent itemset of size k
L1 = {frequent items};
for (k = 1; Lk !=; k++) do begin
Ck+1 = candidates generated from Lk;
for each transaction t in database do
increment the count of all candidates in Ck+1 that are
contained in t
Lk+1 = candidates in Ck+1 with min_support
end
return k Lk; 57
Implementation of Apriori
• How to generate candidates?
– Step 1: self-joining Lk
– Step 2: pruning
• Example of Candidate-generation
– L3={abc, abd, acd, ace, bcd}
– Self-joining: L3*L3
• abcd from abc and abd
• acde from acd and ace
– Pruning:
• acde is removed because ade is not in L3
– C4 = {abcd}
58
Further Improvement of the Apriori
Method
• Major computational challenges
– Multiple scans of transaction database
– Huge number of candidates
– Tedious workload of support counting for candidates
59