Professional Documents
Culture Documents
Large Databases
By Group 10
Sadler Divers 103315414
Beili Wang 104522400
Xiang Xu 106067660
Xiaoxiang Zhang 105635826
Sadler Divers
References
[1] J. Han and M. Kamber, "Data Mining: Concepts and
Techniques", 2nd Edition, Morgan Kaufmann Publishers,
August 2006.
• Why?
To gain Information, Knowledge, Money, etc.
Applications
Market Basket Analysis
Cross-Marketing
Catalog Design
• Apriori Algorithm
• Vertical Format
Concepts and Definitions
• Let I = {I1, I2, … Im} a set of items
• Pseudo-code:
Ck: Candidate itemset of size k
Lk: frequent itemset of size k
L1 = {frequent items};
for (k= 1; Lk!= ∅; k++) do begin
Ck+1 = candidates generated from Lk;
for each transaction t in database do
increment the count of all candidates in Ck+1 that are contained in t
Lk+1= candidates in Ck+1with min_support
end
return∪kLk;
• Calculate Confidence:
• B ∩ C => E Confidence = 2/2 = 100%
• B ∩ E => C Confidence = 2/3 = 66%
• C ∩ E => B Confidence = 2/2 = 100%
• B => C ∩ E Confidence = 2/3 = 66%
• C => B ∩ E Confidence = 3/3 = 100%
• E => B ∩ C Confidence = 2/3 = 66%
Improved the Efficiency of Apriori
Beili Wang
References
[1] A. Savasere, E. Omiecinski, and S. Navathe. An efficient algorithm for
mining association rules in large databases. VLDB'95, 432-444, Zurich,
Switzerland. <
http://www.informatik.uni-trier.de/~ley/db/conf/vldb/SavasereON95.html>.
[2] J. Han and M. Kamber. "Data Mining: Concepts and Techniques". Morgan
Kaufmann Publishers. March 2006. Chapter 5, Section 5.2.3, Page 240.
Source: Presentation Slides of Prof. Anita Wasilewska, 07. Association Analysis, page 51
Partition Algorithm: Basics
• Definition:
A partition p b D of the database refers to any
subset of the transactions contained in the database
D . Any two different partitions are non-
overlapping, i.e., pi T p j , i j .
• Ideas:
Any itemset that is potentially frequent in DB must
be frequent in at least one of the partitions of DB.
Partition scans DB only twice:
Scan 1: partition database and find local frequent
patterns.
Scan 2: consolidate global frequent patterns.
Partition Algorithm
Initially the database D is logically partitioned into n partitions.
Merge phase:
input: local large itemsets of same lengths from all n partitions
output: combine and generate the global candidate itemsets. The set of global
G i
candidate itemsets of length j is computed as C j [ Lj
i 1, , n
Algorithm output: itemsets that have the minimum global support along with their
support. The algorithm reads the entire database twice.
Partition Algorithm: Pseudo code
Source: An Efficient Algorithm for Mining Association Rules in Large Databases, page 7, <
http://citeseer.ist.psu.edu/sarasere95efficient.html>
Partition Algorithm: Example
Consider a small database with four items I={Bread, Butter, Eggs, Milk}
and four transactions as shown in Table 1. Table 2 shows all itemsets for I.
Suppose that the minimum support and minimum confidence of an
association rule are 40% and 60%, respectively.
Source: An Efficient Algorithm for Mining Association Rules in Large Databases, page 22, <
http://citeseer.ist.psu.edu/sarasere95efficient.html>
Performance Comparison – Disk IO
Source: An Efficient Algorithm for Mining Association Rules in Large Databases, page 23, <
http://citeseer.ist.psu.edu/sarasere95efficient.html>
Performance Comparison – Scale-up
Source: An Efficient Algorithm for Mining Association Rules in Large Databases, page 24, <
http://citeseer.ist.psu.edu/sarasere95efficient.html>
Parallelization in Parallel Database
Partition algorithm indicates that the partition processing can
be essentially done in parallel.
4. The local counts at each node is sent to all other nodes. The
global support is the sum of all local supports.
Conclusion
• Partition algorithm achieve both CPU and I/O improvements
over Apriori algorithm
[4] J. Han and Y. Fu, ''Discovery of Multiple-Level Association Rules from Large
Databases'', In Proc. of 1995 Int. Conf. on Very Large Data Bases (VLDB'95).
2% milk →
wheat bread [2%,
72%]
• Encoded transaction: T1 {111,121,211,221}
Source: J. Han and Y. Fu, ''Discovery of Multiple-Level Association Rules from Large
Databases'', Proc. of 1995 Int. Conf. on Very Large Data Bases (VLDB'95).
Multilevel Association:
Uniform vs Reduced Support
• Uniform support
• Reduced support
Uniform Support
• Same minimum support threshold for all levels
Level 1 Milk
min support = 5%
[support = 10%]
Level 1 Milk
min support = 5%
[support = 10%]
• Quantitative Attributes
– Numeric, implicit ordering among values
Mining Quantitative Associations
• Static discretization based on predefined concept
hierarchies
Source: J. Han and M. Kamber, "Data Mining: Concepts and Techniques", Morgan Kaufmann Publishers,
August 2000.
Dynamic Discretization of
Quantitative Association Rules
• Numeric attributes are dynamically discretized
– The confidence of the rules mined is maximized
∑ ∑ dist X ( ti [ X ], t j [ X ] )
N N
d (S [ X ]) =
i =1 j =1
N ( N − 1)
• dist X : Distance metric on the values for the attribute set X (e.g.
Euclidean distance or Manhattan distance)
Clusters and Distance
Measurements (Cont.)
• Cluster C X
– Density threshold d
X
0
d ( C X ) ≤ d 0X
– Frequency threshold s0 C X ≥ s0
Xiaoxiang Zhang
Sources/References:
[1]. J. Han and M. Kamber. “Data Mining Concepts and
Techniques”. Morgan Kaufman Publishers.
• However, tea=>coffee is
misleading, since the
probability of purchasing
coffee is 90%, which is larger
than 80%.
• The above example illustrates that the
confidence of a rule A=>B can be deceiving in
that it is only an estimate of the conditional
probability of itemset B given itemset A.
Measuring Correlation
• One way of measuring correlation is
p ( A ∪B )
corrA, B =
p ( A) p ( B )
• If the resulting value is equal to 1, then A
and B are independent. If the resulting value is greater than 1,
then A and B are positively correlated, else A and B are
negatively correlated.
For the above example
p[tΛc] /( p[t ] * p[c]) = 0.2 /(0.25 * 0.9) = 0.89,
which is less than 1, indicating there is a negative correlation
between buying tea and buying coffee.
• Is the above way of measuring the correlation
good enough?
• So, we introduce:
The chi-squared test for independence
The chi-squared test for independence
• Let R be {i1 , i1} × ... × {ik , ik } and r = r1...rk ∈ R
• Here R is the set of all possible basket values, and r is
a single basket value. Each value of r denotes a
cell---this terminology comes from the view that R is
a k-dimensional contingency table.
Let O(r) denote the number of baskets falling into cell
r.
• The chi-squared statistic is defined as:
(O ( r ) − E[r ]) 2
x =∑
2
E[ r ]
What does chi-squared statistic mean?
E[r ]
x2