9 DWH-DM

Skladišta podataka
(Data Warehouses – DW(H))
Rudarenje podataka
Doc. Dr. Damir Omerašević, dipl.ing.

Dvogodišnji studij (Associate Degree) - Razvoj softvera
Sarajevo, Novembar 2020
Sadržaj/ciljevi
1. Business Intelligence Overview

• MS Power BI Architecture
• Data Perspective
• BI Evolution
2. Data Mining Principles
3. Association Rule Mining
17. decembar 2020. 2

1. Business Intelligence Overview
Benchmark - a side-by-side
comparison of one company
versus other companies that are
competing in the same industry
or space.
BI Overview
• What is Business Intelligence (BI)?

– The process, technologies and tools needed
to turn data into information, information into
knowledge and knowledge into plans that
drive profitable business action
– BI comprises data warehousing, business
analytic tools, and content/knowledge
management

Business Intelligence Overview

BI Overview
• Typical BI applications are:

– Customer segmentation
– Propensity to buy (customer disposition to
buy)
– Customer profitability
– Fraud detection
– Customer attrition (loss of customers)
– Channel optimization (connecting with the
customer)
BI Overview
• Customer segmentation
– What market segments do my customers fall
into,
– and what are their characteristics?
– Personalize customer relationships
– for higher customer satisfaction
– and retention

BI Overview
• Propensity to buy
– Which customers are most likely to respond to
my promotion?
– Target the right customers
– Increase campaign profitability by focusing
– on the customers most likely to buy

BI Overview
• Customer profitability
– What is the lifetime profitability of my
customer?
– Make individual business interaction decisions
– based on the overall profitability of customers

BI Overview
• Fraud detection
– How can I tell which transactions are likely to
be fraudulent?
– If your wife has just proposed to increase your
life insurance policy, you should probably
order pizza for a while
– Quickly determine fraud
and take immediate action
to minimize damage
17. decembar 2020. 10
BI Overview
• Customer attrition
– Which customer is at risk of leaving?
– Prevent loss of high-value customers
– and let go of lower-value customers
• Channel optimization
– What is the best channel to reach my
customer in each segment?
– Interact with customers based on their
preference
– and your need to manage cost
17. decembar 2020. 11
BI Overview
• Automated decision tools

– Rule-based systems that provide a solution usually in
one functional area to a specific repetitive
management problem in one industry
– E.g., automated loan approval, intelligent price setting
• Business performance management (BPM)
– A framework for defining, implementing and managing
an enterprise’s business strategy by linking objectives
with factual measures - key performance indicators
17. decembar 2020. 12

BI Overview
• Dashboards
– Provide a comprehensive visual view of
corporate performance measures, trends, and
exceptions from multiple business areas
• Allows executives to see hot spots in seconds and
explore the situation
17. decembar 2020. 13

BI Overview
17. decembar 2020. 14

MS Power BI Architecture
17. decembar 2020. 15

Data Perspective
• PAST: What Happened?

• Reactive reporting
• Common among most companies
• PRESENT: What is Happening?
• KPI’s and CPM Concepts
• Streaming analytics
• FUTURE: What will Happen?
• Predict based on trends and external data
17. decembar 2020. 16
• Understand impact and what-if analysis
BI Evolution
Predictive vs Prescriptive analisys
17. decembar 2020. 17

2. Data Mining Principles
17. decembar 2020. 18

Data Mining
• What is data mining (knowledge discovery

in databases/KDD)?
– Extraction of interesting (non-trivial, implicit,
previously unknown and potentially useful)
information or patterns from data in large
databases
– Semantics of the data
• What interesting means?
17. decembar 2020. 19

Data Mining Applications
Market Analysis
– Targeted marketing/ Customer profiling
• Find clusters of “model” customers who share the
same characteristics: interest, income level,
spending habits, etc.
– Determine customer
purchasing patterns over time
– Cross-market analysis
• Associations/co-relations
between product sales
• Prediction based on the association of information
– …
17. decembar 2020. 20
Corporate analysis and risk management
– Finance planning and asset evaluation

• Cash flow analysis and prediction
• Trend analysis, time series, etc.
– Resource planning
• Summarize and compare
the resources and spending
– Competition
• Monitor competitors and market directions
• Group customers into classes and a class-based
pricing procedure
17. decembar 2020. • Set pricing strategy in a highly competitive market21
DM Systems Architecture
• Architecture of DM systems
Graphical user interface
Pattern evaluation
Data mining engine
Database or data Knowledge-base

warehouse server
ETL Filtering
Data
17. decembar 2020. Databases 22
Warehouse
Data Mining Techniques
#1/2
• Association (correlation and causality)
• Multi-dimensional vs. single-dimensional association
• – age(X,“20..29”) , income(X,“20..29K”) ⟶
buys(X,“PC”) [support = 2%, confidence = 60%]
• contains(T,“computer”) ⟶ contains(x,“software”) [1%,
75%]
• Classification and Prediction
• Finding models (functions) that describe and
distinguish classes or concepts for future predictions
• Presentation: decision-tree, classification rule, neural
network
17. decembar 2020. • Prediction: predict some unknown or missing numerical23
values
Data Mining Techniques
#2/2
• Cluster analysis
– Class label is unknown: group data to form new
classes, e.g., advertising based on client groups
(segmentation)
– Clustering based on the principle: maximizing the
intra-class similarity and minimizing the interclass
similarity
• Outlier analysis
– Outlier: a data object that does not comply with the
– general behavior of the data
– Can be considered as noise or exception, but is quite
17. decembar 2020.
useful in fraud detection, rare events analysis 24
3. Association Rule Mining
17. decembar 2020. 25

Association Rule
Mining Techniques
• Association rule mining has the objective of
finding all co-occurrence relationships (called
associations), among data items
– Classical application: market basket data analysis,
which aims to discover how items are purchased by
customers in a supermarket
• E.g., Cheese ⟶ Wine [support = 10%, confidence = 80%]
meaning that 10% of the customers buy cheese and wine
together, and 80% of customers buying cheese also buy wine
– Support: how much proof you have for the rule
(i.e.how many people buy it)
– Confidence: the general applicability of the rule (i.e.
17. decembar 2020. 26
how often that rule happens)
Association Rule Mining
• Basic concepts of association rules

– Let I = {i1, i2, …, im} be a set of items.
Let T = {t1, t2, …, tn} be a set of transactions where
each transaction ti is a set of items such that ti ⊆ I.
– An association rule is an implication of the form:

X ⟶ Y, where X ⊂ I, Y ⊂ I and X ⋂ Y = ∅
17. decembar 2020. 27

• Association rule mining market basket

analysis example
– I – set of all items sold in a store
• E.g., i1 = Beef, i2 = Chicken, i3 = Cheese, …
– T – set of transactions
• The content of a customers basket
• E.g., t1: Beef, Chicken, Milk; t2: Beef, Cheese; t3:
Cheese, Wine; t4: …
– An association rule might be
• Beef, Chicken ⟶ Milk, where {Beef, Chicken} is X and
{Milk} is Y
17. decembar 2020. 28
• Rules can be weak or strong

– The strength of a rule is measured by its
support and confidence
– The support of a rule X ⟶ Y, is the percentage of
transactions in T that contains X and Y
• Can be seen as an estimate of the probability Pr({X,Y} ⊆ ti)
• With n as number of transactions in T the support of the
rule X ⟶ Y is:
support = |{i | {X, Y} ⊆ ti}| / n
17. decembar 2020. 29
– The confidence of a rule X ⟶ Y, is the percentage of

transactions in T containing X, that contain X ∪ Y
• Can be seen as estimate of the probability Pr(Y ⊆ ti |X ⊆ ti)
confidence = |{i | {X, Y} ⊆ ti}| / |{j | X ⊆ tj}|
17. decembar 2020. 30

• How do we interpret support and confidence?

– If support is too low, the rule may just occur due to
chance
• Acting on a rule with low support may not be profitable
since it covers too few cases
– If confidence is too low, we cannot reliably predict Y
from X
• Objective of mining association rules is to
discover all associated rules in T that have
support and confidence greater than a minimum
threshold (minsup, minconf)!
17. decembar 2020. 31
• Finding rules based on support and confidence

thresholds Transactions
– Let minsup = 30% and T1 Beef, Chicken, Milk

T2 Beef, Cheese
minconf = 80% T3 Cheese, Boots
– Chicken, Clothes ⟶ Milk isT4valid, [sup
Beef,=Chicken,
3/7 Cheese
(42.84%), conf = 3/3 (100%)]T5 Beef, Chicken, Clothes, Cheese, Milk
– Clothes ⟶ Milk, Chicken isT6also valid,
Clothes, Chicken, Milk
T7 Chicken, Milk, Clothes
and there are more…
17. decembar 2020. 32

• This is rather a simplistic view of shopping

baskets
– Some important information is not considered e.g. the
quantity of each item purchased, the price paid,…
• There are a large number of rule mining
algorithms
– They use different strategies and data structures
– Their resulting sets of rules are all the same
17. decembar 2020. 33
• Approaches in association rule mining

– Apriori algorithm
– Mining with multiple minimum supports
– Mining class association rules
• The best known mining algorithm is the Apriori
algorithm
– Step 1: find all frequent itemsets
(set of items with support ≥ minsup)
– Step 2: use frequent itemsets to generate rules
17. decembar 2020. 34
Apriori Algorithm: Step 1
• Step 1: frequent itemset generation
– The key is the apriori property (downward closure
property): any subset of a frequent itemset is
also a frequent itemset
• E.g., for minsup = 30% Transactions
T1 Beef, Chicken, Milk
T2 Beef, Cheese
Chicken, Clothes, Milk
T3 Cheese, Boots
Chicken, Clothes Chicken, Milk Clothes,T4Milk Beef, Chicken, Cheese
T5 Beef, Chicken, Clothes, Cheese, Milk
Chicken Clothes T6 Clothes, Chicken, Milk

Milk
17. decembar 2020. T7 Chicken, Milk, Clothes 35
• Finding frequent items
– Find all 1-item frequent itemsets; then all 2-item
frequent itemsets, etc.
– In each iteration k, only consider itemsets that contain
a k-1 frequent itemset
– Optimization: the algorithm assumes that items are
sorted in lexicographic order
• The order is used throughout the algorithm in each itemset
• {w[1], w[2], …, w[k]} represents a k-itemset w consisting of
items w[1], w[2], …, w[k], where w[1] < w[2] < … < w[k]
17. decembar 2020.
according to the lexicographic order 36
Finding frequent items
– Initial step
• Find frequent itemsets of size 1: F1
– Generalization, k ≥ 2
• Ck = candidates of size k: those itemsets of size k that
could be frequent, given Fk-1
• Fk = those itemsets that are actually frequent, Fk ⊆ Ck
(need to scan the database once)
17. decembar 2020. 37

– Generalization of candidates uses Fk-1 as input

and returns a superset (candidates) of the set of all
frequent k-itemsets. It has two steps:
• Join step: generate all possible candidate itemsets Ck of
length k, e.g., Ik = join(Ak-1, Bk-1) ⟺ Ak-1= {i1, i2, …, ik-2, ik-1}
and Bk-1= {i1, i2, …, ik-2, i’k-1} and ik-1< i’k-1;Then Ik = {i1, i2, …,
ik-2, ik-1, i’k-1}
• Prune step: remove those candidates in Ck that do not
respect the downward closure property (include “k-1”
non-frequent subsets)
17. decembar 2020. 38

– Generalization e.g., F3 = {{1, 2, 3}, {1, 2, 4}, {1, 3, 4},
{1, 3, 5}, {2, 3, 4}}
• Try joining each 2 candidates from F3
{1, 2, 3} {1, 2, 4} {1, 2, 3, 4} {1, 2, 4} {1, 3, 4}
{1, 3, 4} {1, 3, 5}
{1, 3, 5} {2, 3, 4}
{2, 3, 4}
{1, 3, 4} {1, 3, 5} {1, 3, 4, 5} {1, 3, 5} {2, 3, 4}
{2, 3, 4}
17. decembar 2020. 39
• After join C4 = {{1, 2, 3, 4}, {1, 3, 4, 5}}
• Pruning: {1, 2, 3}
{1, 2, 3, 4}
{1, 2, 4}
∈ F3 ⟹ {1, 2, 3, 4} is a good candidate
{1, 3, 4}
{2, 3, 4}
F3 = {{1, 2, 3}, {1, 2, 4}, {1, 3, 4},
{1, 3, 4, 5} {1, 3, 4} {1, 3, 5}, {2, 3, 4}}
{1, 3, 5}
{1, 4, 5}
∉ F3 ⟹ {1, 3, 4, 5} Removed from C4
{3, 4, 5}
17. decembar 2020.

• After pruning C4 = {{1, 2, 3, 4}} 40
• Finding frequent items, example, minsup = 0.5
TID Items
– First T scan ({item}:count) T100 1, 3, 4
• • C1: {1}:2, {2}:3, {3}:3, {4}:1, {5}:3 T200 2, 3, 5

T300 1, 2, 3, 5
• • F1: {1}:2, {2}:3, {3}:3, {5}:3; T400 2, 5
• {4} has a support of ¼ < 0.5 so it does not belong to the
• frequent items
• C2 = prune(join(F1))
• join : {1,2}, {1,3}, {1,5}, {2,3}, {2,5}, {3,5};
• prune: C2 : {1,2}, {1,3}, {1,5}, {2,3}, {2,5}, {3,5}; (all items
• belong to F1)
17. decembar 2020. 41
TID Items
– Second T scan T100 1, 3, 4
• C2: {1,2}:1, {1,3}:2, {1,5}:1, {2,3}:2, {2,5}:3, T200 2, 3, 5
{3,5}:2 T300 1, 2, 3, 5
T400 2, 5
• F2: {1,3}:2, {2,3}:2, {2,5}:3, {3,5}:2
• Join: we could join {1,3} only with {1,4} or {1,5}, but they are
not in F2.The only possible join in F2 is {2, 3} with {2, 5}
resulting in {2, 3, 5};
• prune({2, 3, 5}): {2, 3}, {2, 5}, {3, 5} all belong to F2,
hence, C3: {2, 3, 5}
– Third T scan
• {2, 3, 5}:2, then sup({2, 3, 5}) = 50%, minsup condition is 42
17. decembar 2020.
fulfilled.Then F3: {2, 3, 5}

• Step 2: generating rules from frequent itemsets
– Frequent itemsets are not the same as association
rules
– One more step is needed to generate association
rules: for each frequent itemset I, for each proper
nonempty subset X of I:
• Let Y = I \ X; X ⟶ Y is an association rule if:
– Confidence(X ⟶ Y) ≥ minconf,
– Support(X ⟶ Y) := |{i | {X, Y} ⊆ ti}| / n = support(I)
– Confidence(X ⟶ Y) := |{i | {X, Y} ⊆ ti}| / |{j | X ⊆ tj}|
= support(I) / support(X)
17. decembar 2020. 43
• Rule generation example, minconf = 50%
– Suppose {2, 3, 5} is a frequent itemset, with sup=50%, as
calculated in step 1
– Proper nonempty subsets: {2, 3}, {2, 5}, {3, 5}, {2}, {3}, {5},
with sup=50%, 75%, 50%, 75%, 75%, 75% respectively
– These generate the following association rules:
• 2,3 ⟶ 5, confidence=100%; (sup(I)=50%; sup{2,3}=50%;
50/50= 1)
TID Items
• 2,5 ⟶ 3, confidence=67%; (50/75)
T100 1, 3, 4
• 3,5 ⟶ 2, confidence=100%; (…)
T200 2, 3, 5
• 2 ⟶ 3,5, confidence=67%
T300 1, 2, 3, 5
T400 2, 5
– All rules have support = support(I) = 50%
17. decembar 2020. 44
• Rule generation, summary
– In order to obtain X ⟶ Y, we need to know
support(I) and support(X)
– All the required information for confidence
computation has already been recorded in itemset
generation
• No need to read the transactions data any more
• This step is not as time-consuming as frequent itemsets
generation
17. decembar 2020. 45

Apriori Algorithm: Summary
• Apriori Algorithm, summary
– If k is the size of the largest itemset, then it makes at
most k passes over data (in practice, k is bounded e.g.,
10)
– The mining exploits sparseness of data, and high
minsup and minconf thresholds
– High minsup threshold makes it impossible to find
rules involving rare items in the data.
• The solution is a mining with multiple
minimum supports approach
17. decembar 2020. 46
Multiple Minimum Supports
• Mining with multiple minimum supports
– Single minimum support assumes that all items in the
data are of the same nature and/or have similar
frequencies, which is incorrect…
– In practice, some items appear very frequently in the
data, while others rarely appear
• E.g., in a supermarket, people buy cooking pans much less
frequently than they buy bread and milk
17. decembar 2020. 47

• Rare item problem: if the frequencies of items
vary significantly, we encounter two problems
– If minsup is set too high, those rules that involve rare
items will not be found
– To find rules that involve both frequent and rare items,
minsup has to be set very low.This may cause
combinatorial explosion because those frequent
items will be associated with one another in all
possible ways
17. decembar 2020. 48
• Multiple Minimum Supports
– Each item can have a minimum item support
• Different support requirements for different rules
– To prevent very frequent items and very rare items
from appearing in the same itemset S, we introduce a
support difference constraint (φ)
• maxi∈S{sup(i)} - mini∈S {sup(i)} ≤ φ,
where 0 ≤ φ ≤ 1 is user specified
17. decembar 2020. 49

• Minsup of a rule
– Let MIS(i) be the minimum item support (MIS) value of
item i.The minsup of a rule R is the lowest MIS
value of the items in the rule:
• Rule R: i1, i2, …, ik ⟶ ik+1, …, ir satisfies its minimum support
• if its actual support is ≥ min(MIS(i1), MIS(i2), …, MIS(ir))
• E.g., the user-specified MIS values are as follows: MIS(bread)
= 2%, MIS(shoes) = 0.1%, MIS(clothes) = 0.2%
– clothes ⟶ bread [sup=0.15%,conf =70%] doesn’t satisfy its minsup
– clothes ⟶ shoes [sup=0.15%,conf =70%] satisfies its minsup
17. decembar 2020. 50

• Downward closure property is
not valid anymore
– E.g., consider four items 1, 2, 3
and 4 in a database
Their minimum item supports are
• MIS(1) = 10%, MIS(2) = 20%, MIS(3) = 5%, MIS(4) = 6%
• {1, 2} with a support of 9% is infrequent since min(10%,
20%) > 9%, but {1, 2, 3} could be frequent, if it would
have a support of e.g. , 7%
– If applied, downward closure, eliminates {1, 2} so that {1, 2, 3} is never
evaluated
17. decembar 2020. 51
• How do we solve the downward closure
property problem?
– Sort all items in I according to their MIS values (make
it a total order)
• The order is used throughout the algorithm in each itemset
– Each itemset w is of the following form:
• {w[1], w[2], …, w[k]}, consisting of items, w[1], w[2], …,
w[k], where MIS(w[1]) ≤ MIS(w[2]) ≤ … ≤ MIS(w[k])
17. decembar 2020. 52

• Multiple minimum supports is an extension of the
Apriori algorithm
– Step 1: frequent itemset generation
• Initial step
– Produce the seeds for generating candidate itemsets
• Candidate generation
– For k = 2
• Generalization
– For k > 2, pruning step differs from the Apriori algorithm
– Step 2: rule generation (the same as before!)
17. decembar 2020. 53
Multiple Minimum Supports: Step I
• Step 1: frequent itemset generation
• – E.g., I={1, 2, 3, 4}, with given MIS(1)=10%,
MIS(2)=20%, MIS(3)=5%, MIS(4)=6%, and consider
n=100 transactions:
– Initial step
• Sort I according to the MIS value of each item. Let M
represent the sorted items
• – Sort I, in M = {3, 4, 1, 2}
• Scan the data once to record the support count of
each item
• – E.g., {3}:6, {4}:3, {1}:9 and {2}:25
17. decembar 2020. 54
MIS(1)=10%, MIS(2)=20%, MIS(3)=5%, MIS(4)=6%, n=100 {3}:6, {4}:3, {1}:9 {2}:25
•Go through the items in M to find the first item i, that meets
MIS(i). Insert it into a list of seeds L
•For each subsequent item j in M (after i), if sup(j) ≥ MIS(i), then
insert j in L
– MIS(3) = 5%; sup ({3}) = 6%; sup(3) > MIS(3), so L={3}
Sup({4}) = 3% < MIS(3), so L remains {3} Sup({1}) = 9% > MIS(3), L = {3, 1}
Sup({2}) = 25% > MIS(3), L = {3, 1, 2}
•Calculate F1 from L based on MIS of each item in L
– F1 = {{3}, {2}}, since sup({1}) = 9% < MIS(1)
•Why not eliminate {1} directly? Why
calculate L and
not directly F?
– Downward closure property is not valid from F
17. decembar 2020. 55
Items 1 2 3 4
MIS 10 20 5 6
– Candidate generation, k = 2. SUP 9 25 6 3
Let φ = 10% (support difference) L {3, 1, 2}
• Take each item (seed) from L in order.
Use L and not F1 due to the downward closure property
invalidity!
• Test the chosen item against its MIS: sup({3}) ≥ MIS(3)
– If true, then we can use this value to form a level 2 candidate
– If not, then go to the next element in L
• If true, e.g., sup({3}) = 6% ≥ MIS(3) = 5%, then try to
form a 2 level candidate together with each of the next
17. decembar 2020. items in L, e.g., {3, 1}, then {3, 2} 56
– {3, 1} is a candidate :⟺ sup({1})≥ MIS(3) and
|sup({3}) – sup({1})| ≤φ
• sup({1}) = 9%; MIS(3) = 5%; sup({3}) =
6%; φ := 10% Items 1 2 3 4
9% > 5% and |6%-9%| < 10%, MIS 10 20 5 6 thus C2 =
{3, 1} SUP 9 25 6 3
– Now try {3, 2} L {3, 1, 2}
• sup({2}) = 25%; 25% > 5%

but |6%-25%| > 10% so this candidate will be rejected due to
the support difference constraint
17. decembar 2020. 57

– Pick the next seed from L, i.e. 1 (needed to try {1,2})

• sup({1}) < MIS(1) so we can not use 1 as seed!
– Candidate generation for k=2 remains C2 = {{3, 1}}
• Now read the transaction list and calculate the support of
each item in C2. Let’s assume sup({3, 1})=6, which is larger
than min(MIS(3), MIS(1))
• Thus F2 = {{3, 1}}
Items 1 2 3 4
MIS 10 20 5 6
SUP 9 25 6 3
17. decembar 2020. L {3, 1, 2} 58
– Generalization, k > 2 uses Lk-1 as input and returns
a superset (candidates) of the set of all frequent k-
itemsets. It has two steps:
• Join step: same as in the case of k=2
Ik = join(Ak-1, Bk-1) ⟺ Ak-1= {i1, i2, …, ik-2, ik-1} and Bk-1= {i1, i2,
…, ik-2, i’k-1} and ik-1< i’k-1 and |sup(ik-1) – sup(i’k-1)| ≤φ.
Then Ik = {i1, i2, …, ik-2, ik-1, i’k-1}
• Prune step: for each (k-1) subset s of Ik, if s is not in Fk-1,
then Ik can be removed from Ck (it is not a good candidate).
There is however one exception to this rule, when s
does not include the first item from Ik
17. decembar 2020. 59
• – Generalization, k > 2 example: let’s
consider
• F3={{1, 2, 3}, {1, 2, 5}, {1, 3, 4}, {1, 3, 5}, {1, 4, 5},
{1, 4, 6},
• {2, 3, 5}}
• • After join we obtain {1, 2, 3, 5}, {1, 3, 4, 5} and {1, 4, 5, 6} (we do
not consider the support difference constraint)
• • After pruning we get C4 = {{1, 2, 3, 5}, {1, 3, 4, 5}}
• – {1, 2, 3, 5} is ok
• – {1, 3, 4, 5} is not deleted although {3, 4, 5} ∉ F3, because MIS(3) > MIS(1). If
MIS(3)
17. decembar 2020. = MIS(1), it could be deleted 60
• – {1, 4, 5, 6} is deleted because {1, 5, 6} ∉ F3
Multiple Minimum Supports: Step 2
• Step 2: rule generation
– Downward closure property is not valid anymore,
therefore we have frequent k order items, which contain
(k-1) non-frequent sub-items
• For those non-frequent items we do not have the support value
recorded
• This problem arises when we form rules of the form
A,B ⟶ C, where MIS(C) = min(MIS(A), MIS(B), MIS(C))
• Conf(A,B ⟶ C) = sup({A,B,C}) / sup({A,B})
We have the frequency of {A, B, C} because it is frequent, but we don’t
have the frequency to calculate support of {AB} since it is not frequent
by itself
• This is called head-item problem
17. decembar 2020. 61
Items {Clothes},{Bread} {Shoes, Clothes, Bread}
SUP 0.15 0.12
Items Bread Clothes Shoes
• Rule generation example MIS 2 0.2 0.1
– {Shoes, Clothes, Bread} is a frequent itemset since
• MIS({Shoes, Clothes, Bread}) = 0.1 < sup({Shoes, Clothes,
Bread}) = 0.12
– However {Clothes, Bread} is not since neither Clothes
nor Bread can seed frequent itemsets
• So we may not calculate the confidence of all rules
depending on Shoes, i.e. rules:
– Clothes, Bread ⟶ Shoes
– Clothes ⟶ Shoes, Bread
17. decembar 2020. – Bread ⟶ Shoes, Clothes 62
• Head-item problem e.g.:
– Clothes, Bread ⟶ Shoes;
– Clothes ⟶ Shoes, Bread;
– Bread ⟶ Shoes, Clothes.
• If we have some item on the right side of a rule,

which has the minimum MIS (e.g. Shoes), we may
not be able to calculate the confidence without
reading the data again
17. decembar 2020. 63
Multiple Minimum Supports: Summary
• Advantages
– It is a more realistic model for practical applications
– The model enables us to find rare item rules, but
without producing a huge number of meaningless
rules with frequent items
– By setting MIS values of some items to 100% (or
more), we can effectively instruct the algorithms not
to generate rules only involving these items
17. decembar 2020. 64

Multiple Minimum Supports: Summary
• Advantages
– It is a more realistic model for practical applications
– The model enables us to find rare item rules, but
without producing a huge number of meaningless
rules with frequent items
– By setting MIS values of some items to 100% (or
more), we can effectively instruct the algorithms not
to generate rules only involving these items
17. decembar 2020. 65

Summary
• Business Intelligence Overview

– Customer segmentation, propensity to buy, customer
profitability, attrition, etc.
• Data Mining Overview
– Extraction of interesting (non-trivial, implicit, previously
unknown and potentially useful) information or patterns
from data in large databases
• Association Rule Mining
– Apriori algorithm, support, confidence, downward closure
property
– Multiple minimum supports solve the “rare-item” problem
which introduced the Head-item problem
17. decembar 2020. 66
Hvala na pažnji!
damir.etf@outlook.com
17. decembar 2020. 67

9 DWH-DM

Uploaded by

Document Information

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

9 DWH-DM

Uploaded by

Copyright:

Available Formats

Skladišta podataka

(Data Warehouses – DW(H))

Doc. Dr. Damir Omerašević, dipl.ing.

1. Business Intelligence Overview

17. decembar 2020. 2

• What is Business Intelligence (BI)?

17. decembar 2020. 4

17. decembar 2020. 5

• Typical BI applications are:

17. decembar 2020. 7

17. decembar 2020. 8

17. decembar 2020. 9

• Automated decision tools

17. decembar 2020. 12

17. decembar 2020. 13

17. decembar 2020. 14

17. decembar 2020. 15

• PAST: What Happened?

17. decembar 2020. 17

17. decembar 2020. 18

• What is data mining (knowledge discovery

17. decembar 2020. 19

– Finance planning and asset evaluation

Data mining engine

Database or data Knowledge-base

17. decembar 2020. 25

• Basic concepts of association rules

– An association rule is an implication of the form:

17. decembar 2020. 27

• Association rule mining market basket

• Rules can be weak or strong

– The confidence of a rule X ⟶ Y, is the percentage of

confidence = |{i | {X, Y} ⊆ ti}| / |{j | X ⊆ tj}|

17. decembar 2020. 30

• How do we interpret support and confidence?

• Finding rules based on support and confidence

– Let minsup = 30% and T1 Beef, Chicken, Milk

17. decembar 2020. 32

• This is rather a simplistic view of shopping

• Approaches in association rule mining

Chicken Clothes T6 Clothes, Chicken, Milk

17. decembar 2020. 37

– Generalization of candidates uses Fk-1 as input

17. decembar 2020. 38

{1, 3, 4} {1, 3, 5} {1, 3, 4, 5} {1, 3, 5} {2, 3, 4}

17. decembar 2020.

• • C1: {1}:2, {2}:3, {3}:3, {4}:1, {5}:3 T200 2, 3, 5

• C2: {1,2}:1, {1,3}:2, {1,5}:1, {2,3}:2, {2,5}:3, T200 2, 3, 5

fulfilled.Then F3: {2, 3, 5}

17. decembar 2020. 45

17. decembar 2020. 47

17. decembar 2020. 49

17. decembar 2020. 50

17. decembar 2020. 52

– Now try {3, 2} L {3, 1, 2}

• sup({2}) = 25%; 25% > 5%

17. decembar 2020. 57

– Pick the next seed from L, i.e. 1 (needed to try {1,2})

• If we have some item on the right side of a rule,

17. decembar 2020. 64

17. decembar 2020. 65

• Business Intelligence Overview

17. decembar 2020. 67