Professional Documents
Culture Documents
Rudarenje podataka
Benchmark - a side-by-side
comparison of one company
versus other companies that are
17. decembar 2020. 3
competing in the same industry
or space.
BI Overview
• Customer segmentation
– What market segments do my customers fall
into,
– and what are their characteristics?
– Personalize customer relationships
– for higher customer satisfaction
– and retention
• Propensity to buy
– Which customers are most likely to respond to
my promotion?
– Target the right customers
– Increase campaign profitability by focusing
– on the customers most likely to buy
• Customer profitability
– What is the lifetime profitability of my
customer?
– Make individual business interaction decisions
– based on the overall profitability of customers
• Fraud detection
– How can I tell which transactions are likely to
be fraudulent?
– If your wife has just proposed to increase your
life insurance policy, you should probably
order pizza for a while
– Quickly determine fraud
and take immediate action
to minimize damage
17. decembar 2020. 10
BI Overview
• Customer attrition
– Which customer is at risk of leaving?
– Prevent loss of high-value customers
– and let go of lower-value customers
• Channel optimization
– What is the best channel to reach my
customer in each segment?
– Interact with customers based on their
preference
– and your need to manage cost
17. decembar 2020. 11
BI Overview
• Dashboards
– Provide a comprehensive visual view of
corporate performance measures, trends, and
exceptions from multiple business areas
• Allows executives to see hot spots in seconds and
explore the situation
• Architecture of DM systems
Graphical user interface
Pattern evaluation
Data
17. decembar 2020. Databases 22
Warehouse
Data Mining Techniques
#1/2
• Association (correlation and causality)
• Multi-dimensional vs. single-dimensional association
• – age(X,“20..29”) , income(X,“20..29K”) ⟶
buys(X,“PC”) [support = 2%, confidence = 60%]
• contains(T,“computer”) ⟶ contains(x,“software”) [1%,
75%]
• Classification and Prediction
• Finding models (functions) that describe and
distinguish classes or concepts for future predictions
• Presentation: decision-tree, classification rule, neural
network
17. decembar 2020. • Prediction: predict some unknown or missing numerical23
values
Data Mining Techniques
#2/2
• Cluster analysis
– Class label is unknown: group data to form new
classes, e.g., advertising based on client groups
(segmentation)
– Clustering based on the principle: maximizing the
intra-class similarity and minimizing the interclass
similarity
• Outlier analysis
– Outlier: a data object that does not comply with the
– general behavior of the data
– Can be considered as noise or exception, but is quite
17. decembar 2020.
useful in fraud detection, rare events analysis 24
3. Association Rule Mining
– Initial step
• Find frequent itemsets of size 1: F1
– Generalization, k ≥ 2
• Ck = candidates of size k: those itemsets of size k that
could be frequent, given Fk-1
• Fk = those itemsets that are actually frequent, Fk ⊆ Ck
(need to scan the database once)
{2, 3, 4}
17. decembar 2020. 39
Association Rule Mining
Apriori Algorithm: Step 1
• After join C4 = {{1, 2, 3, 4}, {1, 3, 4, 5}}
• Pruning: {1, 2, 3}
{1, 2, 3, 4}
{1, 2, 4}
∈ F3 ⟹ {1, 2, 3, 4} is a good candidate
{1, 3, 4}
{2, 3, 4}
F3 = {{1, 2, 3}, {1, 2, 4}, {1, 3, 4},
{1, 3, 4, 5} {1, 3, 4} {1, 3, 5}, {2, 3, 4}}
{1, 3, 5}
{1, 4, 5}
∉ F3 ⟹ {1, 3, 4, 5} Removed from C4
{3, 4, 5}
{3,5}:2 T300 1, 2, 3, 5
T400 2, 5
• F2: {1,3}:2, {2,3}:2, {2,5}:3, {3,5}:2
• Join: we could join {1,3} only with {1,4} or {1,5}, but they are
not in F2.The only possible join in F2 is {2, 3} with {2, 5}
resulting in {2, 3, 5};
• prune({2, 3, 5}): {2, 3}, {2, 5}, {3, 5} all belong to F2,
hence, C3: {2, 3, 5}
– Third T scan
• {2, 3, 5}:2, then sup({2, 3, 5}) = 50%, minsup condition is 42
17. decembar 2020.
• Advantages
– It is a more realistic model for practical applications
– The model enables us to find rare item rules, but
without producing a huge number of meaningless
rules with frequent items
– By setting MIS values of some items to 100% (or
more), we can effectively instruct the algorithms not
to generate rules only involving these items
• Advantages
– It is a more realistic model for practical applications
– The model enables us to find rare item rules, but
without producing a huge number of meaningless
rules with frequent items
– By setting MIS values of some items to 100% (or
more), we can effectively instruct the algorithms not
to generate rules only involving these items