Unit III Da Online - PPTX 2 87

Agenda
● •Advanced Analytical Theory and Methods: Association Rules- Overview,
a-priori algorithm, evaluation of candidate rules, case study-transactions

in grocery store, validation and testing, diagnostics.
● •Regression- linear, logistics, reasons to choose and cautions, additional
regression models.
Syllabus
Advanced Analytical Theory and Methods: Association Rules-

Overview, a-priori algorithm, evaluation of candidate rules, case
study-transactions in grocery store, validation and testing,
diagnostics.
Regression- linear, logistics, reasons to choose and cautions,

additional regression models.
Syllabus

diagnostics.

Association Rules
Overview
Apriori Algorithm
Evaluation of Candidate Rules
Example: Transactions in a Grocery Store
Validation and Testing
Diagnostics
Association rules method Overview
• Unsupervised learning method
• Descriptive (not predictive) method
• Used to find hidden relationships in data
• The relationships are represented as rules

Questions association rules might answer
● Which products tend to be purchased together
● What products do similar customers tend to buy

Applications of Association rule
Market Basket
Analysis
https://blog.rsquaredacademy.com/market-basket-analysis-in-r/
Market Basket Analysis
❏ Market basket analysis uses
❏ Association rule mining
❏ To identify products
❏ Frequently bought together

Market Basket Analysis - Why ?
Use Cases (Applications) of
Association Rule Mining
Simple Example
Simple Example -Transaction Data
Simple Example -Transaction Data
Simple Example -Frequent Item Set
Simple Example- Association Rule
Support
Confidence
Lift
Lift - Interpretation
● Lift = 1: implies no relationship between mobile phone and screen guard (i.e.,
mobile phone and screen guard occur together only by chance)
● Lift > 1: implies that there is a positive relationship between mobile phone and
screen guard (i.e., mobile phone and screen guard occur together more often than
random)
● Lift < 1: implies that there is a negative relationship between mobile phone and
screen guard (i.e., mobile phone and screen guard occur together less often than
random)
Association Rule
•Frequent itemsets from the previous section can form

candidate rules such as X implies Y .
X→Y
Association Rule
Appropriateness of
Candidate Rule
Support Confidence Lift Leverage

Association Rule/ Apriori Example
TID List_Of_Item IDs Minimum Support = 0.5 or 50%

Means 9/2 = 4.5 = 4
T100 I1, I2, I5
T101 I2, I4
ITEM Set FREQUENCY
T102 I2, I5
{ I1} 6
T103 I1, I2, I4
{I2} 8
T104 I1, I2, I3
{I 3} 5
T105 I2, I3
{I4} 3
T106 I1, I2, I3, I4
{I5} 3
T107 I1, I2, I3
T108 I1, I3 , I5
Example
Minimum Support = 0.5
or 50%
TID List_Of_Item IDs Means 9/2 = 4.5 = 4
T100 I1, I2, I5 After

ITEM Set FREQUENC ITEM Set FREQUENCY
Pruning
T101 I2, I4 Y
{ I1} 6
{ I1} 6
T102 I2, I5
{I2} 8
{I2} 8
T103 I1, I2, I4
{I 3} 5
{I 3} 5
T104 I1, I2, I3
{I4} 3
T105 I2, I3
{I5} 3
T106 I1, I2, I3, I4
T107 I1, I2, I3
T108 I1, I3 , I5
Example

or 50%
Means 9/2 = 4.5 = 4
After
ITEM Set FREQUENCY Candidate Pruning
Generation
ITEM Set FREQUENCY ITEM Set FREQUENCY
{ I1} 6
{ I1, I2} 5 { I1, I2} 5
{I2} 8
{I1, I3} 4 {I1, I3} 4
{I 3} 5
{I 2, I3} 4 {I 2, I3} 4
Example

or 50%
Means 9/2 = 4.5 = 4
Candidate After Pruning No

ITEM Set FREQUENCY Generation values, so go to
previous stage ITEM Set FREQUENCY
{ I1, I2} 4 ITEM Set FREQUENCY
{ I1, I2} 5
{I1, I3} 4 { I1, I2,I3} 3
{I1, I3} 4
{I 2, I3} 5 {I 2, I3} 4
We have 3 rules
1. I1 => I2
2. I1 => I3
3. I2 => I3
Example- Support
Rule Frequency of Formula Putting Support Value

X+Y Values in
Formula
I1 => I2 5 5/9 0.55

Freq( X+Y)
______________________
I1 => I3 4 4/9 0.44
No of Transaction
I2 => I3 4 4/9 0.44
Example- Confidence
Rule Freq( X) Freq Formula for Putting Confidence

( X + Y) Confidence Values in (x=>y)
Formula
I1 => I2 6 5 5/6 0.83

Freq( X+Y)
______________________
I1 => I3 6 4 4/6 0.66
Freq (X)
I2 => I3 8 4 4/8 0.50
Example- Lift
Rule Support Support Support Formula Putting Values in Support

of of of Formula Value
(X+ Y) X Y
I1 => I2 0.55 6/9 = 8/9 =0.88 0.55 0.94

0.66 ----------------
Support( X+Y)
______________________
(0.66 * 0.88)
I1 => I3 0.44 6/9 = 5/9 =0.55 0.44 1.21

Support (X) * Support (Y)
0.66 ----------------
(0.66 * 0.55)
I2 => I3 0.44 8/9 =0.88 5/9 =0.55 0.44 0.90

----------------
(0.88 * 0.55)
Example- Leverage
Rule Support Support Support Formula Putting Values in Support

of of of Formula Value
(X+ Y) X Y
I1 => I2 0.55 6/9 = 8/9 =0.88 0.44 - (0.66 * 0.88) -0.03

0.66
Support( X+Y) -
I1 => I3 0.44 6/9 = 5/9 =0.55 0.44 - (0.66 * 0.55) 0.77
Support (X) * Support (Y)
0.66
I2 => I3 0.44 8/9 =0.88 5/9 =0.55 0.55 - (0.88 * 0.55) -0.044
Example
Rule Support Confidence Lift Leverage
I1 => I2 0.55 0.83 0.94 -0.03
I1 => I3 0.44 0.66 1.21 0.77
I2 => I3 0.44 0.50 0.90 -0.044

Example
I1 => I2 0.55 0.83 0.94 -0.03
I1 => I3 0.44 0.66 1.21 0.77
I2 => I3 0.44 0.50 0.90 -0.044

Example
I1 => I2 0.55 0.83 0.94 -0.03
I1 => I3 0.44 0.66 1.21 0.77

Example
I1 => I2 0.55 0.83 0.94 -0.03
I1 => I3 0.44 0.66 1.21 0.77

The Apriori Algorithm
• Join Step: Ck is generated by joining Lk-1with itself

• Prune Step: Any (k-1)-itemset that is not frequent cannot be a subset of a frequent k-itemset
• Pseudo-code:
Ck: Candidate itemset of size k
Lk : frequent itemset of size k
L1 = {frequent items};
for (k = 1; Lk !=∅; k++) do begin
Ck+1 = candidates generated from Lk;
for each transaction t in database do
increment the count of all candidates in Ck+1 that are contained in t
Lk+1 = candidates in Ck+1 with min_support
end
return ∪k Lk;
Applications of Association Rules
The term market basket analysis refers to a specific implementation of

association rules
• For better merchandising – products to include/exclude from inventory each month
• Placement of products
• Cross-selling
• Promotional programs—multiple product purchase incentives managed through a loyalty card
program
Association rules also used for
• Recommender systems – Amazon, Netflix

• Clickstream analysis from web usage log files
• Website visitors to page X click on links A,B,C more than on links D,E,F
Validation and Testing
● The frequent and high confidence itemsets are found by pre-specified minimum
support and minimum confidence levels
● Measures like lift and/or leverage then ensure that interesting rules are identified
rather than coincidental ones
● However, some of the remaining rules may be considered subjectively uninteresting
because they don’t yield unexpected profitable actions
○ E.g., rules like {paper} -> {pencil} are not interesting/meaningful
● Incorporating subjective knowledge requires domain experts
● Good rules provide valuable insights for institutions to improve their business
operations
Diagnostics
● Although the Apriori algorithm is easy to understand and implement, some of the rules
generated are uninteresting or practically useless.
● Additionally, some of the rules may be generated due to coincidental relationships between
the variables.
● Measures like confidence, lift, and leverage should be used along with human insights to
address this problem
● Another problem with association rules is that, in Phase 3 and 4 of the Data Analytics
Lifecycle , the team must specify the minimum support prior to the model execution,
which may lead to too many or too few rules.
● In related research, a variant of the algorithm can use a predefined target range for the
number of rules so that the algorithm can adjust the minimum support accordingly.
● Algorithm requires a scan of the entire database to obtain the result. Accordingly, as the
database grows, it takes more time to compute in each run.
Diagnostics- Approaches to improve Apriori’s
efficiency
• Any itemset that is potentially frequent in a transaction
Partitioning database must be frequent in at least one of the
partitions of the transaction database.

efficiency
• This extracts a subset of the data with a lower
Sampling support threshold and uses the subset to

perform association rule mining.
efficiency
• A transaction that does not contain frequent

Transaction k-itemsets is useless in subsequent scans and
reduction
therefore can be ignored.
efficiency
• Only add new candidate itemsets when all

Dynamic
itemset of their subsets are estimated to be
counting frequent.
efficiency
• If the corresponding hashing bucket count of

Hash-based
itemset a k-itemset is below a certain threshold, the
counting
k-itemset cannot be frequent.
Syllabus

diagnostics.

Regression
Linear Regression
Logical Regression
Reasons to Choose and Cautions
Additional Regression Models

Regression
supervised Learning Used For Prediction

Regression
● Regression analysis attempts to explain the influence that

input (independent) variables have on the outcome
(dependent) variable
● Questions regression might answer
○ What is a person’s expected income?
○ What is probability an applicant will default on a loan?
Regression Types
Linear Regression Logistic Regression

Linear Regression
● Models the
Relationship Between
Several input Continuous

variables outcome variable
Use Cases (Applications)
Real estate example

• Predict residential home prices
• Possible inputs – living area, #bathrooms, #bedrooms, lot size, property taxes
Demand forecasting example

• Restaurant predicts quantity of food needed
• Possible inputs – weather, day of week, etc.
Medical example
• Analyze effect of proposed radiation treatment
• Possible inputs – radiation treatment duration, freq
Linear Regression Equations
Linear Regression Model
Relationship Between Variables Is a Linear Function
RANDOM
Y-INTERCEPT SLOPE ERROR
Y = Β 0+ Β X + ΕI
I 1 I
DEPENDENT (RESPONSE
VARIABLE) INDEPENDENT (EXPLANATORY)
(EG. INCOME) VARIABLE
(E.G., AGE)
Model Description
For More than one

For one input and one input and one output
output variable variable
Y= β0 +β1X1 +ϵ Y= β0 +β1X1 +β2X2+……….

+βp-1Xp-1+ϵ
Model Description
Y= β0 +β1X1 +β2X2+………. +βp-1Xp-1+£

Model Description Example
•Predict person’s annual income as a function of Experience and education
Income= β0+ β1Experience + β2Education + ϵ
•The βi S represent the unknown p parameters

•There is considerable variation in income levels for a group of people with identical
Experience and years of education. This variation is represented by ∈ in the model.
•Ordinary Least Squares (OLS) is a common technique to estimate the parameters
Model Description Example
Model Description
Example
With OLS, the objective is to find the line through these points that minimizes the sum of the
squares of the difference between each point and the line in the vertical direction
The vertical lines represent the distance between each observed y value and the line y=β0+β1x
Model Description
With Normally Distributed Errors
● Following Figure illustrate Regression Model with one input variable, normality assumption on the
error terms and the effect on the outcome variable, Y , for a given value of X.
E.g., for x=8, E(y)~20 but

varies 15-25
>
Diagnostics
Evaluating the Linearity Assumption
● A major assumption in linear regression modeling is that the relationship

between the input and output variables is linear
● The most fundamental way to evaluate this is to plot the outcome variable
against each input variable
● If the relationship between Age and Income is represented as illustrated in
Figure in next slide, a linear model would not apply. In such a case, it is often
useful to do any of the following:
○ Transform the outcome variable.
○ Transform the input variables.
○ Add extra input variables or terms to the regression model.
>
Diagnostics
Evaluating the Residuals
● Residuals. The difference between the observed value of the dependent variable
(y) and the predicted value (ŷ) is called the residual (e).
●
● Each data point has one residual.
●
>
Diagnostics
N-Fold Cross-Validation
● To prevent overfitting, a common practice splits the dataset into

training and test sets, develops the model on the training set and
evaluates it on the test set
● If the quantity of the dataset is insufficient for this, an N-fold
cross-validation technique can be used
○ Dataset randomly split into N dataset of equal size
○ Model trained on N-1 of the sets, tested on remaining one
○ Process repeated N times
○ Average the N model errors over the N folds
>
Diagnostics
Other Diagnostic Considerations
● The model might be improved by including additional input

variables
● Residual plots should be examined for outliers
● Finally, the magnitude and signs of the estimated parameters
should be examined to see if they make sense
Regression
Linear Regression
Logical Regression

>
Logistic Regression - Introduction
● In linear regression modeling, the outcome variable is continuous –

e.g., income ~ age and education
● In logistic regression, the outcome variable is categorical, example
two-valued outcomes like
○ True/false,
○ pass/fail,
○ yes/no
>
Logistic Regression
Use Cases
Medical
• Probability of a patient’s successful response to a specific
medical treatment – input could include age, weight, etc.
>
Logistic Regression
Use Cases
Finance
• Probability an applicant defaults on a loan

>
Logistic Regression
Use Cases
Marketing
• Probability a wireless customer switches carriers (churns)

>
Logistic Regression
Use Cases
Engineering
● Probability a mechanical part malfunctions or fails

>
Logistic Regression
Model Description
Logical regression
is based on the logistic function
As y -> infinity, f(y)->1; &

As y->-infinity, f(y)->0
https://www.saedsayad.com/logistic_regression.htm
>
Logistic Regression
Model Description
With the range of f(y) as (0,1), the logistic function models the
probability of an outcome occurring
In contrast to linear regression, the values of y are not directly

observed; only the values of f(y) in terms of success or failure
are observed.
Linear Regression Vs Logistic Regression
Parameter Linear Regression Logistic Regression
Type of Variable Used Continuous Categorical
Curve Type Straight Line S shaped Curve
Relationship between Dependent and Required Not Required
Independent Variable
Purpose Used to estimate dependent Used to calculate probability
variable in case of change in of Event
independent variable
Formula Y= β0+ β1X1 + …..+ ϵ
Formula Y= β0+ β1X1 + …..+ ϵ
Example Relationship between no of hours Whether they Fail or Pass

worked and salary
>
Diagnostics
Receiver Operating Characteristic (ROC) Curve
● Logistic regression is often used to classify

○ For two classes, C (Churn) and nC (notChurn), we have
■ True Positive: predict C, when actually C
■ True Negative: predict nC, when actually nC
■ False Positive: predict C, when actually nC
■ False Negative: predict nC, when actually C
Diagnostics
Actual Value
Positive(1) Negative(0)
Predicted Positive(1)
TP FP
Value
Negative(0)
FN TN
>
Diagnostics
False Positive Rate(FPR) = # of false Positive

__________________
# of negative
True Positive Rate (TRP) = # of true Positive

_________________
# of Positive
● The Receiver Operating Characteristic (ROC) curve

○ Plots TPR against FPR
>
Diagnostics
Regression
Linear Regression
Logical Regression

>
Linear regression – outcome variable continuous
Logistic regression – outcome variable categorical
Both models assume a linear additive function of the inputs variables
• If this is not true, the models perform poorly

• In linear regression, the further assumption of normally distributed error terms is important for many
statistical inferences
Although a set of input variables may be a good predictor of an output variable,

“correlation does not imply causation”
Regression
Linear Regression
Logical Regression

>
● Multicollinearity is the condition when several input variables are highly

correlated
○ This can lead to inappropriately large coefficients
● To mitigate this problem
○ Ridge regression applies a penalty based on the size of the coefficients
○ Lasso regression applies a penalty proportional to the sum of the
absolute values of the coefficients
○ Multinomial logistic regression – used for a more-than-two-state categorical outcome
variable

Unit III Da Online - PPTX 2 87

Uploaded by

Document Information

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Unit III Da Online - PPTX 2 87

Uploaded by

Copyright:

Available Formats

Agenda

● •Advanced Analytical Theory and Methods: Association Rules- Overview,

a-priori algorithm, evaluation of candidate rules, case study-transactions

● •Regression- linear, logistics, reasons to choose and cautions, additional

Advanced Analytical Theory and Methods: Association Rules-

Regression- linear, logistics, reasons to choose and cautions,

Advanced Analytical Theory and Methods: Association Rules-

Regression- linear, logistics, reasons to choose and cautions,

Evaluation of Candidate Rules

Example: Transactions in a Grocery Store

Validation and Testing

• Unsupervised learning method

• Descriptive (not predictive) method

• Used to find hidden relationships in data

• The relationships are represented as rules

● Which products tend to be purchased together

● What products do similar customers tend to buy

❏ Market basket analysis uses

❏ Association rule mining

❏ Frequently bought together

•Frequent itemsets from the previous section can form

Support Confidence Lift Leverage

TID List_Of_Item IDs Minimum Support = 0.5 or 50%

T100 I1, I2, I5 After

T107 I1, I2, I3

Minimum Support = 0.5

Minimum Support = 0.5

Candidate After Pruning No

Rule Frequency of Formula Putting Support Value

I1 => I2 5 5/9 0.55

Rule Freq( X) Freq Formula for Putting Confidence

I1 => I2 6 5 5/6 0.83

Rule Support Support Support Formula Putting Values in Support

I1 => I2 0.55 6/9 = 8/9 =0.88 0.55 0.94

I1 => I3 0.44 6/9 = 5/9 =0.55 0.44 1.21

I2 => I3 0.44 8/9 =0.88 5/9 =0.55 0.44 0.90

Rule Support Support Support Formula Putting Values in Support

I1 => I2 0.55 6/9 = 8/9 =0.88 0.44 - (0.66 * 0.88) -0.03

Rule Support Confidence Lift Leverage

I1 => I2 0.55 0.83 0.94 -0.03

I1 => I3 0.44 0.66 1.21 0.77

I2 => I3 0.44 0.50 0.90 -0.044

Rule Support Confidence Lift Leverage

I1 => I2 0.55 0.83 0.94 -0.03

I1 => I3 0.44 0.66 1.21 0.77

I2 => I3 0.44 0.50 0.90 -0.044

Rule Support Confidence Lift Leverage

I1 => I2 0.55 0.83 0.94 -0.03

I1 => I3 0.44 0.66 1.21 0.77

Rule Support Confidence Lift Leverage

I1 => I2 0.55 0.83 0.94 -0.03

I1 => I3 0.44 0.66 1.21 0.77

• Join Step: Ck is generated by joining Lk-1with itself

The term market basket analysis refers to a specific implementation of

Association rules also used for

• Recommender systems – Amazon, Netflix

• Any itemset that is potentially frequent in a transaction

Partitioning database must be frequent in at least one of the

partitions of the transaction database.

• This extracts a subset of the data with a lower

Sampling support threshold and uses the subset to

• A transaction that does not contain frequent

• Only add new candidate itemsets when all