You are on page 1of 86

Agenda

● •Advanced Analytical Theory and Methods: Association Rules- Overview,

a-priori algorithm, evaluation of candidate rules, case study-transactions


in grocery store, validation and testing, diagnostics.

● •Regression- linear, logistics, reasons to choose and cautions, additional

regression models.
Syllabus

Advanced Analytical Theory and Methods: Association Rules-


Overview, a-priori algorithm, evaluation of candidate rules, case
study-transactions in grocery store, validation and testing,
diagnostics.

Regression- linear, logistics, reasons to choose and cautions,


additional regression models.
Syllabus

Advanced Analytical Theory and Methods: Association Rules-


Overview, a-priori algorithm, evaluation of candidate rules, case
study-transactions in grocery store, validation and testing,
diagnostics.

Regression- linear, logistics, reasons to choose and cautions,


additional regression models.
Association Rules

Overview

Apriori Algorithm

Evaluation of Candidate Rules

Example: Transactions in a Grocery Store

Validation and Testing

Diagnostics
Association rules method Overview

• Unsupervised learning method

• Descriptive (not predictive) method

• Used to find hidden relationships in data

• The relationships are represented as rules


Questions association rules might answer

● Which products tend to be purchased together

● What products do similar customers tend to buy


Applications of Association rule

Market Basket
Analysis

https://blog.rsquaredacademy.com/market-basket-analysis-in-r/
https://blog.rsquaredacademy.com/market-basket-analysis-in-r/
Market Basket Analysis

❏ Market basket analysis uses

❏ Association rule mining

❏ To identify products

❏ Frequently bought together


Market Basket Analysis - Why ?

https://blog.rsquaredacademy.com/market-basket-analysis-in-r/
Use Cases (Applications) of
Association Rule Mining

https://blog.rsquaredacademy.com/market-basket-analysis-in-r/
Simple Example

https://blog.rsquaredacademy.com/market-basket-analysis-in-r/
Simple Example -Transaction Data

https://blog.rsquaredacademy.com/market-basket-analysis-in-r/
Simple Example -Transaction Data

https://blog.rsquaredacademy.com/market-basket-analysis-in-r/
Simple Example -Frequent Item Set

https://blog.rsquaredacademy.com/market-basket-analysis-in-r/
Simple Example- Association Rule

https://blog.rsquaredacademy.com/market-basket-analysis-in-r/
Simple Example- Association Rule

https://blog.rsquaredacademy.com/market-basket-analysis-in-r/
Simple Example- Association Rule
Support

https://blog.rsquaredacademy.com/market-basket-analysis-in-r/
Simple Example- Association Rule
Confidence

https://blog.rsquaredacademy.com/market-basket-analysis-in-r/
Simple Example- Association Rule
Lift

https://blog.rsquaredacademy.com/market-basket-analysis-in-r/
Simple Example- Association Rule
Lift - Interpretation

● Lift = 1: implies no relationship between mobile phone and screen guard (i.e.,
mobile phone and screen guard occur together only by chance)
● Lift > 1: implies that there is a positive relationship between mobile phone and
screen guard (i.e., mobile phone and screen guard occur together more often than
random)
● Lift < 1: implies that there is a negative relationship between mobile phone and
screen guard (i.e., mobile phone and screen guard occur together less often than
random)

https://blog.rsquaredacademy.com/market-basket-analysis-in-r/
Association Rule

•Frequent itemsets from the previous section can form


candidate rules such as X implies Y .

X→Y
Association Rule

Appropriateness of
Candidate Rule

Support Confidence Lift Leverage


Association Rule/ Apriori Example

TID List_Of_Item IDs Minimum Support = 0.5 or 50%


Means 9/2 = 4.5 = 4
T100 I1, I2, I5

T101 I2, I4
ITEM Set FREQUENCY
T102 I2, I5
{ I1} 6
T103 I1, I2, I4
{I2} 8
T104 I1, I2, I3
{I 3} 5
T105 I2, I3
{I4} 3
T106 I1, I2, I3, I4
{I5} 3
T107 I1, I2, I3

T108 I1, I3 , I5
Example
Minimum Support = 0.5
or 50%
TID List_Of_Item IDs Means 9/2 = 4.5 = 4

T100 I1, I2, I5 After


ITEM Set FREQUENC ITEM Set FREQUENCY
Pruning
T101 I2, I4 Y
{ I1} 6
{ I1} 6
T102 I2, I5
{I2} 8
{I2} 8
T103 I1, I2, I4
{I 3} 5
{I 3} 5
T104 I1, I2, I3
{I4} 3
T105 I2, I3
{I5} 3
T106 I1, I2, I3, I4

T107 I1, I2, I3

T108 I1, I3 , I5
Example

Minimum Support = 0.5


or 50%
Means 9/2 = 4.5 = 4

After
ITEM Set FREQUENCY Candidate Pruning
Generation
ITEM Set FREQUENCY ITEM Set FREQUENCY
{ I1} 6
{ I1, I2} 5 { I1, I2} 5
{I2} 8
{I1, I3} 4 {I1, I3} 4
{I 3} 5
{I 2, I3} 4 {I 2, I3} 4
Example

Minimum Support = 0.5


or 50%
Means 9/2 = 4.5 = 4

Candidate After Pruning No


ITEM Set FREQUENCY Generation values, so go to
previous stage ITEM Set FREQUENCY
{ I1, I2} 4 ITEM Set FREQUENCY
{ I1, I2} 5
{I1, I3} 4 { I1, I2,I3} 3
{I1, I3} 4
{I 2, I3} 5 {I 2, I3} 4

We have 3 rules
1. I1 => I2
2. I1 => I3
3. I2 => I3
Example- Support

Rule Frequency of Formula Putting Support Value


X+Y Values in
Formula

I1 => I2 5 5/9 0.55


Freq( X+Y)
______________________
I1 => I3 4 4/9 0.44
No of Transaction
I2 => I3 4 4/9 0.44
Example- Confidence

Rule Freq( X) Freq Formula for Putting Confidence


( X + Y) Confidence Values in (x=>y)
Formula

I1 => I2 6 5 5/6 0.83


Freq( X+Y)
______________________
I1 => I3 6 4 4/6 0.66
Freq (X)
I2 => I3 8 4 4/8 0.50
Example- Lift

Rule Support Support Support Formula Putting Values in Support


of of of Formula Value
(X+ Y) X Y

I1 => I2 0.55 6/9 = 8/9 =0.88 0.55 0.94


0.66 ----------------
Support( X+Y)
______________________
(0.66 * 0.88)

I1 => I3 0.44 6/9 = 5/9 =0.55 0.44 1.21


Support (X) * Support (Y)
0.66 ----------------
(0.66 * 0.55)

I2 => I3 0.44 8/9 =0.88 5/9 =0.55 0.44 0.90


----------------
(0.88 * 0.55)
Example- Leverage

Rule Support Support Support Formula Putting Values in Support


of of of Formula Value
(X+ Y) X Y

I1 => I2 0.55 6/9 = 8/9 =0.88 0.44 - (0.66 * 0.88) -0.03


0.66
Support( X+Y) -
I1 => I3 0.44 6/9 = 5/9 =0.55 0.44 - (0.66 * 0.55) 0.77
Support (X) * Support (Y)
0.66

I2 => I3 0.44 8/9 =0.88 5/9 =0.55 0.55 - (0.88 * 0.55) -0.044
Example

Rule Support Confidence Lift Leverage

I1 => I2 0.55 0.83 0.94 -0.03

I1 => I3 0.44 0.66 1.21 0.77

I2 => I3 0.44 0.50 0.90 -0.044


Example

Rule Support Confidence Lift Leverage

I1 => I2 0.55 0.83 0.94 -0.03

I1 => I3 0.44 0.66 1.21 0.77

I2 => I3 0.44 0.50 0.90 -0.044


Example

Rule Support Confidence Lift Leverage

I1 => I2 0.55 0.83 0.94 -0.03

I1 => I3 0.44 0.66 1.21 0.77


Example

Rule Support Confidence Lift Leverage

I1 => I2 0.55 0.83 0.94 -0.03

I1 => I3 0.44 0.66 1.21 0.77


The Apriori Algorithm

• Join Step: Ck is generated by joining Lk-1with itself


• Prune Step: Any (k-1)-itemset that is not frequent cannot be a subset of a frequent k-itemset
• Pseudo-code:
Ck: Candidate itemset of size k
Lk : frequent itemset of size k

L1 = {frequent items};
for (k = 1; Lk !=∅; k++) do begin
Ck+1 = candidates generated from Lk;
for each transaction t in database do
increment the count of all candidates in Ck+1 that are contained in t
Lk+1 = candidates in Ck+1 with min_support
end
return ∪k Lk;
Applications of Association Rules

The term market basket analysis refers to a specific implementation of


association rules
• For better merchandising – products to include/exclude from inventory each month
• Placement of products
• Cross-selling
• Promotional programs—multiple product purchase incentives managed through a loyalty card
program

Association rules also used for

• Recommender systems – Amazon, Netflix


• Clickstream analysis from web usage log files
• Website visitors to page X click on links A,B,C more than on links D,E,F
Validation and Testing

● The frequent and high confidence itemsets are found by pre-specified minimum
support and minimum confidence levels
● Measures like lift and/or leverage then ensure that interesting rules are identified
rather than coincidental ones
● However, some of the remaining rules may be considered subjectively uninteresting
because they don’t yield unexpected profitable actions
○ E.g., rules like {paper} -> {pencil} are not interesting/meaningful
● Incorporating subjective knowledge requires domain experts
● Good rules provide valuable insights for institutions to improve their business
operations
Diagnostics

● Although the Apriori algorithm is easy to understand and implement, some of the rules
generated are uninteresting or practically useless.
● Additionally, some of the rules may be generated due to coincidental relationships between
the variables.
● Measures like confidence, lift, and leverage should be used along with human insights to
address this problem
● Another problem with association rules is that, in Phase 3 and 4 of the Data Analytics
Lifecycle , the team must specify the minimum support prior to the model execution,
which may lead to too many or too few rules.
● In related research, a variant of the algorithm can use a predefined target range for the
number of rules so that the algorithm can adjust the minimum support accordingly.
● Algorithm requires a scan of the entire database to obtain the result. Accordingly, as the
database grows, it takes more time to compute in each run.
Diagnostics- Approaches to improve Apriori’s
efficiency

• Any itemset that is potentially frequent in a transaction

Partitioning database must be frequent in at least one of the

partitions of the transaction database.


Diagnostics- Approaches to improve Apriori’s
efficiency

• This extracts a subset of the data with a lower

Sampling support threshold and uses the subset to


perform association rule mining.
Diagnostics- Approaches to improve Apriori’s
efficiency

• A transaction that does not contain frequent


Transaction k-itemsets is useless in subsequent scans and
reduction
therefore can be ignored.
Diagnostics- Approaches to improve Apriori’s
efficiency

• Only add new candidate itemsets when all


Dynamic
itemset of their subsets are estimated to be
counting frequent.
Diagnostics- Approaches to improve Apriori’s
efficiency

• If the corresponding hashing bucket count of


Hash-based
itemset a k-itemset is below a certain threshold, the
counting
k-itemset cannot be frequent.
Syllabus

Advanced Analytical Theory and Methods: Association Rules-


Overview, a-priori algorithm, evaluation of candidate rules, case
study-transactions in grocery store, validation and testing,
diagnostics.

Regression- linear, logistics, reasons to choose and cautions,


additional regression models.
Regression

Linear Regression

Logical Regression

Reasons to Choose and Cautions

Additional Regression Models


Regression

supervised Learning Used For Prediction


Regression

● Regression analysis attempts to explain the influence that


input (independent) variables have on the outcome
(dependent) variable
● Questions regression might answer
○ What is a person’s expected income?
○ What is probability an applicant will default on a loan?
Regression Types

Linear Regression Logistic Regression


Linear Regression

● Models the
Relationship Between

Several input Continuous


variables outcome variable
Use Cases (Applications)

Real estate example


• Predict residential home prices
• Possible inputs – living area, #bathrooms, #bedrooms, lot size, property taxes

Demand forecasting example


• Restaurant predicts quantity of food needed
• Possible inputs – weather, day of week, etc.

Medical example
• Analyze effect of proposed radiation treatment
• Possible inputs – radiation treatment duration, freq
Linear Regression Equations
Linear Regression Model

Relationship Between Variables Is a Linear Function

RANDOM
Y-INTERCEPT SLOPE ERROR

Y = Β 0+ Β X + ΕI
I 1 I
DEPENDENT (RESPONSE
VARIABLE) INDEPENDENT (EXPLANATORY)
(EG. INCOME) VARIABLE
(E.G., AGE)
Model Description

For More than one


For one input and one input and one output
output variable variable

Y= β0 +β1X1 +ϵ Y= β0 +β1X1 +β2X2+……….


+βp-1Xp-1+ϵ
Model Description

Y= β0 +β1X1 +β2X2+………. +βp-1Xp-1+£


Model Description Example

•Predict person’s annual income as a function of Experience and education

Income= β0+ β1Experience + β2Education + ϵ

•The βi S represent the unknown p parameters


•There is considerable variation in income levels for a group of people with identical
Experience and years of education. This variation is represented by ∈ in the model.
•Ordinary Least Squares (OLS) is a common technique to estimate the parameters
Model Description Example
Model Description
Example

With OLS, the objective is to find the line through these points that minimizes the sum of the
squares of the difference between each point and the line in the vertical direction
The vertical lines represent the distance between each observed y value and the line y=β0+β1x
Model Description
With Normally Distributed Errors

● Following Figure illustrate Regression Model with one input variable, normality assumption on the
error terms and the effect on the outcome variable, Y , for a given value of X.

E.g., for x=8, E(y)~20 but


varies 15-25
>

Diagnostics
Evaluating the Linearity Assumption

● A major assumption in linear regression modeling is that the relationship


between the input and output variables is linear
● The most fundamental way to evaluate this is to plot the outcome variable
against each input variable
● If the relationship between Age and Income is represented as illustrated in
Figure in next slide, a linear model would not apply. In such a case, it is often
useful to do any of the following:
○ Transform the outcome variable.
○ Transform the input variables.
○ Add extra input variables or terms to the regression model.
>

Diagnostics
Evaluating the Residuals

● Residuals. The difference between the observed value of the dependent variable
(y) and the predicted value (ŷ) is called the residual (e).

● Each data point has one residual.


>

Diagnostics
N-Fold Cross-Validation

● To prevent overfitting, a common practice splits the dataset into


training and test sets, develops the model on the training set and
evaluates it on the test set
● If the quantity of the dataset is insufficient for this, an N-fold
cross-validation technique can be used
○ Dataset randomly split into N dataset of equal size
○ Model trained on N-1 of the sets, tested on remaining one
○ Process repeated N times
○ Average the N model errors over the N folds
>

Diagnostics
Other Diagnostic Considerations

● The model might be improved by including additional input


variables
● Residual plots should be examined for outliers
● Finally, the magnitude and signs of the estimated parameters
should be examined to see if they make sense
Regression

Linear Regression

Logical Regression

Reasons to Choose and Cautions

Additional Regression Models


>

Logistic Regression - Introduction

● In linear regression modeling, the outcome variable is continuous –


e.g., income ~ age and education
● In logistic regression, the outcome variable is categorical, example
two-valued outcomes like
○ True/false,
○ pass/fail,
○ yes/no
>

Logistic Regression
Use Cases

Medical
• Probability of a patient’s successful response to a specific
medical treatment – input could include age, weight, etc.
>

Logistic Regression
Use Cases

Finance

• Probability an applicant defaults on a loan


>

Logistic Regression
Use Cases

Marketing

• Probability a wireless customer switches carriers (churns)


>

Logistic Regression
Use Cases

Engineering

● Probability a mechanical part malfunctions or fails


>

Logistic Regression
Model Description
Logical regression
is based on the logistic function

As y -> infinity, f(y)->1; &


As y->-infinity, f(y)->0
https://www.saedsayad.com/logistic_regression.htm
>

Logistic Regression
Model Description

With the range of f(y) as (0,1), the logistic function models the
probability of an outcome occurring

In contrast to linear regression, the values of y are not directly


observed; only the values of f(y) in terms of success or failure
are observed.
Linear Regression Vs Logistic Regression
Parameter Linear Regression Logistic Regression
Type of Variable Used Continuous Categorical
Linear Regression Vs Logistic Regression
Parameter Linear Regression Logistic Regression
Type of Variable Used Continuous Categorical
Curve Type Straight Line S shaped Curve
Linear Regression Vs Logistic Regression
Parameter Linear Regression Logistic Regression
Type of Variable Used Continuous Categorical
Curve Type Straight Line S shaped Curve
Relationship between Dependent and Required Not Required
Independent Variable
Linear Regression Vs Logistic Regression
Parameter Linear Regression Logistic Regression
Type of Variable Used Continuous Categorical
Curve Type Straight Line S shaped Curve
Relationship between Dependent and Required Not Required
Independent Variable
Purpose Used to estimate dependent Used to calculate probability
variable in case of change in of Event
independent variable
Linear Regression Vs Logistic Regression
Parameter Linear Regression Logistic Regression
Type of Variable Used Continuous Categorical
Curve Type Straight Line S shaped Curve
Relationship between Dependent and Required Not Required
Independent Variable
Purpose Used to estimate dependent Used to calculate probability
variable in case of change in of Event
independent variable
Formula Y= β0+ β1X1 + …..+ ϵ
Linear Regression Vs Logistic Regression
Parameter Linear Regression Logistic Regression
Type of Variable Used Continuous Categorical
Curve Type Straight Line S shaped Curve
Relationship between Dependent and Required Not Required
Independent Variable
Purpose Used to estimate dependent Used to calculate probability
variable in case of change in of Event
independent variable
Formula Y= β0+ β1X1 + …..+ ϵ

Example Relationship between no of hours Whether they Fail or Pass


worked and salary
>

Diagnostics
Receiver Operating Characteristic (ROC) Curve

● Logistic regression is often used to classify


○ For two classes, C (Churn) and nC (notChurn), we have
■ True Positive: predict C, when actually C
■ True Negative: predict nC, when actually nC
■ False Positive: predict C, when actually nC
■ False Negative: predict nC, when actually C
Diagnostics
Receiver Operating Characteristic (ROC) Curve

Actual Value
Positive(1) Negative(0)
Predicted Positive(1)
TP FP
Value
Negative(0)
FN TN
>

Diagnostics
Receiver Operating Characteristic (ROC) Curve

False Positive Rate(FPR) = # of false Positive


__________________

# of negative

True Positive Rate (TRP) = # of true Positive


_________________

# of Positive

● The Receiver Operating Characteristic (ROC) curve


○ Plots TPR against FPR
>

Diagnostics
Receiver Operating Characteristic (ROC) Curve
Regression

Linear Regression

Logical Regression

Reasons to Choose and Cautions

Additional Regression Models


>

Reasons to Choose and Cautions

Linear regression – outcome variable continuous

Logistic regression – outcome variable categorical

Both models assume a linear additive function of the inputs variables

• If this is not true, the models perform poorly


• In linear regression, the further assumption of normally distributed error terms is important for many
statistical inferences

Although a set of input variables may be a good predictor of an output variable,


“correlation does not imply causation”
Regression

Linear Regression

Logical Regression

Reasons to Choose and Cautions

Additional Regression Models


>

Additional Regression Models

● Multicollinearity is the condition when several input variables are highly


correlated
○ This can lead to inappropriately large coefficients
● To mitigate this problem
○ Ridge regression applies a penalty based on the size of the coefficients
○ Lasso regression applies a penalty proportional to the sum of the
absolute values of the coefficients
○ Multinomial logistic regression – used for a more-than-two-state categorical outcome
variable

You might also like