Professional Documents
Culture Documents
Data Analytics (BE-2015 Pattern) : Unit III-Association Rules and Regression
Data Analytics (BE-2015 Pattern) : Unit III-Association Rules and Regression
(BE-2015 Pattern)
Unit III-
Association Rules and
Regression
Syllabus
Advanced Analytical Theory and Methods:
Association Rules- Overview, a-priori algorithm,
evaluation of candidate rules, case study-
transactions in grocery store, validation and
testing, diagnostics.
Apriori Algorithm
Diagnostics
Overview
Association rules method
• Unsupervised learning method
• Descriptive (not predictive) method
• Used to find hidden relationships in data
• The relationships are represented as rules
Questions association rules might
answer
• Which products tend to be purchased together
• What products do similar customers tend to buy
Overview
• Example – general logic of association rules
Overview
Rules have the form X -> Y
• When X is observed, Y is also observed
Itemset
• Collection of items or entities
• k-itemset = {item 1, item 2,…,item k}
• Examples
• Items purchased in one transaction
• Set of hyperlinks clicked by a user in one session
Association Rules
Overview
Apriori Algorithm
Diagnostics
Definition: Association Rule
Let D be database of transactions
– e.g.: Transaction ID Items
2000 A, B, C
1000 A, C
4000 A, D
5000 B, E, F
contain both X and Y out of
total number of transactions (
Milk
,
Diape
Beer
)2
Confidence (c)
s 0
.
4
|T| 5
Number of transactions that
1/19/22
frequent k-itemset
• Pseudo-code:
L1 = {frequent items};
for (k = 1; Lk !=; k++) do begin
Ck+1 = candidates generated from Lk;
for each transaction t in database do
increment the count of all candidates in Ck+1 that are contained in t
Lk+1 = candidates in Ck+1 with min_support
14
end
return k Lk;
The Apriori Algorithm — Example 1
MinSupp=0.5 (or 50% or 2)
Database D itemset sup.
L1 itemset sup.
TID Items C1 {1} 2 {1} 2
100 134 {2} 3 {2} 3
Scan D
1/19/22
200 235 {3} 3 {3} 3
300 1235 {4} 1 {5} 3
400 25 {5} 3
TID Items
1 Bread, Milk
2 Bread, Diaper, Beer, Eggs
3 Milk, Diaper, Beer, Coke
4 Bread, Milk, Diaper, Beer
5 Bread, Milk, Diaper, Coke
The Apriori Algorithm — Example 2
Ite m s e t C ount
{ B r e a d ,M ilk ,D ia p e r } 3
Association Rules
Overview
Apriori Algorithm
Diagnostics
Evaluation of Candidate Rules
• Frequent itemsets from the previous
section can form candidate rules such
as X implies Y (X → Y).
• This section discusses how measures
such as confidence, lift, and leverage
can help evaluate the appropriateness
of these candidate rules
Evaluation of Candidate Rules
Confidence: says how likely item Y is purchased when item X is purchased, expressed as
{X -> Y}.
●
Confidence( X ⇒ Y ) =
Lift: says how likely item Y is purchased when item X is purchased, while controlling for
how popular item Y is.
●
lift( X ⇒ Y ) =
Leverage: Same as lift but instead of using a ratio, leverage uses the difference
●
Leverage( X ⇒ Y ) = -
Evaluation of Candidate Rules
Confidence
• Confidencemeasures the certainty of a rule
• Mathematically, confidence is the percent of transactions
that contain both X and Y out of all the transactions that
contain X
• Confidence( X ⇒ Y ) =
• lift( X ⇒ Y ) =
• Therefore it can be concluded that milk and bread have a stronger association than
milk and eggs.
Evaluation of Candidate Rules
Leverage
• Leverage
measures the difference in the probability of X and Y appearing
together compared to statistical independence
• Leverage( X ⇒ Y ) = -
Apriori Algorithm
Diagnostics
>
> Groceries@itemInfo[1:10,]
> apply(Groceries@data[,10:20],2,function(r)
paste(Groceries@itemInfo[r,"labels"],collapse=", "))
Example: Grocery Store Transactions
2 Frequent Itemset Generation
To illustrate the Apriori algorithm, the code below does each iteration separately.
Assume minimum support threshold = 0.02 (0.02 * 9853 = 198 items), get 122 itemsets total
The legend on the right is a color matrix indicating the lift and the
confidence to which each square in the main matrix corresponds
Example: Grocery Store Transactions
3 Rule Generation and Visualization
Example: Grocery Store Transactions
3 Rule Generation and Visualization
In the graph, the arrow always points from an item on the LHS
to an item on the RHS.
For example, the arrows that connects ham, processed cheese, and white bread
suggest the rule
Apriori Algorithm
Diagnostics
Validation and Testing
Apriori Algorithm
Diagnostics
Diagnostics
• Although the Apriori algorithm is easy to understand and implement, some of
the rules generated are uninteresting or practically useless.
• Additionally, some of the rules may be generated due to coincidental
relationships between the variables.
• Measures like confidence, lift, and leverage should be used along with human
insights to address this problem
• Another problem with association rules is that, in Phase 3 and 4 of the Data
Analytics Lifecycle , the team must specify the minimum support prior to
the model execution, which may lead to too many or too few rules.
• In related research, a variant of the algorithm can use a predefined target
range for the number of rules so that the algorithm can adjust the minimum
support accordingly.
• Algorithm requires a scan of the entire database to obtain the result.
Accordingly, as the database grows, it takes more time to compute in each
run.
Diagnostics- Approaches to
improve Apriori’s efficiency
Partitioning:
• Any itemset that is potentially frequent in a transaction database must be
frequent in at least one of the partitions of the transaction database.
Sampling:
• This extracts a subset of the data with a lower support threshold and uses the
subset to perform association rule mining.
Transaction reduction:
• A transaction that does not contain frequent k-itemsets is useless in subsequent
scans and therefore can be ignored.
Linear Regression
Logical Regression
Medical example
• Analyze effect of proposed radiation treatment
• Possible inputs – radiation treatment duration, freq
Linear Equations
Y
Y = m X + b
Change
m = S lo p e
47
© 1984-1994 T/Maker Co.
Linear Regression Model
Relationship Between Variables Is a Linear Function
RANDOM
Y-INTERCEPT SLOPE ERROR
YI 0 1X I I
DEPENDENT INDEPENDENT
(RESPONSE (EXPLANATORY)
VARIABLE) VARIABLE
(EG. INCOME) (E.G., AGE)
Model Description
For one input and one output variable
Income=
• OLS=
Model Description
Example
With OLS, the objective is to find the line through these points that
minimizes the sum of the squares of the difference between each
point and the line in the vertical direction
The vertical lines represent the distance between each observed y value
and the line
Model Description
With Normally Distributed Errors
• Making
additional assumptions on the error term provides further
capabilities
• It is common to assume the error term is a normally distributed
random variable with
• Mean equal to zero and constant variance
• Thus, the linear regression model is expressed as
• Y=β0+β1X1+...+βpXp+ϵ
• )
Model Description
With Normally Distributed Errors
• With
this assumption, the expected value E(y) of the linear regression
model is:
> library(lattice)
> splom(~income_input[c(2:5)], groups=NULL, data=income_input,
axis.line.tck=0, axis.text.alpha=0)
Model Description-Example in R
Diagnostics
Evaluating the Linearity Assumption
• A major assumption in linear regression modeling is that
the relationship between the input and output variables is
linear
• The most fundamental way to evaluate this is to plot the
outcome variable against each income variable
• If the relationship between Age and Income is represented
as illustrated in Figure in next slide, a linear model would
not apply. In such a case, it is often useful to do any of the
following:
• Transform the outcome variable.
• Transform the input variables.
• Add extra input variables or terms to the regression model.
Diagnostics
>
Diagnostics
Evaluating the Residuals
• Residuals are the difference between the observed outcome
variables and the fitted value based on the OLS parameter
estimates.
• For residuals, the lm() function in R automatically calculates
and stores the fitted values and the residuals, in the
components fitted.values and residuals in the output of the
lm() function
>
Diagnostics
Evaluating the Residuals
• The residual plots are useful for confirming that the residuals
were centered on zero and have a constant variance
Nonlnear
trend in
residuals
Residuals
not centered
on zero
>
Diagnostics
Evaluating the Residuals
• The residual plots are useful for confirming that the residuals
were centered on zero and have a constant variance
Residuals
not centered
on zero
Variance not
constant
>
Diagnostics
Evaluating the Normality Assumption
• From the histogram, it is seen that the residuals are
centered on zero and appear to be symmetric about zero, as
one would expect for a normally distributed random
variable.
Residuals centered on
zero and appear
normally distributed
>
Diagnostics
Evaluating the Normality Assumption
• Another option is to examine a Q-Q plot
comparing observed data against quantiles (Q)
of assumed dist
> qqnorm(results2$residuals)
> qqline(results2$residuals)
>
Diagnostics
Evaluating the Normality Assumption
Normally
distributed
residuals
Non-normally
distributed
residuals
>
Diagnostics
N-Fold Cross-Validation
• To prevent overfitting, a common practice splits the
dataset into training and test sets, develops the model on
the training set and evaluates it on the test set
• If the quantity of the dataset is insufficient for this, an N-
fold cross-validation technique can be used
• Dataset randomly split into N dataset of equal size
• Model trained on N-1 of the sets, tested on remaining one
• Process repeated N times
• Average the N model errors over the N folds
• Note: if N = size of dataset, this is leave-one-out procedure
>
Diagnostics
Other Diagnostic Considerations
• The model might be improved by including additional
input variables
• However, the adjusted R2 applies a penalty as the number of
parameters increases
• Residual plots should be examined for outliers
• Points markedly different from the majority of points
• They result from bad data, data processing errors, or actual rare
occurrences
• Finally, the magnitude and signs of the estimated
parameters should be examined to see if they make
sense
Regression
Linear Regression
Logical Regression
Logistic Regression
Introduction
• In linear regression modeling, the outcome
variable is continuous – e.g., income ~ age and
education
• In logistic regression, the outcome variable is
categorical, example two-valued outcomes like
• True/false,
• pass/fail,
• yes/no
>
Logistic Regression
Use Cases
Medical
• Probability of a patient’s successful response to a specific medical
treatment – input could include age, weight, etc.
Finance
• Probability an applicant defaults on a loan
Marketing
• Probability a wireless customer switches carriers (churns)
Engineering
• Probability a mechanical part malfunctions or fails
>
Logistic Regression
Model Description
• Logical regression is based on the logistic function
Logistic Regression
Model Description
• With the range of f(y) as (0,1), the logistic function
models the probability of an outcome occurring
Logistic Regression
Model Description: customer churn example
Logistic Regression
Model Description: customer churn example
>
Diagnostics
Model Description: customer churn example
> head(churn_input) # Churned = 1 if cust churned
> sum(churn_input$Churned) # 1743/8000 churned
• Use the Generalized Linear Model function glm()
> Churn_logistic1<-
glm(Churned~Age+Married+Cust_years+Churned_contacts,data=churn_in
put,family=binomial(link=“logit”))
> summary(Churn_logistic1) # Age + Churned_contacts best
> Churn_logistic3<-
glm(Churned~Age+Churned_contacts,data=churn_input,family=binomial(l
ink=“logit”))
> summary(Churn_logistic3) # Age + Churned_contacts
>
Diagnostics
Deviance and the Pseudo-R2
Diagnostics
Receiver Operating Characteristic (ROC) Curve
• Logistic regression is often used to classify
• In the Churn example, a customer can be classified as
Churn if the model predicts high probability of churning
• Although 0.5 is often used as the probability threshold
• For two classes, C (Churn) and nC (notChurn), we have
• True Positive: predict C, when actually C
• True Negative: predict nC, when actually nC
• False Positive: predict C, when actually nC
• False Negative: predict nC, when actually C
>
Diagnostics
Receiver Operating Characteristic (ROC) Curve
> library(ROCR)
> Pred = predict(Churn_logistic3, type=“response”)
>
Diagnostics
Receiver Operating Characteristic (ROC) Curve
>
Diagnostics
Histogram of the Probabilities
Linear Regression
Logical Regression
Linear Regression
Logical Regression