Data Analytics (BE-2015 Pattern) : Unit III-Association Rules and Regression

Data Analytics
(BE-2015 Pattern)
Unit III-
Association Rules and
Regression
Syllabus
Advanced Analytical Theory and Methods:
Association Rules- Overview, a-priori algorithm,
evaluation of candidate rules, case study-
transactions in grocery store, validation and
testing, diagnostics.
Regression- linear, logistics, reasons to choose and

cautions, additional regression models.
Syllabus

Association Rules
Overview
Apriori Algorithm
Evaluation of Candidate Rules
Example: Transactions in a Grocery Store
Validation and Testing
Diagnostics
Overview
Association rules method
• Unsupervised learning method
• Descriptive (not predictive) method
• Used to find hidden relationships in data
• The relationships are represented as rules
Questions association rules might
answer
• Which products tend to be purchased together
• What products do similar customers tend to buy
Overview
• Example – general logic of association rules
Overview
Rules have the form X -> Y
• When X is observed, Y is also observed
Itemset
• Collection of items or entities
• k-itemset = {item 1, item 2,…,item k}
• Examples
• Items purchased in one transaction
• Set of hyperlinks clicked by a user in one session
Association Rules
Overview
Apriori Algorithm
Diagnostics
Definition: Association Rule
Let D be database of transactions
– e.g.: Transaction ID Items
2000 A, B, C
1000 A, C
4000 A, D
5000 B, E, F
• Let I be the set of items that appear in the

database, e.g., I={A,B,C,D,E,F}
• A rule is defined by X  Y, where XI, YI,
and XY=
Definition: Association Rule
TID Items
 Association Rule
 An implication expression of the
1 Bread, Milk
form X  Y, where X and Y are 2 Bread, Diaper, Beer, Eggs
non-overlapping itemsets 3 Milk, Diaper, Beer, Coke
 Example:
4 Bread, Milk, Diaper, Beer
{Milk, Diaper}  {Beer}
5 Bread, Milk, Diaper, Coke
 Rule Evaluation Metrics
Example:
 Support (s)
 Number of transactions that {Milk, Diaper}  Beer

contain both X and Y out of
total number of transactions (
Milk
,
Diape
Beer
)2
 Confidence (c)

s 0
.
4
|T| 5

 Number of transactions that
contain both X and Y out of (Milk,

Diape
Beer
)2
total number of transactions
that contains X


c
(
Milk
,

Diape
) 3
0
.
67
Rule Measures: Support and
Confidence
Find all the rules X  Y with minimum confidence and
support
– support, s, probability that a transaction contains {X  Y}
– confidence, c, conditional probability that a transaction
having X also contains Y
TID Items Let minimum support 50%, and

100 A,B,C minimum confidence 50%, we have
200 A,C
300 A,D  A  C (50%, 66.6%)
400 B,E,F  C  A (50%, 100%)
Overview – Apriori Algorithm
• Apriori is the most fundamental algorithm
• Given itemset L, support of L is the percent of
transactions that contain L
• Frequent itemset – items appear together “often
enough”
• Minimum support defines “often enough” (% transactions)
• If an itemset is frequent, then any subset is frequent
Overview – Apriori Algorithm
• If {B,C,D} frequent, then all subsets frequent
The Apriori Algorithm
• Join Step: C is generated by joining L with itself
k k-1
• Prune Step: Any (k-1)-itemset that is not frequent cannot be a subset of a
1/19/22
frequent k-itemset
• Pseudo-code:
Data Mining: Concepts and

Techniques
Ck: Candidate itemset of size k
Lk : frequent itemset of size k
L1 = {frequent items};
for (k = 1; Lk !=; k++) do begin
Ck+1 = candidates generated from Lk;
for each transaction t in database do
increment the count of all candidates in Ck+1 that are contained in t
Lk+1 = candidates in Ck+1 with min_support
14
end
return k Lk;
The Apriori Algorithm — Example 1
MinSupp=0.5 (or 50% or 2)
Database D itemset sup.
L1 itemset sup.
TID Items C1 {1} 2 {1} 2
100 134 {2} 3 {2} 3
Scan D
1/19/22
200 235 {3} 3 {3} 3
300 1235 {4} 1 {5} 3
400 25 {5} 3
Data Mining: Concepts and

Techniques
C2 itemset sup C2 itemset
L2 itemset sup {1 2}
{1 2} 1 Scan D
{1 3} 2 {1 3} 2 {1 3}
{2 3} 2 {1 5} 1 {1 5}
{2 3} 2 {2 3}
{2 5} 3
{2 5} 3 {2 5}
{3 5} 2
{3 5} 2 {3 5}
15
C3 itemset Scan D L3 itemset sup
{2 3 5} {2 3 5} 2
TID Items
1 Bread, Milk
2 Bread, Diaper, Beer, Eggs
3 Milk, Diaper, Beer, Coke
4 Bread, Milk, Diaper, Beer
5 Bread, Milk, Diaper, Coke
Item Count Items (1-itemsets)

Bread 4
Coke 2
Milk 4 Pairs (2-itemsets)
Itemset Count
Beer 3
{Bread,Milk} 3 (No need to generate
Diaper 4
Eggs 1 {Bread,Beer} 2 candidates involving Coke
{Bread,Diaper} 3 or Eggs)
{Milk,Beer} 2
{Milk,Diaper} 3
{Beer,Diaper} 3
minsup = 3/5=0.6 Triplets (3-itemsets)
Ite m s e t C ount
{ B r e a d ,M ilk ,D ia p e r } 3
Association Rules
Overview
Apriori Algorithm
Diagnostics
• Frequent itemsets from the previous
section can form candidate rules such
as X implies Y (X → Y).
• This section discusses how measures
such as confidence, lift, and leverage
can help evaluate the appropriateness
of these candidate rules
Confidence: says how likely item Y is purchased when item X is purchased, expressed as
{X -> Y}.
●
Confidence( X ⇒ Y ) =
Lift: says how likely item Y is purchased when item X is purchased, while controlling for
how popular item Y is.
●
lift( X ⇒ Y ) =
Leverage: Same as lift but instead of using a ratio, leverage uses the difference
●
Leverage( X ⇒ Y ) = -
Confidence
• Confidencemeasures the certainty of a rule
• Mathematically, confidence is the percent of transactions
that contain both X and Y out of all the transactions that
contain X
• Confidence( X ⇒ Y ) =
• Minimum confidence – predefined threshold

• Problem with confidence
• Given a rule X->Y, confidence considers only the antecedent (X) and
the co-occurrence of X and Y
• Cannot tell if a rule contains true implication
Lift
•• Liftmeasures
how much more often X and Y occur together than expected
if statistically independent
• lift( X ⇒ Y ) =
• Lift = 1 if X and Y are statistically independent

• Lift>1 indicates the degree of usefulness of the rule
• Example – in 1000 transactions,

• If {milk, eggs} appears in 300, {milk} in 500, and {eggs} in 400, then Lift(milk-
>eggs) = 0.3/(0.5*0.4) = 1.5
• If {milk, bread} appears in 400, {milk} in 500, and {bread} in 400, then Lift(milk-
>bread) = 0.4/(0.5*0.4) = 2.0
• Therefore it can be concluded that milk and bread have a stronger association than
milk and eggs.
Leverage
• Leverage
measures the difference in the probability of X and Y appearing
together compared to statistical independence
• Leverage( X ⇒ Y ) = -
• Leverage = 0 if X and Y are statistically independent

• Leverage > 0 indicates degree of usefulness of rule
• Example – in 1000 transactions,
• If {milk, eggs} appears in 300, {milk} in 500, and {eggs} in 400,
then Leverage(milk->eggs) = 0.3 - 0.5*0.4 = 0.1
• If {milk, bread} appears in 400, {milk} in 500, and {bread} in 400,
then Leverage (milk->bread) = 0.4 - 0.5*0.4 = 0.2
• It again confirms that milk and bread have a stronger association
than milk and eggs.
Applications of Association Rules
The term market basket analysis refers to a specific

implementation of association rules
• For better merchandising – products to include/exclude from inventory
each month
• Placement of products
• Cross-selling
• Promotional programs—multiple product purchase incentives managed
through a loyalty card program
Association rules also used for

• Recommender systems – Amazon, Netflix
• Clickstream analysis from web usage log files
• Website visitors to page X click on links A,B,C more than on links D,E,F
Association Rules
Overview
Apriori Algorithm
Diagnostics
>
Example: Grocery Store Transactions

1 The Groceries Dataset
Packages -> Install -> arules, arulesViz # don’t enter next line
> install.packages(c("arules", "arulesViz")) # appears on console
> library('arules')
> library('arulesViz')
> data(Groceries)
> summary(Groceries) # indicates 9835 rows
Class of dataset Groceries is transactions, containing 3 slots

1. transactionInfo # data frame with vectors having length of transactions
2. itemInfo # data frame storing item labels
3. data # binary evidence matrix of labels in transactions
> Groceries@itemInfo[1:10,]
> apply(Groceries@data[,10:20],2,function(r)
paste(Groceries@itemInfo[r,"labels"],collapse=", "))
2 Frequent Itemset Generation
To illustrate the Apriori algorithm, the code below does each iteration separately.
Assume minimum support threshold = 0.02 (0.02 * 9853 = 198 items), get 122 itemsets total
First, get itemsets of length 1

> itemsets<-
apriori(Groceries,parameter=list(minlen=1,maxlen=1,support=0.02,target="frequent
itemsets"))
> summary(itemsets) # found 59 itemsets
> inspect(head(sort(itemsets,by="support"),10)) # lists top 10
Second, get itemsets of length 2
> itemsets<-
itemsets"))
Third, get itemsets of length 3

> itemsets<-
itemsets"))
> summary(itemsets) # found 59 itemsets> inspect(head(sort(itemsets,by="support"),10)) # lists top

10 supported items
3 Rule Generation and Visualization
The Apriori algorithm will now generate rules.

Set minimum support threshold to 0.001 (allows more
rules, presumably for the scatterplot) and minimum
confidence threshold to 0.6 to generate 2,918 rules.
> rules <- apriori(Groceries,parameter =list (support=0.001, confidence=0.6, target="rules"))
> summary(rules) # finds 2918 rules
> plot(rules) # displays scatterplot
The scatterplot shows that the highest lift occurs at a low

support and a low confidence.
Get scatterplot matrix to compare the support, confidence, and lift of

the 2918 rules
> plot(rules@quality) # displays scatterplot matrix
Lift is proportional to confidence with several linear groupings.

Note that Lift = Confidence/Support(Y), so when support of Y remains
the same, lift is proportional to confidence and the slope of the linear
trend is the reciprocal of Support(Y).
Compute the 1/Support(Y) which is the slope

> slope<-sort(round(rules@quality$lift/rules@quality$confidence,2))
Display the number of times each slope appears in dataset

> unlist(lapply(split(slope,f=slope),length))
Display the top 10 rules sorted by lift

> inspect(head(sort(rules,by="lift"),10))
Rule {Instant food products, soda} -> {hamburger meat}

has the highest lift of 19 (page 154)
Find the rules with confidence above 0.9

> confidentRules<-rules[quality(rules)$confidence>0.9]
> confidentRules # set of 127 rules
Plot a matrix-based visualization of the LHS v RHS of rules

>
plot(confidentRules,method="matrix",measure=c("lift","confidence"),control=list(reor
der=TRUE))
The legend on the right is a color matrix indicating the lift and the
confidence to which each square in the main matrix corresponds
Visualize the top 5 rules with the highest lift.

> highLiftRules<-head(sort(rules,by="lift"),5)
> plot(highLiftRules,method="graph",control=list(type="items"))
In the graph, the arrow always points from an item on the LHS
to an item on the RHS.
For example, the arrows that connects ham, processed cheese, and white bread
suggest the rule
{ham, processed cheese} -> {white bread}
Size of circle indicates support and shade represents lift

Association Rules
Overview
Apriori Algorithm
Diagnostics
• The frequent and high confidence itemsets are found by pre-

specified minimum support and minimum confidence levels
• Measures like lift and/or leverage then ensure that
interesting rules are identified rather than coincidental ones
• However, some of the remaining rules may be considered
subjectively uninteresting because they don’t yield
unexpected profitable actions
• E.g., rules like {paper} -> {pencil} are not interesting/meaningful
• Incorporating subjective knowledge requires domain experts
• Good rules provide valuable insights for institutions to
improve their business operations
Association Rules
Overview
Apriori Algorithm
Diagnostics
Diagnostics
• Although the Apriori algorithm is easy to understand and implement, some of
the rules generated are uninteresting or practically useless.
• Additionally, some of the rules may be generated due to coincidental
relationships between the variables.
• Measures like confidence, lift, and leverage should be used along with human
insights to address this problem
• Another problem with association rules is that, in Phase 3 and 4 of the Data
Analytics Lifecycle , the team must specify the minimum support prior to
the model execution, which may lead to too many or too few rules.
• In related research, a variant of the algorithm can use a predefined target
range for the number of rules so that the algorithm can adjust the minimum
support accordingly.
• Algorithm requires a scan of the entire database to obtain the result.
Accordingly, as the database grows, it takes more time to compute in each
run.
Diagnostics- Approaches to
improve Apriori’s efficiency
Partitioning:
• Any itemset that is potentially frequent in a transaction database must be
frequent in at least one of the partitions of the transaction database.
Sampling:
• This extracts a subset of the data with a lower support threshold and uses the
subset to perform association rule mining.
Transaction reduction:
• A transaction that does not contain frequent k-itemsets is useless in subsequent
scans and therefore can be ignored.
Hash-based itemset counting:

• If the corresponding hashing bucket count of a k-itemset is below a certain
threshold, the k-itemset cannot be frequent.
Dynamic itemset counting:

• Only add new candidate itemsets when all of their subsets are estimated to be
frequent.
Syllabus

Regression
Linear Regression
Logical Regression
Reasons to Choose and Cautions
Additional Regression Models

Regression
• Regression analysis attempts to explain the influence that
input (independent) variables have on the outcome
(dependent) variable
• Questions regression might answer
• What is a person’s expected income?
• What is probability an applicant will default on a loan?
• Regression can find the input variables having the greatest

statistical influence on the outcome
• E.g. – if 10-year-old reading level predicts students’ later success, then
try to improve early age reading levels
Linear Regression
• Models the relationship between several input variables
and a continuous outcome variable
• Assumption is that the relationship is linear
• Various transformations can be used to achieve a linear

relationship
• Linear regression models are probabilistic
• Involves randomness and uncertainty

• Not deterministic like Ohm’s Law (V=IR)
Use Cases (Applications)
Real estate example
• Predict residential home prices
• Possible inputs – living area, #bathrooms,
#bedrooms, lot size, property taxes
Demand forecasting example
• Restaurant predicts quantity of food needed
• Possible inputs – weather, day of week, etc.
Medical example
• Analyze effect of proposed radiation treatment
• Possible inputs – radiation treatment duration, freq
Linear Equations
Y
Y = m X + b
Change
m = S lo p e
EPI 809/Spring 2008

in Y
C h a n g e in X
b = Y - in te r c e p t
X
47
© 1984-1994 T/Maker Co.
Linear Regression Model
Relationship Between Variables Is a Linear Function
RANDOM
Y-INTERCEPT SLOPE ERROR
YI   0  1X I   I
DEPENDENT INDEPENDENT
(RESPONSE (EXPLANATORY)
VARIABLE) VARIABLE
(EG. INCOME) (E.G., AGE)
Model Description
For one input and one output variable
For more than one input variable:

Model Description
Example
• Predict person’s annual income as a function of age and education
Income=
• The jS represent the unknown p parameters
• There is considerable variation in income levels for a group of people

with identical ages and years of education. This variation is
represented by in the model.
• Ordinary Least Squares (OLS) is a common technique to estimate

the parameters
Model Description
Example
• OLS=
Model Description
Example
With OLS, the objective is to find the line through these points that
minimizes the sum of the squares of the difference between each
point and the line in the vertical direction
The vertical lines represent the distance between each observed y value
and the line
Model Description
With Normally Distributed Errors
• Making
additional assumptions on the error term provides further
capabilities
• It is common to assume the error term is a normally distributed
random variable with
• Mean equal to zero and constant variance
• Thus, the linear regression model is expressed as
• Y=β0+β1X1+...+βpXp+ϵ
• )
Model Description
• With
this assumption, the expected value E(y) of the linear regression
model is:
• And the variance is
• Thus for a given(X , is normally distributed with Mean

Model Description
• Following Figure illustrate Regression Model with one input variable,
normality assumption on the error terms and the effect on the outcome
variable, Y , for a given value of X.
• E.g., for x=8, E(y)~20 but varies 15-25

Model Description
Example in R
Be sure to get publisher's R downloads: http
://www.wiley.com/WileyCDA/WileyTitle/productCd-111887613X.html
> income_input = as.data.frame(read.csv(“c:/data/income.csv”))

> income_input[1:10,]
> summary(income_input)
> library(lattice)
> splom(~income_input[c(2:5)], groups=NULL, data=income_input,
axis.line.tck=0, axis.text.alpha=0)
Model Description-Example in R
A strong positive linear trend is observed for Income as a function of Age.

Education, a slight positive trend may exist.
Lastly, there is no observed effect on Income based on Gender
Model Description
Categorical Variables
• In the example in R, Gender is a binary variable
• Variables like Gender are categorical variables in contrast

to numeric variables where numeric differences are
meaningful
Model Description
R Functions
• confint() function- Confidence Intervals on the Parameters
• predict() function with interval=“confidence”-Confidence Interval on

Expected Outcome
• predict() function also provides upper/lower bounds on a
particular outcomewith interval=“Prediction”- Prediction Interval on
a Particular Outcome
>
Diagnostics
Evaluating the Linearity Assumption
• A major assumption in linear regression modeling is that
the relationship between the input and output variables is
linear
• The most fundamental way to evaluate this is to plot the
outcome variable against each income variable
• If the relationship between Age and Income is represented
as illustrated in Figure in next slide, a linear model would
not apply. In such a case, it is often useful to do any of the
following:
• Transform the outcome variable.
• Transform the input variables.
• Add extra input variables or terms to the regression model.
Diagnostics
>
Evaluating the Linearity Assumption
Income as a quadratic function of Age

• Common transformations include taking square roots or the
logarithm of the variables.
• Another option is to create a new input variable such as the age
squared and add it to the linear regression model to fit a quadratic
relationship between an input variable and the outcome.
>
Diagnostics
Evaluating the Residuals
• Residuals are the difference between the observed outcome
variables and the fitted value based on the OLS parameter
estimates.
• For residuals, the lm() function in R automatically calculates
and stores the fitted values and the residuals, in the
components fitted.values and residuals in the output of the
lm() function
>
Diagnostics
• The residual plots are useful for confirming that the residuals
were centered on zero and have a constant variance
Nonlnear
trend in
residuals
Residuals
not centered
on zero
>
Diagnostics
• The residual plots are useful for confirming that the residuals
were centered on zero and have a constant variance
Residuals
not centered
on zero
Variance not
constant
>
Diagnostics
Evaluating the Normality Assumption
• From the histogram, it is seen that the residuals are
centered on zero and appear to be symmetric about zero, as
one would expect for a normally distributed random
variable.
Residuals centered on
zero and appear
normally distributed
>
Diagnostics
• Another option is to examine a Q-Q plot
comparing observed data against quantiles (Q)
of assumed dist
> qqnorm(results2$residuals)
> qqline(results2$residuals)
>
Diagnostics
Normally
distributed
residuals
Non-normally
distributed
residuals
>
Diagnostics
N-Fold Cross-Validation
• To prevent overfitting, a common practice splits the
dataset into training and test sets, develops the model on
the training set and evaluates it on the test set
• If the quantity of the dataset is insufficient for this, an N-
fold cross-validation technique can be used
• Dataset randomly split into N dataset of equal size
• Model trained on N-1 of the sets, tested on remaining one
• Process repeated N times
• Average the N model errors over the N folds
• Note: if N = size of dataset, this is leave-one-out procedure
>
Diagnostics
Other Diagnostic Considerations
• The model might be improved by including additional
input variables
• However, the adjusted R2 applies a penalty as the number of
parameters increases
• Residual plots should be examined for outliers
• Points markedly different from the majority of points
• They result from bad data, data processing errors, or actual rare
occurrences
• Finally, the magnitude and signs of the estimated
parameters should be examined to see if they make
sense
Regression
Linear Regression
Logical Regression

>
Logistic Regression
Introduction
• In linear regression modeling, the outcome
variable is continuous – e.g., income ~ age and
education
• In logistic regression, the outcome variable is
categorical, example two-valued outcomes like
• True/false,
• pass/fail,
• yes/no
>
Logistic Regression
Use Cases
Medical
• Probability of a patient’s successful response to a specific medical
treatment – input could include age, weight, etc.
Finance
• Probability an applicant defaults on a loan
Marketing
• Probability a wireless customer switches carriers (churns)
Engineering
• Probability a mechanical part malfunctions or fails
>
Logistic Regression
Model Description
• Logical regression is based on the logistic function
• As y -> infinity, f(y)->1; and as y->-infinity, f(y)->0

>
Logistic Regression
Model Description
• With the range of f(y) as (0,1), the logistic function
models the probability of an outcome occurring
In contrast to linear regression, the values of y

are not directly observed; only the values of
f(y) in terms of success or failure are observed.
>
Logistic Regression
Model Description: customer churn example
• A wireless telecom company estimates

probability of a customer churning (switching
companies)
• Variables collected for each customer: age (years),
married (y/n), duration as customer (years), churned
contacts (count), churned (true/false)
• After analyzing the data and fitting a logical regression
model, age and churned contacts were selected as the
best predictor variables
>
Logistic Regression
>
Diagnostics
> head(churn_input) # Churned = 1 if cust churned
> sum(churn_input$Churned) # 1743/8000 churned
• Use the Generalized Linear Model function glm()
> Churn_logistic1<-
glm(Churned~Age+Married+Cust_years+Churned_contacts,data=churn_in
put,family=binomial(link=“logit”))
> summary(Churn_logistic1) # Age + Churned_contacts best
> Churn_logistic3<-
glm(Churned~Age+Churned_contacts,data=churn_input,family=binomial(l
ink=“logit”))
> summary(Churn_logistic3) # Age + Churned_contacts
>
Diagnostics
Deviance and the Pseudo-R2
• In statistics, deviance is a goodness-of-fit(describes how well it

fits a set of observations) statistic for a statistical model
• In logistic regression, deviance = -2logL
• where L is the maximized value of the likelihood function
used to obtain the parameter estimates
• Two deviance values are provided
• Null deviance = deviance based on only the y-intercept term
• Residual deviance = deviance based on all parameters
• Pseudo-R2 measures how well fitted model explains the data
• Value near 1 indicates a good fit
>
Diagnostics
Receiver Operating Characteristic (ROC) Curve
• Logistic regression is often used to classify
• In the Churn example, a customer can be classified as
Churn if the model predicts high probability of churning
• Although 0.5 is often used as the probability threshold
• For two classes, C (Churn) and nC (notChurn), we have
• True Positive: predict C, when actually C
• True Negative: predict nC, when actually nC
• False Positive: predict C, when actually nC
• False Negative: predict nC, when actually C
>
Diagnostics
• The Receiver Operating Characteristic (ROC) curve

• Plots TPR against FPR
Diagnostics
> library(ROCR)
> Pred = predict(Churn_logistic3, type=“response”)
>
Diagnostics
>
Diagnostics
Histogram of the Probabilities
It is interesting to visualize the counts

of the customers who churned and
who didn’t churn against the estimated
churn probability.
Regression
Linear Regression
Logical Regression

>
Linear regression – outcome variable continuous
Logistic regression – outcome variable categorical
Both models assume a linear additive function of the inputs

variables
• If this is not true, the models perform poorly
• In linear regression, the further assumption of normally distributed error
terms is important for many statistical inferences
Although a set of input variables may be a good predictor of

an output variable, “correlation does not imply causation”
Regression
Linear Regression
Logical Regression

>

• Multicollinearity is the condition when several input
variables are highly correlated
• This can lead to inappropriately large coefficients
• To mitigate this problem
• Ridge regression applies a penalty based on the size of
the coefficients
• Lasso regression applies a penalty proportional to the
sum of the absolute values of the coefficients
• Multinomial logistic regression – used for a more-than-two-
state categorical outcome variable
References
• http://www.csis.pace.edu/~ctappert/cs816-15fall/slides/
• http
://srmnotes.weebly.com/it1110-data-science--big-data.h
tml
• http://www.csis.pace.edu/~ctappert/cs816-15fall/books/
2015DataScience&BigDataAnalytics.pdf

Data Analytics (BE-2015 Pattern) : Unit III-Association Rules and Regression

Uploaded by

Document Information

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Data Analytics (BE-2015 Pattern) : Unit III-Association Rules and Regression

Uploaded by

Copyright:

Available Formats

Data Analytics

Regression- linear, logistics, reasons to choose and

Regression- linear, logistics, reasons to choose and

Evaluation of Candidate Rules

Example: Transactions in a Grocery Store

Validation and Testing

Evaluation of Candidate Rules

Example: Transactions in a Grocery Store

Validation and Testing

• Let I be the set of items that appear in the

contain both X and Y out of (Milk,

TID Items Let minimum support 50%, and

• Prune Step: Any (k-1)-itemset that is not frequent cannot be a subset of a

Data Mining: Concepts and

Data Mining: Concepts and

Item Count Items (1-itemsets)

Evaluation of Candidate Rules

Example: Transactions in a Grocery Store

Validation and Testing

• Minimum confidence – predefined threshold

• Lift = 1 if X and Y are statistically independent

• Example – in 1000 transactions,

• Leverage = 0 if X and Y are statistically independent

The term market basket analysis refers to a specific

Association rules also used for

Evaluation of Candidate Rules

Example: Transactions in a Grocery Store

Validation and Testing

Example: Grocery Store Transactions

Class of dataset Groceries is transactions, containing 3 slots

First, get itemsets of length 1

Third, get itemsets of length 3

> summary(itemsets) # found 59 itemsets> inspect(head(sort(itemsets,by="support"),10)) # lists top

The Apriori algorithm will now generate rules.

The scatterplot shows that the highest lift occurs at a low

Get scatterplot matrix to compare the support, confidence, and lift of

> plot(rules@quality) # displays scatterplot matrix

Lift is proportional to confidence with several linear groupings.

Compute the 1/Support(Y) which is the slope

Display the number of times each slope appears in dataset

Display the top 10 rules sorted by lift

Rule {Instant food products, soda} -> {hamburger meat}

Find the rules with confidence above 0.9

Plot a matrix-based visualization of the LHS v RHS of rules

Visualize the top 5 rules with the highest lift.

{ham, processed cheese} -> {white bread}

Size of circle indicates support and shade represents lift

Evaluation of Candidate Rules

Example: Transactions in a Grocery Store

Validation and Testing

• The frequent and high confidence itemsets are found by pre-

Evaluation of Candidate Rules

Example: Transactions in a Grocery Store

Validation and Testing

Hash-based itemset counting:

Dynamic itemset counting:

Regression- linear, logistics, reasons to choose and

Reasons to Choose and Cautions

Additional Regression Models

• What is probability an applicant will default on a loan?

• Regression can find the input variables having the greatest

• Various transformations can be used to achieve a linear

• Involves randomness and uncertainty