Professional Documents
Culture Documents
Unit III Da Online - PPTX 2 87
Unit III Da Online - PPTX 2 87
regression models.
Syllabus
Overview
Apriori Algorithm
Diagnostics
Association rules method Overview
Market Basket
Analysis
https://blog.rsquaredacademy.com/market-basket-analysis-in-r/
https://blog.rsquaredacademy.com/market-basket-analysis-in-r/
Market Basket Analysis
❏ To identify products
https://blog.rsquaredacademy.com/market-basket-analysis-in-r/
Use Cases (Applications) of
Association Rule Mining
https://blog.rsquaredacademy.com/market-basket-analysis-in-r/
Simple Example
https://blog.rsquaredacademy.com/market-basket-analysis-in-r/
Simple Example -Transaction Data
https://blog.rsquaredacademy.com/market-basket-analysis-in-r/
Simple Example -Transaction Data
https://blog.rsquaredacademy.com/market-basket-analysis-in-r/
Simple Example -Frequent Item Set
https://blog.rsquaredacademy.com/market-basket-analysis-in-r/
Simple Example- Association Rule
https://blog.rsquaredacademy.com/market-basket-analysis-in-r/
Simple Example- Association Rule
https://blog.rsquaredacademy.com/market-basket-analysis-in-r/
Simple Example- Association Rule
Support
https://blog.rsquaredacademy.com/market-basket-analysis-in-r/
Simple Example- Association Rule
Confidence
https://blog.rsquaredacademy.com/market-basket-analysis-in-r/
Simple Example- Association Rule
Lift
https://blog.rsquaredacademy.com/market-basket-analysis-in-r/
Simple Example- Association Rule
Lift - Interpretation
● Lift = 1: implies no relationship between mobile phone and screen guard (i.e.,
mobile phone and screen guard occur together only by chance)
● Lift > 1: implies that there is a positive relationship between mobile phone and
screen guard (i.e., mobile phone and screen guard occur together more often than
random)
● Lift < 1: implies that there is a negative relationship between mobile phone and
screen guard (i.e., mobile phone and screen guard occur together less often than
random)
https://blog.rsquaredacademy.com/market-basket-analysis-in-r/
Association Rule
X→Y
Association Rule
Appropriateness of
Candidate Rule
T101 I2, I4
ITEM Set FREQUENCY
T102 I2, I5
{ I1} 6
T103 I1, I2, I4
{I2} 8
T104 I1, I2, I3
{I 3} 5
T105 I2, I3
{I4} 3
T106 I1, I2, I3, I4
{I5} 3
T107 I1, I2, I3
T108 I1, I3 , I5
Example
Minimum Support = 0.5
or 50%
TID List_Of_Item IDs Means 9/2 = 4.5 = 4
T108 I1, I3 , I5
Example
After
ITEM Set FREQUENCY Candidate Pruning
Generation
ITEM Set FREQUENCY ITEM Set FREQUENCY
{ I1} 6
{ I1, I2} 5 { I1, I2} 5
{I2} 8
{I1, I3} 4 {I1, I3} 4
{I 3} 5
{I 2, I3} 4 {I 2, I3} 4
Example
We have 3 rules
1. I1 => I2
2. I1 => I3
3. I2 => I3
Example- Support
I2 => I3 0.44 8/9 =0.88 5/9 =0.55 0.55 - (0.88 * 0.55) -0.044
Example
L1 = {frequent items};
for (k = 1; Lk !=∅; k++) do begin
Ck+1 = candidates generated from Lk;
for each transaction t in database do
increment the count of all candidates in Ck+1 that are contained in t
Lk+1 = candidates in Ck+1 with min_support
end
return ∪k Lk;
Applications of Association Rules
● The frequent and high confidence itemsets are found by pre-specified minimum
support and minimum confidence levels
● Measures like lift and/or leverage then ensure that interesting rules are identified
rather than coincidental ones
● However, some of the remaining rules may be considered subjectively uninteresting
because they don’t yield unexpected profitable actions
○ E.g., rules like {paper} -> {pencil} are not interesting/meaningful
● Incorporating subjective knowledge requires domain experts
● Good rules provide valuable insights for institutions to improve their business
operations
Diagnostics
● Although the Apriori algorithm is easy to understand and implement, some of the rules
generated are uninteresting or practically useless.
● Additionally, some of the rules may be generated due to coincidental relationships between
the variables.
● Measures like confidence, lift, and leverage should be used along with human insights to
address this problem
● Another problem with association rules is that, in Phase 3 and 4 of the Data Analytics
Lifecycle , the team must specify the minimum support prior to the model execution,
which may lead to too many or too few rules.
● In related research, a variant of the algorithm can use a predefined target range for the
number of rules so that the algorithm can adjust the minimum support accordingly.
● Algorithm requires a scan of the entire database to obtain the result. Accordingly, as the
database grows, it takes more time to compute in each run.
Diagnostics- Approaches to improve Apriori’s
efficiency
Linear Regression
Logical Regression
● Models the
Relationship Between
Medical example
• Analyze effect of proposed radiation treatment
• Possible inputs – radiation treatment duration, freq
Linear Regression Equations
Linear Regression Model
RANDOM
Y-INTERCEPT SLOPE ERROR
Y = Β 0+ Β X + ΕI
I 1 I
DEPENDENT (RESPONSE
VARIABLE) INDEPENDENT (EXPLANATORY)
(EG. INCOME) VARIABLE
(E.G., AGE)
Model Description
With OLS, the objective is to find the line through these points that minimizes the sum of the
squares of the difference between each point and the line in the vertical direction
The vertical lines represent the distance between each observed y value and the line y=β0+β1x
Model Description
With Normally Distributed Errors
● Following Figure illustrate Regression Model with one input variable, normality assumption on the
error terms and the effect on the outcome variable, Y , for a given value of X.
Diagnostics
Evaluating the Linearity Assumption
Diagnostics
Evaluating the Residuals
● Residuals. The difference between the observed value of the dependent variable
(y) and the predicted value (ŷ) is called the residual (e).
●
● Each data point has one residual.
●
>
Diagnostics
N-Fold Cross-Validation
Diagnostics
Other Diagnostic Considerations
Linear Regression
Logical Regression
Logistic Regression
Use Cases
Medical
• Probability of a patient’s successful response to a specific
medical treatment – input could include age, weight, etc.
>
Logistic Regression
Use Cases
Finance
Logistic Regression
Use Cases
Marketing
Logistic Regression
Use Cases
Engineering
Logistic Regression
Model Description
Logical regression
is based on the logistic function
Logistic Regression
Model Description
With the range of f(y) as (0,1), the logistic function models the
probability of an outcome occurring
Diagnostics
Receiver Operating Characteristic (ROC) Curve
Actual Value
Positive(1) Negative(0)
Predicted Positive(1)
TP FP
Value
Negative(0)
FN TN
>
Diagnostics
Receiver Operating Characteristic (ROC) Curve
# of negative
# of Positive
Diagnostics
Receiver Operating Characteristic (ROC) Curve
Regression
Linear Regression
Logical Regression
Linear Regression
Logical Regression