You are on page 1of 8

WEEK 1

Introduction

• This course is subsequent to an earlier course on

– “Business Analytics and Data Mining Modeling Using R”

• Course Roadmap

– Module I: Unsupervised Learning Methods

– Module II: Time Series Forecasting

Module I: Unsupervised Learning Methods

– Association Rules

– Cluster Analysis

Module II: Time Series Forecasting

– Understanding Time Series

– Regression-Based Forecasting Methods

– Smoothing Methods

ASSOCIATION RULES

• Also called

– Affinity Analysis

– Market Basket Analysis

• Due to its origin from the studies of customer purchase transactions databases

• Main Idea is

– To identify item associations in transaction-type databases and


– Formulate probabilistic association rules for the same

– “what goes with what”

 Market Basket databases

– Large no. of transaction records

– Each record consists of all the items purchased by a customer in a

single transaction

• If we can find item groups which are consistently purchased

together, such info could be used for

– Store layouts, cross selling, promotions, catalog design, and customer

Segmentation

 Association rules

– “if-then” statements computed from data

– Example: online recommendation systems or recommender systems in

online shopping websites of e-commerce companies like Amazon,

Flipkart, and Snapdeal

• Two-stage process

– Rule generation

• Apriori Algorithm

– Assessment of rule strength

• Example: mobile phone cover purchase

– What colors of covers customers are likely to purchase together?

– Database of ten transactions

– Open RStudio
• Candidate Rules generation

– Examine all possible rules between items in “if-then” format

– Select rules which are most likely to capture the true association

• “If-then” format

– “If” part is called antecedent

– “then” part is called consequent

• Antecedent and consequent are

– Disjoint sets of items or item sets

– Example: mobile phone cover purchase

• “if red then white”

If red cover is purchased, a white cover is also purchased

• Antecedent and consequent

– Example: “if red and white then green”

• Rule generation

– No. of distinct items in a database = p

– In mobile phone cover purchase example, p=6

– All possible combinations

• Single items, pairs of items, triplets of items, and so on

• High computation time

• Rule generation

– Look for high frequency combinations

• Called frequent item sets

• Define frequent item set

– ‘concept of support’
– Support of a rule is

• No. of transactions with both antecedent and consequent item sets

• Measures the degree of support the data provides for the validity of the rule

• Expressed as a percentage of total records

• Support

– Example: item set {red,white} – 40%

• A frequent item set can be defined as

– An item set having a higher support than user specified minimum

Support

Assignment 1
1)Which of the following is not an advantage of association rules?
The rules are transparent and easy to understand
Generates clear and simple rules
Generates too many rules
None of the above

2)In one of the frequent item-set examples, it is observed that if tea and milk are bought then
sugar is also purchased by the customers. After generating an association rule among the given
set of items, it is inferred:
{Tea} is antecedent and {sugar} is consequent
{Tea } is antecedent and the item set {milk, sugar} is consequent
The item set {Tea, milk } is consequent and {sugar} is antecedent
The item set { Tea, milk } is antecedent and { sugar} is consequent

3)Support is:
No.of transactions with both antecedent and consequent item sets
Measures the degree of support the data provides for the validity of the rule
Expressed as a percentage of total records
All of the above
4)Online recommender systems is an example of:
Cluster Analysis
Affinity analysis
Decision analysis
Both a and b

5)What is the limitation behind rule generation in Association rule mining?


High computation time
Dropping itemsets with valued information
Both a and b
None of the above

6)Confidence can be best represented as:


P(antecedent and consequent)
P(consequent | antecedent)

P(antecedent | consequent)
None of the above

7)In Apriori algorithm, for generating e.g. 5 item sets, we use:


Frequent 4 item sets
Frequent 6 item sets
Frequent 5 item sets
None of the above

8)A database has 5 transactions. Of these, 4 transactions include milk and bread. Further, of the
given 4 transactions, 3 transactions include cheese. Find the support percentage for the following
association rule “if milk and bread are purchased, then cheese is also purchased”.
60%
75%
80%
None of the above

9)What are the methods to interpret the results after rule generation?
Absolute Mean
Lift ratio
Gini Index
Apriori

10)How can we best represent ‘benchmark confidence’ for the following association rule: “If X
and Y, then Z”.
{X,Y}/(Total number of transactions)
{Z}/{X,Y}
{X,Y,Z}/(Total number of transactions)
{Z}/(Total number of transactions)

Assignment 2

1)HAC stands for:


Hierarchical algorithmic clustering
Hierarchical agglomerative clustering
Heightened agglomerative clustering
Hierarchical absolute clustering

2)_____ is a clustering procedure characterized of sequentially merging similar clusters until a


single cluster is reached.
Non-hierarchical clustering
Hierarchical clustering
Divisive clustering
Agglomerative clustering

3)A _____ is a tree diagram for displaying clustering results. Vertical lines represent clusters
that are joined together.
dendrogram
scattergram
scree plot
Histogram
4)Which of the following is true about single link clustering?
distance between clusters is the maximum distance between their members
distance between two clusters as the minimum distance between their members
distance between two clusters as the average distance between their members
none of the above

5)What would be the loss of information if we cluster following observations into single group
under ward’s method?
(3, 3, 2, 0, 5, 2, 6, 4, 0)
34
35.9
34.9
35

6)What would be the loss of information if we cluster following observations into six groups
under ward’s method?
(0, 0, 3, 3, 1, 2, 2, 2,4, 4, 5)
0
1
0.5
10

7)The metrics used for categorical data for calculating distance in clustering is:
jaquard’s coefficient
correlation coefficient
matching coefficient
Maximum coordinate

8)Which statement is not true about cluster analysis?


Objects in each cluster tend to be similar to each other and dissimilar to objects in the other
clusters.
Cluster analysis is also called classification analysis or numerical taxonomy.
Groups or clusters are suggested by the data, not defined a priori.
Cluster analysis is a technique for analyzing data when the criterion or dependent variable is
categorical and the independent variables are interval in nature.
9)Which of the following is required by K-means clustering?
defined distance metric
number of clusters
initial guess as to cluster centroids
all of the above

10)The metrics based on correlation coefficient to calculate distance can be defined as:

You might also like