You are on page 1of 30

1.0 What is clustering? ............................................................................................

2.0 What is Association Rules Analysis ................................................................... 14
2.1 Association Rules Mining .............................................................................. 14
2.2 Basic Concepts ............................................................................................. 15
2.3 Market-basket analysis ................................................................................. 16
2.3.1 Apriori Algorithm ................................................................................... 16
3.0 References ...................................................................................................... 30

1.0 What is clustering?
Clustering (also known as unsupervised classification) is classification with an unknown target.
Furthermore, the total number of classes is unknown. The aim is to segment the cases into
disjoint classes that are homogenous with respect to the inputs. For example, segmenting existing
customers into groups and associating a distinct profile with each group could help future
marketing strategies.
The scenario: A catalog company periodically purchases lists of prospects from outside sources.
They want to design a test mailing to evaluate the potential response rates for several different
products. Based on their experience, they know that customer preference for their products
depends on geographic and demographic factors. Consequently, they want to segment the
prospects into groups that are similar to each other with respect to these attributes. After the
prospects have been segmented, a random sample of prospects within each segment will be
mailed one of several offers. The result of the test campaign will allow the analyst to evaluate the
potential profit of prospects from the list source overall as well as for specific segments.
Creating a Project & Importing Data into SAS Format.
1. Open the SAS 9.3
2. Click Solution > Analysis > Enterprise Miner.
3. Select File > New > Project. Create a new project by giving a name
(UnsupervisedDM) and specify the location of the file and then select Create.
4. Rename the diagram by right click onto the diagram > select rename you can type the
name of the diagram (cluster 1).
5. Click File > Import Data

6. Tick the Standard data source and make sure that in the dropdown list, you choose the
Microsoft Excel Workbook(*.xls *.xlsb *.xlsm *.xlsx) and Click Next.

7. Browse the location of the Excel Workbook (Prospect.xlsx) > Choose the Prospect.xlsx
data > Click Open and OK.

8. Choose the prospect table from the list down menu and Click Next

9. Choose the EMDATA Library from the list down menu.

10. Give a new name as ProsNew in the Member box and Click Finish

Setting Up the Input Data Source Node
11. Add an Input Data Source node by dragging the node from the tools bar into the diagram

12. Open the Input Data Source node by double click the node.
13. Select the PROSNEW data set form the EMDATA library by click the list down menu
and select EMDATA from the list of defined libraries > Select the PROSNEW data set
then select OK .

14. Observe that this data set has 5055 observations (rows) and 9 variables (columns). Note
that the lower-right corner indicates a metadata sample of size 2,000. Enterprise Miner
uses metadata in order to make a preliminary assessment of how to use each variable. By
default, it takes a random sample of 2,000 observations from the data set of interest, and
uses this information to assign a model role and a measurement level to each variable. If
you want to take a larger sample, you may select Change in the metadata sample area of
the window (lower right corner).

15. Select the variables tab. Observe that all variables except ID and LOCATION should be
put to input. Set the model role for ID to id by right click in the model role column of the
row for ID>set model role>id for the pop up menu. Set the model role of LOCATION to

16. You can inspect the distribution of values for each of the variables. For example: to view
the distribution of CLIMATE, right click in the name column for the variable COLUMN
> select view distribution.

17. You can explore the descriptive statistics (min value, max value, mean, standard
deviation, percentage of missing observations, skewness and kurtosis) for interval
variables by selecting Interval Variables tab.
18. You can investigate the number of levels, percentage of missing values and the sort of
order of each class variable by selecting the Class Variable tab.
19. Close the Input Data Source node, and save changes when you are prompted.
Setting up the Replacement node
It is not always necessary to impute missing values since it is not critical in cluster analysis
unless the amount of missing value is too extreme. If missing value imputation is not done,
clustering is based on the non-missing inputs. Although it was not necessary, this strategy was
used for demonstration in this example.
20. Add a Replacement node by dragging the node form the tools tab into the diagram
workspace and connect to the Input Data Source node.

21. Open the Replacement node by double click the node > Select Imputation Methods tab >
click list down menu and select tree imputation for both interval and class variable.

22. Close the Replacement node, and save changes when you are prompted.
Setting up the Cluster node.
23. Add a Cluster node by dragging the node from the tools bar to the diagram workspace
and connect to the Replacement node.

24. Open the Clustering node. The Variables tab is active. K-means clustering is very
sensitive to the scale of measurement of different inputs. Consequently, it is
recommended that you use one of the standardization options on the Variables tab. Select
Range on the Variables tab.

25. Select the Clusters tab.
26. Observe that the default method for choosing the number of clusters is Automatic.

27. Click Selection Criterion > set minimum number of cluster and maximum number of
clusters (use default setting). The maximum and minimum values are the maximum and
minimum number of clusters that the automatic method will create. Click OK.

28. Close the Clustering node, saving changes when you are prompted.
29. Run the diagram from the cluster node and view the results.
30. The Clustering node returns a four-cluster solution

Examining the Cluster Node Results.
31. Select the Tilt icon from the toolbar and tilt the three-dimensional pie chart as shown
32. This pie chart summarizes three statistics of the four clusters.
The options at the top of the Partition tab indicate that
- the size of the slice is proportional to the cluster standard deviation
- height of the slice is related to the number of observations in the cluster
- color indicates the radius (distance from the cluster center to the most remote
observation in that cluster).
You can make some general observations that are based on this plot; Cluster 4 contains
the most cases, followed by cluster 1, cluster 3 and cluster 2. Cluster 3 has the smallest
radius, while cluster 1 has the largest radius.
33. The right side of the window shows the normalized means for each input variable (mean
divided by its standard deviation).
34. Select Statistics tab. The window gives descriptive statistics and other information about
the clusters such as frequency of the cluster, the nearest cluster and etc.

35. In summary, the four clusters can be described as:
Cluster 1: ________________________________________
Cluster 2: ________________________________________
Cluster 3:_________________________________________
Cluster 4: _________________________________________
Using the Insight node
The insight node can be used to compare the differences among the attributes of the prospects.
36. Add an Insight node by dragging the node from the list of the tools bar on the left side of
the window to the diagram workspace and connect to the Clustering node.

37. Open the Insight node and select the Entire data set from the Data tabs Insight Based On
38. Close the Insight node, saving changes when you are prompted.
39. Run the flow from the Insight node and view the results.

2.0 What is Association Rules Analysis

Motivation ~ recent progress in data mining + warehousing have made it possible to collect
HUGE amount of data.
Example : supermarket transaction => barcode,website automatically record purchase data
These data provides POSSIBLE interaction among each item. Supermarket buying @ transaction
data might provide consumer buying pattern!
2.1 Association Rules Mining
Promotional pricing, shelf space plans and product placements are several application benefits
derived from the development of association rule mining analysis. Association rules mining has
been widely used and successfully implemented for discovering useful associations between data
in a large database
Association rule mining is capable of finding useful information from a large transactional
database. An example is the market basket analysis, which as defined by (Han & Kamber, 2001),
is a process of gaining insightful information about customer buying behaviours in order to
discover interesting and useful buying patterns.
This can be done by carefully examining customer buying behaviour and then placing each
item(s) correctly as this will trigger costumer interest in buying additional item(s) rather than
buying a single item (Han & Kamber, 2001). In this context, the aim is to capture the association
relationship within costumer transactions. That is, if after deciding to buy a certain item(s), the
customer is then more or less likely to buy another item(s). This can be done by placing
frequently associated items close together as they are most likely to be bought together. This will
increase sales and benefit the company.
2.2 Basic Concepts
In general, the association rule mining searches for interesting relationships among items in a
given data set under minimum support and minimum confidence conditions.
The problem of finding association rules y x was first introduced in (Agrawal et al., 1993) as
a data mining task of finding frequently co-occurring items in large databases. (Agrawal et al.,
1993) developed a two-phase approach to the association rules problem. The first step is to find
all frequently occurring items, typically referred to as frequent itemsets . Each of the itemsets
will occur at least as frequently as a predetermined minimum support count. The second step is
to generate strong association rules from the frequent itemsets. These strong rules must satisfy
the minimum support and minimum confidence.
Let { }
i i i I ,..., ,
2 1
= be a set of items. Let D, be a transactions database for which each transaction
T is a set of items, such that I T _ . An association rule is a condition of the form of y x
where I x _ and I y _ and . | = y x The support of a rule y x is the number of transactions
that contain both x and . y Let the support (or support ratio) of rule y x (denoted as ) ( y x o )
be s%. This implies that there are s% transactions in D that contain items (itemsets) x and . y In
other words, the probability ) ( y x P = s%. Sometimes, it is expressed as support count or
frequency, that is, it reflects the actual frequency count of the number of transactions in D that
contain the items that are in the rules. An itemset is frequent if it satisfies the user-specified
minimum support threshold. The confidence of a rule y x is the conditional probability of a
transaction containing the consequent ) ( y if the transaction contains the antecedent ) (x . Hence,
the confidence of a rule y x is calculated as ) ( y x o / ) (x o .

2.3 Market-basket analysis
Aims to find regularities behaviors ~ to find set of products that are frequently be bought
2.3.1 Apriori Algorithm
Association rule discovery finds all rules that satisfy specific constraints such as the minimum
support and confidence threshold, as is the case with the Apriori algorithm (Agrawal, Imieliski,
& Swami, 1993). It consists of two main phases: frequent itemsets discovery and association rule
generation, of which the former task is more complex. The Apriori-based algorithm has been
useful for frequent itemsets generation as it performs well on sparse data in discovering frequent
patterns that are comprised of rather smaller itemsets.
Rules structure: { A ^ B } C
if costumer bought milk and eggs, they often bought sugar too!
Association rules: (milk ^ eggs) {sugar}
For each transaction, there is a list of items. Typically, a transaction is a single customer
purchase and the items are the things that were brought. An association rule is a statement of the
form (item set A) ==> (item set B)
The aim of the analysis is to determine the strength of all the association rules among a set of
The strength of the association is measured by the support and confidence of the rules. The
support of the rules A ==> B is the probability that the two item sets occur together. The support
of the rule A ==> B is estimated by

Notice that support is reflexive. That is, the support A ==> B is the same as the support of the
rules B ==> A.
The confidence of an association rule A ==> B is the conditional probability of a transaction
containing item set B given that it contains item set A. The confidence is estimated by


ID Items
1 A,D,E
2 A,B,C
3 A,B,C,D
4 A,B,E,C
5 A,C,B,D

(A ^ B) ==> D



Association Rules Example Using SAS Enterprise Miner
Scenario- Banking Services
Product Products Details
- CD
- Automated Teller Machine Debit Card
- Automobile Installment Loan
- Credit Card
- Certificate of Deposit
- Check/Debit Card
- Checking Account
- Home Equity Line of Credit
- Individual Retirement Account
- Money Market Deposit Account
- Mortgage
- Personal/Consumer Installment Loan
- Saving Account
- Personal Trust Account

A bank want to examine its customer base and understand which of its products
individual customers own in combination with one another. It has chosen to conduct a
market basket analysis of a sample of its customer base. The bank data has a data set that
list the banking products/services used by 799 customers. Thirteen possible products are
represented as shown above.
Name Model Role Measurement Level Description
Account ID Nominal Account Number
Service Target Nominal Type of Services
Visit Sequence Ordinal Order of Product Purchase

The BANK data set has over 32000 rows. Each row of the data set represents a customer-
service combination. Therefore, a single customer can have multiple rows in the data set,
each row representing one of the products he or she owns. The median number of
products per customer is three.

1. Open a new Diagram. To open a new diagram workspace, select File > New >
2. Name the diagram Associations and select OK

3. To add a new data source into the UnsupervisedDM project, right click on
UnsupervisedDM Project and select Explore

4. Open the Emdata folder and copy the BANK data into Emdata folder

4.1 The Bank data set copied into Emdata library

5. To activate the Association diagram, double-click the Association diagram

6. To add the Bank data set into the Association diagram, click and drag the Input
Data Source node into the workspace

7. To open the Input Data Source node, Double-Click the Input Data Source node
and a new window will open.

8. To Load the Bank data set, click Select and Browse the Bank data set in Library :
Emdata and Click OK

9. The Input Data Source node will display the information of the Bank data set.
Change the Metadata sample to Use complete data as a sample

Building the Association Rules Model and Interpreting the Results
1. Open the Bank data set using the Input Data Source node. Click the Variable tab. Set
the role for ACCT to ID, for SERVICE to Target, and for VISIT to Sequence.

2. Close the Input Data Source node and SAVE

3. Add the Association node to the diagram workspace as shown below.

4. Double click the Association node the examine the property of the Bank data set

5. To examine the variable to be used in the analysis, click the Variable tab.

# Because the data table has a sequence variable with a status of use, the Association
node will perform a sequence analysis by default. Sequence analysis is not covered in
this workshop, so for the moment, change the status to Dont Use

6. Open the General tab to explore the Association rules settings.

- Minimum confidence level, which specifies the minimum confidence level
to generate a rule. The default is 5%
- Minimum Transaction Frequency to Support Associations, which specifies
a minimum level of support to claim that items are associated (that is, they
occurred together in the database). The default frequency is 5%.
- Maximum number of items in an association, which determines the
maximum size of the item set to be considered. For example, the default of
four items indicates that a maximum of 4 items will be included in a single
association rule.
- If you are interested in associations involving fairly rare products, you
should consider reducing the support count or percentage when you run
the Association node. If you obtain too many rules to be practically useful,
you should consider raising the minimum support count or percentage as
one possible solution.
7. Run the Association rule node

8. Example of the rules. Notice that the rules are ordered in descending order of Support

- Consider the rule A ==> B. Recall that the
o Support of A ==> B is the probability that a customer has both A and B.
o Confidence of A ==> B is the probability that a customer has B given that the
customer has A.
o Lift of A ==> B is a measure of strength of the association. If the Lift = 2 for the
rule A ==> B, then the customer having A is twice as likely to have B as a
customer chosen at random. Lift is the confidence divided by the expected

9. Suppose you are particularly interested in those associations that involve automobile
loans. One ways to accomplish that visually in the rules is to select that consequence
contains AUTO and then show only those rules consist of the selected item.
10. Select View>Subset Table

11. Select AUTO in Right Hand Side (Type = Single Item) and choose Type=Find Any
for the Left Hand Side, and Click Process

12. The rules based on the aforementioned criteria.

Exporting SAS Results into Others Format
1. Suppose you want export the association rules result into others format (*.xls/*.xlsx/
*.csv). Open a new workbook and rename the new workbook as AssoRulesExport
2. Close the new workbook.
3. Click File> Export

4. Choose the default Library and Member and Click Next

5. Tick the standard data source > Microsoft Excel Workbook and Click Next

6. Browse the newly created workbook AssoRulesExport.xlsx and Click OK

7. Assign a name to the exported table and Click Finish

8. Open the AssoRulesExport.xlsx workbook to view the exported rules.

3.0 References
Han, J., & Kamber, M. (2001). Data mining: concepts and techniques. San Francisco:
Morgan Kaufmann Publishers.
Agrawal, R., Imieliski, T., & Swami, A. (1993). Mining association rules between sets of
items in large databases. SIGMOD Rec., 22, 207-216.