You are on page 1of 17

Experiment No.

04
Aim: To write a program using Apriori Algorithm for Association Rule Mining
Prerequisites:
C/C++/Java Programming
Learning Outcomes:
Concepts of Association Rule Mining
Theory:
AlgorithmJoin Step: Ck is generated by joining Lk-1with itself
Prune Step: Any (k-1)-itemset that is not frequent can not be a subset of a frequent k-itemset
Pseudo-code:
Ck: Candidate itemset of size k
Lk: frequent itemset of size k
L1= {frequent items};
for(k= 1; Lk!=; k++) do begin
Ck+1= candidates generated from Lk;
for each transaction t in database do
increment the count of all candidates in Ck+1 that are contained in t
Lk+1= candidates in Ck+1 with min_support
end
returnkLk;
Example
Consider a database, D , consisting of 9 transactions.

Suppose min. support count required is 2 (i.e. min_sup = 2/9 = 22 % )


Let minimum confidence required is 70%.

We have to first find out the frequent itemset using Apriori algorithm.
Then, Association rules will be generated using min. support & min. confidence.
Step 1: Generating 1-itemset Frequent Pattern

The set of frequent 1-itemsets, L1, consists of the candidate 1-itemsets satisfying minimum
support.
In the first iteration of the algorithm, each item is a member of the set of candidate.
Step 2: Generating 2-itemset Frequent Pattern

To discover the set of frequent 2-itemsets, L2, the algorithm uses L1 Join L1to generate a
candidate set of 2-itemsets, C2.
Next, the transactions in D are scanned and the support count for each candidate itemset in
C2is accumulated (as shown in the middle table).
The set of frequent 2-itemsets, L2, is then determined, consisting of those candidate 2itemsets in C2having minimum support.
Step 3: Generating 3-itemset Frequent Pattern

The generation of the set of candidate 3-itemsets, C3, involves use of the Apriori Property.
In order to find C3, we compute L2JoinL2.
C3= L2 JoinL2 = {{I1, I2, I3}, {I1, I2, I5}, {I1, I3, I5}, {I2, I3, I4}, {I2, I3, I5}, {I2, I4,
I5}}.
Now, Join stepis complete and Prune stepwill be used to reduce the size of C3. Prune step
helps to avoid heavy computation due to large Ck.
Based on the Apriori propertythat all subsets of a frequent itemset must also be frequent, we
can determine that four latter candidates cannot possibly be frequent. How ?
For example , lets take {I1, I2, I3}.The 2-item subsets of it are {I1, I2}, {I1, I3} & {I2, I3}.
Since all 2-item subsets of {I1, I2, I3} are members of L2, We will keep {I1, I2, I3} in C3.
Lets take another example of {I2, I3, I5}which shows how the pruning is performed. The 2item subsets are {I2, I3}, {I2, I5} & {I3,I5}.
BUT, {I3, I5} is not a member of L2and hence it is not frequent violating Apriori Property.
Thus We will have to remove {I2, I3, I5} from C3.
Therefore, C3= {{I1, I2, I3}, {I1, I2, I5}} after checking for all members of result of Join
operationfor Pruning.
Now, the transactions in D are scanned in order to determine L3, consisting of those
candidates 3-itemsets in C3having minimum support.
Step 4: Generating 4-itemset Frequent Pattern
The algorithm uses L3 JoinL3to generate a candidate set of 4-itemsets, C4. Although the
join results in {{I1, I2, I3, I5}}, this itemset is pruned since its subset {{I2, I3, I5}}is not
frequent.
Thus, C4= , and algorithm terminates, having found all of the frequent items. This
completes our Apriori Algorithm.
These frequent itemsets will be used to generate strong association rules( where strong
association rules satisfy both minimum support & minimum confidence).
Step 5:Generating Association Rules from Frequent Itemsets
Procedure:
For each frequent itemset l,generate all nonempty subsets of l.
For every nonempty subset sof l, output the rule s (l-s)if
support_count(l) / support_count(s) >= min_confwhere
min_confis minimum confidence threshold.
Back To Example:

We had L = {{I1}, {I2}, {I3}, {I4}, {I5}, {I1,I2}, {I1,I3}, {I1,I5},


{I2,I3}, {I2,I4}, {I2,I5}, {I1,I2,I3}, {I1,I2,I5}}.
Lets take l = {I1,I2,I5}.
Its all nonempty subsets are {I1,I2}, {I1,I5}, {I2,I5}, {I1}, {I2}, {I5}.
Let minimum confidence thresholdis , say 70%.
The resulting association rules are shown below, each listed with its confidence.
R1: I1 ^ I2 I5
Confidence = sc{I1,I2,I5}/sc{I1,I2} = 2/4 = 50%
R1 is Rejected.
R2: I1 ^ I5 I2
Confidence = sc{I1,I2,I5}/sc{I1,I5} = 2/2 = 100%
R2 is Selected.
R3: I2 ^ I5 I1
Confidence = sc{I1,I2,I5}/sc{I2,I5} = 2/2 = 100%
R3 is Selected.
R4: I1 I2 ^ I5
Confidence = sc{I1,I2,I5}/sc{I1} = 2/6 = 33%
R4 is Rejected.
R5: I2 I1 ^ I5
Confidence = sc{I1,I2,I5}/{I2} = 2/7 = 29%
R5 is Rejected.
R6: I5 I1 ^ I2
Confidence = sc{I1,I2,I5}/ {I5} = 2/2 = 100%
R6 is Selected.
In this way, We have found three strong association rules.

PART B
(PART B : TO BE COMPLETED BY STUDENTS)
(Students must submit the soft copy as per following segments within two hours of the practical
slot. The soft copy must be uploaded on the Blackboard or emailed to the concerned lab in
charge faculties at the end of the practical in case the there is no Black board access available)
Roll No.: E059
Class: B.Tech
Date of Experiment: 22/08/2016
Grade:
Date of Grading:

Name: Shubham Gupta


Batch: E3
Date of Submission: 29/08/2016
Time of Submission:

B.1 Software Code written by student:


(Paste your c/c++/java code completed during the 2 hours of practical in the lab here)

B.2 Input and Output:


(Paste your program input and output in following format, If there is error then paste the specific error in
the output part. In case of error with due permission of the faculty extension can be given to submit the error
free code with output in due course of time. Students will be graded accordingly.)

Input Data:

package apriori;
import java.io.*;
import java.util.*;
public class Apriori {
public static void main(String[] args) {
AprioriCalculation ap = new AprioriCalculation();
ap.aprioriProcess();
}
}
class AprioriCalculation
{
Vector<String> candidates=new Vector<String>(); //the current
candidates
String configFile="config.txt"; //configuration file
String transaFile="transa.txt"; //transaction file
String outputFile="apriori-output.txt";//output file
int numItems; //number of items per transaction
int numTransactions; //number of transactions
double minSup; //minimum support for a frequent itemset

String oneVal[]; //array of value per column that will be treated as a '1'
String itemSep = " "; //the separator value for items in the database
public void aprioriProcess()
{
Date d; //date object for timing purposes
long start, end; //start and end time
int itemsetNumber=0; //the current itemset being looked at
//get config
getConfig();
System.out.println("Apriori algorithm has started.\n");
//start timer
d = new Date();
start = d.getTime();
//while not complete
do
{
//increase the itemset that is being looked at
itemsetNumber++;
//generate the candidates
generateCandidates(itemsetNumber);
//determine and display frequent itemsets
calculateFrequentItemsets(itemsetNumber);
if(candidates.size()!=0)
{
System.out.println("Frequent " + itemsetNumber + "-itemsets");
System.out.println(candidates);
}
//if there are <=1 frequent items, then its the end. This prevents
reading through the database again. When there is only one frequent
itemset.
}while(candidates.size()>1);
//end timer
d = new Date();
end = d.getTime();
//display the execution time
System.out.println("Execution time is: "+((double)(end-start)/1000) +
" seconds.");
}
public static String getInput()

{
String input="";
//read from System.in
BufferedReader reader = new BufferedReader(new
InputStreamReader(System.in));
//try to get users input, if there is an error print the message
try
{
input = reader.readLine();
}
catch (Exception e)
{
System.out.println(e);
}
return input;
}
private void getConfig()
{
FileWriter fw;
BufferedWriter file_out;
String input="";
//ask if want to change the config
System.out.println("Default Configuration: ");
System.out.println("\tRegular transaction file with '" + itemSep + "'
item separator.");
System.out.println("\tConfig File: " + configFile);
System.out.println("\tTransa File: " + transaFile);
System.out.println("\tOutput File: " + outputFile);
System.out.println("\nPress 'C' to change the item separator,
configuration file and transaction files");
System.out.print("or any other key to continue. ");
input=getInput();
if(input.compareToIgnoreCase("c")==0)
{
System.out.print("Enter new transaction filename (return for
'"+transaFile+"'): ");
input=getInput();
if(input.compareToIgnoreCase("")!=0)
transaFile=input;
System.out.print("Enter new configuration filename (return for
'"+configFile+"'): ");
input=getInput();
if(input.compareToIgnoreCase("")!=0)
configFile=input;

System.out.print("Enter new output filename (return for


'"+outputFile+"'): ");
input=getInput();
if(input.compareToIgnoreCase("")!=0)
outputFile=input;
System.out.println("Filenames changed");
System.out.print("Enter the separating character(s) for items
(return for '"+itemSep+"'): ");
input=getInput();
if(input.compareToIgnoreCase("")!=0)
itemSep=input;
}
try
{
FileInputStream file_in = new FileInputStream(configFile);
BufferedReader data_in = new BufferedReader(new
InputStreamReader(file_in));
//number of items
numItems=Integer.valueOf(data_in.readLine()).intValue();
//number of transactions
numTransactions=Integer.valueOf(data_in.readLine()).intValue();
//minsup
minSup=(Double.valueOf(data_in.readLine()).doubleValue());
//output config info to the user
System.out.print("\nInput configuration: "+numItems+" items,
"+numTransactions+" transactions, ");
System.out.println("minsup = "+minSup+"%");
System.out.println();
minSup/=100.0;
oneVal = new String[numItems];
System.out.print("Enter 'y' to change the value each row
recognizes as a '1':");
if(getInput().compareToIgnoreCase("y")==0)
{
for(int i=0; i<oneVal.length; i++)
{
System.out.print("Enter value for column #" + (i+1) + ": ");
oneVal[i] = getInput();
}

}
else
for(int i=0; i<oneVal.length; i++)
oneVal[i]="1";
//create the output file
fw= new FileWriter(outputFile);
file_out = new BufferedWriter(fw);
//put the number of transactions into the output file
file_out.write(numTransactions + "\n");
file_out.write(numItems + "\n******\n");
file_out.close();
}
//if there is an error, print the message
catch(IOException e)
{
System.out.println(e);
}
}
private void generateCandidates(int n)
{
Vector<String> tempCandidates = new Vector<String>();
//temporary candidate string vector
String str1, str2; //strings that will be used for comparisons
StringTokenizer st1, st2; //string tokenizers for the two itemsets being
compared
//if its the first set, candidates are just the numbers
if(n==1)
{
for(int i=1; i<=numItems; i++)
{
tempCandidates.add(Integer.toString(i));
}
}
else if(n==2) //second itemset is just all combinations of itemset 1
{
//add each itemset from the previous frequent itemsets together
for(int i=0; i<candidates.size(); i++)
{
st1 = new StringTokenizer(candidates.get(i));
str1 = st1.nextToken();
for(int j=i+1; j<candidates.size(); j++)
{
st2 = new StringTokenizer(candidates.elementAt(j));
str2 = st2.nextToken();
tempCandidates.add(str1 + " " + str2);

}
}
}
else
{
//for each itemset
for(int i=0; i<candidates.size(); i++)
{
//compare to the next itemset
for(int j=i+1; j<candidates.size(); j++)
{
//create the strigns
str1 = new String();
str2 = new String();
//create the tokenizers
st1 = new StringTokenizer(candidates.get(i));
st2 = new StringTokenizer(candidates.get(j));
//make a string of the first n-2 tokens of the strings
for(int s=0; s<n-2; s++)
{
str1 = str1 + " " + st1.nextToken();
str2 = str2 + " " + st2.nextToken();
}
//if they have the same n-2 tokens, add them together
if(str2.compareToIgnoreCase(str1)==0)
tempCandidates.add((str1 + " " + st1.nextToken() + " " +
st2.nextToken()).trim());
}
}
}
//clear the old candidates
candidates.clear();
//set the new ones
candidates = new Vector<String>(tempCandidates);
tempCandidates.clear();
}
private void calculateFrequentItemsets(int n)
{
Vector<String> frequentCandidates = new Vector<String>(); //the
frequent candidates for the current itemset
FileInputStream file_in; //file input stream
BufferedReader data_in; //data input stream
FileWriter fw;
BufferedWriter file_out;

StringTokenizer st, stFile; //tokenizer for candidate and transaction


boolean match; //whether the transaction has all the items in an
itemset
boolean trans[] = new boolean[numItems]; //array to hold a
transaction so that can be checked
int count[] = new int[candidates.size()]; //the number of successful
matches
try
{
//output file
fw= new FileWriter(outputFile, true);
file_out = new BufferedWriter(fw);
//load the transaction file
file_in = new FileInputStream(transaFile);
data_in = new BufferedReader(new InputStreamReader(file_in));
//for each transaction
for(int i=0; i<numTransactions; i++)
{
//System.out.println("Got here " + i + " times"); //useful to
debug files that you are unsure of the number of line
stFile = new StringTokenizer(data_in.readLine(), itemSep);
//read a line from the file to the tokenizer
//put the contents of that line into the transaction array
for(int j=0; j<numItems; j++)
{
trans[j]=(stFile.nextToken().compareToIgnoreCase(oneVal[j])==0); //if it is
not a 0, assign the value to true
}
//check each candidate
for(int c=0; c<candidates.size(); c++)
{
match = false; //reset match to false
//tokenize the candidate so that we know what items need
to be present for a match
st = new StringTokenizer(candidates.get(c));
//check each item in the itemset to see if it is present in the
transaction
while(st.hasMoreTokens())
{
match = (trans[Integer.valueOf(st.nextToken())-1]);
if(!match) //if it is not present in the transaction stop
checking
break;
}
if(match) //if at this point it is a match, increase the count

count[c]++;
}
}
for(int i=0; i<candidates.size(); i++)
{
// System.out.println("Candidate: " + candidates.get(c) + "
with count: " + count + " % is: " + (count/(double)numItems));
//if the count% is larger than the minSup%, add to the
candidate to the frequent candidates
if((count[i]/(double)numTransactions)>=minSup)
{
frequentCandidates.add(candidates.get(i));
//put the frequent itemset into the output file
file_out.write(candidates.get(i) + "," + count[i]/
(double)numTransactions + "\n");
}
}
file_out.write("-\n");
file_out.close();
}
//if error at all in this process, catch it and print the error messate
catch(IOException e)
{
System.out.println(e);
}
//clear old candidates
candidates.clear();
//new candidates are the old frequent candidates
candidates = new Vector<String>(frequentCandidates);
frequentCandidates.clear();
}
}
Output Data:

run:
Default Configuration:
Regular transaction file with ' ' item separator.
Config File: config.txt
Transa File: transa.txt
Output File: apriori-output.txt
Press 'C' to change the item separator, configuration file and transaction
files
or any other key to continue.
Input configuration: 5 items, 5 transactions, minsup = 40.0%

Enter 'y' to change the value each row recognizes as a '1':


Apriori algorithm has started.
Frequent 1-itemsets
[1, 3, 4, 5]
Frequent 2-itemsets
[1 3, 1 4, 1 5, 3 4, 3 5, 4 5]
Frequent 3-itemsets
[1 3 4, 1 3 5, 1 4 5, 3 4 5]
Frequent 4-itemsets
[1 3 4 5]
Execution time is: 0.047 seconds.
BUILD SUCCESSFUL (total time: 4 seconds)

B.3 Observations and learning:


(Students are expected to comment on the output obtained with clear observations and learning for each
task/ sub part assigned)

We learnt the Apriori Algorithm for Association Rule Mining and also
implemented a code in java.

B.4 Conclusion:
(Students must write the conclusions based on their learning)

We successfully completed the experiment after studying Apriori


Algorithm for Association Rule Mining and also implemented the same in
this experiment.

B.5 Questions of Curiosity


Q1. Explain the following terms

frequent itemsets
closed itemsets
support
confidence
multidimensional association rules

Frequent itemsets:
AnitemsetwhosesupportisgreaterthanorequaltoaminsupthresholdFrequent
ItemsetsMining TID Transactions

100
200
300
400

{
{
{
{

A,
B,
A,
A,

B, E }
D}
B, E }
C}

500 { B, C }
600 { A, C }
700 { A, B }
800 { A, B, C, E }
900 { A, B, C }
1000 { A, C, E }

Closed itemsets:
It is a frequent itemset that is both closed and its support is greater than
or equal to minsup.
An itemset is closed in a data set if there exists no superset that has the
same support count as this original itemset.
IDENTIFICATION:
1. First identify all frequent itemsets.
2. Then from this group find those that are closed by checking to see if
there exists a superset that has the same support as the frequent itemset,
if there is, the itemset is disqualified, but if none can be found, the itemset
is closed.
An alternative method is to first identify the closed itemsets and then use
the minsup to determine which ones are frequent.

Support:
Association rules are created by analyzing data for frequent if/then
patterns and using the criteria support and confidence to identify the most
important relationships. Support is an indication of how frequently the
items appear in the database. Confidence indicates the number of times
the if/then statements have been found to be true.
In data mining, association rules are useful for analyzing and predicting
customer behavior. They play an important part in shopping basket data
analysis, product clustering, catalog design and store layout.
Programmers use association rules to build programs capable of machine
learning. Machine learning is a type of artificial intelligence (AI) that seeks
to build programs with the ability to become more efficient without being
explicitly programmed.

Confidence:
Rules have an associated confidence, which is the conditional probability
that the consequent will occur given the occurrence of the antecedent.
The ASSO_MIN_CONFIDENCE setting specifies the minimum confidence for
rules. The model eliminates any rules with confidence below the required

minimum. See Oracle Database PL/SQL Packages and Types Reference for
details on model settings for association rules.

Example:
Calculating Rules from Frequent Itemsets
Table 10-2 and Table 10-2 show the itemsets and frequent itemsets that
were calculated in Chapter 8. The frequent itemsets are the itemsets that
occur with a minimum support of 67%; at least 2 of the 3 transactions
must include the itemset.
The confidence of a rule indicates the probability of both the antecedent
and the consequent appearing in the same transaction. Confidence is the
conditional probability of the consequent given the antecedent. For
example, cereal might appear in 50 transactions; 40 of the 50 might also
include milk. The rule confidence would be:

cereal implies milk with 80% confidence

Confidence is the ratio of the rule support to the number of transactions


that include the antecedent.
Confidence can be expressed in probability notation as follows.

confidence (A implies B) = P (B/A), which is equal to P(A, B) / P(A)

Multidimensional association rules:


Multi-dimensional association rule mining which explores a data cube
structure. Algorithms for metarule-guided mining:
Given a metarule containing p predicates, we compare mining on an ndimensional (n-D) cube structure (where p < n) with mining on smaller
multiple pdimensional cubes. In addition, we propose an efficient method
for precomputing the cube, which takes into account the constraints
imposed by the given metarule.

Q2. What is the importance of association rule mining? Also list down the areas of
applications of association rule mining.

It is an important data mining model studied extensively by the


database and data mining community. It is assumed all data are
categorical. No good algorithm for numeric data.
Initially used for Market Basket Analysis to find how items purchased
by customers are related.

An association rule has two parts, an antecedent (if) and a


consequent (then). An antecedent is an item found in the data. A
consequent is an item that is found in combination with the
antecedent.
Association rules are created by analyzing data for frequent if/then
patterns and using the criteria support and confidence to identify
the most important relationships. Support is an indication of how
frequently the items appear in the database. Confidence indicates
the number of times the if/then statements have been found to be
true.
In data mining, association rules are useful for analyzing and
predicting customer behavior. They play an important part in
shopping basket data analysis, product clustering, catalog design
and store layout.
Programmers use association rules to build programs capable of
machine learning. Machine learning is a type of artificial intelligence
(AI) that seeks to build programs with the ability to become more
efficient without being explicitly programmed.

Areas:

Market basket analysis


Medical diagnosis
Protein sequences
Census data
CRM of credit card business

You might also like