You are on page 1of 24

A New Two-Phase Sampling Algorithm for Discovering Association Rules

Data mining techniques have been widely used in various applications. Data mining extract novel and useful knowledge from large repositories of data and has become an effective analysis and decision means in corporation. The sharing of data for data mining can bring a lot of advantages for research and business collaboration. Data mining is becoming an increasingly important tool to transform the data into information. The volume of electronically accessible data in warehouse and on the internet is growing faster, scalability of mining is a major concern and classical mining algorithms require one or more passes over the entire database can take one hours or even days to execute and in the future the problem will become worse, to avoid this problem using a sample of data as the synopsis is a popular technique that can scale very well as the data grow. Mining and analysis algorithms require one or more computationally intensive passes over the entire database become slow and worse in future. In Data Mining, Association Rule Mining is a popular and well researched method for discovering relations between variables in a large database and the information can be used as the basis for decisions about marketing activities such as market basket analysis, product placements etc. This project is based on Apriori, SRS (Simple Random Sampling) and FAST (Finding Associations from Sampled Transactions) algorithm to generate association rules and also for discovering the rules in a large database. In a large database by applying Apriori, Simple Random Sampling and FAST algorithm the user can find a best algorithm of calculating the strong and weak rule of the dataset. The user can calculate the time difference and accuracy in order to find an efficient result of discovering the association rules.

HARDWARE CONFIGURATION: Processor Processor Speed Memory (RAM) Hard Disk Floppy Drive Monitor Keyboard Mouse : Pentium IV : 1.7 GHz : 256 MB : 10 GB : 3 1.44 MB Drive : Samsung Color Monitor : 104 keys Intel Keyboard : Intel Optical Mouse

SOFTWARE CONFIGURATION

Operating System : Windows XP Front End Tool : Microsoft Visual Basic .Net 2008 Back End Tool : Microsoft SQL Server 2000

EXISTING SYSTEM: The study of existing system has enlightened the limitation of the system and so it has paved a way for the proposed system. The Problem of finding a relationship between variables in a large database is not as easy as possible. LIMITATION OF EXISTING SYSTEM: Limited amount of memory Need complete list of database Data may be scattered and poorly accessible Requires many database scans Expensive Lossy compressed synopsis (sketch) of data Scalability of mining algorithm is a major concern

PROPOSED SYSTEM: The basis for the proposed system is the recognition of the need for improving the existing system. The proposed system aims at overcoming the drawbacks of the existing system. An important aspect of the new system is that it should be easy to incorporate change. The user should be able to make changes without any difficulty at any time. The proposed system of association rules is done using the Apriori, Simple Random Sampling and FAST,EASE. The proposed system is developed using Visual Basic.NET as the front end and MS SQL server as the background. FEATURES OF PROPOSED SYSTEM: Uses large item set property Save memory space Easily implemented Reduced costs Reduced field time Increase accuracy Provide security Excellent user friendliness Simple Errors can be easily measured

Modules Description : This project is based on FAST, EASE, Apriori and Simple Random Sampling for discovering

association rules in large database.

Apriori Algorithm : The Apriori algorithm is a classic algorithm for learning association rules and it is mainly used to

designed and operate on database containing the transactions.

Simple Random Sampling : The Simple Random Sampling is considered separately and it randomly displays the database and

check for the support and confidence in order to find the best rule. Simple Random Sampling can
make sampling a viable means for attaining both high performance and acceptably accurate results.

FAST Algorithm : FAST (Finding Associations from Sampled Transactions), a refined sampling-based mining algorithm that is distinguished from prior algorithms by its novel two phase approach to sample collection. In Phase I a large sample is collected to quickly and accurately estimate the support of each item in the database. In Phase II, a small final sample is obtained by excluding outlier transactions in such a manner that the support of each item in the final sample is as close as possible to the estimated support of the item in the entire database. Indeed, our numerical experiments indicate that for any fixed computing budget, FAST identify frequent itemsets and fewer false itemsets than sampling-based algorithms. FAST can identify most frequent itemsets in a database at an overall cost that is much lower than that of classical algorithms. In this project A New Two Phase Sampling Algorithm for Discovering Association Rules the user can find out the best comparison time and variation between the algorithms. In a large dataset, first the Apriori algorithm has been applied to find the support and confidence in order to find the strong rule and weak rule, and then randomly display the dataset and find the strong rule and weak rule based on the support and confidence of the dataset. At last the FAST (Finding Associations from Sampled Transactions) algorithm has been used in a large dataset to find out the strong and weak rule based on the support and confidence of the dataset. By applying the three algorithms the user can calculate the correct time and accuracy and also the user can find out the best algorithm from calculating the time difference.

EASE Algorithm : In this paper we introduce a novel data-reduction method, called ease (Epsilon Approximation: Sampling Enabled), that is especially designed for categorical count data. This algorithm is an outgrowth of earlier work by Chen, et al. on the fast datareduction method. Both ease and fast start with a relatively large simple random sample of transactions and deterministically trim the sample to create a final

subsample whose distance" from the complete database is as small as possible. For
reasons of computational efficiency, both algorithms subsample as close" to the original database if the high-level aggregates of the subsample normalized by the total number of data points are close" to the normalized aggregates in the database. These normalized aggregates typically correspond to 1-itemset or 2-itemset supports in the association-rule setting or, in the setting of a contingency table, relative marginal or cell frequencies

Apply EASE Algorithm Highlight with Blue and Red Color

TABLE NAME : Dataset_master | Primary Key : DS_ID

COLUMN NAME DS_ID DS_TRANS

DATATYPE Numeric Text

DESCRIPTION Dataset Identification Dataset Transaction data

TABLE NAME : Result_analysis | Primary Key : Tran_no

COLUMN NAME TRAN_NO TYPE SNO STARTED_TIME ELAPSED_TIME RULES

DATATYPE Numeric Text Numeric Datetime Datetime Text

DESCRIPTION Transaction Number Transaction Type Serial Number Started Time Elapsed time Rules

APRIORI

FINDING RULES

SIMPLE RANDOM SAMPLE

FINDING RULES

FAST TESTING

FINDING RULES

APPLY EASE ALGORITHM

RESULT ANALYSIS

You might also like