174819-Market Basket Analysis

INTRODUCTION
1.1 INTRODUCTION TO MARKET BASKET ANALYSIS
Market Basket Analysis is a technique which identifies the strength of association between
pairs of products purchased together and identify patterns of co-occurrence. A co-occurrence
is when two or more things take place together.
Market Basket Analysis creates If-Then scenario rules, for example, if item A is
purchased then item B is likely to be purchased. The rules are probabilistic in nature or, in other
words, they are derived from the frequencies of co-occurrence in the observations. Frequency
is the proportion of baskets that contain the items of interest. The rules can be used in pricing
strategies, product placement, and various types of cross-selling strategies.
In order to make it easier to understand, think of Market Basket Analysis in terms of

shopping at a supermarket. Market Basket Analysis takes data at transaction level, which lists
all items bought by a customer in a single purchase. The technique determines relationships of
what products were purchased with which other product(s). These relationships are then used
to build profiles containing If-Then rules of the items purchased.
Market basket analysis is a mathematical modelling technique based upon the theory that if
you are likely to buy another group of items. It is used to analyse the customer behaviour and
helps in increasing the sales and maintain inventory by focusing on the point of sale
transactional data.
Market Basket Analysis is one of the most common and useful types of data analysis
for marketing and retailing. The purpose of market basket analysis is to determine what
products customers purchase together. It takes its name from the idea of customers throwing
all their purchases into a shopping cart (a "market basket") during grocery shopping. Knowing
what products people purchase as a group can be very helpful to a retailer or to any other
company. A store could use this information to place products frequently sold together into
the same area, while a catalogue or World Wide Web merchant could use it to determine the
layout of their catalogue and order form. Direct marketers could use the basket results to
determine what new products to offer their prior customers.
1
In some cases, the fact that items sell together is obvious – every fast-food restaurant
asks their customers "Would you like fries with that?" whenever they go through a drive-
through window. However, sometimes the fact that certain items would sell well together is
far from obvious. A well-known example is that a supermarket performing a basket analysis
discovered that diapers and beer sell well together on Thursdays. Though the result does make
sense – young couples stocking up on supplies for themselves and for their children before the
weekend starts – it’s not the sort of thing that someone would normally think of right
away. The strength of market basket analysis is that by using computer data mining tools, it’s
not necessary for a person to think of what products consumers would =logically buy together
– instead, the customers’ sales data is allowed to speak for itself. This is a good example of
data-driven marketing.
Once it is known that customers who buy one product are likely to buy another, it is possible
for the company to market the products together, or to make the purchasers of one product the
target prospects for another. If customers who purchase diapers are already likely to purchase
beer, they’ll be even more likely to if there happens to be a beer display just outside the diaper
aisle. Likewise, if it’s known that customers who buy a sweater and casual pants from a certain
mail-order catalogue have a propensity toward buying a jacket from the same catalogue, sales
of jackets can be increased by having the telephone representatives describe and offer the jacket
to anyone who calls in to order the sweater and pants. Still better, the catalogue company can
provide an additional 5% discount on a package containing the sweater, pants, and jacket
simultaneously and promote well the complete package. The dollar amount of sales is
guaranteed to go up. By targeting customers who are already known to be likely buyers, the
effectiveness of marketing is significantly increased – regardless of if the marketing takes the
form of in-store displays, catalogue layout design, or direct offers to customers. This is the
purpose of market basket analysis – to improve the effectiveness of marketing and sales tactics
using customer data already available to the company.
➢ Market Basket Analysis is one of the most common and useful types of data analysis
for marketing and retailing. The purpose of market basket analysis is to determine what
products customers purchase together. It takes its name from the idea of customers
throwing all their purchases into a shopping cart (a "market basket") during grocery
shopping. Knowing what products people purchase as a group can be very helpful to a
retailer or to any other company. A store could use this information to place products
2
frequently sold together into the same area, while a catalog or World Wide Web
merchant could use it to determine the layout of their catalog and order form. Direct
marketers could use the basket analysis results to determine what new products to offer
their prior customers.
➢ Market basket analysis only uses transactions with more than one item, as no
associations can be made with single purchases. Item association does not necessarily
suggest a cause and effect, but simply a measure of co-occurrence. It does not mean
that since energy drinks and video games are frequently bought together, one is the
cause for the purchase of the other, but it can be construed from the information that
this purchase is most probably made by (or for) a gamer. Such rules or hypothesis must
be tested and should not be taken as truth unless item sales say otherwise
1.2 TYPES OF MARKET BASKET ANALYSIS
➢ Predictive MBA is used to classify cliques of item purchases, events and services that
largely occur in sequence.

➢ Differential MBA removes a high volume of insignificant results and can lead to very
in-depth results. It compares information between different stores, demographics,

seasons of the year, days of the week and other factors.
MBA is commonly used by online retailers to make purchase suggestions to consumers.

For example, when a person buys a particular model of smartphone, the retailer may
suggest other products such as phone cases, screen protectors, memory cards or other
accessories for that particular phone. This is due to the frequency with which other
consumers bought these items in the same transaction as the phone.
MBA is also used in physical retail locations. Due to the increasing sophistication of point of
sale systems coupled with big data analytics, stores are using purchase data and MBA to help
improve store layouts so that consumers can more easily find items that are frequently
purchased together.
3
1.3 DEFINITION
Affinity analysis is a data analysis and data mining technique that discovers co-
occurrence relationships among activities performed by specific individuals or groups. In
general, this can be applied to any process where agents can be uniquely identified and
information about their activities can be recorded.
Market basket analysis (MBA) is an example of an analytics technique employed by retailers

to understand customer purchase behaviors. It is used to determine what items are frequently
bought together or placed in the same basket by customers. It uses this purchase information
to leverage effectiveness of sales and marketing. MBA looks for combinations of products that
frequently occur in purchases and has been prolifically used since the introduction of electronic
point of sale systems that have allowed the collection of immense amounts of data.
The process of discovering frequent item sets in large transactional database is called market
basket analysis.
USES OF MARKET BASKET ANALYSIS
Market Basket analysis is a data mining technique used by retailers to increase sales by
better understanding customer purchasing patterns. It involves analyzing large data sets, such
as purchase history, to reveal product groupings, as well as products that are likely to be
purchased together.
The adoption of market basket analysis was aided by the advent of electronic point-of-
sale (POS) systems. Compared to handwritten records kept by store owners, the digital records
generated by POS systems made it easier for applications to process and analyze large volumes
of purchase data.
When one hears Market Basket Analysis, one thinks of shopping carts and supermarket
shoppers. It is important to realize that there are many other areas in which Market Basket
Analysis can be applied. An example of Market Basket Analysis for a majority of Internet users
is a list of potentially interesting products for Amazon. Amazon informs the customer that
people who bought the item being purchased by them, also reviewed or bought another list of
items. A list of applications of Market Basket Analysis in various industries is listed below:
4
Retail. In Retail, Market Basket Analysis can help determine what items are purchased
together, purchased sequentially, and purchased by season. This can assist retailers to
determine product placement and promotion optimization (for instance, combining product
incentives). Does it make sense to sell soda and chips or soda and crackers?
Telecommunications. In Telecommunications, where high churn rates continue to be a

growing concern, Market Basket Analysis can be used to determine what services are being
utilized and what packages customers are purchasing. They can use that knowledge to direct
marketing efforts at customers who are more likely to follow the same path.
For instance, Telecommunications these days is also offering TV and Internet. Creating
bundles for purchases can be determined from an analysis of what customers purchase, thereby
giving the company an idea of how to price the bundles. This analysis might also lead to
determining the capacity requirements.
Banks. In Financial (banking for instance), Market Basket Analysis can be used to
analyze credit card purchases of customers to build profiles for fraud detection purposes and
cross-selling opportunities.
Insurance. In Insurance, Market Basket Analysis can be used to build profiles to detect
medical insurance claim fraud. By building profiles of claims, you are able to then use the
profiles to determine if more than 1 claim belongs to a particular claimee within a specified
period of time.
Medical. In Healthcare or Medical, Market Basket Analysis can be used for comorbid
conditions and symptom analysis, with which a profile of illness can be better identified. It can
also be used to reveal biologically relevant associations between different genes or between
environmental effects and gene expression.
1.4 USAGE OF MARKET BASKET ANALYSIS

Market basket analysis is a process that looks for relationships among entities and objects that
frequently appear together, such as the collection of items in a shopper’s cart. For the purposes
of customer centricity, market basket analysis examines collections of items to identify
affinities that are relevant within the different contexts of the customer touch points. Some
examples include:
5
Product placement—Identifying products that may often be purchased together and arranging
the placement of those close by to encourage the purchaser to buy both items. That placement
can be physical, such as in the arrangement of products on shelves in a brick and mortar
location, or virtual, such as in a print catalog or on an e-commerce site.
Point-of-Sale—Companies may use the affinity grouping of multiple products as an indication

that customers may be predisposed to buying certain sets of products at the same time. This
enables the presentation of items for cross-selling, or may suggest that customers may be
willing to buy more items when certain products are bundled together.
Customer retention—When customers contact a business to sever a relationship, a company

representative may use market basket analysis to determine the right incentives to offer in order
to retain the customer’s business.
1.5 ALGORITHMS ASSOCIATED WITH MARKET BASKET ANALYSIS

In market basket analysis, association rules are used to predict the likelihood of products being
purchased together. Association rules count the frequency of items that occur together, seeking
to find associations that occur far more often than expected.
Algorithms that use association rules include AIS, SETM and Apriori. The Apriori
algorithm is commonly cited by data scientists in research articles about market basket analysis
and is used to identify frequent items in the database, then evaluate their frequency as the
datasets are expanded to larger sizes.
The arules package for R is an open source toolkit for association mining using the R
programming language. This package supports the Apriori algorithm, along with other mining
algorithms, including arulesNBMiner, opusminer, RKEEL and RSarules.
6
1.6 BENEFITS OF MARKET BASKET ANALYSIS
Store Layout: you can organize or set up your store according to market basket analysis in order
to increase revenue. Once you know the products in the market basket, you can arrange or place
the products near each other so that the customer notice and take a decision to buy them. Market
business analysis acts as a guide to organize your store to get the best revenues.
Marketing Messages: Market basket analysis increase the efficiency of marketing messages
whether it is done by phone, email, social media etc. You can suggest the next best option to the
customers by using market business analysis data. With the help of market business analysis data,
you can give relevant suggestions to your customer instead of telling them about irritating
marketing offers.
Maintain Inventory: If you have done market basket analysis then you may know what are the
products that your customers are going to buy in future and you can maintain your inventory
accordingly. You can also predict the future purchase of customers over a period of time on the
basis of market basket analysis data. You can also use initial sales data to maintain your inventory.
You can also predict the shortage of useful items or more demanded items in your store and then
arrange your stock or inventory accordingly.
Content Placement: Content placement is very important when you are doing an e-commerce
business. Your conversion rates will increase when your products are displayed or arranged in a
right order. Marketing basket analysis is used by the online retailers to display the content that is
likely to read next by the customers. It will help to engage customers on your website. Market
basket analysis helps to increase traffic on your website and to get better conversion rates.
Recommendation Engines: Market basket analysis is the base for creating recommendation
engines. A recommendation engine is a software that analyzes identifies and recommends content
to users in which they are interested. A recommendation engine is an important part of application
and software product. It collects information about people’s habits and then recommends contents
to them.
7
LITERATURE SURVEY
Mining Association Rules is one of the most important field of application of Data Mining. A
set of customer transactions on items is provided and the main purpose is to determine the
correlations within the sales of items. Mining association rules, is known as Market Basket
Analysis, which is also an application field of Data Mining. It is essential to examine the
customer’ purchase behavior and assist in increasing the sales and conserve inventory by
focusing on the point of sale transaction data. This works as a wide area for the researchers to
increase a better data mining algorithm. This chapter discusses a survey about the existing
Association Rule Mining, Apriori Algorithm, Market Basket Analysis system, and agriculture
and data mining techniques.
SURVEY ON ASSOCIATION RULE MINING ALGORITHM Association rule

mining is a specific application of Market Basket Analysis, where retail transaction baskets are
analyzed to find the products which are likely to be purchased together. Consider a supermarket
setting where the database records items purchased by a customer at a single time as a
transaction. The planning department may be interested in finding “associations” between sets
of items with some minimum specified confidence. Such associations might be helpful in
designing promotions and discounts or shelf organization and store layout. The output of the
analysis forms the input for recommendation or marketing strategies. The most common step
in all association rule mining algorithms is to segment the job into two sub-tasks.
• Frequent itemset generation: This sub-task is to discover all the itemsets that satisfy the
minsup threshold. The itemsets which satisfy minsup threshold are called frequent itemsets.
• Rule generation: The sub-task extracts all the high confidence rules from the frequent
itemsets obtained in above step. These rules are called as strong rules. 56 Please purchase PDF
Split-Merge on www.verypdf.com to remove this watermark.
Zhixin et al., recommended an improved classification technique based on Predictive

Association Rules. Classification Dependent Predictive Association Rules (CPAR) is one of
the types of association classification method which integrates the benefits of associative
classification and conventional rule-based classification. For generation of the rule, CPAR is
more efficient than the conventional rule-based classification, since most of the replicate
calculation is ignored and multiple literals can be selected to create multiple rules at the same
time. Although the benefit mentioned above avoids the replicate calculation in rule generation,
8
the prediction processes have the disadvantage in class rule distribution inconsistency and
interruption of inaccurate class rules. Further, it is ineffective in instances that satisfy no rules.
To avoid these difficulties, the author recommends Class Weighting Adjustment, Center
Vector-based Pre-classification and Post-processing with Support Vector Machine (SVM).
Wang et al., suggested a novel rule weighting approach in Classification Association Rule
Mining. Classification Association Rule Mining (CARM) is the newest classification rule
mining technique that built an association rule mining based classifier by using Classification
Association Rules (CARs). The specific CARM algorithm which is used is not regarded, a
similar set of CARs is continually produced from the data, and a classifier is commonly
presented as a structured CAR list, depending on a selected rule ordering approach. Several
number of rule ordering approaches have been recognized in the recent past, which can be
categorized as rule weighting, support-confidence and hybrid. In this approach, an alternative
rule-weighting method, called CISRW (ClassItem Score based Rule Weighting) and a rule-
weighting based rule which orders mechanism depending on CISRW. Later on, two hybrid
techniques are added and developed by merging support-confidence and CISRW.
Bartik presented association based classification for relational data and its use in web mining.
Classification according to the mining association rules is a better and human understandable
classification scheme. The intention of the author is to force an alteration of the fundamental
association based classification technique that can be used in gathering data from the Web
pages. The alteration of the technique and necessary discretization of numeric characteristics
are given. Please purchase PDF Split-Merge on www.verypdf.com to remove this watermark.
Mining Interesting Rules by Association and Classification Algorithm is put forth by Yanthy
et al
The main purpose in data mining is to reveal hidden knowledge from the data and numerous
techniques have been suggested so far. However, the disadvantage is that only a few portions
of the created rules would be of interest to some provided user. Thus, several measures like
confidence, support, lift, information and gain, have been suggested to discover the best or
highly interesting rules. However, some techniques are great at creating rules high in one
interestingness measure but not great in other interestingness measures. The connection
between the techniques and interestingness measures of the created rules is not obvious right
now. The author studied the connection between the techniques and the interesting measures.
The author used synthetic data so that the result is not limited to particular situations.
9
Zongyao et al., proposed a mining local association pattern from spatial dataset. The author
offers a model and an algorithm to mine local association rules from the existing spatial dataset,
while totally considering the reality that spatial heterogeneity may extensively be presented in
realism. The significant element of the model is the computation Localized Measure of
Association Strength (LMAS) which is used to measure the local association patterns. Spatial
association relations are completely described as spatial relations that are modeled by DE-9IM
model. The author offers a mining technique for deciding local association patterns taken from
the spatial dataset. The mines technique reference and target objects have probable association
patterns and processes LMAS for every object in the reference objects for some interested
spatial relation. Therefore, the effect of the algorithm is a LMAS distribution map that repeats
association potential variations inside the examination area. Spatial interpolation for LMAS is
suggested to produce a continuous LMAS distribution that can be used to examine hot spots
that show strong association patterns. This technique was applied in an ecological system
research. Yong et al., proposed a mining association rule with a new measure criteria. In these
days, association rules mining from large databases is an active field of research of data mining
followed by many application areas. On the other hand, there are some difficulties in the strong
association rules mining, depending on the support-confidence framework. Firstly, there are a
large number of redundant association rules generated, it 58 Please purchase PDF Split-Merge
on www.verypdf.com to remove this watermark. 31 is then complicated for the user to find the
interesting ones. The correlation along with the features of the specified application areas is
then avoided. Thus, innovative measure criteria called Chi-Square test and cover must be
initiated to association rules mining, and the additional main aspect is the use of Chi-Square
test to decrease the amount of rules. The Chi-Square test and cover of measures are used by the
author for association rules mining for the use of removing the itemsets that are statistic free,
as frequent itemsets or rules are created. Thus, the number of patterns of itemsets are reduced
and it is effortless for the user to meet the highly noticeable association rules. Finally, the Chi-
Square test is effective on diminishing the quantity of patterns through merging support and
cover constrain. According to Chi-Square test, little irrelevant attributes can be eliminated and
the efficiency and reality of mining association rules enhanced. Mining traditional association
rules using frequent itemsets lattice is given by Vo et al., Numerous methods have been
calculated for the improvement of time in mining frequent itemsets. But, the methods which
treat with the time of mining association rules are not placed in high research. Authentically,
under the database which contains many frequent itemsets (from ten thousands up to millions),
the time of mining association rules is much greater than that required for mining frequent
10
itemsets. An application is generated for lattice in mining conventional association rules which
will considerably decrease the time for the mining rules. This technique consists of two stages
(1) construction of frequent itemsets lattice and (2) mining association rules from lattice. For
the fast determination of association rules, the parent-child relationships in lattice is utilized.
Rastogi et al., presents mining optimized association rules with categorical and numeric
attributes. Of late, mining association rules on huge data sets have achieved significant
attention. Association rules are supportive for expecting relationship along with the features of
a relation and include applications in marketing and many retail sectors. Besides, optimized
association rules are a good approach to give attention on the most interesting features
connecting certain attributes. Optimized association rules are allowable to contain un
instantiated attributes and it is complex to discover instantiations where each of the support or
confidence of the rule is maximized. In this approach, the complexity of the optimized
association rules is simplified in three ways namely, 59 Please purchase PDF Split-Merge on
www.verypdf.com to remove this watermark. (a) association rules are allowed to include
disjunctions over un instantiated features (b) association rules are allowed to include a random
number of uninstantiated features, and (c) uninstantiated features can be either categorical or
numeric. The general association rules let to mine supportive information on seasonal and local
patterns connecting multiple features. It also propose a good method for pruning the search
space when formulating optimized association rules for both categorical and numeric features.
Wang et al., performs an investigation on Association Rules Mining basedon ontology in e-
commerce. Commercial actions carried out with the use of Internet are very popular. Plenty of
transaction logs are produced, which helps to collect useful information through data mining.
Hence, Association Rule Mining is important in ecommerce. However, there are several
problems that arise in the existing association rules mining systems. The existing conventional
techniques cannot solve these problems. The target of answering these difficulties better, it
suggests association rules mining depending on ontology. There are three parts during data
mining: (1) methods of ontology creation and principles of commodity categorization; (2)
simplifying R-interesting based on actual situations; (3) implementing association rules mining
depending on ontology by improved Apriori. In addition, it tests the enhanced algorithm using
FoodMart2000, Java as the development language and Jena as the ontology engine, completes
the whole process of mining, and verifies the validity of the algorithm by the example of the
database. Dong Liyan et al., [29] proposed a novel method of mining frequent item sets. The
target of mining association rules is to decide the association relationship along with the item
sets from mass data. In a number of practical applications, its responsibility is mostly to support
11
decision-making. The author proposed an association rule algorithm of mining frequent item
sets, which introduces a new data structure and takes compressed storage tree to increase the
run presentation of this algorithm. In comparison with the existing algorithms, the proposed
algorithm has a lot benefits in load balance and run time. 60 Please purchase PDF Split-Merge
on www.verypdf.com to remove this watermark. 33 Lei Wen et al., developed an efficient
algorithm for mining frequent closed itemset. Association Rule Mining is a prominent field of
data mining analysis. Identifying the useful and significant frequent itemset is a key step. The
existed frequent itemset discovery algorithms could discover all the frequent itemset or
maximal frequent itemset. Pasquier et al., proposed a novel method of mining frequent closed
itemset. The size of frequent closed itemset was much lesser than all the frequent itemsets and
no information was lost. A new frequent closed itemset method is proposed using the directed
specified itemset graph. This method can identify all the frequent closed itemset significantly
through depth first search technique. Mining frequent itemsets from secondary memory was
put forth by Grahne et al., For the main memory databases It is the main work of mining
association rules (i.e.) mining frequent itemsets for the main memory databases. The author
reveals techniques for mining frequent itemsets when the database or the data structures used
in the mining are bulky to apply in the main memory. Therefore, this technique decreases the
required disk used by order of magnitude, and lets actual scalable data mining.
Xuegang Hu et al.,suggested mining frequent itemsets using a pruned concept lattice. A critical
step in association rule mining is removing frequent itemsets. However, most of the approaches
which extract frequent itemsets examine databases numerous times, which reduces the
effectiveness. In this method, the association among the concept lattice and frequent itemsets
is used, and Pruned Concept Lattice (PCL) method is found to illustrate frequent itemsets in a
specified database, and the scale of frequent itemsets is compacted successfully. A technique
for removing frequent itemsets based on the PCL is executed, which prunes rare concepts
suitably and dynamically throughout the PCL's construction based on the Apriori property.
Mining intratransaction associations are the preceding studies on mining association rules, that
is, the associations between items inside the same transaction. It extends the scope to include
multidimensional and inter transaction associations. For intertransaction association "if
(company) A's stock goes up on day one, B's stock will go down on day two but go up on day
four:" , where the company or day is treated as unit of transactions, the items belong to different
transactions in the stock price 61 Please purchase PDF Split-Merge on www.verypdf.com to
remove this watermark. information database. Furthermore, the intertransaction association can
12
be expanded to associate multiple properties in the similar rule, so that multidimensional
intertransaction associations can be defined and discovered. Mining intertransaction
associations facade more challenges on capable processing than mining intratransaction
associations, since the number of potential association rules is very large.
Tung et al., introduced the notion of intertransaction association rule and developed an
efficient algorithm, FITI (First Intra Then Inter), for mining intertransaction associations,
which adopts two major ideas: 1) an intertransaction frequent itemset contains only the frequent
itemsets of its corresponding intratransaction counterpart; and 2) a special data structure is built
among intratransaction frequent itemsets for efficient mining of intertransaction frequent
itemsets. The question of mining association rules among items in a great database of sales
transactions is examined. In mining association rules, a database of sales transactions is
provided, to find all the associations among items, that is, the presence of some items in a
transaction will involve the presence of other items in the same transaction. The mining of
association rules can be drawn into the problem of finding large itemsets where a large itemset
is a set of items that emerge in an enough number of transactions. The problem of discovering
large itemsets can be answered by constructing a candidate set of itemsets first, and then,
identifying, inside the candidate set, the itemsets that gather the large itemset requirement. In
general, this is made iteratively for every large k-itemset in mounting order of k, where a large
k-itemset is a large itemset with k items. To decide large itemsets from a large number of
candidate sets in early iterations is generally the ruling factor for the overall data mining
presentation. To address this issue, Jong Soo Park et al., developed an effective algorithm for
the candidate set generation. Effective algorithm is a hash-based algorithm which is mainly
good for the invention of a candidate set for large 2-itemsets. Explicitly, the number of
candidate 2-itemsets invented by the proposed algorithm is, in orders of magnitude, lesser than
that by preceding methods, therefore solving the performance bottleneck. The invention of
smaller candidate sets enable to trim the transaction database size efficiently at a much earlier
stage of the iterations, thereby reducing the computational cost considerably for later 62 Please
purchase PDF Split-Merge on www.verypdf.com to remove this watermark. iterations. The
benefit of the new algorithm gives the chance of decreasing the amount of disk input/output
required. Association rule mining is a key issue in data mining. On the other hand, the classical
models ignore the difference among the transactions, and the weighted association rule mining
doesn’t work on databases with only binary attributes.
13
Sun et al., introduced a new measure w-support, which does not require preassigned weights.
It gets the quality of transactions into concern using link-based models. A novel problem of
mining common temporal association rules is investigated in publication databases.
Essentially, a publication database is a group of transactions where every transaction T is a set
of items of which every item includes an individual exhibition period. This model of
association rule mining is unable to hold the publication database owing to the following
fundamental problems, namely, 1) lack of concern of the exhibition period of every individual
item and 2) lack of an equitable support counting basis for every item.
To remedy this, Chang-Hung Lee et al., proposed an innovative algorithm Progressive-

PartitionMiner (abbreviated as PPM) to discover general temporal association rules in a
publication database. The principle idea of PPM is to primarily divide the publication database
in the light of exhibition period of items and increasingly build up the incidence count of every
candidate 2-itemset, based on the inherent partitioning characteristics. Algorithm PPM is
planned to utilize a filtering threshold in every partition to clip out those increasingly infrequent
2-itemsets. The characteristic that the number of candidate 2-itemsets produced by PPM is very
near to the number of frequent 2-itemsets permit to employ the scan reduction technique to
efficiently decrease the number of database scans.
Clearly, the execution time of PPM is, in order of magnitude, lesser than the other required
competitive schemes that are straightly expanded from the existing methods. The accuracy of
the PPM is verified and some of the theoretical properties are derived. Sensitivity analysis of
numerous parameters is conducted to give many insights into Algorithm PPM. A top-down
progressive deepening method is developed by Jiawei Han et al., for efficient mining of
multiple-level association rules from large transaction databases based on the Apriori principle.
A collection of alternate algorithms is suggested by sharing intermediate results, and the
relative presentation is tested and analyzed. The enforcement of different measures of interest
to discover more interesting rules, and the leisure of rule conditions for finding “level-crossing”
association rules, are investigated. Thus the effective algorithms can be increased from large
databases for the discovery and strong multiple-level association rules. Association rule mining
is a lively data mining research area. But, majority of the ARM algorithms provide central
surroundings. In contrast to previous ARM algorithms
Ashrafi et al.,have developed a distributed algorithm, called optimized distributed association

mining, for geographically distributed data sets. ODAM produces support counts of candidate
14
itemsets which is faster than the other DARM algorithms and decreases the size of average
transactions, data sets, and message exchanges. At different levels of support, the occurrence
of interesting patterns is frequent. The classic association mining derived from uniform
minimum support, such as Apriori, moreover misses interesting patterns of low support or
undergoes the bottleneck of itemset invention caused by a low minimum support. A recovered
key lies in exploiting support constraints, which identifies the minimum support that is needed
for end itemset, so that only the necessary itemsets are produced. Ke Wang et al., presented a
framework of frequent itemset mining in the presence of support constraints. This approach is
to "push" support constraints into the Apriori itemset invention so that the "best" minimal
support is decided for each itemset at runtime to protect the essence of Apriori. This approach
is called Adaptive Apriori. Various sequential algorithms have been planned for the mining of
association rules. However, only a few works can be completed in mining association rules in
distributed databases. A straight application of sequential algorithms to distributed databases
is not efficient, as it needs a huge amount of communication overhead. An efficient algorithm
called DMA (Distributed Mining of Association rules), is proposed by Cheung et al., It
produces a small number of candidate sets and needs only O (n) messages for support-count
exchange for every candidate set, where n is the number of sites in the distributed database.
The algorithm has been executed on an experimental testbed, and its performance studied. This
section deals with the association rule, used in mining algorithm and the survey done by various
researchers on this algorithm. Association rule mining has been extensively studied in data
mining and hence it is sequential pattern mining. Association rule mining assists several
businesses to make certain verdicts, like catalogue design, cross marketing and customer
shopping behavior analysis. Based on this survey, it is found that association rule algorithms
are capable of generating rules with confidence values of less than one, however, the number
of possible association rules for a given dataset is normally very large and a high proportion of
the rules are generally of little value.
REVIEW OF APRIORI ALGORITHM As discussed in chapter 1, apriori algorithm also

follows two steps. They are
, • Finding the frequent itemsets: The sets of items that have minimum support. o It is to be
noted that a subset of a frequent itemset must also be a frequent itemset. • Use the frequent
itemsets to formulate association rules. Apriori Algorithm is designed to operate on databases
containing transactions. Other algorithms are designed for finding association rules in data
having no transactions or having no timestamps. The purpose of using the Apriori Algorithm
15
is to find the associations between different sets of data. It is sometimes referred to as "Market
Basket Analysis". Each set of data has a number of items and is called a transaction. The output
of Apriori is the set of rules that reveal how often items are contained in sets of data. It describes
innovative ways to find association rules on a larger scale, allowing implication outcomes that
consist of more than one item. The items that occur together frequently can be associated to
each other in one combination. These types of items occurring together form a frequent itemset
and these frequent itemsets form the association rules. This Apriori Algorithm is used in the
data mining.
Guo Yi-ming et al., presented a vertical format algorithm for mining frequent item sets. Apriori
is a traditional algorithm for association rules. For the purpose of giving the support degree of
candidate sets, Apriori needs to scan the database numerous times. This author proposed a new
algorithm, where mining frequent item sets happen via the vertical format. The proposed
technique needs to scan database one at a time. In the follow-up data mining procedure, it is
used to acquire new frequent item sets through `and operation' among item sets. This technique
takes less storage space, and can improve the effectiveness of data mining. Sumithra et al.,
proposed a distributed Apriori association rule and classical Apriori Mining Algorithms for
grid based knowledge discovery. The intention is to gain knowledge by using predictive.
Apriori and distributed Grid Dependent Apriori algorithms for association rule mining. The
author provides the execution of an association rules discovery data mining task by using grid
technologies. An effect of execution with a contrast of existing Apriori and Distributed Apriori
is also provided by the author. Distributed data mining systems provide a good utilization of
multiple processors and databases to fasten the execution of data mining and to facilitate data
distribution. For evaluating the efficiency of the described technique, performance
investigation of Apriori and Predictive Apriori techniques on a standard database (provides by
using weka tool).
The key intention of grid computing is to provide the organizations and application builders
the ability to develop distributed computing environments in which it uses computing resources
on demand. Hence, it can assist to amplify the effectiveness and to decrease the cost of
computing networks by reducing the time for data processing and optimizing the resources and
distributing workloads, and finally permit the users to gain a much faster outcome on large
operations at lesser costs. Mining association rules based on Apriori Algorithm and application
is given by Pei et al., Mining association rules is a significant subject in data mining. Intended
at two difficulties of discovering frequent itemsets in a large database and mining association
16
rules from frequent itemsets, the author carries some analysis on mining frequent itemsets
algorithm with the help of Apriori Algorithm and Mining Association Rules Algorithm with
the help of enhanced measure system. Mining Association Rules technique is enhanced with
the help of support, confidence and interestingness. It aims at developing interestingness
ineffective rules and losing helpful rules. Ineffective rules are cancelled, which creates many
reasonable association rules and includes negative items. The suggested technique is utilized
to mine association rules to the 2002 student score list of computer dedicated field in Inner
Mongolia University of science and technology. Omari et al., developed a new temporal
measure for interesting frequent itemset mining. Frequent itemset mining helps in searching
for powerfully associated items and transactions in large transaction databases. This measure
is based on the fact that interesting frequent itemsets are typically covered by several recent
transactions. This minimizes the cost of searching for frequent itemsets by minimizing the
search interval. Additionally, this measure can be used to enhance the search approach
implemented by the Apriori Algorithm. Qiang et al., presented an association classification
method based on the compactness of rules. Associative classification provides maximum
classification correctness and strong flexibility. Simultaneously, this associative classification
undergoes a over fitting because the classification rules satisfied the least support and lowest
confidence are returned as strong association rules return to the classifier. An innovative
association classification technique is based on the presentation of rules, it extends Apriori
Algorithm which considers the interestingness, importance and overlapping relationship
among rules. Experimental observation proves that the proposed approach has better
classification accuracy in comparison with CBA and CMAR
. Rui Chang et al., proposed a new optimization algorithm called APRIORIIMPROVE based
on the inadequacy of Apriori. APRIORI-IMPROVE algorithm offers optimizations on 2-items
generation, transactions compression and others. APRIORIIMPROVE uses hash structure to
generate L2, uses an efficient horizontal data representation and an optimized strategy of
storage to save time and space. The performance study shows that APRIORI-IMPROVE is
much faster than Apriori. Association rule mining is used to recognize association relationships
among large data sets. In association rule mining, mining frequent patterns is an important
feature. An efficient algorithm named Apriori-Growth based on Apriori Algorithm and the FP-
tree structure is presented by Bo Wu et al., to mine frequent patterns. The benefit of the Apriori-
Growth Algorithm is that it need not produce conditional pattern bases and sub-conditional
pattern tree recursively. Simulation results reveal that the Apriori-Growth Algorithm executes
17
faster than Apriori Algorithm, which is almost as fast as FP-growth, but it requires only a
smaller memory. Rough Set Theory and the Association Rules Algorithm are mining methods
in which it is used to recognize implicit rules from great amounts of data. As the Association
Rules Mining Algorithm, Apriori Algorithm has achieved a lot of application owing to its easy
use. On the other hand, in practice, it often comes across problems such as low mining
efficiency. Hence, many invalid rules are attained and the rules of pattern mining disorder. An
algorithm called R_Apriori is created by Chen Chu-xiang et al., for the problems with the
decision-making domain. Initially, the conditions of the cores are mined with the Rough
Attribute Reduction Algorithm, 1-frequent item sets and the corresponding sample set is then
found with use mining cores set by the Apriori Algorithm. After the above mentioned stage,
the multi- stage frequent item sets and the corresponding support and confidence can be gained
by the sample collection intersection operator. According to the degree of confidence and
support, the corresponding strength of the rule is determined. R_Apriori Algorithm resolves
the problems of Apriori Algorithm to recover the effectiveness of the algorithm and is in
promotion on certain significance. Retail industry builds up a large number of retail sales data.
Apriori algorithm helps to recognize the association rules among commodities and institute
crossselling strategies to enhance the profits of the retail industry. Based on the analysis of the
effectiveness of the typical Apriori Algorithm, Changsheng Zhang et al., provides a modified
method to enhance the performance of the Apriori Algorithm by deducting the scale of the
candidate item set Ck and the spending of input/output. It is also described as the application
of the modified Apriori Algorithm in search of the association rules of sales data of
commodities by combining with the actual sales data, so that the possibility of the algorithm is
proved. Association rules are the main technique for data mining. Apriori Algorithm is a
standard algorithm of association rule mining. Many algorithms for mining association rules
and their changes are proposed by Wanjun Yu et al.on the basis of Apriori Algorithm, but
traditional algorithms are not good. For the two bottlenecks of frequent itemsets mining
namely, the great multitude of candidate 2-itemsets and the poor effectiveness of increasing
their support, it proposes a new algorithm called Reduced Apriori Algorithm with Tag (RAAT),
which decreases one redundant pruning operations of C2. If the number of frequent 1-itemsets
is n, the number of connected candidate 2-itemsets is then Cn 2 , while the pruning operations
is Cn 2 . The new algorithm reduces the pruning operations of candidate 2-itemsets, which
saves time and increases effectiveness. For the problem of poor effectiveness of increasing
support, RAAT optimizes the subset operation, during the transaction tag to accelerate support
calculations.
18
The experimental results obtained from the tests show that RAAT outperforms only original
efficiency. Efficiency in the research of Association Rule Mining has been disturbed for
numerous years. A high-dimension oriented Apriori Algorithm is proposed by Lei Ji et al.,
Unlike the existing Apriori improvements, the algorithm takes on a new method to decrease
the redundant generation of sub-itemsets during pruning the candidate itemsets, which can
acquire higher efficiency of mining than that of the original algorithm when the dimension of
data is high. Theoretical proof and analysis are given for the rationality of this algorithm. It is
a difficult job to set rare association rules to hold unpredictable items as approaches, for
example, Apriori Algorithm and frequent pattern-growth, and a single minimum support
application suffers from low or high minimum support. If the least support is set high to cover
the rarely appearing items, it will ignore the frequent patterns involving rare items since rare
items fail to satisfy high minimum support. Hence, an effort is made to remove rare association
rules with multiple minimum supports. It explores the possibility and proposes a multiple
minsup based apriori-like approach called Probability Apriori Multiple Minimum Support
(PAMMS) to efficiently discover rare association rules developed by Rawat et al., An
algorithm is established by Gang Fang et al., in mining spatial topology association rules
dependent on Apriori, which is used in mining spatial multilayer transverse association rules
from spatial database. This algorithm generates candidate frequent topological itemsets by
means of down-top search strategy as Apriori, which is appropriate for mining short spatial
topological frequent itemsets. This algorithm compresses a type of spatial topological relation
to form a digit. By this method, initially, the algorithm may proficiently reduce some storage
space when making mining database. Secondly, the algorithm is simple to compute topological
relation among spatial objects, that is, it may compute support of candidate itemsets fast.
Finally, the algorithm is fast to attach (k+1) candidate itemsets of k-frequent itemsets as down-
top search strategy. The result of the experiment indicates that the algorithm of mining spatial
topology association rules based on Apriori is capable of extracting spatial multilayer
transverse association rules from spatial database using efficient data store, and it is very good
to extract short frequent topology association rules. Apriori is one of the most important
algorithms used in rule association mining given by Dongme Sun et al.,The limitations of the
Apriori Algorithm are conversed and then an enhancement planned for improving its
effectiveness.
The improved algorithm is based on the mixture of forward scan and reverse scan of a given
database. If certain conditions are suited, the improved algorithm can highly decrease the
19
scanning times required for the discovery of the candidate itemsets. Theoretical proof and
analysis are given for the rationality of the algorithm. A simulation instance is given to compare
the advantages of this algorithm with that of Apriori. Cutting data mining is an important
technique to increase efficiency, discover hidden knowledge in cutting database, and give
guidance for cutting decisions. It analyzes the Apriori Algorithm for association rules mining,
and creates some enhancement for this algorithm based on the features of cutting database
given by Guofeng Wang et al.,Apriori Algorithm is enhanced to mine association rules in
cutting database. Thus, the Apriori Algorithm can be used well in cutting data mining and the
enhanced algorithm can achieve better effect than the traditional algorithm. For improving the
effectiveness of excavation in relational database with multidimensional association, rule is
given by Yongge Shi et al.,It analyzed Apriori Algorithm and BUC algorithm based on
practice. An enhanced Apriori Algorithm-DGP Algorithm which is based on the
multidimensional association rule was then offered, where the more efficient one will be used
in the relational database. Finally, it was applied for analyzing the reasons for users' line which
do not reach the standard. It can effectively improve the speed of Data mining and improve
ADSL line quality’s analyzing and solving capabilities. The Apriori Algorithm is the main
influential apriori for mining association rules. The basic idea of Apriori Algorithm is to
recognize all the frequent sets. Through the frequent sets and derived association rules, these
rules should undergo minimum support threshold and minimum confidence threshold.
Libing Wu et al., presented improved algorithms, mostly through the introduction of interest
items and frequency threshold, to improve the mining effectiveness, dynamic data mining to
make it easy for the users. Based on the customer relationship management system of ShanHua
Carpet Company, Peng Gong et al.,established an enhanced data mining association rules
Apriori Algorithm and it’s useful to Shanhua group Cross-selling analysis. The use of the
Apriori Algorithm removes lots of invalid businesses, decreases the records for the following
scanning, which increases the effectiveness of data mining. Simultaneously, with the deduction
of the business, the scale of database will also decrease.
The scanning time will therefore be saved and the effectiveness of processing will be
enhanced. Among the data mining algorithms, Apriori Algorithm is a classical algorithm of
association rules.
Tang Junfang et al., proposed an improved algorithm based on classical Apriori Algorithm
analyzing. Through compressing transaction database, the improved algorithm prove the same
20
number of records by including an attribute named count, and apply count to count the support
of itemsets to raise the efficiency and practical experience
. Gang Fang et al., proposed an algorithm of mining spatial topology association rules with
constraint condition based on Apriori, for mining spatial multilayer transverse association rules
with constraint condition from a big spatial database. This algorithm produces candidate
frequent topological items set up using search strategy related to Apriori. This is appropriate
for mining short spatial topological frequent item sets with constraint condition. This algorithm
reduces the storage structure of spatial topological relation to make an integer.
Through this method, the algorithm may initially decrease some storage space of mining
database efficiently. Secondly, the algorithm is natural to differentiate the topological relation
of two spatial objects. That is, it might calculate the support of candidate item sets fastly. At
last, the algorithm is fast to connect the (k+1) candidate item sets of k-frequent item set by up
search strategy. The result of the experiment designate that the algorithm of mining spatial
topology association rules with constraint condition based on Apriori is capable to extract
spatial multilayer transverse association rules with constraint condition from spatial database
by means of efficient data store. It is very good to extract short frequent topology association
rules with constraint condition. Because of the exponential growth in global information,
companies have to contract with an ever growing amount of digital information. One of the
most important challenges for data mining is quickly and correctly finding the relationship
between the data. The Apriori Algorithm is the most well-liked technique in association rules
mining. When applying this method, a database has to be scanned many times and many
candidate itemsets are generated by Kun-Ming Yu et al., Parallel computing is an efficient
strategy for speeding up the mining process. The Weighted Distributed Parallel Apriori
Algorithm (WDPA) is offered as a key to this problem. In the planned method, metadata are
stored in TID forms, so only one scan to the database is required. The TID counts are also
considered, and better load-balancing as well as minimizing idle time for processors can
therefore be achieved. The Apriori Algorithm is most powerful in excavating association rules.
The basic idea of the algorithm is to identify all the frequent itemsets to get association rule.
Libing Wu et al., presented the improved Apriori Algorithm based on interested items, which
mostly built an ordered interested table and crossed it to excavate frequent itemsets quickly.
The paper also achieves the improved algorithm by writing c# code. It has been confirmed
through experiments that this algorithm consumes less time than the traditional ones. Finding
frequent itemsets is another main investigated field of data mining. The Apriori Algorithm is
21
the most recognized algorithm for Frequent Itemsets Mining (FIM) given by Yanbin Ye et al.,
Numerous implementations of the Apriori Algorithm have been documented and calculated.
One of the implementations of the Apriori Algorithm optimizes the data structure with a trie
by Bodon catches the attention. The outcomes of the Bodon's implementation for discovering
frequent itemsets show to be faster than the others by Borgelt and Goethals
. Bodon's implementation reworks into a parallel one where the input transactions are read by
a parallel computer. The parallel computer effected on the modified implementation is
presented. First, evaluation indexes of association rule are provided. Their meaning is analyzed.
The Apriori arithmetic about knowledge discovery is introduced by Shihai Zhang et al., and its
defects are analyzed. Second, a knowledge discovery method of improved Apriori-based high-
rise intelligent form selection is established. Finally, examples of this method are presented.
This method provides a new approach to mining knowledge information in engineering cases,
guiding structural form selection design and to improve quality, efficiency and intelligence
level of structural design.
Sen Guo et al., presented a mechanism called R_Apriori for learning rules from large datasets.
The existing rough set based methods are not valid for large data sets owing to its high time
and space complexity. Large data sets are separated into numerous parts, in combination with
Apriori Algorithm. Implicated rules are obtained in liner relation to the size of the data set. The
experimental result shows that this method is better than the existing ones. Apriori Algorithm
is one of the classic and best algorithm for learning association rule and its process. Mining
association rules is based on Apriori Algorithm and application. In the data mining research,
mining association rules is significant and it can be used effectively. This section deals with
Apriori Mining Algorithms and the survey done by various researchers on this algorithm. It is
clear from the review that the major shortcoming of association rules data-mining is that the
support-confidence framework often generates too many rules. Although apriori algorithm can
identify meaningful itemsets and construct association rules, it suffers from the disadvantage
of generating numerous candidate itemsets that must be repeatedly compared with the entire
database.
22
FRAME WORK
3.1 About R Programming Language
Today, we are starting a tour of the R programming language in which we will explore its
different and essential concepts. This R DataFlair Tutorial Series is designed to help beginners
to get started with R and experienced to brush up their R programming skills and gain
perfection in the language.
R is one of the most widely used programming languages for statistical modeling. It has
become the lingua franca of Data Science. In this article, we will provide you with the
introduction to R programming language, its examples and we will also see how R is
transforming the Data Science industry. We will also go through the various editors,
environments through which you can run the R code.
R is a programming language developed by Ross Ihaka and Robert Gentleman in 1993. R

possesses an extensive catalog of statistical and graphical methods. It includes machine
learning algorithm, linear regression, time series, statistical inference to name a few. Most of
the R libraries are written in R, but for heavy computational task, C, C++ and Fortran codes
are preferred.
R is not only entrusted by academic, but many large companies also use R programming
language, including Uber, Google, Airbnb, Facebook and so on.
Data analysis with R is done in a series of steps; programming, transforming, discovering,

modeling and communicate the results
• Program: R is a clear and accessible programming tool

• Transform: R is made up of a collection of libraries designed specifically for data
science
• Discover: Investigate the data, refine your hypothesis and analyze them
• Model: R provides a wide array of tools to capture the right model for your data
• Communicate: Integrate codes, graphs, and outputs to a report with R Markdown or
build Shiny apps to share with the world
23
3.2 HISTORY OF R
R was conceived at the Bell Laboratories by John Chambers in 1976. R was developed as an
extension as well as an implementation of S programming language.
The R project was released in 1992, its first version in 1995 and a stable beta version in the
year 2000.After seeing the history in this R tutorial, now, let’s move on to the reasons for
learning R programming.
Why learn r programming language?
• With R, you can perform statistical analysis, data analysis as well as machine learning.
We can create objects, functions and packages in it. R is platform-independent and can be
used across multiple operating systems. R is free owing to its open-source GNU licensing
and can be installed by anyone.
• R consists of a robust and aesthetic collection of graphical libraries like ggplot2, plotly and
many more. With these libraries, you can make visually appealing and elegant
visualisations.
• R is most widely used by the various industries. Only the academic avenues in the past
made use of R but industries are now using R as their primary tool for statistical modeling.
The most profound industry that makes use of R is the Data Science industry and the
several underlying industries that it comprises of. industries like health, finance, banking,
manufacturing and many more.
Usage of R
• Statistical inference
• Data analysis
• Machine learning algorithm
3.3 FEATURES OF R PROGRAMMING
Now it’s time to discuss the features of R Programming:
• R is a comprehensive programming language that provides support for procedural

programming involving functions as well as object-oriented programming with generic
functions.
24
• There are more than 10,000 packages in the repository of R programming. With these
packages, one can make use of functions to facilitate easier programming.
• Being an interpreter based language, R produces a machine-independent code that is
portable in nature. Furthermore, it facilitates easy debugging of errors in the code.
• R facilitates complex operations with vectors, arrays, data frames as well as other data
objects that have varying sizes.
• R provides robust facilities for data handling and storage.
• As discussed in the above section, R has extensive community support that provides
technical assistance, seminars and several boot camps to get you started with R.
How R is better than Other Technologies
There are certain unique aspects of R programming which makes it better in comparison with
other technologies:
• Graphical Libraries – R stays ahead of the curve through its aesthetic graphical libraries.
Libraries like ggplot2, plotly facilitate appealing libraries for making well-defined plots.
• Availability / Cost – R is completely free to use which means widespread availability.
• Advancement in Tool – R supports various advanced tools and features that allow you to
build robust statistical models.
• Job Scenario – As stated above, R is the primary tool for Data Science. With the immense
growth in Data Science and rise in demand, R has become the most in-demand
programming language of the world today.
• Customer Service Support and Community – With R, you can enjoy strong community
support.
R Scripts
R is the primary statistical programming language for performing modeling and graphical
tasks. With its extensive support for performing matrix computation, R is used for a variety of
tasks that involve complex datasets.
There is the entropy of freedom for carrying out the selection of editing tools to perform an
interaction with the native console. In order to perform scripting in R, you can simply import
packages and then use the provided functions to achieve results with minimal lines of code.
There are several editors and IDEs that facilitate GUI features for executing R scripts. Some of
the useful editors that support the R programming language are:
25
• RGui (R Graphical User Interface)
• Rstudio – It is a comprehensive environment for R scripting and has more features than
Rstudio.
1. R Graphical User Interface (R GUI)
R GUI is the standard GUI platform for working in R. The R Console Window forms an
essential part of the R GUI. In this window, we input various instructions, scripts and several
other important operations. This console window has several tools embedded in it to facilitate
ease of operations. This console appears whenever we access the R GUI.
In the main panel of R GUI, go to the ‘File‘ menu and select the ‘New Script‘ option. This
will create a new script in R.
In order to quit the active R session, you can type the following code after the R prompt ‘>’ as
follows:
1. > q()
2. RStudio
RStudio is an integrated and comprehensive Integrated Development Environment for R. It

facilitates extensive code editing, development as well as various features that make R an easy
language to implement.
Features of RStudio
• RStudio provides various tools and features that allow you to boost your code productivity.
• It can also be accessed over the web and is cross-platform in nature.
• It facilitates automatic checking of updates so that you don’t have to check for them
manually.
• It provides support for recovery in case of file loss.
• With RStudio, you can manage the data more efficiently.
Components of RStudio
• Source – In the top left corner of the screen is the text editor that allows you to work
within source scripting. You can enter multiple lines in this source. Furthermore, users can
save the R scripts to files that are stored in local memory.
• Console – This is present on the bottom left corner of the main window of R Studio. It
facilitates interactive scripting in R.
26
• Workspace and History – In the top right corner, you will find the R workspace and the
history window. This will give you the list of all the variables that were created in the
environment session. Furthermore, you can also view the list of past commands that were
executed by R.
Files, Plots, Package, and Help at the bottom right corner gives access to the following tools:
• Files – A user can browse the various files and folders on a computer.
• Plots – We obtain the user plots here.
• Packages – Here, we can view the list of all the installed packages.
• Help – We can browse the built-in help system of R with this command.
3.4 R IN INDUSTRY
If we break down the use of R by industry, we see that academics come first. R is a language
to do statistic. R is the first choice in the healthcare industry, followed by government and
consulting.
27
3.5 R PACKAGE’S
The primary uses of R is and will always be, statistic, visualization, and machine learning. The
picture below shows which R package got the most questions in Stack Overflow. In the top 10,
most of them are related to the workflow of a data scientist: data preparation and communicate
the results.
28
All the libraries of R, almost 12k, are stored in CRAN. CRAN is a free and open source. You
can download and use the numerous libraries to perform Machine Learning or time series
analysis.
29
R has multiple ways to present and share work, either through a markdown document or a shiny
app. Everything can be hosted in Rpub, GitHub or the business's website.
Rstudio accepts markdown to write a document. You can export the documents in different
formats:
• Document :
o HTML
30
o PDF/Latex
o Word
• Presentation
o HTML
o PDF beamer
Rstudio has a great tool to create an App easily. Below is an example of app with the World
Bank data.
Why use R?
Data science is shaping the way companies run their businesses. Without a doubt, staying away
from Artificial Intelligence and Machine will lead the company to fail. The big question is
which tool/language should you use?
31
They are plenty of tools available in the market to perform data analysis. Learning a new
language requires some time investment. The picture below depicts the learning curve
compared to the business capability a language offers. The negative relationship implies that
there is no free lunch. If you want to give the best insight from the data, then you need to spend
some time learning the appropriate tool, which is R.
On the top left of the graph, you can see Excel and PowerBI. These two tools are simple to
learn but don't offer outstanding business capability, especially in term of modeling. In the
middle, you can see Python and SAS. SAS is a dedicated tool to run a statistical analysis for
business, but it is not free. SAS is a click and run software. Python, however, is a language
with a monotonous learning curve. Python is a fantastic tool to deploy Machine Learning and
AI but lacks communication features. With an identical learning curve, R is a good trade-off
between implementation and data analysis.
When it comes to data visualization (DataViz), you'd probably heard about Tableau. Tableau
is, without a doubt, a great tool to discover patterns through graphs and charts. Besides, learning
Tableau is not time-consuming. One big problem with data visualization is you might end up
never finding a pattern or just create plenty of useless charts. Tableau is a good tool for quick
32
visualization of the data or Business Intelligence. When it comes to statistics and decision-
making tool, R is more appropriate.
Stack Overflow is a big community for programming languages. If you have a coding issue or
need to understand a model, Stack Overflow is here to help. Over the year, the percentage of
question-views has increased sharply for R compared to the other languages. This trend is of
course highly correlated with the booming age of data science but, it reflects the demand of R
language for data science.
R is a language and environment for statistical computing and graphics. It is a GNU project
which is similar to the S language and environment which was developed at Bell Laboratories
33
(formerly AT&T, now Lucent Technologies) by John Chambers and colleagues. R can be
considered as a different implementation of S. There are some important differences, but much
code written for S runs unaltered under R.
R provides a wide variety of statistical (linear and nonlinear modelling, classical statistical
tests, time-series analysis, classification, clustering, …) and graphical techniques, and is highly
extensible. The S language is often the vehicle of choice for research in statistical methodology,
and R provides an Open Source route to participation in that activity.
One of R’s strengths is the ease with which well-designed publication-quality plots can be
produced, including mathematical symbols and formulae where needed. Great care has been
taken over the defaults for the minor design choices in graphics, but the user retains full control.
R is available as Free Software under the terms of the Free Software Foundation’s GNU
General Public License in source code form. It compiles and runs on a wide variety of UNIX
platforms and similar systems (including FreeBSD and Linux), Windows and MacOS.
3.6 THE R ENVIRONMENT
R is an integrated suite of software facilities for data manipulation, calculation and graphical
display. It includes an effective data handling and storage facility, a suite of operators for
calculations on arrays, in particular matrices, a large, coherent, integrated collection of
intermediate tools for data analysis, graphical facilities for data analysis and display either on-
screen or on hardcopy, and a well-developed, simple and effective programming language
which includes conditionals, loops, user-defined recursive functions and input and output
facilities.
The term “environment” is intended to characterize it as a fully planned and coherent system,
rather than an incremental accretion of very specific and inflexible tools, as is frequently the
case with other data analysis software.
R, like S, is designed around a true computer language, and it allows users to add additional
functionality by defining new functions. Much of the system is itself written in the R dialect of
S, which makes it easy for users to follow the algorithmic choices made. For computationally-
34
intensive tasks, C, C++ and Fortran code can be linked and called at run time. Advanced users
can write C code to manipulate R objects directly.
Many users think of R as a statistics system. We prefer to think of it as an environment within

which statistical techniques are implemented. R can be extended (easily) via packages. There
are about eight packages supplied with the R distribution and many more are available through
the CRAN family of Internet sites covering a very wide range of modern statistics.
R has its own LaTeX-like documentation format, which is used to supply comprehensive
documentation, both on-line in a number of formats and in hardcopy.
3.7 IMPORTANCE OF R IN BUSINESS
R is a very powerful enterprise-driven programming language which has the following striking
features:
1. R is an open-source software!
Yes, R is free! It is licensed under GPL (just as Linux) and you have all the freedom to do
whatever you want to do with R! You can be as creative as possible and make interesting
modifications in it. R is open for integration into other systems too. While working on R
programming language, you can access data whether it is on SAS, SPSS, SQL Server, Oracle
or Excel and also integrate R in various applications and web-servers.
2. R programming is designed for Data Analysis!
R is primary a data analysis software that consists of vast collection of algorithms for data
retrieval, processing, analysis and high-end statistical graphics. R has the built-in universal
statistical methods such as mean, median, distributions, covariance, regression, non-linear
mixed effects, GLM, GAM and the list just goes on… The functions of R programming
language can access all the areas of the analysis results and combine analytical methods to
reach certain conclusions which are crucial for the organizations.
For instance, precise information on the number of people (and their backgrounds) using a
particular mobile handset can be very useful to a mobile company in leveraging its business.
35
3. R programming is Object-oriented!
Yes, it’s true! As compared to other statistical languages, R programming language has strong
object-oriented programming facilities. This is because R has derived from S programming
language. Though R is proficient in developing fully object-oriented programs, it’s approach
to OOP is based on generic functions instead of class hierarchies. R consists of three OOP
systems S3, S4 and R5. These features are based on the concepts of classes and methods. It
will be unfair to compare R with typical object-oriented languages like Perl, Python, Ruby and
so on.
4. R is an Interpreted Computer Language!
Data Science Training
R is typically an interpreted computer language allowing some incredible branching, looping

and modular programming using functions. The R distribution consists of functionality for a
broad spectrum of statistical procedures such as time series analysis, classical parametric and
nonparametric tests, linear and non-linear regression models, clustering, smoothing and so on.
Also, advanced developers of R programming can write ‘C’ code to work on R objects directly.
36
5. R language produces high-end graphics!
R programming as a flexible graphical environment to offer a wide variety of graphical

functions for data presentations such as bar plots, pie charts, histograms, time series, dot charts,
image plots, 3D surfaces, scatter plots, maps, etc. Using R, you can customize your graphics
endlessly, and develop fresh graphics by combining different graph types and have great FUN!
6. Business Analytics with R provides Advanced Analytics!
You can find various amazing domain-specific suites for R such as R metrics Project for
computational finance and Bio Conductor for the analysis and comprehension of high-
throughput genomic data. Apart from these suites, there are several add-on packages available
for R such as CRAN (a set-up of ftp and web servers across the globe to store identical and
latest versions of the code and documentation for R) and Task Views (Guides for the R
functions and packages which are handy for certain methodologies and disciplines).
7. Business Analytics with R has a well-knit community!
R programming language has a vast community of 2 million people, which is growing

exponentially! R is no longer just a programming language but a culture among various
programmers world across. Surf the internet and in a fraction of a second you will find several
websites, forums, blog posts, articles on R programming language. For instance, we have
Crantastic, a community website for R packages where you can search for, review and tag
CRAN packages.
Thus, one thing we learnt about R programming language is that R is limitless in terms of data
analysis. It has some outstanding features which can always be explored further for a powerful
and flexible data computation. The way in which the functionality and popularity of R is
growing, R programming language is going to stay for long and continue helping organizations
in the complicated process of data analysis.
37
R is an open-source programming language that facilitates statistical computing and graphical
libraries. Being open-source, R enjoys community support of avid developers who work on
releasing new packages, updating R and making it a steadfast programming package for Data
Science.
With the help of R, one can perform various statistical operations.
You can obtain it for free from the website www.r-project.org.
It is driven by command lines.
Each command is executed when the user enters them into the prompt.
Since R is open-source, most of its routines and procedures have been developed by
programmers all over the world. All the packages are available for free at the R project website
called CRAN. It contains over 10,000 packages in R. The basic installation comprises of a set
of tools that various data scientists and statisticians use for multiple tasks.
In R, there is a comprehensive environment that facilitates the performance of statistical

operations as well as the generation of data analysis in graphical or text format. The commands
that a console takes in as input are assessed and subsequently executed. R is incapable of
38
handling auto-formatting characters such as dashes or quotes, hence, you need to be discreet
while copy-pasting commands from external sources into your R environment.
Scripting in R
Let’s start scripting in R.
We will create a script to print “Hello world!” in R. To create scripts in R, you need to perform
the following steps:
Here in R, you will have to enclose some commands in print() to get the same output as on the
command line. So you need to type below command: This takes “Hello World” as input in R.
print("Hello World") #Author DataFlair
3.8 COMPANIES USING R
Some of the companies that are using R programming are as follows:
➢ Facebook
➢ Google
➢ Linkedin
➢ IBM
➢ Twitter
➢ Uber
➢ Airbnb
Summary
In the above article of R tutorial, we discussed about the R programming and its basic
information. R has become a standard name in the world of programming. It is the most used
tool in Data Science and many users are opting R due to its useful advantages and features. Its
open-source nature makes R a much better choice for many Data Scientists.
39
IMPLEMTATION OF PROPOSED SYSTEM
4.1 ABOUT MARKET BASKET ANALYSIS
Data Mining provides a lot of opportunities in the market sector. Decision making and
understanding the behaviour of the customer has become vital and challenging problem for the
organization in order to sustain in this competitive environment. The challenges that the
organization faces is to extract the information from their vast customer databases, in order to
gain competitive advantage.
Yanthy et al in this paper author states about the important goal in data mining is to reveal
hidden knowledge from data and various algorithms have been proposed for, but the problem
is that typically not all rules are interesting –only small fraction of the generated rules would
be of interest to any given users. Hence numerous methods such as confidence, support, and
lift have been proposed to determine the best or most interesting rules. However some
algorithms are good at generating rules high in one measure but bad in other.
Apriori was the first associative algorithm proposed and future development in
assocuaton,classification,associative classification algorithms have used apriori as part of the
technique. Apriori algorithm is a level-wise, breadth-first algorithm which counts transactions
Apriori algorithm uses prior knowledge of frequent item set properties. Apriori uses an iterative
approach known as a Market Basket Analysis 6 level-wise search, in which n-item sets are used
to explore (n+1) - item sets. To improve the efficiency of the level-wise generation of frequent
item sets Apriori property is used here. Apriori property insists that all non-empty subsets of a
frequent item set must also be frequent. This is made possible because of the anti-monotone
property of support measure - the support for an item set never exceeds the support for its
subsets. A two-step process consists of join and prune actions are done iteratively.
It is one of the Data Mining Algorithm which is used to find the frequent items/item set from
a given data repository. The algorithm involves 2 steps
a. Pruning
b. Joining.
40
The Apriori property is the important factor to be consider before proceeding with the
algorithm Apriori property states that If an item X is joined with item Y, Support (XUY) =min
(Support(X), Support(Y)).
Basically when we are determining the strength of an association rule i.e. how string the
relationship is between the transaction of the items we measure through the use of the support
and confidence.
The support of an item is the number of transaction containing the item. Those items that do
not meet the minimum support are excluded from the further processing. Support determines
how often a rule is applicable to a given data set.
Confidence is defined as the conditional probability that a transaction containing the LHS will
also contain the RHS.
Confidence(LHS->RHS->
P(RHS/LHS)=P(RHS∩LHS)/P(LHS)=support(RHS∩LHS)/support(LHS)
Confidence determines how frequently item in RHS appears in the transaction that contain
LHS. While determining the rules we must measure these two components as it is very
important to us. A rule that has very low support may occur simply by chance. Confidence on
the other hand, measures the reliability of the inference made by the rule. Han [4, 5] presented
a new association rule mining approach that does not use candidate rule generation called FP-
growth that generates a highly condensed frequent pattern tree (fptree) representation of the
transactional database. Each database transaction is represented in the tree by at most one path.
FP-tree is smaller in size than the original database the construction of it requires two database
scans, where in the first scan, frequent item sets along with their support in each transaction
are produced and in the second scan, FP-tree is constructed.
The mining process is performed by concatenating the patterns with the ones produced from
the conditional FP-tree. One constraint of FP-growth method is that memory may not fit FP-
tree especially in dimensionally large database.
Liu [6] proposed CBA the first Associative Classification (AC) algorithm.CBA implements
the famous Apriori algorithm[3] in order to discover frequent rule items. The Apriori algorithm
consists of three main steps.
41
a. Continuous attribute in the training data set gets discredited.
b. Frequent rule items discovery
c. Rule generation CBA selects high confidence rules to represent the classifier.
Finally, to predict a test case CBA applies the highest confidence rule whose body matches the
test case. Experimental result designated that CBA drives higher quality classifiers with regards
to accuracy that rule induction and decision tree classification approaches. Phani Prasad J,
Murlidher Mourya [7] in this paper author states that there are lots of case studies about the a
Market Basket Analysis 8 ssociation Rules and existing data mining algorithms usage for
market basket analysis but focuses on Apriori algorithm and concludes that the algorithm can
be modified and it can be extended in the future work which also decrease the time complexity.
Author also clearly states the De-merits of the algorithm but claims that there is the way to
improve the efficiency of the algorithm.
4.2 ASSOCIATION RULES IN MARKET BASKET ANALYSIS
Market Basket Analysis is one of the fundamental techniques used by large retailers to uncover
the association between items. In other words, it allows retailers to identify the relationship
between items which are more frequently bought together.
42
Market Basket Analysis
Let’s understand the concept with an example:
Assume we have a data set of 20 customers who visited the grocery store out of which 11 made
the purchase:
Customer 1: Bread, egg, papaya and oat packet

Customer 2: Papaya, bread, oat packet and milk
Customer 3: Egg, bread, and butter
Customer 4: Oat packet, egg, and milk
Customer 5: Milk, bread, and butter
Customer 6: Papaya and milk
Customer 7: Butter, papaya, and bread
Customer 8: Egg and bread
Customer 9: Papaya and oat packet
Customer 10: Milk, papaya, and bread
Customer 11: Egg and milk
Here we observe that 3 customers have bought bread and butter together. The outcome of this
technique can be understood merely as “if this, then that” (if a customer buys bread, there are
chances customer will buy butter).
In general, the analysis is run on millions of transaction data set to identify the association
between items. Analyzing such enormous dataset was the main challenge which I faced during
market basket analysis. The main benefit of conducting market basket analysis is it uncovers
hidden purchasing patterns by customers (which products sell together well), and thus, retailers
can run specific campaigns/promotions to cross-sell the items (bundling of two items).
You can see day to day implementation of market basket analysis in groceries stores, retail
outlets and product recommendation on e-commerce sites (Amazon’s customers who bought
this product also bought these products).
Key metrics for association rules:
There are 3 key metrics to consider when evaluating association rules:
43
Support: Percentage of orders that contain the item set. In the example above, there are 11
orders in total, and {bread, butter} occurs in 3 of them.
Support = Freq(X,Y)/N
Support = 3/11 = 0.27
2. Confidence: Given two items, X and Y, confidence measures the percentage of times that
item Y is purchased, given that item X was purchased. This is expressed as:
Confidence = Freq(X,Y)/Freq(X)
Looking back to the example, percentage of times that butter(X) is purchased, given that
bread(Y) was bought:
Confidence (butter -> bread) = 3/3 = 1
Confidence values range from 0 to 1, where 0 indicates that Y is never purchased when X is
purchased, and 1 indicates that Y is always purchased whenever X is purchased. Note that the
confidence measure is directional. This means that we can also compute the percentage of times
that bread is purchased, given that item butter was purchased:
Confidence (bread->butter) = 3/7 = 0.428
Here we see that all of the orders that contain bread also contain butter. However, does this
mean that there is a relationship between these two items, or are they occurring together in the
same orders simply by chance? To answer this question, we look at another measure which
takes into account the popularity of both items.
3. Lift: Unlike the confidence metric whose value may vary depending on direction (eg:
confidence{X ->Y} may be different from confidence{Y ->X}), lift has no direction. This
means that the lift{X,Y} is always equal to the lift{Y,X}:
lift{X,Y} = lift{Y,X} = support{X,Y} / (support{X} * support{Y})
lift{butter, bread} = lift{bread, butter} = support{butter, bread} / (support{butter} *

support{bread})
44
lift{butter, bread} = lift{bread, butter} =(3/11)/((3/11)*(7/11))
lift{butter, bread} = lift{bread, butter} =1.571
In the example above, if butter occurred in 27.2% (=3/11)of the orders and bread occurred in
63.6% (= 7/11) of the orders, then if there was no relationship between them, we would expect
both of them to show up together in the same order 17.35% of the time (ie: 27.2% * 63.6%).
The numerator, on the other hand, represents how often butter and bread actually appear
together in the same order (27.2%). Taking the numerator and dividing it by the denominator,
we get to know how many more times butter and bread appear in the same order, compared to
if there was no relationship between them (i.e., they are occurring together simply at random).
In summary, lift can take the following values:
Lift = 1; implies no relationship between X and Y (i.e., X and Y occur together only by chance)
Lift > 1; implies that there is a positive relationship between X and Y (i.e., X and Y occur
together more often than random)
Lift < 1; implies that there is a negative relationship between X and Y (i.e., X and Y occur
together less often than random)
4.3 APPLICATION OF MARKET BASKET ANALYSIS
Market basket analysis is applied to various fields of the retail sector in order to boost sales and
generate revenue by identifying the needs of the customers and make purchase suggestions to
them.
Cross Selling: Cross-selling is basically a sales technique in which seller suggests some related
product to a customer after he buys a product. A seller influences the customer to spend more
by purchasing more products related to the product that has already been purchased by him.
For instance, if someone buys milk from a store, the seller asks or suggests him to buy coffee
or tea as well. So basically the seller suggests the complementary product to the customer with
the product that he has already purchased. Market basket analysis helps the retailer to know the
consumer behavior and then go for cross-selling.
45
Product Placement: It refers to placing the complimentary (pen and paper)and substitute goods
(tea and coffee) together so that the customer addresses the goods and will buy both the goods
together. If a seller places these kinds of goods together there is a probability that a customer
will purchase them together. Market basket analysis helps the retailer to identify the goods that
a customer can purchase together.
Affinity Promotion: Affinity promotion is a method of promotion that design promotional

events based on associated products. Market basket analysis affinity promotion is a useful way
to prepare and analyze questionnaire data.
Fraud Detection: Market basket analysis is also applied to fraud detection. It may be possible
to identify purchase behavior that can associate with fraud on the basis of market basket
analysis data that contain credit card usage. Hence market basket analysis is also useful in fraud
detection.
Customer Behavior: Market basket analysis helps to understand customer behavior. It

understands the customer behavior under different conditions. It provides an insight into
customer behavior. It allows the retailer to identify the relationship between two products that
people tend to buy and hence helps to understand the customer behavior towards a product or
service.
Hence, market basket analysis helps the retailer to get an insight into customer behavior and to
understand the relationship between two or more goods so that they can offer or do purchase
suggestions to their customers so that they will buy more from their stores and they can earn
great revenue
46
DATA ANALYSIS AND INTERPRETATION
5.1 SAMPLE OF THE CODE
Setwd(choose.dir())
Getwd()
install.packages("arules")
install.packages("arulesViz")
library(arules)
library(arulesViz)
read.csv(file.choose())
work=read.transactions(file.choose())
view(work)
trans=read.transactions(file.choose(),format = "basket",sep=",")
view(trans)
str(trans)
head(trans)
itemFrequencyPlot(trans,topN=20,type="absolute")
rules<-apriori(data=trans,parameter = list(supp=0.001,conf=0.08),
+appearance = list(default=c("lhs","rhs")),
+control = list(verbose=F))
inspect(rules[1:10])
47
rules<-sort(rules,decreasing = TRUE,by="confidence")
inspect(rules[1:10])
plot(rules,method = "graph",interactive = TRUE)
5.2 EXECUTION OF THE CODE
Setting the working directory
>Setwd(choose.dir())
Checking the working directory
>Getwd()
load the data into r studio by using the following command
>data<-read.csv(file.choose())
now a pop up window will open we have to select the file, then select the appropriate file
install packages like
arules
arulesViz
use the following command to install them
>install.packages(“arules”)
>install.packages(“arulesViz”)
48
now rename the data as trans to read the transactions of the data and give the format as basket
use the following command
>trans=read.transactions("C:/Users/Lenovo/Desktop/project/BasketAnalysis_Groceries-
master/groceries.csv",format = "basket")
now we have to read the above data by using separation “,”
Use the following command
>trans=read.transactions("C:/Users/Lenovo/Desktop/project/BasketAnalysis_Groceries-
master/groceries.csv",format = "basket",sep = ",")
To generate a frequency plot the following command should be followed
>itemFrequencyPlot(trans,topN=20,type="absolute")
49
50
Output of code:
51
FINDINGS & SUGGESTIONS
• We can see that there is a more association with the product “Whole Milk”, by which
this becomes a product with more need of attention both in customers eye and also shelf
optimisation of the appropriate grocery store.
❖ To Understand the buying Pattern of the Products that Comprises the Customers’
Basket.
• This is used to predict products that a customer might want to purchase based on the
other products that are already in the customer's shopping basket. Hence using Market
Basket Analysis
❖ To Study the Top Most Likely Products Purchased by the Customers.

❖ To Study the Most Likely Products Purchased by the Customers with a Particular
Product Category
❖ Creating Multiple Predictions for Suggesting Products to each Customer
52
CONCLUSION
The study analyses the pattern of consumer buying behaviour of products of a grocery store.
The software proves to be useful for retailers to understand the purchasing behaviour of their
customers and gives valuable insights relating to the formation of the basket. It helps in product
assortments, refilling of the stocks for the likely items sold, make promotions based on likely
items sold with a particular category, bundling of the products, give discounts to prompt the
customers to buy the products. The increase in parameters used in the software increases the
efficiency of the analysis. Retailers can use the analysis for devising strategies and to give
suggestions to loyal customers. Business Intelligence Development Studio software can be
used for analysis based on past sales data by using market basket analysis. It can be effectively
used for optimizing the patterns associated with dynamic behaviours of the transactions made
by the customers while purchasing the products of a lifestyle store. It can help the company to
improve sales, suggest products to the customers, cross-selling, and formulate promotions as
per the results obtained from market basket analysis. This data mining tool can be used to
improve strategic placement of the product on the shelves. The marketing department of the
retailer also can understand customer purchasing behaviour better, so that they can design the
Web site in such a way product that tend to be purchased together appear together. In addition,
it also helps the advertising and also for forming marketing strategies. It can prove profitable
and helps in making decisions that add value to the customer shopping experience as well as
to the organization.
Limitations:
Market basket analysis can be more accurate when the dynamic softwares namely alteyx,
archibus are used. The study has used static R software. The scope of the study is limited for
one particular store but can be extended for various stores and areas.
53
BIBLIOGRAPHY
REFERENCES
1. Rakesh Agrawal, Sirkant Ramakrishnan. (1994). Fast Algorithms for Mining Association
Rules.Proceedings of the 20th VLDB Conference. Santiago.
2. Robert J. Hilderman, Colin L. Carter, Howard J. Hamilton, and Nick Cercone. (n.d.). Mining
Market Basket Data Using Share Measures and Characterized Item sets.
3. Usama Fayyad, Gregory Piatetsky-Shapiro, and Padhraic Smyth. (1996) From Data Mining
to Knowledge Discovery in Databases. American Association for Artificial Intelligence.
4. Julander, C.-R. (1992). Basket Analysis: A New Way of Analyzing Scanner Data.
International Journal of Retail and Distribution Management, Volume 20 (7), 10-18.
5. Gregory Piatetsky-Shapiro, William Frawley. (1991). Knowledge Discovery in Databases.

AAAI/ MIT Press.
6. Garry J.Russel, Wagner A. Kamakura. (1997). Modeling Multiple Category Brand

Preference with Household Basket Data. Journal of Retailing, Volume 73(4), 439-461.
7. Qiankun Zhao, Sourav S. Bhowmick. (2003). Association Rule Mining: A Survey.

Singapore: CAIS, Nanyang Technological University, No. 2003116.
8. Rakesh Agrawal, Tomasz Imielinski, Arun Swami. (1993). Mining Association Rules
between Sets of Items in Large Databases. ACM SIGMOND Int'l Conference on Management
Data, (pp. 207- 216). 9. Andreas Mild, Thomas Reutterer. (2003). An improved collaborative
filtering approach for predicting cross-category purchases based on binary market basket data.
Journal of Retailing and Consumer Services vol.10, 123-133
10. https://msdn.microsoft.com/en-us/library/ms167047.aspx
11.https://msdn.microsoft.com/en-us/library/ms173767%28v=sql.105%29.aspx
12. https://msdn.microsoft.com/en-us/library/ms173767(v=sql.105).aspx
54

174819-Market Basket Analysis

Uploaded by

Document Information

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

174819-Market Basket Analysis

Uploaded by

Copyright:

Available Formats

INTRODUCTION

1.1 INTRODUCTION TO MARKET BASKET ANALYSIS

In order to make it easier to understand, think of Market Basket Analysis in terms of

1.2 TYPES OF MARKET BASKET ANALYSIS

largely occur in sequence.

in-depth results. It compares information between different stores, demographics,

MBA is commonly used by online retailers to make purchase suggestions to consumers.

Market basket analysis (MBA) is an example of an analytics technique employed by retailers

USES OF MARKET BASKET ANALYSIS

Telecommunications. In Telecommunications, where high churn rates continue to be a

1.4 USAGE OF MARKET BASKET ANALYSIS

Point-of-Sale—Companies may use the affinity grouping of multiple products as an indication

Customer retention—When customers contact a business to sever a relationship, a company

1.5 ALGORITHMS ASSOCIATED WITH MARKET BASKET ANALYSIS

SURVEY ON ASSOCIATION RULE MINING ALGORITHM Association rule

Zhixin et al., recommended an improved classification technique based on Predictive

To remedy this, Chang-Hung Lee et al., proposed an innovative algorithm Progressive-

Ashrafi et al.,have developed a distributed algorithm, called optimized distributed association

REVIEW OF APRIORI ALGORITHM As discussed in chapter 1, apriori algorithm also

3.1 About R Programming Language

R is a programming language developed by Ross Ihaka and Robert Gentleman in 1993. R

Data analysis with R is done in a series of steps; programming, transforming, discovering,

• Program: R is a clear and accessible programming tool

Why learn r programming language?

3.3 FEATURES OF R PROGRAMMING

Now it’s time to discuss the features of R Programming:

• R is a comprehensive programming language that provides support for procedural

RStudio is an integrated and comprehensive Integrated Development Environment for R. It

3.6 THE R ENVIRONMENT

Many users think of R as a statistics system. We prefer to think of it as an environment within

3.7 IMPORTANCE OF R IN BUSINESS

2. R programming is designed for Data Analysis!

4. R is an Interpreted Computer Language!

Data Science Training

R is typically an interpreted computer language allowing some incredible branching, looping

R programming as a flexible graphical environment to offer a wide variety of graphical

6. Business Analytics with R provides Advanced Analytics!

7. Business Analytics with R has a well-knit community!

R programming language has a vast community of 2 million people, which is growing

With the help of R, one can perform various statistical operations.

You can obtain it for free from the website www.r-project.org.

It is driven by command lines.

In R, there is a comprehensive environment that facilitates the performance of statistical

Let’s start scripting in R.

print("Hello World") #Author DataFlair

3.8 COMPANIES USING R

Some of the companies that are using R programming are as follows:

4.1 ABOUT MARKET BASKET ANALYSIS

b. Frequent rule items discovery

4.2 ASSOCIATION RULES IN MARKET BASKET ANALYSIS

Let’s understand the concept with an example:

Customer 1: Bread, egg, papaya and oat packet

Key metrics for association rules:

There are 3 key metrics to consider when evaluating association rules:

Support = 3/11 = 0.27

Confidence (butter -> bread) = 3/3 = 1

Confidence (bread->butter) = 3/7 = 0.428

lift{X,Y} = lift{Y,X} = support{X,Y} / (support{X} * support{Y})

lift{butter, bread} = lift{bread, butter} = support{butter, bread} / (support{butter} *

lift{butter, bread} = lift{bread, butter} =1.571

In summary, lift can take the following values:

4.3 APPLICATION OF MARKET BASKET ANALYSIS

Affinity Promotion: Affinity promotion is a method of promotion that design promotional

Customer Behavior: Market basket analysis helps to understand customer behavior. It