You are on page 1of 18

1

1. Introduction
Big data as coined, by Roger
Magoulas from OReilly media in 2005 [1]
represents massive data sets with large,
more varied and complex structure with
challenge of storing, analyzing and
visualizing for extracting meaningful results.
Big data analytics is the process of research
into massive amounts of data to reveal
hidden patterns and correlations.
Big data is generated from various
factors like astronomy, atmospheric science,
genomics,
biogeochemical,
biological
science and research, life sciences, medical
records, scientific research, government,
natural disaster and resource management,
private sector, military surveillance, private
sector, financial services, retail, social
networks, web logs, text, document,
photography, audio, video, click streams,
search indexing, call detail records, POS
information, RFID, mobile phones, sensor
networks and telecommunications [2] .
2. Overview of Big Data
2.1 Benefits
Following are the benefits of Big
data in various fields: Better aimed
marketing, more straight business insights,
client based segmentation, recognition of
sales and market chances, automated
decision making, definitions of customer
behaviors, greater return on investments,
quantification of risks and market trending,
comprehension of business alteration, better
planning and forecasting, identification of

consumer behaviour and production yield


extension, predictive analytics on traffic
flows, identification of threats from different
video, audio and data feeds [4].
2.2 Potential of Big data:
McKinsey Global Institute specified the
potential of big data in following main
topics [3].

Healthcare: It has three main pools


of big data.
(1) Clinical data: Optimal treatment
pathway, computerized physician orderentry, transparency about medical data,
remote patient monitoring, advanced
analytics applied to patient profiles.
(2) Pharmaceutical R&D data:
Predictive modelling for new drugs,
suggesting trial sites with large numbers of
potentially eligible patients and strong track
records,
pharmacovigilance
(discover
adverse effects) , develop personalized
medicine.
(3) Activity (claims) and cost data:
automated systems (e.g., machine learning
techniques such as neural networks) for
fraud detection and checking the accuracy
and consistency of payors claims, based on
real-world patient outcomes data to arrive at
fair economic compensation.
Public sector: In public sector five
main categories for big data are:
(1) Creating transparency: Making
data more accessible

(2) Enabling experimentation to


discover needs, expose variability, and
improve performance
(3) Segmenting populations to
customize actions
(4) Replacing/supporting human
decision making with automated algorithms
(5) Innovating new business models,
products, and services with big data

emergency response, urban planning, new


business models.

Retail: Here, five main categories are


(1) Marketing: Cross selling,
location
based
marketing,
customer
behavioural
segmentation,
sentiment
analysis, integrate promotion and pricing.
(2) Merchandising: Placement and
design optimization, price optimization.
(3)
Operations:
Performance
transparency, optimization of lab or inputs,
automated time and attendance tracking, and
improved labour scheduling.
(4) Supply chain: Stock forecasting
by combining multiple datasets such as sales
histories, weather predictions, and seasonal
sales cycles.
(5) New business models: Price
comparison services, web based market.

2.3. Challenges and Obstacles

Manufacturing:
(1) Research and development and
product design
(2) Product lifecycle management.
(3) Design to value.
(4) Open innovation
(5) Supply chain
(6) Production: Digital factory,
Sensor-driven operation.
Personal location data: Smart
routing, geo targeted advertising or

Social
network
analysis:
Understanding user intelligence for more
targeted advertising, marketing campaigns
and capacity planning, customer behavior
and buying patterns, sentiment analytics.

Major hurdles for implementation of


Big data analytics are [4]
(1)Data representation:
Data
representation aims to make data more
meaningful for computer analysis and user
interpretation. Many datasets have certain
levels of heterogeneity in type, structure,
semantics, organization, granularity, and
accessibility.
(2)Redundancy reduction and data
compression:
Effective to reduce the
indirect cost of the entire system on the
premise that the potential values of the data
are not affected.
(3) Data life cycle management: A
data importance principle related to the
analytical value should be developed to
decide which data shall be stored and which
data shall be discarded.
(4) Data confidentiality: The
transactional dataset generally includes a set
of complete operating data to drive key
business processes. Such data contains
details of the lowest granularity and some
sensitive information such as credit card
numbers

3
(5) Energy management: With the
increase of data volume and analytical
demands, the processing, storage, and
transmission of big data will inevitably
consume more and more electric energy.
(6) Expendability and scalability:
The analytical system of big data must
support present and future datasets. The
analytical algorithm must be able to process
increasingly expanding and more complex
datasets.
Apart from these others challenges
faced for implementation of Big Data
analytics are [8] security concerns,
capital/operational expenses, increased
network bottlenecks, shortage of skilled data
science professionals, unmanageable data
rate, data replication capabilities, lack of
compression capabilities, greater network
latency and insufficient CPU power, lack of
current database software in analytics and
fast process time, incapable to make big data
usable for end users.
2.4 Components of Big Data
It has basically three main components [6]:

Highly Structured: - Relational databases


data which are organized into predefined
tables, with a unique property for each row
and column.
Semi Structured: - Web logs, Social media
feeds, raw feed directly from a sensor
source, Email, etc.
Unstructured: - Video, Still images, Audio,
Clicks.
Volume or size of data [9]: Every day about 2.5 exabytes of data
is created.
Till 2015 about 8 zettabytes of data
has been created doubling every two
years.
More than 6 billion mobile
subscriptions sending over 10 billion
text messages every day.
Face book has 955 million monthly
active accounts using 70 languages,
140 billion photos updated, 125
billion friend connections, every day
30 billion pieces of content and 2.7
billion likes and comments have
been posted.
Google monitories 7.2 billion pages
per day and processes 20 petabytes
of data daily. 571 new websites are
created every minute.
Velocity (Data in motion)

Variety of Sources:

Velocity means both how fast data is being


produced and how fast the data must be
processed to meet demand. Velocity
involves streams of data, structured records

creation, and availability for access and


delivery. It includes web site response time,

inventory availability analysis, and


transaction execution, and order tracking
update, product / service delivery

3. Big data techniques and technologies: 3.1 Big data techniques


S.No
1

Technique
A/B testing
[3]

Method
A control group is compared with a
variety of test groups in order to
determine what changes will
improve a given objective variable.

Example
Determining what copy text, layouts,
images, or colors will improve
conversion rates on an e-commerce
web site

Association
rule learning
[5]

A set of techniques to discover


interesting relationship i.e.
association rules among variables
in large databases

Market basket analysis, in which a


retailer can determine which products
are frequently bought together and use
this information for marketing

Data fusion
and data
integration
[6]

A set of techniques that integrate


and analyze data from multiple
sources in order to develop insights
in ways that are more efficient and
potentially more accurate than if
they were developed by analyzing a
single source of data.

Data from social media, analyzed by


natural language processing, can be
combined with real-time sales data, in
order to determine what effect a
marketing campaign is having on
customer sentiment and purchasing
behavior.

Data mining
[7]

A set of techniques to extract


patterns from large datasets by
combining methods from statistics
and machine learning with database
management. These techniques
include cluster analysis,
classification, and regression.

Mining customer data to determine


segments most likely to respond to an
offer, mining human resources data to
identify characteristics of most
successful employees

Natural
language
processing
[8]

Uses computer algorithms to


analyze human (natural) language

Using sentiment analysis on social


media to determine how prospective
customers are reacting to a branding
campaign.

Predictive
modeling
[9]

A mathematical model is created or


chosen to best predict the
probability of an outcome

Estimate the likelihood that a customer


can be cross-sold other product.

Spatial
analysis

Techniques that analyze the


topological, geometric, or

How is consumer willingness to


purchase a product correlated with

[10]

geographic properties encoded in a


data set

location? How would a manufacturing


supply chain network perform with
sites in different locations?

3.2 Big data technologies


S.No.
1.

Technology
Big Table
[11]

Overview
Big table is a distributed storage system
for managing structured data at Google.
It can reliably scale to petabytes of data
and thousands of machines

Application
More than 60 Google
products like Google
Earth, Google Finance
and
Google
web
Indexing, Orkut, and
Google Analytics etc.

2.

Cassandra
[12]

Cassandra is a massively scalable open


source NoSQL database. Cassandra has a
master less ring design that is elegant,
easy to setup, and easy to maintain.
Cassandra delivers continuous
availability, linear scalability, and
operational simplicity across many
commodity servers with no single point
of failure.

Accenture, EBay,
Netflix, Go daddy,
Instagram , Reddit,
Yahoo! Japan, NASA

3.

Hadoop
[13]

Hadoop is an Apache open source


framework written in java that allows
distributed processing of large datasets
across clusters of computers using simple
programming models. Hadoop is
designed to scale up from single server to
thousands of machines, each offering
local computation and storage

Amazon, AOL,
Facebook, IBM, New
York Times, Yahoo!,
Microsoft, Google

4. Association rule
An association is a rule of the
format: LHS -- RHS. The goal of association
rule discovery is to find associations among
items from a set of transactions, each of
which contains a set of items. [5] Generally
the algorithm finds a subset of association
rules that satisfy certain constraints.
(1) Minimum support: - The support
of a rule is defined as the support of the

item-set consisting of both the LHS and the


RHS. The support of an item-set is the
percentage of transactions in the transaction
set that contain the item-set. An item-set
with a support higher than a given minimum
support is called frequent item-set.
(2) Minimum confidence: - It is the
minimum ratio of the support of the rule and
the support of the LHS.

7
Most association rule algorithms
generate association rules in two steps:

(1) Generate all frequent item-sets,

Fig.4.1 General block diagram of Association rule.


(2) Construct all rules using these
item-sets.
4.1. Association rule in Big data
It
has
been
experimentally
demonstrated that for support levels that
generate less than 100,000 rules, which is a
very conservative upper bound for humans
to sift through even considering pruning uninteresting rules, Apriori finishes on all
datasets in less than 1 minute.) For support
levels that generate less than 1,000,000
rules, which are sufficient for prediction
purposes where data is loaded into RAM,
Apriori finishes processing in less than 10
minutes. [14]
4.2. Association rule Algorithms
4.2.1 Apriori Algorithm

The Apriori algorithm finds frequent


item-sets from databases by iteration. For
each iteration I the algorithm attempts to
determine the set of frequent patterns with I
items and this set is engaged to generate the
set of candidate item-sets of the next
iteration. The iteration is repetitively
performed until no candidate patterns can be
discovered. It uses a bottom up approach,
where frequent subsets are extended one
item at a time. In the input datasets are
referred as sequences composed of more or
less items. The output of Apriori is a set of
rules explaining the links these items have in
their sets. [15]
Apriori is an algorithm for finding
frequent
item-sets
using
candidate
generation. Given minimum required
support S as interestingness criterion [18]:

(1) Search for all individual elements


(1-element item-set) that have a minimum
support of S.
(2) From the results of the previous
search for i element item-set, search for all
i+1 element item-sets that have a
minimum support of S. This becomes the
set of all frequent (i+1) item-sets that are
interesting.
(3) Repeat step 2 until item-set size
reaches maximum.
Association rules are of the form
AB where A B is different from B
A. A B implies that if a customer
purchase item A then he also purchase item
B. For the association rule mining two
threshold values are required: (1) Minimum support: - Support is
the percentage of the population which
satisfies the rule or in the other words the
support for a rule R is the ratio of the
number of occurrence of R, given all
occurrences of all rules.
The support of an association pattern
is the percentage of task-relevant data
transactions for which the pattern is true.
Number of tuples with both A and B
Support(A

B) =
Total number of tuples

(2) Minimum confidence: - The


confidence of a rule A B, is the ratio of
the number of occurrences of B given A,
among all other occurrences given A.
Confidence is defined as the measure of
certainty or trustworthiness associated with
each discovered pattern A B.
Number of tuples with both A and B
Confidence (A

B) =
Number of tuples with A

Association rules are generated as


per following method: (1) Use Apriori to generate item-sets
of different sizes.
(2) At each iteration divide each
frequent item-set X into two parts
antecedent (LHS) and consequent (RHS)
this represents a rule of the form
LHSRHS.
(3) Discard all rules whose
confidence is less than minimum
confidence.

4.2.2 FP Algorithm
It generates all frequent item-sets
satisfying a given minimum support by
growing a frequent pattern tree structure that
stores compressed information about the
frequent patterns. In this way, FP-growth
can avoid repeated database scans and also
avoid the generation of a large number of
candidate item-sets. FP-growth takes
transactional data in the form of one row for
each
single
complete
transaction.
Implementations of FP-growth only generate
the frequent item-sets, and not the
association rules [16]. The mining task as
well as the database are decomposed using a
divide and conquer system and finally it
uses a fragment pattern method to avoid the
costly process of candidate generation and
testing opposed to the Apriori algorithm.
[15]
A frequent pattern tree is a structure
consisting of [17]
(1) One root labeled as null,
(2) A set of item-prefix subtrees as
the children of the root,

9
(3) A frequent-item-header table.
Item-prefix subtrees: -Each node in
the item-prefix subtree consists of three
fields: item-name, count, and node-link.
(1) Item-name: -It registers which
item this node represents.
(2) Count: - It registers the number
of transactions represented by the
portion of the path reaching this node.
(3) Node link: - Links to the next
node in the FP-tree carrying the same
item-name, or null if there is none.
Frequent-item-header table: -Each
entry consists of two fields: (1) Item name
(2) Head of node link (a pointer
pointing to the first node in the FP tree
carrying the item-name).

Steps: (1) Scan the transaction


database DB once. Collect F, the set of
frequent items, and the support of each
frequent item. Then we sort F as per support
descending order.
(2) Create the root of an FP-tree, T,
and label it as null. For each transaction
Trans in DB do the following.
(3) If T has a child N , then
increment Ns count by 1; else create a new
node N, with its count initialized to 1, its
parent link linked to T , and its node-link
linked to the nodes with the same item-name
via the node-link structure.
FP-growth is about an order of
magnitude faster than Apriori, especially
when the data set is dense (containing many
patterns) and/or when the frequent patterns
are long.
4.2.3 Charm

Fig. 4.2.FP tree structure

Algorithm for FP tree construction:


Input: A transaction database DB
and a minimum support threshold .
Output: FP-tree, the frequent-pattern
tree of DB.

Charm is an algorithm for generating


closed frequent item-sets for association
rules from transactional data. The closed
frequent item-sets are the smallest
representative subset of frequent item-sets
without loss of information. Charm takes
transactional data in the form of one row for
each single complete transaction. [16]
4.2.4 Magnum Opus
The main unique technique used in
Magnum Opus is the search algorithm based
on OPUS, a systematic search method with
pruning. It considers the whole search space,

but during the search, effectively prunes a


large area of search space without missing
search targets provided that the targets can
be measured using certain criteria. [16]
Field of work

Problem Statement

Government
Fraud at Consignia ,
sector.
UKs Post office group
Researchers
of
Kings
College
London [19]

4.3 Real life application of association rule

Method applied

Outcome

Use of ifthen association Detectors


that
rule.
successfully
spot
abnormal transactions.
E.g. Normal behavior rule IF They
also
copy
time < 1200 AND item = themselves, so CIFD
stamps THEN $2 < cost < $4. adapts itself to create
detectors
that
correspond to the most
prevalent patterns of
fraud.

[20]

Issues
concerning Spatial association rule mining Helped
in
accessibility of an to geo-referenced U.K. census transportation planning
urban area
data of 1991
in area near a local
Stepping Hill Hospital.

Health care sector.


[21]

Anomaly
detection In training Apriori algorithm
and classification in was applied and association
Breast Cancer.
rules were extracted. The
support was set to 10% and
the confidence to 0%.

Success
rate
of
classifier was 69.11%.

Purchasing behavior On a dataset of 353,421


of customer
records
from
1903
households about 1,022,812
association
rules
were
generated for promotion
sensitivity
analysis
i.e.,
analysis
of
customer
responses to various types of
promotions,
including
advertisements, coupons, and
various types of discounts.

In a time duration of
1.5 hours about 2.6% of
discovered rules were
accepted and rest were
rejected.

Retail Sector.
[22]

Telecom Sector.

Time
required
for
training was much less
then neural network.

Thus
total
rules
reduced to about 14
rules per household
from 537 rules per
household.

Which country pairs Use of association rule by Successful in detecting

11
[23]

or
triples
or treating the top-k country high rate of fraud calls
quadruples customers item set as a market basket trends associated with
are currently calling
for each of account.
adult
entertainment
services, that move
Exploiting temporal nature of from
country
to
data by using traffic from last country through time.
month as a baseline for
current month.

Manufacturing
sector.
VAM
Drilling
industries France
[24]

Setting up a system
which provides result
identical to human
observation related to
performance
and
dysfunctions during
forging.

Use of Rule-Growth that


mines sequential rules by FPgrowth with varying the
parameters minimum support
and minimum confidence

Found
the
main
dysfunction responsible
for delay.
Finding that generator
is cause for exceeding
maximum
time
in
starting phase
The
third
major
problem was the lack of
effectiveness of metal
strippers

5. Big Table [11]


5.1 Data Model
Data is organized into three
dimensions: rows, columns, and timestamps.
We refer to the storage referenced by a

particular row key, column key, and


timestamp as a cell. In Web-table, we would
use URLs as row keys, various aspects of
web pages as column names, and store the
contents of the web pages in the contents:
column under the timestamps when they

Fig.5.1 Big table architecture

are fetched. Rows with consecutive keys are


grouped into tablets.

Fig.5.2 Data model for Big table

5.2 Building Blocks


Big-Table depends on a Google
cluster management system for scheduling
jobs, managing resources on shared
machines, monitoring machine status, and
dealing with machine failures.
The Google SSTable immutable-file
format is used internally to store Big-Table
data files. An SSTable provides a persistent,
ordered immutable map from keys to values,

where both keys and values are arbitrary


byte strings.
Big-Table uses Chubby for a variety
of tasks: to ensure that there is at most one
active master at any time; to store the
bootstrap location of Big-Table data; to
discover tablet servers and finalize tablet
server deaths; and to store Big-Table
schemas. Chubby is a distributed lock
service. A Chubby service consists of five
active replicas, one of which is elected to be
the master and actively serve requests. The
service is live when a majority of the
replicas are running and can communicate
with each other.

5.3 Big Table Implementation


The Big-Table implementation has
three major components: a library that is
linked into every client, one master server,
and many tablet servers.

13
The master is responsible for
assigning tablets to tablet servers, detecting
the addition and expiration of tablet servers,
balancing tablet-server load, and garbage
collecting files. In addition, it handles
schema changes such as table and column
family creations and deletions. Each tablet
server manages a set of tablets. The tablet
server handles read and write requests to the
tablets that it has loaded, and also splits
tablets that have grown too large. A Bigtable
cluster stores a number of tables. Each table
consists of a set of tablets, and each tablet
contains all of the data associated with a row
range.
5.3.1 Tablet Location: It uses a three-level hierarchy, the
first level is a file stored in Chubby that
contains the location of the root tablet. The
root tablet contains the locations of all of the
tablets of a special METADATA table. Each
METADATA tablet contains the location of
a set of user tablets. Secondary information
like a log of all events pertaining to each
tablet (such as when a server begins serving
it) is also stored in METADATA. This
information is helpful for debugging and
performance analysis.

5.3.2 Tablet assignment


Each tablet is assigned to at most one
tablet server at a time. The master keeps
track of the set of live tablet servers, and the
current assignment of tablets to tablet
servers, including which tablets are
unassigned. When a tablet is unassigned,
and a tablet server with sufficient room for
the tablet is available, the master assigns the
tablet by sending a tablet load request to the
tablet server. Bigtable uses Chubby to keep
track of tablet servers. When a tablet server
starts, it creates and acquires an exclusive
lock on a uniquely named file in a specific
chubby directory. The master monitors this
directory (the servers directory) to discover
tablet servers.
The set of existing tablets changes only
when a table is created or deleted, two
existing tablets are merged to form one
larger tablet, or an existing tablet is split into
two smaller tablets. The master is able to
keep track of these changes because it
initiates all but the last. Tablet splits are
treated specially since they are initiated by
tablet servers. A tablet server commits a
split by recording information for the new
tablet in the METADATA table. After
committing the split, the tablet server
notifies the master.
5.3.3 Tablet Serving: -

Fig. 5.2 Tablet location

The persistent state of a tablet is


stored in GFS. Updates are committed to a
commit log that stores redo records. The
recently committed ones are stored in
memory in a sorted buffer called a

memtable. Older updates are stored in a


sequence of SSTables.
To recover a tablet, a tablet server
reads its metadata from the METADATA
table. This metadata contains the list of
SSTables that comprise a tablet and a set of
redo points, which are pointers into any
commit logs that may contain data for the
tablet. The server reads the indices of the
SSTables into memory and reconstructs the
memtable by applying all of the updates that
have committed since the redo points.
When a write operation arrives at a
tablet server, the server checks that it is
well-formed (i.e., not sent from a buggy or
obsolete client), and that the sender is
authorized to perform the mutation.
Authorization is performed by reading the
list of permitted writers from a chubby file.
A valid mutation is written to the commit
log. After the write has been committed, its
contents are inserted into the memtable

the servers chubby client cache. Because


chubby caches are consistent, tablet servers
are guaranteed to see all changes to that file.
6.
Market
Basket
Analysis:
Implementation and Results

In Retail each customer purchases different


set of products, different quantities, and
different times. Retailers use this
information to:
(1)
Gain insight about its
merchandise (products):
Fast and slow movers
Products which are purchased
together
Products which might benefit
from promotion
(2)Take action:
Store layouts
Which products to put on
specials, promote, coupons.

5.3.4. Schema Management

6.1 Apriori Algorithm: -

Bigtable schemas are stored in


Chubby.
Chubby
is
an
effective
communication substrate for Bigtable
schemas because it provides atomic wholefile writes and consistent caching of small
files. For example, suppose a client wants to
delete some column families from a table.
The master performs access control checks,
verifies that the resulting schema is well
formed, and then installs the new schema by
rewriting the corresponding schema file in
Chubby. Whenever tablet servers need to
determine what column families exist, they
simply read the appropriate schema file from
Chubby, which is almost always available in

The small database used to test this


algorithm is [18]
S.No.
1
2
3
4
5
6
7
8
9
10

Item 1
Bread
Ice-cream
Bread
Bread
Butter
Bread
Milk
Ice-cream
Butter
Noodles

Item 21
Butter
Bread
Butter
Noodles
Milk
Noodles
Butter
Milk
Milk
Butter

Item 3I3
Milk
Butter
Noodles
Ice-cream
Bread
Ice-cream
Bread
Bread
Noodles
Ice-cream

Table 6.1. Database for testing Apriori algorithm

15
In the given dataset every item occurs three
or more than three times and total number of
transaction is ten so,
Minimum Support = 0.3
Item-set
Bread
Butter
Noodles
Ice-cream
Milk

Support
0.8
0.7
0.5
0.5
0.5

{Bread, Butter} {Milk} 60


{Bread, Milk} {Butter} 75
{Butter, Milk} {Bread} 75
Table 6.5 Rules based on Apriori algorithm

If the minimum confidence threshold is 70


percentages, and the minimum support is 30
percentages, then discovered rules are
{Bread, Milk} {Butter}
{Butter, Milk} {Bread}
{Milk} {Bread, Butter}

Table 6.2 Interestingness of 1-element item sets

Item-sets
{Bread, Butter}
{Bread, Milk}
{Bread, Noodles}
{Bread, Ice-cream}
{Butter, Milk}
{Butter, Noodles}
{Butter, Ice-cream}
{Noodles, Milk}
{Noodles, Ice-cream }
{Milk, Ice-cream}

Support
0.5
0.4
0.3
0.4
0.4
0.3
0.2
0.1
0.3
0.1

Table 6.3 Interestingness of 2-element item sets

Item-sets
{Bread, Butter, Milk}
{Bread, Ice-cream, Noodles}
{Bread, Butter, Noodles}

Support
0.3
0.2
0.1

Table 6.4 Interestingness of 3-element item sets

The main advantage of the Apriori


algorithm is that it only takes data from
previous iteration not from the whole data.
Rule
Confidence (%)
{Bread} {Butter, Milk} 37
{Butter} {Bread, Milk} 42
{Milk} {Bread, Butter} 75

Then the algorithm was run on bakery


database. The database consisted of 50
different items and with 75000 receipts. The
minimum support was found to be 0.04. The
items were named as alphabet of English
literature.
Item-sets

Support

{A,AU}

0.0440

{D,S}

0.0434

{D,AJ}

0.0430

{E,J}

0.0431

{F,W}

0.0439

{Q,AG}

0.0435

{S,AH}

0.0531

{AB,AC}

0.0509

{AH,AQ}

0.0431

Table 6.6. Interestingness of 2-element item sets

Item-sets
{D,S,AJ}

Support
0.0411

Table 6.7. Interestingness of 3-element item sets

6.2 FP Growth Algorithm: This algorithm was implemented on three


datasets, previous two and a new one. The
smallest dataset used was [19]

S. No.
1
2
3
4
5

Item1
A
B
A
A
A

Item2
B
C
C
D
B

Item 3
D
D
E
C

Item 4

Fig. 6.1 FP Tree construction

Table 6.8 Small dataset for FP algorithm

S. No.

A
1

FREQ. 2

1
3

Table 6.9 Ascending order arrangement w.r.t.


frequency

Table 6.10 Conditional pattern base and conditional


FP tree generation

Frequent Pattern
2-Item set
E,A
E,D
B,A
C,A
C,B
D,A
D,C
3 Item set
E,A,D

Support Count
2
2
2
2
2
2
2
2

Table 6.11 Item set generated

17
Similarly, the FP growth algorithm
was implemented on other two databases
and identical results as given by the Apriori
algorithm were obtained.
REFERENCES: 1. G. Halevi, H. Moed, The evolution of big data as a
research and scientific topic: Overview of the
literature, Res. Trends(2012) 36.
2. http://en.wikipedia.org/wiki/Big_data
3. J. Manyika, M. Chui, B. Brown, J. Bughin, R. Dobbs,
C. Roxburgh and A.H. Byers, "Big data: The next
frontier for innovation, competition, and
productivity", McKinsey Global Institute, 2011.
4. Chen, Min, Shiwen Mao, and Yunhao Liu. "Big
data: A survey." Mobile Networks and Applications
19.2 (2014): 171-209
5. Agrawal, Rakesh, Tomasz Imielioski, and Arun
Swami. "Mining association rules between sets of
items in large databases." In ACM SIGMOD Record,
vol. 22, no. 2, pp. 207-216. ACM, 1993.
6. Lohr, Steve. "The age of big data." New York Times
11 (2012).
7. Rygielski, Chris, Jyun-Cheng Wang, and David C.
Yen. "Data mining techniques for customer
relationship management." Technology in society 24,
no. 4 (2002): 483-502.
8. Hennig-Thurau, Thorsten, Edward C. Malthouse,
Christian Friege, Sonja Gensler, Lara Lobschat, Arvind
Rangaswamy, and Bernd Skiera. "The impact of new
media on customer relationships." Journal of service
research 13, no. 3 (2010): 311-330.
9. Kamakura, Wagner A., Michel Wedel, Fernando
De Rosa, and Jose Afonso Mazzon. "Cross-selling
through database marketing: A mixed data factor
analyzer for data augmentation and prediction."

International Journal of Research in marketing 20,


no. 1 (2003): 45-65.
10. Meixell, Mary J., and Vidyaranya B. Gargeya.
"Global supply chain design: A literature review and
critique." Transportation Research Part E: Logistics
and Transportation Review 41, no. 6 (2005): 531550.
11. Chang, Fay, Jeffrey Dean, Sanjay Ghemawat,
Wilson C. Hsieh, Deborah A. Wallach, Mike Burrows,
Tushar Chandra, Andrew Fikes, and Robert E.
Gruber. "Bigtable: A distributed storage system for
structured data." ACM Transactions on Computer
Systems (TOCS) 26, no. 2 (2008): 4.
12. Apache Cassandra 2.1Documentation
October 27, 2015.
13. Shvachko, Konstantin, Hairong Kuang, Sanjay
Radia, and Robert Chansler. "The hadoop distributed
file system." In Mass Storage Systems and
Technologies (MSST), 2010 IEEE 26th Symposium on,
pp. 1-10. IEEE, 2010.
14. Liu, B., Hsu, W., & Ma, Y. (1999, August). Pruning
and summarizing the discovered associations. In
Proceedings of the fifth ACM SIGKDD international
conference on Knowledge discovery and data mining
(pp. 125-134). ACM.
15. Kamsu-Foguem, Bernard, Fabien Rigal, and Flix
Mauget. "Mining association rules for the quality
improvement of the production process." Expert
Systems with Applications 40.4 (2013): 1034-1045
16. Zheng, Z., Kohavi, R., & Mason, L. (2001, August).
Real world performance of association rule
algorithms. In Proceedings of the seventh ACM
SIGKDD international conference on Knowledge
discovery and data mining (pp. 401-406). ACM.
17. Han, Jiawei, Jian Pei, Yiwen Yin, and Runying
Mao. "Mining frequent patterns without candidate
generation: A frequent-pattern tree approach." Data

mining and knowledge discovery 8, no. 1 (2004): 5387.

18. Dongre, Jugendra, Gend Lal Prajapati, and S. V.


Tokekar. "The role of Apriori algorithm for finding
the association rules in Data mining." In Issues and
Challenges in Intelligent Computing Techniques
(ICICT), 2014 International Conference on, pp. 657660. IEEE, 2014.
19. Weatherford, M. (2002). Mining for fraud.
Intelligent Systems, IEEE, 17(4), 4-6.
20. Appice A, Ceci M, Lanza A, et al. Discovery of
spatial association rules in geo-referenced census
data: a relational mining approach. Intell Data Anal
2003; 7:541566.
21. M.-L. Antonie, O. R. Zaane, and A. Coman.
Application of data mining techniques for medical
image classification. In Second International ACM
SIGKDD Workshop on Multimedia Data Mining,
pages 94101, San Francisco, USA, August 2001.
22. Adomavicius, G., & Tuzhilin, A. (2001). Expertdriven validation of rule-based user models in
personalization applications. Data Mining and
Knowledge Discovery, 5(1-2), 33-58
23. Cortes, C., & Pregibon, D. (2001). Signaturebased methods for data streams. Data Mining and
Knowledge Discovery, 5(3), 167-182.
24. Kamsu-Foguem, Bernard, Fabien Rigal, and Flix
Mauget. "Mining association rules for the quality
improvement of the production process." Expert
Systems with Applications 40.4 (2013): 1034-1045.

You might also like