You are on page 1of 66

Chapter 1

INTRODUCTION

Data mining is one of the most important technologies in this generation where in it helps
to mine important data from data warehouses. Data warehouse is met for a storage repository
which can store the data from many years ago. When Big data is booming in the industry i.e.,
data is growing with volume, velocity, varsity. There is a need to get conclusions of those data,
to know whose play is important for future purposes. For that, mining plays a prominent role to
mine data sets which are frequently used. It is known that mining the data from different
repositories is also a tough task. Generally, parallel frequent itemsets mining techniques are
introduced to trace the problem [1]. These techniques are focused on load balancing but in order
to retrieve the data within a time is impossible. This is because of data is balanced at different
locations using load balancing technique. There is no certain clear vision for data which is stored
at particular place based on similarity.

In order to recover from the problem FiDoop-Data Partitioning (DP) was introduced
which is developed using MapReduce Programming model FiDoop-DP helps to enhance the
execution of mining by using parallel frequent itemset mining algorithms on Hadoop clusters [2].
FiDoop-DP acts as a platform to establish the correlation among different transactions to
partition a huge dataset across different data nodes in a Hadoop cluster. In Hadoop cluster if data
is aware of location where it is resided that helps to mine the data easily. But, this doesn’t have
any knowledge previously about the data which is stored at heterogeneous Hadoop clusters are
maintained across different nodes. In order to recover from this problem a scheme is
implemented by adding FiDoop-DP data-aware technique on heterogeneous cluster at the level
of Hadoop Data File System (HDFS) data placement.

1.1 Big Data Characteristics: HACE Theorem

HACE Theorem: Big Data starts with large-volume, heterogeneous, autonomous


sources with distributed and decentralized control, and seeks to explore complex and evolving
relationships among data [3]. The characteristics make it an extreme challenge for discovering

1
useful knowledge from the Big Data. In a naïve sense, imagine that a number of blind men are
trying to size up a giant elephant shown in figure 1.1, which is Big Data in this context. The goal
of each blind man is to draw a picture (or conclusion) of the elephant according to the part of
information collected during the process. Because each person’s view is limited to their local
region, it is not surprising that the each blind men concludes independently that the elephant
“feels” like a rope, a hose, or a wall, depending on the region each of them is limited to certain
area.

To make the problem even more complicated, assume that (a) the elephant is growing
rapidly and its pose also changes constantly, and (b) the blind men also learn from each other
while exchanging information on their respective feelings on the elephant. Exploring the Big
Data in this scenario is equivalent to aggregating heterogeneous information from different
sources (blind men) to help draw a best possible picture to reveal the genuine gesture of the
elephant in a real-time fashion. Indeed, this task is not as simple as asking each blind man to
describe his feelings about the elephant and then getting an expert to draw one single picture
with a combined view, concerning that each individual may speak a different language
(heterogeneous and diverse information sources) and they may even have privacy concerns about
the messages they deliberate in the information exchange process.

Figure 1.1: Big Data Illusions

2
1.2 Huge Data with Heterogeneous and Diverse Dimensionality

One of the fundamental characteristics of the Big Data is the huge volume of data
represented by heterogeneous and diverse dimensionalities. This is because different information
collectors use their own scheme for data recording, and the nature of different applications also
results in diverse representations of the data. For X-ray examination and CT scan of each
individual, images or videos are used to represent the results because it provides visual
information for doctors to carry detailed examinations.

For a DNA or genomic related test, micro-array expression images and sequences are
used to represent the genetic code information because this is the way that current techniques
acquire the data. Under such circumstances, the heterogeneous features refer to the different
types of representations for the same individuals, and the diverse features refer to the variety of
features involved to represent each single observation. Imagine that different organizations (or
health practitioners) may have their own schemata to represent each patient, the data
heterogeneity and diverse dimensionality issues become major challenges if it enables data
aggregation by combining data from all sources.

1.3 Autonomous Sources with Distributed and Decentralized Control

Autonomous data sources with distributed and decentralized controls are a main
characteristic of Big Data applications. Being autonomous, each data sources is able to generate
and collect information without involving (or relying on) any centralized control. This is similar
to the World Wide Web (WWW) setting where each web server provides a certain amount of
information and each server is able to fully function without necessarily relying on other servers.
On the other hand, the enormous volumes of the data also make an application vulnerable to
attacks or malfunctions, if the whole system has to rely on any centralized control unit.

For major Big Data related applications, such as Google, Flicker, Facebook, and
Walmart, a large number of server farms are deployed all over the world to ensure non-stop
services and quick responses for local markets. Such autonomous sources are not only the
solutions of the technical designs, but also the results of the legislation and the regulation rules in
different countries/regions. More specifically, the local government regulations also impact on
3
the wholesale management process and eventually result in data representations and data
warehouses for local markets.

1.4 Complex and Evolving Relationships

While the volume of the Big Data increases, so the complexity and the relationships
among the data are different. In an early stage of data centralized information systems, the focus
is on finding best feature values to represent each observation. This is similar to using a number
of data fields, such as age, gender, income, education background etc., to characterize each
individual. This type of sample-feature representation inherently treats each individual as an
independent entity, without considering their social connections which is one of the most
important factors of the human society.

People form friend circles based on their common hobbies or connections by biological
relationships. Such social connections commonly exist in not only in a daily activities, but also
are very popular in virtual worlds. For example, major social network sites, such as Facebook or
Twitter, are mainly characterized by social functions such as friend-connections and followers
(in Twitter). The correlations between individuals inherently complicate the whole data
representation and any reasoning process. In the sample-feature representation, individuals are
regarded similar if the data share similar feature values, whereas in the sample-feature-
relationship representation, two individuals can be linked together, even though it might share
nothing in common in the feature domains. In a dynamic world, the features used to represent the
individuals and the social ties are used to represent our connections may also evolve with respect
to temporal, spatial, and other factors. Such a complication is becoming part of the reality for Big
Data applications, where the key is to take the complex (i.e., non-linear, many-to-many) data
relationships, along with the evolving changes, into consideration, to discover useful patterns
from Big Data collections.

1.5 Motivation and Purpose

As Heterogeneous systems collect huge amount of data from different data sources, there
is a need of computation or mining over multiple data sources for knowledge discovery. Data
integration approaches are used to overcome heterogeneity problem. However, perfect
4
integration of heterogeneous data sources is a very challenging problem, and it is impossible to
combine one whole database to another site. So, Data mining techniques helps to mine the data,
here data is in large amount which is considered as Big Data so mining heterogeneous data using
Hadoop helps to gain the knowledge.

The main motivation and purpose of using Equivalence Class clustering and bottom up
Lattice Traversal (ECLAT) on Hadoop cluster is to calculate support count faster as possible
compare to other techniques. It uses a vertical database layout which helps to store every item
with its identifier instead of explicitly listing all transactions.

1.6 The Problem Statement

There are various data mining techniques are there for assessing and gaining the
knowledge from different data sources. Those techniques are needed to be satisfied for different
data at different sources and helps to gain the knowledge. Mostly, these days due to rapid growth
of data it is difficult to mine large amount of data within a stipulated time.

Only data integration techniques cannot gain the knowledge from different sources of
data. Because, the data is added and updated in seconds, it is tough to calculate large amount of
data in less time. So, Data partitioning with Data mining technique based on Hadoop process of
calculation helps to mine the data even if it is growing fast.

By using ECLAT scheme in heterogeneous systems helps to calculate or mine the data
faster. It requires less space than Apriori algorithm. It also takes less time for frequent pattern
generation than Apriori. So, calculating the support values and gaining the knowledge are
simple, easy and faster than many other techniques.

1.7 Organization of Dissertation

The technical aspects, system requirements and the organization of dissertation report as
discussed as follows. The dissertation report mainly consists of 7 chapters and references.

Chapter 2: Explains different data mining techniques which help to mine the Big Data and
explains Hadoop server roles and also gives the information about related work.

5
Chapter 3: Describes the idea behind the proposed work and gives the information about the
proposed implementation of dissertation that is using of Hadoop for Data Partitioning in
Heterogeneous systems.

Chapter 4: Shows the system design through UML diagrams in the form of class diagram,
component diagram, use case diagram, activity diagram.

Chapter 5: Explains the Implementation of ECLAT scheme through steps which shows the
installation of MapReduce and working procedure of MapReduce.

Chapter 6: Explains Results, analysis of the data by taking a dataset and the procedure of
calculating support values. It also illustrates extraction of the most frequent itemsets within a
small span of time.

Chapter 7: Concludes the ECLAT algorithm and it also includes the enhancement of the work
for future purposes.

***********

6
Chapter 2

LITERATURE SURVEY

In Hadoop Clusters, data partitioning schemes are used to divide the input data in equal
parts but those schemes should partition the intermediate results. Due to this, data locality and
data balance is completely ignored.

2.1 Data Mining Techniques

Several core techniques that are used in data mining describe the type of mining and data
recovery operation. Unfortunately, the different companies and solutions do not always share
terms, which can add to the confusion and apparent complexity. Some key techniques and
examples of how to use different tools to build the data mining.

2.1.1 Association Rule Learning

In data mining, association rules are useful for analyzing and predicting customer
behavior. They play an important part in shopping basket data analysis, product clustering, and
catalog design and store layout.

Association rule learning is a rule based machine learning method for discovering
interesting relations between variables in large databases [4]. It is intended to identify strong
rules discovered in databases using some measures of interestingness. For example, figure 2.1
shows the association rule for market basket analysis that shows the rule {onions,
potatoes} found in the sales data of a supermarket would indicate that if a customer buys onions
and potatoes together, they are likely to also buy hamburger meat. Such information can be used
as the basis for decisions about marketing activities such as, e.g., promotional pricing or product
placement. In addition to the above example from market basket analysis association rules are
employed today in many application areas including Web usage mining, intrusion
detection, continuous production, and bioinformatics [5].

7
Figure 2.1: Association Technique

2.1.2 Classification

There are two forms of data analysis that can be used for extracting models describing
important classes or to predict future data trends. These two forms are as follows

 Classification
 Prediction
Classification models predict categorical class labels and prediction models predict
continuous valued functions. For example, build a classification model to categorize bank loan
applications as either safe or risky, or a prediction model to predict the expenditures in dollars of
potential customers on computer equipment give with income and occupation.

2.1.3 Prediction

Prediction is a wide topic and runs from predicting the failure of components or
machinery, to identifying fraud and even the prediction of company profits. Used in combination
with the other data mining techniques, prediction involves analyzing trends, classification,
pattern matching, and relation [5]. By analyzing past events or instances, make a prediction
about an event. Using the credit card authorization, for example, combine decision tree analysis
of individual past transactions with classification and historical pattern matches to identify
whether a transaction is fraudulent. Making a match between the purchase of flights to the US
and transactions in the US, it is likely that the transaction is valid.
8
2.1.4 Clustering

By examining one or more attributes or classes, one can group individual pieces of data
together to form a structure opinion. At a simple level, clustering is using one or more attributes
on the basis for identifying a cluster of correlating results. Clustering is useful to identify
different information because it correlates with other examples. So, the similarities and ranges
are seen same if the values are similar and form a group by that differentiation makes easy.

Clustering can work both ways which helps to assume that there is a cluster at a certain
point and then identify the cluster. The graph in figure 2.2 shows Clustering Technique [5]. In
this example, a sample of sales data compares the age of the customer to the size of the sale. It is
not unreasonable to expect that people in twenties (before marriage and kids), fifties, and sixties
(when the children have left home), have more disposable income.

Figure 2.2: Clustering Technique

In the example, identify two clusters, one around the US$2,000/20-30 age group, and
another at the US$7,000-8,000/50-65 age group. In this case, both hypothesized and proved
hypothesis with a simple graph that can create using any suitable graphing software for a quick

9
manual view. More complex determinations require a full analytical package, especially to
automatically base decisions on nearest neighbor information.

Plotting clustering in this way is a simplified example so called nearest neighbor identity.
It helps to identify individual customers by the literal proximity to each other on the graph. It is
highly likely that customers in the same cluster also share other attributes and one can use that
expectation to help drive, classify, and otherwise analyze other people from our data set.

It can also help to apply clustering from the opposite perspective; given certain input
attributes, identify different artifacts. For example, a recent study of 4-digit PIN numbers found
clusters between the digits in ranges 1-12 and 1-31 for the first and second pairs. By plotting
these pairs, identification and determination of clusters to relate to dates (birthdays,
anniversaries) makes easy to differentiate.

2.1.5 Sequential Patterns

Often used over longer-term data, sequential patterns are a useful method for identifying
trends, or regular occurrences of similar events [5]. For example, with customer data identify that
customers buy a particular collection of products together at different times of the year. In a
shopping basket application, using information automatically suggest that certain items be added
to a basket based on their frequency and past purchasing history.

2.1.6 Decision Trees

A decision tree is a structure that includes a root node, branches, and leaf nodes. Each
internal node denotes a test on an attribute, each branch denotes the outcome of a test, and each
leaf node holds a class label [5]. The topmost node in the tree is the root node.

Figure 2.3 shows decision tree is for the concept to buy a computer that indicates whether
a customer at a company is likely to buy a computer or not. Each internal node represents a test
on an attribute. Each leaf node represents a class. Figure 2.3 shows the decision tree of a person’s
computer based on age.

10
Figure 2.3: Decision tree

2.1.7 Frequent Pattern Mining

Frequent Itemset Mining is one of the most critical and time-consuming tasks in
association rule mining (ARM), an often-used data mining task, provides a strategic resource for
decision support by extracting the most important frequent patterns that simultaneously occur in
a large transaction database. A typical application of ARM is the famous market basket analysis.
In FIM, support is a measure defined by users. An itemset X has support s if s% of transactions
contains the itemset. Denote s = support(X); the support of the rule X =>Y is support (XUY).
Here X and Y are two itemsets, and X∩Y= 0. The purpose of FIM is to identify all frequent
itemsets whose support is greater than the minimum support. The first phase is more challenging
and complicated than the second one. Most prior studies are primarily focused on the issue of
discovering frequent itemsets.

2.2 Hadoop Cluster

A Hadoop cluster is a special type of computational cluster designed specifically for


storing and analyzing huge amounts of unstructured data in a distributed computing
environment. Such clusters run Hadoop open source distributed processing software on low-

11
cost commodity computers. Typically one machine in the cluster is designated as the Name Node
and another machine the as Job Tracker; these are the masters. The rest of the machines in the
cluster act as both Data Node and Task Tracker; these are the slaves. Hadoop clusters are often
referred to as "shared nothing" systems because the only thing that is shared between nodes is the
network that connects them.

Hadoop Clusters are known for boosting the speed of data analysis applications. They
also are highly scalable: If a cluster's processing power is overwhelmed by growing volumes
of data, additional cluster nodes can be added to increase throughput. Hadoop clusters also are
highly resistant to failure because each piece of data is copied onto other cluster nodes, which
ensures that the data is not lost if one node fails.

Figure 2.4: Hadoop Server Roles

Figure 2.4 shows the Hadoop Server Roles which contains three major categories of
machine roles in a Hadoop deployment are Client machines, Masters nodes, and Slave nodes.
The Master nodes oversee the two key functional pieces that make up Hadoop, storing lots of
data (HDFS), and running parallel computations on all that data (MapReduce). The Name Node
12
oversees and coordinates the data storage function (HDFS), while the Job Tracker oversees and
coordinates the parallel processing of data using Map Reduce. Slave Nodes make up the vast
majority of machines and do all the dirty work of storing the data and running the computations.
Each slave runs both a Data Node and Task Tracker daemon that communicate with and receive
instructions from master nodes. The Task Tracker daemon is a slave to the Job Tracker, the Data
Node daemon is a slave to the Name Node.

Client machines have Hadoop installed with all the cluster settings, but neither a Master
nor a Slave. Instead, the role of the Client machine is to load data into the cluster, submit Map
Reduce jobs describing how that data should be processed, and then retrieve or view the results
of the job when it is finished. In smaller clusters (~40 nodes) may have a single physical server
playing multiple roles, such as both Job Tracker and Name Node. With medium to large clusters
often have each role operating on a single server machine Figure 2.5 shows the Hadoop cluster.

In real production clusters there is no server virtualization, no hypervisor layer. That


would only amount to unnecessary overhead impeding performance. Hadoop runs best on Linux
machines, working directly with the underlying hardware. That said, Hadoop does work in a
virtual machine. That is a great way to learn and get Hadoop up and running fast and cheap.

This is the typical architecture of a Hadoop cluster. Rack servers populated in racks
connected to a top of rack switch usually with 1 or 2 GE boned links. 10GE nodes are
uncommon but gaining interest as machines continue to get dense with CPU cores and disk
drives. The rack switch has uplinks connected to another tier of switches connecting all the other
racks with uniform bandwidth, forming the cluster. The majority of the servers are Slave nodes
with lots of local disk storage and moderate amounts of CPU and DRAM. Some of the machines
will be Master nodes that might have a slightly different configuration favouring more DRAM
and CPU, less local storage.

13
Figure 2.5: Hadoop Cluster

2.3 FiDoop Data Partitioning Scheme

FiDoop Data Partitioning helps to partition the input transactions which help to decrease
the amount of data transferred through the network during the shuffling phase. It is also helps to
minimize local mining load. Traditional Frequent Item set Mining (TFIM) is a process of
considering redundant transaction transmissions and redundant mining tasks. Redundant
transactions consideration is the main reason for high network load and redundant mining cost.
In order to address this issue FiDoop proposes partition transactions by considering correlations
among transactions and items before the mining process. These results, transactions with highest
similarity are grouped into one partition which in turn prevents the transactions from repetition at
remote nodes.

FiDoop adopts Voronoi diagram based data partitioning technique at second map reduce
job, which helps to minimize unnecessary redundant transaction transmissions. Voronoi diagram
divides the space into regions which are called Voronoi cells. It also considers set of points
which are referred as pivots. For each pivot, there is a corresponding region consisting of all
14
objects closer to it than to the other pivots. Based on the characteristic of Frequent Item set
Mining (FIM), FiDoop adopt the similarity as the distance metric between the two transactions in
Voronoi diagram.

2.3.1 FiDoop Data Partitioning using FP-Growth

In traditional FiDoop–DP adopts the FP-growth algorithm in that initially it constructs


FP-tree. Based on FP-tree it maps the items with another item which are avail in the transaction
[4]. These items are grouped to form a set based on the support. FP-Growth algorithm is more
efficient than Apriori algorithm since, it improves item-set mining by examining the database
only twice. Based on frequent items sets Voronoi diagram is built for similarity. As Voronoi
diagram considers the similarity as the distance metric, FiDoop adopts Jaccard’s similarity as a
distance metric [3]. It is a kind of statistic which helps to measure the distance between datasets.
When the Jaccard’s similarity is high, then the two data sets are very close to each other. In order
to measure the distance between the data sets, first plot a model for each transaction in a database
in terms of Sets.

2.3.2 K-means Selection Technique

In FiDoop, K-means selection technique is used for the selection of pivots. The
procedure of selecting pivots is conducted as a data pre-processing phase. The main target of this
technique is to partition the objects into clusters so that each object belongs to a cluster which is
having the nearest mean. The result of this technique can be applied to partition the data space
into Voronoi cells. By considering the computational cost problem of K-means, FiDoop
performs the sampling on the transaction database before running the algorithm. Selection of
initial pivots plays a vital role in clustering performance [7, 8]. So, K-means++ which is an
extension to K-means is used for selecting pivots [11]. Once K data clusters are generated, select
center point of each cluster as a pivot for the Voronoi diagram based data partitioning.

Once selection of pivots task is completed, there should be a need for calculating the
distances among remaining objects to these pivots to determine a partition to which each object
belongs to. So, FiDoop uses MinHash and Locality Sensitive Hashing based strategy for
grouping and partitioning process. MinHash helps to perceive quick solution to calculate
15
similarity between two sets [6, 7]. MinHash is a best technique for large-scale clustering process.
MinHash uses very smaller representations named as “signatures” for replacing the large sets
composed of “minhash” of the characteristic matrix. This can simply call it as a matrix
representation of data sets. Then, MinHash calculates an expected similarity of two data sets
based on the signatures.

2.3.3 Locality Sensitive Hashing (LSH)

Locality Sensitive Hashing (LSH) is another technique used for data partitioning which
helps to upgrade the performance of MinHash [4]. Using these techniques can avoid the
comparisons of a large number of element pairs. Through MinHash, this can repeatedly evaluate
an enormous pairs. But, using LSH it scans all the transactions at once and identify all the pairs
that are likely to be similar. By using LSH it helps to map the transactions in the feature space to
a number of buckets, this helps to find the similar transactions are likely to be mapped into same
buckets. So using these more similar items are hashed into a same bucket based on the items
which are having high probability than the dissimilar items.

16
Chapter 3
ECLAT EVOLUTION

Equivalence Class clustering and bottom-up LAttice Traversal algorithm (ECLAT) is


frequent itemset mining algorithm. It is one of the best algorithms, the reason for taking ECLAT
algorithm is illustrated as follows:

3.1 Communication in Heterogeneous Systems

Generally, heterogeneous systems defined as the system which provides results by


considering individual systems which have their own database. Integration of those systems is
considered at last stage. Heterogeneous systems results are independent since, there is a need of
translation for communication between different DBMS. In order to communicate, requests are
to made in the DBMS language at their individual locations. If the hardware is different and the
products of DBMS are same then, translation is easy. But, if the products of DBMS were
different, the translation is complicated and involves mapping of the data structure from one
model to another model. If both products hardware and software are different, then the above
two types of translations need to be used. When hardware and software are different makes
processing complex, in order to overcome complexity gateways are used, which can convert the
language and model of each different DBMS into the language and model of the required
systems.

3.2 Using Hadoop for Data Partitioning

The implementations of Hadoop nodes which are computed in a cluster are


homogeneous. But, there is also another cluster called Heterogeneous Cluster whereas the nodes
have different computing capacity. Data partition techniques are required in order to partition the
input and intermediate data which are supported by the required computing capacities of the
nodes that are stored in the cluster [13]. Hadoop Distributed File System (HDFS) makes Hadoop
MapReduce applications to transfer operations to be processed were sent towards the nodes

17
where the actual application data resides. In heterogeneous cluster, the computing capacity of
nodes is different from one another.

A high-speed node can process the data on a local disk of a node faster than the low-
speed node. After a high-speed node completes the processing of its data, it must share the
unprocessed data located in one or more low-speed nodes, which helps to fast processing of data.
Nodes are needed to be cautious about transferring the load from high-speed node to low-speed
node because, if the transferred data is very large, it affects the performance of Hadoop. In order
to enhance the performance of Hadoop in heterogeneous clusters, it is necessary need to
minimize the data transfer in between high-speed and low-speed nodes. This can be attained by
selecting proper data partitioning scheme that distribute and store data across different
heterogeneous nodes by considering their computing capacities.

FiDoop Data Partitioning technique is used in heterogeneous Hadoop Clusters by using


Equivalence class Clustering and Lattice Traversal algorithm. FiDoop main goal is to spilt-up
input transactions to reduce local mining load as well as decrease the amount of data transferred
through network at shuffle phase. So, consider the input data which has N number of map reduce
jobs because of the clusters are heterogeneous.

Every map reduce job consists of set of transactions (t1,t2...tn) and these transactions can
be divided as a set of chunks (c1,c2...cn). Each one has map tasks (m1, m2...mn) and reduces tasks
(r1,r2...rn) that are running on a heterogeneous cluster. Based on this, an intermediate key-value
pairs set produced by the mappers as ((g1,d1), (g1,d2)…(gn,dn)) whereas, di is a collection of
transactions which belongs to the group gi. After map tasks are completed, the shuffle phase
applies the partitioning function to assign intermediate key-value pairs to reduce tasks based on
the keys. If the intermediate key-value pair is partitioned into a reducer running on remote node,
then there is a need of intermediate data shuffling.

FiDoop incorporates voronoi diagram based data partitioning, in order to reduce


unnecessary redundant transaction transmissions [3]. The main idea of Voronoi diagram based
partition is for every given dataset D, it selects O objects as pivots. Then all the objects are
spitted into k disjoint partitions. Each object is assigned to the partition with its closest pivot. By

18
considering this process, it covers the entire data space by splitting into K cells. FiDoop
considers distance metric for dividing the similar data objects to the same partition. Jaccard’s
similarity is used as a distance metric for division.

MinHash is used as a partitioning strategy which is foundation for Locality Sensitive


Hashing (LSH) based partitioning. LSH scans all the transactions at once in order to identify the
similar pairs. Similar transactions are mapped into same bucket. Likewise, many similar
transactions are mapped to many similar buckets. LSH ensures that two similar points are
mapped into the same bucket with high probability as well as, it guarantees that two dissimilar
points are less likely to be mapped into same buckets.

In order to mine, the most frequent itemsets in Hadoop heterogeneous clusters ECLAT
algorithm is used. This algorithm represents data from horizontal to vertical data format. Along
with set intersection ECLAT uses depth first search algorithm.

The procedure of this algorithm is as follows: At initial stage, the algorithm considers the
transaction table with transaction identifiers and itemset. Next, it generates item wise mapping
with transaction ID with itemset1. It continues the process by generating item wise mapping with
transaction ID with itemset2. At final stage, it combines itemset1 and itemset2 with minimum
support. The process is same for each and every transaction and at every node in Hadoop
heterogeneous cluster.

Using ECLAT algorithm for mining, each itemset in the databases is no need to scan for
every time. Using LSH method, depth-first search memory requirements is reduced. ECLAT
algorithm can be considered as efficient algorithm for mining frequent itemset on a large set of
transactions which are performed in a Hadoop heterogeneous cluster.

3.3 MapReduce

Traditional Enterprise Systems normally have a centralized server to store and process
data. The following illustration depicts a schematic view of a traditional enterprise system.
Traditional model is certainly not suitable to process huge volumes of scalable data and cannot

19
be accommodated by standard database servers [12]. Moreover, the centralized system creates
too much of a bottleneck while processing multiple files simultaneously.

Figure 3.1: Database Schema

Figure 3.1 shows the database schema of Centralized system, where user bi-directional
communicates with centralized system and if the data is updated and conclusions or reviews of
those data is available in relational database.

Figure 3.2: Structure of Centralised System

Google solved this bottleneck issue using an algorithm called MapReduce. MapReduce
divides a task into small parts and assigns them to many computers. Later, the results are
20
collected at one place and integrated to form the result dataset. Figure 3.2 shows the Structure of
Centralised System.

3.4 Working Procedure of MapReduce

The MapReduce algorithm contains two important tasks, namely Map task and Reduce
task.
 Map Task: The Map Task takes a set of data and converts it into another set of data,
where individual elements are broken down into tuples (key-value pairs).
 Reduce Task: Reduce Task takes the output from the Map as an input and combines
those data tuples (key-value pairs) into a smaller set of tuples. It is always performed
after the map job.

 Input Phase: It acts as a Record Reader which translates each record in an input file and
sends the parsed data to the mapper in the form of key-value pairs.

 Map: It is a user-defined function, which takes a series of key-value pairs and processes
each one of them to generate zero or more key-value pairs.

 Intermediate Keys: They key-value pairs generated by the mapper are known as
intermediate keys.

 Combiner: A Combiner is a type of local Reducer that groups similar data from the map
phase into identifiable sets. It takes the intermediate keys from the mapper as input and
applies a user-defined code to aggregate the values in a small scope of one mapper. It is
not a part of the main MapReduce algorithm; it is optional.

 Shuffle and Sort: The Reducer Task starts with the Shuffle and Sort step. It downloads
the grouped key-value pairs onto the local machine, while the Reducer is running. The
individual key-value pairs are sorted by key into a larger data list. The data list groups the
equivalent keys together so that their values can be iterated easily in the Reducer task.

21
Figure 3.3: Process of MapReduce

 Reducer: The Reducer takes the grouped key-value paired data as input and runs a
Reducer function on each one of them. The data can be aggregated, filtered, and
combined in a number of ways, and it requires a wide range of processing. Once the
execution is over, it gives zero or more key-value pairs to the final step.

 Output Phase: In the Output Phase, an output formatter that translates the final key-
value pairs from the Reducer function and writes them onto a file using a record writer.

Figure 3.3 shows the MapReduce process, where map task concludes by combining key
with value as key value pair and it counts how many key value pairs available. The output of
map task is assigned input for reduce task. Reduce task reduces the key value pairs as one
context in confined manner.

22
Figure 3.4: MapReduce Functioning

Figure 3.4 shows the Functioning of MapReduce where input is spitted in to some parts
based on their similarity and then in map phase count the input variables. The output of map
phase is shuffled and sorted, this is given for reduce phase. In reduce phase counts the input
variables what is to be required.

Figure 3.5: MapReduce Algorithm Actions

23
 Tokenize − Tokenizes the tweets into maps of tokens and writes them as key-value pairs.

 Filter − Filters unwanted words from the maps of tokens and writes the filtered maps as
key-value pairs.

 Count − Generates a token counter per word.

 Aggregate Counters−Prepares an aggregate of similar counter values into small


manageable units.

Figure 3.5 shows the actions of MapReduce algorithm where the tweets from twitter are
taken as input and those tokenized, filtered, then those are counted for output.

3.5 MapReduce Algorithm


The MapReduce algorithm contains two important tasks, namely Map and Reduce.

 The map task is done by means of Mapper Class

 The reduce task is done by means of Reducer Class.

Mapper class takes the input, tokenizes it, maps and sorts it. The output of Mapper class
is used as input by Reducer class, which in turn searches matching pairs and reduces them.

MapReduce implements various mathematical algorithms to divide a task into small parts
and assign them to multiple systems. In technical terms, MapReduce algorithm helps in sending
the Map & Reduce tasks to appropriate servers in a cluster.

Figure 3.6: MapReduce Tasks

24
Figure 3.6 shows the mathematical algorithms are combined with MapReduce tasks for
better performance of the system. It consists of following methodologies are:
 Sorting
 Searching
 Indexing
 TF-IDF

Sorting
Sorting is one of the basic MapReduce algorithms to process and analyze data.
MapReduce implements sorting algorithm to automatically sort the output key-value pairs from
the mapper by their keys.
 Sorting methods are implemented in the mapper class itself.

 In the Shuffle and Sort phase, after tokenizing the values in the mapper class,
the Context class (user-defined class) collects the matching valued keys as a collection.

 To collect similar key-value pairs (intermediate keys), the Mapper class takes the help
of Raw Comparator class to sort the key-value pairs.

 The set of intermediate key-value pairs for a given Reducer is automatically sorted by
Hadoop to form key-values (K2, {V2, V2, …}) before they are presented to the Reducer.

Searching
Searching plays an important role in MapReduce algorithm. It helps in the combiner
phase (optional) and in the Reducer phase.

Figure 3.7 shows MapReduce employs Searching algorithm to find out the details of the
employee who draws the highest salary in a given employee dataset.

 Employee data in four different files − A, B, C, and D. There are duplicate employee
records in all four files because of importing the employee data from all database tables
repeatedly.

25
Figure 3.7: Details of an Employee

 The Map Phase: It processes each input file and provides the employee data in key-value
pairs (<k, v> : <emp name , salary>). Figure 3.8 shows the illusion of Map Phase.

Figure 3.8: Map Phase Illusion

 The Combiner Phase: It accepts the input from the Map phase as a key-value pair with
employee name and salary. Using searching technique, the combiner checks all the
employee salary to find the highest salaried employee in each file.
<k: employee name, v: salary>

Max= the salary of an first employee. Treated as max salary

if(v(second employee).salary > Max){

Max = v(salary);
26
}

else{

Continue checking;

The expected result is as follows

<satish, 26000> <gopal, 50000> <kiran, 45000> <manisha,


45000>

 Reducer Phase: It forms each file, find the highest salaried employee. To avoid
redundancy, check all the <k, v> pairs and eliminate duplicate entries, if any. The same
algorithm is used in between the four <k, v> pairs, which are coming from four input
files. The final output should be as follows

<gopal, 50000>

Indexing

Normally indexing is used to point to a particular data and its address. It performs batch
indexing on the input files for a particular Mapper. The indexing technique that is normally used
in MapReduce is known as inverted index. Search engines like Google and Bing use inverted
indexing technique. Let us try to understand how Indexing works with the help of a simple
example.

Example
The following text is the input for inverted indexing. Here T[0], T[1], and t[2] are the file
names and their content are in double quotes.

27
T[0] = "it is what it is"

T[1] = "what is it"

T[2] = "it is a banana"

After applying the Indexing algorithm, the output is

"a": {2}

"banana": {2}

"is": {0, 1, 2}

"it": {0, 1, 2}

"what": {0, 1}

Here "a": {2} implies the term "a" appears in the T[2] file. Similarly, "is": {0, 1, 2}
implies the term "is" appears in the files T[0], T[1], and T[2].

Term Frequency- Inverse Document Frequency (TF-IDF)

TF-IDF is a text processing algorithm which is short for Term Frequency − Inverse
Document Frequency. It is one of the common web analysis algorithms. Here, the term
'frequency' refers to the number of times a term appears in a document.

Term Frequency (TF): It measures how frequently a particular term occurs in a document. It is
calculated by the number of times a word appears in a document divided by the total number of
words in that document.

TF(the) = (Number of times term the ‘the’ appears in a document) / (Total number of terms in
the document)

Inverse Document Frequency (IDF): It measures the importance of a term. It is calculated by


the number of documents in the text database divided by the number of documents where a

28
specific term appears. While computing TF, all the terms are considered equally important. That
means, TF counts the term frequency for normal words like “is”, “a”, “what”, etc. Thus, to know
the frequent terms while scaling up the rare ones, by computing the IDF

IDF(the) = log_e(Total number of documents / Number of documents with term ‘the’ in it).

3.6 MapReduce Installation

MapReduce works only on Linux flavoured operating systems and it comes inbuilt with a
Hadoop Framework. The following steps are to be performed in order to install Hadoop
framework.

Verification of JAVA Installation


Java must be installed on system before installing Hadoop. The following command is
used to check Java installed on system.
$ java –version
If Java is already installed on system, the following response is displayed

java version "1.7.0_71"

Java(TM) SE Runtime Environment (build 1.7.0_71-b13)

Java HotSpot(TM) Client VM (build 25.0-b02, mixed mode)

Installation of JAVA

Step 1

 Download the latest version of Java from the following link


 After downloading, you can locate the file jdk-7u71-linux-x64.tar.gz in your Downloads
folder.

29
Step 2

 The following commands are used to extract the contents of jdk-7u71-linux-x64.gz.

$ cd Downloads/
$ ls

jdk-7u71-linux-x64.gz

$ tar zxf jdk-7u71-linux-x64.gz

$ ls

jdk1.7.0_71 jdk-7u71-linux-x64.gz

Step 3

 To make Java available to all the users, move it to the location “/usr/local/”. Go to root
and type the following commands

$ su

password:

# mv jdk1.7.0_71 /usr/local/java

# exit

Step 4

 For setting up PATH and JAVA_HOME variables, add the following commands to
~/.bashrc file.

export JAVA_HOME=/usr/local/java
export PATH=$PATH:$JAVA_HOME/bin

 Apply all the changes to the current running system.

30
$ source ~/.bashrc

Step 5

 Use the following commands to configure Java alternatives

# alternatives --install /usr/bin/java java usr/local/java/bin/java 2

 Now verify the installation using the command java -version from the terminal.

Verification of Hadoop Installation

 Hadoop must be installed on the system before installing MapReduce. Verify the Hadoop
installation using the following command −

$ hadoop version

 If Hadoop is already installed on your system, then the following response is displayed −

Hadoop 2.4.1

--

Subversion https://svn.apache.org/repos/asf/hadoop/common -r 1529768

Compiled by hortonmu on 2013-10-07T06:28Z

Compiled with protoc 2.5.0

From source with checksum 79e53ce7994d1628b240f09af91e1af4

Download Hadoop

 Download Hadoop 2.4.1 from Apache Software Foundation and extract its contents using
the following commands.

$ su

password:

31
# cd /usr/local

# wget http://apache.claz.org/hadoop/common/hadoop-2.4.1/

hadoop-2.4.1.tar.gz

# tar xzf hadoop-2.4.1.tar.gz

# mv hadoop-2.4.1/* to hadoop/

# exit

Installation of Hadoop in Pseudo Distributed Mode

The following steps are used to install Hadoop 2.4.1 in pseudo distributed mode.

Step 1: Setting up Hadoop

 Hadoop environment variables are set by appending the following commands to ~/.bashrc
file.

export HADOOP_HOME=/usr/local/hadoop
export HADOOP_MAPRED_HOME=$HADOOP_HOME

export HADOOP_COMMON_HOME=$HADOOP_HOME

export HADOOP_HDFS_HOME=$HADOOP_HOME

export YARN_HOME=$HADOOP_HOME

export HADOOP_COMMON_LIB_NATIVE_DIR=$HADOOP_HOME/lib/native

export PATH=$PATH:$HADOOP_HOME/sbin:$HADOOP_HOME/bin

Apply all the changes to the current running system.

$ source ~/.bashrc

32
Step 2: Hadoop Configuration

 Find all the Hadoop configuration files in the location “$HADOOP_HOME/etc/hadoop”.


Make suitable changes in those configuration files according to Hadoop infrastructure.

$ cd $HADOOP_HOME/etc/hadoop
 In order to develop Hadoop programs using Java, the Java environment variables are to
be reset in hadoop-env.sh file by replacing JAVA_HOME value with the location of
Java in the system.

export JAVA_HOME=/usr/local/java

Configuration of Hadoop

Edit the following files to configure Hadoop

 core-site.xml
 hdfs-site.xml
 yarn-site.xml
 mapred-site.xml

core-site.xml: It contains the following information

 Port number used for Hadoop instance


 Memory allocated for the file system
 Memory limit for storing the data
 Size of Read/Write buffers
 Open the core-site.xml and add the following properties in between the <configuration>
and </configuration> tags.

<configuration>
<property>

<name>fs.default.name</name>

33
<value>hdfs://localhost:9000 </value>

</property>

</configuration>

hdfs-site.xml: It contains the following information

 Value of replication data


 The namenode path
 The datanode path of your local file systems

dfs.replication (data replication value) = 1


(In the following path /hadoop/ is the user name.

hadoopinfra/hdfs/namenode is the directory created by hdfs file system.)

namenode path = //home/hadoop/hadoopinfra/hdfs/namenode

(hadoopinfra/hdfs/datanode is the directory created by hdfs file system.)

datanode path = //home/hadoop/hadoopinfra/hdfs/datanode

Open this file and add the following properties in between the <configuration>,
</configuration> tags.

<configuration>

<property>

<name>dfs.replication</name>

<value>1</value>

</property>

<property>

34
<name>dfs.name.dir</name>

<value>file:///home/hadoop/hadoopinfra/hdfs/namenode</value>

</property>

<property>

<name>dfs.data.dir</name>

<value>file:///home/hadoop/hadoopinfra/hdfs/datanode </value>

</property>

</configuration>

yarn-site.xml: This file is used to configure yarn into Hadoop. Open the yarn-site.xml file and
add the following properties in between the <configuration>, </configuration> tags.

<configuration>

<property>

<name>yarn.nodemanager.aux-services</name>

<value>mapreduce_shuffle</value>

</property>

</configuration>

mapred-site.xml: This file is used to specify the MapReduce framework. By default, Hadoop
contains a template of yarn-site.xml. First of all, copy the file from mapred-site.xml template to
mapred-site.xml file using the following command.

$ cp mapred-site.xml.template mapred-site.xml

35
Open mapred-site.xml file and add the following properties in between the
<configuration>, </configuration> tags.

<configuration>

<property>

<name>mapreduce.framework.name</name>

<value>yarn</value>

</property>

</configuration>

Verification of Hadoop Installation

The following steps are used to verify the Hadoop installation.

Step 1: Name Node Setup

 Set up the namenode using the command “hdfs namenode -format” as follows

$ cd ~

$ hdfs namenode -format

The expected result is as follows

10/24/14 21:30:55 INFO namenode.NameNode: STARTUP_MSG:

/************************************************************

STARTUP_MSG: Starting NameNode

STARTUP_MSG: host = localhost/192.168.1.11

STARTUP_MSG: args = [-format]

STARTUP_MSG: version = 2.4.1


36
...

...

10/24/14 21:30:56 INFO common.Storage: Storage directory

/home/hadoop/hadoopinfra/hdfs/namenode has been successfully formatted.

10/24/14 21:30:56 INFO namenode.NNStorageRetentionManager: Going to

retain 1 images with txid >= 0

10/24/14 21:30:56 INFO util.ExitUtil: Exiting with status 0

10/24/14 21:30:56 INFO namenode.NameNode: SHUTDOWN_MSG:

/************************************************************

SHUTDOWN_MSG: Shutting down NameNode at localhost/192.168.1.11

************************************************************/

Step 2: Verifying Hadoop dfs

 Execute the following command to start Hadoop file system.

$ start-dfs.sh

The expected output is as follows

10/24/14 21:37:56

Starting namenodes on [localhost]

localhost: starting namenode, logging to /home/hadoop/hadoop-

2.4.1/logs/hadoop-hadoop-namenode-localhost.out

localhost: starting datanode, logging to /home/hadoop/hadoop-

37
2.4.1/logs/hadoop-hadoop-datanode-localhost.out

Starting secondary namenodes [0.0.0.0]

Step 3: Verifying Yarn Script

 The following command is used to start the yarn script. Executing this command starts
yarn daemons.

$ start-yarn.sh

The output is as follows

starting yarn daemons

starting resourcemanager, logging to /home/hadoop/hadoop-

2.4.1/logs/yarn-hadoop-resourcemanager-localhost.out

localhost: starting node manager, logging to /home/hadoop/hadoop-

2.4.1/logs/yarn-hadoop-nodemanager-localhost.out

Step 4: Accessing Hadoop on Browser

The default port number to access Hadoop is 50070. Use the following URL to get Hadoop
services on browser.

http://localhost:50070/

38
Figure 3.9: Hadoop Browser

Figure 3.9 shows the Hadoop normal browser where it consists of when it started,
version, compilation time, cluster id, block pool id.

Step 5: Verify all Applications of a Cluster

The default port number to access all the applications of a cluster is 8088. Use the
following URL to use this service.

http://localhost:8088/

Figure 3.10 shows the applications of Hadoop cloud browser. It is the browser like
Google where anything can be searched to know about things. It also shows master node and
slave node addresses.

39
Figure 3.10: Hadoop Cluster Browser

Summary

This chapter provides that Hadoop is used because it is very tough to store Tera bytes of
data and it is very difficult to mine those data. So, Hadoop plays a prominent role in mining the
data from large storage. It also includes the working procedure of MapReduce and the tasks
performed by map phase and reduce phase. MapReduce is also combined with mathematical
algorithms for better performance. Finally, this chapter concludes with MapReduce installation.

40
Chapter-4

SYSTEM DESIGN

Design Engineering deals with the various Unified Modeling Language (UML) diagrams
for the implementation of project. Design is a meaningful engineering representation of a thing
that is to be built. Software design is a process through which the requirements are translated into
representation of the software. Design is the place where equality is rendered in software
engineering. Design is the means to accurately translate customer requirements into finished
product.

4.1 UML Diagrams

Unified Modeling Language is a standardized modeling language enabling developers to


specify, visualize, construct and document artifacts of a software system. Thus, UML makes
these artifacts scalable, secure and robust in execution. UML is an important aspect involved in
object-oriented software development.

UML Diagrams are 13 types. Some of them are shown briefly as follows. They are:

 Class Diagram
 Use Case Diagram
 Activity Diagram
 Component Diagram

UML describes the real-time systems, it is very important to make a conceptual model
and then proceed gradually. These Diagrams are used to make a blue print of the project what it
has to be made. It helps to make modifications at any cost easily on a system before the project
come to an implementation process. It is very tough to modify the design in the middle of any
project implementation, since by making any project sketch what has to be done as a blue print,
it becomes easy and there will be no challenges to face on the part of design.

41
UML diagrams act as stepping stones of a project, since based on this diagrams whole
project is constructed. If one stone fell down whole project will collapse so, these diagrams are to
be drawn very carefully.

4.1.1 Class Diagram

A class diagram in the UML is a type of static structure diagram that describes the
structure of a system by showing the system’s classes, their attributes, and the relationships
between the classes.

Figure 4.1: Class Diagram


42
Figure 4.1 shows the Class diagram of frequent itemset mining of a Super market and
their relationships. The Class diagram consists of SuperMarket, Manager, Sales, Customer,
Product, inventory, invoice.

4.1.2 Component Diagram

Components are wired together by using an assembly connection to connect the required
interface of one component with the provided interface of another component. This illustrates the
service consumer - service provider relationship between the two components.

An assembly connector is a "connector between two components that defines that one
component provides the services that another component requires. An assembly connector is a
connector that is defined from a required interface or port to a provided interface or port”.

When using a component diagram to show the internal structure of a component, the
provided and required interfaces of the encompassing component can delegate to the
corresponding interfaces of the contained components.

Figure 4.4: Component Diagram

43
Figure 4.4 shows the Component diagram of the system of frequent itemset mining. In
that, it shows who are the actors participating in the mining system and at what basis they are
calculated.

4.1.3 Use Case diagram

A use case diagram is a type of behavioral diagram created from a Use-case Analysis.
The purpose of use case is to present overview of the functionality provided by the system in
terms of actors, their goals and any dependencies between those use cases.

Figure 4.2: Use case Diagram

44
Figure 4.2 shows the Use Case diagram of Sales. In that Sales manager acts as an actor
and it shows the sales of region, location, product, time period and finally print the sales report.

4.1.4 Activity diagram

Activity diagram are a loosely defined diagram to show workflows of step-wise activities
and actions, with support for choice, iteration and concurrency. UML, activity diagrams can be
used to describe the business and operational step-by-step workflows of components in a system.
UML activity diagrams could potentially model the internal logic of a complex operation. In
many ways UML activity diagrams are the object-oriented equivalent of flow charts and Data
Flow Diagrams (DFDs) from structural development.

Figure 4.3: Activity Diagram

45
Figure 4.3 shows the Activity diagram of a product on which the product is sold in a
particular time and it shows the number of results of category of product sales.

Summary

This chapter provides a way to get an idea of the project what have to be done as a blue
print or sketch. The layout of UML diagrams gives a brief description of the project and it also
describes the attributes and operations of things who are participating in the system. It also
describes the actors and their tasks.

46
Chapter 5

IMPLEMENTATION

This chapter explains about the ECLAT scheme. It also briefly describes the steps for
implementing ECLAT in Hadoop environment. To maintain large data, software is required to
sustain as consistent. For that Cloudera software is used to implement, it is an open source
Apache Hadoop Distribution (AHD).

5.1 Implementation of ECLAT Scheme

The implementation details of ECLAT based FiDoop-DP running on Hadoop clusters is


presented, which consists of four steps (i.e., one sequential-computing step and three parallel
Map Reduce jobs). Specifically, before launching the FiDoop-DP process, a pre-processing
phase is performed in a master node to select a set of (k) pivots which serve as an input of the
second MapReduce job that is responsible for the Voronoi diagram-based partitioning.

ECLAT means Equivalence class clustering and bottom up lattice traversal. For a
transaction, in order to mine the recurrent item sets ECLAT algorithm is used. ECLAT helps to
represent the data in vertical data format. ÉCLAT algorithm is normally a depth-first search
algorithm based on set intersection. Generally, RARM and Apriori techniques uses horizontal
data format (TransactionId, Items) wherein transaction identifiers are explicitly listed. Whereas
ECLAT algorithm uses vertical database format and the data which is represented in vertical
format (Items, TransactionId) Items and their corresponding transactions are maintained. All
frequent Item-sets can be computed with intersection of TID-list.

In first scan of database, a TID list (TranscationId) is maintained for each single item.
k+1 Item-set can be generated from k Item-set using Apriori property and depth first search
computation. (k+1) Item-set is an intersection of TID-set of frequent k-Item-set. This process
continues, until no candidate Item-set can be found. One advantage of ECLAT algorithm is, no
need to scan the database for counting the support values of k+1 item set. It is because support
count information is already obtained from k Item-sets. This algorithm helps to avoid the

47
overhead of calculating all the subsets of every transaction and comparing them against the
candidate hash tree during support counting.

5.2 Steps for Implementation Process

1. Download
 Download winghc/hadoop2x-eclipse-plugin zip.
2. Extract
 Extract to a local directory (say, 'C:\hadoop2x-eclipse-plugin').

3. Build
 Open”<hadoop2x-eclipse-plugin-directory>\src\contrib\eclipse-plugin” in Command
prompt.
1 C:\>cd C:\hadoop2x-eclipse-plugin\src\contrib\eclipse-plugin
 Run ANT build
 eclipse.home: Installation directory of Eclipse IDE.
 hadoop.home: Hadoop installation directory.

4. Install
 On successful build, “hadoop-eclipse-plugin-2.2.0.jar” will be generated
inside “<hadoop2x-eclipse-plugin-directory>\src\contrib\eclipse-plugin”. Copy this
jar and paste it to the 'plugins' directory of your IDE.

5. Configure
 Restart the Eclipse IDE if already started. Otherwise start the Eclipse.
 To configure go to Window --> Open Perspective --> Other and
select 'Map/Reduce' perspective.

6. Defining Hadoop Location


 Click on 'New Hadoop location...' (Blue Elephant icon) and define Hadoop location to
run MapReduce applications. Click on 'Finish' button to declare location of Hadoop.

48
7. Map/Reduce(V2) Master:
 Address of the Map/Reduce master node (The Job Tracker).
Distributed File System Master:
 Address of the Distributed File System Master node (The Name Node).

Figure 5.1: Configuration

Figure 5.1 shows the configuration of a cluster. It consists of metrics to know the 'Port
numbers’, start Hadoop and open http://localhost:8088/cluster in a browser. Click Tools -->
Configuration and search for the properties.

49
F

Figure 5.2: Distributed File System Layout

Figure 5.2 shows Distributed File System (DFS) involution for that browse the Hadoop
file system and perform different files / folder operations using the GUI only.

50
Figure 5.3: MapReduce Folder

Figure 5.3 shows to create new project of MapReduce. To create Map/Reduce, Mapper,
Reducer and MapReduce Driver using the wizard, Click on File --> New --> Other...-->
Map/Reduce and the system is ready for Hadoop programming.

Summary

This chapter provides the implementation of ECLAT scheme and steps for processing the
implementation. It is very important to create a new project, involved configuration steps are to
be done.

***********
51
Chapter 6

RESULTS

This chapter specifies the outcome of frequent itemsets by taking a Dataset and the
procedure of calculating support values. It also illustrates extraction of the most frequent itemsets
within a small span of time and shows the comparison between ECLAT and FP-Growth
algorithms.

6.1 Experimental Results

The experimental results are executed by initially taking the sample dataset consists of the
item-sets, which consists of the integer value of the data in the particular item in this dataset.

6.1.1 Procedure for Mining using ECLAT scheme:

1. Scan the database to know better about the database for finding frequent itemsets.

Table 6.1: Horizontal Data Format of a Database

Transaction Identifier Itemsets

1 I0, I2, I2, I4


2 I0, I3, I5, I6
3 I1, I2,I3, I4, I6

4 I1, I4, I5, I6

5 I0, I1,I2, I5,I6

6 I1,I2, I5, I6

Table 6.1 show the database which is normally in horizontal format which consists of
Transaction identifiers and Itemsets. It also shows transaction identifier which contains the
identity of transactions and Itemsets which contains items.

2. Convert horizontal data format to vertical data format

52
Table 6.2: Vertical Data Format of a Database

Item Transaction Id set


I0 1, 2, 5
I1 1, 3, 4, 5, 6
I2 1, 3, 5, 6
I3 2, 3
I4 1, 3,4
I5 2, 4,5, 6
I6 2, 3, 4, 5, 6

Table 6.2 shows the conversion of horizontal database to vertical database format for
finding frequent itemsets. After conversion the table is displayed in the form of transaction
identifiers to transaction id set and itemsets to item.

3. Calculate Support for all items


𝑎
Support = 𝑏 * 100 (in percentage)

where as,
‘a’ is No. of times, of that intended items come along with another intended item that are
available in the transactions,
‘b’ is Total No. of transactions.

Table 6.3: Support for Items

Item Support
I0 3
I1 5
I2 4
I3 2
I4 3
I5 4
I6 5

Table 6.3 shows the support values of items from that compare the support values
with minimum support. Based on support values frequent item sets are classified from
non-frequent itemsets.
53
4. Based on the items support, least valued support is considered as “minimum-support”
for item-sets. Therefore, minimum-support is 2 i.e., 40%.
5. Compare min-sup with support for each item, if the support is less than or equal to min-
sup, then eliminate the item.

Table 6.4: After Elimination

Item Support
I0 3
I1 5
I2 4
I4 3
I5 4
I6 5

Table 6.4 shows the items which are existed after comparision with minimum
support that is, these values are available since those are equal or greater than minimum
support.

6. Combine item {I0} with each remaining item ({I0, I1}, {I0, I2})..., and then calculate
support. Repeat for all item.

Table 6.5: Combined Support with Each Item

Items Support Items Support Items Support Items Support Items Support
{I0, I1} 2 {I1, I2} 4 {I2, I4} 4 {I4, I5} 1 {I5, I6} 4
{I0, I2} 2 {I1, I4} 3 {I2, I5} 2 {I4, I6} 2
{I0, I4} 1 {I1, I5} 3 {I2, I6} 4
{I0, I5} 2 {I1, I6} 4
{I0, I6} 2

Table 6.5 shows the combined support with each item to other. Based on the
support are itemsets of base 2 are taken and they have to be calculated for support.

7. Compare min-sup with support for each item, if the support is less than or equal to min-
sup, then eliminate the item.

54
Table 6.6: Comparison of Minimum Support with Itemsets Support δ 2

Items Support Items Support Items Support


{I1, I2} 4 {I2, I4} 4 {I5, I6} 4
{I1, I4} 3 {I2, I6} 4
{I1, I5} 3
{I1, I6} 4

Table 6.6 shows the Frequent 2 itemsets, these are found after the comparition of
itemsets with min-support.

8. Repeat step-6 for frequent 3 item-sets.

Table 6.7: Combining with Other Items

Item Support
{I1, I2, I4} 2
{I1, I2, I5} 2
{I1, I2, I6} 1
{I1, I4, I5} 3
{I1, I4, I6} 2
{I1, I5, I6} 3
{I2, I5, I6} 2

Table 6.7 shows the itemsets of base 3 which are taken from the
Frequent 2 itemsets and those items have to be taken to combine among them only.

9. Repeat step-5.

Table 6.8: Final Set δ3

Item Support
{I1, I4, I5} 3
{I1, I5, I6} 3

Table 6.8 shows the final set δ3 with are more frequent itemsets of Table 6.1.

10. Finally {I1,I4,I5} , { I1,I5,I6} are frequent item-sets.

55
6.2 Execution of ECLAT with MapReduce:

The data is maintained in a file “contextPasquier”, it has to be executed on MapReduce.


After execution, data which consists of frequent itemsets are displayed in a file. The file is
executed by right click on “Run on Hadoop”, since it takes some time to run.

Figure 6.1: Data in Database

Figure 6.1 shows the data in the file which is to be computed for frequent itemsets. The
file may be text, integers. Integers are given based upon the item bar code and the database
consists of transaction id and items code. The mining of frequent itemsets are found by using
MapReduce and ECLAT algorithms. Support value is taken as 40% and Confidence is taken as

56
60%. Based on these values there is difference in frequent itemsets. With the increase of support
values the frequent patterns of itemsets decreases.

The Support values are regarded as mainly 40%, since this value majorly gives frequent
patterns that are very perfect patterns. So that patterns are used at any purposes and those can
regarded for further analysis.

Figure 6.2: Data in Database

Figure 6.2 shows the data in database which is to be computed for mining frequent
patterns, this is continuation to the figure 6.1. It consists of lots of data that can be integers,
characters. In this project, it has taken integers to calculate frequent patterns.

57
Figure 6.3: Data in Database

Figure 6.3 shows the data in database which is to be computed to mine frequent itemsets
whose support value is taken as 40%. This is continuation to figure 6.2 and shows the data which
is very clumsy and difficult to differentiate the data of which kind of transaction.

58
Figure 6.4: Database Details

Figure 6.4 shows the database details that shows the file in MapReduce and select the
program that should run on Hadoop. After executing, an output file is displayed which shows the
number of transactions in the database, count of frequent itemsets, time taken to run those
database and count of storage used for the database.

59
Figure 6.5: Support Values for Frequent Itemsets

Figure 6.5 shows the frequent itemsets and their support values of the data given in the
data base. The output is itemsets which are frequently available in the database are found based
on the support value i.e., 40%.

60
The graph is plotted based on the comparison between FP-Growth algorithm and ECLAT
algorithm. The parameters are minimum support and the time taken to run a dataset. Hence,
proved that ECLAT with Local Sensitivity Hashing Technique runs faster than FP-Growth
algorithm.

Figure 6.6: Graphical Representation of Minimum Support and Running Time between ECLAT and
FP-Growth algorithms

Figure 6.6 shows the comparision between ECLAT and FP-Growth algorithm based on the
time taken to run the data in the database. It shows that ECLAT takes proportionally very less time
compared to the FP-Growth. The frequnt patterns varies based on the minimum support as the
support increases frequent patterns decreases.

61
Figure 6.7: Graphical Representation of Minimum Support and Running Time between ECLAT and
FP-Growth Algorithms Based on Voronoi Diagram

Figure 6.7 shows the comparision between ECLAT and FP-Growth algorithm based on the
itemsets which are taken from Voronoi diagram. This graph shows clearly the difference between
ECLAT and FP-Growth based on running time of the system.

Summary
This chapter provides the execution of ECLAT algorithm and the ECLAT working process.
The comparision between the ECLAT and FP-Growth are showed in the graph based on the
minimum support values and running time. The coclusion of this chapter shows the running time of
the ECLAT algorithm is very faster than the FP-Growth algorithm.

62
Chapter 7

CONCLUSIONS AND FUTURE DIRECTION

In heterogeneous Hadoop Cluster, Data Partitioning and mining the data from different
systems are the main issues. This is because different nodes need to be considered for processing the
data. So, care should be taken while partitioning the data across different nodes. There may be nodes
which are running fast as well as running slow. If partitions of data are not done perfectly, it reflects
the performance of Hadoop.

Data must be partitioned, thus in a way that, processing can be completed within feasible time
even if the nodes are high-speed or low-speed. Else, after completion of high-speed nodes processing,
assign those nodes to low-speed nodes for sharing the data processing load. But, it is hard when the
transferred data is very large. So, in order to overcome this problem, it is need to be cautious at initial
stage which is data partitioning stage. ECLAT algorithm provides a path for fast running of files in a
database by combining with MapReduce algorithm, it gives better solutions.

FiDoop-Data Partitioning technique is used to split the data across different nodes in
heterogeneous cluster, with the help of ECLAT algorithm for mining. Automatic Updating of
database can be done for further future purpose. Energy efficiency is also made to enhance the power
levels of the system.

63
REFERENCES

[1] M. J. Zaki, “Parallel and distributed association mining: A survey,”Concurrency, IEEE, vol. 7,
no. 4, pp. 14–25, 1999.

[2] YalingXun, Jifu Zhang, Xiao Qin and Xujun Zhao “FiDoop-DP: Data Partitioning in Frequent
ItemsetMining on Hadoop Clusters” IEEE Transactions on Parallel and Distributed Systems,
pp.7-14, 2016.

[3] Xindong Wu, Xingquan Zhu, Gong-Qing Wu, Wei Ding “Data mining with Big data”.

[4] W. Lu, Y. Shen, S. Chen, and B. C. Ooi, “Efficient processing of knearest neighbor joins
using mapreduce,” Proceedings of the VLDB Endowment, vol. 5, no. 10, pp. 1016–1027,
2012.

[5] Martin Brown,”Data Mining Techniques” IBM Developers works, published on 11,2012.

[6] D. Arthur and S. Vassilvitskii, “k-means++: The advantages of careful seeding,” in


Proceedings of the eighteenth annual ACM-SIAM symposiumon Discrete algorithms. Society
for Industrial and Applied Mathematics, 2007, pp. 1027–1035.

[7] Pramudiono and M. Kitsuregawa, “Parallel FP-growth on pc cluster”, in Advances in


Knowledge Discovery and Data Mining. Springer,2003, pp. 467–473.

[8] A.Stupar, S. Michel, and R. Schenkel, “Rankreduce–processing k-nearestneighbor queries on


top of mapreduce,” in Proceedings ofthe 8th Workshop on Large-Scale Distributed Systems
for Information Retrieval. Citeseer, 2010, pp. 13–18.

[9] B.Bahmani, A. Goel, and R. Shinde, “Efficient distributed locality sensitive hashing”, in
Proceedings of the 21st ACM international conference on Information and knowledge
management.ACM, 2012, pp.2174–2178.
64
[10] T. Kanungo, D. M. Mount, N. S. Netanyahu, C. D. Piatko, R. Silverman, and A. Y. Wu, “An
efficient k-means clustering algorithm: Analysis and implementation,” Pattern Analysis and
Machine Intelligence, IEEE Transactions on, vol. 24, no. 7, pp. 881–892, 2002.

[11] T. Kirsten, L. Kolb, M. Hartung, A. Groß, H. K¨opcke, and E. Rahm,“Data partitioning for
parallel entity matching,” Proceedings of the VLDB Endowment, vol. 3, no. 2, 2010.

[12] M. Liroz-Gistau, R. Akbarinia, D. Agrawal, E. Pacitti, and P. Valduriez, “Data partitionin for
minimizing transferred data in mapreduce,” in Data Management in Cloud, Grid and P2P
Systems. Springer, 2013, pp. 1–12.

[13] S. Agrawal, V. Narasayya and B. Yang, “Integrating vertical and horizontal partitioning into
automated physical database design”, in Proceedings of the 2004 ACM SIGMO international
conference on Management of data. ACM, 2004, pp. 359–370.

65
PUBLICATION

3rd International Conference on Recent Challenges in Engineering and


Technology (ICRCET-2017)

[1] J. Chandana, “FiDoop-DP: An Efficient Data Mining Technique on Heterogeneous

Clusters”, 3rd International Conference on Recent Challenges in Engineering and

Technology (ICRCET-2017), 12th - 13th September 2017, Tirupathi.

66

You might also like