You are on page 1of 9

Title: Data Mining this empirical approach has not

wavered. Data analytics has


Summarize: this paper grown and its interest has grown
aims to give a general vision on along with the size of databases.
data mining and non-SQL Towards the end of the 1980s,
databases database researchers, such as
Historical Rakesh Agrawal, began to work
The term "data mining" first on the exploitation of the content
appeared in the early 1960s and of large databases such as those
had a pejorative meaning at that of receipts from supermarkets,
time. Computers were used more convinced that they could add
and more for all kinds of value to these masses of dormant
calculations that could not be data. They used the expression
done manually until then. Some "database mining" but, since it
researchers have started to was already filed by a company
process data tables relating to (Database mining workstation), it
surveys or experiments at their was "data mining" that prevailed.
disposal without a statistical a In March 1989, Shapiro Piatetski
priori. As they observed that the proposed the term "knowledge
results obtained, far from being discovery" at a workshop on
aberrant, were encouraging, they knowledge discovery in
were encouraged to systematize databases. Currently, the terms
this opportunistic approach. data mining and knowledge
However, official statisticians discovery in data bases (KDD, or
considered this approach to be ECD in French) are used more or
unscientific and used the terms less interchangeably. We will
"data mining" or "data fishing" to therefore use the expression
criticize them. This opportunistic "data mining", the latter being
attitude towards data coincided the most frequently used in the
in France with the dissemination literature. The data mining
to the general public of data community initiated its first
analysis, the promoters of which, conference in 1995 following
like Jean-Paul Benzecri, also had numerous workshops on the KDD
to undergo in the early days of between 1989 and 1994. In 1998,
criticism from members of the a special chapter was created
community of statisticians. . under the auspices of the ACM,
Despite everything, the success of called ACMSIGKDD, which brings
together the international KDD In 2009 Yuzhanget al., [5]
community. The first journal in improved the decision tree
the field "Data mining and construction (SPRINT algorithm)
knowledge discovery journal" with service-oriented architecture
published by "Kluwers" was based on the principles of Cloud
launched in 1997. Computing by using Distributed
RELATED WORKS Computational Service Cloud
Both Data mining and cloud (DCSC). They provided a SPRINT
computing have received model to handle the user-defined
significant interest in recent dataset. Our work goes further by
years. Many works are presented building a cloud computing
trying to improve data mining model which distributes cross-
techniques using the abilities of validation tasks and we use
cloud computing. In 2008 different datasets.
Christopher et al., [4] started the In 2010 Jianzong Wang, et al., [6]
first steps in this field. They tried worked with Data Mining of Mass
to scale up the classifiers for Storage based on Cloud
Cloud Computing Computing by implementing a
computers/machines by making a combination model between
comparison amongst the three three techniques (Global Effect,
classification techniques (decision K-NN and Restricted Boltzmann
trees, knearest neighbors and Machines ) for Netflix Prize
support vector machines) and to dataset mining, They observed
evaluate their performance with that Global Effect and K-NN
distributed data. They worked worked very well but RBM did
with six different data sets not perform well. We are similar
(Protein, KDDCup, Alpha, Beta, to their work with regards to data
Syn-SM, and Syn-LG). On the mining based on cloud
other hand our work extends and computing but we are different
converges with their research by by implementing a new model
using decision trees techniques for classification of tasks. In 2011
and proposing an Gopalakrishnan and K. Lakshmi
implementation of an abstraction [7] proposed a new Hierarchical
classification method; whilst our Virtual K-Means Approach
work diverges from theirs by (HVKM) with two models of cloud
using alternative classification computing system PaaS and SaaS
techniques and different datasets. for the user who desired to
provide Business Analysis as a In 2012 N. R. Sheth and J. S. Shah
Service. They used the Sample [9] implemented one of the most
insurance data to test their popular Association Rule
approach. We converges with algorithms called Apriori which
their work on implementing the improved on MapReduce
tow model of cloud computing programming model to work on
system for data mining purposes the Hadoop platform. They built
but we differ from it by an interface between Hadoop
developing a model for and the Sector file System
classification and prediction (Sector/Sphere Cloud system)
rather than simple clustering. which give all Hadoop application
In 2012 Tong et al., [8] they built the ability to work on Sector data
a web application for their data and they observed a decline in
mining analysis in the forecasting performance in the Sector file
service based on cloud system due to I/O and JNI
computing. They called it overhead. We converge with their
Forecasting as a Service (FaaS), work on the data mining side of
which provides forecasting cloud computing while our work
services for users. They evaluated differs from theirs by using
the performance of six data prediction techniques rather than
mining techniques (Logistic yet Association Rule algorithms.
Regression, Time Series, ANN, In 2012 Juan and Pallavi [10]
Random Forest, SVM, MARS) on developed a sequential
the SaaS model based on using association rule algorithm
R, PHP and MySQL tools to (Apriori) by redesigning it from
analyze the manufacturing data the original concept and applied
of the industrial index it to work on MapReduce on the
information forecasting in Amazon EC2 cloud model which
Taiwan. From a technical point of provided a parallel computing
view, our work is similar with platform. They used four different
theirs on building a model for datasets (chess,
prediction which uses multiple mushroom, connect and
prediction techniques but we are T10I4D100K)..
differ from them by using health
care datasets for medical […]
diagnosis purposes as opposed In 2012 Nandini and Saurabh [13]
to an industrial index dataset. proposed high performance
cloud data mining algorithm by developed for classification and
improving the Apriori algorithm prediction tasks rather than just
using Genetic algorithm the Association rule.
approach to work on Spark and Hadoop
sector/sphere cloud framework, What is Apache Hadoop?
and they used multi transaction Apache Hadoop is an open-
datasets to validate their model. source software utility that allows
Our approach is similar to theirs users to manage big data sets
by developing the cloud model (from gigabytes to petabytes) by
for data mining but we differ enabling a network of computers
(or “nodes”) to solve vast and
from it by using alternative
intricate data problems. It is a
datasets in order to test highly scalable, cost-effective
classification techniques in solution that stores and
multiple computing processes structured, semi-
environments. structured and unstructured
In 2013 Kawuu and Yu-Chin [14] data (e.g., Internet clickstream
presented four efficient records, web server logs, IoT
sensor data, etc.).
algorithms (Association rule
Equal Working Set (EWS), Benefits of the Hadoop
Request On Demand (ROD), framework include the following:
Small Size Working Set (SSWS)
 Data protection amid a
and Progressive Size Working Set hardware failure
(PSWS)) to utilize cloud nodes in  Vast scalability from a single
cloud computing environment server to thousands of
machines
with IBM’s Quest synthetic data  Real-time analytics for historical
generator. They observed that analyses and decision-making
the four algorithms are more processes
scalable than TPFP-tree and BTP- What is Apache Spark?
tree schemes. PSWS required Apache Spark — which is also
open source — is a data
only 12.2% and 18% of the
processing engine for big data
execution time used respectively sets. Like Hadoop, Spark splits
by TPFP-tree and BTP-tree. Our up large tasks across different
work is similar to theirs by nodes. However, it tends to
utilizing the resources of the perform faster than Hadoop and
cloud nodes to distribute the it uses random access memory
(RAM) to cache and process data
computation of the data mining
instead of a file system. This
tasks but our model is better
enables Spark to handle use resource manager that
cases that Hadoop cannot. schedules tasks and allocates
resources (e.g., CPU and
memory) to applications.
Benefits of the Spark framework
3. Hadoop MapReduce: Splits big
include the following: data processing tasks into
smaller ones, distributes the
 A unified engine that supports small tasks across different
SQL queries, streaming nodes, then runs each task.
data, machine learning (ML) and 4. Hadoop Common (Hadoop
graph processing Core): Set of common libraries
 Can be 100x faster than and utilities that the other three
Hadoop for smaller modules depend on.
workloads via in-memory
The Spark ecosystem
processing, disk data storage,
etc. Apache Spark, the largest open-
 APIs designed for ease of use source project in data
when manipulating semi- processing, is the only
structured data and processing framework that
transforming data combines data and artificial
The Hadoop ecosystem intelligence (AI). This enables
Hadoop supports advanced users to perform large-scale data
analytics for stored data (e.g., transformations and analyses,
predictive analysis, data mining, and then run state-of-the-art
machine learning (ML), etc.). It machine learning (ML) and AI
enables big data analytics algorithms.
processing tasks to be split into
smaller tasks. The small tasks The Spark ecosystem consists of
are performed in parallel by using five primary modules:
an algorithm (e.g., MapReduce),
and are then distributed across a 1. Spark Core: Underlying
execution engine that schedules
Hadoop cluster (i.e., nodes that and dispatches tasks and
perform parallel computations on coordinates input and output
big data sets). (I/O) operations.
2. Spark SQL: Gathers
The Hadoop ecosystem consists information about structured
of four primary modules: data to enable users to optimize
structured data processing.
1. Hadoop Distributed File 3. Spark Streaming and
System (HDFS): Primary data Structured Streaming: Both
storage system that manages add stream processing
large data sets running on capabilities. Spark Streaming
commodity hardware. It also takes data from different
provides high-throughput data streaming sources and divides it
access and high fault tolerance. into micro-batches for a
2. Yet Another Resource continuous stream. Structured
Negotiator (YARN): Cluster Streaming, built on Spark SQL,
reduces latency and simplifies 1. Performance: Spark is faster
programming. because it uses random access
4. Machine Learning Library memory (RAM) instead of
(MLlib): A set of machine reading and writing intermediate
learning algorithms for data to disks. Hadoop stores
scalability plus tools for feature data on multiple sources and
selection and building ML processes it in batches via
pipelines. The primary API for MapReduce.
MLlib is DataFrames, which 2. Cost: Hadoop runs at a lower
provides uniformity across cost since it relies on any disk
different programming storage type for data
languages like Java, Scala processing. Spark runs at a
and Python. higher cost because it relies on
5. GraphX: User-friendly in-memory computations for
computation engine that real-time data processing, which
enables interactive building, requires it to use high quantities
modification and analysis of of RAM to spin up nodes.
scalable, graph-structured data. 3. Processing: Though both
Comparing Hadoop and Spark platforms process data in a
distributed environment,
Spark is a Hadoop enhancement Hadoop is ideal for batch
to MapReduce. The primary processing and linear data
difference between Spark and processing. Spark is ideal for
MapReduce is that Spark real-time processing and
processes and retains data in processing live unstructured
memory for subsequent steps, data streams.
4. Scalability: When data volume
whereas MapReduce processes rapidly grows, Hadoop quickly
data on disk. As a result, for scales to accommodate the
smaller workloads, Spark’s data demand via Hadoop Distributed
processing speeds are up to File System (HDFS). In turn,
100x faster than MapReduce. Spark relies on the fault tolerant
HDFS for large volumes of data.
Furthermore, as opposed to the 5. Security: Spark enhances
security with authentication via
two-stage execution process in shared secret or event logging,
MapReduce, Spark creates a whereas Hadoop uses multiple
Directed Acyclic Graph (DAG) to authentication and access
schedule tasks and the control methods. Though,
orchestration of nodes across the overall, Hadoop is more secure,
Hadoop cluster. This task- Spark can integrate with
Hadoop to reach a higher
tracking process enables fault security level.
tolerance, which reapplies 6. Machine learning (ML): Spark
recorded operations to data from is the superior platform in this
a previous state. category because it includes
MLlib, which performs iterative
Let’s take a closer look at the key in-memory ML computations. It
differences between Hadoop and also includes tools that perform
regression, classification,
Spark in six critical contexts:
persistence, pipeline Data mining can motivate
construction, evaluation, etc.
researchers to accelerate when
the method analysis the data.
Discusion
Therefore they can work more
Data mining has many enormous
time on other projects. Shopping
advantages, as explained below:
behaviours can be detected. Most
1. Marketing/Retails of the time, you may experience
new problems while designing
To create models, marketing specific shopping patterns.
companies use data mining. This Therefore data mining is used to
was based on history to forecast solve these problems. Mining
who will respond to new methods can find all the
marketing campaigns such as information on these shopping
direct mail, online marketing, etc. patterns. This process also
This means that marketers can creates an area where all the
sell profitable products to unexpected shopping patterns
targeted customers. are calculated. This data
extraction can be beneficial when
2. Finance/Banking shopping patterns are identified.
Since data extraction provides
financial institutions 4. Determining Customer
information on loans and credit Groups
reports, data can determine good We are using data mining to
or bad credits by creating a respond from marketing
model for historical customers. It campaigns to customers. It also
also helps banks detect provides information during the
fraudulent transactions by credit identification of customer groups.
cards that protect a credit card Some surveys can be used to
owner. begin these new customer
groups. And these investigations
3. Researchers are one of the forms of data
mining.
5. Increases Brand Loyalty 8. To Predict Future Trends
In marketing campaigns, mining All information factors are part of
techniques are used. This is to the working nature of the system.
understand their own customers ‘ The data mining systems can also
needs and habits. And from that, be obtained from these. They can
customers can also choose their help you predict future trends,
brand’s clothes. Thus, you can and with the help of this
definitely be self-reliant with the technology, this is entirely
help of this technique. However, possible. And people also adopt
it provides possible information behavioural changes.
when it comes to decisions.
9. Increases Website
6. Helps in Decision Making Optimization
People use these data mining We use data mining to find all
techniques to help them make kinds of unseen element
some decisions in marketing or information. And adding data
business. Today, with the use of mining helps you to optimize
this technology, all information your website. Similarly, this data
can be determined. Also, using mining provides information that
such technology, one can decide may use the technology of data
precisely what is unknown and mining.
unexpected.

7. Increase Company Revenue

Data mining is a process in which


some kind of technology is
involved. One must collect
information on goods sold Conclusion
online; this eventually reduces
Data mining has so many
product costs and services, which
advantages in the area of
is one of data mining benefits.
businesses, governments as well
as individuals. In this article, we [10] J. Li, P. Roy, S. Khan, L. Wang
and Y. Bai, (2012). "Data Mining
have seen places where we can Using Clouds: An Experimental
efficiently use data mining. Implementation of Apriori over
MapReduce", 12th International
Conference on Scalable Computing
References and Communications (ScalCom),
http://eric.univ- Changzhou, China, December.
lyon2.fr/~ricco/cours/slides/IntroDM [13] N. Mishra, S. Sharma and A.
Draft2002.pdf Pandey, (, 2013). "High performance
[5] Y.Han, P. Brezany and I. Janciak, Cloud data mining algorithm and Data
(2009). "Cloud-Enabled Scalable mining in Clouds", IOSR
Decision Tree Construction", Fifth Journal of Computer Engineering
International Conference on (IOSRJCE) Volume 8, Issue 4.
Semantics, Knowledge and Grid", [14] K. W. Lin and Yu-Chin Lo,
ISBN: 978-0-7695-3810-5, pp.128– (2013). "Efficient algorithms for
135. frequent
[6] J. Wang, J. Wan, Z. Liu and pattern mining in many-task computing
P.Wang,(2010). "Data Mining of Mass environments", Journal of
Storage based on Cloud Computing", Knowledge-Based Systems.
Ninth International Conference
on Grid and Cloud Computing. ISBN:
978-1-4244-9334-0 , pp. 426–
431.
[7] T. G. Nair and K. L. Madhuri,
(2011). "Data Mining using
Hierarchical
Virtual K-means Approach Integrating
Data Fragments in Cloud
Computing Environment" , IEEE CCIS,
ISBN: 978-1-61284-203-5, pp.
230–234.
[8] T. Yang, B. Shia, J. Wei and K.
Fang, (2012). "Mass Data Analysis and
Forecasting Based on Cloud
Computing", Journal of Software, vol.
7,
no. 10, October.
[9] N.i R. Sheth and J. S. Shah, (2012).
"Implementing Parallel Data
Mining Algorithm on High
Performance Data Cloud ",
International
Journal of Advanced Research in
Computer Science and Electronics
Engineering Volume 1.

You might also like