Title: Data Mining this empirical approach has not
wavered. Data analytics has
Summarize: this paper grown and its interest has grown aims to give a general vision on along with the size of databases. data mining and non-SQL Towards the end of the 1980s, databases database researchers, such as Historical Rakesh Agrawal, began to work The term "data mining" first on the exploitation of the content appeared in the early 1960s and of large databases such as those had a pejorative meaning at that of receipts from supermarkets, time. Computers were used more convinced that they could add and more for all kinds of value to these masses of dormant calculations that could not be data. They used the expression done manually until then. Some "database mining" but, since it researchers have started to was already filed by a company process data tables relating to (Database mining workstation), it surveys or experiments at their was "data mining" that prevailed. disposal without a statistical a In March 1989, Shapiro Piatetski priori. As they observed that the proposed the term "knowledge results obtained, far from being discovery" at a workshop on aberrant, were encouraging, they knowledge discovery in were encouraged to systematize databases. Currently, the terms this opportunistic approach. data mining and knowledge However, official statisticians discovery in data bases (KDD, or considered this approach to be ECD in French) are used more or unscientific and used the terms less interchangeably. We will "data mining" or "data fishing" to therefore use the expression criticize them. This opportunistic "data mining", the latter being attitude towards data coincided the most frequently used in the in France with the dissemination literature. The data mining to the general public of data community initiated its first analysis, the promoters of which, conference in 1995 following like Jean-Paul Benzecri, also had numerous workshops on the KDD to undergo in the early days of between 1989 and 1994. In 1998, criticism from members of the a special chapter was created community of statisticians. . under the auspices of the ACM, Despite everything, the success of called ACMSIGKDD, which brings together the international KDD In 2009 Yuzhanget al., [5] community. The first journal in improved the decision tree the field "Data mining and construction (SPRINT algorithm) knowledge discovery journal" with service-oriented architecture published by "Kluwers" was based on the principles of Cloud launched in 1997. Computing by using Distributed RELATED WORKS Computational Service Cloud Both Data mining and cloud (DCSC). They provided a SPRINT computing have received model to handle the user-defined significant interest in recent dataset. Our work goes further by years. Many works are presented building a cloud computing trying to improve data mining model which distributes cross- techniques using the abilities of validation tasks and we use cloud computing. In 2008 different datasets. Christopher et al., [4] started the In 2010 Jianzong Wang, et al., [6] first steps in this field. They tried worked with Data Mining of Mass to scale up the classifiers for Storage based on Cloud Cloud Computing Computing by implementing a computers/machines by making a combination model between comparison amongst the three three techniques (Global Effect, classification techniques (decision K-NN and Restricted Boltzmann trees, knearest neighbors and Machines ) for Netflix Prize support vector machines) and to dataset mining, They observed evaluate their performance with that Global Effect and K-NN distributed data. They worked worked very well but RBM did with six different data sets not perform well. We are similar (Protein, KDDCup, Alpha, Beta, to their work with regards to data Syn-SM, and Syn-LG). On the mining based on cloud other hand our work extends and computing but we are different converges with their research by by implementing a new model using decision trees techniques for classification of tasks. In 2011 and proposing an Gopalakrishnan and K. Lakshmi implementation of an abstraction [7] proposed a new Hierarchical classification method; whilst our Virtual K-Means Approach work diverges from theirs by (HVKM) with two models of cloud using alternative classification computing system PaaS and SaaS techniques and different datasets. for the user who desired to provide Business Analysis as a In 2012 N. R. Sheth and J. S. Shah Service. They used the Sample [9] implemented one of the most insurance data to test their popular Association Rule approach. We converges with algorithms called Apriori which their work on implementing the improved on MapReduce tow model of cloud computing programming model to work on system for data mining purposes the Hadoop platform. They built but we differ from it by an interface between Hadoop developing a model for and the Sector file System classification and prediction (Sector/Sphere Cloud system) rather than simple clustering. which give all Hadoop application In 2012 Tong et al., [8] they built the ability to work on Sector data a web application for their data and they observed a decline in mining analysis in the forecasting performance in the Sector file service based on cloud system due to I/O and JNI computing. They called it overhead. We converge with their Forecasting as a Service (FaaS), work on the data mining side of which provides forecasting cloud computing while our work services for users. They evaluated differs from theirs by using the performance of six data prediction techniques rather than mining techniques (Logistic yet Association Rule algorithms. Regression, Time Series, ANN, In 2012 Juan and Pallavi [10] Random Forest, SVM, MARS) on developed a sequential the SaaS model based on using association rule algorithm R, PHP and MySQL tools to (Apriori) by redesigning it from analyze the manufacturing data the original concept and applied of the industrial index it to work on MapReduce on the information forecasting in Amazon EC2 cloud model which Taiwan. From a technical point of provided a parallel computing view, our work is similar with platform. They used four different theirs on building a model for datasets (chess, prediction which uses multiple mushroom, connect and prediction techniques but we are T10I4D100K).. differ from them by using health care datasets for medical […] diagnosis purposes as opposed In 2012 Nandini and Saurabh [13] to an industrial index dataset. proposed high performance cloud data mining algorithm by developed for classification and improving the Apriori algorithm prediction tasks rather than just using Genetic algorithm the Association rule. approach to work on Spark and Hadoop sector/sphere cloud framework, What is Apache Hadoop? and they used multi transaction Apache Hadoop is an open- datasets to validate their model. source software utility that allows Our approach is similar to theirs users to manage big data sets by developing the cloud model (from gigabytes to petabytes) by for data mining but we differ enabling a network of computers (or “nodes”) to solve vast and from it by using alternative intricate data problems. It is a datasets in order to test highly scalable, cost-effective classification techniques in solution that stores and multiple computing processes structured, semi- environments. structured and unstructured In 2013 Kawuu and Yu-Chin [14] data (e.g., Internet clickstream presented four efficient records, web server logs, IoT sensor data, etc.). algorithms (Association rule Equal Working Set (EWS), Benefits of the Hadoop Request On Demand (ROD), framework include the following: Small Size Working Set (SSWS) Data protection amid a and Progressive Size Working Set hardware failure (PSWS)) to utilize cloud nodes in Vast scalability from a single cloud computing environment server to thousands of machines with IBM’s Quest synthetic data Real-time analytics for historical generator. They observed that analyses and decision-making the four algorithms are more processes scalable than TPFP-tree and BTP- What is Apache Spark? tree schemes. PSWS required Apache Spark — which is also open source — is a data only 12.2% and 18% of the processing engine for big data execution time used respectively sets. Like Hadoop, Spark splits by TPFP-tree and BTP-tree. Our up large tasks across different work is similar to theirs by nodes. However, it tends to utilizing the resources of the perform faster than Hadoop and cloud nodes to distribute the it uses random access memory (RAM) to cache and process data computation of the data mining instead of a file system. This tasks but our model is better enables Spark to handle use resource manager that cases that Hadoop cannot. schedules tasks and allocates resources (e.g., CPU and memory) to applications. Benefits of the Spark framework 3. Hadoop MapReduce: Splits big include the following: data processing tasks into smaller ones, distributes the A unified engine that supports small tasks across different SQL queries, streaming nodes, then runs each task. data, machine learning (ML) and 4. Hadoop Common (Hadoop graph processing Core): Set of common libraries Can be 100x faster than and utilities that the other three Hadoop for smaller modules depend on. workloads via in-memory The Spark ecosystem processing, disk data storage, etc. Apache Spark, the largest open- APIs designed for ease of use source project in data when manipulating semi- processing, is the only structured data and processing framework that transforming data combines data and artificial The Hadoop ecosystem intelligence (AI). This enables Hadoop supports advanced users to perform large-scale data analytics for stored data (e.g., transformations and analyses, predictive analysis, data mining, and then run state-of-the-art machine learning (ML), etc.). It machine learning (ML) and AI enables big data analytics algorithms. processing tasks to be split into smaller tasks. The small tasks The Spark ecosystem consists of are performed in parallel by using five primary modules: an algorithm (e.g., MapReduce), and are then distributed across a 1. Spark Core: Underlying execution engine that schedules Hadoop cluster (i.e., nodes that and dispatches tasks and perform parallel computations on coordinates input and output big data sets). (I/O) operations. 2. Spark SQL: Gathers The Hadoop ecosystem consists information about structured of four primary modules: data to enable users to optimize structured data processing. 1. Hadoop Distributed File 3. Spark Streaming and System (HDFS): Primary data Structured Streaming: Both storage system that manages add stream processing large data sets running on capabilities. Spark Streaming commodity hardware. It also takes data from different provides high-throughput data streaming sources and divides it access and high fault tolerance. into micro-batches for a 2. Yet Another Resource continuous stream. Structured Negotiator (YARN): Cluster Streaming, built on Spark SQL, reduces latency and simplifies 1. Performance: Spark is faster programming. because it uses random access 4. Machine Learning Library memory (RAM) instead of (MLlib): A set of machine reading and writing intermediate learning algorithms for data to disks. Hadoop stores scalability plus tools for feature data on multiple sources and selection and building ML processes it in batches via pipelines. The primary API for MapReduce. MLlib is DataFrames, which 2. Cost: Hadoop runs at a lower provides uniformity across cost since it relies on any disk different programming storage type for data languages like Java, Scala processing. Spark runs at a and Python. higher cost because it relies on 5. GraphX: User-friendly in-memory computations for computation engine that real-time data processing, which enables interactive building, requires it to use high quantities modification and analysis of of RAM to spin up nodes. scalable, graph-structured data. 3. Processing: Though both Comparing Hadoop and Spark platforms process data in a distributed environment, Spark is a Hadoop enhancement Hadoop is ideal for batch to MapReduce. The primary processing and linear data difference between Spark and processing. Spark is ideal for MapReduce is that Spark real-time processing and processes and retains data in processing live unstructured memory for subsequent steps, data streams. 4. Scalability: When data volume whereas MapReduce processes rapidly grows, Hadoop quickly data on disk. As a result, for scales to accommodate the smaller workloads, Spark’s data demand via Hadoop Distributed processing speeds are up to File System (HDFS). In turn, 100x faster than MapReduce. Spark relies on the fault tolerant HDFS for large volumes of data. Furthermore, as opposed to the 5. Security: Spark enhances security with authentication via two-stage execution process in shared secret or event logging, MapReduce, Spark creates a whereas Hadoop uses multiple Directed Acyclic Graph (DAG) to authentication and access schedule tasks and the control methods. Though, orchestration of nodes across the overall, Hadoop is more secure, Hadoop cluster. This task- Spark can integrate with Hadoop to reach a higher tracking process enables fault security level. tolerance, which reapplies 6. Machine learning (ML): Spark recorded operations to data from is the superior platform in this a previous state. category because it includes MLlib, which performs iterative Let’s take a closer look at the key in-memory ML computations. It differences between Hadoop and also includes tools that perform regression, classification, Spark in six critical contexts: persistence, pipeline Data mining can motivate construction, evaluation, etc. researchers to accelerate when the method analysis the data. Discusion Therefore they can work more Data mining has many enormous time on other projects. Shopping advantages, as explained below: behaviours can be detected. Most 1. Marketing/Retails of the time, you may experience new problems while designing To create models, marketing specific shopping patterns. companies use data mining. This Therefore data mining is used to was based on history to forecast solve these problems. Mining who will respond to new methods can find all the marketing campaigns such as information on these shopping direct mail, online marketing, etc. patterns. This process also This means that marketers can creates an area where all the sell profitable products to unexpected shopping patterns targeted customers. are calculated. This data extraction can be beneficial when 2. Finance/Banking shopping patterns are identified. Since data extraction provides financial institutions 4. Determining Customer information on loans and credit Groups reports, data can determine good We are using data mining to or bad credits by creating a respond from marketing model for historical customers. It campaigns to customers. It also also helps banks detect provides information during the fraudulent transactions by credit identification of customer groups. cards that protect a credit card Some surveys can be used to owner. begin these new customer groups. And these investigations 3. Researchers are one of the forms of data mining. 5. Increases Brand Loyalty 8. To Predict Future Trends In marketing campaigns, mining All information factors are part of techniques are used. This is to the working nature of the system. understand their own customers ‘ The data mining systems can also needs and habits. And from that, be obtained from these. They can customers can also choose their help you predict future trends, brand’s clothes. Thus, you can and with the help of this definitely be self-reliant with the technology, this is entirely help of this technique. However, possible. And people also adopt it provides possible information behavioural changes. when it comes to decisions. 9. Increases Website 6. Helps in Decision Making Optimization People use these data mining We use data mining to find all techniques to help them make kinds of unseen element some decisions in marketing or information. And adding data business. Today, with the use of mining helps you to optimize this technology, all information your website. Similarly, this data can be determined. Also, using mining provides information that such technology, one can decide may use the technology of data precisely what is unknown and mining. unexpected.
7. Increase Company Revenue
Data mining is a process in which
some kind of technology is involved. One must collect information on goods sold Conclusion online; this eventually reduces Data mining has so many product costs and services, which advantages in the area of is one of data mining benefits. businesses, governments as well as individuals. In this article, we [10] J. Li, P. Roy, S. Khan, L. Wang and Y. Bai, (2012). "Data Mining have seen places where we can Using Clouds: An Experimental efficiently use data mining. Implementation of Apriori over MapReduce", 12th International Conference on Scalable Computing References and Communications (ScalCom), http://eric.univ- Changzhou, China, December. lyon2.fr/~ricco/cours/slides/IntroDM [13] N. Mishra, S. Sharma and A. Draft2002.pdf Pandey, (, 2013). "High performance [5] Y.Han, P. Brezany and I. Janciak, Cloud data mining algorithm and Data (2009). "Cloud-Enabled Scalable mining in Clouds", IOSR Decision Tree Construction", Fifth Journal of Computer Engineering International Conference on (IOSRJCE) Volume 8, Issue 4. Semantics, Knowledge and Grid", [14] K. W. Lin and Yu-Chin Lo, ISBN: 978-0-7695-3810-5, pp.128– (2013). "Efficient algorithms for 135. frequent [6] J. Wang, J. Wan, Z. Liu and pattern mining in many-task computing P.Wang,(2010). "Data Mining of Mass environments", Journal of Storage based on Cloud Computing", Knowledge-Based Systems. Ninth International Conference on Grid and Cloud Computing. ISBN: 978-1-4244-9334-0 , pp. 426– 431. [7] T. G. Nair and K. L. Madhuri, (2011). "Data Mining using Hierarchical Virtual K-means Approach Integrating Data Fragments in Cloud Computing Environment" , IEEE CCIS, ISBN: 978-1-61284-203-5, pp. 230–234. [8] T. Yang, B. Shia, J. Wei and K. Fang, (2012). "Mass Data Analysis and Forecasting Based on Cloud Computing", Journal of Software, vol. 7, no. 10, October. [9] N.i R. Sheth and J. S. Shah, (2012). "Implementing Parallel Data Mining Algorithm on High Performance Data Cloud ", International Journal of Advanced Research in Computer Science and Electronics Engineering Volume 1.