You are on page 1of 3

REVIEW ARTICLE

A Review Paper On Data Mining And Big Data


Sakhr CHAIB
ISIMS SFAX

Abstract- In recent world, big data is very popular term. The main purpose of data mining of either classification or
Big data is generated from many sources such as social prediction. In classification, sorting a data into groups e.g.
media, digital images or videos and so on. The data marketers are only interested in those who responded or not
mining is very helpful for extracting useful information those who did not respond to promotion. In prediction, to
from big data.It is a process which finds useful patterns predict a value e.g. marketers are only interested in
from large amount of data There are so many techniques
predicting for those who responded in promotion only.
of data mining such as clustering, prediction, and
classification and decision tree available for solving the
problems of big data. In this paper, we present the
overview of data mining and big data issues, challenges II. ALGORITHMS USED IN DATA MINING FOR
and its solutions. BIG DATA

Keywords: Big data, data mining


A. Classification trees A popular data-mining
I. INTRODUCTION technique that is used to classify a dependent
categorical variable based on measurements of one
There is large amount of data is available in information or more predictor variables. The result is a tree
industry and it is exceeding day by day. The term data with nodes and links between the nodes that can be
mining is firstly originated in the year of 1990 before read to form if-then rules.
statisticians used terms like Data Fishing or Data Dredging. B. Logistic regression A statistical technique that is a
The most important purpose of data mining is to find the variant of standard regression but extends the
valuable information from large data sets. In an others concept to deal with classification. It produces a
words, data mining is a process of mining knowledge from formula that predicts the probability of the
data. Data mining used in many modern applications these occurrence as a function of the independent
days such as Market Analysis and Management, Corporate variables.
Analysis & Risk Management, Fraud Detection ,science C. Neural networks A software algorithm that is
exploration, sports, astrology, and Internet Web Surf-Aid modeled after the parallel architecture of animal
and so on. brains. The network consists of input nodes, hidden
Data mining is the process of exploration and analysis, by layers, and output nodes. Each unit is assigned a
automatic or semiautomatic means, of large quantities of weight. Data is given to the input node, and by a
data [1]. system of trial and error, the algorithm adjusts the
Data mining is used for exploring and analyzing large weights until it meets a certain stopping criteria.
amounts of data to find patterns for big data. The advent of Some people have likened this to a black–box
big data, the data mining is more prevalent. Four or five approach.
years ago, companies collected all data of transaction stored D. Clustering techniques like K-nearest neighbors
in a single database. Today, volume of data is collected have A technique that identifies groups of similar
expoled. Marketers can also collect information about every records. The K-nearest neighbor technique
conversation people are having about their brand. It requires calculates the distances between the record and
the implementation of new processes, technology and points in the historical (training) data. It then
governance mechanisms that are collectively being referred assigns this record to the class of its nearest
to as big data. Today, big data is a big business. neighbor in a data set.
We can define big data is a process that allows companies to
extract an information from large amount of data. Big data is
used data mining techniques because size of information is
larger.
III. TYPES OF DATA MINING SYSTEM follows
a) Poor data quality e.g. noisy data, dirty data and
Data mining systems can be categorized according to inadequate size of data.
various criteria the classification is as follows [5]: b) Redundant data is uploaded from various sources
such as multimedia files.
A. Classification of data mining systems according to c) Security, privacy of the companies
the type of data source mined: In an organization a d) Algorithm of data mining is not effective.
huge amount of data is available where we need to e) Difficult to processing an unstructured data into
classify these data but these are available most of structured data.
times in a similar fashion. We ask to sort out these f) Higher cost, less flexibility.
data according to its character (maybe
audio/picture, text formatting and so forth) C. Major challenges
B. Classification of data mining systems, according to a. Big Data Mining Platform
the data model: There are so many numbers of data b. Dig Data Semantics and Application
mining models (Relational data model, Object Knowledge
Model, Object Oriented Data Model, Hierarchical  Information Sharing and Data
data Model/W data model) are available and each Privacy
and every model we are practicing the different  Domain and Application
data. Agreeing to these data model the data mining Knowledge
system classifies the information. c. Big Data Mining Algorithm
C. Classification of data mining systems, according to  Local Learning and Model
the sort of knowledge discovered: This Fusion for Multiple Information
classification based on the variety of knowledge Sources
discovered or data mining functionalities, such as  mining from Sparse, Uncertain,
characterization, discrimination, association, and Incomplete Data
categorization, clustering, and so on some systems  Mining Complex and Dynamic
tend to be comprehensive systems offering several Data
data mining functionalities together.
D. Classification of data mining systems, according to V. SOLUTIONS
excavation techniques used: This classification is
according to the data analysis approach used such A. Hadoop: It is open-source software framework for
as machine learning, neural nets, genetic distributed storage of very large datasets on
algorithms, statistics, visualization, database computer clusters. Hadoop provides massive
oriented or data warehouse-oriented, and so on The amounts of storage for any kind of data, enormous
classification can also take into account the degree processing power and the ability to handle virtually
of user interaction involved in the data mining limitless concurrent tasks or jobs.Hadoop is widely
process such as query-driven systems, interactive used in industrial applications with Big Data,
exploratory systems, or autonomous systems. A including spam filtering, network searching, click
comprehensive system would offer a broad stream analysis, and social recommendation. To
assortment of data mining techniques to suit distribute its products and services, such as spam
different situations and options, and offer different filtering and searching, Yahoo has run Hardtop in
levels of user interaction. 42,000 servers at four data centers as of June 2012.
Currently, the largest Hadoop cluster contains
IV. ISSUES, CHALLENGES AND PROBLEMS 4,000 nodes, which is expected to increase to
OF BIG DATA IN DATA MINING 10,000 with the release of Hadop2.0 [2].
A. Problems
The main problems in big data has grown tremendously B. Cloudera
.This large amount of data is beyond the of software tools to
Cloudera is similar to hadoop with extra services. It is help
manage. The exploring a large amount of data, exacting a
in business, to allow people in the companies is easy to
useful information from data sets and knowledge is a
access the data from larger database. It also provides a data
challenge, sometimes it is a major problems. Also big data is
security which is highly important for storing sensitive and
unstructured, huge size and it is not easy to handle.
personal information.
B. Issues
The main issues of data mining in big data are
C. Monod CONCLUSION
Monodb is the modern approach to databases. It is very Today, all the IT professionals, engineers and researchers
good approach for managing data that changes frequently or are working on big data. Big data is term of concerning
data or unstructured. Common use cases include storing data about large volume of complex data sets. In order to solve
problems of big data challenges, many researchers proposed
for mobile apps, product catalogs, real-time personalization,
a different system models, techniques for big data. The high
content management and applications delivering a single performance computing paradigm is required for data
view across multiple systems. mining to solve the problem of big data. We conclude that
Map Reduce is the hub of Hadoop and is a programming there are still chances to improve the algorithms and
paradigm that enables mass scalability across numerous techniques for data mining .In this paper, big data are facing
servers in a Hadoop cluster. In this cluster, each server lots of challenges, issues and provide a solutions to handle
contains a set of internal disk drives that are inexpensive. To the big data.
enhance performance, Map Reduce assigns workloads to the
REFERENCES
servers in which the processed data are stored. Data
processing is scheduled based on the cluster nodes. A node [1] Xingquan Zhu, Ian Davidson, “Knowledge Discovery and Data
may be assigned a task that requires data foreign to that Mining: Challenges and Realities”, ISBN 978-1-59904-252,
Hershey, New York, 2007.
node.
[2] Chen M, Mao S, Liu Y. Big data: a survey. Mobile Networks
and Applications. 2014; 19(2):171–209.

You might also like