Seminar

On
BIG DATA MINING: A CHALLENGE
AND HOW TO MANAGE IT

Submitted To: Submitted By:
Dinesh and Jitender

INTRODUCTION
  Big Data is a new term used to identify the datasets that due
to their large size and complexity,
 We call them BIG DATA because we can not manage them

with our current methodologies or data mining software tools.
 Big Data mining is the capability of extracting useful

information from these large datasets or streams of data, that
due to its volume, variability, and velocity, it was not possible
before to do it.
 The Big Data challenge is becoming one of the most exciting

opportunities for the next years.
 We present in this issue, a broad overview of the topic, its

current status on Big Data mining.
 This paper shows the challenge and tools to manage
heterogeneous information frontier in Big Data mining
research

WHAT IS BIG DATA?
 ‘Big Data’ is similar to ‘small data’, but bigger in
size

 but having data bigger it requires different
approaches:
 Techniques, tools and architecture

 an aim to solve new problems or old problems in a
better way

 Big Data generates value from the storage and
processing of very large quantities of digital
information that cannot be analyzed with
traditional computing techniques.

WHAT IS BIG DATA
 Walmart handles more than 1 million customer
transactions every hour.
• Facebook handles 40 billion photos from its user
base.

THREE CHARACTERISTICS OF BIG DATA V3S Volume Velocity Variety • Data • Data • Data quantity Speed Types .

including video. constantly-updated data feeds containing environmental. location. and other information. 1ST CHARACTER OF BIG DATA VOLUME •A typical PC might have had 10 gigabytes of storage in 2000. • The smart phones. Facebook ingests 500 terabytes of new data every day. the data they create and consume. •Today. sensors embedded into everyday objects will soon result in billions of new. .

. each producing multiple inputs per second.  machine to machine processes exchange data between billions of devices  infrastructure and sensors generate massive log data in real-time  on-line gaming systems support millions of concurrent users.DATA VELOCITY(SPEED)  High-frequency stock trading algorithms reflect market changes within microseconds.

consistent data structure. and strings. and unstructured text. 3D data. fewer updates or a predictable. dates. Big Data is also geospatial data.SOUND)  Big Data isn't just numbers.  Big Data analysis includes different types of data .  Traditionaldatabase systems were designed to address smaller volumes of structured data. VIDEO. audio and video.VARIETY(DATA TYPES IMAGES. including log files and social media.

PROCESSING BIG DATA  Integrating disparate data stores  Mapping data to the programming framework  Connecting and extracting data from storage  Transforming data for processing  Subdividing data in preparation for Hadoop MapReduce  Employing Hadoop MapReduce  Creating the components of Hadoop MapReduce jobs  Distributing data processing across server farms  Executing Hadoop MapReduce jobs  Monitoring the progress of job flows .

audio data 10 . THE STRUCTURE OF BIG DATA  Structured • Most traditional data sources  Semi-structured • Many sources of big data  Unstructured • Video data.

5 quintillion bytes of data. WHY BIG DATA • Growth of Big Data is needed – Increase of storage capacities – Increase of processing power – Availability of data(different data types) – Every day we create 2. 90% of the data in the world today has been created in the last two years alone .

. WHY BIG DATA •FB generates 10TB daily •Twitter generates 7TB of data Daily •IBM claims 90% of today’s stored data was generated in just the last two years.

Sensor embedded in an engine) 2) Typically an entirely new source of data (e. Text streams) 13 4) May not have much values • Need to focus on the important part .g.g. HOW IS BIG DATA DIFFERENT? 1) Automatically generated by a machine (e. Use of the internet) 3) Not designed to be friendly (e.g.

DATA GENERATION POINTS EXAMPLES Mobile Devices Microphones Readers/Scanners Science facilities Programs/ Software Social Media Cameras .

unknown correlations  Competitive advantage  Better business decisions: strategic and operational  Effective marketing. increased revenue . customer satisfaction.BIG DATA ANALYTICS  Examining large amount of data  Appropriate information  Identification of hidden patterns.

. POTENTIAL VALUE OF BIG DATA  $300 billion potential annual value to US health care.  $600 billion potential annual consumer surplus from using personal location data.  60% potential in retailers’ operating margins.

1 % )  Current market size is $200 million.9% of revenues) and analytics firms (17.INDIA – BIG DATA  Gaining attraction  Huge market opportunities for IT services (82. By 2015 $1 billion  Theopportunity for Indian service providers lies in offering services around Big Data implementation and analytics for global multinationals .

•Technologies such as MapReduce. •Fast forward to the present and technologies like Hadoop give you the scale and flexibility to store data before you know how you are going to process it.Hive and Impala enable you to run queries without changing the data structures underneath. It’s about the ability to make better decisions and take meaningful actions at the right time. . BENEFITS OF BIG DATA •Real-time big data isn’t just a process for storing of data in a data warehouse.

and the social media explosion of today.  Big Data is already an important part of the $64 billion database and data analytics market  It offers commercial opportunities of a comparable scale to enterprise software in the late 1980s  And the Internet boom of the 1990s. tap into internal data and build a better information ecosystem. .BENEFITS OF BIG DATA  Our newest research finds that organizations are using big data to target customer-centric outcomes.

and/or high-variety information assets that require new forms of processing to enable enhanced decision making.WHAT IS “BIG DATA”?  "Big Data are high-volume. insight discovery and process optimization” (Gartner 2012)  Complicated (intelligent) analysis of data may make a small data “appear” to be “big”  Bottom line: Any data that exceeds our current capability of processing can be regarded as “big” . high-velocity.

by automatic or semi-automatic means. previously unknown and potentially useful information from data  Exploration & analysis. possibly unexpected. patterns in data  Extraction of implicit.WHAT IS DATA MINING?  Discovery of useful. of large quantities of data in order to discover meaningful patterns .

DATA MINING TASKS  Classification [Predictive]  Clustering [Descriptive]  Association Rule Discovery [Descriptive]  Sequential Pattern Discovery [Descriptive]  Regression [Predictive] .

with training set used to build the model and test set used to validate it. Usually.CLASSIFICATION: DEFINITION  Given a collection of records (training set )  Each record contains a set of attributes. one of the attributes is the class. A test set is used to determine the accuracy of the model.  Find a model for class attribute as a function of the values of other attributes. the given data set is divided into training and test sets. .  Goal: previously unseen records should be assigned a class as accurately as possible.

CLUSTERING income education age 24 .

K-MEANS CLUSTERING 25 .

ASSOCIATION RULE MINING t ion er ts ac m c a ns d sto odu ht t r i cu id pr oug b 26 sales market-basket records: data • Trend: Products p5. p8 often bough together • Trend: Customer 12 likes product p9 .

BIG VELOCITY  Sensor tagging everything of value sends velocity through the roof  E.g. car insurance  Smart phones as a mobile platform sends velocity through the roof  State of multi-player internet games must be recorded – sends velocity through the roof 27 .

g. batch. RDF.BIG DATA STANDARDIZATION CHALLENGES (1)  Big Data use cases. multimedia) and Big Data operations (e. matrix operations)  Domain-specific languages  Semantics of eventual consistency  Advanced network protocols for efficient data transfer  General and domain specific ontologies and taxonomies for describing data semantics including interoperation between ontologies Source : ISO 28 .g. platforms. vocabulary and reference architectures (e. system. JSON. definitions.g. data. streaming)  Query languages including non-relational queries to support diverse data types (XML. online/offline)  Specifications and standardization of metadata including data provenance  Application models (e.

distributed file system.g. and federated analytics (taking the analytics to the data) including data and processing resource discovery and data mining  Data sharing and exchange  Data storage. visualization)  Interface between relational (SQL) and non-relational (NoSQL) Source : ISO  Big Data Quality and Veracity description and management 29 . etc. e. distributed.g.  Human consumption of the results of big data analysis (e. Big Data Standardization Challenges (2)  Big Data security and privacy access controls  Remote. data warehouse. memory storage system.

the probability of failures rises. each with several processor cores. then there is no way for the program to recover anyway. Whenever multiple machines are used in cooperation with one another. . its true power lies in its ability to scale to hundreds or thousands of computers. In a single-machine environment. To work with this volume of data requires distributing parts of the problem to multiple machines to handle in parallel. Hadoop is also designed to efficiently distribute large amounts of work across a set of machines.TOOLS FOR MANAGING BIG DATA  Hadoop is an open-source framework from Apache that allows to store and process big data in a distributed environment across clusters of computers using simple programming models  Hadoop is a large-scale distributed batch processing infrastructure.  Challenges at Large Scale  Performing large-scale computation is difficult. While it can be used on a single machine. failure is not something that program designers explicitly worry about very often: if the machine has crashed.

R  R programming language is the preferred choice amongst data analysts and data scientists  There is no doubt that R is the most preferred programming tool for statisticians. data scientists. data analysts and data architects but it falls short when working with large datasets. .  this is when Hadoop integrated with R language. complement each other for big data analytics and visualization. is an ideal solution. To adapt to the in-memory. data scientists have to limit their data analysis to a sample of data from the large data set. Large datasets of size petabytes cannot be loaded into the RAM memory. single machine limitation of R programming language. RHIVE.   One major drawback with R programming language is that all objects are loaded into the main memory of a single machine.  R and Hadoop were not natural friends but with the advent of novel packages like Rhadoop. and RHIPE.the two seemingly different technologies.

Enterprises combine it with other data access applications in Hadoop to prevent undesirable events or to optimize positive outcomes. or cyber security analytics and threat detection.  Storm is extremely fast. . operational dashboards. with the ability to process over a million records per second per node on a cluster of modest size. data monetization.STORM  Storm is a distributed real-time computation system for processing large volumes of high- velocity data.  Some of specific new business opportunities include: real-time customer service management.

no data mining algorithm can be efficient enough to process very large datasets and provide outcomes in quick time.  Normally we fall back on data mining algorithms to analyze bulk data to identify trends and draw conclusions. scalable machine-learning library that runs on top of Hadoop MapReduce. and Youtube) have to collect and manage on a daily basis. It is not uncommon even for lesser known websites to receive huge amounts of information in bulk. However.  We are living in a day and age where information is available in abundance. The information overload has scaled to such heights that sometimes it becomes difficult to manage our little mailboxes! Imagine the volume of data and records some of the popular websites (the likes of Facebook. Twitter. Mahout is such a data mining framework that normally runs coupled with the Hadoop infrastructure at its background to manage huge volumes of data .  We now have new frameworks that allow us to break down a computation task into multiple segments and run each segment on a different machine.APACHE MAHOUT  Apache Mahout is a powerful. unless the computational tasks are run on multiple machines distributed over the cloud.

. It implements popular machine learning techniques such as:  Recommendation  Classification  Clustering  Apache Mahout started as a sub-project of Apache’s Lucene in 2008. In 2010. Mahout became a top level project of Apache. Apache Mahout is an open source project that is primarily used for creating scalable machine learning algorithms.

scalable. . distributed. pluggable platform that allows programmers to easily develop applications for processing continuous unbounded streams of data.APACHE S4  S4 is a general-purpose. fault-tolerant.

Big Data infrastructure deals with Hadoop. Yahoo!. Large companies such as Facebook.BIG DATA MINING TOOLS  The Big Data phenomenon is intrinsically related to the open source software revolution. and other related software as:  . Twitter. LinkedIn benefit and contribute to open source projects.

. These reduce tasks use the output of the maps to obtain the final result of the job. based in the MapReduce programming model and a distributed file system called Hadoop Distributed Filesystem (HDFS). This step of mapping is then followed by a step of reducing tasks. Hadoop allows writing applications that rapidly process large amounts of data in parallel on large clusters of compute nodes.  A MapReduce job divides the input dataset into independent subsets that are processed by map tasks in parallel. Apache Hadoop : software for data-intensive distributed applications.

similar to S4.  Storm: software for streaming data-intensive distributed applications. Apache S4: platform for processing continuous data streams. and developed by Nathan Marz at Twitter. S4 is designed specifically for managing data streams. S4 apps are designed combining streams and processing elements in real time. .

The most popular are the following:  Apache Mahout: Scalable machine learning and data mining open source software based mainly in Hadoop. there are many open source initiatives. R was designed by Ross Ihaka and Robert Gentleman at the University of Auckland. clas. collaborative filtering and frequent pattern mining.  R: open source programming language and software environment designed for statistical computing and visualization. It has implementations of a wide range of machine learning and data mining algorithms: clustering.  . New Zealand beginning in 1993 and is used for statistical analysis of very large data sets.sification. In Big Data Mining.

regression. . New Zealand. clustering and frequent item set mining and frequent graph mining. MOA: Stream data mining open source software to perform data mining in real time. famous for the WEKA software. It started as a project of the Machine Learning group of University of Waikato. It has imple- mentations of classification. The streams framework provides an environment for defining and running stream processes using simple XML based definitions and is able to use MOA. SAMOA is a new upcoming software project for distributed stream mining that will combine S4 and Storm with MOA. Android and Storm.

. via parallel learning. scalable. Vowpal Wabbit: open source project started at Yahoo! Research and continuing at Microsoft Research to design a fast. useful learning algorithmt can exceed the throughput of any single machine network interface when doing linear learning.

GraphLab computes over dependent records which are stored as vertices in a large distributed data- graph.MORE SPECIFIC TO BIG GRAPH MINING WE FOUND THE FOLLOWING OPEN SOURCE TOOLS:  Pegasus: Big graph mining system built on top of MapReduce. It allows to find patterns and anomalies in massive real-world graphs. .  GraphLab: high-level graph-parallel system built without using MapReduce.

cms. In ICDM Workshops. 2011  [3] L. Kirkby.net. 2012. R Foundation for Statistical Computing. Lapis. Nair. Journal of Machine Learning Research (JMLR). H. Austria.  [7] A.  [10]J.apache. Holmes. http://mahout. McGraw-Hill Companies.nz/. Kang.  [2] P. December 2012. PEGASUS: Mining Billion-Scale Graphs in the Cloud. and Biggest Growth in the Far East. Pfahringer. B. META Group Research Note. ISBN 3-900051-07-0.  [8] D.  [5] Apache Mahout. Bigger Digital Shadows. 2010. Velocity and Variety. 2010  [4] Storm.  [6] R Core Team. February 6. C. Robbins. Kesari. G. . pages 170–177. Vienna. D. deRoos. http://storm-project.Incorporated. R: A Language and Environment for Statistical Computing. D. Laney. 2001  [9]U. MOA: Massive Online Analysis http://moa. IBM Understanding Big Data: Analytics for Enterprise Class Hadoop and Streaming Data. A.waikato. Bifet.ac. Eaton. T. Deutsch. IDC: The Digital Universe in 2020: Big Data. S4: Distributed Stream Computing Platform. Zikopoulos. R. Reinsel. Chau.org. and C. Neumeyer. 2012. Gantz and D. http://hadoop. 3-D Data Management: Controlling Data Volume.apache. and G. and A.REFERENCE  REFERENCES    [1] Apache Hadoop. and B.org. Faloutsos.

.THANK YOU.