You are on page 1of 36

BIG DATA ANALYTICS

Introduction
Data  Information  Insights
Introduction
UNIT-1
Introduction to BigData
and Hadoop
Classification of Digital Data
Classification of Digital Data
• Structured Data
• Semi-Structured Data
• Unstructured Data
Structured Data
• Data conforms to a predefined schema/structure
• Most of the structured data is held in RDBMS
• Cardinality & Degree of a relation
• Steps to create a relation:
– Design the table schema
– Specify the constraints
– Tables in the RDBMS can be related using referential
integrity constraint
– Records are inserted into relation
Structured Data
• Sources:
– Oracle
– IBM DB2
– Microsoft SQL Server
– Teradata
– MySQL
– PostgreSQL
• Ease of Working:
– Manipulations
– Security
– Indexing
– Scalability
– Transaction Processing
Semi Structured Data
• It is referred to as Self Describing Structure
• It uses tags to segregate semantic elements
• For tags with same set of attributes, the order need not
be same
• Eg: XML
<bookdetails>
<book category=‘CSE’>
<title>BDA</title>
<author>Tom White</author>
</book>
</bookdetails>
Semi Structured Data
• Sources
– XML
– JSON (Java Script Object Notation)
– Other Mark-up Languages
o XML describes the structure of data. It is used by web services developed
using SOAP principles
o Simple Object Access Protocol is a protocol specification for exchanging
structured information in computer networks. It is like an envelope.
o It requires more bandwidth, extra overhead and more work at both the ends.
o JSON is used to transmit data between a server and a web application. It is
used by web services developed using REST principles
oRepresentational State Transfer can use both XML and JSON.
oIt is like postcard that is lighter in weight and easier to update and cache.
Unstructured Data
• Data doesn’t conform to any pre-defined data model
• Sources
– Web pages, Images, Free-form text, Audio, Video, Body of email,
Chats, Social Media Data
• Data with some structure may still be labelled as unstructured as the
structure doesn’t help with processing task at hand
• Techniques for Unstructured Data:
• Data Mining– Association Rule Mining, Regression Analysis, Collaborative
Filtering
• Text Mining
• NLP
• Noisy Text Analytics
• Manual tagging with metadata
• POST
Growth in Data
Introduction to Big Data
• Evolution of Big Data
1970s and before 1980s and 1990s 2000s and beyond

Primitive and Mainframes: Basic


Structured Data Storage

Complex and Relational DBs:


Relational Data-Intensive
Applications
Complex and Structured,
Unstructured Multimedia and
Unstructured Data
Data Generation Data Utilization Data Driven
and Storage
Introduction to Big Data
Introduction to Big Data
• Definition of Big Data- DOUG LANEY
Big Data is high volume, high velocity and high variety
information assets that demand cost-effective, innovative
forms of information processing for enhanced insight and
decision making.
Importance of Big Data:
1. Competitive Advantage
2. Decision Making
3. Value of Data
Introduction to Big Data
3V’s of Big Data
1) Volume: Bits-Bytes-PetaBytes-ExaBytes
2) Velocity: Batch Processing-Periodic Processing-Near Real
Time-Real time Processing
3) Variety
Other V’s
1) Veracity
2) Validity
3) Volatility
4) Variability
Advantages of Big Data
Case Study of Big Data
Big Data Analytics

Data Science

Data Analytics

Data Analysis

Data Mining
Types of Analytics
1. Descriptive Analytics  What happened
2. Diagnostic Analytics  Why did it happen
3. Predictive Analytics  What will happen
4. Prescriptive Analytics  How can we make it happen
Big Data Analytics
It is the process of examining the large data sets of Big Data to
unearth hidden patterns, decipher unknown correlations,
understand the rationale behind market trends, recognize
customer preferences and other useful business information.

Actionable Insights
Decision Making
Synthesizing
Knowledge Analyzing
Summarizing
Information Organizing
Collecting
Data
Distributed Systems
Distributed Systems
History of Hadoop
 Hadoop is an open source project of the Apache Foundation.
 It is a framework written in Java, originally developed by Doug Cutting in 2005.
 Hadoop has its origin in Apache Nutch, an open source web search engine, part of
the Lucene project.

Nutch was started and a working crawler and search engine


2002
quickly emerged. It didn’t scale to billions of pages on the web.
2003 Google’s distributed file system [GFS]
2004 NDFS started implementation. Google introduced Map Reduce
2005 Nutch developers had a working Map Reduce implementation in it
2006 Independent subproject of Lucene called Hadoop has emerged
2008 Hadoop became the fastest system to sort an entire TB of data
2009 Yahoo used Hadoop to sort 1TB in 62seconds
Databricks used a 207-node Spark cluster to sort 100 TB of data in
2014
1406 seconds
RDBMS vs. Hadoop
Parameter RDBMS Hadoop

Relational Database Node based Flat


System
Management System Structure
Structured, Semi-
Data Structured Data
structured, Unstructured

Processing OLTP OLAP, Big Data Processing

It does not require any


Choose RDBMS when data
Choice consistent relationship
needs consistent relationship
between data
In a Hadoop cluster, a
Needs expensive hardware to node requires only a
Processor
store huge volumes of data processor, a network card
and few hard drives
Around $4k per TB of
Cost $10k to 14k per TB of storage
storage
RDBMS vs. Map Reduce
Parameter RDBMS Map Reduce

Data Size GB PB

Access Interactive and Batch Batch

Write once, Read many


Updates Read and Write many times
times

Transactions ACID None

Structure Schema on write Schema on read

Integrity High Low


Hadoop Streaming
• It is a generic API that allows writing Mappers and
Reducers in any language.
• It is a utility that comes with Hadoop distribution.
• It can be used to execute programs for Big Data
Analysis.
• Hadoop Streaming can be performed using
languages like Python, Java, PHP, Scala,Perl, UNIX,..
• The utility creates a Map/Reduce job, submits the
job to an appropriate cluster and monitors the
progress of the job until completion.
Analyzing Data with UNIX Tools
There are many challenges in processing line-oriented
data.
1. Speed of execution
2. Decision on size of chunk
3. Time for completion of task- Dependent tasks
4. Combiner-Independent tasks
5. Manager for running overall jobs
Analyzing Data with Hadoop
Analyzing Data with Hadoop
Big Data Analytics PipeLine
Hadoop EcoSystems

You might also like