Unit 1

BIG DATA ANALYTICS
Introduction
Data  Information  Insights
Introduction
UNIT-1
Introduction to BigData
and Hadoop
Classification of Digital Data
Classification of Digital Data
• Structured Data
• Semi-Structured Data
• Unstructured Data
Structured Data
• Data conforms to a predefined schema/structure
• Most of the structured data is held in RDBMS
• Cardinality & Degree of a relation
• Steps to create a relation:
– Design the table schema
– Specify the constraints
– Tables in the RDBMS can be related using referential
integrity constraint
– Records are inserted into relation
Structured Data
• Sources:
– Oracle
– IBM DB2
– Microsoft SQL Server
– Teradata
– MySQL
– PostgreSQL
• Ease of Working:
– Manipulations
– Security
– Indexing
– Scalability
– Transaction Processing
Semi Structured Data
• It is referred to as Self Describing Structure
• It uses tags to segregate semantic elements
• For tags with same set of attributes, the order need not
be same
• Eg: XML
<bookdetails>
<book category=‘CSE’>
<title>BDA</title>
<author>Tom White</author>
</book>
</bookdetails>
Semi Structured Data
• Sources
– XML
– JSON (Java Script Object Notation)
– Other Mark-up Languages
o XML describes the structure of data. It is used by web services developed
using SOAP principles
o Simple Object Access Protocol is a protocol specification for exchanging
structured information in computer networks. It is like an envelope.
o It requires more bandwidth, extra overhead and more work at both the ends.
o JSON is used to transmit data between a server and a web application. It is
used by web services developed using REST principles
oRepresentational State Transfer can use both XML and JSON.
oIt is like postcard that is lighter in weight and easier to update and cache.
Unstructured Data
• Data doesn’t conform to any pre-defined data model
• Sources
– Web pages, Images, Free-form text, Audio, Video, Body of email,
Chats, Social Media Data
• Data with some structure may still be labelled as unstructured as the
structure doesn’t help with processing task at hand
• Techniques for Unstructured Data:
• Data Mining– Association Rule Mining, Regression Analysis, Collaborative
Filtering
• Text Mining
• NLP
• Noisy Text Analytics
• Manual tagging with metadata
• POST
Growth in Data
Introduction to Big Data
• Evolution of Big Data
1970s and before 1980s and 1990s 2000s and beyond
Primitive and Mainframes: Basic

Structured Data Storage
Complex and Relational DBs:

Relational Data-Intensive
Applications
Complex and Structured,
Unstructured Multimedia and
Unstructured Data
Data Generation Data Utilization Data Driven
and Storage
• Definition of Big Data- DOUG LANEY
Big Data is high volume, high velocity and high variety
information assets that demand cost-effective, innovative
forms of information processing for enhanced insight and
decision making.
Importance of Big Data:
1. Competitive Advantage
2. Decision Making
3. Value of Data
3V’s of Big Data
1) Volume: Bits-Bytes-PetaBytes-ExaBytes
2) Velocity: Batch Processing-Periodic Processing-Near Real
Time-Real time Processing
3) Variety
Other V’s
1) Veracity
2) Validity
3) Volatility
4) Variability
Advantages of Big Data
Case Study of Big Data
Big Data Analytics
Data Science
Data Analytics
Data Analysis
Data Mining
Types of Analytics
1. Descriptive Analytics  What happened
2. Diagnostic Analytics  Why did it happen
3. Predictive Analytics  What will happen
4. Prescriptive Analytics  How can we make it happen
Big Data Analytics
It is the process of examining the large data sets of Big Data to
unearth hidden patterns, decipher unknown correlations,
understand the rationale behind market trends, recognize
customer preferences and other useful business information.
Actionable Insights
Decision Making
Synthesizing
Knowledge Analyzing
Summarizing
Information Organizing
Collecting
Data
Distributed Systems
Distributed Systems
History of Hadoop
 Hadoop is an open source project of the Apache Foundation.
 It is a framework written in Java, originally developed by Doug Cutting in 2005.
 Hadoop has its origin in Apache Nutch, an open source web search engine, part of
the Lucene project.
Nutch was started and a working crawler and search engine

2002
quickly emerged. It didn’t scale to billions of pages on the web.
2003 Google’s distributed file system [GFS]
2004 NDFS started implementation. Google introduced Map Reduce
2005 Nutch developers had a working Map Reduce implementation in it
2006 Independent subproject of Lucene called Hadoop has emerged
2008 Hadoop became the fastest system to sort an entire TB of data
2009 Yahoo used Hadoop to sort 1TB in 62seconds
Databricks used a 207-node Spark cluster to sort 100 TB of data in
2014
1406 seconds
RDBMS vs. Hadoop
Parameter RDBMS Hadoop
Relational Database Node based Flat

System
Management System Structure
Structured, Semi-
Data Structured Data
structured, Unstructured
Processing OLTP OLAP, Big Data Processing
It does not require any

Choose RDBMS when data
Choice consistent relationship
needs consistent relationship
between data
In a Hadoop cluster, a
Needs expensive hardware to node requires only a
Processor
store huge volumes of data processor, a network card
and few hard drives
Around $4k per TB of
Cost $10k to 14k per TB of storage
storage
RDBMS vs. Map Reduce
Parameter RDBMS Map Reduce
Data Size GB PB
Access Interactive and Batch Batch
Write once, Read many

Updates Read and Write many times
times
Transactions ACID None
Structure Schema on write Schema on read
Integrity High Low

Hadoop Streaming
• It is a generic API that allows writing Mappers and
Reducers in any language.
• It is a utility that comes with Hadoop distribution.
• It can be used to execute programs for Big Data
Analysis.
• Hadoop Streaming can be performed using
languages like Python, Java, PHP, Scala,Perl, UNIX,..
• The utility creates a Map/Reduce job, submits the
job to an appropriate cluster and monitors the
progress of the job until completion.
Analyzing Data with UNIX Tools
There are many challenges in processing line-oriented
data.
1. Speed of execution
2. Decision on size of chunk
3. Time for completion of task- Dependent tasks
4. Combiner-Independent tasks
5. Manager for running overall jobs
Analyzing Data with Hadoop
Analyzing Data with Hadoop
Big Data Analytics PipeLine
Hadoop EcoSystems

Unit 1

Uploaded by

Document Information

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Unit 1

Uploaded by

Copyright:

Available Formats

BIG DATA ANALYTICS

Primitive and Mainframes: Basic

Complex and Relational DBs:

Nutch was started and a working crawler and search engine

Relational Database Node based Flat

Processing OLTP OLAP, Big Data Processing

It does not require any

Access Interactive and Batch Batch

Write once, Read many

Transactions ACID None

Structure Schema on write Schema on read

Integrity High Low

You might also like