Professional Documents
Culture Documents
Introduction
Data Information Insights
Introduction
UNIT-1
Introduction to BigData
and Hadoop
Classification of Digital Data
Classification of Digital Data
• Structured Data
• Semi-Structured Data
• Unstructured Data
Structured Data
• Data conforms to a predefined schema/structure
• Most of the structured data is held in RDBMS
• Cardinality & Degree of a relation
• Steps to create a relation:
– Design the table schema
– Specify the constraints
– Tables in the RDBMS can be related using referential
integrity constraint
– Records are inserted into relation
Structured Data
• Sources:
– Oracle
– IBM DB2
– Microsoft SQL Server
– Teradata
– MySQL
– PostgreSQL
• Ease of Working:
– Manipulations
– Security
– Indexing
– Scalability
– Transaction Processing
Semi Structured Data
• It is referred to as Self Describing Structure
• It uses tags to segregate semantic elements
• For tags with same set of attributes, the order need not
be same
• Eg: XML
<bookdetails>
<book category=‘CSE’>
<title>BDA</title>
<author>Tom White</author>
</book>
</bookdetails>
Semi Structured Data
• Sources
– XML
– JSON (Java Script Object Notation)
– Other Mark-up Languages
o XML describes the structure of data. It is used by web services developed
using SOAP principles
o Simple Object Access Protocol is a protocol specification for exchanging
structured information in computer networks. It is like an envelope.
o It requires more bandwidth, extra overhead and more work at both the ends.
o JSON is used to transmit data between a server and a web application. It is
used by web services developed using REST principles
oRepresentational State Transfer can use both XML and JSON.
oIt is like postcard that is lighter in weight and easier to update and cache.
Unstructured Data
• Data doesn’t conform to any pre-defined data model
• Sources
– Web pages, Images, Free-form text, Audio, Video, Body of email,
Chats, Social Media Data
• Data with some structure may still be labelled as unstructured as the
structure doesn’t help with processing task at hand
• Techniques for Unstructured Data:
• Data Mining– Association Rule Mining, Regression Analysis, Collaborative
Filtering
• Text Mining
• NLP
• Noisy Text Analytics
• Manual tagging with metadata
• POST
Growth in Data
Introduction to Big Data
• Evolution of Big Data
1970s and before 1980s and 1990s 2000s and beyond
Data Science
Data Analytics
Data Analysis
Data Mining
Types of Analytics
1. Descriptive Analytics What happened
2. Diagnostic Analytics Why did it happen
3. Predictive Analytics What will happen
4. Prescriptive Analytics How can we make it happen
Big Data Analytics
It is the process of examining the large data sets of Big Data to
unearth hidden patterns, decipher unknown correlations,
understand the rationale behind market trends, recognize
customer preferences and other useful business information.
Actionable Insights
Decision Making
Synthesizing
Knowledge Analyzing
Summarizing
Information Organizing
Collecting
Data
Distributed Systems
Distributed Systems
History of Hadoop
Hadoop is an open source project of the Apache Foundation.
It is a framework written in Java, originally developed by Doug Cutting in 2005.
Hadoop has its origin in Apache Nutch, an open source web search engine, part of
the Lucene project.
Data Size GB PB