You are on page 1of 64

Go, change the world

RV College of
Engineering

Big Data Analytics


16CS7F1
Prof.Mamatha T

Overview
RV College of
Engineering SYLLABUS Go, change the world

Unit 1- Introduction to Big Data


Unit 2- Hadoop & Hadoop Architecture
Unit 3- Hadoop Ecosystem & YARN
Unit 4- Hive & Pig
Unit 5- Cassandra Basics
RV College of
Engineering Course Outcomes Go, change the world

After completing the course, the students will be able to

CO1. Understand and explore the concepts of Big data


analytics.
CO2. Analyze map reduce concepts to solve complex problems.
CO3. Review and explore the use of different Hadoop
Ecosystem components.
CO4. Apply big data analytics techniques using Hive, PIG, and
Cassandra for querying big datasets.
RV College of
Engineering
Go, change the world

UNIT 1
Big Data Introduction
•Data : Sources ,Types, Characteristics, Issues

•Big Data: Evolution, Four V’s of BigData, Comparing Big Data with traditional system,
• CAP Theorm

•Types of Architecture: Parallel, Distributed ,Shared Nothing Architecture

•Classification of Analytics
•https://youtu.be/bAyrObl7TYE
RV College of
Engineering
Go, change the world

UNIT 1
Big Data Introduction
RV College of
Engineering
Go, change the world

UNIT 1
Big Data Application Areas
RV College of
Engineering
Go, change the world

How Brands are using Big Data Analytics

Big data analytics involves examining large amounts of data. This is done so as to uncover the hidden
patterns, correlations and also to give insights so as to make proper business decisions.

Example of a Brand that uses Big Data for Targeted Adverts


Netflix is a good example of a big brand that uses big data analytics for targeted advertising.
They have over 100 million subscribers, the company collects huge data.
They send you suggestions on the next movie you should watch. And this is done using your past
search and watch data.
This data is used to give them insights on what interests the subscriber most.
RV College of
Engineering
Go, change the world

Facts of Big Data

•In one second, each person is expected to generate 1.7 megabytes of data by 2020.
•The data by 2020 is expected to reach 40 trillion gigabytes. The size of big data in 2010 was 1.2 zettabytes.
•A study by IBM in 2017 showed that 90% of data have created during the last 2 years. The sudden spurge of data
can attribute to the stupendous growth of the internet. The internet users that was only 2.5 billion in 2012 reached
3 billion in 2014 and touched 4.1 billion in 2019.
•Netflix saves $1 billion every year by channelizing Big data for customer retention. It is possible by using Big data
for improving user experience through personalized recommendations.
•79% of executives consider that not implementing big data is equal to embracing bankruptcy.
RV College of
Engineering
Go, change the world

Unit 2
Hadoop

•Hadoop Architecture
•Hadoop Storage: HDFS
• Common Hadoop Shell commands
• Anatomy of File Write and Read
• Map and Reduce Programming
RV College of
Engineering
Go, change the world

Unit 3
Hadoop Ecosystem & YARN
•SPARK Architecture
•FLUME Architecture
•New Features of Hadoop 2.x
•Compare Hadoop 1.x and Hadoop 2.x
•Workflow of Hadoop 2.x architecture

v
RV College of
Engineering
Go, change the world

Unit 4
Hive & PIG
•Hive Architecture

•Comparison with Traditional Database

•HiveQL - Querying Data


Sorting And Aggregating
Map Reduce Scripts
Sub queries

•Pig Latin - Structure, Statements, Expressions, Types, Schemas, Functions, Macros, User-
Defined Functions
RV College of
Engineering
Go, change the world

Unit 5
Cassandra Basics
• Apache Cassandra – An Introduction, Features of Cassandra
• CQL Data Types, CQLSH, Keyspaces
• CRUD, Insert, Update, Delete, Select
• Alter: Alter Table to
Change the Data Type of a Column
Delete a Column
Drop a Table
Drop a Database
Collections, Set, List, Map
• Import and Export CSV
• Import from STDIN, Export to STDOUT
Reference Books

1. HADOOP: The definitive Guide ,Tom White, , 4th Edition, O Reilly, 2015, ISBN: 978-
144936107

2. Big Data and Analytics ,Seema Acharya and Subhashini C, , 1st Edition Wiley India
Private Limited, 2015, ISBN 978-8126554782..

3. Big Data Analytics: Turning Big Data into Big Money ,Frank J Ohlhorst, , Wiley and SAS
Business Series, 2012 Edition , ISBN: 978-1118147597.

4. Mining of Massive Datasets, Anand Rajaraman and Jeffrey David Ullman, 2nd Edition,
Cambridge University,Press, 2017, ISBN-13: 978-1107015357.
Continuous Internal Evaluation (CIE); Theory (100 Marks)

• CIE is executed by way of quizzes (Q), tests (T) and Assignment.


• A minimum of three quizzes are conducted and each quiz is evaluated for 10 marks
adding up to 30 marks.
• The three tests are conducted for 50 marks each and the sum of the marks scored
from three tests is reduced to 60.
• The marks component for assignment is 10.
Assignment is evaluated by executing Case studies using MATLAB or any analytics
tools
• The total marks of CIE are 100.
RV College of
Engineering
Go, change the world
Go, change the world
RV College of
Engineering

UNIT I

Big Data Analytics

16CS7F1

Prof.Mamatha T

Introduction To Big Data


RV College of
Engineering Go, change the world

UNIT 1
Big Data Introduction (Recap)
•Data : Characteristics of Data
•Big Data: Evolution, Four V’s of BigData
•CAP Theorem
•Challenges of Big Data
RV College of
Engineering Go, change the world

UNIT 1
Big Data Introduction (Recap)
Data.
• The quantities, characters, or symbols on which operations are performed by a computer, which
may be stored and transmitted in the form of electrical signals and recorded on magnetic, optical,
or mechanical recording media.
•Data : Characteristics of Data
•Composition:
•source of data: From where data is generated
•type of data :image,audio,text,pdf
•nature of data: Static or real time streaming

•Condition of Data: can the data be used as it is for analysis


•Context:
•where ,when and why was the data generated
•is the data sensitive
RV College of
Engineering
Go, change the world

BigData: Traditional techniques and methodologies cannot store and process the huge volume of data

BigData
• Characterised by 3Vs Volume,Variety,Velocity of the data
• John R. Mashey coined big data as early as 1990.
• Dough Laney came up with the 3Vs of big data.
Examples of Big Data
•The New York Stock Exchange generates about one terabyte of new trade data per day.
•Social Media
•The statistic shows that 500+terabytes of new data get ingested into the databases of social media
site Facebook, every day. Generated in terms of photo and video uploads, message exchanges, putting
comments.
•A single Jet engine can generate 10+terabytes of data in 30 minutes of flight time. With many thousand
flights per day, generation of data reaches up to many Petabytes.
RV College of
Engineering
Go, change the world

Characteristics of Big Data -3 Vs of Big Data

Volume: There is a high quantity of data to analyse. Scale of data i.e


•Number of users : Facebook users numbered 2.45 billion as on 2019,26.9 million users on
Instagram (2020)
•Size of data : 500+ terabytes of data each day on Facebook
•Facebook is available in 101 languages with over 300,000 users helping in translation.
•According to IDC, by 2025 80 billion devices will be connected to the Internet versus approximately 11
billion devices connect to the internet now. (Machine generated Data)

Variety: Structured, Semi Structured and Unstructured data.

Velocity: The rate at which data keeps getting generated.


•350 Million photos are uploaded every day, with 14.58 million photo uploads per hour, 243,000
photo uploads per minute, and 4,000 photo uploads per second.(Facebook).
•Every 20 Minutes, 1 million links are shared, 20 million friend requests are sent, and 3 million
messages are sent.
•55 million status updates are made every day.
RV College of
Engineering
Go, change the world

3 Vs of Big Data
Variety of Data(Structured)

•Structured Data: Structured data is easily organisable and follows a rigid format as rows and
columns.(Databases and Excel sheets).It as a defined model. Defines how you will capture, store and access data

•Structured data can be created by machines and humans.


•Machine-generated data by devices or sensors without human intervention.
•Examples of machine generated data : Data v
from sensors such as GPSs, RFID tags, medical
devices, data from network and web logs, retail and ecommerce data .
•Example Human Generated data include through interaction with online forms, kiosks, games and so on.

Examples of structured data include financial data such as accounting transactions, address details,
demographic information, star ratings by customers, machines logs, location data from smart phones and
smart devices, etc
RV College of
Engineering
Go, change the world

3 Vs of Big Data
Variety of Data(Structured)

v
RV College of
Engineering
Go, change the world

3 Vs of Big Data
Variety of Data(Semi Structured Data)

•Semi Structured Data: Subset of structured data


• Format includes the capability to add tags, keywords and metadata to data types.

•It is managed by XML


•Adding descriptive elements to images, email are examples of semi-structured data
• Email messages . While the actual content is unstructured, but it does contain
structured data such as name and email address of sender and recipient, time sent,
etc
• Digital photograph. The image itself is unstructured, but if the photo was taken on a
smart phone, for example, it would be date and time stamped, geo tagged, and
would have a device ID. Once stored, the photo could also be given tags that would
provide a structure, such as ‘dog’ or ‘pet.’
RV College of
Engineering
Go, change the world

Advantages and Disadvantages of Digital Data

Digital Data
• Structured
• Semi-Structured
• Unstructured
RV College of
Engineering
Go, change the world

CAP Theorem
RV College of
Engineering
Go, change the world

CAP Theorem
C-Consistency
A-Availability
P-Partition Tolerance
The CAP theorem (called Brewers theorem) states that a distributed system with data replication
cannot simultaneously be consistent, available, and partition tolerant .

Consistent: A system is said to be consistent if all nodes see the same data at the same time.
Availability: Every request receives a response about whether it was successful or failed.
Partition Tolerance:
• Partition tolerance means that the system can continue operating if the network connecting the
nodes has a fault that results in two or more partitions, where the nodes in each partition can
only communicate among each other.
• If P is existing, there is no challenge with A and C (except for latency issues).
• The problem comes when P is not met
RV College of
Engineering
Go, change the world

Not Available Available but not consistent Available &Consistent


RV College of
Engineering
Go, change the world

No issues all running fine Available, Consistent data, Network is not Partitioned
RV College of
Engineering
Go, change the world

No Partition Tolerance because of broken communication


RV College of
Engineering
Go, change the world

AP: When there is no partition tolerance, system is available but with


inconsistent data
RV College of
Engineering
Go, change the world

CP: When there is no partition tolerance, system is not fully available. But the
data is consistent.
RV College of
Engineering
Go, change the world

CA System: Data is consistent between all nodes, and you can read/write from any
node(Available), while you cannot afford to let your network go down
RV College of
Engineering
Go, change the world

Challenges of Big Data

.
RV College of
Engineering
Go, change the world

Requirements for Challenges posed by Big Data


•Cheap and abundant storage is required
•Fasts processor to help quick processing of big data
•Tools which are open source distributed big data platform such as Hadoop
•Require
• Parallel processing
• Clustering
• High Connectivity
• High Throughput

?
•Apache HADOOP Distributed Framework
RV College of
Engineering
Go, change the world

•Unit 2-Hadoop Architecture

•Hadoop Architecture
•Hadoop Storage: HDFS
• Common Hadoop Shell commands
• Anatomy of File Write and Read
•NameNode,Secondary NameNode
•DataNode
•Hadoop MapReduce paradigm,
• Map and Reduce Programming,
•Job and Task trackers, MapReduce Example
RV College of
Engineering
Go, change the world
Go, change the world
RV College of
Engineering

UNIT II

Hadoop Architecture
16CS7F1

BigDataAnalytics
RV College of
Engineering SYLLABUS Go, change the world

•Hadoop Architecture
•Hadoop Storage: HDFS
• Common Hadoop Shell commands
• Anatomy of File Write and Read
•NameNode,Secondary NameNode
•DataNode
•Hadoop MapReduce paradigm,
• Map and Reduce Programming,
•Job and Task trackers, MapReduce Example
RV College of
Engineering
Go, change the world

• Hadoop –Introduction
• Features
• Advantages of Hadoop
• Hadoop Versions
• Limitations of Hadoop 1.x
RV College of
Engineering
Go, change the world

Hadoop Introduction
• The Apache Hadoop software library is a framework that allows for the distributed processing of large
data sets across clusters of computers using simple programming models.
• The framework is written in java.
• It was developed by Dough Cutting in 2005.
• Hadoop was named after his son’s toy elephant .
• Hadoop is an open source project of Apache Foundation
v
• Hadoop is used by
• Facebook
• Linkedln
• Yahoo
• Twitter
RV College of
Engineering
Go, change the world

Hadoop Features

1. It is used to handle massive quantities of structured ,semi-structured and unstructured data.


Stored in HDFS –Hadoop Distributed File System

2. It uses inexpensive commodity hardware to store the data and the process

3. Hadoop has shared nothing architecture.


Neither memory or disk is shared among multiple processors .Each processor as its own memory and
hard disk
RV College of
Engineering
Go, change the world

Shared Nothing Architecture


RV College of
Engineering
Go, change the world

Advantages of Shared Nothing Architecture


Fault Isolation:
Easy to isolated fault. A fault in a single node is contained and confined to that node.

Scalability:
In a shared disk system. The disk controller and bandwidth are shared. Synchronization have to be
implemented to maintain consistent state.
But in case of Hadoop which follows SNA more and more commodity computer can be added.

Reliable and Available:


Hadoop replicates data across multiple computers. So when a system is down ,data can be still be process
from another machine.
RV College of
Engineering
Go, change the world

Advantages of Hadoop
Data Storage: The data storage framework of hadoop is called HDFS
There is no fixed format to store data (Can store Structured, Semi-Structured and unstructured data)
HDFS is schema less.
Structured is imposed on raw data when we need to process it.

Scalable: Hadoop can store and distribute very large data set(Terabytes,Zettabytes) across hundreds of commodity
Servers which operated in parallel.
Vacross

Cost-Effective- Storage and processing is inexpensive since uses commodity hardware.

Resilient to failure-Hadoop is fault tolerant since copy of the data is always available on multiple nodes

Flexible-It can store any type of data as it is (Structured, Semi-Structured and unstructured data).

Fast-Processing is faster since in conventional system data is moved to the code.But in Hadoop code is moved to
location where data is available.
RV College of
Engineering
Go, change the world

Hadoop V.1.x Components


Apache Hadoop V.1.x has the following two major Components.
In Hadoop V.1.x, these two are also know as Two Pillars of Hadoop.
•HDFS (HDFS V1) -Data Storage Framework
•MapReduce (MR V1)- Processing Framework

Hadoop V.2.x Components


Apache Hadoop V.2.x has the following three major Components
In Hadoop V.2.x, these two are also know as Three Pillars of Hadoop.
•HDFS V.2
•YARN (MR V2)
•MapReduce (MR V1)
RV College of
Engineering
Go, change the world
RV College of
Engineering
Go, change the world

Hadoop 1.x Major Components


•Hadoop 1.x Major Components components are: HDFS and MapReduce. They are also know as “Two Pillars” of Hadoop 1.x.

HDFS:
HDFS is a Hadoop Distributed FileSystem, where BigData is stored using Commodity Hardware.
•It is designed to work with Large DataSets with default block size is 64MB

HDFS component is again divided into two sub-components:


Name Node (Master Node)
It used to store Meta Data about Data Nodes like
•How many blocks are stored in Data Nodes
•Which Data Nodes have data
•Slave Node Details
• Data Nodes locations, timestamps etc
Data Node(Slave Nodes)
It is used to store our Application Actual Data. It stores data in Data Slots of size 64MB by default.
RV College of
Engineering
Go, change the world
RV College of
Engineering
Go, change the world

Quiz

Both Master Node and Slave Nodes contain two Hadoop Components:
•------------------
•------------------

•Master Node’s HDFS component is also known as -----------


•Slave Node’s HDFS component is also known as ----------
•Master Node’s “Name Node” component is used to store -----------.
•Slave Node’s “Data Node” component is used to store actual our --------------.
•HDFS stores data with default size of ------------in “Data Slots” or “Data Blocks”.
•Master Node’s MapReduce component is also known as ------------
•Slave Node’s MapReduce component is also known as ---------------
•Master Node’s “Job Tracker” will take care assigning tasks to ----------and --------------------------
•Slave Node’s MapReduce component “Task Tracker” contains two MapReduce Tasks: ----------------- and ------------
RV College of
Engineering
Go, change the world

Hadoop cluster sizing:


We need to consider 3 parameters for sizing:
1: SLA (Service Level Agreement like 1 year, 2 year)
2: Daily Data growth (Daily ingestion of data on hadoop cluster)
3:Replication Factor on hadoop (Default Replication:3)
Example:
1: SLA = 1 year = 365 days
2: daily data growth: 2GB/ day
Vacross
(2GB*365days) = 730GB
3: Replication Factor : 3 (Will take default)
3*730GB= 2190

Now the total size with 3 replication for 1 year SLA is = 2190GB
20 to 30% threshold space is required for MapReduce processing,
Total cluster size= 2190GB + 20% MapReduce Processing
Total cluster size: 2190GB + 438GB
Total cluster size: 2628 GB
RV College of
Engineering
Go, change the world
Go, change the world
RV College of
Engineering

UNIT II

Hadoop Architecture
16CS7F1

BigDataAnalytics
RV College of
Engineering SYLLABUS Go, change the world

•Hadoop Architecture
•Hadoop Storage: HDFS
• Common Hadoop Shell commands
• Anatomy of File Write and Read
•NameNode,Secondary NameNode
•DataNode
•Hadoop MapReduce paradigm,
• Map and Reduce Programming,
•Job and Task trackers, MapReduce Example
RV College of
Engineering
Go, change the world

Hadoop1.x –Components of Hadoop Architecture

• Hadoop cluster is a collection of various commodity hardware(devices that are inexpensive


and amply available).
• This Hardware components work together as a single unit.
• In the Hadoop cluster, there are lots of nodes (computers) contains Master and Slaves

HDFS –Data Storage Framework


MapReduce-Data Processing Framework

Master Node- Namenode (HDFS)


JobTracker (MapReduce)
Slave Node - DataNode (HDFS)
TaskTracker(MapReduce)
RV College of
Engineering
Go, change the world

Role of HDFS(Hadoop Distributed File System)

•HDFS holds very large amount of data and provides easier access.
•To store such huge data, the files are stored across multiple machines. (Distributed File
System)
•These files are stored in redundant fashion to rescue the system from possible data losses in
case of failure.(Fault Tolerant)
• HDFS also makes applications available to parallel processing.
•It is suitable for the distributed storage and processing.
•Hadoop provides a command interface to interact with HDFS.
•HDFS provides file permissions and authentication.
•Streaming access to file system data.
•HDFS follows the master-slave architecture
RV College of
Engineering
Go, change the world

Hadoop1.x -Architecture

v
RV College of
Engineering
Go, change the world

•Which of the following are NOT true for Hadoop?


a) It’s a tool for Big Data analysis
b) It supports structured and unstructured data analysis
c) It aims for vertical scaling out/in scenarios
d). Both (a) and (c)

•What is the default HDFS block size?


a) 32 MB b) 64 KB c) 128 KB d). 64 MB

•HDFS works in a __________ fashion.

What is the default HDFS replication factor?


a) 4 b) 1 c). 3 d) 2

HDFS provides a command line interface called __________ used to interact with HDFS.
HDFS is implemented in _____________ programming language.
RV College of
Engineering
Go, change the world

NameNode

•All the metadata related to HDFS including the information about data nodes, files stored on HDFS, and
Replication, etc. are stored and maintained on the NameNode.(Master Node)
•A NameNode serves as the master and there is only one NameNode per cluster
•The NameNode and DataNode are pieces of software designed to run on commodity machines.
• These machines typically run a GNU/Linux operating system (OS)

•It contains metadata about DataNodes


•It knows exactly which data node contains which blocks
• where the data nodes are located within the machine cluster

•It contains metadata about files distributed across the cluster


•It manages information like location of file blocks across cluster and it’s permission
•The NameNode also manages access to the files, including reads, writes, creates, deletes
and replication of data blocks across different data nodes
RV College of
Engineering
Go, change the world

NameNode

•HDFS stores each file as a sequence of blocks.


• All blocks in a file except the last block are the same size.
Example :Suppose NameNode receives a file of size 196MB.Explain how the file is stored in HDFS and
what is the size of each block 64MB (B1,B2,B3)=192MB +last block B4 (4MB)
•The blocks of a file are replicated for fault tolerance.
•The block size and replication factor are configurable per file.
•An application can specify the number of replicas of a file.
• The replication factor can be specified at file creation time and can be changed later.
RV College of
Engineering
Go, change the world

NameNode

•The NameNode makes all decisions regarding replication of blocks.


•It periodically receives a Heartbeat and a Blockreport from each of the DataNodes in the cluster.
•Receipt of a Heartbeat implies that the DataNode is functioning properly.
•A Blockreport contains a list of all blocks on a DataNode.

In Hadoop, Namenode has two types of files.


1) edit logs files
2) FsImage files
These files are available on namenode disk
RV College of
Engineering
Go, change the world

NameNode

•FsImage contains details about the location of the data on the Data Blocks and which blocks are stored
on which node.
•When namenode is started , latest FsImage file is loaded into "in-memory“.

•EditLogs is a transaction log that record the changes in the HDFS file system or any action performed on
the HDFS cluster such as addtion of a new block, replication, deletion etc.,
•It records the changes since the last FsImage was created, it then merges the changes into the FsImage
file to create a new FsImage file
RV College of
Engineering
Go, change the world

DataNode

DataNodes are the slave nodes in HDFS.


DataNode is a commodity hardware, that is, a non-expensive system which is not of high quality or high-
availability.
The DataNode is a block server that stores the data in the local file.

Functions of DataNode:
•These are slave daemons or process which runs on each slave machine.
•The actual data is stored on DataNodes.
•They send heartbeats to the NameNode periodically to report the overall health of HDFS, by default, this
frequency is set to 3 seconds.
B - in the memory of the namenode
C - Both A&B
D - In the memory of the client application which requested the access to these files.
RV College of
Engineering
Go, change the world

1. The number of copies of a file is called the -----------of that file. This information is stored by the -------------.
2. How files are stored in HDFS?
3. What are blocks in HDFS?What is the default size of block?Where are blocks stored?
4. What happens if NameNode fails in Hadoop?
5. For the frequently accessed HDFS files the blocks are cached in
A - the memory of the datanode
B - in the memory of the namenode
C - Both A&B
D - In the memory of the client application which requested the access to these files.
6. The inter process communication between different nodesVacross in Hadoop uses
7. The current limiting factor to the size of a hadoop cluster is---------
RV College of
Engineering
Go, change the world

You might also like