APDS03 Big Data Day 2

Big Data Technologies – IIMCal APDS 03
- Arijit Sen (arijit.sen@theinventa.com)

Modules Overview
Day 1 (Overview) Day 2 (Hadoop Architecture and
 Introduction to Course – Objectives and Modules Ecosystem)
(Brief)
 Distributed File System - Hadoop/HDFS
 Big Data – A brief History
 Big Data – Opportunities/Case Studies/Examples  Architecture – NameNode and DataNode, Jobs
 Distributions – Installation, Configuration of a  The Ecosystem Part 1 - MapReduce Explained
Cloudera Single Node Cluster in local environment
and in AWS  Hands on – MapReduce and Practice Exercise
Day 6 (Non Relation Paradigm) Day 3, 4 & 5 (Ecosystem contd..)

 Why NoSQL? Ex. Graph (Neo4j) and Document
 Dissecting the components of MapReduce
(MongoDB) databases
 Installation and Configuration of MongoDB and  The Ecosystem Part 2 - Hive, Pig,
Neo4j  Hands on – Pig Scripts and Hive QL
 Basics of JSON and Cypher query language (Neo4j)
 The Ecosystem Part 3 – Sqoop, Flume
 Demonstration and Example
Day 7 (Optional - Spark) Day 7 (Case Study – Flight Prdeiction)

 Introduction to Spark Core - Installation and
 Introduction to Flight Prediction Case Study
Configuration
 Spark-SQL, Streaming  Data Collection and Analysis
 Hands On – Spark-SQL  R Program to connect to Hadoop and Predict
 Practice exercises on Spark SQL
So what is Big Data!!
WHAT IS BIG DATA
Many Terabytes, Petabytes, Exabytes..
3V’s – Volume Velocity Variety..

So what is Big Data!!
Big Data and is it really Big?
What is Big Data?
 Big data is a buzzword, or catch-phrase, used to describe a massive volume
of human or machine generated structured, semi-structured or unstructured
data coming in large volumes , high speeds, that require complex processing
and analysis to help companies improve operations and make faster, more
intelligent decisions
 Big data is any data that is really expensive to manage and hard to extract
value from it – Michael Franklin
Bytes ++
1 Bit = Binary Digit

8 Bits = 1 Byte
1024 Bytes = 1 Kilobyte
1024 Kilobytes = 1 Megabyte
1024 Megabytes = 1 Gigabyte
1024 Gigabytes = 1 Terabyte
1024 Terabytes = 1 Petabyte
1024 Petabytes = 1 Exabyte
1024 Exabytes = 1 Zettabyte
1024 Zettabytes = 1 Yottabyte
1024 Yottabytes = 1 Brontobyte
1024 Brontobytes = 1 Geopbyte
Big Data – 4Vs (Volume, Velocity, Variety and Veracity)
Evolution of Big Data
Big Data – How Big it is?
Now every 2 days we produce more data than we created since beginning of time till 3/4 years back.
Over 5000 Exabyte's of data in cloud. This is ~ 90x of if books printed and stacked from earth till Pluto.
~3 Zettabytes of data floating around
More than 90% of the data is created in last 3 years alone.
Every minute we send ~400 million email, ~2.5 mil. Facebook likes, ~350 thousands tweets,
upload ~300000 photos in Facebook.
Burn all the data we created so far in DVD, stack them, will reach moon 3 times (up and down)
~150B+ Big data market today, growing @13% annually..
Decoding the human genome originally took 10 years to process; now it can be achieved in one week.
Walmart handles more than 1 million customer transactions every hour, which is imported into
databases estimated to contain more than 2.5 petabytes of data
The Gartner Hype Cycle for Big Data 2013
The Hype comparison in last 3 years
Big Data has become prevalent in our life

- Betsy Burton (Gartner)
Smart Machine Technologies with NLP and ANN

based Deep Learning are a hype
https://www.youtube.com/watch?v=SQipnBNVjv0
Big Opportunities !
Reference Terminology
Astronomy
 LSST - The Large Synoptic Survey Telescope (LSST) is a wide-field survey reflecting telescope with an 8.4-
meter primary mirror
 PAN-STARRS - The Panoramic Survey Telescope and Rapid Response System (Pan-STARRS; code: F51
and F52) located at Haleakala Observatory, Hawaii, consists of astronomical cameras, telescopes and a
computing facility that is surveying the sky for moving objects on a continual basis, including accurate
astrometry and photometry of already detected objects
 SDSS - The Sloan Digital Sky Survey or SDSS is a major multi-filter imaging and spectroscopic redshift
survey using a dedicated 2.5-m wide-angle optical telescope at Apache Point Observatory in New Mexico,
United States. The project was named after the Alfred P. Sloan Foundation, which contributed significant
funding.
 n-body-SIMS - In physics and astronomy, an N-body simulation is a simulation of a dynamical system of
particles, usually under the influence of physical forces, such as gravity (see n-body problem). N-body
simulations are widely used tools in astrophysics, from investigating the dynamics of few-body systems like
the Earth-Moon-Sun system to understanding the evolution of the large-scale structure of the universe. In
physical cosmology, N-body simulations are used to study processes of non-linear structure formation such
as galaxy filaments and galaxy halos from the influence of dark matter. Direct N-body simulations are used
to study the dynamical evolution of star clusters.
Reference Terminology
Ocean Sciences
 AUV - Autonomous Underwater vehicles

 ADCP - The Acoustic Doppler Current Profiler (ADCP) measures the speed and direction of ocean currents
using the principle of “Doppler shift”
 CTD - A CTD or Sonde is an oceanography instrument used to measure the conductivity, temperature, and
pressure of seawater (the D stands for "depth," which is closely related to pressure)
 OOI - The National Science Foundation-funded Ocean Observatories Initiative (OOI) is an integrated
infrastructure program composed of science-driven platforms and sensor systems that measure physical,
chemical, geological and biological properties and processes from the seafloor to the air-sea interface.
 IOOS - The U.S. Integrated Ocean Observing System (IOOS)
 CMOP - Center for Coastal Margin Observation & Prediction
 Glider - An ocean glider is an autonomous, unmanned underwater vehicle used for ocean science. Since
gliders require little or no human assistance while traveling, these little robots are uniquely suited for
collecting data in remote locations, safely and at relatively low cost
Ref: on Rick Smolan, Patrick Meirer and Deb Roy
Rick Smolan is a former TIME, LIFE, and National Geographic photographer best known as the co-creator of the Day
in the Life book series. He is currently CEO of Against All Odds Productions, a cross-media organization.More than
five million of his books have been sold around the world, many have appeared on The New York Times best-seller
lists and have been featured on the covers of Fortune, Time, and Newsweek.Smolan is also a member of the
CuriosityStream Advisory Board.
https://en.wikipedia.org/wiki/Rick_Smolan
https://www.youtube.com/watch?v=4VeITe6EJDU
https://www.youtube.com/watch?v=OV1y6ZUV_Q4
Patrick Meier is an internationally recognized expert and consultant on Humanitarian Technology and Innovation. His book,
Digital Humanitarians, has been praised by Harvard, MIT, Stanford, Oxford, UN, Red Cross, World Bank, USAID and
others.
https://irevolutions.org/bio/
Deb Roy is a tenured professor at MIT and served as Chief Media Scientist of Twitter from 2013-2017.
A native of Winnipeg, Manitoba, Canada, Roy received his PhD in Media Arts and Sciences from MIT.
MIT researcher Deb Roy wanted to understand how his infant son learned language -- so he wired up his house with video
cameras to catch every moment (with exceptions) of his son's life, then parsed 90,000 hours of home video to watch "gaaaa"
slowly turn into "water."
https://dkroy.media.mit.edu/
https://www.ted.com/talks/deb_roy_the_birth_of_a_word?language=en
Big Opportunities !
Big Opportunities !
Big Opportunities !
Big Opportunities !
Big Opportunities !
Big Opportunities !
Big Opportunities !
Big Opportunities ! Wordscapes (Deb Roy MIT)
https://www.ted.com/talks/deb_roy_the_birth_of_a_word?language=en
Big Opportunities ! AI in Rock Climbing
Stanford Paper
- http://cs229.stanford.edu/proj2017/final-reports/5232206.pdf
Challenges & Traditional Solutions..
Big Data Vs Traditional DW
Traditional Data Warehouse Systems Big Data Systems
 Centralized and powerful Servers  Leverages distributed clusters of machines/nodes

 Not all important data (unstructured or and incorporation of commodity/old hardware is
semi structured) cannot be loaded due to quick.
structural mismatch or size  All types (Structured, Semi Structured or
 Old data needs to be archived or deleted Unstructured) of data can be stored.
frequently  Size of data no longer a restriction
 Scalability is more often vendor locked  Analysis and Analytics are easier to perform
and difficult (time and money) to upgrade providing a 360 degree view of data
 Most of the S/W are vendor locked i.e.  Scalability can be achieved easily by adding nodes
proprietary. in a linear fashion. Cloud implementations are
 Costly readily available.
 Cloud is recently an option.  Low Cost and huge variety of open source tools.
 Integrates with DW Systems easily with adapters.
Traditional Big Data Implementation
Hadoop – The Solution..
Hadoop – The Solution..
The Problem !!
The Problem !!
What about the following
Split 1Tb in equal 100 equal size (Blocks) and read them in
parallel.
Time to read = 150 mins/100 < 2 mins
Computation time = 60mins/100 < 1min.
1 TB File
1 2 3 4 5 98 99 100
Network Congestion
The Problem !!
The Problem !!
The Solution !!
Hadoop – What it is NOT..
 Not a Database substitute
Hadoop is not meant to store & manage data the way RDBMS does it.
 Not a real time data processing engine
Hadoop is a batch processing system. Using Hadoop with an expectation to analyze data
as soon as it is generated is inappropriate. If need is to analyze data at point of
generation without time lag – look for alternate technologies
 Not an analytic engine
Hadoop by itself does not provide any inbuilt analytic capabilities. Write MapReduce
programs for any data processing requirement
A brief History – The Journey from 2002
Big Data – Ecosystem
Distributions
 Cloudera
 Hortonworks
 MapR (Fastest)
Why Another File System..
Let us understand Blocks..
Let us understand Blocks..
HDFS..
 Distributed file system designed to run on commodity hardware.
 Highly fault tolerant and reliable.
 Data is split into blocks and distributed across multiple nodes in the cluster.
 HDFS has rack awareness and distributes blocks accordingly.
 HDFS is immutable – write once, read multiple times.
 Suitable for applications that require high throughput access to large data sets.
 Default block size is 64/128 MB which is configurable.

WHY HDFS..
 Hardware failure
An HDFS instance consists of hundreds of machines each of which can fail, key goal of HDFS
architecture is to support detection of such faults and recovery.
 Streaming data access

HDFS is built around the idea that the most efficient data processing pattern is a write-once, read-many-
times pattern. More suitable for batch processing than interactive use. The emphasis is on high
throughput of data access rather than low latency of data access.
 Large data sets

Application that run on HDFS have large datasets. A typical file in HDFS is in gigabyte to terabyte size.
A typical cluster stores petabytes worth of data.
WHY HDFS..
 Hardware failure
An HDFS instance consists of hundreds of machines each of which can fail, key goal of HDFS architecture is to support detection
of such faults and recovery.
 Streaming data access

HDFS is built around the idea that the most efficient data processing pattern is a write-once, read-many-times pattern. More
suitable for batch processing than interactive use. The emphasis is on high throughput of data access rather than low latency of
data access.
 Large data sets

Application that run on HDFS have large datasets. A typical file in HDFS is in gigabyte to terabyte size. A typical cluster stores
petabytes worth of data.
 Data Locality
Achieves greater efficiency by moving computation to the data. Since files are spread across the distributed file system as chunks,
each compute process running on a node operates on a subset of the data. Which data operated on by a node is chosen based on its
locality to the node: most data is read from the local disk straight into the CPU, alleviating strain on network bandwidth and
preventing unnecessary network transfers.
 Portability
Designed to be portable from one platform to another facilitating wider adoption.
 Economy
Hadoop doesn’t require expensive, highly reliable hardware to run on. It’s designed to run on clusters of commodity hardware
(commonly available hardware available from multiple vendors3) for which the chance of node failure across the cluster is high,
at least for large clusters.
When NOT to use HDFS..
Low-latency data access

Applications that require low-latency access to data, in the tens of milliseconds range, will not work
well with HDFS. Remember, HDFS is optimized for delivering a high throughput of data, and this
may be at the expense of latency.
Lots of small files

Since the namenode holds file system metadata in memory, the limit to the number of files in a file
system is governed by the amount of memory on the namenode.
Multiple writers, arbitrary file modifications

Files in HDFS may be written to by a single writer. Writes are always made at the end of the file.
There is no support for multiple writers, or for modifications at arbitrary offsets in the file.
HDFS Nodes..
Master & Slaves..
HDFS has a master/slave architecture.
An HDFS cluster consists of a single NameNode, a master server that manages the file system
namespace and regulates access to files by clients.
In addition, there are a number of DataNodes, usually one per node in the cluster, which manage
storage attached to the nodes that they run on.
HDFS exposes a file system namespace and allows user data to be stored in files.
Internally, a file is split into one or more blocks and these blocks are stored in a set of DataNodes.
NameNode..
The NameNode (master) executes file system namespace operations like opening, closing, and renaming files and directories. It also
determines the mapping of blocks to DataNodes.
The NameNode maintains the file system tree and metadata for all files and directories in the tree. Any change to the file system
namespace or its properties is recorded by the NameNode.
This information is stored persistently on the local disk in the form of two files: the FsImage and the Edit log.
The NameNode uses a transaction log called the EditLog to persistently record every change that occurs to file system metadata. The
NameNode uses a file in its local host OS file system to store the EditLog.
o The entire file system namespace, including the mapping of blocks to files and file system properties, is stored in a file called the FsImage
(stored in local file system).
o The NameNode keeps an image of the entire file system namespace and file Blockmap in memory.
o When the NameNode starts up, it reads the FsImage and EditLog from disk, applies all the transactions from the EditLog to the in-memory
representation of the FsImage, and flushes out this new version into a new FsImage on disk.
o It then truncates the old EditLog.This process is called checkpoint.(Generally occurs during startup)
The namenode also knows the datanodes on which all the blocks are stored for a given file.
It does not store block locations persistently, since this information is reconstructed from datanodes when the system starts.
A client accesses the filesystem on behalf of the user by communicating with the namenode and datanodes.
The client presents a filesystem interface, so the user code does not need to know about the namenode and datanode to function.
DataNode..
The DataNodes are responsible for serving read and write requests from the file system’s clients.
The DataNodes also perform block creation, deletion, and replication upon instruction from the
NameNode.
DataNode periodically sends a Heartbeat and a Blockreport to the NameNode in the cluster.
Receipt of a Heartbeat implies that the DataNode is functioning properly. A Blockreport contains a
list of all blocks on a DataNode.
Replication facts..
HDFS is designed to reliably store very large files across machines in a large cluster.
It stores each file as a sequence of blocks; all blocks in a file except the last block are the same size.
The blocks of a file are replicated for fault tolerance.

The block size and replication factor are configurable per file
An application can specify the number of replicas of a file
The replication factor can be specified at file creation time and can be changed later.
Files in HDFS are write-once and have strictly one writer at any time.
The NameNode makes all decisions regarding replication of blocks.

It periodically receives a Heartbeat and a Blockreport from each of the DataNodes in the cluster.
Receipt of a Heartbeat implies that the DataNode is functioning properly
A Blockreport contains a list of all blocks on a DataNode.
Replication facts..
Large HDFS instances run on a cluster of computers that commonly spread across many racks.
Communication between two nodes in different racks has to go through switches.
In most cases, network bandwidth between machines in the same rack is greater than network
bandwidth between machines in different racks.
The NameNode determines the rack id each DataNode belongs to via a process called - Hadoop
Rack Awareness
A simple but non-optimal policy is to place replicas on unique racks.

This prevents losing data when an entire rack fails and allows use of bandwidth from multiple racks when
reading data.
This policy evenly distributes replicas in the cluster which makes it easy to balance load on component
failure.
However, this policy increases the cost of writes because a write needs to transfer blocks to multiple racks.
Replication facts..
For the common case, when the replication factor is three, HDFS’s placement policy is to put one
replica on one node in the local rack, another on a node in a different (remote) rack, and the last
on a different node in the same remote rack.
This policy cuts the inter-rack write traffic which generally improves write performance.
The chance of rack failure is far less than that of node failure - this policy does not impact data
reliability and availability guarantees.
However, it does reduce the aggregate network bandwidth used when reading data since a block is
placed in only two unique racks rather than three.
Hands on - HDFS Sample Commands practice
Listing, Copy (Local < -- > HDFS), Move,

Replication Check
Block Locations
Etc…
HDFS Read Write – In Detail..
Hands on - HDFS Java Programs for Read and Write..
Read From HDFS

Write to HDFS
Architecture..
Architecture..
Architecture..
Architecture..
Architecture..
Architecture..
Architecture..
Architecture..
Architecture..
Architecture..
Architecture..
Introduction of MapReduce – Divide and Conquer
 A technique by which large problems

are divided into smaller sub problems
 Sub-problems are worked upon in
parallel and in isolation
 Intermediate results are combined into
a final result
Thank You and Enjoy Learning..
- Arijit Sen
Email: arijit.sen@theinventa.com / Phone: 9007055155
Appendix
Big Data – Data Visualization Showcase
Why? Tools/Products examples
 A Picture is worth of 1000 words – More  Tableau

insight i.e.  Frameworks – D3.js
 Comprehend information quickly
 Amazon QuickSight
 Discover emerging trends
 IBM Watson Explorer
 Identify relationship and patterns
 Helps data scientists to create  Microsoft PowerBI
mathematical models for predictions
A video by Hans Rosling (Founder of Gapminder) in 2006 (TedX)

- https://www.ted.com/talks/hans_rosling_the_best_stats_you_ve_ever_seen?language=en
A video to show the athletic power of quadcopters and the data it generates for processing
- https://www.youtube.com/watch?v=w2itwFJCgFQ
- https://www.youtube.com/watch?v=RCXGpEmFbOw
Big Data – Data Visualization Showcase 2
WorldBank Data
World Bank Demo Terminology
Terms
 Fertility Rate - Number of live births per 1000 women between the ages of 15 and 44 years
 Life Expectancy - Life expectancy equals the average number of years a person born in a given country is
expected to live if mortality rates at each age were to remain steady in the future
 GDP Per Captia - GDP per capita is a measure of a country's economic output that accounts for population.
It divides the country's gross domestic product by its total population. That makes it the best measurement
of a country's standard of living. It tells you how prosperous a country feels to each of its citizens..
 Why the Largest Economies Aren't the Richest per Capita - GDP per capita allows you to compare the
prosperity of countries with different population sizes. For example, U.S. GDP was $18.56 trillion in 2016.
But one reason America is so prosperous is it has so many people. It's the third most populous country after
China and India.
 The United States must spread its wealth among 324 million people. As a result, its GDP per capita is only
$57,300. That makes it the 18th most prosperous country per person.
 China has the largest GDP in the world, producing $21.2 trillion in 2016. But its GDP per capita was only
$15,400 because it has four times the number of people as the United States.

APDS03 Big Data Day 2

Uploaded by

Document Information

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

APDS03 Big Data Day 2

Uploaded by

Copyright:

Available Formats

Big Data Technologies – IIMCal APDS 03

- Arijit Sen (arijit.sen@theinventa.com)

Day 6 (Non Relation Paradigm) Day 3, 4 & 5 (Ecosystem contd..)

Day 7 (Optional - Spark) Day 7 (Case Study – Flight Prdeiction)

WHAT IS BIG DATA

Many Terabytes, Petabytes, Exabytes..

3V’s – Volume Velocity Variety..

1 Bit = Binary Digit

~3 Zettabytes of data floating around

More than 90% of the data is created in last 3 years alone.

~150B+ Big data market today, growing @13% annually..

Big Data has become prevalent in our life

Smart Machine Technologies with NLP and ANN

 AUV - Autonomous Underwater vehicles

Traditional Data Warehouse Systems Big Data Systems

 Centralized and powerful Servers  Leverages distributed clusters of machines/nodes

What about the following

Time to read = 150 mins/100 < 2 mins

Computation time = 60mins/100 < 1min.

 Distributed file system designed to run on commodity hardware.

 Highly fault tolerant and reliable.

 HDFS has rack awareness and distributes blocks accordingly.

 HDFS is immutable – write once, read multiple times.

 Default block size is 64/128 MB which is configurable.

 Streaming data access

 Large data sets

 Streaming data access

 Large data sets

Low-latency data access

Lots of small files

Multiple writers, arbitrary file modifications

The blocks of a file are replicated for fault tolerance.

The NameNode makes all decisions regarding replication of blocks.

Communication between two nodes in different racks has to go through switches.

A simple but non-optimal policy is to place replicas on unique racks.

Listing, Copy (Local < -- > HDFS), Move,

Read From HDFS

 A technique by which large problems

Why? Tools/Products examples

 A Picture is worth of 1000 words – More  Tableau

A video by Hans Rosling (Founder of Gapminder) in 2006 (TedX)

You might also like