Data Science

CS439 DATA SCIENCE
Muhammad Salman Saeed

Big Data
What is Big Data?
• Big Data is a term for data sets that are so large or
complex that traditional data processing application
software is inadequate to deal with them.
• Big Data challenges include

capturing data, data storage, processing,
data analysis, search, sharing, transfer,
visualization, querying, updating and information
privacy/security.
Trend 1
Trend 2
Trend 3
Trend 4
Can you identify these
Trend 5
trend lines in the field of data?
VOLUME
+
VARIETY
+
VELOCITY
Trend 1
+
VERACITY
Trend 2
Trend 3
Trend 4
+ = BIG DATA
Trend 5
COST
Big Data
Process A lot of Data
High Speed
With Less Cost

What is Big Data?
• We often use the concept of 4 V-s to describe Big

Data:
1. Volume - Amount of Data (Petabytes, Zetabytes)
2. Variety - Forms of Data (Structured, Unstructured)
3. Velocity - Speed of Data (GBs/sec)
4. Veracity - Uncertainty of Data (Accuracy)
What is Big Data?
1. Volume
• Transaction-based data stored through years.
k s,
b n
a . ta ing
,
os , etc D usa
lc
Te ealth r eho
h Wa
• Unstructured data streaming from social media.

Images Videos
Audio e ets
Tw
• Sensor and machine-to-machine data.
2M
IoT M
2. Variety
• Structured data in traditional databases
• Semi-structured data like XML or JSON.
• Unstructured data like emails, images, click-stream

3. Velocity
• Megabytes per second, Gigabytes per second.
• Data needs to be dealt with in timely manner.
• Inconsistent data flows with periodic peaks.

Speed of Creating Data
Speed of Storing Data
Speed of Processing Data
Speed of Analyzing Data
3. Velocity (cont.)
Batch Processing
Collect Clean Feed in Wait for

Act
Data Data Batches Process
Real-Time Processing
Capture Feed
Process
Streaming Real-time Act
Real-time
Data to machine
Big Data enables real-time decision pipelines!

3. Velocity (cont.)
Speed of Data Generation Speed of Data Processing
Slow Slow
Fast Fast
Which path to choose in what scenario?

4. Veracity
• Untrusted and Unreliable.
• Data Inconsistency and Incompleteness.
?
• Biased, Unclean and Ambiguous Data
Big Data Problem
&
Big Data Platforms
We Buy Machines
Storage Processing
A Big Data Platform
A Big Data Platform
Hadoop is one of the platform to
Solve Big Data Problem
Distributed Parallel
Storage Processing
Why Big Data Platforms?
Scalable Cost Effective
Flexible Fast
Resilient
1. Scalable
• It can store and distribute very large data sets across
hundreds of inexpensive servers that operate in
parallel.
• It enables businesses to run applications on
thousands of nodes involving thousands of terabytes
of data.
• It manages horizontal scalability seamlessly.

2. Cost Effective
• A scale-out architecture (as seen in previous slide)
that can affordably store all of a company's data for
later use.
• In the past, many companies would have had to
down-sample data, in an effort to reduce costs.
• The raw data would be deleted in relational DBs, as it

would be too cost-prohibitive to keep.
The cost savings are staggering!

3. Flexible
• Enables businesses to easily access new data
sources and tap into different types of data
(structured, unstructured, semistructured).
• A single system deriving valuable business insights
from data sources as variable as social media, email
conversations or clickstream data.
• A single system used for a wide variety of purposes,

such as log processing, recommendation systems,
data warehousing, market campaign analysis and
fraud detection.
4. Fast
• Storage method is based on a
distributed file system that basically
'maps' data wherever it is located on
a cluster.
• The tools for data processing are
often on the same servers where the
data is located, resulting in much
faster data processing.
• If you're dealing with large volumes of unstructured

data, it is able to efficiently process terabytes of data in
just minutes, and petabytes in hours.
5. Resilient
• Data is replicated to many nodes in the cluster,
which means that in the event of failure, there
is another copy available for use.
Sources of Big Data
Machines
People
Organizations
1. Machines
• Machine generated data is the
biggest source of Big Data.
A Boeing 787 produces 1/2 Terabytes per flight!
• Internet of Things, Smart Devices

- phones & sensors.
‘A lot of smart devices’ x ‘A lot of data capture’ = Big Data

2. People
• Mostly unstructured and text-
heavy.
• 80-90% of data the total data in

the world is unstructured.
• 75% of total data on internet is images/videos. It’s

called the Dark Matter of web.
3. Organizations
• Most data is Structured
Commercial Transactions,
Govt. Open Data, Banking
Stock Records, Medical
Health Records, E-
Commerce, etc.
• At least as important as unstructured data.
• It often gets ‘compartmentalized’ into isolated
information islands called Data Silos.
• Benefits can generated only by linking with other
structured and non-structured data.
• Walmart collects 2.5 petabytes
of data per hour!
Applications of Big Data
• Personalized Marketing.
• Recommendation Engines.
• Sentiment Analysis.
• Mobile Advertising.
• Biomedical Applications.
• Smart Cities.
Traditional Data Warehouse
Modern Data Warehouse
Modern Data Pipelines
Differentiate between DBMS
and DSMS
How to Get Value Out of
Big Data?
Data Science
Exploratory Data Analysis (EDA)
1) Question Acquire Ingest/ETL Wrangling Visualize
Modelling
2)
Choose Build/Train Validate Deploy Test
5 P’s of Data Science
People
Process Programmability
Purpose
Platform
What is Apache Hadoop?
• Apache Hadoop software library is a framework that
allows for the distributed processing of large data
sets across clusters of computers using simple
programming models.
• It is designed to scale up from single servers to

thousands of machines, each offering local
computation and storage.
• Rather than rely on hardware to deliver high-

availability, the library itself is designed to detect and
handle failures at the application layer.
Basic Hadoop Stack
Basic Hadoop Stack
Basic Hadoop Stack
Data Management Frameworks
Hadoop Distributed File System.

A Java-based, distributed file system that
HDFS provides scalable, reliable, high-throughput access
to application data stored across commodity
servers.
Yet Another Resource Negotiator.

YARN A framework for cluster resource management
and job scheduling.
Basic Hadoop Stack
Operations Frameworks
A web-based framework for provisioning,

Ambari managing and monitoring Hadoop Clusters.
A high-performance coordination service for

Zookeeper distributed applications.
A tool for provisioning and managing

Cloudbreak Hadoop Clusters in the cloud.
A server-based workflow engine used to

Oozie execute Hadoop Jobs
Basic Hadoop Stack
Data Access Frameworks
A high-level platform for extracting,

Pig transforming, analyzing large datasets.
A data warehouse infrastructure that

Hive supports ad hoc SQL queries.
A table information, schema and metadata

HCatalog management layer supporting Hive, Pig,
MapReduce, and Tez Processing.
Application development framework for

Cascading building data applications, abstracting details
of complex MapReduce programing.
A scalable distributed NoSQL database that

HBase supports structured data storage for large
tables.
Basic Hadoop Stack
Data Access Frameworks
A client-side SQL layer over HBase that

Phoenix provides low latency access to HBase data.
A low latency, large table data storage and

Accumulo retrieval system with cell-level security.
A distributed computation system for

Storm processing continuous stream of real-time
data.
A distributed search platform capable of

Solr indexing petabytes of data.
A fast, general purpose processing engine

Spark used to build and run sophisticated SQL,
streaming, machine learning or graphics.
Basic Hadoop Stack
Governance and Integration Frameworks
A data governance tool providing workflow

Falcon orchestration, data lifecycle management, and
data replication services.
A REST API that uses standard HTTP verbs

WebHDFS WebHDFS to access, operate, manage HDFS.
HDFS NFS HDFS NFS A gateway that enables access to HDFS as

Gateway an NFS mounted file system.
Gateway
A distributed, reliable and highly available

Flume service that efficiently collects, aggregates and
moves streaming data.
Basic Hadoop Stack
Governance and Integration Frameworks
A set of tools for importing and exporting

Sqoop data between Hadoop and RDBM systems.
A fast, scalable, durable, and faut-tolerant

Kafka publish-subscribe messaging system.
A scalable and extensible set of core

governance services enabling enterprises to
Atlas meet compliance and data integration
requirements.
Basic Hadoop Stack
Security Frameworks
A storage management service providing

file and directory permissions, even more
HDFS granular file and directory access control lists,
and transparent data encryption.
A resource management service with access

YARN control lists controlling access to compute
resources and YARN administrative functions.
A data warehouse infrastructure service

Hive providing granular access controls to table
columns and rows.
Basic Hadoop Stack
Security Frameworks
A data governance tool providing access

Falcon control lists that limit who may submit Hadoop
Jobs.
A gateway providing perimeter security to a

Knox Hadoop Cluster.
A centralized security framework offering

Ranger fine-grained policy controls for HDFS, Hive,
Hbase, Knox, Storm, Kafka and Solr
Hadoop as +1 Architecture
• Though it has the potential to replace all others, it can
also be used to complement existing systems if they
can’t be removed due to any constraints.
Cloudera
Hortonworks Data Platform (HDP)
MapR
Cloudera vs Hortonworks vs MapR (1)
Cloudera vs Hortonworks vs MapR (2)
Distributed File System
• We said Big Data has a lot of Volume. Is it then
possible to store Big Data in a single system?
• It’s not. We need to distribute Big Data into
multiple systems, for which we need a Distributed
File System (DFS).
Node
Rack
• To achieve parallelization, we distribute data across
nodes and also move computation to each node.
Data 1 2 3 4 5
2 5
3 4
• To achieve parallelization, we distribute data across
nodes and also move computation to each node.
Data 1 2 3 4 5
2 5
3 4
Reading 1 TB Data
1 Machine
4 I/O Channels
100 Mbps / Channel 10 Machine
4 I/O Channels
43 Minutes 100 Mbps / Channel
10 Times Faster!
4.3 Minutes
Distributed Computing
• It can be defined as the use of a distributed system to
solve a single large problem by breaking it down
into several tasks where each task is computed in the
individual computers of the distributed system.
• All the computers connected in a network

communicate with each other to attain a common
goal by making use of their own local memory.
• Hadoop makes use of Distributed Computing.

Computer Cluster
• A group of network connected
computers working as a single unit
to perform a task.
Hadoop Server Roles
Client Master Slave

Machines Nodes Nodes
Hadoop Server Roles
Clients
Distributed Data Processing Distributed Data Storage
(MapReduce) (HDFS)
Job Name Secondary

Trackers Nodes Nodes
DataNode DataNode DataNode

TaskTracker TaskTracker TaskTracker
DataNode DataNode DataNode

TaskTracker TaskTracker TaskTracker
Typical Hadoop Cluster
Switch Switch
Hadoop Cluster
Rack 1 Rack 2 Rack 3 Rack n
Switch Switch Switch Switch
Node 1 Node 1 Node 1 Node 1
Node 2 Node 2 Node 2 Node 2
Node n Node n Node n Node n

HDFS Overview
• A fault-tolerant distributed file system for big large
files.
• Write Once, Read Many Times (WORM)
• Divide files into big blocks and distribute across the
cluster.
• Store multiple replicas of each block for reliability.
• Programs can ask “Where do the pieces of my file

live.
HDFS Overview
1110010101100
1
1001010110011
1110010101100
1001010110011
Rack 1 Rack 2 Rack 3
1110010101100
2
1001010110011
1110010101100
Blocks 100101001001
0101100111110
3
0101011001001
0101100111100
110101
4
Logical File Hadoop Cluster
HDFS Overview
• It looks & acts just like a file system.
hdfs dfs -command[args]
• A few of almost 30 HDFS Commands:

- -cat :display file content (uncompressed).
- -text :just like cat but works on uncompressed files.
- -chgrp, -chmod, -chown :changes file permissions.
- -put, -get, -copyFromLocal, -copyToLocal :copies files from
the local file system to HDFS and vice versa.
- -ls, -ls -R :list file/directories.
- -mv, -moveFromLocal, -moveToLocal :moves files.
- -stat :statistical info for any given file.
HDFS Components
Name Data
Node Node
NameNode
• It acts as HDFS Master Component.
• It determines and maintains how chunks of data are
distributed and replicated across the DataNodes.
• It maintains critical HDFS information/system state

information.
• To enhance HDFS performance, it maintains and
serves this information from memory.
• Therefore it is critical to ensure NameNode always
has sufficient memory.
• If NameNode fails, HDFS fails.
NameNode
• Overview of information stored by NameNode.
Namespace Metadata Block Map

• Hierarchy • Permissions & Ownership • Files names >
• Directory names • ACLs Block IDs
• Files names • Block Size & Replication levels
• Access & Modification times
• User quotas
Secondary NameNode
Secondary
NameNode
NameNode
• It does housekeeping and backup of NameNode

namespace and metadata.
• It is not a hot-standby for NameNode.
• It connects to NameNode every hour.
• Saved metadata can rebuild a failed NameNode.

NameNode High Availability
Secondary
NameNode
NameNode
• HDFS NameNode is Single Point of Failure.

• NameNode HA:
- Uses a redundant NameNode.
- Enables fast fail-over in response to
NameNode failure.
- Is configured by Ambari.
- Is configured in an active/standby
configuration.
- Permits administrator-initiated failover for
maintenance.
DataNode
• It acts as HDFS Slave Component.
• It is the only place where chunks of data are actually
stored.
• Other than storing data, it is also responsible for
replicating data.
• It keeps on sending its heartbeat to NameNode to tell

about it’s availability.
• Every 10th heartbeat is a Block Report.

• DataNodes are heterogeneous: supports different types
of storages: Disks, SSDs, Memory.
HDFS Architecture
fs/namespace/meta ops
HDFS Secondary
NameNode
Client NameNode
namespace backup
heartbeats, balancing, block ops, etc.
replication
DataNode DataNode DataNode DataNode DataNode
Data Serving
HDFS Rack Awareness
• Never loose all data if entire rack fails. How?
Hadoop Cluster
• Store replicas on Rack 1 Rack 2 Rack 3

multiple racks.
Switch Switch Switch
• Keep bulky flows
in-rack. Node 1 Node 1 Node 1
• There is higher Node 2 Node 2 Node 2

bandwidth and
lower latency in-
rack.
Node n Node n Node n
HDFS Write Pipeline
Note: All DataNodes are in constant communication with
NameNode so no arrows are drawn for that.
Hadoop Cluster
NameNode Switch Switch Switch
Node 1 Node 1 Node 1

Switch
Node 2 Node 2 Node 2
HDFS
Client
Node n Node n Node n
HDFS Write Pipeline
• HDFS manages writing of file block by block.
• Many files are being written in parallel to save time.
• All communication happens through TCP

connections so high bandwidth is required.
• And HDFS Write Operation has to major cycles:

Acknowledgements and Data Transfer.
• Starting node for each block isn’t necessarily same.
• NameNode updates metadata with the help of Block

Reports sent by DataNodes.
Spanning HDFS Cluster
• Keep Block Size small.
• Small Block Size means more Blocks for a file.
• More Blocks means file is spread on more machines.
• More CPU cores and disk drives that have a block of

file mean more parallel processing power and faster
results.
• This is why we build large wide clusters.
Re-replicating in HDFS Cluster
• NameNode automatically takes care of recovering
missing and corrupted blocks.
• Missing heartbeats signify
lost Nodes.
• NameNode consults
metadata, finds affected data.
• NameNode consults Rack

Awareness script.
• NameNode tells a DataNode to
re-replicate itself to a specified Hadoop Cluster
DataNode that is available.
HDFS Read
• HDFS manages reading of file block by block.
• Many files are being read in parallel to save time.
• All communication happens through TCP

connections so high bandwidth is required.
• NameNode updates metadata with the help of Block
Reports sent by DataNodes.
• In case a DataNode needs a block that it does not
have, the NameNode provides rack local DataNodes
first to leverage in-rack bandwidth.

Data Science

Uploaded by

Document Information

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Data Science

Uploaded by

Copyright:

Available Formats

CS439 DATA SCIENCE

Muhammad Salman Saeed

• Big Data challenges include

With Less Cost

• We often use the concept of 4 V-s to describe Big

• Unstructured data streaming from social media.

• Sensor and machine-to-machine data.

• Semi-structured data like XML or JSON.

• Unstructured data like emails, images, click-stream

• Data needs to be dealt with in timely manner.

• Inconsistent data flows with periodic peaks.

Collect Clean Feed in Wait for

Big Data enables real-time decision pipelines!

Which path to choose in what scenario?

• Data Inconsistency and Incompleteness.

• It manages horizontal scalability seamlessly.

• The raw data would be deleted in relational DBs, as it

The cost savings are staggering!

• A single system used for a wide variety of purposes,

• If you're dealing with large volumes of unstructured

A Boeing 787 produces 1/2 Terabytes per flight!

• Internet of Things, Smart Devices

‘A lot of smart devices’ x ‘A lot of data capture’ = Big Data

• 80-90% of data the total data in

• 75% of total data on internet is images/videos. It’s

1) Question Acquire Ingest/ETL Wrangling Visualize

• It is designed to scale up from single servers to

• Rather than rely on hardware to deliver high-

Hadoop Distributed File System.

Yet Another Resource Negotiator.

A web-based framework for provisioning,

A high-performance coordination service for

A tool for provisioning and managing

A server-based workflow engine used to

A high-level platform for extracting,

A data warehouse infrastructure that

A table information, schema and metadata

Application development framework for

A scalable distributed NoSQL database that

A client-side SQL layer over HBase that

A low latency, large table data storage and

A distributed computation system for

A distributed search platform capable of

A fast, general purpose processing engine

A data governance tool providing workflow

A REST API that uses standard HTTP verbs

HDFS NFS HDFS NFS A gateway that enables access to HDFS as

A distributed, reliable and highly available

A set of tools for importing and exporting

A fast, scalable, durable, and faut-tolerant

A scalable and extensible set of core

A storage management service providing

A resource management service with access

A data warehouse infrastructure service

A data governance tool providing access

A gateway providing perimeter security to a

A centralized security framework offering

• All the computers connected in a network

• Hadoop makes use of Distributed Computing.

Client Master Slave

Job Name Secondary

DataNode DataNode DataNode

DataNode DataNode DataNode