Big Data Processing: Jiaul Paik

Big Data Processing
Jiaul Paik
IIT Kharagpur
Lecture 1
Today’s topics
• Course Information and Logistics
• Introduction to Big Data Processing

Course Information
Teacher
• Jiaul Paik
• Email ids:
• jiaul@cet.iitkgp.ac.in
• jia.paik@gmail.com
Prerequisites
• Knowledge of Data Structures
• Knowledge of Algorithm Design
• Programming
• Python is highly recommended
Evaluation Policy
Type # times Break-up (mid point: 30-th April)

Quiz 2 1+1
Written Test 2 1+1
Programming Assignment 4 2+2

Grading Policy
Type Weightage (%)

Quiz (MCQ) 30
Written Test (MCQ + Short answer type) 50

Programming Assignments 20
Hands-on/Tutorial Session
• There will be several hands-on/tutorial session
• It will cover
• Using Hadoop
• Dealing with distributed data storage
• Mapreduce programming with Hadoop
• Spark
• Basics
• Streaming data
• Relational data
• Graph data
Programming Assignments
• Objectives
• Make you conversant with basics of big data processing technologies
• Gain experience with algorithm/program design for big data
• Submission
• Through moodle (link will be provided)
• Typical deadline
• 7-10 days (depending upon the complexity of the assignment)
What can you expect from this course?
1. Limitations of classical data processing systems
2. Basics of cluster computing for big data
3. Distributed storage for big data
4. Functional programming with Python and Scala
5. Hadoop internals and applications
6. Programming with Hadoop map-reduce
7. Spark internals and programming
8. Large scale Machine learning with Spark

Important Notes
• This is a general purpose practical course
• The techniques you learn can be applied to any form of data

which is ‘big’
• It requires new kind of programming

• NOT difficult, but new programming style
• Thus, hands-on programming experience with modern big data

systems is absolutely essential
Books
• Mining of Massive Datasets : Rajaraman and Ullman
• Data-Intensive Text Processing with MapReduce: Lin and Dyer
• Learning Spark: Konwinski et al.
• Spark - The Definitive Guide: Chambers and Zaharia
• MapReduce Design Patterns: Building Effective Algorithms and Analytics for Hadoop and
Other Systems: Book by Adam Shook and Donald Miner
• Hadoop: The Definitive Guide: Book by Tom White

Introduction to Big Data
Acknowledgements: Some of the slides are taken from Jimmy Lin, University of Waterloo
What is big data?
Why big data?
How to deal with big data?

How much data?
Processes 20 PB a day (2008)

Crawls 20B web pages a day (2012)
Search index is 100+ PB (5/2014)
Bigtable serves 2+ EB, 600M QPS
(5/2014)
How much data?
Hadoop: 10K nodes,

150K cores, 150 PB
(4/2014)
S3: 2T objects, 1.1M

request/second (4/2013)
How much data?
300 PB data in Hive +

600 TB/day (4/2014)
150 PB on 50k+ servers

running 15k apps (6/2011)
LHC: ~15 PB a year

Why big data? Science
Engineering
Commerce
Society
Source: Wikipedia (Everest)

Science
Data-intensive Science
Large Hadron collider
Maximilien Brice, © CERN

Human Genome
Source: Wikipedia (DNA)

GATGCTTACTATGCGGGCCCC
CGGTCTAATGCTTACTATGC
GCTTACTATGCGGGCCCCTT
AATGCTTACTATGCGGGCCCCTT
TAATGCTTACTATGC
?
AATGCTTAGCTATGCGGGC
CGGTCTAGATGCTTACTATGC
CGGTCTAATGCTTAGCTATGC
ATGCTTACTATGCGGGCCCCTT
Reads
Subject Human genome: 3 gbp

genome A few billion short reads
(~100 GB compressed data)
Sequencer
Engineering
The unreasonable effectiveness of data
Search, recommendation, prediction, …
Source: Wikipedia (Three Gorges Dam)

Language Translation
Know the customers
Data  Insights  Competitive advantages
Commerce
Source: Wikiedia (Shinjuku, Tokyo)

Business Intelligence
An organization should retain data that result from carrying out

its mission and exploit those data to generate insights that
benefit the organization, for example, market analysis, strategic
planning, decision making, etc.
Virtuous Product Cycle
a useful service
revenue
(hopefully)
transform insights into analyze user behavior to

action extract insights
Google. Facebook. Twitter. Amazon. Uber.
data products data science

Society
Humans as social sensors
Computational social science
Source: Guardian
Predicting X with Twitter
2010 US Midterm Elections:

60m users shown “I Voted” Messages
Summary: increased turnout by 60k

directly and 280k indirectly
(Paul and Dredze, ICWSM 2011; Bond et al., Nature 2011)

Focus of this course
Data Science
Tools
This Course
Analytics
Infrastructure
Execution
Infrastructure
“big data stack”

Buzzwords
Text: frequency estimation,
language models, inverted
data analytics, business indexes
Data Science
intelligence, OLAP, ETL, Graphs: graph traversals,
Tools
data warehouses and data random walks (PageRank)
lakes
This Course
Relational data: SQL, joins,
Analytics column stores
Infrastructure
Data mining: hashing,
clustering (k-means),
MapReduce, Spark, noSQL, classification,
Execution
Flink, Pig, Hive, Dryad, recommendations
Infrastructure
Pregel, Giraph, Storm
Streams: probabilistic data
structures (Bloom filters,
“big data stack” CMS, HLL counters)
This course focuses on algorithm design and “programming at scale”

What is the Goal of Big Data Processing?
• Finding useful pattern from large data in reasonable amount of
time
• Primary focus is on efficiency, without losing the accuracy much
• Scalability of an algorithm
• Growth of its complexity with the problem size
• How well it can handle big data?

Two Common Routes to Scalability
1. Improving Algorithmic Efficiency
2. Parallel Processing
• Scale-up architecture: Powerful server with lots of RAM, disk and cpu-
cores
• Scale-out architecture: Cluster of low-cost computers

• Map-reduce, Spark
An Example: Data Clustering
Create 10000 clusters from 1 billion vectors of dimension 1000
STEP 1: Start with k initial cluster

centers (that is why k-mean )
STEP 2: Assign/cluster each

member to the closest center
Iterative
steps
STEP 3: Recalculate centers
K-means: Illustration
K-means: Initialize centers randomly
K-means: assign points to nearest center
K-means: readjust centers
K-means: assign points to nearest center
K-means: readjust centers
Kmeans: The Expensive part
Create 10000 clusters from 1 billion vectors of dimension 1000

centers (that is why k-mean )

member to the closest center 1010 × 103 × 104 =1017

Iterative
steps
Solution 1: Improving Algorithmic Efficiency
Sampling based k-mean
• Take a random sample from the data
• Apply kmean on that data to produce the approximate centroids
• Key assumption:
• The centroids from the random samples are very close to the centroids of the
original data
Selective Search: Efficient and Effective Search of Large Textual Collections

by Kulkarni and Callan, ACM TOIS, 2016
Illustration: Sampling based k-mean
original centroids
Random sample
(30%)
kmean
Assign clusters to original

data
approx
centroids
Solution 2: Parallel Processing

centers (that is why k-mean ) 1. Split data into small
chunks
member to the closest center 2. Process each chunk in
Iterative different cores / nodes
steps
in a cluster
Pros and Cons
• Sampling based method
• Pros: Fast processing
• Cons: Lossy Inference, Often Low accuracy
• Scale-up Architecture
• Pros: Fast processing (if data fits into RAM)
• Cons: Risk of data loss (system failure), Scalability issue for very large
data
• Scale-out architecture
• Pros: Can handle very large data, fault tolerant
• Cons: Communication bottleneck, difficulty in writing code
How to Tackle Big Data?
Source: Google
Divide and Conquer
“Work”
Partition
w1 w2 w3
worker worker worker
r1 r2 r3
“Result” Combine
What are the Challenges?
• How do we assign work units to workers?

• What if we have more work units than workers?
• What if workers need to share partial results?
• How do we aggregate partial results?
• How do we know all the workers have finished?
• What if workers die?
What’s the common theme of all of these problems?

Big Data Processing: Jiaul Paik

Uploaded by

Document Information

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Big Data Processing: Jiaul Paik

Uploaded by

Copyright:

Available Formats

Big Data Processing

• Course Information and Logistics

• Introduction to Big Data Processing

• Knowledge of Algorithm Design

Type # times Break-up (mid point: 30-th April)

Written Test 2 1+1

Programming Assignment 4 2+2

Type Weightage (%)

Written Test (MCQ + Short answer type) 50

• Gain experience with algorithm/program design for big data

2. Basics of cluster computing for big data

3. Distributed storage for big data

4. Functional programming with Python and Scala

5. Hadoop internals and applications

6. Programming with Hadoop map-reduce

7. Spark internals and programming

8. Large scale Machine learning with Spark

• The techniques you learn can be applied to any form of data

• It requires new kind of programming

• Thus, hands-on programming experience with modern big data

• Data-Intensive Text Processing with MapReduce: Lin and Dyer

• Learning Spark: Konwinski et al.

• Spark - The Definitive Guide: Chambers and Zaharia

• Hadoop: The Definitive Guide: Book by Tom White

Why big data?

How to deal with big data?

Processes 20 PB a day (2008)

Hadoop: 10K nodes,

S3: 2T objects, 1.1M

300 PB data in Hive +

150 PB on 50k+ servers

LHC: ~15 PB a year

Source: Wikipedia (Everest)

Maximilien Brice, © CERN

Source: Wikipedia (DNA)

Subject Human genome: 3 gbp

Source: Wikipedia (Three Gorges Dam)

Source: Wikiedia (Shinjuku, Tokyo)

An organization should retain data that result from carrying out

transform insights into analyze user behavior to

Google. Facebook. Twitter. Amazon. Uber.

data products data science

2010 US Midterm Elections:

Summary: increased turnout by 60k

(Paul and Dredze, ICWSM 2011; Bond et al., Nature 2011)

“big data stack”

This course focuses on algorithm design and “programming at scale”

• Primary focus is on efficiency, without losing the accuracy much

• Growth of its complexity with the problem size

• How well it can handle big data?

1. Improving Algorithmic Efficiency

• Scale-out architecture: Cluster of low-cost computers

Create 10000 clusters from 1 billion vectors of dimension 1000

STEP 1: Start with k initial cluster

STEP 2: Assign/cluster each

Create 10000 clusters from 1 billion vectors of dimension 1000

STEP 1: Start with k initial cluster

STEP 2: Assign/cluster each

• Take a random sample from the data

• Apply kmean on that data to produce the approximate centroids

Selective Search: Efficient and Effective Search of Large Textual Collections

Assign clusters to original

STEP 1: Start with k initial cluster