You are on page 1of 47

Big Data Processing

Jiaul Paik
IIT Kharagpur
Lecture 1
Today’s topics

• Course Information and Logistics

• Introduction to Big Data Processing


Course Information
Teacher
• Jiaul Paik

• Email ids:
• jiaul@cet.iitkgp.ac.in
• jia.paik@gmail.com
Prerequisites
• Knowledge of Data Structures

• Knowledge of Algorithm Design

• Programming
• Python is highly recommended
Evaluation Policy

Type # times Break-up (mid point: 30-th April)


Quiz 2 1+1

Written Test 2 1+1

Programming Assignment 4 2+2


Grading Policy

Type Weightage (%)


Quiz (MCQ) 30

Written Test (MCQ + Short answer type) 50


Programming Assignments 20
Hands-on/Tutorial Session
• There will be several hands-on/tutorial session

• It will cover
• Using Hadoop
• Dealing with distributed data storage
• Mapreduce programming with Hadoop
• Spark
• Basics
• Streaming data
• Relational data
• Graph data
Programming Assignments
• Objectives
• Make you conversant with basics of big data processing technologies

• Gain experience with algorithm/program design for big data

• Submission
• Through moodle (link will be provided)
• Typical deadline
• 7-10 days (depending upon the complexity of the assignment)
What can you expect from this course?
1. Limitations of classical data processing systems

2. Basics of cluster computing for big data

3. Distributed storage for big data

4. Functional programming with Python and Scala

5. Hadoop internals and applications

6. Programming with Hadoop map-reduce

7. Spark internals and programming

8. Large scale Machine learning with Spark


Important Notes
• This is a general purpose practical course

• The techniques you learn can be applied to any form of data


which is ‘big’

• It requires new kind of programming


• NOT difficult, but new programming style

• Thus, hands-on programming experience with modern big data


systems is absolutely essential
Books
• Mining of Massive Datasets : Rajaraman and Ullman

• Data-Intensive Text Processing with MapReduce: Lin and Dyer

• Learning Spark: Konwinski et al.

• Spark - The Definitive Guide: Chambers and Zaharia

• MapReduce Design Patterns: Building Effective Algorithms and Analytics for Hadoop and
Other Systems: Book by Adam Shook and Donald Miner

• Hadoop: The Definitive Guide: Book by Tom White


Introduction to Big Data

Acknowledgements: Some of the slides are taken from Jimmy Lin, University of Waterloo
What is big data?

Why big data?

How to deal with big data?


How much data?

Processes 20 PB a day (2008)


Crawls 20B web pages a day (2012)
Search index is 100+ PB (5/2014)
Bigtable serves 2+ EB, 600M QPS
(5/2014)
How much data?

Hadoop: 10K nodes,


150K cores, 150 PB
(4/2014)

S3: 2T objects, 1.1M


request/second (4/2013)
How much data?

300 PB data in Hive +


600 TB/day (4/2014)

150 PB on 50k+ servers


running 15k apps (6/2011)

LHC: ~15 PB a year


Why big data? Science
Engineering
Commerce
Society

Source: Wikipedia (Everest)


Science

Data-intensive Science
Large Hadron collider

Maximilien Brice, © CERN


Human Genome

Source: Wikipedia (DNA)


GATGCTTACTATGCGGGCCCC
CGGTCTAATGCTTACTATGC
GCTTACTATGCGGGCCCCTT
AATGCTTACTATGCGGGCCCCTT
TAATGCTTACTATGC

?
AATGCTTAGCTATGCGGGC
AATGCTTACTATGCGGGCCCCTT
AATGCTTACTATGCGGGCCCCTT
CGGTCTAGATGCTTACTATGC
AATGCTTACTATGCGGGCCCCTT
CGGTCTAATGCTTAGCTATGC
ATGCTTACTATGCGGGCCCCTT

Reads

Subject Human genome: 3 gbp


genome A few billion short reads
(~100 GB compressed data)

Sequencer
Engineering
The unreasonable effectiveness of data
Search, recommendation, prediction, …

Source: Wikipedia (Three Gorges Dam)


Language Translation
Know the customers
Data  Insights  Competitive advantages

Commerce

Source: Wikiedia (Shinjuku, Tokyo)


Business Intelligence

An organization should retain data that result from carrying out


its mission and exploit those data to generate insights that
benefit the organization, for example, market analysis, strategic
planning, decision making, etc.
Virtuous Product Cycle

a useful service

revenue
(hopefully)

transform insights into analyze user behavior to


action extract insights

Google. Facebook. Twitter. Amazon. Uber.

data products data science


Society
Humans as social sensors
Computational social science

Source: Guardian
Predicting X with Twitter

2010 US Midterm Elections:


60m users shown “I Voted” Messages

Summary: increased turnout by 60k


directly and 280k indirectly

(Paul and Dredze, ICWSM 2011; Bond et al., Nature 2011)


Focus of this course

Data Science
Tools

This Course
Analytics
Infrastructure

Execution
Infrastructure

“big data stack”


Buzzwords
Text: frequency estimation,
language models, inverted
data analytics, business indexes
Data Science
intelligence, OLAP, ETL, Graphs: graph traversals,
Tools
data warehouses and data random walks (PageRank)
lakes

This Course
Relational data: SQL, joins,
Analytics column stores
Infrastructure
Data mining: hashing,
clustering (k-means),
MapReduce, Spark, noSQL, classification,
Execution
Flink, Pig, Hive, Dryad, recommendations
Infrastructure
Pregel, Giraph, Storm
Streams: probabilistic data
structures (Bloom filters,
“big data stack” CMS, HLL counters)

This course focuses on algorithm design and “programming at scale”


What is the Goal of Big Data Processing?
• Finding useful pattern from large data in reasonable amount of
time

• Primary focus is on efficiency, without losing the accuracy much

• Scalability of an algorithm

• Growth of its complexity with the problem size

• How well it can handle big data?


Two Common Routes to Scalability

1. Improving Algorithmic Efficiency

2. Parallel Processing

• Scale-up architecture: Powerful server with lots of RAM, disk and cpu-
cores

• Scale-out architecture: Cluster of low-cost computers


• Map-reduce, Spark
An Example: Data Clustering

Create 10000 clusters from 1 billion vectors of dimension 1000

STEP 1: Start with k initial cluster


centers (that is why k-mean )

STEP 2: Assign/cluster each


member to the closest center
Iterative
steps
STEP 3: Recalculate centers
K-means: Illustration
K-means: Initialize centers randomly
K-means: assign points to nearest center
K-means: readjust centers
K-means: assign points to nearest center
K-means: readjust centers
Kmeans: The Expensive part

Create 10000 clusters from 1 billion vectors of dimension 1000

STEP 1: Start with k initial cluster


centers (that is why k-mean )

STEP 2: Assign/cluster each


member to the closest center 1010 × 103 × 104 =1017
 

Iterative
steps
STEP 3: Recalculate centers
Solution 1: Improving Algorithmic Efficiency
Sampling based k-mean

• Take a random sample from the data

• Apply kmean on that data to produce the approximate centroids

• Key assumption:
• The centroids from the random samples are very close to the centroids of the
original data

Selective Search: Efficient and Effective Search of Large Textual Collections


by Kulkarni and Callan, ACM TOIS, 2016
Illustration: Sampling based k-mean

original centroids

Random sample
(30%)

kmean

Assign clusters to original


data

approx
centroids
Solution 2: Parallel Processing

STEP 1: Start with k initial cluster


centers (that is why k-mean ) 1. Split data into small
chunks
STEP 2: Assign/cluster each
member to the closest center 2. Process each chunk in
Iterative different cores / nodes
steps
in a cluster
STEP 3: Recalculate centers
Pros and Cons
• Sampling based method
• Pros: Fast processing
• Cons: Lossy Inference, Often Low accuracy

• Scale-up Architecture
• Pros: Fast processing (if data fits into RAM)
• Cons: Risk of data loss (system failure), Scalability issue for very large
data

• Scale-out architecture
• Pros: Can handle very large data, fault tolerant
• Cons: Communication bottleneck, difficulty in writing code
How to Tackle Big Data?

Source: Google
Divide and Conquer

“Work”
Partition

w1 w2 w3

worker worker worker

r1 r2 r3

“Result” Combine
What are the Challenges?

• How do we assign work units to workers?


• What if we have more work units than workers?
• What if workers need to share partial results?
• How do we aggregate partial results?
• How do we know all the workers have finished?
• What if workers die?

What’s the common theme of all of these problems?

You might also like