Professional Documents
Culture Documents
Jiaul Paik
IIT Kharagpur
Lecture 1
Today’s topics
• Email ids:
• jiaul@cet.iitkgp.ac.in
• jia.paik@gmail.com
Prerequisites
• Knowledge of Data Structures
• Programming
• Python is highly recommended
Evaluation Policy
• It will cover
• Using Hadoop
• Dealing with distributed data storage
• Mapreduce programming with Hadoop
• Spark
• Basics
• Streaming data
• Relational data
• Graph data
Programming Assignments
• Objectives
• Make you conversant with basics of big data processing technologies
• Submission
• Through moodle (link will be provided)
• Typical deadline
• 7-10 days (depending upon the complexity of the assignment)
What can you expect from this course?
1. Limitations of classical data processing systems
• MapReduce Design Patterns: Building Effective Algorithms and Analytics for Hadoop and
Other Systems: Book by Adam Shook and Donald Miner
Acknowledgements: Some of the slides are taken from Jimmy Lin, University of Waterloo
What is big data?
Data-intensive Science
Large Hadron collider
?
AATGCTTAGCTATGCGGGC
AATGCTTACTATGCGGGCCCCTT
AATGCTTACTATGCGGGCCCCTT
CGGTCTAGATGCTTACTATGC
AATGCTTACTATGCGGGCCCCTT
CGGTCTAATGCTTAGCTATGC
ATGCTTACTATGCGGGCCCCTT
Reads
Sequencer
Engineering
The unreasonable effectiveness of data
Search, recommendation, prediction, …
Commerce
a useful service
revenue
(hopefully)
Source: Guardian
Predicting X with Twitter
Data Science
Tools
This Course
Analytics
Infrastructure
Execution
Infrastructure
This Course
Relational data: SQL, joins,
Analytics column stores
Infrastructure
Data mining: hashing,
clustering (k-means),
MapReduce, Spark, noSQL, classification,
Execution
Flink, Pig, Hive, Dryad, recommendations
Infrastructure
Pregel, Giraph, Storm
Streams: probabilistic data
structures (Bloom filters,
“big data stack” CMS, HLL counters)
• Scalability of an algorithm
2. Parallel Processing
• Scale-up architecture: Powerful server with lots of RAM, disk and cpu-
cores
Iterative
steps
STEP 3: Recalculate centers
Solution 1: Improving Algorithmic Efficiency
Sampling based k-mean
• Key assumption:
• The centroids from the random samples are very close to the centroids of the
original data
original centroids
Random sample
(30%)
kmean
approx
centroids
Solution 2: Parallel Processing
• Scale-up Architecture
• Pros: Fast processing (if data fits into RAM)
• Cons: Risk of data loss (system failure), Scalability issue for very large
data
• Scale-out architecture
• Pros: Can handle very large data, fault tolerant
• Cons: Communication bottleneck, difficulty in writing code
How to Tackle Big Data?
Source: Google
Divide and Conquer
“Work”
Partition
w1 w2 w3
r1 r2 r3
“Result” Combine
What are the Challenges?