Welcome to Scribd!

Skip carousel

1 ThinkingAtScale

Uploaded by

banala.kalyan

0% found this document useful (0 votes)

7 views20 pages

sdgsgdgs

Original Title

1-ThinkingAtScale

Copyright

Available Formats

PDF, TXT or read online from Scribd

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Report this Document

sdgsgdgs

Copyright:

Available Formats

Download as PDF, TXT or read online from Scribd

Flag for inappropriate content

0% found this document useful (0 votes)

7 views20 pages

1 ThinkingAtScale

Uploaded by

banala.kalyan

sdgsgdgs

Copyright:

Available Formats

Download as PDF, TXT or read online from Scribd

Flag for inappropriate content

Jump to Page

You are on page 1of 20

Search inside document

Thinking at Scale

This presentation includes course content University of Washington

Redistributed under the Creative Commons Attribution 3.0 license.
All other contents:
2009 Cloudera, Inc.

Overview
Characteristics of large-scale computation
Pitfalls of distributed systems
Designing for scalability

2009 Cloudera, Inc.

By the numbers
Max data in memory: 32 GB
Max data per computer: 12 TB
Data processed by Google every month:
400 PB in 2007
Average job size: 180 GB
Time that would take to read sequentially
off a single drive: 45 minutes
2009 Cloudera, Inc.

What does this mean?

We can process data very quickly but we
can read/write it very slowly
Solution: parallel reads
1 HDD = 75 MB/sec
1000 HDDs = 75 GB/sec
Now youre talking!

2009 Cloudera, Inc.

Sharing is Slow
Grid computing: not new
MPI, PVM, Condor

Grid focus: distribute the workload

NetApp filer or other SAN drives many
compute nodes

Modern focus: distribute the data

Reading 100 GB off a single filer would leave
nodes starved just store data locally
2009 Cloudera, Inc.

Sharing is Tricky
Exchanging data requires synchronization
Deadlock becomes a problem

Finite bandwidth is available

Distributed systems can drown themselves
Failovers can cause cascading failure

Temporal dependencies are complicated

Difficult to reason about partial restarts

2009 Cloudera, Inc.

Ken Arnold, CORBA designer:

Failure is the defining difference between
distributed and local programming

2009 Cloudera, Inc.

Reliability Demands
Support partial failure
Total system must support graceful decline in
application performance rather than a full halt

2009 Cloudera, Inc.

Reliability Demands
Data Recoverability
If components fail, their workload must be
picked up by still-functioning units

2009 Cloudera, Inc.

Reliability Demands
Individual Recoverability
Nodes that fail and restart must be able to
rejoin the group activity without a full group
restart

2009 Cloudera, Inc.

Reliability Demands
Consistency
Concurrent operations or partial internal
failures should not cause externally visible
nondeterminism

2009 Cloudera, Inc.

Reliability Demands
Scalability
Adding increased load to a system should not
cause outright failure, but a graceful decline
Increasing resources should support a
proportional increase in load capacity

2009 Cloudera, Inc.

A Radical Way Out

Nodes talk to each other as little as
possible maybe never
Shared nothing architecture

Programmer should not explicitly be

allowed to communicate between nodes
Data is spread throughout machines in
advance, computation happens where its
stored.
2009 Cloudera, Inc.

Motivations for MapReduce

Data processing: > 1 TB
Massively parallel (hundreds or thousands
of CPUs)
Must be easy to use
High-level applications written in MapReduce
Programmers dont worry about socket(), etc.

2009 Cloudera, Inc.

Locality
Master program divvies up tasks based on
location of data: tries to have map tasks
on same machine as physical file data, or
at least same rack
Map task inputs are divided into 64128
MB blocks: same size as filesystem
chunks
Process components of a single file in parallel
2009 Cloudera, Inc.

Fault Tolerance
Tasks designed for independence
Master detects worker failures
Master re-executes tasks that fail while in
progress
Restarting one task does not require
communication with other tasks
Data is replicated to increase availability,
durability
2009 Cloudera, Inc.

Optimizations
No reduce can start until map is complete:
A single slow disk controller can rate-limit the
whole process

Master redundantly executes slowmoving map tasks; uses results of first

copy to finish

2009 Cloudera, Inc.

Conclusions
Computing with big datasets is a
fundamentally different challenge than
doing big compute over a small dataset
New ways of thinking about problems
needed
New tools provide means to capture this
MapReduce, HDFS, etc. can help

2009 Cloudera, Inc.

Hadoop: Data Processing and Modelling
From Everand
Hadoop: Data Processing and Modelling
Garry Turkington
No ratings yet
Apache Oozie Essentials
From Everand
Apache Oozie Essentials
Singh Jagat Jasjit
No ratings yet
Hadoop Cluster Deployment
From Everand
Hadoop Cluster Deployment
Danil Zburivsky
No ratings yet
Monitoring Hadoop
From Everand
Monitoring Hadoop
Gurmukh Singh
No ratings yet
Learn Hive in 24 Hours
From Everand
Learn Hive in 24 Hours
Alex Nordeen
No ratings yet
Getting Started with Big Data Query using Apache Impala
From Everand
Getting Started with Big Data Query using Apache Impala
Agus Kurniawan
No ratings yet
Exploring Hadoop Ecosystem (Volume 2): Stream Processing
From Everand
Exploring Hadoop Ecosystem (Volume 2): Stream Processing
Wei Liu
No ratings yet
Fast Data Processing with Spark 2 - Third Edition
From Everand
Fast Data Processing with Spark 2 - Third Edition
Krishna Sankar
No ratings yet
Cloudera A Complete Guide - 2019 Edition
From Everand
Cloudera A Complete Guide - 2019 Edition
Gerardus Blokdyk
No ratings yet
Exploring Hadoop Ecosystem (Volume 1): Batch Processing
From Everand
Exploring Hadoop Ecosystem (Volume 1): Batch Processing
Wei Liu
No ratings yet
Hadoop BIG DATA Interview Questions You'll Most Likely Be Asked
From Everand
Hadoop BIG DATA Interview Questions You'll Most Likely Be Asked
Vibrant Publishers
No ratings yet
Hadoop in Practice
From Everand
Hadoop in Practice
Alex Holmes
No ratings yet
Hadoop Training #5: MapReduce Algorithm
Document31 pages
Hadoop Training #5: MapReduce Algorithm
Dmytro Shteflyuk
100% (2)
HBase Administration Cookbook
From Everand
HBase Administration Cookbook
Yifeng Jiang
No ratings yet
Big Data Architecture A Complete Guide - 2019 Edition
From Everand
Big Data Architecture A Complete Guide - 2019 Edition
Gerardus Blokdyk
No ratings yet
Apache Spark 2.x Cookbook
From Everand
Apache Spark 2.x Cookbook
Rishi Yadav
No ratings yet
NiFi A Complete Guide - 2021 Edition
From Everand
NiFi A Complete Guide - 2021 Edition
Gerardus Blokdyk
No ratings yet
Learn Hadoop in 24 Hours
From Everand
Learn Hadoop in 24 Hours
Alex Nordeen
No ratings yet
Configuring Hadoop Security With Cloudera Manager
Document52 pages
Configuring Hadoop Security With Cloudera Manager
Boo Cori
No ratings yet
Mastering Hadoop
From Everand
Mastering Hadoop
Sandeep Karanth
No ratings yet
Cloudera Developer Training
Document483 pages
Cloudera Developer Training
equest7916
100% (1)
Instant Redis Optimization How-to
From Everand
Instant Redis Optimization How-to
Arun Chinnachamy
No ratings yet
Hadoop Real-World Solutions Cookbook - Second Edition
From Everand
Hadoop Real-World Solutions Cookbook - Second Edition
Deshpande Tanmay
No ratings yet
Apache Hive Cookbook
From Everand
Apache Hive Cookbook
Shrey Mehrotra
No ratings yet
Amazon Redshift Complete Self-Assessment Guide
From Everand
Amazon Redshift Complete Self-Assessment Guide
Gerardus Blokdyk
No ratings yet
Apache Mahout Clustering Designs
From Everand
Apache Mahout Clustering Designs
Gupta Ashish
No ratings yet
Cloudera Administrator Training
Document373 pages
Cloudera Administrator Training
Naga
100% (6)
Learning HBase
From Everand
Learning HBase
Shashwat Shriparv
No ratings yet
Hadoop Blueprints
From Everand
Hadoop Blueprints
Anurag Shrivastava
No ratings yet
Mastering Azure Synapse Analytics: Learn how to develop end-to-end analytics solutions with Azure Synapse Analytics (English Edition)
From Everand
Mastering Azure Synapse Analytics: Learn how to develop end-to-end analytics solutions with Azure Synapse Analytics (English Edition)
Debananda Ghosh
No ratings yet
Cloudera Administration Handbook
From Everand
Cloudera Administration Handbook
Rohit Menon
No ratings yet
Hadoop 2.x Administration Cookbook
From Everand
Hadoop 2.x Administration Cookbook
Gurmukh Singh
No ratings yet
Learning Apache Spark 2
From Everand
Learning Apache Spark 2
Muhammad Asif Abbasi
No ratings yet
Apache Hive Interview Questions
Document6 pages
Apache Hive Interview Questions
sourashree
100% (1)
Optimizing Hadoop for MapReduce
From Everand
Optimizing Hadoop for MapReduce
Khaled Tannir
No ratings yet
Azure Databricks A Complete Guide - 2019 Edition
From Everand
Azure Databricks A Complete Guide - 2019 Edition
Gerardus Blokdyk
No ratings yet
Snowflake Security: Securing Your Snowflake Data Cloud
From Everand
Snowflake Security: Securing Your Snowflake Data Cloud
Ben Herzberg
No ratings yet
BMC Control-M 7: A Journey from Traditional Batch Scheduling to Workload Automation
From Everand
BMC Control-M 7: A Journey from Traditional Batch Scheduling to Workload Automation
Qiang Ding
No ratings yet
Instant Pentaho Data Integration Kitchen
From Everand
Instant Pentaho Data Integration Kitchen
Sergio Ramazzina
No ratings yet
Pentaho 3.2 Data Integration Beginner's Guide
From Everand
Pentaho 3.2 Data Integration Beginner's Guide
Maria Carina Roldan
No ratings yet
Mastering Apache Cassandra - Second Edition
From Everand
Mastering Apache Cassandra - Second Edition
Nishant Neeraj
No ratings yet
Hadoop Administration
Document97 pages
Hadoop Administration
arjun.ec633
No ratings yet
Big Data & Hadoop Training Material 0 1 PDF
Document168 pages
Big Data & Hadoop Training Material 0 1 PDF
haranadh
50% (2)
Learning Hadoop 2
From Everand
Learning Hadoop 2
Garry Turkington
Rating: 4 out of 5 stars
4/5 (1)
Hadoop Tutorial
Document199 pages
Hadoop Tutorial
satya2003m
50% (2)
Getting Started with Greenplum for Big Data Analytics
From Everand
Getting Started with Greenplum for Big Data Analytics
Gollapudi Sunila
No ratings yet
Learn Hbase in 24 Hours
From Everand
Learn Hbase in 24 Hours
Alex Nordeen
No ratings yet
Azure Databricks A Complete Guide - 2020 Edition
From Everand
Azure Databricks A Complete Guide - 2020 Edition
Gerardus Blokdyk
No ratings yet
ETL A Clear and Concise Reference
From Everand
ETL A Clear and Concise Reference
Gerardus Blokdyk
No ratings yet
DataOps A Complete Guide - 2020 Edition
From Everand
DataOps A Complete Guide - 2020 Edition
Gerardus Blokdyk
No ratings yet
Big Data Hadoop Architect - V4
Document20 pages
Big Data Hadoop Architect - V4
Ali Akbar
No ratings yet
Hadoop Training #4: Programming With Hadoop
Document46 pages
Hadoop Training #4: Programming With Hadoop
Dmytro Shteflyuk
100% (2)
Data Engineer A Complete Guide - 2021 Edition
From Everand
Data Engineer A Complete Guide - 2021 Edition
Gerardus Blokdyk
No ratings yet
Professional Hadoop Solutions
From Everand
Professional Hadoop Solutions
Boris Lublinsky
Rating: 4 out of 5 stars
4/5 (2)
Application Observability with Elastic: Real-time metrics, logs, errors, traces, root cause analysis, and anomaly detection
From Everand
Application Observability with Elastic: Real-time metrics, logs, errors, traces, root cause analysis, and anomaly detection
Navin Sabharwal
No ratings yet
Apache Hive Essentials
From Everand
Apache Hive Essentials
Dayong Du
No ratings yet
Data Architects A Complete Guide - 2019 Edition
From Everand
Data Architects A Complete Guide - 2019 Edition
Gerardus Blokdyk
No ratings yet
DeZyre - Apache - Spark
Document12 pages
DeZyre - Apache - Spark
Madhu
No ratings yet
Teradata A Complete Guide - 2019 Edition
From Everand
Teradata A Complete Guide - 2019 Edition
Gerardus Blokdyk
No ratings yet
02 Hadoop Architecture and HDFS
Document74 pages
02 Hadoop Architecture and HDFS
arjun.ec633
No ratings yet
Redis Cluster
Document17 pages
Redis Cluster
최현준
50% (2)
Redis Cluster
Document17 pages
Redis Cluster
최현준
50% (2)
CSS 3 Help Cheat Sheet
Document1 page
CSS 3 Help Cheat Sheet
Dmytro Shteflyuk
100% (1)
Why RSA Works PDF
Document19 pages
Why RSA Works PDF
a3kira
No ratings yet
Sigmod278 Silberstein
Document12 pages
Sigmod278 Silberstein
Dmytro Shteflyuk
No ratings yet
Authors: Thanks To:: Miek Gieben Go Authors Google Go Nuts Mailing List
Document272 pages
Authors: Thanks To:: Miek Gieben Go Authors Google Go Nuts Mailing List
Dmytro Shteflyuk
No ratings yet
Go Programming
Document60 pages
Go Programming
Dmytro Shteflyuk
80% (5)
CSS 2.1 Help Cheat Sheet
Document1 page
CSS 2.1 Help Cheat Sheet
Dmytro Shteflyuk
No ratings yet
Window Vista Business 20070810
Document29 pages
Window Vista Business 20070810
josephpaulkr
No ratings yet
Performance Tuning For The InfiniDB Analytics Database (For Version 1.0.3)
Document72 pages
Performance Tuning For The InfiniDB Analytics Database (For Version 1.0.3)
Dmytro Shteflyuk
100% (1)
Backup Strategies With MySQL Enterprise Backup
Document33 pages
Backup Strategies With MySQL Enterprise Backup
Dmytro Shteflyuk
No ratings yet
Scaling MySQL Writes Through Partitioning
Document38 pages
Scaling MySQL Writes Through Partitioning
Dmytro Shteflyuk
No ratings yet
Calpont InfiniDB Administrator Guide (For Version 1.0.3)
Document106 pages
Calpont InfiniDB Administrator Guide (For Version 1.0.3)
Dmytro Shteflyuk
100% (2)
RFM: A Precursor To Data Mining
Document10 pages
RFM: A Precursor To Data Mining
Dmytro Shteflyuk
No ratings yet
The Complete Google Analytics Power User Guide PDF
Document45 pages
The Complete Google Analytics Power User Guide PDF
Sergi Soto Roldan
No ratings yet
Scaling Rails Applications in The Cloud
Document59 pages
Scaling Rails Applications in The Cloud
Dmytro Shteflyuk
100% (3)
Amdahl's Law in The Multicore Era
Document6 pages
Amdahl's Law in The Multicore Era
Dmytro Shteflyuk
100% (1)
Scribd Architecture Overview
Document19 pages
Scribd Architecture Overview
Dmytro Shteflyuk
100% (20)
Amdahl's Law in The Multicore Era - HPCA Keynote 02/2008
Document61 pages
Amdahl's Law in The Multicore Era - HPCA Keynote 02/2008
Dmytro Shteflyuk
No ratings yet
Hadoop Training #4: Programming With Hadoop
Document46 pages
Hadoop Training #4: Programming With Hadoop
Dmytro Shteflyuk
100% (2)
Hadoop Training #5: MapReduce Algorithm
Document31 pages
Hadoop Training #5: MapReduce Algorithm
Dmytro Shteflyuk
100% (2)
Unit-6: Data Visualization and Hadoop
Document96 pages
Unit-6: Data Visualization and Hadoop
SE02 Purvesh Agrawal
No ratings yet
DE Sample Resume
Document6 pages
DE Sample Resume
Sri Guru
No ratings yet
PIG Interview Qusetions
Document15 pages
PIG Interview Qusetions
spsoftspsoft
No ratings yet
Apache Hadoop Developer Training PDF
Document394 pages
Apache Hadoop Developer Training PDF
imankit
No ratings yet
Hive Succinctly
Document114 pages
Hive Succinctly
mayboyhan
No ratings yet
ICGTETM 2016 Proceedings PDF
Document690 pages
ICGTETM 2016 Proceedings PDF
Dheeraj
No ratings yet
Slide 3 Hadoop MapReduce Tutorial
Document119 pages
Slide 3 Hadoop MapReduce Tutorial
Hưng
No ratings yet
Information Storage and Management System (Isms) Using Hadoop
Document5 pages
Information Storage and Management System (Isms) Using Hadoop
IJIERT-International Journal of Innovations in Engineering Research and Technology
No ratings yet
DSCI 5350 - Lecture 2 PDF
Document54 pages
DSCI 5350 - Lecture 2 PDF
Praz
No ratings yet
Unit - 1 Notes - Introduction To Data-Analytics PDF
Document106 pages
Unit - 1 Notes - Introduction To Data-Analytics PDF
Kanav Kesar
No ratings yet
Unit 3
Document27 pages
Unit 3
Radhamani V
No ratings yet
Bda
Document2 pages
Bda
Jigar
No ratings yet
Arduino
Document120 pages
Arduino
Malajebamaniraj Mala
No ratings yet
Top Answers To Spark Interview Questions
Document32 pages
Top Answers To Spark Interview Questions
srinivas75k
No ratings yet
Module-3 (Part-2)
Document46 pages
Module-3 (Part-2)
sujith
No ratings yet
1.4 Map Reduce
Document30 pages
1.4 Map Reduce
bhattsb
No ratings yet
Chandralekha Rao Yachamaneni
Document7 pages
Chandralekha Rao Yachamaneni
Kritika Shukla
No ratings yet
Course Introduction To Big Data (2021-2022)
Document118 pages
Course Introduction To Big Data (2021-2022)
Mihaela Rînja
No ratings yet
Chapter 06 Part1
Document20 pages
Chapter 06 Part1
Can Canan
No ratings yet
Map Reduce
Document33 pages
Map Reduce
cellule HD HD
100% (1)
Benefits, Challenges and Tools of Big Data Management PDF
Document9 pages
Benefits, Challenges and Tools of Big Data Management PDF
Francisco Javier Medina Villarreal
No ratings yet
ESDL Lab Manual
Document7 pages
ESDL Lab Manual
anbhute3484
No ratings yet
A Deep Dive Into Nosql Databases The Use Cases and Applications First Edition
Document1,091 pages
A Deep Dive Into Nosql Databases The Use Cases and Applications First Edition
fpm89470
No ratings yet
Cloud Computing II.M.Sc (CS) Prepared by M.Sathya Important Questions 2 Marks
Document24 pages
Cloud Computing II.M.Sc (CS) Prepared by M.Sathya Important Questions 2 Marks
kavitha sree
No ratings yet
Apache Hadoop - A Course For Undergraduates Homework Labs With Professor's Notes
Document135 pages
Apache Hadoop - A Course For Undergraduates Homework Labs With Professor's Notes
Sat
33% (6)
Big Data Notes
Document51 pages
Big Data Notes
Sadanand
No ratings yet
CIT-650 Introduction To Big Data, Developing With Spark and Hadoop
Document4 pages
CIT-650 Introduction To Big Data, Developing With Spark and Hadoop
Amgad Alsisi
No ratings yet
Big Data Analytics For Healthcare Organization A S
Document8 pages
Big Data Analytics For Healthcare Organization A S
Andi Marsali
No ratings yet
Test Blanc
Document23 pages
Test Blanc
قطرة الندى
No ratings yet
BDA Lab Manual AI&DS
Document60 pages
BDA Lab Manual AI&DS
Ridha
No ratings yet