0% found this document useful (0 votes)

385 views21 pages

The Google File System

The Google File System (GFS) is a scalable distributed file system designed by Google to handle large files across many servers. GFS uses a master server to manage metadata and chunk servers to store file data split into large 64MB chunks replicated across multiple machines. The design focuses on reliability with component failures as the norm, optimizes for appending large files, and provides fault tolerance through replication and fast recovery of metadata and data chunks.

Uploaded by

Himanshu Patel

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PPT, PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

385 views21 pages

The Google File System

Uploaded by

Himanshu Patel

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PPT, PDF, TXT or read online on Scribd

The Google File System

Reporter: You-Wei Zhang

Abstract
We have designed and implemented
the Google File System, a scalable
distributed file system for large
distributed data-intensive
applications.

outline

Introduction Design Point

Design Overview
System Interactions
Master Operation
Fault Tolerance
Conclusion

Introduction - Design Point

Traditional Issues
Performance
Scalability
Reliability
Availability

Different Points in GFS

Component failures are the norm rather than the
exception
Files are huge that multi-GB files are common
Most files are mutated by appending new data rather
than overwriting existing data
4

2. Design Overview

Interface
Organized hierarchically in directories
and identified by path names
Usual operations
create, delete, open, close, read, and write

Moreover
Snapshot Copy
Record append Multiple clients to append data
to the same file concurrently

2. Design Overview

Architecture (1)

Single master
Multiple chunkservers
Multiple clients

2. Design Overview

Chunk

Architecture (2)

Files are divided into fixed-size chunks

Each chunk is identified by 64-bit chunk handle
Chunkservers store chunks on local disk as Linux files
Replication for reliability (default 3 replicas)

Master
Maintains all file system metadata
Namespace, access control information, mapping, locations

Manage chunk lease, garbage collection, chunk migration

Periodically communicate with chunkservers (HeartBeat
message)

2. Design Overview

Single Master
Simple read procedure

2. Design Overview

Chunk Size
64MB
Much larger than typical file system block sizes

Advantages from large chunk size

Reduce interaction between client and master
Client can perform many operations on a given
chunk
Reduces network overhead by keeping persistent TCP
connection

Reduce size of metadata stored on the master

The metadata can reside in memory
9

2. Design Overview

Metadata (1)
Store three major types
Namespaces
File and chunk identifier

Mapping from files to chunks

Location of each chunk replicas

In-memory data structures

Metadata is stored in memory
Periodic scanning entire state is easy and
efficient
10

2. Design Overview

Metadata (2)
Chunk locations
Master do not keep a persistent record of chunk locations
Instead, it simply polls chunkservers at startup and periodically
thereafter (heartbeat message)
Because of chunkserver failures, it is hard to keep persistent
record of chunk locations

Operation log
Master maintains historical record of critical metadata changes
Namespace and mapping
For reliability and consistency, replicate operation log on
multiple remote machines

3. System Interactions

Leases and Mutation Order

Use leases to maintain

consistent mutation order
across replicas
Master grant lease to one
of the replicas -> Primary
Primary picks serial order
for all mutations
Other replicas follow the
primary order
Minimize management
overhead at the master
Use pipelining for fully
utilize network bandwidth1

3. System Interactions

Atomic Record Appends

Atomic append operation called record append
Record append is heavily used
Clients would need complicated and expensive
synchronization
Primary checks if append exceed max chunk
size
If so, primary pads chunk to max chunk size
Secondaries do the same
Primary replies to the client that operation
should be retried on the next chunk000
13

3. System Interactions

Snapshot
Make a copy of a file or a directory
tree
Master revokes lease for that file
Duplicate metadata
On first write to a chunk after the snapshot
operation
All chunkservers create new chunk
Data can be copied locally

4. Master Operation

Namespace Management and

Locking

GFS master maintain a table which

map full pathname to metadata
Each node in the namespace has
associated read-write lock
Concurrent operations can be
properly serialized by this locking
mechanism

4. Master Operation

Replica Placement
GFS place replicas over different
racks for reliability and availability
Read can exploit aggregate
bandwidth of multiple racks but write
traffic has to flow through multiple
racks
-> need tradeoff
16

4. Master Operations

Creation, Re-replication,
Rebalancing

Create

Equalize disk space utilization

Limit recent creation on each chunkserver
Spread replicas across racks

Re-replication
Re-replicates happens when a chunkserver becomes
unavailable

Rebalancing
Periodically rebalance replicas for better disk space
and load balancing
17

4. Master Operation

Garbage Collection
Master just logs deletion and rename the file to a
hidden name that includes timestamp
During the masters regular scan, if the
timestamp is within recent 3 days (for example)
it will not be deleted
These files can be read by new name and
undeleted by renaming back to the original
name
Periodically check the orphaned chunk and erase
them
18

4. Master Operation

Stale Replica Detection

Chunkserver misses mutation to the
chunk due to system down
Master maintains chunk version number
to distinguish stale one and up-to-date
one
Increase version number when chunk get
lease from master
Master periodically remove stale replicas
19

5. Fault Tolerance
Fast Recovery
Master and Chunkserver are designed to restore their state
and restart in seconds

Chunk Replication
Each chunk is replicated on multiple chunkservers on
different racks
According to user demand, the replication factor can be
modified for reliability

Master Replication
Operation log
Historical record of critical metadata changes

Operation log is replicated on multiple machines

6. Conclusion
GFS is a distributed file system that support large-scale data processing
workloads on commodity hardware
GFS has different points in the design space
Component failures as the norm
Optimize for huge files
GFS provides fault tolerance
Replicating data
Fast and automatic recovery
Chunk replication
GFS has the simple, centralized master that does not become a bottleneck
GFS is a successful file system
An important tool that enables to continue to innovate on Googles
ideas

DSA RTU 2022 Paper
No ratings yet
DSA RTU 2022 Paper
15 pages
DSA PAST PAPERS (2012 - 2023) Unsolved
100% (1)
DSA PAST PAPERS (2012 - 2023) Unsolved
17 pages
DSA Question Bank
No ratings yet
DSA Question Bank
6 pages
Harshadsa
No ratings yet
Harshadsa
27 pages
Boyer-Moore Heuristics Explained
No ratings yet
Boyer-Moore Heuristics Explained
20 pages
Understanding Binary Trees and Traversals
No ratings yet
Understanding Binary Trees and Traversals
42 pages
DSA Course Outline 20022024 042844pm
100% (1)
DSA Course Outline 20022024 042844pm
5 pages
B - Tree
No ratings yet
B - Tree
23 pages
SYBSC SEMIII Labbook 2020 21 Final Version
No ratings yet
SYBSC SEMIII Labbook 2020 21 Final Version
56 pages
SYCS Data Structure Practical Manual
No ratings yet
SYCS Data Structure Practical Manual
35 pages
BCSL-045 (2023-24) Solved Assignment
No ratings yet
BCSL-045 (2023-24) Solved Assignment
14 pages
Problem Solving Using C
No ratings yet
Problem Solving Using C
253 pages
Ans 7
No ratings yet
Ans 7
3 pages
Data Structure Using C++ For II B.SC
No ratings yet
Data Structure Using C++ For II B.SC
3 pages
Ds 10-Binary Tree
100% (1)
Ds 10-Binary Tree
24 pages
Understanding Arrays and Stacks in C
No ratings yet
Understanding Arrays and Stacks in C
7 pages
C++ Stream and File Operations Guide
No ratings yet
C++ Stream and File Operations Guide
19 pages
Algorithms PPT 19 4 22
No ratings yet
Algorithms PPT 19 4 22
98 pages
Graphs and Traversal Algorithms Explained
No ratings yet
Graphs and Traversal Algorithms Explained
18 pages
C Programming: Conditional Logic
No ratings yet
C Programming: Conditional Logic
3 pages
Java Queue Structures Explained
No ratings yet
Java Queue Structures Explained
21 pages
Understanding Stack and Queue Operations
No ratings yet
Understanding Stack and Queue Operations
124 pages
BCS-051 Software Engineering SRS Guide
No ratings yet
BCS-051 Software Engineering SRS Guide
11 pages
Ada Solved Model Question Paper - 250612 - 183212
No ratings yet
Ada Solved Model Question Paper - 250612 - 183212
63 pages
Data Structures Previous Year Question Paper
No ratings yet
Data Structures Previous Year Question Paper
6 pages
04 10-Mark Questions
No ratings yet
04 10-Mark Questions
3 pages
Data Structures Course File Overview
No ratings yet
Data Structures Course File Overview
13 pages
PHP Notes - TutorialsDuniya
No ratings yet
PHP Notes - TutorialsDuniya
46 pages
Data Structures for Students
No ratings yet
Data Structures for Students
8 pages
UML & Diagramming Tools Guide
No ratings yet
UML & Diagramming Tools Guide
17 pages
Hashing Techniques and Strategies
100% (1)
Hashing Techniques and Strategies
135 pages
File Structures UNIT 1 Notes
50% (2)
File Structures UNIT 1 Notes
13 pages
M.C.A (2024 Pattern)
No ratings yet
M.C.A (2024 Pattern)
20 pages
Lesson 1 - Introduction To Big Data and Hadoop
No ratings yet
Lesson 1 - Introduction To Big Data and Hadoop
46 pages
Simple Graphics and Image Processing
No ratings yet
Simple Graphics and Image Processing
56 pages
Java Object-Oriented Programming Lab Manual
100% (1)
Java Object-Oriented Programming Lab Manual
30 pages
External Practical File PDF
No ratings yet
External Practical File PDF
39 pages
CPP Sunbeam Notes 2024
No ratings yet
CPP Sunbeam Notes 2024
73 pages
BLOCK 2 Computer Graphics Ignou
No ratings yet
BLOCK 2 Computer Graphics Ignou
87 pages
Computer Science Department: Majlis Arts and Science College, Puramannur
No ratings yet
Computer Science Department: Majlis Arts and Science College, Puramannur
20 pages
C Programming: Mastering For Loops
No ratings yet
C Programming: Mastering For Loops
3 pages
Class Design in Object-Oriented Modeling
No ratings yet
Class Design in Object-Oriented Modeling
12 pages
Data Structures Programming Guide
0% (1)
Data Structures Programming Guide
29 pages
Unit 6 Exploring Graphs
No ratings yet
Unit 6 Exploring Graphs
81 pages
Brute Force: Design and Analysis of Algorithms - Chapter 3 1
No ratings yet
Brute Force: Design and Analysis of Algorithms - Chapter 3 1
18 pages
PHP Lab Manual Format
No ratings yet
PHP Lab Manual Format
147 pages
SES Algorithm and Global State Recording
No ratings yet
SES Algorithm and Global State Recording
23 pages
Chap 1 Dbms
0% (2)
Chap 1 Dbms
13 pages
Final Lesson Plan DAA
No ratings yet
Final Lesson Plan DAA
13 pages
Divide and Conquer Algorithms
No ratings yet
Divide and Conquer Algorithms
11 pages
What Is A Design Pattern?: Chapter 1 Introduction
No ratings yet
What Is A Design Pattern?: Chapter 1 Introduction
20 pages
3.6 The Knapsack Problem
No ratings yet
3.6 The Knapsack Problem
49 pages
Understanding Arrays in C++ Programming
No ratings yet
Understanding Arrays in C++ Programming
33 pages
Stack With Linked List
No ratings yet
Stack With Linked List
9 pages
MCS-024 (SEM4) Solved Assignment-Full
No ratings yet
MCS-024 (SEM4) Solved Assignment-Full
13 pages
Introduction to Queue Data Structure
No ratings yet
Introduction to Queue Data Structure
12 pages
Unit V - Directed Acyclic Graph (DAG)
No ratings yet
Unit V - Directed Acyclic Graph (DAG)
20 pages
Practical No. 5 Objective - : Chandigarh University Data Structure Lab (Csp-209)
No ratings yet
Practical No. 5 Objective - : Chandigarh University Data Structure Lab (Csp-209)
5 pages
Online RIDE C++ Project With Source Code
No ratings yet
Online RIDE C++ Project With Source Code
10 pages
Overview of Google File System (GFS)
No ratings yet
Overview of Google File System (GFS)
22 pages
Test Results Summary for Himanshu K
No ratings yet
Test Results Summary for Himanshu K
4 pages
CSMA Protocol Performance Evaluation Report
No ratings yet
CSMA Protocol Performance Evaluation Report
4 pages
Understanding Logistic Regression Basics
No ratings yet
Understanding Logistic Regression Basics
15 pages
Naive Bayes Classifier Seminar Report
No ratings yet
Naive Bayes Classifier Seminar Report
5 pages
CSMA Protocol Performance Analysis
No ratings yet
CSMA Protocol Performance Analysis
3 pages
Document Scanning Overview
No ratings yet
Document Scanning Overview
20 pages
Glaucoma Detection via Image Processing
No ratings yet
Glaucoma Detection via Image Processing
6 pages
Wireless NoC with Congestion-Aware Routing
No ratings yet
Wireless NoC with Congestion-Aware Routing
5 pages
CS701 HPC Programming Assignments Guide
No ratings yet
CS701 HPC Programming Assignments Guide
2 pages
Time and Date Class Implementation
No ratings yet
Time and Date Class Implementation
2 pages
Miniproject
No ratings yet
Miniproject
131 pages
Relational Calculus in Banking
No ratings yet
Relational Calculus in Banking
22 pages
Introduction to Hardik
No ratings yet
Introduction to Hardik
1 page
PL/SQL Employee and Department Queries
No ratings yet
PL/SQL Employee and Department Queries
3 pages
Balanced Nutrition for Crop Sustainability
No ratings yet
Balanced Nutrition for Crop Sustainability
6 pages
Computer Professionalism Case Studies
No ratings yet
Computer Professionalism Case Studies
9 pages
DNA/DNR-DIO-404: General Description
No ratings yet
DNA/DNR-DIO-404: General Description
3 pages
Embedded System Architecture
No ratings yet
Embedded System Architecture
10 pages
YC6A Marine Engine
No ratings yet
YC6A Marine Engine
12 pages
VC Funds Bangalore
No ratings yet
VC Funds Bangalore
10 pages
FOP Practical
No ratings yet
FOP Practical
39 pages
Noctua NH d12l Manual en Web
No ratings yet
Noctua NH d12l Manual en Web
4 pages
Logic Design Homework #3
No ratings yet
Logic Design Homework #3
6 pages
DC-DC Converter
No ratings yet
DC-DC Converter
7 pages
A Framework For Using Processor Cache As RAM (CAR) : Eswaramoorthi Nallusamy University of New Mexico October 10, 2005
No ratings yet
A Framework For Using Processor Cache As RAM (CAR) : Eswaramoorthi Nallusamy University of New Mexico October 10, 2005
25 pages
Codissia Members List - May 2022
100% (3)
Codissia Members List - May 2022
567 pages
FMEA Guide for Engineers
No ratings yet
FMEA Guide for Engineers
9 pages
Massey Ferguson MF 145 TRACTOR Service Parts Catalogue Manual (Part Number 957180)
No ratings yet
Massey Ferguson MF 145 TRACTOR Service Parts Catalogue Manual (Part Number 957180)
15 pages
Tamilnadu Engineering Instruments
No ratings yet
Tamilnadu Engineering Instruments
25 pages
Data Warehousing Quiz
75% (4)
Data Warehousing Quiz
5 pages
GSM Architecture
No ratings yet
GSM Architecture
15 pages
(Dongseo-U) Overview of University
No ratings yet
(Dongseo-U) Overview of University
6 pages
I.2. Bộ điều khiển trung tâm MPU55
No ratings yet
I.2. Bộ điều khiển trung tâm MPU55
15 pages
Cat II Vibration Analysis Question Data Collection PDF
No ratings yet
Cat II Vibration Analysis Question Data Collection PDF
3 pages
Draft: Chapter 2 Review of Pointers and Memory Allocation
No ratings yet
Draft: Chapter 2 Review of Pointers and Memory Allocation
20 pages
RLMM Wind Turbine Models List
No ratings yet
RLMM Wind Turbine Models List
10 pages
Bigbasket Privacy Policy Overview
No ratings yet
Bigbasket Privacy Policy Overview
15 pages
HP LaserJet Pro M4103 - User Guide
No ratings yet
HP LaserJet Pro M4103 - User Guide
212 pages
Azure Admin Exam Prep Guide
100% (1)
Azure Admin Exam Prep Guide
8 pages
Optimizing Warehouse Shipping Processes
No ratings yet
Optimizing Warehouse Shipping Processes
27 pages
8552 Unit 6 Homework 1
No ratings yet
8552 Unit 6 Homework 1
3 pages
MDB 02 X
No ratings yet
MDB 02 X
2 pages
HPU & Drilling Console Manual HH300
No ratings yet
HPU & Drilling Console Manual HH300
13 pages
IoT Unit I Notes
No ratings yet
IoT Unit I Notes
4 pages
40-Meter Direct Conversion Receiver PDF
No ratings yet
40-Meter Direct Conversion Receiver PDF
2 pages

The Google File System

Uploaded by

The Google File System

Uploaded by

The Google File System

Reporter: You-Wei Zhang

Introduction Design Point

Introduction - Design Point

Different Points in GFS

Files are divided into fixed-size chunks

Manage chunk lease, garbage collection, chunk migration

Advantages from large chunk size

Reduce size of metadata stored on the master

Mapping from files to chunks

In-memory data structures

Leases and Mutation Order

Use leases to maintain

Atomic Record Appends

Namespace Management and

GFS master maintain a table which

Equalize disk space utilization

Stale Replica Detection

Operation log is replicated on multiple machines

You might also like