You are on page 1of 36

facebook’s haystack

organization
classical organization
NFS based Design

NAS NAS NAS


Web
Server

NFS
6 5

Photo Store Photo Store


1 2 Server Server

7 4

3
Browser CDN

8
what is the metadata
• file size

• device ID (where is hosted the file)

• user ID (owner)

• group ID

• file mode (who and how can access it)

• timestamp for the inode

• timestamp for the file (modification/accessed)

• number of hard-links

• pointers to pieces of disk containing the file

• signature and other flags…


just some problems…
NFS based Design

NAS NAS NAS


Web
classical Server

NFS
6 5

• more than 99% of CDN hits Photo Store Photo Store


1 2 Server Server

7 4

3
Facebook Browser CDN

• less than 80% of CDN hits


just some problems…
NFS based Design
classical made for

• most web-sites out there NAS NAS NAS


Web
Server
• small content NFS
6 5

Photo Store Photo Store


• seldom access old stuff 1 2 Server Server

7 4

3
Browser CDN
Facebook
8

• HUGE working set

• frequent access old stuff (the perks of


a single tag & stalking)
oldie but goldie…
100
1200

80
1000
Cumulative % of accesses

Operations per Minute


60 800

600
40

400

20
200

0 0
0 200 400 600 800 1000 1200 1400 1600 4/19 4/20
Age (in days)

Figure 7: Cumulative distribution function of the num- Figure 8: Volume


ber of photos requested in a day categorized by age (time different write-enab
just some problems
NFS based Design
2…
classical

• CDN caching is enough Web


NAS NAS NAS
Server

NFS
6 5

Photo Store Photo Store


Facebook 1 2 Server Server

7 4

• CDN caching not enough for pics 3


Browser CDN

• CDN caching not enough for 8

metadata

• Facebook doesn’t like much CDN


anyways
metadata, metadata
everywhere…
NFS based Design

NAS NAS NAS


Web
classical Server

NFS
6 5

• images stored as files (1 to 1 Photo Store Photo Store


Server Server
mapping) 1 2

7 4

• correlated metadata explode 3


Browser CDN

• reading an image —> lots of disk 8

reads
the inode: pointers, pointers
everywhere…
a single inode is hundreds of bytes loooong
a single inode is hundreds of bytes loooong

AND

Facebook got billions of them


a single inode is hundreds of bytes loooong

AND

Facebook got billions of them

AND

the number is increasing every day


just some numbers…
2009 1 year after

220 M/week 1 B/week


uploads
25 TB 60 TB

requests 550 K img/sec 1M img/sec

15 B photos 65 B photos
total 60 B images 260 B images
1.5 PB 20 PB
story of a single access: the
metadata NFSand disk reads
based Design

directories with thousands of files


NAS NAS NAS
• large block map Web
Server

NFS
• NAS couldn’t cache it 6 5

Photo Store Photo Store


Server Server
• >10 disk ops/file 1 2

7 4

directories with hundreds of files


3
Browser CDN
• 3 disk ops/file 8

adding file handle to cache

• 2.5 disk ops/file (the perks of tags &


stalking)
long story put
NFS based Design
short
too many files, all of them accessed
frequently: NAS NAS NAS
Web
Server

• NAS caching does not help NFS


6 5

Photo Store Photo Store


• memcache does not help 1 2 Server Server

7 4

• CDN does not help


3
Browser CDN

AND 8

• caching is expensive

• facebook does not like CDNs


new approach

• CDN ok for popular images

• haystack for the rest… which is a lot


new approach
d Design Haystack based Design
Haystack Store
Haystack
Directory

NAS NAS NAS


Web
Server

NFS
6 5

Photo Store Photo Store Web Haystack


Server Server Server Cache
1 2

7 4

3
Browser CDN

8
Browser CDN
new approach

• CDN ok for popular images


• haystack for the rest… which is a lot
BUT…
• what to do with the metadata?!
idea: reduce the metadata per photo
haystack
Haystack based Design
pieces
Haystack Store
Haystack
Directory

Web Haystack
Server Cache

Browser CDN
the directory
stores info in replicated database, PHP interface

Haystack based Design
• maps logical to physical volumes (3 haystacks on 3 Haystack Store
nodes per logical volume) Haystack
Directory

• generates URLs for images


• http://<CDN>/<Cache>/<Node>/<logical vol. ID,
image ID>
Web Haystack
Server Cache
• load balancing (writes on logical volumes & reads from
physical volumes)
• decides caching per request (CDN miss will not be
loaded into cache)
Browser CDN
• determines read-only volumes
• reached storage capacity
• other reasons…
the cache
Haystack based Design
• distributed hash-table that handles http Haystack Store
requests Haystack
Directory

• key = photo ID
• bridge between CDN (browser) & store
• caching policy: Web Haystack
Server Cache
• request coming directly from browser (CDN
miss mostly cache miss too) &
• photo fetched from a write-enabled machine
CDN
• shelter write-enabled stores from Browser

requests for newly uploaded pics


• either read or write is more efficient
the store
Haystack based Design
• more physical volumes per machine
Haystack Store
Haystack
• one volume (very large file) ~ millions Directory

of photos
• locate a photo within store: Web Haystack
Server Cache

/hay/haystack_<logical volume ID> +


photo offset

goal: Browser CDN

• retrieve filename, offset, & size without


needing disk ops
the store 2
Haystack based Design
Haystack Store
Haystack
goal: Directory

• retrieve filename, offset, & size


without needing disk ops
Web Haystack
Server Cache
properties of a store machine:

• keeps open file descriptors for each


physical volume
Browser CDN

• keeps in-memory mapping of photo


IDs to filesystem metadata
(filename, offset, size)
ck Storethe
– Haystack file Layout
store: file layout
• phy volume: large file made of needles
Superblock
• Header
each needle = photo stored Magic Number
in haystack
Cookie
• Important fields within needle:
Needle 1 Key
• cookie: random nr assigned Alternate Key by directory to
and stored
mitigate brutforce attacks Flags

Needle 2 • key: photo ID (64 bit) Size

• Flags: deleted status Data

• size: size of data Footer Magic Number

• data: actual photo data Data Checksum


Needle 3
Padding
• checksum: to check integrity

• padding: needle size aligned to 8 bytes


32 bytes per photo (8 bytes per image vs. ~600 bytes per inode)
~5GB index / 10TB of images
photo server in-mem mapping
64-bit photo key
1st scaled image 32-bit offset / 16-bit size
2nd scaled image 32-bit offset / 16-bit size
3rd scaled image 32-bit offset / 16-bit size
4th scaled image 32-bit offset / 16-bit size

• maps keys (alternate keys) to flag, size, & volume offset

• alternate key = photo type; 4 scaled versions of the same image

• http requests —>haystack operations

• creates & maintains index of all haystack images

• 32 bytes/photo (8 bytes/image vs 600 bytes/inode)

• roughly 5GB index for 10 TB of images


the store: index file layout
Haystack Store – Haystack Index File Layout

Superblock

Key
Needle 1 index record
Alternate Key
Flags
Needle 2 index record
Offset

Needle 3 index record Size


handling operations within
haystack
Haystack based Design – Photo Download
Haystack Store
Haystack
Directory

reads 2 3
7 8

Haystack directory: Web


Server
Haystack
Cache

• http://<CDN>/<Cache>/<machine ID>/ 6 9
1 4
<logical vol. ID, photo ID, cookie>
5
Browser CDN
10
handling operations within
haystack
Haystack based Design – Photo Download
Haystack Store
Haystack
Directory

reads:
2 3
7 8

If not in CDN and Cache Web Haystack


Server Cache

• get offset & size from index


6 9
1 4

• read the file 5


Browser CDN
10
handling operations within
haystack
Haystack based Design - Photo Upload
Haystack Store
Haystack
Directory

2 3

4
writes
Web Haystack
Server Cache

Haystack Directory to Web Server:


1 5

• logical_vol_ID, key, alternate key,


cookie, & data to store servers Browser CDN
handling operations within
haystack
Haystack based Design - Photo Upload
Haystack Store
Haystack
Directory

2 3

4
writes
Web Haystack
Server Cache

Store Servers:
1 5

• append the image to the haystack file


(all physical volumes) Browser CDN
handling operations within
haystack
Haystack based Design - Photo Upload
Haystack Store
Haystack
Directory

writes 2 3

Store Servers: Web Haystack


Server Cache

• append the image to the haystack file


(all physical volumes) 1 5

CDN
• append the index record to the index Browser

file (all physical volumes)


handling operations within
haystack
Haystack based Design - Photo Upload
Haystack Store
Haystack
Directory
writes
2 3

Store Servers: 4

Web Haystack
Server
• append the image to the haystack file Cache

(all physical volumes)


1 5

• append the index record to the index


CDN
file (all physical volumes) Browser

• update the main-memory index


handling operations within
haystack

deletes

• get offset & size from index

• mark image as ‘deleted’ in the needle header

• update the index


handling operations within
haystack

compaction

• infrequent online operations

• create a copy of haystack skipping duplicates and deleted photos


offline
Thank you!

You might also like