Facebook's Haystack Organization

facebook’s haystack
organization
classical organization
NFS based Design
NAS NAS NAS

Web
Server
NFS
6 5
Photo Store Photo Store

1 2 Server Server
7 4
3
Browser CDN
8
what is the metadata
• file size
• device ID (where is hosted the file)
• user ID (owner)
• group ID
• file mode (who and how can access it)
• timestamp for the inode
• timestamp for the file (modification/accessed)
• number of hard-links
• pointers to pieces of disk containing the file
• signature and other flags…

just some problems…
NFS based Design
NAS NAS NAS

Web
classical Server
NFS
6 5
• more than 99% of CDN hits Photo Store Photo Store

1 2 Server Server
7 4
3
Facebook Browser CDN
• less than 80% of CDN hits

just some problems…
NFS based Design
classical made for
• most web-sites out there NAS NAS NAS

Web
Server
• small content NFS
6 5

• seldom access old stuff 1 2 Server Server
7 4
3
Browser CDN
Facebook
8
• HUGE working set
• frequent access old stuff (the perks of

a single tag & stalking)
oldie but goldie…
100
1200
80
1000
Cumulative % of accesses
Operations per Minute

60 800
600
40
400
20
200
0 0
0 200 400 600 800 1000 1200 1400 1600 4/19 4/20
Age (in days)
Figure 7: Cumulative distribution function of the num- Figure 8: Volume

ber of photos requested in a day categorized by age (time different write-enab
just some problems
NFS based Design
2…
classical
• CDN caching is enough Web

NAS NAS NAS
Server
NFS
6 5

Facebook 1 2 Server Server
7 4
• CDN caching not enough for pics 3

Browser CDN
• CDN caching not enough for 8
metadata
• Facebook doesn’t like much CDN

anyways
metadata, metadata
everywhere…
NFS based Design
NAS NAS NAS

Web
classical Server
NFS
6 5
• images stored as files (1 to 1 Photo Store Photo Store

Server Server
mapping) 1 2
7 4
• correlated metadata explode 3

Browser CDN
• reading an image —> lots of disk 8
reads
the inode: pointers, pointers
everywhere…
a single inode is hundreds of bytes loooong
AND
Facebook got billions of them

AND
Facebook got billions of them
AND
the number is increasing every day

just some numbers…
2009 1 year after
220 M/week 1 B/week

uploads
25 TB 60 TB
requests 550 K img/sec 1M img/sec
15 B photos 65 B photos
total 60 B images 260 B images
1.5 PB 20 PB
story of a single access: the
metadata NFSand disk reads
based Design
directories with thousands of files

NAS NAS NAS
• large block map Web
Server
NFS
• NAS couldn’t cache it 6 5

Server Server
• >10 disk ops/file 1 2
7 4
directories with hundreds of files

3
Browser CDN
• 3 disk ops/file 8
adding file handle to cache
• 2.5 disk ops/file (the perks of tags &

stalking)
long story put
NFS based Design
short
too many files, all of them accessed
frequently: NAS NAS NAS
Web
Server
• NAS caching does not help NFS

6 5

• memcache does not help 1 2 Server Server
7 4
• CDN does not help

3
Browser CDN
AND 8
• caching is expensive
• facebook does not like CDNs

new approach
• CDN ok for popular images
• haystack for the rest… which is a lot

new approach
d Design Haystack based Design
Haystack Store
Haystack
Directory
NAS NAS NAS

Web
Server
NFS
6 5
Photo Store Photo Store Web Haystack

Server Server Server Cache
1 2
7 4
3
Browser CDN
8
Browser CDN
new approach
• CDN ok for popular images

• haystack for the rest… which is a lot
BUT…
• what to do with the metadata?!
idea: reduce the metadata per photo
haystack
Haystack based Design
pieces
Haystack Store
Haystack
Directory
Web Haystack
Server Cache
Browser CDN
the directory
stores info in replicated database, PHP interface
•
• maps logical to physical volumes (3 haystacks on 3 Haystack Store
nodes per logical volume) Haystack
Directory
• generates URLs for images

• http://<CDN>/<Cache>/<Node>/<logical vol. ID,
image ID>
Web Haystack
Server Cache
• load balancing (writes on logical volumes & reads from
physical volumes)
• decides caching per request (CDN miss will not be
loaded into cache)
Browser CDN
• determines read-only volumes
• reached storage capacity
• other reasons…
the cache
• distributed hash-table that handles http Haystack Store
requests Haystack
Directory
• key = photo ID
• bridge between CDN (browser) & store
• caching policy: Web Haystack
Server Cache
• request coming directly from browser (CDN
miss mostly cache miss too) &
• photo fetched from a write-enabled machine
CDN
• shelter write-enabled stores from Browser
requests for newly uploaded pics

• either read or write is more efficient
the store
• more physical volumes per machine
Haystack Store
Haystack
• one volume (very large file) ~ millions Directory
of photos
• locate a photo within store: Web Haystack
Server Cache
/hay/haystack_<logical volume ID> +

photo offset
goal: Browser CDN
• retrieve filename, offset, & size without

needing disk ops
the store 2
Haystack Store
Haystack
goal: Directory
• retrieve filename, offset, & size

without needing disk ops
Web Haystack
Server Cache
properties of a store machine:
• keeps open file descriptors for each

physical volume
Browser CDN
• keeps in-memory mapping of photo

IDs to filesystem metadata
(filename, offset, size)
ck Storethe
– Haystack file Layout
store: file layout
• phy volume: large file made of needles
Superblock
• Header
each needle = photo stored Magic Number
in haystack
Cookie
• Important fields within needle:
Needle 1 Key
• cookie: random nr assigned Alternate Key by directory to
and stored
mitigate brutforce attacks Flags
Needle 2 • key: photo ID (64 bit) Size
• Flags: deleted status Data
• size: size of data Footer Magic Number
• data: actual photo data Data Checksum

Needle 3
Padding
• checksum: to check integrity
• padding: needle size aligned to 8 bytes

32 bytes per photo (8 bytes per image vs. ~600 bytes per inode)
~5GB index / 10TB of images
photo server in-mem mapping
64-bit photo key
1st scaled image 32-bit offset / 16-bit size
2nd scaled image 32-bit offset / 16-bit size
3rd scaled image 32-bit offset / 16-bit size
4th scaled image 32-bit offset / 16-bit size
• maps keys (alternate keys) to flag, size, & volume offset
• alternate key = photo type; 4 scaled versions of the same image
• http requests —>haystack operations
• creates & maintains index of all haystack images
• 32 bytes/photo (8 bytes/image vs 600 bytes/inode)
• roughly 5GB index for 10 TB of images

the store: index file layout
Haystack Store – Haystack Index File Layout
Superblock
Key
Needle 1 index record
Alternate Key
Flags
Needle 2 index record
Offset
Needle 3 index record Size

handling operations within
haystack
Haystack based Design – Photo Download
Haystack Store
Haystack
Directory
reads 2 3
7 8
Haystack directory: Web

Server
Haystack
Cache
• http://<CDN>/<Cache>/<machine ID>/ 6 9
1 4
<logical vol. ID, photo ID, cookie>
5
Browser CDN
10
haystack
Haystack based Design – Photo Download
Haystack Store
Haystack
Directory
reads:
2 3
7 8
If not in CDN and Cache Web Haystack

Server Cache
• get offset & size from index

6 9
1 4
• read the file 5

Browser CDN
10
haystack
Haystack based Design - Photo Upload
Haystack Store
Haystack
Directory
2 3
4
writes
Web Haystack
Server Cache
Haystack Directory to Web Server:

1 5
• logical_vol_ID, key, alternate key,

cookie, & data to store servers Browser CDN
haystack
Haystack Store
Haystack
Directory
2 3
4
writes
Web Haystack
Server Cache
Store Servers:
1 5
• append the image to the haystack file

(all physical volumes) Browser CDN
haystack
Haystack Store
Haystack
Directory
writes 2 3
Store Servers: Web Haystack

Server Cache
• append the image to the haystack file

(all physical volumes) 1 5
CDN
• append the index record to the index Browser
file (all physical volumes)

haystack
Haystack Store
Haystack
Directory
writes
2 3
Store Servers: 4
Web Haystack
Server
• append the image to the haystack file Cache
(all physical volumes)

1 5
• append the index record to the index

CDN
file (all physical volumes) Browser
• update the main-memory index

haystack
deletes
• get offset & size from index
• mark image as ‘deleted’ in the needle header
• update the index

haystack
compaction
• infrequent online operations
• create a copy of haystack skipping duplicates and deleted photos

offline
Thank you!

Facebook&#39;s Haystack Organization

Uploaded by

Document Information

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Facebook&#39;s Haystack Organization

Uploaded by

Copyright:

Available Formats

facebook’s haystack

NAS NAS NAS

Photo Store Photo Store

• device ID (where is hosted the file)

• file mode (who and how can access it)

• timestamp for the inode

• timestamp for the file (modification/accessed)

• pointers to pieces of disk containing the file

• signature and other flags…

NAS NAS NAS

• more than 99% of CDN hits Photo Store Photo Store

• less than 80% of CDN hits

• most web-sites out there NAS NAS NAS

Photo Store Photo Store

• HUGE working set

• frequent access old stuff (the perks of

Operations per Minute

Figure 7: Cumulative distribution function of the num- Figure 8: Volume

• CDN caching is enough Web

Photo Store Photo Store

• CDN caching not enough for pics 3

• CDN caching not enough for 8

• Facebook doesn’t like much CDN

NAS NAS NAS

• images stored as files (1 to 1 Photo Store Photo Store

• correlated metadata explode 3

• reading an image —> lots of disk 8

Facebook got billions of them

Facebook got billions of them

the number is increasing every day

220 M/week 1 B/week

requests 550 K img/sec 1M img/sec

directories with thousands of files

Photo Store Photo Store

directories with hundreds of files

adding file handle to cache

• 2.5 disk ops/file (the perks of tags &

• NAS caching does not help NFS

Photo Store Photo Store

• CDN does not help

• facebook does not like CDNs

• CDN ok for popular images

• haystack for the rest… which is a lot

NAS NAS NAS

Photo Store Photo Store Web Haystack

• CDN ok for popular images

• generates URLs for images

requests for newly uploaded pics

/hay/haystack_<logical volume ID> +

goal: Browser CDN

• retrieve filename, offset, & size without

• retrieve filename, offset, & size

• keeps open file descriptors for each

• keeps in-memory mapping of photo

Needle 2 • key: photo ID (64 bit) Size

• Flags: deleted status Data

• size: size of data Footer Magic Number

• data: actual photo data Data Checksum

• padding: needle size aligned to 8 bytes

• maps keys (alternate keys) to flag, size, & volume offset

• alternate key = photo type; 4 scaled versions of the same image

• http requests —>haystack operations

• creates & maintains index of all haystack images

• 32 bytes/photo (8 bytes/image vs 600 bytes/inode)

Facebook's Haystack Organization

Facebook's Haystack Organization