You are on page 1of 13

Case Study

On
Facebook.
P R E S E N TE D B Y:
G AYAT R I N A R AYA N D A S T E L A &
R I T U K ATA R IA .
Contents:
Introduction
3 V’s and Facebook
Problem Statement
What Facebook did?
Big Data at Facebook
Use cases of the data
Facebook Architecture
Big Data Terms and Facebook
Introduction
 Facebook is a website which allows users, who sign-up for free profiles, to connect with friends,
work colleagues or people they don’t know, online. It allows users to share pictures, music, videos,
and articles, as well as their own thoughts and opinions with however many people they like.
 Established in 2004, from the college dorm room of Mark Zuckerberg, a Harvard student, the
website is now worth billions of dollars and is one of the world’s most recognizable brands.
 Users send “friend requests” to people who they may – or may not – know.

 Once accepted, the two profiles are connected with both users able to see whatever the other person
posts. “Facebookers” can post almost anything to their “timeline”, a snapshot of what is happening in
their social circle at any given time, and can also enter private chat with other friends who are online.
3 V’s and Facebook
 Volume: The world’s most popular social media network with more than two billion monthly
active users worldwide, Facebook stores enormous amounts of user data, making it a massive data
wonderland. Facebook generates 4 petabytes of data per day — that's a million gigabytes. All that
data is stored in what is known as the Hive, which contains about 300 petabytes of data.
 Velocity: The daily Facebook statistics. Based on Social Skinny’s insight, 293,000 statuses are
updated, 136,000 photos uploaded, and 500,000 comments posted on Facebook every minute.
 Variety: Facebook stores data about us in numerous forms. Facebook knows who our friends are,
what we look like, where we are, what we are doing, our likes, our dislikes, and so much more.
Apart from analyzing user data, Facebook has other ways of determining user behavior like tracking
cookies, tag suggestions, analyzing the likes, type of ads we would prefer to watch and much more.
Problem statement
 As Facebook started gaining immense popularity the challenges in storing data and scalability
also increased
 The amount of log and contextual data in Facebook that needs to be processed and stored has
exploded.
 A key requirement for any data processing platform for this environment is the ability to be
able to scale rapidly.
 Further engineering resources being limited, the system should be very reliable and easy to use
and maintain.
What Facebook did?
 Initially data warehousing in Facebook was performed entirely on an oracle instance.

 But after Facebook started hitting scalability and performance problems, they investigated
whether there were any open source technologies that could be used.
 As part of this investigation, they deployed a relatively small Hadoop instance and started
publishing some of their core datasets into this instance.
Big data at Facebook
 The initial prototype was very successful, Facebook got the ability to process massive amount
of data in reasonable timeframes, an ability that they did not have earlier.
 And programmers liked the fact that they could use familiar programming language for
processing.
 With this success, Facebook also started developing Hive, a Data warehousing platform.

 This made it even easier for users to process data in Hadoop cluster by being able to express
common computations in the form of SQL, a language with which most of the engineers and
analysts are familiar with.
Cont…
 The cluster size and usage grew leaps and bounds , and for last published facts Facebook was running
the second largest Hadoop cluster in the world.
 They held more than 2 PB of data in Hadoop and loaded more than 10 TB of data every day.

 Hadoop instance had 2,400 crores and about 9 TB of memory and ran at 100% utilization at many
points during the day.
 Facebook was able to scale out this cluster rapidly in response to the growth, and they have been able
to take the advantage of open source by modifying Hadoop where required to suit their needs.
 Facebook has open sourced the Hive project and it is now available under Hadoop Apache project.
Use cases of this data
Facebook is producing daily and hourly summaries over large amounts of data. These summaries are used for a
number of different purpose within the company:
 Reports based on these summaries are used by engineering and non-engineering functional teams to drive
product decisions, these summaries include reports on growth of the users, page views, and average time spent
on the site by users.
Providing performance numbers about advertisement campaigns that run on Facebook.

Backend processing for the site features such as people you may like and applications you may like.

Running ad hoc jobs over historical data. These analysis help answer questions from Facebook product groups
and executive team.
As a de facto long term archival store for log datasets.
Big Data Terms and Facebook
Scribe
 Facebook uses Scribe as its core log aggregation service. “Scribe is a server for aggregating log
data streamed in real time from a large number of servers.” A network of Scribe servers forms a
directed graph. Each server is a node and directed edges represent lines of communication.
Usually, Scribe is installed on every node, and logs are collected to one giant “aggregator” node.

HDFS
 The collected logs are written into HDFS (Hadoop Distributed File System) and later analyzed
by Hadoop MapReduce or Hive.
Hive/ Hadoop
 Facebook uses Hive , to build a data warehouse over all the data collected in HDFS, Files in
HDFS, including log data from scribe and dimension data from the MYSQL tier, are made
available as tables with logical partitions. A SQL-like query language provided by Hive is used
in conjunction with MapReduce to create/publish a variety of summaries and reports as well as
to perform historical analysis over these tables.

Tools
 Browser-based interfaces built on top of Hive allows user to compose and launch Hive queries
(which in turn launch MapReduce jobs) using just a few mouse clicks.
Thank You!

You might also like