Scaling Linkedin: A Brief History

SCALING LINKEDIN
A BRIEF HISTORY
Josh Clemm
www.linkedin.com/in/joshclemm
“
Scaling = replacing all the components
of a car while driving it at 100mph
Via Mike Krieger, “Scaling Instagram”

LinkedIn started back in 2003 to
“connect to your network for better job
opportunities.”
It had 2700 members in first week.

First week growth guesses from founding team
400M
400M
350M
300M
250M
200M Fast forward to today...

150M
100M
50M
32M
0M
2003 2004 2005 2006 2007 2008 2009 2010 2011 2012 2013 2014 2015
5
LINKEDIN SCALE TODAY
LinkedIn is a global site with over 400 million

members
Web pages and mobile traffic are served at

tens of thousands of queries per second
Backend systems serve millions of queries per

second
How did we get there?
7
Let’s start from
the beginning
LEO
DB
LINKEDIN’S ORIGINAL ARCHITECTURE
● Huge monolithic app

LEO
LEO called Leo
● Java, JSP, Servlets,

JDBC
DB
● Served every page,

same SQL database
Circa 2003
So far so good, but two areas to improve:
1. The growing member to member

connection graph
2. The ability to search those members

MEMBER CONNECTION GRAPH
● Needed to live in-memory for top

performance
● Used graph traversal queries not suitable for

the shared SQL database.
● Different usage profile than other parts of site

MEMBER CONNECTION GRAPH
● Needed to live in-memory for top

performance
● Used graph traversal queries not suitable for

the shared SQL database.
● Different usage profile than other parts of site
So, a dedicated service was created.

LinkedIn’s first service.
MEMBER SEARCH
● Social networks need powerful search
● Lucene was used on top of our member graph

MEMBER SEARCH
● Social networks need powerful search
● Lucene was used on top of our member graph
LinkedIn’s second service.

LINKEDIN WITH CONNECTION GRAPH
AND SEARCH
RPC Member
LEO Graph
Lucene
Connection / Profile Updates

DB
Circa 2004
Getting better, but the single database was
under heavy load.
Vertically scaling helped, but we needed to

offload the read traffic...
REPLICA DBs
● Master/slave concept
● Read-only traffic from replica
● Writes go to main DB
● Early version of Databus kept DBs in sync
Databus Replica
Replica
Main DB relay Replica DB
REPLICA DBs TAKEAWAYS
● Good medium term solution
● We could vertically scale servers for a while
● Master DBs have finite scaling limits
● These days, LinkedIn DBs use partitioning
Databus Replica
Replica
Main DB relay Replica DB
LINKEDIN WITH REPLICA DBs
RPC
Member
LEO Graph Search
Profile
R/O R/W Connection Updates
Updates
Databus relay Replica

Replica
Main DB Replica DB
Circa 2006
As LinkedIn continued to grow, the
monolithic application Leo was becoming
problematic.
Leo was difficult to release, debug, and the

site kept going down...
IT WAS TIME TO... Kill LEO
SERVICE ORIENTED ARCHITECTURE
Extracting services (Java Spring MVC) from

legacy Leo monolithic application
Recruiter Web
App
Public Profile
Web App
LEO
Profile Service
Yet another
Service
Circa 2008 on
SERVICE ORIENTED ARCHITECTURE
● Goal - create vertical stack of

Profile Web stateless services
App
● Frontend servers fetch data

from many domains, build
Profile
Service
HTML or JSON response
● Mid-tier services host APIs,

business logic
Profile DB
● Data-tier or back-tier services

encapsulate data domains
EXAMPLE MULTI-TIER ARCHITECTURE AT LINKEDIN
Browser / App
Frontend
Web App
Profile Connections Groups

Mid-tier
Content Mid-tier
Content Mid-tier
Content
Service
Service Service
Service Service
Service
Edu Data
Data Kafka
Service
Service
Hadoop
DB Voldemort
SERVICE ORIENTED ARCHITECTURE COMPARISON
PROS CONS
● Stateless services ● Ops overhead
easily scale
● Introduces backwards
● Decoupled domains compatibility issues
● Build and deploy ● Leads to complex call

independently graphs and fanout
SERVICES AT LINKEDIN
● In 2003, LinkedIn had one service (Leo)
● By 2010, LinkedIn had over 150 services
● Today in 2015, LinkedIn has over 750 services
bash$ eh -e %%prod | awk -F. '{ print $2 }' | sort | uniq | wc -l
756
Getting better, but LinkedIn was
experiencing hypergrowth...
CACHING
Frontend ● Simple way to reduce load on

Web App
servers and speed up responses
Mid-tier ● Mid-tier caches store derived

Service
objects from different domains,
Cache
reduce fanout
Cache ● Caches in the data layer
DB
● We use memcache, couchbase,
even Voldemort
“
There are only two hard problems in
Computer Science:
Cache invalidation, naming things, and
off-by-one errors.
Via Twitter by Kellan Elliott-McCrea

and later Jonathan Feinberg
CACHING TAKEAWAYS
● Caches are easy to add in the beginning, but

complexity adds up over time.
● Over time LinkedIn removed many mid-tier

caches because of the complexity around
invalidation
● We kept caches closer to data layer

CACHING TAKEAWAYS (cont.)
● Services must handle full load - caches

improve speed, not permanent load bearing
solutions
● We’ll use a low latency solution like

Voldemort when appropriate and precompute
results
LinkedIn’s hypergrowth was extending to
the vast amounts of data it collected.
Individual pipelines to route that data

weren’t scaling. A better solution was
needed...
KAFKA MOTIVATIONS
● LinkedIn generates a ton of data

○ Pageviews
○ Edits on profile, companies, schools
○ Logging, timing
○ Invites, messaging
○ Tracking
● Billions of events everyday
● Separate and independently created pipelines

routed this data
A WHOLE LOT OF CUSTOM PIPELINES...
A WHOLE LOT OF CUSTOM PIPELINES...
As LinkedIn needed to scale, each pipeline

needed to scale.
KAFKA
Distributed pub-sub messaging platform as LinkedIn’s
universal data pipeline
Frontend Frontend Backend
service service Service
Kafka
DWH Oracle Monitoring Analytics Hadoop

KAFKA AT LINKEDIN
BENEFITS
● Enabled near realtime access to any data source
● Empowered Hadoop jobs
● Allowed LinkedIn to build realtime analytics
● Vastly improved site monitoring capability
● Enabled devs to visualize and track call graphs
● Over 1 trillion messages published per day, 10 million

messages per second
ION P
P U
UBBL
L IS
IS H
H E
E D
D D
D A
A ILY
IL Y
OV
O ER
VE ILLLION
RIL
R 11 TTR
Let’s end with
the modern years
REST.LI
● Services extracted from Leo or created new

were inconsistent and often tightly coupled
● Rest.li was our move to a data model centric

architecture
● It ensured a consistent stateless Restful API

model across the company.
REST.LI (cont.)
● By using JSON over HTTP, our new APIs

supported non-Java-based clients.
● By using Dynamic Discovery (D2), we got

load balancing, discovery, and scalability of
each service API.
● Today, LinkedIn has 1130+ Rest.li resources

and over 100 billion Rest.li calls per day
REST.LI (cont.)
Rest.li Automatic API-documentation

REST.LI (cont.)
Rest.li R2/D2 tech stack

LinkedIn’s success with Data infrastructure
like Kafka and Databus led to the
development of more and more scalable
Data infrastructure solutions...
DATA INFRASTRUCTURE
● It was clear LinkedIn could build data

infrastructure that enables long term growth
● LinkedIn doubled down on infra solutions like:

○ Storage solutions
■ Espresso, Voldemort, Ambry (media)
○ Analytics solutions like Pinot
○ Streaming solutions
■ Kafka, Databus, and Samza
○ Cloud solutions like Helix and Nuage
DATABUS
LinkedIn is a global company and was
continuing to see large growth. How else
to scale?
MULTIPLE DATA CENTERS
● Natural progression of horizontally scaling
● Replicate data across many data centers using

storage technology like Espresso
● Pin users to geographically close data center
● Difficult but necessary

● Multiple data centers are imperative to

maintain high availability.
● You need to avoid any single point of failure

not just for each service, but the entire site.
● LinkedIn runs out of three main data centers,

additional PoPs around the globe, and more
coming online every day...
LinkedIn's operational setup as of 2015

(circles represent data centers, diamonds represent PoPs)
Of course LinkedIn’s scaling story is never
this simple, so what else have we done?
WHAT ELSE HAVE WE DONE?
● Each of LinkedIn’s critical systems have

undergone their own rich history of scale
(graph, search, analytics, profile backend,
comms, feed)
● LinkedIn uses Hadoop / Voldemort for insights

like People You May Know, Similar profiles,
Notable Alumni, and profile browse maps.
WHAT ELSE HAVE WE DONE? (cont.)
● Re-architected frontend approach using

○ Client templates
○ BigPipe
○ Play Framework
● LinkedIn added multiple tiers of proxies using

Apache Traffic Server and HAProxy
● We improved the performance of servers with

new hardware, advanced system tuning, and
newer Java runtimes.
Scaling sounds easy and quick to do, right?
“
Hofstadter's Law: It always takes longer
than you expect, even when you take
into account Hofstadter's Law.
Via Douglas Hofstadter,

Gödel, Escher, Bach: An Eternal Golden Braid
THANKS!
Josh Clemm
www.linkedin.com/in/joshclemm
LEARN MORE
● Blog version of this slide deck
https://engineering.linkedin.com/architecture/brief-history-scaling-linkedin
● Visual story of LinkedIn’s history

https://ourstory.linkedin.com/
● LinkedIn Engineering blog

https://engineering.linkedin.com
● LinkedIn Open-Source
https://engineering.linkedin.com/open-source
● LinkedIn’s communication system slides which

include earliest LinkedIn architecture http://www.slideshare.
net/linkedin/linkedins-communication-architecture
● Slides which include earliest LinkedIn data infra work

http://www.slideshare.net/r39132/linkedin-data-infrastructure-qcon-london-2012
LEARN MORE (cont.)
● Project Inversion - internal project to enable developer
productivity (trunk based model), faster deploys, unified
services
http://www.bloomberg.com/bw/articles/2013-04-10/inside-operation-inversion-the-code-
freeze-that-saved-linkedin
● LinkedIn’s use of Apache Traffic server

http://www.slideshare.net/thenickberry/reflecting-a-year-after-migrating-to-apache-traffic-
server
● Multi Data Center - testing fail overs

https://www.linkedin.com/pulse/armen-hamstra-how-he-broke-linkedin-got-promoted-
angel-au-yeung
LEARN MORE - KAFKA
● History and motivation around Kafka
http://www.confluent.io/blog/stream-data-platform-1/
● Thinking about streaming solutions as a commit log

https://engineering.linkedin.com/distributed-systems/log-what-every-software-engineer-
should-know-about-real-time-datas-unifying
● Kafka enabling monitoring and alerting

http://engineering.linkedin.com/52/autometrics-self-service-metrics-collection
● Kafka enabling real-time analytics (Pinot)

http://engineering.linkedin.com/analytics/real-time-analytics-massive-scale-pinot
● Kafka’s current use and future at LinkedIn

http://engineering.linkedin.com/kafka/kafka-linkedin-current-and-future
● Kafka processing 1 trillion events per day

https://engineering.linkedin.com/apache-kafka/how-we_re-improving-and-advancing-
kafka-linkedin
LEARN MORE - DATA INFRASTRUCTURE
● Open sourcing Databus
https://engineering.linkedin.com/data-replication/open-sourcing-databus-linkedins-low-
latency-change-data-capture-system
● Samza streams to help LinkedIn view call graphs

https://engineering.linkedin.com/samza/real-time-insights-linkedins-performance-using-
apache-samza
● Real-time analytics (Pinot)

http://engineering.linkedin.com/analytics/real-time-analytics-massive-scale-pinot
● Introducing Espresso data store

http://engineering.linkedin.com/espresso/introducing-espresso-linkedins-hot-new-
distributed-document-store
LEARN MORE - FRONTEND TECH
● LinkedIn’s use of client templates
○ Dust.js
http://www.slideshare.net/brikis98/dustjs
○ Profile
http://engineering.linkedin.com/profile/engineering-new-linkedin-profile
● Big Pipe on LinkedIn’s homepage

http://engineering.linkedin.com/frontend/new-technologies-new-linkedin-home-page
● Play Framework
○ Introduction at LinkedIn https://engineering.linkedin.
com/play/composable-and-streamable-play-apps
○ Switching to non-block asynchronous model

https://engineering.linkedin.com/play/play-framework-async-io-without-thread-pool-
and-callback-hell
LEARN MORE - REST.LI
● Introduction to Rest.li and how it helps LinkedIn scale
http://engineering.linkedin.com/architecture/restli-restful-service-architecture-scale
● How Rest.li expanded across the company

http://engineering.linkedin.com/restli/linkedins-restli-moment
LEARN MORE - SYSTEM TUNING
● JVM memory tuning
http://engineering.linkedin.com/garbage-collection/garbage-collection-optimization-high-
throughput-and-low-latency-java-applications
● System tuning
http://engineering.linkedin.com/performance/optimizing-linux-memory-management-
low-latency-high-throughput-databases
● Optimizing JVM tuning automatically

https://engineering.linkedin.com/java/optimizing-java-cms-garbage-collections-its-
difficulties-and-using-jtune-solution
WE’RE HIRING
LinkedIn continues to grow quickly and there’s

still a ton of work we can do to improve.
We’re working on problems that very few ever

get to solve - come join us!

Scaling Linkedin: A Brief History

Uploaded by

Document Information

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Scaling Linkedin: A Brief History

Uploaded by

Copyright:

Available Formats

SCALING LINKEDIN

Via Mike Krieger, “Scaling Instagram”

It had 2700 members in first week.

200M Fast forward to today...

LinkedIn is a global site with over 400 million

Web pages and mobile traffic are served at

Backend systems serve millions of queries per

● Huge monolithic app

● Java, JSP, Servlets,

● Served every page,

1. The growing member to member

2. The ability to search those members

● Needed to live in-memory for top

● Used graph traversal queries not suitable for

● Different usage profile than other parts of site

● Needed to live in-memory for top

● Used graph traversal queries not suitable for

● Different usage profile than other parts of site

So, a dedicated service was created.

● Social networks need powerful search

● Lucene was used on top of our member graph

● Social networks need powerful search

● Lucene was used on top of our member graph

LinkedIn’s second service.

Connection / Profile Updates

Vertically scaling helped, but we needed to

● Read-only traffic from replica

● Early version of Databus kept DBs in sync

● Good medium term solution

● We could vertically scale servers for a while

● Master DBs have finite scaling limits

● These days, LinkedIn DBs use partitioning

Databus relay Replica

Leo was difficult to release, debug, and the

Extracting services (Java Spring MVC) from

● Goal - create vertical stack of

● Frontend servers fetch data

● Mid-tier services host APIs,

● Data-tier or back-tier services

Profile Connections Groups

● Build and deploy ● Leads to complex call

● In 2003, LinkedIn had one service (Leo)

● By 2010, LinkedIn had over 150 services

● Today in 2015, LinkedIn has over 750 services

bash$ eh -e %%prod | awk -F. '{ print $2 }' | sort | uniq | wc -l

Frontend ● Simple way to reduce load on

Mid-tier ● Mid-tier caches store derived

Cache ● Caches in the data layer

Via Twitter by Kellan Elliott-McCrea

● Caches are easy to add in the beginning, but

● Over time LinkedIn removed many mid-tier

● We kept caches closer to data layer

● Services must handle full load - caches

● We’ll use a low latency solution like

Individual pipelines to route that data

● LinkedIn generates a ton of data

● Billions of events everyday

● Separate and independently created pipelines

As LinkedIn needed to scale, each pipeline

DWH Oracle Monitoring Analytics Hadoop

● Empowered Hadoop jobs

● Allowed LinkedIn to build realtime analytics

● Vastly improved site monitoring capability