You are on page 1of 42

Scaling myYearbook.

com
Lessons Learned From Rapid Growth

Gavin M. Roy
Chief Technology Officer
myYearbook.com

Surge 2010
About myYearbook.com

• Founded in 2005

• 2007 - 100M Dynamic HTTP Requests per Month

• 2010 - 2.5B Dynamic HTTP Requests per Month

• Top 5 Social Network in the United States as measured by Hitwise

• Top 25 trafficked site in the United States as measured by ComScore

• 99% Uptime
myYearbook.com Friend Discovery
Q1 2007

• All Managed Servers

• 1 PostgreSQL server

• 1 Web application server (Apache/PHP)

• Multiple static content servers

• 1 Phone call a night during peak due to outages


Q4 2010

• 2 Data centers

• 16 Message Brokers (ActiveMQ/RabbitMQ)

• 35 PostgreSQL Servers

• 100 memcached Servers w/ 1.2 TB of active cache

• ~100 Other servers (Consumers, R&D, Email, Monitoring, etc)

• ~400 Web Application Servers


Key Architecture Components

• Languages: PHP, Python • Memcached

• Webservers: • Message Brokers:

• Apache HTTPD • ActiveMQ

• Cherokee • RabbitMQ

• Lighttpd • Tornado

• Squid • Isilon NAS

• F5 Networks BigIP • Message Systems Momentum

• PostgreSQL • Subversion, Git


Growth is a Double-Edged Sword
Internet Startup Growth Cycle

1. Prototype

2. Launch

3. Re-Engineer (Fix problems)

4. Add new functionality

5. Repeat Steps 3 and 4


Internet Startup Growth Cycle

• Steps 1 & 2

• Limited Budget

• Limited Time & Resources

• Steps 3 & 4

• Increased Budget

• Limited Time & Resources


“The best laid schemes o’ mice an’ men gang aft agley”

- Robert Burns, To a Mouse


Instrument Early and Deep
Analytics

• Know your systems

• Know your key application metrics

• 238 Categories

• Applications, Services, etc

• 5,558 Items

• 1 to ~25 Datapoints
Gathering Analytics

• Staplr

• Posuta

• Nagios

• Cacti

• 3rd Party External

• Traffic

• Availability and Performance


Scaling Databases
Plan for Growth

• Hardware

• CPU Horsepower based upon need

• Disk based upon need

• RAM based upon budget.

• Get 2
General DB Scaling

• Scale Up

• Bigger, Faster Hardware

• Better, Faster Software

• Scale Out

• Sharding

• Service Specific
PostgreSQL Scaling

• Connection pooling

• pgBouncer, pgPool, language specific pooling

• Horizontal via plProxy

• Read-only nodes

• Londiste, Slony, Bucardo

• PostgreSQL 9.0
Table Partitioning

• Supported in PostgreSQL as of 8.1

• Excellent method for maintaining data

• Allows for removal of aged data without bloat

• Focused SELECTS while allowing ad-hoc SELECT across all partitions


Internals Monitoring

• Sophisticated system catalog

• Beyond configuration data

• Statistics

• Index utilization

• Cache hit/miss data

• Lock data
“Anything that can possibly go wrong, does.”

- Jack Sack
Recovering from Server Failures

• Daily backup is not enough

• Disaster recovery option

• Replicate data for failover and maintenance

• Warm standby: >=PostgreSQL 8.2

• Hot standby: PostgreSQL 9.0


The Importance of Caching
Static Content

• Content Delivery Networks

• HTTP Reverse Proxy

• Web server

• Storage

• Operating System
Data Caching

• Reduces system load

• Databases

• Filesystem IO

• Tiered approach:

• In Application Execution

• In Application Server (APC/SHM)

• Distributed Across Network (memcached)


memcached at Scale

• TCP vs UDP

• Binary vs Ascii

• Many Packets per Second

• Client Implementation

• Inconsistent Hashing
Strategies

• Monitor Utilization
Appliances

• Still young market

• Rack dense cache

• Leveraging lower cost SSDs over RAM

• Kitchen sink in a box

• Replication

• Fancy UIs
Balance in Development Practices
Few web developers start by planning for scale.
Deep thinking?

• First to market

• Drive market share / grow


business

• Engage users

• Meet the spec

• Get it done yesterday


Application Codebase History

• 2005-2007: Monolithic Code Base

• 2008: Expanded to use a Services Oriented Architecture

• Why SOA?

• Applications get own resources

• Loosely-Coupled architecture

• Selective Maintenance

• 2010: Improved process and performance


Avoid Unnecessary Disruptions
Avoid Reengineering

• Reengineering for reengineering’s sake is an unnecessary distraction

• Address engineering faults with a forward facing purpose

• Do not introduce new products or product redesigns at the same time as a


wholesale change in application code

• Hard to find primary reason for shift in traffic patterns

• Problems impact users impression of new products and user facing


changes
Decouple Processes
Why decouple code
and processes?

• Faster page generation

• Distribution of CPU intensive


tasks

• Scale consumer servers, not


application front-end servers

• Throttle activity

• Tap data streams for other


purposes
Message Processing

• Enqueue in to ActiveMQ or RabbitMQ

• Elastic processing via rejected consumer framework

• Targeted workloads

• Image uploads

• Comment and Message processing

• Email spooling
Storage
User Generated Content

• Shared Storage

• Isilion IQ Series

• Scale out NAS

• FreeBSD Based Appliance

• NFS
Database Servers

• Direct Attached Storage

• Fastest single node disk


implementation

• Cost Effective

• SAN and NAS

• Different performance focus

• More management required

• Expensive
Managing Vendors
Not All Vendors Are Equal

• Migrating CDN vendors resulted in a notable increase in page views and


decrease in page load latency

• Hardware support by vendor differs greatly

• Communication is key

• Foster good relationships

• Extra effort on your part should yield extra effort on their part
Focus on Team Not Just Tech
Questions?
Follow me on twitter: @Crad
Blog: http://gavinroy.com

You might also like