Surge 2010 - Scaling at

Scaling myYearbook.
com
Lessons Learned From Rapid Growth
Gavin M. Roy
Chief Technology Officer
myYearbook.com
Surge 2010
About myYearbook.com
• Founded in 2005
• 2007 - 100M Dynamic HTTP Requests per Month
• 2010 - 2.5B Dynamic HTTP Requests per Month
• Top 5 Social Network in the United States as measured by Hitwise
• Top 25 trafficked site in the United States as measured by ComScore
• 99% Uptime
myYearbook.com Friend Discovery
Q1 2007
• All Managed Servers
• 1 PostgreSQL server
• 1 Web application server (Apache/PHP)
• Multiple static content servers
• 1 Phone call a night during peak due to outages

Q4 2010
• 2 Data centers
• 16 Message Brokers (ActiveMQ/RabbitMQ)
• 35 PostgreSQL Servers
• 100 memcached Servers w/ 1.2 TB of active cache
• ~100 Other servers (Consumers, R&D, Email, Monitoring, etc)
• ~400 Web Application Servers

Key Architecture Components
• Languages: PHP, Python • Memcached
• Webservers: • Message Brokers:
• Apache HTTPD • ActiveMQ
• Cherokee • RabbitMQ
• Lighttpd • Tornado
• Squid • Isilon NAS
• F5 Networks BigIP • Message Systems Momentum
• PostgreSQL • Subversion, Git

Growth is a Double-Edged Sword
Internet Startup Growth Cycle
1. Prototype
2. Launch
3. Re-Engineer (Fix problems)
4. Add new functionality
5. Repeat Steps 3 and 4

Internet Startup Growth Cycle
• Steps 1 & 2
• Limited Budget
• Limited Time & Resources
• Steps 3 & 4
• Increased Budget
• Limited Time & Resources

“The best laid schemes o’ mice an’ men gang aft agley”
- Robert Burns, To a Mouse

Instrument Early and Deep
Analytics
• Know your systems
• Know your key application metrics
• 238 Categories
• Applications, Services, etc
• 5,558 Items
• 1 to ~25 Datapoints
Gathering Analytics
• Staplr
• Posuta
• Nagios
• Cacti
• 3rd Party External
• Traffic
• Availability and Performance

Scaling Databases
Plan for Growth
• Hardware
• CPU Horsepower based upon need
• Disk based upon need
• RAM based upon budget.
• Get 2
General DB Scaling
• Scale Up
• Bigger, Faster Hardware
• Better, Faster Software
• Scale Out
• Sharding
• Service Specific
PostgreSQL Scaling
• Connection pooling
• pgBouncer, pgPool, language specific pooling
• Horizontal via plProxy
• Read-only nodes
• Londiste, Slony, Bucardo
• PostgreSQL 9.0
Table Partitioning
• Supported in PostgreSQL as of 8.1
• Excellent method for maintaining data
• Allows for removal of aged data without bloat
• Focused SELECTS while allowing ad-hoc SELECT across all partitions

Internals Monitoring
• Sophisticated system catalog
• Beyond configuration data
• Statistics
• Index utilization
• Cache hit/miss data
• Lock data
“Anything that can possibly go wrong, does.”
- Jack Sack
Recovering from Server Failures
• Daily backup is not enough
• Disaster recovery option
• Replicate data for failover and maintenance
• Warm standby: >=PostgreSQL 8.2
• Hot standby: PostgreSQL 9.0

The Importance of Caching
Static Content
• Content Delivery Networks
• HTTP Reverse Proxy
• Web server
• Storage
• Operating System
Data Caching
• Reduces system load
• Databases
• Filesystem IO
• Tiered approach:
• In Application Execution
• In Application Server (APC/SHM)
• Distributed Across Network (memcached)

memcached at Scale
• TCP vs UDP
• Binary vs Ascii
• Many Packets per Second
• Client Implementation
• Inconsistent Hashing
Strategies
• Monitor Utilization
Appliances
• Still young market
• Rack dense cache
• Leveraging lower cost SSDs over RAM
• Kitchen sink in a box
• Replication
• Fancy UIs
Balance in Development Practices
Few web developers start by planning for scale.
Deep thinking?
• First to market
• Drive market share / grow

business
• Engage users
• Meet the spec
• Get it done yesterday

Application Codebase History
• 2005-2007: Monolithic Code Base
• 2008: Expanded to use a Services Oriented Architecture
• Why SOA?
• Applications get own resources
• Loosely-Coupled architecture
• Selective Maintenance
• 2010: Improved process and performance

Avoid Unnecessary Disruptions
Avoid Reengineering
• Reengineering for reengineering’s sake is an unnecessary distraction
• Address engineering faults with a forward facing purpose
• Do not introduce new products or product redesigns at the same time as a

wholesale change in application code
• Hard to find primary reason for shift in traffic patterns
• Problems impact users impression of new products and user facing

changes
Decouple Processes
Why decouple code
and processes?
• Faster page generation
• Distribution of CPU intensive

tasks
• Scale consumer servers, not

application front-end servers
• Throttle activity
• Tap data streams for other

purposes
Message Processing
• Enqueue in to ActiveMQ or RabbitMQ
• Elastic processing via rejected consumer framework
• Targeted workloads
• Image uploads
• Comment and Message processing
• Email spooling
Storage
User Generated Content
• Shared Storage
• Isilion IQ Series
• Scale out NAS
• FreeBSD Based Appliance
• NFS
Database Servers
• Direct Attached Storage
• Fastest single node disk

implementation
• Cost Effective
• SAN and NAS
• Different performance focus
• More management required
• Expensive
Managing Vendors
Not All Vendors Are Equal
• Migrating CDN vendors resulted in a notable increase in page views and

decrease in page load latency
• Hardware support by vendor differs greatly
• Communication is key
• Foster good relationships
• Extra effort on your part should yield extra effort on their part
Focus on Team Not Just Tech
Questions?
Follow me on twitter: @Crad
Blog: http://gavinroy.com

Surge 2010 - Scaling at

Uploaded by

Document Information

Original Description:

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Surge 2010 - Scaling at

Uploaded by

Copyright:

Available Formats

Scaling myYearbook.

• 2007 - 100M Dynamic HTTP Requests per Month

• 2010 - 2.5B Dynamic HTTP Requests per Month

• Top 5 Social Network in the United States as measured by Hitwise

• Top 25 trafficked site in the United States as measured by ComScore

• All Managed Servers

• 1 Web application server (Apache/PHP)

• Multiple static content servers

• 1 Phone call a night during peak due to outages

• 16 Message Brokers (ActiveMQ/RabbitMQ)

• 100 memcached Servers w/ 1.2 TB of active cache

• ~100 Other servers (Consumers, R&D, Email, Monitoring, etc)

• ~400 Web Application Servers

• Languages: PHP, Python • Memcached

• Webservers: • Message Brokers:

• Apache HTTPD • ActiveMQ

• Squid • Isilon NAS

• F5 Networks BigIP • Message Systems Momentum

• PostgreSQL • Subversion, Git

3. Re-Engineer (Fix problems)

4. Add new functionality

5. Repeat Steps 3 and 4

• Limited Time & Resources

• Limited Time & Resources

- Robert Burns, To a Mouse

• Know your systems

• Know your key application metrics

• Applications, Services, etc

• 3rd Party External

• Availability and Performance

• CPU Horsepower based upon need

• Disk based upon need

• RAM based upon budget.

• Bigger, Faster Hardware

• Better, Faster Software

• pgBouncer, pgPool, language specific pooling

• Horizontal via plProxy

• Londiste, Slony, Bucardo

• Supported in PostgreSQL as of 8.1

• Excellent method for maintaining data

• Allows for removal of aged data without bloat

• Focused SELECTS while allowing ad-hoc SELECT across all partitions

• Sophisticated system catalog

• Beyond configuration data

• Cache hit/miss data

• Daily backup is not enough

• Disaster recovery option

• Replicate data for failover and maintenance

• Warm standby: >=PostgreSQL 8.2

• Hot standby: PostgreSQL 9.0

• Content Delivery Networks

• HTTP Reverse Proxy

• Reduces system load

• In Application Server (APC/SHM)

• Distributed Across Network (memcached)

• Many Packets per Second

• Still young market

• Rack dense cache

• Leveraging lower cost SSDs over RAM

• Kitchen sink in a box

• Drive market share / grow

• Meet the spec