Scaling myYearbook.

com
Lessons Learned From Rapid Growth Gavin M. Roy Chief Technology Officer myYearbook.com Surge 2010

About myYearbook.com
• Founded in 2005 • 2007 - 100M Dynamic HTTP Requests per Month • 2010 - 2.5B Dynamic HTTP Requests per Month • Top 5 Social Network in the United States as measured by Hitwise • Top 25 trafficked site in the United States as measured by ComScore • 99% Uptime

myYearbook.com

Friend Discovery

Q1 2007
• All Managed Servers • 1 PostgreSQL server • 1 Web application server (Apache/PHP) • Multiple static content servers • 1 Phone call a night during peak due to outages

Q4 2010
• 2 Data centers • 16 Message Brokers (ActiveMQ/RabbitMQ) • 35 PostgreSQL Servers • 100 memcached Servers w/ 1.2 TB of active cache • ~100 Other servers (Consumers, R&D, Email, Monitoring, etc) • ~400 Web Application Servers

Key Architecture Components
• Languages: PHP, Python • Webservers: • Apache HTTPD • Cherokee • Lighttpd • Squid • F5 Networks BigIP • PostgreSQL • Memcached • Message Brokers: • ActiveMQ • RabbitMQ • Tornado • Isilon NAS • Message Systems Momentum • Subversion, Git

Growth is a Double-Edged Sword

Internet Startup Growth Cycle
1. Prototype 2. Launch 3. Re-Engineer (Fix problems) 4. Add new functionality 5. Repeat Steps 3 and 4

Internet Startup Growth Cycle
• Steps 1 & 2 • Limited Budget • Limited Time & Resources • Steps 3 & 4 • Increased Budget • Limited Time & Resources

“The best laid schemes o’ mice an’ men gang aft agley” - Robert Burns, To a Mouse

Instrument Early and Deep

Analytics
• Know your systems • Know your key application metrics • 238 Categories • Applications, Services, etc • 5,558 Items • 1 to ~25 Datapoints

Gathering Analytics
• Staplr • Posuta • Nagios • Cacti • 3rd Party External • Traffic • Availability and Performance

Scaling Databases

Plan for Growth
• Hardware • CPU Horsepower based upon need • Disk based upon need • RAM based upon budget. • Get 2

General DB Scaling
• Scale Up • Bigger, Faster Hardware • Better, Faster Software • Scale Out • Sharding • Service Specific

PostgreSQL Scaling
• Connection pooling • pgBouncer, pgPool, language specific pooling • Horizontal via plProxy • Read-only nodes • Londiste, Slony, Bucardo • PostgreSQL 9.0

Table Partitioning
• Supported in PostgreSQL as of 8.1 • Excellent method for maintaining data • Allows for removal of aged data without bloat • Focused SELECTS while allowing ad-hoc SELECT across all partitions

Internals Monitoring
• Sophisticated system catalog • Beyond configuration data • Statistics • Index utilization • Cache hit/miss data • Lock data

“Anything that can possibly go wrong, does.” - Jack Sack

Recovering from Server Failures
• Daily backup is not enough • Disaster recovery option • Replicate data for failover and maintenance • Warm standby: >=PostgreSQL 8.2 • Hot standby: PostgreSQL 9.0

The Importance of Caching

Static Content
• Content Delivery Networks • HTTP Reverse Proxy • Web server • Storage • Operating System

Data Caching
• Reduces system load • Databases • Filesystem IO • Tiered approach: • In Application Execution • In Application Server (APC/SHM) • Distributed Across Network (memcached)

memcached at Scale
• TCP vs UDP • Binary vs Ascii • Many Packets per Second • Client Implementation • Inconsistent Hashing Strategies • Monitor Utilization

Appliances
• Still young market • Rack dense cache • Leveraging lower cost SSDs over RAM • Kitchen sink in a box • Replication • Fancy UIs

Balance in Development Practices

Few web developers start by planning for scale.

Deep thinking?
• First to market • Drive market share / grow business • Engage users • Meet the spec • Get it done yesterday

Application Codebase History
• 2005-2007: Monolithic Code Base • 2008: Expanded to use a Services Oriented Architecture • Why SOA? • Applications get own resources • Loosely-Coupled architecture • Selective Maintenance • 2010: Improved process and performance

Avoid Unnecessary Disruptions

Avoid Reengineering
• Reengineering for reengineering’s sake is an unnecessary distraction • Address engineering faults with a forward facing purpose • Do not introduce new products or product redesigns at the same time as a wholesale change in application code • Hard to find primary reason for shift in traffic patterns • Problems impact users impression of new products and user facing changes

Decouple Processes

Why decouple code and processes?
• Faster page generation • Distribution of CPU intensive tasks • Scale consumer servers, not application front-end servers • Throttle activity • Tap data streams for other purposes

Message Processing
• Enqueue in to ActiveMQ or RabbitMQ • Elastic processing via rejected consumer framework • Targeted workloads • Image uploads • Comment and Message processing • Email spooling

Storage

User Generated Content
• Shared Storage • Isilion IQ Series • Scale out NAS • FreeBSD Based Appliance • NFS

Database Servers
• Direct Attached Storage • Fastest single node disk implementation • Cost Effective • SAN and NAS • Different performance focus • More management required • Expensive

Managing Vendors

Not All Vendors Are Equal
• Migrating CDN vendors resulted in a notable increase in page views and decrease in page load latency • Hardware support by vendor differs greatly • Communication is key • Foster good relationships • Extra effort on your part should yield extra effort on their part

Focus on Team

Not Just Tech

Questions?
Follow me on twitter: @Crad Blog: http://gavinroy.com