Building Scalable Web Apps with Google App Engine

Brett Slatkin June 14, 2008

Agenda

Using the Python runtime effectively Numbers everyone should know Tools for storing and scaling large data sets Example: Distributed counters Example: A blog

Prevent repeated, wasteful work

Prevent repeated, wasteful work
Loading Python modules on every request can be slow Reuse main() to addresses this: def main(): wsgiref.handlers.CGIHandler().run(my_app) if __name__ == "__main__": main() Lazy-load big modules to reduce the "warm-up" cost def my_expensive_operation(): import big_module big_module.do_work() Take advantage of "preloaded" modules

Prevent repeated, wasteful work 2
Avoid large result sets In-memory sorting and filtering can be slow Make the Datastore work for you Avoid repeated queries Landing pages that use the same query for everyone Incoherent caching Use memcache for a consistent view: results = memcache.get('main_results') if results is None: results = db.GqlQuery('...').fetch(10) memcache.add('main_results', results, 60)

Numbers everyone should know

Numbers everyone should know
Writes are expensive! Datastore is transactional: writes require disk access Disk access means disk seeks Rule of thumb: 10ms for a disk seek Simple math: 1s / 10ms = 100 seeks/sec maximum Depends on: The size and shape of your data Doing work in batches (batch puts and gets)

Numbers everyone should know 2
Reads are cheap! Reads do not need to be transactional, just consistent Data is read from disk once, then it's easily cached All subsequent reads come straight from memory Rule of thumb: 250usec for 1MB of data from memory Simple math: 1s / 250usec = 4GB/sec maximum For a 1MB entity, that's 4000 fetches/sec

Tools for storing data

Tools for storing data: Entities
Fundamental storage type in App Engine Schemaless Set of property name/value pairs Most properties indexed and efficient to query Other large properties not indexed (Blobs, Text) Think of it as an object store, not relational Kinds are like classes Entities are like object instances Relationship between Entities using Keys Reference properties One to many, many to many

Tools for storing data: Keys
Key corresponds to the Bigtable row for an Entity Bigtable accessible as a distributed hashtable Get() by Key: Very fast! No scanning, just copying data Limitations: Only one ID or key_name per Entity Cannot change ID or key_name later 500 bytes

Tools for storing data: Transactions
ACID transactions Atomicity, Consistency, Isolation, Durability No queries in transactions Transactional read and write with Get() and Put() Common practice Query, find what you need Transact with Get() and Put() How to provide a consistent view in queries?

Tools for storing data: Entity groups
Closely related Entities can form an Entity group Stored logically/physically close to each other Define your transactionality RDBMS: Row and table locking Datastore: Transactions across a single Entity group "Locking" one Entity in a group locks them all Serialized writes to the whole group (in transactions) Not a traditional lock; writers attempt to complete in parallel

Tools for storing data: Entity groups 2
Hierarchical Each Entity may have a parent A "root" node defines an Entity group Hierarchy of child Entities can go many levels deep Watch out! Serialized writes for all children of the root Datastore scales wide Each Entity group has serialized writes No limit to the number of Entity groups to use in parallel Think of it as many independent hierarchies of data

Tools for storing data: Entity groups 3

Entity groups all transacting in parallel:

Root

Root

Root

Root

Child Txn 1

Child Txn 2

Child Txn 3

Child Txn 4

Tools for storing data: Entity groups 4
Pitfalls Large Entity groups = high contention = failed transactions Not thinking about write throughput is bad Structure your data to match your usage patterns Good news Query across entity groups without serialized access! Consistent view across all entity groups No partial commits visible All Entities in a group are the latest committed version

Example: Counters

Counters

Using Model.count() Bigtable doesn't know counts by design O(N); cannot be O(1); must scan every Entity row! Use an Entity with a count property: class Counter(db.Model): count = db.IntegerProperty() Frequent updates = high contention! Transactional writes are serialized and too slow Fundamental limitation of distributed systems

Counters: Before and after

Single

Sharded

Counter

Counter

Counter

Counter

Counters: Sharded

Shard counters into multiple Entity groups Pick an Entity at random and update it transactionally Combine sharded Entities together on reads "Contention" reduced by 1/N Sharding factor can be changed with little difficulty

Counters: Models

class CounterConfig(Model): name = StringProperty(required=True) num_shards = IntegerProperty(required=True, default=1) class Counter(Model): name = StringProperty(required=True) count = IntegerProperty(required=True, default=0)

Counters: Get the count

def get_count(name): total = 0 for counter in Counter.gql( 'WHERE name = :1', name): total += counter.count return total

Counters: Increment the count

def increment(name): config = CounterConfig.get_or_insert(name, name=name) def txn(): index = random.randint(0, config.num_shards - 1) shard_name = name + str(index) counter = Counter.get_by_key_name(shard_name) if counter is None: counter = Counter( key_name=shard_name, name=name) counter.count += 1 counter.put() db.run_in_transaction(txn)

Counters: Cache reads

def get_count(name): total = memcache.get(name) if total is None: total = 0 for counter in Counter.gql( 'WHERE name = :1', name): total += counter.count memcache.add(name, str(total), 60) return total

Counters: Cache writes

def increment(name): config = CounterConfig.get_or_insert(name, name=name) def txn(): index = random.randint(0, config.num_shards - 1) shard_name = name + str(index) counter = Counter.get_by_key_name(shard_name) if counter is None: counter = Counter(key_name=shard_name, name=name) counter.count += 1 counter.put() db.run_in_transaction(txn) memcache.incr(name)

Example: Building a Blog

Building a Blog

Standard blog Multiple blog posts Each post has comments Efficient paging without using queries with offsets Remember, Bigtable doesn't know counts!

Building a Blog: Blog entries

Blog entries with an index Having an index establishes a rigid ordering Index enables efficient paging This is a global counter, but it's okay Low write throughput of overall posts = no contention

Building a Blog: Models

class GlobalIndex(db.Model): max_index = db.IntegerProperty(required=True, default=0) class BlogEntry(db.Model): index = db.IntegerProperty(required=True) title = db.StringProperty(required=True) body = db.TextProperty(required=True)

Building a Blog: Posting an entry

def post_entry(blogname, title, body): def txn(): blog_index = BlogIndex.get_by_key_name(blogname) if blog_index is None: blog_index = BlogIndex(key_name=blogname) new_index = blog_index.max_index blog_index.max_index += 1 blog_index.put() new_entry = BlogEntry( key_name=blogname + str(new_index), parent=blog_index, index=new_index, title=title, body=body) new_entry.put() db.run_in_transaction(txn)

Building a Blog: Posting an entry 2

Hierarchy of Entities:

Blog Index

Entry

Building a Blog: Getting one entry

def get_entry(blogname, index): entry = BlogEntry.get_by_key_name( parent=Key.from_path('BlogIndex', blogname), blogname + str(index)) return entry

That's it! Super fast!

Building a Blog: Paging
def get_entries(start_index): extra = None if start_index is None: entries = BlogEntry.gql( 'ORDER BY index DESC').fetch( POSTS_PER_PAGE + 1) else: start_index = int(start_index) entries = BlogEntry.gql( 'WHERE index <= :1 ORDER BY index DESC', start_index).fetch(POSTS_PER_PAGE + 1) if len(entries) > POSTS_PER_PAGE: extra = entries[-1] entries = entries[:POSTS_PER_PAGE] return entries, extra

Building a Blog: Comments

High write-throughput Can't use a shared index Would like to order by post date Post dates aren't unique, so we can't use them to page:
2008-05-26 2008-05-26 2008-05-26 2008-05-26 2008-05-26 2008-05-26 22:11:04.1000 22:11:04.1234 22:11:04.1234 22:11:04.1234 22:11:04.1234 22:11:04.2000 Before My post This is another post And one more post The last post After

Building a Blog: Comments

High write-throughput Can't use a shared index Would like to order by post date Post dates aren't unique, so we can't use them to page:
2008-05-26 2008-05-26 2008-05-26 2008-05-26 2008-05-26 2008-05-26 22:11:04.1000 22:11:04.1234 22:11:04.1234 22:11:04.1234 22:11:04.1234 22:11:04.2000 Before My post This is another post And one more post The last post After

Building a Blog: Composite properties

Make our own composite string property: "post time | user ID | comment ID" Use a shared index for each user's comment ID Each index is in a separate Entity group Guaranteed a unique ordering, querying across entity groups:
2008-05-26 2008-05-26 2008-05-26 2008-05-26 2008-05-26 2008-05-26 22:11:04.1000|brett|3 22:11:04.1234|jon|3 22:11:04.1234|jon|4 22:11:04.1234|ryan|4 22:11:04.1234|ryan|5 22:11:04.2000|ryan|2 Before My post This is another post And one more post The last post After

Building a Blog: Composite properties 2

High throughput because of parallelism

User Index

User Index

User Index

Comment

Comment

Comment

What to remember

What to remember

Minimize Python runtime overhead Minimize waste Why Query when you can Get? Structure your data to match your load Optimize for low write contention Think about Entity groups Memcache is awesome-- use it!

Learn more code.google.com