PMUG Schema Design and Scaling

MongoDB
Schema Design & Scaling

Alvin Richards Technical Director, EMEA alvin@10gen.com @jonnyeight
Agenda
18:00 - 18:15 : Why MongoDB 18:15 - 18:45 : Schema Design 18:45 - 19:00 : Break 19:00 - 19:45 : Scaling 19:45 - 20:00 : Q & A 20:00 - 22:00 : After Party!
Part 1 - Why MongoDB?
10gen - Company Prole

Company behind MongoDB
aGPL license, own copyrights, engineering team
support, consulting, training, license revenue Funding
$73.5 million total funding Sequoia, Union Square, Flybridge, NEA

Management team
Google/DoubleClick, Oracle, Apple, NetApp NYC, Palo Alto, London, Dublin & Sydney 110+ employees
Todays challenges
Current technology stack adds signicant complexity

complexity
custom caching vertical

scaling
sharding
Current technology stack reduces productivity

denormalize remove
joins
remove
transactions productivity
Why we exist
More than 500 customers worldwide

Archiving Complex Data Flexible Data Content and Document Management, MultiMedia eCommerce
Finance
Gaming
Infrastructure
Operational Datastore for Web Infrastructure
Real-time Analytics
Media
Mobile
NoSQL Market Leadership
Part 2 - Schema Design
Topics
Schema design is easy! Data as Objects in code Common patterns Single table inheritance One-to-Many & Many-to-Many Buckets Trees Queues Inventory
Terminology
RDBMS Table Row(s) Index Join Partition Partition Key MongoDB Collection JSON Document Index Embedding & Linking Shard Shard Key
Schema Design Relational Database
Schema Design MongoDB
embedding
embedding
linking
So todays example will use...
Design Session
Design documents that simply map to your application
> post = {author: "Herg", date: ISODate("2011-09-18T09:56:06.298Z"), text: "Destination Moon", tags: ["comic", "adventure"]} > db.posts.insert(post)
Find the document

> db.posts.find() { _id: ObjectId("4c4ba5c0672c685e5e8aabf3"), author: "Herg", date: ISODate("2011-09-18T09:56:06.298Z"), text: "Destination Moon", tags: [ "comic", "adventure" ] } Notes: ID must be unique, but can be anything youd like MongoDB will generate a default ID if one is not supplied
Add and index, nd via Index

Secondary index for author // 1 means ascending, -1 means descending > db.posts.ensureIndex({author: 1}) > db.posts.find({author: 'Herg'}) { _id: ObjectId("4c4ba5c0672c685e5e8aabf3"), date: ISODate("2011-09-18T09:56:06.298Z"), author: "Herg", ... }
Query operators
Conditional operators: $ne, $in, $nin, $mod, $all, $size, $exists, $type, .. $lt, $lte, $gt, $gte, $ne...
// find posts with any tags > db.posts.find({tags: {$exists: true}})

Regular expressions:
// posts where author starts with h > db.posts.find({author: /^h/i })

Counting:
// number of posts written by Herg > db.posts.find({author: "Herg"}).count()
Extending the schema
http://nysi.org.uk/kids_stuff/rocket/rocket.htm
Extending the Schema

new_comment = {author: "Kyle", date: new Date(), text: "great book"}
> db.posts.update( {text: "Destination Moon" }, { "$push": {comments: new_comment}, "$inc": {comments_count: 1}})

> db.posts.find({_id: ObjectId("4c4ba5c0672c685e5e8aabf3")}) { _id : ObjectId("4c4ba5c0672c685e5e8aabf3"), author : "Herg", date : ISODate("2011-09-18T09:56:06.298Z"), text : "Destination Moon", tags : [ "comic", "adventure" ], comments : [ { author : "Kyle", date : ISODate("2011-09-19T09:56:06.298Z"), text : "great book" } ], comments_count: 1 }

// create index on nested documents: > db.posts.ensureIndex({"comments.author": 1}) > db.posts.find({"comments.author":"Kyle"}) // find last 5 posts: > db.posts.find().sort({date:-1}).limit(5) // most commented post: > db.posts.find().sort({comments_count:-1}).limit(1)
When sorting, check if you need an index
Use MongoDB with your language

10gen Supported Drivers Ruby, Python, Perl, PHP, Javascript, node.js Java, C/C++, C#, Scala Erlang, Haskell Object Data Mappers Morphia - Java Mongoid, MongoMapper - Ruby MongoEngine - Python Community Drivers F# , Smalltalk, Clojure, Go, Groovy, Delphi Lua, PowerShell, R
Using your schema - example Java Driver

// Get a connection to the database DBCollection coll = new Mongo().getDB("posts"); // Create the Object Map<String, Object> obj = new HashMap... obj.add("author", "Herg"); obj.add("text", "Destination Moon"); obj.add("date", new Date()); // Insert the object into MongoDB coll.insert(new BasicDBObject(obj));
Using your schema - example Morphia mapper

// Use Morphia annotations @Entity class Blog { @Id String author; @Indexed Date date; String text; }
Using your schema - example Morphia

// Create the data store Datastore ds = new Morphia().createDatastore() // Create the Object Post entry = new Post("Herg", New Date(), "Destination Moon") // Insert object into MongoDB ds.save(entry);
Common Patterns
Common Patterns
http://www.ickr.com/photos/colinwarren/158628063
Inheritance
http://www.ickr.com/photos/dysonstarr/5098228295
Inheritance
Single Table Inheritance RDBMS

shapes table id type
1 area radius length 1 width
circle 3.14
square 4
rect
10
Single Table Inheritance MongoDB

> db.shapes.find()
{ _id: "1", type: "circle",area: 3.14, radius: 1} { _id: "2", type: "square",area: 4, length: 2} { _id: "3", type: "rect", area: 10, length: 5, width: 2}
missing values not stored!

> db.shapes.find()
// find shapes where radius > 0 > db.shapes.find({radius: {$gt: 0}})

> db.shapes.find()
// find shapes where radius > 0 > db.shapes.find({radius: {$gt: 0}}) // create index > db.shapes.ensureIndex({radius: 1}, {sparse:true})
index only values present!
One to Many
http://www.ickr.com/photos/j-sh/6502708899/
One to Many
One to Many relationships can specify degree of association between objects containment life-cycle
One to Many - Embedding

- Embedded Array - $slice operator to return subset of comments - some queries harder e.g nd latest comments across all blogs
posts: { author : "Herg", date : ISODate("2011-09-18T09:56:06.298Z"), comments : [ { author : "Kyle", date : ISODate("2011-09-21T09:52:06.298Z"), text : "great book" } ]}
One to Many - Linking

- Normalized (2 collections) - most exible - more queries
posts: { _id: 1000, author: "Herg", date: ISODate("2011-09-18T09:56:06.298Z"), comments: [ {comment : 1)} ]} comments : { _id : 1, blog: 1000, author : "Kyle", date : ISODate("2011-09-21T09:52:06.298Z")} > blog = db.blogs.find({text: "Destination Moon"}); > db.comments.find({blog: blog._id});
Linking versus Embedding

Embedding 1 seek to load entire object
1 roundtrip to database Read relative to object size Write relative to object size
Linking 1 seek to read master

1 seek to read each detail 2 roundtrip to database Reads longer but consistent Writes longer but consistent
Many to Many
http://www.ickr.com/photos/pats0n/6013379192
Many - Many
Example:
- Product can be in many categories - Category can have many products
Many - Many
products: { _id: 10, name: "Destination Moon", category_ids: [ 20, 30 ] } categories: { _id: 20, name: "comic", product_ids: [ 10, 11, 12 ] } categories: { _id: 21, name: "movie", product_ids: [ 10 ] }
Many - Many
products: { _id: 10, name: "Destination Moon", category_ids: [ 20, 30 ] } categories: { _id: 20, name: "comic", product_ids: [ 10, 11, 12 ] } categories: { _id: 21, name: "movie", product_ids: [ 10 ] } //All categories for a given product > db.categories.find({product_ids: 10})
Alternative
products: { _id: 10, name: "Destination Moon", category_ids: [ 20, 30 ] }

categories: { _id: 20, name: "comic"}
Alternative
products: { _id: 10, name: "Destination Moon", category_ids: [ 20, 30 ] }

categories: { _id: 20, name: "comic"} // All products for a given category > db.products.find({category_ids: 20)}) // All categories for a given product product = db.products.find(_id : some_id) > db.categories.find({_id : {$in : product.category_ids}})
Trees
http://www.ickr.com/photos/cubagallery/5949819558
Trees
Hierarchical information
Trees
Full Tree in Document
{ comments: [ { author: Kyle, text: ..., replies: [ {author: James, text: ..., replies: []} ]} ] }
Pros: Single Document, Performance, Intuitive Cons: Hard to search, Partial Results, 16MB limit

Array of Ancestors
B E
C D F
- Store all Ancestors of a node { _id: "a" } { _id: "b", thread: [ "a" ], replyTo: "a" } { _id: "c", thread: [ "a", "b" ], replyTo: "b" } { _id: "d", thread: [ "a", "b" ], replyTo: "b" } { _id: "e", thread: [ "a" ], replyTo: "a" } { _id: "f", thread: [ "a", "e" ], replyTo: "e" }
Array of Ancestors
B E
C D F
- Store all Ancestors of a node { _id: "a" } { _id: "b", thread: [ "a" ], replyTo: "a" } { _id: "c", thread: [ "a", "b" ], replyTo: "b" } { _id: "d", thread: [ "a", "b" ], replyTo: "b" } { _id: "e", thread: [ "a" ], replyTo: "a" } { _id: "f", thread: [ "a", "e" ], replyTo: "e" } // find all threads where "b" is in > db.posts.find({thread: "b"}) // find replies to "e" > db.posts.find({replyTo: "e"}) // find history of "f" > threads = db.posts.findOne( {_id:"f"} ).thread > db.posts.find( { _id: { $in : threads } )
Trees as Paths
Store hierarchy as a path expression - Separate each node by a delimiter, e.g. / - Use text search for nd parts of a tree
{ comments: [ { author: "Kyle", text: "initial post", path: "/" }, { author: "Jim", text: "jims comment", path: "/jim" }, { author: "Kyle", text: "Kyles reply to Jim", path : "/jim/kyle"} ] } // Find the conversations Jim was part of > db.posts.find({path: /^jim/})
Queue
http://www.ickr.com/photos/deanspic/4960440218
Queue
Need to maintain order and state Ensure that updates are atomic
db.jobs.save( { inprogress: false, priority: 1, ... }); // find highest priority job and mark as in-progress job = db.jobs.findAndModify({ query: {inprogress: false}, sort: {priority: -1}, update: {$set: {inprogress: true, started: new Date()}}, new: true})
Use case - Inventory
User has a number of "votes" they can use A nite stock that you can "sell" A resource that can be "provisioned"
Inventory
User has a number of "votes" they can use A nite stock that you can "sell" A resource that can be "provisioned"
// Number of votes and who the user voted for { _id: "alvin", votes: 42, voted_for: [] } // Subtract a vote and add the blog voted for db.user.update( { _id: "alvin", votes : { $gt : 0}, voted_for: {$ne: "Destination Moon" }, { "$push": {voted_for: "Destination Moon"}, "$inc": {votes: -1}})
Don't try this...
Don't try this...

Incorrect indexing
Large, deeply nested documents One size ts all collections One collection per user
Too many indexes; wrong keys indexed Frequent queries do not use index
Summary
Schema design is different in MongoDB Basic data design principals stay the same Focus on how the application manipulates data Rapidly evolve schema to meet your requirements Enjoy your new freedom, use it wisely :-)
Part 3 - Scaling
Scaling
Operations/sec go up Storage needs go up Capacity IOPs Complexity goes up Caching
How do you scale now?

Optimization & Tuning Schema & Index Design O/S tuning Hardware conguration
$$$
Vertical scaling Hardware is expensive Hard to scale in cloud

throughput
Horizontal scaling - Sharding

read
300 GB Data
shard1
A-Z
write

read
150 GB Data 150 GB Data
shard1
shard2
A-M
N-Z
write

read
100 GB Data 100 GB Data 100 GB Data
shard1
shard2
shard3
A-H
I-Q
R-Z
write
Sharding for caching

read
300 GB Data 96 GB Mem
shard1
A-H I-Q R-Z
3:1 Data/Mem
write
Sharding for caching

read
96 GB Mem 300 GB Data
shard1
shard2
shard3
A-H
I-Q
R-Z
1:1 Data/Mem
write
Replication
read
300 GB Data
A-Z
write
Replication
read
300 GB Data
A-Z A-Z A-Z

900 GB Data
write
Sharding internals
Range based partitioning

a b c d e f g h ... u v w x y z
Large Dataset
Primary Key as username s t
MongoDBs Sharding handle the scale problem by chunking Break up pieces of data into smaller chunks, spread across
many data nodes Each data node contains many chunks If a chunk gets too large or a node overloaded, data can be rebalanced
Range based partitioning

a b c d e f g h
Large Dataset
Primary Key as username s t u v w x y z
Big Data at a Glance

Large Dataset
Primary Key as username
MongoDB Sharding breaks data into chunks (~64 mb)
Scaling
Data Node 1 Large Dataset Data Node 2 Data Node 3
25% of chunks
Data Node 4
25% of chunks
Primary Key as username 25% of chunks 25% of chunks
Representing data as chunks allows many levels of scale across n data nodes
Scaling
Data Node 1 Data Node 2 Data Node 3 Data Node 4 5 Data Node
The set of chunks can be evenly distributed across n data nodes
Add Nodes: Chunk Rebalancing

Data Node 1
x a c s
Data Node 2
b u z g
Data Node 3
t e f w
Data Node 4
v h y d
Data Node 5
The goal is equilibrium - an equal distribution. As nodes are added (or even removed)
chunks can be redistributed for balance.
Writes Routed to Correct Chunk

Data Node 1
c a s u
Data Node 2
z g
Data Node 3
t e f
Data Node 4
v h d
Data Node 5
w b y x
Writes Routed to Correct Chunk

Data Node 1
c a s u
Data Node 2
z g
Data Node 3
t e f
Data Node 4
v h d
Data Node 5
w b y x
Write to keyziggy
Writes are efficiently routed to the appropriate node & chunk
Chunk Splitting & Balancing

Data Node 1
c a s u
Data Node 2
z g
Data Node 3
t e f
Data Node 4
v h d
Data Node 5
w b y x
Write to keyziggy
If a chunk gets too large (default in MongoDB - 64mb per

chunk), It is split into two new chunks

Data Node 1
c a s u
Data Node 2
z g
Data Node 3
t e f
Data Node 4
v h d
Data Node 5
w b y x


Data Node 1
c a s u
Data Node 2
z g
Data Node 3
t e f
Data Node 4
v h d
Data Node 5
w b y x


Data Node 1
c a s u
Data Node 2
z g
Data Node 3
t e f
Data Node 4
v h d
Data Node 5
w b y x
z1
z2


Data Node 1
c a s u
Data Node 2
z g
Data Node 3
t e f
Data Node 4
v h d
Data Node 5
w b y x
z1
z2


Data Node 1
c a s
Data Node 2
z2 u z1 g
Data Node 3
t e f
Data Node 4
v h d
Data Node 5
w b y x
Each new part of the Z chunk (left & right) now contains half of the keys

Data Node 1
c a s u
Data Node 2
z1 g
Data Node 3
t e f
Data Node 4
v h z2 d
Data Node 5
w b y x
As chunks continue to grow and split, they can be rebalanced to keep an equal share of data on each server.
Reads with Key Routed Efficiently

Data Node 1
c a s u
Data Node 2
z1 g
Data Node 3
t e f
Data Node 4
v h z2 d
Data Node 5
w b y x
Read Key xavier
Reading a single value by Primary Key

Read routed efficiently to specic chunk containing key

Data Node 1
c a s u
Data Node 2
z1 g
Data Node 3
t e f
Data Node 4
v h z2 d
Data Node 5
w b y x
Read Key xavier
Reading a single value by Primary Key

Read routed efficiently to specic chunk containing key

Data Node 1
c a s u
Data Node 2
z1 g
Data Node 3
t e f
Data Node 4
v h z2 d
Data Node 5
w b y x
Read Keys T->X
Reading multiple values by Primary Key

Reads routed efficiently to specic chunks in range
Summary
Scaling is simple Add capacity before you need it System automatically re-balances your data No downtime to add capacity No code changes required
download at mongodb.org
alvin@10gen.com
conferences, appearances, and meetups

http://www.10gen.com/events
http://bit.ly/mongo>
Facebook | Twitter | LinkedIn

@mongodb
http://linkd.in/joinmongo

PMUG Schema Design and Scaling

Uploaded by

Document Information

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

PMUG Schema Design and Scaling

Uploaded by

Copyright:

Available Formats

MongoDB

Schema Design & Scaling

Part 1 - Why MongoDB?

10gen - Company Prole

aGPL license, own copyrights, engineering team

support, consulting, training, license revenue Funding

$73.5 million total funding Sequoia, Union Square, Flybridge, NEA

Current technology stack adds signicant complexity

custom caching vertical

Current technology stack reduces productivity

More than 500 customers worldwide

Operational Datastore for Web Infrastructure

NoSQL Market Leadership

Part 2 - Schema Design

Schema Design Relational Database

Schema Design MongoDB

Schema Design MongoDB

Schema Design MongoDB

So todays example will use...

Find the document

Add and index, nd via Index

// find posts with any tags > db.posts.find({tags: {$exists: true}})

// posts where author starts with h > db.posts.find({author: /^h/i })

// number of posts written by Herg > db.posts.find({author: "Herg"}).count()

Extending the schema

Extending the Schema

Extending the Schema

Extending the Schema

When sorting, check if you need an index

Use MongoDB with your language

Using your schema - example Java Driver

Using your schema - example Morphia mapper

Using your schema - example Morphia

Single Table Inheritance RDBMS

Single Table Inheritance MongoDB

missing values not stored!

Single Table Inheritance MongoDB

// find shapes where radius > 0 > db.shapes.find({radius: {$gt: 0}})

Single Table Inheritance MongoDB

index only values present!

One to Many - Embedding

One to Many - Linking

Linking versus Embedding

Linking 1 seek to read master

Use case - Inventory

Don't try this...

Don't try this...

Operations/sec go up Storage needs go up Capacity IOPs Complexity goes up Caching

How do you scale now?

Vertical scaling Hardware is expensive Hard to scale in cloud

Horizontal scaling - Sharding

Horizontal scaling - Sharding

Horizontal scaling - Sharding

Sharding for caching

A-H I-Q R-Z

Sharding for caching

A-Z A-Z A-Z

Range based partitioning

Range based partitioning

Big Data at a Glance

MongoDB Sharding breaks data into chunks (~64 mb)

Primary Key as username 25% of chunks 25% of chunks

The set of chunks can be evenly distributed across n data nodes

Add Nodes: Chunk Rebalancing