You are on page 1of 93

MongoDB

Schema Design & Scaling


Alvin Richards Technical Director, EMEA alvin@10gen.com @jonnyeight

Agenda
18:00 - 18:15 : Why MongoDB 18:15 - 18:45 : Schema Design 18:45 - 19:00 : Break 19:00 - 19:45 : Scaling 19:45 - 20:00 : Q & A 20:00 - 22:00 : After Party!

Part 1 - Why MongoDB?

10gen - Company Prole


Company behind MongoDB

aGPL license, own copyrights, engineering team

support, consulting, training, license revenue Funding

$73.5 million total funding Sequoia, Union Square, Flybridge, NEA


Management team

Google/DoubleClick, Oracle, Apple, NetApp NYC, Palo Alto, London, Dublin & Sydney 110+ employees

Todays challenges

Current technology stack adds signicant complexity


complexity

custom caching vertical


scaling

sharding

Current technology stack reduces productivity


denormalize remove
joins

remove

transactions productivity

Why we exist

More than 500 customers worldwide


Archiving Complex Data Flexible Data Content and Document Management, MultiMedia eCommerce

Finance

Gaming

Infrastructure

Operational Datastore for Web Infrastructure

Real-time Analytics

Media

Mobile

NoSQL Market Leadership

Part 2 - Schema Design

Topics
Schema design is easy! Data as Objects in code Common patterns Single table inheritance One-to-Many & Many-to-Many Buckets Trees Queues Inventory

Terminology
RDBMS Table Row(s) Index Join Partition Partition Key MongoDB Collection JSON Document Index Embedding & Linking Shard Shard Key

Schema Design Relational Database

Schema Design MongoDB

Schema Design MongoDB

embedding

Schema Design MongoDB

embedding

linking

So todays example will use...

Design Session
Design documents that simply map to your application
> post = {author: "Herg", date: ISODate("2011-09-18T09:56:06.298Z"), text: "Destination Moon", tags: ["comic", "adventure"]} > db.posts.insert(post)

Find the document


> db.posts.find() { _id: ObjectId("4c4ba5c0672c685e5e8aabf3"), author: "Herg", date: ISODate("2011-09-18T09:56:06.298Z"), text: "Destination Moon", tags: [ "comic", "adventure" ] } Notes: ID must be unique, but can be anything youd like MongoDB will generate a default ID if one is not supplied

Add and index, nd via Index


Secondary index for author // 1 means ascending, -1 means descending > db.posts.ensureIndex({author: 1}) > db.posts.find({author: 'Herg'}) { _id: ObjectId("4c4ba5c0672c685e5e8aabf3"), date: ISODate("2011-09-18T09:56:06.298Z"), author: "Herg", ... }

Query operators
Conditional operators: $ne, $in, $nin, $mod, $all, $size, $exists, $type, .. $lt, $lte, $gt, $gte, $ne...

// find posts with any tags > db.posts.find({tags: {$exists: true}})


Regular expressions:

// posts where author starts with h > db.posts.find({author: /^h/i })


Counting:

// number of posts written by Herg > db.posts.find({author: "Herg"}).count()

Extending the schema

http://nysi.org.uk/kids_stuff/rocket/rocket.htm

Extending the Schema



new_comment = {author: "Kyle", date: new Date(), text: "great book"}

> db.posts.update( {text: "Destination Moon" }, { "$push": {comments: new_comment}, "$inc": {comments_count: 1}})

Extending the Schema


> db.posts.find({_id: ObjectId("4c4ba5c0672c685e5e8aabf3")}) { _id : ObjectId("4c4ba5c0672c685e5e8aabf3"), author : "Herg", date : ISODate("2011-09-18T09:56:06.298Z"), text : "Destination Moon", tags : [ "comic", "adventure" ], comments : [ { author : "Kyle", date : ISODate("2011-09-19T09:56:06.298Z"), text : "great book" } ], comments_count: 1 }

Extending the Schema


// create index on nested documents: > db.posts.ensureIndex({"comments.author": 1}) > db.posts.find({"comments.author":"Kyle"}) // find last 5 posts: > db.posts.find().sort({date:-1}).limit(5) // most commented post: > db.posts.find().sort({comments_count:-1}).limit(1)

When sorting, check if you need an index

Use MongoDB with your language


10gen Supported Drivers Ruby, Python, Perl, PHP, Javascript, node.js Java, C/C++, C#, Scala Erlang, Haskell Object Data Mappers Morphia - Java Mongoid, MongoMapper - Ruby MongoEngine - Python Community Drivers F# , Smalltalk, Clojure, Go, Groovy, Delphi Lua, PowerShell, R

Using your schema - example Java Driver


// Get a connection to the database DBCollection coll = new Mongo().getDB("posts"); // Create the Object Map<String, Object> obj = new HashMap... obj.add("author", "Herg"); obj.add("text", "Destination Moon"); obj.add("date", new Date()); // Insert the object into MongoDB coll.insert(new BasicDBObject(obj));

Using your schema - example Morphia mapper


// Use Morphia annotations @Entity class Blog { @Id String author; @Indexed Date date; String text; }

Using your schema - example Morphia


// Create the data store Datastore ds = new Morphia().createDatastore() // Create the Object Post entry = new Post("Herg", New Date(), "Destination Moon") // Insert object into MongoDB ds.save(entry);

Common Patterns

Common Patterns

http://www.ickr.com/photos/colinwarren/158628063

Inheritance

http://www.ickr.com/photos/dysonstarr/5098228295

Inheritance

Single Table Inheritance RDBMS


shapes table id type
1 area radius length 1 width

circle 3.14

square 4

rect

10

Single Table Inheritance MongoDB


> db.shapes.find()
{ _id: "1", type: "circle",area: 3.14, radius: 1} { _id: "2", type: "square",area: 4, length: 2} { _id: "3", type: "rect", area: 10, length: 5, width: 2}

missing values not stored!

Single Table Inheritance MongoDB


> db.shapes.find()
{ _id: "1", type: "circle",area: 3.14, radius: 1} { _id: "2", type: "square",area: 4, length: 2} { _id: "3", type: "rect", area: 10, length: 5, width: 2}

// find shapes where radius > 0 > db.shapes.find({radius: {$gt: 0}})

Single Table Inheritance MongoDB


> db.shapes.find()
{ _id: "1", type: "circle",area: 3.14, radius: 1} { _id: "2", type: "square",area: 4, length: 2} { _id: "3", type: "rect", area: 10, length: 5, width: 2}

// find shapes where radius > 0 > db.shapes.find({radius: {$gt: 0}}) // create index > db.shapes.ensureIndex({radius: 1}, {sparse:true})

index only values present!

One to Many

http://www.ickr.com/photos/j-sh/6502708899/

One to Many
One to Many relationships can specify degree of association between objects containment life-cycle

One to Many - Embedding


- Embedded Array - $slice operator to return subset of comments - some queries harder e.g nd latest comments across all blogs
posts: { author : "Herg", date : ISODate("2011-09-18T09:56:06.298Z"), comments : [ { author : "Kyle", date : ISODate("2011-09-21T09:52:06.298Z"), text : "great book" } ]}

One to Many - Linking


- Normalized (2 collections) - most exible - more queries
posts: { _id: 1000, author: "Herg", date: ISODate("2011-09-18T09:56:06.298Z"), comments: [ {comment : 1)} ]} comments : { _id : 1, blog: 1000, author : "Kyle", date : ISODate("2011-09-21T09:52:06.298Z")} > blog = db.blogs.find({text: "Destination Moon"}); > db.comments.find({blog: blog._id});

Linking versus Embedding


Embedding 1 seek to load entire object
1 roundtrip to database Read relative to object size Write relative to object size

Linking 1 seek to read master


1 seek to read each detail 2 roundtrip to database Reads longer but consistent Writes longer but consistent

Many to Many

http://www.ickr.com/photos/pats0n/6013379192

Many - Many
Example:
- Product can be in many categories - Category can have many products

Many - Many
products: { _id: 10, name: "Destination Moon", category_ids: [ 20, 30 ] } categories: { _id: 20, name: "comic", product_ids: [ 10, 11, 12 ] } categories: { _id: 21, name: "movie", product_ids: [ 10 ] }

Many - Many
products: { _id: 10, name: "Destination Moon", category_ids: [ 20, 30 ] } categories: { _id: 20, name: "comic", product_ids: [ 10, 11, 12 ] } categories: { _id: 21, name: "movie", product_ids: [ 10 ] } //All categories for a given product > db.categories.find({product_ids: 10})

Alternative
products: { _id: 10, name: "Destination Moon", category_ids: [ 20, 30 ] }


categories: { _id: 20, name: "comic"}

Alternative
products: { _id: 10, name: "Destination Moon", category_ids: [ 20, 30 ] }


categories: { _id: 20, name: "comic"} // All products for a given category > db.products.find({category_ids: 20)}) // All categories for a given product product = db.products.find(_id : some_id) > db.categories.find({_id : {$in : product.category_ids}})

Trees

http://www.ickr.com/photos/cubagallery/5949819558

Trees
Hierarchical information

Trees
Full Tree in Document
{ comments: [ { author: Kyle, text: ..., replies: [ {author: James, text: ..., replies: []} ]} ] }

Pros: Single Document, Performance, Intuitive Cons: Hard to search, Partial Results, 16MB limit

Array of Ancestors

B E

C D F

- Store all Ancestors of a node { _id: "a" } { _id: "b", thread: [ "a" ], replyTo: "a" } { _id: "c", thread: [ "a", "b" ], replyTo: "b" } { _id: "d", thread: [ "a", "b" ], replyTo: "b" } { _id: "e", thread: [ "a" ], replyTo: "a" } { _id: "f", thread: [ "a", "e" ], replyTo: "e" }

Array of Ancestors

B E

C D F

- Store all Ancestors of a node { _id: "a" } { _id: "b", thread: [ "a" ], replyTo: "a" } { _id: "c", thread: [ "a", "b" ], replyTo: "b" } { _id: "d", thread: [ "a", "b" ], replyTo: "b" } { _id: "e", thread: [ "a" ], replyTo: "a" } { _id: "f", thread: [ "a", "e" ], replyTo: "e" } // find all threads where "b" is in > db.posts.find({thread: "b"}) // find replies to "e" > db.posts.find({replyTo: "e"}) // find history of "f" > threads = db.posts.findOne( {_id:"f"} ).thread > db.posts.find( { _id: { $in : threads } )

Trees as Paths
Store hierarchy as a path expression - Separate each node by a delimiter, e.g. / - Use text search for nd parts of a tree
{ comments: [ { author: "Kyle", text: "initial post", path: "/" }, { author: "Jim", text: "jims comment", path: "/jim" }, { author: "Kyle", text: "Kyles reply to Jim", path : "/jim/kyle"} ] } // Find the conversations Jim was part of > db.posts.find({path: /^jim/})

Queue

http://www.ickr.com/photos/deanspic/4960440218

Queue
Need to maintain order and state Ensure that updates are atomic
db.jobs.save( { inprogress: false, priority: 1, ... }); // find highest priority job and mark as in-progress job = db.jobs.findAndModify({ query: {inprogress: false}, sort: {priority: -1}, update: {$set: {inprogress: true, started: new Date()}}, new: true})

Use case - Inventory

User has a number of "votes" they can use A nite stock that you can "sell" A resource that can be "provisioned"

Inventory
User has a number of "votes" they can use A nite stock that you can "sell" A resource that can be "provisioned"
// Number of votes and who the user voted for { _id: "alvin", votes: 42, voted_for: [] } // Subtract a vote and add the blog voted for db.user.update( { _id: "alvin", votes : { $gt : 0}, voted_for: {$ne: "Destination Moon" }, { "$push": {voted_for: "Destination Moon"}, "$inc": {votes: -1}})

Don't try this...

Don't try this...


Incorrect indexing

Large, deeply nested documents One size ts all collections One collection per user

Too many indexes; wrong keys indexed Frequent queries do not use index

Summary
Schema design is different in MongoDB Basic data design principals stay the same Focus on how the application manipulates data Rapidly evolve schema to meet your requirements Enjoy your new freedom, use it wisely :-)

Part 3 - Scaling

Scaling

Operations/sec go up Storage needs go up Capacity IOPs Complexity goes up Caching

How do you scale now?


Optimization & Tuning Schema & Index Design O/S tuning Hardware conguration
$$$

Vertical scaling Hardware is expensive Hard to scale in cloud


throughput

Horizontal scaling - Sharding


read
300 GB Data

shard1

A-Z

write

Horizontal scaling - Sharding


read
150 GB Data 150 GB Data

shard1

shard2

A-M

N-Z

write

Horizontal scaling - Sharding


read
100 GB Data 100 GB Data 100 GB Data

shard1

shard2

shard3

A-H

I-Q

R-Z

write

Sharding for caching


read
300 GB Data 96 GB Mem

shard1

A-H I-Q R-Z

3:1 Data/Mem

write

Sharding for caching


read
96 GB Mem 300 GB Data

shard1

shard2

shard3

A-H

I-Q

R-Z

1:1 Data/Mem

write

Replication
read
300 GB Data

A-Z

write

Replication
read
300 GB Data

A-Z A-Z A-Z


900 GB Data

write

Sharding internals

Range based partitioning


a b c d e f g h ... u v w x y z

Large Dataset
Primary Key as username s t

MongoDBs Sharding handle the scale problem by chunking Break up pieces of data into smaller chunks, spread across
many data nodes Each data node contains many chunks If a chunk gets too large or a node overloaded, data can be rebalanced

Range based partitioning


a b c d e f g h

Large Dataset
Primary Key as username s t u v w x y z

Big Data at a Glance


Large Dataset
Primary Key as username

MongoDB Sharding breaks data into chunks (~64 mb)

Scaling
Data Node 1 Large Dataset Data Node 2 Data Node 3
25% of chunks

Data Node 4
25% of chunks

Primary Key as username 25% of chunks 25% of chunks

Representing data as chunks allows many levels of scale across n data nodes

Scaling
Data Node 1 Data Node 2 Data Node 3 Data Node 4 5 Data Node

The set of chunks can be evenly distributed across n data nodes

Add Nodes: Chunk Rebalancing


Data Node 1
x a c s

Data Node 2
b u z g

Data Node 3
t e f w

Data Node 4
v h y d

Data Node 5

The goal is equilibrium - an equal distribution. As nodes are added (or even removed)
chunks can be redistributed for balance.

Writes Routed to Correct Chunk


Data Node 1
c a s u

Data Node 2
z g

Data Node 3
t e f

Data Node 4
v h d

Data Node 5
w b y x

Writes Routed to Correct Chunk


Data Node 1
c a s u

Data Node 2
z g

Data Node 3
t e f

Data Node 4
v h d

Data Node 5
w b y x

Write to keyziggy

Writes are efficiently routed to the appropriate node & chunk

Chunk Splitting & Balancing


Data Node 1
c a s u

Data Node 2
z g

Data Node 3
t e f

Data Node 4
v h d

Data Node 5
w b y x

Write to keyziggy

If a chunk gets too large (default in MongoDB - 64mb per


chunk), It is split into two new chunks

Chunk Splitting & Balancing


Data Node 1
c a s u

Data Node 2
z g

Data Node 3
t e f

Data Node 4
v h d

Data Node 5
w b y x

If a chunk gets too large (default in MongoDB - 64mb per


chunk), It is split into two new chunks

Chunk Splitting & Balancing


Data Node 1
c a s u

Data Node 2
z g

Data Node 3
t e f

Data Node 4
v h d

Data Node 5
w b y x

If a chunk gets too large (default in MongoDB - 64mb per


chunk), It is split into two new chunks

Chunk Splitting & Balancing


Data Node 1
c a s u

Data Node 2
z g

Data Node 3
t e f

Data Node 4
v h d

Data Node 5
w b y x

z1

z2

If a chunk gets too large (default in MongoDB - 64mb per


chunk), It is split into two new chunks

Chunk Splitting & Balancing


Data Node 1
c a s u

Data Node 2
z g

Data Node 3
t e f

Data Node 4
v h d

Data Node 5
w b y x

z1

z2

If a chunk gets too large (default in MongoDB - 64mb per


chunk), It is split into two new chunks

Chunk Splitting & Balancing


Data Node 1
c a s

Data Node 2
z2 u z1 g

Data Node 3
t e f

Data Node 4
v h d

Data Node 5
w b y x

Each new part of the Z chunk (left & right) now contains half of the keys

Chunk Splitting & Balancing


Data Node 1
c a s u

Data Node 2
z1 g

Data Node 3
t e f

Data Node 4
v h z2 d

Data Node 5
w b y x

As chunks continue to grow and split, they can be rebalanced to keep an equal share of data on each server.

Reads with Key Routed Efficiently


Data Node 1
c a s u

Data Node 2
z1 g

Data Node 3
t e f

Data Node 4
v h z2 d

Data Node 5
w b y x

Read Key xavier

Reading a single value by Primary Key


Read routed efficiently to specic chunk containing key

Reads with Key Routed Efficiently


Data Node 1
c a s u

Data Node 2
z1 g

Data Node 3
t e f

Data Node 4
v h z2 d

Data Node 5
w b y x

Read Key xavier

Reading a single value by Primary Key


Read routed efficiently to specic chunk containing key

Reads with Key Routed Efficiently


Data Node 1
c a s u

Data Node 2
z1 g

Data Node 3
t e f

Data Node 4
v h z2 d

Data Node 5
w b y x

Read Keys T->X

Reading multiple values by Primary Key


Reads routed efficiently to specic chunks in range

Summary
Scaling is simple Add capacity before you need it System automatically re-balances your data No downtime to add capacity No code changes required

download at mongodb.org
alvin@10gen.com

conferences, appearances, and meetups


http://www.10gen.com/events

http://bit.ly/mongo>

Facebook | Twitter | LinkedIn


@mongodb

http://linkd.in/joinmongo

You might also like