MongoDB

Schema Design & Scaling
Alvin Richards Technical Director, EMEA alvin@10gen.com @jonnyeight

Agenda
•18:00 - 18:15 : Why MongoDB •18:15 - 18:45 : Schema Design •18:45 - 19:00 : Break •19:00 - 19:45 : Scaling •19:45 - 20:00 : Q & A •20:00 - 22:00 : After Party!

Part 1 - Why MongoDB?

10gen - Company Profile
• Company behind MongoDB

• aGPL license, own copyrights, engineering team

• support, consulting, training, license revenue Funding

• $73.5 million total funding • Sequoia, Union Square, Flybridge, NEA
• Management team

• Google/DoubleClick, Oracle, Apple, NetApp • NYC, Palo Alto, London, Dublin & Sydney • 110+ employees

Todays challenges

Current technology stack adds significant complexity
complexity

• custom • caching • vertical
scaling

sharding

Current technology stack reduces productivity
• denormalize • remove
joins

• remove

transactions productivity

Why we exist

More than 500 customers worldwide
Archiving Complex Data Flexible Data Content and Document Management, MultiMedia eCommerce

Finance

Gaming

Infrastructure

Operational Datastore for Web Infrastructure

Real-time Analytics

Media

Mobile

NoSQL Market Leadership

Part 2 - Schema Design

Topics
Schema design is easy! • Data as Objects in code Common patterns • Single table inheritance • One-to-Many & Many-to-Many • Buckets • Trees • Queues • Inventory

Terminology
RDBMS Table Row(s) Index Join Partition Partition  Key MongoDB Collection JSON  Document Index Embedding  &  Linking Shard Shard  Key

Schema Design Relational Database

Schema Design MongoDB

Schema Design MongoDB

embedding

Schema Design MongoDB

embedding

linking

So today’s example will use...

Design Session
Design documents that simply map to your application
>  post  =  {author:  "Hergé",                    date:  ISODate("2011-­‐09-­‐18T09:56:06.298Z"),                    text:  "Destination  Moon",                    tags:  ["comic",  "adventure"]} >  db.posts.insert(post)

Find the document
>  db.posts.find()    {  _id:  ObjectId("4c4ba5c0672c685e5e8aabf3"),        author:  "Hergé",          date:  ISODate("2011-­‐09-­‐18T09:56:06.298Z"),          text:  "Destination  Moon",          tags:  [  "comic",  "adventure"  ]    }     Notes: • ID must be unique, but can be anything you’d like • MongoDB will generate a default ID if one is not supplied

Add and index, find via Index
Secondary index for “author”  //  1  means  ascending,  -­‐1  means  descending  >  db.posts.ensureIndex({author:  1})  >  db.posts.find({author:  'Hergé'})          {  _id:  ObjectId("4c4ba5c0672c685e5e8aabf3"),          date:  ISODate("2011-­‐09-­‐18T09:56:06.298Z"),          author:  "Hergé",            ...  }

Query operators
Conditional operators: $ne, $in, $nin, $mod, $all, $size, $exists, $type, .. $lt, $lte, $gt, $gte, $ne...

//  find  posts  with  any  tags >  db.posts.find({tags:  {$exists:  true}})
Regular expressions:

//  posts  where  author  starts  with  h >  db.posts.find({author:  /^h/i  })  
Counting:

//  number  of  posts  written  by  Hergé >  db.posts.find({author:  "Hergé"}).count()

Extending the schema

http://nysi.org.uk/kids_stuff/rocket/rocket.htm

Extending the Schema
     
 new_comment  =  {author:  "Kyle",                                  date:  new  Date(),                                text:  "great  book"}

 >  db.posts.update(                      {text:  "Destination  Moon"  },                        {  "$push":  {comments:  new_comment},                          "$inc":    {comments_count:  1}})

Extending the Schema
 >  db.posts.find({_id:  ObjectId("4c4ba5c0672c685e5e8aabf3")})    {  _id  :  ObjectId("4c4ba5c0672c685e5e8aabf3"),          author  :  "Hergé",        date  :  ISODate("2011-­‐09-­‐18T09:56:06.298Z"),          text  :  "Destination  Moon",        tags  :  [  "comic",  "adventure"  ],                comments  :  [   {     author  :  "Kyle",     date  :  ISODate("2011-­‐09-­‐19T09:56:06.298Z"),     text  :  "great  book"   }        ],        comments_count:  1    }    

Extending the Schema
//  create  index  on  nested  documents: >  db.posts.ensureIndex({"comments.author":  1}) >  db.posts.find({"comments.author":"Kyle"}) //  find  last  5  posts: >  db.posts.find().sort({date:-­‐1}).limit(5) //  most  commented  post: >  db.posts.find().sort({comments_count:-­‐1}).limit(1)

When sorting, check if you need an index

Use MongoDB with your language
10gen Supported Drivers • Ruby, Python, Perl, PHP, Javascript, node.js • Java, C/C++, C#, Scala • Erlang, Haskell Object Data Mappers • Morphia - Java • Mongoid, MongoMapper - Ruby • MongoEngine - Python Community Drivers • F# , Smalltalk, Clojure, Go, Groovy, Delphi • Lua, PowerShell, R

Using your schema - example Java Driver
//  Get  a  connection  to  the  database DBCollection  coll  =  new  Mongo().getDB("posts"); //  Create  the  Object Map<String,  Object>  obj  =  new  HashMap... obj.add("author",  "Hergé");   obj.add("text",  "Destination  Moon"); obj.add("date",  new  Date()); //  Insert  the  object  into  MongoDB coll.insert(new  BasicDBObject(obj));

Using your schema - example Morphia mapper
//  Use  Morphia  annotations @Entity class Blog { @Id String author; @Indexed Date date; String text; }

Using your schema - example Morphia
//  Create  the  data  store Datastore  ds  =  new  Morphia().createDatastore() //  Create  the  Object Post  entry  =  new  Post("Hergé",                                              New  Date(),                                              "Destination  Moon") //  Insert  object  into  MongoDB ds.save(entry);

Common Patterns

Common Patterns

http://www.flickr.com/photos/colinwarren/158628063

Inheritance

http://www.flickr.com/photos/dysonstarr/5098228295

Inheritance

Single Table Inheritance RDBMS
shapes table id type
1 area radius length 1 width

circle 3.14

2

square 4

2

3

rect

10

5

2

Single Table Inheritance MongoDB
>  db.shapes.find()
 {  _id:  "1",  type:  "circle",area:  3.14,  radius:  1}  {  _id:  "2",  type:  "square",area:  4,  length:  2}  {  _id:  "3",  type:  "rect",    area:  10,  length:  5,  width:  2}

missing values not stored!

Single Table Inheritance MongoDB
>  db.shapes.find()
 {  _id:  "1",  type:  "circle",area:  3.14,  radius:  1}  {  _id:  "2",  type:  "square",area:  4,  length:  2}  {  _id:  "3",  type:  "rect",    area:  10,  length:  5,  width:  2}

//  find  shapes  where  radius  >  0   >  db.shapes.find({radius:  {$gt:  0}})

Single Table Inheritance MongoDB
>  db.shapes.find()
 {  _id:  "1",  type:  "circle",area:  3.14,  radius:  1}  {  _id:  "2",  type:  "square",area:  4,  length:  2}  {  _id:  "3",  type:  "rect",    area:  10,  length:  5,  width:  2}

//  find  shapes  where  radius  >  0   >  db.shapes.find({radius:  {$gt:  0}}) //  create  index >  db.shapes.ensureIndex({radius:  1},  {sparse:true})

index only values present!

One to Many

http://www.flickr.com/photos/j-fish/6502708899/

One to Many
One to Many relationships can specify • degree of association between objects • containment • life-cycle

One to Many - Embedding
- Embedded Array - $slice operator to return subset of comments - some queries harder e.g find latest comments across all blogs
posts:  {                author  :  "Hergé",        date  :  ISODate("2011-­‐09-­‐18T09:56:06.298Z"),          comments  :  [      {     author  :  "Kyle",     date  :  ISODate("2011-­‐09-­‐21T09:52:06.298Z"),     text  :  "great  book"      }        ]}

One to Many - Linking
- Normalized (2 collections) - most flexible - more queries
posts:  {  _id:  1000,                        author:  "Hergé",                  date:  ISODate("2011-­‐09-­‐18T09:56:06.298Z"),                    comments:  [                                    {comment  :  1)}                                      ]} comments  :  {  _id  :  1,                          blog:  1000,                          author  :  "Kyle",            date  :  ISODate("2011-­‐09-­‐21T09:52:06.298Z")} >  blog  =  db.blogs.find({text:  "Destination  Moon"}); >  db.comments.find({blog:  blog._id});

Linking versus Embedding
Embedding • 1 seek to load entire object
• 1 roundtrip to database • Read relative to object size • Write relative to object size

Linking • 1 seek to read master
• 1 seek to read each detail • 2 roundtrip to database • Reads longer but consistent • Writes longer but consistent

Many to Many

http://www.flickr.com/photos/pats0n/6013379192

Many - Many
Example:
- Product can be in many categories - Category can have many products

Many - Many
products:      {  _id:  10,          name:  "Destination  Moon",          category_ids:  [  20,  30  ]  }     categories:      {  _id:  20,            name:  "comic",            product_ids:  [  10,  11,  12  ]  } categories:      {  _id:  21,            name:  "movie",            product_ids:  [  10  ]  }

Many - Many
products:      {  _id:  10,          name:  "Destination  Moon",          category_ids:  [  20,  30  ]  }     categories:      {  _id:  20,            name:  "comic",            product_ids:  [  10,  11,  12  ]  } categories:      {  _id:  21,            name:  "movie",            product_ids:  [  10  ]  } //All  categories  for  a  given  product >  db.categories.find({product_ids:  10})

Alternative
products:      {  _id:  10,          name:  "Destination  Moon",          category_ids:  [  20,  30  ]  }

   
categories:      {  _id:  20,            name:  "comic"}

Alternative
products:      {  _id:  10,          name:  "Destination  Moon",          category_ids:  [  20,  30  ]  }

   
categories:      {  _id:  20,            name:  "comic"} //  All  products  for  a  given  category >  db.products.find({category_ids:  20)})   //  All  categories  for  a  given  product product    =  db.products.find(_id  :  some_id) >  db.categories.find({_id  :  {$in  :  product.category_ids}})  

Trees

http://www.flickr.com/photos/cubagallery/5949819558

Trees
Hierarchical information

   

Trees
Full Tree in Document
{  comments:  [          {  author:  “Kyle”,  text:  “...”,                replies:  [                                            {author:  “James”,  text:  “...”,                                              replies:  []}                ]}    ] }

Pros: Single Document, Performance, Intuitive Cons: Hard to search, Partial Results, 16MB limit
   

Array of Ancestors

A

B E

C D F

- Store all Ancestors of a node    {  _id:  "a"  }    {  _id:  "b",  thread:  [  "a"  ],            replyTo:  "a"  }    {  _id:  "c",  thread:  [  "a",  "b"  ],  replyTo:  "b"  }    {  _id:  "d",  thread:  [  "a",  "b"  ],  replyTo:  "b"  }    {  _id:  "e",  thread:  [  "a"  ],            replyTo:  "a"  }    {  _id:  "f",  thread:  [  "a",  "e"  ],  replyTo:  "e"  }

Array of Ancestors

A

B E

C D F

- Store all Ancestors of a node    {  _id:  "a"  }    {  _id:  "b",  thread:  [  "a"  ],            replyTo:  "a"  }    {  _id:  "c",  thread:  [  "a",  "b"  ],  replyTo:  "b"  }    {  _id:  "d",  thread:  [  "a",  "b"  ],  replyTo:  "b"  }    {  _id:  "e",  thread:  [  "a"  ],            replyTo:  "a"  }    {  _id:  "f",  thread:  [  "a",  "e"  ],  replyTo:  "e"  } //  find  all  threads  where  "b"  is  in >  db.posts.find({thread:  "b"}) //  find  replies  to  "e" >  db.posts.find({replyTo:  "e"}) //  find  history  of  "f" >  threads  =  db.posts.findOne(  {_id:"f"}  ).thread >  db.posts.find(  {  _id:  {  $in  :  threads  }  )

Trees as Paths
Store hierarchy as a path expression - Separate each node by a delimiter, e.g. “/” - Use text search for find parts of a tree
{  comments:  [          {  author:  "Kyle",  text:  "initial  post",                path:  "/"  },          {  author:  "Jim",    text:  "jim’s  comment",              path:  "/jim"  },          {  author:  "Kyle",  text:  "Kyle’s  reply  to  Jim",              path  :  "/jim/kyle"}  ]  } //  Find  the  conversations  Jim  was  part  of   >  db.posts.find({path:  /^jim/})

Queue

http://www.flickr.com/photos/deanspic/4960440218

Queue
• Need to maintain order and state • Ensure that updates are atomic
     db.jobs.save(      {  inprogress:  false,          priority:  1,        ...      }); //  find  highest  priority  job  and  mark  as  in-­‐progress job  =  db.jobs.findAndModify({                              query:    {inprogress:  false},                              sort:      {priority:  -­‐1},                                update:  {$set:  {inprogress:  true,                                                                started:  new  Date()}},                              new:  true})    

Use case - Inventory

• User has a number of "votes" they can use • A finite stock that you can "sell" • A resource that can be "provisioned"

Inventory
• User has a number of "votes" they can use • A finite stock that you can "sell" • A resource that can be "provisioned"
 //  Number  of  votes  and  who  the  user  voted  for  {  _id:      "alvin",      votes:  42,      voted_for:  []  }  //  Subtract  a  vote  and  add  the  blog  voted  for  db.user.update(                      {  _id:  "alvin",                            votes  :  {  $gt  :  0},                          voted_for:  {$ne:  "Destination  Moon"  },                      {  "$push":  {voted_for:  "Destination  Moon"},                          "$inc":    {votes:  -­‐1}})  

Don't try this...

Don't try this...
• Incorrect indexing

• Large, deeply nested documents • One size fits all collections • One collection per user

• Too many indexes; wrong keys indexed • Frequent queries do not use index

Summary
Schema design is different in MongoDB Basic data design principals stay the same Focus on how the application manipulates data Rapidly evolve schema to meet your requirements Enjoy your new freedom, use it wisely :-)

Part 3 - Scaling

Scaling

• Operations/sec go up • Storage needs go up • Capacity • IOPs • Complexity goes up • Caching

How do you scale now?
• Optimization & Tuning • Schema & Index Design • O/S tuning • Hardware configuration
$$$

• Vertical scaling • Hardware is expensive • Hard to scale in cloud
throughput

Horizontal scaling - Sharding
read
300 GB Data

shard1

A-­‐Z

write

Horizontal scaling - Sharding
read
150 GB Data 150 GB Data

shard1

shard2

A-­‐M

N-­‐Z

write

Horizontal scaling - Sharding
read
100 GB Data 100 GB Data 100 GB Data

shard1

shard2

shard3

A-­‐H

I-­‐Q

R-­‐Z

write

Sharding for caching
read
300 GB Data 96 GB Mem

shard1

A-­‐H I-­‐Q R-­‐Z

3:1 Data/Mem

write

Sharding for caching
read
96 GB Mem 300 GB Data

shard1

shard2

shard3

A-­‐H

I-­‐Q

R-­‐Z

1:1 Data/Mem

write

Replication
read
300 GB Data

A-­‐Z

write

Replication
read
300 GB Data

A-­‐Z A-­‐Z A-­‐Z
900 GB Data

write

Sharding internals

Range based partitioning
a b c d e f g h ... u v w x y z

Large Dataset
Primary Key as “username” s t

• MongoDB’s Sharding handle the scale problem by chunking • Break up pieces of data into smaller chunks, spread across
many data nodes • Each data node contains many chunks • If a chunk gets too large or a node overloaded, data can be rebalanced

Range based partitioning
a b c d e f g h

Large Dataset
Primary Key as “username” s t u v w x y z

Big Data at a Glance
Large Dataset
Primary Key as “username”

x

b

v

t

d

f

z

s

h

e

u

c

w

a

y

g

MongoDB Sharding breaks data into chunks (~64 mb)

Scaling
Data Node 1 Large Dataset Data Node 2 Data Node 3
25% of chunks

Data Node 4
25% of chunks

Primary Key as “username” 25% of chunks 25% of chunks

x

b

v

t

d

f

z

s

h

e

u

c

w

a

y

g

Representing data as chunks allows many levels of scale across n data nodes

Scaling
Data Node 1 Data Node 2 Data Node 3 Data Node 4 5 Data Node

x

b

v

t

d

f

z

s

h

e

u

c

w

a

y

g

The set of chunks can be evenly distributed across n data nodes

Add Nodes: Chunk Rebalancing
Data Node 1
x a c s

Data Node 2
b u z g

Data Node 3
t e f w

Data Node 4
v h y d

Data Node 5

• The goal is equilibrium - an equal distribution. • As nodes are added (or even removed)
chunks can be redistributed for balance.

Writes Routed to Correct Chunk
Data Node 1
c a s u

Data Node 2
z g

Data Node 3
t e f

Data Node 4
v h d

Data Node 5
w b y x

Writes Routed to Correct Chunk
Data Node 1
c a s u

Data Node 2
z g

Data Node 3
t e f

Data Node 4
v h d

Data Node 5
w b y x

Write to key“ziggy”

Writes are efficiently routed to the appropriate node & chunk

Chunk Splitting & Balancing
Data Node 1
c a s u

Data Node 2
z g

Data Node 3
t e f

Data Node 4
v h d

Data Node 5
w b y x

Write to key“ziggy”

• If a chunk gets too large (default in MongoDB - 64mb per
chunk), • It is split into two new chunks

Chunk Splitting & Balancing
Data Node 1
c a s u

Data Node 2
z g

Data Node 3
t e f

Data Node 4
v h d

Data Node 5
w b y x

z

• If a chunk gets too large (default in MongoDB - 64mb per
chunk), • It is split into two new chunks

Chunk Splitting & Balancing
Data Node 1
c a s u

Data Node 2
z g

Data Node 3
t e f

Data Node 4
v h d

Data Node 5
w b y x

z

• If a chunk gets too large (default in MongoDB - 64mb per
chunk), • It is split into two new chunks

Chunk Splitting & Balancing
Data Node 1
c a s u

Data Node 2
z g

Data Node 3
t e f

Data Node 4
v h d

Data Node 5
w b y x

z1

z2

• If a chunk gets too large (default in MongoDB - 64mb per
chunk), • It is split into two new chunks

Chunk Splitting & Balancing
Data Node 1
c a s u

Data Node 2
z g

Data Node 3
t e f

Data Node 4
v h d

Data Node 5
w b y x

z1

z2

• If a chunk gets too large (default in MongoDB - 64mb per
chunk), • It is split into two new chunks

Chunk Splitting & Balancing
Data Node 1
c a s

Data Node 2
z2 u z1 g

Data Node 3
t e f

Data Node 4
v h d

Data Node 5
w b y x

Each new part of the Z chunk (left & right) now contains half of the keys

Chunk Splitting & Balancing
Data Node 1
c a s u

Data Node 2
z1 g

Data Node 3
t e f

Data Node 4
v h z2 d

Data Node 5
w b y x

As chunks continue to grow and split, they can be rebalanced to keep an equal share of data on each server.

Reads with Key Routed Efficiently
Data Node 1
c a s u

Data Node 2
z1 g

Data Node 3
t e f

Data Node 4
v h z2 d

Data Node 5
w b y x

Read Key “xavier”

Reading a single value by Primary Key
Read routed efficiently to specific chunk containing key

Reads with Key Routed Efficiently
Data Node 1
c a s u

Data Node 2
z1 g

Data Node 3
t e f

Data Node 4
v h z2 d

Data Node 5
w b y x

Read Key “xavier”

Reading a single value by Primary Key
Read routed efficiently to specific chunk containing key

Reads with Key Routed Efficiently
Data Node 1
c a s u

Data Node 2
z1 g

Data Node 3
t e f

Data Node 4
v h z2 d

Data Node 5
w b y x

Read Keys “T”->”X”

Reading multiple values by Primary Key
Reads routed efficiently to specific chunks in range

Summary
Scaling is simple Add capacity before you need it System automatically re-balances your data No downtime to add capacity No code changes required

download at mongodb.org
alvin@10gen.com

conferences,  appearances,  and  meetups
http://www.10gen.com/events

http://bit.ly/mongo>  

Facebook                    |                  Twitter                  |                  LinkedIn
@mongodb

http://linkd.in/joinmongo

Sign up to vote on this title
UsefulNot useful