You are on page 1of 30

Amazon Dynamo DB

- a term project presentation


CSc 244 | CSU Sacramento
Spring 2014
Instructor : Dr. Ying Jin
8
th
May, 2014
Team 1 :
Chetan Nagarkar
Saransh
Agenda
NoSQL background
CAP theorem
AWS offerings
Dynamo DB
Dynamo DB Architecture
How to use
Pricing
Advantages
Limitations
References
Questions

2
NoSQL background
Traditional databases have a concrete schema.


3
NoSQL systems are designed to be schema-less.

Entry 1
{name:emp1}
Entry 2
{name:emp2,e_id:1,e_addr:Cupertino}
Entry 3
{name:emp3,e_id:3}
Entry 4
{name:emp4,e_id:6, dob:03-Sep-1964}

4
Types of NoSQL systems
1. Key-Value Pair :
Every item is a set of key-value pair
Voldemort(LinkedIn), DynamoDB(Amazon), Redis(VMWare), etc.
2. Document Oriented :
Each key is paired with a complex data structure called document. Documents may
contain multiple key-value pairs
CouchDB, MongoDB, etc.
3. Column Family Stores :
Key value is mapped to set of columns optimized for query over large datasets.
Cassandra(Facebook), HyperTable, Hbase, BigTable(Google) etc.
4. Graph Databases :
Contains information about network, social connections
Neo4j, FlockDB(Twitter).

5
ACID properties in traditional RDBMs

Atomicity
Requires each transaction to be all or nothing
Consistency
Brings databases from one valid state to another
Isolation
Takes care of concurrent execution and avoid overwritting
Durability
Commited transactions will remain so in the event of power loss, crashes etc

6
CAP theorem
It states : A distributed system possibly cannot guarantee all of the following
simultaneously
Consistency: All nodes see the same data at the same time

Availability: The system is always on, i.e., every request receives a response, whether it
serves the request or not.

Partition Tolerance: System continues to operate in case of system failure or message loss.
7
8
Amazon Web Services(AWS) offerings
Launched in 2006, AWS is used by thousands of companies today for cloud
based computing services.
Includes an array of services: Remote Computing(EC2), Identity(IAM),
Database services(SimpleDB, RDS, ElastiCache, DynamoDB, etc.)
8 Regions and Availability zones :
9
Code Name
ap-northeast-1 Asia Pacific (Tokyo) Region
ap-southeast-1 Asia Pacific (Singapore) Region
ap-southeast-2 Asia Pacific (Sydney) Region
eu-west-1 EU (Ireland) Region
sa-east-1 South America (Sao Paulo) Region
us-east-1 US East (Northern Virginia) Region
us-west-1 US West (Northern California) Region
us-west-2 US West (Oregon) Region
Dynamo DB
Dynamo DB is a managed and eventually consistent NoSQL database service to
support highly available data access.
A flexible data model with key/attribute pairs. No schema required.
Optimized for availability to maximize :
Data consistency
Durability
Performance

ALL THIS WITHOUT THE OPERATIONAL BURDEN
10
What is Fully Managed?
Never worry about:
Hardware provisioning
Cross-availability zone replication
Hardware and Software updates
Monitoring and handling of hardware failures
Replicas automatically generated.

11
DynamoDB Architecture
True distributed architecture
Data is spread across hundreds of servers called storage nodes
Hundreds of servers form a cluster in the form of a ring
Client application can connect using one of the two approaches
Routing using a load balancer
Client-library that reflects Dynamos partitioning scheme and can determine the
storage host to connect
DynamoDB is designed to be always writable storage solution
Allows multiple versions of data on multiple storage nodes
Conflict resolution happens while reads and NOT during writes
Syntactic conflict resolution
Symantec conflict resolution

Put(key,context object)
Determines where replicas of the object should be placed based on key,& write replicas to disk
Context Info is stored along with object so as to verify validity of context object
If at least W-1 nodes respond, write is successful
Get(key)-
Operation locates all object replicas associated with key in storage system
Returns a single or list of object with conflict version along with context
Client reconcile divergent versions and supersede current version with based on context
Node handling read and write operation is called coordinator
For reads and writes dynamo uses consistency protocol
Consistency protocol has 3 variables
N Number of replicas of data to be read or written
W Number of nodes that must participate in successful write operation
R Number of machines contacted in read operation
R+W > N Latency determined by slowest of R or W replicas. Thus R&W less than N
DynamoDB System Interface get() and put()
DynamoDB Architecture - Partitioning
Data is partitioned over multiple hosts called storage nodes (ring)
Uses consistent hashing to dynamically partition data across storage hosts
Two problems associated with Consistent hashing
Hashing of storage hosts can cause imbalance of data and load
Consistent hashing treats every storage host as same capacity

A [3,4]
B [1]
C[2]
1
2
3
4
3
4
A[4]
B[1]
1
2
D[2,3]
Initial Situation Situation after C left and D joined
DynamoDB Architecture - Partitioning
Solution Virtual hosts mapped to physical hosts (tokens)
Number of virtual hosts for a physical host depends on capacity of physical host
Virtual nodes are function-mapped to physical nodes

A
B
C
D
H
I
J
C
B Physical
Host
Virtual Host
DynamoDB Architecture Scaling
Gossip protocol used for node membership
Every second, storage node randomly contact a peer to bilaterally reconcile persisted
membership history
Doing so, membership changes are spread and eventually consistent membership view is
formed
When a new members joins, adjacent nodes adjust their object and replica ownership
When a member leaves adjacent nodes adjust their object and replica and distributes keys
owned by leaving member
A
B
C
D
E
F
G
H
Replication factor = 3
DynamoDB Architecture Data Versioning
Eventual Consistency data propagates asynchronously
Its possible to have multiple version on different storage nodes
If latest data is not available on storage node, client will update the old version of data
Conflict resolution is done using vector clock
Vector clock is a metadata information added to data when using get() or put()
Vector clock effectively a list of (node, counter) pairs
One vector clock is associated with every version of every object
One can determine two version has causal ordering or parallel branches by examining
vector clocks
Client specify vector clock information while reading or writing data in the form of
context
Dynamo tries to resolve conflict among multiple version using syntactic reconciliation
If syntactic reconciliation does not work, client has to resolve conflict using semantic
reconciliation


18
Strongly consistent vs Eventually Consistent
Multiple copies of same item to ensure durability
Takes time for update to propagate multiple copies
Eventually consistent
After write happens, immediate read may not give latest value
Strongly consistent
Requires additional read capacity unit
Gets most up-to-date version of item value
Eventually consistent read consumes half the read capacity
unit as strongly consistent read.


How to use
AWS provides SDKs to interact with DynamoDB for the following
platforms/languages:
Android
iOS
Java
.Net

In Java, there are two means to interact with your DynamoDB table:
AWS Management Console
AWS Toolkit for Eclipse
20

Python
PHP
Node.js
Ruby
Management Console
21
Console create table
22
Inside Amazon DynamoDB
Primary key (mandatory for every table)
Hash or Hash + Range
Data model in the form of tables
Data stored in the form of items (name value attributes)
Secondary Indexes for improved performance
Local secondary index
Global secondary index
Scalar data type (number, string etc) or multi-valued data type (sets)

key=value key=value key=value key=value
Table
Item (64KB max)
Attributes
Create Instance in Java
24
Using AWS SDK: AWS Toolkit for Eclipse
25
Pricing
In the free tier, AWS offers 100 MB of free data storage per month.
Read capacity: 1 read = 4 KB data
Write capacity: 1 write = 1 KB data
With eventually consistant reads, you get twice reads/sec
In free tier you get 5 writes/sec & 10 reads/sec
Or 432,000 writes and 864,000 writes per day or 40 million data operations per
month.
26
Advantages
Easy Administration: No h/w or s/w provisioning, setup or configuration
Flexible: Secondary Indexes: query on whichever attribute
Fast, predictable performance: SSDs, single digit latency
Built-in fault tolerance
Stong consistency and atomic counters
Automatic Data Replication: Data stored across 3 availability zones
Security: AWS Identity Access Management(IAM)
CloudWatch Alarms: Alarms are raised if you are near to crossing provisioned throughput or
storage.
Provisioned throughput/Cost effective: Pay how you use

27
Limitations
64KB limit on item size (row size)
1 MB limit on fetching data
Pay more if you want strongly consistent data
Size is multiple of 4KB (provisioning throughput wastage)
Cannot join tables
Indexes should be created during table creation only
No triggers or server side scripts
Limited comparison capability (no not_null, contains etc)

28
References:
http://aws.amazon.com/dynamodb/
http://www.allthingsdistributed.com/files/amazon-dynamo-sosp2007.pdf
http://www.allthingsdistributed.com/2012/01/amazon-dynamodb.html
http://docs.aws.amazon.com/amazondynamodb/latest/developerguide/Intr
oduction.html
29
Questions ?
Thank You !
30

You might also like