DAT309-R1 - Amazon Aurora Storage Demystified - How It All Works

DAT309-R
Amazon Aurora storage

demystified: How it all works
Murali Brahmadesam Tobias Ternstrom
Director of Engineering, Director of Product Mgmt,
Amazon Aurora Amazon Aurora
Amazon Web Services Amazon Web Services
© 2019, Amazon Web Services, Inc. or its affiliates. All rights reserved.
Agenda
• What is Amazon Aurora?
• Quick recap: Database internals & motivation for building Aurora
• Cloud-native database architecture
• Durability at scale
• Performance results
Features & demos
• Global databases
• Fast database cloning
• Database backtrack
What is Amazon Aurora ?
Enterprise class cloud native database
 Speed and availability of high-end commercial databases

 Simplicity and cost-effectiveness of open-source
databases
Amazon Aurora  Drop-in compatibility with MySQL and PostgreSQL

 Simple pay-as-you-go pricing
Delivered as a managed service

Quick recap: Database B+ Tree
Root
Interme Interme
diate diate
Leaf 1 Leaf 2 Leaf 3 Leaf 4
Root Interme Interme Leaf 1 Leaf 2 Leaf 3 Leaf 4

diate diate
Quick recap: DO-REDO-UNDO protocol
Data is modified “in-place” in the buffer-pool using a DO/REDO/UNDO operation
Log records with before and after images are stored in a write-ahead log (WAL)
DO REDO
Log record Log record
UNDO
Log record
Quick recap: Crash Recovery Log records on
Pages on durable storage durable storage
Tx1
Checkpoint Tx2
Tx3
Tx2 and Tx3 are redone by using the REDO procedure
Tx4 Tx4 is undone by using the REDO/UNDO procedure

tf tf
System failure
tr tr
System recovery
Quick recap: I/Os required for persistence Log records on
Pages on durable storage durable storage
Log record write:

typically few bytes
Torn page protection write:

page sized, e.g. 16KB
Checkpoint write:
page sized, e.g. 16KB
User data change size << I/O size (32KB+)
Databases are all about I/O

Traditional database architecture
Compute
Databases are all about I/O
SQL
Transactions
Caching
Design principles for > 40 years Logging
• Increase I/O bandwidth
• Decrease number of I/Os !
Attached storage
Aurora approach: Log is the database
t5 can be created using log records from t1 and t5

Aurora approach: Offload checkpointing to the storage fleet
Problem 1:
Solution:
Problem 2:
Solution:
Aurora approach: compute & storage separation
Compute & storage have different lifetimes Compute
SQL
Compute instances Transactions
• fail and are replaced Caching

Logging
• are shut down to save cost
• are scaled up/down/out on the basis of load needs
Storage, on the other hand, has to be long-lived
Decouple compute and storage for scalability, availability,

Network storage
durability
Aurora uses service-oriented architecture
SQL
We built a log-structured
Transactions
distributed storage system that
is multi-tenant, multi-attach, Caching
and purpose-built for
databases
Logging + storage
I/O flow in Amazon Aurora storage node
① Receive log records and add to in-memory
Storage node
queue and durably persist log records
Log records
Incoming queue 1 7 ② ACK to the database
Database ACK GC
③ Organize records and identify gaps in log
instance ④ Gossip with peers to fill in holes
2 Update
Queue Coalesce Data ⑤ Coalesce log records into new page versions
Pages Scrub
⑥ Periodically stage log and new page versions
5
Sort
Group 3 to S3
8
⑦ Periodically garbage collect old versions
Peer Peer-to-peer gossip Hot
storage log ⑧ Periodically validate CRC codes on blocks
nodes
4
Continuous backup
Note:
6
• All steps are asynchronous
S3 backup • Only steps 1 and 2 are in the foreground
latency path
Uncorrelated and independent failures
At scale there are continuous independent

failures due to failing nodes, disks, and switches. Availability Availability Availability
Zone 1 Zone 2 Zone 3
Shared storage volume

The solution is replication
X X
X
One common straw man:
Replicate 3-ways with 1 copy per AZ Storage nodes with SSDs
Use write and read quorums of 2/3
What about AZ failure?
 Still have 2/3 copies Availability Availability Availability

Zone 1 Zone 2 Zone 3
 Can establish quorum
Shared storage volume
 No data loss
X
Storage nodes with SSDs
What about AZ + 1 failures?
Losing 1 node in an AZ while another AZ Availability

Zone 1
Availability
Zone 2
Availability
Zone 3
is down Shared storage volume


Lose 2/3 copies
Lose quorum
X
X
 Lose data
Aurora tolerates AZ + 1 failures
Replicate 6-ways with 2 copies per AZ
Write quorum of 4/6
What if an AZ fails? Availability

Zone 1
Availability
Zone 2
Availability
Zone 3
 Still have 4/6 copies
 Maintain write availability Shared storage volume
X
X
What if there is an AZ + 1 failure ?
 Still have 3 copies
 No data loss
 Rebuild failed copy by copying from 3 copies
 Recover write availability
Aurora uses segmented storage
➢ Partition volume into 𝑛 fixed-size segments
• Replicate each segment 6 ways into a protection group (PG)
➢ Trade-off between likelihood of faults and time to repair

• If segments are too small, failures are more likely
• If segments are too big, repairs take too long
➢ Choose the biggest size that lets us repair “fast enough”

• We currently picked a segment size of 10 GB, as we can repair a 10-GB
segment in less than a minute
Fast and reversible membership changes
Use quorum sets, and epochs to A B C D E F
Epoch 1: All nodes are healthy
• Enable quicker transitions with epoch advances
• Create richer temporary quorums during changes
A B C D E F
• Reverse changes by more quorum transitions
A B C D E G
Epoch 2: Node F is in a suspect state;
second quorum group is formed with node
G; both quorums are active
A B C D E F
A B C D E G
Epoch 3: Node F is confirmed
unhealthy; new quorum group with
node G is active
Aurora I/O profile
MySQL with replica Aurora
AZ 1 3 AZ 2 AZ 1 AZ 2 AZ 3
1 Replica Replica
Primary Replica Primary instance instance
instance instance instance
1 4 1
ASYNC 4/6 QUORUM
Amazon EBS Amazon EBS
2 5
Continuous backup
Amazon EBS Amazon EBS
mirror mirror
Amazon
S3
Amazon
MySQL I/O profile for 30-min Sysbench run Aurora IO profile for 30-min Sysbench run S3
• 780K transactions • 27M transactions: 35× more

• Average 7.4 I/Os per transaction • 0.95 I/Os per transaction (6× amplification): 7.7× less
Log Binlog Data Double-write Frm files

Write and read throughput
Aurora MySQL is 5× faster than MySQL
250,000 800,000
705,000 705,000
700,000
200,000
200,000
170,000 600,000
150,000 500,000
400,000
100,000
290,787
300,000 257,122
50,000 200,000
9,536 5,592 100,000

0
Write 0
Aurora 5.7 Aurora 5.6 MySQL 5.7 MySQL 5.6 Aurora 5.7 Aurora 5.6 MySQL 5.7 MySQL 5.6
Write throughput Read throughput

Using Sysbench with 250 tables and 200,000 rows per table on R4.16XL
Global physical replication
Primary region Secondary region
AZ 3 AZ 2 AZ 1 AZ 1 AZ 2 AZ 3
Replica Replica 1 1 Replica Replica Replica

instance instance Primary instance instance instance
instance
3
1
ASYNC 4/6 quorum
3 ASYNC 4/6 quorum
Replication
2 Replication
server agent
4
Continuous Storage fleet Storage fleet Continuous
backup backup
Amazon Replication Fleet Replication Fleet Amazon

S3 S3
① Primary instance sends log records in parallel to storage nodes, High throughput: Up to 150K writes/second; negligible
replica instances, and replication server
performance impact
② Replication server streams log records to replication agent in
secondary region
Low replica lag: <1 second cross-region replica lag under
③ Replication agent sends log records in parallel to storage nodes and heavy load
replica instances Fast recovery: <1 minute to accept full read-write
④ Replication server pulls log records from storage nodes to catch up workloads after region failure
after outages
Global replication performance
Logical vs. physical replication
250,000 600 250,000 5.00
4.50
500
200,000 200,000 4.00
3.50
400
150,000 150,000 3.00
seconds
seconds
QPS
QPS
300 2.50
100,000 100,000 2.00

200
1.50
50,000 50,000 1.00

100
0.50
0 0 0 0.00
QPS Lag QPS Lag
Logical replication Physical replication

Global databases
Primary region (US West) Secondary region (US East)
1 Replica
Primary instance
instance
3
1
ASYNC 4/6 quorum
3 ASYNC 4/6 quorum
Replication
2 Replication
server agent
4
Continuous Storage fleet Storage fleet Continuous
backup backup
Amazon Replication Fleet Replication Fleet Amazon

S3 S3
Continuous Inserts Continuous Reads

U S West U S East
(Writer) (Reader)
Replication Lag
~110ms
2019-11-28 2019-11-28
07:48:15.614.124 07:48:15.725.479
Fast database cloning
Benchmarks
Create a copy of a database without
duplicate storage costs
Dev/test
• Creation of a clone is instantaneous because it applications Clone
doesn’t require deep copy
• Data copy happens only on write, when original
and cloned volume data differ
Clone Clone
Production Production
applications applications
Typical use cases
• Clone a production database to run tests
• Reorganize a database Production database
• Save a point-in-time snapshot for analysis

without impacting production system
Database cloning: How does it work?
Source database Cloned database
Page Page Page Page Page Page Page Page

1 2 3 4 1 2 3 4
Both databases reference the same pages on the shared

distributed storage system
Shared distributed storage system
Protection group 1 Protection group 2

Page Page Page Page
1 3 2 4
Database cloning: How does it work?
Source database Cloned database
Page Page Page Page Page Page Page Page Page Page Page
1 2 3 4 5 1 2 3 4 5 6
As databases diverge, new pages are added appropriately to each

database while still referencing pages common to both databases
Shared distributed storage system
Protection group 1 Protection group 2

Page Page Page Page Page
1 3 5 2 4
Page Page Page Page
3 5 2 6
Database State
• ~40M rows
• ~25GB space used
Clone
Database State (Cloned)
• See ~40M rows

• ~0.1GB space used in clone!
(vs. 25GB)
Database State (Cloned)
• Update ~10M rows

• Now ~1.2 GB! vs
25GB
Database backtrack
Invisible Invisible
t4
t2 t3
Rewind to t3
t0 t1
Rewind to t1
t0 t1 t2 t3 t4
Backtrack is a quick way to bring the database to a particular point
in time without having to restore from backups
• Rewinding the database to quickly recover from unintentional DML/DDL
operations
• Rewind multiple times to determine the desired point in time in the database
state; for example, quickly iterate over schema changes without having to restore
multiple times
Database State
(@07:05:07)
Table with ~10K

rows
Table with ~ 10K rows

Database State
(@07:09:58)
• Added a column
• Added two rows
Oops!
Let’s Backtrack
Let’s Backtrack
07:06:00
Database State
(after Backtrack)
Rows added missing

Single column
The good old

state!
Publications
Amazon Aurora: Design Considerations for High Throughput Cloud-Native
Relational Databases. In SIGMOD 2017
Amazon Aurora: On Avoiding Distributed Consensus for I/Os, Commits,

and Membership Changes. In SIGMOD 2018
Thank you!
Murali & Tobias
brahmade@amazon.com
tobiasql@amazon.com

DAT309-R1 - Amazon Aurora Storage Demystified - How It All Works

Uploaded by

Document Information

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

DAT309-R1 - Amazon Aurora Storage Demystified - How It All Works

Uploaded by

Copyright:

Available Formats

DAT309-R

Amazon Aurora storage

 Speed and availability of high-end commercial databases

Amazon Aurora  Drop-in compatibility with MySQL and PostgreSQL

Delivered as a managed service

Leaf 1 Leaf 2 Leaf 3 Leaf 4

Root Interme Interme Leaf 1 Leaf 2 Leaf 3 Leaf 4

Log record Log record

Tx2 and Tx3 are redone by using the REDO procedure

Tx4 Tx4 is undone by using the REDO/UNDO procedure

Log record write:

Torn page protection write:

User data change size << I/O size (32KB+)

Databases are all about I/O

t5 can be created using log records from t1 and t5

• fail and are replaced Caching

• are scaled up/down/out on the basis of load needs

Storage, on the other hand, has to be long-lived

Decouple compute and storage for scalability, availability,

At scale there are continuous independent

Shared storage volume

 Still have 2/3 copies Availability Availability Availability

Losing 1 node in an AZ while another AZ Availability

is down Shared storage volume

What if an AZ fails? Availability

➢ Trade-off between likelihood of faults and time to repair

➢ Choose the biggest size that lets us repair “fast enough”

Amazon EBS Amazon EBS

• 780K transactions • 27M transactions: 35× more

Log Binlog Data Double-write Frm files

9,536 5,592 100,000

Write throughput Read throughput

Replica Replica 1 1 Replica Replica Replica

Amazon Replication Fleet Replication Fleet Amazon

100,000 100,000 2.00

50,000 50,000 1.00

Logical replication Physical replication

Amazon Replication Fleet Replication Fleet Amazon

Continuous Inserts Continuous Reads

• Save a point-in-time snapshot for analysis

Page Page Page Page Page Page Page Page

Both databases reference the same pages on the shared

Shared distributed storage system

Protection group 1 Protection group 2

As databases diverge, new pages are added appropriately to each

Protection group 1 Protection group 2

• See ~40M rows

• Update ~10M rows

Table with ~10K

Table with ~ 10K rows

Rows added missing

The good old

Amazon Aurora: On Avoiding Distributed Consensus for I/Os, Commits,

You might also like