10 Storage

CIS 5550: INTERNET & WEB SYSTEMS
Storage
February 20, 2023

Complex service, simple storage
Variable-size files
- read, write, append
- move, rename
- lock, unlock
- ...
Operating system
Fixed-size blocks
- read
- write
 PC users see a rich, powerful interface
 Hierarchical namespace (directories); can move, rename, append to,
truncate, (de)compress, view, delete files, ...
 But the actual storage device is very simple!
 HDD only knows how to read and write fixed-size data blocks
 Translation done by the operating system
CIS 5550 | Property of Penn Engineering | 2

Analogy to cloud storage
Shopping carts
Friend lists
User accounts
Profiles
...
Web service
Key/value store
- read, write
- delete
 Many cloud services have a similar structure

 Users see a rich interface (shopping carts, product categories, searchable
index...)
 But the actual storage service is very simple
 Read/write 'blocks', similar to a giant hard disk
 Translation done by the web service

Key-value stores
Keys Values
(bob, bschmitt@foo.com)
(gettysburg, "Four score and seven years ago...")
(29ck2dxa1, 0128ckso1$9#*!!8349e)
(windows, )
 The key-value store (KVS) is a simple abstraction for managing

persistent state
 Data is organized as (key,value) pairs.
 Three basic operations:
 PUT(key, value)
 GET(key)  value
 DELETE(key)
Examples of KVS
 Where have you seen this concept before?
 Conventional examples outside the cloud:

 In-memory associative arrays and hash tables – limited to a single
application, only persistent until program ends
 On-disk indices (like BerkeleyDB)
 Database management systems – multiple KVSs++
 "Inverted indices" behind search engines
 Distributed hashtables (e.g., on top of Chord/Pastry)

Key-Multi-Value stores
 What if I want to have multiple values for the same key in a KVS?
 Example: Multiple pages with the same search keyword
 Option 1: Make the “value” a collection object like a set

 Then PUT really becomes GET  add  PUT
 Option 2: Allow the KVS to store multiple values per key

Summary: Key-value stores
 KVS: A simple abstraction for managing persistent state
 Interface consists only of PUT and GET (+possibly DELETE)
 Extremely scalable implementations exist
 Some variants allow multiple values per key
 Examples: Distributed hashtables, associative arrays, ...
 KVS are very widely used

 Common storage abstraction for the Cloud

Plan for today
 Key-value stores
 KVS on the Cloud NEXT
 Sharding and coordination

 Case study: S3
 Case study: DynamoDB

Key-Value stores on the Cloud
 Many situations need hosting of large data sets
 Examples: Amazon catalog, eBay listings, Facebook pages, …
 Ideal: Abstraction of a 'big disk in the clouds', which would have:

 Perfect durability – nothing would ever disappear in a crash
 100% availability – we could always get to the service
 Zero latency from anywhere on earth – no delays!
 Minimal bandwidth utilization – we only send what we absolutely need
 Isolation under concurrent updates – make sure data stays consistent

The inconveniences of the real world
 Why isn't this feasible?
 The “cloud” exists over a physical network

 Communication takes time, esp. across the globe
 Network capacity is limited, both on the backbone and endpoint
 The “cloud” has imperfect hardware

 Hard disks crash
 Servers crash
 Software has bugs
 Can you map these to the previous desiderata?

Finding the right tradeoff
 In practice, we can't have everything
 ... but most applications don't really need 'everything’!
 Some observations:
1. Read-only (or read-mostly) data is easiest to support. (Replicate everywhere!
No concurrency issues! But only some kinds of data fit this pattern.)
2. Granularity matters: “Few large-object” tasks generally tolerate longer
latencies than “many small-object” tasks. (Fewer requests + often more
processing at the client, but much more expensive to replicate or update!)
3. Maybe it makes sense to develop separate solutions for large read-mostly
objects vs. small read-write objects! (Different requirements  different
technical solutions

Specialized KVS
 Cloud KVS are often specialized for a particular tradeoff or
usage scenario
 Example: Amazon’s Simple Storage Service (S3)
 large objects – files, virtual machines, etc.
 assumes objects change infrequently
 objects are opaque to the storage system
 Example: Amazon’s DynamoDB
 small objects – Java objects, records, etc.
 generally updated more frequently; greater need for consistency
 generally multiple attributes or properties, which are exposed to the
storage system

Summary: KVS on the cloud
 Ideally, we would like the abstraction of a 'big disk in the cloud'
 Perfect durability, availability, consistency, throughput, ...
 Practical constraints require compromises

 Propagation delay, unreliable hardware/software, ...
 Hence, we need to make the right tradeoff

 For example, specialize KVS for particular workloads
 No one-size-fits-all solution; different solutions are useful in different
situations

Plan for today
 KVS on the Cloud
 Sharding and coordination NEXT
 Case study: S3

Single-node KVS worker
 Implementing a single-node KVS void main() {
HashMap<String,String> data
is not difficult = new HashMap<String,String>();
ServerSocket ssock
= new ServerSocket(1234);
 All you need is: while (true) {
Socket sock = ssock.accept();
 A server loop that accepts connections Request r = readRequest(sock);
if (r.isGET()) {
 A hashtable for the keys and values send(sock, data.get(r.key()));
} else if (r.isPUT()) {
 A way to read and decode the requests data.put(r.key(), r.value());
 A way to execute the requests + respond send(sock, “OK”);
} else {
send(sock, “ERROR”);
}
 But is this a good solution? }
sock.close();

What would it take to scale this?
 We need multiple workers (potentially lots of them!)
 This requires:
 A way to divide up the data between the workers
 A way for the clients to find the worker that has the data they need
 A way to add or remove workers, and handle worker failures

Sharding
2128-1 0
 Sharding is a simple way to divide up
the work in a large system
 Each object and each worker has a key
from a large space (say, a 0..2128-1 ring)
 Each worker is responsible for a certain
range of keys
 Example: Consistent hashing

 If a worker X has key A and its successor has key B,
X could be responsible for key range [A,B)
 More about this in another lecture

Coordination
 Sharing requires some coordination
 Workers need to know who else is in the system (e.g., who their successor is)
 Clients need to know which worker is responsible for a given key
 Simple approach: Have a central master node

 Workers can check in periodically with the master (to demonstrate liveness)
 The master can assign each worker a range of keys
 Clients can download a list of ranges and workers from the master
 How does this affect scalability and robustness?

 What are some alternatives?

Replication
 If each value is stored on a single worker, 2128-1 0
a worker failure will cause data loss
 We can fix this with replication

 Multiple copies of each value, on different workers
 For instance, if worker X has key A and its successors
have keys B and C, X could be responsible for [A,C)
 Possible consistency issues!

 Which consistency guarantees do we want (if any)?
 How many replicas should we have? And can data loss still happen?

Persistence
 Where should the values be stored?
 Only in memory? Or on disk as well?
 In a surprising number of use cases, memory is just fine!

 Example: KVS is a cache for data that is stored elsewhere (e.g., memcached)
 Example: KVS contains session state that could be rebuilt if necessary
 If we want persistence, the data has to go to disk

 There are lots of ways to do this (see another lecture)

Joins and departures
 What if a new worker is added?
 Need to move some of the data from existing
workers to the new worker
 Otherwise the data can’t be found!
 What if an existing worker fails?

 Not immediately dangerous, if other replicas remain
 But: Need to reestablish invariant that each value
is replicated a certain number of times!
 Some of the remaining workers have to ’take over’
the vacant parts of the key space & create replicas

How scalable is this?
 Sharded KVS can be extremely scalable
 Why?
 Reason: No cross-shard operations!

 Single-key GET and PUT are local to a specific shard
 No coordination necessary!
 Not necessarily true for fancier designs

 Example: Cross-shard transactions
 Fully-fledged distributed databases (with cross-shard joins, etc.)
are much harder to scale

Summary: Sharding and coordination
 Sharding is a way to distribute data across multiple workers
 Data has a primary key, and each worker is responsible for a key range
 This requires some coordination

 Both clients and workers need to know who is responsible for a given key
 Simple implementation: Central ‘master’ node
 If membership is not static, additional steps are required

 Replication can take care of worker failures (more in other lectures)
 Joins and departures may require redistributing some of the data

Plan for today
 Case study: S3 NEXT

Big Objects: Amazon S3
 S3 = Simple Storage System
 Stores large objects (=values) that may have access

permissions
 Used in various “cloud backup” services
 Used to distribute software packages
 Used internally by Amazon to store virtual machines
 “Up to 99.999999999% durability over a given year, and 99.99% availability”
(“eleven nines” and “four nines”)

S3: Key concepts
 S3 consists of:
 objects – named items stored in S3
 buckets of objects – think of these as
volumes in a filesystem
 the console includes a notion of folders,
but these are not intrinsic to S3
 Names within a bucket must uniquely identify a single object

 i.e., keys must be unique

S3: Keys and objects
 What can we use as keys?
 Keys can be any string
 What can we use as objects?
 Objects can be from 1 byte to 5 TB, any format
 Number of objects is 'unlimited'
 Where can objects be stored?
 Can be assigned to specific geographic regions (Washington, Virginia,
California, Ireland, Singapore, Tokyo, ...)
 Why is this important? (name at least four reasons!)
low latency to customer regulatory/legal requirements
minimize fault correlation low-storage-cost regions

S3: Different ways to access objects
 Objects in S3 can be accessed
 ... via REST
 ... via BitTorrent
 ... over the web: http://s3.amazonaws.com/bucket/key
 Web Services use HTTP (the Web browser protocol over sockets) and XML to
send requests and data
 AWS Console also enables configuration
 We’ll mostly be using Java(script) libraries to interact with S3

 You’ll just call normal functions; they will open and close sockets as
necessary

S3: Access permissions
 Permissions are assigned through Access Control Lists (ACLs)
 Essentially, a list of users/groups  permissions
 Bucket permissions are inherited by objects unless overridden at the object
level
 What can you control?

 Can be at the level of buckets or individual objects
 Available rights: Read, write, read ACL, write ACL
 Possible grantees: Everyone, authenticated users, specific users (by AWS
account email address)

S3: Pricing and usage
 You'll pay for:
 Storage (mostly per GB)
 Requests (PUT is more expensive than GET)
 Network bandwidth (upload is free, download costs per GB)
 Various management operations, such as inventory or tagging

S3: Object operations
 Standard PUT/GET/DELETE
 Modify object permissions
 A few specialized operations (copy, undelete, ...)
 The key issue: How do we manage concurrent updates?

 Will I see objects you delete? the latest version? etc.

S3: Consistency models
 Consistency model depends on operations
 read-after-write consistency for PUTs of new objects
 eventual consistency for overwrite PUTs and DELETEs
W1: Cat R1
Client 1:
Client 2: W2: Dog R2
Time
 Read-after-write consistency:
 Each read or write operation becomes effective at some point between its
start time and its completion time
 Reads return the value of the last effective write

S3: Versioning
 S3 uses versioning for consistency, rather than locking
 The idea: every bucket + key maps to a list of versions
 [bucket+key]  [object v1] [object v2] [object v3] …
 Each time we PUT an object, it gets a new version
 The last-received PUT overwrites any previous ones!
 When we GET:
 An unversioned request likely receives the last version – but this is
not guaranteed depending on propagation delays
 A request for bucket+key+version uniquely maps to a single object!
 Versioning can be enabled for each bucket
 Why would you (not) want versioning?

Summary: Amazon S3
 A key-value store for large objects
 Buckets, keys, objects, folders
 Various ways to access objects, e.g., HTTP and BitTorrent
 Supports versioning and access control

 Access control is based on ACLs

Plan for today
 Case study: S3
 Case study: DynamoDB NEXT

What is Amazon DynamoDB?
 A highly scalable, non-relational data store
 Despite its name, not really a database
 Stronger consistency guarantees than S3
 Highly scalable; built-in replication; automatic indexing
 Fine-grained access control
 No 'real' transactions, just a conditional put/delete
 No 'real' relations, just a fairly basic select
KVS RDBMS
S3 DynamoDB RDS

A bit of history
 Early 2000s: Amazon is bursting at the seams
 So far, relying on commercially available technologies
 But: various outages directly attributable to scalability limits, e.g.,
during the 2004 holiday shopping season
 Need an ultra-scalable, highly reliable KV database!
 2007: Dynamo paper published at SOSP
 Used to power various core Amazon services, such as S3
 But many developers preferred SimpleDB
 Dynamo never adopted much beyond the core services. Reason:
Operational complexity. SimpleDB "just works"!
 But SimpleDB has a number of important limitations (e.g., some operations
assume that all of a table's data is on a single server!)

A bit of history
 2012: DynamoDB introduced
 Combines "lessons learned" from both SimpleDB and Dynamo
 Inherits scalability & high performance from Dynamo
 Inherits SimpleDB's richer data model & ease of use

Data model
Name
(key) Attributes (key-multivalue)
Customer First Last Street City State Zip Email

ID name name address
123 Bob Smith 123 Main St Springfield MO 65801
Items
456 James Johnson 456 Front St Seattle WA 98104 james@foo.com
 Somewhat analogous to a spreadsheet:

 Tables: Analogous to buckets
 Items: Names with attribute-multivalue sets
 For example, an item could have more than one street address
 It is possible to add attributes later
 'No' pre-defined schema

Keys and indexes
 Each table must have a partition key
 Either one attribute (hash), or two (hash+range)
 Must be unique, and must be included in every request
 This is used internally to create a kind of index
 You can create secondary indexes; specified when the table is created
 Tables can also have an (optional) sort key
 Used when there are multiple items with the same partition key
 Combination of partition key and sort key must be unique (primary key)
 Values have one of several basic data types
 Strings (S), Numbers (N), Binary (B), Boolean (BOOL), List (L), Map (M)
 Plus various set types (BS, NS, SS)

Basic operations
 ListTables, CreateTable, DeleteTable
 DescribeTable, UpdateTable
 GetItem/PutItem
 Also Batch{Get,Write}Item
 Supports conditional put; can choose strong (per-key) or eventual consistency
 Requires the full primary key (partition+sort) -> see HW1MS1
 UpdateItem/DeleteItem
 Query
 Requires only the partition key; can refine search, e.g., using filters
 Scan
 WARNING: Expensive! Conceptually, reads through the entire table!

PutItem and GetItem
 PutItem has a very simple model:
 Specify the table and the primary key
 [key]  [list of name/value pairs], where we list Attribute.1.Name,
Attribute.1.Value, etc.
 GetItem
 Specify the table and the primary key
 Can have a 'projection expression' that specifies which attributes to get
 Can choose whether the read should be consistent or not
 What are the advantages of each choice?

Conditional Put
 DynamoDB also supports a conditional put
 Item is updated only if a certain predicate holds for the existing item
 Can test for presence/absence of attributes, attribute values, ...
 Can we use this to guarantee consistency?

 Idea: implement a version number, e.g., like this:
do {
List<Attributes> attribs = kvs.getAttributesFor(key);
... update the attribute values as we like ...
retCode = kvs.conditionalPut(key, attribs,
(“version”, attribs.get(“version”)+1));
} while (retcode == ErrorCode.ConditionalCheckFailed);

Query
 Used to retrieve items via primary index or secondary index
 Can ask for a specific hash key, or a certain range of range key values for a
specific hash key. Example: Query the "DiscussionForum" table for a
particular "Forum" name (hash key), or for a particular "Forum" (hash key)
with a certain range of post times (range key)
 Can filter results (similar to SQL's "where")
 Can specify which attributes to return (similar to SQL's "select a,b,c from...")
 Can choose whether or not reads should be consistent
 Supports a cursor (via ExclusiveStartKey/LastEvaluatedKey)

Scan
 Used to retrieve arbitrary items
 Internally performs a scan of the entire (!) table, then applies whatever
filters and projections you specify
 Flexible, but expensive!

Summary: DynamoDB
 DynamoDB is a highly scalable, non-relational data store
 Compromise between SimpleDB’s simplicity and Dynamo’s scalability
 DynamoDB has a structured data model, and a fairly rich API

 Rows (‘items’) with attribute-multivalue sets
 No predefined schema
 Partition keys, sort keys, secondary indexes
 Some richer operations (Conditional Put, Query, Scan)

10 Storage

Uploaded by

Document Information

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

10 Storage

Uploaded by

Copyright:

Available Formats

CIS 5550: INTERNET & WEB SYSTEMS

February 20, 2023

CIS 5550 | Property of Penn Engineering | 2

 Many cloud services have a similar structure

CIS 5550 | Property of Penn Engineering | 3

 The key-value store (KVS) is a simple abstraction for managing

 Conventional examples outside the cloud:

CIS 5550 | Property of Penn Engineering | 5

 Option 1: Make the “value” a collection object like a set

 Option 2: Allow the KVS to store multiple values per key

CIS 5550 | Property of Penn Engineering | 6

 KVS are very widely used

CIS 5550 | Property of Penn Engineering | 7

 Sharding and coordination

CIS 5550 | Property of Penn Engineering | 8

 Ideal: Abstraction of a 'big disk in the clouds', which would have:

CIS 5550 | Property of Penn Engineering | 9

 The “cloud” exists over a physical network

 The “cloud” has imperfect hardware

 Can you map these to the previous desiderata?

CIS 5550 | Property of Penn Engineering | 11

CIS 5550 | Property of Penn Engineering | 12

 Practical constraints require compromises

 Hence, we need to make the right tradeoff

CIS 5550 | Property of Penn Engineering | 13

CIS 5550 | Property of Penn Engineering | 14

CIS 5550 | Property of Penn Engineering | 15

CIS 5550 | Property of Penn Engineering | 16

 Example: Consistent hashing

CIS 5550 | Property of Penn Engineering | 17

 Simple approach: Have a central master node

 How does this affect scalability and robustness?

CIS 5550 | Property of Penn Engineering | 18

 We can fix this with replication

 Possible consistency issues!

CIS 5550 | Property of Penn Engineering | 19

 In a surprising number of use cases, memory is just fine!

 If we want persistence, the data has to go to disk

CIS 5550 | Property of Penn Engineering | 20

 What if an existing worker fails?

CIS 5550 | Property of Penn Engineering | 21

 Reason: No cross-shard operations!

 Not necessarily true for fancier designs

CIS 5550 | Property of Penn Engineering | 22

 This requires some coordination

 If membership is not static, additional steps are required

CIS 5550 | Property of Penn Engineering | 23

 Case study: DynamoDB

CIS 5550 | Property of Penn Engineering | 24

 Stores large objects (=values) that may have access

CIS 5550 | Property of Penn Engineering | 25

 Names within a bucket must uniquely identify a single object

CIS 5550 | Property of Penn Engineering | 26

CIS 5550 | Property of Penn Engineering | 27

 We’ll mostly be using Java(script) libraries to interact with S3

CIS 5550 | Property of Penn Engineering | 28

 What can you control?

CIS 5550 | Property of Penn Engineering | 29

CIS 5550 | Property of Penn Engineering | 30

 The key issue: How do we manage concurrent updates?

CIS 5550 | Property of Penn Engineering | 31

CIS 5550 | Property of Penn Engineering | 32

CIS 5550 | Property of Penn Engineering | 33