CC - W1 (Intr To CC)

Cloud Computing
What is Cloud Computing?

What is Cloud Computing? (1/4)
• Cloud Computing is a on demand model

• Shared pool of computing resources
– Servers
– Storage
– Applications
– Services
• Rapidly provisioned
• Rapidly released
• Minimal Management Effort of Service Providers
• Other definitions also exist
•Cloud computing is the delivery of
hosting services that are provided to a
client over the Internet.
- Enable large-scale services
without up-front investment.
Informal: computing with large datacenters
Our focus: computing as a utility

» Outsourced to a third party or internal org
Different Models Of
Cloud Computing?
Deployment Model
• There are four primary cloud deployment
models :
- Public Cloud
- Private Cloud
- Community Cloud
- Hybrid Cloud
Public Clouds
• Public clouds are owned by cloud service
providers who charge for the use of cloud
resources.
• Basic characteristics:
- Homogeneous infrastructure, Common policies
- Shared resources and multi-tenancy
- Leased or rented infrastructure
- Economies of scale
•AWS/EC2 (Amazon)
•Azure (Microsoft)
•Google Cloud Platform.
Private Clouds
• The cloud infrastructure belongs to and is
operated by only one organization.
• Basic characteristics :
- Heterogeneous infrastructure; Customized policies
- Dedicated resources
- In-house infrastructure; End-to-end control
• Examples include:
Other types of Clouds
• Community cloud
- The cloud infrastructure is shared by several
organizations and supports a specific community that
has shared concerns (e.g., mission, security
requirements, policy and compliance considerations).
• Hybrid cloud
- The cloud infrastructure is a composition of two or more
clouds (private, community, or public) that remain
unique entities but are bound together by standardized
or proprietary technology that enables data and
application portability.
Types Of Cloud Services
Types of Cloud
Services
Infrastructure as a Service VMs,
(IaaS): disks
Platform as a Service (PaaS): Web, MapReduce
Software as a Service Email, GitHub

(SaaS):
Public vs private clouds:

Shared across arbitrary orgs/customers
vs internal to one organization
IaaS, PaaS and
SaaS as a Service (IaaS)
• Infrastructure
• Platform as a Service (PaaS)
• Software as a Service (SaaS)
SaaS
Applications
PaaS Packaged
Software
Platform
OS & Platform
IaaS Application OS &
Stack Application
Infrastructure Stack
Servers · Infrastructure
Storage Servers · Infrastructure
· Network Storage Servers · Storage
· Network · Network
Spectrum of Cloud Users
Image credit:
http://blogs.msdn.com/b/seliot/archive/2010/03/04/what-the-heck-is-cloud-computing-another-re-look-w
ith-
pretty-pictures.aspx
Cloud Service Models
Software as a Service
Platform as a Service Infrastructure as a Service
SaaS
PaaS IaaS
Exampl
e
AWS Lambda functions-as-a-service
» Runs functions in a Linux container on events
» Used for web apps, stream processing, highly
parallel MapReduce and video encoding
Cloud Software
Stack
Web Server Analytics UIs
Java, PHP, JS, … Hive, Pig, HiPal, …
Cache Other Services Analytics Engines

memcached, TAO, …
Security (e.g. IAM)

model serving, search, MapReduce, Dryad,
Metering + Billing
Unicorn, Druid, … Pregel, Spark, …
Operational Stores
SQL, Spanner, Dynamo, Message Bus Metadata
Cassandra, BigTable, Kafka, Kinesis, … Hive, AWS Catalog, …
…
Distributed Storage
Coordinatio
Chubby, ZK,
Amazon S3, GFS, Hadoop FS, …
Resource Manager
…
EC2, Borg, Mesos, Kubernetes, …

n
Example: Web
Application

memcached, TAO, …
Security (e.g. IAM)

Metering + Billing
Operational Stores
…
Distributed Storage
Coordinatio
Chubby, ZK,
Resource Manager
…

n
Example: Analytics
Warehouse

memcached, TAO, …
Security (e.g. IAM)

Metering + Billing
Operational Stores
…
Distributed Storage
Coordinatio
Chubby, ZK,
Resource Manager
…

n
Components Offered as
PaaS

memcached, TAO, …
Security (e.g. IAM)

Metering + Billing
Operational Stores
…
Distributed Storage
Coordinatio
Chubby, ZK,
Resource Manager
…

n
Cloud Computing Properties
& Essentials.
Cloud Properties (1/2)
•Resource efficiency: computing and network
resources are pooled to provide services to
multiple users. Resource allocation is
dynamically adapted according to user demand.
•Elasticity: computing resources can be rapidly

and elastically provisioned to scale up, and
released to scale down based on consumer’s
demand.
Cloud Properties (2/2)
•Self-managing services: a consumer can
provision cloud services, such as web
applications, server time, processing, storage and
network as needed and automatically without
requiring human interaction with each service’s
provider
•Accessible and highly available: cloud

resources are available over the network
anytime and anywhere and are accessed
through standard mechanisms that promote use
by different types of platform (e.g., mobile
phones, laptops, and PDAs).
Cloud Computing Essentials
• Cloud computing is Utility Computing
- Cloud services are controlled and monitored by the
cloud provider through a pay-per-use business model.
• An ideal cloud computing platform is:

- efficient in its use of resources
- scalable
- elastic
- self-managing
- highly available and accessible
- inter-operable and portable
Over or Under-Provisioning
Less
and less
demand.
Shaded area
Shaded area is unused represents requests
capability. not served.
Dynamic Provisioning
• In traditional computing model, two common
problems :
- Underestimate system utilization which result in
under provision
Loss Revenue
Resources
Capacity
Deman
Resources
Capacit d
1 2
y 3
Deman Resources Loss
1 2 d 3 Users Capacit
Time y
(days)
Deman
1 2 d 3
Real world Estimates
• Average server utilization is 5% to 20%.
• Peak workload exceeds the average by factors of
2 to 10.
• Users provision for the peak.
• Peak loads may occur based on the time of day
or based on other factors (e.g. photo sharing
after the holidays, drop/add within two weeks of
start of term, etc.)
Cloud Economics: For
Users
Elasticity:
» Using 1000 servers for 1 hour costs the same as
1 server for 1000 hours
» Same price to get a result faster!
Resources Resources
Time Time
Cloud Economics: For
Providers
Economies of scale:
» Purchasing, powering, managing machines at
scale gives lower per-unit costs than
customers’
Other Interesting
Features
Spot market for preemptible machines
Reserved instances and RI market

Ability to quickly try exotic
hardware
Common Cloud
Applications
1. Web/mobile applications
2. Data analytics (MapReduce, SQL, ML,

etc)
3. Stream processing
4. Batch computation (HPC, video, etc)

Datacenter
Hardware
2-socket server >10GbE
NIC Flash Storage
JBOD disk array
GPU/accelerators
>10GbE Switch
Datacenter
Hardware
Rows of rack-mounted servers

Datacenters with 50 – 200K of servers and burn 10 –
100MW
Storage: distributed with compute or NAS systems

Remote storage access for many use cases (why?)
Hardware
Heterogeneity
[Facebook server configurations]
Custom-design servers
Configurations optimized for major app classes
Few configurations to allow reuse across many apps
Roughly constant power budget per volume
Useful Latency
Initial list from Jeff Dean, Google
Numbers
L1 cache reference 0.5 ns
Branch mispredict 5 ns
L3 cache 20 ns
reference Mutex 25 ns
lock/unlock 100 ns
Main memory reference 3,000
Compress 1K bytes with Snappy ns
Send 2K bytes over 10Ge 2,000
ns
Read 1 MB sequentially from memory
100,0
Read 4KB from NVMe Flash
00 ns
Round trip within same datacenter
50,00
Disk seek 0 ns
Read 1 MB sequentially from disk 500,0
Send packet CA  Europe  00 ns
Useful Throughput
Numbers
DDR4 channel bandwidth 20 GB/sec
PCIe gen3 x16 channel 15 GB/sec
NVMe Flash bandwidth 2GB/sec
GbE link bandwidth 10 – 100
Gbps
Disk bandwidth
6 Gbps
NVMe Flash 4KB IOPS

Disk 4K IOPS 500K –
1M
100 – 200
Performance
Metrics
Throughput
Requests per second
Concurrent users
Gbytes/sec
processed
...
Latency
Execution time
Per request latency
28
Tail Latency
[Dean & Barroso,’13]
The 95th or 99th percentile request latency

End-to-end with all tiers included
Larger scale  more prone to high tail latency

29
Total Cost of Ownership (TCO)
TCO = capital (CapEx) + operational (OpEx) expenses
Operators perspective
CapEx: building, generators, A/C, compute/storage/net
HW
Including spares, amortized over 3 – 15 years
OpEx: electricity (5-7c/KWh), repairs, people, WAN, insurance,
…
Users perspective
CapEx: cost of long term leases on HW and services
OpeEx: pay per use cost on HW and services,
people
30
Operator’s TCO
Example 6% 3%
Servers
14% Energy
Cooling
16% 61%
Networking
Other
[Source: James
Hamilton]
Hardware dominates TCO, make it cheap

Must utilize it as well as possible
31
Reliabilit
y
Failure in time (FIT)
Failures per billion hours of operation =
109/MTTF
Mean time to failure (MTTF)

Time to produce first incorrect output
Mean time to repair (MTTR)

Time to detect and repair a failure
Availabilit
y
MTTF MTTR MTTF MTTR
Correct Failure Correct Failure Correct
Steady state availability = MTTF / (MTTF +

MTTR)
Yearly Datacenter Flakiness
~0.5 overheating (power down most machines in <5 mins, ~1-2 days to recover)
~1 PDU failure (~500-1000 machines suddenly disappear, ~6 hrs to come back)
~1 rack-move (plenty of warning, ~500-1000 machines powered down, ~6 hrs)
~1 network rewiring (rolling ~5% of machines down over 2-day span)
~20 rack failures (40-80 machines instantly disappear, 1-6 hours to get back)
~5 racks go wonky (40-80 machines see 50% packet loss)
~8 network maintenances (4 might cause ~30-minute random connectivity losses)
~12 router reloads (takes out DNS and external vIPs for a couple minutes)
~3 router failures (have to immediately pull traffic for an hour)
~dozens of minor 30-second blips for dns
~1000 individual machine failures (2-4% failure rate, machines crash at least
twice)
~thousands of hard drive failures (1-5% of all disks will die)
Add to these SW bugs, config errors, human errors,
…
Key Availability Techniques
Technique Performance Availability
Replication ✔ ✔
Partitioning (sharding) ✔ ✔
Load-balancing ✔
Watchdog timers ✔
Integrity checks ✔
Canaries ✔
Eventual consistency ✔ ✔
Make apps do something reasonable when not all is right

Better to give users limited functionality than an error page
Aggressive load balancing or request dropping
Better to satisfy 80% of the users rather than none
The CAP
Theorem
In distributed systems, choose 2 out of 3
Consistency
Every read returns data from most recent write
Availability
Every request executes & receives a (non-error)
response
Partition-tolerance
The system continues to function when network
partitions occur (messages dropped or
delayed)
Useful Tips
Check for single points of failure
Keep it simple stupid (KISS)
The reason many systems use centralized control
If it’s not tested, do no rely on it
Question: how do you test availability techniques

with hundreds of loosely coupled services
running on thousands of machines?
37
How Much Does It Cost
1/22 Networking – per GiB and month
→Data going in: $0.10 / GiB
←Data coming out:
 0 .. 1 GiB: $0
 < 10 TB: $0.15 / GiB← max: use for estimates
 11.. 49 TB: $0.11 /
GiB


50 ..
> 149
150 TB:
TB: $0.09
$0.08 // GiB
GiB
2011-‐06-‐17 GRITS 2011 48

How Much Does It Cost 2/2
3 Storage – per GiB and month
 EBS: $0.10 / GiB * month
 S3: $0.15 / GiB * month
2011-‐06-‐17 GRITS 2011 49

Other Considerations 1/4
 Security
 Is your data yours? Safe in transit? OK that
is “shares” space with strangers?
 Vendor lock-in
 No standards (yet)
 Using vendor services ➡ dependency
 Efficacy
 Benchmark machine types to find cost-
performance optimum for your application.
2011-‐06-‐17 GRITS 2011 50
 Use caching (reduce transfer-$$)
 Transfer data once and store it for a month.
 Reuse during the month many times.
 Consider your time-line
 Clouds are good for short-term needs
 Or highly bursty cycle requirements
 Long-term better invest in your own HW
 Deploying distributed applications
 RightScale, Chef, Puppet (,Wrangler)
2011-‐06-‐17 GRITS 2011 51
 System administration
 Clouds: Onus is on you to get it right
 How well do you know Linux sys admin tasks?
 Or will you have to pay someone?
 HPC/Grids: Remote admin responsibility
 Overhead
 Virtualization slower than bare metal
 Commodity Gig-E versus Myrinet et. al.
 Amazon CC solves some of it, but $$$
2011-‐06-‐17 GRITS 2011 52

 Application size
 Good fit: 1,000…10,000 CPU hours
 >10k CPU hours: Put costs into budget
 Maybe HPC elsewhere a better fit?
 No queue
 Cloud is a finite resource
 No queuing, just error “no capacity”
 Happy retrying…
 HPC can achieve 90% resource utilization
2011-‐06-‐17 GRITS 2011 53

Questions?

CC - W1 (Intr To CC)

Uploaded by

Document Information

Original Description:

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

CC - W1 (Intr To CC)

Uploaded by

Copyright:

Available Formats

Cloud Computing

What is Cloud Computing?

• Cloud Computing is a on demand model

Our focus: computing as a utility

Software as a Service Email, GitHub

Public vs private clouds:

Cache Other Services Analytics Engines

Security (e.g. IAM)

Amazon S3, GFS, Hadoop FS, …

EC2, Borg, Mesos, Kubernetes, …

Cache Other Services Analytics Engines

Security (e.g. IAM)

Amazon S3, GFS, Hadoop FS, …

EC2, Borg, Mesos, Kubernetes, …

Cache Other Services Analytics Engines

Security (e.g. IAM)

Amazon S3, GFS, Hadoop FS, …

EC2, Borg, Mesos, Kubernetes, …

Cache Other Services Analytics Engines

Security (e.g. IAM)

Amazon S3, GFS, Hadoop FS, …

EC2, Borg, Mesos, Kubernetes, …

•Elasticity: computing resources can be rapidly

•Accessible and highly available: cloud

• An ideal cloud computing platform is:

Reserved instances and RI market

2. Data analytics (MapReduce, SQL, ML,

4. Batch computation (HPC, video, etc)

JBOD disk array

Rows of rack-mounted servers

Storage: distributed with compute or NAS systems

[Facebook server configurations]

NVMe Flash 4KB IOPS

The 95th or 99th percentile request latency

Larger scale  more prone to high tail latency

Hardware dominates TCO, make it cheap

Mean time to failure (MTTF)

Mean time to repair (MTTR)

Correct Failure Correct Failure Correct

Steady state availability = MTTF / (MTTF +

Make apps do something reasonable when not all is right

If it’s not tested, do no rely on it

Question: how do you test availability techniques

2011-­‐06-­‐17 GRITS 2011 48

2011-­‐06-­‐17 GRITS 2011 49

2011-­‐06-­‐17 GRITS 2011 52

2011-­‐06-­‐17 GRITS 2011 53

You might also like

2011-‐06-‐17 GRITS 2011 48

2011-‐06-‐17 GRITS 2011 49

2011-‐06-‐17 GRITS 2011 52

2011-‐06-‐17 GRITS 2011 53