You are on page 1of 54

Cloud Computing

What is Cloud Computing?


What is Cloud Computing? (1/4)

• Cloud Computing is a on demand model


• Shared pool of computing resources
– Servers
– Storage
– Applications
– Services
What is Cloud Computing? (2/4)

• Rapidly provisioned
• Rapidly released
• Minimal Management Effort of Service Providers
• Other definitions also exist
What is Cloud Computing? (3/4)
•Cloud computing is the delivery of
hosting services that are provided to a
client over the Internet.
- Enable large-scale services
without up-front investment.
What is Cloud Computing? (4/4)
Informal: computing with large datacenters

Our focus: computing as a utility


» Outsourced to a third party or internal org
Different Models Of
Cloud Computing?
Deployment Model
• There are four primary cloud deployment
models :
- Public Cloud
- Private Cloud
- Community Cloud
- Hybrid Cloud
Public Clouds
• Public clouds are owned by cloud service
providers who charge for the use of cloud
resources.
• Basic characteristics:
- Homogeneous infrastructure, Common policies
- Shared resources and multi-tenancy
- Leased or rented infrastructure
- Economies of scale
•AWS/EC2 (Amazon)
•Azure (Microsoft)
•Google Cloud Platform.
Private Clouds
• The cloud infrastructure belongs to and is
operated by only one organization.
• Basic characteristics :
- Heterogeneous infrastructure; Customized policies
- Dedicated resources
- In-house infrastructure; End-to-end control
• Examples include:
Other types of Clouds
• Community cloud
- The cloud infrastructure is shared by several
organizations and supports a specific community that
has shared concerns (e.g., mission, security
requirements, policy and compliance considerations).

• Hybrid cloud
- The cloud infrastructure is a composition of two or more
clouds (private, community, or public) that remain
unique entities but are bound together by standardized
or proprietary technology that enables data and
application portability.
Types Of Cloud Services
Types of Cloud
Services
Infrastructure as a Service VMs,
(IaaS): disks
Platform as a Service (PaaS): Web, MapReduce

Software as a Service Email, GitHub


(SaaS):

Public vs private clouds:


Shared across arbitrary orgs/customers
vs internal to one organization
IaaS, PaaS and
SaaS as a Service (IaaS)
• Infrastructure
• Platform as a Service (PaaS)
• Software as a Service (SaaS)
SaaS
Applications
PaaS Packaged
Software
Platform
OS & Platform
IaaS Application OS &
Stack Application
Infrastructure Stack
Servers · Infrastructure
Storage Servers · Infrastructure
· Network Storage Servers · Storage
· Network · Network
Spectrum of Cloud Users

Image credit:
http://blogs.msdn.com/b/seliot/archive/2010/03/04/what-the-heck-is-cloud-computing-another-re-look-w
ith-
pretty-pictures.aspx
Cloud Service Models

Software as a Service
Platform as a Service Infrastructure as a Service
SaaS
PaaS IaaS
Exampl
e
AWS Lambda functions-as-a-service
» Runs functions in a Linux container on events
» Used for web apps, stream processing, highly
parallel MapReduce and video encoding
Cloud Software
Stack
Web Server Analytics UIs
Java, PHP, JS, … Hive, Pig, HiPal, …

Cache Other Services Analytics Engines


memcached, TAO, …

Security (e.g. IAM)


model serving, search, MapReduce, Dryad,

Metering + Billing
Unicorn, Druid, … Pregel, Spark, …

Operational Stores
SQL, Spanner, Dynamo, Message Bus Metadata
Cassandra, BigTable, Kafka, Kinesis, … Hive, AWS Catalog, …

Distributed Storage
Coordinatio

Chubby, ZK,

Amazon S3, GFS, Hadoop FS, …

Resource Manager

EC2, Borg, Mesos, Kubernetes, …


n
Example: Web
Application
Web Server Analytics UIs
Java, PHP, JS, … Hive, Pig, HiPal, …

Cache Other Services Analytics Engines


memcached, TAO, …

Security (e.g. IAM)


model serving, search, MapReduce, Dryad,

Metering + Billing
Unicorn, Druid, … Pregel, Spark, …

Operational Stores
SQL, Spanner, Dynamo, Message Bus Metadata
Cassandra, BigTable, Kafka, Kinesis, … Hive, AWS Catalog, …

Distributed Storage
Coordinatio

Chubby, ZK,

Amazon S3, GFS, Hadoop FS, …

Resource Manager

EC2, Borg, Mesos, Kubernetes, …


n
Example: Analytics
Warehouse
Web Server Analytics UIs
Java, PHP, JS, … Hive, Pig, HiPal, …

Cache Other Services Analytics Engines


memcached, TAO, …

Security (e.g. IAM)


model serving, search, MapReduce, Dryad,

Metering + Billing
Unicorn, Druid, … Pregel, Spark, …

Operational Stores
SQL, Spanner, Dynamo, Message Bus Metadata
Cassandra, BigTable, Kafka, Kinesis, … Hive, AWS Catalog, …

Distributed Storage
Coordinatio

Chubby, ZK,

Amazon S3, GFS, Hadoop FS, …

Resource Manager

EC2, Borg, Mesos, Kubernetes, …


n
Components Offered as
PaaS
Web Server Analytics UIs
Java, PHP, JS, … Hive, Pig, HiPal, …

Cache Other Services Analytics Engines


memcached, TAO, …

Security (e.g. IAM)


model serving, search, MapReduce, Dryad,

Metering + Billing
Unicorn, Druid, … Pregel, Spark, …

Operational Stores
SQL, Spanner, Dynamo, Message Bus Metadata
Cassandra, BigTable, Kafka, Kinesis, … Hive, AWS Catalog, …

Distributed Storage
Coordinatio

Chubby, ZK,

Amazon S3, GFS, Hadoop FS, …

Resource Manager

EC2, Borg, Mesos, Kubernetes, …


n
Cloud Computing Properties
& Essentials.
Cloud Properties (1/2)
•Resource efficiency: computing and network
resources are pooled to provide services to
multiple users. Resource allocation is
dynamically adapted according to user demand.

•Elasticity: computing resources can be rapidly


and elastically provisioned to scale up, and
released to scale down based on consumer’s
demand.
Cloud Properties (2/2)
•Self-managing services: a consumer can
provision cloud services, such as web
applications, server time, processing, storage and
network as needed and automatically without
requiring human interaction with each service’s
provider

•Accessible and highly available: cloud


resources are available over the network
anytime and anywhere and are accessed
through standard mechanisms that promote use
by different types of platform (e.g., mobile
phones, laptops, and PDAs).
Cloud Computing Essentials
• Cloud computing is Utility Computing
- Cloud services are controlled and monitored by the
cloud provider through a pay-per-use business model.

• An ideal cloud computing platform is:


- efficient in its use of resources
- scalable
- elastic
- self-managing
- highly available and accessible
- inter-operable and portable
Over or Under-Provisioning
Less
and less

demand.

Shaded area
Shaded area is unused represents requests
capability. not served.
Dynamic Provisioning
• In traditional computing model, two common
problems :
- Underestimate system utilization which result in
under provision
Loss Revenue

Resources
Capacity

Deman
Resources

Capacit d
1 2
y 3
Deman Resources Loss
1 2 d 3 Users Capacit
Time y
(days)
Deman
1 2 d 3
Real world Estimates
• Average server utilization is 5% to 20%.
• Peak workload exceeds the average by factors of
2 to 10.
• Users provision for the peak.
• Peak loads may occur based on the time of day
or based on other factors (e.g. photo sharing
after the holidays, drop/add within two weeks of
start of term, etc.)
Cloud Economics: For
Users
Elasticity:
» Using 1000 servers for 1 hour costs the same as
1 server for 1000 hours
» Same price to get a result faster!

Resources Resources

Time Time
Cloud Economics: For
Providers
Economies of scale:
» Purchasing, powering, managing machines at
scale gives lower per-unit costs than
customers’
Other Interesting
Features
Spot market for preemptible machines

Reserved instances and RI market


Ability to quickly try exotic
hardware
Common Cloud
Applications
1. Web/mobile applications

2. Data analytics (MapReduce, SQL, ML,


etc)
3. Stream processing

4. Batch computation (HPC, video, etc)


Datacenter
Hardware
2-socket server >10GbE
NIC Flash Storage

JBOD disk array

GPU/accelerators
>10GbE Switch
Datacenter
Hardware

Rows of rack-mounted servers


Datacenters with 50 – 200K of servers and burn 10 –
100MW

Storage: distributed with compute or NAS systems


Remote storage access for many use cases (why?)
Hardware
Heterogeneity

[Facebook server configurations]

Custom-design servers
Configurations optimized for major app classes
Few configurations to allow reuse across many apps
Roughly constant power budget per volume
Useful Latency
Initial list from Jeff Dean, Google
Numbers
L1 cache reference 0.5 ns
Branch mispredict 5 ns
L3 cache 20 ns
reference Mutex 25 ns
lock/unlock 100 ns
Main memory reference 3,000
Compress 1K bytes with Snappy ns
Send 2K bytes over 10Ge 2,000
ns
Read 1 MB sequentially from memory
100,0
Read 4KB from NVMe Flash
00 ns
Round trip within same datacenter
50,00
Disk seek 0 ns
Read 1 MB sequentially from disk 500,0
Send packet CA  Europe  00 ns
Useful Throughput
Numbers
DDR4 channel bandwidth 20 GB/sec
PCIe gen3 x16 channel 15 GB/sec
NVMe Flash bandwidth 2GB/sec
GbE link bandwidth 10 – 100
Gbps
Disk bandwidth
6 Gbps

NVMe Flash 4KB IOPS


Disk 4K IOPS 500K –
1M
100 – 200
Performance
Metrics
Throughput
Requests per second
Concurrent users
Gbytes/sec
processed
...

Latency
Execution time
Per request latency
28
Tail Latency
[Dean & Barroso,’13]

The 95th or 99th percentile request latency


End-to-end with all tiers included

Larger scale  more prone to high tail latency


29
Total Cost of Ownership (TCO)
TCO = capital (CapEx) + operational (OpEx) expenses

Operators perspective
CapEx: building, generators, A/C, compute/storage/net
HW
Including spares, amortized over 3 – 15 years
OpEx: electricity (5-7c/KWh), repairs, people, WAN, insurance,

Users perspective
CapEx: cost of long term leases on HW and services
OpeEx: pay per use cost on HW and services,
people
30
Operator’s TCO
Example 6% 3%
Servers
14% Energy
Cooling
16% 61%
Networking
Other

[Source: James
Hamilton]

Hardware dominates TCO, make it cheap


Must utilize it as well as possible
31
Reliabilit
y
Failure in time (FIT)
Failures per billion hours of operation =
109/MTTF

Mean time to failure (MTTF)


Time to produce first incorrect output

Mean time to repair (MTTR)


Time to detect and repair a failure
Availabilit
y
MTTF MTTR MTTF MTTR

Correct Failure Correct Failure Correct

Steady state availability = MTTF / (MTTF +


MTTR)
Yearly Datacenter Flakiness
~0.5 overheating (power down most machines in <5 mins, ~1-2 days to recover)
~1 PDU failure (~500-1000 machines suddenly disappear, ~6 hrs to come back)
~1 rack-move (plenty of warning, ~500-1000 machines powered down, ~6 hrs)
~1 network rewiring (rolling ~5% of machines down over 2-day span)
~20 rack failures (40-80 machines instantly disappear, 1-6 hours to get back)
~5 racks go wonky (40-80 machines see 50% packet loss)
~8 network maintenances (4 might cause ~30-minute random connectivity losses)
~12 router reloads (takes out DNS and external vIPs for a couple minutes)
~3 router failures (have to immediately pull traffic for an hour)
~dozens of minor 30-second blips for dns
~1000 individual machine failures (2-4% failure rate, machines crash at least
twice)
~thousands of hard drive failures (1-5% of all disks will die)
Add to these SW bugs, config errors, human errors,

Key Availability Techniques
Technique Performance Availability
Replication ✔ ✔
Partitioning (sharding) ✔ ✔
Load-balancing ✔
Watchdog timers ✔
Integrity checks ✔
Canaries ✔
Eventual consistency ✔ ✔

Make apps do something reasonable when not all is right


Better to give users limited functionality than an error page
Aggressive load balancing or request dropping
Better to satisfy 80% of the users rather than none
The CAP
Theorem
In distributed systems, choose 2 out of 3

Consistency
Every read returns data from most recent write

Availability
Every request executes & receives a (non-error)
response

Partition-tolerance
The system continues to function when network
partitions occur (messages dropped or
delayed)
Useful Tips
Check for single points of failure
Keep it simple stupid (KISS)
The reason many systems use centralized control

If it’s not tested, do no rely on it

Question: how do you test availability techniques


with hundreds of loosely coupled services
running on thousands of machines?
37
How Much Does It Cost
1/22 Networking – per GiB and month
→Data going in: $0.10 / GiB
←Data coming out:
 0 .. 1 GiB: $0
 < 10 TB: $0.15 / GiB← max: use for estimates
 11.. 49 TB: $0.11 /
GiB


50 ..
> 149
150 TB:
TB: $0.09
$0.08 // GiB
GiB

2011-­‐06-­‐17 GRITS 2011 48


How Much Does It Cost 2/2
3 Storage – per GiB and month
 EBS: $0.10 / GiB * month
 S3: $0.15 / GiB * month

2011-­‐06-­‐17 GRITS 2011 49


Other Considerations 1/4
 Security
 Is your data yours? Safe in transit? OK that
is “shares” space with strangers?
 Vendor lock-in
 No standards (yet)
 Using vendor services ➡ dependency
 Efficacy
 Benchmark machine types to find cost-
performance optimum for your application.
2011-­‐06-­‐17 GRITS 2011 50
Other Considerations 2/4
 Use caching (reduce transfer-$$)
 Transfer data once and store it for a month.
 Reuse during the month many times.
 Consider your time-line
 Clouds are good for short-term needs
 Or highly bursty cycle requirements
 Long-term better invest in your own HW
 Deploying distributed applications
 RightScale, Chef, Puppet (,Wrangler)
2011-­‐06-­‐17 GRITS 2011 51
Other Considerations 3/4
 System administration
 Clouds: Onus is on you to get it right
 How well do you know Linux sys admin tasks?
 Or will you have to pay someone?
 HPC/Grids: Remote admin responsibility
 Overhead
 Virtualization slower than bare metal
 Commodity Gig-E versus Myrinet et. al.
 Amazon CC solves some of it, but $$$

2011-­‐06-­‐17 GRITS 2011 52


Other Considerations 4/4
 Application size
 Good fit: 1,000…10,000 CPU hours
 >10k CPU hours: Put costs into budget
 Maybe HPC elsewhere a better fit?
 No queue
 Cloud is a finite resource
 No queuing, just error “no capacity”
 Happy retrying…
 HPC can achieve 90% resource utilization

2011-­‐06-­‐17 GRITS 2011 53


Questions?

You might also like