You are on page 1of 47

Three Talks

• Scalability Terminology
– Gray (with help from Devlin, Laing, Spix)
• What Windows is doing re this
– Laing
• The M$ PetaByte (as time allows)
– Gray

1
Terminology for Scaleability
Bill Devlin, Jim Gray, Bill Laing, George Spix,,,,
paper at: ftp://ftp.research.microsoft.com/pub/tr/tr-99-85.doc

• Farms of servers: Geo


Plex
– Clones: identical
• Scaleability + availability Farm
– Partitions: Partition
• Scaleability Clone
– Packs Pack
• Partition availability Shared Shared Shared
via fail-over Nothing Disk Nothing

• GeoPlex Active-
Passive
Active-
Active

– for disaster tolerance. 2


Unpredictable Growth
• The TerraServer Story:
– Expected 5 M hits per day
– Got 50 M hits on day 1
– Peak at 20 M hpd on a “hot” day
– Average 5 M hpd over last 2 years
• Most of us cannot predict demand
– Must be able to deal with NO demand
– Must be able to deal with HUGE demand
3
Web Services Requirements
• Scalability: Need to be able to add capacity
– New processing
– New storage
– New networking

• Availability: Need continuous service


– Online change of all components (hardware and software)
– Multiple service sites
– Multiple network providers

• Agility: Need great tools


– Manage the system
– Change the application several times per year.
– Add new services several times per year. 4
Premise:
Each Site is a Farm
• Buy computing by the slice (brick):
– Rack of servers + disks.
Building 11 Log Processing
Av e CFG: 4xP6,
Staging Servers
(7)
Ave CFG: 4xP5,
512 RAM,
The Microsoft.Com Site
1 GB RAM, 30 GB HD
Internal WWW
180 GB HD Ave Cost: $35K European Data Center
w w w .microsoft.compremium.microsoft.com
Av e Cost: $128K FY98 Fcst: 12 IDC Staging Serv ers
FY98 Fcst: 2
MOSWest (1)
Av e CFG: 4xP6, (3)
Av e CFG: 4xP6,
FTP Servers 512 RAM,
SQLNet 512 RAM,
Ave CFG: 4xP5, 30 GB HD SQL SERVERS
Feeder LAN 50 GB HD
512 RAM, SQL Consolidators Av e Cost: $35K
Download 30 GB HD Router Av e Cost: $50K
FY98 Fcst: 1 (2)
DMZ Staging Serv ers

– Functionally specialized servers


FY98 Fcst: 1
Replication Ave Cost: $28K Av e CFG: 4xP6,
Av e CFG: 4xP6, 512 RAM,
FY98 Fcst: 0 Router 1 GB RAM, FTP
Liv e SQL Serv ers 160 GB HD
160 GB HD Download Serv er Av e Cost: $80K
SQL Reporting Av e CFG: 4xP6, Av e Cost: $83K (1) FY98 Fcst: 1
Av e CFG: 4xP6, Live SQL Server MOSWest 512 RAM, FY98 Fcst: 2 Switched
512 RAM, Admin LAN Av e CFG: 4xP6, 160 GB HD Ethernet
160 GB HD All servers in Building11 512 RAM, Av e Cost: $83K
Av e Cost: $80K are accessable from 50 GB HD FY98 Fcst: 12
FY98 Fcst: 2 corpnet. Av e Cost: $35K
FY98 Fcst: 2
search.microsoft.com
msid.msn.com msid.msn.com (1)
register.microsoft.com w w w .microsoft.com (1) (1)

• Grow by adding slices


w w w .microsoft.com (2) Ave CFG: 4xP6, (4) Router
(4) 512 RAM,
search.microsoft.com
Av e CFG: 4xP6, 30 GB HD
512 RAM, (3) Japan Data Center
50 GB HD
Ave Cost: $43K w w w .microsoft.com SQL SERVERS
FY98 Fcst: 10 Av e CFG: 4xP6, (2)
Av e Cost: $50K premium.microsoft.com (3)
512 RAM, Av e CFG: 4xP6,
FY98 Fcst: 17
home.microsoft.com 30 GB HD (1) Av e CFG: 4xP6, 512 RAM,
home.microsoft.com FDDI Ring Av e Cost: $28K Av e CFG: 4xP6, 512 RAM, 160 GB HD
(4) (3) (MIS2) FY98 Fcst: 7 512 RAM, 50 GB HD Av e Cost: $80K
premium.microsoft.com 30 GB HD Av e Cost: $50K FY98 Fcst: 1
Av e CFG: 4xP6 Av e Cost: $35K FY98 Fcst: 1
(2)
Ave CFG: 4xP6, 512 RAM FY98 Fcst: 1 msid.msn.com
512 RAM, 28 GB HD
activex.microsoft.com
30 GB HD FDDI Ring Av e Cost: $35K Av e CFG: 4xP6, (2) (1)
Switched

– Spread data and


Ave Cost: $35K (MIS1) FY98 Fcst: 17 512 RAM, Av e CFG: 4xP6,
Ethernet
FY98 Fcst: 3 30 GB HD 256 RAM,
Av e Cost: $28K 30 GB HD
FY98 Fcst: 3 Av e Cost: $25K FTP
Av e CFG: 4xP5, cdm.microsoft.com FY98 Fcst: 2 Download Serv er
256 RAM,
(1) Router (1) HTTP
12 GB HD search.microsoft.com
Av e Cost: $24K Download Serv ers (2)
FY98 Fcst: 0 Router (2) Router
Internet
msid.msn.com Router
(1) Primary 2
Router Gigaswitch 2

computation
OC3 Ethernet
premium.microsoft.com Router Internet
(100Mb/Sec Each)
(100 Mb/Sec Each)
w w w .microsoft.com (1) Router
(3) Secondary
Gigaswitch
Router 13
FTP.microsoft.com Router DS3
(45 Mb/Sec Each)
FDDI Ring (3)

to new slices
Ave CFG: 4xP5,
home.microsoft.com (MIS3)
msid.msn.com 512 RAM, w w w .microsoft.com
(2) (1) 30 GB HD (5)
Ave Cost: $28K
FY98 Fcst: 0

Av e CFG: 4xP5,
256 RAM,
register.microsoft.com
(2) FDDI Ring
(MIS4)
Internet
20 GB HD
Av e Cost: $29K
register.microsoft.com home.microsoft.com
FY98 Fcst: 2
(1) support.microsoft.com (5) Microsoft.com late 2000

• Two styles:
register.msn.com
(2) (2)
Ave CFG: 4xP6,
support.microsoft.com
Canyon Park
512 RAM,
search.microsoft.com (1) 30 GB HD
(3) Ave Cost: $35K
FY98 Fcst: 9

\\Tweeks\Statistics\LAN and Server Name Info\Cluster Process Flow\MidYear98a.vsd


12/15/97
FTP 6
Build Servers 32
– Clones: anonymous servers IIS
Application
210
2
Exchange 24
– Parts+Packs: Partitions fail over within a pack Network/Monitoring
SQL
12
120

• In both cases,
Search 2
NetShow 3
NNTP 16
SMTP 5 6
GeoPlex remote farm for disaster recovery Stagers
total
26
459
Scaleable Systems
Scale UP • ScaleUP: grow by
adding components
to a single system.
• ScaleOut: grow by
adding more systems.

Scale OUT 6
ScaleUP and Scale OUT
• Everyone does both. • 1M$/slice
– IBM S390?
• Choice’s
– Sun E 10,000?
– Size of a brick • 100 K$/slice
– Clones or partitions – Wintel 8X
– Size of a pack • 10 K$/slice
– Wintel 4x
• Who’s software?
• 1 K$/slice
– scaleup and scaleout – Wintel 1x
both have a large
software component 7
Clones: Availability+Scalability
• Some applications are
– Read-mostly
– Low consistency requirements
– Modest storage requirement (less than 1TB)
• Examples:
– HTML web servers (IP sprayer/sieve + replication)
– LDAP servers (replication via gossip)
• Replicate app at all nodes (clones)
• Load Balance:
– Spray& Sieve: requests across nodes.
– Route: requests across nodes.
• Grow: adding clones
• Fault tolerance: stop sending to that clone. 8
Two Clone Geometries
• Shared-Nothing: exact replicas
• Shared-Disk (state stored in server)
Shared Nothing Clones Shared Disk Clones

If clones have any state: make it disposable.


Manage clones by reboot, failing that replace.
9
One person can manage thousands of clones.
Clone Requirements
• Automatic replication (if they have any state)
– Applications (and system software)
– Data
• Automatic request routing
– Spray or sieve
• Management:
– Who is up?
– Update management & propagation
– Application monitoring.
• Clones are very easy to manage:
– Rule of thumb: 100’s of clones per admin.
10
Partitions for Scalability
• Clones are not appropriate for some apps.
– State-full apps do not replicate well
– high update rates do not replicate well
• Examples
– Email
– Databases
– Read/write file server…
– Cache managers
– chat

• Partition state among servers


• Partitioning:
– must be transparent to client.
– split & merge partitions online 11
Packs for Availability
• Each partition may fail (independent of others)
• Partitions migrate to new node via fail-over
– Fail-over in seconds
• Pack: the nodes supporting a partition
– VMS Cluster, Tandem, SP2 HACMP,..
– IBM Sysplex™
– WinNT MSCS (wolfpack)
• Partitions typically grow in packs.
• ActiveActive: all nodes provide service
• ActivePassive: hot standby is idle
• Cluster-In-A-Box now commodity
12
Partitions and Packs
Partitions Packed Partitions
Scalability Scalability + Availability

13
Parts+Packs Requirements
• Automatic partitioning (in dbms, mail, files,…)
– Location transparent
– Partition split/merge
– Grow without limits (100x10TB)
– Application-centric request routing
• Simple fail-over model
– Partition migration is transparent
– MSCS-like model for services
• Management:
– Automatic partition management (split/merge)
– Who is up?
– Application monitoring.
14
GeoPlex: Farm Pairs
• Two farms (or more)
• State (your mailbox, bank account)
stored at both farms
• Changes from one
sent to other
• When one farm fails
other provides service
• Masks
– Hardware/Software faults
– Operations tasks (reorganize, upgrade move)
– Environmental faults (power fail, earthquake, fire) 15
Directory
Fail-Over
Load Balancing
• Routes request to right farm
– Farm can be clone or partition
• At farm, routes request to right service
• At service routes request to
– Any clone
– Correct partition.
• Routes around failures. 16
Availability
99999
well-managed nodes
Masks some hardware failures
well-managed packs & clones
Masks hardware failures,
Operations tasks (e.g. software upgrades)
Masks some software failures
well-managed GeoPlex
Masks site failures (power, network, fire, move,…)
Masks some operations failures 17
Cluster Scale Out Scenarios
The FARM: Clones and Packs of Partitions
Packed Partitions: Database Transparency
SQL Partition 3 SQL Partition 2 SQL
SQLPartition1
Database

replication
Web File StoreB Web File StoreA SQL Temp State
Cloned
Packed
file
servers

Cloned
Web Load Balance Front Ends
(firewall, 18
sprayer,
Clients web server)
Some Examples:
• TerraServer:
– 6 IIS clone front-ends (wlbs)
– 3-partition 4-pack backend: 3 active 1 passive
– Partition by theme and geography (longitude)
– 1/3 sysadmin
• Hotmail:
– 1000 IIS clone HTTP login
– 3400 IIS clone HTTP front door
– + 1000 clones for ad rotator, in/out bound…
– 115 partition backend (partition by mailbox)
– Cisco local director for load balancing
– 50 sysadmin
• Google: (inktomi is similar but smaller)
– 700 clone spider
– 300 clone indexer
– 5-node geoplex (full replica)
– 1,000 clones/farm do search
– 100 clones/farm for http
– 10 sysadmin
See Challenges to Building Scalable Services: A Survey of Microsoft’s Internet Services,
Steven Levi and Galen Hunt http://big/megasurvey/megasurvey.doc. 19
Acronyms
• RACS: Reliable Arrays of Cloned Servers
• RAPS: Reliable Arrays of partitioned and
Packed Servers (the first p is silent ).

20
Emissaries and Fiefdoms
• Emissaries are stateless (nearly)
Emissaries are easy to clone.
• Fiefdoms are stateful
Fiefdoms get partitioned.

21
Summary Geo
Plex
• Terminology for scaleability
• Farms of servers: Farm
– Clones: identical
Partition
• Scaleability + availability
– Partitions: Clone
• Scaleability Pack
– Packs Shared Shared Shared
• Partition availability via fail-over Nothing Disk Nothing

• GeoPlex for disaster tolerance. Active- Active-


Active Passive
Architectural Blueprint for Large eSites
Bill Laing http://msdn.microsoft.com/msdn-online/start/features/DNAblueprint.asp
Scalability Terminology: Farms, Clones, Partitions, and Packs: RACS and RAPS
Bill Devlin, Jim Gray, Bill Laing, George Spix MS-TR-99-85
ftp://ftp.research.microsoft.com/pub/tr/tr-99-85.doc
22
Three Talks
• Scalability Terminology
– Gray (with help from Devlin, Laing, Spix)
• What Windows is doing re this
– Laing
• The M$ PetaByte (as time allows)
– Gray

23
What Windows is Doing
• Continued architecture and analysis work
• AppCenter, BizTalk, SQL, SQL Service Broker, ISA,…
all key to Clones/Partitions
• Exchange is an archetype
– Front ends, directory, partitioned, packs, transparent mobility.
• NLB (clones) and MSCS (Packs)
• High Performance Technical Computing
• Appliances and hardware trends
• Management of these kind of systems
• Still need good ideas on….
24
Architecture and Design work
• Produced an architectural Blueprint for large eSites
published on MSDN
– http://msdn.microsoft.com/msdn-online/start/features/DNAblueprint.asp
• Creating and testing instances of the architecture
– Team led by Per Vonge Neilsen
– Actually building and testing examples of the architecture
with partners. (sometimes known as MICE)
• Built a scalability “Megalab” run by Robert Barnes
– 1000 node cyber wall,
315 1U Compaq DL360s,
32 8ways, 7000 disks
25
26
Clones and Packs aka Clustering
• Integrated the NLB and MSCS teams
– Both focused on scalability and availability
– NLB for Clones
– MSCS for Partitions/Packs
• Vision is a single communications and group
membership infrastructure and a set of management
tools for Clones, Partitions, and Packs
• Unify management for clones/partitions at
BOTH: OS and app level
(e.g. IIS, Biztalk, AppCenter, Yukon, Exchange…)
27
Clustering in Whistler Server
• Microsoft Cluster Server
– Much improved setup and installation
– 4 node support in Advanced server
• Kerberos support for Virtual Servers
• Password change without restarting cluster service
• 8 node support in Datacenter
• SAN enhancements (Device reset not bus reset for disk arbitration,
Shared disk and boot disk on same bus)
• Quorum of nodes supported (no shared disk needed)
• Network Load Balancer
– New NLB manager
• Bi-Directional affinity for ISA as a Proxy/Firewall
• Virtual cluster support (Different port rules for each IP addr)
• Dual NIC support
28
Geoclusters
• AKA - Geographically dispersed (Packs)
– Essentially the nodes and storage are replicated at 2
sites, disks are remotely mirrored
• Being deployed today, helping vendors them get
certified, we still need better tools
• Working with
– EMC, Compaq, NSISoftware, StorageApps
• Log shipping (SQL) and extended VLANs (IIS)
are also solutions

29
High Performance Computing
Last year (CY2000) This year (CY2001)
• This work is a part of server scale- • Partner w/ Cornell/MPI-Soft/+
– Unix to W2000 projects
out efforts (BLaing)
– Evangelism of commercial HPC
• Web site and HPC Tech Preview (start w/ financial svcs)
CD late last year – Showcase environment & apps
(EBC support)
– A W2000 “Beowulf” equivalent – First Itanium FP “play-offs”
w/ 3rd-party tools – BIG tools integration / beta
• Better than the competition • Dell & Compaq offer web HPC
– 10-25% faster than Linux on SMPs buy and support experience (buy
(2, 4 & 8 ways) capacity by-the-slice)
– More reliable than SP2 (!) • Beowulf-on-W2000 book by Tom
Sterling (author of Beowulf on
– Better performance & integration Linux)
w/ IBM periphs (!) • Gain on Sun in the
• But it lacks MPP debugger, tools, www.top500.org list
evangelism, reputation • Address the win-by-default
assumption for Linux in HPC
• See ../windows2000/hpc
• Also \\jcbach\public\cornell* 30
No vendor has succeeded in bringing MPP to non-sci/eng venues & $$$… we will.
Appliances and Hardware Trends
• The appliances team under TomPh is focused on
dramatically simplifying the user experience of
installing the kind of devices
– Working with OEMs to adopt WindowsXP
• Ultradense servers are on the horizon
– 100s of servers per rack
– Manage the rack as one
• Infiniband and 10 GbpsEthernet change things.

31
Operations and Management
• Great research work done in MSR on this topic
– The Mega services paper by Levi and Hunt
– The follow on BIG project developed the ideas of
• Scale Invariant Service Descriptions with
• automated monitoring and
• deployment of servers.
• Building on that work in Windows Server group
• AppCenter doing similar things at app level

32
Still Need Good Ideas on…
• Automatic partitioning
• Stateful load balancing
• Unified management of clones/partitions
at both app and OS level

33
Three Talks
• Scalability Terminology
– Gray (with help from Devlin, Laing, Spix)
• What Windows is doing re this
– Laing
• The M$ PetaByte (as time allows)
– Gray

34
We're building Petabyte Stores Yotta

Everything
Zetta
!
• Soon everything can be Recorded
recorded and indexed All Books Exa
• Hotmail 100TB now MultiMedia
• MSN 100TB now Peta
All LoC books
• List price is 800M$/PB
(words) Tera
(including FC switches & brains)
• Must Geoplex it. .Movi
Giga
• Can we get if for 1M$/PB? e
• Personal 1TB stores for 1k$ A Photo
Mega
A Book 35
24 Yecto, 21 zepto, 18 atto, 15 femto, 12 pico, 9 nano, 6 micro, 3 milli Kilo
Building a Petabyte Store
• EMC ~ 500k$/TB = 500 M$/PB
plus FC switches plus… 800 M$/PB
• TPC-C SANs (Dell 18GB/…) 62 M$/PB
• Dell local SCSI, 3ware 20 M$/PB
500 • Do it yourself: 5 M$/PB
• a billion here, a billion there,
400 soon your talking about real money!
300
200
100
0 36
EMC Dell/3ware
320 GB, 2k$ (now)
• 4x80 GB IDE
(2 hot plugable) 6M$ / PB
– (1,000$)
• SCSI-IDE bridge
– 200k$
• Box
– 500 Mhz cpu
– 256 MB SRAM
– Fan, power, Enet
– 500$
• Ethernet Switch:
– 150$/port
• Or 8 disks/box
640 GB for ~3K$ ( or 300 GB RAID) 37
Hot Swap Drives for
Archive or Data Interchange
• 25 MBps write
(so can write
N x 80 GB
in 3 hours)
• 80 GB/overnite
= ~N x 2 MB/second
@ 19.95$/nite

Compare to 1$/GB
via Internet 38
A Storage Brick
• 2 x 80GB disks
• 500 Mhz cpu (intel/ amd/ arm)
• 256MB ram
• 2 eNet RJ45
• Fan(s)
• Current disk form factor
• 30 watt
• 600$ (?)
 per rack (48U - 3U/module - 16 units/U)
 400 disks, 200 whistler nodes
 32 TB
 100 Billion Instructions Per Second
 120 K$/rack, 4 M$/PB,
 per Petabyte (33 racks)
 4 M$
 3 TeraOps (6,600 nodes)
 13 k disk arms (1/2 TBps IO)
39
What Software Do The Bricks Run?
• Each node has an OS
• Each node has local resources: A federation.
• Each node does not completely trust the others.
• Nodes use RPC to talk to each other
– COM+ SOAP, BizTalk
• Huge leverage in high-level interfaces.
Applications • Same old distributed system story. Applications
datagrams

datagrams
streams

streams
RPC
RPC
?

?
CLR CLR
40
Infiniband /Gbps Ehternet
Storage Rack in 2 years?
• 300 arms
• 50TB (160 GB/arm)
• 24 racks
48 storage processors
2x6+1 in rack
• Disks = 2.5 GBps IO
• Controllers = 1.2 GBps IO
• Ports 500 MBps IO
• My suggestion: move
the processors into
the storage racks.
41
Auto Manage Storage
• 1980 rule of thumb:
– A DataAdmin per 10GB, SysAdmin per mips
• 2000 rule of thumb
– A DataAdmin per 5TB
– SysAdmin per 100 clones (varies with app).
• Problem:
– 5TB is 60k$ today, 10k$ in a few years.
– Admin cost >> storage cost???
• Challenge:
– Automate ALL storage admin tasks

42
It’s Hard to Archive a Petabyte
It takes a LONG time to restore it.
• At 1GBps it takes 12 days!
• Store it in two (or more) places online (on disk?).
A geo-plex
• Scrub it continuously (look for errors)
• On failure,
– use other copy until failure repaired,
– refresh lost copy from safe copy.
• Can organize the two copies differently
(e.g.: one by time, one by space)
43
Call To Action
• Lets work together to make storage bricks
– Low cost
– High function
• NAS (network attached storage)
not SAN (storage area network)
• Ship NT8/CLR/IIS/SQL/Exchange/…
with every disk drive
44
Three Talks
• Scalability Terminology
– Gray (with help from Devlin, Laing, Spix)
• What Windows is doing re this
– Laing
• The M$ PetaByte (as time allows)
– Gray

45
Cheap Storage
• Disks are getting cheap:
• 3 k$/TB disks (12 80 GB disks @ 250$ each)
1000
900
40
40
900
800 Price
Price vs disk capacity
vs disk capacity 35
800
35 raw
raw k$/TB SCSI
IDE
700
IDE 3030
700
600
SCSI k$/TB IDE
SCSI
SCSI y = 15.895x + 13.446 2525
500
600 IDE
$ $

20
400
500 $ 20
$
15
300
400 y = 13.322x - 1.4332
200 1015
300 y = 5.7156x + 47.857 7
100 510
200
0 0
5
1000 10 20
y 30
Raw Disk =unit
3.0635x 40
+
Size GB
50
40.542 60 0 10 20 30
Disk unit size GB
40 50 60
0 0
0 20 40 60 80 0 20 40 60 46 80
Raw Disk unit Size GB Disk unit size GB
All Device Controllers will be Super-Computers
• TODAY
– Disk controller is 10 mips risc engine
with 2MB DRAM Central
– NIC is similar power
Processor &
• SOON Memory
– Will become 100 mips systems
with 100 MB DRAM.
• They are nodes in a federation
(can run Oracle on NT in disk controller).
• Advantages
– Uniform programming model
– Great tools Tera Byte
– Security Backplane
– Economics (cyberbricks)
– Move computation to data (minimize traffic)
54

You might also like