You are on page 1of 26

Rethinking data centers

³There is nothing more difficult to take in hand, more perilous to conduct, or more uncertain in its success, than to take the lead in the introduction of a new order of things´ ± Niccolo Machiavelli

Chuck Thacker Microsoft Research October 2007

Problems
‡ Today, we need a new order of things. ‡ Problem: Today, data centers aren¶t designed as systems. We need apply system engineering to:
± ± ± ± ± Packaging Power distribution Computers themselves Management Networking

‡ I¶ll discuss all of these. ‡ Note: Data centers are not mini-Internets

since we have to tinker with them a lot.System design: Packaging ‡ We build large buildings. ± Near hydro dams. ‡ We build them in remote areas. ‡ They must be human-friendly. and networking). load them up with lots of expensive power and cooling. . ‡ Pretty much need to design for the peak load (power distribution. but not near construction workers. and only then start installing computers. cooling.

‡ Rackable has a similar system. too). so build near a river. ‡ Don¶t need to be human-friendly. ‡ Expand as needed. ‡ Sun¶s version uses a 20-foot box. . A global shipping infrastructure already exists. in sensible increments. computers and all. 40 might be better (They¶ve got other problems. So (rumored) does Google. ‡ Assembled at one location. power. Need a pump. and cooled water. ‡ Requires only networking. but no chiller (might have environmental problems).Packaging: Another way ‡ Sun¶s Black Box ‡ Use a shipping container ± and build a parking lot instead of a building.

The Black Box .

‡ Fixing these problems: Later slides. ‡ Rack as a whole is supported on springs. ‡ Servers have a lot of packaging (later slide). ‡ They go into the racks sideways. ‡ Rack must be extracted to replace a server ± They have a tool. since they¶re designed for front-to-back airflow.Sun¶s problems ‡ Servers are off-the-shelf (with fans). . ‡ Cables exit front and back (mostly back). but the procedure is complex.

plus network switch. ‡ If each server draws 200W. 82K computers. ‡ A 64-container data center is 16 Mw. Reduces insurance cost. plus inefficiency and cooling. ‡ Each container has independent fire suppression. the container is 256 KW. Total container: 1280 servers. . the rack is 8KW.Packaging ‡ A 40¶ container holds two rows of 16 racks. ‡ Each rack holds 40 ³1U´ servers.

± The whole thing nets out at ~40 ± 50% efficient. => Large power bills. hydrogen). ± Transfer switch connects batteries -> UPS/ Diesel generator. Batteries require safety systems (acid.System Design: Power distribution ‡ The current picture: ± Servers run on 110/220V 60Hz 1 AC. . ± Lots and lots of batteries. ± Grid -> Transformer -> Transfer switch -> Servers.

± Synchronous rectifiers on motherboard make DC (with little filtering). Distribution voltage is 480 VAC ‡ Diesel generator supplies backup power. . ‡ Rack contains a step-down transformer. ‡ The Cray 1 (300 KW. but nobody cares. the frequency and voltage drop slightly. ‡ Uses a flywheel to store energy until the generator comes up.Power: Another way ‡ Servers run on 12V 400 Hz 3 AC. Much more efficient (probably > 85%). ‡ Extremely high reliability. ‡ Distribution chain: Grid -> 60Hz motor -> 400Hz generator -> Transformer -> Servers. ± When engaged. 1971) was powered this way.

‡ Why not roll our own? . ± IBM doesn¶t. Rackable. others. ‡ Higher quality and reliability than the average PC.The Computers ‡ We currently use commodity servers designed by HP. but they still operate in the PC ecosystem.

lots of disks. relatively few disks. Properties TBD. ± One for storage. memory. Lots of memory. ‡ Like Petabyte? ± Maybe one for DB apps. Lots of CPU.Designing our own ‡ Minimize SKUs ± One for computes. ‡ What do they look like? . ± Maybe Flash memory has a home here. Modest CPU.

± Although both seem to be beginning to understand that low power/energy efficiency is important. and check what we can¶t correct. . ± So that it¶s easy to extract the server from the rack. fix it or get another. ‡ For the storage SKU. ‡ Error correct everything possible. ‡ Use no power supplies other than on-board inverters (we have these now). optimizing for our needs. ‡ Use commodity disks. don¶t necessarily use Intel/AMD processors. rather than just turning the ³feature´ off.More on Computers ‡ Use custom motherboards. ‡ All cabling exits at the front of the rack. We design them. ‡ When a component doesn¶t work to its spec. not the needs of disparate application.

. ‡ Modeling to predict failures? ± Machine learning is your friend. ± Measure more temperatures than just the CPU.Management ‡ We¶re doing pretty well here. ± Better sensors to diagnose problems. ± Automatic management is getting some traction. opex isn¶t overwhelming. predict failures. ‡ Could probably do better ± Better management of servers to provide more fine-grained control than just reboot. since there are no fans in the server). reimage. ± Measuring airflow is very important (not just finding out whether the fan blades turn.

‡ We also can get exactly what we want. ‡ Serviced by the manufacturer. ‡ Large routers are very expensive. and they¶re never on site. ‡ Router software is large. old.Networking ‡ A large part of the total cost. ‡ They are relatively unreliable ± we sometimes see correlated failures in replicated units. ‡ By designing our own. and command astronomical margins. and incomprehensible ± frequently the cause of problems. rather than paying for a lot of features we don¶t need (data centers aren¶t mini-Internets). . we save money and improve reliability.

Security is simpler.Data center network differences ‡ ‡ ‡ ‡ We know (and define) the topology The number of nodes is limited Broadcast/multicast is not required. ± No malicious attacks. just buggy software and misconfiguration. . ‡ We can distribute applications to servers to distribute load and minimize hot spots.

Push complexity to the edge of the network. . ‡ Simplify the software.Data center network design goals ‡ Reduce the need for large switches in the core. ‡ Improve reliability ‡ Reduce capital and operating cost.

but don¶t need standard protocols. use source routed trees. .One approach to data center networks ‡ Use a 64-node destination-addressed hypercube at the core. long runs are optical. ± Mix of optical and copper cables. ‡ In the containers. ‡ Use standard link technology. Short runs are copper.

Basic switches ‡ Type 1: Forty 1 Gb/sec ports. Center uses 2048 switches. one for tree routing. plus 2 10 Gb/sec ports. one for destination routing. ± Real switches use less expensive FPGAs . ± Type 1 also uses Gb Ethernet Phy chips ± These contain 10 Gb Phys. 704 switches total. plus buffer memories and logic. one per rack. only FPGA bit streams differ. ± Hypercube uses 64 switches. ‡ Both implemented with Xilinx Virtex 5 FPGAs. Hardware is identical. lose 1/2048 of total capacity) ‡ Type 2: Ten 10 Gb/sec ports ± These are the interior switches ± There are two variants. ± These are the leaf switches. ± Not replicated (if a switch fails. ‡ We can prototype both types with BEE3. containers use 10 each.

Data center network .

± Use destination ± based input buffering. since a given link carries packets for a limited number of destinations. .Hypercube properties ‡ ‡ ‡ ‡ ‡ Minimum hop count Even load distribution for all-all communication. Can route around switch/link failures. Simple routing: ± Outport = f(Dest xor NodeNum) ± No routing tables ‡ Switches are identical. Reasonable bisection bandwidth. ± Links are short enough that this doesn¶t cause a buffering problem. ‡ Link±by±link flow control to eliminate drops.

A 16-node (dimension 4) hypercube 3 2 3 0 0 3 2 2 3 0 0 0 1 1 1 5 1 4 1 2 1 2 1 0 0 1 1 0 0 2 3 3 3 2 2 7 3 6 3 2 3 2 3 0 0 3 2 3 0 0 10 1 11 1 2 15 1 14 1 2 1 2 1 0 0 1 2 2 1 0 0 8 3 9 3 13 3 12 3 2 .

Routing within a container .

5 Tb/sec ‡ Average will be lower .Bandwidths ‡ Servers ±> network: 82 Tb/sec ‡ Containers ±> cube: 2. .

Even for one center (and we build many).Objections to this scheme ‡ ³Commodity hardware is cheaper´ ± This is commodity hardware. why aren¶t others doing it´? ± They are. but only if they do what you need. ± And we would work with engineering/manufacturing partners who would be the ultimate manufacturers. ‡ ³If this stuff is so great. Large switches command very large margins. at acceptable cost. it¶s not cheaper. . ‡ ³Standards are better´ ± Yes. This model has worked before. ± And in the case of the network. ‡ ³It requires too many different skills´ ± Not as many as you might think.

± Incremental scale-out. we can achieve: ± Lower cost. both opex and capex. ± More rapid innovation as technology improves.Conclusion ‡ By treating data centers as systems. ± Higher reliability. . and doing full-system optimization.

Questions? .