Professional Documents
Culture Documents
SPECIALIZING A PLANET’S
COMPUTATION: ASIC CLOUDS
.................................................................................................................................................................................................................
SEEM IMPROBABLE DUE TO HIGH NON-RECURRING ENGINEERING (NRE) COSTS AND ASIC
INFLEXIBILITY, BUT LARGE-SCALE BITCOIN ASIC CLOUDS ALREADY EXIST. THIS ARTICLE
DISTILLS LESSONS FROM THESE PRIMORDIAL ASIC CLOUDS AND PROPOSES NEW
SHOWING SUPERIOR TOTAL COST OF OWNERSHIP. ASIC CLOUD NRE AND ECONOMICS
1U
1U
PSU
RCA
1U
ASICs 1U
Fans 1U
Figure 1. High-level abstract architecture of an ASIC Cloud. Specialized replicated compute accelerators (RCAs) are multiplied
up by having multiple copies per application-specific integrated circuit (ASIC), multiple ASICs per server, multiple servers per
rack, and multiple racks per datacenter. Server controller can be a field-programmable gate array (FPGA), microcontroller, or a
Xeon processor. The power delivery and cooling system are customized based on ASIC needs. If required, there would be
DRAMs on the printed circuit board (PCB) as well. (PSU: power supply unit.)
engineering (NRE) costs of ASIC develop- propose a new rule that explains when it
ment and deployment. As a workload grows, makes sense to design and deploy an ASIC
the ASIC Cloud can be scaled in the datacen- Cloud, considering NRE.
ter by adding more ASIC servers, unlike accel-
erators in, say, a mobile phone population,3 in
which the accelerator/processor mix is fixed at ASIC Cloud Architecture
tape out. At the heart of any ASIC Cloud is an energy-
Our research examines ASIC Clouds in efficient, high-performance, specialized RCA
the context of four key applications that that is multiplied up by having multiple cop-
show great potential for ASIC Clouds, ies per ASIC, multiple ASICs per server,
including YouTube-style video transcoding, multiple servers per rack, and multiple racks
Bitcoin and Litecoin mining, and deep learn- per datacenter (see Figure 1). Work requests
ing. ASICs achieve large reductions in silicon from outside the datacenter will be distrib-
area and energy consumption versus CPUs, uted across these RCAs in a scale-out fashion.
GPUs, and FPGAs. We specialize the ASIC All system components can be customized
server to maximize efficiency, employing for the application to minimize TCO.
optimized ASICs, a customized printed cir- Each ASIC interconnects its RCAs using a
cuit board (PCB), custom-designed cooling customized on-chip network. The ASIC’s con-
systems and specialized power delivery sys- trol plane unit also connects to this network
tems, and tailored DRAM and I/O subsys- and schedules incoming work from the ASIC’s
tems. ASIC voltages are customized to tweak off-chip router onto the RCAs. Next, the pack-
energy efficiency and minimize total cost of aged ASICs are arranged in lanes on a custom-
ownership (TCO). The datacenter itself can ized PCB and connected to a controller that
also be specialized, optimizing rack-level and bridges to the off-PCB interface (1 to 100 Gig-
datacenter-level thermals and power delivery abit Ethernet, Remote Direct Memory Access,
to exploit the knowledge of the computation. and PCI Express). In some cases, DRAMs can
We developed tools that consider all aspects connect directly to the ASICs. The controller
of ASIC Cloud design in a bottom-up way, can be implemented by an FPGA, a microcon-
and methodologies that reveal how the troller, or a Xeon processor. It schedules
designers of these novel systems can optimize remote procedure calls (RPCs) that come
TCO in real-world ASIC Clouds. Finally, we from the off-PCB interface on to the ASICs.
.............................................................
MAY/JUNE 2017 63
..............................................................................................................................................................................................
TOP PICKS
64 IEEE MICRO
Evaluating an ASIC Server Configuration
Our ASIC Cloud server configuration evaluator, shown in Figure dynamics (CFD) simulations in Figure A2. Our simulations show
A1, starts with a Verilog implementation of an accelerator, or a that breaking a fixed heat source into smaller ones with the same
detailed evaluation of the accelerator’s properties from the total heat output improves the mixing of warm and cold areas,
research literature. In the design of an ASIC server, we must resulting in lower temperatures. Using thermal optimization tech-
decide how many chips should be placed on the printed circuit niques, we established a fundamental connection between an
board (PCB) and how large, in mm2 of silicon, each chip should be. RCA’s properties, the number of RCAs placed in an ASIC, and how
The size of each chip determines how many replicated compute many ASICs go on a PCB in a server. Given these properties, our
accelerators (RCAs) will be on each chip. In each duct-enclosed heat sink solver determines the optimal heat sink configuration.
lane of ASIC chips, each chip receives around the same amount of Results are validated with the CFD simulator. In the “Design
airflow from the intake fans, but the most downstream chip Space Evaluation” sidebar, we show how we apply this evaluation
receives the hottest air, which includes the waste heat from the flow across the design space to determine TCO and Pareto-
other chips. Therefore, the thermally bottlenecking ASIC is the optimal points that trade off cost per operation per second (ops/s)
one in the back, shown in our detailed computational fluid and watts per ops/s.
80.6280
Voltage scaling
73.3954
66.1628
51.6976
ASIC
44.4650
arrangement
37.2324
Power
Maximum power 29.9998
budgeting policy
input
Die-size optimization
(1) (2)
Figure A. ASIC server evaluation flow. (1) The server cost, per server hash rate, and the energy efficiency are evaluated
using replicated compute accelerator (RCA) properties and a flow that optimizes server heatsinks, die size, voltage, and
power density. (2) Thermal verification of an ASIC Cloud server using Computational Fluid Dynamics tools to validate
the flow results. The farthest ASIC from the fan has the highest temperature and is the bottleneck for power per ASIC
at a fixed voltage and energy efficiency.
net accelerator has a tight low-latency service- and-routed designs in UMC 28 nm using
level agreement. Synopsys IC compiler and analysis tools
For our Bitcoin and Litecoin studies, we (such as PrimeTime). For deep learning and
developed the RCA and got the required video transcoding, we extracted properties
parameters such as gate count from placed- from accelerators in the research literature.
.............................................................
MAY/JUNE 2017 65
..............................................................................................................................................................................................
TOP PICKS
$/OP/s
number of ASICs placed in a lane decreases for space reasons. We
50
model the more expensive PCBs required by DRAM, with more layers
and better signal/power integrity. We employ two 10-Gigabit Ether-
net ports as the off-PCB interface for network-intensive clouds, and 40
66 IEEE MICRO
Table 1. ASIC Cloud optimization results for four applications: (a) Bitcoin, (b) Litecoin, (c) video transcoding,
and (d) deep learning.
Property Energy optimal server TCO optimal server Cost optimal server
.............................................................
MAY/JUNE 2017 67
..............................................................................................................................................................................................
TOP PICKS
68 IEEE MICRO
than CPUs within fixed environmental ence and Engineering at the University of
energy limits. California, San Diego. His research interests
With the coming explosive growth of include ASIC Clouds, low-cost ASIC design,
planet-scale computation, we must work to and systems. Vega received an MSc in electri-
contain the exponentially growing environ- cal and computer engineering from the Uni-
mental impact of datacenters across the versity of Kaiserslautern, Germany. Contact
world. ASIC Clouds promise to help address him at lvgutierrez@ucsd.edu.
this problem. By specializing the datacenter,
they can do greater amounts of computation Ikuo Magaki is an engineer at Apple. He
under environmentally determined energy performed the work for this article as a
limits. The future is planet-scale, and special- Toshiba visiting scholar in the Department
ized ASICs will be everywhere. MICRO of Computer Science and Engineering at
the University of California, San Diego. His
Acknowledgments research interests include ASIC design and
This work was partially supported by NSF ASIC Clouds. Magaki received an MSc in
awards 1228992, 1563767, and 1565446, computer science from Keio University, Japan.
and by STARnet’s Center for Future Archi- Contact him at ikuo.magaki@icloud.com.
tectures Research, a SRC program sponsored
by MARCO and DARPA. Michael Bedford Taylor advises his PhD
students at various well-known west coast
.................................................................... universities. He performed the work for this
References article while at the University of California,
1. M.B. Taylor, “A Landscape of the Dark Sili-
San Diego. His research interests include
con Design Regime,” IEEE Micro, vol. 33,
tiled multicore architecture, dark silicon,
no. 5, 2013, pp. 8–19.
HLS accelerators for mobile, Bitcoin min-
2. I. Magaki et al., “ASIC Clouds: Specializing ing hardware, and ASIC Clouds. Taylor
the Datacenter,” Proc. 43rd Int’l Symp. received a PhD in electrical engineering and
Computer Architecture, 2016, pp. 178–190. computer science from the Massachusetts
3. N. Goulding-Hotta et al., “The GreenDroid Institute of Technology. Contact him at
Mobile Application Processor: An Architecture profmbt@uw.edu.
for Silicon’s Dark Future,” IEEE Micro, vol. 31,
no. 2, 2011, pp. 86–95.
4. M.B. Taylor, “Bitcoin and the Age of
Bespoke Silicon,” Proc. Int’l Conf. Com-
pilers, Architectures and Synthesis for
Embedded Systems, 2013, article 16.
5. M. Khazraee et al., “Moonwalk: NRE Optimi-
zation in ASIC Clouds,” Proc. 22nd Int’l
Conf. Architectural Support for Programming
Languages and Operating Systems, 2017,
pp. 511–526.
MAY/JUNE 2017 69