You are on page 1of 53

Compiling Algorithms for

Heterogeneous Systems Steven Bell


Visit to download the full and correct content document:
https://textbookfull.com/product/compiling-algorithms-for-heterogeneous-systems-stev
en-bell/
More products digital (pdf, epub, mobi) instant
download maybe you interests ...

Colloidal Nanoparticles for Heterogeneous Catalysis


Priscila Destro

https://textbookfull.com/product/colloidal-nanoparticles-for-
heterogeneous-catalysis-priscila-destro/

Radio Systems Engineering Steven W. Ellingson

https://textbookfull.com/product/radio-systems-engineering-
steven-w-ellingson/

Intelligent Algorithms for Analysis and Control of


Dynamical Systems Rajesh Kumar

https://textbookfull.com/product/intelligent-algorithms-for-
analysis-and-control-of-dynamical-systems-rajesh-kumar/

International environmental risk management: a systems


approach Second Edition Bell

https://textbookfull.com/product/international-environmental-
risk-management-a-systems-approach-second-edition-bell/
Data Parallel C++ Mastering DPC++ for Programming of
Heterogeneous Systems using C++ and SYCL 1st Edition
James Reinders

https://textbookfull.com/product/data-parallel-c-mastering-dpc-
for-programming-of-heterogeneous-systems-using-c-and-sycl-1st-
edition-james-reinders/

Tools and Algorithms for the Construction and Analysis


of Systems Dirk Beyer

https://textbookfull.com/product/tools-and-algorithms-for-the-
construction-and-analysis-of-systems-dirk-beyer/

Tools and Algorithms for the Construction and Analysis


of Systems Dirk Beyer

https://textbookfull.com/product/tools-and-algorithms-for-the-
construction-and-analysis-of-systems-dirk-beyer-2/

How to Draw Manga Volume 1 Compiling Characters Society


For The Study Of Manga Techniques

https://textbookfull.com/product/how-to-draw-manga-
volume-1-compiling-characters-society-for-the-study-of-manga-
techniques/

Smart Electronic Systems Heterogeneous Integration of


Silicon and Printed Electronics Li-Rong Zheng

https://textbookfull.com/product/smart-electronic-systems-
heterogeneous-integration-of-silicon-and-printed-electronics-li-
rong-zheng/
Compiling Algorithms
for Heterogeneous Systems
Synthesis Lectures on
Computer Architecture
Editor
Margaret Martonosi, Princeton University
Founding Editor Emeritus
Mark D. Hill, University of Wisconsin, Madison
Synthesis Lectures on Computer Architecture publishes 50- to 100-page publications on topics
pertaining to the science and art of designing, analyzing, selecting and interconnecting hardware
components to create computers that meet functional, performance and cost goals. The scope will
largely follow the purview of premier computer architecture conferences, such as ISCA, HPCA,
MICRO, and ASPLOS.

Compiling Algorithms for Heterogeneous Systems


Steven Bell, Jing Pu, James Hegarty, and Mark Horowitz
2018

Architectural and Operating System Support for Virtual Memory


Abhishek Bhattacharjee and Daniel Lustig
2017

Deep Learning for Computer Architects


Brandon Reagen, Robert Adolf, Paul Whatmough, Gu-Yeon Wei, and David Brooks
2017

On-Chip Networks, Second Edition


Natalie Enright Jerger, Tushar Krishna, and Li-Shiuan Peh
2017

Space-Time Computing with Temporal Neural Networks


James E. Smith
2017

Hardware and Software Support for Virtualization


Edouard Bugnion, Jason Nieh, and Dan Tsafrir
2017
iv
Datacenter Design and Management: A Computer Architect’s Perspective
Benjamin C. Lee
2016

A Primer on Compression in the Memory Hierarchy


Somayeh Sardashti, Angelos Arelakis, Per Stenström, and David A. Wood
2015

Research Infrastructures for Hardware Accelerators


Yakun Sophia Shao and David Brooks
2015

Analyzing Analytics
Rajesh Bordawekar, Bob Blainey, and Ruchir Puri
2015

Customizable Computing
Yu-Ting Chen, Jason Cong, Michael Gill, Glenn Reinman, and Bingjun Xiao
2015

Die-stacking Architecture
Yuan Xie and Jishen Zhao
2015

Single-Instruction Multiple-Data Execution


Christopher J. Hughes
2015

Power-Efficient Computer Architectures: Recent Advances


Magnus Själander, Margaret Martonosi, and Stefanos Kaxiras
2014

FPGA-Accelerated Simulation of Computer Systems


Hari Angepat, Derek Chiou, Eric S. Chung, and James C. Hoe
2014

A Primer on Hardware Prefetching


Babak Falsafi and Thomas F. Wenisch
2014

On-Chip Photonic Interconnects: A Computer Architect’s Perspective


Christopher J. Nitta, Matthew K. Farrens, and Venkatesh Akella
2013
v
Optimization and Mathematical Modeling in Computer Architecture
Tony Nowatzki, Michael Ferris, Karthikeyan Sankaralingam, Cristian Estan, Nilay Vaish, and
David Wood
2013

Security Basics for Computer Architects


Ruby B. Lee
2013

The Datacenter as a Computer: An Introduction to the Design of Warehouse-Scale


Machines, Second edition
Luiz André Barroso, Jimmy Clidaras, and Urs Hölzle
2013

Shared-Memory Synchronization
Michael L. Scott
2013

Resilient Architecture Design for Voltage Variation


Vijay Janapa Reddi and Meeta Sharma Gupta
2013

Multithreading Architecture
Mario Nemirovsky and Dean M. Tullsen
2013

Performance Analysis and Tuning for General Purpose Graphics Processing Units
(GPGPU)
Hyesoon Kim, Richard Vuduc, Sara Baghsorkhi, Jee Choi, and Wen-mei Hwu
2012

Automatic Parallelization: An Overview of Fundamental Compiler Techniques


Samuel P. Midkiff
2012

Phase Change Memory: From Devices to Systems


Moinuddin K. Qureshi, Sudhanva Gurumurthi, and Bipin Rajendran
2011

Multi-Core Cache Hierarchies


Rajeev Balasubramonian, Norman P. Jouppi, and Naveen Muralimanohar
2011

A Primer on Memory Consistency and Cache Coherence


Daniel J. Sorin, Mark D. Hill, and David A. Wood
2011
vi
Dynamic Binary Modification: Tools, Techniques, and Applications
Kim Hazelwood
2011

Quantum Computing for Computer Architects, Second Edition


Tzvetan S. Metodi, Arvin I. Faruque, and Frederic T. Chong
2011

High Performance Datacenter Networks: Architectures, Algorithms, and Opportunities


Dennis Abts and John Kim
2011

Processor Microarchitecture: An Implementation Perspective


Antonio González, Fernando Latorre, and Grigorios Magklis
2010

Transactional Memory, 2nd edition


Tim Harris, James Larus, and Ravi Rajwar
2010

Computer Architecture Performance Evaluation Methods


Lieven Eeckhout
2010

Introduction to Reconfigurable Supercomputing


Marco Lanzagorta, Stephen Bique, and Robert Rosenberg
2009

On-Chip Networks
Natalie Enright Jerger and Li-Shiuan Peh
2009

The Memory System: You Can’t Avoid It, You Can’t Ignore It, You Can’t Fake It
Bruce Jacob
2009

Fault Tolerant Computer Architecture


Daniel J. Sorin
2009

The Datacenter as a Computer: An Introduction to the Design of Warehouse-Scale


Machines
Luiz André Barroso and Urs Hölzle
2009
vii
Computer Architecture Techniques for Power-Efficiency
Stefanos Kaxiras and Margaret Martonosi
2008

Chip Multiprocessor Architecture: Techniques to Improve Throughput and Latency


Kunle Olukotun, Lance Hammond, and James Laudon
2007

Transactional Memory
James R. Larus and Ravi Rajwar
2006

Quantum Computing for Computer Architects


Tzvetan S. Metodi and Frederic T. Chong
2006
Copyright © 2018 by Morgan & Claypool

All rights reserved. No part of this publication may be reproduced, stored in a retrieval system, or transmitted in
any form or by any means—electronic, mechanical, photocopy, recording, or any other except for brief quotations
in printed reviews, without the prior permission of the publisher.

Compiling Algorithms for Heterogeneous Systems


Steven Bell, Jing Pu, James Hegarty, and Mark Horowitz
www.morganclaypool.com

ISBN: 9781627059619 paperback


ISBN: 9781627057301 ebook
ISBN: 9781681732633 hardcover

DOI 10.2200/S00816ED1V01Y201711CAC043

A Publication in the Morgan & Claypool Publishers series


SYNTHESIS LECTURES ON COMPUTER ARCHITECTURE

Lecture #43
Series Editor: Margaret Martonosi, Princeton University
Founding Editor Emeritus: Mark D. Hill, University of Wisconsin, Madison
Series ISSN
Print 1935-3235 Electronic 1935-3243
Compiling Algorithms
for Heterogeneous Systems

Steven Bell
Stanford University

Jing Pu
Google

James Hegarty
Oculus

Mark Horowitz
Stanford University

SYNTHESIS LECTURES ON COMPUTER ARCHITECTURE #43

M
&C Morgan & cLaypool publishers
ABSTRACT
Most emerging applications in imaging and machine learning must perform immense amounts
of computation while holding to strict limits on energy and power. To meet these goals, archi-
tects are building increasingly specialized compute engines tailored for these specific tasks. The
resulting computer systems are heterogeneous, containing multiple processing cores with wildly
different execution models. Unfortunately, the cost of producing this specialized hardware—and
the software to control it—is astronomical. Moreover, the task of porting algorithms to these
heterogeneous machines typically requires that the algorithm be partitioned across the machine
and rewritten for each specific architecture, which is time consuming and prone to error.
Over the last several years, the authors have approached this problem using domain-
specific languages (DSLs): high-level programming languages customized for specific domains,
such as database manipulation, machine learning, or image processing. By giving up general-
ity, these languages are able to provide high-level abstractions to the developer while producing
high-performance output. The purpose of this book is to spur the adoption and the creation of
domain-specific languages, especially for the task of creating hardware designs.
In the first chapter, a short historical journey explains the forces driving computer archi-
tecture today. Chapter 2 describes the various methods for producing designs for accelerators,
outlining the push for more abstraction and the tools that enable designers to work at a higher
conceptual level. From there, Chapter 3 provides a brief introduction to image processing al-
gorithms and hardware design patterns for implementing them. Chapters 4 and 5 describe and
compare Darkroom and Halide, two domain-specific languages created for image processing
that produce high-performance designs for both FPGAs and CPUs from the same source code,
enabling rapid design cycles and quick porting of algorithms. The final section describes how
the DSL approach also simplifies the problem of interfacing between application code and the
accelerator by generating the driver stack in addition to the accelerator configuration.
This book should serve as a useful introduction to domain-specialized computing for com-
puter architecture students and as a primer on domain-specific languages and image processing
hardware for those with more experience in the field.

KEYWORDS
domain-specific languages, high-level synthesis, compilers, image processing accel-
erators, stencil computation
xi

Contents
Preface . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . xiii

Acknowledgments . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . xv

1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1
1.1 CMOS Scaling and the Rise of Specialization . . . . . . . . . . . . . . . . . . . . . . . . . . 2
1.2 What Will We Build Now? . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7
1.2.1 Performance, Power, and Area . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7
1.2.2 Flexibility . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9
1.3 The Cost of Specialization . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10
1.4 Good Applications for Acceleration . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13

2 Computations and Compilers . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 17


2.1 Direct Specification . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 18
2.2 Compilers . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 19
2.3 High-level Synthesis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 21
2.4 Domain-specific Languages . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 24

3 Image Processing with Stencil Pipelines . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 27


3.1 Image Signal Processors . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 27
3.2 Example Applications . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 29

4 Darkroom: A Stencil Language for Image Processing . . . . . . . . . . . . . . . . . . . . 33


4.1 Language Description . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 34
4.2 A Simple Pipeline in Darkroom . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 36
4.3 Optimal Synthesis of Line-buffered Pipelines . . . . . . . . . . . . . . . . . . . . . . . . . 37
4.3.1 Generating Line-buffered Pipelines . . . . . . . . . . . . . . . . . . . . . . . . . . . . 37
4.3.2 Shift Operator . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 39
4.3.3 Finding Optimal Shifts . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 39
4.4 Implementation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 42
4.4.1 ASIC and FPGA Synthesis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 43
4.4.2 CPU Compilation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 44
xii
4.5 Evaluation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 46
4.5.1 Scheduling for Hardware Synthesis . . . . . . . . . . . . . . . . . . . . . . . . . . . . 47
4.5.2 Scheduling for General-purpose Processors . . . . . . . . . . . . . . . . . . . . . . 49
4.6 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 50

5 Programming CPU/FPGA Systems from Halide . . . . . . . . . . . . . . . . . . . . . . . . 51


5.1 The Halide Language . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 52
5.2 Mapping Halide to Hardware . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 53
5.3 Compiler Implementation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 57
5.3.1 Architecture Parameter Extraction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 57
5.3.2 IR Transformation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 59
5.3.3 Loop Perfection Optimization . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 59
5.3.4 Code Generation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 61
5.4 Implementation and Evaluation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 63
5.4.1 Programmability and Efficiency . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 63
5.4.2 Quality of Hardware Generation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 66
5.5 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 68

6 Interfacing with Specialized Hardware . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 69


6.1 Common Interfaces . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 69
6.2 The Challenge of Interfaces . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 70
6.3 Solutions to the Interface Problem . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 71
6.3.1 Compiler Support . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 71
6.3.2 Library Interface . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 71
6.3.3 API plus DSL . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 71
6.4 Drivers for Darkroom and Halide on FPGA . . . . . . . . . . . . . . . . . . . . . . . . . . 72
6.4.1 Memory and Coherency . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 73
6.4.2 Running the Hardware . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 74
6.4.3 Generating Systems and Drivers . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 76
6.4.4 Generating the Whole Stack with Halide . . . . . . . . . . . . . . . . . . . . . . . 76
6.4.5 Heterogeneous System Performance . . . . . . . . . . . . . . . . . . . . . . . . . . . 77

7 Conclusions and Future Directions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 81

Bibliography . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 83

Authors’ Biographies . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 89
xiii

Preface
Cameras are ubiquitous, and computers are increasingly being used to process image data to
produce better images, recognize objects, build representations of the physical world, and extract
salient bits from massive streams of video, among countless other things. But while the data
deluge continues to increase, and while the number of transistors that can be cost-effectively
placed on a silicon die is still going up (for now), limitations on power and energy mean that
traditional CPUs alone are insufficient to meet the demand. As a result, architects are building
more and more specialized compute engines tailored to provide energy and performance gains
on these specific tasks.
Unfortunately, the cost of producing this specialized hardware—and the software to con-
trol it—is astronomical. Moreover, the resulting computer systems are heterogeneous, contain-
ing multiple processing cores with wildly different execution models. The task of porting al-
gorithms to these heterogeneous machines typically requires that the algorithm be partitioned
across the machine and rewritten for each specific architecture, which is time consuming and
prone to error.
Over the last several years, we have approached this problem using domain-specific lan-
guages (DSLs)—high-level programming languages customized for specific domains, such as
database manipulation, machine learning, or image processing. By giving up generality, these
languages are able to provide high-level abstractions to the developer while producing high-
performance output. Our purpose in writing this book is to spur the adoption and the creation
of domain-specific languages, especially for the task of creating hardware designs.
This book is not an exhaustive description of image processing accelerators, nor of domain-
specific languages. Instead, we aim to show why DSLs make sense in light of the current state
of computer architecture and development tools, and to illustrate with some specific examples
what advantages DSLs provide, and what tradeoffs must be made when designing them. Our
examples will come from image processing, and our primary targets are mixed CPU/FPGA
systems, but the underlying techniques and principles apply to other domains and platforms as
well. We assume only passing familiarity with image processing, and focus our discussion on the
architecture and compiler sides of the problem.
In the first chapter, we take a short historical journey to explain the forces driving com-
puter architecture today. Chapter 2 describes the various methods for producing designs for
accelerators, outlining the push for more abstraction and the tools that enable designers to work
at a higher conceptual level. In Chapter 3, we provide a brief introduction to image processing
algorithms and hardware design patterns for implementing them, which we use through the
rest of the book. Chapters 4 and 5 describe Darkroom and Halide, two domain-specific lan-
xiv PREFACE
guages created for image processing. Both are able to produce high-performance designs for
both FPGAs and CPUs from the same source code, enabling rapid design cycles and quick
porting of algorithms. We present both of these examples because comparing and contrasting
them illustrates some of the tradeoffs and design decisions encountered when creating a DSL.
The final portion of the book discusses the task of controlling specialized hardware within a het-
erogeneous system running a multiuser operating system. We give a brief overview of how this
works on Linux and show how DSLs enable us to automatically generate the necessary driver
and interface code, greatly simplifying the creation of that interface.
This book assumes at least some background in computer architecture, such as an advanced
undergraduate or early graduate course in CPU architecture. We also build on ideas from com-
pilers, programming languages, FPGA synthesis, and operating systems, but the book should
be accessible to those without extensive study on these topics.

Steven Bell, Jing Pu, James Hegarty, and Mark Horowitz


January 2018
xv

Acknowledgments
Any work of this size is necessarily the result of many collaborations. We are grateful to John
Brunhaver, Zachary DeVito, Pat Hanrahan, Jonathan Ragan-Kelley, Steve Richardson, Jeff Set-
ter, Artem Vasilyev, and Xuan Yang, who influenced our thinking on these topics and helped
develop portions of the systems described in this book. We’re also thankful to Mike Morgan,
Margaret Martonosi, and the team at Morgan & Claypool for shepherding us through the
writing and production process, and to the reviewers whose feedback made this a much bet-
ter manuscript than it would have been otherwise.

Steven Bell, Jing Pu, James Hegarty, and Mark Horowitz


January 2018
1

CHAPTER 1

Introduction
When the International Technology Roadmap for Semiconductors organization announced its
final roadmap in 2016, it was widely heralded as the official end of Moore’s law [ITRS, 2016].
As we write this, 7 nm technology is still projected to provide cheaper transistors than current
technology, so it isn’t over just yet. But after decades of transistor scaling, the ITRS report
revealed at least modest agreement across the industry that cost-effective scaling to 5 nm and
below was hardly a guarantee.
While the death of Moore’s law remains a topic of debate, there isn’t any debate that the
nature and benefit of scaling has decreased dramatically. Since the early 2000s, scaling has not
brought the power reductions it used to provide. As a result, computing devices are limited by
the electrical power they can dissipate, and this limitation has forced designers to find more
energy-efficient computing structures. In the 2000s this power limitation led to the rise of mul-
ticore processing, and is the reason that practically all current computing devices (outside of
embedded systems) contain multiple CPUs on each die. But multiprocessing was not enough to
continue to scale performance, and specialized processors were also added to systems to make
them more energy efficient. GPUs were added for graphics and data-parallel floating point op-
erations, specialized image and video processors were added to handle video, and digital signal
processors were added to handle the processing required for wireless communication.
On one hand, this shift in structure has made computation more energy efficient; on the
other, it has made programming the resulting systems much more complex. The vast major-
ity of algorithms and programming languages were created for an abstract computing machine
running a single thread of control, with access to the entire memory of the machine. Changing
these algorithms and languages to leverage multiple threads is difficult, and mapping them to
use the specialized processors is near impossible. As a result, accelerators only get used when
performance is essential to the application; otherwise, the code is written for CPU and declared
“good enough.” Unless we develop new languages and tools that dramatically simplify the task
of mapping algorithms onto these modern heterogeneous machines, computing performance
will stagnate.
This book describes one approach to address this issue. By restricting the application do-
main, it is possible to create programming languages and compilers that can ease the burden of
creating and mapping applications to specialized computing resources, allowing us to run com-
plete applications on heterogeneous platforms. We will illustrate this with examples from image
processing and computer vision, but the underlying principles extend to other domains.
2 1. INTRODUCTION
The rest of this chapter explains the constraints that any solution to this problem must
work within. The next section briefly reviews how computers were initially able to take advantage
of Moore’s law scaling without changing the programming model, why that is no longer the case,
and why energy efficiency is now key to performance scaling. Section 1.2 then shows how to
compare different power-constrained designs to determine which is best. Since performance
and power are tightly coupled, they both need to be considered to make the best decision. Using
these metrics, and some information about the energy and area cost of different operations, this
section also points out the types of algorithms that benefit the most from specialized compute
engines. While these metrics show the potential of specialization, Section 1.3 describes the costs
of this approach, which historically required large teams to design the customized hardware and
develop the software that ran on it. The remaining chapters in this book describe one approach
that addresses these cost issues.

1.1 CMOS SCALING AND THE RISE OF SPECIALIZATION


From the earliest days of electronic computers, improvements in physical technology have con-
tinually driven computer performance. The first few technology changes were discrete jumps,
first from vacuum tubes to bipolar transistors in the 1950s, and then from discrete transistors to
bipolar integrated circuits (ICs) in the 1960s. Once computers were built with ICs, they were
able to take advantage of Moore’s law, the prediction-turned-industry-roadmap which stated
that the number of components that could be economically packed onto an integrated circuit
would double every two years [Moore, 1965].
As MOS transistor technology matured, gates built with MOS transistors used less power
and area than gates built with bipolar transistors, and it became clear in the late 1970s that MOS
technology would dominate. During this time Robert Dennard at IBM Research published his
paper on MOS scaling rules, which showed different approaches that could be taken to scale
MOS transistors [Dennard et al., 1974]. In particular, he observed that if a transistor’s operating
voltage and doping concentration were scaled along with its physical dimensions, then a number
of other properties scaled nicely as well, and the resized transistor would behave predictably.
If a MOS transistor is shrunk by a factor of 1= in each linear dimension, and the operating
voltage is lowered by the same 1= , then several things follow:
1. Transistors get smaller, allowing  2 more logic gates in the same silicon area.
2. Voltages and currents inside the transistor scale by a factor of 1= .
3. The effective resistance of the transistor, I =V , remains constant, due to 2 above.
4. The gate capacitance C shrinks by a factor of 1= (1= 2 due to decreased area, multiplied
by  due to reduced electrode spacing).
The switching time for a logic gate is proportional to the resistance of the driving transistor
multiplied by the capacitance of the driven transistor. If the effective resistance remains constant
1.1. CMOS SCALING AND THE RISE OF SPECIALIZATION 3
while the capacitance decreases by 1= , then the overall delay also decreases by 1= , and the chip
can be run faster by a factor of  .
Taken together, these scaling factors mean that  2 more logic gates are switched  faster,
for a total increase of  3 more gate evaluations per second. At the same time, the energy required
to switch a logic gate is proportional to CV 2 . With both capacitance and voltage decreasing by
a factor of 1= , the energy per gate evaluation decreased by a factor of 1= 3 .
During this period,
p roughly every other year, a new technology process yielded transistors
which were about 1= 2 as large in each dimension. Following Dennard scaling, this would give
a chip with twice as many gates and a faster clock by a factor of 1.4, making it 2.8 more
powerful than the previous one. Simultaneously, however, the energy dissipated by each gate
evaluation dropped by 2.8, meaning that total power required was the same as the previous
chip. This remarkable result allowed each new generation to achieve nearly a 3 improvement
for the same die area and power.
This scaling is great in theory, but what happened in practice is somewhat more circuitous.
First, until the mid-1980s, most complex ICs were made with nMOS rather than CMOS gates,
which dissipate power even when they aren’t switching (known as static power). Second, during
this period power supply voltages remained at 5 V, a standard set in the bipolar IC days. As
a result of both of these, the power per gate did not change much even as transistors scaled
down. As nMOS chips grew more complex, the power dissipation of these chips became a
serious problem. This eventually forced the entire industry to transition from nMOS to CMOS
technology, despite the additional manufacturing complexity and lower intrinsic gate speed of
CMOS.
After transitioning to CMOS ICs in the mid-1980s, power supply voltages began to scale
down, but not exactly in sync with technology. While transistor density and clock speed contin-
ued to scale, the energy per logic gate dropped more slowly. With the number of gate evaluations
per second increasing faster than the energy of gate evaluation was scaling down, the overall chip
power grew exponentially.
This power scaling is exactly what we see when we look at historical data from CMOS
microprocessors, shown in Figure 1.1. From 1980 to 2000, the number of transistors on a chip
increased by about 500 (Figure 1.1a), which corresponds to scaling transistor feature size by
roughly 20. During this same period of time, processor clock frequency increased by 100,
which is 5 faster than one would expect from simple gate speed (Figure 1.1b). Most of this ad-
ditional clock speed gain came from microarchitectural changes to create more deeply pipelined
“short tick” machines with fewer gates per cycle, which were enabled by better circuit designs
of key functional units. While these fast clocks were good for performance, they were bad from
a power perspective.
By 2000, computers were executing 50,000 more gate evaluations per second than they
had in the 1980s. During this time the average capacitance had scaled down, providing a 20
energy savings, but power supply voltages had only scaled by 4–5 (Figure 1.1c), giving roughly
4 1. INTRODUCTION
a 25 savings. Taken together the capacitance and supply scaling only reduce the gate energy
by around 500, which means that the power dissipation of the processors should increase by
two orders of magnitude during this period. Figure 1.1d shows that is exactly what happened.

1B 4 GHz
1 GHz
100 M
Number of Transistors

Clock Frequency
10 M 100 MHz

1M 10 MHz

100 k
1 MHz
10 k

1970 1980 1990 2000 2010 2020 1970 1980 1990 2000 2010 2020
(a) Transistors Per Chip (b) CPU Frequency

5V
150 W

Thermal Design Power (TDP)


100 W

3.3 V
Voltage

2.5 V 10 W

1.2 V
1W

1970 1980 1990 2000 2010 2020 1970 1980 1990 2000 2010 2020
(c) Operating Voltage (d) Power Dissipation

Figure 1.1: From the 1960s until the early 2000s, transistor density and operating frequency
scaled up exponentially, providing exponential performance improvements. Power dissipa-
tion increased but was kept in check by lowering the operating voltage. Data from CPUDB
[Danowitz et al., 2012].

Up to this point, all of these additional transistors were used for a host of architectural im-
provements that increased performance even further, including pipelined datapaths, superscalar
instruction issue, and out-of-order execution. However, the instruction set architectures (ISAs)
for various processors generally remained the same through multiple hardware revisions, mean-
1.1. CMOS SCALING AND THE RISE OF SPECIALIZATION 5
ing that existing software could run on the newer machine without modification—and reap a
performance improvement.
But around 2004, Dennard scaling broke down. Lowering the gate threshold voltage fur-
ther caused the leakage power to rise unacceptably high, so it began to level out just below 1 V.
Without the possibility to manage the power density by scaling voltage, manufacturers hit
the “power wall” (the red line in Figure 1.1d). Chips such as the Intel Pentium 4 were dissipating
a little over 100 W at peak performance, which is roughly the limit of a traditional package
with a heatsink-and-fan cooling system. Running a CPU at significantly higher power than this
requires an increasingly complex cooling system, both at a system level and within the chip itself.
Pushed up against the power wall, the only choice was to stop increasing the clock fre-
quency and find other ways to increase performance. Although Intel had predicted processor
clock rates over 10 GHz, actual numbers peaked around 4 GHz and settled back between 2 and
4 GHz (Figure 1.1b).
Even though Dennard scaling had stopped, taking down frequency scaling with it,
Moore’s law continued its steady march forward. This left architects with an abundance of tran-
sistors, but the traditional microarchitectural approaches to improving performance had been
mostly mined out. As a result, computer architecture has turned in several new directions to
improve performance without increasing power consumption.
The first major tack was symmetric multicore, which stamped down two (and then four,
and then eight) copies of the CPU on each chip. This has the obvious benefit of delivering more
computational power for the same clock rate. Doubling the core count still doubles the total
power, but if the clock frequency is dialed back, the chip runs at a lower voltage, keeping the
energy constant while maintaining some of the performance advantage of having multiple cores.
This is especially true if the parallel cores are simplified and designed for energy efficiency rather
than single-thread performance. Nonetheless, even simple CPU cores incur significant overhead
to compute their results, and there is a limit to how much efficiency can be achieved simply by
making more copies.
The next theme was to build processors to exploit regularity in certain applications, lead-
ing to the rise of single-instruction-multiple-data (SIMD) instruction sets and general-purpose
GPU computing (GPGPU). These go further than symmetric multicore in that they amortize
the instruction fetch and decode steps across many hardware units, taking advantage of data
parallelism. Neither SIMD nor GPUs were new; SIMD had existed for decades as a staple
of supercomputer architectures and made its way into desktop processors for multimedia ap-
plications along with GPUs in the late 1990s. But in the mid-2000s, they started to became
prominent as a way to accelerate traditional compute-intensive applications.
A third major tack in architecture was the proliferation of specialized accelerators, which
go even further in stripping out control flow and optimizing data movement for particular appli-
cations. This trend was hastened by the widespread migration to mobile devices and “the cloud,”
where power is paramount and typical use is dominated by a handful of tasks. A modern smart-
6 1. INTRODUCTION
phone System-on-chip (SoC) contains more than a dozen custom compute engines, created
specifically to perform intensive tasks that would be impossible to run in real time on the main
CPU. For example, communicating over WiFi and cellular networks requires complex coding
and modulation/demodulation, which is performed on a small collection of hardware units spe-
cialized for these signal processing tasks. Likewise, decoding or encoding video—whether for
watching Netflix, video chatting, or camera filming—is handled by hardware blocks that only
perform this specific task. And the process of capturing raw pixels and turning them into a
pleasing (or at least presentable) image is performed by a long pipeline of hardware units that
demosaic, color balance, denoise, sharpen, and gamma-correct the image.
Even low-intensity tasks are getting accelerators. For example, playing music from an
MP3 file requires relatively little computational work, but the CPU must wake up a few dozen
times per second to fill a buffer with sound samples. For power efficiency, it may be better to
have a dedicated chip (or accelerator within the SoC, decoupled from the CPU) that just handles
audio.
While there remain some performance gains still to be squeezed out of thread and data
parallelism by incrementally advancing CPU and GPU architectures, they cannot close the gap
to a fully customized ASIC. The reason, as we’ve already hinted, comes down to power.
Cell phones are power-limited both by their battery capacity (roughly 8–12 Wh) and the
amount of heat it is acceptable to dissipate in the user’s hand (around 2 W). The datacenter is
the same story at a different scale. A warehouse-sized datacenter consumes tens of megawatts,
requiring a dedicated substation and a cheap source of electrical power. And like phones, data
center performance is constrained partly by the limits of our ability to get heat out, as evidenced
by recent experiments and plans to build datacenters in caves or in frigid parts of the ocean.
Thus, in today’s power-constrained computing environment, the formula for improvement is
simple: performance per watt is performance.
Only specialized architectures can optimize the data storage and movement to achieve the
energy reduction we want. As we will discuss in Section 1.4, specialized accelerators are able to
eliminate the overhead of instructions by “baking” them into the computation hardware itself.
They also eliminate waste for data movement by designing the storage to match the algorithm.
Of course, general-purpose processors are still necessary for most code, and so modern
systems are increasingly heterogeneous. As mentioned earlier, SoCs for mobile devices contain
dozens of processors and specialized hardware units, and datacenters are increasingly adding
GPUs, FPGAs, and ASIC accelerators [AWS, 2017, Norman P. Jouppi et al., 2017].
In the remainder of this chapter, we’ll describe the metrics that characterize a “good”
accelerator and explain how these factors will determine the kind of systems we will build in the
future. Then we lay out the challenges to specialization and describe the kinds of applications
for which we can expect accelerators to be most effective.
1.2. WHAT WILL WE BUILD NOW? 7
1.2 WHAT WILL WE BUILD NOW?
Given that specialized accelerators are—and will continue to be—an important part of computer
architecture for the foreseeable future, the question arises: What makes a good accelerator? Or
said another way, if I have a potential set of designs, how do I choose what to add to my SoC
or datacenter, if anything?

1.2.1 PERFORMANCE, POWER, AND AREA


On the surface, the good things we want are obvious. We want high performance, low power,
and low cost.
Raw performance—the speed at which a device is able to perform a computation—is
the most obvious measure of “good-ness.” Consumers will throw down cash for faster devices,
whether that performance means quicker web page loads or richer graphics. Unfortunately, this
isn’t easy to quantify with the most commonly advertised metrics.
Clock speed matters, but we also need to account for how much work is done on each
clock cycle. Multiplying clock speed by the number of instructions issued per cycle is better, but
still ignores the fact that some instructions might do much more work than others. And on top
of this, we have the fact that utilization is rarely 100% and depends heavily on the architecture
and application.
We can quantify performance in a device-independent way by counting the number of
essential operations performed per unit time. For the purposes of this metric, we define “essen-
tial operations” to include only the operations that form the actual result of the computation.
Most devices require a great deal of non-essential computation, such as decoding instructions or
loading and storing intermediate data. These are “non-essential” not because they are pointless
or unnecessary but because they are not intrinsically required to perform the computation. They
are simply overhead incurred by the specific architecture.
With this definition, adding two pieces of data to produce an intermediate result is an
essential operation, but incrementing a loop counter is not since the latter is required by the
implementation and not the computation itself.
To make things concrete, a 3  3 convolution on a single-channel image requires nine mul-
tiplications (multiplying 3  3 pixels by their corresponding weights) and eight 2-input additions
per output pixel. For a 640  480 image (307,200 pixels), this is a little more than 5.2 million
total operations.
A CPU implementation requires many more instructions than this to compute the result
since the instruction stream includes conditional branches, loop index computations, and so
forth. On the flip side, some implementations might require fewer instructions than operations,
if they process multiple pieces of data on each instruction or have complex instructions that
fuse multiple operations. But implementations across this whole spectrum can be compared if
we calculate everything in terms of device-independent operations, rather than device-specific
instructions.
8 1. INTRODUCTION
The second metric is power consumption, measured in Watts. In a datacenter context,
the power consumption is directly related to the operating cost, and thus to the total cost of
ownership (TCO). In a mobile device, power consumption determines how long the battery will
last (or how large a battery is necessary for the device to survive all day). Power consumption also
determines the maximum computational load that can be sustained without causing the device
to overheat and throttle back.
The third metric is cost. We’ll discuss development costs further in the following section,
but for now it is sufficient to observe that the production cost of the final product is closely related
to the silicon area of the chip, typically measured in square millimeters (mm2 ). More chips of a
smaller design will fit on a fixed-size wafer, and smaller chips are likely to have somewhat higher
yield percentages, both of which reduce the manufacturing cost.
However, as important as performance, power, and silicon area are as metrics, they can’t
be used directly to compare designs, because it is relatively straightforward to trade one for the
other.
Running a chip at a higher operating voltage causes its transistors to switch more rapidly,
allowing us to increase the clock frequency and get increased performance, at the cost of in-
creased power consumption. Conversely, lowering the operating voltage along with the clock
frequency saves energy, at the cost of lower performance.1
It isn’t fair to compare the raw performance of a desktop Intel Core i7 to an ARM phone
SoC, if for no other reason than that the desktop processor has a 20–50 power advantage.
Instead, it is more appropriate to divide the power ( Joules per second) by the performance (op-
erations per second) to get the average energy used per computation ( Joules per operation).
Throughout the rest of this book, we’ll refer to this as “energy per operation” or pJ/op. We
could equivalently think about maximizing the inverse: operations/Joule.
For a battery-powered device, energy per operation relates directly to the amount of com-
putation that can be performed with a single battery charge; for anything plugged into the wall,
this relates the amount of useful computation that was done with the money you paid to the
electric company.
A similar difficulty is related to the area metric. For applications with sufficient parallelism,
we can double performance simply by stamping down two copies of the same processor on a chip.
This benefit requires no increase in clock speed or operating voltage—only more silicon. This
was, of course, the basic impetus behind going to multi core computation.
Even further, it is possible to lower the voltage and clock frequency of the two cores,
trading performance for energy efficiency as described earlier. As a result, it is possible to improve
either power or performance by increasing silicon area as long as there is enough parallelism.
Thus, when comparing between architectures for highly parallel applications, it is helpful to

1 Of
course, modern CPUs do this scaling on the fly to match their performance to the ever-changing CPU load, known as
“Dynamic Voltage and Frequency Scaling” (DVFS).
1.2. WHAT WILL WE BUILD NOW? 9
normalize performance by the silicon area used. This gives us operations/Joule divided by area,
ops
or mm 2 J .
ops
These two compound metrics, pJ=operation and mm 2 J , give us meaningful ways to com-
pare and evaluate vastly different architectures. However, it isn’t sufficient to simply minimize
these in the abstract; we must consider the overall system and application workload.

1.2.2 FLEXIBILITY
Engineers building a system are concerned with a particular application, or perhaps a collection
of applications, and the metrics discussed are only helpful insofar as they represent performance
on the applications of interest. If a specialized hardware module cannot run our problem, its
energy and area efficiency are irrelevant. Likewise, if a module can only accelerate parts of the
application, or only some applications out of a larger suite, then its benefit is capped by Ahm-
dahl’s law. As a result, we have a flexibility tradeoff: more flexible devices allow us to accelerate
computation that would otherwise remain on the CPU, but increased flexibility often means
reduced efficiency.
Suppose a hypothetical fixed-function device can accelerate 50% of a computation by a
factor of 100, reducing the total computation time from 1 second to 0.505 seconds. If adding
some flexibility to the device drops the performance to only 10 but allows us to accelerate 70%
of the computation, we will now complete the computation in 0.37 seconds—a clear win.
Moreover, many applications demand flexibility, whether the product is a networking de-
vice that needs to support new protocols or an augmented-reality headset that must incorporate
the latest advances in computer vision. As more and more devices are connected to the internet,
consumers increasingly expect that features can be upgraded and bugs can be fixed via over-the-
air updates. In this market, a fixed-function device that cannot support rapid iteration during
prototyping and cannot be reconfigured once deployed is a major liability.
The tradeoff is that flexibility isn’t free, as we have already alluded to. It almost always hurts
ops
efficiency (performance per watt or mm 2 J ) since overhead is spent processing the configuration.
Figure 1.2 illustrates this by comparing the performance and efficiency for a range of designs
proposed at ISSCC a number of years ago. While newer semiconductor processes have reduced
energy across the board, the same trend holds: the most flexible devices (CPUs) are the least
efficient, and increasing specialization also increases performance, by as much as three orders of
magnitude.
In certain domains, this tension has created something of a paradox: applications that were
traditionally performed completely in hardware are moving toward software implementations,
even while competing forces push related applications away from software toward hardware. For
example, the fundamental premise of software defined radio (SDR) is that moving much (or all)
of the signal processing for a radio from hardware to software makes it possible to build a system
that is simpler, cheaper, and more flexible. With only a minimal analog front-end, an SDR
system can easily run numerous different coding and demodulation schemes, and be upgraded
10 1. INTRODUCTION
1,000

Energy Efficiency (MOPS/mW)


Microprocessors General Purpose DSPs Dedicated

100

10

0.1

0.01

Figure 1.2: Comparison of efficiency for a number of designs from ISSCC, showing the clear
tradeoff between flexibility and efficiency. Designs are sorted by efficiency and grouped by overall
design paradigm. Figure from Marković and Brodersen [2012].

over the air. But because real-time signal processing requires extremely high computation rates,
many SDR platforms use an FPGA, and carefully optimized libraries have been written to fully
exploit the SIMD and digital signal processing (DSP) hardware in common SoCs. Likewise,
software-defined networking aims to provide software-based reconfigurability to networks, but
at the same time more and more effort is being poured into custom networking chips.

1.3 THE COST OF SPECIALIZATION


To fit these metrics together, we must consider one more factor: cost. After all, given the enor-
mous benefits of specialization, the only thing preventing us from making a specialized acceler-
ator for everything is the expense.
Figure 1.3 compares the non-recurring engineering (NRE) cost of building a new high-
end SoC on the past few silicon process nodes. The price tags for the most recent technologies
are now well out of reach for all but the largest companies. Most ASICs are less expensive than
this, by virtue of being less complex, using purchased or existing IP, having lower performance
targets, and being produced on older and mature processes [Khazraee et al., 2017]. Yet these
costs still run into the millions of dollars and remain risky undertakings for many businesses.
Several components contribute to this cost. The most obvious is the price of the lithogra-
phy masks and tooling setup, which has been driven up by the increasingly high precision of each
process node. Likewise, these processes have ever-more-stringent design rules, which require
more engineering effort during the place and route process and in verification. The exponen-
tial increase in number of transistors has enabled a corresponding growth in design complexity,
which comes with increased development expense. Some of these additional transistors are used
1.3. THE COST OF SPECIALIZATION 11
500

Validation
Prototye
400

Software
300
Cost (million USD)

200
Physical

100
Verification

Architecture
IP
0
65 nm 45/40 nm 28 nm 22 nm 16/14 nm 10 nm 7 nm 5 nm
(2006) (2008) (2010) (2012) (2014) (2017)

Figure 1.3: Estimated cost breakdown to build a large SoC. The overall cost is increasing expo-
nentially, and software comprises nearly half of the total cost. (Data from International Business
Strategies [IBS, 2017].)

in ways that do not appreciably increase the design complexity, such as additional copies of pro-
cessor cores or larger caches. But while the exact slope of the correlation is debatable, the trend
is clear: More transistors means more complexity, and therefore higher design costs. Moreover,
with increased complexity comes increased costs for testing and verification.
Last, but particularly relevant to this book, is the cost of developing software to run the
chip, which in the IBS estimates accounts for roughly 40% of the total cost. The accelerator
must be configured, whether with microcode, a set of registers, or something else, and it must
be interfaced with the software running on the rest of the system. Even the most rigid of “fixed”
devices usually have some degree of configurability, such as the ability to set an operating mode
or to control specific parameters or coefficients.
This by itself is unremarkable, except that all of these “configurations” are tied to a pro-
gramming model very different than the idealized CPU that most developers are used to. Timing
details become crucial, instructions execute out of order or in a massively parallel fashion, and
12 1. INTRODUCTION
concurrency and synchronization are handled with device-specific primitives. Accelerators are,
almost by definition, difficult to program.
To state the obvious, the more configurable a device is, the more effort must go into con-
figuring it. In highly configurable accelerators such as GPUs or FPGAs, it is quite easy—even
typical—to produce configurations that do not perform well. Entire job descriptions revolve
around being able to work the magic to create high-performance configurations for accelera-
tors. These people, informally known as “the FPGA wizards” or “GPU gurus,” have an intimate
knowledge of the device hardware and carry a large toolbox of techniques for optimizing appli-
cations. They also have excellent job security.
This difficulty is exacerbated by a lack of tools. Specialized accelerators need specialized
tools, often including a compiler toolchain, debugger, and perhaps even an operating system.
This is not a problem in the CPU space: there are only a handful of competitive CPU archi-
tectures, and many groups are developing tools, both commercial and open source. Intel is but
one of many groups with an x86 C++ compiler, and the same is true for ARM. But specialized
accelerators are not as widespread, and making tools for them is less profitable. Unsurprisingly,
NVIDIA remains the primary source of compilers, debuggers, and development tools for their
GPUs. This software design effort cannot easily be pushed onto third-party companies or the
open-source community, and becomes part of the chip development cost.
As we stand today, bringing a new piece of silicon to market is as much about writing
software as it is designing logic. It isn’t sufficient to just “write a driver” for the hardware; what
is needed is an effective bridge to application-level code.
Ultimately, companies will only create and use accelerators if the improvement justifies
the expense. That is, an accelerator is only worthwhile if the engineering cost can be recouped
by savings in the operating cost, or if the accelerator enables an application that was previously
impossible. The operating cost is closely tied to the efficiency of the computing system, both in
terms of the number of units necessary (buying a dozen CPUs vs. a single customized accelerator)
and in terms of time and electricity. Because it is almost always easier to implement an algorithm
on a more flexible device, this cost optimization results in a tug-of-war between performance
and flexibility, illustrated in Figure 1.4.
This is particularly true for low-volume products, where the NRE cost dominates the
overall expense. In such cases, the cheapest solution—rather than the most efficient—might be
the best. Often, the most cost-effective solution to speed up an application is to buy a more
powerful computer (or a whole rack of computers!) and run the same horribly inefficient code
on it. This is why an enormous amount of code, even deployed production code, is written in
languages like Python and Matlab, which have poor runtime performance but terrific developer
productivity.
Our goal is to reduce the cost of developing accelerators and of mapping emerging applica-
tions onto heterogeneous systems, pushing down the NRE of the high-cost/high-performance
1.4. GOOD APPLICATIONS FOR ACCELERATION 13

CPU

Operating Cost
Optimized CPU

GPU

FPGA

ASIC

Engineering Cost

Figure 1.4: Tradeoff of operating cost (which is inversely related to runtime performance) vs.
non-recurring engineering cost (which is inversely related to flexibility). More flexible devices
(CPUs and GPUs) require less development effort but achieve worse performance compared to
FPGAs and ASICs. We aim to reduce the engineering development cost (red arrows), making
it more feasible to adopt specialized computing.

areas of this tradeoff space. Unless we do so, it will remain more cost effective to use general-
purpose systems, and computer performance in many areas will suffer.

1.4 GOOD APPLICATIONS FOR ACCELERATION


Before we launch into systems for programming accelerators, we’ll examine which applications
can be accelerated most effectively. Can all applications be accelerated with specialized proces-
sors, or just some of them?
The short answer is that only a few types of applications are worth accelerating. To see
why, we have to go back to the fundamentals of power and energy. Given that, for a modern chip,
performance per watt is equivalent to performance, we want to minimize the energy consumed
per unit of computation. That is, if the way to maximize operations per second is to maximize
operations per second per watt, we can cancel “seconds,” and simply maximize operations per
Joule.
Table 1.1 shows the energy required for a handful of fundamental operations in a 45 nm
process. The numbers are smaller for more recent process nodes, but the relative scale remains
essentially the same.
The crucial observation here is that a DRAM fetch requires 500 more energy than a
32-bit multiplication, and 50; 000 more than an 8-bit addition. The cost of fetching data from
memory completely dwarfs the cost of computing with it. The cache hierarchy helps, of course,
Another random document with
no related content on Scribd:
of Columbia University. He has held many important posts and
taken numerous prizes. His cantata The Rock of Liberty was sung at
the Tercentenary Celebration, 1920, of the settlement of Plymouth.
Arne Oldberg, born in Youngstown, Ohio (1874) is director of the
piano department of Northwestern University (Michigan) and has
many orchestral works, written symphonies, concertos and
overtures, which have had frequent hearings. He has also composed
much chamber music.
There are also Harry Rowe Shelley (1858), writer of much
important church music; James H. Rogers, composer of teaching
pieces for the piano and many fine songs, including a cycle In
Memoriam, which is a heartfelt expression of sorrow in beautiful
music; Wilson G. Smith, composer of many piano teaching pieces
and musical writer; Louis Coerne, writer of opera and of works for
orchestra; Ernest Kroeger of St. Louis who also used Indian and
Negro themes in works for orchestra and piano; Carl Busch of
Kansas City, composer of orchestral works, cantatas, music for violin
and many songs, in some of which we see the Indian. In California
we meet Wm. J. McCoy and Humphrey J. Stewart who have
composed church music and have written often for the yearly out-
door “High Jinks” of the San Francisco Bohemian Club, in which
many important composers have been invited to assist; Domenico
Brescia, a South American composer living in San Francisco, who
wrote interesting chamber music played at the Berkshire Chamber
Music Festivals; and Albert Elkus, a composer of serious works for
orchestra and piano. Smith died in 1929; Coerne in 1922.
But this is growing into a musical directory! And even neglecting
many who have done much to make music grow in America, we must
proceed for we have important milestones ahead.
For many years New York has been the American center of music.
Few of the people in musical life are native New Yorkers, but have
come from all parts of the States and Europe to this musical Mecca.
MacDowell Greatest American Poet-Composer

The greatest romanticist and poet-composer of America up to the


present is Edward MacDowell (1861–1908). Some of the
romanticism of the early 19th century has become mere imitation of
the style which arose as a protest against the insincere forms of the
18th century. But the true spirit of romance never dies and never
becomes artificial,—such romance had MacDowell. He was sincere,
always a poet, always himself, and in spite of his Irish-Scotch
inheritance, German training and love of Norse legends, he
expressed MacDowell in every note. He lived before the time when
we question “How shall we express America in Music”? In fact he
was much against tagging composers as American, German, French,
and so on.
Edward MacDowell, born in New York City, began piano lessons
when he was eight. One of his teachers was the brilliant South
American Teresa Carreño, who later played her pupil’s concerto with
many world orchestras. At 15, he entered the Paris Conservatory
where he was fellow student with Debussy.
While there, MacDowell studied French, and during a lesson
amused himself by drawing a picture of his teacher. When caught,
the teacher, instead of rebuking him, took the sketch to a friend, a
master at the École des Beaux Arts, the famous old art school of
Paris. The artist found the sketch so good that he offered to train him
without charge but Edward had made up his mind to be a musician
and did not accept the offer.
In 1879, MacDowell studied composition at Frankfort with
Joachim Raff, one of the composers of the Romantic period. Raff
introduced him to Liszt, who invited MacDowell to play his first
piano suite at Zürich (1882). The composer’s modesty is reflected in
these words which Lawrence Gilman quotes: “I would not have
changed a note in one of them for untold gold, and inside I had the
greatest love for them; but the idea that any one else might take them
seriously had never occurred to me.” This suite was his first
published composition.
In 1884, he married Miss Marian Nevins of New York, and theirs
was one of the most beautiful marriages in musical history, although
their meeting was amusing! The young girl had crossed the ocean to
continue her music studies at a time when it was not a common
occurrence, and when she went to Raff for lessons, he sent her to a
young countryman of hers, “an extraordinary piano teacher.” She
was indignant to be sent to a young inexperienced American in that
fashion, but she went! The young inexperienced American did not
want to teach an American girl, because he felt she would not be
serious enough to do the kind of work he demanded, but he accepted
her! Later she accepted him!

Edward MacDowell.

America’s Greatest Poet-Composer.


Charles Griffes.

American Impressionist.

In 1888, he established himself in Boston as pianist and teacher.


His first concert was with the Kneisel Quartet, and in 1889 he
successfully played his concerto with the Boston Symphony
Orchestra. He made tours through the States giving recitals and
appearing with the orchestras. Winning immediate recognition, his
position as an exceptional composer grew. In 1896 the Boston
Symphony Orchestra presented his first piano concerto and his
orchestral Indian Suite on the same program in New York. Such an
honor had never before been shown an American!
In 1896, he became professor of the new Chair of Music at
Columbia University in New York City. After resigning his post in
1904, his health broke as the result of an accident, and for several
years he was an invalid. All the care of physicians, devoted friends,
his parents, and his courageous wife, could not restore his memory,
and in 1908, he died in New York and was buried in Peterboro, N. H.
A natural boulder from where he often watched the sunset, marks
the spot—fitting for one who loved Nature as he did.
Shortly before his passing, a group of friends formed a society, the
MacDowell Club of New York, which has for its object the promoting
of “a sympathetic understanding of the correlation of all the arts, and
of contributing to the broadening of their influence, thus carrying
forward the life-purpose of Edward MacDowell.” He wished
musicians to know the value of associating with artists outside of the
field of music. Eugene Heffley, (1862–1925) an intimate friend of
MacDowell and first president of the MacDowell Club did much to
make the MacDowell music known and loved, just as he did for
Charles Griffes, Debussy, Ravel, Scriabin and others who have come
with new messages.
Some people have statues erected, others have towns and streets
named for them, but besides the numerous MacDowell Clubs
throughout the States, the most beautiful memorial is the MacDowell
Association at Peterboro. Early in his career, MacDowell found it
impossible to work well in the city, and by happy chance he and his
wife discovered a deserted farm which they bought for the proverbial
“song.” Here the composer spent his summers in the beautiful New
Hampshire woods, in the heart of which he built the little log cabin,
which in his words, is
A house of dreams untold
It looks out over the whispering tree-tops
And faces the setting sun.

And in this “house” he told many of his dreams in lovely melody!


While ill, he often expressed the desire to share the inspiration-
giving peace and beauty of his woods with friends, workers in music
and the sister arts. Out of this wish has grown the colony for creative
workers, which has been a haven to hundreds of composers, poets,
painters, sculptors, dramatists, and novelists. The “Log Cabin” is the
seed out of which twenty studios have sprung. The small deserted
farm has spread over 500 acres, and Mrs. MacDowell with the aid of
faithful friends has made a dream come true!
MacDowell was a composer for the pianoforte, although he wrote
some lovely songs; a few orchestral works, best known of which is
The Indian Suite, in which he employs Indian themes; and several
male choruses written when he conducted the New York
Mendelssohn Glee Club. We love and remember him for his
Woodland Sketches, Sea Pieces, Fireside Tales, New England Idyls
(opus 62 and his last work), virtuoso-studies, and the four sonatas—
the Tragica, Eroica, Norse and Keltic.
W. H. Humiston (1869–1924), composer, lecturer, musical critic,
organist, assistant conductor of the New York Philharmonic
Orchestra and a pupil of MacDowell, had the most complete
collection of Bach and Wagner in this country and was a great
authority on their writings. This collection now belongs to the
MacDowell Association, and is in the colony library at Peterboro.
Henry Holden Huss

Henry Holden Huss (1862), born in Newark, New Jersey, has lived
in New York since his early twenties when he returned from studying
with Rheinberger in Munich. Before his European days he was a
pupil of his father, George J. Huss, a Bavarian who came to America
during the 1848 revolution, and was one of the best musical
educators in this country. Huss also studied with O. B. Boise (1845–
1912), an American theorist and teacher. As concert pianist, Huss has
played his piano concerto, one of the best American works, with all
the important orchestras. Raoul Pugno, the much-loved French
pianist, and Adele aus der Ohe also played it abroad and in America.
Huss has always aimed for the highest ideals as teacher, composer
and pianist. A classicist at heart, his works are written on classic
models,—a beautiful violin sonata with poetic slow movement, many
chamber music works, a concerto for violin and orchestra, besides
The Seven Ages of Man for baritone and orchestra, often sung by the
late David Bispham, Cleopatra’s Death, for soprano and orchestra, a
female chorus Ave Maria, and many fine art songs and piano pieces,
the most beautiful of which is a tone poem To the Night, a lovely
impressionistic composition that ranks with the best that America
has produced.
Two other pupils of O. B. Boise, Ernest Hutcheson (1871) an
Australian, and Howard Brockway (1870), a Brooklynite, have done
much to make music grow in America. Hutcheson, who studied also
with Max Vogrich in Australia and Reinecke in Leipsic, has made so
enviable a career as pianist and teacher, that one forgets he has a
symphony, a double piano concerto and several other large works in
manuscript. Brockway, who harmonized Lonesome Tunes, folk songs
from the Kentucky Mountains collected by Miss Loraine Wyman, is
also the composer of a symphony played in Boston (1907) by the
Symphony Orchestra, a suite, ballad-scherzo for orchestra, many
piano works and songs. Hutcheson, Brockway and Boise were
teachers in the Baltimore Peabody Institute, one of the important
music schools, under direction of Harold Randolph, a fine musician
and pianist.
George F. Boyle (1886) of New South Wales has, since 1910, been
professor at the Peabody Institute. He has composed many piano
pieces, songs and orchestral works.
Rubin Goldmark

Rubin Goldmark (1872), is known as the best toastmaster in the


music world! Born in New York, he was one of Dvorak’s most
talented pupils and inherits his gifts from his noted uncle Carl
Goldmark (1830–1915), a Hungarian composer of the overture
Sakuntala and the opera The Queen of Sheba, the symphony The
Rustic Wedding and much else. Rubin Goldmark has written several
important tone poems,—Samson, Gettysburg Requiem, Negro
Rhapsody, based on negro themes, and other fine things for
orchestra, chamber music, piano and violin numbers and as a
teacher he has laid the foundations for several American composers
among whom are:—Frederick Jacobi, Aaron Copland and George
Gershwin. Each score from Goldmark’s pen is an addition to
American music.
Henry Hadley

Henry K. Hadley (1871) by right of birth and training belongs to


the New England group of composers, but most of his life was spent
in Germany where he got his orchestral experience, and in different
parts of America where he has conducted orchestras—Seattle,
Washington, San Francisco and New York. Hadley is one of the few
Americans who has conducted the New York Philharmonic
Orchestra.
Hadley has taken many prizes for opera, symphony, cantata and
an orchestral rhapsody. To this he has added numerous other
orchestral and chamber music works and over 100 songs.
Albert Mildenberg’s “Michael Angelo”

In these days when the cry is for American opera, it seems


regrettable that an opera ready for production should lie idle because
of the death of its composer. Perhaps no work in history has had a
more tragic story than Michael Angelo by Albert Mildenberg (1878–
1918). In 1908, Mildenberg signed a contract in Vienna for the
production of the opera. The following year on the way to Europe,
the ship, Slavonia, was wrecked, and although the composer
escaped, his entire orchestral score and parts went to the bottom of
the sea. Courageously he rewrote the work, and sent it to the
Metropolitan Opera House in competition for the $10,000 prize,
won by Horatio Parker. Before it had reached the judges, in some
way, still unexplained, the major part of the score disappeared!
Again, Mildenberg set to work with the sketches he had, and made a
third score, but it cost him his life, for though the opera was
completed before his death, he was too ill to carry it further.
In addition to this grand opera, Mildenberg, a pupil of Rafael
Joseffy, wrote many piano pieces. He also composed The Violet, I
Love Thee, and Astarte, songs that had a popular vogue and are still
found on many programs, and romantic comic operas, The Wood
Witch and Love’s Locksmith, besides a cantata and many choruses.
Two other operas which had Metropolitan Opera House
productions were The Temple Dancer by John Adam Hugo (1873)
and The Legend by Joseph Carl Breil (1870).
John Alden Carpenter—Modernist

John Alden Carpenter (1876), one of America’s foremost


composers, was born in Park Ridge, Illinois, and educated at
Harvard where he took the music course, studying afterwards in
England with Edward Elgar, the English composer. A business man,
Carpenter still devotes his time to composing music that has put him
among America’s leading musical lights. While he might be called a
romanticist, his tendencies are impressionistic, and none
understands better than he the charms of rich and unusual
harmonies, the use of modern melodic and orchestral effects, and the
value of humor in music. All these we find in his Adventures in a
Perambulator for orchestra, and his ballet Krazy Kat, where jazz
rhythms are used to great advantage. One of the most beautiful
works of its kind, is the ballet after Oscar Wilde’s The Birthday of the
Infanta, performed by the Chicago Opera Company, and his first
ballet written for the Metropolitan Opera Company is called
Skyscraper, certainly American! Carpenter’s settings of Tagore’s
Gitanjali are among America’s finest songs; he has many others, a
concertino for piano and orchestra and a violin sonata.
An All-American Symphony

Eric Delamarter (1880), born in Lansing, Michigan, has written a


Symphony After Walt Whitman in which he has used twenty-year
old street songs from the “Barbary Coast” (San Francisco Bowery),
Lonesome Tunes of Kentucky, and a fox-trot rhythm with newer
street songs. These, Delamarter has woven into a symphony with
skill and sincerity. The material is All-American although neither
Negro nor Indian.
Delamarter is a well-known musical critic in Chicago, an organist,
composer of many other works for orchestra, organ, and oratorios,
incidental music for drama, cantatas and songs and since 1917
assistant conductor of the Chicago Symphony Orchestra.
Noble Kreider and Edward Royce, son of Professor Josiah Royce of
Harvard University, have both written well for the piano. Harold
Bauer has played variations and short pieces by Edward Royce.
Ernest Schelling—Pianist-Composer

Ernest Schelling (1876), born in Delaware, New Jersey, appeared


as pianist at the Academy of Music in Philadelphia, at the age of four!
His musical training abroad included several years with Paderewski.
He has made many concert tours in Europe and America, and for two
seasons has conducted the children’s concerts of the New York
Philharmonic Society. His important orchestral works include a
symphonic legend, a suite, two numbers, Suite Phantastique and
Impressions From An Artist’s Life, for piano and orchestra, and his
latest work to enjoy wide popularity, The Victory Ball.
John Powell—Virginian

All the charm and refinement of the Southern gentleman are


reflected in John Powell’s personality, along with an earnest sincerity
and conviction. He was born in Richmond, Virginia, (1882), is a
graduate of the University of Virginia, and a pupil of Theodor
Leschetizky and Navratil in Vienna. He has made an international
reputation as brilliant pianist and is also one of our most gifted
composers. Powell’s works show classical training in form, with
which he combines a rich romantic feeling and a love for folk music.
He believes that music should draw on the folk element for its
strength, and has proved his theory by using freely the folk music he
knows best, that of the negro. In the South, At the Fair, piano pieces,
show this early influence and his fund of humor, and in his Negro
Rhapsody for piano and orchestra, Powell has painted a picture of
the negro in many moods—sinister and menacing, primitive
bordering on barbaric, as well as humorous, care-free and childlike.
His Sonata Teutonica which first brought him before the public is
of extraordinary strength, length and talent. He has written other
sonatas for piano and for violin, songs, chamber music and
orchestral works.
Negro Spirituals versus Jazz

This brings us face to face with one of the most discussed


questions of the day: the influence of Negro music and jazz on
serious composition. The pure Negro music is the Spiritual and not
jazz, which may be the typically American idiom we have been
waiting for.
It is not Negro but is developed from the Negro dance rhythm,
from a real folk music; it is the result of Negro music played upon by
American life and influences; through it we may learn to free
ourselves musically, and show the true American spirit of adventure
and daring which until now has been absent in our native
compositions. The path has been travelled from the songs of Stephen
Foster, Negro Minstrels, “coon songs” and “cake walks,” to jazz with
its elaborate orchestration unlike any other existing music, and its
complicated rhythms. Jazz rhythm is contrapuntal rhythm. Europe
says that it is our one original and important contribution to music!
This is a strong statement, but as “imitation is the sincerest form of
flattery,” their serious 20th century composers have flattered us by
writing jazz, and we have Piano Rag by Stravinsky, a Syncopated
Sonata by Jean Wiener, jazz by Darius Milhaud, Casella, Honegger,
and even Debussy was tempted into writing Golliwogg’s Cake Walk.
In Los Angeles (April, 1925), Walter Henry Rothwell with his
Philharmonic Orchestra played an American Caprice by Henry
Schoenefeld (1857) one of many works in which the composer has
used Indian and Negro themes.
Henry Thacker Burleigh, Most Noted Colored
Composer

His arrangement of the Spiritual, Deep River, has made Harry


Burleigh’s name known on two continents, and its success has led
many into that field. Burleigh (1866) was one of the foremost among
the Dvorak pupils, and has held the position of leading baritone in
St. George’s Church for many years, as also at the Temple Emanuel
on Fifth Avenue. His name is found on practically every program
where Spirituals are sung.
Of Burleigh’s race is R. Nathaniel Dett (1882), conductor of the
Hampton Singers, also director of the music department of Hampton
College. His name was introduced by Percy Grainger, who played his
characteristic Negro dance called Juba Dance in Europe and
America. Dett’s greatest works are his arrangements of the Spirituals
for chorus. Grainger wrote of him: “There is in his treatment of
blended human voices that innate sonority and vocal naturalness
that seem to result only from accumulated long experience of
untrained improvised polyphonic singing, such as that of Southern
Negroes, South Sea Polynesians and Russian peasants. These things
are branches of the very tree of natural communal song.”
David Guion, a young Texan, is well known in this field and also
for his piano setting of Turkey in the Straw.
Louis Gruenberg Finds New Paths

Louis Gruenberg (1884) was born in Russia but came to America


at the age of two. At nineteen he went to Europe and became the
pupil of Ferruccio Busoni, the Italian pianist-composer who spent
most of his life in Berlin and Vienna and also taught for two years at
the New England Conservatory of Music.
Gruenberg had followed conventional lines of composition for
some years, receiving prizes in Berlin and New York (in 1922 he was
awarded the Flagler prize for a symphonic poem Hill of Dreams). His
works of this period comprise symphonic poems, a string quartet, a
piano concerto, a symphony, a suite for violin, also a sonata, two
operas, songs and piano pieces.
He began to study America, to ask himself what was the spirit of
Americanism that had not yet found its way into music, and his
answer was not the Negro jazz, but the white man’s jazz expressing
the “spirit of the times.” As a result he changed his way of writing.
The compositions of this period are a violin sonata, a set of piano
pieces called Polychromatics, a Poem in sonatina form for ’cello, four
pieces for string quartet, a viola sonata, an orchestral tone-poem, a
group of short piano pieces in jazz rhythms with the amusing name
of Jazzberries, three violin pieces in the same style, a group of songs
Animals and Insects, texts by Vachel Lindsay, and that same poet’s
Daniel which Gruenberg has set as Daniel Jazz for tenor and
chamber music orchestra, and Creation, a Negro sermon by James
Weldon Johnson, a poet, who has just won the Spingarn Prize for the
most distinctive work (1924–1925) of an American of African
descent.
Two Jazz Geniuses

Irving Berlin, the genius of the age in writing typical American


jazz, was born in Russia and has had no musical training. He picks
out his irresistible melodies by ear and his aide writes them down to
the delight of the millions in all corners of the earth, from New York
to the Sahara desert, where the phonograph has carried them. The
sheiks no longer sing in ancient pentatonic melody to their lady
loves, but turn on the phonograph which ably plays some of his
hundred American songs: My Wife Goes to the Country, Snooky
Ookums, Along Came Ruth, If You Don’t Want Me Why Do You
Hang Around? Mandy, Say it with Music, What’ll I Do, All Alone,
and many from the musical revues (Music Box Revue, especially).
His earlier Alexander’s Rag Time Band goes back to cake walk days
and has become a classic of its kind and the model for popular music
following it. He rose from poverty to riches through giving great
delight to the public.
George Gershwin (1898) flashed into the lime-light through his
jazz piano concerto Rhapsody in Blue and his extraordinary playing
of it with Paul Whiteman’s Orchestra. In this piece we find a merging
of classic form with the “voice of the people”! It will be interesting to
watch this young man, not yet thirty, to see the outcome of grafting a
musical education on to his unusual natural gifts. As a result of the
success of his experiment he has been commissioned to write a New
York Concerto for the New York Symphony. He is a Brooklyn boy
brought up as a “song plugger” for a publisher of popular music,
playing their songs in vaudeville acts and in cafés.
Charles Tomlinson Griffes

Charles Tomlinson Griffes (1884–1920) was a poet-composer


whose early death was a serious loss to America, for every thing he
wrote was an addition to our music. He was impressionistic in style,
and we are grateful for the lovely art songs, Five Poems of Ancient
China and Japan, three songs with orchestral accompaniment to
poems of Fiona MacLeod, ten piano pieces and the Sonata which
have never been surpassed in beauty and workmanship by any
American, the Poem for flute and orchestra, the string quartet on
Indian themes, and his orchestral tone-poem, The Pleasure Dome of
Kubla Khan. For the stage, Griffes composed a Japanese mime-play,
Schojo, a dance drama, The Kairn of Korwidwen and Walt
Whitman’s Salut au Monde, a dramatic ballet. The last two were
presented at the Neighborhood Playhouse, where interesting
experiments in music and the drama have been made by the Misses
Lewisohn.
Griffes was a native of Elmira, New York, and his first studies were
made with Miss Mary S. Broughton, who recognized her young
pupil’s unusual talent and took him to Germany for study. His
composition work was done with Humperdinck, and Rüfer, and from
1907 until his death he taught music at Hackley, a boys’ school in
Tarrytown, New York.
Lawrence Gilman, American critic, says of him: “He was a poet
with a sense of comedy.... Griffes had never learned how to pose—he
would never have learned how if he had lived to be as triumphantly
old and famous as Monsieur Saint-Saëns or Herr Bruch or Signor
Verdi.... It was only a short while before his death that the Boston
Symphony Orchestra played for the first time (in Boston) his
Pleasure Dome of Kubla Khan ... and the general concert-going
public turned aside ... to bestow an approving hand upon this
producer of a sensitive and imaginative tone-poetry who was by
some mysterious accident, an American!... He was a fastidious
craftsman, a scrupulous artist. He was neither smug nor pretentious
nor accommodating. He went his own way,—modestly, quietly,
unswervingly ... having the vision of the few....”

You might also like