Professional Documents
Culture Documents
CAN HELP POWER UP MORE CORES, RESEARCHERS STILL NEED CMOS TECHNOLOGY TO
THEY NEED SMART RESOURCE MANAGEMENT TO REALIZE THIS PROMISE. THIS ARTICLE
1.2
multicores gives us new insights on hetero-
0.8
geneous computing. Furthermore, the un-
paralleled near-threshold and subthreshold 0.4
performance of TFETs result in hitherto 0.0
Vt
unexplored cost functions and operating 0.50 0.40 0.30 0.25 0.20 0.15
points. (b) Vcc (V)
SEPTEMBER/OCTOBER 2013 51
...............................................................................................................................................................................................
DARK SILICON
node with a fixed threshold voltage, reducing a multicore processor made solely of
the supply voltage further pushes devices into TFETs can achieve much better perfor-
the subthreshold regime on account of mance on a dim silicon configuration, it
reduced Vcc " Vt . Figure 1b shows that, in will not be able to reach high frequencies
the subthreshold region, CMOS device on a dark silicon configuration. On the
delay grows exponentially. This degradation other hand, different applications prefer
is due to the intrinsic 60 mV/decade mini- different configurations; for instance, scal-
mum subthreshold slope of CMOS devices, able applications prefer using all cores
which leads to very low subthreshold drive and exploit thread-level parallelism (TLP),
currents. This, in turn, causes CMOS cir- but unscalable applications, or applications
cuits to operate at extremely low frequencies. with large sequential regions, better benefit
Consequently, CMOS multicores show poor from using a few cores at higher frequencies.
performance in a low-voltage, dim silicon Clearly, neither a homogeneous CMOS nor
configuration, and prefer operating at a a homogeneous TFET multicore can serve
higher-voltage, dark silicon configuration. both purposes.
The use of alternative device technologies In this article, we consider a heteroge-
has been proposed in order to overcome this neous multicore comprising a few CMOS
barrier imposed by CMOS technology. cores (that are particularly useful for acceler-
Recently, various new steep-slope devices ating sequential or unscalable codes) and
have emerged that can implement energy- many TFET cores (that are optimized to op-
efficient low-voltage circuits. These devices’ erate efficiently at low voltages to cater to
physics let them achieve sub-60 mV/decade highly parallel workloads). Figure 2 com-
subthreshold slopes. This leads to higher pares a homogeneous CMOS multicore
Ion =Ioff ratios at low voltages, which trans- with a heterogeneous CMOS-TFET multi-
lates into higher drive currents and lower core in both dark and dim silicon configura-
off-state leakage currents. At near-threshold tions. The heterogeneous multicore can
and subthreshold voltages, steep-slope de- match the dark silicon performance of the
vices have the potential to achieve superior homogeneous configuration because it can
performance with energy efficiencies that activate the same number of CMOS cores
are orders of magnitude higher than at high frequencies. In addition, it can out-
CMOS devices. One instance of steep-slope perform the homogeneous processor in a
devices, the interband TFET, is a promising dim silicon setting because it employs low-
slope device candidate due to its superior op- voltage optimized TFET cores. The hetero-
eration stability and better fabrication com- geneous multicore can thus use the same
patibility than other alternatives. TFETs power budget to either turn on more cores
show tremendous potential for scaling supply at the same frequency or use the same num-
voltages and reducing power consumption. ber of cores at higher frequencies.
Researchers have already demonstrated logic Although a CMOS-TFET heterogeneous
and memory applications using TFET de- multicore can operate efficiently on both
vices operating at 0.1 V,2 and processors dark and dim silicon configurations, an ap-
designed using TFETs are projected to be plication could prefer one configuration
in production by 2020. over another owing to factors such as peak
Although steep-slope devices are prom- instruction throughput and thread and core
ising, they have some disadvantages com- scalability. Therefore, mapping applications
pared to CMOS technology. For instance, on a heterogeneous system poses several in-
TFET energy efficiency is superior to teresting questions. Given a number of appli-
CMOS only at low voltages. As the supply cations to execute, how many cores of each
voltage increases, the inherent limitation in type and how much power should each ap-
the TFET charge-carrying mechanism plication be allocated? How should the appli-
causes the current to saturate above a cer- cations’ threads share these resources? In
tain operating voltage. This escalates order to answer these questions, we can for-
power consumption rapidly and restricts mulate an optimization problem by reducing
maximum operating frequency. Although power consumption under performance
.............................................................
52 IEEE MICRO
Homogeneous CMOS multicore Heterogeneous CMOS-TFET multicore
TFET no. of cores CMOS no. of cores
Dark: equal performance (4)
Dark (1)
N6 # of active cores = N4 (CMOS)
# of active cores = N1
No. of cores
N5 Frequency = f4
Frequency = f1 N3
N2 (N1=N4, f1=f4)
N1=N4
Figure 2. A homogeneous CMOS multicore (left) and a heterogeneous CMOS-TFET (tunnel field-effect transistor) multicore
(right) operating at dark and dim silicon settings because of limited available power. The graphs (center; not to scale) show
the frequency and number of cores, and frequency and power per core trade-offs, between the two types of cores. In a
dark silicon setting (fewer cores, higher voltage), the heterogeneous multicore can match the homogeneous multicore’s
performance as long as it contains enough CMOS cores (1 versus 4). In a dim silicon setting (more cores, smaller voltage),
the heterogeneous multicore can outperform the homogeneous multicore either by using the same number of TFET cores
at a higher frequency (2 versus 5) or by using more TFET cores at the same frequency (2 versus 6). Further dimming the
CMOS multicore can enable more cores to be turned on but forces these cores to operate at extremely low frequencies,
leading to very poor performance (3).
SEPTEMBER/OCTOBER 2013 53
...............................................................................................................................................................................................
DARK SILICON
C T C T
CMOS TFET CMOS TFET
Thr1 Thr2 domain domain domain domain
Pc, Wc Pt, Wt Pc Pt
Thr3 Thr4
C T C T ΔP, ΔW ΔP
Figure 3. Dynamic optimizations proposed for improving the performance and energy effi-
ciency of heterogeneous CMOS-TFET multicores. Migrating threads across CMOS and TFET
cores (a); power and work partitioning across threads belonging to a single application (b);
and power and resource partitioning across threads belonging to different applications (c).
64
ISO-power
Normalized performance
32 points
16
8 ISO-performance
points
4
1
0.5 1 2 4 8 16 32 64 128
Power (W)
CMOS X 2 CMOS X 4 CMOS X 8 CMOS X 16
TFET X 2 TFET X 4 TFET X 8 TFET X 16
migrating the thread running on the CMOS and as a high-voltage optimized core at high
core to the TFET core coupled with this voltages. To analyze this method’s benefits,
CMOS core puts the thread into a more we simulated an eight-core processor with
energy-efficient execution state,3 as Figure 3a four CMOS cores and four TFET cores. We
shows. Similarly, an increase in TFET core fre- assumed that, due to power limitations, the
quency above fc triggers a thread migration maximum number of cores that can simulta-
from the TFET core to the corresponding neously be powered on is restricted to 4, but
CMOS core. This scheme enables each tile the choice of the exact CMOS/TFET core
in the heterogeneous multicore to search combination to use can vary dynamically.
among the iso-performance configurations The baseline homogeneous system and the
(Figure 4) and use the one with the smallest proposed heterogeneous system both employ
power consumption. Hence, each tile acts as the two energy-saving DVFS mechanisms:
a low-voltage optimized core at low voltages EDP-aware DVFS and barrier-aware DVFS.
.............................................................
54 IEEE MICRO
To determine our thread migration
migration (%)
benchmarks. Our results, presented in Fig- 60
ure 5, show that, on average the heteroge-
40
neous multicore has 20 percent better EDP
than the homogeneous multicore. As the im- 20
pact of the DVFS schemes varies on each 0
benchmark, the major source of benefits
fft
lu
m
vo x
n
es
nd
ity
e
sq
Av a
i
ea
ag
p
d
fm
os
rn
-n
lre
-s
ra
oc
er
er
er
ba
with thread migration also varies. For in-
di
at
at
ra
w
stance, in lu, we observed the biggest EDP
EDP DVFS Barrier DVFS EDP+Barrier DVFS
improvements when we used EDP-aware
DVFS allied with thread migration, whereas
in water-spa, thread migration makes the big- Figure 5. The reduction in energy-delay product (EDP) obtained by enabling
gest impact when used in conjunction with thread migration on the heterogeneous multicore when EDP-aware dynamic
barrier-aware DVFS. These results indicate voltage and frequency scaling (DVFS), barrier-aware DVFS, and both EDP-
that significant energy savings can be aware and barrier-aware DVFS are used. We observe that the EDP improves
obtained by exploiting the energy-efficient by nearly 20 percent when both DVFS techniques are implemented.
behavior of TFET cores at low voltages.
As a generalization, we can treat this
4-CMOS/4-TFET configuration as one tile A single multithreaded application on a
of a larger many-core, and we can use the en- heterogeneous multicore
ergy saved on one tile to turn on or ramp up Given an application to be executed on a
the power budget of other tiles in the system. heterogeneous multicore, we can consider
two possible thread-to-core mapping schemes:
A dim silicon approach # using only one type of core (either
An alternate formulation of the energy-
CMOS or TFET, exclusively) at any
efficiency optimization problem is to maximize
time (homogeneous mapping), and
performance under a fixed power budget.
# using both types of cores simultane-
Here, instead of letting large parts of the multi-
ously (heterogeneous mapping).
core remain unused because of dark silicon, we
adopt a dim silicon approach. We now have In homogeneous mapping, cores of the un-
more cores sharing the available power budget, used type are left dark, allowing the active
and to find an energy-efficient runtime config- cores to use the entire power budget. In heter-
uration, we exploit the applications’ character- ogeneous mapping, all cores will share the
istics when distributing resources. total available power budget, and the applica-
Based on the type of workload we are run- tion threads will be mapped to both CMOS
ning on the heterogeneous multicore, we clas- and TFET cores. Because cores of different
sify this resource distribution problem as device types have different V/f characteristics,
these cores will run at different operating
# a multithreaded application executing
points although they are allocated equal
alone, or
per-core power budgets. This will result in
# two (or more) multithreaded applications
different types of cores having unequal perfor-
sharing the cores and power budget.
mance. We thus employ a dynamic load-
For both types of workloads, we designed balancing scheme (for example, using dynamic
and evaluated static and dynamic schemes parallel loop scheduling) to avoid any ineffi-
that map the available cores and redistribute ciencies that could arise due to equal work par-
the available power to threads and applica- titioning across application threads. After load
tions on a heterogeneous multicore to im- balancing, the core type that operates at a
prove performance. (See the ‘‘Experimental more energy-efficient point would complete
Setup’’ sidebar for additional details on our more work in the same time frame because
simulation infrastructure and experiments.) all cores have an equal per-core power budget
.............................................................
SEPTEMBER/OCTOBER 2013 55
...............................................................................................................................................................................................
DARK SILICON
...............................................................................................................................................................................................
Experimental Setup
We performed our experiments using the Simics full-system simulator.
For our thread migration study, we simulated a 4-CMOS homogeneous References
and a 4-CMOS/4-TFET heterogeneous multicore.1 For our power and 1. K. Swaminathan et al., ‘‘Improving Energy Efficiency of Mul-
work partitioning experiments, we modeled the multicores listed in tithreaded Applications Using Heterogeneous CMOS-TFET
Table 2 in the main article.2 Our Si-FinFET and TFET cores are architec- Multicores,’’ Proc. Int’l Symp. Low Power Electronics and
turally similar to the Intel Atom Z520.3 These cores were also equipped Design (ISLPED 11), IEEE CS, 2011, pp. 247-252.
to run DVFS with a 1-ms epoch. In our dynamic work partitioning study, 2. E. Kultursay et al., ‘‘Performance Enhancement under Power
we modified the SPEC-OMP 2001 benchmarks to incorporate dynamic Constraints Using Heterogeneous CMOS-TFET Multicores,’’
loop scheduling. For our experiments with multiple multithreaded appli- Proc. Int’l Conf. Hardware/Software Codesign and System
cations, we built eight workloads by randomly pairing Parsec bench- Synthesis (CODESþISSS 12), ACM, 2012, pp. 245-254.
marks. In our workloads, each application is associated with a user- 3. Intel Atom Processor Z5xx Series Datasheet Intel, tech. re-
defined weight that represents its relative importance. Further details port 319535-003US, June 2010.
on our experimental setup and simulation parameters can be found
elsewhere.2
Workload Power
partitioning partitioning
Thread across across
Processor mapping threads threads Code
32 CMOS or 32 TFET CMOS or TFET Equal Equal BestBase
8 CMOS and 24 TFET CMOS and TFET Equal Equal Hetero-BestManual
CMOS and TFET Equal Equal Hetero-Naive
CMOS and TFET Dynamic Equal Hetero-DynWork
CMOS and TFET Dynamic Dynamic Hetero-DynWork-DynPow
(iso-power points in Figure 4). In this case, Tables 1 and 2 list the mapping tech-
repartitioning the available power across niques we evaluated for this study. Starting
cores, as in Figure 3b, can further improve with two baseline homogeneous 32-core pro-
the multicore’s overall performance. cessors that are all-CMOS and all-TFET, we
We implemented and evaluated a power first determined the configuration that shows
partitioning scheme that treats CMOS and the better performance for each application
TFET cores as two independent, homoge- (BestBase). We then analyzed our heteroge-
neous power domains and redistributes the neous 8-CMOS/24-TFET multicore with a
total chip power among the domains using a homogeneous mapping. An application can
perturb-and-observe method.4 This scheme pe- prefer running on 8 CMOS or 24 TFET
riodically transfers a small amount of power cores on the basis of its scaling behavior.
from one domain to the other and observes We assume that, in our baseline heteroge-
the resulting performance improvement or neous configuration (Hetero-BestManual),
degradation. Depending on the outcome, it ei- the user selects the best-performing option.
ther continues to transfer power in the same To explore the benefits of using both types
direction, or reverses the direction of the of cores in the heterogeneous processor
power transfer. By combining power partition- simultaneously, we first evaluated a naive
ing with heterogeneous application mapping technology substitution scheme without any
and dynamic loop scheduling, we automati- runtime mapping or scheduling optimiza-
cally optimize performance at runtime. tions (Hetero-Naive). We then enabled
.............................................................
56 IEEE MICRO
Table 2. Evaluated schemes for multiprogrammed workloads in a dim silicon approach.
normalized to BestBase
(Hetero-DynWork-DynPow). 1.75
Figure 6 shows our results with three het- 1.50
1.25
Speedup
erogeneous multicore experiments normal-
ized to BestBase. Homogeneous mapping 1.00
0.75
(Hetero-BestManual) results in a 5 percent
0.50
performance degradation. A simple technol-
0.25
ogy substitution (Hetero-Naive) yields only a 0
4 percent improvement. Adding dynamic applu apsi equake gafort swim wupwise Average
work partitioning (Hetero-DynWork) brings Hetero-BestManual Hetero-Naive
an additional 12 percent performance improve- Hetero-DynWork Hetero-DynWork-DynPow
ment. Our combined scheme (Hetero-Dyn-
Work-DynPow) performs best, achieving 21
Figure 6. Performance improvement obtained with the CMOS-TFET multi-
percent better performance than the baseline.
core using homogeneous mapping, heterogeneous mapping, dynamic work
partitioning, and dynamic work and power partitioning combined. Our com-
Multiple multithreaded applications sharing a bined scheme (Hetero-DynWork-DynPow) achieves 21 percent better per-
heterogeneous multicore formance than the baseline.
We propose static and dynamic optimiza-
tions to improve the performance of a
power-constrained multicore when two identified by the static scheme and dynami-
applications are running concurrently. To cally repartitions power based on the energy
simplify the problem, we use a homogeneous efficiency the applications achieve (see Fig-
application-to-core mapping, where each ap- ure 3c). It uses a scheme similar to the
plication is assigned to either CMOS or perturb-and-observe mechanism. In this case,
TFET cores. In our static scheme, we first ex- because each application runs on only one
amine the relative scalability of applications type of core, we distribute the power allo-
using static profiling. Working with two cated to each domain equally across its
applications scheduled to run together, the cores. To address fairness concerns, we
application that scales better with the num- limit the maximum performance degradation
ber of cores is mapped to TFET cores and that an application can suffer to 10 percent.
the application that scales better with fre- Tables 1 and 2 list the configurations we
quency runs on CMOS cores. The total evaluated for this study. For our baseline, we
power budget is partitioned among the two choose the best-performing homogeneous
applications (that is, the CMOS and TFET configuration out of two 32-core processors
domains) based on the ratio of the user- with all-CMOS and all-TFET cores when
defined application weights. using static power partitioning (BestBase-
Because power and core allocation in our StaticPow). The amount of power allocated
profile-based scheme is fixed throughout the to each application is decided statically
entire execution of the workload, it cannot based on the weights and the power budget.
capture the changing behavior of applica- The number of cores to use for each applica-
tions. Hence, we propose a dynamic scheme tion is selected using this power allocation
that starts with the initial power allocation and profile-based scaling information. Each
.............................................................
SEPTEMBER/OCTOBER 2013 57
...............................................................................................................................................................................................
DARK SILICON
2
T he inherent physical limitations of
CMOS transistors at near-threshold
Weighted
1
and subthreshold operating voltages has
necessitated researchers to search for new
device technologies and examine the adoption
0
of device-level heterogeneous processors for
)
2)
5)
4)
3)
3)
ge
3)
(3
(3
c(
4(
s(
a(
tr(
ra
next-generation architectures. Processors de-
ed
re
ac
ip
-fa
26
w
)-b
e
)-f
)-f
)-d
-s
-v
Av
)-x
5)
(1
(1
r(2
signed using steep-slope transistors, especially
4)
1)
r(2
d(
fre
64
(4
s(
s(
bt
de
bt
p
p
x2
vi
vi
58 IEEE MICRO
Design Automation Conf. (DAC 12), ACM, migration methods for analog circuits.
2012, pp. 1131-1136. Saripalli has a PhD in computer science
2. S. Mookerjea et al., ‘‘Experimental Demon- and engineering from Pennsylvania State
stration of 100 nm Channel Length University, where he completed the work for
In0.53Ga0.47As-based Vertical Inter-band this article.
Tunnel Field Effect Transistors (TFETs) for
Ultra Low-Power Logic and SRAM Applica- Vijaykrishnan Narayanan is a professor in
tions,’’ Proc. IEEE Int’l Electron Devices the Departments of Computer Science and
Meeting (IEDM 09), IEEE CS, 2009, Engineering and Electrical Engineering at
doi:10.1109/IEDM.2009.5424355. Pennsylvania State University. His research
3. K. Swaminathan et al., ‘‘Improving Energy interests include power-aware and reliable
Efficiency of Multithreaded Applications systems, embedded systems, nanoscale
Using Heterogeneous CMOS-TFET Multi- devices and interactions with system archi-
cores,’’ Proc. Int’l Symp. Low Power Elec- tectures, reconfigurable systems, networks-
tronics and Design (ISLPED 11), IEEE CS, on-chips, and domain-specific computing.
2011, pp. 247-252. Narayanan has a PhD in computer science
4. E. Kultursay et al., ‘‘Performance Enhance- and engineering from the University of
ment under Power Constraints Using Heter- South Florida.
ogeneous CMOS-TFET Multicores,’’ Proc.
Int’l Conf. Hardware/Software Codesign Mahmut T. Kandemir is a professor in the
and System Synthesis ACM, 2012, Computer Science and Engineering Depart-
pp. 245-254. ment at Pennsylvania State University. His
research interests include optimizing compi-
Karthik Swaminathan is a PhD student in lers, runtime systems, embedded systems,
the Computer Science and Engineering I/O and high-performance storage, and power-
Department at Pennsylvania State Univer- aware computing. Kandemir has a PhD in
sity. His research focuses on power-aware computer science and engineering from Syr-
computer architectures. He is currently acuse University.
working on leveraging emerging device
technologies in the architectural domain to Suman Datta is a professor in the Depart-
improve performance, power, and reliability. ment of Electrical Engineering at Pennsylva-
Swaminathan has a bachelor’s and master’s nia State University. His research interests
degree in electrical engineering from the include exploring new materials, novel nano-
Indian Institute of Technology, Madras. fabrication techniques, and device structures
for CMOS enhancement and replacement
Emre Kultursay is a PhD student in the for future energy-efficient, high-performance
Computer Science and Engineering Depart- information processing systems. Datta has a
ment at Pennsylvania State University. His PhD in electrical and computer engineering
research interests include compiler optimiza- from the University of Cincinnati.
tions for high-performance computing sys-
tems, energy-efficient multicore architectures, Direct questions and comments about
and heterogeneous processors and systems. this article to Karthik Swaminathan, 111,
Kultursay has a BS in electrical engineering Information Sciences and Technology
and computer engineering from the Middle Building, Pennsylvania State University,
East Technical University, Turkey. University Park, PA 16802; kvs120@cse.
psu.edu.
Vinay Saripalli is a design, technology, and
CAD services engineer at Intel. His research
interests involve development of CAD auto-
mation flows for analog circuit design
productivity improvement, and process
.............................................................
SEPTEMBER/OCTOBER 2013 59