Power and Temperature-Aware Clock Frequency and Thread Assignment in Multi-layer MPSoC

Kyungsu Kang*, Sungjoo Yoo** and Chong-Min Kyung* * KAIST ** POSTECH

Contents
• Introduction
– – – – Challenges in multi-layer MPSoC Problem definition and Preliminaries Relate works Motivational Example

• Proposed method
– Temperature-slack based DVFS – Thread assignment exploiting memory-boundness

• Experimental result • Conclusion

MPSoC Workshop, Gifu, 2010, Kyung

2

3D integration of MPSoC
Merits: 1. Small footprint 2. Short wire length 3. Heterogeneous integration 4. Wide bandwidth Challenges: 1. Temperature 2. Yield 3. CAD support

MPSoC Workshop, Gifu, 2010, Kyung

3

Challenges of multi-layer MPSoC
Performance Cooling cost Cost Liquid (water, nitrogen, etc.) Fan heatsink Temperaturerelated problems Temperature

Leakage power

Reliability

−B Pl = A ⋅ T ⋅ exp( ) T
2
MPSoC Workshop, Gifu, 2010, Kyung

MTTF = A ⋅ exp(

Ea ) k ⋅T
4

Objective of the research
• Developing temperature-aware power management methods (i.e., DVFS, thread assignment) to maximize instruction throughput in 3D multi-processor systems.
Operating system
DVFS Thread assignment

Performance monitor

Temperature monitor

Application

MPSoC Workshop, Gifu, 2010, Kyung

3D multi-processor system
5

Challenges
• Different thermal characteristics compared with 2D systems • Instantaneous (not steady-state) temperature analysis • Consideration of workload characteristics (e.g, instructions per cycles, memory-boundness) • Many systems with peak power constraint [ISCA05][Intel]

[ISCA05] M. Annavaram et al., “Mitigating Amdahl’s low through EPI throttling,” in Proc. ISCA., June 2005, pp. 298-309. [Intel] Intel Turbo Boost White Paper. [Online] Available:http://www.intel.com/technology/turboboost/
MPSoC Workshop, Gifu, 2010, Kyung 6

Problem definition
Given N cores and N threads to be assigned ; Find schedules of voltage/frequency and thread assignment on each core such that total IPS ( = sum of IPS for all cores) is maximized (IPS : instructions per second) subject to satisfying both peak instantaneous power and peak instantaneous temperature constraints.

N

Processor core

Core layers

1

2

Heat sink
MPSoC Workshop, Gifu, 2010, Kyung 7

Thermal characteristics of 3D systems
• Heat flow
– Heat is propagated vertically to the heat sink through other cores in between and dissipated at the heat sink.

Core 3 Rintra C Rinter

Simplified thermal model [Zhu, TCAD08]
P3

Rhs=1.22 K/W Rinter=0.15 K/W Rintra=2.44 K/W

Core1 P1 Rhs + Tamb −

Core 2 Rintra C C Rhs + Tamb − P2

Rintra ∼ 16 · Rinter
8

C. Zhu et al., “Three-dimensional chip-multiprocessor runtime thermal management,” IEEE Trans. Comput.-Aided Design Integr. Circuits Syst., vol. 27, no. 8, pp. 1479-1492, Aug. 2008. MPSoC Workshop, Gifu, 2010, Kyung

3D thermal characteristics
• Thermal coupling
– Thinning of silicon layer makes strong mutual thermal coupling among vertical adjacent cores. Thermal resistance: Rth = H k ⋅ A Thickness (µm) > 50 ~300 Core `3 Core `4 Core `1 Core `2 Core 1 Core 2 where H: thickness k: thermal conductivity A: surface area

< 6,000

Heat sink

(Source: Zhou, TPDS10) X. Zhou et al., “Thermal-aware task scheduling for 3D multi-core processors,” IEEE Trans. Parallel and Distributed Syst., vol. 21, no. 1, pp. 60-71, Jan. 2010. MPSoC Workshop, Gifu, 2010, Kyung 9

3D thermal characteristics
• Layer-dependent cooling efficiency
– Cores near the heat sink have lower temperature than those far from the heat sink T2ss = ( P2 + P3 ) ⋅ Rhs + Tamb Steady-state Core 3 temperature: T ss = P ⋅ R + T ss
3 3 inter 2

C

Rinter Core 2

P3

T2ss ≤ T3ss ; Equality holds if and only if P3=0.
(Source: Zhou, TPDS10)

Cool job on Core 3 P2 Hot job on Core2

C

Rhs + Tamb −

X. Zhou et al., “Thermal-aware task scheduling for 3D multi-core processors,” IEEE Trans. Parallel and Distributed Syst., vol. 21, no. 1, pp. 60-71, Jan. 2010. MPSoC Workshop, Gifu, 2010, Kyung

10

Instantaneous vs. steady-state temperature
• High-level temperature model [Liao, TCAD05]:
T (t ) = (Tinit − (Tamb + R ⋅ P )) ⋅ e − t / R ⋅C + R ⋅ P + Tamb = (Tinit − Tss ) ⋅ e − t / R⋅C + Tss 0 0
Tss : steady-state temperature where Tinit (Tss) is initial (steady-state) temperature of a core, P is power consumption of a core, R and C are thermal resistance and capacitance of a core, respectively. Driving temperature force Instantaneous temperature needs to be considered as thermal time constant of several hundreds of milliseconds is much longer than DVFS time step. [Skadron, TACO04]
W. Liao et al., “Temperature and supply voltage aware performance and power modeling at microarchitecture level,” IEEE Trans. Comput.-Aided Design Integr. Circuits Syst., vol. 24, no. 7, pp. 1042 - 1053, July 2005. K. Skadron et al., “Temperature-aware microarchitecture: modeling and implementation,” ACM Trans. Architecture and Code Optimization, vol. 1, pp. 94-125, Mar. 2004. 11 MPSoC Workshop, Gifu, 2010, Kyung

Effect of memory-boundness in DVFS
• Execution time of an application: tex ( f core ) = f + t stall core
wcomp : computation workload, fcore : clock frequency of core tstall : stall time spent by core for external memory access

wcomp

• Speedup (SU) for three programs in SPEC2000
Speedup (SU)
ref tex ( f core = 1.0GHz) SU = new tex ( f core )

Low SU is due to high memory-boundness
new f core (GHz) Workshop, Gifu, 2010, Kyung MPSoC

12

Related works
Previous works Constraint Temperature analysis 2D integration
[Coskun, TVLSI08] [Zhang, ICCAD08] [Murali, CODES+ISSS07] [Annavaram, ISCA05] [Isci, Int. Symp. Microarchitecture06] [Bergamaschi, ASPDAC08] [Donald, ISCA06]

Memoryboundness

Solution type

Temperature Temperature and power Power

Instantaneous temperature Instantaneous temperature Steady-state temperature Instantaneous temperature
3D integration

Not handled Not handled

Design-time Design-time

Handled

Runtime

Temperature

Handled

Runtime

[Zhao, APCCAS08]

Temperature and power Temperature Temperature and power

Instantaneous temperature Steady-state temperature Instantaneous temperature

Not handled Not handled Handled

Design-time Runtime Runtime
13

[Zhu, TCAD08]

Proposed

MPSoC Workshop, Gifu, 2010, Kyung

Motivational Example
• Preliminaries
– Threads 1 and 2 are assigned to cores 1 and 2, respectively. – Each core has four frequency levels (i.e., 1.00, 1.33, 1.66, 2.00 GHz). – Constraints: Pmax = 52W, Tmax = 63oC
Thread 1 (High memory-boundness) Core 1 Thread 2 (Low memory-boundness) (a) An example platform (b) Threads to be assigned
14 MPSoC Workshop, Gifu, 2010, Kyung

Core 2

Heat sink

2 1

1 2

Motivational Example
(Effect of considering instantaneous temperature)
Temperature slack-based (Instantaneous temperature-based) DVFS

Steady-state temperature-based DVFS [Zhu, TCAD08]

8.8 % IPS improvement while keeping temperature constraint

C. Zhu et al., “Three-dimensional chip-multiprocessor runtime thermal management,” IEEE Trans. Comput.-Aided Design Integr. Circuits Syst., vol. 27, no. 8, pp. 1479-1492, Aug. 2008. MPSoC Workshop, Gifu, 2010, Kyung

15

2 1

1

Motivational Example 1 2 (Effect of task assignment)
Temperature slack-based DVFS Temperature slack-based DVFS with different thread assignment

2

1 2

11.1 % IPS improvement while keeping temperature constraint

MPSoC Workshop, Gifu, 2010, Kyung

16

Temperature slack-based DVFS
• Two-step approach
– Power budgeting based on per-core temperature slack – Optimal frequency assignment based on the assigned per-core power budget Example of two-step approach: 22 W 200 W Power budget, Pmax 30 W Power budgeting among cores 25 W … Core 1 @ 60oC Core 2 @ 57oC Core N @ 52oC … 1.6 GHz @ 21.6 W 1.6 GHz @ 21.6 W … 2.0 GHz @ 27.7 W Optimal frequency assignment for 17 each core

MPSoC Workshop, Gifu, 2010, Kyung

Terminology
• Core-set : cores in the same horizontal position • Top core: the core farthest from the heat sink within a core-set
Core (i, j, k)
z y x
(i, j, k) (X, Y, Z)

Top core (4, 3) = core (4, 3, Z) Core-set (4, 3) …

Core layers N=X·Y·Z Heat sink
MPSoC Workshop, Gifu, 2010, Kyung 18

Temperature slack-based DVFS
• Two-step approach
– Power budgeting among core-sets based on the temperature slack of top core in each core-set – Optimal frequency assignment of cores within a core-set based on the assigned power budget of core-set 1.0 GHz Example of two-step approach: @ 11.3 W CoreCore Z set 1 @ 60 W top 1.6 GHz core 200 W Core 2 @ 21.6 W 52oC Core 1 2.0 GHz Power budget, Pmax 30 W Core Core-set 1 @ 27.7 W X·Y @ top Optimal frequency assignment core within a core-set Power budgeting among 19 60oC core-sets MPSoC Workshop, Gifu, 2010, Kyung … … … …

Reducing complexity in power budgeting
Find power budget of each core-set ( Pi ,core− set ) j such that total IPS (instructions per second) is maximized subject to ∀i, ∀j; Ti , j , Z (t ) ≤ Tmax and

Pi ,core− set ≤ Pmax ∑∑ j
i =1 j =1

X

Y

Core-set
(X, Y) (i, j)

(X, Y, Z) (i, j, k)

Heat sink

Power budgeting among cores

Heat sink

Power budgeting among core-sets

MPSoC Workshop, Gifu, 2010, Kyung

20

Power budgeting among core-sets
• Power budgeting among core-sets in the form of assigning steady-state temperature
∀i, ∀j; Tmax − Ti , j , Z (t ) T
ss i, j ,Z

− Ti , j , Z (t )

=C

Temperature slack of core-set

Ti ,ss, Z = Ricore − set ⋅ Pi ,core − set + Tamb j ,j j

= C … Eqn. Temperature driving force (x) of core-set

where Ti,j,Z (t) is the temperature of top core (i, j) at time t

Ti ,ssj , Z is the steady-state temperature of top core (i, j) R icore − set is the thermal resistance of core-set (i, j) ,j
C is a constant Theorem: Performance is maximized when power is assigned to each coreset such that the equation (x) is satisfied.
MPSoC Workshop, Gifu, 2010, Kyung 21

Proof of power budgeting theorem (1)
Tmax − Ti (t ) =C ss Ti − Ti (t )
Tmax − Ti (t ) = C` dTi (t ) dt

Q

dTi (t ) 1 = (Ti ss − Ti init ) ⋅ e −t / RC ⋅ ∝ Ti ss − Ti init dt RC

C ` represents

the time spent in each core to completely close the temperature slack remaining at time t.
Tmax − Ti (t ) = C` dTi (t ) dt

Tmax − Ti (t ) = C` init η ⋅ ( Ri ⋅ Pi (t ) + Tamb − Ti ) Tmax − Ti (t ) = C` K ⋅ Pi (t ) + L Q Ti ss = Ri ⋅ Pi + Tamb

dTi(t)/dt linearly increases as power (Pi) increases.
MPSoC Workshop, Gifu, 2010, Kyung 22

Proof of power budgeting theorem (2)
Power Power Core 2 P2
C`

Core 2 Power Time Time Power P2
C` −α

Pmax

Time

Core 1 P1C`

C` (a) Set power budget of each core such that both cores completely close the temperature slack at the same time

Core 1 ∆t P1C`+α Time t1 C` (b) Set power budget of each core such that Core 1 closes its temperature slack earlier than Core 2
β

Executed workload (w) of (b): (we assume that P ∝ f

)

w = β P C ` + α ⋅ (C `− ∆t ) + β P2C ` − α ⋅ (C `− ∆t ) + β Pmax ⋅ ∆t 1
= β P C ` + α ⋅ (C `−∆t ) + β Pmax − P C ` − α ⋅ (C `− ∆t ) + β Pmax ⋅ ∆t 1 1
MPSoC Workshop, Gifu, 2010, Kyung 23

Proof of power budgeting theorem (3)
Power Power Core 2 P2
C`

Core 2 Power Time Time Power P2
C` −α

Pmax

Time

Core 1 P1C`

C` (a) Set power budget of each core such that both cores completely close the temperature slack at the same time

Core 1 ∆t P1C`+α Time t1 C` (b) Set power budget of each core such that Core 1 closes its temperature slack earlier than Core 2

Partial derivatives of executed workload (w) of (b): dw = − β P C ` + α − β Pmax − ( P C ` + α ) + β Pmax < 0 1 1 d∆ t w is maximized when ∆t is zero. (QED)

Q

β

x is a concave function.
(2 < β < 3)
24

MPSoC Workshop, Gifu, 2010, Kyung

Power budgeting among core-sets
• Constraints in power budgeting
∀i, ∀j; Ti , j , Z (t + ∆t ) ≤ Tmax
(1) (2)
X Y

∑∑ P
i =1 j =1

core − set i, j

≤ Pmax

where Pi ,core − set is the power budget of core-set (i, j) j
∆t is the time interval for DVFS

Ti , j , Z (t + ∆t ) = (Ti , j , Z (t ) − Ti ,ss, Z ) ⋅ e − ∆t / RC + Ti ,ss, Z j j
Binary search is used to find the smallest C as defined below such that the two constraints (1) and (2) are both satisfied.

Tmax − Ti (t ) =C Ti ss − Ti (t )

Pi ,core − set and Ti , j , Z (t + ∆t ) are non-decreasing as C j decreases.
MPSoC Workshop, Gifu, 2010, Kyung 25

Frequency assignment in a core-set
• Theorem: Following relations among frequencies within a core-set must hold to maximize the instruction throughput performance; Layer Z
dP( f k ) ∀k ; Rk ⋅ = M where M is a constant. df k
Subject to

rZ

∑R
k =1

Z

Core r2

k

⋅ P( f k ) + Tamb ≤ T
k

ss Z

Layer 2 Layer 1 r1 + T − amb

Rk = ∑ rl
l =1

P(fk): power consumption of core running at fk

Implication of the equation: With the same total amount of workload executed for all layers, the core on the layer farther from heat sink with larger Rk is assigned lower clock frequency (due to the upward concavity of dP(fk)/dfk.
MPSoC Workshop, Gifu, 2010, Kyung

26

Proof of frequency assignment (1)
Find fk for all k=1, 2, …, Z

where fk is clock frequency of the core located on layer k

such that

f total = ∑ f k is maximized (for maximum performance)
k =1

Z

subject to

∑R
k =1

Z

k

⋅ P( f k ) + Tamb = TZss (for maximum frequency)

MPSoC Workshop, Gifu, 2010, Kyung

27

Proof of frequency assignment (2)

Lagrange function: L( f1 , f 2 ,..., f Z , λ ) = ∑
k =1

Z

 Z  f k + λ ⋅  ∑ Rk ⋅ P( f k ) + Tamb − TZss   k =1 

Objective where λ is a Lagrange multiplier.

Constraint

Partial derivatives of the Lagrange function must be set to zero to maximize the objective;

dP( f k ) dL( f1 , f 2 ,..., f Z , λ ) = 1 + λ ⋅ Rk ⋅ =0 df k df k

Therefore, following relations among frequencies within a core-set must hold; dP( f k ) ∀k ; Rk ⋅ = M where M is a constant. df k
MPSoC Workshop, Gifu, 2010, Kyung 28

Frequency assignment in a core-set
• Finding optimal discrete frequency levels
f1 = the highest available clock frequency Determine the frequencies of remaining cores dP( f k ) ∀k ; Rk ⋅ =M df k

f1 = the next highest available frequency

Constraint check ∑ Rk ⋅ P( f k ) + Tamb ≤ TZss ?
Z k =1

No

Yes
MPSoC Workshop, Gifu, 2010, Kyung 29

Temperature-aware thread assignment
• Two objectives
– Balancing temperatures among cores – Maximizing instruction throughput performance (i.e., total IPS)

Thread 1 Thread 2 Thread N Points to consider:

How to assign?
?

Core 1 @ 60oC Core 2 @ 55oC Core N @ 58oC

1) IPC (instructions per cycle) 2) Memory-boundness
30

MPSoC Workshop, Gifu, 2010, Kyung

Temperature-aware thread assignment
• Two-step approach
– Thread-set assignment among core-sets to balance temperatures among core-sets – Thread assignment within a core-set to maximize IPS

Example of two-step approach (step 1): Thread: 1 2 3

N
?

Core-set
(X, Y) (i, j)

1 Thread-set: 2

1 2
… …

1 2

Z 1

Z 2

Z X·Y

Heat sink
31

MPSoC Workshop, Gifu, 2010, Kyung

Thread-set assignment among core-sets
• Objective of thread-set formation
– Balancing IPC sums among thread-sets

• Procedure of thread-set formation
– 1) Assume X·Y empty sets which can store maximally Z threads. – 2) Sort all threads according to the descending order of IPC. – 3) Put the thread with the highest IPC into a set with the lowest sum of IPCs. – 4) Repeat 3) until all threads are assigned to one of X·Y sets.

MPSoC Workshop, Gifu, 2010, Kyung

32

Thread-set assignment among core-sets
Example of thread-set formation: forming three thread-sets when Z = 2 IPC: 0.9 IPC: 0.8 IPC: 0.6 IPC: 0.4 IPC: 0.3 IPC: 0.3

IPC: 0.9 IPC: 0.3 IPC sum: 1.2

IPC: 0.8 IPC: 0.3 IPC sum: 1.1

IPC: 0.6 IPC: 0.4 IPC sum: 1.0

MPSoC Workshop, Gifu, 2010, Kyung

33

Thread-set assignment among core-sets
• Procedure
1. 2. 3. Sort all thread-sets according to the ascending order of IPC sum. Assign the thread-set with the lowest IPC sum to the core-set with the highest temperature of top core. Repeat 2 until all the thread-sets are assigned.
2

Relation between IPC and switching power: Ps (V dd , f ) = C s ⋅ V dd ⋅ f ∝ IPC Example of IPC sum: Thread IPC: 0.9 IPC: 0.3 Thread-set
MPSoC Workshop, Gifu, 2010, Kyung

IPC sum: 0.9 + 0.3 = 1.2
34

Temperature-aware thread assignment
• Two-step approach
– Thread-set assignment among core-sets to balance temperatures among core-sets – Thread assignment within a core-set to maximize IPS

Example of two-step approach (step 2): Thread-set 1 2
… ?

Core-set Z


2 1 Heat sink

Z

MPSoC Workshop, Gifu, 2010, Kyung

35

Thread assignment within a core-set
• Procedure
1. 2. 3. Sort threads in a core-set according to the ascending order of SU. Assign the thread with the lowest SU to the core farthest from the heat sink. Repeat 2 until all threads in a core-set are assigned. Higher temperature slack (i.e., lower core temperature) allows the assignment of larger power budget (i.e., higher voltage/frequency), and, therefore, task with larger SU. Core-set Z High temperature 2 1 Heat sink

MPSoC Workshop, Gifu, 2010, Kyung 36

Tmax − Ti (t ) =C ss Ti − Ti (t )
Thread-set 1.9 1.2 2.0
(SU values)

Experimental setup
Simulation environment
Target processor Temperature estimation Power estimation Performance profiling Intel 65-nm Merom processor 3D grid-based temperature model Switching power + temperature-aware leakage power Performance profiling API on Intel Core2 processor in LW25 laptop [Sakran, ISSCC07] [Huang, TVLSI06] [W. Liao, TCAD05] PAPI

Thermal characteristics [Coskun, DATE09]
Layer Heat sink Heat spreader TIM Core / L2 cache Interlayer Thermal conductance (W/mK) 400.0 400.0 4.0 100.0 4.0 Heat capacitance (J/m3K) 3.55E+6 3.55E+6 4.00E+6 1.75E+6 4.00E+6 Thickness (µm) 6,900 1,000 20 150 20

ETC.
Frequency range (4 steps) ∆t (DVFS time interval) DVFS overhead Thread migration overhead ∆l (thread assignment interval) Psleep Maximum temperature Maximum power 1 ~ 2 GHz 5 ms 10 µs [Lee, Tcomput.10] 1 ms [Coskun, DATE09] 100 ms 2 W [datasheet] 70 oC 200 W
37

MPSoC Workshop, Gifu, 2010, Kyung

Experimental setup
Comparison with existing solutions
Algorithm 2DI_SST 3DI_SST 3DIS_SST 3DIS_IT (proposed) Power budgeting Steady state temperature analysis Steady state temperature analysis Steady state temperature analysis Instantaneous temperature analysis Thread assignment IPC and 2D floorplan-awareness IPC and 3D floorplan-awareness IPC, SU and 3D floorplan-awareness IPC, SU and 3D floorplan-awareness

Benchmark combinations (SPEC2000)
Criteria IPC Speed up (SU) Contents hipc, lipc, mipc hm, lm, mm

Ex.) hipc-hm: combination of applications with high IPC and high memory-boundness
MPSoC Workshop, Gifu, 2010, Kyung 38

Experimental results
IPS result of each combination of applications: 1) 2DI_SST 3DI_SST; 8.0 % IPS improvement 2) 3DI_SST 3DIS_SST; 8.5 % IPS improvement 3) 3DIS_SST 3DIS_IT; 14.5 % IPS improvement

2DI_SST 3DI_SST 3DIS_SST 3DIS_IT (proposed)

Steady state temperature analysis Steady state temperature analysis Steady state temperature analysis Instantaneous temperature analysis

IPC and 2D floorplan-awareness IPC and 3D floorplan-awareness IPC, SU and 3D floorplan-awareness IPC, SU and 3D floorplan-awareness
39

MPSoC Workshop, Gifu, 2010, Kyung

Experimental results
IPS result of each combination of applications: 1) 2DI_SST 3DI_SST; 8.0 % IPS improvement 2) 3DI_SST 3DIS_SST; 8.5 % IPS improvement 3) 3DIS_SST 3DIS_IT; 14.5 % IPS improvement

Reasons for instruction throughput improvement: 1) Utilizing thermal characteristics of 3D floorplan and 2) memoryboundness of each applications 3) Aggressive power budgeting by exploiting the instantaneous temperature analysis MPSoC Workshop, Gifu, 2010, Kyung 40

Experimental results
EPI (energy per instruction) result: 2DI_SST 3DIS_IT; 3.4 % EPI increase 3DI_SST 3DIS_IT; 1.3 % EPI increase

Reasonable overhead in energy efficiency Reasons for EPI increase: Proposed method assigns higher frequencies to cores through aggressive power budgeting (gives average 24% instruction throughput improvement)
MPSoC Workshop, Gifu, 2010, Kyung 41

Experimental results
Computational time measured from LG xnote LW25 laptop running 2GHz:
Method Measured time Time interval to invoke

DVFS Thread assignment

110 µs 3 µs

5 ms 100 ms

# of clock frequency changes per second: Average 61 clock frequency changes during 1s in 3DIS_IT.

The overhead of DVFS is negligible.
42

MPSoC Workshop, Gifu, 2010, Kyung

Summary
• Temperature-aware power management in multi-layer MPSoC
– Dynamic voltage frequency scaling based on temperature slack
• Power budgeting among core-sets based on the temperature slack of top core in each core-set • Optimal frequency assignment of cores within a core-set based on the assigned power budget of core-set

– Temperature-aware thread assignment
• Thread-set assignment among core-sets to balance temperatures among core-sets • Thread assignment within a core-set to maximize IPS

– Experimental result shows 41% (24% on average) IPS improvement compared with existing methods

MPSoC Workshop, Gifu, 2010, Kyung

43

Sign up to vote on this title
UsefulNot useful