## Are you sure?

This action might not be possible to undo. Are you sure you want to continue?

Los Angeles

Statistical Analysis and Optimization for Timing and

Power of VLSI Circuits

A dissertation submitted in partial satisfaction

of the requirements for the degree

Doctor of Philosophy in Electrical Engineering

by

Lerong Cheng

2010

c Copyright by

Lerong Cheng

2010

The dissertation of Lerong Cheng is approved.

Lieven Vandenberghe

Glenn Reinman

Sudhakar Pamarti

Lei He, Committee Co-chair

Puneet Gupta, Committee Co-chair

University of California, Los Angeles

2010

ii

To my family.

iii

TABLE OF CONTENTS

1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1

1.1 Literature Review . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6

1.1.1 FPGA Power, Performance, and Area Optimization . . . . . . 6

1.1.2 Statistical Timing Modeling, Analysis, and Optimization . . . 8

1.2 Contributions of the Dissertation . . . . . . . . . . . . . . . . . . . . 9

1.3 Dissertation Outline . . . . . . . . . . . . . . . . . . . . . . . . . . . 12

I Statistical Modeling and Optimization for FPGA Circuits 15

2 Deterministic Device and Architecture Co-Optimization for FPGA Power,

Delay, and Area Minimization . . . . . . . . . . . . . . . . . . . . . . . . . 16

2.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 17

2.2 Preliminaries . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 20

2.2.1 Conventional FPGA Architecture . . . . . . . . . . . . . . . 20

2.2.2 V

dd

Gatable FPGA Architecture . . . . . . . . . . . . . . . . 21

2.2.3 FPGA Architecture Evaluation Flow . . . . . . . . . . . . . . 23

2.3 Trace-Based Estimation . . . . . . . . . . . . . . . . . . . . . . . . . 24

2.3.1 Trace Collection . . . . . . . . . . . . . . . . . . . . . . . . 25

2.3.2 Power and Delay Model . . . . . . . . . . . . . . . . . . . . 26

2.3.3 Validation of Ptrace . . . . . . . . . . . . . . . . . . . . . . . 30

2.4 Hyper-Architecture Evaluation . . . . . . . . . . . . . . . . . . . . . 32

iv

2.4.1 Overview . . . . . . . . . . . . . . . . . . . . . . . . . . . . 32

2.4.2 Necessity of Device and Architecture Co-Optimization . . . . 33

2.4.3 Energy and Delay Tradeoff . . . . . . . . . . . . . . . . . . . 36

2.4.4 ED and Area Tradeoff . . . . . . . . . . . . . . . . . . . . . 38

2.4.5 Impact of Utilization Rate . . . . . . . . . . . . . . . . . . . 40

2.4.6 Impact of Interconnect Structure . . . . . . . . . . . . . . . . 41

2.5 Conclusions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 46

3 FPGADevice and Architecture Co-Optimization Considering Process Vari-

ation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 48

3.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 49

3.2 Delay and Leakage Variation Model . . . . . . . . . . . . . . . . . . 50

3.3 Hyper-Architecture Evaluation Considering Process Variation . . . . 58

3.3.1 Impact of Process Variation . . . . . . . . . . . . . . . . . . 59

3.3.2 Impact of Device and Architecture Tuning . . . . . . . . . . . 60

3.4 Conclusions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 62

4 FPGA Concurrent Development of Process and Architecture Considering

Process Variation and Reliability . . . . . . . . . . . . . . . . . . . . . . . 63

4.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 64

4.2 Device Models . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 66

4.3 Circuit-Level Delay and Power . . . . . . . . . . . . . . . . . . . . . 70

4.3.1 Delay Model . . . . . . . . . . . . . . . . . . . . . . . . . . 70

4.3.2 Power Model . . . . . . . . . . . . . . . . . . . . . . . . . . 73

v

4.4 Chip-level Delay and Power . . . . . . . . . . . . . . . . . . . . . . 75

4.4.1 Delay Model . . . . . . . . . . . . . . . . . . . . . . . . . . 75

4.4.2 Leakage Power Model . . . . . . . . . . . . . . . . . . . . . 76

4.4.3 Dynamic Power Model . . . . . . . . . . . . . . . . . . . . . 76

4.5 Chip Level Delay and Power Variation . . . . . . . . . . . . . . . . . 78

4.5.1 Variation Models . . . . . . . . . . . . . . . . . . . . . . . . 78

4.5.2 Delay Variation . . . . . . . . . . . . . . . . . . . . . . . . . 79

4.5.3 Chip Leakage Power Variation . . . . . . . . . . . . . . . . . 80

4.6 Process Evaluation . . . . . . . . . . . . . . . . . . . . . . . . . . . 81

4.6.1 Power and Delay Optimization . . . . . . . . . . . . . . . . . 82

4.6.2 Variation Analysis . . . . . . . . . . . . . . . . . . . . . . . 83

4.7 FPGA Reliability . . . . . . . . . . . . . . . . . . . . . . . . . . . . 85

4.7.1 Impact of Device Aging . . . . . . . . . . . . . . . . . . . . 86

4.7.2 Permanent Soft Error Rate (SER) . . . . . . . . . . . . . . . 89

4.8 Interaction between Process Variation and Reliability . . . . . . . . . 90

4.8.1 Impact of Device Aging on Power and Delay Variation . . . . 91

4.8.2 Impacts of Device Aging and Process Variation on SER . . . 92

4.9 Conclusions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 93

II Statistical Timing Modeling and Analysis 95

5 Non-Gaussian Statistical Timing Analysis Using Second-Order Polyno-

mial Fitting . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 96

vi

5.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 97

5.2 Second-Order Polynomial Fitting of Max Operation . . . . . . . . . . 99

5.2.1 Review and Preliminary . . . . . . . . . . . . . . . . . . . . 99

5.2.2 New Fitting Method for Max Operation . . . . . . . . . . . . 100

5.3 Quadratic SSTA . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 107

5.3.1 Quadratic Delay Model . . . . . . . . . . . . . . . . . . . . . 107

5.3.2 Max Operation for Quadratic Delay Model . . . . . . . . . . 109

5.3.3 Sum Operation for Quadratic Delay Model . . . . . . . . . . 115

5.3.4 Computational Complexity of Quadratic SSTA . . . . . . . . 115

5.4 Semi-Quadratic SSTA . . . . . . . . . . . . . . . . . . . . . . . . . . 115

5.4.1 Semi-Quadratic Delay Model . . . . . . . . . . . . . . . . . 115

5.4.2 Max Operation for Semi-Quadratic Delay Model . . . . . . . 116

5.5 Linear SSTA . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 118

5.5.1 Linear Delay Model . . . . . . . . . . . . . . . . . . . . . . 118

5.5.2 Max Operation for Linear Delay Model . . . . . . . . . . . . 119

5.6 Experimental Result . . . . . . . . . . . . . . . . . . . . . . . . . . . 121

5.7 Conclusions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 127

6 Physically Justiﬁable Die-Level Modeling of Spatial Variation in View of

Systematic Across-Wafer Variability . . . . . . . . . . . . . . . . . . . . . 128

6.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 129

6.2 Physical Origins of Spatial Variation . . . . . . . . . . . . . . . . . . 131

6.3 Analysis of Wafer Level Variation and Spatial Correlation . . . . . . . 133

vii

6.3.1 Variation of Mean and Variance with Location . . . . . . . . 134

6.3.2 Appearance of Spatial Correlation . . . . . . . . . . . . . . . 137

6.3.3 Dependence between Inter-Die and Within-Die Variation . . . 139

6.3.4 When can Spatial Variation be Ignored? . . . . . . . . . . . . 140

6.4 General Across-Wafer Variation Model . . . . . . . . . . . . . . . . 141

6.5 Modeling Spatial Variability . . . . . . . . . . . . . . . . . . . . . . 144

6.5.1 Slope Augmented Across-Wafer Model . . . . . . . . . . . . 145

6.5.2 Quadratic Across-Wafer Model . . . . . . . . . . . . . . . . 145

6.5.3 Location Dependent Across-Wafer Model . . . . . . . . . . . 146

6.6 Application of Spatial Variation Models on Statistical Timing Analysis 146

6.6.1 Delay Model . . . . . . . . . . . . . . . . . . . . . . . . . . 147

6.6.2 Experimental Result . . . . . . . . . . . . . . . . . . . . . . 148

6.7 Application of Spatial Variation Models on Statistical Leakage Analysis 156

6.8 Summary of Different Models . . . . . . . . . . . . . . . . . . . . . 158

6.9 Conclusions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 161

7 On Conﬁdence in Characterization and Application of Variation Models 162

7.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 163

7.2 Problem Formulation . . . . . . . . . . . . . . . . . . . . . . . . . . 165

7.3 Conﬁdence Interval for Variation Sources . . . . . . . . . . . . . . . 167

7.3.1 Mean and Variance . . . . . . . . . . . . . . . . . . . . . . . 167

7.3.2 Simplifying the Large Lot Case . . . . . . . . . . . . . . . . 169

7.3.3 One Production Lot Case . . . . . . . . . . . . . . . . . . . . 170

viii

7.3.4 Experimental Validation for One Lot Case . . . . . . . . . . . 171

7.3.5 Fast/Slow Corner Estimation . . . . . . . . . . . . . . . . . . 172

7.4 Conﬁdence in SPICE Fast/Slow Corner . . . . . . . . . . . . . . . . 177

7.5 Conﬁdence in Statistical Timing Analysis . . . . . . . . . . . . . . . 183

7.6 Practical Questions: Determining Conﬁdence Induced Guardband for

Real Designs . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 188

7.7 Conclusions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 189

8 Conclusions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 191

9 Appendix . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 195

References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 198

ix

LIST OF FIGURES

1.1 Increasing of power density. . . . . . . . . . . . . . . . . . . . . . . 1

1.2 Leakage power and frequency variation. . . . . . . . . . . . . . . . . 3

1.3 Delay yield. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5

1.4 SSTA ﬂow. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6

2.1 (a) Island style routing architecture; (b) Connection block; (c) Switch

block; (d) Routing switches. . . . . . . . . . . . . . . . . . . . . . . 21

2.2 (a) V

dd

-gateable switch; (b) V

dd

-gateable routing switch; (c) V

dd

-gateable

connection block; (d) V

dd

-gateable logic block. . . . . . . . . . . . . 22

2.3 Existing FPGA architecture evaluation ﬂow for a given device setting. 24

2.4 New trace-based evaluation ﬂow. We perform the same ﬂow as Fig-

ure 2.3 under one device setting to collect the trace information. . . . 25

2.5 Comparison between Psim and Ptrace. . . . . . . . . . . . . . . . . 31

2.6 hyper-architectures under different device settings. . . . . . . . . . . 35

2.7 Dominant hyper-architectures. (a) Homo-V

th

and Hetero-V

th

; (b) Homo-

V

th

+G and Hetero-V

th

+G. . . . . . . . . . . . . . . . . . . . . . . . 44

2.8 ED and area trade-off. ED and area are normalized with respect to the

baseline (N = 8, K = 4, V

dd

= 0.9V, and V

th

= 0.3V). . . . . . . . . 45

3.1 Leakage and delay of baseline architecture hyper-arch. . . . . . . . . 60

4.1 Trace-based estimation ﬂow. . . . . . . . . . . . . . . . . . . . . . . 65

4.2 The ﬂow for on current, I

on

, calculation. . . . . . . . . . . . . . . . . 67

4.3 The ﬂow for sub-threshold leakage current, I

of f

, calculation. . . . . . 68

x

4.4 The ﬂow for gate leakage currents I

gon

and I

gof f

calculation. . . . . . 68

4.5 The ﬂow for transistor gate and diffusion capacitances C

g

and C

di f f

calculation. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 69

4.6 Voltage and current in an inverter during transition. . . . . . . . . . . 71

4.7 The pull-down delay of a 1X inverter at ITRS 2005 HP 32nm tech-

nology nodes. (a) Input and output voltages, and short circuit current

with a small input slope; (b) Input and output voltages, and short cir-

cuit current with a large input slope; (c) The NMOS IV curves and the

transition of transient current I

ds

with small and large input slopes. . . 72

4.8 Energy and delay tradeoff. . . . . . . . . . . . . . . . . . . . . . . . 83

4.9 Delay and leakage distribution. (a) Leakage PDF; (b) Delay PDF. . . 85

4.10 The calculation ﬂow for ΔV

th

due to NBTI and HCI. . . . . . . . . . 87

4.11 V

th

increase caused by NBTI and HCI. . . . . . . . . . . . . . . . . . 87

4.12 Impact of device aging. (a) Leakage change; (b) Delay change. . . . 88

4.13 The ﬂow for SRAM SER calculation. . . . . . . . . . . . . . . . . . 90

4.14 Impact of device aging on delay and leakage PDF. (a) Leakage PDF

comparison; (b) Delay PDF comparison. . . . . . . . . . . . . . . . 91

5.1 (a) Comparison of exact computation, linear ﬁtting, and second-order

ﬁtting for max(V, 0); (b) PDF comparison of exact computation, linear

ﬁtting, and second-order ﬁtting for max(V, 0). . . . . . . . . . . . . . 107

5.2 Algorithm for computing max(D

1

,D

2

). . . . . . . . . . . . . . . . . . 110

5.3 PDF comparison for circuit s15850. . . . . . . . . . . . . . . . . . . 123

6.1 Ring oscillator frequency within a wafer. (a) Process 1; (b) Process 2. 133

xi

6.2 Mean and variance for different r

d

. . . . . . . . . . . . . . . . . . . . 137

6.3 Apparent spatial correlation and covariance as a function of distance. . 138

6.4 Correlation coefﬁcient for within-die spatial variation after inter-die

variation is removed. . . . . . . . . . . . . . . . . . . . . . . . . . . 139

6.5 Approximating across-wafer variation. . . . . . . . . . . . . . . . . . 141

6.6 Ring oscillator frequency within a wafer. . . . . . . . . . . . . . . . . 143

6.7 PDF of S

x

and S

y

. . . . . . . . . . . . . . . . . . . . . . . . . . . . . 144

7.1 Comparison of S-L, L-S, and L-L case. . . . . . . . . . . . . . . . . . 175

7.2 Number of ˆ n or ˜ n which can be considered as large for different frac-

tion of lot-to-lot variation assuming maximum 3% error at 90% conﬁ-

dence. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 176

7.3 Flow to simulate conﬁdence. . . . . . . . . . . . . . . . . . . . . . . 180

7.4 90% conﬁdent D

w

f

with varying ˆ n and ˜ n. . . . . . . . . . . . . . . . . 182

7.5 Worst case mean, standard deviation, and 95-percentile point normal-

ized with respect to the nominal value for ISCAS85 benchmarks, as-

suming ˆ n = 10 and ˜ n = 15. . . . . . . . . . . . . . . . . . . . . . . . 185

7.6 Normalized worst case mean, standard deviation, and 95-percentile

point for ISCAS85 benchmarks, assuming ˆ n = 10 and ˜ n is large. . . . 186

7.7 90%conﬁdence worst case 95-percentile point with varying ˆ n for C7552.

Percentile point is normalized with respect to the nominal value. . . . 187

xii

LIST OF TABLES

1.1 Impact of process variation in near-term future. . . . . . . . . . . . . 3

2.1 Trace information, device and circuit parameters. . . . . . . . . . . . 26

2.2 switching activities for different technology nodes, V

d

d and V

th

. Ar-

chitecture setting: N = 10, K = 4. Unit: switch per clock cycle. . . . 27

2.3 Short circuit power ratios for different technology nodes, V

d

d and V

th

.

Architecture setting: N = 10, K = 4. . . . . . . . . . . . . . . . . . . 28

2.4 MCNCbenchmark list. The number of LUTs and ﬂip ﬂops are counted

under the architecture N=8 and K=4. . . . . . . . . . . . . . . . . . 31

2.5 Baseline hyper-architecture and evaluation ranges. . . . . . . . . . . 33

2.6 Power and delay ranges for different device settings. . . . . . . . . . 34

2.7 Min-ED hyper-architecture of optimizing device and architecture sep-

arately and simultaneously. . . . . . . . . . . . . . . . . . . . . . . 36

2.8 Minimum delay and minimum energy hyper-architectures. . . . . . . 37

2.9 Comparison between baseline and min-EDhyper-architecture in Homo-

V

th

, Hetero-V

th

, Homo-V

th

+G, and Hetero-V

th

+G. . . . . . . . . . . . 38

2.10 Minimum ED-area product hyper-architectures for different classes.

ED, Area, and ED-area product are normalized with respect to the

baseline. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 39

2.11 Min-ED hyper-architecture under different utilization rates. . . . . . 41

2.12 Min-AED hyper-architecture under different utilization rates. Note:

AED is normalized with respect to the baseline. . . . . . . . . . . . . 41

2.13 Min-delay hyper-architecture under different routing structures. . . . 42

xiii

2.14 Min-energy hyper-architecture under different routing structures. . . 43

2.15 Min-ED hyper-architecture under different routing structures. . . . . 43

3.1 Veriﬁcation of yield model. . . . . . . . . . . . . . . . . . . . . . . . 58

3.2 Experimental setting. . . . . . . . . . . . . . . . . . . . . . . . . . . 59

3.3 Optimum leakage yield hyper-architecture. . . . . . . . . . . . . . . . 61

3.4 Optimum Timing yield hyper-architecture. . . . . . . . . . . . . . . . 61

3.5 Optimum leakage and timing combined yield hyper-architecture. . . . 62

4.1 Experimental setting. . . . . . . . . . . . . . . . . . . . . . . . . . . 82

4.2 Power, delay, energy, and ED comparison for different device settings. 83

4.3 Experimental setting for process variation. . . . . . . . . . . . . . . . 84

4.4 Nominal value, mean, and standard deviation comparison for leakage

power and delay. . . . . . . . . . . . . . . . . . . . . . . . . . . . . 85

4.5 Delay and leakage change due to device aging. . . . . . . . . . . . . 89

4.6 Soft error rate comparison. . . . . . . . . . . . . . . . . . . . . . . . 90

4.7 Impact of device aging on process variation. . . . . . . . . . . . . . . 92

4.8 The impact of device aging and process variation on SER of one SRAM

under ITRS HP 32nm technology. . . . . . . . . . . . . . . . . . . . 93

5.1 Experiment setting. The 3σ value is the normalized with respect to the

nominal value. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 122

xiv

5.2 Error percentage of mean, standard deviation, skewness, and 95-percentile

point for variation setting (1). Note:the error percentage of mean (µ),

standard deviation (σ), and 95-percentile point (95%) is computed as

100 ×(MC value −SSTA value)/σ

MC

, and the error percentage of

skewness γ is computed as 100×(γ

MC

−γ

SSTA

)/γ

MC

. . . . . . . . . . 125

5.3 Mean, standard deviation, and skewness comparison for variation set-

ting (2). . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 126

6.1 Notations. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 135

6.2 Delay percentage error for different variation models. Note: the µ, σ,

and 95-percentile point for exact simulation is in ns. Run time (T) is in

ms.

∗

for LDAW, SPC, and IW, linear SSTA is applied when assuming

linear cell delay model. . . . . . . . . . . . . . . . . . . . . . . . . . 152

6.3 percentage error for ISCAS85 benchmark stretching on a chip with

different chip size. Note: We assume square chips and chip size means

edge length in mm. The values of exact simulation are in ns. . . . . . 153

6.4 Delay percentage error at different locations in a 2cm×2cm chip.

Note: The values of exact simulation are in ns. . . . . . . . . . . . . . 154

6.5 Delay comparison for ISCAS85 benchmark in 1cm×1cm chip. Note:

The values of exact simulation are in ns. . . . . . . . . . . . . . . . . 154

6.6 Delay comparison for ISCAS85 benchmark in 6mm×6mmchip. Note:

The values of exact simulation are in ns. . . . . . . . . . . . . . . . . 155

6.7 Delay comparison for ISCAS85 benchmark in 3mm×3mmchip. Note:

The values of exact simulation are in ns. . . . . . . . . . . . . . . . . 155

6.8 Leakage error percentage for different models in 2cm×2cmchip. Note:

The exact values are in mW. Run time (T) is in s. . . . . . . . . . . . 157

xv

6.9 Leakage error for different variation model on different size chips.

Note: exact values are in mW. . . . . . . . . . . . . . . . . . . . . . 158

6.10 Summary of different variation models. . . . . . . . . . . . . . . . . 160

7.1 Reliability of statistical analysis for different number of measured and

production lots. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 167

7.2 Notations. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 168

7.3 Experimental validation of estimation of conﬁdence interval for mean

for ˆ n = 5 and ˜ n = 1. . . . . . . . . . . . . . . . . . . . . . . . . . . . 171

7.4 Guardband of fast/slow corner for different degrees of conﬁdence as-

suming ˆ n = 10, ˜ n = 15. . . . . . . . . . . . . . . . . . . . . . . . . . 174

7.5 Guardband of fast/slow corner for different degrees of conﬁdence as-

suming large ˆ n and ˜ n = 15. . . . . . . . . . . . . . . . . . . . . . . . 174

7.6 Guardband of fast/slowcorner for different conﬁdence levels assuming

ˆ n = 10 and large ˜ n. . . . . . . . . . . . . . . . . . . . . . . . . . . . 175

7.7 Guardband of fast/slowcorner for different conﬁdence levels assuming

ˆ n = 10, ˜ n = 1, and ˜ n

= 25. . . . . . . . . . . . . . . . . . . . . . . . 177

7.8 Percentage guardband of worst case delay corner and corresponding

variation source corners under different conﬁdence levels assuming

ˆ n = 10, ˜ n = 15. Note: delay value is in ps, L

gate

value is in nm, and

V

th

value is in V. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 179

7.9 Conﬁdence comparison for inverter chain based corner assuming ˆ n =

10, ˜ n = 15. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 179

xvi

7.10 Percentage guardband of worst case delay corner and corresponding

variation sources corners under different conﬁdence levels assuming

large and ˜ n = 15. . . . . . . . . . . . . . . . . . . . . . . . . . . . . 181

7.11 Percentage guardband of worst case delay corner and corresponding

variation source corners under different conﬁdence levels assuming

ˆ n = 10 and large ˜ n. . . . . . . . . . . . . . . . . . . . . . . . . . . . 181

7.12 Percentage guardband of worst case delay corner and corresponding

variation source corners under different conﬁdence levels assuming

ˆ n = 10, ˜ n = 1, and ˜ n

= 25. . . . . . . . . . . . . . . . . . . . . . . . 182

7.13 Conﬁdence comparison for ISCAS85 benchmark 95-percentile point

delay assuming ˆ n = 10, ˜ n = 15. . . . . . . . . . . . . . . . . . . . . . 185

xvii

Acknowledgments

Many people offered me signiﬁcant help during the development of this dissertation.

I would like to acknowledge them here.

First of all, I would like to thank my adviser, Prof. Lei He and my co-adviser, Prof.

Puneet Gupta. They not only helped me to develop skills to solve problems, but also

motivated and enlightened me at all time. This dissertation would not exist without

their enormous encouragement, support, and inspiration.

I am indebted to Prof. Sudhakar Pamarti, Prof. Glenn Reinman, and Prof. Lieven

Vandenberghe for serving as members of my dissertation committee and their sugges-

tions to improve this work.

I treasure the opportunity that I have worked with a group of wonderful people

and would like to thank them for all the helps and support through my research work,

especially those colleagues at the UCLA Design Automation Lab and NanoCAD Lab.

In particular, I thank Dr. Fei Li, Dr. Yan Lin, Miss Phoebe Wong, and Dr. Jinjun

Xiong. The work on device and architecture co-optimization for FPGA in Chapter 2

and Chapter 3 is a joint project with Fei Li, Yan Li, and Phoebe Wong. Yan Li also

helped me on developing the transistor and circuit level power and delay model for

FPGA process and architecture concurrent design in Chapter 4. Jinjun Xiong helped

me to develop the experiment and revise the writing for the statistical static timing

analysis work in Chapter 5. He also helped me to develop the chip-wise optimization

ﬂow to minimize FPGA delay considering process variation.

I sincerely thank Dr. Kevin Cao for his help with my research. He helped me to

develop the FPGA device aging model, which is reported in Chapter 4.

I also indebted to Dr. Costas Spanos and Mr. Kun Qian for helping me to develop

the across-wafer variation model, which is reported in Chapter 6.

xviii

I would also like to thank many researchers include Dr. Qi Xiang, Dr. Jeffery Tang,

Mr. Tim Tuan, Miss Wenping Wang, for collaboration and helpful discussions.

Finally, and most of all, I am grateful to my family. I would like to thank my

wife Wen Wang for her constant understanding and supporting me during my Ph.D.

study. I would also like to thank my parents, my brother, and my sister in law, for

their dedication, and support. This dissertation becomes so humble compared to their

constant inspiration, support, and love.

xix

Vita

1979 Born in People’s Republic of China.

1997-2001 B.S. in Electronic and Communication Engineering, Zhongshan

University, Guangzhou, P.R. China.

2001-2003 M.S. in Electrical and Computer Engineering, Portland State Uni-

versity, Portland, Oregon, USA. Teaching Assistant, Analogue Cir-

cuit Design, 2002. Graduate Student Researcher, 2002-2003.

2003-2004 First year Ph.D. program, in Electrical and Computer Engineering,

University of Colorado, Boulder, Colorado, USA. Graduate Student

Researcher, 2003-2004.

2004-present P.D. program, in Electrical Engineering, University of Califor-

nia, Los Angeles, California, USA. Teaching Assistant, Advanced

Computer Architecture, 2006. Graduate Student Researcher,

UCLA. Design Automation Lab, 2004-2008. Graduate Student Re-

searcher, UCLA. Nano CAD Lab, 2008-present.

2006 Intern Engineer, Nvidia Corp., San Jose, California, USA. Worked

on GPU testing.

2006-2007 Research Intern, Altera Corp., San Jose, California, USA. Worked

on statistical power and delay estimation.

xx

PUBLICATIONS

L. Cheng, F. Li, Y. Lin, P. Wong and L. He, “Device and Architecture Co-Optimization

for FPGA Architecture Power Reduction,” IEEE Transactions on Computer-Aided De-

sign of Integrated Circuits and Systems (TCAD), volume: 26, issue: 7, July 2007, pp:

1211-1221.

L. Cheng, P. Gupta, and L. He , “On Conﬁdence in Characterization and Application

of Variation Models,” accepted by IEEE/ACM Asia South Paciﬁc Design Automation

Conference (ASPDAC), February 2010.

L. Cheng, P. Gupta, and L. He , “Accounting for Non-linear Dependence Using Func-

tion Driven Component Analysis,” IEEE/ACM Asia South Paciﬁc Design Automation

Conference (ASPDAC), February 2009, pp: 474-479.

L. Cheng, P. Gupta, and L. He , “Efﬁcient Additive Statistical Leakage Estimation,”

IEEE Transactions on Computer-Aided Design of Integrated Circuits and Systems

(TCAD), volume: 28, issue: 11, November 2009, pp:1777-1781.

L. Cheng, P. Gupta, C. Spanos, K. Qian, and L. He , “Physically Justiﬁable Die-Level

Modeling of Spatial Variation in View of Systematic Across Wafer Variability,” Design

Automation Conference (DAC), July 2009, pp: 104-109.

L. Cheng, W. Hung, X.Song, and G. Yang, “Congestion Estimation for 3D Routing,”

IEEE Computer Society annual symposium on VLSI (ISVLSI), February 2004, pp:

xxi

239-240.

L. Cheng, W. Hung, X.Song, and G. Yang, “Congestion Estimation for Three-Dimensional

Architectures,” IEEE Transactions on Circuits and Systems II: Express Briefs (TCAS

II), volume: 51, issue: 12, December 2004, pp: 655- 659.

L. Cheng, Y. Lin, L. He, and Y. Cao, “Trace-Based Framework for Concurrent De-

velopment of Process and FPGA Architecture Considering Process Variation and Re-

liability,” International Symposium on Field-Programmable Gate Arrays (ISFPGA),

February 2008, pp: 159-168.

L. Cheng, X.Song, G. Yang, W. Hung, Z. Tang, and S. Gao, “A Fast Congestion

Estimator for Routing with Bounded Detours,” Integration, the VLSI Journal (Integrat

VLSI J), volume: 41, issue: 3, May 2008, pp: 360-370.

L. Cheng, X.Song, G. Yang, and Z. Tang, “A Fast Congestion Estimator for Routing

with Bounded Detours,” IEEE/ACM Asia South Paciﬁc Design Automation Confer-

ence (ASPDAC), February 2004, pp: 666-670.

L. Cheng, P. Wong, F. Li, Y. Lin, and L. He, “Device and Architecture Co-Optimization

for FPGA Power Reduction,” Design Automation Conference (DAC), June 2005, pp:

915-920.

L. Cheng, J. Xiong, and L. He, “FPGA Performance Optimization via Chipwise

Placement Considering Process Variations,” International Conference on Field Pro-

grammable Logic and Applications (FPL), August 2006, pp:1-6.

xxii

L. Cheng, J. Xiong, and L. He, “Non-Linear Statistical Static Timing Analysis for

Non-Gaussian Variation Sources,” ACM/IEEE International Workshop on Timing is-

sue: in the Speciﬁcation and Synthesis of Digital Systems(TAU), February 2007, pp:

80-85.

L. Cheng, J. Xiong, and L. He, “Non-Linear Statistical Static Timing Analysis for

Non-Gaussian Variation Sources,” Design Automation Conference (DAC), June 2007,

pp: 250-255.

L. Cheng, J. Xiong, and L. He, “Non-Gaussian Statistical Timing Analysis Using

Second-Order Polynomial Fitting,” IEEE International Workshop on Design for Man-

ufacturability and Yield (DFM&Y), October 2007.

L. Cheng, J. Xiong, and L. He, “Non-Gaussian Statistical Timing Analysis Using

Second-Order Polynomial Fitting,” IEEE/ACM Asia South Paciﬁc Design Automation

Conference (ASPDAC), February 2008, pp: 298-303.

L. Cheng, J. Xiong, and L. He, “Non-Gaussian Statistical Timing Analysis Using

Second-Order Polynomial Fitting,” IEEE Transactions on Computer-Aided Design of

Integrated Circuits and Systems (TCAD), volume: 28, issue: 1, January 2009, pp:

130-140.

F. He, L. Cheng, X. Song, and G. Yang, “An Analytical Congestion Model With

Bounded-Based Detours,”. accepted by Journal of Circuits, Systems, and Computers

(JCSC).

xxiii

F. He, L. Cheng, G. Yang, X. Song, M. Gu, and J. Sun, “On Theoretical Upper Bounds

for Routing Estimation,” Journal of Universal Computer Science (J.UCS), volume: 11,

Number 6, June 2005, pp: 916-925.

F. He, M. Gu, X. Song, Z. Tang, and L. Cheng, “Probabilistic Estimation for Routing

Space,” The Computer Journal (TCJ), May 2005, pp: 667-676.

F. He, X. Song, L. Cheng, G. Yang, Z. Tang, M. Gu, and J. Sun, “A Hierarchical

Method for Wiring and Congestion Prediction,” IEEE Computer Society annual sym-

posium on VLSI (ISVLSI), May 2005, pp: 307-308.

F. He, X. Song, M. Gu, L. Cheng, G. Yang, Z. Tang, and J.Sun, “ACombinatorial Con-

gestion Estimation Approach with Generalized Detours,” Computers & Mathematics

with Applications (CAM), volume: 51, : 6-7, March 2006, pp: 1113-1126.

W. Hung, X.Song, T. Kam, L. Cheng, and G. Yang, “Routability Checking For Three-

Dimensional Architectures,” IEEE Transactions on Very Large Scale Integration Sys-

tems (TVLSI), volume: 12, issue: 12, December 2004, pp: 1371-1374.

Y. Jeng and L. Cheng, “Digital Spectrumof a Nonuniformly Sampled Two-Dimensional

Signal and Its Reconstruction,” IEEE Transaction on Instrumentation and Measure-

ment (TIM), volume: 54, issue: 3, June 2005, pp: 1180-1187.

P. Wong, L. Cheng, Y. Lin, and L. He, “FPGA Device and Architecture Evaluation

Considering Process Variation,” International Conference on Computer Aided Design

(ICCAD), November 2005, pp: 19-24.

xxiv

Abstract of the Dissertation

Statistical Analysis and Optimization for Timing and

Power of VLSI Circuits

by

Lerong Cheng

Doctor of Philosophy in Electrical Engineering

University of California, Los Angeles, 2010

Professor Puneet Gupta, Co-chair

Professor Lei He, Co-chair

As CMOS technology scales down, process variation introduces signiﬁcant uncertainty

in power and performance to VLSI circuits and signiﬁcantly affects their reliability. If

this uncertainty is not properly handled, it may become the bottleneck of CMOS tech-

nology improvement. This dissertation proposes novel techniques to model, analyze,

and optimize power and performance of FPGAs and ASICs considering process varia-

tion. This dissertation focuses on two aspects: (1) Process and architecture concurrent

optimization for FPGAs; (2) Statistical timing modeling and analysis.

To perform process and architecture concurrent optimization, an efﬁcient and ac-

curate FPGA power, delay, and variation evaluator, Ptrace, is proposed. With Ptrace,

we present the ﬁrst in-depth study on device and FPGA architecture co-optimization

to minimize power, delay, area, and variation considering hundreds of device and ar-

chitecture combinations. Furthermore, to enable early stage process and architecture

co-optimization without stable device models, we develop transistor level and circuit

level power, delay, and reliability models and incorporate them with Ptrace. With the

extended Ptrace, we perform architecture and process parameters concurrent optimiza-

xxv

tion for FPGA power, delay, variation, and reliability.

To perform statistical timing modeling and analysis, we ﬁrst present an efﬁcient

and accurate statistical static timing analysis (SSTA) ﬂow for non-linear cell delay

model with non-Gaussian variation sources. All operations in this ﬂow are performed

by analytical equations without any time consuming numerical approach. Then, to

further improve the efﬁciency and accuracy of statistical timing analysis, we develop a

new die-level spatial variation model which accurately models the across-wafer vari-

ation. Besides modeling spatial variation, mean and variance uncertainty introduced

by limited number of samples is another problem in SSTA. To solve this problem,

we evaluate the conﬁdence for statistical analysis and estimate the guardband value to

ensure a target conﬁdence.

To the best of our knowledge, this dissertation is the ﬁrst novel study of device,

process, and architecture concurrent co-optimization for FPGA power, delay, variation,

and reliability; and is the ﬁrst work to model across-wafer variation at die-level and to

consider conﬁdence guardband in statistical analysis.

xxvi

CHAPTER 1

Introduction

Complementary Metal Oxide Semiconductor (CMOS) manufacturing in nanometer re-

gion has resulted in great achievements in the world of electronics. The performance

and capabilities of Very Large Scale Integration (VLSI) circuits improve rapidly with

the aggressive scaling of transistors in CMOS. However, the increase of leakage power

and process variation have emerged to slow down this trend.

Since leakage power is exponentially related to channel length, it increases rapidly

as CMOS technology scales down. Figure 1.1 [114] illustrates the power density for

different technology nodes in the past thirty years. It can be seen that power density is

increasing at a dramatic rate. Therefore, power consumption becomes a crucial design

constraint for nano-scale VLSI designs.

•1980

P

o

w

e

r

D

e

n

s

i

t

y

(

W

/

c

m

)

2

•1985 •1990 •1995 •2000 •2005 •2010

•10

•10

•10

•10

•10

•4

•3

•2

•1

•0

Figure 1.1: Increasing of power density.

1

In comparison to challenges created by power, process variation is more difﬁcult to

handle in VLSI design. Due to the inevitable manufacturing ﬂuctuations and imperfec-

tions, transistors on different copies of the same design, or even at different locations

on the same chip will vary signiﬁcantly. Process variation is the uncertainty of physical

process parameters, such as channel length, doping density, oxide thickness, channel

width, metal thickness, metal width, and so on. According to the causes of variation,

process variation can be classiﬁed into two types [162]:

• Catastrophic defects are caused by isolated random events (such as particles or

other contaminations) during manufacturing, which render chips non-functional.

• Parametric variations are caused by random ﬂuctuations in process conditions so

that the physical properties of some parameters on a chip differ from the original

design. The ﬂuctuations may include aberrations in stepper lens, doping density,

and manufacturing temperature.

Process variations introduce signiﬁcant uncertainty for both circuit performance

and leakage power. It has been shown in [25] that even for the 180nm technology,

process variation can lead to 1.3X variation in frequency and 20X variation in leak-

age power, as illustrated in Figure 1.2. Such impact will become even larger in future

technology generations. In recent years, many methods, such as statistical model-

ing, analysis, and optimization for VLSI circuits, have been developed to alleviate

the variation effects. This research is called “Design for Manufacturability ” (DFM).

Internation Technology Roadmap for Semiconductors (ITRS) predicts the DFM tech-

nology requirement for process variation in the near-term future, as shown in Table 1.1

[69]. From the table, it is predicted that leakage power variation will increase to 331%

and performance variation will increase to 88% in the year 2015. Therefore, as CMOS

technology scales down to nano-meter region, process variation will become a poten-

tial show-stopper if it is not appropriately handled.

2

Figure 1.2: Leakage power and frequency variation.

Year 2007 2008 2009 2010 2011 2012 2013 2014 2015

% of V

dd

variability 10 10 10 10 10 10 10 10 10

% of V

th

variability (minimum size) 33 37 42 42 42 58 58 81 81

% of V

th

variability (typical size) 16 18 20 20 20 26 26 36 36

% of CD (L

gate

) variability 12 12 12 12 12 12 12 12 12

% circuit performance variability 46 48 49 51 60 63 63 63 63

% circuit total power variability 56 57 63 68 72 76 80 84 88

% circuit leakage power variability 124 143 186 229 255 281 287 294 331

Table 1.1: Impact of process variation in near-term future.

There are four common design styles in current VLSI: Application Speciﬁc Integrated

Circuit (ASIC), Field-Programmable Gate Array (FPGA), Application Speciﬁc Instruction

set Processor (ASIP), and general purpose processor or microprocessor. Different de-

sign styles may have different power consumption requirement and vulnerability to

process variation. Therefore, different modeling, analysis, and optimization techniques

should be developed for different design styles.

Among the design styles, FPGA is the most ﬂexible one. Since FPGA allows the

same silicon implementation to be programmed or re-programmed for a variety of

applications, it provides low non-recurring engineering (NRE) cost and short time to

3

market. Due to this advantage, FPGA industry has grown rapidly since its invention

in 1984. However, in order to achieve programmability, FPGA pays the penalty with

decrease of performance, power, and area. Previous studies [83, 82] have shown a

100X energy difference, a 4.3X delay difference, and a 40X area difference between

FPGA designs and their ASIC counterparts. As discussed before, power consumption

is a crucial design constraint for nano-scale VLSI circuits. The power problem is

more signiﬁcant for FPGAs than other design styles because FPGA has higher power

consumptions. On the other hand, since FPGAs have lots of regularity, the impact of

process variation on FPGAs are not as signiﬁcant as other design styles. Nevertheless,

the impact of process variation on FPGAs should not be ignored and should be properly

handled. To solve the above problems, this dissertation focuses on optimizing power

and performance for FPGAs considering process variation. The ﬁrst objective of this

dissertation is:

Objective 1: Concurrently evaluate FPGA architecture and pro-

cess parameters to optimize power, delay, and area considering pro-

cess variation and reliability.

In comparison to FPGAs, ASICs have higher performance, lower power, and smaller

area. However, due to the increased design complexity, ASICs has higher NRE cost,

design cost, and longer time to market than FPGAs. Due to the above advantages

and disadvantages, ASICs are usually used in high performance or low power applica-

tions, wherein the impact of process variation is very signiﬁcant. Therefore, variation

modeling and analysis in ASICs are more important than in FPGAs.

Process variation may cause manufacturing yield loss. Manufacturing yield loss is

deﬁned as the ratio between the number of chips that fail to meet the design speciﬁ-

cations and the total number of manufactured chips [112]. There are several reasons

causing chip failure, such as catastrophic defects and parametric variations. Figure 1.3

4

illustrates the performance variation caused by parametric variations. It can be seen

that chip delay is spread over a wide range and the delay of some chips is higher than

the speciﬁcation value. The chips failing to meet the delay speciﬁcation contribute to

yield loss.

Yield

Delay spec

Yield Loss

Measured Delay

Figure 1.3: Delay yield.

In order to analyze and optimize performance yield, Statistical Static Timing Analysis

(SSTA) is proposed. Instead of estimating the chip delay deterministically as in the

traditional Static Timing Analysis (STA), SSTA estimates the chip delay statistically.

SSTA not only estimates the nominal value of chip delay but also provides the de-

lay distribution. Figure 1.4 shows the SSTA ﬂow. In this ﬂow, there are three key

components [122]:

• Modeling of process variation from silicon process characterization.

• Modeling of sensitivities of delay for standard cell libraries with regards to pro-

cess parameter variations (library modeling).

• Statistics aware engines for STA and optimization.

In this dissertation, we mainly focus on modeling of process variation and devel-

oping an accurate SSTA engine. The second objective of this dissertation is:

5

Modeling of

process

Variation

Measurement

Library

Modeling

Net List

SSTA

Engine

Chip

Delay

Variation

Figure 1.4: SSTA ﬂow.

Objective 2: Statistical Timing Modeling and Analysis.

In the following of this chapter, Section 1.1 reviews some related research works;

Section 1.2 discusses the major contributions of this dissertation; and Section 1.3 out-

lines this dissertation.

1.1 Literature Review

1.1.1 FPGA Power, Performance, and Area Optimization

It is well known that architecture, including logic block (or cluster) size, Look-Up

Table (LUT) size, and routing wire segment length, has great impact on FPGA power,

performance, and area. Therefore, architecture evaluation for FPGA optimization at-

tracts lots of concerns. In early 1990s, architecture evaluation mainly focuses on opti-

mizing delay [143] and area [130]. These works showed that for non-clustered FPGAs,

LUT size of 4 achieves the smallest area and LUT size 5 or 6 minimizes delay. After

cluster-based island style FPGA became popular, architecture evaluation for this type

of FPGA was performed. [10] showed that LUT sizes ranging from 4 to 6 and cluster

sizes between 4 and 10 produce the best area-delay product.

6

As discussed before, power consumption is one of the most important design con-

straints of FPGA design when technology scales down to nano-meter region. There-

fore, after the year 2000, FPGA power modeling became a hot research topic. [158]

quantiﬁed the leakage power of commercial FPGA architecture and [48] introduced a

high level FPGA power estimation methodology. [123, 89, 92] presented power eval-

uation frameworks for generic parameterized FPGAs and showed that leakage power

takes higher portion of total power consumption in FPGAs than in ASICs. It was also

shown that interconnects consume even more power than logic elements in FPGA.

At the same time, FPGA power optimization was also studied. Power aware FPGA

CAD algorithms was proposed in [85]; power driven partition method was presented

in [117]; a leakage power saving routing multiplexer and an input control method were

developed [150]; and a conﬁguration inversion method to save the leakage power was

proposed in [12]. Architecture evaluation for power and delay minimization is then

studied in [93, 123, 92].

Due to programmability, utilization rate of circuit elements in FPGA is very low.

To reduce leakage power consumption of unused circuit elements, body bias [110]

and power gating [59, 102] were applied. To further reduce power on used circuit

elements, programmable dual-V

dd

was ﬁrst applied to logic blocks [93, 90, 98, 91]

and then extended to interconnects [58, 51, 71]. Architecture evaluation considering

power gating and dual-V

dd

was performed [101]. Besides leakage power minimization,

a glitch minimization technique [84] was proposed to reduce dynamic power.

Recently, FPGA modeling and optimization considering process variation was

studied. [31, 43, 100, 115, 147, 99, 135, 79, 115, 154, 137, 28] presented techniques to

statistically optimize FPGA performance. [136] performs parametric yield estimation

for FPGA considering within-die variation.

7

1.1.2 Statistical Timing Modeling, Analysis, and Optimization

Process Variation Modeling- Process variation introduces signiﬁcant uncertainty to

circuit power and delay. In order to analyze the power and delay uncertainty, process

variation modeling is needed. An early work was proposed in [25] whereby process

variation is separated into inter-die variation and within-die variation. All transistors

on the same die share the same inter-die variation and different dies may have different

inter-die variation; each transistor has its own within-die variation and the within-die

variation for different transistors are assumed to be independent. Later on, it was

observed that within-die variation is not independent but spatially correlated and the

correlation depends on the distance between two within-die locations. In this case,

spatial variation was modeled as correlated random variables [6, 32] and principle

component analysis was applied to perform statistical timing analysis. In this model, a

chip is divided into several grids and each grid has its own spatial variation. The spatial

variations of different grids are correlated and the correlation coefﬁcients depend on

the distance between two grids. However, obtaining the correlation coefﬁcient between

two grids became a problem for this model. [172, 173, 105] solved this problem by

modeling the spatial variation as a random ﬁeld [174] and assuming the correlation

coefﬁcient to be a function of distance. After that, several more complicated spatial

variation models were proposed [45, 105, 189, 55, 64].

Statistical Analysis- With the model of process variation, statistical analysis and

optimization are then performed. Statistical static timing [103, 72, 7, 120, 157, 128,

32, 5, 49, 9, 22, 8, 163, 86, 53, 182, 13, 181, 113, 75, 180, 178, 76, 3, 183, 2,

142, 19, 179, 40, 30, 52, 144, 165, 116, 36, 41, 56, 63, 97, 35, 104, 189, 145, 141,

88, 161, 81, 107, 65, 61, 169, 138, 146, 70, 155, 29, 171, 118, 42] and leakage

[26, 111, 21, 37, 176, 140, 109, 15, 156, 33, 95, 62, 73, 44, 119, 151, 188, 20, 133]

analysis are hot research topics in recent years. There are two major approaches in

8

Statistical static timing analysis techniques: path-based [103, 72, 7, 120, 157, 128]

and block-based [32, 5, 49, 9, 22, 8, 163, 86, 53, 182, 13, 181, 113, 75, 180, 178, 76,

3, 183, 2, 142, 19, 179, 40, 41, 42] SSTA. Since path-based SSTA is not scalable to

large circuit sizes, block-based SSTA is more commonly used. Since Gaussian random

variables are easy to be handled, the early block-based SSTA ﬂows [32, 163] modeled

the gate delay as linear functions of variation sources and assumed all the variation

sources are mutually independent Gaussian random variables. Later, it was observed

that the linear delay model is no longer accurate when the scale of variation becomes

larger [94] and a higher-order delay model is thus used [180, 178]. Moreover, it was

also observed that some variation sources do not follow a Gaussian distribution. For

example, the via resistance has an asymmetric distribution [13], while dopant concen-

tration is more suitably modeled as a Poisson distribution [142] rather than Gaussian.

The nonlinearity of delay model and non-Gaussian variation sources make SSTA much

more difﬁcult. [142] applied independent component analysis to de-correlate the non-

Gaussian random variables, but it was still based on a linear delay model. Some recent

works [76, 13, 19, 179, 40, 41, 42] considered both non-linear delay model and non-

Gaussian variation sources.

Conﬁdence Analysis- In all the statistical modeling and analysis methods dis-

cussed above, the statistical characteristics (such as mean and variance) are assumed to

be given and reliable. However, when the number of production samples is not large,

the production statistical characteristics may signiﬁcantly deviate from their popula-

tion values. Therefore, uncertainty in the statistics of measured data as well as produc-

tion data should be considered in statistical analysis. [177] modeled the uncertainty

of mean, variance, and correlation coefﬁcients as an interval, and then estimated the

range of mean and variance of circuit performance.

9

1.2 Contributions of the Dissertation

The contributions of this dissertation are two-fold:

1. Device and architecture concurrent optimization for FPGAs.

2. Statistical timing modeling and analysis.

For device and architecture concurrent optimization for FPGAs, we have the fol-

lowing contributions:

• We ﬁrst develop an efﬁcient yet accurate power and delay estimator for FPGAs,

which we refer to as Ptrace. Experimental results show that compared to cycle

accurate power simulator and VPR [18], Ptrace predicts chip level power and

delay within 4% and 5% error, respectively. Based on such framework, we per-

form device and architecture co-optimization for FPGA circuits. Compared to

the baseline, which uses the VPR architecture model [18] with the same LUT

size and cluster size as the commercial FPGAs used by Xilinx Virtex-II [170],

and the device settings from ITRS roadmap[66], our co-optimization reduces

energy-delay product by 18.4% and chip area by 23%. Furthermore, consider-

ing FPGA architecture with power-gating capability, our architecture and device

co-optimization reduces energy-delay product by 55.0% and chip area by 8.2%

compared to the baseline. We also study the impact of utilization rate and inter-

connect structure.

• We then extend Ptrace to estimate power and delay variation of FPGAs. We pro-

posed closed-form models to estimate chip level leakage and timing variations

for FPGA. It has been shown that the mean and standard deviation computed by

our models are within 3% error compared to Monte-Carlo simulation. With the

10

extended Ptrace, we perform FPGA device and architecture evaluation consider-

ing process variations. Compared to the baseline, device and architecture tuning

improves leakage yield by 4.8%, timing yield by 12.4%, and leakage and timing

combined yield by 9.2%. We also observe that LUT size of 4 gives the highest

leakage yield, LUT size of 7 gives the highest timing yield, but LUT size of 5

achieves the maximum leakage and timing combined yield.

• We further extend (Ptrace) to consider process parameters directly so that FPGA

circuit and architecture evaluation can be conducted when only the ﬁrst or-

der process parameters are available. Such evaluation may be used to select

circuits and architectures that are less sensitive to process changes or process

variations. We call the resulting framework as Ptrace2. With Ptrace2, we in-

corporate analytical calculations for two types of FPGA reliability, device ag-

ing (Negative-Bias-Temperature-Instability, NBTI [11, 149, 160, 23] and Hot-

Carrier-Injection, HCI [34, 149, 166]) and permanent soft error rate (SER) [60],

again in the ‘from device to chip’ fashion. We observe that device aging reduces

standard deviation of leakage by 65% over 10 years while it has relatively small

impact on delay variation. Moreover, we also ﬁnd that neither device aging due

to NBTI and HCI nor process variation has signiﬁcant impact on SER.

For statistical timing modeling and analysis, we have the following contributions:

• We introduce an efﬁcient and accurate SSTA ﬂow for non-linear SSTA with

non-Gaussian variation sources. All operations in our ﬂow are based on efﬁcient

closed-form formulae. Experimental results showthat compared to Monte-Carlo

simulation, our approach predicts the mean, standard deviation, skewness, and

95-percentile point within 1%, 1%, 6%, and 1% error, respectively.

• Our proposed SSTA ﬂow solves the problem of computing the chip delay vari-

11

ation, but does not consider modeling of process variation. In order to improve

the accuracy and efﬁciency of SSTA, we presents an accurate model for across-

wafer variation. Compared to the exact value, the error of the traditional grid-

based spatial variation model is up to 8% while the error of our model is only

2%. Morever, our new model is 6X faster than the traditional model.

• The above SSTA ﬂow also assumes that the statistical variation models are reli-

able. However, due to limited number of samples (especially in the case of lot-to-

lot variation), calibrated models have low degree of conﬁdence. The problem is

further exacerbated when low production volumes cause additional loss of con-

ﬁdence in the statistical analysis (since production only sees a small snapshot

of the entire distribution). We mathematically derive the conﬁdence intervals

for commonly used statistical measures (mean, variance, percentile corner) and

analysis (SPICE corner extraction, statistical timing). Our estimations are within

2% of simulated conﬁdence values. Our experiments indicate that for moderate

characterization volumes (10 lots) and low-to-medium production volumes (15

lots), a signiﬁcant guardband is needed (e.g., 34.7%of standard deviation for sin-

gle parameter corner, 38.7% of standard deviation for SPICE corner, and 52% of

standard deviation for 95%-tile point of circuit delay are needed) to ensure 95%

conﬁdence in the results. The guardbands are non-negligible for all cases when

either production or characterization volume is not large.

1.3 Dissertation Outline

The remaining of this dissertation is organized as follows:

Part I, including Chapter 2, 3, 4, proposes techniques to statistically model and

optimize FPGA power and delay.

12

Chapter 2 ﬁrst introduces a time efﬁcient trace-based delay and power estimation

framework for FPGA circuits, Ptrace. With Ptrace, we simultaneously optimize de-

vice setting (supply voltage V

dd

and threshold voltage V

th

) and FPGA architecture

(look-up table size K, cluster size N, and interconnect wire segment length W) to min-

imize power, delay and area.

Chapter 3 extends the architecture and device co-optimization ﬂow in Chapter 2

to considering process variation. We ﬁrst develop a set of closed-form formulas for

chip level leakage and timing variations considering both die-to-die and within-die

variations. Based on such model, we extend Ptrace to estimate the power and delay

variation of FPGAs. Applying extended Ptrace, we optimize device setting and FPGA

architecture to maximize delay yield, leakage yield, and delay and leakage combine

yield.

Chapter 4 proposes a framework to perform concurrent process and architecture

optimization for FPGAs in early stage without detailed process modeling. We ﬁrst

implement models to calculate electrical characteristics of advanced CMOS transis-

tors based on the ITRS MASTAR4 (Model for Assessment of cmoS Technologies And

Roadmaps) tool [67, 68, 148], and then develop circuit level models of delay, leakage

power, and input/output capacitance for FPGA basic circuit elements. With this cir-

cuit level model, we further extend Ptrace in Chapter 2 and 3 to evaluate chip level

power, delay, and reliability. We refer to the new trace-based model as Ptrace2. With

Ptrace2, we can conduct FPGA circuit and architecture evaluation when only the ﬁrst

order process parameters are available. We may also consider reliability issues, such

as device aging and permanent soft error rate, in addition to power and delay during

evaluation.

Part II, including Chapter 5, Chapter 6, and Chapter 7 introduces statistical tim-

ing modeling and analysis.

13

Chapter 5 presents an efﬁcient and accurate statistical static timing analysis ﬂowfor

non-linear delay model with non-Gaussian variation sources. We ﬁrst propose a second

order polynomial approximation for the max operation of two randomvariables. Based

on this approximation, we develop an SSTA ﬂow for three different delay models:

quadratic delay model, quadratic delay model without crossing terms (semi-quadratic

model), and linear delay model.

Chapter 6 studies another key component of SSTA: the modeling of process vari-

ation. We analyze the impact of deterministic across-wafer variation and ﬁnd that

the spatially correlated within-die variation is mainly caused by deterministic across

wafer variation. Then, we propose an efﬁcient and accurate die-level variation model

to model the across-wafer variation. We apply our proposed variation model to the

SSTA ﬂow in Chapter 5 to verify its efﬁciency and accuracy.

Chapter 7 studies the conﬁdence interval of statistical characteristics, such as mean

and variance, of statistical timing analysis. We estimates the conﬁdence interval for a

single variation sources, for the fast/slow corner of an inverter chain, and for full chip

timing analysis.

Finally, Chapter 8 concludes this dissertation with discussion of the ongoing work

and future work as well.

In Chapter 9, I brieﬂy summarize the works I have ﬁnished in my Ph.D. program

which are not included in this dissertation.

14

Part I

Statistical Modeling and Optimization

for FPGA Circuits

15

CHAPTER 2

Deterministic Device and Architecture Co-Optimization

for FPGA Power, Delay, and Area Minimization

Device optimization considering supply voltage V

dd

and threshold voltage V

th

has lit-

tle chip area increase, but a great impact on power and performance in the nanometer

technology. This chapter studies simultaneous evaluation of device and architecture

optimization for FPGAs. We ﬁrst develop an efﬁcient yet accurate timing and power

evaluation method, called trace-based model. By collecting trace information from

cycle-accurate simulation of placed and routed FPGA benchmark circuits and re-using

the trace for different V

dd

and V

th

, we enable device and architecture co-optimization

considering hundreds of device and architecture combinations. Compared to the base-

line FPGA architecture, which uses the VPR architecture model and the same LUT

and cluster sizes as those used by the Xilinx Virtex-II, V

dd

suggested by ITRS, and V

th

optimized with respect to the above architecture and V

dd

, architecture and device co-

optimization can reduce energy-delay product by 20.5% and chip area by 23.3%. Fur-

thermore, considering power-gating of unused logic blocks and interconnect switches

(in this case sleep transistor size is a parameter of device tuning), our co-optimization

reduces energy-delay product by 55.0% and chip area by 8.2% compared to the base-

line FPGA architecture.

16

2.1 Introduction

FPGAs allow the same silicon implementation to be programmed or re-programmed

for a variety of applications. It provides low NRE (non-recurring engineering) cost

and short time to market. Due to the large number of transistors required for ﬁeld pro-

grammability and the low utilization rate of FPGA resources (typically 62.5% [158]),

existing FPGAs consume more power compared to ASICs [83]. As the process ad-

vances to nanometer technologies and low-energy embedded applications are explored

for FPGAs, power consumption becomes a crucial design constraint for FPGAs.

Recent work has studied FPGA power modeling and optimization. The leakage

power of a commercial FPGA architecture was quantiﬁed [158], and a high level FPGA

power estimation methodology was presented [48]. Power evaluation frameworks

were introduced for generic parameterized FPGA [123, 89, 92] and it was shown that

both interconnect and leakage power are signiﬁcant for FPGAs in nanometer technolo-

gies. As to power optimization, the interaction of a suit of power-aware FPGA CAD

algorithms without changing the existing FPGAs was studied in [85]. Power-driven

partition algorithm for mapping applications to FPGAs with different V

dd

-levels [117]

was proposed. A conﬁguration inversion method to reduce the leakage power of mul-

tiplexers without any additional hardware [12] was investigated. Besides the power

optimization CAD algorithms, low power FPGA circuits and architectures have also

been studied. Region based power gating for FPGA logic blocks [59] and ﬁne-grained

power-gating for FPGA interconnects [102] were proposed, and V

dd

programmability

was applied to both FPGA logic blocks [93, 90, 98, 91] and interconnects [58, 51, 71].

Anewtype of routing multiplexer and an input control method were developed [150] to

reduce leakage of unused routing multiplexers, and a circuitry combining gate biasing,

body biasing and multi-threshold techniques was designed [110] to reduce intercon-

nect leakage. Besides leakage power reduction, a glitch minimization technique [84]

17

was proposed to reduce dynamic power.

Architecture evaluation has been performed ﬁrst for area and delay. For non-

clustered FPGAs, it was shown that an LUT size of 4 achieves the smallest area [130]

and an LUT size of 5 or 6 leads to the best performance [143]. Later on, the cluster-

based island style FPGA was studied to optimize area-delay product and it showed

that LUT sizes ranging from 4 to 6 and cluster sizes between 4 and 10 can produce the

best area-delay product [10]. Besides area and delay, FPGA architecture evaluation

considering energy was studied recently [93, 123, 92]. It was shown that in 0.35µm

technology, an LUT size of 3 consumes the smallest energy [123]. In 100nm tech-

nology, an LUT size of 4 consumes the smallest energy [92]. [101] further evaluated

FPGA architectures with ﬁeld programmable dual-V

dd

and power gating, and consid-

ering area, delay, and energy.

However, all the aforementioned architecture evaluation assumed ﬁxed supply volt-

age V

dd

and threshold voltage V

th

, and sleep transistor size (if power gating is applied),

and have not conducted simultaneous evaluation of device optimization such as V

dd

and V

th

tuning and architecture optimization such as tuning LUT and cluster sizes.

Architecture and device co-optimization may obtain a better power and performance

tradeoff compared to architecture tuning alone. We deﬁne hyper-architecture as the

combination of device and architectural parameters. The co-optimization requires the

exploration of the following dimensions: cluster size N, LUT size K, V

dd

, V

th

, and sleep

transistor size if power gating is applied. The total number of hyper-architecture com-

binations can be easily over a few hundreds considering the interaction between these

dimensions. In the existing power evaluation frameworks, such as cycle-accurate sim-

ulation [89] and transition density based estimation [123], timing and power are cal-

culated for each circuit element. Therefore, it is time-consuming to explore the huge

hyper-architecture solution space using methods from [89, 123]. In order to perform

18

device and architecture co-optimization, an accurate yet extremely efﬁcient timing and

power evaluation method is required.

The ﬁrst contribution of this chapter is that we develop a trace-based estimator

(called Ptrace) for FPGA power, delay, and area. We proﬁle placed and routed bench-

mark circuits and collect statistical information (called trace) on switching activity,

short circuit power, near-critical path structure, and circuit element utilization rate for

a given set of benchmark circuits (MCNC [175] benchmark set in this chapter). We

show that the trace is independent of device tuning and it can be used to calculate

FPGA chip-level performance and power for a number of device designs. Compared

to performing placement-and-routing by VPR [18] followed by cycle-accurate simula-

tion (called Psim) from [89], Ptrace has a high ﬁdelity and an average error of 1.3% for

energy and of 0.8% for delay. The trace collecting has the same runtime as evaluating

FPGA architecture for one combination of V

dd

, V

th

and sleep transistor size using VPR

and Psim. It took one week to collect the trace for the MCNC benchmark set using

eight 1.2GHz Intel Xeon servers while all the hyper-architecture evaluation reported

in this chapter with over hundreds of hyper-architecture combinations took only a few

minutes on one server.

The second contribution is that we performthe architecture and device co-optimization

for conventional FPGAs and FPGAs with power gating capability. We explore differ-

ent V

dd

, V

th

, and sleep transistor size combinations in addition to cluster size and LUT

size combinations. For comparison, we obtain the baseline FPGA hyper-architecture

which uses the VPR architecture model [18] and the same LUT size and cluster size as

the commercial FPGAs used by Xilinx Virtex-II [170], and V

dd

suggested by ITRS

[66], but V

th

optimized by our device optimization. Such baseline is signiﬁcantly

better than the ones with no device optimization. Compared to the baseline hyper-

architecture, architecture and device co-optimization can reduce energy-delay product

19

(product of energy per clock cycle and critical path delay, in short, ED) by 18.4% and

chip area by 23.3%. Furthermore, considering FPGA architecture with power-gating

capability, our architecture and device co-optimization reduces ED by 55.0% and chip

area by 8.2% compared to the baseline. We also study the impact of utilization rate

and interconnect structure.

The rest of the chapter is organized as follows. Section 2.2 presents the background

of FPGA architecture and existing FPGA architecture evaluation ﬂow. Section 2.3

introduces our trace-based estimation models. Section 2.4 applies the new estimation

models to architecture and device co-optimization. Section 2.5 concludes this chapter.

2.2 Preliminaries

2.2.1 Conventional FPGA Architecture

We assume cluster-based island style FPGA architecture such as that in [18] for all

classes of FPGAs studied in this chapter. A cluster-based logic block (see Figure 2.1)

includes N fully connected Basic Logic Elements (BLEs). Each BLE includes one

K-input lookup table (LUT) and one ﬂip-ﬂop (DFF). The combination of cluster size

N and LUT size K is the architectural issue we evaluate in this chapter. The logic

blocks are surrounded by routing channels consisting of wire segments. The input and

output pins of a logic block can be connected to the wire segments in routing chan-

nels via a connection block (see Figure 2.1 (b)). A routing switch block is located at

the intersection of a horizontal channel and a vertical channel. Figure 2.1 (c) shows a

subset switch block [87], where the incoming track can be connected to the outgoing

tracks with the same track number.

1

The connections in a switch block (represented by

the dashed lines in Figure 2.1 (c)) are programmable routing switches. We implement

1

Without loss of generality, we assume subset switch block in this chapter.

20

routing switches by tri-state buffers and use two tri-state buffers for each connection

so that it can be programmed independently for either direction. We deﬁne an inter-

connect segment as a wire segment driven by a tri-state buffer or a buffer.

2

In this

chapter, we investigate both uniform length-4 interconnect wire segments and mixed-

length interconnects. We decide the routing channel width CW in the same way as the

architecture study in [18], i.e., CW = 1.2CW

min

where CW

min

is the minimum channel

width required to route the given circuit successfully.

0

1

2

3

3 1 2 0

3

2

1

0 0

1

2

3

0 1 3 2

A C

D

B

switch

connection

connection switch

(b) Connection block and

SR

SR

c

o

n

n

e

c

t

i

o

n

b

l

o

c

k

SR

0123

(c) Switch block

(d) Routing switches

SR

A B

logic

block

in1

block

connection

block

switch

(a) Island style

routing architecte

Figure 2.1: (a) Island style routing architecture; (b) Connection block; (c) Switch

block; (d) Routing switches.

2

We interchangeably use the terms of switch and buffer/tri-state buffer.

21

2.2.2 V

dd

Gatable FPGA Architecture

Power gating can be applied to interconnects and logic blocks to reduce FPGA power.

Figure 2.2 illustrates the circuit design of the V

dd

-gateable interconnect switch and

logic block from [101]. We insert a PMOS transistor (called a sleep transistor) between

the power rail and the buffer (or logic block) to provide the power-gating capability.

When a buffer or logic block is not used, the sleep transistor is turned off by the

conﬁguration cell. SPICE simulation shows that power-gating can reduce the leakage

power of an unused buffer or a logic block by a factor of over 300.

S

w

i

t

c

h

S

w

i

t

c

h

D

e

c

o

d

e

r

SR

SR

SR

l

o

g

N

S

R

A

M

c

e

l

l

s

2

Switch

Switch

Logic block

in1

(b) Vdd−gateable routing switches

(a) Vdd−gateable switch

dN

d1

d1

dN

N

s

w

i

t

c

h

e

s

(c) Vdd−gateable connection block

Connection switches

wire track

Dec_Disable

B

A

M1

M2

Vdd

SR

BUFF

Switch

SR

SR

SR

SR

(d) Vdd gateable logic block

Logic block

Sram

Vdd

Figure 2.2: (a) V

dd

-gateable switch; (b) V

dd

-gateable routing switch; (c) V

dd

-gateable

connection block; (d) V

dd

-gateable logic block.

22

There is power and delay overhead associated with the sleep transistor insertion.

The dynamic power overhead is almost negligible. This is due to the fact that the sleep

transistors stay either ON or OFF after conﬁguration and there is no charging and

discharging at their source/drain capacitors. The delay overhead associated with the

sleep transistor insertion can be bounded when the sleep transistor is properly sized.

Moreover, the connection box with power gating is different from that in the con-

ventional architecture. For the conventional FPGA, an input MUX, which is effectively

an implicit decoder [18], is used in the connection box as illustrated in Figure 2.1(b).

However, for the power gating architecture, an explicit decoder is used in the con-

nection box to enable both the signal path and power supply for the single input, as

illustrated in Figure 2.2(c) [101]. Because the decoder is no longer in the critical path,

the delay of connection box in Figure 2.2(c) is smaller than that in Figure 2.1(b). The

connection box in Figure 2.2(c) also has a much smaller leakage power but a larger

area and a slightly larger dynamic power compared to the conventional architecture.

2.2.3 FPGA Architecture Evaluation Flow

The existing FPGA architecture evaluation ﬂow considering area, delay, and power

[92] is illustrated in Figure 2.3. For a given benchmark set, we ﬁrst optimized the

logic then map the circuit to a given LUT size. TV-Pack is used to pack the mapped

circuit to a given cluster size. After packing, we place and route the circuit using VPR

[18] and obtain the chip level delay and area. Finally, cycle-accurate power simulator

[89] (in short Psim) is used to estimate the chip level power consumption.

The architecture evaluation ﬂow discussed above is time consuming because we

need to place and route every circuit under different architectures and a large number

of randomly generated input vectors need to be simulated for each circuit. However,

the device and architecture co-optimization requires the exploration of the following

23

Parasitic

Extraction

Cycle-accurate

Power

Simulator

(Psim)

Power

Arch

Spec

Logic Optimization(SIS)

Tech-Mapping

(RASP)

Timing-Driven Packing

(TV-Pack)

Placement & Routing

(VPR)

Delay Area

Benchmark

Figure 2.3: Existing FPGA architecture evaluation ﬂow for a given device setting.

dimensions: cluster size N, LUT size K, supply voltage V

dd

, threshold voltage V

th

, and

sleep transistor size if power gating is applied. The total number of hyper-architecture

combinations can be easily over a few hundreds considering the interaction between

these dimensions. Therefore, the conventional evaluation ﬂow is not practical for de-

vice and architecture co-optimization due to its time inefﬁciency. In order to perform

device and architecture co-optimization, a fast yet accurate FPGA power and delay

estimator is required. Such estimator is just the one we are going to introduce in Sec-

tion 2.3, the trace-based estimator.

2.3 Trace-Based Estimation

In this section, we will introduce the efﬁcient power and delay estimation model, trace-

based estimation method (Ptrace). We speculate that during hyper-architecture evalu-

ation, there are following two classes of information as summarized in Table 2.1. The

ﬁrst class only depends on architecture and is called trace of the architecture. The sec-

ond class only depends on device setting and circuit design. The basic idea of Ptrace is

24

as follows: For a given benchmark set, we proﬁle placed and routed benchmark circuits

and collect trace information under one device setting and for one FPGA architecture.

We then obtain chip-level performance and power for a set of device for a given archi-

tecture parameter values based on the trace information. Figure 2.4 illustrates the ﬂow

of trace-based evaluation.

Arch

Spec

Trace-Based

Estimation

Trace

Collection

Chip Level Area,

Delay, and Power

Circuit Level

Area, Delay,

and Power

Figure 2.4: New trace-based evaluation ﬂow. We perform the same ﬂow as Figure 2.3

under one device setting to collect the trace information.

2.3.1 Trace Collection

As mentioned before, the trace information only depends on architecture and remains

the same when device setting changes. For a given benchmark set and a given FPGA

architecture, the trace includes number of used type i circuit elements (N

u

i

), total num-

ber of type i circuit elements (N

t

i

), average switching activity of used type i circuit

elements (S

u

i

), short circuit power ratio (α

sc

), and near-critical path structure (see Ta-

ble 2.1). Near-critical path structure is the number of each type of circuit elements

(N

p

i

) on the near-critical path. Device parameters include V

dd

and V

t

, which depend on

technology scale, and average leakage power of type i circuit elements (P

s

i

), average

load capacitance of type i circuit elements (C

u

i

), and average delay of type i circuit

25

elements (D

i

), which depend on circuit design. The device parameters, such as cir-

cuit level delay and power, can be collected by SPICE simulation or by measurement.

After collecting the trace and device level delay and power, we can perform Ptrace

to estimate the chip level delay, power, and area. Below, we will show that the trace

information is insensitive to the device parameters and discuss our trace-based models.

Trace Parameters (depend on architecture)

N

u

i

# of used type i circuit elements

N

t

i

total # of type i circuit elements

S

u

i

avg. switching activity for used type i circuit elements

N

p

i

# of type i circuit elements on the near-critical path

α

sc

ratio between short circuit power and switch power

Device Parameters

(depend on processing technology and circuit design)

V

dd

power supply voltage

V

th

threshold voltage

P

s

i

avg. leakage power for type i circuit elements

C

u

i

avg. load capacitance of type i circuit elements

D

i

avg. delay of type i circuit elements

Table 2.1: Trace information, device and circuit parameters.

2.3.2 Power and Delay Model

2.3.2.1 Dynamic Power Model

Dynamic power includes switch power and short-circuit power. A circuit implemented

on an FPGA cannot utilize all circuit elements. Dynamic power is only consumed by

the used FPGA resources. Our trace-based switch power model distinguishes different

types of used FPGA resources and applies the following formula:

P

sw

=

∑

i

1

2

N

u

i

· f · V

2

dd

· C

sw

i

(2.1)

The summation is over different types of circuit elements, i.e., LUTs, buffers, input

pins and output pins. For type i circuit elements, C

sw

i

is the average switch capacitance

26

and N

u

i

is the number of used circuit elements, f is the operating frequency. In this

chapter, we assume that the circuit works at its maximumfrequency, i.e., the reciprocal

of the near-critical path delay. The switch capacitance is further calculated as:

C

sw

i

= (

∑

j∈El

i

C

i, j

/N

u

i

) · S

u

i

= C

u

i

· S

u

i

(2.2)

For type i circuit elements, C

u

i

is the average load capacitance of used circuit elements,

which is averaged over C

i, j

, the local load capacitance for used circuit element j. El

i

is

the set of used type i circuit elements, and S

u

i

is the average switching activity of used

type i circuit elements. We assume that the average switching activity of the circuit

elements is determined by the circuit logic functionality and FPGA architecture. The

device parameters of V

d

d and V

th

have a limited effect on switching activity. We verify

this assumption in Table 2.2 by demonstrating the average switching activity of ﬁve

benchmarks at different technology nodes, and V

d

d and V

th

levels.

benchmark 70nm V

d

d=1.1 100nm V

d

d=1.3 70nm V

d

d=1.0

V

th

=0.25 V

th

=0.32 V

th

=0.20

logic inter- logic inter- logic inter-

connect connect connect

alu4 2.06 0.55 2.01 0.54 2.03 0.59

apex2 1.73 0.47 1.75 0.47 1.70 0.47

apex4 1.23 0.27 1.19 0.26 1.16 0.29

bigkey 1.75 0.56 1.96 0.59 1.71 0.55

clma 0.90 0.21 0.87 0.21 0.91 0.23

Table 2.2: switching activities for different technology nodes, V

d

d and V

th

. Architec-

ture setting: N = 10, K = 4. Unit: switch per clock cycle.

The short circuit power is related to signal transition time, which is difﬁcult to

obtain without detailed simulation using a real delay model. In our trace-based model,

we model the short circuit power as:

P

sc

= P

sw

· α

sc

(2.3)

27

Where α

sc

is the ratio between short circuit power and switch power. Although this

ratio value at the circuit level depends on FPGA circuit design and architecture, we

assume that α

sc

does not depend on device and technology at the chip level. We ver-

ify this assumption in Table 2.3

3

by showing the average short circuit power ratio at

different technology nodes, V

d

d, and V

th

levels.

benchmark 70nm V

d

d=1.1 100nm V

d

d=1.3 70nm V

d

d=1.0

V

th

=0.25 V

th

=0.32 V

th

=0.20

logic inter- logic inter- logic inter-

connect connect connect

alu4 1.43 1.12 1.44 1.16 1.46 1.15

apex2 1.44 0.89 1.42 0.93 1.48 0.92

apex4 1.08 0.86 1.15 0.79 1.18 0.82

bigkey 0.74 1.64 0.76 1.71 0.72 1.68

clma 1.11 1.72 1.21 1.62 1.16 1.63

Table 2.3: Short circuit power ratios for different technology nodes, V

d

d and V

th

. Ar-

chitecture setting: N = 10, K = 4.

2.3.2.2 Leakage Power Model

The leakage power is modeled as follows,

P

static

=

∑

i

N

t

i

P

s

i

(2.4)

For resource type i, N

t

i

is the total number of circuit elements, and P

s

i

is the leakage

power for a type i element. Notice that usually N

t

i

>N

u

i

because the resource utilization

rate is low in FPGAs (typically 62.5% [158]). For an FPGA architecture with power-

gating capability, an unused circuit element can be power-gated to reduce leakage

3

Because the delay between buffers is usually large for FPGA circuits, the short circuit power ratio

could be big.

28

power.

4

In this case, the total leakage power is modeled by the following formula:

P

static

=

∑

i

N

u

i

P

i

+α

gating

·

∑

i

(N

t

i

−N

u

i

)P

i

(2.5)

where α

gating

is the average leakage ratio between a power-gated circuit element and a

circuit element in normal operation. SPICE simulation shows that sleep transistors can

reduce leakage power by a factor of 300 and α

gating

= 0.003 is used in this chapter.

2.3.2.3 Delay Model

To avoid the static timing analysis for the entire circuit implemented on a given FPGA

fabric, we obtain the structure of the ten longest circuit paths including the near-critical

path for each circuit. The path structure is the number of elements of different resource

types, i.e., LUT, wire segment and interconnect switch, on one circuit path. We assume

that the new near-critical path due to different V

d

d and V

th

levels is among these ten

longest paths found by our benchmark proﬁling. When V

d

d and V

th

change, we can

calculate delay values for the ten longest paths under new V

d

d and V

th

levels, and

choose the largest one as the new near-critical path delay. Therefore, the FPGA delay

can be calculated as follows:

D =

∑

i

N

p

i

D

i

(2.6)

For resource type i, N

p

i

is the number of circuit elements that the near-critical path

goes through, and D

i

is the delay of such a circuit element. D

i

is a circuit parameter

depending on V

d

d, V

th

, process technology, and FPGA architecture.

5

To get the path

4

In this chapter, we assume the wire segment length change introduce by power gating can be ig-

nored. In our experiment, wire length is calculated as 4 ×

√

area

tile

[18], where area

tile

is the area of

a logic block with the interconnect surrounding it. The area overhead introduced by power gating is

less than 30%. Therefore, the change of wire segment length is less than 14%. Moreover the load ca-

pacitance of a routing buffer is mainly determined by the input and output capacitances of buffers. The

wire capacitance is a small portion (about 20%) of the load capacitance. Therefore, the area increase

introduced by power gating will have small impact on the delay and power estimation.

5

Delay of each circuit element is measured with the worst case switch. When power gating is

applied, a single gating transistor is used for the entire logic block. In order to measure the delay of

29

statistical information N

p

i

, we only need to place and route the circuit once for a given

FPGA architecture.

2.3.3 Validation of Ptrace

2.3.3.1 Veriﬁcation of Deterministic Model

To validate Ptrace, we compare it to the conventional architecture evaluation ﬂow

for ITRS [66] 70nm technologies. We assume V

d

d=1.0V and V

th

=0.2V and map 20

MCNC benchmarks, as illustrated in Table 2.4, to two architectures: {N = 8, K = 4}

and {N =6, K =7}. In Table 2.4, the number of LUTs and ﬂip ﬂops are counted under

the architecture N=8 and K=4. Notice that we can map the benchmarks to different

LUT and cluster sizes. We collect trace using the conventional architecture evaluation

ﬂow in 70nm technology, V

d

d=1.0V and V

th

=0.2V. Figure 2.5 compares energy and

delay between the conventional evaluation ﬂow and Ptrace for each benchmark. The

average energy error of Ptrace is 1.3% and average delay error is 0.8%. From the ﬁg-

ure, the Ptrace has the same trends for energy as Psim does and for delay as VPR does.

Therefore, Ptrace has a high ﬁdelity. Moreover, the run time of Ptrace is 2s, while that

of the conventional evaluation ﬂow is 120 hours.

each circuit element in a logic block with gating transistor, a series of randomly generated vectors are

used as the inputs to the block and the worst case delay is measured for each circuit element under such

random input vectors.

30

benchmark # LUTs # Flip ﬂops benchmark # LUTs # Flip ﬂops

alu4 1607 0 ex5p 1201 0

apex2 2118 0 frisc 5926 900

apex4 1291 0 misex3 1501 0

bigkey 2935 224 pdc 5608 0

clma 13537 33 s298 2548 8

des 2172 0 s38417 8458 1463

diffeq 1933 377 s38584 7040 1260

dsip 1619 224 seq 1953 0

elliptic 4185 1138 spla 3938 0

ex1010 4721 0 tseng 1299 382

Table 2.4: MCNC benchmark list. The number of LUTs and ﬂip ﬂops are counted

under the architecture N=8 and K=4.

0

9

18

27

36

45

0 9 18 27 36 45

Delay from VPR(ns)

D

e

l

a

y

f

r

o

m

P

t

r

a

c

e

(

n

s

)

8x4

6x7

4% Error

0

2

4

6

8

10

0 2 4 6 8 10

Energy from Psim (nJ)

E

n

e

r

g

y

f

r

o

m

P

t

r

a

c

e

(

n

J

) 8x4

6x7

5% Error

Figure 2.5: Comparison between Psim and Ptrace.

31

2.4 Hyper-Architecture Evaluation

2.4.1 Overview

In this section, we use Ptrace to perform device and architecture evaluation. We

consider 70nm ITRS technology and evaluate four FPGA hyper-architecture classes:

Homo-V

th

, Hetero-V

th

, Homo-V

th

+G and Hetero-V

th

+G. Homo-V

th

is the conventional

FPGA using homogeneous V

th

for interconnects and logic blocks. Hetero-V

th

applies

different V

th

to logic blocks and interconnects. Homo-V

th

+G and Hetero-V

th

+G are

the same as Homo-V

th

and Hetero-V

th

, respectively, except that unused logic blocks

and interconnects are power-gated [101]. We compare them with the baseline hyper-

architecture, which uses the VPR architecture model [18] and has the same LUT size

and cluster size as those used by the Xilinx Virtex-II [170] (cluster size of 8, LUT

size of 4), V

dd

suggested by ITRS [66] (0.9v), and V

th

of 0.3v that is optimized with re-

spect to the above architecture and V

dd

. The baseline hyper-architecture and evaluation

ranges for device and architecture are presented in Table 2.5. Note that a high V

th

is

applied to all SRAM cells for conﬁguration to reduce their leakage power as suggested

by [93].

In the subsections 2.4.2 to 2.4.4, we assume the following:

• The utilization rate (deﬁned as the utilization rate of logic blocks, i.e., number

of used logic blocks over the number of total available logic blocks) is 0.5.

• All interconnect wire segments span 4 logic blocks with fully buffered routing

switches.

• The routing channel width is 1.2 times of the minimumchannel width that allows

the FPGA circuit being routed.

• All benchmark circuits work at their highest frequency (1/critical path delay).

32

The impact of utilization rate and interconnect architecture will be discussed in sub-

sections 2.4.5 and 2.4.6, respectively. For each hyper-architecture, we compute the

energy, delay and area as the geometric mean of 20 MCNC benchmarks in Table 2.4.

Moreover, in the rest of this chapter, we use CV

t

for V

th

of logic blocks and IV

th

for

V

th

for global interconnects. Note that CV

t

= IV

th

in Homo-V

th

and Homo-V

th

+G. To

illustrate the tradeoff between energy and delay, we introduce the concept of dominant

hyper-architecture: If hyper-architecture A has less energy consumption and a smaller

delay than hyper-architecture B, then we say that B is inferior to A. We deﬁne the dom-

inant hyper-architectures as the set of hyper-architectures that are not inferior to any

other hyper-architectures.

We organize the rest of this section as follows: First, Section 2.4.2 illustrates the

necessity of device and architecture co-optimization. Then Section 2.4.3 presents the

energy and delay tradeoff and ED reduction achieved by device and architecture co-

optimization. Section 2.4.4 discusses the device and architecture co-optimization con-

sidering area. Finally, Sections 2.4.5 and 2.4.6 analyze the impact of utilization rate

and compare the evaluation result of different routing architectures, respectively.

Baseline FPGA device/arch parameter values

V

dd

V

th

N K

0.9v 0.3v 8 4

Value range for device/arch optimization

V

dd

V

th

N K

0.8v-1.1v 0.2v-0.4v 6-12 3-7

Table 2.5: Baseline hyper-architecture and evaluation ranges.

2.4.2 Necessity of Device and Architecture Co-Optimization

In this section, we show the necessity of device and architecture co-optimization. We

ﬁrst discuss the need of device tuning, then compare the results of optimizing device

33

and architecture separately and simultaneously.

Architecture evaluation has been studied in previous research [131, 80, 10, 89, 123,

101]. However, device tuning has not been reported in the literature. Our experiments

show that device tuning has a much greater impact on delay and energy than archi-

tecture tuning does, which is demonstrated in Figure 2.6 and Table 2.6. Each set of

data points in Figure 2.6 is the dominant hyper-architectures for a given device set-

ting.

6

For example, set D4 is the dominant hyper-architectures under V

dd

=1.0V and

V

th

=0.25V. From the ﬁgure, we observe that a change on the device leads to a more

signiﬁcant change in energy and delay than architecture change does. For example,

for device setting V

dd

= 0.9V and V

th

=0.25v, energy for different architectures ranges

from 1.82nJ to 2.11nJ, and delay ranges from 13.68ns to 16.46ns. However, if we

increase V

th

by 0.05v, i.e., V

dd

= 0.9V and V

th

= 0.3V, the energy ranges from 1.17nJ

to 1.32nJ and the delay ranges from 18.99ns to 23.24ns. Therefore, it is important to

evaluate both device and architecture instead of evaluating architecture only.

Set V

dd

V

th

Min energy Max energy Min delay Max delay

(V) (V) (nJ) (nJ) (ns) (ns)

D1 0.9 0.25 1.82 2.11 13.68 16.46

D2 0.9 0.30 1.17 1.32 18.99 23.24

D3 0.9 0.35 0.97 1.05 31.01 36.40

D4 1.0 0.25 2.30 3.17 11.90 14.06

D5 1.0 0.30 1.37 1.95 15.60 17.50

D6 1.0 0.35 1.11 1.30 21.31 24.66

D7 1.1 0.25 5.58 17.01 10.60 12.43

D8 1.1 0.30 3.10 9.03 13.05 15.40

D9 1.1 0.35 1.95 4.73 17.01 19.92

Table 2.6: Power and delay ranges for different device settings.

There are three methods to perform device and architecture optimization. In the

ﬁrst method, we ﬁrst optimize device using one architecture then optimize the ar-

6

Dominant hyper-architectures for a given device setting are the hyper-architectures that are not

inferior to any other hyper-architectures under such device setting.

34

0

2

4

6

8

10

12

14

16

18

10 15 20 25 30 35 40

Delay (ns)

E

n

e

r

g

y

p

e

r

c

y

c

l

e

(

n

J

)

D1 Vdd 0.9 Vt 0.25

D2 Vdd 0.9 Vt 0.30

D3 Vdd 0.9 Vt 0.35

D4 Vdd 1.0 Vt 0.25

D5 Vdd 1.0 Vt 0.30

D6 Vdd 1.0 Vt 0.35

D7 Vdd 1.1 Vt 0.25

D8 Vdd 1.1 Vt 0.30

D9 Vdd 1.1 Vt 0.35

D1

D9

D8

D7

D6

D5

D4

D3

D2

Figure 2.6: hyper-architectures under different device settings.

chitecture under the optimized device setting and call it device-arch method. In the

second method, we ﬁrst optimize the architecture within one device setting then opti-

mize the device setting according to the optimized architecture and call it arch-device

method. In the third method, we optimize architecture and device simultaneously and

call it simultaneous method. Both methods arch-device and device-arch cannot guar-

antee the optimal solution. Table 2.7 compares the min-ED hyper-architectures for

Homo-V

th

found by three different methods. For both arch-device and device-arch,

we start search from the baseline case {N = 8, K = 4, V

dd

= 0.9V, and V

th

= 0.3V}.

In our experiment, we ﬁnd that there are two local optimal hyper-architectures in the

whole solution space: {N = 6, K = 7, V

dd

= 0.9V, V

th

= 0.3V} and {N = 10, K = 4,

V

dd

= 1.0V, V

th

= 0.3V} which is also the global optimal. In this particular example,

both arch-device and device-arch achieve the same local optimal hyper-architecture

{N = 6, K = 7, V

dd

= 0.9V, V

th

= 0.3V}. If we start search from some other hyper-

35

architectures, arch-device and device-arch may achieve other local optimum but there

is no guarantee of the global optimum. In order to obtain the global optimal solution,

we have to perform simultaneous method. From Table 2.7, we observe that simul-

taneous method can reduce ED by 13.3% compared to arch-device and device-arch.

We also see that the runtime of arch-device and device-arch is shorter than that of

simultaneous method. However, due to the time efﬁciency of Ptrace, the runtime of

simultaneous method is also small (only 34.1s). Therefore, it is still worthwhile per-

forming simultaneous device and architecture optimization.

V

dd

(V) CV

t

(V) IV

th

(V) (N, k) Energy (nJ) Delay (ns) ED (nJ· ns) runtime (s)

arch-device 0.9 0.30 0.30 (6,7) 1.38 19.8 27.3 5.1

device-arch 0.9 0.30 0.30 (6,7) 1.38 19.8 27.3 3.4

simultaneous 1.0 0.30 0.30 (10,4) 1.37 17.5 24.1 (-13.3%) 34.1

Table 2.7: Min-ED hyper-architecture of optimizing device and architecture separately

and simultaneously.

2.4.3 Energy and Delay Tradeoff

In this section, we ﬁrst compare the impact of device tuning and architecture tuning,

then present min-energy and min-delay hyper-architectures, and ﬁnally discuss the

energy and delay tradeoff. For the classes Homo-V

th

+G and Hetero-V

th

+Gwith power-

gating, we assume the following ﬁxed sleep transistor size: 210X PMOS for a logic

block, 10X PMOS for a switch buffer, and 1X PMOS for a connection buffer. We then

discuss the sleep transistor tuning in Section 2.4.4.

Table 2.8 summarizes the minimumdelay and minimumenergy hyper-architectures

for each class. The minimum delay hyper-architectures have cluster size of 6 and LUT

size of 7 which are the same for all classes. The minimum energy hyper-architectures

have LUT size of 4 for all classes. This is similar to the previous evaluation result

[101, 89]. As expected, the min-delay hyper-architectures have the highest V

dd

and

36

lowest V

th

. However, the min-energy hyper-architectures have the lowest V

dd

but not

the highest V

th

. This is because we assume that each circuit works at its highest pos-

sible frequency (1/critical path delay). The energy is calculated as E = delay · power.

When V

th

is too high, the delay is so large that the energy per clock cycle increases.

Hyper- Minimum delay hyper-architecture Minimum energy hyper-architecture

Arch V

dd

CV

t

IV

th

(N,K) E D V

dd

CV

t

IV

th

(N,K) E D

Class (V) (V) (V) (nJ) (ns) (V) (V) (V) (nJ) (ns)

Homo-V

th

1.1 0.20 0.20 (6,7) 31.22 8.86 0.8 0.35 0.35 (10,4) 0.942 59.2

Hetero-V

th

1.1 0.20 0.20 (6,7) 31.22 8.86 0.8 0.30 0.35 (12,4) 0.920 43.6

Homo-V

th

+G 1.1 0.20 0.20 (6,7) 15.98 9.45 0.8 0.30 0.30 (12,4) 0.550 30.5

Hetero-V

th

+G 1.1 0.20 0.20 (6,7) 15.98 9.45 0.8 0.30 0.25 (10,4) 0.549 24.3

Table 2.8: Minimum delay and minimum energy hyper-architectures.

Usually, higher performance hyper-architectures consume more energy. To illus-

trate the energy and delay tradeoff, we present the dominant hyper-architectures of

Homo-V

th

and Hetero-V

th

in Figure 2.7 (a) and those of Homo-V

th

+G and Hetero-

V

th

+G in Figure 2.7 (b). From the ﬁgure, we ﬁnd that the energy difference between

Homo-V

th

+G and Hetero-V

th

+G is smaller than that between Homo-V

th

and Hetero-

V

th

. This is because leakage power is signiﬁcantly reduced by power-gating and there-

fore more detailed V

th

tuning such as heterogeneous-V

th

has a smaller impact. We also

see that dominant hyper-architectures of Homo-V

th

+G and Hetero-V

th

+G have smaller

delay than that of Homo-V

th

and Hetero-V

th

. This is due to the fact that the connection

box with power gating has smaller delay than that without power gating as discussed

in Section 2.2.2. Moreover, with the dominant hyper-architecture ﬁgure, we can obtain

the minimum energy solution for a given performance range. For example, if we want

to ﬁnd the minimumenergy solution for Homo-V

th

with delay limit 15ns, we only need

to pick the dominant hyper-architecture whose delay is closest to 15ns.

In order to achieve the best energy and delay tradeoff, we ﬁnd the hyper-architectures

with the minimum energy delay product (in short min-ED) in Table 2.9. Compared to

37

the baseline, ED reduction is 14.5% and 18.4% for Homo-V

th

and Hetero-V

th

, respec-

tively. If power gating is applied, ED can be reduced by about 60% for both Homo-

V

th

+G and Hetero-V

th

+G. The similar ED reduction for power gating classes is due to

the fact that leakage power is greatly reduced by power-gating and therefore the more

detailed V

th

tuning such as heterogeneous-V

th

has a small impact on power reduction

as discussed before. We also see that, compared to the min-ED hyper-architectures

without power gating, the min-ED hyper-architectures with power gating has a lower

V

th

. This is because leakage power is greatly reduced when power gating is applied,

therefore a lower V

th

can improve performance without much penalty on leakage.

hyper-architecture.Class V

dd

(V) CV

t

(V) IV

th

(V) (N, k) E(nJ) D (ns) ED (nJ· ns) Area %

Baseline 0.9 0.30 0.30 (8,4) 1.20 23.5 28.2 100.00

Homo-V

th

1.0 0.30 0.30 (10,4) 1.37 17.5 24.1 (-14.5%) 81.90

Hetero-V

th

0.9 0.25 0.30 (12,4) 1.27 18.1 23.0 (-18.4%) 79.52

Homo-V

th

+G 0.9 0.25 0.25 (12,4) 0.74 16.10 11.9 (-57.8%) 127.43

Hetero-V

th

+G 0.8 0.25 0.20 (10,4) 0.65 17.30 11.2 (-60.3%) 126.19

Table 2.9: Comparison between baseline and min-ED hyper-architecture in Homo-V

th

,

Hetero-V

th

, Homo-V

th

+G, and Hetero-V

th

+G.

2.4.4 ED and Area Tradeoff

In the previous sections, we assume ﬁxed sleep transistor sizes for Homo-V

th

and

Hetero-V

th

+G and discuss hyper-architecture evaluation to minimize ED without con-

sidering area. However, area is important for FPGA design. Power-gating using sleep

transistors may change delay and area tradeoff for FPGA architecture. Usually, the

larger the sleep transistor size, the smaller the delay is. In this section, we perform

device and architecture co-optimization to achieve the best ED and area tradeoff. Al-

though dual-V

th

may change the layout area due to extra diffusion well area, such

change depends on technology and is often very small. Therefore, in this section, we

38

assume that V

dd

and V

th

change does not affect area.

For Homo-V

th

and Hetero-V

th

, because no power gating is applied, we do not need

to tune sleep transistor size. To achieve the best ED-area tradeoff, we ﬁnd out the

hyper-architectures with minimum product of energy, delay, and area (in short AED),

which are summarized in Table 2.10. Compared to the baseline, the min-AED hyper-

architecture of Homo-V

th

reduces ED by 14.7% and area by 18.3%,

7

and the min-AED

hyper-architecture of Hetero-V

th

reduces ED by 18.4% and area by 23.3%. Figure 2.8

presents the chip-level ED and area tradeoff. We prune inferior solutions with both

ED and area larger than any alternative solutions. From the ﬁgure, we see that, for the

classes without power gating, the min-ED hyper-architecture is exactly the min-AED

hyper-architecture. This is because that when no power gating is applied, the larger the

area, the more leakage power is consumed. Therefore, the min-ED hyper-architectures

use less area than other hyper-architectures.

V

dd

(V) CV

t

(V) IV

th

(V) (N,K) S ED Area AED AED reduction %

Baseline 0.9 0.30 0.30 (8,4) - 1 1 - -

Homo-V

th

1.0 0.30 0.30 (10,4) - 0.853 0.817 0.697 30.3

Hetero-V

th

0.9 0.25 0.30 (12,4) - 0.816 0.767 0.626 37.4

Homo-V

th

+G 0.9 0.25 0.25 (12,4) 2 0.470 0.918 0.432 56.9

Hetero-V

th

+G 0.9 0.25 0.20 (12,4) 2 0.450 0.918 0.413 58.7

Table 2.10: Minimum ED-area product hyper-architectures for different classes. ED,

Area, and ED-area product are normalized with respect to the baseline.

For Homo-V

th

+G and Hetero-V

th

+G with power gating, the sleep transistor size

has to be considered. Because only one sleep transistor is used for one logic block,

as illustrated in Figure 2.2(d), we assume the 210X PMOS for the sleep transistor

with negligible area overhead. Moreover, we observe that a 1X PMOS as the sleep

7

In our evluation, we assume that the area depends only on architecture and but not device setting.

Our experimental result shows that N = 12, K = 4 is most area efﬁcient architecture, this architecture

reduces area by 23.6% compared to the baseline (N =8, K =4). The architecture N =10, K =4 is the

second area efﬁcient architecture which reduces area by 18.3% compared to the baseline.

39

transistor for one switch in connection box (see the circuit in Figure 2.2(c)) provides

good performance, and any further increase of the sleep transistor size cannot improve

the performance much.

The sleep transistors for the switches in the routing box, however, may affect delay

greatly. We consider four sleep transistor sizes: 2X, 4X, 7X, and 10X PMOS for a rout-

ing switch. From the Figure 2.8, we ﬁnd that device and architecture co-optimization

can reduce ED and area simultaneously even when power gating is applied. This

is due to the fact that tuning device setting and architecture offers a bigger solution

space to explore in chip level.

8

Table 2.10 summarizes the minimum AED prod-

uct hyper-architectures. Compared to the baseline case, the minimum AED product

hyper-architecture of Homo-V

th

+G reduces ED by 53.0% and area by 8.2% and the

minimumAED product hyper-architecture in Hetero-V

th

+G reduces ED by 55.0% and

area by 8.2%.

2.4.5 Impact of Utilization Rate

In the previous part of this section, we assume ﬁxed utilization rate (0.5). In this

subsection, we will further discuss the impact of utilization rate on FPGA architec-

ture evaluation. We compare the min-ED and min-AED hyper-architectures under

three different utilization rates: 0.3, 0.5, and 0.8 in Tables 2.11 and 2.12. We see that

the min-ED and min-AED hyper-architectures under different utilization rates are the

same. Therefore, we conclude that utilization rate in practice does not affect hyper-

architecture evaluation. We guess that the reason why the utilization rate does not

affect hyper-architecture evaluation is as follows: In our models the performance and

8

As discussed before, the most area efﬁcient architectures (N = 12 and K = 4) reduces area by

about 20% compared to the baseline. The area increase introduced by power gating leads to only about

15% area increase. Therefore, for the minimum AED hyper-architecture of the power gating classes

consumes less area compared to the baseline.

40

dynamic power of the circuit under different utilization rate is the same,

9

and the only

difference is leakage power. For the classes with power gating, the leakage power is de-

termined by the active elements and the leakage power of the unused circuit elements is

negligible (as we can see from Table 2.11, in classes Homo-V

th

+G and Hetero-V

th

+G,

ED under different utilization rate is very close). When no power gating is applied,

for given benchmark circuits, leakage power is the dominant power component and

changes inversely proportional to the utilization rate. Therefore the relationship be-

tween the leakage power of different hyper-architectures does not change with respect

to the change of the utilization rate.

Utilization 0.3 0.5 0.8

rate V

dd

CV

th

IV

th

(N, k) ED V

dd

CV

th

IV

th

(N, k) ED V

dd

CV

th

IV

th

(N, k) ED

Homo-V

th

1.0 0.30 0.30 (10,4) 32.0 1.0 0.30 0.30 (10,4) 24.1 1.0 0.30 0.30 (10,4) 19.4

Hetero-V

th

0.9 0.25 0.30 (12,4) 31.5 0.9 0.25 0.30 (12,4) 23.0 0.9 0.25 0.30 (12,4) 18.1

Homo-V

th

+G 0.9 0.25 0.25 (12,4) 12.3 0.9 0.25 0.25 (12,4) 11.9 0.9 0.25 0.25 (12,4) 11.8

Hetero-V

th

+G 0.8 0.25 0.20 (10,4) 11.6 0.8 0.25 0.20 (10,4) 11.2 0.8 0.25 0.20 (10,4) 11.0

Table 2.11: Min-ED hyper-architecture under different utilization rates.

Utilization 0.3 0.5 0.8

rate V

dd

CV

th

IV

th

(N, k) S AED V

dd

CV

th

IV

th

(N, k) S AED V

dd

CV

th

IV

th

(N, k) S AED

Homo-V

th

1.0 0.30 0.30 (10,4) - 0.649 1.0 0.30 0.30 (10,4) - 0.697 1.0 0.30 0.30 (10,4) - 0.698

Hetero-V

th

0.9 0.25 0.30 (12,4) - 0.637 0.9 0.25 0.30 (12,4) - 0.626 0.9 0.25 0.30 (12,4) - 0.617

Homo-V

th

+G 0.9 0.25 0.25 (12,4) 2 0.330 0.9 0.25 0.25 (12,4) 2 0.432 0.9 0.25 0.25 (12,4) 2 0.533

Hetero-V

th

+G 0.8 0.25 0.20 (10,4) 2 0.324 0.8 0.25 0.20 (10,4) 2 0.413 0.8 0.25 0.20 (10,4) 2 0.510

Table 2.12: Min-AED hyper-architecture under different utilization rates. Note: AED

is normalized with respect to the baseline.

2.4.6 Impact of Interconnect Structure

In the previous discussion, we always assume all the wire segments span four logic

blocks (i.e. length-4 interconnect wires). In this section, we will compare two routing

9

We assume the placement and route is the same for different utilization rate.

41

structures to show the impact of routing structure on delay and power. We compare a

structure with uniform length-4 wire segments (in short uniform-interconnect) and a

routing structure with 60% length-4 wire segments and 40% length-8 wire segments

(in short mixed-interconnect). The mixed-interconnect is optimal for minimizing delay

[47]. Tables 2.13, 2.14, and 2.15 compare the min delay, min energy and min ED

hyper-architectures between the two routing structures, respectively. We see that the

min-delay hyper-architectures for both interconnect structures are same. The min-

energy and min-ED hyper-architectures for the two routing structure have the same the

device setting and LUT size, but mixed-interconnect tends to use smaller cluster size

as the interconnect delay is reduced in mixed-interconnect. We also see that uniform-

interconnect has lower energy but higher delay than mixed-interconnect. This is due to

the fact that mixed-interconnect applies length-8 wire-segments and therefore a buffer

with a larger size is used, which reduces delay but increases power.

uniform-interconnect

hyper-architecture V

dd

CV

th

IV

th

(N,K) Energy Delay

Class (V) (V) (V) (nJ) (ns)

Homo-V

th

1.1 0.20 0.20 (6,7) 31.22 8.86

Hetero-V

th

1.1 0.20 0.20 (6,7) 31.22 8.86

Homo-V

th

+G 1.1 0.20 0.20 (6,7) 15.98 9.45

Hetero-V

th

+G 1.1 0.20 0.20 (6,7) 15.98 9.45

mixed-interconnect

Homo-V

th

1.1 0.20 0.20 (6,7) 35.56 8.42

Hetero-V

th

1.1 0.20 0.20 (6,7) 35.56 8.42

Homo-V

th

+G 1.1 0.20 0.20 (6,7) 16.25 9.13

Hetero-V

th

+G 1.1 0.20 0.20 (6,7) 16.25 9.13

Table 2.13: Min-delay hyper-architecture under different routing structures.

42

uniform-interconnect

hyper-architecture V

dd

CV

th

IV

th

(N,K) Energy Delay

Class (V) (V) (V) (nJ) (ns)

Homo-V

th

0.8 0.35 0.35 (10,4) 0.942 59.2

Hetero-V

th

0.8 0.30 0.35 (12,4) 0.920 43.6

Homo-V

th

+G 0.8 0.30 0.30 (12,4) 0.550 30.5

Hetero-V

th

+G 0.8 0.30 0.25 (10,4) 0.549 24.3

mixed-interconnect

Homo-V

th

0.8 0.35 0.35 (8,4) 1.052 55.6

Hetero-V

th

0.8 0.35 0.35 (8,4) 1.052 55.6

Homo-V

th

+G 0.8 0.30 0.30 (12,4) 0.610 29.3

Hetero-V

th

+G 0.8 0.30 0.25 (8,4) 0.602 22.8

Table 2.14: Min-energy hyper-architecture under different routing structures.

uniform interconnect

hyper-architecture V

dd

CV

th

IV

th

(N,K) Energy Delay ED

Class (V) (V) (V) (nJ) (ns) (nJ·ns)

Homo-V

th

1.0 0.30 0.30 (10,4) 1.37 17.50 24.1

Hetero-V

th

0.9 0.25 0.30 (12,4) 1.27 18.10 23.0

Homo-V

th

+G 0.9 0.25 0.25 (12,4) 0.74 16.10 11.9

Hetero-V

th

+G 0.8 0.25 0.20 (10,4) 0.65 17.30 11.2

mixed-interconnect

Homo-V

th

1.0 0.30 0.30 (8,4) 1.51 17.00 25.7

Hetero-V

th

0.9 0.25 0.30 (12,4) 1.39 18.40 25.6

Homo-V

th

+G 0.9 0.25 0.25 (8,4) 0.82 16.40 13.4

Hetero-V

th

+G 0.8 0.25 0.20 (8,4) 0.74 16.67 12.3

Table 2.15: Min-ED hyper-architecture under different routing structures.

43

0

8

16

24

32

8 21 34 47 60

Delay (ns)

E

n

e

r

g

y

p

e

r

c

y

c

l

e

(

n

J

)

Homo-Vt

Hetero-Vt

Min-ED

hyper-arch for

Homo-Vt

Min-ED

hyper-arch

for Hetero-Vt

Min delay

hyper-arch

for Homo-Vt

Min energy

hyper-arch

for Homo-Vt

Min energy

hyper-arch

for Hetero-Vt

(a)

0

3

6

9

12

15

18

8 13 18 23 28 33

Delay (ns)

E

n

e

r

g

y

p

e

r

c

y

c

l

e

(

n

J

)

Homo-Vt+G

Hetero-Vt+G

㋏߫5

㋏߫6

㋏߫7

Min-ED

hyper-arch for

Homo-Vt+G

Min-ED

hyper-arch

for Hetero-

Min delay hyper-

arch for Homo-

Vt+G and Hetero-

Min energy

hyper-arch for

Homo-Vt

Min energy

hyper-arch for

Hetero-Vt+G

(b)

Figure 2.7: Dominant hyper-architectures. (a) Homo-V

th

and Hetero-V

th

; (b)

Homo-V

th

+G and Hetero-V

th

+G.

44

0.7

0.9

1.1

1.3

0.3 0.4 0.5 0.6 0.7 0.8 0.9 1.0

Normalized ED

N

o

r

m

a

l

i

z

e

d

a

r

e

a

Homo-Vt

Hetero-Vt

Homo-Vt+G

Hetero-Vt+G

Min AED

hyper-arch for

Homo-Vt+G

Min AED

hyper-arch for

Hetero-Vt+G

Min AED

hyper-arch for

Hetero-Vt

Min AED

hyper-arch for

Homo-Vt

Figure 2.8: ED and area trade-off. ED and area are normalized with respect to the

baseline (N = 8, K = 4, V

dd

= 0.9V, and V

th

= 0.3V).

45

2.5 Conclusions

In this chapter, we have developed trace-based power and performance evaluation

(Ptrace) for FPGA. The one-time use of placement, routing and cycle-accurate power

simulation is applied to collect the timing and power trace for a given benchmark set

and a given FPGA architecture. The trace can then be re-used to calculate timing and

power via closed-form formulae for different device parameters and technology scal-

ing. Ptrace is much faster, yet accurate compared to the conventional evaluation based

on placement and routing by VPR [18] followed by cycle-accurate simulation (Psim)

[89].

Using the trace-based estimation, we have performed device (V

dd

, V

th

and sleep

transistor size if power gating is applied) and architecture (cluster and LUT size) co-

optimizations for low power FPGAs. We assume the ITRS [66] 70nm technology

and use the following baseline for comparison: Cluster size of 8 and LUT size of 4

as in the Xilinx Virtex-II [170], V

dd

of 0.9v suggested by ITRS, V

th

of 0.3v which is

optimized for min-ED (i.e., minimum energy delay product) with respect to the above

architecture and V

dd

. Compared to the baseline case, simultaneous optimization of

FPGA architecture and device reduces the min-ED by 14.7% and area by 18.3% for

FPGA using homogeneous-V

th

for the logic blocks and interconnects without power

gating. Optimizing V

th

separately (i.e., heterogeneous-V

th

) for the logic block and

interconnect reduces min-EDby 18.4%and area by 23.3%. Furthermore, power gating

unused logic and interconnect reduces the min-ED by up to 55.0% and reduces area by

8.2%. Compared to the classes without power gating, the min-ED hyper-architectures

of the classes with power gating have lower V

th

. This is due to the fact that, when

power gating is applied, leakage power is signiﬁcantly reduced and therefore a lower

V

th

can be applied to reduce delay.

We observe that min-ED hyper-architectures and min-AED hyper-architectures for

46

different utilization rate (between 30% and 80%) are the same. Therefore, the utiliza-

tion rate in practice does not affect the device and architecture co-optimization result.

Moreover, we also test two different routing structures, one with uniform length 4 wire

segments (uniform-interconnect) and the other with 60% length 4 wire segments and

40% length 8 wire segments (mixed-interconnect). We observe that the min-ED, min-

energy and min-delay hyper-architectures under two different interconnect structures

are similar, except that mix-interconnect tends to use slightly smaller cluster size due

to the reduced interconnect delay.

47

CHAPTER 3

FPGA Device and Architecture Co-Optimization

Considering Process Variation

Process variations in nanometer technologies are becoming an important issue for

cutting-edge FPGAs with a multi-million gate capacity. In this chapter, we extend

Ptrace to handle process variation. We ﬁrst develop closed-form models of chip level

FPGA leakage and timing variations considering both die-to-die and within-die varia-

tions in effective channel length, threshold voltage, and gate oxide thickness. Experi-

ments show that the mean and standard deviation computed by our models are within

3% from those computed by Monte Carlo simulation. We also observe that the leakage

and delay variations can be up to 5.5X and 1.9X, respectively. We then derive analyti-

cal yield models considering both leakage and timing variations, and use such models

to evaluate FPGA device and architecture considering process variations. Compared to

the baseline, which uses the VPR architecture and device setting from ITRS roadmap,

device and architecture tuning improves leakage yield by 10.4%, timing yield by 5.7%,

and leakage and timing combined yield by 9.4%. We also observe that LUT size of 4

gives the highest leakage yield, LUT size of 7 gives the highest timing yield, but LUT

size of 5 achieves the maximum leakage and timing combined yield.

48

3.1 Introduction

In the previous chapter, we have introduced a time efﬁcient trace-based power and

delay estimator for FPGA circuit. However, such estimator is only for determinis-

tic values and does not consider process variation. Modern VLSI designs see a large

impact from process variation as devices scale down to nanometer technologies. Vari-

ability in effective channel length, threshold voltage, and gate oxide thickness incurs

uncertainties in both chip performance and power consumption. For example, mea-

sured variation in chip-level leakage can be as high as 20X compared to the nominal

value for high performance microprocessors [25]. In addition to meeting the perfor-

mance constraint under timing variation, dice with excessively large leakage due to

such a high variation have to be rejected to meet the given power budget.

Recent work has studied parametric yield estimation for both timing and leakage

power. Statistical timing analysis considering path correlation was studied in [120]

[86, 178] and [180, 33] further introduced non-Gaussian variation and non-linear vari-

ation models. Timing yield estimation was discussed in [57, 53] and [127] purposed a

methodology to improve timing yield. As device scale down, leakage power becomes a

signiﬁcant component of total power consumption and it is greatly affected by process

variation. [129, 187, 152] studied the parametric yield considering both leakage and

timing variations. Power minimization by gate sizing and threshold voltage assign-

ment under timing yield constrains were studied in [114]. However, all these studies

only focus on ASIC and do not consider FPGA.

FPGA has a great deal of regularity, therefore process variation may have smaller

impact on FPGAs than on ASICs. Yet the parametric yield for FPGAs still should be

studied. Some recent works [43, 100, 115, 136] has studied the impact of process vari-

ation and presented various statistical optimization methods. However, architecture

and device co-optimization considering process variation has not be studied yet.

49

In this chapter,we ﬁrst develop closed-form formulas of chip level leakage and

timing variations considering both die-to-die and within-die variations. Based on such

model, we extend the Ptrace discussed in Chapter 2 to estimate the power and delay

variation of FPGAs. Experiments showthat the mean and standard deviation computed

by our models are within 3% from those copmuted by Monte Carlo simulation. We

also observe that the leakage and delay variations can be up to 5.5X and 1.5X, respec-

tively. With the extended Ptrace, we perform FPGA device and architecture evaluation

considering process variations. The evaluation requires the exploration of the follow-

ing dimensions: cluster size N, LUT size K,

1

supply voltage V

dd

, and threshold volt-

age V

t

. We deﬁned the combinations of the above parameters as hyper-architecture.

For comparison, we obtain the baseline FPGA hyper-architecture which uses the VPR

architecture model [18] and the same LUT size and Cluster size as the commercial

FPGAs used by Xilinx Virtex-II [170], and device setting from ITRS roadmap[66].

Compared to the baseline, device and architecture tuning improves leakage yield by

4.8%, timing yield by 12.4%, and leakage and timing combined yield by 9.2%. We

also observe that LUT size of 4 gives the highest leakage yield, LUT size of 7 gives

the highest timing yield, but LUT size of 5 achieves the maximum leakage and timing

combined yield.

The rest of the chapter is organized as follows: Section 3.2 derives closed-form

models for leakage and delay variations and develops the leakage and timing yield

models. Section 3.3 perform device and architecture evaluation to improve yield rate.

Finally, Section 3.4 concludes the chapter.

1

In this chapter, N refers to cluster size and K refers to LUT size

50

3.2 Delay and Leakage Variation Model

In this chapter, we consider the variation in gate channel length (L

gate

), threshold volt-

age (V

th

), and gate oxide thickness (T

ox

). According to [190], spatial correlation is not

signiﬁcant. Therefore, in this chapter, we assume each variation source is decomposed

into global (inter-die) variation and local (intra-die) variation as follows:

L = L

g

+L

l

(3.1)

V =V

g

+V

l

(3.2)

T = T

g

+T

l

(3.3)

where L, V, and T are variations of L

gate

, V

th

, and T

ox

respectively, L

g

, V

g

, and T

g

are

inter-die variations, and L

l

, V

l

, and T

l

are intra-die variations. In the rest of this chapter,

we assume both inter-die (L

g

, V

g

, and T

g

) and intra-die (L

l

, V

l

, and T

l

) variations are

normal random variables. And we also assume that inter-die variation and intra-die

variation are independent, and all variation sources are also independent.

3.2.0.1 Leakage under Variation

We extend the leakage model in FPGA power and delay estimation framework Ptrace

[38] to consider variations. In Ptrace, the total leakage current of an FPGA chip is

calculated as follows:

I

chi p

=

∑

i

N

t

i

· I

i

(3.4)

where N

t

i

is the number of FPGA circuit elements of resource type i, i.e., an intercon-

nect switch, buffer, LUT, conﬁguration SRAM cell, or ﬂip-ﬂop, and I

i

is the leakage

current of a type i circuit element. Different sizes of interconnect switches and buffers

are considered as different circuit elements.

The leakage current I

i

of a type i circuit element is the sum of the sub-threshold

51

and gate leakages:

I

i

= I

sub

+I

gate

(3.5)

Variation in I

sub

mainly sources fromvariation in L

gate

andV

th

. Variation in I

gate

mainly

sources from variation in T

ox

.

Different from [129] which models sub-threshold leakage and gate leakage sepa-

rately, we model the total leakage current I

i

of circuit element in resource type i as

follows:

I

i

= I

n

(i) · e

f

Li

(L)

· e

f

Vi

(V)

· e

f

Ti

(T)

(3.6)

where I

n

(i) is the nominal value of the leakage current of type i circuit element and f

is the function that represents the impact of each type of process variation on leakage.

The dependency between these functions has been shown to be negligible in [129].

From MASTAR4 model [68], we ﬁnd that it is sufﬁcient to express these functions as

simple linear functions as follows:

f

Li

(L) = −c

i1

· L (3.7)

f

Vi

(V) = −c

i2

· V (3.8)

f

Ti

(T) = −c

i3

· T (3.9)

where c

i1

, c

i2

, c

i3

are ﬁtting parameters obtained from MASTAR4 model. The negative

sign in the exponent indicates that the transistors with shorter channel length, lower

threshold voltage, and smaller oxide thickness lead to higher leakage current. We

rewrite (3.6) as follows by decomposing L, V and T in to intra-die (L

l

,V

l

, T

l

) and inter-

die (L

g

,V

g

, T

g

) components.

I

i

= I

n

(i) · e

−(c

i1

L

g

+c

i2

V

g

+c

i3

T

g

)

· e

−(c

i1

L

l

+c

i2

V

l

+c

i3

T

l

)

(3.10)

To extend the leakage model (3.4) under variations, we assume that each element

has unique intra-die variations but all elements in one die share the same inter-die

52

variations. Both inter-die and intra-die variations are modeled as normal random vari-

ables. The leakage distribution of a circuit element is a lognormal distribution. The

total leakage is the sum of all lognormals. The state-of-the-art FPGA chip usually has

a large number of circuit elements, therefore the relative random variance of the total

leakage due to intra-die variation approaches zero. Similar to [129], for given inter-die

variations, we apply the Central Limit Theorem and use the sum of mean to approx-

imate the total leakage current. After integration, we can write the expression of the

chip-level leakage as the follows:

I

chi p

≈

∑

i

N

t

i

· E[I

i

|L

g

,V

g

, T

g

]

=

∑

i

N

t

i

S

i

I

L

g

,V

g

,T

g

(i) (3.11)

S

i

= e

(c

i1

σ

L

l

2

+c

i2

σ

V

l

2

+c

i3

σ

T

l

2

)/2

(3.12)

I

L

g

,V

g

,T

g

(i) = I

n

(i)e

−(c

i1

L

g

+c

i2

V

g

+c

i3

T

g

)

(3.13)

where S

i

is the scale factor introduced by intra-die variability in L,V, and T. I

L

g

,V

g

,T

g

(i)

is the leakage as a function of inter-die variations. σ

L

l

, σ

V

l

and σ

T

l

are the variances of

L

l

, V

l

, and T

l

, respectively.

3.2.0.2 Leakage Yield

From (3.11), (3.12), and (3.13), we can see that the chip leakage current is a sum of

log-normal random variables and it can be expressed as follows,

I

chi p

=

∑

i

X

i

(3.14)

X

i

∼Lognormal(log(A

i

), ((c

i1

σ

L

g

)

2

+(c

i2

σ

V

g

)

2

+(c

i3

σ

T

g

)

2

)) (3.15)

A

i

= N

i

I

n

(i) (3.16)

Same as [129], we model I

chi p

, the sum of the lognormal variables X

i

, as another log-

normal random variable. The lognormal variable X

i

shares the same random variables

53

σ

L

g

, σ

V

g

, and σ

T

g

, and therefore these variables are dependent of each other. Consider-

ing the dependency, we calculate the mean and variance of the new lognormal I

chi p

as

follows,

µ

I

chip

=

∑

i

{exp[log(A

i

) +

(c

i1

σ

L

g

)

2

2

+

(c

i2

σ

V

g

)

2

2

+

(c

i3

σ

T

g

)

2

2

]} (3.17)

σ

2

I

chip

=

∑

i

{exp[2log(A

i

) +(c

i1

σ

L

g

)

2

+(c

i2

σ

V

g

)

2

+(c

i3

σ

T

g

)

2

]

·[exp(c

ii

2

σ

2

L

g

+c

i2

2

σ

2

V

g

+c

2

i3

σ

2

T

g

) −1]}+

∑

i, j

2COV(X

i

, X

j

) (3.18)

where the mean of I

chi p

, µ

I

chip

, is the sum of means of X

i

and the variance of I

chi p

, σ

I

chip

,

is the sum of variance of X

i

and the covariance of each pair of X

i

. The covariance is

calculated as follows,

COV(X

i

, X

j

) = E[X

i

X

j

] −E[X

i

]E[X

j

] (3.19)

E[X

i

X

j

] = exp[log(A

i

A

j

) +

(c

i1

+c

j2

)

2

σ

L

g

2

2

+

(c

i2

+c

j2

)

2

σ

V

g

2

2

+

(c

i3

+c

j3

)

2

σ

T

g

2

2

] (3.20)

E[X

i

] = exp[log(A

i

) +

(c

i1

σ

L

g

)

2

2

+

(c

i2

σ

V

g

)

2

2

+

(c

i3

σ

T

g

)

2

2

] (3.21)

We then use the method from [129] to obtain the mean and variance (µ

N,I

chip

, σ

N,I

chip

2

)

of the normal randomvariable corresponding to the lognormal I

chi p

. As the exponential

function that relates the lognormal variable I

chi p

with the normal variable I

N,chi p

is a

monotone increasing function, the CDF of I

chi p

can be expressed as follows using the

standard expression for the CDF of a lognormal random variable,

µ

N,I

chip

=

log[µ

I

chip

4

/(µ

I

chip

2

+σ

I

chip

2

)]

2

(3.22)

σ

N,I

chip

2

= log[1+(σ

I

chip

2

/µ

I

chip

2

)] (3.23)

CDF(I

chi p

) =

1

2

[1+er f (

log(I

chi p

) −µ

N,I

chip

√

2σ

N,I

chip

)] (3.24)

where er f (·) is the error function. Given a leakage limit I

cut

for I

chi p

,

Y

leak

=CDF(I

cut

) ×100% (3.25)

54

gives the leakage yield rate Y

leak

(I

cut

|L

g

), i.e., the percentage of FPGA chips that is

smaller than I

cut

.

3.2.0.3 Timing under Variation

The performance depends on L

gate

, V

th

, and T

ox

, but its variation is primarily affected

by L

gate

andV

th

variation [129]. Belowwe extend the delay model in Ptrace to consider

inter-die and intra-die variations of L

gate

. In Ptrace, the path delay is calculated as

follows:

D =

∑

i

d

i

(3.26)

where d

i

is the delay of the i

th

circuit element in the path. Considering process varia-

tion,the path delay is calculated as follows:

D =

∑

i

d

i

(L

g

, L

l

,V

g

,V

l

) (3.27)

For circuit element i in the path, d

i

(L

g

, L

l

,V

G

,V

l

) is the delay considering inter-die

variation L

g

, V

g

and intra-die variation L

l

, V

l

. L

g

and V

g

the same for all the circuit

elements in the critical path. Given L

g

and V

g

, we evenly sample a few (eleven in this

chapter) points within range of [L

g

−3σ

L

l

, L

g

+3σ

L

l

]. We then use circuit level delay

model in [39] to obtain the delay for each circuit element with these variations. As

the delay monotonically decreases when L

gate

and V

th

increase, we can directly map

the probability of a channel length to the probability of a delay and obtain the delay

distribution of a circuit element. We assume that the intra-die channel length and

threshold voltage variation of each element is independent from each other. Therefore,

we can obtain the PDF (probability density function) of the critical path delay for a

given L

g

and V

g

as follows by convolution operation,

PDF(D|L

g

,V

g

) = PDF(d

1

|L

g

,V

g

) ⊗PDF(d

2

|L

g

,V

g

) ⊗· · ·

⊗PDF(d

i

|L

g

,V

g

) ⊗· · · ⊗PDF(d

n

|L

g

,V

g

) (3.28)

55

3.2.0.4 Timing Yield

The timing yield is calculated on a bin-by-bin basis where each bin corresponds to a

speciﬁc value L

g

and V

g

. We further consider intra-die variation of channel length in

timing yield analysis. Given the inter-die channel length variation L

g

, and threshold

voltage variation V

g

, (3.28) gives the PDF of the critical path delay D of the circuit. We

can obtain the CDF of delay, CDF(D|L

g

,V

g

), by integrating PDF(D|L

g

,V

g

). Given a

cutoff delay (D

cut

), CDF(D

cut

|L

g

) gives the probability that the path delay is smaller

than D

cut

considering L

gate

and V

th

variations. However, it is not sufﬁcient to only

analyze the original critical path in the absence of process variations. The close-to-be

critical paths may become critical considering variations and an FPGA chip that meets

the performance requirement should have the delay of all paths no greater than D

cut

.

We assume that for a given L

g

the delay of each path is independent and we can

calculate the timing yield as follows,

Y

per f

(D

cut

|L

g

,V

g

) =

n

∏

i=1

CDF

i

(D

cut

|L

g

,V

g

) (3.29)

where CDF

i

(D

cut

|L

g

,V

g

) gives the probability that the delay of the i

th

longest path

is no greater than D

cut

. In this chapter, we only consider the ten longest paths, i.e.,

n = 10 because the simulation result shows that the ten longest paths have already

covered all the paths with a delay larger than 75% of the critical path delay under the

nominal condition. We then integrate Y

per f

(D

cut

|L

g

,V

g

) over L

g

and V

g

to calculate the

performance yield Y

per f

as follows,

Y

per f

=

Z Z

+∞

−∞

PDF(L

g

)PDF(V

g

) · Y

per f

(D

cut

|L

g

,V

g

) · dL

g

dV

g

(3.30)

3.2.0.5 Leakage and Timing Combined Yield

To analyze the yield of a lot, we need to consider both leakage and delay limit. In order

to compute the leakage and delay combined yield, we ﬁrst need to calculate the leakage

56

yield for a given inter-die variation of gate channel length L

g

and threshold voltage

V

g

, Y

leak|L

g

,V

g

. Similar to Section 3.2.0.2, we ﬁrst calculate the mean and variance of

leakage current for given L

g

and V

g

,

µ

I

chip|L

g

,V

g

=

∑

i

{exp[log(

¯

A

i|L

g

,V

g

) +

(c

i3

σ

T

g

)

2

2

]} (3.31)

σ

2

I

chip|L

g

,V

g

=

∑

i

{exp[2log(

¯

A

i|L

g

,V

g

) +(c

i3

σ

T

g

)

2

]

·[exp(c

2

i3

σ

2

T

g

) −1]}+

∑

i, j

2COV(

¯

X

i|L

g

,V

g

,

¯

X

j|L

g

,V

g

)

(3.32)

where

¯

A

i|L

g

,V

g

= A

i

· exp(−c

i1

L

g

−c

i2

V

g

) (3.33)

¯

X

i|L

g

,V

g

∼ Lognormal(

¯

A

i|L

g

,V

g

, (c

i3

σ

T

g

)

2

) (3.34)

Similar to X

i

’s, the covariance between

¯

X

i|L

g

,V

g

’s are computed as:

COV(

¯

X

i|L

g

,V

g

,

¯

X

j|L

g

,V

g

) = E[

¯

X

i|L

g

,V

g

·

¯

X

j|L

g

,V

g

] −E[

¯

X

i|L

g

,V

g

]E[

¯

X

j|L

g

,V

g

] (3.35)

E[

¯

X

i|L

g

,V

g

·

¯

X

j|L

g

,V

g

] = exp[log(

¯

A

i|L

g

,V

g

·

¯

A

j|L

g

,V

g

) +

(c

i3

+c

j3

)

2

σ

T

g

2

2

] (3.36)

E[X

i

] = exp[log(

¯

A

i|L

g

,V

g

) +

(c

i3

σ

T

g

)

2

2

] (3.37)

Finally, the CDF of leakage current for given L

g

and V

g

, I

leak|L

g

,V

g

, is calculated as:

µ

N,I

chip|L

g

,V

g

=

log[µ

4

I

chip|L

g

,V

g

/(µ

2

I

chip|L

g

,V

g

+σ

2

I

chip|L

g

,V

g

)]

2

(3.38)

σ

N,I

chip|L

g

,V

g

2

= log[1+(σ

2

I

chip|L

g

,V

g

/µ

2

I

chip|L

g

,V

g

)] (3.39)

CDF(I

chi p

|L

g

,V

g

) =

1

2

[1+er f (

log(I

chi p

) −µ

N,I

chip|L

g

,V

g

√

2σ

N,I

chip

)] (3.40)

With the CDF of I

leak|L

g

,V

g

, it is easy to compute the leakage yield for given L

g

and

V

g

,

Y

leak|L

g

,V

g

=CDF(I

cut

|L

g

,V

g

) ×100% (3.41)

57

MC sim Our model

Y

leak

% Y

per f

% Y

com

% Y

leak

% Y

per f

% Y

com

%

90.2 74.1 64.4 89.3 (-0.9) 72.6 (-1.5) 62.1 (-2.1)

Table 3.1: Veriﬁcation of yield model.

Because for given a speciﬁc inter-die variation of channel length L

g

and threshold

voltage variation V

g

, the leakage variability only depends on the variability of random

variable T

g

as shown in (3.31), and the timing variability only depends on the variabil-

ity of random variable L

l

and V

l

as shown in (3.29). Therefore, we assume that the

leakage yield and timing yield are independent of each other for given L

g

and V

g

. The

yield considering the imposed leakage and timing limit can be calculated as follows,

Y

com

=

Z Z

+∞

−∞

PDF(L

g

)PDF(V

g

)Y

leak

(I

cut

|L

g

,V

g

)Y

per f

(D

cut

|L

g

,V

g

) · dL

g

dV

g

(3.42)

3.2.0.6 Veriﬁcation of Yield Model

In this section, we verify our yield model by comparing it to 10,000 sample Monte-

Carlo simulation. In our experiment, we assume 70nm ITRS technology and use de-

vice and architecture same as the baseline hyper-architecture as in Section 2.4. We also

assume that all 20 MCNC benchmarks are put into one FPGA chip. The cut of leak-

age power is 2X of the nominal value and the cut of delay is 1.1X of nominal value.

Table 3.1 compares the yield estimated from our model and that from the Monte-Carlo

simulation. From the table, we see that our yield model is within 3% error compared

to the Monte-Carlo simulation.

3.3 Hyper-Architecture Evaluation Considering Process Variation

In this section, we use our yield model to perform device and architecture evaluation

for leakage and delay yield optimization. We consider ITRS 70nm technology and

58

N K W V

th

(V) V

dd

(V)

Evaluation range 4,5,6,7 6,8,10,12 4 0.2 0.4 0.8 1.1

Baseline 8 4 4 0.3 0.9

Source Distribution 3σ

g

3σ

l

L

gate

Normal 5.0% 3.0%

V

th

Normal 2.5% 1.9%

T

ox

Normal 2.5% 1.9%

Table 3.2: Experimental setting.

change V

dd

and V

th

around such setting. For architecture, we consider LUT size K from

4 to 7, and cluster size N from 6 to 12. For interconnect, we assume that all the global

routing track (W) span 4 logic blocks with all buffer switch box. In the experiment, we

assume that all 20 MCNC benchmarks are put into one chip and obtain the longest 10

critical paths from them. For comparison, we use the same baseline hyper-architecture

as in Section 2.4. For process variation, we assume that all the variation sources has

normal distribution. For all variation sources, we assume that the 3σ value of the inter-

die variation is 10% of nominal value, and the 3σ value of the intra-die variation is 5%

of nominal value.

3.3.1 Impact of Process Variation

In this section, we analyze the impact of process variation on FPGA leakage power

and delay. Figure 3.1 illustrates the leakage and delay variation from Monte-Carlo

simulation for the baseline hyper-arch. In the ﬁgure, each dot is sample of Monte-

Carlo simulation. From the ﬁgure, we can see with process variation, the range of

leakage power is up to 5.5X and the range of delay variation is up to 1.5X.

59

0.8 0.85 0.9 0.95 1 1.05 1.1 1.15 1.2

0

0.5

1

1.5

2

2.5

3

3.5

4

4.5

5

Normalized Delay

N

o

r

m

a

l

i

z

e

d

L

e

a

k

a

g

e

P

o

w

e

r

Cutoff delay

Cutoff power

1.5X

5.5X

Figure 3.1: Leakage and delay of baseline architecture hyper-arch.

3.3.2 Impact of Device and Architecture Tuning

In this section, we perform device and architecture evaluation to optimize the delay

and leakage power yield for two FPGA classes, Homo-V

th

and Hetero-V

th

as deﬁned

in in Section 2.4. In the rest of this section, we assume that the cutoff leakage power

is 2X of the nominal value and the cutoff delay is 1.1X of the nominal value, as shown

in Figure 3.1.

3.3.2.1 Leakage Yield

We ﬁrst optimize leakage yield. Table 3.3 illustrates the hyper-archs with maximum

leakage yield for both two classes.

In the table, CV

th

refers to the V

th

of logic blocks and IV

th

refers to the V

th

of

interconnect. Notice that for Homo-V

th

, CV

th

= IV

th

. From the table, we see that

Homo-L

gate

and Hetero-L

gate

give the same maximum leakage yield result. That is,

the optimum L

gate

for logic blocks and interconnect is the same.

This is because the larger L

gate

gives better leakage yield, both logic block and

interconnect using largest L

gate

(33nm) results in optimum leakage yield. Moreover,

60

N K CV

th

(V) IV

th

(V) V

dd

(V) Y

leak

% Y

per f

% Y

com

%

Baseline 8 4 0.3 0.3 0.9 89.5 85.5 67.1

Homo-V

th

10 4 0.4 0.4 0.9 94.3(+4.8) 79.3 69.1

Hetero-V

th

10 4 0.4 0.4 0.9 94.3(+4.8) 79.3 69.1

Table 3.3: Optimum leakage yield hyper-architecture.

N K CV

th

(V) IV

th

(V) V

dd

(V) Y

leak

% Y

per f

% Y

com

%

Baseline 8 4 0.3 0.3 0.9 89.5 85.5 67.1

Homo-V

th

6 7 0.2 0.2 1.1 63.5 97.9(+12.4) 57.9

Hetero-V

th

6 7 0.2 0.2 1.1 67.6 97.9(+12.4) 57.9

Table 3.4: Optimum Timing yield hyper-architecture.

we can also ﬁnd that K = 4 gives the optimum leakage yield, which improve leakage

yield by 4.8% compared to the baseline.

3.3.2.2 Timing Yield

Secondly, we analyze the timing yield. For timing yield analysis, we only analyze the

delay of the largest MCNC benchmark clma. Similarly, the timing yield is often stud-

ied using selected test circuit such as ring oscillator for ASICin the literature. Table 3.4

illustrates the optimumtiming yield hyper-arch. Similar to leakage yield analysis, both

Homo-L

gate

and Hetero-L

gate

achieve the same hyper-arch for optimum timing yield.

The reason is similar to the leakage yield. That is the smaller V

th

gives better timing

yield, therefore both logic block and interconnect using smallest V

th

(0.2V) results in

optimum timing yield. From the table, we can also ﬁnd that the optimum timing yield

hyper-arch has K = 7 and improve timing yield by 12.4% compared to the baseline.

3.3.2.3 Leakage and Timing Combined Yield

Finally, we discuss about leakage and timing combined yield. Table 3.5 illustrates the

optimum leakage and timing combined yield hyper-arch. From the table, we see that

61

N K CV

th

(V) IV

th

(V) V

dd

(V) Y

leak

% Y

per f

% Y

com

%

Baseline 8 4 0.3 0.3 0.9 89.5 85.5 67.1

Homo-V

th

10 5 0.3 0.3 1.0 87.6 86.5 75.2 (+8.1)

Hetero-V

th

8 5 0.35 0.3 1.0 91.6 84.1 76.3 (+9.2)

Table 3.5: Optimum leakage and timing combined yield hyper-architecture.

compared to the baseline, the optimum hyper-arch for Homo-V

th

improves combined

yield by 8.1%and the optimumhyper-arch for Hetero-V

th

improves the combined yield

by 9.2%. Unlike the leakage yield and timing yield analysis, Homo-V

th

and Hetero-V

th

give different result for combined yield optimization. This is because both leakage and

timing should be considered to optimize combined yield, the largest (or smallest) V

th

not necessary gives the optimum combined yield. We also ﬁnd that Hetero-V

th

gives

better result than Homo-V

th

. This is because Hetero-V

th

provides larger search space.

But for both Homo-V

th

and Hetero-V

th

, K = 5 gives the best combined yield.

3.4 Conclusions

In this chapter, we have developed efﬁcient models for chip-level leakage variation

and system timing variation in FPGAs. Experiments show that our models are within

3% from Monte Carlo simulation, and the FPGA chip level leakage and delay varia-

tions can be up to 5.5X and 1.5X, respectively. We have shown that architecture and

device tuning has a signiﬁcant impact on FPGA parametric yield rate. Compared to

the baseline, the optimum hyper-architecture (combination of architecture and device

parameters) improves leakage and timing combined yield by 9.2%. In addition, LUT

size 4 has the highest leakage yield, 7 has the highest timing yield, but LUT size 5

achieves the maximum combined leakage and timing yield.

62

CHAPTER 4

FPGA Concurrent Development of Process and

Architecture Considering Process Variation and

Reliability

In Chapter 2 and Chapter 3, we develop a trace-based framework (Ptrace) to perform

device and architecture co-optimization. In this chapter, we further improve the trace-

based framework (Ptrace2) to enable concurrent process and FPGA architecture co-

development. Applying the new trace-based framework (Ptrace2), the user can tune

eight parameters for bulk CMOS processes and obtain the chip level performance and

power distribution and soft error rate (SER) considering process variations and device

aging. The Ptrace2 framework is efﬁcient as it is based on closed-form formulas. It

is also ﬂexible as process parameters can be customized for different FPGA elements

and no SPICE models and simulations are needed for these elements. Therefore, this

framework is suitable for early stage process and FPGA architecture co-development.

The chapter further presents a few examples to utilize the framework. We show that

applying heterogeneous gate lengths to logic and interconnect may lead to 1.3X delay

difference, 3.1X energy difference, and reduce standard deviation of leakage variation

by 87%. This offers a large room for power and delay tradeoff. We further show that

the device aging has a knee point over time, and device burn-in to reach the point

could reduce the performance change over 10 years from 8.5% to 5.5% and reduce die

to die leakage signiﬁcantly. In addition, we also study the interaction between process

63

variation, device aging and SER. We observe that device aging reduces standard de-

viation of leakage by 65% over 10 years while it has relatively small impact on delay

variation. Moreover, we also ﬁnd that neither device aging due to NBTI and HCI nor

process variation have signiﬁcant impact on SER.

4.1 Introduction

In Chapter 2 and Chapter 3, we proposed a time efﬁcient FPGA power and delay

evaluator Ptrace. However, Ptrace assumes that the processes are mature and stable

device models are available so that SPICE simulations can be carried out to obtain

circuit level power and delay. The assumption that FPGA architecture development

starts only after the process technology is stable may be valid in the past, but it no

longer holds as we begin to develop process and architecture concurrently in order to

shorten the time to market.

In this chapter we further extend the trace-based architecture framework (Ptrace)

discussed in the previous section to consider process parameters directly, therefore we

can conduct FPGA circuit and architecture evaluation when only the ﬁrst order process

parameters are available. Such evaluation may be used to select circuits and architec-

tures less sensitive to process changes or process variations. It may further provide

inputs for process tuning, given that the FPGA is a large volume product and an FPGA

company may convince a foundry to tune process when there are large enough ben-

eﬁts. We call the resulting framework as ptrace2. As illustrated in Figure 4.1, for

performance and power, ptrace2 calculates ﬁrst electrical characteristics of advanced

CMOS transistors, then delay, leakage power, input/output capacitance for FPGA ba-

sic circuit elements, and ﬁnally the chip level performance and power based on trace

similar to that in Chapter 2 and Chapter 3. The new process variation analysis can

handle non-Gaussian variation sources which is ignored by Ptrace.

64

Trace

Cricuit Element Statistics

Critical Path Structure

Switching Activity

Transistor

Model

Chip Level

Leakage Power

Dynamic Power

Delay

Reliability

Soft Error Rate

Device Aging

Process Variation

Power Distrubution

Delay Distribution

Input Output

PTrace2

Reliability

Chip Level

Power and

Delay

Estimation

Variation

Analysis

Circuit Level

Power and

Delay

Estimation

Transistor

Electrical

Charateristics

Figure 4.1: Trace-based estimation ﬂow.

With ptrace2, we incorporate analytical calculations for two types of FPGA relia-

bility, device aging (Negative-Bias-Temperature-Instability, NBTI [11, 149] [160, 23]

and Hot-Carrier-Injection, HCI [34, 149, 166]) and permanent soft error rate (SER)

[60], again in the from device to chip fashion. Furthermore, we illustrate how to use

this framework to improve power and performance by process and FPGA concurrent

development, and to study the interaction between process variation, device aging and

SER.

The rest of this chapter is organized as follows: Section 4.2 presents our imple-

mentation of ITRS device model. Sections 4.3, 4.4 and 4.5 introduce the circuit-

and chip-level power and delay models and chip-level variation models, respectively.

Section 4.6 presents device tuning for power and delay optimization, and Section 4.7

analyzes the reliability for FPGA. Finally, Section 4.8 studies the interaction between

reliability and process variation, and Section 4.9 concludes this chapter.

65

4.2 Device Models

In this chapter, we implement the bulk transistor device model from ITRS 2005 MAS-

TAR4 (Model for Assessment of cmoS Technologies And Roadmaps) tool [67, 68,

148]. MASTAR4 is a computing tool which calculates the electrical characteristics

of advance CMOS transistors

1

. It can handle different technologies including planar

bulk, double gate and silicon on isolator (SOI) while we only consider traditional bulk

transistor in this work. We brieﬂy review the calculation ﬂow in MASTAR4 below.

The calculation in MASTAR4 is based on analytical equations, which directly de-

pends on various major technological parameters including gate length (L

gate

), gate

oxide thickness (T

ox

), channel doping density (N

bulk

), channel width (W), extension

depth (X

jext

) and the total series resistance for source and drain (R

acc

). The output

electrical characteristics include the on current (I

on

), the sub-threshold leakage current

(off current, I

of f

), the gate leakage current when the channel is on/off (I

gon

/I

gof f

), the

gate capacitance (C

g

) and the drain/source diffusion capacitance (C

di f f

). Temperature

(T) and the supply voltage (V

dd

) also have a signiﬁcant impact on these output char-

acteristics. There are also other inputs related to mobility, velocity and gate stack etc.,

as well as some intermediate outputs such as effective mobility u

e f f

which are used

for the ﬁnal output calculation. We follow MASTAR4 tool and only tune the major

process inputs while all the other inputs can be tuned as well if needed.

Figure 4.2 shows the calculation ﬂow for the on current I

on

. I

on

depends on all the

main inputs, i.e. L

gate

, T

ox

, N

bulk

, X

jext

, W, R

acc

, T and V

dd

. The feature intermedi-

ate parameters, the electrical (or effective) gate length (L

elec

) and the electrical gate

oxide thickness (T

ox elec

), are ﬁrst calculated. The saturated threshold voltage (V

th

)

is then calculated considering the short channel effect (SCE) and drain induced bar-

1

The device model in MASTAR4 tool is a predicted model for advanced processes and there is no correspondent SPICE

models for these technologies.

66

Inputs: Lgate Tox Nbulk Xjext W Racc T Vdd

Output: Ion

Intermediate variables: Lelec Tox_elec

Intermediate variables: Vth

Intermediate variables: u_eff

Figure 4.2: The ﬂow for on current, I

on

, calculation.

rier lowering (DIBL). The effective mobility (u

e f f

) is also calculated. With the main

input parameters, main intermediate parameters and all other parameters, I

on

is then

calculated as,

V

gt

= V

gs

−V

th

(4.1)

I

dsat0

= 0.5u

e f f

· C

ox elec

·

W

L

elec

· V

gt

· V

dsat

(4.2)

I

on

=

I

dsat0

1+

2R

acc

I

dsat0

V

gt

−

R

acc

I

dsat0

V

gt

+L

elec

E

c

(1+d)

(4.3)

where I

dsat0

is the on current without considering the series resistance (R

acc

), V

gs

is

the difference between the gate and source voltage levels, V

dsat

is the velocity satura-

tion voltage, E

c

is the critical electrical ﬁeld and d is an intermediate parameter. The

derivation of these intermediate parameters are provided in the ITRS device model

[68].

Figure 4.3 shows the calculation ﬂow for the sub-threshold leakage current I

of f

.

I

of f

depends on all the main inputs except for R

acc

. Similar to I

on

calculation, the

67

Inputs: Lgate Tox Nbulk Xjext W Racc T Vdd

Output: Ioff

Intermediate variables: Lelec Tox_elec

Intermediate variables: Vth Slope

Figure 4.3: The ﬂow for sub-threshold leakage current, I

of f

, calculation.

intermediate feature parameters L

elec

and T

ox elec

, and the saturated threshold voltage

V

th

are ﬁrst calculated. In order to calculate I

of f

, the sub-threshold slope (Slope) is

also calculated as,

Slope =

kT

q

ln

1

0(1+

ε

s

· T

ox elec

ε

ox

· T

dep

) (4.4)

where k is the Boltzmann constant, T is the temperature, q is electric unit, ε

s

and ε

ox

are the dielectric permittivities of silicon and oxide, respectively. With Slope, V

th

and

other parameters, we can then calculate I

of f

as,

I

of f

= I

th

W

L

elec

e

−V

th

/Slope

(4.5)

where I

th

= 0.5uA.

Inputs: Lgate Tox Nbulk Xjext W Racc T Vdd

Outputs: Igon Igoff

Figure 4.4: The ﬂow for gate leakage currents I

gon

and I

gof f

calculation.

Figure 4.4 shows the calculation ﬂow for the gate leakage currents when the chan-

nel is on (I

gon

) and off (I

gof f

), respectively. I

gon

and I

gof f

depend on L

gate

, T

ox

, X

jext

, W

68

Outputs: Cg Cdiff

Inputs: Lgate Tox Nbulk Xjext W Racc T Vdd

Intermediate variables: Tox_elec

Figure 4.5: The ﬂow for transistor gate and diffusion capacitances C

g

and C

di f f

calcu-

lation.

and V

dd

. In order to calculate I

gon

and I

gof f

, we ﬁrst calculate the gate leakage current

density J

g

as,

J

g

= a

1

e

a

2

V

2

g

+a

3

Vg

e

−a

4

T

ox

(4.6)

where a

1

= 1.44E5A/cm

2

, a

2

= −4.02V

−2

, a

3

= 13.05V

−1

and a

4

= 1/(1.17E −

10m). With J

g

, I

gon

and I

gof f

are then calculated as,

I

gon

= L

gate

· W · J

g

(4.7)

ΔL = L

gate

−L

elec

(4.8)

I

gof f

= 0.5ΔL· W · J

g

(4.9)

where ΔL is the difference between L

gate

and L

elec

.

Figure 4.5 shows the calculation ﬂow for transistor gate and drain/source capaci-

tances, C

g

and C

di f f

, respectively. The capacitances depend on L

gate

, T

ox

, W, T and

V

dd

. The intermediate feature parameter T

ox elec

is ﬁrst calculated for capacitance cal-

culation. C

g

and C

di f f

are calculated as,

C

g

= (

ε

ox

T

ox elec

L

gate

+C

total f ringing

+C

overlap

)W (4.10)

C

di f f

=C

overlap

+C

junc

(4.11)

where the C

g

calculation considers gate oxide capacitance, fringing capacitance C

total f ringing

69

and overlap capacitance C

overlap

, and C

di f f

calculation considers C

overlap

and junction

capacitance C

junc

.

4.3 Circuit-Level Delay and Power

In this section, we present basic circuit models for delay and power characteristics us-

ing the device model in Section 4.2. We consider buffers, LUT, SRAM, pass transistor

gate, ﬂip-ﬂop (FF) and multiplexer [18] for the FPGA circuits, where the multiplexer

is implemented as an NMOS pass transistor tree. Essentially, these FPGA circuits can

be further decomposed into net-lists containing the most basic circuit elements, i.e.

inverters and pass transistors. We therefore mainly discuss the power and delay for

these basic circuit elements as below.

4.3.1 Delay Model

The pass transistor and multiplexer tree are modeled as a lumped capacitance, which

is treated as part of the loading capacitance of an inverter. We calculate the inverter

delay based on numerical integration through the transistor IV curves. In MASTAR4

tool, only the calculation for the maximum on current I

on

, i.e. the current in velocity

saturation region, is provided. We use the equations from [74] to calculate the drain-

source current I

ds

in different working regions, i.e. sub-threshold, linear, and saturation

regions. I

ds

is calculated as a function of drain, source, gate and body voltage levels as

following. In the sub-threshold region, i.e. V

gs

<V

th

, I

ds

is calculated as,

I

ds

= I

th

· (W/L

elec

) · e

((V

gs

−V

th

)/Slope)

(4.12)

70

where I

th

is 0.5uA, V

gs

is the gate source voltage difference. In linear region, i.e.

V

gs

>V

th

and V

ds

<V

gs

−V

th

, I

ds

is calculated as,

I

ds

=

W

L

elec

· u

e f f

·

C

ox elec

1+V

ds

/(E

c

· L

elec

)

· (V

gs

−V

th

−V

ds

/2) · V

ds

(4.13)

where C

ox elec

is the gate oxide capacitance, V

ds

is the drain source voltage difference

and E

c

is the critical electrical ﬁeld. C

ox elec

and E

c

are both calculated using equations

in the ITRS MARSTAR4 model. In the saturation region, i.e. V

gs

> V

th

and V

ds

>

V

gs

−V

th

, I

ds

is calculated as,

I

ds

= 0.5·

W

L

elec

· u

e f f

·

C

ox elec

1+V

ds

/(E

c

· L

elec

)

· V

2

ds

(4.14)

In velocity saturation region, i.e. V

gs

>V

th

and V

ds

>V

dsat

, I

ds

is equal to the on current

I

on

calculated by (4.3), where the velocity saturation voltage V

dsat

is calculated using

equations in the MASTAR4 tool. V

th

in the above equations is calculated considering

body-bias, short channel effect (SCE) and drain-induced barrier lowering (DIBL). Us-

ing these equations, Figure 4.7(c) shows the IV curves for an NMOS transistor under

ITRS 2005 HP 32nm technology node.

Vg = Vin

Vd = Vout

Vs = GND

C

Ids(N)

Vb = GND

Vs = Vdd

Vb = Vdd

Ids(P)

Figure 4.6: Voltage and current in an inverter during transition.

71

Given an inverter (see Figure 4.6), its loading capacitance C and the input voltage

waveform, we can then calculate the inverter delay. At time t, we can obtain the

transient PMOS and NMOS drain-source current i

ds

(P) and i

ds

(N), respectively. We

can then perform numerical integration to obtain the output voltage waveform based

on the following equation,

dV(out) = (i

ds

(P) −i

ds

(N)) · dt/C (4.15)

The delay is the time difference between when the output and input voltages reach

0.5V

dd

. We calculate the pull-down and pull-up delay for an inverter and then obtain

the worst case delay as the inverter delay. Note that the input slew rate is automatically

considered in this delay model. The output voltage waveform can be propagated for

delay calculation of the next stage.

ITRS05 HP (32nm) 1X INV, Cload =1fF

0

0.2

0.4

0.6

0.8

1

1.2

0 0.01 0.02 0.03 0.04

time (ns)

v

o

l

t

a

g

e

(

v

)

0

2

4

6

8

10

12

14

d

r

a

i

n

c

u

r

r

e

n

t

(

u

A

)

Vin

Vout (NMOS)

Vout (Inv)

PMOS current

ITRS05 HP (32nm) 1X INV, Cload =1fF

0

0.2

0.4

0.6

0.8

1

1.2

0 0.02 0.04 0.06 0.08 0.1 0.12

time (ns)

v

o

l

t

a

g

e

(

v

)

0

2

4

6

8

10

12

14

d

r

a

i

n

c

u

r

r

e

n

t

(

u

A

)

Vin

Vout (NMOS)

Vout (Inv)

PMOS current

(a) (b) (c)

ITRS05 HP NMOS (32nm) IV curves

0

200

400

600

800

1000

1200

0 0.2 0.4 0.6 0.8 1 1.2

Vds (v)

I

d

s

(

u

A

/

u

m

)

transient

transient II

Vgs=0.42v

Vgs=0.52v

Vgs=1.1v

Vgs=0.61v

Vgs=0.71v

Vgs=0.81v

Vgs=0.91v

Vgs=1.0v

small input

slope

large input

slope

Figure 4.7: The pull-down delay of a 1X inverter at ITRS 2005 HP 32nm technology

nodes. (a) Input and output voltages, and short circuit current with a small input slope;

(b) Input and output voltages, and short circuit current with a large input slope; (c) The

NMOS IV curves and the transition of transient current I

ds

with small and large input

slopes.

Figure 4.7 (a) shows the transient voltage transition of a 1X inverter with 1 f F as

loading capacitance C at ITRS 2005 HP 32nm technology node. The input voltage

transition is from 0 to V

dd

. The input slew rate is deﬁned as the transition time from

72

0.1V

dd

(or 0.9V

dd

) to 0.9V

dd

(or 0.1V

dd

). The short circuit current, i.e. the PMOS

drain-source transient current is also shown in this ﬁgure. If this short circuit current

is ignored, the output voltage waveform (V

out

(NMOS) in the ﬁgure) is almost over-

lapped the the waveform (V

out

(Inv) in the ﬁgure) considering the short circuit current.

Figure 4.7 (b) shows the transient voltage transition of this inverter with a larger input

slope, i.e. 5X large as the that in Figure 4.7 (a). The short circuit current becomes more

signiﬁcant due to a larger input slope and can no longer be ignored, i.e. the output volt-

age waveform without considering short circuit current differs the waveform consider-

ing short circuit current which results in a 20% delay difference. In FPGAs, the input

voltage slope may be large due to a large loading capacitance. Therefore, the short

circuit current is necessary to be considered for delay calculation. Figure 4.7(c) shows

the NMOS transient drain-source current i

ds

transitions with the two input slopes. With

a larger input slope, the transition is slower with a larger delay. While not presented

here, a similar trend for pull-up transition, i.e. input voltage transition is from V

dd

down to 0, is observed.

ITRS MASTAR4 tool uses CV

dd

/I

on

to predict transistor delay. As shown in Fig-

ure 4.7, the input slope has a signiﬁcant impact on the inverter delay. Since the FPGA

circuit element usually has a large loading capacitance and a large input slope, our

delay model is more accurate than that in MASTAR4. In addition, our delay model is

ﬂexible and can be extended to other complex gates easily.

For other FPGA circuit elements with loading capacitance, e.g. LUTs, FFs and

buffers, we ﬁrst breakdown them into the netlist of inverters and pass transistors. The

delay of each circuit element is then calculated using the above method, e.g. the LUT

delay is decomposed into the delay from LUT input passing through input buffers

to the multiplexer control inputs and the delay from the SRAM passing through the

multiplexer and the output buffer.

73

4.3.2 Power Model

In this section, we ﬁrst discuss leakage power including sub-threshold and gate leakage

power and then discuss dynamic power including switching power and short circuit

power for basic circuit elements.

An inverter consumes both sub-threshold and gate leakage power, which depends

on the input logic value. We calculate the average leakage power for an inverter,

P

leak

(inv), as,

P

leak

(inv) =V

dd

· (I

of f

(inv) +I

gate

(inv)) (4.16)

I

of f

(inv) = (I

of f

(P) +I

of f

(N))/2 (4.17)

I

gate

(inv) =

I

gon

(P) +I

gof f

(P) +I

gon

(N) +I

gof f

(N)

2

(4.18)

where I

o

f f (P) and I

of f

(N) are the PMOS and NMOS sub-threshold leakage currents,

respectively, I

gon

(P), I

gof f

(P), I

gon

(N) and I

gof f

(N) are the PMOS and NMOS gate

leakage currents when the channel is on and off, respectively. The two inverters in

an SRAM cell are identical with one input as V

dd

and one input as 0. Therefore, the

average leakage calculation of an inverter can be applied to SRAM leakage calculation.

For an NMOS pass transistor, only gate leakage power is consumed, which can be

either V

dd

· I

gon

(N) or V

dd

· I

gof f

(N) depending on if the channel of this pass transistor

is on or off. The pass transistor in a used/unused routing switch implemented by a

tri-state buffer is on/off. For a multiplexer containing N NMOS pass transistors, N/2

of them are on while the other half are off. Based on the leakage model for inverters

and pass transistors/muxes, we can calculate the leakage power for other FPGA circuit

elements.

We consider switching and short circuit power for inverter dynamic power con-

sumption. For switching power, we calculate the gate (C

g

(inv)) and self-loading ca-

74

pacitances (C

di f f

) for an inverter as,

C

g

(inv) = C

g

(P) +C

g

(N) (4.19)

C

di f f

(inv) = C

di f f

(P) +C

di f f

(N) (4.20)

where C

g

(P) and C

g

(N) are the PMOS and NMOS gate capacitances, respectively,

C

di f f

(P) and C

di f f

(N) are the PMOS and NMOS diffusion capacitances, respectively.

The input capacitance, internal capacitance and self-loading capacitance can then be

easily extracted for each FPGA circuit element for switching power calculation pur-

pose.

As shown in Figure 4.7 (a) and (b), short circuit current depends on input slew

rate. The transient short circuit current has been calculated during delay calculation.

We can simply perform a numerical integration on the transient short circuit current

and obtain the short circuit energy per switch SC(Sl), where Sl is the input slew rate.

4.4 Chip-level Delay and Power

With the delay and power model for basic circuits discussed in Section 4.3, we in-

troduce the chip level delay and power estimation model in this section. In order to

perform chip level estimation, we apply the similar idea as the trace-based estimation

in Chapter 2. We collect the tract information in the same way as in Section 2.3, then

calculate the chip level power and delay.

4.4.1 Delay Model

Similar to Section 2.3.2.3, we compute the critical path delay by adding the delay of

all circuit elements in the critical path, i.e., LUT, wire segment, interconnect switch

75

buffer, and MUX:

D

crit

=

∑

i∈P

D

i

(4.21)

where D

i

is the delay of the i

th

circuit element in the critical path, which can be cal-

culated by the circuit level delay model as discussed in Section 4.3.1. For each circuit

element, the loading capacitance can be calculated from the in/out capacitance of the

basic circuit elements as introduced in Section 4.3.2. For an interconnect wire seg-

ment, we use the Elmore delay model with resistance and capacitance suggested by

ITRS [67].

4.4.2 Leakage Power Model

In the same way as in Section 2.3.2.2, the chip leakage power is modeled as the sum

of the leakage power of all circuit elements:

P

leak

=

∑

i∈T

N

t

i

P

s

i

(4.22)

where P

s

i

is the leakage power for type i circuit element, which can be computed by

circuit level leakage power model as discussed in Section 4.3.2.

4.4.3 Dynamic Power Model

Dynamic power includes switch power and short-circuit power. A circuit implemented

on an FPGA cannot utilize all circuit elements. Dynamic power is only consumed by

the used FPGA resources. Our switch power model distinguishes different types of

used FPGA circuit elements and applies the following expression:

P

sw

=

1

2

· f · V

2

dd

· SW ·

∑

i∈T

N

u

i

· C

i

(4.23)

where f is the operating frequency, V

dd

is the supply voltage, SW is the average switch-

ing activity, and C

i

is the total capacitance per switch of type i circuit elements. The

76

total capacitance per switch C

i

can be computed from the in/out capacitance of a basic

circuit element as discussed in Section 4.3.2. The switching activity is related to the

input vector and logic, which is difﬁcult to obtained without detailed simulation using

a large set of input vectors. In our model, the average switching activity SW can be set

by the users. We also assume that the operating frequency to be 5/6 of the maximum

frequency, that is f = 1/(1.2 · D

crit

). We do not set the operating frequency to maxi-

mum frequency since the actual critical path delay of some chips may increase due to

process variation.

The short circuit power is related to the input slew rate. Here we model the chip

short circuit power as:

P

sc

= f · SW ·

∑

i∈T

∑

1≤j≤N

u

i

SC

i

(Sl

i, j

) (4.24)

where SC

i

(·) is the function to compute short circuit energy per switch for type i circuit

element and Sl

i, j

is the input slew rate of the j

th

type i circuit element. The short circuit

energy function SC

i

(·) can be obtained from the short circuit power model of basic

circuit elements as discussed in Section 4.3.2. Because of the regularity of FPGAs, the

input slew rate for the same type of circuit elements is very close. Therefore, the chip

short circuit power can be further simpliﬁed as:

P

sc

= f · SW ·

∑

i∈T

SC

i

(Sl

i

) (4.25)

where Sl

i

is the input slew rate of type i circuit element, which is computed as the

output slew rate of the element driving type i element. For example, a connection box

buffer is driven by a global interconnect buffer, then the input slew rate of a connec-

tion box buffer is the output slew rate of the global interconnect buffer. Such output

slew rate can be computed from the basic circuit element delay model discussed in

Section 4.3.1.

77

With the switch power and short circuit power, the total dynamic power is com-

puted as:

P

dynamic

= P

sw

+P

sc

(4.26)

Notice that dynamic power model in this chapter is different from that in Sec-

tion 2.3.2.1. In Section 2.3.2.1, short circuit power is calculated as a ratio to switching

power, while in this chapter, we calculate the circuit level short circuit power directly

from transistor model. Therefore, the short circuit power model introduced in this

section is more accurate than that in Section 2.3.2.1.

4.5 Chip Level Delay and Power Variation

Based on the chip-level power and delay model introduced in Section 4.4, we study

the impact of process variation on power and delay in this section. First we introduce

the variation model as below.

4.5.1 Variation Models

In general, delay and leakage power of a circuit element are functions of the underlying

process parameters and they can be described as:

D

t

= F

D

t

(X

1

, X

2

, ...) (4.27)

P

t

= F

L

t

(X

1

, X

2

, ...) (4.28)

where F

L

t

and F

D

t

are functions of process parameters to calculate leakage power and

delay for type t circuit element, respectively, and the process variation sources (such

as gate change length and dopant density) are modeled as a random variable X. To

evaluate the chip-level leakage power and delay considering process variation, we ﬁrst

78

apply randomly generated process parameters to the device model as in Section 4.2 and

then calculate the leakage power and delay for basic circuit elements as in Section 4.3.

The sensitivity functions F

L

i

and F

D

i

can be obtained from the leakage power and delay

model of basic circuit elements discussed in Section 4.3.

For each variation source X, all inter-die, intra-die spatial, and intra-die random

variation is considered. That is:

X = X

g

+X

s

+X

r

(4.29)

where X

g

, X

s

and X

r

are the inter-die, intra-die spatial, and intra-die random variation

for variation source X. We assume that X

g

, X

s

, and X

r

are independent from each

other. Each circuit element has its own random variation X

r

, and all circuit element

of the whole chip share the same inter-die variation X

g

. We use the grid based model

in [173] to model the spatial variation. A chip is divided into several grids, all circuit

elements in the same grid share the same spatial variation X

s

and the spatial variation

between different grids are correlated. The correlation coefﬁcient decreases when the

distance between two grids increases.

4.5.2 Delay Variation

With process variation, some close-to-be critical path may become critical path. There-

fore, in order to analyze delay variation, circuit delay is calculated as the maximum of

a set of near critical paths:

D

var

= max

j∈C

D

j

(4.30)

where C is the set of close-to-be critical paths and D

j

is the delay of the j

th

close-to-be

critical path, which is calculated as:

D

j

=

∑

i∈P

j

F

D

ji

(X

ji

1

, X

ji

2

, ...) (4.31)

79

where P

j

is the set of path elements of the j

th

path and F

D

ji

(·) is the delay variation

function for the i

th

circuit element in P

j

. Because there is no closed-form formula for

the delay variation function F

D

ji

(·), we do not have closed-form formula to compute

the chip delay distribution. In this chapter, we use the Monte-Carlo simulation to

obtain the chip delay distribution. We ﬁrst generate M samples of variation sources

(X

ji

1

, X

ji

2

, ...) for all i, and then compute the the path delay under such samples to

obtain M samples of path delay. Finally we can get the delay distribution from those

M samples of path delay. The Monte-Carlo simulation allows us to consider non-

Gaussian variation sources and spatial correlation, which is ignored in the previous

Ptrace [167].

4.5.3 Chip Leakage Power Variation

The chip leakage power with process variation is modeled as the sum of leakage power

of all circuit elements:

P

leak

=

∑

i∈T

N

t

i

∑

j=1

P

j

i

=

∑

i∈T

N

t

i

∑

j=1

F

L

i

(X

i j

1

, X

i j

2

, ...) (4.32)

where P

j

i

is the leakage power of the j

th

type i circuit element, and X

i j

is the variation

of the j

th

type i circuit element.

Similar to the chip delay variation model, we use Monte-Carlo simulation to com-

pute chip leakage distribution. We ﬁrst generate M samples of variation sources and

then compute the M samples of chip leakage power. However, because the random

variation is unique for each circuit element, when the circuit size becomes large, a

large number of random variation samples need to be computed. This makes the sim-

ulation very inefﬁcient. In order to solve such problem, we simplify the chip leakage

power variation model by applying central limit theorem similar to [25, 168].

Given the inter-die and within-die spatial variation, the total leakage power for all

80

type i circuit elements in the l

th

grid can be calculated as:

P

i

leak

=

N

t

li

∑

j=1

F

L

i

(X

g1

+X

l

s1

+X

il j

r1

, X

g2

+X

il j

r2

, ...) (4.33)

where X

g

is the inter-die variation, X

l

s

is the spatial variation in the l

th

grid, X

il j

r

is

the random variation for the j

th

type i circuit elements in the l

th

grid, N

t

li

is the number

of type i circuit element in the l

th

grid. Usually, N

t

li

is a very large number. Therefore,

by applying the central limit theorem, we have:

N

t

li

∑

j=1

F

L

i

(X

g1

+X

l

s1

+X

il j

r1

, X

g2

+X

il j

r2

, ...) ≈N

t

i

· µ

i

(X

g1

+X

l

s1

, X

g2

+X

l

s2

, ...) (4.34)

where

µ

il

(X

g1

+X

l

s1

, X

g2

+X

l

s2

, ...)

= E[F

L

i

(X

i

g1

+X

i j

r1

, X

i

g2

+X

i j

r2

, ...)|X

g1

, X

l

s1

, X

g2

, X

l

s2

, ...] (4.35)

In order to compute µ

il

(·), we ﬁrst generate M

r

samples of random variations, and

then compute the mean of leakage power under such samples. That is:

µ

il

(X

g1

+X

l

s1

, X

g2

+X

l

s2

, ...)

=

1

M

r

M

r

∑

k=1

F

L

j

(X

g1

+X

l

s1

+X

k

r1

, X

g2

+X

l

s1

+X

k

r2

, ...) (4.36)

where x

k

r

is the k

th

sample of random variation. Here, notice that usually M

r

is much

smaller than N

t

li

and does not depend on circuit size. Therefore such method is scalable

to large circuit size. Then the chip leakage power can be modeled as:

P

leak

=

∑

i∈T

G

∑

l=1

N

t

li

· µ

il

(X

g1

+X

l

s1

, X

g2

+X

l

s2

, ...) (4.37)

where G is the number of grids. With the above equation, we may generate M samples

of inter-die and within-die spatial variations to compute the distribution of leakage

power.

81

Notice that the variation model introduced in this section may handle non-Gaussian

variation sources and spatial correlation, which are ignored in the variation model in

Section 3.2.

4.6 Process Evaluation

Based on the chip level power and delay model presented in Section 4.4 and 4.5, we

perform device tuning to minimize power and delay.

4.6.1 Power and Delay Optimization

First we discuss the optimization for nominal value of power and delay. In this chapter,

we consider two device parameters, supply voltage Vdd and the physical gate length

L

gate

. In our experiment, we start from the device setting of ITRS High Performance

32nm technology (HP32) [67] (with Vdd=1.1V, L

gate

=32nm), and then change Vdd

and L

gate

around such setting. Moreover, we assume that the FPGA has cluster size

N=6, LUT size K=7, and the global interconnect wire length W=4, which is opti-

mized for high performance [91]. For dynamic power estimation, we assume that the

switching activity SW=0.5. In addition, we also consider heterogeneous L

gate

for logic

blocks (L

c

) and interconnects (L

i

). Because SRAMs have little impact on FPGA delay,

we assume that all SRAMs use high V

th

(1.5X dopant density compared to the other

circuit elements) and ﬁx L

gate

=32nm. During our evaluation, Vdd is tuned from 1.0V

to 1.1V, L

gate

(both L

c

and L

c

) is tuned from 31nm to 33nm. The experimental setting

is summarized in Table 5.1. In our experiment, we assumed that 20 MCNC bench-

marks [175] are put in one FPGA chip. Therefore, the power is computed as the sum

of all 20 benchmarks and delay is computed as the longest critical path delay of 20

benchmarks.

82

N K W SW Vdd (V) L

c

(nm) L

i

(nm)

6 7 4 0.5 1.0, 1.05, 1.1 31, 32, 33 31, 32, 33

Table 4.1: Experimental setting.

Figure 4.8 shows the energy per clock cycle and delay tradeoff between different

device settings. From the ﬁgure, we ﬁnd that there is a 3X energy span and a 1.3X

delay span within our search space. We also ﬁnd that that the knee point in the energy

delay tradeoff ﬁgure is exactly HP32, which has high performance without paying too

much power penalty. On the other hand, we also optimize device setting by minimizing

energy and delay product (ED). We show the top 2 minimum ED device settings in the

ﬁgure, which we refer to as min-ED1 and min-ED2.

2

5

8

11

14

3.5 4.0 4.5 5.0

Delay (ns)

E

n

e

r

g

y

(

n

J

)

All device

Min-ED

HP32

Min-ED

HP32

3.1X

1.3X

Figure 4.8: Energy and delay tradeoff.

Table 4.2 shows the power (P), delay (D), energy per clock cycle (E) , and energy

delay product (ED) for HP32 and min-ED device settings. From the table, we see that

by applying device tuning, ED can be reduced by up to 29.4% (min-ED1) compared to

HP32. We also ﬁnd that it is worthwhile to apply heterogeneous L

gate

. The optimum

heterogeneous L

gate

device setting (min-ED1) has lower ED (0.7nJ·ns lower) than the

83

optimum homogeneous L

gate

device setting (min-ED2).

Device Vdd L

c

L

i

P D E ED

(V) (nm) (nm) (W) (ns) (nJ) (nJ·ns)

HP32 1.1 32 32 1.19 3.90 1.88 22.6

min-ED1 1.0 32 33 0.77 4.55 3.50 15.9(-29.4%)

min-ED2 1.0 33 33 0.74 4.73 3.50 16.5(-26.8%)

Table 4.2: Power, delay, energy, and ED comparison for different device settings.

4.6.2 Variation Analysis

In this section, we apply the chip level variation model as introduced in Section 4.5

to analyze the chip level leakage and delay variation. In our experiment, we consider

four types of variation sources, dopant density (N

bulk

), channel gate length (L

gate

),

oxide thickness (T

ox

), and mobility (µ

e f f

) [190]. Moreover, because for technology

under 65nm, the spatially correlated intra-die variation is small compared to the intra-

die random variation [190], therefore, we ignore spatial variation in this chapter. For

all variation sources, we assume that they follow normal distribution. Notice that our

process variation model is based on Monte-Carlo simulation, therefore it can handle

any non-Gaussian variation sources. The 3-sigma value of inter-die (3σ

g

), intra-die

spatial 3σ

s

, and intra-die random (3σ

r

) variation for all types of variation sources are

summarized in Table 4.3. In the table, the 3-sigma values are the represented as the

percentage of nominal value. Moreover, for spatial variation, we assume that the cor-

relation coefﬁcient decreases to 1% when distance is 1cm (D

corr

= 1cm). In our ex-

periment, we consider two different variation settings, low variation (LV) and high

variation (HV). We generate M = 10, 000 samples for Monte-Carlo simulation and use

M

r

= 100 samples to compute the leakage mean µ

i

(·).

Figure 4.9 illustrates the probability density function (PDF) of delay and leakage

power for min-ED1 and HP32 under variation setting LV. From the ﬁgure, we see that

84

M = 10, 000 M

r

= 100 D

corr

= 1cm

Source Distribution LV HV

3σ

g

3σ

s

3σ

r

3σ

g

3σ

s

3σ

r

N

bulk

Normal 4.0% 4.0% 2.0% 6.0% 6.0% 4.0%

L

gate

Normal 2.0% 2.0% 1.5 3.0% 3.0% 2.2%

T

xo

Normal 2.0% 2.0% 1.5 3.0% 3.0% 2.2%

µ

e f f

Normal 4.0% 4.0% 2.0% 6.0% 6.0% 4.0%

Table 4.3: Experimental setting for process variation.

min-ED1 has much less leakage variation compared to HP32 while its delay variation

is only a little bit larger than HP32. We also observe that the leakage roughly has a

log-normal distribution and delay roughly has a normal distribution. This is because

leakage power is almost exponential to the variation sources while delay is almost

linear to the variation sources.

0.2 0.4 0.6 0.8 1 1.2 1.4 1.6 1.8 2

0

2

4

6

8

10

12

14

16

Leakage (W)

(a)

min−ED1

HP32

3.6 3.8 4 4.2 4.4 4.6 4.8 5 5.2

0

0.5

1

1.5

2

2.5

3

3.5

4

Delay (ns)

(b)

min−ED1

HP32

Figure 4.9: Delay and leakage distribution. (a) Leakage PDF; (b) Delay PDF.

Table 4.4 summarizes the nominal value (nom), mean (µ), and standard deviation

(σ) of leakage power and delay for the min-ED device settings and HP32. From the

table, we observe that compared to HP32 the min-ED device settings greatly reduce

leakage variation (reduce standard deviation by up to 92%) with only a small increase

of delay variation (increase standard deviation by up to 37%). From the table, we also

see that HP32 is more sensitive to process variation than the min-ED device settings.

85

Device Leakage (mW) Delay (ns)

nom LV HV nom LV HV

µ σ µ σ µ σ µ σ

HP32 854 982 348 1120 575 3.90 3.95 0.122 3.99 0.185

min-ED1 328 359 51 (-86%) 398.2 68 (-88%) 4.55 4.55 0.162(+33%) 4.58 0.244(+32%)

min-ED2 315 328 32 (-89%) 375.2 48 (-92%) 4.73 4.779 0.164(+37%) 4.782 0.245(+32%)

Table 4.4: Nominal value, mean, and standard deviation comparison for leakage power

and delay.

4.7 FPGA Reliability

As technology scales down to nanometer, reliability issues such as device aging with

V

th

degradation and soft error rate (SER) due to high-energy particle or cosmic rays be-

come more and more signiﬁcant. In this section we study these two types of reliability

for FPGA, i.e. device aging and permanent soft error rate (SER).

4.7.1 Impact of Device Aging

First, we introduce the impact of device aging effect on FPGA power and delay. It

has been shown that negative-bias-temperature-instability (NBTI) effect increases the

threshold voltage of PMOS transistor [11][160][149][23], and hot-carrier-injection

(HCI) increases the threshold voltage (V

th

)of NMOS transistor [34][149][166]. Due

to the effect of NBTI, PMOS V

th

increases during stress mode, i.e. with input 0, and

recovers during recovery mode, i.e. with input V

dd

. (4.38) and (4.39) are the equations

for PMOS ΔV

th

over time in the stress and recovery modes, respectively.

ΔV

th

= (K

v

(t −t

0

)

0.5

+(ΔV

th0

)

(1/2n)

)

2n

(4.38)

ΔV

th

=−ΔV

th0

(1−

2ξ

1

t

e

+

ξ

2

C(t −t

0

)

2t

ox

+

√

Ct

) (4.39)

86

(4.40) is the equation for NMOS ΔV

th

over time due to HCI.

ΔV

th

=

q

C

ox

K

2

Q

i

e

E

ox

E

o2

e

−

φ

it

qλE

m

t

n

(4.40)

The key parameters that determine the degradation rate include the inversion charge,

Q

i

, electrical ﬁeld, E

ox

, temperature, supply voltage V

dd

and nominal threshold voltage

V

th

. Due to the space limit, we do not include detailed explanation for each parameters.

Interested readers please refer to [23] for more details. Figure 4.10 shows the calcu-

lation ﬂow for ΔV

th

due to NBTI and HCI. We ﬁrst calculate the effective gate length

L

elec

and oxide thickness T

ox elec

and then calculate V

th

. With these intermediate pa-

rameters, we can obtain ΔV

th

due to device aging over time. A higher V

dd

or lower V

th

leads to larger V

th

increase. The V

th

degradation due to device aging results in increase

of critical path delay and decrease of leakage power.

Inputs: Lgate Tox Nbulk Xjext W Racc T Vdd

Output: HCI HBTI

Intermediate variables: Lelec Tox_elec

Intermediate variables: Vth

Figure 4.10: The calculation ﬂow for ΔV

th

due to NBTI and HCI.

Figure 4.11 illustrates the V

th

increase (ΔV) for PMOS and NMOS transistors of

both min-ED1 and HP32 within 10 years. From the ﬁgure, we see that HP32 is much

more sensitive to NBTI and HCI than min-ED1. This is due to the fact that HP32 uses

higher V

dd

and lower V

th

(due to smaller L

gate

) than min-ED1. Hence HP32 has larger

ΔV

th

for both NMOS and PMOS than min-ED1.

87

0 1 2 3 4 5 6 7 8 9 10

0

5

10

15

20

25

30

35

40

45

Time (year)

Δ

V

t

h

(

m

V

)

Min−ED1 PMOS (NBTI)

Min−ED1 NMOS (HCI)

HP32 PMOS (NBTI)

HP32 NMOS (HCI)

Figure 4.11: V

th

increase caused by NBTI and HCI.

Figure 4.12 illustrates the critical path delay increase and chip leakage power de-

crease within 10 years for min-ED1 and HP32. It is interesting to see in the ﬁgure that

the leakage and delay change is most signiﬁcant at the ﬁrst one year. Therefore, device

burn-in [96] that has been used for microprocessors but not FPGAs yet could be used

to reduce leakage and delay change over time after product shipment.

Table 4.5 illustrates the current value, value after 1 year, and value after 10 years of

leakage power (L) and delay (D). In the table, we also compare the change percentage

with and without burn-in (burn to 1 year). The change percentage without burn-in is

computed as the percentage of the difference between the 10 year value and current

value, and that with burn-in is computed as the percentage of the difference between

the 10 year value and the 1 year value. From the ﬁgure and table, we ﬁnd that the

min-ED device settings is less sensitive to device aging compared to HP32. This is be-

cause HP32 has higher V

dd

and smaller L

gate

compared to the min-ED device settings.

Therefore it has larger V

th

degradation, hence more sensitive to device aging. We may

see that compare to HP32, min-ED device settings can not only reduce ED but also

88

Device Current 1 Year 10 Year Change

L D L D L D w/o burn-in w/ burn-in

(mW) (ns) (mW) (ns) (mW) (ns) L D L D

HP32 854 3.90 711 4.01 640 4.23 -25.1% +8.5% -10.0% +5.5%

min-ED1 328 4.55 317 4.59 311 4.64 -5.2% +2.0% -1.9% +1.1%

min-ED2 315 4.73 307 4.76 302 4.82 -2.5% +1.9% -1.6% +1.3%

Table 4.5: Delay and leakage change due to device aging.

increase aging reliability of FPGAs. Moreover, by applying burn-in, we can greatly

reduce the delay degradation. For example, for HP32, burn-in reduces 10 year delay

degradation from 8.5% to 5.5%.

0 2 4 6 8 10

−250

−200

−150

−100

−50

0

Time (year)

(a)

Δ

P

l

e

a

k

(

m

W

)

Min−ED1

HP32

1 Year

0 2 4 6 8 10

0

0.05

0.1

0.15

0.2

0.25

0.3

0.35

Time (year)

(b)

Δ

D

e

l

a

y

(

n

s

)

min−ED1

HP32

1 Year

Figure 4.12: Impact of device aging. (a) Leakage change; (b) Delay change.

4.7.2 Permanent Soft Error Rate (SER)

We use the model in [60] to calculate the soft error rate (SER) for conﬁguration

SRAMs in our study. The SER in SRAMs is permanent in FPGAs, which cannot

be recovered unless re-writing those affected conﬁguration bits [14][17]. The SER for

one SRAM cell is shown in (4.41),

SER(SRAM) = F · A· K· e

−Q

crit

/Q

s

(4.41)

89

where F is the neutron ﬂux, A is the transistor drain area, K is a technology independent

constant and is same for all device settings, Q

crit

is the critical charge to incur an error,

and Q

s

is the charge collection slope. Figure 4.13 shows the ﬂow to calculate the

SRAM SER. We ﬁrst calculate intermediate parameters including L

elec

, T

ox elec

and

V

th

. Q

s

, Q

crit

and A are then calculated. With all these intermediate parameters, we can

obtain the SRAM SER. A transistor with a larger Q

crit

needs more energy to incur an

error and has a smaller SER. Q

crit

mainly depends on supply voltage V

dd

and loading

capacitance C, and slightly depends on V

th

. On the other hand, a transistor with a larger

Q

s

is more effective in collecting charge and has a larger SER. Q

s

depends on channel

doping density N

bulk

and V

dd

. We use the models in [60] with linear interpolation to

calculate Q

crit

and Q

s

under different device settings, i.e. different V

dd

and V

th

, and

then calculate the SER of an SRAM cell correspondingly using (4.41). The chip-level

permanent SER can be calculated as,

SER = Num SRAM· SER(SRAM) (4.42)

where Num SRAM is the total number of SRAM cells.

Inputs: Lgate Tox Nbulk Xjext W Racc T Vdd

Output: SER

Intermediate variables: Lelec Tox_elec

Intermediate variables: Vth

Intermediate variables: Qs Qcrit A

Figure 4.13: The ﬂow for SRAM SER calculation.

Table 4.6 compares the SER for HP32, and the two min-ED device settings. For

90

the min-ED device settings, L

gate

for SRAM cells is still 32nm. Therefor only V

dd

may

affect SER. In the table, SER is measured as number of failures in billion hours (FIT).

The supply voltage V

dd

in HP32 is higher, hence the critical charge, Q

crit

, is larger and

may result in a smaller SER. For min-ED1/2 with V

dd

of 1.0v, SER is 1.6% higher than

that of HP32. It is worthwhile to use Min-ED device settings to obtain a signiﬁcant

ED reduction with virtually no impact on SER.

Device V

dd

(V) # SRAMs SER (FIT)

HP32 1.1 12438596 362.45

min-ED1/2 1.0 12438596 368.25 (+1.6%)

Table 4.6: Soft error rate comparison.

4.8 Interaction between Process Variation and Reliability

In this section, we study the impact of device aging on process variation and the impact

of device aging and process variation on SER.

4.8.1 Impact of Device Aging on Power and Delay Variation

First, we discuss the impact of device aging on process variation induced leakage and

delay variation. Figure 4.14 compares the PDF of delay and leakage between current

value and the value after 10 years for the device setting min-ED1. From the ﬁgure,

we ﬁnd that device aging actually reduce the leakage variation. But device aging only

increase delay but has no signiﬁcant impact on delay variation.

Table 4.7 summarizes the mean and standard deviation of leakage and delay for

different device settings. The table compares both the current value and the value after

10 years in both low and high variation cases. From the table, we ﬁnd that device

aging has great impact of leakage power variation. For HP32 device setting under low

91

0.3 0.35 0.4 0.45 0.5 0.55

0

5

10

15

20

25

Leakage (W)

(a)

Current

After 10 Years

4 4.2 4.4 4.6 4.8 5 5.2

0

0.5

1

1.5

2

2.5

3

3.5

Delay (ns)

(b)

Current

After 10 Years

Figure 4.14: Impact of device aging on delay and leakage PDF. (a) Leakage PDF

comparison; (b) Delay PDF comparison.

variation case (LV), device aging reduces mean by 28.8% and standard deviation by

65.2%. However, the impact of device aging on delay variation is relatively small. For

HP32 device setting, device aging only increases standard deviation by 1.7%. This

is because leakage power is rough exponential to V

th

while delay is roughly linear to

V

th

. Moreover, we observe that compared to HP32 device setting, device aging has

less impact on power and delay variation for min-ED settings. This is because HP32

is more sensitive to device aging as discussed in Section 4.7. From the result, we also

ﬁnd that in high variation case, the impact of device aging on delay variation is similar

to that of the regular variation case (min-ED1). However, when variation is high, the

impact of device aging on leakage variation is much larger than that in the regular

variation case. That is, when variation scale becomes larger, there is a larger impact of

device aging on leakage variation.

4.8.2 Impacts of Device Aging and Process Variation on SER

As shown in Figure 4.13, the SRAM SER depends on V

dd

, loading capacitance, V

th

and

channel doping density N

bulk

while V

th

and N

bulk

may be affected by device aging and

92

Device Leakage (mW) Delay (ns)

current 10 years current 10 years

µ σ µ σ µ σ µ σ

HP32 LV 982 348 678 (-28.8%) 118 (-65.2%) 3.95 0.122 4.23 (+8.4%) 0.121 (+1.67%)

min-ED1 LV 359 51.0 321 (-6.2%) 31.2 (-32.7%) 4.55 0.162 4.65 (+2.0%) 0.162 (+0.16%)

min-ED2 LV 328 32.0 309 (-4.6%) 22.4 (-30.4%) 4.78 0.164 4.83 (+1.9%) 0.164 (+0.25%)

HP32 HV 1120 575 712 (-36.4%) 158 (-72.5%) 3.99 1.85 4.09 (+2.6%) 1.86 (+0.54%)

min-ED1 HV 398.2 68 345.1 (-13.3%) 38.1 (-44.0%) 4.55 0.318 4.66 (+2.1%) 0.319 (+0.28%)

min-ED2 HV 375.2 48 332 (-12.4%) 25 (-47.9%) 4.78 0.245 4.79 (+0.21%) 0.245 (+0.21%)

Table 4.7: Impact of device aging on process variation.

process variation. We perform the study on the impacts of device aging and process

variation on SER as below.

V

th

may increase with device aging due to NBTI and HCI. For ITRS HP 32nm

technology (see Figure 4.11, V

th

may degrade (or increase) by around 10% (or 20mV)

for PMOS due to NBTI and by around 25% (or 50mV) for NMOS due to HCI after 10

years. A larger V

th

may result in a slightly smaller critical charge Q

crit

for an SRAM

and therefore a larger SER. Table 4.8 presents the impact of device aging on SER.

After 10 years, the SRAM SER may increase by 0.3% due to degraded V

th

. Even if

V

th

degrades by 100mv for both NMOS and PMOS, the SRAM SER only increases by

0.9%.

0 year 10 3σ=6% variation

or nominal years in N

bulk

SRAM SER (FIT) 2.914E-5 +0.3% -0.18% to +0.17%

Table 4.8: The impact of device aging and process variation on SER of one SRAM

under ITRS HP 32nm technology.

With process variation, N

bulk

and channel gate length L

gate

become random vari-

ables. N

bulk

variation directly affects charge collection slope Q

s

and therefore affects

SER. L

gate

variation may affect V

th

and therefore implicitly affect SER. For ITRS HP

32nm technology, a variation in L

gate

from 31nm to 33 nm may result in a variation

93

in V

th

from -25% to 21% [68]. It has been shown that V

th

variation does not have a

signiﬁcant impact on SER. Table 4.8 presents the impact of N

bulk

variation on SER. A

variation of 3σ as 6% in N

bulk

only results in a variation in SER from -0.18% to 0.17%.

A larger N

bulk

may result in a larger charge collection slope Q

s

and therefor a smaller

SER. On the other hand, a larger N

bulk

may result in a higher V

th

and a smaller critical

charge Q

crit

, therefore a larger SER. These two effects compensate with each other and

therefore N

bulk

does not have a signiﬁcant impact on SER either. It is clear that neither

device aging due to NBTI and HCI nor process variation have any signiﬁcant impact

on SER.

4.9 Conclusions

In this chapter, we have developed a trace-based framework to enable concurrent pro-

cess and FPGA architecture co-development. Same as the MASTAR4 model used by

the 2005 ITRS [67][68][148], the user can tune eight parameters for bulk CMOS pro-

cesses and obtain the chip level performance and power distribution and soft error rate

(SER) considering process variations and device aging. The framework is efﬁcient as it

is based on closed-form formulas. It is also ﬂexible as process parameters can be cus-

tomized for different FPGA elements and no SPICE models and simulation are needed

for these elements. Therefore, this framework is suitable for early stage process and

FPGA architecture co-development.

A few examples of using this framework have been presented. We show that ap-

plying heterogeneous gate lengths to logic and interconnect may lead to 1.3X delay

difference, 3.1X energy difference, and reduce standard deviation of leakage variation

by 87%. This offers a large room for power and delay tradeoff. We further show that

the device aging has a knee point over time, and device burn-in to reach the point

could reduce the performance change over 10 years from 8.5% to 5.5% and reduce die

94

to die leakage signiﬁcantly. In addition, we also study the interaction between process

variation, device aging and SER. We observe that device aging reduces standard de-

viation of leakage by 65% over 10 years while it has relatively small impact on delay

variation. Moreover, we also ﬁnd that neither device aging due to NBTI and HCI nor

process variation have signiﬁcant impact on SER.

95

Part II

Statistical Timing Modeling and

Analysis

96

CHAPTER 5

Non-Gaussian Statistical Timing Analysis Using

Second-Order Polynomial Fitting

In Part I, we have discussed statistical modeling and optimization for FPGAs. In Part

II of this dissertation, we study statistical modeling and analysis for ASICs. In this

chapter, we focus on statistical static timing analysis (SSTA). Lots of research works

on SSTA have been published in recent years. However, most of the existing SSTA

techniques have difﬁculty in handling the non-Gaussian variation distribution and non-

linear dependency of delay on variation sources. To solve this problem, we ﬁrst pro-

pose a new method to approximate the max operation of two non-Gaussian random

variables through second-order polynomial ﬁtting. Applying the approximation, we

present present new non-Gaussian SSTA algorithms for three delay models: quadratic

model, quadratic model without crossing terms (semi-quadratic model), and linear

model. All the atomic operations (max and sum) of our algorithms are performed by

closed-form formulas, hence they scale well for large designs. Experimental results

show that compared to the Monte-Carlo simulation, our approach predicts the mean,

standard deviation, skewness, and 95-percentile point within 1%, 1%, 6%, and 1%

error, respectively.

97

5.1 Introduction

Process variation introduces signiﬁcant uncertainty to circuit delay as discussed

in Chapter 1. In Part I, we have studied statistical optimization for FPGA circuits.

Compared to FPGA, ASIC has higher performance, lower power, and smaller area.

Therefore, ASICs are widely used in high performance or low power applications,

where the impact of process variation is signiﬁcant.

To analyze the impact of process variation on circuit delay, statistical static timing

analysis (SSTA) is developed for full chip timing analysis under process variation. By

performing SSTA, designers can obtain the timing distribution and its sensitivity to

various process parameters.

In recent years, two types of SSTA techniques are proposed, the path-based [103,

72, 7, 120, 157, 128] and the block-based [32, 5, 49, 9, 22, 8, 163, 86, 53, 182, 13, 181,

113, 75, 180, 178, 76, 3, 183, 2, 142, 19, 179, 40] SSTA. Since the number of paths

is exponential with respect to the circuit size, the path-based SSTA is not scalable to

large circuits. The block-based SSTA was proposed to solve this problem. The goal

of block-based SSTA is to parameterize timing characteristics of the timing graph as a

function of the underlying sources of process parameters which are modeled as random

variables.

The early SSTA methods [32, 163] modeled the gate delay as linear functions

of variation sources and assumed all the variation sources are mutually independent

Gaussian random variables. Based on such assumption, [163] presented time efﬁcient

methods with closed-form formulas [46] for all atomic operations (max and sum).

However, when the amount of variation becomes larger, the linear delay model is no

longer accurate [94]. In order to capture the non-linear dependence of delay on the

variation sources, a higher-order delay model is thus needed [180, 178].

98

As more complicated or large-scale variation sources are taken into account, the

assumption of Gaussian variation sources is also not valid. For example, the via resis-

tance has an asymmetric distribution [13], while dopant concentration is more suitably

modeled as a Poisson distribution [142] than Gaussian. Some of the most recent works

on SSTA [76, 142, 13, 19, 179, 40] started to take non-Gaussian variation sources into

account. For example, [142] applied independent component analysis to de-correlate

the non-Gaussian random variables, but it was still based on a linear delay model.

[76, 13, 19, 179] considered both non-linear delay model and non-Gaussian variation

sources, but the computation cost of these techniques is too high to be applicable for

large designs. For example, [76] proposed to compute the max operation by a regres-

sion based on Monte-Carlo simulation, which is slow. [13] computed the max opera-

tion through tightness probability while [179] applied moment matching to reconstruct

the max. However, to do so, both had to resort to expensive multi-dimension numer-

ical integration techniques. [19] handled the atomic operations by approximating the

gate delay using a set of orthogonal polynomials, which needs to be constructed for

different variation distributions. [40] proposed to approximate the probability density

function (PDF) of max results as a Fourier Series, but it lacks the capability to handle

the crossing term effects on timing.

In this chapter, we introduce a time efﬁcient non-linear SSTA for arbitrary non-

Gaussian variation sources. The major contribution of this work is two-fold. (1) We

propose a new method to approximate the max of two non-Gaussian random vari-

ables as a second-order polynomial function based on least-square-error curve ﬁtting.

Experimental results shows that such approximation is much more accurate than the

linear approximation based on tightness probability [163, 180, 13]. (2) Based on the

new approximation of the max operation, we present our SSTA technique for three

different delay models: quadratic delay model, quadratic delay model without cross-

ing terms (semi-quadratic model), and linear delay model. In our method, only the

99

ﬁrst few moments are required for different distributions and no extra effort is needed,

while [19] needs to obtain different sets of orthogonal polynomials for different vari-

ation distributions. Moreover, all the atomic operations are performed by closed-form

formulas, hence they are very time efﬁcient. For the linear and semi-quadratic delay

model, the computational complexity of our method is linear in both the number of

variation sources and circuit size. For the quadratic delay model, the computational

complexity is cubic (third-order) to the number of variation sources and linear to the

circuit size. Experimental results show that for the semi-quadratic delay model, our

approach is 70 times faster than the non-linear SSTA in [40], and indicates less than

1% error in mean, 1% error in standard deviation, 30% error in skewness, and 1% er-

ror in 95-percentile point. Moreover, for the more accurate quadratic delay model, our

approach incurs less than 1% error in mean, 1% error in standard deviation, 6% error

in skewness, and 1% error in 95-percentile point.

The rest of the chapter is organized as follows: Section 5.2 introduces the approx-

imation of the max operation using second-order polynomial ﬁtting; with the approxi-

mation of max, Section 5.3 presents a novel SSTA algorithmfor quadratic delay model

with non-Gaussian variation sources; we further apply this technique to handle both

semi-quadratic and linear delay model in Section 5.4 and Section 5.5, respectively;

experimental results are presented in Section 5.6; Section 5.7 concludes this chapter.

5.2 Second-Order Polynomial Fitting of Max Operation

5.2.1 Review and Preliminary

According to [163], given two random variables, A and B, the tightness probability is

deﬁned as the probability of A greater than B, i.e., T

A

= P{A > B} = P{A−B > 0}.

100

Then the max operation is approximated as:

max(A, B) ≈T

A

· A+(1−T

A

) · B+c, (5.1)

where c is a term used to match the mean and variance of max(A, B). Because (5.1)

can be further written as max(A, B) = max(A−B, 0) +B, we arrive at:

max(A−B, 0) ≈ T

A

· (A−B) +c. (5.2)

According to (5.2), we can see that the max operation in [163] is in fact approximated

by a linear function subject to certain conditions (such as matching the exact mean

and variance). Such a linear approximation is efﬁcient and reasonably accurate when

both A and B are Gaussian. As shown in [163, 164], the PDF predicted by such linear

approximation is very close the Monte Carlo simulation. Moreover, in this case the

coefﬁcients can be computed easily, as both T

A

and E[max(A, B)] can be obtained by

closed-form formulas when A and B are Gaussian [163]. However, when A and B are

non-Gaussian randomvariables, the tightness probability T

A

and E[max(A, B)] are hard

to obtain. For example, T

A

in [13] has to be computed via expensive multi-dimensional

numerical integration, thus preventing its scalability to large designs. Moreover, be-

cause the max operation is an inherently non-linear function, linear approximation

would become less and less accurate, particularly when the amount of variation in-

creases and the number of non-Gaussian variation sources increases. To overcome

these difﬁculties, we develop a more efﬁcient and accurate approximation method in

the next section.

5.2.2 New Fitting Method for Max Operation

In this section, we introduce a new ﬁtting method to approximate the max operation.

Instead of using the linear function, we propose to use a second-order polynomial

101

function to approximate the max operation, i.e.,

max(V, 0) ≈h(V, Θ) = θ

2

V

2

+θ

1

V +θ

0

, (5.3)

where Θ = (θ

0

, θ

1

, θ

2

)

T

are three coefﬁcients of the second-order polynomial h(v, Θ).

The problem thus becomes how to obtain the ﬁtting parameters of Θ. Different from

the linear ﬁtting method through tightness probability, we compute Θ by matching

the mean of the max operation while minimizing the square error (SE) between

h(V, P) and max(V, 0) within the ±ε range of V. Mathematically, this problem can be

formulated as the following optimization problem:

Θ = arg min

E[h(V,Θ)]=µ

m

Z

µ

v

+ε

µ

v

−ε

h(v, Θ) −max(v, 0)

2

dv (5.4)

where µ

v

, σ

2

v

, and γ

v

are the mean, variance, and skewness of random variable of

V, respectively; while µ

m

and E[h(V, Θ)] are the exact and approximated mean of

max(V, 0), respectively. In other words,

µ

m

= E[max(V, 0)] (5.5)

E[h(V, Θ)] = θ

2

(σ

2

v

+µ

2

v

) +θ

1

µ

v

+θ

0

. (5.6)

In (5.4), ε is the approximation range. The value of ε may affect the accuracy of the

second order approximation. Intuitively, when ε is large, the probability that V lies

outside the ±ε range is low. Then the impact of ignoring the difference in the low

probability region is justiﬁed. However, when ε is too large, we may unnecessarily

ﬁtting the curve in a wide range without focusing on the high probability region. In

this chapter, we ﬁnd that assuming:

ε = 3σ

v

(5.7)

provides a good approximation of max. This can be explained by the fact that, although

the circuit delay is not Gaussian, it would still be more or less Gaussian like. That is,

102

its PDF would be most likely centering around the ±3σ range of the mean. In practice,

the users can choose the range ε according to the delay distributions.

In order to solve (5.4), we ﬁrst need to compute µ

m

. When V is a non-Gaussian

randomvariable, exact computation of µ

m

is difﬁcult in general. Therefore, we propose

to use the following two-step procedure to approximately compute µ

m

. In the ﬁrst

step, we approximate the non-Gaussian random variable V as a quadratic function of a

standard Gaussian random variable W similar to [184], i.e.,

V ≈P(W) = c

2

· W

2

+c

1

· W +c

0

. (5.8)

The coefﬁcients c

2

, c

1

, and c

0

can be obtained by matching P(W) and V’s mean,

variance and skewness simultaneously, i.e.,

E[c

2

· W

2

+c

1

· W +c

0

] = µ

v

(5.9)

E[(c

2

· W

2

+c

1

· W +c

0

−µ

v

)

2

] = σ

2

v

(5.10)

E[(c

2

· W

2

+c

1

· W +c

0

−µ

v

)

3

] = γ

v

· σ

3

v

(5.11)

As shown in [184], solving the above equations, c

2

will be one of the following values:

c

2,1

= −

2σ

2

v

+Δ

2/3

2Δ

1/3

(5.12)

c

2,2

=

2(1+ j

√

3)σ

2

v

+(1− j

√

3)Δ

2/3

4Δ

1/3

(5.13)

c

2,3

=

2(1− j

√

3)σ

2

v

+(1+ j

√

3)Δ

2/3

4Δ

1/3

(5.14)

where Δ = γ

v

· σ

3

v

+ j

8σ

6

v

−γ

2

v

· σ

6

v

, with j =

√

−1. It is proved in [184] that, when

|γ

v

| < 2

√

2 there exists one and only one of the above three values in the range |c

2

| <

σ

v

/

√

2. We will pick such value for c

2

. When |γ

v

| > 2

√

2, we may let:

c

2

=

σ

v

/

√

2 γ

v

> 2

√

2

−σ

v

/

√

2 γ

v

<−2

√

2

(5.15)

103

It is proved in [19] that the above equations gives P(W) which matches the mean and

variance of V and has the skewness closest to γ

v

. With c

2

, c

1

and c

0

can be computed

as:

c

1

=

σ

2

v

−2c

2

2

(5.16)

c

0

= µ

v

−c

2

(5.17)

After obtaining the coefﬁcients c

2

, c

1

, and c

0

in the second step, we approximate

the exact mean of max(V, 0) by the exact mean of max(P(W), 0), i.e.,

µ

m

≈E[max(P(W), 0)] =

Z

P(w)>0

P(w)φ(w)dw (5.18)

where φ(·) is the PDF of the standard normal distribution. In the above equation, the

integration range P(w) > 0 can be computed in four different cases:

Case 1: c

2

> 0

P(w) > 0 ⇔w <t

1

∨w >t

2

(5.19)

where

t

1

= (−c

1

−

c

2

1

−4c

2

c

0

)/2c

2

(5.20)

t

2

= (−c

1

+

c

2

1

−4c

2

c

0

)/2c

2

(5.21)

Case 2: c

2

< 0

P(w) > 0 ⇔t

2

< w < t

1

(5.22)

Case 3: c

2

= 0∧c

1

> 0

P(w) > 0 ⇔w >−c

0

/c

1

(5.23)

Case 4: c

2

= 0∧c

1

< 0

P(w) > 0 ⇔w <−c

0

/c

1

(5.24)

104

Knowing the integration range, we can compute E[max(P(W), 0)] under such four

cases:

E[max(P(W), 0)] = (5.25)

(c

2

+c

0

)

1+Φ(t

1

) −Φ(t

2

)

+(c

1

+t

1

)

φ(t

2

) −φ(t

1

)

c

2

> 0

(c

2

+c

0

)

Φ(t

1

) −Φ(t

2

)

+(c

1

+t

1

)

φ(t

2

) −φ(t

1

)

c

2

< 0

c

0

· Φ(c

0

/c

1

) +c

1

· φ(c

0

/c

1

) c

2

= 0∧c

1

> 0

c

0

·

1−Φ(c

0

/c

1

)

−c

1

· φ(c

0

/c

1

) c

2

= 0∧c

1

< 0

where Φ(·) is the cumulative density function (CDF) of the standard normal distribu-

tion. According to (5.25), we can compute µ

m

easily through analytical formulas.

In practice, the above approximation of µ

m

is accurate only when the distribution

of V is close to the normal distribution. When V is not close to the normal distribution,

for example, V is uniform distribution, the above approximation of µ

m

is no longer

accurate. Fortunately, in practice, the circuit delay has close-to-normal distribution as

discussed above. Therefore, such approximation is accurate for SSTA.

After obtaining µ

m

, we need to ﬁnd Θ in (5.3) by solving the constrained optimiza-

tion problem of (5.4). In the following, we show that (5.4) can be solved analytically

as well. We ﬁrst write the constraint in (5.4) as follows:

θ

0

= µ

m

−θ

2

(µ

2

v

+σ

2

v

) −θ

1

µ

v

. (5.26)

Replacing θ

0

in (5.4) by (5.26), the square error in (5.4) can be written as:

SE =

0

Z

µ

v

−3σ

v

h

2

(v, Θ)dv +

µ

v

+3σ

v

Z

0

h(v, Θ)dv −v

2

dv

=

Z

0

l

1

θ

2

(v

2

−µ

2

v

−σ

2

v

) +θ

1

(v −µ

v

) +µ

m

2

dv +

Z

l

2

0

θ

2

(v

2

−µ

2

v

−σ

2

v

) +θ

1

(v −µ

v

) +µ

m

−v

2

dv, (5.27)

105

where l

1

= µ

v

−3σ

v

and l

2

= µ

v

+3σ

v

. By expanding the square and integral, we

can transform the constrained optimization of (5.4) to the following unconstrained

optimization problem, which is a quadratic form of Θ

= (θ

1

, θ

2

)

T

:

Θ

= argminSE (5.28)

SE = Θ

T

SΘ

+QΘ

+t, (5.29)

where S = (s

i j

) is a 2 ×2 matrix, Q = (q

i

) is a 1 ×2 vector, and t is a constant. The

parameters of S, Q, and t can be computed as:

s

11

= (l

3

2

−l

3

1

)/3+(l

2

−l

1

)µ

2

v

−(l

2

2

−l

2

1

)µ

v

(5.30)

s

22

= (l

5

2

−l

5

1

)/5+(l

2

2

−l

2

1

)(µ

2

v

+σ

2

v

)

2

−2(l

3

2

−l

3

1

)(µ

2

v

+σ

2

v

)/3 (5.31)

s

12

= s

21

= (l

4

2

−l

4

1

)/4+(l

2

−l

1

)µ

v

(µ

2

v

+σ

2

v

) +

(l

3

2

−l

3

1

)µ

v

/3−(l

2

2

−l

2

1

)(µ

2

v

+σ

2

v

)/2 (5.32)

q

1

= l

2

2

µ

v

+(l

2

2

−l

2

1

)µ

m

−2l

3

2

/3−2(l

2

−l

1

)µ

v

µ

m

(5.33)

q

2

= l

2

2

(µ

2

v

+σ

2

v

) +2(l

3

2

−l

3

1

)µ

m

/3− (5.34)

2l

3

2

/3−2(l

2

−l

1

)(µ

2

v

+σ

2

v

)µ

v

t = l

3

2

/3+(l

2

−l

1

)µ

2

m

−l

2

2

µ

m

(5.35)

Because the square error is always positive no matter what value the Θ

is, S is a pos-

itive deﬁnite matrix. Therefore, (5.29) is to minimize a second-order convex function

of Θ

without constraints. Then the optimum of Θ

**can be obtained by setting the
**

derivative of (5.29) to zero, resulting a 2×2 system of linear equations:

∂SE

∂θ

1

= 2s

11

· θ

1

+2s

12

· θ

2

+q

1

= 0 (5.36)

∂SE

∂θ

2

= 2s

22

· θ

2

+2s

12

· θ

1

+q

2

= 0 (5.37)

106

Such a system of linear equations can be solved efﬁciently:

θ

1

=

s

12

· q

2

−s

22

· q

1

2s

11

· s

22

−2s

2

12

(5.38)

θ

2

=

s

12

· q

1

−s

11

· q

2

2s

11

· s

22

−2s

2

12

(5.39)

With θ

1

and θ

2

, we can compute θ

0

from (5.26).

From the above discussion, we see that for a random variable V with any distribu-

tion, if we know its mean µ

v

, variance σ

2

v

, and skewness γ

v

, we can obtain the ﬁtting

parameters Θ for max(V, 0) through closed-form formulas.

Notice that in (5.4) we try to minimize the mean square error within the ±3σ range.

If µ

v

>3σ

v

, V is always larger than zero within the ±3σ range. That is: max(V, 0) =V

when V ∈ (µ

v

−3σ

v

, µ

v

+3σ

v

). From (5.4), it is easy to ﬁnd that in this case, Θ =

(0, 1, 0). That means:

max(V, 0) ≈V when µ

v

> 3σ

v

(5.40)

To show the accuracy of our second-order ﬁtting approach to the max approxima-

tion, we compare the results obtain from our approach, the linear ﬁtting method, and

the exact (or Monte Carlo) computation. For example, by assuming V ∼N(0.7, 1), the

exact max operation max(V, 0), linear ﬁtting through through tightness probability, and

our second-order ﬁtting can be obtained as shown in Fig. 5.2.2, where the x-axis is V,

and y-axis is the results of max(V, 0). The corresponding PDFs of the three approaches

are shown in Fig. 5.2.2. From the ﬁgures, we see that our proposed second-order ﬁtting

method is more accurate than the linear ﬁtting method. In particular, the PDF of our

second-order ﬁtting method has a sharp peak which approximates the impulse of exact

PDF well as shown in Fig. 5.2.2. In contrast, the linear ﬁtting method can only give a

smooth PDF, which is far away from the exact max result.

107

−2 −1 0 1 2 3

−1

−0.5

0

0.5

1

1.5

2

2.5

3

3.5

4

Max(V,0)

Second−Order Fit

Linear Fit

(a)

−2 −1 0 1 2 3

0

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1

Max(V,0)

Second−Order Fit

Linear Fit

(b)

Figure 5.1: (a) Comparison of exact computation, linear ﬁtting, and second-order ﬁt-

ting for max(V, 0); (b) PDF comparison of exact computation, linear ﬁtting, and sec-

ond-order ﬁtting for max(V, 0).

5.3 Quadratic SSTA

5.3.1 Quadratic Delay Model

In Section 5.2, we introduce the second-order ﬁtting of max operation. Here we will

apply such ﬁtting in SSTA.

We ﬁrst discuss the delay model. In practice, the circuit delay is a complicate

function of variation sources:

D = f (X

1

, X

2

, . . . X

n

, R) (5.41)

where X = (X

1

, X

2

, . . . X

n

)

T

are global variation sources, R is local random variation, n

is the number of global variation sources. Here, we assume that X

i

’s and R are mutually

independent. Without loss of generality, we assume all X

i

’s and R have zero mean

and unit variance. In order to simplify the above delay model, we apply the Taylor

expansion to approximate it. Considering that the scale of local random variation is

108

usually smaller than that of global variation, we use second order Taylor expansion to

approximate X

i

’s and use ﬁrst order expansion to approximate R:

D ≈ d

0

+

n

∑

i, j=1

b

i j

X

i

X

j

+

n

∑

i=1

a

i

X

i

+rR

= d

0

+AX +X

T

BX +rR (5.42)

where d

0

is the nominal delay, A = (a

1

, a

2

, . . . a

n

) are the linear delay sensitivity co-

efﬁcients of the global variation sources, B = (b

i j

) are the second-order sensitivity

coefﬁcients, which is an n×n matrix, and r is the linear delay sensitivity coefﬁcient of

the local random variation. A, B, and r are calculated in the similar way as [180]:

a

i

=

∂f

∂X

i

(5.43)

b

i j

=

∂

2

f

∂X

i

∂X

j

(5.44)

r =

∂f

∂R

(5.45)

In practice, f is a complex function and can not be expressed as closed-form formula.

In this case, the coefﬁcients A, B, and r can be obtained by measurement or SPICE

simulation. Moreover, in practice, the local random variation is caused by several

independent factors; therefore, the local random variation R is the sum of several in-

dependent random variables. According to the central limit theorem, R will be close

to a Gaussian random variable. In this chapter, for simplicity, we assume that R is a

random variable with standard normal distribution. Finally, we obtain our quadratic

delay model as follows:

D = d

0

+AX +X

T

BX +rR (5.46)

In the rest of this chapter, we use m

i,k

and m

r,k

to represent the kth central moment

for X

i

and the kth moment for R, respectively. Notice that all X

i

’s and R are with zero

mean, that means their central moments and raw moments are the same. Moreover,

109

we also assume that the moments of the variation sources, m

i,k

and m

r,k

, are known. In

practice, such moments can be computed from the samples of variation sources.

To compute the arrival time in a block-based SSTA framework, two atomic opera-

tions, max and sum, are needed. That is, given D

1

and D

2

with the quadratic form of

(5.46):

D

1

= d

01

+A

1

X +X

T

B

1

X +r

1

R

1

, (5.47)

D

2

= d

02

+A

2

X +X

T

B

2

X +r

2

R

2

, (5.48)

we want to compute

D

m

= max(D

1

, D

2

)

= d

m0

+A

m

X +X

T

B

m

X +r

m

R

m

, (5.49)

D

s

= D

1

+D

2

= d

s0

+A

s

X +X

T

B

s

X +r

s

R

s

. (5.50)

We will present how these two operations are handled in the rest of this section.

5.3.2 Max Operation for Quadratic Delay Model

The max operation is the most difﬁcult operation for block-based SSTA. Based on the

second-order polynomial ﬁtting method as discussed in Section 5.2.2, we propose a

novel technique to compute the max of two random variables. The overall ﬂow of the

max operation is illustrated in Figure 5.2. Considering

D

m

= max(D

1

, D

2

) = max(D

1

−D

2

, 0) +D

2

, (5.51)

let D

p

= D

1

−D

2

. Without loss of generality, we assume E[D

p

] > 0. We ﬁrst compute

the mean and variance of D

p

, if µ

D

p

> 3σ

D

p

, then D

m

= D

1

, otherwise, we compute

the joint moments between D

p

and X

i

’s and the skewness of D

p

. Knowing the mean,

variance, and skewness of D

p

, we apply the method as shown in Section 5.2.2 to ﬁnd

110

the ﬁtting coefﬁcients Θ = (θ

0

, θ

1

, θ

2

) for max(D

p

, 0). Finally, we use the moment

matching method to reconstruct the quadratic form of D

m

.

Input: Quadratic form of D

1

and D

2

Output: Quadratic form of D

m

1. Let D

p

= D

1

−D

2

, compute µ

Dp

and σ

Dp

if µ

Dp

≥3σ

Dp

{

2. Dm = D

1

}

else {

3. Compute joint moments between D

2

p

and X

i

4. Compute γ

Dp

5. Get ﬁtting coefﬁcients Θ

6. Compute joint moments between D

m

and X

i

7. Reconstruct the quadratic form of D

m

}

Figure 5.2: Algorithm for computing max(D

1

,D

2

).

5.3.2.1 Mean and Variance of D

p

In order to compute the mean and variance, we ﬁrst obtain the quadratic form of D

p

as

follows:

D

p

= D

1

−D

2

= d

p0

+A

p

· X +X

T

B

p

X +r

1

· R

1

−r

2

· R

2

(5.52)

d

p0

= d

01

−d

02

(5.53)

A

p

= A

1

−A

2

(5.54)

B

p

= B

1

−B

2

(5.55)

Because X

i

’s, R

1

, and R

2

are mutually independent random variables with mean 0

111

and variance 1, the mean and variance of D

p

can be computed as:

µ

D

p

= E[D

p

] = d

p0

+

n

∑

i=1

b

pii

(5.56)

σ

2

D

p

= E[(D

p

−µ

D

p

)

2

]

= r

2

1

+r

2

2

+

n

∑

i=1

(a

2

pi

+b

pii

m

i,4

+2a

pi

b

pii

m

i,3

) +

∑

1≤i<j≤n

2(b

2

pi j

+b

pii

b

pj j

) −(

n

∑

i=1

b

pii

)

2

(5.57)

5.3.2.2 Joint Moments and Skewness of D

p

From the deﬁnition of D

p

as shown in (5.52), the joint moments between D

2

p

and X

i

,

X

2

i

, and R can be computed as:

E[X

i

D

2

p

] = 2d

p0

a

pi

+(a

2

pi

+2d

p0

b

pii

)m

i,3

+2a

pi

b

pii

m

i,4

+b

2

pii

m

i,5

+

2·

∑

1≤j≤n

j=i

a

pi

b

pj j

+2a

pj

b

p

i j +

(b

pii

b

pj j

+2b

2

pi j

)m

i,3

+2b

pi j

b

pj j

m

j,3

(5.58)

E[R

1

D

2

p

] = r

2

1

m

r,3

+2r

1

µ

D

p

(5.59)

E[R

2

D

2

p

] = r

2

2

m

r,3

−2r

1

µ

D

p

(5.60)

E[X

2

i

D

2

p

] = d

2

p0

+r

2

1

+r

2

2

+2d

p0

a

pi

m

i,3

+2a

pi

b

pii

m

i,5

+

(a

2

pi

+2d

p0

b

pii

) · m

i,4

+b

2

pii

m

i,6

+

∑

1≤j≤n

j=i

a

2

pj

+2d

p0

b

pj j

+2(a

pi

b

pj j

+2a

pj

b

pi j

)m

i,3

+2a

pj

b

pj j

m

j,3

+

2(b

pii

b

pj j

+4b

2

pi j

) · m

i,4

+b

2

pj j

m

j,4

+4b

pj j

b

pi j

m

i,3

m

j,3

+

2·

∑

1≤j<k≤n

j,k=i

(b

pj j

b

p

kk +2b

2

pjk

) (5.61)

112

E[X

i

X

j

D

2

p

] = 2a

pi

a

pj

+4d

p0

b

pi j

+2· (2a

pi

b

pi j

+a

pj

b

pii

)m

i,3

+

4b

pii

b

pi j

m

i,4

+2· (2a

pj

b

pi j

+a

pi

b

pj j

)m

j,3

+4b

pj j

b

pi j

m

j,4

+

2· (2b

2

pi j

+b

pii

b

p

j j)m

i,3

m

j,3

+4·

∑

1≤k≤n

k=i, j

b

pi j

b

pkk

(5.62)

With the joint moments computed above, the raw moments, central moments, and

skewness of D

p

is computed as:

E[D

2

p

] = µ

2

D

p

+σ

2

D

p

(5.63)

E[D

3

p

] = E[D

p

· D

2

p

]

= E[(d

p0

+A

p

X +X

T

B

p

X +r

1

R

1

−r

2

R

2

)D

2

p

]

= d

p0

· E[D

2

p

] +r

1

· E[R

1

D

2

p

] −r

2

· E[R

2

D

2

p

] +

n

∑

i=1

a

pi

· E[X

i

D

2

p

] +b

pii

· E[X

2

i

D

2

p

]

+

∑

0≤i<j≤n

b

pi j

E[X

i

X

j

· D

2

p

] (5.64)

m

D

p

(3) = E[(D−µ

D

)

3

]

= E[D

3

p

] −3µ

D

p

· E[D

2

p

] +2µ

3

D

p

(5.65)

γ

D

p

= m

D

p

(3)/σ

3

D

p

(5.66)

5.3.2.3 Reconstruct the Quadratic Form of Dm

Knowing µ

D

p

, σ

D

p

, and γ

D

p

, we apply the second-order ﬁtting method to compute the

ﬁtting parameters Θ = (θ

0

, θ

1

, θ

2

) for max(D

p

, 0). Then D

m

can be represented as:

D

m

= max(D

1

, D

2

) = max(D

p

, 0) +D

2

≈ θ

2

· D

2

p

+θ

1

· D

p

+θ

0

+D

2

= θ

2

· D

2

p

+D

q

+θ

0

(5.67)

D

q

= θ

1

· D

p

+D

2

(5.68)

113

The above equation gives the closed- form formula of D

m

. However, because D

p

and

D

q

are in quadratic form as (5.46), the representation of D

m

in (5.67) is a 4

th

order

polynomial of X

i

’s. In order to reconstruct the quadratic form for D

m

, we ﬁrst compute

the the mean of D

m

, and joint moments between D

m

and variation sources:

E[D

m

] = θ

2

· E[D

2

p

] +E[D

q

] +θ

0

(5.69)

E[X

i

D

m

] = θ

2

· E[X

i

D

2

p

] +E[X

i

D

q

] (5.70)

E[X

2

i

D

m

] = θ

2

· E[X

2

i

D

2

p

] +E[X

2

i

D

q

] +θ

0

(5.71)

E[X

i

X

j

D

m

] = θ

2

· E[X

i

X

j

D

2

p

] +E[X

i

X

j

D

q

] (5.72)

E[R

1

D

m

] = θ

2

· E[R

1

D

2

p

] +E[R

1

D

q

] (5.73)

E[R

2

D

m

] = θ

2

· E[R

2

D

2

p

] +E[R

2

D

q

] (5.74)

The joint moments between D

2

p

and X

i

’s are computed in (5.58)-(5.62). The joint

moments between D

q

and variation sources are:

E[D

q

] = d

q0

+

n

∑

i=1

b

qii

(5.75)

E[X

i

D

q

] = a

qi

m

i,2

+b

qii

m

i,3

(5.76)

E[X

2

i

D

q

] = d

q0

+a

qi

m

i,3

+b

qii

m

i,4

+

∑

1≤j≤n

j=i

b

qj j

(5.77)

E[X

i

X

j

D

q

] = 2b

qi j

(5.78)

E[R

1

D

q

] = θ

1

· r

1

(5.79)

E[R

2

D

q

] = (1−θ

1

) · r

2

(5.80)

114

Because we want to reconstruct D

m

in the quadratic form, as shown in (5.49), by

applying the moment matching technique similar to [178], we have:

E[D

m

] =

n

∑

i=1

b

mii

+d

m0

(5.81)

E[X

i

D

m

] = a

mi

+b

mii

· m

i,3

(5.82)

E[X

2

i

D

m

] = a

mi

· m

i,3

+b

nii

· m

i,4

+

∑

1≤j≤n

j=i

b

mj j

(5.83)

E[X

i

X

j

D

m

] = 2b

mi j

(5.84)

With E[D

m

], E[X

i

D

m

], E[X

2

i

D

m

], and E[X

i

X

j

D

m

] computed in (5.69)-(5.72), the sen-

sitivity coefﬁcients A

m

= (a

mi

) and B

m

= (b

mi j

) can be obtained by solving the linear

equations above:

b

mii

=

E[X

2

i

D

m

] −E[D

m

] −E[X

i

D

m

]m

i,3

m

i,4

−m

2

i,3

−1

(5.85)

b

mi j

= E[X

i

X

j

D

m

] (5.86)

a

mi

= E[X

i

D

m

] −b

mii

· m

i,3

(5.87)

d

m0

= E[D

m

] −

n

∑

i=1

b

mii

(5.88)

Finally, we consider the random term of D

m

. Because the random term in D

m

comes from the random terms in D

1

and D

2

, we assume that r

m

R

m

= r

m1

R

1

+r

m2

R

2

.

Because the random variation sources R

m

, R

1

, and R

2

are Gaussian random variables,

by applying the moment matching technique similar to (5.84), we have:

r

m1

= E[R

1

· D

m

] (5.89)

r

m2

= E[R

2

· D

m

] (5.90)

r

m

=

r

2

m1

+r

2

m2

(5.91)

Where E[R

1

· D

m

] and E[R

2

· D

m

] are computed in (5.73) and (5.74).

115

5.3.3 Sum Operation for Quadratic Delay Model

The sum operation is straight forward. The coefﬁcients of D

s

can be computed by

adding the correspondent coefﬁcients of D

1

and D

2

:

d

s0

= d

01

+d

02

(5.92)

A

s

= A

1

+A

2

(5.93)

B

s

= B

1

+B

2

(5.94)

r

s

=

r

2

1

+r

2

2

(5.95)

5.3.4 Computational Complexity of Quadratic SSTA

For each max operation in SSTA based on a quadratic delay model, we need to cal-

culate n

2

joint moments between D

p

and X

i

X

j

s; for each joint moment, we need to

compute the sum of n numbers; hence the computational complexity is O{n

3

}, where

n is the number of variation sources. For the sum operation, we need to compute the

sum of two n ×n metric; hence the computational complexity is O{n

2

}. Because the

total number of max and sum operations is linear with respect to circuit sizes N, the

total complexity is O{n

3

N}.

5.4 Semi-Quadratic SSTA

5.4.1 Semi-Quadratic Delay Model

In Section 5.3, we introduce the SSTA for quadratic delay model. However, in practice,

the impact of the crossing terms are usually very weak [178, 19]. Therefore, if we

ignore the crossing terms, we may speed up the SSTA process without affecting the

accuracy too much. Ignoring the crossing terms, the quadratic delay model in (5.46) is

116

re-written as:

D = d

0

+AX +BX

2

+rR (5.96)

where d

0

, A, r, and R are deﬁned in the similar way as the quadratic delay model in

(5.46), X

2

= (X

2

1

, X

2

2

, . . . , X

2

n

)

T

are the square of variation sources, and B= (b

1

, b

2

, . . . , b

n

)

are the second order sensitivity coefﬁcients for the square terms. We deﬁned such

quadratic model without crossing terms as Semi-Quadratic Delay Model. We will in-

troduce the atomic operations, max and sum, for the semi-quadratic delay model in the

rest of this section.

5.4.2 Max Operation for Semi-Quadratic Delay Model

The overall ﬂow of the max operation of semi-quadratic delay model is similar to that

of quadratic delay model as illustrated in Figure 5.2. The only difference is that we do

not need to compute the joint moments between the crossing terms and D

m

.

5.4.2.1 Moments of D

p

We deﬁned D

p

= D

1

−D

2

in a similar way as (5.52). In order to compute the central

moments of D

p

, we ﬁrst rewrite D

p

to the following form:

D

p

= d

p0

+

n

∑

i=1

Y

pi

+r

1

R

1

−r

2

R

2

(5.97)

d

p0

= d

p0

+

n

∑

i=1

b

pi

(5.98)

Y

pi

= a

pi

X

i

+b

pi

X

2

i

−b

pi

(5.99)

with a

pi

= a

1i

−a

2i

and b

pi

= b

1i

−b

2i

. Because X

i

s are mutually independent random

variables with mean 0 and variance 1, Y

pi

s are independent random variables with zero

117

mean. Therefore, the ﬁrst 3 central moments of D

p

are:

µ

D

p

= d

p0

(5.100)

σ

2

D

p

= r

2

1

+r

2

2

+

n

∑

i=1

E[Y

2

pi

] (5.101)

m

D

p

(3) = E[(D

p

−µ

D

p

)

3

] = r

3

1

+r

3

2

+

n

∑

i=1

E[Y

3

pi

] (5.102)

where

E[Y

2

pi

] = b

2

pi

(m

i,4

−1) +2a

pi

b

pi

m

i,3

+a

2

pi

(5.103)

E[Y

3

pi

] = 3a

pi

b

2

pi

m

i,5

−2m

i,3

+3a

2

pi

b

pi

m

i,4

−1

+

a

3

pi

m

i,3

+b

3

pi

m

i,6

−3m

i,4

+2

(5.104)

With the central moments of D

p

, the raw moments and skewness can be computed

easily.

5.4.2.2 Reconstruct the Semi-Quadratic Form of D

m

Similar to the quadratic SSTA, by knowing the mean, variance and skewness of D

p

,

the ﬁtting coefﬁcients Θ can be obtained. Then the joint moments between X

i

s and D

m

can be computed in the similar ways as (5.69)-(5.74). Here the joint moments between

X

i

s and D

2

p

, D

q

are:

E[X

i

D

2

p

] = E[X

i

Y

2

pi

] +2d

0

E[X

i

Y

pi

] (5.105)

E[X

2

i

D

2

p

] = E[X

2

i

Y

2

pi

] +2d

0

E[X

2

i

Y

pi

] +E[X

2

i

]E[(D

p

−Y

pi

)

2

] (5.106)

E[X

i

D

q

] = E[X

i

Y

qi

] (5.107)

E[X

2

i

D

q

] = E[X

2

Y

qi

] (5.108)

where E[Y

2

pi

] are computed in (5.102), E[(D

p

−Y

pi

)

2

] = σ

2

D

p

−E[Y

2

pi

], and Y

qi

s are

deﬁned in the similar way as Y

pi

s:

Y

qi

= a

qi

X

i

+b

qi

X

2

i

−b

qi

(5.109)

118

with a

qi

= θ

1

(a

1

−a

2

) +a

2

and b

qi

= θ

1

(b

1

−b

2

) +b

2

. The joint moments between

X

i

s and Y

pi

s are:

E[X

2

i

Y

2

pi

] =

a

2

pi

−2

2

pi

m

i,4

+b

2

pi

m

i,6

−1

+2a

pi

b

pi

m

i,5

−m

i,3

(5.110)

E[X

2

i

Y

pi

] = a

pi

m

i,3

+b

pi

m

i,4

−1

(5.111)

E[X

i

Y

2

pi

] = b

2

pi

m

i,5

+2a

pi

b

pi

m

i,4

−1

+

a

2

pi

−2b

2

pi

m

i,3

(5.112)

E[X

i

Y

pi

] = a

pi

+b

pi

m

i,3

(5.113)

The joint moments between X

i

s and Y

qi

s can be computed in the same way.

Knowing the joint moments between X

i

s and D

m

, by applying the moment match-

ing technique, we will have similar equations as (5.84). And then A

m

and B

m

can be

obtained by solving such equations. Finally, we can compute the random term of D

m

in the same way as the quadratic model, as shown in (5.91).

The sum operation of the semi-quadratic delay model is similar to that of the

quadratic delay model. We can compute D

s

= D

1

+D

2

by summing up the corre-

spondent coefﬁcients.

From the discussion above, it is easy to see that for the semi-quadratic SSTA, the

computational complexity for both max and sum operation is O{n}, where n is the

number of variation sources. The number of max and sum operations is linear to the

circuit size.

5.5 Linear SSTA

5.5.1 Linear Delay Model

In the previous sections, we introduce the SSTA for non-linear delay model. In prac-

tice, when the variation scale is small, the circuit delay can be approximated as a linear

119

function of variation sources:

D = d

0

+A· X +r · R (5.114)

where X, d

0

, A, r, and R are deﬁned in the similar way as the quadratic delay model

in (5.46). The difference is that there are no second order terms. Hence the atomic

operations of the linear delay model are much simpler than those of quadratic delay

model. We will introduce such operations in the rest of this section.

5.5.2 Max Operation for Linear Delay Model

The overall ﬂow of the max operation of linear delay model is similar to that of

quadratic delay model as illustrated in Figure 5.2 except that we do not need to com-

pute the high order joint moments between D

m

and X

i

s.

5.5.2.1 Mean and Variance of D

p

The linear form of D

p

= D

1

−D

2

can be computed in the similar way as (5.52). Then

the mean and variance of D

p

can be computed as:

µ

D

p

= E[D

p

] = d

p0

; (5.115)

σ

2

D

p

= E[(D−µ

D

)

2

] =

n

∑

i=1

a

2

pi

+r

2

1

+r

2

2

(5.116)

5.5.2.2 Joint Moments and Skewness of D

p

The joint moments between D

2

p

and the variation sources and random variation can be

computed as:

E[X

i

· D

2

p

] = a

2

pi

· m

i,3

+2a

pi

· d

p0

(5.117)

120

With the joint moments computed above, the third order raw moments D

p

is computed

as:

E[D

3

p

] =

n

∑

i=1

a

i

E[X

i

· D

2

p

] +r

1

E[R

1

· D

2

p

] +r

2

E[R

2

· D

2

p

] (5.118)

The second order raw moment, central moments, and skewness of D

p

can be computed

in the same way as the quadratic SSTA in (5.66).

5.5.2.3 Reconstruct the Linear Canonical Form of Dm

Knowing the mean, variance, and skewness of D

p

, we may apply the method in Sec-

tion 5.2.2 to compute the ﬁtting parameters Θ for max(D

p

, 0), and then represent D

m

in the form of:

D

m

≈ θ

2

· D

2

p

+D

q

+ p

0

(5.119)

where D

q

is deﬁned in the similar way as (5.67). The above equation gives the closed-

form formula of D

m

, however, such formula is not in the linear canonical form. We

still apply moment matching technique to reconstruct D

m

to the linear form. First, the

mean of D

m

and the joint moments between Dm and variation sources are:

E[D

m

] = θ

2

· E[D

2

p

] +E[D

q

] + p

0

(5.120)

E[X

i

· D

m

] = θ

2

· E[X

i

· D

2

p

] +E[X

i

· D

q

] (5.121)

Where E[D

2

p

] is computed in (5.118). The joint moments between D

q

and variation

sources are computed as:

E[D

q

] = d

s0

(5.122)

E[X

i

· D

q

] = a

qi

(5.123)

121

With the joint moments, by applying the moment matching technique in [178], we

have:

a

mi

= E[X

i

· D

m

] (5.124)

d

m0

= E[D

m

] (5.125)

Finally, the random term r

m

can be computed in the same way as the quadratic model

in (5.91).

For the linear SSTA we discussed above, the computational complexity for both

max and sum operation is O{n}. Similar to the quadratic SSTA, the number of max

and sum operations is linear to the circuit size.

5.6 Experimental Result

We have implemented our SSTA algorithmin Cfor all three delay models we discussed

before: quadratic delay model (Quad SSTA), semi-quadratic delay model (Semi-Quad

SSTA), and linear delay model (Lin SSTA). We also deﬁne three comparison cases:

(1) our implementation of the linear SSTA for Gaussian variation sources in [163],

which we refer to as Lin Gau; (2) the non-linear SSTA using Fourier Series approxi-

mation for non-Gaussian variation sources in [40] (Fourier SSTA); (3) 100,000-sample

Monte-Carlo simulation (MC). We apply all the above methods to the ISCAS89 suite

of benchmarks in TSMC 90nm technology.

In our experiment, we consider two types of variation sources gate channel length

L

gate

and threshold voltage V

th

. For each type of variation source, inter-die, intra-die

spatial, and intra-die random variation are considered. We use the grid-based model

in [172, 173] to model the spatial variation. The number of grids (the number of

spatial variation sources) is determined by the circuit size and larger circuits have more

variation sources. We also assume that the 3σ value of the inter-die (σ

g

), intra-die

122

spatial (σ

s

), and intra-die random (σ

r

) variation are 10%, 10%, and 5% of the nominal

value, respectively. In the following, we perform the experiments for two variation

setting: (1) both L

gate

and V

th

have skew-normal distributions [16]; and (2) L

gate

has

a normal distribution and V

th

has a Poisson distribution. The experimental setting is

shown in Table 5.1.

Setting L

gate

dist V

th

dist 3σ

g

3σ

s

3σ

r

(1) Skewnormal Skewnormal 10% 10% 5%

(2) Normal Poisson 10% 10% 5%

Table 5.1: Experiment setting. The 3σ value is the normalized with respect to the

nominal value.

Fig. 5.3 illustrates the PDF comparison for circuit s15850 under variation setting

(1). In the ﬁgure, delay is normalized with respect to the nominal value. From the ﬁg-

ure, we ﬁnd that, compared to the Monte-Carlo simulation, the Quad SSTA is the most

accurate, the next is Semi-Quad SSTA, and Lin SSTA is least accurate. Such result is

expected, because the Quad SSTA captures all the second-order effects; the Semi-Quad

SSTA captures only partial second order effects; while the Lin SSTA captures only lin-

ear effects and ignores all non-linear effects. Moreover, Fourier SSTA [40] has similar

accuracy as Semi-Quad SSTA because they both apply semi-quadratic delay model. In

addition, we also ﬁnd that all these three non-Gaussian SSTA is more accurate than

Lin Gau [163].

Table 5.2 compares the run time in second (T), and the error percentage of mean

(µ), standard deviation (σ), and skewness (γ) under variation setting (1). In the table,

the error percentage of mean (µ), standard deviation (σ), and 95-percentile point (95%)

is computed as 100 ×(MC value −SSTA value)/σ

MC

, and the error percentage of

skewness γ is computed as 100 ×(γ

MC

−γ

SSTA

)/γ

MC

. Moreover, the average error

in the table is average of the absolute value; and the average runtime is the average

runtime ratio between SSTA and Monte-Carlo simulation. For each benchmark, G

123

0.6 0.7 0.8 0.9 1 1.1 1.2 1.3 1.4 1.5

0

0.5

1

1.5

2

2.5

3

3.5

Normalized Delay

MC

Quad SSTA

Semi−Quad SSTA

Lin SSTA

Lin Gau

Fourier SSTA

Figure 5.3: PDF comparison for circuit s15850.

refers to the number of gates; and N refers to the total number of variation sources.

For fair comparison, when we perform Lin Gau on non-Gaussian variation sources, we

ﬁrst approximate the non-Gaussian random variables to the Gaussian random variables

by matching the mean and variance, then use Lin Gau to obtain the linear canonical

form of the circuit delay, ﬁnally, we still use the non-Gaussian variation sources to

reconstruct the PDF of the circuit delay. From the table, we see that for the Quad

SSTA, the error of mean, standard deviation, and 95-percentile point is within 0.3%,

and the error of skewness is within 5%. Semi-Quad SSTA, Lin SSTA result similar

mean and standard deviation error, but the error of skewness and 95-percentile point

is much larger, especially for the Lin SSTA, the error of skewness is up to 30% and

the error of 95-percentile point is up to 2%. This is because the Lin SSTA ignores all

non-linear effects which signiﬁcantly affect the skewness, and the inaccurate skewness

results larger error of 95-percentile point. Compared to Lin Gau, all the three non-

Gaussian SSTA methods predict more accurate mean values and 95-percentile points,

which is the most important timing characteristic, and the non-linear SSTA methods,

124

Quad SSTA and Semi-Quad SSTA also give much more accurate skewness. Moreover,

we also ﬁnd that both Semi-Quad SSTA and Lin SSTA have similar run time as Lin

Gau, but the run time of Quad SSTA is longer especially when the number of variation

sources is large. This is because the computational complexity of Semi-Quad SSTA,

Lin SSTA, and Lin Gau is the same, but Quad SSTA has higher complexity than the

other two SSTA methods. We also observe that compared to Fourier SSTA, Semi-

Quad SSTA has similar accuracy with almost 70X speed up. This is due to the fact

that Semi-Quad SSTA uses the same semi-quadratic delay model as Fourier SSTA,

while the computational complexity of Semi-Quad SSTA (O{n}) is lower than that

of Fourier SSTA (O{nk

2

}), where n is the number of variation sources, and k is the

maximum number of orders of Fourier Series [40]. After all, the run time of all the

SSTA methods is signiﬁcantly shorter than the Monte-Carlo simulation.

Table 5.3 illustrates the results under variation setting (2). From this table, we

can ﬁnd the similar trend as Table 5.2. Moreover, we also ﬁnd that our methods are

still highly accurate under such variation setting. For Quad SSTA, the error of mean,

standard deviation, and 95-percentile point is within 1%, and the error of skewness is

within 6%. This shows that our approach works well for different distributions.

125

bench G N Quad SSTA Semi-Quad SSTA Lin SSTA Lin Gau Fourier SSTA MC

name µ σ γ 95% T µ σ γ 95% T µ σ γ 95% T µ σ γ 95% T µ σ γ 95% T T

s208 61 12 -0.02 0.04 1.03 -0.03 0.01 -0.02 0.04 -10.7 -0.23 0.01 -0.02 -0.32 -22.8 -1.1 0.01 -0.02 -0.32 -22.8 -2.99 0.01 -0.04 -0.07 -9.5 -0.4 1.01 15.3

s386 118 18 -0.08 0.01 2.52 -0.03 0.01 -0.05 -0.05 -9.45 -0.36 0.01 -0.06 -0.4 -22.5 -1.23 0.01 -0.04 -0.36 -23 -2.96 0.01 0.02 -0.02 -10.9 0.36 1.17 33.6

s400 106 25 0.23 -0.06 -0.88 0.13 0.03 0.3 -0.12 -15.2 -0.21 0.01 0.29 -0.65 -31.4 -1.51 0.01 0.35 -0.66 -31.7 -3.56 0.01 -0.17 -0.24 -10 -0.2 1.28 38.7

s444 119 25 -0.1 -0.12 2.92 -0.17 0.04 -0.09 -0.15 -12.2 -0.51 0.01 -0.09 -0.56 -26.8 -1.56 0.01 -0.07 -0.5 -27.1 -3.37 0.01 -0.02 -0.22 -9.95 -0.57 1.38 41.9

s832 262 33 -0.06 -0.04 -0.11 -0.17 0.16 -0.06 -0.05 -12.1 -0.4 0.01 -0.06 -0.5 -27.6 -1.53 0.01 -0.06 -0.5 -27.6 -3.5 0.01 -0.1 -0.25 -14 -0.32 1.91 103

s953 311 43 -0.09 -0.11 2.03 -0.32 0.17 -0.15 -0.24 -9.87 -0.86 0.01 -0.17 -0.68 -27.7 -1.98 0.01 -0.01 -0.67 -28.1 -3.7 0.01 -0.02 -0.14 -7.23 -0.59 1.99 122

s1238 428 55 0.12 0.01 0.59 0.15 0.43 0.22 -0.04 -10.4 -0.04 0.01 0.21 -0.47 -26.5 -1.12 0.01 0.28 -0.45 -26.6 -2.98 0.01 -0.16 -0.04 -6.59 -0.63 2.17 201

s1423 490 68 -0.01 -0.08 -1.25 0.06 0.75 -0.01 -0.08 -10.2 -0.14 0.01 -0.01 -0.49 -25.2 -1.18 0.01 -0.01 -0.47 -25.3 -3.12 0.01 -0.21 -0.01 -4.15 -0.3 2.18 288

s1494 588 68 -0.08 -0.08 3.63 0.08 1.06 -0.07 -0.16 -6.02 -0.24 0.01 -0.08 -0.52 -20.6 -1.14 0.01 -0.01 -0.44 -21.4 -2.85 0.01 -0.2 -0.12 -4.41 -0.22 2.28 320

s5378 1004 82 0.06 -0.12 4.71 -0.16 4.95 0.09 -0.2 -6.72 -0.49 0.05 0.08 -0.66 -26.7 -1.65 0.02 0.18 -0.56 -27.4 -3.31 0.02 -0.07 -0.11 -5.68 -0.6 5.42 1238

s9234 2027 99 0.01 -0.11 0.98 -0.24 10.4 0.04 -0.23 -8.74 -0.61 0.09 0.02 -0.71 -28.4 -1.82 0.04 0.29 -0.66 -28.7 -3.43 0.03 0.04 -0.07 -13.7 0.61 6.89 2801

s13207 2573 115 -0.01 -0.01 3.04 -0.06 25.9 0.02 -0.09 -4.4 -0.33 0.15 0.01 -0.49 -22.2 -1.34 0.07 0.16 -0.49 -22.2 -3.09 0.06 0.09 -0.19 -13.3 0.54 9.04 4717

s15850 3448 135 0.18 -0.02 2.66 0.27 38.9 0.23 -0.07 -4.85 0.06 0.19 0.22 -0.52 -24.6 -1.11 0.08 0.51 -0.53 -24.7 -2.92 0.08 0.18 -0.09 -4.38 0.64 11.5 6484

s38417 8709 176 0.11 -0.01 2.22 0.07 202 0.13 -0.04 -3.8 -0.07 0.62 0.13 -0.49 -23.1 -1.21 0.26 0.24 -0.47 -23.2 -3.16 0.24 0.08 -0.19 -10.5 0.79 23.2 19116

s38584 11448 176 -0.05 -0.02 0.1 0.01 218 -0.04 -0.12 -6.37 -0.29 0.68 -0.06 -0.57 -25.1 -1.44 0.29 0.11 -0.57 -25.2 -3.31 0.25 -0.11 -0.19 -14.4 -0.53 25.1 19459

Ave - - 0.08 0.06 1.91 0.13 1/720 0.1 0.11 8.74 0.32 1/19814 0.1 0.54 25.4 1.39 1/35819 0.15 0.51 25.7 3.22 1/39248 0.1 0.13 9.25 0.49 1/260 -

Table 5.2: Error percentage of mean, standard deviation, skewness, and 95-percentile point for variation setting

(1). Note:the error percentage of mean (µ), standard deviation (σ), and 95-percentile point (95%) is computed as

100×(MC value−SSTA value)/σ

MC

, and the error percentage of skewness γ is computed as 100×(γ

MC

−γ

SSTA

)/γ

MC

.

1

2

6

bench G N Quad SSTA Semi-Quad SSTA Lin SSTA Lin Gau Fourier SSTA MC

name µ σ γ 95% T µ σ γ 95% T µ σ γ 95% T µ σ γ 95% T µ σ γ 95% T T

s208 61 12 -0.03 0.05 1.64 -0.05 0.01 -0.03 0.04 -20 -0.35 0.01 -0.05 -0.07 -58.9 -0.81 0.01 -0.05 -0.07 -58.9 -1.17 0.01 0.02 -0.25 -15 0.63 0.99 15.4

s386 118 18 -0.07 0.01 5.54 -0.05 0.02 -0.05 -0.04 -16.5 -0.4 0.01 -0.08 -0.16 -58.2 -0.91 0.01 -0.08 -0.11 -58.8 -1.15 0.01 -0.01 -0.02 -16.5 -0.34 1.2 33.9

s400 106 25 0.25 -0.07 -2.48 0.25 0.02 0.32 -0.13 -25.7 -0.18 0.01 0.29 -0.3 -71.3 -0.93 0.01 0.35 -0.3 -71.9 -1.21 0.01 -0.15 -0.27 -20.8 -0.36 1.27 39

s444 119 25 -0.09 -0.12 3.27 -0.23 0.03 -0.08 -0.14 -22.4 -0.62 0.01 -0.1 -0.28 -65 -1.22 0.01 -0.09 -0.19 -65.1 -1.43 0.01 0.01 -0.09 -13.6 0.3 1.38 42.1

s832 262 33 -0.05 -0.04 -3.26 -0.14 0.14 -0.05 -0.05 -23.5 -0.48 0.01 -0.08 -0.19 -66.9 -1.1 0.01 -0.08 -0.19 -66.9 -1.45 0.01 -0.01 -0.18 -13.4 -0.38 1.93 103

s953 311 43 -0.09 -0.11 3.96 -0.44 0.18 -0.15 -0.23 -16.6 -1 0.01 -0.19 -0.38 -66.9 -1.67 0.01 -0.02 -0.35 -67.4 -1.75 0.01 -0.12 -0.22 -16.5 -0.38 1.95 122

s1238 428 55 0.11 -0.02 -3.92 0.09 0.43 0.21 -0.05 -22 -0.13 0.01 0.18 -0.19 -69.8 -0.76 0.01 0.24 -0.16 -70.2 -1 0.01 -0.17 -0.05 -16.1 -0.58 2.15 201

s1423 490 68 -0.01 -0.08 -4.96 0.01 0.75 -0.01 -0.09 -20.2 -0.22 0.01 -0.03 -0.22 -64.1 -0.82 0.01 -0.02 -0.2 -64 -1.2 0.01 0.16 -0.26 -16.9 0.59 2.19 288

s1494 588 68 -0.08 -0.11 1.71 0.03 1.05 -0.07 -0.18 -16.9 -0.32 0.01 -0.1 -0.3 -59 -0.85 0.01 -0.04 -0.21 -58.3 -1.05 0.01 0.17 -0.15 -16.7 0.6 2.29 322

s5378 1004 82 0.05 -0.17 1.26 -0.14 4.95 0.08 -0.25 -16.1 -0.54 0.04 0.04 -0.41 -68.3 -1.23 0.02 0.12 -0.27 -68.5 -1.24 0.01 -0.02 -0.1 -19.4 -0.39 5.44 1238

s9234 2027 99 0.01 -0.13 2.52 -0.22 10.4 0.04 -0.25 -12.5 -0.65 0.08 0.01 -0.42 -68.6 -1.39 0.03 0.25 -0.34 -69 -1.27 0.04 -0.1 -0.2 -15.3 -0.41 6.76 2795

s13207 2573 115 0.01 -0.02 5.21 0.01 25.7 0.04 -0.09 -7.95 -0.27 0.15 0.01 -0.23 -62.3 -0.89 0.06 0.14 -0.21 -62.4 -1.01 0.06 -0.11 -0.04 -13.2 -0.28 9.11 4720

s15850 3448 135 0.18 -0.07 0.4 0.18 38.8 0.22 -0.12 -12.1 -0.02 0.19 0.18 -0.27 -66.4 -0.77 0.08 0.47 -0.25 -66.5 -0.84 0.07 0.18 -0.16 -19.9 0.4 11.3 6477

s38417 8709 176 0.1 -0.03 4.59 0.05 202 0.12 -0.06 -6.09 -0.14 0.62 0.09 -0.2 -63.8 -0.84 0.27 0.2 -0.17 -63.9 -1.07 0.24 0.13 -0.27 -17.4 0.6 23.6 19097

s38584 11448 176 -0.06 -0.08 -3.38 -0.04 218 -0.05 -0.16 -14.7 -0.32 0.68 -0.09 -0.31 -65.2 -1.03 0.29 0.08 -0.3 -65.3 -1.22 0.26 -0.07 -0.07 -19.8 -0.6 25.2 19460

Ave - - 0.08 0.07 3.21 0.13 1/681 0.1 0.13 16.9 0.38 1/20497 0.1 0.26 65 1.01 1/37942 0.15 0.22 65.1 1.2 1/42392 0.09 0.16 16.7 0.46 1/260 -

Table 5.3: Mean, standard deviation, and skewness comparison for variation setting (2).

1

2

7

5.7 Conclusions

In this chapter, we have proposed a new method to approximate the max operation of

two non-Gaussian random variables using second-order polynomial ﬁtting. It has been

shown that such approximation is more accurate than the approximation using lin-

ear ﬁtting through tightness probability. By applying such approximation, we present

new SSTA algorithms for three different delay models, i.e., quadratic model, quadratic

model without crossing terms (semi-quadratic model), and linear model. All atomic

operations of these algorithms are performed by closed-form formulas, hence they are

very time efﬁcient. The computational complexity of both the linear delay model and

the semi-quadratic delay model is linear to the number of variation sources, and that

of the quadratic delay model is cubic (third-order) to the number of variation sources.

Moreover, the computational complexity is linear to the circuit size for all three delay

models. The SSTA with semi-quadratic delay model results similar error as the SSTA

using Fourier Series [40] with almost 70X speed up. Moreover, compared to Monte-

Carlo simulation for non-Gaussian variation sources, the SSTA with quadratic delay

model results the error of mean, standard deviation, skewness, and 95-percentile point

is within 1%, 1%, 6%, and 1%, respectively.

128

CHAPTER 6

Physically Justiﬁ able Die-Level Modeling of Spatial

Variation in View of Systematic Across-Wafer

Variability

Modeling spatial variation is important for statistical analysis. Most existing works

model spatial variation as spatially correlated random variables. We discuss process

origins of spatial variability, all of which indicate that spatial variation comes from

deterministic across-wafer variation, and purely random spatial variation is not signiﬁ-

cant. We analytically study the impact of across-wafer variation and show how it gives

an appearance of correlation. We have developed a new die-level variation model

considering deterministic across-wafer variation and derived the range of conditions

under which ignoring spatial variation altogether may be acceptable. Experimental

results show that for statistical timing and leakage analysis, our model is within 2%

and 5% error from exact simulation result, respectively, while the error of the existing

distance-based spatial variation model is up to 6.5% and 17%, respectively. Moreover,

our new model is also 6X faster than the spatial variation model for statistical timing

analysis and 7X faster for statistical leakage analysis.

129

6.1 Introduction

In Chapter 5, we developed an efﬁcient and accurate SSTA ﬂow in which process

variation is modeled as inter-die, within-die spatial, and within-die random variations.

However, new measurement results show that the variation model in Chapter 5 is not

accurate. Therefore, we developed a more accurate and efﬁcient variation model and

applied it on the SSTA ﬂow presented in Chapter 5.

There are several existing works that focus on analyzing and modeling process

variation [6, 32, 172, 106, 108, 134, 24, 25, 173]. The simplest method models pro-

cess variation as the sum of inter-die (global) variation and independent within-die

(local random) variation [25]. Later, it was observed that within-die variation is spa-

tially correlated and the correlation depends on the distance between two within-die

locations. [6, 32] model spatial variation as correlated random variables, and principle

component analysis is applied to perform statistical timing analysis. In this model, a

chip is divided into several grids and each grid has its own spatial variation. The spatial

variations of different grids are correlated and the correlation coefﬁcient depends on

the distance between two grids. [172, 173] focuses on the extraction of spatial corre-

lation and it models the correlation coefﬁcient as a function of distance. Several more

complex spatial correlation models have been proposed in [45, 105, 189, 55, 64].

In contrast to the spatial correlation models, process oriented modeling has con-

cluded that within-die spatial variation is caused by deterministic across wafer and

across-ﬁeld variation while purely random within-die spatial variation is not signif-

icant [190, 54, 50]. However, in practical design ﬂow, designers do not know the

within-wafer location or within-ﬁeld location of each die; therefore, we need to ana-

lyze the impact of across-wafer variation and across-ﬁeld variation on die-scale. Since

silicon measurements cited in this chapter indicate that across-wafer variation is much

more signiﬁcant than the across-ﬁeld variation, we consider only across-wafer varia-

130

tion in this chapter, but the approach can be easily extended to account for across-ﬁeld

variation.

In this chapter, we ﬁrst analyze the impact of deterministic across-wafer variation

on spatial correlation. We observe that when quadratic across-wafer variation model

is used as in [54, 186, 125]:

1. Different locations of the chip may have different mean and variance. Such

differences increase when the chip size increases.

2. When chip size is small, the correlation coefﬁcients for a certain Euclidean dis-

tance are within a narrow range. This explains why most existing works ﬁnd that

spatial correlation is a function of distance.

3. Within-die spatial variation is NOT spatially correlated when across-wafer sys-

tematic variation is removed.

4. Within-die spatial variation is NOT independent from inter-die variation.

5. If chip size is not large, the two-level inter-/within-die decomposition of process

variation is still very accurate.

Based on our analysis, we propose three accurate and efﬁcient spatial variation

models considering across-wafer variation. We then apply our new model to the SSTA

ﬂow introduced in Chapter 5. Experimental results show that our model is more accu-

rate and efﬁcient compared to the distance-based spatial variation model in [172, 173].

Compared to the exact simulation, the error of our model for statistical timing anal-

ysis is within 2% and the error for statistical leakage analysis is within 5%. On the

other hand, the error of the distance-based spatial correlation model is up to 6.5% for

statistical timing analysis and up to 17% for statistical leakage analysis. Moreover,

131

our model is 6X faster than the distance-based spatial correlation model for statistical

timing analysis and 7X faster for statistical leakage analysis.

The rest of this chapter is organized as follows: Section 6.2 discusses the physical

causes for across-wafer variation; Section 6.3 analyzes the impact of across-wafer

variation on die-scale; Section 6.4 discusses the case when the across-wafer variation

is not a perfect parabola; Section 6.5 introduces the new variation models; the new

models are applied to statistical timing analysis in Section 6.6 and statistical leakage

analysis in Section 6.7; Section 6.8 summarizes the advantages and disadvantages of

different variation models; and ﬁnally Section 6.9 concludes this chapter.

6.2 Physical Origins of Spatial Variation

In silicon manufacturing, there are many steps that cause non-uniformity in devices

across the wafer. Interestingly, most of these processes by the very nature of the equip-

ment follow a radially varying trend across the wafer. Most processes are “center-fed”

or “edge-fed” with the boundary conditions at the edge of wafer being substantially

different. Moreover, wafers are often rotated to increase process uniformity across

them which further leads to radial behavior of non-uniformity. This is further exacer-

bated by advent of single-wafer processing for 300mm wafers.

For example, overlay error includes errors in the position and rotation of the wafer

stage during exposure, wafer stage vibration, and the distortion of the wafer with re-

spect to the exposure pattern [139]. Magniﬁcation and rotation components of overlay

error increase from center of the wafer outwards.

1

During chemical vapor deposition

(CVD) step, species depletion and temperature non-uniformity on the wafer at lower

temperatures may cause thickness non-uniformity [159, 132]. Redeposition effect in

1

Overlay error can directly impact critical dimension in double patterning.

132

physical vapor deposition (PVD) [27] may cause non-uniformity. Moreover, center

peak shape of the RF electric ﬁeld distribution [77] also leads to a center peak shape of

etch rate, and chamber wall conditions [78] also cause etch rate non-uniformity. In real

processes, the wafers are rotated to improve uniformity. [78, 27] show that the etch

rate varies radially across the wafer: the etch rate is high at the center of the wafer and

decreases toward the edges. Post-exposure bake (PEB) temperatures are higher at the

center of the wafer and decrease outwards [185]. Similarly, other processes ranging

from resist coat to wafer deformation due to vacuum chuck holding it follow a bowl-

shaped trend across the wafer. All these processes cause a systematic across-wafer

variation in physical dimensions.

Across-wafer variation of gate length observed in several recent silicon measure-

ments [153, 54, 186, 125] validates our arguments. [126] also shows that ring oscillator

frequency and leakage current decrease from the center to the edge of the wafer. Fig-

ure 6.1 shows industry data of ring oscillator frequency for wafers from two different

industry processes. Process 1 is with 45nm technology and process 2 is with 65nm

technology. From the ﬁgure, we see that for both process, ring oscillator frequency

decreases from the center to the edge of the wafer. Moreover, it has also been shown

that there is no spatial correlation for threshold voltage variation [190]. Therefore, the

across-wafer frequency and leakage variation is mainly caused by gate length varia-

tion.

It has been shown that for process 1, the across-wafer frequency variation can be

approximated as a quadratic function (a parabola) [126]. For process 2, the across-

wafer variation is not a perfect parabola as process 1. However, it follows a systematic

trend and the ring oscillator frequency decreases from the center to the edge of the

wafer. Since the amount measurement data of for process 2 (more than 300 wafers)

is much more than process 1, in the rest of this chapter, all of our simulation and

133

Figure 6.1: Ring oscillator frequency within a wafer. (a) Process 1; (b) Process 2.

experiments are based on the measurement result of process 2.

Besides across-wafer variation, lithography-induced effects such as lens aberra-

tions can lead to systematic across-ﬁeld variation and across-die variation. Across-die

variation can be modeled as within-die deterministic mean shift and will not cause

within-die spatial correlation. Moreover, silicon measurements cited in this chapter in-

dicate that across-wafer variation is much more signiﬁcant (probably due to advance-

ments in resolution enhancement and lithographic equipment) than across-ﬁeld and

across-die variation. Hence, for simplicity, we consider only across-wafer variation in

this chapter.

6.3 Analysis of Wafer Level Variation and Spatial Correlation

To analyze the impact of across-wafer variation on die-scale, we assume the across-

wafer variation to be a quadratic function as in[24, 54, 186, 125]:

v

p

= ax

2

w

+by

2

w

+cx

w

+dy

w

+m

w

+m

f

+m

d

+m

r

(6.1)

134

where v

p

is a variation source, such as L

e f f

, a, b, c, and d are function coefﬁcients,

which are obtained from ﬁtting the measurement data from industry process as shown

in Figure 6.1, m

w

comprises inter-die random, inter-wafer, inter-lot variation and the

ﬁtting error of quadratic ﬁtting as in (6.1); m

f

and m

d

are the across-ﬁeld and across-

die variation, respectively, as discussed in Section 6.2, we consider only across-wafer

variation and ignore these two types of variations (m

f

and m

d

) in this chapter; m

r

is

the random noise, and (x

w

, y

w

) is across-wafer location. In the rest of this section, we

will analyze the spatial variation based on the above model. Table 7.3 summarizes the

mathematical notations used in this section.

We obtain the coefﬁcients of the above across-wafer variation model by ﬁtting the

industrial 65nm process measured ring oscillator delay with 300 wafers from 23 lots.

In the rest of this section, all simulations are based-on this extracted model.

6.3.1 Variation of Mean and Variance with Location

According to (6.1), it is easy to ﬁnd that for a die, whose center lies on (x

c

, y

c

) wafer

coordinates, the variation at location (x, y) in a die (assuming the coordinate of the

center of the die to be (0, 0)) is:

v

p

(x, y) = a(x

c

+x

)

2

+b(y

c

+y

)

2

+c(x

c

+x

) +d(y

c

+y

) +m

w

+m

r

(x, y) (6.2)

The deﬁnitions of (x

c

, y

c

) and (x

, y

**) are in Table 7.3. In this section, we assume that
**

a, b, c, and d are ﬁxed for a process. In practice, these coefﬁcients may vary slightly

for wafer-to-wafer or lot-to-lot. We will further discuss this in Section 6.4.

In real design ﬂow, the die location in the wafer (x

c

, y

c

) is not known to designers.

We can convert the wafer-level systematic variation model to a die-level model by

noting that the dies are always distributed evenly on the wafer. Therefore, we may

model x

c

and y

c

as random variables which are evenly distributed in the center at (0, 0)

135

Symbols Description

(x

c

, y

c

) Location of the center of the die in the wafer

r

w

wafer radius

m

w

Inter-die variation

m

r

Within-die random variation

σ

2

m

Variance of m

w

σ

2

r

Variance of m

r

a, b, c, d Across-wafer variation coefﬁcients

(l

x

, l

y

) x and y dimension die size

(x

w

, y

w

) Within-wafer location

(x, y) Within-die location

ω Angle between the die and wafer coordinate

v

p

(x, y) Variation of within-die location (x, y)

x

x

= xcos ω+ysinω

y

y

= xsinω−ycos ω

l

x

l

x

= l

x

cos ω+l

y

sinω

l

y

l

y

= l

x

sinω−l

y

cos ω

x

x

= ax

+c/2

y

y

= by

+d/2

l

x

l

x

= ax

+c/2

l

y

l

y

= by

+d/2

r

dµ

r

dµ

=

x

2

/a+y

2

/b

r

dσ

r

dσ

=

x

2

+y

2

δ Euclidean distance between (x

1

, y

1

) and (x

2

, y

2

)

δ =

(x

1

−x

2

)

2

+(y

1

−y

2

)

2

r

m

r

m

=

l

2

x

+l

2

y

r

m

r

m

=

l

2

x

+l

2

y

r

m

r

m

=

l

2

x

+l

2

y

k

0

k

0

= r

2

w

(a+b)/4−c

2

/4a−d

2

/4b

k

1

k

1

= r

4

w

(a

2

+b

2

)/16−r

4

w

ab/24+σ

2

m

k

2

k

2

= k

1

/r

2

w

α α = x

1

x

2

+y

1

y

2

β β =σ

2

r

/r

2

w

in the wafer s

0

s

0

= cos

2

ω(al

2

x

+bl

2

y

) +sin

2

ω(bl

2

x

+al

2

y

)

v

g

Inter-die variation

v

s

Within-die spatial variation

v

l

Within-die random variation

Table 6.1: Notations.

136

with radius r

w

(radius of the wafer). It is easy to see that x

c

and y

c

are identical and

their PDF is

2

:

PDF(x

c

) = 2

r

2

w

−x

2

c

/(πr

w

) −r

w

< x

c

< r

w

(6.3)

In this case, the variation at location (x, y), v

p

(x, y), is expressed as a function of

four random variables: x

c

, y

c

, m

w

, and m

r

(x, y). We may easily calculate the mean and

variance of v

p

(x, y):

µ

v

p

(x,y)

= k

0

+x

2

/a+y

2

/b = k

0

+r

2

dµ

(6.4)

σ

2

v

p

(x,y)

= k

1

+r

2

w

(x

2

+y

2

) = k

1

+σ

2

r

+r

2

w

r

2

dσ

(6.5)

where x

, y

, r

dµ

, r

dσ

, k

0

, and k

1

are deﬁned in Table 7.3.

From (6.4) and (6.5), it is interesting to note that different die locations may have

different means and variances

3

. The location (x

0

, y

0

) having the smallest mean and

variance is given by:

x

0

= −ccos ω/2a−d sinω/2b (6.6)

y

0

= d cosω/2b−csinω/2a (6.7)

The locations farther off from(x

0

, y

0

) will have larger mean and variance. Figures 6.2(a)

and 6.2(b) illustrate the mean and variance of ring oscillator delay for different r

dµ

(or

r

dσ

).

2

x

c

and y

c

are not independent. In order to generate samples of x

c

and y

c

, we may assume x

c

=r

s

cosθ

and y

c

= r

s

sinθ, where r

s

and θ are independent. r

s

follows triangle distribution ranging from 0 to r

w

,

θ follows uniform distribution ranging from 0 to 2π

3

Such difference is caused by the nonlinearity of the across-wafer variation function (we assume

quadratic function as in Equation(6.1)). If the across wafer variation function is a piecewise linear

function, the mean and variance will be the same for all locations on a die.

137

0 0.2 0.4 0.6

0.1292

0.1296

r

dµ

µ

(a) µV.S. r

dµ

0 0.2 0.4 0.6

3.5

4

4.5

x 10

−5

r

dσ

σ

2

(b) σ

2

V.S. r

dσ

Figure 6.2: Mean and variance for different r

d

.

6.3.2 Appearance of Spatial Correlation

Besides mean and variance, we are also interested in the covariance between two loca-

tions (x

1

, y

1

) and (x

2

, y

2

). From (6.2), we may calculate the covariance as:

Cov = k

1

+r

2

w

(x

1

x

2

+y

1

y

2

) (6.8)

With covariance and variance calculated as above, we may obtain the correlation coef-

ﬁcient as:

ρ =

k

2

2

+2k

2

α+α

2

(k

2

+β)

2

+(r

2

dσ1

+r

2

dσ2

)(k

2

+β) +r

2

dσ1

r

2

dσ2

(6.9)

where α, β, and k

2

are deﬁned in Table 7.3. From (6.9), we obtain the upper bound

and lower bound of the correlation coefﬁcient for a certain Euclidean distance:

ρ ≤ρ

u

=

1−

δ

2

k

2

+δβ/2+2βk

2

+β

2

(k

2

+β)

2

+2r

2

m

(k

2

+β) +r

4

m

(6.10)

ρ ≥ρ

l

=

1−

δ

2

(k

2

+r

2

m

−δ

2

/4+β) +β(2k

2

+r

2

m

)

(k

2

+β)

2

+δ

2

(k

2

+β)/2+δ

4

/16

(6.11)

where δ is the Euclidean distance between (x

1

, y

1

) and (x

2

, y

2

), l

x

, l

y

, and r

m

are

deﬁned in Table 7.3. From the upper bound and lower bound, we may also calculate

138

the range of correlation coefﬁcient:

ρ

u

−ρ

l

≤

r

2

m

/(r

2

m

+k

2

+β) (6.12)

Notice that usually the wafer size is much larger than the die size, that is k

2

r

2

m

, there-

fore, the range of correlation coefﬁcient for a certain distance is very small. Moreover,

from the above equation, we also ﬁnd that when the variances of the inter-die residual

and within-die random variation increase, the range decreases. This explains why most

existing works [172, 173, 45] ﬁnd that spatial correlation is a function of distance.

Figure 6.3(a) illustrates the exact data for 40 locations, the upper bound and the

lower bound. From the ﬁgure, we ﬁnd that the range of ρ for a certain distance is very

small. Although the correlation coefﬁcient is within a narrow range, covariance is not,

as shown in Figure 6.3(b). This is because of the differences of variance across the die.

0 0.5 1

0.86

0.88

0.9

0.92

0.94

0.96

0.98

Distance (cm)

ρ

Exact

Upper Bound

Lower Bound

LB

UB

(a) ρ

0 0.5 1 1.5

3

3.5

4

x 10

−5

Distance (cm)

C

o

v

(b) Covariance

Figure 6.3: Apparent spatial correlation and covariance as a function of distance.

Figure 6.4 shows the correlation coefﬁcient for within-die variation after subtract-

ing the mean variation of the die (mainly caused by across wafer variation)

4

. We

4

In the ﬁgure, the correlation coefﬁcient can be a negative number when distance is large. This is

because after subtracting the mean, when the within-die variation of one corner increases, the within-

die variation of the opposite corner must decrease. That means, the within-die variations of opposite

corners are negative correlated. Moreover even when two locations are very close, if they lie on the

opposite side of the center, their correlation is still near zero.

139

observe that the within-die spatial variation is almost NOT spatially correlated, as em-

pirically observed in [172, 173]. This further validates that the spatial variation is

caused by systematic across-wafer variation.

0 0.5 1 1.5

−1

−0.5

0

0.5

1

Distance

ρ

Figure 6.4: Correlation coefﬁcient for within-die spatial variation after inter-die varia-

tion is removed.

6.3.3 Dependence between Inter-Die and Within-Die Variation

In most existing variation models, process variation is decomposed into inter-die,

within-die spatial, and within-die random variation:

v

p

= v

g

+v

s

+v

l

(6.13)

where v

g

is the inter-die variation, v

s

is the within-die spatial variation, and v

l

is the

within-die variation. Usually v

g

is modeled as the variation of the chip mean, v

l

is the

pure random local variation, and v

s

is the residual. v

g

, v

s

, and v

l

are assumed to be

independent.

With the variation model in (6.2), we may also calculate the inter-die, within-die

spatial, and within-die random variation. Inter die variation is calculated as the varia-

140

tion of the chip mean:

v

g

=

1

l

x

l

y

ZZ

|x|<l

x

/2

|y|<l

y

/2

v

p

(x, y)dxdy

= ax

2

c

+by

2

c

+cx

c

+dy

c

+m

w

+s

0

(6.14)

where s

0

is deﬁned in Table 7.3.

Within-die random variation is the local random variation: v

l

= R, and within-die

spatial variation is calculated as the remaining variation:

v

s

(x, y) = v

p

(x, y) −v

g

−v

l

= r

2

dµ

+2ax

x

c

+2by

y

c

−s

1

(6.15)

where s

1

is deﬁned in Table 7.3. From the above equations, we ﬁnd that both inter-die

and within-die spatial variations are functions of random variables x

c

and y

c

. Hence,

we may not decompose process variation into independent inter-die and within-die

spatial variation.

6.3.4 When can Spatial Variation be Ignored?

In this section, we analyze the accuracy of the simple two-level inter-/within-die vari-

ation model for different chip sizes. If we only consider inter-/within-die variation,

we may lump the across-wafer variation into inter-die variation, that is, approximate

the across-wafer variation as a piecewise constant function, as shown in Figure 6.5(a).

To evaluate the impact of the approximation error, we may treat such approximation

error as noise and the process variation as signal; and then evaluate the signal to noise

ratio. In order to do this, we calculate the mean square approximation error and the

total variance of variation. The signal to noise ratio when ignoring the spatial variation

141

is given as:

SNR = σ

2

total

/MSE

≈

6abr

4

w

+6(c +d)r

2

w

+σ

2

M

+σ

2

R

abr

2

w

(l

2

x

+l

2

y

) +2(c +d)l

x

l

y

(6.16)

It can be seen that MSE depends on chip size. When chip size is small, MSE is

small. This is because we approximate the across-wafer variation as a piecewise con-

stant function with small steps, hence such approximation is accurate. Figure 6.5(b)

illustrates the SNR for different die sizes. It can be seen that the SNR decreases when

die size increases as expected. We also observe that when chip size (l

x

and l

y

) is

smaller than 1cm, the SNR is up to 100. That means, two-level inter-/within-die vari-

ation model is accurate.

Exact

Piecewise

(a) Piecewise constant

0 1 2 3 4 5

10

0

10

1

10

2

10

3

10

4

10

5

Chip size (cm)

S

N

R

(b) SNR V.S. chip size.

Figure 6.5: Approximating across-wafer variation.

6.4 General Across-Wafer Variation Model

In the previous section, we assumed that the across-wafer variation is a quadratic func-

tion as shown in Equation 6.2. In practice, across-wafer variation may not be an exact

142

parabola. Moreover, the across-wafer variation may be slightly different for different

wafers. Therefore, there will be some ﬁtting residual after subtracting the across-wafer

parabola:

v(x, y) = v

p

+v

r

(6.17)

where v

p

is the quadratic across-wafer variation model as shown in (6.2) and V

r

is ﬁt-

ting residual. In the previous section, we assume that the ﬁtting residual is lumped into

inter-die random variation. However, the ﬁtting residual contains not only inter-die

random variation but a systematic trend of within-die variation. Figure 6.6(a) illus-

trates the original delay variation across the wafer, and Figure 6.6(b) illustrates the

ﬁtting residual of a wafer delay variation after subtracting the quadratic across-wafer

variation function. From the ﬁgure, we observe that for each die, there is a systematic

trend: If one corner of the die is faster, then the opposite corner is slower. From the

die point of view, such trend will also introduce some spatial correlation. In order to

model this trend, we approximate the ﬁtting residual as a linear function of within-die

location:

v

r

(x

, y

) = s

x

x +s

y

y +m

w

(6.18)

where s

x

and s

y

are x-dimension and y-dimension slope of within-die trend, respec-

tively, and m

w

is inter-die part of the residual, which can be lumped into inter-die

random variation. We also ﬁnd that the trends of within-die variation for each die are

different for different dies. In this case, we may model s

x

and s

y

as random variables.

Figure 6.7 illustrates the distribution of s

x

and s

y

obtained from measurement data of

process 2. From the ﬁgure, we ﬁnd that both s

x

and s

y

are almost uncorrelated and they

both follow Gaussian distribution.

Notice that when we model the ﬁtting residual as a linear within-die variation trend,

two more random variables s

x

and s

y

are introduced. This makes the variation model

143

more complicated. When the across-wafer variation is a perfect parabola and the ﬁtting

residual is not signiﬁcant, we may just lump the ﬁtting residual in to inter-die and

random variation.

0 50 100 150

0

50

100

150

0.102592

0.110338

0.118083

0.126474

0.13422

0.142611

(a) Original delay variation.

0 50 100 150

0

50

100

150

−0.0147271

−0.00816225

−0.00159736

0.0055146

0.0120795

0.0191914

(b) Residual after subtracting

Equation(6.2).

0 50 100 150

0

50

100

150

−0.00986836

−0.00634698

−0.0028256

0.000989223

0.0045106

0.00832543

(c) Residual after subtracting (6.18).

Figure 6.6: Ring oscillator frequency within a wafer.

In addition, we also observe that after subtracting the model of ﬁtting residual in

(6.18), the remaining variation is almost uncorrelated, as illustrated in Figure 6.6(b)

5

.

5

It seems that there is a systematic within-die variation pattern. In this chapter, we assume that all

this within-die pattern to be independent randomvariation. In practice, such within-die variation pattern

can be modeled as mean shift.

144

−2 0 2 4 6 8 10 12 14

x 10

−4

0

200

400

600

800

1000

1200

1400

1600

1800

S

x

PDF

Gaussian Approx

(a) PDF of S

x

−5 −4 −3 −2 −1 0 1 2 3 4 5

x 10

−4

0

500

1000

1500

2000

2500

S

y

PDF

Gaussian Approx

(b) PDF of S

y

Figure 6.7: PDF of S

x

and S

y

.

Combining (6.18) and (6.2) we obtain a general die-level across-wafer variation

model:

v

p

(x, y) = a(x

c

+x

)

2

+b(y

c

+y

)

2

+c(x

c

+x

) +

d(y

c

+y

) +s

x

x +s

y

y +m

w

+m

r

(x, y) (6.19)

6.5 Modeling Spatial Variability

As discussed in Section 6.1, spatial variation largely comes from the deterministic

across-wafer variation. Hence, modeling the within-die variation as spatial-correlated

random variable is not accurate as discussed in Section 6.3.

In this section, we introduce three new spatial variation models considering across-

wafer variation:

• Slop Augmented Across-Wafer variation model (SAAW).

• Quadratic Across-Wafer variation model (QAW).

145

• Location Dependent Across-Wafer variation model (LDAW).

In the following of this section, we will discuss these models in detail:

6.5.1 Slope Augmented Across-Wafer Model

(6.19) calculates the variation for a given location (x, y). In the equation, the die lo-

cation (x

c

, y

c

) are modeled as random variables and their PDF is shown in Equation

(6.3). (6.19) provides a new spatial variation model. We refer to this new model as

Slope Augmented Across-Wafer variation model (SAAW) .

Notice that in SAAW model, there are only six random variables, inter-die random

variation m

w

, within-die random variation m

r

, die location within the wafer x

c

and y

c

,

and slope of ﬁtting residual s

x

and s

y

. However, for the traditional distance-based spa-

tial variation model, the number of spatial variation sources depends on the number

of grids. Larger chip needs more variables. Therefore, our new model not only mod-

els the across-wafer variation accurately but also is more efﬁcient than the traditional

spatial correlation model.

6.5.2 Quadratic Across-Wafer Model

As discussed in Section 6.4, when the across-wafer variation is a perfect parabola, the

ﬁtting residual is not signiﬁcant

6

, we may just lump the ﬁtting residual into inter-die

random variation and simplify (6.19) to (6.2). In this case, there are only four random

variables, x

c

and y

c

, m

w

, and m

r

. We refer to this model as Quadratic Across-Wafer

variation model (QAW). Since QAW does not consider ﬁtting residual, it is not as

accurate as SAAW.

6

For example, process 1 as discussed in Section 6.2.

146

6.5.3 Location Dependent Across-Wafer Model

As discussed in Section 6.3.4, when die size is small enough, applying the two-level

inter-/within-die variation model does not introduce much error. However, inter-/within-

die variation model still does not consider the mean and variance difference at different

locations of a chip, as discussed in Section 6.3.1. To further improve the accuracy of

inter-/within-die variation model, we account for this:

v(x, y) ≈m

d

+µ

v

p

(x, y) +σ

v

p

(x, y)m

r

(x, y) (6.20)

where m

d

is inter-die variation including inter-lot random, inter-wafer random, inter-

die random, and across-wafer variation; m

r

(x, y) is within-die variation including within-

die random variation and residual of across-wafer variation; µ

v

p

(x, y) and σ

v

p

are mean

and variance difference at different locations of a chip, which can be calculated from

(6.4) and (6.5). We refer to the above model as Location Dependent Across-Wafer

variation model (LDAW). Notice that in Equation 6.20, µ

v

p

(x, y) and σ

v

p

(x, y) are de-

terministic value for a certain within-die location (x, y). Therefore, LDAW model has

only two random variables: m

d

and m

r

which is the same as two level inter-/within-

die model. Hence compared to inter-/within-die model, LDAW has similar efﬁciency

but higher accuracy because LDAW considers mean and variance difference across the

chip while inter-/within-die model does not.

6.6 Application of Spatial Variation Models on Statistical Timing

Analysis

In this section, we apply our across-wafer variation models to statistical static timing

analysis.

147

6.6.1 Delay Model

As discussed in Chapter 5, cell delay can be modeled as a quadratic function of vari-

ation sources, as shown in (5.46). For SAAW in (6.19) and QAW (6.2), each variation

source is a quadratic function of random variables. Therefore, by applying SAAW or

QAW on quadratic cell delay model, the cell delay becomes a 4

th

order function of

random variables. However, our SSTA ﬂow in Chapter 5 can only handle a quadratic

function of random variables. Therefore, we apply the moment matching technique

in the way as Section 5.3 to match the 4

th

function to a quadratic function of ran-

dom variables by matching the mean and ﬁrst two order joint moments. And then,

we may apply the quadratic SSTA ﬂow in Section 5.3 to estimate the chip delay vari-

ation. Notice that moment matching approximation is performed only once for all

cells and does not increase the run time of SSTA. LDAW is a linear function of vari-

ation sources. Therefore, when applying LDAW, cell delay is a quadratic function of

random variables, hence quadratic-SSTA can be applied. Moreover, as discussed in

Section 5.6, Semi-quadratic SSTA greatly improves the speed with only a little ac-

curacy loss compared to quadratic SSTA. Therefore, in this application, we may also

ignore the crossing terms and apply semi-quadratic SSTA to improve efﬁciency.

Sometimes, cell delay model can be simpliﬁed as a linear function of variation

sources, as shown in (5.114). In this case, applying SAAW or QAW results in a

quadratic function of random variable and hence quadratic SSTA or semi-quadratic

SSTA can be applied; and applying LDAW results in a linear function of random vari-

ables where linear SSTA can be applied.

148

6.6.2 Experimental Result

We have applied our new model to the SSTA ﬂow introduced in Chapter 5. In order

to verify the efﬁciency and accuracy, three comparison cases are deﬁned: 1) Monte-

Carlo simulation with the exact deterministic across-wafer variation model

7

, which

is the golden case for comparison; 2) distance-based spatial correlation model from

[172, 173], which is referred to as SPatial Correlation model (SPC); 3) two-level inter-

/within-die variation model, which is referred to as Inter-/Within-die variation model

(IW).

We apply all the above methods to the ISCAS85 suite of benchmarks in PTM

45nm technology[124]. We assume random placement for ISCAS85 circuits. Since

process variation has smaller impact on interconnect delay than on logic cell delay,

we only consider logic cell delay when calculating the full chip delay variation. In

the experiment, we consider the gate length variation obtained from minimum square

error ﬁtting on the ring oscillator delay from industrial 65nm process (Process 2 as

discussed in Section 6.2) measurement from the model as shown in (6.19). We ob-

tain the across-wafer coefﬁcients a, b, c, and d

8

, ﬁtting residual coefﬁcients s

x

and

s

y

9

, standard deviation of random inter-wafer, inter-die, and within-die variation as

percentage with respect to the nominal value. Then we assume that the percentages of

all the above coefﬁcients to nominal value are the same at 45nm technology node and

65nm technology node.

7

In the simulation, each wafer may have different across-wafer variation which is obtained from

measurement data of process 2. We have simulated 318 wafers correspondent to 318 measured wafers.

8

We apply quadratic function to ﬁt the across-wafer variation for each wafer to obtain the ﬁtting

coefﬁcients for each wafer, then use average coefﬁcients of all wafers for our experiment.

9

We obtain the slope of ﬁtting residual s

x

and s

y

for each chip, and then calculate the mean and

variance of s

x

and s

y

for all chips. In the experiment, we assume s

x

and s

y

to be Gaussian random

variables with mean and variance obtained from the measurement data.

149

6.6.2.1 Full Chip Delay

In the experiment, we assume that the chip size is 2cm×2cm and the wafer radius is

15cm. Since ISCAS85 benchmarks are very small, the impact of spatial variation on

delay is not signiﬁcant within the circuit. In order to show such impact,we assume the

benchmarks are stretched on a 2cm×2cm chip. In our experiment, for the SPC model,

we divide the chip to 10 ×10 = 100 grids. Table 6.2 illustrates the percentage error

of mean (µ), standard deviation (σ), and 95-percentile point (95%) and run time (T) of

different variation models. In the table, we also compared the result of using quadratic

cell delay model (Quad) and linear cell delay model (Lin). We only use quadratic

cell delay model for golden case simulation (exact). The error is calculated as error

of different variation models compared to the golden case simulation. For SAAW and

QAW, we also compare the results of applying quadratic SSTA with crossing terms

(SAAW Quad and QAW Quad) and applying SSTA without crossing terms (SAAW S-

Quad and QAW S-Quad). From the table, we have the following observations:

• Compared to full quadratic SSTA, semi-quadratic SSTA (SSTA without crossing

terms) achieves up to 8X speed up with less than 1% accuracy loss.

• SAAW is more accurate than QAW. This is because the ﬁtting residual is signif-

icant for the measurement data, QAW ignores ﬁtting residual and hence intro-

duces more error.

• The error of SAAW using semi-quadratic SSTA is within 2% while the error of

spatial correlation model is up to 6.5%.

• Compared to quadratic cell delay model, linear cell delay has less than 2% ac-

curacy loss. This is because in our experiment, the cell delay variation is well

approximated by a linear function.

150

• For linear cell delay model, SAAW achieves about 6X speed up compared to

SPC. This is because there are 100 grids in the spatial correlation model, result-

ing in 37 spatial random variables

10

, while SAAW has only 6 random variables.

• LDAW and IW are very efﬁcient. However, both model has much larger error

than others. This is because both model ignore correlation. LDAW is signiﬁ-

cantly more accurate than IW with no runtime penalty.

Since linear cell delay model and semi-quadratic SSTA are accurate. In the rest of

this subsection, we assume linear cell delay and apply semi-quadratic SSTA for all

experiments. Moreover, since SAAW is more accurate than QAW with only a small run

time overhead, in the following experiment, we do not consider the QAW model in the

following experiments.

10

There are 100 correlated spatial random variables, we apply PCA to truncate some insigniﬁcant

principle components and there remains 37 signiﬁcant principle components.

151

In the above experiment, we only considered big chips. As discussed in Sec-

tion 6.3.4, when the chip size is small, the impact on across-wafer variation at die

level is not signiﬁcant. In order to verify this, we perform delay estimation of IS-

CAS85 benchmarks stretching on different size chips. Table 6.3 shows the percentage

error for different models under different chip size. From the table, we ﬁnd that when

chip size is small, LDAW and IW are accurate. Considering that LDAW has similar run

time but is more accurate (although when chip size is small, the accuracy improvement

is limited) compared to IW, LDAW is always better than IW.

6.6.2.2 Delay of Blocks on Different Locations on a Chip

The above experiment assumes that the benchmarks are stretched on a chip. How-

ever, in real design, especially for big chips, the design is separated into several blocks

and each block only occupies a small region of a chip. In this case, the critical path

is within a small region instead of spanning all over the chip. As discussed in Sec-

tion 6.3.1, different chip locations may have different mean and variance. Therefore,

when a block is placed at different locations on a chip, its delay variation may be dif-

ferent. In order to show such effect, we assume that the ISCAS85 benchmark circuit

is placed (no stretched) in different locations on a chip: center (C), lower left corner

(LL), lower right corner (LR), upper left corner (UL), and upper right corner (UR),

and then calculate the delay variation with location. Since ISCAS85 benchmarks are

very small, the impact of spatial variation on delay is not signiﬁcant within the circuit.

Therefore, in this experiment, we only compare two models LDAW and IW

11

.

Table 6.4 compares the percentage error of LDAW and IW for ISCAS85 bench-

marks placing on different locations on a 2cm×2cm chip. From the table, we ﬁnd that

11

When the circuit is in a small region, SAAW and QAW will give similar result as LDAW, and SPC

will give similar result as IW.

152

Bench- delay Exact SAAW Quad SAAW S-Quad QAW Quad QAW S-Quad LDAW

∗

SPC

∗

IW

∗

mark model µ σ 95% µ σ 95% T µ σ 95% T µ σ 95% T µ σ 95% T µ σ 95% T µ σ 95% T µ σ 95% T

c1908 Quad 17.6 2.27 24.5 -0.4 -0.9 -1.0 146 -0.9 -1.7 -1.7 27 -0.8 -1.9 -2.3 54 -1.3 -2.5 -3.2 19 -2.1 -7.5 -6.9 9 -1.5 -4.2 -3.8 1450 -2.6 -10.2 -8.9 8

Lin - - - -0.8 -1.5 -1.4 150 -1.4 -1.8 -2.0 26 -0.9 -3.6 -3.1 53 -1.2 -3.4 -3.9 18 -2.6 -7.5 -8.1 10 -2.1 -4.4 -4.0 135 -3.0 -11.5 -10.3 10

c3540 Quad 25.7 3.43 34.5 +0.4 +0.9 +0.7 212 -0.4 -1.3 -1.1 36 +0.4 -1.8 -1.2 76 -0.9 -2.1 -1.9 25 +0.4 -5.8 -4.6 13 -1.2 -4.8 -4.0 4210 -1.4 -7.3 -6.5 12

Lin - - - -0.6 -1.2 -1.2 209 -1.1 -1.9 -1.6 35 -0.9 -3.6 -3.1 77 -1.2 -5.1 -3.9 27 -1.8 -6.5 -6.0 9 -2.0 -6.5 -5.7 202 -2.9 -9.3 -8.8 10

c7552 Quad 48.9 6.47 64.7 -0.6 +0.3 +0.2 435 -0.8 -0.2 -0.9 67 -0.8 -1.6 -1.4 115 -1.6 -1.5 -1.7 48 -2.7 -3.6 -4.0 20 -1.0 -2.3 -2.9 8182 -2.1 -6.7 -6.5 22

Lin - - - -0.6 -0.5 -0.6 430 -1.5 -1.4 -1.6 101 -1.1 -1.3 -1.4 109 -1.9 -2.2 -2.8 79 -3.3 -4.6 -4.9 16 -3.3 -3.5 -4.3 433 -3.9 -8.9 -8.2 15

Table 6.2: Delay percentage error for different variation models. Note: the µ, σ, and 95-percentile point for exact simulation

is in ns. Run time (T) is in ms.

∗

for LDAW, SPC, and IW, linear SSTA is applied when assuming linear cell delay model.

1

5

3

Bench- Chip Exact SAAW LDAW SPC IW

mark size µ σ 95% µ σ 95% µ σ 95% µ σ 95% µ σ 95%

c1908 10 17.4 2.24 24.2 -1.2 -1.8 -1.9 -2.2 -5.5 -5.9 -1.7 -3.5 -3.3 -2.5 -6.8 -7.1

6 17.5 2.23 24.3 -1.1 -1.5 -1.6 -2.2 -4.1 -4.2 -1.2 -2.6 -2.4 -2.2 -4.5 -4.3

3 17.6 2.22 24.4 -0.9 -1.2 -1.1 -1.1 -1.8 -1.6 -1.0 -1.5 -1.5 -1.9 -2.1 -2.5

c3540 10 25.6 3.42 34.6 -1.0 -2.2 -1.6 -1.4 -6.2 -4.9 -1.8 -5.4 -4.8 -2.2 -7.5 -6.9

6 25.8 3.45 34.3 -0.8 -1.2 -1.0 -1.2 -3.8 -3.3 -1.5 -3.4 -3.0 -1.3 -4.2 -3.7

3 25.8 3.44 34.4 -0.5 -1.0 -1.0 -1.1 -2.1 -2.0 -1.0 -1.9 -1.9 -1.1 -2.2 -2.3

c7552 10 48.9 6.47 64.7 -1.2 -1.1 -1.4 -2.8 -3.5 -3.7 -2.5 -2.2 -2.7 -3.2 -5.6 -6.3

6 48.9 6.47 64.7 -1.0 -1.1 -1.3 -1.4 -2.5 -2.3 -2.2 -2.1 -2.5 -2.9 -3.1 -3.3

3 48.9 6.47 64.7 -0.6 -0.9 -1.0 -1.0 -1.1 -1.2 -0.9 -1.3 -1.4 -1.3 -1.4 -1.6

Table 6.3: percentage error for ISCAS85 benchmark stretching on a chip with different

chip size. Note: We assume square chips and chip size means edge length in mm. The

values of exact simulation are in ns.

the estimation of LDAW is within 1% error from the exact simulation and the error of

IW is up to 8%. This is because LDAW predicts different mean and variance for dif-

ferent location correctly, as discussed in Section 6.5 while IW can only give the same

mean and variance for all locations.

Table 6.5, 6.6, and 6.7 shows percentage error of LDAW and IW for ISCAS85

benchmarks placing on different locations on a 1cm×1cm, 6mm×6mm, and 3mm×3mm

chip, respectively. From the tables, we ﬁnd that the error of IW becomes smaller when

chip size is small.

154

bench loca- Exact LDAW IW

mark tion µ σ 95% µ σ 95% µ σ 95%

c3540 C 25.4 3.29 32.2 +0.4 +0.3 +0.3 +1.1 +2.4 +2.8

LL 24.8 3.22 31.9 +0.8 +0.6 +0.3 +3.5 +1.4 +1.9

LR 26.2 3.35 33.1 -0.7 +0.6 -0.6 -1.9 -2.4 -1.8

UL 26.5 3.36 33.3 -0.2 +0.3 +0.3 -3.3 -3.0 -4.4

UR 27.1 3.41 34.1 -0.3 -0.3 -0.6 -6.1 -4.1 -5.2

c7552 C 48.2 6.37 60.2 +0.8 +0.3 +0.2 +1.0 +0.9 +0.9

LL 47.2 6.11 58.5 +0.4 +0.3 +0.7 +3.6 +4.6 +3.1

LR 49.4 6.51 62.3 -0.2 +0.3 +0.3 -0.8 -1.9 -3.0

UL 49.5 6.65 63.1 -0.2 +0.1 +0.1 -1.0 -4.0 -4.1

UR 50.1 6.91 65.3 -0.4 +0.3 +0.1 -1.0 -7.4 -7.4

Table 6.4: Delay percentage error at different locations in a 2cm×2cm chip. Note:

The values of exact simulation are in ns.

bench loca- Exact LDAW IW

mark tion µ σ 95% µ σ 95% µ σ 95%

c3540 C 25.4 3.26 31.9 +0.5 +0.4 +0.2 +1.3 +1.6 +1.9

LL 25.0 3.25 31.6 +0.5 +0.3 +0.3 +2.4 +1.3 +2.4

LR 25.8 3.30 32.4 -0.4 -0.4 -0.4 -1.1 -0.9 -0.8

UL 26.1 3.32 32.6 -0.4 +0.3 -0.2 -2.0 -1.5 -1.4

UR 26.2 3.34 33.0 -0.3 -0.3 -0.6 -2.6 -1.8 -2.2

c7552 C 48.6 6.38 60.4 +0.2 -0.1 -0.2 +1.0 +0.5 +0.6

LL 48.2 6.35 60.3 -0.2 -0.2 +0.2 +1.7 +1.0 +1.1

LR 48.9 6.42 60.4 +0.2 -0.2 +0.2 -0.2 -0.7 -0.5

UL 49.1 6.43 61.0 -0.2 -0.2 -0.2 -0.9 +1.0 -1.3

UR 49.2 6.45 61.2 -0.3 -0.4 -0.5 -1.1 -1.4 -1.8

Table 6.5: Delay comparison for ISCAS85 benchmark in 1cm×1cm chip. Note: The

values of exact simulation are in ns.

155

bench loca- Exact LDAW IW

mark tion µ σ 95% µ σ 95% µ σ 95%

c3540 C 25.5 3.27 32.0 +0.4 +0.2 +0.3 +0.9 +0.7 +1.4

LL 25.1 3.25 31.7 +0.5 +0.3 +0.3 +2.4 +1.3 +2.4

LR 26.0 3.32 32.6 -0.4 -0.4 -0.4 -1.1 -0.9 -0.8

UL 26.2 3.34 32.8 -0.4 +0.3 -0.2 -2.0 -1.5 -1.4

UR 26.3 3.35 33.1 -0.3 -0.3 -0.6 -2.6 -1.8 -2.2

c7552 C 48.4 6.37 60.1 +0.2 -0.1 -0.2 +1.0 +0.5 +0.6

LL 48.1 6.34 59.9 -0.2 -0.2 +0.2 +1.7 +1.0 +1.1

LR 49.0 6.43 60.7 +0.2 -0.2 +0.2 -0.2 -0.7 -0.5

UL 49.3 6.46 61.3 -0.2 -0.2 -0.2 -0.9 +1.0 -1.3

UR 49.4 6.48 61.5 -0.3 -0.4 -0.5 -1.1 -1.4 -1.8

Table 6.6: Delay comparison for ISCAS85 benchmark in 6mm×6mm chip. Note: The

values of exact simulation are in ns.

bench loca- Exact LDAW IW

mark tion µ σ 95% µ σ 95% µ σ 95%

c3540 C 25.6 3.28 32.2 +0.4 +0.2 +0.3 +0.5 +0.3 +0.8

LL 25.4 3.26 320 +0.5 +0.6 +0.3 +1.2 +1.0 +1.2

LR 25.8 3.31 32.4 -0.2 -0.4 -0.6 -0.5 -0.7 -0.7

UL 25.9 3.22 32.5 -0.2 +0.3 -0.2 -1.0 -1.1 -0.9

UR 26.0 3.31 32.7 -0.2 -0.2 -0.3 -0.7 -0.8 -1.1

c7552 C 48.6 6.38 60.3 +0.2 -0.3 +0.7 +0.2 +0.5 +1.1

LL 48.4 6.37 60.1 -0.2 0.1 +0.2 +0.4 +0.3 +0.7

LR 48.8 6.42 60.6 +0.2 -0.3 +0.3 -0.6 -0.9 -0.5

UL 48.9 6.43 60.9 -0.2 +0.2 -0.2 -0.4 +0.0 -0.6

UR 49.2 6.45 61.2 -0.2 -0.2 -0.3 -0.7 -0.8 -1.1

Table 6.7: Delay comparison for ISCAS85 benchmark in 3mm×3mm chip. Note: The

values of exact simulation are in ns.

156

6.7 Application of Spatial Variation Models on Statistical Leakage

Analysis

Besides SSTA, we also apply our variation model to statistical leakage power analysis.

Usually, cell leakage power variation is modeled as exponential function of variation

sources:

P

leak

= P

0

· e

∑c

i

V

i

(6.21)

where P

0

is the nominal leakage power and c

i

’s are sensitivity coefﬁcients. The full

chip leakage power is calculated as the sum of leakage power of all cells:

P

chi p

=

∑

i∈Cell

P

i,leak

(6.22)

where Cell is the set of all cells in the chip and P

i,leak

is leakage power of the i

th

cell. Since each variation source is a quadratic function as in (6.19), the cell leakage

power is an exponential of a quadratic function of random variables. Considering

that the random variables may be non-Gaussian, there is no closed-form equation to

calculate the full chip leakage power. Therefore, in this chapter, we apply Monte-Carlo

simulation to obtain the full chip leakage power variation.

We have implemented leakage variation analysis with different models in Matlab.

In the experiment, we use the same setting and comparison cases as the SSTA exper-

iment in Section 6.6. For each variation model, we use 100,000 sample Monte-Carlo

simulation to obtain the full chip leakage power for all variation models. For the leak-

age analysis, we assume that 900 copies of ISCAS benchmark circuits are placed in

a 30 ×30 array on a 2cm×2cm chip. Table 6.8 compares the leakage variation for

ISCAS85 benchmarks. From the table, we observe that:

• Error of SAAW is within 5% while error of SPC is up to 17%. Moreover, SAAW

is 7X faster than SPC because there are fewer random variables for SAAW.

157

Bench- Exact SAAW QAW LDAW SPC IW

mark µ σ 95% µ σ 95% T µ σ 95% T µ σ 95% T µ σ 95% T µ σ 95% T

c1355 62.1 14.5 92.5 +1.5 +3.3 +3.2 16.3 +3.4 +6.5 +5.4 9.8 -5.5 -15.6 -16.9 2.5 +5.3 +10.4 +12.2 123 -7.9 -19.6 -20.9 2.7

c1908 95.6 20.3 144 +0.9 +4.4 +3.7 15.5 +2.3 +7.8 +6.3 9.7 -6.5 -17.8 -19.6 2.6 +5.7 +14.8 +14.3 122 -8.6 -19.2 -23.5 2.9

c2670 131 22.9 181 +1.4 +2.7 +1.7 15.7 +2.9 +8.5 +4.7 9.5 -7.9 -16.9 -22.0 2.8 +6.8 +12.2 +9.4 122 -9.2 -20.3 -25.5 2.4

c3540 201 37.4 282 +1.5 +2.3 +1.8 15.4 +3.1 +5.8 +4.4 10.4 -5.6 -16.5 -20.2 2.6 +4.9 +11.2 +8.2 123 -8.3 -18.2 -23.5 2.7

c7552 403 73.2 562 +1.6 +2.7 +1.9 15.3 +3.7 +6.0 +5.0 10.1 -7.3 -12.6 -16.9 2.6 +7.1 +13.8 +10.7 122 -9.2 -20.3 -24.5 2.5

Table 6.8: Leakage error percentage for different models in 2cm×2cm chip. Note:

The exact values are in mW. Run time (T) is in s.

• SAAW is more accurate than QAW, but is about 50% slower.

• Both LDAW and IW are not accurate. This is because both these models do not

consider correlation and hence under estimate the leakage power variation.

Similar to SSTA, for leakage variation analysis, we also perform leakage esti-

mation in different size chips: 1cm×1cm, 6mm×6mm, and 3mm×3mm. For the

1cm×1cm chip, we assume that 225 copies of ISCAS85 benchmark circuits are placed

in a 15×15 array, for the 6mm×6mm chip, we assume 100 copies of ISCAS85 bench-

mark circuits are placed in a 10 ×10 array, and for the 3mm×3mm chip, we assume

25 copies of ISCAS benchmark circuits are placed in a 5 ×5 array. Table 6.9 shows

the error percentage for different model on different size chips. From the table, we

ﬁnd that the error of LDAW, SPC, and IW reduces when chip size becomes smaller as

expected.

158

Bench- Chip Exact SAAW LDAW SPC IW

mark size µ σ 95% µ σ 95% µ σ 95% µ σ 95% µ σ 95%

c1355 10 15.4 3.52 23.2 +1.2 +3.1 +3.0 -4.7 -10.6 -11.2 +4.8 +7.5 +9.2 -5.4 -12.3 -13.8

6 5.92 1.62 10.3 +1.0 +2.7 +2.9 -3.5 -7.5 -8.3 +2.7 +6.2 +6.7 -3.9 -8.1 -9.2

3 1.48 0.40 2.58 +0.6 +1.8 +2.0 -1.8 -3.5 -3.7 +1.6 +2.9 +3.1 -2.0 -3.5 -4.0

c1908 10 23.9 5.7 36.1 +1.0 +2.7 +2.9 -4.2 -12.3 -14.1 +3.7 +8.5 +9.2 -5.5 -13.9 -15.1

6 10.6 2.25 16.1 +0.9 +2.0 +2.1 -3.4 -7.3 -8.2 +1.9 +5.6 +6.8 -3.5 -7.3 -9.0

3 2.65 0.57 4.03 +1.0 +2.2 +2.2 -1.3 -3.5 -4.0 +1.2 +2.9 +3.1 -1.9 -4.0 -4.4

c2670 10 32.8 5.72 45.2 +1.2 +2.8 +2.9 -5.3 -11.1 -14.0 +4.2 +8.2 +9.1 -6.5 -12.3 -17.1

6 14.6 2.55 20.1 +1.0 +1.8 +1.9 -3.5 -6.2 -7.1 +2.4 +4.3 +6.0 -4.3 -7.1 -8.3

3 3.65 0.65 5.03 +0.8 +1.4 +1.7 -1.9 -3.2 -3.7 +2.0 +3.6 +3.5 -2.2 -4.5 -5.1

c3540 10 50.5 9.37 71.1 +1.2 +1.8 +2.1 -3.9 -7.8 -10.5 +2.9 +5.3 +6.2 -5.0 -9.6 -14.5

6 22.3 4.16 30.2 +1.0 +1.4 +1.7 -2.3 -4.5 -5.5 +1.8 +3.5 +3.6 -3.4 -6.1 -8.5

3 5.58 1.05 7.55 +0.9 +1.2 +1.4 -1.2 -3.1 -3.0 +1.0 +1.4 +2.0 -1.4 -3.4 -3.9

c7552 10 102 18.5 141 +1.4 +2.3 +2.4 -4.6 -7.3 -9.9 +4.0 +6.2 +8.6 -6.3 -8.2 -12.5

6 45.0 8.19 6.26 +1.2 +1.9 +1.9 -3.0 -5.2 -7.3 +2.7 +4.2 +5.1 -3.5 -6.0 -8.2

3 11.4 2.06 1.58 +0.9 +0.7 +1.3 -1.2 -1.8 -2.0 +1.5 +1.6 +1.6 -2.3 -2.9 -3.4

Table 6.9: Leakage error for different variation model on different size chips. Note:

exact values are in mW.

6.8 Summary of Different Models

In previous sections, we compared the accuracy and efﬁciency of different models.

Table 6.10 summarizes the advantages and disadvantages of our proposed spatial vari-

ation models (SAAW, QAW, and LDAW), and the traditional variation models (SPC

and IW). Our proposed across-wafer variation models exactly model the across-wafer

variation and the number of random variables does not depend on chip size. Therefore

they are accurate and efﬁcient. SAAW has six random variables and it can be applied to

any across-wafer variation models. QAW has four random variables, hence it is more

efﬁcient than SAAW. However, it can be applied only when the across-wafer variation

is a perfect parabola. LDAW is the most efﬁcient, ignores correlation and only works

for small chips. Moreover, SAAW and QAW need to know the across-wafer variation,

therefore, one needs to track the die locations within the wafer to build up the model.

159

On the other hand, the traditional variation models (as well as LDAW) only require

measurement on a die without tracking die locations. Therefore, they are somewhat

easier to build. However, such models are not accurate compared to our proposed

models. Moreover, for SPC, since the number of random variables depends on number

of grids, it is not as efﬁcient as our proposed models.

160

Model Type Advantages Disadvantages Models # of RVs Case to Apply

Across- Accurate Need die tracking SAAW Equ(6.19) 6 large chip, non-parabola across-wafer variation

wafer Efﬁcient to extract QAW Equ (6.2) 4 large chip, parabola across-wafer variation

models LDAW Equ (6.20) 2 small chip

Traditional Easy to Not SPC Depend on # of grids large chip

models extract accurate IW 2 small chip

Table 6.10: Summary of different variation models.

1

6

1

6.9 Conclusions

In this chapter, we analytically study the impact of systematic across-wafer variation

on within-die spatial variation. For simplicity, we assume that across-wafer variation is

a quadratic function. We ﬁrst observe that different locations of a chip may have differ-

ent means and variances and such difference becomes more signiﬁcant when chip size

increases. Secondly, we ﬁnd that spatial correlation is visible only when the across

wafer systematic is not taken into account. When it is taken into account, we show

that within-die random variability does not exhibit a strong or useful pattern of spatial

correlation. We exploited these observations in order to create a much more accurate

and efﬁcient model for performance variability prediction. Thirdly, we ﬁnd that the

within-die spatial variation is NOT independent of the inter-die variation. However,

when chip size is small enough, such dependence is weak and the across-wafer varia-

tion can be lumped in to inter-die variation. In this case, the two level inter-/within-die

variation model is still accurate. Later, we also analyze the case when the across-wafer

is not a perfect quadratic function. Based on the above analysis, we have proposed

an accurate and efﬁcient variation model for deterministic across wafer variation. We

further apply our new variation model to two applications: statistical static timing anal-

ysis and statistical leakage analysis. Experimental result shows that compared to the

distance-based spatial variation model, our new model reduces the error from 6.5%

to 2% for statistical timing analysis and reduces error from 17% to 5% for statistical

leakage analysis. Our model also improves the run time by 6X for statistical timing

analysis and by 7X for statistical leakage analysis.

162

CHAPTER 7

On Conﬁ dence in Characterization and Application of

Variation Models

In this chapter we study statistics of statistics. Statistical modeling and analysis have

become the mainstay of modern design-manufacturing ﬂows. Most analysis tech-

niques assume that the statistical variation models are reliable. However, due to limited

number of samples (especially in the case of lot-to-lot variation), calibrated models

have low degree of conﬁdence. The problem is further exacerbated when production

volumes are low (≤65 lots) causing additional loss of conﬁdence in statistical analysis

(since production only sees a small snapshot of the entire distribution). The problem

of conﬁdence in statistical analysis is going to be further worsened with advent of

450mm wafers. We mathematically derive conﬁdence intervals for commonly used

statistical measures (mean, variance, percentile corner) and analysis (SPICE corner

extraction, statistical timing). Our estimates are within 2% of simulated conﬁdence

values. Our experiments (with variability assumptions derived from test silicon data

from a 45nm industrial process) indicate that for moderate characterization volumes

(10 lots) and low-to-medium production volumes (15 lots), a signiﬁcant guardband

(e.g., 34.7% of standard deviation for single parameter corner, 38.7% of standard de-

viation for SPICE corner, and 52% of standard deviation for 95-percentile point of

circuit delay are needed) to ensure 95% conﬁdence in the results. The guardbands are

non-negligible for all cases when either production or characterization volume is not

163

large. We also study the interesting one production lot case which may be common for

prototyping as well as for academic designs. The proposed methods are not runtime-

intensive (always within 10s) as they require Monte-Carlo simulations on closed form

expressions.

7.1 Introduction

In process variation modeling, analysis or optimization, statistical characteristics of

variation sources, such as mean, variance, and skewness, are often assumed to be

known and reliable.

In practice, the statistical characteristics of the variation sources are obtained from

measurement of a set of samples. Due to limited number of measurements, the mea-

sured statistical characteristics may be unreliable. This is especially true for lot-to-lot

variation, since the number of lots measured for characterization is usually small.

Moreover, the existing statistical analysis, implicitly assume that the production

chips have the same statistical characteristics as the corresponding population values

1

.

When the number of production samples is not small, the production statistical char-

acteristics may signiﬁcantly deviate from their population values. Therefore, the un-

certainty in statistics of measured data as well as production data should be considered

in statistical analysis. [177] models the uncertainty of mean, variance, and correlation

coefﬁcients as an interval, and then estimates the range of mean and variance of circuit

performance. However, this work only models the uncertainty as an interval, ignoring

their true distributions. It also ignores the uncertainty introduced by limited number of

production lots.

Before we move further, it is important to brieﬂy review the concept of conﬁdence

1

Here by “population”, we mean the statistical values in the case of inﬁnite measured as well as

production samples.

164

intervals in statistical analysis. Consider a standard normal random variable X, we

choose n samples of it (X

1

, X

2

, . . . , X

n

). Once the samples are chosen, the sample mean

ˆ µ and variance ˆ σ

2

are ﬁxed. However, if we repeatedly choose n samples for 1,000

times, and calculate the sample mean and variance for each time, the results may dif-

fer. When the sample mean is smaller than a certain value µ

conf

in 900 out of the 1000

runs, we say that we have 90% conﬁdence that the sample mean is smaller than µ

conf

.

Notice that the conﬁdence is not yield but the probability that certain statistical char-

acteristic is within a given interval. Once chips are produced, the distribution of (say)

circuit delay is ﬁxed. But due to the uncertainty of the statistical model, we do not

know exactly the production mean and variance prior. For a given conﬁdence, one can

estimate the distribution of mean and variance (of course the actual mean and variance

are going to be just one sample out of these distributions).

In this chapter, we study the uncertainty of statistical models and analyze the im-

pact of such uncertainty on statistical analysis. The contributions of this chapter are:

1. Given the number of measured lots and the number of production lots,we esti-

mate the distribution of the production mean and variance of variation sources.

2. For a given conﬁdence level, we estimate the worst case fast/slow corner (µ±kσ

corner) for variation sources.

3. We extend the ideas to extract SPICE corners depending on desired conﬁdence

level as well as number of measured/produced silicon lots.

4. We estimate the conﬁdence interval of mean, variance and quantile for statistical

timing analysis.

Experimental results show that the required guardband (to reach desired conﬁdence)

estimated by our method is very close to the exact simulation value. We also observe

165

that the guardband value increases dramatically when conﬁdence level increases. We

need to introduce up to 30% guardband value to achieve 95% conﬁdence.

The rest of this chapter is organized as follows: Section 7.2 gives the problem

formulation; then Section 7.3 studies the conﬁdence interval estimation for variation

sources; Section 7.4 presents the estimation of worst case delay for SPICE corner

estimation and Section 7.5 analyzes the impact of mean and variance uncertainty on

statistical timing analysis; Section 7.6 discusses how to choose conﬁdence level in real

circuit design; and ﬁnally Section 7.7 concludes this chapter.

7.2 Problem Formulation

Process variation is decomposed into inter-lot (V

l

), inter-wafer(V

w

), inter-die (V

d

), and

within-die (V

r

) variation:

V =V

l

+V

w

+V

d

+V

r

(7.1)

µ

V

= µ

l

+µ

w

+µ

d

+µ

r

(7.2)

σ

2

V

= σ

2

l

+σ

2

w

+σ

2

d

+σ

2

r

(7.3)

where µ

l

(σ

2

l

), µ

w

(σ

2

w

), µ

d

(σ

2

d

), and µ

r

(σ

2

r

) are means (variances) of inter-lot, inter-

wafer, inter-die, and within-die variations, respectively. It has been shown that in case

of ﬁnite number of samples the sample mean follows Gaussian distribution and the

sample variance follow χ

2

distribution [4]. The variances of the means and variances

are:

σ

2

µl

= σ

2

l

/N

l

(7.4)

σ

2

σ

2

l

= σ

2

l

/N

l

(7.5)

σ

2

µw

= σ

2

w

/N

w

(7.6)

σ

2

σ

2

w

= σ

2

w

/N

w

(7.7)

166

σ

2

µd

=σ

2

d

/N

d

(7.8)

σ

2

σ

2

d

=σ

2

d

/N

d

(7.9)

σ

2

µr

= σ

2

r

/N

r

(7.10)

σ

2

σ

2

r

= σ

2

r

/N

r

(7.11)

where N

l

, N

w

, N

d

, and N

r

are total number of lots, wafers, dies, and circuit el-

ements. There are usually 20 to 50 wafers per lot, more than one hundred dies per

wafer, and tens of measured circuit elements within each die

2

. Therefore, N

r

N

d

N

w

N

l

. Hence σ

2

µl

σ

2

µw

σ

2

µd

σ

2

µr

, σ

2

σ

2

l

σ

2

σ

2

w

σ

2

σ

2

d

σ

2

σ

2

r

. Therefore, the

uncertainty of mean and variance comes largely from lot-to-lot variation. Hence in this

chapter, we focus our attention on lot-to-lot variation.

Table 7.1 illustrates ﬁve cases of number of measured lots and number of produc-

tion lots.When the number of measured lots or production lots is small, the conﬁdence

interval is large and hence statistical analysis is not reliable. In this case, we may want

to guardband statistical analysis to ensure a certain degree of conﬁdence. Example use

models for these different scenarios can be: commodity process/high volume part (L-

L); niche process/high column part (S-L); commodity process/niche part (L-S); niche

process/niche part (S-S); and commodity (or niche) process/prototyping (L-1).

In practice, only the measured as opposed to the population value is known. There-

fore, our the problem is:

Given the number of measured lots and production lots and the measured

values of mean and variance of variation sources, estimate the distribution of

statistical measures of production lots.

In rest of the chapter, we assume that the lot-to-lot variation of all the variation

sources follows Gaussian distribution.

2

Note that 85 lots of 25 300mmwafers each with die-size of 100mm

2

amounts to production volume

of 1.5 million chips.

167

Case # Measured # Production Conﬁdence Reliability of

lots lots interval Statistical analysis

L-L Large Large Small High

S-L Small Large Large Low

L-S Large Small Large Low

S-S Small Small Large Low

L-1 Any 1 Large Low

Table 7.1: Reliability of statistical analysis for different number of measured and pro-

duction lots.

7.3 Conﬁ dence Interval for Variation Sources

In this section, we will discuss conﬁdence interval estimation for a single variation

source. Table 7.3 summarizes the notation used in the rest of the chapter.

7.3.1 Mean and Variance

Let’s ﬁrst discuss the mean and variance. From (7.1), we calculate the total mean and

variance of measured value as:

ˆ µ

t

= ˆ µ

l

+µ

o

(7.12)

ˆ σ

2

t

= ˆ σ

2

l

+σ

2

o

(7.13)

As discussed in Section 7.2, the uncertainty of mean and variance comes largely from

lot-to-lot variation. Therefore, we assume the mean µ

o

and variance σ

2

o

of all other

variation are reliable. In the same way as above, we calculate the total mean and

variance of the production as:

˜ µ

t

= ˜ µ

l

+µ

o

(7.14)

˜ σ

2

t

= ˜ σ

2

l

+σ

2

o

(7.15)

In the rest of this subsection, we will focus on analyzing conﬁdence interval for lot-to-

lot variation.

168

n number of lots

µ, σ

2

mean and variance

cn

f /s

fast/slow corner

no hat population value (µ, σ

2

and so on)

ˆ value obtain from measured lots ( ˆ n, ˆ µ, and so on)

˜ value of production lots ( ˜ n, ˜ µ, and so on)

2

t

total value including inter-lot, inter-wafer, inter-die and

within-die variation (µ

t

, σ

2

t

, and so on)

2

l

inter-lot variation value (µ

l

, σ

2

l

, and so on)

2

w

inter-wafer variation value (µ

w

, σ

2

w

, and so on)

2

d

inter-die variation value (µ

d

, σ

2

d

, and so on)

2

r

within-die variation value (µ

r

, σ

2

r

, and so on)

2

o

total value except inter-lot variation (µ

o

, σ

2

o

, and so on)

µ

o

= µ

w

+µ

d

+µ

r

, σ

2

o

=σ

2

w

+σ

2

d

+σ

2

r

Table 7.2: Notations.

In order to estimate the conﬁdence interval, we ﬁrst estimate the population values

of mean and variance from measurement data [4]:

µ

l

= ˆ µ

l

+ ˆ σ

l

· M

1

·

ˆ n−1

ˆ nQ

1

(7.16)

σ

2

l

= ( ˆ n−1) ˆ σ

2

l

/Q

1

(7.17)

where M

1

∼N(0, 1) is a random variable with standard normal distribution, and Q

1

∼

χ

2

ˆ n−1

is a random variable with χ

2

distribution with ˆ n −1 degrees of freedom [4].

Then, we estimate the production mean and variance with respect to population mean

and variance:

˜ µ

l

= µ

l

+σ

l

· M

2

/

√

˜ n (7.18)

˜ σ

2

l

= σ

2

l

· Q

2

/( ˜ n−1) (7.19)

where M

2

∼N(0, 1) is a random variable with standard normal distribution and Q

2

∼

χ

2

˜ n−1

is a random variable with χ

2

distribution with ˜ n−1 degrees of freedom. Finally,

169

we obtain the production mean and variance as:

˜ µ

l

= ˆ µ

l

+ ˆ σ

l

·

ˆ n−1

Q

1

M

1

√

ˆ n

+

M

2

√

˜ n

= ˆ µ

l

+ ˆ σ

l

ηT (7.20)

˜ σ

2

l

= ˆ σ

2

l

·

Q

2

( ˆ n−1)

Q

1

( ˜ n−1)

= ˆ σ

2

l

· ζ

2

· U (7.21)

where

η =

ˆ n+ ˜ n

ˆ n˜ n

(7.22)

ζ =

ˆ n−1

˜ n−1

(7.23)

T =

√

ˆ n−1(

√

˜ nM

1

+

√

ˆ nM

2

)

( ˆ n+ ˜ n)Q

1

(7.24)

U =

Q

2

Q

1

(7.25)

It is easy to ﬁnd that:

√

˜ nM

1

+

√

ˆ nM

2

√

ˆ n+˜ n

∼ N(0, 1), hence, T is the ratio between a standard

normal random variable and square root of a χ

2

random variable. Therefore, T is a

random variable with t-distribution with ˆ n degrees of freedom [4]. Moreover, U is the

ratio of two χ

2

random variables. The PDF of U is [4]:

PDF

U

(u) =

Γ(

ˆ n+˜ n

2

−1)u

ˆ n

2

−1

Γ(

ˆ n

2

)Γ(

˜ n

2

)(u+1)

ˆ n+˜ n

2

(7.26)

where Γ(·) is Gamma function [4] which is calculated as:

Γ(x) =

Z

∞

0

t

x−1

e

−t

dt (7.27)

7.3.2 Simplifying the Large Lot Case

In the previous subsection, we analyzed the case when both ˆ n and ˜ n are small (S-S

case). We may simplify our model when either ˆ n or ˜ n is large.

When the number of measured lots ˆ n is large (L-S case), we can approximate the

170

population mean and variance as measured mean and variance:

µ

l

≈ ˆ µ

l

(7.28)

σ

2

l

≈ ˆ σ

2

l

(7.29)

In this case, (7.20) and (7.21) can be simpliﬁed to:

˜ µ

l

= ˆ µ

l

+ ˆ σ

l

· M

2

/

√

˜ n (7.30)

˜ σ

2

l

= ˆ σ

2

l

· Q

2

/( ˜ n−1) (7.31)

In the other case when the number of production lots ˜ n is large (L-S case), the

production values of mean and variance are very close to the population value, i.e.:

˜ µ

l

≈ µ

l

(7.32)

˜ σ

2

l

≈ σ

2

l

(7.33)

Therefore, (7.20) and (7.21) can be simpliﬁed to:

˜ µ

l

= ˆ µ

l

+ ˆ σ

l

M

1

·

ˆ n−1

ˆ nQ

1

(7.34)

˜ σ

2

l

= ( ˆ n−1) ˆ σ

2

l

/Q

1

(7.35)

7.3.3 One Production Lot Case

Besides S-S, L-S, and S-L cases, we also consider the special case of one production

lot (L-1 case), that is ˜ n = 1. In this case, production mean still follows the same

distribution as shown in (7.20) and (7.21). However, since there is only one lot, lot-to-

lot variation has no impact on variance. Uncertainty of variance estimation is mainly

caused by wafer-to-wafer variation. Hence, in this case, the production variance is

calculated as:

˜ σ

2

t

= ˆ σ

2

w

· Q

2

/( ˜ n

−1) +σ

2

d

+σ

2

r

(7.36)

171

Targeted conf.% 50 60 70 80 90

Measured conf.% 51.72 60.34 72.41 81.03 93.10

Table 7.3: Experimental validation of estimation of conﬁdence interval for mean for

ˆ n = 5 and ˜ n = 1.

where ˜ n

is number of production wafers, Q

2

is a χ

2

random variable with ˜ n

−1 free-

dom, and ˆ σ

2

w

is the measured variance of wafer-to-wafer variation.

7.3.4 Experimental Validation for One Lot Case

The problems in validation of the conﬁdence-based guardband models proposed in this

work should be obvious by now. We need exceedingly large amounts of data spread

out over hundreds of lots preferably over multiple independent designs. Fortunately,

the mathematical derivations in this work make very few approximations if any. Nev-

ertheless, in this section we try to validate the theory above for the case with small

number of measured lots and one production lot. To get over the hurdle of huge data

requirement, we use wafer-to-wafer variation instead of lot-to-lot variation.

We obtain wafer-to-wafer ring oscillator delay variation data from an industrial

65nm process with 348 wafers spread over 23 lots. We randomly divide these wafers

into 58 sets with 6 wafers per set. We model each set as a design and assume that 5 of

the wafers are used to generate the statistical model ( ˆ n = 5) and the other wafer is the

production wafer ( ˜ n = 1). For each set we apply (7.20) to estimate the distribution of

production mean and calculate the conﬁdence interval for a targeted conﬁdence level.

We obtain the measured conﬁdence by counting the number of sets that production

value is within the conﬁdence interval: Measured Conf = Number of match sets/ Total

number of sets.

Table 7.3 compares the targeted conﬁdence level and the measured conﬁdence

level. The small error (≤4%) is due to limited data.

172

7.3.5 Fast/Slow Corner Estimation

The next statistical characteristics we consider are fast and slow corners cn

f /s

for vari-

ation sources.

Usually, the fast/slow corner is expressed as:

cn

f /s

= ˜ µ

t

±k

f /s

˜ σ

t

(7.37)

From (7.20) and (7.21), we have:

cn

f

= ˆ µ

l

+µ

o

+ ˆ σ

l

ηT −k

f

ˆ σ

2

ζ

2

U +σ

2

o

(7.38)

cn

s

= ˆ µ

l

+µ

o

+ ˆ σ

l

ηT +k

s

ˆ σ

2

ζ

2

U +σ

2

o

(7.39)

For a given conﬁdence level, we want to estimate the worst case fast/slow corner,

cn

w

f /s

. However, fast corner and slow corners are dependent on each other. Therefore,

we need to handle them together. Assuming that hold time violation is harder to ﬁx,

higher conﬁdence may be needed on fast corner. We estimate the worst case fast/slow

corner such that there is c f

f

conﬁdence that the fast corner is larger than cn

w

f

and there

is c f

t

conﬁdence that the fast corner is larger than cn

w

f

and the slow corner is smaller

than cn

w

s

, that is:

3

P{ cn

f

> cn

w

f

} = c f

f

(7.40)

P{ cn

f

> cn

w

f

, cn

s

< cn

w

s

} = c f

t

(7.41)

Due to the complexity of (7.38), we use Monte-Carlo simulations to obtain the distri-

butions of the fast/slow corners. Then the worst case fast corner is calculated as:

cn

w

f

=CDF

−1

cn

f

(c f

f

) (7.42)

With the worst case fast corner value, from the samples, we may also obtain the con-

ditional CDF of cn

s

given cn

f

> cn

w

f

, CDF

cn

s

| cn

f

<cn

w

f

. Then the worst case slow corner

3

Without loss of generality, we assume fast implies smaller parameter value (e.g. channel width).

173

is calculated as:

cn

w

s

=CDF

−1

cn

s

| cn

f

<cn

w

f

(c f

t

/c f

f

) (7.43)

We then express the worst case corner value as:

cn

w

f /s

= ˆ µ±k

f /s

ˆ σ (7.44)

When the worst case corner values are known, we can easily obtain the value of k

f /s

.

Table 7.4 shows the value of k

f /s

under different conﬁdence levels. In the table, the

guardband percentage is calculated as: 100×(k

f /s

−k

f /s

)/k

f /s

. In the experiment, we

let ˆ n = 10, ˜ n = 15, k

f

= k

s

= 3 and we assume that the variance of inter-lot variation

is 18% of the total variance (derived from the data described in section 7.3.4). We

also assume that the variation source is with standard normal distribution

4

. In the

experiment, we perform the estimation for 1000 runs. For each run, we generate ˆ n

measured samples to obtain the measured mean ˆ µ and variance ˆ σ

2

, and then apply the

method above to calculate the worst case value of k

f /s

. Notice that the k

f /s

depends on

the measured values ˆ µ and ˆ σ

2

, hence each run may result in a different k

f /s

value. In

the table, the k

f /s

values are calculated as the average of 1000 runs. From the table, we

can ﬁnd that to ensure high conﬁdence, k

f /s

needs to be 30% over k

f /s

, i.e., we need a

large guardband.

As discussed in Section 7.3.2, when the number of measured lots is large (L-S), the

mean and variance can be simpliﬁed as (7.30). Hence the corner model can be also

simpliﬁed as:

cn

f

= ˆ µ

l

+ ˆ σ

l

M

2

√

˜ n

−k

f

ˆ σ

2

l

· Q

2

( ˜ n−1)

+σ

2

o

(7.45)

cn

s

= ˆ µ

l

+ ˆ σ

l

M

2

√

˜ n

+k

s

ˆ σ

2

l

· Q

2

( ˜ n−1)

+σ

2

o

(7.46)

4

The nominal mean and variance of variation sources does not affect the value of k

f

/s.

174

c f

t

% c f

f

% k

f

k

s

50 60 3.080 (2.67%) 3.246 (8.20%)

60 70 3.155 (5.13%) 3.275 (9.13%)

70 80 3.255 (8.50%) 3.307 (10.2%)

80 90 3.422 (14.7%) 3.345 (11.5%)

90 95 3.594 (19.8%) 3.518 (17.2%)

95 99 4.042 (34.7%) 3.619 (20.6%)

Table 7.4: Guardband of fast/slow corner for different degrees of conﬁdence assuming

ˆ n = 10, ˜ n = 15.

c f

t

% c f

f

% k

f

k

s

50 60 3.035 (1.17%) 3.103 (3.43%)

60 70 3.112 (3.73%) 3.121 (4.03%)

70 80 3.124 (4.14%) 3.138 (4.60%)

80 90 3.236 (7.87%) 3.215 (7.17%)

90 95 3.401 (13.4%) 3.342 (11.4%)

95 99 3.625 (20.8%) 3.502 (16.7%)

Table 7.5: Guardband of fast/slow corner for different degrees of conﬁdence assuming

large ˆ n and ˜ n = 15.

Table 7.5 shows the average k

f /s

value for the L-S case. As expected, the guardband

in this case is smaller than the S-S case.

Similarly, when the number of production lots is large (S-L), corner model can be

simpliﬁed to:

cn

f

= ˆ µ

l

+ ˆ σ

l

· M

1

·

ˆ n−1

ˆ nQ1

−k

f

( ˆ n−1) ˆ σ

2

l

Q

1

+σ

2

o

(7.47)

cn

s

= ˆ µ

l

+ ˆ σ

l

· M

1

·

ˆ n−1

ˆ nQ1

+k

s

( ˆ n−1) ˆ σ

2

l

Q

1

+σ

2

o

(7.48)

Table 7.6 shows the average k

f /s

value for the S-L case. From the table, it is evident

that the guardband is smaller than that of the S-S case.

Figure 7.1 compares the 90% conﬁdence value of k

f

between the L-S (or S-L) and

L-L. We observe that when ˜ n ≥ 60 (or ˆ n ≥ 85), L-S (or S-L) can be treated as L-L.

175

c f

t

% c f

f

% k

f

k

s

50 60 3.052 (1.73%) 3.132 (4.40%)

60 70 3.134 (4.47%) 3.155 (5.17%)

70 80 3.165 (5.50%) 3.193 (6.43%)

80 90 3.272 (9.07%) 3.267 (8.90%)

90 95 3.431 (14.4%) 3.370 (12.3%)

95 99 3.690 (23.0%) 3.523 (17.4%)

Table 7.6: Guardband of fast/slow corner for different conﬁdence levels assuming

ˆ n = 10 and large ˜ n.

That is, the statistical analysis is reliable and we do not need to perform guardband

estimation. In the experiment, when error of k

f

value between L-L and L-S (or S-L)

cases is within 3%, we consider ˆ n (or ˜ n) value is large.

50 100 150

3.05

3.1

3.15

3.2

3.25

3.3

n

~

(n)

^

k

’

f

L−S

S−L

n

^

=80

n

~

=60

k’

f

=3.09

Figure 7.1: Comparison of S-L, L-S, and L-L case.

Note that increasing lot-to-lot variation will increase the value of ˆ n and ˜ n that can be

considered large as shown in Figure 7.2. This is because when the lot-to-lot variation

increase, the uncertainty of mean and variance increases. Therefore, we need a larger

number of measured lots (or production lots) to be considered as large.

As discussed in Section 7.3.2, in the special case when there is only one production

lot (L-1), lot-to-lot variation has no impact on the variance. Instead, the uncertainty of

176

20 40 60

0

200

400

600

800

σ

2

l

%

L

a

r

g

e

n ~

(

n

)

^

n

~

n

^

Figure 7.2: Number of ˆ n or ˜ n which can be considered as large for different fraction of

lot-to-lot variation assuming maximum 3% error at 90% conﬁdence.

the variance comes from wafer-to-wafer variation. From (7.36), we can calculate the

fast/slow corners as:

cn

f

= ˆ µ

l

+µ

o

+ ˆ σ

l

ηT −k

f

ˆ σ

2

w

Q

2

/( ˜ n

−1) +σ

2

o

(7.49)

cn

s

= ˆ µ

l

+µ

o

+ ˆ σ

l

ηT +k

s

ˆ σ

2

w

Q

2

/( ˜ n

−1) +σ

2

o

(7.50)

Table 7.7 shows the k

f /s

for the L-1 case. In the experiment, we assume that there

are 25 wafers per lot, ˜ n

**= 25. From the table, we see that the guardband is smaller
**

than that of the L-S and S-L cases. This somewhat surprising result is explained by

the fact that lot-to-lot variation has no impact on variance (but still affects the mean),

hence the uncertainty of the corner is smaller.

177

c f

t

% c f

f

% k

f

k

s

50 60 3.051 (1.70%) 3.261 (8.70%)

60 70 3.116 (3.87%) 3.278 (9.27%)

70 80 3.197 (6.57%) 3.293 (9.77%)

80 90 3.320 (10.7%) 3.307 (10.2%)

90 95 3.438 (14.6%) 3.432 (14.4%)

95 99 3.574 (19.1%) 3.475 (15.8%)

Table 7.7: Guardband of fast/slow corner for different conﬁdence levels assuming

ˆ n = 10, ˜ n = 1, and ˜ n

= 25.

7.4 Conﬁ dence in SPICE Fast/Slow Corner

Fast/slow corners for circuit simulation is the major interface between design and pro-

cess. A common approach to derive the corners is to use measured values from canon-

ical circuits such as an inverter chain. Variation source values corresponding to the

corner delay D

w

f /s

are then calculated.

We assume that the inverter chain delay is a linear function of variation sources:

D = d

0

+

m

∑

i=1

c

i

X

i

(7.51)

where m is the number of variation sources, d

0

is the nominal delay, X

i

’s are variation

sources, and c

i

’s are sensitivity coefﬁcients of inverter chain delay to variation sources.

Considering all variation sources to be independent, the mean and variance of the

product inverter chain delay can be calculated as:

˜ µ

D

= d

0

+

m

∑

i=1

c

i

(µ

oi

+ ˜ µ

li

) (7.52)

˜ σ

2

D

=

m

∑

i=1

c

2

i

(σ

2

oi

+ ˜ σ

2

li

) (7.53)

where µ

oi

and σ

2

oi

are the mean and variance of other variation for the i

th

variation

source, respectively, ˜ µ

li

and ˜ σ

2

li

are the mean and variance of lot-to-lot variation for

the i

th

variation source, respectively. ˜ µ

li

and ˜ σ

2

li

can be calculated as shown in Sec-

tion 7.3.1.

178

Usually, the delay corner is expressed as

˜

D

f /s

= ˜ µ

D

±k

f /s

˜ σ

D

(7.54)

As in Section 7.3.5 we use Monte-Carlo simulation to obtain the PDF of the fast and

slow delay corners. One sided conﬁdence interval is then used to obtain the worst case

fast and slow delay corners D

w

f

/s. Corresponding corner values of variation sources

are calculated by solving the following [121]:

D

w

f /s

= d

0

+

m

∑

i=1

c

i

X

w

i

(7.55)

X

w

1f /s

−ˆ µ

1

c

1

ˆ σ

1

=

X

w

2f /s

−ˆ µ

2

c

2

ˆ σ

2

. . . =

X

w

mf /s

−ˆ µ

m

c

m

ˆ σ

m

(7.56)

Table 7.8 illustrates the guardband percentage for the worst case delay and the cor-

responding variation source values under different conﬁdence levels. In the table, the

guardband percentage is calculated as: 100×|worst case value - nominal value|/nominal value.

In the experiment, we use PTM bulk low power SPICE model for 45nm technology

[1]. We assume three variation sources, gate length L, threshold voltage for NMOS V

tn

and PMOS V

t p

. For gate length variation, we assume 3σ = 3nm, for threshold voltage

variation, we assume that the 3σ value is 20% of the nominal value. We also assume

that the inter-lot variation is 18% as in previous sections. In the experiment, we let

ˆ n = 10 and ˜ n = 15. Since measured corner values affect production corners,we show

the average of 1000 runs.

From the table, we observe that under the same conﬁdence level, the guardband

percentage of delay corner is less than that of k

f

value of variation sources as shown

in Table 7.4. This is because the k

f

value of variation sources does not depend on

the nominal mean and variance. However, for the worst case delay, the guard band

percentage does depend on nominal mean. When the nominal mean increases, the

guardband percentage decreases.

179

c f

t

c f

f

D

w

f

L

w

f

V

w

tnf

V

w

t pf

D

w

s

L

w

s

V

w

tns

V

w

t ps

50 60 0.29 0 07 0.21 0.20 0.52 0.13 0.39 0.36

60 70 0.57 0 14 0.43 0.40 0.77 0.19 0.58 0.55

70 80 1.15 0 29 0.86 0.81 1.29 0.32 0.97 0.91

80 90 2.01 0 50 1.50 1.41 1.80 0.45 1.35 1.27

90 95 3.15 0 79 2.36 2.22 2.84 0.71 2.13 2.00

95 99 4.58 1 15 3.44 3.23 3.61 0.90 2.71 2.54

Nominal 349 42.62 0.558 0.603 388 47.41 0.612 0.643

Table 7.8: Percentage guardband of worst case delay corner and corresponding varia-

tion source corners under different conﬁdence levels assuming ˆ n = 10, ˜ n = 15. Note:

delay value is in ps, L

gate

value is in nm, and V

th

value is in V.

Targeted Conf Simulated Conf

c f

t

c f

f

c f

t

c f

f

50 60 51.9 61.6

60 70 61.7 72.4

70 80 71.6 81.4

80 90 81.1 91.3

90 95 90.8 95.7

95 99 95.5 99.2

Table 7.9: Conﬁdence comparison for inverter chain based corner assuming ˆ n = 10,

˜ n = 15.

In Table 7.9, we also compare the targeted conﬁdence and the exact conﬁdence of

the estimated worst case corner value. The ﬂow to obtain the simulated conﬁdence is

shown in Figure 7.3. In the experiment, we let n

run

= 1, 000. From the table, we ﬁnd

that the exact simulated conﬁdence is a little bit higher than the target value. Such

error largely comes from the ﬁtting of the linear delay model to SPICE.

Table 7.10 and 7.11 show the average guardband percentage for the L-S and S-L

cases, respectively. Note that the guardband, though appears to be small, is signiﬁcant

when compared to variation itself (e.g. σ for V

t p

variation is only 6.7% of nominal

value, while the guardband of fast corner is up to 2.58% of nominal value. That is, the

180

1. n

f

= 0, n

t

= 0

2. for i=1 to n

run

3. Generate ˆ n samples of measured lots /* calculate estimated value */

4. Calculate ˆ µ and ˆ σ

2

for each variation sources from samples

5. Perform 10,000-sample MC simulation on (7.54)

according to ˆ µand ˆ σ

2

.

6. Obtain worst case corners: D

w

f

, D

w

s

.

7. Generate ˜ n samples of production lots /* calculate simulated value */

8. Calculate ˜ µ and ˜ σ

2

for each variation source from samples.

9. Perform 10,000-sample SPICE MC simulation on inverter

chain delay based on ˜ µ and ˜ σ

2

.

10. Calculate mean µ

D

and variance σ

2

D

of SPICE MC samples.

11. Calculate simulated corners: D

f

= µ

D

−k

f

σ

D

, D

s

= µ

D

+k

s

σ

D

.

12. if D

f

> D

w

f

/* compare simulated value and estimated value */

13. n

f

=n

f

+1.

14. if D

s

< D

w

s

15. n

t

= n

t

+1

16. c f

f

= n

f

/n

run

, c f

t

= n

t

/n

run

Figure 7.3: Flow to simulate conﬁdence.

guardband value is up to 38.7% of standard deviation).

Figure 7.1 compares the 90% conﬁdence value of D

w

f

between the L-S (or S-L) and

L-L. In the ﬁgure, D

w

f

is normalized with respect to the nominal value. From the ﬁgure,

We observe that ˜ n ≥30 (or ˆ n ≥45), can be considered as “large”. Similar to the single

source case, when error of k

f

value between L-L and L-S (or S-L) case are within 3%,

we consider ˆ n (or ˜ n) value is large. We also ﬁnd that for the worst case delay corner,

the value of ˆ n (or ˜ n) to be considered as large is smaller than that of the single variation

source as in Figure 7.1. This is because the guardband of worst case delay is smaller

than that of single variation source as discussed above. Hence, we need fewer lots to

achieve the same model reliability.

Table 7.12 shows the average guardband percentage for the L-1 case. The guard-

band percentage of this case is smaller that of S-S, L-S, or S-L cases. As discussed in

Section 7.3.5, this is because lot-to-lot variation has no impact on variance. Therefore,

181

c f

t

c f

f

D

w

f

L

w

f

V

w

tnf

V

w

t pf

D

w

s

L

w

s

V

w

tns

V

w

t ps

50 60 0.21 0.05 0.16 0.15 0.39 0.10 0.29 0.27

60 70 0.43 0.11 0.32 0.30 0.58 0.14 0.43 0.41

70 80 0.86 0.21 0.64 0.61 0.97 0.24 0.72 0.68

80 90 1.50 0.38 1.13 1.06 1.35 0.34 1.01 0.95

90 95 2.36 0.59 1.77 1.67 2.13 0.53 1.59 1.50

95 99 3.44 0.86 2.58 2.42 2.71 0.68 2.03 1.91

Table 7.10: Percentage guardband of worst case delay corner and corresponding vari-

ation sources corners under different conﬁdence levels assuming large and ˜ n = 15.

c f

t

c f

f

D

w

f

L

w

f

V

w

tnf

V

w

t pf

D

w

s

L

w

s

V

w

tns

V

w

t ps

50 60 0.24 0.06 0.18 0.17 0.44 0.11 0.33 0.31

60 70 0.49 0.12 0.37 0.34 0.66 0.16 0.49 0.46

70 80 0.97 0.24 0.73 0.69 1.10 0.27 0.82 0.77

80 90 1.70 0.43 1.28 1.20 1.53 0.38 1.15 1.08

90 95 2.68 0.67 2.01 1.89 2.41 0.60 1.81 1.70

95 99 3.90 0.97 2.92 2.75 3.07 0.77 2.30 2.16

Table 7.11: Percentage guardband of worst case delay corner and corresponding vari-

ation source corners under different conﬁdence levels assuming ˆ n = 10 and large ˜ n.

182

50 100 150

0.9

0.92

0.94

0.96

0.98

1

n

~

(n)

^

N

o

r

m

a

l

i

z

e

d

D

w f

L−S

S−L

n

^

=45

n

~

=30

3% error

Figure 7.4: 90% conﬁdent D

w

f

with varying ˆ n and ˜ n.

c f

t

c f

f

D

w

f

L

w

f

V

w

tnf

V

w

t pf

D

w

s

L

w

s

V

w

tns

V

w

t ps

50 60 0.23 0.06 0.17 0.16 0.42 0.11 0.32 0.30

60 70 0.38 0.10 0.29 0.27 0.53 0.13 0.40 0.37

70 80 0.69 0.17 0.52 0.49 0.71 0.18 0.53 0.50

80 90 0.95 0.24 0.71 0.67 0.90 0.23 0.68 0.63

90 95 1.46 0.37 1.10 1.03 1.21 0.30 0.91 0.85

95 99 2.05 0.51 1.54 1.45 1.87 0.47 1.40 1.32

Table 7.12: Percentage guardband of worst case delay corner and corresponding vari-

ation source corners under different conﬁdence levels assuming ˆ n = 10, ˜ n = 1, and

˜ n

= 25.

we need lower guardband to achieve the same conﬁdence.

In this section, we illustrated the extra guardband required in corners (which ironi-

cally are themselves guardbands by deﬁnition) to ensure conﬁdence in statistical anal-

ysis when the number of lots used to characterize variation or number of production

silicon lots are small. Next we discuss the impact of limited characterization or pro-

duction data on one of the most talked about chip-level statistical analysis methods,

namely, statistical timing analysis.

183

7.5 Conﬁ dence in Statistical Timing Analysis

In this section, we will discuss about conﬁdence estimation for statistical static timing

analysis (SSTA). Since quadratic SSTA with non-Gaussian variation sources is very

complicated, we only analyze linear SSTA with Gaussian variation sources in this

section.

By performing linear SSTA in Section 5.5, the chip delay is expressed as a linear

canonical form of variation sources:

D = d

0

+

m

∑

i=1

c

i

X

i

(7.57)

Since circuit delay has a linear canonical form, its mean and variance can be cal-

culated in the same way as shown in (7.52). We again use Monte-Carlo simulation

on the linear equation to obtain the CDF of the output mean (CDF

µ

D

(·)) and variance

(CDF

σ

2

D

(·)). Then for a given conﬁdence level Con f , it is easy to obtain the worst case

mean and variance:

5

µ

w

= CDF

−1

µ

D

(Con f ) (7.58)

σ

w2

= CDF

−1

σ

2

D

(Con f ) σ

w

=

σ

w2

(7.59)

Another quantity of interest is a certain percentile point. Since we assume Gaussian

variation sources and apply linear canonical form to approximate circuit delay, the

circuit delay also follows Gaussian distribution. Therefore, the percentile point C(p%)

can be calculated as:

C(p%) = µ+kσ (7.60)

k = Φ

−1

(p%) (7.61)

5

Notice that we have known the distribution of mean and variance, it is easy for us to obtain any

interval of mean and variance under a certain conﬁdence.

184

where Φ(·) is the CDF of standard normal distribution. In this way, the percentile point

is converted to a µ+kσ corner. We may apply the same method as in Section 7.4 to

obtain the CDF of the percentile point, and then calculate its worst case value.

Figure 7.5 illustrates the worst case mean, standard deviation, and 95% percentile

point for ISCAS85 benchmarks. In the experiment, we apply the same experimental

setting as that for the inverter chain as in Section 7.4. As discussed before, the worst

case value depends on the measured values of mean and variance. Hence, different

runs may result in different worst case values. In the ﬁgure, the worst case value is

averaged over 1,000 runs and normalized with respect to the nominal value. From the

ﬁgure, we observe that the guardband of chip delay is lower than that of single varia-

tion source and inverter chain delay. This is because chip delay has a larger nominal

mean to variance ratio than inverter chain. However, although the guardband value

seems not large, it is very signiﬁcant compared to the scale of variation. For exam-

ple, for the 95-percentile point, the nominal value is 6.9X of standard deviation and

the guardband value is up to 7.6% of nominal value to ensure 95% conﬁdence. That

means the guardband value is up to 52% of standard deviation.

In Table 7.13, we compare the targeted conﬁdence and the simulated conﬁdence of

the estimated conﬁdence interval. The simulated conﬁdence is obtained in Figure 7.3.

The only difference is that instead of running SPICE Monte-Carlo simulation, we ap-

ply SSTA [41] to obtain the chip delay distribution. Due to our simplifying assumption

that the linear canonical form is independent of small changes in mean and variance,

there is a small error between targeted and simulated conﬁdence.

Figure 7.6 illustrates the mean, standard deviation, and 95-percentile point of IS-

CAS85 benchmarks under different conﬁdence level for S-L case. We ignore L-S and

L-1 cases where the need of SSTA is not well motivated.

Figure 7.7 compares the 90% conﬁdence value of 95-percentile point for C7552

185

50 60 70 80 90

1

1.02

1.04

1.06

Confidence

µ

w

C17

C444

c1355

c7552

50 60 70 80 90

1

1.05

1.1

1.15

Confidence

σ

w

C17

C444

C1355

C7552

50 60 70 80 90

1.02

1.04

1.06

1.08

Confidence

W

o

r

s

t

9

5

%

C17

C444

C1355

C7552

Figure 7.5: Worst case mean, standard deviation, and 95-percentile point normalized

with respect to the nominal value for ISCAS85 benchmarks, assuming ˆ n = 10 and

˜ n = 15.

Targeted Simulated conf

conf µ σ

2

95%

50 51.3 51.4 51.3

60 60.8 60.9 61.2

70 70.1 70.7 70.8

80 80.9 81.0 80.3

90 90.5 90.4 90.2

95 95.5 95.2 95.3

Table 7.13: Conﬁdence comparison for ISCAS85 benchmark 95-percentile point delay

assuming ˆ n = 10, ˜ n = 15.

186

50 60 70 80 90

1

1.01

1.02

1.03

1.04

1.05

Confidence

µ

w

C17

C444

C1355

C7552

50 60 70 80 90

1

1.05

1.1

Confidence

σ

w

C17

C444

C1355

C7552

50 60 70 80 90

1

1.02

1.04

1.06

Confidence

w

o

r

s

t

9

5

%

C17

C444

C1355

C7552

Figure 7.6: Normalized worst case mean, standard deviation, and 95-percentile point

for ISCAS85 benchmarks, assuming ˆ n = 10 and ˜ n is large.

187

50 100 150

1

1.01

1.02

1.03

1.04

1.05

1.06

n

^

w

o

r

s

t

9

5

%

n

^

=80

n

^

=40

n

^

=24

Figure 7.7: 90% conﬁdence worst case 95-percentile point with varying ˆ n for C7552.

Percentile point is normalized with respect to the nominal value.

between S-L and L-L. In the ﬁgure, percentile point is normalized with respect to the

nominal value. From the ﬁgure, we observe that the guardband of chip delay is lower

than that of single variation source and inverter chain delay. We ﬁnd that we only need

a small number of lots ( ˆ n ≥ 24) to bound the error between S-L and L-L under 3%. If

we want to bound the error to 2% or 1%, we need ˆ n ≥40 and ˆ n ≥80, respectively.

Again, we see that limited data can lead to signiﬁcantly large guardbanding to en-

sure good conﬁdence in analysis results. The method discussed in this section is fairly

straightforward, fast (essentially Monte Carlo on a linear expression, runtime is within

10s) and accurate to estimate conﬁdence in results of statistical timing analysis espe-

cially when the variability models are known to suffer from limited lot-to-lot variation

data.

188

7.6 Practical Questions: Determining Conﬁ dence Induced Guard-

band for Real Designs

So far in the chapter, we have proposed techniques to quantify the conﬁdence in sta-

tistical circuit analysis and estimate guardband required to attain a desired level of

conﬁdence. The answer to “what is the desired level of conﬁdence” is not an easy

one especially given the steep increase in guardband to ensure higher levels of conﬁ-

dence. Besides quality of models ( ˆ n) and production volume ( ˜ n), two issues inﬂuence

the choice of conﬁdence level:

1. Risk (in terms of schedule, performance, power, etc) tolerable for the design.

A lower conﬁdence does not necessarily translate to lower yield but it implies

that the probability of the statistical analysis actually matching real silicon is

smaller. Therefore, conﬁdence level is directly related to risk. For designs where

power/performance constraints are ﬂexible or time to volume is not critical (i.e.,

if yield is lower than expectation, running more lots is acceptable), a lower conﬁ-

dence is acceptable. Conﬁdence levels provide a tradeoff between risk of higher

manufacturing cost (due to low yield) and higher upfront design cost (due to

guardbanding).

2. Extent of post-silicon tuning options available. Ability to change design char-

acteristics during or post-manufacturing mitigates the need for high conﬁdence

in models. If the silicon foundry can extensively adjust the process (which es-

sentially shifts the mean) for the design

6

, the silicon can be made to match the

analysis and ensuring high conﬁdence in design-side estimates is not essential.

Similarly, tunability using knobs such as adaptive body biasing, adaptive voltage

scaling can also alleviate downside of low conﬁdence.

6

This is usually possible only for high-volume, high-value designs.

189

In the end, the choice of conﬁdence will be as much a business decision as a techni-

cal decision as it quantiﬁes the desire for predictability of statistics of manufacturing.

7.7 Conclusions

In this chapter, we have studied impact of characterization and production silicon vol-

umes on believability of statistical variation models and analysis. We have developed

methods to estimate the distribution of production silicon statistical characteristics

(e.g., mean, variance, percentile corners, etc) in terms of (unreliable) measured sta-

tistical characteristics, measured data volume, and intended production volume. The

loss of reliability largely stems from lot-to-lot variation. Our experiments and results

indicate the following.

1. For small characterization or production volumes, best/worst corners for a vari-

ation source may need as much as 30 % guardband to attain 95% conﬁdence.

Our estimates show strong agreement with conﬁdence measured from a set of

real silicon data (maximum error ≤ 4%).

2. For a typical SPICE corner extraction methodology, the guardband needed for

95% conﬁdence is about 14% of the nominal uncertainty (difference between

best/worst corners) for low-to-medium volume production ( ˆ n = 10, ˜ n = 15).

3. Finally, the 95

th

percentile delay as estimated using statistical timing analysis

needs a guardband of 5%-7%.

Predicted statistics are reliable for number of measured lots larger than 85 and

number of production lots larger than 60. We assumed lot-to-lot variation to be 18%

of total variance for our experiments. The required guardband will increase if lot-

to-lot variation increases as a fraction of total variation. We believe that conﬁdence-

190

level induced guardband should be considered for all low-to-medium volume designs.

Moreover, the problem is likely to become more severe when 450mm wafer sizes are

adopted by the industry (since less number of lots would be required to meet the same

production volume).

191

CHAPTER 8

Conclusions

In modern Very Large Scale Integration (VLSI) circuit industry, the scaling of Complementary

Metal Oxide Semiconductor (CMOS) technology enables the ever growing number of

transistors on a chip, hence improves the chip functionality and reduces the manu-

facturing cost. However, the process variation caused by inevitable manufacturing

ﬂuctuations and imperfections also increases when technology scales down. Process

variation introduces signiﬁcant power and performance uncertainty to VLSI circuits

and becomes a potential stopper for further technology scaling.

In this dissertation, we consider two types of the most common VLSI circuits:

Field-Programmable Gate Array (FPGA) and Application Speciﬁc Integrated Circuit

(ASIC). We have modeled, analyzed, and optimized power and delay for both FPGAs

and ASICs.

In Part I, including Chapter 2, 3, 4, we have studied modeling, analysis, and

optimization of FPGA power and performance with existence of process variation.

In Chapter 2, we have developed a trace-based power and performance evalua-

tion framework(Ptrace) for FPGA. Then we performed device and architecture co-

evaluation on hundreds of device and architecture combinations to optimize power,

delay, and area. Compared to the baseline, our optimization reduces the energy-delay

product by 18.4% and area by 23.3%.

In Chapter 3, we have extended Ptrace to consider process variation. Then we per-

192

formed device and architecture co-optimization to maximize power and delay yield.

Compared to the baseline, our optimization improves leakage power and timing com-

bined yield by 9.4%.

In Chapter 4, we have developed transistor level voltage and current model based

on ITRS MASTAR4 (Model for Assessment of cmoS Technologies And Roadmaps),

and then developed circuit level power, delay, and reliability model. Combining the

circuit level model with Ptrace, we have developed a framework to evaluate FPGA

power, delay, and reliability simultaneously. We have shown that applying heteroge-

neous gate lengths to logic and interconnect may lead to 1.3X delay difference, 3.1X

energy difference. This offers a large room for power and delay tradeoff. We have fur-

ther shown that the device aging has a knee point over time and device burn-in to reach

the point could reduce the performance change over 10 years from 8.5% to 5.5% and

signiﬁcantly reduce leakage variation. In addition, we have also studied the interaction

between process variation, device aging, and SER. We observe that device aging re-

duces standard deviation of leakage by 65% over 10 years while it has relatively small

impact on delay variation. Moreover, we also ﬁnd that neither device aging due to

NBTI and HCI nor process variation have signiﬁcant impact on SER.

In Part II, including Chapter 5, Chapter 6, and Chapter 7, we have studied sta-

tistical timing modeling and analysis for ASICs.

In Chapter 5, we have proposed a new method to approximate the max opera-

tion of two non-Gaussian random variables using second-order polynomial ﬁtting. By

applying such approximation, we have presented new SSTA algorithms for three dif-

ferent delay models: quadratic model, quadratic model without crossing terms (semi-

quadratic model), and linear model. All atomic operations of these algorithms are

performed by closed-form formulas, hence they are very time efﬁcient. Experimental

results show that compared to Monte-Carlo simulation, our SSTA ﬂow predicts the

193

mean, standard deviation, skewness, and 95-percentile point within 1%, 1%, 6%, and

1% error, respectively.

In Chapter 6, we have studied the impact of systematic across-wafer variation on

within-die variation. Based on the analysis, we have developed an efﬁcient and ac-

curate spatial variation model to model across-wafer variation at die level. We then

implemented the new model to the SSTA ﬂow in Chapter 5. Experimental results

show that our new model reduces error from 6.5% to 2% and improve run time by 6X

compared to the traditional spatial variation model.

In Chapter 7, we have studied impact of characterization and production silicon

volumes on believability of statistical variation models and analysis. We mathemati-

cally derived the conﬁdence intervals for commonly used statistical measures (mean,

variance, percentile corner) and analysis (SPICE corner extraction, statistical timing).

Our estimates are within 2% of simulated conﬁdence values. Our experiments indicate

that for moderate characterization volumes (10 lots) and low-to-medium production

volumes (15 lots), a signiﬁcant guardband (e.g., 34.7% of standard deviation for single

parameter corner, 38.7% of standard deviation for SPICE corner, and 52% of standard

deviation for 95%-tile point of circuit delay are needed) to ensure 95% conﬁdence in

the results. The guardbands are non-negligible for all cases when either production or

characterization volume is not large.

This dissertation makes contributions to statistical modeling, analysis, and opti-

mization for FPGAs and ASICs. However, there are still lots of research works need

to be done in this ﬁeld. In practice, the statistical modeling and analysis techniques

presented in Part II can not only be applied to ASICs, but also to FPGAs. The

SSTA method introduced in Chapter 5 can be applied to FPGA delay variation for

more accurate and efﬁcient delay variation estimation; the model of spatial correlation

in Chapter 6 can also be applied to FPGA power and delay variation estimation to im-

194

prove the accuracy and efﬁciency; and we may also apply the method in Chapter 7 to

estimate the conﬁdence for FPGA power and delay variation analysis. In the future,

applying the statistical analysis techniques for ASICs to FPGAs is one of our research

topics.

195

CHAPTER 9

Appendix

In the appendix, I brieﬂy introduce the works I have ﬁnished in my Ph.D. program and

are not introduced in this dissertation:

• FPGAPerformance Optimization via Chipwise Placement Considering Pro-

cess Variations. We develop a statistical chipwise placement ﬂow to improve

FPGA performance. First, we propose an efﬁcient high-level trace-based es-

timation method, similar to Ptrace, to evaluate the potential performance gain

achievable through chipwise FPGA placement without detailed placement. Sec-

ond, we develop a variation-aware detailed placement algorithmvaPL within the

VPR framework [18] to leverage process variation and optimize performance

for each chip. Chipwise placement vaPL is deterministic for each chip when the

chip’s variation map is known, and leads to different placements for different

chips of the same application. Our experimental results show that, compared

to the existing FPGA placement, variation aware chipwise placement improves

circuit performance by up to 12.1% for the tested variation maps. This work

was published in International Conference on Field Programmable Logic and

Applications (FPL), August 2006.

• Fourier Series Approximation for Max Operation in Non-Gaussian and

Quadratic Statistical Static Timing Analysis. We propose efﬁcient algorithms

to handle the max operation in SSTA with both quadratic delay dependency and

196

non-Gaussian variation sources simultaneously. Based on such algorithms, we

develop an SSTA ﬂow with quadratic delay model and non-Gaussian variation

sources. All the atomic operations, max and add, are calculated efﬁciently via

either closed-form formulas or low dimension (at most 2D) lookup tables. We

prove that the complexity of our algorithm is linear in both variation sources and

circuit sizes, hence our algorithm scales well for large designs. Compared to

Monte Carlo simulation for non-Gaussian variation sources and nonlinear delay

models, our approach predicts the mean, standard deviation and 95% percentile

point with less than 2% error, and the skewness with less than 10% error. This

work was published in Design Automation Conference (DAC), June 2007.

• Accounting for Non-linear Dependence Using Function Driven Component

Analysis. We ﬁrst analyze the impact of non-linear dependence on statisti-

cal analysis and show that ignoring non-linear dependence may cause error in

both statistical timing and power analysis. Then we introduce a function driven

component analysis (FCA) to minimize the error introduced by non-linear de-

pendence. Experimental results show that the proposed FCA method is more

accurate compared to the traditional PCA or ICA. This work was published

in IEEE/ACM Asia South Paciﬁc Design Automation Conference (ASPDAC),

February 2009.

• Efﬁcient Additive Statistical Leakage Estimation. We present an efﬁcient

additive leakage variation model. We use polynomial instead of exponential

function to represents the leakage variation considering inter-die variation, then

extend such model to consider both within-die spatial and within-die random

variation. Due to the additivity, this new model is more efﬁcient than the ex-

ponential model when calculating the full chip leakage power. Experimental

results show that our method is 5X faster than the existing Wilkinson’s approach

197

with no accuracy loss in mean estimation, about 1% accuracy loss in standard

deviation estimation and 99% percentile point estimation. This work was pub-

lished in IEEE Transactions on Computer-Aided Design of Integrated Circuits

and Systems (TCAD), Volume 28, issue: 11, November 2009, pp:1777 - 1781.

198

REFERENCES

[1] PTM SPICE model. In http://www.eas.asu.edu/ ptm/latest.html.

[2] Soroush Abbaspour, Hanif Fatemi, and Massoud Pedram. Non-gaussian statis-

tical interconnect timing analysis. In Proc. Design Automation and Test Conf.

in Europe, March 2006.

[3] Soroush Abbaspour, Hanif Fatemi, and Massoud Pedram. Parameterized block-

based non-gaussian statistical gate timing analysis. In Proc. Asia South Paciﬁc

Design Automation Conf., January 2006.

[4] Milton Abramowitz and Irene A. Stegun. Handbook of Mathematical Functions

with Formulas, Graphs, and Mathematical Tables. New York: Dover, 1965.

[5] Aseem Agarwal, David Blaauw, and Vladimir Zolotov. Statistical timing analy-

sis for intra-die process variations with spatial correlations. In Proc. Intl. Conf.

Computer-Aided Design, pages 900 – 907, Nov. 2003.

[6] Aseem Agarwal, David Blaauw, Vladimir Zolotov, Savithri Sundareswaran,

Min Zhao, Kaushik Gala, and Rajendran Panda. Statistical delay computation

considering spatial correlations. In Proc. Asia South Paciﬁc Design Automation

Conf., 2003.

[7] Aseem Agarwal, David Blaauw, Vladimir Zolotov, Savithri Sundareswaran,

Min Zhao, Kaushik Gala, and Rajendran Panda. Statistical delay computation

considering spatial correlations. In Proc. Asia South Paciﬁc Design Automation

Conf., pages 271 – 276, January 2003.

[8] Aseem Agarwal, David Blaauw, Vladimir Zolotov, and Sarma Vrudhula. Com-

putation and reﬁnement of statistical bounds on circuit delay. In Proc. Design

Automation Conf., June 2003.

[9] Aseem Agarwal, Vladimir Zolotov, and David Blaauw. Statistical timing anal-

ysis using bounds and selective enumeration. In IEEE Trans. Computer-Aided

Design of Integrated Circuits and Systems, volume 22, pages 1243 – 1260, Sept.

2003.

[10] Elias Ahmed and Jonathan Rose. The effect of LUT and cluster size on deep-

submicron FPGA performance and density. In Proc. ACM Intl. Symp. Field-

Programmable Gate Arrays, pages 3–12, February 2000.

[11] M.A. Alam and S. Mahapatra. A comprehensive model of pmos nbti degrada-

tion. In Microeletronics Reliability, August 2005.

199

[12] Jason H. Anderson, Farid N. Najm, and Tim Tuan. Active leakage power op-

timization for FPGAs. In Proc. ACM Intl. Symp. Field-Programmable Gate

Arrays, February 2004.

[13] Hongliang Chang andVladimir Zolotov, Sambasivan Narayan, and Chandu

Visweswariah. Parameterized block-based statistical timing analysis with non-

Gaussian and nonlinear parameters. In Proc. Design Automation Conf., pages

71–76, June 2005. Anaheim, CA.

[14] Ghazanfar Asadi and Mehdi B. Tahoori. Soft error rate estimation and mitiga-

tion for SRAM-based FPGAs. In Proc. ACM Intl. Symp. Field-Programmable

Gate Arrays, February 2005.

[15] Maryam Ashouei, Abhijit Chatterjee, Adit D. Singh, and Vivek De. A dual-

vt layout approach for statistical leakage variability minimization in nanometer

CMOS. In Proc. Int. Conf. on Computer Design, October 2005.

[16] Adelchi Azzalini. The skew-normal distribution and related multivariate fami-

lies. Scandinavian Journal of Statistics, 32:159–188, June 2005.

[17] M. Bellato, P. Bernardi1, D. Bortolato, A. Candelori, M. Ceschia,

A. Paccagnella, M. Rebaudengo1, M. Sonza Reorda, M. Violante1, and P. Zam-

bolin. Evaluating the effects of SEUs affecting the conﬁguration memory of an

SRAM-based FPGA. In Design Automation and Test in Europe, March 2004.

[18] Vaughn Betz, Jonathan Rose, and Alexander Marquardt. Architecture and CAD

for Deep-Submicron FPGAs. Kluwer Academic Publishers, February 1999.

[19] Sarvesh Bhardwaj, Praveen Ghanta, and Sarma Vrudhula. A framework for

statistical timing analysis using non-linear delay and slew models. In Proc. Intl.

Conf. Computer-Aided Design, November 2006.

[20] Sarvesh Bhardwaj and Sarma Vrudhula. Leakage minimization of nano-scale

circuits in the presence of systematic and random variations. In Proc. Design

Automation Conf., 2005.

[21] Sarvesh Bhardwaj and Sarma Vrudhula. Leakage minimization of digital cir-

cuits using gate sizing in the presence of process variations. IEEE Trans.

Computer-Aided Design of Integrated Circuits and Systems, March 2008.

[22] Sarvesh Bhardwaj, Sarma Vrudhula, and David Blaauw. τau: Timing analysis

under uncertainty. In Proc. Intl. Conf. Computer-Aided Design, pages 615 –

620, Nov. 2003.

200

[23] Sarvesh Bhardwaj, Wenping Wang, Rakesh Vattikonda, Yu Cao, and S. Vrud-

hula. Predictive modeling of the nbti effect for reliable design. In Proc. IEEE

Custom Integrated Circuits Conf., September 2006.

[24] Duane Boning, James Chung, Dennis Ouma, and Rajesh Divecha. Spatial varia-

tion in semiconductor processes: Modeling for control. In Proceedings Process

Control Diagnostics and Modeling in Semiconductor Manufacturing II Electro-

chemical Society Meeting, May 1997.

[25] Shekhar Borkar, Tanay Karnik, Siva Narendra, Jim Tschanz, Ali Keshavarzi,

and Vivek De. Parameter variations and impact on circuits and microarchitec-

ture. In Proc. Design Automation Conf., June 2003.

[26] Shekhar Borkar, Tanay Karnik, Siva Narendra, Jim Tschanz, Ali Keshavarzi,

and Vivek De. Parameter variations and impacts on circuits and microarchitec-

ture. In Proc. Design Automation Conf., June 2003.

[27] Jozef Brcka and Rodney L. Robison. Wafer redeposition impact on etch rate

uniformity in ipvd system. IEEE Transactions on Plasma Sciences, February

2007.

[28] Michael Brown, Cyrus Bazeghi, Matthew Guthaus, and Jose Renau. Measur-

ing and modeling variabilityusing low-cost FPGAs. In Proc. ACM Intl. Symp.

Field-Programmable Gate Arrays, February 2009.

[29] James Burnham, Chih-Kong Ken Yang, and Haitham Hindi. A stochastic jitter

model for analyzing digital timing-recovery circuits. In Proc. Design Automa-

tion Conf., July 2009.

[30] Steven M. Burns, Mahesh C. Ketkar, Noel Menezes, Keith A. Bowman,

James W. Tschanz, and Vivek De. Comparative analysis of conventional and

statistical design techniques. In Proc. Design Automation Conf., June 2007.

[31] Nicola Campregher, Peter Y. K. Cheung, George A. Constantinides, and Mi-

lan Vasilko. Yield enhancements of design-speciﬁc fpgas. In Proc. ACM Intl.

Symp. Field-Programmable Gate Arrays, February 2006.

[32] Hongliang Chang and Sachin S. Sapatnekar. Statistical timing analysis consid-

ering spatial correlations using a single PERT-like traversal. In Proc. Intl. Conf.

Computer-Aided Design, pages 621 – 625, Nov. 2003.

[33] Hongliang Chang and Sachin S. Sapatnekar. Full-chip analysis of leakage

power under process variations, including spatial correlations. In Proc. Design

Automation Conf., June 2005.

201

[34] Kueing-Long Chen, Stephen A. Saller, Imelda A. Groves, and David B. Scott.

Reliability effects on mos transistors due to hot-carrier injection. IEEE Journal

of Solid-State Circuits, February 1985.

[35] Ruiming Chen, Lizheng Zhang, Vladimir Zolotov, Chandu Visweswariah, and

Jinjun Xiong. Static timing: Back to our roots. In Proc. Asia South Paciﬁc

Design Automation Conf., February 2008.

[36] Ruiming Chen and Hai Zhou. Timing budgeting under arbitrary process varia-

tions. In Proc. Intl. Conf. Computer-Aided Design, November 2007.

[37] Ruiming Chen and Hai Zhou. Fast estimation of timing yield bounds for process

variations. IEEE Trans. VLSI Syst., March 2008.

[38] Lerong Cheng, Fei Li, Yan Lin, Phoebe Wong, and Lei He. Device and architec-

ture cooptimization for FPGA power reduction. IEEE Trans. Computer-Aided

Design of Integrated Circuits and Systems, 19:1211–1221, July 2007.

[39] Lerong Cheng, Yan Lin, and Lei He. Tracebased framework for concurrent

development of process and FPGA architecture considering process variation

and reliability. In Proc. ACM Intl. Symp. Field-Programmable Gate Arrays,

February 2008.

[40] Lerong Cheng, Jinjun Xiong, and Lei He. Non-linear statistical static timing

analysis for non-gaussian variation sources. In Proc. Design Automation Conf.,

June 2007.

[41] Lerong Cheng, Jinjun Xiong, and Lei He. Non-gaussian statistical timing anal-

ysis using Second-Order polynomial ﬁtting. In Proc. Asia South Paciﬁc Design

Automation Conf., February 2008.

[42] Lerong Cheng, Jinjun Xiong, and Lei He. Non-Gaussian statistical timing anal-

ysis using Second-Order polynomial ﬁtting. IEEE Trans. Computer-Aided De-

sign of Integrated Circuits and Systems, January 2009.

[43] Lerong Cheng, Jinjun Xiong, Lei He, and Mike Hutton. FPGA performance

optimization via chipwise placement considering process variations. In Inter-

national Conference on Field-Programmable Logic and Applications, August

2006.

[44] Kaviraj Chopra, Saumil Shah, Ashish Srivastava, David Blaauw, and Dennis

Sylvester. Parametric yield maximization using gate sizing based on efﬁcient

statistical power and delay gradient computation. In Proc. Intl. Conf. Computer-

Aided Design, November 2005.

202

[45] Kaviraj Chopra, Narendra Shenoy, and David Blaauw. Variogram based ro-

bust extraction of process variation. In Proc. τ, Timing and Analysis Utilities,

February 2007.

[46] Charles E. Clark. The greatest of a ﬁnite set of random variables. In Operations

Research, pages 85 – 91, 1961.

[47] David Lewis ect. The stratix routing and logic architecture. In Proc. ACM Intl.

Symp. Field-Programmable Gate Arrays, February 2003.

[48] Vijay Degalahal and TimTuan. Methodology for high level estimation of FPGA

power consumption. In Proc. Asia South Paciﬁc Design Automation Conf.,

January 2005.

[49] Anirudh Devgan and Chandramouli Kashyap. Block-Based static timing anal-

ysis with uncertainty. In Proc. Intl. Conf. Computer-Aided Design, November

2003.

[50] Nigel Drego, Anantha Chandrakasan, and Duane Boning. Lack of spatial corre-

lation in mosfet threshold voltage variation and implications for voltage scaling.

IEEE Trans. Semiconductor Manufacturing, May 2009.

[51] Fei Li and Yan Lin and Lei He. Vdd programmability to reduce FPGA inter-

connect power. In Proc. Intl. Conf. Computer-Aided Design, November 2004.

[52] Zhuo Feng, Peng Li, and Yaping Zhan. Fast second-order statistical static tim-

ing analysis using parameter dimension reduction. In Proc. Design Automation

Conf., June 2007.

[53] F.N.Najm and N.menezes. Statistical timing analysis based on a timing yield

model. In Proc. Design Automation Conf., June 2004.

[54] Paul Friedberg, Willy Cheung, and Costas J. Spanos. Spatial modeling of

micron-scale gate length variation. In Data Analysis and Modeling for Process

Control III, March 2006.

[55] Qiang Fu, Wai-Shing Luk, Jun Tao, Changhao Yan, and Xuan Zeng. Character-

izing intra-die spatial correlation using spectral density method. In Proc. IEEE

International Symposium on Quality Electronic Design, March 2008.

[56] Takayuki Fukuoka, Akira Tsuchiya, and Hidetoshi Onodera. Statistical gate

delay model for multiple input switching. In Proc. Asia South Paciﬁc Design

Automation Conf., February 2008.

203

[57] Anne Gattiker, Sani Nassif, Rashmi Dinakar, and Chris Long. Timing yield

estimation from static timing analysis. In International Symposium on Quality

of Electronic Design, 2001.

[58] A. Gayasen, K. Lee, N. Vijaykrishnan, M. Kandemir, M. J. Irwin, and

T. Tuan. A dual-vdd low power FPGA architecture. In Proc. Intl. Conf. Field-

Programmable Logic and its Application, August 2004.

[59] A. Gayasen, Y. Tsai, N. Vijaykrishnan, M. Kandemir, M. J. Irwin, and T. Tuan.

Reducing leakage energy in FPGAs using region-constrained placement. In

Proc. ACM Intl. Symp. Field-Programmable Gate Arrays, February 2004.

[60] P. Hazucha and C. Svensson. Impact of CMOS technology scaling on the atmo-

spheric neutron soft error rate. IEEE Trans. on Nuclear Science, 2000.

[61] Khaled R. Heloue, Sari Onaissi, and Farid N. Najm. Efﬁcient block-based pa-

rameterized timing analysis covering all potentially critical paths. In Proc. Intl.

Conf. Computer-Aided Design, November 2008.

[62] Dadgour H.F., Sheng-Chih Lin, and Banerjee K. A statistical framework for

estimation of full-chip leakage-power distribution under parameter variations.

Electron Devices, IEEE Transactions on, November 2007.

[63] Katsumi Homma, Izumi Nitta, and Toshiyuki Shibuya. Non-gaussian statistical

timing models of die-to-die and within-die parameter variations for full chip

analysis. In Proc. Asia South Paciﬁc Design Automation Conf., February 2008.

[64] Shin ichi OHKAWA, HirooMASUDA, and Yasuaki INOUE. A novel expres-

sion of spatial correlation by a random curved surface model and its application

to LSI designs. IEICE TRANS. FUNDAMENTALS, 2008.

[65] Masanori Imai, Takashi Sato Noriaki Nakayama, and Kazuya Masu. Non-

parametric statistical static timing analysis: An ssta framework for arbitrary

distribution. In Proc. Design Automation Conf., June 2008.

[66] International Technology Roadmap for Semiconductor. In http://public.itrs.net/,

2002.

[67] International Technology Roadmap for Semiconductor. In http://public.itrs.net/,

2005.

[68] International Technology Roadmap for Semiconductor. A user’s guide to MAS-

TAR4. In http://www.itrs.net/models.html, 2005.

204

[69] International Technology Roadmap for Semiconductor. Design. In

http://www.itrs.net/Links/2007ITRS/2007 Chapters/2007 Design.pdf, 2007.

[70] Javid Jaffari and Mohab Anis. On efﬁcient monte carlo-based statistical static

timing analysis of digital circuits. In Proc. Intl. Conf. Computer-Aided Design,

November 2008.

[71] Jason H. Anderson and Farid N. Najm. Low-power programmable routing cir-

cuitry for FPGAs. In Proc. Intl. Conf. Computer-Aided Design, November

2004.

[72] J. Jess, K. Kalafala, S. Naidu, R. Otten, and C. Visweswariah. Statistical timing

for parametric yield prediction of digital integrated circuits. In Proc. Design

Automation Conf., June 2003.

[73] Jochen Jess. Parametric yield estimation for deep submicron VLSI circuits. In

15th Symposium on Integrated Circuits and Systems Design, September 2002.

[74] J.M.Rabaey. Digital Integrated Circuits - A Design Perspective. Prentice Hall

Inc., 2003.

[75] Kunhyuk Kang, Bipul C. Paul, and Kaushik Roy. Statistical timing analysis

using levelized covariance propagation. In Proc. Design Automation and Test

Conf. in Europe, March 2005.

[76] Vishal Khandelwal and Ankur Srivastava. A general framework for accurate

statistical timing analysis considering correlations. In Proc. Design Automation

Conf., pages 89 – 94, June 2005.

[77] Tae Won Kim and Eray S.Aydil. Investigation of etch rate uniformity of 60 mhz

plasma etching equipment. Japanese Journal of Applied Physics, November

2001.

[78] Tae Won Kim and Eray S.Aydil. Effects of chamber wall conditions on cl con-

certration and si etch rate uniformity in palsma etching reactors. Journal of The

Electrochemical Society, June 2003.

[79] Dirk Koch, Christian Haubelt, and Juergen Teich. Efﬁcient hardware check-

pointing – concepts, overhead analysis, and implementation. In Proc. ACM

Intl. Symp. Field-Programmable Gate Arrays, February 2007.

[80] J. Kouloheris and A. El Gamal. FPGA area vs. cell granularity - lookup tables

and PLA cells. In 1st ACM Workshop on FPGAs, berkeley, CA, February 1992.

205

[81] Sanjay V. Kumar, Chandramouli V. Kashyap, and Sachin Sapatnekar. A frame-

work for block-based timing sensitivity analysis. In Proc. Design Automation

Conf., June 2008.

[82] Ian Kuon and Jonathan Rose. Measuring the gap between fpgas and asics. IEEE

Trans. Computer-Aided Design of Integrated Circuits and Systems, February

2007.

[83] Eric Kusse and Jan M. Rabaey. Low-energy embedded FPGA structures. In

Proc. Intl. Symp. Low Power Electronics and Design, pages 155–160, August

1998.

[84] Julien Lamoureux, Guy G. Lemieux, and Steven J.E. Wilton. Glitchless: An

active glitch minimization techniques for FPGAs. In Proc. ACM Intl. Symp.

Field-Programmable Gate Arrays, February 2007.

[85] Julien Lamoureux and Steven J.E. Wilton. On the interaction between power-

aware FPGA CAD algorithms. In Proc. Intl. Conf. Computer-Aided Design,

pages 701–708, November 2003.

[86] Jiayong Le, Xin Li, and Lawrence T. Pileggi. STAC: Statisical timing analysis

with correlation. In Proc. Design Automation Conf., Jun 2004.

[87] G. G. Lemieux and S. D. Brown. A detailed router for allocating wire segments

in ﬁeld-programmable gate arrays. In Proceedings of the ACM Physical Design

Workshop, April 1993.

[88] Bing Li, Ning Chen, Manuel Schmidt, Walter Schneider, and Ulf Schlichtmann.

On hierarchical statistical static timing analysis. In Proc. Design Automation

and Test Conf. in Europe, March 2009.

[89] Fei Li, Deming Chen, Lei He, and Jason Cong. Architecture evaluation for

power-efﬁcient FPGAs. In Proc. ACM Intl. Symp. Field-Programmable Gate

Arrays, February 2003.

[90] Fei Li, Yan Lin, and Lei He. FPGA power reduction using conﬁgurable dual-

vdd. In Proc. Design Automation Conf., June 2004.

[91] Fei Li, Yan Lin, and Lei He. Field programmability of supply voltages for

FPGA power reduction. IEEE Trans. Computer-Aided Design of Integrated

Circuits and Systems, 26, April 2007.

[92] Fei Li, Yan Lin, Lei He, Deming Chen, and Jason Congs. Power modeling and

characteristics of ﬁeld programmable gate arrays. IEEE Trans. Computer-Aided

Design of Integrated Circuits and Systems, 24:1712–1724, November 2005.

206

[93] Fei Li, Yan Lin, Lei He, and Jason Cong. Low-power FPGA using pre-deﬁned

dual-vdd/dual-vt fabrics. In Proc. ACM Intl. Symp. Field-Programmable Gate

Arrays, February 2004.

[94] Xin Li, Jiayong Le, Padmini Gopalakrishnan, and Lawrence T Pileggi. Asymp-

totic probability extraction for non-normal distributions of circuit performance.

In Proc. Intl. Conf. Computer-Aided Design, November 2004.

[95] Xin Li, Jiayong Le, and L.T. Pileggi. Projection based statistical analysis of

full-chip leakage power with non-log-normal distributions. In Proc. Design Au-

tomation Conf., June 2006.

[96] Weiping Liao, Lei He, and Kevin Lepak. Temperature and supply voltage

aware performance and power modeling at microarchitecture level. IEEE Trans.

Computer-Aided Design of Integrated Circuits and Systems, July 2005.

[97] Saihua Lin, Yu Wang, Rong Luo, and Huazhong Yang. A capacitive boosted

buffer technique for high-speed process-variation-tolerant interconnect in udvs

application. In Proc. Asia South Paciﬁc Design Automation Conf., February

2008.

[98] Yan Lin and Lei He. Dual-vdd interconnect with chip-level time slack allocation

for FPGA power reduction. IEEE Trans. Computer-Aided Design of Integrated

Circuits and Systems, 25, October 2006.

[99] Yan Lin and Lei He. Stochastic physical synthesis for FPGAs with pre-routing

interconnect uncertainty and process variation. In Proc. ACM Intl. Symp. Field-

Programmable Gate Arrays, February 2007.

[100] Yan Lin, Mike Hutton, and Lei He. Placement and timing for FPGAs consider-

ing variations. In Proc. Intl. Conf. Field-Programmable Logic and its Applica-

tion, August 2006.

[101] Yan Lin, Fei Li, and Lei He. Circuits and architectures for vdd programmable

FPGAs. In Proc. ACM Intl. Symp. Field-Programmable Gate Arrays, February

2005.

[102] Yan Lin, Fei Li, and Lei He. Routing track duplication with ﬁne-grained power-

gating for FPGA interconnect power reduction. In Proc. Asia South Paciﬁc

Design Automation Conf., January 2005.

[103] Jing-Jia Liou, A. Krstic, L.-C. Wang, and Kwang-Ting Cheng. False-path-

aware statistical timing analysis and efﬁcient path selection for delay testing

and timing validation. In Proc. Design Automation Conf., June 2002.

207

[104] Bao Liu. Signal probability based statistical timing analysis. In Proc. Design

Automation and Test Conf. in Europe, March 2008.

[105] Bao Liu. Spatial correlation extraction via random ﬁeld simulation and produc-

tion chip performance regression. In Proc. Design Automation and Test Conf.

in Europe, March 2008.

[106] Frank Liu. A general framework for spatial correlation modeling in vlsi design.

In Proc. Design Automation Conf., June 2007.

[107] Jui-Hsiang Liu, Lumdo Chen, and Chen Charlie Chung-Ping. Accurate and

analytical statistical spatial correlation modeling for vlsi dfm applications. In

Proc. Design Automation Conf., June 2008.

[108] Jui-Hsiang Liu, Ming-Feng Tsai, Lumdo Chen, and Charlie Chung-Ping Chen.

Accurate and analytical statistical spatial correlation modeling for vlsi dfm ap-

plications. In Proc. Design Automation Conf., June 2008.

[109] M. Liu, Wang Wei-Shen, and M. Orshanskyg. Leakage power reduction by

dual-vth designs under probabilistic analysis of vth variation. In Proc. Intl.

Symp. Low Power Electronics and Design, August 2004.

[110] Andrea Lodi, Luca Ciccarelli, and Roberto Giansante. Combining low-leakage

techniques for FPGA routing design. In Proc. ACM Intl. Symp. Field-

Programmable Gate Arrays, Februray 2005.

[111] Yuanlin Lu and V.D. Agrawal. Statistical leakage and timing optimization for

submicron process variation. In VLSI Design, January 2007.

[112] W. Maly, A.J. Strojwas, and S.W. Director. VLSI yield prediction and estima-

tion: A uniﬁed framework. IEEE Trans. Computer-Aided Design of Integrated

Circuits and Systems, 5:114–130, Jan. 1986.

[113] Hratch Mangassarian and Mohab Anis. On statistical timing analysis with inter-

and intra-die variations. In Proc. Design Automation and Test Conf. in Europe,

March 2005.

[114] Murari Mani, Anirudh Devgan, and Michael Orshansky. An efﬁcient algorithm

for statistical minimization of total power under timing yield constraints. In

Proc. Design Automation Conf., pages 309–314, June 2005.

[115] Yohei Matsumoto, Masakazu Hioki, Takashi Kawanami, Toshiyuki Tsutsumi,

Tadashi Nakagawa, Toshihiro Sekigawa, and Hanpei Koike. Performance and

yield enhancement of FPGAs with with-in die variation using multiple conﬁgu-

rations. In Proc. ACM Intl. Symp. Field-Programmable Gate Arrays, February

2007.

208

[116] Hushrav D. Mogal, Haifeng Qian, Sachin S. Sapatnekar, and Kia Bazargan.

Clustering based pruning for statistical criticality computation under process

variations. In Proc. Intl. Conf. Computer-Aided Design, November 2007.

[117] Rajarshi Mukherjee and Seda Ogrenci Memik. Power-driven design partition-

ing. In Proc. Intl. Conf. Field-Programmable Logic and its Application, August

2004.

[118] Ayhan Mutlu, Jiayong Le, Ruben Molina, and Mustafa Celik. A parametric

approach for handling local variation effects in timing analysis. In Proc. Design

Automation Conf., July 2009.

[119] Siva Narendra, Vivek De, Shekhar Borkar andDimitri A. Antoniadis, and Anan-

tha P. Chandrakasan. Full-chip subthreshold leakage power prediction and re-

duction techniques for sub-0.18-um CMOS. Solid-State Circuits, IEEE Journal

of, March 2004.

[120] Michael Orshansky and Arnab Bandyopadhyay. Fast statistical timing analysis

handling arbitrary delay correlations. In Proc. Design Automation Conf., June

2004.

[121] Michael Orshansky, Sani R. Nassif, and Duane Boning. Design for Manufac-

turability and Statistical Design. Springer, 2007.

[122] Pierrick Pedron. Tutorial on statistical static timing analysis (SSTA). In Sophia

Antipolis MicroElectronics Forum, 2008.

[123] K. Poon, A. Yan, and S. Wilton. A ﬂexible power model for FPGAs. In Proc. of

12th International conference on Field-Programmable Logic and Applications,

September 2002.

[124] Predictive Technology Model. In http://www.eas.asu.edu/ ptm/.

[125] Kun Qian and Costas J. Spanos. A comprehensive model of process variabil-

ity for statistical timing optimization. In Design for Manufacturability through

Design-Process Integration II, April 2008.

[126] Kun Qian and Costas J. Spanos. Hierachical modeling of spatial variability

of a 45nm test chip. In Design for Manufacturability through Design-Process

Integration III, March 2009.

[127] Sreeja Raj, Sarma B.K. Vrudhula, and Janet Wang. A methodology to improve

timing yield in the presence of process variations. In Proc. Design Automation

Conf., June 2004.

209

[128] Anand Ramalingam, Ashish Kumar Singh, Sani R. Nassif, Gi-Joon Nam,

Michael Orshansky, and David Z. Pan. An accurate sparse matrix based frame-

work for statistical static timing analysis. In Proc. Intl. Conf. Computer-Aided

Design, November 2006.

[129] Rajeev Rao, Anirudh Devgan, David Blaauw, and Dennis Sylveste. Parametric

yield estimation considering leakage variability. In Proc. Design Automation

Conf., June 2004.

[130] Jonathan Rose, Robert J. Francis, David Lewis, and Paul Chow. Architecture

of ﬁeld-programmable gate arrays: The effect of logic functionality on area

efﬁciency. IEEE Journal of Solid-State Circuits, 1990.

[131] Jonathan Rose, Robert J. Francis, David Lewis, and Paul Chow. Architecture

of ﬁeld-programmable gate arrays: The effect of logic functionality on area

efﬁciency. Proc. IEEE Int. Solid-State Circuits Conf., 1990.

[132] Shigeru Sakai, Masaaki Ogino, Ryosuke Shimizu, and Yukihiro Shimogaki.

Deposition uniformity control in a commercial scale HTO-CVD reactor. MA-

TERIALS RESEARCH SOCIETY SYMPOSIUM PROCEEDINGS, 2007.

[133] Ashoka Sathanur, Antonio Pullini, Giovanni De Micheli, Luca Benini, and En-

rico Macii. Physically clustered forward body biasing for variability compensa-

tion in nano-meter CMOS design. In Proc. Design Automation and Test Conf.

in Europe, March 2009.

[134] Takashi Sato, Hiroyuki Ueyama, Noriaki Nakayama, and Kazuya Masu. Deter-

mination of optimal polynomial regression function to decompose on-die sys-

tematic and random variations. In Proc. Asia South Paciﬁc Design Automation

Conf., February 2008.

[135] Pete Sedcole and Peter Y. K. Cheung. Parametric yield in FPGAs due to within-

die delay variations: A quantative analysis. In Proc. ACM Intl. Symp. Field-

Programmable Gate Arrays, February 2007.

[136] Pete Sedcole and Peter Y. K. Cheung. Parametric yield in FPGAs due to within-

die delay variations: a quantitative analysis. In Proc. ACM Intl. Symp. Field-

Programmable Gate Arrays, February 2007.

[137] Pete Sedcole, Justin S. Wong, and Peter Y.K. Cheung. Measuring and modeling

FPGA clock variability. In Proc. ACM Intl. Symp. Field-Programmable Gate

Arrays, February 2008.

210

[138] Davood Shamsi, Petros Boufounos, and Farinaz Koushanfar. Post-silicon tim-

ing characterization by compressed sensing. In Proc. Intl. Conf. Computer-

Aided Design, November 2008.

[139] James R. Sheats. Microlithography Science and Technology. CRC, 1998.

[140] Xukai Shen, Yu Wang, Rong Luo, and Huazhong Yang. Leakage power reduc-

tion through dual vth assignment considering threshold voltage variation. In

ASICON, October 2007.

[141] Sean X. Shi, Anand Ramalingam, Daifeng Wang, and David Z. Pan. Latch

modeling for statistical timing analysis. In Proc. Design Automation and Test

Conf. in Europe, March 2008.

[142] Jaskirat Singh and Sachin Sapatnekar. Statistical timing analysis with cor-

related non-gaussian parameters using independent componenet analysis. In

ACM/IEEE International Workshop on Timing Issues, pages 143–148, Feb.

2006.

[143] Satwant Singh, Jonathan Rose, Paul Chow, and David Lewis. The effect of

logic block architecture on FPGA performance. IEEE Journal of Solid-State

Circuits, 1992.

[144] Amith Singhee and Rob A. Rutenbar. Beyond low-order statistical response

surfaces: Latent variable regression for efﬁcient, highly nonlinear ﬁtting. In

Proc. Design Automation Conf., June 2007.

[145] Amith Singhee, Sonia Singhal, and Rob A. Rutenbar. Exploiting correlation

kernels for efﬁcient handling of intra-die spatial correlation, with application to

statistical timing. In Proc. Design Automation and Test Conf. in Europe, March

2008.

[146] Amith Singhee, Sonia Singhal, and Rob A. Rutenbar. Practical, fast monte carlo

statistical static timing analysis: Why and how. In Proc. Intl. Conf. Computer-

Aided Design, November 2008.

[147] Satish Sivaswamy and Kia Bazargan. Variation-aware routing for FPGAs. In

Proc. ACM Intl. Symp. Field-Programmable Gate Arrays, February 2007.

[148] Thomas Skotnicki. Heading for decananometer CMOS - Is navigation among

icebergs still a viable strategy? In Solid-State Device Research Conference,

pages 19–33, September 2000.

211

[149] Mahapatra Souvik, Saha Dipankar, Varghese Dhanoop, and Bharath P. Kumar.

On the generation and recovery of interface traps in MOSFETs subjected to

NBTI, FN, and HCI stress. IEEE. Trans. on Electron Device, 53, July 2006.

[150] Suresh Srinivasan, Aman Gayasen, Narayanan Vijaykrishnan, and Tim Tuan.

Leakage control in FPGA routing fabric. In Proc. Asia South Paciﬁc Design

Automation Conf., January 2005.

[151] Ashish Srivastava, Robert Bai, David Blaauw, and Dennis Sylvester. Modeling

and analysis of leakage power considering within-die process variations. In

Proc. Intl. Symp. Low Power Electronics and Design, 2002.

[152] Ashish Srivastava, Saumil S. Shah, Kanak B. Agarwal, Dennis M. Sylvester,

David Blaauw, and Stephen Director. Accurate and efﬁcient parametric yield

estimation considering correlated variations in leakage power and performance.

In Proc. Design Automation Conf., June 2005.

[153] Brian E. Stine., Duane S. Boning, and James E. Chung. Analysis and decompo-

sition of spatial variation in integrated circuit processes and devices. In Semi-

conductor Manufacturing, IEEE Transactions on, pages 24 – 41, Feb. 1997.

[154] Yuuri Sugihara, Yohei Kume, Kazutoshi Kobayashi, and Hidetoshi Onodera.

Speed and yield enhancement by track swapping on critical paths utilizing ran-

dom variations for FPGAs. In Proc. ACM Intl. Symp. Field-Programmable

Gate Arrays, February 2008.

[155] Shingo Takahashi, Yuki Yoshida, and Shuji Tsukiyama. Gaussian mixture

model for statistical timing analysis. In Proc. Design Automation Conf., July

2009.

[156] Li Tao and Yu Zhiping. Statistical analysis of full-chip leakage power consider-

ing junction tunneling leakage. In Proc. Design Automation Conf., June 2007.

[157] Shuji Tsukiyama. Toward stochastic design for digital circuits: statistical static

timing analysis. In Proc. Asia South Paciﬁc Design Automation Conf., January

2005.

[158] TimTuan and Bocheng Lai. Leakage power analysis of a 90nm FPGA. In Proc.

IEEE Custom Integrated Circuits Conf., 2003.

[159] SALI Jaydeep V., PATIL S. B., JADKAR S. R., and TAKWALE M. G. Hot-

Wire CVD growth simulation for thickness uniformity. In International Confer-

ence on Cat-CVD Process, 2001.

212

[160] Rakesh Vattikonda, Wenping Wang, and Yu Cao. Modeling and minimization

of pmos nbti effect for robust nanometer design. In Proc. Design Automation

Conf., July 2006.

[161] Vineeth Veetil, Dennis Sylvester, and David Blaauw. Efﬁcient monte carlo

based incremental statistical timing analysis. In Proc. Design Automation

Conf., June 2008.

[162] Chandu Visweswariah. Death, taxes and failing chips. In Proc. Design Au-

tomation Conf., June 2003.

[163] Chandu Visweswariah, Kaushik Ravindran, Kerim Kalafala, Steven G. Walker,

and Sambasivan Narayan. First-order incremental block-based statistical timing

analysis. In Proc. Design Automation Conf., June 2004.

[164] Chandu Visweswariah, Kaushik Ravindran, Kerim Kalafala, Steven G. Walker,

Sambasivan Narayan, Daniel K. Beece, Jeff Piaget, Natesan Venkateswaran,

and Jeffrey G. Hemmett. First-order incremental block-based statistical tim-

ing analysis. IEEE Trans. Computer-Aided Design of Integrated Circuits and

Systems, 2006.

[165] Ruilin Wang and Cheng Kok Koh. A frequency-domain technique for statistical

timing analysis of clock meshes. In Proc. Intl. Conf. Computer-Aided Design,

November 2007.

[166] Wenping Wang, Vijay Reddy, Anand T. Krishnan, Rakesh Vattikonda, Srikanth

Krishnan, and Yu Cao. An integrated modeling paradigm of circuit reliability

for 65nm CMOS technology. In Proc. IEEE Custom Integrated Circuits Conf.,

September 2007.

[167] Ho-Yan Wong, Lerong Cheng, Yan Lin, and Lei He. FPGA device and architec-

ture evaluation considering process variations. In Proc. Intl. Conf. Computer-

Aided Design, November 2005.

[168] Phoebe Wong, Lerong Cheng, Yan Lin, and Lei He. FPGA device and arhi-

tecture evaluation consdering process variation. In Proc. Intl. Conf. Computer-

Aided Design, November 2005.

[169] Lin Xie, Azadeh Davoodi, Jun Zhang, and Tai-Hsuan Wu. Adjustment-based

modeling for statistical static timing analysis with high dimension of variability.

In Proc. Intl. Conf. Computer-Aided Design, November 2008.

[170] Xilinx Corporation. Virtex-II 1.5v platform FPGA complete data sheet. July

2002.

213

[171] Jinjun Xiong, Chandu Visweswariah, and Vladimir Zolotov. Statistical ordering

of correlated timing quantities and its application for path ranking. In Proc.

Design Automation Conf., July 2009.

[172] Jinjun Xiong, Vladimir Zolotov, and Lei He. Robust extraction of spatial corre-

lation. In Proc. Intl. Symp. Physical Design, April 2006.

[173] Jinjun Xiong, Vladimir Zolotov, and Lei He. Robust extraction of spatial cor-

relation. IEEE Trans. Computer-Aided Design of Integrated Circuits and Sys-

tems, 2007.

[174] A. M. Yaglom. Some classes of random ﬁelds in n-Dimensional space, related

to stationary randomprocesses. Theory Probability Applications, January 1957.

[175] S. Yang. Logic synthesis and optimization benchmarks, version 3.0. Technical

report, Microelectronics Center of North Carolina (MCNC), 1991.

[176] Xiaoji Ye, Yaping Zhan, and Peng Li. Statistical leakage power minimization

using fast equi-slack shell based optimization. In Proc. Design Automation

Conf., 2007.

[177] Guo Yu, Wei Dong, Zhuo Fang, and Peng Li. Statistical static timing analysis

considering process variation model uncertainty. IEEE Trans. Computer-Aided

Design of Integrated Circuits and Systems, October 2008.

[178] Yaping Zhan, Andrzej J. Strojwas, Xin Li, and Lawrence T. Pileggi.

Correlaiton-aware statistical timing analysis with non-gaussian delay distribu-

tion. In Proc. Design Automation Conf., pages 77–82, June 2005. Anaheim,

CA.

[179] Yaping Zhan, Andrzej J. Strojwas, David Newmark, and Mahesh Sharma.

Generic statistical timing analysis with non-gaussian process parameters. In

Austin Conference on Integrated Systems and Circuits, May 2006.

[180] Lizheng Zhang, Weijen Chen, Yuhen Hu, John A. Gubner, and Charlie Chung-

Ping Chen. Correlation-preserved non-gaussian statistical timing analysis with

quadratic timing model. In Proc. Design Automation Conf., pages 83 – 88, June

2005.

[181] Lizheng Zhang, Weijen Chen, Yuhen Hu, John A. Gubner, and Charlie Chung-

Ping Chen. Statistical timing analysis with extended pseudo-canonical timing

model. In Proc. Design Automation and Test Conf. in Europe, pages 952 – 957,

March 2005.

214

[182] Lizheng Zhang, Yuhen Hu, and Charlie Chung-Ping Chen. Block based statisti-

cal timing analysis with extended canonical timing model. In Proc. Asia South

Paciﬁc Design Automation Conf., January 2005.

[183] Lizheng Zhang, Yuhen Hu, and Chung-Ping Chen. Statistical timing analysis

with path reconvergence and spatial correlations. In Proc. Design Automation

and Test Conf. in Europe, March 2006.

[184] Lizheng Zhang, Jun Shao, and Charlie Chung-Ping Chen. Non-gaussian statis-

tical parameter modeling for SSTA with conﬁdence interval analysis. In Proc.

Intl. Symp. Physical Design, Aprial 2006.

[185] Qiaolin Zhang, Kameshwar Poolla, and Costas J. Spanos. One step forward

from run-to-run critical dimension control: Across-wafer level critical dimen-

sion control through lithography and etch process. Journal of Process Control,

18(10):937 – 945, 2008. Advanced Process Control for Semiconductor Manu-

facturing, First IFAC Workshop on Advanced Process Control for Semiconduc-

tor Manufacturing.

[186] Qiaolin Zhang, Cherry Tang, Jason Cain, Angela Hui, Tony Hsieh, Nick Mac-

crae, Bhanwar Singh, Kameshwar Poolla, and Costas J. Spanos. Across-wafer

cd uniformity control through lithography and etch process: experimental ver-

iﬁcation. In Metrology, Inspection, and Process Control for Microlithography

XXI, April 2007.

[187] Songqing Zhang, Vineet Wason, and Kaustav Banerjee. A probabilistic frame-

work to estimate full-chip subthreshold leakage power distribution considering

within-die and die-to-die p-t-v variations. In Proc. Intl. Symp. Low Power Elec-

tronics and Design, Aug 2004.

[188] Songqing Zhang, Vineet Wason, and Kaustav Banerjee. A probabilistic frame-

work to estimate full-chip subthreshold leakage power distribution considering

within-die and die-to-die variations. In Proc. Intl. Symp. Low Power Electron-

ics and Design, 2004.

[189] Wangyang Zhang, Wenjian Yu, Zeyi Wang, Zhiping Yu, Rong Jiang, and Jinjun

Xiong. An efﬁcient method for chip-level statistical capacitance extraction con-

sidering process variations with spatial correlation. In Proc. Design Automation

and Test Conf. in Europe, March 2008.

[190] Wei Zhao, Yu Cao, Frank Liu, Kanak Agarwal, Dhruva Acharyya, Sani Nassif,

and Kevin Nowka. Rigorous extraction of process variations for 65nm CMOS

design. In European Solid-State Circuits Conference, September 2007.

215

c Copyright by Lerong Cheng 2010

The dissertation of Lerong Cheng is approved.

Lieven Vandenberghe

Glenn Reinman

Sudhakar Pamarti

Lei He, Committee Co-chair

Puneet Gupta, Committee Co-chair

University of California, Los Angeles 2010

ii

To my family.

iii

. . . 15 Delay. . and Area Minimization . . . . . Dissertation Outline . 2. . . . . . Statistical Timing Modeling. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1. . 16 2. . . . .1 2.3.4 Hyper-Architecture Evaluation . . . . . . . . . . .1. . Performance. . . . . . . . . . Vdd Gatable FPGA Architecture . . . . . . . 1 6 6 8 9 12 Contributions of the Dissertation . . . . . . . . . . . iv . . . . . . . . . . . . . . . . .3. . . . . . Preliminaries . . . . .3 Conventional FPGA Architecture . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .2.1 Literature Review . . . . . . . . . . . . . Power and Delay Model . . . .3. . . . . . . . Analysis. . . . . .3 2. . . . . . . . . . . . . . . . . 2. . . . . . . . . . . . . . . . 2. . . . . . . . . . . . . . . .1 2. . . . . . . . . . . and Area Optimization . . and Optimization . . . . . . . . . . .2 1. . . .2. . . . . . . . . . . 17 20 20 21 23 24 25 26 30 32 Trace-Based Estimation . . . . . . . . . . . . .1 1. FPGA Architecture Evaluation Flow . . . . . . . . . . . . . .1 2. . . . Validation of Ptrace . . . . .3 Trace Collection .2. . . . . . . . 1. . . .TABLE OF C ONTENTS 1 Introduction . . . . . . . . . . . . . . .3 FPGA Power. . . . . .2 2. . . . . . . . . . . . . . .2 2. . . . I Statistical Modeling and Optimization for FPGA Circuits 2 Deterministic Device and Architecture Co-Optimization for FPGA Power. . . . .2 Introduction .2 1.1. . . . . .

. Energy and Delay Tradeoff . . . . . . . . . . . . . . . . . .2 3.4. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Power Model . . . . . . . . . . . . . . . . . . . . . . . . . . .2 4. . . . . . . . . . . . . . . . . . . . . . . . . .1 3. . . . . . . .1 2. . .3. 64 66 70 70 73 v . . . . . . . . Circuit-Level Delay and Power . . . . . . . . . . . .4. . . . . . . . . . . . . . . 49 50 58 59 60 62 Conclusions . . . . . . . . . . . . Necessity of Device and Architecture Co-Optimization . . .5 Overview . . . . . . . . . .4 2. . .4. 3 FPGA Device and Architecture Co-Optimization Considering Process Variation . . . . . . . . . . . . . . . 48 3. .4. . .3 Introduction . . . . . . . . . 32 33 36 38 40 41 46 Conclusions . . . . . . . . . . . . . . . . .4. . .3. . . . .2 2. 63 4.1 4. . . . . . . . . . . Impact of Utilization Rate . . . . . . . .3. . . . . . . . . . .3 2. . .1 4. . . . . . . . . . . . . . . . . 4. . . . . . . . . . . .4. Device Models . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Hyper-Architecture Evaluation Considering Process Variation . . . . . . . .4 Impact of Process Variation . . . . . . . . . . . . . . . . . . . . . . Impact of Interconnect Structure . .3. . . . . .6 2. . ED and Area Tradeoff . . . .2 Delay Model . . . . . . . . . . . . . . .3 Introduction . . . . . . . . . . . . . . . . . Delay and Leakage Variation Model . . . 3. . . . . . . . . . . . . . . . . . . . . 4 FPGA Concurrent Development of Process and Architecture Considering Process Variation and Reliability . .5 2. .2 3. . . . . . . . . . .2. .1 3. . . . . . . . . . . Impact of Device and Architecture Tuning . . . . . . . . . . . . .

2 Impact of Device Aging . . . .2 4. . . . . . . . . .6 Process Evaluation . . . . . . . . . . . . . . 4. . . . . . . 4. . . . . . . 4. . . . . . . . . . Permanent Soft Error Rate (SER) .8 Interaction between Process Variation and Reliability . . . .4. . . Variation Analysis . . .5. . . . . . . . . . . . 4. . . . . . 4. . . . . . . . . .7. . . . .3 Delay Model . . . . . .4. . . . . . . . . . . . . . . . Dynamic Power Model . .2 4. . . . . . . . . . . . . . Delay Variation . . . .8. . . . . . . . . . . . .1 4. . . .9 Conclusions . . . . . . . 4. . . . . . . . . . . . . . .4. . . . . . . . . . . . . . . . . . .8. . . . . . . . . .1 4. . . . . . .7. . . . . . . . . . 75 75 76 76 78 78 79 80 81 82 83 85 86 89 90 91 92 93 4.5. .5 Chip Level Delay and Power Variation . . . . . . . . . . II Statistical Timing Modeling and Analysis 5 Non-Gaussian Statistical Timing Analysis Using Second-Order Polyno- 95 mial Fitting .3 Variation Models . . .7 FPGA Reliability . . .5. . .2 Impact of Device Aging on Power and Delay Variation . . . . . . . . . . . . . . . . 4. . . . . . . . . . . . . . . . . .4 Chip-level Delay and Power . . . . . .1 4. . . . .2 Power and Delay Optimization . . . . . . . . . . . . . . . . . . . . . . .1 4. . . . . . . . . . . . . . . . . . . . . . . . 96 vi . . . . . . . . Leakage Power Model . . . . 4. . .1 4. . .6. . . Chip Leakage Power Variation . . .6. . . . . . . . . . . . Impacts of Device Aging and Process Variation on SER . . . . . . . . . 4. . . . . . . . . . . . . . . . . . . . . . . .4. . . . . . . . .

. . . . . . . .2 5. . . . . . . . . . . . .5. . . . . . . .4 Quadratic Delay Model . . . . . . . . . . . . . . . . . .2 Review and Preliminary . . . .5. . . .3. . . . . . . . . . . . . 109 Sum Operation for Quadratic Delay Model . . . . . . . . . . . . . . . . . . . .2 6. . . . . . . . . .1 5. . .3 . . . . . . . . . 127 6 Physically Justiﬁable Die-Level Modeling of Spatial Variation in View of Systematic Across-Wafer Variability 6. . . 118 5. . . . . . . . . . 118 Max Operation for Linear Delay Model . . . . .6 5. .5. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 121 Conclusions . . . . . . . . . 128 Introduction . . . 115 Max Operation for Semi-Quadratic Delay Model . . . . . . . . Second-Order Polynomial Fitting of Max Operation . . . . . . . . . .1 5. . . . . . . . . . . . . . .2. .3. . . . . . . . . . . . 115 Computational Complexity of Quadratic SSTA . . . . . . . 100 5. . . . . . . . . . . . . . . . . . 131 Analysis of Wafer Level Variation and Spatial Correlation . . . . . . . . . . 119 5. . . . .2 Introduction . . . . . .4. . . .3. . . . . . . . .5 Linear SSTA . . . .1 5. . . . . . . . . 133 vii . . . . . . . . . . .2 Semi-Quadratic Delay Model . . . . . . . . . . . . . . . 129 Physical Origins of Spatial Variation . . .4 Semi-Quadratic SSTA .1 5.2 Linear Delay Model . . . . . . . . . . 5. . 107 5. . . . . . .2.3. . . . 115 5. .7 Experimental Result . .1 6. . .1 5. . . . . . . . . . . . . . .3 5. . . . . . . 107 Max Operation for Quadratic Delay Model . . . . . . . . . . . . . . . . .3 Quadratic SSTA . . . 116 5. . . . . . . . . . . . . 97 99 99 New Fitting Method for Max Operation . . . . . . 115 5. . . . .4. . . .

. .5. . . .3. . . . . . . . . . . . . . . . . . 161 7 On Conﬁdence in Characterization and Application of Variation Models 162 7. . 144 6. . . . . . . . . . . . . .6. . . . . . . . . .3. . . . . . . . . .5 Variation of Mean and Variance with Location . . . . . . . . . .1 7. . . . . . . . . . . . . . . . . . . . . . .6 Application of Spatial Variation Models on Statistical Timing Analysis 146 6.4 6. . . . . . . . . 170 viii . . .3 Mean and Variance . . . . 148 6. . .1 6. . . . . . .2 7. . . . . . . . . . . 167 Simplifying the Large Lot Case . . . . . . . . . . . . . . . . . . . . . . . . 169 One Production Lot Case . . .3 Introduction . . . 163 Problem Formulation . . . . . . . . . .8 6. . . . . . .3. . .2 7. . . . . . . . . . . . . . . . . . . . . . . . . .9 Application of Spatial Variation Models on Statistical Leakage Analysis 156 Summary of Different Models . . . 145 Location Dependent Across-Wafer Model . .6. . . . . .3. . . . . . . . . . . . . .3. . .5. . . . . .7 6.4 6. .5. . . . 146 6. . . . 145 Quadratic Across-Wafer Model . . . . . . 147 Experimental Result . . .2 6. 140 General Across-Wafer Variation Model . . . . . .3 6. . . . . . . 141 Modeling Spatial Variability . . . . . 137 Dependence between Inter-Die and Within-Die Variation . . . . . . . .2 6. . . . . .2 Delay Model . . . . . . . . . . 139 When can Spatial Variation be Ignored? . . . 134 Appearance of Spatial Correlation . .1 6. . . . . . . . . . . . . . . . .3. . . . . . . . . . 158 Conclusions . . . . . . . . . . . . 165 Conﬁdence Interval for Variation Sources . . . . . . . . . . . .3 Slope Augmented Across-Wafer Model . . . . . . .1 7. . . .6. 167 7.3. . . . .1 6. .

.7 Conclusions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .6 Experimental Validation for One Lot Case . . . .3. . . . . . . . . . . .5 7. 188 7. . . . . . . 172 Conﬁdence in SPICE Fast/Slow Corner . . . .4 7. . .7. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .3. . . . . . . .5 7. . 191 9 Appendix . . . . . . . . . . . . . . . . . . . . . . . . 189 8 Conclusions . . . . . . . . . 177 Conﬁdence in Statistical Timing Analysis . . . . . . . . . . . . . . . . 198 ix . . . . . . 195 References . . . . . . . . . . . . . . .4 7. . . . . . . . . . . . . . . . . . . . . . 171 Fast/Slow Corner Estimation . . . . . . 183 Practical Questions: Determining Conﬁdence Induced Guardband for Real Designs . . . . .

. . . . . . calculation. . . . . . . . . . . . . . . . . . (a) Homo-Vth and Hetero-Vth. .1 4. . . calculation. (d) Vdd -gateable logic block. .1 4. . . Trace-based estimation ﬂow. . . . . . . . . . . . .3 1. . . . . . . . . x . .9V . . (b) HomoVth +G and Hetero-Vth +G. 2. . . . . . . . . (b) Connection block. . .3 under one device setting to collect the trace information. Vdd = 0. . .L IST OF F IGURES 1.3 2. . . .8 ED and area trade-off. . . . . . . . . . . . . . . . . . . . . . (c) Switch block. . . . . Delay yield. . The ﬂow for on current. . . . . . . . . Dominant hyper-architectures. . . The ﬂow for sub-threshold leakage current. . . . . SSTA ﬂow. . .5 2. . .2 (a) Vdd -gateable switch. . . . . . . . . . . 44 2. (c) Vdd -gateable connection block. . New trace-based evaluation ﬂow. . .2 4. . . .2 1. . . . . . . . . . . . K = 4. 2. . . .1 Increasing of power density. . . . . . . . . .7 Comparison between Psim and Ptrace. .6 2. . . . . We perform the same ﬂow as Figure 2. . ED and area are normalized with respect to the baseline (N = 8. . . . . 25 31 35 22 24 21 1 3 5 6 hyper-architectures under different device settings. . . . . . . . . (a) Island style routing architecture. . . .1 1. . . . . . .3 Leakage and delay of baseline architecture hyper-arch. . . . . . . . . . . . (b) Vdd -gateable routing switch. . . . . Leakage power and frequency variation. . and Vth = 0. . . . . . Ion . . . (d) Routing switches. . . . . 45 60 65 67 68 3. . . . . . . . . . . . .4 Existing FPGA architecture evaluation ﬂow for a given device setting. . .3V ). . . . . . . . . . . Io f f . . . . . 2.4 2. . . . . . . . . . . . . . .

. . .10 The calculation ﬂow for ΔVth due to NBTI and HCI. (b) Delay PDF. . . . . . . . . . (b) PDF comparison of exact computation. . . . . . 123 Ring oscillator frequency within a wafer. . 68 69 71 4. . (b) Delay PDF comparison. . (a) Process 1.13 The ﬂow for SRAM SER calculation. . . . (b) Process 2.4.14 Impact of device aging on delay and leakage PDF. linear ﬁtting. 4. . . . . . 133 xi . . The pull-down delay of a 1X inverter at ITRS 2005 HP 32nm technology nodes. . . .2 5. . . . (a) Leakage change. . . . . . 0).9 Energy and delay tradeoff. . . . . (a) Input and output voltages. . . and short circuit current with a small input slope. . .12 Impact of device aging. . . . . 4. . . . . . . . . . . . .D2 ). . . . . .1 . . . . . . . . . 4. . . 72 83 85 87 87 88 90 4. . . . .3 6. . 110 PDF comparison for circuit s15850. . . . . . . .5 The ﬂow for gate leakage currents Igon and Igo f f calculation. . . . . . . Delay and leakage distribution. . . . . 5. (b) Input and output voltages. .4 4. . . . . .11 Vth increase caused by NBTI and HCI. . . . . . .7 Voltage and current in an inverter during transition. 4. . . . (b) Delay change. . 91 (a) Comparison of exact computation. . . . . . . . . . . . . . . . . . . .8 4. 0). linear ﬁtting. . . . . . . . . . . . . (c) The NMOS IV curves and the transition of transient current Ids with small and large input slopes. . . 4. . 107 5. . . . . . (a) Leakage PDF. . . . and short circuit current with a large input slope. . .6 4. . and second-order ﬁtting for max(V.1 Algorithm for computing max(D1 . . . . . . . (a) Leakage PDF comparison. . The ﬂow for transistor gate and diffusion capacitances Cg and Cdi f f calculation. . . . . . and second-order ﬁtting for max(V. . . . . . . .

137 Apparent spatial correlation and covariance as a function of distance. . . . . . . . . . . . . . . . . . . . . and L-L case. . . . . .5 6. . assuming n = 10 and n is large. . . . . . . . . . . . . . . 143 PDF of Sx and Sy . . 144 Comparison of S-L. .4 7. . . . .5 Flow to simulate conﬁdence. . . . and 95-percentile point for ISCAS85 benchmarks. 185 ˆ ˜ 7. . . . . . . . . . . . . . . . . . . ˆ Percentile point is normalized with respect to the nominal value. . . . . .6 Normalized worst case mean. .2 6. . L-S. 175 Number of n or n which can be considered as large for different fracˆ ˜ tion of lot-to-lot variation assuming maximum 3% error at 90% conﬁdence. . . . . . . . . . . . . . . 182 ˆ ˜ f Worst case mean. . .1 7. . . . . . . . .3 7. standard deviation. . . . . .6. . . . . . . . . . . . . . . . . . standard deviation.7 90% conﬁdence worst case 95-percentile point with varying n for C7552. 138 Correlation coefﬁcient for within-die spatial variation after inter-die variation is removed. . 176 7. . . . . . . .3 6. . . . . . . . . . . . . . assuming n = 10 and n = 15. . . . and 95-percentile point normalized with respect to the nominal value for ISCAS85 benchmarks. .6 6. 141 Ring oscillator frequency within a wafer. . . . . . . . . . . . . . . . . . . . . . . . . . . . . 186 ˆ ˜ 7. . . . . .4 Mean and variance for different rd . . . . . . . . . . . . . . . . . . . . . . . . . . .7 7. . . . . . . 187 xii . . . . . .2 Approximating across-wafer variation. . . . . 139 6. . 180 90% conﬁdent Dw with varying n and n. . . . . .

. . 36 37 2. and Hetero-Vth +G. . The number of LUTs and ﬂip ﬂops are counted under the architecture N=8 and K=4. Min-ED hyper-architecture of optimizing device and architecture separately and simultaneously. . . . . . . . . . . . . . . Vd d and Vth . . . .9 Minimum delay and minimum energy hyper-architectures. .12 Min-AED hyper-architecture under different utilization rates. . . . . . . . . . . . 2. . . . . . . Trace information. . .11 Min-ED hyper-architecture under different utilization rates. K = 4. 39 41 2. . and ED-area product are normalized with respect to the baseline. . . 41 42 xiii . .13 Min-delay hyper-architecture under different routing structures. . . . . Architecture setting: N = 10.5 2. . . . . . . . Unit: switch per clock cycle. . . 2. . . Architecture setting: N = 10.1 2. . .10 Minimum ED-area product hyper-architectures for different classes. . 2. . . . . . . . . . . Homo-Vth +G. Vd d and Vth . . . . . . . . . . . . . . .4 MCNC benchmark list. 28 2. . . . . 38 2. . . Power and delay ranges for different device settings. . . . . . 27 3 26 Short circuit power ratios for different technology nodes. . . . . . . . . 31 33 34 2. . Area. . Comparison between baseline and min-ED hyper-architecture in HomoVth . . . .2 Impact of process variation in near-term future. . . . . . . . Hetero-Vth . . . . .L IST OF TABLES 1. . . . . . . . . . . . . . . switching activities for different technology nodes. . . . . . . . .6 2. . . . . ED. . . . . .8 2. . . . . . .3 . . . . . . . . . . device and circuit parameters. K = 4. . .7 Baseline hyper-architecture and evaluation ranges. Note: AED is normalized with respect to the baseline. .1 2.

. . . . . . . . . . . .2 3. . . . 3. . . . . . . . . . . . . . . . . . . . . . . . . 85 89 90 92 4. . . . . . mean. . . . . . . . and standard deviation comparison for leakage power and delay. . . . . . . . . . . . Optimum leakage yield hyper-architecture. . Optimum Timing yield hyper-architecture. . . .3 3. Soft error rate comparison.3 4. . . . . . . . . . . . . . .1 Experiment setting. . . . .7 4. . Experimental setting. . . . . . . . . . . . . . Optimum leakage and timing combined yield hyper-architecture. . . . .2 4.15 Min-ED hyper-architecture under different routing structures. delay. .6 4.8 Delay and leakage change due to device aging. . . . . . . . . . . . . . . . . . Impact of device aging on process variation. . . .5 4. . . . Experimental setting. . . . . energy. . . . . . . . . . . .2. . . . . . .4 3. . . . . . and ED comparison for different device settings. . Nominal value. . . . . . .14 Min-energy hyper-architecture under different routing structures. . . . . . . . . . . . . . . . . .5 4. . . . . . . . . . . The impact of device aging and process variation on SER of one SRAM under ITRS HP 32nm technology. . . . . . . . . . . 122 xiv . . . Power. . . . . . .1 3. . . Veriﬁcation of yield model. . . . . 93 5. . . . . The 3σ value is the normalized with respect to the nominal value. . . . . . 43 43 58 59 61 61 62 82 83 84 . . . . . . .4 . . . . Experimental setting for process variation. . . . . . . . . . . . 2. . . . .1 4. . . . . . . . . . .

. and IW. . . . . . . .1 6. . . . . . Note:the error percentage of mean (µ). . 125 100 × (MC value − SSTA value)/σMC . . . . . . . . . . σ. . . . standard deviation. . . Note: The values of exact simulation are in ns. . . 154 6. . . . . . . . . . . . . . . . . .2 Error percentage of mean. . . . and 95-percentile point for exact simulation is in ns. . . 153 6. standard deviation. . .3 Mean. . . . Run time (T) is in s. . . . . . . . . . . skewness. . . . and skewness comparison for variation setting (2). and 95-percentile point (95%) is computed as skewness γ is computed as 100 × (γMC − γSSTA )/γMC . . The values of exact simulation are in ns. . . . . . 152 6. . . Note: The values of exact simulation are in ns. . SPC.3 percentage error for ISCAS85 benchmark stretching on a chip with different chip size. . . . Note: We assume square chips and chip size means edge length in mm. . . . .5. . ∗ for LDAW. . . . . . .4 Delay percentage error at different locations in a 2cm × 2cm chip. . . . and 95-percentile point for variation setting (1). . . . . . Note: The exact values are in mW . . . . . . . . . .8 Leakage error percentage for different models in 2cm×2cm chip. . . . . . . . 155 6. . 157 xv . . . . . 155 6. Run time (T) is in ms. . . . . . . .7 Delay comparison for ISCAS85 benchmark in 3mm×3mm chip.6 The values of exact simulation are in ns. . . . . linear SSTA is applied when assuming linear cell delay model. . .5 Delay comparison for ISCAS85 benchmark in 1cm × 1cm chip. Note: The values of exact simulation are in ns. . . . Note: Delay comparison for ISCAS85 benchmark in 6mm×6mm chip. 126 6. . . . . 135 Delay percentage error for different variation models. . . . 154 6. . and the error percentage of 5. . standard deviation (σ). . . . . Note: the µ. . . . . . . . . . . . . . . . . . . .2 Notations. . . . . . .

. . . . . . 168 Experimental validation of estimation of conﬁdence interval for mean for n = 5 and n = 1. 167 7. . 179 7. n = 1. . . . .6. . . . . . . . n = 15. . . . . . . . . . . . . . . . . . 177 ˆ ˜ ˜ 7. . Lgate value is in nm.9 Leakage error for different variation model on different size chips. . . . . . . . . . . . . . . . . . . . Note: exact values are in mW . . . . . . . . . . . . . 158 6. . . . . . .1 Reliability of statistical analysis for different number of measured and production lots. . . . . . . . and ˆ ˜ Vth value is in V . . . . . . . . . . . n = 15. 174 ˆ ˜ 7. . . . .10 Summary of different variation models. . . . . 179 ˜ xvi . . . 160 7. . . and n = 25. . . . . 174 ˆ ˜ 7. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Note: delay value is in ps. . . . . . . .6 Guardband of fast/slow corner for different conﬁdence levels assuming n = 10 and large n. . . . . . . .2 7. .9 Conﬁdence comparison for inverter chain based corner assuming n = ˆ 10. . . . . . . . . . . . . . . . . . . . . . . .5 Guardband of fast/slow corner for different degrees of conﬁdence assuming large n and n = 15. . . . . . . . . n = 15. . . . . 171 ˆ ˜ 7. . . . . . . . . . .4 Guardband of fast/slow corner for different degrees of conﬁdence assuming n = 10. . .7 Guardband of fast/slow corner for different conﬁdence levels assuming n = 10. . . . . . . . . . . . . . . . . . . . . 175 ˆ ˜ 7. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .3 Notations. . . . . . . . . . . . . . .8 Percentage guardband of worst case delay corner and corresponding variation source corners under different conﬁdence levels assuming n = 10. . . . . .

7.12 Percentage guardband of worst case delay corner and corresponding variation source corners under different conﬁdence levels assuming n = 10. . . . . . . . . . . . . . . . . . . .10 Percentage guardband of worst case delay corner and corresponding variation sources corners under different conﬁdence levels assuming large and n = 15. . . . . . . . n = 1. 181 ˆ ˜ 7. . . . . . . . . . . . 182 ˆ ˜ ˜ 7. . .13 Conﬁdence comparison for ISCAS85 benchmark 95-percentile point delay assuming n = 10. . . . . . . 181 ˜ 7. . . and n = 25. 185 ˆ ˜ xvii .11 Percentage guardband of worst case delay corner and corresponding variation source corners under different conﬁdence levels assuming n = 10 and large n. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . n = 15. . . . . . . .

I also indebted to Dr. which is reported in Chapter 4. Yan Li. Yan Lin. and Phoebe Wong. Fei Li. This dissertation would not exist without their enormous encouragement. and Dr. and Prof. Puneet Gupta. especially those colleagues at the UCLA Design Automation Lab and NanoCAD Lab. In particular. I am indebted to Prof. I would like to thank my adviser. Prof. Yan Li also helped me on developing the transistor and circuit level power and delay model for FPGA process and architecture concurrent design in Chapter 4. Lei He and my co-adviser. Miss Phoebe Wong. Jinjun Xiong. They not only helped me to develop skills to solve problems. support. which is reported in Chapter 6. and inspiration. I would like to acknowledge them here. I thank Dr. Kun Qian for helping me to develop the across-wafer variation model. The work on device and architecture co-optimization for FPGA in Chapter 2 and Chapter 3 is a joint project with Fei Li. He helped me to develop the FPGA device aging model. but also motivated and enlightened me at all time. Glenn Reinman. He also helped me to develop the chip-wise optimization ﬂow to minimize FPGA delay considering process variation. I sincerely thank Dr. Prof. Prof. Jinjun Xiong helped me to develop the experiment and revise the writing for the statistical static timing analysis work in Chapter 5. Kevin Cao for his help with my research. Dr. Sudhakar Pamarti. I treasure the opportunity that I have worked with a group of wonderful people and would like to thank them for all the helps and support through my research work. Costas Spanos and Mr. xviii . Lieven Vandenberghe for serving as members of my dissertation committee and their suggestions to improve this work.Acknowledgments Many people offered me signiﬁcant help during the development of this dissertation. First of all.

D. for their dedication. I am grateful to my family.I would also like to thank many researchers include Dr. Mr. study. Dr. I would like to thank my wife Wen Wang for her constant understanding and supporting me during my Ph. and most of all. This dissertation becomes so humble compared to their constant inspiration. Jeffery Tang. Miss Wenping Wang. for collaboration and helpful discussions. Qi Xiang. and love. xix . I would also like to thank my parents. my brother. and my sister in law. Finally. and support. support. Tim Tuan.

China. San Jose. Portland State University. in Electrical Engineering. in Electrical and Computer Engineering. xx . Nano CAD Lab. Teaching Assistant. 2003-2004 First year Ph. in Electronic and Communication Engineering. Worked on GPU testing. 2001-2003 M. Advanced Computer Architecture.Vita 1979 1997-2001 Born in People’s Republic of China. Worked on statistical power and delay estimation. 2006-2007 Research Intern.D. 2006. Altera Corp. Teaching Assistant. Nvidia Corp. California. UCLA. Guangzhou. B. Los Angeles. program. San Jose.R. USA. Design Automation Lab. USA. California. Zhongshan University. USA.S. 2004-present P. 2002-2003. Oregon. in Electrical and Computer Engineering. USA. P.D. Graduate Student Researcher. program. Graduate Student Researcher. 2006 Intern Engineer. USA. Portland. 2004-2008. Boulder. Graduate Student Researcher. California.. University of Colorado. Analogue Circuit Design. University of California. Colorado. UCLA. 2008-present. 2003-2004. 2002.S. Graduate Student Researcher..

L. November 2009. and L. pp:1777-1781.Song.” Design Automation Conference (DAC).” accepted by IEEE/ACM Asia South Paciﬁc Design Automation Conference (ASPDAC). February 2004. issue: 11. Cheng. Lin.” IEEE Transactions on Computer-Aided Design of Integrated Circuits and Systems (TCAD). July 2007. pp: 474-479. L.” IEEE Computer Society annual symposium on VLSI (ISVLSI). P. Cheng. P.” IEEE Transactions on Computer-Aided De- sign of Integrated Circuits and Systems (TCAD). Yang. volume: 26. “Congestion Estimation for 3D Routing. Gupta. “Device and Architecture Co-Optimization for FPGA Architecture Power Reduction. He . and L. Cheng. P. Wong and L. Qian.P UBLICATIONS L. He . Cheng. P. July 2009. L. Y. F. pp: xxi . pp: 104-109. February 2010. “On Conﬁdence in Characterization and Application of Variation Models. February 2009. “Efﬁcient Additive Statistical Leakage Estimation. C. issue: 7. “Physically Justiﬁable Die-Level Modeling of Spatial Variation in View of Systematic Across Wafer Variability. P. He . Spanos. Cheng. He. and L. volume: 28. Gupta. and L. Li. W.” IEEE/ACM Asia South Paciﬁc Design Automation Conference (ASPDAC). Gupta. “Accounting for Non-linear Dependence Using Function Driven Component Analysis. L. K. L. X. He . and G. pp: 1211-1221. Hung. Gupta. Cheng.

L.” IEEE Transactions on Circuits and Systems II: Express Briefs (TCAS II). “Congestion Estimation for Three-Dimensional Architectures. L. W. X. “Device and Architecture Co-Optimization for FPGA Power Reduction. Cheng. Y. pp: 655.Song.Song. Cheng. pp: 360-370. December 2004. He. L. L. Yang. Gao. Li. W. Hung. L. Cheng. X. L. “A Fast Congestion Estimator for Routing with Bounded Detours. Cheng. and L.” Integration. pp: 159-168. and G. Y. August 2006.” Design Automation Conference (DAC). Xiong. P. and Z. pp:1-6. Cheng. Lin.Song. Lin. G. and Y. February 2004. volume: 41. February 2008. pp: 915-920. and L. Tang. Tang. G. issue: 12. L. He. the VLSI Journal (Integrat VLSI J).239-240. May 2008.” International Conference on Field Pro- grammable Logic and Applications (FPL). “A Fast Congestion Estimator for Routing with Bounded Detours. J. He.659. volume: 51. June 2005. F. Cao. Yang. Wong. Cheng. Yang. Hung. “Trace-Based Framework for Concurrent Development of Process and FPGA Architecture Considering Process Variation and Reliability.” International Symposium on Field-Programmable Gate Arrays (ISFPGA). and S. xxii . “FPGA Performance Optimization via Chipwise Placement Considering Process Variations.” IEEE/ACM Asia South Paciﬁc Design Automation Confer- ence (ASPDAC). Z. pp: 666-670. X. issue: 3.

and G. October 2007. and L. pp: 80-85. Song. Systems.” IEEE Transactions on Computer-Aided Design of Integrated Circuits and Systems (TCAD). Yang. L.” IEEE/ACM Asia South Paciﬁc Design Automation Conference (ASPDAC). pp: 250-255. and L. He. Cheng. pp: 130-140. January 2009. L. Xiong. xxiii . “Non-Gaussian Statistical Timing Analysis Using Second-Order Polynomial Fitting. L. and L.”. J. June 2007. Xiong. L. He. and L. Cheng. Cheng. and L. J. Xiong. Cheng.” ACM/IEEE International Workshop on Timing is- sue: in the Speciﬁcation and Synthesis of Digital Systems(TAU).L. February 2007. He. Cheng. L. X. He. and Computers (JCSC). J. February 2008. “Non-Linear Statistical Static Timing Analysis for Non-Gaussian Variation Sources. “Non-Gaussian Statistical Timing Analysis Using Second-Order Polynomial Fitting. “Non-Gaussian Statistical Timing Analysis Using Second-Order Polynomial Fitting. pp: 298-303.” Design Automation Conference (DAC). volume: 28. Xiong. Cheng. Xiong. “Non-Linear Statistical Static Timing Analysis for Non-Gaussian Variation Sources. F. J. accepted by Journal of Circuits.” IEEE International Workshop on Design for Man- ufacturability and Yield (DFM&Y). issue: 1. “An Analytical Congestion Model With Bounded-Based Detours. He. He. J.

Gu. volume: 11. L. F. “Routability Checking For ThreeDimensional Architectures. pp: 667-676. Tang.” Journal of Universal Computer Science (J. M. volume: 51. Yang. “Digital Spectrum of a Nonuniformly Sampled Two-Dimensional Signal and Its Reconstruction. Yang. Yang. He. G. “On Theoretical Upper Bounds for Routing Estimation. P. Song. pp: 307-308.Sun. Number 6. He. Cheng. M. L. June 2005. pp: 1371-1374. Tang. Song. He. Kam. pp: 916-925. Cheng.F. X. “A Combinatorial Congestion Estimation Approach with Generalized Detours. X. M. May 2005. F. Y. Cheng. G. Sun. volume: 12. Cheng. X. pp: 1180-1187. Song. issue: 12. and G. May 2005. pp: 19-24. Cheng. and J. Jeng and L. pp: 1113-1126. Z. L. Wong. Song. Y.” Computers & Mathematics with Applications (CAM). X. Gu. M. L.UCS). L. June 2005.” The Computer Journal (TCJ). December 2004. and J. and L.” International Conference on Computer Aided Design (ICCAD).” IEEE Transactions on Very Large Scale Integration Sys- tems (TVLSI). X. Z. “A Hierarchical Method for Wiring and Congestion Prediction. G. volume: 54. Yang. W. “Probabilistic Estimation for Routing Space. Gu.Song. November 2005. He. He. “FPGA Device and Architecture Evaluation Considering Process Variation. Lin. Tang. Z. T. Gu. F. xxiv .” IEEE Computer Society annual sym- posium on VLSI (ISVLSI). : 6-7. and J. issue: 3. Cheng.” IEEE Transaction on Instrumentation and Measure- ment (TIM). Cheng. March 2006. and L. Sun. Hung.

Ptrace. 2010 Professor Puneet Gupta. To perform process and architecture concurrent optimization. we present the ﬁrst in-depth study on device and FPGA architecture co-optimization to minimize power. This dissertation proposes novel techniques to model. Furthermore. and variation evaluator. This dissertation focuses on two aspects: (1) Process and architecture concurrent optimization for FPGAs. Co-chair Professor Lei He. and reliability models and incorporate them with Ptrace. (2) Statistical timing modeling and analysis.Abstract of the Dissertation Statistical Analysis and Optimization for Timing and Power of VLSI Circuits by Lerong Cheng Doctor of Philosophy in Electrical Engineering University of California. an efﬁcient and accurate FPGA power. process variation introduces signiﬁcant uncertainty in power and performance to VLSI circuits and signiﬁcantly affects their reliability. we perform architecture and process parameters concurrent optimiza- xxv . Los Angeles. analyze. area. Co-chair As CMOS technology scales down. is proposed. With Ptrace. If this uncertainty is not properly handled. With the extended Ptrace. and variation considering hundreds of device and architecture combinations. it may become the bottleneck of CMOS technology improvement. delay. to enable early stage process and architecture co-optimization without stable device models. and optimize power and performance of FPGAs and ASICs considering process variation. we develop transistor level and circuit level power. delay. delay.

To solve this problem. process. mean and variance uncertainty introduced by limited number of samples is another problem in SSTA. Then. we evaluate the conﬁdence for statistical analysis and estimate the guardband value to ensure a target conﬁdence. variation. and architecture concurrent co-optimization for FPGA power. To perform statistical timing modeling and analysis. to further improve the efﬁciency and accuracy of statistical timing analysis. delay. and reliability. Besides modeling spatial variation. All operations in this ﬂow are performed by analytical equations without any time consuming numerical approach. delay. this dissertation is the ﬁrst novel study of device. and reliability. we develop a new die-level spatial variation model which accurately models the across-wafer variation. To the best of our knowledge. and is the ﬁrst work to model across-wafer variation at die-level and to consider conﬁdence guardband in statistical analysis. xxvi .tion for FPGA power. variation. we ﬁrst present an efﬁcient and accurate statistical static timing analysis (SSTA) ﬂow for non-linear cell delay model with non-Gaussian variation sources.

it increases rapidly as CMOS technology scales down. Therefore.CHAPTER 1 Introduction Complementary Metal Oxide Semiconductor (CMOS) manufacturing in nanometer region has resulted in great achievements in the world of electronics. Figure 1. 1 .1: Increasing of power density.1 [114] illustrates the power density for different technology nodes in the past thirty years. power consumption becomes a crucial design constraint for nano-scale VLSI designs. It can be seen that power density is increasing at a dramatic rate. •10 •4 Power Density (W/cm 2 ) •10 •3 •10 •2 •10 •1 •10 •1980 •0 •1985 •1990 •1995 •2000 •2005 •2010 Figure 1. The performance and capabilities of Very Large Scale Integration (VLSI) circuits improve rapidly with the aggressive scaling of transistors in CMOS. However. the increase of leakage power and process variation have emerged to slow down this trend. Since leakage power is exponentially related to channel length.

such as statistical modeling. channel width. as illustrated in Figure 1. oxide thickness.3X variation in frequency and 20X variation in leakage power. metal thickness.1 [69]. process variation will become a potential show-stopper if it is not appropriately handled. which render chips non-functional.2. Process variation is the uncertainty of physical process parameters. process variation is more difﬁcult to handle in VLSI design. it is predicted that leakage power variation will increase to 331% and performance variation will increase to 88% in the year 2015. doping density.In comparison to challenges created by power. From the table. as CMOS technology scales down to nano-meter region. In recent years. transistors on different copies of the same design. It has been shown in [25] that even for the 180nm technology. or even at different locations on the same chip will vary signiﬁcantly. as shown in Table 1. This research is called “Design for Manufacturability ” (DFM). Therefore. Due to the inevitable manufacturing ﬂuctuations and imperfections. and manufacturing temperature. Internation Technology Roadmap for Semiconductors (ITRS) predicts the DFM technology requirement for process variation in the near-term future. Process variations introduce signiﬁcant uncertainty for both circuit performance and leakage power. doping density. and so on. 2 . process variation can be classiﬁed into two types [162]: • Catastrophic defects are caused by isolated random events (such as particles or other contaminations) during manufacturing. According to the causes of variation. metal width. have been developed to alleviate the variation effects. analysis. process variation can lead to 1. many methods. • Parametric variations are caused by random ﬂuctuations in process conditions so that the physical properties of some parameters on a chip differ from the original design. and optimization for VLSI circuits. The ﬂuctuations may include aberrations in stepper lens. such as channel length. Such impact will become even larger in future technology generations.

FPGA is the most ﬂexible one. Among the design styles. and optimization techniques should be developed for different design styles. analysis. Year % of Vdd variability % of Vth variability (minimum size) % of Vth variability (typical size) % of CD (Lgate ) variability % circuit performance variability % circuit total power variability % circuit leakage power variability 2007 10 33 16 12 46 56 124 2008 10 37 18 12 48 57 143 2009 10 42 20 12 49 63 186 2010 10 42 20 12 51 68 229 2011 10 42 20 12 60 72 255 2012 10 58 26 12 63 76 281 2013 10 58 26 12 63 80 287 2014 10 81 36 12 63 84 294 2015 10 81 36 12 63 88 331 Table 1. different modeling. Application Speciﬁc Instruction set Processor (ASIP). and general purpose processor or microprocessor. Since FPGA allows the same silicon implementation to be programmed or re-programmed for a variety of applications. Therefore. it provides low non-recurring engineering (NRE) cost and short time to 3 . Field-Programmable Gate Array (FPGA).1: Impact of process variation in near-term future. There are four common design styles in current VLSI: Application Speciﬁc Integrated Circuit (ASIC). Different design styles may have different power consumption requirement and vulnerability to process variation.2: Leakage power and frequency variation.Figure 1.

and a 40X area difference between FPGA designs and their ASIC counterparts.market. since FPGAs have lots of regularity. FPGA pays the penalty with decrease of performance. variation modeling and analysis in ASICs are more important than in FPGAs. The power problem is more signiﬁcant for FPGAs than other design styles because FPGA has higher power consumptions. delay. wherein the impact of process variation is very signiﬁcant. Figure 1. Due to the above advantages and disadvantages. design cost. ASICs have higher performance. The ﬁrst objective of this dissertation is: Objective 1: Concurrently evaluate FPGA architecture and process parameters to optimize power. ASICs has higher NRE cost. Therefore. On the other hand. and area. in order to achieve programmability. Due to this advantage. due to the increased design complexity. and area considering process variation and reliability. the impact of process variation on FPGAs should not be ignored and should be properly handled. Manufacturing yield loss is deﬁned as the ratio between the number of chips that fail to meet the design speciﬁcations and the total number of manufactured chips [112]. There are several reasons causing chip failure. As discussed before. Process variation may cause manufacturing yield loss. Nevertheless. a 4. lower power. and longer time to market than FPGAs. power consumption is a crucial design constraint for nano-scale VLSI circuits. To solve the above problems.3 4 . Previous studies [83.3X delay difference. In comparison to FPGAs. ASICs are usually used in high performance or low power applications. such as catastrophic defects and parametric variations. this dissertation focuses on optimizing power and performance for FPGAs considering process variation. power. and smaller area. the impact of process variation on FPGAs are not as signiﬁcant as other design styles. 82] have shown a 100X energy difference. However. However. FPGA industry has grown rapidly since its invention in 1984.

In this dissertation. The second objective of this dissertation is: 5 . we mainly focus on modeling of process variation and developing an accurate SSTA engine. • Statistics aware engines for STA and optimization. there are three key components [122]: • Modeling of process variation from silicon process characterization. Instead of estimating the chip delay deterministically as in the traditional Static Timing Analysis (STA). Figure 1. In this ﬂow. Statistical Static Timing Analysis (SSTA) is proposed.illustrates the performance variation caused by parametric variations.4 shows the SSTA ﬂow. It can be seen that chip delay is spread over a wide range and the delay of some chips is higher than the speciﬁcation value. Dela s e c y p Yield Yield Loss Measured Delay Figure 1. SSTA not only estimates the nominal value of chip delay but also provides the delay distribution. In order to analyze and optimize performance yield. • Modeling of sensitivities of delay for standard cell libraries with regards to process parameter variations (library modeling). SSTA estimates the chip delay statistically.3: Delay yield. The chips failing to meet the delay speciﬁcation contribute to yield loss.

[10] showed that LUT sizes ranging from 4 to 6 and cluster sizes between 4 and 10 produce the best area-delay product. and routing wire segment length. Performance.3 outlines this dissertation. LUT size of 4 achieves the smallest area and LUT size 5 or 6 minimizes delay. Therefore.Measurement Modeling of process Variation SSTA Engine Chip Delay Variation Library Modeling Net List Figure 1.1 reviews some related research works. After cluster-based island style FPGA became popular.2 discusses the major contributions of this dissertation. Section 1. Section 1. and Area Optimization It is well known that architecture. In the following of this chapter. and Section 1. including logic block (or cluster) size. 1. performance.1 FPGA Power. architecture evaluation mainly focuses on optimizing delay [143] and area [130]. architecture evaluation for FPGA optimization attracts lots of concerns. These works showed that for non-clustered FPGAs. architecture evaluation for this type of FPGA was performed. 6 .1 Literature Review 1. Look-Up Table (LUT) size. has great impact on FPGA power. In early 1990s. and area. Objective 2: Statistical Timing Modeling and Analysis.4: SSTA ﬂow.1.

123. body bias [110] and power gating [59. Architecture evaluation considering power gating and dual-Vdd was performed [101]. [136] performs parametric yield estimation for FPGA considering within-die variation. 89. To reduce leakage power consumption of unused circuit elements. power consumption is one of the most important design constraints of FPGA design when technology scales down to nano-meter region. a glitch minimization technique [84] was proposed to reduce dynamic power. 7 . 137. 51. Besides leakage power minimization. utilization rate of circuit elements in FPGA is very low. 102] were applied. 92]. Due to programmability. 115. power driven partition method was presented in [117]. 100. Therefore. 92] presented power evaluation frameworks for generic parameterized FPGAs and showed that leakage power takes higher portion of total power consumption in FPGAs than in ASICs. 90. and a conﬁguration inversion method to save the leakage power was proposed in [12]. [158] quantiﬁed the leakage power of commercial FPGA architecture and [48] introduced a high level FPGA power estimation methodology. 115. 71]. a leakage power saving routing multiplexer and an input control method were developed [150]. programmable dual-Vdd was ﬁrst applied to logic blocks [93. Architecture evaluation for power and delay minimization is then studied in [93. 91] and then extended to interconnects [58. 135. 79. [123. It was also shown that interconnects consume even more power than logic elements in FPGA.As discussed before. 43. 154. 147. FPGA power modeling became a hot research topic. 98. [31. Recently. FPGA modeling and optimization considering process variation was studied. 99. FPGA power optimization was also studied. At the same time. To further reduce power on used circuit elements. 28] presented techniques to statistically optimize FPGA performance. after the year 2000. Power aware FPGA CAD algorithms was proposed in [85].

165. 88. 111. 128. 97. There are two major approaches in 8 . 52.2 Statistical Timing Modeling. 41. 73. 119. 173. 171. 181. In this model. Statistical Analysis. obtaining the correlation coefﬁcient between two grids became a problem for this model. 189. 36. 118. and Optimization Process Variation Modeling. 65. 113. 179. 146. 35. 33. 5. Statistical static timing [103. 163. 157. [172. 7. 142. each transistor has its own within-die variation and the within-die variation for different transistors are assumed to be independent.With the model of process variation. 9. 182. 2. 49. 95. The spatial variations of different grids are correlated and the correlation coefﬁcients depend on the distance between two grids. Later on. 104. 189. 76. 29. 63. 56. 178. 53. 138. 21. 169. statistical analysis and optimization are then performed. 40. 144. 116. 105] solved this problem by modeling the spatial variation as a random ﬁeld [174] and assuming the correlation coefﬁcient to be a function of distance. 151. 107. 44. 81. 133] analysis are hot research topics in recent years. 55. 32] and principle component analysis was applied to perform statistical timing analysis. 161. 64]. 13. 75. several more complicated spatial variation models were proposed [45. 120.1. 70. 72. 19. 188. 105. 32. 62. In this case. process variation modeling is needed. 86. 3. 30. In order to analyze the power and delay uncertainty. 8.Process variation introduces signiﬁcant uncertainty to circuit power and delay. 176. it was observed that within-die variation is not independent but spatially correlated and the correlation depends on the distance between two within-die locations. 183. After that. a chip is divided into several grids and each grid has its own spatial variation. 156. An early work was proposed in [25] whereby process variation is separated into inter-die variation and within-die variation. 22. 42] and leakage [26. 20. 15. 37. 61. Analysis. However. 180.1. 155. All transistors on the same die share the same inter-die variation and different dies may have different inter-die variation. 109. 140. 145. 141. spatial variation was modeled as correlated random variables [6.

19. 182. Since Gaussian random variables are easy to be handled. 179. the via resistance has an asymmetric distribution [13]. 40. 86. block-based SSTA is more commonly used. 9 . 180. 183. 13. the early block-based SSTA ﬂows [32. the statistical characteristics (such as mean and variance) are assumed to be given and reliable. and correlation coefﬁcients as an interval. However. when the number of production samples is not large. Moreover. Some recent works [76. 40. 49. 42] considered both non-linear delay model and nonGaussian variation sources. 163. 2. [142] applied independent component analysis to de-correlate the nonGaussian random variables. 42] SSTA. 3. 53. 9. 8. 179. 128] and block-based [32.In all the statistical modeling and analysis methods discussed above. it was also observed that some variation sources do not follow a Gaussian distribution.Statistical static timing analysis techniques: path-based [103. 178]. 142. 75. variance. 76. 22. 72. 5. The nonlinearity of delay model and non-Gaussian variation sources make SSTA much more difﬁcult. 41. and then estimated the range of mean and variance of circuit performance. 19. uncertainty in the statistics of measured data as well as production data should be considered in statistical analysis. 7. For example. 157. Therefore. the production statistical characteristics may signiﬁcantly deviate from their population values. Later. 41. 178. [177] modeled the uncertainty of mean. 13. 181. 113. but it was still based on a linear delay model. it was observed that the linear delay model is no longer accurate when the scale of variation becomes larger [94] and a higher-order delay model is thus used [180. Since path-based SSTA is not scalable to large circuit sizes. Conﬁdence Analysis. 163] modeled the gate delay as linear functions of variation sources and assumed all the variation sources are mutually independent Gaussian random variables. 120. while dopant concentration is more suitably modeled as a Poisson distribution [142] rather than Gaussian.

Experimental results show that compared to cycle posed closed-form models to estimate chip level leakage and timing variations 10 .2% compared to the baseline. Based on such framework. which uses the VPR architecture model [18] with the same LUT size and cluster size as the commercial FPGAs used by Xilinx Virtex-II [170]. we have the following contributions: • We ﬁrst develop an efﬁcient yet accurate power and delay estimator for FPGAs. respectively. We also study the impact of utilization rate and interconnect structure. Furthermore. It has been shown that the mean and standard deviation computed by our models are within 3% error compared to Monte-Carlo simulation. We profor FPGA. 2. Ptrace predicts chip level power and delay within 4% and 5% error. • We then extend Ptrace to estimate power and delay variation of FPGAs. With the which we refer to as Ptrace.1. and the device settings from ITRS roadmap[66].4% and chip area by 23%. For device and architecture concurrent optimization for FPGAs. accurate power simulator and VPR [18]. Device and architecture concurrent optimization for FPGAs. we perform device and architecture co-optimization for FPGA circuits. Compared to the baseline.2 Contributions of the Dissertation The contributions of this dissertation are two-fold: 1. considering FPGA architecture with power-gating capability. our co-optimization reduces energy-delay product by 18. Statistical timing modeling and analysis.0% and chip area by 8. our architecture and device co-optimization reduces energy-delay product by 55.

160. For statistical timing modeling and analysis. device and architecture tuning improves leakage yield by 4. Such evaluation may be used to select circuits and architectures that are less sensitive to process changes or process variations. • Our proposed SSTA ﬂow solves the problem of computing the chip delay vari11 . device aging (Negative-Bias-Temperature-Instability. 23] and HotCarrier-Injection. With Ptrace2. respectively. we incorporate analytical calculations for two types of FPGA reliability. skewness. Moreover. Compared to the baseline. 6%. we perform FPGA device and architecture evaluation considering process variations. HCI [34. Experimental results show that compared to Monte-Carlo simulation. 149.2%. We call the resulting framework as Ptrace2. but LUT size of 5 achieves the maximum leakage and timing combined yield. NBTI [11. All operations in our ﬂow are based on efﬁcient closed-form formulae. timing yield by 12.4%. our approach predicts the mean.extended Ptrace. • We further extend (Ptrace) to consider process parameters directly so that FPGA circuit and architecture evaluation can be conducted when only the ﬁrst order process parameters are available. we have the following contributions: • We introduce an efﬁcient and accurate SSTA ﬂow for non-linear SSTA with non-Gaussian variation sources. We also observe that LUT size of 4 gives the highest leakage yield. and leakage and timing combined yield by 9. and 95-percentile point within 1%. standard deviation. LUT size of 7 gives the highest timing yield.8%. 1%. again in the ‘from device to chip’ fashion. we also ﬁnd that neither device aging due to NBTI and HCI nor process variation has signiﬁcant impact on SER. We observe that device aging reduces standard deviation of leakage by 65% over 10 years while it has relatively small impact on delay variation. 166]) and permanent soft error rate (SER) [60]. 149. and 1% error.

38. Morever. Our estimations are within 2% of simulated conﬁdence values. the error of the traditional gridbased spatial variation model is up to 8% while the error of our model is only 2%. 1. 4.. percentile corner) and analysis (SPICE corner extraction.3 Dissertation Outline The remaining of this dissertation is organized as follows: Part I. In order to improve the accuracy and efﬁciency of SSTA. but does not consider modeling of process variation. The guardbands are non-negligible for all cases when either production or characterization volume is not large. statistical timing). we presents an accurate model for acrosswafer variation. The problem is ﬁdence in the statistical analysis (since production only sees a small snapshot of the entire distribution). proposes techniques to statistically model and optimize FPGA power and delay.7% of standard deviation for single parameter corner. 12 . Compared to the exact value. variance. calibrated models have low degree of conﬁdence. and 52% of standard deviation for 95%-tile point of circuit delay are needed) to ensure 95% conﬁdence in the results. 34. However. including Chapter 2. We mathematically derive the conﬁdence intervals for commonly used statistical measures (mean. due to limited number of samples (especially in the case of lot-tofurther exacerbated when low production volumes cause additional loss of con- lot variation).g. 3.ation. • The above SSTA ﬂow also assumes that the statistical variation models are reli- able. our new model is 6X faster than the traditional model. a signiﬁcant guardband is needed (e.7% of standard deviation for SPICE corner. Our experiments indicate that for moderate characterization volumes (10 lots) and low-to-medium production volumes (15 lots).

With this circuit level model. We ﬁrst develop a set of closed-form formulas for chip level leakage and timing variations considering both die-to-die and within-die variations. Based on such model. delay and area. and reliability. and interconnect wire segment length W ) to minimize power. cluster size N. Chapter 6. 68. leakage power. With Ptrace. in addition to power and delay during evaluation. leakage yield. We ﬁrst implement models to calculate electrical characteristics of advanced CMOS transistors based on the ITRS MASTAR4 (Model for Assessment of cmoS Technologies And Roadmaps) tool [67. 148]. Part II. We refer to the new trace-based model as Ptrace2. Chapter 3 extends the architecture and device co-optimization ﬂow in Chapter 2 to considering process variation. Applying extended Ptrace. and then develop circuit level models of delay. we simultaneously optimize device setting (supply voltage Vdd and threshold voltage Vth ) and FPGA architecture (look-up table size K. delay. we can conduct FPGA circuit and architecture evaluation when only the ﬁrst order process parameters are available. Chapter 4 proposes a framework to perform concurrent process and architecture optimization for FPGAs in early stage without detailed process modeling.Chapter 2 ﬁrst introduces a time efﬁcient trace-based delay and power estimation framework for FPGA circuits. With Ptrace2. and delay and leakage combine yield. such as device aging and permanent soft error rate. we optimize device setting and FPGA architecture to maximize delay yield. we further extend Ptrace in Chapter 2 and 3 to evaluate chip level power. Ptrace. 13 . and input/output capacitance for FPGA basic circuit elements. we extend Ptrace to estimate the power and delay variation of FPGAs. and Chapter 7 introduces statistical timing modeling and analysis. including Chapter 5. We may also consider reliability issues.

and for full chip timing analysis. Chapter 6 studies another key component of SSTA: the modeling of process variation. program which are not included in this dissertation. for the fast/slow corner of an inverter chain.Chapter 5 presents an efﬁcient and accurate statistical static timing analysis ﬂow for non-linear delay model with non-Gaussian variation sources. 14 . We analyze the impact of deterministic across-wafer variation and ﬁnd that the spatially correlated within-die variation is mainly caused by deterministic across wafer variation. we propose an efﬁcient and accurate die-level variation model to model the across-wafer variation. Chapter 8 concludes this dissertation with discussion of the ongoing work and future work as well. Finally. Based on this approximation. In Chapter 9. We apply our proposed variation model to the SSTA ﬂow in Chapter 5 to verify its efﬁciency and accuracy. of statistical timing analysis. quadratic delay model without crossing terms (semi-quadratic model). Then. we develop an SSTA ﬂow for three different delay models: quadratic delay model. We ﬁrst propose a second order polynomial approximation for the max operation of two random variables. and linear delay model. Chapter 7 studies the conﬁdence interval of statistical characteristics.D. We estimates the conﬁdence interval for a single variation sources. I brieﬂy summarize the works I have ﬁnished in my Ph. such as mean and variance.

Part I Statistical Modeling and Optimization for FPGA Circuits 15 .

By collecting trace information from cycle-accurate simulation of placed and routed FPGA benchmark circuits and re-using the trace for different Vdd and Vth . Compared to the baseline FPGA architecture. and Vth optimized with respect to the above architecture and Vdd .2% compared to the baseline FPGA architecture. Delay. but a great impact on power and performance in the nanometer technology.3%. considering power-gating of unused logic blocks and interconnect switches (in this case sleep transistor size is a parameter of device tuning). and Area Minimization Device optimization considering supply voltage Vdd and threshold voltage Vth has little chip area increase. architecture and device cooptimization can reduce energy-delay product by 20. called trace-based model. Vdd suggested by ITRS.5% and chip area by 23. Furthermore. we enable device and architecture co-optimization considering hundreds of device and architecture combinations.CHAPTER 2 Deterministic Device and Architecture Co-Optimization for FPGA Power. We ﬁrst develop an efﬁcient yet accurate timing and power evaluation method. This chapter studies simultaneous evaluation of device and architecture optimization for FPGAs. 16 . which uses the VPR architecture model and the same LUT and cluster sizes as those used by the Xilinx Virtex-II. our co-optimization reduces energy-delay product by 55.0% and chip area by 8.

89. Besides leakage power reduction. 51. a glitch minimization technique [84] 17 . and a high level FPGA power estimation methodology was presented [48]. The leakage power of a commercial FPGA architecture was quantiﬁed [158]. Due to the large number of transistors required for ﬁeld programmability and the low utilization rate of FPGA resources (typically 62. Besides the power optimization CAD algorithms. body biasing and multi-threshold techniques was designed [110] to reduce interconnect leakage.1 Introduction FPGAs allow the same silicon implementation to be programmed or re-programmed for a variety of applications.5% [158]). 98. 90. and a circuitry combining gate biasing.2. A conﬁguration inversion method to reduce the leakage power of multiplexers without any additional hardware [12] was investigated. power consumption becomes a crucial design constraint for FPGAs. As to power optimization. the interaction of a suit of power-aware FPGA CAD algorithms without changing the existing FPGAs was studied in [85]. 92] and it was shown that both interconnect and leakage power are signiﬁcant for FPGAs in nanometer technologies. 91] and interconnects [58. existing FPGAs consume more power compared to ASICs [83]. and Vdd programmability was applied to both FPGA logic blocks [93. low power FPGA circuits and architectures have also been studied. Region based power gating for FPGA logic blocks [59] and ﬁne-grained power-gating for FPGA interconnects [102] were proposed. Power evaluation frameworks were introduced for generic parameterized FPGA [123. A new type of routing multiplexer and an input control method were developed [150] to reduce leakage of unused routing multiplexers. Recent work has studied FPGA power modeling and optimization. As the process advances to nanometer technologies and low-energy embedded applications are explored for FPGAs. It provides low NRE (non-recurring engineering) cost and short time to market. 71]. Power-driven partition algorithm for mapping applications to FPGAs with different Vdd -levels [117] was proposed.

35µm technology. Vth . FPGA architecture evaluation considering energy was studied recently [93. We deﬁne hyper-architecture as the combination of device and architectural parameters. Besides area and delay. 92]. and have not conducted simultaneous evaluation of device optimization such as Vdd and Vth tuning and architecture optimization such as tuning LUT and cluster sizes. The total number of hyper-architecture combinations can be easily over a few hundreds considering the interaction between these dimensions. Later on. 123. and energy. and sleep transistor size if power gating is applied. and considering area. Architecture evaluation has been performed ﬁrst for area and delay. For nonclustered FPGAs. delay. such as cycle-accurate simulation [89] and transition density based estimation [123]. the clusterbased island style FPGA was studied to optimize area-delay product and it showed that LUT sizes ranging from 4 to 6 and cluster sizes between 4 and 10 can produce the best area-delay product [10]. it was shown that an LUT size of 4 achieves the smallest area [130] and an LUT size of 5 or 6 leads to the best performance [143]. it is time-consuming to explore the huge hyper-architecture solution space using methods from [89. an LUT size of 3 consumes the smallest energy [123]. and sleep transistor size (if power gating is applied). Therefore. all the aforementioned architecture evaluation assumed ﬁxed supply voltage Vdd and threshold voltage Vth . In order to perform 18 . timing and power are calculated for each circuit element. In 100nm technology. It was shown that in 0.was proposed to reduce dynamic power. However. Architecture and device co-optimization may obtain a better power and performance tradeoff compared to architecture tuning alone. an LUT size of 4 consumes the smallest energy [92]. Vdd . LUT size K. [101] further evaluated FPGA architectures with ﬁeld programmable dual-Vdd and power gating. The co-optimization requires the exploration of the following dimensions: cluster size N. In the existing power evaluation frameworks. 123].

and circuit element utilization rate for a given set of benchmark circuits (MCNC [175] benchmark set in this chapter). an accurate yet extremely efﬁcient timing and power evaluation method is required. Compared to the baseline hyperarchitecture. architecture and device co-optimization can reduce energy-delay product 19 . and Vdd suggested by ITRS [66]. we obtain the baseline FPGA hyper-architecture which uses the VPR architecture model [18] and the same LUT size and cluster size as the commercial FPGAs used by Xilinx Virtex-II [170]. delay.8% for delay. Vth and sleep transistor size using VPR and Psim. We explore different Vdd . The second contribution is that we perform the architecture and device co-optimization for conventional FPGAs and FPGAs with power gating capability. For comparison. near-critical path structure.2GHz Intel Xeon servers while all the hyper-architecture evaluation reported in this chapter with over hundreds of hyper-architecture combinations took only a few minutes on one server. Compared to performing placement-and-routing by VPR [18] followed by cycle-accurate simulation (called Psim) from [89]. short circuit power. The trace collecting has the same runtime as evaluating FPGA architecture for one combination of Vdd .device and architecture co-optimization. Vth . and area. We proﬁle placed and routed benchmark circuits and collect statistical information (called trace) on switching activity. It took one week to collect the trace for the MCNC benchmark set using eight 1. but Vth optimized by our device optimization. and sleep transistor size combinations in addition to cluster size and LUT size combinations. The ﬁrst contribution of this chapter is that we develop a trace-based estimator (called Ptrace) for FPGA power. We show that the trace is independent of device tuning and it can be used to calculate FPGA chip-level performance and power for a number of device designs.3% for energy and of 0. Such baseline is signiﬁcantly better than the ones with no device optimization. Ptrace has a high ﬁdelity and an average error of 1.

2% compared to the baseline.(product of energy per clock cycle and critical path delay. we assume subset switch block in this chapter. We also study the impact of utilization rate and interconnect structure. our architecture and device co-optimization reduces ED by 55.0% and chip area by 8. The rest of the chapter is organized as follows. A cluster-based logic block (see Figure 2. considering FPGA architecture with power-gating capability.4% and chip area by 23. Section 2.1) includes N fully connected Basic Logic Elements (BLEs). Each BLE includes one K-input lookup table (LUT) and one ﬂip-ﬂop (DFF).2 presents the background of FPGA architecture and existing FPGA architecture evaluation ﬂow. The logic blocks are surrounded by routing channels consisting of wire segments. ED) by 18.2.5 concludes this chapter. Section 2. in short.3%. Furthermore. 2. The input and output pins of a logic block can be connected to the wire segments in routing channels via a connection block (see Figure 2.1 Conventional FPGA Architecture We assume cluster-based island style FPGA architecture such as that in [18] for all classes of FPGAs studied in this chapter.1 The connections in a switch block (represented by the dashed lines in Figure 2. where the incoming track can be connected to the outgoing tracks with the same track number.2 Preliminaries 2. Section 2.1 (c) shows a subset switch block [87].1 (c)) are programmable routing switches. A routing switch block is located at the intersection of a horizontal channel and a vertical channel. Figure 2. 20 .4 applies the new estimation models to architecture and device co-optimization.3 introduces our trace-based estimation models.1 (b)). We implement 1 Without loss of generality. The combination of cluster size N and LUT size K is the architectural issue we evaluate in this chapter. Section 2.

connection block 0123 switch block logic block in1 SR SR 0 1 2 3 connection block connection switch (a) Island style routing architecte 012 3 0 1 2 3 A B C 0 1 2 3 (b) Connection block and connection switch SR A B SR D 012 3 (c) Switch block (d) Routing switches Figure 2. We deﬁne an inter- connect segment as a wire segment driven by a tri-state buffer or a buffer. 2 We interchangeably use the terms of switch and buffer/tri-state buffer. We decide the routing channel width CW in the same way as the architecture study in [18].2CWmin where CWmin is the minimum channel width required to route the given circuit successfully.1: (a) Island style routing architecture. (b) Connection block.. (c) Switch block. 2 In this chapter. (d) Routing switches. CW = 1.e. 21 . i. we investigate both uniform length-4 interconnect wire segments and mixedlength interconnects.routing switches by tri-state buffers and use two tri-state buffers for each connection so that it can be programmed independently for either direction.

2. the sleep transistor is turned off by the conﬁguration cell.2 illustrates the circuit design of the Vdd -gateable interconnect switch and logic block from [101]. Figure 2. SPICE simulation shows that power-gating can reduce the leakage power of an unused buffer or a logic block by a factor of over 300.2: (a) Vdd -gateable switch. When a buffer or logic block is not used. We insert a PMOS transistor (called a sleep transistor) between the power rail and the buffer (or logic block) to provide the power-gating capability. 22 Switch SR d1 Switch SR log2 N SRAM cells (a) Vdd−gateable switch d1 . Vdd M2 SR SR SR A M1 Logic block B BUFF Switch Connection switches in1 N switches Decoder Switch Switch SR (b) Vdd−gateable routing switches dN dN SR SR Dec_Disable wire track (c) Vdd−gateable connection block Vdd Sram Logic block (d) Vdd gateable logic block Figure 2.2.2 Vdd Gatable FPGA Architecture Power gating can be applied to interconnects and logic blocks to reduce FPGA power. (d) Vdd -gateable logic block. (b) Vdd -gateable routing switch. (c) Vdd -gateable connection block.

2. TV-Pack is used to pack the mapped circuit to a given cluster size. Because the decoder is no longer in the critical path.There is power and delay overhead associated with the sleep transistor insertion. an explicit decoder is used in the connection box to enable both the signal path and power supply for the single input. delay. we place and route the circuit using VPR [18] and obtain the chip level delay and area. The delay overhead associated with the sleep transistor insertion can be bounded when the sleep transistor is properly sized. we ﬁrst optimized the logic then map the circuit to a given LUT size. For a given benchmark set. the device and architecture co-optimization requires the exploration of the following 23 . However. Finally.3. is used in the connection box as illustrated in Figure 2. After packing. an input MUX. the connection box with power gating is different from that in the conventional architecture. as illustrated in Figure 2. the delay of connection box in Figure 2.1(b).2(c) [101]. which is effectively an implicit decoder [18].1(b).2.2(c) is smaller than that in Figure 2.2(c) also has a much smaller leakage power but a larger area and a slightly larger dynamic power compared to the conventional architecture. For the conventional FPGA. The architecture evaluation ﬂow discussed above is time consuming because we need to place and route every circuit under different architectures and a large number of randomly generated input vectors need to be simulated for each circuit. However. cycle-accurate power simulator [89] (in short Psim) is used to estimate the chip level power consumption. Moreover. and power [92] is illustrated in Figure 2. for the power gating architecture. This is due to the fact that the sleep transistors stay either ON or OFF after conﬁguration and there is no charging and discharging at their source/drain capacitors. The dynamic power overhead is almost negligible. The connection box in Figure 2.3 FPGA Architecture Evaluation Flow The existing FPGA architecture evaluation ﬂow considering area.

a fast yet accurate FPGA power and delay estimator is required. dimensions: cluster size N. we will introduce the efﬁcient power and delay estimation model. there are following two classes of information as summarized in Table 2. LUT size K. 2. The total number of hyper-architecture combinations can be easily over a few hundreds considering the interaction between these dimensions.3 Trace-Based Estimation In this section. The second class only depends on device setting and circuit design.1. We speculate that during hyper-architecture evaluation. In order to perform device and architecture co-optimization. The ﬁrst class only depends on architecture and is called trace of the architecture. the trace-based estimator. the conventional evaluation ﬂow is not practical for device and architecture co-optimization due to its time inefﬁciency. threshold voltage Vth .3: Existing FPGA architecture evaluation ﬂow for a given device setting. tracebased estimation method (Ptrace). supply voltage Vdd . and sleep transistor size if power gating is applied.3. Such estimator is just the one we are going to introduce in Section 2. Therefore.Benchmark Logic Optimization(SIS) Tech-Mapping (RASP) Timing-Driven Packing (TV-Pack) Placement & Routing (VPR) Area Delay Parasitic Extraction Cycle-accurate Power Simulator (Psim) Arch Spec Power Figure 2. The basic idea of Ptrace is 24 .

short circuit power ratio (αsc ). Delay. Delay. which depend on p technology scale. and average delay of type i circuit 25 . We then obtain chip-level performance and power for a set of device for a given architecture parameter values based on the trace information.3. Figure 2. Trace Collection Arch Spec Trace-Based Estimation Circuit Level Area. the trace information only depends on architecture and remains the same when device setting changes.4: New trace-based evaluation ﬂow. and average leakage power of type i circuit elements (Pis ). average load capacitance of type i circuit elements (Ciu ). Device parameters include Vdd and Vt .4 illustrates the ﬂow of trace-based evaluation. and Power Figure 2. and near-critical path structure (see Table 2. We perform the same ﬂow as Figure 2.3 under one device setting to collect the trace information. average switching activity of used type i circuit elements (Siu ). and Power Chip Level Area. Near-critical path structure is the number of each type of circuit elements (Ni ) on the near-critical path. total number of type i circuit elements (Nit ). 2. the trace includes number of used type i circuit elements (Niu ).1 Trace Collection As mentioned before.as follows: For a given benchmark set. we proﬁle placed and routed benchmark circuits and collect trace information under one device setting and for one FPGA architecture.1). For a given benchmark set and a given FPGA architecture.

such as circuit level delay and power.1: Trace information.2. leakage power for type i circuit elements avg. power.3. Dynamic power is only consumed by the used FPGA resources. we can perform Ptrace to estimate the chip level delay. buffers. Trace Parameters (depend on architecture) Niu Nit Siu Nip αsc # of used type i circuit elements total # of type i circuit elements avg. For type i circuit elements. Cisw is the average switch capacitance 26 .elements (Di ).2 Power and Delay Model 2.e. After collecting the trace and device level delay and power..1 Dynamic Power Model Dynamic power includes switch power and short-circuit power.3. and area. switching activity for used type i circuit elements # of type i circuit elements on the near-critical path ratio between short circuit power and switch power Device Parameters (depend on processing technology and circuit design) Vdd Vth Pis Ciu Di power supply voltage threshold voltage avg. Below. which depend on circuit design. LUTs. load capacitance of type i circuit elements avg. we will show that the trace information is insensitive to the device parameters and discuss our trace-based models. The device parameters. i. 2. can be collected by SPICE simulation or by measurement. input pins and output pins. A circuit implemented on an FPGA cannot utilize all circuit elements. delay of type i circuit elements Table 2. Our trace-based switch power model distinguishes different types of used FPGA resources and applies the following formula: 1 2 Psw = ∑ Niu · f ·Vdd ·Cisw 2 i (2. device and circuit parameters.1) The summation is over different types of circuit elements.

f is the operating frequency.26 0. In our trace-based model. the reciprocal of the near-critical path delay.e.96 0.01 1. we model the short circuit power as: Psc = Psw · αsc 27 (2.0 Vth =0.54 0. Ciu is the average load capacitance of used circuit elements.25 logic alu4 apex2 apex4 bigkey clma 2.59 0. Architecture setting: N = 10.55 0.71 0.1 Vth =0.and Niu is the number of used circuit elements. the local load capacitance for used circuit element j.87 100nm Vd d=1. The switch capacitance is further calculated as: Cisw = ( = j∈Eli Ciu · Siu ∑ Ci. Unit: switch per clock cycle.75 0. The short circuit power is related to signal transition time.23 Table 2.29 0.21 2.06 1. Eli is the set of used type i circuit elements.59 0. and Siu is the average switching activity of used type i circuit elements. which is difﬁcult to obtain without detailed simulation using a real delay model.23 1.20 logic interconnect 0. We verify this assumption in Table 2.3) .56 0.75 1.2) For type i circuit elements.19 1.73 1.70 1. In this chapter.32 logic interconnect 0.16 1. which is averaged over Ci.03 1. we assume that the circuit works at its maximum frequency.47 0.47 0.55 0.21 2.2 by demonstrating the average switching activity of ﬁve benchmarks at different technology nodes.3 Vth =0.27 0. We assume that the average switching activity of the circuit elements is determined by the circuit logic functionality and FPGA architecture. Vd d and Vth . i. benchmark 70nm Vd d=1.2: switching activities for different technology nodes..90 interconnect 0.47 0. The device parameters of Vd d and Vth have a limited effect on switching activity.91 70nm Vd d=1. and Vd d and Vth levels. K = 4. j . j /Niu ) · Siu (2.

1 Vth =0. We verify this assumption in Table 2.44 1.71 1.72 1.48 1.63 Table 2.72 1.4) For resource type i.93 0.76 1.15 0.21 100nm Vd d=1. For an FPGA architecture with powergating capability. 3 Because the delay between buffers is usually large for FPGA circuits. Although this ratio value at the circuit level depends on FPGA circuit design and architecture. Vd d and Vth .74 1. 2.18 0. Nit is the total number of circuit elements.32 logic interconnect 1.16 0.3.3 Vth =0.86 1.20 logic interconnect 1.64 1.79 1.5% [158]). and Vth levels.11 interconnect 1. benchmark 70nm Vd d=1.43 1.68 1.82 1.12 0. an unused circuit element can be power-gated to reduce leakage could be big.25 logic alu4 apex2 apex4 bigkey clma 1.08 0.46 1.62 1. K = 4. we assume that αsc does not depend on device and technology at the chip level. Pstatic = ∑ Nit Pis i (2. Notice that usually Nit > Niu because the resource utilization rate is low in FPGAs (typically 62.3: Short circuit power ratios for different technology nodes. Architecture setting: N = 10.15 0.92 0.44 1.2.16 70nm Vd d=1.33 by showing the average short circuit power ratio at different technology nodes. Vd d. and Pis is the leakage power for a type i element.0 Vth =0.89 0.Where αsc is the ratio between short circuit power and switch power.2 Leakage Power Model The leakage power is modeled as follows. the short circuit power ratio 28 .42 1.

the FPGA delay can be calculated as follows: D = ∑ Ni Di i p (2. LUT.power. Therefore.. the area increase introduced by power gating will have small impact on the delay and power estimation. i.6) For resource type i. When power gating is applied. Moreover the load capacitance of a routing buffer is mainly determined by the input and output capacitances of buffers. we can calculate delay values for the ten longest paths under new Vd d and Vth levels.3. we assume the wire segment length change introduce by power gating can be ig√ nored. process technology. and FPGA architecture. When Vd d and Vth change. and Di is the delay of such a circuit element. on one circuit path.e.2. a single gating transistor is used for the entire logic block.5) where αgating is the average leakage ratio between a power-gated circuit element and a circuit element in normal operation. we obtain the structure of the ten longest circuit paths including the near-critical path for each circuit. Therefore. The area overhead introduced by power gating is less than 30%. We assume that the new near-critical path due to different Vd d and Vth levels is among these ten longest paths found by our benchmark proﬁling.003 is used in this chapter. where areatile is the area of a logic block with the interconnect surrounding it. The path structure is the number of elements of different resource types. p Ni is the number of circuit elements that the near-critical path goes through.5 To get the path In this chapter. Di is a circuit parameter depending on Vd d. the total leakage power is modeled by the following formula: Pstatic = ∑ Niu Pi + αgating · ∑ (Nit − Niu )Pi i i (2. the change of wire segment length is less than 14%. Therefore. and choose the largest one as the new near-critical path delay. The wire capacitance is a small portion (about 20%) of the load capacitance. 5 Delay of each circuit element is measured with the worst case switch. 2. wire segment and interconnect switch.4 In this case. wire length is calculated as 4 × areatile [18]. SPICE simulation shows that sleep transistors can reduce leakage power by a factor of 300 and αgating = 0.3 Delay Model To avoid the static timing analysis for the entire circuit implemented on a given FPGA fabric. Vth . In order to measure the delay of 4 29 . In our experiment.

3.3.0V and Vth =0. We assume Vd d=1. each circuit element in a logic block with gating transistor. K = 4} the architecture N=8 and K=4.4. The average energy error of Ptrace is 1.statistical information Ni . Vd d=1. the number of LUTs and ﬂip ﬂops are counted under LUT and cluster sizes.2V. as illustrated in Table 2.8%. a series of randomly generated vectors are used as the inputs to the block and the worst case delay is measured for each circuit element under such random input vectors. K = 7}. From the ﬁgure. the run time of Ptrace is 2s. Notice that we can map the benchmarks to different and {N = 6. We collect trace using the conventional architecture evaluation ﬂow in 70nm technology. 30 . we compare it to the conventional architecture evaluation ﬂow for ITRS [66] 70nm technologies.1 Veriﬁcation of Deterministic Model To validate Ptrace.2V and map 20 MCNC benchmarks. Figure 2. we only need to place and route the circuit once for a given p FPGA architecture. to two architectures: {N = 8.3 Validation of Ptrace 2. Moreover.0V and Vth =0.3% and average delay error is 0.4. In Table 2.3. while that of the conventional evaluation ﬂow is 120 hours. the Ptrace has the same trends for energy as Psim does and for delay as VPR does.5 compares energy and delay between the conventional evaluation ﬂow and Ptrace for each benchmark. Therefore. Ptrace has a high ﬁdelity. 2.

31 .4: MCNC benchmark list. The number of LUTs and ﬂip ﬂops are counted under the architecture N=8 and K=4.benchmark alu4 apex2 apex4 bigkey clma des diffeq dsip elliptic ex1010 # LUTs 1607 2118 1291 2935 13537 2172 1933 1619 4185 4721 # Flip ﬂops 0 0 0 224 33 0 377 224 1138 0 benchmark ex5p frisc misex3 pdc s298 s38417 s38584 seq spla tseng # LUTs 1201 5926 1501 5608 2548 8458 7040 1953 3938 1299 # Flip ﬂops 0 900 0 0 8 1463 1260 0 0 382 Table 2.5: Comparison between Psim and Ptrace. 45 36 27 Energy from Ptrace (nJ) Delay from Ptrace(ns) 10 8x4 6x7 8 6 8x4 6x7 4% Error 18 9 0 5% Error 4 2 0 0 9 18 27 36 Delay from VPR(ns) 45 0 2 4 6 8 Energy from Psim (nJ) 10 Figure 2.

9v). number of used logic blocks over the number of total available logic blocks) is 0. LUT size of 4). Homo-Vth +G and Hetero-Vth +G. and Vth of 0. Homo-Vth is the conventional FPGA using homogeneous Vth for interconnects and logic blocks.2. Vdd suggested by ITRS [66] (0. Hetero-Vth applies different Vth to logic blocks and interconnects. i. The baseline hyper-architecture and evaluation ranges for device and architecture are presented in Table 2.4. We consider 70nm ITRS technology and evaluate four FPGA hyper-architecture classes: Homo-Vth .5. • The routing channel width is 1.. Note that a high Vth is applied to all SRAM cells for conﬁguration to reduce their leakage power as suggested by [93]. Homo-Vth +G and Hetero-Vth +G are the same as Homo-Vth and Hetero-Vth. except that unused logic blocks and interconnects are power-gated [101]. In the subsections 2. we assume the following: • The utilization rate (deﬁned as the utilization rate of logic blocks.4.1 Overview In this section. We compare them with the baseline hyperarchitecture. • All interconnect wire segments span 4 logic blocks with fully buffered routing switches.4 Hyper-Architecture Evaluation 2. • All benchmark circuits work at their highest frequency (1/critical path delay).5. Hetero-Vth . we use Ptrace to perform device and architecture evaluation.4. which uses the VPR architecture model [18] and has the same LUT size and cluster size as those used by the Xilinx Virtex-II [170] (cluster size of 8. 32 .e.4. respectively.2 times of the minimum channel width that allows the FPGA circuit being routed.2 to 2.3v that is optimized with respect to the above architecture and Vdd .

4.4v N 6-12 K 3-7 Table 2.9v Vth 0.6 analyze the impact of utilization rate and compare the evaluation result of different routing architectures.3v N 8 K 4 Value range for device/arch optimization Vdd 0.5 and 2. To illustrate the tradeoff between energy and delay. Moreover.5 and 2. For each hyper-architecture. we use CVt for Vth of logic blocks and IVth for Vth for global interconnects.4.4. then compare the results of optimizing device 33 .1v Vth 0.2v-0.4. then we say that B is inferior to A. We ﬁrst discuss the need of device tuning.6. Section 2. we introduce the concept of dominant hyper-architecture: If hyper-architecture A has less energy consumption and a smaller delay than hyper-architecture B. We deﬁne the dominant hyper-architectures as the set of hyper-architectures that are not inferior to any other hyper-architectures. respectively.2 illustrates the necessity of device and architecture co-optimization.8v-1. 2. We organize the rest of this section as follows: First.5: Baseline hyper-architecture and evaluation ranges. respectively.4. Section 2.4 discusses the device and architecture co-optimization considering area. Then Section 2. in the rest of this chapter.4. Sections 2. delay and area as the geometric mean of 20 MCNC benchmarks in Table 2.2 Necessity of Device and Architecture Co-Optimization In this section.3 presents the energy and delay tradeoff and ED reduction achieved by device and architecture cooptimization. Note that CVt = IVth in Homo-Vth and Homo-Vth +G.4. we show the necessity of device and architecture co-optimization.The impact of utilization rate and interconnect architecture will be discussed in subsections 2. Baseline FPGA device/arch parameter values Vdd 0. we compute the energy.4.4. Finally.

05 17.32 1. Our experiments show that device tuning has a much greater impact on delay and energy than architecture tuning does.6 and Table 2.58 3.99 31. set D4 is the dominant hyper-architectures under Vdd =1.37 1.92 Table 2.11nJ.9V and Vth = 0.05 3.99ns to 23.10 1.25v. 10. From the ﬁgure.35 0.46 23.9 0.40 14.50 24.31 10.0 1.73 Min delay (ns) 13.9 0.95 Max energy (nJ) 2.6.46ns.0 1.25V.9 1.e.82nJ to 2. it is important to evaluate both device and architecture instead of evaluating architecture only. the energy ranges from 1.30 0.01 9. and delay ranges from 13. energy for different architectures ranges from 1.0V and Vth =0.24ns.1 1. In the ﬁrst method.3V .1 Vth (V) 0.05v.1 1.17nJ to 1. if we increase Vth by 0.60 21.24 36.01 11. device tuning has not been reported in the literature.97 2.11 5.35 0.66 12.68 18.17 0. which is demonstrated in Figure 2. 80.30 0.6: Power and delay ranges for different device settings.17 1.60 13.03 4.30 17. Architecture evaluation has been studied in previous research [131. Set D1 D2 D3 D4 D5 D6 D7 D8 D9 Vdd (V) 0.6 For example. we observe that a change on the device leads to a more signiﬁcant change in energy and delay than architecture change does.90 15.32nJ and the delay ranges from 18.43 15. 123. For example.0 1.9V and Vth =0.40 19..01 Max delay (ns) 16.25 0. 6 Dominant 34 .30 0. However. i.95 1.25 0.06 17.11 1. However.82 1. for device setting Vdd = 0.30 1.68ns to 16. There are three methods to perform device and architecture optimization. 89. 101].6 is the dominant hyper-architectures for a given device setting.25 0.and architecture separately and simultaneously. Therefore. Vdd = 0. Each set of data points in Figure 2.35 Min energy (nJ) 1. we ﬁrst optimize device using one architecture then optimize the arhyper-architectures for a given device setting are the hyper-architectures that are not inferior to any other hyper-architectures under such device setting.

Both methods arch-device and device-arch cannot guarantee the optimal solution.18 16 Energy per cycle (nJ) 14 12 10 8 6 4 2 0 10 D4 D1 D5 15 20 D9 D2 25 Delay (ns) 30 35 40 D6 D8 D7 D1 Vdd 0. For both arch-device and device-arch.9V .6: hyper-architectures under different device settings.3V }.0V . and Vth = 0.25 D2 Vdd 0.3V }. Vth = 0. we optimize architecture and device simultaneously and call it simultaneous method. chitecture under the optimized device setting and call it device-arch method.9 Vt 0.25 D8 Vdd 1. In the second method.9V .35 D7 Vdd 1.0 Vt 0.9 Vt 0. In the third method.3V } which is also the global optimal.1 Vt 0. K = 7. K = 4. Vdd = 0.25 D5 Vdd 1.35 D3 Figure 2. In our experiment.30 D3 Vdd 0.30 D6 Vdd 1. both arch-device and device-arch achieve the same local optimal hyper-architecture whole solution space: {N = 6.0 Vt 0.1 Vt 0.7 compares the min-ED hyper-architectures for Homo-Vth found by three different methods. K = 7. Table 2. Vdd = 0. If we start search from some other hyper- 35 . we ﬁnd that there are two local optimal hyper-architectures in the we start search from the baseline case {N = 8.0 Vt 0.9V .3V } and {N = 10.30 D9 Vdd 1. we ﬁrst optimize the architecture within one device setting then optimize the device setting according to the optimized architecture and call it arch-device method. Vdd = 0.1 Vt 0. K = 4. In this particular example.35 D4 Vdd 1.9 Vt 0. Vdd = 1. Vth = 0. Vth = 0. {N = 6.

3 27. we ﬁrst compare the impact of device tuning and architecture tuning.9 0.1s).30 0. As expected.38 1.1 (-13.3 Energy and Delay Tradeoff In this section.1 3. The minimum energy hyper-architectures have LUT size of 4 for all classes. We also see that the runtime of arch-device and device-arch is shorter than that of simultaneous method.4. From Table 2. This is similar to the previous evaluation result [101. For the classes Homo-Vth +G and Hetero-Vth +G with powergating.4 34.30 IVth (V) 0. it is still worthwhile performing simultaneous device and architecture optimization. and 1X PMOS for a connection buffer.3%) runtime (s) 5.8 17.38 1.7) (6.4) Energy (nJ) 1. we observe that simultaneous method can reduce ED by 13.architectures.0 CVt (V) 0. we assume the following ﬁxed sleep transistor size: 210X PMOS for a logic block.30 0.30 0.8 summarizes the minimum delay and minimum energy hyper-architectures for each class. then present min-energy and min-delay hyper-architectures. arch-device and device-arch may achieve other local optimum but there is no guarantee of the global optimum.8 19.3 24. the min-delay hyper-architectures have the highest Vdd and 36 . Table 2. 2.30 0. Vdd (V) arch-device device-arch simultaneous 0.30 (N. 89].1 Table 2. due to the time efﬁciency of Ptrace.7) (10. and ﬁnally discuss the energy and delay tradeoff. The minimum delay hyper-architectures have cluster size of 6 and LUT size of 7 which are the same for all classes. In order to obtain the global optimal solution.4.37 Delay (ns) 19.9 1.3% compared to arch-device and device-arch. However. we have to perform simultaneous method. We then discuss the sleep transistor tuning in Section 2.4. Therefore. k) (6. the runtime of simultaneous method is also small (only 34.7: Min-ED hyper-architecture of optimizing device and architecture separately and simultaneously. 10X PMOS for a switch buffer.7.5 ED (nJ· ns) 27.

The energy is calculated as E = delay · power.8 0.8 Minimum energy hyper-architecture CVt (V) 0. we only need to pick the dominant hyper-architecture whose delay is closest to 15ns. From the ﬁgure. In order to achieve the best energy and delay tradeoff. we present the dominant hyper-architectures of Homo-Vth and Hetero-Vth in Figure 2.1 1. For example.86 9. with the dominant hyper-architecture ﬁgure.lowest Vth .7) (N.3 sible frequency (1/critical path delay). Table 2.7) (6. higher performance hyper-architectures consume more energy.35 0.6 30.22 15.5 24.1 1. This is because leakage power is signiﬁcantly reduced by power-gating and therefore more detailed Vth tuning such as heterogeneous-Vth has a smaller impact.22 31. This is because we assume that each circuit works at its highest posWhen Vth is too high.30 0.K) E (nJ) 31.4) (N.4) (10.7) (6.98 D (ns) 8.20 0. However.35 0.7 (a) and those of Homo-Vth +G and HeteroVth +G in Figure 2.8 0.8 0.20 IVth (V) 0.20 0.30 0.942 0.2.20 0.4) (12.20 0.20 0. This is due to the fact that the connection box with power gating has smaller delay than that without power gating as discussed in Section 2. HyperArch Class Homo-Vth Hetero-Vth Homo-Vth +G Hetero-Vth +G Vdd (V) 1.920 0.20 0.549 D (ns) 59. if we want to ﬁnd the minimum energy solution for Homo-Vth with delay limit 15ns.7) (6.25 (10. we ﬁnd that the energy difference between Homo-Vth +G and Hetero-Vth+G is smaller than that between Homo-Vth and HeteroVth . the delay is so large that the energy per clock cycle increases.20 (6.550 0.45 9.35 0.86 8.1 1. we can obtain the minimum energy solution for a given performance range.2 43.30 0.1 Minimum delay hyper-architecture CVt (V) 0. we ﬁnd the hyper-architectures with the minimum energy delay product (in short min-ED) in Table 2. the min-energy hyper-architectures have the lowest Vdd but not the highest Vth .2.30 IVth (V) 0.7 (b).98 15.8: Minimum delay and minimum energy hyper-architectures. Compared to 37 . We also see that dominant hyper-architectures of Homo-Vth +G and Hetero-Vth +G have smaller delay than that of Homo-Vth and Hetero-Vth .9.4) (12.45 Vdd (V) 0.K) E (nJ) 0. To illustrate the energy and delay tradeoff. Moreover. Usually.

1 16. the larger the sleep transistor size.20 1.25 IVth (V) 0.8 CVt (V) 0. we assume ﬁxed sleep transistor sizes for Homo-Vth and Hetero-Vth +G and discuss hyper-architecture evaluation to minimize ED without considering area.43 126. Therefore.74 0.9 0.30 0.5 18. This is because leakage power is greatly reduced when power gating is applied. compared to the min-ED hyper-architectures without power gating.4) (10.the baseline.00 81.9 0.Class Baseline Homo-Vth Hetero-Vth Homo-Vth +G Hetero-Vth +G Vdd (V) 0. we perform device and architecture co-optimization to achieve the best ED and area tradeoff.0 (-18.90 79.8%) 11.30 0. However. Homo-Vth +G.4) (12.30 ED (nJ· ns) 28. k) (8. We also see that.9: Comparison between baseline and min-ED hyper-architecture in Homo-Vth .5 17.3%) Area % 100. and Hetero-Vth +G.19 Table 2.52 127. in this section.5% and 18.30 0.25 0. ED reduction is 14.20 (N. we 38 .10 17. area is important for FPGA design. therefore a lower Vth can improve performance without much penalty on leakage. 2. Usually.4.0 0.4%) 11.5%) 23.2 (-60. Power-gating using sleep transistors may change delay and area tradeoff for FPGA architecture.30 0.65 D (ns) 23. Although dual-Vth may change the layout area due to extra diffusion well area.30 0.25 0.4) (12.4) E(nJ) 1.27 0. hyper-architecture.4 ED and Area Tradeoff In the previous sections. The similar ED reduction for power gating classes is due to the fact that leakage power is greatly reduced by power-gating and therefore the more detailed Vth tuning such as heterogeneous-Vth has a small impact on power reduction as discussed before.4) (10. such change depends on technology and is often very small.37 1.9 (-57. the min-ED hyper-architectures with power gating has a lower Vth . the smaller the delay is. If power gating is applied.4% for Homo-Vth and Hetero-Vth . ED can be reduced by about 60% for both HomoVth +G and Hetero-Vth +G.9 1. In this section. respectively.25 0.1 (-14.2 24. Hetero-Vth .

K = 4 is the second area efﬁcient architecture which reduces area by 18. we assume the 210X PMOS for the sleep transistor with negligible area overhead.3% compared to the baseline.3%.470 0.4) (12. This is because that when no power gating is applied.3%.450 Area 1 0.3 37. the min-ED hyper-architecture is exactly the min-AED hyper-architecture. ED. we ﬁnd out the hyper-architectures with minimum product of energy. we see that.9 0. the min-AED hyperarchitecture of Homo-Vth reduces ED by 14. For Homo-Vth +G and Hetero-Vth+G with power gating. Figure 2.25 0.626 0. because no power gating is applied.816 0.7% and area by 18. we assume that the area depends only on architecture and but not device setting.9 1. Our experimental result shows that N = 12.10.432 0.4) (10. Vdd (V) Baseline Homo-Vth Hetero-Vth Homo-Vth +G Hetero-Vth +G 0. and area (in short AED).9 58. Area.9 CVt (V) 0.8 presents the chip-level ED and area tradeoff. Moreover. To achieve the best ED-area tradeoff.25 IVth (V) 0. K = 4).4) (12.10: Minimum ED-area product hyper-architectures for different classes. as illustrated in Figure 2.25 0. Because only one sleep transistor is used for one logic block.817 0.30 0. Therefore. K = 4 is most area efﬁcient architecture.6% compared to the baseline (N = 8.assume that Vdd and Vth change does not affect area.413 AED reduction % 30.853 0.4) S 2 2 ED 1 0. delay.697 0.K) (8.0 0. We prune inferior solutions with both ED and area larger than any alternative solutions.4 56. the more leakage power is consumed.25 0. For Homo-Vth and Hetero-Vth. for the classes without power gating.30 0. we observe that a 1X PMOS as the sleep our evluation. Compared to the baseline. 7 In 39 .30 0.918 AED 0. this architecture reduces area by 23.767 0.30 0. The architecture N = 10.20 (N. the min-ED hyper-architectures use less area than other hyper-architectures.7 and the min-AED hyper-architecture of Hetero-Vth reduces ED by 18.9 0.918 0. From the ﬁgure.4) (12. the sleep transistor size has to be considered. which are summarized in Table 2. the larger the area. we do not need to tune sleep transistor size.7 Table 2.2(d). and ED-area product are normalized with respect to the baseline.4% and area by 23.30 0.

and 10X PMOS for a routing switch.2(c)) provides good performance.5.10 summarizes the minimum AED prod- uct hyper-architectures. We compare the min-ED and min-AED hyper-architectures under three different utilization rates: 0.11 and 2.8.2%.12. may affect delay greatly. for the minimum AED hyper-architecture of the power gating classes consumes less area compared to the baseline.5 Impact of Utilization Rate In the previous part of this section. we conclude that utilization rate in practice does not affect hyperarchitecture evaluation.5). We guess that the reason why the utilization rate does not affect hyper-architecture evaluation is as follows: In our models the performance and discussed before.3. however.0% and area by 8. Therefore.4.0% and area by 8. we assume ﬁxed utilization rate (0. 8 As 40 . Compared to the baseline case. we will further discuss the impact of utilization rate on FPGA architecture evaluation. We see that the min-ED and min-AED hyper-architectures under different utilization rates are the same. We consider four sleep transistor sizes: 2X. 7X. and any further increase of the sleep transistor size cannot improve the performance much. In this subsection. 0. This is due to the fact that tuning device setting and architecture offers a bigger solution space to explore in chip level. we ﬁnd that device and architecture co-optimization can reduce ED and area simultaneously even when power gating is applied. the minimum AED product hyper-architecture of Homo-Vth +G reduces ED by 53.2% and the minimum AED product hyper-architecture in Hetero-Vth+G reduces ED by 55. 2.8 in Tables 2. The area increase introduced by power gating leads to only about 15% area increase. From the Figure 2.transistor for one switch in connection box (see the circuit in Figure 2. 4X. Therefore. 8 Table 2. The sleep transistors for the switches in the routing box. the most area efﬁcient architectures (N = 12 and K = 4) reduces area by about 20% compared to the baseline. and 0.

20 (10. in classes Homo-Vth +G and Hetero-Vth +G.11.4) 18.4) 24. for given benchmark circuits.4) 2 0.432 0.4) . length-4 interconnect wires). leakage power is the dominant power component and changes inversely proportional to the utilization rate.25 0.9 0.30 (10.4) 11. k) ED 1.9 0.25 0.0 0.9 0.4) 2 0.30 (10.4) .20 (10. 41 .4) 12.9 0.30 0.25 0.4) 23.25 (12. k) ED 1.4) .25 0. k) S AED 1.4) 11.6 Impact of Interconnect Structure In the previous discussion.5 0.8 0.1 0.3 Vdd CVth IVth (N.12: Min-AED hyper-architecture under different utilization rates.4) 11. Note: AED is normalized with respect to the baseline.30 (10.8 Vdd CVth IVth (N.8 0.9 0.30 0.30 (12.0.617 0.4) 2 0.5 Vdd CVth IVth (N.0.0.11: Min-ED hyper-architecture under different utilization rates.698 0.4) 19.25 0.4.25 0.3 Vdd CVth IVth (N.9 0.4) 2 0.324 Table 2.dynamic power of the circuit under different utilization rate is the same. k) S AED 1. In this section.25 (12.8 0.649 0.25 (12.0 0.30 0.25 0. the leakage power is determined by the active elements and the leakage power of the unused circuit elements is negligible (as we can see from Table 2. Therefore the relationship between the leakage power of different hyper-architectures does not change with respect to the change of the utilization rate.30 (10.697 0.0 Homo-Vth +G 0.4) 31.25 (12.8 0.9 0. Utilization rate Homo-Vth Hetero-Vth 0. When no power gating is applied. 2.25 0.20 (10.0 0.25 0.4 0.4) .30 (12.30 (12.9 0.8 Vdd CVth IVth (N. k) ED 1.30 (12.20 (10. Utilization rate Homo-Vth Hetero-Vth 0. ED under different utilization rate is very close).30 (10.0.4) 2 0.1 0. 9 and the only difference is leakage power.0 0.3 Hetero-Vth +G 0.30 (10. For the classes with power gating.626 0.8 0.0.0 0.25 0.9 0.4) 32.9 0.0.9 0.2 0.533 0.30 0.6 Table 2.4) 2 0.4) .20 (10. we will compare two routing 9 We assume the placement and route is the same for different utilization rate.330 Hetero-Vth +G 0.8 0.9 0.510 Homo-Vth +G 0.413 0.25 0.25 (12.30 0.0 0.25 0.30 0.25 0.0 0.25 (12.4) 11.637 0.25 0.0 0. k) S AED 1.4) 11.20 (10.8 0.5 Vdd CVth IVth (N.25 0.30 (12.25 0.4) .9 0.25 0.25 0.e.30 (12. we always assume all the wire segments span four logic blocks (i.

25 16.structures to show the impact of routing structure on delay and power.7) (6.20 0.13 9. 2.20 0.1 1.45 9.86 8. which reduces delay but increases power.20 0.20 0.14.1 1.1 CVth (V) 0. respectively. The minenergy and min-ED hyper-architectures for the two routing structure have the same the device setting and LUT size.20 0.15 compare the min delay.20 0.1 1.20 0.7) (6.20 (6.K) Energy (nJ) 31.20 0.56 35.7) (6. The mixed-interconnect is optimal for minimizing delay [47]. 42 . and 2.13.22 15.20 0.13 mixed-interconnect Table 2.1 1.56 16.20 0. min energy and min ED hyper-architectures between the two routing structures. We also see that uniforminterconnect has lower energy but higher delay than mixed-interconnect.20 IVth (V) 0. This is due to the fact that mixed-interconnect applies length-8 wire-segments and therefore a buffer with a larger size is used.20 0.7) (6.42 8.86 9.98 15.7) (6.13: Min-delay hyper-architecture under different routing structures. We compare a structure with uniform length-4 wire segments (in short uniform-interconnect) and a routing structure with 60% length-4 wire segments and 40% length-8 wire segments (in short mixed-interconnect).42 9.7) (6. Tables 2. We see that the min-delay hyper-architectures for both interconnect structures are same.7) (6.22 31.1 1. but mixed-interconnect tends to use smaller cluster size as the interconnect delay is reduced in mixed-interconnect.20 0.7) (N.45 8. uniform-interconnect hyper-architecture Class Homo-Vth Hetero-Vth Homo-Vth +G Hetero-Vth +G Homo-Vth Hetero-Vth Homo-Vth +G Hetero-Vth +G Vdd (V) 1.1 1.25 Delay (ns) 8.98 35.1 1.20 0.20 0.

3 55.8 CVth (V) 0.30 0.K) Energy (nJ) 0.20 (10.9 11.25 0.39 0.25 0.14: Min-energy hyper-architecture under different routing structures.8 0.40 16.35 0.74 Delay (ns) 17.25 0.1 23.550 0.052 0.6 55.35 0.30 0.30 0.052 1.4) (10.30 0.20 0.27 0.25 0.6 13.30 17.35 0.920 0.549 1.2 43.4) (12.10 16.30 0.942 0.35 0.6 29.610 0.3 22.4) (10.25 0.50 18.35 0.4) (12.7 25.25 0.K) Energy (nJ) 1.30 0.65 1.51 1.9 0.30 0.4) (N.15: Min-ED hyper-architecture under different routing structures.4) (8.8 CVth (V) 0.8 0.25 IVth (V) 0.74 0.30 0.10 17.35 0.4) (N.2 25.8 0.8 1.82 0.4) (8.4) (12.00 18.4) (12.0 11.25 0.30 0.9 0.0 0. 43 .25 (10.4) (12.uniform-interconnect hyper-architecture Class Homo-Vth Hetero-Vth Homo-Vth +G Hetero-Vth +G Homo-Vth Hetero-Vth Homo-Vth +G Hetero-Vth +G Vdd (V) 0.3 mixed-interconnect Table 2.25 0.30 IVth (V) 0.8 0.40 16.9 0.602 Delay (ns) 59.30 0.5 24.9 0.8 0.4) (8.6 30.0 0.30 0.8 0.35 0.4) (8.67 ED (nJ·ns) 24.4) (12. uniform interconnect hyper-architecture Class Homo-Vth Hetero-Vth Homo-Vth +G Hetero-Vth +G Homo-Vth Hetero-Vth Homo-Vth +G Hetero-Vth +G Vdd (V) 1.8 0.4 12.37 1.4) (8.8 mixed-interconnect Table 2.4) (8.30 0.

**32 Energy per cycle (nJ) 24 16 8 0 8
**

18 Energy per cycle (nJ) 15 12 9 6 3 0 8

Min delay hyper-arch for Homo-Vt Min-ED hyper-arch for Homo-Vt Min-ED Min energy hyper-arch hyper-arch for Hetero-Vt for Hetero-Vt

Homo-Vt Hetero-Vt

Min energy hyper-arch for Homo-Vt

21

34 (a) Delay (ns)

47

60

Min delay hyperarch for HomoVt+G and HeteroMin-ED hyper-arch for Homo-Vt+G Min-ED hyper-arch for Hetero-

Homo-Vt+G Hetero-Vt+G

Min energy hyper-arch for 5 Homo-Vt Min energy hyper-arch for Hetero-Vt+G

6 7

13

18 23 Delay (ns) (b)

28

33

Figure 2.7: Dominant hyper-architectures. Homo-Vth +G and Hetero-Vth +G.

(a) Homo-Vth and Hetero-Vth ; (b)

44

1.3 Normalized area

Homo-Vt Hetero-Vt Homo-Vt+G Hetero-Vt+G

Min AED hyper-arch for Homo-Vt

1.1

Min AED hyper-arch for Homo-Vt+G

0.9

Min AED hyper-arch for Hetero-Vt+G Min AED hyper-arch for Hetero-Vt

0.7 0.3 0.4 0.5 0.6 0.7 0.8 Normalized ED 0.9 1.0

Figure 2.8: ED and area trade-off. ED and area are normalized with respect to the baseline (N = 8, K = 4, Vdd = 0.9V , and Vth = 0.3V ).

45

2.5 Conclusions

In this chapter, we have developed trace-based power and performance evaluation (Ptrace) for FPGA. The one-time use of placement, routing and cycle-accurate power simulation is applied to collect the timing and power trace for a given benchmark set and a given FPGA architecture. The trace can then be re-used to calculate timing and power via closed-form formulae for different device parameters and technology scaling. Ptrace is much faster, yet accurate compared to the conventional evaluation based on placement and routing by VPR [18] followed by cycle-accurate simulation (Psim) [89]. Using the trace-based estimation, we have performed device (Vdd , Vth and sleep transistor size if power gating is applied) and architecture (cluster and LUT size) cooptimizations for low power FPGAs. We assume the ITRS [66] 70nm technology and use the following baseline for comparison: Cluster size of 8 and LUT size of 4 as in the Xilinx Virtex-II [170], Vdd of 0.9v suggested by ITRS, Vth of 0.3v which is optimized for min-ED (i.e., minimum energy delay product) with respect to the above architecture and Vdd . Compared to the baseline case, simultaneous optimization of FPGA architecture and device reduces the min-ED by 14.7% and area by 18.3% for FPGA using homogeneous-Vth for the logic blocks and interconnects without power gating. Optimizing Vth separately (i.e., heterogeneous-Vth ) for the logic block and interconnect reduces min-ED by 18.4% and area by 23.3%. Furthermore, power gating unused logic and interconnect reduces the min-ED by up to 55.0% and reduces area by 8.2%. Compared to the classes without power gating, the min-ED hyper-architectures of the classes with power gating have lower Vth . This is due to the fact that, when power gating is applied, leakage power is signiﬁcantly reduced and therefore a lower Vth can be applied to reduce delay. We observe that min-ED hyper-architectures and min-AED hyper-architectures for

46

different utilization rate (between 30% and 80%) are the same. Therefore, the utilization rate in practice does not affect the device and architecture co-optimization result. Moreover, we also test two different routing structures, one with uniform length 4 wire segments (uniform-interconnect) and the other with 60% length 4 wire segments and 40% length 8 wire segments (mixed-interconnect). We observe that the min-ED, minenergy and min-delay hyper-architectures under two different interconnect structures are similar, except that mix-interconnect tends to use slightly smaller cluster size due to the reduced interconnect delay.

47

We then derive analytical yield models considering both leakage and timing variations. We also observe that LUT size of 4 gives the highest leakage yield.4%. we extend Ptrace to handle process variation. and use such models to evaluate FPGA device and architecture considering process variations. and gate oxide thickness. LUT size of 7 gives the highest timing yield. timing yield by 5. but LUT size of 5 achieves the maximum leakage and timing combined yield.5X and 1. Experiments show that the mean and standard deviation computed by our models are within 3% from those computed by Monte Carlo simulation. which uses the VPR architecture and device setting from ITRS roadmap. device and architecture tuning improves leakage yield by 10.7%. 48 . Compared to the baseline.4%. threshold voltage.CHAPTER 3 FPGA Device and Architecture Co-Optimization Considering Process Variation Process variations in nanometer technologies are becoming an important issue for cutting-edge FPGAs with a multi-million gate capacity. We ﬁrst develop closed-form models of chip level FPGA leakage and timing variations considering both die-to-die and within-die variations in effective channel length. respectively.9X. In this chapter. We also observe that the leakage and delay variations can be up to 5. and leakage and timing combined yield by 9.

53] and [127] purposed a methodology to improve timing yield. However. Variability in effective channel length. Timing yield estimation was discussed in [57. we have introduced a time efﬁcient trace-based power and delay estimator for FPGA circuit. and gate oxide thickness incurs uncertainties in both chip performance and power consumption. However. For example. 115. 152] studied the parametric yield considering both leakage and timing variations. therefore process variation may have smaller impact on FPGAs than on ASICs. such estimator is only for deterministic values and does not consider process variation. 136] has studied the impact of process variation and presented various statistical optimization methods. 100. 49 . threshold voltage. [129. Modern VLSI designs see a large impact from process variation as devices scale down to nanometer technologies. In addition to meeting the performance constraint under timing variation. all these studies only focus on ASIC and do not consider FPGA. However. Power minimization by gate sizing and threshold voltage assignment under timing yield constrains were studied in [114]. Statistical timing analysis considering path correlation was studied in [120] [86. Yet the parametric yield for FPGAs still should be studied. Some recent works [43. 178] and [180. dice with excessively large leakage due to such a high variation have to be rejected to meet the given power budget. measured variation in chip-level leakage can be as high as 20X compared to the nominal value for high performance microprocessors [25].3. FPGA has a great deal of regularity. leakage power becomes a signiﬁcant component of total power consumption and it is greatly affected by process variation. architecture and device co-optimization considering process variation has not be studied yet. As device scale down. 33] further introduced non-Gaussian variation and non-linear variation models. Recent work has studied parametric yield estimation for both timing and leakage power.1 Introduction In the previous chapter. 187.

2%. we obtain the baseline FPGA hyper-architecture which uses the VPR architecture model [18] and the same LUT size and Cluster size as the commercial FPGAs used by Xilinx Virtex-II [170].5X and 1.4%. We also observe that LUT size of 4 gives the highest leakage yield. We also observe that the leakage and delay variations can be up to 5. Experiments show that the mean and standard deviation computed by our models are within 3% from those copmuted by Monte Carlo simulation.2 derives closed-form models for leakage and delay variations and develops the leakage and timing yield models. but LUT size of 5 achieves the maximum leakage and timing combined yield.3 perform device and architecture evaluation to improve yield rate.we ﬁrst develop closed-form formulas of chip level leakage and timing variations considering both die-to-die and within-die variations. 1 In this chapter. we extend the Ptrace discussed in Chapter 2 to estimate the power and delay variation of FPGAs.8%. For comparison. Section 3. Section 3. and device setting from ITRS roadmap[66]. and threshold voltage Vt . N refers to cluster size and K refers to LUT size 50 . timing yield by 12.4 concludes the chapter. The evaluation requires the exploration of the following dimensions: cluster size N. We deﬁned the combinations of the above parameters as hyper-architecture. Based on such model. and leakage and timing combined yield by 9. LUT size K.In this chapter. we perform FPGA device and architecture evaluation considering process variations. respectively. 1 supply voltage Vdd . Compared to the baseline. LUT size of 7 gives the highest timing yield. device and architecture tuning improves leakage yield by 4. Finally. The rest of the chapter is organized as follows: Section 3. With the extended Ptrace.5X.

Vl . Therefore. In the rest of this chapter. According to [190]. spatial correlation is not signiﬁcant. and Ll . and gate oxide thickness (Tox ).0.2. we consider the variation in gate channel length (Lgate ). V . The leakage current Ii of a type i circuit element is the sum of the sub-threshold 51 . and all variation sources are also independent. and Ii is the leakage current of a type i circuit element. Different sizes of interconnect switches and buffers are considered as different circuit elements. we assume each variation source is decomposed into global (inter-die) variation and local (intra-die) variation as follows: L = Lg + Ll V = Vg +Vl T = Tg + Tl (3. And we also assume that inter-die variation and intra-die variation are independent.e. In Ptrace. and Tg are inter-die variations.1) (3. the total leakage current of an FPGA chip is calculated as follows: Ichip = ∑ Nit · Ii i (3. in this chapter.3) where L. 3. conﬁguration SRAM cell. Vg . and T are variations of Lgate . or ﬂip-ﬂop. Vl . buffer.4) where Nit is the number of FPGA circuit elements of resource type i.1 Leakage under Variation We extend the leakage model in FPGA power and delay estimation framework Ptrace [38] to consider variations. and Tl are intra-die variations. threshold voltage (Vth ). i. LUT.3.2 Delay and Leakage Variation Model In this chapter.. Vg . an interconnect switch.2) (3. and Tg ) and intra-die (Ll . we assume both inter-die (Lg . Lg . and Tox respectively. and Tl ) variations are normal random variables. Vth .

8) (3.Vl . Variation in Igate mainly sources from variation in Tox .10) To extend the leakage model (3. we model the total leakage current Ii of circuit element in resource type i as follows: Ii = In (i) · e fLi(L) · e fVi (V ) · e fTi(T ) (3. Tl ) and interdie (Lg . ci2 . lower threshold voltage.4) under variations. The negative sign in the exponent indicates that the transistors with shorter channel length. we ﬁnd that it is sufﬁcient to express these functions as simple linear functions as follows: fLi (L) = −ci1 · L fVi (V ) = −ci2 ·V fTi (T ) = −ci3 · T (3.9) where ci1 . Ii = In (i) · e−(ci1 Lg+ci2Vg +ci3 Tg ) · e−(ci1 Ll +ci2Vl +ci3 Tl ) (3.Vg .5) Variation in Isub mainly sources from variation in Lgate and Vth . V and T in to intra-die (Ll . Different from [129] which models sub-threshold leakage and gate leakage separately.6) as follows by decomposing L. Tg ) components. we assume that each element has unique intra-die variations but all elements in one die share the same inter-die 52 .7) (3. We rewrite (3. The dependency between these functions has been shown to be negligible in [129].6) where In (i) is the nominal value of the leakage current of type i circuit element and f is the function that represents the impact of each type of process variation on leakage.and gate leakages: Ii = Isub + Igate (3. From MASTAR4 model [68]. ci3 are ﬁtting parameters obtained from MASTAR4 model. and smaller oxide thickness lead to higher leakage current.

we apply the Central Limit Theorem and use the sum of mean to approximate the total leakage current.12). Vl . The lognormal variable Xi shares the same random variables 53 .2 Leakage Yield From (3. we can write the expression of the chip-level leakage as the follows: Ichip ≈ = ∑ Nit · E[Ii|Lg.Vg. Tg] i ∑ Nit SiILg. The state-of-the-art FPGA chip usually has a large number of circuit elements. the sum of the lognormal variables Xi .13) Si = e i (ci1 σLl 2 +ci2 σVl 2 +ci3 σTl 2 )/2 ILg .14) (3.variations. for given inter-die variations.Tg (i) is the leakage as a function of inter-die variations. and Tl .12) (3.13).15) (3. as another lognormal random variable.11). therefore the relative random variance of the total leakage due to intra-die variation approaches zero. The total leakage is the sum of all lognormals. Similar to [129]. we model Ichip .0. 3. (3.Vg. ILg . ((ci1σLg )2 + (ci2 σVg )2 + (ci3 σTg )2 )) = Ni In (i) i (3.11) (3. The leakage distribution of a circuit element is a lognormal distribution.2. and (3.V . σLl .Tg (i) (3.Vg .Tg (i) = In (i)e−(ci1 Lg +ci2Vg +ci3 Tg ) where Si is the scale factor introduced by intra-die variability in L. we can see that the chip leakage current is a sum of log-normal random variables and it can be expressed as follows. After integration.Vg .16) Same as [129]. σVl and σTl are the variances of Ll . and T . respectively. Ichip = ∑ Xi Xi Ai ∼ Lognormal(log(Ai). Both inter-die and intra-die variations are modeled as normal random variables.

Ichip .Ichip = log[1 + (σIchip 2 /µIchip 2 )] log(Ichip) − µN.22) (3. Given a leakage limit Icut for Ichip . µIchip (ci1 σLg )2 (ci2 σVg )2 (ci3 σTg )2 = ∑ {exp[log(Ai ) + + + ]} 2 2 2 i (3.23) (3.20) (3.24) . log[µIchip 4 /(µIchip 2 + σIchip 2 )] 2 2 σN. The covariance is calculated as follows. σN. µIchip . and therefore these variables are dependent of each other.Ichip 1 √ CDF(Ichip) = [1 + er f ( )] 2 2σN. X j ) L i3 T (3.chip is a monotone increasing function. As the exponential function that relates the lognormal variable Ichip with the normal variable IN.Ichip = where er f (·) is the error function. σVg . is the sum of means of Xi and the variance of Ichip .Ichip 2 ) of the normal random variable corresponding to the lognormal Ichip .21) We then use the method from [129] to obtain the mean and variance (µN.25) (3. σIchip . and σTg .19) 2 (ci1 + c j2 ) σLg 2 + 2 (ci2 + c j2 )2 σVg 2 (ci3 + c j3 )2 σTg 2 + ] 2 2 (ci1 σLg )2 (ci2 σVg )2 (ci3 σTg )2 E[Xi ] = exp[log(Ai ) + + + ] 2 2 2 (3. COV (Xi . Considering the dependency.σLg .17) σ2chip = I ∑{exp[2log(Ai) + (ci1σLg )2 + (ci2 σVg )2 + (ci3 σTg )2] i i. j 2 ·[exp(cii 2 σ2 g + ci2 2 σVg + c2 σ2g ) − 1]} + ∑ 2COV (Xi.18) where the mean of Ichip . the CDF of Ichip can be expressed as follows using the standard expression for the CDF of a lognormal random variable.Ichip µN. we calculate the mean and variance of the new lognormal Ichip as follows. X j ) = E[Xi X j ] − E[Xi ]E[X j ] E[Xi X j ] = exp[log(Ai A j ) + (3. is the sum of variance of Xi and the covariance of each pair of Xi . Yleak = CDF(Icut ) × 100% 54 (3.

We assume that the intra-die channel length and threshold voltage variation of each element is independent from each other. Lg and Vg the same for all the circuit elements in the critical path. di (Lg .Vg ) ⊗ · · · ⊗ PDF(dn |Lg .Vg ) ⊗ PDF(d2 |Lg . PDF(D|Lg .Vg ) ⊗ · · · ⊗PDF(di |Lg . we can directly map 55 . we evenly sample a few (eleven in this chapter) points within range of [Lg − 3σLl .VG . we can obtain the PDF (probability density function) of the critical path delay for a given Lg and Vg as follows by convolution operation. Considering process variation.e. the path delay is calculated as follows: D = ∑ di i (3. Given Lg and Vg . Vl . Vg and intra-die variation Ll .Vg ) (3.Vg ) = PDF(d1 |Lg . the percentage of FPGA chips that is smaller than Icut . Below we extend the delay model in Ptrace to consider inter-die and intra-die variations of Lgate .. but its variation is primarily affected by Lgate and Vth variation [129].26) where di is the delay of the ith circuit element in the path. and Tox . i.the path delay is calculated as follows: D = ∑ di (Lg . As the probability of a channel length to the probability of a delay and obtain the delay distribution of a circuit element. Therefore. Lg + 3σLl ].Vg .27) For circuit element i in the path. Ll .28) the delay monotonically decreases when Lgate and Vth increase.gives the leakage yield rate Yleak (Icut |Lg ).3 Timing under Variation The performance depends on Lgate . Vth . In Ptrace. We then use circuit level delay model in [39] to obtain the delay for each circuit element with these variations. Ll .Vl ) i (3.0.2. 3.Vl ) is the delay considering inter-die variation Lg .

However.2.. we ﬁrst need to calculate the leakage 56 . and threshold voltage variation Vg . i. In this chapter.e. CDF(Dcut |Lg ) gives the probability that the path delay is smaller Yper f (Dcut |Lg . The close-to-be critical paths may become critical considering variations and an FPGA chip that meets the performance requirement should have the delay of all paths no greater than D cut . We further consider intra-die variation of channel length in timing yield analysis.2. n cutoff delay (Dcut ). Given a than Dcut considering Lgate and Vth variations.30) PDF(Lg )PDF(Vg ) ·Yper f (Dcut |Lg .5 Leakage and Timing Combined Yield To analyze the yield of a lot.Vg ). We then integrate Yper f (Dcut |Lg . Yper f = Z Z +∞ −∞ nominal condition. by integrating PDF(D|Lg .3. We can obtain the CDF of delay.28) gives the PDF of the critical path delay D of the circuit.Vg ) over Lg and Vg to calculate the (3.Vg ) = ∏ CDFi (Dcut |Lg . we need to consider both leakage and delay limit.Vg ). CDF(D|Lg . In order to compute the leakage and delay combined yield.4 Timing Yield The timing yield is calculated on a bin-by-bin basis where each bin corresponds to a speciﬁc value Lg and Vg .Vg ) gives the probability that the delay of the ith longest path is no greater than Dcut .0. it is not sufﬁcient to only analyze the original critical path in the absence of process variations.0. We assume that for a given Lg the delay of each path is independent and we can calculate the timing yield as follows.Vg ) i=1 (3.Vg ) · dLg dVg 3.29) where CDFi (Dcut |Lg . Given the inter-die channel length variation Lg . we only consider the ten longest paths. (3. n = 10 because the simulation result shows that the ten longest paths have already covered all the paths with a delay larger than 75% of the critical path delay under the performance yield Yper f as follows.

Vg ) × 100% (3.Vg .2.Vg ] ¯ ¯ ¯ ¯ E[Xi|Lg.Vg ) i3 T i.Vg = log[µ4chip|L I g .Vg ]E[X j|Lg. the CDF of leakage current for given Lg and Vg .Vg 2 = log[1 + (σ2chip|L I CDF(Ichip|Lg .Vg ) + (ci3 σTg )2 ] 2 (3.Vg log(Ichip) − µN.Vg ∼ Lognormal(Ai|Lg.Vg = CDF(Icut |Lg .Vg .39) (3. Yleak|Lg .Vg ) + i (ci3 σTg )2 ]} 2 (3.Ichip With the CDF of Ileak|Lg .yield for a given inter-die variation of gate channel length Lg and threshold voltage Vg .Vg · X j|Lg.Vg ] = exp[log(Ai|Lg.Ichip|Lg . j i (3.Vg . Ileak|Lg .37) (ci3 + c j3 )2 σTg 2 ] 2 Finally.Vg ) + (ci3σTg )2] ¯ ¯ ·[exp(c2 σ2g ) − 1]} + ∑ 2COV (Xi|Lg.Ichip|Lg . is calculated as: µN.Vg = Ai · exp(−ci1 Lg − ci2Vg ) (3.Vg + σ2chip|L I )] g .Vg . the covariance between Xi|Lg.31) = ¯ ∑{exp[2log(Ai|Lg. it is easy to compute the leakage yield for given Lg and Vg . (ci3 σTg )2 ) ¯ Similar to Xi ’s.41) 57 .Vg )] (3.Vg 1 √ [1 + er f ( )] 2 2σN.Vg ¯ = ∑ {exp[log(Ai|Lg.Vg ’s are computed as: ¯ ¯ ¯ ¯ ¯ ¯ COV (Xi|Lg.Vg .Vg ] − E[Xi|Lg.38) (3.Vg ) + ¯ E[Xi ] = exp[log(Ai|Lg.35) (3. X j|Lg.34) ¯ ¯ Xi|Lg.40) g . Yleak|Lg . X j|Lg.Vg · X j|Lg.Vg ) = E[Xi|Lg.Vg 2 /µ2chip|L I g .Vg /(µ2chip|L I σN.33) (3.0.Vg · A j|Lg. µIchip|Lg .32) where ¯ Ai|Lg. Similar to Section 3.Vg .36) (3.2.Ichip|Lg .Vg ) = g .Vg σ2chip|L I g . we ﬁrst calculate the mean and variance of leakage current for given Lg and Vg .

2 Yper f % 74.29).42) 3.0.Vg ) · dLg dVg (3. 3. We consider ITRS 70nm technology and 58 . From the table.9) Our model Yper f % 72. Therefore.6 Veriﬁcation of Yield Model In this section.1) Table 3.3 (-0. we verify our yield model by comparing it to 10.000 sample MonteCarlo simulation. In our experiment. we use our yield model to perform device and architecture evaluation for leakage and delay yield optimization.3 Hyper-Architecture Evaluation Considering Process Variation In this section. we assume 70nm ITRS technology and use device and architecture same as the baseline hyper-architecture as in Section 2.4 Yleak % 89. The cut of leakage power is 2X of the nominal value and the cut of delay is 1.4.1 (-2.6 (-1.31). we see that our yield model is within 3% error compared to the Monte-Carlo simulation. we assume that the leakage yield and timing yield are independent of each other for given L g and Vg .5) Ycom % 62. the leakage variability only depends on the variability of random variable Tg as shown in (3.1: Veriﬁcation of yield model.MC sim Yleak % 90.1 compares the yield estimated from our model and that from the Monte-Carlo simulation. and the timing variability only depends on the variability of random variable Ll and Vl as shown in (3.Vg )Yper f (Dcut |Lg .2. The yield considering the imposed leakage and timing limit can be calculated as follows. Ycom = Z Z +∞ −∞ PDF(Lg )PDF(Vg )Yleak (Icut |Lg . Table 3.1 Ycom % 64. Because for given a speciﬁc inter-die variation of channel length Lg and threshold voltage variation Vg . We also assume that all 20 MCNC benchmarks are put into one FPGA chip.1X of nominal value.

6.3. the range of leakage power is up to 5.5X and the range of delay variation is up to 1.1 0.7 8 K 6. each dot is sample of MonteCarlo simulation. we assume that all the global routing track (W ) span 4 logic blocks with all buffer switch box.0% 1.4 0. In the ﬁgure. we assume that the 3σ value of the interdie variation is 10% of nominal value.9% Distribution Normal Normal Normal Table 3.9 3σl 3.5% Vdd (V) 0. we can see with process variation. change Vdd and Vth around such setting.5.8 1. we consider LUT size K from 4 to 7.2: Experimental setting.2 0.9% 1.10. Figure 3. For all variation sources.4.1 illustrates the leakage and delay variation from Monte-Carlo simulation for the baseline hyper-arch. In the experiment. we use the same baseline hyper-architecture as in Section 2.1 Impact of Process Variation In this section.12 4 W 4 4 Vth (V) 0.N Evaluation range Baseline Source Lgate Vth Tox 4.8.0% 2. For comparison. From the ﬁgure. we assume that all the variation sources has normal distribution. and the 3σ value of the intra-die variation is 5% of nominal value. For interconnect. 3. For process variation. we assume that all 20 MCNC benchmarks are put into one chip and obtain the longest 10 critical paths from them.3 3σg 5. we analyze the impact of process variation on FPGA leakage power and delay. and cluster size N from 6 to 12. For architecture. 59 .5X.5% 2.

That is. In the rest of this section. 3. This is because the larger Lgate gives better leakage yield. Homo-Vth and Hetero-Vth as deﬁned in in Section 2.3.85 0.1 1.5 3 2. we perform device and architecture evaluation to optimize the delay and leakage power yield for two FPGA classes.1X of the nominal value. the optimum Lgate for logic blocks and interconnect is the same. 60 .95 1 1.4.1: Leakage and delay of baseline architecture hyper-arch.1.5 4. Moreover.05 1.2. From the table.1 Leakage Yield We ﬁrst optimize leakage yield.2 Figure 3. we see that Homo-Lgate and Hetero-Lgate give the same maximum leakage yield result. as shown in Figure 3. Table 3. we assume that the cutoff leakage power is 2X of the nominal value and the cutoff delay is 1. CVth refers to the Vth of logic blocks and IVth refers to the Vth of interconnect.5 Normalized Leakage Power 4 3.3 illustrates the hyper-archs with maximum leakage yield for both two classes.2 Impact of Device and Architecture Tuning In this section.5 0 0. 3.15 1.5 1 0. Notice that for Homo-Vth .5X Cutoff power 1.8 0. In the table. CVth = IVth .5X Normalized Delay 0.5 2 1.3.9 Cutoff delay 5. both logic block and interconnect using largest Lgate (33nm) results in optimum leakage yield.

Similarly.9 Table 3.2.9 Yleak % 89.9(+12. Table 3.1 Yleak % 89.4 Vdd (V) 0. That is the smaller Vth gives better timing yield.2 Vdd (V) 0.8) Yper f % 85.4) 97.1 57. we only analyze the delay of the largest MCNC benchmark clma.4 illustrates the optimum timing yield hyper-arch.2 Timing Yield Secondly.4% compared to the baseline.2V) results in optimum timing yield. we can also ﬁnd that K = 4 gives the optimum leakage yield. Table 3.4: Optimum Timing yield hyper-architecture.9 0. therefore both logic block and interconnect using smallest Vth (0. which improve leakage yield by 4.5 79.1 1.1 69. The reason is similar to the leakage yield.3 79. Similar to leakage yield analysis.3 0.3 0. N Baseline Homo-Vth Hetero-Vth 8 6 6 K 4 7 7 CVth (V) 0.3.3(+4.4 0. For timing yield analysis.2 IVth (V) 0.5 97. we discuss about leakage and timing combined yield.5 67.6 Yper f % 85.3 0. 3.2.1 Table 3. the timing yield is often studied using selected test circuit such as ring oscillator for ASIC in the literature.2 0.9 57.3: Optimum leakage yield hyper-architecture.8) 94.5 63.3 Ycom % 67. From the table.N Baseline Homo-Vth Hetero-Vth 8 10 10 K 4 4 4 CVth (V) 0.3.5 94.4) Ycom % 67. we see that 61 . From the table.9 1. we analyze the timing yield. we can also ﬁnd that the optimum timing yield hyper-arch has K = 7 and improve timing yield by 12.9 0.5 illustrates the optimum leakage and timing combined yield hyper-arch.3(+4. both Homo-Lgate and Hetero-Lgate achieve the same hyper-arch for optimum timing yield.8% compared to the baseline.9(+12.4 0. 3.4 IVth (V) 0.3 0.2 0.3 Leakage and Timing Combined Yield Finally.1 69.

and the FPGA chip level leakage and delay variations can be up to 5.N Baseline Homo-Vth Hetero-Vth 8 10 8 K 4 5 5 CVth (V) 0.5 84. the optimum hyper-architecture (combination of architecture and device parameters) improves leakage and timing combined yield by 9.3 0.2 (+8.5X.1) 76.3 0.2%.2%. the optimum hyper-arch for Homo-Vth improves combined yield by 8. we have developed efﬁcient models for chip-level leakage variation and system timing variation in FPGAs. We have shown that architecture and device tuning has a signiﬁcant impact on FPGA parametric yield rate. Homo-Vth and Hetero-Vth give different result for combined yield optimization. 3. In addition.1 75.0 1.2) Table 3. 62 .6 Yper f % 85. 7 has the highest timing yield. LUT size 4 has the highest leakage yield.4 Conclusions In this chapter. K = 5 gives the best combined yield.3 0.3 0. the largest (or smallest) Vth not necessary gives the optimum combined yield.0 Yleak % 89. We also ﬁnd that Hetero-Vth gives better result than Homo-Vth .5: Optimum leakage and timing combined yield hyper-architecture.1 Ycom % 67. This is because Hetero-Vth provides larger search space. Compared to the baseline. Experiments show that our models are within 3% from Monte Carlo simulation. Unlike the leakage yield and timing yield analysis. respectively.9 1. but LUT size 5 achieves the maximum combined leakage and timing yield.5 87. This is because both leakage and timing should be considered to optimize combined yield.1% and the optimum hyper-arch for Hetero-Vth improves the combined yield by 9.6 91.3 Vdd (V) 0. compared to the baseline.5 86.3 (+9. But for both Homo-Vth and Hetero-Vth.5X and 1.35 IVth (V) 0.

1X energy difference. we further improve the tracebased framework (Ptrace2) to enable concurrent process and FPGA architecture codevelopment. In addition. 3.5% to 5. We show that applying heterogeneous gate lengths to logic and interconnect may lead to 1.5% and reduce die to die leakage signiﬁcantly. the user can tune eight parameters for bulk CMOS processes and obtain the chip level performance and power distribution and soft error rate (SER) considering process variations and device aging. this framework is suitable for early stage process and FPGA architecture co-development.CHAPTER 4 FPGA Concurrent Development of Process and Architecture Considering Process Variation and Reliability In Chapter 2 and Chapter 3. we develop a trace-based framework (Ptrace) to perform device and architecture co-optimization. and device burn-in to reach the point could reduce the performance change over 10 years from 8.3X delay difference. and reduce standard deviation of leakage variation by 87%. In this chapter. It is also ﬂexible as process parameters can be customized for different FPGA elements and no SPICE models and simulations are needed for these elements. we also study the interaction between process 63 . We further show that the device aging has a knee point over time. Applying the new trace-based framework (Ptrace2). The Ptrace2 framework is efﬁcient as it is based on closed-form formulas. Therefore. This offers a large room for power and delay tradeoff. The chapter further presents a few examples to utilize the framework.

device aging and SER. Moreover. input/output capacitance for FPGA basic circuit elements. In this chapter we further extend the trace-based architecture framework (Ptrace) discussed in the previous section to consider process parameters directly. The new process variation analysis can handle non-Gaussian variation sources which is ignored by Ptrace. then delay. We call the resulting framework as ptrace2. However. and ﬁnally the chip level performance and power based on trace similar to that in Chapter 2 and Chapter 3. It may further provide inputs for process tuning. Such evaluation may be used to select circuits and architectures less sensitive to process changes or process variations. The assumption that FPGA architecture development starts only after the process technology is stable may be valid in the past. We observe that device aging reduces standard deviation of leakage by 65% over 10 years while it has relatively small impact on delay variation. we proposed a time efﬁcient FPGA power and delay evaluator Ptrace. leakage power. Ptrace assumes that the processes are mature and stable device models are available so that SPICE simulations can be carried out to obtain circuit level power and delay. given that the FPGA is a large volume product and an FPGA company may convince a foundry to tune process when there are large enough beneﬁts. for performance and power.1 Introduction In Chapter 2 and Chapter 3. As illustrated in Figure 4.1. ptrace2 calculates ﬁrst electrical characteristics of advanced CMOS transistors. 4. therefore we can conduct FPGA circuit and architecture evaluation when only the ﬁrst order process parameters are available. we also ﬁnd that neither device aging due to NBTI and HCI nor process variation have signiﬁcant impact on SER. 64 .variation. but it no longer holds as we begin to develop process and architecture concurrently in order to shorten the time to market.

8 studies the interaction between reliability and process variation. and to study the interaction between process variation. 166]) and permanent soft error rate (SER) [60]. Furthermore.9 concludes this chapter. device aging and SER.3. we incorporate analytical calculations for two types of FPGA reliability. 149] [160. and Section 4.5 introduce the circuitand chip-level power and delay models and chip-level variation models. device aging (Negative-Bias-Temperature-Instability. 65 . respectively. 23] and Hot-Carrier-Injection. Section 4.7 analyzes the reliability for FPGA. NBTI [11. The rest of this chapter is organized as follows: Section 4. Section 4. we illustrate how to use this framework to improve power and performance by process and FPGA concurrent development. With ptrace2. 4. again in the from device to chip fashion.Input Transistor Model PTrace2 Output Chip Level Process Variation Trace Reliability Figure 4.6 presents device tuning for power and delay optimization.2 presents our implementation of ITRS device model. HCI [34. 149.4 and 4.1: Trace-based estimation ﬂow. and Section 4. Sections 4. Finally.

Figure 4.4. Nbulk .e. channel width (W ). The saturated threshold voltage (Vth ) is then calculated considering the short channel effect (SCE) and drain induced bar1 The device model in MASTAR4 tool is a predicted model for advanced processes and there is no correspondent SPICE models for these technologies. which directly depends on various major technological parameters including gate length (L gate ). MASTAR4 is a computing tool which calculates the electrical characteristics of advance CMOS transistors 1 . the gate leakage current when the channel is on/off (Igon /Igo f f ). T and Vdd . as well as some intermediate outputs such as effective mobility u e f f which are used for the ﬁnal output calculation. Io f f ). double gate and silicon on isolator (SOI) while we only consider traditional bulk transistor in this work. It can handle different technologies including planar bulk. channel doping density (Nbulk ). Ion depends on all the main inputs. i. The output electrical characteristics include the on current (Ion ).. 148]. W . the electrical (or effective) gate length (Lelec ) and the electrical gate oxide thickness (Tox elec ). The calculation in MASTAR4 is based on analytical equations. are ﬁrst calculated. Temperature (T ) and the supply voltage (Vdd ) also have a signiﬁcant impact on these output characteristics. We brieﬂy review the calculation ﬂow in MASTAR4 below. There are also other inputs related to mobility. We follow MASTAR4 tool and only tune the major process inputs while all the other inputs can be tuned as well if needed. we implement the bulk transistor device model from ITRS 2005 MASTAR4 (Model for Assessment of cmoS T echnologies And Roadmaps) tool [67. The feature intermediate parameters. the gate capacitance (Cg ) and the drain/source diffusion capacitance (Cdi f f ). extension depth (X jext ) and the total series resistance for source and drain (Racc ). gate oxide thickness (Tox ). 66 . Lgate .2 Device Models In this chapter. 68. velocity and gate stack etc. Racc . X jext .2 shows the calculation ﬂow for the on current Ion . Tox . the sub-threshold leakage current (off current.

Vgt = Vgs −Vth Idsat0 = 0. Ec is the critical electrical ﬁeld and d is an intermediate parameter. Ion is then calculated as. the 67 . rier lowering (DIBL). The effective mobility (ue f f ) is also calculated. The derivation of these intermediate parameters are provided in the ITRS device model [68].Inputs: Lgate Tox Nbulk Xjext W Racc T Vdd Intermediate variables: Lelec Tox_elec Intermediate variables: Vth Intermediate variables: u_eff Output: Ion Figure 4. Ion .3 shows the calculation ﬂow for the sub-threshold leakage current Io f f .5ue f f ·Cox elec · Ion = 1+ 2Racc Idsat0 Vgt (4.1) W ·Vgt ·Vdsat Lelec elec (4. Io f f depends on all the main inputs except for Racc .2) (4. main intermediate parameters and all other parameters. calculation.2: The ﬂow for on current.3) Idsat0 R − Vgt +Lacc Idsat0 Ec (1+d) where Idsat0 is the on current without considering the series resistance (Racc ). Similar to Ion calculation. With the main input parameters. Figure 4. Vgs is the difference between the gate and source voltage levels. Vdsat is the velocity saturation voltage.

intermediate feature parameters Lelec and Tox elec . we can then calculate Io f f as.4) where k is the Boltzmann constant.4: The ﬂow for gate leakage currents Igon and Igo f f calculation. Vth and other parameters. q is electric unit. the sub-threshold slope (Slope) is also calculated as. In order to calculate Io f f . respectively. respectively. Igon and Igo f f depend on Lgate .4 shows the calculation ﬂow for the gate leakage currents when the channel is on (Igon ) and off (Igo f f ).5uA. W 68 . T is the temperature.3: The ﬂow for sub-threshold leakage current. Io f f = Ith where Ith = 0. With Slope.5) Outputs: Igon Igoff Figure 4. ε s and εox are the dielectric permittivities of silicon and oxide. Figure 4. Slope = kT εs · Tox elec ln1 0(1 + ) q εox · Tdep (4. Io f f . Inputs: Lgate Tox Nbulk Xjext W Racc T Vdd W −Vth /Slope e Lelec (4. X jext . Tox . and the saturated threshold voltage Vth are ﬁrst calculated. calculation.Inputs: Lgate Tox Nbulk Xjext W Racc T Vdd Intermediate variables: Lelec Tox_elec Intermediate variables: Vth Slope Output: Ioff Figure 4.

7) (4.02V −2 . respectively.6) where a1 = 1.Inputs: Lgate Tox Nbulk Xjext W Racc T Vdd Intermediate variables: Tox_elec Outputs: Cg Cdiff Figure 4. Tox . a2 = −4.5: The ﬂow for transistor gate and diffusion capacitances Cg and Cdi f f calculation. Cg and Cdi f f . T and Vdd .05V −1 and a4 = 1/(1. In order to calculate Igon and Igo f f . fringing capacitance Ctotal 69 . The intermediate feature parameter Tox elec is ﬁrst calculated for capacitance calculation. W . (4. a3 = 13. Jg = a1 ea2Vg +a3V g e−a4 Tox 2 (4.10) (4. Igon and Igo f f are then calculated as. Cg and Cdi f f are calculated as. and Vdd . Igon = Lgate ·W · Jg ΔL = Lgate − Lelec Igo f f = 0.5ΔL ·W · Jg where ΔL is the difference between Lgate and Lelec . With Jg . we ﬁrst calculate the gate leakage current density Jg as.5 shows the calculation ﬂow for transistor gate and drain/source capacitances.11) f ringing where the Cg calculation considers gate oxide capacitance.8) (4. The capacitances depend on Lgate .9) Figure 4.17E − 10m). Cg = ( Cdi f f Lgate +Ctotal Tox elec = Coverlap +C junc εox f ringing +Coverlap )W (4.44E5A/cm2 .

Vgs < Vth .e. gate and body voltage levels as following. i. i. and saturation regions.12) 70 . the current in velocity saturation region. i. linear.and overlap capacitance Coverlap . We calculate the inverter delay based on numerical integration through the transistor IV curves. Ids is calculated as. only the calculation for the maximum on current Ion . We use the equations from [74] to calculate the drainsource current Ids in different working regions.2. In MASTAR4 tool. these FPGA circuits can be further decomposed into net-lists containing the most basic circuit elements. is provided. Ids = Ith · (W /Lelec ) · e((Vgs −Vth )/Slope) (4. In the sub-threshold region. which is treated as part of the loading capacitance of an inverter.1 Delay Model The pass transistor and multiplexer tree are modeled as a lumped capacitance. and Cdi f f calculation considers Coverlap and junction capacitance C junc .e. We therefore mainly discuss the power and delay for these basic circuit elements as below. 4. Ids is calculated as a function of drain. pass transistor gate. LUT. sub-threshold. Essentially.3 Circuit-Level Delay and Power In this section. source. 4. SRAM. i. We consider buffers.3.e.e. inverters and pass transistors. we present basic circuit models for delay and power characteristics using the device model in Section 4. ﬂip-ﬂop (FF) and multiplexer [18] for the FPGA circuits. where the multiplexer is implemented as an NMOS pass transistor tree.

Ids is equal to the on current Ion calculated by (4.3). i. Ids is calculated as. Cox elec and Ec are both calculated using equations in the ITRS MARSTAR4 model. i. i. Vgs is the gate source voltage difference. 71 . where the velocity saturation voltage Vdsat is calculated using equations in the MASTAR4 tool. Vgs > Vth and Vds < Vgs −Vth . Vgs > Vth and Vds > Vdsat .5uA.13) where Cox elec is the gate oxide capacitance.e. Vgs > Vth and Vds > Vgs −Vth .14) In velocity saturation region.where Ith is 0. Ids is calculated as.6: Voltage and current in an inverter during transition. Ids = W Cox elec · ue f f · · (Vgs −Vth −Vds /2) ·Vds Lelec 1 +Vds /(Ec · Lelec ) (4. Using these equations. Vds is the drain source voltage difference and Ec is the critical electrical ﬁeld. In linear region.e. Ids = 0. short channel effect (SCE) and drain-induced barrier lowering (DIBL).5 · W Cox elec 2 · ue f f · ·Vds Lelec 1 +Vds /(Ec · Lelec ) (4. Figure 4. Vs = Vdd Vb = Vdd Ids(P) Vg = Vin Ids(N) Vd = Vout C Vb = GND Vs = GND Figure 4. In the saturation region.7(c) shows the IV curves for an NMOS transistor under ITRS 2005 HP 32nm technology node. Vth in the above equations is calculated considering body-bias.e.

06 0.52v Vgs=0.02 0.04 0. (b) Input and output voltages. (c) The NMOS IV curves and the transition of transient current Ids with small and large input slopes. and short circuit current with a small input slope.4 0.6 Vds (v) (c) 0.61v Vgs=0.12 drain current (uA) Ids (uA/um) 12 1200 1000 800 600 400 200 0 0 ITRS05 HP NMOS (32nm) IV curves Vgs=1.2 Figure 4.8 0.81v Vgs=0. its loading capacitance C and the input voltage waveform. Cload =1fF Vin Vout (NMOS) Vout (Inv) PMOS current 14 drain current (uA) 12 10 8 6 4 2 0 0.6).2 0.08 time (ns) 0.Given an inverter (see Figure 4.1 Vin Vout (NMOS) Vout (Inv) PMOS current 14 10 8 6 4 2 0 0. Cload =1fF 1.7 (a) shows the transient voltage transition of a 1X inverter with 1 f F as loading capacitance C at ITRS 2005 HP 32nm technology node.2 0 0 0.42v small input slope large input slope transient transient II 0.6 0. Figure 4.8 0. respectively. The input voltage transition is from 0 to Vdd . 1.8 1 1.5Vdd . The input slew rate is deﬁned as the transition time from 72 .15) The delay is the time difference between when the output and input voltages reach 0.04 ITRS05 HP (32nm) 1X INV. We can then perform numerical integration to obtain the output voltage waveform based on the following equation. Note that the input slew rate is automatically considered in this delay model. we can then calculate the inverter delay. The output voltage waveform can be propagated for delay calculation of the next stage.4 (a) (b) 0.7: The pull-down delay of a 1X inverter at ITRS 2005 HP 32nm technology nodes.2 1 voltage (v) 0.03 ITRS05 HP (32nm) 1X INV. (a) Input and output voltages.01 0.02 time (ns) 0.71v Vgs=0.2 0 0 0. We calculate the pull-down and pull-up delay for an inverter and then obtain the worst case delay as the inverter delay. and short circuit current with a large input slope.2 1 voltage (v) 0.91v Vgs=0. we can obtain the transient PMOS and NMOS drain-source current ids (P) and ids (N). dV (out) = (ids (P) − ids(N)) · dt/C (4.1v Vgs=1.4 0. At time t.6 0.0v Vgs=0.

1Vdd (or 0. we ﬁrst breakdown them into the netlist of inverters and pass transistors. the PMOS drain-source transient current is also shown in this ﬁgure. Therefore. a similar trend for pull-up transition. 5X large as the that in Figure 4. ITRS MASTAR4 tool uses CVdd /Ion to predict transistor delay.7(c) shows the NMOS transient drain-source current ids transitions with the two input slopes.9Vdd ) to 0.7. is observed.g.e. With a larger input slope.7 (b) shows the transient voltage transition of this inverter with a larger input slope. i.e. the input voltage slope may be large due to a large loading capacitance. The short circuit current. the transition is slower with a larger delay. our delay model is more accurate than that in MASTAR4.7 (a). i. For other FPGA circuit elements with loading capacitance. the output voltage waveform without considering short circuit current differs the waveform considering short circuit current which results in a 20% delay difference. input voltage transition is from Vdd down to 0. LUTs. the output voltage waveform (Vout (NMOS) in the ﬁgure) is almost overlapped the the waveform (Vout (Inv) in the ﬁgure) considering the short circuit current.e. 73 .1Vdd ). our delay model is ﬂexible and can be extended to other complex gates easily. The delay of each circuit element is then calculated using the above method. While not presented here. e. the input slope has a signiﬁcant impact on the inverter delay. In addition. FFs and buffers.0. the short circuit current is necessary to be considered for delay calculation. e. Figure 4. Figure 4. As shown in Figure 4. i. In FPGAs. The short circuit current becomes more signiﬁcant due to a larger input slope and can no longer be ignored.e. Since the FPGA circuit element usually has a large loading capacitance and a large input slope. If this short circuit current is ignored. i.g.9Vdd (or 0. the LUT delay is decomposed into the delay from LUT input passing through input buffers to the multiplexer control inputs and the delay from the SRAM passing through the multiplexer and the output buffer.

16) (4. which can be is on or off.4. Pleak (inv). we ﬁrst discuss leakage power including sub-threshold and gate leakage power and then discuss dynamic power including switching power and short circuit power for basic circuit elements. For a multiplexer containing N NMOS pass transistors. Igon (P). We calculate the average leakage power for an inverter. we calculate the gate (Cg (inv)) and self-loading caeither Vdd · Igon (N) or Vdd · Igo f f (N) depending on if the channel of this pass transistor 74 . Igo f f (P). respectively. The two inverters in an SRAM cell are identical with one input as Vdd and one input as 0. we can calculate the leakage power for other FPGA circuit elements. Pleak (inv) = Vdd · (Io f f (inv) + Igate (inv)) Io f f (inv) = (Io f f (P) + Io f f (N))/2 Igon (P) + Igo f f (P) + Igon (N) + Igo f f (N) Igate (inv) = 2 (4. the average leakage calculation of an inverter can be applied to SRAM leakage calculation.17) (4.3. We consider switching and short circuit power for inverter dynamic power consumption.2 Power Model In this section. For an NMOS pass transistor. Igon (N) and Igo f f (N) are the PMOS and NMOS gate leakage currents when the channel is on and off. For switching power. Based on the leakage model for inverters and pass transistors/muxes.18) where Io f f (P) and Io f f (N) are the PMOS and NMOS sub-threshold leakage currents. as. only gate leakage power is consumed. respectively. which depends on the input logic value. An inverter consumes both sub-threshold and gate leakage power. Therefore. The pass transistor in a used/unused routing switch implemented by a tri-state buffer is on/off. N/2 of them are on while the other half are off.

3. wire segment. then calculate the chip level power and delay. interconnect switch 75 . Cdi f f (P) and Cdi f f (N) are the PMOS and NMOS diffusion capacitances. we apply the similar idea as the trace-based estimation in Chapter 2.e. 4. respectively.. i.3. The transient short circuit current has been calculated during delay calculation.3.3.1 Delay Model Similar to Section 2.2. 4. LUT. The input capacitance. We collect the tract information in the same way as in Section 2.pacitances (Cdi f f ) for an inverter as. We can simply perform a numerical integration on the transient short circuit current and obtain the short circuit energy per switch SC(Sl). internal capacitance and self-loading capacitance can then be easily extracted for each FPGA circuit element for switching power calculation purpose. Cg (inv) = Cg (P) +Cg (N) Cdi f f (inv) = Cdi f f (P) +Cdi f f (N) (4. where Sl is the input slew rate. As shown in Figure 4.19) (4. we compute the critical path delay by adding the delay of all circuit elements in the critical path. In order to perform chip level estimation.4 Chip-level Delay and Power With the delay and power model for basic circuits discussed in Section 4. respectively.7 (a) and (b).20) where Cg (P) and Cg (N) are the PMOS and NMOS gate capacitances. we introduce the chip level delay and power estimation model in this section.4. short circuit current depends on input slew rate.

Dynamic power is only consumed by the used FPGA resources.4. Vdd is the supply voltage.2.2. which can be computed by circuit level leakage power model as discussed in Section 4.3.3.2 Leakage Power Model In the same way as in Section 2.4.23) where f is the operating frequency. we use the Elmore delay model with resistance and capacitance suggested by ITRS [67]. For each circuit element.buffer. For an interconnect wire segment.21) where Di is the delay of the ith circuit element in the critical path.2.2. 4. 4. A circuit implemented on an FPGA cannot utilize all circuit elements.3 Dynamic Power Model Dynamic power includes switch power and short-circuit power. SW is the average switching activity. and MUX: Dcrit = ∑ Di i∈P (4.22) is the leakage power for type i circuit element. The 76 . which can be calculated by the circuit level delay model as discussed in Section 4. the chip leakage power is modeled as the sum of the leakage power of all circuit elements: Pleak = where Pis i∈T ∑ Nit Pis (4. and Ci is the total capacitance per switch of type i circuit elements.1.3. the loading capacitance can be calculated from the in/out capacitance of the basic circuit elements as introduced in Section 4.3. Our switch power model distinguishes different types of used FPGA circuit elements and applies the following expression: Psw = 1 2 · f ·Vdd · SW · ∑ Niu ·Ci 2 i∈T (4.

25) where Sli is the input slew rate of type i circuit element.1. In our model. which is difﬁcult to obtained without detailed simulation using a large set of input vectors.3. then the input slew rate of a connection box buffer is the output slew rate of the global interconnect buffer. We also assume that the operating frequency to be 5/6 of the maximum frequency. the average switching activity SW can be set by the users.2 · Dcrit ). Here we model the chip short circuit power as: Psc = f · SW · ∑ i∈T 1≤ j≤Niu ∑ SCi (Sli. which is computed as the output slew rate of the element driving type i element.2. The switching activity is related to the input vector and logic. Because of the regularity of FPGAs.3.3. We do not set the operating frequency to maximum frequency since the actual critical path delay of some chips may increase due to process variation.24) where SCi (·) is the function to compute short circuit energy per switch for type i circuit element and Sli. the input slew rate for the same type of circuit elements is very close.2. a connection box buffer is driven by a global interconnect buffer. Such output slew rate can be computed from the basic circuit element delay model discussed in Section 4. 77 . For example. The short circuit power is related to the input slew rate. j is the input slew rate of the jth type i circuit element. The short circuit energy function SCi (·) can be obtained from the short circuit power model of basic circuit elements as discussed in Section 4. j ) (4. that is f = 1/(1.total capacitance per switch Ci can be computed from the in/out capacitance of a basic circuit element as discussed in Section 4. the chip short circuit power can be further simpliﬁed as: Psc = f · SW · ∑ SCi (Sli ) i∈T (4. Therefore.

X2 . X2 . . the short circuit power model introduced in this section is more accurate than that in Section 2.27) (4.3.5.1. 4.) Pt = FtL (X1.2..) (4. To evaluate the chip-level leakage power and delay considering process variation.3.28) where FtL and FtD are functions of process parameters to calculate leakage power and delay for type t circuit element. respectively.1.. delay and leakage power of a circuit element are functions of the underlying process parameters and they can be described as: Dt = FtD (X1. we calculate the circuit level short circuit power directly from transistor model. . and the process variation sources (such as gate change length and dopant density) are modeled as a random variable X . First we introduce the variation model as below.26) Notice that dynamic power model in this chapter is different from that in Section 2..4.5 Chip Level Delay and Power Variation Based on the chip-level power and delay model introduced in Section 4. while in this chapter. Therefore.1 Variation Models In general. 4. short circuit power is calculated as a ratio to switching power. we study the impact of process variation on power and delay in this section..1. we ﬁrst 78 . In Section 2.With the switch power and short circuit power.2. the total dynamic power is computed as: Pdynamic = Psw + Psc (4.2.3.

Each circuit element has its own random variation Xr . and intra-die random variation for variation source X . We use the grid based model in [173] to model the spatial variation. which is calculated as: Dj = D ∑ Fji (X1 . some close-to-be critical path may become critical path. That is: X = Xg + Xs + Xr (4. Therefore. The sensitivity functions FiL and FiD can be obtained from the leakage power and delay model of basic circuit elements discussed in Section 4. and all circuit element of the whole chip share the same inter-die variation Xg . and intra-die random variation is considered.30) where C is the set of close-to-be critical paths and D j is the delay of the jth close-to-be critical path. X2 .29) where Xg .) ji ji (4. and Xr are independent from each other.. A chip is divided into several grids. intra-die spatial. all inter-die. in order to analyze delay variation.31) i∈P j 79 .2 Delay Variation With process variation.3.apply randomly generated process parameters to the device model as in Section 4.2 and then calculate the leakage power and delay for basic circuit elements as in Section 4. intra-die spatial.3. The correlation coefﬁcient decreases when the distance between two grids increases.5. circuit delay is calculated as the maximum of a set of near critical paths: Dvar = max D j j∈C (4. 4. For each variation source X . Xs and Xr are the inter-die. all circuit elements in the same grid share the same spatial variation Xs and the spatial variation between different grids are correlated. Xs . We assume that Xg . ..

In this chapter. However. Because there is no closed-form formula for D the delay variation function Fji (·).3 Chip Leakage Power Variation The chip leakage power with process variation is modeled as the sum of leakage power of all circuit elements: Nit Pleak = j i∈T j=1 ∑ ∑ Pi j = i∈T j=1 ∑ ∑ FiL (X1 . 4. In order to solve such problem. when the circuit size becomes large. a large number of random variation samples need to be computed. . We ﬁrst generate M samples of variation sources (X1 . and then compute the the path delay under such samples to ji ji obtain M samples of path delay. X2 . because the random variation is unique for each circuit element.. and X i j is the variation of the jth type i circuit element. Given the inter-die and within-die spatial variation. Similar to the chip delay variation model.. we use the Monte-Carlo simulation to obtain the chip delay distribution.. which is ignored in the previous Ptrace [167]. We ﬁrst generate M samples of variation sources and then compute the M samples of chip leakage power.32) where Pi is the leakage power of the jth type i circuit element.D where Pj is the set of path elements of the jth path and Fji (·) is the delay variation function for the ith circuit element in Pj .. we simplify the chip leakage power variation model by applying central limit theorem similar to [25. we do not have closed-form formula to compute the chip delay distribution. 168].) Nit ij ij (4. The Monte-Carlo simulation allows us to consider nonGaussian variation sources and spatial correlation.5. Finally we can get the delay distribution from those M samples of path delay. the total leakage power for all 80 .) for all i. we use Monte-Carlo simulation to compute chip leakage distribution. . X2 . This makes the simulation very inefﬁcient.

**type i circuit elements in the lth grid can be calculated as:
**

i Pleak = l ∑ FiL (Xg1 + Xs1 + Xr1 , Xg2 + Xr2 , ...)

t Nli

il j

il j

(4.33)

il j

j=1

**where Xg is the inter-die variation, Xsl is the spatial variation in the lth grid, Xr is
**

t the random variation for the jth type i circuit elements in the lth grid, Nli is the number t of type i circuit element in the lth grid. Usually, Nli is a very large number. Therefore,

**by applying the central limit theorem, we have:
**

l l l ∑ FiL (Xg1 + Xs1 + Xr1 , Xg2 + Xr2 , ...) ≈ Nit · µi(Xg1 + Xs1, Xg2 + Xs2, ...)

t Nli

il j

il j

(4.34)

j=1

where

l l µil (Xg1 + Xs1 , Xg2 + Xs2 , ...) i i l l = E[FiL (Xg1 + Xr1 , Xg2 + Xr2 , ...)|Xg1, Xs1 , Xg2 , Xs2, ...] ij ij

(4.35)

In order to compute µil (·), we ﬁrst generate Mr samples of random variations, and then compute the mean of leakage power under such samples. That is:

l l µil (Xg1 + Xs1 , Xg2 + Xs2 , ...)

=

1 Mr L l k l k ∑ F (Xg1 + Xs1 + Xr1, Xg2 + Xs1 + Xr2, ...) Mr k=1 j

(4.36)

**where xk is the kth sample of random variation. Here, notice that usually Mr is much r
**

t smaller than Nli and does not depend on circuit size. Therefore such method is scalable

**to large circuit size. Then the chip leakage power can be modeled as: Pleak =
**

t l l ∑ ∑ Nli · µil (Xg1 + Xs1, Xg2 + Xs2, ...) G

(4.37)

i∈T l=1

where G is the number of grids. With the above equation, we may generate M samples of inter-die and within-die spatial variations to compute the distribution of leakage power.

81

Notice that the variation model introduced in this section may handle non-Gaussian variation sources and spatial correlation, which are ignored in the variation model in Section 3.2.

4.6 Process Evaluation

Based on the chip level power and delay model presented in Section 4.4 and 4.5, we perform device tuning to minimize power and delay.

4.6.1 Power and Delay Optimization First we discuss the optimization for nominal value of power and delay. In this chapter, we consider two device parameters, supply voltage Vdd and the physical gate length Lgate . In our experiment, we start from the device setting of ITRS High Performance 32nm technology (HP32) [67] (with Vdd=1.1V, Lgate =32nm), and then change Vdd and Lgate around such setting. Moreover, we assume that the FPGA has cluster size N=6, LUT size K=7, and the global interconnect wire length W =4, which is optimized for high performance [91]. For dynamic power estimation, we assume that the switching activity SW =0.5. In addition, we also consider heterogeneous L gate for logic blocks (Lc ) and interconnects (Li ). Because SRAMs have little impact on FPGA delay, we assume that all SRAMs use high Vth (1.5X dopant density compared to the other circuit elements) and ﬁx Lgate =32nm. During our evaluation, Vdd is tuned from 1.0V to 1.1V, Lgate (both Lc and Lc ) is tuned from 31nm to 33nm. The experimental setting is summarized in Table 5.1. In our experiment, we assumed that 20 MCNC benchmarks [175] are put in one FPGA chip. Therefore, the power is computed as the sum of all 20 benchmarks and delay is computed as the longest critical path delay of 20 benchmarks.

82

N 6

K 7

W 4

SW 0.5

Vdd (V) 1.0, 1.05, 1.1

Lc (nm) 31, 32, 33

Li (nm) 31, 32, 33

Table 4.1: Experimental setting. Figure 4.8 shows the energy per clock cycle and delay tradeoff between different device settings. From the ﬁgure, we ﬁnd that there is a 3X energy span and a 1.3X delay span within our search space. We also ﬁnd that that the knee point in the energy delay tradeoff ﬁgure is exactly HP32, which has high performance without paying too much power penalty. On the other hand, we also optimize device setting by minimizing energy and delay product (ED). We show the top 2 minimum ED device settings in the ﬁgure, which we refer to as min-ED1 and min-ED2.

14

All device Min-ED HP32

11

Energy (nJ)

8

3.1X Min-ED

5 HP32 1.3X 2 3.5 4.0 4.5 5.0

Delay (ns)

Figure 4.8: Energy and delay tradeoff. Table 4.2 shows the power (P), delay (D), energy per clock cycle (E) , and energy delay product (ED) for HP32 and min-ED device settings. From the table, we see that by applying device tuning, ED can be reduced by up to 29.4% (min-ED1) compared to HP32. We also ﬁnd that it is worthwhile to apply heterogeneous Lgate . The optimum heterogeneous Lgate device setting (min-ED1) has lower ED (0.7nJ·ns lower) than the

83

**optimum homogeneous Lgate device setting (min-ED2).
**

Device Vdd (V) HP32 min-ED1 min-ED2 1.1 1.0 1.0 Lc 32 32 33 Li 32 33 33 P (W) 1.19 0.77 0.74 D (ns) 3.90 4.55 4.73 E (nJ) 1.88 3.50 3.50 ED (nJ·ns) 22.6 15.9(-29.4%) 16.5(-26.8%)

(nm) (nm)

Table 4.2: Power, delay, energy, and ED comparison for different device settings.

4.6.2 Variation Analysis In this section, we apply the chip level variation model as introduced in Section 4.5 to analyze the chip level leakage and delay variation. In our experiment, we consider four types of variation sources, dopant density (Nbulk ), channel gate length (Lgate ), oxide thickness (Tox ), and mobility (µe f f ) [190]. Moreover, because for technology under 65nm, the spatially correlated intra-die variation is small compared to the intradie random variation [190], therefore, we ignore spatial variation in this chapter. For all variation sources, we assume that they follow normal distribution. Notice that our process variation model is based on Monte-Carlo simulation, therefore it can handle any non-Gaussian variation sources. The 3-sigma value of inter-die (3σ g ), intra-die spatial 3σs , and intra-die random (3σr ) variation for all types of variation sources are summarized in Table 4.3. In the table, the 3-sigma values are the represented as the percentage of nominal value. Moreover, for spatial variation, we assume that the correlation coefﬁcient decreases to 1% when distance is 1cm (Dcorr = 1cm). In our experiment, we consider two different variation settings, low variation (LV) and high variation (HV). We generate M = 10, 000 samples for Monte-Carlo simulation and use Mr = 100 samples to compute the leakage mean µi (·). Figure 4.9 illustrates the probability density function (PDF) of delay and leakage power for min-ED1 and HP32 under variation setting LV. From the ﬁgure, we see that

84

85 .0% 3σr 2.8 5 5.0% 2. (b) Delay PDF.0% 3. min-ED1 has much less leakage variation compared to HP32 while its delay variation is only a little bit larger than HP32.0% 6. 4 16 14 12 10 8 6 4 2 0 0. From the table.0% 2. mean (µ).0% 3σg 6.5 1.0% 2.0% Table 4.6 3.0% Dcorr = 1cm HV 3σs 6. We also observe that the leakage roughly has a log-normal distribution and delay roughly has a normal distribution.5 3 2.8 1 1. This is because leakage power is almost exponential to the variation sources while delay is almost linear to the variation sources.0% 3.2% 4.2 4.0% 3.5 1 0.6 Delay (ns) (b) 4.0% 3.3: Experimental setting for process variation.9: Delay and leakage distribution. (a) Leakage PDF.2 min−ED1 HP32 Figure 4.2 0.4 4.0% 4.0% 1.0% 2.0% 4.5 2.5 2 1. we observe that compared to HP32 the min-ED device settings greatly reduce leakage variation (reduce standard deviation by up to 92%) with only a small increase of delay variation (increase standard deviation by up to 37%).4 0.4 summarizes the nominal value (nom).6 0.000 Source Distribution 3σg Nbulk Lgate Txo µe f f Normal Normal Normal Normal 4. we also see that HP32 is more sensitive to process variation than the min-ED device settings.5 2 0 3.4 1.0% Mr = 100 LV 3σs 4.2 Leakage (W) (a) 1.0% 6. and standard deviation (σ) of leakage power and delay for the min-ED device settings and HP32.8 min−ED1 HP32 3.6 1. From the table.M = 10. Table 4.2% 2.0% 3σr 4.0% 2.8 4 4.

90 4.38) (4.245(+32%) Table 4.1 Impact of Device Aging First.122 0. with input Vdd .164(+37%) µ 3. mean. we introduce the impact of device aging effect on FPGA power and delay.162(+33%) 0.4: Nominal value.39) 2ξ1te + ξ2C(t − t0 ) √ ) 2tox + Ct 86 .55 4.185 0. with input 0.2 HV σ 575 68 (-88%) 48 (-92%) 3.2 375.244(+32%) 0.58 4. i. In this section we study these two types of reliability for FPGA. reliability issues such as device aging with Vth degradation and soft error rate (SER) due to high-energy particle or cosmic rays become more and more signiﬁcant.782 σ 0.99 4.95 4.e.7. respectively.e. and hot-carrier-injection (HCI) increases the threshold voltage (Vth )of NMOS transistor [34][149][166]. 4. and standard deviation comparison for leakage power and delay.779 LV Delay (ns) HV σ 0. and recovers during recovery mode.5 + (ΔVth0 )(1/2n) )2n ΔVth = −ΔVth0 (1 − (4. It has been shown that negative-bias-temperature-instability (NBTI) effect increases the threshold voltage of PMOS transistor [11][160][149][23]. PMOS Vth increases during stress mode.7 FPGA Reliability As technology scales down to nanometer. device aging and permanent soft error rate (SER).e. ΔVth = (Kv (t − t0 )0.38) and (4.Device nom µ HP32 min-ED1 min-ED2 854 328 315 982 359 328 Leakage (mW) LV σ 348 51 (-86%) 32 (-89%) µ 1120 398. Due to the effect of NBTI.39) are the equations for PMOS ΔVth over time in the stress and recovery modes.73 nom µ 3.55 4. i. 4. (4. i.

Hence HP32 has larger ΔVth for both NMOS and PMOS than min-ED1. we can obtain ΔVth due to device aging over time. We ﬁrst calculate the effective gate length Lelec and oxide thickness Tox elec and then calculate Vth . supply voltage Vdd and nominal threshold voltage Vth . Interested readers please refer to [23] for more details. With these intermediate parameters. Qi . 87 . From the ﬁgure. Inputs: Lgate Tox Nbulk Xjext W Racc T Vdd Intermediate variables: Lelec Tox_elec Intermediate variables: Vth Output: HCI HBTI Figure 4. Figure 4.40) is the equation for NMOS ΔVth over time due to HCI.40) The key parameters that determine the degradation rate include the inversion charge. The Vth degradation due to device aging results in increase of critical path delay and decrease of leakage power. ΔVth = q K2 Cox Qi e Eo2 e Eox it − qλE φ m tn (4. Figure 4.10 shows the calculation ﬂow for ΔVth due to NBTI and HCI. temperature. A higher Vdd or lower Vth leads to larger Vth increase. This is due to the fact that HP32 uses higher Vdd and lower Vth (due to smaller Lgate ) than min-ED1. Eox . electrical ﬁeld. we do not include detailed explanation for each parameters.(4.11 illustrates the Vth increase (ΔV ) for PMOS and NMOS transistors of both min-ED1 and HP32 within 10 years. Due to the space limit. we see that HP32 is much more sensitive to NBTI and HCI than min-ED1.10: The calculation ﬂow for ΔVth due to NBTI and HCI.

Therefore it has larger Vth degradation. we also compare the change percentage with and without burn-in (burn to 1 year). Table 4. Figure 4.45 40 35 30 Min−ED1 PMOS (NBTI) Min−ED1 NMOS (HCI) HP32 PMOS (NBTI) HP32 NMOS (HCI) Δ V (mV) 25 20 15 10 5 0 0 1 2 3 4 5 6 7 8 9 10 th Time (year) Figure 4. device burn-in [96] that has been used for microprocessors but not FPGAs yet could be used to reduce leakage and delay change over time after product shipment. and that with burn-in is computed as the percentage of the difference between the 10 year value and the 1 year value. min-ED device settings can not only reduce ED but also 88 . we ﬁnd that the min-ED device settings is less sensitive to device aging compared to HP32. In the table. and value after 10 years of leakage power (L) and delay (D). The change percentage without burn-in is computed as the percentage of the difference between the 10 year value and current value. value after 1 year. This is because HP32 has higher Vdd and smaller Lgate compared to the min-ED device settings. It is interesting to see in the ﬁgure that the leakage and delay change is most signiﬁcant at the ﬁrst one year.12 illustrates the critical path delay increase and chip leakage power decrease within 10 years for min-ED1 and HP32. Therefore.5 illustrates the current value. From the ﬁgure and table. hence more sensitive to device aging.11: Vth increase caused by NBTI and HCI. We may see that compare to HP32.

76 (mW) 711 317 307 10 Year L (mW) 640 311 302 D (ns) 4.1 −200 0. increase aging reliability of FPGAs.2 Permanent Soft Error Rate (SER) We use the model in [60] to calculate the soft error rate (SER) for conﬁguration SRAMs in our study.82 w/o burn-in L -25. we can greatly reduce the delay degradation.15 0.41).5% D Change w/ burn-in L -10.59 4.1% -5. The SER in SRAMs is permanent in FPGAs. burn-in reduces 10 year delay degradation from 8.0% -1.73 L 1 Year D (ns) 4.5% to 5.64 4.55 4.05 −250 0 2 4 6 Time (year) (a) 8 10 0 0 2 4 6 Time (year) (b) 8 10 1 Year min−ED1 HP32 ΔP leak −150 Figure 4.1% +1.41) 89 . (a) Leakage change.5: Delay and leakage change due to device aging.12: Impact of device aging.7. for HP32.5% +1.9% Table 4.01 4. The SER for one SRAM cell is shown in (4. by applying burn-in. Moreover. which cannot be recovered unless re-writing those affected conﬁguration bits [14][17].0% +1. 0 0.23 4.5% +2. For example.5%.3 −50 (mW) 1 Year Min−ED1 −100 HP32 Δ Delay (ns) 0.2% -2. SER(SRAM) = F · A · K · e−Qcrit /Qs (4.6% D +5.25 0.35 0. 4.3% HP32 min-ED1 min-ED2 +8. (b) Delay change.9% -1.Device Current L (mW) 854 328 315 D (ns) 3.2 0.90 4.

A transistor with a larger Qcrit needs more energy to incur an error and has a smaller SER. different Vdd and Vth .6 compares the SER for HP32. A is the transistor drain area. i.e. We use the models in [60] with linear interpolation to calculate Qcrit and Qs under different device settings. and the two min-ED device settings. and then calculate the SER of an SRAM cell correspondingly using (4. a transistor with a larger Qs is more effective in collecting charge and has a larger SER. Qcrit and A are then calculated. SER = Num SRAM · SER(SRAM) where Num SRAM is the total number of SRAM cells. With all these intermediate parameters.13 shows the ﬂow to calculate the SRAM SER. Qcrit is the critical charge to incur an error. Inputs: Lgate Tox Nbulk Xjext W Racc T Vdd (4. Qs . and slightly depends on Vth . K is a technology independent constant and is same for all device settings.41). We ﬁrst calculate intermediate parameters including Lelec . On the other hand. Tox elec and Vth .where F is the neutron ﬂux. we can obtain the SRAM SER. and Qs is the charge collection slope. For 90 . Figure 4. Table 4.13: The ﬂow for SRAM SER calculation. Qcrit mainly depends on supply voltage Vdd and loading capacitance C.42) Intermediate variables: Lelec Tox_elec Intermediate variables: Vth Intermediate variables: Qs Qcrit A Output: SER Figure 4. Qs depends on channel doping density Nbulk and Vdd . The chip-level permanent SER can be calculated as.

4. But device aging only increase delay but has no signiﬁcant impact on delay variation.6% higher than that of HP32. In the table. Therefor only Vdd may affect SER.8 Interaction between Process Variation and Reliability In this section. For min-ED1/2 with Vdd of 1.25 (+1.1 Impact of Device Aging on Power and Delay Variation First.6%) Table 4.the min-ED device settings.0v.7 summarizes the mean and standard deviation of leakage and delay for different device settings. we ﬁnd that device aging has great impact of leakage power variation. SER is 1. The supply voltage Vdd in HP32 is higher.1 1. Qcrit .14 compares the PDF of delay and leakage between current value and the value after 10 years for the device setting min-ED1. SER is measured as number of failures in billion hours (FIT).45 368. It is worthwhile to use Min-ED device settings to obtain a signiﬁcant ED reduction with virtually no impact on SER. we discuss the impact of device aging on process variation induced leakage and delay variation. is larger and may result in a smaller SER.6: Soft error rate comparison. hence the critical charge. 4. Device HP32 min-ED1/2 Vdd (V) 1. Figure 4. For HP32 device setting under low 91 . Table 4. we ﬁnd that device aging actually reduce the leakage variation. From the table.0 # SRAMs 12438596 12438596 SER (FIT) 362. From the ﬁgure. Lgate for SRAM cells is still 32nm. The table compares both the current value and the value after 10 years in both low and high variation cases.8. we study the impact of device aging on process variation and the impact of device aging and process variation on SER.

8 5 5. loading capacitance.8% and standard deviation by 65.45 Leakage (W) (a) 0.2 4. However.6 Delay (ns) (b) 4.5 0. we also ﬁnd that in high variation case.7%. This is because HP32 is more sensitive to device aging as discussed in Section 4. when variation scale becomes larger.4 0. device aging reduces mean by 28. device aging only increases standard deviation by 1. Moreover. variation case (LV). For HP32 device setting. the impact of device aging on delay variation is relatively small.13. That is. the impact of device aging on leakage variation is much larger than that in the regular variation case.2 Figure 4.4 4. we observe that compared to HP32 device setting. (b) Delay PDF comparison.2 Impacts of Device Aging and Process Variation on SER As shown in Figure 4. However. From the result. device aging has less impact on power and delay variation for min-ED settings.8.35 0. the SRAM SER depends on Vdd . 4.5 0 0. the impact of device aging on delay variation is similar to that of the regular variation case (min-ED1).55 0 4 4.5 3 2. (a) Leakage PDF comparison. Vth and channel doping density Nbulk while Vth and Nbulk may be affected by device aging and 92 .25 Current After 10 Years 3.7.14: Impact of device aging on delay and leakage PDF.2%. This is because leakage power is rough exponential to Vth while delay is roughly linear to Vth .5 1 10 5 0. when variation is high.3 0.5 Current After 10 Years 20 15 2 1. there is a larger impact of device aging on leakage variation.

245 Delay (ns) 10 years µ 4.1 (-13.162 0.3% 3σ=6% variation in Nbulk -0.79 (+0.2 (-32.7: Impact of device aging on process variation.18% to +0. the SRAM SER only increases by 0.55 4.9%) 4.78 3.9%. After 10 years.21%) Table 4.16%) 0. Nbulk and channel gate length Lgate become random variables.23 (+8.164 (+0.55 4. We perform the study on the impacts of device aging and process variation on SER as below. Table 4.86 (+0.09 (+2. A larger Vth may result in a slightly smaller critical charge Qcrit for an SRAM and therefore a larger SER.4 (-30. Lgate variation may affect Vth and therefore implicitly affect SER.6%) 4.4%) 158 (-72.122 0.914E-5 10 years +0.5%) 38. Nbulk variation directly affects charge collection slope Qs and therefore affects SER. 0 year or nominal SRAM SER (FIT) 2.4%) 4.Device current µ HP32 LV min-ED1 LV min-ED2 LV HP32 HV min-ED1 HV min-ED2 HV 982 359 328 1120 398.8%) 321 (-6.78 current σ 0.4%) 345.2 σ 348 51. a variation in Lgate from 31nm to 33 nm may result in a variation 93 . process variation.2%) 31.66 (+2.0 32.99 4. For ITRS HP 32nm technology.6%) 712 (-36.2%) 309 (-4.0 575 68 48 Leakage (mW) 10 years µ 678 (-28.0%) 4.21%) σ 0.1 (-44.121 (+1.3%) 332 (-12.85 0.9%) µ 3. For ITRS HP 32nm technology (see Figure 4.1%) 4.67%) 0.65 (+2.319 (+0.0%) 25 (-47.11.8: The impact of device aging and process variation on SER of one SRAM under ITRS HP 32nm technology.8 presents the impact of device aging on SER. With process variation. Vth may increase with device aging due to NBTI and HCI.245 (+0.4%) σ 118 (-65.2 375.318 0.7%) 22.95 4.28%) 0.162 (+0. the SRAM SER may increase by 0.3% due to degraded Vth . Even if Vth degrades by 100mv for both NMOS and PMOS. Vth may degrade (or increase) by around 10% (or 20mV) for PMOS due to NBTI and by around 25% (or 50mV) for NMOS due to HCI after 10 years.25%) 1.17% Table 4.54%) 0.164 1.83 (+1.

These two effects compensate with each other and therefore Nbulk does not have a signiﬁcant impact on SER either. We further show that the device aging has a knee point over time.18% to 0. and reduce standard deviation of leakage variation by 87%. We show that applying heterogeneous gate lengths to logic and interconnect may lead to 1. a larger Nbulk may result in a higher Vth and a smaller critical charge Qcrit . Table 4. The framework is efﬁcient as it is based on closed-form formulas. A variation of 3σ as 6% in Nbulk only results in a variation in SER from -0. therefore a larger SER. A few examples of using this framework have been presented.17%. the user can tune eight parameters for bulk CMOS processes and obtain the chip level performance and power distribution and soft error rate (SER) considering process variations and device aging. 3. we have developed a trace-based framework to enable concurrent process and FPGA architecture co-development. On the other hand. Therefore. It has been shown that Vth variation does not have a signiﬁcant impact on SER. A larger Nbulk may result in a larger charge collection slope Qs and therefor a smaller SER. It is clear that neither device aging due to NBTI and HCI nor process variation have any signiﬁcant impact on SER.in Vth from -25% to 21% [68]. this framework is suitable for early stage process and FPGA architecture co-development.8 presents the impact of Nbulk variation on SER.3X delay difference.5% to 5. Same as the MASTAR4 model used by the 2005 ITRS [67][68][148]. 4. and device burn-in to reach the point could reduce the performance change over 10 years from 8. This offers a large room for power and delay tradeoff.1X energy difference.9 Conclusions In this chapter.5% and reduce die 94 . It is also ﬂexible as process parameters can be customized for different FPGA elements and no SPICE models and simulation are needed for these elements.

95 . In addition. we also study the interaction between process variation. We observe that device aging reduces standard deviation of leakage by 65% over 10 years while it has relatively small impact on delay variation. we also ﬁnd that neither device aging due to NBTI and HCI nor process variation have signiﬁcant impact on SER.to die leakage signiﬁcantly. Moreover. device aging and SER.

Part II Statistical Timing Modeling and Analysis 96 .

All the atomic operations (max and sum) of our algorithms are performed by closed-form formulas. skewness. we ﬁrst propose a new method to approximate the max operation of two non-Gaussian random variables through second-order polynomial ﬁtting. we have discussed statistical modeling and optimization for FPGAs. quadratic model without crossing terms (semi-quadratic model). To solve this problem. we present present new non-Gaussian SSTA algorithms for three delay models: quadratic model. we focus on statistical static timing analysis (SSTA). hence they scale well for large designs. Experimental results show that compared to the Monte-Carlo simulation. In Part II of this dissertation. Lots of research works on SSTA have been published in recent years. respectively. 97 . 1%. In this chapter. and 1% error. and 95-percentile point within 1%. Applying the approximation. we study statistical modeling and analysis for ASICs. and linear model. our approach predicts the mean. most of the existing SSTA techniques have difﬁculty in handling the non-Gaussian variation distribution and nonlinear dependency of delay on variation sources. However. standard deviation.CHAPTER 5 Non-Gaussian Statistical Timing Analysis Using Second-Order Polynomial Fitting In Part I. 6%.

182. 163. The goal of block-based SSTA is to parameterize timing characteristics of the timing graph as a function of the underlying sources of process parameters which are modeled as random variables. 19.5. 120. lower power. designers can obtain the timing distribution and its sensitivity to various process parameters. the path-based [103. statistical static timing analysis (SSTA) is developed for full chip timing analysis under process variation. a higher-order delay model is thus needed [180. 5. 163] modeled the gate delay as linear functions of variation sources and assumed all the variation sources are mutually independent Gaussian random variables. 128] and the block-based [32. 2. 181. However. 13. The block-based SSTA was proposed to solve this problem. the linear delay model is no longer accurate [94]. two types of SSTA techniques are proposed. we have studied statistical optimization for FPGA circuits. To analyze the impact of process variation on circuit delay. 142. 178]. the path-based SSTA is not scalable to large circuits. ASICs are widely used in high performance or low power applications. In Part I. In order to capture the non-linear dependence of delay on the variation sources. [163] presented time efﬁcient methods with closed-form formulas [46] for all atomic operations (max and sum). 22. 7. 98 . 75. 9. ASIC has higher performance. Compared to FPGA. Based on such assumption. 157. 49. where the impact of process variation is signiﬁcant. In recent years. Therefore. 3. 8. 179. 76. By performing SSTA. Since the number of paths is exponential with respect to the circuit size.1 Introduction Process variation introduces signiﬁcant uncertainty to circuit delay as discussed in Chapter 1. 53. and smaller area. 113. 183. The early SSTA methods [32. 40] SSTA. 86. 178. when the amount of variation becomes larger. 180. 72.

which is slow. [76. 19. to do so. 19. [19] handled the atomic operations by approximating the gate delay using a set of orthogonal polynomials. 179] considered both non-linear delay model and non-Gaussian variation sources. 40] started to take non-Gaussian variation sources into account. [76] proposed to compute the max operation by a regression based on Monte-Carlo simulation. In this chapter. and linear delay model. but it lacks the capability to handle the crossing term effects on timing. the via resistance has an asymmetric distribution [13]. 13. both had to resort to expensive multi-dimension numerical integration techniques. only the 99 . However. In our method. while dopant concentration is more suitably modeled as a Poisson distribution [142] than Gaussian.As more complicated or large-scale variation sources are taken into account. 179. we introduce a time efﬁcient non-linear SSTA for arbitrary nonGaussian variation sources. we present our SSTA technique for three different delay models: quadratic delay model. the assumption of Gaussian variation sources is also not valid. The major contribution of this work is two-fold. [40] proposed to approximate the probability density function (PDF) of max results as a Fourier Series. quadratic delay model without crossing terms (semi-quadratic model). (1) We propose a new method to approximate the max of two non-Gaussian random variables as a second-order polynomial function based on least-square-error curve ﬁtting. 142. For example. [13] computed the max operation through tightness probability while [179] applied moment matching to reconstruct the max. which needs to be constructed for different variation distributions. (2) Based on the new approximation of the max operation. For example. but the computation cost of these techniques is too high to be applicable for large designs. 13]. 180. Some of the most recent works on SSTA [76. but it was still based on a linear delay model. Experimental results shows that such approximation is much more accurate than the linear approximation based on tightness probability [163. For example. 13. [142] applied independent component analysis to de-correlate the non-Gaussian random variables.

6% error in skewness. For the quadratic delay model. we further apply this technique to handle both semi-quadratic and linear delay model in Section 5.1 Review and Preliminary According to [163].. Moreover. The rest of the chapter is organized as follows: Section 5. i. all the atomic operations are performed by closed-form formulas. 30% error in skewness. 5.ﬁrst few moments are required for different distributions and no extra effort is needed. while [19] needs to obtain different sets of orthogonal polynomials for different variation distributions. Experimental results show that for the semi-quadratic delay model. and indicates less than 1% error in mean.2. A and B. our approach incurs less than 1% error in mean. the tightness probability is deﬁned as the probability of A greater than B.2 Second-Order Polynomial Fitting of Max Operation 5. and 1% error in 95-percentile point.e. given two random variables. the computational complexity of our method is linear in both the number of variation sources and circuit size. experimental results are presented in Section 5.3 presents a novel SSTA algorithm for quadratic delay model with non-Gaussian variation sources.6.5. for the more accurate quadratic delay model. hence they are very time efﬁcient. Section 5. 1% error in standard deviation. 1% error in standard deviation. and 1% error in 95-percentile point. 100 . respectively.2 introduces the approximation of the max operation using second-order polynomial ﬁtting.7 concludes this chapter. our approach is 70 times faster than the non-linear SSTA in [40].4 and Section 5. For the linear and semi-quadratic delay model. the computational complexity is cubic (third-order) to the number of variation sources and linear to the circuit size. with the approximation of max. Moreover. TA = P{A > B} = P{A − B > 0}. Section 5.

However. 164]. particularly when the amount of variation increases and the number of non-Gaussian variation sources increases. we introduce a new ﬁtting method to approximate the max operation. Such a linear approximation is efﬁcient and reasonably accurate when both A and B are Gaussian. TA in [13] has to be computed via expensive multi-dimensional numerical integration. in this case the coefﬁcients can be computed easily. we arrive at: max(A − B. Because (5. B). For example. 0) ≈ TA · (A − B) + c.2). when A and B are non-Gaussian random variables. Moreover. (5. we propose to use a second-order polynomial 101 . B) = max(A − B. B)] can be obtained by closed-form formulas when A and B are Gaussian [163].2. the PDF predicted by such linear approximation is very close the Monte Carlo simulation. because the max operation is an inherently non-linear function.Then the max operation is approximated as: max(A. To overcome these difﬁculties.1) where c is a term used to match the mean and variance of max(A. thus preventing its scalability to large designs. Instead of using the linear function. we can see that the max operation in [163] is in fact approximated by a linear function subject to certain conditions (such as matching the exact mean and variance). as both TA and E[max(A. (5.1) can be further written as max(A. As shown in [163.2 New Fitting Method for Max Operation In this section. B)] are hard to obtain.2) According to (5. we develop a more efﬁcient and accurate approximation method in the next section. 5. Moreover. B) ≈ TA · A + (1 − TA ) · B + c. 0) + B. linear approximation would become less and less accurate. the tightness probability TA and E[max(A.

0). we compute Θ by matching the mean of the max operation while minimizing the square error (SE) between h(V. Θ) = θ2V 2 + θ1V + θ0 . 0) ≈ h(V. θ2 )T are three coefﬁcients of the second-order polynomial h(v. P) and max(V. we may unnecessarily ﬁtting the curve in a wide range without focusing on the high probability region. we ﬁnd that assuming: ε = 3σv (5. Then the impact of ignoring the difference in the low provides a good approximation of max. Θ). and γv are the mean. i.7) outside the ±ε range is low. Mathematically. Intuitively. and skewness of random variable of v V . µm = E[max(V. 0) dv (5. respectively. v v (5. That is. θ1 . However..4). the probability that V lies probability region is justiﬁed.4) where µv .Θ)]=µm µv −ε h(v. (5. when ε is too large.function to approximate the max operation.6) In (5. 0) within the ±ε range of V . In other words. 0)] E[h(V. In this chapter. σ2 . The problem thus becomes how to obtain the ﬁtting parameters of Θ. Θ)] are the exact and approximated mean of max(V. ε is the approximation range. respectively. Different from the linear ﬁtting method through tightness probability.3) where Θ = (θ0 . The value of ε may affect the accuracy of the second order approximation. Θ) − max(v. max(V. Θ)] = θ2 (σ2 + µ2 ) + θ1 µv + θ0 .e. it would still be more or less Gaussian like. variance. when ε is large. this problem can be formulated as the following optimization problem: Θ = arg min Z µv +ε 2 E[h(V. while µm and E[h(V. This can be explained by the fact that. although the circuit delay is not Gaussian.5) (5. 102 .

14) 4Δ1/3 √ where Δ = γv · σ3 + j 8σ6 − γ2 · σ6 . It is proved in [184] that. i. c1 . E[c2 ·W 2 + c1 ·W + c0 ] = µv (5. When V is a non-Gaussian random variable.e. when v v v v √ |γv | < 2 2 there exists one and only one of the above three values in the range |c 2 | < √ √ σv / 2. and c0 can be obtained by matching P(W ) and V ’s mean.3 = (5. we approximate the non-Gaussian random variable V as a quadratic function of a standard Gaussian random variable W similar to [184].4). (5. i.9) (5.11) E[(c2 ·W 2 + c1 ·W + c0 − µv )2 ] = σ2 v E[(c2 ·W 2 + c1 ·W + c0 − µv )3 ] = γv · σ3 v As shown in [184]. we propose to use the following two-step procedure to approximately compute µm . we ﬁrst need to compute µm . c2 will be one of the following values: 2σ2 + Δ2/3 v (5.. solving the above equations.its PDF would be most likely centering around the ±3σ range of the mean. with j = −1. When |γv| > 2 2. We will pick such value for c2 .10) (5. variance and skewness simultaneously. In the ﬁrst step.15) √ −σ /√2 γ < −2 2 c2. we may let: √ √ σv / 2 γv > 2 2 c2 = (5. Therefore.e. In practice. the users can choose the range ε according to the delay distributions.13) 4Δ1/3 √ √ 2(1 − j 3)σ2 + (1 + j 3)Δ2/3 v c2.1 = − v v 103 . In order to solve (5..8) The coefﬁcients c2 . V ≈ P(W ) = c2 ·W 2 + c1 ·W + c0 .2 = (5.12) 1/3 2Δ√ √ 2(1 + j 3)σ2 + (1 − j 3)Δ2/3 v c2. exact computation of µm is difﬁcult in general.

20) (5. 0) by the exact mean of max(P(W ). 0)..24) (5. the integration range P(w) > 0 can be computed in four different cases: Case 1: c2 > 0 P(w) > 0 ⇔ w < t1 ∨ w > t2 where t1 = (−c1 − t2 = (−c1 + Case 2: c2 < 0 P(w) > 0 ⇔ t2 < w < t1 Case 3: c2 = 0 ∧ c1 > 0 P(w) > 0 ⇔ w > −c0 /c1 Case 4: c2 = 0 ∧ c1 < 0 P(w) > 0 ⇔ w < −c0 /c1 104 (5.19) . µm ≈ E[max(P(W ).16) (5.18) where φ(·) is the PDF of the standard normal distribution. we approximate the exact mean of max(V. and c0 in the second step.e. With c2 .It is proved in [19] that the above equations gives P(W ) which matches the mean and variance of V and has the skewness closest to γv .17) c0 = µv − c2 After obtaining the coefﬁcients c2 .21) (5. 0)] = Z P(w)>0 P(w)φ(w)dw (5. c1 . c1 and c0 can be computed as: c1 = σ2 − 2c2 v 2 (5.23) (5. In the above equation.22) c2 − 4c2 c0 )/2c2 1 c2 − 4c2 c0 )/2c2 1 (5. i.

the circuit delay has close-to-normal distribution as discussed above.27) 105 .25). v v 2 Z l2 (5. Θ)dv − v dv 2 2 θ2 (v2 − µ2 − σ2 ) + θ1 (v − µv ) + µm dv + v v θ2 (v2 − µ2 − σ2 ) + θ1 (v − µv ) + µm − v dv. Fortunately. we can compute E[max(P(W ). After obtaining µm . we need to ﬁnd Θ in (5. Therefore. When V is not close to the normal distribution.25) c2 > 0 c2 < 0 c2 = 0 ∧ c1 > 0 c2 = 0 ∧ c1 < 0 where Φ(·) is the cumulative density function (CDF) of the standard normal distribution. the square error in (5. According to (5. We ﬁrst write the constraint in (5.3) by solving the constrained optimization problem of (5.4) can be written as: SE = = Z0 µv −3σv Z 0 l1 0 (5. we show that (5. v v Replacing θ0 in (5. in practice. 0)] = (c2 + c0 ) 1 + Φ(t1 ) − Φ(t2) + (c1 + t1 ) φ(t2) − φ(t1) (c2 + c0 ) Φ(t1 ) − Φ(t2) + (c1 + t1 ) φ(t2) − φ(t1) c · Φ(c /c ) + c · φ(c /c ) 0 0 1 1 0 1 c · 1 − Φ(c /c ) − c · φ(c /c ) 0 0 1 1 0 1 (5. Θ)dv + µvZ v +3σ 0 h(v.4) by (5.Knowing the integration range.26) h2 (v. we can compute µm easily through analytical formulas.4). the above approximation of µm is no longer accurate. In practice. In the following. V is uniform distribution.4) as follows: θ0 = µm − θ2 (µ2 + σ2 ) − θ1 µv .26). the above approximation of µm is accurate only when the distribution of V is close to the normal distribution.4) can be solved analytically as well. 0)] under such four cases: E[max(P(W ). for example. such approximation is accurate for SSTA.

Then the optimum of Θ can be obtained by setting the derivative of (5. we can transform the constrained optimization of (5.30) (5. where S = (si j ) is a 2 × 2 matrix.4) to the following unconstrained optimization problem.29) to zero.28) (5. Therefore. Q = (qi ) is a 1 × 2 vector. By expanding the square and integral. which is a quadratic form of Θ = (θ1 . Q.37) 106 .29) SE = Θ T SΘ + QΘ + t.32) (5.35) Because the square error is always positive no matter what value the Θ is.34) 2 2 2 3 q1 = l2 µv + (l2 − l1 )µm − 2l2 /3 − 2(l2 − l1 )µvµm 2 3 3 q2 = l2 (µ2 + σ2 ) + 2(l2 − l1 )µm /3 − v v 3 2l2 /3 − 2(l2 − l1 )(µ2 + σ2 )µv v v 3 2 t = l2 /3 + (l2 − l1 )µ2 − l2 µm m (5. The parameters of S. S is a positive deﬁnite matrix. resulting a 2 × 2 system of linear equations: ∂SE = 2s11 · θ1 + 2s12 · θ2 + q1 = 0 ∂θ1 ∂SE = 2s22 · θ2 + 2s12 · θ1 + q2 = 0 ∂θ2 (5.where l1 = µv − 3σv and l2 = µv + 3σv .29) is to minimize a second-order convex function of Θ without constraints.31) 5 5 2 2 3 3 s22 = (l2 − l1 )/5 + (l2 − l1 )(µ2 + σ2 )2 − 2(l2 − l1 )(µ2 + σ2 )/3 v v v v 4 4 s12 = s21 = (l2 − l1 )/4 + (l2 − l1 )µv(µ2 + σ2 ) + v v 3 3 2 2 (l2 − l1 )µv /3 − (l2 − l1 )(µ2 + σ2 )/2 v v (5.33) (5. θ2 )T : Θ = arg min SE (5.36) (5. and t can be computed as: 3 3 2 2 s11 = (l2 − l1 )/3 + (l2 − l1 )µ2 − (l2 − l1 )µv v (5. and t is a constant. (5.

7. 5. 0). we can obtain the ﬁtting v parameters Θ for max(V. the PDF of our second-order ﬁtting method has a sharp peak which approximates the impulse of exact PDF well as shown in Fig. we see that for a random variable V with any distribution. linear ﬁtting through through tightness probability. Θ = (0. and our second-order ﬁtting can be obtained as shown in Fig.2. In particular.2.38) (5. µv + 3σv ). where the x-axis is V . From (5.2. by assuming V ∼ N(0. we compare the results obtain from our approach. 0) through closed-form formulas. the exact max operation max(V. 107 .2. That means: If µv > 3σv . we see that our proposed second-order ﬁtting method is more accurate than the linear ﬁtting method. the linear ﬁtting method. when V ∈ (µv − 3σv . variance σ2 . The corresponding PDFs of the three approaches are shown in Fig. In contrast. it is easy to ﬁnd that in this case. and skewness γv .40) To show the accuracy of our second-order ﬁtting approach to the max approximation. From the ﬁgures. 5. the linear ﬁtting method can only give a smooth PDF. Notice that in (5. 0). From the above discussion.4). if we know its mean µv .2. and y-axis is the results of max(V.26). 0). 0) = V max(V. 5.4) we try to minimize the mean square error within the ±3σ range. 1). For example.2. 0) ≈ V when µv > 3σv (5. V is always larger than zero within the ±3σ range.Such a system of linear equations can be solved efﬁciently: θ1 = θ2 s12 · q2 − s22 · q1 2s11 · s22 − 2s2 12 s12 · q1 − s11 · q2 = 2s11 · s22 − 2s2 12 (5. and the exact (or Monte Carlo) computation. 1. That is: max(V.39) With θ1 and θ2 . which is far away from the exact max result. we can compute θ0 from (5.

5 2 1.Xn . R is local random variation. In order to simplify the above delay model. R) (5.3. we apply the Taylor expansion to approximate it. 5. Xn )T are global variation sources.6 0. we assume that Xi ’s and R are mutually independent. Here.5 1 0. n is the number of global variation sources. linear ﬁtting. and second-order ﬁtting for max(V. and second-order ﬁtting for max(V. . X2 .5 0 −0. the circuit delay is a complicate function of variation sources: D = f (X1 . we assume all Xi ’s and R have zero mean and unit variance. .1 Quadratic Delay Model In Section 5.0) Second−Order Fit Linear Fit (a) (b) Figure 5.2. . Without loss of generality. We ﬁrst discuss the delay model. .2 0.41) where X = (X1 .8 0.5 3 2. we introduce the second-order ﬁtting of max operation. 0). .1 0 −2 −1 0 1 2 3 Max(V.3 0.3 Quadratic SSTA 5.5 0. Considering that the scale of local random variation is 108 . 0).7 0. linear ﬁtting. In practice. X2. (b) PDF comparison of exact computation.9 0. .1: (a) Comparison of exact computation.5 −1 −2 −1 0 1 2 3 Max(V.4 3.4 0. Here we will apply such ﬁtting in SSTA.0) Second−Order Fit Linear Fit 1 0.

in practice.44) (5. the coefﬁcients A. B. a2 . respectively. . A = (a1 . an ) are the linear delay sensitivity coefﬁcients of the global variation sources. we assume that R is a random variable with standard normal distribution.k and mr. Moreover. B = (bi j ) are the second-order sensitivity the local random variation.k to represent the kth central moment for Xi and the kth moment for R.usually smaller than that of global variation. and r can be obtained by measurement or SPICE simulation. the local random variation is caused by several independent factors.43) (5. Finally. R will be close to a Gaussian random variable. In this chapter. therefore. Moreover. . . we use mi. Notice that all Xi ’s and R are with zero mean. f is a complex function and can not be expressed as closed-form formula. j=1 ∑ bi j Xi X j + ∑ ai Xi + rR i=1 T = d0 + AX + X BX + rR (5. we use second order Taylor expansion to approximate Xi ’s and use ﬁrst order expansion to approximate R: n n D ≈ d0 + i. According to the central limit theorem.45) In practice.42) where d0 is the nominal delay.46) In the rest of this chapter. 109 . and r is the linear delay sensitivity coefﬁcient of (5. the local random variation R is the sum of several independent random variables. B. for simplicity. which is an n × n matrix. In this case. A. that means their central moments and raw moments are the same. and r are calculated in the similar way as [180]: ∂f ∂Xi ∂2 f bi j = ∂Xi ∂X j ∂f r = ∂R ai = coefﬁcients. we obtain our quadratic delay model as follows: D = d0 + AX + X T BX + rR (5.

2. and skewness of D p .2. are needed. Ds = D1 + D2 = ds0 + As X + X T Bs X + rs Rs .we also assume that the moments of the variation sources. otherwise. 0) + D2 .48) 5. D2 ) = dm0 + Am X + X T Bm X + rm Rm .46): D1 = d01 + A1 X + X T B1 X + r1 R1 . such moments can be computed from the samples of variation sources.47) (5.k and mr. We will present how these two operations are handled in the rest of this section. we want to compute Dm = max(D1 .2. That is. Without loss of generality. D2 ) = max(D1 − D2 .2.51) let D p = D1 − D2 .49) (5. The overall ﬂow of the max operation is illustrated in Figure 5. we compute the joint moments between D p and Xi ’s and the skewness of D p . we propose a novel technique to compute the max of two random variables. (5. mi. if µD p > 3σD p .k .2 to ﬁnd 110 .50) (5. max and sum. Knowing the mean.3. are known. we apply the method as shown in Section 5. D2 = d02 + A2 X + X T B2 X + r2 R2 . We ﬁrst compute the mean and variance of D p . given D1 and D2 with the quadratic form of (5. Based on the second-order polynomial ﬁtting method as discussed in Section 5. In practice. then Dm = D1 . To compute the arrival time in a block-based SSTA framework. two atomic operations. (5. we assume E[D p ] > 0. variance.2 Max Operation for Quadratic Delay Model The max operation is the most difﬁcult operation for block-based SSTA. Considering Dm = max(D1 .

Compute joint moments between D2 and Xi p 4.the ﬁtting coefﬁcients Θ = (θ0 . Finally. Input: Quadratic form of D1 and D2 Output: Quadratic form of Dm 1. Compute joint moments between Dm and Xi 7. we use the moment matching method to reconstruct the quadratic form of Dm .2.54) (5. θ1 . we ﬁrst obtain the quadratic form of D p as follows: D p = D1 − D2 = d p0 + A p · X + X T B p X + r1 · R1 − r2 · R2 d p0 = d01 − d02 A p = A1 − A2 B p = B1 − B2 (5. Compute γD p 5. 0). compute µD p and σD p if µD p ≥ 3σD p { 2.3. R1 .52) (5. θ2 ) for max(D p .D2 ). and R2 are mutually independent random variables with mean 0 111 . Reconstruct the quadratic form of Dm } Figure 5. Let D p = D1 − D2 . Get ﬁtting coefﬁcients Θ 6.55) Because Xi ’s. Dm = D1 } else { 3.53) (5. 5.2: Algorithm for computing max(D1 .1 Mean and Variance of D p In order to compute the mean and variance.

4 + 2a pi b pii mi.k=i ∑ (b p j j b p kk + 2b2 jk ) p (5.3 + 2a p j b p j j m j.3 + 2a pi b pii mi.5 + p p0 (a2 + 2d p0 b pii ) · mi.3 ) + pi 1≤i< j≤n ∑ 2(b2 j + b pii b p j j ) − ( ∑ b pii )2 pi i=1 n (5.60) (b pii b p j j + 2b2 j )mi.and variance 1.3.4 + b2 j j m j.5 + p pi pii 2· 1≤ j≤n j=i ∑ a pi b p j j + 2a p j b p i j + (5. the joint moments between D2 and Xi .4 + b2 mi.56) 2 2 = r1 + r2 + ∑ (a2 + b pii mi.59) (5.2 Joint Moments and Skewness of D p From the deﬁnition of D p as shown in (5.3 − 2r1 µD p p 2 2 E[Xi2 D2 ] = d 2 + r1 + r2 + 2d p0 a pi mi.3 pi 2 E[R1 D2 ] = r1 mr. the mean and variance of D p can be computed as: n µD p = E[D p] = d p0 + ∑ b pii σ2 p = E[(D p − µD p )2 ] D n i=1 i=1 (5.3 + p 2(b pii b p j j + 4b2 j ) · mi.2.4 + 4b p j j b pi j mi.58) (5.3 + 2r1 µD p p 2 E[R2 D2 ] = r2 mr.57) 5.3 + 2a pi b pii mi.61) 112 .4 + b2 mi.3 + pi p 2· 1≤ j<k≤n j.3 + 2b pi j b p j j m j. p Xi2 .52).6 + pi pii ∑ 1≤ j≤n j=i a2 j + 2d p0 b p j j + 2(a pi b p j j + 2a p j b pi j )mi.3 m j. and R can be computed as: E[Xi D2 ] = 2d p0 a pi + (a2 + 2d p0 b pii )mi.

E[Xi X j D2 ] = 2a pi a p j + 4d p0 b pi j + 2 · (2a pib pi j + a p j b pii )mi.65) (5.68) Dq = θ 1 · D p + D 2 113 . and skewness of D p is computed as: E[D2 ] = µ2 p + σ2 p p D D E[D3 ] = E[D p · D2 ] p p = E[(d p0 + A p X + X T B p X + r1 R1 − r2 R2 )D2 ] p = d p0 · E[D2 ] + r1 · E[R1 D2 ] − r2 · E[R2D2 ] + p p p n i=1 (5.3 + 4 · pi 1≤k≤n k=i.3 + 4b p j j b pi j m j. θ2 ) for max(D p .3 Reconstruct the Quadratic Form of Dm Knowing µD p .4 + 2 · (2b2 j + b pii b p j j)mi.3.2. central moments.3m j.3 + p 4b pii b pi j mi. we apply the second-order ﬁtting method to compute the ﬁtting parameters Θ = (θ0 . and γD p . σD p . θ1 . D2 ) = max(D p .62) With the joint moments computed above.66) 5. j ∑ b pi j b pkk (5.63) ∑ a pi · E[XiD2 ] + b pii · E[Xi2D2 ] + p p b pi j E[Xi X j · D2 ] p (5. 0). the raw moments.4 + 2 · (2a p j b pi j + a pi b p j j )m j. 0) + D2 ≈ θ 2 · D2 + θ 1 · D p + θ 0 + D 2 p = θ 2 · D2 + D q + θ 0 p (5. Then Dm can be represented as: Dm = max(D1 .67) (5.64) 0≤i< j≤n ∑ mD p (3) = E[(D − µD )3 ] γD p = mD p (3)/σ3 p D = E[D3 ] − 3µD p · E[D2 ] + 2µ3 p p p D (5.

because D p and Dq are in quadratic form as (5.62).75) (5. and joint moments between Dm and variation sources: E[Dm ] = θ2 · E[D2 ] + E[Dq ] + θ0 p (5.67) is a 4th order polynomial of Xi ’s.58)-(5.72) (5.3 E[Xi2 Dq ] = dq0 + aqi mi.76) 1≤ j≤n j=i E[Xi Dq ] = aqi mi.70) (5.80) 114 . we ﬁrst compute the the mean of Dm .46).79) (5.69) (5. In order to reconstruct the quadratic form for Dm .74) E[Xi X j Dm ] = θ2 · E[Xi X j D2 ] + E[Xi X j Dq ] p E[R1 Dm ] = θ2 · E[R1 D2 ] + E[R1Dq ] p E[R2 Dm ] = θ2 · E[R2 D2 ] + E[R2Dq ] p E[Xi2 Dm ] = θ2 · E[Xi2D2 ] + E[Xi2Dq ] + θ0 p E[Xi Dm ] = θ2 · E[Xi D2 ] + E[XiDq ] p The joint moments between D2 and Xi ’s are computed in (5.78) (5.3 + bqii mi.71) (5.73) (5. However. the representation of Dm in (5.form formula of Dm .2 + bqii mi.4 + E[Xi X j Dq ] = 2bqi j E[R1 Dq ] = θ1 · r1 E[R2 Dq ] = (1 − θ1 ) · r2 ∑ bq j j (5. The joint p moments between Dq and variation sources are: n E[Dq ] = dq0 + ∑ bqii i=1 (5.The above equation gives the closed.77) (5.

74).86) (5. by applying the moment matching technique similar to [178].4 − m2 − 1 i.82) (5.87) (5.91) Where E[R1 · Dm ] and E[R2 · Dm ] are computed in (5. Because the random term in Dm comes from the random terms in D1 and D2 . Because the random variation sources Rm . and E[Xi X j Dm ] computed in (5. the sensitivity coefﬁcients Am = (ami ) and Bm = (bmi j ) can be obtained by solving the linear equations above: bmii = E[Xi2Dm ] − E[Dm ] − E[Xi Dm ]mi. we have: rm1 = E[R1 · Dm ] rm2 = E[R2 · Dm ] rm = 2 2 rm1 + rm2 (5.72).85) (5. as shown in (5. we assume that rm Rm = rm1 R1 + rm2 R2 .3 n dm0 = E[Dm ] − ∑ bmii i=1 Finally.83) (5.89) (5.90) (5.84). we have: n E[Dm ] = i=1 ∑ bmii + dm0 ∑ bm j j (5. R1 .88) bmi j = E[Xi X j Dm ] ami = E[Xi Dm ] − bmii · mi. E[Xi Dm ]. E[Xi2Dm ].3 mi.3 With E[Dm ].4 + E[Xi X j Dm ] = 2bmi j E[Xi Dm ] = ami + bmii · mi.69)-(5. and R2 are Gaussian random variables.73) and (5. 115 .3 (5.3 + bnii · mi.Because we want to reconstruct Dm in the quadratic form.81) (5. by applying the moment matching technique similar to (5.84) 1≤ j≤n j=i E[Xi2Dm ] = ami · mi.49). we consider the random term of Dm .

46) is 116 . Ignoring the crossing terms.4 Semi-Quadratic SSTA 5.4.3 Sum Operation for Quadratic Delay Model The sum operation is straight forward.92) (5. 19]. we need to compute the sum of two n × n metric. the total complexity is O {n3 N}. if we ignore the crossing terms. For the sum operation.4 Computational Complexity of Quadratic SSTA For each max operation in SSTA based on a quadratic delay model. where n is the number of variation sources.95) 5. we need to compute the sum of n numbers. the impact of the crossing terms are usually very weak [178.1 Semi-Quadratic Delay Model In Section 5.3. we need to calculate n2 joint moments between D p and Xi X j s.3.5.94) (5. in practice. Because the total number of max and sum operations is linear with respect to circuit sizes N. we introduce the SSTA for quadratic delay model. The coefﬁcients of Ds can be computed by adding the correspondent coefﬁcients of D1 and D2 : ds0 = d01 + d02 As = A 1 + A 2 Bs = B 1 + B 2 rs = 2 2 r1 + r2 (5.93) (5. for each joint moment. the quadratic delay model in (5.3. Therefore. However. we may speed up the SSTA process without affecting the accuracy too much. hence the computational complexity is O {n 3 }. hence the computational complexity is O {n 2 }. 5.

bn ) are the second order sensitivity coefﬁcients for the square terms. 5.99) d p0 = d p0 + ∑ b pi Ypi = a pi Xi + b pi Xi2 − b pi i=1 variables with mean 0 and variance 1. Xn )T are the square of variation sources.. A. . In order to compute the central moments of D p . . . and R are deﬁned in the similar way as the quadratic delay model in 2 2 2 (5. .2 Max Operation for Semi-Quadratic Delay Model The overall ﬂow of the max operation of semi-quadratic delay model is similar to that of quadratic delay model as illustrated in Figure 5.4. X2 . X 2 = (X1 . we ﬁrst rewrite D p to the following form: n D p = d p0 + ∑ Ypi + r1 R1 − r2 R2 i=1 n (5.97) (5.1 Moments of D p We deﬁned D p = D1 − D2 in a similar way as (5.re-written as: D = d0 + AX + BX 2 + rR (5.46). and B = (b1 . . for the semi-quadratic delay model in the rest of this section.96) where d0 .52).4.98) (5. 5. Because Xi s are mutually independent random 117 . b2 . Ypi s are independent random variables with zero with a pi = a1i − a2i and b pi = b1i − b2i . max and sum. . We will introduce the atomic operations. .2. The only difference is that we do not need to compute the joint moments between the crossing terms and D m . r. We deﬁned such quadratic model without crossing terms as Semi-Quadratic Delay Model.2.

105) (5.106) (5. Therefore.4 − 1) + 2a pi b pi mi. Here the joint moments between Xi s and D2 .6 − 3mi.107) (5.108) E[Xi Dq ] = E[XiYqi ] E[Xi2 Dq ] = E[X 2Yqi ] 2 2 where E[Ypi ] are computed in (5.69)-(5.104) With the central moments of D p .2 Reconstruct the Semi-Quadratic Form of Dm Similar to the quadratic SSTA.4 − 1 + pi pi a3 mi.4 + 2 pi pi (5. 5. Dq are: p 2 E[Xi D2 ] = E[XiYpi ] + 2d0 E[XiYpi ] p 2 E[Xi2D2 ] = E[Xi2Ypi ] + 2d0 E[Xi2Ypi ] + E[Xi2]E[(D p −Ypi )2 ] p (5.109) .5 − 2mi.4.2. the ﬁtting coefﬁcients Θ can be obtained. Then the joint moments between Xi s and Dm can be computed in the similar ways as (5.74). E[(D p − Ypi )2 ] = σ2 p − E[Ypi ].3 + b3 mi. the raw moments and skewness can be computed easily.101) 3 ∑ E[Ypi] n mD p (3) = E[(D p − µD p ) where 3 3 3 ] = r1 + r2 + (5.mean.100) (5. the ﬁrst 3 central moments of D p are: µD p = d p0 2 2 2 σ2 p = r1 + r2 + ∑ E[Ypi ] D i=1 n (5. and Yqi s are D deﬁned in the similar way as Ypi s: Yqi = aqi Xi + bqi Xi2 − bqi 118 (5. variance and skewness of D p .3 + 3a2 b pi mi.3 + a2 pi pi (5. by knowing the mean.102).103) 3 E[Ypi ] = 3a pi b2 mi.102) i=1 2 E[Ypi ] = b2 (mi.

And then A m and Bm can be obtained by solving such equations.110) (5.5. The number of max and sum operations is linear to the circuit size.3 + b pi mi. as shown in (5. The sum operation of the semi-quadratic delay model is similar to that of the quadratic delay model.1 Linear Delay Model In the previous sections. From the discussion above. In practice. by applying the moment matching technique.5 Linear SSTA 5. we introduce the SSTA for non-linear delay model.4 + b2 mi.4 − 1 a2 − 22 mi.5 − mi.113) 2 E[XiYpi ] = b2 mi. we can compute the random term of D m in the same way as the quadratic model. the number of variation sources.111) (5.6 − 1 + 2a pi b pi mi. Knowing the joint moments between Xi s and Dm .4 − 1 + a2 − 2b2 mi. it is easy to see that for the semi-quadratic SSTA.84).3 The joint moments between Xi s and Yqi s can be computed in the same way.5 + 2a pi b pi mi. We can compute Ds = D1 + D2 by summing up the correspondent coefﬁcients. when the variation scale is small.3 pi pi pi E[XiYpi ] = a pi + b pi mi. The joint moments between Xi s and Ypi s are: 2 E[Xi2Ypi ] = E[Xi2Ypi ] = a pi mi.112) (5. Finally. we will have similar equations as (5. where n is the 5. the circuit delay can be approximated as a linear 119 .3 pi pi pi (5. computational complexity for both max and sum operation is O {n}.91).with aqi = θ1 (a1 − a2 ) + a2 and bqi = θ1 (b1 − b2 ) + b2 .

function of variation sources: D = d0 + A · X + r · R (5.114)

where X , d0 , A, r, and R are deﬁned in the similar way as the quadratic delay model in (5.46). The difference is that there are no second order terms. Hence the atomic operations of the linear delay model are much simpler than those of quadratic delay model. We will introduce such operations in the rest of this section.

5.5.2 Max Operation for Linear Delay Model The overall ﬂow of the max operation of linear delay model is similar to that of quadratic delay model as illustrated in Figure 5.2 except that we do not need to compute the high order joint moments between Dm and Xi s.

5.5.2.1 Mean and Variance of D p The linear form of D p = D1 − D2 can be computed in the similar way as (5.52). Then the mean and variance of D p can be computed as: µD p = E[D p ] = d p0 ;

2 2 σ2 p = E[(D − µD )2 ] = ∑ a2 + r1 + r2 D pi i=1 n

(5.115) (5.116)

5.5.2.2 Joint Moments and Skewness of D p The joint moments between D2 and the variation sources and random variation can be p computed as: E[Xi · D2 ] = a2 · mi,3 + 2a pi · d p0 p pi (5.117)

120

**With the joint moments computed above, the third order raw moments D p is computed as: E[D3 ] = ∑ ai E[Xi · D2 ] + r1E[R1 · D2 ] + r2 E[R2 · D2 ] p p p p
**

i=1 n

(5.118)

The second order raw moment, central moments, and skewness of D p can be computed in the same way as the quadratic SSTA in (5.66).

5.5.2.3 Reconstruct the Linear Canonical Form of Dm Knowing the mean, variance, and skewness of D p , we may apply the method in Section 5.2.2 to compute the ﬁtting parameters Θ for max(D p , 0), and then represent Dm in the form of: Dm ≈ θ 2 · D2 + D q + p 0 p (5.119)

where Dq is deﬁned in the similar way as (5.67). The above equation gives the closedform formula of Dm , however, such formula is not in the linear canonical form. We still apply moment matching technique to reconstruct Dm to the linear form. First, the mean of Dm and the joint moments between Dm and variation sources are: E[Dm ] = θ2 · E[D2 ] + E[Dq ] + p0 p (5.120) (5.121)

E[Xi · Dm ] = θ2 · E[Xi · D2 ] + E[Xi · Dq ] p

Where E[D2 ] is computed in (5.118). The joint moments between Dq and variation p sources are computed as: E[Dq ] = ds0 E[Xi · Dq ] = aqi (5.122) (5.123)

121

With the joint moments, by applying the moment matching technique in [178], we have: ami = E[Xi · Dm ] dm0 = E[Dm ] (5.124) (5.125)

Finally, the random term rm can be computed in the same way as the quadratic model in (5.91). For the linear SSTA we discussed above, the computational complexity for both max and sum operation is O {n}. Similar to the quadratic SSTA, the number of max and sum operations is linear to the circuit size.

5.6 Experimental Result

We have implemented our SSTA algorithm in C for all three delay models we discussed before: quadratic delay model (Quad SSTA), semi-quadratic delay model (Semi-Quad SSTA), and linear delay model (Lin SSTA). We also deﬁne three comparison cases: (1) our implementation of the linear SSTA for Gaussian variation sources in [163], which we refer to as Lin Gau; (2) the non-linear SSTA using Fourier Series approximation for non-Gaussian variation sources in [40] (Fourier SSTA); (3) 100,000-sample Monte-Carlo simulation (MC). We apply all the above methods to the ISCAS89 suite of benchmarks in TSMC 90nm technology. In our experiment, we consider two types of variation sources gate channel length Lgate and threshold voltage Vth . For each type of variation source, inter-die, intra-die spatial, and intra-die random variation are considered. We use the grid-based model in [172, 173] to model the spatial variation. The number of grids (the number of spatial variation sources) is determined by the circuit size and larger circuits have more variation sources. We also assume that the 3σ value of the inter-die (σ g ), intra-die

122

spatial (σs ), and intra-die random (σr ) variation are 10%, 10%, and 5% of the nominal value, respectively. In the following, we perform the experiments for two variation setting: (1) both Lgate and Vth have skew-normal distributions [16]; and (2) Lgate has a normal distribution and Vth has a Poisson distribution. The experimental setting is shown in Table 5.1.

Setting (1) (2) Lgate dist Skewnormal Normal Vth dist Skewnormal Poisson 3σg 10% 10% 3σs 10% 10% 3σr 5% 5%

Table 5.1: Experiment setting. The 3σ value is the normalized with respect to the nominal value. Fig. 5.3 illustrates the PDF comparison for circuit s15850 under variation setting (1). In the ﬁgure, delay is normalized with respect to the nominal value. From the ﬁgure, we ﬁnd that, compared to the Monte-Carlo simulation, the Quad SSTA is the most accurate, the next is Semi-Quad SSTA, and Lin SSTA is least accurate. Such result is expected, because the Quad SSTA captures all the second-order effects; the Semi-Quad SSTA captures only partial second order effects; while the Lin SSTA captures only linear effects and ignores all non-linear effects. Moreover, Fourier SSTA [40] has similar accuracy as Semi-Quad SSTA because they both apply semi-quadratic delay model. In addition, we also ﬁnd that all these three non-Gaussian SSTA is more accurate than Lin Gau [163]. Table 5.2 compares the run time in second (T ), and the error percentage of mean (µ), standard deviation (σ), and skewness (γ) under variation setting (1). In the table, the error percentage of mean (µ), standard deviation (σ), and 95-percentile point (95%) skewness γ is computed as 100 × (γMC − γSSTA )/γMC . Moreover, the average error in the table is average of the absolute value; and the average runtime is the average runtime ratio between SSTA and Monte-Carlo simulation. For each benchmark, G is computed as 100 × (MC value − SSTA value)/σMC , and the error percentage of

123

and N refers to the total number of variation sources.1 1. standard deviation. ﬁnally. all the three nonGaussian SSTA methods predict more accurate mean values and 95-percentile points.3 1. refers to the number of gates.5 Figure 5. the error of skewness is up to 30% and the error of 95-percentile point is up to 2%. Lin SSTA result similar mean and standard deviation error.8 0.9 Normalized Delay 1 1.5 1 0. Semi-Quad SSTA.5 0 0.4 1. and the non-linear SSTA methods. and the error of skewness is within 5%.2 1. From the table. 124 . we see that for the Quad SSTA. and the inaccurate skewness results larger error of 95-percentile point. but the error of skewness and 95-percentile point is much larger.3%.3: PDF comparison for circuit s15850.3. the error of mean.5 MC Quad SSTA Semi−Quad SSTA Lin SSTA Lin Gau Fourier SSTA 3 2.7 0. we still use the non-Gaussian variation sources to reconstruct the PDF of the circuit delay. which is the most important timing characteristic.5 2 1.6 0. For fair comparison. we ﬁrst approximate the non-Gaussian random variables to the Gaussian random variables by matching the mean and variance. especially for the Lin SSTA. then use Lin Gau to obtain the linear canonical form of the circuit delay. Compared to Lin Gau. when we perform Lin Gau on non-Gaussian variation sources. and 95-percentile point is within 0. This is because the Lin SSTA ignores all non-linear effects which signiﬁcantly affect the skewness.

This is because the computational complexity of Semi-Quad SSTA. we can ﬁnd the similar trend as Table 5. This shows that our approach works well for different distributions.Quad SSTA and Semi-Quad SSTA also give much more accurate skewness. but the run time of Quad SSTA is longer especially when the number of variation sources is large. while the computational complexity of Semi-Quad SSTA (O {n}) is lower than that 125 . We also observe that compared to Fourier SSTA. Table 5. we also ﬁnd that our methods are still highly accurate under such variation setting. Moreover. the error of mean. the run time of all the SSTA methods is signiﬁcantly shorter than the Monte-Carlo simulation. After all. but Quad SSTA has higher complexity than the other two SSTA methods. This is due to the fact that Semi-Quad SSTA uses the same semi-quadratic delay model as Fourier SSTA. we also ﬁnd that both Semi-Quad SSTA and Lin SSTA have similar run time as Lin Gau.2. and k is the maximum number of orders of Fourier Series [40].3 illustrates the results under variation setting (2). For Quad SSTA. SemiQuad SSTA has similar accuracy with almost 70X speed up. standard deviation. Lin SSTA. From this table. and the error of skewness is within 6%. and 95-percentile point is within 1%. and Lin Gau is the same. where n is the number of variation sources. Moreover. of Fourier SSTA (O {nk2 }).

04 -0.01 -0.11 0.13 -0.57 -0.4 -22.01 0.66 -28.01 0.61 25.16 -0.01 0.01 0.02 0.13 T µ Semi-Quad SSTA σ γ 95% T 0.09 -0.4 0.06 0.06 -0.85 0.01 0.36 0.01 -0.56 -0.54 0.bench G name s208 s386 s400 s444 s832 s953 s1238 s1423 s1494 61 118 106 119 262 311 428 490 588 N µ σ Quad SSTA γ 95% -0.7 -0.04 -0.51 0.09 -0.2 -0.01 2.01 0.17 43 -0.06 1.14 -7.01 0.19 0.53 -0.68 -27.2 -0.22 -0.3 33.02 -0.4 1/35819 0.01 0.9 103 122 201 288 320 12 -0.04 -10.98 0.8 -0.23 -0.52 25 0.13 -0.03 0.15 0.82 0.01 0.19 -13.38 1.44 -21.24 0.18 -0.01 0.57 -25.4 -1.25 µ σ -0.03 0.02 0.1 0.01 0.33 0.07 -0.11 0.02 -0.01 -0.01 -0.1 -0.04 -0.03 55 0.09 -0.1 -3.22 33 -0.68 0.61 0.15 -12.65 -31.06 -0.1 0.63 0.37 -0.04 -6.4 -1.01 -0. and the error percentage of skewness γ is computed as 100 × (γ MC − γSSTA )/γMC .67 -28.03 18 -0.66 -31.6 0.24 4.02 -0.6 -1.1 19459 1/260 - 218 -0.22 -9.59 -0.04 1.06 -0.02 -0.16 -0.12 0.8 Lin Gau γ 95% -2.01 0. standard deviation (σ).91 5.74 -0.04 -0.72 -0.32 -22.47 -26.4 -2.99 -0.12 -4.4 -3.01 0.9 0.49 -22.24 -0.01 0.09 0.25 Table 5.01 0.06 -0.32 0.04 -0.24 -0.04 4717 11.25 -14 MC T 15.08 -0.01 0. skewness. and 95-percentile point (95%) is computed as 100 × (MC value − SSTA value)/σMC .22 126 s5378 1004 82 0.31 0.32 -22.06 -0.07 0.42 1238 6. Note:the error percentage of mean (µ).15 -0.24 -10 -0.38 0.09 -0.17 -0.07 -9.17 -0.8 -2.08 -0.01 2.49 10.08 0.01 3.23 0.12 -0.96 T 0.06 0.37 -0.6 -2.02 0.92 0.12 4.56 -0.3 0.08 -0.91 1.15 0.11 -0.01 0.06 -0.12 -6.01 0.5 -1.2 -6.95 0.16 -0.1 Ave 0.21 -0.51 -0.08 0.2 -0.3 -0.56 -27.66 s38417 8709 176 0.01 -0.79 0.01 -4.14 0.17 -0.15 0.49 -25.05 -9.9 0.17 1.08 -10.2 -0.02 -0.54 25.59 68 -0.14 1.13 9.04 -10.16 0.2 -1.5 -27.11 -5.18 -0.64 0.98 s13207 2573 115 -0.2 19116 25.1 0.4 0.11 8.12 -0.7 1/39248 0.3 -3.2 -3.52 -24.5 -0.06 -0.4 -0.65 0.1 0.07 0.04 0.95 -0.12 2.16 -0.04 s15850 3448 135 0.43 0.18 -0.68 µ σ 0.09 -4.01 0.86 0.01 0.08 -0.8 -1.18 2.28 -0.88 25 -0.02 0.02 -0.66 -26.21 -0.23 0.08 0.32 T 1.51 -0.5 -27.53 -24.04 0.21 -0.01 1.03 0.7 -1.17 2.07 -0.7 -2.01 -0.07 -4.5 6484 23.27 0.11 2.15 -0.01 0.08 3.02 -0.7 -3.98 -0.06 -0.36 -0.2 -0.4 38.02 0.29 1/720 0.23 -0.11 -0.29 µ σ -0.02 -0.51 25.29 -0.23 -8.01 0.1 -1.1 T 0.71 s9234 2027 99 0.49 -23.11 -0.21 -0.59 -0.01 0.22 s38584 11448 176 -0.18 -0.71 -28.75 -0.87 -0.6 -1.12 -15.5 -3. .28 1.57 -25.89 2801 9.04 -3.13 -0.85 202 0.47 -23.1 -1.9 -0.6 -1.06 0.01 0.35 -0.2 -3.2: Error percentage of mean.01 0.08 0.01 0.7 41.99 2.02 -10.01 0.62 0.5 0.07 -13.05 0.4 -0.07 0.63 -0.49 -22.6 -0.1 -0.05 -0.2 -1.1 -3.09 -4.01 -0.01 -0.5 -27.92 -0.19 -10.23 -0. standard deviation.7 0.41 -0.26 0.74 -0.7 -1.31 3.32 Lin SSTA γ 95% -1.08 -1.04 -0.2 -3.07 -0.04 -0.56 -26.49 -0.47 -25.05 -0.3 -0. and 95-percentile point for variation setting (1).7 Fourier SSTA γ 95% -0.01 0.22 -0.36 -23 0.06 -0.6 38.34 0.01 -0.43 0.02 2.52 -20.25 68 -0.16 -6.44 1.03 -0.11 -0.24 -9.17 -0.08 -0.05 -12.06 -0.09 0.45 -0.5 -1.53 1/19814 0.39 -0.45 -26.4 0.01 0.28 -0.09 -0.01 -0.7 -3.29 -0.19 -14.

4 -0.27 -66.4 -0.13 -25.07 0.15 0.3 -1.76 -0.5 -0.15 -0.39 0.12 3.01 0.04 -3.03 0.02 -16.14 -0.05 -16.3 -0.9 127 -0.24 0.93 1.3 6477 23.29 0.02 0.17 -0.05 -0.12 -0.27 38.5 -0.35 -0.9 -1.01 0.2 -0.2 1.5 -0.08 0.67 0.1 -19.07 -0.2 -0.07 -2.06 0.27 0.48 0.41 -68.01 0.47 -0.01 0.19 2.09 0.4 0.2 -15.23 -0.82 -0.25 -0.15 0.95 -0.01 0.07 0.09 -0.01 0.03 1.8 0.22 -64.15 2.09 -13.16 -19.2 -0.01 0. standard deviation.28 0.59 Ave 0.01 0.84 -0.01 5.01 0.1 -1.26 s9234 2027 99 0.43 -0.05 -22 -1 -0.7 Table 5.2 -63.01 0.02 -0.75 -0.18 -0.16 -0.21 -0.24 0.3 -1.07 3.3 -0.71 -0.38 -0.19 -0.09 0.22 -1.05 1.14 -0.1 103 122 201 288 322 12 -0.05 -0.9 39 42.05 -0.03 -0.01 0.2 -0.95 2.01 -0.2 Fourier SSTA γ 95% 0.62 0.25 -0.22 -16.11 4720 11.52 s13207 2573 115 0.11 1.03 -0.06 0.2 -64 -1 -1.26 µ σ -0.11 3.4 -0.9 -1.01 0.5 -1.4 -0.84 0.02 -0.9 -1.11 -58.04 1/20497 0.38 -66.26 -16.2 -0.16 -58.01 -0.3 -0.45 -0.07 -58.8 -0.06 -0.8 0.08 -0.58 0.05 0.12 -12.38 -0.05 -0.07 -19.01 -0.27 -20.3 -1.bench G name s208 s386 s400 s444 s832 s953 s1238 s1423 s1494 61 118 106 119 262 311 428 490 588 N µ σ Quad SSTA γ 95% -0.13 2.25 -66.01 0.68 µ σ 0.38 1.03 -0.23 0.23 -62.64 18 -0.8 -1.14 -0.13 -0.7 -0.02 -3.07 -58.16 -14.24 -0. and skewness comparison for variation setting (2).03 4.81 -0.1 -0.6 0.08 -0.2 -1.11 -0.01 -0. .05 0.1 0.7 0.21 -58.17 -63.21 -0.35 -67.13 16.06 -6.17 -0.01 0.6 0.1 -0.11 -0.1 -0.32 4.04 -0.04 -0.3 -65.01 -0.4 0.05 -0.92 68 -0.5 Lin SSTA γ 95% T 0.7 0.01 -0.08 -0.07 0.05 0.99 1.01 0.96 68 -0.3 -71.08 -3.38 218 -0.19 -66.4 T 0.22 0.5 -0.01 0.04 0.25 -15 MC T 15.89 0.9 -0.09 -0.05 -23.18 -0.17 -0.09 -20.16 16.01 -0.22 1.08 -4.01 0.08 -0.19 -65.4 -1.01 0.65 25.04 -0.13 T µ Semi-Quad SSTA σ γ 95% -0.48 25 -0.44 1238 6.05 -0.43 0.9 -1.1 -0.39 -0.07 0.93 -0.01 -0.9 -1.1 -0.25 -12.28 -65 -0.15 -0.02 -0.42 -68.29 µ σ -0.03 0.9 0.05 -0.07 -0.63 T 0.12 -0.15 0.96 55 0.25 -0.04 0.35 -0.23 -16.21 -62.09 -7.1 Lin Gau γ 95% T 0.3 -59 -0.01 0.02 -0.08 -0.21 s15850 3448 135 0.34 -69 -1.01 0.6 -0.44 0.27 0.32 1/681 0.01 0.85 0.8 -0.02 0.18 -0.04 -13.17 1.04 -16.76 2795 9.04 -20 0.01 0.2 s5378 1004 82 0.1 -0.91 0.01 0.22 1.02 -0.9 -0.1 1/42392 0.14 -0.54 25 0.08 -0.02 5.01 0.05 -0.26 -0.02 202 0.14 -22.08 0.8 -0.3 -71.09 -0.01 0.19 0.14 0.6 -1.16 -70.6 0.18 0.29 -0.15 -16.04 -0.77 0.5 -0.46 5.2 19460 1/260 - s38584 11448 176 -0.4 -1.9 0.01 0.13 33 -0.38 -0.19 -66.75 0.27 -17.08 0.25 -16.31 -65.01 0.18 0.01 -0.03 0.09 -0.54 10.27 1.4 -0.3 -0.01 0.01 0.4 s38417 8709 176 0.27 43 -0.4 33.22 -0.09 -0.36 0.18 -0.08 -0.18 -16.01 -0.1 -0.09 -0.32 -0.1 0.59 0.1 -0.01 0.6 0.6 19097 25.26 65 1/37942 0.18 -13.62 0.7 -0.18 -0.41 -0.21 -0.9 -1.12 -0.3: Mean.22 65.01 0.01 0.05 -0.27 -68.34 -0.08 -0.19 -69.95 0.

5.7 Conclusions

In this chapter, we have proposed a new method to approximate the max operation of two non-Gaussian random variables using second-order polynomial ﬁtting. It has been shown that such approximation is more accurate than the approximation using linear ﬁtting through tightness probability. By applying such approximation, we present new SSTA algorithms for three different delay models, i.e., quadratic model, quadratic model without crossing terms (semi-quadratic model), and linear model. All atomic operations of these algorithms are performed by closed-form formulas, hence they are very time efﬁcient. The computational complexity of both the linear delay model and the semi-quadratic delay model is linear to the number of variation sources, and that of the quadratic delay model is cubic (third-order) to the number of variation sources. Moreover, the computational complexity is linear to the circuit size for all three delay models. The SSTA with semi-quadratic delay model results similar error as the SSTA using Fourier Series [40] with almost 70X speed up. Moreover, compared to MonteCarlo simulation for non-Gaussian variation sources, the SSTA with quadratic delay model results the error of mean, standard deviation, skewness, and 95-percentile point is within 1%, 1%, 6%, and 1%, respectively.

128

CHAPTER 6 Physically Justiﬁ able Die-Level Modeling of Spatial Variation in View of Systematic Across-Wafer Variability

Modeling spatial variation is important for statistical analysis. Most existing works model spatial variation as spatially correlated random variables. We discuss process origins of spatial variability, all of which indicate that spatial variation comes from deterministic across-wafer variation, and purely random spatial variation is not signiﬁcant. We analytically study the impact of across-wafer variation and show how it gives an appearance of correlation. We have developed a new die-level variation model considering deterministic across-wafer variation and derived the range of conditions under which ignoring spatial variation altogether may be acceptable. Experimental results show that for statistical timing and leakage analysis, our model is within 2% and 5% error from exact simulation result, respectively, while the error of the existing distance-based spatial variation model is up to 6.5% and 17%, respectively. Moreover, our new model is also 6X faster than the spatial variation model for statistical timing analysis and 7X faster for statistical leakage analysis.

129

6.1 Introduction

In Chapter 5, we developed an efﬁcient and accurate SSTA ﬂow in which process variation is modeled as inter-die, within-die spatial, and within-die random variations. However, new measurement results show that the variation model in Chapter 5 is not accurate. Therefore, we developed a more accurate and efﬁcient variation model and applied it on the SSTA ﬂow presented in Chapter 5. There are several existing works that focus on analyzing and modeling process variation [6, 32, 172, 106, 108, 134, 24, 25, 173]. The simplest method models process variation as the sum of inter-die (global) variation and independent within-die (local random) variation [25]. Later, it was observed that within-die variation is spatially correlated and the correlation depends on the distance between two within-die locations. [6, 32] model spatial variation as correlated random variables, and principle component analysis is applied to perform statistical timing analysis. In this model, a chip is divided into several grids and each grid has its own spatial variation. The spatial variations of different grids are correlated and the correlation coefﬁcient depends on the distance between two grids. [172, 173] focuses on the extraction of spatial correlation and it models the correlation coefﬁcient as a function of distance. Several more complex spatial correlation models have been proposed in [45, 105, 189, 55, 64]. In contrast to the spatial correlation models, process oriented modeling has concluded that within-die spatial variation is caused by deterministic across wafer and across-ﬁeld variation while purely random within-die spatial variation is not significant [190, 54, 50]. However, in practical design ﬂow, designers do not know the within-wafer location or within-ﬁeld location of each die; therefore, we need to analyze the impact of across-wafer variation and across-ﬁeld variation on die-scale. Since silicon measurements cited in this chapter indicate that across-wafer variation is much more signiﬁcant than the across-ﬁeld variation, we consider only across-wafer varia-

130

tion in this chapter, but the approach can be easily extended to account for across-ﬁeld variation. In this chapter, we ﬁrst analyze the impact of deterministic across-wafer variation on spatial correlation. We observe that when quadratic across-wafer variation model is used as in [54, 186, 125]: 1. Different locations of the chip may have different mean and variance. Such differences increase when the chip size increases. 2. When chip size is small, the correlation coefﬁcients for a certain Euclidean distance are within a narrow range. This explains why most existing works ﬁnd that spatial correlation is a function of distance. 3. Within-die spatial variation is NOT spatially correlated when across-wafer systematic variation is removed. 4. Within-die spatial variation is NOT independent from inter-die variation. 5. If chip size is not large, the two-level inter-/within-die decomposition of process variation is still very accurate. Based on our analysis, we propose three accurate and efﬁcient spatial variation models considering across-wafer variation. We then apply our new model to the SSTA ﬂow introduced in Chapter 5. Experimental results show that our model is more accurate and efﬁcient compared to the distance-based spatial variation model in [172, 173]. Compared to the exact simulation, the error of our model for statistical timing analysis is within 2% and the error for statistical leakage analysis is within 5%. On the other hand, the error of the distance-based spatial correlation model is up to 6.5% for statistical timing analysis and up to 17% for statistical leakage analysis. Moreover,

131

Redeposition effect in 1 Overlay error can directly impact critical dimension in double patterning. wafer stage vibration.6 and statistical leakage analysis in Section 6. the new models are applied to statistical timing analysis in Section 6.3 analyzes the impact of across-wafer variation on die-scale. 132 .5 introduces the new variation models. For example. Section 6. wafers are often rotated to increase process uniformity across them which further leads to radial behavior of non-uniformity.2 Physical Origins of Spatial Variation In silicon manufacturing. and the distortion of the wafer with respect to the exposure pattern [139]. Section 6. Section 6. most of these processes by the very nature of the equipment follow a radially varying trend across the wafer.our model is 6X faster than the distance-based spatial correlation model for statistical timing analysis and 7X faster for statistical leakage analysis.2 discusses the physical causes for across-wafer variation. 132].9 concludes this chapter.4 discusses the case when the across-wafer variation is not a perfect parabola. overlay error includes errors in the position and rotation of the wafer stage during exposure. 6. species depletion and temperature non-uniformity on the wafer at lower temperatures may cause thickness non-uniformity [159. there are many steps that cause non-uniformity in devices across the wafer. Interestingly. and ﬁnally Section 6. Most processes are “center-fed” or “edge-fed” with the boundary conditions at the edge of wafer being substantially different. Moreover.8 summarizes the advantages and disadvantages of different variation models.1 During chemical vapor deposition (CVD) step.7. Magniﬁcation and rotation components of overlay error increase from center of the wafer outwards. The rest of this chapter is organized as follows: Section 6. This is further exacerbated by advent of single-wafer processing for 300mm wafers. Section 6.

Process 1 is with 45nm technology and process 2 is with 65nm technology. Moreover. Since the amount measurement data of for process 2 (more than 300 wafers) is much more than process 1. ring oscillator frequency decreases from the center to the edge of the wafer. we see that for both process. 27] show that the etch rate varies radially across the wafer: the etch rate is high at the center of the wafer and decreases toward the edges.1 shows industry data of ring oscillator frequency for wafers from two different industry processes. center peak shape of the RF electric ﬁeld distribution [77] also leads to a center peak shape of etch rate. 54. and chamber wall conditions [78] also cause etch rate non-uniformity. it has also been shown that there is no spatial correlation for threshold voltage variation [190]. Therefore. it follows a systematic trend and the ring oscillator frequency decreases from the center to the edge of the wafer. All these processes cause a systematic across-wafer variation in physical dimensions. For process 2. all of our simulation and 133 . in the rest of this chapter. [78. Figure 6. Moreover. 186. the across-wafer frequency and leakage variation is mainly caused by gate length variation. Post-exposure bake (PEB) temperatures are higher at the center of the wafer and decrease outwards [185]. Across-wafer variation of gate length observed in several recent silicon measurements [153. It has been shown that for process 1. In real processes. However. From the ﬁgure. the acrosswafer variation is not a perfect parabola as process 1. 125] validates our arguments. [126] also shows that ring oscillator frequency and leakage current decrease from the center to the edge of the wafer. the across-wafer frequency variation can be approximated as a quadratic function (a parabola) [126]. other processes ranging from resist coat to wafer deformation due to vacuum chuck holding it follow a bowlshaped trend across the wafer.physical vapor deposition (PVD) [27] may cause non-uniformity. the wafers are rotated to improve uniformity. Similarly.

Besides across-wafer variation. for simplicity. Across-die variation can be modeled as within-die deterministic mean shift and will not cause within-die spatial correlation. 186.Figure 6. (a) Process 1. silicon measurements cited in this chapter indicate that across-wafer variation is much more signiﬁcant (probably due to advancements in resolution enhancement and lithographic equipment) than across-ﬁeld and across-die variation. we consider only across-wafer variation in this chapter. Hence. (b) Process 2. 125]: v p = ax2 + by2 + cxw + dyw + mw + m f + md + mr w w (6. we assume the acrosswafer variation to be a quadratic function as in[24. Moreover. lithography-induced effects such as lens aberrations can lead to systematic across-ﬁeld variation and across-die variation. experiments are based on the measurement result of process 2. 6.3 Analysis of Wafer Level Variation and Spatial Correlation To analyze the impact of across-wafer variation on die-scale.1) 134 . 54.1: Ring oscillator frequency within a wafer.

we may model xc and yc as random variables which are evenly distributed in the center at (0.2.1). and d are ﬁxed for a process. b. In real design ﬂow. inter-wafer. which are obtained from ﬁtting the measurement data from industry process as shown in Figure 6. yc ) is not known to designers. such as Le f f . In the rest of this section. yc ) and (x . we assume that a. y) (6.1. we consider only across-wafer variation and ignore these two types of variations (m f and md ) in this chapter. c. and (xw . m f and md are the across-ﬁeld and acrossdie variation. yw ) is across-wafer location.3 summarizes the mathematical notations used in this section. y) in a die (assuming the coordinate of the center of the die to be (0. In this section. 6. In the rest of this section. c.3. yc ) wafer coordinates. b.where v p is a variation source. mw comprises inter-die random. these coefﬁcients may vary slightly for wafer-to-wafer or lot-to-lot.4. it is easy to ﬁnd that for a die. the die location in the wafer (xc .3. Therefore. inter-lot variation and the ﬁtting error of quadratic ﬁtting as in (6. We will further discuss this in Section 6. mr is the random noise. whose center lies on (x c . as discussed in Section 6. y ) are in Table 7. respectively.1). and d are function coefﬁcients.2) The deﬁnitions of (xc .1 Variation of Mean and Variance with Location According to (6. the variation at location (x. a. 0) 135 . all simulations are based-on this extracted model. 0)) is: v p (x. y) = a(xc + x )2 + b(yc + y )2 + c(xc + x ) + d(yc + y ) + mw + mr (x. In practice. we will analyze the spatial variation based on the above model. We obtain the coefﬁcients of the above across-wafer variation model by ﬁtting the industrial 65nm process measured ring oscillator delay with 300 wafers from 23 lots. Table 7. We can convert the wafer-level systematic variation model to a die-level model by noting that the dies are always distributed evenly on the wafer.

y1 ) and (x2 .1: Notations.c.y) ω v p (x.b.d (lx .Symbols (xc .y) x = x cos ω + ysin ω lx = lx cos ω + ly sin ω ly = lx sin ω − ly cos ω x = ax + c/2 y = by + d/2 lx = ax + c/2 ly = by + d/2 rdµ = rdσ = x 2 /a + y 2 /b x 2 +y 2 y = x sin ω − ycos ω Euclidean distance between (x1 .ly ) (xw . 136 .y2 ) δ= (x1 − x2 )2 + (y1 − y2 )2 rm = rm = rm = 2 2 lx + ly rm rm rm k0 k1 k2 α β in the wafer s0 vg vs vl lx2 + ly2 lx 2 + ly 2 4 4 k1 = rw (a2 + b2 )/16 − rw ab/24 + σ2 m 2 k2 = k1 /rw 2 k0 = rw (a + b)/4 − c2 /4a − d 2 /4b α = x1 x2 + y1 y2 2 β = σ2 /rw r 2 2 2 2 s0 = cos2 ω(alx + bly ) + sin2 ω(blx + aly ) Inter-die variation Within-die spatial variation Within-die random variation Table 6.yw ) (x.y) x y lx ly x y lx ly rdµ rdσ δ Description Location of the center of the die in the wafer wafer radius Inter-die variation Within-die random variation Variance of mw Variance of mr Across-wafer variation coefﬁcients x and y dimension die size Within-wafer location Within-die location Angle between the die and wafer coordinate Variation of within-die location (x.yc ) rw mw mr σ2 m σ2 r a.

rs follows triangle distribution ranging from 0 to rw . mw . rdµ.1)).2(a) and 6.7) The locations farther off from (x0 . Figures 6.5). it is interesting to note that different die locations may have different means and variances3 . From (6. the mean and variance will be the same for all locations on a die. c and yc are not independent.6) (6. If the across wafer variation function is a piecewise linear function. y): 2 µv p (x. rdσ . is expressed as a function of four random variables: xc . y0 ) will have larger mean and variance. the variation at location (x. and mr (x. 2x 137 . y . and k1 are deﬁned in Table 7. In order to generate samples of x c and yc .5) where x . y0 ) having the smallest mean and variance is given by: x0 = −c cos ω/2a − d sin ω/2b y0 = d cos ω/2b − c sin ω/2a (6. k0 .2(b) illustrate the mean and variance of ring oscillator delay for different r dµ (or rdσ ). y). We may easily calculate the mean and variance of v p (x.3.y) = k1 + rw (x 2 + y 2 ) = k1 + σ2 + rw rdσ r v (6.3) In this case.4) and (6. y). we may assume xc = rs cos θ and yc = rs sin θ.4) (6. The location (x0 . v p (x.with radius rw (radius of the wafer). where rs and θ are independent. θ follows uniform distribution ranging from 0 to 2π 3 Such difference is caused by the nonlinearity of the across-wafer variation function (we assume quadratic function as in Equation(6. yc . y).y) = k0 + x 2 /a + y 2 /b = k0 + rdµ 2 2 2 σ2p (x. It is easy to see that xc and yc are identical and their PDF is 2 : PDF(xc ) = 2 2 rw − x2 /(πrw ) c −rw < xc < rw (6.

lx .4 0.S.0.5 0. we obtain the upper bound and lower bound of the correlation coefﬁcient for a certain Euclidean distance: ρ ≤ ρu = ρ ≥ ρl = 1− 1− δ2 k2 + δβ/2 + 2βk2 + β2 (k2 + β)2 + 2rm2 (k2 + β) + rm4 δ2 (k2 + rm2 − δ2 /4 + β) + β(2k2 + rm2 ) (k2 + β)2 + δ2 (k2 + β)/2 + δ4 /16 (6. we may also calculate 138 . and rm are deﬁned in Table 7.8) With covariance and variance calculated as above. we may obtain the correlation coefﬁcient as: ρ= 2 k2 + 2k2 α + α2 2 2 2 2 (k2 + β)2 + (rdσ1 + rdσ2 )(k2 + β) + rdσ1 rdσ2 (6. rdµ (b) σ2 V. y2 ). and k2 are deﬁned in Table 7. From (6. rdσ Figure 6.5 x 10 −5 4 µ σ2 3.2 Appearance of Spatial Correlation Besides mean and variance.10) (6.2 0.6 dσ (a) µ V.2: Mean and variance for different rd .4 rdµ 0.9). β.S. From (6.2 r 0. ly . we may calculate the covariance as: 2 Cov = k1 + rw (x1 x2 + y1 y2 ) (6.3.2).6 0 0. y1 ) and (x2 . we are also interested in the covariance between two locations (x1 .3.11) where δ is the Euclidean distance between (x1 . From the upper bound and lower bound.3.9) where α.1296 4. 6. y1 ) and (x2 . y2 ).1292 0 0.

3(b).94 ρ 0. From the ﬁgure. that is k2 fore. the within-die variations of opposite corners are negative correlated. 4 In 139 .5 1 Distance (cm) 1.86 0 Exact Upper Bound Lower Bound 0. 0. Figure 6.98 0. Moreover. This is because of the differences of variance across the die.92 0.9 0. the upper bound and the lower bound.88 0.12) 2 rm . the withindie variation of the opposite corner must decrease. if they lie on the opposite side of the center. the range decreases.5 UB 4 x 10 −5 3 0 0. We the ﬁgure.the range of correlation coefﬁcient: ρu − ρl ≤ 2 2 rm /(rm + k2 + β) (6.3(a) illustrates the exact data for 40 locations. Moreover even when two locations are very close.5 1 Distance (cm) LB Cov 3. That means. when the within-die variation of one corner increases. Although the correlation coefﬁcient is within a narrow range. the correlation coefﬁcient can be a negative number when distance is large. This explains why most existing works [172. their correlation is still near zero. 45] ﬁnd that spatial correlation is a function of distance. This is because after subtracting the mean. covariance is not.96 0. 173. the range of correlation coefﬁcient for a certain distance is very small.5 (a) ρ (b) Covariance Figure 6. Figure 6. there- Notice that usually the wafer size is much larger than the die size. we also ﬁnd that when the variances of the inter-die residual and within-die random variation increase.4 shows the correlation coefﬁcient for within-die variation after subtracting the mean variation of the die (mainly caused by across wafer variation) 4 . we ﬁnd that the range of ρ for a certain distance is very small. from the above equation. as shown in Figure 6.3: Apparent spatial correlation and covariance as a function of distance.

observe that the within-die spatial variation is almost NOT spatially correlated. 1 0.5 Distance 1 1. This further validates that the spatial variation is caused by systematic across-wafer variation.13) where vg is the inter-die variation.4: Correlation coefﬁcient for within-die spatial variation after inter-die variation is removed.5 −1 0 ρ 0. we may also calculate the inter-die.3 Dependence between Inter-Die and Within-Die Variation In most existing variation models. and vl is the within-die variation. as empirically observed in [172.5 0 −0. 173]. 6. and within-die random variation. process variation is decomposed into inter-die. and vs is the residual. With the variation model in (6. and within-die random variation: v p = vg + vs + vl (6. within-die spatial. vs is the within-die spatial variation. vs . Inter die variation is calculated as the varia- 140 .5 Figure 6.3. vl is the pure random local variation. within-die spatial. Usually vg is modeled as the variation of the chip mean. vg .2). and vl are assumed to be independent.

we may treat such approximation error as noise and the process variation as signal. In order to do this. we ﬁnd that both inter-die and within-die spatial variations are functions of random variables xc and yc . 6. |x|<lx /2 |y|<ly /2 Within-die random variation is the local random variation: vl = R.5(a). y) = v p (x. and then evaluate the signal to noise ratio. If we only consider inter-/within-die variation. To evaluate the impact of the approximation error.tion of the chip mean: vg = 1 lx ly ZZ v p (x. we may lump the across-wafer variation into inter-die variation.15) where s1 is deﬁned in Table 7. as shown in Figure 6.14) = ax2 + by2 + cxc + dyc + mw + s0 c c where s0 is deﬁned in Table 7. approximate the across-wafer variation as a piecewise constant function. y)dxdy (6. From the above equations. The signal to noise ratio when ignoring the spatial variation 141 . we analyze the accuracy of the simple two-level inter-/within-die variation model for different chip sizes. Hence. we may not decompose process variation into independent inter-die and within-die spatial variation.4 When can Spatial Variation be Ignored? In this section. that is.3. we calculate the mean square approximation error and the total variance of variation. and within-die spatial variation is calculated as the remaining variation: vs (x.3. y) − vg − vl 2 = rdµ + 2ax xc + 2by yc − s1 (6.3.

4 General Across-Wafer Variation Model In the previous section. the SNR is up to 100. When chip size is small. MSE is small.S. That means.is given as: 2 SNR = σtotal /MSE 4 2 6abrw + 6(c + d)rw + σ2 + σ2 M R ≈ 2 2 2 abrw (lx + ly ) + 2(c + d)lx ly (6. Figure 6.16) It can be seen that MSE depends on chip size. We also observe that when chip size (l x and ly ) is smaller than 1cm. 10 Exact Piecewise SNR 10 10 10 10 10 5 4 3 2 1 0 0 1 2 3 Chip size (cm) 4 5 (a) Piecewise constant (b) SNR V. hence such approximation is accurate. Figure 6. we assumed that the across-wafer variation is a quadratic function as shown in Equation 6. two-level inter-/within-die variation model is accurate. This is because we approximate the across-wafer variation as a piecewise constant function with small steps.5(b) illustrates the SNR for different die sizes.5: Approximating across-wafer variation.2. In practice. chip size. It can be seen that the SNR decreases when die size increases as expected. 6. across-wafer variation may not be an exact 142 .

there will be some ﬁtting residual after subtracting the across-wafer parabola: v(x. However.7 illustrates the distribution of sx and sy obtained from measurement data of process 2.2) and Vr is ﬁtting residual. respectively. In the previous section. From the ﬁgure. the across-wafer variation may be slightly different for different wafers. such trend will also introduce some spatial correlation. Moreover. Notice that when we model the ﬁtting residual as a linear within-die variation trend. we may model sx and sy as random variables. In order to model this trend. we observe that for each die. Figure 6.6(b) illustrates the ﬁtting residual of a wafer delay variation after subtracting the quadratic across-wafer variation function. there is a systematic trend: If one corner of the die is faster. and Figure 6. which can be lumped into inter-die random variation.18) where sx and sy are x-dimension and y-dimension slope of within-die trend. we ﬁnd that both sx and sy are almost uncorrelated and they both follow Gaussian distribution. We also ﬁnd that the trends of within-die variation for each die are different for different dies. Therefore. From the ﬁgure. two more random variables sx and sy are introduced. the ﬁtting residual contains not only inter-die random variation but a systematic trend of within-die variation. we approximate the ﬁtting residual as a linear function of within-die location: vr (x .parabola. we assume that the ﬁtting residual is lumped into inter-die random variation. This makes the variation model 143 . then the opposite corner is slower.17) where v p is the quadratic across-wafer variation model as shown in (6. and mw is inter-die part of the residual. y ) = sx x + sy y + mw (6.6(a) illustrates the original delay variation across the wafer. In this case. Figure 6. From the die point of view. y) = v p + vr (6.

102592 0 50 100 150 0 −0. Figure 6.0120795 0.0028256 50 −0. 150 150 0.142611 0.13422 100 0. (b) Residual after subtracting Equation(6. In practice.126474 100 0.more complicated. we assume that all this within-die pattern to be independent random variation.000989223 −0.118083 50 50 −0. seems that there is a systematic within-die variation pattern.0147271 0 50 100 150 (a) Original delay variation.00986836 0 50 100 150 (c) Residual after subtracting (6.18).6(b) 5 . as illustrated in Figure 6. 150 0.0055146 0.00832543 0. the remaining variation is almost uncorrelated. 5 It 144 .110338 −0. When the across-wafer variation is a perfect parabola and the ﬁtting residual is not signiﬁcant.0191914 0.18). such within-die variation pattern can be modeled as mean shift. we also observe that after subtracting the model of ﬁtting residual in (6. In addition. we may just lump the ﬁtting residual in to inter-die and random variation.00816225 0 0.00634698 0 −0.6: Ring oscillator frequency within a wafer.00159736 0.2).0045106 100 0. In this chapter.

5 Modeling Spatial Variability As discussed in Section 6.1800 2500 S PDF x 1600 S PDF y Gaussian Approx 2000 Gaussian Approx 1400 1200 1500 1000 800 1000 600 400 500 200 0 −2 0 2 4 6 8 10 12 x 10 14 −4 0 −5 −4 −3 −2 −1 0 1 2 3 4 5 x 10 −4 (a) PDF of Sx (b) PDF of Sy Figure 6.3.1. • Quadratic Across-Wafer variation model (QAW). Hence. 145 .2) we obtain a general die-level across-wafer variation model: v p (x. modeling the within-die variation as spatial-correlated random variable is not accurate as discussed in Section 6. y) (6.7: PDF of Sx and Sy . In this section.18) and (6. spatial variation largely comes from the deterministic across-wafer variation. y) = a(xc + x )2 + b(yc + y )2 + c(xc + x ) + d(yc + y ) + sx x + sy y + mw + mr (x. we introduce three new spatial variation models considering acrosswafer variation: • Slop Augmented Across-Wafer variation model (SAAW).19) 6. Combining (6.

and mr .4. In the following of this section. the die location (xc . inter-die random variation mw . Larger chip needs more variables. We refer to this new model as Slope Augmented Across-Wafer variation model (SAAW) . (6.5. In the equation. the number of spatial variation sources depends on the number of grids. process 1 as discussed in Section 6. the ﬁtting residual is not signiﬁcant6.5. xc and yc . when the across-wafer variation is a perfect parabola.19) to (6. our new model not only models the across-wafer variation accurately but also is more efﬁcient than the traditional spatial correlation model.3).2). die location within the wafer xc and yc . Therefore. 6 For example.19) provides a new spatial variation model. we will discuss these models in detail: 6. 146 . for the traditional distance-based spatial variation model. Notice that in SAAW model. However. there are only six random variables. In this case. and slope of ﬁtting residual sx and sy . We refer to this model as Quadratic Across-Wafer variation model (QAW).2. there are only four random variables. Since QAW does not consider ﬁtting residual. it is not as accurate as SAAW. 6. mw . within-die random variation mr . yc ) are modeled as random variables and their PDF is shown in Equation (6.2 Quadratic Across-Wafer Model As discussed in Section 6.19) calculates the variation for a given location (x. we may just lump the ﬁtting residual into inter-die random variation and simplify (6. y).• Location Dependent Across-Wafer variation model (LDAW).1 Slope Augmented Across-Wafer Model (6.

1. 6.3. y) ≈ md + µv p (x.4. To further improve the accuracy of inter-/within-die variation model. µv p (x.4) and (6. Hence compared to inter-/within-die model. y) are deterministic value for a certain within-die location (x. as discussed in Section 6. µv p (x. inter-wafer random. y). applying the two-level inter-/within-die variation model does not introduce much error. y) is within-die variation including withindie random variation and residual of across-wafer variation. LDAW model has only two random variables: md and mr which is the same as two level inter-/withindie model.3.5. y) + σv p (x. y) and σv p (x. Notice that in Equation 6.3 Location Dependent Across-Wafer Model As discussed in Section 6. mr (x. y) (6. which can be calculated from (6.6 Application of Spatial Variation Models on Statistical Timing Analysis In this section. However. y) and σv p are mean and variance difference at different locations of a chip.20) where md is inter-die variation including inter-lot random. Therefore. LDAW has similar efﬁciency but higher accuracy because LDAW considers mean and variance difference across the chip while inter-/within-die model does not. y)mr (x.5).20. we account for this: v(x.6. we apply our across-wafer variation models to statistical static timing analysis. inter-/withindie variation model still does not consider the mean and variance difference at different locations of a chip. when die size is small enough. 147 . interdie random. We refer to the above model as Location Dependent Across-Wafer variation model (LDAW). and across-wafer variation.

when applying LDAW.6. Therefore.3 to estimate the chip delay variation. each variation source is a quadratic function of random variables.6.114). cell delay is a quadratic function of random variables. as shown in (5. we may apply the quadratic SSTA ﬂow in Section 5. the cell delay becomes a 4th order function of random variables. Therefore. In this case. Sometimes. cell delay can be modeled as a quadratic function of variation sources. applying SAAW or QAW results in a quadratic function of random variable and hence quadratic SSTA or semi-quadratic SSTA can be applied. cell delay model can be simpliﬁed as a linear function of variation sources. we may also ignore the crossing terms and apply semi-quadratic SSTA to improve efﬁciency. Semi-quadratic SSTA greatly improves the speed with only a little accuracy loss compared to quadratic SSTA.6.19) and QAW (6. Notice that moment matching approximation is performed only once for all cells and does not increase the run time of SSTA. by applying SAAW or QAW on quadratic cell delay model. our SSTA ﬂow in Chapter 5 can only handle a quadratic function of random variables. LDAW is a linear function of variation sources. as discussed in Section 5. Moreover. hence quadratic-SSTA can be applied.3 to match the 4th function to a quadratic function of random variables by matching the mean and ﬁrst two order joint moments. 148 .2).1 Delay Model As discussed in Chapter 5. And then. However. we apply the moment matching technique in the way as Section 5. Therefore.46). in this application. Therefore. as shown in (5. and applying LDAW results in a linear function of random variables where linear SSTA can be applied. For SAAW in (6.

173].6. ﬁtting residual coefﬁcients sx and sy 9 .2 Experimental Result We have applied our new model to the SSTA ﬂow introduced in Chapter 5. In the experiment. b. In the experiment. We assume random placement for ISCAS85 circuits. We obtain the across-wafer coefﬁcients a. and then calculate the mean and x y variance of sx and sy for all chips. We apply all the above methods to the ISCAS85 suite of benchmarks in PTM 45nm technology[124]. which is referred to as SPatial Correlation model (SPC). inter-die. 2) distance-based spatial correlation model from [172. and within-die variation as percentage with respect to the nominal value. we only consider logic cell delay when calculating the full chip delay variation. three comparison cases are deﬁned: 1) MonteCarlo simulation with the exact deterministic across-wafer variation model 7 .19). which is referred to as Inter-/Within-die variation model (IW). 7 In 149 . Since process variation has smaller impact on interconnect delay than on logic cell delay. Then we assume that the percentages of all the above coefﬁcients to nominal value are the same at 45nm technology node and 65nm technology node. and d 8 . standard deviation of random inter-wafer. we consider the gate length variation obtained from minimum square error ﬁtting on the ring oscillator delay from industrial 65nm process (Process 2 as discussed in Section 6. which is the golden case for comparison. c. 9 We obtain the slope of ﬁtting residual s and s for each chip.6. we assume sx and sy to be Gaussian random variables with mean and variance obtained from the measurement data.2) measurement from the model as shown in (6. 8 We apply quadratic function to ﬁt the across-wafer variation for each wafer to obtain the ﬁtting coefﬁcients for each wafer. the simulation. then use average coefﬁcients of all wafers for our experiment. We have simulated 318 wafers correspondent to 318 measured wafers. 3) two-level inter/within-die variation model. each wafer may have different across-wafer variation which is obtained from measurement data of process 2. In order to verify the efﬁciency and accuracy.

we also compare the results of applying quadratic SSTA with crossing terms (SAAW Quad and QAW Quad) and applying SSTA without crossing terms (SAAW SQuad and QAW S-Quad). the impact of spatial variation on delay is not signiﬁcant within the circuit. This is because in our experiment.we assume the benchmarks are stretched on a 2cm×2cm chip. • The error of SAAW using semi-quadratic SSTA is within 2% while the error of spatial correlation model is up to 6. • SAAW is more accurate than QAW.2 illustrates the percentage error of mean (µ). In the table. for the SPC model. we divide the chip to 10 × 10 = 100 grids. The error is calculated as error of different variation models compared to the golden case simulation. • Compared to quadratic cell delay model. the cell delay variation is well 150 . In our experiment. curacy loss. Since ISCAS85 benchmarks are very small. Table 6.1 Full Chip Delay In the experiment. and 95-percentile point (95%) and run time (T) of different variation models. we also compared the result of using quadratic cell delay model (Quad) and linear cell delay model (Lin).6. From the table. In order to show such impact.5%. We only use quadratic cell delay model for golden case simulation (exact). semi-quadratic SSTA (SSTA without crossing terms) achieves up to 8X speed up with less than 1% accuracy loss.2. linear cell delay has less than 2% acapproximated by a linear function. This is because the ﬁtting residual is significant for the measurement data.6. QAW ignores ﬁtting residual and hence introduces more error. standard deviation (σ). For SAAW and QAW. we have the following observations: • Compared to full quadratic SSTA. we assume that the chip size is 2cm×2cm and the wafer radius is 15cm.

• For linear cell delay model. we assume linear cell delay and apply semi-quadratic SSTA for all experiments. However. in the following experiment. than others. both model has much larger error cantly more accurate than IW with no runtime penalty. 10 There 151 . result- • LDAW and IW are very efﬁcient. Since linear cell delay model and semi-quadratic SSTA are accurate. In the rest of this subsection. LDAW is signiﬁ- are 100 correlated spatial random variables. since SAAW is more accurate than QAW with only a small run time overhead. we do not consider the QAW model in the following experiments. This is because both model ignore correlation. SPC. Moreover. while SAAW has only 6 random variables. This is because there are 100 grids in the spatial correlation model. SAAW achieves about 6X speed up compared to ing in 37 spatial random variables10 . we apply PCA to truncate some insigniﬁcant principle components and there remains 37 signiﬁcant principle components.

SAAW and QAW will give similar result as LDAW.4. we ﬁnd that when chip size is small. upper left corner (UL). its delay variation may be different.3. and SPC will give similar result as IW. Table 6. and then calculate the delay variation with location. the accuracy improvement is limited) compared to IW. we only considered big chips.1. LDAW is always better than IW. when a block is placed at different locations on a chip. especially for big chips. and upper right corner (UR). As discussed in Section 6. 11 When 152 .In the above experiment. in this experiment. Therefore. From the table. the design is separated into several blocks and each block only occupies a small region of a chip. the impact on across-wafer variation at die level is not signiﬁcant. we ﬁnd that the circuit is in a small region. Table 6. However. we only compare two models LDAW and IW 11 .6. the impact of spatial variation on delay is not signiﬁcant within the circuit. when the chip size is small.3. we perform delay estimation of ISCAS85 benchmarks stretching on different size chips. the critical path is within a small region instead of spanning all over the chip. Since ISCAS85 benchmarks are very small. Considering that LDAW has similar run time but is more accurate (although when chip size is small. 6.2. in real design.2 Delay of Blocks on Different Locations on a Chip The above experiment assumes that the benchmarks are stretched on a chip. we assume that the ISCAS85 benchmark circuit is placed (no stretched) in different locations on a chip: center (C). LDAW and IW are accurate. Therefore. In order to verify this. lower right corner (LR). lower left corner (LL). In this case. From the table.3 shows the percentage error for different models under different chip size.4 compares the percentage error of LDAW and IW for ISCAS85 benchmarks placing on different locations on a 2cm×2cm chip. In order to show such effect. different chip locations may have different mean and variance. As discussed in Section 6.

9 9 -1.7 -3.8 10 -0.5 -1.2 15 c3540 Quad 25. σ.0 -11.1 -4.0 135 -3.2 435 -0.0 146 -0.3 -1.1 77 -1.9 -3.9 16 -3.43 34.3 -2.3 54 -1.8 -4.7 -0.0 9 -2.5 -5.4 -7.5 -0.9 27 -1.5 12 153 Lin Lin c7552 Quad 48.3 10 -0.3 -8.3 +0. ∗ for LDAW.7 -6.7 3.0 4210 -1.8 -1. linear SSTA is applied when assuming linear cell delay model.1 -1.5 -3.3 433 -3.9 -1.5 -8.4 -5.6 -10. SPC.2 76 -0. Run time (T) is in ms.8 1450 -2.5 -1.7 202 -2.delay mark model µ Lin - Exact σ 95% µ SAAW Quad σ 95% T SAAW S-Quad µ σ 95% T µ QAW Quad σ 95% T QAW S-Quad µ σ 95% T µ LDAW ∗ σ 95% T µ SPC∗ σ 95% T µ IW ∗ σ 95% T c1908 Quad 17.5 -4.2 -5.8 -6.9 -2.4 109 -1.9 6.9 8 -0.8 -1.3 -2.2 -3.7 27 -0.6 -3.1 -1.4 115 -1.2 -2.4 -1. .9 -8.9 25 +0.4 +0.5 -0.3 -4.0 20 -1.7 48 -2.0 -6.3 -3.3 -6. and 95-percentile point for exact simulation is in ns.5 -4.9 18 -2.2: Delay percentage error for different variation models.8 -4.5 -6.5 +0.9 -8.2 -0.6 101 -1.1 53 -1.9 -3.1 -1.2 -3.1 -6.9 -9.3 -1.8 -1.2 -1.4 -4.8 79 -3.8 -1.5 -10.0 26 -0.6 2.7 212 -0.9 +0.9 -1.2 -4.Bench. and IW.4 -0.6 -0.1 36 +0.6 13 -1.6 430 -1.6 +0.9 -2.9 -2.8 -2.4 150 -1.6 -4.4 -3.6 -1.6 35 -0.2 19 -2.6 -1.9 -1.5 22 Table 6.8 -0.1 -3.4 -1.6 -7.6 -3.6 -4.5 -6.9 8182 -2.2 -8.0 -2.47 64.4 -1.1 10 -2.1 -7.9 67 -0.7 -1.4 -1. Note: the µ.6 -1.2 209 -1.27 24.5 -1.

From the tables.2 -3.9 -1.9 -1.5 -3.2 -7.4 -3.9 -1.1 -2.1 -2.3 -1.5 -6.2 -1.0 -1.0 -1.2 -1.9 -1.6 -2.5 -2.47 64.0 -2.Bench.8 3.0 -1.3 -1.47 64.9 6.4 -2.44 34.8 -2.9 -3.0 -1.1 -1.7 -3.45 34.8 -1.6 -2.9 -1.2 -4.6 2.9 6 25.9 6. as discussed in Section 6.3 -2.1 -2. 6mm×6mm.7 -2.7 -0.0 -1. The values of exact simulation are in ns.6 -0.5 -1.8 -1.5 -5.3: percentage error for ISCAS85 benchmark stretching on a chip with different chip size.0 -1. Note: We assume square chips and chip size means edge length in mm.6 -6.7 shows percentage error of LDAW and IW for ISCAS85 benchmarks placing on different locations on a 1cm×1cm.3 3 17.1 -2.6.8 3. This is because LDAW predicts different mean and variance for different location correctly.4 -6.23 24.3 -4.9 -2.2 -1.1 -4.5 -1.8 -5.2 -2.3 -0.3 6 48.1 -1. and 3mm×3mm chip.9 -1.5 2. 154 .1 -1.0 -1. Table 6.1 -1.3 -1.9 -1. respectively.1 -3.7 -1.2 -2.2 -4.5 -1.2 -1.2 -2.5 -2.6 -1.4 -2.22 24.0 -1.2 -5.5 while IW can only give the same mean and variance for all locations.4 -2. and 6.2 -3.1 -1.2 -1.2 -0.8 -3.1 -1.3 -2.7 -3. 6.3 -1.9 6.5 -2.6 Table 6.5 -4.5 -3.8 -7.42 34.7 3 25.1 6 17.24 24.6 3.5. the estimation of LDAW is within 1% error from the exact simulation and the error of IW is up to 8%.3 3 48.4 2.0 -1.7 -1.3 c7552 10 48.3 -1. we ﬁnd that the error of IW becomes smaller when chip size is small.4 -0.4 -4.2 -1.2 -5.2 -2.1 -2.2 -1.0 -1.Chip mark size µ Exact σ 95% µ SAAW σ 95% µ LDAW σ 95% µ SPC σ 95% µ IW σ 95% c1908 10 17.9 -2.2 -4.5 -1.47 64.6 -1.5 c3540 10 25.8 -3.4 -0.4 -1.5 -3.5 -6.6 -1.8 -1.4 -1.

2 +0.4 +0.37 60.3 -0.4 +0.2 -0.8 -1.1 -0.6 -2.1 LR 48.4 +0.3 +2.2 -0.1 -0.2 c7552 C 48.0 +0.11 58.3 UR 49.4 -0.3 +1.4 +2.4 -1.1 -1.4 +1.5 -1.3 +0.2 6.36 33.8 LL 24.1 +0.3 -0.2 -0.3 -0.3 -3.4 +0.4 -0.3 +0.0 -4.2 +0.9 -2.2 +1. Note: The 155 .6 +0. Note: bench locamark tion µ Exact σ 95% µ LDAW σ 95% µ IW σ 95% c3540 C 25.3 +0.1 6.5 3.35 60.30 32.8 +0.5 -1.5: Delay comparison for ISCAS85 benchmark in 1cm × 1cm chip.45 61.29 32.6 -0.0 +0.1 LR 49.5 +0.1 -0.2 -0.7 +1.42 60.0 -7.4 The values of exact simulation are in ns.34 33.8 UL 26.3 +0.3 -0.2 -0.2 -0.2 +1.3 +0.5 +0.1 -1.9 +0.4 UR 27.2 -0.4 +0.91 65.2 6.3 -0.9 +0.1 3.9 -0.0 -4.43 61.3 -0.1 3.9 6.3 +0.2 -2.3 +2.3 -0.6 LL 48.9 +1.26 31.8 values of exact simulation are in ns.0 3.2 3.bench locamark tion µ Exact σ 95% µ LDAW σ 95% µ IW σ 95% c3540 C 25.0 -4.4 +1.25 31.6 -0.4 +0.6 +1.7 -0.0 -1.4 -0. Table 6.0 -0.4 LR 25.6 -1.4 3.6 +4.7 +0.2 +1.8 +0.51 62.4 -0.5 +0.0 UL 49.6 -6.9 -3.2 -0.2 c7552 C 48.7 +3.65 63.4 +0.5 6.2 6.1 -0.9 LL 25.3 +3.8 -2.0 -0.1 -0.4 6.1 -4.0 +1.0 -1.9 +0.8 3.38 60.2 +0.3 -0.3 -0.1 -1.3 -0.1 6.6 6.35 33.2 +0.2 -0.41 34.2 +0.4: Delay percentage error at different locations in a 2cm × 2cm chip.4 3.5 +0.1 +2.8 3.22 31.4 -7.4 -1.4 -1.6 +0.9 LL 47.1 -5.5 +1.6 -1.6 +3.3 +0.2 +1.4 UR 26.5 UL 49.1 UR 50.3 +1.2 6.9 LR 26.3 -3.3 -0.32 32. Table 6.2 3.8 UL 26.2 +0.2 +0.

2 -0.2 +0.4 -0.31 32.0 6.2 6.9 -0.1 +0.2 -0.3 +0.9 -0.4 +0.6 +0.2 -1.7 -0.7 -0.2 -0.4 6.3 -0.3 -0.2 +0.2 +0.3 +1.0 -1.6 -1.45 61.3 +0.0 3.5 -0.6 LL 48.38 60.7 +1.2 +0.2 +0.2 +1.3 +0. Note: The bench locamark tion µ Exact σ 95% µ LDAW σ 95% µ IW σ 95% c3540 C 25.1 -0.6 +0.5 -1.3 +0.0 +0.3 -0.2 -0.3 -0.7 UL 25.37 60.1 -0.5 UL 49.9 UR 26.6 -0.25 31.1 c7552 C 48.8 LL 25.4 +0.5 -0.7 LR 48.32 32.22 32.4 -0.8 3.3 +0.2 LR 25.2 +1.3 3.2 -0.1 LL 48.6 UR 49.5 +1.3 -0.37 60.3 -0.2 -0.7 -0.3 UR 49.1 LR 49.0 -0.4 -0. Note: The 156 .43 60.6 6.8 -0.3 -0.8 -2.4 +0.2 -0.2 +0.8 -1.5 +0.1 -0.1 3.43 60.0 +1.4 +0.8 values of exact simulation are in ns.8 -1.8 6.1 values of exact simulation are in ns.2 +0.6 -0.2 0.9 3.28 32.6: Delay comparison for ISCAS85 benchmark in 6mm × 6mm chip.4 -1.42 60.7 -0.1 -1.4 +0. Table 6.4 +1.7 +0.4 UR 26.3 -0.9 +1.3 6.9 +0.7: Delay comparison for ISCAS85 benchmark in 3mm × 3mm chip.34 59.9 -0.48 61.3 +0.1 +0.34 32.7 -0.9 6.5 +0.3 +0.1 -0.4 -1.6 -2.bench locamark tion µ Exact σ 95% µ LDAW σ 95% µ IW σ 95% c3540 C 25.5 -0.7 +0.2 -0.7 +0.2 -0.3 -0.2 c7552 C 48.4 6.27 32.46 61.5 +0.3 +0.2 -0.7 +1.3 +2.2 -0.0 3.0 -1.0 -1.2 -2.9 -0. Table 6.5 +0.4 LL 25.6 3.2 -0.4 -0.2 -0.4 6.2 -0.4 3.2 -0.26 320 +0.2 +1.2 3.35 33.6 -0.2 +0.1 -0.4 LR 26.5 3.1 6.31 32.3 +2.2 -0.2 +0.5 -1.4 -0.2 -0.2 -0.5 UL 48.8 UL 26.0 +1.0 +0.

Considering that the random variables may be non-Gaussian. For each variation model. 157 . we also apply our variation model to statistical leakage power analysis. We have implemented leakage variation analysis with different models in Matlab. For the leakage analysis. in this chapter. we observe that: • Error of SAAW is within 5% while error of SPC is up to 17%. cell leakage power variation is modeled as exponential function of variation sources: Pleak = P0 · e∑ ciVi chip leakage power is calculated as the sum of leakage power of all cells: Pchip = (6.22) where Cell is the set of all cells in the chip and Pi.8 compares the leakage variation for ISCAS85 benchmarks. there is no closed-form equation to calculate the full chip leakage power. we apply Monte-Carlo simulation to obtain the full chip leakage power variation.19). SAAW is 7X faster than SPC because there are fewer random variables for SAAW. Usually.21) where P0 is the nominal leakage power and ci ’s are sensitivity coefﬁcients. The full i∈Cell ∑ Pi. we use the same setting and comparison cases as the SSTA experiment in Section 6. In the experiment.leak is leakage power of the ith cell.6. the cell leakage power is an exponential of a quadratic function of random variables.6.000 sample Monte-Carlo simulation to obtain the full chip leakage power for all variation models. Since each variation source is a quadratic function as in (6. we use 100. Moreover.leak (6. Therefore. we assume that 900 copies of ISCAS benchmark circuits are placed in a 30 × 30 array on a 2cm×2cm chip. Table 6.7 Application of Spatial Variation Models on Statistical Leakage Analysis Besides SSTA. From the table.

4 c3540 201 37.7 -6. Run time (T) is in s.5 +1.8 +10.1 14. 6mm×6mm.5 +5.6 -19.2 +9. Note: The exact values are in mW .1 +5.7 c7552 403 73.8 +6.2 -20.4 +2. From the table.9 2.7 15.5 -17.2 16.8 +14. we assume 100 copies of ISCAS85 benchmark circuits are placed in a 10 × 10 array.9 2.7 +1.4 -5.5 +3.0 +5. for leakage variation analysis.5 2.5 -7.3 -24. 158 .9 +11.3 122 -8.7 c1908 95.3 -12.6 +5.2 +8.7 +14.9 -19.3 +3.7 15.2 -20.4 9. This is because both these models do not consider correlation and hence under estimate the leakage power variation.9 -16.7 +1.3 9.4 282 +1.2 -23.4 +3.9 -22.4 122 -9.3 144 +0.3 +1.2 2.5 -15.6 +7.1 -7.6 -16.0 10.6 2.8 -5. for the 6mm×6mm chip.7 9. and for the 3mm×3mm chip.4 +6.4 +12. SPC. Table 6.6 +4.5 Table 6.9 shows the error percentage for different model on different size chips.6 20.2 123 -7.3 +3.5 2.Benchmark µ Exact σ 95% µ SAAW σ 95% T µ QAW σ 95% T µ LDAW σ 95% T µ SPC σ 95% T µ σ IW 95% T c1355 62.8 +12.6 -16.8 +6. For the 1cm×1cm chip.4 +3.9 +8. • SAAW is more accurate than QAW.5 92. and 3mm×3mm.6 -20.2 123 -8.9 2. we ﬁnd that the error of LDAW.9 181 +1.3 -18. Similar to SSTA.8 15.4 10.9 c2670 131 22.5 2.7 +2.3 +7.6 -16.5 +2. and IW reduces when chip size becomes smaller as expected.1 +13.5 +2.2 562 +1.5 +5.5 -20. • Both LDAW and IW are not accurate.3 +10.7 +6.2 -23.9 +4.5 +4.6 +2.8: Leakage error percentage for different models in 2cm × 2cm chip.9 15. we assume 25 copies of ISCAS benchmark circuits are placed in a 5 × 5 array.3 -25.8 +4. we assume that 225 copies of ISCAS85 benchmark circuits are placed in a 15 × 15 array. but is about 50% slower.3 +3. we also perform leakage estimation in different size chips: 1cm×1cm.7 122 -9.8 -19.5 2.0 2.

5 -5.7 +2.1 +1.0 +2.2 +1.8 +2.0 +4.8 +1.26 +1.3 -3.9 -3.0 c1908 10 23. QAW has four random variables.0 +1.5 141 +1. Note: exact values are in mW .40 2.5 -4.2 -12.2 -5.9 +4.8 -3.0 -3.6 -2.4 +1.1 -3.3 -1.3 -8.2 +6. therefore.6 +2.8 +2.1 6 10.Chip mark size µ Exact σ 95% µ SAAW σ 95% µ LDAW σ 95% µ SPC σ 95% µ IW σ 95% c1355 10 15.37 71.1 +1.3 -9.8 +7. LDAW is the most efﬁcient.3 -11.3 -13.7 -10.9 5.10 summarizes the advantages and disadvantages of our proposed spatial variation models (SAAW.5 +2.7 +4.1 +3.1 -9.8 Summary of Different Models In previous sections.92 1.1 +0. Our proposed across-wafer variation models exactly model the across-wafer variation and the number of random variables does not depend on chip size.1 -6.9 +1.4 3.2 +2.9 -3.0 +6.1 -3.0 -1.5 -3.9 +1.7 +2.5 9.2 +1.7 +8.4 -1.4 +4.5 6 45.5 -12.6 2.3 +2.4 c2670 10 32.2 +4.7 +6. it can be applied only when the across-wafer variation is a perfect parabola.2 +3.9 +2.2 3 1.9 -3.25 16.7 +2.19 6.0 +3.7 -3.3 -9.2 -5.6 +1.4 +2.3 +6.4 -4.9 -3.3 4.6 +6.3 +1.3 -8.6 +3.58 +0. 6.5 -6. and the traditional variation models (SPC and IW).0 -5.8 +2.5 -5.8 5.8 +1.5 +1.2 -4.8 6 5.5 +3.5 -4.5 -2. 159 .3 -4.2 +1.4 -12.7 +1. Therefore they are accurate and efﬁcient.9 +5.4 +2.0 3 2. and LDAW).2 +1.0 -1.2 +9.0 +2.9 +5.0 +1.5 -7.3 -17.9 +0.5 3 5.2 +2. hence it is more efﬁcient than SAAW.2 -7.2 -3.5 6 22.3 +6.7 36.03 +0.5 +9.57 4.3 -14.1 +1.5 +9.4 -6.4 -7. Moreover.0 +1.9 -4.0 8. However.0 +1.5 +1.5 -8.06 1.1 -14.2 -1.7 -1.7 +1.1 c3540 10 50.9 -5.72 45.3 +2.6 -14.55 20.4 +1.9 +3. we compared the accuracy and efﬁciency of different models.4 Table 6.0 -8. QAW.3 -7.55 +0.0 +2.2 +8.0 -9.58 1.4 -3.2 -7.1 6 14.9 -8.Bench. Table 6. ignores correlation and only works for small chips.1 -3.3 3 3.0 -4.9 -7.65 0.9 -15.6 -11.2 +1.2 -1.5 -6.1 -3.52 23.0 -4.0 +2.4 2.4 -3.6 -7.3 -2.9 +3. SAAW has six random variables and it can be applied to any across-wafer variation models.6 +1.0 +1.8 +3. SAAW and QAW need to know the across-wafer variation.2 -5.1 +3.2 +8. one needs to track the die locations within the wafer to build up the model.2 3 11.3 +2.1 -8.9: Leakage error for different variation model on different size chips.0 -4.58 +0.48 0.7 -2.1 +2.8 -3.03 +1.1 -1.9 -3.1 -2.65 5.5 -7.65 0.62 10.5 -13.6 -3.1 -8.6 -6.6 2.0 +1.16 30.2 -12.2 +5.2 +1.05 7.9 -4.2 +1.8 -2.8 -10.2 -3.9 c7552 10 102 18.2 +2.

such models are not accurate compared to our proposed models. However. they are somewhat easier to build. for SPC. the traditional variation models (as well as LDAW) only require measurement on a die without tracking die locations. Moreover.On the other hand. 160 . it is not as efﬁcient as our proposed models. Therefore. since the number of random variables depends on number of grids.

non-parabola across-wafer variation large chip. .19) QAW Equ (6.2) LDAW Equ (6.20) SPC IW # of RVs 6 4 2 Depend on # of grids 2 Case to Apply large chip.10: Summary of different variation models.Model Type Acrosswafer models Traditional models Advantages Accurate Efﬁcient Easy to extract Disadvantages Need die tracking to extract Not accurate Models SAAW Equ(6. parabola across-wafer variation small chip large chip small chip 161 Table 6.

For simplicity. We exploited these observations in order to create a much more accurate and efﬁcient model for performance variability prediction. we ﬁnd that spatial correlation is visible only when the across wafer systematic is not taken into account.9 Conclusions In this chapter. We ﬁrst observe that different locations of a chip may have different means and variances and such difference becomes more signiﬁcant when chip size increases. we show that within-die random variability does not exhibit a strong or useful pattern of spatial correlation.6. we have proposed an accurate and efﬁcient variation model for deterministic across wafer variation. when chip size is small enough. we ﬁnd that the within-die spatial variation is NOT independent of the inter-die variation. we analytically study the impact of systematic across-wafer variation on within-die spatial variation. When it is taken into account. Secondly. Based on the above analysis. the two level inter-/within-die variation model is still accurate. our new model reduces the error from 6. we also analyze the case when the across-wafer is not a perfect quadratic function. However. we assume that across-wafer variation is a quadratic function. 162 . We further apply our new variation model to two applications: statistical static timing analysis and statistical leakage analysis. Experimental result shows that compared to the distance-based spatial variation model.5% to 2% for statistical timing analysis and reduces error from 17% to 5% for statistical leakage analysis. such dependence is weak and the across-wafer variation can be lumped in to inter-die variation. Our model also improves the run time by 6X for statistical timing analysis and by 7X for statistical leakage analysis. Thirdly. In this case. Later.

Statistical modeling and analysis have become the mainstay of modern design-manufacturing ﬂows. percentile corner) and analysis (SPICE corner extraction. Our experiments (with variability assumptions derived from test silicon data from a 45nm industrial process) indicate that for moderate characterization volumes (10 lots) and low-to-medium production volumes (15 lots).. The guardbands are non-negligible for all cases when either production or characterization volume is not 163 . statistical timing). due to limited number of samples (especially in the case of lot-to-lot variation). calibrated models have low degree of conﬁdence. However. The problem of conﬁdence in statistical analysis is going to be further worsened with advent of 450mm wafers. Most analysis techniques assume that the statistical variation models are reliable.7% of standard deviation for single parameter corner.CHAPTER 7 On Conﬁ dence in Characterization and Application of Variation Models In this chapter we study statistics of statistics. a signiﬁcant guardband (e. We mathematically derive conﬁdence intervals for commonly used statistical measures (mean.g. 38. variance. Our estimates are within 2% of simulated conﬁdence values. 34. The problem is further exacerbated when production volumes are low (≤ 65 lots) causing additional loss of conﬁdence in statistical analysis (since production only sees a small snapshot of the entire distribution).7% of standard deviation for SPICE corner. and 52% of standard deviation for 95-percentile point of circuit delay are needed) to ensure 95% conﬁdence in the results.

This is especially true for lot-to-lot variation. However. It also ignores the uncertainty introduced by limited number of production lots. we mean the statistical values in the case of inﬁnite measured as well as production samples. implicitly assume that the production chips have the same statistical characteristics as the corresponding population values 1. analysis or optimization. the existing statistical analysis. Moreover. Due to limited number of measurements. ignoring their true distributions. the measured statistical characteristics may be unreliable. The proposed methods are not runtimeintensive (always within 10s) as they require Monte-Carlo simulations on closed form expressions. and skewness. Therefore. variance. 1 164 .large. [177] models the uncertainty of mean. the statistical characteristics of the variation sources are obtained from measurement of a set of samples. We also study the interesting one production lot case which may be common for prototyping as well as for academic designs. are often assumed to be known and reliable. the production statistical characteristics may signiﬁcantly deviate from their population values. such as mean. and then estimates the range of mean and variance of circuit performance. variance. Before we move further. it is important to brieﬂy review the concept of conﬁdence Here by “population”. 7. and correlation coefﬁcients as an interval. the uncertainty in statistics of measured data as well as production data should be considered in statistical analysis. this work only models the uncertainty as an interval. statistical characteristics of variation sources.1 Introduction In process variation modeling. since the number of lots measured for characterization is usually small. When the number of production samples is not small. In practice.

the results may differ. the distribution of (say) circuit delay is ﬁxed. . However. Xn). We estimate the conﬁdence interval of mean. Experimental results show that the required guardband (to reach desired conﬁdence) estimated by our method is very close to the exact simulation value. 2. we say that we have 90% conﬁdence that the sample mean is smaller than µcon f . we choose n samples of it (X1.intervals in statistical analysis. . we do not know exactly the production mean and variance prior. one can estimate the distribution of mean and variance (of course the actual mean and variance are going to be just one sample out of these distributions). Notice that the conﬁdence is not yield but the probability that certain statistical characteristic is within a given interval. When the sample mean is smaller than a certain value µcon f in 900 out of the 1000 runs.000 ˆ times. We extend the ideas to extract SPICE corners depending on desired conﬁdence level as well as number of measured/produced silicon lots. For a given conﬁdence level. In this chapter. X2 . the sample mean ˆ µ and variance σ2 are ﬁxed. Once chips are produced. we estimate the worst case fast/slow corner (µ± kσ corner) for variation sources. if we repeatedly choose n samples for 1. we study the uncertainty of statistical models and analyze the impact of such uncertainty on statistical analysis. . variance and quantile for statistical timing analysis. . Consider a standard normal random variable X . Once the samples are chosen. and calculate the sample mean and variance for each time. The contributions of this chapter are: 1. 3.we estimate the distribution of the production mean and variance of variation sources. 4. For a given conﬁdence. But due to the uncertainty of the statistical model. Given the number of measured lots and the number of production lots. We also observe 165 .

interw r l d wafer. 7. The variances of the means and variances are: σ2 = σ2 /Nl µl l σ2 2 = σ2 /Nl l σ l (7. and µr (σ2 ) are means (variances) of inter-lot. inter-die (Vd ).2 gives the problem formulation.6) (7.2) (7.6 discusses how to choose conﬁdence level in real circuit design. We need to introduce up to 30% guardband value to achieve 95% conﬁdence.7) σ2 = σ2 /Nw µw w σ2 2 = σ2 /Nw w σ w 166 .3 studies the conﬁdence interval estimation for variation sources. inter-wafer(Vw ).2 Problem Formulation Process variation is decomposed into inter-lot (Vl ). and within-die (Vr ) variation: V = Vl +Vw +Vd +Vr µV = µl + µw + µd + µr 2 σV = σ2 + σ2 + σ2 + σ2 w r l d (7. and within-die variations. inter-die. µw (σ2 ).4 presents the estimation of worst case delay for SPICE corner estimation and Section 7.that the guardband value increases dramatically when conﬁdence level increases. It has been shown that in case of ﬁnite number of samples the sample mean follows Gaussian distribution and the sample variance follow χ2 distribution [4].3) where µl (σ2 ).5 analyzes the impact of mean and variance uncertainty on statistical timing analysis. Section 7. then Section 7. and ﬁnally Section 7.1) (7.7 concludes this chapter. The rest of this chapter is organized as follows: Section 7.4) (7.5) (7. respectively. µd (σ2 ). Section 7.

estimate the distribution of statistical measures of production lots. only the measured as opposed to the population value is known. niche process/high column part (S-L). of 1. Therefore. we may want to guardband statistical analysis to ensure a certain degree of conﬁdence. Table 7. In practice. wafers. Nw . the σ r uncertainty of mean and variance comes largely from lot-to-lot variation. Hence σ2 µl σ2 µw σ2 µd σ2 . and tens of measured circuit elements within each die2 . 2 Note that 85 lots of 25 300mm wafers each with die-size of 100mm 2 amounts to production volume 167 . we focus our attention on lot-to-lot variation. Hence in this chapter. dies.1 illustrates ﬁve cases of number of measured lots and number of production lots.10) (7. Example use models for these different scenarios can be: commodity process/high volume part (LL).11) σ2 = σ2 /Nr µr r σ2 2 = σ2 /Nr r σ r where Nl . we assume that the lot-to-lot variation of all the variation sources follows Gaussian distribution.5 million chips. commodity process/niche part (L-S). σ 2 2 µr σ l Nd σ2 2 σ w σ2 2 σ d σ2 2 . In rest of the chapter. Therefore. the conﬁdence interval is large and hence statistical analysis is not reliable.8) (7. In this case. and circuit elements. more than one hundred dies per wafer. niche process/niche part (S-S). and Nr are total number of lots.σ2 = σ2 /Nd µd d σ2 2 = σ2 /Nd d σ d (7.9) (7. and commodity (or niche) process/prototyping (L-1).When the number of measured lots or production lots is small. There are usually 20 to 50 wafers per lot. Therefore. Nr Nw Nl . our the problem is: Given the number of measured lots and production lots and the measured values of mean and variance of variation sources. Nd .

Table 7.3 Conﬁ dence Interval for Variation Sources In this section. we calculate the total mean and variance of the production as: µt = µl + µo ˜ ˜ ˜ ˜l σt2 = σ2 + σ2 o (7. we calculate the total mean and variance of measured value as: µt = µl + µo ˆ ˆ ˆ ˆl σt2 = σ2 + σ2 o (7.1: Reliability of statistical analysis for different number of measured and production lots. 7. 7.1).3.2. In the same way as above.12) (7.15) In the rest of this subsection. Therefore.13) As discussed in Section 7. the uncertainty of mean and variance comes largely from lot-to-lot variation.1 Mean and Variance Let’s ﬁrst discuss the mean and variance. we will discuss conﬁdence interval estimation for a single variation source. we assume the mean µo and variance σ2 of all other o variation are reliable. we will focus on analyzing conﬁdence interval for lot-tolot variation. From (7. 168 .14) (7.3 summarizes the notation used in the rest of the chapter.Case # Measured lots Large Small Large Small Any # Production lots Large Large Small Small 1 Conﬁdence interval Small Large Large Large Large Reliability of Statistical analysis High Low Low Low Low L-L S-L L-S S-S L-1 Table 7.

Finally. σ2 . σ2 = σ2 + σ2 + σ2 o w r d cn f /s no hat ˆ ˜ 2t 2l 2w 2d 2r 2o Table 7. and so on) inter-lot variation value (µl . σ2 number of lots mean and variance fast/slow corner population value (µ. σ2 . σ2 . In order to estimate the conﬁdence interval. and so on) l inter-wafer variation value (µw .17) where M1 ∼ N(0. and so on) r total value except inter-lot variation (µo . σ2 . inter-die and within-die variation (µt . we estimate the production mean and variance with respect to population mean and variance: √ µl = µl + σl · M2 / n ˜ ˜ (7. inter-wafer. we ﬁrst estimate the population values of mean and variance from measurement data [4]: ˆ µl = µl + σl · M1 · ˆ ˆl σ2 = (n − 1)σ2 /Q1 ˆ l n−1 ˆ nQ1 ˆ (7. ˜ ˜ 169 . µ.18) (7. and so on) o µo = µw + µd + µr . and so on) ˆ ˆ value of production lots (n.16) (7. and so on) ˜ ˜ total value including inter-lot. ˆ ˆ ˜l σ2 = σ2 · Q2/(n − 1) ˜ l where M2 ∼ N(0. and Q 1 ∼ Then. σ2 . σt2 . σ2 and so on) value obtain from measured lots (n. and so on) w inter-die variation value (µd .2: Notations. and so on) d within-die variation value (µr .19) 2 χn−1 is a random variable with χ2 distribution with n − 1 degrees of freedom [4]. 1) is a random variable with standard normal distribution and Q 2 ∼ 2 χn−1 is a random variable with χ2 distribution with n − 1 degrees of freedom.n µ. µ. 1) is a random variable with standard normal distribution.

T is a random variable with t-distribution with n degrees of freedom [4]. We may simplify our model when either n or n is large. we analyzed the case when both n and n are small (S-S ˆ ˜ case).3.24) (7.27) 7. U is the ˆ ratio of two χ2 random variables. hence.26) where Γ(·) is Gamma function [4] which is calculated as: Γ(x) = Z ∞ 0 t x−1 e−t dt (7.25) normal random variable and square root of a χ2 random variable. Therefore. 1). Moreover.22) (7. T is the ratio between a standard ˆ ˜ Γ( n )Γ( n )(u + 1) 2 2 (7.23) (7.21) (7.we obtain the production mean and variance as: ˆ µl = µl + σl · ˜ ˆ ˜l σ2 where n+n ˆ ˜ nn ˆ˜ n−1 ˆ ζ = n−1 ˜ √ √ √ n − 1( nM1 + nM2 ) ˆ ˜ ˆ T = (n + n)Q1 ˆ ˜ Q2 U = Q1 η = It is easy to ﬁnd that: √ √ nM1 + nM2 ˜√ ˆ n+n ˆ ˜ n − 1 M1 M2 ˆ ˆ √ + √ = µl + σl ηT ˆ Q1 n ˜ n ˆ Q2 (n − 1) ˆ ˆl ˆl = σ2 · = σ2 · ζ2 ·U Q1 (n − 1) ˜ (7.2 Simplifying the Large Lot Case In the previous subsection. The PDF of U is [4]: PDFU (u) = ˆ ˜ Γ( n+n − 1)u 2 −1 2 n ˆ n+n ˆ ˜ 2 ∼ N(0. ˆ ˜ When the number of measured lots n is large (L-S case). we can approximate the ˆ 170 .20) (7.

3.28) (7. However.30) (7.32) (7.21).20) and (7. since there is only one lot. we also consider the special case of one production lot (L-1 case).population mean and variance as measured mean and variance: µl ≈ µl ˆ (7.e. lot-tolot variation has no impact on variance.33) ˜l σ2 ≈ σ2 l Therefore. L-S.: µl ≈ µl ˜ (7.20) and (7.35) Besides S-S.36) .21) can be simpliﬁed to: ˆ µl = µl + σl M1 · ˜ ˆ ˜l ˆl σ2 = (n − 1)σ2 /Q1 ˆ 7. i.34) (7. Uncertainty of variance estimation is mainly caused by wafer-to-wafer variation. that is n = 1. (7. Hence.29) ˆl σ2 ≈ σ2 l In this case.21) can be simpliﬁed to: √ ˆ µl = µl + σl · M2 / n ˜ ˆ ˜ (7. In this case. and S-L cases. the production variance is calculated as: ˜ ˆw σt2 = σ2 · Q2 /(n − 1) + σ2 + σ2 ˜ r d 171 (7. (7.31) ˜l ˆl σ2 = σ2 · Q2/(n − 1) ˜ In the other case when the number of production lots n is large (L-S case). in this case.20) and (7. the ˜ production values of mean and variance are very close to the population value. production mean still follows the same ˜ distribution as shown in (7.3 One Production Lot Case n−1 ˆ nQ1 ˆ (7.

we use wafer-to-wafer variation instead of lot-to-lot variation. Table 7.3 compares the targeted conﬁdence level and the measured conﬁdence level.34 70 72. ˆ ˜ where n is number of production wafers. We model each set as a design and assume that 5 of the wafers are used to generate the statistical model (n = 5) and the other wafer is the ˆ production wafer (n = 1). To get over the hurdle of huge data requirement. Q2 is a χ2 random variable with n − 1 free˜ ˜ ˆw dom.10 Table 7. 7. Fortunately.% Measured conf. Nevertheless. We randomly divide these wafers into 58 sets with 6 wafers per set.% 50 51. the mathematical derivations in this work make very few approximations if any. We need exceedingly large amounts of data spread out over hundreds of lots preferably over multiple independent designs.4 Experimental Validation for One Lot Case The problems in validation of the conﬁdence-based guardband models proposed in this work should be obvious by now. and σ2 is the measured variance of wafer-to-wafer variation.3.Targeted conf. 172 . We obtain wafer-to-wafer ring oscillator delay variation data from an industrial 65nm process with 348 wafers spread over 23 lots. For each set we apply (7.3: Experimental validation of estimation of conﬁdence interval for mean for n = 5 and n = 1.41 80 81.03 90 93. The small error (≤ 4%) is due to limited data. We obtain the measured conﬁdence by counting the number of sets that production value is within the conﬁdence interval: Measured Conf = Number of match sets/ Total number of sets.72 60 60. in this section we try to validate the theory above for the case with small number of measured lots and one production lot.20) to estimate the distribution of ˜ production mean and calculate the conﬁdence interval for a targeted conﬁdence level.

5 Fast/Slow Corner Estimation The next statistical characteristics we consider are fast and slow corners cn f /s for variation sources. we use Monte-Carlo simulations to obtain the distributions of the fast/slow corners.g. Assuming that hold time violation is harder to ﬁx.39) (7. we have: ˆ cn f = µl + µo + σl ηT − k f ˆ ˆ cns = µl + µo + σl ηT + ks ˆ ˆ σ2 ζ2U + σ2 o ˆ σ2 ζ2U + σ2 o (7. fast corner and slow corners are dependent on each other.21). 173 . we assume fast implies smaller parameter value (e. Therefore.3.38).38) (7. from the samples.37) For a given conﬁdence level.41) Due to the complexity of (7. we want to estimate the worst case fast/slow corner. that is: s 3 P{cn f > cnw } = c f f f P{cn f > cnw .40) (7.20) and (7. Usually. channel width).7. We estimate the worst case fast/slow corner such that there is c f f conﬁdence that the fast corner is larger than cnw and there f is c ft conﬁdence that the fast corner is larger than cnw and the slow corner is smaller f than cnw . cns < cnw } = c ft s f (7.42) With the worst case fast corner value. Then the worst case slow corner f f 3 Without loss of generality. CDFcns |cn f <cnw . the fast/slow corner is expressed as: ˜ cn f /s = µt ± k f /s σt ˜ From (7. we may also obtain the conditional CDF of cns given cn f > cnw . However. cnw/s . Then the worst case fast corner is calculated as: −1 cnw = CDFcn f (c f f ) f (7. f we need to handle them together. higher conﬁdence may be needed on fast corner.

the mean and variance can be simpliﬁed as (7. We also assume that the variation source is with standard normal distribution 4 . 174 .46) nominal mean and variance of variation sources does not affect the value of k f /s.43) We then express the worst case corner value as: ˆ cnw/s = µ ± k f /s σ ˆ f (7.is calculated as: −1 cnw = CDFcns |cn f <cnw (c ft /c f f ) s f (7.30).e. Hence the corner model can be also simpliﬁed as: M ˆ 2 cn f = µl + σl √ − k f ˆ n ˜ M ˆ 2 cns = µl + σl √ + ks ˆ n ˜ 4 The ˆl σ2 · Q2 + σ2 o (n − 1) ˜ ˆl σ2 · Q2 + σ2 o (n − 1) ˜ (7. when the number of measured lots is large (L-S). Notice that the k f /s depends on ˆ the measured values µ and σ2 . In the experiment. k f = ks = 3 and we assume that the variance of inter-lot variation ˆ ˜ is 18% of the total variance (derived from the data described in section 7.3. In ˆ the table. the guardband percentage is calculated as: 100 × (k f /s − k f /s )/k f /s . and then apply the ˆ method above to calculate the worst case value of k f /s . From the table.45) (7.4).3.4 shows the value of k f /s under different conﬁdence levels. k f /s needs to be 30% over k f /s .44) When the worst case corner values are known. For each run.. we let n = 10. we generate n ˆ ˆ measured samples to obtain the measured mean µ and variance σ2 . we can ﬁnd that to ensure high conﬁdence. we need a large guardband. Table 7. i. n = 15. we perform the estimation for 1000 runs. the k f /s values are calculated as the average of 1000 runs. In the experiment. As discussed in Section 7. hence each run may result in a different k f /s value. we can easily obtain the value of k f /s .2. In the table.

4%) 3. Similarly.60%) 3.17%) 3.625 (20.422 (14.4%) 3.17%) 3. From the table.8%) 4.345 (11.138 (4. when the number of production lots is large (S-L).155 (5.48) Table 7.20%) 3. corner model can be simpliﬁed to: ˆ cn f = µl + σl · M1 · ˆ ˆ cns = µl + σl · M1 · ˆ n−1 ˆ −kf nQ1 ˆ n−1 ˆ + ks nQ1 ˆ ˆl (n − 1)σ2 ˆ + σ2 o Q1 ˆl (n − 1)σ2 ˆ + σ2 o Q1 (7.121 (4.2%) 3.4: Guardband of fast/slow corner for different degrees of conﬁdence assuming n = 10.43%) 3.13%) 3.124 (4.080 (2.342 (11.67%) 3.5 shows the average k f /s value for the L-S case. it is evident that the guardband is smaller than that of the S-S case.255 (8.50%) 3. ˆ ˜ Table 7.502 (16.14%) 3.6 shows the average k f /s value for the S-L case.8%) ks 3.594 (19.47) (7. ˜ ˆ 175 .246 (8.619 (20.112 (3.042 (34.5%) 3.275 (9. Figure 7. As expected.035 (1.03%) 3. ˆ ˜ c ft % 50 60 70 80 90 95 cff % 60 70 80 90 95 99 kf 3.236 (7.401 (13.5: Guardband of fast/slow corner for different degrees of conﬁdence assuming large n and n = 15.215 (7.87%) 3.7%) 3.6%) Table 7. the guardband in this case is smaller than the S-S case.103 (3.13%) 3.7%) ks 3. L-S (or S-L) can be treated as L-L.518 (17.1 compares the 90% conﬁdence value of k f between the L-S (or S-L) and L-L.c ft % 50 60 70 80 90 95 cff % 60 70 80 90 95 99 kf 3.2%) 3.73%) 3. We observe that when n ≥ 60 (or n ≥ 85).307 (10.7%) Table 7. n = 15.

15 3.40%) 3.1: Comparison of S-L. and L-L case.3 3.134 (4.272 (9. we need a larger number of measured lots (or production lots) to be considered as large.6: Guardband of fast/slow corner for different conﬁdence levels assuming n = 10 and large n. the uncertainty of 176 .25 L−S S−L k’f 3. This is because when the lot-to-lot variation increase.193 (6.132 (4.09 f 150 Figure 7.4%) 3. the statistical analysis is reliable and we do not need to perform guardband estimation.052 (1. when error of k f value between L-L and L-S (or S-L) cases is within 3%. ˆ ˜ That is.1 3.4%) Table 7.43%) 3.267 (8.50%) 3.47%) 3.90%) 3. ˆ ˜ ~ n =60 ^ n =80 3. L-S.690 (23.523 (17.0%) ks 3.17%) 3.73%) 3.2.165 (5. we consider n (or n) value is large.3. Therefore.2 3.07%) 3.155 (5. in the special case when there is only one production lot (L-1).3%) 3. lot-to-lot variation has no impact on the variance. In the experiment.2. the uncertainty of mean and variance increases. Note that increasing lot-to-lot variation will increase the value of n and n that can be ˆ ˜ considered large as shown in Figure 7.431 (14.c ft % 50 60 70 80 90 95 cff % 60 70 80 90 95 99 kf 3.05 50 ~ 100 ^ n (n) k’ =3. As discussed in Section 7.370 (12. Instead.

we assume that there are 25 wafers per lot. we see that the guardband is smaller ˜ than that of the L-S and S-L cases.800 600 ~ ^ Large n (n) ~ n ^ n 400 200 0 20 40 σ2% l 60 Figure 7.2: Number of n or n which can be considered as large for different fraction of ˆ ˜ lot-to-lot variation assuming maximum 3% error at 90% conﬁdence. hence the uncertainty of the corner is smaller. From the table. the variance comes from wafer-to-wafer variation.49) (7.50) Table 7. This somewhat surprising result is explained by the fact that lot-to-lot variation has no impact on variance (but still affects the mean).36). In the experiment.7 shows the k f /s for the L-1 case. 177 . n = 25. we can calculate the fast/slow corners as: ˆ cn f = µl + µo + σl ηT − k f ˆ ˆ cns = µl + µo + σl ηT + ks ˆ ˆw σ2 Q2 /(n − 1) + σ2 ˜ o ˆw σ2 Q2 /(n − 1) + σ2 ˜ o (7. From (7.

c ft % 50 60 70 80 90 95 cff % 60 70 80 90 95 99 kf 3.1%) ks 3.77%) 3.438 (14.475 (15.574 (19.6%) 3.7%) 3.116 (3.2%) 3.51) where m is the number of variation sources.197 (6.27%) 3.4%) 3.4 Conﬁ dence in SPICE Fast/Slow Corner Fast/slow corners for circuit simulation is the major interface between design and process.52) (7.432 (14. respectively.7: Guardband of fast/slow corner for different conﬁdence levels assuming n = 10. Considering all variation sources to be independent.320 (10. Xi ’s are variation sources. Variation source values corresponding to the corner delay Dw/s are then calculated.307 (10.57%) 3.53) ˜D σ2 = i=1 ˜ li ∑ c2(σ2 + σ2 ) i oi m where µoi and σ2 are the mean and variance of other variation for the ith variation oi ˜ li source. µli and σ2 can be calculated as shown in Sec˜ tion 7.051 (1. f We assume that the inverter chain delay is a linear function of variation sources: m D = d0 + ∑ ci Xi i=1 (7. and ci ’s are sensitivity coefﬁcients of inverter chain delay to variation sources.278 (9. respectively. µli and σ2 are the mean and variance of lot-to-lot variation for ˜ ˜ li the ith variation source. 178 .70%) 3. ˆ ˜ ˜ 7.261 (8.8%) Table 7. and n = 25.87%) 3.3. n = 1.70%) 3. the mean and variance of the product inverter chain delay can be calculated as: µD = d0 + ∑ ci (µoi + µli ) ˜ ˜ i=1 m (7.1. A common approach to derive the corners is to use measured values from canonical circuits such as an inverter chain.293 (9. d0 is the nominal delay.

we assume that the 3σ value is 20% of the nominal value. When the nominal mean increases. 179 . In the table. We assume three variation sources. From the table.. we use PTM bulk low power SPICE model for 45nm technology [1]. In the experiment.we show ˆ ˜ the average of 1000 runs. the guardband percentage decreases. We also assume that the inter-lot variation is 18% as in previous sections.5 we use Monte-Carlo simulation to obtain the PDF of the fast and slow delay corners. In the experiment. Corresponding corner values of variation sources f are calculated by solving the following [121]: Dw/s = d0 + ∑ ci Xiw f w X1 f /s − µ1 ˆ m (7. we observe that under the same conﬁdence level.3. we assume 3σ = 3nm. This is because the k f value of variation sources does not depend on the nominal mean and variance. One sided conﬁdence interval is then used to obtain the worst case fast and slow delay corners Dw /s. the guard band percentage does depend on nominal mean. = ˆ cm σm (7. for threshold voltage variation. the guardband percentage of delay corner is less than that of k f value of variation sources as shown in Table 7. we let n = 10 and n = 15.Usually. For gate length variation. However.nominal value|/nominal value. the guardband percentage is calculated as: 100×|worst case value .56) Table 7.. for the worst case delay.8 illustrates the guardband percentage for the worst case delay and the corresponding variation source values under different conﬁdence levels. Since measured corner values affect production corners. the delay corner is expressed as ˜ ˜ D f /s = µD ± k f /s σD ˜ (7. gate length L.54) As in Section 7.55) w Xm f /s − µm ˆ ˆ c1 σ1 = w X2 f /s − µ2 ˆ i=1 ˆ c2 σ2 . threshold voltage for NMOS Vtn and PMOS Vt p .4.

3 95. though appears to be small. 000.11 show the average guardband percentage for the L-S and S-L cases.58 0.58% of nominal value.91 1.32 0.62 w Vtn f Vtw f p 0.8: Percentage guardband of worst case delay corner and corresponding variation source corners under different conﬁdence levels assuming n = 10.80 2.20 0. Note that the guardband.41 2. is signiﬁcant when compared to variation itself (e.6 72.29 1.7% of nominal value.9 61.603 Dw s 0. and Vth value is in V . we ﬁnd that the exact simulated conﬁdence is a little bit higher than the target value.81 1.77 1.43 0.71 0.10 and 7.61 388 Lw s 0.4 81.50 2.22 3. n = 15. we also compare the targeted conﬁdence and the exact conﬁdence of the estimated worst case corner value.57 1.2 Table 7.7 99.55 0.612 Nominal Table 7.643 0. Such error largely comes from the ﬁtting of the linear delay model to SPICE.27 2. That is. ˜ In Table 7. In the experiment. From the table.13 2.5 cff 61.9. respectively. we let nrun = 1.01 3.15 4.54 0.41 w Vtns Vtw ps 0.58 349 Lw f 0 07 0 14 0 29 0 50 0 79 1 15 42. The ﬂow to obtain the simulated conﬁdence is shown in Figure 7. ˆ n = 15.29 0.36 0.45 0.g.1 90.52 0.71 0.8 95. Table 7.97 1.9: Conﬁdence comparison for inverter chain based corner assuming n = 10.c ft 50 60 70 80 90 95 cff 60 70 80 90 95 99 Dw f 0. Lgate value is in nm.86 1. Note: ˆ ˜ delay value is in ps.35 2.84 3. σ for Vt p variation is only 6.4 91. Targeted Con f c ft 50 60 70 80 90 95 cff 60 70 80 90 95 99 Simulated Con f c ft 51.36 3.15 2. the 180 .19 0.23 0.3.40 0. while the guardband of fast corner is up to 2.44 0.7 71.00 2.21 0.558 0.13 0.90 47.6 81.39 0.

D 11. c ft = nt /nrun Figure 7. Hence. Obtain worst case corners: Dw . Therefore.000-sample SPICE MC simulation on inverter ˜ chain delay based on µ and σ2 .1.54) ˆ according to µ and σ2 . Similar to the single ˜ ˆ we consider n (or n) value is large. In the ﬁgure. or S-L cases. ˜ 9. if D f > Dw /* compare simulated value and estimated value */ f 13.1 compares the 90% conﬁdence value of Dw between the L-S (or S-L) and f L-L. Calculate µ and σ2 for each variation sources from samples ˆ 5. We also ﬁnd that for the worst case delay corner. this is because lot-to-lot variation has no impact on variance. s f 7.7% of standard deviation). Table 7. L-S. n f = 0. 181 .12 shows the average guardband percentage for the L-1 case. source case. c f f = n f /nrun .000-sample MC simulation on (7. Dw is normalized with respect to the nominal value. can be considered as “large”. 12. Generate n samples of measured lots /* calculate estimated value */ ˆ ˆ 4. This is because the guardband of worst case delay is smaller than that of single variation source as discussed above. ˆ 6. if Ds < Dw s 15. we need fewer lots to achieve the same model reliability. Calculate simulated corners: D f = µD − k f σD .3. Calculate µ and σ2 for each variation source from samples. Generate n samples of production lots /* calculate simulated value */ ˜ ˜ 8. for i=1 to nrun 3. ˜ 10. when error of k f value between L-L and L-S (or S-L) case are within 3%. 14. Figure 7. n f =n f + 1. Dw .5. nt = nt + 1 16. nt = 0 2. Ds = µD + ks σD . As discussed in Section 7. Perform 10. From the ﬁgure.1. Perform 10. Calculate mean µD and variance σ2 of SPICE MC samples. The guardband percentage of this case is smaller that of S-S. ˆ ˜ the value of n (or n) to be considered as large is smaller than that of the single variation ˆ ˜ source as in Figure 7. guardband value is up to 38.3: Flow to simulate conﬁdence. f We observe that n ≥ 30 (or n ≥ 45).

12 0. ˜ c ft 50 60 70 80 90 95 cff 60 70 80 90 95 99 Dw f 0.44 Lw f 0.11 0.05 0.41 3.28 2.21 0.24 0.34 0.70 2.30 Table 7.13 1.03 Table 7.34 0.86 w Vtn f Vtw f p 0.59 2.13 2.86 1.58 0.90 Lw f 0.07 Lw s 0.91 0.33 0. ˆ ˜ 182 .42 Dw s 0.16 0.10: Percentage guardband of worst case delay corner and corresponding variation sources corners under different conﬁdence levels assuming large and n = 15.01 1.27 0.15 0.44 0.24 0.38 0.17 0.61 1.30 0.36 3.67 2.60 0.11: Percentage guardband of worst case delay corner and corresponding variation source corners under different conﬁdence levels assuming n = 10 and large n.53 2.24 0.97 1.08 1.77 2.89 2.49 0.95 1.43 0.06 1.15 1.81 2.69 1.68 w Vtns Vtw ps 0.20 1.06 0.21 0.41 0.75 Dw s 0.16 0.37 0.01 2.38 0.97 w Vtn f Vtw f p 0.43 0.31 0.58 0.29 0.49 0.68 3.77 1.70 2.39 0.c ft 50 60 70 80 90 95 cff 60 70 80 90 95 99 Dw f 0.82 1.18 0.73 1.50 1.72 1.10 0.77 w Vtns Vtw ps 0.71 Lw s 0.27 0.32 0.97 1.53 0.35 2.43 0.16 0.10 1.59 0.66 1.14 0.46 0.68 0.64 1.11 0.50 2.67 0.92 0.

98 0.50 0.71 0.51 w Vtn f Vtw f p 0.96 0.54 0.32 0.63 0.90 1.30 0.85 1.4: 90% conﬁdent Dw with varying n and n.46 2.11 0.29 0.03 1.53 0. n = 1. Next we discuss the impact of limited characterization or production data on one of the most talked about chip-level statistical analysis methods.37 0.95 1.67 1.10 0.52 0.53 0. statistical timing analysis.27 0.21 1.23 0.1 w 0.24 0.40 0.16 0.45 Dw s 0.42 0.32 0.9 50 3% error ~ n =30 ^ n =45 Normalized D f L−S S−L 100 ~ ^ n (n) 150 Figure 7.71 1.18 0.49 0. namely.30 0.05 Lw f 0.17 0.37 0.68 0.69 0.12: Percentage guardband of worst case delay corner and corresponding variation source corners under different conﬁdence levels assuming n = 10.06 0.10 1. and ˆ ˜ n = 25.17 0.87 Lw s 0. we illustrated the extra guardband required in corners (which ironically are themselves guardbands by deﬁnition) to ensure conﬁdence in statistical analysis when the number of lots used to characterize variation or number of production silicon lots are small.40 Table 7.47 w Vtns Vtw ps 0. ˆ ˜ f c ft 50 60 70 80 90 95 cff 60 70 80 90 95 99 Dw f 0.91 1.38 0. ˜ we need lower guardband to achieve the same conﬁdence.23 0.94 0. 183 . In this section.13 0.92 0.

7.59) Another quantity of interest is a certain percentile point. Then for a given conﬁdence level Con f . it is easy for us to obtain any interval of mean and variance under a certain conﬁdence. its mean and variance can be calculated in the same way as shown in (7.60) (7. We again use Monte-Carlo simulation on the linear equation to obtain the CDF of the output mean (CDFµD (·)) and variance (CDFσ2 (·)). Therefore.5 Conﬁ dence in Statistical Timing Analysis In this section.52). the percentile point C(p%) can be calculated as: C(p%) = µ+ kσ k = Φ−1 (p%) 5 Notice (7. the chip delay is expressed as a linear canonical form of variation sources: m D = d0 + ∑ ci Xi i=1 (7. it is easy to obtain the worst case D mean and variance: 5 −1 µw = CDFµD (Con f ) −1 σw2 = CDFσ2 (Con f ) D (7.58) σw = σw2 (7.61) that we have known the distribution of mean and variance.5. 184 .57) Since circuit delay has a linear canonical form. Since quadratic SSTA with non-Gaussian variation sources is very complicated. we only analyze linear SSTA with Gaussian variation sources in this section. we will discuss about conﬁdence estimation for statistical static timing analysis (SSTA). Since we assume Gaussian variation sources and apply linear canonical form to approximate circuit delay. By performing linear SSTA in Section 5. the circuit delay also follows Gaussian distribution.

we compare the targeted conﬁdence and the simulated conﬁdence of the estimated conﬁdence interval.5 illustrates the worst case mean.4 to obtain the CDF of the percentile point. standard deviation. The only difference is that instead of running SPICE Monte-Carlo simulation. we apply the same experimental setting as that for the inverter chain as in Section 7. although the guardband value seems not large. standard deviation. we apply SSTA [41] to obtain the chip delay distribution. However. We may apply the same method as in Section 7. In this way. Figure 7.6% of nominal value to ensure 95% conﬁdence. From the ﬁgure. and then calculate its worst case value.3. In the experiment. In the ﬁgure. Figure 7. We ignore L-S and L-1 cases where the need of SSTA is not well motivated. we observe that the guardband of chip delay is lower than that of single variation source and inverter chain delay. different runs may result in different worst case values. the worst case value depends on the measured values of mean and variance. As discussed before.9X of standard deviation and the guardband value is up to 7. and 95% percentile point for ISCAS85 benchmarks. The simulated conﬁdence is obtained in Figure 7.6 illustrates the mean. This is because chip delay has a larger nominal mean to variance ratio than inverter chain. the worst case value is averaged over 1. the percentile point is converted to a µ+ kσ corner. Figure 7.4. for the 95-percentile point. Due to our simplifying assumption that the linear canonical form is independent of small changes in mean and variance.where Φ(·) is the CDF of standard normal distribution. For example.7 compares the 90% conﬁdence value of 95-percentile point for C7552 185 . Hence. it is very signiﬁcant compared to the scale of variation.000 runs and normalized with respect to the nominal value. That means the guardband value is up to 52% of standard deviation. the nominal value is 6.13. there is a small error between targeted and simulated conﬁdence. In Table 7. and 95-percentile point of ISCAS85 benchmarks under different conﬁdence level for S-L case.

8 70.3 90.5 95. ˜ Targeted con f 50 60 70 80 90 95 Simulated con f µ 51.08 90 60 70 80 Confidence 90 Worst 95% 1.02 σw 1.06 C17 C444 C1355 C7552 1. standard deviation.1.3 Table 7.5: Worst case mean.1 80.7 81. n = 15.8 80.06 µw 1.04 1. assuming n = 10 and ˆ n = 15.1 1. and 95-percentile point normalized with respect to the nominal value for ISCAS85 benchmarks.02 50 60 70 80 Confidence 90 Figure 7.5 σ2 51.9 90.2 95.3 60. ˆ ˜ 186 .4 95.3 61.04 C17 C444 c1355 c7552 1.2 95% 51.4 60.05 1 50 C17 C444 C1355 C7552 1 50 60 70 80 Confidence 1.9 70.0 90.15 1.2 70.13: Conﬁdence comparison for ISCAS85 benchmark 95-percentile point delay assuming n = 10.

05 60 70 80 Confidence 90 1 50 60 70 80 Confidence 90 1. ˆ ˜ 187 .01 1 50 C17 C444 C1355 C7552 1.1.06 worst 95% 1.6: Normalized worst case mean. and 95-percentile point for ISCAS85 benchmarks.02 1 50 60 70 80 Confidence 90 Figure 7.1 σw C17 C444 C1355 C7552 1.02 1.04 C17 C444 C1355 C7552 1. standard deviation. assuming n = 10 and n is large.03 µw 1.04 1.05 1.

188 .7: 90% conﬁdence worst case 95-percentile point with varying n for C7552.05 worst 95% ^ n =24 ^ n =80 1. percentile point is normalized with respect to the nominal value. ˆ ˆ Again. we see that limited data can lead to signiﬁcantly large guardbanding to ensure good conﬁdence in analysis results. between S-L and L-L. we observe that the guardband of chip delay is lower than that of single variation source and inverter chain delay. fast (essentially Monte Carlo on a linear expression.03 1. From the ﬁgure. The method discussed in this section is fairly straightforward. respectively.1. In the ﬁgure. ˆ Percentile point is normalized with respect to the nominal value.02 1. runtime is within 10s) and accurate to estimate conﬁdence in results of statistical timing analysis especially when the variability models are known to suffer from limited lot-to-lot variation data.04 1.01 1 ^ n =40 50 100 ^ n 150 Figure 7. we need n ≥ 40 and n ≥ 80.06 1. We ﬁnd that we only need a small number of lots (n ≥ 24) to bound the error between S-L and L-L under 3%. If ˆ we want to bound the error to 2% or 1%.

6 This is usually possible only for high-volume. adaptive voltage scaling can also alleviate downside of low conﬁdence. conﬁdence level is directly related to risk. we have proposed techniques to quantify the conﬁdence in statistical circuit analysis and estimate guardband required to attain a desired level of conﬁdence. A lower conﬁdence does not necessarily translate to lower yield but it implies that the probability of the statistical analysis actually matching real silicon is smaller.. Therefore. running more lots is acceptable). For designs where power/performance constraints are ﬂexible or time to volume is not critical (i. Ability to change design characteristics during or post-manufacturing mitigates the need for high conﬁdence in models. If the silicon foundry can extensively adjust the process (which essentially shifts the mean) for the design6 . high-value designs.7.6 Practical Questions: Determining Conﬁ dence Induced Guardband for Real Designs So far in the chapter. performance. The answer to “what is the desired level of conﬁdence” is not an easy one especially given the steep increase in guardband to ensure higher levels of conﬁdence.e. Risk (in terms of schedule. two issues inﬂuence ˆ ˜ the choice of conﬁdence level: 1. the silicon can be made to match the analysis and ensuring high conﬁdence in design-side estimates is not essential. Besides quality of models (n) and production volume (n). 189 . if yield is lower than expectation. power. Similarly. Conﬁdence levels provide a tradeoff between risk of higher manufacturing cost (due to low yield) and higher upfront design cost (due to guardbanding). 2. etc) tolerable for the design. tunability using knobs such as adaptive body biasing. a lower conﬁdence is acceptable. Extent of post-silicon tuning options available.

We assumed lot-to-lot variation to be 18% of total variance for our experiments. the 95th percentile delay as estimated using statistical timing analysis needs a guardband of 5%-7%. variance.In the end. mean. we have studied impact of characterization and production silicon volumes on believability of statistical variation models and analysis. The required guardband will increase if lotto-lot variation increases as a fraction of total variation. the choice of conﬁdence will be as much a business decision as a technical decision as it quantiﬁes the desire for predictability of statistics of manufacturing. the guardband needed for 95% conﬁdence is about 14% of the nominal uncertainty (difference between best/worst corners) for low-to-medium volume production (n = 10. 1. 2. Finally. Predicted statistics are reliable for number of measured lots larger than 85 and number of production lots larger than 60. measured data volume. ˆ ˜ 3. For a typical SPICE corner extraction methodology. best/worst corners for a variation source may need as much as 30 % guardband to attain 95% conﬁdence.. 7. For small characterization or production volumes. We believe that conﬁdence- 190 . percentile corners. n = 15). We have developed methods to estimate the distribution of production silicon statistical characteristics (e.g. and intended production volume. etc) in terms of (unreliable) measured statistical characteristics.7 Conclusions In this chapter. Our experiments and results indicate the following. Our estimates show strong agreement with conﬁdence measured from a set of real silicon data (maximum error ≤ 4%). The loss of reliability largely stems from lot-to-lot variation.

level induced guardband should be considered for all low-to-medium volume designs. 191 . the problem is likely to become more severe when 450mm wafer sizes are adopted by the industry (since less number of lots would be required to meet the same production volume). Moreover.

analysis. We have modeled. Then we per- 192 . However. In Chapter 3. and optimization of FPGA power and performance with existence of process variation. the process variation caused by inevitable manufacturing ﬂuctuations and imperfections also increases when technology scales down. delay.CHAPTER 8 Conclusions In modern Very Large Scale Integration (VLSI) circuit industry. we have extended Ptrace to consider process variation. In Part I. In this dissertation. the scaling of Complementary Metal Oxide Semiconductor (CMOS) technology enables the ever growing number of transistors on a chip. including Chapter 2. 3. we consider two types of the most common VLSI circuits: Field-Programmable Gate Array (FPGA) and Application Speciﬁc Integrated Circuit (ASIC). our optimization reduces the energy-delay product by 18.4% and area by 23. we have developed a trace-based power and performance evaluation framework(Ptrace) for FPGA.3%. Compared to the baseline. hence improves the chip functionality and reduces the manufacturing cost. and optimized power and delay for both FPGAs and ASICs. Then we performed device and architecture coevaluation on hundreds of device and architecture combinations to optimize power. Process variation introduces signiﬁcant power and performance uncertainty to VLSI circuits and becomes a potential stopper for further technology scaling. analyzed. we have studied modeling. 4. and area. In Chapter 2.

and reliability model. and linear model. our optimization improves leakage power and timing combined yield by 9. delay. we have studied sta- tistical timing modeling and analysis for ASICs. Chapter 6. This offers a large room for power and delay tradeoff. hence they are very time efﬁcient. 3.5% and signiﬁcantly reduce leakage variation. In Part II.formed device and architecture co-optimization to maximize power and delay yield. and SER. we have developed transistor level voltage and current model based on ITRS MASTAR4 (Model for Assessment of cmoS Technologies And Roadmaps). our SSTA ﬂow predicts the 193 . we have presented new SSTA algorithms for three different delay models: quadratic model. All atomic operations of these algorithms are performed by closed-form formulas. In Chapter 5. In addition. delay.3X delay difference. we have proposed a new method to approximate the max operation of two non-Gaussian random variables using second-order polynomial ﬁtting. and reliability simultaneously. By applying such approximation.5% to 5. Experimental results show that compared to Monte-Carlo simulation. including Chapter 5. we have also studied the interaction between process variation. device aging.4%. and then developed circuit level power. we have developed a framework to evaluate FPGA power. Moreover. We observe that device aging reduces standard deviation of leakage by 65% over 10 years while it has relatively small impact on delay variation.1X energy difference. We have further shown that the device aging has a knee point over time and device burn-in to reach the point could reduce the performance change over 10 years from 8. Combining the circuit level model with Ptrace. and Chapter 7. In Chapter 4. We have shown that applying heterogeneous gate lengths to logic and interconnect may lead to 1. Compared to the baseline. quadratic model without crossing terms (semiquadratic model). we also ﬁnd that neither device aging due to NBTI and HCI nor process variation have signiﬁcant impact on SER.

and optimization for FPGAs and ASICs. 1%. we have studied impact of characterization and production silicon volumes on believability of statistical variation models and analysis.5% to 2% and improve run time by 6X compared to the traditional spatial variation model. and 52% of standard deviation for 95%-tile point of circuit delay are needed) to ensure 95% conﬁdence in the results. but also to FPGAs. Our estimates are within 2% of simulated conﬁdence values. the model of spatial correlation in Chapter 6 can also be applied to FPGA power and delay variation estimation to im- 194 . The SSTA method introduced in Chapter 5 can be applied to FPGA delay variation for more accurate and efﬁcient delay variation estimation. In Chapter 6. skewness. there are still lots of research works need to be done in this ﬁeld. statistical timing). we have studied the impact of systematic across-wafer variation on within-die variation. The guardbands are non-negligible for all cases when either production or characterization volume is not large. 38. In practice. We mathematically derived the conﬁdence intervals for commonly used statistical measures (mean.. Experimental results show that our new model reduces error from 6. Based on the analysis.g. Our experiments indicate that for moderate characterization volumes (10 lots) and low-to-medium production volumes (15 lots). we have developed an efﬁcient and accurate spatial variation model to model across-wafer variation at die level. 34. and 1% error. In Chapter 7. standard deviation. percentile corner) and analysis (SPICE corner extraction. analysis.mean. and 95-percentile point within 1%. the statistical modeling and analysis techniques presented in Part II can not only be applied to ASICs. However.7% of standard deviation for single parameter corner. a signiﬁcant guardband (e.7% of standard deviation for SPICE corner. 6%. This dissertation makes contributions to statistical modeling. We then implemented the new model to the SSTA ﬂow in Chapter 5. respectively. variance.

prove the accuracy and efﬁciency. and we may also apply the method in Chapter 7 to estimate the conﬁdence for FPGA power and delay variation analysis. In the future. applying the statistical analysis techniques for ASICs to FPGAs is one of our research topics. 195 .

First. Second. and leads to different placements for different chips of the same application.D. • Fourier Series Approximation for Max Operation in Non-Gaussian and to handle the max operation in SSTA with both quadratic delay dependency and Quadratic Statistical Static Timing Analysis. Our experimental results show that. We propose efﬁcient algorithms 196 . we propose an efﬁcient high-level trace-based estimation method. program and are not introduced in this dissertation: • FPGA Performance Optimization via Chipwise Placement Considering Process Variations. We develop a statistical chipwise placement ﬂow to improve FPGA performance. we develop a variation-aware detailed placement algorithm vaPL within the VPR framework [18] to leverage process variation and optimize performance for each chip. variation aware chipwise placement improves circuit performance by up to 12.1% for the tested variation maps. similar to Ptrace. to evaluate the potential performance gain achievable through chipwise FPGA placement without detailed placement.CHAPTER 9 Appendix In the appendix. I brieﬂy introduce the works I have ﬁnished in my Ph. compared to the existing FPGA placement. August 2006. Chipwise placement vaPL is deterministic for each chip when the chip’s variation map is known. This work was published in International Conference on Field Programmable Logic and Applications (FPL).

standard deviation and 95% percentile point with less than 2% error. max and add. This work was published in Design Automation Conference (DAC). this new model is more efﬁcient than the exponential model when calculating the full chip leakage power. Due to the additivity. then extend such model to consider both within-die spatial and within-die random variation. Based on such algorithms. June 2007. February 2009.non-Gaussian variation sources simultaneously. We present an efﬁcient function to represents the leakage variation considering inter-die variation. Compared to Monte Carlo simulation for non-Gaussian variation sources and nonlinear delay models. This work was published in IEEE/ACM Asia South Paciﬁc Design Automation Conference (ASPDAC). and the skewness with less than 10% error. we develop an SSTA ﬂow with quadratic delay model and non-Gaussian variation sources. our approach predicts the mean. Experimental results show that our method is 5X faster than the existing Wilkinson’s approach Analysis. We prove that the complexity of our algorithm is linear in both variation sources and circuit sizes. • Efﬁcient Additive Statistical Leakage Estimation. are calculated efﬁciently via either closed-form formulas or low dimension (at most 2D) lookup tables. • Accounting for Non-linear Dependence Using Function Driven Component cal analysis and show that ignoring non-linear dependence may cause error in both statistical timing and power analysis. We use polynomial instead of exponential 197 . We ﬁrst analyze the impact of non-linear dependence on statisti- additive leakage variation model. Experimental results show that the proposed FCA method is more accurate compared to the traditional PCA or ICA. Then we introduce a function driven component analysis (FCA) to minimize the error introduced by non-linear dependence. All the atomic operations. hence our algorithm scales well for large designs.

Volume 28. This work was published in IEEE Transactions on Computer-Aided Design of Integrated Circuits and Systems (TCAD).with no accuracy loss in mean estimation. 198 . about 1% accuracy loss in standard deviation estimation and 99% percentile point estimation. pp:1777 . issue: 11. November 2009.1781.

In Proc. In Proc. In Proc. [7] Aseem Agarwal.html. in Europe.eas. Vladimir Zolotov. Conf. and Massoud Pedram. Non-gaussian statistical interconnect timing analysis.. The effect of LUT and cluster size on deepsubmicron FPGA performance and density. Savithri Sundareswaran. and Rajendran Panda. January 2006. 2003. Asia South Paciﬁc Design Automation Conf. Statistical timing analysis for intra-die process variations with spatial correlations. and Rajendran Panda. and Sarma Vrudhula. [10] Elias Ahmed and Jonathan Rose. Design Automation and Test Conf. Vladimir Zolotov. pages 900 – 907. [5] Aseem Agarwal. volume 22. Design Automation Conf. Hanif Fatemi.. Savithri Sundareswaran. Alam and S. [3] Soroush Abbaspour. Min Zhao. In http://www. In Proc. pages 1243 – 1260. March 2006. Min Zhao. Statistical timing analysis using bounds and selective enumeration.. In Microeletronics Reliability. Computer-Aided Design. [2] Soroush Abbaspour. FieldProgrammable Gate Arrays. In IEEE Trans. Statistical delay computation considering spatial correlations. 1965. June 2003. David Blaauw. [8] Aseem Agarwal. New York: Dover. David Blaauw.asu. Kaushik Gala. Vladimir Zolotov. and Mathematical Tables. In Proc. [4] Milton Abramowitz and Irene A. Asia South Paciﬁc Design Automation Conf. and Massoud Pedram.R EFERENCES [1] PTM SPICE model. Handbook of Mathematical Functions with Formulas. Mahapatra. Nov.. pages 3–12. A comprehensive model of pmos nbti degradation. ACM Intl. Intl. January 2003. Parameterized blockbased non-gaussian statistical gate timing analysis. [11] M. August 2005. In Proc. Asia South Paciﬁc Design Automation Conf. Hanif Fatemi. and Vladimir Zolotov. Computation and reﬁnement of statistical bounds on circuit delay. [9] Aseem Agarwal. Sept. February 2000. In Proc. Computer-Aided Design of Integrated Circuits and Systems. Statistical delay computation considering spatial correlations. Stegun.A. 2003. David Blaauw. 199 . Vladimir Zolotov. Kaushik Gala. David Blaauw.edu/ ptm/latest. Symp. pages 271 – 276. Graphs. 2003. and David Blaauw. [6] Aseem Agarwal.

CA. 200 . Active leakage power optimization for FPGAs. and Sarma Vrudhula. Intl. Design Automation Conf. Bellato. [16] Adelchi Azzalini. A.[12] Jason H. Tahoori. February 1999. Singh. Leakage minimization of digital circuits using gate sizing in the presence of process variations. Praveen Ghanta. Paccagnella. [15] Maryam Ashouei. October 2005. [20] Sarvesh Bhardwaj and Sarma Vrudhula. March 2004. on Computer Design. A dualvt layout approach for statistical leakage variability minimization in nanometer CMOS. March 2008. Symp. In Proc. Ceschia. Sambasivan Narayan. Computer-Aided Design of Integrated Circuits and Systems. In Design Automation and Test in Europe. Abhijit Chatterjee. Bernardi1. June 2005. and Tim Tuan. τau: Timing analysis under uncertainty. 2005. 2003. Conf. M. Int. Sarma Vrudhula. Anderson. Design Automation Conf. D. Field-Programmable Gate Arrays. Adit D. February 2004. February 2005. Kluwer Academic Publishers.. Computer-Aided Design. and Chandu Visweswariah. M. In Proc. Intl. P. Jonathan Rose. Symp. Parameterized block-based statistical timing analysis with nonGaussian and nonlinear parameters. M. Evaluating the effects of SEUs affecting the conﬁguration memory of an SRAM-based FPGA. Rebaudengo1. Nov. [19] Sarvesh Bhardwaj. Field-Programmable Gate Arrays. [22] Sarvesh Bhardwaj.. and Vivek De. Sonza Reorda. Najm. In Proc. M. ACM Intl. [13] Hongliang Chang andVladimir Zolotov. Leakage minimization of nano-scale circuits in the presence of systematic and random variations. [14] Ghazanfar Asadi and Mehdi B. In Proc. [18] Vaughn Betz. Bortolato. November 2006. Computer-Aided Design. Soft error rate estimation and mitigation for SRAM-based FPGAs. In Proc. Candelori. 32:159–188. Conf. [17] M. and David Blaauw. pages 615 – 620. A framework for statistical timing analysis using non-linear delay and slew models. and P. June 2005. [21] Sarvesh Bhardwaj and Sarma Vrudhula. Zambolin. pages 71–76. In Proc. Violante1. and Alexander Marquardt. IEEE Trans. Architecture and CAD for Deep-Submicron FPGAs. Scandinavian Journal of Statistics. The skew-normal distribution and related multivariate families. Anaheim. In Proc. A. Conf. Farid N. ACM Intl.

. In Proceedings Process Control Diagnostics and Modeling in Semiconductor Manufacturing II Electrochemical Society Meeting. Design Automation Conf. Yield enhancements of design-speciﬁc fpgas. Ketkar. Ali Keshavarzi. Symp. June 2003. and Vivek De. Keith A. Ali Keshavarzi. [26] Shekhar Borkar. May 1997. 201 . In Proc. Design Automation Conf.. Wenping Wang.. In Proc. Rakesh Vattikonda. and S. Design Automation Conf. June 2003. ACM Intl. [24] Duane Boning. and Jose Renau. In Proc. Chih-Kong Ken Yang. September 2006. In Proc. A stochastic jitter model for analyzing digital timing-recovery circuits. Constantinides. In Proc. and Haitham Hindi. Nov. Burns. Design Automation Conf. [33] Hongliang Chang and Sachin S. Siva Narendra. Parameter variations and impacts on circuits and microarchitecture. Parameter variations and impact on circuits and microarchitecture. Matthew Guthaus. [25] Shekhar Borkar. Design Automation Conf. James W.. and Vivek De. Tschanz. [28] Michael Brown. Tanay Karnik. June 2007. Wafer redeposition impact on etch rate uniformity in ipvd system. Conf. and Milan Vasilko.. June 2005. February 2006. Intl.. Bowman. Jim Tschanz. 2003. Field-Programmable Gate Arrays. Sapatnekar. K. February 2009. Full-chip analysis of leakage power under process variations. Robison. Noel Menezes. [31] Nicola Campregher. Tanay Karnik. Measuring and modeling variabilityusing low-cost FPGAs. and Vivek De. James Chung. Statistical timing analysis considering spatial correlations using a single PERT-like traversal. Symp. IEEE Transactions on Plasma Sciences. In Proc. February 2007. pages 621 – 625. Vrudhula. [30] Steven M. and Rajesh Divecha. Peter Y. Mahesh C. Spatial variation in semiconductor processes: Modeling for control. In Proc. Sapatnekar. Yu Cao. including spatial correlations. [27] Jozef Brcka and Rodney L.[23] Sarvesh Bhardwaj. Cyrus Bazeghi. IEEE Custom Integrated Circuits Conf. Computer-Aided Design. In Proc. [32] Hongliang Chang and Sachin S. Dennis Ouma. Field-Programmable Gate Arrays. [29] James Burnham. Siva Narendra. Comparative analysis of conventional and statistical design techniques. July 2009. In Proc. George A. Cheung. Predictive modeling of the nbti effect for reliable design. Jim Tschanz. ACM Intl.

Intl. [42] Lerong Cheng. Asia South Paciﬁc Design Automation Conf. Chandu Visweswariah. January 2009. February 1985. Intl. August 2006. Jinjun Xiong. Field-Programmable Gate Arrays. Computer-Aided Design of Integrated Circuits and Systems. In Proc. and Dennis Sylvester. November 2005. Device and architecture cooptimization for FPGA power reduction. In Proc.. Computer-Aided Design. Jinjun Xiong. In Proc. [37] Ruiming Chen and Hai Zhou. Asia South Paciﬁc Design Automation Conf. IEEE Trans. Fast estimation of timing yield bounds for process variations. February 2008. FPGA performance optimization via chipwise placement considering process variations. and David B. [43] Lerong Cheng. Static timing: Back to our roots. ComputerAided Design. and Lei He. [36] Ruiming Chen and Hai Zhou. Yan Lin. [39] Lerong Cheng. Design Automation Conf. In Proc. IEEE Journal of Solid-State Circuits. March 2008. Conf. Scott. Non-linear statistical static timing analysis for non-gaussian variation sources. Computer-Aided Design of Integrated Circuits and Systems. Lei He. David Blaauw. IEEE Trans. Fei Li. In International Conference on Field-Programmable Logic and Applications. February 2008. Groves. and Jinjun Xiong. November 2007. Timing budgeting under arbitrary process variations. Jinjun Xiong. July 2007. Conf.. Symp. Non-gaussian statistical timing analysis using Second-Order polynomial ﬁtting. Yan Lin. February 2008. [44] Kaviraj Chopra. Imelda A. 202 . and Mike Hutton. Saller. IEEE Trans. Ashish Srivastava. 19:1211–1221. Tracebased framework for concurrent development of process and FPGA architecture considering process variation and reliability. In Proc. Saumil Shah.. [41] Lerong Cheng. [40] Lerong Cheng. and Lei He. Non-Gaussian statistical timing analysis using Second-Order polynomial ﬁtting. and Lei He. In Proc. Reliability effects on mos transistors due to hot-carrier injection. Parametric yield maximization using gate sizing based on efﬁcient statistical power and delay gradient computation. [35] Ruiming Chen. [38] Lerong Cheng. and Lei He. ACM Intl.. Jinjun Xiong. June 2007. VLSI Syst. Phoebe Wong.[34] Kueing-Long Chen. Vladimir Zolotov. and Lei He. Stephen A. Lizheng Zhang.

Intl. [50] Nigel Drego. Design Automation Conf. and Duane Boning.menezes. Computer-Aided Design. In Operations Research. Intl. Block-Based static timing analysis with uncertainty. Conf. In Proc. Asia South Paciﬁc Design Automation Conf. Clark. February 2007. March 2008. The stratix routing and logic architecture. Willy Cheung. In Proc. In Proc. 1961. June 2007. [46] Charles E. March 2006.. Asia South Paciﬁc Design Automation Conf. [49] Anirudh Devgan and Chandramouli Kashyap. [48] Vijay Degalahal and Tim Tuan. In Proc. [55] Qiang Fu. Akira Tsuchiya. and Costas J. IEEE Trans. Fast second-order statistical static timing analysis using parameter dimension reduction. Variogram based robust extraction of process variation.Najm and N. [56] Takayuki Fukuoka. τ. The greatest of a ﬁnite set of random variables. [47] David Lewis ect. Computer-Aided Design. Characterizing intra-die spatial correlation using spectral density method. Conf. [54] Paul Friedberg. Statistical gate delay model for multiple input switching. ACM Intl. November 2003. Jun Tao. Statistical timing analysis based on a timing yield model. Wai-Shing Luk. January 2005.N. February 2003.. [52] Zhuo Feng. May 2009. Vdd programmability to reduce FPGA interconnect power. In Data Analysis and Modeling for Process Control III. In Proc. 203 . and Hidetoshi Onodera. Timing and Analysis Utilities. IEEE International Symposium on Quality Electronic Design. Lack of spatial correlation in mosfet threshold voltage variation and implications for voltage scaling. In Proc. [53] F. and David Blaauw. and Yaping Zhan. In Proc. Spanos. [51] Fei Li and Yan Lin and Lei He.[45] Kaviraj Chopra. In Proc. Methodology for high level estimation of FPGA power consumption.. Design Automation Conf. and Xuan Zeng.. Field-Programmable Gate Arrays. In Proc. February 2008. Semiconductor Manufacturing. Narendra Shenoy. June 2004. Changhao Yan. Peng Li. Spatial modeling of micron-scale gate length variation. Anantha Chandrakasan. Symp. November 2004. pages 85 – 91.

net/models. Tuan.[57] Anne Gattiker. and Toshiyuki Shibuya. M. and Farid N. Computer-Aided Design.itrs. on Nuclear Science. In http://www.. Kandemir. [63] Katsumi Homma. [61] Khaled R... Nonparametric statistical static timing analysis: An ssta framework for arbitrary distribution. A user’s guide to MASTAR4. A dual-vdd low power FPGA architecture. Impact of CMOS technology scaling on the atmospheric neutron soft error rate. and Chris Long. In International Symposium on Quality of Electronic Design. In Proc. 2002. Reducing leakage energy in FPGAs using region-constrained placement. 204 . HirooMASUDA. Gayasen. and T. February 2008. Field-Programmable Gate Arrays. Rashmi Dinakar. [62] Dadgour H. 2001. Design Automation Conf. and T. [58] A.html. Tuan. N. FieldProgrammable Logic and its Application.net/. N. Lee. Y. Sheng-Chih Lin. A statistical framework for estimation of full-chip leakage-power distribution under parameter variations. Symp. Asia South Paciﬁc Design Automation Conf. Najm. J. Timing yield estimation from static timing analysis. M. Kandemir. [59] A. Gayasen. In Proc. J. Efﬁcient block-based parameterized timing analysis covering all potentially critical paths. August 2004. In Proc. Sani Nassif. [66] International Technology Roadmap for Semiconductor. IEEE Trans. M. Vijaykrishnan. and Yasuaki INOUE. November 2008. June 2008. In http://public. [68] International Technology Roadmap for Semiconductor. Takashi Sato Noriaki Nakayama. Izumi Nitta. 2005. 2008. In Proc. Sari Onaissi. Svensson. Heloue. Hazucha and C.itrs.net/. IEEE Transactions on. Electron Devices. In Proc. Intl. [64] Shin ichi OHKAWA. and Banerjee K.F. FUNDAMENTALS. IEICE TRANS. K. M.itrs. February 2004. 2005. Irwin. A novel expression of spatial correlation by a random curved surface model and its application to LSI designs. and Kazuya Masu. [67] International Technology Roadmap for Semiconductor. Intl. Non-gaussian statistical timing models of die-to-die and within-die parameter variations for full chip analysis. [65] Masanori Imai. Conf. Tsai. [60] P. Conf. Vijaykrishnan. Irwin. 2000. ACM Intl. November 2007. In http://public.

cell granularity . [75] Kunhyuk Kang. Design. 2007. Naidu. Conf. Investigation of etch rate uniformity of 60 mhz plasma etching equipment. In Proc. Design Automation and Test Conf. Bipul C.pdf. November 2008. and implementation. ACM Intl. Journal of The Electrochemical Society. FPGA area vs. Parametric yield estimation for deep submicron VLSI circuits. Najm. February 1992. Design Automation Conf. [70] Javid Jaffari and Mohab Anis. Statistical timing analysis using levelized covariance propagation. and Juergen Teich. 2003.itrs. Computer-Aided Design. Intl. berkeley.net/Links/2007ITRS/2007 Chapters/2007 Design. [72] J. Otten. in Europe. and Kaushik Roy.. November 2004. In Proc. [79] Dirk Koch. September 2002. Visweswariah. Prentice Hall Inc. In 15th Symposium on Integrated Circuits and Systems Design.Aydil. Design Automation Conf.. Efﬁcient hardware checkpointing – concepts. [80] J. Jess.[69] International Technology Roadmap for Semiconductor. CA. February 2007. Field-Programmable Gate Arrays. and C. June 2005. Anderson and Farid N. Christian Haubelt. overhead analysis.Rabaey. In Proc. [71] Jason H. On efﬁcient monte carlo-based statistical static timing analysis of digital circuits. In Proc. In 1st ACM Workshop on FPGAs. Kouloheris and A. Low-power programmable routing circuitry for FPGAs. In Proc. R. El Gamal. Digital Integrated Circuits . Statistical timing for parametric yield prediction of digital integrated circuits. K. [73] Jochen Jess. A general framework for accurate statistical timing analysis considering correlations. November 2001. Effects of chamber wall conditions on cl concertration and si etch rate uniformity in palsma etching reactors. June 2003. Symp.M. In http://www. In Proc. S. June 2003. 205 . Intl. pages 89 – 94. Computer-Aided Design. [77] Tae Won Kim and Eray S. [74] J. [76] Vishal Khandelwal and Ankur Srivastava..Aydil. March 2005. Paul. Kalafala. [78] Tae Won Kim and Eray S. Japanese Journal of Applied Physics.lookup tables and PLA cells.A Design Perspective. Conf.

and Lei He. and Lawrence T. [84] Julien Lamoureux. In Proc. Symp. April 1993.E. February 2003. IEEE Trans. A framework for block-based timing sensitivity analysis. Intl. Measuring the gap between fpgas and asics. On hierarchical statistical static timing analysis. Brown. Yan Lin. [82] Ian Kuon and Jonathan Rose. Power modeling and characteristics of ﬁeld programmable gate arrays. Symp. and Jason Congs. [90] Fei Li. Chandramouli V. Lemieux. Design Automation Conf. Deming Chen. IEEE Trans. Kashyap. Yan Lin. Walter Schneider. Rabaey. November 2003. Lemieux and S. pages 701–708. 206 . and Steven J. In Proc. [85] Julien Lamoureux and Steven J. Low-energy embedded FPGA structures.. Design Automation Conf. Lei He. [89] Fei Li. [83] Eric Kusse and Jan M. Kumar.. Jun 2004. In Proceedings of the ACM Physical Design Workshop. In Proc. and Sachin Sapatnekar. and Ulf Schlichtmann. [91] Fei Li. March 2009. pages 155–160. In Proc. [92] Fei Li.E. Symp. Deming Chen. Computer-Aided Design of Integrated Circuits and Systems. Yan Lin.. June 2004. Xin Li. November 2005. Low Power Electronics and Design. Design Automation and Test Conf. ACM Intl.[81] Sanjay V. Wilton. April 2007. D. Conf. Field-Programmable Gate Arrays. Architecture evaluation for power-efﬁcient FPGAs. Computer-Aided Design. In Proc. A detailed router for allocating wire segments in ﬁeld-programmable gate arrays. ACM Intl. Ning Chen. In Proc. Glitchless: An active glitch minimization techniques for FPGAs. Pileggi. Manuel Schmidt. 24:1712–1724. June 2008. Computer-Aided Design of Integrated Circuits and Systems. August 1998. February 2007. On the interaction between poweraware FPGA CAD algorithms. STAC: Statisical timing analysis with correlation. Guy G. February 2007. G. Computer-Aided Design of Integrated Circuits and Systems. Design Automation Conf. Lei He. Intl. [86] Jiayong Le. [88] Bing Li. Field programmability of supply voltages for FPGA power reduction. and Jason Cong. Field-Programmable Gate Arrays. In Proc. IEEE Trans. Wilton. in Europe. In Proc. and Lei He. 26. FPGA power reduction using conﬁgurable dualvdd. [87] G.

January 2005. June 2002.. Design Automation Conf. and Kevin Lepak. Lei He. Pileggi.[93] Fei Li. [95] Xin Li. In Proc. In Proc. Conf. [94] Xin Li. Field-Programmable Logic and its Application. In Proc. February 2007. ACM Intl. Padmini Gopalakrishnan. February 2008. Symp. [96] Weiping Liao. Routing track duplication with ﬁne-grained powergating for FPGA interconnect power reduction. Temperature and supply voltage aware performance and power modeling at microarchitecture level. February 2005. Symp. July 2005. Circuits and architectures for vdd programmable FPGAs. IEEE Trans. A. and Huazhong Yang. and L. Stochastic physical synthesis for FPGAs with pre-routing interconnect uncertainty and process variation.. Asymptotic probability extraction for non-normal distributions of circuit performance. Mike Hutton. Field-Programmable Gate Arrays. Fei Li.-C. [97] Saihua Lin. Rong Luo. and Lei He. [103] Jing-Jia Liou. 25. Intl. June 2006. Field-Programmable Gate Arrays. Jiayong Le. [102] Yan Lin. In Proc. In Proc. October 2006.T. Fei Li. ACM Intl. and Lei He. Conf. Design Automation Conf. Dual-vdd interconnect with chip-level time slack allocation for FPGA power reduction. Jiayong Le. L. Asia South Paciﬁc Design Automation Conf. Yu Wang. A capacitive boosted buffer technique for high-speed process-variation-tolerant interconnect in udvs application. Wang. Intl. Projection based statistical analysis of full-chip leakage power with non-log-normal distributions. Lei He. Symp. ACM Intl. In Proc. [98] Yan Lin and Lei He. Low-power FPGA using pre-deﬁned dual-vdd/dual-vt fabrics. Computer-Aided Design. Computer-Aided Design of Integrated Circuits and Systems. and Jason Cong. False-pathaware statistical timing analysis and efﬁcient path selection for delay testing and timing validation. In Proc. February 2004. Asia South Paciﬁc Design Automation Conf. and Kwang-Ting Cheng.. and Lei He. Krstic. [100] Yan Lin.. In Proc. Yan Lin. Computer-Aided Design of Integrated Circuits and Systems. FieldProgrammable Gate Arrays. and Lawrence T Pileggi. IEEE Trans. [101] Yan Lin. Placement and timing for FPGAs considering variations. [99] Yan Lin and Lei He. 207 . August 2006. November 2004. In Proc.

A general framework for spatial correlation modeling in vlsi design. In Proc. [107] Jui-Hsiang Liu. Design Automation Conf... Statistical leakage and timing optimization for submicron process variation. Ming-Feng Tsai. Toshiyuki Tsutsumi. June 2008. ACM Intl. March 2008. In Proc. [105] Bao Liu. IEEE Trans. In Proc. June 2007. Design Automation Conf. In Proc. March 2005. Symp. Agrawal. Wang Wei-Shen.. Lumdo Chen. Intl. Combining low-leakage techniques for FPGA routing design. In Proc. [108] Jui-Hsiang Liu. August 2004. A. Design Automation and Test Conf. 1986. In Proc. Low Power Electronics and Design. June 2008. Director. and Hanpei Koike. Computer-Aided Design of Integrated Circuits and Systems.W. [112] W. and Charlie Chung-Ping Chen. Signal probability based statistical timing analysis. Spatial correlation extraction via random ﬁeld simulation and production chip performance regression. [113] Hratch Mangassarian and Mohab Anis. 208 . March 2008. ACM Intl. In Proc. Strojwas. Design Automation and Test Conf. [115] Yohei Matsumoto. Design Automation and Test Conf. Jan. An efﬁcient algorithm for statistical minimization of total power under timing yield constraints. Maly. [106] Frank Liu. Symp. On statistical timing analysis with interand intra-die variations. and Roberto Giansante. Accurate and analytical statistical spatial correlation modeling for vlsi dfm applications. [109] M. and S. In Proc. [114] Murari Mani. in Europe. Masakazu Hioki. Anirudh Devgan. in Europe. Tadashi Nakagawa. Lumdo Chen.[104] Bao Liu. and Chen Charlie Chung-Ping. [111] Yuanlin Lu and V. In VLSI Design.D. Design Automation Conf. and M. Performance and yield enhancement of FPGAs with with-in die variation using multiple conﬁgurations. Design Automation Conf. in Europe. Toshihiro Sekigawa. Takashi Kawanami. Orshanskyg. Accurate and analytical statistical spatial correlation modeling for vlsi dfm applications. Liu. In Proc. Luca Ciccarelli. Symp. pages 309–314. and Michael Orshansky. FieldProgrammable Gate Arrays. February 2007. VLSI yield prediction and estimation: A uniﬁed framework. January 2007.J. June 2005. 5:114–130. Leakage power reduction by dual-vth designs under probabilistic analysis of vth variation. Februray 2005. In Proc. [110] Andrea Lodi. Field-Programmable Gate Arrays..

June 2004. Mogal. Field-Programmable Logic and its Application. August 2004. Vrudhula. 2008. In Design for Manufacturability through Design-Process Integration III. Computer-Aided Design. and Mustafa Celik. and Duane Boning. [118] Ayhan Mutlu. In Proc. Fast statistical timing analysis handling arbitrary delay correlations. [125] Kun Qian and Costas J. Jiayong Le. and S. Design Automation Conf. Antoniadis. and Anantha P. [127] Sreeja Raj. Wilton. [123] K. 2007. Nassif.K. November 2007. A ﬂexible power model for FPGAs.. April 2008. March 2004. In Sophia Antipolis MicroElectronics Forum.18-um CMOS. Conf. A methodology to improve timing yield in the presence of process variations. Vivek De. July 2009. Conf. In Design for Manufacturability through Design-Process Integration II. [124] Predictive Technology Model. and Janet Wang. Spanos. In http://www. In Proc. and Kia Bazargan. [126] Kun Qian and Costas J.asu.edu/ ptm/. Yan. Sani R.[116] Hushrav D. Full-chip subthreshold leakage power prediction and reduction techniques for sub-0. In Proc. In Proc. March 2009.. Ruben Molina. Spanos. Solid-State Circuits. Design Automation Conf. Intl. Hierachical modeling of spatial variability of a 45nm test chip. A parametric approach for handling local variation effects in timing analysis. Tutorial on statistical static timing analysis (SSTA). Haifeng Qian. Shekhar Borkar andDimitri A. Sarma B. Sachin S. Sapatnekar. Design for Manufacturability and Statistical Design.eas. A. Clustering based pruning for statistical criticality computation under process variations. September 2002. IEEE Journal of. In Proc. Power-driven design partitioning. Design Automation Conf. A comprehensive model of process variability for statistical timing optimization. [117] Rajarshi Mukherjee and Seda Ogrenci Memik. of 12th International conference on Field-Programmable Logic and Applications. [119] Siva Narendra. Chandrakasan. Intl.. June 2004. 209 . [122] Pierrick Pedron. Poon. Springer. [121] Michael Orshansky. [120] Michael Orshansky and Arnab Bandyopadhyay. In Proc.

Determination of optimal polynomial regression function to decompose on-die systematic and random variations. 210 . Gi-Joon Nam. 1990. David Lewis. Antonio Pullini. Sani R. IEEE Journal of Solid-State Circuits. IEEE Int. Parametric yield in FPGAs due to withindie delay variations: A quantative analysis. Computer-Aided Design. and David Z. and Enrico Macii. K. An accurate sparse matrix based framework for statistical static timing analysis. ACM Intl. Wong. [134] Takashi Sato. In Proc. ACM Intl. 1990. Symp. Giovanni De Micheli. Parametric yield in FPGAs due to withindie delay variations: a quantitative analysis. and Kazuya Masu. Measuring and modeling FPGA clock variability. and Peter Y. Proc. Michael Orshansky. [131] Jonathan Rose. Cheung.. In Proc. Design Automation Conf. [129] Rajeev Rao. Architecture of ﬁeld-programmable gate arrays: The effect of logic functionality on area efﬁciency. [130] Jonathan Rose. 2007. Anirudh Devgan. Pan. Cheung. and Yukihiro Shimogaki. and Dennis Sylveste. In Proc. Ryosuke Shimizu. June 2004. David Lewis. February 2008. [136] Pete Sedcole and Peter Y. in Europe. March 2009. K. Symp. [135] Pete Sedcole and Peter Y. February 2007. and Paul Chow. Field-Programmable Gate Arrays. Ashish Kumar Singh. Physically clustered forward body biasing for variability compensation in nano-meter CMOS design. Justin S. Asia South Paciﬁc Design Automation Conf. February 2008. February 2007. In Proc. Luca Benini.[128] Anand Ramalingam.. Intl. In Proc. FieldProgrammable Gate Arrays. In Proc. David Blaauw. Robert J. Conf. Robert J. Hiroyuki Ueyama. Architecture of ﬁeld-programmable gate arrays: The effect of logic functionality on area efﬁciency. and Paul Chow. November 2006. In Proc.. ACM Intl. Symp. [137] Pete Sedcole. [133] Ashoka Sathanur. Masaaki Ogino. Francis. Cheung. FieldProgrammable Gate Arrays. Noriaki Nakayama. Nassif. Solid-State Circuits Conf.K. Parametric yield estimation considering leakage variability. Francis. [132] Shigeru Sakai. Design Automation and Test Conf. MATERIALS RESEARCH SOCIETY SYMPOSIUM PROCEEDINGS. Deposition uniformity control in a commercial scale HTO-CVD reactor.

In ASICON. The effect of logic block architecture on FPGA performance. March 2008. In Proc. Symp. ACM Intl. Pan. highly nonlinear ﬁtting. and Huazhong Yang. in Europe. Conf. and Rob A. CRC. with application to statistical timing. Rong Luo. and Rob A. [142] Jaskirat Singh and Sachin Sapatnekar. Petros Boufounos. October 2007. and Farinaz Koushanfar. Statistical timing analysis with correlated non-gaussian parameters using independent componenet analysis. IEEE Journal of Solid-State Circuits. ComputerAided Design. November 2008. Sonia Singhal. [147] Satish Sivaswamy and Kia Bazargan. Rutenbar. Intl. fast monte carlo statistical static timing analysis: Why and how. [140] Xukai Shen. Anand Ramalingam. Design Automation Conf. Sheats. Field-Programmable Gate Arrays. 211 . In Proc. [143] Satwant Singh. Exploiting correlation kernels for efﬁcient handling of intra-die spatial correlation. [148] Thomas Skotnicki. [144] Amith Singhee and Rob A. Heading for decananometer CMOS . Post-silicon timing characterization by compressed sensing. [139] James R. Leakage power reduction through dual vth assignment considering threshold voltage variation.[138] Davood Shamsi. and David Z. Conf. 1998. Intl. Design Automation and Test Conf. 2006. September 2000. Paul Chow. Daifeng Wang. 1992. pages 143–148. Practical. In Proc. Shi. February 2007. Microlithography Science and Technology. In Proc. Rutenbar. [145] Amith Singhee. and David Lewis. Sonia Singhal.Is navigation among icebergs still a viable strategy? In Solid-State Device Research Conference. Jonathan Rose. [141] Sean X. [146] Amith Singhee. In Proc. November 2008. Feb. in Europe. March 2008. In Proc. pages 19–33. Design Automation and Test Conf.. Yu Wang. In ACM/IEEE International Workshop on Timing Issues. ComputerAided Design. Rutenbar. Variation-aware routing for FPGAs. June 2007. Beyond low-order statistical response surfaces: Latent variable regression for efﬁcient. Latch modeling for statistical timing analysis.

Saha Dipankar. Agarwal. In Proc.. In Proc. Design Automation Conf. January 2005.. Varghese Dhanoop. In Proc. David Blaauw. Intl. Robert Bai. Leakage power analysis of a 90nm FPGA. G. Symp. Low Power Electronics and Design.. Yohei Kume. 1997. June 2005. Kanak B. ACM Intl. Sylvester. [152] Ashish Srivastava. and TAKWALE M. On the generation and recovery of interface traps in MOSFETs subjected to NBTI. Analysis and decomposition of spatial variation in integrated circuit processes and devices. Trans. and James E. 2001. [157] Shuji Tsukiyama. 53. JADKAR S. Accurate and efﬁcient parametric yield estimation considering correlated variations in leakage power and performance. B.[149] Mahapatra Souvik. In Proc. July 2009. February 2008. [158] Tim Tuan and Bocheng Lai. Design Automation Conf. Duane S. Speed and yield enhancement by track swapping on critical paths utilizing random variations for FPGAs. Design Automation Conf. and Stephen Director. In Proc. IEEE. June 2007. on Electron Device. 2002. Feb. Stine. IEEE Custom Integrated Circuits Conf. [151] Ashish Srivastava. In International Conference on Cat-CVD Process. Statistical analysis of full-chip leakage power considering junction tunneling leakage. and HCI stress. and Shuji Tsukiyama.. FN. [156] Li Tao and Yu Zhiping. Symp. [150] Suresh Srinivasan.. [154] Yuuri Sugihara. and Bharath P. Kazutoshi Kobayashi. In Proc. HotWire CVD growth simulation for thickness uniformity... and Hidetoshi Onodera. pages 24 – 41. Shah. Narayanan Vijaykrishnan. R. Asia South Paciﬁc Design Automation Conf. and Dennis Sylvester. IEEE Transactions on. and Tim Tuan. Yuki Yoshida. [153] Brian E. In Proc. Aman Gayasen. July 2006... [155] Shingo Takahashi. Asia South Paciﬁc Design Automation Conf. January 2005. Toward stochastic design for digital circuits: statistical static timing analysis.. Kumar. PATIL S. Saumil S. Leakage control in FPGA routing fabric. Boning. 212 . Chung. Gaussian mixture model for statistical timing analysis. Dennis M. Field-Programmable Gate Arrays. In Semiconductor Manufacturing. [159] SALI Jaydeep V. 2003. Modeling and analysis of leakage power considering within-die process variations. In Proc. David Blaauw.

2006. [167] Ho-Yan Wong. In Proc. Intl. and Tai-Hsuan Wu.. Modeling and minimization of pmos nbti effect for robust nanometer design.. Kaushik Ravindran. [165] Ruilin Wang and Cheng Kok Koh.. In Proc. Intl. Intl. Adjustment-based modeling for statistical static timing analysis with high dimension of variability. IEEE Trans. November 2005. Steven G. Lerong Cheng. Krishnan. [168] Phoebe Wong.5v platform FPGA complete data sheet. November 2008. Lerong Cheng. [161] Vineeth Veetil.[160] Rakesh Vattikonda. and David Blaauw. Virtex-II 1. [163] Chandu Visweswariah. and Sambasivan Narayan. Efﬁcient monte carlo based incremental statistical timing analysis. Jeff Piaget. and Yu Cao. Design Automation Conf. Steven G. Kaushik Ravindran. November 2007. In Proc. and Lei He. Conf. Conf. First-order incremental block-based statistical timing analysis. 213 . Yan Lin. Conf. July 2006. [169] Lin Xie. Kerim Kalafala. and Yu Cao. A frequency-domain technique for statistical timing analysis of clock meshes. Conf. ComputerAided Design. Computer-Aided Design of Integrated Circuits and Systems. Sambasivan Narayan. Design Automation Conf. November 2005. Walker. An integrated modeling paradigm of circuit reliability for 65nm CMOS technology. ComputerAided Design. Intl. Hemmett. Computer-Aided Design. Design Automation Conf. Dennis Sylvester. Computer-Aided Design. June 2003. [170] Xilinx Corporation. Design Automation Conf. In Proc. Wenping Wang. In Proc. and Jeffrey G. First-order incremental block-based statistical timing analysis. Vijay Reddy. [164] Chandu Visweswariah. Anand T. July 2002.. Beece. In Proc. Natesan Venkateswaran.. June 2004. In Proc. Srikanth Krishnan. Walker. Rakesh Vattikonda. FPGA device and architecture evaluation considering process variations. Death. taxes and failing chips. June 2008. Daniel K. FPGA device and arhitecture evaluation consdering process variation. In Proc. [162] Chandu Visweswariah. Yan Lin. Azadeh Davoodi. Jun Zhang. Kerim Kalafala. September 2007. IEEE Custom Integrated Circuits Conf. [166] Wenping Wang. In Proc. and Lei He.

CA. 2007. and Vladimir Zolotov. Xin Li. [176] Xiaoji Ye. John A. version 3. Gubner. [174] A. Gubner. Andrzej J. [181] Lizheng Zhang. and Charlie ChungPing Chen. Generic statistical timing analysis with non-gaussian process parameters. Design Automation Conf. Robust extraction of spatial correlation. [179] Yaping Zhan. and Lei He.. Correlaiton-aware statistical timing analysis with non-gaussian delay distribution. Statistical ordering of correlated timing quantities and its application for path ranking. Design Automation Conf. and Lawrence T. March 2005. In Proc.[171] Jinjun Xiong. M. and Mahesh Sharma. Intl. [172] Jinjun Xiong. Correlation-preserved non-gaussian statistical timing analysis with quadratic timing model. Statistical leakage power minimization using fast equi-slack shell based optimization. and Peng Li. Weijen Chen. [180] Lizheng Zhang. [178] Yaping Zhan. 1991. Zhuo Fang. related to stationary random processes. Strojwas. and Lei He. Microelectronics Center of North Carolina (MCNC). Yuhen Hu. In Austin Conference on Integrated Systems and Circuits. Vladimir Zolotov. Design Automation Conf. October 2008. David Newmark. Wei Dong. April 2006. Yaglom. Some classes of random ﬁelds in n-Dimensional space. [177] Guo Yu. In Proc. Yang. Symp. pages 77–82. Weijen Chen. Yuhen Hu. Anaheim.0. and Charlie ChungPing Chen. Design Automation and Test Conf. Computer-Aided Design of Integrated Circuits and Systems.. 214 . January 1957. Chandu Visweswariah.. Andrzej J. Statistical static timing analysis considering process variation model uncertainty. Pileggi. John A. June 2005. pages 83 – 88. Design Automation Conf. In Proc. Technical report. Physical Design. IEEE Trans. and Peng Li. [173] Jinjun Xiong. In Proc. Statistical timing analysis with extended pseudo-canonical timing model. Theory Probability Applications. Strojwas. Robust extraction of spatial correlation. in Europe. May 2006.. Logic synthesis and optimization benchmarks. Vladimir Zolotov. June 2005. Computer-Aided Design of Integrated Circuits and Systems. In Proc. Yaping Zhan. 2007. IEEE Trans. [175] S. July 2009. In Proc. pages 952 – 957.

and Kaustav Banerjee. [183] Lizheng Zhang. and Charlie Chung-Ping Chen.. First IFAC Workshop on Advanced Process Control for Semiconductor Manufacturing. One step forward from run-to-run critical dimension control: Across-wafer level critical dimension control through lithography and etch process. and Costas J. Symp. in Europe. Angela Hui. An efﬁcient method for chip-level statistical capacitance extraction considering process variations with spatial correlation. Symp. and Costas J. Spanos. Zhiping Yu. [188] Songqing Zhang. Inspection. A probabilistic framework to estimate full-chip subthreshold leakage power distribution considering within-die and die-to-die p-t-v variations. Vineet Wason. and Kaustav Banerjee. Journal of Process Control. Cherry Tang. Tony Hsieh. Wenjian Yu. [185] Qiaolin Zhang. and Kevin Nowka. Vineet Wason. March 2006. [184] Lizheng Zhang. [187] Songqing Zhang. and Chung-Ping Chen. Low Power Electronics and Design. In Proc. 18(10):937 – 945.[182] Lizheng Zhang. Physical Design. Intl. In Proc. Frank Liu. and Process Control for Microlithography XXI. January 2005. Aprial 2006. Kameshwar Poolla. [189] Wangyang Zhang. [186] Qiaolin Zhang. Rigorous extraction of process variations for 65nm CMOS design. Kanak Agarwal. A probabilistic framework to estimate full-chip subthreshold leakage power distribution considering within-die and die-to-die variations. April 2007. In Metrology. Design Automation and Test Conf. Zeyi Wang. Intl. Yuhen Hu. Asia South Paciﬁc Design Automation Conf. March 2008. Symp. 2004. Yu Cao. Dhruva Acharyya. Jun Shao. Across-wafer cd uniformity control through lithography and etch process: experimental veriﬁcation. and Charlie Chung-Ping Chen. Bhanwar Singh. 215 . 2008. Rong Jiang. In Proc. Block based statistical timing analysis with extended canonical timing model. Advanced Process Control for Semiconductor Manufacturing. In Proc. Spanos. Non-gaussian statistical parameter modeling for SSTA with conﬁdence interval analysis. In Proc. In European Solid-State Circuits Conference. In Proc. Yuhen Hu. Low Power Electronics and Design. Kameshwar Poolla. Intl. September 2007. Nick Maccrae. Jason Cain. Sani Nassif. [190] Wei Zhao. Design Automation and Test Conf. Aug 2004. Statistical timing analysis with path reconvergence and spatial correlations. in Europe. and Jinjun Xiong.

- Sub1 1 Theory IC Design Flow NguyenHungQuan 201309 (1)
- IIJCS-2014-06-27-5
- vaagdevi title-1.doc
- Altera 40-Nm Stratix IV FPGAs and HardCopy IV ASICs
- AT17C65
- Questions
- EEG Circuit
- FFSA brief
- Syllabus
- JPMorgan FGPA Overview
- xilinx power analysis
- Basics of Fpga Design
- digital systems
- Stereo Vision Ip Design for Fpga Implementation of Obstacle
- Spartan II
- Cmos Power
- r05410207 Vlsi Design
- Programmable Logic and Software
- Memristor LUT Base Study
- VLSI Project
- Benchmarking Final
- ir2109
- 1_2
- lec0Fab
- A Multi-Granularity FPGA With Hierarchical Interconnects for Efficient and Flexible Mobile Computing.pdf
- Chameleon Chips Seminar Report
- uPD77016
- Uln 2803
- AnandTech _ Understanding the Cell Microprocessor
- ee500_syllabus_Spring_2014(1)

- wrewfsd
- cfthj
- zaef
- New Text Docudsdadment
- New Text Document
- Nw Tfext Doxxcuddsafsdment
- wrewfsd
- wrewfsd
- wrewfsd
- Nw Tfext Docuddsafsdment
- Nw Tfext Docuddsafsdment
- wrewfsd
- Zzzzz
- New SText Docusfment
- Nw Tfext Doxxcduddsafsdment
- Schmitt Trigger
- رول اليوزرات
- Is CH2 Sheet3
- Nw Text Docuddsafsdment
- Nw Text Docuddsafsdment
- New Text Docuddsafsdment
- Microwaves (1)
- New SText Docddumenxdst
- IS_sheet1
- Is CH2 Sheet3fbvgas
- New Text Documenxdst
- New Text Docddumenxdst
- App of Microwave
- Nw Tfext Docuddsafsdment
- pepooo

Read Free for 30 Days

Cancel anytime.

Close Dialog## Are you sure?

This action might not be possible to undo. Are you sure you want to continue?

Loading