University of California

Los Angeles
Statistical Analysis and Optimization for Timing and
Power of VLSI Circuits
A dissertation submitted in partial satisfaction
of the requirements for the degree
Doctor of Philosophy in Electrical Engineering
by
Lerong Cheng
2010
c Copyright by
Lerong Cheng
2010
The dissertation of Lerong Cheng is approved.
Lieven Vandenberghe
Glenn Reinman
Sudhakar Pamarti
Lei He, Committee Co-chair
Puneet Gupta, Committee Co-chair
University of California, Los Angeles
2010
ii
To my family.
iii
TABLE OF CONTENTS
1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1
1.1 Literature Review . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6
1.1.1 FPGA Power, Performance, and Area Optimization . . . . . . 6
1.1.2 Statistical Timing Modeling, Analysis, and Optimization . . . 8
1.2 Contributions of the Dissertation . . . . . . . . . . . . . . . . . . . . 9
1.3 Dissertation Outline . . . . . . . . . . . . . . . . . . . . . . . . . . . 12
I Statistical Modeling and Optimization for FPGA Circuits 15
2 Deterministic Device and Architecture Co-Optimization for FPGA Power,
Delay, and Area Minimization . . . . . . . . . . . . . . . . . . . . . . . . . 16
2.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 17
2.2 Preliminaries . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 20
2.2.1 Conventional FPGA Architecture . . . . . . . . . . . . . . . 20
2.2.2 V
dd
Gatable FPGA Architecture . . . . . . . . . . . . . . . . 21
2.2.3 FPGA Architecture Evaluation Flow . . . . . . . . . . . . . . 23
2.3 Trace-Based Estimation . . . . . . . . . . . . . . . . . . . . . . . . . 24
2.3.1 Trace Collection . . . . . . . . . . . . . . . . . . . . . . . . 25
2.3.2 Power and Delay Model . . . . . . . . . . . . . . . . . . . . 26
2.3.3 Validation of Ptrace . . . . . . . . . . . . . . . . . . . . . . . 30
2.4 Hyper-Architecture Evaluation . . . . . . . . . . . . . . . . . . . . . 32
iv
2.4.1 Overview . . . . . . . . . . . . . . . . . . . . . . . . . . . . 32
2.4.2 Necessity of Device and Architecture Co-Optimization . . . . 33
2.4.3 Energy and Delay Tradeoff . . . . . . . . . . . . . . . . . . . 36
2.4.4 ED and Area Tradeoff . . . . . . . . . . . . . . . . . . . . . 38
2.4.5 Impact of Utilization Rate . . . . . . . . . . . . . . . . . . . 40
2.4.6 Impact of Interconnect Structure . . . . . . . . . . . . . . . . 41
2.5 Conclusions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 46
3 FPGADevice and Architecture Co-Optimization Considering Process Vari-
ation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 48
3.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 49
3.2 Delay and Leakage Variation Model . . . . . . . . . . . . . . . . . . 50
3.3 Hyper-Architecture Evaluation Considering Process Variation . . . . 58
3.3.1 Impact of Process Variation . . . . . . . . . . . . . . . . . . 59
3.3.2 Impact of Device and Architecture Tuning . . . . . . . . . . . 60
3.4 Conclusions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 62
4 FPGA Concurrent Development of Process and Architecture Considering
Process Variation and Reliability . . . . . . . . . . . . . . . . . . . . . . . 63
4.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 64
4.2 Device Models . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 66
4.3 Circuit-Level Delay and Power . . . . . . . . . . . . . . . . . . . . . 70
4.3.1 Delay Model . . . . . . . . . . . . . . . . . . . . . . . . . . 70
4.3.2 Power Model . . . . . . . . . . . . . . . . . . . . . . . . . . 73
v
4.4 Chip-level Delay and Power . . . . . . . . . . . . . . . . . . . . . . 75
4.4.1 Delay Model . . . . . . . . . . . . . . . . . . . . . . . . . . 75
4.4.2 Leakage Power Model . . . . . . . . . . . . . . . . . . . . . 76
4.4.3 Dynamic Power Model . . . . . . . . . . . . . . . . . . . . . 76
4.5 Chip Level Delay and Power Variation . . . . . . . . . . . . . . . . . 78
4.5.1 Variation Models . . . . . . . . . . . . . . . . . . . . . . . . 78
4.5.2 Delay Variation . . . . . . . . . . . . . . . . . . . . . . . . . 79
4.5.3 Chip Leakage Power Variation . . . . . . . . . . . . . . . . . 80
4.6 Process Evaluation . . . . . . . . . . . . . . . . . . . . . . . . . . . 81
4.6.1 Power and Delay Optimization . . . . . . . . . . . . . . . . . 82
4.6.2 Variation Analysis . . . . . . . . . . . . . . . . . . . . . . . 83
4.7 FPGA Reliability . . . . . . . . . . . . . . . . . . . . . . . . . . . . 85
4.7.1 Impact of Device Aging . . . . . . . . . . . . . . . . . . . . 86
4.7.2 Permanent Soft Error Rate (SER) . . . . . . . . . . . . . . . 89
4.8 Interaction between Process Variation and Reliability . . . . . . . . . 90
4.8.1 Impact of Device Aging on Power and Delay Variation . . . . 91
4.8.2 Impacts of Device Aging and Process Variation on SER . . . 92
4.9 Conclusions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 93
II Statistical Timing Modeling and Analysis 95
5 Non-Gaussian Statistical Timing Analysis Using Second-Order Polyno-
mial Fitting . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 96
vi
5.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 97
5.2 Second-Order Polynomial Fitting of Max Operation . . . . . . . . . . 99
5.2.1 Review and Preliminary . . . . . . . . . . . . . . . . . . . . 99
5.2.2 New Fitting Method for Max Operation . . . . . . . . . . . . 100
5.3 Quadratic SSTA . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 107
5.3.1 Quadratic Delay Model . . . . . . . . . . . . . . . . . . . . . 107
5.3.2 Max Operation for Quadratic Delay Model . . . . . . . . . . 109
5.3.3 Sum Operation for Quadratic Delay Model . . . . . . . . . . 115
5.3.4 Computational Complexity of Quadratic SSTA . . . . . . . . 115
5.4 Semi-Quadratic SSTA . . . . . . . . . . . . . . . . . . . . . . . . . . 115
5.4.1 Semi-Quadratic Delay Model . . . . . . . . . . . . . . . . . 115
5.4.2 Max Operation for Semi-Quadratic Delay Model . . . . . . . 116
5.5 Linear SSTA . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 118
5.5.1 Linear Delay Model . . . . . . . . . . . . . . . . . . . . . . 118
5.5.2 Max Operation for Linear Delay Model . . . . . . . . . . . . 119
5.6 Experimental Result . . . . . . . . . . . . . . . . . . . . . . . . . . . 121
5.7 Conclusions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 127
6 Physically Justifiable Die-Level Modeling of Spatial Variation in View of
Systematic Across-Wafer Variability . . . . . . . . . . . . . . . . . . . . . 128
6.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 129
6.2 Physical Origins of Spatial Variation . . . . . . . . . . . . . . . . . . 131
6.3 Analysis of Wafer Level Variation and Spatial Correlation . . . . . . . 133
vii
6.3.1 Variation of Mean and Variance with Location . . . . . . . . 134
6.3.2 Appearance of Spatial Correlation . . . . . . . . . . . . . . . 137
6.3.3 Dependence between Inter-Die and Within-Die Variation . . . 139
6.3.4 When can Spatial Variation be Ignored? . . . . . . . . . . . . 140
6.4 General Across-Wafer Variation Model . . . . . . . . . . . . . . . . 141
6.5 Modeling Spatial Variability . . . . . . . . . . . . . . . . . . . . . . 144
6.5.1 Slope Augmented Across-Wafer Model . . . . . . . . . . . . 145
6.5.2 Quadratic Across-Wafer Model . . . . . . . . . . . . . . . . 145
6.5.3 Location Dependent Across-Wafer Model . . . . . . . . . . . 146
6.6 Application of Spatial Variation Models on Statistical Timing Analysis 146
6.6.1 Delay Model . . . . . . . . . . . . . . . . . . . . . . . . . . 147
6.6.2 Experimental Result . . . . . . . . . . . . . . . . . . . . . . 148
6.7 Application of Spatial Variation Models on Statistical Leakage Analysis 156
6.8 Summary of Different Models . . . . . . . . . . . . . . . . . . . . . 158
6.9 Conclusions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 161
7 On Confidence in Characterization and Application of Variation Models 162
7.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 163
7.2 Problem Formulation . . . . . . . . . . . . . . . . . . . . . . . . . . 165
7.3 Confidence Interval for Variation Sources . . . . . . . . . . . . . . . 167
7.3.1 Mean and Variance . . . . . . . . . . . . . . . . . . . . . . . 167
7.3.2 Simplifying the Large Lot Case . . . . . . . . . . . . . . . . 169
7.3.3 One Production Lot Case . . . . . . . . . . . . . . . . . . . . 170
viii
7.3.4 Experimental Validation for One Lot Case . . . . . . . . . . . 171
7.3.5 Fast/Slow Corner Estimation . . . . . . . . . . . . . . . . . . 172
7.4 Confidence in SPICE Fast/Slow Corner . . . . . . . . . . . . . . . . 177
7.5 Confidence in Statistical Timing Analysis . . . . . . . . . . . . . . . 183
7.6 Practical Questions: Determining Confidence Induced Guardband for
Real Designs . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 188
7.7 Conclusions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 189
8 Conclusions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 191
9 Appendix . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 195
References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 198
ix
LIST OF FIGURES
1.1 Increasing of power density. . . . . . . . . . . . . . . . . . . . . . . 1
1.2 Leakage power and frequency variation. . . . . . . . . . . . . . . . . 3
1.3 Delay yield. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5
1.4 SSTA flow. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6
2.1 (a) Island style routing architecture; (b) Connection block; (c) Switch
block; (d) Routing switches. . . . . . . . . . . . . . . . . . . . . . . 21
2.2 (a) V
dd
-gateable switch; (b) V
dd
-gateable routing switch; (c) V
dd
-gateable
connection block; (d) V
dd
-gateable logic block. . . . . . . . . . . . . 22
2.3 Existing FPGA architecture evaluation flow for a given device setting. 24
2.4 New trace-based evaluation flow. We perform the same flow as Fig-
ure 2.3 under one device setting to collect the trace information. . . . 25
2.5 Comparison between Psim and Ptrace. . . . . . . . . . . . . . . . . 31
2.6 hyper-architectures under different device settings. . . . . . . . . . . 35
2.7 Dominant hyper-architectures. (a) Homo-V
th
and Hetero-V
th
; (b) Homo-
V
th
+G and Hetero-V
th
+G. . . . . . . . . . . . . . . . . . . . . . . . 44
2.8 ED and area trade-off. ED and area are normalized with respect to the
baseline (N = 8, K = 4, V
dd
= 0.9V, and V
th
= 0.3V). . . . . . . . . 45
3.1 Leakage and delay of baseline architecture hyper-arch. . . . . . . . . 60
4.1 Trace-based estimation flow. . . . . . . . . . . . . . . . . . . . . . . 65
4.2 The flow for on current, I
on
, calculation. . . . . . . . . . . . . . . . . 67
4.3 The flow for sub-threshold leakage current, I
of f
, calculation. . . . . . 68
x
4.4 The flow for gate leakage currents I
gon
and I
gof f
calculation. . . . . . 68
4.5 The flow for transistor gate and diffusion capacitances C
g
and C
di f f
calculation. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 69
4.6 Voltage and current in an inverter during transition. . . . . . . . . . . 71
4.7 The pull-down delay of a 1X inverter at ITRS 2005 HP 32nm tech-
nology nodes. (a) Input and output voltages, and short circuit current
with a small input slope; (b) Input and output voltages, and short cir-
cuit current with a large input slope; (c) The NMOS IV curves and the
transition of transient current I
ds
with small and large input slopes. . . 72
4.8 Energy and delay tradeoff. . . . . . . . . . . . . . . . . . . . . . . . 83
4.9 Delay and leakage distribution. (a) Leakage PDF; (b) Delay PDF. . . 85
4.10 The calculation flow for ΔV
th
due to NBTI and HCI. . . . . . . . . . 87
4.11 V
th
increase caused by NBTI and HCI. . . . . . . . . . . . . . . . . . 87
4.12 Impact of device aging. (a) Leakage change; (b) Delay change. . . . 88
4.13 The flow for SRAM SER calculation. . . . . . . . . . . . . . . . . . 90
4.14 Impact of device aging on delay and leakage PDF. (a) Leakage PDF
comparison; (b) Delay PDF comparison. . . . . . . . . . . . . . . . 91
5.1 (a) Comparison of exact computation, linear fitting, and second-order
fitting for max(V, 0); (b) PDF comparison of exact computation, linear
fitting, and second-order fitting for max(V, 0). . . . . . . . . . . . . . 107
5.2 Algorithm for computing max(D
1
,D
2
). . . . . . . . . . . . . . . . . . 110
5.3 PDF comparison for circuit s15850. . . . . . . . . . . . . . . . . . . 123
6.1 Ring oscillator frequency within a wafer. (a) Process 1; (b) Process 2. 133
xi
6.2 Mean and variance for different r
d
. . . . . . . . . . . . . . . . . . . . 137
6.3 Apparent spatial correlation and covariance as a function of distance. . 138
6.4 Correlation coefficient for within-die spatial variation after inter-die
variation is removed. . . . . . . . . . . . . . . . . . . . . . . . . . . 139
6.5 Approximating across-wafer variation. . . . . . . . . . . . . . . . . . 141
6.6 Ring oscillator frequency within a wafer. . . . . . . . . . . . . . . . . 143
6.7 PDF of S
x
and S
y
. . . . . . . . . . . . . . . . . . . . . . . . . . . . . 144
7.1 Comparison of S-L, L-S, and L-L case. . . . . . . . . . . . . . . . . . 175
7.2 Number of ˆ n or ˜ n which can be considered as large for different frac-
tion of lot-to-lot variation assuming maximum 3% error at 90% confi-
dence. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 176
7.3 Flow to simulate confidence. . . . . . . . . . . . . . . . . . . . . . . 180
7.4 90% confident D
w
f
with varying ˆ n and ˜ n. . . . . . . . . . . . . . . . . 182
7.5 Worst case mean, standard deviation, and 95-percentile point normal-
ized with respect to the nominal value for ISCAS85 benchmarks, as-
suming ˆ n = 10 and ˜ n = 15. . . . . . . . . . . . . . . . . . . . . . . . 185
7.6 Normalized worst case mean, standard deviation, and 95-percentile
point for ISCAS85 benchmarks, assuming ˆ n = 10 and ˜ n is large. . . . 186
7.7 90%confidence worst case 95-percentile point with varying ˆ n for C7552.
Percentile point is normalized with respect to the nominal value. . . . 187
xii
LIST OF TABLES
1.1 Impact of process variation in near-term future. . . . . . . . . . . . . 3
2.1 Trace information, device and circuit parameters. . . . . . . . . . . . 26
2.2 switching activities for different technology nodes, V
d
d and V
th
. Ar-
chitecture setting: N = 10, K = 4. Unit: switch per clock cycle. . . . 27
2.3 Short circuit power ratios for different technology nodes, V
d
d and V
th
.
Architecture setting: N = 10, K = 4. . . . . . . . . . . . . . . . . . . 28
2.4 MCNCbenchmark list. The number of LUTs and flip flops are counted
under the architecture N=8 and K=4. . . . . . . . . . . . . . . . . . 31
2.5 Baseline hyper-architecture and evaluation ranges. . . . . . . . . . . 33
2.6 Power and delay ranges for different device settings. . . . . . . . . . 34
2.7 Min-ED hyper-architecture of optimizing device and architecture sep-
arately and simultaneously. . . . . . . . . . . . . . . . . . . . . . . 36
2.8 Minimum delay and minimum energy hyper-architectures. . . . . . . 37
2.9 Comparison between baseline and min-EDhyper-architecture in Homo-
V
th
, Hetero-V
th
, Homo-V
th
+G, and Hetero-V
th
+G. . . . . . . . . . . . 38
2.10 Minimum ED-area product hyper-architectures for different classes.
ED, Area, and ED-area product are normalized with respect to the
baseline. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 39
2.11 Min-ED hyper-architecture under different utilization rates. . . . . . 41
2.12 Min-AED hyper-architecture under different utilization rates. Note:
AED is normalized with respect to the baseline. . . . . . . . . . . . . 41
2.13 Min-delay hyper-architecture under different routing structures. . . . 42
xiii
2.14 Min-energy hyper-architecture under different routing structures. . . 43
2.15 Min-ED hyper-architecture under different routing structures. . . . . 43
3.1 Verification of yield model. . . . . . . . . . . . . . . . . . . . . . . . 58
3.2 Experimental setting. . . . . . . . . . . . . . . . . . . . . . . . . . . 59
3.3 Optimum leakage yield hyper-architecture. . . . . . . . . . . . . . . . 61
3.4 Optimum Timing yield hyper-architecture. . . . . . . . . . . . . . . . 61
3.5 Optimum leakage and timing combined yield hyper-architecture. . . . 62
4.1 Experimental setting. . . . . . . . . . . . . . . . . . . . . . . . . . . 82
4.2 Power, delay, energy, and ED comparison for different device settings. 83
4.3 Experimental setting for process variation. . . . . . . . . . . . . . . . 84
4.4 Nominal value, mean, and standard deviation comparison for leakage
power and delay. . . . . . . . . . . . . . . . . . . . . . . . . . . . . 85
4.5 Delay and leakage change due to device aging. . . . . . . . . . . . . 89
4.6 Soft error rate comparison. . . . . . . . . . . . . . . . . . . . . . . . 90
4.7 Impact of device aging on process variation. . . . . . . . . . . . . . . 92
4.8 The impact of device aging and process variation on SER of one SRAM
under ITRS HP 32nm technology. . . . . . . . . . . . . . . . . . . . 93
5.1 Experiment setting. The 3σ value is the normalized with respect to the
nominal value. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 122
xiv
5.2 Error percentage of mean, standard deviation, skewness, and 95-percentile
point for variation setting (1). Note:the error percentage of mean (µ),
standard deviation (σ), and 95-percentile point (95%) is computed as
100 ×(MC value −SSTA value)/σ
MC
, and the error percentage of
skewness γ is computed as 100×(γ
MC
−γ
SSTA
)/γ
MC
. . . . . . . . . . 125
5.3 Mean, standard deviation, and skewness comparison for variation set-
ting (2). . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 126
6.1 Notations. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 135
6.2 Delay percentage error for different variation models. Note: the µ, σ,
and 95-percentile point for exact simulation is in ns. Run time (T) is in
ms.

for LDAW, SPC, and IW, linear SSTA is applied when assuming
linear cell delay model. . . . . . . . . . . . . . . . . . . . . . . . . . 152
6.3 percentage error for ISCAS85 benchmark stretching on a chip with
different chip size. Note: We assume square chips and chip size means
edge length in mm. The values of exact simulation are in ns. . . . . . 153
6.4 Delay percentage error at different locations in a 2cm×2cm chip.
Note: The values of exact simulation are in ns. . . . . . . . . . . . . . 154
6.5 Delay comparison for ISCAS85 benchmark in 1cm×1cm chip. Note:
The values of exact simulation are in ns. . . . . . . . . . . . . . . . . 154
6.6 Delay comparison for ISCAS85 benchmark in 6mm×6mmchip. Note:
The values of exact simulation are in ns. . . . . . . . . . . . . . . . . 155
6.7 Delay comparison for ISCAS85 benchmark in 3mm×3mmchip. Note:
The values of exact simulation are in ns. . . . . . . . . . . . . . . . . 155
6.8 Leakage error percentage for different models in 2cm×2cmchip. Note:
The exact values are in mW. Run time (T) is in s. . . . . . . . . . . . 157
xv
6.9 Leakage error for different variation model on different size chips.
Note: exact values are in mW. . . . . . . . . . . . . . . . . . . . . . 158
6.10 Summary of different variation models. . . . . . . . . . . . . . . . . 160
7.1 Reliability of statistical analysis for different number of measured and
production lots. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 167
7.2 Notations. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 168
7.3 Experimental validation of estimation of confidence interval for mean
for ˆ n = 5 and ˜ n = 1. . . . . . . . . . . . . . . . . . . . . . . . . . . . 171
7.4 Guardband of fast/slow corner for different degrees of confidence as-
suming ˆ n = 10, ˜ n = 15. . . . . . . . . . . . . . . . . . . . . . . . . . 174
7.5 Guardband of fast/slow corner for different degrees of confidence as-
suming large ˆ n and ˜ n = 15. . . . . . . . . . . . . . . . . . . . . . . . 174
7.6 Guardband of fast/slowcorner for different confidence levels assuming
ˆ n = 10 and large ˜ n. . . . . . . . . . . . . . . . . . . . . . . . . . . . 175
7.7 Guardband of fast/slowcorner for different confidence levels assuming
ˆ n = 10, ˜ n = 1, and ˜ n

= 25. . . . . . . . . . . . . . . . . . . . . . . . 177
7.8 Percentage guardband of worst case delay corner and corresponding
variation source corners under different confidence levels assuming
ˆ n = 10, ˜ n = 15. Note: delay value is in ps, L
gate
value is in nm, and
V
th
value is in V. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 179
7.9 Confidence comparison for inverter chain based corner assuming ˆ n =
10, ˜ n = 15. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 179
xvi
7.10 Percentage guardband of worst case delay corner and corresponding
variation sources corners under different confidence levels assuming
large and ˜ n = 15. . . . . . . . . . . . . . . . . . . . . . . . . . . . . 181
7.11 Percentage guardband of worst case delay corner and corresponding
variation source corners under different confidence levels assuming
ˆ n = 10 and large ˜ n. . . . . . . . . . . . . . . . . . . . . . . . . . . . 181
7.12 Percentage guardband of worst case delay corner and corresponding
variation source corners under different confidence levels assuming
ˆ n = 10, ˜ n = 1, and ˜ n

= 25. . . . . . . . . . . . . . . . . . . . . . . . 182
7.13 Confidence comparison for ISCAS85 benchmark 95-percentile point
delay assuming ˆ n = 10, ˜ n = 15. . . . . . . . . . . . . . . . . . . . . . 185
xvii
Acknowledgments
Many people offered me significant help during the development of this dissertation.
I would like to acknowledge them here.
First of all, I would like to thank my adviser, Prof. Lei He and my co-adviser, Prof.
Puneet Gupta. They not only helped me to develop skills to solve problems, but also
motivated and enlightened me at all time. This dissertation would not exist without
their enormous encouragement, support, and inspiration.
I am indebted to Prof. Sudhakar Pamarti, Prof. Glenn Reinman, and Prof. Lieven
Vandenberghe for serving as members of my dissertation committee and their sugges-
tions to improve this work.
I treasure the opportunity that I have worked with a group of wonderful people
and would like to thank them for all the helps and support through my research work,
especially those colleagues at the UCLA Design Automation Lab and NanoCAD Lab.
In particular, I thank Dr. Fei Li, Dr. Yan Lin, Miss Phoebe Wong, and Dr. Jinjun
Xiong. The work on device and architecture co-optimization for FPGA in Chapter 2
and Chapter 3 is a joint project with Fei Li, Yan Li, and Phoebe Wong. Yan Li also
helped me on developing the transistor and circuit level power and delay model for
FPGA process and architecture concurrent design in Chapter 4. Jinjun Xiong helped
me to develop the experiment and revise the writing for the statistical static timing
analysis work in Chapter 5. He also helped me to develop the chip-wise optimization
flow to minimize FPGA delay considering process variation.
I sincerely thank Dr. Kevin Cao for his help with my research. He helped me to
develop the FPGA device aging model, which is reported in Chapter 4.
I also indebted to Dr. Costas Spanos and Mr. Kun Qian for helping me to develop
the across-wafer variation model, which is reported in Chapter 6.
xviii
I would also like to thank many researchers include Dr. Qi Xiang, Dr. Jeffery Tang,
Mr. Tim Tuan, Miss Wenping Wang, for collaboration and helpful discussions.
Finally, and most of all, I am grateful to my family. I would like to thank my
wife Wen Wang for her constant understanding and supporting me during my Ph.D.
study. I would also like to thank my parents, my brother, and my sister in law, for
their dedication, and support. This dissertation becomes so humble compared to their
constant inspiration, support, and love.
xix
Vita
1979 Born in People’s Republic of China.
1997-2001 B.S. in Electronic and Communication Engineering, Zhongshan
University, Guangzhou, P.R. China.
2001-2003 M.S. in Electrical and Computer Engineering, Portland State Uni-
versity, Portland, Oregon, USA. Teaching Assistant, Analogue Cir-
cuit Design, 2002. Graduate Student Researcher, 2002-2003.
2003-2004 First year Ph.D. program, in Electrical and Computer Engineering,
University of Colorado, Boulder, Colorado, USA. Graduate Student
Researcher, 2003-2004.
2004-present P.D. program, in Electrical Engineering, University of Califor-
nia, Los Angeles, California, USA. Teaching Assistant, Advanced
Computer Architecture, 2006. Graduate Student Researcher,
UCLA. Design Automation Lab, 2004-2008. Graduate Student Re-
searcher, UCLA. Nano CAD Lab, 2008-present.
2006 Intern Engineer, Nvidia Corp., San Jose, California, USA. Worked
on GPU testing.
2006-2007 Research Intern, Altera Corp., San Jose, California, USA. Worked
on statistical power and delay estimation.
xx
PUBLICATIONS
L. Cheng, F. Li, Y. Lin, P. Wong and L. He, “Device and Architecture Co-Optimization
for FPGA Architecture Power Reduction,” IEEE Transactions on Computer-Aided De-
sign of Integrated Circuits and Systems (TCAD), volume: 26, issue: 7, July 2007, pp:
1211-1221.
L. Cheng, P. Gupta, and L. He , “On Confidence in Characterization and Application
of Variation Models,” accepted by IEEE/ACM Asia South Pacific Design Automation
Conference (ASPDAC), February 2010.
L. Cheng, P. Gupta, and L. He , “Accounting for Non-linear Dependence Using Func-
tion Driven Component Analysis,” IEEE/ACM Asia South Pacific Design Automation
Conference (ASPDAC), February 2009, pp: 474-479.
L. Cheng, P. Gupta, and L. He , “Efficient Additive Statistical Leakage Estimation,”
IEEE Transactions on Computer-Aided Design of Integrated Circuits and Systems
(TCAD), volume: 28, issue: 11, November 2009, pp:1777-1781.
L. Cheng, P. Gupta, C. Spanos, K. Qian, and L. He , “Physically Justifiable Die-Level
Modeling of Spatial Variation in View of Systematic Across Wafer Variability,” Design
Automation Conference (DAC), July 2009, pp: 104-109.
L. Cheng, W. Hung, X.Song, and G. Yang, “Congestion Estimation for 3D Routing,”
IEEE Computer Society annual symposium on VLSI (ISVLSI), February 2004, pp:
xxi
239-240.
L. Cheng, W. Hung, X.Song, and G. Yang, “Congestion Estimation for Three-Dimensional
Architectures,” IEEE Transactions on Circuits and Systems II: Express Briefs (TCAS
II), volume: 51, issue: 12, December 2004, pp: 655- 659.
L. Cheng, Y. Lin, L. He, and Y. Cao, “Trace-Based Framework for Concurrent De-
velopment of Process and FPGA Architecture Considering Process Variation and Re-
liability,” International Symposium on Field-Programmable Gate Arrays (ISFPGA),
February 2008, pp: 159-168.
L. Cheng, X.Song, G. Yang, W. Hung, Z. Tang, and S. Gao, “A Fast Congestion
Estimator for Routing with Bounded Detours,” Integration, the VLSI Journal (Integrat
VLSI J), volume: 41, issue: 3, May 2008, pp: 360-370.
L. Cheng, X.Song, G. Yang, and Z. Tang, “A Fast Congestion Estimator for Routing
with Bounded Detours,” IEEE/ACM Asia South Pacific Design Automation Confer-
ence (ASPDAC), February 2004, pp: 666-670.
L. Cheng, P. Wong, F. Li, Y. Lin, and L. He, “Device and Architecture Co-Optimization
for FPGA Power Reduction,” Design Automation Conference (DAC), June 2005, pp:
915-920.
L. Cheng, J. Xiong, and L. He, “FPGA Performance Optimization via Chipwise
Placement Considering Process Variations,” International Conference on Field Pro-
grammable Logic and Applications (FPL), August 2006, pp:1-6.
xxii
L. Cheng, J. Xiong, and L. He, “Non-Linear Statistical Static Timing Analysis for
Non-Gaussian Variation Sources,” ACM/IEEE International Workshop on Timing is-
sue: in the Specification and Synthesis of Digital Systems(TAU), February 2007, pp:
80-85.
L. Cheng, J. Xiong, and L. He, “Non-Linear Statistical Static Timing Analysis for
Non-Gaussian Variation Sources,” Design Automation Conference (DAC), June 2007,
pp: 250-255.
L. Cheng, J. Xiong, and L. He, “Non-Gaussian Statistical Timing Analysis Using
Second-Order Polynomial Fitting,” IEEE International Workshop on Design for Man-
ufacturability and Yield (DFM&Y), October 2007.
L. Cheng, J. Xiong, and L. He, “Non-Gaussian Statistical Timing Analysis Using
Second-Order Polynomial Fitting,” IEEE/ACM Asia South Pacific Design Automation
Conference (ASPDAC), February 2008, pp: 298-303.
L. Cheng, J. Xiong, and L. He, “Non-Gaussian Statistical Timing Analysis Using
Second-Order Polynomial Fitting,” IEEE Transactions on Computer-Aided Design of
Integrated Circuits and Systems (TCAD), volume: 28, issue: 1, January 2009, pp:
130-140.
F. He, L. Cheng, X. Song, and G. Yang, “An Analytical Congestion Model With
Bounded-Based Detours,”. accepted by Journal of Circuits, Systems, and Computers
(JCSC).
xxiii
F. He, L. Cheng, G. Yang, X. Song, M. Gu, and J. Sun, “On Theoretical Upper Bounds
for Routing Estimation,” Journal of Universal Computer Science (J.UCS), volume: 11,
Number 6, June 2005, pp: 916-925.
F. He, M. Gu, X. Song, Z. Tang, and L. Cheng, “Probabilistic Estimation for Routing
Space,” The Computer Journal (TCJ), May 2005, pp: 667-676.
F. He, X. Song, L. Cheng, G. Yang, Z. Tang, M. Gu, and J. Sun, “A Hierarchical
Method for Wiring and Congestion Prediction,” IEEE Computer Society annual sym-
posium on VLSI (ISVLSI), May 2005, pp: 307-308.
F. He, X. Song, M. Gu, L. Cheng, G. Yang, Z. Tang, and J.Sun, “ACombinatorial Con-
gestion Estimation Approach with Generalized Detours,” Computers & Mathematics
with Applications (CAM), volume: 51, : 6-7, March 2006, pp: 1113-1126.
W. Hung, X.Song, T. Kam, L. Cheng, and G. Yang, “Routability Checking For Three-
Dimensional Architectures,” IEEE Transactions on Very Large Scale Integration Sys-
tems (TVLSI), volume: 12, issue: 12, December 2004, pp: 1371-1374.
Y. Jeng and L. Cheng, “Digital Spectrumof a Nonuniformly Sampled Two-Dimensional
Signal and Its Reconstruction,” IEEE Transaction on Instrumentation and Measure-
ment (TIM), volume: 54, issue: 3, June 2005, pp: 1180-1187.
P. Wong, L. Cheng, Y. Lin, and L. He, “FPGA Device and Architecture Evaluation
Considering Process Variation,” International Conference on Computer Aided Design
(ICCAD), November 2005, pp: 19-24.
xxiv
Abstract of the Dissertation
Statistical Analysis and Optimization for Timing and
Power of VLSI Circuits
by
Lerong Cheng
Doctor of Philosophy in Electrical Engineering
University of California, Los Angeles, 2010
Professor Puneet Gupta, Co-chair
Professor Lei He, Co-chair
As CMOS technology scales down, process variation introduces significant uncertainty
in power and performance to VLSI circuits and significantly affects their reliability. If
this uncertainty is not properly handled, it may become the bottleneck of CMOS tech-
nology improvement. This dissertation proposes novel techniques to model, analyze,
and optimize power and performance of FPGAs and ASICs considering process varia-
tion. This dissertation focuses on two aspects: (1) Process and architecture concurrent
optimization for FPGAs; (2) Statistical timing modeling and analysis.
To perform process and architecture concurrent optimization, an efficient and ac-
curate FPGA power, delay, and variation evaluator, Ptrace, is proposed. With Ptrace,
we present the first in-depth study on device and FPGA architecture co-optimization
to minimize power, delay, area, and variation considering hundreds of device and ar-
chitecture combinations. Furthermore, to enable early stage process and architecture
co-optimization without stable device models, we develop transistor level and circuit
level power, delay, and reliability models and incorporate them with Ptrace. With the
extended Ptrace, we perform architecture and process parameters concurrent optimiza-
xxv
tion for FPGA power, delay, variation, and reliability.
To perform statistical timing modeling and analysis, we first present an efficient
and accurate statistical static timing analysis (SSTA) flow for non-linear cell delay
model with non-Gaussian variation sources. All operations in this flow are performed
by analytical equations without any time consuming numerical approach. Then, to
further improve the efficiency and accuracy of statistical timing analysis, we develop a
new die-level spatial variation model which accurately models the across-wafer vari-
ation. Besides modeling spatial variation, mean and variance uncertainty introduced
by limited number of samples is another problem in SSTA. To solve this problem,
we evaluate the confidence for statistical analysis and estimate the guardband value to
ensure a target confidence.
To the best of our knowledge, this dissertation is the first novel study of device,
process, and architecture concurrent co-optimization for FPGA power, delay, variation,
and reliability; and is the first work to model across-wafer variation at die-level and to
consider confidence guardband in statistical analysis.
xxvi
CHAPTER 1
Introduction
Complementary Metal Oxide Semiconductor (CMOS) manufacturing in nanometer re-
gion has resulted in great achievements in the world of electronics. The performance
and capabilities of Very Large Scale Integration (VLSI) circuits improve rapidly with
the aggressive scaling of transistors in CMOS. However, the increase of leakage power
and process variation have emerged to slow down this trend.
Since leakage power is exponentially related to channel length, it increases rapidly
as CMOS technology scales down. Figure 1.1 [114] illustrates the power density for
different technology nodes in the past thirty years. It can be seen that power density is
increasing at a dramatic rate. Therefore, power consumption becomes a crucial design
constraint for nano-scale VLSI designs.
•1980
P
o
w
e
r

D
e
n
s
i
t
y

(
W
/
c
m


)
2
•1985 •1990 •1995 •2000 •2005 •2010
•10
•10
•10
•10
•10
•4
•3
•2
•1
•0
Figure 1.1: Increasing of power density.
1
In comparison to challenges created by power, process variation is more difficult to
handle in VLSI design. Due to the inevitable manufacturing fluctuations and imperfec-
tions, transistors on different copies of the same design, or even at different locations
on the same chip will vary significantly. Process variation is the uncertainty of physical
process parameters, such as channel length, doping density, oxide thickness, channel
width, metal thickness, metal width, and so on. According to the causes of variation,
process variation can be classified into two types [162]:
• Catastrophic defects are caused by isolated random events (such as particles or
other contaminations) during manufacturing, which render chips non-functional.
• Parametric variations are caused by random fluctuations in process conditions so
that the physical properties of some parameters on a chip differ from the original
design. The fluctuations may include aberrations in stepper lens, doping density,
and manufacturing temperature.
Process variations introduce significant uncertainty for both circuit performance
and leakage power. It has been shown in [25] that even for the 180nm technology,
process variation can lead to 1.3X variation in frequency and 20X variation in leak-
age power, as illustrated in Figure 1.2. Such impact will become even larger in future
technology generations. In recent years, many methods, such as statistical model-
ing, analysis, and optimization for VLSI circuits, have been developed to alleviate
the variation effects. This research is called “Design for Manufacturability ” (DFM).
Internation Technology Roadmap for Semiconductors (ITRS) predicts the DFM tech-
nology requirement for process variation in the near-term future, as shown in Table 1.1
[69]. From the table, it is predicted that leakage power variation will increase to 331%
and performance variation will increase to 88% in the year 2015. Therefore, as CMOS
technology scales down to nano-meter region, process variation will become a poten-
tial show-stopper if it is not appropriately handled.
2
Figure 1.2: Leakage power and frequency variation.
Year 2007 2008 2009 2010 2011 2012 2013 2014 2015
% of V
dd
variability 10 10 10 10 10 10 10 10 10
% of V
th
variability (minimum size) 33 37 42 42 42 58 58 81 81
% of V
th
variability (typical size) 16 18 20 20 20 26 26 36 36
% of CD (L
gate
) variability 12 12 12 12 12 12 12 12 12
% circuit performance variability 46 48 49 51 60 63 63 63 63
% circuit total power variability 56 57 63 68 72 76 80 84 88
% circuit leakage power variability 124 143 186 229 255 281 287 294 331
Table 1.1: Impact of process variation in near-term future.
There are four common design styles in current VLSI: Application Specific Integrated
Circuit (ASIC), Field-Programmable Gate Array (FPGA), Application Specific Instruction
set Processor (ASIP), and general purpose processor or microprocessor. Different de-
sign styles may have different power consumption requirement and vulnerability to
process variation. Therefore, different modeling, analysis, and optimization techniques
should be developed for different design styles.
Among the design styles, FPGA is the most flexible one. Since FPGA allows the
same silicon implementation to be programmed or re-programmed for a variety of
applications, it provides low non-recurring engineering (NRE) cost and short time to
3
market. Due to this advantage, FPGA industry has grown rapidly since its invention
in 1984. However, in order to achieve programmability, FPGA pays the penalty with
decrease of performance, power, and area. Previous studies [83, 82] have shown a
100X energy difference, a 4.3X delay difference, and a 40X area difference between
FPGA designs and their ASIC counterparts. As discussed before, power consumption
is a crucial design constraint for nano-scale VLSI circuits. The power problem is
more significant for FPGAs than other design styles because FPGA has higher power
consumptions. On the other hand, since FPGAs have lots of regularity, the impact of
process variation on FPGAs are not as significant as other design styles. Nevertheless,
the impact of process variation on FPGAs should not be ignored and should be properly
handled. To solve the above problems, this dissertation focuses on optimizing power
and performance for FPGAs considering process variation. The first objective of this
dissertation is:
Objective 1: Concurrently evaluate FPGA architecture and pro-
cess parameters to optimize power, delay, and area considering pro-
cess variation and reliability.
In comparison to FPGAs, ASICs have higher performance, lower power, and smaller
area. However, due to the increased design complexity, ASICs has higher NRE cost,
design cost, and longer time to market than FPGAs. Due to the above advantages
and disadvantages, ASICs are usually used in high performance or low power applica-
tions, wherein the impact of process variation is very significant. Therefore, variation
modeling and analysis in ASICs are more important than in FPGAs.
Process variation may cause manufacturing yield loss. Manufacturing yield loss is
defined as the ratio between the number of chips that fail to meet the design specifi-
cations and the total number of manufactured chips [112]. There are several reasons
causing chip failure, such as catastrophic defects and parametric variations. Figure 1.3
4
illustrates the performance variation caused by parametric variations. It can be seen
that chip delay is spread over a wide range and the delay of some chips is higher than
the specification value. The chips failing to meet the delay specification contribute to
yield loss.

Yield
Delay spec
Yield Loss
Measured Delay
Figure 1.3: Delay yield.
In order to analyze and optimize performance yield, Statistical Static Timing Analysis
(SSTA) is proposed. Instead of estimating the chip delay deterministically as in the
traditional Static Timing Analysis (STA), SSTA estimates the chip delay statistically.
SSTA not only estimates the nominal value of chip delay but also provides the de-
lay distribution. Figure 1.4 shows the SSTA flow. In this flow, there are three key
components [122]:
• Modeling of process variation from silicon process characterization.
• Modeling of sensitivities of delay for standard cell libraries with regards to pro-
cess parameter variations (library modeling).
• Statistics aware engines for STA and optimization.
In this dissertation, we mainly focus on modeling of process variation and devel-
oping an accurate SSTA engine. The second objective of this dissertation is:
5
Modeling of
process
Variation
Measurement
Library
Modeling
Net List
SSTA
Engine
Chip
Delay
Variation
Figure 1.4: SSTA flow.
Objective 2: Statistical Timing Modeling and Analysis.
In the following of this chapter, Section 1.1 reviews some related research works;
Section 1.2 discusses the major contributions of this dissertation; and Section 1.3 out-
lines this dissertation.
1.1 Literature Review
1.1.1 FPGA Power, Performance, and Area Optimization
It is well known that architecture, including logic block (or cluster) size, Look-Up
Table (LUT) size, and routing wire segment length, has great impact on FPGA power,
performance, and area. Therefore, architecture evaluation for FPGA optimization at-
tracts lots of concerns. In early 1990s, architecture evaluation mainly focuses on opti-
mizing delay [143] and area [130]. These works showed that for non-clustered FPGAs,
LUT size of 4 achieves the smallest area and LUT size 5 or 6 minimizes delay. After
cluster-based island style FPGA became popular, architecture evaluation for this type
of FPGA was performed. [10] showed that LUT sizes ranging from 4 to 6 and cluster
sizes between 4 and 10 produce the best area-delay product.
6
As discussed before, power consumption is one of the most important design con-
straints of FPGA design when technology scales down to nano-meter region. There-
fore, after the year 2000, FPGA power modeling became a hot research topic. [158]
quantified the leakage power of commercial FPGA architecture and [48] introduced a
high level FPGA power estimation methodology. [123, 89, 92] presented power eval-
uation frameworks for generic parameterized FPGAs and showed that leakage power
takes higher portion of total power consumption in FPGAs than in ASICs. It was also
shown that interconnects consume even more power than logic elements in FPGA.
At the same time, FPGA power optimization was also studied. Power aware FPGA
CAD algorithms was proposed in [85]; power driven partition method was presented
in [117]; a leakage power saving routing multiplexer and an input control method were
developed [150]; and a configuration inversion method to save the leakage power was
proposed in [12]. Architecture evaluation for power and delay minimization is then
studied in [93, 123, 92].
Due to programmability, utilization rate of circuit elements in FPGA is very low.
To reduce leakage power consumption of unused circuit elements, body bias [110]
and power gating [59, 102] were applied. To further reduce power on used circuit
elements, programmable dual-V
dd
was first applied to logic blocks [93, 90, 98, 91]
and then extended to interconnects [58, 51, 71]. Architecture evaluation considering
power gating and dual-V
dd
was performed [101]. Besides leakage power minimization,
a glitch minimization technique [84] was proposed to reduce dynamic power.
Recently, FPGA modeling and optimization considering process variation was
studied. [31, 43, 100, 115, 147, 99, 135, 79, 115, 154, 137, 28] presented techniques to
statistically optimize FPGA performance. [136] performs parametric yield estimation
for FPGA considering within-die variation.
7
1.1.2 Statistical Timing Modeling, Analysis, and Optimization
Process Variation Modeling- Process variation introduces significant uncertainty to
circuit power and delay. In order to analyze the power and delay uncertainty, process
variation modeling is needed. An early work was proposed in [25] whereby process
variation is separated into inter-die variation and within-die variation. All transistors
on the same die share the same inter-die variation and different dies may have different
inter-die variation; each transistor has its own within-die variation and the within-die
variation for different transistors are assumed to be independent. Later on, it was
observed that within-die variation is not independent but spatially correlated and the
correlation depends on the distance between two within-die locations. In this case,
spatial variation was modeled as correlated random variables [6, 32] and principle
component analysis was applied to perform statistical timing analysis. In this model, a
chip is divided into several grids and each grid has its own spatial variation. The spatial
variations of different grids are correlated and the correlation coefficients depend on
the distance between two grids. However, obtaining the correlation coefficient between
two grids became a problem for this model. [172, 173, 105] solved this problem by
modeling the spatial variation as a random field [174] and assuming the correlation
coefficient to be a function of distance. After that, several more complicated spatial
variation models were proposed [45, 105, 189, 55, 64].
Statistical Analysis- With the model of process variation, statistical analysis and
optimization are then performed. Statistical static timing [103, 72, 7, 120, 157, 128,
32, 5, 49, 9, 22, 8, 163, 86, 53, 182, 13, 181, 113, 75, 180, 178, 76, 3, 183, 2,
142, 19, 179, 40, 30, 52, 144, 165, 116, 36, 41, 56, 63, 97, 35, 104, 189, 145, 141,
88, 161, 81, 107, 65, 61, 169, 138, 146, 70, 155, 29, 171, 118, 42] and leakage
[26, 111, 21, 37, 176, 140, 109, 15, 156, 33, 95, 62, 73, 44, 119, 151, 188, 20, 133]
analysis are hot research topics in recent years. There are two major approaches in
8
Statistical static timing analysis techniques: path-based [103, 72, 7, 120, 157, 128]
and block-based [32, 5, 49, 9, 22, 8, 163, 86, 53, 182, 13, 181, 113, 75, 180, 178, 76,
3, 183, 2, 142, 19, 179, 40, 41, 42] SSTA. Since path-based SSTA is not scalable to
large circuit sizes, block-based SSTA is more commonly used. Since Gaussian random
variables are easy to be handled, the early block-based SSTA flows [32, 163] modeled
the gate delay as linear functions of variation sources and assumed all the variation
sources are mutually independent Gaussian random variables. Later, it was observed
that the linear delay model is no longer accurate when the scale of variation becomes
larger [94] and a higher-order delay model is thus used [180, 178]. Moreover, it was
also observed that some variation sources do not follow a Gaussian distribution. For
example, the via resistance has an asymmetric distribution [13], while dopant concen-
tration is more suitably modeled as a Poisson distribution [142] rather than Gaussian.
The nonlinearity of delay model and non-Gaussian variation sources make SSTA much
more difficult. [142] applied independent component analysis to de-correlate the non-
Gaussian random variables, but it was still based on a linear delay model. Some recent
works [76, 13, 19, 179, 40, 41, 42] considered both non-linear delay model and non-
Gaussian variation sources.
Confidence Analysis- In all the statistical modeling and analysis methods dis-
cussed above, the statistical characteristics (such as mean and variance) are assumed to
be given and reliable. However, when the number of production samples is not large,
the production statistical characteristics may significantly deviate from their popula-
tion values. Therefore, uncertainty in the statistics of measured data as well as produc-
tion data should be considered in statistical analysis. [177] modeled the uncertainty
of mean, variance, and correlation coefficients as an interval, and then estimated the
range of mean and variance of circuit performance.
9
1.2 Contributions of the Dissertation
The contributions of this dissertation are two-fold:
1. Device and architecture concurrent optimization for FPGAs.
2. Statistical timing modeling and analysis.
For device and architecture concurrent optimization for FPGAs, we have the fol-
lowing contributions:
• We first develop an efficient yet accurate power and delay estimator for FPGAs,
which we refer to as Ptrace. Experimental results show that compared to cycle
accurate power simulator and VPR [18], Ptrace predicts chip level power and
delay within 4% and 5% error, respectively. Based on such framework, we per-
form device and architecture co-optimization for FPGA circuits. Compared to
the baseline, which uses the VPR architecture model [18] with the same LUT
size and cluster size as the commercial FPGAs used by Xilinx Virtex-II [170],
and the device settings from ITRS roadmap[66], our co-optimization reduces
energy-delay product by 18.4% and chip area by 23%. Furthermore, consider-
ing FPGA architecture with power-gating capability, our architecture and device
co-optimization reduces energy-delay product by 55.0% and chip area by 8.2%
compared to the baseline. We also study the impact of utilization rate and inter-
connect structure.
• We then extend Ptrace to estimate power and delay variation of FPGAs. We pro-
posed closed-form models to estimate chip level leakage and timing variations
for FPGA. It has been shown that the mean and standard deviation computed by
our models are within 3% error compared to Monte-Carlo simulation. With the
10
extended Ptrace, we perform FPGA device and architecture evaluation consider-
ing process variations. Compared to the baseline, device and architecture tuning
improves leakage yield by 4.8%, timing yield by 12.4%, and leakage and timing
combined yield by 9.2%. We also observe that LUT size of 4 gives the highest
leakage yield, LUT size of 7 gives the highest timing yield, but LUT size of 5
achieves the maximum leakage and timing combined yield.
• We further extend (Ptrace) to consider process parameters directly so that FPGA
circuit and architecture evaluation can be conducted when only the first or-
der process parameters are available. Such evaluation may be used to select
circuits and architectures that are less sensitive to process changes or process
variations. We call the resulting framework as Ptrace2. With Ptrace2, we in-
corporate analytical calculations for two types of FPGA reliability, device ag-
ing (Negative-Bias-Temperature-Instability, NBTI [11, 149, 160, 23] and Hot-
Carrier-Injection, HCI [34, 149, 166]) and permanent soft error rate (SER) [60],
again in the ‘from device to chip’ fashion. We observe that device aging reduces
standard deviation of leakage by 65% over 10 years while it has relatively small
impact on delay variation. Moreover, we also find that neither device aging due
to NBTI and HCI nor process variation has significant impact on SER.
For statistical timing modeling and analysis, we have the following contributions:
• We introduce an efficient and accurate SSTA flow for non-linear SSTA with
non-Gaussian variation sources. All operations in our flow are based on efficient
closed-form formulae. Experimental results showthat compared to Monte-Carlo
simulation, our approach predicts the mean, standard deviation, skewness, and
95-percentile point within 1%, 1%, 6%, and 1% error, respectively.
• Our proposed SSTA flow solves the problem of computing the chip delay vari-
11
ation, but does not consider modeling of process variation. In order to improve
the accuracy and efficiency of SSTA, we presents an accurate model for across-
wafer variation. Compared to the exact value, the error of the traditional grid-
based spatial variation model is up to 8% while the error of our model is only
2%. Morever, our new model is 6X faster than the traditional model.
• The above SSTA flow also assumes that the statistical variation models are reli-
able. However, due to limited number of samples (especially in the case of lot-to-
lot variation), calibrated models have low degree of confidence. The problem is
further exacerbated when low production volumes cause additional loss of con-
fidence in the statistical analysis (since production only sees a small snapshot
of the entire distribution). We mathematically derive the confidence intervals
for commonly used statistical measures (mean, variance, percentile corner) and
analysis (SPICE corner extraction, statistical timing). Our estimations are within
2% of simulated confidence values. Our experiments indicate that for moderate
characterization volumes (10 lots) and low-to-medium production volumes (15
lots), a significant guardband is needed (e.g., 34.7%of standard deviation for sin-
gle parameter corner, 38.7% of standard deviation for SPICE corner, and 52% of
standard deviation for 95%-tile point of circuit delay are needed) to ensure 95%
confidence in the results. The guardbands are non-negligible for all cases when
either production or characterization volume is not large.
1.3 Dissertation Outline
The remaining of this dissertation is organized as follows:
Part I, including Chapter 2, 3, 4, proposes techniques to statistically model and
optimize FPGA power and delay.
12
Chapter 2 first introduces a time efficient trace-based delay and power estimation
framework for FPGA circuits, Ptrace. With Ptrace, we simultaneously optimize de-
vice setting (supply voltage V
dd
and threshold voltage V
th
) and FPGA architecture
(look-up table size K, cluster size N, and interconnect wire segment length W) to min-
imize power, delay and area.
Chapter 3 extends the architecture and device co-optimization flow in Chapter 2
to considering process variation. We first develop a set of closed-form formulas for
chip level leakage and timing variations considering both die-to-die and within-die
variations. Based on such model, we extend Ptrace to estimate the power and delay
variation of FPGAs. Applying extended Ptrace, we optimize device setting and FPGA
architecture to maximize delay yield, leakage yield, and delay and leakage combine
yield.
Chapter 4 proposes a framework to perform concurrent process and architecture
optimization for FPGAs in early stage without detailed process modeling. We first
implement models to calculate electrical characteristics of advanced CMOS transis-
tors based on the ITRS MASTAR4 (Model for Assessment of cmoS Technologies And
Roadmaps) tool [67, 68, 148], and then develop circuit level models of delay, leakage
power, and input/output capacitance for FPGA basic circuit elements. With this cir-
cuit level model, we further extend Ptrace in Chapter 2 and 3 to evaluate chip level
power, delay, and reliability. We refer to the new trace-based model as Ptrace2. With
Ptrace2, we can conduct FPGA circuit and architecture evaluation when only the first
order process parameters are available. We may also consider reliability issues, such
as device aging and permanent soft error rate, in addition to power and delay during
evaluation.
Part II, including Chapter 5, Chapter 6, and Chapter 7 introduces statistical tim-
ing modeling and analysis.
13
Chapter 5 presents an efficient and accurate statistical static timing analysis flowfor
non-linear delay model with non-Gaussian variation sources. We first propose a second
order polynomial approximation for the max operation of two randomvariables. Based
on this approximation, we develop an SSTA flow for three different delay models:
quadratic delay model, quadratic delay model without crossing terms (semi-quadratic
model), and linear delay model.
Chapter 6 studies another key component of SSTA: the modeling of process vari-
ation. We analyze the impact of deterministic across-wafer variation and find that
the spatially correlated within-die variation is mainly caused by deterministic across
wafer variation. Then, we propose an efficient and accurate die-level variation model
to model the across-wafer variation. We apply our proposed variation model to the
SSTA flow in Chapter 5 to verify its efficiency and accuracy.
Chapter 7 studies the confidence interval of statistical characteristics, such as mean
and variance, of statistical timing analysis. We estimates the confidence interval for a
single variation sources, for the fast/slow corner of an inverter chain, and for full chip
timing analysis.
Finally, Chapter 8 concludes this dissertation with discussion of the ongoing work
and future work as well.
In Chapter 9, I briefly summarize the works I have finished in my Ph.D. program
which are not included in this dissertation.
14
Part I
Statistical Modeling and Optimization
for FPGA Circuits
15
CHAPTER 2
Deterministic Device and Architecture Co-Optimization
for FPGA Power, Delay, and Area Minimization
Device optimization considering supply voltage V
dd
and threshold voltage V
th
has lit-
tle chip area increase, but a great impact on power and performance in the nanometer
technology. This chapter studies simultaneous evaluation of device and architecture
optimization for FPGAs. We first develop an efficient yet accurate timing and power
evaluation method, called trace-based model. By collecting trace information from
cycle-accurate simulation of placed and routed FPGA benchmark circuits and re-using
the trace for different V
dd
and V
th
, we enable device and architecture co-optimization
considering hundreds of device and architecture combinations. Compared to the base-
line FPGA architecture, which uses the VPR architecture model and the same LUT
and cluster sizes as those used by the Xilinx Virtex-II, V
dd
suggested by ITRS, and V
th
optimized with respect to the above architecture and V
dd
, architecture and device co-
optimization can reduce energy-delay product by 20.5% and chip area by 23.3%. Fur-
thermore, considering power-gating of unused logic blocks and interconnect switches
(in this case sleep transistor size is a parameter of device tuning), our co-optimization
reduces energy-delay product by 55.0% and chip area by 8.2% compared to the base-
line FPGA architecture.
16
2.1 Introduction
FPGAs allow the same silicon implementation to be programmed or re-programmed
for a variety of applications. It provides low NRE (non-recurring engineering) cost
and short time to market. Due to the large number of transistors required for field pro-
grammability and the low utilization rate of FPGA resources (typically 62.5% [158]),
existing FPGAs consume more power compared to ASICs [83]. As the process ad-
vances to nanometer technologies and low-energy embedded applications are explored
for FPGAs, power consumption becomes a crucial design constraint for FPGAs.
Recent work has studied FPGA power modeling and optimization. The leakage
power of a commercial FPGA architecture was quantified [158], and a high level FPGA
power estimation methodology was presented [48]. Power evaluation frameworks
were introduced for generic parameterized FPGA [123, 89, 92] and it was shown that
both interconnect and leakage power are significant for FPGAs in nanometer technolo-
gies. As to power optimization, the interaction of a suit of power-aware FPGA CAD
algorithms without changing the existing FPGAs was studied in [85]. Power-driven
partition algorithm for mapping applications to FPGAs with different V
dd
-levels [117]
was proposed. A configuration inversion method to reduce the leakage power of mul-
tiplexers without any additional hardware [12] was investigated. Besides the power
optimization CAD algorithms, low power FPGA circuits and architectures have also
been studied. Region based power gating for FPGA logic blocks [59] and fine-grained
power-gating for FPGA interconnects [102] were proposed, and V
dd
programmability
was applied to both FPGA logic blocks [93, 90, 98, 91] and interconnects [58, 51, 71].
Anewtype of routing multiplexer and an input control method were developed [150] to
reduce leakage of unused routing multiplexers, and a circuitry combining gate biasing,
body biasing and multi-threshold techniques was designed [110] to reduce intercon-
nect leakage. Besides leakage power reduction, a glitch minimization technique [84]
17
was proposed to reduce dynamic power.
Architecture evaluation has been performed first for area and delay. For non-
clustered FPGAs, it was shown that an LUT size of 4 achieves the smallest area [130]
and an LUT size of 5 or 6 leads to the best performance [143]. Later on, the cluster-
based island style FPGA was studied to optimize area-delay product and it showed
that LUT sizes ranging from 4 to 6 and cluster sizes between 4 and 10 can produce the
best area-delay product [10]. Besides area and delay, FPGA architecture evaluation
considering energy was studied recently [93, 123, 92]. It was shown that in 0.35µm
technology, an LUT size of 3 consumes the smallest energy [123]. In 100nm tech-
nology, an LUT size of 4 consumes the smallest energy [92]. [101] further evaluated
FPGA architectures with field programmable dual-V
dd
and power gating, and consid-
ering area, delay, and energy.
However, all the aforementioned architecture evaluation assumed fixed supply volt-
age V
dd
and threshold voltage V
th
, and sleep transistor size (if power gating is applied),
and have not conducted simultaneous evaluation of device optimization such as V
dd
and V
th
tuning and architecture optimization such as tuning LUT and cluster sizes.
Architecture and device co-optimization may obtain a better power and performance
tradeoff compared to architecture tuning alone. We define hyper-architecture as the
combination of device and architectural parameters. The co-optimization requires the
exploration of the following dimensions: cluster size N, LUT size K, V
dd
, V
th
, and sleep
transistor size if power gating is applied. The total number of hyper-architecture com-
binations can be easily over a few hundreds considering the interaction between these
dimensions. In the existing power evaluation frameworks, such as cycle-accurate sim-
ulation [89] and transition density based estimation [123], timing and power are cal-
culated for each circuit element. Therefore, it is time-consuming to explore the huge
hyper-architecture solution space using methods from [89, 123]. In order to perform
18
device and architecture co-optimization, an accurate yet extremely efficient timing and
power evaluation method is required.
The first contribution of this chapter is that we develop a trace-based estimator
(called Ptrace) for FPGA power, delay, and area. We profile placed and routed bench-
mark circuits and collect statistical information (called trace) on switching activity,
short circuit power, near-critical path structure, and circuit element utilization rate for
a given set of benchmark circuits (MCNC [175] benchmark set in this chapter). We
show that the trace is independent of device tuning and it can be used to calculate
FPGA chip-level performance and power for a number of device designs. Compared
to performing placement-and-routing by VPR [18] followed by cycle-accurate simula-
tion (called Psim) from [89], Ptrace has a high fidelity and an average error of 1.3% for
energy and of 0.8% for delay. The trace collecting has the same runtime as evaluating
FPGA architecture for one combination of V
dd
, V
th
and sleep transistor size using VPR
and Psim. It took one week to collect the trace for the MCNC benchmark set using
eight 1.2GHz Intel Xeon servers while all the hyper-architecture evaluation reported
in this chapter with over hundreds of hyper-architecture combinations took only a few
minutes on one server.
The second contribution is that we performthe architecture and device co-optimization
for conventional FPGAs and FPGAs with power gating capability. We explore differ-
ent V
dd
, V
th
, and sleep transistor size combinations in addition to cluster size and LUT
size combinations. For comparison, we obtain the baseline FPGA hyper-architecture
which uses the VPR architecture model [18] and the same LUT size and cluster size as
the commercial FPGAs used by Xilinx Virtex-II [170], and V
dd
suggested by ITRS
[66], but V
th
optimized by our device optimization. Such baseline is significantly
better than the ones with no device optimization. Compared to the baseline hyper-
architecture, architecture and device co-optimization can reduce energy-delay product
19
(product of energy per clock cycle and critical path delay, in short, ED) by 18.4% and
chip area by 23.3%. Furthermore, considering FPGA architecture with power-gating
capability, our architecture and device co-optimization reduces ED by 55.0% and chip
area by 8.2% compared to the baseline. We also study the impact of utilization rate
and interconnect structure.
The rest of the chapter is organized as follows. Section 2.2 presents the background
of FPGA architecture and existing FPGA architecture evaluation flow. Section 2.3
introduces our trace-based estimation models. Section 2.4 applies the new estimation
models to architecture and device co-optimization. Section 2.5 concludes this chapter.
2.2 Preliminaries
2.2.1 Conventional FPGA Architecture
We assume cluster-based island style FPGA architecture such as that in [18] for all
classes of FPGAs studied in this chapter. A cluster-based logic block (see Figure 2.1)
includes N fully connected Basic Logic Elements (BLEs). Each BLE includes one
K-input lookup table (LUT) and one flip-flop (DFF). The combination of cluster size
N and LUT size K is the architectural issue we evaluate in this chapter. The logic
blocks are surrounded by routing channels consisting of wire segments. The input and
output pins of a logic block can be connected to the wire segments in routing chan-
nels via a connection block (see Figure 2.1 (b)). A routing switch block is located at
the intersection of a horizontal channel and a vertical channel. Figure 2.1 (c) shows a
subset switch block [87], where the incoming track can be connected to the outgoing
tracks with the same track number.
1
The connections in a switch block (represented by
the dashed lines in Figure 2.1 (c)) are programmable routing switches. We implement
1
Without loss of generality, we assume subset switch block in this chapter.
20
routing switches by tri-state buffers and use two tri-state buffers for each connection
so that it can be programmed independently for either direction. We define an inter-
connect segment as a wire segment driven by a tri-state buffer or a buffer.
2
In this
chapter, we investigate both uniform length-4 interconnect wire segments and mixed-
length interconnects. We decide the routing channel width CW in the same way as the
architecture study in [18], i.e., CW = 1.2CW
min
where CW
min
is the minimum channel
width required to route the given circuit successfully.
0
1
2
3
3 1 2 0
3
2
1
0 0
1
2
3
0 1 3 2
A C
D
B
switch
connection
connection switch
(b) Connection block and
SR
SR
c
o
n
n
e
c
t
i
o
n

b
l
o
c
k
SR
0123
(c) Switch block
(d) Routing switches
SR
A B
logic
block
in1
block
connection
block
switch
(a) Island style
routing architecte
Figure 2.1: (a) Island style routing architecture; (b) Connection block; (c) Switch
block; (d) Routing switches.
2
We interchangeably use the terms of switch and buffer/tri-state buffer.
21
2.2.2 V
dd
Gatable FPGA Architecture
Power gating can be applied to interconnects and logic blocks to reduce FPGA power.
Figure 2.2 illustrates the circuit design of the V
dd
-gateable interconnect switch and
logic block from [101]. We insert a PMOS transistor (called a sleep transistor) between
the power rail and the buffer (or logic block) to provide the power-gating capability.
When a buffer or logic block is not used, the sleep transistor is turned off by the
configuration cell. SPICE simulation shows that power-gating can reduce the leakage
power of an unused buffer or a logic block by a factor of over 300.
S
w
i
t
c
h
S
w
i
t
c
h
D
e
c
o
d
e
r
SR
SR
SR
l
o
g


N


S
R
A
M

c
e
l
l
s
2
Switch
Switch
Logic block
in1
(b) Vdd−gateable routing switches
(a) Vdd−gateable switch
dN
d1
d1
dN
N


s
w
i
t
c
h
e
s
(c) Vdd−gateable connection block
Connection switches
wire track
Dec_Disable
B
A
M1
M2
Vdd
SR
BUFF
Switch
SR
SR
SR
SR
(d) Vdd gateable logic block
Logic block
Sram
Vdd
Figure 2.2: (a) V
dd
-gateable switch; (b) V
dd
-gateable routing switch; (c) V
dd
-gateable
connection block; (d) V
dd
-gateable logic block.
22
There is power and delay overhead associated with the sleep transistor insertion.
The dynamic power overhead is almost negligible. This is due to the fact that the sleep
transistors stay either ON or OFF after configuration and there is no charging and
discharging at their source/drain capacitors. The delay overhead associated with the
sleep transistor insertion can be bounded when the sleep transistor is properly sized.
Moreover, the connection box with power gating is different from that in the con-
ventional architecture. For the conventional FPGA, an input MUX, which is effectively
an implicit decoder [18], is used in the connection box as illustrated in Figure 2.1(b).
However, for the power gating architecture, an explicit decoder is used in the con-
nection box to enable both the signal path and power supply for the single input, as
illustrated in Figure 2.2(c) [101]. Because the decoder is no longer in the critical path,
the delay of connection box in Figure 2.2(c) is smaller than that in Figure 2.1(b). The
connection box in Figure 2.2(c) also has a much smaller leakage power but a larger
area and a slightly larger dynamic power compared to the conventional architecture.
2.2.3 FPGA Architecture Evaluation Flow
The existing FPGA architecture evaluation flow considering area, delay, and power
[92] is illustrated in Figure 2.3. For a given benchmark set, we first optimized the
logic then map the circuit to a given LUT size. TV-Pack is used to pack the mapped
circuit to a given cluster size. After packing, we place and route the circuit using VPR
[18] and obtain the chip level delay and area. Finally, cycle-accurate power simulator
[89] (in short Psim) is used to estimate the chip level power consumption.
The architecture evaluation flow discussed above is time consuming because we
need to place and route every circuit under different architectures and a large number
of randomly generated input vectors need to be simulated for each circuit. However,
the device and architecture co-optimization requires the exploration of the following
23
Parasitic
Extraction
Cycle-accurate
Power
Simulator
(Psim)
Power
Arch
Spec
Logic Optimization(SIS)
Tech-Mapping
(RASP)
Timing-Driven Packing
(TV-Pack)
Placement & Routing
(VPR)
Delay Area
Benchmark
Figure 2.3: Existing FPGA architecture evaluation flow for a given device setting.
dimensions: cluster size N, LUT size K, supply voltage V
dd
, threshold voltage V
th
, and
sleep transistor size if power gating is applied. The total number of hyper-architecture
combinations can be easily over a few hundreds considering the interaction between
these dimensions. Therefore, the conventional evaluation flow is not practical for de-
vice and architecture co-optimization due to its time inefficiency. In order to perform
device and architecture co-optimization, a fast yet accurate FPGA power and delay
estimator is required. Such estimator is just the one we are going to introduce in Sec-
tion 2.3, the trace-based estimator.
2.3 Trace-Based Estimation
In this section, we will introduce the efficient power and delay estimation model, trace-
based estimation method (Ptrace). We speculate that during hyper-architecture evalu-
ation, there are following two classes of information as summarized in Table 2.1. The
first class only depends on architecture and is called trace of the architecture. The sec-
ond class only depends on device setting and circuit design. The basic idea of Ptrace is
24
as follows: For a given benchmark set, we profile placed and routed benchmark circuits
and collect trace information under one device setting and for one FPGA architecture.
We then obtain chip-level performance and power for a set of device for a given archi-
tecture parameter values based on the trace information. Figure 2.4 illustrates the flow
of trace-based evaluation.
Arch
Spec
Trace-Based
Estimation
Trace
Collection
Chip Level Area,
Delay, and Power
Circuit Level
Area, Delay,
and Power
Figure 2.4: New trace-based evaluation flow. We perform the same flow as Figure 2.3
under one device setting to collect the trace information.
2.3.1 Trace Collection
As mentioned before, the trace information only depends on architecture and remains
the same when device setting changes. For a given benchmark set and a given FPGA
architecture, the trace includes number of used type i circuit elements (N
u
i
), total num-
ber of type i circuit elements (N
t
i
), average switching activity of used type i circuit
elements (S
u
i
), short circuit power ratio (α
sc
), and near-critical path structure (see Ta-
ble 2.1). Near-critical path structure is the number of each type of circuit elements
(N
p
i
) on the near-critical path. Device parameters include V
dd
and V
t
, which depend on
technology scale, and average leakage power of type i circuit elements (P
s
i
), average
load capacitance of type i circuit elements (C
u
i
), and average delay of type i circuit
25
elements (D
i
), which depend on circuit design. The device parameters, such as cir-
cuit level delay and power, can be collected by SPICE simulation or by measurement.
After collecting the trace and device level delay and power, we can perform Ptrace
to estimate the chip level delay, power, and area. Below, we will show that the trace
information is insensitive to the device parameters and discuss our trace-based models.
Trace Parameters (depend on architecture)
N
u
i
# of used type i circuit elements
N
t
i
total # of type i circuit elements
S
u
i
avg. switching activity for used type i circuit elements
N
p
i
# of type i circuit elements on the near-critical path
α
sc
ratio between short circuit power and switch power
Device Parameters
(depend on processing technology and circuit design)
V
dd
power supply voltage
V
th
threshold voltage
P
s
i
avg. leakage power for type i circuit elements
C
u
i
avg. load capacitance of type i circuit elements
D
i
avg. delay of type i circuit elements
Table 2.1: Trace information, device and circuit parameters.
2.3.2 Power and Delay Model
2.3.2.1 Dynamic Power Model
Dynamic power includes switch power and short-circuit power. A circuit implemented
on an FPGA cannot utilize all circuit elements. Dynamic power is only consumed by
the used FPGA resources. Our trace-based switch power model distinguishes different
types of used FPGA resources and applies the following formula:
P
sw
=

i
1
2
N
u
i
· f · V
2
dd
· C
sw
i
(2.1)
The summation is over different types of circuit elements, i.e., LUTs, buffers, input
pins and output pins. For type i circuit elements, C
sw
i
is the average switch capacitance
26
and N
u
i
is the number of used circuit elements, f is the operating frequency. In this
chapter, we assume that the circuit works at its maximumfrequency, i.e., the reciprocal
of the near-critical path delay. The switch capacitance is further calculated as:
C
sw
i
= (

j∈El
i
C
i, j
/N
u
i
) · S
u
i
= C
u
i
· S
u
i
(2.2)
For type i circuit elements, C
u
i
is the average load capacitance of used circuit elements,
which is averaged over C
i, j
, the local load capacitance for used circuit element j. El
i
is
the set of used type i circuit elements, and S
u
i
is the average switching activity of used
type i circuit elements. We assume that the average switching activity of the circuit
elements is determined by the circuit logic functionality and FPGA architecture. The
device parameters of V
d
d and V
th
have a limited effect on switching activity. We verify
this assumption in Table 2.2 by demonstrating the average switching activity of five
benchmarks at different technology nodes, and V
d
d and V
th
levels.
benchmark 70nm V
d
d=1.1 100nm V
d
d=1.3 70nm V
d
d=1.0
V
th
=0.25 V
th
=0.32 V
th
=0.20
logic inter- logic inter- logic inter-
connect connect connect
alu4 2.06 0.55 2.01 0.54 2.03 0.59
apex2 1.73 0.47 1.75 0.47 1.70 0.47
apex4 1.23 0.27 1.19 0.26 1.16 0.29
bigkey 1.75 0.56 1.96 0.59 1.71 0.55
clma 0.90 0.21 0.87 0.21 0.91 0.23
Table 2.2: switching activities for different technology nodes, V
d
d and V
th
. Architec-
ture setting: N = 10, K = 4. Unit: switch per clock cycle.
The short circuit power is related to signal transition time, which is difficult to
obtain without detailed simulation using a real delay model. In our trace-based model,
we model the short circuit power as:
P
sc
= P
sw
· α
sc
(2.3)
27
Where α
sc
is the ratio between short circuit power and switch power. Although this
ratio value at the circuit level depends on FPGA circuit design and architecture, we
assume that α
sc
does not depend on device and technology at the chip level. We ver-
ify this assumption in Table 2.3
3
by showing the average short circuit power ratio at
different technology nodes, V
d
d, and V
th
levels.
benchmark 70nm V
d
d=1.1 100nm V
d
d=1.3 70nm V
d
d=1.0
V
th
=0.25 V
th
=0.32 V
th
=0.20
logic inter- logic inter- logic inter-
connect connect connect
alu4 1.43 1.12 1.44 1.16 1.46 1.15
apex2 1.44 0.89 1.42 0.93 1.48 0.92
apex4 1.08 0.86 1.15 0.79 1.18 0.82
bigkey 0.74 1.64 0.76 1.71 0.72 1.68
clma 1.11 1.72 1.21 1.62 1.16 1.63
Table 2.3: Short circuit power ratios for different technology nodes, V
d
d and V
th
. Ar-
chitecture setting: N = 10, K = 4.
2.3.2.2 Leakage Power Model
The leakage power is modeled as follows,
P
static
=

i
N
t
i
P
s
i
(2.4)
For resource type i, N
t
i
is the total number of circuit elements, and P
s
i
is the leakage
power for a type i element. Notice that usually N
t
i
>N
u
i
because the resource utilization
rate is low in FPGAs (typically 62.5% [158]). For an FPGA architecture with power-
gating capability, an unused circuit element can be power-gated to reduce leakage
3
Because the delay between buffers is usually large for FPGA circuits, the short circuit power ratio
could be big.
28
power.
4
In this case, the total leakage power is modeled by the following formula:
P
static
=

i
N
u
i
P
i

gating
·

i
(N
t
i
−N
u
i
)P
i
(2.5)
where α
gating
is the average leakage ratio between a power-gated circuit element and a
circuit element in normal operation. SPICE simulation shows that sleep transistors can
reduce leakage power by a factor of 300 and α
gating
= 0.003 is used in this chapter.
2.3.2.3 Delay Model
To avoid the static timing analysis for the entire circuit implemented on a given FPGA
fabric, we obtain the structure of the ten longest circuit paths including the near-critical
path for each circuit. The path structure is the number of elements of different resource
types, i.e., LUT, wire segment and interconnect switch, on one circuit path. We assume
that the new near-critical path due to different V
d
d and V
th
levels is among these ten
longest paths found by our benchmark profiling. When V
d
d and V
th
change, we can
calculate delay values for the ten longest paths under new V
d
d and V
th
levels, and
choose the largest one as the new near-critical path delay. Therefore, the FPGA delay
can be calculated as follows:
D =

i
N
p
i
D
i
(2.6)
For resource type i, N
p
i
is the number of circuit elements that the near-critical path
goes through, and D
i
is the delay of such a circuit element. D
i
is a circuit parameter
depending on V
d
d, V
th
, process technology, and FPGA architecture.
5
To get the path
4
In this chapter, we assume the wire segment length change introduce by power gating can be ig-
nored. In our experiment, wire length is calculated as 4 ×

area
tile
[18], where area
tile
is the area of
a logic block with the interconnect surrounding it. The area overhead introduced by power gating is
less than 30%. Therefore, the change of wire segment length is less than 14%. Moreover the load ca-
pacitance of a routing buffer is mainly determined by the input and output capacitances of buffers. The
wire capacitance is a small portion (about 20%) of the load capacitance. Therefore, the area increase
introduced by power gating will have small impact on the delay and power estimation.
5
Delay of each circuit element is measured with the worst case switch. When power gating is
applied, a single gating transistor is used for the entire logic block. In order to measure the delay of
29
statistical information N
p
i
, we only need to place and route the circuit once for a given
FPGA architecture.
2.3.3 Validation of Ptrace
2.3.3.1 Verification of Deterministic Model
To validate Ptrace, we compare it to the conventional architecture evaluation flow
for ITRS [66] 70nm technologies. We assume V
d
d=1.0V and V
th
=0.2V and map 20
MCNC benchmarks, as illustrated in Table 2.4, to two architectures: {N = 8, K = 4}
and {N =6, K =7}. In Table 2.4, the number of LUTs and flip flops are counted under
the architecture N=8 and K=4. Notice that we can map the benchmarks to different
LUT and cluster sizes. We collect trace using the conventional architecture evaluation
flow in 70nm technology, V
d
d=1.0V and V
th
=0.2V. Figure 2.5 compares energy and
delay between the conventional evaluation flow and Ptrace for each benchmark. The
average energy error of Ptrace is 1.3% and average delay error is 0.8%. From the fig-
ure, the Ptrace has the same trends for energy as Psim does and for delay as VPR does.
Therefore, Ptrace has a high fidelity. Moreover, the run time of Ptrace is 2s, while that
of the conventional evaluation flow is 120 hours.
each circuit element in a logic block with gating transistor, a series of randomly generated vectors are
used as the inputs to the block and the worst case delay is measured for each circuit element under such
random input vectors.
30
benchmark # LUTs # Flip flops benchmark # LUTs # Flip flops
alu4 1607 0 ex5p 1201 0
apex2 2118 0 frisc 5926 900
apex4 1291 0 misex3 1501 0
bigkey 2935 224 pdc 5608 0
clma 13537 33 s298 2548 8
des 2172 0 s38417 8458 1463
diffeq 1933 377 s38584 7040 1260
dsip 1619 224 seq 1953 0
elliptic 4185 1138 spla 3938 0
ex1010 4721 0 tseng 1299 382
Table 2.4: MCNC benchmark list. The number of LUTs and flip flops are counted
under the architecture N=8 and K=4.
0
9
18
27
36
45
0 9 18 27 36 45
Delay from VPR(ns)
D
e
l
a
y

f
r
o
m

P
t
r
a
c
e
(
n
s
)
8x4
6x7
4% Error
0
2
4
6
8
10
0 2 4 6 8 10
Energy from Psim (nJ)
E
n
e
r
g
y

f
r
o
m

P
t
r
a
c
e

(
n
J
) 8x4
6x7
5% Error
Figure 2.5: Comparison between Psim and Ptrace.
31
2.4 Hyper-Architecture Evaluation
2.4.1 Overview
In this section, we use Ptrace to perform device and architecture evaluation. We
consider 70nm ITRS technology and evaluate four FPGA hyper-architecture classes:
Homo-V
th
, Hetero-V
th
, Homo-V
th
+G and Hetero-V
th
+G. Homo-V
th
is the conventional
FPGA using homogeneous V
th
for interconnects and logic blocks. Hetero-V
th
applies
different V
th
to logic blocks and interconnects. Homo-V
th
+G and Hetero-V
th
+G are
the same as Homo-V
th
and Hetero-V
th
, respectively, except that unused logic blocks
and interconnects are power-gated [101]. We compare them with the baseline hyper-
architecture, which uses the VPR architecture model [18] and has the same LUT size
and cluster size as those used by the Xilinx Virtex-II [170] (cluster size of 8, LUT
size of 4), V
dd
suggested by ITRS [66] (0.9v), and V
th
of 0.3v that is optimized with re-
spect to the above architecture and V
dd
. The baseline hyper-architecture and evaluation
ranges for device and architecture are presented in Table 2.5. Note that a high V
th
is
applied to all SRAM cells for configuration to reduce their leakage power as suggested
by [93].
In the subsections 2.4.2 to 2.4.4, we assume the following:
• The utilization rate (defined as the utilization rate of logic blocks, i.e., number
of used logic blocks over the number of total available logic blocks) is 0.5.
• All interconnect wire segments span 4 logic blocks with fully buffered routing
switches.
• The routing channel width is 1.2 times of the minimumchannel width that allows
the FPGA circuit being routed.
• All benchmark circuits work at their highest frequency (1/critical path delay).
32
The impact of utilization rate and interconnect architecture will be discussed in sub-
sections 2.4.5 and 2.4.6, respectively. For each hyper-architecture, we compute the
energy, delay and area as the geometric mean of 20 MCNC benchmarks in Table 2.4.
Moreover, in the rest of this chapter, we use CV
t
for V
th
of logic blocks and IV
th
for
V
th
for global interconnects. Note that CV
t
= IV
th
in Homo-V
th
and Homo-V
th
+G. To
illustrate the tradeoff between energy and delay, we introduce the concept of dominant
hyper-architecture: If hyper-architecture A has less energy consumption and a smaller
delay than hyper-architecture B, then we say that B is inferior to A. We define the dom-
inant hyper-architectures as the set of hyper-architectures that are not inferior to any
other hyper-architectures.
We organize the rest of this section as follows: First, Section 2.4.2 illustrates the
necessity of device and architecture co-optimization. Then Section 2.4.3 presents the
energy and delay tradeoff and ED reduction achieved by device and architecture co-
optimization. Section 2.4.4 discusses the device and architecture co-optimization con-
sidering area. Finally, Sections 2.4.5 and 2.4.6 analyze the impact of utilization rate
and compare the evaluation result of different routing architectures, respectively.
Baseline FPGA device/arch parameter values
V
dd
V
th
N K
0.9v 0.3v 8 4
Value range for device/arch optimization
V
dd
V
th
N K
0.8v-1.1v 0.2v-0.4v 6-12 3-7
Table 2.5: Baseline hyper-architecture and evaluation ranges.
2.4.2 Necessity of Device and Architecture Co-Optimization
In this section, we show the necessity of device and architecture co-optimization. We
first discuss the need of device tuning, then compare the results of optimizing device
33
and architecture separately and simultaneously.
Architecture evaluation has been studied in previous research [131, 80, 10, 89, 123,
101]. However, device tuning has not been reported in the literature. Our experiments
show that device tuning has a much greater impact on delay and energy than archi-
tecture tuning does, which is demonstrated in Figure 2.6 and Table 2.6. Each set of
data points in Figure 2.6 is the dominant hyper-architectures for a given device set-
ting.
6
For example, set D4 is the dominant hyper-architectures under V
dd
=1.0V and
V
th
=0.25V. From the figure, we observe that a change on the device leads to a more
significant change in energy and delay than architecture change does. For example,
for device setting V
dd
= 0.9V and V
th
=0.25v, energy for different architectures ranges
from 1.82nJ to 2.11nJ, and delay ranges from 13.68ns to 16.46ns. However, if we
increase V
th
by 0.05v, i.e., V
dd
= 0.9V and V
th
= 0.3V, the energy ranges from 1.17nJ
to 1.32nJ and the delay ranges from 18.99ns to 23.24ns. Therefore, it is important to
evaluate both device and architecture instead of evaluating architecture only.
Set V
dd
V
th
Min energy Max energy Min delay Max delay
(V) (V) (nJ) (nJ) (ns) (ns)
D1 0.9 0.25 1.82 2.11 13.68 16.46
D2 0.9 0.30 1.17 1.32 18.99 23.24
D3 0.9 0.35 0.97 1.05 31.01 36.40
D4 1.0 0.25 2.30 3.17 11.90 14.06
D5 1.0 0.30 1.37 1.95 15.60 17.50
D6 1.0 0.35 1.11 1.30 21.31 24.66
D7 1.1 0.25 5.58 17.01 10.60 12.43
D8 1.1 0.30 3.10 9.03 13.05 15.40
D9 1.1 0.35 1.95 4.73 17.01 19.92
Table 2.6: Power and delay ranges for different device settings.
There are three methods to perform device and architecture optimization. In the
first method, we first optimize device using one architecture then optimize the ar-
6
Dominant hyper-architectures for a given device setting are the hyper-architectures that are not
inferior to any other hyper-architectures under such device setting.
34
0
2
4
6
8
10
12
14
16
18
10 15 20 25 30 35 40
Delay (ns)
E
n
e
r
g
y

p
e
r

c
y
c
l
e

(
n
J
)
D1 Vdd 0.9 Vt 0.25
D2 Vdd 0.9 Vt 0.30
D3 Vdd 0.9 Vt 0.35
D4 Vdd 1.0 Vt 0.25
D5 Vdd 1.0 Vt 0.30
D6 Vdd 1.0 Vt 0.35
D7 Vdd 1.1 Vt 0.25
D8 Vdd 1.1 Vt 0.30
D9 Vdd 1.1 Vt 0.35
D1
D9
D8
D7
D6
D5
D4
D3
D2
Figure 2.6: hyper-architectures under different device settings.
chitecture under the optimized device setting and call it device-arch method. In the
second method, we first optimize the architecture within one device setting then opti-
mize the device setting according to the optimized architecture and call it arch-device
method. In the third method, we optimize architecture and device simultaneously and
call it simultaneous method. Both methods arch-device and device-arch cannot guar-
antee the optimal solution. Table 2.7 compares the min-ED hyper-architectures for
Homo-V
th
found by three different methods. For both arch-device and device-arch,
we start search from the baseline case {N = 8, K = 4, V
dd
= 0.9V, and V
th
= 0.3V}.
In our experiment, we find that there are two local optimal hyper-architectures in the
whole solution space: {N = 6, K = 7, V
dd
= 0.9V, V
th
= 0.3V} and {N = 10, K = 4,
V
dd
= 1.0V, V
th
= 0.3V} which is also the global optimal. In this particular example,
both arch-device and device-arch achieve the same local optimal hyper-architecture
{N = 6, K = 7, V
dd
= 0.9V, V
th
= 0.3V}. If we start search from some other hyper-
35
architectures, arch-device and device-arch may achieve other local optimum but there
is no guarantee of the global optimum. In order to obtain the global optimal solution,
we have to perform simultaneous method. From Table 2.7, we observe that simul-
taneous method can reduce ED by 13.3% compared to arch-device and device-arch.
We also see that the runtime of arch-device and device-arch is shorter than that of
simultaneous method. However, due to the time efficiency of Ptrace, the runtime of
simultaneous method is also small (only 34.1s). Therefore, it is still worthwhile per-
forming simultaneous device and architecture optimization.
V
dd
(V) CV
t
(V) IV
th
(V) (N, k) Energy (nJ) Delay (ns) ED (nJ· ns) runtime (s)
arch-device 0.9 0.30 0.30 (6,7) 1.38 19.8 27.3 5.1
device-arch 0.9 0.30 0.30 (6,7) 1.38 19.8 27.3 3.4
simultaneous 1.0 0.30 0.30 (10,4) 1.37 17.5 24.1 (-13.3%) 34.1
Table 2.7: Min-ED hyper-architecture of optimizing device and architecture separately
and simultaneously.
2.4.3 Energy and Delay Tradeoff
In this section, we first compare the impact of device tuning and architecture tuning,
then present min-energy and min-delay hyper-architectures, and finally discuss the
energy and delay tradeoff. For the classes Homo-V
th
+G and Hetero-V
th
+Gwith power-
gating, we assume the following fixed sleep transistor size: 210X PMOS for a logic
block, 10X PMOS for a switch buffer, and 1X PMOS for a connection buffer. We then
discuss the sleep transistor tuning in Section 2.4.4.
Table 2.8 summarizes the minimumdelay and minimumenergy hyper-architectures
for each class. The minimum delay hyper-architectures have cluster size of 6 and LUT
size of 7 which are the same for all classes. The minimum energy hyper-architectures
have LUT size of 4 for all classes. This is similar to the previous evaluation result
[101, 89]. As expected, the min-delay hyper-architectures have the highest V
dd
and
36
lowest V
th
. However, the min-energy hyper-architectures have the lowest V
dd
but not
the highest V
th
. This is because we assume that each circuit works at its highest pos-
sible frequency (1/critical path delay). The energy is calculated as E = delay · power.
When V
th
is too high, the delay is so large that the energy per clock cycle increases.
Hyper- Minimum delay hyper-architecture Minimum energy hyper-architecture
Arch V
dd
CV
t
IV
th
(N,K) E D V
dd
CV
t
IV
th
(N,K) E D
Class (V) (V) (V) (nJ) (ns) (V) (V) (V) (nJ) (ns)
Homo-V
th
1.1 0.20 0.20 (6,7) 31.22 8.86 0.8 0.35 0.35 (10,4) 0.942 59.2
Hetero-V
th
1.1 0.20 0.20 (6,7) 31.22 8.86 0.8 0.30 0.35 (12,4) 0.920 43.6
Homo-V
th
+G 1.1 0.20 0.20 (6,7) 15.98 9.45 0.8 0.30 0.30 (12,4) 0.550 30.5
Hetero-V
th
+G 1.1 0.20 0.20 (6,7) 15.98 9.45 0.8 0.30 0.25 (10,4) 0.549 24.3
Table 2.8: Minimum delay and minimum energy hyper-architectures.
Usually, higher performance hyper-architectures consume more energy. To illus-
trate the energy and delay tradeoff, we present the dominant hyper-architectures of
Homo-V
th
and Hetero-V
th
in Figure 2.7 (a) and those of Homo-V
th
+G and Hetero-
V
th
+G in Figure 2.7 (b). From the figure, we find that the energy difference between
Homo-V
th
+G and Hetero-V
th
+G is smaller than that between Homo-V
th
and Hetero-
V
th
. This is because leakage power is significantly reduced by power-gating and there-
fore more detailed V
th
tuning such as heterogeneous-V
th
has a smaller impact. We also
see that dominant hyper-architectures of Homo-V
th
+G and Hetero-V
th
+G have smaller
delay than that of Homo-V
th
and Hetero-V
th
. This is due to the fact that the connection
box with power gating has smaller delay than that without power gating as discussed
in Section 2.2.2. Moreover, with the dominant hyper-architecture figure, we can obtain
the minimum energy solution for a given performance range. For example, if we want
to find the minimumenergy solution for Homo-V
th
with delay limit 15ns, we only need
to pick the dominant hyper-architecture whose delay is closest to 15ns.
In order to achieve the best energy and delay tradeoff, we find the hyper-architectures
with the minimum energy delay product (in short min-ED) in Table 2.9. Compared to
37
the baseline, ED reduction is 14.5% and 18.4% for Homo-V
th
and Hetero-V
th
, respec-
tively. If power gating is applied, ED can be reduced by about 60% for both Homo-
V
th
+G and Hetero-V
th
+G. The similar ED reduction for power gating classes is due to
the fact that leakage power is greatly reduced by power-gating and therefore the more
detailed V
th
tuning such as heterogeneous-V
th
has a small impact on power reduction
as discussed before. We also see that, compared to the min-ED hyper-architectures
without power gating, the min-ED hyper-architectures with power gating has a lower
V
th
. This is because leakage power is greatly reduced when power gating is applied,
therefore a lower V
th
can improve performance without much penalty on leakage.
hyper-architecture.Class V
dd
(V) CV
t
(V) IV
th
(V) (N, k) E(nJ) D (ns) ED (nJ· ns) Area %
Baseline 0.9 0.30 0.30 (8,4) 1.20 23.5 28.2 100.00
Homo-V
th
1.0 0.30 0.30 (10,4) 1.37 17.5 24.1 (-14.5%) 81.90
Hetero-V
th
0.9 0.25 0.30 (12,4) 1.27 18.1 23.0 (-18.4%) 79.52
Homo-V
th
+G 0.9 0.25 0.25 (12,4) 0.74 16.10 11.9 (-57.8%) 127.43
Hetero-V
th
+G 0.8 0.25 0.20 (10,4) 0.65 17.30 11.2 (-60.3%) 126.19
Table 2.9: Comparison between baseline and min-ED hyper-architecture in Homo-V
th
,
Hetero-V
th
, Homo-V
th
+G, and Hetero-V
th
+G.
2.4.4 ED and Area Tradeoff
In the previous sections, we assume fixed sleep transistor sizes for Homo-V
th
and
Hetero-V
th
+G and discuss hyper-architecture evaluation to minimize ED without con-
sidering area. However, area is important for FPGA design. Power-gating using sleep
transistors may change delay and area tradeoff for FPGA architecture. Usually, the
larger the sleep transistor size, the smaller the delay is. In this section, we perform
device and architecture co-optimization to achieve the best ED and area tradeoff. Al-
though dual-V
th
may change the layout area due to extra diffusion well area, such
change depends on technology and is often very small. Therefore, in this section, we
38
assume that V
dd
and V
th
change does not affect area.
For Homo-V
th
and Hetero-V
th
, because no power gating is applied, we do not need
to tune sleep transistor size. To achieve the best ED-area tradeoff, we find out the
hyper-architectures with minimum product of energy, delay, and area (in short AED),
which are summarized in Table 2.10. Compared to the baseline, the min-AED hyper-
architecture of Homo-V
th
reduces ED by 14.7% and area by 18.3%,
7
and the min-AED
hyper-architecture of Hetero-V
th
reduces ED by 18.4% and area by 23.3%. Figure 2.8
presents the chip-level ED and area tradeoff. We prune inferior solutions with both
ED and area larger than any alternative solutions. From the figure, we see that, for the
classes without power gating, the min-ED hyper-architecture is exactly the min-AED
hyper-architecture. This is because that when no power gating is applied, the larger the
area, the more leakage power is consumed. Therefore, the min-ED hyper-architectures
use less area than other hyper-architectures.
V
dd
(V) CV
t
(V) IV
th
(V) (N,K) S ED Area AED AED reduction %
Baseline 0.9 0.30 0.30 (8,4) - 1 1 - -
Homo-V
th
1.0 0.30 0.30 (10,4) - 0.853 0.817 0.697 30.3
Hetero-V
th
0.9 0.25 0.30 (12,4) - 0.816 0.767 0.626 37.4
Homo-V
th
+G 0.9 0.25 0.25 (12,4) 2 0.470 0.918 0.432 56.9
Hetero-V
th
+G 0.9 0.25 0.20 (12,4) 2 0.450 0.918 0.413 58.7
Table 2.10: Minimum ED-area product hyper-architectures for different classes. ED,
Area, and ED-area product are normalized with respect to the baseline.
For Homo-V
th
+G and Hetero-V
th
+G with power gating, the sleep transistor size
has to be considered. Because only one sleep transistor is used for one logic block,
as illustrated in Figure 2.2(d), we assume the 210X PMOS for the sleep transistor
with negligible area overhead. Moreover, we observe that a 1X PMOS as the sleep
7
In our evluation, we assume that the area depends only on architecture and but not device setting.
Our experimental result shows that N = 12, K = 4 is most area efficient architecture, this architecture
reduces area by 23.6% compared to the baseline (N =8, K =4). The architecture N =10, K =4 is the
second area efficient architecture which reduces area by 18.3% compared to the baseline.
39
transistor for one switch in connection box (see the circuit in Figure 2.2(c)) provides
good performance, and any further increase of the sleep transistor size cannot improve
the performance much.
The sleep transistors for the switches in the routing box, however, may affect delay
greatly. We consider four sleep transistor sizes: 2X, 4X, 7X, and 10X PMOS for a rout-
ing switch. From the Figure 2.8, we find that device and architecture co-optimization
can reduce ED and area simultaneously even when power gating is applied. This
is due to the fact that tuning device setting and architecture offers a bigger solution
space to explore in chip level.
8
Table 2.10 summarizes the minimum AED prod-
uct hyper-architectures. Compared to the baseline case, the minimum AED product
hyper-architecture of Homo-V
th
+G reduces ED by 53.0% and area by 8.2% and the
minimumAED product hyper-architecture in Hetero-V
th
+G reduces ED by 55.0% and
area by 8.2%.
2.4.5 Impact of Utilization Rate
In the previous part of this section, we assume fixed utilization rate (0.5). In this
subsection, we will further discuss the impact of utilization rate on FPGA architec-
ture evaluation. We compare the min-ED and min-AED hyper-architectures under
three different utilization rates: 0.3, 0.5, and 0.8 in Tables 2.11 and 2.12. We see that
the min-ED and min-AED hyper-architectures under different utilization rates are the
same. Therefore, we conclude that utilization rate in practice does not affect hyper-
architecture evaluation. We guess that the reason why the utilization rate does not
affect hyper-architecture evaluation is as follows: In our models the performance and
8
As discussed before, the most area efficient architectures (N = 12 and K = 4) reduces area by
about 20% compared to the baseline. The area increase introduced by power gating leads to only about
15% area increase. Therefore, for the minimum AED hyper-architecture of the power gating classes
consumes less area compared to the baseline.
40
dynamic power of the circuit under different utilization rate is the same,
9
and the only
difference is leakage power. For the classes with power gating, the leakage power is de-
termined by the active elements and the leakage power of the unused circuit elements is
negligible (as we can see from Table 2.11, in classes Homo-V
th
+G and Hetero-V
th
+G,
ED under different utilization rate is very close). When no power gating is applied,
for given benchmark circuits, leakage power is the dominant power component and
changes inversely proportional to the utilization rate. Therefore the relationship be-
tween the leakage power of different hyper-architectures does not change with respect
to the change of the utilization rate.
Utilization 0.3 0.5 0.8
rate V
dd
CV
th
IV
th
(N, k) ED V
dd
CV
th
IV
th
(N, k) ED V
dd
CV
th
IV
th
(N, k) ED
Homo-V
th
1.0 0.30 0.30 (10,4) 32.0 1.0 0.30 0.30 (10,4) 24.1 1.0 0.30 0.30 (10,4) 19.4
Hetero-V
th
0.9 0.25 0.30 (12,4) 31.5 0.9 0.25 0.30 (12,4) 23.0 0.9 0.25 0.30 (12,4) 18.1
Homo-V
th
+G 0.9 0.25 0.25 (12,4) 12.3 0.9 0.25 0.25 (12,4) 11.9 0.9 0.25 0.25 (12,4) 11.8
Hetero-V
th
+G 0.8 0.25 0.20 (10,4) 11.6 0.8 0.25 0.20 (10,4) 11.2 0.8 0.25 0.20 (10,4) 11.0
Table 2.11: Min-ED hyper-architecture under different utilization rates.
Utilization 0.3 0.5 0.8
rate V
dd
CV
th
IV
th
(N, k) S AED V
dd
CV
th
IV
th
(N, k) S AED V
dd
CV
th
IV
th
(N, k) S AED
Homo-V
th
1.0 0.30 0.30 (10,4) - 0.649 1.0 0.30 0.30 (10,4) - 0.697 1.0 0.30 0.30 (10,4) - 0.698
Hetero-V
th
0.9 0.25 0.30 (12,4) - 0.637 0.9 0.25 0.30 (12,4) - 0.626 0.9 0.25 0.30 (12,4) - 0.617
Homo-V
th
+G 0.9 0.25 0.25 (12,4) 2 0.330 0.9 0.25 0.25 (12,4) 2 0.432 0.9 0.25 0.25 (12,4) 2 0.533
Hetero-V
th
+G 0.8 0.25 0.20 (10,4) 2 0.324 0.8 0.25 0.20 (10,4) 2 0.413 0.8 0.25 0.20 (10,4) 2 0.510
Table 2.12: Min-AED hyper-architecture under different utilization rates. Note: AED
is normalized with respect to the baseline.
2.4.6 Impact of Interconnect Structure
In the previous discussion, we always assume all the wire segments span four logic
blocks (i.e. length-4 interconnect wires). In this section, we will compare two routing
9
We assume the placement and route is the same for different utilization rate.
41
structures to show the impact of routing structure on delay and power. We compare a
structure with uniform length-4 wire segments (in short uniform-interconnect) and a
routing structure with 60% length-4 wire segments and 40% length-8 wire segments
(in short mixed-interconnect). The mixed-interconnect is optimal for minimizing delay
[47]. Tables 2.13, 2.14, and 2.15 compare the min delay, min energy and min ED
hyper-architectures between the two routing structures, respectively. We see that the
min-delay hyper-architectures for both interconnect structures are same. The min-
energy and min-ED hyper-architectures for the two routing structure have the same the
device setting and LUT size, but mixed-interconnect tends to use smaller cluster size
as the interconnect delay is reduced in mixed-interconnect. We also see that uniform-
interconnect has lower energy but higher delay than mixed-interconnect. This is due to
the fact that mixed-interconnect applies length-8 wire-segments and therefore a buffer
with a larger size is used, which reduces delay but increases power.
uniform-interconnect
hyper-architecture V
dd
CV
th
IV
th
(N,K) Energy Delay
Class (V) (V) (V) (nJ) (ns)
Homo-V
th
1.1 0.20 0.20 (6,7) 31.22 8.86
Hetero-V
th
1.1 0.20 0.20 (6,7) 31.22 8.86
Homo-V
th
+G 1.1 0.20 0.20 (6,7) 15.98 9.45
Hetero-V
th
+G 1.1 0.20 0.20 (6,7) 15.98 9.45
mixed-interconnect
Homo-V
th
1.1 0.20 0.20 (6,7) 35.56 8.42
Hetero-V
th
1.1 0.20 0.20 (6,7) 35.56 8.42
Homo-V
th
+G 1.1 0.20 0.20 (6,7) 16.25 9.13
Hetero-V
th
+G 1.1 0.20 0.20 (6,7) 16.25 9.13
Table 2.13: Min-delay hyper-architecture under different routing structures.
42
uniform-interconnect
hyper-architecture V
dd
CV
th
IV
th
(N,K) Energy Delay
Class (V) (V) (V) (nJ) (ns)
Homo-V
th
0.8 0.35 0.35 (10,4) 0.942 59.2
Hetero-V
th
0.8 0.30 0.35 (12,4) 0.920 43.6
Homo-V
th
+G 0.8 0.30 0.30 (12,4) 0.550 30.5
Hetero-V
th
+G 0.8 0.30 0.25 (10,4) 0.549 24.3
mixed-interconnect
Homo-V
th
0.8 0.35 0.35 (8,4) 1.052 55.6
Hetero-V
th
0.8 0.35 0.35 (8,4) 1.052 55.6
Homo-V
th
+G 0.8 0.30 0.30 (12,4) 0.610 29.3
Hetero-V
th
+G 0.8 0.30 0.25 (8,4) 0.602 22.8
Table 2.14: Min-energy hyper-architecture under different routing structures.
uniform interconnect
hyper-architecture V
dd
CV
th
IV
th
(N,K) Energy Delay ED
Class (V) (V) (V) (nJ) (ns) (nJ·ns)
Homo-V
th
1.0 0.30 0.30 (10,4) 1.37 17.50 24.1
Hetero-V
th
0.9 0.25 0.30 (12,4) 1.27 18.10 23.0
Homo-V
th
+G 0.9 0.25 0.25 (12,4) 0.74 16.10 11.9
Hetero-V
th
+G 0.8 0.25 0.20 (10,4) 0.65 17.30 11.2
mixed-interconnect
Homo-V
th
1.0 0.30 0.30 (8,4) 1.51 17.00 25.7
Hetero-V
th
0.9 0.25 0.30 (12,4) 1.39 18.40 25.6
Homo-V
th
+G 0.9 0.25 0.25 (8,4) 0.82 16.40 13.4
Hetero-V
th
+G 0.8 0.25 0.20 (8,4) 0.74 16.67 12.3
Table 2.15: Min-ED hyper-architecture under different routing structures.
43
0
8
16
24
32
8 21 34 47 60
Delay (ns)
E
n
e
r
g
y

p
e
r

c
y
c
l
e

(
n
J
)
Homo-Vt
Hetero-Vt
Min-ED
hyper-arch for
Homo-Vt
Min-ED
hyper-arch
for Hetero-Vt
Min delay
hyper-arch
for Homo-Vt
Min energy
hyper-arch
for Homo-Vt
Min energy
hyper-arch
for Hetero-Vt
(a)
0
3
6
9
12
15
18
8 13 18 23 28 33
Delay (ns)
E
n
e
r
g
y

p
e
r

c
y
c
l
e

(
n
J
)
Homo-Vt+G
Hetero-Vt+G
㋏߫5
㋏߫6
㋏߫7
Min-ED
hyper-arch for
Homo-Vt+G
Min-ED
hyper-arch
for Hetero-
Min delay hyper-
arch for Homo-
Vt+G and Hetero-
Min energy
hyper-arch for
Homo-Vt
Min energy
hyper-arch for
Hetero-Vt+G
(b)
Figure 2.7: Dominant hyper-architectures. (a) Homo-V
th
and Hetero-V
th
; (b)
Homo-V
th
+G and Hetero-V
th
+G.
44
0.7
0.9
1.1
1.3
0.3 0.4 0.5 0.6 0.7 0.8 0.9 1.0
Normalized ED
N
o
r
m
a
l
i
z
e
d

a
r
e
a
Homo-Vt
Hetero-Vt
Homo-Vt+G
Hetero-Vt+G
Min AED
hyper-arch for
Homo-Vt+G
Min AED
hyper-arch for
Hetero-Vt+G
Min AED
hyper-arch for
Hetero-Vt
Min AED
hyper-arch for
Homo-Vt
Figure 2.8: ED and area trade-off. ED and area are normalized with respect to the
baseline (N = 8, K = 4, V
dd
= 0.9V, and V
th
= 0.3V).
45
2.5 Conclusions
In this chapter, we have developed trace-based power and performance evaluation
(Ptrace) for FPGA. The one-time use of placement, routing and cycle-accurate power
simulation is applied to collect the timing and power trace for a given benchmark set
and a given FPGA architecture. The trace can then be re-used to calculate timing and
power via closed-form formulae for different device parameters and technology scal-
ing. Ptrace is much faster, yet accurate compared to the conventional evaluation based
on placement and routing by VPR [18] followed by cycle-accurate simulation (Psim)
[89].
Using the trace-based estimation, we have performed device (V
dd
, V
th
and sleep
transistor size if power gating is applied) and architecture (cluster and LUT size) co-
optimizations for low power FPGAs. We assume the ITRS [66] 70nm technology
and use the following baseline for comparison: Cluster size of 8 and LUT size of 4
as in the Xilinx Virtex-II [170], V
dd
of 0.9v suggested by ITRS, V
th
of 0.3v which is
optimized for min-ED (i.e., minimum energy delay product) with respect to the above
architecture and V
dd
. Compared to the baseline case, simultaneous optimization of
FPGA architecture and device reduces the min-ED by 14.7% and area by 18.3% for
FPGA using homogeneous-V
th
for the logic blocks and interconnects without power
gating. Optimizing V
th
separately (i.e., heterogeneous-V
th
) for the logic block and
interconnect reduces min-EDby 18.4%and area by 23.3%. Furthermore, power gating
unused logic and interconnect reduces the min-ED by up to 55.0% and reduces area by
8.2%. Compared to the classes without power gating, the min-ED hyper-architectures
of the classes with power gating have lower V
th
. This is due to the fact that, when
power gating is applied, leakage power is significantly reduced and therefore a lower
V
th
can be applied to reduce delay.
We observe that min-ED hyper-architectures and min-AED hyper-architectures for
46
different utilization rate (between 30% and 80%) are the same. Therefore, the utiliza-
tion rate in practice does not affect the device and architecture co-optimization result.
Moreover, we also test two different routing structures, one with uniform length 4 wire
segments (uniform-interconnect) and the other with 60% length 4 wire segments and
40% length 8 wire segments (mixed-interconnect). We observe that the min-ED, min-
energy and min-delay hyper-architectures under two different interconnect structures
are similar, except that mix-interconnect tends to use slightly smaller cluster size due
to the reduced interconnect delay.
47
CHAPTER 3
FPGA Device and Architecture Co-Optimization
Considering Process Variation
Process variations in nanometer technologies are becoming an important issue for
cutting-edge FPGAs with a multi-million gate capacity. In this chapter, we extend
Ptrace to handle process variation. We first develop closed-form models of chip level
FPGA leakage and timing variations considering both die-to-die and within-die varia-
tions in effective channel length, threshold voltage, and gate oxide thickness. Experi-
ments show that the mean and standard deviation computed by our models are within
3% from those computed by Monte Carlo simulation. We also observe that the leakage
and delay variations can be up to 5.5X and 1.9X, respectively. We then derive analyti-
cal yield models considering both leakage and timing variations, and use such models
to evaluate FPGA device and architecture considering process variations. Compared to
the baseline, which uses the VPR architecture and device setting from ITRS roadmap,
device and architecture tuning improves leakage yield by 10.4%, timing yield by 5.7%,
and leakage and timing combined yield by 9.4%. We also observe that LUT size of 4
gives the highest leakage yield, LUT size of 7 gives the highest timing yield, but LUT
size of 5 achieves the maximum leakage and timing combined yield.
48
3.1 Introduction
In the previous chapter, we have introduced a time efficient trace-based power and
delay estimator for FPGA circuit. However, such estimator is only for determinis-
tic values and does not consider process variation. Modern VLSI designs see a large
impact from process variation as devices scale down to nanometer technologies. Vari-
ability in effective channel length, threshold voltage, and gate oxide thickness incurs
uncertainties in both chip performance and power consumption. For example, mea-
sured variation in chip-level leakage can be as high as 20X compared to the nominal
value for high performance microprocessors [25]. In addition to meeting the perfor-
mance constraint under timing variation, dice with excessively large leakage due to
such a high variation have to be rejected to meet the given power budget.
Recent work has studied parametric yield estimation for both timing and leakage
power. Statistical timing analysis considering path correlation was studied in [120]
[86, 178] and [180, 33] further introduced non-Gaussian variation and non-linear vari-
ation models. Timing yield estimation was discussed in [57, 53] and [127] purposed a
methodology to improve timing yield. As device scale down, leakage power becomes a
significant component of total power consumption and it is greatly affected by process
variation. [129, 187, 152] studied the parametric yield considering both leakage and
timing variations. Power minimization by gate sizing and threshold voltage assign-
ment under timing yield constrains were studied in [114]. However, all these studies
only focus on ASIC and do not consider FPGA.
FPGA has a great deal of regularity, therefore process variation may have smaller
impact on FPGAs than on ASICs. Yet the parametric yield for FPGAs still should be
studied. Some recent works [43, 100, 115, 136] has studied the impact of process vari-
ation and presented various statistical optimization methods. However, architecture
and device co-optimization considering process variation has not be studied yet.
49
In this chapter,we first develop closed-form formulas of chip level leakage and
timing variations considering both die-to-die and within-die variations. Based on such
model, we extend the Ptrace discussed in Chapter 2 to estimate the power and delay
variation of FPGAs. Experiments showthat the mean and standard deviation computed
by our models are within 3% from those copmuted by Monte Carlo simulation. We
also observe that the leakage and delay variations can be up to 5.5X and 1.5X, respec-
tively. With the extended Ptrace, we perform FPGA device and architecture evaluation
considering process variations. The evaluation requires the exploration of the follow-
ing dimensions: cluster size N, LUT size K,
1
supply voltage V
dd
, and threshold volt-
age V
t
. We defined the combinations of the above parameters as hyper-architecture.
For comparison, we obtain the baseline FPGA hyper-architecture which uses the VPR
architecture model [18] and the same LUT size and Cluster size as the commercial
FPGAs used by Xilinx Virtex-II [170], and device setting from ITRS roadmap[66].
Compared to the baseline, device and architecture tuning improves leakage yield by
4.8%, timing yield by 12.4%, and leakage and timing combined yield by 9.2%. We
also observe that LUT size of 4 gives the highest leakage yield, LUT size of 7 gives
the highest timing yield, but LUT size of 5 achieves the maximum leakage and timing
combined yield.
The rest of the chapter is organized as follows: Section 3.2 derives closed-form
models for leakage and delay variations and develops the leakage and timing yield
models. Section 3.3 perform device and architecture evaluation to improve yield rate.
Finally, Section 3.4 concludes the chapter.
1
In this chapter, N refers to cluster size and K refers to LUT size
50
3.2 Delay and Leakage Variation Model
In this chapter, we consider the variation in gate channel length (L
gate
), threshold volt-
age (V
th
), and gate oxide thickness (T
ox
). According to [190], spatial correlation is not
significant. Therefore, in this chapter, we assume each variation source is decomposed
into global (inter-die) variation and local (intra-die) variation as follows:
L = L
g
+L
l
(3.1)
V =V
g
+V
l
(3.2)
T = T
g
+T
l
(3.3)
where L, V, and T are variations of L
gate
, V
th
, and T
ox
respectively, L
g
, V
g
, and T
g
are
inter-die variations, and L
l
, V
l
, and T
l
are intra-die variations. In the rest of this chapter,
we assume both inter-die (L
g
, V
g
, and T
g
) and intra-die (L
l
, V
l
, and T
l
) variations are
normal random variables. And we also assume that inter-die variation and intra-die
variation are independent, and all variation sources are also independent.
3.2.0.1 Leakage under Variation
We extend the leakage model in FPGA power and delay estimation framework Ptrace
[38] to consider variations. In Ptrace, the total leakage current of an FPGA chip is
calculated as follows:
I
chi p
=

i
N
t
i
· I
i
(3.4)
where N
t
i
is the number of FPGA circuit elements of resource type i, i.e., an intercon-
nect switch, buffer, LUT, configuration SRAM cell, or flip-flop, and I
i
is the leakage
current of a type i circuit element. Different sizes of interconnect switches and buffers
are considered as different circuit elements.
The leakage current I
i
of a type i circuit element is the sum of the sub-threshold
51
and gate leakages:
I
i
= I
sub
+I
gate
(3.5)
Variation in I
sub
mainly sources fromvariation in L
gate
andV
th
. Variation in I
gate
mainly
sources from variation in T
ox
.
Different from [129] which models sub-threshold leakage and gate leakage sepa-
rately, we model the total leakage current I
i
of circuit element in resource type i as
follows:
I
i
= I
n
(i) · e
f
Li
(L)
· e
f
Vi
(V)
· e
f
Ti
(T)
(3.6)
where I
n
(i) is the nominal value of the leakage current of type i circuit element and f
is the function that represents the impact of each type of process variation on leakage.
The dependency between these functions has been shown to be negligible in [129].
From MASTAR4 model [68], we find that it is sufficient to express these functions as
simple linear functions as follows:
f
Li
(L) = −c
i1
· L (3.7)
f
Vi
(V) = −c
i2
· V (3.8)
f
Ti
(T) = −c
i3
· T (3.9)
where c
i1
, c
i2
, c
i3
are fitting parameters obtained from MASTAR4 model. The negative
sign in the exponent indicates that the transistors with shorter channel length, lower
threshold voltage, and smaller oxide thickness lead to higher leakage current. We
rewrite (3.6) as follows by decomposing L, V and T in to intra-die (L
l
,V
l
, T
l
) and inter-
die (L
g
,V
g
, T
g
) components.
I
i
= I
n
(i) · e
−(c
i1
L
g
+c
i2
V
g
+c
i3
T
g
)
· e
−(c
i1
L
l
+c
i2
V
l
+c
i3
T
l
)
(3.10)
To extend the leakage model (3.4) under variations, we assume that each element
has unique intra-die variations but all elements in one die share the same inter-die
52
variations. Both inter-die and intra-die variations are modeled as normal random vari-
ables. The leakage distribution of a circuit element is a lognormal distribution. The
total leakage is the sum of all lognormals. The state-of-the-art FPGA chip usually has
a large number of circuit elements, therefore the relative random variance of the total
leakage due to intra-die variation approaches zero. Similar to [129], for given inter-die
variations, we apply the Central Limit Theorem and use the sum of mean to approx-
imate the total leakage current. After integration, we can write the expression of the
chip-level leakage as the follows:
I
chi p


i
N
t
i
· E[I
i
|L
g
,V
g
, T
g
]
=

i
N
t
i
S
i
I
L
g
,V
g
,T
g
(i) (3.11)
S
i
= e
(c
i1
σ
L
l
2
+c
i2
σ
V
l
2
+c
i3
σ
T
l
2
)/2
(3.12)
I
L
g
,V
g
,T
g
(i) = I
n
(i)e
−(c
i1
L
g
+c
i2
V
g
+c
i3
T
g
)
(3.13)
where S
i
is the scale factor introduced by intra-die variability in L,V, and T. I
L
g
,V
g
,T
g
(i)
is the leakage as a function of inter-die variations. σ
L
l
, σ
V
l
and σ
T
l
are the variances of
L
l
, V
l
, and T
l
, respectively.
3.2.0.2 Leakage Yield
From (3.11), (3.12), and (3.13), we can see that the chip leakage current is a sum of
log-normal random variables and it can be expressed as follows,
I
chi p
=

i
X
i
(3.14)
X
i
∼Lognormal(log(A
i
), ((c
i1
σ
L
g
)
2
+(c
i2
σ
V
g
)
2
+(c
i3
σ
T
g
)
2
)) (3.15)
A
i
= N
i
I
n
(i) (3.16)
Same as [129], we model I
chi p
, the sum of the lognormal variables X
i
, as another log-
normal random variable. The lognormal variable X
i
shares the same random variables
53
σ
L
g
, σ
V
g
, and σ
T
g
, and therefore these variables are dependent of each other. Consider-
ing the dependency, we calculate the mean and variance of the new lognormal I
chi p
as
follows,
µ
I
chip
=

i
{exp[log(A
i
) +
(c
i1
σ
L
g
)
2
2
+
(c
i2
σ
V
g
)
2
2
+
(c
i3
σ
T
g
)
2
2
]} (3.17)
σ
2
I
chip
=

i
{exp[2log(A
i
) +(c
i1
σ
L
g
)
2
+(c
i2
σ
V
g
)
2
+(c
i3
σ
T
g
)
2
]
·[exp(c
ii
2
σ
2
L
g
+c
i2
2
σ
2
V
g
+c
2
i3
σ
2
T
g
) −1]}+

i, j
2COV(X
i
, X
j
) (3.18)
where the mean of I
chi p
, µ
I
chip
, is the sum of means of X
i
and the variance of I
chi p
, σ
I
chip
,
is the sum of variance of X
i
and the covariance of each pair of X
i
. The covariance is
calculated as follows,
COV(X
i
, X
j
) = E[X
i
X
j
] −E[X
i
]E[X
j
] (3.19)
E[X
i
X
j
] = exp[log(A
i
A
j
) +
(c
i1
+c
j2
)
2
σ
L
g
2
2
+
(c
i2
+c
j2
)
2
σ
V
g
2
2
+
(c
i3
+c
j3
)
2
σ
T
g
2
2
] (3.20)
E[X
i
] = exp[log(A
i
) +
(c
i1
σ
L
g
)
2
2
+
(c
i2
σ
V
g
)
2
2
+
(c
i3
σ
T
g
)
2
2
] (3.21)
We then use the method from [129] to obtain the mean and variance (µ
N,I
chip
, σ
N,I
chip
2
)
of the normal randomvariable corresponding to the lognormal I
chi p
. As the exponential
function that relates the lognormal variable I
chi p
with the normal variable I
N,chi p
is a
monotone increasing function, the CDF of I
chi p
can be expressed as follows using the
standard expression for the CDF of a lognormal random variable,
µ
N,I
chip
=
log[µ
I
chip
4
/(µ
I
chip
2

I
chip
2
)]
2
(3.22)
σ
N,I
chip
2
= log[1+(σ
I
chip
2

I
chip
2
)] (3.23)
CDF(I
chi p
) =
1
2
[1+er f (
log(I
chi p
) −µ
N,I
chip


N,I
chip
)] (3.24)
where er f (·) is the error function. Given a leakage limit I
cut
for I
chi p
,
Y
leak
=CDF(I
cut
) ×100% (3.25)
54
gives the leakage yield rate Y
leak
(I
cut
|L
g
), i.e., the percentage of FPGA chips that is
smaller than I
cut
.
3.2.0.3 Timing under Variation
The performance depends on L
gate
, V
th
, and T
ox
, but its variation is primarily affected
by L
gate
andV
th
variation [129]. Belowwe extend the delay model in Ptrace to consider
inter-die and intra-die variations of L
gate
. In Ptrace, the path delay is calculated as
follows:
D =

i
d
i
(3.26)
where d
i
is the delay of the i
th
circuit element in the path. Considering process varia-
tion,the path delay is calculated as follows:
D =

i
d
i
(L
g
, L
l
,V
g
,V
l
) (3.27)
For circuit element i in the path, d
i
(L
g
, L
l
,V
G
,V
l
) is the delay considering inter-die
variation L
g
, V
g
and intra-die variation L
l
, V
l
. L
g
and V
g
the same for all the circuit
elements in the critical path. Given L
g
and V
g
, we evenly sample a few (eleven in this
chapter) points within range of [L
g
−3σ
L
l
, L
g
+3σ
L
l
]. We then use circuit level delay
model in [39] to obtain the delay for each circuit element with these variations. As
the delay monotonically decreases when L
gate
and V
th
increase, we can directly map
the probability of a channel length to the probability of a delay and obtain the delay
distribution of a circuit element. We assume that the intra-die channel length and
threshold voltage variation of each element is independent from each other. Therefore,
we can obtain the PDF (probability density function) of the critical path delay for a
given L
g
and V
g
as follows by convolution operation,
PDF(D|L
g
,V
g
) = PDF(d
1
|L
g
,V
g
) ⊗PDF(d
2
|L
g
,V
g
) ⊗· · ·
⊗PDF(d
i
|L
g
,V
g
) ⊗· · · ⊗PDF(d
n
|L
g
,V
g
) (3.28)
55
3.2.0.4 Timing Yield
The timing yield is calculated on a bin-by-bin basis where each bin corresponds to a
specific value L
g
and V
g
. We further consider intra-die variation of channel length in
timing yield analysis. Given the inter-die channel length variation L
g
, and threshold
voltage variation V
g
, (3.28) gives the PDF of the critical path delay D of the circuit. We
can obtain the CDF of delay, CDF(D|L
g
,V
g
), by integrating PDF(D|L
g
,V
g
). Given a
cutoff delay (D
cut
), CDF(D
cut
|L
g
) gives the probability that the path delay is smaller
than D
cut
considering L
gate
and V
th
variations. However, it is not sufficient to only
analyze the original critical path in the absence of process variations. The close-to-be
critical paths may become critical considering variations and an FPGA chip that meets
the performance requirement should have the delay of all paths no greater than D
cut
.
We assume that for a given L
g
the delay of each path is independent and we can
calculate the timing yield as follows,
Y
per f
(D
cut
|L
g
,V
g
) =
n

i=1
CDF
i
(D
cut
|L
g
,V
g
) (3.29)
where CDF
i
(D
cut
|L
g
,V
g
) gives the probability that the delay of the i
th
longest path
is no greater than D
cut
. In this chapter, we only consider the ten longest paths, i.e.,
n = 10 because the simulation result shows that the ten longest paths have already
covered all the paths with a delay larger than 75% of the critical path delay under the
nominal condition. We then integrate Y
per f
(D
cut
|L
g
,V
g
) over L
g
and V
g
to calculate the
performance yield Y
per f
as follows,
Y
per f
=
Z Z
+∞
−∞
PDF(L
g
)PDF(V
g
) · Y
per f
(D
cut
|L
g
,V
g
) · dL
g
dV
g
(3.30)
3.2.0.5 Leakage and Timing Combined Yield
To analyze the yield of a lot, we need to consider both leakage and delay limit. In order
to compute the leakage and delay combined yield, we first need to calculate the leakage
56
yield for a given inter-die variation of gate channel length L
g
and threshold voltage
V
g
, Y
leak|L
g
,V
g
. Similar to Section 3.2.0.2, we first calculate the mean and variance of
leakage current for given L
g
and V
g
,
µ
I
chip|L
g
,V
g
=

i
{exp[log(
¯
A
i|L
g
,V
g
) +
(c
i3
σ
T
g
)
2
2
]} (3.31)
σ
2
I
chip|L
g
,V
g
=

i
{exp[2log(
¯
A
i|L
g
,V
g
) +(c
i3
σ
T
g
)
2
]
·[exp(c
2
i3
σ
2
T
g
) −1]}+

i, j
2COV(
¯
X
i|L
g
,V
g
,
¯
X
j|L
g
,V
g
)
(3.32)
where
¯
A
i|L
g
,V
g
= A
i
· exp(−c
i1
L
g
−c
i2
V
g
) (3.33)
¯
X
i|L
g
,V
g
∼ Lognormal(
¯
A
i|L
g
,V
g
, (c
i3
σ
T
g
)
2
) (3.34)
Similar to X
i
’s, the covariance between
¯
X
i|L
g
,V
g
’s are computed as:
COV(
¯
X
i|L
g
,V
g
,
¯
X
j|L
g
,V
g
) = E[
¯
X
i|L
g
,V
g
·
¯
X
j|L
g
,V
g
] −E[
¯
X
i|L
g
,V
g
]E[
¯
X
j|L
g
,V
g
] (3.35)
E[
¯
X
i|L
g
,V
g
·
¯
X
j|L
g
,V
g
] = exp[log(
¯
A
i|L
g
,V
g
·
¯
A
j|L
g
,V
g
) +
(c
i3
+c
j3
)
2
σ
T
g
2
2
] (3.36)
E[X
i
] = exp[log(
¯
A
i|L
g
,V
g
) +
(c
i3
σ
T
g
)
2
2
] (3.37)
Finally, the CDF of leakage current for given L
g
and V
g
, I
leak|L
g
,V
g
, is calculated as:
µ
N,I
chip|L
g
,V
g
=
log[µ
4
I
chip|L
g
,V
g
/(µ
2
I
chip|L
g
,V
g

2
I
chip|L
g
,V
g
)]
2
(3.38)
σ
N,I
chip|L
g
,V
g
2
= log[1+(σ
2
I
chip|L
g
,V
g

2
I
chip|L
g
,V
g
)] (3.39)
CDF(I
chi p
|L
g
,V
g
) =
1
2
[1+er f (
log(I
chi p
) −µ
N,I
chip|L
g
,V
g


N,I
chip
)] (3.40)
With the CDF of I
leak|L
g
,V
g
, it is easy to compute the leakage yield for given L
g
and
V
g
,
Y
leak|L
g
,V
g
=CDF(I
cut
|L
g
,V
g
) ×100% (3.41)
57
MC sim Our model
Y
leak
% Y
per f
% Y
com
% Y
leak
% Y
per f
% Y
com
%
90.2 74.1 64.4 89.3 (-0.9) 72.6 (-1.5) 62.1 (-2.1)
Table 3.1: Verification of yield model.
Because for given a specific inter-die variation of channel length L
g
and threshold
voltage variation V
g
, the leakage variability only depends on the variability of random
variable T
g
as shown in (3.31), and the timing variability only depends on the variabil-
ity of random variable L
l
and V
l
as shown in (3.29). Therefore, we assume that the
leakage yield and timing yield are independent of each other for given L
g
and V
g
. The
yield considering the imposed leakage and timing limit can be calculated as follows,
Y
com
=
Z Z
+∞
−∞
PDF(L
g
)PDF(V
g
)Y
leak
(I
cut
|L
g
,V
g
)Y
per f
(D
cut
|L
g
,V
g
) · dL
g
dV
g
(3.42)
3.2.0.6 Verification of Yield Model
In this section, we verify our yield model by comparing it to 10,000 sample Monte-
Carlo simulation. In our experiment, we assume 70nm ITRS technology and use de-
vice and architecture same as the baseline hyper-architecture as in Section 2.4. We also
assume that all 20 MCNC benchmarks are put into one FPGA chip. The cut of leak-
age power is 2X of the nominal value and the cut of delay is 1.1X of nominal value.
Table 3.1 compares the yield estimated from our model and that from the Monte-Carlo
simulation. From the table, we see that our yield model is within 3% error compared
to the Monte-Carlo simulation.
3.3 Hyper-Architecture Evaluation Considering Process Variation
In this section, we use our yield model to perform device and architecture evaluation
for leakage and delay yield optimization. We consider ITRS 70nm technology and
58
N K W V
th
(V) V
dd
(V)
Evaluation range 4,5,6,7 6,8,10,12 4 0.2 0.4 0.8 1.1
Baseline 8 4 4 0.3 0.9
Source Distribution 3σ
g

l
L
gate
Normal 5.0% 3.0%
V
th
Normal 2.5% 1.9%
T
ox
Normal 2.5% 1.9%
Table 3.2: Experimental setting.
change V
dd
and V
th
around such setting. For architecture, we consider LUT size K from
4 to 7, and cluster size N from 6 to 12. For interconnect, we assume that all the global
routing track (W) span 4 logic blocks with all buffer switch box. In the experiment, we
assume that all 20 MCNC benchmarks are put into one chip and obtain the longest 10
critical paths from them. For comparison, we use the same baseline hyper-architecture
as in Section 2.4. For process variation, we assume that all the variation sources has
normal distribution. For all variation sources, we assume that the 3σ value of the inter-
die variation is 10% of nominal value, and the 3σ value of the intra-die variation is 5%
of nominal value.
3.3.1 Impact of Process Variation
In this section, we analyze the impact of process variation on FPGA leakage power
and delay. Figure 3.1 illustrates the leakage and delay variation from Monte-Carlo
simulation for the baseline hyper-arch. In the figure, each dot is sample of Monte-
Carlo simulation. From the figure, we can see with process variation, the range of
leakage power is up to 5.5X and the range of delay variation is up to 1.5X.
59
0.8 0.85 0.9 0.95 1 1.05 1.1 1.15 1.2
0
0.5
1
1.5
2
2.5
3
3.5
4
4.5
5
Normalized Delay
N
o
r
m
a
l
i
z
e
d

L
e
a
k
a
g
e

P
o
w
e
r
Cutoff delay
Cutoff power
1.5X
5.5X
Figure 3.1: Leakage and delay of baseline architecture hyper-arch.
3.3.2 Impact of Device and Architecture Tuning
In this section, we perform device and architecture evaluation to optimize the delay
and leakage power yield for two FPGA classes, Homo-V
th
and Hetero-V
th
as defined
in in Section 2.4. In the rest of this section, we assume that the cutoff leakage power
is 2X of the nominal value and the cutoff delay is 1.1X of the nominal value, as shown
in Figure 3.1.
3.3.2.1 Leakage Yield
We first optimize leakage yield. Table 3.3 illustrates the hyper-archs with maximum
leakage yield for both two classes.
In the table, CV
th
refers to the V
th
of logic blocks and IV
th
refers to the V
th
of
interconnect. Notice that for Homo-V
th
, CV
th
= IV
th
. From the table, we see that
Homo-L
gate
and Hetero-L
gate
give the same maximum leakage yield result. That is,
the optimum L
gate
for logic blocks and interconnect is the same.
This is because the larger L
gate
gives better leakage yield, both logic block and
interconnect using largest L
gate
(33nm) results in optimum leakage yield. Moreover,
60
N K CV
th
(V) IV
th
(V) V
dd
(V) Y
leak
% Y
per f
% Y
com
%
Baseline 8 4 0.3 0.3 0.9 89.5 85.5 67.1
Homo-V
th
10 4 0.4 0.4 0.9 94.3(+4.8) 79.3 69.1
Hetero-V
th
10 4 0.4 0.4 0.9 94.3(+4.8) 79.3 69.1
Table 3.3: Optimum leakage yield hyper-architecture.
N K CV
th
(V) IV
th
(V) V
dd
(V) Y
leak
% Y
per f
% Y
com
%
Baseline 8 4 0.3 0.3 0.9 89.5 85.5 67.1
Homo-V
th
6 7 0.2 0.2 1.1 63.5 97.9(+12.4) 57.9
Hetero-V
th
6 7 0.2 0.2 1.1 67.6 97.9(+12.4) 57.9
Table 3.4: Optimum Timing yield hyper-architecture.
we can also find that K = 4 gives the optimum leakage yield, which improve leakage
yield by 4.8% compared to the baseline.
3.3.2.2 Timing Yield
Secondly, we analyze the timing yield. For timing yield analysis, we only analyze the
delay of the largest MCNC benchmark clma. Similarly, the timing yield is often stud-
ied using selected test circuit such as ring oscillator for ASICin the literature. Table 3.4
illustrates the optimumtiming yield hyper-arch. Similar to leakage yield analysis, both
Homo-L
gate
and Hetero-L
gate
achieve the same hyper-arch for optimum timing yield.
The reason is similar to the leakage yield. That is the smaller V
th
gives better timing
yield, therefore both logic block and interconnect using smallest V
th
(0.2V) results in
optimum timing yield. From the table, we can also find that the optimum timing yield
hyper-arch has K = 7 and improve timing yield by 12.4% compared to the baseline.
3.3.2.3 Leakage and Timing Combined Yield
Finally, we discuss about leakage and timing combined yield. Table 3.5 illustrates the
optimum leakage and timing combined yield hyper-arch. From the table, we see that
61
N K CV
th
(V) IV
th
(V) V
dd
(V) Y
leak
% Y
per f
% Y
com
%
Baseline 8 4 0.3 0.3 0.9 89.5 85.5 67.1
Homo-V
th
10 5 0.3 0.3 1.0 87.6 86.5 75.2 (+8.1)
Hetero-V
th
8 5 0.35 0.3 1.0 91.6 84.1 76.3 (+9.2)
Table 3.5: Optimum leakage and timing combined yield hyper-architecture.
compared to the baseline, the optimum hyper-arch for Homo-V
th
improves combined
yield by 8.1%and the optimumhyper-arch for Hetero-V
th
improves the combined yield
by 9.2%. Unlike the leakage yield and timing yield analysis, Homo-V
th
and Hetero-V
th
give different result for combined yield optimization. This is because both leakage and
timing should be considered to optimize combined yield, the largest (or smallest) V
th
not necessary gives the optimum combined yield. We also find that Hetero-V
th
gives
better result than Homo-V
th
. This is because Hetero-V
th
provides larger search space.
But for both Homo-V
th
and Hetero-V
th
, K = 5 gives the best combined yield.
3.4 Conclusions
In this chapter, we have developed efficient models for chip-level leakage variation
and system timing variation in FPGAs. Experiments show that our models are within
3% from Monte Carlo simulation, and the FPGA chip level leakage and delay varia-
tions can be up to 5.5X and 1.5X, respectively. We have shown that architecture and
device tuning has a significant impact on FPGA parametric yield rate. Compared to
the baseline, the optimum hyper-architecture (combination of architecture and device
parameters) improves leakage and timing combined yield by 9.2%. In addition, LUT
size 4 has the highest leakage yield, 7 has the highest timing yield, but LUT size 5
achieves the maximum combined leakage and timing yield.
62
CHAPTER 4
FPGA Concurrent Development of Process and
Architecture Considering Process Variation and
Reliability
In Chapter 2 and Chapter 3, we develop a trace-based framework (Ptrace) to perform
device and architecture co-optimization. In this chapter, we further improve the trace-
based framework (Ptrace2) to enable concurrent process and FPGA architecture co-
development. Applying the new trace-based framework (Ptrace2), the user can tune
eight parameters for bulk CMOS processes and obtain the chip level performance and
power distribution and soft error rate (SER) considering process variations and device
aging. The Ptrace2 framework is efficient as it is based on closed-form formulas. It
is also flexible as process parameters can be customized for different FPGA elements
and no SPICE models and simulations are needed for these elements. Therefore, this
framework is suitable for early stage process and FPGA architecture co-development.
The chapter further presents a few examples to utilize the framework. We show that
applying heterogeneous gate lengths to logic and interconnect may lead to 1.3X delay
difference, 3.1X energy difference, and reduce standard deviation of leakage variation
by 87%. This offers a large room for power and delay tradeoff. We further show that
the device aging has a knee point over time, and device burn-in to reach the point
could reduce the performance change over 10 years from 8.5% to 5.5% and reduce die
to die leakage significantly. In addition, we also study the interaction between process
63
variation, device aging and SER. We observe that device aging reduces standard de-
viation of leakage by 65% over 10 years while it has relatively small impact on delay
variation. Moreover, we also find that neither device aging due to NBTI and HCI nor
process variation have significant impact on SER.
4.1 Introduction
In Chapter 2 and Chapter 3, we proposed a time efficient FPGA power and delay
evaluator Ptrace. However, Ptrace assumes that the processes are mature and stable
device models are available so that SPICE simulations can be carried out to obtain
circuit level power and delay. The assumption that FPGA architecture development
starts only after the process technology is stable may be valid in the past, but it no
longer holds as we begin to develop process and architecture concurrently in order to
shorten the time to market.
In this chapter we further extend the trace-based architecture framework (Ptrace)
discussed in the previous section to consider process parameters directly, therefore we
can conduct FPGA circuit and architecture evaluation when only the first order process
parameters are available. Such evaluation may be used to select circuits and architec-
tures less sensitive to process changes or process variations. It may further provide
inputs for process tuning, given that the FPGA is a large volume product and an FPGA
company may convince a foundry to tune process when there are large enough ben-
efits. We call the resulting framework as ptrace2. As illustrated in Figure 4.1, for
performance and power, ptrace2 calculates first electrical characteristics of advanced
CMOS transistors, then delay, leakage power, input/output capacitance for FPGA ba-
sic circuit elements, and finally the chip level performance and power based on trace
similar to that in Chapter 2 and Chapter 3. The new process variation analysis can
handle non-Gaussian variation sources which is ignored by Ptrace.
64
Trace
Cricuit Element Statistics
Critical Path Structure
Switching Activity
Transistor
Model
Chip Level
Leakage Power
Dynamic Power
Delay
Reliability
Soft Error Rate
Device Aging
Process Variation
Power Distrubution
Delay Distribution
Input Output
PTrace2
Reliability
Chip Level
Power and
Delay
Estimation
Variation
Analysis
Circuit Level
Power and
Delay
Estimation
Transistor
Electrical
Charateristics
Figure 4.1: Trace-based estimation flow.
With ptrace2, we incorporate analytical calculations for two types of FPGA relia-
bility, device aging (Negative-Bias-Temperature-Instability, NBTI [11, 149] [160, 23]
and Hot-Carrier-Injection, HCI [34, 149, 166]) and permanent soft error rate (SER)
[60], again in the from device to chip fashion. Furthermore, we illustrate how to use
this framework to improve power and performance by process and FPGA concurrent
development, and to study the interaction between process variation, device aging and
SER.
The rest of this chapter is organized as follows: Section 4.2 presents our imple-
mentation of ITRS device model. Sections 4.3, 4.4 and 4.5 introduce the circuit-
and chip-level power and delay models and chip-level variation models, respectively.
Section 4.6 presents device tuning for power and delay optimization, and Section 4.7
analyzes the reliability for FPGA. Finally, Section 4.8 studies the interaction between
reliability and process variation, and Section 4.9 concludes this chapter.
65
4.2 Device Models
In this chapter, we implement the bulk transistor device model from ITRS 2005 MAS-
TAR4 (Model for Assessment of cmoS Technologies And Roadmaps) tool [67, 68,
148]. MASTAR4 is a computing tool which calculates the electrical characteristics
of advance CMOS transistors
1
. It can handle different technologies including planar
bulk, double gate and silicon on isolator (SOI) while we only consider traditional bulk
transistor in this work. We briefly review the calculation flow in MASTAR4 below.
The calculation in MASTAR4 is based on analytical equations, which directly de-
pends on various major technological parameters including gate length (L
gate
), gate
oxide thickness (T
ox
), channel doping density (N
bulk
), channel width (W), extension
depth (X
jext
) and the total series resistance for source and drain (R
acc
). The output
electrical characteristics include the on current (I
on
), the sub-threshold leakage current
(off current, I
of f
), the gate leakage current when the channel is on/off (I
gon
/I
gof f
), the
gate capacitance (C
g
) and the drain/source diffusion capacitance (C
di f f
). Temperature
(T) and the supply voltage (V
dd
) also have a significant impact on these output char-
acteristics. There are also other inputs related to mobility, velocity and gate stack etc.,
as well as some intermediate outputs such as effective mobility u
e f f
which are used
for the final output calculation. We follow MASTAR4 tool and only tune the major
process inputs while all the other inputs can be tuned as well if needed.
Figure 4.2 shows the calculation flow for the on current I
on
. I
on
depends on all the
main inputs, i.e. L
gate
, T
ox
, N
bulk
, X
jext
, W, R
acc
, T and V
dd
. The feature intermedi-
ate parameters, the electrical (or effective) gate length (L
elec
) and the electrical gate
oxide thickness (T
ox elec
), are first calculated. The saturated threshold voltage (V
th
)
is then calculated considering the short channel effect (SCE) and drain induced bar-
1
The device model in MASTAR4 tool is a predicted model for advanced processes and there is no correspondent SPICE
models for these technologies.
66
Inputs: Lgate Tox Nbulk Xjext W Racc T Vdd
Output: Ion
Intermediate variables: Lelec Tox_elec
Intermediate variables: Vth
Intermediate variables: u_eff
Figure 4.2: The flow for on current, I
on
, calculation.
rier lowering (DIBL). The effective mobility (u
e f f
) is also calculated. With the main
input parameters, main intermediate parameters and all other parameters, I
on
is then
calculated as,
V
gt
= V
gs
−V
th
(4.1)
I
dsat0
= 0.5u
e f f
· C
ox elec
·
W
L
elec
· V
gt
· V
dsat
(4.2)
I
on
=
I
dsat0
1+
2R
acc
I
dsat0
V
gt

R
acc
I
dsat0
V
gt
+L
elec
E
c
(1+d)
(4.3)
where I
dsat0
is the on current without considering the series resistance (R
acc
), V
gs
is
the difference between the gate and source voltage levels, V
dsat
is the velocity satura-
tion voltage, E
c
is the critical electrical field and d is an intermediate parameter. The
derivation of these intermediate parameters are provided in the ITRS device model
[68].
Figure 4.3 shows the calculation flow for the sub-threshold leakage current I
of f
.
I
of f
depends on all the main inputs except for R
acc
. Similar to I
on
calculation, the
67
Inputs: Lgate Tox Nbulk Xjext W Racc T Vdd
Output: Ioff
Intermediate variables: Lelec Tox_elec
Intermediate variables: Vth Slope
Figure 4.3: The flow for sub-threshold leakage current, I
of f
, calculation.
intermediate feature parameters L
elec
and T
ox elec
, and the saturated threshold voltage
V
th
are first calculated. In order to calculate I
of f
, the sub-threshold slope (Slope) is
also calculated as,
Slope =
kT
q
ln
1
0(1+
ε
s
· T
ox elec
ε
ox
· T
dep
) (4.4)
where k is the Boltzmann constant, T is the temperature, q is electric unit, ε
s
and ε
ox
are the dielectric permittivities of silicon and oxide, respectively. With Slope, V
th
and
other parameters, we can then calculate I
of f
as,
I
of f
= I
th
W
L
elec
e
−V
th
/Slope
(4.5)
where I
th
= 0.5uA.
Inputs: Lgate Tox Nbulk Xjext W Racc T Vdd
Outputs: Igon Igoff
Figure 4.4: The flow for gate leakage currents I
gon
and I
gof f
calculation.
Figure 4.4 shows the calculation flow for the gate leakage currents when the chan-
nel is on (I
gon
) and off (I
gof f
), respectively. I
gon
and I
gof f
depend on L
gate
, T
ox
, X
jext
, W
68
Outputs: Cg Cdiff
Inputs: Lgate Tox Nbulk Xjext W Racc T Vdd
Intermediate variables: Tox_elec
Figure 4.5: The flow for transistor gate and diffusion capacitances C
g
and C
di f f
calcu-
lation.
and V
dd
. In order to calculate I
gon
and I
gof f
, we first calculate the gate leakage current
density J
g
as,
J
g
= a
1
e
a
2
V
2
g
+a
3
Vg
e
−a
4
T
ox
(4.6)
where a
1
= 1.44E5A/cm
2
, a
2
= −4.02V
−2
, a
3
= 13.05V
−1
and a
4
= 1/(1.17E −
10m). With J
g
, I
gon
and I
gof f
are then calculated as,
I
gon
= L
gate
· W · J
g
(4.7)
ΔL = L
gate
−L
elec
(4.8)
I
gof f
= 0.5ΔL· W · J
g
(4.9)
where ΔL is the difference between L
gate
and L
elec
.
Figure 4.5 shows the calculation flow for transistor gate and drain/source capaci-
tances, C
g
and C
di f f
, respectively. The capacitances depend on L
gate
, T
ox
, W, T and
V
dd
. The intermediate feature parameter T
ox elec
is first calculated for capacitance cal-
culation. C
g
and C
di f f
are calculated as,
C
g
= (
ε
ox
T
ox elec
L
gate
+C
total f ringing
+C
overlap
)W (4.10)
C
di f f
=C
overlap
+C
junc
(4.11)
where the C
g
calculation considers gate oxide capacitance, fringing capacitance C
total f ringing
69
and overlap capacitance C
overlap
, and C
di f f
calculation considers C
overlap
and junction
capacitance C
junc
.
4.3 Circuit-Level Delay and Power
In this section, we present basic circuit models for delay and power characteristics us-
ing the device model in Section 4.2. We consider buffers, LUT, SRAM, pass transistor
gate, flip-flop (FF) and multiplexer [18] for the FPGA circuits, where the multiplexer
is implemented as an NMOS pass transistor tree. Essentially, these FPGA circuits can
be further decomposed into net-lists containing the most basic circuit elements, i.e.
inverters and pass transistors. We therefore mainly discuss the power and delay for
these basic circuit elements as below.
4.3.1 Delay Model
The pass transistor and multiplexer tree are modeled as a lumped capacitance, which
is treated as part of the loading capacitance of an inverter. We calculate the inverter
delay based on numerical integration through the transistor IV curves. In MASTAR4
tool, only the calculation for the maximum on current I
on
, i.e. the current in velocity
saturation region, is provided. We use the equations from [74] to calculate the drain-
source current I
ds
in different working regions, i.e. sub-threshold, linear, and saturation
regions. I
ds
is calculated as a function of drain, source, gate and body voltage levels as
following. In the sub-threshold region, i.e. V
gs
<V
th
, I
ds
is calculated as,
I
ds
= I
th
· (W/L
elec
) · e
((V
gs
−V
th
)/Slope)
(4.12)
70
where I
th
is 0.5uA, V
gs
is the gate source voltage difference. In linear region, i.e.
V
gs
>V
th
and V
ds
<V
gs
−V
th
, I
ds
is calculated as,
I
ds
=
W
L
elec
· u
e f f
·
C
ox elec
1+V
ds
/(E
c
· L
elec
)
· (V
gs
−V
th
−V
ds
/2) · V
ds
(4.13)
where C
ox elec
is the gate oxide capacitance, V
ds
is the drain source voltage difference
and E
c
is the critical electrical field. C
ox elec
and E
c
are both calculated using equations
in the ITRS MARSTAR4 model. In the saturation region, i.e. V
gs
> V
th
and V
ds
>
V
gs
−V
th
, I
ds
is calculated as,
I
ds
= 0.5·
W
L
elec
· u
e f f
·
C
ox elec
1+V
ds
/(E
c
· L
elec
)
· V
2
ds
(4.14)
In velocity saturation region, i.e. V
gs
>V
th
and V
ds
>V
dsat
, I
ds
is equal to the on current
I
on
calculated by (4.3), where the velocity saturation voltage V
dsat
is calculated using
equations in the MASTAR4 tool. V
th
in the above equations is calculated considering
body-bias, short channel effect (SCE) and drain-induced barrier lowering (DIBL). Us-
ing these equations, Figure 4.7(c) shows the IV curves for an NMOS transistor under
ITRS 2005 HP 32nm technology node.
Vg = Vin
Vd = Vout
Vs = GND
C
Ids(N)
Vb = GND
Vs = Vdd
Vb = Vdd
Ids(P)
Figure 4.6: Voltage and current in an inverter during transition.
71
Given an inverter (see Figure 4.6), its loading capacitance C and the input voltage
waveform, we can then calculate the inverter delay. At time t, we can obtain the
transient PMOS and NMOS drain-source current i
ds
(P) and i
ds
(N), respectively. We
can then perform numerical integration to obtain the output voltage waveform based
on the following equation,
dV(out) = (i
ds
(P) −i
ds
(N)) · dt/C (4.15)
The delay is the time difference between when the output and input voltages reach
0.5V
dd
. We calculate the pull-down and pull-up delay for an inverter and then obtain
the worst case delay as the inverter delay. Note that the input slew rate is automatically
considered in this delay model. The output voltage waveform can be propagated for
delay calculation of the next stage.
ITRS05 HP (32nm) 1X INV, Cload =1fF
0
0.2
0.4
0.6
0.8
1
1.2
0 0.01 0.02 0.03 0.04
time (ns)
v
o
l
t
a
g
e

(
v
)
0
2
4
6
8
10
12
14
d
r
a
i
n

c
u
r
r
e
n
t

(
u
A
)
Vin
Vout (NMOS)
Vout (Inv)
PMOS current
ITRS05 HP (32nm) 1X INV, Cload =1fF
0
0.2
0.4
0.6
0.8
1
1.2
0 0.02 0.04 0.06 0.08 0.1 0.12
time (ns)
v
o
l
t
a
g
e

(
v
)
0
2
4
6
8
10
12
14
d
r
a
i
n

c
u
r
r
e
n
t

(
u
A
)
Vin
Vout (NMOS)
Vout (Inv)
PMOS current
(a) (b) (c)
ITRS05 HP NMOS (32nm) IV curves
0
200
400
600
800
1000
1200
0 0.2 0.4 0.6 0.8 1 1.2
Vds (v)
I
d
s

(
u
A
/
u
m
)
transient
transient II
Vgs=0.42v
Vgs=0.52v
Vgs=1.1v
Vgs=0.61v
Vgs=0.71v
Vgs=0.81v
Vgs=0.91v
Vgs=1.0v
small input
slope
large input
slope
Figure 4.7: The pull-down delay of a 1X inverter at ITRS 2005 HP 32nm technology
nodes. (a) Input and output voltages, and short circuit current with a small input slope;
(b) Input and output voltages, and short circuit current with a large input slope; (c) The
NMOS IV curves and the transition of transient current I
ds
with small and large input
slopes.
Figure 4.7 (a) shows the transient voltage transition of a 1X inverter with 1 f F as
loading capacitance C at ITRS 2005 HP 32nm technology node. The input voltage
transition is from 0 to V
dd
. The input slew rate is defined as the transition time from
72
0.1V
dd
(or 0.9V
dd
) to 0.9V
dd
(or 0.1V
dd
). The short circuit current, i.e. the PMOS
drain-source transient current is also shown in this figure. If this short circuit current
is ignored, the output voltage waveform (V
out
(NMOS) in the figure) is almost over-
lapped the the waveform (V
out
(Inv) in the figure) considering the short circuit current.
Figure 4.7 (b) shows the transient voltage transition of this inverter with a larger input
slope, i.e. 5X large as the that in Figure 4.7 (a). The short circuit current becomes more
significant due to a larger input slope and can no longer be ignored, i.e. the output volt-
age waveform without considering short circuit current differs the waveform consider-
ing short circuit current which results in a 20% delay difference. In FPGAs, the input
voltage slope may be large due to a large loading capacitance. Therefore, the short
circuit current is necessary to be considered for delay calculation. Figure 4.7(c) shows
the NMOS transient drain-source current i
ds
transitions with the two input slopes. With
a larger input slope, the transition is slower with a larger delay. While not presented
here, a similar trend for pull-up transition, i.e. input voltage transition is from V
dd
down to 0, is observed.
ITRS MASTAR4 tool uses CV
dd
/I
on
to predict transistor delay. As shown in Fig-
ure 4.7, the input slope has a significant impact on the inverter delay. Since the FPGA
circuit element usually has a large loading capacitance and a large input slope, our
delay model is more accurate than that in MASTAR4. In addition, our delay model is
flexible and can be extended to other complex gates easily.
For other FPGA circuit elements with loading capacitance, e.g. LUTs, FFs and
buffers, we first breakdown them into the netlist of inverters and pass transistors. The
delay of each circuit element is then calculated using the above method, e.g. the LUT
delay is decomposed into the delay from LUT input passing through input buffers
to the multiplexer control inputs and the delay from the SRAM passing through the
multiplexer and the output buffer.
73
4.3.2 Power Model
In this section, we first discuss leakage power including sub-threshold and gate leakage
power and then discuss dynamic power including switching power and short circuit
power for basic circuit elements.
An inverter consumes both sub-threshold and gate leakage power, which depends
on the input logic value. We calculate the average leakage power for an inverter,
P
leak
(inv), as,
P
leak
(inv) =V
dd
· (I
of f
(inv) +I
gate
(inv)) (4.16)
I
of f
(inv) = (I
of f
(P) +I
of f
(N))/2 (4.17)
I
gate
(inv) =
I
gon
(P) +I
gof f
(P) +I
gon
(N) +I
gof f
(N)
2
(4.18)
where I
o
f f (P) and I
of f
(N) are the PMOS and NMOS sub-threshold leakage currents,
respectively, I
gon
(P), I
gof f
(P), I
gon
(N) and I
gof f
(N) are the PMOS and NMOS gate
leakage currents when the channel is on and off, respectively. The two inverters in
an SRAM cell are identical with one input as V
dd
and one input as 0. Therefore, the
average leakage calculation of an inverter can be applied to SRAM leakage calculation.
For an NMOS pass transistor, only gate leakage power is consumed, which can be
either V
dd
· I
gon
(N) or V
dd
· I
gof f
(N) depending on if the channel of this pass transistor
is on or off. The pass transistor in a used/unused routing switch implemented by a
tri-state buffer is on/off. For a multiplexer containing N NMOS pass transistors, N/2
of them are on while the other half are off. Based on the leakage model for inverters
and pass transistors/muxes, we can calculate the leakage power for other FPGA circuit
elements.
We consider switching and short circuit power for inverter dynamic power con-
sumption. For switching power, we calculate the gate (C
g
(inv)) and self-loading ca-
74
pacitances (C
di f f
) for an inverter as,
C
g
(inv) = C
g
(P) +C
g
(N) (4.19)
C
di f f
(inv) = C
di f f
(P) +C
di f f
(N) (4.20)
where C
g
(P) and C
g
(N) are the PMOS and NMOS gate capacitances, respectively,
C
di f f
(P) and C
di f f
(N) are the PMOS and NMOS diffusion capacitances, respectively.
The input capacitance, internal capacitance and self-loading capacitance can then be
easily extracted for each FPGA circuit element for switching power calculation pur-
pose.
As shown in Figure 4.7 (a) and (b), short circuit current depends on input slew
rate. The transient short circuit current has been calculated during delay calculation.
We can simply perform a numerical integration on the transient short circuit current
and obtain the short circuit energy per switch SC(Sl), where Sl is the input slew rate.
4.4 Chip-level Delay and Power
With the delay and power model for basic circuits discussed in Section 4.3, we in-
troduce the chip level delay and power estimation model in this section. In order to
perform chip level estimation, we apply the similar idea as the trace-based estimation
in Chapter 2. We collect the tract information in the same way as in Section 2.3, then
calculate the chip level power and delay.
4.4.1 Delay Model
Similar to Section 2.3.2.3, we compute the critical path delay by adding the delay of
all circuit elements in the critical path, i.e., LUT, wire segment, interconnect switch
75
buffer, and MUX:
D
crit
=

i∈P
D
i
(4.21)
where D
i
is the delay of the i
th
circuit element in the critical path, which can be cal-
culated by the circuit level delay model as discussed in Section 4.3.1. For each circuit
element, the loading capacitance can be calculated from the in/out capacitance of the
basic circuit elements as introduced in Section 4.3.2. For an interconnect wire seg-
ment, we use the Elmore delay model with resistance and capacitance suggested by
ITRS [67].
4.4.2 Leakage Power Model
In the same way as in Section 2.3.2.2, the chip leakage power is modeled as the sum
of the leakage power of all circuit elements:
P
leak
=

i∈T
N
t
i
P
s
i
(4.22)
where P
s
i
is the leakage power for type i circuit element, which can be computed by
circuit level leakage power model as discussed in Section 4.3.2.
4.4.3 Dynamic Power Model
Dynamic power includes switch power and short-circuit power. A circuit implemented
on an FPGA cannot utilize all circuit elements. Dynamic power is only consumed by
the used FPGA resources. Our switch power model distinguishes different types of
used FPGA circuit elements and applies the following expression:
P
sw
=
1
2
· f · V
2
dd
· SW ·

i∈T
N
u
i
· C
i
(4.23)
where f is the operating frequency, V
dd
is the supply voltage, SW is the average switch-
ing activity, and C
i
is the total capacitance per switch of type i circuit elements. The
76
total capacitance per switch C
i
can be computed from the in/out capacitance of a basic
circuit element as discussed in Section 4.3.2. The switching activity is related to the
input vector and logic, which is difficult to obtained without detailed simulation using
a large set of input vectors. In our model, the average switching activity SW can be set
by the users. We also assume that the operating frequency to be 5/6 of the maximum
frequency, that is f = 1/(1.2 · D
crit
). We do not set the operating frequency to maxi-
mum frequency since the actual critical path delay of some chips may increase due to
process variation.
The short circuit power is related to the input slew rate. Here we model the chip
short circuit power as:
P
sc
= f · SW ·

i∈T

1≤j≤N
u
i
SC
i
(Sl
i, j
) (4.24)
where SC
i
(·) is the function to compute short circuit energy per switch for type i circuit
element and Sl
i, j
is the input slew rate of the j
th
type i circuit element. The short circuit
energy function SC
i
(·) can be obtained from the short circuit power model of basic
circuit elements as discussed in Section 4.3.2. Because of the regularity of FPGAs, the
input slew rate for the same type of circuit elements is very close. Therefore, the chip
short circuit power can be further simplified as:
P
sc
= f · SW ·

i∈T
SC
i
(Sl
i
) (4.25)
where Sl
i
is the input slew rate of type i circuit element, which is computed as the
output slew rate of the element driving type i element. For example, a connection box
buffer is driven by a global interconnect buffer, then the input slew rate of a connec-
tion box buffer is the output slew rate of the global interconnect buffer. Such output
slew rate can be computed from the basic circuit element delay model discussed in
Section 4.3.1.
77
With the switch power and short circuit power, the total dynamic power is com-
puted as:
P
dynamic
= P
sw
+P
sc
(4.26)
Notice that dynamic power model in this chapter is different from that in Sec-
tion 2.3.2.1. In Section 2.3.2.1, short circuit power is calculated as a ratio to switching
power, while in this chapter, we calculate the circuit level short circuit power directly
from transistor model. Therefore, the short circuit power model introduced in this
section is more accurate than that in Section 2.3.2.1.
4.5 Chip Level Delay and Power Variation
Based on the chip-level power and delay model introduced in Section 4.4, we study
the impact of process variation on power and delay in this section. First we introduce
the variation model as below.
4.5.1 Variation Models
In general, delay and leakage power of a circuit element are functions of the underlying
process parameters and they can be described as:
D
t
= F
D
t
(X
1
, X
2
, ...) (4.27)
P
t
= F
L
t
(X
1
, X
2
, ...) (4.28)
where F
L
t
and F
D
t
are functions of process parameters to calculate leakage power and
delay for type t circuit element, respectively, and the process variation sources (such
as gate change length and dopant density) are modeled as a random variable X. To
evaluate the chip-level leakage power and delay considering process variation, we first
78
apply randomly generated process parameters to the device model as in Section 4.2 and
then calculate the leakage power and delay for basic circuit elements as in Section 4.3.
The sensitivity functions F
L
i
and F
D
i
can be obtained from the leakage power and delay
model of basic circuit elements discussed in Section 4.3.
For each variation source X, all inter-die, intra-die spatial, and intra-die random
variation is considered. That is:
X = X
g
+X
s
+X
r
(4.29)
where X
g
, X
s
and X
r
are the inter-die, intra-die spatial, and intra-die random variation
for variation source X. We assume that X
g
, X
s
, and X
r
are independent from each
other. Each circuit element has its own random variation X
r
, and all circuit element
of the whole chip share the same inter-die variation X
g
. We use the grid based model
in [173] to model the spatial variation. A chip is divided into several grids, all circuit
elements in the same grid share the same spatial variation X
s
and the spatial variation
between different grids are correlated. The correlation coefficient decreases when the
distance between two grids increases.
4.5.2 Delay Variation
With process variation, some close-to-be critical path may become critical path. There-
fore, in order to analyze delay variation, circuit delay is calculated as the maximum of
a set of near critical paths:
D
var
= max
j∈C
D
j
(4.30)
where C is the set of close-to-be critical paths and D
j
is the delay of the j
th
close-to-be
critical path, which is calculated as:
D
j
=

i∈P
j
F
D
ji
(X
ji
1
, X
ji
2
, ...) (4.31)
79
where P
j
is the set of path elements of the j
th
path and F
D
ji
(·) is the delay variation
function for the i
th
circuit element in P
j
. Because there is no closed-form formula for
the delay variation function F
D
ji
(·), we do not have closed-form formula to compute
the chip delay distribution. In this chapter, we use the Monte-Carlo simulation to
obtain the chip delay distribution. We first generate M samples of variation sources
(X
ji
1
, X
ji
2
, ...) for all i, and then compute the the path delay under such samples to
obtain M samples of path delay. Finally we can get the delay distribution from those
M samples of path delay. The Monte-Carlo simulation allows us to consider non-
Gaussian variation sources and spatial correlation, which is ignored in the previous
Ptrace [167].
4.5.3 Chip Leakage Power Variation
The chip leakage power with process variation is modeled as the sum of leakage power
of all circuit elements:
P
leak
=

i∈T
N
t
i

j=1
P
j
i
=

i∈T
N
t
i

j=1
F
L
i
(X
i j
1
, X
i j
2
, ...) (4.32)
where P
j
i
is the leakage power of the j
th
type i circuit element, and X
i j
is the variation
of the j
th
type i circuit element.
Similar to the chip delay variation model, we use Monte-Carlo simulation to com-
pute chip leakage distribution. We first generate M samples of variation sources and
then compute the M samples of chip leakage power. However, because the random
variation is unique for each circuit element, when the circuit size becomes large, a
large number of random variation samples need to be computed. This makes the sim-
ulation very inefficient. In order to solve such problem, we simplify the chip leakage
power variation model by applying central limit theorem similar to [25, 168].
Given the inter-die and within-die spatial variation, the total leakage power for all
80
type i circuit elements in the l
th
grid can be calculated as:
P
i
leak
=
N
t
li

j=1
F
L
i
(X
g1
+X
l
s1
+X
il j
r1
, X
g2
+X
il j
r2
, ...) (4.33)
where X
g
is the inter-die variation, X
l
s
is the spatial variation in the l
th
grid, X
il j
r
is
the random variation for the j
th
type i circuit elements in the l
th
grid, N
t
li
is the number
of type i circuit element in the l
th
grid. Usually, N
t
li
is a very large number. Therefore,
by applying the central limit theorem, we have:
N
t
li

j=1
F
L
i
(X
g1
+X
l
s1
+X
il j
r1
, X
g2
+X
il j
r2
, ...) ≈N
t
i
· µ
i
(X
g1
+X
l
s1
, X
g2
+X
l
s2
, ...) (4.34)
where
µ
il
(X
g1
+X
l
s1
, X
g2
+X
l
s2
, ...)
= E[F
L
i
(X
i
g1
+X
i j
r1
, X
i
g2
+X
i j
r2
, ...)|X
g1
, X
l
s1
, X
g2
, X
l
s2
, ...] (4.35)
In order to compute µ
il
(·), we first generate M
r
samples of random variations, and
then compute the mean of leakage power under such samples. That is:
µ
il
(X
g1
+X
l
s1
, X
g2
+X
l
s2
, ...)
=
1
M
r
M
r

k=1
F
L
j
(X
g1
+X
l
s1
+X
k
r1
, X
g2
+X
l
s1
+X
k
r2
, ...) (4.36)
where x
k
r
is the k
th
sample of random variation. Here, notice that usually M
r
is much
smaller than N
t
li
and does not depend on circuit size. Therefore such method is scalable
to large circuit size. Then the chip leakage power can be modeled as:
P
leak
=

i∈T
G

l=1
N
t
li
· µ
il
(X
g1
+X
l
s1
, X
g2
+X
l
s2
, ...) (4.37)
where G is the number of grids. With the above equation, we may generate M samples
of inter-die and within-die spatial variations to compute the distribution of leakage
power.
81
Notice that the variation model introduced in this section may handle non-Gaussian
variation sources and spatial correlation, which are ignored in the variation model in
Section 3.2.
4.6 Process Evaluation
Based on the chip level power and delay model presented in Section 4.4 and 4.5, we
perform device tuning to minimize power and delay.
4.6.1 Power and Delay Optimization
First we discuss the optimization for nominal value of power and delay. In this chapter,
we consider two device parameters, supply voltage Vdd and the physical gate length
L
gate
. In our experiment, we start from the device setting of ITRS High Performance
32nm technology (HP32) [67] (with Vdd=1.1V, L
gate
=32nm), and then change Vdd
and L
gate
around such setting. Moreover, we assume that the FPGA has cluster size
N=6, LUT size K=7, and the global interconnect wire length W=4, which is opti-
mized for high performance [91]. For dynamic power estimation, we assume that the
switching activity SW=0.5. In addition, we also consider heterogeneous L
gate
for logic
blocks (L
c
) and interconnects (L
i
). Because SRAMs have little impact on FPGA delay,
we assume that all SRAMs use high V
th
(1.5X dopant density compared to the other
circuit elements) and fix L
gate
=32nm. During our evaluation, Vdd is tuned from 1.0V
to 1.1V, L
gate
(both L
c
and L
c
) is tuned from 31nm to 33nm. The experimental setting
is summarized in Table 5.1. In our experiment, we assumed that 20 MCNC bench-
marks [175] are put in one FPGA chip. Therefore, the power is computed as the sum
of all 20 benchmarks and delay is computed as the longest critical path delay of 20
benchmarks.
82
N K W SW Vdd (V) L
c
(nm) L
i
(nm)
6 7 4 0.5 1.0, 1.05, 1.1 31, 32, 33 31, 32, 33
Table 4.1: Experimental setting.
Figure 4.8 shows the energy per clock cycle and delay tradeoff between different
device settings. From the figure, we find that there is a 3X energy span and a 1.3X
delay span within our search space. We also find that that the knee point in the energy
delay tradeoff figure is exactly HP32, which has high performance without paying too
much power penalty. On the other hand, we also optimize device setting by minimizing
energy and delay product (ED). We show the top 2 minimum ED device settings in the
figure, which we refer to as min-ED1 and min-ED2.
2
5
8
11
14
3.5 4.0 4.5 5.0
Delay (ns)
E
n
e
r
g
y

(
n
J
)
All device
Min-ED
HP32
Min-ED
HP32
3.1X
1.3X
Figure 4.8: Energy and delay tradeoff.
Table 4.2 shows the power (P), delay (D), energy per clock cycle (E) , and energy
delay product (ED) for HP32 and min-ED device settings. From the table, we see that
by applying device tuning, ED can be reduced by up to 29.4% (min-ED1) compared to
HP32. We also find that it is worthwhile to apply heterogeneous L
gate
. The optimum
heterogeneous L
gate
device setting (min-ED1) has lower ED (0.7nJ·ns lower) than the
83
optimum homogeneous L
gate
device setting (min-ED2).
Device Vdd L
c
L
i
P D E ED
(V) (nm) (nm) (W) (ns) (nJ) (nJ·ns)
HP32 1.1 32 32 1.19 3.90 1.88 22.6
min-ED1 1.0 32 33 0.77 4.55 3.50 15.9(-29.4%)
min-ED2 1.0 33 33 0.74 4.73 3.50 16.5(-26.8%)
Table 4.2: Power, delay, energy, and ED comparison for different device settings.
4.6.2 Variation Analysis
In this section, we apply the chip level variation model as introduced in Section 4.5
to analyze the chip level leakage and delay variation. In our experiment, we consider
four types of variation sources, dopant density (N
bulk
), channel gate length (L
gate
),
oxide thickness (T
ox
), and mobility (µ
e f f
) [190]. Moreover, because for technology
under 65nm, the spatially correlated intra-die variation is small compared to the intra-
die random variation [190], therefore, we ignore spatial variation in this chapter. For
all variation sources, we assume that they follow normal distribution. Notice that our
process variation model is based on Monte-Carlo simulation, therefore it can handle
any non-Gaussian variation sources. The 3-sigma value of inter-die (3σ
g
), intra-die
spatial 3σ
s
, and intra-die random (3σ
r
) variation for all types of variation sources are
summarized in Table 4.3. In the table, the 3-sigma values are the represented as the
percentage of nominal value. Moreover, for spatial variation, we assume that the cor-
relation coefficient decreases to 1% when distance is 1cm (D
corr
= 1cm). In our ex-
periment, we consider two different variation settings, low variation (LV) and high
variation (HV). We generate M = 10, 000 samples for Monte-Carlo simulation and use
M
r
= 100 samples to compute the leakage mean µ
i
(·).
Figure 4.9 illustrates the probability density function (PDF) of delay and leakage
power for min-ED1 and HP32 under variation setting LV. From the figure, we see that
84
M = 10, 000 M
r
= 100 D
corr
= 1cm
Source Distribution LV HV

g

s

r

g

s

r
N
bulk
Normal 4.0% 4.0% 2.0% 6.0% 6.0% 4.0%
L
gate
Normal 2.0% 2.0% 1.5 3.0% 3.0% 2.2%
T
xo
Normal 2.0% 2.0% 1.5 3.0% 3.0% 2.2%
µ
e f f
Normal 4.0% 4.0% 2.0% 6.0% 6.0% 4.0%
Table 4.3: Experimental setting for process variation.
min-ED1 has much less leakage variation compared to HP32 while its delay variation
is only a little bit larger than HP32. We also observe that the leakage roughly has a
log-normal distribution and delay roughly has a normal distribution. This is because
leakage power is almost exponential to the variation sources while delay is almost
linear to the variation sources.
0.2 0.4 0.6 0.8 1 1.2 1.4 1.6 1.8 2
0
2
4
6
8
10
12
14
16
Leakage (W)
(a)
min−ED1
HP32
3.6 3.8 4 4.2 4.4 4.6 4.8 5 5.2
0
0.5
1
1.5
2
2.5
3
3.5
4
Delay (ns)
(b)
min−ED1
HP32
Figure 4.9: Delay and leakage distribution. (a) Leakage PDF; (b) Delay PDF.
Table 4.4 summarizes the nominal value (nom), mean (µ), and standard deviation
(σ) of leakage power and delay for the min-ED device settings and HP32. From the
table, we observe that compared to HP32 the min-ED device settings greatly reduce
leakage variation (reduce standard deviation by up to 92%) with only a small increase
of delay variation (increase standard deviation by up to 37%). From the table, we also
see that HP32 is more sensitive to process variation than the min-ED device settings.
85
Device Leakage (mW) Delay (ns)
nom LV HV nom LV HV
µ σ µ σ µ σ µ σ
HP32 854 982 348 1120 575 3.90 3.95 0.122 3.99 0.185
min-ED1 328 359 51 (-86%) 398.2 68 (-88%) 4.55 4.55 0.162(+33%) 4.58 0.244(+32%)
min-ED2 315 328 32 (-89%) 375.2 48 (-92%) 4.73 4.779 0.164(+37%) 4.782 0.245(+32%)
Table 4.4: Nominal value, mean, and standard deviation comparison for leakage power
and delay.
4.7 FPGA Reliability
As technology scales down to nanometer, reliability issues such as device aging with
V
th
degradation and soft error rate (SER) due to high-energy particle or cosmic rays be-
come more and more significant. In this section we study these two types of reliability
for FPGA, i.e. device aging and permanent soft error rate (SER).
4.7.1 Impact of Device Aging
First, we introduce the impact of device aging effect on FPGA power and delay. It
has been shown that negative-bias-temperature-instability (NBTI) effect increases the
threshold voltage of PMOS transistor [11][160][149][23], and hot-carrier-injection
(HCI) increases the threshold voltage (V
th
)of NMOS transistor [34][149][166]. Due
to the effect of NBTI, PMOS V
th
increases during stress mode, i.e. with input 0, and
recovers during recovery mode, i.e. with input V
dd
. (4.38) and (4.39) are the equations
for PMOS ΔV
th
over time in the stress and recovery modes, respectively.
ΔV
th
= (K
v
(t −t
0
)
0.5
+(ΔV
th0
)
(1/2n)
)
2n
(4.38)
ΔV
th
=−ΔV
th0
(1−

1
t
e
+

ξ
2
C(t −t
0
)
2t
ox
+

Ct
) (4.39)
86
(4.40) is the equation for NMOS ΔV
th
over time due to HCI.
ΔV
th
=
q
C
ox
K
2

Q
i
e
E
ox
E
o2
e

φ
it
qλE
m
t
n

(4.40)
The key parameters that determine the degradation rate include the inversion charge,
Q
i
, electrical field, E
ox
, temperature, supply voltage V
dd
and nominal threshold voltage
V
th
. Due to the space limit, we do not include detailed explanation for each parameters.
Interested readers please refer to [23] for more details. Figure 4.10 shows the calcu-
lation flow for ΔV
th
due to NBTI and HCI. We first calculate the effective gate length
L
elec
and oxide thickness T
ox elec
and then calculate V
th
. With these intermediate pa-
rameters, we can obtain ΔV
th
due to device aging over time. A higher V
dd
or lower V
th
leads to larger V
th
increase. The V
th
degradation due to device aging results in increase
of critical path delay and decrease of leakage power.
Inputs: Lgate Tox Nbulk Xjext W Racc T Vdd
Output: HCI HBTI
Intermediate variables: Lelec Tox_elec
Intermediate variables: Vth
Figure 4.10: The calculation flow for ΔV
th
due to NBTI and HCI.
Figure 4.11 illustrates the V
th
increase (ΔV) for PMOS and NMOS transistors of
both min-ED1 and HP32 within 10 years. From the figure, we see that HP32 is much
more sensitive to NBTI and HCI than min-ED1. This is due to the fact that HP32 uses
higher V
dd
and lower V
th
(due to smaller L
gate
) than min-ED1. Hence HP32 has larger
ΔV
th
for both NMOS and PMOS than min-ED1.
87
0 1 2 3 4 5 6 7 8 9 10
0
5
10
15
20
25
30
35
40
45
Time (year)
Δ

V
t
h

(
m
V
)
Min−ED1 PMOS (NBTI)
Min−ED1 NMOS (HCI)
HP32 PMOS (NBTI)
HP32 NMOS (HCI)
Figure 4.11: V
th
increase caused by NBTI and HCI.
Figure 4.12 illustrates the critical path delay increase and chip leakage power de-
crease within 10 years for min-ED1 and HP32. It is interesting to see in the figure that
the leakage and delay change is most significant at the first one year. Therefore, device
burn-in [96] that has been used for microprocessors but not FPGAs yet could be used
to reduce leakage and delay change over time after product shipment.
Table 4.5 illustrates the current value, value after 1 year, and value after 10 years of
leakage power (L) and delay (D). In the table, we also compare the change percentage
with and without burn-in (burn to 1 year). The change percentage without burn-in is
computed as the percentage of the difference between the 10 year value and current
value, and that with burn-in is computed as the percentage of the difference between
the 10 year value and the 1 year value. From the figure and table, we find that the
min-ED device settings is less sensitive to device aging compared to HP32. This is be-
cause HP32 has higher V
dd
and smaller L
gate
compared to the min-ED device settings.
Therefore it has larger V
th
degradation, hence more sensitive to device aging. We may
see that compare to HP32, min-ED device settings can not only reduce ED but also
88
Device Current 1 Year 10 Year Change
L D L D L D w/o burn-in w/ burn-in
(mW) (ns) (mW) (ns) (mW) (ns) L D L D
HP32 854 3.90 711 4.01 640 4.23 -25.1% +8.5% -10.0% +5.5%
min-ED1 328 4.55 317 4.59 311 4.64 -5.2% +2.0% -1.9% +1.1%
min-ED2 315 4.73 307 4.76 302 4.82 -2.5% +1.9% -1.6% +1.3%
Table 4.5: Delay and leakage change due to device aging.
increase aging reliability of FPGAs. Moreover, by applying burn-in, we can greatly
reduce the delay degradation. For example, for HP32, burn-in reduces 10 year delay
degradation from 8.5% to 5.5%.
0 2 4 6 8 10
−250
−200
−150
−100
−50
0
Time (year)
(a)
Δ

P
l
e
a
k

(
m
W
)
Min−ED1
HP32
1 Year
0 2 4 6 8 10
0
0.05
0.1
0.15
0.2
0.25
0.3
0.35
Time (year)
(b)
Δ

D
e
l
a
y

(
n
s
)
min−ED1
HP32
1 Year
Figure 4.12: Impact of device aging. (a) Leakage change; (b) Delay change.
4.7.2 Permanent Soft Error Rate (SER)
We use the model in [60] to calculate the soft error rate (SER) for configuration
SRAMs in our study. The SER in SRAMs is permanent in FPGAs, which cannot
be recovered unless re-writing those affected configuration bits [14][17]. The SER for
one SRAM cell is shown in (4.41),
SER(SRAM) = F · A· K· e
−Q
crit
/Q
s
(4.41)
89
where F is the neutron flux, A is the transistor drain area, K is a technology independent
constant and is same for all device settings, Q
crit
is the critical charge to incur an error,
and Q
s
is the charge collection slope. Figure 4.13 shows the flow to calculate the
SRAM SER. We first calculate intermediate parameters including L
elec
, T
ox elec
and
V
th
. Q
s
, Q
crit
and A are then calculated. With all these intermediate parameters, we can
obtain the SRAM SER. A transistor with a larger Q
crit
needs more energy to incur an
error and has a smaller SER. Q
crit
mainly depends on supply voltage V
dd
and loading
capacitance C, and slightly depends on V
th
. On the other hand, a transistor with a larger
Q
s
is more effective in collecting charge and has a larger SER. Q
s
depends on channel
doping density N
bulk
and V
dd
. We use the models in [60] with linear interpolation to
calculate Q
crit
and Q
s
under different device settings, i.e. different V
dd
and V
th
, and
then calculate the SER of an SRAM cell correspondingly using (4.41). The chip-level
permanent SER can be calculated as,
SER = Num SRAM· SER(SRAM) (4.42)
where Num SRAM is the total number of SRAM cells.
Inputs: Lgate Tox Nbulk Xjext W Racc T Vdd
Output: SER
Intermediate variables: Lelec Tox_elec
Intermediate variables: Vth
Intermediate variables: Qs Qcrit A
Figure 4.13: The flow for SRAM SER calculation.
Table 4.6 compares the SER for HP32, and the two min-ED device settings. For
90
the min-ED device settings, L
gate
for SRAM cells is still 32nm. Therefor only V
dd
may
affect SER. In the table, SER is measured as number of failures in billion hours (FIT).
The supply voltage V
dd
in HP32 is higher, hence the critical charge, Q
crit
, is larger and
may result in a smaller SER. For min-ED1/2 with V
dd
of 1.0v, SER is 1.6% higher than
that of HP32. It is worthwhile to use Min-ED device settings to obtain a significant
ED reduction with virtually no impact on SER.
Device V
dd
(V) # SRAMs SER (FIT)
HP32 1.1 12438596 362.45
min-ED1/2 1.0 12438596 368.25 (+1.6%)
Table 4.6: Soft error rate comparison.
4.8 Interaction between Process Variation and Reliability
In this section, we study the impact of device aging on process variation and the impact
of device aging and process variation on SER.
4.8.1 Impact of Device Aging on Power and Delay Variation
First, we discuss the impact of device aging on process variation induced leakage and
delay variation. Figure 4.14 compares the PDF of delay and leakage between current
value and the value after 10 years for the device setting min-ED1. From the figure,
we find that device aging actually reduce the leakage variation. But device aging only
increase delay but has no significant impact on delay variation.
Table 4.7 summarizes the mean and standard deviation of leakage and delay for
different device settings. The table compares both the current value and the value after
10 years in both low and high variation cases. From the table, we find that device
aging has great impact of leakage power variation. For HP32 device setting under low
91
0.3 0.35 0.4 0.45 0.5 0.55
0
5
10
15
20
25
Leakage (W)
(a)
Current
After 10 Years
4 4.2 4.4 4.6 4.8 5 5.2
0
0.5
1
1.5
2
2.5
3
3.5
Delay (ns)
(b)
Current
After 10 Years
Figure 4.14: Impact of device aging on delay and leakage PDF. (a) Leakage PDF
comparison; (b) Delay PDF comparison.
variation case (LV), device aging reduces mean by 28.8% and standard deviation by
65.2%. However, the impact of device aging on delay variation is relatively small. For
HP32 device setting, device aging only increases standard deviation by 1.7%. This
is because leakage power is rough exponential to V
th
while delay is roughly linear to
V
th
. Moreover, we observe that compared to HP32 device setting, device aging has
less impact on power and delay variation for min-ED settings. This is because HP32
is more sensitive to device aging as discussed in Section 4.7. From the result, we also
find that in high variation case, the impact of device aging on delay variation is similar
to that of the regular variation case (min-ED1). However, when variation is high, the
impact of device aging on leakage variation is much larger than that in the regular
variation case. That is, when variation scale becomes larger, there is a larger impact of
device aging on leakage variation.
4.8.2 Impacts of Device Aging and Process Variation on SER
As shown in Figure 4.13, the SRAM SER depends on V
dd
, loading capacitance, V
th
and
channel doping density N
bulk
while V
th
and N
bulk
may be affected by device aging and
92
Device Leakage (mW) Delay (ns)
current 10 years current 10 years
µ σ µ σ µ σ µ σ
HP32 LV 982 348 678 (-28.8%) 118 (-65.2%) 3.95 0.122 4.23 (+8.4%) 0.121 (+1.67%)
min-ED1 LV 359 51.0 321 (-6.2%) 31.2 (-32.7%) 4.55 0.162 4.65 (+2.0%) 0.162 (+0.16%)
min-ED2 LV 328 32.0 309 (-4.6%) 22.4 (-30.4%) 4.78 0.164 4.83 (+1.9%) 0.164 (+0.25%)
HP32 HV 1120 575 712 (-36.4%) 158 (-72.5%) 3.99 1.85 4.09 (+2.6%) 1.86 (+0.54%)
min-ED1 HV 398.2 68 345.1 (-13.3%) 38.1 (-44.0%) 4.55 0.318 4.66 (+2.1%) 0.319 (+0.28%)
min-ED2 HV 375.2 48 332 (-12.4%) 25 (-47.9%) 4.78 0.245 4.79 (+0.21%) 0.245 (+0.21%)
Table 4.7: Impact of device aging on process variation.
process variation. We perform the study on the impacts of device aging and process
variation on SER as below.
V
th
may increase with device aging due to NBTI and HCI. For ITRS HP 32nm
technology (see Figure 4.11, V
th
may degrade (or increase) by around 10% (or 20mV)
for PMOS due to NBTI and by around 25% (or 50mV) for NMOS due to HCI after 10
years. A larger V
th
may result in a slightly smaller critical charge Q
crit
for an SRAM
and therefore a larger SER. Table 4.8 presents the impact of device aging on SER.
After 10 years, the SRAM SER may increase by 0.3% due to degraded V
th
. Even if
V
th
degrades by 100mv for both NMOS and PMOS, the SRAM SER only increases by
0.9%.
0 year 10 3σ=6% variation
or nominal years in N
bulk
SRAM SER (FIT) 2.914E-5 +0.3% -0.18% to +0.17%
Table 4.8: The impact of device aging and process variation on SER of one SRAM
under ITRS HP 32nm technology.
With process variation, N
bulk
and channel gate length L
gate
become random vari-
ables. N
bulk
variation directly affects charge collection slope Q
s
and therefore affects
SER. L
gate
variation may affect V
th
and therefore implicitly affect SER. For ITRS HP
32nm technology, a variation in L
gate
from 31nm to 33 nm may result in a variation
93
in V
th
from -25% to 21% [68]. It has been shown that V
th
variation does not have a
significant impact on SER. Table 4.8 presents the impact of N
bulk
variation on SER. A
variation of 3σ as 6% in N
bulk
only results in a variation in SER from -0.18% to 0.17%.
A larger N
bulk
may result in a larger charge collection slope Q
s
and therefor a smaller
SER. On the other hand, a larger N
bulk
may result in a higher V
th
and a smaller critical
charge Q
crit
, therefore a larger SER. These two effects compensate with each other and
therefore N
bulk
does not have a significant impact on SER either. It is clear that neither
device aging due to NBTI and HCI nor process variation have any significant impact
on SER.
4.9 Conclusions
In this chapter, we have developed a trace-based framework to enable concurrent pro-
cess and FPGA architecture co-development. Same as the MASTAR4 model used by
the 2005 ITRS [67][68][148], the user can tune eight parameters for bulk CMOS pro-
cesses and obtain the chip level performance and power distribution and soft error rate
(SER) considering process variations and device aging. The framework is efficient as it
is based on closed-form formulas. It is also flexible as process parameters can be cus-
tomized for different FPGA elements and no SPICE models and simulation are needed
for these elements. Therefore, this framework is suitable for early stage process and
FPGA architecture co-development.
A few examples of using this framework have been presented. We show that ap-
plying heterogeneous gate lengths to logic and interconnect may lead to 1.3X delay
difference, 3.1X energy difference, and reduce standard deviation of leakage variation
by 87%. This offers a large room for power and delay tradeoff. We further show that
the device aging has a knee point over time, and device burn-in to reach the point
could reduce the performance change over 10 years from 8.5% to 5.5% and reduce die
94
to die leakage significantly. In addition, we also study the interaction between process
variation, device aging and SER. We observe that device aging reduces standard de-
viation of leakage by 65% over 10 years while it has relatively small impact on delay
variation. Moreover, we also find that neither device aging due to NBTI and HCI nor
process variation have significant impact on SER.
95
Part II
Statistical Timing Modeling and
Analysis
96
CHAPTER 5
Non-Gaussian Statistical Timing Analysis Using
Second-Order Polynomial Fitting
In Part I, we have discussed statistical modeling and optimization for FPGAs. In Part
II of this dissertation, we study statistical modeling and analysis for ASICs. In this
chapter, we focus on statistical static timing analysis (SSTA). Lots of research works
on SSTA have been published in recent years. However, most of the existing SSTA
techniques have difficulty in handling the non-Gaussian variation distribution and non-
linear dependency of delay on variation sources. To solve this problem, we first pro-
pose a new method to approximate the max operation of two non-Gaussian random
variables through second-order polynomial fitting. Applying the approximation, we
present present new non-Gaussian SSTA algorithms for three delay models: quadratic
model, quadratic model without crossing terms (semi-quadratic model), and linear
model. All the atomic operations (max and sum) of our algorithms are performed by
closed-form formulas, hence they scale well for large designs. Experimental results
show that compared to the Monte-Carlo simulation, our approach predicts the mean,
standard deviation, skewness, and 95-percentile point within 1%, 1%, 6%, and 1%
error, respectively.
97
5.1 Introduction
Process variation introduces significant uncertainty to circuit delay as discussed
in Chapter 1. In Part I, we have studied statistical optimization for FPGA circuits.
Compared to FPGA, ASIC has higher performance, lower power, and smaller area.
Therefore, ASICs are widely used in high performance or low power applications,
where the impact of process variation is significant.
To analyze the impact of process variation on circuit delay, statistical static timing
analysis (SSTA) is developed for full chip timing analysis under process variation. By
performing SSTA, designers can obtain the timing distribution and its sensitivity to
various process parameters.
In recent years, two types of SSTA techniques are proposed, the path-based [103,
72, 7, 120, 157, 128] and the block-based [32, 5, 49, 9, 22, 8, 163, 86, 53, 182, 13, 181,
113, 75, 180, 178, 76, 3, 183, 2, 142, 19, 179, 40] SSTA. Since the number of paths
is exponential with respect to the circuit size, the path-based SSTA is not scalable to
large circuits. The block-based SSTA was proposed to solve this problem. The goal
of block-based SSTA is to parameterize timing characteristics of the timing graph as a
function of the underlying sources of process parameters which are modeled as random
variables.
The early SSTA methods [32, 163] modeled the gate delay as linear functions
of variation sources and assumed all the variation sources are mutually independent
Gaussian random variables. Based on such assumption, [163] presented time efficient
methods with closed-form formulas [46] for all atomic operations (max and sum).
However, when the amount of variation becomes larger, the linear delay model is no
longer accurate [94]. In order to capture the non-linear dependence of delay on the
variation sources, a higher-order delay model is thus needed [180, 178].
98
As more complicated or large-scale variation sources are taken into account, the
assumption of Gaussian variation sources is also not valid. For example, the via resis-
tance has an asymmetric distribution [13], while dopant concentration is more suitably
modeled as a Poisson distribution [142] than Gaussian. Some of the most recent works
on SSTA [76, 142, 13, 19, 179, 40] started to take non-Gaussian variation sources into
account. For example, [142] applied independent component analysis to de-correlate
the non-Gaussian random variables, but it was still based on a linear delay model.
[76, 13, 19, 179] considered both non-linear delay model and non-Gaussian variation
sources, but the computation cost of these techniques is too high to be applicable for
large designs. For example, [76] proposed to compute the max operation by a regres-
sion based on Monte-Carlo simulation, which is slow. [13] computed the max opera-
tion through tightness probability while [179] applied moment matching to reconstruct
the max. However, to do so, both had to resort to expensive multi-dimension numer-
ical integration techniques. [19] handled the atomic operations by approximating the
gate delay using a set of orthogonal polynomials, which needs to be constructed for
different variation distributions. [40] proposed to approximate the probability density
function (PDF) of max results as a Fourier Series, but it lacks the capability to handle
the crossing term effects on timing.
In this chapter, we introduce a time efficient non-linear SSTA for arbitrary non-
Gaussian variation sources. The major contribution of this work is two-fold. (1) We
propose a new method to approximate the max of two non-Gaussian random vari-
ables as a second-order polynomial function based on least-square-error curve fitting.
Experimental results shows that such approximation is much more accurate than the
linear approximation based on tightness probability [163, 180, 13]. (2) Based on the
new approximation of the max operation, we present our SSTA technique for three
different delay models: quadratic delay model, quadratic delay model without cross-
ing terms (semi-quadratic model), and linear delay model. In our method, only the
99
first few moments are required for different distributions and no extra effort is needed,
while [19] needs to obtain different sets of orthogonal polynomials for different vari-
ation distributions. Moreover, all the atomic operations are performed by closed-form
formulas, hence they are very time efficient. For the linear and semi-quadratic delay
model, the computational complexity of our method is linear in both the number of
variation sources and circuit size. For the quadratic delay model, the computational
complexity is cubic (third-order) to the number of variation sources and linear to the
circuit size. Experimental results show that for the semi-quadratic delay model, our
approach is 70 times faster than the non-linear SSTA in [40], and indicates less than
1% error in mean, 1% error in standard deviation, 30% error in skewness, and 1% er-
ror in 95-percentile point. Moreover, for the more accurate quadratic delay model, our
approach incurs less than 1% error in mean, 1% error in standard deviation, 6% error
in skewness, and 1% error in 95-percentile point.
The rest of the chapter is organized as follows: Section 5.2 introduces the approx-
imation of the max operation using second-order polynomial fitting; with the approxi-
mation of max, Section 5.3 presents a novel SSTA algorithmfor quadratic delay model
with non-Gaussian variation sources; we further apply this technique to handle both
semi-quadratic and linear delay model in Section 5.4 and Section 5.5, respectively;
experimental results are presented in Section 5.6; Section 5.7 concludes this chapter.
5.2 Second-Order Polynomial Fitting of Max Operation
5.2.1 Review and Preliminary
According to [163], given two random variables, A and B, the tightness probability is
defined as the probability of A greater than B, i.e., T
A
= P{A > B} = P{A−B > 0}.
100
Then the max operation is approximated as:
max(A, B) ≈T
A
· A+(1−T
A
) · B+c, (5.1)
where c is a term used to match the mean and variance of max(A, B). Because (5.1)
can be further written as max(A, B) = max(A−B, 0) +B, we arrive at:
max(A−B, 0) ≈ T
A
· (A−B) +c. (5.2)
According to (5.2), we can see that the max operation in [163] is in fact approximated
by a linear function subject to certain conditions (such as matching the exact mean
and variance). Such a linear approximation is efficient and reasonably accurate when
both A and B are Gaussian. As shown in [163, 164], the PDF predicted by such linear
approximation is very close the Monte Carlo simulation. Moreover, in this case the
coefficients can be computed easily, as both T
A
and E[max(A, B)] can be obtained by
closed-form formulas when A and B are Gaussian [163]. However, when A and B are
non-Gaussian randomvariables, the tightness probability T
A
and E[max(A, B)] are hard
to obtain. For example, T
A
in [13] has to be computed via expensive multi-dimensional
numerical integration, thus preventing its scalability to large designs. Moreover, be-
cause the max operation is an inherently non-linear function, linear approximation
would become less and less accurate, particularly when the amount of variation in-
creases and the number of non-Gaussian variation sources increases. To overcome
these difficulties, we develop a more efficient and accurate approximation method in
the next section.
5.2.2 New Fitting Method for Max Operation
In this section, we introduce a new fitting method to approximate the max operation.
Instead of using the linear function, we propose to use a second-order polynomial
101
function to approximate the max operation, i.e.,
max(V, 0) ≈h(V, Θ) = θ
2
V
2

1
V +θ
0
, (5.3)
where Θ = (θ
0
, θ
1
, θ
2
)
T
are three coefficients of the second-order polynomial h(v, Θ).
The problem thus becomes how to obtain the fitting parameters of Θ. Different from
the linear fitting method through tightness probability, we compute Θ by matching
the mean of the max operation while minimizing the square error (SE) between
h(V, P) and max(V, 0) within the ±ε range of V. Mathematically, this problem can be
formulated as the following optimization problem:
Θ = arg min
E[h(V,Θ)]=µ
m
Z
µ
v

µ
v
−ε

h(v, Θ) −max(v, 0)

2
dv (5.4)
where µ
v
, σ
2
v
, and γ
v
are the mean, variance, and skewness of random variable of
V, respectively; while µ
m
and E[h(V, Θ)] are the exact and approximated mean of
max(V, 0), respectively. In other words,
µ
m
= E[max(V, 0)] (5.5)
E[h(V, Θ)] = θ
2

2
v

2
v
) +θ
1
µ
v

0
. (5.6)
In (5.4), ε is the approximation range. The value of ε may affect the accuracy of the
second order approximation. Intuitively, when ε is large, the probability that V lies
outside the ±ε range is low. Then the impact of ignoring the difference in the low
probability region is justified. However, when ε is too large, we may unnecessarily
fitting the curve in a wide range without focusing on the high probability region. In
this chapter, we find that assuming:
ε = 3σ
v
(5.7)
provides a good approximation of max. This can be explained by the fact that, although
the circuit delay is not Gaussian, it would still be more or less Gaussian like. That is,
102
its PDF would be most likely centering around the ±3σ range of the mean. In practice,
the users can choose the range ε according to the delay distributions.
In order to solve (5.4), we first need to compute µ
m
. When V is a non-Gaussian
randomvariable, exact computation of µ
m
is difficult in general. Therefore, we propose
to use the following two-step procedure to approximately compute µ
m
. In the first
step, we approximate the non-Gaussian random variable V as a quadratic function of a
standard Gaussian random variable W similar to [184], i.e.,
V ≈P(W) = c
2
· W
2
+c
1
· W +c
0
. (5.8)
The coefficients c
2
, c
1
, and c
0
can be obtained by matching P(W) and V’s mean,
variance and skewness simultaneously, i.e.,
E[c
2
· W
2
+c
1
· W +c
0
] = µ
v
(5.9)
E[(c
2
· W
2
+c
1
· W +c
0
−µ
v
)
2
] = σ
2
v
(5.10)
E[(c
2
· W
2
+c
1
· W +c
0
−µ
v
)
3
] = γ
v
· σ
3
v
(5.11)
As shown in [184], solving the above equations, c
2
will be one of the following values:
c
2,1
= −

2
v

2/3

1/3
(5.12)
c
2,2
=
2(1+ j

3)σ
2
v
+(1− j

3)Δ
2/3

1/3
(5.13)
c
2,3
=
2(1− j

3)σ
2
v
+(1+ j

3)Δ
2/3

1/3
(5.14)
where Δ = γ
v
· σ
3
v
+ j


6
v
−γ
2
v
· σ
6
v
, with j =

−1. It is proved in [184] that, when

v
| < 2

2 there exists one and only one of the above three values in the range |c
2
| <
σ
v
/

2. We will pick such value for c
2
. When |γ
v
| > 2

2, we may let:
c
2
=



σ
v
/

2 γ
v
> 2

2
−σ
v
/

2 γ
v
<−2

2
(5.15)
103
It is proved in [19] that the above equations gives P(W) which matches the mean and
variance of V and has the skewness closest to γ
v
. With c
2
, c
1
and c
0
can be computed
as:
c
1
=

σ
2
v
−2c
2
2
(5.16)
c
0
= µ
v
−c
2
(5.17)
After obtaining the coefficients c
2
, c
1
, and c
0
in the second step, we approximate
the exact mean of max(V, 0) by the exact mean of max(P(W), 0), i.e.,
µ
m
≈E[max(P(W), 0)] =
Z
P(w)>0
P(w)φ(w)dw (5.18)
where φ(·) is the PDF of the standard normal distribution. In the above equation, the
integration range P(w) > 0 can be computed in four different cases:
Case 1: c
2
> 0
P(w) > 0 ⇔w <t
1
∨w >t
2
(5.19)
where
t
1
= (−c
1

c
2
1
−4c
2
c
0
)/2c
2
(5.20)
t
2
= (−c
1
+

c
2
1
−4c
2
c
0
)/2c
2
(5.21)
Case 2: c
2
< 0
P(w) > 0 ⇔t
2
< w < t
1
(5.22)
Case 3: c
2
= 0∧c
1
> 0
P(w) > 0 ⇔w >−c
0
/c
1
(5.23)
Case 4: c
2
= 0∧c
1
< 0
P(w) > 0 ⇔w <−c
0
/c
1
(5.24)
104
Knowing the integration range, we can compute E[max(P(W), 0)] under such four
cases:
E[max(P(W), 0)] = (5.25)















(c
2
+c
0
)

1+Φ(t
1
) −Φ(t
2
)

+(c
1
+t
1
)

φ(t
2
) −φ(t
1
)

c
2
> 0
(c
2
+c
0
)

Φ(t
1
) −Φ(t
2
)

+(c
1
+t
1
)

φ(t
2
) −φ(t
1
)

c
2
< 0
c
0
· Φ(c
0
/c
1
) +c
1
· φ(c
0
/c
1
) c
2
= 0∧c
1
> 0
c
0
·

1−Φ(c
0
/c
1
)

−c
1
· φ(c
0
/c
1
) c
2
= 0∧c
1
< 0
where Φ(·) is the cumulative density function (CDF) of the standard normal distribu-
tion. According to (5.25), we can compute µ
m
easily through analytical formulas.
In practice, the above approximation of µ
m
is accurate only when the distribution
of V is close to the normal distribution. When V is not close to the normal distribution,
for example, V is uniform distribution, the above approximation of µ
m
is no longer
accurate. Fortunately, in practice, the circuit delay has close-to-normal distribution as
discussed above. Therefore, such approximation is accurate for SSTA.
After obtaining µ
m
, we need to find Θ in (5.3) by solving the constrained optimiza-
tion problem of (5.4). In the following, we show that (5.4) can be solved analytically
as well. We first write the constraint in (5.4) as follows:
θ
0
= µ
m
−θ
2

2
v

2
v
) −θ
1
µ
v
. (5.26)
Replacing θ
0
in (5.4) by (5.26), the square error in (5.4) can be written as:
SE =
0
Z
µ
v
−3σ
v
h
2
(v, Θ)dv +
µ
v
+3σ
v
Z
0

h(v, Θ)dv −v

2
dv
=
Z
0
l
1

θ
2
(v
2
−µ
2
v
−σ
2
v
) +θ
1
(v −µ
v
) +µ
m

2
dv +
Z
l
2
0

θ
2
(v
2
−µ
2
v
−σ
2
v
) +θ
1
(v −µ
v
) +µ
m
−v

2
dv, (5.27)
105
where l
1
= µ
v
−3σ
v
and l
2
= µ
v
+3σ
v
. By expanding the square and integral, we
can transform the constrained optimization of (5.4) to the following unconstrained
optimization problem, which is a quadratic form of Θ

= (θ
1
, θ
2
)
T
:
Θ

= argminSE (5.28)
SE = Θ
T

+QΘ

+t, (5.29)
where S = (s
i j
) is a 2 ×2 matrix, Q = (q
i
) is a 1 ×2 vector, and t is a constant. The
parameters of S, Q, and t can be computed as:
s
11
= (l
3
2
−l
3
1
)/3+(l
2
−l
1

2
v
−(l
2
2
−l
2
1

v
(5.30)
s
22
= (l
5
2
−l
5
1
)/5+(l
2
2
−l
2
1
)(µ
2
v

2
v
)
2
−2(l
3
2
−l
3
1
)(µ
2
v

2
v
)/3 (5.31)
s
12
= s
21
= (l
4
2
−l
4
1
)/4+(l
2
−l
1

v

2
v

2
v
) +
(l
3
2
−l
3
1

v
/3−(l
2
2
−l
2
1
)(µ
2
v

2
v
)/2 (5.32)
q
1
= l
2
2
µ
v
+(l
2
2
−l
2
1

m
−2l
3
2
/3−2(l
2
−l
1

v
µ
m
(5.33)
q
2
= l
2
2

2
v

2
v
) +2(l
3
2
−l
3
1

m
/3− (5.34)
2l
3
2
/3−2(l
2
−l
1
)(µ
2
v

2
v

v
t = l
3
2
/3+(l
2
−l
1

2
m
−l
2
2
µ
m
(5.35)
Because the square error is always positive no matter what value the Θ

is, S is a pos-
itive definite matrix. Therefore, (5.29) is to minimize a second-order convex function
of Θ

without constraints. Then the optimum of Θ

can be obtained by setting the
derivative of (5.29) to zero, resulting a 2×2 system of linear equations:
∂SE
∂θ
1
= 2s
11
· θ
1
+2s
12
· θ
2
+q
1
= 0 (5.36)
∂SE
∂θ
2
= 2s
22
· θ
2
+2s
12
· θ
1
+q
2
= 0 (5.37)
106
Such a system of linear equations can be solved efficiently:
θ
1
=
s
12
· q
2
−s
22
· q
1
2s
11
· s
22
−2s
2
12
(5.38)
θ
2
=
s
12
· q
1
−s
11
· q
2
2s
11
· s
22
−2s
2
12
(5.39)
With θ
1
and θ
2
, we can compute θ
0
from (5.26).
From the above discussion, we see that for a random variable V with any distribu-
tion, if we know its mean µ
v
, variance σ
2
v
, and skewness γ
v
, we can obtain the fitting
parameters Θ for max(V, 0) through closed-form formulas.
Notice that in (5.4) we try to minimize the mean square error within the ±3σ range.
If µ
v
>3σ
v
, V is always larger than zero within the ±3σ range. That is: max(V, 0) =V
when V ∈ (µ
v
−3σ
v
, µ
v
+3σ
v
). From (5.4), it is easy to find that in this case, Θ =
(0, 1, 0). That means:
max(V, 0) ≈V when µ
v
> 3σ
v
(5.40)
To show the accuracy of our second-order fitting approach to the max approxima-
tion, we compare the results obtain from our approach, the linear fitting method, and
the exact (or Monte Carlo) computation. For example, by assuming V ∼N(0.7, 1), the
exact max operation max(V, 0), linear fitting through through tightness probability, and
our second-order fitting can be obtained as shown in Fig. 5.2.2, where the x-axis is V,
and y-axis is the results of max(V, 0). The corresponding PDFs of the three approaches
are shown in Fig. 5.2.2. From the figures, we see that our proposed second-order fitting
method is more accurate than the linear fitting method. In particular, the PDF of our
second-order fitting method has a sharp peak which approximates the impulse of exact
PDF well as shown in Fig. 5.2.2. In contrast, the linear fitting method can only give a
smooth PDF, which is far away from the exact max result.
107
−2 −1 0 1 2 3
−1
−0.5
0
0.5
1
1.5
2
2.5
3
3.5
4
Max(V,0)
Second−Order Fit
Linear Fit
(a)
−2 −1 0 1 2 3
0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
1
Max(V,0)
Second−Order Fit
Linear Fit
(b)
Figure 5.1: (a) Comparison of exact computation, linear fitting, and second-order fit-
ting for max(V, 0); (b) PDF comparison of exact computation, linear fitting, and sec-
ond-order fitting for max(V, 0).
5.3 Quadratic SSTA
5.3.1 Quadratic Delay Model
In Section 5.2, we introduce the second-order fitting of max operation. Here we will
apply such fitting in SSTA.
We first discuss the delay model. In practice, the circuit delay is a complicate
function of variation sources:
D = f (X
1
, X
2
, . . . X
n
, R) (5.41)
where X = (X
1
, X
2
, . . . X
n
)
T
are global variation sources, R is local random variation, n
is the number of global variation sources. Here, we assume that X
i
’s and R are mutually
independent. Without loss of generality, we assume all X
i
’s and R have zero mean
and unit variance. In order to simplify the above delay model, we apply the Taylor
expansion to approximate it. Considering that the scale of local random variation is
108
usually smaller than that of global variation, we use second order Taylor expansion to
approximate X
i
’s and use first order expansion to approximate R:
D ≈ d
0
+
n

i, j=1
b
i j
X
i
X
j
+
n

i=1
a
i
X
i
+rR
= d
0
+AX +X
T
BX +rR (5.42)
where d
0
is the nominal delay, A = (a
1
, a
2
, . . . a
n
) are the linear delay sensitivity co-
efficients of the global variation sources, B = (b
i j
) are the second-order sensitivity
coefficients, which is an n×n matrix, and r is the linear delay sensitivity coefficient of
the local random variation. A, B, and r are calculated in the similar way as [180]:
a
i
=
∂f
∂X
i
(5.43)
b
i j
=

2
f
∂X
i
∂X
j
(5.44)
r =
∂f
∂R
(5.45)
In practice, f is a complex function and can not be expressed as closed-form formula.
In this case, the coefficients A, B, and r can be obtained by measurement or SPICE
simulation. Moreover, in practice, the local random variation is caused by several
independent factors; therefore, the local random variation R is the sum of several in-
dependent random variables. According to the central limit theorem, R will be close
to a Gaussian random variable. In this chapter, for simplicity, we assume that R is a
random variable with standard normal distribution. Finally, we obtain our quadratic
delay model as follows:
D = d
0
+AX +X
T
BX +rR (5.46)
In the rest of this chapter, we use m
i,k
and m
r,k
to represent the kth central moment
for X
i
and the kth moment for R, respectively. Notice that all X
i
’s and R are with zero
mean, that means their central moments and raw moments are the same. Moreover,
109
we also assume that the moments of the variation sources, m
i,k
and m
r,k
, are known. In
practice, such moments can be computed from the samples of variation sources.
To compute the arrival time in a block-based SSTA framework, two atomic opera-
tions, max and sum, are needed. That is, given D
1
and D
2
with the quadratic form of
(5.46):
D
1
= d
01
+A
1
X +X
T
B
1
X +r
1
R
1
, (5.47)
D
2
= d
02
+A
2
X +X
T
B
2
X +r
2
R
2
, (5.48)
we want to compute
D
m
= max(D
1
, D
2
)
= d
m0
+A
m
X +X
T
B
m
X +r
m
R
m
, (5.49)
D
s
= D
1
+D
2
= d
s0
+A
s
X +X
T
B
s
X +r
s
R
s
. (5.50)
We will present how these two operations are handled in the rest of this section.
5.3.2 Max Operation for Quadratic Delay Model
The max operation is the most difficult operation for block-based SSTA. Based on the
second-order polynomial fitting method as discussed in Section 5.2.2, we propose a
novel technique to compute the max of two random variables. The overall flow of the
max operation is illustrated in Figure 5.2. Considering
D
m
= max(D
1
, D
2
) = max(D
1
−D
2
, 0) +D
2
, (5.51)
let D
p
= D
1
−D
2
. Without loss of generality, we assume E[D
p
] > 0. We first compute
the mean and variance of D
p
, if µ
D
p
> 3σ
D
p
, then D
m
= D
1
, otherwise, we compute
the joint moments between D
p
and X
i
’s and the skewness of D
p
. Knowing the mean,
variance, and skewness of D
p
, we apply the method as shown in Section 5.2.2 to find
110
the fitting coefficients Θ = (θ
0
, θ
1
, θ
2
) for max(D
p
, 0). Finally, we use the moment
matching method to reconstruct the quadratic form of D
m
.
Input: Quadratic form of D
1
and D
2
Output: Quadratic form of D
m
1. Let D
p
= D
1
−D
2
, compute µ
Dp
and σ
Dp
if µ
Dp
≥3σ
Dp
{
2. Dm = D
1
}
else {
3. Compute joint moments between D
2
p
and X
i
4. Compute γ
Dp
5. Get fitting coefficients Θ
6. Compute joint moments between D
m
and X
i
7. Reconstruct the quadratic form of D
m
}
Figure 5.2: Algorithm for computing max(D
1
,D
2
).
5.3.2.1 Mean and Variance of D
p
In order to compute the mean and variance, we first obtain the quadratic form of D
p
as
follows:
D
p
= D
1
−D
2
= d
p0
+A
p
· X +X
T
B
p
X +r
1
· R
1
−r
2
· R
2
(5.52)
d
p0
= d
01
−d
02
(5.53)
A
p
= A
1
−A
2
(5.54)
B
p
= B
1
−B
2
(5.55)
Because X
i
’s, R
1
, and R
2
are mutually independent random variables with mean 0
111
and variance 1, the mean and variance of D
p
can be computed as:
µ
D
p
= E[D
p
] = d
p0
+
n

i=1
b
pii
(5.56)
σ
2
D
p
= E[(D
p
−µ
D
p
)
2
]
= r
2
1
+r
2
2
+
n

i=1
(a
2
pi
+b
pii
m
i,4
+2a
pi
b
pii
m
i,3
) +

1≤i<j≤n
2(b
2
pi j
+b
pii
b
pj j
) −(
n

i=1
b
pii
)
2
(5.57)
5.3.2.2 Joint Moments and Skewness of D
p
From the definition of D
p
as shown in (5.52), the joint moments between D
2
p
and X
i
,
X
2
i
, and R can be computed as:
E[X
i
D
2
p
] = 2d
p0
a
pi
+(a
2
pi
+2d
p0
b
pii
)m
i,3
+2a
pi
b
pii
m
i,4
+b
2
pii
m
i,5
+


1≤j≤n
j=i

a
pi
b
pj j
+2a
pj
b
p
i j +
(b
pii
b
pj j
+2b
2
pi j
)m
i,3
+2b
pi j
b
pj j
m
j,3

(5.58)
E[R
1
D
2
p
] = r
2
1
m
r,3
+2r
1
µ
D
p
(5.59)
E[R
2
D
2
p
] = r
2
2
m
r,3
−2r
1
µ
D
p
(5.60)
E[X
2
i
D
2
p
] = d
2
p0
+r
2
1
+r
2
2
+2d
p0
a
pi
m
i,3
+2a
pi
b
pii
m
i,5
+
(a
2
pi
+2d
p0
b
pii
) · m
i,4
+b
2
pii
m
i,6
+

1≤j≤n
j=i

a
2
pj
+2d
p0
b
pj j
+2(a
pi
b
pj j
+2a
pj
b
pi j
)m
i,3
+2a
pj
b
pj j
m
j,3
+
2(b
pii
b
pj j
+4b
2
pi j
) · m
i,4
+b
2
pj j
m
j,4
+4b
pj j
b
pi j
m
i,3
m
j,3

+


1≤j<k≤n
j,k=i
(b
pj j
b
p
kk +2b
2
pjk
) (5.61)
112
E[X
i
X
j
D
2
p
] = 2a
pi
a
pj
+4d
p0
b
pi j
+2· (2a
pi
b
pi j
+a
pj
b
pii
)m
i,3
+
4b
pii
b
pi j
m
i,4
+2· (2a
pj
b
pi j
+a
pi
b
pj j
)m
j,3
+4b
pj j
b
pi j
m
j,4
+
2· (2b
2
pi j
+b
pii
b
p
j j)m
i,3
m
j,3
+4·

1≤k≤n
k=i, j
b
pi j
b
pkk
(5.62)
With the joint moments computed above, the raw moments, central moments, and
skewness of D
p
is computed as:
E[D
2
p
] = µ
2
D
p

2
D
p
(5.63)
E[D
3
p
] = E[D
p
· D
2
p
]
= E[(d
p0
+A
p
X +X
T
B
p
X +r
1
R
1
−r
2
R
2
)D
2
p
]
= d
p0
· E[D
2
p
] +r
1
· E[R
1
D
2
p
] −r
2
· E[R
2
D
2
p
] +
n

i=1

a
pi
· E[X
i
D
2
p
] +b
pii
· E[X
2
i
D
2
p
]

+

0≤i<j≤n
b
pi j
E[X
i
X
j
· D
2
p
] (5.64)
m
D
p
(3) = E[(D−µ
D
)
3
]
= E[D
3
p
] −3µ
D
p
· E[D
2
p
] +2µ
3
D
p
(5.65)
γ
D
p
= m
D
p
(3)/σ
3
D
p
(5.66)
5.3.2.3 Reconstruct the Quadratic Form of Dm
Knowing µ
D
p
, σ
D
p
, and γ
D
p
, we apply the second-order fitting method to compute the
fitting parameters Θ = (θ
0
, θ
1
, θ
2
) for max(D
p
, 0). Then D
m
can be represented as:
D
m
= max(D
1
, D
2
) = max(D
p
, 0) +D
2
≈ θ
2
· D
2
p

1
· D
p

0
+D
2
= θ
2
· D
2
p
+D
q

0
(5.67)
D
q
= θ
1
· D
p
+D
2
(5.68)
113
The above equation gives the closed- form formula of D
m
. However, because D
p
and
D
q
are in quadratic form as (5.46), the representation of D
m
in (5.67) is a 4
th
order
polynomial of X
i
’s. In order to reconstruct the quadratic form for D
m
, we first compute
the the mean of D
m
, and joint moments between D
m
and variation sources:
E[D
m
] = θ
2
· E[D
2
p
] +E[D
q
] +θ
0
(5.69)
E[X
i
D
m
] = θ
2
· E[X
i
D
2
p
] +E[X
i
D
q
] (5.70)
E[X
2
i
D
m
] = θ
2
· E[X
2
i
D
2
p
] +E[X
2
i
D
q
] +θ
0
(5.71)
E[X
i
X
j
D
m
] = θ
2
· E[X
i
X
j
D
2
p
] +E[X
i
X
j
D
q
] (5.72)
E[R
1
D
m
] = θ
2
· E[R
1
D
2
p
] +E[R
1
D
q
] (5.73)
E[R
2
D
m
] = θ
2
· E[R
2
D
2
p
] +E[R
2
D
q
] (5.74)
The joint moments between D
2
p
and X
i
’s are computed in (5.58)-(5.62). The joint
moments between D
q
and variation sources are:
E[D
q
] = d
q0
+
n

i=1
b
qii
(5.75)
E[X
i
D
q
] = a
qi
m
i,2
+b
qii
m
i,3
(5.76)
E[X
2
i
D
q
] = d
q0
+a
qi
m
i,3
+b
qii
m
i,4
+

1≤j≤n
j=i
b
qj j
(5.77)
E[X
i
X
j
D
q
] = 2b
qi j
(5.78)
E[R
1
D
q
] = θ
1
· r
1
(5.79)
E[R
2
D
q
] = (1−θ
1
) · r
2
(5.80)
114
Because we want to reconstruct D
m
in the quadratic form, as shown in (5.49), by
applying the moment matching technique similar to [178], we have:
E[D
m
] =
n

i=1
b
mii
+d
m0
(5.81)
E[X
i
D
m
] = a
mi
+b
mii
· m
i,3
(5.82)
E[X
2
i
D
m
] = a
mi
· m
i,3
+b
nii
· m
i,4
+

1≤j≤n
j=i
b
mj j
(5.83)
E[X
i
X
j
D
m
] = 2b
mi j
(5.84)
With E[D
m
], E[X
i
D
m
], E[X
2
i
D
m
], and E[X
i
X
j
D
m
] computed in (5.69)-(5.72), the sen-
sitivity coefficients A
m
= (a
mi
) and B
m
= (b
mi j
) can be obtained by solving the linear
equations above:
b
mii
=
E[X
2
i
D
m
] −E[D
m
] −E[X
i
D
m
]m
i,3
m
i,4
−m
2
i,3
−1
(5.85)
b
mi j
= E[X
i
X
j
D
m
] (5.86)
a
mi
= E[X
i
D
m
] −b
mii
· m
i,3
(5.87)
d
m0
= E[D
m
] −
n

i=1
b
mii
(5.88)
Finally, we consider the random term of D
m
. Because the random term in D
m
comes from the random terms in D
1
and D
2
, we assume that r
m
R
m
= r
m1
R
1
+r
m2
R
2
.
Because the random variation sources R
m
, R
1
, and R
2
are Gaussian random variables,
by applying the moment matching technique similar to (5.84), we have:
r
m1
= E[R
1
· D
m
] (5.89)
r
m2
= E[R
2
· D
m
] (5.90)
r
m
=

r
2
m1
+r
2
m2
(5.91)
Where E[R
1
· D
m
] and E[R
2
· D
m
] are computed in (5.73) and (5.74).
115
5.3.3 Sum Operation for Quadratic Delay Model
The sum operation is straight forward. The coefficients of D
s
can be computed by
adding the correspondent coefficients of D
1
and D
2
:
d
s0
= d
01
+d
02
(5.92)
A
s
= A
1
+A
2
(5.93)
B
s
= B
1
+B
2
(5.94)
r
s
=

r
2
1
+r
2
2
(5.95)
5.3.4 Computational Complexity of Quadratic SSTA
For each max operation in SSTA based on a quadratic delay model, we need to cal-
culate n
2
joint moments between D
p
and X
i
X
j
s; for each joint moment, we need to
compute the sum of n numbers; hence the computational complexity is O{n
3
}, where
n is the number of variation sources. For the sum operation, we need to compute the
sum of two n ×n metric; hence the computational complexity is O{n
2
}. Because the
total number of max and sum operations is linear with respect to circuit sizes N, the
total complexity is O{n
3
N}.
5.4 Semi-Quadratic SSTA
5.4.1 Semi-Quadratic Delay Model
In Section 5.3, we introduce the SSTA for quadratic delay model. However, in practice,
the impact of the crossing terms are usually very weak [178, 19]. Therefore, if we
ignore the crossing terms, we may speed up the SSTA process without affecting the
accuracy too much. Ignoring the crossing terms, the quadratic delay model in (5.46) is
116
re-written as:
D = d
0
+AX +BX
2
+rR (5.96)
where d
0
, A, r, and R are defined in the similar way as the quadratic delay model in
(5.46), X
2
= (X
2
1
, X
2
2
, . . . , X
2
n
)
T
are the square of variation sources, and B= (b
1
, b
2
, . . . , b
n
)
are the second order sensitivity coefficients for the square terms. We defined such
quadratic model without crossing terms as Semi-Quadratic Delay Model. We will in-
troduce the atomic operations, max and sum, for the semi-quadratic delay model in the
rest of this section.
5.4.2 Max Operation for Semi-Quadratic Delay Model
The overall flow of the max operation of semi-quadratic delay model is similar to that
of quadratic delay model as illustrated in Figure 5.2. The only difference is that we do
not need to compute the joint moments between the crossing terms and D
m
.
5.4.2.1 Moments of D
p
We defined D
p
= D
1
−D
2
in a similar way as (5.52). In order to compute the central
moments of D
p
, we first rewrite D
p
to the following form:
D
p
= d

p0
+
n

i=1
Y
pi
+r
1
R
1
−r
2
R
2
(5.97)
d

p0
= d
p0
+
n

i=1
b
pi
(5.98)
Y
pi
= a
pi
X
i
+b
pi
X
2
i
−b
pi
(5.99)
with a
pi
= a
1i
−a
2i
and b
pi
= b
1i
−b
2i
. Because X
i
s are mutually independent random
variables with mean 0 and variance 1, Y
pi
s are independent random variables with zero
117
mean. Therefore, the first 3 central moments of D
p
are:
µ
D
p
= d

p0
(5.100)
σ
2
D
p
= r
2
1
+r
2
2
+
n

i=1
E[Y
2
pi
] (5.101)
m
D
p
(3) = E[(D
p
−µ
D
p
)
3
] = r
3
1
+r
3
2
+
n

i=1
E[Y
3
pi
] (5.102)
where
E[Y
2
pi
] = b
2
pi
(m
i,4
−1) +2a
pi
b
pi
m
i,3
+a
2
pi
(5.103)
E[Y
3
pi
] = 3a
pi
b
2
pi

m
i,5
−2m
i,3

+3a
2
pi
b
pi

m
i,4
−1

+
a
3
pi
m
i,3
+b
3
pi

m
i,6
−3m
i,4
+2

(5.104)
With the central moments of D
p
, the raw moments and skewness can be computed
easily.
5.4.2.2 Reconstruct the Semi-Quadratic Form of D
m
Similar to the quadratic SSTA, by knowing the mean, variance and skewness of D
p
,
the fitting coefficients Θ can be obtained. Then the joint moments between X
i
s and D
m
can be computed in the similar ways as (5.69)-(5.74). Here the joint moments between
X
i
s and D
2
p
, D
q
are:
E[X
i
D
2
p
] = E[X
i
Y
2
pi
] +2d

0
E[X
i
Y
pi
] (5.105)
E[X
2
i
D
2
p
] = E[X
2
i
Y
2
pi
] +2d

0
E[X
2
i
Y
pi
] +E[X
2
i
]E[(D
p
−Y
pi
)
2
] (5.106)
E[X
i
D
q
] = E[X
i
Y
qi
] (5.107)
E[X
2
i
D
q
] = E[X
2
Y
qi
] (5.108)
where E[Y
2
pi
] are computed in (5.102), E[(D
p
−Y
pi
)
2
] = σ
2
D
p
−E[Y
2
pi
], and Y
qi
s are
defined in the similar way as Y
pi
s:
Y
qi
= a
qi
X
i
+b
qi
X
2
i
−b
qi
(5.109)
118
with a
qi
= θ
1
(a
1
−a
2
) +a
2
and b
qi
= θ
1
(b
1
−b
2
) +b
2
. The joint moments between
X
i
s and Y
pi
s are:
E[X
2
i
Y
2
pi
] =

a
2
pi
−2
2
pi

m
i,4
+b
2
pi

m
i,6
−1

+2a
pi
b
pi

m
i,5
−m
i,3

(5.110)
E[X
2
i
Y
pi
] = a
pi
m
i,3
+b
pi

m
i,4
−1

(5.111)
E[X
i
Y
2
pi
] = b
2
pi
m
i,5
+2a
pi
b
pi

m
i,4
−1

+

a
2
pi
−2b
2
pi

m
i,3
(5.112)
E[X
i
Y
pi
] = a
pi
+b
pi
m
i,3
(5.113)
The joint moments between X
i
s and Y
qi
s can be computed in the same way.
Knowing the joint moments between X
i
s and D
m
, by applying the moment match-
ing technique, we will have similar equations as (5.84). And then A
m
and B
m
can be
obtained by solving such equations. Finally, we can compute the random term of D
m
in the same way as the quadratic model, as shown in (5.91).
The sum operation of the semi-quadratic delay model is similar to that of the
quadratic delay model. We can compute D
s
= D
1
+D
2
by summing up the corre-
spondent coefficients.
From the discussion above, it is easy to see that for the semi-quadratic SSTA, the
computational complexity for both max and sum operation is O{n}, where n is the
number of variation sources. The number of max and sum operations is linear to the
circuit size.
5.5 Linear SSTA
5.5.1 Linear Delay Model
In the previous sections, we introduce the SSTA for non-linear delay model. In prac-
tice, when the variation scale is small, the circuit delay can be approximated as a linear
119
function of variation sources:
D = d
0
+A· X +r · R (5.114)
where X, d
0
, A, r, and R are defined in the similar way as the quadratic delay model
in (5.46). The difference is that there are no second order terms. Hence the atomic
operations of the linear delay model are much simpler than those of quadratic delay
model. We will introduce such operations in the rest of this section.
5.5.2 Max Operation for Linear Delay Model
The overall flow of the max operation of linear delay model is similar to that of
quadratic delay model as illustrated in Figure 5.2 except that we do not need to com-
pute the high order joint moments between D
m
and X
i
s.
5.5.2.1 Mean and Variance of D
p
The linear form of D
p
= D
1
−D
2
can be computed in the similar way as (5.52). Then
the mean and variance of D
p
can be computed as:
µ
D
p
= E[D
p
] = d
p0
; (5.115)
σ
2
D
p
= E[(D−µ
D
)
2
] =
n

i=1
a
2
pi
+r
2
1
+r
2
2
(5.116)
5.5.2.2 Joint Moments and Skewness of D
p
The joint moments between D
2
p
and the variation sources and random variation can be
computed as:
E[X
i
· D
2
p
] = a
2
pi
· m
i,3
+2a
pi
· d
p0
(5.117)
120
With the joint moments computed above, the third order raw moments D
p
is computed
as:
E[D
3
p
] =
n

i=1
a
i
E[X
i
· D
2
p
] +r
1
E[R
1
· D
2
p
] +r
2
E[R
2
· D
2
p
] (5.118)
The second order raw moment, central moments, and skewness of D
p
can be computed
in the same way as the quadratic SSTA in (5.66).
5.5.2.3 Reconstruct the Linear Canonical Form of Dm
Knowing the mean, variance, and skewness of D
p
, we may apply the method in Sec-
tion 5.2.2 to compute the fitting parameters Θ for max(D
p
, 0), and then represent D
m
in the form of:
D
m
≈ θ
2
· D
2
p
+D
q
+ p
0
(5.119)
where D
q
is defined in the similar way as (5.67). The above equation gives the closed-
form formula of D
m
, however, such formula is not in the linear canonical form. We
still apply moment matching technique to reconstruct D
m
to the linear form. First, the
mean of D
m
and the joint moments between Dm and variation sources are:
E[D
m
] = θ
2
· E[D
2
p
] +E[D
q
] + p
0
(5.120)
E[X
i
· D
m
] = θ
2
· E[X
i
· D
2
p
] +E[X
i
· D
q
] (5.121)
Where E[D
2
p
] is computed in (5.118). The joint moments between D
q
and variation
sources are computed as:
E[D
q
] = d
s0
(5.122)
E[X
i
· D
q
] = a
qi
(5.123)
121
With the joint moments, by applying the moment matching technique in [178], we
have:
a
mi
= E[X
i
· D
m
] (5.124)
d
m0
= E[D
m
] (5.125)
Finally, the random term r
m
can be computed in the same way as the quadratic model
in (5.91).
For the linear SSTA we discussed above, the computational complexity for both
max and sum operation is O{n}. Similar to the quadratic SSTA, the number of max
and sum operations is linear to the circuit size.
5.6 Experimental Result
We have implemented our SSTA algorithmin Cfor all three delay models we discussed
before: quadratic delay model (Quad SSTA), semi-quadratic delay model (Semi-Quad
SSTA), and linear delay model (Lin SSTA). We also define three comparison cases:
(1) our implementation of the linear SSTA for Gaussian variation sources in [163],
which we refer to as Lin Gau; (2) the non-linear SSTA using Fourier Series approxi-
mation for non-Gaussian variation sources in [40] (Fourier SSTA); (3) 100,000-sample
Monte-Carlo simulation (MC). We apply all the above methods to the ISCAS89 suite
of benchmarks in TSMC 90nm technology.
In our experiment, we consider two types of variation sources gate channel length
L
gate
and threshold voltage V
th
. For each type of variation source, inter-die, intra-die
spatial, and intra-die random variation are considered. We use the grid-based model
in [172, 173] to model the spatial variation. The number of grids (the number of
spatial variation sources) is determined by the circuit size and larger circuits have more
variation sources. We also assume that the 3σ value of the inter-die (σ
g
), intra-die
122
spatial (σ
s
), and intra-die random (σ
r
) variation are 10%, 10%, and 5% of the nominal
value, respectively. In the following, we perform the experiments for two variation
setting: (1) both L
gate
and V
th
have skew-normal distributions [16]; and (2) L
gate
has
a normal distribution and V
th
has a Poisson distribution. The experimental setting is
shown in Table 5.1.
Setting L
gate
dist V
th
dist 3σ
g

s

r
(1) Skewnormal Skewnormal 10% 10% 5%
(2) Normal Poisson 10% 10% 5%
Table 5.1: Experiment setting. The 3σ value is the normalized with respect to the
nominal value.
Fig. 5.3 illustrates the PDF comparison for circuit s15850 under variation setting
(1). In the figure, delay is normalized with respect to the nominal value. From the fig-
ure, we find that, compared to the Monte-Carlo simulation, the Quad SSTA is the most
accurate, the next is Semi-Quad SSTA, and Lin SSTA is least accurate. Such result is
expected, because the Quad SSTA captures all the second-order effects; the Semi-Quad
SSTA captures only partial second order effects; while the Lin SSTA captures only lin-
ear effects and ignores all non-linear effects. Moreover, Fourier SSTA [40] has similar
accuracy as Semi-Quad SSTA because they both apply semi-quadratic delay model. In
addition, we also find that all these three non-Gaussian SSTA is more accurate than
Lin Gau [163].
Table 5.2 compares the run time in second (T), and the error percentage of mean
(µ), standard deviation (σ), and skewness (γ) under variation setting (1). In the table,
the error percentage of mean (µ), standard deviation (σ), and 95-percentile point (95%)
is computed as 100 ×(MC value −SSTA value)/σ
MC
, and the error percentage of
skewness γ is computed as 100 ×(γ
MC
−γ
SSTA
)/γ
MC
. Moreover, the average error
in the table is average of the absolute value; and the average runtime is the average
runtime ratio between SSTA and Monte-Carlo simulation. For each benchmark, G
123
0.6 0.7 0.8 0.9 1 1.1 1.2 1.3 1.4 1.5
0
0.5
1
1.5
2
2.5
3
3.5
Normalized Delay
MC
Quad SSTA
Semi−Quad SSTA
Lin SSTA
Lin Gau
Fourier SSTA
Figure 5.3: PDF comparison for circuit s15850.
refers to the number of gates; and N refers to the total number of variation sources.
For fair comparison, when we perform Lin Gau on non-Gaussian variation sources, we
first approximate the non-Gaussian random variables to the Gaussian random variables
by matching the mean and variance, then use Lin Gau to obtain the linear canonical
form of the circuit delay, finally, we still use the non-Gaussian variation sources to
reconstruct the PDF of the circuit delay. From the table, we see that for the Quad
SSTA, the error of mean, standard deviation, and 95-percentile point is within 0.3%,
and the error of skewness is within 5%. Semi-Quad SSTA, Lin SSTA result similar
mean and standard deviation error, but the error of skewness and 95-percentile point
is much larger, especially for the Lin SSTA, the error of skewness is up to 30% and
the error of 95-percentile point is up to 2%. This is because the Lin SSTA ignores all
non-linear effects which significantly affect the skewness, and the inaccurate skewness
results larger error of 95-percentile point. Compared to Lin Gau, all the three non-
Gaussian SSTA methods predict more accurate mean values and 95-percentile points,
which is the most important timing characteristic, and the non-linear SSTA methods,
124
Quad SSTA and Semi-Quad SSTA also give much more accurate skewness. Moreover,
we also find that both Semi-Quad SSTA and Lin SSTA have similar run time as Lin
Gau, but the run time of Quad SSTA is longer especially when the number of variation
sources is large. This is because the computational complexity of Semi-Quad SSTA,
Lin SSTA, and Lin Gau is the same, but Quad SSTA has higher complexity than the
other two SSTA methods. We also observe that compared to Fourier SSTA, Semi-
Quad SSTA has similar accuracy with almost 70X speed up. This is due to the fact
that Semi-Quad SSTA uses the same semi-quadratic delay model as Fourier SSTA,
while the computational complexity of Semi-Quad SSTA (O{n}) is lower than that
of Fourier SSTA (O{nk
2
}), where n is the number of variation sources, and k is the
maximum number of orders of Fourier Series [40]. After all, the run time of all the
SSTA methods is significantly shorter than the Monte-Carlo simulation.
Table 5.3 illustrates the results under variation setting (2). From this table, we
can find the similar trend as Table 5.2. Moreover, we also find that our methods are
still highly accurate under such variation setting. For Quad SSTA, the error of mean,
standard deviation, and 95-percentile point is within 1%, and the error of skewness is
within 6%. This shows that our approach works well for different distributions.
125
bench G N Quad SSTA Semi-Quad SSTA Lin SSTA Lin Gau Fourier SSTA MC
name µ σ γ 95% T µ σ γ 95% T µ σ γ 95% T µ σ γ 95% T µ σ γ 95% T T
s208 61 12 -0.02 0.04 1.03 -0.03 0.01 -0.02 0.04 -10.7 -0.23 0.01 -0.02 -0.32 -22.8 -1.1 0.01 -0.02 -0.32 -22.8 -2.99 0.01 -0.04 -0.07 -9.5 -0.4 1.01 15.3
s386 118 18 -0.08 0.01 2.52 -0.03 0.01 -0.05 -0.05 -9.45 -0.36 0.01 -0.06 -0.4 -22.5 -1.23 0.01 -0.04 -0.36 -23 -2.96 0.01 0.02 -0.02 -10.9 0.36 1.17 33.6
s400 106 25 0.23 -0.06 -0.88 0.13 0.03 0.3 -0.12 -15.2 -0.21 0.01 0.29 -0.65 -31.4 -1.51 0.01 0.35 -0.66 -31.7 -3.56 0.01 -0.17 -0.24 -10 -0.2 1.28 38.7
s444 119 25 -0.1 -0.12 2.92 -0.17 0.04 -0.09 -0.15 -12.2 -0.51 0.01 -0.09 -0.56 -26.8 -1.56 0.01 -0.07 -0.5 -27.1 -3.37 0.01 -0.02 -0.22 -9.95 -0.57 1.38 41.9
s832 262 33 -0.06 -0.04 -0.11 -0.17 0.16 -0.06 -0.05 -12.1 -0.4 0.01 -0.06 -0.5 -27.6 -1.53 0.01 -0.06 -0.5 -27.6 -3.5 0.01 -0.1 -0.25 -14 -0.32 1.91 103
s953 311 43 -0.09 -0.11 2.03 -0.32 0.17 -0.15 -0.24 -9.87 -0.86 0.01 -0.17 -0.68 -27.7 -1.98 0.01 -0.01 -0.67 -28.1 -3.7 0.01 -0.02 -0.14 -7.23 -0.59 1.99 122
s1238 428 55 0.12 0.01 0.59 0.15 0.43 0.22 -0.04 -10.4 -0.04 0.01 0.21 -0.47 -26.5 -1.12 0.01 0.28 -0.45 -26.6 -2.98 0.01 -0.16 -0.04 -6.59 -0.63 2.17 201
s1423 490 68 -0.01 -0.08 -1.25 0.06 0.75 -0.01 -0.08 -10.2 -0.14 0.01 -0.01 -0.49 -25.2 -1.18 0.01 -0.01 -0.47 -25.3 -3.12 0.01 -0.21 -0.01 -4.15 -0.3 2.18 288
s1494 588 68 -0.08 -0.08 3.63 0.08 1.06 -0.07 -0.16 -6.02 -0.24 0.01 -0.08 -0.52 -20.6 -1.14 0.01 -0.01 -0.44 -21.4 -2.85 0.01 -0.2 -0.12 -4.41 -0.22 2.28 320
s5378 1004 82 0.06 -0.12 4.71 -0.16 4.95 0.09 -0.2 -6.72 -0.49 0.05 0.08 -0.66 -26.7 -1.65 0.02 0.18 -0.56 -27.4 -3.31 0.02 -0.07 -0.11 -5.68 -0.6 5.42 1238
s9234 2027 99 0.01 -0.11 0.98 -0.24 10.4 0.04 -0.23 -8.74 -0.61 0.09 0.02 -0.71 -28.4 -1.82 0.04 0.29 -0.66 -28.7 -3.43 0.03 0.04 -0.07 -13.7 0.61 6.89 2801
s13207 2573 115 -0.01 -0.01 3.04 -0.06 25.9 0.02 -0.09 -4.4 -0.33 0.15 0.01 -0.49 -22.2 -1.34 0.07 0.16 -0.49 -22.2 -3.09 0.06 0.09 -0.19 -13.3 0.54 9.04 4717
s15850 3448 135 0.18 -0.02 2.66 0.27 38.9 0.23 -0.07 -4.85 0.06 0.19 0.22 -0.52 -24.6 -1.11 0.08 0.51 -0.53 -24.7 -2.92 0.08 0.18 -0.09 -4.38 0.64 11.5 6484
s38417 8709 176 0.11 -0.01 2.22 0.07 202 0.13 -0.04 -3.8 -0.07 0.62 0.13 -0.49 -23.1 -1.21 0.26 0.24 -0.47 -23.2 -3.16 0.24 0.08 -0.19 -10.5 0.79 23.2 19116
s38584 11448 176 -0.05 -0.02 0.1 0.01 218 -0.04 -0.12 -6.37 -0.29 0.68 -0.06 -0.57 -25.1 -1.44 0.29 0.11 -0.57 -25.2 -3.31 0.25 -0.11 -0.19 -14.4 -0.53 25.1 19459
Ave - - 0.08 0.06 1.91 0.13 1/720 0.1 0.11 8.74 0.32 1/19814 0.1 0.54 25.4 1.39 1/35819 0.15 0.51 25.7 3.22 1/39248 0.1 0.13 9.25 0.49 1/260 -
Table 5.2: Error percentage of mean, standard deviation, skewness, and 95-percentile point for variation setting
(1). Note:the error percentage of mean (µ), standard deviation (σ), and 95-percentile point (95%) is computed as
100×(MC value−SSTA value)/σ
MC
, and the error percentage of skewness γ is computed as 100×(γ
MC
−γ
SSTA
)/γ
MC
.
1
2
6
bench G N Quad SSTA Semi-Quad SSTA Lin SSTA Lin Gau Fourier SSTA MC
name µ σ γ 95% T µ σ γ 95% T µ σ γ 95% T µ σ γ 95% T µ σ γ 95% T T
s208 61 12 -0.03 0.05 1.64 -0.05 0.01 -0.03 0.04 -20 -0.35 0.01 -0.05 -0.07 -58.9 -0.81 0.01 -0.05 -0.07 -58.9 -1.17 0.01 0.02 -0.25 -15 0.63 0.99 15.4
s386 118 18 -0.07 0.01 5.54 -0.05 0.02 -0.05 -0.04 -16.5 -0.4 0.01 -0.08 -0.16 -58.2 -0.91 0.01 -0.08 -0.11 -58.8 -1.15 0.01 -0.01 -0.02 -16.5 -0.34 1.2 33.9
s400 106 25 0.25 -0.07 -2.48 0.25 0.02 0.32 -0.13 -25.7 -0.18 0.01 0.29 -0.3 -71.3 -0.93 0.01 0.35 -0.3 -71.9 -1.21 0.01 -0.15 -0.27 -20.8 -0.36 1.27 39
s444 119 25 -0.09 -0.12 3.27 -0.23 0.03 -0.08 -0.14 -22.4 -0.62 0.01 -0.1 -0.28 -65 -1.22 0.01 -0.09 -0.19 -65.1 -1.43 0.01 0.01 -0.09 -13.6 0.3 1.38 42.1
s832 262 33 -0.05 -0.04 -3.26 -0.14 0.14 -0.05 -0.05 -23.5 -0.48 0.01 -0.08 -0.19 -66.9 -1.1 0.01 -0.08 -0.19 -66.9 -1.45 0.01 -0.01 -0.18 -13.4 -0.38 1.93 103
s953 311 43 -0.09 -0.11 3.96 -0.44 0.18 -0.15 -0.23 -16.6 -1 0.01 -0.19 -0.38 -66.9 -1.67 0.01 -0.02 -0.35 -67.4 -1.75 0.01 -0.12 -0.22 -16.5 -0.38 1.95 122
s1238 428 55 0.11 -0.02 -3.92 0.09 0.43 0.21 -0.05 -22 -0.13 0.01 0.18 -0.19 -69.8 -0.76 0.01 0.24 -0.16 -70.2 -1 0.01 -0.17 -0.05 -16.1 -0.58 2.15 201
s1423 490 68 -0.01 -0.08 -4.96 0.01 0.75 -0.01 -0.09 -20.2 -0.22 0.01 -0.03 -0.22 -64.1 -0.82 0.01 -0.02 -0.2 -64 -1.2 0.01 0.16 -0.26 -16.9 0.59 2.19 288
s1494 588 68 -0.08 -0.11 1.71 0.03 1.05 -0.07 -0.18 -16.9 -0.32 0.01 -0.1 -0.3 -59 -0.85 0.01 -0.04 -0.21 -58.3 -1.05 0.01 0.17 -0.15 -16.7 0.6 2.29 322
s5378 1004 82 0.05 -0.17 1.26 -0.14 4.95 0.08 -0.25 -16.1 -0.54 0.04 0.04 -0.41 -68.3 -1.23 0.02 0.12 -0.27 -68.5 -1.24 0.01 -0.02 -0.1 -19.4 -0.39 5.44 1238
s9234 2027 99 0.01 -0.13 2.52 -0.22 10.4 0.04 -0.25 -12.5 -0.65 0.08 0.01 -0.42 -68.6 -1.39 0.03 0.25 -0.34 -69 -1.27 0.04 -0.1 -0.2 -15.3 -0.41 6.76 2795
s13207 2573 115 0.01 -0.02 5.21 0.01 25.7 0.04 -0.09 -7.95 -0.27 0.15 0.01 -0.23 -62.3 -0.89 0.06 0.14 -0.21 -62.4 -1.01 0.06 -0.11 -0.04 -13.2 -0.28 9.11 4720
s15850 3448 135 0.18 -0.07 0.4 0.18 38.8 0.22 -0.12 -12.1 -0.02 0.19 0.18 -0.27 -66.4 -0.77 0.08 0.47 -0.25 -66.5 -0.84 0.07 0.18 -0.16 -19.9 0.4 11.3 6477
s38417 8709 176 0.1 -0.03 4.59 0.05 202 0.12 -0.06 -6.09 -0.14 0.62 0.09 -0.2 -63.8 -0.84 0.27 0.2 -0.17 -63.9 -1.07 0.24 0.13 -0.27 -17.4 0.6 23.6 19097
s38584 11448 176 -0.06 -0.08 -3.38 -0.04 218 -0.05 -0.16 -14.7 -0.32 0.68 -0.09 -0.31 -65.2 -1.03 0.29 0.08 -0.3 -65.3 -1.22 0.26 -0.07 -0.07 -19.8 -0.6 25.2 19460
Ave - - 0.08 0.07 3.21 0.13 1/681 0.1 0.13 16.9 0.38 1/20497 0.1 0.26 65 1.01 1/37942 0.15 0.22 65.1 1.2 1/42392 0.09 0.16 16.7 0.46 1/260 -
Table 5.3: Mean, standard deviation, and skewness comparison for variation setting (2).
1
2
7
5.7 Conclusions
In this chapter, we have proposed a new method to approximate the max operation of
two non-Gaussian random variables using second-order polynomial fitting. It has been
shown that such approximation is more accurate than the approximation using lin-
ear fitting through tightness probability. By applying such approximation, we present
new SSTA algorithms for three different delay models, i.e., quadratic model, quadratic
model without crossing terms (semi-quadratic model), and linear model. All atomic
operations of these algorithms are performed by closed-form formulas, hence they are
very time efficient. The computational complexity of both the linear delay model and
the semi-quadratic delay model is linear to the number of variation sources, and that
of the quadratic delay model is cubic (third-order) to the number of variation sources.
Moreover, the computational complexity is linear to the circuit size for all three delay
models. The SSTA with semi-quadratic delay model results similar error as the SSTA
using Fourier Series [40] with almost 70X speed up. Moreover, compared to Monte-
Carlo simulation for non-Gaussian variation sources, the SSTA with quadratic delay
model results the error of mean, standard deviation, skewness, and 95-percentile point
is within 1%, 1%, 6%, and 1%, respectively.
128
CHAPTER 6
Physically Justifi able Die-Level Modeling of Spatial
Variation in View of Systematic Across-Wafer
Variability
Modeling spatial variation is important for statistical analysis. Most existing works
model spatial variation as spatially correlated random variables. We discuss process
origins of spatial variability, all of which indicate that spatial variation comes from
deterministic across-wafer variation, and purely random spatial variation is not signifi-
cant. We analytically study the impact of across-wafer variation and show how it gives
an appearance of correlation. We have developed a new die-level variation model
considering deterministic across-wafer variation and derived the range of conditions
under which ignoring spatial variation altogether may be acceptable. Experimental
results show that for statistical timing and leakage analysis, our model is within 2%
and 5% error from exact simulation result, respectively, while the error of the existing
distance-based spatial variation model is up to 6.5% and 17%, respectively. Moreover,
our new model is also 6X faster than the spatial variation model for statistical timing
analysis and 7X faster for statistical leakage analysis.
129
6.1 Introduction
In Chapter 5, we developed an efficient and accurate SSTA flow in which process
variation is modeled as inter-die, within-die spatial, and within-die random variations.
However, new measurement results show that the variation model in Chapter 5 is not
accurate. Therefore, we developed a more accurate and efficient variation model and
applied it on the SSTA flow presented in Chapter 5.
There are several existing works that focus on analyzing and modeling process
variation [6, 32, 172, 106, 108, 134, 24, 25, 173]. The simplest method models pro-
cess variation as the sum of inter-die (global) variation and independent within-die
(local random) variation [25]. Later, it was observed that within-die variation is spa-
tially correlated and the correlation depends on the distance between two within-die
locations. [6, 32] model spatial variation as correlated random variables, and principle
component analysis is applied to perform statistical timing analysis. In this model, a
chip is divided into several grids and each grid has its own spatial variation. The spatial
variations of different grids are correlated and the correlation coefficient depends on
the distance between two grids. [172, 173] focuses on the extraction of spatial corre-
lation and it models the correlation coefficient as a function of distance. Several more
complex spatial correlation models have been proposed in [45, 105, 189, 55, 64].
In contrast to the spatial correlation models, process oriented modeling has con-
cluded that within-die spatial variation is caused by deterministic across wafer and
across-field variation while purely random within-die spatial variation is not signif-
icant [190, 54, 50]. However, in practical design flow, designers do not know the
within-wafer location or within-field location of each die; therefore, we need to ana-
lyze the impact of across-wafer variation and across-field variation on die-scale. Since
silicon measurements cited in this chapter indicate that across-wafer variation is much
more significant than the across-field variation, we consider only across-wafer varia-
130
tion in this chapter, but the approach can be easily extended to account for across-field
variation.
In this chapter, we first analyze the impact of deterministic across-wafer variation
on spatial correlation. We observe that when quadratic across-wafer variation model
is used as in [54, 186, 125]:
1. Different locations of the chip may have different mean and variance. Such
differences increase when the chip size increases.
2. When chip size is small, the correlation coefficients for a certain Euclidean dis-
tance are within a narrow range. This explains why most existing works find that
spatial correlation is a function of distance.
3. Within-die spatial variation is NOT spatially correlated when across-wafer sys-
tematic variation is removed.
4. Within-die spatial variation is NOT independent from inter-die variation.
5. If chip size is not large, the two-level inter-/within-die decomposition of process
variation is still very accurate.
Based on our analysis, we propose three accurate and efficient spatial variation
models considering across-wafer variation. We then apply our new model to the SSTA
flow introduced in Chapter 5. Experimental results show that our model is more accu-
rate and efficient compared to the distance-based spatial variation model in [172, 173].
Compared to the exact simulation, the error of our model for statistical timing anal-
ysis is within 2% and the error for statistical leakage analysis is within 5%. On the
other hand, the error of the distance-based spatial correlation model is up to 6.5% for
statistical timing analysis and up to 17% for statistical leakage analysis. Moreover,
131
our model is 6X faster than the distance-based spatial correlation model for statistical
timing analysis and 7X faster for statistical leakage analysis.
The rest of this chapter is organized as follows: Section 6.2 discusses the physical
causes for across-wafer variation; Section 6.3 analyzes the impact of across-wafer
variation on die-scale; Section 6.4 discusses the case when the across-wafer variation
is not a perfect parabola; Section 6.5 introduces the new variation models; the new
models are applied to statistical timing analysis in Section 6.6 and statistical leakage
analysis in Section 6.7; Section 6.8 summarizes the advantages and disadvantages of
different variation models; and finally Section 6.9 concludes this chapter.
6.2 Physical Origins of Spatial Variation
In silicon manufacturing, there are many steps that cause non-uniformity in devices
across the wafer. Interestingly, most of these processes by the very nature of the equip-
ment follow a radially varying trend across the wafer. Most processes are “center-fed”
or “edge-fed” with the boundary conditions at the edge of wafer being substantially
different. Moreover, wafers are often rotated to increase process uniformity across
them which further leads to radial behavior of non-uniformity. This is further exacer-
bated by advent of single-wafer processing for 300mm wafers.
For example, overlay error includes errors in the position and rotation of the wafer
stage during exposure, wafer stage vibration, and the distortion of the wafer with re-
spect to the exposure pattern [139]. Magnification and rotation components of overlay
error increase from center of the wafer outwards.
1
During chemical vapor deposition
(CVD) step, species depletion and temperature non-uniformity on the wafer at lower
temperatures may cause thickness non-uniformity [159, 132]. Redeposition effect in
1
Overlay error can directly impact critical dimension in double patterning.
132
physical vapor deposition (PVD) [27] may cause non-uniformity. Moreover, center
peak shape of the RF electric field distribution [77] also leads to a center peak shape of
etch rate, and chamber wall conditions [78] also cause etch rate non-uniformity. In real
processes, the wafers are rotated to improve uniformity. [78, 27] show that the etch
rate varies radially across the wafer: the etch rate is high at the center of the wafer and
decreases toward the edges. Post-exposure bake (PEB) temperatures are higher at the
center of the wafer and decrease outwards [185]. Similarly, other processes ranging
from resist coat to wafer deformation due to vacuum chuck holding it follow a bowl-
shaped trend across the wafer. All these processes cause a systematic across-wafer
variation in physical dimensions.
Across-wafer variation of gate length observed in several recent silicon measure-
ments [153, 54, 186, 125] validates our arguments. [126] also shows that ring oscillator
frequency and leakage current decrease from the center to the edge of the wafer. Fig-
ure 6.1 shows industry data of ring oscillator frequency for wafers from two different
industry processes. Process 1 is with 45nm technology and process 2 is with 65nm
technology. From the figure, we see that for both process, ring oscillator frequency
decreases from the center to the edge of the wafer. Moreover, it has also been shown
that there is no spatial correlation for threshold voltage variation [190]. Therefore, the
across-wafer frequency and leakage variation is mainly caused by gate length varia-
tion.
It has been shown that for process 1, the across-wafer frequency variation can be
approximated as a quadratic function (a parabola) [126]. For process 2, the across-
wafer variation is not a perfect parabola as process 1. However, it follows a systematic
trend and the ring oscillator frequency decreases from the center to the edge of the
wafer. Since the amount measurement data of for process 2 (more than 300 wafers)
is much more than process 1, in the rest of this chapter, all of our simulation and
133
Figure 6.1: Ring oscillator frequency within a wafer. (a) Process 1; (b) Process 2.
experiments are based on the measurement result of process 2.
Besides across-wafer variation, lithography-induced effects such as lens aberra-
tions can lead to systematic across-field variation and across-die variation. Across-die
variation can be modeled as within-die deterministic mean shift and will not cause
within-die spatial correlation. Moreover, silicon measurements cited in this chapter in-
dicate that across-wafer variation is much more significant (probably due to advance-
ments in resolution enhancement and lithographic equipment) than across-field and
across-die variation. Hence, for simplicity, we consider only across-wafer variation in
this chapter.
6.3 Analysis of Wafer Level Variation and Spatial Correlation
To analyze the impact of across-wafer variation on die-scale, we assume the across-
wafer variation to be a quadratic function as in[24, 54, 186, 125]:
v
p
= ax
2
w
+by
2
w
+cx
w
+dy
w
+m
w
+m
f
+m
d
+m
r
(6.1)
134
where v
p
is a variation source, such as L
e f f
, a, b, c, and d are function coefficients,
which are obtained from fitting the measurement data from industry process as shown
in Figure 6.1, m
w
comprises inter-die random, inter-wafer, inter-lot variation and the
fitting error of quadratic fitting as in (6.1); m
f
and m
d
are the across-field and across-
die variation, respectively, as discussed in Section 6.2, we consider only across-wafer
variation and ignore these two types of variations (m
f
and m
d
) in this chapter; m
r
is
the random noise, and (x
w
, y
w
) is across-wafer location. In the rest of this section, we
will analyze the spatial variation based on the above model. Table 7.3 summarizes the
mathematical notations used in this section.
We obtain the coefficients of the above across-wafer variation model by fitting the
industrial 65nm process measured ring oscillator delay with 300 wafers from 23 lots.
In the rest of this section, all simulations are based-on this extracted model.
6.3.1 Variation of Mean and Variance with Location
According to (6.1), it is easy to find that for a die, whose center lies on (x
c
, y
c
) wafer
coordinates, the variation at location (x, y) in a die (assuming the coordinate of the
center of the die to be (0, 0)) is:
v
p
(x, y) = a(x
c
+x

)
2
+b(y
c
+y

)
2
+c(x
c
+x

) +d(y
c
+y

) +m
w
+m
r
(x, y) (6.2)
The definitions of (x
c
, y
c
) and (x

, y

) are in Table 7.3. In this section, we assume that
a, b, c, and d are fixed for a process. In practice, these coefficients may vary slightly
for wafer-to-wafer or lot-to-lot. We will further discuss this in Section 6.4.
In real design flow, the die location in the wafer (x
c
, y
c
) is not known to designers.
We can convert the wafer-level systematic variation model to a die-level model by
noting that the dies are always distributed evenly on the wafer. Therefore, we may
model x
c
and y
c
as random variables which are evenly distributed in the center at (0, 0)
135
Symbols Description
(x
c
, y
c
) Location of the center of the die in the wafer
r
w
wafer radius
m
w
Inter-die variation
m
r
Within-die random variation
σ
2
m
Variance of m
w
σ
2
r
Variance of m
r
a, b, c, d Across-wafer variation coefficients
(l
x
, l
y
) x and y dimension die size
(x
w
, y
w
) Within-wafer location
(x, y) Within-die location
ω Angle between the die and wafer coordinate
v
p
(x, y) Variation of within-die location (x, y)
x

x

= xcos ω+ysinω
y

y

= xsinω−ycos ω
l

x
l

x
= l
x
cos ω+l
y
sinω
l

y
l

y
= l
x
sinω−l
y
cos ω
x

x

= ax

+c/2
y

y

= by

+d/2
l

x
l

x
= ax

+c/2
l

y
l

y
= by

+d/2
r

r

=

x
2
/a+y
2
/b
r

r

=

x
2
+y
2
δ Euclidean distance between (x

1
, y

1
) and (x

2
, y

2
)
δ =

(x

1
−x

2
)
2
+(y

1
−y

2
)
2
r
m
r
m
=

l
2
x
+l
2
y
r

m
r

m
=

l
2
x
+l
2
y
r

m
r

m
=

l
2
x
+l
2
y
k
0
k
0
= r
2
w
(a+b)/4−c
2
/4a−d
2
/4b
k
1
k
1
= r
4
w
(a
2
+b
2
)/16−r
4
w
ab/24+σ
2
m
k
2
k
2
= k
1
/r
2
w
α α = x

1
x

2
+y

1
y

2
β β =σ
2
r
/r
2
w
in the wafer s
0
s
0
= cos
2
ω(al
2
x
+bl
2
y
) +sin
2
ω(bl
2
x
+al
2
y
)
v
g
Inter-die variation
v
s
Within-die spatial variation
v
l
Within-die random variation
Table 6.1: Notations.
136
with radius r
w
(radius of the wafer). It is easy to see that x
c
and y
c
are identical and
their PDF is
2
:
PDF(x
c
) = 2

r
2
w
−x
2
c
/(πr
w
) −r
w
< x
c
< r
w
(6.3)
In this case, the variation at location (x, y), v
p
(x, y), is expressed as a function of
four random variables: x
c
, y
c
, m
w
, and m
r
(x, y). We may easily calculate the mean and
variance of v
p
(x, y):
µ
v
p
(x,y)
= k
0
+x
2
/a+y
2
/b = k
0
+r
2

(6.4)
σ
2
v
p
(x,y)
= k
1
+r
2
w
(x
2
+y
2
) = k
1

2
r
+r
2
w
r
2

(6.5)
where x

, y

, r

, r

, k
0
, and k
1
are defined in Table 7.3.
From (6.4) and (6.5), it is interesting to note that different die locations may have
different means and variances
3
. The location (x
0
, y
0
) having the smallest mean and
variance is given by:
x
0
= −ccos ω/2a−d sinω/2b (6.6)
y
0
= d cosω/2b−csinω/2a (6.7)
The locations farther off from(x
0
, y
0
) will have larger mean and variance. Figures 6.2(a)
and 6.2(b) illustrate the mean and variance of ring oscillator delay for different r

(or
r

).
2
x
c
and y
c
are not independent. In order to generate samples of x
c
and y
c
, we may assume x
c
=r
s
cosθ
and y
c
= r
s
sinθ, where r
s
and θ are independent. r
s
follows triangle distribution ranging from 0 to r
w
,
θ follows uniform distribution ranging from 0 to 2π
3
Such difference is caused by the nonlinearity of the across-wafer variation function (we assume
quadratic function as in Equation(6.1)). If the across wafer variation function is a piecewise linear
function, the mean and variance will be the same for all locations on a die.
137
0 0.2 0.4 0.6
0.1292
0.1296
r

µ
(a) µV.S. r

0 0.2 0.4 0.6
3.5
4
4.5
x 10
−5
r

σ
2
(b) σ
2
V.S. r

Figure 6.2: Mean and variance for different r
d
.
6.3.2 Appearance of Spatial Correlation
Besides mean and variance, we are also interested in the covariance between two loca-
tions (x
1
, y
1
) and (x
2
, y
2
). From (6.2), we may calculate the covariance as:
Cov = k
1
+r
2
w
(x

1
x

2
+y

1
y

2
) (6.8)
With covariance and variance calculated as above, we may obtain the correlation coef-
ficient as:
ρ =

k
2
2
+2k
2
α+α
2
(k
2
+β)
2
+(r
2
dσ1
+r
2
dσ2
)(k
2
+β) +r
2
dσ1
r
2
dσ2
(6.9)
where α, β, and k
2
are defined in Table 7.3. From (6.9), we obtain the upper bound
and lower bound of the correlation coefficient for a certain Euclidean distance:
ρ ≤ρ
u
=

1−
δ
2
k
2
+δβ/2+2βk
2

2
(k
2
+β)
2
+2r
2
m
(k
2
+β) +r
4
m
(6.10)
ρ ≥ρ
l
=

1−
δ
2
(k
2
+r
2
m
−δ
2
/4+β) +β(2k
2
+r
2
m
)
(k
2
+β)
2

2
(k
2
+β)/2+δ
4
/16
(6.11)
where δ is the Euclidean distance between (x

1
, y

1
) and (x

2
, y

2
), l

x
, l

y
, and r

m
are
defined in Table 7.3. From the upper bound and lower bound, we may also calculate
138
the range of correlation coefficient:
ρ
u
−ρ
l

r
2
m
/(r
2
m
+k
2
+β) (6.12)
Notice that usually the wafer size is much larger than the die size, that is k
2
r
2
m
, there-
fore, the range of correlation coefficient for a certain distance is very small. Moreover,
from the above equation, we also find that when the variances of the inter-die residual
and within-die random variation increase, the range decreases. This explains why most
existing works [172, 173, 45] find that spatial correlation is a function of distance.
Figure 6.3(a) illustrates the exact data for 40 locations, the upper bound and the
lower bound. From the figure, we find that the range of ρ for a certain distance is very
small. Although the correlation coefficient is within a narrow range, covariance is not,
as shown in Figure 6.3(b). This is because of the differences of variance across the die.
0 0.5 1
0.86
0.88
0.9
0.92
0.94
0.96
0.98
Distance (cm)
ρ
Exact
Upper Bound
Lower Bound
LB
UB
(a) ρ
0 0.5 1 1.5
3
3.5
4
x 10
−5
Distance (cm)
C
o
v
(b) Covariance
Figure 6.3: Apparent spatial correlation and covariance as a function of distance.
Figure 6.4 shows the correlation coefficient for within-die variation after subtract-
ing the mean variation of the die (mainly caused by across wafer variation)
4
. We
4
In the figure, the correlation coefficient can be a negative number when distance is large. This is
because after subtracting the mean, when the within-die variation of one corner increases, the within-
die variation of the opposite corner must decrease. That means, the within-die variations of opposite
corners are negative correlated. Moreover even when two locations are very close, if they lie on the
opposite side of the center, their correlation is still near zero.
139
observe that the within-die spatial variation is almost NOT spatially correlated, as em-
pirically observed in [172, 173]. This further validates that the spatial variation is
caused by systematic across-wafer variation.
0 0.5 1 1.5
−1
−0.5
0
0.5
1
Distance
ρ
Figure 6.4: Correlation coefficient for within-die spatial variation after inter-die varia-
tion is removed.
6.3.3 Dependence between Inter-Die and Within-Die Variation
In most existing variation models, process variation is decomposed into inter-die,
within-die spatial, and within-die random variation:
v
p
= v
g
+v
s
+v
l
(6.13)
where v
g
is the inter-die variation, v
s
is the within-die spatial variation, and v
l
is the
within-die variation. Usually v
g
is modeled as the variation of the chip mean, v
l
is the
pure random local variation, and v
s
is the residual. v
g
, v
s
, and v
l
are assumed to be
independent.
With the variation model in (6.2), we may also calculate the inter-die, within-die
spatial, and within-die random variation. Inter die variation is calculated as the varia-
140
tion of the chip mean:
v
g
=
1
l
x
l
y
ZZ
|x|<l
x
/2
|y|<l
y
/2
v
p
(x, y)dxdy
= ax
2
c
+by
2
c
+cx
c
+dy
c
+m
w
+s
0
(6.14)
where s
0
is defined in Table 7.3.
Within-die random variation is the local random variation: v
l
= R, and within-die
spatial variation is calculated as the remaining variation:
v
s
(x, y) = v
p
(x, y) −v
g
−v
l
= r
2

+2ax

x
c
+2by

y
c
−s
1
(6.15)
where s
1
is defined in Table 7.3. From the above equations, we find that both inter-die
and within-die spatial variations are functions of random variables x
c
and y
c
. Hence,
we may not decompose process variation into independent inter-die and within-die
spatial variation.
6.3.4 When can Spatial Variation be Ignored?
In this section, we analyze the accuracy of the simple two-level inter-/within-die vari-
ation model for different chip sizes. If we only consider inter-/within-die variation,
we may lump the across-wafer variation into inter-die variation, that is, approximate
the across-wafer variation as a piecewise constant function, as shown in Figure 6.5(a).
To evaluate the impact of the approximation error, we may treat such approximation
error as noise and the process variation as signal; and then evaluate the signal to noise
ratio. In order to do this, we calculate the mean square approximation error and the
total variance of variation. The signal to noise ratio when ignoring the spatial variation
141
is given as:
SNR = σ
2
total
/MSE

6abr
4
w
+6(c +d)r
2
w

2
M

2
R
abr
2
w
(l
2
x
+l
2
y
) +2(c +d)l
x
l
y
(6.16)
It can be seen that MSE depends on chip size. When chip size is small, MSE is
small. This is because we approximate the across-wafer variation as a piecewise con-
stant function with small steps, hence such approximation is accurate. Figure 6.5(b)
illustrates the SNR for different die sizes. It can be seen that the SNR decreases when
die size increases as expected. We also observe that when chip size (l
x
and l
y
) is
smaller than 1cm, the SNR is up to 100. That means, two-level inter-/within-die vari-
ation model is accurate.
Exact
Piecewise
(a) Piecewise constant
0 1 2 3 4 5
10
0
10
1
10
2
10
3
10
4
10
5
Chip size (cm)
S
N
R
(b) SNR V.S. chip size.
Figure 6.5: Approximating across-wafer variation.
6.4 General Across-Wafer Variation Model
In the previous section, we assumed that the across-wafer variation is a quadratic func-
tion as shown in Equation 6.2. In practice, across-wafer variation may not be an exact
142
parabola. Moreover, the across-wafer variation may be slightly different for different
wafers. Therefore, there will be some fitting residual after subtracting the across-wafer
parabola:
v(x, y) = v
p
+v
r
(6.17)
where v
p
is the quadratic across-wafer variation model as shown in (6.2) and V
r
is fit-
ting residual. In the previous section, we assume that the fitting residual is lumped into
inter-die random variation. However, the fitting residual contains not only inter-die
random variation but a systematic trend of within-die variation. Figure 6.6(a) illus-
trates the original delay variation across the wafer, and Figure 6.6(b) illustrates the
fitting residual of a wafer delay variation after subtracting the quadratic across-wafer
variation function. From the figure, we observe that for each die, there is a systematic
trend: If one corner of the die is faster, then the opposite corner is slower. From the
die point of view, such trend will also introduce some spatial correlation. In order to
model this trend, we approximate the fitting residual as a linear function of within-die
location:
v
r
(x

, y

) = s
x
x +s
y
y +m

w
(6.18)
where s
x
and s
y
are x-dimension and y-dimension slope of within-die trend, respec-
tively, and m

w
is inter-die part of the residual, which can be lumped into inter-die
random variation. We also find that the trends of within-die variation for each die are
different for different dies. In this case, we may model s
x
and s
y
as random variables.
Figure 6.7 illustrates the distribution of s
x
and s
y
obtained from measurement data of
process 2. From the figure, we find that both s
x
and s
y
are almost uncorrelated and they
both follow Gaussian distribution.
Notice that when we model the fitting residual as a linear within-die variation trend,
two more random variables s
x
and s
y
are introduced. This makes the variation model
143
more complicated. When the across-wafer variation is a perfect parabola and the fitting
residual is not significant, we may just lump the fitting residual in to inter-die and
random variation.
0 50 100 150
0
50
100
150


0.102592
0.110338
0.118083
0.126474
0.13422
0.142611
(a) Original delay variation.
0 50 100 150
0
50
100
150


−0.0147271
−0.00816225
−0.00159736
0.0055146
0.0120795
0.0191914
(b) Residual after subtracting
Equation(6.2).
0 50 100 150
0
50
100
150


−0.00986836
−0.00634698
−0.0028256
0.000989223
0.0045106
0.00832543
(c) Residual after subtracting (6.18).
Figure 6.6: Ring oscillator frequency within a wafer.
In addition, we also observe that after subtracting the model of fitting residual in
(6.18), the remaining variation is almost uncorrelated, as illustrated in Figure 6.6(b)
5
.
5
It seems that there is a systematic within-die variation pattern. In this chapter, we assume that all
this within-die pattern to be independent randomvariation. In practice, such within-die variation pattern
can be modeled as mean shift.
144
−2 0 2 4 6 8 10 12 14
x 10
−4
0
200
400
600
800
1000
1200
1400
1600
1800


S
x
PDF
Gaussian Approx
(a) PDF of S
x
−5 −4 −3 −2 −1 0 1 2 3 4 5
x 10
−4
0
500
1000
1500
2000
2500


S
y
PDF
Gaussian Approx
(b) PDF of S
y
Figure 6.7: PDF of S
x
and S
y
.
Combining (6.18) and (6.2) we obtain a general die-level across-wafer variation
model:
v
p
(x, y) = a(x
c
+x

)
2
+b(y
c
+y

)
2
+c(x
c
+x

) +
d(y
c
+y

) +s
x
x +s
y
y +m
w
+m
r
(x, y) (6.19)
6.5 Modeling Spatial Variability
As discussed in Section 6.1, spatial variation largely comes from the deterministic
across-wafer variation. Hence, modeling the within-die variation as spatial-correlated
random variable is not accurate as discussed in Section 6.3.
In this section, we introduce three new spatial variation models considering across-
wafer variation:
• Slop Augmented Across-Wafer variation model (SAAW).
• Quadratic Across-Wafer variation model (QAW).
145
• Location Dependent Across-Wafer variation model (LDAW).
In the following of this section, we will discuss these models in detail:
6.5.1 Slope Augmented Across-Wafer Model
(6.19) calculates the variation for a given location (x, y). In the equation, the die lo-
cation (x
c
, y
c
) are modeled as random variables and their PDF is shown in Equation
(6.3). (6.19) provides a new spatial variation model. We refer to this new model as
Slope Augmented Across-Wafer variation model (SAAW) .
Notice that in SAAW model, there are only six random variables, inter-die random
variation m
w
, within-die random variation m
r
, die location within the wafer x
c
and y
c
,
and slope of fitting residual s
x
and s
y
. However, for the traditional distance-based spa-
tial variation model, the number of spatial variation sources depends on the number
of grids. Larger chip needs more variables. Therefore, our new model not only mod-
els the across-wafer variation accurately but also is more efficient than the traditional
spatial correlation model.
6.5.2 Quadratic Across-Wafer Model
As discussed in Section 6.4, when the across-wafer variation is a perfect parabola, the
fitting residual is not significant
6
, we may just lump the fitting residual into inter-die
random variation and simplify (6.19) to (6.2). In this case, there are only four random
variables, x
c
and y
c
, m
w
, and m
r
. We refer to this model as Quadratic Across-Wafer
variation model (QAW). Since QAW does not consider fitting residual, it is not as
accurate as SAAW.
6
For example, process 1 as discussed in Section 6.2.
146
6.5.3 Location Dependent Across-Wafer Model
As discussed in Section 6.3.4, when die size is small enough, applying the two-level
inter-/within-die variation model does not introduce much error. However, inter-/within-
die variation model still does not consider the mean and variance difference at different
locations of a chip, as discussed in Section 6.3.1. To further improve the accuracy of
inter-/within-die variation model, we account for this:
v(x, y) ≈m

d

v
p
(x, y) +σ
v
p
(x, y)m

r
(x, y) (6.20)
where m

d
is inter-die variation including inter-lot random, inter-wafer random, inter-
die random, and across-wafer variation; m

r
(x, y) is within-die variation including within-
die random variation and residual of across-wafer variation; µ
v
p
(x, y) and σ
v
p
are mean
and variance difference at different locations of a chip, which can be calculated from
(6.4) and (6.5). We refer to the above model as Location Dependent Across-Wafer
variation model (LDAW). Notice that in Equation 6.20, µ
v
p
(x, y) and σ
v
p
(x, y) are de-
terministic value for a certain within-die location (x, y). Therefore, LDAW model has
only two random variables: m

d
and m

r
which is the same as two level inter-/within-
die model. Hence compared to inter-/within-die model, LDAW has similar efficiency
but higher accuracy because LDAW considers mean and variance difference across the
chip while inter-/within-die model does not.
6.6 Application of Spatial Variation Models on Statistical Timing
Analysis
In this section, we apply our across-wafer variation models to statistical static timing
analysis.
147
6.6.1 Delay Model
As discussed in Chapter 5, cell delay can be modeled as a quadratic function of vari-
ation sources, as shown in (5.46). For SAAW in (6.19) and QAW (6.2), each variation
source is a quadratic function of random variables. Therefore, by applying SAAW or
QAW on quadratic cell delay model, the cell delay becomes a 4
th
order function of
random variables. However, our SSTA flow in Chapter 5 can only handle a quadratic
function of random variables. Therefore, we apply the moment matching technique
in the way as Section 5.3 to match the 4
th
function to a quadratic function of ran-
dom variables by matching the mean and first two order joint moments. And then,
we may apply the quadratic SSTA flow in Section 5.3 to estimate the chip delay vari-
ation. Notice that moment matching approximation is performed only once for all
cells and does not increase the run time of SSTA. LDAW is a linear function of vari-
ation sources. Therefore, when applying LDAW, cell delay is a quadratic function of
random variables, hence quadratic-SSTA can be applied. Moreover, as discussed in
Section 5.6, Semi-quadratic SSTA greatly improves the speed with only a little ac-
curacy loss compared to quadratic SSTA. Therefore, in this application, we may also
ignore the crossing terms and apply semi-quadratic SSTA to improve efficiency.
Sometimes, cell delay model can be simplified as a linear function of variation
sources, as shown in (5.114). In this case, applying SAAW or QAW results in a
quadratic function of random variable and hence quadratic SSTA or semi-quadratic
SSTA can be applied; and applying LDAW results in a linear function of random vari-
ables where linear SSTA can be applied.
148
6.6.2 Experimental Result
We have applied our new model to the SSTA flow introduced in Chapter 5. In order
to verify the efficiency and accuracy, three comparison cases are defined: 1) Monte-
Carlo simulation with the exact deterministic across-wafer variation model
7
, which
is the golden case for comparison; 2) distance-based spatial correlation model from
[172, 173], which is referred to as SPatial Correlation model (SPC); 3) two-level inter-
/within-die variation model, which is referred to as Inter-/Within-die variation model
(IW).
We apply all the above methods to the ISCAS85 suite of benchmarks in PTM
45nm technology[124]. We assume random placement for ISCAS85 circuits. Since
process variation has smaller impact on interconnect delay than on logic cell delay,
we only consider logic cell delay when calculating the full chip delay variation. In
the experiment, we consider the gate length variation obtained from minimum square
error fitting on the ring oscillator delay from industrial 65nm process (Process 2 as
discussed in Section 6.2) measurement from the model as shown in (6.19). We ob-
tain the across-wafer coefficients a, b, c, and d
8
, fitting residual coefficients s
x
and
s
y
9
, standard deviation of random inter-wafer, inter-die, and within-die variation as
percentage with respect to the nominal value. Then we assume that the percentages of
all the above coefficients to nominal value are the same at 45nm technology node and
65nm technology node.
7
In the simulation, each wafer may have different across-wafer variation which is obtained from
measurement data of process 2. We have simulated 318 wafers correspondent to 318 measured wafers.
8
We apply quadratic function to fit the across-wafer variation for each wafer to obtain the fitting
coefficients for each wafer, then use average coefficients of all wafers for our experiment.
9
We obtain the slope of fitting residual s
x
and s
y
for each chip, and then calculate the mean and
variance of s
x
and s
y
for all chips. In the experiment, we assume s
x
and s
y
to be Gaussian random
variables with mean and variance obtained from the measurement data.
149
6.6.2.1 Full Chip Delay
In the experiment, we assume that the chip size is 2cm×2cm and the wafer radius is
15cm. Since ISCAS85 benchmarks are very small, the impact of spatial variation on
delay is not significant within the circuit. In order to show such impact,we assume the
benchmarks are stretched on a 2cm×2cm chip. In our experiment, for the SPC model,
we divide the chip to 10 ×10 = 100 grids. Table 6.2 illustrates the percentage error
of mean (µ), standard deviation (σ), and 95-percentile point (95%) and run time (T) of
different variation models. In the table, we also compared the result of using quadratic
cell delay model (Quad) and linear cell delay model (Lin). We only use quadratic
cell delay model for golden case simulation (exact). The error is calculated as error
of different variation models compared to the golden case simulation. For SAAW and
QAW, we also compare the results of applying quadratic SSTA with crossing terms
(SAAW Quad and QAW Quad) and applying SSTA without crossing terms (SAAW S-
Quad and QAW S-Quad). From the table, we have the following observations:
• Compared to full quadratic SSTA, semi-quadratic SSTA (SSTA without crossing
terms) achieves up to 8X speed up with less than 1% accuracy loss.
• SAAW is more accurate than QAW. This is because the fitting residual is signif-
icant for the measurement data, QAW ignores fitting residual and hence intro-
duces more error.
• The error of SAAW using semi-quadratic SSTA is within 2% while the error of
spatial correlation model is up to 6.5%.
• Compared to quadratic cell delay model, linear cell delay has less than 2% ac-
curacy loss. This is because in our experiment, the cell delay variation is well
approximated by a linear function.
150
• For linear cell delay model, SAAW achieves about 6X speed up compared to
SPC. This is because there are 100 grids in the spatial correlation model, result-
ing in 37 spatial random variables
10
, while SAAW has only 6 random variables.
• LDAW and IW are very efficient. However, both model has much larger error
than others. This is because both model ignore correlation. LDAW is signifi-
cantly more accurate than IW with no runtime penalty.
Since linear cell delay model and semi-quadratic SSTA are accurate. In the rest of
this subsection, we assume linear cell delay and apply semi-quadratic SSTA for all
experiments. Moreover, since SAAW is more accurate than QAW with only a small run
time overhead, in the following experiment, we do not consider the QAW model in the
following experiments.
10
There are 100 correlated spatial random variables, we apply PCA to truncate some insignificant
principle components and there remains 37 significant principle components.
151
In the above experiment, we only considered big chips. As discussed in Sec-
tion 6.3.4, when the chip size is small, the impact on across-wafer variation at die
level is not significant. In order to verify this, we perform delay estimation of IS-
CAS85 benchmarks stretching on different size chips. Table 6.3 shows the percentage
error for different models under different chip size. From the table, we find that when
chip size is small, LDAW and IW are accurate. Considering that LDAW has similar run
time but is more accurate (although when chip size is small, the accuracy improvement
is limited) compared to IW, LDAW is always better than IW.
6.6.2.2 Delay of Blocks on Different Locations on a Chip
The above experiment assumes that the benchmarks are stretched on a chip. How-
ever, in real design, especially for big chips, the design is separated into several blocks
and each block only occupies a small region of a chip. In this case, the critical path
is within a small region instead of spanning all over the chip. As discussed in Sec-
tion 6.3.1, different chip locations may have different mean and variance. Therefore,
when a block is placed at different locations on a chip, its delay variation may be dif-
ferent. In order to show such effect, we assume that the ISCAS85 benchmark circuit
is placed (no stretched) in different locations on a chip: center (C), lower left corner
(LL), lower right corner (LR), upper left corner (UL), and upper right corner (UR),
and then calculate the delay variation with location. Since ISCAS85 benchmarks are
very small, the impact of spatial variation on delay is not significant within the circuit.
Therefore, in this experiment, we only compare two models LDAW and IW
11
.
Table 6.4 compares the percentage error of LDAW and IW for ISCAS85 bench-
marks placing on different locations on a 2cm×2cm chip. From the table, we find that
11
When the circuit is in a small region, SAAW and QAW will give similar result as LDAW, and SPC
will give similar result as IW.
152
Bench- delay Exact SAAW Quad SAAW S-Quad QAW Quad QAW S-Quad LDAW

SPC

IW

mark model µ σ 95% µ σ 95% T µ σ 95% T µ σ 95% T µ σ 95% T µ σ 95% T µ σ 95% T µ σ 95% T
c1908 Quad 17.6 2.27 24.5 -0.4 -0.9 -1.0 146 -0.9 -1.7 -1.7 27 -0.8 -1.9 -2.3 54 -1.3 -2.5 -3.2 19 -2.1 -7.5 -6.9 9 -1.5 -4.2 -3.8 1450 -2.6 -10.2 -8.9 8
Lin - - - -0.8 -1.5 -1.4 150 -1.4 -1.8 -2.0 26 -0.9 -3.6 -3.1 53 -1.2 -3.4 -3.9 18 -2.6 -7.5 -8.1 10 -2.1 -4.4 -4.0 135 -3.0 -11.5 -10.3 10
c3540 Quad 25.7 3.43 34.5 +0.4 +0.9 +0.7 212 -0.4 -1.3 -1.1 36 +0.4 -1.8 -1.2 76 -0.9 -2.1 -1.9 25 +0.4 -5.8 -4.6 13 -1.2 -4.8 -4.0 4210 -1.4 -7.3 -6.5 12
Lin - - - -0.6 -1.2 -1.2 209 -1.1 -1.9 -1.6 35 -0.9 -3.6 -3.1 77 -1.2 -5.1 -3.9 27 -1.8 -6.5 -6.0 9 -2.0 -6.5 -5.7 202 -2.9 -9.3 -8.8 10
c7552 Quad 48.9 6.47 64.7 -0.6 +0.3 +0.2 435 -0.8 -0.2 -0.9 67 -0.8 -1.6 -1.4 115 -1.6 -1.5 -1.7 48 -2.7 -3.6 -4.0 20 -1.0 -2.3 -2.9 8182 -2.1 -6.7 -6.5 22
Lin - - - -0.6 -0.5 -0.6 430 -1.5 -1.4 -1.6 101 -1.1 -1.3 -1.4 109 -1.9 -2.2 -2.8 79 -3.3 -4.6 -4.9 16 -3.3 -3.5 -4.3 433 -3.9 -8.9 -8.2 15
Table 6.2: Delay percentage error for different variation models. Note: the µ, σ, and 95-percentile point for exact simulation
is in ns. Run time (T) is in ms.

for LDAW, SPC, and IW, linear SSTA is applied when assuming linear cell delay model.
1
5
3
Bench- Chip Exact SAAW LDAW SPC IW
mark size µ σ 95% µ σ 95% µ σ 95% µ σ 95% µ σ 95%
c1908 10 17.4 2.24 24.2 -1.2 -1.8 -1.9 -2.2 -5.5 -5.9 -1.7 -3.5 -3.3 -2.5 -6.8 -7.1
6 17.5 2.23 24.3 -1.1 -1.5 -1.6 -2.2 -4.1 -4.2 -1.2 -2.6 -2.4 -2.2 -4.5 -4.3
3 17.6 2.22 24.4 -0.9 -1.2 -1.1 -1.1 -1.8 -1.6 -1.0 -1.5 -1.5 -1.9 -2.1 -2.5
c3540 10 25.6 3.42 34.6 -1.0 -2.2 -1.6 -1.4 -6.2 -4.9 -1.8 -5.4 -4.8 -2.2 -7.5 -6.9
6 25.8 3.45 34.3 -0.8 -1.2 -1.0 -1.2 -3.8 -3.3 -1.5 -3.4 -3.0 -1.3 -4.2 -3.7
3 25.8 3.44 34.4 -0.5 -1.0 -1.0 -1.1 -2.1 -2.0 -1.0 -1.9 -1.9 -1.1 -2.2 -2.3
c7552 10 48.9 6.47 64.7 -1.2 -1.1 -1.4 -2.8 -3.5 -3.7 -2.5 -2.2 -2.7 -3.2 -5.6 -6.3
6 48.9 6.47 64.7 -1.0 -1.1 -1.3 -1.4 -2.5 -2.3 -2.2 -2.1 -2.5 -2.9 -3.1 -3.3
3 48.9 6.47 64.7 -0.6 -0.9 -1.0 -1.0 -1.1 -1.2 -0.9 -1.3 -1.4 -1.3 -1.4 -1.6
Table 6.3: percentage error for ISCAS85 benchmark stretching on a chip with different
chip size. Note: We assume square chips and chip size means edge length in mm. The
values of exact simulation are in ns.
the estimation of LDAW is within 1% error from the exact simulation and the error of
IW is up to 8%. This is because LDAW predicts different mean and variance for dif-
ferent location correctly, as discussed in Section 6.5 while IW can only give the same
mean and variance for all locations.
Table 6.5, 6.6, and 6.7 shows percentage error of LDAW and IW for ISCAS85
benchmarks placing on different locations on a 1cm×1cm, 6mm×6mm, and 3mm×3mm
chip, respectively. From the tables, we find that the error of IW becomes smaller when
chip size is small.
154
bench loca- Exact LDAW IW
mark tion µ σ 95% µ σ 95% µ σ 95%
c3540 C 25.4 3.29 32.2 +0.4 +0.3 +0.3 +1.1 +2.4 +2.8
LL 24.8 3.22 31.9 +0.8 +0.6 +0.3 +3.5 +1.4 +1.9
LR 26.2 3.35 33.1 -0.7 +0.6 -0.6 -1.9 -2.4 -1.8
UL 26.5 3.36 33.3 -0.2 +0.3 +0.3 -3.3 -3.0 -4.4
UR 27.1 3.41 34.1 -0.3 -0.3 -0.6 -6.1 -4.1 -5.2
c7552 C 48.2 6.37 60.2 +0.8 +0.3 +0.2 +1.0 +0.9 +0.9
LL 47.2 6.11 58.5 +0.4 +0.3 +0.7 +3.6 +4.6 +3.1
LR 49.4 6.51 62.3 -0.2 +0.3 +0.3 -0.8 -1.9 -3.0
UL 49.5 6.65 63.1 -0.2 +0.1 +0.1 -1.0 -4.0 -4.1
UR 50.1 6.91 65.3 -0.4 +0.3 +0.1 -1.0 -7.4 -7.4
Table 6.4: Delay percentage error at different locations in a 2cm×2cm chip. Note:
The values of exact simulation are in ns.
bench loca- Exact LDAW IW
mark tion µ σ 95% µ σ 95% µ σ 95%
c3540 C 25.4 3.26 31.9 +0.5 +0.4 +0.2 +1.3 +1.6 +1.9
LL 25.0 3.25 31.6 +0.5 +0.3 +0.3 +2.4 +1.3 +2.4
LR 25.8 3.30 32.4 -0.4 -0.4 -0.4 -1.1 -0.9 -0.8
UL 26.1 3.32 32.6 -0.4 +0.3 -0.2 -2.0 -1.5 -1.4
UR 26.2 3.34 33.0 -0.3 -0.3 -0.6 -2.6 -1.8 -2.2
c7552 C 48.6 6.38 60.4 +0.2 -0.1 -0.2 +1.0 +0.5 +0.6
LL 48.2 6.35 60.3 -0.2 -0.2 +0.2 +1.7 +1.0 +1.1
LR 48.9 6.42 60.4 +0.2 -0.2 +0.2 -0.2 -0.7 -0.5
UL 49.1 6.43 61.0 -0.2 -0.2 -0.2 -0.9 +1.0 -1.3
UR 49.2 6.45 61.2 -0.3 -0.4 -0.5 -1.1 -1.4 -1.8
Table 6.5: Delay comparison for ISCAS85 benchmark in 1cm×1cm chip. Note: The
values of exact simulation are in ns.
155
bench loca- Exact LDAW IW
mark tion µ σ 95% µ σ 95% µ σ 95%
c3540 C 25.5 3.27 32.0 +0.4 +0.2 +0.3 +0.9 +0.7 +1.4
LL 25.1 3.25 31.7 +0.5 +0.3 +0.3 +2.4 +1.3 +2.4
LR 26.0 3.32 32.6 -0.4 -0.4 -0.4 -1.1 -0.9 -0.8
UL 26.2 3.34 32.8 -0.4 +0.3 -0.2 -2.0 -1.5 -1.4
UR 26.3 3.35 33.1 -0.3 -0.3 -0.6 -2.6 -1.8 -2.2
c7552 C 48.4 6.37 60.1 +0.2 -0.1 -0.2 +1.0 +0.5 +0.6
LL 48.1 6.34 59.9 -0.2 -0.2 +0.2 +1.7 +1.0 +1.1
LR 49.0 6.43 60.7 +0.2 -0.2 +0.2 -0.2 -0.7 -0.5
UL 49.3 6.46 61.3 -0.2 -0.2 -0.2 -0.9 +1.0 -1.3
UR 49.4 6.48 61.5 -0.3 -0.4 -0.5 -1.1 -1.4 -1.8
Table 6.6: Delay comparison for ISCAS85 benchmark in 6mm×6mm chip. Note: The
values of exact simulation are in ns.
bench loca- Exact LDAW IW
mark tion µ σ 95% µ σ 95% µ σ 95%
c3540 C 25.6 3.28 32.2 +0.4 +0.2 +0.3 +0.5 +0.3 +0.8
LL 25.4 3.26 320 +0.5 +0.6 +0.3 +1.2 +1.0 +1.2
LR 25.8 3.31 32.4 -0.2 -0.4 -0.6 -0.5 -0.7 -0.7
UL 25.9 3.22 32.5 -0.2 +0.3 -0.2 -1.0 -1.1 -0.9
UR 26.0 3.31 32.7 -0.2 -0.2 -0.3 -0.7 -0.8 -1.1
c7552 C 48.6 6.38 60.3 +0.2 -0.3 +0.7 +0.2 +0.5 +1.1
LL 48.4 6.37 60.1 -0.2 0.1 +0.2 +0.4 +0.3 +0.7
LR 48.8 6.42 60.6 +0.2 -0.3 +0.3 -0.6 -0.9 -0.5
UL 48.9 6.43 60.9 -0.2 +0.2 -0.2 -0.4 +0.0 -0.6
UR 49.2 6.45 61.2 -0.2 -0.2 -0.3 -0.7 -0.8 -1.1
Table 6.7: Delay comparison for ISCAS85 benchmark in 3mm×3mm chip. Note: The
values of exact simulation are in ns.
156
6.7 Application of Spatial Variation Models on Statistical Leakage
Analysis
Besides SSTA, we also apply our variation model to statistical leakage power analysis.
Usually, cell leakage power variation is modeled as exponential function of variation
sources:
P
leak
= P
0
· e
∑c
i
V
i
(6.21)
where P
0
is the nominal leakage power and c
i
’s are sensitivity coefficients. The full
chip leakage power is calculated as the sum of leakage power of all cells:
P
chi p
=

i∈Cell
P
i,leak
(6.22)
where Cell is the set of all cells in the chip and P
i,leak
is leakage power of the i
th
cell. Since each variation source is a quadratic function as in (6.19), the cell leakage
power is an exponential of a quadratic function of random variables. Considering
that the random variables may be non-Gaussian, there is no closed-form equation to
calculate the full chip leakage power. Therefore, in this chapter, we apply Monte-Carlo
simulation to obtain the full chip leakage power variation.
We have implemented leakage variation analysis with different models in Matlab.
In the experiment, we use the same setting and comparison cases as the SSTA exper-
iment in Section 6.6. For each variation model, we use 100,000 sample Monte-Carlo
simulation to obtain the full chip leakage power for all variation models. For the leak-
age analysis, we assume that 900 copies of ISCAS benchmark circuits are placed in
a 30 ×30 array on a 2cm×2cm chip. Table 6.8 compares the leakage variation for
ISCAS85 benchmarks. From the table, we observe that:
• Error of SAAW is within 5% while error of SPC is up to 17%. Moreover, SAAW
is 7X faster than SPC because there are fewer random variables for SAAW.
157
Bench- Exact SAAW QAW LDAW SPC IW
mark µ σ 95% µ σ 95% T µ σ 95% T µ σ 95% T µ σ 95% T µ σ 95% T
c1355 62.1 14.5 92.5 +1.5 +3.3 +3.2 16.3 +3.4 +6.5 +5.4 9.8 -5.5 -15.6 -16.9 2.5 +5.3 +10.4 +12.2 123 -7.9 -19.6 -20.9 2.7
c1908 95.6 20.3 144 +0.9 +4.4 +3.7 15.5 +2.3 +7.8 +6.3 9.7 -6.5 -17.8 -19.6 2.6 +5.7 +14.8 +14.3 122 -8.6 -19.2 -23.5 2.9
c2670 131 22.9 181 +1.4 +2.7 +1.7 15.7 +2.9 +8.5 +4.7 9.5 -7.9 -16.9 -22.0 2.8 +6.8 +12.2 +9.4 122 -9.2 -20.3 -25.5 2.4
c3540 201 37.4 282 +1.5 +2.3 +1.8 15.4 +3.1 +5.8 +4.4 10.4 -5.6 -16.5 -20.2 2.6 +4.9 +11.2 +8.2 123 -8.3 -18.2 -23.5 2.7
c7552 403 73.2 562 +1.6 +2.7 +1.9 15.3 +3.7 +6.0 +5.0 10.1 -7.3 -12.6 -16.9 2.6 +7.1 +13.8 +10.7 122 -9.2 -20.3 -24.5 2.5
Table 6.8: Leakage error percentage for different models in 2cm×2cm chip. Note:
The exact values are in mW. Run time (T) is in s.
• SAAW is more accurate than QAW, but is about 50% slower.
• Both LDAW and IW are not accurate. This is because both these models do not
consider correlation and hence under estimate the leakage power variation.
Similar to SSTA, for leakage variation analysis, we also perform leakage esti-
mation in different size chips: 1cm×1cm, 6mm×6mm, and 3mm×3mm. For the
1cm×1cm chip, we assume that 225 copies of ISCAS85 benchmark circuits are placed
in a 15×15 array, for the 6mm×6mm chip, we assume 100 copies of ISCAS85 bench-
mark circuits are placed in a 10 ×10 array, and for the 3mm×3mm chip, we assume
25 copies of ISCAS benchmark circuits are placed in a 5 ×5 array. Table 6.9 shows
the error percentage for different model on different size chips. From the table, we
find that the error of LDAW, SPC, and IW reduces when chip size becomes smaller as
expected.
158
Bench- Chip Exact SAAW LDAW SPC IW
mark size µ σ 95% µ σ 95% µ σ 95% µ σ 95% µ σ 95%
c1355 10 15.4 3.52 23.2 +1.2 +3.1 +3.0 -4.7 -10.6 -11.2 +4.8 +7.5 +9.2 -5.4 -12.3 -13.8
6 5.92 1.62 10.3 +1.0 +2.7 +2.9 -3.5 -7.5 -8.3 +2.7 +6.2 +6.7 -3.9 -8.1 -9.2
3 1.48 0.40 2.58 +0.6 +1.8 +2.0 -1.8 -3.5 -3.7 +1.6 +2.9 +3.1 -2.0 -3.5 -4.0
c1908 10 23.9 5.7 36.1 +1.0 +2.7 +2.9 -4.2 -12.3 -14.1 +3.7 +8.5 +9.2 -5.5 -13.9 -15.1
6 10.6 2.25 16.1 +0.9 +2.0 +2.1 -3.4 -7.3 -8.2 +1.9 +5.6 +6.8 -3.5 -7.3 -9.0
3 2.65 0.57 4.03 +1.0 +2.2 +2.2 -1.3 -3.5 -4.0 +1.2 +2.9 +3.1 -1.9 -4.0 -4.4
c2670 10 32.8 5.72 45.2 +1.2 +2.8 +2.9 -5.3 -11.1 -14.0 +4.2 +8.2 +9.1 -6.5 -12.3 -17.1
6 14.6 2.55 20.1 +1.0 +1.8 +1.9 -3.5 -6.2 -7.1 +2.4 +4.3 +6.0 -4.3 -7.1 -8.3
3 3.65 0.65 5.03 +0.8 +1.4 +1.7 -1.9 -3.2 -3.7 +2.0 +3.6 +3.5 -2.2 -4.5 -5.1
c3540 10 50.5 9.37 71.1 +1.2 +1.8 +2.1 -3.9 -7.8 -10.5 +2.9 +5.3 +6.2 -5.0 -9.6 -14.5
6 22.3 4.16 30.2 +1.0 +1.4 +1.7 -2.3 -4.5 -5.5 +1.8 +3.5 +3.6 -3.4 -6.1 -8.5
3 5.58 1.05 7.55 +0.9 +1.2 +1.4 -1.2 -3.1 -3.0 +1.0 +1.4 +2.0 -1.4 -3.4 -3.9
c7552 10 102 18.5 141 +1.4 +2.3 +2.4 -4.6 -7.3 -9.9 +4.0 +6.2 +8.6 -6.3 -8.2 -12.5
6 45.0 8.19 6.26 +1.2 +1.9 +1.9 -3.0 -5.2 -7.3 +2.7 +4.2 +5.1 -3.5 -6.0 -8.2
3 11.4 2.06 1.58 +0.9 +0.7 +1.3 -1.2 -1.8 -2.0 +1.5 +1.6 +1.6 -2.3 -2.9 -3.4
Table 6.9: Leakage error for different variation model on different size chips. Note:
exact values are in mW.
6.8 Summary of Different Models
In previous sections, we compared the accuracy and efficiency of different models.
Table 6.10 summarizes the advantages and disadvantages of our proposed spatial vari-
ation models (SAAW, QAW, and LDAW), and the traditional variation models (SPC
and IW). Our proposed across-wafer variation models exactly model the across-wafer
variation and the number of random variables does not depend on chip size. Therefore
they are accurate and efficient. SAAW has six random variables and it can be applied to
any across-wafer variation models. QAW has four random variables, hence it is more
efficient than SAAW. However, it can be applied only when the across-wafer variation
is a perfect parabola. LDAW is the most efficient, ignores correlation and only works
for small chips. Moreover, SAAW and QAW need to know the across-wafer variation,
therefore, one needs to track the die locations within the wafer to build up the model.
159
On the other hand, the traditional variation models (as well as LDAW) only require
measurement on a die without tracking die locations. Therefore, they are somewhat
easier to build. However, such models are not accurate compared to our proposed
models. Moreover, for SPC, since the number of random variables depends on number
of grids, it is not as efficient as our proposed models.
160
Model Type Advantages Disadvantages Models # of RVs Case to Apply
Across- Accurate Need die tracking SAAW Equ(6.19) 6 large chip, non-parabola across-wafer variation
wafer Efficient to extract QAW Equ (6.2) 4 large chip, parabola across-wafer variation
models LDAW Equ (6.20) 2 small chip
Traditional Easy to Not SPC Depend on # of grids large chip
models extract accurate IW 2 small chip
Table 6.10: Summary of different variation models.
1
6
1
6.9 Conclusions
In this chapter, we analytically study the impact of systematic across-wafer variation
on within-die spatial variation. For simplicity, we assume that across-wafer variation is
a quadratic function. We first observe that different locations of a chip may have differ-
ent means and variances and such difference becomes more significant when chip size
increases. Secondly, we find that spatial correlation is visible only when the across
wafer systematic is not taken into account. When it is taken into account, we show
that within-die random variability does not exhibit a strong or useful pattern of spatial
correlation. We exploited these observations in order to create a much more accurate
and efficient model for performance variability prediction. Thirdly, we find that the
within-die spatial variation is NOT independent of the inter-die variation. However,
when chip size is small enough, such dependence is weak and the across-wafer varia-
tion can be lumped in to inter-die variation. In this case, the two level inter-/within-die
variation model is still accurate. Later, we also analyze the case when the across-wafer
is not a perfect quadratic function. Based on the above analysis, we have proposed
an accurate and efficient variation model for deterministic across wafer variation. We
further apply our new variation model to two applications: statistical static timing anal-
ysis and statistical leakage analysis. Experimental result shows that compared to the
distance-based spatial variation model, our new model reduces the error from 6.5%
to 2% for statistical timing analysis and reduces error from 17% to 5% for statistical
leakage analysis. Our model also improves the run time by 6X for statistical timing
analysis and by 7X for statistical leakage analysis.
162
CHAPTER 7
On Confi dence in Characterization and Application of
Variation Models
In this chapter we study statistics of statistics. Statistical modeling and analysis have
become the mainstay of modern design-manufacturing flows. Most analysis tech-
niques assume that the statistical variation models are reliable. However, due to limited
number of samples (especially in the case of lot-to-lot variation), calibrated models
have low degree of confidence. The problem is further exacerbated when production
volumes are low (≤65 lots) causing additional loss of confidence in statistical analysis
(since production only sees a small snapshot of the entire distribution). The problem
of confidence in statistical analysis is going to be further worsened with advent of
450mm wafers. We mathematically derive confidence intervals for commonly used
statistical measures (mean, variance, percentile corner) and analysis (SPICE corner
extraction, statistical timing). Our estimates are within 2% of simulated confidence
values. Our experiments (with variability assumptions derived from test silicon data
from a 45nm industrial process) indicate that for moderate characterization volumes
(10 lots) and low-to-medium production volumes (15 lots), a significant guardband
(e.g., 34.7% of standard deviation for single parameter corner, 38.7% of standard de-
viation for SPICE corner, and 52% of standard deviation for 95-percentile point of
circuit delay are needed) to ensure 95% confidence in the results. The guardbands are
non-negligible for all cases when either production or characterization volume is not
163
large. We also study the interesting one production lot case which may be common for
prototyping as well as for academic designs. The proposed methods are not runtime-
intensive (always within 10s) as they require Monte-Carlo simulations on closed form
expressions.
7.1 Introduction
In process variation modeling, analysis or optimization, statistical characteristics of
variation sources, such as mean, variance, and skewness, are often assumed to be
known and reliable.
In practice, the statistical characteristics of the variation sources are obtained from
measurement of a set of samples. Due to limited number of measurements, the mea-
sured statistical characteristics may be unreliable. This is especially true for lot-to-lot
variation, since the number of lots measured for characterization is usually small.
Moreover, the existing statistical analysis, implicitly assume that the production
chips have the same statistical characteristics as the corresponding population values
1
.
When the number of production samples is not small, the production statistical char-
acteristics may significantly deviate from their population values. Therefore, the un-
certainty in statistics of measured data as well as production data should be considered
in statistical analysis. [177] models the uncertainty of mean, variance, and correlation
coefficients as an interval, and then estimates the range of mean and variance of circuit
performance. However, this work only models the uncertainty as an interval, ignoring
their true distributions. It also ignores the uncertainty introduced by limited number of
production lots.
Before we move further, it is important to briefly review the concept of confidence
1
Here by “population”, we mean the statistical values in the case of infinite measured as well as
production samples.
164
intervals in statistical analysis. Consider a standard normal random variable X, we
choose n samples of it (X
1
, X
2
, . . . , X
n
). Once the samples are chosen, the sample mean
ˆ µ and variance ˆ σ
2
are fixed. However, if we repeatedly choose n samples for 1,000
times, and calculate the sample mean and variance for each time, the results may dif-
fer. When the sample mean is smaller than a certain value µ
conf
in 900 out of the 1000
runs, we say that we have 90% confidence that the sample mean is smaller than µ
conf
.
Notice that the confidence is not yield but the probability that certain statistical char-
acteristic is within a given interval. Once chips are produced, the distribution of (say)
circuit delay is fixed. But due to the uncertainty of the statistical model, we do not
know exactly the production mean and variance prior. For a given confidence, one can
estimate the distribution of mean and variance (of course the actual mean and variance
are going to be just one sample out of these distributions).
In this chapter, we study the uncertainty of statistical models and analyze the im-
pact of such uncertainty on statistical analysis. The contributions of this chapter are:
1. Given the number of measured lots and the number of production lots,we esti-
mate the distribution of the production mean and variance of variation sources.
2. For a given confidence level, we estimate the worst case fast/slow corner (µ±kσ
corner) for variation sources.
3. We extend the ideas to extract SPICE corners depending on desired confidence
level as well as number of measured/produced silicon lots.
4. We estimate the confidence interval of mean, variance and quantile for statistical
timing analysis.
Experimental results show that the required guardband (to reach desired confidence)
estimated by our method is very close to the exact simulation value. We also observe
165
that the guardband value increases dramatically when confidence level increases. We
need to introduce up to 30% guardband value to achieve 95% confidence.
The rest of this chapter is organized as follows: Section 7.2 gives the problem
formulation; then Section 7.3 studies the confidence interval estimation for variation
sources; Section 7.4 presents the estimation of worst case delay for SPICE corner
estimation and Section 7.5 analyzes the impact of mean and variance uncertainty on
statistical timing analysis; Section 7.6 discusses how to choose confidence level in real
circuit design; and finally Section 7.7 concludes this chapter.
7.2 Problem Formulation
Process variation is decomposed into inter-lot (V
l
), inter-wafer(V
w
), inter-die (V
d
), and
within-die (V
r
) variation:
V =V
l
+V
w
+V
d
+V
r
(7.1)
µ
V
= µ
l

w

d

r
(7.2)
σ
2
V
= σ
2
l

2
w

2
d

2
r
(7.3)
where µ
l

2
l
), µ
w

2
w
), µ
d

2
d
), and µ
r

2
r
) are means (variances) of inter-lot, inter-
wafer, inter-die, and within-die variations, respectively. It has been shown that in case
of finite number of samples the sample mean follows Gaussian distribution and the
sample variance follow χ
2
distribution [4]. The variances of the means and variances
are:
σ
2
µl
= σ
2
l
/N
l
(7.4)
σ
2
σ
2
l
= σ
2
l
/N
l
(7.5)
σ
2
µw
= σ
2
w
/N
w
(7.6)
σ
2
σ
2
w
= σ
2
w
/N
w
(7.7)
166
σ
2
µd

2
d
/N
d
(7.8)
σ
2
σ
2
d

2
d
/N
d
(7.9)
σ
2
µr
= σ
2
r
/N
r
(7.10)
σ
2
σ
2
r
= σ
2
r
/N
r
(7.11)
where N
l
, N
w
, N
d
, and N
r
are total number of lots, wafers, dies, and circuit el-
ements. There are usually 20 to 50 wafers per lot, more than one hundred dies per
wafer, and tens of measured circuit elements within each die
2
. Therefore, N
r
N
d

N
w
N
l
. Hence σ
2
µl
σ
2
µw
σ
2
µd
σ
2
µr
, σ
2
σ
2
l
σ
2
σ
2
w
σ
2
σ
2
d
σ
2
σ
2
r
. Therefore, the
uncertainty of mean and variance comes largely from lot-to-lot variation. Hence in this
chapter, we focus our attention on lot-to-lot variation.
Table 7.1 illustrates five cases of number of measured lots and number of produc-
tion lots.When the number of measured lots or production lots is small, the confidence
interval is large and hence statistical analysis is not reliable. In this case, we may want
to guardband statistical analysis to ensure a certain degree of confidence. Example use
models for these different scenarios can be: commodity process/high volume part (L-
L); niche process/high column part (S-L); commodity process/niche part (L-S); niche
process/niche part (S-S); and commodity (or niche) process/prototyping (L-1).
In practice, only the measured as opposed to the population value is known. There-
fore, our the problem is:
Given the number of measured lots and production lots and the measured
values of mean and variance of variation sources, estimate the distribution of
statistical measures of production lots.
In rest of the chapter, we assume that the lot-to-lot variation of all the variation
sources follows Gaussian distribution.
2
Note that 85 lots of 25 300mmwafers each with die-size of 100mm
2
amounts to production volume
of 1.5 million chips.
167
Case # Measured # Production Confidence Reliability of
lots lots interval Statistical analysis
L-L Large Large Small High
S-L Small Large Large Low
L-S Large Small Large Low
S-S Small Small Large Low
L-1 Any 1 Large Low
Table 7.1: Reliability of statistical analysis for different number of measured and pro-
duction lots.
7.3 Confi dence Interval for Variation Sources
In this section, we will discuss confidence interval estimation for a single variation
source. Table 7.3 summarizes the notation used in the rest of the chapter.
7.3.1 Mean and Variance
Let’s first discuss the mean and variance. From (7.1), we calculate the total mean and
variance of measured value as:
ˆ µ
t
= ˆ µ
l

o
(7.12)
ˆ σ
2
t
= ˆ σ
2
l

2
o
(7.13)
As discussed in Section 7.2, the uncertainty of mean and variance comes largely from
lot-to-lot variation. Therefore, we assume the mean µ
o
and variance σ
2
o
of all other
variation are reliable. In the same way as above, we calculate the total mean and
variance of the production as:
˜ µ
t
= ˜ µ
l

o
(7.14)
˜ σ
2
t
= ˜ σ
2
l

2
o
(7.15)
In the rest of this subsection, we will focus on analyzing confidence interval for lot-to-
lot variation.
168
n number of lots
µ, σ
2
mean and variance
cn
f /s
fast/slow corner
no hat population value (µ, σ
2
and so on)
ˆ value obtain from measured lots ( ˆ n, ˆ µ, and so on)
˜ value of production lots ( ˜ n, ˜ µ, and so on)
2
t
total value including inter-lot, inter-wafer, inter-die and
within-die variation (µ
t
, σ
2
t
, and so on)
2
l
inter-lot variation value (µ
l
, σ
2
l
, and so on)
2
w
inter-wafer variation value (µ
w
, σ
2
w
, and so on)
2
d
inter-die variation value (µ
d
, σ
2
d
, and so on)
2
r
within-die variation value (µ
r
, σ
2
r
, and so on)
2
o
total value except inter-lot variation (µ
o
, σ
2
o
, and so on)
µ
o
= µ
w

d

r
, σ
2
o

2
w

2
d

2
r
Table 7.2: Notations.
In order to estimate the confidence interval, we first estimate the population values
of mean and variance from measurement data [4]:
µ
l
= ˆ µ
l
+ ˆ σ
l
· M
1
·

ˆ n−1
ˆ nQ
1
(7.16)
σ
2
l
= ( ˆ n−1) ˆ σ
2
l
/Q
1
(7.17)
where M
1
∼N(0, 1) is a random variable with standard normal distribution, and Q
1

χ
2
ˆ n−1
is a random variable with χ
2
distribution with ˆ n −1 degrees of freedom [4].
Then, we estimate the production mean and variance with respect to population mean
and variance:
˜ µ
l
= µ
l

l
· M
2
/

˜ n (7.18)
˜ σ
2
l
= σ
2
l
· Q
2
/( ˜ n−1) (7.19)
where M
2
∼N(0, 1) is a random variable with standard normal distribution and Q
2

χ
2
˜ n−1
is a random variable with χ
2
distribution with ˜ n−1 degrees of freedom. Finally,
169
we obtain the production mean and variance as:
˜ µ
l
= ˆ µ
l
+ ˆ σ
l
·

ˆ n−1
Q
1

M
1

ˆ n
+
M
2

˜ n

= ˆ µ
l
+ ˆ σ
l
ηT (7.20)
˜ σ
2
l
= ˆ σ
2
l
·
Q
2
( ˆ n−1)
Q
1
( ˜ n−1)
= ˆ σ
2
l
· ζ
2
· U (7.21)
where
η =

ˆ n+ ˜ n
ˆ n˜ n
(7.22)
ζ =

ˆ n−1
˜ n−1
(7.23)
T =

ˆ n−1(

˜ nM
1
+

ˆ nM
2
)

( ˆ n+ ˜ n)Q
1
(7.24)
U =
Q
2
Q
1
(7.25)
It is easy to find that:

˜ nM
1
+

ˆ nM
2

ˆ n+˜ n
∼ N(0, 1), hence, T is the ratio between a standard
normal random variable and square root of a χ
2
random variable. Therefore, T is a
random variable with t-distribution with ˆ n degrees of freedom [4]. Moreover, U is the
ratio of two χ
2
random variables. The PDF of U is [4]:
PDF
U
(u) =
Γ(
ˆ n+˜ n
2
−1)u
ˆ n
2
−1
Γ(
ˆ n
2
)Γ(
˜ n
2
)(u+1)
ˆ n+˜ n
2
(7.26)
where Γ(·) is Gamma function [4] which is calculated as:
Γ(x) =
Z

0
t
x−1
e
−t
dt (7.27)
7.3.2 Simplifying the Large Lot Case
In the previous subsection, we analyzed the case when both ˆ n and ˜ n are small (S-S
case). We may simplify our model when either ˆ n or ˜ n is large.
When the number of measured lots ˆ n is large (L-S case), we can approximate the
170
population mean and variance as measured mean and variance:
µ
l
≈ ˆ µ
l
(7.28)
σ
2
l
≈ ˆ σ
2
l
(7.29)
In this case, (7.20) and (7.21) can be simplified to:
˜ µ
l
= ˆ µ
l
+ ˆ σ
l
· M
2
/

˜ n (7.30)
˜ σ
2
l
= ˆ σ
2
l
· Q
2
/( ˜ n−1) (7.31)
In the other case when the number of production lots ˜ n is large (L-S case), the
production values of mean and variance are very close to the population value, i.e.:
˜ µ
l
≈ µ
l
(7.32)
˜ σ
2
l
≈ σ
2
l
(7.33)
Therefore, (7.20) and (7.21) can be simplified to:
˜ µ
l
= ˆ µ
l
+ ˆ σ
l
M
1
·

ˆ n−1
ˆ nQ
1
(7.34)
˜ σ
2
l
= ( ˆ n−1) ˆ σ
2
l
/Q
1
(7.35)
7.3.3 One Production Lot Case
Besides S-S, L-S, and S-L cases, we also consider the special case of one production
lot (L-1 case), that is ˜ n = 1. In this case, production mean still follows the same
distribution as shown in (7.20) and (7.21). However, since there is only one lot, lot-to-
lot variation has no impact on variance. Uncertainty of variance estimation is mainly
caused by wafer-to-wafer variation. Hence, in this case, the production variance is
calculated as:
˜ σ
2
t
= ˆ σ
2
w
· Q

2
/( ˜ n

−1) +σ
2
d

2
r
(7.36)
171
Targeted conf.% 50 60 70 80 90
Measured conf.% 51.72 60.34 72.41 81.03 93.10
Table 7.3: Experimental validation of estimation of confidence interval for mean for
ˆ n = 5 and ˜ n = 1.
where ˜ n

is number of production wafers, Q

2
is a χ
2
random variable with ˜ n

−1 free-
dom, and ˆ σ
2
w
is the measured variance of wafer-to-wafer variation.
7.3.4 Experimental Validation for One Lot Case
The problems in validation of the confidence-based guardband models proposed in this
work should be obvious by now. We need exceedingly large amounts of data spread
out over hundreds of lots preferably over multiple independent designs. Fortunately,
the mathematical derivations in this work make very few approximations if any. Nev-
ertheless, in this section we try to validate the theory above for the case with small
number of measured lots and one production lot. To get over the hurdle of huge data
requirement, we use wafer-to-wafer variation instead of lot-to-lot variation.
We obtain wafer-to-wafer ring oscillator delay variation data from an industrial
65nm process with 348 wafers spread over 23 lots. We randomly divide these wafers
into 58 sets with 6 wafers per set. We model each set as a design and assume that 5 of
the wafers are used to generate the statistical model ( ˆ n = 5) and the other wafer is the
production wafer ( ˜ n = 1). For each set we apply (7.20) to estimate the distribution of
production mean and calculate the confidence interval for a targeted confidence level.
We obtain the measured confidence by counting the number of sets that production
value is within the confidence interval: Measured Conf = Number of match sets/ Total
number of sets.
Table 7.3 compares the targeted confidence level and the measured confidence
level. The small error (≤4%) is due to limited data.
172
7.3.5 Fast/Slow Corner Estimation
The next statistical characteristics we consider are fast and slow corners cn
f /s
for vari-
ation sources.
Usually, the fast/slow corner is expressed as:
cn
f /s
= ˜ µ
t
±k
f /s
˜ σ
t
(7.37)
From (7.20) and (7.21), we have:
cn
f
= ˆ µ
l

o
+ ˆ σ
l
ηT −k
f

ˆ σ
2
ζ
2
U +σ
2
o
(7.38)
cn
s
= ˆ µ
l

o
+ ˆ σ
l
ηT +k
s

ˆ σ
2
ζ
2
U +σ
2
o
(7.39)
For a given confidence level, we want to estimate the worst case fast/slow corner,
cn
w
f /s
. However, fast corner and slow corners are dependent on each other. Therefore,
we need to handle them together. Assuming that hold time violation is harder to fix,
higher confidence may be needed on fast corner. We estimate the worst case fast/slow
corner such that there is c f
f
confidence that the fast corner is larger than cn
w
f
and there
is c f
t
confidence that the fast corner is larger than cn
w
f
and the slow corner is smaller
than cn
w
s
, that is:
3
P{ cn
f
> cn
w
f
} = c f
f
(7.40)
P{ cn
f
> cn
w
f
, cn
s
< cn
w
s
} = c f
t
(7.41)
Due to the complexity of (7.38), we use Monte-Carlo simulations to obtain the distri-
butions of the fast/slow corners. Then the worst case fast corner is calculated as:
cn
w
f
=CDF
−1
cn
f
(c f
f
) (7.42)
With the worst case fast corner value, from the samples, we may also obtain the con-
ditional CDF of cn
s
given cn
f
> cn
w
f
, CDF
cn
s
| cn
f
<cn
w
f
. Then the worst case slow corner
3
Without loss of generality, we assume fast implies smaller parameter value (e.g. channel width).
173
is calculated as:
cn
w
s
=CDF
−1
cn
s
| cn
f
<cn
w
f
(c f
t
/c f
f
) (7.43)
We then express the worst case corner value as:
cn
w
f /s
= ˆ µ±k

f /s
ˆ σ (7.44)
When the worst case corner values are known, we can easily obtain the value of k

f /s
.
Table 7.4 shows the value of k

f /s
under different confidence levels. In the table, the
guardband percentage is calculated as: 100×(k

f /s
−k
f /s
)/k
f /s
. In the experiment, we
let ˆ n = 10, ˜ n = 15, k
f
= k
s
= 3 and we assume that the variance of inter-lot variation
is 18% of the total variance (derived from the data described in section 7.3.4). We
also assume that the variation source is with standard normal distribution
4
. In the
experiment, we perform the estimation for 1000 runs. For each run, we generate ˆ n
measured samples to obtain the measured mean ˆ µ and variance ˆ σ
2
, and then apply the
method above to calculate the worst case value of k

f /s
. Notice that the k

f /s
depends on
the measured values ˆ µ and ˆ σ
2
, hence each run may result in a different k

f /s
value. In
the table, the k

f /s
values are calculated as the average of 1000 runs. From the table, we
can find that to ensure high confidence, k

f /s
needs to be 30% over k
f /s
, i.e., we need a
large guardband.
As discussed in Section 7.3.2, when the number of measured lots is large (L-S), the
mean and variance can be simplified as (7.30). Hence the corner model can be also
simplified as:
cn
f
= ˆ µ
l
+ ˆ σ
l
M
2

˜ n
−k
f

ˆ σ
2
l
· Q
2
( ˜ n−1)

2
o
(7.45)
cn
s
= ˆ µ
l
+ ˆ σ
l
M
2

˜ n
+k
s

ˆ σ
2
l
· Q
2
( ˜ n−1)

2
o
(7.46)
4
The nominal mean and variance of variation sources does not affect the value of k

f
/s.
174
c f
t
% c f
f
% k

f
k

s
50 60 3.080 (2.67%) 3.246 (8.20%)
60 70 3.155 (5.13%) 3.275 (9.13%)
70 80 3.255 (8.50%) 3.307 (10.2%)
80 90 3.422 (14.7%) 3.345 (11.5%)
90 95 3.594 (19.8%) 3.518 (17.2%)
95 99 4.042 (34.7%) 3.619 (20.6%)
Table 7.4: Guardband of fast/slow corner for different degrees of confidence assuming
ˆ n = 10, ˜ n = 15.
c f
t
% c f
f
% k

f
k

s
50 60 3.035 (1.17%) 3.103 (3.43%)
60 70 3.112 (3.73%) 3.121 (4.03%)
70 80 3.124 (4.14%) 3.138 (4.60%)
80 90 3.236 (7.87%) 3.215 (7.17%)
90 95 3.401 (13.4%) 3.342 (11.4%)
95 99 3.625 (20.8%) 3.502 (16.7%)
Table 7.5: Guardband of fast/slow corner for different degrees of confidence assuming
large ˆ n and ˜ n = 15.
Table 7.5 shows the average k

f /s
value for the L-S case. As expected, the guardband
in this case is smaller than the S-S case.
Similarly, when the number of production lots is large (S-L), corner model can be
simplified to:
cn
f
= ˆ µ
l
+ ˆ σ
l
· M
1
·

ˆ n−1
ˆ nQ1
−k
f

( ˆ n−1) ˆ σ
2
l
Q
1

2
o
(7.47)
cn
s
= ˆ µ
l
+ ˆ σ
l
· M
1
·

ˆ n−1
ˆ nQ1
+k
s

( ˆ n−1) ˆ σ
2
l
Q
1

2
o
(7.48)
Table 7.6 shows the average k

f /s
value for the S-L case. From the table, it is evident
that the guardband is smaller than that of the S-S case.
Figure 7.1 compares the 90% confidence value of k

f
between the L-S (or S-L) and
L-L. We observe that when ˜ n ≥ 60 (or ˆ n ≥ 85), L-S (or S-L) can be treated as L-L.
175
c f
t
% c f
f
% k

f
k

s
50 60 3.052 (1.73%) 3.132 (4.40%)
60 70 3.134 (4.47%) 3.155 (5.17%)
70 80 3.165 (5.50%) 3.193 (6.43%)
80 90 3.272 (9.07%) 3.267 (8.90%)
90 95 3.431 (14.4%) 3.370 (12.3%)
95 99 3.690 (23.0%) 3.523 (17.4%)
Table 7.6: Guardband of fast/slow corner for different confidence levels assuming
ˆ n = 10 and large ˜ n.
That is, the statistical analysis is reliable and we do not need to perform guardband
estimation. In the experiment, when error of k

f
value between L-L and L-S (or S-L)
cases is within 3%, we consider ˆ n (or ˜ n) value is large.
50 100 150
3.05
3.1
3.15
3.2
3.25
3.3
n
~
(n)
^
k

f


L−S
S−L
n
^
=80
n
~
=60
k’
f
=3.09
Figure 7.1: Comparison of S-L, L-S, and L-L case.
Note that increasing lot-to-lot variation will increase the value of ˆ n and ˜ n that can be
considered large as shown in Figure 7.2. This is because when the lot-to-lot variation
increase, the uncertainty of mean and variance increases. Therefore, we need a larger
number of measured lots (or production lots) to be considered as large.
As discussed in Section 7.3.2, in the special case when there is only one production
lot (L-1), lot-to-lot variation has no impact on the variance. Instead, the uncertainty of
176
20 40 60
0
200
400
600
800
σ
2
l
%
L
a
r
g
e

n ~

(
n
)
^


n
~
n
^
Figure 7.2: Number of ˆ n or ˜ n which can be considered as large for different fraction of
lot-to-lot variation assuming maximum 3% error at 90% confidence.
the variance comes from wafer-to-wafer variation. From (7.36), we can calculate the
fast/slow corners as:
cn
f
= ˆ µ
l

o
+ ˆ σ
l
ηT −k
f

ˆ σ
2
w
Q

2
/( ˜ n

−1) +σ
2
o
(7.49)
cn
s
= ˆ µ
l

o
+ ˆ σ
l
ηT +k
s

ˆ σ
2
w
Q

2
/( ˜ n

−1) +σ
2
o
(7.50)
Table 7.7 shows the k

f /s
for the L-1 case. In the experiment, we assume that there
are 25 wafers per lot, ˜ n

= 25. From the table, we see that the guardband is smaller
than that of the L-S and S-L cases. This somewhat surprising result is explained by
the fact that lot-to-lot variation has no impact on variance (but still affects the mean),
hence the uncertainty of the corner is smaller.
177
c f
t
% c f
f
% k

f
k

s
50 60 3.051 (1.70%) 3.261 (8.70%)
60 70 3.116 (3.87%) 3.278 (9.27%)
70 80 3.197 (6.57%) 3.293 (9.77%)
80 90 3.320 (10.7%) 3.307 (10.2%)
90 95 3.438 (14.6%) 3.432 (14.4%)
95 99 3.574 (19.1%) 3.475 (15.8%)
Table 7.7: Guardband of fast/slow corner for different confidence levels assuming
ˆ n = 10, ˜ n = 1, and ˜ n

= 25.
7.4 Confi dence in SPICE Fast/Slow Corner
Fast/slow corners for circuit simulation is the major interface between design and pro-
cess. A common approach to derive the corners is to use measured values from canon-
ical circuits such as an inverter chain. Variation source values corresponding to the
corner delay D
w
f /s
are then calculated.
We assume that the inverter chain delay is a linear function of variation sources:
D = d
0
+
m

i=1
c
i
X
i
(7.51)
where m is the number of variation sources, d
0
is the nominal delay, X
i
’s are variation
sources, and c
i
’s are sensitivity coefficients of inverter chain delay to variation sources.
Considering all variation sources to be independent, the mean and variance of the
product inverter chain delay can be calculated as:
˜ µ
D
= d
0
+
m

i=1
c
i

oi
+ ˜ µ
li
) (7.52)
˜ σ
2
D
=
m

i=1
c
2
i

2
oi
+ ˜ σ
2
li
) (7.53)
where µ
oi
and σ
2
oi
are the mean and variance of other variation for the i
th
variation
source, respectively, ˜ µ
li
and ˜ σ
2
li
are the mean and variance of lot-to-lot variation for
the i
th
variation source, respectively. ˜ µ
li
and ˜ σ
2
li
can be calculated as shown in Sec-
tion 7.3.1.
178
Usually, the delay corner is expressed as
˜
D
f /s
= ˜ µ
D
±k
f /s
˜ σ
D
(7.54)
As in Section 7.3.5 we use Monte-Carlo simulation to obtain the PDF of the fast and
slow delay corners. One sided confidence interval is then used to obtain the worst case
fast and slow delay corners D
w
f
/s. Corresponding corner values of variation sources
are calculated by solving the following [121]:
D
w
f /s
= d
0
+
m

i=1
c
i
X
w
i
(7.55)
X
w
1f /s
−ˆ µ
1
c
1
ˆ σ
1
=
X
w
2f /s
−ˆ µ
2
c
2
ˆ σ
2
. . . =
X
w
mf /s
−ˆ µ
m
c
m
ˆ σ
m
(7.56)
Table 7.8 illustrates the guardband percentage for the worst case delay and the cor-
responding variation source values under different confidence levels. In the table, the
guardband percentage is calculated as: 100×|worst case value - nominal value|/nominal value.
In the experiment, we use PTM bulk low power SPICE model for 45nm technology
[1]. We assume three variation sources, gate length L, threshold voltage for NMOS V
tn
and PMOS V
t p
. For gate length variation, we assume 3σ = 3nm, for threshold voltage
variation, we assume that the 3σ value is 20% of the nominal value. We also assume
that the inter-lot variation is 18% as in previous sections. In the experiment, we let
ˆ n = 10 and ˜ n = 15. Since measured corner values affect production corners,we show
the average of 1000 runs.
From the table, we observe that under the same confidence level, the guardband
percentage of delay corner is less than that of k

f
value of variation sources as shown
in Table 7.4. This is because the k

f
value of variation sources does not depend on
the nominal mean and variance. However, for the worst case delay, the guard band
percentage does depend on nominal mean. When the nominal mean increases, the
guardband percentage decreases.
179
c f
t
c f
f
D
w
f
L
w
f
V
w
tnf
V
w
t pf
D
w
s
L
w
s
V
w
tns
V
w
t ps
50 60 0.29 0 07 0.21 0.20 0.52 0.13 0.39 0.36
60 70 0.57 0 14 0.43 0.40 0.77 0.19 0.58 0.55
70 80 1.15 0 29 0.86 0.81 1.29 0.32 0.97 0.91
80 90 2.01 0 50 1.50 1.41 1.80 0.45 1.35 1.27
90 95 3.15 0 79 2.36 2.22 2.84 0.71 2.13 2.00
95 99 4.58 1 15 3.44 3.23 3.61 0.90 2.71 2.54
Nominal 349 42.62 0.558 0.603 388 47.41 0.612 0.643
Table 7.8: Percentage guardband of worst case delay corner and corresponding varia-
tion source corners under different confidence levels assuming ˆ n = 10, ˜ n = 15. Note:
delay value is in ps, L
gate
value is in nm, and V
th
value is in V.
Targeted Conf Simulated Conf
c f
t
c f
f
c f
t
c f
f
50 60 51.9 61.6
60 70 61.7 72.4
70 80 71.6 81.4
80 90 81.1 91.3
90 95 90.8 95.7
95 99 95.5 99.2
Table 7.9: Confidence comparison for inverter chain based corner assuming ˆ n = 10,
˜ n = 15.
In Table 7.9, we also compare the targeted confidence and the exact confidence of
the estimated worst case corner value. The flow to obtain the simulated confidence is
shown in Figure 7.3. In the experiment, we let n
run
= 1, 000. From the table, we find
that the exact simulated confidence is a little bit higher than the target value. Such
error largely comes from the fitting of the linear delay model to SPICE.
Table 7.10 and 7.11 show the average guardband percentage for the L-S and S-L
cases, respectively. Note that the guardband, though appears to be small, is significant
when compared to variation itself (e.g. σ for V
t p
variation is only 6.7% of nominal
value, while the guardband of fast corner is up to 2.58% of nominal value. That is, the
180
1. n
f
= 0, n
t
= 0
2. for i=1 to n
run
3. Generate ˆ n samples of measured lots /* calculate estimated value */
4. Calculate ˆ µ and ˆ σ
2
for each variation sources from samples
5. Perform 10,000-sample MC simulation on (7.54)
according to ˆ µand ˆ σ
2
.
6. Obtain worst case corners: D
w
f
, D
w
s
.
7. Generate ˜ n samples of production lots /* calculate simulated value */
8. Calculate ˜ µ and ˜ σ
2
for each variation source from samples.
9. Perform 10,000-sample SPICE MC simulation on inverter
chain delay based on ˜ µ and ˜ σ
2
.
10. Calculate mean µ
D
and variance σ
2
D
of SPICE MC samples.
11. Calculate simulated corners: D
f
= µ
D
−k
f
σ
D
, D
s
= µ
D
+k
s
σ
D
.
12. if D
f
> D
w
f
/* compare simulated value and estimated value */
13. n
f
=n
f
+1.
14. if D
s
< D
w
s
15. n
t
= n
t
+1
16. c f
f
= n
f
/n
run
, c f
t
= n
t
/n
run
Figure 7.3: Flow to simulate confidence.
guardband value is up to 38.7% of standard deviation).
Figure 7.1 compares the 90% confidence value of D
w
f
between the L-S (or S-L) and
L-L. In the figure, D
w
f
is normalized with respect to the nominal value. From the figure,
We observe that ˜ n ≥30 (or ˆ n ≥45), can be considered as “large”. Similar to the single
source case, when error of k

f
value between L-L and L-S (or S-L) case are within 3%,
we consider ˆ n (or ˜ n) value is large. We also find that for the worst case delay corner,
the value of ˆ n (or ˜ n) to be considered as large is smaller than that of the single variation
source as in Figure 7.1. This is because the guardband of worst case delay is smaller
than that of single variation source as discussed above. Hence, we need fewer lots to
achieve the same model reliability.
Table 7.12 shows the average guardband percentage for the L-1 case. The guard-
band percentage of this case is smaller that of S-S, L-S, or S-L cases. As discussed in
Section 7.3.5, this is because lot-to-lot variation has no impact on variance. Therefore,
181
c f
t
c f
f
D
w
f
L
w
f
V
w
tnf
V
w
t pf
D
w
s
L
w
s
V
w
tns
V
w
t ps
50 60 0.21 0.05 0.16 0.15 0.39 0.10 0.29 0.27
60 70 0.43 0.11 0.32 0.30 0.58 0.14 0.43 0.41
70 80 0.86 0.21 0.64 0.61 0.97 0.24 0.72 0.68
80 90 1.50 0.38 1.13 1.06 1.35 0.34 1.01 0.95
90 95 2.36 0.59 1.77 1.67 2.13 0.53 1.59 1.50
95 99 3.44 0.86 2.58 2.42 2.71 0.68 2.03 1.91
Table 7.10: Percentage guardband of worst case delay corner and corresponding vari-
ation sources corners under different confidence levels assuming large and ˜ n = 15.
c f
t
c f
f
D
w
f
L
w
f
V
w
tnf
V
w
t pf
D
w
s
L
w
s
V
w
tns
V
w
t ps
50 60 0.24 0.06 0.18 0.17 0.44 0.11 0.33 0.31
60 70 0.49 0.12 0.37 0.34 0.66 0.16 0.49 0.46
70 80 0.97 0.24 0.73 0.69 1.10 0.27 0.82 0.77
80 90 1.70 0.43 1.28 1.20 1.53 0.38 1.15 1.08
90 95 2.68 0.67 2.01 1.89 2.41 0.60 1.81 1.70
95 99 3.90 0.97 2.92 2.75 3.07 0.77 2.30 2.16
Table 7.11: Percentage guardband of worst case delay corner and corresponding vari-
ation source corners under different confidence levels assuming ˆ n = 10 and large ˜ n.
182
50 100 150
0.9
0.92
0.94
0.96
0.98
1
n
~
(n)
^
N
o
r
m
a
l
i
z
e
d

D
w f


L−S
S−L
n
^
=45
n
~
=30
3% error
Figure 7.4: 90% confident D
w
f
with varying ˆ n and ˜ n.
c f
t
c f
f
D
w
f
L
w
f
V
w
tnf
V
w
t pf
D
w
s
L
w
s
V
w
tns
V
w
t ps
50 60 0.23 0.06 0.17 0.16 0.42 0.11 0.32 0.30
60 70 0.38 0.10 0.29 0.27 0.53 0.13 0.40 0.37
70 80 0.69 0.17 0.52 0.49 0.71 0.18 0.53 0.50
80 90 0.95 0.24 0.71 0.67 0.90 0.23 0.68 0.63
90 95 1.46 0.37 1.10 1.03 1.21 0.30 0.91 0.85
95 99 2.05 0.51 1.54 1.45 1.87 0.47 1.40 1.32
Table 7.12: Percentage guardband of worst case delay corner and corresponding vari-
ation source corners under different confidence levels assuming ˆ n = 10, ˜ n = 1, and
˜ n

= 25.
we need lower guardband to achieve the same confidence.
In this section, we illustrated the extra guardband required in corners (which ironi-
cally are themselves guardbands by definition) to ensure confidence in statistical anal-
ysis when the number of lots used to characterize variation or number of production
silicon lots are small. Next we discuss the impact of limited characterization or pro-
duction data on one of the most talked about chip-level statistical analysis methods,
namely, statistical timing analysis.
183
7.5 Confi dence in Statistical Timing Analysis
In this section, we will discuss about confidence estimation for statistical static timing
analysis (SSTA). Since quadratic SSTA with non-Gaussian variation sources is very
complicated, we only analyze linear SSTA with Gaussian variation sources in this
section.
By performing linear SSTA in Section 5.5, the chip delay is expressed as a linear
canonical form of variation sources:
D = d
0
+
m

i=1
c
i
X
i
(7.57)
Since circuit delay has a linear canonical form, its mean and variance can be cal-
culated in the same way as shown in (7.52). We again use Monte-Carlo simulation
on the linear equation to obtain the CDF of the output mean (CDF
µ
D
(·)) and variance
(CDF
σ
2
D
(·)). Then for a given confidence level Con f , it is easy to obtain the worst case
mean and variance:
5
µ
w
= CDF
−1
µ
D
(Con f ) (7.58)
σ
w2
= CDF
−1
σ
2
D
(Con f ) σ
w
=

σ
w2
(7.59)
Another quantity of interest is a certain percentile point. Since we assume Gaussian
variation sources and apply linear canonical form to approximate circuit delay, the
circuit delay also follows Gaussian distribution. Therefore, the percentile point C(p%)
can be calculated as:
C(p%) = µ+kσ (7.60)
k = Φ
−1
(p%) (7.61)
5
Notice that we have known the distribution of mean and variance, it is easy for us to obtain any
interval of mean and variance under a certain confidence.
184
where Φ(·) is the CDF of standard normal distribution. In this way, the percentile point
is converted to a µ+kσ corner. We may apply the same method as in Section 7.4 to
obtain the CDF of the percentile point, and then calculate its worst case value.
Figure 7.5 illustrates the worst case mean, standard deviation, and 95% percentile
point for ISCAS85 benchmarks. In the experiment, we apply the same experimental
setting as that for the inverter chain as in Section 7.4. As discussed before, the worst
case value depends on the measured values of mean and variance. Hence, different
runs may result in different worst case values. In the figure, the worst case value is
averaged over 1,000 runs and normalized with respect to the nominal value. From the
figure, we observe that the guardband of chip delay is lower than that of single varia-
tion source and inverter chain delay. This is because chip delay has a larger nominal
mean to variance ratio than inverter chain. However, although the guardband value
seems not large, it is very significant compared to the scale of variation. For exam-
ple, for the 95-percentile point, the nominal value is 6.9X of standard deviation and
the guardband value is up to 7.6% of nominal value to ensure 95% confidence. That
means the guardband value is up to 52% of standard deviation.
In Table 7.13, we compare the targeted confidence and the simulated confidence of
the estimated confidence interval. The simulated confidence is obtained in Figure 7.3.
The only difference is that instead of running SPICE Monte-Carlo simulation, we ap-
ply SSTA [41] to obtain the chip delay distribution. Due to our simplifying assumption
that the linear canonical form is independent of small changes in mean and variance,
there is a small error between targeted and simulated confidence.
Figure 7.6 illustrates the mean, standard deviation, and 95-percentile point of IS-
CAS85 benchmarks under different confidence level for S-L case. We ignore L-S and
L-1 cases where the need of SSTA is not well motivated.
Figure 7.7 compares the 90% confidence value of 95-percentile point for C7552
185
50 60 70 80 90
1
1.02
1.04
1.06
Confidence
µ
w


C17
C444
c1355
c7552
50 60 70 80 90
1
1.05
1.1
1.15
Confidence
σ
w


C17
C444
C1355
C7552
50 60 70 80 90
1.02
1.04
1.06
1.08
Confidence
W
o
r
s
t

9
5
%


C17
C444
C1355
C7552
Figure 7.5: Worst case mean, standard deviation, and 95-percentile point normalized
with respect to the nominal value for ISCAS85 benchmarks, assuming ˆ n = 10 and
˜ n = 15.
Targeted Simulated conf
conf µ σ
2
95%
50 51.3 51.4 51.3
60 60.8 60.9 61.2
70 70.1 70.7 70.8
80 80.9 81.0 80.3
90 90.5 90.4 90.2
95 95.5 95.2 95.3
Table 7.13: Confidence comparison for ISCAS85 benchmark 95-percentile point delay
assuming ˆ n = 10, ˜ n = 15.
186
50 60 70 80 90
1
1.01
1.02
1.03
1.04
1.05
Confidence
µ
w


C17
C444
C1355
C7552
50 60 70 80 90
1
1.05
1.1
Confidence
σ
w


C17
C444
C1355
C7552
50 60 70 80 90
1
1.02
1.04
1.06
Confidence
w
o
r
s
t

9
5
%


C17
C444
C1355
C7552
Figure 7.6: Normalized worst case mean, standard deviation, and 95-percentile point
for ISCAS85 benchmarks, assuming ˆ n = 10 and ˜ n is large.
187
50 100 150
1
1.01
1.02
1.03
1.04
1.05
1.06
n
^
w
o
r
s
t

9
5
%
n
^
=80
n
^
=40
n
^
=24
Figure 7.7: 90% confidence worst case 95-percentile point with varying ˆ n for C7552.
Percentile point is normalized with respect to the nominal value.
between S-L and L-L. In the figure, percentile point is normalized with respect to the
nominal value. From the figure, we observe that the guardband of chip delay is lower
than that of single variation source and inverter chain delay. We find that we only need
a small number of lots ( ˆ n ≥ 24) to bound the error between S-L and L-L under 3%. If
we want to bound the error to 2% or 1%, we need ˆ n ≥40 and ˆ n ≥80, respectively.
Again, we see that limited data can lead to significantly large guardbanding to en-
sure good confidence in analysis results. The method discussed in this section is fairly
straightforward, fast (essentially Monte Carlo on a linear expression, runtime is within
10s) and accurate to estimate confidence in results of statistical timing analysis espe-
cially when the variability models are known to suffer from limited lot-to-lot variation
data.
188
7.6 Practical Questions: Determining Confi dence Induced Guard-
band for Real Designs
So far in the chapter, we have proposed techniques to quantify the confidence in sta-
tistical circuit analysis and estimate guardband required to attain a desired level of
confidence. The answer to “what is the desired level of confidence” is not an easy
one especially given the steep increase in guardband to ensure higher levels of confi-
dence. Besides quality of models ( ˆ n) and production volume ( ˜ n), two issues influence
the choice of confidence level:
1. Risk (in terms of schedule, performance, power, etc) tolerable for the design.
A lower confidence does not necessarily translate to lower yield but it implies
that the probability of the statistical analysis actually matching real silicon is
smaller. Therefore, confidence level is directly related to risk. For designs where
power/performance constraints are flexible or time to volume is not critical (i.e.,
if yield is lower than expectation, running more lots is acceptable), a lower confi-
dence is acceptable. Confidence levels provide a tradeoff between risk of higher
manufacturing cost (due to low yield) and higher upfront design cost (due to
guardbanding).
2. Extent of post-silicon tuning options available. Ability to change design char-
acteristics during or post-manufacturing mitigates the need for high confidence
in models. If the silicon foundry can extensively adjust the process (which es-
sentially shifts the mean) for the design
6
, the silicon can be made to match the
analysis and ensuring high confidence in design-side estimates is not essential.
Similarly, tunability using knobs such as adaptive body biasing, adaptive voltage
scaling can also alleviate downside of low confidence.
6
This is usually possible only for high-volume, high-value designs.
189
In the end, the choice of confidence will be as much a business decision as a techni-
cal decision as it quantifies the desire for predictability of statistics of manufacturing.
7.7 Conclusions
In this chapter, we have studied impact of characterization and production silicon vol-
umes on believability of statistical variation models and analysis. We have developed
methods to estimate the distribution of production silicon statistical characteristics
(e.g., mean, variance, percentile corners, etc) in terms of (unreliable) measured sta-
tistical characteristics, measured data volume, and intended production volume. The
loss of reliability largely stems from lot-to-lot variation. Our experiments and results
indicate the following.
1. For small characterization or production volumes, best/worst corners for a vari-
ation source may need as much as 30 % guardband to attain 95% confidence.
Our estimates show strong agreement with confidence measured from a set of
real silicon data (maximum error ≤ 4%).
2. For a typical SPICE corner extraction methodology, the guardband needed for
95% confidence is about 14% of the nominal uncertainty (difference between
best/worst corners) for low-to-medium volume production ( ˆ n = 10, ˜ n = 15).
3. Finally, the 95
th
percentile delay as estimated using statistical timing analysis
needs a guardband of 5%-7%.
Predicted statistics are reliable for number of measured lots larger than 85 and
number of production lots larger than 60. We assumed lot-to-lot variation to be 18%
of total variance for our experiments. The required guardband will increase if lot-
to-lot variation increases as a fraction of total variation. We believe that confidence-
190
level induced guardband should be considered for all low-to-medium volume designs.
Moreover, the problem is likely to become more severe when 450mm wafer sizes are
adopted by the industry (since less number of lots would be required to meet the same
production volume).
191
CHAPTER 8
Conclusions
In modern Very Large Scale Integration (VLSI) circuit industry, the scaling of Complementary
Metal Oxide Semiconductor (CMOS) technology enables the ever growing number of
transistors on a chip, hence improves the chip functionality and reduces the manu-
facturing cost. However, the process variation caused by inevitable manufacturing
fluctuations and imperfections also increases when technology scales down. Process
variation introduces significant power and performance uncertainty to VLSI circuits
and becomes a potential stopper for further technology scaling.
In this dissertation, we consider two types of the most common VLSI circuits:
Field-Programmable Gate Array (FPGA) and Application Specific Integrated Circuit
(ASIC). We have modeled, analyzed, and optimized power and delay for both FPGAs
and ASICs.
In Part I, including Chapter 2, 3, 4, we have studied modeling, analysis, and
optimization of FPGA power and performance with existence of process variation.
In Chapter 2, we have developed a trace-based power and performance evalua-
tion framework(Ptrace) for FPGA. Then we performed device and architecture co-
evaluation on hundreds of device and architecture combinations to optimize power,
delay, and area. Compared to the baseline, our optimization reduces the energy-delay
product by 18.4% and area by 23.3%.
In Chapter 3, we have extended Ptrace to consider process variation. Then we per-
192
formed device and architecture co-optimization to maximize power and delay yield.
Compared to the baseline, our optimization improves leakage power and timing com-
bined yield by 9.4%.
In Chapter 4, we have developed transistor level voltage and current model based
on ITRS MASTAR4 (Model for Assessment of cmoS Technologies And Roadmaps),
and then developed circuit level power, delay, and reliability model. Combining the
circuit level model with Ptrace, we have developed a framework to evaluate FPGA
power, delay, and reliability simultaneously. We have shown that applying heteroge-
neous gate lengths to logic and interconnect may lead to 1.3X delay difference, 3.1X
energy difference. This offers a large room for power and delay tradeoff. We have fur-
ther shown that the device aging has a knee point over time and device burn-in to reach
the point could reduce the performance change over 10 years from 8.5% to 5.5% and
significantly reduce leakage variation. In addition, we have also studied the interaction
between process variation, device aging, and SER. We observe that device aging re-
duces standard deviation of leakage by 65% over 10 years while it has relatively small
impact on delay variation. Moreover, we also find that neither device aging due to
NBTI and HCI nor process variation have significant impact on SER.
In Part II, including Chapter 5, Chapter 6, and Chapter 7, we have studied sta-
tistical timing modeling and analysis for ASICs.
In Chapter 5, we have proposed a new method to approximate the max opera-
tion of two non-Gaussian random variables using second-order polynomial fitting. By
applying such approximation, we have presented new SSTA algorithms for three dif-
ferent delay models: quadratic model, quadratic model without crossing terms (semi-
quadratic model), and linear model. All atomic operations of these algorithms are
performed by closed-form formulas, hence they are very time efficient. Experimental
results show that compared to Monte-Carlo simulation, our SSTA flow predicts the
193
mean, standard deviation, skewness, and 95-percentile point within 1%, 1%, 6%, and
1% error, respectively.
In Chapter 6, we have studied the impact of systematic across-wafer variation on
within-die variation. Based on the analysis, we have developed an efficient and ac-
curate spatial variation model to model across-wafer variation at die level. We then
implemented the new model to the SSTA flow in Chapter 5. Experimental results
show that our new model reduces error from 6.5% to 2% and improve run time by 6X
compared to the traditional spatial variation model.
In Chapter 7, we have studied impact of characterization and production silicon
volumes on believability of statistical variation models and analysis. We mathemati-
cally derived the confidence intervals for commonly used statistical measures (mean,
variance, percentile corner) and analysis (SPICE corner extraction, statistical timing).
Our estimates are within 2% of simulated confidence values. Our experiments indicate
that for moderate characterization volumes (10 lots) and low-to-medium production
volumes (15 lots), a significant guardband (e.g., 34.7% of standard deviation for single
parameter corner, 38.7% of standard deviation for SPICE corner, and 52% of standard
deviation for 95%-tile point of circuit delay are needed) to ensure 95% confidence in
the results. The guardbands are non-negligible for all cases when either production or
characterization volume is not large.
This dissertation makes contributions to statistical modeling, analysis, and opti-
mization for FPGAs and ASICs. However, there are still lots of research works need
to be done in this field. In practice, the statistical modeling and analysis techniques
presented in Part II can not only be applied to ASICs, but also to FPGAs. The
SSTA method introduced in Chapter 5 can be applied to FPGA delay variation for
more accurate and efficient delay variation estimation; the model of spatial correlation
in Chapter 6 can also be applied to FPGA power and delay variation estimation to im-
194
prove the accuracy and efficiency; and we may also apply the method in Chapter 7 to
estimate the confidence for FPGA power and delay variation analysis. In the future,
applying the statistical analysis techniques for ASICs to FPGAs is one of our research
topics.
195
CHAPTER 9
Appendix
In the appendix, I briefly introduce the works I have finished in my Ph.D. program and
are not introduced in this dissertation:
• FPGAPerformance Optimization via Chipwise Placement Considering Pro-
cess Variations. We develop a statistical chipwise placement flow to improve
FPGA performance. First, we propose an efficient high-level trace-based es-
timation method, similar to Ptrace, to evaluate the potential performance gain
achievable through chipwise FPGA placement without detailed placement. Sec-
ond, we develop a variation-aware detailed placement algorithmvaPL within the
VPR framework [18] to leverage process variation and optimize performance
for each chip. Chipwise placement vaPL is deterministic for each chip when the
chip’s variation map is known, and leads to different placements for different
chips of the same application. Our experimental results show that, compared
to the existing FPGA placement, variation aware chipwise placement improves
circuit performance by up to 12.1% for the tested variation maps. This work
was published in International Conference on Field Programmable Logic and
Applications (FPL), August 2006.
• Fourier Series Approximation for Max Operation in Non-Gaussian and
Quadratic Statistical Static Timing Analysis. We propose efficient algorithms
to handle the max operation in SSTA with both quadratic delay dependency and
196
non-Gaussian variation sources simultaneously. Based on such algorithms, we
develop an SSTA flow with quadratic delay model and non-Gaussian variation
sources. All the atomic operations, max and add, are calculated efficiently via
either closed-form formulas or low dimension (at most 2D) lookup tables. We
prove that the complexity of our algorithm is linear in both variation sources and
circuit sizes, hence our algorithm scales well for large designs. Compared to
Monte Carlo simulation for non-Gaussian variation sources and nonlinear delay
models, our approach predicts the mean, standard deviation and 95% percentile
point with less than 2% error, and the skewness with less than 10% error. This
work was published in Design Automation Conference (DAC), June 2007.
• Accounting for Non-linear Dependence Using Function Driven Component
Analysis. We first analyze the impact of non-linear dependence on statisti-
cal analysis and show that ignoring non-linear dependence may cause error in
both statistical timing and power analysis. Then we introduce a function driven
component analysis (FCA) to minimize the error introduced by non-linear de-
pendence. Experimental results show that the proposed FCA method is more
accurate compared to the traditional PCA or ICA. This work was published
in IEEE/ACM Asia South Pacific Design Automation Conference (ASPDAC),
February 2009.
• Efficient Additive Statistical Leakage Estimation. We present an efficient
additive leakage variation model. We use polynomial instead of exponential
function to represents the leakage variation considering inter-die variation, then
extend such model to consider both within-die spatial and within-die random
variation. Due to the additivity, this new model is more efficient than the ex-
ponential model when calculating the full chip leakage power. Experimental
results show that our method is 5X faster than the existing Wilkinson’s approach
197
with no accuracy loss in mean estimation, about 1% accuracy loss in standard
deviation estimation and 99% percentile point estimation. This work was pub-
lished in IEEE Transactions on Computer-Aided Design of Integrated Circuits
and Systems (TCAD), Volume 28, issue: 11, November 2009, pp:1777 - 1781.
198
REFERENCES
[1] PTM SPICE model. In http://www.eas.asu.edu/ ptm/latest.html.
[2] Soroush Abbaspour, Hanif Fatemi, and Massoud Pedram. Non-gaussian statis-
tical interconnect timing analysis. In Proc. Design Automation and Test Conf.
in Europe, March 2006.
[3] Soroush Abbaspour, Hanif Fatemi, and Massoud Pedram. Parameterized block-
based non-gaussian statistical gate timing analysis. In Proc. Asia South Pacific
Design Automation Conf., January 2006.
[4] Milton Abramowitz and Irene A. Stegun. Handbook of Mathematical Functions
with Formulas, Graphs, and Mathematical Tables. New York: Dover, 1965.
[5] Aseem Agarwal, David Blaauw, and Vladimir Zolotov. Statistical timing analy-
sis for intra-die process variations with spatial correlations. In Proc. Intl. Conf.
Computer-Aided Design, pages 900 – 907, Nov. 2003.
[6] Aseem Agarwal, David Blaauw, Vladimir Zolotov, Savithri Sundareswaran,
Min Zhao, Kaushik Gala, and Rajendran Panda. Statistical delay computation
considering spatial correlations. In Proc. Asia South Pacific Design Automation
Conf., 2003.
[7] Aseem Agarwal, David Blaauw, Vladimir Zolotov, Savithri Sundareswaran,
Min Zhao, Kaushik Gala, and Rajendran Panda. Statistical delay computation
considering spatial correlations. In Proc. Asia South Pacific Design Automation
Conf., pages 271 – 276, January 2003.
[8] Aseem Agarwal, David Blaauw, Vladimir Zolotov, and Sarma Vrudhula. Com-
putation and refinement of statistical bounds on circuit delay. In Proc. Design
Automation Conf., June 2003.
[9] Aseem Agarwal, Vladimir Zolotov, and David Blaauw. Statistical timing anal-
ysis using bounds and selective enumeration. In IEEE Trans. Computer-Aided
Design of Integrated Circuits and Systems, volume 22, pages 1243 – 1260, Sept.
2003.
[10] Elias Ahmed and Jonathan Rose. The effect of LUT and cluster size on deep-
submicron FPGA performance and density. In Proc. ACM Intl. Symp. Field-
Programmable Gate Arrays, pages 3–12, February 2000.
[11] M.A. Alam and S. Mahapatra. A comprehensive model of pmos nbti degrada-
tion. In Microeletronics Reliability, August 2005.
199
[12] Jason H. Anderson, Farid N. Najm, and Tim Tuan. Active leakage power op-
timization for FPGAs. In Proc. ACM Intl. Symp. Field-Programmable Gate
Arrays, February 2004.
[13] Hongliang Chang andVladimir Zolotov, Sambasivan Narayan, and Chandu
Visweswariah. Parameterized block-based statistical timing analysis with non-
Gaussian and nonlinear parameters. In Proc. Design Automation Conf., pages
71–76, June 2005. Anaheim, CA.
[14] Ghazanfar Asadi and Mehdi B. Tahoori. Soft error rate estimation and mitiga-
tion for SRAM-based FPGAs. In Proc. ACM Intl. Symp. Field-Programmable
Gate Arrays, February 2005.
[15] Maryam Ashouei, Abhijit Chatterjee, Adit D. Singh, and Vivek De. A dual-
vt layout approach for statistical leakage variability minimization in nanometer
CMOS. In Proc. Int. Conf. on Computer Design, October 2005.
[16] Adelchi Azzalini. The skew-normal distribution and related multivariate fami-
lies. Scandinavian Journal of Statistics, 32:159–188, June 2005.
[17] M. Bellato, P. Bernardi1, D. Bortolato, A. Candelori, M. Ceschia,
A. Paccagnella, M. Rebaudengo1, M. Sonza Reorda, M. Violante1, and P. Zam-
bolin. Evaluating the effects of SEUs affecting the configuration memory of an
SRAM-based FPGA. In Design Automation and Test in Europe, March 2004.
[18] Vaughn Betz, Jonathan Rose, and Alexander Marquardt. Architecture and CAD
for Deep-Submicron FPGAs. Kluwer Academic Publishers, February 1999.
[19] Sarvesh Bhardwaj, Praveen Ghanta, and Sarma Vrudhula. A framework for
statistical timing analysis using non-linear delay and slew models. In Proc. Intl.
Conf. Computer-Aided Design, November 2006.
[20] Sarvesh Bhardwaj and Sarma Vrudhula. Leakage minimization of nano-scale
circuits in the presence of systematic and random variations. In Proc. Design
Automation Conf., 2005.
[21] Sarvesh Bhardwaj and Sarma Vrudhula. Leakage minimization of digital cir-
cuits using gate sizing in the presence of process variations. IEEE Trans.
Computer-Aided Design of Integrated Circuits and Systems, March 2008.
[22] Sarvesh Bhardwaj, Sarma Vrudhula, and David Blaauw. τau: Timing analysis
under uncertainty. In Proc. Intl. Conf. Computer-Aided Design, pages 615 –
620, Nov. 2003.
200
[23] Sarvesh Bhardwaj, Wenping Wang, Rakesh Vattikonda, Yu Cao, and S. Vrud-
hula. Predictive modeling of the nbti effect for reliable design. In Proc. IEEE
Custom Integrated Circuits Conf., September 2006.
[24] Duane Boning, James Chung, Dennis Ouma, and Rajesh Divecha. Spatial varia-
tion in semiconductor processes: Modeling for control. In Proceedings Process
Control Diagnostics and Modeling in Semiconductor Manufacturing II Electro-
chemical Society Meeting, May 1997.
[25] Shekhar Borkar, Tanay Karnik, Siva Narendra, Jim Tschanz, Ali Keshavarzi,
and Vivek De. Parameter variations and impact on circuits and microarchitec-
ture. In Proc. Design Automation Conf., June 2003.
[26] Shekhar Borkar, Tanay Karnik, Siva Narendra, Jim Tschanz, Ali Keshavarzi,
and Vivek De. Parameter variations and impacts on circuits and microarchitec-
ture. In Proc. Design Automation Conf., June 2003.
[27] Jozef Brcka and Rodney L. Robison. Wafer redeposition impact on etch rate
uniformity in ipvd system. IEEE Transactions on Plasma Sciences, February
2007.
[28] Michael Brown, Cyrus Bazeghi, Matthew Guthaus, and Jose Renau. Measur-
ing and modeling variabilityusing low-cost FPGAs. In Proc. ACM Intl. Symp.
Field-Programmable Gate Arrays, February 2009.
[29] James Burnham, Chih-Kong Ken Yang, and Haitham Hindi. A stochastic jitter
model for analyzing digital timing-recovery circuits. In Proc. Design Automa-
tion Conf., July 2009.
[30] Steven M. Burns, Mahesh C. Ketkar, Noel Menezes, Keith A. Bowman,
James W. Tschanz, and Vivek De. Comparative analysis of conventional and
statistical design techniques. In Proc. Design Automation Conf., June 2007.
[31] Nicola Campregher, Peter Y. K. Cheung, George A. Constantinides, and Mi-
lan Vasilko. Yield enhancements of design-specific fpgas. In Proc. ACM Intl.
Symp. Field-Programmable Gate Arrays, February 2006.
[32] Hongliang Chang and Sachin S. Sapatnekar. Statistical timing analysis consid-
ering spatial correlations using a single PERT-like traversal. In Proc. Intl. Conf.
Computer-Aided Design, pages 621 – 625, Nov. 2003.
[33] Hongliang Chang and Sachin S. Sapatnekar. Full-chip analysis of leakage
power under process variations, including spatial correlations. In Proc. Design
Automation Conf., June 2005.
201
[34] Kueing-Long Chen, Stephen A. Saller, Imelda A. Groves, and David B. Scott.
Reliability effects on mos transistors due to hot-carrier injection. IEEE Journal
of Solid-State Circuits, February 1985.
[35] Ruiming Chen, Lizheng Zhang, Vladimir Zolotov, Chandu Visweswariah, and
Jinjun Xiong. Static timing: Back to our roots. In Proc. Asia South Pacific
Design Automation Conf., February 2008.
[36] Ruiming Chen and Hai Zhou. Timing budgeting under arbitrary process varia-
tions. In Proc. Intl. Conf. Computer-Aided Design, November 2007.
[37] Ruiming Chen and Hai Zhou. Fast estimation of timing yield bounds for process
variations. IEEE Trans. VLSI Syst., March 2008.
[38] Lerong Cheng, Fei Li, Yan Lin, Phoebe Wong, and Lei He. Device and architec-
ture cooptimization for FPGA power reduction. IEEE Trans. Computer-Aided
Design of Integrated Circuits and Systems, 19:1211–1221, July 2007.
[39] Lerong Cheng, Yan Lin, and Lei He. Tracebased framework for concurrent
development of process and FPGA architecture considering process variation
and reliability. In Proc. ACM Intl. Symp. Field-Programmable Gate Arrays,
February 2008.
[40] Lerong Cheng, Jinjun Xiong, and Lei He. Non-linear statistical static timing
analysis for non-gaussian variation sources. In Proc. Design Automation Conf.,
June 2007.
[41] Lerong Cheng, Jinjun Xiong, and Lei He. Non-gaussian statistical timing anal-
ysis using Second-Order polynomial fitting. In Proc. Asia South Pacific Design
Automation Conf., February 2008.
[42] Lerong Cheng, Jinjun Xiong, and Lei He. Non-Gaussian statistical timing anal-
ysis using Second-Order polynomial fitting. IEEE Trans. Computer-Aided De-
sign of Integrated Circuits and Systems, January 2009.
[43] Lerong Cheng, Jinjun Xiong, Lei He, and Mike Hutton. FPGA performance
optimization via chipwise placement considering process variations. In Inter-
national Conference on Field-Programmable Logic and Applications, August
2006.
[44] Kaviraj Chopra, Saumil Shah, Ashish Srivastava, David Blaauw, and Dennis
Sylvester. Parametric yield maximization using gate sizing based on efficient
statistical power and delay gradient computation. In Proc. Intl. Conf. Computer-
Aided Design, November 2005.
202
[45] Kaviraj Chopra, Narendra Shenoy, and David Blaauw. Variogram based ro-
bust extraction of process variation. In Proc. τ, Timing and Analysis Utilities,
February 2007.
[46] Charles E. Clark. The greatest of a finite set of random variables. In Operations
Research, pages 85 – 91, 1961.
[47] David Lewis ect. The stratix routing and logic architecture. In Proc. ACM Intl.
Symp. Field-Programmable Gate Arrays, February 2003.
[48] Vijay Degalahal and TimTuan. Methodology for high level estimation of FPGA
power consumption. In Proc. Asia South Pacific Design Automation Conf.,
January 2005.
[49] Anirudh Devgan and Chandramouli Kashyap. Block-Based static timing anal-
ysis with uncertainty. In Proc. Intl. Conf. Computer-Aided Design, November
2003.
[50] Nigel Drego, Anantha Chandrakasan, and Duane Boning. Lack of spatial corre-
lation in mosfet threshold voltage variation and implications for voltage scaling.
IEEE Trans. Semiconductor Manufacturing, May 2009.
[51] Fei Li and Yan Lin and Lei He. Vdd programmability to reduce FPGA inter-
connect power. In Proc. Intl. Conf. Computer-Aided Design, November 2004.
[52] Zhuo Feng, Peng Li, and Yaping Zhan. Fast second-order statistical static tim-
ing analysis using parameter dimension reduction. In Proc. Design Automation
Conf., June 2007.
[53] F.N.Najm and N.menezes. Statistical timing analysis based on a timing yield
model. In Proc. Design Automation Conf., June 2004.
[54] Paul Friedberg, Willy Cheung, and Costas J. Spanos. Spatial modeling of
micron-scale gate length variation. In Data Analysis and Modeling for Process
Control III, March 2006.
[55] Qiang Fu, Wai-Shing Luk, Jun Tao, Changhao Yan, and Xuan Zeng. Character-
izing intra-die spatial correlation using spectral density method. In Proc. IEEE
International Symposium on Quality Electronic Design, March 2008.
[56] Takayuki Fukuoka, Akira Tsuchiya, and Hidetoshi Onodera. Statistical gate
delay model for multiple input switching. In Proc. Asia South Pacific Design
Automation Conf., February 2008.
203
[57] Anne Gattiker, Sani Nassif, Rashmi Dinakar, and Chris Long. Timing yield
estimation from static timing analysis. In International Symposium on Quality
of Electronic Design, 2001.
[58] A. Gayasen, K. Lee, N. Vijaykrishnan, M. Kandemir, M. J. Irwin, and
T. Tuan. A dual-vdd low power FPGA architecture. In Proc. Intl. Conf. Field-
Programmable Logic and its Application, August 2004.
[59] A. Gayasen, Y. Tsai, N. Vijaykrishnan, M. Kandemir, M. J. Irwin, and T. Tuan.
Reducing leakage energy in FPGAs using region-constrained placement. In
Proc. ACM Intl. Symp. Field-Programmable Gate Arrays, February 2004.
[60] P. Hazucha and C. Svensson. Impact of CMOS technology scaling on the atmo-
spheric neutron soft error rate. IEEE Trans. on Nuclear Science, 2000.
[61] Khaled R. Heloue, Sari Onaissi, and Farid N. Najm. Efficient block-based pa-
rameterized timing analysis covering all potentially critical paths. In Proc. Intl.
Conf. Computer-Aided Design, November 2008.
[62] Dadgour H.F., Sheng-Chih Lin, and Banerjee K. A statistical framework for
estimation of full-chip leakage-power distribution under parameter variations.
Electron Devices, IEEE Transactions on, November 2007.
[63] Katsumi Homma, Izumi Nitta, and Toshiyuki Shibuya. Non-gaussian statistical
timing models of die-to-die and within-die parameter variations for full chip
analysis. In Proc. Asia South Pacific Design Automation Conf., February 2008.
[64] Shin ichi OHKAWA, HirooMASUDA, and Yasuaki INOUE. A novel expres-
sion of spatial correlation by a random curved surface model and its application
to LSI designs. IEICE TRANS. FUNDAMENTALS, 2008.
[65] Masanori Imai, Takashi Sato Noriaki Nakayama, and Kazuya Masu. Non-
parametric statistical static timing analysis: An ssta framework for arbitrary
distribution. In Proc. Design Automation Conf., June 2008.
[66] International Technology Roadmap for Semiconductor. In http://public.itrs.net/,
2002.
[67] International Technology Roadmap for Semiconductor. In http://public.itrs.net/,
2005.
[68] International Technology Roadmap for Semiconductor. A user’s guide to MAS-
TAR4. In http://www.itrs.net/models.html, 2005.
204
[69] International Technology Roadmap for Semiconductor. Design. In
http://www.itrs.net/Links/2007ITRS/2007 Chapters/2007 Design.pdf, 2007.
[70] Javid Jaffari and Mohab Anis. On efficient monte carlo-based statistical static
timing analysis of digital circuits. In Proc. Intl. Conf. Computer-Aided Design,
November 2008.
[71] Jason H. Anderson and Farid N. Najm. Low-power programmable routing cir-
cuitry for FPGAs. In Proc. Intl. Conf. Computer-Aided Design, November
2004.
[72] J. Jess, K. Kalafala, S. Naidu, R. Otten, and C. Visweswariah. Statistical timing
for parametric yield prediction of digital integrated circuits. In Proc. Design
Automation Conf., June 2003.
[73] Jochen Jess. Parametric yield estimation for deep submicron VLSI circuits. In
15th Symposium on Integrated Circuits and Systems Design, September 2002.
[74] J.M.Rabaey. Digital Integrated Circuits - A Design Perspective. Prentice Hall
Inc., 2003.
[75] Kunhyuk Kang, Bipul C. Paul, and Kaushik Roy. Statistical timing analysis
using levelized covariance propagation. In Proc. Design Automation and Test
Conf. in Europe, March 2005.
[76] Vishal Khandelwal and Ankur Srivastava. A general framework for accurate
statistical timing analysis considering correlations. In Proc. Design Automation
Conf., pages 89 – 94, June 2005.
[77] Tae Won Kim and Eray S.Aydil. Investigation of etch rate uniformity of 60 mhz
plasma etching equipment. Japanese Journal of Applied Physics, November
2001.
[78] Tae Won Kim and Eray S.Aydil. Effects of chamber wall conditions on cl con-
certration and si etch rate uniformity in palsma etching reactors. Journal of The
Electrochemical Society, June 2003.
[79] Dirk Koch, Christian Haubelt, and Juergen Teich. Efficient hardware check-
pointing – concepts, overhead analysis, and implementation. In Proc. ACM
Intl. Symp. Field-Programmable Gate Arrays, February 2007.
[80] J. Kouloheris and A. El Gamal. FPGA area vs. cell granularity - lookup tables
and PLA cells. In 1st ACM Workshop on FPGAs, berkeley, CA, February 1992.
205
[81] Sanjay V. Kumar, Chandramouli V. Kashyap, and Sachin Sapatnekar. A frame-
work for block-based timing sensitivity analysis. In Proc. Design Automation
Conf., June 2008.
[82] Ian Kuon and Jonathan Rose. Measuring the gap between fpgas and asics. IEEE
Trans. Computer-Aided Design of Integrated Circuits and Systems, February
2007.
[83] Eric Kusse and Jan M. Rabaey. Low-energy embedded FPGA structures. In
Proc. Intl. Symp. Low Power Electronics and Design, pages 155–160, August
1998.
[84] Julien Lamoureux, Guy G. Lemieux, and Steven J.E. Wilton. Glitchless: An
active glitch minimization techniques for FPGAs. In Proc. ACM Intl. Symp.
Field-Programmable Gate Arrays, February 2007.
[85] Julien Lamoureux and Steven J.E. Wilton. On the interaction between power-
aware FPGA CAD algorithms. In Proc. Intl. Conf. Computer-Aided Design,
pages 701–708, November 2003.
[86] Jiayong Le, Xin Li, and Lawrence T. Pileggi. STAC: Statisical timing analysis
with correlation. In Proc. Design Automation Conf., Jun 2004.
[87] G. G. Lemieux and S. D. Brown. A detailed router for allocating wire segments
in field-programmable gate arrays. In Proceedings of the ACM Physical Design
Workshop, April 1993.
[88] Bing Li, Ning Chen, Manuel Schmidt, Walter Schneider, and Ulf Schlichtmann.
On hierarchical statistical static timing analysis. In Proc. Design Automation
and Test Conf. in Europe, March 2009.
[89] Fei Li, Deming Chen, Lei He, and Jason Cong. Architecture evaluation for
power-efficient FPGAs. In Proc. ACM Intl. Symp. Field-Programmable Gate
Arrays, February 2003.
[90] Fei Li, Yan Lin, and Lei He. FPGA power reduction using configurable dual-
vdd. In Proc. Design Automation Conf., June 2004.
[91] Fei Li, Yan Lin, and Lei He. Field programmability of supply voltages for
FPGA power reduction. IEEE Trans. Computer-Aided Design of Integrated
Circuits and Systems, 26, April 2007.
[92] Fei Li, Yan Lin, Lei He, Deming Chen, and Jason Congs. Power modeling and
characteristics of field programmable gate arrays. IEEE Trans. Computer-Aided
Design of Integrated Circuits and Systems, 24:1712–1724, November 2005.
206
[93] Fei Li, Yan Lin, Lei He, and Jason Cong. Low-power FPGA using pre-defined
dual-vdd/dual-vt fabrics. In Proc. ACM Intl. Symp. Field-Programmable Gate
Arrays, February 2004.
[94] Xin Li, Jiayong Le, Padmini Gopalakrishnan, and Lawrence T Pileggi. Asymp-
totic probability extraction for non-normal distributions of circuit performance.
In Proc. Intl. Conf. Computer-Aided Design, November 2004.
[95] Xin Li, Jiayong Le, and L.T. Pileggi. Projection based statistical analysis of
full-chip leakage power with non-log-normal distributions. In Proc. Design Au-
tomation Conf., June 2006.
[96] Weiping Liao, Lei He, and Kevin Lepak. Temperature and supply voltage
aware performance and power modeling at microarchitecture level. IEEE Trans.
Computer-Aided Design of Integrated Circuits and Systems, July 2005.
[97] Saihua Lin, Yu Wang, Rong Luo, and Huazhong Yang. A capacitive boosted
buffer technique for high-speed process-variation-tolerant interconnect in udvs
application. In Proc. Asia South Pacific Design Automation Conf., February
2008.
[98] Yan Lin and Lei He. Dual-vdd interconnect with chip-level time slack allocation
for FPGA power reduction. IEEE Trans. Computer-Aided Design of Integrated
Circuits and Systems, 25, October 2006.
[99] Yan Lin and Lei He. Stochastic physical synthesis for FPGAs with pre-routing
interconnect uncertainty and process variation. In Proc. ACM Intl. Symp. Field-
Programmable Gate Arrays, February 2007.
[100] Yan Lin, Mike Hutton, and Lei He. Placement and timing for FPGAs consider-
ing variations. In Proc. Intl. Conf. Field-Programmable Logic and its Applica-
tion, August 2006.
[101] Yan Lin, Fei Li, and Lei He. Circuits and architectures for vdd programmable
FPGAs. In Proc. ACM Intl. Symp. Field-Programmable Gate Arrays, February
2005.
[102] Yan Lin, Fei Li, and Lei He. Routing track duplication with fine-grained power-
gating for FPGA interconnect power reduction. In Proc. Asia South Pacific
Design Automation Conf., January 2005.
[103] Jing-Jia Liou, A. Krstic, L.-C. Wang, and Kwang-Ting Cheng. False-path-
aware statistical timing analysis and efficient path selection for delay testing
and timing validation. In Proc. Design Automation Conf., June 2002.
207
[104] Bao Liu. Signal probability based statistical timing analysis. In Proc. Design
Automation and Test Conf. in Europe, March 2008.
[105] Bao Liu. Spatial correlation extraction via random field simulation and produc-
tion chip performance regression. In Proc. Design Automation and Test Conf.
in Europe, March 2008.
[106] Frank Liu. A general framework for spatial correlation modeling in vlsi design.
In Proc. Design Automation Conf., June 2007.
[107] Jui-Hsiang Liu, Lumdo Chen, and Chen Charlie Chung-Ping. Accurate and
analytical statistical spatial correlation modeling for vlsi dfm applications. In
Proc. Design Automation Conf., June 2008.
[108] Jui-Hsiang Liu, Ming-Feng Tsai, Lumdo Chen, and Charlie Chung-Ping Chen.
Accurate and analytical statistical spatial correlation modeling for vlsi dfm ap-
plications. In Proc. Design Automation Conf., June 2008.
[109] M. Liu, Wang Wei-Shen, and M. Orshanskyg. Leakage power reduction by
dual-vth designs under probabilistic analysis of vth variation. In Proc. Intl.
Symp. Low Power Electronics and Design, August 2004.
[110] Andrea Lodi, Luca Ciccarelli, and Roberto Giansante. Combining low-leakage
techniques for FPGA routing design. In Proc. ACM Intl. Symp. Field-
Programmable Gate Arrays, Februray 2005.
[111] Yuanlin Lu and V.D. Agrawal. Statistical leakage and timing optimization for
submicron process variation. In VLSI Design, January 2007.
[112] W. Maly, A.J. Strojwas, and S.W. Director. VLSI yield prediction and estima-
tion: A unified framework. IEEE Trans. Computer-Aided Design of Integrated
Circuits and Systems, 5:114–130, Jan. 1986.
[113] Hratch Mangassarian and Mohab Anis. On statistical timing analysis with inter-
and intra-die variations. In Proc. Design Automation and Test Conf. in Europe,
March 2005.
[114] Murari Mani, Anirudh Devgan, and Michael Orshansky. An efficient algorithm
for statistical minimization of total power under timing yield constraints. In
Proc. Design Automation Conf., pages 309–314, June 2005.
[115] Yohei Matsumoto, Masakazu Hioki, Takashi Kawanami, Toshiyuki Tsutsumi,
Tadashi Nakagawa, Toshihiro Sekigawa, and Hanpei Koike. Performance and
yield enhancement of FPGAs with with-in die variation using multiple configu-
rations. In Proc. ACM Intl. Symp. Field-Programmable Gate Arrays, February
2007.
208
[116] Hushrav D. Mogal, Haifeng Qian, Sachin S. Sapatnekar, and Kia Bazargan.
Clustering based pruning for statistical criticality computation under process
variations. In Proc. Intl. Conf. Computer-Aided Design, November 2007.
[117] Rajarshi Mukherjee and Seda Ogrenci Memik. Power-driven design partition-
ing. In Proc. Intl. Conf. Field-Programmable Logic and its Application, August
2004.
[118] Ayhan Mutlu, Jiayong Le, Ruben Molina, and Mustafa Celik. A parametric
approach for handling local variation effects in timing analysis. In Proc. Design
Automation Conf., July 2009.
[119] Siva Narendra, Vivek De, Shekhar Borkar andDimitri A. Antoniadis, and Anan-
tha P. Chandrakasan. Full-chip subthreshold leakage power prediction and re-
duction techniques for sub-0.18-um CMOS. Solid-State Circuits, IEEE Journal
of, March 2004.
[120] Michael Orshansky and Arnab Bandyopadhyay. Fast statistical timing analysis
handling arbitrary delay correlations. In Proc. Design Automation Conf., June
2004.
[121] Michael Orshansky, Sani R. Nassif, and Duane Boning. Design for Manufac-
turability and Statistical Design. Springer, 2007.
[122] Pierrick Pedron. Tutorial on statistical static timing analysis (SSTA). In Sophia
Antipolis MicroElectronics Forum, 2008.
[123] K. Poon, A. Yan, and S. Wilton. A flexible power model for FPGAs. In Proc. of
12th International conference on Field-Programmable Logic and Applications,
September 2002.
[124] Predictive Technology Model. In http://www.eas.asu.edu/ ptm/.
[125] Kun Qian and Costas J. Spanos. A comprehensive model of process variabil-
ity for statistical timing optimization. In Design for Manufacturability through
Design-Process Integration II, April 2008.
[126] Kun Qian and Costas J. Spanos. Hierachical modeling of spatial variability
of a 45nm test chip. In Design for Manufacturability through Design-Process
Integration III, March 2009.
[127] Sreeja Raj, Sarma B.K. Vrudhula, and Janet Wang. A methodology to improve
timing yield in the presence of process variations. In Proc. Design Automation
Conf., June 2004.
209
[128] Anand Ramalingam, Ashish Kumar Singh, Sani R. Nassif, Gi-Joon Nam,
Michael Orshansky, and David Z. Pan. An accurate sparse matrix based frame-
work for statistical static timing analysis. In Proc. Intl. Conf. Computer-Aided
Design, November 2006.
[129] Rajeev Rao, Anirudh Devgan, David Blaauw, and Dennis Sylveste. Parametric
yield estimation considering leakage variability. In Proc. Design Automation
Conf., June 2004.
[130] Jonathan Rose, Robert J. Francis, David Lewis, and Paul Chow. Architecture
of field-programmable gate arrays: The effect of logic functionality on area
efficiency. IEEE Journal of Solid-State Circuits, 1990.
[131] Jonathan Rose, Robert J. Francis, David Lewis, and Paul Chow. Architecture
of field-programmable gate arrays: The effect of logic functionality on area
efficiency. Proc. IEEE Int. Solid-State Circuits Conf., 1990.
[132] Shigeru Sakai, Masaaki Ogino, Ryosuke Shimizu, and Yukihiro Shimogaki.
Deposition uniformity control in a commercial scale HTO-CVD reactor. MA-
TERIALS RESEARCH SOCIETY SYMPOSIUM PROCEEDINGS, 2007.
[133] Ashoka Sathanur, Antonio Pullini, Giovanni De Micheli, Luca Benini, and En-
rico Macii. Physically clustered forward body biasing for variability compensa-
tion in nano-meter CMOS design. In Proc. Design Automation and Test Conf.
in Europe, March 2009.
[134] Takashi Sato, Hiroyuki Ueyama, Noriaki Nakayama, and Kazuya Masu. Deter-
mination of optimal polynomial regression function to decompose on-die sys-
tematic and random variations. In Proc. Asia South Pacific Design Automation
Conf., February 2008.
[135] Pete Sedcole and Peter Y. K. Cheung. Parametric yield in FPGAs due to within-
die delay variations: A quantative analysis. In Proc. ACM Intl. Symp. Field-
Programmable Gate Arrays, February 2007.
[136] Pete Sedcole and Peter Y. K. Cheung. Parametric yield in FPGAs due to within-
die delay variations: a quantitative analysis. In Proc. ACM Intl. Symp. Field-
Programmable Gate Arrays, February 2007.
[137] Pete Sedcole, Justin S. Wong, and Peter Y.K. Cheung. Measuring and modeling
FPGA clock variability. In Proc. ACM Intl. Symp. Field-Programmable Gate
Arrays, February 2008.
210
[138] Davood Shamsi, Petros Boufounos, and Farinaz Koushanfar. Post-silicon tim-
ing characterization by compressed sensing. In Proc. Intl. Conf. Computer-
Aided Design, November 2008.
[139] James R. Sheats. Microlithography Science and Technology. CRC, 1998.
[140] Xukai Shen, Yu Wang, Rong Luo, and Huazhong Yang. Leakage power reduc-
tion through dual vth assignment considering threshold voltage variation. In
ASICON, October 2007.
[141] Sean X. Shi, Anand Ramalingam, Daifeng Wang, and David Z. Pan. Latch
modeling for statistical timing analysis. In Proc. Design Automation and Test
Conf. in Europe, March 2008.
[142] Jaskirat Singh and Sachin Sapatnekar. Statistical timing analysis with cor-
related non-gaussian parameters using independent componenet analysis. In
ACM/IEEE International Workshop on Timing Issues, pages 143–148, Feb.
2006.
[143] Satwant Singh, Jonathan Rose, Paul Chow, and David Lewis. The effect of
logic block architecture on FPGA performance. IEEE Journal of Solid-State
Circuits, 1992.
[144] Amith Singhee and Rob A. Rutenbar. Beyond low-order statistical response
surfaces: Latent variable regression for efficient, highly nonlinear fitting. In
Proc. Design Automation Conf., June 2007.
[145] Amith Singhee, Sonia Singhal, and Rob A. Rutenbar. Exploiting correlation
kernels for efficient handling of intra-die spatial correlation, with application to
statistical timing. In Proc. Design Automation and Test Conf. in Europe, March
2008.
[146] Amith Singhee, Sonia Singhal, and Rob A. Rutenbar. Practical, fast monte carlo
statistical static timing analysis: Why and how. In Proc. Intl. Conf. Computer-
Aided Design, November 2008.
[147] Satish Sivaswamy and Kia Bazargan. Variation-aware routing for FPGAs. In
Proc. ACM Intl. Symp. Field-Programmable Gate Arrays, February 2007.
[148] Thomas Skotnicki. Heading for decananometer CMOS - Is navigation among
icebergs still a viable strategy? In Solid-State Device Research Conference,
pages 19–33, September 2000.
211
[149] Mahapatra Souvik, Saha Dipankar, Varghese Dhanoop, and Bharath P. Kumar.
On the generation and recovery of interface traps in MOSFETs subjected to
NBTI, FN, and HCI stress. IEEE. Trans. on Electron Device, 53, July 2006.
[150] Suresh Srinivasan, Aman Gayasen, Narayanan Vijaykrishnan, and Tim Tuan.
Leakage control in FPGA routing fabric. In Proc. Asia South Pacific Design
Automation Conf., January 2005.
[151] Ashish Srivastava, Robert Bai, David Blaauw, and Dennis Sylvester. Modeling
and analysis of leakage power considering within-die process variations. In
Proc. Intl. Symp. Low Power Electronics and Design, 2002.
[152] Ashish Srivastava, Saumil S. Shah, Kanak B. Agarwal, Dennis M. Sylvester,
David Blaauw, and Stephen Director. Accurate and efficient parametric yield
estimation considering correlated variations in leakage power and performance.
In Proc. Design Automation Conf., June 2005.
[153] Brian E. Stine., Duane S. Boning, and James E. Chung. Analysis and decompo-
sition of spatial variation in integrated circuit processes and devices. In Semi-
conductor Manufacturing, IEEE Transactions on, pages 24 – 41, Feb. 1997.
[154] Yuuri Sugihara, Yohei Kume, Kazutoshi Kobayashi, and Hidetoshi Onodera.
Speed and yield enhancement by track swapping on critical paths utilizing ran-
dom variations for FPGAs. In Proc. ACM Intl. Symp. Field-Programmable
Gate Arrays, February 2008.
[155] Shingo Takahashi, Yuki Yoshida, and Shuji Tsukiyama. Gaussian mixture
model for statistical timing analysis. In Proc. Design Automation Conf., July
2009.
[156] Li Tao and Yu Zhiping. Statistical analysis of full-chip leakage power consider-
ing junction tunneling leakage. In Proc. Design Automation Conf., June 2007.
[157] Shuji Tsukiyama. Toward stochastic design for digital circuits: statistical static
timing analysis. In Proc. Asia South Pacific Design Automation Conf., January
2005.
[158] TimTuan and Bocheng Lai. Leakage power analysis of a 90nm FPGA. In Proc.
IEEE Custom Integrated Circuits Conf., 2003.
[159] SALI Jaydeep V., PATIL S. B., JADKAR S. R., and TAKWALE M. G. Hot-
Wire CVD growth simulation for thickness uniformity. In International Confer-
ence on Cat-CVD Process, 2001.
212
[160] Rakesh Vattikonda, Wenping Wang, and Yu Cao. Modeling and minimization
of pmos nbti effect for robust nanometer design. In Proc. Design Automation
Conf., July 2006.
[161] Vineeth Veetil, Dennis Sylvester, and David Blaauw. Efficient monte carlo
based incremental statistical timing analysis. In Proc. Design Automation
Conf., June 2008.
[162] Chandu Visweswariah. Death, taxes and failing chips. In Proc. Design Au-
tomation Conf., June 2003.
[163] Chandu Visweswariah, Kaushik Ravindran, Kerim Kalafala, Steven G. Walker,
and Sambasivan Narayan. First-order incremental block-based statistical timing
analysis. In Proc. Design Automation Conf., June 2004.
[164] Chandu Visweswariah, Kaushik Ravindran, Kerim Kalafala, Steven G. Walker,
Sambasivan Narayan, Daniel K. Beece, Jeff Piaget, Natesan Venkateswaran,
and Jeffrey G. Hemmett. First-order incremental block-based statistical tim-
ing analysis. IEEE Trans. Computer-Aided Design of Integrated Circuits and
Systems, 2006.
[165] Ruilin Wang and Cheng Kok Koh. A frequency-domain technique for statistical
timing analysis of clock meshes. In Proc. Intl. Conf. Computer-Aided Design,
November 2007.
[166] Wenping Wang, Vijay Reddy, Anand T. Krishnan, Rakesh Vattikonda, Srikanth
Krishnan, and Yu Cao. An integrated modeling paradigm of circuit reliability
for 65nm CMOS technology. In Proc. IEEE Custom Integrated Circuits Conf.,
September 2007.
[167] Ho-Yan Wong, Lerong Cheng, Yan Lin, and Lei He. FPGA device and architec-
ture evaluation considering process variations. In Proc. Intl. Conf. Computer-
Aided Design, November 2005.
[168] Phoebe Wong, Lerong Cheng, Yan Lin, and Lei He. FPGA device and arhi-
tecture evaluation consdering process variation. In Proc. Intl. Conf. Computer-
Aided Design, November 2005.
[169] Lin Xie, Azadeh Davoodi, Jun Zhang, and Tai-Hsuan Wu. Adjustment-based
modeling for statistical static timing analysis with high dimension of variability.
In Proc. Intl. Conf. Computer-Aided Design, November 2008.
[170] Xilinx Corporation. Virtex-II 1.5v platform FPGA complete data sheet. July
2002.
213
[171] Jinjun Xiong, Chandu Visweswariah, and Vladimir Zolotov. Statistical ordering
of correlated timing quantities and its application for path ranking. In Proc.
Design Automation Conf., July 2009.
[172] Jinjun Xiong, Vladimir Zolotov, and Lei He. Robust extraction of spatial corre-
lation. In Proc. Intl. Symp. Physical Design, April 2006.
[173] Jinjun Xiong, Vladimir Zolotov, and Lei He. Robust extraction of spatial cor-
relation. IEEE Trans. Computer-Aided Design of Integrated Circuits and Sys-
tems, 2007.
[174] A. M. Yaglom. Some classes of random fields in n-Dimensional space, related
to stationary randomprocesses. Theory Probability Applications, January 1957.
[175] S. Yang. Logic synthesis and optimization benchmarks, version 3.0. Technical
report, Microelectronics Center of North Carolina (MCNC), 1991.
[176] Xiaoji Ye, Yaping Zhan, and Peng Li. Statistical leakage power minimization
using fast equi-slack shell based optimization. In Proc. Design Automation
Conf., 2007.
[177] Guo Yu, Wei Dong, Zhuo Fang, and Peng Li. Statistical static timing analysis
considering process variation model uncertainty. IEEE Trans. Computer-Aided
Design of Integrated Circuits and Systems, October 2008.
[178] Yaping Zhan, Andrzej J. Strojwas, Xin Li, and Lawrence T. Pileggi.
Correlaiton-aware statistical timing analysis with non-gaussian delay distribu-
tion. In Proc. Design Automation Conf., pages 77–82, June 2005. Anaheim,
CA.
[179] Yaping Zhan, Andrzej J. Strojwas, David Newmark, and Mahesh Sharma.
Generic statistical timing analysis with non-gaussian process parameters. In
Austin Conference on Integrated Systems and Circuits, May 2006.
[180] Lizheng Zhang, Weijen Chen, Yuhen Hu, John A. Gubner, and Charlie Chung-
Ping Chen. Correlation-preserved non-gaussian statistical timing analysis with
quadratic timing model. In Proc. Design Automation Conf., pages 83 – 88, June
2005.
[181] Lizheng Zhang, Weijen Chen, Yuhen Hu, John A. Gubner, and Charlie Chung-
Ping Chen. Statistical timing analysis with extended pseudo-canonical timing
model. In Proc. Design Automation and Test Conf. in Europe, pages 952 – 957,
March 2005.
214
[182] Lizheng Zhang, Yuhen Hu, and Charlie Chung-Ping Chen. Block based statisti-
cal timing analysis with extended canonical timing model. In Proc. Asia South
Pacific Design Automation Conf., January 2005.
[183] Lizheng Zhang, Yuhen Hu, and Chung-Ping Chen. Statistical timing analysis
with path reconvergence and spatial correlations. In Proc. Design Automation
and Test Conf. in Europe, March 2006.
[184] Lizheng Zhang, Jun Shao, and Charlie Chung-Ping Chen. Non-gaussian statis-
tical parameter modeling for SSTA with confidence interval analysis. In Proc.
Intl. Symp. Physical Design, Aprial 2006.
[185] Qiaolin Zhang, Kameshwar Poolla, and Costas J. Spanos. One step forward
from run-to-run critical dimension control: Across-wafer level critical dimen-
sion control through lithography and etch process. Journal of Process Control,
18(10):937 – 945, 2008. Advanced Process Control for Semiconductor Manu-
facturing, First IFAC Workshop on Advanced Process Control for Semiconduc-
tor Manufacturing.
[186] Qiaolin Zhang, Cherry Tang, Jason Cain, Angela Hui, Tony Hsieh, Nick Mac-
crae, Bhanwar Singh, Kameshwar Poolla, and Costas J. Spanos. Across-wafer
cd uniformity control through lithography and etch process: experimental ver-
ification. In Metrology, Inspection, and Process Control for Microlithography
XXI, April 2007.
[187] Songqing Zhang, Vineet Wason, and Kaustav Banerjee. A probabilistic frame-
work to estimate full-chip subthreshold leakage power distribution considering
within-die and die-to-die p-t-v variations. In Proc. Intl. Symp. Low Power Elec-
tronics and Design, Aug 2004.
[188] Songqing Zhang, Vineet Wason, and Kaustav Banerjee. A probabilistic frame-
work to estimate full-chip subthreshold leakage power distribution considering
within-die and die-to-die variations. In Proc. Intl. Symp. Low Power Electron-
ics and Design, 2004.
[189] Wangyang Zhang, Wenjian Yu, Zeyi Wang, Zhiping Yu, Rong Jiang, and Jinjun
Xiong. An efficient method for chip-level statistical capacitance extraction con-
sidering process variations with spatial correlation. In Proc. Design Automation
and Test Conf. in Europe, March 2008.
[190] Wei Zhao, Yu Cao, Frank Liu, Kanak Agarwal, Dhruva Acharyya, Sani Nassif,
and Kevin Nowka. Rigorous extraction of process variations for 65nm CMOS
design. In European Solid-State Circuits Conference, September 2007.
215

c Copyright by Lerong Cheng 2010

The dissertation of Lerong Cheng is approved.

Lieven Vandenberghe

Glenn Reinman

Sudhakar Pamarti

Lei He, Committee Co-chair

Puneet Gupta, Committee Co-chair

University of California, Los Angeles 2010

ii

To my family.

iii

. . . 15 Delay. . and Area Minimization . . . . . Dissertation Outline . 2. . . . . . Statistical Timing Modeling. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1. . 16 2. . . . .1 2.3.4 Hyper-Architecture Evaluation . . . . . . . . . . .1. . Performance. . . . . . . . . . Vdd Gatable FPGA Architecture . . . . . . . 1 6 6 8 9 12 Contributions of the Dissertation . . . . . . . . . . . iv . . . . . . . . . . . . . . . . .3. . . . . . Preliminaries . . . . .3 Conventional FPGA Architecture . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .2.1 Literature Review . . . . . . . . . . . . . Power and Delay Model . . . .3. . . . . . . . Analysis. . . . . .3 2. . . . . . . . . . . . . . . . . 2. . . . . . . . . . . . . . . . 2. . . . . . . . . . . . . . . .1 2. . . . . . . . . . . and Area Optimization . . and Optimization . . . . . . . . . . .2 1. . . .2. . . . . . . . . . . 17 20 20 21 23 24 25 26 30 32 Trace-Based Estimation . . . . . . . . . . . . .1 1. FPGA Architecture Evaluation Flow . . . . . . . . . . . . . .1 2. . . . Validation of Ptrace . . . . .3 Trace Collection .2. . . . . . . . 1. . . .TABLE OF C ONTENTS 1 Introduction . . . . . . . . . . . . . . .3 FPGA Power. . . . . .2 2. . . . . . . . . . . . . . .2 2. . . . I Statistical Modeling and Optimization for FPGA Circuits 2 Deterministic Device and Architecture Co-Optimization for FPGA Power. . . . .2 Introduction .2 1.1. . . . . .

. Energy and Delay Tradeoff . . . . . . . . . . . . . . . . . .2 3.4. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Power Model . . . . . . . . . . . . . . . . . . . . . . . . . . .2 4. . . . . . . . . . . . . . . . . . . . . . . . . .1 3. . . . . . . .1 2. . .3. 64 66 70 70 73 v . . . . . . . . Circuit-Level Delay and Power . . . . . . . . . . . .4. . . . . . . . . . . . . . . 49 50 58 59 60 62 Conclusions . . . . . . . . . . . . Necessity of Device and Architecture Co-Optimization . . .5 Overview . . . . . . . . . .4 2. . .4. 3 FPGA Device and Architecture Co-Optimization Considering Process Variation . . . . . . . . . . . . . . . 48 3. .4. . .3 Introduction . . . . . . . . . 32 33 36 38 40 41 46 Conclusions . . . . . . . . . . . . . . . . .4. . .3. . . . .2 2. 63 4.1 4. . . . . . . . . . . Impact of Utilization Rate . . . . . . . .3. . . . . . . . . . .3 2. . .1 4. . . . . . . . . . . . . . . . . 4. . . . . . . . . . . .4. Device Models . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Hyper-Architecture Evaluation Considering Process Variation . . . . . . . .4 Impact of Process Variation . . . . . . . . . . . . . . . . . . . . . . Impact of Interconnect Structure . .3. . . . . .6 2. . ED and Area Tradeoff . . . .2 Delay Model . . . . . . . . . . . . . . .3 Introduction . . . . . . . . . . . . . . . . . Delay and Leakage Variation Model . . . 3. . . . . . . . . . . . . . . . . . . . . 4 FPGA Concurrent Development of Process and Architecture Considering Process Variation and Reliability . .5 2. .2 3. . . . . . . . . . .2. .1 3. . . . . . . . . . . Impact of Device and Architecture Tuning . . . . . . . . . . . . .

2 Impact of Device Aging . . . .2 4. . . . . . . . . .6 Process Evaluation . . . . . . . . . . . . . . 4. . . . . . . 4. . . . . . . 4. . . . . . . . . . Permanent Soft Error Rate (SER) .8 Interaction between Process Variation and Reliability . . . .4. . . Variation Analysis . . .5. . . . . . . . . . . . 4. . . . . . 4. . . . . . . . . .7. . . . .3 Delay Model . . . . . .4. . . . . . . . . . . . . . . . Dynamic Power Model . .2 4. . . . . . . . . . . . . . Delay Variation . . . .8. . . . . . . . . . . . .1 4. . . .9 Conclusions . . . . . . . 4. . . . . . . . . . . . . . .4. . . . . . . . . . . . . . . . . . .8. . . . . . . . . .1 4. . . . . . .7. . . . . . . . . . 75 75 76 76 78 78 79 80 81 82 83 85 86 89 90 91 92 93 4.5. .5 Chip Level Delay and Power Variation . . . . . . . . . . II Statistical Timing Modeling and Analysis 5 Non-Gaussian Statistical Timing Analysis Using Second-Order Polyno- 95 mial Fitting .3 Variation Models . . .7 FPGA Reliability . . .5. . .2 Impact of Device Aging on Power and Delay Variation . . . . . . . . . . . . . . . . 4. . . . . . . . . . . . . . . . . .4 Chip-level Delay and Power . . . . . .1 4. . . . .2 Power and Delay Optimization . . . . . . . . . . . . . . . . . . . . . . .1 4. . . . . . . . . . . . . . . . . . . . . . . . 96 vi . . . . . . . . Leakage Power Model . . . . 4. . .1 4. . .6. . . Chip Leakage Power Variation . . .6. . . . . . . . . . . . Impacts of Device Aging and Process Variation on SER . . . . . . . . . 4. . . . . . . . . . . . . . . . . . . . . . . .4. . . . . . . . .

. . . . . . . .2 5. . . . . . . . . . . . .5. . . . . . . .4 Quadratic Delay Model . . . . . . . . . . . . . . . . . .2 Review and Preliminary . . . .5. . . .3. . . . . . . . . . . . . 109 Sum Operation for Quadratic Delay Model . . . . . . . . . . . . . . . . . . . .2 6. . . . . . . . . .1 5. . .3 . . . . . . . . . 127 6 Physically Justifiable Die-Level Modeling of Spatial Variation in View of Systematic Across-Wafer Variability 6. . . 118 5. . . . . . . . . . 118 Max Operation for Linear Delay Model . . . . .6 5. .5. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 121 Conclusions . . . . . . . . . 128 Introduction . . . 115 Max Operation for Semi-Quadratic Delay Model . . . . . . . . Second-Order Polynomial Fitting of Max Operation . . . . . . . . . .1 5. . . . . . . . . . . . . . .2. .3. . . . . . . . . . . . 115 Computational Complexity of Quadratic SSTA . . . . . . . 100 5. . . . . . . . . . . . . . . . . . 131 Analysis of Wafer Level Variation and Spatial Correlation . . . . . . . . . . 119 5. . . . .2 Introduction . . . . . .4. . . .3. . . . . . . . .5 Linear SSTA . . . .1 5. . . . . . . . . 133 vii . . . . . . . . . . .2 Semi-Quadratic Delay Model . . . . . . . . . . . . . . . 129 Physical Origins of Spatial Variation . . .4 Semi-Quadratic SSTA .1 5.2 Linear Delay Model . . . . . . . . . . 5. . 107 5. . . . . . .2.3. . . . 115 5. .7 Experimental Result . .1 6. . .1 5. . . . . . . . . . . . . . .3 5. . . . . . . 107 Max Operation for Quadratic Delay Model . . . . . . . . . . . . . . . . .3 Quadratic SSTA . . . 116 5. . . . . . . . . . . . . 97 99 99 New Fitting Method for Max Operation . . . . . . 115 5. . . . .4. . . .

. .5. . . .3. . . . . . . . . . . . . . . . . . 161 7 On Confidence in Characterization and Application of Variation Models 162 7. . 144 6. . . . . . . . . . . . . .6. . . . . . . . . .3. . . . . . . . . .5 Variation of Mean and Variance with Location . . . . . . . . . .1 7. . . . . . . . . . . . . . . . . . . . . . .6 Application of Spatial Variation Models on Statistical Timing Analysis 146 6.4 6. . . . . . . . . 170 viii . . .3 Mean and Variance . . . . 148 6. . .1 6. . . . . . .2 7. . . . . . . . . . . 167 Simplifying the Large Lot Case . . . . . . . . . . . . . . . . . . . . . . . . 169 One Production Lot Case . . .3 Introduction . . . 163 Problem Formulation . . . . . . . . . .8 6. . . . . . .3. . .2 7. . . . . . . . . . . . . . . . . . . . . . . . . .9 Application of Spatial Variation Models on Statistical Leakage Analysis 156 Summary of Different Models . . . 145 Location Dependent Across-Wafer Model . .6. . . . . .3. . . . . . . . . . . . . .3. . .5. . . . . .7 6.4 6. .5. . . . 146 6. . . . 145 Quadratic Across-Wafer Model . . . . . . 147 Experimental Result . . .2 6. 140 General Across-Wafer Variation Model . . . . . .3 6. . . . . . . 141 Modeling Spatial Variability . . . . . 137 Dependence between Inter-Die and Within-Die Variation . . . . . . . .2 6. . . . . .2 Delay Model . . . . . . . . . . 139 When can Spatial Variation be Ignored? . . . 134 Appearance of Spatial Correlation . .1 6. . . . . . . . . . . . . . . . .3. . . . . . . . . . 158 Conclusions . . . . . . . . . . . . 165 Confidence Interval for Variation Sources . . . . . . . . . . . .3 Slope Augmented Across-Wafer Model . . . . . . .1 7. . . .6. 167 7.3. . . . .1 6. .

.7 Conclusions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .6 Experimental Validation for One Lot Case . . . .3. . . . . . . . . . . .5 7. 188 7. . . . . . . 172 Confidence in SPICE Fast/Slow Corner . . . .4 7. . .7. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .3. . . . . . . .5 7. . 191 9 Appendix . . . . . . . . . . . . . . . . . . . . . . . . 189 8 Conclusions . . . . . . . . . 177 Confidence in Statistical Timing Analysis . . . . . . . . . . . . . . . . 198 ix . . . . . . 195 References . . . . . . . . . . . . . . .4 7. . . . . . . . . . . . . . . . . . . . . . 171 Fast/Slow Corner Estimation . . . . . . 183 Practical Questions: Determining Confidence Induced Guardband for Real Designs . . . . .

. . . . . . calculation. . . . . . . . . . . . . . . . . . (a) Homo-Vth and Hetero-Vth. .1 4. . . calculation. (d) Vdd -gateable logic block. .1 4. . . Trace-based estimation flow. . . . . . . . . . . . .3 1. . . . . . . . . x . .9V . . (b) HomoVth +G and Hetero-Vth +G. 2. . . . . . . . . (b) Connection block. . .3 under one device setting to collect the trace information. Vdd = 0. . .L IST OF F IGURES 1.3 2. . . .8 ED and area trade-off. . . . . . . . . . . . . . . . . . . . . . (c) Switch block. . . . . Delay yield. . The flow for on current. . . . . . . . . Dominant hyper-architectures. . . The flow for sub-threshold leakage current. . . . . SSTA flow. . .5 2. . .2 (a) Vdd -gateable switch. . . . . . . . . . . 44 2. (c) Vdd -gateable connection block. . New trace-based evaluation flow. . .2 4. . . .2 1. . . . . . . . . . . . K = 4. 2. . . .1 Increasing of power density. . . . . . . . . .7 Comparison between Psim and Ptrace. .6 2. . . . . We perform the same flow as Figure 2. . ED and area are normalized with respect to the baseline (N = 8. . . . . 25 31 35 22 24 21 1 3 5 6 hyper-architectures under different device settings. . . . . . . . . (a) Island style routing architecture. . . .1 1. . . . . . .3 Leakage and delay of baseline architecture hyper-arch. . . . . . . . . . . . (b) Vdd -gateable routing switch. . . . . Leakage power and frequency variation. . and Vth = 0. . . . . . Ion . . . (d) Routing switches. . . . . 45 60 65 67 68 3. . . . . . . . . . . . .4 Existing FPGA architecture evaluation flow for a given device setting. . .3V ). . . . . . . . . . . Io f f . . . . . 2.4 2. . . . . . . . . . . . . . .

. . .10 The calculation flow for ΔVth due to NBTI and HCI. (b) Delay PDF. . . . . . . . . . (b) PDF comparison of exact computation. . . . . . 123 Ring oscillator frequency within a wafer. . 68 69 71 4. . (b) Delay PDF comparison. . (a) Process 1.13 The flow for SRAM SER calculation. . . . (b) Process 2.4.14 Impact of device aging on delay and leakage PDF. linear fitting. 4. . . . . . 133 xi . . The pull-down delay of a 1X inverter at ITRS 2005 HP 32nm technology nodes. . . .2 5. . . . (a) Leakage change. . . . . . 0).9 Energy and delay tradeoff. . . . . (a) Input and output voltages. . . and short circuit current with a small input slope. . .12 Impact of device aging. . . . . 4. . . . . . . . . . . . .D2 ). . . . . .1 . . . . . . . . . 4. . . 72 83 85 87 87 88 90 4. . . . .3 6. . 110 PDF comparison for circuit s15850. . . . . . . .5 The flow for gate leakage currents Igon and Igo f f calculation. . . . . . . Delay and leakage distribution. . . . . 5. (b) Input and output voltages. .4 4. . . . . .11 Vth increase caused by NBTI and HCI. . . . . . .7 Voltage and current in an inverter during transition. 4. . . . (b) Delay change. . 91 (a) Comparison of exact computation. . . . . . . . . . . . . . . . . . . .8 4. 0). linear fitting. . . . . . . . . . . . . (c) The NMOS IV curves and the transition of transient current Ids with small and large input slopes. . . 4. . 107 5. . . . . . (a) Leakage PDF. . . . and short circuit current with a large input slope. . .6 4. . and second-order fitting for max(V.1 Algorithm for computing max(D1 . . . . . . . (a) Leakage PDF comparison. . The flow for transistor gate and diffusion capacitances Cg and Cdi f f calculation. . . . . . and second-order fitting for max(V. . . . . . . .

137 Apparent spatial correlation and covariance as a function of distance. . . . . . . . . . . . . . . . . . . . . and L-L case. . . . . .5 6. . assuming n = 10 and n is large. . . . . . . . . . . . . . . 143 PDF of Sx and Sy . . 144 Comparison of S-L. .4 7. . . . .5 Flow to simulate confidence. . . . and 95-percentile point for ISCAS85 benchmarks. 185 ˆ ˜ 7. . . . . . . . . . . . . . . . . . . ˆ Percentile point is normalized with respect to the nominal value. . . . . .6 Normalized worst case mean. .2 6. . L-S. 175 Number of n or n which can be considered as large for different fracˆ ˜ tion of lot-to-lot variation assuming maximum 3% error at 90% confidence. . . . . . . . . . . . . . . 182 ˆ ˜ f Worst case mean. . .1 7. . . . . . . . .3 7. standard deviation. . . . . .6. . . . . . . . . . . . . . . . . . standard deviation.7 90% confidence worst case 95-percentile point with varying n for C7552. 138 Correlation coefficient for within-die spatial variation after inter-die variation is removed. . 176 7. . . . . . . .3 6. . . . . . . . . . . . . . assuming n = 10 and n = 15. . . . and 95-percentile point normalized with respect to the nominal value for ISCAS85 benchmarks. .6 6. 141 Ring oscillator frequency within a wafer. . . . . . . . . . . . . . . . . . . . . . . . . . . . . 186 ˆ ˜ 7. . . . . .4 Mean and variance for different rd . . . . . . . . . . . . . . . . . . . . . . . . . . .7 7. . . . . . . 187 xii . . . . . .2 Approximating across-wafer variation. . . . . 139 6. . 180 90% confident Dw with varying n and n. . . . . .

. . 36 37 2. and Hetero-Vth +G. . The number of LUTs and flip flops are counted under the architecture N=8 and K=4. Min-ED hyper-architecture of optimizing device and architecture separately and simultaneously. . . . . . . . . . . . . . . Vd d and Vth . . . .9 Minimum delay and minimum energy hyper-architectures. .12 Min-AED hyper-architecture under different utilization rates. . . . . . . . . . . . 2. . . . . . . Trace information. . .11 Min-ED hyper-architecture under different utilization rates. K = 4. 39 41 2. . and ED-area product are normalized with respect to the baseline. . . 41 42 xiii . .13 Min-delay hyper-architecture under different routing structures. . . . . Architecture setting: N = 10.5 2. . . . . . . . Unit: switch per clock cycle. . . 2. . . Architecture setting: N = 10.1 2. . .10 Minimum ED-area product hyper-architectures for different classes. . 2. . . . . . . . . . . Homo-Vth +G. Vd d and Vth . . . . . . . . . . . . . . .4 MCNC benchmark list. 28 2. . . . . 38 2. . . Power and delay ranges for different device settings. . . . . . 27 3 26 Short circuit power ratios for different technology nodes. . . . . . . . . 31 33 34 2. . Area. . Comparison between baseline and min-ED hyper-architecture in HomoVth . . . .2 Impact of process variation in near-term future. . . . . . . . Hetero-Vth . . . . .L IST OF TABLES 1. . . . . . . . . . . . . . . switching activities for different technology nodes. . . . . . . . .6 2. . . . . ED. . . . . .8 2. . . . . . .3 . . . . . . . . . . device and circuit parameters. K = 4. . .7 Baseline hyper-architecture and evaluation ranges. Note: AED is normalized with respect to the baseline. .1 2.

. . . . . . . . . . . .2 3. . . . 3. . . . . . . . . . . . . . . . . . . . . . . . . 85 89 90 92 4. . . . . . mean. . . . . . . . and standard deviation comparison for leakage power and delay. . . . . . . . . . . . Optimum leakage yield hyper-architecture. . Optimum Timing yield hyper-architecture. . . .3 3. Soft error rate comparison.3 4. . . . . . . . . . . . . . .1 Experiment setting. . . . .7 4. . Experimental setting. . . . . . . . . . . . . . Optimum leakage and timing combined yield hyper-architecture. . . . .2 4.15 Min-ED hyper-architecture under different routing structures. delay. .6 4.8 Delay and leakage change due to device aging. . . . . . . . . . . . . . . . . . Impact of device aging on process variation. . . .5 4. . . . Experimental setting. . . . . energy. . . . . . . . . . . .2. . . . . . .4 3. . . . . . and ED comparison for different device settings. . Nominal value. . . . . . .14 Min-energy hyper-architecture under different routing structures. . . . . . . . . . . . . . . . . .5 4. . . . . . . . . . . The impact of device aging and process variation on SER of one SRAM under ITRS HP 32nm technology. . . . . . . . . . . 122 xiv . . . Power. . . . . . .1 3. . . Verification of yield model. . . . . 93 5. . . . . The 3σ value is the normalized with respect to the nominal value. . . . . . 43 43 58 59 61 61 62 82 83 84 . . . . . . .4 . . . . Experimental setting for process variation. . . . . . . . . . . . 2. . . . .1 4. . . . . . . . . . .

. and IW. . . . . . . .1 6. . . . . . Note:the error percentage of mean (µ). . 125 100 × (MC value − SSTA value)/σMC . . . . . . . . . . σ. . . . standard deviation. . . Note: The values of exact simulation are in ns. . . 154 6. . . . . . . . . . . . . . . . . .2 Error percentage of mean. . . . and 95-percentile point for exact simulation is in ns. . . 153 6. standard deviation. . .3 Mean. . . . Run time (T) is in s. . . . . . . . . . . skewness. . . . and skewness comparison for variation setting (2). and 95-percentile point (95%) is computed as skewness γ is computed as 100 × (γMC − γSSTA )/γMC . . The values of exact simulation are in ns. . . . . . 152 6. . . Note: The values of exact simulation are in ns. . SPC.3 percentage error for ISCAS85 benchmark stretching on a chip with different chip size. . . . Note: We assume square chips and chip size means edge length in mm. . . . .5. . ∗ for LDAW. . . . . . .4 Delay percentage error at different locations in a 2cm × 2cm chip. . . . and 95-percentile point for variation setting (1). . . . . . Note: The exact values are in mW . . . . . . . . . .8 Leakage error percentage for different models in 2cm×2cm chip. . . . . . . . 155 6. . 157 xv . . . . . 155 6. Run time (T) is in ms. . . . . . . .7 Delay comparison for ISCAS85 benchmark in 3mm×3mm chip.6 The values of exact simulation are in ns. . . . . linear SSTA is applied when assuming linear cell delay model. . .5 Delay comparison for ISCAS85 benchmark in 1cm × 1cm chip. Note: The values of exact simulation are in ns. . . . Note: Delay comparison for ISCAS85 benchmark in 6mm×6mm chip. 126 6. . . . . 135 Delay percentage error for different variation models. . . . 154 6. . and the error percentage of 5. . standard deviation (σ). . . . . Note: the µ. . . . . . . . . . . . . . . . . . . .2 Notations. . . . . . .

. . . . . . 168 Experimental validation of estimation of confidence interval for mean for n = 5 and n = 1. 167 7. . 179 7. n = 1. . . . .6. . . . . . . . n = 15. . . . . . . . . . . . . . . . . . 177 ˆ ˜ ˜ 7. . Lgate value is in nm.9 Leakage error for different variation model on different size chips. . . . . . . . . . . . . . . . . . . . Note: exact values are in mW . . . . . . . . . . . . . 158 6. . . . . . .1 Reliability of statistical analysis for different number of measured and production lots. . . . . . . . and ˆ ˜ Vth value is in V . . . . . . . . . . . n = 15. 174 ˆ ˜ 7. . . . .10 Summary of different variation models. . . . . 179 ˜ xvi . . . 160 7. . . and n = 25. . . . . 174 ˆ ˜ 7. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Note: delay value is in ps. . . . . . . .6 Guardband of fast/slow corner for different confidence levels assuming n = 10 and large n. . . . . . . .2 7. .9 Confidence comparison for inverter chain based corner assuming n = ˆ 10. . . . . . . . . . . . . . . . . . . . . . . .5 Guardband of fast/slow corner for different degrees of confidence assuming large n and n = 15. . . . . . . . . n = 15. . . . . 171 ˆ ˜ 7. . . . . . . . . . .4 Guardband of fast/slow corner for different degrees of confidence assuming n = 10. . .7 Guardband of fast/slow corner for different confidence levels assuming n = 10. . . . . . . . . . . . . . . . . . . . . 175 ˆ ˜ 7. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .3 Notations. . . . . . . . . . . . . . .8 Percentage guardband of worst case delay corner and corresponding variation source corners under different confidence levels assuming n = 10. . . . . .

7.12 Percentage guardband of worst case delay corner and corresponding variation source corners under different confidence levels assuming n = 10. . . . . . . . . . . . . . . . . . . .10 Percentage guardband of worst case delay corner and corresponding variation sources corners under different confidence levels assuming large and n = 15. . . . . . . . n = 1. 181 ˆ ˜ 7. . . . . . . . . . . . 182 ˆ ˜ ˜ 7. . .13 Confidence comparison for ISCAS85 benchmark 95-percentile point delay assuming n = 10. . . . . . . 181 ˜ 7. . . and n = 25. 185 ˆ ˜ xvii .11 Percentage guardband of worst case delay corner and corresponding variation source corners under different confidence levels assuming n = 10 and large n. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . n = 15. . . . . . . .

I also indebted to Dr. which is reported in Chapter 4. Yan Li. Yan Lin. and Phoebe Wong. Fei Li. This dissertation would not exist without their enormous encouragement. and Dr. and Prof. Puneet Gupta. especially those colleagues at the UCLA Design Automation Lab and NanoCAD Lab. In particular. I am indebted to Prof. I would like to thank my adviser. Prof. Yan Li also helped me on developing the transistor and circuit level power and delay model for FPGA process and architecture concurrent design in Chapter 4. Lei He and my co-adviser. Miss Phoebe Wong. Jinjun Xiong. They not only helped me to develop skills to solve problems. support. which is reported in Chapter 6. and inspiration. I would like to acknowledge them here. I thank Dr. Kun Qian for helping me to develop the across-wafer variation model. The work on device and architecture co-optimization for FPGA in Chapter 2 and Chapter 3 is a joint project with Fei Li. He helped me to develop the FPGA device aging model. but also motivated and enlightened me at all time. Glenn Reinman. He also helped me to develop the chip-wise optimization flow to minimize FPGA delay considering process variation. I sincerely thank Dr. Prof. Prof. Jinjun Xiong helped me to develop the experiment and revise the writing for the statistical static timing analysis work in Chapter 5. Kevin Cao for his help with my research. Dr. Sudhakar Pamarti. I treasure the opportunity that I have worked with a group of wonderful people and would like to thank them for all the helps and support through my research work. Costas Spanos and Mr. xviii . Lieven Vandenberghe for serving as members of my dissertation committee and their suggestions to improve this work.Acknowledgments Many people offered me significant help during the development of this dissertation. First of all.

D. for their dedication. I am grateful to my family.I would also like to thank many researchers include Dr. Mr. study. Dr. I would like to thank my wife Wen Wang for her constant understanding and supporting me during my Ph. and most of all. This dissertation becomes so humble compared to their constant inspiration. Jeffery Tang. Miss Wenping Wang. for collaboration and helpful discussions. Qi Xiang. and love. xix . I would also like to thank my parents. my brother. and my sister in law. Finally. and support. support. Tim Tuan.

China. San Jose. Portland State University. in Electrical Engineering. in Electrical and Computer Engineering. xx . Nano CAD Lab. Teaching Assistant. 2003-2004 First year Ph. in Electronic and Communication Engineering. Worked on GPU testing. 2001-2003 M. Advanced Computer Architecture.Vita 1979 1997-2001 Born in People’s Republic of China. Worked on statistical power and delay estimation. 2006-2007 Research Intern.D. 2006. Altera Corp. Teaching Assistant. Nvidia Corp. California. UCLA. Guangzhou. B. Los Angeles. program. San Jose.R. USA. Design Automation Lab. USA. California. Zhongshan University. USA.S. 2004-present P. 2002-2003. Oregon. in Electrical and Computer Engineering. USA. P.D. Graduate Student Researcher. program. Graduate Student Researcher. 2006 Intern Engineer. USA. Portland. 2004-2008. Boulder. Graduate Student Researcher. California.. University of Colorado. Analogue Circuit Design. University of California. Colorado. UCLA. 2008-present. 2003-2004. 2002.S. Graduate Student Researcher..

L. November 2009. and L. pp:1777-1781.Song.” Design Automation Conference (DAC).” accepted by IEEE/ACM Asia South Pacific Design Automation Conference (ASPDAC). February 2004. issue: 11. Cheng. Lin.” IEEE Transactions on Computer-Aided Design of Integrated Circuits and Systems (TCAD). July 2007. pp: 474-479. L.” IEEE Computer Society annual symposium on VLSI (ISVLSI). P. Cheng. P.” IEEE Transactions on Computer-Aided De- sign of Integrated Circuits and Systems (TCAD). Yang. volume: 26. “Congestion Estimation for 3D Routing. Gupta. “Device and Architecture Co-Optimization for FPGA Architecture Power Reduction. He . and L. Cheng. P. Wong and L. Qian.P UBLICATIONS L. He . Cheng. P. July 2009. L. Y. F. pp: xxi . pp: 104-109. February 2010. “On Confidence in Characterization and Application of Variation Models. February 2009. “Efficient Additive Statistical Leakage Estimation. C. issue: 7. “Physically Justifiable Die-Level Modeling of Spatial Variation in View of Systematic Across Wafer Variability. P. He . Spanos. Cheng. He. and L. volume: 28. Gupta. and L. Li. W.” IEEE/ACM Asia South Pacific Design Automation Conference (ASPDAC). Gupta. “Accounting for Non-linear Dependence Using Function Driven Component Analysis. L. K. L. X. He . and G. pp: 1211-1221. Hung. Gupta. Cheng.

L.” IEEE Transactions on Circuits and Systems II: Express Briefs (TCAS II). “Congestion Estimation for Three-Dimensional Architectures. L. W. X. “Device and Architecture Co-Optimization for FPGA Power Reduction. Cheng. Y. pp: 655.Song.Song. Cheng. pp: 360-370. December 2004. He. L. L. Yang. Gao. Li. W. Hung. L. Cheng. X. L. “A Fast Congestion Estimator for Routing with Bounded Detours. Cheng. and L.” Integration. pp: 159-168. and G. Y. August 2006.” Design Automation Conference (DAC). Xiong. P. and Z. pp:1-6. Cheng. Lin.Song. Lin. G. and Y. February 2004. volume: 41. February 2008. pp: 915-920. and L. Tang. Tang. G. issue: 12. L. He. the VLSI Journal (Integrat VLSI J).239-240. May 2008.” International Conference on Field Pro- grammable Logic and Applications (FPL). “A Fast Congestion Estimator for Routing with Bounded Detours. J. He.659. volume: 51. June 2005. F. Cao. Yang. Wong. Cheng. Yang. Hung. “Trace-Based Framework for Concurrent Development of Process and FPGA Architecture Considering Process Variation and Reliability.” International Symposium on Field-Programmable Gate Arrays (ISFPGA). and S. xxii . “FPGA Performance Optimization via Chipwise Placement Considering Process Variations.” IEEE/ACM Asia South Pacific Design Automation Confer- ence (ASPDAC). Z. pp: 666-670. X. issue: 3.

and G. October 2007. and L. pp: 80-85. Song. Systems.” IEEE Transactions on Computer-Aided Design of Integrated Circuits and Systems (TCAD). Yang. L.” IEEE/ACM Asia South Pacific Design Automation Conference (ASPDAC). pp: 250-255. and L. He. Cheng. pp: 130-140. January 2009. L. Xiong. xxiii . “Non-Gaussian Statistical Timing Analysis Using Second-Order Polynomial Fitting. L. and L.”. J. June 2007. Xiong. L. He. and L. Cheng. Cheng. and L. J. Xiong. Cheng.” ACM/IEEE International Workshop on Timing is- sue: in the Specification and Synthesis of Digital Systems(TAU).L. February 2007. He. Cheng. L. X. He. and Computers (JCSC). J. February 2008. “Non-Linear Statistical Static Timing Analysis for Non-Gaussian Variation Sources. “Non-Gaussian Statistical Timing Analysis Using Second-Order Polynomial Fitting. “Non-Gaussian Statistical Timing Analysis Using Second-Order Polynomial Fitting. pp: 298-303.” Design Automation Conference (DAC). volume: 28. Xiong. Cheng. Xiong. “Non-Linear Statistical Static Timing Analysis for Non-Gaussian Variation Sources. F. J. accepted by Journal of Circuits.” IEEE International Workshop on Design for Man- ufacturability and Yield (DFM&Y). issue: 1. “An Analytical Congestion Model With Bounded-Based Detours. He. He. J.

Gu. volume: 11. L. F. “Routability Checking For ThreeDimensional Architectures. pp: 667-676. Tang.” Journal of Universal Computer Science (J. M. volume: 51. Yang. “Digital Spectrum of a Nonuniformly Sampled Two-Dimensional Signal and Its Reconstruction. Yang. Yang. He. G. “On Theoretical Upper Bounds for Routing Estimation. P. Song. pp: 307-308.Sun. Number 6. He. Cheng. M. L. June 2005. pp: 1371-1374. Tang. Song. He. Kam. pp: 916-925. Cheng.F. X. “A Combinatorial Congestion Estimation Approach with Generalized Detours. X. M. May 2005. F. Y. Cheng. G. Sun. volume: 12. Cheng. X. pp: 1180-1187. Song. issue: 12. and G. May 2005. pp: 19-24. Cheng. and J. Jeng and L. pp: 1113-1126. Z. L. Wong. Song. Y.” Computers & Mathematics with Applications (CAM). X. Gu. M. L.UCS). L. June 2005.” The Computer Journal (TCJ). December 2004. and J. and L.” International Conference on Computer Aided Design (ICCAD).” IEEE Transactions on Very Large Scale Integration Sys- tems (TVLSI). X. Z. “A Hierarchical Method for Wiring and Congestion Prediction. G. volume: 54. Yang. W. “Probabilistic Estimation for Routing Space. Gu.Song. November 2005. He. He. “FPGA Device and Architecture Evaluation Considering Process Variation. Lin. Tang. Z. T. Gu. F. xxiv .” IEEE Computer Society annual sym- posium on VLSI (ISVLSI). : 6-7. and J. issue: 3. Cheng.” IEEE Transaction on Instrumentation and Measure- ment (TIM). Cheng. March 2006. and L. Sun. Hung.

Ptrace. 2010 Professor Puneet Gupta. To perform process and architecture concurrent optimization. we present the first in-depth study on device and FPGA architecture co-optimization to minimize power. This dissertation proposes novel techniques to model. Furthermore. and variation evaluator. This dissertation focuses on two aspects: (1) Process and architecture concurrent optimization for FPGAs. Co-chair Professor Lei He. and reliability models and incorporate them with Ptrace. (2) Statistical timing modeling and analysis.Abstract of the Dissertation Statistical Analysis and Optimization for Timing and Power of VLSI Circuits by Lerong Cheng Doctor of Philosophy in Electrical Engineering University of California. an efficient and accurate FPGA power. process variation introduces significant uncertainty in power and performance to VLSI circuits and significantly affects their reliability. we perform architecture and process parameters concurrent optimiza- xxv . Los Angeles. analyze. area. Co-chair As CMOS technology scales down. is proposed. With Ptrace. If this uncertainty is not properly handled. With the extended Ptrace. and variation considering hundreds of device and architecture combinations. it may become the bottleneck of CMOS technology improvement. delay. to enable early stage process and architecture co-optimization without stable device models. and optimize power and performance of FPGAs and ASICs considering process variation. we develop transistor level and circuit level power. delay. delay.

To solve this problem. process. mean and variance uncertainty introduced by limited number of samples is another problem in SSTA. Then. we evaluate the confidence for statistical analysis and estimate the guardband value to ensure a target confidence. variation. and architecture concurrent co-optimization for FPGA power. To perform statistical timing modeling and analysis. to further improve the efficiency and accuracy of statistical timing analysis. delay. and reliability. Besides modeling spatial variation. All operations in this flow are performed by analytical equations without any time consuming numerical approach. delay. this dissertation is the first novel study of device. and reliability. we develop a new die-level spatial variation model which accurately models the across-wafer variation. To the best of our knowledge. and is the first work to model across-wafer variation at die-level and to consider confidence guardband in statistical analysis. xxvi .tion for FPGA power. variation. we first present an efficient and accurate statistical static timing analysis (SSTA) flow for non-linear cell delay model with non-Gaussian variation sources.

it increases rapidly as CMOS technology scales down. Therefore.CHAPTER 1 Introduction Complementary Metal Oxide Semiconductor (CMOS) manufacturing in nanometer region has resulted in great achievements in the world of electronics. Figure 1. 1 .1: Increasing of power density.1 [114] illustrates the power density for different technology nodes in the past thirty years. power consumption becomes a crucial design constraint for nano-scale VLSI designs. It can be seen that power density is increasing at a dramatic rate. •10 •4 Power Density (W/cm 2 ) •10 •3 •10 •2 •10 •1 •10 •1980 •0 •1985 •1990 •1995 •2000 •2005 •2010 Figure 1. The performance and capabilities of Very Large Scale Integration (VLSI) circuits improve rapidly with the aggressive scaling of transistors in CMOS. However. the increase of leakage power and process variation have emerged to slow down this trend. Since leakage power is exponentially related to channel length.

such as statistical modeling. channel width. as illustrated in Figure 1. oxide thickness.3X variation in frequency and 20X variation in leakage power. metal thickness.1 [69]. process variation will become a potential show-stopper if it is not appropriately handled. which render chips non-functional.2. Process variation is the uncertainty of physical process parameters. process variation is more difficult to handle in VLSI design. it is predicted that leakage power variation will increase to 331% and performance variation will increase to 88% in the year 2015. doping density.In comparison to challenges created by power. From the table. as CMOS technology scales down to nano-meter region. In recent years. transistors on different copies of the same design. It has been shown in [25] that even for the 180nm technology. or even at different locations on the same chip will vary significantly. as shown in Table 1. This research is called “Design for Manufacturability ” (DFM). Therefore. Due to the inevitable manufacturing fluctuations and imperfections. and manufacturing temperature. Internation Technology Roadmap for Semiconductors (ITRS) predicts the DFM technology requirement for process variation in the near-term future. Process variations introduce significant uncertainty for both circuit performance and leakage power. doping density. and so on. 2 . process variation can be classified into two types [162]: • Catastrophic defects are caused by isolated random events (such as particles or other contaminations) during manufacturing. According to the causes of variation. metal width. have been developed to alleviate the variation effects. analysis. process variation can lead to 1. many methods. • Parametric variations are caused by random fluctuations in process conditions so that the physical properties of some parameters on a chip differ from the original design. and optimization for VLSI circuits. The fluctuations may include aberrations in stepper lens. such as channel length. Such impact will become even larger in future technology generations.

FPGA is the most flexible one. Among the design styles. and optimization techniques should be developed for different design styles. analysis. Year % of Vdd variability % of Vth variability (minimum size) % of Vth variability (typical size) % of CD (Lgate ) variability % circuit performance variability % circuit total power variability % circuit leakage power variability 2007 10 33 16 12 46 56 124 2008 10 37 18 12 48 57 143 2009 10 42 20 12 49 63 186 2010 10 42 20 12 51 68 229 2011 10 42 20 12 60 72 255 2012 10 58 26 12 63 76 281 2013 10 58 26 12 63 80 287 2014 10 81 36 12 63 84 294 2015 10 81 36 12 63 88 331 Table 1. different modeling. Application Specific Instruction set Processor (ASIP). and general purpose processor or microprocessor. Since FPGA allows the same silicon implementation to be programmed or re-programmed for a variety of applications. Therefore. it provides low non-recurring engineering (NRE) cost and short time to 3 . Field-Programmable Gate Array (FPGA).1: Impact of process variation in near-term future. There are four common design styles in current VLSI: Application Specific Integrated Circuit (ASIC). Different design styles may have different power consumption requirement and vulnerability to process variation.2: Leakage power and frequency variation.Figure 1.

and a 40X area difference between FPGA designs and their ASIC counterparts.market. since FPGAs have lots of regularity. FPGA pays the penalty with decrease of performance. variation modeling and analysis in ASICs are more important than in FPGAs. The power problem is more significant for FPGAs than other design styles because FPGA has higher power consumptions. delay. wherein the impact of process variation is very significant. Figure 1. Due to the above advantages and disadvantages. design cost. ASICs have higher performance. The first objective of this dissertation is: Objective 1: Concurrently evaluate FPGA architecture and process parameters to optimize power. ASICs has higher NRE cost. Therefore. On the other hand. and area. in order to achieve programmability. Due to this advantage. due to the increased design complexity. and area considering process variation and reliability. the impact of process variation on FPGAs should not be ignored and should be properly handled. Manufacturing yield loss is defined as the ratio between the number of chips that fail to meet the design specifications and the total number of manufactured chips [112]. There are several reasons causing chip failure. As discussed before. Process variation may cause manufacturing yield loss. Nevertheless. a 4. lower power. and longer time to market than FPGAs. power consumption is a crucial design constraint for nano-scale VLSI circuits. To solve the above problems.3 4 . Previous studies [83.3X delay difference. In comparison to FPGAs. ASICs are usually used in high performance or low power applications. such as catastrophic defects and parametric variations. this dissertation focuses on optimizing power and performance for FPGAs considering process variation. power. and smaller area. the impact of process variation on FPGAs are not as significant as other design styles. 82] have shown a 100X energy difference. However. However. FPGA industry has grown rapidly since its invention in 1984.

In this dissertation. The second objective of this dissertation is: 5 . we mainly focus on modeling of process variation and developing an accurate SSTA engine. • Statistics aware engines for STA and optimization. there are three key components [122]: • Modeling of process variation from silicon process characterization. Instead of estimating the chip delay deterministically as in the traditional Static Timing Analysis (STA). Figure 1. In this flow. Statistical Static Timing Analysis (SSTA) is proposed.illustrates the performance variation caused by parametric variations.4 shows the SSTA flow. It can be seen that chip delay is spread over a wide range and the delay of some chips is higher than the specification value. Dela s e c y p Yield Yield Loss Measured Delay Figure 1. SSTA not only estimates the nominal value of chip delay but also provides the delay distribution. In order to analyze and optimize performance yield. • Modeling of sensitivities of delay for standard cell libraries with regards to process parameter variations (library modeling). SSTA estimates the chip delay statistically.3: Delay yield. The chips failing to meet the delay specification contribute to yield loss.

[10] showed that LUT sizes ranging from 4 to 6 and cluster sizes between 4 and 10 produce the best area-delay product. and routing wire segment length. Performance.3 outlines this dissertation. LUT size of 4 achieves the smallest area and LUT size 5 or 6 minimizes delay. Therefore.Measurement Modeling of process Variation SSTA Engine Chip Delay Variation Library Modeling Net List Figure 1.1 reviews some related research works. After cluster-based island style FPGA became popular.2 discusses the major contributions of this dissertation. Section 1. Section 1. and Area Optimization It is well known that architecture. In the following of this chapter. and Section 1. including logic block (or cluster) size. 1. performance.1 FPGA Power. architecture evaluation mainly focuses on optimizing delay [143] and area [130]. architecture evaluation for FPGA optimization attracts lots of concerns. These works showed that for non-clustered FPGAs. architecture evaluation for this type of FPGA was performed. 6 .1 Literature Review 1. Look-Up Table (LUT) size. has great impact on FPGA power. In early 1990s. and area. Objective 2: Statistical Timing Modeling and Analysis.4: SSTA flow.1.

123. body bias [110] and power gating [59. Architecture evaluation considering power gating and dual-Vdd was performed [101]. [136] performs parametric yield estimation for FPGA considering within-die variation. 89. To reduce leakage power consumption of unused circuit elements. power consumption is one of the most important design constraints of FPGA design when technology scales down to nano-meter region. a glitch minimization technique [84] was proposed to reduce dynamic power. 7 . 137. 51. Besides leakage power minimization. utilization rate of circuit elements in FPGA is very low. 102] were applied. 92]. Due to programmability. 115. power driven partition method was presented in [117]. 100. Therefore. 92] presented power evaluation frameworks for generic parameterized FPGAs and showed that leakage power takes higher portion of total power consumption in FPGAs than in ASICs. 90. and a configuration inversion method to save the leakage power was proposed in [12]. [158] quantified the leakage power of commercial FPGA architecture and [48] introduced a high level FPGA power estimation methodology. 115. 71]. a leakage power saving routing multiplexer and an input control method were developed [150]. programmable dual-Vdd was first applied to logic blocks [93. Architecture evaluation for power and delay minimization is then studied in [93. 91] and then extended to interconnects [58. 135. 79. [123. It was also shown that interconnects consume even more power than logic elements in FPGA.As discussed before. 43. 154. 147. FPGA power modeling became a hot research topic. 98. [31. Recently. FPGA modeling and optimization considering process variation was studied. 99. FPGA power optimization was also studied. At the same time. To further reduce power on used circuit elements. 28] presented techniques to statistically optimize FPGA performance. after the year 2000. Power aware FPGA CAD algorithms was proposed in [85].

165. 88. 111. 128. 97. There are two major approaches in 8 . 52.2 Statistical Timing Modeling. 41. 73. 119. 173. 171. 181. In this model. Statistical Analysis. obtaining the correlation coefficient between two grids became a problem for this model. 189. 36. 118. and Optimization Process Variation Modeling. 65. 113. 179. 146. 35. 33. 5. Statistical static timing [103. 163. 157. [172. 7. 142. each transistor has its own within-die variation and the within-die variation for different transistors are assumed to be independent.With the model of process variation. 9. 182. 2. 49. 95. The spatial variations of different grids are correlated and the correlation coefficients depend on the distance between two grids. Later on. 104. 189. 76. 29. 63. 56. 178. 53. 138. 21. 169. statistical analysis and optimization are then performed. 40. 144. 116. 105] solved this problem by modeling the spatial variation as a random field [174] and assuming the correlation coefficient to be a function of distance. 151. 107. 44. 81. 133] analysis are hot research topics in recent years. 55. 32] and principle component analysis was applied to perform statistical timing analysis. 161. 64]. 13. 75. several more complicated spatial variation models were proposed [45. 120.1. 70. 72. 19. 188. 105. 32. 62. In this case. process variation modeling is needed. 86. 3. 30. In order to analyze the power and delay uncertainty. 8.Process variation introduces significant uncertainty to circuit power and delay. 176. it was observed that within-die variation is not independent but spatially correlated and the correlation depends on the distance between two within-die locations. 183. After that. a chip is divided into several grids and each grid has its own spatial variation. 156. An early work was proposed in [25] whereby process variation is separated into inter-die variation and within-die variation. 22. 42] and leakage [26. 20. 15. 37. 61. Analysis. However. 180.1. 155. All transistors on the same die share the same inter-die variation and different dies may have different inter-die variation. 109. 140. 145. 141. spatial variation was modeled as correlated random variables [6.

19. 182. Since Gaussian random variables are easy to be handled. 179. the via resistance has an asymmetric distribution [13]. 40. 86. block-based SSTA is more commonly used. 9 . 180. 183. 13. the early block-based SSTA flows [32. the statistical characteristics (such as mean and variance) are assumed to be given and reliable. and correlation coefficients as an interval. However. when the number of production samples is not large. Moreover. Some recent works [76. 40. 49. 42] considered both non-linear delay model and nonGaussian variation sources. 163. 2. [142] applied independent component analysis to de-correlate the nonGaussian random variables. 42] SSTA. 3. 53. 9. 8. 179. 128] and block-based [32.In all the statistical modeling and analysis methods discussed above. it was also observed that some variation sources do not follow a Gaussian distribution.Statistical static timing analysis techniques: path-based [103. 178]. 142. 75. variance. 76. 22. 72. 5. The nonlinearity of delay model and non-Gaussian variation sources make SSTA much more difficult. 41. and then estimated the range of mean and variance of circuit performance. 19. uncertainty in the statistics of measured data as well as production data should be considered in statistical analysis. 7. For example. 157. Therefore. the production statistical characteristics may significantly deviate from their population values. Later. 41. 178. [177] modeled the uncertainty of mean. 13. 181. 113. but it was still based on a linear delay model. it was observed that the linear delay model is no longer accurate when the scale of variation becomes larger [94] and a higher-order delay model is thus used [180. Since path-based SSTA is not scalable to large circuit sizes. Confidence Analysis. 163] modeled the gate delay as linear functions of variation sources and assumed all the variation sources are mutually independent Gaussian random variables. 120. while dopant concentration is more suitably modeled as a Poisson distribution [142] rather than Gaussian.

Experimental results show that compared to cycle posed closed-form models to estimate chip level leakage and timing variations 10 .2% compared to the baseline. Based on such framework. which uses the VPR architecture model [18] with the same LUT size and cluster size as the commercial FPGAs used by Xilinx Virtex-II [170]. we have the following contributions: • We first develop an efficient yet accurate power and delay estimator for FPGAs. respectively. We also study the impact of utilization rate and interconnect structure. Furthermore. It has been shown that the mean and standard deviation computed by our models are within 3% error compared to Monte-Carlo simulation. We profor FPGA. 2. Ptrace predicts chip level power and delay within 4% and 5% error. • We then extend Ptrace to estimate power and delay variation of FPGAs. With the which we refer to as Ptrace.1. and the device settings from ITRS roadmap[66].4% and chip area by 23%. For device and architecture concurrent optimization for FPGAs. accurate power simulator and VPR [18]. Device and architecture concurrent optimization for FPGAs. we perform device and architecture co-optimization for FPGA circuits. Compared to the baseline.2 Contributions of the Dissertation The contributions of this dissertation are two-fold: 1. considering FPGA architecture with power-gating capability. our co-optimization reduces energy-delay product by 18. Statistical timing modeling and analysis.0% and chip area by 8. our architecture and device co-optimization reduces energy-delay product by 55.

160. For statistical timing modeling and analysis. device and architecture tuning improves leakage yield by 4. Such evaluation may be used to select circuits and architectures that are less sensitive to process changes or process variations. • Our proposed SSTA flow solves the problem of computing the chip delay vari11 . device aging (Negative-Bias-Temperature-Instability. 23] and HotCarrier-Injection. With Ptrace2. respectively. we incorporate analytical calculations for two types of FPGA reliability. skewness. Moreover. Compared to the baseline. 6%. we perform FPGA device and architecture evaluation considering process variations. HCI [34. Experimental results show that compared to Monte-Carlo simulation. 149.2%. We call the resulting framework as Ptrace2. but LUT size of 5 achieves the maximum leakage and timing combined yield. NBTI [11. All operations in our flow are based on efficient closed-form formulae. timing yield by 12.4%. our approach predicts the mean.extended Ptrace. • We further extend (Ptrace) to consider process parameters directly so that FPGA circuit and architecture evaluation can be conducted when only the first order process parameters are available. we have the following contributions: • We introduce an efficient and accurate SSTA flow for non-linear SSTA with non-Gaussian variation sources. We also observe that LUT size of 4 gives the highest leakage yield. and leakage and timing combined yield by 9. and 95-percentile point within 1%. standard deviation. LUT size of 7 gives the highest timing yield.8%. 1%. again in the ‘from device to chip’ fashion. we also find that neither device aging due to NBTI and HCI nor process variation has significant impact on SER. We observe that device aging reduces standard deviation of leakage by 65% over 10 years while it has relatively small impact on delay variation. 166]) and permanent soft error rate (SER) [60]. 149. and 1% error.

38. Morever. Our estimations are within 2% of simulated confidence values. the error of the traditional gridbased spatial variation model is up to 8% while the error of our model is only 2%. 1. 4.. percentile corner) and analysis (SPICE corner extraction.3 Dissertation Outline The remaining of this dissertation is organized as follows: Part I. In order to improve the accuracy and efficiency of SSTA. but does not consider modeling of process variation. The guardbands are non-negligible for all cases when either production or characterization volume is not large. statistical timing). we presents an accurate model for acrosswafer variation. The problem is fidence in the statistical analysis (since production only sees a small snapshot of the entire distribution). proposes techniques to statistically model and optimize FPGA power and delay.7% of standard deviation for single parameter corner. 12 . Compared to the exact value. variance. calibrated models have low degree of confidence. and 52% of standard deviation for 95%-tile point of circuit delay are needed) to ensure 95% confidence in the results. 34. However. including Chapter 2. We mathematically derive the confidence intervals for commonly used statistical measures (mean. due to limited number of samples (especially in the case of lot-tofurther exacerbated when low production volumes cause additional loss of con- lot variation).g. 3.ation. • The above SSTA flow also assumes that the statistical variation models are reli- able. our new model is 6X faster than the traditional model. a significant guardband is needed (e.7% of standard deviation for SPICE corner. Our experiments indicate that for moderate characterization volumes (10 lots) and low-to-medium production volumes (15 lots).

With this circuit level model. We first develop a set of closed-form formulas for chip level leakage and timing variations considering both die-to-die and within-die variations. Based on such model. delay and area. and reliability. and interconnect wire segment length W ) to minimize power. cluster size N. Chapter 6. 68. leakage power. With Ptrace. in addition to power and delay during evaluation. leakage yield. We first implement models to calculate electrical characteristics of advanced CMOS transistors based on the ITRS MASTAR4 (Model for Assessment of cmoS Technologies And Roadmaps) tool [67. 148]. Part II. We refer to the new trace-based model as Ptrace2. Chapter 3 extends the architecture and device co-optimization flow in Chapter 2 to considering process variation. Applying extended Ptrace. and then develop circuit level models of delay. we simultaneously optimize device setting (supply voltage Vdd and threshold voltage Vth ) and FPGA architecture (look-up table size K. delay. we can conduct FPGA circuit and architecture evaluation when only the first order process parameters are available. Chapter 4 proposes a framework to perform concurrent process and architecture optimization for FPGAs in early stage without detailed process modeling.Chapter 2 first introduces a time efficient trace-based delay and power estimation framework for FPGA circuits. With Ptrace2. and delay and leakage combine yield. such as device aging and permanent soft error rate. we optimize device setting and FPGA architecture to maximize delay yield. we further extend Ptrace in Chapter 2 and 3 to evaluate chip level power. Ptrace. 13 . and input/output capacitance for FPGA basic circuit elements. we extend Ptrace to estimate the power and delay variation of FPGAs. and Chapter 7 introduces statistical timing modeling and analysis. including Chapter 5. We may also consider reliability issues.

and for full chip timing analysis. Chapter 6 studies another key component of SSTA: the modeling of process variation. program which are not included in this dissertation. for the fast/slow corner of an inverter chain.Chapter 5 presents an efficient and accurate statistical static timing analysis flow for non-linear delay model with non-Gaussian variation sources. 14 . We analyze the impact of deterministic across-wafer variation and find that the spatially correlated within-die variation is mainly caused by deterministic across wafer variation. we propose an efficient and accurate die-level variation model to model the across-wafer variation. Chapter 8 concludes this dissertation with discussion of the ongoing work and future work as well. Finally. Based on this approximation. In Chapter 9. We apply our proposed variation model to the SSTA flow in Chapter 5 to verify its efficiency and accuracy. of statistical timing analysis. quadratic delay model without crossing terms (semi-quadratic model). Then. we develop an SSTA flow for three different delay models: quadratic delay model. We first propose a second order polynomial approximation for the max operation of two random variables. and linear delay model. Chapter 7 studies the confidence interval of statistical characteristics.D. We estimates the confidence interval for a single variation sources. I briefly summarize the works I have finished in my Ph. such as mean and variance.

Part I Statistical Modeling and Optimization for FPGA Circuits 15 .

By collecting trace information from cycle-accurate simulation of placed and routed FPGA benchmark circuits and re-using the trace for different Vdd and Vth . Compared to the baseline FPGA architecture. and Vth optimized with respect to the above architecture and Vdd .2% compared to the baseline FPGA architecture. Delay. but a great impact on power and performance in the nanometer technology.3%. considering power-gating of unused logic blocks and interconnect switches (in this case sleep transistor size is a parameter of device tuning). and Area Minimization Device optimization considering supply voltage Vdd and threshold voltage Vth has little chip area increase. architecture and device cooptimization can reduce energy-delay product by 20. called trace-based model. Vdd suggested by ITRS.5% and chip area by 23. Furthermore. we enable device and architecture co-optimization considering hundreds of device and architecture combinations.CHAPTER 2 Deterministic Device and Architecture Co-Optimization for FPGA Power. We first develop an efficient yet accurate timing and power evaluation method. This chapter studies simultaneous evaluation of device and architecture optimization for FPGAs. 16 . which uses the VPR architecture model and the same LUT and cluster sizes as those used by the Xilinx Virtex-II. our co-optimization reduces energy-delay product by 55.0% and chip area by 8.

89. Besides leakage power reduction. 51. a glitch minimization technique [84] 17 . and a high level FPGA power estimation methodology was presented [48]. The leakage power of a commercial FPGA architecture was quantified [158]. Due to the large number of transistors required for field programmability and the low utilization rate of FPGA resources (typically 62. Besides the power optimization CAD algorithms. body biasing and multi-threshold techniques was designed [110] to reduce interconnect leakage.1 Introduction FPGAs allow the same silicon implementation to be programmed or re-programmed for a variety of applications.5% [158]). 98. 90. and a circuitry combining gate biasing.2. A configuration inversion method to reduce the leakage power of multiplexers without any additional hardware [12] was investigated. power consumption becomes a crucial design constraint for FPGAs. As to power optimization. the interaction of a suit of power-aware FPGA CAD algorithms without changing the existing FPGAs was studied in [85]. 92] and it was shown that both interconnect and leakage power are significant for FPGAs in nanometer technologies. 91] and interconnects [58. existing FPGAs consume more power compared to ASICs [83]. and Vdd programmability was applied to both FPGA logic blocks [93. low power FPGA circuits and architectures have also been studied. Region based power gating for FPGA logic blocks [59] and fine-grained power-gating for FPGA interconnects [102] were proposed. Power evaluation frameworks were introduced for generic parameterized FPGA [123. A new type of routing multiplexer and an input control method were developed [150] to reduce leakage of unused routing multiplexers. Recent work has studied FPGA power modeling and optimization. As the process advances to nanometer technologies and low-energy embedded applications are explored for FPGAs. It provides low NRE (non-recurring engineering) cost and short time to market. 71]. Power-driven partition algorithm for mapping applications to FPGAs with different Vdd -levels [117] was proposed.

35µm technology. Vth . FPGA architecture evaluation considering energy was studied recently [93. We define hyper-architecture as the combination of device and architectural parameters. Besides area and delay. 92]. and have not conducted simultaneous evaluation of device optimization such as Vdd and Vth tuning and architecture optimization such as tuning LUT and cluster sizes. The total number of hyper-architecture combinations can be easily over a few hundreds considering the interaction between these dimensions. Later on. 123. and energy. and sleep transistor size if power gating is applied. and considering area. Architecture evaluation has been performed first for area and delay. For nonclustered FPGAs. delay. such as cycle-accurate simulation [89] and transition density based estimation [123]. the clusterbased island style FPGA was studied to optimize area-delay product and it showed that LUT sizes ranging from 4 to 6 and cluster sizes between 4 and 10 can produce the best area-delay product [10]. it was shown that an LUT size of 4 achieves the smallest area [130] and an LUT size of 5 or 6 leads to the best performance [143]. it is time-consuming to explore the huge hyper-architecture solution space using methods from [89. an LUT size of 3 consumes the smallest energy [123]. and sleep transistor size (if power gating is applied). Therefore. all the aforementioned architecture evaluation assumed fixed supply voltage Vdd and threshold voltage Vth . In order to perform 18 . timing and power are calculated for each circuit element. In 100nm technology. It was shown that in 0.was proposed to reduce dynamic power. However. Architecture and device co-optimization may obtain a better power and performance tradeoff compared to architecture tuning alone. an LUT size of 4 consumes the smallest energy [92]. Vdd . LUT size K. [101] further evaluated FPGA architectures with field programmable dual-Vdd and power gating. The co-optimization requires the exploration of the following dimensions: cluster size N. In the existing power evaluation frameworks. 123].

and circuit element utilization rate for a given set of benchmark circuits (MCNC [175] benchmark set in this chapter). an accurate yet extremely efficient timing and power evaluation method is required. Compared to the baseline hyperarchitecture. architecture and device co-optimization can reduce energy-delay product 19 . and Vdd suggested by ITRS [66]. we obtain the baseline FPGA hyper-architecture which uses the VPR architecture model [18] and the same LUT size and cluster size as the commercial FPGAs used by Xilinx Virtex-II [170]. delay.8% for delay. Vth and sleep transistor size using VPR and Psim. We explore different Vdd . The second contribution is that we perform the architecture and device co-optimization for conventional FPGAs and FPGAs with power gating capability. For comparison. near-critical path structure.2GHz Intel Xeon servers while all the hyper-architecture evaluation reported in this chapter with over hundreds of hyper-architecture combinations took only a few minutes on one server. Compared to performing placement-and-routing by VPR [18] followed by cycle-accurate simulation (called Psim) from [89]. short circuit power. The trace collecting has the same runtime as evaluating FPGA architecture for one combination of Vdd .device and architecture co-optimization. Vth . and area. We profile placed and routed benchmark circuits and collect statistical information (called trace) on switching activity. It took one week to collect the trace for the MCNC benchmark set using eight 1. but Vth optimized by our device optimization. and sleep transistor size combinations in addition to cluster size and LUT size combinations. The first contribution of this chapter is that we develop a trace-based estimator (called Ptrace) for FPGA power. We show that the trace is independent of device tuning and it can be used to calculate FPGA chip-level performance and power for a number of device designs.3% for energy and of 0. Such baseline is significantly better than the ones with no device optimization. Ptrace has a high fidelity and an average error of 1.

2% compared to the baseline.(product of energy per clock cycle and critical path delay. we assume subset switch block in this chapter. We also study the impact of utilization rate and interconnect structure. our architecture and device co-optimization reduces ED by 55.0% and chip area by 8. The rest of the chapter is organized as follows. A cluster-based logic block (see Figure 2. considering FPGA architecture with power-gating capability.4% and chip area by 23. Section 2.1) includes N fully connected Basic Logic Elements (BLEs). Each BLE includes one K-input lookup table (LUT) and one flip-flop (DFF).2 presents the background of FPGA architecture and existing FPGA architecture evaluation flow. The logic blocks are surrounded by routing channels consisting of wire segments. ED) by 18.2.5 concludes this chapter. Section 2. in short.3%. Furthermore. 2. The input and output pins of a logic block can be connected to the wire segments in routing channels via a connection block (see Figure 2.1 Conventional FPGA Architecture We assume cluster-based island style FPGA architecture such as that in [18] for all classes of FPGAs studied in this chapter.1 The connections in a switch block (represented by the dashed lines in Figure 2. where the incoming track can be connected to the outgoing tracks with the same track number.2 Preliminaries 2. Section 2.1 (c) shows a subset switch block [87].1 (c)) are programmable routing switches. A routing switch block is located at the intersection of a horizontal channel and a vertical channel. Figure 2. 20 .4 applies the new estimation models to architecture and device co-optimization.3 introduces our trace-based estimation models.1 (b)). We implement 1 Without loss of generality. The combination of cluster size N and LUT size K is the architectural issue we evaluate in this chapter. Section 2.

connection block 0123 switch block logic block in1 SR SR 0 1 2 3 connection block connection switch (a) Island style routing architecte 012 3 0 1 2 3 A B C 0 1 2 3 (b) Connection block and connection switch SR A B SR D 012 3 (c) Switch block (d) Routing switches Figure 2. We define an inter- connect segment as a wire segment driven by a tri-state buffer or a buffer. 2 We interchangeably use the terms of switch and buffer/tri-state buffer. We decide the routing channel width CW in the same way as the architecture study in [18].2CWmin where CWmin is the minimum channel width required to route the given circuit successfully.1: (a) Island style routing architecture. (b) Connection block.. (c) Switch block. 2 In this chapter. (d) Routing switches. CW = 1.e. 21 . i. we investigate both uniform length-4 interconnect wire segments and mixedlength interconnects.routing switches by tri-state buffers and use two tri-state buffers for each connection so that it can be programmed independently for either direction.

2. the sleep transistor is turned off by the configuration cell.2 illustrates the circuit design of the Vdd -gateable interconnect switch and logic block from [101]. Figure 2. SPICE simulation shows that power-gating can reduce the leakage power of an unused buffer or a logic block by a factor of over 300.2: (a) Vdd -gateable switch. When a buffer or logic block is not used. We insert a PMOS transistor (called a sleep transistor) between the power rail and the buffer (or logic block) to provide the power-gating capability. 22 Switch SR d1 Switch SR log2 N SRAM cells (a) Vdd−gateable switch d1 . Vdd M2 SR SR SR A M1 Logic block B BUFF Switch Connection switches in1 N switches Decoder Switch Switch SR (b) Vdd−gateable routing switches dN dN SR SR Dec_Disable wire track (c) Vdd−gateable connection block Vdd Sram Logic block (d) Vdd gateable logic block Figure 2.2.2 Vdd Gatable FPGA Architecture Power gating can be applied to interconnects and logic blocks to reduce FPGA power. (d) Vdd -gateable logic block. (b) Vdd -gateable routing switch. (c) Vdd -gateable connection block.

2. TV-Pack is used to pack the mapped circuit to a given cluster size. Because the decoder is no longer in the critical path.There is power and delay overhead associated with the sleep transistor insertion. an explicit decoder is used in the connection box to enable both the signal path and power supply for the single input. delay. we place and route the circuit using VPR [18] and obtain the chip level delay and area. The delay overhead associated with the sleep transistor insertion can be bounded when the sleep transistor is properly sized. we first optimized the logic then map the circuit to a given LUT size. For a given benchmark set. the device and architecture co-optimization requires the exploration of the following 23 . However. Finally.3. is used in the connection box as illustrated in Figure 2. After packing. an input MUX. the connection box with power gating is different from that in the conventional architecture. as illustrated in Figure 2. the delay of connection box in Figure 2.1(b).2(c) [101]. which is effectively an implicit decoder [18].1(b).2.2(c) is smaller than that in Figure 2.2(c) also has a much smaller leakage power but a larger area and a slightly larger dynamic power compared to the conventional architecture. For the conventional FPGA. The architecture evaluation flow discussed above is time consuming because we need to place and route every circuit under different architectures and a large number of randomly generated input vectors need to be simulated for each circuit. However. cycle-accurate power simulator [89] (in short Psim) is used to estimate the chip level power consumption. Moreover. and power [92] is illustrated in Figure 2. for the power gating architecture. This is due to the fact that the sleep transistors stay either ON or OFF after configuration and there is no charging and discharging at their source/drain capacitors. The dynamic power overhead is almost negligible. The connection box in Figure 2.3 FPGA Architecture Evaluation Flow The existing FPGA architecture evaluation flow considering area.

a fast yet accurate FPGA power and delay estimator is required. dimensions: cluster size N. we will introduce the efficient power and delay estimation model. there are following two classes of information as summarized in Table 2. LUT size K. 2. The total number of hyper-architecture combinations can be easily over a few hundreds considering the interaction between these dimensions.3 Trace-Based Estimation In this section. The second class only depends on device setting and circuit design.1. We speculate that during hyper-architecture evaluation. In order to perform device and architecture co-optimization. The first class only depends on architecture and is called trace of the architecture. the trace-based estimator. the conventional evaluation flow is not practical for device and architecture co-optimization due to its time inefficiency. threshold voltage Vth .3: Existing FPGA architecture evaluation flow for a given device setting. tracebased estimation method (Ptrace). supply voltage Vdd . and sleep transistor size if power gating is applied.3. Such estimator is just the one we are going to introduce in Section 2. Therefore.Benchmark Logic Optimization(SIS) Tech-Mapping (RASP) Timing-Driven Packing (TV-Pack) Placement & Routing (VPR) Area Delay Parasitic Extraction Cycle-accurate Power Simulator (Psim) Arch Spec Power Figure 2. The basic idea of Ptrace is 24 .

short circuit power ratio (αsc ). Delay. Delay. which depend on p technology scale. and average delay of type i circuit 25 . We then obtain chip-level performance and power for a set of device for a given architecture parameter values based on the trace information.3. Figure 2. Trace Collection Arch Spec Trace-Based Estimation Circuit Level Area. the trace information only depends on architecture and remains the same when device setting changes.4: New trace-based evaluation flow. and average leakage power of type i circuit elements (Pis ). average load capacitance of type i circuit elements (Ciu ). Device parameters include Vdd and Vt .4 illustrates the flow of trace-based evaluation. and Power Figure 2. and near-critical path structure (see Table 2. We perform the same flow as Figure 2.3 under one device setting to collect the trace information. average switching activity of used type i circuit elements (Siu ). and Power Chip Level Area. Near-critical path structure is the number of each type of circuit elements (Ni ) on the near-critical path. total number of type i circuit elements (Nit ). 2. the trace includes number of used type i circuit elements (Niu ).1 Trace Collection As mentioned before.as follows: For a given benchmark set. we profile placed and routed benchmark circuits and collect trace information under one device setting and for one FPGA architecture.1). For a given benchmark set and a given FPGA architecture.

such as circuit level delay and power.1: Trace information.2. leakage power for type i circuit elements avg. power.3. Dynamic power is only consumed by the used FPGA resources. we can perform Ptrace to estimate the chip level delay. buffers. Trace Parameters (depend on architecture) Niu Nit Siu Nip αsc # of used type i circuit elements total # of type i circuit elements avg. For type i circuit elements. Cisw is the average switch capacitance 26 .elements (Di ).2 Power and Delay Model 2.e. After collecting the trace and device level delay and power..1 Dynamic Power Model Dynamic power includes switch power and short-circuit power.3. and area. switching activity for used type i circuit elements # of type i circuit elements on the near-critical path ratio between short circuit power and switch power Device Parameters (depend on processing technology and circuit design) Vdd Vth Pis Ciu Di power supply voltage threshold voltage avg. Below. which depend on circuit design. LUTs. load capacitance of type i circuit elements avg. we will show that the trace information is insensitive to the device parameters and discuss our trace-based models. The device parameters. i. 2. can be collected by SPICE simulation or by measurement. input pins and output pins. A circuit implemented on an FPGA cannot utilize all circuit elements. delay of type i circuit elements Table 2. Our trace-based switch power model distinguishes different types of used FPGA resources and applies the following formula: 1 2 Psw = ∑ Niu · f ·Vdd ·Cisw 2 i (2. device and circuit parameters.1) The summation is over different types of circuit elements.

f is the operating frequency.26 0. In our trace-based model. the reciprocal of the near-critical path delay.e.96 0.01 1. we model the short circuit power as: Psc = Psw · αsc 27 (2.0 Vth =0.54 0. Ciu is the average load capacitance of used circuit elements.25 logic alu4 apex2 apex4 bigkey clma 2.59 0. Architecture setting: N = 10.55 0.71 0.1 Vth =0.and Niu is the number of used circuit elements. the local load capacitance for used circuit element j.87 100nm Vd d=1. The switch capacitance is further calculated as: Cisw = ( = j∈Eli Ciu · Siu ∑ Ci. Unit: switch per clock cycle.75 0. The short circuit power is related to signal transition time.23 Table 2.29 0.21 2.06 1. Eli is the set of used type i circuit elements.59 0. and Siu is the average switching activity of used type i circuit elements. which is difficult to obtain without detailed simulation using a real delay model.23 1.20 logic interconnect 0. We verify this assumption in Table 2.3) .56 0.75 1.2) For type i circuit elements.19 1.73 1.70 1. In this chapter.32 logic interconnect 0.16 1. which is averaged over Ci.03 1. we assume that the circuit works at its maximum frequency.47 0.47 0.55 0.21 2.2 by demonstrating the average switching activity of five benchmarks at different technology nodes.3 Vth =0.27 0. We assume that the average switching activity of the circuit elements is determined by the circuit logic functionality and FPGA architecture. Vd d and Vth . i. benchmark 70nm Vd d=1.2: switching activities for different technology nodes..90 interconnect 0.47 0. The device parameters of Vd d and Vth have a limited effect on switching activity.91 70nm Vd d=1. and Vd d and Vth levels. K = 4. j . j /Niu ) · Siu (2.

1 Vth =0. We verify this assumption in Table 2.44 1.71 1.72 1.48 1.63 Table 2.72 1.4) For resource type i.93 0.76 1.15 0.21 100nm Vd d=1. For an FPGA architecture with powergating capability. 3 Because the delay between buffers is usually large for FPGA circuits. Although this ratio value at the circuit level depends on FPGA circuit design and architecture. Vd d and Vth .74 1. 2.18 0. Nit is the total number of circuit elements.32 logic interconnect 1.16 0.3.3 Vth =0.86 1.20 logic interconnect 1.64 1.79 1.5% [158]). and Vth levels.11 interconnect 1. benchmark 70nm Vd d=1.43 1.68 1.82 1.12 0. an unused circuit element can be power-gated to reduce leakage could be big.25 logic alu4 apex2 apex4 bigkey clma 1.08 0.46 1.62 1. K = 4. we assume that αsc does not depend on device and technology at the chip level. Pstatic = ∑ Nit Pis i (2. Notice that usually Nit > Niu because the resource utilization rate is low in FPGAs (typically 62.3: Short circuit power ratios for different technology nodes. Architecture setting: N = 10.15 0.92 0.44 1.2.16 70nm Vd d=1.33 by showing the average short circuit power ratio at different technology nodes. Vd d. and Pis is the leakage power for a type i element.0 Vth =0.89 0.Where αsc is the ratio between short circuit power and switch power.2 Leakage Power Model The leakage power is modeled as follows. the short circuit power ratio 28 .42 1.

the FPGA delay can be calculated as follows: D = ∑ Ni Di i p (2. LUT.power. Therefore.. the area increase introduced by power gating will have small impact on the delay and power estimation. i.6) For resource type i. When power gating is applied. Moreover the load capacitance of a routing buffer is mainly determined by the input and output capacitances of buffers. we can calculate delay values for the ten longest paths under new Vd d and Vth levels.3. we assume the wire segment length change introduce by power gating can be ig√ nored. process technology. and FPGA architecture. When Vd d and Vth change. and Di is the delay of such a circuit element. on one circuit path.e.2. a single gating transistor is used for the entire logic block.5) where αgating is the average leakage ratio between a power-gated circuit element and a circuit element in normal operation. we obtain the structure of the ten longest circuit paths including the near-critical path for each circuit. Therefore. The area overhead introduced by power gating is less than 30%. We assume that the new near-critical path due to different Vd d and Vth levels is among these ten longest paths found by our benchmark profiling.003 is used in this chapter. where areatile is the area of a logic block with the interconnect surrounding it. The path structure is the number of elements of different resource types. p Ni is the number of circuit elements that the near-critical path goes through.5 To get the path In this chapter. Di is a circuit parameter depending on Vd d. the total leakage power is modeled by the following formula: Pstatic = ∑ Niu Pi + αgating · ∑ (Nit − Niu )Pi i i (2. the change of wire segment length is less than 14%. Therefore. and choose the largest one as the new near-critical path delay. The wire capacitance is a small portion (about 20%) of the load capacitance. 5 Delay of each circuit element is measured with the worst case switch. 2. wire segment and interconnect switch.4 In this case. wire length is calculated as 4 × areatile [18]. SPICE simulation shows that sleep transistors can reduce leakage power by a factor of 300 and αgating = 0.3 Delay Model To avoid the static timing analysis for the entire circuit implemented on a given FPGA fabric. Vth . In order to measure the delay of 4 29 . In our experiment.

3.3.0V and Vth =0. We assume Vd d=1. each circuit element in a logic block with gating transistor. K = 4} the architecture N=8 and K=4.4. The average energy error of Ptrace is 1.statistical information Ni . Vd d=1. the number of LUTs and flip flops are counted under LUT and cluster sizes.2V. as illustrated in Table 2.8%. a series of randomly generated vectors are used as the inputs to the block and the worst case delay is measured for each circuit element under such random input vectors. K = 7}. From the figure. the run time of Ptrace is 2s. Notice that we can map the benchmarks to different and {N = 6. We collect trace using the conventional architecture evaluation flow in 70nm technology. 30 . we compare it to the conventional architecture evaluation flow for ITRS [66] 70nm technologies.1 Verification of Deterministic Model To validate Ptrace.2V and map 20 MCNC benchmarks. Figure 2. we only need to place and route the circuit once for a given p FPGA architecture. to two architectures: {N = 8.3 Validation of Ptrace 2. Moreover.0V and Vth =0.3% and average delay error is 0.4. In Table 2.3. while that of the conventional evaluation flow is 120 hours. the Ptrace has the same trends for energy as Psim does and for delay as VPR does.5 compares energy and delay between the conventional evaluation flow and Ptrace for each benchmark. Therefore. Ptrace has a high fidelity. 2.

31 .4: MCNC benchmark list. The number of LUTs and flip flops are counted under the architecture N=8 and K=4.benchmark alu4 apex2 apex4 bigkey clma des diffeq dsip elliptic ex1010 # LUTs 1607 2118 1291 2935 13537 2172 1933 1619 4185 4721 # Flip flops 0 0 0 224 33 0 377 224 1138 0 benchmark ex5p frisc misex3 pdc s298 s38417 s38584 seq spla tseng # LUTs 1201 5926 1501 5608 2548 8458 7040 1953 3938 1299 # Flip flops 0 900 0 0 8 1463 1260 0 0 382 Table 2.5: Comparison between Psim and Ptrace. 45 36 27 Energy from Ptrace (nJ) Delay from Ptrace(ns) 10 8x4 6x7 8 6 8x4 6x7 4% Error 18 9 0 5% Error 4 2 0 0 9 18 27 36 Delay from VPR(ns) 45 0 2 4 6 8 Energy from Psim (nJ) 10 Figure 2.

9v). number of used logic blocks over the number of total available logic blocks) is 0. LUT size of 4). Homo-Vth +G and Hetero-Vth +G. and Vth of 0. Homo-Vth is the conventional FPGA using homogeneous Vth for interconnects and logic blocks.2. Vdd suggested by ITRS [66] (0. Hetero-Vth applies different Vth to logic blocks and interconnects. i. The baseline hyper-architecture and evaluation ranges for device and architecture are presented in Table 2.4. We consider 70nm ITRS technology and evaluate four FPGA hyper-architecture classes: Homo-Vth .5. • The routing channel width is 1.. Note that a high Vth is applied to all SRAM cells for configuration to reduce their leakage power as suggested by [93]. Homo-Vth +G and Hetero-Vth +G are the same as Homo-Vth and Hetero-Vth. except that unused logic blocks and interconnects are power-gated [101]. In the subsections 2. we assume the following: • The utilization rate (defined as the utilization rate of logic blocks.4.1 Overview In this section. We compare them with the baseline hyperarchitecture. • All interconnect wire segments span 4 logic blocks with fully buffered routing switches.4 Hyper-Architecture Evaluation 2. • All benchmark circuits work at their highest frequency (1/critical path delay).5. Hetero-Vth . we use Ptrace to perform device and architecture evaluation.4. which uses the VPR architecture model [18] and has the same LUT size and cluster size as those used by the Xilinx Virtex-II [170] (cluster size of 8. 32 .e.4. respectively.2 times of the minimum channel width that allows the FPGA circuit being routed.2 to 2.3v that is optimized with respect to the above architecture and Vdd .

4.4v N 6-12 K 3-7 Table 2.9v Vth 0.6 analyze the impact of utilization rate and compare the evaluation result of different routing architectures.3v N 8 K 4 Value range for device/arch optimization Vdd 0.5 and 2. To illustrate the tradeoff between energy and delay. Moreover.5 and 2. For each hyper-architecture. we use CVt for Vth of logic blocks and IVth for Vth for global interconnects.4.4. then compare the results of optimizing device 33 .1v Vth 0.2v-0.4. then we say that B is inferior to A. We first discuss the need of device tuning.6. Section 2. we introduce the concept of dominant hyper-architecture: If hyper-architecture A has less energy consumption and a smaller delay than hyper-architecture B. We define the dominant hyper-architectures as the set of hyper-architectures that are not inferior to any other hyper-architectures. respectively.2 illustrates the necessity of device and architecture co-optimization.8v-1. 2. We organize the rest of this section as follows: First.5: Baseline hyper-architecture and evaluation ranges. respectively.4. Section 2.4 discusses the device and architecture co-optimization considering area. Then Section 2. in the rest of this chapter.4. Sections 2. delay and area as the geometric mean of 20 MCNC benchmarks in Table 2.2 Necessity of Device and Architecture Co-Optimization In this section.3 presents the energy and delay tradeoff and ED reduction achieved by device and architecture cooptimization. Note that CVt = IVth in Homo-Vth and Homo-Vth +G.4. we show the necessity of device and architecture co-optimization.The impact of utilization rate and interconnect architecture will be discussed in subsections 2. Baseline FPGA device/arch parameter values Vdd 0. we compute the energy.4.4. Finally.

05 17.32 1. Our experiments show that device tuning has a much greater impact on delay and energy than architecture tuning does.6 and Table 2.58 3.99 31. set D4 is the dominant hyper-architectures under Vdd =1.37 1.92 Table 2.11nJ.9V and Vth = 0.05 3.99ns to 23.10 1.25v. 10. From the figure.35 0.46 23.9 0.40 14.50 24.31 10.0 1.73 Min delay (ns) 13.9 0.95 Max energy (nJ) 2.6.46ns.0 1.25V.9 1.e.82nJ to 2. it is important to evaluate both device and architecture instead of evaluating architecture only. the energy ranges from 1.30 0.01 9. and delay ranges from 13. energy for different architectures ranges from 1.0V and Vth =0.24ns.1 1. In the first method.3V .1 Vth (V) 0.05v.1 1.17nJ to 1. if we increase Vth by 0.60 21.24 36.01 11. device tuning has not been reported in the literature.97 2.11 5.35 0.66 12.68 18.17 0. which is demonstrated in Figure 2. 80.30 0.6: Power and delay ranges for different device settings.17 1.60 13.03 4.30 17. Architecture evaluation has been studied in previous research [131. Set D1 D2 D3 D4 D5 D6 D7 D8 D9 Vdd (V) 0.6 For example. we observe that a change on the device leads to a more significant change in energy and delay than architecture change does.90 15.32nJ and the delay ranges from 18.43 15. 123. For example.0 1.9V and Vth =0.40 19..01 Max delay (ns) 16.25 0. 6 Dominant 34 .30 0. However. i.95 1.25 0.06 17.11 1. However.82 1. for device setting Vdd = 0.30 1.68ns to 16. There are three methods to perform device and architecture optimization. 89. 101].6 is the dominant hyper-architectures for a given device setting.25 0.and architecture separately and simultaneously. Therefore. Vdd = 0. Each set of data points in Figure 2.35 Min energy (nJ) 1. we first optimize device using one architecture then optimize the arhyper-architectures for a given device setting are the hyper-architectures that are not inferior to any other hyper-architectures under such device setting.

Both methods arch-device and device-arch cannot guarantee the optimal solution.18 16 Energy per cycle (nJ) 14 12 10 8 6 4 2 0 10 D4 D1 D5 15 20 D9 D2 25 Delay (ns) 30 35 40 D6 D8 D7 D1 Vdd 0. For both arch-device and device-arch.9V .6: hyper-architectures under different device settings.3V }.0V . and Vth = 0.25 D2 Vdd 0.3V }. Vth = 0. we optimize architecture and device simultaneously and call it simultaneous method. chitecture under the optimized device setting and call it device-arch method.9 Vt 0.25 D8 Vdd 1. In the second method.9V .35 D7 Vdd 1.0 Vt 0.9 Vt 0. In the third method.3V } which is also the global optimal.1 Vt 0. K = 7. K = 4. Vdd = 0.25 D5 Vdd 1.35 D3 Figure 2. In our experiment.30 D3 Vdd 0.30 D6 Vdd 1. both arch-device and device-arch achieve the same local optimal hyper-architecture whole solution space: {N = 6.0 Vt 0.1 Vt 0.7 compares the min-ED hyper-architectures for Homo-Vth found by three different methods. K = 7. Table 2. Vdd = 0. If we start search from some other hyper- 35 . we find that there are two local optimal hyper-architectures in the we start search from the baseline case {N = 8.0 Vt 0.9V .3V } and {N = 10.30 D9 Vdd 1. we first optimize the architecture within one device setting then optimize the device setting according to the optimized architecture and call it arch-device method. Vdd = 0.1 Vt 0. K = 4. In this particular example.35 D4 Vdd 1.9 Vt 0. Vdd = 1. Vth = 0. Vth = 0. {N = 6.

3 27. we first compare the impact of device tuning and architecture tuning.9 0.1s).30 0. As expected.38 1.1 (-13.3 Energy and Delay Tradeoff In this section.1 3. The minimum energy hyper-architectures have LUT size of 4 for all classes. We also see that the runtime of arch-device and device-arch is shorter than that of simultaneous method.4. From Table 2. This is similar to the previous evaluation result [101. For the classes Homo-Vth +G and Hetero-Vth +G with powergating.4 34.30 IVth (V) 0. it is still worthwhile performing simultaneous device and architecture optimization. and 1X PMOS for a connection buffer.3%) runtime (s) 5.8 17.38 1.7) (6.4) Energy (nJ) 1. we observe that simultaneous method can reduce ED by 13.architectures.0 CVt (V) 0. we assume the following fixed sleep transistor size: 210X PMOS for a logic block.30 0.30 0.8 summarizes the minimum delay and minimum energy hyper-architectures for each class. then present min-energy and min-delay hyper-architectures. arch-device and device-arch may achieve other local optimum but there is no guarantee of the global optimum.8 19.3 24. the min-delay hyper-architectures have the highest Vdd and 36 . Table 2. 2.30 0. Vdd (V) arch-device device-arch simultaneous 0.30 (N. 89].1 Table 2. due to the time efficiency of Ptrace.7) (10. and finally discuss the energy and delay tradeoff. The minimum delay hyper-architectures have cluster size of 6 and LUT size of 7 which are the same for all classes. In order to obtain the global optimal solution.4.37 Delay (ns) 19.9 1.3% compared to arch-device and device-arch. However. we have to perform simultaneous method. We then discuss the sleep transistor tuning in Section 2.4. Therefore. k) (6. the runtime of simultaneous method is also small (only 34.7: Min-ED hyper-architecture of optimizing device and architecture separately and simultaneously. 10X PMOS for a switch buffer.7.5 ED (nJ· ns) 27.

The energy is calculated as E = delay · power.8 0.8 Minimum energy hyper-architecture CVt (V) 0. we only need to pick the dominant hyper-architecture whose delay is closest to 15ns. From the figure. In order to achieve the best energy and delay tradeoff. we present the dominant hyper-architectures of Homo-Vth and Hetero-Vth in Figure 2.1 1. For example.86 9. with the dominant hyper-architecture figure.lowest Vth .7) (N.3 sible frequency (1/critical path delay). Table 2.7) (6. higher performance hyper-architectures consume more energy.35 0.6 30.22 15.5 24.1 1. This is because leakage power is significantly reduced by power-gating and therefore more detailed Vth tuning such as heterogeneous-Vth has a smaller impact.22 31. This is because we assume that each circuit works at its highest posWhen Vth is too high.30 0.K) E (nJ) 31.4) (N.4) (10.7) (6.98 D (ns) 8.20 0. However.35 0.7 (a) and those of Homo-Vth +G and HeteroVth +G in Figure 2.8 0.8 0.20 IVth (V) 0.20 0.30 0.942 0.2.20 0.4) (12.20 0.20 0. This is due to the fact that the connection box with power gating has smaller delay than that without power gating as discussed in Section 2. HyperArch Class Homo-Vth Hetero-Vth Homo-Vth +G Hetero-Vth +G Vdd (V) 1.920 0.20 0.549 D (ns) 59. if we want to find the minimum energy solution for Homo-Vth with delay limit 15ns.7) (6.25 (10. we find that the energy difference between Homo-Vth +G and Hetero-Vth+G is smaller than that between Homo-Vth and HeteroVth . the delay is so large that the energy per clock cycle increases.20 (6.550 0.45 9.35 0.86 8.1 1. we can obtain the minimum energy solution for a given performance range.2 43.30 0.1 Minimum delay hyper-architecture CVt (V) 0. we find the hyper-architectures with the minimum energy delay product (in short min-ED) in Table 2. the min-energy hyper-architectures have the lowest Vdd but not the highest Vth .2.30 IVth (V) 0.7 (b).98 15.8: Minimum delay and minimum energy hyper-architectures. Compared to 37 . We also see that dominant hyper-architectures of Homo-Vth +G and Hetero-Vth +G have smaller delay than that of Homo-Vth and Hetero-Vth .9.4) (12.45 Vdd (V) 0.K) E (nJ) 0. To illustrate the energy and delay tradeoff. Moreover. Usually.

1 16. the larger the sleep transistor size.20 1.25 IVth (V) 0.8 CVt (V) 0. we assume fixed sleep transistor sizes for Homo-Vth and Hetero-Vth +G and discuss hyper-architecture evaluation to minimize ED without considering area.43 126. Therefore.74 0.9 0.30 0.5 18. This is because leakage power is greatly reduced when power gating is applied. compared to the min-ED hyper-architectures without power gating.4) (10.the baseline.00 81.9 0.Class Baseline Homo-Vth Hetero-Vth Homo-Vth +G Hetero-Vth +G Vdd (V) 0. we perform device and architecture co-optimization to achieve the best ED and area tradeoff.0 (-18.90 79.8%) 11.30 0. However. Homo-Vth +G.4) (12.30 ED (nJ· ns) 28. k) (8. We also see that.9: Comparison between baseline and min-ED hyper-architecture in Homo-Vth .5 17.3%) Area % 100. and Hetero-Vth +G.19 Table 2.52 127. in this section.5% and 18.30 0.25 0. ED reduction is 14.20 (N. we 38 .10 17. area is important for FPGA design. therefore a lower Vth can improve performance without much penalty on leakage. 2. Usually.4.0 0.4%) 11.5%) 23.2 (-60. Power-gating using sleep transistors may change delay and area tradeoff for FPGA architecture.30 0.65 D (ns) 23. Although dual-Vth may change the layout area due to extra diffusion well area.30 0.25 0.4) (12.4) E(nJ) 1.27 0. hyper-architecture.4 ED and Area Tradeoff In the previous sections. The similar ED reduction for power gating classes is due to the fact that leakage power is greatly reduced by power-gating and therefore the more detailed Vth tuning such as heterogeneous-Vth has a small impact on power reduction as discussed before.4) (10. such change depends on technology and is often very small.37 1.9 (-57. the min-ED hyper-architectures with power gating has a lower Vth . the smaller the delay is. If power gating is applied.4% for Homo-Vth and Hetero-Vth . ED can be reduced by about 60% for both HomoVth +G and Hetero-Vth +G.9 1. In this section. respectively.25 0.1 (-14.2 24. Hetero-Vth .

K = 4 is the second area efficient architecture which reduces area by 18. we assume the 210X PMOS for the sleep transistor with negligible area overhead.3% compared to the baseline.3%.470 0.4) (12. This is because that when no power gating is applied.3%.450 Area 1 0.3 37. the min-ED hyper-architecture is exactly the min-AED hyper-architecture. ED. we find out the hyper-architectures with minimum product of energy. we see that.9 0. the min-AED hyperarchitecture of Homo-Vth reduces ED by 14. For Homo-Vth +G and Hetero-Vth+G with power gating. Figure 2.25 0.626 0. because no power gating is applied.816 0.7% and area by 18. we assume that the area depends only on architecture and but not device setting.9 1. Our experimental result shows that N = 12.10.432 0.4) (10. Vdd (V) Baseline Homo-Vth Hetero-Vth Homo-Vth +G Hetero-Vth +G 0. and area (in short AED).9 58. Area.9 CVt (V) 0.8 presents the chip-level ED and area tradeoff. Moreover. To achieve the best ED-area tradeoff.25 IVth (V) 0. K = 4).4) (12.10: Minimum ED-area product hyper-architectures for different classes. as illustrated in Figure 2.25 0. Because only one sleep transistor is used for one logic block.817 0.30 0. Therefore. K = 4 is most area efficient architecture.6% compared to the baseline (N = 8.assume that Vdd and Vth change does not affect area.413 AED reduction % 30.853 0.4) S 2 2 ED 1 0. delay.697 0.K) (8.0 0. We prune inferior solutions with both ED and area larger than any alternative solutions.4 56. the more leakage power is consumed.25 0. For Homo-Vth and Hetero-Vth. for the classes without power gating.30 0. we observe that a 1X PMOS as the sleep our evluation. Compared to the baseline. 7 In 39 .30 0.918 AED 0. this architecture reduces area by 23.767 0.30 0. The architecture N = 10.20 (N. the min-ED hyper-architectures use less area than other hyper-architectures.7 and the min-AED hyper-architecture of Hetero-Vth reduces ED by 18.9 0.918 0. From the figure.4) (12. the sleep transistor size has to be considered. which are summarized in Table 2. the larger the area. we do not need to tune sleep transistor size.7 Table 2.2(d). and ED-area product are normalized with respect to the baseline.4% and area by 23.30 0.

and 10X PMOS for a routing switch.2(c)) provides good performance.5.10 summarizes the minimum AED prod- uct hyper-architectures. We compare the min-ED and min-AED hyper-architectures under three different utilization rates: 0.11 and 2.8.2%.12. may affect delay greatly. for the minimum AED hyper-architecture of the power gating classes consumes less area compared to the baseline.5 Impact of Utilization Rate In the previous part of this section. we conclude that utilization rate in practice does not affect hyperarchitecture evaluation.5). We guess that the reason why the utilization rate does not affect hyper-architecture evaluation is as follows: In our models the performance and discussed before.3. however.0% and area by 8. Therefore.4.0% and area by 8. we assume fixed utilization rate (0. 8 As 40 . Compared to the baseline case. we will further discuss the impact of utilization rate on FPGA architecture evaluation. We see that the min-ED and min-AED hyper-architectures under different utilization rates are the same. We consider four sleep transistor sizes: 2X. 7X. and any further increase of the sleep transistor size cannot improve the performance much. In this subsection. 0. This is due to the fact that tuning device setting and architecture offers a bigger solution space to explore in chip level. we find that device and architecture co-optimization can reduce ED and area simultaneously even when power gating is applied. the minimum AED product hyper-architecture of Homo-Vth +G reduces ED by 53.2% and the minimum AED product hyper-architecture in Hetero-Vth+G reduces ED by 55. 2.8 in Tables 2. The area increase introduced by power gating leads to only about 15% area increase. From the Figure 2.transistor for one switch in connection box (see the circuit in Figure 2. 4X. Therefore. 8 Table 2. The sleep transistors for the switches in the routing box. the most area efficient architectures (N = 12 and K = 4) reduces area by about 20% compared to the baseline. and 0.

20 (10. in classes Homo-Vth +G and Hetero-Vth +G.11.4) 18.4) 24. for given benchmark circuits.4) 2 0.432 0.4) . length-4 interconnect wires). leakage power is the dominant power component and changes inversely proportional to the utilization rate.25 0.9 0.30 (10.4) 11. k) ED 1.9 0.25 0.0 0.9 0.4) 2 0.30 (10.4) .20 (10. 41 .4) 12.9 0.30 0.25 0.4) 23.25 (12. k) ED 1.4) .25 0. k) S AED 1.4) 11.6 Impact of Interconnect Structure In the previous discussion.5 0.8 0.1 0.3 Vdd CVth IVth (N.12: Min-AED hyper-architecture under different utilization rates.4) 11. Note: AED is normalized with respect to the baseline.30 (10.8 Vdd CVth IVth (N.8 0.9 0.30 0.30 (12.0.617 0.4) 2 0.5 Vdd CVth IVth (N.0.0.11: Min-ED hyper-architecture under different utilization rates.698 0.4) 19.25 0.4.25 0.3 Vdd CVth IVth (N.9 0.4) 2 0.324 Table 2.dynamic power of the circuit under different utilization rate is the same. k) S AED 1. In this section.25 (12.8 0.649 0.25 (12.0 0.30 0.25 0. the leakage power is determined by the active elements and the leakage power of the unused circuit elements is negligible (as we can see from Table 2. Therefore the relationship between the leakage power of different hyper-architectures does not change with respect to the change of the utilization rate.30 (10.697 0.0 Homo-Vth +G 0.4) 31.25 (12.8 0.9 0. Utilization rate Homo-Vth Hetero-Vth 0. When no power gating is applied. 2.25 0.20 (10.0 0.25 0.4 0.4) .30 (12.30 (12.9 0.8 Vdd CVth IVth (N. k) ED 1.30 (12.20 (10. Utilization rate Homo-Vth Hetero-Vth 0. ED under different utilization rate is very close).30 (10.0.4) 2 0.1 0. 9 and the only difference is leakage power.0 0.3 Hetero-Vth +G 0.30 (10. For the classes with power gating.626 0.8 0.0.0 0.25 0.9 0.4) 32.9 0.0.9 0.2 0.533 0.30 0.6 Table 2.4) 2 0.4) .20 (10. we will compare two routing 9 We assume the placement and route is the same for different utilization rate.330 Hetero-Vth +G 0.8 0.9 0.510 Homo-Vth +G 0.413 0.25 0.25 (12.30 0.0 0.25 0.30 0.25 0.0 0.25 (12.4) 11.637 0.25 0.0 0. k) S AED 1.4) 11.20 (10.8 0.5 Vdd CVth IVth (N.25 0.30 (12.25 0.4) .9 0.25 0.25 0.e.30 (12. we always assume all the wire segments span four logic blocks (i.

25 16.structures to show the impact of routing structure on delay and power.7) (6.20 0.13 9. 2.20 0.1 1.45 9.86 8. which reduces delay but increases power.20 0.20 0.14.1 1.1 CVth (V) 0. respectively. The minenergy and min-ED hyper-architectures for the two routing structure have the same the device setting and LUT size.20 0.15 compare the min delay.20 0.1 1.20 0.7) (6.20 (6.K) Energy (nJ) 31.20 0.56 35.7) (6. The mixed-interconnect is optimal for minimizing delay [47]. 42 . and 2.13.22 15.20 0.13 mixed-interconnect Table 2.1 1.56 16.20 0. min energy and min ED hyper-architectures between the two routing structures. We also see that uniforminterconnect has lower energy but higher delay than mixed-interconnect.20 IVth (V) 0. This is due to the fact that mixed-interconnect applies length-8 wire-segments and therefore a buffer with a larger size is used.20 0.7) (6.42 8.86 9.98 15.7) (6.13: Min-delay hyper-architecture under different routing structures. We compare a structure with uniform length-4 wire segments (in short uniform-interconnect) and a routing structure with 60% length-4 wire segments and 40% length-8 wire segments (in short mixed-interconnect).42 9.7) (6. Tables 2. We see that the min-delay hyper-architectures for both interconnect structures are same.7) (6.22 31.1 1. but mixed-interconnect tends to use smaller cluster size as the interconnect delay is reduced in mixed-interconnect.20 0.7) (N.45 8. uniform-interconnect hyper-architecture Class Homo-Vth Hetero-Vth Homo-Vth +G Hetero-Vth +G Homo-Vth Hetero-Vth Homo-Vth +G Hetero-Vth +G Vdd (V) 1.1 1.25 Delay (ns) 8.98 35.1 1.20 0.20 0.

3 55.8 CVth (V) 0.30 0.K) Energy (nJ) 0.20 (10.9 11.25 0.39 0.25 0.14: Min-energy hyper-architecture under different routing structures.8 0.40 16.35 0.74 Delay (ns) 17.25 0.1 23.550 0.052 0.6 55.35 0.30 0.30 0.052 1.4) (10.30 0.20 0.27 0.25 0.6 13.30 17.35 0.920 0.549 1.2 43.4) (12.10 16.30 0.942 0.35 0.6 29.610 0.3 22.4) (10.25 0.50 18.35 0.4) (12.7 25.25 0.K) Energy (nJ) 1.30 0.65 1.51 1.9 0.30 0.4) (N.15: Min-ED hyper-architecture under different routing structures.4) (8.8 CVth (V) 0.8 0.25 IVth (V) 0.74 0.30 0.10 17.35 0.4) (N.2 25.8 0.8 1.82 0.4) (8.4) (12.00 18.4) (12.0 11.25 0.30 0.9 0.0 0. 43 .25 (10.4) (12.uniform-interconnect hyper-architecture Class Homo-Vth Hetero-Vth Homo-Vth +G Hetero-Vth +G Homo-Vth Hetero-Vth Homo-Vth +G Hetero-Vth +G Vdd (V) 0.3 mixed-interconnect Table 2.25 0.30 IVth (V) 0.8 0.40 16.9 0.602 Delay (ns) 59.30 0.5 24.9 0.8 0.4) (8.6 30.0 0.30 0.8 0.35 0.4) (8.67 ED (nJ·ns) 24.4) (12. uniform interconnect hyper-architecture Class Homo-Vth Hetero-Vth Homo-Vth +G Hetero-Vth +G Homo-Vth Hetero-Vth Homo-Vth +G Hetero-Vth +G Vdd (V) 1.8 0.4 12.37 1.4) (8.8 mixed-interconnect Table 2.4) (8.30 0.

32 Energy per cycle (nJ) 24 16 8 0 8
18 Energy per cycle (nJ) 15 12 9 6 3 0 8

Min delay hyper-arch for Homo-Vt Min-ED hyper-arch for Homo-Vt Min-ED Min energy hyper-arch hyper-arch for Hetero-Vt for Hetero-Vt

Homo-Vt Hetero-Vt

Min energy hyper-arch for Homo-Vt

21

34 (a) Delay (ns)

47

60

Min delay hyperarch for HomoVt+G and HeteroMin-ED hyper-arch for Homo-Vt+G Min-ED hyper-arch for Hetero-

Homo-Vt+G Hetero-Vt+G
Min energy hyper-arch for 5 Homo-Vt Min energy hyper-arch for Hetero-Vt+G

6 7

13

18 23 Delay (ns) (b)

28

33

Figure 2.7: Dominant hyper-architectures. Homo-Vth +G and Hetero-Vth +G.

(a) Homo-Vth and Hetero-Vth ; (b)

44

1.3 Normalized area
Homo-Vt Hetero-Vt Homo-Vt+G Hetero-Vt+G
Min AED hyper-arch for Homo-Vt

1.1

Min AED hyper-arch for Homo-Vt+G

0.9
Min AED hyper-arch for Hetero-Vt+G Min AED hyper-arch for Hetero-Vt

0.7 0.3 0.4 0.5 0.6 0.7 0.8 Normalized ED 0.9 1.0

Figure 2.8: ED and area trade-off. ED and area are normalized with respect to the baseline (N = 8, K = 4, Vdd = 0.9V , and Vth = 0.3V ).

45

2.5 Conclusions
In this chapter, we have developed trace-based power and performance evaluation (Ptrace) for FPGA. The one-time use of placement, routing and cycle-accurate power simulation is applied to collect the timing and power trace for a given benchmark set and a given FPGA architecture. The trace can then be re-used to calculate timing and power via closed-form formulae for different device parameters and technology scaling. Ptrace is much faster, yet accurate compared to the conventional evaluation based on placement and routing by VPR [18] followed by cycle-accurate simulation (Psim) [89]. Using the trace-based estimation, we have performed device (Vdd , Vth and sleep transistor size if power gating is applied) and architecture (cluster and LUT size) cooptimizations for low power FPGAs. We assume the ITRS [66] 70nm technology and use the following baseline for comparison: Cluster size of 8 and LUT size of 4 as in the Xilinx Virtex-II [170], Vdd of 0.9v suggested by ITRS, Vth of 0.3v which is optimized for min-ED (i.e., minimum energy delay product) with respect to the above architecture and Vdd . Compared to the baseline case, simultaneous optimization of FPGA architecture and device reduces the min-ED by 14.7% and area by 18.3% for FPGA using homogeneous-Vth for the logic blocks and interconnects without power gating. Optimizing Vth separately (i.e., heterogeneous-Vth ) for the logic block and interconnect reduces min-ED by 18.4% and area by 23.3%. Furthermore, power gating unused logic and interconnect reduces the min-ED by up to 55.0% and reduces area by 8.2%. Compared to the classes without power gating, the min-ED hyper-architectures of the classes with power gating have lower Vth . This is due to the fact that, when power gating is applied, leakage power is significantly reduced and therefore a lower Vth can be applied to reduce delay. We observe that min-ED hyper-architectures and min-AED hyper-architectures for

46

different utilization rate (between 30% and 80%) are the same. Therefore, the utilization rate in practice does not affect the device and architecture co-optimization result. Moreover, we also test two different routing structures, one with uniform length 4 wire segments (uniform-interconnect) and the other with 60% length 4 wire segments and 40% length 8 wire segments (mixed-interconnect). We observe that the min-ED, minenergy and min-delay hyper-architectures under two different interconnect structures are similar, except that mix-interconnect tends to use slightly smaller cluster size due to the reduced interconnect delay.

47

We then derive analytical yield models considering both leakage and timing variations. We also observe that LUT size of 4 gives the highest leakage yield.4%. we extend Ptrace to handle process variation. and use such models to evaluate FPGA device and architecture considering process variations. and gate oxide thickness. LUT size of 7 gives the highest timing yield. timing yield by 5. but LUT size of 5 achieves the maximum leakage and timing combined yield.5X and 1. Experiments show that the mean and standard deviation computed by our models are within 3% from those computed by Monte Carlo simulation. which uses the VPR architecture and device setting from ITRS roadmap. device and architecture tuning improves leakage yield by 10.7%. 48 . Compared to the baseline.4%. threshold voltage.CHAPTER 3 FPGA Device and Architecture Co-Optimization Considering Process Variation Process variations in nanometer technologies are becoming an important issue for cutting-edge FPGAs with a multi-million gate capacity. We first develop closed-form models of chip level FPGA leakage and timing variations considering both die-to-die and within-die variations in effective channel length. respectively.9X. In this chapter. We also observe that the leakage and delay variations can be up to 5. and leakage and timing combined yield by 9.

53] and [127] purposed a methodology to improve timing yield. However. Variability in effective channel length. Timing yield estimation was discussed in [57. we have introduced a time efficient trace-based power and delay estimator for FPGA circuit. and gate oxide thickness incurs uncertainties in both chip performance and power consumption. However. For example. 115. 152] studied the parametric yield considering both leakage and timing variations. therefore process variation may have smaller impact on FPGAs than on ASICs. such estimator is only for deterministic values and does not consider process variation. 136] has studied the impact of process variation and presented various statistical optimization methods. 100. 49 . threshold voltage. [129. Modern VLSI designs see a large impact from process variation as devices scale down to nanometer technologies. In addition to meeting the performance constraint under timing variation. all these studies only focus on ASIC and do not consider FPGA. However. Power minimization by gate sizing and threshold voltage assignment under timing yield constrains were studied in [114]. Statistical timing analysis considering path correlation was studied in [120] [86. Yet the parametric yield for FPGAs still should be studied. Some recent works [43. 178] and [180. dice with excessively large leakage due to such a high variation have to be rejected to meet the given power budget. measured variation in chip-level leakage can be as high as 20X compared to the nominal value for high performance microprocessors [25].3. FPGA has a great deal of regularity. leakage power becomes a significant component of total power consumption and it is greatly affected by process variation. architecture and device co-optimization considering process variation has not be studied yet. As device scale down. 33] further introduced non-Gaussian variation and non-linear variation models. Recent work has studied parametric yield estimation for both timing and leakage power.1 Introduction In the previous chapter. 187.

2%. we obtain the baseline FPGA hyper-architecture which uses the VPR architecture model [18] and the same LUT size and Cluster size as the commercial FPGAs used by Xilinx Virtex-II [170].5X and 1.4%. We also observe that LUT size of 4 gives the highest leakage yield. We also observe that the leakage and delay variations can be up to 5. Experiments show that the mean and standard deviation computed by our models are within 3% from those copmuted by Monte Carlo simulation.2 derives closed-form models for leakage and delay variations and develops the leakage and timing yield models. but LUT size of 5 achieves the maximum leakage and timing combined yield.3 perform device and architecture evaluation to improve yield rate.we first develop closed-form formulas of chip level leakage and timing variations considering both die-to-die and within-die variations. 1 In this chapter. we extend the Ptrace discussed in Chapter 2 to estimate the power and delay variation of FPGAs.8%. For comparison. Section 3. Section 3. and device setting from ITRS roadmap[66]. and threshold voltage Vt . N refers to cluster size and K refers to LUT size 50 . timing yield by 12.4 concludes the chapter. The evaluation requires the exploration of the following dimensions: cluster size N. We defined the combinations of the above parameters as hyper-architecture. Based on such model. and leakage and timing combined yield by 9. LUT size K.In this chapter. we perform FPGA device and architecture evaluation considering process variations. respectively. 1 supply voltage Vdd . Compared to the baseline. LUT size of 7 gives the highest timing yield. device and architecture tuning improves leakage yield by 4. Finally. The rest of the chapter is organized as follows: Section 3. With the extended Ptrace.5X.

Vl . Therefore. In the rest of this chapter. According to [190]. spatial correlation is not significant. and Ll . and gate oxide thickness (Tox ).0.2. we consider the variation in gate channel length (Lgate ). V . The leakage current Ii of a type i circuit element is the sum of the sub-threshold 51 . and all variation sources are also independent. and Ii is the leakage current of a type i circuit element. Different sizes of interconnect switches and buffers are considered as different circuit elements. we assume each variation source is decomposed into global (inter-die) variation and local (intra-die) variation as follows: L = Lg + Ll V = Vg +Vl T = Tg + Tl (3. And we also assume that inter-die variation and intra-die variation are independent.e. In Ptrace. and Tg are inter-die variations.1) (3. the total leakage current of an FPGA chip is calculated as follows: Ichip = ∑ Nit · Ii i (3. in this chapter.3) where L. 3. configuration SRAM cell. Vg . and T are variations of Lgate . or flip-flop. Vl . buffer.4) where Nit is the number of FPGA circuit elements of resource type i.1 Leakage under Variation We extend the leakage model in FPGA power and delay estimation framework Ptrace [38] to consider variations. and Tl are intra-die variations. threshold voltage (Vth ). i. LUT.3.2 Delay and Leakage Variation Model In this chapter.. Vg . an interconnect switch.2) (3. and Tg ) and intra-die (Ll . we assume both inter-die (Lg . Lg . and Tox respectively. and Tl ) variations are normal random variables. Vth .

8) (3.Vl . Variation in Igate mainly sources from variation in Tox .10) To extend the leakage model (3. we model the total leakage current Ii of circuit element in resource type i as follows: Ii = In (i) · e fLi(L) · e fVi (V ) · e fTi(T ) (3. Tl ) and interdie (Lg . ci2 . lower threshold voltage.4) under variations. The negative sign in the exponent indicates that the transistors with shorter channel length. we find that it is sufficient to express these functions as simple linear functions as follows: fLi (L) = −ci1 · L fVi (V ) = −ci2 ·V fTi (T ) = −ci3 · T (3.9) where ci1 . Ii = In (i) · e−(ci1 Lg+ci2Vg +ci3 Tg ) · e−(ci1 Ll +ci2Vl +ci3 Tl ) (3.Vg .5) Variation in Isub mainly sources from variation in Lgate and Vth . V and T in to intra-die (Ll . Different from [129] which models sub-threshold leakage and gate leakage separately.6) as follows by decomposing L. Tg ) components. we assume that each element has unique intra-die variations but all elements in one die share the same inter-die 52 .7) (3. We rewrite (3. The dependency between these functions has been shown to be negligible in [129].6) where In (i) is the nominal value of the leakage current of type i circuit element and f is the function that represents the impact of each type of process variation on leakage.and gate leakages: Ii = Isub + Igate (3. From MASTAR4 model [68]. ci3 are fitting parameters obtained from MASTAR4 model. and smaller oxide thickness lead to higher leakage current.

we apply the Central Limit Theorem and use the sum of mean to approximate the total leakage current.12). Vl . The lognormal variable Xi shares the same random variables 53 .2 Leakage Yield From (3. we can write the expression of the chip-level leakage as the follows: Ichip ≈ = ∑ Nit · E[Ii|Lg.Vg. Tg] i ∑ Nit SiILg. The state-of-the-art FPGA chip usually has a large number of circuit elements. the sum of the lognormal variables Xi .13) Si = e i (ci1 σLl 2 +ci2 σVl 2 +ci3 σTl 2 )/2 ILg .14) (3.variations. for given inter-die variations.Tg (i) is the leakage as a function of inter-die variations. and Tl .12) (3.13).15) (3. as another lognormal random variable.11). therefore the relative random variance of the total leakage due to intra-die variation approaches zero. The total leakage is the sum of all lognormals. Similar to [129]. we model Ichip .0. 3. (3.Vg. ILg . ((ci1σLg )2 + (ci2 σVg )2 + (ci3 σTg )2 )) = Ni In (i) i (3.11) (3. The leakage distribution of a circuit element is a lognormal distribution.2. and (3.V . σLl .Tg (i) (3.Vg .Tg (i) = In (i)e−(ci1 Lg +ci2Vg +ci3 Tg ) where Si is the scale factor introduced by intra-die variability in L. we can see that the chip leakage current is a sum of log-normal random variables and it can be expressed as follows. After integration.Vg .16) Same as [129]. σVl and σTl are the variances of Ll . and T . respectively. Ichip = ∑ Xi Xi Ai ∼ Lognormal(log(Ai). Both inter-die and intra-die variations are modeled as normal random variables.

Ichip .Ichip = log[1 + (σIchip 2 /µIchip 2 )] log(Ichip) − µN.22) (3. Given a leakage limit Icut for Ichip . µIchip (ci1 σLg )2 (ci2 σVg )2 (ci3 σTg )2 = ∑ {exp[log(Ai ) + + + ]} 2 2 2 i (3.23) (3.20) (3.24) . log[µIchip 4 /(µIchip 2 + σIchip 2 )] 2 2 σN. The covariance is calculated as follows. σN. µIchip . and therefore these variables are dependent of each other.Ichip 1 √ CDF(Ichip) = [1 + er f ( )] 2 2σN. X j ) L i3 T (3.chip is a monotone increasing function. As the exponential function that relates the lognormal variable Ichip with the normal variable IN.Ichip = where er f (·) is the error function. σVg . is the sum of means of Xi and the variance of Ichip .Ichip 2 ) of the normal random variable corresponding to the lognormal Ichip .21) We then use the method from [129] to obtain the mean and variance (µN.25) (3. σIchip . and σTg .19) 2 (ci1 + c j2 ) σLg 2 + 2 (ci2 + c j2 )2 σVg 2 (ci3 + c j3 )2 σTg 2 + ] 2 2 (ci1 σLg )2 (ci2 σVg )2 (ci3 σTg )2 E[Xi ] = exp[log(Ai ) + + + ] 2 2 2 (3. COV (Xi . Considering the dependency.σLg .17) σ2chip = I ∑{exp[2log(Ai) + (ci1σLg )2 + (ci2 σVg )2 + (ci3 σTg )2] i i. j 2 ·[exp(cii 2 σ2 g + ci2 2 σVg + c2 σ2g ) − 1]} + ∑ 2COV (Xi.18) where the mean of Ichip . the CDF of Ichip can be expressed as follows using the standard expression for the CDF of a lognormal random variable.Ichip µN. we calculate the mean and variance of the new lognormal Ichip as follows. X j ) = E[Xi X j ] − E[Xi ]E[X j ] E[Xi X j ] = exp[log(Ai A j ) + (3. is the sum of variance of Xi and the covariance of each pair of Xi . Yleak = CDF(Icut ) × 100% 54 (3.

We assume that the intra-die channel length and threshold voltage variation of each element is independent from each other. Lg and Vg the same for all the circuit elements in the critical path. di (Lg .Vg ) ⊗ · · · ⊗ PDF(dn |Lg .Vg ) ⊗ PDF(d2 |Lg . PDF(D|Lg .Vg ) ⊗ · · · ⊗PDF(di |Lg . we can directly map 55 . we evenly sample a few (eleven in this chapter) points within range of [Lg − 3σLl .VG . we can obtain the PDF (probability density function) of the critical path delay for a given Lg and Vg as follows by convolution operation. Considering process variation.e. the path delay is calculated as follows: D = ∑ di i (3. Given Lg and Vg . Vl . Vg and intra-die variation Ll .Vg ) (3.Vg ) = PDF(d1 |Lg . the percentage of FPGA chips that is smaller than Icut . Below we extend the delay model in Ptrace to consider inter-die and intra-die variations of Lgate .. but its variation is primarily affected by Lgate and Vth variation [129].26) where di is the delay of the ith circuit element in the path. and Tox . i.the path delay is calculated as follows: D = ∑ di (Lg . As the probability of a channel length to the probability of a delay and obtain the delay distribution of a circuit element. Therefore. Lg + 3σLl ].Vg .27) For circuit element i in the path. Ll .28) the delay monotonically decreases when Lgate and Vth increase.gives the leakage yield rate Yleak (Icut |Lg ).3 Timing under Variation The performance depends on Lgate . Vth . In Ptrace. We then use circuit level delay model in [39] to obtain the delay for each circuit element with these variations. Ll .Vl ) i (3.0.2. 3.Vl ) is the delay considering inter-die variation Lg .

However.2.. we first need to calculate the leakage 56 . and threshold voltage variation Vg . i. In this chapter.e. CDF(Dcut |Lg ) gives the probability that the path delay is smaller Yper f (Dcut |Lg . The close-to-be critical paths may become critical considering variations and an FPGA chip that meets the performance requirement should have the delay of all paths no greater than D cut . We further consider intra-die variation of channel length in timing yield analysis.2. n cutoff delay (Dcut ). Given a than Dcut considering Lgate and Vth variations.30) PDF(Lg )PDF(Vg ) ·Yper f (Dcut |Lg .5 Leakage and Timing Combined Yield To analyze the yield of a lot.Vg ). We then integrate Yper f (Dcut |Lg . Yper f = Z Z +∞ −∞ nominal condition. by integrating PDF(D|Lg .3. We can obtain the CDF of delay.28) gives the PDF of the critical path delay D of the circuit.Vg ) over Lg and Vg to calculate the (3.Vg ) = ∏ CDFi (Dcut |Lg . we need to consider both leakage and delay limit.Vg ). CDF(D|Lg . In order to compute the leakage and delay combined yield.4 Timing Yield The timing yield is calculated on a bin-by-bin basis where each bin corresponds to a specific value Lg and Vg .Vg ) gives the probability that the delay of the ith longest path is no greater than Dcut .0. it is not sufficient to only analyze the original critical path in the absence of process variations.0. We assume that for a given Lg the delay of each path is independent and we can calculate the timing yield as follows.Vg ) i=1 (3.Vg ) · dLg dVg 3.29) where CDFi (Dcut |Lg . Given the inter-die channel length variation Lg . we only consider the ten longest paths. (3. n = 10 because the simulation result shows that the ten longest paths have already covered all the paths with a delay larger than 75% of the critical path delay under the performance yield Yper f as follows.

Vg ) × 100% (3.Vg .2.Vg ] ¯ ¯ ¯ ¯ E[Xi|Lg.Vg ) i3 T i.Vg = log[µ4chip|L I g .Vg ]E[X j|Lg. the CDF of leakage current for given Lg and Vg .Vg 2 = log[1 + (σ2chip|L I CDF(Ichip|Lg .Vg ) + (ci3 σTg )2 ] 2 (3.Vg log(Ichip) − µN.Vg ∼ Lognormal(Ai|Lg.Vg = CDF(Icut |Lg .Vg .39) (3. Yleak|Lg .Vg ) + i (ci3 σTg )2 ]} 2 (3.Ichip With the CDF of Ileak|Lg .yield for a given inter-die variation of gate channel length Lg and threshold voltage Vg .Vg · X j|Lg.Vg ] = exp[log(Ai|Lg.Ichip|Lg . j i (3.Vg . Ileak|Lg .37) (ci3 + c j3 )2 σTg 2 ] 2 Finally.Vg ) + (ci3σTg )2] ¯ ¯ ·[exp(c2 σ2g ) − 1]} + ∑ 2COV (Xi|Lg.Ichip|Lg . is calculated as: µN.Vg = Ai · exp(−ci1 Lg − ci2Vg ) (3.Vg + σ2chip|L I )] g .Vg . the covariance between Xi|Lg.31) = ¯ ∑{exp[2log(Ai|Lg. it is easy to compute the leakage yield for given Lg and Vg . (ci3 σTg )2 ) ¯ Similar to Xi ’s.41) 57 .Vg )] (3.Vg 1 √ [1 + er f ( )] 2 2σN.Vg ¯ = ∑ {exp[log(Ai|Lg.Vg ’s are computed as: ¯ ¯ ¯ ¯ ¯ ¯ COV (Xi|Lg.Vg .Vg ] − E[Xi|Lg.38) (3.Vg ) + ¯ E[Xi ] = exp[log(Ai|Lg.35) (3. X j|Lg.34) ¯ ¯ Xi|Lg.40) g . Yleak|Lg . X j|Lg.Vg · X j|Lg.Vg ) = E[Xi|Lg.Vg 2 /µ2chip|L I g .Vg /(µ2chip|L I σN.33) (3.0.Vg · A j|Lg. µIchip|Lg .32) where ¯ Ai|Lg. Similar to Section 3.Vg .36) (3.2.Ichip|Lg .Vg ) = g .Vg σ2chip|L I g . we first calculate the mean and variance of leakage current for given Lg and Vg .

2 Yper f % 74.29).42) 3.0.Vg ) · dLg dVg (3. 3. We consider ITRS 70nm technology and 58 . From the table.9) Our model Yper f % 72. Therefore.6 Verification of Yield Model In this section.1) Table 3.3 (-0. we verify our yield model by comparing it to 10.000 sample MonteCarlo simulation. In our experiment. we use our yield model to perform device and architecture evaluation for leakage and delay yield optimization.3 Hyper-Architecture Evaluation Considering Process Variation In this section. we assume 70nm ITRS technology and use device and architecture same as the baseline hyper-architecture as in Section 2.4 Yleak % 89. The cut of leakage power is 2X of the nominal value and the cut of delay is 1.4.1 (-2.6 (-1.31). we see that our yield model is within 3% error compared to the Monte-Carlo simulation. we assume that the leakage yield and timing yield are independent of each other for given L g and Vg .5) Ycom % 62. the leakage variability only depends on the variability of random variable Tg as shown in (3.1: Verification of yield model.MC sim Yleak % 90.1 compares the yield estimated from our model and that from the Monte-Carlo simulation. and the timing variability only depends on the variability of random variable Ll and Vl as shown in (3.Vg )Yper f (Dcut |Lg .2. The yield considering the imposed leakage and timing limit can be calculated as follows. Ycom = Z Z +∞ −∞ PDF(Lg )PDF(Vg )Yleak (Icut |Lg . Table 3.1 Ycom % 64. Because for given a specific inter-die variation of channel length Lg and threshold voltage variation Vg . We also assume that all 20 MCNC benchmarks are put into one FPGA chip.1X of nominal value.

6.3. the range of leakage power is up to 5.5X and the range of delay variation is up to 1.1 0.7 8 K 6. each dot is sample of MonteCarlo simulation. we assume that all the global routing track (W ) span 4 logic blocks with all buffer switch box.0% 1.4 0. In the figure. we assume that the 3σ value of the interdie variation is 10% of nominal value.9% Distribution Normal Normal Normal Table 3.9 3σl 3.5% Vdd (V) 0. we can see with process variation. change Vdd and Vth around such setting.5.8 1. we consider LUT size K from 4 to 7.2: Experimental setting.2 0.9% 1.10. Figure 3. For all variation sources.4.1 illustrates the leakage and delay variation from Monte-Carlo simulation for the baseline hyper-arch. In the experiment. we use the same baseline hyper-architecture as in Section 2.1 Impact of Process Variation In this section.12 4 W 4 4 Vth (V) 0.N Evaluation range Baseline Source Lgate Vth Tox 4.8.0% 2. For comparison. From the figure. we assume that all the variation sources has normal distribution. and the 3σ value of the intra-die variation is 5% of nominal value. For interconnect. 3. For process variation. we assume that all 20 MCNC benchmarks are put into one chip and obtain the longest 10 critical paths from them.3 3σg 5. we analyze the impact of process variation on FPGA leakage power and delay. and cluster size N from 6 to 12. For architecture. 59 .5X.5% 2.

That is. In the rest of this section. 3. This is because the larger Lgate gives better leakage yield. Homo-Vth and Hetero-Vth as defined in in Section 2.3.85 0.1 1.5 3 2. we perform device and architecture evaluation to optimize the delay and leakage power yield for two FPGA classes.1X of the nominal value. the optimum Lgate for logic blocks and interconnect is the same. 60 .95 1 1.4.1: Leakage and delay of baseline architecture hyper-arch.1.5 4. Moreover.05 1.2. From the table.1 Leakage Yield We first optimize leakage yield.2 Figure 3. we see that Homo-Lgate and Hetero-Lgate give the same maximum leakage yield result. as shown in Figure 3. Table 3. we assume that the cutoff leakage power is 2X of the nominal value and the cutoff delay is 1. CVth refers to the Vth of logic blocks and IVth refers to the Vth of interconnect.5 Normalized Leakage Power 4 3.3 illustrates the hyper-archs with maximum leakage yield for both two classes.2 Impact of Device and Architecture Tuning In this section.5 0 0. 3.15 1.5 1 0. Notice that for Homo-Vth .5X Cutoff power 1.8 0. In the table. CVth = IVth .5X Normalized Delay 0.5 2 1.3.9 Cutoff delay 5. both logic block and interconnect using largest Lgate (33nm) results in optimum leakage yield.

Similarly.9 Table 3.2.9 Yleak % 89.9(+12. Table 3.1 Yleak % 89.4 Vdd (V) 0. That is the smaller Vth gives better timing yield.2 Vdd (V) 0.8) Yper f % 85.4) 97.1 57. we only analyze the delay of the largest MCNC benchmark clma.4 illustrates the optimum timing yield hyper-arch.2 Timing Yield Secondly.4% compared to the baseline.2V) results in optimum timing yield. we can also find that K = 4 gives the optimum leakage yield. Table 3.4: Optimum Timing yield hyper-architecture.9 0. therefore both logic block and interconnect using smallest Vth (0. which improve leakage yield by 4.5 79.1 1.1 69. The reason is similar to the leakage yield.3 79. Similar to leakage yield analysis.3 0.3 0. N Baseline Homo-Vth Hetero-Vth 8 6 6 K 4 7 7 CVth (V) 0.3.3(+4.4 0. For timing yield analysis.2 IVth (V) 0.5 97. we discuss about leakage and timing combined yield.5 67.6 Yper f % 85.3 0. 3.2.1 Table 3. the timing yield is often studied using selected test circuit such as ring oscillator for ASIC in the literature.2 0.9 57.3: Optimum leakage yield hyper-architecture.8) 94.5 63.3 Ycom % 67. From the table.N Baseline Homo-Vth Hetero-Vth 8 10 10 K 4 4 4 CVth (V) 0.3.5 94.4) Ycom % 67. we see that 61 . From the table.9 1. we analyze the timing yield. we can also find that the optimum timing yield hyper-arch has K = 7 and improve timing yield by 12.9 0.5 illustrates the optimum leakage and timing combined yield hyper-arch.3(+4. both Homo-Lgate and Hetero-Lgate achieve the same hyper-arch for optimum timing yield.8% compared to the baseline.9(+12.4 0. 3.4 IVth (V) 0.3 0.2 0.3 Leakage and Timing Combined Yield Finally.1 69.

and the FPGA chip level leakage and delay variations can be up to 5.N Baseline Homo-Vth Hetero-Vth 8 10 8 K 4 5 5 CVth (V) 0.5 84. the optimum hyper-architecture (combination of architecture and device parameters) improves leakage and timing combined yield by 9.3 0.2 (+8.5X.1) 76.3 0.2%.2%. the optimum hyper-arch for Homo-Vth improves combined yield by 8. we have developed efficient models for chip-level leakage variation and system timing variation in FPGAs. We have shown that architecture and device tuning has a significant impact on FPGA parametric yield rate. Homo-Vth and Hetero-Vth give different result for combined yield optimization. 3. In addition.1 75.0 1.2) Table 3. 62 .6 Yper f % 85. 7 has the highest timing yield. LUT size 4 has the highest leakage yield.4 Conclusions In this chapter. K = 5 gives the best combined yield.3 0.3 0. the largest (or smallest) Vth not necessary gives the optimum combined yield.0 Yleak % 89. We also find that Hetero-Vth gives better result than Homo-Vth .5: Optimum leakage and timing combined yield hyper-architecture.1 Ycom % 67. This is because Hetero-Vth provides larger search space. Compared to the baseline. Experiments show that our models are within 3% from Monte Carlo simulation. Unlike the leakage yield and timing yield analysis. respectively.9 1. but LUT size 5 achieves the maximum combined leakage and timing yield.5 87. This is because both leakage and timing should be considered to optimize combined yield.1% and the optimum hyper-arch for Hetero-Vth improves the combined yield by 9.6 91.3 Vdd (V) 0. compared to the baseline.5 86.3 (+9. But for both Homo-Vth and Hetero-Vth.5X and 1.35 IVth (V) 0.

1X energy difference. we further improve the tracebased framework (Ptrace2) to enable concurrent process and FPGA architecture codevelopment. In addition. 3.5% to 5. We show that applying heterogeneous gate lengths to logic and interconnect may lead to 1.5% and reduce die to die leakage significantly. the user can tune eight parameters for bulk CMOS processes and obtain the chip level performance and power distribution and soft error rate (SER) considering process variations and device aging. this framework is suitable for early stage process and FPGA architecture co-development.CHAPTER 4 FPGA Concurrent Development of Process and Architecture Considering Process Variation and Reliability In Chapter 2 and Chapter 3. we develop a trace-based framework (Ptrace) to perform device and architecture co-optimization. and device burn-in to reach the point could reduce the performance change over 10 years from 8.3X delay difference. and reduce standard deviation of leakage variation by 87%. In this chapter. It is also flexible as process parameters can be customized for different FPGA elements and no SPICE models and simulations are needed for these elements. we also study the interaction between process 63 . We further show that the device aging has a knee point over time. Applying the new trace-based framework (Ptrace2). The Ptrace2 framework is efficient as it is based on closed-form formulas. Therefore. This offers a large room for power and delay tradeoff. The chapter further presents a few examples to utilize the framework.

device aging and SER. Moreover. input/output capacitance for FPGA basic circuit elements. In this chapter we further extend the trace-based architecture framework (Ptrace) discussed in the previous section to consider process parameters directly. The new process variation analysis can handle non-Gaussian variation sources which is ignored by Ptrace. then delay. We call the resulting framework as ptrace2. However. and finally the chip level performance and power based on trace similar to that in Chapter 2 and Chapter 3. It may further provide inputs for process tuning. Such evaluation may be used to select circuits and architectures less sensitive to process changes or process variations. The assumption that FPGA architecture development starts only after the process technology is stable may be valid in the past. We observe that device aging reduces standard deviation of leakage by 65% over 10 years while it has relatively small impact on delay variation. we proposed a time efficient FPGA power and delay evaluator Ptrace. leakage power. Ptrace assumes that the processes are mature and stable device models are available so that SPICE simulations can be carried out to obtain circuit level power and delay. given that the FPGA is a large volume product and an FPGA company may convince a foundry to tune process when there are large enough benefits. for performance and power.1 Introduction In Chapter 2 and Chapter 3. As illustrated in Figure 4.1. ptrace2 calculates first electrical characteristics of advanced CMOS transistors. 4. therefore we can conduct FPGA circuit and architecture evaluation when only the first order process parameters are available. we also find that neither device aging due to NBTI and HCI nor process variation have significant impact on SER. 64 .variation. but it no longer holds as we begin to develop process and architecture concurrently in order to shorten the time to market.

8 studies the interaction between reliability and process variation. and to study the interaction between process variation. 166]) and permanent soft error rate (SER) [60]. Furthermore.9 concludes this chapter. device aging and SER.3. we incorporate analytical calculations for two types of FPGA reliability. 149] [160. and Section 4.5 introduce the circuitand chip-level power and delay models and chip-level variation models. device aging (Negative-Bias-Temperature-Instability. 65 . respectively. 23] and Hot-Carrier-Injection. Section 4.7 analyzes the reliability for FPGA. NBTI [11. The rest of this chapter is organized as follows: Section 4. Section 4. we illustrate how to use this framework to improve power and performance by process and FPGA concurrent development. With ptrace2. 4. again in the from device to chip fashion.Input Transistor Model PTrace2 Output Chip Level Process Variation Trace Reliability Figure 4.6 presents device tuning for power and delay optimization.2 presents our implementation of ITRS device model. HCI [34. 149.4 and 4.1: Trace-based estimation flow. and Section 4. Sections 4. Finally.

Figure 4.4. Nbulk .e. channel width (W ). The saturated threshold voltage (Vth ) is then calculated considering the short channel effect (SCE) and drain induced bar1 The device model in MASTAR4 tool is a predicted model for advanced processes and there is no correspondent SPICE models for these technologies. which directly depends on various major technological parameters including gate length (L gate ). MASTAR4 is a computing tool which calculates the electrical characteristics of advance CMOS transistors 1 . the gate leakage current when the channel is on/off (Igon /Igo f f ). T and Vdd . as well as some intermediate outputs such as effective mobility u e f f which are used for the final output calculation. Io f f ). double gate and silicon on isolator (SOI) while we only consider traditional bulk transistor in this work. It can handle different technologies including planar bulk. channel doping density (Nbulk ). Ion depends on all the main inputs. i. The output electrical characteristics include the on current (Ion ).. 148]. W . the electrical (or effective) gate length (Lelec ) and the electrical gate oxide thickness (Tox elec ). The calculation in MASTAR4 is based on analytical equations. are first calculated. Temperature (T ) and the supply voltage (Vdd ) also have a significant impact on these output characteristics. We briefly review the calculation flow in MASTAR4 below. There are also other inputs related to mobility. We follow MASTAR4 tool and only tune the major process inputs while all the other inputs can be tuned as well if needed. we implement the bulk transistor device model from ITRS 2005 MASTAR4 (Model for Assessment of cmoS T echnologies And Roadmaps) tool [67. The feature intermediate parameters. the gate capacitance (Cg ) and the drain/source diffusion capacitance (Cdi f f ). extension depth (X jext ) and the total series resistance for source and drain (Racc ). gate oxide thickness (Tox ). 66 . Lgate .2 Device Models In this chapter. 68. velocity and gate stack etc. Racc . X jext .2 shows the calculation flow for the on current Ion . Tox . the sub-threshold leakage current (off current.

Vgt = Vgs −Vth Idsat0 = 0. Ec is the critical electrical field and d is an intermediate parameter. Ion is then calculated as. the 67 . rier lowering (DIBL). The effective mobility (ue f f ) is also calculated. The derivation of these intermediate parameters are provided in the ITRS device model [68].Inputs: Lgate Tox Nbulk Xjext W Racc T Vdd Intermediate variables: Lelec Tox_elec Intermediate variables: Vth Intermediate variables: u_eff Output: Ion Figure 4. Ion .3 shows the calculation flow for the sub-threshold leakage current Io f f .5ue f f ·Cox elec · Ion = 1+ 2Racc Idsat0 Vgt (4.1) W ·Vgt ·Vdsat Lelec elec (4. Io f f depends on all the main inputs except for Racc .2) (4. main intermediate parameters and all other parameters. calculation.2: The flow for on current.3) Idsat0 R − Vgt +Lacc Idsat0 Ec (1+d) where Idsat0 is the on current without considering the series resistance (Racc ). Similar to Ion calculation. With the main input parameters. Figure 4. Vgs is the difference between the gate and source voltage levels. Vdsat is the velocity saturation voltage.

intermediate feature parameters Lelec and Tox elec . we can then calculate Io f f as.4) where k is the Boltzmann constant.4: The flow for gate leakage currents Igon and Igo f f calculation. Vth and other parameters. q is electric unit. the sub-threshold slope (Slope) is also calculated as. In order to calculate Io f f . respectively. respectively. Igon and Igo f f depend on Lgate .4 shows the calculation flow for the gate leakage currents when the channel is on (Igon ) and off (Igo f f ).5uA. W 68 . T is the temperature.3: The flow for sub-threshold leakage current. Io f f = Ith where Ith = 0. With Slope.5) Outputs: Igon Igoff Figure 4. ε s and εox are the dielectric permittivities of silicon and oxide. Figure 4. Slope = kT εs · Tox elec ln1 0(1 + ) q εox · Tdep (4. Io f f . Inputs: Lgate Tox Nbulk Xjext W Racc T Vdd W −Vth /Slope e Lelec (4. X jext . Tox . and the saturated threshold voltage Vth are first calculated. calculation.Inputs: Lgate Tox Nbulk Xjext W Racc T Vdd Intermediate variables: Lelec Tox_elec Intermediate variables: Vth Slope Output: Ioff Figure 4.

7) (4.02V −2 . respectively.6) where a1 = 1.Inputs: Lgate Tox Nbulk Xjext W Racc T Vdd Intermediate variables: Tox_elec Outputs: Cg Cdiff Figure 4. Tox . a2 = −4.5: The flow for transistor gate and diffusion capacitances Cg and Cdi f f calculation. Cg and Cdi f f . T and Vdd .05V −1 and a4 = 1/(1. In order to calculate Igon and Igo f f . fringing capacitance Ctotal 69 . The intermediate feature parameter Tox elec is first calculated for capacitance calculation. W . (4. a3 = 13. Jg = a1 ea2Vg +a3V g e−a4 Tox 2 (4.10) (4. Igon and Igo f f are then calculated as. Cg and Cdi f f are calculated as. and Vdd . Igon = Lgate ·W · Jg ΔL = Lgate − Lelec Igo f f = 0.5ΔL ·W · Jg where ΔL is the difference between Lgate and Lelec . With Jg . we first calculate the gate leakage current density Jg as.5 shows the calculation flow for transistor gate and drain/source capacitances.11) f ringing where the Cg calculation considers gate oxide capacitance.8) (4. The capacitances depend on Lgate .9) Figure 4.17E − 10m). Cg = ( Cdi f f Lgate +Ctotal Tox elec = Coverlap +C junc εox f ringing +Coverlap )W (4.44E5A/cm2 .

Vgs < Vth .e. gate and body voltage levels as following. i. i. and saturation regions.12) 70 . the current in velocity saturation region. i. linear.and overlap capacitance Coverlap . We calculate the inverter delay based on numerical integration through the transistor IV curves. Ids is calculated as. only the calculation for the maximum on current Ion . We use the equations from [74] to calculate the drainsource current Ids in different working regions.2. In MASTAR4 tool. these FPGA circuits can be further decomposed into net-lists containing the most basic circuit elements. is provided. Ids = Ith · (W /Lelec ) · e((Vgs −Vth )/Slope) (4. In the sub-threshold region. which is treated as part of the loading capacitance of an inverter.1 Delay Model The pass transistor and multiplexer tree are modeled as a lumped capacitance. and Cdi f f calculation considers Coverlap and junction capacitance C junc .e. We therefore mainly discuss the power and delay for these basic circuit elements as below. 4. Ids is calculated as a function of drain. pass transistor gate. LUT. sub-threshold. Essentially.3 Circuit-Level Delay and Power In this section. source. 4. SRAM. i. We consider buffers.3.e.e. inverters and pass transistors. we present basic circuit models for delay and power characteristics using the device model in Section 4. flip-flop (FF) and multiplexer [18] for the FPGA circuits. where the multiplexer is implemented as an NMOS pass transistor tree.

Ids is equal to the on current Ion calculated by (4.3). i. Ids is calculated as. Cox elec and Ec are both calculated using equations in the ITRS MARSTAR4 model. i. i. Vgs is the gate source voltage difference. 71 . where the velocity saturation voltage Vdsat is calculated using equations in the MASTAR4 tool. Vgs > Vth and Vds < Vgs −Vth . Vgs > Vth and Vds > Vdsat .5uA.13) where Cox elec is the gate oxide capacitance.e. Vgs > Vth and Vds > Vgs −Vth .14) In velocity saturation region.where Ith is 0. Ids is calculated as.6: Voltage and current in an inverter during transition. Ids = W Cox elec · ue f f · · (Vgs −Vth −Vds /2) ·Vds Lelec 1 +Vds /(Ec · Lelec ) (4. Using these equations. Vds is the drain source voltage difference and Ec is the critical electrical field. In linear region.e. Ids = 0. short channel effect (SCE) and drain-induced barrier lowering (DIBL).5 · W Cox elec 2 · ue f f · ·Vds Lelec 1 +Vds /(Ec · Lelec ) (4. Figure 4. Vs = Vdd Vb = Vdd Ids(P) Vg = Vin Ids(N) Vd = Vout C Vb = GND Vs = GND Figure 4. In the saturation region.7(c) shows the IV curves for an NMOS transistor under ITRS 2005 HP 32nm technology node. Vth in the above equations is calculated considering body-bias.e.

06 0.52v Vgs=0.02 0.04 0. (b) Input and output voltages. (c) The NMOS IV curves and the transition of transient current Ids with small and large input slopes. and short circuit current with a small input slope.4 0.6 Vds (v) (c) 0.61v Vgs=0.12 drain current (uA) Ids (uA/um) 12 1200 1000 800 600 400 200 0 0 ITRS05 HP NMOS (32nm) IV curves Vgs=1.2 Figure 4.8 0.81v Vgs=0. its loading capacitance C and the input voltage waveform. Cload =1fF Vin Vout (NMOS) Vout (Inv) PMOS current 14 drain current (uA) 12 10 8 6 4 2 0 0.6).2 0.08 time (ns) 0.Given an inverter (see Figure 4.1 Vin Vout (NMOS) Vout (Inv) PMOS current 14 10 8 6 4 2 0 0. Cload =1fF 1.7 (a) shows the transient voltage transition of a 1X inverter with 1 f F as loading capacitance C at ITRS 2005 HP 32nm technology node.2 0 0 0.42v small input slope large input slope transient transient II 0.6 0. Figure 4.8 0. respectively. The input voltage transition is from 0 to Vdd . 1.8 1 1.5Vdd . The input slew rate is defined as the transition time from 72 .15) The delay is the time difference between when the output and input voltages reach 0.04 ITRS05 HP (32nm) 1X INV. We can then perform numerical integration to obtain the output voltage waveform based on the following equation. Note that the input slew rate is automatically considered in this delay model. we can then calculate the inverter delay. The output voltage waveform can be propagated for delay calculation of the next stage.4 (a) (b) 0.7: The pull-down delay of a 1X inverter at ITRS 2005 HP 32nm technology nodes.2 1 voltage (v) 0.03 ITRS05 HP (32nm) 1X INV. (a) Input and output voltages.01 0.02 time (ns) 0.71v Vgs=0.2 0 0 0. We calculate the pull-down and pull-up delay for an inverter and then obtain the worst case delay as the inverter delay. and short circuit current with a large input slope.2 1 voltage (v) 0.91v Vgs=0. we can obtain the transient PMOS and NMOS drain-source current ids (P) and ids (N). dV (out) = (ids (P) − ids(N)) · dt/C (4.1v Vgs=1.4 0. At time t.6 0.0v Vgs=0.

1Vdd (or 0. we first breakdown them into the netlist of inverters and pass transistors. the PMOS drain-source transient current is also shown in this figure. Therefore. a similar trend for pull-up transition. 5X large as the that in Figure 4. ITRS MASTAR4 tool uses CVdd /Ion to predict transistor delay.7(c) shows the NMOS transient drain-source current ids transitions with the two input slopes.9Vdd ) to 0.7. is observed.g.e. With a larger input slope.7 (b) shows the transient voltage transition of this inverter with a larger input slope. i.e. the input voltage slope may be large due to a large loading capacitance. The short circuit current. the transition is slower with a larger delay. our delay model is more accurate than that in MASTAR4.7 (a). i. For other FPGA circuit elements with loading capacitance. the output voltage waveform without considering short circuit current differs the waveform considering short circuit current which results in a 20% delay difference. input voltage transition is from Vdd down to 0. LUTs. the output voltage waveform (Vout (NMOS) in the figure) is almost overlapped the the waveform (Vout (Inv) in the figure) considering the short circuit current.e. 73 .1Vdd ). our delay model is flexible and can be extended to other complex gates easily. The delay of each circuit element is then calculated using the above method. While not presented here. e. the input slope has a significant impact on the inverter delay. In addition. FFs and buffers.0. the short circuit current is necessary to be considered for delay calculation. e. Figure 4. Figure 4. As shown in Figure 4. i. In FPGAs. The short circuit current becomes more significant due to a larger input slope and can no longer be ignored.e. Since the FPGA circuit element usually has a large loading capacitance and a large input slope. If this short circuit current is ignored. i.g.9Vdd (or 0. the LUT delay is decomposed into the delay from LUT input passing through input buffers to the multiplexer control inputs and the delay from the SRAM passing through the multiplexer and the output buffer.

16) (4. which can be is on or off.4. Pleak (inv). we first discuss leakage power including sub-threshold and gate leakage power and then discuss dynamic power including switching power and short circuit power for basic circuit elements. For a multiplexer containing N NMOS pass transistors. Igon (P). We calculate the average leakage power for an inverter. we calculate the gate (Cg (inv)) and self-loading caeither Vdd · Igon (N) or Vdd · Igo f f (N) depending on if the channel of this pass transistor 74 . Igo f f (P). respectively. The two inverters in an SRAM cell are identical with one input as Vdd and one input as 0. we can calculate the leakage power for other FPGA circuit elements. Pleak (inv) = Vdd · (Io f f (inv) + Igate (inv)) Io f f (inv) = (Io f f (P) + Io f f (N))/2 Igon (P) + Igo f f (P) + Igon (N) + Igo f f (N) Igate (inv) = 2 (4. the average leakage calculation of an inverter can be applied to SRAM leakage calculation.17) (4.3. We consider switching and short circuit power for inverter dynamic power consumption.2 Power Model In this section. For an NMOS pass transistor. Igon (N) and Igo f f (N) are the PMOS and NMOS gate leakage currents when the channel is on and off. For switching power. Based on the leakage model for inverters and pass transistors/muxes.18) where Io f f (P) and Io f f (N) are the PMOS and NMOS sub-threshold leakage currents. as. only gate leakage power is consumed. respectively. which depends on the input logic value. An inverter consumes both sub-threshold and gate leakage power. Therefore. The pass transistor in a used/unused routing switch implemented by a tri-state buffer is on/off. N/2 of them are on while the other half are off.

3. wire segment. then calculate the chip level power and delay. interconnect switch 75 . Cdi f f (P) and Cdi f f (N) are the PMOS and NMOS diffusion capacitances. we apply the similar idea as the trace-based estimation in Chapter 2.e. 4. respectively.. i.3. The transient short circuit current has been calculated during delay calculation.3.3.1 Delay Model Similar to Section 2.2. 4. LUT. The input capacitance. We collect the tract information in the same way as in Section 2.pacitances (Cdi f f ) for an inverter as. We can simply perform a numerical integration on the transient short circuit current and obtain the short circuit energy per switch SC(Sl). internal capacitance and self-loading capacitance can then be easily extracted for each FPGA circuit element for switching power calculation purpose. Cg (inv) = Cg (P) +Cg (N) Cdi f f (inv) = Cdi f f (P) +Cdi f f (N) (4. where Sl is the input slew rate. As shown in Figure 4.19) (4. we compute the critical path delay by adding the delay of all circuit elements in the critical path. In order to perform chip level estimation.4 Chip-level Delay and Power With the delay and power model for basic circuits discussed in Section 4. respectively.7 (a) and (b).20) where Cg (P) and Cg (N) are the PMOS and NMOS gate capacitances. we introduce the chip level delay and power estimation model in this section.4. short circuit current depends on input slew rate.

Dynamic power is only consumed by the used FPGA resources.4. Vdd is the supply voltage.2.2. which can be computed by circuit level leakage power model as discussed in Section 4.3.3.2 Leakage Power Model In the same way as in Section 2.4.23) where f is the operating frequency. we use the Elmore delay model with resistance and capacitance suggested by ITRS [67]. For each circuit element.buffer. For an interconnect wire segment.21) where Di is the delay of the ith circuit element in the critical path.2.2. 4. 4. A circuit implemented on an FPGA cannot utilize all circuit elements.3 Dynamic Power Model Dynamic power includes switch power and short-circuit power. SW is the average switching activity. and MUX: Dcrit = ∑ Di i∈P (4.22) is the leakage power for type i circuit element. The 76 . which can be calculated by the circuit level delay model as discussed in Section 4. the chip leakage power is modeled as the sum of the leakage power of all circuit elements: Pleak = where Pis i∈T ∑ Nit Pis (4. and Ci is the total capacitance per switch of type i circuit elements.1.3. the loading capacitance can be calculated from the in/out capacitance of the basic circuit elements as introduced in Section 4.3. Our switch power model distinguishes different types of used FPGA circuit elements and applies the following expression: Psw = 1 2 · f ·Vdd · SW · ∑ Niu ·Ci 2 i∈T (4.

25) where Sli is the input slew rate of type i circuit element.1. In our model. which is difficult to obtained without detailed simulation using a large set of input vectors.3. then the input slew rate of a connection box buffer is the output slew rate of the global interconnect buffer. We also assume that the operating frequency to be 5/6 of the maximum frequency. the average switching activity SW can be set by the users.2 · Dcrit ). Here we model the chip short circuit power as: Psc = f · SW · ∑ i∈T 1≤ j≤Niu ∑ SCi (Sli. which is computed as the output slew rate of the element driving type i element.2. The switching activity is related to the input vector and logic. Because of the regularity of FPGAs.3.3. We do not set the operating frequency to maximum frequency since the actual critical path delay of some chips may increase due to process variation.24) where SCi (·) is the function to compute short circuit energy per switch for type i circuit element and Sli. the input slew rate for the same type of circuit elements is very close.2. a connection box buffer is driven by a global interconnect buffer. Such output slew rate can be computed from the basic circuit element delay model discussed in Section 4. 77 . For example. The short circuit power is related to the input slew rate. j is the input slew rate of the jth type i circuit element. The short circuit energy function SCi (·) can be obtained from the short circuit power model of basic circuit elements as discussed in Section 4. j ) (4. that is f = 1/(1.total capacitance per switch Ci can be computed from the in/out capacitance of a basic circuit element as discussed in Section 4. the chip short circuit power can be further simplified as: Psc = f · SW · ∑ SCi (Sli ) i∈T (4. Therefore.

X2 . X2 . . the short circuit power model introduced in this section is more accurate than that in Section 2.27) (4.3.5.1. 4.) Pt = FtL (X1.2..) (4. To evaluate the chip-level leakage power and delay considering process variation.3.28) where FtL and FtD are functions of process parameters to calculate leakage power and delay for type t circuit element. respectively.1.. delay and leakage power of a circuit element are functions of the underlying process parameters and they can be described as: Dt = FtD (X1. we calculate the circuit level short circuit power directly from transistor model. . and the process variation sources (such as gate change length and dopant density) are modeled as a random variable X . First we introduce the variation model as below.26) Notice that dynamic power model in this chapter is different from that in Section 2..4.5 Chip Level Delay and Power Variation Based on the chip-level power and delay model introduced in Section 4. while in this chapter. Therefore.1 Variation Models In general. 4. short circuit power is calculated as a ratio to switching power. we study the impact of process variation on power and delay in this section..1. we first 78 . In Section 2.With the switch power and short circuit power.2. the total dynamic power is computed as: Pdynamic = Psw + Psc (4.2.3.

Each circuit element has its own random variation Xr . and intra-die random variation for variation source X . We use the grid based model in [173] to model the spatial variation. which is calculated as: Dj = D ∑ Fji (X1 . some close-to-be critical path may become critical path. That is: X = Xg + Xs + Xr (4. Therefore. The sensitivity functions FiL and FiD can be obtained from the leakage power and delay model of basic circuit elements discussed in Section 4. and all circuit element of the whole chip share the same inter-die variation Xg . and intra-die random variation is considered.30) where C is the set of close-to-be critical paths and D j is the delay of the jth close-to-be critical path. X2 .29) where Xg .) ji ji (4. and Xr are independent from each other.. A chip is divided into several grids. intra-die spatial. all inter-die. in order to analyze delay variation.31) i∈P j 79 .2 Delay Variation With process variation.3.apply randomly generated process parameters to the device model as in Section 4.2 and then calculate the leakage power and delay for basic circuit elements as in Section 4. intra-die spatial.3. The correlation coefficient decreases when the distance between two grids increases.5. circuit delay is calculated as the maximum of a set of near critical paths: Dvar = max D j j∈C (4. 4. For each variation source X . Xs and Xr are the inter-die. all circuit elements in the same grid share the same spatial variation Xs and the spatial variation between different grids are correlated. Xs . We assume that Xg . ..

In this chapter. However. Because there is no closed-form formula for D the delay variation function Fji (·).3 Chip Leakage Power Variation The chip leakage power with process variation is modeled as the sum of leakage power of all circuit elements: Nit Pleak = j i∈T j=1 ∑ ∑ Pi j = i∈T j=1 ∑ ∑ FiL (X1 . 4. In order to solve such problem. when the circuit size becomes large. a large number of random variation samples need to be computed. . We first generate M samples of variation sources (X1 . and then compute the the path delay under such samples to ji ji obtain M samples of path delay. X2 . because the random variation is unique for each circuit element.. and X i j is the variation of the jth type i circuit element. Given the inter-die and within-die spatial variation. Similar to the chip delay variation model.. we use the Monte-Carlo simulation to obtain the chip delay distribution.. which is ignored in the previous Ptrace [167]. We first generate M samples of variation sources and then compute the M samples of chip leakage power.32) where Pi is the leakage power of the jth type i circuit element.D where Pj is the set of path elements of the jth path and Fji (·) is the delay variation function for the ith circuit element in Pj .. we simplify the chip leakage power variation model by applying central limit theorem similar to [25. we do not have closed-form formula to compute the chip delay distribution. 168].) Nit ij ij (4. The Monte-Carlo simulation allows us to consider nonGaussian variation sources and spatial correlation.5. Finally we can get the delay distribution from those M samples of path delay. the total leakage power for all 80 .) for all i. we use Monte-Carlo simulation to compute chip leakage distribution. . X2 . This makes the simulation very inefficient.

type i circuit elements in the lth grid can be calculated as:
i Pleak = l ∑ FiL (Xg1 + Xs1 + Xr1 , Xg2 + Xr2 , ...)
t Nli

il j

il j

(4.33)
il j

j=1

where Xg is the inter-die variation, Xsl is the spatial variation in the lth grid, Xr is
t the random variation for the jth type i circuit elements in the lth grid, Nli is the number t of type i circuit element in the lth grid. Usually, Nli is a very large number. Therefore,

by applying the central limit theorem, we have:
l l l ∑ FiL (Xg1 + Xs1 + Xr1 , Xg2 + Xr2 , ...) ≈ Nit · µi(Xg1 + Xs1, Xg2 + Xs2, ...)
t Nli

il j

il j

(4.34)

j=1

where
l l µil (Xg1 + Xs1 , Xg2 + Xs2 , ...) i i l l = E[FiL (Xg1 + Xr1 , Xg2 + Xr2 , ...)|Xg1, Xs1 , Xg2 , Xs2, ...] ij ij

(4.35)

In order to compute µil (·), we first generate Mr samples of random variations, and then compute the mean of leakage power under such samples. That is:
l l µil (Xg1 + Xs1 , Xg2 + Xs2 , ...)

=

1 Mr L l k l k ∑ F (Xg1 + Xs1 + Xr1, Xg2 + Xs1 + Xr2, ...) Mr k=1 j

(4.36)

where xk is the kth sample of random variation. Here, notice that usually Mr is much r
t smaller than Nli and does not depend on circuit size. Therefore such method is scalable

to large circuit size. Then the chip leakage power can be modeled as: Pleak =
t l l ∑ ∑ Nli · µil (Xg1 + Xs1, Xg2 + Xs2, ...) G

(4.37)

i∈T l=1

where G is the number of grids. With the above equation, we may generate M samples of inter-die and within-die spatial variations to compute the distribution of leakage power.

81

Notice that the variation model introduced in this section may handle non-Gaussian variation sources and spatial correlation, which are ignored in the variation model in Section 3.2.

4.6 Process Evaluation
Based on the chip level power and delay model presented in Section 4.4 and 4.5, we perform device tuning to minimize power and delay.

4.6.1 Power and Delay Optimization First we discuss the optimization for nominal value of power and delay. In this chapter, we consider two device parameters, supply voltage Vdd and the physical gate length Lgate . In our experiment, we start from the device setting of ITRS High Performance 32nm technology (HP32) [67] (with Vdd=1.1V, Lgate =32nm), and then change Vdd and Lgate around such setting. Moreover, we assume that the FPGA has cluster size N=6, LUT size K=7, and the global interconnect wire length W =4, which is optimized for high performance [91]. For dynamic power estimation, we assume that the switching activity SW =0.5. In addition, we also consider heterogeneous L gate for logic blocks (Lc ) and interconnects (Li ). Because SRAMs have little impact on FPGA delay, we assume that all SRAMs use high Vth (1.5X dopant density compared to the other circuit elements) and fix Lgate =32nm. During our evaluation, Vdd is tuned from 1.0V to 1.1V, Lgate (both Lc and Lc ) is tuned from 31nm to 33nm. The experimental setting is summarized in Table 5.1. In our experiment, we assumed that 20 MCNC benchmarks [175] are put in one FPGA chip. Therefore, the power is computed as the sum of all 20 benchmarks and delay is computed as the longest critical path delay of 20 benchmarks.

82

N 6

K 7

W 4

SW 0.5

Vdd (V) 1.0, 1.05, 1.1

Lc (nm) 31, 32, 33

Li (nm) 31, 32, 33

Table 4.1: Experimental setting. Figure 4.8 shows the energy per clock cycle and delay tradeoff between different device settings. From the figure, we find that there is a 3X energy span and a 1.3X delay span within our search space. We also find that that the knee point in the energy delay tradeoff figure is exactly HP32, which has high performance without paying too much power penalty. On the other hand, we also optimize device setting by minimizing energy and delay product (ED). We show the top 2 minimum ED device settings in the figure, which we refer to as min-ED1 and min-ED2.
14
All device Min-ED HP32

11

Energy (nJ)

8

3.1X Min-ED

5 HP32 1.3X 2 3.5 4.0 4.5 5.0

Delay (ns)

Figure 4.8: Energy and delay tradeoff. Table 4.2 shows the power (P), delay (D), energy per clock cycle (E) , and energy delay product (ED) for HP32 and min-ED device settings. From the table, we see that by applying device tuning, ED can be reduced by up to 29.4% (min-ED1) compared to HP32. We also find that it is worthwhile to apply heterogeneous Lgate . The optimum heterogeneous Lgate device setting (min-ED1) has lower ED (0.7nJ·ns lower) than the

83

optimum homogeneous Lgate device setting (min-ED2).
Device Vdd (V) HP32 min-ED1 min-ED2 1.1 1.0 1.0 Lc 32 32 33 Li 32 33 33 P (W) 1.19 0.77 0.74 D (ns) 3.90 4.55 4.73 E (nJ) 1.88 3.50 3.50 ED (nJ·ns) 22.6 15.9(-29.4%) 16.5(-26.8%)

(nm) (nm)

Table 4.2: Power, delay, energy, and ED comparison for different device settings.

4.6.2 Variation Analysis In this section, we apply the chip level variation model as introduced in Section 4.5 to analyze the chip level leakage and delay variation. In our experiment, we consider four types of variation sources, dopant density (Nbulk ), channel gate length (Lgate ), oxide thickness (Tox ), and mobility (µe f f ) [190]. Moreover, because for technology under 65nm, the spatially correlated intra-die variation is small compared to the intradie random variation [190], therefore, we ignore spatial variation in this chapter. For all variation sources, we assume that they follow normal distribution. Notice that our process variation model is based on Monte-Carlo simulation, therefore it can handle any non-Gaussian variation sources. The 3-sigma value of inter-die (3σ g ), intra-die spatial 3σs , and intra-die random (3σr ) variation for all types of variation sources are summarized in Table 4.3. In the table, the 3-sigma values are the represented as the percentage of nominal value. Moreover, for spatial variation, we assume that the correlation coefficient decreases to 1% when distance is 1cm (Dcorr = 1cm). In our experiment, we consider two different variation settings, low variation (LV) and high variation (HV). We generate M = 10, 000 samples for Monte-Carlo simulation and use Mr = 100 samples to compute the leakage mean µi (·). Figure 4.9 illustrates the probability density function (PDF) of delay and leakage power for min-ED1 and HP32 under variation setting LV. From the figure, we see that

84

85 .0% 3σr 2.8 5 5.0% 2. (b) Delay PDF.0% 3. min-ED1 has much less leakage variation compared to HP32 while its delay variation is only a little bit larger than HP32.0% 6. 4 16 14 12 10 8 6 4 2 0 0. From the table.0% 2. mean (µ).0% 3σg 6.5 1.0% 2.0% Table 4.6 3.0% Dcorr = 1cm HV 3σs 6. We also observe that the leakage roughly has a log-normal distribution and delay roughly has a normal distribution.5 3 2.8 1 1. This is because leakage power is almost exponential to the variation sources while delay is almost linear to the variation sources.0% 3.2% 4.2 4.0% 3.5 1 0.6 Delay (ns) (b) 4.0% 3.3: Experimental setting for process variation.9: Delay and leakage distribution. (a) Leakage PDF.2 min−ED1 HP32 Figure 4.2 0.4 4.0% 4.0% 1.0% 2.0% 4.5 2.5 2 1. we observe that compared to HP32 the min-ED device settings greatly reduce leakage variation (reduce standard deviation by up to 92%) with only a small increase of delay variation (increase standard deviation by up to 37%).4 0.4 summarizes the nominal value (nom).6 0.000 Source Distribution 3σg Nbulk Lgate Txo µe f f Normal Normal Normal Normal 4. we also see that HP32 is more sensitive to process variation than the min-ED device settings.5 2 0 3.4 1.0% Mr = 100 LV 3σs 4.2 Leakage (W) (a) 1.0% 6. and standard deviation (σ) of leakage power and delay for the min-ED device settings and HP32.8 min−ED1 HP32 3.6 1. From the table.M = 10. Table 4.2% 2.0% 3σr 4.0% 2.8 4 4.

90 4.38) (4.245(+32%) Table 4.1 Impact of Device Aging First.122 0. with input Vdd .164(+37%) µ 3. mean. we introduce the impact of device aging effect on FPGA power and delay.162(+33%) 0.4: Nominal value.39) 2ξ1te + ξ2C(t − t0 ) √ ) 2tox + Ct 86 .55 4.185 0. with input 0.2 HV σ 575 68 (-88%) 48 (-92%) 3.2 375.244(+32%) 0.58 4. i. In this section we study these two types of reliability for FPGA. reliability issues such as device aging with Vth degradation and soft error rate (SER) due to high-energy particle or cosmic rays become more and more significant.782 σ 0.99 4.95 4.e.7. respectively.e. and hot-carrier-injection (HCI) increases the threshold voltage (Vth )of NMOS transistor [34][149][166]. 4. and standard deviation comparison for leakage power and delay.779 LV Delay (ns) HV σ 0. and recovers during recovery mode.5 + (ΔVth0 )(1/2n) )2n ΔVth = −ΔVth0 (1 − (4. It has been shown that negative-bias-temperature-instability (NBTI) effect increases the threshold voltage of PMOS transistor [11][160][149][23]. PMOS Vth increases during stress mode.7 FPGA Reliability As technology scales down to nanometer. device aging and permanent soft error rate (SER).e. ΔVth = (Kv (t − t0 )0.38) and (4.Device nom µ HP32 min-ED1 min-ED2 854 328 315 982 359 328 Leakage (mW) LV σ 348 51 (-86%) 32 (-89%) µ 1120 398. Due to the effect of NBTI.39) are the equations for PMOS ΔVth over time in the stress and recovery modes.73 nom µ 3.55 4. i. 4. (4. i.

Hence HP32 has larger ΔVth for both NMOS and PMOS than min-ED1. we can obtain ΔVth due to device aging over time. We first calculate the effective gate length Lelec and oxide thickness Tox elec and then calculate Vth . supply voltage Vdd and nominal threshold voltage Vth . Interested readers please refer to [23] for more details. With these intermediate parameters. Qi . 87 . From the figure. Inputs: Lgate Tox Nbulk Xjext W Racc T Vdd Intermediate variables: Lelec Tox_elec Intermediate variables: Vth Output: HCI HBTI Figure 4. Figure 4.40) is the equation for NMOS ΔVth over time due to HCI.40) The key parameters that determine the degradation rate include the inversion charge. The Vth degradation due to device aging results in increase of critical path delay and decrease of leakage power. ΔVth = q K2 Cox Qi e Eo2 e Eox it − qλE φ m tn (4. Figure 4.10 shows the calculation flow for ΔVth due to NBTI and HCI. temperature. A higher Vdd or lower Vth leads to larger Vth increase. This is due to the fact that HP32 uses higher Vdd and lower Vth (due to smaller Lgate ) than min-ED1. Eox . electrical field. we do not include detailed explanation for each parameters.(4.11 illustrates the Vth increase (ΔV ) for PMOS and NMOS transistors of both min-ED1 and HP32 within 10 years. Due to the space limit. we see that HP32 is much more sensitive to NBTI and HCI than min-ED1.10: The calculation flow for ΔVth due to NBTI and HCI.

Therefore it has larger Vth degradation. we also compare the change percentage with and without burn-in (burn to 1 year). Table 4. Figure 4.45 40 35 30 Min−ED1 PMOS (NBTI) Min−ED1 NMOS (HCI) HP32 PMOS (NBTI) HP32 NMOS (HCI) Δ V (mV) 25 20 15 10 5 0 0 1 2 3 4 5 6 7 8 9 10 th Time (year) Figure 4. device burn-in [96] that has been used for microprocessors but not FPGAs yet could be used to reduce leakage and delay change over time after product shipment. and that with burn-in is computed as the percentage of the difference between the 10 year value and the 1 year value. min-ED device settings can not only reduce ED but also 88 . we find that the min-ED device settings is less sensitive to device aging compared to HP32. In the table. and value after 10 years of leakage power (L) and delay (D). The change percentage without burn-in is computed as the percentage of the difference between the 10 year value and current value. value after 1 year. This is because HP32 has higher Vdd and smaller Lgate compared to the min-ED device settings. It is interesting to see in the figure that the leakage and delay change is most significant at the first one year.12 illustrates the critical path delay increase and chip leakage power decrease within 10 years for min-ED1 and HP32. Therefore.5 illustrates the current value. From the figure and table. hence more sensitive to device aging.11: Vth increase caused by NBTI and HCI. We may see that compare to HP32.

76 (mW) 711 317 307 10 Year L (mW) 640 311 302 D (ns) 4.1 −200 0. increase aging reliability of FPGAs.2 Permanent Soft Error Rate (SER) We use the model in [60] to calculate the soft error rate (SER) for configuration SRAMs in our study.82 w/o burn-in L -25. we can greatly reduce the delay degradation.15 0.41).5% D Change w/ burn-in L -10.59 4.1% -5. The SER in SRAMs is permanent in FPGAs. burn-in reduces 10 year delay degradation from 8.0% -1.73 L 1 Year D (ns) 4.5% to 5.64 4.55 4.05 −250 0 2 4 6 Time (year) (a) 8 10 0 0 2 4 6 Time (year) (b) 8 10 1 Year min−ED1 HP32 ΔP leak −150 Figure 4.1% +1.41) 89 . (a) Leakage change.5: Delay and leakage change due to device aging.12: Impact of device aging.7. for HP32.5% +1.9% Table 4.01 4. The SER for one SRAM cell is shown in (4. by applying burn-in. Moreover. which cannot be recovered unless re-writing those affected configuration bits [14][17].0% +1. 0 0.23 4.5% +2. For example.5%.3 −50 (mW) 1 Year Min−ED1 −100 HP32 Δ Delay (ns) 0.2% -2. SER(SRAM) = F · A · K · e−Qcrit /Qs (4.6% D +5.25 0.35 0. 4.3% HP32 min-ED1 min-ED2 +8. (b) Delay change.9% -1.Device Current L (mW) 854 328 315 D (ns) 3.2 0.90 4.

A transistor with a larger Qcrit needs more energy to incur an error and has a smaller SER. different Vdd and Vth .6 compares the SER for HP32. A is the transistor drain area. i.e. We use the models in [60] with linear interpolation to calculate Qcrit and Qs under different device settings. and the two min-ED device settings. and then calculate the SER of an SRAM cell correspondingly using (4. a transistor with a larger Qs is more effective in collecting charge and has a larger SER. Qcrit and A are then calculated. SER = Num SRAM · SER(SRAM) where Num SRAM is the total number of SRAM cells. With all these intermediate parameters.13 shows the flow to calculate the SRAM SER. Qcrit is the critical charge to incur an error. Inputs: Lgate Tox Nbulk Xjext W Racc T Vdd (4. Qs . and slightly depends on Vth . K is a technology independent constant and is same for all device settings.41). We first calculate intermediate parameters including Lelec . On the other hand. Tox elec and Vth .where F is the neutron flux. we can obtain the SRAM SER. and Qs is the charge collection slope. For 90 . Figure 4. Table 4.13: The flow for SRAM SER calculation. Qcrit mainly depends on supply voltage Vdd and loading capacitance C.42) Intermediate variables: Lelec Tox_elec Intermediate variables: Vth Intermediate variables: Qs Qcrit A Output: SER Figure 4. Qs depends on channel doping density Nbulk and Vdd . The chip-level permanent SER can be calculated as.

4. But device aging only increase delay but has no significant impact on delay variation.6% higher than that of HP32. In the table. Therefor only Vdd may affect SER.8 Interaction between Process Variation and Reliability In this section. For min-ED1/2 with Vdd of 1.25 (+1.1 Impact of Device Aging on Power and Delay Variation First.6%) Table 4.the min-ED device settings.0v.7 summarizes the mean and standard deviation of leakage and delay for different device settings. we find that device aging has great impact of leakage power variation. SER is 1. The supply voltage Vdd in HP32 is higher.1 1. Qcrit .14 compares the PDF of delay and leakage between current value and the value after 10 years for the device setting min-ED1. SER is measured as number of failures in billion hours (FIT).45 368. It is worthwhile to use Min-ED device settings to obtain a significant ED reduction with virtually no impact on SER. we discuss the impact of device aging on process variation induced leakage and delay variation. is larger and may result in a smaller SER.6: Soft error rate comparison. hence the critical charge. 4. Device HP32 min-ED1/2 Vdd (V) 1. Figure 4. For HP32 device setting under low 91 . Table 4. we find that device aging actually reduce the leakage variation. From the table.0 # SRAMs 12438596 12438596 SER (FIT) 362. From the figure. Lgate for SRAM cells is still 32nm. The table compares both the current value and the value after 10 years in both low and high variation cases.8. we study the impact of device aging on process variation and the impact of device aging and process variation on SER.

8 5 5. loading capacitance.8% and standard deviation by 65.45 Leakage (W) (a) 0.2 4. However.6 Delay (ns) (b) 4.5 0. we also find that in high variation case.7%. This is because HP32 is more sensitive to device aging as discussed in Section 4. when variation scale becomes larger.4 0. device aging reduces mean by 28. device aging only increases standard deviation by 1. Moreover. variation case (LV). For HP32 device setting. the impact of device aging on delay variation is relatively small.13. That is. the impact of device aging on leakage variation is much larger than that in the regular variation case.2 Figure 4.4 4. we observe that compared to HP32 device setting. (b) Delay PDF comparison.2 Impacts of Device Aging and Process Variation on SER As shown in Figure 4. However. From the result. device aging has less impact on power and delay variation for min-ED settings.8.35 0. the SRAM SER depends on Vdd . 4.5 0 0. the impact of device aging on delay variation is similar to that of the regular variation case (min-ED1).55 0 4 4.5 3 2. (a) Leakage PDF comparison. Vth and channel doping density Nbulk while Vth and Nbulk may be affected by device aging and 92 .25 Current After 10 Years 3.7.14: Impact of device aging on delay and leakage PDF.2%. This is because leakage power is rough exponential to Vth while delay is roughly linear to Vth .5 1 10 5 0. when variation is high.3 0.5 Current After 10 Years 20 15 2 1. there is a larger impact of device aging on leakage variation.

245 Delay (ns) 10 years µ 4.1 (-13.162 0.3% 3σ=6% variation in Nbulk -0.79 (+0.2 (-32.7: Impact of device aging on process variation.18% to +0. the SRAM SER only increases by 0.55 4.9%) 4.78 3.9%. After 10 years.21%) Table 4.16%) 0. Nbulk and channel gate length Lgate become random variables.23 (+8.164 (+0.55 4. We perform the study on the impacts of device aging and process variation on SER as below. Table 4.86 (+0.09 (+2. A larger Vth may result in a slightly smaller critical charge Qcrit for an SRAM and therefore a larger SER.4 (-30. Lgate variation may affect Vth and therefore implicitly affect SER.6%) 4.4%) 158 (-72.122 0.914E-5 10 years +0.5%) 38. Nbulk variation directly affects charge collection slope Qs and therefore affects SER. 0 year or nominal SRAM SER (FIT) 2.4%) 4.Device current µ HP32 LV min-ED1 LV min-ED2 LV HP32 HV min-ED1 HV min-ED2 HV 982 359 328 1120 398.8%) 321 (-6.78 current σ 0.4%) 345.2 σ 348 51. a variation in Lgate from 31nm to 33 nm may result in a variation 93 . process variation.2%) 31.66 (+2.0 32.99 4. For ITRS HP 32nm technology.6%) 712 (-36.2%) 309 (-4.0 575 68 48 Leakage (mW) 10 years µ 678 (-28.0%) 4.21%) σ 0.1 (-44.121 (+1.3%) 332 (-12.85 0.9%) µ 3. For ITRS HP 32nm technology (see Figure 4.1%) 4.67%) 0.65 (+2.319 (+0.0%) 25 (-47.11.8: The impact of device aging and process variation on SER of one SRAM under ITRS HP 32nm technology.8 presents the impact of device aging on SER. With process variation. Vth may increase with device aging due to NBTI and HCI.245 (+0.4%) σ 118 (-65.2 375.318 0.7%) 22.95 4.28%) 0.162 (+0. the SRAM SER may increase by 0.3% due to degraded Vth . Even if Vth degrades by 100mv for both NMOS and PMOS. Vth may degrade (or increase) by around 10% (or 20mV) for PMOS due to NBTI and by around 25% (or 50mV) for NMOS due to HCI after 10 years.25%) 1.17% Table 4.54%) 0.164 1.83 (+1.

These two effects compensate with each other and therefore Nbulk does not have a significant impact on SER either. We further show that the device aging has a knee point over time.18% to 0. and reduce standard deviation of leakage variation by 87%. We show that applying heterogeneous gate lengths to logic and interconnect may lead to 1. a larger Nbulk may result in a higher Vth and a smaller critical charge Qcrit . Table 4. The framework is efficient as it is based on closed-form formulas. A variation of 3σ as 6% in Nbulk only results in a variation in SER from -0. therefore a larger SER. A few examples of using this framework have been presented.17%. the user can tune eight parameters for bulk CMOS processes and obtain the chip level performance and power distribution and soft error rate (SER) considering process variations and device aging. 3. we have developed a trace-based framework to enable concurrent process and FPGA architecture co-development. On the other hand. Therefore. It has been shown that Vth variation does not have a significant impact on SER. A larger Nbulk may result in a larger charge collection slope Qs and therefor a smaller SER. It is clear that neither device aging due to NBTI and HCI nor process variation have any significant impact on SER.in Vth from -25% to 21% [68]. this framework is suitable for early stage process and FPGA architecture co-development.8 presents the impact of Nbulk variation on SER.3X delay difference.5% to 5. Same as the MASTAR4 model used by the 2005 ITRS [67][68][148]. 4. and device burn-in to reach the point could reduce the performance change over 10 years from 8. This offers a large room for power and delay tradeoff.1X energy difference.9 Conclusions In this chapter.5% and reduce die 94 . It is also flexible as process parameters can be customized for different FPGA elements and no SPICE models and simulation are needed for these elements.

95 . In addition. we also study the interaction between process variation. We observe that device aging reduces standard deviation of leakage by 65% over 10 years while it has relatively small impact on delay variation. we also find that neither device aging due to NBTI and HCI nor process variation have significant impact on SER.to die leakage significantly. Moreover. device aging and SER.

Part II Statistical Timing Modeling and Analysis 96 .

All the atomic operations (max and sum) of our algorithms are performed by closed-form formulas. skewness. we first propose a new method to approximate the max operation of two non-Gaussian random variables through second-order polynomial fitting. we have discussed statistical modeling and optimization for FPGAs. quadratic model without crossing terms (semi-quadratic model). To solve this problem. we present present new non-Gaussian SSTA algorithms for three delay models: quadratic model. we focus on statistical static timing analysis (SSTA). hence they scale well for large designs. Experimental results show that compared to the Monte-Carlo simulation. In Part II of this dissertation. Lots of research works on SSTA have been published in recent years. respectively. 97 . 1%. In this chapter. and 1% error. and 95-percentile point within 1%. Applying the approximation. we study statistical modeling and analysis for ASICs. and linear model. our approach predicts the mean. most of the existing SSTA techniques have difficulty in handling the non-Gaussian variation distribution and nonlinear dependency of delay on variation sources. However. standard deviation.CHAPTER 5 Non-Gaussian Statistical Timing Analysis Using Second-Order Polynomial Fitting In Part I. 6%.

182. 163. The goal of block-based SSTA is to parameterize timing characteristics of the timing graph as a function of the underlying sources of process parameters which are modeled as random variables. 19.5. 120. lower power. designers can obtain the timing distribution and its sensitivity to various process parameters. the path-based [103. statistical static timing analysis (SSTA) is developed for full chip timing analysis under process variation. a higher-order delay model is thus needed [180. 5. 163] modeled the gate delay as linear functions of variation sources and assumed all the variation sources are mutually independent Gaussian random variables. 128] and the block-based [32. 2. 181. However. 13. The block-based SSTA was proposed to solve this problem. the linear delay model is no longer accurate [94]. two types of SSTA techniques are proposed. we have studied statistical optimization for FPGA circuits. To analyze the impact of process variation on circuit delay. 142. 178]. the path-based SSTA is not scalable to large circuits. ASICs are widely used in high performance or low power applications. In Part I. In order to capture the non-linear dependence of delay on the variation sources. [163] presented time efficient methods with closed-form formulas [46] for all atomic operations (max and sum). 22. 7. 98 . 75. 9. ASIC has higher performance. Compared to FPGA. Based on such assumption. 157. 49. where the impact of process variation is significant. In recent years. Therefore. 3. 8. 179. 76. By performing SSTA. Since the number of paths is exponential with respect to the circuit size.1 Introduction Process variation introduces significant uncertainty to circuit delay as discussed in Chapter 1. 53. and smaller area. 113. 183. The early SSTA methods [32. 40] SSTA. 86. 178. when the amount of variation becomes larger. 180. 72.

which is slow. [76. 19. to do so. 19. [19] handled the atomic operations by approximating the gate delay using a set of orthogonal polynomials. 179] considered both non-linear delay model and non-Gaussian variation sources. 40] started to take non-Gaussian variation sources into account. [76] proposed to compute the max operation by a regression based on Monte-Carlo simulation. In this chapter. and linear delay model. but it lacks the capability to handle the crossing term effects on timing. the via resistance has an asymmetric distribution [13]. 13. both had to resort to expensive multi-dimension numerical integration techniques. only the 99 . However. In our method. while dopant concentration is more suitably modeled as a Poisson distribution [142] than Gaussian.As more complicated or large-scale variation sources are taken into account. 179. we introduce a time efficient non-linear SSTA for arbitrary nonGaussian variation sources. we present our SSTA technique for three different delay models: quadratic delay model. the assumption of Gaussian variation sources is also not valid. The major contribution of this work is two-fold. [40] proposed to approximate the probability density function (PDF) of max results as a Fourier Series. quadratic delay model without crossing terms (semi-quadratic model). (1) We propose a new method to approximate the max of two non-Gaussian random variables as a second-order polynomial function based on least-square-error curve fitting. 142. For example. [13] computed the max operation through tightness probability while [179] applied moment matching to reconstruct the max. which needs to be constructed for different variation distributions. (2) Based on the new approximation of the max operation. For example. but the computation cost of these techniques is too high to be applicable for large designs. 13]. 180. Some of the most recent works on SSTA [76. but it was still based on a linear delay model. Experimental results shows that such approximation is much more accurate than the linear approximation based on tightness probability [163. For example. 13. [142] applied independent component analysis to de-correlate the non-Gaussian random variables.

6% error in skewness. For the quadratic delay model. we further apply this technique to handle both semi-quadratic and linear delay model in Section 5.1 Review and Preliminary According to [163].. Moreover. The rest of the chapter is organized as follows: Section 5. i. all the atomic operations are performed by closed-form formulas. 30% error in skewness. 5.first few moments are required for different distributions and no extra effort is needed. while [19] needs to obtain different sets of orthogonal polynomials for different variation distributions. Experimental results show that for the semi-quadratic delay model. and indicates less than 1% error in mean.2. A and B. our approach incurs less than 1% error in mean. the tightness probability is defined as the probability of A greater than B.2 Second-Order Polynomial Fitting of Max Operation 5. and 1% error in 95-percentile point.e. given two random variables. the computational complexity of our method is linear in both the number of variation sources and circuit size. experimental results are presented in Section 5.3 presents a novel SSTA algorithm for quadratic delay model with non-Gaussian variation sources.6.5. for the more accurate quadratic delay model. hence they are very time efficient. Section 5. 1% error in standard deviation. 1% error in standard deviation. and 1% error in 95-percentile point. 100 . respectively.2 introduces the approximation of the max operation using second-order polynomial fitting.7 concludes this chapter. our approach is 70 times faster than the non-linear SSTA in [40].4 and Section 5. For the linear and semi-quadratic delay model. the computational complexity is cubic (third-order) to the number of variation sources and linear to the circuit size. with the approximation of max. Moreover. TA = P{A > B} = P{A − B > 0}. Section 5.

However. 164]. particularly when the amount of variation increases and the number of non-Gaussian variation sources increases. we introduce a new fitting method to approximate the max operation. Such a linear approximation is efficient and reasonably accurate when both A and B are Gaussian. TA in [13] has to be computed via expensive multi-dimensional numerical integration. in this case the coefficients can be computed easily. we arrive at: max(A − B. Because (5. B). For example. 0) ≈ TA · (A − B) + c.2). when A and B are non-Gaussian random variables. Moreover. (5. we propose to use a second-order polynomial 101 . B) = max(A − B. B)] can be obtained by closed-form formulas when A and B are Gaussian [163].2. the PDF predicted by such linear approximation is very close the Monte Carlo simulation. because the max operation is an inherently non-linear function.Then the max operation is approximated as: max(A. To overcome these difficulties.1) where c is a term used to match the mean and variance of max(A. thus preventing its scalability to large designs. Instead of using the linear function. we can see that the max operation in [163] is in fact approximated by a linear function subject to certain conditions (such as matching the exact mean and variance). as both TA and E[max(A. (5.1) can be further written as max(A. As shown in [163.2 New Fitting Method for Max Operation In this section. B)] are hard to obtain.2) According to (5. we develop a more efficient and accurate approximation method in the next section. 5. Moreover. B) ≈ TA · A + (1 − TA ) · B + c. 0) + B. linear approximation would become less and less accurate. the tightness probability TA and E[max(A.

0). we compute Θ by matching the mean of the max operation while minimizing the square error (SE) between h(V. Θ) = θ2V 2 + θ1V + θ0 . 0) ≈ h(V. θ2 )T are three coefficients of the second-order polynomial h(v. P) and max(V. we may unnecessarily fitting the curve in a wide range without focusing on the high probability region. we find that assuming: ε = 3σv (5. Then the impact of ignoring the difference in the low provides a good approximation of max. Θ). and γv are the mean. i.7) outside the ±ε range is low. Mathematically. Intuitively. and skewness of random variable of v V . µm = E[max(V. 0) dv (5. respectively. v v (5. That is. θ1 . However..4). the probability that V lies probability region is justified.4) where µv .Θ)]=µm µv −ε h(v. (5. when ε is too large.function to approximate the max operation.6) In (5. 0) within the ±ε range of V . In other words. 0)] E[h(V. In this chapter. σ2 . The problem thus becomes how to obtain the fitting parameters of Θ. Θ)] are the exact and approximated mean of max(V. ε is the approximation range. respectively. Different from the linear fitting method through tightness probability.3) where Θ = (θ0 . The value of ε may affect the accuracy of the second order approximation. Θ) − max(v. max(V. Θ)] = θ2 (σ2 + µ2 ) + θ1 µv + θ0 .e. it would still be more or less Gaussian like. variance. when ε is large. this problem can be formulated as the following optimization problem: Θ = arg min Z µv +ε 2 E[h(V. while µm and E[h(V. This can be explained by the fact that. although the circuit delay is not Gaussian.5) (5. 102 .

14) 4Δ1/3 √ where Δ = γv · σ3 + j 8σ6 − γ2 · σ6 . It is proved in [184] that. i. c1 . E[c2 ·W 2 + c1 ·W + c0 ] = µv (5. When V is a non-Gaussian random variable.e. when v v v v √ |γv | < 2 2 there exists one and only one of the above three values in the range |c 2 | < √ √ σv / 2. and c0 can be obtained by matching P(W ) and V ’s mean.3 = (5. we approximate the non-Gaussian random variable V as a quadratic function of a standard Gaussian random variable W similar to [184].4). (5. i.9) (5.11) E[(c2 ·W 2 + c1 ·W + c0 − µv )2 ] = σ2 v E[(c2 ·W 2 + c1 ·W + c0 − µv )3 ] = γv · σ3 v As shown in [184]. we propose to use the following two-step procedure to approximately compute µm . we first need to compute µm . c2 will be one of the following values: 2σ2 + Δ2/3 v (5.. solving the above equations.its PDF would be most likely centering around the ±3σ range of the mean. with j = −1. When |γv| > 2 2. We will pick such value for c2 .10) (5. variance and skewness simultaneously. In the first step.15) √  −σ /√2 γ < −2 2 c2. we may let:  √ √  σv / 2 γv > 2 2 c2 = (5. Therefore.e. In practice. the users can choose the range ε according to the delay distributions.13) 4Δ1/3 √ √ 2(1 − j 3)σ2 + (1 + j 3)Δ2/3 v c2.1 = − v v 103 . In order to solve (5..8) The coefficients c2 . V ≈ P(W ) = c2 ·W 2 + c1 ·W + c0 .2 = (5.12) 1/3 2Δ√ √ 2(1 + j 3)σ2 + (1 − j 3)Δ2/3 v c2. exact computation of µm is difficult in general.

20) (5. 0) by the exact mean of max(P(W ). 0)..24) (5. the integration range P(w) > 0 can be computed in four different cases: Case 1: c2 > 0 P(w) > 0 ⇔ w < t1 ∨ w > t2 where t1 = (−c1 − t2 = (−c1 + Case 2: c2 < 0 P(w) > 0 ⇔ t2 < w < t1 Case 3: c2 = 0 ∧ c1 > 0 P(w) > 0 ⇔ w > −c0 /c1 Case 4: c2 = 0 ∧ c1 < 0 P(w) > 0 ⇔ w < −c0 /c1 104 (5.19) . µm ≈ E[max(P(W ).16) (5.18) where φ(·) is the PDF of the standard normal distribution. we approximate the exact mean of max(V. and c0 in the second step.e. With c2 .It is proved in [19] that the above equations gives P(W ) which matches the mean and variance of V and has the skewness closest to γv .17) c0 = µv − c2 After obtaining the coefficients c2 .21) (5. 0)] = Z P(w)>0 P(w)φ(w)dw (5. c1 . c1 and c0 can be computed as: c1 = σ2 − 2c2 v 2 (5.23) (5. In the above equation.22) c2 − 4c2 c0 )/2c2 1 c2 − 4c2 c0 )/2c2 1 (5. i.

the circuit delay has close-to-normal distribution as discussed above.27) 105 .25). v v 2 Z l2 (5. Θ)dv − v dv 2 2 θ2 (v2 − µ2 − σ2 ) + θ1 (v − µv ) + µm dv + v v θ2 (v2 − µ2 − σ2 ) + θ1 (v − µv ) + µm − v dv. Fortunately. we can compute E[max(P(W ). After obtaining µm . we need to find Θ in (5. Therefore. When V is not close to the normal distribution.25) c2 > 0 c2 < 0 c2 = 0 ∧ c1 > 0 c2 = 0 ∧ c1 < 0 where Φ(·) is the cumulative density function (CDF) of the standard normal distribution. the square error in (5. According to (5. We first write the constraint in (5.3) by solving the constrained optimization problem of (5.4) can be written as: SE = = Z0 µv −3σv Z 0 l1 0 (5. we show that (5. v v Replacing θ0 in (5. in practice. 0)] =   (c2 + c0 ) 1 + Φ(t1 ) − Φ(t2) + (c1 + t1 ) φ(t2) − φ(t1)       (c2 + c0 ) Φ(t1 ) − Φ(t2) + (c1 + t1 ) φ(t2) − φ(t1)  c · Φ(c /c ) + c · φ(c /c )  0 0 1 1 0 1      c · 1 − Φ(c /c ) − c · φ(c /c ) 0 0 1 1 0 1 (5. Θ)dv + µvZ v +3σ 0 h(v.4) by (5.Knowing the integration range.26) h2 (v. we can compute µm easily through analytical formulas.4). the above approximation of µm is no longer accurate. In practice. In the following. V is uniform distribution.4) as follows: θ0 = µm − θ2 (µ2 + σ2 ) − θ1 µv .26). the above approximation of µm is accurate only when the distribution of V is close to the normal distribution.4) can be solved analytically as well. 0)] under such four cases: E[max(P(W ). for example. such approximation is accurate for SSTA.

Then the optimum of Θ can be obtained by setting the derivative of (5. we can transform the constrained optimization of (5.30) (5. where S = (si j ) is a 2 × 2 matrix.4) to the following unconstrained optimization problem.29) to zero.28) (5. Therefore. Q = (qi ) is a 1 × 2 vector. By expanding the square and integral. which is a quadratic form of Θ = (θ1 . Q.37) 106 .29) SE = Θ T SΘ + QΘ + t.32) (5.35) Because the square error is always positive no matter what value the Θ is.34) 2 2 2 3 q1 = l2 µv + (l2 − l1 )µm − 2l2 /3 − 2(l2 − l1 )µvµm 2 3 3 q2 = l2 (µ2 + σ2 ) + 2(l2 − l1 )µm /3 − v v 3 2l2 /3 − 2(l2 − l1 )(µ2 + σ2 )µv v v 3 2 t = l2 /3 + (l2 − l1 )µ2 − l2 µm m (5. The parameters of S. S is a positive definite matrix. resulting a 2 × 2 system of linear equations: ∂SE = 2s11 · θ1 + 2s12 · θ2 + q1 = 0 ∂θ1 ∂SE = 2s22 · θ2 + 2s12 · θ1 + q2 = 0 ∂θ2 (5.where l1 = µv − 3σv and l2 = µv + 3σv .29) is to minimize a second-order convex function of Θ without constraints.31) 5 5 2 2 3 3 s22 = (l2 − l1 )/5 + (l2 − l1 )(µ2 + σ2 )2 − 2(l2 − l1 )(µ2 + σ2 )/3 v v v v 4 4 s12 = s21 = (l2 − l1 )/4 + (l2 − l1 )µv(µ2 + σ2 ) + v v 3 3 2 2 (l2 − l1 )µv /3 − (l2 − l1 )(µ2 + σ2 )/2 v v (5.33) (5. θ2 )T : Θ = arg min SE (5.36) (5. and t can be computed as: 3 3 2 2 s11 = (l2 − l1 )/3 + (l2 − l1 )µ2 − (l2 − l1 )µv v (5. and t is a constant. (5.

7. 5. 0). we can obtain the fitting v parameters Θ for max(V. the PDF of our second-order fitting method has a sharp peak which approximates the impulse of exact PDF well as shown in Fig. we see that for a random variable V with any distribution. linear fitting through through tightness probability. Θ = (0. and our second-order fitting can be obtained as shown in Fig.2. In particular.2.38) (5. µv + 3σv ). where the x-axis is V . From (5.2. by assuming V ∼ N(0. we compare the results obtain from our approach. 0) through closed-form formulas. the exact max operation max(V. 107 .2. That means: If µv > 3σv . we see that our proposed second-order fitting method is more accurate than the linear fitting method. the linear fitting method. when V ∈ (µv − 3σv . variance σ2 . The corresponding PDFs of the three approaches are shown in Fig. In contrast. it is easy to find that in this case. and skewness γv .40) To show the accuracy of our second-order fitting approach to the max approximation. From the figures. 5. the linear fitting method can only give a smooth PDF. Notice that in (5. 0). From the above discussion.4). if we know its mean µv .2. and y-axis is the results of max(V.26). 0). 0) = V max(V. 5.4) we try to minimize the mean square error within the ±3σ range. 1). For example.2. 0) ≈ V when µv > 3σv (5. V is always larger than zero within the ±3σ range.Such a system of linear equations can be solved efficiently: θ1 = θ2 s12 · q2 − s22 · q1 2s11 · s22 − 2s2 12 s12 · q1 − s11 · q2 = 2s11 · s22 − 2s2 12 (5. and the exact (or Monte Carlo) computation. 1. That is: max(V.39) With θ1 and θ2 . which is far away from the exact max result. we can compute θ0 from (5.

5 2 1.Xn . R is local random variation. In order to simplify the above delay model. R) (5.3. we apply the Taylor expansion to approximate it. 5. Xn )T are global variation sources.6 0. we assume that Xi ’s and R are mutually independent. Here.5 1 0. n is the number of global variation sources. linear fitting. and second-order fitting for max(V. and second-order fitting for max(V. . X2 .5 0 −0. the circuit delay is a complicate function of variation sources: D = f (X1 . we assume all Xi ’s and R have zero mean and unit variance. .1 Quadratic Delay Model In Section 5.0) Second−Order Fit Linear Fit (a) (b) Figure 5.2. . Without loss of generality. We first discuss the delay model. .2 0.41) where X = (X1 .8 0.5 3 2. we introduce the second-order fitting of max operation. 0). .1 0 −2 −1 0 1 2 3 Max(V.3 0.3 Quadratic SSTA 5.5 0. Considering that the scale of local random variation is 108 . 0).7 0. linear fitting. In practice. X2. (b) PDF comparison of exact computation.9 0. .1: (a) Comparison of exact computation.5 −1 −2 −1 0 1 2 3 Max(V.4 3.4 0. Here we will apply such fitting in SSTA.0) Second−Order Fit Linear Fit 1 0.

in practice.44) (5. the coefficients A. B. a2 . respectively. . A = (a1 . an ) are the linear delay sensitivity coefficients of the global variation sources. we assume that R is a random variable with standard normal distribution.k and mr. Moreover. B = (bi j ) are the second-order sensitivity the local random variation.k to represent the kth central moment for Xi and the kth moment for R.usually smaller than that of global variation. and r can be obtained by measurement or SPICE simulation. the local random variation is caused by several independent factors.43) (5. Finally. R will be close to a Gaussian random variable. In this chapter. therefore. Moreover. . . we use mi. Notice that all Xi ’s and R are with zero mean. f is a complex function and can not be expressed as closed-form formula. j=1 ∑ bi j Xi X j + ∑ ai Xi + rR i=1 T = d0 + AX + X BX + rR (5. we use second order Taylor expansion to approximate Xi ’s and use first order expansion to approximate R: n n D ≈ d0 + i. According to the central limit theorem.45) In practice.42) where d0 is the nominal delay.46) In the rest of this chapter. 109 . and r is the linear delay sensitivity coefficient of (5. the local random variation R is the sum of several independent random variables. B. for simplicity. which is an n × n matrix. In this case. A. that means their central moments and raw moments are the same. and r are calculated in the similar way as [180]: ∂f ∂Xi ∂2 f bi j = ∂Xi ∂X j ∂f r = ∂R ai = coefficients. we obtain our quadratic delay model as follows: D = d0 + AX + X T BX + rR (5.

2. and skewness of D p .2. are needed. Ds = D1 + D2 = ds0 + As X + X T Bs X + rs Rs .we also assume that the moments of the variation sources. otherwise. 0) + D2 .48) 5. D2 ) = dm0 + Am X + X T Bm X + rm Rm .46): D1 = d01 + A1 X + X T B1 X + r1 R1 . such moments can be computed from the samples of variation sources.47) (5.k and mr. We will present how these two operations are handled in the rest of this section. we want to compute Dm = max(D1 .2. That is. Without loss of generality. D2 ) = max(D1 − D2 .2.51) let D p = D1 − D2 .49) (5. The overall flow of the max operation is illustrated in Figure 5. we compute the joint moments between D p and Xi ’s and the skewness of D p . we propose a novel technique to compute the max of two random variables. (5. mi. if µD p > 3σD p .k .2 to find 110 .50) (5. max and sum. Knowing the mean.3. are known. we apply the method as shown in Section 5. D2 = d02 + A2 X + X T B2 X + r2 R2 . We first compute the mean and variance of D p . given D1 and D2 with the quadratic form of (5. Based on the second-order polynomial fitting method as discussed in Section 5. In practice. then Dm = D1 . To compute the arrival time in a block-based SSTA framework. two atomic operations. (5. we assume E[D p ] > 0. variance.2 Max Operation for Quadratic Delay Model The max operation is the most difficult operation for block-based SSTA. Considering Dm = max(D1 .

Compute joint moments between D2 and Xi p 4.the fitting coefficients Θ = (θ0 . Finally. Input: Quadratic form of D1 and D2 Output: Quadratic form of Dm 1. Compute joint moments between Dm and Xi 7. we use the moment matching method to reconstruct the quadratic form of Dm .2.54) (5. θ1 . we first obtain the quadratic form of D p as follows: D p = D1 − D2 = d p0 + A p · X + X T B p X + r1 · R1 − r2 · R2 d p0 = d01 − d02 A p = A1 − A2 B p = B1 − B2 (5. Compute γD p 5. 0). compute µD p and σD p if µD p ≥ 3σD p { 2.3. R1 .52) (5. θ2 ) for max(D p .D2 ). and R2 are mutually independent random variables with mean 0 111 . Reconstruct the quadratic form of Dm } Figure 5. Let D p = D1 − D2 . Get fitting coefficients Θ 6.55) Because Xi ’s. Dm = D1 } else { 3.53) (5. 5.2: Algorithm for computing max(D1 .1 Mean and Variance of D p In order to compute the mean and variance.

4 + 2a pi b pii mi.k=i ∑ (b p j j b p kk + 2b2 jk ) p (5.3 + 2a p j b p j j m j.3 + 2a pi b pii mi.5 + p p0 (a2 + 2d p0 b pii ) · mi.3 ) + pi 1≤i< j≤n ∑ 2(b2 j + b pii b p j j ) − ( ∑ b pii )2 pi i=1 n (5.60) (b pii b p j j + 2b2 j )mi.and variance 1.3.4 + b2 j j m j.5 + p pi pii 2· 1≤ j≤n j=i ∑ a pi b p j j + 2a p j b p i j + (5. the joint moments between D2 and Xi .4 + b2 mi.56) 2 2 = r1 + r2 + ∑ (a2 + b pii mi.59) (5.2 Joint Moments and Skewness of D p From the definition of D p as shown in (5.3 − 2r1 µD p p 2 2 E[Xi2 D2 ] = d 2 + r1 + r2 + 2d p0 a pi mi.3 pi 2 E[R1 D2 ] = r1 mr. the mean and variance of D p can be computed as: n µD p = E[D p] = d p0 + ∑ b pii σ2 p = E[(D p − µD p )2 ] D n i=1 i=1 (5.3 + p 2(b pii b p j j + 4b2 j ) · mi.2.4 + 4b p j j b pi j mi.58) (5.3 + 2r1 µD p p 2 E[R2 D2 ] = r2 mr.57) 5.3 + 2a pi b pii mi.61) 112 .4 + b2 mi.3 + pi p 2· 1≤ j<k≤n j.3 + 2b pi j b p j j m j. p Xi2 .52).6 + pi pii ∑ 1≤ j≤n j=i a2 j + 2d p0 b p j j + 2(a pi b p j j + 2a p j b pi j )mi.3 m j. and R can be computed as: E[Xi D2 ] = 2d p0 a pi + (a2 + 2d p0 b pii )mi.

E[Xi X j D2 ] = 2a pi a p j + 4d p0 b pi j + 2 · (2a pib pi j + a p j b pii )mi.65) (5.68) Dq = θ 1 · D p + D 2 113 . and skewness of D p is computed as: E[D2 ] = µ2 p + σ2 p p D D E[D3 ] = E[D p · D2 ] p p = E[(d p0 + A p X + X T B p X + r1 R1 − r2 R2 )D2 ] p = d p0 · E[D2 ] + r1 · E[R1 D2 ] − r2 · E[R2D2 ] + p p p n i=1 (5.3 + 4 · pi 1≤k≤n k=i.3 + 4b p j j b pi j m j. θ2 ) for max(D p .3 Reconstruct the Quadratic Form of Dm Knowing µD p .4 + 2 · (2b2 j + b pii b p j j)mi.3.2. central moments.3m j.3 + p 4b pii b pi j mi. we apply the second-order fitting method to compute the fitting parameters Θ = (θ0 . and γD p . σD p . θ1 . D2 ) = max(D p .62) With the joint moments computed above.66) 5. j ∑ b pi j b pkk (5.63) ∑ a pi · E[XiD2 ] + b pii · E[Xi2D2 ] + p p b pi j E[Xi X j · D2 ] p (5. 0). the raw moments.4 + 2 · (2a p j b pi j + a pi b p j j )m j. 0) + D2 ≈ θ 2 · D2 + θ 1 · D p + θ 0 + D 2 p = θ 2 · D2 + D q + θ 0 p (5. Then Dm can be represented as: Dm = max(D1 .67) (5.64) 0≤i< j≤n ∑ mD p (3) = E[(D − µD )3 ] γD p = mD p (3)/σ3 p D = E[D3 ] − 3µD p · E[D2 ] + 2µ3 p p p D (5.

because D p and Dq are in quadratic form as (5.62).75) (5. and joint moments between Dm and variation sources: E[Dm ] = θ2 · E[D2 ] + E[Dq ] + θ0 p (5.67) is a 4th order polynomial of Xi ’s.58)-(5.72) (5.3 E[Xi2 Dq ] = dq0 + aqi mi.76) 1≤ j≤n j=i E[Xi Dq ] = aqi mi.70) (5.80) 114 . we first compute the the mean of Dm .46).79) (5.69) (5. In order to reconstruct the quadratic form for Dm .74) E[Xi X j Dm ] = θ2 · E[Xi X j D2 ] + E[Xi X j Dq ] p E[R1 Dm ] = θ2 · E[R1 D2 ] + E[R1Dq ] p E[R2 Dm ] = θ2 · E[R2 D2 ] + E[R2Dq ] p E[Xi2 Dm ] = θ2 · E[Xi2D2 ] + E[Xi2Dq ] + θ0 p E[Xi Dm ] = θ2 · E[Xi D2 ] + E[XiDq ] p The joint moments between D2 and Xi ’s are computed in (5.78) (5.3 + bqii mi.71) (5.73) (5. However. the representation of Dm in (5.form formula of Dm .2 + bqii mi.4 + E[Xi X j Dq ] = 2bqi j E[R1 Dq ] = θ1 · r1 E[R2 Dq ] = (1 − θ1 ) · r2 ∑ bq j j (5. The joint p moments between Dq and variation sources are: n E[Dq ] = dq0 + ∑ bqii i=1 (5.The above equation gives the closed.77) (5.

74).86) (5. by applying the moment matching technique similar to [178].4 − m2 − 1 i.82) (5.87) (5.91) Where E[R1 · Dm ] and E[R2 · Dm ] are computed in (5. Because the random term in Dm comes from the random terms in D1 and D2 . Because the random variation sources Rm . and E[Xi X j Dm ] computed in (5. the sensitivity coefficients Am = (ami ) and Bm = (bmi j ) can be obtained by solving the linear equations above: bmii = E[Xi2Dm ] − E[Dm ] − E[Xi Dm ]mi. we have: rm1 = E[R1 · Dm ] rm2 = E[R2 · Dm ] rm = 2 2 rm1 + rm2 (5.72).85) (5. as shown in (5. we assume that rm Rm = rm1 R1 + rm2 R2 .3 n dm0 = E[Dm ] − ∑ bmii i=1 Finally.83) (5.89) (5.90) (5.84). we have: n E[Dm ] = i=1 ∑ bmii + dm0 ∑ bm j j (5. R1 .88) bmi j = E[Xi X j Dm ] ami = E[Xi Dm ] − bmii · mi. E[Xi Dm ]. E[Xi2Dm ].3 mi.3 With E[Dm ].4 + E[Xi X j Dm ] = 2bmi j E[Xi Dm ] = ami + bmii · mi.69)-(5. and R2 are Gaussian random variables.73) and (5. 115 .3 (5.3 + bnii · mi.Because we want to reconstruct Dm in the quadratic form.81) (5. by applying the moment matching technique similar to (5.84) 1≤ j≤n j=i E[Xi2Dm ] = ami · mi.49). we consider the random term of Dm .

46) is 116 . Ignoring the crossing terms.4 Semi-Quadratic SSTA 5.4.3 Sum Operation for Quadratic Delay Model The sum operation is straight forward.92) (5. 19]. we need to compute the sum of two n × n metric. the total complexity is O {n3 N}. if we ignore the crossing terms. For the sum operation.4 Computational Complexity of Quadratic SSTA For each max operation in SSTA based on a quadratic delay model. where n is the number of variation sources.95) 5. we need to compute the sum of n numbers. the impact of the crossing terms are usually very weak [178.1 Semi-Quadratic Delay Model In Section 5.3. we need to calculate n2 joint moments between D p and Xi X j s.3.5.94) (5. in practice. Because the total number of max and sum operations is linear with respect to circuit sizes N. we introduce the SSTA for quadratic delay model. The coefficients of Ds can be computed by adding the correspondent coefficients of D1 and D2 : ds0 = d01 + d02 As = A 1 + A 2 Bs = B 1 + B 2 rs = 2 2 r1 + r2 (5.93) (5. for each joint moment. the quadratic delay model in (5.3. Therefore. However. we may speed up the SSTA process without affecting the accuracy too much. hence the computational complexity is O {n 3 }. hence the computational complexity is O {n 2 }. 5.

bn ) are the second order sensitivity coefficients for the square terms. 5.99) d p0 = d p0 + ∑ b pi Ypi = a pi Xi + b pi Xi2 − b pi i=1 variables with mean 0 and variance 1. Xn )T are the square of variation sources.. A. . In order to compute the central moments of D p . . . and R are defined in the similar way as the quadratic delay model in 2 2 2 (5. .2 Max Operation for Semi-Quadratic Delay Model The overall flow of the max operation of semi-quadratic delay model is similar to that of quadratic delay model as illustrated in Figure 5.4. X2 . X 2 = (X1 . we first rewrite D p to the following form: n D p = d p0 + ∑ Ypi + r1 R1 − r2 R2 i=1 n (5.97) (5.1 Moments of D p We defined D p = D1 − D2 in a similar way as (5.re-written as: D = d0 + AX + BX 2 + rR (5.46). and B = (b1 . . for the semi-quadratic delay model in the rest of this section.96) where d0 .52).4.98) (5. 5. Because Xi s are mutually independent random 117 . b2 . Ypi s are independent random variables with zero with a pi = a1i − a2i and b pi = b1i − b2i . max and sum. . We will introduce the atomic operations. .2. The only difference is that we do not need to compute the joint moments between the crossing terms and D m . r. We defined such quadratic model without crossing terms as Semi-Quadratic Delay Model.2.

105) (5.106) (5. Therefore.4 − 1) + 2a pi b pi mi. Here the joint moments between Xi s and D2 .6 − 3mi.107) (5.108) E[Xi Dq ] = E[XiYqi ] E[Xi2 Dq ] = E[X 2Yqi ] 2 2 where E[Ypi ] are computed in (5.69)-(5.104) With the central moments of D p .2 Reconstruct the Semi-Quadratic Form of Dm Similar to the quadratic SSTA.4 − 1 + pi pi a3 mi.4 + 2 pi pi (5. 5. Dq are: p 2 E[Xi D2 ] = E[XiYpi ] + 2d0 E[XiYpi ] p 2 E[Xi2D2 ] = E[Xi2Ypi ] + 2d0 E[Xi2Ypi ] + E[Xi2]E[(D p −Ypi )2 ] p (5.109) .5 − 2mi.4.2. the fitting coefficients Θ can be obtained. Then the joint moments between Xi s and Dm can be computed in the similar ways as (5.74). E[(D p − Ypi )2 ] = σ2 p − E[Ypi ].3 + b3 mi. the raw moments and skewness can be computed easily.101) 3 ∑ E[Ypi] n mD p (3) = E[(D p − µD p ) where 3 3 3 ] = r1 + r2 + (5.mean.100) (5. the first 3 central moments of D p are: µD p = d p0 2 2 2 σ2 p = r1 + r2 + ∑ E[Ypi ] D i=1 n (5. and Yqi s are D defined in the similar way as Ypi s: Yqi = aqi Xi + bqi Xi2 − bqi 118 (5. variance and skewness of D p .3 + 3a2 b pi mi.3 + a2 pi pi (5. by knowing the mean.102).103) 3 E[Ypi ] = 3a pi b2 mi.102) i=1 2 E[Ypi ] = b2 (mi.

And then A m and Bm can be obtained by solving such equations.110) (5.5. The number of max and sum operations is linear to the circuit size.3 + b pi mi. as shown in (5. The sum operation of the semi-quadratic delay model is similar to that of the quadratic delay model.1 Linear Delay Model In the previous sections. From the discussion above. In practice. by applying the moment matching technique.5 Linear SSTA 5. we introduce the SSTA for non-linear delay model.4 + b2 mi.4 − 1 a2 − 22 mi.5 − mi.113) 2 E[XiYpi ] = b2 mi. we can compute the random term of D m in the same way as the quadratic model. the number of variation sources.111) (5.6 − 1 + 2a pi b pi mi. Knowing the joint moments between Xi s and Dm .4 − 1 + a2 − 2b2 mi. it is easy to see that for the semi-quadratic SSTA.84).3 The joint moments between Xi s and Yqi s can be computed in the same way.5 + 2a pi b pi mi. We can compute Ds = D1 + D2 by summing up the correspondent coefficients. when the variation scale is small.3 pi pi pi E[XiYpi ] = a pi + b pi mi. The joint moments between Xi s and Ypi s are: 2 E[Xi2Ypi ] = E[Xi2Ypi ] = a pi mi.112) (5. Finally. we will have similar equations as (5. where n is the 5. the circuit delay can be approximated as a linear 119 .3 pi pi pi (5. computational complexity for both max and sum operation is O {n}.91).with aqi = θ1 (a1 − a2 ) + a2 and bqi = θ1 (b1 − b2 ) + b2 .

function of variation sources: D = d0 + A · X + r · R (5.114)

where X , d0 , A, r, and R are defined in the similar way as the quadratic delay model in (5.46). The difference is that there are no second order terms. Hence the atomic operations of the linear delay model are much simpler than those of quadratic delay model. We will introduce such operations in the rest of this section.

5.5.2 Max Operation for Linear Delay Model The overall flow of the max operation of linear delay model is similar to that of quadratic delay model as illustrated in Figure 5.2 except that we do not need to compute the high order joint moments between Dm and Xi s.

5.5.2.1 Mean and Variance of D p The linear form of D p = D1 − D2 can be computed in the similar way as (5.52). Then the mean and variance of D p can be computed as: µD p = E[D p ] = d p0 ;
2 2 σ2 p = E[(D − µD )2 ] = ∑ a2 + r1 + r2 D pi i=1 n

(5.115) (5.116)

5.5.2.2 Joint Moments and Skewness of D p The joint moments between D2 and the variation sources and random variation can be p computed as: E[Xi · D2 ] = a2 · mi,3 + 2a pi · d p0 p pi (5.117)

120

With the joint moments computed above, the third order raw moments D p is computed as: E[D3 ] = ∑ ai E[Xi · D2 ] + r1E[R1 · D2 ] + r2 E[R2 · D2 ] p p p p
i=1 n

(5.118)

The second order raw moment, central moments, and skewness of D p can be computed in the same way as the quadratic SSTA in (5.66).

5.5.2.3 Reconstruct the Linear Canonical Form of Dm Knowing the mean, variance, and skewness of D p , we may apply the method in Section 5.2.2 to compute the fitting parameters Θ for max(D p , 0), and then represent Dm in the form of: Dm ≈ θ 2 · D2 + D q + p 0 p (5.119)

where Dq is defined in the similar way as (5.67). The above equation gives the closedform formula of Dm , however, such formula is not in the linear canonical form. We still apply moment matching technique to reconstruct Dm to the linear form. First, the mean of Dm and the joint moments between Dm and variation sources are: E[Dm ] = θ2 · E[D2 ] + E[Dq ] + p0 p (5.120) (5.121)

E[Xi · Dm ] = θ2 · E[Xi · D2 ] + E[Xi · Dq ] p

Where E[D2 ] is computed in (5.118). The joint moments between Dq and variation p sources are computed as: E[Dq ] = ds0 E[Xi · Dq ] = aqi (5.122) (5.123)

121

With the joint moments, by applying the moment matching technique in [178], we have: ami = E[Xi · Dm ] dm0 = E[Dm ] (5.124) (5.125)

Finally, the random term rm can be computed in the same way as the quadratic model in (5.91). For the linear SSTA we discussed above, the computational complexity for both max and sum operation is O {n}. Similar to the quadratic SSTA, the number of max and sum operations is linear to the circuit size.

5.6 Experimental Result
We have implemented our SSTA algorithm in C for all three delay models we discussed before: quadratic delay model (Quad SSTA), semi-quadratic delay model (Semi-Quad SSTA), and linear delay model (Lin SSTA). We also define three comparison cases: (1) our implementation of the linear SSTA for Gaussian variation sources in [163], which we refer to as Lin Gau; (2) the non-linear SSTA using Fourier Series approximation for non-Gaussian variation sources in [40] (Fourier SSTA); (3) 100,000-sample Monte-Carlo simulation (MC). We apply all the above methods to the ISCAS89 suite of benchmarks in TSMC 90nm technology. In our experiment, we consider two types of variation sources gate channel length Lgate and threshold voltage Vth . For each type of variation source, inter-die, intra-die spatial, and intra-die random variation are considered. We use the grid-based model in [172, 173] to model the spatial variation. The number of grids (the number of spatial variation sources) is determined by the circuit size and larger circuits have more variation sources. We also assume that the 3σ value of the inter-die (σ g ), intra-die

122

spatial (σs ), and intra-die random (σr ) variation are 10%, 10%, and 5% of the nominal value, respectively. In the following, we perform the experiments for two variation setting: (1) both Lgate and Vth have skew-normal distributions [16]; and (2) Lgate has a normal distribution and Vth has a Poisson distribution. The experimental setting is shown in Table 5.1.
Setting (1) (2) Lgate dist Skewnormal Normal Vth dist Skewnormal Poisson 3σg 10% 10% 3σs 10% 10% 3σr 5% 5%

Table 5.1: Experiment setting. The 3σ value is the normalized with respect to the nominal value. Fig. 5.3 illustrates the PDF comparison for circuit s15850 under variation setting (1). In the figure, delay is normalized with respect to the nominal value. From the figure, we find that, compared to the Monte-Carlo simulation, the Quad SSTA is the most accurate, the next is Semi-Quad SSTA, and Lin SSTA is least accurate. Such result is expected, because the Quad SSTA captures all the second-order effects; the Semi-Quad SSTA captures only partial second order effects; while the Lin SSTA captures only linear effects and ignores all non-linear effects. Moreover, Fourier SSTA [40] has similar accuracy as Semi-Quad SSTA because they both apply semi-quadratic delay model. In addition, we also find that all these three non-Gaussian SSTA is more accurate than Lin Gau [163]. Table 5.2 compares the run time in second (T ), and the error percentage of mean (µ), standard deviation (σ), and skewness (γ) under variation setting (1). In the table, the error percentage of mean (µ), standard deviation (σ), and 95-percentile point (95%) skewness γ is computed as 100 × (γMC − γSSTA )/γMC . Moreover, the average error in the table is average of the absolute value; and the average runtime is the average runtime ratio between SSTA and Monte-Carlo simulation. For each benchmark, G is computed as 100 × (MC value − SSTA value)/σMC , and the error percentage of

123

and N refers to the total number of variation sources.1 1. standard deviation. finally. all the three nonGaussian SSTA methods predict more accurate mean values and 95-percentile points.3 1. refers to the number of gates.5 Figure 5. the error of skewness is up to 30% and the error of 95-percentile point is up to 2%. Lin SSTA result similar mean and standard deviation error.8 0.9 Normalized Delay 1 1.5 1 0. Semi-Quad SSTA.5 0 0.4 1. and the non-linear SSTA methods. and the error of skewness is within 5%.2 1. From the table. 124 . we see that for the Quad SSTA. and the inaccurate skewness results larger error of 95-percentile point. but the error of skewness and 95-percentile point is much larger.3%.3: PDF comparison for circuit s15850.3. the error of mean.5 MC Quad SSTA Semi−Quad SSTA Lin SSTA Lin Gau Fourier SSTA 3 2.7 0. we still use the non-Gaussian variation sources to reconstruct the PDF of the circuit delay. which is the most important timing characteristic.5 2 1.6 0. For fair comparison. we first approximate the non-Gaussian random variables to the Gaussian random variables by matching the mean and variance. especially for the Lin SSTA. then use Lin Gau to obtain the linear canonical form of the circuit delay. Compared to Lin Gau. when we perform Lin Gau on non-Gaussian variation sources. and 95-percentile point is within 0. This is because the Lin SSTA ignores all non-linear effects which significantly affect the skewness.

This is because the computational complexity of Semi-Quad SSTA. we can find the similar trend as Table 5. This shows that our approach works well for different distributions.Quad SSTA and Semi-Quad SSTA also give much more accurate skewness. but the run time of Quad SSTA is longer especially when the number of variation sources is large. while the computational complexity of Semi-Quad SSTA (O {n}) is lower than that 125 . We also observe that compared to Fourier SSTA. Table 5. we also find that our methods are still highly accurate under such variation setting. Moreover. the error of mean. the run time of all the SSTA methods is significantly shorter than the Monte-Carlo simulation. After all. but Quad SSTA has higher complexity than the other two SSTA methods. This is due to the fact that Semi-Quad SSTA uses the same semi-quadratic delay model as Fourier SSTA. we also find that both Semi-Quad SSTA and Lin SSTA have similar run time as Lin Gau.2. and k is the maximum number of orders of Fourier Series [40].3 illustrates the results under variation setting (2). For Quad SSTA. SemiQuad SSTA has similar accuracy with almost 70X speed up. standard deviation. Lin SSTA. From this table. and the error of skewness is within 6%. and 95-percentile point is within 1%. and Lin Gau is the same. where n is the number of variation sources. Moreover. of Fourier SSTA (O {nk2 }).

04 -0.01 -0.11 0.13 -0.57 -0.4 -22.01 0.66 -28.01 0.61 25.16 -0.01 0.01 0.02 0.13 T µ Semi-Quad SSTA σ γ 95% T 0.09 -0.4 0.06 0.06 -0.85 0.01 0.36 0.01 -0.56 -0.54 0.bench G name s208 s386 s400 s444 s832 s953 s1238 s1423 s1494 61 118 106 119 262 311 428 490 588 N µ σ Quad SSTA γ 95% -0.7 -0.04 -0.51 0.09 -0.2 -0.01 2.01 0.17 43 -0.06 1.14 -7.01 0.19 0.53 -0.68 -27.2 -0.22 -0.3 33.02 -0.4 1/35819 0.01 0.9 103 122 201 288 320 12 -0.04 -10.98 0.8 -0.23 -0.52 25 0.13 -0.03 0.15 0.82 0.01 0.19 -13.38 1.44 -21.24 0.18 -0.01 0.57 -25.4 -1.25 µ σ -0.03 0.02 0.1 0.01 0.33 0.07 -0.11 0.02 -0.01 -0.01 -0.1 -0.04 -0.03 55 0.09 -0.1 -3.22 33 -0.68 0.61 0.15 -12.65 -31.06 -0.1 0.63 0.37 -0.04 -6.4 -1.01 -0. and the error percentage of skewness γ is computed as 100 × (γ MC − γSSTA )/γMC .67 -28.03 18 -0.66 -31.6 0.24 4.02 -0.6 -1.1 19459 1/260 - 218 -0.22 -9.59 -0.04 1.06 -0.02 -0.16 -0.12 0.8 Lin Gau γ 95% -2.01 0. standard deviation (σ).91 5.74 -0.04 -0.72 -0.32 -22.47 -26.4 -2.99 -0.12 -4.4 -3.01 0.9 0.49 -22.24 -0.01 0.09 0.25 Table 5.01 0.06 -0.32 0.04 -0.24 -0.04 4717 11.25 -14 MC T 15.08 -0.01 0. skewness. and 95-percentile point (95%) is computed as 100 × (MC value − SSTA value)/σMC .22 126 s5378 1004 82 0.31 0.32 -22.06 -0.07 0.42 1238 6. Note:the error percentage of mean (µ).15 -0.24 -10 -0.38 0.09 -0.17 -0.07 -9.17 -0.8 -2.08 -0.01 2.49 10.08 0.01 3.23 0.12 -0.96 T 0.06 0.37 -0.6 -2.02 0.92 0.12 4.56 -0.3 0.08 -0.91 1.15 0.11 -0.01 0.06 -0.12 -6.01 0.5 -1.2 -6.95 0.16 -0.1 Ave 0.21 -0.51 -0.08 0.2 -0.3 -0.56 -27.66 s38417 8709 176 0.01 -0.79 0.01 -4.14 0.17 -0.15 0.49 -25.05 -9.9 0.17 1.08 -10.2 -0.02 -0.54 25.59 68 -0.14 1.13 9.04 -10.16 0.2 -1.5 -27.11 -5.18 -0.64 0.98 s13207 2573 115 -0.2 19116 25.1 0.4 0.11 8.12 -0.7 1/39248 0.3 -3.2 -3.52 -24.5 -0.06 -0.4 -0.65 0.1 0.07 0.04 0.95 -0.12 2.16 -0.04 s15850 3448 135 0.43 0.18 -0.68 µ σ 0.09 -4.01 0.86 0.01 0.08 -0.8 -1.18 2.28 -0.88 25 -0.02 0.02 -0.66 -26.21 -0.23 0.08 0.32 T 1.51 -0.5 -27.53 -24.04 0.21 -0.01 1.03 0.7 -1.17 2.07 -0.7 -2.01 -0.07 -4.5 6484 23.27 0.11 2.15 -0.01 0.08 3.02 -0.7 -3.98 -0.06 -0.36 -0.2 -0.4 38.02 0.29 1/720 0.23 -0.11 -0.29 µ σ -0.02 -0.51 25.29 -0.23 -8.01 0.1 -1.1 T 0.71 s9234 2027 99 0.49 -23.11 -0.21 -0.59 -0.01 0.22 s38584 11448 176 -0.18 -0.71 -28.75 -0.87 -0.6 -1.12 -15.5 -3. .28 1.57 -25.89 2801 9.04 -3.13 -0.85 202 0.47 -23.1 -1.9 -0.6 -1.06 0.01 0.35 -0.2 -3.2: Error percentage of mean.01 0.08 0.01 0.7 41.99 2.02 -10.01 0.62 0.5 0.07 -13.05 0.4 -0.07 0.63 -0.49 -22.6 -0.1 -0.05 -0.2 -1.1 -3.09 -4.01 -0.01 -0.5 -27.92 -0.19 -10.23 -0. standard deviation.7 0.41 -0.26 0.74 -0.7 -1.31 3.32 Lin SSTA γ 95% -1.08 -1.04 -0.2 -3.07 -0.04 -0.56 -26.49 -0.47 -25.05 -0.3 -0. and 95-percentile point for variation setting (1).7 Fourier SSTA γ 95% -0.01 0.22 -0.36 -23 0.06 -0.6 38.34 0.01 -0.43 0.02 2.52 -20.25 68 -0.16 -6.44 1.03 -0.11 -0.24 -9.17 -0.08 -0.05 -12.06 -0.09 0.45 -0.5 -1.53 1/19814 0.39 -0.45 -26.4 0.01 0.28 -0.09 -0.01 -0.7 -3.29 -0.19 -14.

4 -0.27 -66.4 -0.13 -25.07 0.15 0.3 -1.76 -0.5 -0.15 -0.39 0.12 3.01 0.04 -3.03 0.02 -16.14 -0.05 -16.3 -0.9 127 -0.24 0.93 1.3 6477 23.29 0.02 0.17 -0.05 -0.12 -0.27 38.5 -0.35 -0.9 -1.01 0.2 -0.2 1.5 -0.08 0.67 0.1 -19.07 -0.2 -0.07 -2.06 0.27 0.48 0.41 -68.01 0.47 -0.01 0.19 2.09 0.4 0.2 -15.23 -0.82 -0.25 -0.15 0.95 -0.01 0.07 0.09 -0.01 0.03 1.8 0.22 -64.15 2.09 -13.16 -19.2 -0.01 0. standard deviation.28 0.59 Ave 0.01 0.84 -0.01 5.01 0.1 -1.26 s9234 2027 99 0.43 -0.05 -22 -1 -0.7 Table 5.2 -63.01 0.02 -0.75 -0.18 -0.16 -0.21 -0.24 0.3 -1.07 3.3 -0.71 -0.38 -0.19 -0.09 0.22 -1.05 1.14 -0.1 103 122 201 288 322 12 -0.05 -0.9 39 42.05 -0.03 -0.01 0.2 -0.95 2.01 -0.2 Fourier SSTA γ 95% 0.62 0.25 -0.22 -16.11 4720 11.52 s13207 2573 115 0.11 1.03 -0.06 0.2 -64 -1 -1.26 µ σ -0.11 3.4 -0.9 -1.01 0.5 -1.4 -0.84 0.02 -0.9 -1.11 -58.04 1/20497 0.38 -66.26 -16.2 -0.16 -58.01 -0.3 -0.45 -0.07 -58.8 -0.06 -0.8 0.08 -0.58 0.05 0.12 -12.38 -0.05 -0.07 -19.01 -0.27 -20.3 -1.bench G name s208 s386 s400 s444 s832 s953 s1238 s1423 s1494 61 118 106 119 262 311 428 490 588 N µ σ Quad SSTA γ 95% -0.13 2.25 -66.01 0.68 µ σ 0.38 1.03 -0.23 0.23 -62.64 18 -0.8 -1.14 -0.13 -0.7 -0.02 -3.07 -58.16 -14.24 -0. and skewness comparison for variation setting (2).03 4.81 -0.1 -0.6 0.08 -0.2 -1.11 -0.01 -0. .05 0.1 0.7 0.21 -58.17 -63.21 -0.35 -67.13 16.06 -6.17 -0.01 0.6 0.1 -0.11 -0.1 -0.32 4.04 -0.04 -0.3 -65.01 -0.4 0.05 -0.92 68 -0.5 Lin SSTA γ 95% T 0.7 0.01 -0.08 -0.07 0.05 0.99 1.01 0.96 68 -0.3 -71.08 -3.38 218 -0.19 -66.4 T 0.22 0.5 -0.01 0.04 0.25 -15 MC T 15.89 0.9 -0.09 -0.05 -23.18 -0.17 -0.09 -20.16 16.01 -0.22 1.08 -4.01 0.08 -0.19 -65.4 -1.01 0.65 25.04 -0.13 T µ Semi-Quad SSTA σ γ 95% -0.48 25 -0.44 1238 6.05 -0.43 0.9 -1.1 -0.39 -0.07 0.93 -0.01 -0.9 -1.1 -0.25 -12.28 -65 -0.15 -0.02 -0.42 -68.29 µ σ -0.03 0.9 0.05 -0.07 -0.63 T 0.12 -0.15 0.96 55 0.25 -0.04 0.35 -0.23 -16.21 -62.09 -7.1 Lin Gau γ 95% T 0.3 -59 -0.01 0.02 -0.08 -0.21 s15850 3448 135 0.34 -69 -1.01 0.6 -0.44 0.27 0.32 1/681 0.01 0.85 0.8 -0.02 0.18 -0.04 -13.17 1.04 -16.76 2795 9.04 -20 0.01 0.2 s5378 1004 82 0.1 -0.91 0.01 0.22 1.02 -0.9 -0.1 1/42392 0.14 -0.54 25 0.08 -0.02 5.01 0.05 -0.26 -0.02 202 0.14 -22.08 0.8 -0.3 -71.09 -0.01 0.19 0.14 0.6 -1.16 -70.6 0.18 0.29 -0.15 -16.04 -0.77 0.5 -0.46 5.2 19460 1/260 - s38584 11448 176 -0.4 -1.9 0.01 0.13 33 -0.38 -0.19 -66.75 0.27 -17.08 0.25 -16.31 -65.01 0.18 0.01 -0.03 0.09 -0.54 10.27 1.4 -0.3 -0.01 0.01 0.4 s38417 8709 176 0.27 43 -0.4 33.22 -0.09 -0.36 0.18 -0.08 -0.18 -16.01 -0.1 -0.09 -0.32 -0.1 0.59 0.1 -0.01 0.6 0.6 19097 25.26 65 1/37942 0.18 -13.62 0.7 -0.18 -0.41 -0.21 -0.9 -1.12 -0.3: Mean.22 65.01 0.01 0.05 -0.27 -68.34 -0.08 -0.19 -69.95 0.

5.7 Conclusions
In this chapter, we have proposed a new method to approximate the max operation of two non-Gaussian random variables using second-order polynomial fitting. It has been shown that such approximation is more accurate than the approximation using linear fitting through tightness probability. By applying such approximation, we present new SSTA algorithms for three different delay models, i.e., quadratic model, quadratic model without crossing terms (semi-quadratic model), and linear model. All atomic operations of these algorithms are performed by closed-form formulas, hence they are very time efficient. The computational complexity of both the linear delay model and the semi-quadratic delay model is linear to the number of variation sources, and that of the quadratic delay model is cubic (third-order) to the number of variation sources. Moreover, the computational complexity is linear to the circuit size for all three delay models. The SSTA with semi-quadratic delay model results similar error as the SSTA using Fourier Series [40] with almost 70X speed up. Moreover, compared to MonteCarlo simulation for non-Gaussian variation sources, the SSTA with quadratic delay model results the error of mean, standard deviation, skewness, and 95-percentile point is within 1%, 1%, 6%, and 1%, respectively.

128

CHAPTER 6 Physically Justifi able Die-Level Modeling of Spatial Variation in View of Systematic Across-Wafer Variability
Modeling spatial variation is important for statistical analysis. Most existing works model spatial variation as spatially correlated random variables. We discuss process origins of spatial variability, all of which indicate that spatial variation comes from deterministic across-wafer variation, and purely random spatial variation is not significant. We analytically study the impact of across-wafer variation and show how it gives an appearance of correlation. We have developed a new die-level variation model considering deterministic across-wafer variation and derived the range of conditions under which ignoring spatial variation altogether may be acceptable. Experimental results show that for statistical timing and leakage analysis, our model is within 2% and 5% error from exact simulation result, respectively, while the error of the existing distance-based spatial variation model is up to 6.5% and 17%, respectively. Moreover, our new model is also 6X faster than the spatial variation model for statistical timing analysis and 7X faster for statistical leakage analysis.

129

6.1 Introduction
In Chapter 5, we developed an efficient and accurate SSTA flow in which process variation is modeled as inter-die, within-die spatial, and within-die random variations. However, new measurement results show that the variation model in Chapter 5 is not accurate. Therefore, we developed a more accurate and efficient variation model and applied it on the SSTA flow presented in Chapter 5. There are several existing works that focus on analyzing and modeling process variation [6, 32, 172, 106, 108, 134, 24, 25, 173]. The simplest method models process variation as the sum of inter-die (global) variation and independent within-die (local random) variation [25]. Later, it was observed that within-die variation is spatially correlated and the correlation depends on the distance between two within-die locations. [6, 32] model spatial variation as correlated random variables, and principle component analysis is applied to perform statistical timing analysis. In this model, a chip is divided into several grids and each grid has its own spatial variation. The spatial variations of different grids are correlated and the correlation coefficient depends on the distance between two grids. [172, 173] focuses on the extraction of spatial correlation and it models the correlation coefficient as a function of distance. Several more complex spatial correlation models have been proposed in [45, 105, 189, 55, 64]. In contrast to the spatial correlation models, process oriented modeling has concluded that within-die spatial variation is caused by deterministic across wafer and across-field variation while purely random within-die spatial variation is not significant [190, 54, 50]. However, in practical design flow, designers do not know the within-wafer location or within-field location of each die; therefore, we need to analyze the impact of across-wafer variation and across-field variation on die-scale. Since silicon measurements cited in this chapter indicate that across-wafer variation is much more significant than the across-field variation, we consider only across-wafer varia-

130

tion in this chapter, but the approach can be easily extended to account for across-field variation. In this chapter, we first analyze the impact of deterministic across-wafer variation on spatial correlation. We observe that when quadratic across-wafer variation model is used as in [54, 186, 125]: 1. Different locations of the chip may have different mean and variance. Such differences increase when the chip size increases. 2. When chip size is small, the correlation coefficients for a certain Euclidean distance are within a narrow range. This explains why most existing works find that spatial correlation is a function of distance. 3. Within-die spatial variation is NOT spatially correlated when across-wafer systematic variation is removed. 4. Within-die spatial variation is NOT independent from inter-die variation. 5. If chip size is not large, the two-level inter-/within-die decomposition of process variation is still very accurate. Based on our analysis, we propose three accurate and efficient spatial variation models considering across-wafer variation. We then apply our new model to the SSTA flow introduced in Chapter 5. Experimental results show that our model is more accurate and efficient compared to the distance-based spatial variation model in [172, 173]. Compared to the exact simulation, the error of our model for statistical timing analysis is within 2% and the error for statistical leakage analysis is within 5%. On the other hand, the error of the distance-based spatial correlation model is up to 6.5% for statistical timing analysis and up to 17% for statistical leakage analysis. Moreover,

131

Redeposition effect in 1 Overlay error can directly impact critical dimension in double patterning. wafer stage vibration.6 and statistical leakage analysis in Section 6. the new models are applied to statistical timing analysis in Section 6.3 analyzes the impact of across-wafer variation on die-scale. 132 .5 introduces the new variation models. For example. Section 6. wafers are often rotated to increase process uniformity across them which further leads to radial behavior of non-uniformity.2 Physical Origins of Spatial Variation In silicon manufacturing. and the distortion of the wafer with respect to the exposure pattern [139]. Section 6. Section 6. most of these processes by the very nature of the equipment follow a radially varying trend across the wafer.our model is 6X faster than the distance-based spatial correlation model for statistical timing analysis and 7X faster for statistical leakage analysis.2 discusses the physical causes for across-wafer variation. 132].9 concludes this chapter.4 discusses the case when the across-wafer variation is not a perfect parabola. overlay error includes errors in the position and rotation of the wafer stage during exposure. 6. species depletion and temperature non-uniformity on the wafer at lower temperatures may cause thickness non-uniformity [159. there are many steps that cause non-uniformity in devices across the wafer. Interestingly. and finally Section 6. Most processes are “center-fed” or “edge-fed” with the boundary conditions at the edge of wafer being substantially different. Moreover.8 summarizes the advantages and disadvantages of different variation models.1 During chemical vapor deposition (CVD) step.7. Magnification and rotation components of overlay error increase from center of the wafer outwards. The rest of this chapter is organized as follows: Section 6. This is further exacerbated by advent of single-wafer processing for 300mm wafers. Section 6.

Process 1 is with 45nm technology and process 2 is with 65nm technology. Moreover. Since the amount measurement data of for process 2 (more than 300 wafers) is much more than process 1. ring oscillator frequency decreases from the center to the edge of the wafer. we see that for both process. 27] show that the etch rate varies radially across the wafer: the etch rate is high at the center of the wafer and decreases toward the edges.1 shows industry data of ring oscillator frequency for wafers from two different industry processes. center peak shape of the RF electric field distribution [77] also leads to a center peak shape of etch rate. 54. and chamber wall conditions [78] also cause etch rate non-uniformity. it has also been shown that there is no spatial correlation for threshold voltage variation [190]. Therefore. it follows a systematic trend and the ring oscillator frequency decreases from the center to the edge of the wafer. All these processes cause a systematic across-wafer variation in physical dimensions. For process 2. all of our simulation and 133 . in the rest of this chapter. [78. Figure 6. Moreover. 186. the across-wafer frequency and leakage variation is mainly caused by gate length variation. Post-exposure bake (PEB) temperatures are higher at the center of the wafer and decrease outwards [185]. Across-wafer variation of gate length observed in several recent silicon measurements [153. It has been shown that for process 1. In real processes. However. From the figure. the acrosswafer variation is not a perfect parabola as process 1. 125] validates our arguments. [126] also shows that ring oscillator frequency and leakage current decrease from the center to the edge of the wafer. the across-wafer frequency variation can be approximated as a quadratic function (a parabola) [126]. other processes ranging from resist coat to wafer deformation due to vacuum chuck holding it follow a bowlshaped trend across the wafer.physical vapor deposition (PVD) [27] may cause non-uniformity. the wafers are rotated to improve uniformity. Similarly.

Besides across-wafer variation. for simplicity. Across-die variation can be modeled as within-die deterministic mean shift and will not cause within-die spatial correlation. 186.Figure 6. (a) Process 1. silicon measurements cited in this chapter indicate that across-wafer variation is much more significant (probably due to advancements in resolution enhancement and lithographic equipment) than across-field and across-die variation. we consider only across-wafer variation in this chapter. Hence. (b) Process 2. 125]: v p = ax2 + by2 + cxw + dyw + mw + m f + md + mr w w (6. we assume the acrosswafer variation to be a quadratic function as in[24. Moreover. lithography-induced effects such as lens aberrations can lead to systematic across-field variation and across-die variation. experiments are based on the measurement result of process 2. 6.3 Analysis of Wafer Level Variation and Spatial Correlation To analyze the impact of across-wafer variation on die-scale.1) 134 . 54.1: Ring oscillator frequency within a wafer.

we may model xc and yc as random variables which are evenly distributed in the center at (0.2.1). and d are fixed for a process. b. In real design flow. inter-wafer. which are obtained from fitting the measurement data from industry process as shown in Figure 6. yc ) is not known to designers. such as Le f f . In the rest of this section. yc ) and (x . we assume that a. y) (6.1. we consider only across-wafer variation and ignore these two types of variations (m f and md ) in this chapter. c. and (xw . m f and md are the across-field and acrossdie variation. yw ) is across-wafer location.3 summarizes the mathematical notations used in this section. y) in a die (assuming the coordinate of the center of the die to be (0. In this section. 6. In the rest of this section. c.3. yc ) wafer coordinates. b.where v p is a variation source. mw comprises inter-die random. these coefficients may vary slightly for wafer-to-wafer or lot-to-lot.4. it is easy to find that for a die. the die location in the wafer (xc .3. Therefore. inter-lot variation and the fitting error of quadratic fitting as in (6. We will further discuss this in Section 6. mr is the random noise. whose center lies on (x c . as discussed in Section 6. y ) are in Table 7. respectively.1). and d are function coefficients.2) The definitions of (xc .1 Variation of Mean and Variance with Location According to (6. the variation at location (x. a. 0) 135 . all simulations are based-on this extracted model. 0)) is: v p (x. y) = a(xc + x )2 + b(yc + y )2 + c(xc + x ) + d(yc + y ) + mw + mr (x. In practice. we will analyze the spatial variation based on the above model. We obtain the coefficients of the above across-wafer variation model by fitting the industrial 65nm process measured ring oscillator delay with 300 wafers from 23 lots. Table 7. We can convert the wafer-level systematic variation model to a die-level model by noting that the dies are always distributed evenly on the wafer.

y1 ) and (x2 .1: Notations.c.y) ω v p (x.b.d (lx .Symbols (xc .y) x = x cos ω + ysin ω lx = lx cos ω + ly sin ω ly = lx sin ω − ly cos ω x = ax + c/2 y = by + d/2 lx = ax + c/2 ly = by + d/2 rdµ = rdσ = x 2 /a + y 2 /b x 2 +y 2 y = x sin ω − ycos ω Euclidean distance between (x1 .ly ) (xw . 136 .y2 ) δ= (x1 − x2 )2 + (y1 − y2 )2 rm = rm = rm = 2 2 lx + ly rm rm rm k0 k1 k2 α β in the wafer s0 vg vs vl lx2 + ly2 lx 2 + ly 2 4 4 k1 = rw (a2 + b2 )/16 − rw ab/24 + σ2 m 2 k2 = k1 /rw 2 k0 = rw (a + b)/4 − c2 /4a − d 2 /4b α = x1 x2 + y1 y2 2 β = σ2 /rw r 2 2 2 2 s0 = cos2 ω(alx + bly ) + sin2 ω(blx + aly ) Inter-die variation Within-die spatial variation Within-die random variation Table 6.yw ) (x.y) x y lx ly x y lx ly rdµ rdσ δ Description Location of the center of the die in the wafer wafer radius Inter-die variation Within-die random variation Variance of mw Variance of mr Across-wafer variation coefficients x and y dimension die size Within-wafer location Within-die location Angle between the die and wafer coordinate Variation of within-die location (x.yc ) rw mw mr σ2 m σ2 r a.

rs follows triangle distribution ranging from 0 to rw . mw . rdµ.1)).2(a) and 6.7) The locations farther off from (x0 . Figures 6.5). it is interesting to note that different die locations may have different means and variances3 . From (6. the mean and variance will be the same for all locations on a die. c and yc are not independent.6) (6. If the across wafer variation function is a piecewise linear function. y): 2 µv p (x. rdσ . is expressed as a function of four random variables: xc . y0 ) will have larger mean and variance. the variation at location (x. and mr (x. 2x 137 . y . and k1 are defined in Table 7. In order to generate samples of x c and yc .5) where x . y0 ) having the smallest mean and variance is given by: x0 = −c cos ω/2a − d sin ω/2b y0 = d cos ω/2b − c sin ω/2a (6. k0 .2(b) illustrate the mean and variance of ring oscillator delay for different r dµ (or rdσ ). y). We may easily calculate the mean and variance of v p (x.3.y) = k1 + rw (x 2 + y 2 ) = k1 + σ2 + rw rdσ r v (6.3) In this case.4) and (6. y). we may assume xc = rs cos θ and yc = rs sin θ.4) (6. The location (x0 . v p (x.with radius rw (radius of the wafer). where rs and θ are independent. θ follows uniform distribution ranging from 0 to 2π 3 Such difference is caused by the nonlinearity of the across-wafer variation function (we assume quadratic function as in Equation(6. yc . y).y) = k0 + x 2 /a + y 2 /b = k0 + rdµ 2 2 2 σ2p (x. It is easy to see that xc and yc are identical and their PDF is 2 : PDF(xc ) = 2 2 rw − x2 /(πrw ) c −rw < xc < rw (6.

lx .4 0.S.0.5 0. we obtain the upper bound and lower bound of the correlation coefficient for a certain Euclidean distance: ρ ≤ ρu = ρ ≥ ρl = 1− 1− δ2 k2 + δβ/2 + 2βk2 + β2 (k2 + β)2 + 2rm2 (k2 + β) + rm4 δ2 (k2 + rm2 − δ2 /4 + β) + β(2k2 + rm2 ) (k2 + β)2 + δ2 (k2 + β)/2 + δ4 /16 (6. we may also calculate 138 . and rm are defined in Table 7.8) With covariance and variance calculated as above. we may obtain the correlation coefficient as: ρ= 2 k2 + 2k2 α + α2 2 2 2 2 (k2 + β)2 + (rdσ1 + rdσ2 )(k2 + β) + rdσ1 rdσ2 (6. rdµ (b) σ2 V. y2 ). and k2 are defined in Table 7. From (6. rdσ Figure 6.5 x 10 −5 4 µ σ2 3.2 Appearance of Spatial Correlation Besides mean and variance.10) (6.2 0.6 dσ (a) µ V.2: Mean and variance for different rd .4 rdµ 0.9). β.S. From (6.2 r 0. ly . we may calculate the covariance as: 2 Cov = k1 + rw (x1 x2 + y1 y2 ) (6.3.2).6 0 0. y1 ) and (x2 . we are also interested in the covariance between two locations (x1 .3.11) where δ is the Euclidean distance between (x1 . From the upper bound and lower bound.3.9) where α.1296 4. 6. y1 ) and (x2 . y2 ).1292 0 0.

3(b).94 ρ 0. From the figure. that is k2 fore. the within-die variations of opposite corners are negative correlated. 4 In 139 .5 1 Distance (cm) 1.86 0 Exact Upper Bound Lower Bound 0. 0. Figure 6.98 0. Moreover. This is because of the differences of variance across the die.92 0.9 0. the upper bound and the lower bound.88 0.12) 2 rm . the withindie variation of the opposite corner must decrease. if they lie on the opposite side of the center. the range decreases.5 UB 4 x 10 −5 3 0 0. We the figure.the range of correlation coefficient: ρu − ρl ≤ 2 2 rm /(rm + k2 + β) (6.3(a) illustrates the exact data for 40 locations. Moreover even when two locations are very close.5 1 Distance (cm) LB Cov 3. That means. when the within-die variation of one corner increases. Although the correlation coefficient is within a narrow range. the correlation coefficient can be a negative number when distance is large. This explains why most existing works [172. their correlation is still near zero. 45] find that spatial correlation is a function of distance. This is because after subtracting the mean. covariance is not.96 0. 173. the range of correlation coefficient for a certain distance is very small.5 (a) ρ (b) Covariance Figure 6. Figure 6. there- Notice that usually the wafer size is much larger than the die size. we also find that when the variances of the inter-die residual and within-die random variation increase.4 shows the correlation coefficient for within-die variation after subtracting the mean variation of the die (mainly caused by across wafer variation) 4 . we find that the range of ρ for a certain distance is very small. from the above equation. as shown in Figure 6.3: Apparent spatial correlation and covariance as a function of distance.

observe that the within-die spatial variation is almost NOT spatially correlated. 1 0.5 Distance 1 1. This further validates that the spatial variation is caused by systematic across-wafer variation.13) where vg is the inter-die variation.4: Correlation coefficient for within-die spatial variation after inter-die variation is removed.5 −1 0 ρ 0. we may also calculate the inter-die.3 Dependence between Inter-Die and Within-Die Variation In most existing variation models. and vl is the within-die variation. as empirically observed in [172.5 0 −0. 173]. 6. and within-die random variation. process variation is decomposed into inter-die. and vs is the residual. With the variation model in (6. and within-die random variation: v p = vg + vs + vl (6. within-die spatial. vs is the within-die spatial variation. vs . Inter die variation is calculated as the varia- 140 .5 Figure 6.3. vl is the pure random local variation. within-die spatial. Usually vg is modeled as the variation of the chip mean. vg .2). and vl are assumed to be independent.

we may treat such approximation error as noise and the process variation as signal. In order to do this. we find that both inter-die and within-die spatial variations are functions of random variables xc and yc . 6. |x|<lx /2 |y|<ly /2 Within-die random variation is the local random variation: vl = R.5(a). y) = v p (x. and then evaluate the signal to noise ratio. If we only consider inter-/within-die variation. To evaluate the impact of the approximation error.tion of the chip mean: vg = 1 lx ly ZZ v p (x. we may lump the across-wafer variation into inter-die variation.15) where s1 is defined in Table 7. as shown in Figure 6.14) = ax2 + by2 + cxc + dyc + mw + s0 c c where s0 is defined in Table 7. approximate the across-wafer variation as a piecewise constant function. y)dxdy (6. From the above equations. The signal to noise ratio when ignoring the spatial variation 141 . we analyze the accuracy of the simple two-level inter-/within-die variation model for different chip sizes. Hence. we may not decompose process variation into independent inter-die and within-die spatial variation.4 When can Spatial Variation be Ignored? In this section. that is.3. we calculate the mean square approximation error and the total variance of variation. and within-die spatial variation is calculated as the remaining variation: vs (x.3. y) − vg − vl 2 = rdµ + 2ax xc + 2by yc − s1 (6.3.

4 General Across-Wafer Variation Model In the previous section. the SNR is up to 100. When chip size is small. MSE is small.S. That means.is given as: 2 SNR = σtotal /MSE 4 2 6abrw + 6(c + d)rw + σ2 + σ2 M R ≈ 2 2 2 abrw (lx + ly ) + 2(c + d)lx ly (6. Figure 6.16) It can be seen that MSE depends on chip size. We also observe that when chip size (l x and ly ) is smaller than 1cm. 10 Exact Piecewise SNR 10 10 10 10 10 5 4 3 2 1 0 0 1 2 3 Chip size (cm) 4 5 (a) Piecewise constant (b) SNR V. hence such approximation is accurate. Figure 6. we assumed that the across-wafer variation is a quadratic function as shown in Equation 6. two-level inter-/within-die variation model is accurate. This is because we approximate the across-wafer variation as a piecewise constant function with small steps.5(b) illustrates the SNR for different die sizes.5: Approximating across-wafer variation.2. In practice. chip size. It can be seen that the SNR decreases when die size increases as expected. 6. across-wafer variation may not be an exact 142 .

there will be some fitting residual after subtracting the across-wafer parabola: v(x. However.7 illustrates the distribution of sx and sy obtained from measurement data of process 2.2) and Vr is fitting residual. respectively. In the previous section. From the figure. the across-wafer variation may be slightly different for different wafers. such trend will also introduce some spatial correlation. Moreover. Notice that when we model the fitting residual as a linear within-die variation trend. we may model sx and sy as random variables. In order to model this trend. we observe that for each die. Figure 6.6(b) illustrates the fitting residual of a wafer delay variation after subtracting the quadratic across-wafer variation function. there is a systematic trend: If one corner of the die is faster. and Figure 6. which can be lumped into inter-die random variation.18) where sx and sy are x-dimension and y-dimension slope of within-die trend. we find that both sx and sy are almost uncorrelated and they both follow Gaussian distribution. We also find that the trends of within-die variation for each die are different for different dies. Therefore. From the figure. two more random variables sx and sy are introduced. the fitting residual contains not only inter-die random variation but a systematic trend of within-die variation. we approximate the fitting residual as a linear function of within-die location: vr (x .parabola. we assume that the fitting residual is lumped into inter-die random variation. This makes the variation model 143 . then the opposite corner is slower.17) where v p is the quadratic across-wafer variation model as shown in (6. and mw is inter-die part of the residual. y ) = sx x + sy y + mw (6.6(a) illustrates the original delay variation across the wafer. In this case. Figure 6. From the die point of view. y) = v p + vr (6.

102592 0 50 100 150 0 −0. Figure 6.0120795 0.0028256 50 −0. 150 150 0.142611 0.13422 100 0. (b) Residual after subtracting Equation(6. In practice.126474 100 0.more complicated. we assume that all this within-die pattern to be independent random variation.000989223 −0.118083 50 50 −0. seems that there is a systematic within-die variation pattern.0147271 0 50 100 150 (a) Original delay variation.00986836 0 50 100 150 (c) Residual after subtracting (6.18).6(b) 5 . as illustrated in Figure 6. 150 0.0055146 0.00832543 0. the remaining variation is almost uncorrelated. 5 It 144 .110338 −0. When the across-wafer variation is a perfect parabola and the fitting residual is not significant.0191914 0.18). such within-die variation pattern can be modeled as mean shift. we also observe that after subtracting the model of fitting residual in (6. In addition. we may just lump the fitting residual in to inter-die and random variation.00816225 0 0.00634698 0 −0.6: Ring oscillator frequency within a wafer.00159736 0.2).0045106 100 0. In this chapter.

5 Modeling Spatial Variability As discussed in Section 6.1800 2500 S PDF x 1600 S PDF y Gaussian Approx 2000 Gaussian Approx 1400 1200 1500 1000 800 1000 600 400 500 200 0 −2 0 2 4 6 8 10 12 x 10 14 −4 0 −5 −4 −3 −2 −1 0 1 2 3 4 5 x 10 −4 (a) PDF of Sx (b) PDF of Sy Figure 6.3.1. • Quadratic Across-Wafer variation model (QAW). Hence. 145 .2) we obtain a general die-level across-wafer variation model: v p (x. modeling the within-die variation as spatial-correlated random variable is not accurate as discussed in Section 6. y) (6.7: PDF of Sx and Sy . In this section.18) and (6. spatial variation largely comes from the deterministic across-wafer variation. y) = a(xc + x )2 + b(yc + y )2 + c(xc + x ) + d(yc + y ) + sx x + sy y + mw + mr (x. we introduce three new spatial variation models considering acrosswafer variation: • Slop Augmented Across-Wafer variation model (SAAW).19) 6. Combining (6.

and mr .4. In the following of this section. the die location (xc . inter-die random variation mw . Larger chip needs more variables. We refer to this new model as Slope Augmented Across-Wafer variation model (SAAW) . (6.5. In the equation. the number of spatial variation sources depends on the number of grids. process 1 as discussed in Section 6. the fitting residual is not significant6.5. xc and yc . when the across-wafer variation is a perfect parabola.19) to (6. our new model not only models the across-wafer variation accurately but also is more efficient than the traditional spatial correlation model.3).2). die location within the wafer xc and yc . Therefore. 6 For example.19) provides a new spatial variation model. we will discuss these models in detail: 6. 146 . for the traditional distance-based spatial variation model. Notice that in SAAW model. However. there are only six random variables. In this case. and slope of fitting residual sx and sy . We refer to this model as Quadratic Across-Wafer variation model (QAW).2. there are only four random variables. Since QAW does not consider fitting residual. it is not as accurate as SAAW. 6. mw . within-die random variation mr . yc ) are modeled as random variables and their PDF is shown in Equation (6.2 Quadratic Across-Wafer Model As discussed in Section 6.19) calculates the variation for a given location (x. we may just lump the fitting residual into inter-die random variation and simplify (6. y).• Location Dependent Across-Wafer variation model (LDAW).1 Slope Augmented Across-Wafer Model (6.

1. 6.3. y) ≈ md + µv p (x.4. To further improve the accuracy of inter-/within-die variation model. µv p (x.4) and (6. Hence compared to inter-/within-die model. y) are deterministic value for a certain within-die location (x. as discussed in Section 6. µv p (x. inter-wafer random. y). applying the two-level inter-/within-die variation model does not introduce much error. y) is within-die variation including withindie random variation and residual of across-wafer variation. LDAW model has only two random variables: md and mr which is the same as two level inter-/withindie model.3.5. y) + σv p (x. y) and σv p (x. Notice that in Equation 6.3 Location Dependent Across-Wafer Model As discussed in Section 6. mr (x. y) (6. which can be calculated from (6.6 Application of Spatial Variation Models on Statistical Timing Analysis In this section. However. y) and σv p are mean and variance difference at different locations of a chip.20) where md is inter-die variation including inter-lot random. Therefore. LDAW has similar efficiency but higher accuracy because LDAW considers mean and variance difference across the chip while inter-/within-die model does not. y)mr (x.5).20. we account for this: v(x.6. we apply our across-wafer variation models to statistical static timing analysis. inter-/withindie variation model still does not consider the mean and variance difference at different locations of a chip. when die size is small enough. 147 . interdie random. We refer to the above model as Location Dependent Across-Wafer variation model (LDAW). and across-wafer variation.

when applying LDAW.6. Therefore.3 to estimate the chip delay variation. each variation source is a quadratic function of random variables.6.114). cell delay is a quadratic function of random variables. as shown in (5. we may apply the quadratic SSTA flow in Section 5. the cell delay becomes a 4th order function of random variables. Therefore. In this case. Sometimes. cell delay can be modeled as a quadratic function of variation sources. applying SAAW or QAW results in a quadratic function of random variable and hence quadratic SSTA or semi-quadratic SSTA can be applied. cell delay model can be simplified as a linear function of variation sources. we may also ignore the crossing terms and apply semi-quadratic SSTA to improve efficiency. Semi-quadratic SSTA greatly improves the speed with only a little accuracy loss compared to quadratic SSTA.6.19) and QAW (6. Notice that moment matching approximation is performed only once for all cells and does not increase the run time of SSTA. by applying SAAW or QAW on quadratic cell delay model. our SSTA flow in Chapter 5 can only handle a quadratic function of random variables. LDAW is a linear function of variation sources. as discussed in Section 5. Moreover. hence quadratic-SSTA can be applied.3 to match the 4th function to a quadratic function of random variables by matching the mean and first two order joint moments. 148 .2).1 Delay Model As discussed in Chapter 5. And then. However. we apply the moment matching technique in the way as Section 5. Therefore.46). in this application. Therefore. as shown in (5. and applying LDAW results in a linear function of random variables where linear SSTA can be applied. For SAAW in (6.

173].6. fitting residual coefficients sx and sy 9 .2 Experimental Result We have applied our new model to the SSTA flow introduced in Chapter 5. In the experiment. b. In the experiment. We assume random placement for ISCAS85 circuits. We obtain the across-wafer coefficients a. and then calculate the mean and x y variance of sx and sy for all chips. We apply all the above methods to the ISCAS85 suite of benchmarks in PTM 45nm technology[124]. which is referred to as SPatial Correlation model (SPC). inter-die. 2) distance-based spatial correlation model from [172. and within-die variation as percentage with respect to the nominal value. we only consider logic cell delay when calculating the full chip delay variation. three comparison cases are defined: 1) MonteCarlo simulation with the exact deterministic across-wafer variation model 7 .19). which is referred to as Inter-/Within-die variation model (IW). 7 In 149 . Since process variation has smaller impact on interconnect delay than on logic cell delay. Then we assume that the percentages of all the above coefficients to nominal value are the same at 45nm technology node and 65nm technology node. and d 8 . standard deviation of random inter-wafer. we consider the gate length variation obtained from minimum square error fitting on the ring oscillator delay from industrial 65nm process (Process 2 as discussed in Section 6. which is the golden case for comparison. c. 9 We obtain the slope of fitting residual s and s for each chip.6. we assume sx and sy to be Gaussian random variables with mean and variance obtained from the measurement data.2) measurement from the model as shown in (6. 8 We apply quadratic function to fit the across-wafer variation for each wafer to obtain the fitting coefficients for each wafer. the simulation. then use average coefficients of all wafers for our experiment. We have simulated 318 wafers correspondent to 318 measured wafers. 3) two-level inter/within-die variation model. each wafer may have different across-wafer variation which is obtained from measurement data of process 2. In order to verify the efficiency and accuracy.

we also compare the results of applying quadratic SSTA with crossing terms (SAAW Quad and QAW Quad) and applying SSTA without crossing terms (SAAW SQuad and QAW S-Quad). the impact of spatial variation on delay is not significant within the circuit. This is because in our experiment.we assume the benchmarks are stretched on a 2cm×2cm chip. • The error of SAAW using semi-quadratic SSTA is within 2% while the error of spatial correlation model is up to 6. • SAAW is more accurate than QAW.2 illustrates the percentage error of mean (µ). In the table. for the SPC model. we divide the chip to 10 × 10 = 100 grids. The error is calculated as error of different variation models compared to the golden case simulation. • Compared to quadratic cell delay model. the cell delay variation is well 150 . In our experiment. curacy loss. Since ISCAS85 benchmarks are very small. Table 6.1 Full Chip Delay In the experiment. and 95-percentile point (95%) and run time (T) of different variation models. we also compared the result of using quadratic cell delay model (Quad) and linear cell delay model (Lin).6. From the table. In order to show such impact.5%. We only use quadratic cell delay model for golden case simulation (exact). semi-quadratic SSTA (SSTA without crossing terms) achieves up to 8X speed up with less than 1% accuracy loss.2. linear cell delay has less than 2% acapproximated by a linear function. This is because the fitting residual is significant for the measurement data.6. QAW ignores fitting residual and hence introduces more error. standard deviation (σ). For SAAW and QAW. we have the following observations: • Compared to full quadratic SSTA. we assume that the chip size is 2cm×2cm and the wafer radius is 15cm.

• For linear cell delay model. we assume linear cell delay and apply semi-quadratic SSTA for all experiments. However. in the following experiment. than others. both model has much larger error cantly more accurate than IW with no runtime penalty. 10 There 151 . result- • LDAW and IW are very efficient. Since linear cell delay model and semi-quadratic SSTA are accurate. In the rest of this subsection. LDAW is signifi- are 100 correlated spatial random variables. since SAAW is more accurate than QAW with only a small run time overhead. we do not consider the QAW model in the following experiments. This is because both model ignore correlation. SPC. Moreover. while SAAW has only 6 random variables. This is because there are 100 grids in the spatial correlation model. SAAW achieves about 6X speed up compared to ing in 37 spatial random variables10 . we apply PCA to truncate some insignificant principle components and there remains 37 significant principle components.

SAAW and QAW will give similar result as LDAW.4. we find that when chip size is small. upper left corner (UL). its delay variation may be different.3. and SPC will give similar result as IW. Table 6. and then calculate the delay variation with location. the accuracy improvement is limited) compared to IW. we only considered big chips.1. LDAW is always better than IW. when a block is placed at different locations on a chip. especially for big chips. and upper right corner (UR). As discussed in Section 6. 11 When 152 .In the above experiment. in this experiment. Therefore. From the table. the design is separated into several blocks and each block only occupies a small region of a chip. the impact on across-wafer variation at die level is not significant. we find that the circuit is in a small region. Table 6. However. we only compare two models LDAW and IW 11 .6. the impact of spatial variation on delay is not significant within the circuit. when the chip size is small.3. we perform delay estimation of ISCAS85 benchmarks stretching on different size chips. the critical path is within a small region instead of spanning all over the chip. Since ISCAS85 benchmarks are very small. Considering that LDAW has similar run time but is more accurate (although when chip size is small. 6.2. in real design.2 Delay of Blocks on Different Locations on a Chip The above experiment assumes that the benchmarks are stretched on a chip. we assume that the ISCAS85 benchmark circuit is placed (no stretched) in different locations on a chip: center (C). LDAW and IW are accurate. Therefore. In order to verify this. lower right corner (LR). lower left corner (LL). In this case. From the table.3 shows the percentage error for different models under different chip size.4 compares the percentage error of LDAW and IW for ISCAS85 benchmarks placing on different locations on a 2cm×2cm chip. In order to show such effect. different chip locations may have different mean and variance. As discussed in Section 6.

9 9 -1.7 -3.8 10 -0.5 -1.2 15 c3540 Quad 25. σ.0 -11.1 -4.0 135 -3.2 435 -0.0 146 -0.3 -1.1 77 -1.9 -3.9 16 -3.43 34.3 -2.3 54 -1.8 -4.7 -0.0 9 -2.5 -5.4 -7.5 -0.9 27 -1.5 12 153 Lin Lin c7552 Quad 48.3 10 -0.3 -8.3 +0. ∗ for LDAW.7 -6.7 3.0 4210 -1.8 -1. linear SSTA is applied when assuming linear cell delay model.1 -1.5 -3.3 433 -3.9 -1.5 -8.4 -5.6 -10. SPC.2 76 -0. Run time (T) is in ms.8 1450 -2.5 -1.7 202 -2.delay mark model µ Lin - Exact σ 95% µ SAAW Quad σ 95% T SAAW S-Quad µ σ 95% T µ QAW Quad σ 95% T QAW S-Quad µ σ 95% T µ LDAW ∗ σ 95% T µ SPC∗ σ 95% T µ IW ∗ σ 95% T c1908 Quad 17.5 -4.2 -5.8 -6.9 -2.4 109 -1.9 6.9 8 -0.8 -1.3 -2.2 -3.7 27 -0.6 -3.1 -1.4 115 -1.2 -2.4 -1. .9 -8.9 25 +0.4 +0.5 -0.3 -4.0 20 -1.7 48 -2.0 -6.3 -3.3 -6. and 95-percentile point for exact simulation is in ns.5 -4.9 18 -2.2: Delay percentage error for different variation models.8 -4.5 -6.5 +0.9 -8.2 -0.6 101 -1.1 53 -1.9 -3.1 -1.2 -3.1 -6.9 -9.3 -1.8 -1.2 -1.4 -4.8 79 -3.8 -1.5 -10.0 26 -0.6 2.7 212 -0.9 +0.9 -1.2 -4.Bench. and IW.4 -0.6 -0.1 36 +0.6 13 -1.6 430 -1.6 +0.9 -2.9 -2.8 -2.4 150 -1.6 -4.4 -3.6 -1.6 35 -0.2 19 -2.6 -1.9 -1.5 22 Table 6.8 -0.1 -3.4 -1.6 -7.6 -3.6 -4.5 -6.9 8182 -2.2 -8.0 -2.47 64.4 -1.1 10 -2.1 -7.9 67 -0.7 -1.4 -1. Note: the µ.6 -1.2 209 -1.27 24.5 -1.

From the tables.2 -3.9 -1.9 -1.5 -3.2 -7.4 -3.9 -1.1 -2.1 -2.3 -1.5 -6.2 -1.0 -1.0 -1.2 -1.9 -1.6 -2.5 -2.47 64.0 -2.Bench.8 3.0 -1.3 -1.47 64.9 6.4 -2.44 34.8 -2.9 -3.0 -1.1 -1.7 -3.45 34.8 -1.6 -2.9 -1.2 -4.6 2.9 6 25.9 6. as discussed in Section 6.3 -2.1 -2. 6mm×6mm.7 -2.7 -0.0 -1. The values of exact simulation are in ns.6 -0.5 -1.8 -1.5 -5.3: percentage error for ISCAS85 benchmark stretching on a chip with different chip size.0 -1. Note: We assume square chips and chip size means edge length in mm.6 -6.7 shows percentage error of LDAW and IW for ISCAS85 benchmarks placing on different locations on a 1cm×1cm.3 3 17.1 -2.6.8 3. This is because LDAW predicts different mean and variance for different location correctly.4 -6.23 24.3 -4.9 -2.2 -1.1 -4.5 -1.8 -5.2 -2.3 -0.3 6 48.1 -1. and 3mm×3mm chip.9 -1.5 2. 154 .1 -1.0 -1. Table 6.1 -1.3 -1.9 -1. respectively.1 -3.7 -1.2 -2.2 -4.5 -1.2 -1.2 -2.5 -2.6 -1.4 -2.22 24.0 -1.2 -5.5 while IW can only give the same mean and variance for all locations.4 -2. and 6.2 -3.1 -1.2 -1.2 -0.8 -3.1 -1.3 -2.7 -3. 6.3 -1.9 6.5 -2.6 Table 6.5 -4.5 -3.8 -7.42 34.7 3 25.1 6 17.24 24.6 3.5. the estimation of LDAW is within 1% error from the exact simulation and the error of IW is up to 8%.3 3 48.4 2.0 -1.7 -1.3 c7552 10 48.3 -1. we find that the error of IW becomes smaller when chip size is small.4 -0.4 -4.2 -1.2 -5.2 -2.1 -2.2 -1.0 -1.Chip mark size µ Exact σ 95% µ SAAW σ 95% µ LDAW σ 95% µ SPC σ 95% µ IW σ 95% c1908 10 17.9 -2.2 -4.5 -1.47 64.6 -1.5 c3540 10 25.8 -3.4 -0.4 -1.5 -3.5 -6.6 -1.8 -1.4 -1.

2 +0.4 +0.37 60.3 -0.4 +0.2 -0.8 -1.1 -0.6 -2.1 LR 48.4 +0.3 +2.2 -0.1 -0.2 c7552 C 48.0 +0.11 58.3 UR 49.4 -0.3 +1.4 +2.4 -1.1 -1.4 +1.5 -1.3 +0.2 6.36 33.8 LL 24.1 +0.3 -0.2 -0.3 -0.3 -3.4 +0.4 -0.3 +0.0 -4.2 +0.9 -2.2 +1. Note: The 155 .6 +0. Note: bench locamark tion µ Exact σ 95% µ LDAW σ 95% µ IW σ 95% c3540 C 25.3 +0.1 6.5 3.35 60.30 32.8 +0.5 -1.5: Delay comparison for ISCAS85 benchmark in 1cm × 1cm chip.45 61.29 32.6 -0.0 +0.1 LR 49.5 +0.1 -0.2 -0.7 +1.42 60.0 -7.4 The values of exact simulation are in ns.34 33.8 UL 26.3 +0.3 -0.2 -0.2 -0.2 +1.3 +0.5 +0.1 -1.9 +0.4 UR 27.2 -0.4 +0.91 65.2 6.3 -0.9 +0.1 3.9 -0.0 -4.43 61.3 -0.1 3.9 6.3 +0.2 -2.3 +2.3 -0.6 LL 48.9 +1.26 31.8 values of exact simulation are in ns.0 3.2 3.bench locamark tion µ Exact σ 95% µ LDAW σ 95% µ IW σ 95% c3540 C 25.0 -4.4 +1.25 31.6 -0.4 +0.6 +1.7 -0.0 -1.4 -0. Table 6.0 -0.4 LR 25.6 -1.4 3.6 +4.7 +0.2 +1.8 +0.51 62.4 -0.5 +0.0 UL 49.6 -6.9 -3.2 -0.2 c7552 C 48.7 +3.65 63.4 +0.5 6.2 6.1 -0.9 LL 25.3 +3.8 -2.0 -0.1 -0.4 6.1 -4.0 +1.0 -1.9 +0.8 3.38 60.2 +0.3 -0.3 -0.1 -1.3 -0.1 6.6 6.35 33.2 +0.2 -0.41 34.2 +0.4: Delay percentage error at different locations in a 2cm × 2cm chip.4 3.5 +0.1 +2.8 3.22 31.4 -7.4 -1.4 -1.6 +0.9 LL 47.1 -5.5 +1.6 -1.6 +3.3 +0.2 +1.4 UR 26.5 UL 49.1 UR 50.3 +1.2 6.9 LR 26.3 -3.3 -0.32 32. Table 6.2 3.8 UL 26.2 +0.2 +0.

2 -0.2 +0.4 -0.31 32.0 6.2 6.9 -0.1 +0.2 -0.3 +0.9 -0.4 +0.6 +0.2 -1.7 -0.7 -0.2 -0.4 6.3 -0.3 -0.2 +0.2 +0.3 +1.0 -1.6 -1.45 61.3 +0.0 3.5 -0.6 LL 48.38 60.7 +1.2 +0.2 +0.2 +1.3 +0. Note: The bench locamark tion µ Exact σ 95% µ LDAW σ 95% µ IW σ 95% c3540 C 25.1 -0.6 +0.5 -1.3 +0.0 +0.3 -0.2 -0.3 -0.7 UL 25.37 60.1 -0.5 UL 49.9 UR 26.6 -0.25 31.1 c7552 C 48.8 LL 25.4 +0.5 -0.7 LR 48.32 32.22 32.4 -0.8 3.3 +0.2 LR 25.2 +1.3 3.2 -0.1 LL 48.6 UR 49.5 +1.3 -0.37 60.3 -0.2 -0.7 -0.3 UR 49.1 LR 49.0 -0.4 -0. Note: The 156 .43 60.6 6.8 -0.3 -0.8 -2.4 +0.2 -0.2 +0.8 -1.5 +0.1 -0.1 3.43 60.0 +1.4 +0.8 values of exact simulation are in ns.8 -1.8 6.1 values of exact simulation are in ns.2 +0.6 -0.2 0.9 3.28 32.6: Delay comparison for ISCAS85 benchmark in 6mm × 6mm chip.4 -1.42 60.7 -0.1 -1.4 +0. Table 6.4 +1.7 +0.4 UR 26.3 -0.9 +1.3 6.9 +0.7: Delay comparison for ISCAS85 benchmark in 3mm × 3mm chip.34 59.9 -0.48 61.3 +0.1 +0.34 32.7 -0.9 6.5 +0.3 +0.1 -0.4 -1.6 -2.bench locamark tion µ Exact σ 95% µ LDAW σ 95% µ IW σ 95% c3540 C 25.5 -0.7 +0.2 -0.7 +0.2 -0.3 -0.2 c7552 C 48.4 6.27 32.46 61.5 +0.3 +0.2 -0.7 +1.3 +2.2 -0.0 3.0 -1.0 -1.2 -2.9 -0. Table 6.5 +0.4 LL 25.6 3.2 -0.4 -0.2 -0.4 6.2 -0.4 3.2 -0.26 320 +0.2 +1.2 3.35 33.6 -0.2 +0.1 -0.4 LR 26.5 3.1 6.31 32.3 +2.2 -0.2 +0.5 -1.4 -0.2 -0.2 -0.5 UL 48.8 UL 26.0 +1.0 +0.

Considering that the random variables may be non-Gaussian. For each variation model. 157 . we also apply our variation model to statistical leakage power analysis. We have implemented leakage variation analysis with different models in Matlab. For the leakage analysis. in this chapter. we observe that: • Error of SAAW is within 5% while error of SPC is up to 17%. cell leakage power variation is modeled as exponential function of variation sources: Pleak = P0 · e∑ ciVi chip leakage power is calculated as the sum of leakage power of all cells: Pchip = (6.22) where Cell is the set of all cells in the chip and Pi.8 compares the leakage variation for ISCAS85 benchmarks. there is no closed-form equation to calculate the full chip leakage power. we apply Monte-Carlo simulation to obtain the full chip leakage power variation.19). SAAW is 7X faster than SPC because there are fewer random variables for SAAW. Usually.21) where P0 is the nominal leakage power and ci ’s are sensitivity coefficients. The full i∈Cell ∑ Pi. we use the same setting and comparison cases as the SSTA experiment in Section 6. In the experiment.leak is leakage power of the ith cell.6. the cell leakage power is an exponential of a quadratic function of random variables.6.000 sample Monte-Carlo simulation to obtain the full chip leakage power for all variation models. Since each variation source is a quadratic function as in (6. we use 100. Moreover.leak (6. Therefore. we assume that 900 copies of ISCAS benchmark circuits are placed in a 30 × 30 array on a 2cm×2cm chip. Table 6.7 Application of Spatial Variation Models on Statistical Leakage Analysis Besides SSTA. From the table.

4 c3540 201 37.7 -6. Run time (T) is in s.5 +1.8 +10.1 14. 6mm×6mm.5 +5.6 -19.2 +9. Note: The exact values are in mW .1 +5.7 c7552 403 73.8 +6.2 -20.4 +2. From the table.9 2.7 15.5 -17.2 16.8 +14. we assume 100 copies of ISCAS85 benchmark circuits are placed in a 10 × 10 array.9 2.7 +1.4 -5.5 +3.0 +5. for leakage variation analysis.5 2.5 -7.3 -24. 158 .9 +11.3 122 -8.7 c1908 95.3 -12.6 +5.2 +8.7 +14.9 -19.3 +3.7 15.2 -20.4 9. This is because both these models do not consider correlation and hence under estimate the leakage power variation.9 -16.7 +1.3 9.4 282 +1.2 -23.4 +3.9 -22.4 122 -9.3 144 +0.3 +1.2 2.5 -15.6 +7.1 -7.6 -16.0 10.6 2.8 -5. for the 6mm×6mm chip.7 9. and for the 3mm×3mm chip.4 +6.4 +12. SPC. Table 6.6 +4.5 Table 6.9 shows the error percentage for different model on different size chips.6 20.2 123 -7.3 +3.5 2.Benchmark µ Exact σ 95% µ SAAW σ 95% T µ QAW σ 95% T µ LDAW σ 95% T µ SPC σ 95% T µ σ IW 95% T c1355 62.8 +12.6 -16.8 +6. For the 1cm×1cm chip.4 +3.9 +8. • SAAW is more accurate than QAW.5 92. and 3mm×3mm.6 -20.2 123 -8.9 2. we find that the error of LDAW.9 181 +1.3 -18. Similar to SSTA.8 15.4 10.9 c2670 131 22.5 2.7 +2.3 +7.6 -16.5 +2. and IW reduces when chip size becomes smaller as expected.1 +13.5 +2.2 562 +1.5 +5.5 -20. • Both LDAW and IW are not accurate.3 +10.7 +6.2 -23.9 +4.5 +4.6 +2.8: Leakage error percentage for different models in 2cm × 2cm chip.9 15. we assume 25 copies of ISCAS benchmark circuits are placed in a 5 × 5 array.3 -25.8 +4. we assume that 225 copies of ISCAS85 benchmark circuits are placed in a 15 × 15 array. but is about 50% slower.3 +3. we also perform leakage estimation in different size chips: 1cm×1cm.7 122 -9.8 -19.5 2.0 2.

5 -5.7 +2.1 +1.0 +2.2 +1.8 +2.0 +4.8 +1.26 +1.3 -3.9 -3.0 c1908 10 23. QAW has four random variables.0 +1.5 141 +1. Note: exact values are in mW .40 2.5 -4.2 -12.2 -5.9 +4.8 -3.0 -3.6 -2.4 +1.1 -3.3 -1.3 -8.2 +6. therefore.6 +2.8 +2.1 6 10.Chip mark size µ Exact σ 95% µ SAAW σ 95% µ LDAW σ 95% µ SPC σ 95% µ IW σ 95% c1355 10 15.37 71.1 +1.3 -9.8 +7. LDAW is the most efficient.3 -11.3 -13.7 -10.9 5.10 summarizes the advantages and disadvantages of our proposed spatial variation models (SAAW.5 +2.7 +4.1 +3.1 -9.8 Summary of Different Models In previous sections.92 1.1 +0. Our proposed across-wafer variation models exactly model the across-wafer variation and the number of random variables does not depend on chip size.1 -6.9 +1.4 3.2 +2.9 -3.0 +6.1 -3.0 -1.5 -3.9 +1.7 +2.5 9.2 +1.7 +8.4 -1.4 +4.5 6 45.5 -12.6 2.3 +2.4 c2670 10 32.2 +4.7 +6. it can be applied only when the across-wafer variation is a perfect parabola.2 +3.9 +2.2 3 1.9 -3.25 16.7 +2.19 6.0 +3.7 -3.3 -9.2 -5.6 +1.4 +2.3 +6.4 -4.9 -3.3 4.6 +6.3 +1.3 -8.6 +3.58 +0. 6.5 -6. and the traditional variation models (SPC and IW).0 -5.8 +2.5 -5.8 5.8 +1.5 +1.2 -4.8 6 5.5 +3.5 -4.5 -2. 159 .3 -4.2 +1.4 -12.7 +1. Therefore they are accurate and efficient.9 +5.4 +2.0 3 2. and LDAW).2 +1.0 -1.2 +9.0 +2.9 +5.0 +1.5 -7.3 -17.9 +0.5 3 5.2 +2. hence it is more efficient than SAAW.2 -7.2 -3.5 6 22.3 +6.7 36.03 +0.5 +9.57 4.3 -14.1 +1.5 +9.4 -6.4 -7. Moreover.0 +1.9 -4.0 8. However.0 +1.5 +1.5 -8.06 1.1 -14.2 -1.7 -1.7 +1.1 c3540 10 50.9 -5.72 45.3 +2.6 -14.55 20.4 +1.9 +3. we compared the accuracy and efficiency of different models.4 Table 6.0 -8. QAW.3 -7.55 +0.0 +2.2 +8.0 -9.58 1.4 -3.2 -7.1 6 14.9 -8.Bench. Table 6. ignores correlation and only works for small chips.1 -3.3 3 3.0 -4.9 -7.65 0.9 -15.6 -11.2 +1.2 -1.5 -6.1 -3.52 23.0 -4.0 +2.4 2.4 -3.6 -7.3 -2.9 +3. SAAW has six random variables and it can be applied to any across-wafer variation models.6 +1.0 +1.8 +3. SAAW and QAW need to know the across-wafer variation.2 -5.1 +3.2 +8. one needs to track the die locations within the wafer to build up the model.2 3 11.3 +2.1 -8.9: Leakage error for different variation model on different size chips.0 -4.58 +0.48 0.7 -2.1 +2.8 -3.03 +1.1 -1.9 -3.1 -2.65 5.5 -7.65 0.62 10.5 -13.6 -3.1 -8.6 -6.6 2.0 +1.16 30.2 -12.2 +5.2 +1.05 7.9 -4.2 +1.8 -2.8 -10.2 -3.9 c7552 10 102 18.2 +2.

such models are not accurate compared to our proposed models. However. they are somewhat easier to build. for SPC. the traditional variation models (as well as LDAW) only require measurement on a die without tracking die locations. Moreover.On the other hand. 160 . it is not as efficient as our proposed models. Therefore. since the number of random variables depends on number of grids.

non-parabola across-wafer variation large chip. .19) QAW Equ (6.2) LDAW Equ (6.20) SPC IW # of RVs 6 4 2 Depend on # of grids 2 Case to Apply large chip.10: Summary of different variation models.Model Type Acrosswafer models Traditional models Advantages Accurate Efficient Easy to extract Disadvantages Need die tracking to extract Not accurate Models SAAW Equ(6. parabola across-wafer variation small chip large chip small chip 161 Table 6.

For simplicity. We exploited these observations in order to create a much more accurate and efficient model for performance variability prediction. we find that spatial correlation is visible only when the across wafer systematic is not taken into account.9 Conclusions In this chapter. We first observe that different locations of a chip may have different means and variances and such difference becomes more significant when chip size increases. we show that within-die random variability does not exhibit a strong or useful pattern of spatial correlation.6. we have proposed an accurate and efficient variation model for deterministic across wafer variation. when chip size is small enough. we find that the within-die spatial variation is NOT independent of the inter-die variation. we analytically study the impact of systematic across-wafer variation on within-die spatial variation. When it is taken into account. Secondly. Based on the above analysis. the two level inter-/within-die variation model is still accurate. our new model reduces the error from 6. we also analyze the case when the across-wafer is not a perfect quadratic function. However. we assume that across-wafer variation is a quadratic function. 162 . We further apply our new variation model to two applications: statistical static timing analysis and statistical leakage analysis. Experimental result shows that compared to the distance-based spatial variation model.5% to 2% for statistical timing analysis and reduces error from 17% to 5% for statistical leakage analysis. such dependence is weak and the across-wafer variation can be lumped in to inter-die variation. Our model also improves the run time by 6X for statistical timing analysis and by 7X for statistical leakage analysis. Thirdly. In this case. Later.

Statistical modeling and analysis have become the mainstay of modern design-manufacturing flows. percentile corner) and analysis (SPICE corner extraction. Our experiments (with variability assumptions derived from test silicon data from a 45nm industrial process) indicate that for moderate characterization volumes (10 lots) and low-to-medium production volumes (15 lots).. The guardbands are non-negligible for all cases when either production or characterization volume is not 163 . statistical timing). due to limited number of samples (especially in the case of lot-to-lot variation). calibrated models have low degree of confidence. However. The problem of confidence in statistical analysis is going to be further worsened with advent of 450mm wafers. Most analysis techniques assume that the statistical variation models are reliable.7% of standard deviation for single parameter corner.CHAPTER 7 On Confi dence in Characterization and Application of Variation Models In this chapter we study statistics of statistics. a significant guardband (e. We mathematically derive confidence intervals for commonly used statistical measures (mean.g. 38. variance. Our estimates are within 2% of simulated confidence values. 34. The problem is further exacerbated when production volumes are low (≤ 65 lots) causing additional loss of confidence in statistical analysis (since production only sees a small snapshot of the entire distribution).7% of standard deviation for SPICE corner. and 52% of standard deviation for 95-percentile point of circuit delay are needed) to ensure 95% confidence in the results.

This is especially true for lot-to-lot variation. However. It also ignores the uncertainty introduced by limited number of production lots. we mean the statistical values in the case of infinite measured as well as production samples. implicitly assume that the production chips have the same statistical characteristics as the corresponding population values 1. analysis or optimization. the existing statistical analysis. Moreover. Due to limited number of measurements. ignoring their true distributions. the measured statistical characteristics may be unreliable. The proposed methods are not runtimeintensive (always within 10s) as they require Monte-Carlo simulations on closed form expressions. and skewness. Therefore. variance. 1 164 .large. [177] models the uncertainty of mean. the statistical characteristics of the variation sources are obtained from measurement of a set of samples. We also study the interesting one production lot case which may be common for prototyping as well as for academic designs. are often assumed to be known and reliable. the production statistical characteristics may significantly deviate from their population values. such as mean. and then estimates the range of mean and variance of circuit performance. variance. Before we move further. it is important to briefly review the concept of confidence Here by “population”. 7. and correlation coefficients as an interval. the uncertainty in statistics of measured data as well as production data should be considered in statistical analysis. this work only models the uncertainty as an interval. statistical characteristics of variation sources.1 Introduction In process variation modeling. since the number of lots measured for characterization is usually small. When the number of production samples is not small. In practice.

the results may differ. the distribution of (say) circuit delay is fixed. . However. Xn). We estimate the confidence interval of mean. Experimental results show that the required guardband (to reach desired confidence) estimated by our method is very close to the exact simulation value. 2. we say that we have 90% confidence that the sample mean is smaller than µcon f . we choose n samples of it (X1.intervals in statistical analysis. . we do not know exactly the production mean and variance prior. one can estimate the distribution of mean and variance (of course the actual mean and variance are going to be just one sample out of these distributions). Notice that the confidence is not yield but the probability that certain statistical characteristic is within a given interval. When the sample mean is smaller than a certain value µcon f in 900 out of the 1000 runs.000 ˆ times. We extend the ideas to extract SPICE corners depending on desired confidence level as well as number of measured/produced silicon lots. For a given confidence level. In this chapter. X2 . the sample mean ˆ µ and variance σ2 are fixed. Once chips are produced. we estimate the worst case fast/slow corner (µ± kσ corner) for variation sources. if we repeatedly choose n samples for 1. we study the uncertainty of statistical models and analyze the impact of such uncertainty on statistical analysis. . variance and quantile for statistical timing analysis. . Consider a standard normal random variable X . Once the samples are chosen. and calculate the sample mean and variance for each time. The contributions of this chapter are: 1. 3.we estimate the distribution of the production mean and variance of variation sources. 4. For a given confidence. But due to the uncertainty of the statistical model. Given the number of measured lots and the number of production lots. We also observe 165 .

interw r l d wafer. 7. The variances of the means and variances are: σ2 = σ2 /Nl µl l σ2 2 = σ2 /Nl l σ l (7. and µr (σ2 ) are means (variances) of inter-lot. inter-die (Vd ).2 gives the problem formulation.6) (7.2) (7.6 discusses how to choose confidence level in real circuit design. We need to introduce up to 30% guardband value to achieve 95% confidence.7) σ2 = σ2 /Nw µw w σ2 2 = σ2 /Nw w σ w 166 .3 studies the confidence interval estimation for variation sources. inter-wafer(Vw ).2 Problem Formulation Process variation is decomposed into inter-lot (Vl ). and within-die (Vr ) variation: V = Vl +Vw +Vd +Vr µV = µl + µw + µd + µr 2 σV = σ2 + σ2 + σ2 + σ2 w r l d (7. and within-die variations. inter-die. µw (σ2 ).4 presents the estimation of worst case delay for SPICE corner estimation and Section 7.that the guardband value increases dramatically when confidence level increases. It has been shown that in case of finite number of samples the sample mean follows Gaussian distribution and the sample variance follow χ2 distribution [4].3) where µl (σ2 ).5 analyzes the impact of mean and variance uncertainty on statistical timing analysis. Section 7. then Section 7. and finally Section 7.1) (7.7 concludes this chapter. The rest of this chapter is organized as follows: Section 7.4) (7.5) (7. respectively. µd (σ2 ). Section 7.

estimate the distribution of statistical measures of production lots. only the measured as opposed to the population value is known. niche process/high column part (S-L). of 1. Therefore. we may want to guardband statistical analysis to ensure a certain degree of confidence. Table 7. In practice. wafers. Nw . the σ r uncertainty of mean and variance comes largely from lot-to-lot variation. Hence σ2 µl σ2 µw σ2 µd σ2 . and tens of measured circuit elements within each die2 . 2 Note that 85 lots of 25 300mm wafers each with die-size of 100mm 2 amounts to production volume 167 . we focus our attention on lot-to-lot variation. Hence in this chapter. dies.1 illustrates five cases of number of measured lots and number of production lots.10) (7. Example use models for these different scenarios can be: commodity process/high volume part (LL).11) σ2 = σ2 /Nr µr r σ2 2 = σ2 /Nr r σ r where Nl . we assume that the lot-to-lot variation of all the variation sources follows Gaussian distribution.5 million chips. commodity process/niche part (L-S). σ 2 2 µr σ l Nd σ2 2 σ w σ2 2 σ d σ2 2 . In rest of the chapter. Therefore. the confidence interval is large and hence statistical analysis is not reliable.8) (7. In this case. and circuit elements. more than one hundred dies per wafer. niche process/niche part (S-S). and Nr are total number of lots.σ2 = σ2 /Nd µd d σ2 2 = σ2 /Nd d σ d (7.9) (7. and commodity (or niche) process/prototyping (L-1).When the number of measured lots or production lots is small. There are usually 20 to 50 wafers per lot. Therefore. Nr Nw Nl . our the problem is: Given the number of measured lots and production lots and the measured values of mean and variance of variation sources. Nd .

Table 7.3 Confi dence Interval for Variation Sources In this section. we calculate the total mean and variance of the production as: µt = µl + µo ˜ ˜ ˜ ˜l σt2 = σ2 + σ2 o (7. we calculate the total mean and variance of measured value as: µt = µl + µo ˆ ˆ ˆ ˆl σt2 = σ2 + σ2 o (7.1: Reliability of statistical analysis for different number of measured and production lots. 7. 7.1).3.2. In the same way as above.12) (7.15) In the rest of this subsection. Therefore.13) As discussed in Section 7. the uncertainty of mean and variance comes largely from lot-to-lot variation.1 Mean and Variance Let’s first discuss the mean and variance. we will discuss confidence interval estimation for a single variation source. we assume the mean µo and variance σ2 of all other o variation are reliable. we will focus on analyzing confidence interval for lot-tolot variation. From (7. 168 .14) (7.3 summarizes the notation used in the rest of the chapter.Case # Measured lots Large Small Large Small Any # Production lots Large Large Small Small 1 Confidence interval Small Large Large Large Large Reliability of Statistical analysis High Low Low Low Low L-L S-L L-S S-S L-1 Table 7.

Finally. σ2 . σ2 = σ2 + σ2 + σ2 o w r d cn f /s no hat ˆ ˜ 2t 2l 2w 2d 2r 2o Table 7. and so on) inter-lot variation value (µl . σ2 number of lots mean and variance fast/slow corner population value (µ. σ2 . σ2 . In order to estimate the confidence interval. and so on) l inter-wafer variation value (µw .17) where M1 ∼ N(0. and so on) r total value except inter-lot variation (µo . σ2 . inter-die and within-die variation (µt . we estimate the production mean and variance with respect to population mean and variance: √ µl = µl + σl · M2 / n ˜ ˜ (7. inter-wafer. we first estimate the population values of mean and variance from measurement data [4]: ˆ µl = µl + σl · M1 · ˆ ˆl σ2 = (n − 1)σ2 /Q1 ˆ l n−1 ˆ nQ1 ˆ (7. ˜ ˜ 169 . µ.18) (7. and so on) o µo = µw + µd + µr . and so on) ˆ ˆ value of production lots (n.16) (7. and so on) ˜ ˜ total value including inter-lot. ˆ ˆ ˜l σ2 = σ2 · Q2/(n − 1) ˜ l where M2 ∼ N(0. and Q 1 ∼ Then. σ2 . σt2 . σ2 and so on) value obtain from measured lots (n. and so on) w inter-die variation value (µd .2: Notations. and so on) d within-die variation value (µr .19) 2 χn−1 is a random variable with χ2 distribution with n − 1 degrees of freedom [4]. 1) is a random variable with standard normal distribution and Q 2 ∼ 2 χn−1 is a random variable with χ2 distribution with n − 1 degrees of freedom.n µ. µ. 1) is a random variable with standard normal distribution.

T is a random variable with t-distribution with n degrees of freedom [4]. We may simplify our model when either n or n is large. we analyzed the case when both n and n are small (S-S ˆ ˜ case).3.24) (7.27) 7. U is the ˆ ratio of two χ2 random variables. hence.26) where Γ(·) is Gamma function [4] which is calculated as: Γ(x) = Z ∞ 0 t x−1 e−t dt (7.25) normal random variable and square root of a χ2 random variable. Therefore. 1). Moreover.22) (7. T is the ratio between a standard ˆ ˜ Γ( n )Γ( n )(u + 1) 2 2 (7.23) (7.21) (7.we obtain the production mean and variance as: ˆ µl = µl + σl · ˜ ˆ ˜l σ2 where n+n ˆ ˜ nn ˆ˜ n−1 ˆ ζ = n−1 ˜ √ √ √ n − 1( nM1 + nM2 ) ˆ ˜ ˆ T = (n + n)Q1 ˆ ˜ Q2 U = Q1 η = It is easy to find that: √ √ nM1 + nM2 ˜√ ˆ n+n ˆ ˜ n − 1 M1 M2 ˆ ˆ √ + √ = µl + σl ηT ˆ Q1 n ˜ n ˆ Q2 (n − 1) ˆ ˆl ˆl = σ2 · = σ2 · ζ2 ·U Q1 (n − 1) ˜ (7.2 Simplifying the Large Lot Case In the previous subsection. The PDF of U is [4]: PDFU (u) = ˆ ˜ Γ( n+n − 1)u 2 −1 2 n ˆ n+n ˆ ˜ 2 ∼ N(0. ˆ ˜ When the number of measured lots n is large (L-S case). we can approximate the ˆ 170 .20) (7.

3.28) (7. However.30) (7.32) (7.21).20) and (7. since there is only one lot. we also consider the special case of one production lot (L-1 case).population mean and variance as measured mean and variance: µl ≈ µl ˆ (7.e. lot-tolot variation has no impact on variance.33) ˜l σ2 ≈ σ2 l Therefore. L-S.: µl ≈ µl ˜ (7.20) and (7.35) Besides S-S.36) .21) can be simplified to: ˆ µl = µl + σl M1 · ˜ ˆ ˜l ˆl σ2 = (n − 1)σ2 /Q1 ˆ 7. i.34) (7. Uncertainty of variance estimation is mainly caused by wafer-to-wafer variation. that is n = 1. (7. Hence.29) ˆl σ2 ≈ σ2 l In this case.21) can be simplified to: √ ˆ µl = µl + σl · M2 / n ˜ ˆ ˜ (7. In this case. and S-L cases. the production variance is calculated as: ˜ ˆw σt2 = σ2 · Q2 /(n − 1) + σ2 + σ2 ˜ r d 171 (7. (7.31) ˜l ˆl σ2 = σ2 · Q2/(n − 1) ˜ In the other case when the number of production lots n is large (L-S case). in this case.20) and (7. the ˜ production values of mean and variance are very close to the population value. production mean still follows the same ˜ distribution as shown in (7.3 One Production Lot Case n−1 ˆ nQ1 ˆ (7.

we use wafer-to-wafer variation instead of lot-to-lot variation. Table 7.3 compares the targeted confidence level and the measured confidence level.34 70 72. ˆ ˜ where n is number of production wafers. We model each set as a design and assume that 5 of the wafers are used to generate the statistical model (n = 5) and the other wafer is the ˆ production wafer (n = 1). To get over the hurdle of huge data requirement. Q2 is a χ2 random variable with n − 1 free˜ ˜ ˆw dom.10 Table 7. 7. Fortunately.% Measured conf. Nevertheless. We randomly divide these wafers into 58 sets with 6 wafers per set.% 50 51. the mathematical derivations in this work make very few approximations if any. We need exceedingly large amounts of data spread out over hundreds of lots preferably over multiple independent designs.4 Experimental Validation for One Lot Case The problems in validation of the confidence-based guardband models proposed in this work should be obvious by now. and σ2 is the measured variance of wafer-to-wafer variation.3.Targeted conf. 172 . We obtain wafer-to-wafer ring oscillator delay variation data from an industrial 65nm process with 348 wafers spread over 23 lots. For each set we apply (7.3: Experimental validation of estimation of confidence interval for mean for n = 5 and n = 1.41 80 81.03 90 93. The small error (≤ 4%) is due to limited data. We obtain the measured confidence by counting the number of sets that production value is within the confidence interval: Measured Conf = Number of match sets/ Total number of sets.72 60 60. in this section we try to validate the theory above for the case with small number of measured lots and one production lot.20) to estimate the distribution of ˜ production mean and calculate the confidence interval for a targeted confidence level.

5 Fast/Slow Corner Estimation The next statistical characteristics we consider are fast and slow corners cn f /s for variation sources. we use Monte-Carlo simulations to obtain the distributions of the fast/slow corners.g. Assuming that hold time violation is harder to fix.39) (7. we have: ˆ cn f = µl + µo + σl ηT − k f ˆ ˆ cns = µl + µo + σl ηT + ks ˆ ˆ σ2 ζ2U + σ2 o ˆ σ2 ζ2U + σ2 o (7. fast corner and slow corners are dependent on each other.21). 173 . we assume fast implies smaller parameter value (e. Therefore.3.38).38) (7. from the samples.37) For a given confidence level.41) Due to the complexity of (7. we want to estimate the worst case fast/slow corner. that is: s 3 P{cn f > cnw } = c f f f P{cn f > cnw .40) (7.20) and (7. Usually. channel width).7. We estimate the worst case fast/slow corner such that there is c f f confidence that the fast corner is larger than cnw and there f is c ft confidence that the fast corner is larger than cnw and the slow corner is smaller f than cnw . cns < cnw } = c ft s f (7.42) With the worst case fast corner value. Then the worst case slow corner f f 3 Without loss of generality. CDFcns |cn f <cnw . the fast/slow corner is expressed as: ˜ cn f /s = µt ± k f /s σt ˜ From (7. we may also obtain the conditional CDF of cns given cn f > cnw . However. cnw/s . Then the worst case fast corner is calculated as: −1 cnw = CDFcn f (c f f ) f (7. f we need to handle them together. higher confidence may be needed on fast corner.

the mean and variance can be simplified as (7. We also assume that the variation source is with standard normal distribution 4 . 174 .46) nominal mean and variance of variation sources does not affect the value of k f /s.43) We then express the worst case corner value as: ˆ cnw/s = µ ± k f /s σ ˆ f (7.is calculated as: −1 cnw = CDFcns |cn f <cnw (c ft /c f f ) s f (7.30).e. Hence the corner model can be also simplified as: M ˆ 2 cn f = µl + σl √ − k f ˆ n ˜ M ˆ 2 cns = µl + σl √ + ks ˆ n ˜ 4 The ˆl σ2 · Q2 + σ2 o (n − 1) ˜ ˆl σ2 · Q2 + σ2 o (n − 1) ˜ (7. when the number of measured lots is large (L-S). Notice that the k f /s depends on ˆ the measured values µ and σ2 . In the experiment. k f = ks = 3 and we assume that the variance of inter-lot variation ˆ ˜ is 18% of the total variance (derived from the data described in section 7.3. In ˆ the table. the guardband percentage is calculated as: 100 × (k f /s − k f /s )/k f /s . and then apply the ˆ method above to calculate the worst case value of k f /s . From the table.45) (7.4).3.4 shows the value of k f /s under different confidence levels. k f /s needs to be 30% over k f /s .44) When the worst case corner values are known. For each run.. we let n = 10. we generate n ˆ ˆ measured samples to obtain the measured mean µ and variance σ2 . we can find that to ensure high confidence. we need a large guardband. Table 7. i. n = 15. we perform the estimation for 1000 runs. the k f /s values are calculated as the average of 1000 runs. In the experiment. As discussed in Section 7. hence each run may result in a different k f /s value. we can easily obtain the value of k f /s .2. In the table.

4%) 3. Similarly.60%) 3.17%) 3.625 (20.422 (14.4%) 3.17%) 3. From the table.8%) 4.345 (11.138 (4. when the number of production lots is large (S-L).155 (5.48) Table 7.20%) 3. corner model can be simplified to: ˆ cn f = µl + σl · M1 · ˆ ˆ cns = µl + σl · M1 · ˆ n−1 ˆ −kf nQ1 ˆ n−1 ˆ + ks nQ1 ˆ ˆl (n − 1)σ2 ˆ + σ2 o Q1 ˆl (n − 1)σ2 ˆ + σ2 o Q1 (7.121 (4.2%) 3.4: Guardband of fast/slow corner for different degrees of confidence assuming n = 10.43%) 3.13%) 3.124 (4.080 (2.342 (11.67%) 3.5 shows the average k f /s value for the L-S case. it is evident that the guardband is smaller than that of the S-S case.255 (8.50%) 3. ˆ ˜ Table 7.502 (16.14%) 3.6 shows the average k f /s value for the S-L case.8%) ks 3.594 (19.47) (7. ˜ ˆ 175 .246 (8.619 (20.112 (3.042 (34.5%) 3.275 (9. Figure 7. As expected.035 (1.03%) 3. ˆ ˜ c ft % 50 60 70 80 90 95 cff % 60 70 80 90 95 99 kf 3.236 (7.401 (13.5: Guardband of fast/slow corner for different degrees of confidence assuming large n and n = 15.215 (7.87%) 3.7%) 3.6%) Table 7. the guardband in this case is smaller than the S-S case.103 (3.13%) 3.7%) ks 3. L-S (or S-L) can be treated as L-L.518 (17.1 compares the 90% confidence value of k f between the L-S (or S-L) and L-L.c ft % 50 60 70 80 90 95 cff % 60 70 80 90 95 99 kf 3.2%) 3.73%) 3. We observe that when n ≥ 60 (or n ≥ 85).307 (10.7%) Table 7. n = 15.

15 3.40%) 3.1: Comparison of S-L. and L-L case.3 3.134 (4.272 (9. we need a larger number of measured lots (or production lots) to be considered as large.6: Guardband of fast/slow corner for different confidence levels assuming n = 10 and large n. the uncertainty of 176 .25 L−S S−L k’f 3. This is because when the lot-to-lot variation increase.193 (6.132 (4.09 f 150 Figure 7.4%) 3. the statistical analysis is reliable and we do not need to perform guardband estimation.052 (1. when error of k f value between L-L and L-S (or S-L) cases is within 3%. ˆ ˜ That is.1 3.4%) Table 7.43%) 3.267 (8.50%) 3.47%) 3.90%) 3. ˆ ˜ ~ n =60 ^ n =80 3. L-S.690 (23.523 (17.0%) ks 3.17%) 3.73%) 3.2.165 (5. we consider n (or n) value is large.3. Therefore.2 3.07%) 3.155 (5. in the special case when there is only one production lot (L-1).3%) 3. lot-to-lot variation has no impact on the variance. In the experiment.2. the uncertainty of mean and variance increases. Note that increasing lot-to-lot variation will increase the value of n and n that can be ˆ ˜ considered large as shown in Figure 7.431 (14.c ft % 50 60 70 80 90 95 cff % 60 70 80 90 95 99 kf 3.05 50 ~ 100 ^ n (n) k’ =3. As discussed in Section 7.370 (12. Instead.

we assume that there are 25 wafers per lot. we see that the guardband is smaller ˜ than that of the L-S and S-L cases.800 600 ~ ^ Large n (n) ~ n ^ n 400 200 0 20 40 σ2% l 60 Figure 7.2: Number of n or n which can be considered as large for different fraction of ˆ ˜ lot-to-lot variation assuming maximum 3% error at 90% confidence. hence the uncertainty of the corner is smaller. From the table. the variance comes from wafer-to-wafer variation.49) (7.50) Table 7. This somewhat surprising result is explained by the fact that lot-to-lot variation has no impact on variance (but still affects the mean).36). In the experiment.7 shows the k f /s for the L-1 case. 177 . n = 25. we can calculate the fast/slow corners as: ˆ cn f = µl + µo + σl ηT − k f ˆ ˆ cns = µl + µo + σl ηT + ks ˆ ˆw σ2 Q2 /(n − 1) + σ2 ˜ o ˆw σ2 Q2 /(n − 1) + σ2 ˜ o (7. From (7.

c ft % 50 60 70 80 90 95 cff % 60 70 80 90 95 99 kf 3.1%) ks 3.77%) 3.438 (14.475 (15.574 (19.6%) 3.7%) 3.116 (3.2%) 3.51) where m is the number of variation sources.197 (6.27%) 3.4%) 3.4 Confi dence in SPICE Fast/Slow Corner Fast/slow corners for circuit simulation is the major interface between design and process.52) (7.432 (14. respectively.7: Guardband of fast/slow corner for different confidence levels assuming n = 10. Considering all variation sources to be independent.320 (10. Xi ’s are variation sources. Variation source values corresponding to the corner delay Dw/s are then calculated.307 (10.57%) 3.53) ˜D σ2 = i=1 ˜ li ∑ c2(σ2 + σ2 ) i oi m where µoi and σ2 are the mean and variance of other variation for the ith variation oi ˜ li source. µli and σ2 can be calculated as shown in Sec˜ tion 7.051 (1. f We assume that the inverter chain delay is a linear function of variation sources: m D = d0 + ∑ ci Xi i=1 (7. and ci ’s are sensitivity coefficients of inverter chain delay to variation sources.278 (9. respectively. µli and σ2 are the mean and variance of lot-to-lot variation for ˜ ˜ li the ith variation source. 178 .70%) 3. ˆ ˜ ˜ 7.261 (8.8%) Table 7. and n = 25.87%) 3.3. n = 1.70%) 3. the mean and variance of the product inverter chain delay can be calculated as: µD = d0 + ∑ ci (µoi + µli ) ˜ ˜ i=1 m (7.1. A common approach to derive the corners is to use measured values from canonical circuits such as an inverter chain.293 (9. d0 is the nominal delay.

we assume that the 3σ value is 20% of the nominal value. When the nominal mean increases. 179 . In the table. We assume three variation sources. From the table.. we use PTM bulk low power SPICE model for 45nm technology [1]. In the experiment.we show ˆ ˜ the average of 1000 runs. the guardband percentage decreases. We also assume that the inter-lot variation is 18% as in previous sections.5 we use Monte-Carlo simulation to obtain the PDF of the fast and slow delay corners. In the experiment. Corresponding corner values of variation sources f are calculated by solving the following [121]: Dw/s = d0 + ∑ ci Xiw f w X1 f /s − µ1 ˆ m (7. we observe that under the same confidence level.3. we assume 3σ = 3nm. This is because the k f value of variation sources does not depend on the nominal mean and variance. One sided confidence interval is then used to obtain the worst case fast and slow delay corners Dw /s. the guard band percentage does depend on nominal mean. = ˆ cm σm (7. for threshold voltage variation. the guardband percentage of delay corner is less than that of k f value of variation sources as shown in Table 7. we let n = 10 and n = 15.Usually. For gate length variation. However.nominal value|/nominal value. the guardband percentage is calculated as: 100×|worst case value .56) Table 7.. for the worst case delay.8 illustrates the guardband percentage for the worst case delay and the corresponding variation source values under different confidence levels. Since measured corner values affect production corners. the delay corner is expressed as ˜ ˜ D f /s = µD ± k f /s σD ˜ (7. gate length L.54) As in Section 7.55) w Xm f /s − µm ˆ ˆ c1 σ1 = w X2 f /s − µ2 ˆ i=1 ˆ c2 σ2 . threshold voltage for NMOS Vtn and PMOS Vt p .4.

3 95. though appears to be small. 000.11 show the average guardband percentage for the L-S and S-L cases.58 0.58% of nominal value.91 1.32 0.62 w Vtn f Vtw f p 0.8: Percentage guardband of worst case delay corner and corresponding variation source corners under different confidence levels assuming n = 10.80 2.20 0. Note that the guardband.41 2. is significant when compared to variation itself (e.6 72.29 1.7% of nominal value.9 61.603 Dw s 0. and Vth value is in V . we find that the exact simulated confidence is a little bit higher than the target value.81 1.77 1.43 0.71 0.10 and 7.61 388 Lw s 0.4 81.50 2.22 3. n = 15. we also compare the targeted confidence and the exact confidence of the estimated worst case corner value.57 1.2 Table 7.7 99.55 0.612 Nominal Table 7.643 0. Such error largely comes from the fitting of the linear delay model to SPICE.27 2. That is. ˜ In Table 7. In the experiment. From the table.13 2.5 cff 61.9. respectively. we let nrun = 1.01 3.15 4.54 0.41 w Vtns Vtw ps 0.58 349 Lw f 0 07 0 14 0 29 0 50 0 79 1 15 42. The flow to obtain the simulated confidence is shown in Figure 7. ˆ n = 15.29 0.36 0.45 0.g.1 90.52 0.71 0.8 95. Table 7.97 1.9: Confidence comparison for inverter chain based corner assuming n = 10.c ft 50 60 70 80 90 95 cff 60 70 80 90 95 99 Dw f 0. Lgate value is in nm.86 1. Note: ˆ ˜ delay value is in ps.35 2.84 3. σ for Vt p variation is only 6.4 91. Targeted Con f c ft 50 60 70 80 90 95 cff 60 70 80 90 95 99 Simulated Con f c ft 51.36 3.15 2. the 180 .19 0.23 0.3.40 0. while the guardband of fast corner is up to 2.44 0.7 71.00 2.21 0.558 0.13 0.90 47.6 81.39 0.

D 11. c ft = nt /nrun Figure 7. Hence. Obtain worst case corners: Dw . Therefore.000-sample SPICE MC simulation on inverter ˜ chain delay based on µ and σ2 .1.54) ˆ according to µ and σ2 . Similar to the single ˜ ˆ we consider n (or n) value is large. In the figure. or S-L cases. ˜ 9. if D f > Dw /* compare simulated value and estimated value */ f 13.1 compares the 90% confidence value of Dw between the L-S (or S-L) and f L-L. Calculate µ and σ2 for each variation sources from samples ˆ 5. We also find that for the worst case delay corner. this is because lot-to-lot variation has no impact on variance. s f 7.7% of standard deviation). Table 7. L-S. n f = 0. 181 .12 shows the average guardband percentage for the L-1 case. source case. c f f = n f /nrun .000-sample MC simulation on (7. Dw is normalized with respect to the nominal value. can be considered as “large”. 12. Generate n samples of measured lots /* calculate estimated value */ ˆ ˆ 4. This is because the guardband of worst case delay is smaller than that of single variation source as discussed above. ˆ 6. if Ds < Dw s 15. we need fewer lots to achieve the same model reliability. Calculate simulated corners: D f = µD − k f σD .3. Calculate µ and σ2 for each variation source from samples. Generate n samples of production lots /* calculate simulated value */ ˜ ˜ 8. for i=1 to nrun 3. ˜ 10. when error of k f value between L-L and L-S (or S-L) case are within 3%. 14. Figure 7. n f =n f + 1. Dw .5. nt = nt + 1 16. nt = 0 2. Ds = µD + ks σD . As discussed in Section 7. Perform 10. From the figure.1. Perform 10. Calculate mean µD and variance σ2 of SPICE MC samples. The guardband percentage of this case is smaller that of S-S. ˆ ˜ the value of n (or n) to be considered as large is smaller than that of the single variation ˆ ˜ source as in Figure 7. guardband value is up to 38.3: Flow to simulate confidence. f We observe that n ≥ 30 (or n ≥ 45).

12 0. ˜ c ft 50 60 70 80 90 95 cff 60 70 80 90 95 99 Dw f 0.44 Lw f 0.11 0.05 0.41 3.28 2.21 0.24 0.34 0.70 2.30 Table 7.13 1.03 Table 7.34 0.86 w Vtn f Vtw f p 0.59 2.13 2.86 1.58 0.90 Lw f 0.07 Lw s 0.91 0.33 0. ˆ ˜ 182 .42 Dw s 0.16 0.10: Percentage guardband of worst case delay corner and corresponding variation sources corners under different confidence levels assuming large and n = 15.01 1.27 0.15 0.44 0.24 0.38 0.17 0.61 1.30 0.36 3.67 2.60 0.11: Percentage guardband of worst case delay corner and corresponding variation source corners under different confidence levels assuming n = 10 and large n.53 2.24 0.97 1.08 1.77 2.89 2.49 0.95 1.43 0.06 1.15 1.81 2.69 1.68 w Vtns Vtw ps 0.20 1.06 0.21 0.41 0.75 Dw s 0.16 0.37 0.01 2.38 0.97 w Vtn f Vtw f p 0.43 0.31 0.58 0.29 0.49 0.68 3.77 1.70 2.39 0.c ft 50 60 70 80 90 95 cff 60 70 80 90 95 99 Dw f 0.82 1.18 0.73 1.50 1.72 1.10 0.77 w Vtns Vtw ps 0.71 Lw s 0.27 0.32 0.97 1.53 0.35 2.43 0.16 0.10 1.59 0.66 1.14 0.46 0.68 0.64 1.11 0.50 2.67 0.92 0.

98 0.50 0.71 0.51 w Vtn f Vtw f p 0.96 0.54 0.32 0.63 0.90 1.30 0.85 1.4: 90% confident Dw with varying n and n.46 2.11 0.29 0.03 1.53 0. n = 1. Next we discuss the impact of limited characterization or production data on one of the most talked about chip-level statistical analysis methods.37 0.95 1.67 1.10 0.52 0.53 0. statistical timing analysis.27 0.21 1.23 0.1 w 0.24 0.40 0.16 0.45 Dw s 0.42 0.32 0.9 50 3% error ~ n =30 ^ n =45 Normalized D f L−S S−L 100 ~ ^ n (n) 150 Figure 7.71 1.18 0.49 0. namely.30 0.05 Lw f 0.17 0.37 0.68 0.69 0.12: Percentage guardband of worst case delay corner and corresponding variation source corners under different confidence levels assuming n = 10.06 0.10 1. and ˆ ˜ n = 25.17 0.87 Lw s 0. we illustrated the extra guardband required in corners (which ironically are themselves guardbands by definition) to ensure confidence in statistical analysis when the number of lots used to characterize variation or number of production silicon lots are small.40 Table 7.47 w Vtns Vtw ps 0. ˆ ˜ f c ft 50 60 70 80 90 95 cff 60 70 80 90 95 99 Dw f 0.91 1.38 0. ˜ we need lower guardband to achieve the same confidence.23 0.94 0. 183 . In this section.13 0.92 0.

7.59) Another quantity of interest is a certain percentile point. Then for a given confidence level Con f . it is easy for us to obtain any interval of mean and variance under a certain confidence. its mean and variance can be calculated in the same way as shown in (7.60) (7. We again use Monte-Carlo simulation on the linear equation to obtain the CDF of the output mean (CDFµD (·)) and variance (CDFσ2 (·)). Therefore.5 Confi dence in Statistical Timing Analysis In this section.52). the percentile point C(p%) can be calculated as: C(p%) = µ+ kσ k = Φ−1 (p%) 5 Notice (7. the chip delay is expressed as a linear canonical form of variation sources: m D = d0 + ∑ ci Xi i=1 (7. it is easy to obtain the worst case D mean and variance: 5 −1 µw = CDFµD (Con f ) −1 σw2 = CDFσ2 (Con f ) D (7.58) σw = σw2 (7.61) that we have known the distribution of mean and variance.5. 184 .57) Since circuit delay has a linear canonical form. Since quadratic SSTA with non-Gaussian variation sources is very complicated. we only analyze linear SSTA with Gaussian variation sources in this section. we will discuss about confidence estimation for statistical static timing analysis (SSTA). Since we assume Gaussian variation sources and apply linear canonical form to approximate circuit delay. By performing linear SSTA in Section 5. the circuit delay also follows Gaussian distribution.

we compare the targeted confidence and the simulated confidence of the estimated confidence interval.5 illustrates the worst case mean.4 to obtain the CDF of the percentile point. standard deviation. The only difference is that instead of running SPICE Monte-Carlo simulation. we apply the same experimental setting as that for the inverter chain as in Section 7. although the guardband value seems not large. standard deviation. we apply SSTA [41] to obtain the chip delay distribution. However. We may apply the same method as in Section 7. In this way. Figure 7.6% of nominal value to ensure 95% confidence. From the figure. and then calculate its worst case value.3. In the experiment. In the figure. Figure 7. We ignore L-S and L-1 cases where the need of SSTA is not well motivated. we observe that the guardband of chip delay is lower than that of single variation source and inverter chain delay. different runs may result in different worst case values. the worst case value depends on the measured values of mean and variance. As discussed before.9X of standard deviation and the guardband value is up to 7. and 95% percentile point for ISCAS85 benchmarks. The simulated confidence is obtained in Figure 7.6 illustrates the mean. This is because chip delay has a larger nominal mean to variance ratio than inverter chain. the worst case value is averaged over 1. the percentile point is converted to a µ+ kσ corner. Figure 7.4. for the 95-percentile point. Due to our simplifying assumption that the linear canonical form is independent of small changes in mean and variance.where Φ(·) is the CDF of standard normal distribution. For example.7 compares the 90% confidence value of 95-percentile point for C7552 185 . Hence. it is very significant compared to the scale of variation.000 runs and normalized with respect to the nominal value. That means the guardband value is up to 52% of standard deviation. the nominal value is 6.13. there is a small error between targeted and simulated confidence. In Table 7. and 95-percentile point of ISCAS85 benchmarks under different confidence level for S-L case.

8 70.3 90.5 95. ˜ Targeted con f 50 60 70 80 90 95 Simulated con f µ 51.08 90 60 70 80 Confidence 90 Worst 95% 1.02 σw 1.06 C17 C444 C1355 C7552 1. standard deviation.1.3 Table 7.5: Worst case mean.1 80.7 81. n = 15.8 80.06 µw 1.04 1. assuming n = 10 and ˆ n = 15.1 1. and 95-percentile point normalized with respect to the nominal value for ISCAS85 benchmarks.02 50 60 70 80 Confidence 90 Figure 7.5 σ2 51.9 90.2 95.3 60. ˆ ˜ 186 .4 95.3 61.04 C17 C444 c1355 c7552 1.2 95% 51.4 60.05 1 50 C17 C444 C1355 C7552 1 50 60 70 80 Confidence 1.9 70.0 90.15 1.2 70.13: Confidence comparison for ISCAS85 benchmark 95-percentile point delay assuming n = 10.

05 60 70 80 Confidence 90 1 50 60 70 80 Confidence 90 1. ˆ ˜ 187 .01 1 50 C17 C444 C1355 C7552 1.1.06 worst 95% 1.6: Normalized worst case mean. and 95-percentile point for ISCAS85 benchmarks.02 1 50 60 70 80 Confidence 90 Figure 7.1 σw C17 C444 C1355 C7552 1.02 1.04 C17 C444 C1355 C7552 1. standard deviation. assuming n = 10 and n is large.03 µw 1.04 1.05 1.

188 .7: 90% confidence worst case 95-percentile point with varying n for C7552.05 worst 95% ^ n =24 ^ n =80 1. percentile point is normalized with respect to the nominal value. ˆ ˆ Again. we see that limited data can lead to significantly large guardbanding to ensure good confidence in analysis results. between S-L and L-L. we observe that the guardband of chip delay is lower than that of single variation source and inverter chain delay. fast (essentially Monte Carlo on a linear expression.03 1. From the figure. The method discussed in this section is fairly straightforward. respectively.1. In the figure. ˆ Percentile point is normalized with respect to the nominal value.02 1. runtime is within 10s) and accurate to estimate confidence in results of statistical timing analysis especially when the variability models are known to suffer from limited lot-to-lot variation data.04 1.01 1 ^ n =40 50 100 ^ n 150 Figure 7. we need n ≥ 40 and n ≥ 80.06 1. We find that we only need a small number of lots (n ≥ 24) to bound the error between S-L and L-L under 3%. If ˆ we want to bound the error to 2% or 1%.

6 This is usually possible only for high-volume. adaptive voltage scaling can also alleviate downside of low confidence. confidence level is directly related to risk. we have proposed techniques to quantify the confidence in statistical circuit analysis and estimate guardband required to attain a desired level of confidence. A lower confidence does not necessarily translate to lower yield but it implies that the probability of the statistical analysis actually matching real silicon is smaller.. Therefore. running more lots is acceptable). For designs where power/performance constraints are flexible or time to volume is not critical (i. Ability to change design characteristics during or post-manufacturing mitigates the need for high confidence in models. If the silicon foundry can extensively adjust the process (which essentially shifts the mean) for the design6 . high-value designs.7.6 Practical Questions: Determining Confi dence Induced Guardband for Real Designs So far in the chapter. performance. The answer to “what is the desired level of confidence” is not an easy one especially given the steep increase in guardband to ensure higher levels of confidence.e. Risk (in terms of schedule. two issues influence ˆ ˜ the choice of confidence level: 1. the silicon can be made to match the analysis and ensuring high confidence in design-side estimates is not essential. Besides quality of models (n) and production volume (n). 189 . if yield is lower than expectation. power. Similarly. Confidence levels provide a tradeoff between risk of higher manufacturing cost (due to low yield) and higher upfront design cost (due to guardbanding). 2. etc) tolerable for the design. tunability using knobs such as adaptive body biasing. a lower confidence is acceptable. Extent of post-silicon tuning options available.

We assumed lot-to-lot variation to be 18% of total variance for our experiments. the 95th percentile delay as estimated using statistical timing analysis needs a guardband of 5%-7%. variance.In the end. mean. we have studied impact of characterization and production silicon volumes on believability of statistical variation models and analysis. The required guardband will increase if lotto-lot variation increases as a fraction of total variation. the choice of confidence will be as much a business decision as a technical decision as it quantifies the desire for predictability of statistics of manufacturing. the guardband needed for 95% confidence is about 14% of the nominal uncertainty (difference between best/worst corners) for low-to-medium volume production (n = 10. 1. 2. Finally. Predicted statistics are reliable for number of measured lots larger than 85 and number of production lots larger than 60. measured data volume. ˆ ˜ 3. For a typical SPICE corner extraction methodology. best/worst corners for a variation source may need as much as 30 % guardband to attain 95% confidence.. 7. For small characterization or production volumes. We believe that confidence- 190 . percentile corners. n = 15). We have developed methods to estimate the distribution of production silicon statistical characteristics (e.g. and intended production volume. etc) in terms of (unreliable) measured statistical characteristics.7 Conclusions In this chapter. Our experiments and results indicate the following. Our estimates show strong agreement with confidence measured from a set of real silicon data (maximum error ≤ 4%). The loss of reliability largely stems from lot-to-lot variation.

level induced guardband should be considered for all low-to-medium volume designs. 191 . the problem is likely to become more severe when 450mm wafer sizes are adopted by the industry (since less number of lots would be required to meet the same production volume). Moreover.

analysis. We have modeled. Then we per- 192 . However. In Chapter 3. and optimization of FPGA power and performance with existence of process variation. the process variation caused by inevitable manufacturing fluctuations and imperfections also increases when technology scales down. delay.CHAPTER 8 Conclusions In modern Very Large Scale Integration (VLSI) circuit industry. we have extended Ptrace to consider process variation. In Part I. In this dissertation. the scaling of Complementary Metal Oxide Semiconductor (CMOS) technology enables the ever growing number of transistors on a chip. including Chapter 2. 3. we consider two types of the most common VLSI circuits: Field-Programmable Gate Array (FPGA) and Application Specific Integrated Circuit (ASIC). our optimization reduces the energy-delay product by 18.4% and area by 23. we have developed a trace-based power and performance evaluation framework(Ptrace) for FPGA.3%. Compared to the baseline. hence improves the chip functionality and reduces the manufacturing cost. and optimized power and delay for both FPGAs and ASICs. Then we performed device and architecture coevaluation on hundreds of device and architecture combinations to optimize power. Process variation introduces significant power and performance uncertainty to VLSI circuits and becomes a potential stopper for further technology scaling. analyzed. we have studied modeling. 4. and area. In Chapter 2.

and reliability model. and linear model. our optimization improves leakage power and timing combined yield by 9. delay. we have studied sta- tistical timing modeling and analysis for ASICs. Chapter 6. This offers a large room for power and delay tradeoff. hence they are very time efficient. 3.5% and significantly reduce leakage variation. In Part II.formed device and architecture co-optimization to maximize power and delay yield. and SER. we have developed transistor level voltage and current model based on ITRS MASTAR4 (Model for Assessment of cmoS Technologies And Roadmaps). our SSTA flow predicts the 193 . we have presented new SSTA algorithms for three different delay models: quadratic model. All atomic operations of these algorithms are performed by closed-form formulas. In Chapter 5. In addition. delay.3X delay difference. we have proposed a new method to approximate the max operation of two non-Gaussian random variables using second-order polynomial fitting. and reliability simultaneously. By applying such approximation.5% to 5. Experimental results show that compared to Monte-Carlo simulation. including Chapter 5. we have also studied the interaction between process variation. device aging.4%. and then developed circuit level power. we have developed a framework to evaluate FPGA power. Moreover. We observe that device aging reduces standard deviation of leakage by 65% over 10 years while it has relatively small impact on delay variation.1X energy difference. We have further shown that the device aging has a knee point over time and device burn-in to reach the point could reduce the performance change over 10 years from 8. Combining the circuit level model with Ptrace. and Chapter 7. In Chapter 4. We have shown that applying heterogeneous gate lengths to logic and interconnect may lead to 1. Compared to the baseline. quadratic model without crossing terms (semiquadratic model). we also find that neither device aging due to NBTI and HCI nor process variation have significant impact on SER.

and optimization for FPGAs and ASICs. 1%. we have studied impact of characterization and production silicon volumes on believability of statistical variation models and analysis.5% to 2% and improve run time by 6X compared to the traditional spatial variation model. and 52% of standard deviation for 95%-tile point of circuit delay are needed) to ensure 95% confidence in the results. but also to FPGAs. Our estimates are within 2% of simulated confidence values. the model of spatial correlation in Chapter 6 can also be applied to FPGA power and delay variation estimation to im- 194 . The SSTA method introduced in Chapter 5 can be applied to FPGA delay variation for more accurate and efficient delay variation estimation. In Chapter 6. skewness. there are still lots of research works need to be done in this field. statistical timing). we have studied the impact of systematic across-wafer variation on within-die variation. The guardbands are non-negligible for all cases when either production or characterization volume is not large. 38. In practice. We mathematically derived the confidence intervals for commonly used statistical measures (mean.. Experimental results show that our new model reduces error from 6. Based on the analysis.g. Our experiments indicate that for moderate characterization volumes (10 lots) and low-to-medium production volumes (15 lots). we have developed an efficient and accurate spatial variation model to model across-wafer variation at die level. 34. and 1% error. In Chapter 7. standard deviation. percentile corner) and analysis (SPICE corner extraction. analysis.mean. and 95-percentile point within 1%. the statistical modeling and analysis techniques presented in Part II can not only be applied to ASICs. However.7% of standard deviation for single parameter corner. a significant guardband (e.7% of standard deviation for SPICE corner. 6%. This dissertation makes contributions to statistical modeling. We then implemented the new model to the SSTA flow in Chapter 5. respectively. variance.

prove the accuracy and efficiency. and we may also apply the method in Chapter 7 to estimate the confidence for FPGA power and delay variation analysis. In the future. applying the statistical analysis techniques for ASICs to FPGAs is one of our research topics. 195 .

First. Second. and leads to different placements for different chips of the same application.D. • Fourier Series Approximation for Max Operation in Non-Gaussian and to handle the max operation in SSTA with both quadratic delay dependency and Quadratic Statistical Static Timing Analysis. Our experimental results show that. We propose efficient algorithms 196 . we propose an efficient high-level trace-based estimation method. program and are not introduced in this dissertation: • FPGA Performance Optimization via Chipwise Placement Considering Process Variations. We develop a statistical chipwise placement flow to improve FPGA performance. we develop a variation-aware detailed placement algorithm vaPL within the VPR framework [18] to leverage process variation and optimize performance for each chip. variation aware chipwise placement improves circuit performance by up to 12.1% for the tested variation maps. similar to Ptrace. to evaluate the potential performance gain achievable through chipwise FPGA placement without detailed placement.CHAPTER 9 Appendix In the appendix. I briefly introduce the works I have finished in my Ph. compared to the existing FPGA placement. August 2006. Chipwise placement vaPL is deterministic for each chip when the chip’s variation map is known. This work was published in International Conference on Field Programmable Logic and Applications (FPL).

standard deviation and 95% percentile point with less than 2% error. max and add. This work was published in Design Automation Conference (DAC). this new model is more efficient than the exponential model when calculating the full chip leakage power. Due to the additivity. then extend such model to consider both within-die spatial and within-die random variation. Based on such algorithms. June 2007. February 2009.non-Gaussian variation sources simultaneously. We present an efficient function to represents the leakage variation considering inter-die variation. Compared to Monte Carlo simulation for non-Gaussian variation sources and nonlinear delay models. This work was published in IEEE/ACM Asia South Pacific Design Automation Conference (ASPDAC). and the skewness with less than 10% error. we develop an SSTA flow with quadratic delay model and non-Gaussian variation sources. our approach predicts the mean. Experimental results show that our method is 5X faster than the existing Wilkinson’s approach Analysis. We prove that the complexity of our algorithm is linear in both variation sources and circuit sizes. • Efficient Additive Statistical Leakage Estimation. are calculated efficiently via either closed-form formulas or low dimension (at most 2D) lookup tables. • Accounting for Non-linear Dependence Using Function Driven Component cal analysis and show that ignoring non-linear dependence may cause error in both statistical timing and power analysis. We use polynomial instead of exponential 197 . We first analyze the impact of non-linear dependence on statisti- additive leakage variation model. Experimental results show that the proposed FCA method is more accurate compared to the traditional PCA or ICA. Then we introduce a function driven component analysis (FCA) to minimize the error introduced by non-linear dependence. All the atomic operations. hence our algorithm scales well for large designs.

Volume 28. This work was published in IEEE Transactions on Computer-Aided Design of Integrated Circuits and Systems (TCAD).with no accuracy loss in mean estimation. 198 . about 1% accuracy loss in standard deviation estimation and 99% percentile point estimation. pp:1777 . issue: 11. November 2009.1781.

In Proc. In Proc. In Proc. [7] Aseem Agarwal.html. in Europe.eas. Vladimir Zolotov. Conf. and Massoud Pedram. Non-gaussian statistical interconnect timing analysis.. The effect of LUT and cluster size on deepsubmicron FPGA performance and density. Savithri Sundareswaran. and Rajendran Panda. January 2006. 2003. Asia South Pacific Design Automation Conf. Statistical timing analysis for intra-die process variations with spatial correlations. and Rajendran Panda. and Sarma Vrudhula. [10] Elias Ahmed and Jonathan Rose. Design Automation and Test Conf. Vladimir Zolotov. pages 900 – 907. [5] Aseem Agarwal. volume 22. Design Automation Conf. Hanif Fatemi.. Savithri Sundareswaran. Alam and S. [3] Soroush Abbaspour. Min Zhao. In http://www. In Proc. pages 1243 – 1260. March 2006. Min Zhao. Statistical timing analysis using bounds and selective enumeration.. In Microeletronics Reliability. Computer-Aided Design. [2] Soroush Abbaspour. FieldProgrammable Gate Arrays. In IEEE Trans. Statistical delay computation considering spatial correlations. 1965. June 2003. David Blaauw. [8] Aseem Agarwal. New York: Dover. David Blaauw.asu. Kaushik Gala. Vladimir Zolotov. and Mathematical Tables. In Proc. [4] Milton Abramowitz and Irene A. Asia South Pacific Design Automation Conf. and Massoud Pedram.R EFERENCES [1] PTM SPICE model. Handbook of Mathematical Functions with Formulas. Mahapatra. Nov.. pages 3–12. A comprehensive model of pmos nbti degradation. ACM Intl. Intl. January 2003. Parameterized blockbased non-gaussian statistical gate timing analysis. [11] M. August 2005. In Proc. Asia South Pacific Design Automation Conf. Hanif Fatemi. and Vladimir Zolotov. Computation and refinement of statistical bounds on circuit delay. [9] Aseem Agarwal. Sept. February 2000. In Proc. Computer-Aided Design of Integrated Circuits and Systems. Statistical delay computation considering spatial correlations. Stegun.A. 2003. David Blaauw. 199 . Vladimir Zolotov. Kaushik Gala. David Blaauw.edu/ ptm/latest. Symp. pages 271 – 276. Graphs. 2003. and David Blaauw. [6] Aseem Agarwal.

CA. 200 . Active leakage power optimization for FPGAs. and Sarma Vrudhula. Intl. Design Automation Conf. Bellato. [16] Adelchi Azzalini. A.[12] Jason H. Tahoori. February 1999. Singh. Leakage minimization of digital circuits using gate sizing in the presence of process variations. Praveen Ghanta. Paccagnella. [15] Maryam Ashouei. October 2005. [20] Sarvesh Bhardwaj and Sarma Vrudhula. March 2004. on Computer Design. A dualvt layout approach for statistical leakage variability minimization in nanometer CMOS. March 2008. Symp. In Proc. Ceschia. Sambasivan Narayan. Computer-Aided Design of Integrated Circuits and Systems. In Design Automation and Test in Europe. Abhijit Chatterjee. Bernardi1. June 2005. and Tim Tuan. τau: Timing analysis under uncertainty. 2005. 2003. Conf. M. Int. Sarma Vrudhula. Anderson. Design Automation Conf. D. Field-Programmable Gate Arrays. Adit D. February 2004. February 2005. Kluwer Academic Publishers.. Computer-Aided Design. and Chandu Visweswariah. M. In Proc. Intl. P. Jonathan Rose. Symp. Parameterized block-based statistical timing analysis with nonGaussian and nonlinear parameters. M. Evaluating the effects of SEUs affecting the configuration memory of an SRAM-based FPGA. Rebaudengo1. Nov. [19] Sarvesh Bhardwaj. Field-Programmable Gate Arrays. [22] Sarvesh Bhardwaj.. and Vivek De. Sonza Reorda. Najm. In Proc. M. ACM Intl. [13] Hongliang Chang andVladimir Zolotov. Leakage minimization of nano-scale circuits in the presence of systematic and random variations. [14] Ghazanfar Asadi and Mehdi B. In Proc. [18] Vaughn Betz. Bortolato. November 2006. Computer-Aided Design. Soft error rate estimation and mitigation for SRAM-based FPGAs. In Proc. Candelori. 32:159–188. Conf. [17] M. and David Blaauw. pages 615 – 620. A framework for statistical timing analysis using non-linear delay and slew models. and P. June 2005. [21] Sarvesh Bhardwaj and Sarma Vrudhula. Zambolin. pages 71–76. In Proc. Violante1. and Alexander Marquardt. IEEE Trans. Architecture and CAD for Deep-Submicron FPGAs. Scandinavian Journal of Statistics. The skew-normal distribution and related multivariate families. Anaheim. In Proc. A. Conf. Farid N. ACM Intl.

. In Proceedings Process Control Diagnostics and Modeling in Semiconductor Manufacturing II Electrochemical Society Meeting. Design Automation Conf. Yield enhancements of design-specific fpgas. Ketkar. Ali Keshavarzi. Symp. June 2003. and Vivek De. Keith A. Ali Keshavarzi. [26] Shekhar Borkar. May 1997. 201 . In Proc. Design Automation Conf.. Wenping Wang.. In Proc. Rakesh Vattikonda. and S. Design Automation Conf. June 2003. ACM Intl. [24] Duane Boning. and Jose Renau. In Proc. Chih-Kong Ken Yang. September 2006. In Proc. A stochastic jitter model for analyzing digital timing-recovery circuits. Constantinides. In Proc. and Haitham Hindi. Nov. Burns. Design Automation Conf. [33] Hongliang Chang and Sachin S. Siva Narendra. Parameter variations and impacts on circuits and microarchitecture. Parameter variations and impact on circuits and microarchitecture. Matthew Guthaus. [25] Shekhar Borkar. Design Automation Conf. James W.. and Vivek De. Tschanz. [28] Michael Brown. Tanay Karnik. June 2007. Wafer redeposition impact on etch rate uniformity in ipvd system. Conf. and Milan Vasilko.. June 2005. February 2006. Intl.. Bowman. Jim Tschanz. 2003. Field-Programmable Gate Arrays. Sapatnekar. K. February 2009. Full-chip analysis of leakage power under process variations. Robison. Noel Menezes. [31] Nicola Campregher. Tanay Karnik. Measuring and modeling variabilityusing low-cost FPGAs. and Vivek De. James Chung. Statistical timing analysis considering spatial correlations using a single PERT-like traversal. Symp. IEEE Transactions on Plasma Sciences. In Proc. February 2007. pages 621 – 625. Vrudhula. [30] Steven M. and Rajesh Divecha. Peter Y. Mahesh C. Spatial variation in semiconductor processes: Modeling for control. In Proc. Sapatnekar. Yu Cao. including spatial correlations. [27] Jozef Brcka and Rodney L.[23] Sarvesh Bhardwaj. Cyrus Bazeghi. IEEE Custom Integrated Circuits Conf. Computer-Aided Design. In Proc. [32] Hongliang Chang and Sachin S. Dennis Ouma. Field-Programmable Gate Arrays. [29] James Burnham. Siva Narendra. Comparative analysis of conventional and statistical design techniques. July 2009. In Proc. George A. Cheung. Predictive modeling of the nbti effect for reliable design. Jim Tschanz. ACM Intl.

Intl. [42] Lerong Cheng. Asia South Pacific Design Automation Conf. Chandu Visweswariah. January 2009. February 1985. Intl. August 2006. Jinjun Xiong. Field-Programmable Gate Arrays. Computer-Aided Design of Integrated Circuits and Systems. In Proc. and Dennis Sylvester. November 2005. Device and architecture cooptimization for FPGA power reduction. In Proc.. Computer-Aided Design. Jinjun Xiong. In Proc. [37] Ruiming Chen and Hai Zhou. Asia South Pacific Design Automation Conf. IEEE Trans. Fast estimation of timing yield bounds for process variations. February 2008. FPGA performance optimization via chipwise placement considering process variations. and David B. [43] Lerong Cheng. Static timing: Back to our roots. ComputerAided Design. and Lei He. [36] Ruiming Chen and Hai Zhou. Yan Lin. [39] Lerong Cheng. Design Automation Conf. In Proc. IEEE Journal of Solid-State Circuits. March 2008. Conf. Scott. Non-linear statistical static timing analysis for non-gaussian variation sources. Computer-Aided Design of Integrated Circuits and Systems. Lei He. David Blaauw. IEEE Trans. Fei Li. In International Conference on Field-Programmable Logic and Applications. February 2008. Groves. and Jinjun Xiong. November 2007. Timing budgeting under arbitrary process variations. Jinjun Xiong. July 2007. Conf.. Symp. Non-gaussian statistical timing analysis using Second-Order polynomial fitting. Yan Lin. February 2008. [44] Kaviraj Chopra. Imelda A. 202 . and Mike Hutton. Saller. IEEE Trans. Ashish Srivastava. 19:1211–1221. Tracebased framework for concurrent development of process and FPGA architecture considering process variation and reliability. In Proc. Saumil Shah.. [41] Lerong Cheng. [40] Lerong Cheng. and Lei He. Non-Gaussian statistical timing analysis using Second-Order polynomial fitting. and Lei He. In Proc. Reliability effects on mos transistors due to hot-carrier injection. Parametric yield maximization using gate sizing based on efficient statistical power and delay gradient computation. [35] Ruiming Chen. [38] Lerong Cheng. and Lei He. ACM Intl.. Jinjun Xiong. June 2007. VLSI Syst. Phoebe Wong.[34] Kueing-Long Chen. Vladimir Zolotov. and Lei He. Stephen A. Lizheng Zhang.

Intl. [50] Nigel Drego. Design Automation Conf. and Duane Boning.menezes. Computer-Aided Design. In Operations Research. Intl. Block-Based static timing analysis with uncertainty. Conf. In Proc. Asia South Pacific Design Automation Conf. Clark. February 2007. March 2008. The stratix routing and logic architecture. Willy Cheung. In Proc. In Proc. 1961. June 2007. [46] Charles E. March 2006.. Asia South Pacific Design Automation Conf. [49] Anirudh Devgan and Chandramouli Kashyap. [48] Vijay Degalahal and Tim Tuan. In Proc. [55] Qiang Fu. Akira Tsuchiya. and Costas J. IEEE Trans. Fast second-order statistical static timing analysis using parameter dimension reduction. Variogram based robust extraction of process variation.Najm and N. [56] Takayuki Fukuoka. τ. The greatest of a finite set of random variables. [47] David Lewis ect. Computer-Aided Design. Characterizing intra-die spatial correlation using spectral density method. Conf. [54] Paul Friedberg. Statistical gate delay model for multiple input switching. ACM Intl. November 2003. Jun Tao. Statistical timing analysis based on a timing yield model. Wai-Shing Luk. January 2005.N. February 2003.. [52] Zhuo Feng. May 2009. Vdd programmability to reduce FPGA interconnect power. In Data Analysis and Modeling for Process Control III. In Proc. 203 . and Hidetoshi Onodera. Timing and Analysis Utilities. IEEE International Symposium on Quality Electronic Design. Lack of spatial correlation in mosfet threshold voltage variation and implications for voltage scaling. In Proc. [53] F. and David Blaauw. and Yaping Zhan. In Proc. Spanos. [51] Fei Li and Yan Lin and Lei He.[45] Kaviraj Chopra. In Proc. Methodology for high level estimation of FPGA power consumption.. Design Automation Conf. and Xuan Zeng.. Field-Programmable Gate Arrays. In Proc. February 2008. Semiconductor Manufacturing. Narendra Shenoy. June 2004. Changhao Yan. Peng Li. Spatial modeling of micron-scale gate length variation. Anantha Chandrakasan. Symp. November 2004. pages 85 – 91.

net/models. Tuan.[57] Anne Gattiker. and Toshiyuki Shibuya. M. and Farid N. Computer-Aided Design.itrs. on Nuclear Science. In http://www.. Kandemir. [63] Katsumi Homma. [61] Khaled R... Nonparametric statistical static timing analysis: An ssta framework for arbitrary distribution. A user’s guide to MASTAR4. A dual-vdd low power FPGA architecture. Impact of CMOS technology scaling on the atmospheric neutron soft error rate. and Chris Long. In International Symposium on Quality of Electronic Design. In Proc. 2002. Reducing leakage energy in FPGAs using region-constrained placement. 204 . HirooMASUDA. Gayasen. and T. February 2008. Field-Programmable Gate Arrays. Rashmi Dinakar. [62] Dadgour H. 2001. Design Automation Conf. and T. [58] A.html. Tuan. N. FieldProgrammable Logic and its Application.net/. N. Lee. Y. Sheng-Chih Lin. A statistical framework for estimation of full-chip leakage-power distribution under parameter variations. Symp. Asia South Pacific Design Automation Conf. Najm. J. Timing yield estimation from static timing analysis. M. Kandemir. [59] A. Gayasen. In Proc. J. Efficient block-based parameterized timing analysis covering all potentially critical paths. August 2004. In Proc. Sani Nassif. [66] International Technology Roadmap for Semiconductor. IEEE Trans. M. Vijaykrishnan. and Yasuaki INOUE. November 2008. June 2008. In http://public. [68] International Technology Roadmap for Semiconductor. Takashi Sato Noriaki Nakayama. Izumi Nitta. 2005. 2008. In Proc. Sari Onaissi. Svensson. Heloue. Hazucha and C.itrs.net/. IEEE Transactions on. Electron Devices. In Proc. Intl. [64] Shin ichi OHKAWA. and Banerjee K.F. FUNDAMENTALS. IEICE TRANS. K. M.itrs. February 2004. 2005. Irwin. A novel expression of spatial correlation by a random curved surface model and its application to LSI designs. and Kazuya Masu. [67] International Technology Roadmap for Semiconductor. Intl. Non-gaussian statistical timing models of die-to-die and within-die parameter variations for full chip analysis. [65] Masanori Imai. Conf. Tsai. [60] P. Conf. Vijaykrishnan. Irwin. 2000. ACM Intl. November 2007. In http://public.

cell granularity . [75] Kunhyuk Kang. Design. 2007. Naidu. Conf. Investigation of etch rate uniformity of 60 mhz plasma etching equipment. In Proc. Design Automation and Test Conf. Bipul C.pdf. November 2008. and implementation. ACM Intl. Journal of The Electrochemical Society. FPGA area vs. Parametric yield estimation for deep submicron VLSI circuits. Najm. February 1992. Design Automation Conf. [70] Javid Jaffari and Mohab Anis. Statistical timing analysis using levelized covariance propagation. and Juergen Teich. 2003.itrs. Computer-Aided Design. Intl. berkeley.net/Links/2007ITRS/2007 Chapters/2007 Design. [72] J. Otten. in Europe. and Kaushik Roy.. November 2004. In Proc. [79] Dirk Koch. September 2002. Visweswariah. Prentice Hall Inc. In 15th Symposium on Integrated Circuits and Systems Design.Aydil. Design Automation Conf.. Efficient hardware checkpointing – concepts. [80] J. Jess.[69] International Technology Roadmap for Semiconductor. CA. February 2007. Field-Programmable Gate Arrays. and C. June 2005. Anderson and Farid N. Christian Haubelt. overhead analysis.Rabaey. In Proc. [71] Jason H. On efficient monte carlo-based statistical static timing analysis of digital circuits. In Proc. In 1st ACM Workshop on FPGAs. Kouloheris and A. Low-power programmable routing circuitry for FPGAs. In Proc. R. El Gamal. Digital Integrated Circuits . Statistical timing for parametric yield prediction of digital integrated circuits. K. [73] Jochen Jess. A general framework for accurate statistical timing analysis considering correlations. November 2001. Effects of chamber wall conditions on cl concertration and si etch rate uniformity in palsma etching reactors. June 2003. Symp.M. In http://www. In Proc. S. June 2003. 205 . Intl. pages 89 – 94. Computer-Aided Design. [77] Tae Won Kim and Eray S. [74] J. [76] Vishal Khandelwal and Ankur Srivastava..Aydil. March 2005. Paul. Kalafala. [78] Tae Won Kim and Eray S. Japanese Journal of Applied Physics.lookup tables and PLA cells.A Design Perspective. Conf.

and Lei He. and Lawrence T. [84] Julien Lamoureux. In Proc. Symp. April 1993.E. February 2003. IEEE Trans. A framework for block-based timing sensitivity analysis. Intl. Measuring the gap between fpgas and asics. On hierarchical statistical static timing analysis. Brown. Yan Lin. [82] Ian Kuon and Jonathan Rose. Power modeling and characteristics of field programmable gate arrays. Symp. and Jason Congs. [90] Fei Li. Chandramouli V. Lemieux. Design Automation Conf. Deming Chen. IEEE Trans. Kashyap. Yan Lin. Walter Schneider. Rabaey. November 2003. Lemieux and S. pages 701–708. 206 . and Steven J. In Proc. [85] Julien Lamoureux and Steven J. Low-energy embedded FPGA structures.. Design Automation Conf. Lei He. [89] Fei Li. [83] Eric Kusse and Jan M. Kumar.. Jun 2004. In Proceedings of the ACM Physical Design Workshop. In Proc. and Sachin Sapatnekar. and Ulf Schlichtmann. [91] Fei Li. March 2009. pages 155–160. In Proc. [92] Fei Li.E. Symp. Deming Chen. Computer-Aided Design of Integrated Circuits and Systems. Yan Lin.. June 2004. Xin Li. November 2005. Low Power Electronics and Design. Design Automation and Test Conf. ACM Intl.[81] Sanjay V. Wilton. April 2007. D. Conf. Field-Programmable Gate Arrays. Architecture evaluation for power-efficient FPGAs. Computer-Aided Design. In Proc. A detailed router for allocating wire segments in field-programmable gate arrays. ACM Intl. Ning Chen. In Proc. Glitchless: An active glitch minimization techniques for FPGAs. Pileggi. Manuel Schmidt. 24:1712–1724. June 2008. Computer-Aided Design of Integrated Circuits and Systems. August 1998. February 2007. On the interaction between poweraware FPGA CAD algorithms. STAC: Statisical timing analysis with correlation. Guy G. February 2007. G. Computer-Aided Design of Integrated Circuits and Systems. Design Automation Conf. Lei He. Intl. [86] Jiayong Le. [88] Bing Li. Field programmability of supply voltages for FPGA power reduction. and Jason Cong. Field-Programmable Gate Arrays. In Proc. IEEE Trans. Wilton. in Europe. In Proc. and Lei He. 26. FPGA power reduction using configurable dualvdd. [87] G.

January 2005. June 2002.. Design Automation Conf. and Kevin Lepak. Lei He. Pileggi.[93] Fei Li. [95] Xin Li. In Proc. In Proc. Conf. [94] Xin Li. Field-Programmable Logic and its Application. In Proc. February 2007. ACM Intl. Padmini Gopalakrishnan. February 2008. Symp. [96] Weiping Liao. Routing track duplication with fine-grained powergating for FPGA interconnect power reduction. Temperature and supply voltage aware performance and power modeling at microarchitecture level. February 2005. Symp. July 2005. Circuits and architectures for vdd programmable FPGAs. IEEE Trans. A. and Huazhong Yang. and L. Stochastic physical synthesis for FPGAs with pre-routing interconnect uncertainty and process variation.. Asymptotic probability extraction for non-normal distributions of circuit performance. Mike Hutton. Field-Programmable Gate Arrays. Fei Li.-C. [97] Saihua Lin. Rong Luo. and Lei He. [103] Jing-Jia Liou. 25. Intl. June 2006. Field-Programmable Gate Arrays. Jiayong Le. [102] Yan Lin. In Proc. In Proc. October 2006.T. Fei Li. ACM Intl. and Lei He. Conf. Design Automation Conf. Dual-vdd interconnect with chip-level time slack allocation for FPGA power reduction. Jiayong Le. L. Asia South Pacific Design Automation Conf. Yu Wang. A capacitive boosted buffer technique for high-speed process-variation-tolerant interconnect in udvs application. Wang. Intl. Projection based statistical analysis of full-chip leakage power with non-log-normal distributions. Lei He. Symp. ACM Intl. In Proc. [98] Yan Lin and Lei He. Low-power FPGA using pre-defined dual-vdd/dual-vt fabrics. Computer-Aided Design. Computer-Aided Design of Integrated Circuits and Systems. and Jason Cong. False-pathaware statistical timing analysis and efficient path selection for delay testing and timing validation. In Proc. February 2004. Asia South Pacific Design Automation Conf. and Kwang-Ting Cheng.. and Lei He. Krstic. [100] Yan Lin.. In Proc. Yan Lin. Computer-Aided Design of Integrated Circuits and Systems. FieldProgrammable Gate Arrays. and Lawrence T Pileggi. IEEE Trans. [101] Yan Lin. Placement and timing for FPGAs considering variations. [99] Yan Lin and Lei He. 207 . August 2006. November 2004. In Proc.

A general framework for spatial correlation modeling in vlsi design. In Proc. [107] Jui-Hsiang Liu. Design Automation Conf... Statistical leakage and timing optimization for submicron process variation. Ming-Feng Tsai. Toshiyuki Tsutsumi. June 2008. ACM Intl. March 2008. In Proc. [105] Bao Liu. IEEE Trans. In Proc. June 2007. Design Automation Conf. In Proc. March 2005. Symp. Agrawal. Wang Wei-Shen.. Lumdo Chen. Intl. Combining low-leakage techniques for FPGA routing design. In Proc. [108] Jui-Hsiang Liu. August 2004. A. Design Automation and Test Conf. 1986. In Proc. Low Power Electronics and Design. June 2008. Director. and Hanpei Koike. Computer-Aided Design of Integrated Circuits and Systems.W. [112] W. and Charlie Chung-Ping Chen. Signal probability based statistical timing analysis. Spatial correlation extraction via random field simulation and production chip performance regression. [113] Hratch Mangassarian and Mohab Anis. 208 . March 2008. ACM Intl. In Proc. Strojwas. Design Automation and Test Conf. [115] Yohei Matsumoto. Design Automation and Test Conf. Jan. An efficient algorithm for statistical minimization of total power under timing yield constraints. Maly. [106] Frank Liu. Symp. On statistical timing analysis with interand intra-die variations. and Roberto Giansante. Accurate and analytical statistical spatial correlation modeling for vlsi dfm applications. [109] M. and S. In Proc. [114] Murari Mani. in Europe. Masakazu Hioki. Anirudh Devgan. in Europe. Tadashi Nakagawa. Lumdo Chen.[104] Bao Liu. and Chen Charlie Chung-Ping. [111] Yuanlin Lu and V. In VLSI Design.D. Design Automation Conf. and M. Performance and yield enhancement of FPGAs with with-in die variation using multiple configurations. Design Automation Conf. in Europe. Toshihiro Sekigawa. Takashi Kawanami. Orshanskyg. Accurate and analytical statistical spatial correlation modeling for vlsi dfm applications. Liu. In Proc. Luca Ciccarelli. Symp. pages 309–314. and Michael Orshansky. FieldProgrammable Gate Arrays. February 2007. VLSI yield prediction and estimation: A unified framework. January 2007.J. June 2005. 5:114–130. Leakage power reduction by dual-vth designs under probabilistic analysis of vth variation. Februray 2005. In Proc. [110] Andrea Lodi. Field-Programmable Gate Arrays..

June 2004. Mogal. Field-Programmable Logic and its Application. August 2004. Vrudhula. 2008. In Design for Manufacturability through Design-Process Integration III. Computer-Aided Design. and Mustafa Celik. and Duane Boning. [118] Ayhan Mutlu. In Proc. Fast statistical timing analysis handling arbitrary delay correlations. [125] Kun Qian and Costas J. Jiayong Le. and S. Design Automation Conf. Antoniadis. and Anantha P. [127] Sreeja Raj. Wilton. [123] K. 2007. Nassif.K. November 2007. A flexible power model for FPGAs.. April 2008. March 2004. In Sophia Antipolis MicroElectronics Forum.18-um CMOS. Conf. A methodology to improve timing yield in the presence of process variations. Vivek De. July 2009. Conf. In Design for Manufacturability through Design-Process Integration II. [124] Predictive Technology Model. and Janet Wang. Spanos. In http://www. In Proc. and Kia Bazargan. [126] Kun Qian and Costas J.asu.edu/ ptm/. Yan. Sani R.[116] Hushrav D. Full-chip subthreshold leakage power prediction and reduction techniques for sub-0. In Proc. In Proc. March 2009.. Ruben Molina. Spanos. Solid-State Circuits. Design Automation Conf. Intl. Hierachical modeling of spatial variability of a 45nm test chip. A parametric approach for handling local variation effects in timing analysis. Tutorial on statistical static timing analysis (SSTA). Haifeng Qian. Shekhar Borkar andDimitri A. Sarma B. Sachin S. Sapatnekar. Design for Manufacturability and Statistical Design.eas. A. Clustering based pruning for statistical criticality computation under process variations. September 2002. IEEE Journal of. In Proc. Power-driven design partitioning. Design Automation Conf. A comprehensive model of process variability for statistical timing optimization. [117] Rajarshi Mukherjee and Seda Ogrenci Memik. of 12th International conference on Field-Programmable Logic and Applications. [119] Siva Narendra. Chandrakasan. Intl.. June 2004. 209 . [122] Pierrick Pedron. Poon. Springer. [121] Michael Orshansky. [120] Michael Orshansky and Arnab Bandyopadhyay. In Proc.

Determination of optimal polynomial regression function to decompose on-die systematic and random variations. 210 . Gi-Joon Nam. 1990. David Lewis. Antonio Pullini. Sani R. IEEE Journal of Solid-State Circuits. IEEE Int. Parametric yield in FPGAs due to withindie delay variations: A quantative analysis. Computer-Aided Design. and David Z. and Enrico Macii. K. An accurate sparse matrix based framework for statistical static timing analysis. ACM Intl. Wong. [134] Takashi Sato. In Proc. ACM Intl. 1990. Symp. Giovanni De Micheli. Parametric yield in FPGAs due to withindie delay variations: a quantitative analysis. and Kazuya Masu. Measuring and modeling FPGA clock variability. and Peter Y. Proc. Michael Orshansky. [131] Jonathan Rose. Cheung.. In Proc. Design Automation Conf. [129] Rajeev Rao. Architecture of field-programmable gate arrays: The effect of logic functionality on area efficiency. [130] Jonathan Rose. 2007. Anirudh Devgan. Pan. Cheung. and Yukihiro Shimogaki. and Dennis Sylveste. In Proc. Ryosuke Shimizu. June 2004. David Lewis. February 2008. [136] Pete Sedcole and Peter Y. in Europe. March 2009. K. Symp. [135] Pete Sedcole and Peter Y. February 2007. and Paul Chow. Field-Programmable Gate Arrays. Ashish Kumar Singh. Physically clustered forward body biasing for variability compensation in nano-meter CMOS design. Justin S. Asia South Pacific Design Automation Conf. February 2008. February 2007. In Proc. Luca Benini.[128] Anand Ramalingam.. Intl. In Proc. FieldProgrammable Gate Arrays. In Proc. David Blaauw. Robert J. Conf. Robert J. Hiroyuki Ueyama. Architecture of field-programmable gate arrays: The effect of logic functionality on area efficiency. and Paul Chow. November 2006. In Proc.. ACM Intl. Symp. [137] Pete Sedcole. [133] Ashoka Sathanur. Masaaki Ogino. Francis. Cheung. FieldProgrammable Gate Arrays. Noriaki Nakayama. Nassif. Solid-State Circuits Conf.K. Parametric yield estimation considering leakage variability. Francis. [132] Shigeru Sakai. Design Automation and Test Conf. MATERIALS RESEARCH SOCIETY SYMPOSIUM PROCEEDINGS. Deposition uniformity control in a commercial scale HTO-CVD reactor.

In ASICON. The effect of logic block architecture on FPGA performance. March 2008. In Proc. Symp. ACM Intl. Pan. highly nonlinear fitting. and Huazhong Yang. in Europe. Conf. and Rob A. CRC. with application to statistical timing. Rong Luo. and Rob A. [142] Jaskirat Singh and Sachin Sapatnekar. Petros Boufounos. October 2007. and Farinaz Koushanfar. Statistical timing analysis with correlated non-gaussian parameters using independent componenet analysis. IEEE Journal of Solid-State Circuits. ComputerAided Design. November 2008. Sonia Singhal. [147] Satish Sivaswamy and Kia Bazargan. Rutenbar. Intl. fast monte carlo statistical static timing analysis: Why and how. [140] Xukai Shen. Anand Ramalingam. Design Automation Conf. Sheats. Field-Programmable Gate Arrays. 211 . In Proc. [143] Satwant Singh. Exploiting correlation kernels for efficient handling of intra-die spatial correlation. [148] Thomas Skotnicki. [144] Amith Singhee and Rob A. Heading for decananometer CMOS . Post-silicon timing characterization by compressed sensing. [139] James R. Leakage power reduction through dual vth assignment considering threshold voltage variation.[138] Davood Shamsi. and David Z. Conf. 1998. Intl. Design Automation and Test Conf. 2006. September 2000. Paul Chow. Daifeng Wang. 1992. pages 143–148. Practical. In Proc. Shi. February 2007. Microlithography Science and Technology. In Proc. Rutenbar. [145] Amith Singhee. and David Lewis. Sonia Singhal.Is navigation among icebergs still a viable strategy? In Solid-State Device Research Conference. Jonathan Rose. [141] Sean X. [146] Amith Singhee. In Proc. November 2008. Feb. in Europe. March 2008. In Proc. pages 19–33. Design Automation and Test Conf.. Yu Wang. In ACM/IEEE International Workshop on Timing Issues. ComputerAided Design. Rutenbar. Variation-aware routing for FPGAs. June 2007. Beyond low-order statistical response surfaces: Latent variable regression for efficient. Latch modeling for statistical timing analysis.

Saha Dipankar. Agarwal. In Proc.. In Proc. Design Automation Conf. January 2005.. Varghese Dhanoop. In Proc. David Blaauw. Intl. Robert Bai. Leakage power analysis of a 90nm FPGA. G. Symp. Low Power Electronics and Design.. Yohei Kume. 1997. June 2005. Kanak B. ACM Intl. Sylvester. [152] Ashish Srivastava. and TAKWALE M. On the generation and recovery of interface traps in MOSFETs subjected to NBTI. Analysis and decomposition of spatial variation in integrated circuit processes and devices. Trans. and James E. 2001. [157] Shuji Tsukiyama. 53. JADKAR S. Accurate and efficient parametric yield estimation considering correlated variations in leakage power and performance. B.[149] Mahapatra Souvik. In Proc. July 2009. February 2008. [158] Tim Tuan and Bocheng Lai. Design Automation Conf. Duane S. Speed and yield enhancement by track swapping on critical paths utilizing random variations for FPGAs. Design Automation Conf. and Stephen Director. In Proc. IEEE. June 2007. on Electron Device. 2002. Feb. Stine. IEEE Custom Integrated Circuits Conf. [151] Ashish Srivastava. In International Conference on Cat-CVD Process. Statistical analysis of full-chip leakage power considering junction tunneling leakage. and HCI stress. and Shuji Tsukiyama.. FN. [156] Li Tao and Yu Zhiping. Symp. [150] Suresh Srinivasan.. [154] Yuuri Sugihara. and Bharath P. Kazutoshi Kobayashi. In Proc. HotWire CVD growth simulation for thickness uniformity... and Hidetoshi Onodera. pages 24 – 41. Shah. Narayanan Vijaykrishnan. R. Asia South Pacific Design Automation Conf. and Dennis Sylvester. IEEE Transactions on. and Tim Tuan. Yuki Yoshida. [153] Brian E. In Proc. Aman Gayasen. July 2006... [155] Shingo Takahashi. Asia South Pacific Design Automation Conf. January 2005. Toward stochastic design for digital circuits: statistical static timing analysis.. Kumar. PATIL S. Saumil S. Leakage control in FPGA routing fabric. Boning. 212 . Chung. Gaussian mixture model for statistical timing analysis. Dennis M. Field-Programmable Gate Arrays. In Semiconductor Manufacturing. [159] SALI Jaydeep V. 2003. Modeling and analysis of leakage power considering within-die process variations. In Proc. David Blaauw.

2006. [167] Ho-Yan Wong. In Proc. Intl. and Tai-Hsuan Wu.. Modeling and minimization of pmos nbti effect for robust nanometer design.. Kaushik Ravindran. [165] Ruilin Wang and Cheng Kok Koh.. In Proc. Intl. Intl. Adjustment-based modeling for statistical static timing analysis with high dimension of variability. IEEE Trans. November 2005. Steven G. Lerong Cheng. Krishnan. [168] Phoebe Wong.5v platform FPGA complete data sheet. November 2008. Lerong Cheng. [161] Vineeth Veetil.[160] Rakesh Vattikonda. and David Blaauw. Virtex-II 1. [163] Chandu Visweswariah. and Sambasivan Narayan. Efficient monte carlo based incremental statistical timing analysis. Jeff Piaget. and Yu Cao. Design Automation Conf. Steven G. Kaushik Ravindran. November 2007. In Proc. and Lei He. Conf. Conf. First-order incremental block-based statistical timing analysis. 213 . Yan Lin. Conf. July 2006. [169] Lin Xie. Kerim Kalafala. and Yu Cao. A frequency-domain technique for statistical timing analysis of clock meshes. Conf. ComputerAided Design. Computer-Aided Design of Integrated Circuits and Systems. Sambasivan Narayan. Design Automation Conf. November 2005. Walker. An integrated modeling paradigm of circuit reliability for 65nm CMOS technology. ComputerAided Design. Intl. Hemmett. Computer-Aided Design. Design Automation Conf. Dennis Sylvester. Computer-Aided Design. June 2003. [170] Xilinx Corporation. Design Automation Conf. In Proc. Wenping Wang. In Proc. and Jeffrey G. First-order incremental block-based statistical timing analysis. Vijay Reddy. [164] Chandu Visweswariah. Anand T. July 2002.. Beece. In Proc. Natesan Venkateswaran.. June 2004. In Proc. Srikanth Krishnan. Walker. Rakesh Vattikonda. FPGA device and architecture evaluation considering process variations. Death. taxes and failing chips. June 2008. Daniel K. FPGA device and arhitecture evaluation consdering process variation. In Proc. [162] Chandu Visweswariah. Yan Lin. Azadeh Davoodi. Jun Zhang. Kerim Kalafala. September 2007. IEEE Custom Integrated Circuits Conf. [166] Wenping Wang. In Proc. and Lei He.

CA. 2007. and Vladimir Zolotov. Xin Li. [176] Xiaoji Ye. John A. version 3. Gubner. [174] A. Gubner. Andrzej J. [181] Lizheng Zhang. and Charlie ChungPing Chen. Generic statistical timing analysis with non-gaussian process parameters. Design Automation Conf. Robust extraction of spatial correlation. [179] Yaping Zhan. and Lei He.. Correlaiton-aware statistical timing analysis with non-gaussian delay distribution. Statistical ordering of correlated timing quantities and its application for path ranking. Design Automation Conf. and Lawrence T. March 2005. In Proc.[171] Jinjun Xiong. M. and Mahesh Sharma. Intl. [172] Jinjun Xiong. Correlation-preserved non-gaussian statistical timing analysis with quadratic timing model. Statistical leakage power minimization using fast equi-slack shell based optimization. and Peng Li. Weijen Chen. [180] Lizheng Zhang. [178] Yaping Zhan. 1991. Zhuo Fang. related to stationary random processes. Strojwas. and Lei He. Microelectronics Center of North Carolina (MCNC). Yuhen Hu. In Austin Conference on Integrated Systems and Circuits. Vladimir Zolotov. Design Automation Conf. October 2008. David Newmark. Wei Dong. April 2006. Yaglom. Some classes of random fields in n-Dimensional space. [177] Guo Yu. In Proc. Yang. Symp. pages 77–82. Weijen Chen. Yuhen Hu. Anaheim.0. and Charlie ChungPing Chen. Design Automation and Test Conf. Computer-Aided Design of Integrated Circuits and Systems.. 214 . January 1957. Chandu Visweswariah.. Andrzej J. Statistical static timing analysis considering process variation model uncertainty. Pileggi. John A. June 2005. pages 83 – 88. Design Automation Conf. In Proc. Technical report. Physical Design. IEEE Trans. and Peng Li. [173] Jinjun Xiong. In Proc. Statistical timing analysis with extended pseudo-canonical timing model. Theory Probability Applications. Strojwas. Robust extraction of spatial correlation. in Europe. May 2006.. Logic synthesis and optimization benchmarks. Vladimir Zolotov. June 2005. Computer-Aided Design of Integrated Circuits and Systems. In Proc. Yaping Zhan. 2007. IEEE Trans. [175] S. July 2009. In Proc. pages 952 – 957.

and Kaustav Banerjee. [183] Lizheng Zhang. and Charlie Chung-Ping Chen.. First IFAC Workshop on Advanced Process Control for Semiconductor Manufacturing. One step forward from run-to-run critical dimension control: Across-wafer level critical dimension control through lithography and etch process. and Costas J. Symp. in Europe. Angela Hui. An efficient method for chip-level statistical capacitance extraction considering process variations with spatial correlation. Symp. and Costas J. Spanos. Zhiping Yu. [188] Songqing Zhang. Inspection. A probabilistic framework to estimate full-chip subthreshold leakage power distribution considering within-die and die-to-die p-t-v variations. Vineet Wason. and Kaustav Banerjee. Journal of Process Control. Cherry Tang. Tony Hsieh. Wenjian Yu. [185] Qiaolin Zhang. and Kevin Nowka. Vineet Wason. March 2006. [184] Lizheng Zhang. [187] Songqing Zhang. and Chung-Ping Chen. Low Power Electronics and Design. In Proc. 18(10):937 – 945.[182] Lizheng Zhang. Physical Design. Intl. In Proc. Frank Liu. and Process Control for Microlithography XXI. January 2005. Aprial 2006. Kameshwar Poolla. [189] Wangyang Zhang. [186] Qiaolin Zhang. Rigorous extraction of process variations for 65nm CMOS design. Kanak Agarwal. A probabilistic framework to estimate full-chip subthreshold leakage power distribution considering within-die and die-to-die variations. April 2007. In Metrology. Design Automation and Test Conf. Zeyi Wang. Intl. Yuhen Hu. Asia South Pacific Design Automation Conf. March 2008. Symp. 2004. Yu Cao. Dhruva Acharyya. Jun Shao. Across-wafer cd uniformity control through lithography and etch process: experimental verification. and Charlie Chung-Ping Chen. Bhanwar Singh. 215 . 2008. Rong Jiang. In Proc. Block based statistical timing analysis with extended canonical timing model. Advanced Process Control for Semiconductor Manufacturing. In Proc. Spanos. Non-gaussian statistical parameter modeling for SSTA with confidence interval analysis. In Proc. In European Solid-State Circuits Conference. In Proc. Yuhen Hu. Low Power Electronics and Design. Kameshwar Poolla. Intl. September 2007. Nick Maccrae. Jason Cain. Sani Nassif. [190] Wei Zhao. Design Automation and Test Conf. Aug 2004. Statistical timing analysis with path reconvergence and spatial correlations. in Europe. and Jinjun Xiong.

Master your semester with Scribd & The New York Times

Special offer for students: Only $4.99/month.

Master your semester with Scribd & The New York Times

Cancel anytime.