You are on page 1of 705

E&CE 327: Digital Systems Engineering Lecture Slides

Mark Aagaard 2011t1Winter University of Waterloo Dept of Electrical and Computer Engineering

Contents
I Lecture Notes
1 VHDL 1.1 Introduction to VHDL . . . . . . . . . . . . . . . . . . . . . . . . . 1.1.1 Levels of Abstraction . . . . . . . . . . . . . . . . . . . . . 1.1.2 VHDL Origins and History . . . . . . . . . . . . . . . . . . 1.1.3 Semantics . . . . . . . . . . . . . . . . . . . . . . . . . . 1.1.4 Synthesis of a Simulation-Based Language . . . . . . . . 1.1.5 Solution to Synthesis Sanity . . . . . . . . . . . . . . . . . 1.1.6 Standard Logic 1164 . . . . . . . . . . . . . . . . . . . . . 1.2 Comparison of VHDL to Other Hardware Description Languages . . . . . . . . . . . . . . . .

1
3 4 4 5 6 11 12 13 14

ii 1.3 Overview of Syntax . . . . . . . . . . . . . . . . . 1.3.1 Syntactic Categories . . . . . . . . . . . . . 1.3.2 Library Units . . . . . . . . . . . . . . . . . 1.3.3 Entities and Architecture . . . . . . . . . . . 1.3.4 Concurrent Statements . . . . . . . . . . . 1.3.5 Component Declaration and Instantiations . 1.3.6 Processes . . . . . . . . . . . . . . . . . . 1.3.7 Sequential Statements . . . . . . . . . . . . 1.3.8 A Few More Miscellaneous VHDL Features 1.4 Concurrent vs Sequential Statements . . . . . . . 1.4.1 Concurrent Assignment vs Process . . . . 1.4.2 Conditional Assignment vs If Statements . 1.4.3 Selected Assignment vs Case Statement . 1.4.4 Coding Style . . . . . . . . . . . . . . . . . 1.5 Overview of Processes . . . . . . . . . . . . . . . 1.5.1 Combinational Process vs Clocked Process 1.5.2 Latch Inference . . . . . . . . . . . . . . . . 1.6 Details of Process Execution . . . . . . . . . . . . 1.6.1 Simple Simulation . . . . . . . . . . . . . . 1.6.2 Temporal Granularities of Simulation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

CONTENTS . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14 14 14 15 18 21 21 26 27 27 28 29 30 31 32 36 43 46 46 48

CONTENTS 1.6.3 Intuition Behind Delta-Cycle Simulation . . . . . 1.6.4 Denitions and Algorithm . . . . . . . . . . . . . 1.6.4.1 Process Modes . . . . . . . . . . . . . 1.6.4.2 Simulation Algorithm . . . . . . . . . . 1.6.4.3 Delta-Cycle Denitions . . . . . . . . . 1.6.5 Example 1: Process Execution (Bamboozle) . . 1.6.6 Example 2: Process Execution (Flummox) . . . . 1.6.7 Ex: Need for Provisonal Asn . . . . . . . . . . . 1.6.8 Delta-Cycle Simulations of Flip-Flops . . . . . . 1.7 Register-Transfer-Level Simulation . . . . . . . . . . . . 1.7.1 Overview . . . . . . . . . . . . . . . . . . . . . . 1.7.2 Technique for Register-Transfer Level Simulation 1.7.3 Examples of RTL Simulation . . . . . . . . . . . 1.7.3.1 RTL Simulation Example 1 . . . . . . . 1.8 VHDL and Hardware Building Blocks . . . . . . . . . . . 1.8.1 Basic Building Blocks . . . . . . . . . . . . . . . 1.8.2 Deprecated Building Blocks for RTL . . . . . . . 1.8.3 Hardware and Code for Flops . . . . . . . . . . . 1.8.3.1 Flops with Waits and Ifs . . . . . . . . . 1.8.3.2 Flops with Synchronous Reset . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

iii 48 50 50 54 57 58 58 63 69 78 79 80 81 81 85 85 90 92 92 94

iv

CONTENTS 1.8.3.3 Flop with Chip-Enable and Mux on Input . . 1.8.3.4 Flops with Chip-Enable, Muxes, and Reset . 1.8.4 An Example Sequential Circuit . . . . . . . . . . . . . 1.9 Arrays and Vectors . . . . . . . . . . . . . . . . . . . . . . . . 1.10 Arithmetic . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1.10.1 Arithmetic Packages . . . . . . . . . . . . . . . . . . 1.10.2 Shift and Rotate Operations . . . . . . . . . . . . . . 1.10.3 Overloading of Arithmetic . . . . . . . . . . . . . . . 1.10.4 Different Widths and Arithmetic . . . . . . . . . . . . 1.10.5 Overloading of Comparisons . . . . . . . . . . . . . 1.10.6 Different Widths and Comparisons . . . . . . . . . . 1.10.7 Type Conversion . . . . . . . . . . . . . . . . . . . . 1.11 Synthesizable vs Non-Synthesizable Code . . . . . . . . . . 1.11.1 Unsynthesizable Code . . . . . . . . . . . . . . . . . 1.11.1.1 Initial Values . . . . . . . . . . . . . . . . . 1.11.1.2 Wait For . . . . . . . . . . . . . . . . . . . . 1.11.1.3 Different Wait Conditions . . . . . . . . . . 1.11.1.4 Multiple if rising edge in Process . . . . . 1.11.1.5 if rising edge and wait in Same Process 1.11.1.6 if rising edge with else Clause . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 101 102 102 102 103 103 104 104 104 104 105 106 108 109 109 110 111 113 114 115

CONTENTS

1.11.1.7 if rising edge Inside a for Loop . . . . . . . . . . 116 1.11.1.8 wait Inside of a for loop . . . . . . . . . . . . . . 118 1.12 Synthesizable VHDL Coding Guidelines . . . . . . . . . . . . . . . 120 2 RTL Design with VHDL 2.1 Prelude to Chapter . . . . . . . . . . . . . . . . . . . . . . 2.2 FPGA Background and Coding Guidelines . . . . . . . . 2.2.1 Generic FPGA Hardware . . . . . . . . . . . . . . 2.2.1.1 Generic FPGA Cell . . . . . . . . . . . . 2.2.2 Area Estimation . . . . . . . . . . . . . . . . . . . 2.2.2.1 Interconnect for Generic FPGA . . . . . . 2.2.2.2 Clocks for Generic FPGAs . . . . . . . . 2.2.2.3 Special Circuitry in FPGAs . . . . . . . . 2.2.3 Generic-FPGA Coding Guidelines . . . . . . . . . 2.3 Design Flow . . . . . . . . . . . . . . . . . . . . . . . . . 2.4 Algorithms and High-Level Models . . . . . . . . . . . . . 2.5 Finite State Machines in VHDL . . . . . . . . . . . . . . . 2.5.1 Introduction to State-Machine Design . . . . . . . 2.5.1.1 Mealy vs Moore State Machines . . . . . 2.5.1.2 Introduction to State Machines and VHDL 121 122 122 122 123 128 134 134 135 139 143 143 144 144 144 147

. . . . . . . . . . . . . . .

. . . . . . . . . . . . . . .

. . . . . . . . . . . . . . .

. . . . . . . . . . . . . . .

. . . . . . . . . . . . . . .

. . . . . . . . . . . . . . .

vi

CONTENTS 2.5.1.3 Explicit vs Implicit State Machines . . . . . . . 2.5.2 Implementing a Simple Moore Machine . . . . . . . . . 2.5.2.1 Implicit Moore State Machine . . . . . . . . . . 2.5.2.2 Explicit Moore with Flopped Output . . . . . . 2.5.2.3 Explicit Moore with Combinational Outputs . . 2.5.2.4 Explicit-Current+Next Moore with Concurrent signment . . . . . . . . . . . . . . . . . . . . . 2.5.2.5 E-C+N Moore with Comb Proc . . . . . . . . . 2.5.3 Implementing a Simple Mealy Machine . . . . . . . . . 2.5.4 Reset . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2.5.5 State Encoding . . . . . . . . . . . . . . . . . . . . . . . 2.6 Dataow Diagrams . . . . . . . . . . . . . . . . . . . . . . . . . 2.6.1 Dataow Diagrams Overview . . . . . . . . . . . . . . . 2.6.2 Dataow Diagrams, Hardware, and Behaviour . . . . . 2.6.3 Dataow Diagram Execution . . . . . . . . . . . . . . . 2.6.4 Performance Estimation . . . . . . . . . . . . . . . . . . 2.6.5 Area Estimation . . . . . . . . . . . . . . . . . . . . . . 2.6.6 Design Analysis . . . . . . . . . . . . . . . . . . . . . . 2.6.7 Area / Performance Tradeoffs . . . . . . . . . . . . . . . 2.7 Design Example: Massey . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . As. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 149 154 155 157 159 161 163 165 166 170 171 171 184 188 198 199 201 203 206

CONTENTS 2.8 Design Example: Vanier . . . . . . . . . . . . . . . . . 2.8.1 Requirements . . . . . . . . . . . . . . . . . . 2.8.2 Algorithm . . . . . . . . . . . . . . . . . . . . . 2.8.3 Initial Dataow Diagram . . . . . . . . . . . . . 2.8.4 Reschedule to Meet Requirements . . . . . . . 2.8.5 Optimize Resources . . . . . . . . . . . . . . . 2.8.6 Assign Names to Registered Values . . . . . . 2.8.7 Input/Output Allocation . . . . . . . . . . . . . 2.8.8 Tangent: Combinational Outputs . . . . . . . . 2.8.9 Register Allocation . . . . . . . . . . . . . . . . 2.8.10 Datapath Allocation . . . . . . . . . . . . . . . 2.8.11 Hardware Block Diagram and State Machine 2.8.11.1 Control for Registers . . . . . . . . . 2.8.11.2 Control for Datapath Components . 2.8.11.3 Control for State . . . . . . . . . . . 2.8.11.4 Complete State Machine Table . . . 2.8.12 VHDL Code with Explicit State Machine . . . 2.8.13 Peephole Optimizations . . . . . . . . . . . . 2.8.14 Notes and Observations . . . . . . . . . . . . 2.9 Pipelining . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

vii 206 208 209 210 211 213 216 217 220 221 223 224 225 228 230 231 233 237 240 242

viii 2.9.1 Introduction to Pipelining . . . . . . . . . . . . . 2.9.2 Partially Pipelined . . . . . . . . . . . . . . . . . 2.9.3 Terminology . . . . . . . . . . . . . . . . . . . . . Design Example: Pipelined Massey . . . . . . . . . . . Memory Arrays and RTL Design . . . . . . . . . . . . 2.11.1 Memory Operations . . . . . . . . . . . . . . . 2.11.2 Memory Arrays in VHDL . . . . . . . . . . . . . 2.11.3 Data Dependencies . . . . . . . . . . . . . . . 2.11.4 Memory and Dataow Diagrams . . . . . . . . 2.11.5 Ex: Mem Array and Dataow Diagram . . . . . Input / Output Protocols . . . . . . . . . . . . . . . . . Example: Moving Average . . . . . . . . . . . . . . . . 2.13.1 Requirements and Environmental Assumptions 2.13.2 Algorithm . . . . . . . . . . . . . . . . . . . . . 2.13.3 Pseudocode and Dataow Diagrams . . . . . . 2.13.4 Control Tables and State Machine . . . . . . . . 2.13.5 VHDL Code . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

CONTENTS . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 242 248 250 252 256 256 260 260 265 272 279 280 281 282 286 291 295

2.10 2.11

2.12 2.13

CONTENTS

ix

3 Performance Analysis and Optimization 297 3.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 298 3.2 Dening Performance . . . . . . . . . . . . . . . . . . . . . . . . . . 299 3.3 Comparing Performance . . . . . . . . . . . . . . . . . . . . . . . . . 302 3.3.1 General Equations . . . . . . . . . . . . . . . . . . . . . . . . 302 3.3.2 Example: Performance of Printers . . . . . . . . . . . . . . . 304 3.4 Clock Speed, CPI, Program Length, and Performance . . . . . . . . 305 3.4.1 Mathematics . . . . . . . . . . . . . . . . . . . . . . . . . . . 305 3.4.2 Example: CISC vs RISC and CPI . . . . . . . . . . . . . . . . 306 3.4.3 Effect of Instruction Set on Performance . . . . . . . . . . . . 310 3.4.4 Effect of Time to Market on Relative Performance . . . . . . 312 3.4.5 Summary of Equations . . . . . . . . . . . . . . . . . . . . . 312 3.5 Performance Analysis and Dataow Diagrams . . . . . . . . . . . . 313 3.5.1 Dataow Diagrams, CPI, and Clock Speed . . . . . . . . . . 313 3.5.2 Examples of Dataow Diagrams for Two Instructions . . . . . 316 3.5.2.1 Scheduling of Operations for Different Clock Periods 317 3.5.2.2 Performance Computation for Different Clock Periods 320 3.5.2.3 Example: Two Instructions Taking Similar Time . . . 321 3.5.2.4 Example: Same Total Time, Different Order for A . . 322 3.5.3 Example: From Algorithm to Optimized Dataow . . . . . . . 323

x 3.6 General Optimizations . . . . . . . . . . . . . . . . . 3.6.1 Strength Reduction . . . . . . . . . . . . . . 3.6.1.1 Arithmetic Strength Reduction . . . 3.6.1.2 Boolean Strength Reduction . . . . 3.6.2 Replication and Sharing . . . . . . . . . . . . 3.6.2.1 Mux-Pushing . . . . . . . . . . . . . 3.6.2.2 Common Subexpression Elimination 3.6.2.3 Computation Replication . . . . . . 3.6.3 Arithmetic . . . . . . . . . . . . . . . . . . . . 3.7 Retiming . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

CONTENTS . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 326 326 326 327 328 328 329 331 332 333

CONTENTS 4 Functional Verication 4.1 Overview . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4.1.1 Terminology: Validation / Verication / Testing . . . . . . . 4.1.2 The Difculty of Designing Correct Chips . . . . . . . . . 4.1.2.1 Notes from Kenn Heinrich (UW E&CE grad) . . 4.1.2.2 Notes from Aart de Geus (Chairman and CEO Synopsys) . . . . . . . . . . . . . . . . . . . . . 4.2 Test Cases and Coverage . . . . . . . . . . . . . . . . . . . . . . 4.2.1 Coverage . . . . . . . . . . . . . . . . . . . . . . . . . . . 4.2.2 Floating Point Divider Example . . . . . . . . . . . . . . . 4.3 Testbenches . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4.3.1 Overview of Test Benches . . . . . . . . . . . . . . . . . . 4.3.2 Reference Model Style Testbench . . . . . . . . . . . . . 4.3.3 Relational Style Testbench . . . . . . . . . . . . . . . . . 4.3.4 Coding Structure of a Testbench . . . . . . . . . . . . . . 4.3.5 Datapath vs Control . . . . . . . . . . . . . . . . . . . . . 4.3.6 Verication Tips . . . . . . . . . . . . . . . . . . . . . . . 4.4 Functional Verication for Datapath Circuits . . . . . . . . . . . . 4.4.1 A Spec-Less Testbench . . . . . . . . . . . . . . . . . . . 4.4.2 Use an Array for Test Vectors . . . . . . . . . . . . . . . . . . . . . . . . of . . . . . . . . . . . . . . . . . . . . . . . . . . . .

xi 335 336 336 336 337 337 338 338 339 344 344 345 345 346 347 348 349 351 352

xii 4.4.3 Build Spec into Stimulus . . . . . . . 4.4.4 Have Separate Specication Entity . 4.4.5 Generate Test Vectors Automatically 4.4.6 Relational Specication . . . . . . . 4.5 Functional Verication of Control Circuits . 4.5.1 Overview of Queues in Hardware . . 4.5.2 VHDL Coding . . . . . . . . . . . . . 4.5.2.1 Package . . . . . . . . . . 4.5.2.2 Other VHDL Coding . . . . 4.5.3 Code Structure for Verication . . . 4.5.4 Instrumentation Code . . . . . . . . 4.5.5 Assertions . . . . . . . . . . . . . . 4.5.6 VHDL Coding Tips . . . . . . . . . . 4.5.7 Queue Specication . . . . . . . . . 4.5.8 Queue Testbench . . . . . . . . . . 4.6 Example: Microwave Oven . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

CONTENTS . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 353 355 358 359 360 361 368 368 368 369 371 376 380 385 389 391

CONTENTS 5 Timing Analysis 5.1 Delays and Denitions . . . . . . . . . . . . . . . . . . . . . . . . 5.1.1 Background Denitions . . . . . . . . . . . . . . . . . . . 5.1.2 Clock-Related Timing Denitions . . . . . . . . . . . . . . 5.1.2.1 Clock Skew . . . . . . . . . . . . . . . . . . . . . 5.1.2.2 Clock Latency . . . . . . . . . . . . . . . . . . . 5.1.2.3 Clock Jitter . . . . . . . . . . . . . . . . . . . . . 5.1.3 Storage-Related Timing Denitions . . . . . . . . . . . . . 5.1.3.1 Flops and Latches . . . . . . . . . . . . . . . . . 5.1.4 Propagation Delays . . . . . . . . . . . . . . . . . . . . . 5.1.5 Timing Constraints . . . . . . . . . . . . . . . . . . . . . . 5.1.5.1 Minimum Clock Period . . . . . . . . . . . . . . . 5.1.5.2 Hold Constraint . . . . . . . . . . . . . . . . . . 5.1.5.3 Example Timing Violations . . . . . . . . . . . . 5.2 Timing Analysis of Latches and Flip Flops . . . . . . . . . . . . . 5.2.1 Simple Multiplexer Latch . . . . . . . . . . . . . . . . . . . 5.2.1.1 Structure and Behaviour of Multiplexer Latch . . 5.2.1.2 Strategy for Timing Analysis of Storage Devices 5.2.1.3 Clock-to-Q Time of a Multiplexer Latch . . . . . 5.2.1.4 Setup Timing of a Multiplexer Latch . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

xiii 401 402 402 403 403 405 406 408 408 410 411 411 412 412 415 415 416 420 421 422

xiv

CONTENTS 5.2.1.5 Hold Time of a Multiplexer Latch . . . . . . . . . 5.2.1.6 Example of a Bad Latch . . . . . . . . . . . . . . 5.3 Critical Paths and False Paths . . . . . . . . . . . . . . . . . . . 5.3.1 Introduction to Critical and False Paths . . . . . . . . . . 5.3.1.1 Example of Critical Path in Full Adder . . . . . . 5.3.1.2 Preliminaries for Critical Paths . . . . . . . . . . 5.3.1.3 Longest Path and Critical Path . . . . . . . . . . 5.3.2 Longest Path . . . . . . . . . . . . . . . . . . . . . . . . . 5.3.3 Detecting a False Path . . . . . . . . . . . . . . . . . . . . 5.3.3.1 Preliminaries . . . . . . . . . . . . . . . . . . . 5.3.3.2 Almost-Correct Algorithm to Detect a False Path 5.3.3.3 Examples of Detecting False Paths . . . . . . . 5.3.4 Finding the Next Candidate Path . . . . . . . . . . . . . . 5.3.4.1 Algorithm to Find Next Candidate Path . . . . . 5.3.4.2 Examples of Finding Next Candidate Path . . . . 5.3.5 Correct Algorithm to Find Critical Path . . . . . . . . . . . 5.3.5.1 Rules for Late Side Inputs . . . . . . . . . . . . . 5.3.5.2 Monotone Speedup . . . . . . . . . . . . . . . . 5.3.5.3 Analysis of Side-Input-Causes-Glitch Situation . 5.3.5.4 Complete Algorithm . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 428 430 431 431 434 436 436 440 441 441 447 447 449 450 451 454 454 455 456 456

CONTENTS 5.3.5.5 Complete Examples . . . . . . . . . . . . . . . . . . 5.3.6 Further Extensions to Critical Path Analysis . . . . . . . . . . 5.3.7 Increasing the Accuracy of Critical Path Analysis . . . . . . . 5.4 Elmore Timing Model . . . . . . . . . . . . . . . . . . . . . . . . . . 5.4.1 RC-Networks for Timing Analysis . . . . . . . . . . . . . . . . 5.4.2 Derivation of Analog Timing Model . . . . . . . . . . . . . . . 5.4.2.1 Example Derivation: Equation for Voltage at Node 3 5.4.2.2 General Derivation . . . . . . . . . . . . . . . . . . . 5.4.3 Elmore Timing Model . . . . . . . . . . . . . . . . . . . . . . 5.4.4 Examples of Using Elmore Delay . . . . . . . . . . . . . . . . 5.4.4.1 Interconnect with Single Fanout . . . . . . . . . . . 5.4.4.2 Interconnect with Multiple Gates in Fanout . . . . . 5.5 Practical Usage of Timing Analysis . . . . . . . . . . . . . . . . . . . 5.5.1 Speed Binning . . . . . . . . . . . . . . . . . . . . . . . . . . 5.5.1.1 FPGAs, Interconnect, and Synthesis . . . . . . . . . 5.5.2 Worst Case Timing . . . . . . . . . . . . . . . . . . . . . . . 5.5.2.1 Fanout delay . . . . . . . . . . . . . . . . . . . . . . 5.5.2.2 Derating Factors . . . . . . . . . . . . . . . . . . . .

xv 457 462 462 463 463 475 479 483 487 491 491 495 498 500 501 502 502 503

xvi 6 Power Analysis and Power-Aware Design 6.1 Overview . . . . . . . . . . . . . . . . . . . . . . . . . 6.1.1 Importance of Power and Energy . . . . . . . . 6.1.2 Industrial Names and Products . . . . . . . . . 6.1.3 Power vs Energy . . . . . . . . . . . . . . . . . 6.1.4 Batteries, Power and Energy . . . . . . . . . . 6.1.4.1 Do Batteries Store Energy or Power? 6.1.4.2 Battery Life and Efciency . . . . . . 6.1.4.3 Battery Life and Power . . . . . . . . 6.2 Power Equations . . . . . . . . . . . . . . . . . . . . . 6.2.1 Switching Power . . . . . . . . . . . . . . . . . 6.2.2 Short-Circuited Power . . . . . . . . . . . . . . 6.2.3 Leakage Power . . . . . . . . . . . . . . . . . . 6.2.4 Glossary . . . . . . . . . . . . . . . . . . . . . 6.2.5 Note on Power Equations . . . . . . . . . . . . 6.3 Overview of Power Reduction Techniques . . . . . . . 6.4 Voltage Reduction for Power Reduction . . . . . . . . 6.5 Data Encoding for Power Reduction . . . . . . . . . . 6.5.1 How Data Encoding Can Reduce Power . . . . 6.5.2 Example Problem: Sixteen Pulser . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

CONTENTS . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 507 508 508 509 509 510 510 511 512 515 517 520 521 522 522 522 527 531 531 535

CONTENTS 6.5.2.1 Problem Statement . . . . . . . . . . . . . 6.5.2.2 Additional Information . . . . . . . . . . . . 6.5.2.3 Answer . . . . . . . . . . . . . . . . . . . . 6.6 Clock Gating . . . . . . . . . . . . . . . . . . . . . . . . . . 6.6.1 Introduction to Clock Gating . . . . . . . . . . . . . . 6.6.2 Implementing Clock Gating . . . . . . . . . . . . . . 6.6.3 Design Process . . . . . . . . . . . . . . . . . . . . 6.6.4 Effectiveness of Clock Gating . . . . . . . . . . . . . 6.6.5 Example: Reduced Activity Factor with Clock Gating 6.6.6 Clock Gating with Valid-Bit Protocol . . . . . . . . . 6.6.6.1 Valid-Bit Protocol . . . . . . . . . . . . . . . 6.6.6.2 How Many Clock Cycles for Module? . . . 6.6.6.3 Adding Clock-Gating Circuitry . . . . . . . 6.6.7 Example: Pipelined Circuit with Clock-Gating . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

xvii 535 536 538 544 544 545 546 546 550 552 552 555 556 559

xviii 7 Fault Testing and Testability 7.1 Faults and Testing . . . . . . . . . . . . . . . . . . . . . 7.1.1 Overview of Faults and Testing . . . . . . . . . . 7.1.1.1 Faults . . . . . . . . . . . . . . . . . . . 7.1.1.2 Causes of Faults . . . . . . . . . . . . . 7.1.1.3 Testing . . . . . . . . . . . . . . . . . . 7.1.1.4 Burn In . . . . . . . . . . . . . . . . . . 7.1.1.5 Bin Sorting . . . . . . . . . . . . . . . . 7.1.1.6 Testing Techniques . . . . . . . . . . . 7.1.1.7 Design for Testability (DFT) . . . . . . . 7.1.2 Example Problem: Economics of Testing . . . . 7.1.3 Physical Faults . . . . . . . . . . . . . . . . . . . 7.1.3.1 Types of Physical Faults . . . . . . . . . 7.1.3.2 Locations of Faults . . . . . . . . . . . . 7.1.3.3 Layout Affects Locations . . . . . . . . 7.1.3.4 Naming Fault Locations . . . . . . . . . 7.1.4 Detecting a Fault . . . . . . . . . . . . . . . . . . 7.1.4.1 Which Test Vectors will Detect a Fault? 7.1.5 Mathematical Models of Faults . . . . . . . . . . 7.1.5.1 Single Stuck-At Fault Model . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

CONTENTS . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 563 564 564 564 565 565 566 566 567 567 567 567 568 569 570 570 571 571 574 575

CONTENTS 7.1.6 Generate Test Vector to Find a Mathematical Fault 7.1.6.1 Algorithm . . . . . . . . . . . . . . . . . . 7.1.6.2 Example of Finding a Test Vector . . . . . 7.1.7 Undetectable Faults . . . . . . . . . . . . . . . . . 7.1.7.1 Redundant Circuitry . . . . . . . . . . . . 7.1.7.2 Curious Circuitry and Fault Detection . . 7.2 Test Generation . . . . . . . . . . . . . . . . . . . . . . . 7.2.1 A Small Example . . . . . . . . . . . . . . . . . . . 7.2.2 Choosing Test Vectors . . . . . . . . . . . . . . . . 7.2.2.1 Fault Domination . . . . . . . . . . . . . . 7.2.2.2 Fault Equivalence . . . . . . . . . . . . . 7.2.2.3 Gate Collapsing . . . . . . . . . . . . . . 7.2.2.4 Node Collapsing . . . . . . . . . . . . . . 7.2.2.5 Fault Collapsing Summary . . . . . . . . 7.2.3 Fault Coverage . . . . . . . . . . . . . . . . . . . . 7.2.4 Test Vector Generation and Fault Detection . . . . 7.2.5 Generate Test Vectors for 100% Coverage . . . . 7.2.5.1 Collapse the Faults . . . . . . . . . . . . 7.2.5.2 Check for Fault Domination . . . . . . . . 7.2.5.3 Required Test Vectors . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

xix 577 577 578 579 579 582 583 583 584 585 586 587 588 588 589 590 591 592 595 597

xx

CONTENTS 7.2.5.4 Faults Not Covered by Required Test Vectors . . . . 598 7.2.5.5 Order to Run Test Vectors . . . . . . . . . . . . . . . 599 7.2.5.6 Summary of Technique to Find and Order Test Vectors601 7.2.6 One Fault Hiding Another . . . . . . . . . . . . . . . . . . . . 602 7.3 Scan Testing in General . . . . . . . . . . . . . . . . . . . . . . . . . 604 7.3.1 Structure and Behaviour of Scan Testing . . . . . . . . . . . 604 7.3.2 Scan Chains . . . . . . . . . . . . . . . . . . . . . . . . . . . 606 7.3.2.1 Circuitry in Normal and Scan Mode . . . . . . . . . 607 7.3.2.2 Scan in Operation . . . . . . . . . . . . . . . . . . . 608 7.3.2.3 Scan in Operation with Example Circuit . . . . . . . 610 7.3.3 Summary of Scan Testing . . . . . . . . . . . . . . . . . . . . 614 7.3.4 Time to Test a Chip . . . . . . . . . . . . . . . . . . . . . . . 615 7.3.4.1 Example: Time to Test a Chip . . . . . . . . . . . . 616 7.4 Boundary Scan and JTAG . . . . . . . . . . . . . . . . . . . . . . . . 617 7.4.1 Scan Instructions . . . . . . . . . . . . . . . . . . . . . . . . . 620 7.5 Built In Self Test . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 621 7.5.1 Block Diagram . . . . . . . . . . . . . . . . . . . . . . . . . . 621 7.5.1.1 Components . . . . . . . . . . . . . . . . . . . . . . 624 7.5.1.2 Linear Feedback Shift Register (LFSR) . . . . . . . 628 7.5.1.3 Maximal-Length LFSR . . . . . . . . . . . . . . . . . 630

CONTENTS Test Generator . . . . . . . . . . . . . . . . . . Signature Analyzer . . . . . . . . . . . . . . . . Result Checker . . . . . . . . . . . . . . . . . . Arithmetic over Binary Fields . . . . . . . . . . Shift Registers and Characteristic Polynomials 7.5.6.1 Circuit Multiplication . . . . . . . . . . 7.5.7 Bit Streams and Characteristic Polynomials . . 7.5.8 Division . . . . . . . . . . . . . . . . . . . . . . 7.5.9 Signature Analysis: Math and Circuits . . . . . 7.6 Scan vs Self Test . . . . . . . . . . . . . . . . . . . . . 7.5.2 7.5.3 7.5.4 7.5.5 7.5.6 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

xxi 633 636 640 641 643 646 647 648 651 660

xxii 8 Review 8.1 Overview of the Term . . . . . . . . . . 8.2 VHDL . . . . . . . . . . . . . . . . . . . 8.2.1 VHDL Topics . . . . . . . . . . . 8.2.2 VHDL Example Problems . . . . 8.3 RTL Design Techniques . . . . . . . . . 8.3.1 Design Topics . . . . . . . . . . 8.3.2 Design Example Problems . . . 8.4 Functional Verication . . . . . . . . . . 8.4.1 Verication Topics . . . . . . . . 8.4.2 Verication Example Problems . 8.5 Performance Analysis and Optimization 8.5.1 Performance Topics . . . . . . . 8.5.2 Performance Example Problems 8.6 Timing Analysis . . . . . . . . . . . . . . 8.6.1 Timing Topics . . . . . . . . . . . 8.6.2 Timing Example Problems . . . 8.7 Power . . . . . . . . . . . . . . . . . . . 8.7.1 Power Topics . . . . . . . . . . . 8.7.2 Power Example Problems . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

CONTENTS . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 661 662 663 663 664 665 665 666 667 667 668 669 669 670 671 671 672 673 673 674

CONTENTS 8.8 Testing . . . . . . . . . . . . . . . . 8.8.1 Testing Topics . . . . . . . . 8.8.2 Testing Example Problems . 8.9 Formulas to be Given on Final Exam . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

xxiii 675 675 676 677

Part I Lecture Notes

Chapter 1 VHDL: The Language

CHAPTER 1. VHDL

1.1 1.1.1

Introduction to VHDL Levels of Abstraction

Transistor Signal values and time are continous (analog). Each transistor is modeled by a resistor-capacitor network. Switch Time is continuous, but voltage may be either continuous or discrete. Linear equations are used. Gate Transistors are grouped together into gates. Voltages are discrete values such as 0 and 1. Register transfer level Hardware is modeled as assignments to registers and combinational signals. Basic unit of time is one clock cycle. Transaction level A transaction is an operation such as transfering data across a bus. Building blocks are processors, controllers, etc. VHDL, SystemC, or SystemVerilog. Electronic-system level Looks at an entire electronic system, with both hardware and software.

1.1.2 VHDL Origins and History

1.1.2

VHDL Origins and History

VHDL = VHSIC Hardware Description Language VHSIC = Very High Speed Integrated Circuit

The VHSIC Hardware Description Language (VHDL) is a formal notation intended for use in all phases of the creation of electronic systems. Because it is both machine readable and human readable, it supports the development, verication, synthesis and testing of hardware designs, the communication of hardware design data, and the maintenance, modication, and procurement of hardware. Language Reference Manual (IEEE Design Automation Standards Committee, 1993a)

VHDL is a lot more than synthesis of digital hardware

CHAPTER 1. VHDL

1.1.3

Semantics

The original goal of VHDL was to simulate circuits. The semantics of the language dene circuit behaviour.
a c <= a AND b;

simulation

b c

But now, VHDL is used in simulation and synthesis. Synthesis is concerned with the structure of the circuit. Synthesis: converts one type of description (behavioural) into another, lower level, description (usually a netlist).
c <= a AND b;

synthesis

a c b

1.1.3 Semantics

Synthesis
Synthesis is a computer-aided design (CAD) technique that transforms a designers concise, high-level description of a circuit into a structural description of a circuit.

c <= a AND b;

synthesis

a c b

CHAPTER 1. VHDL

CAD Tools
CAD Tools allow designers to automate lower-level design processes in implementing the desired functionality of a system. NOTE: EDA = Electronic Design Automation. EDA = CAD. In digital hardware design

1.1.3 Semantics

Synthesis vs Simulation
For synthesis, we want the code we write to dene the structure of the hardware that is generated.
c <= a AND b;

synthesis

a c b

10

CHAPTER 1. VHDL

Synthesis vs Simulation
The VHDL semantics dene the behaviour of the hardware that is generated, not the structure of the hardware.
a a c

the

sis

simulation

b c

syn

c <= a AND b;

different structure
a a c b

same behaviour

syn the sis

simulation

b c

1.1.4 Synthesis of a Simulation-Based Language

11

1.1.4 Synthesis of a Simulation-Based Language


This section reserved for your reading pleasure

12

CHAPTER 1. VHDL

1.1.5

Solution to Synthesis Sanity

Pick a high-quality synthesis tool and study its documentation thoroughly Learn the idioms of the tool Different VHDL code with same behaviour can result in very different circuits Be careful if you have to port VHDL code from one tool to another KISS: Keep It Simple Stupid
VHDL examples will illustrate reliable coding techniques for the synthesis tools from Synopsys, Mentor Graphics, Altera, Xilinx, and most other companies as well. Follow the coding guidelines and examples from lecture As you write VHDL, think about the hardware you expect to get. Note: If you cant predict the hardware, then the hardware probably wont be very good (small, fast, correct, etc)

1.1.6 Standard Logic 1164

13

1.1.6

Standard Logic 1164

std logic 1164: IEEE standard for signal values in VHDL. U X 0 1 Z W L H -- uninitialized strong unknown strong 0 strong 1 high impedance weak unknown weak 0 weak 1 dont care

The most common values are: U, X, 0, 1. If you see X in a simulation, it usually means that there is a mistake in your code.

14

CHAPTER 1. VHDL

1.2 Comparison of VHDL to Other Hardware Description Languages


This section reserved for your reading pleasure

1.3 1.3.1

Overview of Syntax Syntactic Categories


This section reserved for your reading pleasure

1.3.2

Library Units
This section reserved for your reading pleasure

1.3.3 Entities and Architecture

15

1.3.3

Entities and Architecture

Each hardware module is described with an Entity/Architecture pair

entity

entity architecture

architecture

Entity and Architecture

16

CHAPTER 1. VHDL

Entity
library ieee; use ieee.std_logic_1164.all; entity and_or is port ( a, b, c : in std_logic ; z : out std_logic ); end and_or; Example of an entity

1.3.3 Entities and Architecture

17

Architecture
architecture signal x : begin x <= a AND z <= x OR end main; main of and_or is std_logic; b; (a AND c); Example of architecture

18

CHAPTER 1. VHDL

1.3.4

Concurrent Statements

Architectures contain concurrent statements Concurrent statements execute in parallel (Figure1.4)


Concurrent statements make VHDL fundamentally different from most software languages. Hardware (gates) naturally execute in parallel VHDL mimics the behaviour of real hardware. At each innitesimally small moment of time, each gate: 1. samples its inputs 2. computes the value of its output 3. drives the output

1.3.4 Concurrent Statements

19

Concurrent Statements
architecture main of bowser is begin x1 <= a AND b; x2 <= NOT x1; z <= NOT x2; end main; architecture main of bowser is begin z <= NOT x2; x2 <= NOT x1; x1 <= a AND b; end main;

a b

x1

x2

The order of concurrent statements doesnt matter

20

CHAPTER 1. VHDL

Types of Concurrent Statements


conditional assignment similar to conventional if-then-else c <= a+b when sel=1 else a+c when sel=0 else "0000"; selected assignment similar to conventional case/switch with color select d <= "00" when red , "01" when . . .; component instantiation use a hardware module/component add1 : adder port map( a => f, b => g, s => h, co => i); for-generate create multiple pieces of hardware bgen: for i in 1 to 7 generate b(i)<=a(7-i); end generate; if-generate conditionally create some hardware okgen : if optgoal /= fast then generate result <= ((a and b) or (d and not e)) or g; end generate; fastgen : if optgoal = fast then generate result <= 1; end generate; process description of complex behaviour (Section 1.3.6)

1.3.5 Component Declaration and Instantiations

21

1.3.5 Component Declaration and Instantiations


This section reserved for your reading pleasure

1.3.6

Processes

Processes are used to describe complex and potentially unsynthesizable behaviour A process is a concurrent statement (Section 1.3.4). The body of a process contains sequential statements (Section 1.3.7) Processes are the most complex and difcult to understand part of VHDL (Sections 1.5 and 1.6)

22

CHAPTER 1. VHDL

Example Process with Sensitivity List


process (a, b, c) begin y <= a AND b; if (a = 1) then z1 <= b AND c; z2 <= NOT c; else z1 <= b OR c; z2 <= c; end if; end process;

1.3.6 Processes

23

Example Process with Wait Statements


process begin y <= a AND b; z <= 0; wait until rising_edge(clk); if (a = 1) then z <= 1; y <= 0; wait until rising_edge(clk); else y <= a OR b; end if; end process;

24

CHAPTER 1. VHDL

Sensitivity Lists and Wait Statements


Processes must have either a sensitivity list or at least one wait statement on each execution path through the process. Processes cannot have both a sensitivity list and a wait statement.

1.3.6 Processes

25

Sensitivity List
The sensitivity list contains the signals that are read in the process. A process is executed when a signal in its sensitivity list changes value. An important coding guideline to ensure consistent synthesis and simulation results is to include all signals that are read in the sensitivity list. There is one exception to this rule: for a process that implements a ip-op with an if rising edge statement, it is acceptable to include only the clock signal in the sensitivity list other signals may be included, but are not needed.

26

CHAPTER 1. VHDL

1.3.7

Sequential Statements

Used inside processes and functions. wait signal assignment if-then-else case wait until . . . ; . . . <= . . . ; if . . . then . . . elsif . . . end if; case . . . is when . . . | . . . => . . . ; when . . . => . . . ; end case; loop . . . end loop; while . . . loop . . . end loop; for . . . in . . . loop . . . end loop; next . . . ;

loop while loop for loop next

The most commonly used sequential statements

1.3.8 A Few More Miscellaneous VHDL Features

27

1.3.8 A Few More Miscellaneous VHDL Features


This section reserved for your reading pleasure

1.4

Concurrent vs Sequential Statements

All concurrent assignments can be translated into sequential statements. But, not all sequential statements can be translated into concurrent statements.

28

CHAPTER 1. VHDL

1.4.1

Concurrent Assignment vs Process

The two code fragments below have identical behaviour: architecture main of tiny is begin b <= a; end main; architecture main of tiny is begin process (a) begin b <= a; end process; end main;

1.4.2 Conditional Assignment vs If Statements

29

1.4.2 Conditional Assignment vs If Statements


The two code fragments below have identical behaviour: Concurrent Statements t <= <val1> when <cond> else <val2>; Sequential Statements if <cond> then t <= <val1>; else t <= <val2>; end if

30

CHAPTER 1. VHDL

1.4.3 Selected Assignment vs Case Statement


The two code fragments below have identical behaviour Concurrent Statements with <expr> select t <= <val1> when <choices1>, <val2> when <choices2>, <val3> when <choices3>; Sequential Statements case <expr> is when <choices1> => t <= <val1>; when <choices2> => t <= <val2>; when <choices3> => t <= <val3>; end case;

1.4.4 Coding Style

31

1.4.4

Coding Style

Code thats easy to write with sequential statements, but difcult with concurrent: case <expr> is when <choice1> => if <cond> then o <= <expr1>; else o <= <expr2>; end if; when <choice2> => ... end case;

32

CHAPTER 1. VHDL

1.5

Overview of Processes

Processes are the most difcult VHDL construct to understand. This section gives an overview of processes. Section 1.6 gives the details of the semantics of processes. Within a process, statements are executed almost sequentially

Among processes, execution is done in parallel Remember: a process is a concurrent statement!

1.5. OVERVIEW OF PROCESSES

33

Process Semantics
VHDL mimics hardware Hardware (gates) execute in parallel Processes execute in parallel with each other All possible orders of executing processes must produce the same simulation results (waveforms) If a signal is not assigned a value, then it holds its previous value

All orders of executing concurrent statements must produce the same waveforms

34

CHAPTER 1. VHDL

Process Semantics
execution sequence
architecture procA: process stmtA1; stmtA2; stmtA3; end process; procB: process stmtB1; stmtB2; end process; B1 B2 B1 B2 B1 B2 A1 A2 A3 A1 A2 A3 A1 A2 A3

execution sequence

execution sequence

single threaded: procA before procB

single threaded: procB before procA

multithreaded: procA and procB in parallel

1.5. OVERVIEW OF PROCESSES

35

Process Semantics

All execution orders must have same behaviour

36

CHAPTER 1. VHDL

1.5.1 Combinational Process vs Clocked Process


Each well-written synthesizable process is either combinational or clocked.

Combinational process:
Executing the process takes part of one clock cycle Target signals are outputs of combinational circuitry A combinational processes must have a sensitivity list A combinational process must not have any wait statements A combinational falling_edges
process must not have any rising_edges, or

The hardware for a combinational process is just combinational circuitry

1.5.1 Combinational Process vs Clocked Process

37

Clocked process:
Executing the process takes one (or more) clock cycles Target signals are outputs of ops Process contains one or more wait or if rising edge statements Hardware contains combinational circuitry and ip ops

Note: Clocked processes are sometimes called sequential processes, but this can be easily confused with sequential statements, so in E&CE 327 well refer to synthesizable processes as either combinational or clocked.

38

CHAPTER 1. VHDL

Combinational or Clocked Process? (1)


process (a,b,c) p1 <= a; if (b = c) then p2 <= b; else p2 <= a; end if; end process;

1.5.1 Combinational Process vs Clocked Process

39

Combinational or Clocked Process? (2)


process begin wait until rising_edge(clk); b <= a; end process;

40

CHAPTER 1. VHDL

Combinational or Clocked Process? (3)


process (clk) begin if rising_edge(clk) then b <= a; end if; end process;

1.5.1 Combinational Process vs Clocked Process

41

Combinational or Clocked Process? (4)


process (clk) begin a <= clk; end process;

42

CHAPTER 1. VHDL

Combinational or Clocked Process? (5)


process begin wait until rising_edge(a); c <= b; end process;

1.5.2 Latch Inference

43

1.5.2

Latch Inference

The semantics of VHDL require that if a signal is assigned a value on some passes through a process and not on other passes, then on a pass through the process when the signal is not assigned a value, it must maintain its value from the previous pass. process (a, b, c) begin if (a = 1) then z1 <= b; z2 <= b; else z1 <= c; end if; end process;

a b c z1 z2

Example of latch inference

44

CHAPTER 1. VHDL

Latch Inference
When a signals value must be stored, VHDL infers a latch or a ip-op in the hardware to store the value. If you want a latch or a ip-op for the signal, then latch inference is good. If you want combinational circuitry, then latch inference is bad.

1.5.2 Latch Inference

45

Loop, Latch, Flop


a b z
a
Latch Combinational loop
EN

b a
Flip-op

Question:

Write VHDL code for each of the above circuits

46

CHAPTER 1. VHDL

1.6 1.6.1

Details of Process Execution Simple Simulation


0ns
a

10ns

12ns 15ns

d e

b c d e

1.6.2 Temporal Granularities of Simulation

47

Different Programs, Same Behaviour


All three programs below synthesize to the circuit on the previous slide. The goal of VHDL semantics is that all three programs have the same behaviour. process (a,b) begin c <= a and b; end process; process (b,c,d) begin d <= not c; e <= b and d; end process; process (a,b,c,d) begin c <= a and b; d <= not c; e <= b and d; end process; process (a,b) begin c <= a and b; end process; process (c) begin d <= not c; end process; process (b,d) begin e <= b and d; end process;

48

CHAPTER 1. VHDL

1.6.2

Temporal Granularities of Simulation


This section reserved for your reading pleasure

1.6.3 tion

Intuition Behind Delta-Cycle Simula-

In zero-delay simulation, a sequence of dependent events must appear to happen instantaneously (in zero time). In particular, the effect of an event must propagate instantaneously through combinational circuitry. Two fundamental rules for zero-delay simulation: 1. events appear to propagate through combinational circuitry instantaneously. 2. all of the gates appear to operate in parallel

1.6.3 Intuition Behind Delta-Cycle Simulation

49

Intution for Delta Cycles


To make it appear that events propagate instaneously, VHDL introduces an articial unit of time, the delta cycle, to represent an innitesimally small amount of time. In each delta cycle, every gate in the circuit will sample its inputs, compute its result, and drive its output signal with the result. Simulators simulate one gate at a time, but the waveforms make it appear that all of the gates were run in parallel. In each delta cycle, the simulator executes all gates whose inputs changed. To preserve the illusion that the gates ran in parallel, the effect of simulating a gate remains invisible until the end of the delta cycle.

50

CHAPTER 1. VHDL

1.6.4 1.6.4.1

Denitions and Algorithm Process Modes


active
e sp su te tiv a nd ac

postponed resume

suspended

1.6.4 Denitions and Algorithm

51

Suspended
active
e sp su te tiv a

nd

postponed resume

ac

suspended

Nothing to currently execute A process stays suspended until the event that it is waiting for occurs: either a change in a signal on its sensitivity list or the condition in a wait statement

52

CHAPTER 1. VHDL

Postponed
active
e sp su te tiv a

nd

postponed resume

ac

suspended

Wants to execute, but not currently active A process stays postponed until the simulator chooses it from the pool of postponed processes

1.6.4 Denitions and Algorithm

53

Active
active
e sp su te tiv a nd

postponed resume

ac

suspended

Currently executing A process stays active until it hits a wait statement or sensitivity list, at which point it suspends

54

CHAPTER 1. VHDL

1.6.4.2

Simulation Algorithm

The algorithm presented here is a simplication of the actual algorithm in the VHDL Standard. This algorithm does not (a <= b after 2 ns;). support delayed assignments; for example:

A somewhat ironic note, only six of the two hundred pages in the VHDL Standard are devoted to the semantics of executing processes.

1.6.4 Denitions and Algorithm

55

The Algorithm
Simulations start at step 1 with all processes postponed and all signals with a default value (e.g., U for std logic).
1. While there are postponed processes: (a) Pick one or more postponed processes to execute (become active). (b) Provisionally execute assignments (new values become visible at step 3) (c) A process executes until it hits its sensitivity list or a wait statement, at which point it suspends. (d) Processes that become suspended, stay suspended until there are no more postponed or active processes. 2. Each process checks its sensitivity list or wait condition to see if it should resume 3. Update signals with their provisional values 4. If no postponed processes, then increment simulation time to next event.

56

CHAPTER 1. VHDL

Notes on Simulation Algorithm


At a wait statement, the process will suspend even if the condition is true in the current simulation cycle. The process will resume when the condition changes to true. In n-threaded execution, at most n processes are active at a time

1.6.4 Denitions and Algorithm

57

1.6.4.3

Delta-Cycle Denitions

Denition simulation step: Executing one sequential assignment or process mode change.

Denition simulation cycle: The operations that occur in one iteration of the simulation algorithm.

Denition delta cycle: A simulation cycle that does not advance simulation time.

Denition simulation round: A sequence of simulation cycles that all have the same simulation time.

58

CHAPTER 1. VHDL

1.6.5 Example 1: Process Execution (Bamboozle)


This section reserved for your reading pleasure

1.6.6 Example 2: Process Execution (Flummox)


This example is a variation of the Bamboozle example from section 1.6.5.

1.6.6 Example 2: Process Execution (Flummox)


process mode (S=suspended, P=postponend A=active) simulation-step pointer (one per process) P visible-assignment value provisional-assignment value

59

proc1: process (a, b, c) begin c <= a AND b; d <= NOT c; end process; proc2: process (b, d) begin 0ns e <= b AND d; sim round end process; sim cycle proc3: process begin delta cycle a <= 1; proc1 proc2 b <= 0; proc3 wait for 3 ns; a b <= 1; wait for 99 ns; b end process;
c d e

U a U b Uc Ud U e

Legend

initial values simulation step

60
1. While there are postponed processes: (a) Pick process(es) to activate (b) Execute active processes, record prov asns (c) Suspend at sens list or wait statement (d) Once suspended, stay suspended 2. Check sens lists, wait conditions for changes 3. Update signals with provisional values 4. If no postponed procs, increment time proc1: ...(a, b, c)... c <= a AND b; d <= NOT c; end process; proc2: ...(b, d)... e <= b AND d; end process; proc3: process begin a <= 1; b <= 0; wait for 3 ns; b <= 1; wait for 99 ns; end process;

CHAPTER 1. VHDL

d e

sim round sim cycle delta cycle proc1 proc2 proc3 a b c d e

1.6.6 Example 2: Process Execution (Flummox)

61

From Delta-Time to Real Time


0ns +1
a U b U c U d U e U U U U U U

3ns +2 +3 +1 +2 +3

102ns

0ns

1ns

2ns

3ns

4ns

100ns 101ns 102ns

a U b U c U d U e U

62

CHAPTER 1. VHDL

Note and Questions


Note: If a signal is updated with the same value it had in the previous simulation cycle, then it does not change, and therefore does not trigger processes to resume.

Question: What are the different granularities of time that occur when doing delta-cycle simulation?

Question: What is the order of granularity, from nest to coarsest, amongst the different granularities related to delta-cycle simulation?

1.6.7 Ex: Need for Provisonal Asn

63

1.6.7

Ex: Need for Provisonal Asn

architecture main of swindle is begin p_c: process (a, b) begin Question: c <= a AND b; end process; p_d: process (a, c) begin d <= a XOR c; end process; end main;

draw the circuit

Circuit to illustrate need for provisional assignments 1. Start with all signals at 0. 2. Simultaneously change to a = 1 and b = 1.

64

CHAPTER 1. VHDL

With Provisional Assignments, c Before d


If assignments are not visible within same simulation cycle (correct: i.e. provisional assignments are used) p_c: process c <= a AND end process; p_d: process d <= a XOR end process; (a, b) begin b; (a, c) begin c;

p_c p_d a b c d
0 0 0 0

P A P

S A S P A S

If p c is scheduled before p d, then d will have a 1 pulse.

1.6.7 Ex: Need for Provisonal Asn

65

With Provisional Assignments, d Before c


If assignments are not visible within same simulation cycle (correct: i.e. provisional assignments are used) p_c: process c <= a AND end process; p_d: process d <= a XOR end process; (a, b) begin b; (a, c) begin c;

p_c p_d a b c d
0 0 0 0

P P A S

S P A S

If p d is scheduled before p c, then d will have a 1 pulse.

66

CHAPTER 1. VHDL

Without Prov. Assignments, c Before d


If assignments are visible within same simulation cycle (incorrect) p_c: process c <= a AND end process; p_d: process d <= a XOR end process; (a, b) begin b; (a, c) begin c;

p_c p_d a b c d
0 0 0 0

P A P

S A S P A S

If p c is scheduled before p d, then d will stay constant 0.

1.6.7 Ex: Need for Provisonal Asn

67

Without Prov. Assignments, d Before c


If assignments are visible within same simulation cycle (incorrect) p_c: process c <= a AND end process; p_d: process d <= a XOR end process; (a, b) begin b; (a, c) begin c;

p_c p_d a b c d
0 0 0 0

P P A S

S P A S

If p d is scheduled before p c, then d will have a 1 pulse.

68

CHAPTER 1. VHDL

Need for Provisional Assignment


With provisional assignments, both orders of scheduling processes result in the same behaviour on all signals. Without provisional assignments, different scheduling orders result in different behaviour.

1.6.8 Delta-Cycle Simulations of Flip-Flops

69

1.6.8

Delta-Cycle Simulations of Flip-Flops


p_clk : process begin clk <= 0; wait for 10 ns; clk <= 1; wait for 10 ns; end process; flop : process ( clk ) begin if rising_edge( clk ) then q <= a; end if; end process;

p_a : process begin a <= 0; wait for 15 ns; a <= 1; wait for 20 ns; end process;

0ns

sim round sim cycle delta cycle p_a P p_clk P flop P a U clk U q U

B B B A

E E S A
U U

S A S
0 0

70

CHAPTER 1. VHDL

Redraw with Normal Time Scale

0ns

5ns

10ns

15ns

20ns

25ns

30ns

35ns

a clk q

1.6.8 Delta-Cycle Simulations of Flip-Flops

71

Back-to-Back Flops
p_a : process begin a <= 0; wait for 15 ns; a <= 1; wait for 20 ns; end process; p_clk : process begin clk <= 0; wait for 10 ns; clk <= 1; wait for 10 ns; end process; flops : process ( clk ) begin if rising_edge( clk ) then q1 <= a; q2 <= q1; end if; end process;
15ns 20ns 30ns 35ns

10ns

sim round sim cycle delta cycle p_a p_clk flops a 0 clk 0 q1 U q2 U

B
B/E B/E

B B S

E B E

E B E B/E B
B/E

P A P A P A S

E B E B/E B B/E B S P A
1

E B E B E B/E B B/E B E S P A P A S S

E B E

E E

P A

1 U 0

1 1

72

CHAPTER 1. VHDL

Redraw with Normal Time Scale

0ns

5ns

10ns

15ns

20ns

25ns

30ns

35ns

a clk q

1.6.8 Delta-Cycle Simulations of Flip-Flops

73

External Inputs and Flops


Question: Do the signals b1 and b2 have the same behaviour from 2030 ns?

74 architecture mathilde of sauv is e signal clk, a, b : std_logic; begin process begin clk <= 1; wait for 10 ns; clk <= 0; wait for 10 ns; end process; process begin wait for 20 ns; a1 <= 1; end process; process begin wait until rising_edge(clk); a1 <= 1; end process; process begin wait until rising_edge( clk ); b1 <= a1;

CHAPTER 1. VHDL

1.6.8 Delta-Cycle Simulations of Flip-Flops

75

Testbenches and Clock Phases


env : process begin a <= 1; clk <= 0; wait for 10 ns; a <= 0; clk <= 1; wait for 10 ns; end process;
0ns

flop : process ( clk ) begin if rising_edge( clk ) then q1 <= a end if; end process;

sim round sim cycle delta cycle env flop1 flop2 a clk q1

76

CHAPTER 1. VHDL

Redraw with Normal Time Scale


0ns 10ns 20ns

a clk q1

1.6.8 Delta-Cycle Simulations of Flip-Flops

77

Warning
Note: Testbench signals For consistent results across different simulators, simulation scripts vs test benches, and timingsimulation vs zero-delay simulation do not change signals in your testbench or script at the same time as the clock changes.
0ns 10ns 20ns 30ns 40ns 50ns 60ns

a is output of clocked or combinational process

a U clk U q1
0ns U 10ns 20ns 30ns 40ns 50ns 60ns

a U

a is output of timed process (testbench or environment) POOR DESIGN a is output of timed process (testbench or environment) GOOD DESIGN

clk U q1
0ns U 10ns 20ns 30ns 40ns 50ns 60ns

a U clk U q1
U

78

CHAPTER 1. VHDL

1.7
0ns
sim round sim cycle delta cycle proc1 proc2 proc3 a b c d e B B B P P A P U U U U U U

Register-Transfer-Level Simulation
0ns+1 0ns+2 0ns+23ns
EB EB PA S EB E S PA EB EB B S PA

3ns+1

3ns+2 3ns+3
E E E S

102ns

A S A U1 0 U U S

EB EB S PA P

S A

EB EB P PA S

A S

EB EB S PA

EB E S PA

0ns
a U 1 U 0 U 0 U 1 U 0

1ns

2ns

3ns

102ns

1 0 U 0 0 1 0 1 1 1 1 0 0

b c d e

1 1 0

Delta cycle simulation

RTL simulation

1.7.1 Overview

79

1.7.1

Overview

Much simpler than delta cycle Columns are real time: clock cycles, nanoseconds, etc. Can simulate both synthesizable and unsynthesizable code Cannot simulate combinational loops Same values as delta-cycle at end of simulation round process begin Question: In this code, what a <= 0; value should b have 10 ns? wait for 10 ns; a <= 1; ... end process;
process begin b <= 0; wait for 10 ns; b <= a; ... end process;

80

CHAPTER 1. VHDL

1.7.2 Technique for Register-Transfer Level Simulation


1. Pre-processing (a) Separate processes into combinational and non-combinational (clocked and timed) (b) Decompose each combinational process into separate processes with one target signal per process (c) Sort processes into topological order based on dependencies 2. For each clock cycle or unit of time: (a) Run non-combinational processes in any order. Non-combinational assignments read from earlier clock cycle / time step, except that clocked processes read the current value of the clock signal. (b) Run combinational processes in topological order. Combinational assignments read from current clock cycle / time step.

1.7.3 Examples of RTL Simulation

81

1.7.3 1.7.3.1

Examples of RTL Simulation RTL Simulation Example 1

We revisit an earlier example from delta-cycle simulation, but change the code slightly and do register-transfer-level simulation. proc1: process (a, b, c) begin d <= NOT c; c <= a AND b; end process; proc2: process (b, d) begin e <= b AND d; end process; proc3: process begin a <= 1; b <= 0; wait for 3 ns; b <= 1; wait for 99 ns; end process;

82

CHAPTER 1. VHDL

Decompose and sort comb procs


proc1d: process (c) begin d <= NOT c; end process; proc1c: process (a, b) begin c <= a AND b; end process; proc2: process (b, d) begin e <= b AND d; end process; proc1c: process (a, b) begin c <= a AND b; end process; proc1d: process (c) begin d <= NOT c; end process; proc2: process (b, d) begin e <= b AND d; end process;

Decomposed

Sorted

1.7.3 Examples of RTL Simulation

83

Waveforms
0ns
a b c d e U U U U U

1ns

2ns

3ns

102ns

Example: Communicating State Machines

84

CHAPTER 1. VHDL

huey: process begin clk <= 0; wait for 10 ns; clk <= 1; wait for 10 ns; end process; dewey: process begin a <= to_unsigned(0,4); wait until re(clk); while (a < 4) loop a <= a + 1; wait until re(clk); end loop; end process;

louie: process begin d <= 1; wait until re(clk); if (a >= 2) then d <= 0; wait until re(clk); end if; end process;

clk a d

1.8. VHDL AND HARDWARE BUILDING BLOCKS

85

1.8 1.8.1

VHDL and Hardware Building Blocks Basic Building Blocks

Different classes of building blocks:

Conditional Arithmetic Storage

86

CHAPTER 1. VHDL

Basic Building Blocks: Boolean


Schematic VHDL Description and or not nor xor
AND OR

gate

gate inverter and gate exclusive-or gate

nand NAND gate

1.8.1 Basic Building Blocks

87

Basic Building Blocks: Conditional


if-then-else, when-else, Multiplexer with-select, case

88

CHAPTER 1. VHDL

Basic Building Blocks: Arithmetic


+ adder subtracter

asl, lsl left shifter asr, lsr right shifter

1.8.1 Basic Building Blocks

89

Basic Building Blocks: Storage


D CE S WE A DI DO R Q

clocked process

ip op

memory component single-port memory


WE A0 DI0 A1 DO1 DO0

memory component dual-port memory

90

CHAPTER 1. VHDL

1.8.2

Deprecated Building Blocks for RTL

Some of the common gates you have encountered in previous courses should be avoided when synthesizing register-transfer-level hardware, particularly if FPGAs are the implementation technology. Latches : Use ops, not latches T, JK, SR, etc ip-ops : Limit yourself to D-type ip-ops Tri-State Buffers : Use multiplexers, not tri-state buffers Note: Unfortunately and surprisingly, PalmChip has been awarded a US patent for using uni-directional busses (i.e. multiplexers) for system-on-chip designs. The patent was led in 2000, so all fourth-year design projects completed after that date will need to pay royalties to PalmChip

1.8.2 Deprecated Building Blocks for RTL

91

What is This?
process (a) begin if rising_edge(a) then c <= b; end if; end process;

92

CHAPTER 1. VHDL

1.8.3 1.8.3.1

Hardware and Code for Flops Flops with Waits and Ifs

process (clk) begin if rising_edge(clk) then q <= d; end if; end process;

1.8.3 Hardware and Code for Flops

93

VHDL Code for Flip-Flop: Wait-Style


process begin wait until rising_edge(clk); q <= d; end process;

94

CHAPTER 1. VHDL

1.8.3.2

Flops with Synchronous Reset

process (clk) begin if rising_edge(clk) then if (reset = 1) then q <= 0; else q <= d; end if; end if; end process;

1.8.3 Hardware and Code for Flops

95

Flop with Synchronous Reset: Wait-Style


process begin wait until rising_edge(clk); if (reset = 1) then q <= 0; else q <= d0; end if; end process;

96

CHAPTER 1. VHDL

Variation on a Floppy Theme


Question: Synchronous or asynchronous reset?

process (clk, reset) begin if (reset = 1) then q <= 0; else if rising_edge(clk) then q <= d; end if; end if; end process;

1.8.3 Hardware and Code for Flops

97

Variated Flop of a Theme


Question: Synchronous or asynchronous reset?

process begin if (reset = 1) then q <= 0; else q <= d0; end if; wait until rising_edge(clk); end process;

98

CHAPTER 1. VHDL

Flop with Chip-Enable


process (clk) begin if rising_edge(clk) then if (ce = 1) then q <= d; end if; end if; end process; Wait-style op with chip-enable included in course notes

1.8.3 Hardware and Code for Flops

99

Q: Flop with a Mux on the Input?


sel d0
D Q

d1 clk

100

CHAPTER 1. VHDL

Q: Flops with a Mux on the Output?


d0 clk d1 clk
D Q D Q

q0

sel

q q1

Question: For the circuits with mux-on-input and mux-on-output, does q have the same behaviour in both circuits?

1.8.3 Hardware and Code for Flops

101

1.8.3.3 Input

Flop with Chip-Enable and Mux on

Hint: Chip Enable process (clk) begin if rising_edge(clk) then if (ce = 1) then q <= d; end if; end if; end process;

102

CHAPTER 1. VHDL

1.8.3.4 Reset

Flops with Chip-Enable, Muxes, and


This section reserved for your reading pleasure

1.8.4

An Example Sequential Circuit


This section reserved for your reading pleasure

1.9

Arrays and Vectors


This section reserved for your reading pleasure

1.10. ARITHMETIC

103

1.10 Arithmetic
VHDL includes all of the common arithmetic and logical operators. Use the VHDL arithmetic operators and let the synthesis tool choose the best implementation for you.

1.10.1

Arithmetic Packages

To do arithmetic with signals, use the numeric_std package. This package denes types signed and unsigned, which are std_logic vectors on which you can do signed or unsigned arithmetic. numeric std supersedes std logic arith. earlier arithmetic packages, such as

Use only one arithmetic package, otherwise the different denitions will clash and you can get strange error messages.

104

CHAPTER 1. VHDL

1.10.2

Shift and Rotate Operations


This section reserved for your reading pleasure

1.10.3

Overloading of Arithmetic
This section reserved for your reading pleasure

1.10.4

Different Widths and Arithmetic


This section reserved for your reading pleasure

1.10.5

Overloading of Comparisons
This section reserved for your reading pleasure

1.10.6 Different Widths and Comparisons Overloading of Comparison Operations (=, /=, >=, >, <) src1/2 unsigned signed unsigned src2/1 integer OK integer OK signed fails in analysis

105

1.10.6

Different Widths and Comparisons


This section reserved for your reading pleasure

106

CHAPTER 1. VHDL

1.10.7

Type Conversion

The functions unsigned, signed, to integer, to unsigned and to signed are used to convert between integers, std-logic vectors, signed vectors and unsigned vectors. If you convert between two types of the same width, then no additional hardware will be generated. The listing below summarizes the types of these functions.

1.10.7 Type Conversion

107

Type Conversion
unsigned( val : std_logic_vector ) signed( val : std_logic_vector ) to_integer( val : signed ) to_integer( val : unsigned ) to_unsigned( val : integer; width : natural) to_signed( val : integer; width : natural) Note: More details in course notes return unsigned; return signed; return integer; return integer; return unsigned; return signed;

108

CHAPTER 1. VHDL

1.11 Synthesizable vs Non-Synthesizable Code


Synthesis is done by matching VHDL code against templates or patterns. Its important to use idioms that your synthesis tools recognize. Think like hardware: when you write VHDL, you should know what hardware you expect to be produced by the synthesizer.

1.11.1 Unsynthesizable Code

109

1.11.1 1.11.1.1

Unsynthesizable Code Initial Values

Initial values on signals (UNSYNTHESIZABLE) signal bad_signal : std_logic := 0; Reason: At powerup, the values on signals are random (except for some FPGAs).

110

CHAPTER 1. VHDL

1.11.1.2

Wait For

Wait for length of time (UNSYNTHESIZABLE) wait for 10 ns; Reason: Delays through circuits are dependent upon both the circuit and its operating environment, particularly supply voltage and temperature. For example, imagine trying to build an AND gate that will have exactly a 2ns delay in all environments.

1.11.1 Unsynthesizable Code

111

1.11.1.3

Different Wait Conditions

wait statements with different conditions in a process (UNSYNTHESIZABLE) -- different clock signals process begin wait until rising_edge(clk1); x <= a; wait until rising_edge(clk2); x <= a; end process; Reason: Would require the ip ops to use different clock signals at different times.

112

CHAPTER 1. VHDL

Different Wait Conditions


-- different clock edges process begin wait until rising_edge(clk); x <= a; wait until falling_edge(clk); x <= a; end process; Reason: Would require ip-op to be sensitive to different clock edges at different times.

1.11.1 Unsynthesizable Code

113

1.11.1.4 cess

Multiple if rising edge in Pro-

Multiple if rising edge statements in a process (UNSYNTHESIZABLE) process (clk) begin if rising_edge(clk) then q0 <= d0; end if; if rising_edge(clk) then q1 <= d1; end if; end process; Reason: The idioms for synthesis tools generally expect just a single if rising edge statement in each process. The simpler the VHDL code is, the easier it is to synthesize hardware. Programmers of synthesis tools make idiomatic (idiotic?) restrictions to make their jobs simpler.

114

CHAPTER 1. VHDL

1.11.1.5 if rising edge and wait in Same Process


An if rising edge statement and a wait statement in the same process (UNSYNTHESIZABLE) process (clk) begin if rising_edge(clk) then q0 <= d0; end if; wait until rising_edge(clk); q0 <= d1; end process; Reason: The idioms for synthesis tools generally expect just a single type of opgenerating statement in each process.

1.11.1 Unsynthesizable Code

115

1.11.1.6

if rising edge with else Clause

The if statement has a rising edge condition and an else clause (UNSYNTHESIZABLE). process (clk) begin if rising_edge(clk) then q0 <= d0; else q0 <= d1; end if; end process; Reason: Generally, an if-then-else statement synthesizes to a multiplexer.

116

CHAPTER 1. VHDL

1.11.1.7

if rising edge Inside a for Loop

An if rising edge statement in a for-loop (UNSYNTHESIZABLE-Synopsys) process (clk) begin for i in 0 to 7 loop if rising_edge(clk) then q(i) <= d; end if; end loop; end process; Reason: just an idiom of the synthesis tool. Some loop statements are synthesizable (Rushton Section 8.7). For-loops in general are described in Ashenden.

1.11.1 Unsynthesizable Code

117

Synthesizable Alternative
A synthesizable alternative to an if rising edge statement in a for-loop is to put the if-rising-edge outside of the for loop. process (clk) begin if rising_edge(clk) then for i in 0 to 7 loop q(i) <= d; end loop; end if; end process;

118

CHAPTER 1. VHDL

1.11.1.8

wait Inside of a for loop

wait statements in a for loop (UNSYNTHESIZABLE) process begin for i in 0 to 7 loop wait until rising_edge(clk); x <= to_unsigned(i,4); end loop; end process; Reason: Unknown. while-loops with the same behaviour are synthesizable. Note: Combinational for-loops Combinational for-loops are usually synthesizable. They are often used to build a combinational circuit for each element of an array. Note: Clocked for-loops Clocked for-loops are not synthesizable, but are very useful in simulation, particular to generate test vectors for test benches.

1.11.1 Unsynthesizable Code

119

Synthesizable Alternative to Wait-Inside-For


while loop (synthesizable) This is the synthesizable alternative to the the wait statement in a for loop above. process begin -- output values from 0 to 4 on i -- sending one value out each clock cycle i <= to_unsigned(0,4); wait until rising_edge(clk); while (4 > i) loop i <= i + 1; wait until rising_edge(clk); end loop; end process;

120

CHAPTER 1. VHDL

1.12 Synthesizable VHDL Coding Guidelines


This section reserved for your reading pleasure

Chapter 2 RTL Design with VHDL: From Requirements to Optimized Code

122

CHAPTER 2. RTL DESIGN WITH VHDL

2.1

Prelude to Chapter
This section reserved for your reading pleasure

2.2 FPGA Background and Coding Guidelines 2.2.1 Generic FPGA Hardware

2.2.1 Generic FPGA Hardware

123

2.2.1.1

Generic FPGA Cell

Cell = Logic Element (LE) in Altera = Congurable Logic Block (CLB) in Xilinx
carry_in

data_in

comb

D CE

data_out

ctrl_in

carry_out

124

CHAPTER 2. RTL DESIGN WITH VHDL

Congurable Comb/Flop Connection


carry_in comb_data_out comb_data_in comb
D CE R

flop_data_out

flop_data_in ctrl_in

carry_out

2.2.1 Generic FPGA Hardware

125

Separate Comb and Flop


carry_in comb_data_out comb_data_in comb
D CE R

flop_data_out

flop_data_in ctrl_in

carry_out

126

CHAPTER 2. RTL DESIGN WITH VHDL

Connect Comb and Flop


carry_in comb_data_out comb_data_in comb
D CE R

flop_data_out

flop_data_in ctrl_in

carry_out

2.2.1 Generic FPGA Hardware

127

Flopped and Unopped Outputs


carry_in comb_data_out comb_data_in comb
D CE R

flop_data_out

flop_data_in ctrl_in

carry_out

128

CHAPTER 2. RTL DESIGN WITH VHDL

2.2.2

Area Estimation

To estimate the number of FPGA cells that will be required to implement a circuit, recall that an FPGA lookup-table can implement any function with up to four inputs and one output. We will describe two methods to estimate the area (number of FPGA cells) required to implement a gate-level circuit:

1. Rough estimate based simply upon the number of ip-ops and primary inputs that are in the fanin of each ip-op. 2. A more accurate estimate, based upon greedily including as many gates as possible into each FPGA cell.

2.2.2 Area Estimation

129

Lower Bound on Area for Circuit with one Target


Source ops/inputs Minimum cells 1 1 2 1 3 1 4 1 5 2 6 2 7 2 8 3 9 3 10 3 11 4 For a single target signal, this technique gives a lower bound on the number of cells needed. For multiple target signals, this technique might be an overestimate, because a single cell can drive several other cells.

130 Question:

CHAPTER 2. RTL DESIGN WITH VHDL How many cells are needed to implement a 4:1 mux?

2.2.2 Area Estimation

131

3 Cells for 10:1 Function

132

CHAPTER 2. RTL DESIGN WITH VHDL

Estimate Area for Circuit


For each ip-op and output: traverse backward through the fanin gathering as much combinational circuitry as possible into the FPGA cell. Stopping conditions: ip-op

more than four inputs However, have more than four signals as input, then further back in the fanin, the circuit will collapse back to four or fewer signals.

2.2.2 Area Estimation Question: Map the combinational circuits below onto generic FPGA cells.

133

comb
D CE

comb
Q D CE

a b c d z

comb
D CE

comb
Q D CE

comb
D CE

comb
Q D CE

134

CHAPTER 2. RTL DESIGN WITH VHDL

2.2.2.1

Interconnect for Generic FPGA


This section reserved for your reading pleasure

2.2.2.2

Clocks for Generic FPGAs

Characteristics of clock signals: High fanout (drive many gates)

Long wires (destination gates scattered all over chip)


Characteristics of FPGAs: Very few gates that are large (strong) enough to support a high fanout.

Very few wires that traverse entire chip and can be connected to every ip-op.

2.2.2 Area Estimation

135

2.2.2.3

Special Circuitry in FPGAs Memory

For more than ve years, FPGAs have had special circuits for RAM and ROM. In Altera FPGAs, these circuits are called ESBs (Embedded System Blocks). These special circuits are possible because many FPGAs are fabricated on the same processes as SRAM chips. So, the FPGAs simply contain small chunks of SRAM.

136

CHAPTER 2. RTL DESIGN WITH VHDL

Microprocessors
A new feature to appear in FPGAs in 2001 and 2002 is hardwired microprocessors on the same chip as programmable hardware.

Hard Soft Altera Arm 922T with 200 MIPs Nios with ?? MIPs Xilinx: Virtex-II Pro Power PC 405 with 420 D-MIPs Microblaze with 100 D-MIPs The Xilinx-II Pro has 4 Power PCs and enough programmable hardware to implement the rst-generation Intel Pentium microprocessor.

2.2.2 Area Estimation

137

Arithmetic Circuitry
A new feature to appear in FPGAs in 2001 and 2002 is hardwired circuits for multipliers and adders. Altera: Mercury 16 16 at 130MHz Xilinx: Virtex-II Pro 18 18 at ???MHz Using these resources can improve signicantly both the area and performance of a design.

138

CHAPTER 2. RTL DESIGN WITH VHDL

Input / Output
Recently, high-end FPGAs have started to include special circuits to increase the bandwidth of communication with the outside world. Product Altera True-LVDS (1 Gbps) Xilinx Rocket I/O (3 Gbps)

2.2.3 Generic-FPGA Coding Guidelines

139

2.2.3

Generic-FPGA Coding Guidelines Flip Flops Are Free

Flip-ops are almost free in FPGAs


reason In FPGAs, the area consumed by a design is usually determined by the amount of combinational circuitry, not by the number of ip-ops.

140

CHAPTER 2. RTL DESIGN WITH VHDL

Use It or Lose
Aim for using 8090% of the cells on a chip.
reason If you use more than 90% of the cells on a chip, then the place-androute program might not be able to route the wires to connect the cells. reason If you use less than 80% of the cells, then probably: there are optimizations that will increase performance and still allow the design to t on the chip; or you spent too much human effort on optimizing for low area; or you could use a smaller (cheaper!) chip. exception In E&CE 327 (unlike in real life), the mark is based on the actual number of cells used.

2.2.3 Generic-FPGA Coding Guidelines

141

Just One Clock


Use just one clock signal
reason If all ip-ops use the same clock, then the clock does not impose any constraints on where the place-and-route tool puts ip-ops and gates. If different ip-ops used different clocks, then ip-ops that are near each other would probably be required to use the same clock.

142

CHAPTER 2. RTL DESIGN WITH VHDL

Just One Clock Edge


Use only one edge of the clock signal
reason There are two ways to use both rising and falling edges of a clock signal: have rising-edge and falling-edge ip ops, or have two different clock signals that are inverses of each other. Most FPGAs have only rising-edge ip ops. Thus, using both edges of a clock signal is equivalent to having two different clock signals, which is deprecated by the preceding guideline.

2.3. DESIGN FLOW

143

2.3

Design Flow
This section reserved for your reading pleasure

2.4

Algorithms and High-Level Models


This section reserved for your reading pleasure

144

CHAPTER 2. RTL DESIGN WITH VHDL

2.5 2.5.1

Finite State Machines in VHDL Introduction to State-Machine Design Mealy vs Moore State Machines

2.5.1.1

2.5.1 Introduction to State-Machine Design

145

Moore Machines
Outputs are dependent upon only the state No combinational paths from inputs to outputs
s0/0 a s1/1 !a s2/0

s3/0

146

CHAPTER 2. RTL DESIGN WITH VHDL

Mealy Machines
Outputs are dependent upon both the state and the inputs Combinational paths from inputs to outputs
s0 a/1 s1 /0 s3 /0 !a/0 s2

2.5.1 Introduction to State-Machine Design

147

2.5.1.2 VHDL

Introduction to State Machines and

A state machine is generally written as a single clocked process, or as a pair of processes, where one is clocked and one is combinational.

Design Decisions
Moore vs Mealy (Sections 2.5.2 and 2.5.3) Implicit vs Explicit (Section 2.5.1.3) State values in explicit state machines: Enumerated type vs constants (Section 2.5.5) State values for constants: encoding scheme (binary, gray, one-hot, ...) (Section 2.5.5)

148

CHAPTER 2. RTL DESIGN WITH VHDL

VHDL Constructs for State Machines


The following VHDL control constructs are useful to steer the transition from state to state: loop if ... then ... else case next for ... loop exit while ... loop

2.5.1 Introduction to State-Machine Design

149

2.5.1.3
Explicit

Explicit vs Implicit State Machines

There are two styles of writing state machines in VHDL: explicit and implicit.

State signal appears explicitly in VHDL code At most one wait statement per process Two sub-categories of explicit state machines
Explicit-Current State signal represents current state Next-state computation done in a clocked process Explicit-Current+Next Two state signals: current state and next state Next-state computation done in a combinational process Current-state <= next-state is registered assignment Implicit Use multiple wait statements in a process to describe state machine implicilty

150

CHAPTER 2. RTL DESIGN WITH VHDL

Implicit State Machines


For the implicit style of writing state machines, the synthesis program adds an implicit register to hold the state signal and combinational circuitry to update the state signal. In Synopsys synthesis tools, the state signal dened by the synthesizer is named multiple wait state reg. In Mentor Graphics, the state signal is named STATE VAR We can think of the VHDL code for implicit state machines as having zero state signals, explicit-current state machines as having one state signal (state), and explicit-current+next state machines as having two state signals (state and state next).

2.5.1 Introduction to State-Machine Design

151

State Machine Tradeoffs


Explicit-Current+Next

Most detailed, closest to hardware Greatest opportunity for manual optimization Most labour-intensive Susceptible to small, subtle, hard-to-nd bugs
Explicit-Current

Almost as manual optimization as Explicit-Current+Next Easier to write than Explicit-Current+Next Less susceptible to subtle bugs
Implicit

Taught infrequently Least detailed, furthest from actual hardware Rely on synthesis for optimization Usually least labour to write, shortest code Easiest to write correctly (But must understand VHDL synthesis!)

152

CHAPTER 2. RTL DESIGN WITH VHDL

Limitation of Implicit State Machines


Because implicit state machines are written with loops, if-then-elses, cases, etc. it is difcult to write some state machines with complicated control ows in an implicit style. The following example illustrates the point.
s0/0 a !a s2/0

!a s3/0

a s1/1

2.5.1 Introduction to State-Machine Design

153

Terminology
Note: The terminology of explicit and implicit is somewhat standard, in that some descriptions of processes with multiple wait statements describe the processes as having implicit state machines. There is no standard terminology to distinguish between the two explicit styles: explicit-current+next and explicit-current.

154

CHAPTER 2. RTL DESIGN WITH VHDL

2.5.2 Implementing a Simple Moore Machine


s0/0 a s1/1 !a s2/0

entity simple is port ( a, clk : in std_logic; z : out std_logic ); end simple;

s3/0

2.5.2 Implementing a Simple Moore Machine

155

2.5.2.1

Implicit Moore State Machine

Flops architecture moore_implicit_v1a of simple isGates Delay begin process begin z <= 0; wait until rising_edge(clk); if (a = 1) then z <= 1; else z <= 0; end if; wait until rising_edge(clk); z <= 0; wait until rising_edge(clk); end process; end moore_implicit;

156

CHAPTER 2. RTL DESIGN WITH VHDL

Implicit Moore State Machine


!a s2/0

2.5.2 Implementing a Simple Moore Machine

157

2.5.2.2

Explicit Moore with Flopped Output


Flops Gates Delay

architecture moore_explicit_v1 of simple is type state_ty is (s0, s1, s2, s3); signal state : state_ty; begin process (clk) begin if rising_edge(clk) then case state is when s0 => if (a = 1) then state <= s1; z <= 1; else state <= s2; z <= 0; end if; when s1 | s2 => state <= s3; z <= 0; when s3 => state <= s0; z <= 1; end case; end if; end process; end moore_explicit_v1;

158

CHAPTER 2. RTL DESIGN WITH VHDL

Explicit Moore with Flopped Outputs

2.5.2 Implementing a Simple Moore Machine

159

2.5.2.3 Explicit Moore with Combinational Outputs


architecture moore_explicit_v2 of simple is type state_ty is (s0, s1, s2, s3); signal state : state_ty; begin process (clk) begin if rising_edge(clk) then case state is when s0 => if (a = 1) then state <= s1; else state <= s2; end if; when s1 | s2 => state <= s3; when s3 => state <= s0; end case; end if; end process; z <= 1 when (state = s1) else 0; end moore_explicit_v2;

Flops Gates Delay

160

CHAPTER 2. RTL DESIGN WITH VHDL

Explicit Moore with Combinational Outputs

2.5.2 Implementing a Simple Moore Machine

161

2.5.2.4 Explicit-Current+Next Moore with Concurrent Assignment


architecture moore_explicit_v3 of simple is type state_ty is (s0, s1, s2, s3); signal state, state_nxt : state_ty; begin process (clk) begin if rising_edge(clk) then state <= state_nxt; end if; end process; state_nxt <= s1 when (state = s0) and (a = 1) else s2 when (state = s0) and (a = 0) else s3 when (state = s1) or (state = s2) else s0; z <= 1 when (state = s1) else 0; end moore_explicit_v3;

Flops Gates Delay

162

CHAPTER 2. RTL DESIGN WITH VHDL

Explicit-Current+Next Moore with Concurrent Assignment


The hardware synthesized from this architecture is the same as that synthesized from moore explicit v2, which is written in the current-explicit style.

2.5.2 Implementing a Simple Moore Machine

163

2.5.2.5

E-C+N Moore with Comb Proc


Change the selected assignment to state into a combinational process using a case statement. Flops Gates Delay Same hardware as moore explicit v2 and v3.

architecture moore_explicit_v4 of simple is type state_ty is (s0, s1, s2, s3); signal state, state_nxt : state_ty; begin process (clk) begin if rising_edge(clk) then state <= state_nxt; end if; end process; process (state, a) begin case state is when s0 => if (a = 1) then state_nxt <= s1; else state_nxt <= s2; end if; when s1 | s2 => state_nxt <= s3; when s3 => state_nxt <= s0; end case; end process; z <= 1 when (state = s1) else 0; end moore_explicit_v4;

164

CHAPTER 2. RTL DESIGN WITH VHDL

Explicit-Current+Next Moore with Combinational Process

2.5.3 Implementing a Simple Mealy Machine

165

2.5.3 Implementing a Simple Mealy Machine


Mealy machines have a combinational path from inputs to outputs, which often violates good coding guidelines for hardware. Thus, Moore machines are much more common. You should know how to write a Mealy machine if needed, but most of the state machines that you design will be Moore machines. This section reserved for your reading pleasure

166

CHAPTER 2. RTL DESIGN WITH VHDL

2.5.4

Reset

All circuits should have a reset signal that puts the circuit back into a good initial state. However, not all ip ops within the circuit need to be reset. In a circuit that has a datapath and a state machine, the state machine will probably need to be reset, but datapath may not need to be reset. There are standard ways to add a reset signal to both explicit and implicit state machines. It is important that reset is tested on every clock cycle, otherwise a reset might not be noticed, or your circuit will be slow to react to reset and could generate illegal outputs after reset is asserted.

2.5.4 Reset

167

Reset with Implicit State Machine


Insert a loop Test for reset after each wait
Example from section 2.5.2.1:
architecture moore_implicit of simple is begin process begin init : loop -- outermost loop z <= 0; wait until rising_edge(clk); next init when (reset = 1); -- test for reset if (a = 1) then z <= 1; else z <= 0; end if; wait until rising_edge(clk); next init when (reset = 1); -- test for reset z <= 0; wait until rising_edge(clk); next init when (reset = 1); -- test for reset end process; end moore_implicit;

168

CHAPTER 2. RTL DESIGN WITH VHDL

Reset with Explicit State Machine


Reset is often easier to include in an explicit state machine, because we need only put a test for reset = 1 in the clocked process for the state. The pattern for an explicit-current style of machine is: process (clk) begin if rising_edge(clk) then if reset = 1 then state <= S0; else if ... then state <= ...; elif ... then ... -- more tests and assignments to state end if; end if; end if; end process;

2.5.4 Reset

169

Reset with Explicit State Machine


Applying this pattern to the explicit Moore machine from section 2.5.2.3 produces:
architecture moore_explicit_v2 of simple is type state_ty is (s0, s1, s2, s3); signal state : state_ty; begin process (clk) begin if rising_edge(clk) then if (reset = 1) then state <= s0; else case state is ... end case; end if; end if; end process; z <= 1 when (state = s1) else 0; end moore_explicit_v2;

170

CHAPTER 2. RTL DESIGN WITH VHDL

Reset with Explicit-Current+Next


The pattern for an explicit-current+next style is: process (clk) begin if rising_edge(clk) then if reset = 1 then state_cur <= reset state; else state_cur <= state_nxt; end if; end if; end process;

2.5.5

State Encoding
This section reserved for your reading pleasure

2.6. DATAFLOW DIAGRAMS

171

2.6 2.6.1

Dataow Diagrams Dataow Diagrams Overview

Dataow diagrams are data-dependency graphs where the computation is divided into clock cycles. Purpose:
Provide a disciplined approach for designing datapath-centric circuits Guide the design from algorithm, through high-level models, and nally to register transfer level code for the datapath and control circuitry. Estimate area and performance Make tradeoffs between different design options

Background
Based on techniques from high-level synthesis tools Some similarity between high-level synthesis and software compilation Each dataow diagram corresponds to a basic block in software compiler terminology.

172

CHAPTER 2. RTL DESIGN WITH VHDL

Data-Dependency Graph
a b c d e f

+
x1

+
x2

+
x3

+
x4

+
z

Data-dependency graph for z = a + b + c + d + e + f

2.6.1 Dataow Diagrams Overview

173

Dataow Diagrams
a b c d e f

+
x1

+
x2

+
x3

+
x4

+
z

Dataow diagram for z = a + b + c + d + e + f

174

CHAPTER 2. RTL DESIGN WITH VHDL

Clock Cycle Boundaries


a b c d e f

+
x1

+
x2

Horizontal lines mark clock cycle boundaries

+
x3

+
x4

+
z

2.6.1 Dataow Diagrams Overview

175

Latency
a b c d e f

+
2 3 4 5 6
z x1

+
x2

Horizontal lines mark clock cycle boundaries

+
x3

+
x4

+
Latency = 6 clock cycles

176

CHAPTER 2. RTL DESIGN WITH VHDL

Latency
a b c d e f

+
x1

+
2
x2

Horizontal lines mark clock cycle boundaries

+
x3

+
3 4
z x4

+
Latency = 4 clock cycles

Question: Why would a good hardware engineer nd this design disatisfying?

2.6.1 Dataow Diagrams Overview

177

Flip Flops
a b c d e f

+
x1

+
x2

Horizontal lines mark clock cycle boundaries

+
x3

Signals crossing clock boundaries are flip-flops

+
x4

+
z

178

CHAPTER 2. RTL DESIGN WITH VHDL

Registered Inputs and Outputs


a b c d e f

+
x1

+
x2

Horizontal lines mark clock cycle boundaries

+
x3

Signals crossing clock boundaries are flip-flops

+
x4

+
z

Flops on both inputs and outputs

2.6.1 Dataow Diagrams Overview

179

Registered Inputs, Combinational Outputs


a b c d e f

+
x1

+
x2

Horizontal lines mark clock cycle boundaries

+
x3

Signals crossing clock boundaries are flip-flops

+
x4

+
z

Flops on inputs, but not outputs (Latency = 5)

180

CHAPTER 2. RTL DESIGN WITH VHDL

Datapath Components
a b c d e f

+
x1

+
x2

Horizontal lines mark clock cycle boundaries

+
x3

Signals crossing clock boundaries are flip-flops

+
x4

Blocks in clock cycles are datapath components

+
z

2.6.1 Dataow Diagrams Overview

181

Inputs

Unconnected signal tails are inputs Horizontal lines mark clock cycle boundaries

+
x1

+
x2

+
x3

Signals crossing clock boundaries are flip-flops

+
x4

Blocks in clock cycles are datapath components

+
z

182

CHAPTER 2. RTL DESIGN WITH VHDL

Outputs
a b c d e f

Unconnected signal tails are inputs Horizontal lines mark clock cycle boundaries

+
x1

+
x2

+
x3

Signals crossing clock boundaries are flip-flops

+
x4

Blocks in clock cycles are datapath components

+
z

Unconnected signal heads are outputs

2.6.1 Dataow Diagrams Overview

183

Summary
a b c d e f

Unconnected signal tails are inputs Horizontal lines mark clock cycle boundaries

+
x1

+
x2

+
x3

Signals crossing clock boundaries are flip-flops

+
x4

Blocks in clock cycles are datapath components

+
z

Unconnected signal heads are outputs

184

CHAPTER 2. RTL DESIGN WITH VHDL

2.6.2 Dataow Diagrams, Hardware, and Behaviour Primary Input


Dataow Diagram i Hardware i x

x
Behaviour
clk i x

2.6.2 Dataow Diagrams, Hardware, and Behaviour

185

Register Input
Hardware i x Dataow Diagram i Behaviour

clk i x

186

CHAPTER 2. RTL DESIGN WITH VHDL

Register Signal
Hardware

i1 x

Dataow Diagram i1 i2

i2

+
x
clk i1 i2 x

Behaviour

2.6.2 Dataow Diagrams, Hardware, and Behaviour

187

Combinational-Component Output
Hardware

i1

Dataow Diagram i1 i2

i2

+
x
clk i1 i2 x

Behaviour

188

CHAPTER 2. RTL DESIGN WITH VHDL

2.6.3

Dataow Diagram Execution

2.6.3 Dataow Diagram Execution

189

Execution with Registers on Both Inputs and Outputs


a b c d e f

0
clk a

0 1 2 3 4 5 6

x1

+
x2

+
x3

x1 x2

+
x4

x3 x4

+
x5

x5 z

+
z

190

CHAPTER 2. RTL DESIGN WITH VHDL

Execution with Registers on Both Inputs and Outputs


a b c d e f

0 1
clk a

0 1 2 3 4 5 6

x1

+
x2

+
x3

x1 x2

+
x4

x3 x4

+
x5

x5 z

+
z

2.6.3 Dataow Diagram Execution

191

Execution with Registers on Both Inputs and Outputs


a b c d e f

0 1
clk a

0 1 2 3 4 5 6

x1

+
x2

+
x3

x1 x2

+
x4

x3 x4

+
x5

x5 z

+
z

192

CHAPTER 2. RTL DESIGN WITH VHDL

Execution with Registers on Both Inputs and Outputs


a b c d e f

0 1
clk a

0 1 2 3 4 5 6

x1

+
x2

+
x3

x1 x2

+
x4

x3 x4

+
x5

x5 z

+
z

2.6.3 Dataow Diagram Execution

193

Execution with Registers on Both Inputs and Outputs


a b c d e f

0 1
clk a

0 1 2 3 4 5 6

x1

+
x2

+
x3

x1 x2

+
x4

x3 x4

+
x5

x5 z

+
z

194

CHAPTER 2. RTL DESIGN WITH VHDL

Execution with Registers on Both Inputs and Outputs


a b c d e f

0 1
clk a

0 1 2 3 4 5 6

x1

+
x2

+
x3

x1 x2

+
x4

x3 x4

+
x5

x5 z

+
z

2.6.3 Dataow Diagram Execution

195

Execution with Registers on Both Inputs and Outputs


a b c d e f

0 1
clk a

0 1 2 3 4 5 6

x1

+
x2

+
x3

x1 x2

+
x4

x3 x4

+
x5

x5 z

+
z

5 6

196

CHAPTER 2. RTL DESIGN WITH VHDL

Execution with Registers on Both Inputs and Outputs


a b c d e f

0 1
clk a

0 1 2 3 4 5 6

x1

+
x2

+
x3

x1 x2

+
x4

x3 x4

+
x5

x5 z

+
z

5 6

2.6.3 Dataow Diagram Execution

197

Execution Without Output Registers


a b c d e f

0 1
clk a

0 1 2 3 4 5 6

x1

+
x2

+
x3

x1 x2

+
x4

x3 x4

+
x5

x5 z

+
z

198

CHAPTER 2. RTL DESIGN WITH VHDL

2.6.4

Performance Estimation Performance Equations


Performance 1 TimeExec

TimeExec = Latency ClockPeriod

Denition Latency: Number of clock cycles from inputs to outputs. A combinational circuit has latency of zero. A single register has a latency of one. A chain of n registers has a latency of n.

Latency: count horizontal lines in diagram

Performance of Dataow Diagrams

Min clock period (Max clock speed) limited by longest path in a clock cycle

2.6.5 Area Estimation

199

2.6.5

Area Estimation

Maximum number of blocks in a clock cycle is total number of that component that are needed Maximum number of signals that cross a cycle boundary is total number of registers that are needed Maximum number of unconnected signal tails in a clock cycle is total number of inputs that are needed Maximum number of unconnected signal heads in a clock cycle is total number of outputs that are needed
These estimates give lower bounds. Other constraints might force you to use more components.

200

CHAPTER 2. RTL DESIGN WITH VHDL

Area Estimation
Implementation-technology factors, such as the relative size of registers, multiplexers, and datapath components, might force you to make tradeoffs that increase the number of datapath components to decrease the overall area of the circuit. With some FPGA chips, a 2:1 multiplexer has the same area as an adder.

With some FPGA chips, a 2:1 multiplexer can be combined with an adder into one FPGA cell per bit. In FPGAs, registers are usually free, in that the area consumed by a circuit is limited by the amount of combinational logic, not the number of ip-ops.

2.6.6 Design Analysis

201

2.6.6
a b

Design Analysis
c d e f

+
x1

num inputs

+
x2

num outputs

+
x3

num registers

+
x4

num adders min clock period

+
z

latency

202

CHAPTER 2. RTL DESIGN WITH VHDL

Design Analysis (Contd)


a b c d e f

+
x1

num inputs

+
x2

num outputs

+
x3

num registers

+
x4

num adders min clock period

+
x5 z

latency

2.6.7 Area / Performance Tradeoffs

203

2.6.7
a b

Area / Performance Tradeoffs


two adds per clock cycle
a b c d e f c d e f

one add per clock cycle


0 1
x1

0 1

+ +
x2

+
x1

+
x2

+
x3

+
x3

+
x4

+
x4

+
x5 z

5 6

+
x5 z

3 4

Note: wasted.

In the Two-add design, half of the last clock cycle is

204

CHAPTER 2. RTL DESIGN WITH VHDL

Two Adds per Clock Cycle


a b c d e f

0
clk

0 1 2 3 4 5 6
a x1

+
x1

+
x2

x2

+
x3

x3

x4 x5

+
x4

+
x5 z

3 4

2.6.7 Area / Performance Tradeoffs

205

Design Comparison
One add per clock cycle
a b c d e f

Two adds per clock cycle


a b c d e f

0 1

0 1

+
x1

+
x1

+
x2

+
x2

+
x3

+
x3

+
x4

+
x4

+
x5 z

5 6

+
x5 z

3 4

inputs outputs registers adders clock period latency Question:

6 1 6 1 op + 1 add 6

6 1 6 2 op + 2 add 4

Under what circumstances would each design option be fastest?

206

CHAPTER 2. RTL DESIGN WITH VHDL

2.7

Design Example: Massey


This section reserved for your reading pleasure

2.8

Design Example: Vanier

Well go through the following artifacts: 1. requirements 2. algorithm 3. dataow diagram 4. high-level models 5. hardware block diagram 6. RTL code for datapath 7. state machine 8. RTL code for control

2.8. DESIGN EXAMPLE: VANIER

207

Design Process
1. Scheduling (allocate operations to clock cycles) 2. I/O allocation 3. First high-level model 4. Register allocation 5. Datapath allocation 6. Connect datapath components, insert muxes where needed 7. Design implicit state machine 8. Optimize 9. Design explicit-current state machine 10. Optimize

208

CHAPTER 2. RTL DESIGN WITH VHDL

2.8.1

Requirements

Functional requirements: compute the following formula: output = (a d) + c + (d b) + b Performance requirement:


Max clock period: op plus (2 adds or 1 multiply) Max latency: 4

Cost requirements
Maximum of two adders Maximum of two multipliers Unlimited registers Maximum of three inputs and one output Maximum of 5000 student-minutes of design effort

Registered inputs and outputs

2.8.2 Algorithm

209

2.8.2

Algorithm

output = (a d) + c + (d b) + b Create a data-dependency graph for the algorithm.


a d b c

+ + +
z

210

CHAPTER 2. RTL DESIGN WITH VHDL

2.8.3

Initial Dataow Diagram

Schedule operations into clock cycles.


a d b c

+ + +
z

2.8.4 Reschedule to Meet Requirements

211

2.8.4
a

Reschedule to Meet Requirements


d b c a d b c

+ + +
z z

212

CHAPTER 2. RTL DESIGN WITH VHDL

Fix Clock Period Violation


d b c d b c

+ + +
z

+ + +
z

2.8.5 Optimize Resources

213

2.8.5

Optimize Resources
a d b c

+ + +
z

214

CHAPTER 2. RTL DESIGN WITH VHDL

Analysis
d b

+ + +
z

Question: Should we move the second addition from third clock cycle to second?

2.8.5 Optimize Resources

215

Dene Entity
Having nalized our input/output scheduling, we can write our entity. Note: we will add a reset signal later, when we design the state machine to control the datapath. entity vanier is port ( clk : in std_logic; i_1, i_2 : in std_logic_vector(15 downto 0); o_1 : out std_logic_vector(15 downto 0) ); end vanier;

216

CHAPTER 2. RTL DESIGN WITH VHDL

2.8.6

Assign Names to Registered Values


d b

+ + +
z

Question:

Why do we not need to assign names to combinational signals?

Question: Why do we not need to assign a new name to x1, x2, and x4 the second time they cross a clock cycle boundary?

2.8.7 Input/Output Allocation

217

2.8.7

Input/Output Allocation
d x1 b x2 c x4 x5

a x3

+
x6

+ +
x8 z

x7

218

CHAPTER 2. RTL DESIGN WITH VHDL

VHDL Code!
architecture hlm_v1 of vanier is signal x_1, x_2, x_3, x_4, x_5, x_6, x_7, x_8 : unsigned(15 downto 0); begin process begin wait until rising_edge(clk); x_1 <= unsigned(i_1); x_2 <= unsigned(i_2); wait until rising_edge(clk); x_3 <= unsigned(i_1); x_4 <= x_1(7 downto 0) * x_2(7 downto 0); x_5 <= unsigned(i_2); wait until rising_edge(clk); x_6 <= x_3(7 downto 0) * x_1(7 downto 0); x_7 <= x_2 + x_5; wait until rising_edge(clk); x_8 <= x_6 + (x_4 + x_7); end process; o_1 <= std_logic_vector(x_8); end hlm_v1;

2.8.7 Input/Output Allocation


0 i1 i2 x1
i1 d i2 b

219
1 2 3 4 5

0 1

x2 x3

x1
i1 a

x2
i2 c

x4 x5

x3

x4

x5

+
x6

x6 x7

+ +
x8 z o1

x7 3

x8

0 4 i1 i2 r1 r2 r3 r4 r5

220

CHAPTER 2. RTL DESIGN WITH VHDL

2.8.8

Tangent: Combinational Outputs

architecture hlm_v1c of vanier is signal x_1, x_2, x_3, x_4, x_5, x_6, x_7 : unsigned(15 downto 0); begin process begin wait until rising_edge(clk); x_1 <= unsigned(i_1); x_2 <= unsigned(i_2); wait until rising_edge(clk); x_3 <= unsigned(i_1); x_4 <= x_1(7 downto 0) * x_2(7 downto 0); x_5 <= unsigned(i_2); wait until rising_edge(clk); x_6 <= x_3(7 downto 0) * x_1(7 downto 0); x_7 <= x_2 + x_5; end process; o_1 <= std_logic_vector(x_6 + (x_4 + x_7)); end hlm_v1c;

i1 d

i2 b

x1
i1 a

x2
i2 c

x3

x4

x5

+
x6

+ +
z o1

x7

2.8.9 Register Allocation

221

2.8.9

Register Allocation
i1 d i2 b

x1
i1 a

x2
i2 c

x3

x4

x5

+
x6

+ +
z o1

x7

222

CHAPTER 2. RTL DESIGN WITH VHDL

New VHDL Code!


i1 d r1 x1 i1 a r3 x3 r4 x4 i2 b r2 x2 i2 c r5 x5

+
r2 x6

+ +
r5 x8 z o1

r5 x7

architecture hlm_v2 of vanier is signal r_1, r_2, r_3, r_4, r_5 : unsigned(15 downto 0); begin process begin wait until rising_edge(clk); r_1 <= unsigned(i_1); r_2 <= unsigned(i_2); wait until rising_edge(clk); r_3 <= unsigned(i_1); r_4 <= r_1(7 downto 0) * r_2(7 downto 0); r_5 <= unsigned(i_2); wait until rising_edge(clk); r_2 <= r_3(7 downto 0) * r_1(7 downto 0); r_5 <= r_2 + r_5; wait until rising_edge(clk); r_5 <= r_2 + (r_4 + r_5); end process; o_1 <= std_logic_vector(r_5); end hlm_v2;

2.8.10 Datapath Allocation

223

2.8.10
i1 d r1 x1 i1 a r3 x3

Datapath Allocation
i2 b r2 x2 i2 c r4 x4 r5 x5

+
r2 x6

+ +
r5 x8 z o1

r5 x7

224

CHAPTER 2. RTL DESIGN WITH VHDL

2.8.11 Hardware Block Diagram and State Machine


1. Calculate number of states that are needed 2. Control signals for registers

Chip enable Mux select on input


3. Control signals for datapath components

Instruction (e.g. add/sub for ALU) Mux select on inputs


For our example: Use four states: S0..S3, one for each clock cycle.

2.8.11 Hardware Block Diagram and State Machine

225

2.8.11.1
S0 S1
i1 a r3 x3 m1 m1 i1 d r1 x1

Control for Registers


i2 b r2 x2 i2 c r4 x4 a1 r5 x5

Build a table with one row per state, one colum per register.

S2

+
r5 x7

r2 x6 a2

S3
a1

+
r5 x8 z o1

S0

r1 ce S0 S1 S2 S3 d ce

r2 d ce

r3 d ce

r4 d ce

r5 d

226

CHAPTER 2. RTL DESIGN WITH VHDL

Optimize chip enables and muxes


r1 S0 S1 S2 S3 ce 1 0 d i1 ce 1 0 1 r2 d i2 m1 ce 1 r3 d i1 ce 1 0 r4 d m1 ce 1 1 1 r5 d i2 a1 a1

Chip enable: a register holds a value for multiple clock cycles. Mux: a register loads values from multiple sources.

2.8.11 Hardware Block Diagram and State Machine

227

Optimized Chip Enables and Muxes


r1=i1 ce 1 0 r2 ce 1 0 1 d i2 m1 r3=i1 r4=m1 ce 1 0 r5 d i2 a1 a1

S0 S1 S2 S3

228

CHAPTER 2. RTL DESIGN WITH VHDL

2.8.11.2

Control for Datapath Components

Table for datapath components. One row per state. One column per datapath component. Sub-columns for sources and instructions (e.g. add/sub for ALU).
S0 S1
i1 a r3 x3 m1 m1 r4 x4 a1 r2 x6 a2 i2 c r5 x5 i1 d r1 x1 i2 b r2 x2

S2

+
r5 x7

S3
a1

+
r5 x8 z o1

S0

S0 S1 S2 S3

a1 a2 m1 src1 src2 src1 src2 src1 src2 r1 r2 r2 r5 r3 r1 r2 a2 r4 r5

2.8.11 Hardware Block Diagram and State Machine

229

Optimize Datapath Control Table


a1 a2 m1 src1 src2 src1 src2 src1 src2 r1 r2 r2 r5 r1 r3 r2 a2 r4 r5

S0 S1 S2 S3

230

CHAPTER 2. RTL DESIGN WITH VHDL

2.8.11.3

Control for State

We need to control the transition from one state to the next. For this example, the transition is very simple, each state transitions to its successor: S0 S1 S2 S3 S0....

2.8.11 Hardware Block Diagram and State Machine

231

2.8.11.4
S0 S1 S2 S3

Complete State Machine Table

r1 ce r2 ce r2 sel r4 ce r5 sel a1 src2 sel m1 src2 sel state 1 1 i2 S1 0 0 1 i2 r2 S2 1 m1 0 a1 r5 r3 S3 a1 a2 S0

Question:

What values should we use for dont cares?

232

CHAPTER 2. RTL DESIGN WITH VHDL

Dont Cares Instantiations


S0 S1 S2 S3 r1 ce r2 ce r2 sel r4 ce r5 sel a1 src2 sel m1 src2 sel state 1 1 i2 0 a1 a2 r3 S1 0 0 m1 1 i2 a2 r2 S2 1 1 m1 0 a1 r5 r3 S3 1 1 m1 0 a1 a2 r3 S0

2.8.12 VHDL Code with Explicit State Machine

233

2.8.12 chine

VHDL Code with Explicit State Ma-

We chose a one-hot encoding of the state, which usually results in small and fast hardware for state machines with sixteen or fewer states.

architecture explicit_v1 of vanier is signal r_1, r_2, r_3, r_4, r_5 : std_logic_vector(15 downto 0) type state_ty is std_logic_vector(3 downto 0); constant s0 : state_ty := "0001"; constant s1 : state_ty := "0010"; constant s2 : state_ty := "0100"; constant s3 : state_ty := "1000"; signal state : state_ty;

234
begin ----------------------- r_1 process (clk) begin if rising_edge(clk) then if state != S1 then r_1 <= i_1; end if; end if; end process; ----------------------- r_2 process (clk) begin if rising_edge(clk) then if state != S1 then if state = S0 then r_2 <= i_2; else r_2 <= m_1; end if; end if; end if; end process;

CHAPTER 2. RTL DESIGN WITH VHDL


----------------------- r_3 process (clk) begin if rising_edge(clk) then r_3 <= i_1; end if; end process; ----------------------- r_4 process (clk) begin if rising_edge(clk) then if state = S1 then r_4 <= m_1; end if; end if; end process;

2.8.12 VHDL Code with Explicit State Machine


----------------------- r_5 process (clk) begin if rising_edge(clk) then if state = S1 then r_5 <= i_2; else r_5 <= a_1; end if; end if; end process; ----------------------- combinational datapath with state select a1_src2 <= r_5 when S2, a_2 when others; with state select m1_src2 <= r_2 when S1 r_3 when others; a_1 <= a_2 + a1_src2; a_2 <= r_4 + r_5; m_1 <= r_1 * m1_src2; o_1 <= r_5; ----------------------- state machine process (clk) begin if rising_edge(clk) then if reset = 1 then state <= S0; else case state is when S0 => state <= when S1 => state <= when S2 => state <= when S3 => state <= end case; end if; end if; end process; ---------------------end explicit_v1;

235

S1; S2; S3; S0;

236

CHAPTER 2. RTL DESIGN WITH VHDL

Hardware Block Diagram


i1 i2

S0 S1
i1 a r3 x3 m1

i1 d r1 x1 m1 r4 x4

i2 b r2 x2 i2 c r5 x5 a1

S2

+
r5 x7

r1

r2

r3

r5

r2 x6 a2

S3
a1

+
r5 x8 z m1

o1

S0

r4 a2

+ +

a1

2.8.13 Peephole Optimizations

237

2.8.13

Peephole Optimizations
-- r_1 (optimized) process (clk) begin if rising_edge(clk) then if then r_1 <= i_1; end if; end if; end process;

-- r_1 process (clk) begin if rising_edge(clk) then if state != S1 then r_1 <= i_1; end if; end if; end process;

238

CHAPTER 2. RTL DESIGN WITH VHDL

Peephole Optimizations
-- r_2 process (clk) begin if rising_edge(clk) then if state != S1 if state = S0 then r_2 <= i_2; else r_2 <= m_1; end if; end if; end if; end process; -- r_2 (optimized) process (clk) begin if rising_edge(clk) then if state(1) = 0 then if state(0) = 1 then r_2 <= i_2; else r_2 <= m_1; end if; end if; end if; end process;

2.8.13 Peephole Optimizations

239

Peephole Optimizations
-- state machine process (clk) begin if rising_edge(clk) then if reset = 1 then state <= S0; else case state is when S0 => state <= when S1 => state <= when S2 => state <= when S3 => state <= end case; end if; end if; end process; -- state machine (optimized) -- NOTE: "st" = "state" process (clk) begin if rising_edge(clk) then if reset = 1 then st <= S0; else for i in 0 to 3 loop st( (i+1) mod 4 ) <= st( i ); end loop; end if; end if; end process;

S1; S2; S3; S0;

240

CHAPTER 2. RTL DESIGN WITH VHDL

2.8.14

Notes and Observations

Our functional requirements were written as: output = (a d) + (d b) + b + c Alternatively, we could have achieved exactly the same functionality with the functional requirements written as (the two statements are mathematically equivalent): output = (a d) + b + (d b) + c

2.8.14 Notes and Observations

241

Data Dependency Graphs: Clean vs Ugly


The naive data dependency graph for the alternative formulation is much messier than the data dependency graph for the original formulation: Original (a d) + (d b) + b + c
a d b c a d

Alternative (a d) + c + (d b) + b
b c

+ + +
z

+ +
z

242

CHAPTER 2. RTL DESIGN WITH VHDL

2.9

Pipelining

Pipelining is optimization that increases performance by overlapping the execution of multiple parcels (instructions). The cost is an increase in area, because we cannot reuse datapath components, registers, inputs, or outputs.

2.9.1

Introduction to Pipelining

2.9.1 Introduction to Pipelining

243

Review of unpipelined dataow diagram


a r1
add1

b r2

0
c r2

+
r1
add1

1
clk d r2

0 1 2 3 4 5 6 7 8 9 10 11 12 13
a r1

+
r1
add1

2
e r2

+
r1
add1

3
f r2

+
r1
add1

4 5

+
z

Question: How soon can we start to execute ?

244

CHAPTER 2. RTL DESIGN WITH VHDL

Pipelined dataow diagram


Each stage is treated as separate dataow diagram. Double line denotes boundary between stages.
a stage 5 stage 4 stage 3 stage 2 stage 1 r1
add1

b r2

0 1 2 3 4 5 6 7 8 9 10 11 12 13 0
c r4 clk a (stage1) r1 d r5

+
r3
add2

1 2
e r8

+
r5
add3

(stage2) r3 (stage3) r5

+
r7
add4

3
f r10

(stage4) r7 (stage5) r9

+
r9
add5

4 5

+
z

Question: How soon can we start to execute ?

2.9.1 Introduction to Pipelining

245

Sequential (Unpipelined) Hardware


reset
State(0) State(1) State(2) State(3) State(4) i1 i2

r1
add1

r2

+
o1

246

CHAPTER 2. RTL DESIGN WITH VHDL

Pipelined Hardware
i1 i2 r1 stage 1
add1

r2 i3

+
r3

r4 i4

stage 2

add2

+
r5

r6 i5

stage 3

add3

+
r7

r8 i6

stage 4

add4

+
r9

r10

stage 5

add5

+
o1

2.9.1 Introduction to Pipelining

247

Pipelined VHDL Code


-- stage 1 process begin wait until rising_edge(clk); r1 <= i1; r2 <= i2; end process; -- stage 2 process begin wait until rising_edge(clk); r3 <= r1 + r2; r4 <= i3; end process; -- stage 3 process begin wait until rising_edge(clk); r5 <= r3 + r4; r6 <= i4; end process; -- stage 4 process begin wait until rising_edge(clk); r7 <= r5 + r6; r8 <= i5; end process; -- stage 5 process begin wait until rising_edge(clk); r9 <= r7 + r8; r10 <= i6; end process; -- output o1 <= r9 + r10;

248

CHAPTER 2. RTL DESIGN WITH VHDL

2.9.2

Partially Pipelined

Fully pipelined: throughput is one parcel per clock cycle Partially pipelined: throughput is less than one parcel per clock cycle. Superscalar: throughput is more than one parcel per clock cycle.
a r1 stage 1
add1

b r2

0
c r2

0 1 2 3 4 5 6 7 8 9 10 11 12 13
clk a

+
r1
add1

1
d r4

+
r3
add2

2
e r4

(stage1) r1 (stage2) r3

stage 2

+
r3
add2

3
f r6

(stage3) r5 z

+
r5
add3

4 5

+
z

Question: How do we execute followed by ?

stage 3

2.9.2 Partially Pipelined

249

Hardware for Partially Pipelined


i1 i2

reset
State(0) State(1)

stage 1

r1
add1

r2

+
i2

stage 2

r3
add2

r4

+
i2 stage 3 r5
add3

r6

+
o1

250

CHAPTER 2. RTL DESIGN WITH VHDL

2.9.3

Terminology

Denition Depth: The depth of a pipeline is the number of stages on the longest path through the pipeline.

Denition Latency: The latency of a pipeline is measured the same as for an unpipelined circuit: the number of clock cycles from inputs to outputs.

Denition Throughput: The number of parcels consumed or produced per clock cycle.

Denition Upstream/downstream: Because parcels ow through the pipeline analogously to water in a stream, the terms upstream and downstream are used respectively to refer to earlier and later stages in the pipeline. For example, stage1 is upstream from stage2.

2.9.3 Terminology Denition Bubble: When a pipe stage is empty (contains invalid data), it is said to contain a bubble.

251

Question: How do we know whether the output of the pipeline is a bubble or is valid data?

252

CHAPTER 2. RTL DESIGN WITH VHDL

2.10 Design Example: Pipelined Massey Requirements


Functional requirements:

Compute the sum of output = a + b + c + d + e + f Registered inputs, combinational outputs


Performance requirements:

six

8-bit

numbers:

Maximum clock period: unlimited Maximum latency: four


Cost requirements:

Maximum of ve adders Small miscellaneous hardware (e.g. muxes) is unlimited Maximum of six inputs and one output Design effort is unlimited

2.10. DESIGN EXAMPLE: PIPELINED MASSEY

253

Initial Dataow Diagrams


Original dataow
a b c d

Final unpipelined dataow


a b c f

+ +

+ + + + +
z f d e

+ +
z

254

CHAPTER 2. RTL DESIGN WITH VHDL

Dataow Diagram Exploration


Variation on original dataow
a b c d e f

Pipelined dataow diagram


a b c d i_valid

+ +

+ +

+ +
z o_valid

+
z

2.10. DESIGN EXAMPLE: PIPELINED MASSEY

255

VHDL Code
-- stage 1 process begin wait until rising_edge(clk); r1 <= i1; r2 <= i2; r3 <= i3; end process; a1 <= r1 + r2; a2 <= r3 + r4; -- stage 2 process begin wait until rising_edge(clk); r5 <= a1; r6 <= a2; r7 <= i5; end process; a3 <= r5 + r6; a4 <= r7 + r8; -- stage 3 process begin wait until rising_edge(clk); r9 <= a3; r10 <= a4; end process; a5 <= r9 + r10; -- outputs z <= a5; o_valid <= v3;

r4 <= i4;

v1 <= i_valid;

r8 <= i6;

v2 <= v1;

v3 <= v2;

256

CHAPTER 2. RTL DESIGN WITH VHDL

2.11 Memory Arrays and RTL Design 2.11.1 Memory Operations Read of Memory with Registered Inputs
Hardware
we a clk
WE A DO

M
DI

do

Behaviour
clk we a M(a) do a d -

2.11.1 Memory Operations

257

Write to Memory with Registered Inputs


Hardware
we a di clk
WE A DO

M
DI

do

Behaviour
clk we a di M(a) do a d -

258

CHAPTER 2. RTL DESIGN WITH VHDL

Dual-Port Memory with Registered Inputs


clk we a0 we a0 di0 a1 clk
WE A0 DO0

a d a -

M
DI0 A1 DO1

do0 do1

di0 a1 M(a) M(a) do0 do1

2.11.1 Memory Operations

259

Sequence of Memory Operations


clk we a0 di0 we a0 di0 a1 clk
WE A0 DO0

a d a a d2 a -

a1 M do0 do1 M(a) M(a) M(a) M(a) do0 do1

DI0 A1 DO1

d d1 d

260

CHAPTER 2. RTL DESIGN WITH VHDL

2.11.2

Memory Arrays in VHDL


This section reserved for your reading pleasure

2.11.3

Data Dependencies

Denition of Three Types of Dependencies


M[i] := := M[i] := := := M[i] :=

:= M[i]

M[i]

:=

M[i]

:=

Read after Write Write after Write Write after Read (True dependency) (Load dependency) (Anti dependency) Instructions in a program can be reordered, so long as the data dependencies are preserved.

2.11.3 Data Dependencies

261

Purpose of Dependencies
W0 WAW ordering prevents W0 from happening after W1 R3 := ...... W1 R3 := ...... producer

RAW ordering prevents R1 from happening before W1 WAR ordering prevents W2 from happening before R1 R1 ... := ... R3 ... consumer

W2

R3 := ......

Each of the three types of memory dependencies (RAW, WAW, and WAR) serves a specic purpose in ensuring that producer-consumer relationships are preserved.

262

CHAPTER 2. RTL DESIGN WITH VHDL

Ordering of Memory Operations Data Dependencies


M[3] M[2] M[1] M[0] 30 20 10 0 M[2] := 21 M[3] := 31 A B := M[2] := M[0] 21

M[3] := 32 M[0] := 01 C := M[3]

Initial Program

2.11.3 Data Dependencies

263

Data Dependencies (Contd)


M[2] := 21 M[3] := 31 A B := M[2] := M[0] M[2] := 21 B A := M[0] := M[2]

M[3] := 31 M[3] := 32 M[0] := 01 C := M[3]

M[3] := 32 M[0] := 01 C := M[3]

Initial Program

Valid Modication

264

CHAPTER 2. RTL DESIGN WITH VHDL

Data Dependencies (Contd)


M[2] := 21 M[3] := 31 A B := M[2] := M[0] M[2] := 21 B A := M[0] := M[2]

M[3] := 31 C := M[3]

M[3] := 32 M[0] := 01 C := M[3]

M[3] := 32 M[0] := 01

Initial Program

Valid (or Bad?) Modication

2.11.4 Memory and Dataow Diagrams

265

2.11.4

Memory and Dataow Diagrams Legend for Dataow Diagrams

name name name name (rd) name(wr)

Input port Output port State signal Array read Array write

Basic Memory Operations


mem mem addr mem(rd) data mem (anti-dependency) mem(wr) data addr

mem

data := mem[addr]; mem[addr] := data; Memory Read Memory Write

266

CHAPTER 2. RTL DESIGN WITH VHDL

Dataow Diagrams and Data Dependencies

Read after Write Dependencies


Algo: mem[wr addr] := data in; data out := mem[rd addr];
mem data_in wr_addr

mem(wr)

rd_addr

mem(rd)

mem

data_out

Read after Write

2.11.4 Memory and Dataow Diagrams

267

Read after Write Optimization


Algo: mem[wr addr] := data in; := mem[rd addr]; data out
mem data_in wr_addr rd_addr

mem(wr)

mem(rd)

mem

data_out

Optimization when rd addr = wr addr

268

CHAPTER 2. RTL DESIGN WITH VHDL

Write after Write Dependencies


Algo: mem[wr1 addr] := data1; mem[wr2 addr] := data2;
mem data1 wr1_addr

mem(wr)

data2

wr2_addr

mem(wr)

mem

Write after Write

2.11.4 Memory and Dataow Diagrams

269

Write after Write Scheduling Option


Algo: mem[wr1 addr] := data1; mem[wr2 addr] := data2;
mem data1 wr1_addr

Algo: mem[wr1 addr] := data1; mem[wr2 addr] := data2;


mem data2 wr2_addr

mem(wr)

data2

wr2_addr
data1

mem(wr) wr1_addr

mem(wr)

mem

mem(wr)

Write after Write


mem

Scheduling option when wr1 addr = wr2 addr

270

CHAPTER 2. RTL DESIGN WITH VHDL

Write after Read Dependencies


Algo: rd data := mem[rd addr]; mem[wr addr] := wr data;
mem rd_addr

mem(rd)

wr_data wr_addr

mem(wr)

rd_data

mem

Write after Read

2.11.4 Memory and Dataow Diagrams

271

Write after Read Optimization


Algo: rd data := mem[rd addr]; mem[wr addr] := wr data;
mem rd_addr wr_data wr_addr

mem(rd)

mem(wr)

rd_data

mem

Optimization when rd addr = wr addr

272

CHAPTER 2. RTL DESIGN WITH VHDL

2.11.5 gram
mem M 21 2

Ex: Mem Array and Dataow Dia-

data_in wr_addr

M(wr)

31

M(wr)

M(rd)

M(rd)

32

1 2 3 4 5 6 7

M[2] := 21 M[3] := 31 A B := M[2] := M[0]

M(wr)

01

M(wr)

M[3] := 32 M[0] := 01 C := M[3] M C 7 M(rd)

2.11.5 Ex: Mem Array and Dataow Diagram

273

Dependencies for Known Addresses


mem M data_in wr_addr 21 2

M(wr)

31

M(wr)

M(rd)

M(rd)

32

M(wr)

01

M(wr)

M(rd)

274

CHAPTER 2. RTL DESIGN WITH VHDL

Anti-Dependencies for Known Addresses


mem M data_in wr_addr 21 2

M(wr)

31

M(wr)

M(rd)

M(rd)

32

M(wr)

01

M(wr)

M(rd)

2.11.5 Ex: Mem Array and Dataow Diagram

275

Minimal Dependencies
M 0 21 2 31 3

M(rd) B 01 0 M(wr)

M(wr)

M(wr)

2 M(rd)

32 3 M(wr) 3 M(rd)

Memory array with minimal dependencies

276

CHAPTER 2. RTL DESIGN WITH VHDL

Memory Array with Orderings


M 0 21 2 31 3

M(rd) B 01 0

M(wr)

M(wr)

2 2 M(rd) 3

32 3 M(wr) 3 3 M(rd)

M(wr)

Memory array with orderings

2.11.5 Ex: Mem Array and Dataow Diagram

277

Place Operations in Clock Cycles


M 0 21 2

M(rd) B

M(wr)

2 2 M(rd) A 2

31 3 M(wr)

32 3 3 M(wr)

01 0 4 M(wr) 3

3 M(rd)

278

CHAPTER 2. RTL DESIGN WITH VHDL

Final Dataow Diagram


M 0 1 M(rd) B 2 2 M(rd) A 32 3 3 M(wr) 2 31 3 M(wr) 1 21 2 M(wr)

3 3 M(rd) C 4

01 0 M(wr) M

Final version of DFD

2.12. INPUT / OUTPUT PROTOCOLS

279

2.12 Input / Output Protocols


This section reserved for your reading pleasure

280

CHAPTER 2. RTL DESIGN WITH VHDL

2.13 Example: Moving Average


In this section we will design a circuit that performs a moving average as it receives a stream of data. When each new data item is received, the output is the average of the four most recently received data.
Time 0 1 2 3 4 5 6 7 8 9 10 i_data 2 3 5 6 6 0 2 2 5 3 1

o_avg

4 5 4 3

2.13.1 Requirements and Environmental Assumptions

281

2.13.1 Requirements and Environmental Assumptions


1. Input data is sent sporadically, with at least 2 clock cycles of bubbles (invalid data) between valid data. 2. When the input data is valid, the signal i valid is asserted for exactly one clock cycle. 3. Input data will be 8-bit signed numbers. 4. When output data is ready, o valid shall be asserted. 5. The output data (o avg) shall be the average of the four most recently received input data. Output numbers shall be truncated to integer values.

282

CHAPTER 2. RTL DESIGN WITH VHDL

2.13.2

Algorithm
avg i = (xi3 + xi2 + xi1 + xi)/4

Generic equation with input data xi:

Decompose into sum and avg: sumi = xi3 + xi2 + xi1 + xi avg i = sumi/4 Look for patterns and potential optimizations: sum5 = x2 + (x3 + x4 + x5) sum6 = (x3 + x4 + x5) + x6 = sum5 x2 + x6 Generalized recurrence equation: sumi = sumi1 xi4 + xi avg i = sumi/4

2.13.2 Algorithm

283

Summary of Behaviour
1. Dene a signal new for the value of i data each time that i valid is 1. 2. Dene a memory array M to store a sliding window of the four most recent values of i data. 3. Dene a signal old for the oldest data value from the sliding window. 4. Update sumi with sumi1 oldi + newi

284

CHAPTER 2. RTL DESIGN WITH VHDL

Sliding Window
Two design patterns to choose from: shift register vs circular buffer
old old
M[3] M[2] M[1] M[0]

M[0..3]

new

new

Shift register

Circular Buffer For FIFO behaviour, circular buffer is usually prefered: smaller and lower power.

2.13.2 Algorithm

285

Sliding Window with Registers


8 d we addr idx[0] ce[0]
D CE Q

M[0] 8

idx[1]

ce[1]

D CE

M[1] 8 8 q M[2] 8

idx[2]

ce[2]

D CE

idx[3]

ce[3]

D CE

M[3] 8

Register array with chip-enables and decoded multiplexer

286

CHAPTER 2. RTL DESIGN WITH VHDL

2.13.3

Pseudocode and Dataow Diagrams

First Pseudocode
Real 3-address pseudocode new old tmp sum M[idx] idx o_avg = = = = = = = i_data M[idx] sum - old tmp + new new idx rol 1 sum/4
sum M idx i_data new

Rd

old
Wr

tmp

(wired shift)

sum

o_avg

idx

2.13.3 Pseudocode and Dataow Diagrams Remove intermediate signal old new = i_data tmp = sum - M[idx] sum = tmp + new M[idx] = new idx = idx rol 1 o_avg = sum/4 reading new from memory tmp = sum - M[idx] M[idx] = i_data new = M[idx] sum = tmp + new idx = idx rol 1 o_avg = sum/4 Remove intermediate signal new tmp = sum - M[idx] M[idx] = i_data sum = tmp + M[idx] idx = idx rol 1 o_avg = sum/4

287

Data-dependency graph after removing new


sum M idx i_data

Rd

old
Wr

Rd

tmp

new
(wired shift)

sum

o_avg

idx

288

CHAPTER 2. RTL DESIGN WITH VHDL

Dataow Diagram
Latency of three clock cycles
M S0
Wr Rd

Latency of two clock cycles


M S0
Wr Rd

i_data

idx

sum

i_data

idx

sum

S1
Rd 1

S1
Rd 1

S2 S0 M sum

(wired shift)

S0

(wired shift)

o_avg

idx

sum

o_avg

idx

Two clock cycles potentially preferable for performance, but requires an additional multiplexer.

2.13.3 Pseudocode and Dataow Diagrams Latency of two clock cycles with registered address M i_data idx sum
S0
Wr Rd 1

289

S1
Rd

S0

(wired shift)

sum

o_avg

idx

Removes need for multiplexer on address input to circular buffer

290

CHAPTER 2. RTL DESIGN WITH VHDL

Register and Datapath Allocation


M S0
Wr idx sum Rd 1 rol

i_data

idx

sum

S1
Rd

as1

sum

idx

S0

as1

(wired shift)

sum

o_avg

idx

2.13.4 Control Tables and State Machine

291

2.13.4
M S0
Wr

Control Tables and State Machine


idx sum

i_data

idx

sum

Rd 1 rol

S1
Rd

as1

Register control table M idx sum we addr d ce d ce d S0 1 idx x 0 1 as1 S1 0 idx 1 rol 1 as1 Datapath control table as1 rol sub src1 src2 src1 src2 S0 0 M sum S1 1 sum M idx 1

sum

idx

S0

as1

(wired shift)

sum

o_avg

idx

292

CHAPTER 2. RTL DESIGN WITH VHDL Optimized control table Static assignments in control table M.addr = idx M.d = x idx.d = rol sum.d = as1 as1.src1 = sum as1.src2 = M

M idx as1 we ce sub S0 1 1 0 S1 0 0 1

2.13.4 Control Tables and State Machine

293

Control Table and Bubbles


Almost nal control table M idx sum as1 we ce ce sub S0 1 0 1 0 S1 0 1 1 1 idle 0 0 0 Final control table M idx sum as1 we ce ce sub S0 1 0 1 0 S1 0 1 1 1 idle 0 0 0 0 Static assignments M.addr = idx M.d = x idx.d = rol sum.d = as1 as1.src1 = sum as1.src2 = M

294

CHAPTER 2. RTL DESIGN WITH VHDL

State Machine
i valid valid1 S0 1 0 S1 0 1 idle 0 0 Final control table with state encoding

state M idx sum as1 i valid valid1 we ce ce sub S0 1 0 1 0 1 0 S1 0 1 0 1 1 1 idle 0 0 0 0 0 0 M.we idx.ce sum.ce as1.sub = = = = i_valid valid1 i_valid OR valid1 valid1

2.13.5 VHDL Code

295

2.13.5

VHDL Code

-- valid bits process begin wait until rising_edge(clk); valid1 <= i_valid; o_valid <= valid1; end process; -- idx process begin wait until rising_edge(clk); if reset = 1 then idx <= "0001"; else if valid1 = 1 then idx <= idx rol 1; end if; end if; end process;

-- sliding window process begin wait until rising_edge(clk); for i in 3 downto 0 loop if (i_valid = 1) and (idx(i) = 1) th M(i) <= i_data; end if; end loop; end process; mem_out <= M(0) when idx(0) = 1 else M(1) when idx(1) = 1 else M(2) when idx(2) = 1 else M(3); -- add sub add_sub <= sum - mem_out when valid1 = 1 else sum + mem_out; -- sum process begin wait until rising_edge(clk); if i_valid = 1 or valid1 = 1 then sum <= add_sub; end if; end process;

296

CHAPTER 2. RTL DESIGN WITH VHDL

Hardware
i_valid i_data

A CE

valid1
CE

M
(wired shift)

idx

add/sub

CE

sum
(wired shift)

o_valid

o_avg

Chapter 3 Performance Analysis and Optimization

298

CHAPTER 3. PERFORMANCE ANALYSIS AND OPTIMIZATION

3.1

Introduction

Hennessey and Pattersons Quantitative Computer Achitecture (textbook for E&CE 429) has good information on performance. We will use some of the same definitions and formulas as Hennessey and Patterson, but we will move away from generic denitions of performance for computer systems and focus on performance for digital circuits.

3.2. DEFINING PERFORMANCE

299

3.2

Dening Performance
Performance = Work Time

You can double your performance by: doing twice the work in the same amount of time OR doing the same amount of work in half the time

300

CHAPTER 3. PERFORMANCE ANALYSIS AND OPTIMIZATION

Benchmarking
Work Performance = Time Measuring time is easy, but how do we accurately measure work? The game of benchmarketing is nding a denition of work that makes your system appear to get the most work done in the least amount of time. Measure of Work clock cycle instruction synthetic program real program travel 1/4 mile Measure of Performance MHz MIPs Whetstone, Dhrystone, D-MIPs (Dhrystone MIPs) SPEC drag race

3.2. DEFINING PERFORMANCE

301

SPEC Benchmarks
The Spec Benchmarks are among the most respected and accurate predictions of real-world performance.

Denition SPEC: Standard Performance Evaluation Corporation MISSION: To establish, maintain, and endorse a standardized set of relevant benchmarks and metrics for performance evaluation of modern computer systems http://www.spec.org.

The Spec organization has different benchmarks for integer software, oating-point software, web-serving software, etc.

302

CHAPTER 3. PERFORMANCE ANALYSIS AND OPTIMIZATION

3.3 3.3.1

Comparing Performance General Equations


Big Small n% = Small

Equation for Big is n% greater than Small:

Using n% greater formula, the phrase The performance of A is n% greater than the performance of B is: PerformanceA PerformanceB PerformanceB

n% =

Performance is inversely proportional to time: 1 Performance = Time

3.3.1 General Equations

303

Substituting the above equation into the equation for the performance of A is n% greater than the performance of B gives: n% = TimeB TimeA TimeA

In general, the equation for a fast system to be n% faster than a slow system is: TSlow TFast TFast

n% =

Another useful formula is the average time to do one of k different tasks, each of which happens %i of the time and takes an amount of time Ti to do each time it is done .

TAvg =

i=1

(%i)(Ti)

We can measure the performance of practically anything (cars, computers, vacuum cleaners, printers....)

304

CHAPTER 3. PERFORMANCE ANALYSIS AND OPTIMIZATION

3.3.2

Example: Performance of Printers


This section reserved for your reading pleasure

3.4. CLOCK SPEED, CPI, PROGRAM LENGTH, AND PERFORMANCE

305

3.4 Clock Speed, CPI, Program Length, and Performance 3.4.1 Mathematics
CPI NumInsts ClockSpeed ClockPeriod Cycles per instruction Number of instructions Clock speed Clock period

Time = NumInsts CPI ClockPeriod Time = NumInstsCPI ClockSpeed

306

CHAPTER 3. PERFORMANCE ANALYSIS AND OPTIMIZATION

3.4.2

Example: CISC vs RISC and CPI


Clock Speed SPECint AMD Athlon 1.1GHz 409 Fujitsu SPARC64 675MHz 443

The AMD Athlon is a CISC microprocessor (it uses the IA-32 instruction set). The Fujitsu SPARC64 is a RISC microprocessor (it uses Suns Sparc instruction set). Assume that it requires 20% more instructions to write a program in the Sparc instruction set than the same program requires in IA-32.

3.4.2 Example: CISC vs RISC and CPI

307

SPECint and Performance


Clock Speed SPECint AMD Athlon 1.1GHz 409 Fujitsu SPARC64 675MHz 443

Question:

Which of the two processors has higher performance?

308

CHAPTER 3. PERFORMANCE ANALYSIS AND OPTIMIZATION

Relative CPI
Question: What is the ratio between the CPIs of the two microprocessors?

3.4.2 Example: CISC vs RISC and CPI

309

Absolute CPI
Question: Can you determine the absolute (actual) CPI of either microprocessor?

310

CHAPTER 3. PERFORMANCE ANALYSIS AND OPTIMIZATION

3.4.3 Effect of Instruction Set on Performance


Your group designs a microprocessor and you are considering adding a fused multiply-accumulate to the instruction set. (A fused multiply accumulate is a single instruction that does both a multiply and an addition. It is often used in digital signal processing.) Your studies have shown that, on average, half of the multiply operations are followed by an add instruction that could be done with a fused multiply-add. Additionally, you know: cpi % ADD 0.8 CPIavg 15% MUL 1.2 CPIavg 5% Other 1.0 CPIavg 80%

3.4.3 Effect of Instruction Set on Performance

311

Options
You have three options:

option 1 : no change option 2 : add the MAC instruction, increase the clock period by 20%, and MAC has the same CPI as MUL. option 3 : add the MAC instruction, keep the clock period the same, and the CPI of a MAC is 50% greater than that of a multiply.

Question:

Which option will result in the highest overall performance?

312

CHAPTER 3. PERFORMANCE ANALYSIS AND OPTIMIZATION

3.4.4 Effect of Time to Market on Relative Performance


Assume that performance of the average product in your market segment doubles every 18 months. You are considering an optimization that will improve the performance of your product by 7%.

Question: If you add the optimization, how much can you allow your schedule to slip before the delay hurts your relative performance compared to not doing the optimization and launching the product according to your current schedule?

3.4.5

Summary of Equations

3.5. PERFORMANCE ANALYSIS AND DATAFLOW DIAGRAMS

313

3.5 Performance Analysis and Dataow Diagrams 3.5.1 Dataow Diagrams, CPI, and Clock Speed
One of the challenges in designing a circuit is to choose the clock speed. Choosing a clock period affects many aspects of the design, not just the overall performance. Some goals will push you toward a short clock period Some goals will push you toward a long clock period

314 Goal

CHAPTER 3. PERFORMANCE ANALYSIS AND OPTIMIZATION Action Affect

Minimize area

Increase exibility

scheduling

Decrease percentage of clock cycle spent in ops (overhead time in ops is not doing useful work) Decrease time to execute an instruction

3.5.1 Dataow Diagrams, CPI, and Clock Speed

315

Outline to Choose Clock Period


Outline of plan to nd optimal latency and clock period for maximum performance:

1. Start with smallest possible clock period. 2. Allocate operations to clock cycles 3. Calculate average time to execute an instruction. 4. If latency > 1, then: increase clock period until reduce latency; return to Step 2. Else (latency = 1): choose clock period and dataow diagram that resulted in highest performance. 5. Optimize dataow diagram to reduce area.

316

CHAPTER 3. PERFORMANCE ANALYSIS AND OPTIMIZATION

3.5.2 Examples of Dataow Diagrams for Two Instructions


Circuit supports two instructions, A and B Each operation occurs 50% of the time. The delay through a register is 5ns. Find clock period and dataow diagram to maximize overall performance.
h (20 ns)

Instruction A
f (30ns)

Instruction B
i (40ns)

g (50 ns)

g (50 ns)

g (50 ns)

3.5.2 Examples of Dataow Diagrams for Two Instructions

317

3.5.2.1 Scheduling of Operations for Different Clock Periods Scheduling (1)


55ns Clock Period
Instr A 55ns 55ns f (30ns) Instr B i (40ns)
15 ns 25 ns

g (50 ns) h (20 ns)

g (50 ns)

55ns

55ns

g (50 ns)

318

CHAPTER 3. PERFORMANCE ANALYSIS AND OPTIMIZATION

Scheduling (2)
15 ns 25 ns 15 ns 25 ns

3.5.2 Examples of Dataow Diagrams for Two Instructions

319

Scheduling (3)
15 ns 25 ns

320

CHAPTER 3. PERFORMANCE ANALYSIS AND OPTIMIZATION

3.5.2.2 Performance Computation for Different Clock Periods


Question: Which clock speed will result in the highest overall performance? Tavg

Clock Period CPIA CPIB 55ns 75ns 85ns 95ns 155ns

3.5.2 Examples of Dataow Diagrams for Two Instructions

321

3.5.2.3 Example: Two Instructions Taking Similar Time


Question: For the ow below, which clock speed will result in the highest overall performance?

A B 30ns 40ns 50ns 50ns 20ns 40ns 50ns

Clock Period CPIA CPIB ns ns ns ns ns ns

Tavg

322

CHAPTER 3. PERFORMANCE ANALYSIS AND OPTIMIZATION

3.5.2.4 Example: Same Total Time, Different Order for A


Question: For the ow below, which clock speed will result in the highest overall performance?

A B 30ns 40ns 20ns 50ns 50ns 40ns 50ns

Clock Period CPIA CPIB ns ns ns ns

Tavg

3.5.3 Example: From Algorithm to Optimized Dataow

323

3.5.3 Example: mized Dataow

From Algorithm to Opti-

This question involves doing some of the design work for a circuit that implements InstP and InstQ using the components described below. Instruction Algorithm Frequence of Occurrence InstP a b ((a b) + (b d) + e) 75% InstQ (i + j + k + l) m 25%

Component Delays 2-input Mult 40ns 2-input Add 25ns Register 5ns

324

CHAPTER 3. PERFORMANCE ANALYSIS AND OPTIMIZATION

NOTES
There is a resource limitation of a maximum of 3 input ports. (There are no other resource limitations.) You must put registers on your inputs, you do not need to register your outputs. The environment will directly connect your outputs (its inputs) to registers. Each input value (a, b, c, d, e, i, j, k, l, m) can be input only once if you need to use a value in multiple clock cycles, you must store it in a register.

3.5.3 Example: From Algorithm to Optimized Dataow

325

Questions
Question: What clock period will result in the best overall performance?

Question: Find a minimal set of resources that will achieve the performance you calculated.

326

CHAPTER 3. PERFORMANCE ANALYSIS AND OPTIMIZATION

3.6 3.6.1

General Optimizations Strength Reduction

Strength reduction replaces one operation with another that is simpler.

3.6.1.1

Arithmetic Strength Reduction


wired shift logical left shift logical left wired shift logical right shift logical right wired shift and addition

Multiply by a constant power of two Multiply by a power of two Divide by a constant power of two Divide by a power of two Multiply by 3

3.6.1 Strength Reduction

327

3.6.1.2
is neg, is pos

Boolean Strength Reduction

Boolean tests that can be implemented as wires is odd, is even By choosing your encodings carefully, you can sometimes reduce a vector comparison to a wire. For example if your state uses a one-hot encoding, then the comparison state = S3 reduces to state(3) = 1. You might expect a reasonable logic-synthesis tool to do this reduction automatically, but most tools do not do this reduction. When using encodings other than one-hot, Karnaugh maps can be useful tools for optimizing vector comparisons. By carefully choosing our state assignments, when we use a full binary encoding for 8 states, the comparison: (state = S0 or state = S3 or state = S4) = 1 can be reduced from looking at 3 bits, to looking at just 2 bits. If we have a condition that is true for four states, then we can nd an encoding that looks at just 1 bit.

328

CHAPTER 3. PERFORMANCE ANALYSIS AND OPTIMIZATION

3.6.2 3.6.2.1

Replication and Sharing Mux-Pushing

Pushing multiplexors into the fanin of a signal can reduce area. Before z <= a + b when (w = 1) else a + c; After tmp <= b when (w = 1) else c; z <= a + tmp; The rst circuit will have two adders, while the second will have one adder. Some synthesis tools will perform this optimization automatically, particularly if all of the signals are combinational.

3.6.2 Replication and Sharing

329

3.6.2.2 tion

Common Subexpression Elimina-

Introduce new signals to capture subexpressions that occur multiple places in the code. Before y <= else z <= else

a + b + c when (w = 1) d; a + c + d when (w = 1) e;

After tmp <= y <= else z <= else

a + c; b + tmp when (w = 1) d; d + tmp when (w = 1) e;

330

CHAPTER 3. PERFORMANCE ANALYSIS AND OPTIMIZATION

Subexpression Elimination
Note: Clocked subexpressions Care must be taken when doing common subexpression elimination in a clocked process. Putting the temporary signal in the clocked process will add a clock cycle to the latency of the computation, because the tmp signal will be ip-op. The tmp signal must be combinational to preserve the behaviour of the circuit.

3.6.2 Replication and Sharing

331

3.6.2.3

Computation Replication

To improve performance
If same result is needed at two very distant locations and wire delays are signicant, it might improve performance (increase clock speed) to replicate the hardware

To reduce area
If same result is needed at two different times that are widely separated, it might be cheaper to reuse the hardware component to repeat the computation than to store the result in a register Note: Muxes are not free Each time a component is reused, multiplexors are added to inputs and/or outputs. Too much sharing of a component can cost more area in additional multiplexors than would be spent in replicating the component

332

CHAPTER 3. PERFORMANCE ANALYSIS AND OPTIMIZATION

3.6.3

Arithmetic

VHDL is left-associative. The expression a + b + c + d is interpreted as (((a + b) + c) + d). You can use parentheses to suggest parallelism. Perform arithmetic on the minimum number of bits needed. If you only need the lower 12 bits of a result, but your input signals are 16 bits wide, trim your inputs to 12 bits. This results in a smaller and faster design than computing all 16 bits of the result and trimming the result to 12 bits.

3.7. RETIMING

333

3.7
state a b c

Retiming
state S0 S1 S2 S3 S0 S1 S2 S3 a critical path b c sel 1 y z x y + z +

sel x

process begin wait until rising_edge(clk); if state = S1 then z <= a + c; else z <= b + c; end if; end process;

334

CHAPTER 3. PERFORMANCE ANALYSIS AND OPTIMIZATION

Retimed Circuit and Waveform


state a b c sel x y z

state S0 S1 S2 S3 S0 S1 S2 S3 a b c sel x y z

process (state) begin if state = S1 then sel = 1 else sel = 1 end if; end process; process begin wait until rising_edge(clk); if sel = 1 then ... -- code for z end if; end process;

process begin wait until rising_edge(clk); if state = then sel = 1 else sel = 1 end if; end process; process begin wait until rising_edge(clk); if sel = 1 then ... -- code for z end if; end process;

Chapter 4 Functional Verication

336

CHAPTER 4. FUNCTIONAL VERIFICATION

4.1

Overview

4.1.1 Terminology: Validation / Verication / Testing 4.1.2 The Difculty of Designing Correct Chips

4.1.2 The Difculty of Designing Correct Chips

337

4.1.2.1 Notes from Kenn Heinrich (UW E&CE grad)


Everyone should get a lecture on why their rst industrial design wont work in the eld. Note: There are six reasons in your notes.

4.1.2.2 Notes from Aart de Geus (Chairman and CEO of Synopsys)


More than 60% of the ASIC designs that are fabricated have at least one error, issue, or a problem that whose severity forced the design to be reworked. Note: There is a pretty picture in your notes.

338

CHAPTER 4. FUNCTIONAL VERIFICATION

4.2 4.2.1

Test Cases and Coverage Coverage

To be absolutely certain that an implementation is correct, we must check every combination of values. This includes both input values and internal state (ip ops). If we have ni bits of inputs and ns bits in ip-ops, we have to test 2ni +ns different cases when doing functional verication.

Question: If we have nc combinational signals, why dont we have to test 2ni+ns+nc different cases?

4.2.2 Floating Point Divider Example

339

4.2.2

Floating Point Divider Example

This example illustrates the difculty of achieving signicant coverage on realistic circuits. Consider doing the functional simulation for a double precision (64-bit) oating-point divider. Given Information Data width 64 bits Number of gates in circuit 10 000 Number of assembly-language instructions to 100 simulate one gate for one test case Number of clock cycles required to execute one 0.5 assembly language instruction on the computer that is running the simulation Clock speed of computer that is running the sim- 1 Gigahertz ulation

340

CHAPTER 4. FUNCTIONAL VERIFICATION

Number of Cases
Question: How many cases must be considered?

width=64b, gates=10 000, instrs/gate=100, cycles/instr=0.5, cycles/sec=109

4.2.2 Floating Point Divider Example

341

Simulation Run Time


Question: How long will it take to simulate all of the different possible cases using a single computer? width=64b, gates=10 000, instrs/gate=100, cycles/instr=0.5, cycles/sec=109

342

CHAPTER 4. FUNCTIONAL VERIFICATION

Coverage
Question: If you can run simulations non-stop for one year on ten computers, what coverage will you achieve? width=64b, gates=10 000, instrs/gate=100, cycles/instr=0.5, cycles/sec=109

4.2.2 Floating Point Divider Example

343

Simulation vs the Real World


From Validating the Intel(R) Pentium(R) Microprocessor by Bob Bentley, Design Automation Conference 2001. (Link on E&CE 327 web page.) Simulating the Pentium 4 Processor on a Pentium 3 Processor ran at about 15 MHz.

By tapeout, over 200 billion simulation cycles had been run on a network of computers. All of these simulations represent less than two minutes of running a real processor.

344

CHAPTER 4. FUNCTIONAL VERIFICATION

4.3 4.3.1

Testbenches Overview of Test Benches


testbench specification stimulus check

implementation

Implementation Circuit that youre checking for bugs also known as: design under test or unit under test Stimulus Generates test vectors Specication Describes desired behaviour of implementation Check Checks whether implementation obeys specication

4.3.2 Reference Model Style Testbench

345

4.3.2

Reference Model Style Testbench


specification

reference model testbench

stimulus

implementation

4.3.3

Relational Style Testbench

relational testbench

stimulus

check

implementation

346

CHAPTER 4. FUNCTIONAL VERIFICATION

4.3.4
testbench stimulus

Coding Structure of a Testbench


specification check

implementation

architecture main of athabasca_tb is component declaration for implementation; other declarations begin implementation instantiation; stimulus process; specification process (or component instantiation); check process; end main;

4.3.5 Datapath vs Control

347

4.3.5

Datapath vs Control

Datapath and control circuits tend to use different styles of testbenches.


reference model testbench specification stimulus

implementation

relational testbench

stimulus

check

implementation

348

CHAPTER 4. FUNCTIONAL VERIFICATION

4.3.6

Verication Tips

Suggested order of simulation for functional verication. 1. Write high-level model. 2. Simulate high-level model until have correct functionality and latency. 3. Write synthesizable model. 4. Use zero-delay simulation (uw-sim) to check behaviour of synthesizable model against high-level model. 5. Optimize the synthesizable model. 6. Use zero-delay simulation (uw-sim) to check behaviour of optimized model against high-level model. 7. Use timing-simulation (uw-timsim) to check behaviour of optimized model against high-level model. section 4.4 describes a series of testbenches that are particularly useful for debugging datapath circuits in the early phases of the design cycle.

4.4. FUNCTIONAL VERIFICATION FOR DATAPATH CIRCUITS

349

4.4 Functional Verication for Datapath Circuits


In this section we will incrementally develop a testbench for a very simple circuit: an AND gate.

350

CHAPTER 4. FUNCTIONAL VERIFICATION

Implementation
entity and2 is port ( a, b : in std_logic; c : out std_logic ); end and2; architecture main of and2 is begin c <= 1 when (a = 1 AND b = 1) else 0; end and2;

4.4.1 A Spec-Less Testbench

351

4.4.1

A Spec-Less Testbench

First, use waveform viewer to check that implementation generates reasonable outputs for a small set of inputs.
entity and2_tb is end and2_tb; architecture main_tb of and2_tb is component and2 ... end component; signal ta, tb, tc_impl : std_logic; signal ok : boolean; begin --------------------------------------------impl : and2 port map (a => ta, b => tb, c => tc_impl); --------------------------------------------stimulus : process begin ta <= 0; tb <= 0; wait for 10ns; ta <= 1; tb <= 1; wait for 10ns; end process; --------------------------------------------end main_tb;

352

CHAPTER 4. FUNCTIONAL VERIFICATION

4.4.2

Use an Array for Test Vectors

architecture main_tb of and2_tb is ... begin ... stimulus : process type test_datum_ty is record ra, rb : std_logic; end record; type test_vectors_ty is array(natural range <>) of test_datum_ty ; constant test_vectors : test_vectors_ty := -a b ( ( 0, 0), ( 1, 1) ); begin for i in test_vectorslow to test_vectorshigh loop ta <= test_vectors(i).ra; tb <= test_vectors(i).rb; wait for 10 ns; end loop; end process; end main_tb;

4.4.3 Build Spec into Stimulus

353

4.4.3

Build Spec into Stimulus

stimulus : process type test_datum_ty is record ra, rb, rc : std_logic; end record; type test_vectors_ty is array(natural range <>) of test_datum_ty; constant test_vectors : test_vectors_ty := -a, b: inputs -c : expected output -a b c ( ( 0, 0, 0), ( 0, 1, 0), ( 1, 1, 1) ); begin for i in test_vectorslow to test_vectorshigh loop ta <= test_vectors(i).ra; tb <= test_vectors(i).rb; tc_spec <= test_vectors(i).rc; wait for 10 ns; end loop; end process;

354

CHAPTER 4. FUNCTIONAL VERIFICATION

Build Spec into Stimulus (Contd)


stimulus : process ... begin for i in test_vectorslow to test_vectorshigh loop ta <= test_vectors(i).ra; tb <= test_vectors(i).rb; tc_spec <= test_vectors(i).rc; wait for 10 ns; end loop; end process; -----------------------------------------check : process (tc_impl, tc_spec) begin ok <= (tc_impl = tc_spec); end process; -----------------------------------------end main_tb;

4.4.4 Have Separate Specication Entity

355

4.4.4

Have Separate Specication Entity

entity and2_spec is ...(same as and2 entity)... end and2_spec; architecture spec of and2_spec is begin c <= a AND b; end spec;

356

CHAPTER 4. FUNCTIONAL VERIFICATION

Testbench for Separate Specication


architecture main_tb of and2_tb is component and2 ...; component and2_spec ...; signal ta, tb, tc_impl, tc_spec : std_logic; signal ok : boolean; begin -----------------------------------------impl : and2 port map (a => ta, b => tb, c => tc_impl); spec : and2_spec port map (a => ta, b => tb, c => tc_spec); -----------------------------------------stimulus process... check process... end

4.4.4 Have Separate Specication Entity

357

Testbench for Separate Spec (Contd)


stimulus : process ... constant test_vectors : test_vectors_ty := -a b ( ( 0, 0), ( 1, 1) ); begin for i in test_vectorslow to test_vectorshigh loop ta <= test_vectors(i).ra; tb <= test_vectors(i).rb; wait for 10 ns; end loop; end process; -----------------------------------------check : process (tc_impl, tc_spec) begin ok <= (tc_impl = tc_spec); end process; -----------------------------------------end main_tb;

358

CHAPTER 4. FUNCTIONAL VERIFICATION

4.4.5

Generate Test Vectors Automatically

architecture main_tb of and2_tb is ... begin ... stimulus : process subtype std_test_ty of std_logic is (0, 1); begin for va in std_test_tylow to std_test_tyhigh loop for vb in std_test_tylow to std_test_tyhigh loop ta <= va; tb <= vb; wait for 10 ns; end loop; end loop; end process; ... end main_tb;

4.4.6 Relational Specication

359

4.4.6

Relational Specication

Sometimes we want to check a relationship between the output and the input, rather than check that the output has a specic value. To do this, we drop the spec process, and put the brains into the check process.
architecture main_tb of and2_tb is ... begin -----------------------------------------impl : and2 port map (a => ta, b => tb, c => tc_impl); -----------------------------------------stimulus : process ... end process; -----------------------------------------check : process (tc_impl, tc_spec) begin ok <= NOT (tc_impl = 1 AND (ta =0 OR tb = 0)); end process; -----------------------------------------end main_tb;

360

CHAPTER 4. FUNCTIONAL VERIFICATION

4.5 Functional Verication of Control Circuits


Control circuits are often more challenging to verify than datapath circuits.

In this section, we will explore the functional verication of state machines via a First-In First-Out queue.

4.5.1 Overview of Queues in Hardware

361

4.5.1

Overview of Queues in Hardware


write read queue

Structure of queue

362
Empty A Write 1

CHAPTER 4. FUNCTIONAL VERIFICATION


Write 2 A

Write Sequence

4.5.1 Overview of Queues in Hardware


Write 1 A B Write 2 A B

363

A Second Example Write

364
Read 1 A B B Read 2 A

CHAPTER 4. FUNCTIONAL VERIFICATION

Example Read Sequence

4.5.1 Overview of Queues in Hardware


Write 1 Write 2

365

B C D E F G H I J

B C D E F G H I J

Write Illustrating Index Wrap

366
Write 1 K B C D E F G H I J Write 2 K B C D E F G H I J

CHAPTER 4. FUNCTIONAL VERIFICATION

Write Illustrating Full Queue

4.5.1 Overview of Queues in Hardware


do_rd mem do_wr rd_idx data_rd data_wr wr_idx mem do_wr data_wr rd_idx
WE A0 DI0 A1 DO1 DO0

367
do_rd wr_idx

data_rd

empty

empty

Queue Signals Control circuitry not shown.

Incomplete Queue Blocks

368

CHAPTER 4. FUNCTIONAL VERIFICATION

4.5.2 4.5.2.1

VHDL Coding Package

package queue_pkg is subtype data is std_logic_vector(3 downto 0); function to_data(i : integer) return data; end queue_pkg; package body queue_pkg is function to_data(i : integer) return data is begin return std_logic_vector(to_unsigned(i, 4)); end to_data; end queue_pkg;

4.5.2.2

Other VHDL Coding

4.5.3 Code Structure for Verication This section reserved for your reading pleasure

369

4.5.3

Code Structure for Verication

Verication things to notice in queue implementation: 1. instrumentation code 2. coverage monitors 3. assertions

370

CHAPTER 4. FUNCTIONAL VERIFICATION

Code Structure for Verication


architecture ... is ... begin ... normal implementation ... process (clk) begin if rising_edge(clk) then ... instrumentation code ... prev_signame <= signame; end if; end process; ... assertions ... ... coverage monitors ... end;

4.5.4 Instrumentation Code

371

4.5.4

Instrumentation Code

Added to implementation to support verication Usually keeps track of previous values of signals Does not create hardware (Optimized away during synthesis) Does not feed any output signals Must use synthesizable subset of VHDL
process (clk) begin if rising_edge(clk) then prev_rd_idx <= rd_idx; prev_wr_idx <= wr_idx; prev_do_rd <= do_rd; prev_do_wr <= do_wr; end if; end process;

372

CHAPTER 4. FUNCTIONAL VERIFICATION

Coverage Events for Queue


Question: What events should we monitor to estimate the coverage of our functional tests?

4.5.4 Instrumentation Code

373

Coverage Monitor Template


process (signals read) begin if (condition) then report "coverage: message"; elsif (condition) ) then report "coverage: message"; else report "error: case fall through on message" severity warning; end if; end process;

374

CHAPTER 4. FUNCTIONAL VERIFICATION

Coverage Monitor Code


Events related to rd idx equals wr idx. process (prev_rd_idx, prev_wr_idx, rd_idx, wr_idx) begin if (rd_idx = wr_idx) then if ( prev_rd_idx = prev_wr_idx ) then report "coverage: read = write both moved"; elsif ( rd_idx /= prev_rd_idx ) then report "coverage: Read caught write"; elsif ( wr_idx /= prev_wr_idx ) then report "coverage: Write caught read"; else report "error: case fall through on rd/wr catching" severity warning; end if; end if; end process;

4.5.4 Instrumentation Code

375

Coverage Monitor Code


Events related to rd idx wrapping. process (rd_idx) begin if (rd_idx = low_idx) then report "coverage: rd mv to low"; elsif (rd_idx = high_idx) then report "coverage: rd mv to high"; else report "coverage: rd mv normal"; end if; end process;

376

CHAPTER 4. FUNCTIONAL VERIFICATION

4.5.5

Assertions Assertions for Queue

1. If rd idx changes, then it increments or wraps. 2. If rd idx changes, then do rd was 1, or reset is 1. 3. If wr idx changes, then it increments or wraps. 4. If wr idx changes, then do wr was 1, or reset is 1. 5. And many others....

4.5.5 Assertions

377

Assertion Template
process (signals read) begin assert (required condition) report "error: message" severity warning; end process;

378

CHAPTER 4. FUNCTIONAL VERIFICATION

Assertions: Read Index


process (rd_idx) begin assert ((rd_idx > prev_rd_idx) or (rd_idx = low_idx)) report "error: rd inc" severity warning; assert ((prev_do_rd = 1) or (reset = 1)) report "error: rd imp do_rd" severity warning; end process;

4.5.5 Assertions

379

Assertions: Write Index


process (wr_idx) begin assert ((wr_idx > prev_wr_idx) or (wr_idx = low_idx)) report "error: wr inc" severity warning; assert ((prev_do_wr = 1) or (reset = 1)) report "error: wr imp do_wr" severity warning; end process;

380

CHAPTER 4. FUNCTIONAL VERIFICATION

4.5.6

VHDL Coding Tips Vector Type Declaration

type data_array_ty is array(natural range <>) of data; signal data_array : data_array_ty(7 downto 0);

4.5.6 VHDL Coding Tips

381

Functions
function to_idx (i : natural range data_arraylow to data_arrayhigh) return idx_ty is begin return to_unsigned(i, idx_tylength); end to_idx; Conversion to Index Without Function With Function rd_idx <= to_unsigned(5, 3); rd_idx <= to_idx(5); The function code is verbose, but is very maintainable, because neither the function itself nor uses of the function need to know the width of the index vector.

382

CHAPTER 4. FUNCTIONAL VERIFICATION

Attributes
function inc_idx (idx : idx_ty) return idx_ty is begin if idx < data_arrayhigh then return (idx + 1); else return (to_idx(data_arraylow)); end if; end inc_idx;

4.5.6 VHDL Coding Tips

383

Feedback Loops, and Functions


Coding guideline: use functions. Dont use procedures. inc as fun inc as proc wr_idx <= inc_idx(wr_idx); inc_idx(wr_idx); Functions clearly distinguish between reading from a signal and writing to a signal. By examining the use of a procedure, you cannot tell which signals are read from and which are written to. You must examine the declaration or implementation of the procedure to determine modes of signals. Modifying a signal within a procedure results in a tri-state signal. This is bad.

384

CHAPTER 4. FUNCTIONAL VERIFICATION

File I/O (textio package)


TEXTIO denes read, write, readline, writeline functions. Described in: http://www.eng.auburn.edu/department/ee/mgc/vhdl.html#textio These functions can be used to read test vectors from a le and write results to a le.

4.5.7 Queue Specication

385

4.5.7

Queue Specication

Most bugs in queues are related to the queue becoming full, becoming empty, and/or wrap of indices. Specication should be obviously correct. Avoid bugs in specication by making specication queue larger than the max number of writes that we will do in test suite. Thus, the specication queue will never become full or wrap. However, the implementation queue will become full and wrap.

386

CHAPTER 4. FUNCTIONAL VERIFICATION

Write Index Update in Specication


We increment write-index on every write, we never wrap. process (clk) begin if rising_edge(clk) then if (reset = 1) then wr_idx <= 0; elsif (do_wr = 1) then wr_idx <= wr_idx + 1; end if; end if; end process;

4.5.7 Queue Specication

387

Things to Notice
Things to notice in queue specication: 1. dont care conditions (-) 2. uninitialized data (hint: what is the value of rd_data when do more reads than writes?

388

CHAPTER 4. FUNCTIONAL VERIFICATION

Dont Care
rd_data <= data_array(rd_idx) when (do_rd =1) else (others => -);

4.5.8 Queue Testbench

389

4.5.8

Queue Testbench

Things to notice in queue testbench: 1. running multipe test sequences 2. uninitialized data U 3. std_match to compare spec and impl data 0 0 1 1 everything else 0 L 1 H everything everything

With equality, - = 1, but we want to use - to mean dont care in specication. The solution is to use std match, rather than = to check implementation signals against the specication.

390

CHAPTER 4. FUNCTIONAL VERIFICATION

Stimulus Process Structure


The stimulus process runs multiple test vectors in a single simulation run.
stimulus : process type test_datum_ty is record r_reset, ... normal fields ... end record; type test_vectors_ty is array(natural range <>) of test_datum_ty; constant test_vectors : test_vectors_ty := ( -reset ... other signal ... ( 1, normal fields), -- test case 1 ( 0, normal fields), ... ( 1, normal fields), -- test case 2 ( 0, normal fields), ... ); begin for i in test_vectorsrange loop if (test_vectors(i).r_reset = 1) then ... reset code ... end if; reset <= 0; ... normal sequence ... wait until rising_edge(clk); end loop; end process;

4.6. EXAMPLE: MICROWAVE OVEN

391

4.6

Example: Microwave Oven

This question concerns the VHDL code microwave, which controls a simple microwave oven; the properties prop1...prop3; and two proposed changes to the VHDL code. INSTRUCTIONS: 1. Assume that the code as currently written is correct any change to the code that causes a change to the behaviour of the signals heat or count is a bug. 2. For each of the two proposed code changes, answer whether the code change will cause a bug. 3. If the code change will cause a bug, provide a test case that will exercise the bug and identify all of the given properties (prop1, prop2, and prop3) that will detect the bug with the test case you provide. 4. If none of the three properties can detect the bug, provide a property of your own that will detect the bug with the testcase you provide.

392

CHAPTER 4. FUNCTIONAL VERIFICATION

Question: For each of the three properties prop1...prop2, answer whether the property is best checked as part of a testbench or assertion. For each property, justify why a testbench or an assertion is the best method to validate that property. prop1 If start is pushed and the door is closed, then heat remains on for exactly the time specied by the timer when start was pushed, assuming reset remains false and the door remains closed. prop2 If the door is open, then heat is off. prop3 If start is not pushed, reset is false, and count is greater than zero, then count is decremented.

4.6. EXAMPLE: MICROWAVE OVEN

393

Implementation
entity microwave is port ( timer -- time input from user : in unsigned(7 downto 0); reset, -- resets microwave clk, -- clock signal input is_open, -- detects when door is open start -- start button input from user : in std_logic; heat : out std_logic -- 1=on, 0=off ); end microwave; architecture main of microwave is signal count : unsigned(7 downto 0); -- internal time count signal x_heat : std_logic; begin

394

CHAPTER 4. FUNCTIONAL VERIFICATION

-- heat process -----------------------------process (clk) begin if rising_edge(clk) then if reset = 1 then x_heat <= 0; elsif (is_open = 0) and (start = 1) and (time > 0) then x_heat <= 1; elsif (is_open = 0) and (count > 0) then x_heat <= x_heat; else x_heat <= 0; end if; end if; end process;

-- region of -- change #1 -----

4.6. EXAMPLE: MICROWAVE OVEN


-- count process -----------------------------process (clk) begin if rising_edge(clk) then if (reset = 1) then count <= to_unsigned(0, 8); elsif (start = 1) then count <= timer; elsif (count > 0) then count <= count - 1; end if; end if; end process; heat <= x_heat; end main;

395

-- region of -- change #2 ---

396

CHAPTER 4. FUNCTIONAL VERIFICATION

Properties
prop1 If start is pushed and the door is closed, then heat remains on for exactly the time specied by the timer when start was pushed, assuming reset remains false and the door remains closed. prop2 If the door is open, then heat is off. prop3 If start is not pushed, reset is false, and count is greater than zero, then count is decremented.

4.6. EXAMPLE: MICROWAVE OVEN

397

Change #1
elsif (start = 1) then count <= time; From: elsif (count > 0) then count <= count - 1; elsif (count > 0) then count <= count - 1; elsif (start = 1) then count <= time;

To:

398

CHAPTER 4. FUNCTIONAL VERIFICATION

Change #2
elsif (is_open then x_heat <= From: elsif (is_open then x_heat <= elsif To: = 0) and (start = 1) and (time > 0) 1; = 0) and (count > 0) x_heat;

(is_open = 0) and ((start = 1) or (count > 0)) then x_heat <= 1; else x_heat <= 0;

4.6. EXAMPLE: MICROWAVE OVEN

399

Coverage
Question: If msb of src1 is 1 and lsb of src2 is 0 or sum(3) is 1, then result is wrong. What is the minimum coverage needed to detect bug? What is the minimim coverage needed to guarantee that the bug will be detected?

400

CHAPTER 4. FUNCTIONAL VERIFICATION

Chapter 5 Timing Analysis

402

CHAPTER 5. TIMING ANALYSIS

5.1

Delays and Denitions

In this section we will look at the different timing parameters of circuits. Our focus will be on those parameters that limit the maximum clock speed at which a circuit will work correctly.

5.1.1

Background Denitions
This section reserved for your reading pleasure

5.1.2 Clock-Related Timing Denitions

403

5.1.2 5.1.2.1

Clock-Related Timing Denitions Clock Skew


clk1 clk2 clk3 clk4

skew clk1 clk2 clk3 clk4

Denition Clock Skew: The difference in arrival times for the same clock edge at different ip-ops.

Clock skew is caused by the difference in interconnect delays to different points on the chip.

404

CHAPTER 5. TIMING ANALYSIS

Clock Tree Design


Clock tree design is critical in high-performance designs to minimize clock skew. Sophisticated synthesis tools put lots of effort into clock tree design, and the techniques for clock tree design still generate PhD theses.

5.1.2 Clock-Related Timing Denitions

405

5.1.2.2

Clock Latency
master clock latency intermediate clock final clock

master clock intermediate clock final clock

Denition Clock Latency: The difference in arrival times for the same clock edge at different levels of interconnect along the clock tree. (Intuitively different points in the clock generation circuitry.) Note: Clock latency Clock latency does not affect the limit on the minimim clock period.

406

CHAPTER 5. TIMING ANALYSIS

5.1.2.3
ideal clock

Clock Jitter

clock with jitter jitter

Denition Clock Jitter: Difference between actual clock period and ideal clock period.

5.1.2 Clock-Related Timing Denitions

407

Causes of Clock Jitter


Clock jitter is caused by: temperature and voltage variations over time

temperature and voltage variations across different locations on a chip manufacturing variations between different parts

408

CHAPTER 5. TIMING ANALYSIS

5.1.3 5.1.3.1
clk d q

Storage-Related Timing Denitions Flops and Latches


clk d q

Flop Behaviour

Latch Behaviour

Storage devices have two modes: load mode and store mode. Flops are edge sensitive; they are in load mode just before the clock edge. Latches are level senstive; they are in load mode while their enable signal is asserted high (low for active low latches).

5.1.3 Storage-Related Timing Denitions

409

Timing Parameters
Setup d clk q Clock-to-Q Hold d clk q Clock-to-Q Setup Hold Setup d clk q Clock-to-Q Hold

Flip-op

Active-high latch

Active-low latch

Setup and hold dene the window in which input data are required to be constant in order to guarantee that storage device will store data correctly. Clock-to-Q denes the delay from the clock edge to when the output is guaranteed to be stable.

410

CHAPTER 5. TIMING ANALYSIS

5.1.4

Propagation Delays

Propagation delay time it takes a signal to travel from the source (driving) op to the destination op propagation delay = load delay + interconnect delay Load delay combinational gates between the ops Interconnect delay wires between gates and ops

5.1.5 Timing Constraints

411

5.1.5 5.1.5.1

Timing Constraints Minimum Clock Period


a clk1 clk2 b signal is stable signal may change signal may rise signal may fall
clock period

clk1 clk2 a b

ClockPeriod >

412

CHAPTER 5. TIMING ANALYSIS

5.1.5.2 5.1.5.3

Hold Constraint Example Timing Violations Good Timing


a clk b c d

a clk b

Clock-to-Q

Prop Setup Hold

c d

5.1.5 Timing Constraints

413

Setup Violation
a clk b Clock-to-Q Prop Setup c d ??? ???

Setup Violation

414

CHAPTER 5. TIMING ANALYSIS

Hold Violation
a clk b c d

a clk b

Clock-to-Q Prop Hold

c d

???

Hold Violation

5.2. TIMING ANALYSIS OF LATCHES AND FLIP FLOPS

415

5.2 Timing Analysis of Latches and Flip Flops


In this section, we show how to nd the clock-to-Q, setup, and hold times for latches, ip-ops, and other storage elements.

5.2.1

Simple Multiplexer Latch

416

CHAPTER 5. TIMING ANALYSIS

5.2.1.1 Structure and Behaviour of Multiplexer Latch


clk i o i 1 o

Loading / pass-through mode

Storage mode

5.2.1 Simple Multiplexer Latch

417

Unfold Multiplexer to Simple Gates


0 i o a b s o

Multiplexer: symbol and implementation


clk i o a sel b o

Latch implementation

418

CHAPTER 5. TIMING ANALYSIS

Latch Glitching
d clk o

Note: inverters on clk Both of the inverters on the clk signal are needed. Together, they prevent a glitch on the OR gate when clk is deasserted. If there was only one inverter, a glitch would occur. For more on this, see section 5.2.1.6

5.2.1 Simple Multiplexer Latch

419

Loading and Storing Values


d clk d=0 clk=1 1 1 0 1 1 0 0 o

Loading 0
0 1 0 d clk=0 o

Loading 1
0 1 1 0 o=0

d=1 clk=1

0 0 0 1

0 1 1

Storing 0

Storing 1

420

CHAPTER 5. TIMING ANALYSIS

5.2.1.2 Strategy for Timing Analysis of Storage Devices


The key to calculating setup and hold times of a latch, op, etc is to identify: 1. how the data is stored when not connected to the input (often a pair of inverters in a loop) 2. the gate(s) that the clock uses to cause the stored data to drive the output (often a transmission gate or multiplexor) 3. the gate(s) that the clock uses to cause the input to drive the output (often a transmission gate or multiplexor)

5.2.1 Simple Multiplexer Latch

421

5.2.1.3 Latch
d clk

Clock-to-Q Time of a Multiplexer


l1 c2 cn l2 qn s2 s1 q d clk l1 c2 cn l2 qn s2 s1 q

d clk

l1 c2 cn

l2 qn s2 s1 q

d clk

l1 c2 cn

l2 qn s2 s1 q

d clk

l1 c2 cn

l2 qn s2 s1 q

d clk

l1 c2 cn

l2 qn s2 s1 q

422

CHAPTER 5. TIMING ANALYSIS

5.2.1.4
d 1 clk
0

Setup Timing of a Multiplexer Latch


1 0 0

d 0 clk

1 0

Circuit is stable in load mode d 0 clk


0 1 0 0

t=3: l2 is set to 0, because c2 turns off AND gate d 0 clk


1 0 1 0

t=0: Clk transitions from load to store d 0 clk


1 1 1 0

t=4: from store path propagates to q d 0 clk


1 0 1 0

t=1: Clk transitions from load to store d 0 clk


1 0 1

t=5: from store path completes cycle

t=2: s1 propagates to s2, because cn turns on AND gate

5.2.1 Simple Multiplexer Latch

423
1 1 1

Setup Violation
d 1 clk
0 1 0 0

d 0 clk

Circuit is stable in load mode with


0 1 0

d 1 clk

t=1: propagates through AND Clk propagates through inverter Trouble: inconsistent values on load path and store path. Old value () still in store path when store path is enabled. d 0 clk
1 0 1

t=-1: D transitions from to d 0 clk


0 1 0 0

t=2: old propagates through AND d 0 clk


1 0 1 0 /

t=0: propagates through inverter Clk transitions from load to store

t=3: l2 is set to 0, because c2 turns off AND gate

424
d 0 clk
1 0 1 / /

CHAPTER 5. TIMING ANALYSIS


d 0 clk
=1 1 0 1 1 0 0 0 0 1 1 1

t=4: / from store path propagates to q


d

t=5: Illustrate instability with =0, =1


setup with negative margin
-3 -2 -1 0 1 2 3 4 5 6

d 0 clk

1 0

l1 l2 qn q s1 s2

1 /

clk cn

t=5: / from store path completes cycle

c2

5.2.1 Simple Multiplexer Latch

425

We now repeat the analysis of setup violation, but illustrate the minimum violation (input transitions from to 3 time-units before the clock edge).
d 1 clk
0 1 0 0

d 1 clk

0 1

Circuit is stable in load mode with


0 1 0

t=-1: propagates through AND


0 1 0

d 1 clk

d 0 clk

t=-3: D transitions from to


0 1 0

t=0: Clk transitions from load to store


1 1 1

d 1 clk

d 0 clk

t=-2: propagates through inverter

t=1: Clk propagates through inverter

426
Trouble: inconsistent values on load path and store path. Old value () still in store path when store path is enabled. d 0 clk
1 0 1

CHAPTER 5. TIMING ANALYSIS


d 0 clk
1 0 1 / 0 /

t=5: / from store path completes cycle

t=2: old propagates through AND


1 0 1

d 0 clk

0 /

d 0 clk

=1 1 0

0 0 1

1 1

0 1 1

t=3: l2 is set to 0, because c2 turns off AND gate


1 0 1
d

t=5: Illustrate instability with =0, =1


-3 -2 -1 0 1 2 3 4 5 6

setup with negative margin


/

d 0 clk

0 / /

l1 l2 qn q s1 s2 clk cn

t=4: / from store path propagates to q

c2

5.2.1 Simple Multiplexer Latch

427

Minimum Setup Time


d clk l1 l2 qn cn s2 s1 q

setup d l1 l2 qn q s1 s2 clk cn c2

428

CHAPTER 5. TIMING ANALYSIS

5.2.1.5

Hold Time of a Multiplexer Latch


d clk cn s2 s1 l1 c2 l2 qn q

5.2.1 Simple Multiplexer Latch

429

Hold Time Behaviour


d clk cn s2 s1 l1 c2 l2 qn q d clk cn s2 s1 l1 c2 l2 qn q

d clk cn

l1 c2

l2 qn s2 s1 q

d clk cn

l1 c2

l2 qn s2 s1 q

d clk cn

l1 c2

l2 qn s2 s1 q

d clk cn

l1 c2

l2 qn s2 s1 q

430

CHAPTER 5. TIMING ANALYSIS

5.2.1.6

Example of a Bad Latch


d clk l1 c2 cn l2 qn s2 s1 q

d l1 l2 qn q s1 s2 clk c2 cn

5.3. CRITICAL PATHS AND FALSE PATHS

431

5.3

Critical Paths and False Paths

5.3.1 Introduction to Critical and False Paths


Denition critical path: The slowest path on the chip between ops or ops and pins. The critical path limits the maximum clock speed.

Denition false path: : a path along which an edge cannot travel from beginning to end.

432

CHAPTER 5. TIMING ANALYSIS

Outline
The algorithm that we present comes from McGeer and Brayton in a DAC 198? paper. The algorithm to nd the critical path through a circuit is presented in several parts. 1. Section 5.3.2: Find the longest path ignoring the possibility of false paths. 2. Section 5.3.3: Almost-correct algorithm to test whether a candidate critical path is a false path. 3. Section 5.3.4: If a candidate path is a false path, then nd the next candidate path, and repeat the false-path detection algorithm. 4. Section 5.3.5: Correct, complete, and complex algorithm to nd the critical path in a circuit.

5.3.1 Introduction to Critical and False Paths

433

Notes
Note: The analysis of critical paths and false paths assumes that all inputs change values at exactly the same time. Timing differences between inputs are modelled by the skew parameter in timing analysis. Throughout our discussion of critical paths, we will use the delay values for gates shown in the table below. gate delay NOT 2 AND 4 OR 4 XOR 6

434

CHAPTER 5. TIMING ANALYSIS

5.3.1.1 Adder
Question:

Example of Critical Path in Full

Find the critical path through the full-adder circuit shown below.
ci a b i k j co s

5.3.1 Introduction to Critical and False Paths

435

Alternative Excitation
Question: Do the input values of ci=0, a=, b=1 exercise the critical path?
ci a b i k j co s

436

CHAPTER 5. TIMING ANALYSIS

5.3.1.2 5.3.1.3

Preliminaries for Critical Paths Longest Path and Critical Path

The longest path through the circuit might not be the critical path, because the behaviour of the gates might prevent an edge (0 1 or 1 0) from travelling along the path.

5.3.1 Introduction to Critical and False Paths

437

Example False Path


Question: path Determine whether the longest path in the circuit below is a false
a y b

a = 0, b = 0 1
a y b b a

a = 0, b = 1 0
y

a = 1, b = 0 1
a y b b a

a = 1, b = 1 0
y

Question:

How can we determine analytically that this is a false path?

438
a

CHAPTER 5. TIMING ANALYSIS


y b

5.3.1 Introduction to Critical and False Paths

439

Preview of Complete Example


Question: Find the critical path through the circuit below.
b a c d e f g

b a c

440

CHAPTER 5. TIMING ANALYSIS

5.3.2

Longest Path

Outline of Algorithm to Find Longest Path


The basic idea is to annotate each signal with the maximum delay from it to an output. Start at destination signals and traverse through fanin to source signals. Destination signals have a delay of 0 At each gate, annotate the inputs by the delay through the gate plus the delay of the output. When a signal fans out to multiple gates, annotate the output of the source (driving) gate with maximum delay of the destination signals.

The primary input signal with the maximum delay is the start of the longest path. The delay annotation of this signal is the delay of the longest path. The longest path is found by working from the source signal to the destination signals, picking the fanout signal with the maximum delay at each step.

5.3.3 Detecting a False Path

441

5.3.3 5.3.3.1

Detecting a False Path Preliminaries

The controlling value of a gate is the value such that if one of the inputs has this value, the output can be determined independently of the other inputs. The controlled output value is the value produced by the controlling input value. Gate Controlling Value Controlled Output
AND OR NAND NOR XOR

442

CHAPTER 5. TIMING ANALYSIS

Path Input, Side Input


Denition path input: For a gate on a path (either a candidate critical path, or a real critical path), the path input is the input signal that is on the path.

Denition side input: For a gate on a path (either a candidate critical path, or a real critical path), the side inputs are the input signals that are not on the path.

5.3.3 Detecting a False Path

443

Reconvergent Fanout
Denition reconvergent fanout: There are paths from signals in the fanout of a gate that reconverge at another gate.
a b d e f c g y h z

If a candidate path has reconvergent fanout, then the rising or falling edge on the input to the path might cause a side input along the path to have a rising or falling edge, rather than a stable 0 or 1.

444

CHAPTER 5. TIMING ANALYSIS

Rules for Propagating an Edge Along a Path


NOT 1 AND 1

0 OR

1 XOR

5.3.3 Detecting a False Path

445

Missing Rules?
Question: Why do the rules not have falling edges for AND gates or rising edges for OR gates on the side input?
a b a c b c

446

CHAPTER 5. TIMING ANALYSIS

Viability Condition of a Path


Denition Viability condition: For a path (p) though a circuit, the viability condition is a Boolean expression in terms of the input signals that denes the cases where an edge will propagate along the path.

Based upon the rules for propagating an edge that we have seen so far, the viability condition for a path is: every side input has a non-controlling value. As always, section 5.3.5 has the complete viability condition.

5.3.3 Detecting a False Path

447

5.3.3.2 Almost-Correct Algorithm to Detect a False Path


1. Annotate each side input along the path with its non-controlling value. These annotations are the constraints that must be satised for the candidate path to be exercised. 2. Propagate the constraints backward from the side inputs of the path to the inputs of the circuit under consideration. 3. If there is a contradiction amongst the constraints, then the candidate path is a false path. 4. If there is no contradiction, then the constraints on the inputs give the conditions under which an edge will traverse along the candidate path from input to output.

5.3.3.3

Examples of Detecting False Paths

448

CHAPTER 5. TIMING ANALYSIS

False-Path Example 1
Question: Determine if the longest path in the circuit below is a false path.

a 16 b 12 c 10

d 14

f 12

12 g 8 12 6 h 4 i 4 8 8

2 4 4

j 0 k 0

e 8

side input non-controlling value constraint

5.3.4 Finding the Next Candidate Path

449

5.3.4

Finding the Next Candidate Path

If the longest path is a false path, we need to nd the next longest path in the circuit, which will be our next candidate critical path. If this candidate fails, we continue to nd the next longest of the remaining paths, ad innitum.

450

CHAPTER 5. TIMING ANALYSIS

5.3.4.1 Path

Algorithm to Find Next Candidate

1. Initialize path table with primary inputs, their potential delay, and fanout. 2. Sort path table by potential delay 3. If the partial path with the max delay has just one unused fanout signal, then extend the partial path with this signal. Otherwise: (a) Extend path through unused fanout with max delay. (b) Delete this fanout signal from the list of unused fanout signals . 4. Compute constraint that side input has non-controlling value 5. If the new constraint does not cause a contradiction, then return to step 3. Otherwise: (a) Mark this partial path as false. (b) For each partial path that is a prex of the false path:

recalculate potential delay of path


(c) Return to step 2

5.3.4 Finding the Next Candidate Path

451

5.3.4.2 Examples of Finding Next Candidate Path Next-Path Example 1


Question: Starting from the initial delay calculation and longest path, nd the next candidate path and test if it is a false path.
a 16 b 12 c 10 e 8 d 14 f 12 12 g 8 12 6 8 8 h 4 i 4 2 4 4 j 0 k 0

452

CHAPTER 5. TIMING ANALYSIS potential unused delay fanout path 10 e c 12 h, g b 16 d a

5.3.4 Finding the Next Candidate Path side input non-controlling value constraint

453

454

CHAPTER 5. TIMING ANALYSIS

5.3.5 Path

Correct Algorithm to Find Critical

We now remove the assumption that side inputs always arrive earlier than path inputs.

5.3.5.1

Rules for Late Side Inputs


side=non-ctrl path=non-ctrl side=CTRL path=CTRL side=CTRL path=non-ctrl

side=non-ctrl path=CTRL Early Side


path input causes glitch

path input propogates

side input propogates

neither input propogates

Late Side
monotone speedup monotone speedup path input propogates side input causes glitch

The complete and correct rule: a path input excites the gate if the side-input is non-controlling or the side-input arrives late and the path input is controlling.

5.3.5 Correct Algorithm to Find Critical Path

455

5.3.5.2

Monotone Speedup

Denition monotonic: A function ( f ) is monotonic if increasing its input causes the output to increase or remain the same. Mathematically: x < y = f (x) f (y).

Denition monotononous: A lecture is monotonous if increasing the length of the lecture increases the number of people who are asleep.

Denition monotone speedup: The maximum clockspeed of a circuit should be monotonic with respect to the speed of any gate or sub-circuit. That is, if we increase the speed of part of the circuit, we should either increase the clockspeed of the circuit, or leave it unchanged.

456

CHAPTER 5. TIMING ANALYSIS

5.3.5.3 Analysis of Side-Input-Causes-Glitch Situation 5.3.5.4 Complete Algorithm

If nd a contradiction on the path, check for side inputs that are on previously discovered false paths. If a gate and its side input are on a previously discovered false path, then the side input denes a prex of a false path that is a late-arriving side input. For each late-arriving prex, compute its viability (the conditions under which an edge will propagate along the prex to the late side input). To the row of the late arriving side input in the constraint table, add as a disjunction the constraint that: the path input has a controlling value and at least one of the prexes is viable.

5.3.5 Correct Algorithm to Find Critical Path

457

5.3.5.5

Complete Examples Complete Example 1

Question:

Find the critical path in the circuit below.


b a c d e g

potential unused delay fanout path false a,b,d,e,f,g 10 g, c a 10 a,c,f,g side input non-controlling value constraint f[e] 1 a g[a] 1 a

458

CHAPTER 5. TIMING ANALYSIS

Complete Example 2
Question: Find the critical path in the circuit below.
a
8 8 8 14

f4
8 8 8

4 4 4

j 0

b 12 c

d 12 e 10

10

g 8 12
12

h8 i

potential unused delay fanout path false b,d,e,g,h,i,j 8 f a 12 h c 14 f, g b,d,e 14 b,d,e,g,i,j

side input non-ctrl value constraint h[c] 0 c i[h] 0 cb j[f] 0 ab

5.3.5 Correct Algorithm to Find Critical Path

459

Complete Example 3
Monotone speedup

Critical path a, c, e, f Late side input e[d]


0 e f

Total delay 10 Excitation: a = rising edge

0 a

0 0

b c

2 2

0 a

0 0

b c

2 2

4 e

0 6 f

Rising edge excitation


0 a 0 c 2 0 b 0.5 d 1 e 6 0

Falling edge excitation


f 10

Fast timing

460

CHAPTER 5. TIMING ANALYSIS

Complete Example 4
Late side inputs sometimes must have an edge. Find the second-longest path with contradiction using early sides: c d k e a i j b g f h

a b

c 0 d 1
0

e 1 g 4 h 6

1 6

1 0

0 f 2

c 2 d 4
0

a 0 b

e4 8
6

48

i 810

j
10 12

14 k 16

0 f 2

g 4 h 6

5.3.5 Correct Algorithm to Find Critical Path

461

Complete Example 5
Late side paths must be viable.

Question:

Find the critical path in the circuit below.


a b c h d f e g i k j

a b c

i k j

h d f

462

CHAPTER 5. TIMING ANALYSIS

5.3.6 Further Extensions to Critical Path Analysis


McGeer and Braytons paper includes two extensions to the critical path algorithm presented here that we will not cover. gates with more than two inputs

nding all input values that will exercise the critical path multiple paths with the same delay to the same gate

5.3.7 Increasing the Accuracy of Critical Path Analysis


When doing critical path calculations, it is often useful to strike a balance between accuracy and effort. In the examples so far, we assumed that all signals had the same wire and load delays. This assumption simplies calculations, but reduces accuracy. Section 5.4 discusses how the analog world affects timing analysis.

5.4. ELMORE TIMING MODEL

463

5.4 5.4.1

Elmore Timing Model RC-Networks for Timing Analysis


Mask Level (P-Tran) poly source contact gate p-diff
drain
substrate

Transistor Level (P-Tran) source


gate drain

Cross-Section of Fabricated Transistor poly contact


p-diff

Switch Level (P-Tran) source


gate drain

464 Mask Level (N-Tran) poly source


gate
drain

CHAPTER 5. TIMING ANALYSIS Cross-Section of Fabricated Transistor poly contact


p-diff

Transistor Level (N-Tran) source


gate

Switch Level (N-Tran) source


gate drain

contact n-diff drain


substrate

5.4.1 RC-Networks for Timing Analysis

465

Different Levels of Abstraction for Inverter


Transistor Level VDD Gate Level a b Mask Level
contact VDD poly p-diff b n-diff GND metal

GND

metal

RC-Network models of P- and N-transistors


source Rpu gate Cp drain gate Rpd source drain Cp

466 RC-Network for Timing Analysis


VDD Rpu a CL Cp Rpd GND b

CHAPTER 5. TIMING ANALYSIS

5.4.1 RC-Networks for Timing Analysis

467

A Pair of Inverters
Transistor Level VDD Gate Level b
b

GND

Mask Level b

468

CHAPTER 5. TIMING ANALYSIS

A Pair of Inverters (Contd)


Mask Level
VDD b c

GND

RC-Network for Timing Analysis


VDD Rpu a CL Rpd GND Cp b RW CW RV CL Rpd Cp Rpu c

RC-Network for Timing Analysis (trimmed)

5.4.1 RC-Networks for Timing Analysis


VDD Rpu b RW Cp Rpd GND CW RV CL

469

470

CHAPTER 5. TIMING ANALYSIS

A Circuit with Fanout


Gate Level
c a b d

Gate Level (physical layout) c b d a c

Transistor Level
VDD

b a c b d

c GND

5.4.1 RC-Networks for Timing Analysis

471

A Circuit with Fanout (Contd)


Transistor Level
VDD

b a c b d

c GND

Mask Level
VDD b a b c d c GND

472

CHAPTER 5. TIMING ANALYSIS

A Circuit with Fanout (Contd)


Mask Level
VDD b a b c d c GND

RC-Network for Timing Analysis


VDD

Rpu a CL Cp Rpd b RW1 RV CW1 CL

Rpu b c Cp Rpd RW3 CW3 RW2 CW2 RV CL

Rpu d Cp Rpd c

GND

5.4.1 RC-Networks for Timing Analysis

473

A Circuit with Fanout


RC-Network for Timing Analysis
VDD Rpu a CL Cp Rpd b RW1 RV CW1 CL Cp Rpd RW3 CW3 Rpu b c RW2 CW2 RV CL Cp Rpd c d Rpu

GND

RC-Network for Timing Analysis (trimmed)


VDD

Rpu b b RW1 Cp Rpd RV CW1 CL RW2 CW2 RV CL

GND

474
VDD Rpu b RW1 Cp Rpd GND RV CW1 CL b RW2 CW2

CHAPTER 5. TIMING ANALYSIS RC-Network for Timing Analysis (cleaned up)


RV CL

5.4.2 Derivation of Analog Timing Model

475

5.4.2

Derivation of Analog Timing Model Real Waveforms


Slow input Fast input
input voltage time time input voltage time time

input voltage

output voltage

476

CHAPTER 5. TIMING ANALYSIS

Steps Toward Approximation


We begin with two simplications as steps toward calculating a single delay value for a circuit. 1. Look at the circuits response to a step-function input. 2. Measure the delay to go from GND to 65% of VDD and from VDD to 35% of VDD.

Denition Trip Points: A high or 1 trip point is the voltage level where an upwards transition means the signal represents a 1. A low or 0 trip point is the voltage level where a downwards transition means the signal represents a 0.
a

5.4.2 Derivation of Analog Timing Model

477

The source (VDD in our case) and each capacitor is a node. We number the nodes, capacitors, and resistors. Resistors are numbered according to the capacitor to their right. Multiple resistors in series without an intervening capacitor are lumped into a single resistor. All nodes except the source start at GND. We calculate the voltage at a node when we turn on the P-transistor (connect to VDD).
The process for analyzing a transition from VDD to GND on a node is the dual of the process just described. The source node is GND, all other nodes start at VDD, we calculate the voltage when we turn on the N-transistor (connect it to GND).
VDD 0 Rpu R1 R2 1 b RW12 Cp Rpd GND RV R5 CW1 R3 R4 b RW2 3 RV CW2 5 CL 4 CL

Node Numbering, Initial Conditions

478

CHAPTER 5. TIMING ANALYSIS

Dene: Path and Downstream


Denition path: The path from the source node to a node i is the set of all resistors between the source and i. Example: path(3) = {R1, R2, R3}

Denition down: The set of capactitors downstream from a node is the set of all capacitors where current would ow through the node to charge the capacitor. You can think of this as the set of capacitors that are between the node and ground. Example: down(2) = {C2,C3,C4,C5}. Example: down(3) = {C3,C4}

5.4.2 Derivation of Analog Timing Model

479

5.4.2.1 Example Derivation: Equation for Voltage at Node 3


V3(t) = V0(t) voltage drop fromNode0toNode3 The voltage drop is the sum of the voltage drops across the resistors on the path from Node0 to Node3 = V0(t)
rpath(3)

Rr Ir (t)

= V0(t) (R1I1(t) + R2I2(t) + R3I3(t)) The current through a resistor is the sum of the currents through all of the downstream capacitors Ir (t) =
cdown(r)

Ic

I1(t) = Ic1 + Ic2 + Ic3 + Ic4 + Ic5 I2(t) = Ic2 + Ic3 + Ic4 + Ic5 I3(t) = Ic3 + Ic4

480

CHAPTER 5. TIMING ANALYSIS Substitute Ir into the equation for V3 R1(Ic1 + Ic2 + Ic3 + Ic4 + Ic5) V3(t) = V0(t) + R2(Ic2 + Ic3 + Ic4 + Ic5) + R3(Ic3 + Ic4) Use associativity to group terms by currents. Ic1(R1) + Ic2(R1 + R2) + Ic3(R1 + R2 + R3) V3(t) = V0(t) + Ic4(R1 + R2 + R3) + Ic5(R1 + R2)

5.4.2 Derivation of Analog Timing Model Current through a capacitor Vc(t) Ic(t) = Cc t Substitute Ic into equation for V3 Vc1(t) (R1)Cc1 t V (t) + (R1 + R2)Cc2 c2 t V (t) V3(t) = V0(t) + (R1 + R2 + R3)Cc3 c3 t V (t) + (R1 + R2 + R3)Cc4 c4 t V (t) + (R1 + R2)Cc5 c5 t

481

482

CHAPTER 5. TIMING ANALYSIS

Ri,k = R3,1 R3,2 R3,3 R3,4 R3,5 = = = = =

r(path(k)path(k))

Rr

R1 R1 + R2 R1 + R2 + R3 R1 + R2 + R3 R1 + R2

Substitute Ri,k into V3 Vc2(t) Vc3(t) Vc1(t) + R3,2Cc2 + R3,3Cc3 R3,1Cc1 t t t V3(t) = V0(t) Vc4(t) Vc5(t) + R3,4Cc4 + R3,5Cc5 t t

5.4.2 Derivation of Analog Timing Model

483

5.4.2.2

General Derivation
Vi(t) = V0(t) voltage drop fromNode0toNodei The voltage drop is the sum of the voltage drops across the resistors on the path from Node0 to Nodei = V0(t)
rpath(i)

Rr Ir (t)

484

CHAPTER 5. TIMING ANALYSIS The current through a resistor is the sum of the currents through all of the downstream capacitors Ir (t) =
cdown(r)

Ic

Vi(t) = V0(t)

Substitute Ir into the equation for Vi


rpath(i)

Rr

cdown(r)

Ic

Use associativity to push Rr into the summation over c Vi(t) = V0(t)


rpath(i) cdown(r)

Rr Ic

5.4.2 Derivation of Analog Timing Model Current through a capacitor Vc(t) Ic(t) = Cc t Substitute Ic into equation for Vi Vi(t) = V0(t)
rpath(i) cdown(r)

485

Rr Cc

Vc(t) t

Vi(t) = V0(t)

A little bit of handwaving to prepare for Elmore resistance


kNodes

rpath(i)path(k)

Rr Ck

Vc(t) t

486

CHAPTER 5. TIMING ANALYSIS Dene Elmore resistance Ri,k R i,k =


r(path(k)path(k))

Rr

Substitute Ri,k into Vi Vi(t) = V0(t)


kNodes

Ri,k Ck

Vc(t) t

5.4.3 Elmore Timing Model

487

5.4.3

Elmore Timing Model

Assume that V0(t) is a step function from 0 to 1 at time 0. Derive upper and lower bounds for Vi(t). Find RC time constants for upper and lower bounds. Elmore delay is guaranteed to be between upper and lower bounds.

Upper and lower bounds Elmore model RC-network model

TD-TRi

TRi

TP

TP-TRi TD

488

CHAPTER 5. TIMING ANALYSIS

Equations for Curves


Time : 0 1+ t TDi TP TDi TRi TP TRi TDi TP t TR TRi 1 ie TP 1 et/TDi

Upper

Elmore

Lower

TDi 1 t + TRi

TP TRi t TDi TP e 1 TP

Fact: 0 TRi TDi TP

5.4.3 Elmore Timing Model

489

Denitions of Time Constants


TRi = TDi = TP =

kNodes

R2 Ck k,i Ri,i

Mathematical artifact, no intuitive meaning

kNodes

Rk,iCk Elmore delay Rk,kCk RC-time constant for lumped network

kNodes

490

CHAPTER 5. TIMING ANALYSIS

Picking the Trip Point


Vi(t) = VDD(1 et/TDi ) Pick trip point of Vi(t) = 0.65VDD, then solve for t 0.65VDD = VDD(1 et/TDi ) 0.35 = et/TDi Take ln of both sides ln 0.35 = ln(et/TDi ) ln 0.35 = 1.05 1.0 1.0 = t/TDi t = TDi By picking a trip point of 0.65VDD, the time for Vi to reach the trip is the Elmore delay.

5.4.4 Examples of Using Elmore Delay

491

5.4.4 5.4.4.1

Examples of Using Elmore Delay Interconnect with Single Fanout

492

CHAPTER 5. TIMING ANALYSIS

G1

G2

Ra4 Ra1
G1

C3 Rw3

G2 C1 Rw1 G1
Rpu

Ra3 C2 Rw2 Ra2

G2 Ra1 Rw1 Ra2 Rw2 Ra3 Rw3 Ra4

Vi Cp Rpd

C1

C2

C3

CG2

G* C* Ra* Rw*

gate capacitance on wire resistance through antifuse resistance through wire

5.4.4 Examples of Using Elmore Delay Question:


G1 Rpu Ra1 Rw1 Ra2 Rw2 Ra3 Rw3 Ra4 Cp Rpd C1 C2 C3 CG2

493

Calculate delay from gate 1 to gate 2


G2

Vi

494

CHAPTER 5. TIMING ANALYSIS

Doubling Antifuses
Question: If you double the number of antifuses and wires needed to connect two gates, what will be the approximate effect on the wire delay between the gates?

5.4.4 Examples of Using Elmore Delay

495

5.4.4.2 Interconnect with Multiple Gates in Fanout


G1 G3 G2

G2 G3 G1

Question: Assuming that wire resistance is much less than antifuse resistance and that all antifuses have equal resistance, calculate the delay from the source inverter (G1) to G2

496

CHAPTER 5. TIMING ANALYSIS

5.4.4 Examples of Using Elmore Delay

497

Delay to G2 vs G3
Question: Assuming all wire segments at same level have roughly the same capacitance, which is greater, the delay to G2 or the delay to G3?
R3 C3 R5 C4 R6 G3 C7 C6 R2 C2

R4 C5 G2 C1 R1
G1

G1
G2

Rpu R1 n1 R2 n2 Cp Rpd G3 R5 n6 R6 C6 C1 C2 n3 R3 n4 R4 C3 C4

Vi

n5 C5

n7 C7

498

CHAPTER 5. TIMING ANALYSIS

5.5

Practical Usage of Timing Analysis

Speed Grading

Fabs sort chips according to their speed (sorting is known as speed grading or speed binning) Faster chips are more expensive In FPGAs, sorting is based usualy on propagation delay through an FPGA cell. As wires become a larger portiono of delay, some analysis of wire delays is also being done. Propagation delay is the average of the rising and falling propagation delays. Typical speed grades for FPGAs:
Std standard speed grade 1 15% faster than Std 2 25% faster than Std 3 35% faster than Std Worst-Case Timing

Maximum Delay in CMOS. When?

5.5. PRACTICAL USAGE OF TIMING ANALYSIS Minimum voltage Maximum temperature

499

Slow-slow conditions (process variation/corner which result in slow p-channel and slow n-channel). We could also have fast-fast, slow-fast, and fast-slow process corners

Increasing temperature increases delay


Temp = resistivity resistivity = electron vibration electron vibration = colliding with current electrons colliding with current electrons = delay

Increasing supply voltage decreases delay


supply voltage = current current = load capacitor charge time load capacitor charge time = total delay

Derating factor is a number used to adjust timing number to account for voltage and temp conditions

500

CHAPTER 5. TIMING ANALYSIS

ASIC manufacturers classes, based on variety of environments: VDD TA (ambient temp) TC (case temp) Commercial 5V 5% 0 to +70C Industrial 5V 10% 40 to +85C 5V 10% 55 to +125C Military What is important is the transistor temperature inside the chip, TJ (junction temperature)

5.5.1

Speed Binning

Speed binning is the process of testing each manufactured part to determine the maximum clock speed at which it will run reliably. Manufacturers sell chips off of the same manufacturing line at different prices based on how fast they will run. A speed bin is the clock speed that chips will be labeled with when sold. Overclocking: running a chip at a clock speed faster than what it is rated for (and hoping that your software crashes more frequently than your over-stressed hardware will).

5.5.1 Speed Binning

501

5.5.1.1 FPGAs, Interconnect, and Synthesis


On FPGAs 40-60% of clock cycle is consumed by interconnect. When synthesizing, increasing effort (number of iterations) of place and route can signicantly reduce the clock period on large designs.

502

CHAPTER 5. TIMING ANALYSIS

5.5.2 5.5.2.1

Worst Case Timing Fanout delay

In Smiths book, Table 5.2 (Fanout delay) combines two separate parameters:

capacitive load delay interconnect delay


into a single parameter (fanout). This is common, and ne. But, when reading a table such as this, you need to know whether fanout delay is combining both capacitive load delay and interconnect delay, or is just capacitive load.

5.5.2 Worst Case Timing

503

5.5.2.2

Derating Factors

Delays are dependent upon supply voltage and temperature. Temp = Delay Supply voltage = Delay

504

CHAPTER 5. TIMING ANALYSIS

Temperature
Temp = Delay
Temp = Resistivity of wires As temp goes up, atoms vibrate more, and so have greater probability of colliding with electrons owing with current.

5.5.2 Worst Case Timing

505

Supply Voltage
Supply voltage = Delay
Supply voltage = current (V = IR) current = time to charge load capacitors to threshold voltage

506

CHAPTER 5. TIMING ANALYSIS

Derating Factor Denition


A derating factor is a number to adjust timing numbers to account for different temperature and voltage conditions. Excerpt from table 5.3 in Smiths book (Actel Act 3 derating factors): Derating factor 1.17 1.00 0.63 Temp 125C 70C -55C Vdd 4.5V 5.0V 5.5V

Chapter 6 Power Analysis and Power-Aware Design

508

CHAPTER 6. POWER ANALYSIS AND POWER-AWARE DESIGN

6.1 6.1.1

Overview Importance of Power and Energy

Laptops, PDA, cell-phones, etc obvious! For microprocessors in personal computers, every watt above 40W adds $1 to manufacturing cost Approx 25% of operating expense of server farm goes to energy bills (Dis)Comfort of Unix labs in E2 Sandia Labs had to build a special sub-station when they took delivery of Teraops massively parallel supercomputer (over 9000 Pentium Pros) High-speed microprocessors today can run so hot that they will damage themselves Athlon reliability problems, Pentium 4 processor thermal throttling In 2000, information technology consumed 8% of total power in US. Future power viruses: cell phone viruses cause cell phone to run in full power mode and consume battery very quickly; PC viruses that cause CPU to meltdown batteries

6.1.2 Industrial Names and Products

509

6.1.2

Industrial Names and Products

Note: Lots of links from E&CE 327 web pages under Documentation

6.1.3

Power vs Energy

Most people talk about power reduction, but sometimes they mean power and sometimes energy. Power minimization is usually about heat removal

Energy minimization is usually about battery life or energy costs


Type Units Equivalent Types Equations Energy Joules Work = Volts Coulombs = 1 C Volts2 2 Power Watts Energy / Time = Volts I = Joules/ sec

510

CHAPTER 6. POWER ANALYSIS AND POWER-AWARE DESIGN

6.1.4

Batteries, Power and Energy

6.1.4.1 Do Batteries Store Energy or Power?


Energy = Volts Coulombs Power = Energy Time

Batteries rated in Amp-hours at a voltage. battery = Amps Seconds Volts = Coulombs Seconds Volts Seconds = Coulombs Volts = Energy Batteries store energy.

6.1.4 Batteries, Power and Energy

511

6.1.4.2

Battery Life and Efciency

To extend battery life, we want to increase the amount of work done and/or decrease energy consumed. Work and energy are same units, therefore to extend battery life, we truly want to improve efciency. Power efciency of microprocessors normally measured in MIPS/Watt. Is this a real measure of efciency? MIPs = millions of instructions Seconds Watts Energy Seconds = millions of instructions Energy Both instructions executed and energy are measures of work, so MIPs/Watt is a measure of efciency.

Question:

What is the weakness of this analysis?

512

CHAPTER 6. POWER ANALYSIS AND POWER-AWARE DESIGN

6.1.4.3

Battery Life and Power

Question: Running a VHDL simulation requires executing an average of 1 million instructions per simulation step. My computer runs at 700MHz, has a CPI of 1.0, and burns 70W of power. My battery is rated at 10V and 2.5AH. Assuming all of my computers clock cycles go towards running VHDL simulations, how many simulation steps can I run on one battery charge?

6.1.4 Batteries, Power and Energy

513

Battery Life and Power


Question: If I use the SpeedStep feature of my computer, my computer runs at 600MHz with 60W of power. With SpeedStep activated, much longer can I keep the computer running on one battery?

514

CHAPTER 6. POWER ANALYSIS AND POWER-AWARE DESIGN

Battery Life and Power


Question: With SpeedStep activated, how many more simulation steps can I run on one battery?

6.2. POWER EQUATIONS

515

6.2

Power Equations
Power = SwitchPower + ShortPower + LeakagePower DynamicPower StaticPower

Dynamic Power dependent upon clock speed Switching Power useful charges up transistors Short Circuit Power not useful both N and P transistors are on Static Power independent of clock speed Leakage Power not useful leaks around transistor

516

CHAPTER 6. POWER ANALYSIS AND POWER-AWARE DESIGN

Dynamic Power
Dynamic power is proportional to how often signals change their value (switch). Roughly 20% of signals switch during a clock cycle.

Need to take glitches into account when calculating activity factor. Glitches increase the activity factor. Equations for dynamic power contain clock speed and activity factor.

6.2.1 Switching Power

517

6.2.1

Switching Power
1->0 0->1 CapLoad 0->1 1->0 CapLoad

Charging a capacitor

Disharging a capacitor

1 energy to (dis)charge capacitor = CapLoad VoltSup2 2

518

CHAPTER 6. POWER ANALYSIS AND POWER-AWARE DESIGN

Switching Power
When a capacitor C is charged to a voltage V , the energy stored in capacitor is 1CV 2. 2 The energy required to charge the capacitor from 0 to V is CV 2. Half of the energy ( 1CV 2 is dissipated as heat through the pullup resistance. Half of energy is 2 transfered to the capacitor. When the capacitor discharges from V to 0, the energy stored in the capacitor 1 ( 2CV 2) is dissipated as heat through the pulldown resistance.

6.2.1 Switching Power

519

Switching Power
f : frequency at which invertor goes through complete charge-discharge cycle. (eqn 15.4 in Smith)

average switching power = f CapLoad VoltSup2 ClockSpeed clock speed ActFact average number of times that signal switches from 0 1 or from 1 0 during a clock cycle

1 average switching power = ActFact ClockSpeed CapLoad VoltSup2 2

520

CHAPTER 6. POWER ANALYSIS AND POWER-AWARE DESIGN

6.2.2
IShort Vi Vo

Short-Circuited Power

VoltSup VoltSup - VoltThresh

VoltThresh GND P-trans on N-trans on TimeShort

Gate Voltage

PwrShort = ActFact ClockSpeed TimeShort IShort VoltSup

6.2.3 Leakage Power

521

6.2.3

Leakage Power
Vi Vo

I
N P N P P

ILeak V
N-substrate

Cross section of invertor showing parasitic diode

Leakage current through parasitic diode

PwrLk = ILeak VoltSup q VoltThresh kT

ILeak e

522

CHAPTER 6. POWER ANALYSIS AND POWER-AWARE DESIGN

6.2.4

Glossary
This section reserved for your reading pleasure

6.2.5

Note on Power Equations


This section reserved for your reading pleasure

6.3 Overview of Power Reduction Techniques


We can divide power reduction techniques into two classes: analog and digital.

6.3. OVERVIEW OF POWER REDUCTION TECHNIQUES

523

Analog Parameters
Power reduction parameters at the analog level. capacitance for example, Silicon on Insulator (SOI) resistance for example, copper wires voltage low-voltage circuits

524

CHAPTER 6. POWER ANALYSIS AND POWER-AWARE DESIGN

Analog Techniques
Power reduction techniques at the analog level. dual-VDD Two different supply voltages: high voltage for performance-critical portions of design, low voltage for remainder of circuit. Alternatively, can vary voltage over time: high voltage when running performance-critical software and low voltage when running software that is less sensitive to performance. dual-Vt Two different threshold voltages: transistors with low threshold voltage for performance-critical portions of design (can switch more quickly, but more leakage power), transistors with high threshold voltage for remainder of circuit (switches more slowly, but reduces leakage power). exotic circuits Special ops, latches, and combinational circuitry that run at a high frequency while minimizing power adiabatic circuits Special circuitry that consumes power on 0 1 transitions, but not 1 0 transitions. These sacrice performance for reduced power. clock trees Up to 30% of total power can be consumed in clock generation and clock tree

6.3. OVERVIEW OF POWER REDUCTION TECHNIQUES

525

Digital Parameters
Power-reduction parameters at the digital level. capacitance (number of gates) activity factor clock frequency

526

CHAPTER 6. POWER ANALYSIS AND POWER-AWARE DESIGN

Digital Techniques
Power-reduction techniques at the digital level. multiple clocks Put a high speed clock in performance-critical parts of design and a low speed clock for remainder of circuit clock gating Turn off clock to portions of a chip when its not being used data encoding Gray coding vs one-hot vs fully encoded vs ... glitch reduction Adjust circuit delays or add redundant circuitry to reduce or eliminate glitches. asynchronous circuits Get rid of clocks altogether.... Additional low-power design techniques for RTL from a Qualis engineer: http://home.europa.com/celiac/lowpower.html

6.4. VOLTAGE REDUCTION FOR POWER REDUCTION

527

6.4 Voltage Reduction for Power Reduction


If our goal is to reduce power, the most promising approach is to reduce the supply voltage, because, from: (ActFact ClockSpeed 1 CapLoad VoltSup2) 2 + (ActFact ClockSpeed TimeShort IShort VoltSup) + (ILeak VoltSup)

Power =

we observe:

Power VoltSup2

528

CHAPTER 6. POWER ANALYSIS AND POWER-AWARE DESIGN

Reducing Difference Between Supply and Threshold Voltage


As the supply voltage decreases, it takes longer to charge up the capacitive load, which increases the load delay of a circuit. In the chapter on timing analysis, we saw that increasing the supply voltage will decrease the delay through a circuit. (From V = IR, increasing V causes an increase in I, which causes the capacitive load to charge more quickly.) However, it is more accurate to take into account both the value of the supply voltage, and the difference between the supply voltage and the threshold voltage. (VoltSup VoltThresh)2 MaxClockSpeed VoltSup

6.4. VOLTAGE REDUCTION FOR POWER REDUCTION

529

Effect of Decreasing Supply Voltage on Delay


Question: If the delay along the critical path of a circuit is 20 ns, the supply voltage is 2.8 V, and the threshold voltage is 0.7 V, calculate the critical path delay if the supply voltage is dropped to 2.2 V.

530

CHAPTER 6. POWER ANALYSIS AND POWER-AWARE DESIGN

Reducing Threshold Voltage Increases Leakage Current


If we reduce the supply voltage, we want to also reduce the threshold voltage, so that we do not increase the delay through the circuit. However, as threshold voltage drops, leakage current increases: q VoltThresh kT

ILeak e

And increasing the leakage current increases the power: Power ILeak So, need to strike a balance between reducing VoltSup (which has a quadratic affect on reducing power), and increasing ILeak, which has a linear affect on increasing power.

6.5. DATA ENCODING FOR POWER REDUCTION

531

6.5

Data Encoding for Power Reduction

6.5.1 How Data Encoding Can Reduce Power


Data encoding is a technique that chooses data values so that normal execution will have a low activity factor. The most common example is Gray coding where exactly one bit changes value each clock cycle when counting.

532 Decimal 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15

CHAPTER 6. POWER ANALYSIS AND POWER-AWARE DESIGN Gray 0000 0001 0011 0010 0110 0111 0101 0100 1100 1101 1111 1110 1010 1011 1001 1000 Binary 0000 0001 0010 0011 0100 0101 0110 0111 1000 1001 1010 1011 1100 1101 1110 1111

6.5.1 How Data Encoding Can Reduce Power

533

8-bit Counter
Question: For an eight-bit counter, how much more power will a binary counter consume than a Gray-code counter?

534

CHAPTER 6. POWER ANALYSIS AND POWER-AWARE DESIGN

Random Data
Question: For completely random eight-bit data, how much more power will a binary circuit consume than a Gray-code circuit?

6.5.2 Example Problem: Sixteen Pulser

535

6.5.2 6.5.2.1

Example Problem: Sixteen Pulser Problem Statement

Your task is to do the power analysis for a circuit that should send out a one-clock-cycle pulse on the done signal once every 16 clock cycles. (That is, done is 0 for 15 clock cycles, then 1 for one cycle, then repeat with 15 cycles of 0 followed by a 1, etc.)
1 clk done 2 3 15 16 17 31 32 33

Required behaviour You have been asked to consider three different types of counters: a binary counter, a Gray-code counter, and a one-hot counter. (The table below shows the values from 0 to 15 for the different encodings.)

Question: What is the relative amount of power consumption for the different options?

536

CHAPTER 6. POWER ANALYSIS AND POWER-AWARE DESIGN

6.5.2.2

Additional Information

Your implementation technology is an FPGA where each cell has a programable combinational circuit and a ip-op. The combinational circuit has 4 inputs and 1 output. The capacitive load of the combinational circuit is twice that of the ip-op.

PLA

cell 1. You may neglect power associated with clocks. 2. You may assume that all counters: (a) are implemented on the same fabrication process (b) run at the same clock speed (c) have negligible leakage and short-circuit currents

6.5.2 Example Problem: Sixteen Pulser

537

Data Encoding
Decimal 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 Gray 0000 0001 0011 0010 0110 0111 0101 0100 1100 1101 1111 1110 1010 1011 1001 1000 One-Hot 0000000000000001 0000000000000010 0000000000000100 0000000000001000 0000000000010000 0000000000100000 0000000001000000 0000000010000000 0000000100000000 0000001000000000 0000010000000000 0000100000000000 0001000000000000 0010000000000000 0100000000000000 1000000000000000 Binary 0000 0001 0010 0011 0100 0101 0110 0111 1000 1001 1010 1011 1100 1101 1110 1111

538

CHAPTER 6. POWER ANALYSIS AND POWER-AWARE DESIGN

6.5.2.3

Answer Sketch the Circuitry

Name the output done and the count digits d().

6.5.2 Example Problem: Sixteen Pulser

539

Capacitance
cap number subtotal cap Gray d() PLAs Flops done PLAs Flops 1-Hot d() PLAs Flops done PLAs Flops Binary d() PLAs Flops done PLAs Flops

540

CHAPTER 6. POWER ANALYSIS AND POWER-AWARE DESIGN

Activity Factors Gray Coding Activity Factor


clk d(0) d(1) d(2) d(3) done 8/16 4/16 2/16 2/16 2/16

Gray coding

6.5.2 Example Problem: Sixteen Pulser

541

One-Hot Activity Factor


clk d(0) d(1) d(2) 2/16 2/16 2/16 2/16 done 2/16

One-hot coding

542

CHAPTER 6. POWER ANALYSIS AND POWER-AWARE DESIGN

Binary Coding Activity Factor


clk d(0) d(1) d(2) d(3) done 16/16 8/16 4/16 2/16 2/16

Binary coding

6.5.2 Example Problem: Sixteen Pulser

543

Putting it all Together


subtotal cap act fact Gray d() PLAs Flops done PLAs Flops Total d() PLAs Flops done PLAs Flops Total power

1-Hot

Binary d() PLAs Flops done PLAs Flops Total

544

CHAPTER 6. POWER ANALYSIS AND POWER-AWARE DESIGN

6.6

Clock Gating

The basic idea of clock gating is to reduce power by turning off the clock when a circuit isnt needed. This reduces the activity factor.

6.6.1

Introduction to Clock Gating


Examples of Clock Gating

Condition O/S in standby mode

Circuitry turned off Everything except core state (PC, registers, caches, etc) No oating point instruc- oating point circuitry tions for k clock cycles Instruction cache miss Instruction decode circuitry No instruction in pipe Pipe stage i stage i

6.6.2 Implementing Clock Gating

545

6.6.2

Implementing Clock Gating

Clock gating is implemented by adding a component that disables the clock when the circuit isnt needed.
i_data i_valid clk o_data

o_valid

Without clock gating


i_data i_valid clk o_data

cool_clk

o_valid

clk_en i_wakeup Clock Enable State Machine

With clock gating

546

CHAPTER 6. POWER ANALYSIS AND POWER-AWARE DESIGN

6.6.3 6.6.4

Design Process Effectiveness of Clock Gating

Parameters to characterize effectiveness of clock gating: Eff = effectiveness of clock gating PctValid = percentage of clock cycles with valid data in the circuit the clock must be toggling PctClk = percentage of clock cycles that clock toggles Effectiveness measures the percentage of clock cycles with invalid data in which the clock is turned off. Equation for effectiveness of clock gating: PctClkOff Eff = PctInvalid 1 PctClk = 1 PctValid

6.6.4 Effectiveness of Clock Gating

547

Clock Gating Effectiveness Questions


Question: What is the effectiveness if the clock toggles only when there is valid data?

Question:

What is the effectiveness of a clock that always toggles?

548

CHAPTER 6. POWER ANALYSIS AND POWER-AWARE DESIGN

Clock Gating Effectiveness Questions


Question: What does it mean for a clock gating scheme to be 75% effective?

Question:

What happens if PctClk < PctValid?

6.6.4 Effectiveness of Clock Gating

549

Effect of Effectiveness
We can see the effect of the effectiveness of a clock-gating scheme on the activity factor: A PctValid * A A
0 0 Eff 1

The new activity factor with a clock gating scheme is:

A = A (1 PctValid) Eff A

550

CHAPTER 6. POWER ANALYSIS AND POWER-AWARE DESIGN

6.6.5 Example: Reduced Activity Factor with Clock Gating


Question: How much power will be saved in the following clock-gating scheme?

70% of the time the main circuit has valid data clock gating circuit is 90% effective (90% of the time that the circuit has invalid data, the clock is off) clock gating circuit has 10% of the area of the main circuit clock gating circuit has same activity factor as main circuit neglect short-circuiting and leakage power

6.6.5 Example: Reduced Activity Factor with Clock Gating

551

552

CHAPTER 6. POWER ANALYSIS AND POWER-AWARE DESIGN

6.6.6 6.6.6.1

Clock Gating with Valid-Bit Protocol Valid-Bit Protocol

Need a mechanism to tell circuit when to pay attention to data inputs


clk i_valid i_data

o_valid o_data

clk i_valid i_data

6.6.6 Clock Gating with Valid-Bit Protocol

553

Valid-Bit Protocol
clk i_valid i_data o_valid o_data

clk i_valid i_data o_valid o_data

i valid: high when i data has valid data signies whether circuit should pay attention to or ignore data. o valid: high when o data has valid data signies whether whether environment should pay attention to output of circuit. For more on circuit protocols, see section 2.12.

554

CHAPTER 6. POWER ANALYSIS AND POWER-AWARE DESIGN

Microscopic Analysis
Which clock edges are needed?
i_valid clk o_valid

clk i_valid o_valid

6.6.6 Clock Gating with Valid-Bit Protocol

555

6.6.6.2 How Many Clock Cycles for Module?


Given a module with latency Lat, if the module receives a stream of NumPcls consecutive valid parcels, how many clock cycles must the clock-enable signal be asserted?
Latency NumPcls NumClkEn i_valid o_valid clk_en Latency NumPcls NumClkEn

i_valid o_valid clk_en

i_valid o_valid clk_en

i_valid o_valid clk_en

i_valid o_valid clk_en

i_valid o_valid clk_en

i_valid o_valid clk_en

556

CHAPTER 6. POWER ANALYSIS AND POWER-AWARE DESIGN

6.6.6.3

Adding Clock-Gating Circuitry Before Clock Gating


data_in valid_in clk data_out valid_out

clk valid_in data_in valid_out data_out dont care uninitialized

6.6.6 Clock Gating with Valid-Bit Protocol

557

After Clock Gating: Circuitry


data_in valid_in data_out valid_out

hot_clk clk_en wakeup_in Clock Enable State Machine

cool_clk

wakeup_out

hot clk: clock that always toggles cool clk: gated clock sometimes toggles, sometimes stays low wakeup: alerts circuit that valid data will be arriving soon clk en: turns on cool clk

558

CHAPTER 6. POWER ANALYSIS AND POWER-AWARE DESIGN

After Clock Gating: New Signals


hot_clk wakeup_in valid_in data_in clk_en cool_clk valid_out data_out wakeup_out

6.6.7 Example: Pipelined Circuit with Clock-Gating

559

6.6.7 Example: Pipelined Circuit with Clock-Gating


Design a clock enable state machine for the pipelined component described below. capacitance of pipelined component = 200

latency varies from 5 to 10 clock cycles, even distribution of latencies contains a maximum of 6 instructions (parcels of data). 60% of incoming parcels are valid average length of continuous sequence of valid parcels is 80 use input and output valid bits for wakeup leakage current is negligible short-circuit current is negligible LUTs have a capacitance of 1, ops have a capacitance of 2

560

CHAPTER 6. POWER ANALYSIS AND POWER-AWARE DESIGN

Waveforms for Parcel Count


1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 i_valid o_valid parcel_count parcel_clk_en

Waveforms for Cycle Count


1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 i_valid o_valid cycle_count

0 1 2 0 0 0 1 2 3 4 5 0 1 2 3 4 5 6 7 8 9 10 cycle_clk_en

6.6.7 Example: Pipelined Circuit with Clock-Gating

561

Summary of Design Process


Outline: 1. sketch out circuitry for parcel count and cycle count state machine 2. estimate capacitance of each state machine 3. estimate activity factor of main circuit, based on behaviour

562

CHAPTER 6. POWER ANALYSIS AND POWER-AWARE DESIGN

Parcel Count Design


Need to count (0..6) parcels, therefore need 3 bits for counter. Counter must be able to increment and decrement. Equations for counter action (increment/decrement/no-change): i valid o valid action 0 0 no change 0 1 decrement increment 1 0 1 1 no change

Chapter 7 Fault Testing and Testability

564

CHAPTER 7. FAULT TESTING AND TESTABILITY

7.1 7.1.1

Faults and Testing Overview of Faults and Testing Faults

7.1.1.1

During manufacturing, faults can occur that make the physical product behave incorrectly. Denition: A fault is a manufacturing defect that causes a wire, poly, diffusion, or via to either break or connect to something it shouldnt.

Good wires

Shorted wires

Open wire

7.1.1 Overview of Faults and Testing

565

7.1.1.2

Causes of Faults

Fabrication process (initial construction is bad) chemical mix, impurities, dust Manufacturing process (damage during construction)
handling: probing, cutting, mounting materials: corrosion, adhesion failure, cracking, peeling

7.1.1.3

Testing

Denition Testing is the process of checking that the manufactured wafer/chip/board/system has the same functionality as the simulations.

566

CHAPTER 7. FAULT TESTING AND TESTABILITY

7.1.1.4

Burn In

Denition Burn-in: The process of subjecting chips to extreme conditions (high and low temps, high and low voltages, high and low clock speeds) before and during testing.

Soon to break wire

7.1.1.5

Bin Sorting

Each chip (or wafer) is run at a variety of clock speeds. The chips are grouped and labeled (binned) by the maximum clock frequency at which they will work reliably. For example, chips coming off of the same production line might be labelled as 800MHz, 900MHz, and 1000MHz.

7.1.2 Example Problem: Economics of Testing

567

7.1.1.6 7.1.1.7

Testing Techniques Design for Testability (DFT)

7.1.2 Example Problem: Economics of Testing


Note: There is a tradeoff between the amount of money spent on testing chips vs dealing with (e.g. replacing) faulty chips. Usually the best tradeoff is to ship chips with a small, but non-zero probability that the chip has a fault.

7.1.3

Physical Faults

568

CHAPTER 7. FAULT TESTING AND TESTABILITY

7.1.3.1
Good Circuit
a b c d

Types of Physical Faults


Bad Circuits open wired-AND bridging short wired-OR bridging short stronger wins bridging short (b is stronger) short to VDD
a b a b a b a b c d c d c d c d

a b a b

c d c d

short to GND

7.1.3 Physical Faults

569

7.1.3.2

Locations of Faults

Each segment of wire, poly, diffusion, via, etc is a potential fault location. Different segments affect different gates in the fanout. A potential fault location is a segment or segments where a fault at any position affects the same set of gates in the same way.

570

CHAPTER 7. FAULT TESTING AND TESTABILITY

7.1.3.3
a b c d e f g h

Layout Affects Locations


L2

e
L3

L2

e
L3 L5 L4

L1 L4

g h

L1

g h

7.1.3.4

Naming Fault Locations

Two ways to name a fault location: pin-fault model Faults are modelled as occuring on input and output pins of gates. net-fault model Faults are modelled as occuring on segments of wires. In E&CE 327, well use the net-fault model, because it is simpler to work with and is closer to what actually happens in hardware.

7.1.4 Detecting a Fault

571

7.1.4

Detecting a Fault

To detect a fault, we compare the actual output of the circuit against the expected value.

7.1.4.1 Fault?

Which Test Vectors will Detect a

Question: For the good circuit and faulty circuit shown below, which test vectors will detect the fault?
a b c a b e c

d e

Good circuit

Faulty circuit

572 Answer: a 0 0 0 0 1 1 1 1 b 0 0 1 1 0 0 1 1 c good faulty 0 0 0 1 1 1 0 0 0 1 1 1 0 0 0 1 1 1 0 1 0 1 1 1

CHAPTER 7. FAULT TESTING AND TESTABILITY

Sometimes multiple test vectors will catch the same fault. Sometimes a single test vector can catch multiple faults.

7.1.4 Detecting a Fault


a b c a b c d e d e

573

a b c good faulty 1 1 0 1 0

Another fault The test vector 110 can catch both this fault and the previous one. Note: Detect vs. diagnose Testing detects faults. Testing does not diagnose which fault occurred.

574

CHAPTER 7. FAULT TESTING AND TESTABILITY

7.1.5

Mathematical Models of Faults

Goal: develop reliable and predictable technique for detecting faults in circuits. Observations:

The possible faults in a circuit are dependent upon the physical layout of the circuit. A very wide variety of possible faults A single test vector can catch many different faults
Need: a mathematical model for faults that is abstracted from complexities of circuit layout and plethora of possible faults, yet still detects most or all possible faults.

7.1.5 Mathematical Models of Faults

575

7.1.5.1

Single Stuck-At Fault Model

Two simplifying assumptions: 1. A maximum of one fault per tested circuit (hence single) 2. All faults are either: (a) stuck-at 1: short to VDD (b) stuck-at 0: short to GND hence, stuck at

576

CHAPTER 7. FAULT TESTING AND TESTABILITY

Example of Stuck-At Faults


a b c d i

Question: If we consider all possible stuck-at faults, how many faulty circuits would we need to test for?

Question: If we consider only single-stuck-at faults, how many faulty circuits would we need to test for?

7.1.6 Generate Test Vector to Find a Mathematical Fault

577

7.1.6 Generate Test Vector to Find a Mathematical Fault


Faults are detected by stimulating circuits (real, manufactured circuit, not a simulation!) with test-vectors and checking that the real circuit gives the correct output.

7.1.6.1

Algorithm

1. compute Karnaugh map for correct circuit 2. compute Karnaugh map for faulty circuit 3. nd region of disagreement 4. any assignment in region of disagreement is a test vector that will detect fault 5. any assignment outside of region of disagreement will result in same output on both correct and faulty circuit

578

CHAPTER 7. FAULT TESTING AND TESTABILITY

7.1.6.2

Example of Finding a Test Vector


a b c
a c

d e

a b c

d e

b c1 c0

ab ab ab ab 10 11 01 00
c

Good circuit

Faulty circuit

Question:

Find a test test vector will detect the faulty circuit

a c

7.1.7 Undetectable Faults

579

7.1.7

Undetectable Faults

Not all faults are detectable.

1. If a circuit is irredundant then all single stuck-at faults can be detected. A redundant circuit is one where one or more gates can be removed without affecting the functional behaviour. 2. If not trying to nd all of the faults in a circuit, then a fault that you arent looking for can mask a fault that you are looking for.

7.1.7.1

Redundant Circuitry

Some faults are undetectable. Undetectable stuck-at faults are located in redundant parts of a circuit.

580

CHAPTER 7. FAULT TESTING AND TESTABILITY

Timing Hazards
Static hazard Dynamic hazard Timing hazards are often removed by adding redundant circuitry.

Redundant Circuitry
a b
1,1 1,0

1,0 1,0,1

b c

d e

d c
1,1

0,1

0,1

f g

Irredundant circuit

Illustration of timing hazard

Glitch on g is caused because the AND gate for e turns off before f turns on.

7.1.7 Undetectable Faults

581

Redundant Circuitry
Question: Add one or more gates to the circuit so that the static hazard is guaranteed to be prevented, independent of the delay values through the gates
1,1 1,0

a c

a b

1,0 1,0,1

d c
1,1

0,1

0,1

Redundant Circuitry
Question: Has the redundant circuitry introduced any undetectable faults? If so, identify an undetectable fault.

582

CHAPTER 7. FAULT TESTING AND TESTABILITY

7.1.7.2 Curious Circuitry and Fault Detection


Curiously, the stuck-at fault at L1 is undetectable, but faults at either L2 or L3 are detectable.
a
L2

a z z c
c

b c

L1 L3

fault eqn L2@0 a (b c)

K-map
a c b

diff w/ ckt
a c b

a c

b c

L2@1 a (b c)

7.2. TEST GENERATION

583

7.2 7.2.1
a b c

Test Generation A Small Example


L4

ab + bc
a

L2 L5

fault 1) L2@1

eqn K-map
a c b

diff w/ ckt test vectors


a c b

a c

b c

2) L4@1
a c b c a b

3) L5@1

584

CHAPTER 7. FAULT TESTING AND TESTABILITY

7.2.2

Choosing Test Vectors

The goal of test vector generation is to nd the smallest set of test vectors that will detect the faults of interest. Test vector generation requires analyzing the faults. We can simplify the task of fault analysis by reducing the number of faults that we have to analyze. Smith has examples of this in Figures 14.13 and 14.14.

7.2.2 Choosing Test Vectors

585

7.2.2.1
fault eqn

Fault Domination
K-map
a c b c

Diff w/ ckt
a b

test vectors

1) L5@1 ab+c
a c b c a b

101, 001

2) L6@1 1

101, 001, 100, 010, 000

Denition dominates: f1 dominates f2: any test vector that detects f1 will also detect f2. When choosing test vectors, we can ignore the dominated fault, but must keep the dominant fault.

Question:

To detect both L5@1 and L6@1, can we ignore one of the faults?

Question:

What would happen if we ignored the wrong fault?

586

CHAPTER 7. FAULT TESTING AND TESTABILITY

7.2.2.2
fault

Fault Equivalence
Diff w/ ckt
b c a b a c

eqn K-map

1) L1@1 b
a c b c a b

2) L3@1 b

Denition fault equivalence: f1 is equivalent to f2: f1 and f2 are detected by exactly the same set of test vectors. That is, all of the test vectors that detect f1 will also detect f2, and vice versa. When choosing test vectors we can ignore one of the faults and just include the other.

7.2.2 Choosing Test Vectors

587

7.2.2.3

Gate Collapsing

A stuck-at-1 fault on the input to an OR gate is equivalent to a stuck-at-1 fault on the output of the OR gate.

Denition Gate collapsing: : The technique of looking at the functionality of a gate and nding equivalent faults between inputs and outputs. Sets of collapsable faults for common gates
@0

AND
@1

@0

@0

OR

@1

@1

QuestionWhat is the set of collapsible faults for a NAND gate? NAND

588

CHAPTER 7. FAULT TESTING AND TESTABILITY

7.2.2.4
Note:

Node Collapsing
Node collapsing is relevant only for the pin-fault model

7.2.2.5

Fault Collapsing Summary

When calculating the test-vectors to detect a set of faults, apply the fault collapsing techniques of: gate collapsing

node collapsing (if using pin-fault model) general fault equivalence (intelligent collapsing) fault domination
to reduce the number of faults that you must examine.

7.2.3 Fault Coverage

589

7.2.3

Fault Coverage

Denition Fault coverage: percentage of detectable faults that are detected by a set of test vectors. DetectedFaults DetectableFaults

FaultCoverage =

Some peoples denition of fault coverage has a denominator of AllPossibleFaults, not just those that are detectable.

590

CHAPTER 7. FAULT TESTING AND TESTABILITY

7.2.4 Test Vector Generation and Fault Detection


There are two ways to generate vectors and check results: built-in tests and scan testing. Both require: generate test vectors

overide normal datapath to send test-vectors, rather than normal inputs, as inputs to ops compare outputs of ops to expected result

7.2.5 Generate Test Vectors for 100% Coverage

591

7.2.5 Generate Test Vectors for 100% Coverage


In this section we will nd the test vectors to achieve 100% coverage of single stuck at faults for the circuit of the day. We will use a simple algorithm, there are much more sophisticated algorithms that are more efcient. The problem of test vector generation is often called Automatic Test Pattern Generation (ATPG) and continues to be an active area of research.
a b c
L1 L4 L2 L5 L3 L7

ab + bc
L6 L8
a b

Example Circuit with Fault Locations and Karnaugh Map

592

CHAPTER 7. FAULT TESTING AND TESTABILITY

7.2.5.1

Collapse the Faults

Initial circuit with potential faults:


a b
L2@0,1 L5@0,1 L1@0,1 L4@0,1

L6@0,1 L8@0,1 L7@0,1

L3@0,1

7.2.5 Generate Test Vectors for 100% Coverage

593

Gate Collapsing
gate faults kept fault

For each set of equivalent faults, we will keep the fault shown in bold and eliminate the other faults. A good heuristic for choosing which fault to keep: keep the fault closes to the output. The closer a fault is to the output, the easier it is to analyze its behaviour, because the equation for the output will be simpler.

594

CHAPTER 7. FAULT TESTING AND TESTABILITY

Intelligent Collapsing
1. delete faults that previously decided could be ignored 2. by intelligent analysis of circuit, nd equivalent faults
a b
L2@0,1 L5@0,1

L1@0,1 L4@0,1

L6@0,1 L8@0,1 L7@0,1

L3@0,1

7.2.5 Generate Test Vectors for 100% Coverage

595

7.2.5.2
fault eqn 1) L2@1 a+c

Check for Fault Domination


K-map Diff w/ ckt
a c b

a c

a c

a c

2) L3@1 b
a c b

a c

3) L4@1 a+bc
a c b

a c

4) L5@1 ab+c
a c b

a c

5) L6@0 bc
a c b

a c

6) L7@0 ab
a c b

a c

7) L8@0 0
a c b

a c

8) L8@1 1

596

CHAPTER 7. FAULT TESTING AND TESTABILITY

Remove dominated faults


Current faults:
a b
L2@0,1 L5@0,1 L1@0,1 L4@0,1

L6@0,1 L8@0,1 L7@0,1

L3@0,1

Dominated faults:

7.2.5 Generate Test Vectors for 100% Coverage

597

7.2.5.3

Required Test Vectors

Denition required test vector: A test vector tv is required if there is a fault for which tv is the only test vector that will detect the fault. fault eqn K-map
a c b c

Diff w/ ckt
a b

1) L3@1 b
a c b c a b

2) L4@1 a+bc
a c b c a b

3) L5@1 ab+c
a c b c a b

4) L6@0 bc
a c b c a b

5) L7@0 ab

598

CHAPTER 7. FAULT TESTING AND TESTABILITY

7.2.5.4 Faults Not Covered by Required Test Vectors


fault eqn K-map
a c b c

Diff w/ ckt
a b

1) L4@1 a+bc
a c b c a b

2) L5@1 ab+c Test vector(s) required to catch these faults:

7.2.5 Generate Test Vectors for 100% Coverage

599

7.2.5.5

Order to Run Test Vectors

The order in which the test vectors are run is important because it can affect how long a faulty chip stays in the tester before the chips fault is detected. The rst vector to run should be the one that detects the most faults. Build a table for which faults each test vector will detect.

600
a c b c a b c

CHAPTER 7. FAULT TESTING AND TESTABILITY


Test Vector
a b c a b

fault 110
a c b

010

011

101

1) 2) 3) 4) 5) 6) 7) 8) 9)

L1@0
a c b

1 1
a c b

L1@1 L2@0
a c b

1
a c b

1 1 1

L2@1 L3@0
a c b

L3@1
a c b

1 1
a c b

L4@0 L4@1
a c b

1 1
a c b

L5@0
a c b

10) L5@1 11) L6@0


a c b

1 1 1
a c b

12) L6@1 13) L7@0


a c b

1 1

14) L7@1
a c b

1 1
a c b

1 1

15) L8@0 16) L8@1 Faults detected

1 5

1 6

7.2.5 Generate Test Vectors for 100% Coverage

601

7.2.5.6 Summary of Technique to Find and Order Test Vectors


1. identify all possible faults 2. gate collapsing 3. node collapsing 4. intelligent collapsing 5. fault domination 6. determine required test vectors 7. choose minimal set of test vectors to detect remaining faults 8. order test vectors based on number of faults detected (NOTE: when iterating through this step, need to take into account faults detected by earlier test vectors)

602

CHAPTER 7. FAULT TESTING AND TESTABILITY

7.2.6
a b c
L1 L4 L2 L5 L3

One Fault Hiding Another


L6 L8 L7

Assume that we are not trying to detect all faults L1 is viewed as not being at risk for faults, but L3 is at risk for faults.
a b z c
L3 L1

a b

L1

z c
L3

7.2.6 One Fault Hiding Another

603

Fault Hiding
a b z c
L3 L1

a b

L1

z c
L3

Problem: If L1 is stuck-at 1, the test vectors that normally detect L3@0 will not detect L3@0. In the presence of other faults, the set of test vectors to detect a fault will change. fault(s) L3@0 eqn K-map Diff w/ ckt
a c b c a b

ab
a c b c a b

L1@1,L3@0 b

604

CHAPTER 7. FAULT TESTING AND TESTABILITY

7.3

Scan Testing in General

7.3.1 Structure and Behaviour of Scan Testing


data_in(3) another circuit #0 zeta_in(3) another circuit #1

data_in(2) circuit under test

zeta_in(2)

data_in(1)

zeta_in(1)

data_in(0)

zeta_in(0)

Normal Circuit

7.3.1 Structure and Behaviour of Scan Testing


mode0 scan_in0 mode1 scan_in1

605

circuit under test

scan_out0

scan_out1

Circuit with Scan Chains Added

yet another circuit

another circuit

scan chain 0

scan chain 1

606

CHAPTER 7. FAULT TESTING AND TESTABILITY

7.3.2

Scan Chains
mode0 scan_in0 mode1 scan_in1 zeta_in(3) another circuit #1 data_in(3) zeta_in(3)

data_in(3) another circuit #0

data_in(2) circuit under test

zeta_in(2)

data_in(2)

data_in(1)

zeta_in(1)

circuit under test

zeta_in(2)

data_in(1) data_in(0) zeta_in(0) data_in(0)

zeta_in(1)

zeta_in(0) scan_out0 scan_out1

Normal Circuit

Circuit with Scan Chains Added

7.3.2 Scan Chains

607

7.3.2.1
mode0 scan_in0

Circuitry in Normal and Scan Mode


mode1 scan_in1 mode0 scan_in0 mode1 scan_in1

circuit under test

circuit under test

scan_out0

scan_out1

scan_out0

scan_out1

Normal Mode

Scan Mode

608

CHAPTER 7. FAULT TESTING AND TESTABILITY

7.3.2.2
mode0 scan chain 0

Scan in Operation
scan_in0 mode1 scan chain 0 scan_in1 clk mode0 yet another circuit scan_out0 scan_in0 scan_out1 scan_in1 scan_out0 scan_out1 current vector0 current results1

another circuit

circuit under test

Circuit under test with scan chains


current vector0 scan_in0 scan chain 0 mode0 mode1 scan chain 0 scan_in1 scan chain 0 scan_in0 mode1 scan chain 0

Sequence of load; test; unload


mode0 scan_in1 scan chain 0 scan chain 0 scan_in0 mode1 scan_in1

mode0

another circuit

another circuit

another circuit

yet another circuit

circuit under test

yet another circuit

circuit under test

scan_out0 scan_out0 scan_out1 scan_out0 scan_out1

scan_out1 current results1

Load Test Vector (1 cycle per bit)

Run Test Vector Through Circuit

Unload Result (1 cycle per bit)

yet another circuit

circuit under test

7.3.2 Scan Chains

609

Unload and Load and Same Time


mode0 scan chain 0 current vector0 scan_in0 mode1 scan chain 0 current vector1 scan_in1 mode0 scan chain 0 scan_in0 mode1 scan chain 0 scan_in1 mode0 scan chain 0 next test vector0 scan_in0 mode1 scan chain 0 next test vector1 scan_in1

another circuit

another circuit

yet another circuit

another circuit

yet another circuit

circuit under test

scan_out0 previous results0

scan_out1 previous results1

scan_out0

scan_out1

scan_out0 current results0

scan_out1 current results1

Unload Prev Result Load Cur Test Vector (1 cycle per bit)
clk mode0 scan_out0 scan_in0 scan_out1 scan_in1 previous results0 current vector0 previous results1 current vector1 current results0 next test vector0 current results1 next test vector1

Run Cur Test Vector Through Circuit

Unload Cur Result Load New Test Vector (1 cycle per bit)

Sequence of load; run; unload

yet another circuit

circuit under test

circuit under test

610

CHAPTER 7. FAULT TESTING AND TESTABILITY

7.3.2.3 Scan in Operation with Example Circuit


mode0 scan_in0 a a b y z c d c b z y mode1 scan_in1

Circuit under test

scan_out0

scan_out1

Circuit under test with scan test circuitry

7.3.2 Scan Chains


mode0 scan_in0 a y b z c c b z mode1 scan_in1 mode0 scan_in0 a y mode1 scan_in1

611

scan_out0 clk mode0

scan_out1

scan_out0 clk mode0

scan_out1

Start Loading Test Vector (Load )


mode0 scan_in0 a y b z c c b mode1 scan_in1 mode0 scan_in0 a

Load
mode1 scan_in1

scan_out0 clk mode0

scan_out1

scan_out0 clk mode0

scan_out1

Load

Load

612
mode0 scan_in0 mode1 scan_in1

CHAPTER 7. FAULT TESTING AND TESTABILITY


mode0 scan_in0 mode1 scan_in1

__

+
__

__

__

scan_out1

scan_out1

scan_out0 clk mode0

scan_out0 clk mode0

Run Test Vector


mode0 scan_in0 +
__

Test Values Propagate


mode1 scan_in1 mode0 scan_in0 +
__

mode1 scan_in1

__

scan_out0 clk mode0

scan_out1 (+)
__

scan_out0
__

scan_out1 (+, +)
__

clk mode0

Flop-In Result, Start (Un)loading Test Vector

Continue (Un)loading Test Vector

7.3.2 Scan Chains


mode0 scan_in0 mode1 scan_in1 mode0 scan_in0 mode1 scan_in1

613

scan_out0
__

scan_out1 (+, +)
__

scan_out0
__

scan_out1 (+, +)
__

clk mode0

clk mode0

Continue (Un)loading Test Vector


mode0 scan_in0 mode1 scan_in1

Finish (Un)loading Test Vector

scan_out0
__

scan_out1 (+, +)
__

clk mode0

Run Next Test Vector

614

CHAPTER 7. FAULT TESTING AND TESTABILITY

7.3.3

Summary of Scan Testing

Adding scan circuitry


1. Registers around circuit to be tested are grouped into scan chains 2. Replace each op with mux + op 3. Flops and muxes wired together into scan chains 4. Each scan chain is connected to dedicated I/O pins for loading and unloading test vectors

Running test vectors


1. Put scan chain in scan mode 2. Load in test vector (one element of vector per clock cycle) 3. Put scan chain in normal mode 4. Run circuit for one clock cycle load result of test into ops 5. Unload results of current test vector while simultaneously loading in next test vector (one element of vector per clock cycle)

7.3.4 Time to Test a Chip

615

7.3.4

Time to Test a Chip

If the length (number of ops) of a scan chain is n, then it takes 2n + 1 clock cycles to run a single test: n clock cycles to scan in the test vector, 1 clock cycle to execute the test vector, and n cycles to scan out the results. Once the results are scanned out, they can be compared to the expected results for a correctly working circuit. If we run 2 or more tests (and chips generally are subjected to hundreds of thousands of tests), then we speed things up by scanning in the next test vector while we scan out the previous result. ScanLength = number of ip ops in a scan chain NumVectors = number of test vectors in test suite TimeScan = number of clock cycles to run test suite = NumVectors (ScanLength + 1) + ScanLength

616

CHAPTER 7. FAULT TESTING AND TESTABILITY

7.3.4.1

Example: Time to Test a Chip

A 800MHz chip has scan chains of length 20,000 bits, 18,000 bits, 21,000 bits, 22,000 bits, and two of 15,000 bits. 500,000 test vectors are used for each scan chain. The tests are run at 80% of full speed.

Question:

Calculate the total test time.

7.4. BOUNDARY SCAN AND JTAG

617

7.4

Boundary Scan and JTAG

Boundary scan originated as technique to test wires on printed circuit boards (PCBs). Goal was to replace bed-of-nails style testing with technique that would work for high-density PCBs (lots of small wires close together) Now used to test both boards and chip internals. Used both on boundaries (I/O pins) and internal ops.

618

CHAPTER 7. FAULT TESTING AND TESTABILITY

Boundary Scan with JTAG


Standardized by IEEE (1149) and previously by JTAG: 4 required signals (Scan Pins: TDI, TDO, TCK, TMS)

1 optional signal (Scan Pin: TRST) protocol to connect circuit under test to tester and other circuits state machine to drive test circuitry on chip Boundary Scan Description Language (BSDL): structural language used to describe which features of JTAG a circuit supports
JTAG circuitry now commonly built-into FPGAs and ASICS, or part of a cell-library. Rarely is a JTAG circuit custom-built as part of a larger part. So, youll probably be choosing and using JTAG circuits, not constructing new ones. Using JTAG circuitry is usually done by giving a description of your printed circuit board (PCB) and the JTAG components on each chip (in BSDL) to test generation software. The software then generates a sequence of JTAG commands and data that can be used to test the wires on the circuit board for opens and shorts.

7.4.1 Scan Instructions

619

JTAG Structure
chip BSR BSC circuit under test BSC BSC chip scan registers control TDI BR Instruction Decoder IR TCK IDCODE TDI TCK TMS TDO control TMS TAP Controller IRC IRC TDO BSC BSC BSC

normal input pins

circuit under test

normal output pins

High-level view

Detailed view

620

CHAPTER 7. FAULT TESTING AND TESTABILITY

7.4.1

Scan Instructions

This the set of required instructions, other instructions are optional. Test board-level interconnect. Drive output pins of chip with hard-coded test vector. Sample results on inputs. SAMPLE Sample result data PRELOAD Load test vector BYPASS Directly connect TDI to TDO. This is used when several chips are daisy chained together to skip loading data into some chips. IDCODE Output manufacturer and part number EXTEST

7.5. BUILT IN SELF TEST

621

7.5 7.5.1
test generator

Built In Self Test Block Diagram


mode test generator d(0) o_data(0) d(0) i_data(0) o_data(0) mode

i_data(0)

d(1) i_data(1) circuit under test

o_data(1)

d(1) i_data(1) circuit under test

o_data(1)

d(2) i_data(2)

o_data(2)

d(2) i_data(2)

o_data(2)

d(3) i_data(3) result checker all_ok i_data(3)

d(3)

result checker all_ok

Circuit in Normal Mode

Circuit in Test Mode

622

CHAPTER 7. FAULT TESTING AND TESTABILITY

Circuit w/ BIST in Normal Mode


mode test gen LFSR test generator i_data(0) signature ok(0) analyzer0 o_data(0) signature ok(1) analyzer1 circuit under test d(2) i_data(2) o_data(1) signature ok(2) analyzer2 o_data(2)

d(0)

d(1) i_data(1)

d(3) i_data(3)

result checker all_ok

7.5.1 Block Diagram

623

Circuit w/ BIST in Test Mode


mode test gen LFSR test generator i_data(0) signature ok(0) analyzer0 o_data(0) signature ok(1) analyzer1 circuit under test d(2) i_data(2) o_data(1) signature ok(2) analyzer2 o_data(2)

d(0)

d(1) i_data(1)

d(3) i_data(3)

result checker all_ok

624

CHAPTER 7. FAULT TESTING AND TESTABILITY

7.5.1.1

Components Test Generator

mode test gen LFSR test generator i_data(0) signature ok(0) analyzer0 o_data(0) signature ok(1) analyzer1 circuit under test d(2) i_data(2) o_data(1) signature ok(2) analyzer2 o_data(2)

d(0)

d(1) i_data(1)

d(3) i_data(3)

result checker all_ok

generates a psuedo-random set of test vectors for n output bits, generates all vectors from 1 to 2n 1 in a pseudo random order built with a linear-feedback shift register (shift-register portion is the input ops)

7.5.1 Block Diagram

625

Test Generator

q2 q1 q0

Question:

Why not just use a counter to generate 1..2n 1?

626

CHAPTER 7. FAULT TESTING AND TESTABILITY

Signature Analyzer
mode test gen LFSR test generator i_data(0) signature ok(0) analyzer0 o_data(0) signature ok(1) analyzer1 circuit under test d(2) i_data(2) o_data(1) signature ok(2) analyzer2 o_data(2) d(0)

d(1) i_data(1)

d(3) i_data(3)

result checker all_ok

checks that the output it is examining has the correct results for the complete set of tests that are run only has a meaningful result at the end of the entire test sequence. built with a linear-feedback shift register similar to a hash function or a lossy compression function if there are no faults, the signature analyzer will denitely say ok (no false negatives) if there is a fault, the signature analyzer might say ok or might say bad (false positives are possible) design tradeoff: more accurate signature analyzers require more hardware

7.5.1 Block Diagram

627

Result Checker
mode test gen LFSR test generator i_data(0) signature ok(0) analyzer0 o_data(0) signature ok(1) analyzer1 circuit under test d(2) i_data(2) o_data(1) signature ok(2) analyzer2 o_data(2) d(0) d(1) i_data(1)

d(3) i_data(3)

result checker all_ok

signature analyzers output ok/bad on every clock cycle, but the result is only meaningful at the end of running the complete set of test vectors the result checker looks at test vector inputs to detect the end of the test suite and outputs all ok if all signature analyzers report ok at that moment implemented as an AND gate

628

CHAPTER 7. FAULT TESTING AND TESTABILITY

7.5.1.2 Linear Feedback Shift Register (LFSR)


Basically, a shift register (sequence of ip-ops) with the output of the last ip-op fed back into some of the earlier ip-ops with XOR gates. Design parameters:

number of ip-ops external or internal XOR feedback taps (coefcients) external-input or self-contained reset or set

reset

d0 i

q0 d1

q1 d2

q2

LFSR Example

7.5.1 Block Diagram

629

Example LFSRs
reset d0 d0 i
S S S S S S R

q0 d1

q1 d2

q2

q0 d1

q1 d2

q2

set

External-XOR, input, reset

External-XOR, no input, set

reset i d0
R

q0

d1

q1 d2

q2 i d0
R

q0

d1

q1

d2

q2

S S S S

set

Internal-XOR, input, set

Internal-XOR, input, reset

In E&CE 327, we use internal-XOR LFSRs, because the circuitry matches the mathematics of Galois elds. External-XOR LFSRs work just ne, but they are more difcult to analyze, because their behaviour cant be treated as Galois elds.

630

CHAPTER 7. FAULT TESTING AND TESTABILITY

7.5.1.3

Maximal-Length LFSR

Denition maximal-length linear feedback shift register: An LFSR that outputs a pseudo-random sequence of all representable bit-vectors except 0...00.

Denition pseudo random: The same elements in the same order every time, but the relationship between consecutive elements is apparantly random.

Maximal-length linear feedback shift registers are used to generate test vectors for built-in self test.

7.5.1 Block Diagram

631

Maximal-Length LFSR Circuits


The gures below illustrate the two maximal-length internal-XOR linear feedback shift registers that can be constructed with 3 ops.

d0

q0 d1

q1

d2

q2

set

Maximal-length internal-XOR LFSR

d0

q0

d1

q1 d2

q2

set

Maximal-length internal-XOR LFSR

Question: Why do maximal-length LFSRs not generate the test vector 0...00?

632

CHAPTER 7. FAULT TESTING AND TESTABILITY

Maximal Length LFSR Characteristics


Maximal-length LFSRs:

set to all 1s initially self contained (no external i input)

1 reset clk d0 q0 d1 q1 q2 val 7 6

Timing diagram for a 3-op maximal-length LFSR

7.5.2 Test Generator

633

7.5.2
mode test gen LFSR test generator i_data(0) d(0) d(1) i_data(1)

Test Generator
signature ok(0) analyzer0 o_data(0) signature ok(1) analyzer1 circuit under test d(2) o_data(1) signature ok(2) analyzer2 o_data(2)

i_data(2)

d(3) i_data(3)

result checker all_ok

The test generator component is a maximal-length LFSR ...

d0

q0

d1

q1 d2

q2

set

634

CHAPTER 7. FAULT TESTING AND TESTABILITY

Test Generator
The test generator component is a maximal-length LFSR with multiplexors on the inputs to each ip-op. In test mode, the data input on each ip op is connected to the output of the previous ip op. In normal mode, the input of each ip op is connected to the environment.
mode d1 q1 d2 q2

d0

q0

i_d(0) i_d(1) i_d(2) set q0 q1 q2

7.5.2 Test Generator

635

Test Generator
mode d0 i_d(0)

q0

d1 i_d(1) d2 i_d(2)

q1

q2

A test generator, reset not shown

636

CHAPTER 7. FAULT TESTING AND TESTABILITY

7.5.3

Signature Analyzer

There are four things that change between different signature analyzers:

number of ops ( ops = area, accuracy) choice of feedback taps: a good choice can improve accuracy (more isnt necessarily better) bubbles on input to AND gate for ok: determined by expected result from simulating test sequence through circuit under test and LFSR of analyzer.
mode test gen LFSR test generator i_data(0) signature ok(0) analyzer0 o_data(0) signature ok(1) analyzer1 circuit under test d(2) i_data(2) o_data(1) signature ok(2) analyzer2 o_data(2)

d(0)

d(1) i_data(1)

d(3) i_data(3)

result checker all_ok

7.5.3 Signature Analyzer

637

Signature Analyzer
This circuit:

Two ops, most analyzers use more the HP boards in the 1970s used 37 ops! Feedback taps on both ops. Different signature analyzers have different congurations of feedback taps. Also contains ok tester (AND gate). Expected output of LFSR at end of test sequence is: q0=1 and q1=1, or 01. (We know this because of bubble on AND gate. To see why this is the expected output of the signature analyzer, we would need to know the correct sequence of outputs of the circuit under test.)
reset

d0 i

q0

d1

q1

ok

638

CHAPTER 7. FAULT TESTING AND TESTABILITY

Signature Analyzer
reset clk i d0 q0 d1 q1 0 0 i6 i5 i4 i3 i2 i1 i0 -

7.5.3 Signature Analyzer

639

Signature Analyzer Timing


reset clk i d0 q0 d1 q1 i6 i6 0 0 0 i5 i5 i6 i6 0 i4 i3 i2 i1 i0 -

i4i6 356 i5

245 1346 02356

i4i6 356

245 1346 02356 -

i5i6 i4i5 346 2356 1245 i6

i5i6 i4i5 346 2356 1245

356 = i3i5i6 2356 = i2i3i5i6 etc...

640

CHAPTER 7. FAULT TESTING AND TESTABILITY

7.5.4
mode test gen LFSR test generator i_data(0) d(0) d(1) i_data(1)

Result Checker
signature ok(0) analyzer0 o_data(0) signature ok(1) analyzer1 circuit under test d(2) o_data(1) signature ok(2) analyzer2 o_data(2)

i_data(2)

d(3) i_data(3)

result checker all_ok

The purpose of the result checker is to check the ok circuit at the end of the test sequence.
reset q0 q1 q2 ok

all_ok

7.5.5 Arithmetic over Binary Fields

641

7.5.5

Arithmetic over Binary Fields

Galois Fields! Two operations: + and Two values: 0 and 1 Bit vectors and shift-registers are written as polynomials in terms of x.

Addition
+ represents XOR expression result 0+0 0 0+1 1 1+0 1 1+1 0 x+x 0

Multiplication
represents concatenating shift registers expression result x4 1 x4 x2 x3 x5

642

CHAPTER 7. FAULT TESTING AND TESTABILITY

Example
Calculate (x3 + x2 + 1) (x2 + x) x2 (x3 + x2 + 1) = x5 + x4 x (x3 + x2 + 1) = x4 + x5 + + x2 x3 + x x3 + x2 + x

7.5.6 Shift Registers and Characteristic Polynomials

643

7.5.6 Shift Registers and Characteristic Polynomials


Each linear feedback shift register has a corresponding characteristic polynomial. From polynomials to hardware:

The maximum exponent denotes the number of ops The other exponents denote the ops that tap off of feedback line from last op From the characteristic polynomial, we cannot determine whether the shift register has an external input. Stated another way, two shift registers that are identical except that one has an external input and the other does not will have the same characteristic polynomial.

644

CHAPTER 7. FAULT TESTING AND TESTABILITY

Shift Regs and Polynomials


reset i d0
R

q0

q1

q2

p(x) = x3

reset

d0 x0

q0 x1

d1

q1 x2

q2 x3

p(x) = x3 + x

reset

d0 i x0

q0 x1

q1 x2

q2 x3

p(x) = x3 + 1

7.5.6 Shift Registers and Characteristic Polynomials

645

Shift Regs and Polynomials

reset

d0 i x0

q0 x1

d1

q1 x2

q2 x3

p(x) = x3 + x + 1

reset

d0 i x0

q0 x1

d1

q1 x2

d2

q2 x3

p(x) = x3 + x2 + x + 1

reset

d0 i x0

q0 x1

d1

q1 x2

q2 x3

d3

q3 x4

p(x) = x4 + x3 + x + 1

646

CHAPTER 7. FAULT TESTING AND TESTABILITY

7.5.6.1

Circuit Multiplication

Redoing the multiplication example (x2 + x) (x3 + x2 + 1) as circuits:

x2 + x x3 + x2 + 1 (x2 + x) (x3 + x2 + 1)

x (x3 + x2 + 1) + x2 (x3 + x2 + 1)

x5 + x3 + x2 + x

7.5.7 Bit Streams and Characteristic Polynomials

647

7.5.7 Bit Streams and Characteristic Polynomials


A bit stream, or bit sequence, can be represented as a polynomial. The oldest (rst) bit in a sequence of n bits is represented by xn1 and the youngest (last) bit is x0 . The bit sequence 1010011 can be represented as x6 + x4 + x + 1: 1 0 1 0 0 1 1 = 1x6 + 0x5 + 1x4 + 0x3 + 0x2 + 1x1 + 1x0 = x6 + x4 + x + 1

648

CHAPTER 7. FAULT TESTING AND TESTABILITY

7.5.8

Division

With rules for multiplication and addition, we can dene division. A fundamental theorem of division denes q and r to be the quotient and remainder, respectively, of m p iff:

m(x) = q(x) p(x) + r(x)

7.5.8 Division

649

Long Division
In Galois elds, we do division just as with long division in elementary school. Given: m(x) = x6 + x4 + x3 p(x) = x4 + x Calculate the quotient, q(x) and remainder r(x) for m(x) p(x): x2 + 1 x4 + x x6 + 0x5 + 1x4 + 1x3 + 0x2 + 0x1 + 0x0 x6 + 1x3 1x4 1x4 + x x Quotient q(x) = x2 + 1 Remainder r(x) = x

650

CHAPTER 7. FAULT TESTING AND TESTABILITY

Long Division (Check)


Check result: m(x) = = = = q(x) p(x) + r(x) (x2 + 1) (x4 + x) + x x6 + x3 + x4 + x + x x6 + x4 + x3

7.5.9 Signature Analysis: Math and Circuits

651

7.5.9 Signature Analysis: Math and Circuits


The input to the signature analyzer is a message, m(x), which is a sequence of n bits represented as a polynomial. After n shifts through an LFSR with l ops:

The sequence of output bits forms a quotient, q(x), of length n l The ops in the analyzer form a remainder, r(x), of length l

m(x) = q(x) p(x) + r(x) The remainder is the signature.

652

CHAPTER 7. FAULT TESTING AND TESTABILITY

Test Generation: Math and Circuits


The mathematics for an LFSR without an input i:

same polynomial as if the circuit had an input input sequence is all 0s

7.5.9 Signature Analysis: Math and Circuits

653

Input Streams and Error Polynomials


An input stream with an error can be represented as m(x) + e(x)

e(x) is the error polynomial bits in the message that are ipped have a coefcient of 1 in e(x)

m(x) + e(x) = q(x) p(x) + r (x)

654

CHAPTER 7. FAULT TESTING AND TESTABILITY

Input Streams and Error Polynomials


The error e(x) will be detected if it results in a different signature (remainder). m(x) and m(x) + e(x) will have the same remainder iff

e(x) mod p(x) = 0 That is e(x) must be a multiple of p(x). The larger p(x) is, the smaller the chances that e(x) will be a multiple of p(x).

7.5.9 Signature Analysis: Math and Circuits

655

BIST for a Simple Circuit


Outline of steps to see if a fault will be detected by BIST: 1. Output sequence from test generator 2. Output sequence from correct circuit 3. Remainder for signature analyzer with correct output sequence 4. Output sequence from faulty circuit 5. Remainder for signature analyzer with faulty output sequence 6. Compare correct and faulty remainder, if different then fault detected

656

CHAPTER 7. FAULT TESTING AND TESTABILITY

Components
a b a
L1 L4 L2 L5 L3 L6 L7 L8

t0

t1

t2

r0

r1

r2

7.5.9 Signature Analysis: Math and Circuits


t0 t1 t2 a b c z z correct faulty

657

t0 t1

t2

r0

r1

r2

r0

r1

r2

658 Question:

CHAPTER 7. FAULT TESTING AND TESTABILITY Determine if L2@1 will be detected Equation for correct circuit: ab + bc Equation for faulty circuit: a + c Output sequences for correct and faulty circuits
t0 a 1 1 0 1 0 0 1 t1 b 1 1 1 0 1 0 0 t2 c 1 0 1 0 0 1 1 correct faulty z 1 1 1 0 0 0 0 z 1 1 1 1 0 1 1 output sequences from circuits

Test Generation Sequence


t0 t1 1 1 0 1 0 0 1 1 1 1 1 0 1 0 0 1 0 1 0 0 1 1 1 t2 1 initial values = 1 0 1 0 0 1 1 1 final values are repeat of initial values

Technique is to shift; then compute result of XORs

vectors from test generation sequence

7.5.9 Signature Analysis: Math and Circuits

659

Signature analyzer sequence for correct Signature analyzer sequence for faulty circuit Circuit
z 1 1 1 0 0 0 0 1 1 1 1 0 0 1 r0 0 1 1 1 1 0 0 1 0 1 1 0 1 0 1 r1 0 0 1 1 0 1 0 1 0 0 1 0 0 1 1 r2 0 0 0 initial values = 0 1 0 0 remainder 1 1 z 1 1 1 1 0 1 1 1 1 1 0 0 1 1 r0 0 1 1 1 0 0 1 1 0 1 1 0 0 0 1 r1 0 0 1 1 0 0 0 1 0 0 1 0 0 0 0 r2 0 0 0 initial values = 0 1 0 0 remainder 0 0

output sequence from correct circuit

output sequence from correct circuit

660

CHAPTER 7. FAULT TESTING AND TESTABILITY

7.6
Scan

Scan vs Self Test

less hardware slower well dened coverage test vectors are easy to modify Self Test more hardware faster ill dened coverage test vectors are hard to modify

Chapter 8 Review
This chapter lists the major topics of the term. The Topics List section for each major area is meant to be relatively complete.

662

CHAPTER 8. REVIEW

8.1

Overview of the Term


Analog world
power faults and testing effects in the digital

The purely digital world


VHDL design and optimization methods functional verication performance analysis

timing analysis

8.2. VHDL

663

8.2 8.2.1

VHDL VHDL Topics

simple syntax and semantics things that you should know simply by having done the labs and project behavioural semantics of VHDL synthesis semantics of VHDL synthesizable and unsynthesizable code

664

CHAPTER 8. REVIEW

8.2.2

VHDL Example Problems

identify whether a particular signal will be the output of combinational circuitry or a op identify whether a particular process is combinational or clocked legal, synthesizable, and good code perform delta-cycle simulation of VHDL perform RTL simulation of VHDL identify whether two VHDL fragments have same behaviour match VHDL code with waveforms match VHDL code with hardware choose the VHDL fragment that generates smaller or faster hardware

8.3. RTL DESIGN TECHNIQUES

665

8.3 8.3.1

RTL Design Techniques Design Topics


dataow diagram scheduling input/output allocation register allocation datapath allocation hardware block diagram state machine

coding guidelines generic FPGA hardware area estimation nite state machines
implicit explicit-current explicit-current+next

from algorithm to hardware


dependency graph

memory dependencies memory arrays and dataow diagrams

666

CHAPTER 8. REVIEW

8.3.2

Design Example Problems

choose design guidelines to follow in different situations estimate area to implement a circuit in an FPGA calculate resource usage for a dataow diagram calculate performance data for a dataow diagram given an algorithm, design a dataow diagram given a dataow diagram, design the datapath and nite state machine optimize a dataow diagram to improve performance or reduce resource usage given a dataow diagram, calculate the clock period that will result in the maximum performance

8.4. FUNCTIONAL VERIFICATION

667

8.4 8.4.1

Functional Verication Verication Topics

test cases measuring coverage time for verication test benches assertions coverage monitors relational specication functional specication boundary conditions / corner cases

668

CHAPTER 8. REVIEW

8.4.2

Verication Example Problems

choose rst cases to test identify corner cases choose technique to detect bug (test case, assertion/test bench) determine whether a code change will cause a bug identify a test case and either assertion or test bench to catch a bug

8.5. PERFORMANCE ANALYSIS AND OPTIMIZATION

669

8.5 Performance Analysis and Optimization 8.5.1


speedup n% faster calculating performance of different different tasks and of average task choosing which task to optimize to best improve overall performance cpi calculations performance increase over time design tradeoffs (CPI vs NumInsts vs ClockSpeed vs time-to-market) CPI calculations MIPs calculations Clock speed vs. performance Optimality performance / area tradeoffs

Performance Topics

time to execute a program denition of performance

670

CHAPTER 8. REVIEW

8.5.2

Performance Example Problems

calculate performance / area tradeoffs calculate performance / time tradeoffs compare performance data between products evaluate performance criteria

8.6. TIMING ANALYSIS

671

8.6 8.6.1

Timing Analysis Timing Topics


timing analysis of master-slave ip-op timing analysis of hierachical storage device critical path and false path
algorithm to nd critical path algorithm to determine if path is false or critical signal assignment to exercise critical path

circuit parameters that affect delay


clock period clock skew clock jitter propagation delay load delay setup time hold time clock-to-Q time

elmore timing model derating factors

timing analysis of latch

672

CHAPTER 8. REVIEW

8.6.2

Timing Example Problems

timing parameters for minimum clock period timing parameters for hold constraint nd the critical path and assignment to exercise it compute elmore delay constant compare accuracy of different timing models determine if a storage device will work correctly compute timing parameters of storage device identify timing violation, suggest remedy suggest design change to increase clock speed

8.7. POWER

673

8.7 8.7.1

Power Power Topics


leakage current threshold voltage supply voltage

power vs energy equations for power dynamic power


static power switching power short circuit power leakage power activity factor

analog power reduction techniques rtl power reduction techniques


data encoding clock gating

674

CHAPTER 8. REVIEW

8.7.2

Power Example Problems

predict effect of new fabrication process on power predict effect of environment change (temp, supply voltage, etc) on power consumption predict effect of design change on power consumption (capacitance, activity factor) design data-encoding scheme for a circuit, predict effect on power consumption design clock gating scheme for a circuit, predict effect on power consumption asses validity of various power- or energy-consumption metrics

8.8. TESTING

675

8.8 8.8.1

Testing Testing Topics


behaviour of a scan chain time to run a scan test JTAG built-in self-test linear feedback shift register signature analyzer Galois elds process and time to run a BIST test

causes of faults locations of faults physical faults single stuck-at fault model testable / untestable fault economics of testing fault coverage test vector generation order test vectors to reduce test time

676

CHAPTER 8. REVIEW

8.8.2

Testing Example Problems

compute optimal amount of testing to maximize prots compute coverage for a given set of test vectors nd test vectors to catch a set of faults, choose order to run test vectors determine if a fault is detectable choose an LFSR to use for BIST test generation choose an LFSR to use for BIST signature analysis determine if a given BIST will catch a given fault determine probability that a given BIST technique will report that a faulty circuit is correct determine if a given fault-testing scheme will detect a physical fault match LFSR to characteristic polynomial match BIST hardware to Galois mathematics perform Galois eld mathematics, compare to waveforms

8.9. FORMULAS TO BE GIVEN ON FINAL EXAM

677

8.9

Formulas to be Given on Final Exam


Ins C T = F Pf = W T T1 T2 F/106
( PIi Ci)
i=0 n

S =

M =

678

CHAPTER 8. REVIEW

Formulas II
1 P = (A CL V2 F) + ( A V ISh F) + (V IL) 2 q = 1.60218 1019C k = 1.38066 1023J/K (V VTh)2 F V q VTh IL e k T