You are on page 1of 84

Distribution Statement Pending

SERGE
LEEF
PROGRAM MANAGER
DARPA/MTO

Distribution Statement Pending 2


WORKSHOP INTRO:
ASIC FUNCTIONAL
VERIFICATION

Distribution Statement Pending


Growing Gap Between Available Gates and Design Capability

The engineers are not becoming smarter…


4
DISTRIBUTION A. Approved for public release: distribution unlimited
Verification Gap Has Been Growing for Decades

30+

Source: Mentor Graphics, 2009 5


DISTRIBUTION A. Approved for public release: distribution unlimited
Verification Methodologies Are Severely Strained

• Unpredictable, iterative loop during timing closure and


system integration/test phase

• Poor partitioning decisions at the front end of the


process are impossible to overcome during the design

• Functional verification is strained by even today’s


designs; how to verify multi-discipline systems with
billions of gates?

• Even though software is a key, growing system


component it is only partially included in hardware
verification phases due to insufficient simulation speed

• Interaction with outside world through sensors and


actuators is rarely an integral part of the flow

Source: Mentor Graphics, 2009 6


DISTRIBUTION A. Approved for public release: distribution unlimited
% of ASIC/IC Project Time Spent in Verification

30%
2014: Average 57% 2014
2016: Average 54% 2016
25%
2018: Average 53% 2018
2020: Average 56%
2020
20%
Design Projects

15%

10%

5%

0%
> 0%-20% > 20%-30% > 30%-40% > 40%-50% > 50%-60% > 60%-70% > 70%-80% > 80%

Percentage of ASIC/IC Project Time Spent in Verification

Source: Wilson Research Group and Mentor, A Siemens Business, 2020 Functional Verification Study
7
DISTRIBUTION A. Approved for public release: distribution unlimited
Nex t-Gen: Abstraction + Reduced Order Models + HPC + Cloud Scaling
• Simulation physics limits reached decades ago and rely on Moore’s law for speedups
• Parallel simulation has not been re-visited since emergence of the cloud computing
• Emergence of Machine Learning has not been employed to drive creation of faster models
• Need to re-think simulation in light of advances in ILA, ML, HPCs, and Cloud

Instruction Surrogate
S
UVM Verification O Level Models Cloud
Support IP T Abstraction Scaling
A
(digital) (digital)

Verilator Next-Gen
Digital Sim
Functional Logic Layout Scalable Functional Logic Layout
Verification Synthesis Generation HPC Verification Synthesis Generation
Next-Gen
XYCE
Analog Sim

Analog Next-Gen Simulation


Is not a Part of this Presentation
8
DISTRIBUTION A. Approved for public release: distribution unlimited
Convergence of Technologies to Drive Next-Gen Simulation Advances

CLOUD
SCALING

REDUCED ORDER MODELS

DISTRIBUTION A. Approved for public release: distribution unlimited


Enabling Linear Scaling / Hyperscaling of Digital Simulation

42 Years of Microprocessor Trend Data 106 Simulation • Combine limitless and elastic compute and storage available
1010
Performance on the cloud with one or more of the following innovations:
109 Gap
• Simulation engine novel algorithms
Transistor 10
8
103
Count 107 • Parallel partitioning schemes
(per chip) • High-performance-compute (HPC) architectures and
106 Digital Logic
programmable cloud based FPGA fabric
Simulation Speed
105
(cps) • ML-driven simulation partitioning
104

103

1970 1980 1990 2000 2010 2020

Source: https://github.com/karlrupp/microprocessor-trend-data
Novel Algorithms

• Chip development timelines are bottlenecked by functional


verification, where simulation speed is center-stage Parallelization

1000X
• Simulation speed grew with Moore driven platform Cloud FPGA

performance to ~ 1K cps
ML Partitioning
• Post-Moore, the architectural direction is multicore/cloud, but
modern simulation algorithms are not designed take
advantage of distributed computational fabric Source: http://users.ece.utexas.edu/~valvano/Volume1/

10
DISTRIBUTION A. Approved for public release: distribution unlimited
VERIFYING ACCELERATED
COMPUTING PLATFORMS:
CHALLENGES AND
The views, opinions and/or findings
expressed are those of the author
OPPORTUNITIES
and should not be interpreted as
representing the official views or
policies of the Department of
Defense or the U.S. Government.
Distribution Statement Pending
BRUCEK
KHAILANY
SENIOR DIRECTOR OF VLSI RESEARCH
NVIDIA

Distribution StatementDistribution
A - ApprovedStatement
for public release.
Pending Distribution is unlimited.
DESIGN COMPLEXITY
• Complex SoCs that include feature-rich
• More transistors:
CPUs, GPUs, I/Os, and many accelerators
Improved capability
AND
Implementation effort

[Electronics 1965]

NVIDIA Xavier SoC [CES 2018]


4
Distribution Statement A - Approved for public release. Distribution is unlimited. Distribution Statement© NVIDIA 2021
Pending
DESIGN COMPLEXITY
• Typical development timeline: 3-5 years from R&D to product

• Design and especially verification dominates implementation effort


• Majority of all digital IC design effort
• Prohibits which features make it into each SoC

• Benefits of dramatically lower design and verification effort


• Overlap architect & implement phases for faster time-to-market
• Get more features into each SoC
5
Distribution Statement A - Approved for public release. Distribution is unlimited. Distribution Statement © NVIDIA
Pending 2021
VERIFICATION APPROACHES (1)

• Constrained random verification is the main workhorse of ASIC verification

Golden Model
Stimulus
Generator Testbench Checker

DUT
• Challenges
• Schedule and productivity
• Complexity of building and debugging testbenches
• Getting to signoff quality by iteratively closing coverage
6
Distribution Statement A - Approved for public release. Distribution is unlimited. Distribution Statement© NVIDIA 2021
Pending
VERIFICATION APPROACHES (2)

EMULATION
FORMAL METHODS

• Key use cases


• Equivalence checking
• Property checking • Key use cases
• Advantages:
• Verifying HW with real SW
• Complete state space exploration
• Efficient debug, quick testing • Critical for developing SW
• Challenges drivers, integrating SW & HW
• Capacity limits
• Need experts to drive tools to
closure
7
Distribution Statement A - Approved for public release. Distribution is unlimited. Distribution Statement© NVIDIA 2021
Pending
RESEARCH OPPORTUNITIES

• Raising level of abstractions for


design and verification

• Accelerating RTL simulation

DATA DEEP NEURAL NETWORK PROGRAM

• Machine learning

8
Distribution Statement A - Approved for public release. Distribution is unlimited. Distribution Statement© NVIDIA 2021
Pending
RAISING DESIGN LEVEL OF ABSTRACTION
• Methodologies explored during DARPA CRAFT
(2016-2019)

APPROACH ADVANTAGES
• Works great for HW units needing
• Use higher-level languages for agile design, changing features
hardware design (and verification) • …especially if most verification can
• C++ leverage native simulation in
high-level languages
• Other groups exploring
Chisel/Python/other CHALLENGES
• Use tools/IRs to lower to Verilog • Need production-ready formal
• e.g. HLS equivalence tools for verifying generated
• Use Libraries/Generators RTL vs. higher-level models
• High bar for productivity improvements
• e.g. MatchLib, hlslibs.org
• Significant effort to replace workflows
on existing products 9
Distribution Statement A - Approved for public release. Distribution is unlimited. Distribution Statement© NVIDIA
Pending 2021
ACCELERATING RTL SIMULATION

• Time to revisit GPU-accelerated computing for faster RTL simulations?

• Previous research: Strong scaling single RTL sims for lower latency
(i.e. reduced time-to-solution for a single simulation)
• Gate-level simulation
• 4-60x [Chatterjee et al., DATE 2009], 5-270x [Zhu et al., ToDAES 2011]
• RTL simulation
• 20-50x [Qian et al., ICCAD 2011], 2-15x [Vinco et al., DAC 2012], + commercial efforts

• Recent research: Stimulus-parallel (cycle-parallel) gate-level sims for higher throughput


• 100-1000x [Holst et al., ToDAES 2015], 26-360x [Zhang et al., ICCAD 2020]
• Useful for power, timing, testability, and reliability analysis

• For RTL bug hunting and coverage closure, we need higher throughput RTL-level
simulation at a reasonable latency
10
Distribution Statement A - Approved for public release. Distribution is unlimited. Distribution Statement© NVIDIA
Pending 2021
ACCELERATING RTL SIMULATION
• What has changed in 10 years?
Verilator,
Yosys,
FIRRTL,
….

Performant
Higher performing GPUs Ubiquitous HPC Mature
open-source
More features systems forAI SW stack EDA tools
Parallel
Testbench
• Can we leverage these developments Stimulus 0
DUT 0
for higher-throughput Stimulus 1
DUT 1
stimulus-parallel
RTL simulation? Stimulus N-1
DUT N-1
11
Distribution Statement A - Approved for public release. Distribution is unlimited. Distribution Statement© NVIDIA 2021
Pending
MACHINE LEARNING (ML)

• ML has transformed many domains, can it transform design verification?


• Previous and ongoing work in the community
• ML-based predictors for guiding testbench constraints
• ML-assisted triage of failing tests
• Potential opportunity: can we use AI to solve primary DV challenges?
• We generate tons of data during simulations, but most of it is thrown
away, can we use it to train models?
• Constrained random verification
• Using AI more directly for stimulus generation to iterate to coverage closure?
• Using AI to assist/augment designers in bug hunting and debugging?
• Formal verification
• Improving the effectiveness of solvers?
• Other ideas
12
Distribution Statement A - Approved for public release. Distribution is unlimited. Distribution Statement© NVIDIA
Pending 2021
ASIC
VERIFICATION
CRISIS

The views, opinions and/or findings expressed are those of the author and should
not be interpreted as representing the official views or policies of the
Department of Defense or the U.S. Government.
DENNIS
BROPHY
DIRECTOR OF STRATEGIC BUSINESS
DEVELOPMENT
SIEMENS EDA

23
Distribution Statement A - Approved for public release. Distribution is unlimited.
PREDICTIONS & FORECASTS

No accounting for an inventive or changing world


• Inventions have long since reached their limit, and I see no hope for further
developments.
Julius Sextus Frontinus - Roman engineer, 10 A.D.

• Everything that could be invented had already been invented.


Charles Duell – Commissioner, U.S. Patent Office, 1899

• I think there is a world market for maybe five computers.


Thomas Watson – Chairman of IBM, 1943

• There is no reason anyone would want a computer in their home.


Ken Olson – Chairman of Digital Equipment Corp., 1977

24
Distribution Statement A - Approved for public release. Distribution is unlimited.
PREDICTIONS & FORECASTS
No accounting for expanding demand
• No one would need more than 637kb of memory for a personal computer and 640
ought to be enough for anybody.
Bill Gates – Microsoft founder, 1981

• Next Christmas the iPod will be dead, finished, gone, kaput.


Sir Alan Sugar – British entrepreneur, 2005

No accounting for innovation and automation


• This new telephonic apparatus may be all well and good for our colonial cousins
but it will never catch on in Great Britain because we have an adequate supply of
messenger boys.
Sir William Preece – Chief engineer, British Post Office, 1876

• If growth in telephonic communication continues at the current foreseen rate by the


year 2000, every woman of working age in the United Kingdom will have to be a
telephone operator.
Sir William Preece – Chief engineer, British Post Office, 1886
25
Distribution Statement A - Approved for public release. Distribution is unlimited.
ENGINEERS ON AN ASIC/IC PROJECT
12.0 11.6 11.4 Design Engineers
10.5 10.3
Verification
10.0
Engineers
8.5 8.4 Log. (Design
8.0 7.8 Engineers)
Design Projects

Log. (Verification
Engineers)
6.0
4.8

4.0

2.0

0.0
2007 2012 2016 2020
Mean Peak Number of Engineers on ASIC/IC Projects

Source: Wilson Research Group and Siemens EDA, 2020 Functional Verification Study 26
Distribution Statement A - Approved for public release. Distribution is unlimited.
PREDICTIONS & FORECASTS

We must account for the absurd


• Will the entire population of India become verification engineers in 50 years?
Wally Rhines – CEO, Mentor Graphics, 2011

We should remind ourselves of what has worked


• Quality cannot be inspected into a product; it must be built into it.
W. Edwards Deming – Author, Out of the Crisis, 1982

27
Distribution Statement A - Approved for public release. Distribution is unlimited.
LATE ISSUE DISCOVERY – COSTLY TO FIX
$10,000,000
41Kx

$1,000,000

130x
$100,000

$10,000
24x

$1,000 8x Labor
1x
Schedule
$100 &
Labor/Schedule Rebuild
$10

$1
Coding IP Verification Integration/Top System Validation / Post--Silicon
Verification ECO 28
Source: External Data on Mask Costs, and Internal Research
Distribution Statement A - Approved for public release. Distribution is unlimited.
PROBLEMS FOUND LATE COST MILLIONS
LITTLE CHANGE OVER 8 YEARS DESPITE HIGH COST OF FAILURE

Average masksets/production issues (per project) Average cost (2020) of late functional finds & fixes
in hardware (per project)
$5.0
2.1 2.0

Millions
2.0
1.6
$4.5
$4.3M
$4.0

$3.5

2012 2020 2016 2020 $3.0

% of masksets/production issues: functional $2.5

2020 2020 $2.0


53%
$1.2M
49% 50% 52%
$1.5

$1.0

$0.5
2012 2020 2012 2020 130 nm 90 nm 65 nm 45 nm 32 nm 28 nm 14 nm 7 nm 5 nm FPGA

ASIC
$0.0
FPGA Source: Wilson Research Group and Siemens EDA, 2020 Functional Verification Study 29
Distribution Statement A - Approved for public release. Distribution is unlimited.
SIMULATION CANNOT FIND ALL FAULTS
CDC ADOPTION IN ASIC REDUCING IMPACT, BUT …
% of masksets/production issues: clocking Cost of one clock issue escape (5 $3.8M
nm ASIC, minimum):
43%
Trend
$1.0
37%

Millions
33% $0.9 $820K
$0.8

$0.7

$0.6

19% $0.5
$530K
$0.4
Progress:
$0.3
CDC Adoption
$0.2
2012 2020 2012 2020
$0.1

ASIC FPGA 130 nm 90 nm 65 nm 45 nm 32 nm 28 nm 14 nm 7 nm 5 nm FPGA


$-
Source: Wilson Research Group and Siemens EDA, 2020 Functional Verification Study 30
Distribution Statement A - Approved for public release. Distribution is unlimited.
ADD MORE COMPLEXITY
ADOPTION OF SAFETY STANDARDS

DO-254 - Avionics

ISO26262 - Automotive

IEC61508 - Industrial

IEC61511 - Process Industry

IEC61513 - Nuclear

IEC60601 - Medical

EN50129 - Railway

ISO25119 - Agriculture & Forestry FPGA


MIL-STD-882 - Military ASIC/IC
Other

0% 10% 20% 30% 40% 50% 60% 70%


Safety Critical Design Projects

Source: Wilson Research Group and Siemens EDA, 2020 Functional Verification Study 31
Distribution Statement A - Approved for public release. Distribution is unlimited.
INTENT-FOCUSED INSIGHT
• Produce Path to address ASIC Verification
• Raise the level of abstraction Crisis – Founded on bug prevention
• Deep analysis via continuous
integration
• Prove
• Use RTL code checks
• Embrace sequential analysis
• Protect
• Design intent preserved


• Check synthesis, test insertion,
energy reduction schemes, Quality cannot be inspected into a
implementation, and more product; it must be built into it.
don’t alter design intent - W. Edwards Deming
32
Distribution Statement A - Approved for public release. Distribution is unlimited.
UNIFORM PROCESSOR/ACCELERATOR/DEVICE
SPECIFICATION FOR
SIMULATION-BASED/FORMAL VERIFICATION

This research was developed with funding from the Defense Advanced Research
Projects Agency (DARPA).

The views, opinions and/or findings expressed are those of the author and should
not be interpreted as representing the official views or policies of the
Department of Defense or the U.S. Government.
Distribution Statement Pending
SHARAD
MALIK
DARPA ERI SUMMIT
ASIC FUNCTIONAL VERIFICATION
WORKSHOP

Distribution Statement Distribution


A - ApprovedStatement Pending
for public release. Distribution is unlimited.
EMERGING SILICON LANDSCAPE

Image Signal
CPU GPU
Processing

ML
Crypto- Video Die Photo Analysis of Apple A-series SoCs
engine …
accelerator Processing [Source: Harvard Architecture, Circuit and Compilers Groups]
s
http://vlsiarch.eecs.harvard.edu/research/accelerators/die-photo-
analysis/

35
Distribution Statement Distribution
A - ApprovedStatement Pending
for public release. Distribution is unlimited.
EMERGING SILICON LANDSCAPE

8-Core GPU

Image Signal
CPU GPU 4 Firestorm
Processing SLC Cache
Cores + 16–Core
12MB L2 Neural 4 Icestorm
Engine Core
+ 4MB L2
ML
Crypto- Video
engine …
accelerator Processing
s
Apple M1 Die Photo [Source: AnandTech]
https://www.anandtech.com/show/16226/apple-silicon-
m1-a14-deep-dive
36
Distribution Statement Distribution
A - ApprovedStatement Pending
for public release. Distribution is unlimited.
FUNCTIONAL AND SECURITY CORRECTNESS

Func. Module • Communicating (heterogeneous) IPs


𝜇𝜇C
MMIO Message
• Processor/micro-controller
Buffer
MMIO
FW DMA
• Firmware

On-chip Interconnect
• Specialized hardware/accelerators
Others

Security Engine • Critical and complex


𝜇𝜇C
MMIO Message • Secure assets
Buffer
• Access control
Secure
FW Mem. • Bypassing DMA
MMIO Routing • Various synchronization mech.
Logic

Other IPs (e.g., attacker) 37


Distribution Statement Distribution
A - ApprovedStatement Pending
for public release. Distribution is unlimited.
VERIFICATION CHALLENGES

• Scale of the system ↑


• Design heterogeneity ↑ Software SW

Processor Proc.
Accelerators Acc.

Interconnect

Shared memory access

Memory-mapped I/O

38
Distribution Statement Distribution
A - ApprovedStatement Pending
for public release. Distribution is unlimited.
DESIRED SOLUTION SPACE

• Software-driven verification
• Software/firmware-hardware Software SW
co-verification
• High-speed (co-)simulation Processor Proc.
Accelerators Acc.
• Sound/trusted high-level
hardware models
Reduce Interconnect
• Supported by select FV
the
• Uniform high-level models number
for co-simulation/FV Shared memory access
of events
Memory-mapped I/O

Enabler: Instruction-Level Abstraction Models for Accelerators/Devices


39
Distribution Statement Distribution
A - ApprovedStatement Pending
for public release. Distribution is unlimited.
ACCELERATOR/DEVICE INTERFACE

CPU GPU Flash Firmware C code // Load instruction


HW accelerators

On-chip Interconnect
Microcontroller +
Firmware
MMU+ // Store instruction
DMA …
DRAM
Memory Accelerator
Address Register
NoC interface 0xff00 status
0xff02 enable
… …
Accessing registers Triggering operations
Distribution Statement Distribution
A - ApprovedStatement Pending
for public release. Distribution is unlimited.
INSTRUCTION-LEVEL ABSTRACTION (ILA)

Interface Commands ≝ Instructions


AES Block Encryption • Merits (similar to ISA)
Visible State • Software-visible “architectural”
state variables
Address
Interconnect

• Modular: set of instructions


Interface
Length • Per-instruction state-update
Key
Counter
State
• More than ISA
• Formal
• Hierarchical
MMIO accesses Commands
• Generalizes ISA to include
Write, 0xff00, 0x1 START_ENCRYPT accelerators
Write, 0xff02, data WRITE_ADDRESS
… … 41
Distribution Statement Distribution
A - ApprovedStatement Pending
for public release. Distribution is unlimited.
ILA FRAMEWORK
Design artifact
ILA: Function
Modeling Verification input
Ref. Model &
Properties Generated
Pono model Refinement
checker Map Formal co- Tools
verification
always @(posedge clk) begin
if (!resetn) begin
mem_la_firstword_reg <= 0;
last_mem_valid <= 0;
end else begin Softw
if (!mem_valid)
RTL mem_la_reg <= mem_la_firstword; High-level are
lastm_valid <= mem_valid && Simulation
!mem_ready;
end Model
end

Tandem
DARPA POSH Co-simulation
Simulation
Supported 42
Distribution Statement Distribution
A - ApprovedStatement Pending
for public release. Distribution is unlimited.
ILA-BASED FORMAL HARDWARE
VERIFICATION
• Formal verification of RTL implementation
• Modular verification — per-instruction checking
• Automating modular verification
ILA 𝑖𝑖
𝑖𝑖 = 𝑖𝑖𝑖𝑖𝑖𝑖𝑖𝑖𝑖𝑖𝑖𝑖𝑖𝑖𝑖𝑖𝑖𝑖𝑖𝑖𝑖𝑖
U U’

𝑟𝑟 𝑟𝑟 𝑟𝑟 = refinement
relations
f* f* f*
V V’ f = RTL state
RTL transitions

For each instruction, check:


• in all valid RTL starting states (environment)
43
Distribution Statement Distribution
A - ApprovedStatement Pending
for public release. Distribution is unlimited.
ILA AS VERIFIED HARDWARE ABSTRACTION

High-level System
C++ Instruction- co-simulation Software
ILAtor level executable
ILA Model
model

Verified
abstraction HW/SW co-
simulation (QEMU)

C++ RTL
RTL Verilator executable
model Low-level
design
co-simulation

44
Distribution Statement Distribution
A - ApprovedStatement Pending
for public release. Distribution is unlimited.
Upcoming Target: >100X
ILA FRAMEWORK Auto-extract from speedup over
RTL models
RTL Simulation
Design artifact
ILA: Function
using automatically
Verification input
Ref. Model & Modeling extracted ILA
Properties abstraction models
Generated
Pono model
Refinement
checker Formal co- Tools
Map
verification
always @(posedge clk) begin
if (!resetn) begin
mem_la_firstword_reg <= 0;
last_mem_valid <= 0;
end else begin Softw
if (!mem_valid)
RTL mem_la_reg <= mem_la_firstword; High-level are
lastm_valid <= mem_valid && Simulation
!mem_ready;
end Model
end

Tandem
Co-simulation
Simulation
45
Distribution Statement Distribution
A - ApprovedStatement Pending
for public release. Distribution is unlimited.
ILANG FRAMEWORK

• ILAng: ILA modeling and verification platform

• Open-source (MIT license): https://github.com/PrincetonUniversity/ILAng


• Docker: https://hub.docker.com/r/byhuang/ilang
• ILA modeling database (IMDB): https://github.com/PrincetonUniversity/IMDb
• Document: https://bo-yuan-huang.gitbook.io/ilang/
• API reference: https://princetonuniversity.github.io/ILAng-
Doc/namespaceilang.html

46
Distribution Statement Distribution
A - ApprovedStatement Pending
for public release. Distribution is unlimited.
SELECTED BIBLIOGRAPHY

• Generating Architecture-Level Abstractions from RTL Designs for


Processors and Accelerators - Part I: Determining Architectural State
Variables. [ICCAD21]
• ILAng: A Modeling and Verification Platform for SoCs using Instruction-
Level Abstractions. [TACAS19]
• Integrating Memory Consistency Models with Instruction-Level
Abstractions for Heterogeneous System-on-Chip Verification. [FMCAD18]
• Formal Security Verification of Concurrent Firmware in SoCs using
Instruction-Level Abstraction for Hardware. [DAC18]
• Instruction-Level Abstraction (ILA): A Uniform Specification for System-
on-Chip (SoC) Verification. [TODAES18] (Best Paper Award)

47
Distribution Statement Distribution
A - ApprovedStatement Pending
for public release. Distribution is unlimited.
RAPID DESIGN
VALIDATION WITH HIGH-
LEVEL SYNTHESIS

This research was developed with funding from the Defense Advanced Research
Projects Agency (DARPA).

The views, opinions and/or findings expressed are those of the author and should
not be interpreted as representing the official views or policies of the
Department of Defense or the U.S. Government.
JASON
CONG
VOLGENAU CHAIR FOR ENGINEERING
EXCELLENCE
UCLA

49
Distribution Statement A - Approved for public release. Distribution is unlimited.
HLS BASED ACCELERATOR DESIGN ON FPGAS
AT UCLA

50
Distribution Statement A - Approved for public release. Distribution is unlimited.
ASIC DESIGNS ARE USING HLS AS WELL

• Google develops WebM video decompression hardware IP using Catapult


High-Level Synthesis[1]
• “coded and verified primarily in standard C++”
• “HLS approach makes design implementation and verification 50% faster
than a traditional RTL design flow”

[1]: https://resources.sw.siemens.com/en-US/white-paper-google-develops-webm-video-decompression-hardware-ip-using-high-level
[Catapult Verification Flow]: https://cfnewsads.thomasnet.com/spec/40025/40025313.pdf
51
Distribution Statement A - Approved for public release. Distribution is unlimited.
CURRENT STATE OF HLS VALIDATION

• C++ code compiled using standard compiler (GCC/Clang/...)


• Closer to reference C/C++ code
• Well designed for data-parallel programs (parallelism via #pramgas)
• Behavior simulation >1000× faster than RTL simulation [1]
• But…
• Hard to scale to larger & more complex (esp task-parallel) programs
• Not always possible/correct
• Not cycle-accurate

[1]: Yuze Chi, Young-kyu Choi, Jason Cong, Jie Wang. Rapid Cycle-Accurate Simulator for High-Level Synthesis. In FPGA, 2019.

52
Distribution Statement A - Approved for public release. Distribution is unlimited.
CHALLENGE #1: SCALING TO LARGE DESIGNS

• Pure HLS-C is not as flexible &


expressive enough
• Large designs (e.g. Google’s video
decoder) are
• decomposed
• synthesized and validated separately
at component level
• System-level validation done at RTL
level

[Google VP9 G2 Decoder Hardware]: https://cfnewsads.thomasnet.com/spec/40025/40025313.pdf

53
Distribution Statement A - Approved for public release. Distribution is unlimited.
OUR EFFORT #1: BETTER SUPPORT FOR TASK-
PARALLEL PROGRAMS
• 4 PEs interconnected w/ a ring
• PEs send packets to ring nodes & receive packets from ring nodes
• Ring nodes forward packets conditionally based on the destination PE
specified in the packet header

Image source: Yuze Chi, Licheng Guo, Jason Lau, Young-kyu Choi, Jie Wang, Jason Cong. Extending High-Level Synthesis for Task-Parallel
Programs. In FCCM, 2021.
54
Distribution Statement A - Approved for public release. Distribution is unlimited.
EXTENDING KERNEL PROGRAMMING INTERFACES

Case 1: 1→2
1→3 2→3
destinations do
not conflict
Case 2:
destinations
do conflict

Equivalent kernel code w/ TAPA

For kernel code, TAPA is much shorter to write with peeking


and transaction API support.
Kernel code w/ Vivado HLS
Image source: Yuze Chi, Licheng Guo, Jason Lau, Young-kyu Choi, Jie Wang, Jason Cong. Extending High-Level Synthesis for Task-Parallel Programs. In FCCM, 2021.
55
Distribution Statement A - Approved for public release. Distribution is unlimited.
KERNEL LOC REDUCTION

Image source: Yuze Chi, Licheng Guo, Jason Lau, Young-kyu Choi, Jie Wang, Jason Cong. Extending High-Level Synthesis for Task-Parallel Programs. In
FCCM, 2021.
56
Distribution Statement A - Approved for public release. Distribution is unlimited.
SOFTWARE SIMULATION
TAPA’s coroutine-based simulator
Multi-thread simulator
Simulate Simulate
Sequential simulator PE 1 RingNode 1
Simulate Simulate
PE 1 RingNode 1 Coroutine #1 Coroutine #2
Simulate PE 1 Thread #1 Thread #2 ... Thread #1
... Core #1 Core #1

Simulate RingNode 1
Simulate Simulate
PE 2 RingNode 2 Simulate Simulate
PE 2 RingNode 2
Thread #3 Thread #4
... ... Coroutine #3 Coroutine #4
Thread #1
1.2-2.2μs1 ...
Incorrect 26ns2
(do not match spec.) Core #1 Core #2
Not Thread #2
scalable Correct &
scalable Core #2
[1]: https://eli.thegreenplace.net/2018/measuring-context-switching-and-memory-overheads-for-linux-threads/
[2]: https://www.boost.org/doc/libs/1_73_0/libs/coroutine2/doc/html/coroutine2/performance.html
[TAPA]: Yuze Chi, Licheng Guo, Jason Lau, Young-kyu Choi, Jie Wang, Jason Cong. Extending High-Level Synthesis for Task-Parallel Programs. In FCCM, 2021.
Images are authors’ own.
57
Distribution Statement A - Approved for public release. Distribution is unlimited.
SIMULATION TIME REDUCTION

Image source: Yuze Chi, Licheng Guo, Jason Lau, Young-kyu Choi, Jie Wang, Jason Cong. Extending High-Level Synthesis for Task-Parallel Programs. In FCCM, 2021.
Sequential simulation does not produce Cannon & PageRank
58
Distribution Statement A - Approved for public release. Distribution is unlimited.
CHALLENGE #2: LACK OF CYCLE-ACCURACY

• HLS simulation does not always match RTL simulation


• Example: HLS simulation of molecular dynamics

Dist PE1 Dist PE2 Dist PE3 Dist PE4 (II=4)


1
5 6 1st round: (bubble) 2 (bubble) (bubble)
9 2 2nd round: 5 (bubble) (bubble) 8
11 10 3rd round: (bubble) 10 11 (bubble)
3
4 (Round-robin
7 8
12 non-blocking Force PE (II=1)
< HLS C code> read)
#pragma HLS dataflow
Dist_PE1();
Does not
Simulated in RTL sim output: 2 5 8 10 11
Dist_PE2(); match!
instantiation order Dist_PE3(); SW sim output: 5 2 11 8 10
→ Missing bubbles Dist_PE4();
Force_PE();

Image source: Yuze Chi, Young-kyu Choi, Jason Cong, Jie Wang. Rapid Cycle-Accurate Simulator for High-Level Synthesis. In FPGA, 2019.
60
Distribution Statement A - Approved for public release. Distribution is unlimited.
OUR EFFORT #2: RAPID CYCLE-ACCURATE
SIMULATION FOR HLS
• FLASH: Fast, paralleL, Accurate Simulator for HLS
• Scheduling information extracted to achieve cycle-accurate results

<HLS design steps> Allocation Library

HLS C code Compilation Binding Generation RTL code


Fast, but
1. Output may stmt,loop,
not be accurate Scheduling
func, ...
2. No perf Accurate, but
estimation SW Proposed RTL too slow
simulator simulator simulator
(FLASH) scheduling info

Image source: Yuze Chi, Young-kyu Choi, Jason Cong, Jie Wang. Rapid Cycle-Accurate Simulator for High-Level Synthesis. In FPGA, 2019.

61
Distribution Statement A - Approved for public release. Distribution is unlimited.
SIMULATION TIME COMPARISON

Deep (55) pipeline

Frequent FIFO stall


(FIFO depth=1)

Do not always work Much faster Comparable speed


correctly than RTL with SW simulation
simulation

Image source: Yuze Chi, Young-kyu Choi, Jason Cong, Jie Wang. Rapid Cycle-Accurate Simulator for High-Level Synthesis. In FPGA, 2019.
62
Distribution Statement A - Approved for public release. Distribution is unlimited.
OVERALL WORKFLOW

Image of authors’ own 63


Distribution Statement A - Approved for public release. Distribution is unlimited.
BEYOND SIMULATION: A-QED + HLS FORMAL
VERIFICATION
• Simulation-based verification
• May miss critical bugs, exhaustive simulation time consuming
• Formal verification appealing, but traditionally has barriers
• Design-specific properties
• Huge manual effort, error prone, corner cases not guaranteed
• Doesn’t scale for large designs
• New approach: Accelerator Quick Error Detection (A-QED)
• Thorough
• Overcomes above barriers when combined with HLS

64
[Singh DATE 19] Distribution Statement A - Approved for public release. Distribution is unlimited.
A-QED INSPIRED BY SYMBOLIC QED FOR
PROCESSORS
• Universal property based on self-consistency: sound, complete

Infineon’s Symbolic QED case study: 16 automotive IP cores

Thorough Productivity 60× improved


Detected bugs & spec. errors

6
0% 100% +7% Person
months 2 person days
Industry Flow Symbolic QED Industry Flow Symbolic QED

E. Singh, D. Lin, C. Barrett and S. Mitra, “Logic Bug Detection and Localization using Symbolic Quick Error Detection,” IEEE Trans. CAD, 2018.
E. Singh et al. Symbolic QED Pre-silicon Verification for Automotive Microcontroller Cores: Industrial Case Study. DATE’19
F. Lonsing, S. Mitra, and C. Barrett. A Theoretical Framework for Symbolic Quick Error Detection. FMCAD’20
65
[Singh DATE 19] Distribution Statement A - Approved for public release. Distribution is unlimited.
A-QED FOR HARDWARE ACCELERATORS

Thorough Seamless HLS integration


Universal self-consistency check
No full specification In … I2 I1
A-QED
Sound & complete module
HA
Model
Non-interfering HAs: popular HA class checker Checker On … O O1
2

In … I2 I1 HA On … O2 O1 Verification collateral (A-QED module &


interfaces) generated by HLS
Check: I1 = In ⇒ O1 = On (for verification only, not fabricated)

E. Singh et al. A-QED Verification of Hardware Accelerators. DAC’20


66
[Singh DATE 19] Distribution Statement A - Approved for public release. Distribution is unlimited.
A-QED SCALEUP FOR LARGE HARDWARE
ACCELERATORS
• A-QED compositional: new functional decomposition unique to A-QED

Traditional formal A-QED scaleup

Nvidia’s NVDLA (16M gates) 12 hr. timeout 2 mins.

ISmartDNN (42M gates) 12 hr. timeout 10 secs.

… 109 designs analyzed

AES (382K gates) Crash 1 min.

S. Chattopadhyay et al. Scaling Up Hardware Accelerator Verification using A-QED with Functional Decomposition. FMCAD ‘21
67
[Singh DATE 19] Distribution Statement A - Approved for public release. Distribution is unlimited.
TAKE-AWAYS

• HLS can substantially decrease validation time of ASIC design


• TAPA: convenient APIs & scalable software simulation
• Can scale to large complex systems using HLS productively
• FLASH: fast software simulation w/ scheduling information
• Can rapidly provide accurate performance estimation
• A-QED + HLS: thorough, scalable, large productivity benefits

Complex systems Correct output data Cycle-accurate Thorough & scalable


performance estimation

[Complex systems]: https://en.wikipedia.org/wiki/File:Wait-for_graph_example.png


[Check mark]: https://pixabay.com/illustrations/correct-mark-green-continue-right-2214020/
[Dart]: https://www.maxpixel.net/Dartboard-Dart-Board-Arrow-Bulls-Eye-Accurate-25780 68
[Magnifying glass]: authors’ own Distribution Statement A - Approved for public release. Distribution is unlimited.
CONTRIBUTORS & ACKNOWLEDGMENTS (UCLA)

• Partially funded by the


NSF/Intel CAPA program
(CCF-1723773) and NSF
RTML program (CCF-
1937599).
Jason Cong Yuze Chi Young-kyu Choi

Licheng Guo Jason Lau Jie Wang


Images provided by each contributor 69
Distribution Statement A - Approved for public release. Distribution is unlimited.
CONTRIBUTORS & ACKNOWLEDGMENTS (STANFORD)

• Partially funded by
DARPA POSH, NSF,
Stanford SystemX
Alliance

Subhasish Mitra Clark Barrett Caroline Trippel

Florian Lonsing Eshan Singh Saranyu


Images provided by each contributor Chattopadhyay
70
Distribution Statement A - Approved for public release. Distribution is unlimited.
ACCELERATING
SIMULATION
THROUGH NOVEL
HARDWARE
ARCHITECTURES

The views, opinions and/or findings expressed are those of the author and should
not be interpreted as representing the official views or policies of the
Department of Defense or the U.S. Government.
Distribution Statement Pending
DANIEL
SANCHEZ
ASSOCIATE PROFESSOR
MIT

Distribution Statement Distribution


A - ApprovedStatement Pending
for public release. Distribution is unlimited.
EXECUTIVE SUMMARY

• Slow RTL simulation hinders the design and verification of complex chips
• Software simulators parallelize poorly due to limitations of multicore processors
• Specialized emulators are expensive, hard to scale, slow to compile for

• We are investigating ESANA, a new hardware architecture to accelerate


RTL simulation by several orders of magnitude
Memory Memory Memory Tile Hardware support for
Local cache memory
fine-grain parallelism
… Tile
General-purpose
cores
Specialized

Specialized compute
PEs

Task management hardware

and memory

Decoupled communication  scalable with a commodity interconnect


Targeting >1000x speedups vs. software simulation 73
Distribution Statement Distribution
A - ApprovedStatement Pending
for public release. Distribution is unlimited.
SOFTWARE SIMULATION IS HARD TO PARALLELIZE

• Digital systems have plentiful parallelism, so why is simulation hard to scale?


Memory 1.5
Example: Verilator achieves speedup

Speedup
1
of only 1.2x when simulating a chip
with 256 RISC-V cores 0.5

0
1 2 3 4 5 6 7 8
Cores

• Simulation requires fine-grained parallelism:


• Work naturally expressed using small tasks with few operations each
• Fast communication among simulated modules  frequent
synchronization among tasks

• Poor match to existing multicores, which work well only with coarse-grained
tasks that synchronize infrequently
74
Distribution Statement Distribution
A - ApprovedStatement Pending
for public release. Distribution is unlimited.
ARCHITECTURAL SUPPORT
FOR FINE-GRAIN PARALLELISM
• Chronos [ASPLOS’20] targets hard-to-parallelize applications
• Hardware supports extremely short tasks (~10 instructions)
• Implicit synchronization by defining a global order among tasks
• Hardware executes tasks speculatively and out of order to scale

• Chronos FPGA prototype provides efficient, distributed, scalable


mechanisms for ordered speculation
Cache (Private, Chronos
Mem0 Mem1 Mem2 Mem3 non-coherent) Framework
Memory Traffic Interconnect Processing
… …
PE
PE

PE
Tile Tile Tile Tile Elements
0 1 2 N (RISC-V cores
Task Traffic Interconnect Task Unit or application-
specific RTL) 75
Distribution Statement Distribution
A - ApprovedStatement Pending
for public release. Distribution is unlimited.
ARCHITECTURAL SUPPORT
FOR FINE-GRAIN PARALLELISM
• Chronos achieves order-of-magnitude speedups on discrete-event
simulation using existing infrastructure

15.3x Platform AWS Instance Price ($/hr)


Baseline CPU M4.10xlarge 2.00
FPGA F1.2xlarge 1.65

Chronos is 15x faster than 40-thread CPU despite 19x lower frequency
because it exploits fine-grained parallelism effectively
76
Distribution Statement Distribution
A - ApprovedStatement Pending
for public release. Distribution is unlimited.
ESANA’S APPROACH: SCALING
AND ACCELERATING SOFTWARE SIMULATION
ESANA simulator
Input
Parallel simulation code
HDL code
class foo { • Fine-grained ordered tasks
(Verilog)
module foo;
void eval() { • Event-driven execution
Compile enqueue(bar.eval()); • Distributed state
module bar;
… • Uses specialized hardware

ESANA hardware • Multiple tiled chips


Memory Memory Memory Tile • Hardware support for
Local cache memory
tiny ordered tasks
… Tile General-purpose
cores
Specialized
PEs • Speculative execution
Task management hardware • Simulation-specialized
compute and memory

ESANA multi-FPGA prototype


Local or cloud deployment
Scalable, latency-tolerant design with commodity interconnect (100GbE)
77
Distribution Statement Distribution
A - ApprovedStatement Pending
for public release. Distribution is unlimited.
SIMULATION VS. EMULATION TRADEOFFS
class alu {
State-of-the-art

module alu(…);
simulators compile RTL void eval() {
always_comb begin into efficient code if (op == 0) out = a + b;
case (op) else if (op == 1) out = a * b;
ADD: out = a + b; else if (op == 2) out = a & b;
MUL: out = a * b; else if (op == 3) out = a ^ b;
AND: out = a & b; a b op
XOR: out = a ^ b; 64 64
endcase Specialized emulators
end + *
endmodule synthesize RTL and map gate-
level netlist (e.g., to FPGAs) 0 1 2 3
64
out
• Simulation introduces interpretation overheads (using instructions)
• However, a single instruction can simulate many gates (e.g., a * b)
• Simulation can avoid ineffectual work, whereas emulation implements whole circuit
• Simulation can trade space for time  system size does not limit circuit size
78
Distribution Statement Distribution
A - ApprovedStatement Pending
for public release. Distribution is unlimited.
CODESIGNED SIMULATOR AND HARDWARE

• ESANA relies on a tightly integrated RTL simulator (Verilator-based) to


maximize performance in two ways

1. Parallelizing code into tasks and mapping them to massively parallel


hardware

2. Adopting event-driven execution to avoid ineffectual work


• Recent work [Beamer DAC’20] shows 10x speedups from this approach
without hardware support; hardware enables more aggressive optimization

79
Distribution Statement Distribution
A - ApprovedStatement Pending
for public release. Distribution is unlimited.
SPECIALIZING FOR HIGHER EFFICIENCY

• Instruction-based simulation is efficient for arithmetic-intensive circuits, but


inefficient for control-intensive ones
• Specialized processing elements and custom instructions can execute
narrow/control-intensive regions efficiently
Configuration memory Tasks are pipelined to
achieve high throughput
...
I/O Switch Switch Switch I/O Task N+1 Task N Task 1
FU FU FU
Memory Tile ... FU FU
...
FU
Local cache memory ...
I/O Switch Switch Switch I/O
General-purpose Memory FU FU FU

Data outputs
Specialized ... FU FU FU

Data inputs
Tile
cores
Local cache memory

PEs General-purpose
cores
Tile

Specialized
PEs
Tile

...

...

...

...

...

...
Task management hardware FU FU

...
FU

...

...
...
Task management hardware

FU FU FU
I/O Switch Switch Switch I/O ...
...

• In recent work [Fifer MICRO’21] we have shown how to use specialized


reconfigurable units for control-intensive, irregular tasks
80
Distribution Statement Distribution
A - ApprovedStatement Pending
for public release. Distribution is unlimited.
ESANA MULTI-FPGA PROTOTYPE

• Since all communication is asynchronous and tasks can be executed widely


out of order, ESANA tolerates long latencies well
• We are designing a multi-FPGA prototype Memory Memory Memory
to demonstrate the scalability of this approach

• Targeting 8-16 FPGA boards connected with
100GbE (~1us latency, 200 GB/s bisection)
• ~1000 cores at 200 MHz, ~1GB on-chip storage,
500 GB/s aggregate off-chip bandwidth
• Expecting speedups >1000x vs. software simulation
(execute 1-month simulation in ~10 minutes)
• Expecting similar resource efficiency to FPGA-based emulation
(much higher frequency compensates interpretation overheads)
• Design should scale to 100s of FPGAs/ASICs 81
Distribution Statement Distribution
A - ApprovedStatement Pending
for public release. Distribution is unlimited.
WHAT SUCCESS WILL LOOK LIKE

• Today, emulating a cutting-edge design (e.g., top-of-line GPU) requires


• Renting or acquiring a specialized emulation platform (1000+ FPGAs,
with custom boards and interconnect  $$$)
• Synthesizing and mapping RTL to emulation platform (weeks)

• Instead, ESANA will enable


• Renting 100s of commodity FPGAs in the cloud by the hour
• Compiling RTL to an ESANA software program (minutes)

82
Distribution Statement Distribution
A - ApprovedStatement Pending
for public release. Distribution is unlimited.
CONCLUSION

• Slow and stagnant simulation performance bottlenecks chip design

• There is abundant parallelism in RTL simulation, but current parallel


architectures cannot uncover it

• ESANA seeks to accelerate RTL simulation by several orders of magnitude


• Hardware support for fine-grained ordered parallelism
• Co-designed hardware and simulator to maximize parallelism and avoid
ineffectual work
• Specialization to achieve high efficiency while retaining programmability
• Expect >1000x speedups on RTL simulation of large designs

83
Distribution Statement Distribution
A - ApprovedStatement Pending
for public release. Distribution is unlimited.
THANK YOU! QUESTIONS?

• Slow and stagnant simulation performance bottlenecks chip design

• There is abundant parallelism in RTL simulation, but current parallel


architectures cannot uncover it

• ESANA seeks to accelerate RTL simulation by several orders of magnitude


• Hardware support for fine-grained ordered parallelism
• Co-designed hardware and simulator to maximize parallelism and avoid
ineffectual work
• Specialization to achieve high efficiency while retaining programmability
• Expect >1000x speedups on RTL simulation of large designs

84
Distribution Statement Distribution
A - ApprovedStatement Pending
for public release. Distribution is unlimited.

You might also like