Professional Documents
Culture Documents
Leef ASIC Workshop
Leef ASIC Workshop
SERGE
LEEF
PROGRAM MANAGER
DARPA/MTO
30+
30%
2014: Average 57% 2014
2016: Average 54% 2016
25%
2018: Average 53% 2018
2020: Average 56%
2020
20%
Design Projects
15%
10%
5%
0%
> 0%-20% > 20%-30% > 30%-40% > 40%-50% > 50%-60% > 60%-70% > 70%-80% > 80%
Source: Wilson Research Group and Mentor, A Siemens Business, 2020 Functional Verification Study
7
DISTRIBUTION A. Approved for public release: distribution unlimited
Nex t-Gen: Abstraction + Reduced Order Models + HPC + Cloud Scaling
• Simulation physics limits reached decades ago and rely on Moore’s law for speedups
• Parallel simulation has not been re-visited since emergence of the cloud computing
• Emergence of Machine Learning has not been employed to drive creation of faster models
• Need to re-think simulation in light of advances in ILA, ML, HPCs, and Cloud
Instruction Surrogate
S
UVM Verification O Level Models Cloud
Support IP T Abstraction Scaling
A
(digital) (digital)
Verilator Next-Gen
Digital Sim
Functional Logic Layout Scalable Functional Logic Layout
Verification Synthesis Generation HPC Verification Synthesis Generation
Next-Gen
XYCE
Analog Sim
CLOUD
SCALING
42 Years of Microprocessor Trend Data 106 Simulation • Combine limitless and elastic compute and storage available
1010
Performance on the cloud with one or more of the following innovations:
109 Gap
• Simulation engine novel algorithms
Transistor 10
8
103
Count 107 • Parallel partitioning schemes
(per chip) • High-performance-compute (HPC) architectures and
106 Digital Logic
programmable cloud based FPGA fabric
Simulation Speed
105
(cps) • ML-driven simulation partitioning
104
103
Source: https://github.com/karlrupp/microprocessor-trend-data
Novel Algorithms
1000X
• Simulation speed grew with Moore driven platform Cloud FPGA
performance to ~ 1K cps
ML Partitioning
• Post-Moore, the architectural direction is multicore/cloud, but
modern simulation algorithms are not designed take
advantage of distributed computational fabric Source: http://users.ece.utexas.edu/~valvano/Volume1/
10
DISTRIBUTION A. Approved for public release: distribution unlimited
VERIFYING ACCELERATED
COMPUTING PLATFORMS:
CHALLENGES AND
The views, opinions and/or findings
expressed are those of the author
OPPORTUNITIES
and should not be interpreted as
representing the official views or
policies of the Department of
Defense or the U.S. Government.
Distribution Statement Pending
BRUCEK
KHAILANY
SENIOR DIRECTOR OF VLSI RESEARCH
NVIDIA
Distribution StatementDistribution
A - ApprovedStatement
for public release.
Pending Distribution is unlimited.
DESIGN COMPLEXITY
• Complex SoCs that include feature-rich
• More transistors:
CPUs, GPUs, I/Os, and many accelerators
Improved capability
AND
Implementation effort
[Electronics 1965]
Golden Model
Stimulus
Generator Testbench Checker
DUT
• Challenges
• Schedule and productivity
• Complexity of building and debugging testbenches
• Getting to signoff quality by iteratively closing coverage
6
Distribution Statement A - Approved for public release. Distribution is unlimited. Distribution Statement© NVIDIA 2021
Pending
VERIFICATION APPROACHES (2)
EMULATION
FORMAL METHODS
• Machine learning
8
Distribution Statement A - Approved for public release. Distribution is unlimited. Distribution Statement© NVIDIA 2021
Pending
RAISING DESIGN LEVEL OF ABSTRACTION
• Methodologies explored during DARPA CRAFT
(2016-2019)
APPROACH ADVANTAGES
• Works great for HW units needing
• Use higher-level languages for agile design, changing features
hardware design (and verification) • …especially if most verification can
• C++ leverage native simulation in
high-level languages
• Other groups exploring
Chisel/Python/other CHALLENGES
• Use tools/IRs to lower to Verilog • Need production-ready formal
• e.g. HLS equivalence tools for verifying generated
• Use Libraries/Generators RTL vs. higher-level models
• High bar for productivity improvements
• e.g. MatchLib, hlslibs.org
• Significant effort to replace workflows
on existing products 9
Distribution Statement A - Approved for public release. Distribution is unlimited. Distribution Statement© NVIDIA
Pending 2021
ACCELERATING RTL SIMULATION
• Previous research: Strong scaling single RTL sims for lower latency
(i.e. reduced time-to-solution for a single simulation)
• Gate-level simulation
• 4-60x [Chatterjee et al., DATE 2009], 5-270x [Zhu et al., ToDAES 2011]
• RTL simulation
• 20-50x [Qian et al., ICCAD 2011], 2-15x [Vinco et al., DAC 2012], + commercial efforts
• For RTL bug hunting and coverage closure, we need higher throughput RTL-level
simulation at a reasonable latency
10
Distribution Statement A - Approved for public release. Distribution is unlimited. Distribution Statement© NVIDIA
Pending 2021
ACCELERATING RTL SIMULATION
• What has changed in 10 years?
Verilator,
Yosys,
FIRRTL,
….
Performant
Higher performing GPUs Ubiquitous HPC Mature
open-source
More features systems forAI SW stack EDA tools
Parallel
Testbench
• Can we leverage these developments Stimulus 0
DUT 0
for higher-throughput Stimulus 1
DUT 1
stimulus-parallel
RTL simulation? Stimulus N-1
DUT N-1
11
Distribution Statement A - Approved for public release. Distribution is unlimited. Distribution Statement© NVIDIA 2021
Pending
MACHINE LEARNING (ML)
The views, opinions and/or findings expressed are those of the author and should
not be interpreted as representing the official views or policies of the
Department of Defense or the U.S. Government.
DENNIS
BROPHY
DIRECTOR OF STRATEGIC BUSINESS
DEVELOPMENT
SIEMENS EDA
23
Distribution Statement A - Approved for public release. Distribution is unlimited.
PREDICTIONS & FORECASTS
24
Distribution Statement A - Approved for public release. Distribution is unlimited.
PREDICTIONS & FORECASTS
No accounting for expanding demand
• No one would need more than 637kb of memory for a personal computer and 640
ought to be enough for anybody.
Bill Gates – Microsoft founder, 1981
Log. (Verification
Engineers)
6.0
4.8
4.0
2.0
0.0
2007 2012 2016 2020
Mean Peak Number of Engineers on ASIC/IC Projects
Source: Wilson Research Group and Siemens EDA, 2020 Functional Verification Study 26
Distribution Statement A - Approved for public release. Distribution is unlimited.
PREDICTIONS & FORECASTS
27
Distribution Statement A - Approved for public release. Distribution is unlimited.
LATE ISSUE DISCOVERY – COSTLY TO FIX
$10,000,000
41Kx
$1,000,000
130x
$100,000
$10,000
24x
$1,000 8x Labor
1x
Schedule
$100 &
Labor/Schedule Rebuild
$10
$1
Coding IP Verification Integration/Top System Validation / Post--Silicon
Verification ECO 28
Source: External Data on Mask Costs, and Internal Research
Distribution Statement A - Approved for public release. Distribution is unlimited.
PROBLEMS FOUND LATE COST MILLIONS
LITTLE CHANGE OVER 8 YEARS DESPITE HIGH COST OF FAILURE
Average masksets/production issues (per project) Average cost (2020) of late functional finds & fixes
in hardware (per project)
$5.0
2.1 2.0
Millions
2.0
1.6
$4.5
$4.3M
$4.0
$3.5
$1.0
$0.5
2012 2020 2012 2020 130 nm 90 nm 65 nm 45 nm 32 nm 28 nm 14 nm 7 nm 5 nm FPGA
ASIC
$0.0
FPGA Source: Wilson Research Group and Siemens EDA, 2020 Functional Verification Study 29
Distribution Statement A - Approved for public release. Distribution is unlimited.
SIMULATION CANNOT FIND ALL FAULTS
CDC ADOPTION IN ASIC REDUCING IMPACT, BUT …
% of masksets/production issues: clocking Cost of one clock issue escape (5 $3.8M
nm ASIC, minimum):
43%
Trend
$1.0
37%
Millions
33% $0.9 $820K
$0.8
$0.7
$0.6
19% $0.5
$530K
$0.4
Progress:
$0.3
CDC Adoption
$0.2
2012 2020 2012 2020
$0.1
DO-254 - Avionics
ISO26262 - Automotive
IEC61508 - Industrial
IEC61513 - Nuclear
IEC60601 - Medical
EN50129 - Railway
Source: Wilson Research Group and Siemens EDA, 2020 Functional Verification Study 31
Distribution Statement A - Approved for public release. Distribution is unlimited.
INTENT-FOCUSED INSIGHT
• Produce Path to address ASIC Verification
• Raise the level of abstraction Crisis – Founded on bug prevention
• Deep analysis via continuous
integration
• Prove
• Use RTL code checks
• Embrace sequential analysis
• Protect
• Design intent preserved
“
• Check synthesis, test insertion,
energy reduction schemes, Quality cannot be inspected into a
implementation, and more product; it must be built into it.
don’t alter design intent - W. Edwards Deming
32
Distribution Statement A - Approved for public release. Distribution is unlimited.
UNIFORM PROCESSOR/ACCELERATOR/DEVICE
SPECIFICATION FOR
SIMULATION-BASED/FORMAL VERIFICATION
This research was developed with funding from the Defense Advanced Research
Projects Agency (DARPA).
The views, opinions and/or findings expressed are those of the author and should
not be interpreted as representing the official views or policies of the
Department of Defense or the U.S. Government.
Distribution Statement Pending
SHARAD
MALIK
DARPA ERI SUMMIT
ASIC FUNCTIONAL VERIFICATION
WORKSHOP
Image Signal
CPU GPU
Processing
ML
Crypto- Video Die Photo Analysis of Apple A-series SoCs
engine …
accelerator Processing [Source: Harvard Architecture, Circuit and Compilers Groups]
s
http://vlsiarch.eecs.harvard.edu/research/accelerators/die-photo-
analysis/
35
Distribution Statement Distribution
A - ApprovedStatement Pending
for public release. Distribution is unlimited.
EMERGING SILICON LANDSCAPE
8-Core GPU
Image Signal
CPU GPU 4 Firestorm
Processing SLC Cache
Cores + 16–Core
12MB L2 Neural 4 Icestorm
Engine Core
+ 4MB L2
ML
Crypto- Video
engine …
accelerator Processing
s
Apple M1 Die Photo [Source: AnandTech]
https://www.anandtech.com/show/16226/apple-silicon-
m1-a14-deep-dive
36
Distribution Statement Distribution
A - ApprovedStatement Pending
for public release. Distribution is unlimited.
FUNCTIONAL AND SECURITY CORRECTNESS
On-chip Interconnect
• Specialized hardware/accelerators
Others
Processor Proc.
Accelerators Acc.
Interconnect
Memory-mapped I/O
38
Distribution Statement Distribution
A - ApprovedStatement Pending
for public release. Distribution is unlimited.
DESIRED SOLUTION SPACE
• Software-driven verification
• Software/firmware-hardware Software SW
co-verification
• High-speed (co-)simulation Processor Proc.
Accelerators Acc.
• Sound/trusted high-level
hardware models
Reduce Interconnect
• Supported by select FV
the
• Uniform high-level models number
for co-simulation/FV Shared memory access
of events
Memory-mapped I/O
On-chip Interconnect
Microcontroller +
Firmware
MMU+ // Store instruction
DMA …
DRAM
Memory Accelerator
Address Register
NoC interface 0xff00 status
0xff02 enable
… …
Accessing registers Triggering operations
Distribution Statement Distribution
A - ApprovedStatement Pending
for public release. Distribution is unlimited.
INSTRUCTION-LEVEL ABSTRACTION (ILA)
Tandem
DARPA POSH Co-simulation
Simulation
Supported 42
Distribution Statement Distribution
A - ApprovedStatement Pending
for public release. Distribution is unlimited.
ILA-BASED FORMAL HARDWARE
VERIFICATION
• Formal verification of RTL implementation
• Modular verification — per-instruction checking
• Automating modular verification
ILA 𝑖𝑖
𝑖𝑖 = 𝑖𝑖𝑖𝑖𝑖𝑖𝑖𝑖𝑖𝑖𝑖𝑖𝑖𝑖𝑖𝑖𝑖𝑖𝑖𝑖𝑖𝑖
U U’
𝑟𝑟 𝑟𝑟 𝑟𝑟 = refinement
relations
f* f* f*
V V’ f = RTL state
RTL transitions
High-level System
C++ Instruction- co-simulation Software
ILAtor level executable
ILA Model
model
Verified
abstraction HW/SW co-
simulation (QEMU)
C++ RTL
RTL Verilator executable
model Low-level
design
co-simulation
44
Distribution Statement Distribution
A - ApprovedStatement Pending
for public release. Distribution is unlimited.
Upcoming Target: >100X
ILA FRAMEWORK Auto-extract from speedup over
RTL models
RTL Simulation
Design artifact
ILA: Function
using automatically
Verification input
Ref. Model & Modeling extracted ILA
Properties abstraction models
Generated
Pono model
Refinement
checker Formal co- Tools
Map
verification
always @(posedge clk) begin
if (!resetn) begin
mem_la_firstword_reg <= 0;
last_mem_valid <= 0;
end else begin Softw
if (!mem_valid)
RTL mem_la_reg <= mem_la_firstword; High-level are
lastm_valid <= mem_valid && Simulation
!mem_ready;
end Model
end
Tandem
Co-simulation
Simulation
45
Distribution Statement Distribution
A - ApprovedStatement Pending
for public release. Distribution is unlimited.
ILANG FRAMEWORK
46
Distribution Statement Distribution
A - ApprovedStatement Pending
for public release. Distribution is unlimited.
SELECTED BIBLIOGRAPHY
47
Distribution Statement Distribution
A - ApprovedStatement Pending
for public release. Distribution is unlimited.
RAPID DESIGN
VALIDATION WITH HIGH-
LEVEL SYNTHESIS
This research was developed with funding from the Defense Advanced Research
Projects Agency (DARPA).
The views, opinions and/or findings expressed are those of the author and should
not be interpreted as representing the official views or policies of the
Department of Defense or the U.S. Government.
JASON
CONG
VOLGENAU CHAIR FOR ENGINEERING
EXCELLENCE
UCLA
49
Distribution Statement A - Approved for public release. Distribution is unlimited.
HLS BASED ACCELERATOR DESIGN ON FPGAS
AT UCLA
50
Distribution Statement A - Approved for public release. Distribution is unlimited.
ASIC DESIGNS ARE USING HLS AS WELL
[1]: https://resources.sw.siemens.com/en-US/white-paper-google-develops-webm-video-decompression-hardware-ip-using-high-level
[Catapult Verification Flow]: https://cfnewsads.thomasnet.com/spec/40025/40025313.pdf
51
Distribution Statement A - Approved for public release. Distribution is unlimited.
CURRENT STATE OF HLS VALIDATION
[1]: Yuze Chi, Young-kyu Choi, Jason Cong, Jie Wang. Rapid Cycle-Accurate Simulator for High-Level Synthesis. In FPGA, 2019.
52
Distribution Statement A - Approved for public release. Distribution is unlimited.
CHALLENGE #1: SCALING TO LARGE DESIGNS
53
Distribution Statement A - Approved for public release. Distribution is unlimited.
OUR EFFORT #1: BETTER SUPPORT FOR TASK-
PARALLEL PROGRAMS
• 4 PEs interconnected w/ a ring
• PEs send packets to ring nodes & receive packets from ring nodes
• Ring nodes forward packets conditionally based on the destination PE
specified in the packet header
Image source: Yuze Chi, Licheng Guo, Jason Lau, Young-kyu Choi, Jie Wang, Jason Cong. Extending High-Level Synthesis for Task-Parallel
Programs. In FCCM, 2021.
54
Distribution Statement A - Approved for public release. Distribution is unlimited.
EXTENDING KERNEL PROGRAMMING INTERFACES
Case 1: 1→2
1→3 2→3
destinations do
not conflict
Case 2:
destinations
do conflict
Image source: Yuze Chi, Licheng Guo, Jason Lau, Young-kyu Choi, Jie Wang, Jason Cong. Extending High-Level Synthesis for Task-Parallel Programs. In
FCCM, 2021.
56
Distribution Statement A - Approved for public release. Distribution is unlimited.
SOFTWARE SIMULATION
TAPA’s coroutine-based simulator
Multi-thread simulator
Simulate Simulate
Sequential simulator PE 1 RingNode 1
Simulate Simulate
PE 1 RingNode 1 Coroutine #1 Coroutine #2
Simulate PE 1 Thread #1 Thread #2 ... Thread #1
... Core #1 Core #1
Simulate RingNode 1
Simulate Simulate
PE 2 RingNode 2 Simulate Simulate
PE 2 RingNode 2
Thread #3 Thread #4
... ... Coroutine #3 Coroutine #4
Thread #1
1.2-2.2μs1 ...
Incorrect 26ns2
(do not match spec.) Core #1 Core #2
Not Thread #2
scalable Correct &
scalable Core #2
[1]: https://eli.thegreenplace.net/2018/measuring-context-switching-and-memory-overheads-for-linux-threads/
[2]: https://www.boost.org/doc/libs/1_73_0/libs/coroutine2/doc/html/coroutine2/performance.html
[TAPA]: Yuze Chi, Licheng Guo, Jason Lau, Young-kyu Choi, Jie Wang, Jason Cong. Extending High-Level Synthesis for Task-Parallel Programs. In FCCM, 2021.
Images are authors’ own.
57
Distribution Statement A - Approved for public release. Distribution is unlimited.
SIMULATION TIME REDUCTION
Image source: Yuze Chi, Licheng Guo, Jason Lau, Young-kyu Choi, Jie Wang, Jason Cong. Extending High-Level Synthesis for Task-Parallel Programs. In FCCM, 2021.
Sequential simulation does not produce Cannon & PageRank
58
Distribution Statement A - Approved for public release. Distribution is unlimited.
CHALLENGE #2: LACK OF CYCLE-ACCURACY
Image source: Yuze Chi, Young-kyu Choi, Jason Cong, Jie Wang. Rapid Cycle-Accurate Simulator for High-Level Synthesis. In FPGA, 2019.
60
Distribution Statement A - Approved for public release. Distribution is unlimited.
OUR EFFORT #2: RAPID CYCLE-ACCURATE
SIMULATION FOR HLS
• FLASH: Fast, paralleL, Accurate Simulator for HLS
• Scheduling information extracted to achieve cycle-accurate results
Image source: Yuze Chi, Young-kyu Choi, Jason Cong, Jie Wang. Rapid Cycle-Accurate Simulator for High-Level Synthesis. In FPGA, 2019.
61
Distribution Statement A - Approved for public release. Distribution is unlimited.
SIMULATION TIME COMPARISON
Image source: Yuze Chi, Young-kyu Choi, Jason Cong, Jie Wang. Rapid Cycle-Accurate Simulator for High-Level Synthesis. In FPGA, 2019.
62
Distribution Statement A - Approved for public release. Distribution is unlimited.
OVERALL WORKFLOW
64
[Singh DATE 19] Distribution Statement A - Approved for public release. Distribution is unlimited.
A-QED INSPIRED BY SYMBOLIC QED FOR
PROCESSORS
• Universal property based on self-consistency: sound, complete
6
0% 100% +7% Person
months 2 person days
Industry Flow Symbolic QED Industry Flow Symbolic QED
E. Singh, D. Lin, C. Barrett and S. Mitra, “Logic Bug Detection and Localization using Symbolic Quick Error Detection,” IEEE Trans. CAD, 2018.
E. Singh et al. Symbolic QED Pre-silicon Verification for Automotive Microcontroller Cores: Industrial Case Study. DATE’19
F. Lonsing, S. Mitra, and C. Barrett. A Theoretical Framework for Symbolic Quick Error Detection. FMCAD’20
65
[Singh DATE 19] Distribution Statement A - Approved for public release. Distribution is unlimited.
A-QED FOR HARDWARE ACCELERATORS
S. Chattopadhyay et al. Scaling Up Hardware Accelerator Verification using A-QED with Functional Decomposition. FMCAD ‘21
67
[Singh DATE 19] Distribution Statement A - Approved for public release. Distribution is unlimited.
TAKE-AWAYS
• Partially funded by
DARPA POSH, NSF,
Stanford SystemX
Alliance
The views, opinions and/or findings expressed are those of the author and should
not be interpreted as representing the official views or policies of the
Department of Defense or the U.S. Government.
Distribution Statement Pending
DANIEL
SANCHEZ
ASSOCIATE PROFESSOR
MIT
• Slow RTL simulation hinders the design and verification of complex chips
• Software simulators parallelize poorly due to limitations of multicore processors
• Specialized emulators are expensive, hard to scale, slow to compile for
Specialized compute
PEs
and memory
Speedup
1
of only 1.2x when simulating a chip
with 256 RISC-V cores 0.5
0
1 2 3 4 5 6 7 8
Cores
• Poor match to existing multicores, which work well only with coarse-grained
tasks that synchronize infrequently
74
Distribution Statement Distribution
A - ApprovedStatement Pending
for public release. Distribution is unlimited.
ARCHITECTURAL SUPPORT
FOR FINE-GRAIN PARALLELISM
• Chronos [ASPLOS’20] targets hard-to-parallelize applications
• Hardware supports extremely short tasks (~10 instructions)
• Implicit synchronization by defining a global order among tasks
• Hardware executes tasks speculatively and out of order to scale
PE
Tile Tile Tile Tile Elements
0 1 2 N (RISC-V cores
Task Traffic Interconnect Task Unit or application-
specific RTL) 75
Distribution Statement Distribution
A - ApprovedStatement Pending
for public release. Distribution is unlimited.
ARCHITECTURAL SUPPORT
FOR FINE-GRAIN PARALLELISM
• Chronos achieves order-of-magnitude speedups on discrete-event
simulation using existing infrastructure
Chronos is 15x faster than 40-thread CPU despite 19x lower frequency
because it exploits fine-grained parallelism effectively
76
Distribution Statement Distribution
A - ApprovedStatement Pending
for public release. Distribution is unlimited.
ESANA’S APPROACH: SCALING
AND ACCELERATING SOFTWARE SIMULATION
ESANA simulator
Input
Parallel simulation code
HDL code
class foo { • Fine-grained ordered tasks
(Verilog)
module foo;
void eval() { • Event-driven execution
Compile enqueue(bar.eval()); • Distributed state
module bar;
… • Uses specialized hardware
…
79
Distribution Statement Distribution
A - ApprovedStatement Pending
for public release. Distribution is unlimited.
SPECIALIZING FOR HIGHER EFFICIENCY
Data outputs
Specialized ... FU FU FU
Data inputs
Tile
cores
Local cache memory
PEs General-purpose
cores
Tile
Specialized
PEs
Tile
...
...
...
...
...
...
Task management hardware FU FU
...
FU
...
...
...
Task management hardware
FU FU FU
I/O Switch Switch Switch I/O ...
...
82
Distribution Statement Distribution
A - ApprovedStatement Pending
for public release. Distribution is unlimited.
CONCLUSION
83
Distribution Statement Distribution
A - ApprovedStatement Pending
for public release. Distribution is unlimited.
THANK YOU! QUESTIONS?
84
Distribution Statement Distribution
A - ApprovedStatement Pending
for public release. Distribution is unlimited.