You are on page 1of 27

Semiconductor

Scaling is
Failing
What next for processors?

Roddy Urquhart
DESIGN PROCESSORS
SIMPLER,
FASTER,
AND CHEAPER.

Codasip® was founded on a simple belief – that we could bring together the brilliance of
microprocessor architects, RTL engineers, and software engineers and capture it in tools
that made design simpler, faster, and affordable.

Codasip Studio™ was born in 2014 with the mission of automating processor design. At
that time, we already believed in the power and future of RISC-V, so it was natural for
Codasip to embrace it as an Open Instruction Set Architecture (ISA) and implement it in
all our processors.

After 50 years, Moore’s law, Dennard Scaling, and Amdahl’s law are failing. The
semiconductor industry must change, and processor paradigms must change with it.

The future is in:

• Domain-Specific Accelerators
• Customized solutions
• New leaders that disrupt

Some of the information in this document may be incorrect due to changes in product specifications that may
have occurred since publishing. Please, ask Codasip sales representative for the latest information.

© 2022 Codasip Group. All rights reserved.


Codasip, CodAL, Codasip Studio, Codasip CodeSpace and respective RISC-V Processors core names are either
the registered service marks, registered trademarks or trademarks of Codasip Group in the United States and other
jurisdictions. Other brands and names mentioned herein may be the trademarks of their respective owners.

Semiconductor Scaling is Failing. Published in March, 2022.


Executive Summary
The electronics industry has demanded increasing functionality and performance
especially in areas such as mobile phones, automotive, edge processing and
communications systems. Improvements from one product generation to the next are
delivered by new generations of SoCs and associated software.

For decades, the semiconductor industry has relied on scaling such as Moore’s law to
deliver denser and faster chips. Today scaling is failing with an effective upper bound
on clock frequencies and new technology nodes being prohibitively expensive. Design
teams have responded with multi-core systems and more specialized cores such as DSPs
and GPUs but single thread performance improvements are at the end of the line.

This paper considers the challenge of the end of scaling and how further processor
specialization is the way to deliver further performance improvements. The industry must
change from adapting software to fit on available hardware to tailoring computational
units to match their computational load. Many varied, custom designs will be needed. It
will be necessary to design for differentiation.

Finally, new approaches to processor design are considered, in particular the value of
architectural languages and processor design automation. RISC-V is an excellent starting
point for specialized cores given its modularity and provision for custom instructions.
Existing RISC-V designs described in an architecture description language can provide
an excellent starting point for tailored processing units. Industry will need to change its
design approach if it is to continue to innovate and deliver cutting edge future products.

Table of Contents
Introduction 1

Semiconductor Scaling 2

Failure of Scaling 5

The Future is Specialization 9

Design for Differentiation 13

Conclusion 21

Find Out More 22


Semiconductor Scaling is Failing

Introduction
New generations of electronic products have demanded more and more functionality
and performance. These increases in capability rely a great deal on advances in
integrated circuits.

For about 50 years the semiconductor industry has relied on shrinking silicon geometries
to achieve greater design complexity and processor performance for an acceptable
cost. This shrinking has been most famously described by Moore’s Law and the less well-
known Dennard Scaling. This virtuous and predictable scaling is broken – so how can we
achieve improvements in performance in the future?

1 www.codasip.com
Semiconductor Scaling is Failing

Semiconductor Scaling
The remarkable advances in computational performance and in data storage in recent
decades have been achieved by moving to process nodes with successively smaller
geometries. This remarkable scaling has been predicted by Gordon Moore of Intel and
achieved through scaling rules developed by Robert Dennard of IBM.

Figure 1. Gordon Moore. Source: Intel Free press Figure 2. Robert Dennard. Source: Fred Holland

Moore’s Law

In an article1 in Electronics in 1965, Gordon Moore predicted that it would be possible to


double the density of integrated circuits every year. This prediction was based on early
advances in manufacturing MOSFETs. A decade later he revised his prediction2 to say
that the transistor density would double approximately every two years.

Moore’s Law has depended on both continuous improvements in silicon processing


technology and electronic design automation (EDA). Finer geometries demand
advances in deposition, lithography, implantation, interconnect and circuit design. EDA
tools have had to accommodate increased design complexity and changing device
physics as they target new, finer technology nodes.

Some of the changes in design methodology such as use of automatic place and route,
logic and HDL synthesis and power optimization strategy were required to cope with the
design complexity changes enabled by Moore’s Law.

1 Moore, Gordon E., “Cramming more components onto integrated circuits”, Electronics, Volume
38, Number 8, April 19, 1965
2 G. E. Moore, “Progress in Digital Integrated Electronics.” Technical Digest 1975. International
Electron Devices Meeting, IEEE, 1975, pp. 11-13

www.codasip.com 2
Semiconductor Scaling is Failing

Dennard Scaling

Robert Dennard is famous for inventing the single transistor memory cell in 1965. This
invention enabled the development of DRAMs which have been fundamental to
computers over the last four decades.

Semiconductor scaling not only involves the geometries of transistors but must take
account of power supplies and power dissipation. In 1975, Robert Dennard and colleagues
at IBM proposed3 an approach to scaling that ensured that the electric field remained
constant from one technology generation to the next.

Thus, if the transistor dimensions reduced by 30%, the transistor area reduced by 50%. To
keep the electric field constant the voltage V must also be reduced by 30%.

As a result of this the circuit delays also reduce by 30% enabling a 40% increase in
operating frequency f. Furthermore, capacitance C decreases by 30% too.

Since dynamic power P = CV2f, the combined effects of the new geometry result in a 50%
power reduction.

This scaling meant that moving to a smaller silicon geometry brought immediate benefits
to processor design. The finer geometry enabled more complex cores to be designed
with higher clock frequencies while keeping power consumption at the same level as the
older geometry.

A Virtuous Combination for Microprocessors

With semiconductor manufacturing and design automation technology enabling the


adoption of successively smaller silicon geometries there was remarkable progress in the
development of microprocessors in the late 20th Century and beyond.

Intel launched the first commercial microprocessor in 1971 – the Intel 4004. Just three
years later, Intel launched the 8080 – the first mainstream 8-bit microprocessor. In 1978,
Intel launched their first high volume 16-bit core, the 8086. This processor was designed
into the IBM PC which was the forerunner of modern PCs. The 8086 design had derivatives
which were the first in the hugely successful series of x86 microprocessor designs. The
first 32-bit x86 core was the i386 in 1985 and the x86-64 ISA was released in 1999. The x86
architecture grew considerably in complexity over time.

While x86 had developed along CISC lines, other architectures such as Arm, SPARC,
PowerPC and MIPS started with RISC principles. But like x86 processors these too grew in
performance and in wordlength.

The virtuous combination of Moore’s Law and Dennard Scaling was particularly successful
even 40 years after Gordon Moore’s first predictions. Just considering 32-bit cores the
transition from the i386 in 3,000 nm to Prescott in 90 nm was accompanied by an increase
in clock frequency from 33 MHz to 3,800 MHz.

3 Dennard, Robert H.; Gaensslen, Fritz; Yu, Hwa-Nien; Rideout, Leo; Bassous, Ernest; LeBlanc, Andre
(October 1974). “Design of ion-implanted MOSFET’s with very small physical dimensions”. IEEE
Journal of Solid-State Circuits. SC-9 (5): 256–268

3 www.codasip.com
Semiconductor Scaling is Failing

Processor complexity increases included larger wordlengths, deeper pipelines and


performance features such as branch prediction, superscalar, out-of-order execution
and SIMD units.

With semiconductor scaling working so well over many applications, the latest single
core microprocessors were able to deliver the required performance improvements
simply by increasing their complexity and turning up the clock frequency. Applications
requiring either dedicated hardware or special architectures were relatively seldom and
were mainly embedded or in supercomputing. Semiconductor scaling seemed to follow
the Goldilocks principle.

When semiconductor scaling worked the main challenge was to get software tuned
to the needs of the available microprocessors.

www.codasip.com 4
Semiconductor Scaling is Failing

Failure of Scaling
Well as the saying goes ‘all good things come to an end’ and we are now talking about
the ‘End of Moore’s Law’. Some of the key trends are summarized in the following chart
looking at half a century of processors.

Figure 3. 48 Years of Microprocessor Trend Data. Source: K. Rupp.

The vertical axis is logarithmic reflecting Gordon Moore’s prediction. However, it is


immediately apparent that maximum clock frequency has leveled off and that even
single thread performance improvements have slowed down considerably.

Leakage Power and End of Dennard Scaling

When Robert Dennard and colleagues formulated their scaling principles an underlying
assumption was that power dissipation was overwhelmingly dynamic power P = CV2f
dependent on capacitance, voltage and operating frequency. Leakage power in larger
geometries was negligible.

This changed with the adoption of geometries of 90 nm and below. With smaller
geometries there is ever-increasing leakage power and the budget for dynamic power
reduced. Around 90 nm an immediate consequence was that microprocessors could not
simply rely on increasing clock frequency to deliver enhanced single core performance.
Intel had planned to take advantage of semiconductor scaling with their Tejas/Pentium V
design. It was reported in 2003 4 that the design would aim for 5 to 7 GHz and that

4 Paul Dutton, “Pentium V will launch with 64-bit Windows Elements”, Computex 2003, 26
September 2003, retrieved 21/01/2022

5 www.codasip.com
Semiconductor Scaling is Failing

deeper 40-50 stage pipelines would be involved5.

It was therefore highly significant when Intel cancelled this project in May 2004. Given
that the previous Prescott design had failed to achieve clock speeds of 5 GHz due to heat
dissipation and power consumption, the Tejas/Pentium V was also going to fail to reach
its projected frequency.

This marked the end of the road for relying on Dennard Scaling and increased
clock frequencies to achieve performance with single cores. As the microprocessor
trend data (above) shows, there is a clock frequency ceiling of about 5GHz – and in
practice most designs stop at around half of that for thermal reasons.

Subsequently Intel has focused more on dual cores and architectural features rather than
high clock frequencies to achieve their desired performance.

Moore’s Law Grinds to a Halt

So, with Dennard Scaling ending, can we not simply be careful about clock frequency
and still hope to make progress with Moore’s Law? Although the semiconductor industry
is investing in the most advanced technology nodes there are only three companies –
TSMC, Samsung and Intel – that can afford to do so. In January 2022, Intel has announced
plans for two new wafer fabs in Ohio for $20 billion ($10B per fab)6. In contrast when Intel
announced their Dalian, China wafer fab in 20077, the cost was $2.5B for the fab.

Furthermore, the rate at which it is possible to move to new finer silicon geometries has
slowed down. Moore’s Law is grinding to a halt due to fundamental limitations in silicon
device physics. Physics suggests that going below 1nm is going to require very different
technologies (e.g. nanowires). There are simply too few atoms in these fine geometries to
reliably build wires.

Economics are a bigger problem with exponentially growing costs for wafer fabs in new
finer technology nodes mean that devices in those nodes are no longer going down in
price. Moore’s Law no longer applies to falling cost: whereas each node with its higher
density meant an exponential decline in cost-per-gate. Instead we now see costs stable
and rising with finer geometries.

5 Chip magicians at work: patching at 45nm, Redakie Tweakers, retrieved 21/01/2022


6 Stephen Shankland, “Intel’s $100B Ohio ‘megafab’ could become world’s largest chip plant”,
CNET, 21st January 2022
7 Alexei Oreskovic, “Intel Builds Wafer Fab in China”, TheStreet, 26th March 2007

www.codasip.com 6
Semiconductor Scaling is Failing

Figure 4. Gate cost trend. Marvell investor Day 2020.

Multiple Logical Cores and Amdahl’s Law

With the end of Dennard Scaling and using increasing clock frequencies to gain higher
performance was no longer practical, multi-core designs became standard. A very
obvious area for this was in mobile phone processing. The multi-core approach had two
angles, firstly using additional processors for specialized tasks and secondly using multi-
core configurations for application processing.

Graphics and DSP are two computationally intense activities that are not well-handled
by general-purpose cores. Thus, mobile phone chips adopted specialized GPUs and DSP
cores to handle these activities. Microcontrollers were often used for handling things like
Bluetooth or Wi-Fi protocol stacks or security functions.

Just as Intel had transitioned from single to dual core microprocessors, mobile phone
companies moved from single to dual core designs based on the Arm architecture. For
example, in 2011, Apple’s A5 chip had dual Cortex-A9 cores. This trend to more cores
continued with for example, Qualcomm’s Snapdragon 200 series having quad-core
configurations in 2013 and Samsung offering 8-cores with their Exynos 7 Octa chips from
2015.

Returning to the 48 years of microprocessor trend data it should be no surprise to see that
the number of logical cores has been steadily rising since 2005. However, the use of multi-
processing is limited by Amdahl’s Law which deals with the theoretical speedup possible
when adding processors in parallel. In practice any speedup will be limited by those
parts of the software that are required to be executed sequentially. For applications such
as running a mobile phone operating system we are probably already achieving what
is possible.

7 www.codasip.com
Semiconductor Scaling is Failing

End of the Line

Summarizing the effects of failing scaling, the well-worn approach of moving to new
process nodes and ever higher clock frequencies will not deliver significantly better
performance.

End of the Line 2X/20 years


(3%/year)
100,000
Amdahl’s Law 2X/6 years
(12%/year)
10,000
Performance vs VAX11-780

End of Dennard Scaling - Multicore 2X/3.5 years


(23 %/year)

1,000
RISC 2X/1.5 years
(52%/year)
100

10
CISC 2X/2.5 years
(22%/year)

1 | | | | |

1980s 1990s 2000s 2010s 2015

Figure 5. The end of the line. Source: John Hennessy and David Patterson, Computer Architecture: A
Quantitative Approach, 6/e. 2018.

In their 2018 book Hennessy and Patterson consider processor performance over almost
four decades. They describe improvements in single core performance of 3%/year as
“end of the line”.

www.codasip.com 8
Semiconductor Scaling is Failing

The Future is Specialization


Progress in computation performance since the end of Dennard Scaling has been
achieved by using multiple off-the-shelf cores. However, such cores are falling into silos
such as MCU, DSP, GPU and application processor.

MCU DSP GPU AP

Figure 6. Most processor IP is in silos.

Each type of core is optimized for a range of computations which may not match what is
actually required in an on-chip subsystem. For example, if a mixture of protocol handling
and DSP algorithms were needed the choice might be to instantiate both an MCU and a
DSP core. This potentially could be wasteful if one or both were underutilized. Perhaps for
this hypothetical case, should ideally be a sort of fusion of an MCU and DSP?

However, to meet a particular set of needs, there will almost certainly not be an ideal
fusion core available as off-the-shelf IP. Some companies have already developed very
specialized cores – often known as application-specific instruction processors (ASIPs) –
to efficiently handle a narrowly-defined computational workload. However, developing
such cores has required very specialized skills to define the instruction set, develop the
processor microarchitecture and associated software toolchain and finally verify the
core. This has required a combination of processor architects, RTL developers, compiler
developers and verification engineers. Few companies have been able to afford this
approach especially if the development is manual which is time consuming and can be
error prone.

One of the reasons why it has been challenging to develop ASIPs is that an instruction
set usually needs to be developed from scratch. Commercial ISAs such as Arm’s are not
available without a very expensive architectural license and few companies have the
skills to develop their own ISA.

Some open architectures have been available, such as OpenSPARC and OpenRISC.
However, have not been adopted much because they have lacked the flexibility needed
for domain-specific architectures.

Domain Specific Accelerators

The only way in the short-term to counter failing scaling is to take specialization a stage
further by creating innovative architectures for tackling specialized processing problems.
Instead of the classic approach of tailoring software to available microprocessors it is
now necessary to create hardware that is designed to match a software workload. This
can be achieved through customizing an ISA, creating special microarchitectures, or
creating novel processing cores and arrays.

9 www.codasip.com
Semiconductor Scaling is Failing

In a well-known paper describing a ‘New Golden Age for Computer Architecture8, John
Hennessy and David Paterson trace the history of computer architecture and describe the
challenges faced with the end of scaling. They identified domain-specific architectures
or domain-specific accelerators (DSAs) as a new opportunity for computer architecture.
They see DSAs as programmable processors tailored for a specific class of applications.
DSAs are distinct from general-purpose processors and distinct from ASICs which may
have either no or limited programmability. ASIPs might be considered as a subset of DSAs.
They characterized DSAs as exploiting parallelism such as instruction level parallelism (ILP)
or SIMD or systolic arrays if the class of applications benefited from it. They said that DSAs
make efficient use of memory – memory accesses are often more costly in energy than
arithmetic operations. Also, DSAs can take advantage of less arithmetic precision with,
for example, some AIML inference applications working adequately with 4-, 8- or 16-bit
arithmetic. For example, Google’s Tensor Processing Unit (TPU) is a systolic array working
with 8-bit precision.

With the end of Moore’s Law, the future lies in tailoring processor hardware to its
specialized computational workload.

In another paper, William J. Dally, Yatish Turakhia & Song Han identify a number of key
insights on the performance of DSAs9:

• Specialization enables parallelism which can enable a DSA to deliver efficiency


• Memory dominates the area and power of a DSA subsystem.
• Specialized instructions deliver much of the benefit at a DSA with low
development cost
• DSAs are one of the few ways available to raise performance after the end of
semiconductor scaling
• Special operations on domain-specific data types can do in one cycle what
might require tens of cycles on a general-purpose core.

Nevertheless, there will continue to be a need for some general-purpose processors for
example to run operating systems and also for some dedicated hardware.

Will DSAs mean the End of Makimoto’s Wave?

In the era of Moore’s Law SoC designers have generally faced a choice between general-
purpose processors and dedicated non-programmable hardware.

8 John L. Hennessy & David A. Patterson, “A New Golden Age for Computer Architecture”,
Communications of the ACM, February 2019, Vol. 62 No. 2, Pages 48-60
9 William J. Dally, Yatish Turakhia & Song Han, “Domain-Specific Hardware Accelerators”,
Communications of the ACM, July 2020, Vol. 63 No. 7, Pages 48-57

www.codasip.com 10
Semiconductor Scaling is Failing

General
Hardwired
Purpose
Logic
Processor

Flexible Makimoto’s Customization


Wave
Standardization Efficient

Figure 7. Makimoto’s wave.

Dr Tsugio Makimoto, CTO of Sony in the early 1990s, observed that the industry
tended to fluctuate between standardized and customized solutions cyclically with a
period of approximately 10 years10. For example, the electronics industry swung from
microprocessors in the 1970s to ASICs in the 1980s.

An inevitable consequence of standardization is lack of differentiation. For example,


many 32-bit microcontrollers use the same Arm cores, have similar instruction and data
memory sizes and differ little from one another.

With the advent of programmable, domain-specific accelerators the industry is no longer


faced with a binary choice but with a continuum.

General
Domain-specific Hardwired
Purpose
Processor Accelerator Logic

Increasing Flexibility

Increasing Efficiency

Figure 8. Domain-specific continuum.

DSA can vary from being closed to hardwired logic (limited programmability) to being
close to a general-purpose processor such as an MCU.

10 Tsugio Makimoto, “The Hot Decade of Field Programmable Technologies”, retrieved 3/2/2022

11 www.codasip.com
Semiconductor Scaling is Failing

The more specialized the processor, the greater the efficiency in terms of silicon area and
power consumption. With less specialization, the greater the flexibility of the DSA. On the
DSA continuum there is the possibility of fine tuning a core for performance, area and
power – and design for differentiation is enabled.

www.codasip.com 12
Semiconductor Scaling is Failing

Design for Differentiation


If DSAs are to be used to achieve performance, area and power goals through
specialization there are opportunities to for designers to differentiate as they are not
simply using the same off-the-shelf cores as their competitors. When referring to goals
it is important to note that commonly used “PPA” measures usually refer to a processor
core in the most narrow sense. However, memory accesses can consume significantly
more power than arithmetic operations and instruction memory often dominates area
particularly in embedded systems. Thus, realistic PPA is really a topic at the subsystem not
the processor core level.

Specialization is not only a great opportunity but means that there will be many different
designs created requiring the involvement of a broader community of designers and a
greater degree of design efficiency.

There are four key enablers that can contribute to efficient design for differentiation:

• The open RISC-V ISA


• Architectural languages
• Processor design automation
• Existing verified RISC-V cores for customization.

But before considering design issues let’s consider how differentiation can be achieved.

Identify the Secret Sauce

Differentiation in an SoC is likely to be some combination of functionality, performance


and cost. Some functions such as operating systems, communication protocols or coding
formats are standardized and therefore are not differentiated. Other functions, such as AI
inference, audio or video processing often benefit from special algorithms and custom
data types and therefore can be differentiated. Identifying differentiated algorithms and
then designing hardware tailored for them adds value to the final SoC.

Even in an area such as neural networks where the underlying computations are
dominated by vector and matrix operations, there are opportunities for differentiation.
Unlike training, where typically floating-point arithmetic is used, inference will be based
on quantized integers. A variety of approaches can be taken such as vector engines,
SIMD or systolic arrays. AI startup Mythic have even created an Analog Compute Engine
(ACE™) to handle such computations.

Finally, and importantly, it makes sense to analyze a software workload to identify


computational bottlenecks. If bottlenecks are understood, then the next step is to define
new instructions to address them. As was noted above, special instructions can be the
key to achieving DSA performance.

RISC-V Enables Specialization

Earlier we noted that commercial ISAs have not been available to teams developing
special processor cores and few teams have had the expertise to develop an ISA from
scratch. The advent of RISC-V has broken this deadlock for many use cases.

13 www.codasip.com
Semiconductor Scaling is Failing

Firstly, it is a free and open standard offering the benefits of an architectural license
with no license fee. Secondly, the standard only covers the instruction set architecture
(ISA) and not microarchitecture. Therefore, any RISC-V user is able to create their own
microarchitecture without restriction. Thirdly, RISC-V does not prescribe a licensing model
so both commercially licensed and open-sourced microarchitectures are possible.
Fourthly, by being owned by the community rather than a company RISC-V has no
vendor lock. These compelling reasons have meant that RISC-V has seen widespread
support from companies large and small, researchers and academia. This support has
come from all the major regions of the world involved in IC design.

RISC-V also is modular, making it an ideal ISA for building specialized processors and
accelerators.

Optional
Non-standard
Base Integer Standard
Extensions
Extensions

Figure 9. RISC-V modular instruction set.

There are three categories of instruction:

1. The base integer instruction set which for a particular wordlength must always
be used. Like the early RISC ISAs there are very few instructions with just 40 32-bit
instructions (RV32I). For most applications though, this base integer set is unlikely to
sufficient on its own.
2. Greater performance is possible through using optional standard extensions. There
are a number of sets including single, double and quad precision floating point, packed
SIMD, compressed, bit manipulation, etc.
3. For special instructions that are not covered well by optional standard extensions it is
possible to create non-standard custom extensions.

This modularity is very well-aligned with the needs of DSAs as it is possible to tune the
instruction set used to a computational workload. Designers can choose which standard
extensions are needed for an accelerator and can create custom instructions for fine
tuning and differentiation.

Unlike creating a custom ASIP ISA, RISC-V is an excellent starting point in creating a
special ISA for an accelerator. Choosing from existing sets of instructions is far less costly
than developing the ISA from scratch.

Architectural versus Hardware Description Languages

It may be tempting to think that with the RISC-V ISA already defined, moving to a specific
microarchitecture is easy - after all there are open-source RTL implementations available.
Also, if a design team wants to create its own microarchitecture using a hardware
description language (HDL) to create a processor core it is an approach that has been
used for the last quarter of a century.

www.codasip.com 14
Semiconductor Scaling is Failing

However, there are fundamental differences between generic hardware described in an


HDL and a processor. Processors have very complex state spaces. Unlike other parts of an
SoC, a processor core executes software, therefore it is essential to cover both hardware
and software aspects of the design.

Figure 10. Hardware description languages only cover hardware.

If a core is described with a hardware description language (HDL), whether an older


one like Verilog or a newer one like Chisel, all that can be described is hardware. For
common configurations on the RISC-V instruction set like RV32IMC, there are open-source
software toolchains and instruction set simulators (ISS) available. However, as soon as
custom instructions are needed, it is necessary to manually change the toolchain or ISS.
Another area for manual work is creating a verification environment to check that the
microarchitecture is consistent with the instruction accurate description. These manual
actions – often undertaken in parallel – add significant technical risk.

Figure 11. An architectural language covers software and verification as well as hardware.

If an architectural language – such as Codasip’s CodAL – is used, then there is a complete


processor description capable of supporting software, hardware and verification
aspects. If custom instructions are implemented by adding to the architectural language
source, then these additions can be reflected in the software toolchain and verification
environment as well as the RTL.

Processor Design Automation

15 www.codasip.com
Semiconductor Scaling is Failing

Even with the flexibility and openness of RISC-V, creating a processor or DSA will be
challenging without design automation. If the objective is to tailor the hardware to the
software workload, then it is necessary to profile the software workload. Processor design
automation tools such as Codasip Studio product are designed to handle the entire
processor development process including hardware and software aspects.

As we have seen the obvious starting point is using an architecture description language,
such as CodAL, to describe the core. An initial core description can be developed, and
the instruction set tailored by an iterative process described as ‘SDK in the loop’.

Initial IA CodAL
Description

Generate SDK

Compile and Profile Analyze and Update


Application Software Instruction Set

Good?

Final Instruction Set

Figure 12. SDK in the loop.

The instruction accurate (IA) description is used to automatically generate an SDK. Then
real application code is compiled and profiled in order to assess the performance of the
instruction set. If hotspots in the software are identified, it is possible to define new custom
instructions to address them. These instructions will be implemented in the CodAL source
and then the process repeated.

Once the design team is happy with the instruction set, they can move on to focus
on microarchitecture. The next phase of iteratively improving the microarchitecture is
described as ‘HDK in the loop’.

www.codasip.com 16
Semiconductor Scaling is Failing

Initial CA CodAL
Description

Generate HDK

Profile on CA or RTL Analyze and Update


Simulator or FPGA Microarchitecture

Good?

Verify RTL against


Golden Reference

Figure 13. HDK in the loop.

Here an initial cycle accurate description (CA) is the starting point. Codasip Studio
can generate an HDK including a CA simulator, RTL, testbench and UVM verification
environment. The UVM Again, real application software can be profiled using either:

1. A generated CA simulator,
2. An RTL simulator or
3. An FPGA prototype.

Codasip Studio provides static formal analysis and consistency checks between IA
and CA descriptions. Once the microarchitecture is stable, RTL can be generated, and
the verification of the RTL against an IA golden reference can be undertaken. The RTL
generated by Codasip Studio is human-readable and contains links to the original CodAL
source code to make debugging straightforward.

Instruction Accurate CodAL


Processor Model Reference Model
E qu i va l ence

Test Cases

Cycle Accurate CodAL


Synthesizable RTL
Processor Model

Figure 14. Using UVM to verify RTL against golden reference.

Codasip Studio is used to automatically generate a UVM environment which can be


extended and run by verification engineers. The environment includes functional cover
points, assertions and permits the simultaneous debug of RTL and C code. A variety of
test cases can be used including assembler programs randomly generated by Codasip
Studio and directed tests. Codasip also uses a variety of 3rd party models and verification

17 www.codasip.com
Semiconductor Scaling is Failing

tools for comprehensive processor verification.

Get a Jump Start

If an existing processor core come close to meeting the requirements of a particular


workload, then it can be the starting point for a DSA. If it is designed in an architecture
description language, the same design automation approach can be used for
customization. This can significantly reduce the time to market for the DSA.

Alternative traditional approaches to customization involve simultaneous changing


of RTL, ISS and software toolchain. This not only require a lot of manual effort but adds
significant technical risk.

Codasip offers a range of RISC-V fully verified embedded and application processors.
Although these can be licensed in the conventional off-the-shelf way with RTL, software
toolchain and testbenches, Codasip also licenses the CodAL source code – a very valuable
architectural source license. If the CodAL source code is licensed it is straightforward to
modify the CodAL source to add custom instructions or microarchitectural features.

Thus, the development of a DSA can be incremental rather than a major design effort
with the microarchitecture starting from scratch.

Audio Processing Example

A real example of this process is when Microsemi developed a RISC-V audio processing
unit for the IoT. They were interested in replacing an off-the-shelf core and wanted to
evaluate the flexibility of RISC-V in addressing their processing needs. They profiled their
echo cancellation software and then looked at different ISA scenarios.

Figure 15. Trading off RISC-V extensions

In the example above, Microsemi started with the base 32-bit RISC-V ISA implementation
of a Codasip L30 core. Since this is a minimal set of just 40 instructions the core was small
at 16 kgates. However, it was unlikely that RV32I would give the performance required

www.codasip.com 18
Semiconductor Scaling is Failing

and with almost 1.8M clock cycles for their test was too many. The high clock frequency
required would have resulted in too much power dissipation.

Since their DSP algorithm included many multiplications, the next obvious step was to
add the [M] multiplication/division optional extensions. The use of the RV32IM combined
with a sequential multiplier improved performance but was below the design’s target.
Using a parallel multiplier helped bring performance to over 13× faster than the RV32I
design.

Finally, they experimented with custom DSP extensions. The custom instructions increased
the throughput 56.24× over the original RV32I at the cost of making the core 2.43× bigger in
gatecount. The increased throughput meant that a slower clock frequency was needed
reducing dynamic power. With embedded systems, the silicon area is dominated by
instruction memory and hence code size matters. The use of custom instructions reduced
the code size from 232 kbytes to 64 kbytes an improvement of 3.62×.

Neural Networks Empowered by Custom Instructions

In a paper11 on implementing neural networks on processor cores with limited hardware


resources, Alexey Shchekin used TensorFlow Lite Micro with a Codasip L31 RISC-V core to
implement a convolutional neural network for image classification. The neural network
architecture contains two convolutional and pooling layers, at least one fully-connected
layer, vectorized nonlinear functions, data resize and normalization operations12 (Figure
16). He took the well-known “MNIST handwritten digits classification” benchmark and
used the Codasip Studio profiler to analyze the image classification task.

Figure 16. Convolutional neural network architecture.

As would be expected, ~84% of the cycles were used on the image convolution function.
The convolution is implemented by deeply nested for-loops.

11 Alexey Shchekin, “Embedded AI on L-Series cores”, whitepaper Codasip


12 Sara Aqab & Muhammad Usman Tariq, “Handwriting Recognition using Artificial Intelligence
Neural Network and Image Processing”, International Journal of Advanced Computer Science
and Applications,Vol. 11, No. 7, 2020, 137-146

19 www.codasip.com
Semiconductor Scaling is Failing

Figure 17. Profiler identifying ‘hot spots’ in deepest for-loop.

In this case of TFLite convolution, most time is spent for multiply + accumulate operation
(mul followed by c.add) and the consequent (vector) loads from the memory (lb
instructions after the for-statement). Merging multiplication and addition as well as
loading bytes with an immediate address increment were promising ideas for creating
RISC-V custom instructions.

Just adding two custom instructions to improve the arithmetic and vector loads led to
a custom L31 core with better performance and power consumption than the standard
L31. The number of clock cycles required for image classification were reduced by more
than 10% and the power consumption reduced by more than 8%.

This and other examples, in areas such as cryptographic computations, demonstrate


how specialization of computational units delivers better performance in a cost-effective
way.

www.codasip.com 20
Semiconductor Scaling is Failing

Conclusion
New design approaches are urgently needed to overcome the end of semiconductor
scaling. Specifically, systems will be heterogeneous with specialized computational units
such as domain-specific accelerators. These need to be fine-tuned to the needs of their
computational workload.

With increased specialization, there will be increased variation in processor and


accelerator design. This inevitably means that more designs will need to be developed
and that they will not be developed solely by the limited number of specialist processor
design teams.

While developing processors from scratch will remain an area for processor experts,
creating derivatives of existing designs will be within the reach of SoC design teams. This
can be achieved if a verified design available in an architectural language, like CodAL,
and can be customized using design automation tools.

The combination of processor IP cores and processor design automation will enable
many new specialized processors and accelerators to be created.
The combination of:

• RISC-V – the open and modular ISA


• An architectural language like CodAL
• Processor design automation like Codasip Studio
• Verified RISC-V cores with architectural license

not only overcomes the failure of scaling but enables the development a new generation
of efficient computational solutions.

The semiconductor industry needs to change its approach to designing computational


units to be competitive and to drive further innovation.

21 www.codasip.com
Semiconductor Scaling is Failing

Find Out More


To find out more about Codasip’s processor design automation and RISC-V processor
cores contact us.

Download our paper entitled “Creating Domain-Specific Accelerators using Custom


RISC-V Instructions”.

Learn how Mythic used Codasip Studio and an L-series RISC-V core to create an optimized
processor.

www.codasip.com 22
Semiconductor Scaling is Failing

About Codasip

Codasip was founded in 2014 in the Czech Republic. Our technology is based on 10 years
of prior university research at the Brno University of Technology/Faculty of Information
Technology including the PhD theses of our CEO and CTO.

Codasip was a founding member of the RISC-V Foundation in 2015 and has actively
contributed to their activities ever since. In the same year we also introduced our first
RISC-V core to the market and have continued to extend our RISC-V product line.

Today, Codasip is a rapidly expanding business and a strong international player with
sales representation all over the globe including Japan, Taiwan, Korea, India, and Israel.

We are looking for new talents

We are always on the lookout for new talent to join our team of over 120, mostly developers.
We pride ourselves on being a growth company with the startup spirit, but also well-
established in a high-growth technology market, backed by reputable investors. Spend
some time to find out more about what we do then get in touch!

23 www.codasip.com

You might also like