Professional Documents
Culture Documents
Scaling is
Failing
What next for processors?
Roddy Urquhart
DESIGN PROCESSORS
SIMPLER,
FASTER,
AND CHEAPER.
Codasip® was founded on a simple belief – that we could bring together the brilliance of
microprocessor architects, RTL engineers, and software engineers and capture it in tools
that made design simpler, faster, and affordable.
Codasip Studio™ was born in 2014 with the mission of automating processor design. At
that time, we already believed in the power and future of RISC-V, so it was natural for
Codasip to embrace it as an Open Instruction Set Architecture (ISA) and implement it in
all our processors.
After 50 years, Moore’s law, Dennard Scaling, and Amdahl’s law are failing. The
semiconductor industry must change, and processor paradigms must change with it.
• Domain-Specific Accelerators
• Customized solutions
• New leaders that disrupt
Some of the information in this document may be incorrect due to changes in product specifications that may
have occurred since publishing. Please, ask Codasip sales representative for the latest information.
For decades, the semiconductor industry has relied on scaling such as Moore’s law to
deliver denser and faster chips. Today scaling is failing with an effective upper bound
on clock frequencies and new technology nodes being prohibitively expensive. Design
teams have responded with multi-core systems and more specialized cores such as DSPs
and GPUs but single thread performance improvements are at the end of the line.
This paper considers the challenge of the end of scaling and how further processor
specialization is the way to deliver further performance improvements. The industry must
change from adapting software to fit on available hardware to tailoring computational
units to match their computational load. Many varied, custom designs will be needed. It
will be necessary to design for differentiation.
Finally, new approaches to processor design are considered, in particular the value of
architectural languages and processor design automation. RISC-V is an excellent starting
point for specialized cores given its modularity and provision for custom instructions.
Existing RISC-V designs described in an architecture description language can provide
an excellent starting point for tailored processing units. Industry will need to change its
design approach if it is to continue to innovate and deliver cutting edge future products.
Table of Contents
Introduction 1
Semiconductor Scaling 2
Failure of Scaling 5
Conclusion 21
Introduction
New generations of electronic products have demanded more and more functionality
and performance. These increases in capability rely a great deal on advances in
integrated circuits.
For about 50 years the semiconductor industry has relied on shrinking silicon geometries
to achieve greater design complexity and processor performance for an acceptable
cost. This shrinking has been most famously described by Moore’s Law and the less well-
known Dennard Scaling. This virtuous and predictable scaling is broken – so how can we
achieve improvements in performance in the future?
1 www.codasip.com
Semiconductor Scaling is Failing
Semiconductor Scaling
The remarkable advances in computational performance and in data storage in recent
decades have been achieved by moving to process nodes with successively smaller
geometries. This remarkable scaling has been predicted by Gordon Moore of Intel and
achieved through scaling rules developed by Robert Dennard of IBM.
Figure 1. Gordon Moore. Source: Intel Free press Figure 2. Robert Dennard. Source: Fred Holland
Moore’s Law
Some of the changes in design methodology such as use of automatic place and route,
logic and HDL synthesis and power optimization strategy were required to cope with the
design complexity changes enabled by Moore’s Law.
1 Moore, Gordon E., “Cramming more components onto integrated circuits”, Electronics, Volume
38, Number 8, April 19, 1965
2 G. E. Moore, “Progress in Digital Integrated Electronics.” Technical Digest 1975. International
Electron Devices Meeting, IEEE, 1975, pp. 11-13
www.codasip.com 2
Semiconductor Scaling is Failing
Dennard Scaling
Robert Dennard is famous for inventing the single transistor memory cell in 1965. This
invention enabled the development of DRAMs which have been fundamental to
computers over the last four decades.
Semiconductor scaling not only involves the geometries of transistors but must take
account of power supplies and power dissipation. In 1975, Robert Dennard and colleagues
at IBM proposed3 an approach to scaling that ensured that the electric field remained
constant from one technology generation to the next.
Thus, if the transistor dimensions reduced by 30%, the transistor area reduced by 50%. To
keep the electric field constant the voltage V must also be reduced by 30%.
As a result of this the circuit delays also reduce by 30% enabling a 40% increase in
operating frequency f. Furthermore, capacitance C decreases by 30% too.
Since dynamic power P = CV2f, the combined effects of the new geometry result in a 50%
power reduction.
This scaling meant that moving to a smaller silicon geometry brought immediate benefits
to processor design. The finer geometry enabled more complex cores to be designed
with higher clock frequencies while keeping power consumption at the same level as the
older geometry.
Intel launched the first commercial microprocessor in 1971 – the Intel 4004. Just three
years later, Intel launched the 8080 – the first mainstream 8-bit microprocessor. In 1978,
Intel launched their first high volume 16-bit core, the 8086. This processor was designed
into the IBM PC which was the forerunner of modern PCs. The 8086 design had derivatives
which were the first in the hugely successful series of x86 microprocessor designs. The
first 32-bit x86 core was the i386 in 1985 and the x86-64 ISA was released in 1999. The x86
architecture grew considerably in complexity over time.
While x86 had developed along CISC lines, other architectures such as Arm, SPARC,
PowerPC and MIPS started with RISC principles. But like x86 processors these too grew in
performance and in wordlength.
The virtuous combination of Moore’s Law and Dennard Scaling was particularly successful
even 40 years after Gordon Moore’s first predictions. Just considering 32-bit cores the
transition from the i386 in 3,000 nm to Prescott in 90 nm was accompanied by an increase
in clock frequency from 33 MHz to 3,800 MHz.
3 Dennard, Robert H.; Gaensslen, Fritz; Yu, Hwa-Nien; Rideout, Leo; Bassous, Ernest; LeBlanc, Andre
(October 1974). “Design of ion-implanted MOSFET’s with very small physical dimensions”. IEEE
Journal of Solid-State Circuits. SC-9 (5): 256–268
3 www.codasip.com
Semiconductor Scaling is Failing
With semiconductor scaling working so well over many applications, the latest single
core microprocessors were able to deliver the required performance improvements
simply by increasing their complexity and turning up the clock frequency. Applications
requiring either dedicated hardware or special architectures were relatively seldom and
were mainly embedded or in supercomputing. Semiconductor scaling seemed to follow
the Goldilocks principle.
When semiconductor scaling worked the main challenge was to get software tuned
to the needs of the available microprocessors.
www.codasip.com 4
Semiconductor Scaling is Failing
Failure of Scaling
Well as the saying goes ‘all good things come to an end’ and we are now talking about
the ‘End of Moore’s Law’. Some of the key trends are summarized in the following chart
looking at half a century of processors.
When Robert Dennard and colleagues formulated their scaling principles an underlying
assumption was that power dissipation was overwhelmingly dynamic power P = CV2f
dependent on capacitance, voltage and operating frequency. Leakage power in larger
geometries was negligible.
This changed with the adoption of geometries of 90 nm and below. With smaller
geometries there is ever-increasing leakage power and the budget for dynamic power
reduced. Around 90 nm an immediate consequence was that microprocessors could not
simply rely on increasing clock frequency to deliver enhanced single core performance.
Intel had planned to take advantage of semiconductor scaling with their Tejas/Pentium V
design. It was reported in 2003 4 that the design would aim for 5 to 7 GHz and that
4 Paul Dutton, “Pentium V will launch with 64-bit Windows Elements”, Computex 2003, 26
September 2003, retrieved 21/01/2022
5 www.codasip.com
Semiconductor Scaling is Failing
It was therefore highly significant when Intel cancelled this project in May 2004. Given
that the previous Prescott design had failed to achieve clock speeds of 5 GHz due to heat
dissipation and power consumption, the Tejas/Pentium V was also going to fail to reach
its projected frequency.
This marked the end of the road for relying on Dennard Scaling and increased
clock frequencies to achieve performance with single cores. As the microprocessor
trend data (above) shows, there is a clock frequency ceiling of about 5GHz – and in
practice most designs stop at around half of that for thermal reasons.
Subsequently Intel has focused more on dual cores and architectural features rather than
high clock frequencies to achieve their desired performance.
So, with Dennard Scaling ending, can we not simply be careful about clock frequency
and still hope to make progress with Moore’s Law? Although the semiconductor industry
is investing in the most advanced technology nodes there are only three companies –
TSMC, Samsung and Intel – that can afford to do so. In January 2022, Intel has announced
plans for two new wafer fabs in Ohio for $20 billion ($10B per fab)6. In contrast when Intel
announced their Dalian, China wafer fab in 20077, the cost was $2.5B for the fab.
Furthermore, the rate at which it is possible to move to new finer silicon geometries has
slowed down. Moore’s Law is grinding to a halt due to fundamental limitations in silicon
device physics. Physics suggests that going below 1nm is going to require very different
technologies (e.g. nanowires). There are simply too few atoms in these fine geometries to
reliably build wires.
Economics are a bigger problem with exponentially growing costs for wafer fabs in new
finer technology nodes mean that devices in those nodes are no longer going down in
price. Moore’s Law no longer applies to falling cost: whereas each node with its higher
density meant an exponential decline in cost-per-gate. Instead we now see costs stable
and rising with finer geometries.
www.codasip.com 6
Semiconductor Scaling is Failing
With the end of Dennard Scaling and using increasing clock frequencies to gain higher
performance was no longer practical, multi-core designs became standard. A very
obvious area for this was in mobile phone processing. The multi-core approach had two
angles, firstly using additional processors for specialized tasks and secondly using multi-
core configurations for application processing.
Graphics and DSP are two computationally intense activities that are not well-handled
by general-purpose cores. Thus, mobile phone chips adopted specialized GPUs and DSP
cores to handle these activities. Microcontrollers were often used for handling things like
Bluetooth or Wi-Fi protocol stacks or security functions.
Just as Intel had transitioned from single to dual core microprocessors, mobile phone
companies moved from single to dual core designs based on the Arm architecture. For
example, in 2011, Apple’s A5 chip had dual Cortex-A9 cores. This trend to more cores
continued with for example, Qualcomm’s Snapdragon 200 series having quad-core
configurations in 2013 and Samsung offering 8-cores with their Exynos 7 Octa chips from
2015.
Returning to the 48 years of microprocessor trend data it should be no surprise to see that
the number of logical cores has been steadily rising since 2005. However, the use of multi-
processing is limited by Amdahl’s Law which deals with the theoretical speedup possible
when adding processors in parallel. In practice any speedup will be limited by those
parts of the software that are required to be executed sequentially. For applications such
as running a mobile phone operating system we are probably already achieving what
is possible.
7 www.codasip.com
Semiconductor Scaling is Failing
Summarizing the effects of failing scaling, the well-worn approach of moving to new
process nodes and ever higher clock frequencies will not deliver significantly better
performance.
1,000
RISC 2X/1.5 years
(52%/year)
100
10
CISC 2X/2.5 years
(22%/year)
1 | | | | |
Figure 5. The end of the line. Source: John Hennessy and David Patterson, Computer Architecture: A
Quantitative Approach, 6/e. 2018.
In their 2018 book Hennessy and Patterson consider processor performance over almost
four decades. They describe improvements in single core performance of 3%/year as
“end of the line”.
www.codasip.com 8
Semiconductor Scaling is Failing
Each type of core is optimized for a range of computations which may not match what is
actually required in an on-chip subsystem. For example, if a mixture of protocol handling
and DSP algorithms were needed the choice might be to instantiate both an MCU and a
DSP core. This potentially could be wasteful if one or both were underutilized. Perhaps for
this hypothetical case, should ideally be a sort of fusion of an MCU and DSP?
However, to meet a particular set of needs, there will almost certainly not be an ideal
fusion core available as off-the-shelf IP. Some companies have already developed very
specialized cores – often known as application-specific instruction processors (ASIPs) –
to efficiently handle a narrowly-defined computational workload. However, developing
such cores has required very specialized skills to define the instruction set, develop the
processor microarchitecture and associated software toolchain and finally verify the
core. This has required a combination of processor architects, RTL developers, compiler
developers and verification engineers. Few companies have been able to afford this
approach especially if the development is manual which is time consuming and can be
error prone.
One of the reasons why it has been challenging to develop ASIPs is that an instruction
set usually needs to be developed from scratch. Commercial ISAs such as Arm’s are not
available without a very expensive architectural license and few companies have the
skills to develop their own ISA.
Some open architectures have been available, such as OpenSPARC and OpenRISC.
However, have not been adopted much because they have lacked the flexibility needed
for domain-specific architectures.
The only way in the short-term to counter failing scaling is to take specialization a stage
further by creating innovative architectures for tackling specialized processing problems.
Instead of the classic approach of tailoring software to available microprocessors it is
now necessary to create hardware that is designed to match a software workload. This
can be achieved through customizing an ISA, creating special microarchitectures, or
creating novel processing cores and arrays.
9 www.codasip.com
Semiconductor Scaling is Failing
In a well-known paper describing a ‘New Golden Age for Computer Architecture8, John
Hennessy and David Paterson trace the history of computer architecture and describe the
challenges faced with the end of scaling. They identified domain-specific architectures
or domain-specific accelerators (DSAs) as a new opportunity for computer architecture.
They see DSAs as programmable processors tailored for a specific class of applications.
DSAs are distinct from general-purpose processors and distinct from ASICs which may
have either no or limited programmability. ASIPs might be considered as a subset of DSAs.
They characterized DSAs as exploiting parallelism such as instruction level parallelism (ILP)
or SIMD or systolic arrays if the class of applications benefited from it. They said that DSAs
make efficient use of memory – memory accesses are often more costly in energy than
arithmetic operations. Also, DSAs can take advantage of less arithmetic precision with,
for example, some AIML inference applications working adequately with 4-, 8- or 16-bit
arithmetic. For example, Google’s Tensor Processing Unit (TPU) is a systolic array working
with 8-bit precision.
With the end of Moore’s Law, the future lies in tailoring processor hardware to its
specialized computational workload.
In another paper, William J. Dally, Yatish Turakhia & Song Han identify a number of key
insights on the performance of DSAs9:
Nevertheless, there will continue to be a need for some general-purpose processors for
example to run operating systems and also for some dedicated hardware.
In the era of Moore’s Law SoC designers have generally faced a choice between general-
purpose processors and dedicated non-programmable hardware.
8 John L. Hennessy & David A. Patterson, “A New Golden Age for Computer Architecture”,
Communications of the ACM, February 2019, Vol. 62 No. 2, Pages 48-60
9 William J. Dally, Yatish Turakhia & Song Han, “Domain-Specific Hardware Accelerators”,
Communications of the ACM, July 2020, Vol. 63 No. 7, Pages 48-57
www.codasip.com 10
Semiconductor Scaling is Failing
General
Hardwired
Purpose
Logic
Processor
Dr Tsugio Makimoto, CTO of Sony in the early 1990s, observed that the industry
tended to fluctuate between standardized and customized solutions cyclically with a
period of approximately 10 years10. For example, the electronics industry swung from
microprocessors in the 1970s to ASICs in the 1980s.
General
Domain-specific Hardwired
Purpose
Processor Accelerator Logic
Increasing Flexibility
Increasing Efficiency
DSA can vary from being closed to hardwired logic (limited programmability) to being
close to a general-purpose processor such as an MCU.
10 Tsugio Makimoto, “The Hot Decade of Field Programmable Technologies”, retrieved 3/2/2022
11 www.codasip.com
Semiconductor Scaling is Failing
The more specialized the processor, the greater the efficiency in terms of silicon area and
power consumption. With less specialization, the greater the flexibility of the DSA. On the
DSA continuum there is the possibility of fine tuning a core for performance, area and
power – and design for differentiation is enabled.
www.codasip.com 12
Semiconductor Scaling is Failing
Specialization is not only a great opportunity but means that there will be many different
designs created requiring the involvement of a broader community of designers and a
greater degree of design efficiency.
There are four key enablers that can contribute to efficient design for differentiation:
But before considering design issues let’s consider how differentiation can be achieved.
Even in an area such as neural networks where the underlying computations are
dominated by vector and matrix operations, there are opportunities for differentiation.
Unlike training, where typically floating-point arithmetic is used, inference will be based
on quantized integers. A variety of approaches can be taken such as vector engines,
SIMD or systolic arrays. AI startup Mythic have even created an Analog Compute Engine
(ACE™) to handle such computations.
Earlier we noted that commercial ISAs have not been available to teams developing
special processor cores and few teams have had the expertise to develop an ISA from
scratch. The advent of RISC-V has broken this deadlock for many use cases.
13 www.codasip.com
Semiconductor Scaling is Failing
Firstly, it is a free and open standard offering the benefits of an architectural license
with no license fee. Secondly, the standard only covers the instruction set architecture
(ISA) and not microarchitecture. Therefore, any RISC-V user is able to create their own
microarchitecture without restriction. Thirdly, RISC-V does not prescribe a licensing model
so both commercially licensed and open-sourced microarchitectures are possible.
Fourthly, by being owned by the community rather than a company RISC-V has no
vendor lock. These compelling reasons have meant that RISC-V has seen widespread
support from companies large and small, researchers and academia. This support has
come from all the major regions of the world involved in IC design.
RISC-V also is modular, making it an ideal ISA for building specialized processors and
accelerators.
Optional
Non-standard
Base Integer Standard
Extensions
Extensions
1. The base integer instruction set which for a particular wordlength must always
be used. Like the early RISC ISAs there are very few instructions with just 40 32-bit
instructions (RV32I). For most applications though, this base integer set is unlikely to
sufficient on its own.
2. Greater performance is possible through using optional standard extensions. There
are a number of sets including single, double and quad precision floating point, packed
SIMD, compressed, bit manipulation, etc.
3. For special instructions that are not covered well by optional standard extensions it is
possible to create non-standard custom extensions.
This modularity is very well-aligned with the needs of DSAs as it is possible to tune the
instruction set used to a computational workload. Designers can choose which standard
extensions are needed for an accelerator and can create custom instructions for fine
tuning and differentiation.
Unlike creating a custom ASIP ISA, RISC-V is an excellent starting point in creating a
special ISA for an accelerator. Choosing from existing sets of instructions is far less costly
than developing the ISA from scratch.
It may be tempting to think that with the RISC-V ISA already defined, moving to a specific
microarchitecture is easy - after all there are open-source RTL implementations available.
Also, if a design team wants to create its own microarchitecture using a hardware
description language (HDL) to create a processor core it is an approach that has been
used for the last quarter of a century.
www.codasip.com 14
Semiconductor Scaling is Failing
Figure 11. An architectural language covers software and verification as well as hardware.
15 www.codasip.com
Semiconductor Scaling is Failing
Even with the flexibility and openness of RISC-V, creating a processor or DSA will be
challenging without design automation. If the objective is to tailor the hardware to the
software workload, then it is necessary to profile the software workload. Processor design
automation tools such as Codasip Studio product are designed to handle the entire
processor development process including hardware and software aspects.
As we have seen the obvious starting point is using an architecture description language,
such as CodAL, to describe the core. An initial core description can be developed, and
the instruction set tailored by an iterative process described as ‘SDK in the loop’.
Initial IA CodAL
Description
Generate SDK
Good?
The instruction accurate (IA) description is used to automatically generate an SDK. Then
real application code is compiled and profiled in order to assess the performance of the
instruction set. If hotspots in the software are identified, it is possible to define new custom
instructions to address them. These instructions will be implemented in the CodAL source
and then the process repeated.
Once the design team is happy with the instruction set, they can move on to focus
on microarchitecture. The next phase of iteratively improving the microarchitecture is
described as ‘HDK in the loop’.
www.codasip.com 16
Semiconductor Scaling is Failing
Initial CA CodAL
Description
Generate HDK
Good?
Here an initial cycle accurate description (CA) is the starting point. Codasip Studio
can generate an HDK including a CA simulator, RTL, testbench and UVM verification
environment. The UVM Again, real application software can be profiled using either:
1. A generated CA simulator,
2. An RTL simulator or
3. An FPGA prototype.
Codasip Studio provides static formal analysis and consistency checks between IA
and CA descriptions. Once the microarchitecture is stable, RTL can be generated, and
the verification of the RTL against an IA golden reference can be undertaken. The RTL
generated by Codasip Studio is human-readable and contains links to the original CodAL
source code to make debugging straightforward.
Test Cases
17 www.codasip.com
Semiconductor Scaling is Failing
Codasip offers a range of RISC-V fully verified embedded and application processors.
Although these can be licensed in the conventional off-the-shelf way with RTL, software
toolchain and testbenches, Codasip also licenses the CodAL source code – a very valuable
architectural source license. If the CodAL source code is licensed it is straightforward to
modify the CodAL source to add custom instructions or microarchitectural features.
Thus, the development of a DSA can be incremental rather than a major design effort
with the microarchitecture starting from scratch.
A real example of this process is when Microsemi developed a RISC-V audio processing
unit for the IoT. They were interested in replacing an off-the-shelf core and wanted to
evaluate the flexibility of RISC-V in addressing their processing needs. They profiled their
echo cancellation software and then looked at different ISA scenarios.
In the example above, Microsemi started with the base 32-bit RISC-V ISA implementation
of a Codasip L30 core. Since this is a minimal set of just 40 instructions the core was small
at 16 kgates. However, it was unlikely that RV32I would give the performance required
www.codasip.com 18
Semiconductor Scaling is Failing
and with almost 1.8M clock cycles for their test was too many. The high clock frequency
required would have resulted in too much power dissipation.
Since their DSP algorithm included many multiplications, the next obvious step was to
add the [M] multiplication/division optional extensions. The use of the RV32IM combined
with a sequential multiplier improved performance but was below the design’s target.
Using a parallel multiplier helped bring performance to over 13× faster than the RV32I
design.
Finally, they experimented with custom DSP extensions. The custom instructions increased
the throughput 56.24× over the original RV32I at the cost of making the core 2.43× bigger in
gatecount. The increased throughput meant that a slower clock frequency was needed
reducing dynamic power. With embedded systems, the silicon area is dominated by
instruction memory and hence code size matters. The use of custom instructions reduced
the code size from 232 kbytes to 64 kbytes an improvement of 3.62×.
As would be expected, ~84% of the cycles were used on the image convolution function.
The convolution is implemented by deeply nested for-loops.
19 www.codasip.com
Semiconductor Scaling is Failing
In this case of TFLite convolution, most time is spent for multiply + accumulate operation
(mul followed by c.add) and the consequent (vector) loads from the memory (lb
instructions after the for-statement). Merging multiplication and addition as well as
loading bytes with an immediate address increment were promising ideas for creating
RISC-V custom instructions.
Just adding two custom instructions to improve the arithmetic and vector loads led to
a custom L31 core with better performance and power consumption than the standard
L31. The number of clock cycles required for image classification were reduced by more
than 10% and the power consumption reduced by more than 8%.
www.codasip.com 20
Semiconductor Scaling is Failing
Conclusion
New design approaches are urgently needed to overcome the end of semiconductor
scaling. Specifically, systems will be heterogeneous with specialized computational units
such as domain-specific accelerators. These need to be fine-tuned to the needs of their
computational workload.
While developing processors from scratch will remain an area for processor experts,
creating derivatives of existing designs will be within the reach of SoC design teams. This
can be achieved if a verified design available in an architectural language, like CodAL,
and can be customized using design automation tools.
The combination of processor IP cores and processor design automation will enable
many new specialized processors and accelerators to be created.
The combination of:
not only overcomes the failure of scaling but enables the development a new generation
of efficient computational solutions.
21 www.codasip.com
Semiconductor Scaling is Failing
Learn how Mythic used Codasip Studio and an L-series RISC-V core to create an optimized
processor.
www.codasip.com 22
Semiconductor Scaling is Failing
About Codasip
Codasip was founded in 2014 in the Czech Republic. Our technology is based on 10 years
of prior university research at the Brno University of Technology/Faculty of Information
Technology including the PhD theses of our CEO and CTO.
Codasip was a founding member of the RISC-V Foundation in 2015 and has actively
contributed to their activities ever since. In the same year we also introduced our first
RISC-V core to the market and have continued to extend our RISC-V product line.
Today, Codasip is a rapidly expanding business and a strong international player with
sales representation all over the globe including Japan, Taiwan, Korea, India, and Israel.
We are always on the lookout for new talent to join our team of over 120, mostly developers.
We pride ourselves on being a growth company with the startup spirit, but also well-
established in a high-growth technology market, backed by reputable investors. Spend
some time to find out more about what we do then get in touch!
23 www.codasip.com