You are on page 1of 6

Optimized Retargetable Compiler for Embedded

Processors – GCC vs LLVM

Lavinia Ghica Nicolae Tapus


Development Tools - Compiler Computer Science Department
Microchip Romania “Politehnica” University of Bucharest
Bucharest, Romania Bucharest, Romania
lavinia.dragusin@gmail.com ntapus@cs.pub.ro

Abstract— Retargetable compilers become more and more The scope of this paper is to analyze two open-source
popular as they are involved even in the processors design phase. compilers: GCC, a mature retargetable compiler, and LLVM,
The reduced time-to-market period puts a challenge on optimized a new retargetable compiler, built on different concepts. The
retargetable compilers. An optimized retargetable compiler gives GNU (GCC) compiler is one of the most widely used C/C++
a reliable feedback to tailor processors towards a certain compilers in the world. It is the basic build tool for building all
application domain. The first choice in choosing a retargetable EmbeddedLinux and Android systems, as well as all desktop
compiler may be an open-source one. This paper aims to or server Linux operating systems and their applications. The
compare two well-known open-source compilers: GCC and GNU compiler is also used to build many commercial real-
LLVM. The first is a mature compiler, retargeted for more than
time operating systems, such as those from Enea, QNX,
100 processors, while the second is a new one, retargeted for less
than 10 processors, but built on a very promising approach, one
WindRiver and more.
big plus being the latest release of Redhat (Linux operating Chapter 2 gives an overview of the machine description
system) which replaced the previously used GCC with LLVM. representation for GCC and LLVM, figuring out the pluses
The paper compares the two compilers from both easiness of and minuses of each one. Chapter 3 discusses the compilers
retargetability and the target specific optimizations enablement. construction and analyzes the code generation phase. Chapter
4 presents the two approaches for target specific optimizations
Keywords— optimized retargetable compiler; GCC; LLVM;
(e.g. register allocation) and the interaction between the
retargetable code generation; retargetable optimizations;
machine description and the optimization interrogations. This
chapter also underlines the possibility of adding a new target
I. Introduction specific optimization and estimates the effort of this task.
Chapter 5 focuses on object code generation. Chapter 6 makes
A retargetable compiler is a compiler that can be easily a final analysis and concludes.
modified to generate code for different processors. Optimizing
retargetable compilers are very common nowadays, when Each chapter presents first the GCC approach and then the
time-to-market for new processors became shorter and shorter. LLVM one, just for maturity reasons.
They bridge the gap between classic compilers and electronic
processor design. In this context, the compilers have a double
role: first, they are used in the design phase of the processor to
explore its capabilities and second, they are released with the II. Machine Description
new processor as part of the build tools chain.
The concern regarding retargetable compilers is their lack
Representation
of machine-specific code optimization techniques, which The mechanism of GCC machine descriptions has been
prevents them from achieving the highest code quality. While quite successful as demonstrated by a wide variety of targets
this problem is partially inherent to the retargetable for which GCC was retargeted. The Gnu Compiler Collection
compilation approach, it can be circumvented by designing uses a retargetable compilation model which is adapted to a
flexible, configurable code optimization techniques that apply given target by reading a description of the target and
to many target architectures and defining an interface which instantiating the machine dependent parts of the generated
can configure the optimizations using the machine description compiler.
information. The first step in the retargeting process is to understand the
The retargetable compilers are modular, compared to the architecture of the target microprocessor. The key points in
traditional ones, having target independent modules, but also understanding are the register file (general purpose registers
target specific modules, mainly in the backend. and special purpose registers - if any), the pipeline model of

978-1-4673-8200-7/15/$31.00 ©2015 European Union


103
Authorized licensed use limited to: NORTHWESTERN POLYTECHNICAL UNIVERSITY. Downloaded on April 02,2024 at 02:04:14 UTC from IEEE Xplore. Restrictions apply.
the processor (are there hazards, or stalls, or forwarding The way classes other than GENERAL_REGS are
network), and the instruction set (is it orthogonal). The second specified in operand constraints is through machine-dependent
step in the retargeting process is to define the ABI (application operand constraint letters. You can define such letters to
binary interface). The ABI covers details like sizes, layout, correspond to various classes, then use them in operand
alignment of data types, defines calling conventions (how constraints.
arguments are passed to functions, the layout of the stack
frame, and register usage conventions). The third and final In GCC the register description should have the narrowest
step in the retargeting process is to define three key files that register classes for allocatable registers, so that each class
will describe the microprocessor and its operating either has no subclasses, or that for some mode, the move cost
environment to GCC. These three files are the target machine between registers within the class is cheaper than moving a
macro file (machine.h), the machine description file register in the class to or from memory.
(machine.md), and a third file containing helper functions for If an instruction (or a set of instructions) accepts registers
the previous two files (machine.c). However, at first look, the from two classes, a new class is defined as the union of the
machine descriptions are difficult to read, construct, maintain, two. For example, if an instruction allows either a floating
and enhance[11]. They require specifying instruction patterns point (coprocessor) register or a general register for a certain
using Register Transfer Language (RTL) templates, operand, you should define a
employing a mechanism which is verbose, repetitive and class FLOAT_OR_GENERAL_REGS which includes both of
requires a lot of details. Below is a typical RTL template from them. Otherwise you will get suboptimal code, or even
MIPS machine description (addsi3). internal compiler errors when reload cannot find a register in
the class computed via reg_class_subunion.
The machine description register information has certain
redundant information about the register classes: for each
class, it specifies which classes contain it and which ones are
contained in it; and for each pair of classes, the largest class
contained in their union. This allows the compiler to build a
class hierarchy and to check overlappings.
When a value occupying several consecutive registers is
expected in a certain class, all the respective registers must
Fig. 1. GCC – Example of instruction description belong to that class. Therefore, register classes cannot be used
to enforce a requirement for a register pair to start with an
The RTL template uses RTL operators set and plus; the even-numbered register. The way to specify this requirement
former represents an assignment. Operator match_operand is with HARD_REGNO_MODE_OK.
matches an operand using a mode (SI for single integer), a LLVM describes 3 classes of instructions: L-type
predicate (register_operand), and constraint strings ("=r", "r" instructions, which are generally associated with memory
and "r"). Given a GIMPLE statement a = b + c, first an RTL operations, A-type instructions for arithmetic operations, and
statement is generated and then the assembly statement is J-type instructions that are typically used when altering
generated eventually. control flow (i.e. jumps).
On many machines, the numbered registers are not all A target description is done in a declarative domain-
equivalent. For example, certain registers may not be allowed specific language (a set of .td files) processed by the tblgen
for indexed addressing; certain registers may not be allowed in tool. Each description file is completed by a .cpp file. For
some instructions. These machine restrictions are described to example RegisterInfo.td has as pair RegisterInfo.cpp, which is
the compiler using register classes[10]. used to describe the register file of the target and any
A register class is defined by giving each class a name and interactions between the registers. Physical registers (those
saying which of the registers belong to it. Then, register that actually exist in the target description) are unique small
classes are detailed. They should be allowed as operands to numbers, and virtual registers are generally large. Note that
particular instruction patterns. register #0 is reserved as a flag value.
In general, each register will belong to several classes. In Each register in the processor description has an
fact, one class must be named ALL_REGS and contain all the associated TargetRegisterDesc entry, which provides a textual
registers. Another class must be named NO_REGS and name for the register (used for assembly output and debugging
contain no registers. Often the union of two classes will be dumps) and a set of aliases (used to indicate whether one
another class; however, this is not required[10]. register overlaps with another).
One of the classes must be named GENERAL_REGS. Each register class contains sets of registers that have the
There is nothing special about the name (it is just a same properties (for example, they are all 32-bit integer
convention), but the operand constraint letters ‘r’ and ‘g’ registers). Each SSA virtual register created by the instruction
specify this class. If there is no difference between selector has an associated register class. When the register
GENERAL_REGS and ALL_REGS, GENERAL_REGS can allocator runs, it replaces virtual registers with a physical
be defined as a macro which expands to ALL_REGS. register in the set.

978-1-4673-8200-7/15/$31.00 ©2015 European Union


104
Authorized licensed use limited to: NORTHWESTERN POLYTECHNICAL UNIVERSITY. Downloaded on April 02,2024 at 02:04:14 UTC from IEEE Xplore. Restrictions apply.
generator into target-specific machine instructions. There are
several well-known ways to do this in the literature. LLVM
uses a SelectionDAG based instruction selector.

if (i>=5) && (i<=10) becomes


(i-5)<=10

Fig. 3. GCC – Optimization during Code Generation

Portions of the DAG instruction selector are generated


from the target description (*.td) files. The LLVM goal is, for
the entire instruction selector, to be generated from
these .td files, though currently there are still things that
require custom C++ code.
The SelectionDAG provides an abstraction for code
representation in a way that is amenable to instruction
Fig. 2. LLVM – Machine Description through compiler selection using automatic techniques (e.g. dynamic-
programming based optimal pattern matching selectors). It is
The TargetInstrInfo class is used to describe the machine
also well-suited to other phases of code generation; in
instructions supported by the target. It is essentially an array
particular, instruction scheduling (SelectionDAG’s are very
of TargetInstrDescriptor objects, each of which describes one
close to scheduling DAGs post-selection). Additionally, the
instruction the target supports. Descriptors define
SelectionDAG provides a host representation where a large
characteristics like the mnemonic for the opcode, the number
variety of very low level (but target-
of operands, the list of implicit register uses and defs, whether
independent) optimizations may be performed; ones which
the instruction has certain target-independent properties
require extensive information about the instructions efficiently
(accesses memory, is commutable, etc), and holds any target-
supported by the target.
specific flags.
The SelectionDAG is a Directed-Acyclic-Graph whose
nodes are instances of the SDNode class. The primary payload
III. Code Generation of the SDNode is its operation code (Opcode) that indicates
GCC uses a modified version of the Davidson Fraser what operation the node performs and the operands to the
model of compilation [1]. This contrasts with the traditional operation. The various operation node types are described at
Aho Ullman model [2] which performs instruction selection the top of
over optimized machine independent intermediate the include/llvm/CodeGen/SelectionDAGNodes.h file.
representation (IR). In order to ensure the quality of generated A SelectionDAG node contains the following information:
code, instruction selection in Aho Ullman model is performed
using cost based tree tiling [3] that tries to cover a Subject • Opcode. This is an integer that identifies the
Tree in the IR with instructions that minimize the cost using a instruction represented by the node.
set of Transformer Trees. • Results (definitions). While most instructions produce
The Davidson Fraser model advocates simple instruction exactly one result, some may either define several
selection and optimizes the selected instructions. An expander values (e. g., operations with side effects, combined
generates a naive machine dependent code using a set of division/modulo instructions) or no value at all (e. g.,
transformer tree (most often RTL trees) by employing simpler branch instructions). The node object keeps a list of
structure based tiling. The final code is produced by a the value types for all of its results.
recognizer that identifies the instructions (Inst) corresponding
• Operands (uses). Every SDNode keeps a record of all
to the register transfers representing the intermediate code.
other nodes upon which it has a dependency.
Retargeting a compiler in Davidson Fraser model requires
rewriting the expander and the recognizer which employ A SelectionDAG has designated “Entry” and “Root”
simple algorithms [4]. A generic optimizer for machine nodes. The Entry node is always a marker node with an
dependent code is possible because of the following key idea: Opcode of ISD::EntryToken. The Root node is the final side-
When computations are expressed in the form of allowable effecting node in the token chain. For example, in a single
register transfers, although the actual register transfer basic block function it would be the return node.
statements are machine dependent, their form is machine
independent [5]. One important concept for SelectionDAGs is the notion of
a “legal” vs. “illegal” DAG. A legal DAG for a target is one
One optimization performed during code generation is that only uses supported operations and supported types. On a
merging comparison, as seen in Fig3. 32-bit PowerPC, for example, a DAG with a value of type i1,
i8, i16, or i64 would be illegal, as would a DAG that uses a
In LLVM, Code Generation or Instruction Selection is the
SREM or UREM operation. The legalize-types and legalize-
process of translating LLVM code presented to the code

978-1-4673-8200-7/15/$31.00 ©2015 European Union


105
Authorized licensed use limited to: NORTHWESTERN POLYTECHNICAL UNIVERSITY. Downloaded on April 02,2024 at 02:04:14 UTC from IEEE Xplore. Restrictions apply.
operations phases are responsible for turning an illegal DAG The problem with TableGen is that some of its uses are so
into a legal DAG. complex (and instruction selection, as we'll shortly see, is one
of the worst offenders) that it's easy to forget how simple the
The backend class targetTargetLowering, a subclass of idea is in its core. The LLVM developers realized a long time
TargetLowering, offers a two-fold functionality. On the one ago that a lot of repetitive code has to be written for each new
hand, it provides its superclass with target-specific target. Take a machine instruction, for instance. An instruction
information, including: is being used in code generation, in the assembler, in the
• the alignment of machine functions (2 bytes) disassembler, in optimizers, and in many other places. Each
such use results in a "table" that maps instructions to some
• all value types natively supported by the target piece of information. Wouldn't it be beneficial if we could just
machine, and, for each type, the register class that has define all instructions in one central place which collects all
been defined for it the interesting information needed about them and then
• the exact binary representation of boolean values generate all the tables automatically? This is precisely what
(true: 1, false: 0) TableGen was designed to do.

On the other hand, targetTargetLowering handles all cases As a conclusion for this chapter, the LLVM has a good
of instruction nodes that cannot be lowered automatically but approach, but, for optimized code, the user has to write more
require manual intervention. These cases include: calling code than for GCC.
conventions, special instructions and operands, which cannot
be lowered automatically because of some properties and
restrictions of the architecture. One LLVM drawback in this
IV. Target Specific Optimization
field is that for a CISC architecture the code GCC cannot generate target specific optimizations, but
generation/lowering part is almost 80% hand-written. offers the possibility to configure several ones.
One important issue that the code generator needs to be Peephole optimization gathers two or more consecutive
aware of is the presence of fixed registers. In particular, there instructions into a single one. Each peephole pattern is
are often places in the instruction stream where the register described in machine description file. The peephole
allocator must arrange for a particular value to be in a optimization is called between register allocation and
particular register. This can occur due to limitations of the instruction scheduling[6]. The description contains the input
instruction set (e.g., the x86 can only do a 32-bit divide with instructions, the output instruction and the additional required
the EAX/EDX registers), or external factors like calling scratch registers. Below is an example of peephole pattern
conventions. In any case, the instruction selector should emit description:
code that copies a virtual register into or out of a physical
register when needed.
By the end of code generation, the register allocator would
coalesce the registers and delete the resultant identity moves.
MachineInstr’s are initially selected in SSA-form, and are
maintained in SSA-form until register allocation happens. For
the most part, this is trivially simple since LLVM is already in
SSA form; LLVM PHI nodes become machine code PHI
nodes, and virtual registers are only allowed to have a single
definition.
After register allocation, machine code is no longer in
SSA-form because there are no virtual registers left in the
code.
Instruction selection is arguably the most important part of Fig. 4. GCC – Peephole definition
the code generation phase. Its task is to convert a legal
selection DAG into a new DAG of target machine code. In
other words, the abstract, target-independent input has to be
matched to concrete, target-dependent output. For this purpose
LLVM uses an elaborate pattern-matching algorithm that
consists of two major steps.
The first step happens "offline", when LLVM itself is
being built, and involves the TableGen tool, which generates
the pattern-matching tables from instruction definitions.
TableGen is an important part of the LLVM eco-system, and it Fig. 5. GCC – Peephole Transformation
plays an especially central role in instruction selection, so it's
worthwhile to discuss it here more in depth.

978-1-4673-8200-7/15/$31.00 ©2015 European Union


106
Authorized licensed use limited to: NORTHWESTERN POLYTECHNICAL UNIVERSITY. Downloaded on April 02,2024 at 02:04:14 UTC from IEEE Xplore. Restrictions apply.
Define_split expression tells the compiler how to split a structures, or they make assumptions about live ranges being
complex instruction into two or more simple instructions. The invariant. That makes it difficult to, say, commute a two-
instruction split is useful in the following situations: when the address instruction on the fly, or rematerialize a constant pool
machine may have instructions which require delay slots in load instead of spilling it to the stack.
between and when the output of some instructions won’t be As the name implies, linear scan works by visiting live
available for multiple cycles. This optimization has the ability ranges in a linear order. It maintains an active list of live
to place instructions in slots that are empty. ranges that are live at the current point in the function, and this
is how it detects interference without computing the full
interference graph. The active list is the key to linear scan's
speed, but it is also its greatest weakness. When all physical
registers are blocked by interfering live ranges in the active
list, a live range is selected for spilling. Live ranges being
spilled without being split first cause a too complex code that
another module “rewriter” is attempting to clean up. We
would much rather split them into smaller pieces that might be
assignable, but this would require the linear scan algorithm to
backtrack. This is very expensive, and full live range splitting
isn't really feasible with linear scan.
Fig. 6. GCC – Split Expression
LLVM provides several different register allocation
The GCC offers the possibility to choose where to allocate algorithms implemented.
a local/global variable:
Register allocation is based on the target registers
register int force_ra asm ("a1"); description. In LLVM registers description has 2 parts: a
descriptive one in TargetRegisterInfo.td file and a C++ one
[8].
Fig. 7. GCC – Allocate a variable in a certain register
The register information in the compiler is generated by
The GCC register allocation [10] is based on register TableGen from the .td file and placed in
information, more precise on register classes definition, and targetGenRegisterInfo.h.inc and targetGenRegisterInfo.inc.
instruction constraints. To avoid allocation errors, the machine But some of the code in targetRegisterInfo requires hand-
description is based on the narrowest register classes for coding.
allocatable registers.
We should note that the automatically generated files and
There are multiple macros with allow the user (the one classes do not provide all register-relevant functionalities and
who ports the new architecture) to configure and to improve information for a target-specific backend (there are some
the default allocation. Some macros, even if they are for virtual functions defined in the TargetRegisterInfo class not
configuration, require deep knowledge of both GCC compiler implemented in the automatically generated class). Thus we
implementation and architecture. One example is need to manually implement the rest of the functions to
MODE_BASE_REG_REG_CLASS (mode). This macro is retrieve certain register information, for
defined if the architecture has indexed addressing modes and example getCalleeSavedRegs() and getFrameRegister(),
the base plus index addresses have different requirements than etc[9].
other base register uses. Target hooks are another mechanism
which gives freedom to model the register allocation. The
target hook bool TARGET_CLASS_LIKELY_SPILLED_P V. Conclusions
(reg_class_t rclass) returns true if pseudos that have been Retargetable compilers are a promising approach to meet
assigned to registers of class rclass would likely be spilled time-to-market and performance constraints. A compiler is
because registers of rclass are needed for spill registers. ”retargetablȅif it can be used to generate code for different
There are many hooks and macros like those above used to processors by reusing significant compiler source code. This
configure the allocator to be in good shape and to get the best has resulted in a paradigm shift towards a language-based
performance: macros to set the allocation order, to mark design methodology using Architecture Description Language,
special addressing modes (like preincrement, postdecrement), exploration of architecture/compiler co-designs and automatic
define what values fit in registers, set leaf registers for leaf compiler/simulator generation. However, whatever approach
functions, stack registers (if any), and registers used by calling is used, the accuracy and the performance depend on target
conventions for parameter passing. specific optimizations, i.e. instruction selection, register
allocation and instruction scheduling.
Linear scan has been the default register allocator in
LLVM since 2004 [6]. It has worked surprisingly well for At this moment, we can’t talk about the best retargetable
such a simple algorithm. In fact, the simple design made it compiler. Each user should evaluate the needs. Do we target a
easier to tweak the algorithm in order to make small small microcontroller, a risc/cisc processor or a core which in
improvements to the generated code. More advanced register future may be part of a multicore architecture? Another
allocation algorithms often need to build expensive data

978-1-4673-8200-7/15/$31.00 ©2015 European Union


107
Authorized licensed use limited to: NORTHWESTERN POLYTECHNICAL UNIVERSITY. Downloaded on April 02,2024 at 02:04:14 UTC from IEEE Xplore. Restrictions apply.
decision element is time constraints versus people experience
with one or another of the compilers.
References
TABLE I. GCC VS LLVM
[1] Norman Ramsey, Jack W. Davidson, “Machine descriptions to build
Compilers tools for embedded systems”, In Workshop on Languages, Compilers,
and Tools for Embedded Systems, Springer Verlag, volume 1474 of
GCC LLVM lncs, 1998 (June), pages 172–188.
Machine Declarative [2] Alfred V. Aho, Mahadevan Ganapathi, Steven W. K. Tjiang, “Code
Description RTL description language generation using tree matching and dynamic programming”,
Code Transactions on Programming Languages and Systems, 1989, 11(4),
Davison-Frasser SelectionDAG pages 491–516.
Generation
Peephole Peephole patterns in .md [3] A. V. Aho, R. Sethi, J. D. Ullman, “Compilers: Principles, Techniques,
--- and Tools”, Addison-Wesley, 1986.
Engine file
Split Engine Split patterns in .md file --- [4] Mark W., Bailey, Jack W. Davidson, “Automatic detection and
diagnosis of faults in generated code for procedure calls”, IEEE
Register 2 steps – local and global Transactions on Software Engineering, 29(11), 2003, pages 1031–1042.
Linear scan
Allocation RA [5] Christopher W. Fraser, Robert R. Henry, Todd A. Proebsting, “ BURG:
fast optimal instruction selection and tree parsing”, SIGPLAN Notices,
1992, 27(4), pages 68–76.
The GCC maintains the flexibility to create something
[6] llvm.org.
unique for each target. GCC is very mature, easy to install (at
[7] Dmitry Melnik , Andrey Belevantsev, Dmitry Plotnikov, Semun Lee, “A
least for most systems), and is the default compiler for lots of case study: optimizing GCC on ARM for performance of libevas
systems. rasterization library,” In Proceeding of GROW [2010].
From my point of view LLVM has a plus for easiness of [8] Ghassan Shobaki, Maxim Shawabkeh, Najm Abu-Rmaileh,
“Preallocation Instruction Scheduling with Register Pressure
adding new features, building and debugging, while GCC still Minimization Using a Combinatorial Optimization Approach”, ACM
has the ownership on performance for most of the targets. But Transactions on Architecture and Code Optimization, TACO, Sep. 2013.
it also should be noted that the LLVM was designed for the [9] http://www.pitt.edu/~yol26/notes/llvm3/LLVM3.html
context of multicore and multithreading programming. For [10] https://gcc.gnu.org/onlinedocs/gccint/Register-Classes.html
example, starting from this year, LLVM has full support for [11] http://www.drdobbs.com/retargeting-the-gnu-c-compiler/184401529
OpenMP standard, based on LLVM OpenMP runtime library. [12] https://opus4.kobv.de/opus4-fau/files/1108/tricore_llvm.pdf
This feature boosts the application performance on modern
multicore architectures.

978-1-4673-8200-7/15/$31.00 ©2015 European Union


108
Authorized licensed use limited to: NORTHWESTERN POLYTECHNICAL UNIVERSITY. Downloaded on April 02,2024 at 02:04:14 UTC from IEEE Xplore. Restrictions apply.

You might also like