You are on page 1of 8

Compression of Embedded System Programs

Michael Kozuch and Andrew Wolfe


Department of Electrical Engineering
Princeton University
Abstract instruction memory. The reduced code size provides a cost
Embedded systems are often sensitive to space, weight, savings for each production unit. At run time, the stored
and cost considerations. Reducing the size of stored instructions are decompressed by the instruction cache refill
programs can significantly improve these factors. This paper engine. Code in the instruction cache appears to the
discusses a program compression methodology based on processor as standard CPU instructions. No changes in
existing processor architectures. The authors examine operation or performance occur when instructions are found in
practical and theoretical measures for the maximum the instruction cache. On a cache miss, the cache refill
compression rate of a suite of programs across six modern engine locates the compressed cache line and expands it. The
architectures. The theoretical compression rate is reported in cache refill time varies depending on the characteristics of the
terms of the zeroth and first-order entropies, while the memory system, the coding method, the decompression
practical compression rate is reported in terms of the hardware, and the compressed code.
Huffman-encoded format of the proposed compression The reduced program size means that system cost, size,
methodology and the GNU file compression utility, gzip. weight, and power consumption can be reduced. Furthermore,
These experiments indicate that a practical increase of 15%- fetching compressed instructions requires fewer bus cycles,
30% and a theoretical increase of over 100% in code density further reducing power consumption. Even a slight reduction
can be expected using the techniques examined. In addition, a in program size will allow designers to incorporate additional
novel, greedy, variable-length-to-variable-lengthencoding features without increasing system memory. This can greatly
algorithm is presented with preliminary results. increase product competitiveness.
1. Introduction A system for encoding and decoding RISC programs has
already been proposed [Wolfe92]. A preliminary study of
An embedded system may be loosely defined as a system
embedded program compression in that paper showed that
which incorporates computer technology but is not, itself, a such a method is practical and can provide significant
computer. Examples range from simple devices such as
compression at a reasonable cost and with minimal impact on
television remote controls to compute intensive systems
performance. This earlier work measures the effectiveness of
such as automotive engine controllers and missile guidance a single class of compression algorithms on the MIPS
systems. While these systems vary greatly in their cost,
architecture. That work is extended here in a more
complexity, and function, they all share some common comprehensive exploration of the potential of program
characteristics. The system’s primary functions are
compression. Practical and theoretical compression rates are
implemented by a microprocessor or microcontroller
measured for 6 common 32-bit architectures. A new
executing an essentially permanent stored program composed
specialized compression method is then presented that may
by the system designer rather than the user. These systems
be more effective than existing algorithms for embedded
are also sensitive to many design constraints including limits program compression.
on size, weight, power consumption, and cost.
An ideal solution for the high performance embedded 2. Embedded System Program Compression
systems market would be a processor that provides all of the
recognized performance benefits of RISC architectures while 2.1. Prior Research
also providing denser instruction storage than existing RISC. Lossless data compression as a field of study is relatively
implementations. In pursuit of this goal. we have been mature. However, the best known lossless data compression
exploring methods for compression of programs in embedded methods such as Huffman coding [Huffman521 or Lempel-Ziv
systems. We have developed a method whereby embedded coding [Lempe176, Ziv77, Ziv781 are primarily designed for
programs are stored in compressed form but executed from the text compression. These algorithms are commonly used in
instruction cache in the standard format. This provides off-line, file-based compression programs such as the UNIX”
increased storage density with only a minimal impact on utilities compress and gzip. While the theory behind these
processor performance. algorithms is valuable to embedded systems program
The system consists of a standard processor core compression, the existing file-based implementations are not
augmented with a special code-expanding instruction cache. suitable for run-time operation. In addition to these well-
A traditional compiler and linker are used to generate standard known methods, there has been considerable research in the
embedded object code. This object code is then compressed area of lossy compression methods such as DCT-based
on the host development system using a code compression [PetajangZ] or fractal-based compression [Kocsis89]. These
tool similar in principle to the UNIX” compress utility. produce a much greater rate of compression than lossless
This compressed code is stored in the embedded system methods; however, they are unsuitable for program
compression.
Embedded system program compression has cost and speed
This research has been partially supported by Motorola
constraints that differentiate it from typical file-based
under Research Agreement JD979- 1.

270
1063-640-4 $4.00 0 1994 IEEE
compression applications. Since embedded systems are since the vast majority of instruction fetches result in cache
highly cost sensitive and typically only execute a single hits, the performance of the processor is unchanged for these
program, it is not possible to include temporary storage for instructions.
an uncompressed version of the program. Instead, the In an embedded system, it is not possible to decompress
program must be decompressed on demand at run time, so that the entire program at once; therefore, a block oriented
an uncompressed copy of the next instruction is always
compression scheme is required. The experiments we have
available. The system proposed in [Wolfe92] uses the performed are based on compressing 32-byte cache lines into
existing instruction cache in high-performance processors as smaller byte aligned blocks as shown in Figure 1. A number
a decompression buffer, storing uncompressed copies of of compression techniques are possible, but they all must
recently used fixed-sized blocks of instructions. These fixed-
allow for effective run-time decompression. Compression
sized blocks are decompressed by the cache refill hardware takes place at program development time therefore
whenever there is an instruction cache miss. Since the
compression time is immaterial. Decompression time,,
program must be decompressed in small fixed-sized blocks however, directly impacts cache refill time and thus
rather than the more common approach of decompressing the performance.
entire program from. beginning to end, the most obvious
compression methods require that each block has been
separately compressed. Furthermore, to retain high 8-word fully-aligned Mock
performance it must be possible to decompress a block with 1 I I I I I I I 1
low latency, preferably no longer than a normal cache line ...00 ...04 ...08 ...OC ... 10 ...14 ...18 ...1C
refill.
In addition to the existing work on file-based
compression, automated compression and decompression n-byte unaliined block
schemes have been implemented in general-purpose systems
I 1 I I I I
at slower levels in the memory hierarchy [Tautongl]. ...00 ...0 4 . . . 0 8 ... OC ...10 ...1 4 ...1 8 ...1C
Automated file compression systems such as the Doublespace
utility in MSDOS 6.2 use file and block based compression to Figure 1 - Block Bounded Compression.
reduce the disk space requirement of files. A similar method is
Maintaining full compatibility with existing code
discussed in [Categl] using compression within memory and
disk to compress pages in a demand-paged virtual memory presents a problem when executing control transfer
system. These disk-based systems use large data blocks, on instructions such as jumps or procedure calls. The address of
the jump target in the compressed code is different than it is
the order of 4K-16K bytes, rather than the 14-64 byte blocks
common in instruction caches. Furthermore, disk-based in the uncompressed code. This problem is one reason why
continuous file-based compression is impractical for direct
systems can tolerate decompression latencies on the order of
execution. If a program branches to an instruction at a given
10-2OOms rather than the 50-500ns latency that is tolerable address, how can that instruction be found in the compressed
in embedded system program compression. These differences
in scale allow the effective use of Lempel-Ziv type algorithms program? A specific jump target address in the original code
implemented in either hardware or software for disk or virtual- may not even correspond to an addressable byte boundary in
memory based compression. Unfortunately, this class of the compressed code. While it might be possible to place all
algorithms does not appear to be practical or effective for jump targets on addressable boundaries and replace
short program blocks at cache speeds. uncompressed code target addresses in the original code with
the new compressed code target addresses, this introduces new
A program compression scheme based on dictionary problems. Jump targets that happen to be in cache would
compression has been proposed in [Devedasgrl]. This work have different addresses than the same targets in main
presents some interesting ideas in software-only memory. Furthermore, programs often contain indirect or
compression; however, the experimental results do not yet computed jump targets. To convert these addresses would
validate the methods. The experimental results are based on require modifications to the address computation algorithms
unoptimized assembly code without the inclusion of libraries. in the compiled code.
These examples contain far more redundancy than fully
In-cache expansion solves most gddressing problems.
optimized applications. In fact, simply enabling the
The address of a jump target in cache is the same as in the
optimizer produces smaller code than the dictionary-based
original uncompressed program. If a program jumps to a
compression.
target that is not in cache, that target is brought into the
2.2. Mechanisms for Program Compression cache before execution. This only requires that the processor
locate the address of the beginning of each compressed cache
The key challenge in the development of a code line. This restricts each compressed cache line such that it
compression scheme for existing microprocessor must start on an addressable boundary. Some record of the
architectures is that the system must run all existing new location of each cache line is required to map the program
programs correctly. Furthermore, the performance of a address of each block to its actual physical storage location.
compressed code processor should be comparable to that of a
traditional processor. The use of instruction cache based A new structure is incorporated into the cache refill
decompression assures that these requirements can be met. hardware. The Line Address Table or LAT maps program
All instructions are fetched through the instruction cache. instruction block addresses into compressed code instruction
Since they are stored uncompressed in cache, they can always block addresses. The data in the LAT is generated by the
be fetched from the original program address. Furthermore, compression tool and stored along with the program. Figure
2 diagrams the LAT for a 32-byte cache line.

271
&qypsq&
Cache Une Address
This code, called a Preselected Huj” Code is then built into
the decompression logic rather than stored in memory. The
effectiveness of this code is generally independent of block
size; however, the embedded system compression
mechanisms add additional overhead that limits the
effectiveness of coding.
Huffman codes suffer from inherent inefficiency whenever
the frequency of occurrence of symbols is not exactly a
Figure 2 - Line Address Table. negative power of two. This is a quantization effect caused by
Using a Line Address Table, all compressed code can be the requirement that an integral number of bits is used to code
accessed normally by the processor without modifying the each symbol. This is further compounded in many cases by
processor operation or the program. Line Address Table the fact that 8-bit symbols have been used rather than 32-bit
access increases cache line refill time by a marginal amount, symbols corresponding to the size of RISC instructions.
at least one memory access time. This is not a major effect This is necessary to reduce the complexity of the decoder.
since it only occurs during a cache miss, however this effect Despite these inefficiencies, our experiments show that the
can be further reduced by using another small cache to hold effect of these factors is small. The effectiveness of
the most recently used entries from the LAT. This cache is compression is also reduced by the fact that each compressed
essentially identical to a TLB and in fact is called the Cache block must be stored on a byte-addressable boundary. This
Line Address Lookaside Buffer or CLB. In practice, the LAT adds an average of 3.5 bits to every coded block.
is simply stored in the instruction memory. A base register Another factor contributing to coding overhead is the
value within the cache refill engine is added to the line address storage of the Line Address Table. Storing a full pointer to
during CLB refill in order to index into this table. Figure 3 each compressed line would be prohibitively expensive,
shows how the overall instruction memory hierarchy might however, we have used an ad-hoc compression technique to
be implemented in a typical system. pack multiple pointers into each LAT entry based on storing a
base address plus the length of each compressed line.
According to this design, the compressed cache lines are
aligned on byte boundaries in the compressed program
storage area, and the LAT provides a compact index into these
compressed cache lines. Specifically, if the cache line size is
I bytes, the address space is b bits, and each LAT entry
provides a pointer to each of c cache lines, then each LAT
entry occupies:

b +C . [i0g2(1)] bits
Because each LAT entry locates cl bytes, the overhead
associated with this design is:
!AT

Figure 3 - Overall Memory System Organization.

2.3. Codes for Program Compression


6 +c-[10g2(l)l bits
8 bits I byte
c.lbYte-
1
The specific requirements of embedded system program For the example of a 32-byte cache line, a 24-bit address
compression restrict the class of coding techniques that can space and 8 cache lines per entry, the overhead is 8 byted256
be effectively used for compression. The critical requirement bytes = 3.12595. Since the number of bits available to
that small blocks of data must be separately compressed is the represent the length of each line is limited, the compressed
most difficult constraint. Many of the adaptive Lempel-Ziv length of each line must be limited as well. One can limit the
algorithms that are effective for file compression simply do encoded length to no more than the original length by simply
not provide good compression on small blocks. In fact, it is not encoding any line that is not reduced in size. This
very difficult to get good compression on small blocks of slightly improves the rate of compression.
programs. This design has been shown to provide very good
Most of the experiments we have done in embedded performance and moderate compression for the MIPS R3000
program compression rely on Huffman coding. Variable architecture in earlier work. However, it was still not known
length code words are selected to represent each byte in the if this method is effective on a variety of architectures. It is
original program. The length of each code word depends on also interesting to explore the possibility that there may be
the frequency of Occurrence in the sample program. A better coding methods for short blocks of program code. The
Huffman code can be generated for each embedded system next two sections explore these issues.
program, however this requires that the code be stored along
with the program. For embedded systems programs, we have 3. Investigation
found that it is preferable to build a fixed Huffman code for Given the advantages of compressing embedded system
each architecture based on a large sample of typical programs. code and the feasibility of the proposed architecture, a further

272
investigation into the compressibility of embedded system On each architecture, an instruction extraction tool was
code is warranted. The purpose of these experiments is to created. These tools extract the actual instructions (text
explore the compressibility of code on several modern segment) from executable programs compiled on that
architectures using traditional coding methods modified for architecture. After the program set was compiled on each
embedded systems program compression. machine, the instruction extraction tool was applied to each
Embedded system code is rarely portable among program to isolate the actual program. (This step eliminates
architectures. Therefore, the analysis of this paper is based on the data segments, relocation information, symbol tables,
a set of computer benchmarks which we believe may be a etc.)
suitable approximation to embedded system code. These
benchmarks are taken from the SPEC benchmark suite, the Normalized Program Set Size
UNIXm operating system, and some speech
encoding/decoding programs. The fifteen example programs
are presented in Table I. This set was chosen for its mix of
floating point and integer code, various size programs, and
portability.

Program Function
awk pattern scanning/processing
dnasa7 floating point kernels
dodw thermohydraulical modelization
eqntott Boolean equation translation
espresso Boolean function minimization . Vax ~ MIPS 68020 . SPARC RS6OOO . MPC603
fPPPP quantum chemistry
regular expression matching
gsmtx GSM 06.10 speech CODEC Figure 4. Sum of Program Sizes for Each Machine
matrix300 matrix multiplication
(Normalized to the VAX 11/750)
neqn typeset mathematics For comparison, an architectural comparison of the
Sed stream editor uncompressed text size is presented in Figure 4 where the sum
tomcatv mesh generation of the sizes in the test set is reported for each machine
uvselp NADC speech coding normalized to the total size of the VAX 111750. The native
xlisp lisp interpreter programs differ significantly in size based only on the
yacc yet another compiler compiler architecture and compiler. This results both from differences
in the instruction set encoding and in the speeasize tradeoffs
Table I. Benchmark Set made by the compiler as well as differences in library code.
Each of the programs in Table I was compiled on six This raises the interesting question of whether the less dense
different architectures. This was intended to identify instruction sets contain more redundancy and thus are more
architectural differences (e.g. RISC vs. CISC) which might compressible. We conducted a number of experiments to
affect compressibility. The architectures used are shown in investigate this issue.
Table 11. Five of these architectures are typical of current and The first experiment measures the entropy of each program
future 32-bit high-performance embedded processor cores. using a byte-symbol alphabet. These entropy measures
The VAX is used as a reference point as a high-density determine the maximum possible compression under a given
instruction set. set of assumptions. The zeroth-order entropy assumes that
the probability of an occurrence of a given symbol ai of
alphabet S is given by p ( q ) and is independent of where the
VAX 111750 BSD UNIX 4.3 byte occurs in the byte stream. That is, bytes occur in random
(SGI) MIPS R4ooo IRIX 4.0.5F System V order, but with a certain distribution. The zeroth-order
(Sun) 68020 SunOS 4.1.1 entropy over a source alphabet, S, is then given by
(Sun)SPARC SunOS 4.1.3
(IBM) RS6000 AIX 3.2

Table 11. Tested Architectures and OS.


Each of the fifteen programs was compiled on each of the (The subscript 2 in the entropy symbol, H 2 ( S ) , denotes
six architectures except sed which would not compile on the that the encoded symbol stream is of radix 2, that is, the
RS6000 machine. The programs were compiled with the encoded stream is in bits.) The entropy of a source may be
standard compiler on each machine (e.g. cc orj77) and with interpreted as the information content of that source, and
the highest optimization level allowed. Further, to ensure hence may be viewed as a measure of the theoretical maximum
that the programs contained all the code to be run. each was compressibility of that source [Ha"ing80]. The entropy is
compiled statically (i.e. no dynamically linked libraries). dependent upon the model of the source assumed. Here, we
The same source code was used on each machine, so at an have assumed a zeroth-order model.
abstract level, the amount of computational work performed
was identical on all machines.

273
program set. the aggregate entropy is greater (meaning less
For our purposes, the alphabet, S, is the set of all possible compression) than the average entropy, but this is not always
bytes (8 bits). The probability of a byte occurring was true. In the context of a compressed program embedded
determined by counting the number of Occurrences of that system, the average entropy expresses the typical
byte and dividing by the number of bytes. The encoding tool compression ratio (in the theoretical limit) for a program if
first builds a histogram from the extracted program. The the decompression engine is custom designed for that one
histogram leads directly to the probability distribution, and program. The aggregate entropy represents the limit of
the entropy may then be calculated according to the above average compression if a single code is used for all programs.
equation.
If we change the model of our source, we must also
determine a new form for the entropy. A more general source 1 UZero Order OFmt Order
model is given if we assume that each byte is dependent upon 0.9
the previous byte (as in a first-order Markov process). In this
0.8
case, we have a set of conditional probabilities, p ( a j l a i ),
which indicate the probability that symbol a. occurs gven 0.7
that ai has just occurred. This model is the first-order model 0.6
and the entropy is given by the first-order entropy: 0.5

0.4
0.3
0.2
0.1
Here, p ( a i , aj,, indicates the probability that the pattern 0
ai,aj occurs.
The first-order entropy of each program set was determined
similarly to the zeroth-order entropy. The significant Figure 5. Average Entropy for 6 architectures.
difference between the two measurements is that calculating
the first-order entropy involves generating the n conditional
probabilities for each of the n symbols in the alphabet, S.
During processing, a nxn matrix, h. is generated where h ( i j ) I I
is the number of aj symbols which follow ai symbols. The 0.9
probability of a symbol occurring may then be gwen by: 0.8
0.7
0.6
0.5
0.4
i j
0.3
and the conditional probability of a symbol occurring is
0.2
given by:
0.1
0
Van MIPS 68020 SPARC RS
W MPc603

1
Further, the pattern probability may be found from: Figure 6. Aggregate Entropy for 6 architectures.
It is clear from the zeroth-order entropy numbers, that
simple compression methods like Huffman coding are not
going to achieve very high rates of compression on this type
The average and aggregate calculated values of the entropy of data. In fact, it appears that the MIPS instruction set
for the program set are shown in Figure 5 and Figure 6, originally studied is the most compressible using zeroth-
respectively. The entropy may be interpreted as the order methods. This indicates that simple coding methods are
maximum compression ratio where the compression ratio is likely to be inadequate for other architectures.
given by:
In order to measure the efficiency of Huffman coding on
Compressed Size these architectures, we used the compression method
Compression Ratio = described in Section 2 to actually compress programs from
Uncompressed Size each architecture in cache-line sized blocks. Figure 7
describes the compression ratio obtained and the sources of
The aggregate entropy is measured by generating compression overhead from 32-byte blocks using a 64-bit
occurrence statistics on the program set as a whole, and the LAT entry for each 8 blocks. The coding overhead represents
average entropy is calculated by averaging (arithmetically) the inefficiency in the Huffman code caused by integral length
the separately measured entropy for each program. For our symbols. This difference between the observed compresssion

274
Figure 7. Compression Efficiency

rate and the zeroth-order entropy limit is reduced by the ad-


hoc optimization of not encoding blocks that would increase 3.2. Variable Symbol Length Encoding
in length. Although, this can result in better compression These experiments have measured three indicators of
than predicted (negative overhead), it requires that the special compressibility, zeroth-order entropy, first-order entropy,
case be flagged in the LAT. The blocking overhead is the and gzip compression. Compression methods that approach
result of byte alignment. This and the LAT overhead are zeroth-order entropy provide fast, simple decompression and
constant across architectures, although sample variations hence, are more suitable for embedded system applications.
occur in the data. The data shows that for all 6 architectures, However, the more aggressive compression methods are much
the block compression method used is about 94%-95% as more effective at reducing program size. Ideally, we would
effective as the best zeroth-order code. This appears to be like to develop new coding schemes that are more effective
essentially constant, indicating that entropy is a good than Huffman coding, work well on short blocks, and are easy
estimator for practical compression ratio for this type of to decode at high speed. In order to exceed the entropy limits
code. established in section 3. we must modify our model of the
source. One key observation is that the traditional model of
3.1. More Aggressive Coding. treating each 8-bit byte as a source symbol, while obvious for
The data from our experiments indicates that zeroth-order ASCII text, is rather arbitrary for programs. One possible
coding is only moderately effective at compression for most modification is to use zeroth-order entropy coding with
of the architectures we have evaluated. However, the first- source symbols of other sizes. An investigation of various
order entropy data shows that much better compression is other symbol lengths is summarized in Figure 9. It might be
possible with more aggressive coding methods. To confirm argued that the natural symbol size for RISC instructions is
this hypothesis, we also compressed each set of programs via 32 bits, however constructing a hardware decoder for a
the Gnu file compression utility, gzip. Note that these Huffman code with 232 distinct symbols is impractical.
values, shown in Figure 8, are better than those reported for Sixteen-bit symbols are also more effective than 8-bit
the above zeroth and first-order entropies because the gzip symbols, however they are also probably too expensive for
algorithm operates on a higher-order source model than we embedded systems.
have considered. We have assumed that the decompression
engine required by the higher order compression techniques
(Lempel-Ziv etc.) would be too expensive for implementation 0.9
in embedded systems, but the potential for greater 0.8
compression indicates that it is valuable to search for more 0.7
aggressive compression methods that are feasible for x 0.6
embedded system compression. 0.5
a 0.4
0.3
Gzip Compression Ratio 0.2
0.1
1 ~~

0
0.9.
2 4 8 12 16
0.8.
0.7 . Symbol Size (bits)
0.6 9
mMPC603 mRS6000 nMps 068020 oSPARC mVax

Figure 9. Entropy vs. Source Symbol Size


A detailed examination of instruction encoding for these
architectures shows that instructions are comprised of a
Vax MIPS 68020 SPARC Rs6Mo MKXO3 sequence of fields of varying lengths, ranging from 4 to 26
bits. This presents the possibility that the best method for
encoding of instructions is to create a set of symbols of
Figure 8. Gzip Compression Ratio

275
varying lengths such that this set of symbols can be Greedy Symbol Selection Algorithm
combined into strings to create any possible cache line. The One alternative to exhaustive search is to employ a greedy
encoding problem thus consists of two parts, selection of a algorithm as in Figure 10. This algorithm successively
set of variable-length symbols for encoding and optimal searches for the next symbol to be included in the set until
partitioning of program blocks into those symbols. One either the maximum set size is reached or the selection criteria
possible set of symbols is the set of all 8-bit symbols; are no longer met. A valid source symbol set must cover the
hence, the byte-symbol Huffman codes are a subset of source program. This requires that some sequence of the
variable-length codes. symbols in the set is equal to the original source program.
Although variable-length codes may provide improved Preferably, the source symbol set covers all legal programs.
compression for embedded programs, the mechanics of The Priority() function used in FindBestSymbol() uses the
generating these codes is computationally complex. Ideally number of occurrences of a symbol, count, and the symbol
one would like to be able to select the optimal set of symbols size in bits, size, to determine a priority for inclusion of the
such that a program encoded using those symbols is of symbol in the symbol set (the symbol with the highest
minimum length. Unfortunately, this problem cannot be priority is to be included). This priority is an estimation of
solved in reasonable time. Assuming that the maximum the savings that inclusion of the symbol represents. Two
number of symbols to be included in the source model is n ,
versions of the Priority() function are under evaluation. The

exhaustive search yields xr=,(


and the maximum source symbol length is b bits, the
2bti-2) possible symbol
sets. This is simply the enumeration of all distinct sets of the
first version estimates the total number of bits saved from the
sample program by including a given symbol.

# of symb. of length size


prioritymvhgs= count
Zb+l-2 strings of length b or less, such that the set contains
at least 1 and no more than n members. For the relatively
small example of b=16, n=32, this results in approximately The second version estimates the compression rate of the
possibilities while the more reasonable b=32 and subset of the source data that will be encoded by a given
n=256 is not directly computable in 64-bit double precision source symbol. If this priority function is adopted, an
arithmetic. The difficulty of this problem is compounded by inclusion threshold is required. Since only a limited number
the fact that the evaluation of the quality of a symbol set of symbols can practically be included in the source symbol
involves actually encoding the program sample and set, only those symbols that provide a good compression rate
measuring the encoded size. The optimal encoding given a and cover a significant potion of the source program should
set of variable-length symbols also requires exponential time be included.
to compute. Consequently, with present computer
technology, exhaustive symbol set evaluation is infeasible.
The obvious alternative is to use heuristics to search and
[size-[log2 [ # I symb. of length size
of
COWtt
Il
evaluate a subset of this design space in the search for a very Pri0rityrme = size
good, if not optimal, set of coding symbols. We are After a symbol set is selected, the resulting symbols are
experimenting with greedy algorithms to construct source Huffman encoded to produce a uniquely decipherable coded
symbol sets. symbol set. Figure 1 1 reports results for several programs
symbol PindBestSymbolO compressed under both versions of the priority function.
{ Since the source program is encoded on a cache line basis,
Symbol tOBW, 6; each cache line may be encoded using an exhaustive search
priority temp-w, w-0; algorithm.
for(al1 6-01 lengths) 1
{
GanerateHistogramO;
^ - T I .Bit Savings Priority nComDression Rate Prioritv I
t e m p = ~ t Y a x 8 y m b o l;
~
tm-w=Priority(ttmp);
if ( t m - w > w )
I
s-t-;
w-temp-w;
1
1
return 8;
1
SymbolSetSelectO
{
while(bitstream not exhausted)
1
symbol s=FindBeotSymbolO; I
for(al1 occur. of symbol s in bitstream) awk eqntott grep gsmtx naqn sed uvselp xlisp yacc
Mark8ymbolUsed ( ) ;
SetInclude(s);
1
1 Figure 11. Variable Length Symbol Compression.
Figure 10. Greedy Symbol Selection Algorithm.

276
4. Conclusions and Future Work Massachusetts Institute of Technology,
This paper presents several experiments conceming the 1994.
effectiveness of program compression for embedded systems. [Hamming801 R.W. Hamming, Coding and Information
Simple comparisons of six 32-bit architectures show that Theory, Prentice-Hall, Englewood Cliffs, NJ,
significant variations in program size exist between 1980.
architectures using native program coding. Despite the large
variations in program density, compressibility varies only [Huffman521 D. A. Huffman, "A Method for the
moderately among architectures and appears to be Construction of Minimum-Redundancy
uncorrelated to uncompressed program size. Simple Codes," Proceedings of the IRE, Volume 40,
compression methods such as Huffman coding and its variants pp. 1098-1101, (1952).
provide only moderate compression on all of these [Intrate1-921 G. Intrader and I. Spillinger, "Performance
architectures; however, first-order entropy analysis and Evaluation of a Decoded Instruction Cache for
Lempel-Ziv based coding demonstrate that better Variable Instruction-Length Computers,"
compression is possible. Proc. of the 19th Symp. Comp. Arch., IEEE
In order to discover an improved coding method for Computer Society, May 1992.
embedded systems programs, a class of variable source [Kocsis891 A.M. Kocsis, "Fractal Based Image
symbol length codes is considered. The complexity of Compression," 1989 Twenty-Third Asilomar
determining optimal variable source symbol length codes is Conference on Signals, Systems, and
shown to be intractable; however, two greedy heuristics are Computers, Volume 1, pp. 177-181.
evaluated. These heuristics provide compression that is no
better than common coding methods. [Lempe176] A. Lempel, and J. Ziv, "On the Complexity of
Finite Sequences." IEEE Transactions on
The most obvious area for continued research is the Information Theory, Volume 20, pp. 75-81,
development of more effective heuristics for variable source 1976.
symbol length codes. In addition, compiler-based methods
are under investigation to improve compressibility. [Petajan92] E. Petajan, "Digital Video Coding Techniques
Instruction selection, register allocation, and instruction for US High-DefinitionTV,"IEEE Micro,
scheduling may all be optimized for low entropy. This may pp. 13-21, October 1992.
improve practical compression rates. This may be extended
into earlier optimization stages of the compiler in order to [StorerSS] J. A. Storer, Data Compression: Methods and
increase the similarity among program structures and thus Theory, Computer Science Press, Rockville,
increase opportunities for entropy reduction. Concurrently MD, 1988.
we are investigating opportunities for using the compiler to [Tauton91] M. Taunton, "Compressed Executables: an
reduce program size prior to compression. This primarily Exercise in Thinking Small." Proceedings of
involves detection of opportunities to combine similar code the Summer 1991 Usenix Conference, pp.
sequences. Finally, some investigations could be made into 385-403.
designing instruction sets which possess good compressed
program characteristics. [Wolfe92] A. Wolfe and A. Chanin,"Executing
Compressed Programs on An Embedded RISC
5. References Architecture," in proc. Micro-25: The 25th
Annual International Symposium on
[Cate91] V. Cate and T. Gross, "Combining the Microarchitecture , 1992.
Concepts of Compression and Caching for a
Two-Level Filesystem", Proc. Fourth [Ziv77] J. Ziv and A. Lempel. "A Universal Algorithm
International Con5 on Architectural Support for Sequential Data Compression," IEEE
for Programming Languages and Operating Transactions on Information Theory, Volume
Systems, ACM, April 1991. 23, pp. 337-343, 1977.
[Devedas94] S. Devedas, S. Laio, and K. Keutzer, "On Code [Ziv78] J. Ziv and A. Lempel. "Compression of
Size Minimization Using Data Compression Individual Sequences Via Variable-Rate
Techniques", Research Laboratory of Coding," IEEE Transactions on Information
Electronics Technical Memorandum 94/18, Theory, Volume 24, pp. 530-536. 1978.

277