Professional Documents
Culture Documents
Witten - Arithemtic Coding
Witten - Arithemtic Coding
Edgar H. Sibley
Panel Editor
The state of the art in data compression is arithmetic coding, not the betterknown Huffman method. Arithmetic coding gives greater compression, is
faster for adaptive models, and clearly separatesthe model from the channel
encoding.
520
DATA
COMPRESSION
June 1987
Volume 30
Number 6
Computing Practices
June 1987
Volume 30
Number 6
IDEA
OF ARITHMETIC
CODING
Probability
Range
.2
.3
.l
.2
.l
LO, 0.2)
[0.2, 0.5)
[0.5, 0.6)
[0.6,0.8)
[0.8, 0.9)
.l
[0.9, 1.0)
521
Computing Practices
;::2,
p.2,
[0.23,
[0.233,
[0.23354,
1)
0.5)
0.26)
0.236)
0.2336)
0.2336)
Figure 1 shows another representation of the encoding process. The vertical bars with ticks represent the symbol probabilities stipulated by the
model. After the first symbol has been processed, the
model is scaled into the range [0.2, 0.5), as shown in
!
ri
After
seeing
Nothing
After
seeing
Nothing
1
!
u
0.5
0.26
Initially
After seeing e
P, 1)
[0.2, 0.5)
After seeing a
FIGUREla.
!
u
a
\
a
\
a
0i
0.2i
0.2i
522
June 1987
Volume 30
Number 6
Computing Practices
*/
*/
*/
then
encode-symbol(symbo1,
cum-freq)
range = high - low
high
= low f range*cum-freq[symbol-11
low
= low f range*cum-freq(symbol1
/* ARITHMETIC DECODING ALGORITHM. */
/* "Value" is the number that has been received.
/* Continue calling
decode-symbol
until
the terminator
*/
*/
symbol is returned.
decode-symbol(cum-freq)
find symbol such that
cum-freq[symbol]
<= (value-low)/(high-low)
< cum-freqrsymbol-11
/* This ensures that value lies within
the new l /
;* (low, high) range that will be calculated
by */
/* the following
lines of code.
*/
range = high - low
high
= low t range*cum-freq[symbol-11
1OW = low t range*cum-freq[symbol]
return
symbol
]une 1987
Volume 30
Number 6
A PROGRAM
FOR ARITHMETIC
CODING
Figure 2 shows a pseudocode fragment that summarizes the encoding and decoding procedures developed in the last section. Symbols are numbered, 1, 2,
3 . . . The frequency range for the ith symbol is
from cum-freq[i] to cum-freq[i - 11. As i decreases,
cum-freq[i] increases, and cum-freq[O] = 1. (The
reason for this backwards convention is that
cum-freq[O] will later contain a normalizing factor,
and it will be convenient to have it begin the array.]
The current interval is [Zozu,high), and for both
encoding and decoding, this should be initialized
to [O, 1).
Unfortunately, Figure 2 is overly simplistic. In
practice, there are several factors that complicate
both encoding and decoding:
Incremental transmission and reception. The encode
algorithm as described does not transmit anything
until the entire message has been encoded; neither
does the decode algorithm begin decoding until it
has received the complete transmission. In most
applications an incremental mode of operation is
necessary.
The desire to use integer arithmetic. The precision
required to represent the [low, high) interval grows
with the length of the message. Incremental operation will help overcome this, but the potential for
523
Computing Practices
arithmetic-coding-h
1
2
3
4
5
6
7
a
9
10
11
12
13
14
15
:6
/'
Code-value-bits
long code-value:
fdefine
Top-value
/'
16
(((long)l<<Code_value_blts)-1)
/* Largest
code value
l/
l/
l /
*define
#define
Idefine
First-qtr
Half
Third-qtr
(Top-value/ltl)
(Z'First-qtr)
(3Firat-qtr)
/* Point
/* Point
/* Point
after
after
after
first
first
third
quarter
half
l/
quarter
l /
"/
mode1.h
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
/'
/'
#define
#define
No-of-chars
EOF-symbol
#define
No-of-symbols
/'
256
(No-of-charetl)
(No-of-charstll
/* Number of character
/* Index of EOF symbol
/*
Total
symbols
number of symbols
'/
'/
*/
int char-to-index[No-of-chars];
unsigned char index_to_char[No_of_symbols+l]:
/*
l/
'/
/* Maximum allowed
frequency
count
2a14 - 1
/*
/* Cumulative
symbol frequencies
l/
l/
l/
/* CUMULATIVE FREQUENCYTABLE. */
Idefine
int
Max-frequency
cum_frsq[No_of_symbols+l];
16383
524
Communicationso~
the ACM
June 1987
Volume 30
Number 6
Computing Practices
encode.c
39
40
41
42
43
44
45
46
47
40
49
50
51
52
53
54
55
56
57
58
59
60
/'
#include
#include
<stdlo.h>
"mode1.h"
main0
startmodel~):
1
start-outputlng-bitsO;
start-encodingo;
for (::I
(
int ch; lnt symbol;
ch - getc(stdin);
If (ch--EOF) break;
symbolchar-to.-lndex[ch];
encode-symbol(symbol,cum-freq);
update-model(symbo1);
1
encode_symbol(EOF_~symbol,cum_freq);
done-encoding();
done-outputinebitso;
exit (0) :
arithmetic
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
17
78
79
80
Bl
82
83
84
85
86
81
88
a9
90
91
92
93
94
/* Set up other
modules.
l/
/'
characters.
l/
Loop through
l/
l/
l/
follows
l/
l/
l/
- encode.c
vold
bll-plus-follov();
that
high;
code region
bits to output
*/
after
l /
'/
start-encoding0
low - 0;
I
hiah - Top-value;
bits to follow
- 0;
1
-/*
/* Routine
static
static
/*
"arithmetic-cod1ng.h"
/* Full
code
/* No bits
'/
range.
to follow
next.
l/
ENCODEA SYMBOL. l /
encode-symhoI(symbol,cum-freq)
int symbol:
/* Symbol to encode
int cum-freql]:
/* Cumulative
symbol frequencies
long range;
/' Size of the current
code region
I
range
- (long) (high-low)+l;
high - low +
/* Narrow the code region
(ranye'cum-freq[symbol-ll)/cum-freq[O)-1:
/* to that allotted
to this
/' symbol.
low - low +
~ranpe*cum~freq(symbol))/cum~freq(O);
'/
'/
l/
l/
l/
l/
]une 1987
Volume 30
Number 6
525
for
95
96
97
98
99
100
101
102
103
104
105
106
107
bits-to-foliow
t*
'1
/* Output
0 if
In low half.
/* Output
1 if
in hiQh half.*/
offset
l/
to top.
*/
l/
bit
half.
./
1;
/*
Subtract
offset
/+ Otherwise
/* Scale
exit
to middle'/
loop.
'1
up code range.
l /
done-encoding0
bits-to-follow
+* 1:
I
if (low<First-qtr)
bit~plus~follow(0):
else bit-plus-follow(l):
1
/* OUTPUT BITS PLUS FOLLOWING OPPOSITE BITS.
static
f
bits.
/* Output an opposite
if in middle
/. later
1
else break:
low * 2'low;
high * 2*hightl;
/'
Loop to ouLput
/+ Subtract
low -- First-qtr;
high -- First-qtr;
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
/'
(:;I
I
if (high<Half)
I
bit~plus~follow(0);
I
else if (low>-Half)
(
bit~plus~follow(1):
low -- Half:
high -* Half:
I
else if (low>-Flrst
qtr
sc hiQh<ThirdIqtr)
void bitglus-foIlow(blt)
int bit:
output-bitcbit);
while (bits-to-follow>O)
output-bit
(Ibit);
bits to-follow
-* 1;
1
-
'1
'/
/*
l/
l /
l/
l/
Output
the bit.
i
/* Output
l/
/*
'/
/'
bits-to-follow
bits.
Set
bits-to-follow
to zero.
opposfte
l /
decode.c
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
<stdio.h>
"mode1.h"
main ()
start-model
0 :
I
start-inpUtinQ-bitso;
start-decodingo:
for (;;)
(
lnt ch: int symboi;
symbol - decode-symbolicurn-freq):
if (symbol**EOF symbol) break;
ch * lndex_to_c?;ar[symbolJ;
putc(ch,stdout):
update-model(symbo1);
1
exit.
/* Set up other
modules.
'/
/*
Loop through
characters.
'/
/*
/'
/*
/'
/*
'/
'I
'/
l/
(0) ;
526
Communications
of the ACM
]une 1987
Volume 30
Number 6
Computing Practices
arithmetic
- decode.c
--
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
/*
--
ARITHMETIC
#include
/'
ALGORITHM.
l/
OF THE DECODING.
l/
arithmetic-cod1ng.h
CURRENT STATE
static
static
/*
DECODING
code
value
code:value
value;
low, high;
START DECODING
/*
Currently-seen
/*
Ends
STREAM OF SYMBOLS.
of
code
code
current
l /
/
value
replan
'/
Start-deCdiQ()
(
lnt i:
value
- 0;
for
(1 - 1; I<-Code-value-bits;
value
- 2'value,lnput_bit();
1
low " 0:
hiQh
- Top-value;
it+)
/*
/*
Input
code
bits
value.
/*
Full
code
to
flll
the
l/
l /
l/
ranpe.
179
180
/*
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
lnt
!
l '/
(cum-freq)
decode-symbol
int
cwn-freq[
I;
/*
Cumulative
symbol
frequencies
lOQ raQe;
/* Size of current
code
region
int cum;
/* Cumulative
frequency
calculated
int symbol:
/*
Symbol
decoded
raQe
- (lO"Q)(hlQh-1oW)tl;
cum /*
Flnd
cum freq
for value.
(((lonQl(value-lou)tl)*cum~freq[O]-l)/ranQe:
for
~eyn!bol - 1; cum-fraq[symbolJ%wm;
symboltt)
; /* Then find
symbol.
hiQh
low
(ranQe*cum-freq[symbol-ll)/cum-freq[O]-1;
low - low t
(ra"Qe'cum~freq[symbo1])/cum~freq[O];
for
I::1
(
if
1
else
(hiQh<Half)
/
nothing
If
if
l/
(low>-Half)
(low>-First-qtr
L hiQh<Thlrd-qtr)
value
-- First-qtr;
low -- First-qtr;
high
--
Narrow
the
/*
/*
to that
symbol.
allotted
/*
Loop
QOt
Expand
low
/*
Expand
hlph
/*
Subtract
/*
Expand
/*
Subtract
/*
Otherwise
/*
/*
Scale
Move
to
code
reQlon
to
rld
of
this
bits.
l /
l/
l
l
l
l
/
/
/
/
l/
value
-- Half;
low -- Half:
hlQh -- Half;
else
/*
l/
l/
half.
'/
'/
half.
offset
middle
to
top.
*I
l /
half.
(
offset
to
mIddlr*/
First-qtr;
else
bxeak:
low - 2*1ow:
high
- 2'hiQhtl:
value
- 2*valuetlnput_bit():
1
relucn
exit
l/
loop.
up code
range.
in next input
blt.
l/
l/
symbol;
]une 1987
Volume 30
Number 6
527
bit-input-c
216
217
218
219
flnclude
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
<stdlo.h>
"arithmetic-cod1ng.h"
+include
/'
static
static
static
/*
int
buffer:
/*
Bits
lnt
lnt
blts_to-go:
/*
/*
Number
Number
garbage-bits:
INITIALIZE
waiting
to be
of bits
still
of
bit6
past
l/
l/
/
input
in
buffer
end-of-file
BIT INPUT. */
St6rt_inpUting_blt6()
bits-to-go
l
garbage-bits
0;
- 0:
/*
Buffer
/*
no
/*
Read
bit6
l/
l /
out with
starts
in
bits
it.
/*
int
(
INPUT A BIT.
l /
input-bit0
lnt t:
if (bits-to-go--O)
1
buffer
- getc(stdln):
if
(buffer--EOF)
garbage-bits
if
/*
(
t-
/*
1;
(garbage~,blts>Code~value~blts-2)
fprlntf(stderr,Bad
exit
input
1
flle\nj;
the
next
are
left
Return
if no '/
in buffer.
*'/
byte
arbitrary
bits*/
/* after
eof, but check
/*
for too many such.
l/
l /
(-1);
)
1
bits-to-go
t - buffer&l;
buffer
a>- 1;
bits-to-go
return
-*
8:
l/
l/
I:
t;
528
June 1987
Volume 30
Number 6
bit-output.c
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
215
276
277
270
279
200
281
282
283
204
285
286
287
288
209
290
291
292
293
294
<stdlo.h>
int
int
buffer:
bits-to-go;
INITIALIZE
/* Bits buffered
for output
/* Number of bits free in buffer
l/
l/
start_.outputing-..bits()
buffer
- 0;
1
bitS_tO-QO8;
)
/* OUTPUT A BIT.
/* Buffer
/' with.
Is empty to start
'/
l/
l/
output-blt(bit)
lnt bit:
buffer
>>- 1;
(
if (bit)
buffer
I- 0x00;
bits-to-go
-- 1;
if (bits-to-go--O)
(
putc(buffer,stdout):
bits-to-go
- 8;
t
t
/* FLUSH OUT THE LAST BITS.
/* Dot bit
In top of buffer.*/
/* Output buffer
/' now full.
if
It
1s
l/
l /
l/
done-outputlng-bits0
putc(buffer>>blts_to-go,stdout):
]une 1987
Volume 30
Number 6
529
Computing Practices
fixed-model.
1
2
4
5
6
7
a
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
40
49
50
51
52
53
54
55
56
57
58
59
60
61
'mode1.h'
freq[No-of-symbolstlj
0,
1.
1.
1,
1.
1,
1,
1,
1,
- (
1,
1,
1,
1,
1,
1,
1, 124,
1,
1,
1,
1,
2,
2,
1,
1,
1,
1,
15,
2,
9:;C
6,
1,
1,
1,
1,
1,
7b,
19,
60,
1,
3,
2,
1,
1,
>
1,
?*/
1,
O*/
11,
/*
1236,
1,
21,
9,
3,
1,
/'012
15,
15,
8,
3
5,
4
4,
5
7,
6
5,
7
4,
S
4,
1,
A
24,
B
15,
C
22,
D
12,
E
15,
F
10,
C
9,
H
16,
I
16,
J
8,
K
6,
L
12,
M
23,
N
13,
14,
Q
1,
R
14,
S
26,
T
29,
U
6,
V
3,
W
11,
X
1,
Y
3,
Z
1,
[
1,
\
1,
I
1,
..
1,
/*@
/'P
/*
'
1, 49:.
/*p
1.
1,
1.
1,
1.
1,
1,
1,
85, 17;,
232, 74:,
8
r
5, 388, 375,
t
531, 159,
111,
1.
1,
1.
1,
1.
1,
1,
1,
1,
1,
1.
1.
1.
1,
1,
1,
1,
1,
1,
1.
1,
1,
1,
1,
1,
1,
1,
1,
1,
1,
1,
i,
2:,
1.
1.
1,
1,
1,
1,
1,
I,
5:,
91,
1,
1,
1,
1,
1,
1,
1,
6,
Y
x
12, 101,
1.
1.
1.
1,
1,
1,
1,
I,
293, 418,
127, ll:,
I,
1.
1,
1,
1.
1.
1,
1.
1,
1.
1.
1.
1.
1,
1,
1,
1,
1,
1,
1,
1,
k
1
m
39, 250, 139,
l/
5,
0 l /
42:,
446.
f.
2,
1,
2,
3.
1.
1,
1,
1,
1.
1.
1,
1,
1.
1.
1,
1,
1.
1,
1,
1.
1.
1,
1,
1.
1,
1,
1.
1,
1,
1,
1.
1,
1,
1,
1.
1.
1,
1.
1,
1.
1,
1.
1,
1,
1.
1,
1.
1,
1,
1,
I,
l/
*/
1,
I,
1,
/*
INITIALIZE
start-model0
!
lnt 1:
for (i - 0; i<No-of-chars:
it+)
char-to-index[i]
- itl;
index_to-char[i+l)
- i;
/* Set up tables
that
/* translate
between symbol
/*
indexes
and
characters.
l/
l/
l /
cum-freq[No-of-symbols]
- 0;
for (1 - No-of-symbols;
i>O; i--l
(
cum-freq[i-lj
- cum-freq[i]
t freq(iJ;
I
if (cum-freq[Ol
> Max-frequency)
abort0;
/* Set up cumulative
/* frequency
counts.
/* Check counts
l /
'/
within
limit*/
(symbol)
symbol;
/*
Do
nothing.
l/
530
]une 1987
Computing Practices
adaptive-mode1.c
1
2
3
4
5
6
7
0
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
21
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
40
49
50
51
52
53
"modo1.h"
freq]No-of-symbolstl]:
INITIALIZE
/+ symbol
frequencies
l/
THE MODEL. l /
start-model0
int i:
I
for
(i - 0; i<No-of-chars;
it+)
(
char-to-index[l]
- itl:
index-to-char[itl]
- i:
I
for (i - 0; i<-No-of-symbols;
it+)
freq[i]
- 1;
cum-freq[i]
- No-of-symbols-i;
/* Set up tables
that
/* translate
between symbol
/* indexes and characters.
l/
l/
l/
/*
/*
/*
Set up initial
frequency
counts to be one for all
symbols.
l /
l/
l/
/*
/*
not be the
same as freq[l].
l /
l/
freq[O]
- 0;
t
/'
Freq[Ol
must
update~model(symbo1)
int symbol:
/* Index of new
lnt 1;
/* New index for
(
if (cum-freq]O]--Max-frequency)
(
int cum;
cum - 0:
for (1 - No-of-symbols;
la-O; i--)
(
freq[i]
- (freq(i]+1]/2:
cum-freq[l]
- cum;
cum t-
symbol
symbol
/*
'1
maxlawn.
l /
l/
l/
l/
l/
l/
Find
l /
See
if
frequency
/* are at thslr
/*
/*
/*
count8
freq[l];
t
t
ior
if
(i - symbol;
freq(i]--freq[i-1):
(i<symbol)
]
int ch-i,
ch-symbol :
ch-i - index-to-char[i];
ch-symbol - index-to-char[aymbol];
index-to-char[i]
- ch-symbol:
index-to-char[symbol]
- ch-1;
char-to-index]ch-i]
- symbol:
dex[ch-symbol]
- i;
char-to-in'
I
freq[i]
+- 1;
while (i>O) (
i -- 1:
1 tcum-freq]!
t
1;
l--l
: /*
symbol's
new Index.
/*
/*
/*
/*
l /
has l /
l/
l/
l/
l/
'/
FIGURE4. Fixed and Adaptive Models for Use with Figure 3 (continued)
/me 1987
Volume 30
Number 6
531
Computing Practices
for
if
532
I
else if (low I Half)
output-bit(l);
low = 2*(low-Half);
high = 2*(high-Half)+l;
I
else
break;
I
which ensures that, upon completion, low < Half
I high. This can be found in lines 95-113 of
encode-symbol (1, although there are some extra complications caused by underflow possibilities (see the
next subsection). Care is taken to shift ones in at the
bottom when high is scaled, as noted above.
Incremental reception is done using a number
called value as in Figure 2, in which processed
bits flow out the top (high-significance) end and
newly received ones flow in the bottom. Initially,
start-decoding0
(lines 168-176) fills value with received bits. Once decode-symbol () has identified the
next input symbol, it shifts out now-useless highorder bits that are the same in low and high, shifting
value by the same amount (and replacing lost bits by
fresh input bits at the bottom end):
for
(;;I
{
(high < Half)
output-bit(O);
low
= 2*1ow;
high = 2*high+l;
if
(;;)
(high
value
low
high
1
< Half)
(
= 2*value+input_bit(
= 2*1ow;
= 2*high+l;
);
I
else if
value
low
high
I
else
break;
t
(see lines 194-213, again complicated by precautions
against underflow, as discussed below).
Proof of Decoding Correctness
At this point it is worth checking that identification
of the next symbol by decode-symbol () works
properly. Recall from Figure 2 that decode-symbol ()
must use value to find the symbol that, when encoded, reduces the range to one that still includes
value. Lines 186-188 in decode-symbol0
identify the
symbol for which
]une 1987
Volume 30
Number 6
Computing Practices
cum-freq[symbol]
~ (value - low + 1) * cum-freq[O] - 1
high - low + 1
i
1
< cum-freq[symbol-
11,
_ 1
l]
June 1987
Volume 30
Number 6
(14
or
low < Half < Third-qtr
I high.
(lb)
Top-value + 1 + 1
4
533
(4
/-
08
Top-value -
Third-qtr
high
Half
First+
o-
534
Cotnrnu~~icatiom
of the ACM
is performed.
p = 31 if long integers
lune 1987
Volume
30
Number
Computing Practices
June 1987
Volume 30
Number 6
535
Computing Practices
of Figure 4, frequency counts, which must be maintained anyway, are used to optimize access by keeping the array in frequency order-an effective kind
of self-organizing linear search [9]. Update-model0
first checks to see if the new model will exceed the
cumulative-frequency
limit, and if so scales all frequencies down by a factor of two (taking care to
ensure that no count scales to zero) and recomputes
cumulative values (Figure 4, lines 29-37). Then, if
necessary, update-model () reorders the symbols to
place the current one in its correct rank in the frequency ordering, altering the translation tables to
reflect the change. Finally, it increments the appropriate frequency count and adjusts cumulative frequencies accordingly.
PERFORMANCE
Now consider the performance of the algorithm of
Figure 3, both in compression efficiency and execution time.
Compression Efficiency
In principle, when a message is coded using arithmetic: coding, the number of bits in the encoded
string is the same as the entropy of that message
with respect to the model used for coding. Three
factors cause performance to be worse than this in
practice:
(1) message termination overhead;
(2) the use of fixed-length rather than infiniteprecision arithmetic; and
(3) scaling of counts so that their total is at most
Max- frequency.
None of these effects is significant, as we now show.
In order to isolate the effect of arithmetic coding, the
model will be considered to be exact (as defined
above).
Arithmetic coding must send extra bits at the
end of each message, causing a message termination overhead. Two bits are needed, sent by
done-encoding() (Figure 3, lines 119-123), in order to
disambiguate the final symbol. In cases where a bit
stream must be blocked into 8-bit characters before
encoding, it will be necessary to round out to the
end of a block. Combining these, an extra 9 bits may
be required.
The overhead of using fixed-length arithmetic occurs because remainders are truncated on division.
It can be assessed by comparing the algorithms performance with the figure obtained from a theoretical
entropy calculation that derives its frequencies from
counts scaled exactly as for coding. It is completely
negligible-on
the order of lop4 bits/symbol.
536
output-bit(), and
bit-plus- follow() were converted to macros to
eliminate procedure-call overhead.
Frequently
used quantities were put in register
(2)
variables.
(3) Multiplies by two were replaced by additions
(C I,=).
Array
indexing was replaced by pointer manip(4)
ulation in the loops at line 189 in Figure 3 and
lines 49-52 of the adaptive model in Figure 4.
This mildly optimized C implementation has an
execution time of 214 ps/252 ps per input byte, for
encoding/decoding 100,000 bytes of English text on
a VAX-11/780, as shown in Table II. Also given are
corresponding figures for the same program on an
Apple Macintosh and a SUN-3/75. As can be seen,
coding a C source program of the same length took
slightly longer in all cases, and a binary object program longer still. The reason for this will be discussed shortly. Two artificial test files were included
to allow readers to replicate the results. Alphabet
consists of enough copies of the 26-letter alphabet to
fill out 100,000 characters (ending with a partially
completed alphabet). Skew statistics contains
Ime 1987
Volume 30
Number 6
Computing Practices
(I4
Encodetime
(ccs)
214
230
313
223
143
262
288
406
277
170
687
729
950
719
507
135
151
241
145
81
194
208
280
204
126
Encode time
Mildly optimized C implementation
Text file
C program
VAX object program
Alphabet
Skew statistics
57,718
82,991
73,501
59,292
12,092
Macintosh512K
Decodetlme
oeclxletlme
(as)
SUN-3175
Encodetime
(PS)
oecodetime
b4
98
105
145
105
70
121
131
190
130
85
46
51
75
51
28
58
65
107
65
36
243
266
402
264
160
10,000 copies of the string aaaabaaaac; it demonstrates that files may be encoded into less than one
bit per character (output size of 12,092 bytes =
96,736 bits). All results quoted used the adaptive
model of Figure 4.
A further factor of two can be gained by reprogramming in assembly language. A carefully optimized version of Figures 3 and 4 (adaptive model)
was written in both VAX and M68000 assembly languages. Full use was made of registers, and advantage taken of the Is-bit code-value to expedite some
crucial comparisons and make subtractions of Half
trivial. The performance of these implementations
on the test files is also shown in Table II in order
to give the reader some idea of typical execution
speeds.
The VAX-11/780 assembly-language timings are
broken down in Table III. These figures were obtained with the UNIX profile facility and are accurate only to within perhaps 10 percent. (This mechanism constructs a histogram of program counter
values at real-time clock interrupts and suffers from
statistical variation as well as some systematic errors.) Bounds calculation refers to the initial parts
of encode-symbol () and decode-symbol () (Figure 3,
lines 90-94 and 190-l%), which contain multiply
and divide operations. Bit shifting is the major
loop in both the encode and decode routines
(lines 96-113 and 194-213). The cum calculation in
decode-symbol (1, which requires a multiply/divide,
and the following loop to identify the next symbol
(lines 187-189), is Symbol decode. Finally, Model
]une 1987
Volume 30
Number 6
Text file
Bounds calculation
Bit shifting
Model update
Symbol decode
Other
C program
Bounds calculation
Bit shifting
Model update
Symbol decode
Other
VAX object program
Bounds calculation
Bit shifting
Model update
Symbol decode
Other
Encodstime
(PSI
Decodethe
(4
32
39
29
4
104
31
30
29
45
0
135
30
42
33
4
109
28
35
36
51
1
151
34
46
75
31
40
75
94
1
241
3
158
537
ComputingPractices
better. The extra time for decoding over encoding is
due entirely to the symbol decode step. This takes
longer in the C program and object file tests because
the loop of line 189 was executed more often (on
average 9 times, 13 times, and 35 times, respectively). This also affects the model update time because it is the number of cumulative counts that
must be incremented in Figure 4, lines 49-52. In the
worst case, when the symbol frequencies are uniformly distributed, these loops are executed an average of 128 times. Worst-case performance would be
improved by using a more complex tree representation for frequencies, but this would likely be slower
for text files.
SOME
APPLICATIONS
Applications of arithmetic coding are legion. By liberating coding with respect to a model from the modeling required for prediction, it encourages a whole
new view of data compression [19]. This separation
of function costs nothing in compression performance, since arithmetic coding is (practically) optimal with respect to the entropy of the model. Here
we intend to do no more than suggest the scope of
this view by briefly considering
(1)
(2)
(3)
(4)
Of course, as noted earlier, greater coding efficiencies could easily be achieved with more sophisticated models. Modeling, however, is an extensive
topic in its own right and is beyond the scope of this
article.
Adaptive text compression using single-character
adaptive frequencies shows off arithmetic coding
to good effect. The results obtained using the program in Figures 3 and 4 vary from 4.8-5.3 bits/char-
22
(4
(f4
57,718
62,991
73,546
59,292
12,092
214
230
313
223
143
262
288
406
277
170
St
57,781
63,731
76,950
60,127
16,257
Encodetime
b4
oscodetime
CS)
550
596
822
598
215
414
441
606
411
132
Notes: The mildly optimized C implementation was used for arithmetic coding.
UNIX compact
was used for adaptive Huffman coding.
Times are for a VAX-l l/780 and exclude I/O and operating-system
overhead in support of I/O.
538
Commlkcations
of
the ACM
June 1987
Volume 30
Number 6
Computing Practices
]une 1987
Volume 30
Number 6
c[s] 5
(V-Z + 1) x c[O] - 1
h-I+1
in other words,
(v - I + 1) x c[O] - 1
-&
r
5 c[s - l] - 1,
c[s] 5
(1)
where
r=h-l+l,
OSEIT-.
1
r
( I :
-
CPI[
(v - I + 1) X c[O] - 1
-e
r
5v+l--
z-z+T
CPI [
(v - I + 1) x c[O] - 1 + 1 _ c _ 1
r
from Eq. (1)
REFERENCES
1. Abramson, N. Information Theory and Coding. McGraw-Hill,
New
York, 1963. This textbook contains the first reference to what was to
become the method of arithmetic coding (pp. 61-62).
2. Bentley. J.L., Sleator, D.D., Tarjan, R.E., and Wei, V.K. A locally
adaptive data compression scheme. Commun. ACM 29, 4 (Apr. 1986),
320-330.
Shows how recency effects can be incorporated explicitly
into a text compression system.
939
Computing Practices
19. Rissanen, J.. and Langdon, G.G. Universal modeling and coding. IEEE
Trans. Znf Theory IT-27, 1 (Jan. 1981), 12-23. Shows how data cornpression can be separated into modeling for prediction and coding
with respect to a model.
20. Rubin, F. Arithmetic
stream coding using fixed precision registers.
IEEE Trans. If. Theory IT-25, 6 (Nov. 1979), 672-675. One of the first
papers to present all the essential elements of practical arithmetic
coding, including fixed-point computation and incremental operation.
21. Shannon, C.E.. and Weaver, W. The Mathematical Theory of Communication. University of Illinois Press, Urbana, Ill., 1949. A classic
book that develops communication
theory from the ground up.
22. Welch, T.A. A technique for high-performance
data compression.
Computer 17, 6 (June 1984), 8-19. A very fast coding technique based
on the method of [23], but whose compression performance is poor
by the standards of [4] and [6]. An improved implementation
of this
method is widely used in UNIX systems under the name compress.
23. Ziv, J., and Lempel, A. Compression of individual sequences via
variable-rate coding. IEEE Trans. InJ Theory IT-24, 5 (Sept. 1978),
530-536. Describes a method of text compression that works by
replacing a substring with a pointer to an earlier occurrence of the
same substring. Although it performs quite well, it does not provide
a clear separation between modeling and coding.
CR Categories and Subject Descriptors: E.4 [Data]: Coding and Information Theory-d&a
compaction and compression;H.l.l [Models and
Principles]: Systems and Information Theory-information
theory
General Terms: Algorithms, Performance
Additional Key Words and Phrases: Adaptive modeling. arithmetic
coding, Huffman coding
Received 8/86; accepted Z2/86
Authors Present Address: Ian H. Witten, Radford M. Neal, and John
G. Cleary, Dept. of Computer Science, The University of Calgary,
2500 University Drive NW, Calgary, Canada T2N lN4.
Permission to copy without fee all or part of this material IS granted
provided that the copies are not made or distributed for direct commercial advantage, the ACM copyright notice and the title of the publication
and its date appear, and notice is given that copying is by permission of
the Association for Computing Machinery. To copy otherwise, or to
republish, requires a fee and/or specific permission.
540
]une 1987
Volume 30
Number 6