You are on page 1of 15

Arithmetic coding

From Wikipedia, the free encyclopedia

Arithmetic coding is a form of variable-length entropy encoding used in lossless data compression. Normally, a string of characters such as the words "hello there" is represented using a fi ed number of bits per character, as in the A!"## code. When a string is converted to arithmetic encoding, fre$uently used characters will be stored with fewer bits and not-so-fre$uently occurring characters will be stored with more bits, resulting in fewer bits used in total. Arithmetic coding differs from other forms of entropy encoding such as %uffman coding in that rather than separating the input into component symbols and replacing each with a code, arithmetic coding encodes the entire message into a single number, a fraction n where &'.' ( n ) *.'+.
Contents
[hide]

1 Implementation details and examples

o o o o o

1.1 Equal probabilities 1.2 Defining a model 1.3 Encoding and decoding: o er ie! 1." Encoding and decoding: example 1.# $ources of inefficienc% 2 &dapti e arithmetic coding 3 'recision and renormali(ation " &rithmetic coding as a generali(ed change of radix

o o o

".1 )heoretical limit of compressed message # *onnections !ith other compression methods #.1 +uffman coding #.2 ,ange encoding - .$ patents / 0enchmar1s and other technical characteristics 2 )eaching aid 3 $ee also 14 ,eferences 11 External lin1s

[edit]Implementation

details and examples

[edit]Equal

probabilities

#n the simplest case, the probability of each symbol occurring is e$ual. For e ample, consider a se$uence taken from a set of three symbols, A, ,, and ", each e$ually likely to occur. !imple block encoding would use - bits per symbol, which is wasteful. one of the bit variations is never used. A more efficient solution is to represent the se$uence as a rational number between ' and * in base /, where each digit represents a symbol. For e ample, the se$uence "A,,"A," could become '.'**-'*/. 0he ne t step is to encode this ternary number using a fi ed-point binary number of sufficient precision to recover it, such as '.''*'**''* - 1 this is only 2 bits, -34 smaller than the na5ve block encoding. 0his is feasible for long se$uences because there are efficient, in-place algorithms for converting the base of arbitrarily precise numbers. 0o decode the value, knowing the original string had length 6, one can simply convert back to base /, round to 6 digits, and recover the string.

[edit]Defining

a model

#n general, arithmetic coders can produce near-optimal output for any given set of symbols and probabilities &the optimal value is 7log2P bits for each symbol of probability P, see source coding theorem+. "ompression algorithms that use arithmetic coding start by determining a model of the data 8 basically a prediction of what patterns will be found in the symbols of the message. 0he more accurate this prediction is, the closer to optimal the output will be. Example. a simple, static model for describing the output of a particular monitoring instrument over time might be.

6'4 chance of symbol N9:0;A< -'4 chance of symbol =>!#0#?9 *'4 chance of symbol N9@A0#?9 *'4 chance of symbol 9NA->F-AA0A. (The presence of this symbol means that the stream will be 'internally terminated', as is fairly common in data compression; when this symbol appears in the data stream, the decoder will know that the entire stream has been decoded.)

Bodels can also handle alphabets other than the simple four-symbol set chosen for this e ample. Bore sophisticated models are also possible. higher-order modelling changes its estimation of the current probability of a symbol based on the symbols that precede it &the conte t+, so that in a model for 9nglish te t, for e ample, the percentage chance of "u" would be much higher when it follows a "C" or a "$". Bodels can even be adapti!e, so that they continuously change their prediction of the data based on what the stream actually contains. 0he decoder must have the same model as the encoder.

[edit]Encoding

and decoding: overview

#n general, each step of the encoding process, e cept for the very last, is the sameD the encoder has basically Eust three pieces of data to consider.

0he ne t symbol that needs to be encoded 0he current interval &at the very start of the encoding process, the interval is set to F',*G, but that will change+

0he probabilities the model assigns to each of the various symbols that are possible at this stage &as mentioned earlier, higher-order or adaptive models mean that these probabilities are not necessarily the same in each step.+

0he encoder divides the current interval into sub-intervals, each representing a fraction of the current interval proportional to the probability of that symbol in the current conte t. Whichever interval corresponds to the actual symbol that is ne t to be encoded becomes the interval used in the ne t step. Example. for the four-symbol model above.

the interval for N9:0;A< would be F', '.6+ the interval for =>!#0#?9 would be F'.6, '.H+ the interval for N9@A0#?9 would be F'.H, '.2+ the interval for 9NA->F-AA0A would be F'.2, *+.

When all symbols have been encoded, the resulting interval unambiguously identifies the se$uence of symbols that produced it. Anyone who has the same final interval and model that is being used can reconstruct the symbol se$uence that must have entered the encoder to result in that final interval. #t is not necessary to transmit the final interval, howeverD it is only necessary to transmit one fractionthat lies within that interval. #n particular, it is only necessary to transmit enough digits &in whatever base+ of the fraction so that all fractions that begin with those digits fall into the final interval.

[edit]Encoding

and decoding: example

A diagram showing decoding of '.3/H &the circular point+ in the e ample model. 0he region is divided into subregions proportional to symbol fre$uencies, then the subregion containing the point is successively subdivided in the same way.

"onsider the process for decoding a message encoded with the given four-symbol model. 0he message is encoded in the fraction '.3/H &using decimal for clarity, instead of binaryD also assuming that there are only as many digits as needed to decode the message.+ 0he process starts with the same interval used by the encoder. F',*+, and using the same model, dividing it into the same four sub-intervals that the encoder must have. 0he fraction '.3/H falls into the sub-interval for N9:0;A<, F', '.6+D this indicates that the first symbol the encoder read must have been N9:0;A<, so this is the first symbol of the message. Ne t divide the interval F', '.6+ into sub-intervals.

the interval for N9:0;A< would be F', './6+ -- "#$ of %#, #.") the interval for =>!#0#?9 would be F'./6, '.IH+ -- 2#$ of %#, #.") the interval for N9@A0#?9 would be F'.IH, '.3I+ -- &#$ of %#, #.") the interval for 9NA->F-AA0A would be F'.3I, '.6+. -- &#$ of %#, #.")

!ince .3/H is within the interval F'.IH, '.3I+, the second symbol of the message must have been N9@A0#?9. Again divide our current interval into sub-intervals.

the interval for N9:0;A< would be F'.IH, '.3*6+ the interval for =>!#0#?9 would be F'.3*6, '.3-H+ the interval for N9@A0#?9 would be F'.3-H, '.3/I+ the interval for 9NA->F-AA0A would be F'.3/I, '.3I'+.

Now .3/H falls within the interval of the 9NA->F-AA0A symbolD therefore, this must be the ne t symbol. !ince it is also the internal termination symbol, it means the decoding is complete. #f the stream is not internally terminated, there needs to be some other way to indicate where the stream stops. >therwise, the decoding process could continue forever, mistakenly reading more symbols from the fraction than were in fact encoded into it.

[edit]Sources

of inefficienc

0he message '.3/H in the previous e ample could have been encoded by the e$ually short fractions '.3/I, '.3/3, '.3/6, '.3/J or '.3/2. 0his suggests that the use of decimal instead of binary introduced some inefficiency. 0his is correctD the information content of a three-digit decimal is appro imately 2.266 bitsD the same message could have been encoded in the binary fraction '.*'''*'*' &e$uivalent to '.3/2'6-3 decimal+ at a cost of only H bits. &0he final Kero must be specified in the binary fraction, or else the message would be ambiguous without e ternal information such as compressed stream siKe.+ 0his H bit output is larger than the information content, or entropy of the message, which is *.3J L / or I.J* bits. 0he large difference between the e ampleMs H &or J with e ternal compressed data siKe information+ bits of output and the entropy of I.J* bits is caused by the short e ample message not being able to e ercise the coder effectively. 0he claimed symbol probabilities were F'.6, '.-, '.*, '.*G, but the actual fre$uencies in this e ample are F'.//, ', './/, './/G. #f the intervals are readEusted for these fre$uencies, the entropy of the message would be *.3H bits and the same N9:0;A< N9@A0#?9 9NA>FAA0A message could be encoded as intervals F', *N/+D F*N2, -N2+D F3N-J, 6N-J+D and a binary interval of F*'****', ***'''*+. 0his could yield an output message of ***, or Eust / bits. 0his is also an e ample of how statistical coding methods like arithmetic encoding can produce an output message that is larger than the input message, especially if the probability model is off.

[edit]Adaptive

arithmetic coding

>ne advantage of arithmetic coding over other similar methods of data compression is the convenience of adaptation. 'daptation is the changing of the fre$uency &or probability+ tables while processing the data. 0he decoded data matches the original data as long as the fre$uency table in decoding is replaced in the same way and in the same step as in encoding. 0he synchroniKation is, usually, based on a combination of symbols occurring during the encoding and decoding process. Adaptive arithmetic coding significantly improves the compression ratio compared to static methodsD it may be as effective as - to / times better in the result.

[edit]!recision

and renormali"ation

0he above e planations of arithmetic coding contain some simplification. #n particular, they are written as if the encoder first calculated the fractions representing the endpoints of the interval in full, using infinite precision, and only converted the fraction to its final form at the end of encoding. ;ather than try to simulate infinite precision, most arithmetic coders instead operate at a fi ed limit of precision which they know the decoder will be able to match, and round the calculated fractions to their nearest e$uivalents at that precision. An e ample shows how this would work if the model called for the interval F',*+ to be divided into thirds, and this was appro imated with H bit precision. Note that since now the precision is known, so are the binary ranges weMll be able to use.

Symbol

Probability (expressed as fraction)

Interval reduced to eightbit precision (as fractions)

Interval reduced to eightbit precision (in binary)

Range in binary

&

153

[46 2#52#-7

[4.444444446 4.414141417

44444444 8 41414144

153

[2#52#-6 1/152#-7

[4.414141416 4.141414117

41414141 8 14141414

153

[1/152#-6 17

[4.141414116 1.444444447

14141411 8 11111111

A process called renormali(ation keeps the finite precision from becoming a limit on the total number of symbols that can be encoded. Whenever the range is reduced to the point where all values in the range share certain beginning digits, those digits are sent to the output. For however many digits of precision the computer can handle, it is now handling fewer than that, so the e isting digits are shifted left, and at the right, new digits are added to e pand the range as widely as possible. Note that this result occurs in two of the three cases from our previous e ample.

Symbol Probability

Range

Digits that can be sent to output

Range after renormali ation

&

153

!4444444 8!1414144

4444444! 8 1414144"

153

41414141 8 14141414

9one

41414141 8 14141414

153

"4141411 8"1111111

"

4141411! 8 1111111"

[edit]Arithmetic

coding as a generali"ed change of radix

;ecall that in the case where the symbols had e$ual probabilities, arithmetic coding could be implemented by a simple change of base, or radi . #n general, arithmetic &and range+ coding may be interpreted as a generali(ed change of radi . For e ample, we may look at any se$uence of symbols.

as a number in a certain base presuming that the involved symbols form an ordered set and each symbol in the ordered set denotes a se$uential integer ' O ', ) O *, * O -, + O /, and so on. 0his results in the following fre$uencies and cumulative fre$uencies.

Symbol #re$uency of occurrence Cumulative fre$uency

0he c,m,lati!e fre-,ency is the total of all fre$uencies below it in a fre$uency distribution &a running total of fre$uencies+. #n a positional numeral system the radi , or base, is numerically e$ual to a number of different symbols used to e press the number. For e ample, in the decimal system the number of symbols is *', namely ',*,-,/,I,3,6,J,H,2. 0he radi is used to e press any finite integer in a presumed multiplier in polynomial form. For e ample, the number I3J is actually IL*' - P 3L*'* P JL*'', where base *' is presumed but not shown e plicitly. #nitially, we will convert AA,AA, into a base-6 numeral, because 6 is the length of the string. 0he string is first mapped into the digit string /'*//*, which then maps to an integer by the polynomial.

0he result -/6J* has a length of *3 bits, which is not very close to the theoretical limit &the entropy of the message+, which is appro imately 2 bits.

0o encode a message with a length closer to the theoretical limit imposed by information theory we need to slightly generaliKe the classic formula for changing the radi . We will compute lower and upper bounds < and : and choose a number between them. For the computation of < we multiply each term in the above e pression by the product of the fre$uencies of all previously occurred symbols.

0he difference between this polynomial and the polynomial above is that each term is multiplied by the product of the fre$uencies of all previously occurring symbols. Bore generally, < may be computed as.

where

are the cumulative fre$uencies and

are the fre$uencies of

occurrences. #nde es denote the position of the symbol in a message. #n the special case where all fre$uencies are *, this is the change-of-base formula.

0he upper bound : will be < plus the product of all fre$uenciesD in this case : O < P &/ L * L - L / L / L -+ O -3''- P *'H O -3**'. #n general, : is given by.

Now we can choose any number from the interval F<, :+ to represent the messageD one convenient choice is the value with the longest possible trail of Keroes, -3*'', since it allows us to achieve compression by representing the result as -3*L*'-. 0he Keroes can also be truncated, giving -3*, if the length of the message is stored separately. <onger messages will tend to have longer trails of Keroes. 0o decode the integer -3*'', the polynomial computation can be reversed as shown in the table below. At each stage the current symbol is identified, then the corresponding term is subtracted from the result.

Remainder Identification Identified symbol

Corrected remainder

2#144

2#144 5 -# : 3 D

;2#144 < -# = 37 5 3 : #34

#34

#34 5 -" : 4

&

;#34 < -" = 47 5 1 : #34

#34

#34 5 -3 : 2

;#34 < -3 = 17 5 2 : 12/

12/

12/ 5 -2 : #

;12/ < -2 = 37 5 3 : 2-

2-

2- 5 -1 : "

;2- < -1 = 37 5 3 : 2

2 5 -4 : 2

Auring decoding we take the floor after dividing by the corresponding power of 6. 0he result is then matched against the cumulative intervals and the appropriate symbol is selected from look up table. When the symbol is identified the result is corrected. 0he process is continued for the known length of the message or while the remaining result is positive. 0he only difference compared to the classical change-of-base is that there may be a range of values associated with each symbol. #n this e ample, A is always ', , is either * or -, and A is any of /, I, 3. 0his is in e act accordance with our intervals that are determined by the fre$uencies. When all intervals are e$ual to * we have a special case of the classic base change.

[edit]#heoretical

limit of compressed message


bits. After the

0he lower bound < never e ceeds nn, where n is the siKe of the message, and so can be represented in

computation of the upper bound . and the reduction of the message by selecting a number from the interval F/, .+ with the longest trail of Keros we can

presume that this length can be reduced by

bits. !ince each

fre$uency in a product occurs e actly same number of times as the value of this fre$uency, we can use the siKe of the alphabet ' for the computation of the product

Applying log- for the estimated number of bits in the message, the final message &not counting a logarithmic overhead for the message length and fre$uency tables+ will match the number of bits given by entropy, which for long messages is very close to optimal.

[edit]$onnections

with other compression

methods
[edit]%uffman

coding

0ain article1 2,ffman coding 0here is great similarity between arithmetic coding and %uffman coding 8 in fact, it has been shown that %uffman is Eust a specialiKed case of arithmetic coding 8 but because arithmetic coding translates the entire message into one number represented in base b, rather than translating each symbol of the message into a series of digits in base b, it will sometimes approach optimal entropy encoding much more closely than %uffman can. #n fact, a %uffman code corresponds closely to an arithmetic code where each of the fre$uencies is rounded to a nearby power of Q 1 for this reason %uffman deals relatively poorly with distributions where symbols have fre$uencies far from a power of Q, such as '.J3 or './J3. 0his includes most distributions where there are either a small numbers of symbols &such as Eust the bits ' and *+ or where one or two symbols dominate the rest. For an alphabet Ra, b, cS with e$ual probabilities of *N/, %uffman coding may produce the following code.

a T 0. 3'4 b T 10. -34 c T 11. -34

0his code has an e pected &- P - P *+N/ U *.66J bits per symbol for %uffman coding, an inefficiency of 3 percent compared to log -/ U *.3H3 bits per symbol for arithmetic coding.

For an alphabet R', *S with probabilities '.6-3 and './J3, %uffman encoding treats them as though they had '.3 probability each, assigning * bit to each value, which does not achieve any compression over naive block encoding. Arithmetic coding approaches the optimal compression ratio of

When the symbol ' has a high probability of '.23, the difference is much greater.

>ne simple way to address this weakness is to concatenate symbols to form a new alphabet in which each symbol represents a se$uence of symbols in the original alphabet. #n the above e ample, grouping se$uences of three symbols before encoding would produce new "super-symbols" with the following fre$uencies.

000. H3.J4 001, 010, 100. I.34 each 011, 101, 110. .-I4 each 111. '.'*-34

With this grouping, %uffman coding averages *./ bits for every three symbols, or '.I// bits per symbol, compared with one bit per symbol in the original encoding.

[edit]&ange

encoding

0ain article1 3ange encoding ;ange encoding is regarded by one group of engineers as a different techni$ue and by another group only as a different name for arithmetic coding.Fwho4GF!erification neededG 0here is no uni$ue opinion but some people believe that, Fweasel wordsG when processing is applied as one step per symbol, it is range coding, and when one step is re$uired per every bit it is arithmetic coding. #n another opinionFwho4Garithmetic coding is the computing of two boundaries on interval F',*+ and

choosing the shortest fraction from it, and range encoding is computing boundaries on the interval and choosing

the number with the longest trail of Keros from within. Bany researchers believeFwho4G that slight difference in the approach makes range encoding patent free. 0o back up this idea they provide reference to the article of @. Nigel N. Bartin, which is not reader friendly and is subEect to interpretation. #t is cited in the @len <angdon article An #ntroduction to Arithmetic "oding, #,B V. ;9!. A9?9<>=. ?><. -H, No -, Barch *2HI, which makes the method suggested by Bartin as prior art recogniKed by an industry e pert. #t is close to the first topic of the current article with the difference that both the <>W and %#@% limits are computed on every step and that probabilities are still used for narrowing down the interval and not the fre$uencies. 0he article of @. N. N. Bartin amaKingly dropped out of attention of many researchers who were filing patents on arithmetic coding e plaining the matter of their algorithms as building long proper fraction, which put all their patents at risk to be circumvented by those who do it differently because a patent is a very formal document and language definitions should be very precise. F!erification neededG #t is not necessary that all patents on arithmetic coding are now void in the light of BartinMs article but it opens the ground for debates, which could have been avoided if authors at least mentioned the approach.F!erification neededG

[edit]'S

patents

A variety of specific techni$ues for arithmetic coding have historically been covered by :! patents, although various well-known methods have since passed into the public domain as the patents have e pired. 0echni$ues covered by patents may be essential for implementing the algorithms for arithmetic coding that are specified in some formal international standards. When this is the case, such patents are generally available for licensing under what is called "reasonable and non-discriminatory" &;ANA+ licensing terms

&at least as a matter of standards-committee policy+. #n some well-known instances &including some involving #,B patents that have since e pired+ such licenses were available free, and in other instances, licensing fees have been re$uired. 0he availability of licenses under ;ANA terms does not necessarily satisfy everyone who might want to use the technology, as what may seem "reasonable" for a company preparing a proprietary software product may seem much less reasonable for a free software or open source proEect. At least one significant compression software program, bKip-, deliberately discontinued the use of arithmetic coding in favor of %uffman coding due to the perceived patent situation at the time. Also, encoders and decoders of the V=9@ file format, which has options for both %uffman encoding and arithmetic coding, typically only support the %uffman encoding option, which was originally because of patent concernsD the result is that nearly all V=9@ images in use today use %uffman encodingF*Galthough V=9@Ms arithmetic coding patentsF-G have e pired due to the age of the V=9@ standard &the design of which was appro imately completed by *22'+.F/G !ome :! patents relating to arithmetic coding are listed below.

:.!. =atent I,*--,II' 1 &#,B+ Filed I Barch *2JJ, @ranted -I >ctober *2JH &Now e pired+

:.!. =atent I,-H6,-36 1 &#,B+ @ranted -3 August *2H* &Now e pired+

:.!. =atent I,I6J,/*J 1 &#,B+ @ranted -* August *2HI &Now e pired+

:.!. =atent I,63-,H36 1 &#,B+ @ranted I February *2H6 &Now e pired+

:.!. =atent I,H2*,6I/ 1 &#,B+ Filed *3 !eptember *2H6, granted - Vanuary *22' &Now e pired+

:.!. =atent I,2'3,-2J 1 &#,B+ Filed *H November *2HH, granted -J February *22' &Now e pired+

:.!. =atent I,2//,HH/ 1 &#,B+ Filed / Bay *2HH, granted *- Vune *22' &Now e pired+

:.!. =atent I,2/3,HH- 1 &#,B+ Filed -' Vuly *2HH, granted *2 Vune *22' &Now e pired+

:.!. =atent I,2H2,''' 1 Filed *2 Vune *2H2, granted -2 Vanuary *22* &Now e pired+

:.!. =atent 3,'22,II' 1 &#,B+ Filed 3 Vanuary *22', granted -I Barch *22- &Now e pired+

:.!. =atent 3,-J-,IJH 1 &;icoh+ Filed *J August *22-, granted -* Aecember *22/

Note. 0his list is not e haustive. !ee the following link for a list of more patents.FIG 0he Airac codecuses arithmetic coding and is not patent pending.F3G =atents on arithmetic coding may e ist in other Eurisdictions, see software patents for a discussion of the patentability of software around the world.

[edit](enchmar)s

and other technical

characteristics
9very programmatic implementation of arithmetic encoding has a different compression ratio and performance. While compression ratios vary only a little &usually under *4+ F!erification
neededG

the code e ecution time can vary by a factor of *'.

"hoosing the right encoder from a list of publicly available encoders is not a simple task because performance and compression ratio depend also on the type of data, particularly on the siKe of the alphabet &number of different symbols+. >ne of two particular encoders may have better performance for small alphabets while the other may show better performance for large alphabets. Bost encoders have limitations on the siKe of the alphabet and many of them are designed for a dual alphabet only &Kero and one+.

[edit]#eaching

aid

An interactive visualiKation tool for teaching arithmetic coding, dasher.tcl, was also the first prototype of the assistive communication system, Aasher.

You might also like