Professional Documents
Culture Documents
Topic includes:
Information source
Information measure
Entropy
Source codes
Data compression
Information source
One standard method of defining an information source is based upon
indicating the number, and identity, of each possible symbol as well as their
associated frequency of occurrence, each of which is expressed as a probability.
Consider the following received message:
THA CIT SAT ON THA MIT
The reason that such a correct interpretation may be made is that the English language
contains a degree of redundancy. That is, the information that the symbols used to encode
the information produce more than is strictly required, as seen in the above trivial example.
This example illustrates that the information content of the original message is not
necessarily the same as the original set of symbols used to convey the message.
The information content of a message may be quantified. An information source may be
modelled by a repertoire of messages from which that desired may be selected. Suppose a
source contains a set of symbols denoted by:
(x1, x2, x3, . . . , xn)
The probability of any one symbol n appearing at any moment in the sequence may be known
and denoted by P(xn) and where we may express a source by the following parameters:
- the number of symbols in the repertoire, or alphabet, denoted by n;
- the symbols themselves are denoted by x1, . . . , xn;
- the probabilities of occurrence of each of the symbols, denoted by P(x1), . .
. , P(xn).
Such a source is called discrete memory-less source. By memory-less we mean that the
probability of a symbol occurring is independent of any other symbol.
As an example let us model a source based upon binary signals:
The number of symbols n is two. That is, one of two possible voltage
levels may be transmitted.
Symbols are denoted as x0 representing binary 0 and x1 representing
binary 1.
Finally, the probability of occurrence of these symbols is denoted:
P(x ) = P(x ) = 0.5 that is they are equiprobable.
0 1
1. An information source consists of the outcomes of tossing a fair coin. The source can be
modelled as follows:
The number of symbols n = 2. The symbols represent the outcome of tossing a fair coin so
that x1 represents heads and x2 tails.
The probabilities of occurrence of the symbols are, assuming that it is a fair coin: P(x1) = P(x2)
= 0.5. Hence there is equal uncertainty with regard to the outcome of a single toss, so that
either outcome removes the same amount of uncertainty and therefore contains the same
amount of information.
2. However, when the symbols represent the answers to the question ‘Did you
watch television last night?’, then the source may be modelled as follows:
The number of symbols n = 2.
The symbols are then x1 = Yes, x2 = No.
On the assumption that 80% of the population watched television last night,
the probabilities of occurrence of the symbols are:
P(x1) = 0.8 and P(x2) = 0.2
3. Consider the source where the number of symbols is n = 2: The symbols are a
binary source with x1 = 1, x2 = 0.
The probabilities of occurrence are: P(x1) = 1 and P(x2) = 0
The receipt of x1 is certain and therefore there is no uncertainty and hence no
information; x2 of course will never be received.
Let the information content conveyed by xi be denoted by I(xi). Then from the
relationships established above we can say that:
1. If P(xi) = P(xj) then I(xi) = I(xj)
2. If P(xi) < P(xj) then I(xi) > I(xj)
3. If P(xi) = 1 then I(xi) = 0
A mathematical function that will satisfy the above constraints is given by:
I(xi) = logb(1/P(xi)) = -logb(P(xi))
Base x may conveniently be either e or 10, the logarithms for which are widely
available.
Entropy – (H) expresses the average amount of information conveyed by a
single symbol within a set of codewords and may be determine as follows:
and where: log2(1/P(xi)) is, as we have seen, simply the information content of the ith symbol.
The formula for entropy shown above is therefore simply averaging each of the products of a
symbol’s probability and length, in bits, for all of the symbols or codewords in the codeword
set.
In designing source codes it is the average information content of the code that is of interest
rather than the information content of particular symbols. Entropy effectively provides the
average content and this may be used to estimate the bandwidth of a channel required to
transmit a particular code, or the size of memory to store a certain number of symbols within
a code.
Source codes
As indicated earlier, the aim of source coding is to produce a code which, on
average, requires the transmission of the maximum amount of information for
the fewest binary digits. This can be quantified by calculating the efficiency η of
the code. However before calculating efficiency we need to establish the length
of the code. The length of a code is the average length of its codewords and is
obtained by:
where li is the number of digits in the ith symbol and n is the number of symbols
the code contains.
The efficiency of a code is obtained by dividing the entropy by the average code
length:
Analysis and design of source codes
Where messages consisting of sequences of symbols from an n-symbol source have to be
transmitted to their destination via a binary data transmission channel, each symbol must be
coded into a sequence of binary digits at the transmitter to produce a suitable input to the
channel.
In order to design suitable source codes some descriptors for classifying source codes have
been produced, as follows:
run-length encoding instead of transmitting absolute values, repeated patterns or runs are
detected. The repeated value itself is then sent, and the number of times that it is repeated
index compression.
Repeated patterns are placed in a table and both transmitter and receiver hold a copy of the
table. In order to transmit a run an index is used at the transmitter to point to the entry of
the run in the table. It is this index which is then transmitted and which the receiver uses to
extract the run in question. This is widely used with zip compression techniques and is
suitable for text and images, but not for speech. A commonly used code is Lempel–Ziv (LZ)
code.
Facsimile compression
Fax, as in television, is based upon scanning a document line by line but differs
inasmuch that only monochrome is provided. Operation is by means of a sharply focused
light source scanned across the document in a series of closely spaced lines. An optical
detector detects the reflected light from the scanned area which is encoded as either binary
0 for a ‘white’ area or binary 1 for ‘dark’. The receiver then interprets the data as a black dot
for 1 or ‘no printing’ for 0. The identical dots described are termed picture elements, or pels.
The ITU has produced a number of fax standards, namely T2 (Group 1), T3
(Group 2), T4 (Group 3) and T6 (Group 4). Only Group 3 and Group 4 are commonly used.
Group 3 is intended to operate over analogue PSTN lines using modulation and operating at
14.4 kbps. Group 4 operates digitally at 64 kbps by means of baseband transmission over ISDN
lines.
Group 3 scans an A4 sheet of information from the top left-hand corner to the bottom right-
hand corner. Each line is subdivided into 1728 picture elements (pels). Each pel is quantized
into either black or white. In the vertical direction the page is scanned to give approximately
1145 lines.
Termination codes are merely white, or black, runs 0 to 63 pels long. Make-up
codes (Table 3.4(b)) are multiples of 64 pels of the same color. Coding is based on
assuming that the first pel of a line is always white. Runs of less than 64 pels are
simply encoded directly and the appropriate termination code selected. Runs of
64 pels or more are encoded using an appropriate make-up code and, if
necessary, one or more termination codes. Runs in excess
of 2623 pels make use of more than one make-up code. This use of Huffman
coding whereby codewords may in some instances be selected from both tables,
rather than directly encoding into a single codeword, is known as modified
Huffman coding.
Group 4 specifies a different coding strategy to that of modified Huffman in
order to cope with images. Although photographs and images may contain large
degrees of shades in one line, there is often only a very small change between
adjacent lines.
Much better compression may therefore be achieved if difference encoding
across adjacent lines is used. Group 4 uses a code known as modified relative
element address designate, or modified READ code, which is also optionally
specified for Group 3.
(The term modified is to indicate that it is a variant of an earlier READ code.)
Modified Huffman coding only encodes a single line at a time and is referred to
as a one-dimensional code. Modified READ coding is known as a two-
dimensional code because the encoding is based upon a pair of lines.
a1 is simply the first pel to the right of a0 which has opposite color, that is the
first pel of the next run.
a2 is similar to a1 and denotes the start of the next run after that defined by a1.
b1 is the first pel on the reference line to the right of a0 whose color differs from
the pel immediately to its left and also that of a0.
b2 is the start of the next run after b1.
A run to be encoded may be in one of three different modes:
1. Pass mode: This occurs when a run in the reference line is no longer present in the
succeeding coding line, Figure 3.5(a).
The run that has ‘disappeared’, b1b2, is represented by the codeword 0001. The next
codeword to be produced will be based upon the run starting at a1, and which is in this case
a run of only 1 pel. The significance of a1′, which corresponds to b2 in the reference line
above, is that this position in the coding line will be regarded as a0 of the next run to be
encoded. Note that this run, which is the next to be encoded after the run that disappeared,
has the same color.
2. Vertical mode: This in fact applies to most runs and is where a black run in the coding line
is within ±3 pels of the start of a corresponding black run in the reference line. The two
extreme cases are shown in Figure 3.5(b).
There are five other possibilities, namely ±1 pel or ±2 different, or the commencement of the
two runs coincides.
3. Horizontal mode: This mode is similar to the vertical mode but where the degree of
overlap is in excess of ±3 pels. Two examples are shown in Figure 3.5(c).
Encoding
uses the codeword 001 to indicate that it is horizontal mode followed by codewords for the
run length a0a1 and a1a2. In the case of the upper example, the coding line commences with
the disappearance of a run of two black pels and would be encoded as pass mode. This is
followed by two white pels, a0a1, which do not fall within the category of pass or vertical
mode and must therefore be in horizontal mode.
To complete horizontal coding, the next black run, a1a2, is also encoded. Hence the horizontal
mode is encoded 001 0111 00011. Similarly it is left to readers to satisfy themselves that the
lower example is encoded 001 00111 11.
Video compression
Current video compression predominantly uses MPEG
where MPEG is the acronym for Moving Picture
Experts Group set up by ISO in 1990 to develop
standards for moving pictures. MPEG in turn is partly
based upon the use of a parallel standard originally
devised for digital coding of still pictures and now used
in digital photographic equipment. This standard is the
Joint Photographic Expert Group (JPEG) and drew
upon experts from industry, the universities,
broadcasters and so on. The group worked with the then
CCITT and ISO and commenced work in the mid-
1980s. JPEG compresses single still images by means of
spatial compression.
Video signals are based upon a series of still pictures, or frames, which are
obtained at a constant rate using a scanning technique. Very often interleaved
scanning is used, as in public broadcast transmissions, where on one cycle of
scanning, odd lines of the picture are produced, and even lines, the next. These
‘half frames’ are known as fields.