You are on page 1of 43

INFORMATION THEORY

INFORMATION THEORY

• It is a branch of communication theory devoted to


problems in coding.
• A unique feature of information theory is its use of
a numerical measure of the amount of information
gained when the contents of a message are
learned.
HISTORY
• Information theory is a branch of applied mathematics
and electrical engineering involving the quantification of
information.
• Historically, information theory was developed by Claude
E. Shannon, known as "the father of information
theory“ to find fundamental limits on compressing and
reliably storing and communicating data.
HISTORY
• Since its inception it has broadened to find applications
in many other areas, including statistical inference,
natural language processing, cryptography generally,
networks other than communication networks —as in
neurobiology, the evolution and function of molecular
codes, model selection in ecology, thermal physics,
quantum computing, plagiarism detection and other
forms of data analysis.
HISTORY

• Applications of fundamental topics of information theory


include lossless data compression (e.g. ZIP files),
lossydata compression (e.g. MP3s), and channel coding
(e.g. for DSL lines). The field is at the intersection of
mathematics, statistics, computer science, physics,
neurobiology, and electrical engineering.
• The main concepts of information theory can be grasped
by considering the most wide spread means of human
communication: language.
• Source coding and channel coding are the fundamental
concerns of information theory.
• Information theory, however, does not consider message
importance or meaning, as these are matters of the
quality of data rather than the quantity and readability of
data
• The convention in information theory is to measure
information in bits.
• Formatting –which transforms information from its original or
natural, form to a well defined and standard digital form, e.g.
PCM
• Source Coding –which reduces the average number of symbols
required to transmit a given message
• Encryption –which code messages using a cipher to prevent
unauthorized reception or transmission
• Error Control Decoding –which allows a receiver to detect and
sometimes correct symbols which are received in errors
• Line Coding / Phase Shaping –which ensures the transmitted
symbol waveforms are well suited to the characteristics of the
channel
CAN YOU READ THIS?
QUANTITY OF INFORMATION
• The most common unit of information is the bit,
based on the binary logarithm.
• Other units include the nat, which is based on the
natural logarithm, and the Hartley, which is based
on the common logarithm.
• A key measure of information in the theory is
known as entropy, which is usually expressed
by the average number of bits needed for
storage or communication.
ENTROPY
• Is a measure of disorder, or more precisely unpredictability.
If a compression scheme is lossless—that is, you can always
recover the entire original message by uncompressing—then a
compressed message has the same total entropy as the
original, but in fewer bits.
Shannon-Weaver Formula
1
𝐻 = −∑𝑃𝑖 𝑙𝑜𝑔2 𝑃𝑖 = ෍ 𝑃𝑖 𝑙𝑜𝑔2
𝑃𝑖
ENTROPY
Ex) A, B, C, D
All have equal chance of transmitting information
H = ¼log4 + ¼log4 + ¼log4 + ¼log4
H = ½+ ½+ ½+ ½
H=2
This result means that the average amount of information
in this situation is worth 2 bits.
A –00 B –01 C –10 D –11
SOURCE THEORY
• Any process that generates successive messages
can be considered a source of information.
• Information rate is the average entropy per
symbol.
• The rate of a source of information is related to its
redundancy and how well it can be compressed, the
subject of source coding.
SOURCE ENCODING

• Telecommunications medium has a limited capacity for data


transmission. This capacity is commonly measured by the
parameter called bandwidth.
• Since the bandwidth of a signal increases with the number of bits
to be transmitted each second, an important function of a digital
communications system is to represent the digitized signal by as
few bits as possible—that is, to reduce redundancy.
• Redundancy reduction is accomplished by a source encoder, which
often operates in conjunction with the analog-to-digital converter.
TYPES OF CODING
SOURCE CODING

• Source coding or compression is required for


efficient transmission or storage, leading to
transmit more data given channel capacity or
storage space.
• Two types of source coding techniques are
typically named: Lossless and Lossy coding.
SOURCE CODING
• Examples for source coding applications:
1.gzip, compress, winzip, ...
2.Mobile voice, audio, and video transmission
3.Digital Versatile Discs (DVDs) and Blu-Ray Discs
• Source Coding in Practice
Source coding often enables applications:
1.Digital television (DVB-T)
2.Internet video streaming (YouTube)
• Source coding makes applications economically feasible
1.Distribution of digital images
2.High definition television (HDTV) over IPTV
DATA COMPRESSION
DATA COMPRESSION
Lossless compression is a class
of data compression algorithms
that allow the original data to
be perfectly reconstructed from
the compressed data.
Lossy compression is the class
of data encoding methods that
uses inexact approximations to
represent the content. These
techniques are used to reduce
the data size for storage,
handling, and transmitting
content.
RUN-LENGTH ENCODING
Is a form of lossless data compression that takes advantage of the fact that there are
often runs of identical values in files.
Runs of data are stored as a single data value and count, rather than as the
original run.
HUFFMAN ENCODING

• Huffman Coding (also known as Huffman Encoding) is an


algorithm for doing data compression and it forms the basic
idea behind file compression.
Assigns codes to characters such that the length of the code
depends on the relative frequency or weight of the
corresponding character.
Huffman coding tree or Huffman tree is a full binary tree in
which each leaf of the tree corresponds to a letter in the given
alphabet. David Albert Huffman
HUFFMAN ENCODING
• The idea is to assign a variable-length codes to input characters, lengths of
assigned codes are based on the frequencies of corresponding characters.
• The most frequent character gets the smallest code and the least frequent
character gets the largest code.
• Huffman code is prefix-free, that is no codes starts with another complete
code.
Example: Construct a Huffman Tree and encode DEED and MUCK given the
following frequencies.
HUFFMAN ENCODING

Another example:
SOURCE CODING
SOURCE CODING

• In either case, a fundamental trade-off is made


between bit rate and fidelity. The ability of a
source coding system to make this trade-off well is
called its coding efficiency or rate-distortion
performance, and the coding system itself is
referred to as a source codec.
SOURCE CODING THEOREM
• A typical source-encoding model depicts a discrete
memoryless source X having finite entropy H(X)and a
finite set of source symbols, with corresponding
probabilities of occurrence 𝑝𝑘 𝑤ℎ𝑒𝑟𝑒 𝑘=1,2,…,𝐾.
• Source-coding theorem states that for a given discrete
memoryless source X, the average codeword length
per symbol is bounded by the source entropy H(X).
SOURCE CODE ENTROPY
SOURCE CODING

•codeword length
•average data rate
•efficiency of the coder
(i.e. actual output data rate compared to the minimum
achievable rate)
SOURCE CODING: CODEWORD LENGTH
The length of the codeword corresponding to a particular symbol of
the message is simply the number of digits in the codeword assigned
to the symbol.
Classification of Source Codes:
1. Fixed Length Source Codes
2. Variable-Length Source Codes
3. Distinct Codes
4. Uniquely Decodable Codes
SOURCE CODING: CODEWORD LENGTH
Classification of Source Codes:
1. Fixed Length Source Codes
-all characters represented by same number of bits
-ASCII (8 bits) and Unicode (16 bits) are fixed-length codes
2. Variable-Length Source Codes
-When the source symbols are not equiprobable, this can be more efficient than a
fixed-length source.
3. Distinct Codes
-A code is called distinct if each codeword is unique and clearly distinguishable from
the other.
4. Uniquely Decodable Codes
-A distinct code is said to be uniquely decodable if the original source symbol
sequence can be reconstructed perfectly from the received encoded binary
sequence. It can be both fixed and variable-length code.
SOURCE CODING: CODEWORD LENGTH
SOURCE CODING: CODEWORD LENGTH
SOURCE CODING: CODEWORD LENGTH

Potential problem: how do we know where one


character ends and another begins?
SOURCE CODING: CODEWORD LENGTH
Prefix Source Code
• A prefix source code is defined as a code in which no
codeword is the prefix of any other code assigned to the
symbol of the given source.
• It is used to evolve uniquely decodable source code
representing the output.
• The end of a code word is always recognizable in prefix
codes.
SOURCE CODING: CODEWORD LENGTH

• Example: Design a variable-length prefix-free code such


that the message DEAACAAAAABAcan be encoded
using 22 bits.
SOURCE CODING: AVERAGE CODEWORD
LENGTH
Average codeword length, 𝐿𝑎𝑣𝑔 , of the source encoder is
given as
SOURCE CODING: CODE EFFICIENCY
Code efficiency of a binary source encoder

When 𝑳𝒎𝒊𝒏=𝑯(𝑿), the condition gives an optimum source code and


𝐿𝑚𝑖𝑛>𝐻(𝑋) specifies the suboptimum source code.

The average length of the codeword of an optimum source


code is the entropy of the source itself.

You might also like