You are on page 1of 22

Fuzzy Sets and Systems 132 (2002) 11 – 32

Possibilistic information theory: a coding theoretic approach 

Andrea Sgarro ∗
Department of Mathematical Sciences (DSM), University of Trieste, 34100 Trieste, Italy
Received 20 April 2001; accepted 19 November 2001


We de*ne information measures which pertain to possibility theory and which have a coding-theoretic meaning. We put
forward a model for information sources and transmission channels which is possibilistic rather than probabilistic. In the case
of source coding without distortion we de*ne a notion of possibilistic entropy, which is connected to the so-called Hartley’s
measure; we tackle also the case of source coding with distortion. In the case of channel coding we de*ne a notion of
possibilistic capacity, which is connected to a combinatorial notion called graph capacity. In the probabilistic case Hartley’s
measure and graph capacity are relevant quantities only when the allowed decoding error probability is strictly equal to zero,
while in the possibilistic case they are relevant quantities for whatever value of the allowed decoding error possibility; as the
allowed error possibility becomes larger the possibilistic entropy decreases (one can reliably compress data to smaller sizes),
while the possibilistic capacity increases (one can reliably transmit data at a higher rate). We put forward an interpretation of
possibilistic coding, which is based on distortion measures. We discuss an application, where possibilities are used to cope
with uncertainty as induced by a “vague” linguistic description of the transmission channel.
c 2001 Elsevier Science B.V. All rights reserved.

Keywords: Measures of information; Possibility theory; Possibilistic sources; Possibilistic entropy; Possibilistic channels; Possibilistic
capacity; Zero-error information theory; Graph capacity; Distortion measures

1. Introduction measures which pertain to possibility theory and

which have a coding-theoretic meaning. This kind of
When one speaks of possibilistic information the- operational approach to information measures was
ory, usually one thinks of possibilistic information *rst taken by Shannon when he laid down the foun-
measures, like U-uncertainty, say, and of their use dations of information theory in his seminal paper of
in uncertainty management; the approach which one 1948 [18], and has proved to be quite successful; it has
takes is axiomatic, in the spirit of the validation lead to such important probabilistic functionals as are
of Shannon’s entropy which is obtained by using source entropy or channel capacity. Below we shall
Hin<cin’s axioms; cf. e.g. [8,12–14]. In this paper adopt a model for information sources and transmis-
we take a diBerent approach: we de*ne information sion channels which is possibilistic rather than prob-
abilistic (is based on logic rather than statistics); this
will lead us to de*ne a notion of possibilistic entropy
 Partially supported by MURST and GNIM-CNR. Part of this
and a notion of possibilistic capacity in much the same
paper, based mainly on Section 5; has been submitted for presen- way as one arrives at the corresponding probabilistic
tation at Ecsqaru-2001, to be held in September 2001 in Toulouse,
notions. An interpretation of possibilistic coding is
∗ Corresponding author. Tel.: +40-6762623; fax: +40-6762636. discussed, which is based on distortion measures, a
E-mail address: (A. Sgarro). notion which is currently used in probabilistic coding.

0165-0114/01/$ - see front matter  c 2001 Elsevier Science B.V. All rights reserved.
PII: S 0 1 6 5 - 0 1 1 4 ( 0 1 ) 0 0 2 4 5 - 7
12 A. Sgarro / Fuzzy Sets and Systems 132 (2002) 11 – 32

We are con*dent that our operational approach may of data protection in noisy channels is devilishly
be a contribution to enlighten, if not to disentangle, diLcult, and has lead to a new and fascinating branch
the vexed question of de*ning adequate information of coding theory, and more generally of informa-
measures in possibility theory. tion theory and combinatorics, called zero-error
We recall that both the entropy of a probabilistic information theory, which has been pretty recently
source and the capacity of a probabilistic channel are overviewed and extensively referenced in [15]. In
asymptotic parameters; more precisely, they are limit particular, the zero-error capacity of a probabilis-
values for the rates of optimal codes, compression tic channel is expressed in terms of a remarkable
codes in the case of sources, and error-correction combinatorial notion called Shannon’s graph capac-
codes in the case of channels; the codes one consid- ity (graph-theoretic preliminaries are described in
ers are constrained to satisfy a reliability criterion Appendix A).
of the type: the decoding-error probability of the So, to be fastidious, even in the case of probabilistic
code should be at most equal to a tolerated value , entropy and probabilistic capacity one deals with two
06¡1. A streamlined description of source codes step-functions of , which can assume only two dis-
and channel codes will be given below in Sections tinct values, one for  = 0 and the other for ¿0. We
4 and 5; even from our Keeting hints it is however shall adopt a model of the source and a model of the
apparent that, at least a priori, both the entropy of channel which are possibilistic rather than probabilis-
a source and the capacity of a channel depend on tic, and shall choose a reliability criterion of the type:
the value  which has been chosen to specify the the decoding-error possibility should be at most equal
reliability criterion. If in the probabilistic models the to , 06¡1. As shown below, the possibilistic ana-
mention of  is usually omitted, the reason is that logues of entropy and capacity exhibit quite a perspic-
the asymptotic values for the optimal rates are the uous step-wise behaviour as functions of , and so the
same whatever the value of , provided however that mention of  cannot be disposed of. As for the “form”
 is strictly positive. 1 Zero-error reliability criteria of the functionals one obtains, it is of the same type
lead instead to quite diBerent quantities, zero-error as in the case of the zero-error probabilistic measures,
entropy and zero-error capacity. Now, the problem even if the tolerated error possibility is strictly posi-
of compressing information sources at zero error tive. In particular, the capacities of possibilistic chan-
is so trivial that the term zero-error entropy is sel- nels are always expressed in terms of graph capacities;
dom used, if ever. 2 Instead, the zero-error problem in the possibilistic case, however, as one loosens the
reliability criterion by allowing a larger error possi-
The entropy and the capacity relative to a positive error
bility, the relevant graph changes and the capacity of
probability  allow one to construct sequences of codes whose the possibilistic channel increases.
probability of a decoding error is actually in/nitesimal; it will be We describe the contents of the paper. In Sec-
argued below that this point of view does not make much sense tion 2, after some preliminaries on possibility theory,
for possibilistic coding; cf. Remark 4.3. possibilistic sources and possibilistic channels are
2 No error-free data compression is feasible for probabilistic

sources if one insists, as we do below, on using block-codes, i.e.,

introduced. Section 3 contains two simple lemmas,
codes whose codewords have all the same length; this is why one Lemmas 3.1 and 3.2, which are handy tools apt to
has to resort to variable-length codes, e.g., to HuBman codes. As “ translate” probabilistic zero-error results into the
for variable-length coding, the possibilistic theory appears to lack framework of possibility theory. Section 4 is devoted
a counterpart for the notion of average length; one should have to to possibilistic entropy and source coding; we have
choose one of the various aggregation operators which have been
proposed in the literature (for the very broad notion of aggregation
decided to deal in Section 4 only with the problem of
operators, and of “averaging” aggregations in particular, cf., e.g., source coding without distortion, and to relegate the
[12] or [16]). Even if one insists on using block-codes, the problem more taxing case of source coding with distortion to
of data compression at zero error is far from trivial when a an appendix (Appendix B); this way we are able to
distortion measure is introduced; cf. Appendix B. In this paper make many of our points in an extremely simple way.
we deal only with the basics of Shannon’s theory, but extensions
are feasible to more involved notions, compound channels, say,
In Section 5, after giving a streamlined description
or multi-user communication (as for these information-theoretic of channel coding, possibilistic capacity is de*ned
notions cf., e.g., [3] or [4]). and a coding theorem is provided. Section 6 explores
A. Sgarro / Fuzzy Sets and Systems 132 (2002) 11 – 32 13

the consequences of changing the reliability criterion standard; for more details we refer the reader, e.g.,
used in Section 5; one requires that the average error to [3] or [4]. As for possibility theory, and in par-
possibility should be small, rather than the maximal ticular for a clari*cation of the elusive notion of
error possibility. 3 Up to Section 6, our point of view non-interactivity, which is often seen as the natural
is rather abstract: the goal is simply to understand possibilistic analogue of probabilistic independence
what happens when one replaces probabilities by pos- (cf. Section 2), we mention [5,6,9,11,12,16,23].
sibilities in the standard models for data transmission.
A discussion of the practical meaning of our proposal
is instead deferred to Section 7: we put forward an in- 2. Possibilistic sources and possibilistic channels
terpretation of the possibilistic model which is based
on distortion measures. We discuss an application to We recall that a possibility distribution 
the design of error-correcting telephone keyboards; in over a *nite set A = {a1 ; : : : ; ak }, called the al-
the spirit of “soft ” mathematics possibilities are seen phabet, is de*ned by giving a possibility vector
as numeric counterparts for linguistic labels, and are  = ( 1 ; 2 ; : : : ; k ) whose components i are the
used to cope with uncertainty as induced by “vague” possibilities (ai ) of the k singletons ai (16i6k,
linguistic information. k¿2):
Section 7 points also to future work, which does
(ai ) = i ; 0 6 i 6 1; max i = 1:
not simply aim at a possibilistic translation and 16i6k
generalization of the probabilistic approach. Open
problems are mentioned, which might prove to be The possibility 4 of each subset A ⊆ A is the maxi-
stimulating also from a strictly mathematical view- mum of the possibilities of its elements:
point. In this paper we take the asymptotic point of
(A) = max i : (2.1)
view which is typical of Shannon theory, but one ai ∈A
might prefer to take the constructive point of view of
algebraic coding, and try to provide *nite-length code In particular (∅) = 0; (A) = 1. In logical terms
constructions, as those hinted at in Section 7. We taking a maximum means that event A is -possible
deem that the need for a solid theoretical foundation when at least one of its elements is so, in the sense of
of “soft ” coding, as possibilistic coding basically is, a logical disjunction.
is proved by the fact that several ad hoc coding algo- Instead, probability distributions are de*ned
rithms are already successfully used in practice, e.g., through a probability vector P = (p1 ; p2 ; : : : ; pk ),
those for compressing images, which are not based on P(ai ) = pi ; 06pi 61, 16i6k pi = 1, and have an
probabilistic descriptions of the source or of the chan- additive nature:

nel (an exhaustive list of source coding algorithms is P(A) = pi :
to be found in [21]). Probabilistic descriptions, which ai ∈A
are derived from statistical estimates, are often too
costly to obtain, or even unfeasible, and at the same With respect to probabilities, an empirical interpreta-
time they are uselessly detailed. tion of possibilities is less clear. The debate on the
The paper aims at a minimum level of self- meaning and the use of possibilities is an ample and
containment, and so we have shortly re-described long-standing one; the reader is referred to standard
certain notions of information theory which are quite texts on possibility theory, e.g., those quoted at the

3 The new possibilistic frame includes the traditional zero-error

probabilistic frame, as argued in Section 3: it is enough to take 4 The fact that the symbol  is used both for vectors and for

possibilities which are equal to zero when the probability is zero, distributions will cause no confusion; below the same symbol will
and equal to one when the probability is positive, whatever its be used also to denote a stationary and non-interactive source,
value. However, the consideration of possibility values which are since the behaviour of the latter is entirely speci*ed by the vector
intermediate between zero and one does enlarge the frame; cf. . Similar conventions will be tacitly adopted also in the case of
Theorem 6.1 in Section 6, and the short comment made there just probabilistic sources, and of probabilistic and possibilistic chan-
before giving its proof. nels.
14 A. Sgarro / Fuzzy Sets and Systems 132 (2002) 11 – 32

end of Section 1; cf. also Section 7, where the appli- each y = y1 y2 : : : yn ∈ Bn :

cability of our model to real-world data transmission
is discussed. W n (y|x) = W n (y1 y2 : : : yn |x1 x2 : : : xn )
The probability distribution P over A can be ex- n
tended in a stationary and memoryless way to a prob- 
= W (yi |xi ): (2.2)
ability distribution P n over the Cartesian power An i=1
by setting for each sequence x = x1 x2 : : : xn ∈ An :
Note that W n is itself a stochastic matrix whose rows

P n (x) = P(xi ): are headed to the sequences in An , and whose columns
16i6n are headed to the sequences in Bn . The memoryless
nature of the channel is expressed by the fact that the
We recall that the elements of An are the k n sequences n transition probabilities W (yi |xi ) are multiplied.
of length n built over the alphabet A. Each such se- We now de*ne the possibilistic analogue of stochas-
quence can be interpreted as the information which is tic (probabilistic) matrices. The k rows of a possibilis-
output in n time instants by a stationary and memo- tic matrix  with h columns are possibility vectors
ryless source, or SML source. The memoryless nature over the output alphabet B. Each entry (b|a) will
of the source is expressed by the fact that the n prob- be interpreted as the transition possibility 5 from the
abilities P(xi ) are multiplied. Similarly, we shall ex- input letter a ∈ A to the output letter b ∈ B; cf. the
tend the possibility distribution  in a stationary and example given below. In De*nition 2.2  is such a
non-interactive way to a possibility distribution [n] possibilistic matrix.
over the Cartesian power An :
Denition 2.2. A stationary and non-interactive
Denition 2.1. A stationary and non-interactive channel, or SNI channel, [n] , extends  to n-tuples
information source over the alphabet A is de*ned by and is de*ned as follows:
setting for each sequence x ∈ An :
[n] (y|x) = [n] (y1 y2 : : : yn |x1 x2 : : : xn )
 (x) = min (xi ):
16i6n = min (yi |xi ): (2.3)
In logical terms, this means that the occurrence of
sequence x = x1 x2 : : : xn is declared -possible when Products as in (2:2) are replaced in (2:3) by a
this is so for all of the letters xi , in the sense of a logical minimum operation; this expresses the non-interactive
conjunction. An interpretation of non-interactivity in nature of the extension. Note that [n] is itself a
our models of sources and channels is discussed in possibilistic matrix whose rows are headed to the
Section 7. sequences in An , and whose columns are headed to
the sequences in Bn . Taking the minimum of the n
Let A = {a1 ; : : : ; ak } and B = {b1 ; : : : ; bh } be two transition possibilities (yi |xi ) can be interpreted as a
alphabets, called in this context the input alphabet
and the output alphabet, respectively. Probabilistic
5 Of course transition probabilities and transition possibilities
channels are usually described by giving a stochastic
matrix W whose rows are headed to the input alpha- are conditional probabilities and conditional possibilities, respec-
tively, as made clear by our notation which uses a conditioning
bet A and whose columns are headed to the out- bar. We have avoided mentioning explicitly the notion of condi-
put alphabet B. We recall that the k rows of such a tional possibilities because they are the object of a debate which is
stochastic matrix are probability vectors over the out- far from being closed (cf. e.g., Part II of [5]); actually, the worst
put alphabet B; each entry W (b|a) is interpreted as problems are met when one starts by assigning a joint distribution
the transition probability from the input letter a ∈ A and wants to compute the marginal and conditional ones. In our
case it is instead conditional possibilities that are the starting point:
to the output letter b ∈ B. A stationary and memo- as argued in [2], “prior” conditional possibilities are not problem-
ryless channel W n , or SML channel, extends W to atic, or rather they are no more problematic than possibilities in
n-tuples, and is de*ned by setting for each x ∈ An and themselves.
A. Sgarro / Fuzzy Sets and Systems 132 (2002) 11 – 32 15

logical conjunction: only when all the transitions -equivalent when the following double implication
are -possible, it is -possible to obtain output y from holds ∀a ∈ A:
input x; cf. also Section 7. If B is a subset of Bn , one
has in accordance with (2:1): P(a) = 0 ⇔ (a) 6 :
[n] (B|x) = max [n] (y|x):
y∈B The following lemma shows that -equivalence,
rather than a relation between letters, is a relation
Example 2.1. For A = B = {a; b} we show a possi- pertaining to the extended distributions P n and [n] ,
bilistic matrix  and its “square” [2] which speci*es seen as set-functions over An :
the transition possibilities from input couples to output
couples. The possibility that a is received when b is Lemma 3.1. Fix n¿1. The probability vector P and
sent is ; this is also the possibility that aa is received the possibility vector  are -equivalent if and only
when ab is sent, say; 0661. Take B = {aa; bb}; if the following double implication holds ∀A ⊆ An :
then [2] (B|ab) = max[; 0] = . In Section 6 this ex-
ample will be used assuming  = 0,  = 1. P n (A) = 0 ⇔ [n] (A) 6 :
[2] | aa ab ba bb Proof. To prove that the double implication implies
 | a b −− + −− −− −− −− -equivalence, just take A = {aa : : : a} for each letter
−− + −− −− aa | 1 0 0 0 a ∈ A. Now we prove that if P and  are -equivalent
then the double implication in Lemma 3.1 holds true.
a | 1 0 ab |  1 0 0 First assume A is a singleton, and contains only se-
b |  1 ba |  0 1 0 quence x. The following chain of double implications
bb |    1 holds:

P n (x) = 0 ⇔ ∃i : P(xi ) = 0 ⇔
3. A few lemmas

Sometimes the actual value of a probability does ∃i : (xi ) 6  ⇔ min (xi ) 6 

not matter, what matters is only whether that prob- i

ability is zero or non-zero, i.e., whether the corre- [n]

⇔  (x) 6 :
sponding event E is “impossible” or “possible”. The
canonical transformation maps probabilities to binary
This means that, if the two vectors P and 
(zero-one) possibilities by setting Poss{E} = 0 if and
are -equivalent, so are also P n and [n] , seen as vec-
only if Prob{E} = 0, else Poss{E} = 1; this transfor-
tors with k n components. Then the following chain
mation can be applied to the components of a prob-
holds too, whatever the size of A:
ability vector P or to the components of a stochastic
matrix W to obtain a possibility vector  or a pos-
P n (A) = 0 ⇔ ∀x ∈ A : P n (x) = 0 ⇔
sibilistic matrix , respectively. Below we shall in-
troduce an equivalence relation called -equivalence
which in a way extends the notion of a canonical trans-
formation; here and in the sequel  is a real number ∀x ∈ A : [n] (x) 6  ⇔
such as 06¡1. It will appear that a vector  or a
max [n] (x) 6  ⇔ [n] (A) 6 :
matrix  obtained canonically from P or from W are x∈A
-equivalent to P or to W , respectively, for whatever
value of . However simple, Lemma 3.1 and its straightfor-
ward generalization to channels, Lemma 3.2 below,
Denition 3.1. A probability vector P and a pos- are the basic tools used to convert probabilistic
sibility vector  over alphabet A are said to be zero-error coding theorems into possibilistic ones.
16 A. Sgarro / Fuzzy Sets and Systems 132 (2002) 11 – 32

Denition 3.2. A stochastic matrix W and a possi- the possibilistic matrix :

bilistic matrix  are said to be -equivalent when the
following double implication holds ∀a ∈ A; ∀b ∈ B:  (a; a ) = ((|a); (|a ))

= max [(b|a) ∧ (b|a )]:

W (b|a) = 0 ⇔ (b|a) 6 : b∈B

Lemma 3.1 soon generalizes as follows:

Example 3.1. We re-take Example 2.1 above. One
Lemma 3.2. Fix n¿1. The stochastic matrix W and has:  (a; a) =  (b; b) = 1,  (a; b) = . With re-
the possibility matrix  are -equivalent if and only spect to [2] , the proximity of two letter couples x
if the following double implication holds ∀x ∈ An , and x is either 1 or , according whether x = x or
∀B ⊆ Bn : x = x (recall that [2] can be viewed as a possibilistic
matrix over the “alphabet ” of letter couples). Cf. also
Examples 5.1, 5.2 and the example worked out in
W n (B|x) = 0 ⇔ [n] (B|x) 6 :
Section 7.
In Sections 5 and 6 on channel coding we shall
Denition 3.3. Once a possibilistic matrix  and a
need the following notion of confoundability between
number  are given (06¡1), two input letters a and
letters: two input letters a and a are confoundable for
a are said to be -confoundable if and only if their
the probabilistic matrix W if and only if there exists at
proximity exceeds :
least an output letter b such that the transition proba-
bilities W (b|a) and W (b|a ) are both strictly positive.  (a; a ) ¿ :
Given matrix W , one can construct a confoundability
graph G(W ) whose vertices are the letters of A by Given  and , one constructs the -confoundability
joining two letters by an edge if and only if they are graph G (), whose vertices are the letters of A, by
confoundable (graph-theoretic notions are described joining two letters by an edge if and only if they are
in Appendix A). -confoundable for .
We now de*ne a similar notion for possibilis-
tic matrices; to this end we introduce a proximity Lemma 3.3. If the stochastic matrix W and the pos-
index  between possibility vectors  = ( 1 ; 2 ; : : :) sibilistic matrix  are -equivalent the two confound-
and  = ( 1 ; 2 ; : : :), which in our case will be pos- ability graphs G(W ) and G () coincide.
sibility vectors over the output set B:
Proof. We have to prove that, under the assumption
of -equivalence, any two input letters a and a are
(;  ) = max [ i ∧ i ]:
16i6h confoundable for the stochastic matrix W if and only
if they are -confoundable for the possibilistic matrix
Above the wedge symbol ∧ stands for a minimum . The following chain of double implications holds:
and is used only to improve readability. The in-
dex  is symmetric: (;  ) = ( ; ). One has a and a are confoundable for W ⇔
06(;  )61, with (;  ) = 0 if and only
∃b: W (b|a) ¿ 0; W (b|a ) ¿ 0 ⇔
 and  have disjoint supports, and (;  ) = 1
if and only if there is at least a letter a for which ∃b: (b|a) ¿ ; (b|a ) ¿  ⇔
(a) =  (a) = 1; in particular, this happens when
 =  (we recall that the support of a possibility max [(b|a) ∧ (b|a )] ¿  ⇔
vector is made up by those letters whose possibility
is strictly positive).  (a; a ) ¿  ⇔ a and a are
The proximity index  will be extended to input
letters a and a , by taking the corresponding rows in -confoundable for :
A. Sgarro / Fuzzy Sets and Systems 132 (2002) 11 – 32 17

Remark 3.1. The index  (a; a ) establishes a fuzzy with distortion, as explained in Appendix B. The rea-
relation between input letters, which may be repre- son for con*ning the general case with distortion to
sented by means of a fuzzy graph G() with vertex an appendix is just ease of readability: actually, data
set equal to the input alphabet A: each edge (a; a ) compression without distortion as covered in this sec-
belongs to the edge set of G() with a degree of tion oBers no real problem from a mathematical point
membership equal to  (a; a ). Then the crisp graphs of view, but at the same type is very typical of the
G () are obtained as (strong) -cuts of the fuzzy novelties which our possibilistic approach to coding
graph G(). By the way, we observe that  (a; a ) presents with respect to the standard probabilistic ap-
is a proximity relation in the technical sense of fuzzy proach.
set theory [17]. For basic notions in fuzzy set theory We give a streamlined description of what a source
cf., e.g., [7,12] or [16]. code f is; for more details we refer to [3] or [4].
A code f is made up of two elements, an encoder f+
and a decoder f− ; the encoder maps the n-sequences
4. The entropy of a possibilistic source of An , i.e., the messages output by the information
source, to binary strings of a *xed length l called code-
We start by the following general observation, words; the decoder maps back codewords to messages
which applies both to source and channel coding. in a way which should be “reliable”, as speci*ed be-
The elements which de*ne a code f, i.e., the en- low. In practice (and without loss of generality), the
coder f+ and the decoder f− (cf. below), do not basic element of a code is a subset C ⊆ An of mes-
require a probabilistic or a possibilistic description sages, called the codebook. The idea is that, out of the
of the source, or of the channel, respectively. One k n messages output by the information source, only
must simply choose the alphabets (or at least the those belonging to the codebook C are given sepa-
alphabet sizes): a primary alphabet A, which is the rate binary codewords and are properly recognized
source alphabet in the case of sources and the input by the decoder; should the source output a message
alphabet in the case of channels, and the secondary which does not belong to C, then the encoder will
alphabet B, which is the reproduction alphabet in the use any of the binary codewords meant for the mes-
case of sources and the output alphabet in the case of sages in C, and so a decoding error will be committed.
channels. 6 One must also specify a length n, which Thinking of a source which is modelled probabilis-
is the length of the messages which are encoded in tically, as is standard in information theory, a good
the case of sources, and the length of the codewords code should trade oB two conKicting demands: the bi-
which are sent through the channel in the case of nary codewords should be short, so as to ensure com-
channels, respectively. Once these elements, A, B pression of data, while the error probability should
and n, have been chosen, one can construct a code f, be small, so as to ensure reliability. In practice, one
i.e., a couple encoder=decoder. Then one can study chooses a tolerated error probability , 06¡1, and
the performance of f by varying the “behaviour” of then constructs a set C as small as possible with the
the source (of the channel, respectively): for exam- constraint that its probability be at least as great as
ple one can *rst assume that this behaviour has a 1 − . The number n−1 log |C| is called the code rate;
probabilistic nature, while later one changes to a less log |C| is interpreted as the (non-necessarily integer)
committal possibilistic description. length of the binary sequences which encode source
The *rst coding problem which we tackle is data sequences 7 and so the rate is measured as number of
compression without distortion. The results of this
section, or at least their asymptotic counterparts, in-
clusive of the notion of -entropy, might have been 7 In Shannon theory one often incurs into the slight but conve-

obtained as a very special case of data compression nient inaccuracy of allowing non-integer “lengths”. By the way,
the logarithms here and below are all to the base 2, and so the unit
we choose for information measures is the bit. Bars as in |C| de-
6 Actually, in source coding without distortion the primary al- note size, i.e., number of elements. Notice that, not to overcharge
phabet and the reproduction alphabet coincide and so the latter is our notation, the mention of the length n is not made explicit in
not be explicitly mentioned in Section 4. the symbols which denote coding functions f and codebooks C.
18 A. Sgarro / Fuzzy Sets and Systems 132 (2002) 11 – 32

bits per source letter. Consequently, the fundamental whose rate log |{a: P(a)¿0}| is the same whatever
optimization problem of probabilistic source coding the given length n. Consequently, this is also the value
boils down to *nding a suitable codebook C: of the zero-error entropy H0 (P) of the SML source:
1 H0 (P) = log |{a: P(a) ¿ 0}|:
Minimize the code rate log |C| with the
For  strictly positive, one has, as is well known,
constraint Prob{¬C} 6  (4.1)
H (P) = H(P)6H0 (P), the inequality being strict
(the symbol ¬ denotes negation, or set complementa- unless P is uniform over its support. Note that the
tion). As is usual, we shall consider only Bernoullian -entropy is a step-function of : however, the func-
(i.e., stationary and memoryless, or SML) sources, tion’s step is obtained only if one keeps the value
which are completely described by the probability vec-  = 0, which is rather uninteresting because it corre-
tor P over the alphabet letters of A; then in (4:1) the sponds to a situation where no data-compression is
generic indication of probability can be replaced by feasible, but only data transcription into binary; this
the more speci*c symbol P n . happens, say, when one uses ASCII. The zero-error
Given the SML source P, its -entropy H (P) is de- entropy H0 (P) is sometimes called Hartley’s measure
*ned as the limit of the rates Rn (P; ) of optimal codes (cf. [12]); in the present context it might be rather
which solve the optimization problem (4:1); obtained called Hartley’s entropy ( = 0), to be set against
as the length n of the encoded messages goes to in*n- Shannon’s entropy (¿0).
Example 4.1. Take P = (1=2; 1=4; 1=4; 0) over an
H (P) = lim Rn (P; ): alphabet A of four letters. For  = 0 one has
H0 (P) = log 3 ≈ 1:585, while H (P) = H(P) = 1:5
In the probabilistic theory there is a dramatic diBerence
whenever 0¡¡1.
between the case  = 0 and the case  = 0. Usually one
tackles the case  = 0, the only one which allows actual
We now go to a stationary and non-interactive
data compression, as it can be proved. It is well-known
source, or SNI source, over alphabet A, which is
that the rates of optimal codes tend to the Shannon
entirely described by the possibilistic vector  over
entropy H(P) as n goes to in*nity:
alphabet letters. The source coding optimization prob-

H (P) = H(P) = − pi log pi ; 0 ¡  ¡ 1 lem (4.1) will be replaced by (4.3), where one bounds
the decoding error possibility rather than the decoding
error probability:
(we use the script symbol H to distinguish Shannon
entropy from the operational entropy H ). So Shan- 1
Minimize the code rate log |C|
non entropy is the asymptotic value of optimal rates. n
Note that this asymptotic value does not depend on
the tolerated error probability ¿0; only the speed of with the constraint [n] (¬C) 6 : (4.3)
convergence is aBected; this is why the mention of 
Now we shall de*ne the possibilistic entropy; as in the
is in most cases altogether omitted. Instead, we *nd
probabilistic case, the de*nition is operational, i.e., is
it convenient to explicitly mention , and say that the
given in terms of a coding problem.
(probabilistic) -entropy H (P) of the SML source
ruled by the probability vector P is equal to the Shan-
Denition 4.1. Given the stationary and non-
non entropy H(P) for whatever ¿0.
interactive source , its possibilistic -entropy H ()
Let us go to the case  = 0. In this case the structure
is de*ned as the limit of the rates Rn (; ) of optimal
of optimal codebooks is extremely simple: each se-
codes which solve the optimization problem (4.3),
quence of positive probability must be given its own
obtained as the length n goes to in*nity:
codeword, and so the optimal codebook is
C = {a: P(a) ¿ 0}n (4.2) H () = lim Rn (; ):
A. Sgarro / Fuzzy Sets and Systems 132 (2002) 11 – 32 19

Because of Lemma 3.1, the constraint in (4.3) can Proposition 4.1. If 06¡ ¡1, then H ()¿
be re-written as P n {¬C} = 0 for whatever P which is H (). If i ¡ i+1 are two consecutive entries in ,
-equivalent with . This means that solving the min- then H () is constant for i 6¡ i+1 .
imization problem (4.3) is the same as solving the
minimization problem (4.1) at zero error for whatever Example 4.2. Take  = (1; 1; 1=2; 1=2; 1=4; 0) over an
P such as to be -equivalent with . So, the following alphabet A of six letters. Then H () = log 5 ≈ 2:322
lemma holds: when 06¡1=4, H () = log 4 = 2 when 1=46¡
1=2, H () = log 2 = 1 when 1=26¡1.
Lemma 4.1. If P and  are -equivalent, the very
same code f = (f+ ; f− ) which is optimal for crite- Remark 4.1. In the probabilistic case the constraint
rion (4:1) at zero error, with P n (¬C) = 0, is optimal Prob{¬C}6 can be re-written in terms of the prob-
also for criterion (4:3) at -error, with [n] (¬C)6, ability of correct decoding as Prob{C}¿1 − , be-
and conversely. cause Prob{C} + Prob{¬C} = 1. Instead, the sum
Poss{C} + Poss{¬C} can be strictly larger than 1,
A comparison with (4.2) shows that an optimal and so Poss{C}¿1 −  is a diBerent constraint. This
codebook C for (4.3) is formed by all the sequences constraint, however, would be quite loose and quite
of length n which are built over the sub-alphabet of uninteresting, since the possibility Poss{C} of correct
those letters whose possibility exceeds : decoding and the error possibility Poss{¬C} can be
both equal to 1 at the same time.
C = {a: (a) ¿ }n : Remark 4.2. Unlike in the probabilistic case, in
the possibilistic case replacing the “weak” reliabil-
Consequently: ity constraint [n] {¬C}6 by a strict inequality,
[n] {¬C}¡, does make a diBerence even asymptot-
Theorem 4.1. The possibilistic -entropy H () is ically. In this case the De*nition 3.1 of -equivalence
given by: should be modi*ed by requiring P(a) = 0 if and
only if (a)¡, 0¡61. The “strict” possibilistic
H () = log |{a: (a) ¿ }|; 0 6  ¡ 1: entropy one would obtain is however the same step-
function as H () above, only that the “steps” of the
The fact that the possibilistic entropy is obtained new function would be closed on the right rather than
as the limit of a constant sequence of optimal rates being closed on the left.
is certainly disappointing; however, asymptotic opti-
mal rates are not always so trivially found, as will ap- Remark 4.3. In the probabilistic case one can pro-
pear when we discuss channel coding (Section 5) or duce a sequence of source codes whose rate tends
source coding with distortion (Appendix B); we shall to Shannon entropy and whose error probability goes
comment there that reaching an optimal asymptotic to zero; in other terms Shannon entropy allows one
value “too soon” (for n = 1) corresponds to a situation to code not only with a decoding error probability
where one is obliged to use trivial code constructions. bounded by any ¿0, but even with an “in*nitesi-
In a way, we have simply proved that in possibilistic mal” (however, positive) error probability. In our pos-
source coding without distortion trivial code construc- sibilistic case, however, requiring that the error pos-
tions are unavoidable. sibility goes to zero is the same as requiring that it is
Below we stress explicitly the obvious fact that zero for n high enough, as easily perceived by con-
the possibilistic entropy H () is a stepwise non- sidering that the possibilistic entropy is constant in a
increasing function of , 06¡1. The steps of the right neighbourhood of  = 0.
function H () begin in correspondence to the dis-
tinct possibility components i ¡1 which appear in Remarks 4.1, 4.2 and 4.3, suitably reformulated,
vector , inclusive of the value 0 = 0 even if 0 is would apply also in the case of source coding with
not a component of ; below the term “consecutive” distortion as in Appendix B and channel coding as in
refers to an ordering of the numbers i . Section 5, but we shall no further insist on them.
20 A. Sgarro / Fuzzy Sets and Systems 132 (2002) 11 – 32

5. The capacity of a possibilistic channel declare a code f reliable, one requires that the proba-
bility that the output sequence does not belong to the
Let A = {a1 ; : : : ; ak } and B = {b1 ; : : : ; bh } be two correct decoding set is acceptably low, i.e., below a
alphabets, called in this context the input alphabet pre-assigned threshold , 06¡1. If one wants to play
and the output alphabet, respectively. We give a safe, one has to insist that the decoding error should
streamlined description of what a channel code is; for be low for each codeword c ∈ C which might have
more details we refer to [3,4], and also to [15], which been transmitted (a looser criterion will be examined
is speci*cally devoted to zero-error information the- in Section 6). The reliability criterion which a code f
ory. The basic elements of a code f are the encoder must meet is so:
f+ and the decoder f− . The encoder f+ is an in-
max W n (¬Dc |c) 6 : (5.1)
jective (invertible) mapping which takes uncoded c∈C
messages onto a set of codewords C ⊆ An ; the set
We recall that the symbol ¬ denotes negation, or
M of uncoded messages is left unspeci*ed, since its
set-complementation; of course the inequality sign in
“structure” is irrelevant. Codewords are sent as input
(5.1) can be replaced by an equality sign whenever
sequences through a noisy medium, or noisy channel.
 = 0. Once the length n and the threshold  are cho-
They are received at the other end of the channel as
sen, one can try to determine the rate Rn = Rn (W; )
output sequences which belong to Bn . The decoder
of an optimal code which solves the optimization
f− takes back output sequences to the codewords of
C, and so to the corresponding uncoded messages.
This gives rise to a partition of Bn into decoding sets, Maximize the code rate Rn so as to satisfy
one for each codeword c ∈ C. Namely, the decoding
the constraint (5:1):
set Dc for codeword c is Dc = {y: f− (y) = c} ⊆ Bn .
The most important feature of a code f = (f+ ; f− ) The job can be quite tough, however, and so
is its codebook C ⊆ An of size |C|. The decoder f− , one has often to be contented with the asymp-
and so the decoding sets Dc , are often chosen by use of totic value of the optimal rates Rn , which is ob-
some statistical principle, e.g., maximum likelihood, tained when the codeword length n goes to in-
but we shall not need any special assumption (pos- *nity. This asymptotic value is called the -
sibilistic decoding strategies are described in [1,10]). capacity of channel W . For 0¡¡1 the capacity
The encoder f+ will never be used in the sequel, and C is always the same, only the speed of conver-
so its speci*cation is irrelevant. The rate Rn of a code gence of the optimal rates to C is aBected by
f with codebook C is de*ned as the choice of . When one says “capacity” one
refers by default to this positive -capacity; cf. [3]
Rn = log |C|: or [4].
n Instead, when  = 0 there is a dramatic change.
The number log |C| can be seen as the (not nec- In this case one uses the confoundability graph
essarily integer) binary length of the uncoded G(W ) associated with channel W ; we recall that
messages, the ones which carry information; then in G(W ) two vertices, i.e., two input letters a and
the rate Rn is interpreted as a transmission speed, a , are adjacent if and only if they are confound-
which is measured in information bits (bit frac- able; cf. Section 3. If W n is seen as a stochastic
tions, rather) per transmitted bit. The idea is to matrix with k n rows headed to An and hn columns
design codes which are fast and reliable at the headed to Bn , one can consider also the confound-
same time. Once a reliability criterion has been ability graph G(W n ) for the k n input sequences of
chosen, one tries to *nd the optimal code for length n; two input sequences are confoundable, and
each pre-assigned codeword length n, i.e., a code so adjacent in the graph, when there is an output
with highest rate among those which meet the sequence which can be reached from any of the
criterion. two with positive probability. If C is a maximal
Let us consider a stationary and memoryless chan- independent set in G(W n ) the limit of n−1 log |C|
nel W n , or SML channel, as de*ned in (2.2). To when n goes to in*nity is by de*nition the graph
A. Sgarro / Fuzzy Sets and Systems 132 (2002) 11 – 32 21

capacity C(G(W )) of the confoundability graph The following lemma is soon obtained from
G(W ). 8 Lemmas 3.2 and 3.3, and in its turn soon implies
As easily checked, the codebook C ⊆ An of an op- Theorem 5.1; it states that possibilistic coding and
timal code is precisely a maximal independent set of zero-error probabilistic coding are diBerent formula-
G(W n ). Consequently, the zero-error capacity C0 (W ) tions of the same mathematical problem.
of channel W is equal to the capacity of the corre-
sponding confoundability graph G(W ): Lemma 5.1. Let the SML channel W and the SNI
C0 (W ) = C(G(W )): channel  be -equivalent. Then a code f = (f+ ; f− )
satis/es the reliability criterion (5:1) at zero error
The paper [19] which Shannon published in 1956 and for the probabilistic channel W if and only if it sat-
which contains these results inaugurated zero-error is/es the reliability criterion (5:2) at -error for the
information theory. Observe however that the last possibilistic channel .
equality gives no real solution to the problem of as-
sessing the zero-error capacity of the channel, but
Theorem 5.1. The codebook C ⊆ An of an optimal
simply re-phrases it in a neat combinatorial language;
code for criterion (5:2) is a maximal independent
actually, a single-letter expression of the zero-error
set of G ([n] ). Consequently, the -capacity of the
capacity is so far unknown, at least in general (“single-
possibilistic channel  is equal to the capacity of the
letter” means that one is able to calculate the limit so
corresponding -confoundability graph G ():
as to get rid of the codeword length n). This unpleas-
ant observation applies also to Theorem 5.1 below.
We now pass to a stationary and non-interactive C () = C(G ()):
channel [n] , or SNI channel, as de*ned in (2.3). The
reliability criterion (5.1) is correspondingly replaced Observe that the speci*cation of the decoding sets
by: Dc of an optimal code (and so of the decoding strat-
egy) is obvious: one decodes y to the unique code-
max [n] (¬Dc |c) 6 : (5.2) word c for which [n] (y|c)¿; there cannot be two
codewords with this property, because they would be
The optimization problem is now: -confoundable, and this would violate independence.
If [n] (y|c)6 for all c ∈ C, then y can be assigned
Maximize the code rate Rn so as to satisfy to any decoding set, this choice being irrelevant from
the point of view of criterion (5.2).
the constraint (5:2): Below we stress explicitly the obvious fact that the
The number  is now the error possibility which we are graph capacity C () = C(G ()) is a stepwise non-
ready to accept. Again the inequality sign in (5.2) is to decreasing function of , 06¡1; the term “consecu-
be replaced by the equality sign when  = 0. A looser tive” refers to an ordering of the distinct components
criterion based on average error possibility rather than i which appear in  ( i can be zero even if zero does
maximal error possibility will be examined in Sec- not appear as an entry in ):
tion 6.
Proposition 5.1. If 06¡ ¡1, then C ()6
Denition 5.1. The -capacity of channel  is the C (). If i ¡ i+1 are two consecutive entries in ,
limit of optimal code rates Rn (; ), obtained as the then C () is constant for i 6¡ i+1 .
codeword length n goes to in*nity.
Example 5.1. Binary possibilistic channels. The in-
8 We recall that an independent set in a graph, called also a put alphabet is binary, A = {0; 1}, the output alphabet
stable set, is a set of vertices no two of which are adjacent,
and so in our case the vertices of an independent set are never
is either the same (“doubly” binary channel), or is aug-
confoundable; all these graph-theoretic notions, inclusive of graph mented by an erasure symbol 2 (binary erasure chan-
capacity, are explained more diBusely in Appendix A. nel); the corresponding possibilistic matrices 1 and
22 A. Sgarro / Fuzzy Sets and Systems 132 (2002) 11 – 32

2 are given below: (1; ; ; 0; 0) in which 0¡ ¡¡1:

| a1 a2 a3 a4 a5
−− + −− −− −− −− −−
1 | 0 1 2 | 0 2 1
a1 | 1  0 0
−− + −− −− −− + −− −− −−
a2 | 0 1  0
0 | 1  0 | 1  0
a3 | 0 0 1 
1 | 1 1 | 0 1
a4 | 0 0 1 
a5 |  0 0 1
with 0¡ 6¡1. As soon checked, one has for the After setting by circularity a6 = a1 , a7 = a2 , one
proximities between input letters: 1 (0; 1) =  in the has: (ai ; ai ) = 1¿(ai ; ai+1 ) = ¿(ai ; ai+2 ) = ,
case of the doubly binary channel and 2 (0; 1) = 6 16i65. Capacities can be computed as explained in
in the case of the erasure channel. The rele- Appendix A: C0 () = 0√(the corresponding graph is
vant confoundability graphs are G0 (1 ) = G0 (2 ), complete), C () = log 5 (the pentagon graph pops
where the input letters 0 and 1 are adjacent, and up), C () = log 5 (the corresponding graph is√edge-
G (1 ) = G (2 ), where they are not. One has free). So C () = 0 for 06¡ , C () = log 5 for
C0 (1 ) = C0 (2 ) = 0, C (1 ) = C (2 ) = 1, and so 6¡, else C () = log 5.
C (1 ) = C (2 ) = 0 for 06¡ , C (1 ) = 0¡
C (2 ) = 1 for 6¡, else C (1 ) = C (2 ) = 1.
Some of these intervals may vanish when the tran-
sition possibilities and  are allowed to be equal 6. Average-error capacity versus maximal-error
and to take on also the values 0 and 1. Data trans- capacity
mission is feasible when the corresponding capacity
is positive. In this case, however the capacity is Before discussing an interpretation and an applica-
“too high” to be interesting, since a capacity equal tion of the possibilistic approach, we indulge in one
to 1 in the binary case means that the reliability more “technical” section. In the standard theory of
criterion is so loose that no data protection is re- probabilistic coding the reliability criterion (5.1) is of-
quired: for *xed codeword length n, the optimal ten replaced by the looser criterion:
codeword is simply C = An . Actually, whenever
the input alphabet is binary, one is necessarily con- 1  n
W (¬Dc |c) 6 
fronted with two limit situations which are both |C|
uninteresting: either the confoundability graph is
complete and the capacity is zero (i.e., the relia- which requires that the average probability of error,
bility criterion is so demanding that reliable trans- rather than the maximal probability, be smaller than 
mission of data is hopeless), or the graph is edge- so as to be declared acceptable. Roughly speaking, one
free and the capacity is maximal (the reliability no longer requires that all codewords perform well,
criterion is so undemanding that data protection is but is contented whenever “most” codewords do so,
not needed). In Section 7 we shall hint at interac- and so resorts to an arithmetic mean rather than to
tive models for possibilistic channels which might a maximum operator. The new criterion being looser
prove to be interesting also in the binary case; cf. for ¿0, higher rates can be achieved; however one
Remark 7.1. proves that the gain evaporates asymptotically (cf.,
e.g., [4]). So, the average-error -capacity and the
Example 5.2. A “rotating” channel. Take k = 5; maximal-error -capacity (the only one we have con-
the quinary input and output alphabet is the same; sidered so far) are in fact identical. We shall pursue
the possibilistic matrix  “rotates” the row-vector a similar approach also in the case of possibilistic
A. Sgarro / Fuzzy Sets and Systems 132 (2002) 11 – 32 23

channels, and adopt the reliability criterion: in the possibilistic case the maximal-error capacity
1  [n] and the average-error capacity coincide for all . We
 (¬Dc |c) 6  (6.1) stress that, unlike Theorem 5.1, Theorem 6.1 is not
c∈C solved by simply re-cycling a result already available
rather than (5.2). The corresponding optimization in the probabilistic framework (even if the “expur-
problem is: gation” technique used below is a standard tool of
Shannon theory). This shows that the possibilistic
Maximize the code rate Rn so as to satisfy framework is strictly larger than the zero-error prob-
abilistic framework, as soon as one allows possibility
the constraint (6:1): values which are intermediate between zero and one.
From now on we shall assume  = 0, else (5.2) and
For ¿0 one can achieve better rates than in the case
(6.1) become one and the same criterion, and there is
of maximal error, as the following example shows.
nothing new to say. Clearly, (6.1) being a looser crite-
rion, the average-error possibility of any pre-assigned
Example 6.1. We re-take the 2 × 2 matrix  of Ex-
code cannot be larger than the maximal-error possibil-
ample 2.1, which is basically the matrix 1 of Exam-
ity, and so the average-error -capacity of the channel
ple 5.1 when = 0. We choose n¿1 and adopt the
cannot be smaller than the maximal-error -capacity:
reliability criterion (5.2) which involves the maximal
CQ ()¿C (). The theorem below will be proven by
decoding error. For 06¡ the graph G () is com-
showing that also the inverse inequality holds true.
plete and so is also G ([n] ). Maximal independent
sets are made up by just one sequence: the optimal rate
Theorem 6.1. The average-error -capacity CQ ()
is as low as 0; in practice this means that no informa-
and the maximal-error -capacity C () of the SNI
tion is transmittable at that level of reliability. Let us
possibilistic channel  coincide for whatever admis-
pass instead to the reliability criterion (6.1) which in-
sible error possibility , 06¡1:
volves the average decoding error. Let us take a code-
book whose codewords are all the 2n sequences in An ;
CQ  () = C ():
each output sequence is decoded to itself. The rate of
this code is as high as 1. It easy to check that the de-
Proof. Let us consider an optimal code which satis*es
coding error possibility for each transmitted sequence
the reliability criterion (6.1) for *xed codeword length
c is always equal to , except when c = aa : : : a is sent,
n and *xed tolerated error possibility ¿0; since the
in which case the error possibility is zero. This means
code is optimal, its codebook C has maximal size |C|.
that with an error possibility  such that
2n − 1
2n 1 = 0 ¡ 2 ¡ · · · ¡ r = 1 (6.2)
the optimal rate is 0 for criterion (5.2) while it is 1
for criterion (6.1). Observe that the interval where the be the distinct components which appear as entries
two optimal rates diBer evaporates as n increases. in the possibilistic matrix  which speci*es the tran-
sition possibilities, and so speci*es the possibilistic
In analogy to the maximal-error -capacity C (), behaviour of the channel we are using; r¿1. Fix code-
the average-error -capacity is de*ned as follows: word c: we observe that the error possibility for c, i.e.,
the possibility that c is incorrectly decoded, is neces-
Denition 6.1. The average-error -capacity CQ () sarily one of the values which appear in (6.2), as it
of channel  is the limit of code rates RQ n (; ) op- is derived from those values by using maximum and
timal with respect to criterion (6.1), obtained as the minimum operators (we add i = 0 even if 0 is not to
codeword length n goes to in*nity. be found in ). This allows us to partition the code-
book C into r classes Ci , 16i6r, by putting into the
(Our result below will make it clear that such a same class Ci those codewords c whose error possi-
limit does exist.) We shall prove below that also bility is equal precisely to i (some of the classes Ci
24 A. Sgarro / Fuzzy Sets and Systems 132 (2002) 11 – 32

may be void). The reliability criterion (6.1) satis*ed Now, the union of the classes on the left side of (6.4)
by our code can be re-written as: can be used as the codebook of a new code with max-
 |Ci | imal error possibility 6. It will be enough to modify
i 6 : (6.3) the decoder by enlarging in whatever way the decod-
|C| ing sets Dc with error possibility [n] (¬Dc |c) = i 6,
so as to cover Bn ; by doing so the error possibility
We can now think of a non-negative random variable
cannot become larger. Of course the new code need
X which takes on the values i , each with probability
not be optimal in the class of all codes which satisfy
|Ci |=|C|; to this random variable X we shall apply the
criterion (5.2) for *xed n and ; so for its rate R∗n one
well-known Markov inequality (cf., e.g., [4]), which
is written as:
1 R∗n = log |Ci | 6 Rn ; (6.5)
Prob{X ¿ #XQ } 6 ; n
# i: 6

where XQ is the expectation of X , i.e., the *rst side where Rn is the optimal rate with respect to criterion
of (6.3), while # is any positive number. Because of (5.2) relative to maximal error. On the other hand, in
(6.3), which can be written also as XQ 6, one has a terms of the rate RQ n = n−1 log |C| optimal with respect
fortiori: to criterion (6.1) relative to average error, (6.4) can
1 be re-written as:
Prob{X ¿ #} 6
# ∗ # − 1 nRQn
2nRn ¿ 2 : (6.6)
or, equivalently: #
 |C| In (6.6) the term (#−1)=# is a positive constant which
|Ci | 6 : belongs to the open interval ]0; 1[, and so its logarithm
i: i ¿# is negative. Comparing (6.5) and (6.6), and recalling
Now we choose # and set it equal to: that Rn 6RQ n :
 + j 1 #−1
#= ; RQ n + log 6 R∗n 6 Rn 6 RQ n :
2 n #
where j is the smallest value in (6.2) such as to One obtains the theorem by going to the limit.
be strictly greater than . With this choice one has
#¿1; we stress that # is a constant once  and  are
chosen; in particular # does not depend on n. The last 7. An interpretation of the possibilistic model
summation can be now taken over those values of i based on distortion measures
for which:
 + j We have examined a possibilistic model of data
i ¿ transmission and coding which is inspired by the stan-
dard probabilistic model: what we did is simply replac-
i.e., since there is no i left between  and j , the ing probabilities by possibilities and independence by
inequality can be re-written as: non-interactivity, a notion which is often seen as the
 |C| “right” analogue of probabilistic independence in pos-
|Ci | 6 : sibility theory. In this section we shall try to give an
i: i ¿
interpretation of our possibilistic approach. The exam-
The r classes Ci are disjoint and give a partition of C; ple of an application to the design of telephone key-
so, if one considers those classes Ci for which the error boards will be given.
possibility i is at most , one can equivalently write: We concentrate on noisy channels and codes for
 correcting transmission errors; we shall consider
#−1 sources at the end of the section. The idea is that in
|Ci | ¿ |C|: (6.4)
i: 6
# some cases statistical likelihood may be eBectively
A. Sgarro / Fuzzy Sets and Systems 132 (2002) 11 – 32 25

replaced by what one might call “structural resem- a distortion between sequences x ∈ A n and y ∈ B n in
blance”. Suppose that a “grapheme” is sent through several way; one resorts, e.g., to peak distortion:
a noisy channel which we are unable to describe in
all statistical details. A distorted grapheme will be re- d∗n (x; y) = max d(xi ; yi ) (7.2)
ceived at the other end of the channel; the repertoire
of input graphemes and of output graphemes are sup- or, more commonly and less demandingly, to average
posed to be both *nite. We assume that it is plausible 9 distortion:
that the grapheme which has been received has a 1 
dn (x; y) = d(xi ; yi ): (7.3)
small distortion, or even no distortion at all, from the n
grapheme which has been sent over the channel; large
distortions are instead unplausible. Without real loss Let us be very demanding and adopt peak distortion:
of generality we shall “norm” the distortions to the in- structurally two sequences resemble each other only
terval [0; 1], so that the occurrence of distortion one is when they do so in each position. Following the philos-
seen as quite unplausible. Correspondingly, the one- ophy of the equality (7:1) above, where the term “plau-
complement of the distortion can be seen as an index sibility” has been replaced by the more speci*c term
of “structural resemblance” between the input symbol “transition possibility” and where the resemblance is
and the output symbol; with high plausibility this in- interpreted as the one-complement of the distortion,
dex will have a high value. We shall assign a numeric we set:
value to the plausibility by setting it equal precisely
to the value of the resemblance index; in other words, (b|a) = 1 − d(a; b);
we assume the “equality”:
[n] (y|x) = 1 − d∗n (x; y)
plausibility = structural resemblance: (7.1)

Long sequences of graphemes will be sent through the = min (yi |xi ): (7.4)
channel. The distortion between the input sequence x
and the output sequence y will depend on the distor- This corresponds precisely to a stationary and non-
tions between the single graphemes xi and yi which interactive channel .
make up the sequences; to specify how this happens, To make our point we now examine a small-
we shall take inspiration from rate-distortion theory, scale example. We assume that sequences of circled
which is shortly reviewed in Appendix B; cf. also [3] graphemes out of the alphabet A={⊕; ⊗; ; ; }
or [4]. We recall here how distortion measures are are sent through a channel. Because of noise, some
de*ned. One is given two alphabets, the primary al- of the bars inside the circle can be erased during
phabet A and the secondary alphabet B, which in our transmission; instead, in our model the channel can-
case will be the alphabet of possible input graphemes not add any bars, and so the repertoire of the output
and the alphabet of possible output graphemes, re- ◦◦
graphemes is a superset of A: B = A ∪ { ; \ }. We
spectively. A distortion measure d is given which do not have any statistical information about the be-
speci*es the distortions d(a; b) between each primary haviour of the channel; we shall be contented with
letter a ∈ A and each secondary letter b ∈ B; for each the following “linguistic judgements”:
primary letter a there is at least one secondary letter It is quite plausible that a grapheme is received as it
b such that d(a; b) = 0, which perfectly reproduces a. has been sent
Distortions d(a; b) are always non-negative, but in our It is pretty plausible that a single bar has been erased
case they are also constrained not to exceed 1. The It is pretty unplausible that two bars have been erased
distortion between letters a and b can be extended to Everything else is quite unplausible
We shall “numerize” our judgements by assigning
9 The term plausibility is a technical term of evidence theory;
the numeric values 1; 2=3; 1=3; 0 to the corresponding
actually, possibilities can be seen as very special plausibilities; cf.,
possibilities. This is the same as setting the distor-
e.g., [12]. So, the adoption of a term which is akin to “possibility” tions d(a; b) proportional to the number of bars which
is more committal than it may seem at *rst sight. have been deleted during transmission. Our choice is
26 A. Sgarro / Fuzzy Sets and Systems 132 (2002) 11 – 32

enough to specify a possibilistic channel , whose input sequence and output sequence; this crite-
matrix is given below; zeroes have not been writ- rion corresponds to peak distortion, as explained
ten to help readability. Underneath  we have writ- above.
ten the matrix  which speci*es the proximities be- Let us consider coding. We use the proximity ma-
tween the input graphemes; since  is symmetric, i.e., trix  to construct the -confoundability graph G (),
(a; a ) = (a ; a), we have written only the upper tri- which was de*ned in Section 3. If 06¡1=3 the
angle; cf. the de*nition of  in Section 3. -confoundability graph is complete and the -
capacity of the channel is 0: this means that the
 | ⊕ ⊗  ◦  ◦\ ◦ reliability criterion (5.2) is so strict that no data
− + − − − − − − − transmission is feasible. For ¿2=3 the graph is
⊕ | 1 2=3 2=3 1=3 edge-free and so the -capacity is log 5: this means
that the reliability criterion (5.2) is so loose that the
⊗ | 1 2=3 2=3 1=3 channel  behaves essentially as noise-free. Let us
 | 1 2=3 proceed to the more interesting case 1=36¡2=3;
 | 1 2=3 actually, one can take  = 1=3 (cf. Proposition 5.1).
A maximal independent set I in G1=3 () is made
◦ | 1 up by the three “vertices” ⊕; ⊗ and , as soon◦
checked. Using the notions explained in the ap-
 | ⊕ ⊗   ◦ pendix, and in particular the inequalities (A.1), one
− − − − − − − soon shows that the 3n sequences of In give a
⊕ | 1 1=3 2=3 1=3 1=3 maximal independent set in G1=3 ([n] ). Fix code-
word length n; as stated by Theorem 5:2; an opti-
⊗ | 1 1=3 2=3 1=3
mal codebook is C = In and so the optimal code
 | 1 2=3 2=3 rate is log 3, which is also the value of the capac-
 | 1 2=3 ity C1=3 (). When one uses such a code, a decod-
ing error occurs only when at least one of the n
◦ | 1
graphemes sent over the channel loses at least two
Both the components of  and those of  are in bars, an event which has been judged to be pretty
their own way “resemblance indices”. However, unplausible.
those in  specify the structural resemblance be- We give the example of an application. Think of
tween an input grapheme a and an output grapheme the keys in a digital keyboard, as the one of the
b; this resemblance is 1 when b equals a, is 2=3 author’s telephone, say, in which digits from 1 to
when b can be obtained from a by deletion of a 9 are arranged on a 3 × 3 grid, left to right, top
single bar, is 1=3 when b can be obtained from a row to bottom row, while digit 0 is positioned be-
by deleting two bars, and is 0 when b cannot be low digit 8. It may happen that, when a telephone
obtained from a in any of these ways. Instead the number is digited, the wrong key is pressed (be-
components of  specify how easy it is to confound cause of “channel noise”). We assume the following
input graphemes at the other end of the channel: model of the “noisy channel”, in which possibili-
(a; a ) = 1 means a = a , (a; a ) = 2=3 means that ties are seen as numeric labels for vague linguistic
a and a are diBerent, but there is at least an output judgements:
grapheme which can be obtained by deletion of a
single bar in a, or in a , or in both, (a; a ) = 1=3 (i) it is quite plausible that the correct key is pressed
means that one has two delete at least two bars (possibility 1);
from one of the input graphemes, or from both, (ii) it is pretty plausible that one inadvertently
to reach a common output grapheme. Assuming presses a “neighbour” of the correct key, i.e., a
that the channel  is stationary and non-interactive key which is positioned on the same row or on
means that we are adopting a very strict criterion the same column and is contiguous to the correct
to evaluate the “structural resemblance” between key (possibility 2=3);
A. Sgarro / Fuzzy Sets and Systems 132 (2002) 11 – 32 27

(iii) it is pretty unplausible that the key one presses probabilistic model is in this case pretty unnatural. In-
is contiguous to the correct key, but is positioned stead, in a “soft” possibilistic approach one speci*es
on the same diagonal (possibility 1=3); just one possibilistic matrix , which contains pre-
(iv) everything else is quite unplausible (possibility cisely the information which is needed and nothing
0). more.
Unfortunately, the author’s telephone is not espe-
When the wrong key is pressed, we shall say that
cially promising. Let us adopt criterion (5.2). If the al-
a cross-over of type (ii), of type (iii), or of type
lowed error possibility of the code is 2=3 (or more), the
(iv) has taken place, according whether its possi-
confoundability graph is edge-free and no error pro-
bility is 2=3, 1=3, or 0. Using these values 10 one
tection is required. If we choose the error possibility
can construct a possibilistic matrix  with the in-
 = 1=3, we have C1=3 () = log &(G1=3 ()) = log 3; in
put and the output alphabet both equal to the set of
other words the 1=3-capacity, which is an asymp-
the 10 keys. One has, for example: (a|1) = 2=3 for
totic 11 parameter, is reached already for n = 1. To see
a ∈ {2; 4}, (a|1) = 1=3 for a = 5, (a|1) = 0 for
this use the inequalities (A.1) of Appendix A: the in-
a ∈ {3; 6; 7; 8; 9; 0}. As for the proximity  (a; b),
dependence number of G1=3 () is 3, and a maximal
it is equal to 2=3 whenever either keys a and b are
independent set of keys, which are far enough from
neighbours as in (ii), or there is a third key c which
each other so as not to be confoundable, is {0; 1; 6},
is a common neighbour of both. One has, for ex-
as easily checked; however, one checks that 3 is also
ample:  (1; a) = 2=3 for a ∈ {2; 3; 4; 5; 7}; instead,
the chromatic number of the complementary graph.
 (1; a) = 1=3 for a ∈ {6; 8; 9} and  (1; a) = 0 for
In practice, this means that an optimal codebook as
a = 0. A codebook is a bunch of admissible tele-
in Theorem 5.1 may be constructed by juxtaposition
phone numbers of length n; since a phone number
of the input “letters” 0; 1; 6; the code is disappointing,
is wrong whenever there is a collision with another
since everything boils down to allowing only phone
phone number in a single digit, it is natural to as-
numbers which use keys 0; 1; 6. As for decoding, the
sume that the “noisy channel”  is non-interactive.
output sequence y is decoded to the single codeword c
This example had been suggested to us by J. KTorner;
however, at least in principle, in the standard prob- for which [n] (y|c)¿1=3; so, error correction is cer-
abilistic setting one would have to specify three tainly successful if there have been no cross-overs of
stochastic matrices such as to be 0; 1=3 and 2=3— type (iii) and (iv). If, for example, one digits num-
equivalent with . In these matrices only the op- ber 2244 rather than 1111 a successful error correc-
position zero=non-zero would count; their entries tion takes place; actually, [4] (2244|c)¿1=3 only for
would have no empirical meaning, and no signi*cant c = 1111. If instead one is so clumsy as to digit the
relation with the stochastic matrix of the probabili- “pretty unplausible” number 2225, this is incorrectly
ties with which errors are actually committed by the decoded to 1116. Take instead the the more demanding
hand of the operator. So, the adoption of a “hard” threshold  = 0; the 0-capacity, as easily checked, goes
down to log 2; the 0-error code remains as disappoint-
ing as the 1=3-error code, being obtained by allowing
10 Adopting a diBerent “numerization” for the transition possibil- only phone numbers made up of “far-away” digits as
ities (or, equivalently, for the distortions) does not make any real are 0 and 1, say. The design of convenient keyboards
diBerence from the point of view of criterion (5.2), provided the
order is preserved and the values 0 and 1 are kept *xed. Instead,
arithmetic averages as in criterion (6.1) have no sort of insensitiv- 11 When the value of an asymptotic functional (channel capac-

ity to order-preserving transformations; criterion (6.1) might prove ity, say, or source entropy, or the rate-distortion function as in
to be appropriate in a situation where one interprets possibilities Appendix B) is reached already for n = 1, its computation is easy,
in some other way (recall that possibilities can be viewed as a but, unfortunately, this is so because the situation is so hopeless
special case of plausibilities, which in their turn can be viewed that one is obliged to use trivial code constructions. By the way,
as a special case of upper probabilities; cf., e.g., [22]). In (iv) we this is always the case when one tries to compress possibilistic
might have chosen a “negligible” positive value, rather than 0: sources without distortion, as in Section 4. The interesting situ-
again, this would have made no serious diBerence, save adding a ations correspond instead to cases when the computation of the
negligible initial interval where the channel capacity would have asymptotic functional is diLcult, as for the pentagon, or even
been zero. unfeasible, as for the heptagon (cf. Appendix A).
28 A. Sgarro / Fuzzy Sets and Systems 132 (2002) 11 – 32

such that their possibilistic capacity is not obtained  which has been considered above. The “items” will
already for n = 1 is a graph-theoretic problem which be sequences y of n graphemes, and the ith control
may be of relevant practical interest in those situations will be made on the ith grapheme yi . The possibility
when digiting an incorrect number may cause seri- vector  over the seven graphemes of B, in the order
ous inconveniences. More generally, exhibiting useful as they are listed, will be:
*nite-length code constructions would have a relation
to the material of this paper, which is similar to the re-
 = (1; 1; 1; 2=3; 1; 2=3; 1) over
lation of coding theory (algebraic coding theory, say)
to the asymptotic theory of coding (Shannon theory). B = {⊕; ⊗; ; ◦ ; ; ◦\ ; ◦}:
Remark 7.1. Rather than peak distortion, in (7.4) one
A possibility smaller than 1 has been assigned to
might use average distortion. This would give rise to a
the two output graphemes which are not also input
stationary but de*nitely interactive channel for which:
graphemes; in practice, vector  has been obtained by
1  taking the maximum of the entries in the columns of
n (y|x) = 1 − dn (x; y) = (yi |xi ):
n the possibilistic matrix  which describes the chan-
nel. When (b) = 1 it is possible that the grapheme
We leave open the problem of studying such a channel b has been received at the end of the channel ex-
and ascertaining its meaning for real-world data trans- actly as it has been transmitted, when (b) = 2=3 the
mission. Actually, one might even de*ne new distor- grapheme b which has been received is necessarily dis-
tions between sequences based on a diBerent way of torted with respect to the input grapheme, and that at
averaging single-letter distortions in the general sense least one bar has been erased during transmission. 13
of aggregation operators (the very broad notion of ag- Let us *x a value , 06¡1, and rule out all the items
gregation operators and averaging operators is cov- whose possibility is 6. Then the accepted items can
ered, e.g., in [12] or [16]). be encoded by means of a possibilistic source code as
in Section 4: each acceptable item is given a binary
Now we pass to source coding and data compres- number whose length is nH (), or rather nH (),
sion. We shall pursue an interpretation of possibilis- by rounding to the integer ceiling, i.e., to the small-
tic SNI sources and possibilistic source coding which est integer which is as least as large as n times the
*ts in with a meaning of the word “possible” to be -entropy of . In our case, when ¿2=3 only the
found in the Oxford Dictionary of the English Lan- sequences which do not contain the graphemes 
guage: possible = tolerable to deal with, i.e., accept- and  \ are given a codeword, and so H () = log 5;
able, because it possesses all the qualities which are instead when 62=3 all the sequences have their own
required. 12 Assume that certain items are accepted codeword, and so H () = log 7.
only if they pass n quality controls; each control i is
given a numeric mark i from 0 (totally unaccept-
able) to 1 (faultless); the marks which one can assign 13 As a matter of fact, we have been using a formula proposed
are chosen from a *nite subset of [0; 1] of numbers
in the literature in order to compute marginal output possibilities
which just stand for linguistic judgements. The qual- (b), when the marginal input possibilities (a) and the condi-
ity control as a whole is passed only when all the n tional possibilities (b|a) are given, namely
controls have been passed. As an example, let us take
the source alphabet B equal to the alphabet of the (b) = max [(a) ∧ (b|a)]
seven graphemes output by the possibilistic channel
the maximum being taken over all letters a ∈ A. In our case we
12 An interpretation which may be worth pursuing is: degree of have set all the input possibilities (a) equal to 1. The possibilistic
possibility = level of grammaticality. This may be interesting also formula is inspired by the corresponding probabilistic one, just
in channel coding, in those situation when decoding errors are less replacing sums and products by maxima and minima, as is usual
serious when the encoded message has a low level of grammatical when one passes from probabilities to possibilities; cf., e.g., [5]
correctness. or [11].
A. Sgarro / Fuzzy Sets and Systems 132 (2002) 11 – 32 29

Acknowledgements The limit always exists, as it can be shown. It is rather

easy to prove that
We gladly acknowledge helpful discussions with
1 Q
F. Fabris on the relationship between possibilistic log &(G) 6 log &(Gn ) 6 log '(G) (A.1)
channels and distortion measures as used in proba- n
bilistic source coding. and so whenever &(G) = '(G) Q the graph capacity is
very simply C(G) = log &(G). Giving a single-letter
characterization of graph capacity can be however a
Appendix A. Graph capacity very tough problem, which is still unsolved in its gen-
erality [15]. We observe that the minimum value of
We consider only simple graphs, i.e., graphs with- the graph capacity is zero, and is reached
out multiple edges and without loops; we recall that a the graph is complete, i.e., has all the K2 edges; the
graph is assigned by giving its vertices and its edges; maximum value of the capacity of a graph with k ver-
each edge connects two (distinct) vertices which are tices is log k, and is obtained when the graph is edge-
then adjacent. If G is a graph, its complementary free (has no edges at all). We also observe that “pure”
graph GQ has the same set of vertices, but two vertices combinatorialists
 prefer to de*ne graph capacity as the
are adjacent in GQ if and only if they are not adjacent limit of n &(Gn ), i.e., as 2C(G) .
in G. By &(G) and '(G) we denote the independence
number and the chromatic number of G, respectively. Example A.1. Let us take the case of a polygon Pk
We recall that the independence number of a graph is with k vertices. For k = 3, we have a triangle P3 ; then
the maximum size of a set of vertices none of which &(P3 ) = '(PQ3 ) = 1 and the capacity C(P3 ) is zero. Let
are adjacent (of an independent set, called also a stable us go to the quadrangle P4 ; then &(P4 ) = '(P Q4 ) = 2 and
set); the chromatic number of a graph is the minimum so C(P4 ) = 1. In the case of the pentagon, however,
number of colours which can be assigned to its vertices &(P5 ) = 2¡'(P Q5 ) = 3. It was quite an achievement of

in such a way that no two adjacent vertices have the LovVasz to prove in 1979 that C(P5 ) = log 5, as long
same colour. From a graph G with k vertices one may conjectured; the conjecture had resisted a proof for
wish to construct a “power graph” Gn whose k n “ver- more than twenty years. The capacity of the heptagon
tices” are the vertex sequences of length n. Many such P7 is still unknown.
powers are described in the literature; of these we need
the following one, called sometimes the strong power:
two vertices x = x1 x2 : : : xn and u = u1 u2 : : : un are ad- Appendix B. The possibilistic rate-distortion func-
jacent in Gn if and only if for each component i either tion
xi and ui are adjacent in G or xi = ui ; 16i6n. The
reason for choosing this type of power becomes clear This appendix generalizes source coding as dealt
when one thinks of confoundability graphs G(W ) and with in Section 4 and is rather more technical than
of -confoundability graphs G () as de*ned in Sec- the body of the paper. The reader is referred to Sec-
tion 3. Actually, one has: tion 4 for a description of the problem of source
coding. In the case of source coding with distortion,
G(W n ) = [G(W )]n ; G ([n] ) = [G ()]n : beside the primary source alphabet A one has a
secondary alphabet B, called also the reproduction
The *rst equality is obvious; the second is implied by alphabet, which is used to reproduce primary se-
the *rst and by Lemma 3.3: just take any stochastic quences. A distortion matrix d is given which speci-
matrix W which is -equivalent to . *es the distortions d(a; b) between each primary letter
If G is a simple graph, its graph capacity C(G), a ∈ A and each secondary letter b ∈ B; the numbers
called also Shannon’s graph capacity, is de*ned as d(a; b) are non-negative and for each primary letter
a there is at least one secondary letter b such that
1 d(a; b) = 0, i.e., such as to perfectly reproduce a.
C(G) = lim log &(Gn ):
n n We recall that distortion measures have already been
30 A. Sgarro / Fuzzy Sets and Systems 132 (2002) 11 – 32

used in Section 7; unlike in Section 7, here we do For *xed +¿0 and ¿0, one is interested in the
not require d(a; b)61. The distortion between letters asymptotic value of the optimal rates. For ¿0 one
a and b is extended to average distortion between proves that the solution, i.e., the asymptotic value of
sequences x ∈ A n and y ∈ B n as we did in (7.3), or optimal code rates, is given by the rate-distortion
to peak distortion, called also maximal distortion, as function
in (7.2). Unlike in the case without distortion, here
the decoder f− maps the binary codeword f+ (x) R (P; +) = min I (X ∧ Y ); ¿0 (B.2)
XY : Ed(X; Y )6+
to a secondary sequence y ∈ B n which should have
an acceptably small distortion from the encoded pri- in whose expression at the right  does not explicitly
mary sequence x. Let us denote by g the composition appear. Above I (X ∧ Y ) is the mutual information 14
of encoder and decoder, g(x) = f− (f+ (x)); the set of the random couple XY , X being a random primary
of secondary sequences C = g(A n ) ⊆ B n which are letter ouput by the source according to the probabil-
used to reproduce the primary sequences is called ity distribution P. The second random component Y
the codebook of the code f = (f+ ; f− ). In practice of the random couple XY belongs to the secondary al-
the secondary sequence y = g(x) is usually misin- phabet B, and so is a random secondary letter. The
terpreted as if it were the codeword for the primary minimum is taken with respect to all random couples
sequence x, and correspondingly the mapping g is XY which are constrained to have an expected distor-
called the encoder (this is slightly abusive, but the tion Ed(X; Y ) which does not exceed the threshold +.
speci*cation of f+ and f− turns out to be irrelevant The rate-distortion function does not look especially
once g is chosen). The rate of the code is the number friendly; luckily the problem of its computation has
been deeply investigated from a numeric viewpoint
log |C| [4]. Observe however that, even if the computation
Rn = :
n of the rate-distortion function involves a minimiza-
tion, there is no trace of n left and so its expression is
The numerator can be interpreted as the (non neces- single-letter, unlike in the case of graph capacity.
sarily integer) length of the binary codewords output Let us proceed to zero-error coding with distortion.
by the encoder stricto sensu f+ and fed to the de- The problem of *nding a single-letter expression for
coder f− , and so the rate is the number of bits per the asymptotic value of optimal rates is not at all triv-
primary letter. From now on we shall forget about f+ ; ial, but it has been solved; not surprisingly, this value
the term “encoder” will refer solely to the mapping g turns out to depend only on the support of P, i.e., on
which outputs secondary sequences y ∈ B n . the fact whether the probabilities P(a) of source let-
Let us begin by the average distortion dn , as is ters a are zero or non-zero. More precisely, for  = 0
common in the probabilistic approach. One *xes a the asymptotic value is given by the zero-error rate-
threshold +¿0, a tolerated error probability ¿0, and distortion function:
requires that the following reliability criterion is sat-
is*ed: R0 (P; +)= max min I (X ∧Y ):
X : P(a)=0⇒PX (a)=0 XY : Ed(X; Y )6+
P n {x : dn (x; g(x)) ¿ +} 6 : (B.1)
Here the maximum is taken with respect to all ran-
The encoder g should be constructed in such a way dom variables X whose support is (possibly strictly)
that the codebook C ⊆ B n be as small as possible, un- included in the support of P, i.e., in the subset of let-
der the constraint that the reliability criterion which ters a whose probability P(a) is strictly positive; PX is
has been chosen is satis*ed; for *xed n one can equiv- the probability distribution of the random variable X .
alently minimize the code rate:
14 The mutual information can be expressed in terms of Shannon
log |C| entropies as I (X ∧ Y ) = H (X ) + H (Y ) − H (XY ); it is seen as
Minimize the code rate so as to satisfy
n an index of dependence between the random variables X and Y ,
and assumes its lowest value, i.e., 0, if and only if X and Y are
constraint (B:1): independent.
A. Sgarro / Fuzzy Sets and Systems 132 (2002) 11 – 32 31

The minimum in (B.3) is to be compared with R (P; +) or, in the case of peak distortion:
as in (B.2). In practice, one considers all the rate-
n {x : d∗n (x; g(x)) ¿ +} 6  (B.6)
distortion functions over the support of P, and then
selects the largest value which has been obtained; as to be compared with (B.1) and (B.4). The correspond-
for numeric techniques which are available, cf. [4]. ing minimization problems are:
If one chooses the peak distortion dn∗ rather than the
average distortion dn , the reliability criterion (B.1) and Minimize the code rate log |C| so as to satisfy
the de*nition of the rate-distortion function should be n
modi*ed accordingly; in particular, the new reliability constraint (B:5) or (B:6); respectively:
criterion is
P n {x : d∗n (x; g(x)) ¿ +} 6 : (B.4) Denition B.1. The possibilistic rate-distortion
The asymptotic optimal rate for peak distortion function R (; +) for average distortion and the
R∗ (P; +) has in general a higher value 15 than possibilistic rate-distortion function R∗ (; +) for
R (P; +), since (B.4) is more demanding than (B.1). peak distortion are the limit of the rates Rn of codes
The expression of R∗ (P; +) turns out to be the which are optimal for the criterion (B.5) or (B.6),
same as in (B.2) and (B.3), only replacing the con- respectively, as the length n goes to in*nity; 06
straint Ed(X; Y )6+ which de*nes the minimization ¡1.
set by the more severe constraint d(X; Y )6+ (cf.
[4]; by writing d(X; Y ) = 0 we mean that the event Lemma 3.1 gives soon the following lemma:
d(X; Y ) = 0 has probability 1, i.e., that the support of
the random couple XY is made up only of couples Lemma B.1. If P and  are -equivalent, a code
(a; b) for which d(a; b) = 0). f = (f+ ; f− ) is optimal for criterion (B:1) at zero
If the source is a SNI source described by giving error if and only if it is optimal for criterion (B:5)
the possibility vector  over primary letters, one can at -error; it is optimal for criterion (B:4) at zero er-
consider the same codes as before, but judge of their ror if and only if it is optimal for criterion (B:6) at
reliability by referring to the new reliability criteria: -error.
n {x : dn (x; g(x)) ¿ +} 6  (B.5)
The following theorem is obtained from Lemma B.1
after a comparison with the expressions of R0 (P; +)
We recall that peak distortion can be taken back to coding
and R∗0 (P; +):
with average distortion with a threshold equal to zero; this is
true no matter whether the source is probabilistic or possibilistic.
Actually, if one sets Theorem B.1. The possibilistic rate-distortion func-
(a; b) = 0 iB d(a; b) 6 +; else (a; b) = d(a; b)
tions R (; +) and R∗ (; +); 06¡1, are given by:

the inequality d∗n (x; y)6+ is clearly equivalent to the equality R (; +) = R0 (P; +); R∗ (; +) = R∗0 (P; +)
n (x; y) = 0. So, coding at distortion level + with peak distortion is
for whatever P such as to be -equivalent to ; more
the same as coding at distortion level zero with average distortion,
after replacing the old distortion measure d by the new distortion explicitly:
measure . The case of average distortion with + = 0 and the
general case of peak distortion with any +¿0 can be both couched R (; +) = max
X :(a)6⇒PX (a)=0
into the inspiring mould of graph theory; then the rate-distortion
function is rather called the hypergraph entropy, or the graph
entropy in the special case when the two alphabets A and B
min I (X ∧ Y );
XY : Ed(X; Y )6+
coincide and when the distortion is Hamming distortion, as de*ned
at the end of this appendix; cf. [4,20]. We recall that graph capacity
and hypergraph entropy are the two basic functionals of the zero- R∗ (; +) = max
X : (a)6⇒PX (a)=0
error theory; both of them originated in a coding theoretic context,
but both have found unexpected and deep applications elsewhere;
min I (X ∧ Y ):
cf. [15]. XY : d(X;Y )6+
32 A. Sgarro / Fuzzy Sets and Systems 132 (2002) 11 – 32

Observe that the possibilistic rate-distortion func- [2] B. Bouchon-Meunier, G. Coletti, C. Marsala, Possibilistic
tions R (; +) and R∗ (; +) are both non-increasing Conditional Events, IPMU 2000, Madrid, July 3–7 2000,
step-functions of . Actually, if i ¡ i+1 are two con- Proceedings, pp. 1561–1566.
[3] Th.M. Cover, J.A. Thomas, Elements of Information Theory,
secutive entries in , as in Proposition 4.1, the rela- Wiley, New York, 1991.
tion of -equivalence is always the same for whatever [4] I. CsiszVar, J. KTorner, Information Theory, Academic Press,
 such as i 6¡ i+1 . Unlike R (; +), R∗ (; +) is New York, 1981.
also a step-function of +. Actually, if the distinct en- [5] G. De Cooman, Possibility Theory, Internat. J. General
Systems 25 (4) (1997) 291–371.
tries of the matrix d are arranged in the increasing
[6] D. Dubois, H.T. Nguyen, H. Prade, Possibility theory,
order, and if di ¡di+1 are two consecutive entries, the probability and fuzzy sets: misunderstandings, bridges and
constraint (B.6) is the same for whatever + such that gaps, in: D. Dubois, H. Prade (Eds.), Fundamentals of Fuzzy
di 6+¡di+1 . Sets, Kluwer Academic Publishers, Boston, 2000, pp. 343–
In some simple cases the minima and the maxima 438.
[7] D. Dubois, W. Ostasiewicz, H. Prade, Fuzzy Sets: History and
which appear in the expression of the various rate-
Basic Notions, in: D. Dubois, H. Prade (Eds.), Fundamentals
distortion functions can be made explicit; the reader is of Fuzzy Sets, Kluwer Academic Publishers, Boston, 2000,
referred once more to [4]; the results given there are pp. 21–290.
soon adapted to the possibilistic case. We shall just [8] D. Dubois, H. Prade, Properties of measures of information
mention one such special case: the two alphabets co- in evidence and possibility theories, Fuzzy Sets and Systems
24 (1987) 161–182.
incide, A = B, the distortion matrix d is Hamming [9] D. Dubois, H. Prade, Fuzzy sets in approximate reasoning:
distortion, i.e., d(a; b) is equal to 0 or to 1 according inference with possibility distribution, Fuzzy Sets and
whether a = b or a = b, respectively; + = 0. As a mat- Systems 40 (1991) 143–202.
ter of fact, one soon realizes that this is just a diBerent [10] F. Fabris, A. Sgarro, Possibilistic data transmission and
formulation of the problem of coding without distor- fuzzy integral decoding, IPMU 2000, Madrid, July 3–7 2000,
Proceedings, pp. 1153–1158.
tion as in Section 4. A simple computation gives: [11] E. Hisdal, Conditional possibilities, independence and
non-interaction, Fuzzy Sets and Systems 1 (1978) 283–297.
R (; 0) = R∗ (; 0) = log |{a: (a) ¿ }|
[12] G.J. Klir, T.A. Folger, Fuzzy Sets, Uncertainty and
in accordance with the expression of the possibilistic Information, Prentice-Hall, London, 1988.
[13] G.J. Klir, M.J. Wierman, Uncertainty-Based Information:
entropy given in Theorem 4.1. A slight generalization Elements of Generalized Information Theory, Physica
of this case is obtained for arbitrary +¿0 when the Verlag=Springer Verlag, Heidelberg and New York, 1998.
inequality d(a; b)6+ is an equivalence relation which [14] G.J. Klir, Measures of uncertainty and information, in: D.
partitions the primary alphabet A into equivalence Dubois, H. Prade (Eds.), Fundamentals of Fuzzy Sets, Kluwer
classes E. Then Academic Publishers, Boston, 2000, pp. 439–457.
[15] J. KTorner, A. Orlitsky, Zero-error information theory, Trans.
R∗ (; +) = log |{E: (E) ¿ }|: Inform. Theory 44 (6) (1998) 2207–2229.
[16] H.T. Nguyen, E.A. Walker, A First Course in Fuzzy Logic,
In practice, optimal codes are constructed by taking a 2nd Edition, Chapman & Hall, London, 2000.
letter aE for each class E whose possibility exceeds ; [17] S. Ovchinnikov, An Introduction to Fuzzy Relations, in: D.
Dubois, H. Prade (Eds.), Fundamentals of Fuzzy Sets, Kluwer
each primary letter in E is then reproduced by using Academic Publishers, Boston, 2000, pp. 233–259.
precisely aE . This way the asymptotic optimal rate [18] C.E. Shannon, A mathematical theory of communication, Bell
R∗ (; +) is achieved already for n = 1, as in the case System Technical J. 27 (3&4) (1948) 379 – 423, 623– 656.
of coding without distortion. This is bad news, since [19] C.E. Shannon, The zero-error capacity of a noisy channel,
it means that optimal code constructions are bound to IRE Trans. Inform. Theory IT-2 (1956) 8–19.
[20] G. Simonyi, Graph entropy: a survey, in: W. Cook, L. LovVasz,
be trivial; cf. footnote 11. P. Seymour (Eds.), Combinatorial Optimization, DIMACS
Series in Discrete Maths and Computer Science, vol. 20,
1995, AMS, Providence, RI, pp. 399–441.
References [21] D. Solomon, Data Compression, Springer, New York, 1998.
[22] P. Walley, Statistical Reasoning with Imprecise Probabilities,
[1] M. Borelli, A. Sgarro, A possibilistic distance for sequences Chapman & Hall, London, 1991.
of equal and unequal length, in: C. CWa lude, Gh. PWa un (Eds.), [23] L. Zadeh, Fuzzy sets as a basis for a theory of possibility,
Finite VS In*nite, Discrete Mathematics and Theoretical Fuzzy Sets and Systems 1 (1978) 3–28.
Computer Science, Springer, London, 2000, pp. 27–38.