0ukyear Using Concurrent Entropy Models for Textual Interf

Using Concurrent Entropy Models
for Textual Interfaces
Joscha Bach
Humboldt-Universität zu Berlin, Institut für Informatik
bach@informatik.hu-berlin.de
Abstract. The efficiency of textual input of disabled users can be considerably

improved by employing prediction methods as they are used in data
compression algorithms. In this paper, the application of the PPM algorithm is
described along with a technique of utilizing multiple concurrent prediction
models. In a prototypical implementation – called PRIS – the method has been
found to reduce the number of keystrokes necessary to enter textual data by 55
to 62 percent over common text entry. Compared with zero-order prediction as
used in most textual input aids that are currently used, PRIS reduces the number
of keystrokes between 12 and 24 percent.
1 Introduction
The ability to communicate is among the most important human faculties. To be

deprived of it by disease or accident is an unbearable situation, and the provision of
communication aids can be an essential part of rehabilitation and restoration of a
patients quality of life. Computers can often act as such an aid, and a lot of research is
undertaken in the field of rehabilitational computing. In recent years, several methods
for textual data compression have been devised that could also be put to use to make
the entering of textual information faster and more efficient. However, probably due
to the limited attractiveness of rehabilitational computing to commercial exploitation,
human-computer interfaces designed exclusively for people with disabilities
impairing their communication usually take relatively little advantage of current
technology.
In the course of a diploma thesis [Bach 2000], several textual data compression
methods have been examined and compared with respect to their application in
human-computer interfaces. As a result, a prototypical application – PRIS, which
stands for Predictive Interface using Similarity – has been designed and evaluated.
In short, PRIS takes advantage of the redundancy of natural language by offering the
user predictions of what to type next. These predictions are based on dictionaries
primed on common corpora of textual input and are constantly adapted to the user.
Other than most common word-based textual input aids, PRIS takes not only relative
frequencies of words but also their probability based on the preceding word into
account. PRIS makes use of the PPM compression algorithm [Moffat 1990] that I
have combined with a method to switch between concurrent adaptive entropy models
of the user’s input, so the system does not ‘unlearn’ formerly acquired knowledge
when text in a different language or of a different style is entered. Results have shown
the superiority of this approach over single-dictionary systems.
2 Entering text using prediction methods
Usually, textual data are being entered into textual interfaces (word processors,
communication programs etc.) one character at a time. Conceivably, it could make
sense to exploit the redundancy of natural language, for instance by predicting likely
completions for an input (as it is common in the URL address completion of modern
internet browser applications, for instance). Text processing, however, usually does
not resort to such shortcuts, because the need to interrupt typing and check for the
predictions offered would slow down most users beyond their usual typing speed. On
the other hand, for people with motion impediments, it can be more efficient to reduce
the number of strokes necessary for inputting a word. The balance between the
number of predictions to skim through and the reduction of key strokes varies with
the typing speed of the user.
PRIS uses the context of the formerly seen text to offer an adjustable number of
predictions that can be chosen using a hotkey or by a mouse click. If the desired word
is not on the list, the next likely words can be shown or one more character be
entered, removing not only all words not matching, but also all formerly shown words
from the list.

3 Making predictions using PPM
As we know from the theory of data compression, the entropy of HS of a symbol S

with probability PS can be estimated as
H S = − log 2 PS
that is, S can be encoded in HS bits. [Shannon 1950]

The problem for text compression consists of getting the probability PS right, i.e. the
algorithm has to estimate, which symbols are to expect with which probability, based
on the context of the preceding text.
PPM [Cleary/Witten 1984] calculates the probabilities from the context of the n
preceding symbols, generating a adaptive prediction model of order n. (If no
preceding symbols are taken into account and only the relative frequencies of the
words are used to estimate their probabilities, we talk of zero-order prediction).
The probability of a symbol S is thus computed as
f S | S n → S n−1 →...→ S1
PS =
f S n → S n−1 →...→ S1
where fS|Sn→Sn-1→...→S1 specifies, how often S has been seen after the sequence {Sn, Sn-1, ...,
S1}; fSn→Sn-1→...→S1 is the frequency of this sequence. But what if S has not yet occurred
in the context of {Sn, Sn-1, ..., S1} (this situation is commonly referred to as the Zero
Frequency Problem)? – In this case, PPM encodes an Escape symbol, followed by the
prediction of S based on the context of n-1. If S is unknown in this context as well,
another Escape symbol follows, together with a prediction based on the context of
n-2, and so on, down to the zero-order-predictions (for the context of n=0) that are
given by
fS
PS =
m
where fS is the absolute frequency of S and m the number of words encountered so far.
To be more accurate, the probability of the Escape symbol has to be regarded as well:
f S | S n → S n−1 →...→ S1
PS = (1 − PESC )
f S n → S n−1 →...→ S1
Several estimates for the probability of the Escape symbol can be found in the
literature, and for my word-based approach, I have found the PPM-C method [Moffat
et al. 1990] to work best:
rS n → S n −1 →...→S1
PESC =
f S n → S n −1 →...→ S1 + rS n → S n −1 →...→ S1
with rSn→Sn-1→...→S1 as the number of different words that have been encountered in the
context of {Sn, Sn-1, ..., S1} so far.
While PPM is one of the most powerful compression methods known today, [Witten
et al. 1994] its models grow exponentially with the number of different symbols in the
alphabet and the order n. That is not so much a problem for character-based encoding,
as the model size can be limited by “forgetting” rarely used consecutions. However,
for word-based models as I intended to use in PRIS, this leads to difficulties, because
typical English words have frequencies between 1/10.000 and 1/50.000. [Yuret 1998]
Hence, most words would be removed from the dictionary before enough statistics for
meaningful predictions can be accumulated, restricting the use of word-based PPM to
a first-order model (i.e. only the preceding word is used for prediction).
For English text samples, I found word based first order PPM to yield compression
rates of about 7.9 bits per word, which is comparable to character-based fifth order
PPM, and is way favourable over zero order prediction (about 9.5 bits per word).
[Bach/Witten 1999]
4 Combining PPM with multiple dictionaries
User
Selection and
Input
Model n
Model 2
Model 1 Pre-
Priming dictions
Priming- User Output
Text Interface
Selection of
Model by
Similarity
Measuring Entropy
Learning
Figure 2: Structure of PRIS’ architecture
The quality of predictions of a PPM model depends on how closely the contextual
statistics (usually called “compression model” or “dictionary”) matches the currently
encoded text. Therefore it sees desirable to employ not one but an arbitrary number of
different dictionaries to model different languages and styles. Not only the vocabulary
but also the structure of documents determines the succession of individual words, for
instance, an e-mail might require a different dictionary than a literature reference.
PRIS takes care of that demand by maintaining several concurrent dictionaries. Of

these, just one is active at a time, and only the active dictionary undergoes adaptation
to the currently entered text. This leaves the problem of deciding which model to
choose. To measure the similarity of the currently entered text to the individual
prediction models, PRIS determines the PPM-entropy of that text based on each
model, that is, the number of bits that would be necessary to encode the text given the
dictionary would be the PPM model.
The actual paragraph is segmented into words (punctuation marks count as words and
white-spaces are omitted), and the entropy of each word S is calculated:
• Is there a preceding word S* and is S* contained in the model?

- If not:
fS
H S = − log 2
m
- Otherwise:
o has the current word S been seen in the context of S* ?
- If so:
 f S |S * 
H S = − log 2  (1 − PESC ) 
 f S* 
- Otherwise:
fS
H S = − log 2 PESC − log 2
m
(Calculating PESC as above.)
The entropy of completely unknown words is estimated as if it had seen exactly once
before:
1
H SX = − log 2 PESC − log 2
m
The entropy of the paragraph is the sum of the individual word entropies (∑HS). The
model with the lowest entropy is expected to give the best predictions for the text.
5 Results
As mentioned before, the fluency of text input depends not only on the reduction of
necessary keystrokes achieved by the input aid, but also – reciprocally – on the
number of predictions the user is offered. The balance of these depends on the
specific situation of the user: someone that can press a key every second might find it
not helpful to browse through ten possible completions before hitting the next, while a
more impaired user with an input rate of a key per minute would perhaps want more
than 20. This makes it difficult to compare different input aids, and the reader should
bear in mind that for the purpose of comparison, PRIS has been set to offer nine
predictions. Where PRIS’ algorithm is compared to zero-order prediction (as used in
common input aids like WiViK 21 or SofType), nine predictions and identical priming
documents have been used there as well.
In all cases, an algorithm for setting spaces after selected words and around
punctuation marks automatically has been applied, and mistakes caused by this
algorithm have been penalized with the stroke counts necessary to correct them (using
the backspace key).
For the experiments, the following text corpora have been used:
• Dumas Malone’s “Jefferson and his time”, of this the first five volumes for
priming the dictionaries, and the sixth for measurements,
• Jane Austen’s complete works (six novels), where the novel “Persuasion”
was used for measuring,
• 10 megabytes of text from the Berlin newspaper “taz” during the year 1993
!#" $% & '# ( )$*
•
The text used for priming has never been used for the measurements itself.
Probably the most interesting question is the one after PRIS’ efficiency as a textual
input aid. For the measurements, each key-press and each selection of a prediction
were counted as an input stroke. After priming, learning had been disabled; every
unknown word had to be entered completely:
Size Training PPM Stroke

Text-Corpus
(characters) Text (words) (strokes) Reduction
“Jefferson“ Corpus 1,033,528 1,045,404 389,517 62.31%
“taz“ 497,508 1,599,483 189,978 61.81%
“Persuasion“ 464,666 625,365 195,489 57.93%
E-Mail 114,683 101,210 51,763 54.86%
Table: Stroke reduction for primed input
1
www.wivik.com
Even if no prior priming takes place, PRIS can yield considerable improvements, if
learning is turned on: in the case of the novel “Persuasion”, for the first 100,000
characters, 60,590 input strokes were necessary. This is about twice as much as the
best case (when all words are known) that needs only 31,995 key-strokes, but still
means an improvement of almost 40 percent over plain typing.
The stroke reduction gains with the size of the priming text. For measurements, a big
portion of the last volume of “Jefferson and his time” have been entered based on a
variable number of the preceding volumes acting as priming text. To keep the size of
the model constant, learning had been disabled:
65%
63%
61%
59%
Stroke Reduction
57%
55%
53%
51%
49%
47%
45%
50 250 450 650 850 1050
Size of priming text (1000 words)
Figure 3: Stroke reduction compared to size of priming text
After most words in the text have been seen at least once, PRIS delivers stroke
reduction rates of over 40 percent. After this, improvements are made due to the PPM
model. After one million words have been seen, 62 percent stroke reduction are
achieved, that is, each key-press enters on average 2.6 characters.
Text Size Training Zero Order PPM Savings
Corpus (characters) Text (words) (strokes) (strokes) by PPM
“taz“ 497,508 1,599,483 248,458 189,978 23.54%
E-Mail 114,683 101,210 60,902 51,763 15.01%
“Jefferson“ 1,033,528 1,045,404 448,982 389,517 13.24%
“Persuasion“ 464,666 625,365 223,134 195,489 12.39%
Table: Stroke reduction compared to size of priming text
Depending on the size of the model and the style of the input text, PRIS obtains
between 12 and 24 percent better results for the example corpora than zero order
prediction.
While PRIS is still in a prototypic stage of development, it is fast and effective, and it
could be put to good use in a textual interface as an input aid.
References
!" # $% &' ( )* +-, , . )( , ' /102/1.43% & ,5/ 6879. % : : ' /10 ;=<->"@?@AB @C" @D
EF@? G8AB HI4J KJ@A H LGMN@ O GP" QSRF@D T
;D @AUL V8WYXY;D @AULY Z[JT\ !" # ]^), )4( / 6879. % : : ' /104_: ' 0 `^)a@) 7, ' bB%c( /1a ' 0 `^)0a7). , ' )*J: , . ' 0 `
6F), ( d' 0 `Ye UfFf f gAB H9;=G8C\C\hJ GJ HBJ8WYX8i4G8D ;=jlk m n o8 p <JYn8q m X8r o
kVGM M SrskVGM M \tu # v 687* % 6w% 0, ' 0 `x, d%zy y{|a@), )}( / 6879. % : : ' /10~: ( d% 6w% UfFfFfgAB HU GJ HGJ
;=G8C\C\hJ GJ HBuri4G8D Yn8W <J8 m o
<GJ r<GJ;F f y. % a ' ( , ' /10)0a"% 0, . / 7U/ 2=79. ' 0, % a^u0 `Y* ' : d =@D D=< L HU @Cg= J @D YG8hJAB@D
8 r <J r m qYX
@D XYF c !" kVGM M tu =@D D gF ! e # {s)0) `' 0 `zw' `Y) , % : isP"GHU ABT EF@ G8D T
X
whJA18WwhJA1 > # ]u' : ( /1bB% . / 2 * ' 0 `Y_' : , ' (. % * ), ' /10:F_: ' 0 `[* % &' ( )*), , . )( , ' /10e J>gH HB >"@?@AB C"FGM
;=G8C\?h @Ac<8@ =T4fFD AU @DfwJ @AU J8kVUgYkz@L48W

0ukyear Using Concurrent Entropy Models for Textual Interf

Uploaded by

Document Information

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

0ukyear Using Concurrent Entropy Models for Textual Interf

Uploaded by

Copyright:

Available Formats

Using Concurrent Entropy Models

for Textual Interfaces

Abstract. The efficiency of textual input of disabled users can be considerably

The ability to communicate is among the most important human faculties. To be

2 Entering text using prediction methods

    

As we know from the theory of data compression, the entropy of HS of a symbol S

that is, S can be encoded in HS bits. [Shannon 1950]

The probability of a symbol S is thus computed as

4 Combining PPM with multiple dictionaries

Figure 2: Structure of PRIS’ architecture

PRIS takes care of that demand by maintaining several concurrent dictionaries. Of

• Is there a preceding word S* and is S* contained in the model?

o has the current word S been seen in the context of S* ?

Size Training PPM Stroke

“Jefferson“ Corpus 1,033,528 1,045,404 389,517 62.31%

“taz“ 497,508 1,599,483 189,978 61.81%

“Persuasion“ 464,666 625,365 195,489 57.93%

E-Mail 114,683 101,210 51,763 54.86%

Table: Stroke reduction for primed input

Size of priming text (1000 words)

Figure 3: Stroke reduction compared to size of priming text

“taz“ 497,508 1,599,483 248,458 189,978 23.54%

E-Mail 114,683 101,210 60,902 51,763 15.01%

“Jefferson“ 1,033,528 1,045,404 448,982 389,517 13.24%

“Persuasion“ 464,666 625,365 223,134 195,489 12.39%

Table: Stroke reduction compared to size of priming text

You might also like