Professional Documents
Culture Documents
Joscha Bach
Humboldt-Universität zu Berlin, Institut für Informatik
bach@informatik.hu-berlin.de
1 Introduction
In the course of a diploma thesis [Bach 2000], several textual data compression
methods have been examined and compared with respect to their application in
human-computer interfaces. As a result, a prototypical application – PRIS, which
stands for Predictive Interface using Similarity – has been designed and evaluated.
In short, PRIS takes advantage of the redundancy of natural language by offering the
user predictions of what to type next. These predictions are based on dictionaries
primed on common corpora of textual input and are constantly adapted to the user.
Other than most common word-based textual input aids, PRIS takes not only relative
frequencies of words but also their probability based on the preceding word into
account. PRIS makes use of the PPM compression algorithm [Moffat 1990] that I
have combined with a method to switch between concurrent adaptive entropy models
of the user’s input, so the system does not ‘unlearn’ formerly acquired knowledge
when text in a different language or of a different style is entered. Results have shown
the superiority of this approach over single-dictionary systems.
Usually, textual data are being entered into textual interfaces (word processors,
communication programs etc.) one character at a time. Conceivably, it could make
sense to exploit the redundancy of natural language, for instance by predicting likely
completions for an input (as it is common in the URL address completion of modern
internet browser applications, for instance). Text processing, however, usually does
not resort to such shortcuts, because the need to interrupt typing and check for the
predictions offered would slow down most users beyond their usual typing speed. On
the other hand, for people with motion impediments, it can be more efficient to reduce
the number of strokes necessary for inputting a word. The balance between the
number of predictions to skim through and the reduction of key strokes varies with
the typing speed of the user.
PRIS uses the context of the formerly seen text to offer an adjustable number of
predictions that can be chosen using a hotkey or by a mouse click. If the desired word
is not on the list, the next likely words can be shown or one more character be
entered, removing not only all words not matching, but also all formerly shown words
from the list.
H S = − log 2 PS
PPM [Cleary/Witten 1984] calculates the probabilities from the context of the n
preceding symbols, generating a adaptive prediction model of order n. (If no
preceding symbols are taken into account and only the relative frequencies of the
words are used to estimate their probabilities, we talk of zero-order prediction).
f S | S n → S n−1 →...→ S1
PS =
f S n → S n−1 →...→ S1
where fS|Sn→Sn-1→...→S1 specifies, how often S has been seen after the sequence {Sn, Sn-1, ...,
S1}; fSn→Sn-1→...→S1 is the frequency of this sequence. But what if S has not yet occurred
in the context of {Sn, Sn-1, ..., S1} (this situation is commonly referred to as the Zero
Frequency Problem)? – In this case, PPM encodes an Escape symbol, followed by the
prediction of S based on the context of n-1. If S is unknown in this context as well,
another Escape symbol follows, together with a prediction based on the context of
n-2, and so on, down to the zero-order-predictions (for the context of n=0) that are
given by
fS
PS =
m
where fS is the absolute frequency of S and m the number of words encountered so far.
To be more accurate, the probability of the Escape symbol has to be regarded as well:
f S | S n → S n−1 →...→ S1
PS = (1 − PESC )
f S n → S n−1 →...→ S1
Several estimates for the probability of the Escape symbol can be found in the
literature, and for my word-based approach, I have found the PPM-C method [Moffat
et al. 1990] to work best:
rS n → S n −1 →...→S1
PESC =
f S n → S n −1 →...→ S1 + rS n → S n −1 →...→ S1
with rSn→Sn-1→...→S1 as the number of different words that have been encountered in the
context of {Sn, Sn-1, ..., S1} so far.
While PPM is one of the most powerful compression methods known today, [Witten
et al. 1994] its models grow exponentially with the number of different symbols in the
alphabet and the order n. That is not so much a problem for character-based encoding,
as the model size can be limited by “forgetting” rarely used consecutions. However,
for word-based models as I intended to use in PRIS, this leads to difficulties, because
typical English words have frequencies between 1/10.000 and 1/50.000. [Yuret 1998]
Hence, most words would be removed from the dictionary before enough statistics for
meaningful predictions can be accumulated, restricting the use of word-based PPM to
a first-order model (i.e. only the preceding word is used for prediction).
For English text samples, I found word based first order PPM to yield compression
rates of about 7.9 bits per word, which is comparable to character-based fifth order
PPM, and is way favourable over zero order prediction (about 9.5 bits per word).
[Bach/Witten 1999]
User
Selection and
Input
Model n
Model 2
Model 1 Pre-
Priming dictions
Priming- User Output
Text Interface
Selection of
Model by
Similarity
Measuring Entropy
Learning
The quality of predictions of a PPM model depends on how closely the contextual
statistics (usually called “compression model” or “dictionary”) matches the currently
encoded text. Therefore it sees desirable to employ not one but an arbitrary number of
different dictionaries to model different languages and styles. Not only the vocabulary
but also the structure of documents determines the succession of individual words, for
instance, an e-mail might require a different dictionary than a literature reference.
The actual paragraph is segmented into words (punctuation marks count as words and
white-spaces are omitted), and the entropy of each word S is calculated:
- If so:
f S |S *
H S = − log 2 (1 − PESC )
f S*
- Otherwise:
fS
H S = − log 2 PESC − log 2
m
(Calculating PESC as above.)
The entropy of completely unknown words is estimated as if it had seen exactly once
before:
1
H SX = − log 2 PESC − log 2
m
The entropy of the paragraph is the sum of the individual word entropies (∑HS). The
model with the lowest entropy is expected to give the best predictions for the text.
5 Results
As mentioned before, the fluency of text input depends not only on the reduction of
necessary keystrokes achieved by the input aid, but also – reciprocally – on the
number of predictions the user is offered. The balance of these depends on the
specific situation of the user: someone that can press a key every second might find it
not helpful to browse through ten possible completions before hitting the next, while a
more impaired user with an input rate of a key per minute would perhaps want more
than 20. This makes it difficult to compare different input aids, and the reader should
bear in mind that for the purpose of comparison, PRIS has been set to offer nine
predictions. Where PRIS’ algorithm is compared to zero-order prediction (as used in
common input aids like WiViK 21 or SofType), nine predictions and identical priming
documents have been used there as well.
In all cases, an algorithm for setting spaces after selected words and around
punctuation marks automatically has been applied, and mistakes caused by this
algorithm have been penalized with the stroke counts necessary to correct them (using
the backspace key).
For the experiments, the following text corpora have been used:
• Dumas Malone’s “Jefferson and his time”, of this the first five volumes for
priming the dictionaries, and the sixth for measurements,
• Jane Austen’s complete works (six novels), where the novel “Persuasion”
was used for measuring,
• 10 megabytes of text from the Berlin newspaper “taz” during the year 1993
!#" $% & '# ( )$*
•
The text used for priming has never been used for the measurements itself.
Probably the most interesting question is the one after PRIS’ efficiency as a textual
input aid. For the measurements, each key-press and each selection of a prediction
were counted as an input stroke. After priming, learning had been disabled; every
unknown word had to be entered completely:
1
www.wivik.com
Even if no prior priming takes place, PRIS can yield considerable improvements, if
learning is turned on: in the case of the novel “Persuasion”, for the first 100,000
characters, 60,590 input strokes were necessary. This is about twice as much as the
best case (when all words are known) that needs only 31,995 key-strokes, but still
means an improvement of almost 40 percent over plain typing.
The stroke reduction gains with the size of the priming text. For measurements, a big
portion of the last volume of “Jefferson and his time” have been entered based on a
variable number of the preceding volumes acting as priming text. To keep the size of
the model constant, learning had been disabled:
65%
63%
61%
59%
Stroke Reduction
57%
55%
53%
51%
49%
47%
45%
50 250 450 650 850 1050
After most words in the text have been seen at least once, PRIS delivers stroke
reduction rates of over 40 percent. After this, improvements are made due to the PPM
model. After one million words have been seen, 62 percent stroke reduction are
achieved, that is, each key-press enters on average 2.6 characters.
Text Size Training Zero Order PPM Savings
Corpus (characters) Text (words) (strokes) (strokes) by PPM
Depending on the size of the model and the style of the input text, PRIS obtains
between 12 and 24 percent better results for the example corpora than zero order
prediction.
While PRIS is still in a prototypic stage of development, it is fast and effective, and it
could be put to good use in a textual interface as an input aid.
References
!" # $% &' ( )* +-, , . )( , ' /102/1.43% & ,5/ 6879. % : : ' /10 ;=<->"@?@AB @C" @D
EF@? G8AB HI4J KJ@A H LGMN@ O GP" QSRF@D T
;D @AUL V8WYXY;D @AULY Z[JT\ !" # ]^), )4( / 6879. % : : ' /104_: ' 0 `^)a@) 7, ' bB%c( /1a ' 0 `^)0a7). , ' )*J: , . ' 0 `
6F), ( d' 0 `Ye UfFf f gAB H9;=G8C\C\hJ GJ HBJ8WYX8i4G8D ;=jlk m n o8 p <JYn8q m X8r o
kVGM M SrskVGM M \tu # v 687* % 6w% 0, ' 0 `x, d%zy y{|a@), )}( / 6879. % : : ' /10~: ( d% 6w% UfFfFfgAB HU GJ HGJ
;=G8C\C\hJ GJ HBuri4G8D Yn8W <J8 m o
<GJ r<GJ;F f y. % a ' ( , ' /10)0a"% 0, . / 7U/ 2=79. ' 0, % a^u0 `Y* ' : d =@D D=< L HU @Cg= J @D YG8hJAB@D
8 r <J r m qYX
@D XYF c !" kVGM M tu =@D D gF ! e # {s)0) `' 0 `zw' `Y) , % : isP"GHU ABT EF@ G8D T
X
whJA18WwhJA1 > # ]u' : ( /1bB% . / 2 * ' 0 `Y_' : , ' (. % * ), ' /10:F_: ' 0 `[* % &' ( )*), , . )( , ' /10e J>gH HB >"@?@AB C"FGM
;=G8C\?h @Ac<8@ =T4fFD AU @DfwJ @AU J8kVUgYkz@L48W