Professional Documents
Culture Documents
A finite state transducer (FST) is a finite state automaton (FSA, FA) which produces output as well as
reading input, which means it is useful for parsing (while a "bare" FSA can only be used for recognizing,
i.e. pattern matching).
An FST consists of a finite number of states which are linked by transitions labeled with an input/output
pair. The FST starts out in a designated start state and jumps to different states depending on the input,
while producing output according to its transition table.
FSTs are useful in NLP and speech recognition because they have nice algebraic properties, most notably
that they can be freely combined (form an algebra) under composition, which implements relational
composition on regular relations (think of this as non-deterministic function composition) while staying
very compact. FSTs can do parsing of regular languages into strings in linear time.
As an example, I once implemented morphological parsing as a bunch of FSTs. My main FST for verbs
would turn a regular verb, say "walked", into "walk+PAST". I also had an FST for the verb "to be", which
would turn "is" into "be+PRESENT+3rd" (3rd person), and similarly for other irregular verbs. All the FSTs
were combined into a single one using an FST compiler, which produced a single FST that was much
smaller than the sum of its parts and ran very fast. FSTs can be built by a variety of tools that accept an
extended regular expression syntax.
The ARPA-MIT LM format
An ARPA-style language model file comes in two parts - the header and
the -gram definitions. The header contains a description of the contents
of the file.
<header> = { ngram <int>=<int> }
The first <int> gives the -gram order and the second <int> gives
the number of -gram entries stored.
\1-grams:
-1.6682 A -2.2371
-5.5975 A'S -0.2818
-2.8755 A. -1.1409
-4.3297 A.'S -0.5886
-5.1432 A.S -0.4862
...
\2-grams:
-3.4627 A BABY -0.2884
-4.8091 A BABY'S -0.1659
-5.4763 A BACH -0.4722
-3.6622 A BACK -0.8814
...
\3-grams:
-4.3813 !SENT_START A CAMBRIDGE
-4.4782 !SENT_START A CAMEL
-4.0196 !SENT_START A CAMERA
-4.9004 !SENT_START A CAMP
-3.4319 !SENT_START A CAMPAIGN
...
\end\
Discrete Cosine Transform