Abstract
Language modelingisthe attempt to characterize,capture and exploit regularitiesin naturallanguage. In statistical language modeling, large amounts of text are used to automaticallydetermine the model’s parameters. Language modeling is useful in automatic speechrecognition, machine translation, and any other application that processes natural languagewith incomplete knowledge.In this thesis, I view language as an information source which emits a stream of symbolsfrom a finite alphabet (the vocabulary). The goal of language modeling is then to identifyand exploit sources of information in the language stream, so as to minimize its perceivedentropy.Most existing statistical language models exploit the immediate past only. To extractinformation from further back in the document’s history, I use
trigger pairs
as the basicinformation bearing elements. This allows the model to adapt its expectations to the topicof discourse.Next, statistical evidence from many sources must be combined. Traditionally, linearinterpolation and its variants have been used, but these are shown here to be seriouslydeficient. Instead, I apply the principle of Maximum Entropy (ME). Each informationsource gives rise to a set of constraints, to be imposed on the combined estimate. Theintersection of these constraints is the set of probability functions which are consistent withall the information sources. The function with the highest entropy within that set is the MEsolution. Given consistent statistical evidence, a unique ME solution is guaranteed to exist,and an iterative algorithm exists which is guaranteed to converge to it. The ME framework is extremely general: any phenomenon that can be described in terms of statistics of thetext can be readily incorporated.An adaptive language model based on the ME approach was trained on the Wall StreetJournal corpus, and showed 32%–39% perplexity reduction over the baseline. Wheninterfaced to SPHINX-II, Carnegie Mellon’s speech recognizer, it reduced its error rate by10%–14%.The significance of this thesis lies in improving language modeling, reducing speech recog-nition error rate, and in being the first large-scale test of the approach. It illustrates thefeasibility of incorporating many diverse knowledge sources in a single, unified statisticalframework.iii
Leave a Comment