The categorization must work reliably inspite of textual errors.
The categorization must be efﬁcient, con-suming as little storage and processingtime as possible, because of the sheer vol-ume of documents to be handled.
The categorization must be able to recog-nize when a given document does
match any category, or when it falls
two categories. This is becausecategory boundaries are almost never clear-cut.In this paper we will cover the following top-ics:
Section 2.0 introduces N-grams and N-gram-based similarity measures.
Section 3.0 discusses text categorizationusing N-gram frequency statistics.
Section 4.0 discusses testing N-gram-based text categorization on a languageclassiﬁcation task.
Section 5.0 discusses testing N-gram-based text categorization on a computernewsgroup classiﬁcation task.
Section 6.0 discusses some advantages of N-gram-based text categorization overother possible approaches.
Section 7.0 gives some conclusions, andindicates directions for further work.
An N-gram is an N-character slice of a longerstring. Although in the literature the term caninclude the notion of any co-occurring set of characters in a string (e.g., an N-gram made upof the ﬁrst and third character of a word), in thispaper we use the term for contiguous slices only.Typically, one slices the string into a set of over-lapping N-grams. In our system, we use N-gramsof several different lengths simultaneously. Wealso append blanks to the beginning and endingof the string in order to help with matchingbeginning-of-word and ending-of-word situa-tions. (We will use the underscore character (“_”)to represent blanks.) Thus, the word “TEXT”would be composed of the following N-grams:
bi-grams: _T, TE, EX, XT, T_tri-grams: _TE, TEX, EXT, XT_, T_ _quad-grams:_TEX, TEXT, EXT_, XT_ _, T_ _ _
In general, a string of length
, padded withblanks, will have
+1 quad-grams, and so on.N-gram-based matching has had some suc-cess in dealing with noisy ASCII input in otherproblem domains, such as in interpreting postaladdresses ( and ), in text retrieval ( and), and in a wide variety of other natural lan-guage processing applications. The key bene-ﬁt that N-gram-based matching provides derivesfrom its very nature: since every string is decom-posed into small parts, any errors that are presenttend to affect only a limited number of thoseparts, leaving the remainder intact. If we countN-grams that are common to two strings, we geta measure of their similarity that is resistant to awide variety of textual errors.
3.0Text Categorization Using N-Gram Frequency Statistics
Human languages invariably have some wordswhich occur more frequently than others. One of the most common ways of expressing this ideahas become known as Zipf’s Law , which wecan re-state as follows:
th most common word in a human languagetext occurs with a frequency inversely propor-tional to
The implication of this law is that there isalways a set of words which dominates most of the other words of the language in terms of fre-quency of use. This is true both of words in gen-eral, and of words that are speciﬁc to a particularsubject. Furthermore, there is a smooth contin-uum of dominance from most frequent to least.The smooth nature of the frequency curves helpsus in some ways, because it implies that we donot have to worry too much about speciﬁc fre-quency thresholds. This same law holds, at least