Welcome to Scribd, the world's digital library. Read, publish, and share books and documents. See more
Download
Standard view
Full view
of .
Look up keyword
Like this
1Activity
0 of .
Results for:
No results containing your search query
P. 1
N Gram Based Text Categorization

N Gram Based Text Categorization

Ratings: (0)|Views: 21 |Likes:
Published by Mustafa
We describe here an N-gram-based approach
to text categorization that is tolerant of textual
errors. The system is small, fast and robust. This
system worked very well for language classifica-
tion, achieving in one test a 99.8% correct clas-
sification rate on Usenet newsgroup articles
written in different languages. The system also
worked reasonably well for classifying articles
from a number of different computer-oriented
newsgroups according to subject, achieving as
high as an 80% correct classification rate. There
are also several obvious directions for improving
the system’s classification performance in those
cases where it did not do as well.
We describe here an N-gram-based approach
to text categorization that is tolerant of textual
errors. The system is small, fast and robust. This
system worked very well for language classifica-
tion, achieving in one test a 99.8% correct clas-
sification rate on Usenet newsgroup articles
written in different languages. The system also
worked reasonably well for classifying articles
from a number of different computer-oriented
newsgroups according to subject, achieving as
high as an 80% correct classification rate. There
are also several obvious directions for improving
the system’s classification performance in those
cases where it did not do as well.

More info:

Published by: Mustafa on Oct 12, 2011
Copyright:Attribution Non-commercial

Availability:

Read on Scribd mobile: iPhone, iPad and Android.
download as PDF, TXT or read online from Scribd
See more
See less

10/12/2011

pdf

text

original

 
 N-Gram-Based Text Categorization
 William B. Cavnar and John M. Trenkle
 Environmental Research Institute of MichiganP.O. Box 134001Ann Arbor MI 48113-4001
  Abstract
 Text categorization is a fundamental task in doc-ument processing, allowing the automated han-dling of enormous streams of documents inelectronic form. One difficulty in handling someclasses of documents is the presence of different kinds of textual errors, such as spelling and grammatical errors in email, and character rec-ognition errors in documents that come throughOCR. Text categorization must work reliably onall input, and thus must tolerate some level of these kinds of problems.We describe here an N-gram-based approachto text categorization that is tolerant of textualerrors. The system is small, fast and robust. Thissystem worked very well for language classifica-tion, achieving in one test a 99.8% correct clas-sification rate on Usenet newsgroup articleswritten in different languages. The system alsoworked reasonably well for classifying articles from a number of different computer-oriented newsgroups according to subject, achieving ashigh as an 80% correct classification rate. Thereare also several obvious directions for improvingthe system’s classification performance in thosecases where it did not do as well.The system is based on calculating and com- paring profiles of N-gram frequencies. First, weuse the system to compute profiles on training set data that represent the various categories, e.g.,language samples or newsgroup content sam- ples. Then the system computes a profile for a particular document that is to be classified.Finally, the system computes a distance measurebetween the document’s profile and each of thecategory profiles. The system selects the categorywhose profile has the smallest distance to thedocument’s profile. The profiles involved arequite small, typically 10K bytes for a categorytraining set, and less than 4K bytes for an indi-vidual document.Using N-gram frequency profiles provides asimple and reliable way to categorize documentsin a wide range of classification tasks.
 1.0Introduction
 Electronic documents come from a wide varietyof sources. Many are generated with variousword processing software packages, and are sub- jected to various kinds of automatic scrutiny,e.g., spelling checkers, as well as to manual edit-ing and revision. Many other documents, how-ever, do not have the benefit of this kind of scrutiny, and thus may contain significant num-bers of errors of various kinds. Email messagesand bulletin board postings, for example, areoften composed on the fly and sent without eventhe most cursory levels of inspection and correc-tion. Also, paper documents that are digitallyscanned and run through an OCR system willdoubtless contain at least some recognitionerrors. It is precisely on these kinds of docu-ments, where further manual inspection and cor-rection is difficult and costly, that there would bethe greatest benefit in automatic processing.One fundamental kind of document process-ing is text categorization, in which an incomingdocument is assigned to some pre-existing cate-gory. Routing news articles from a newswire isone application for such a system. Sortingthrough digitized paper archives would beanother. These applications have the followingcharacteristics:
 
 
 The categorization must work reliably inspite of textual errors.
 
 The categorization must be efficient, con-suming as little storage and processingtime as possible, because of the sheer vol-ume of documents to be handled.
 
 The categorization must be able to recog-nize when a given document does
not 
 match any category, or when it falls
 between
 two categories. This is becausecategory boundaries are almost never clear-cut.In this paper we will cover the following top-ics:
 
 Section 2.0 introduces N-grams and N-gram-based similarity measures.
 
 Section 3.0 discusses text categorizationusing N-gram frequency statistics.
 
 Section 4.0 discusses testing N-gram-based text categorization on a languageclassification task.
 
 Section 5.0 discusses testing N-gram-based text categorization on a computernewsgroup classification task.
 
 Section 6.0 discusses some advantages of N-gram-based text categorization overother possible approaches.
 
 Section 7.0 gives some conclusions, andindicates directions for further work.
 2.0N-Grams
 An N-gram is an N-character slice of a longerstring. Although in the literature the term caninclude the notion of any co-occurring set of characters in a string (e.g., an N-gram made upof the first and third character of a word), in thispaper we use the term for contiguous slices only.Typically, one slices the string into a set of over-lapping N-grams. In our system, we use N-gramsof several different lengths simultaneously. Wealso append blanks to the beginning and endingof the string in order to help with matchingbeginning-of-word and ending-of-word situa-tions. (We will use the underscore character (“_”)to represent blanks.) Thus, the word “TEXT”would be composed of the following N-grams:
 bi-grams: _T, TE, EX, XT, T_tri-grams: _TE, TEX, EXT, XT_, T_ _quad-grams:_TEX, TEXT, EXT_, XT_ _, T_ _ _
 In general, a string of length
 , padded withblanks, will have
 +1 bi-grams,
 +1tri-grams,
 
 +1 quad-grams, and so on.N-gram-based matching has had some suc-cess in dealing with noisy ASCII input in otherproblem domains, such as in interpreting postaladdresses ([1] and [2]), in text retrieval ([3] and[4]), and in a wide variety of other natural lan-guage processing applications[5]. The key bene-fit that N-gram-based matching provides derivesfrom its very nature: since every string is decom-posed into small parts, any errors that are presenttend to affect only a limited number of thoseparts, leaving the remainder intact. If we countN-grams that are common to two strings, we geta measure of their similarity that is resistant to awide variety of textual errors.
 3.0Text Categorization Using N-Gram Frequency Statistics
 Human languages invariably have some wordswhich occur more frequently than others. One of the most common ways of expressing this ideahas become known as Zipf’s Law [6], which wecan re-state as follows:
The
n
 th most common word in a human languagetext occurs with a frequency inversely propor-tional to
n
 .
 The implication of this law is that there isalways a set of words which dominates most of the other words of the language in terms of fre-quency of use. This is true both of words in gen-eral, and of words that are specific to a particularsubject. Furthermore, there is a smooth contin-uum of dominance from most frequent to least.The smooth nature of the frequency curves helpsus in some ways, because it implies that we donot have to worry too much about specific fre-quency thresholds. This same law holds, at least
 
 approximately, for other aspects of human lan-guages. In particular, it is true for the frequencyof occurrence of N-grams, both as inflectionforms and as morpheme-like word componentswhich carry meaning. (See Figure 1 for an exam-ple of a Zipfian distribution of N-gram frequen-cies from a technical document.) Zipf’s Lawimplies that classifying documents with N-gramfrequency statistics will not be very sensitive tocutting off the distributions at a particular rank. Italso implies that if we are comparing documentsfrom the same category they should have similarN-gram frequency distributions.We have built an experimental text categori-zation system that uses this idea. Figure 2 illus-trates the overall data flow for the system. In thisscheme, we start with a set of pre-existing textcategories (such as subject domains) for whichwe have reasonably sized samples, say, of 10K to20K bytes each. From these, we would generatea set of N-gram frequency profiles to representeach of the categories. When a new documentarrives for classification, the system first com-putes its N-gram frequency profile. It then com-pares this profile against the profiles for each of the categories using an easily calculated distancemeasure. The system classifies the document asbelonging to the category having the smallestdistance.
 3.1 Generating N-Gram FrequencyProfiles
 The bubble in Figure 2 labelled “GenerateProfile” is very simple. It merely reads incomingtext, and counts the occurrences of all N-grams.To do this, the system performs the followingsteps:
 
 Split the text into separate tokens consist-ing only of letters and apostrophes. Digitsand punctuation are discarded. Pad thetoken with sufficient blanks before andafter.
 
 Scan down each token, generating all pos-sible N-grams, for N=1 to 5. Use positionsthat span the padding blanks, as well.
 Hash into a table to find the counter for theN-gram, and increment it. The hash tableuses a conventional collision handlingmechanism to ensure that each N-gramgets its own counter.
 
 When done, output all N-grams and theircounts.
 
 Sort those counts into reverse order by thenumber of occurrences. Keep just the N-grams themselves, which are now inreverse order of frequency. 
FIGURE 1. N-Gram Frequencies By Rank In A Technical Document
0500100015002000
   N  -   G  r  a  m    F  r  e  q  u  e  n  c  y
0 100 200 300 400 500N-Gram Rank

You're Reading a Free Preview

Download
/*********** DO NOT ALTER ANYTHING BELOW THIS LINE ! ************/ var s_code=s.t();if(s_code)document.write(s_code)//-->