Upload_transparent

Extremely Fast Text Feature Extraction for Classification and Indexing

 
 
 
 
 
Download PDF FREE
Value This
Doc
Scribd
Average
     
Pages: 16 43
Words: 12880 13640
Characters: 78833 81678
Lines: 311 623
     
     
Letters per word: 6.12 5.99
Words per line: 41.41 21.89
Words per page: 805.0 317.21

Add to your reading list

Flag_red Flag this document

Document Information

1,665 Reads | 0 Comments

Description

Most research in speeding up text mining involves algorithmic improvements to induction algorithms, and
yet for many large scale applications, such as classifying or indexing large document repositories, the time
spent extracting word features from texts can itself greatly exceed the initial training time. This paper
describes a fast method for text feature extraction that folds together Unicode conversion, forced
lowercasing, word boundary detection, and string hash computation. We show empirically that our integer
hash features result in classifiers with equivalent statistical performance to those built using string word
features, but require far less computation and less memory.

Pdf_16x16 16 Pages


Date Added

09/06/2008

Category

Uncategorized.

Tags
Groups
Copyright

Attribution Non-commercial

More info »

 

or use Facebook Connect