Chapter-12 - Document Classification

Document Classification
• Document classification is a technique
used to predict group membership for
document collections.
• Goal: previously unseen documents
should be assigned a class as accurately
as possible.
05/24/2023 2
Examples
– To classify news articles into “business” and “sports”
– To classify Web pages into personal home pages and
others
– To classify product reviews into positive reviews and
negative reviews
• Approach: supervised machine learning
– For each pre-defined category, we need a set of train-
ing documents known to belong to the category.
– From the training documents, we train a classifier.
Overview
• Step 1—text pre-processing
– to pre-process text and represent each doc-
ument as a feature vector
• Step 2—training
– to train a classifier using a classification tool
• Step 3—classification
– to apply the classifier to new documents
Methods (1)
• Manual classification
– Used by Yahoo!, Looksmart, about.com, ODP, Medline
– very accurate when job is done by experts
– consistent when the problem size and team is small
– difficult and expensive to scale
• Automatic document classification
– Hand-coded rule-based systems
• Reuters, CIA, Verity, …
• Commercial systems have complex query lan-
guages (everything in IR query languages + accu-
mulators)
Methods (2)
• Supervised learning of document-label as-
signment function: Autonomy, Kana, MSN,
Verity, …
• Naive Bayes (simple, common method)
• k-Nearest Neighbors (simple, powerful)
• Support-vector machines (new, more powerful)
• … plus many other methods
• No free lunch: requires hand-classified training
data
• But can be built (and refined) by amateurs
Machine learning approach
• a general inductive process (learner) automatically
builds a classifier for a category ci by observing the
characteristics of a set of documents manually classi-
fied under ci or ci by a domain expert
• from these characteristics the learner extracts the
characteristics that a new unseen document should
have in order to be classified under ci
• use of classifier: the classifier observes the character-
istics of a new document and decides whether it
should be classified under ci or ci
7
Classification process: classifier
construction
Learner
Training
set
Doc 1; Label: yes

Doc2; Label: no Classifier
...
Docn; Label: yes
8
Classification process: testing
the classifier
Test set Classifier
9
Classification process: use of
the classifier
New, unseen
document Classifier
TRUE / FALSE
10
Training set, test set, validation
set
• initial corpus of manually classified documents
– let dj belong to the initial corpus
– for each pair <dj, ci> it is known if dj should be filed
under ci
• positive examples, negative examples of a category
11
set
• the initial corpus is divided into two sets
– a training set
– a test set
• the training set is used to build the classifier
• the test set is used for testing the effectiveness of the
classifier
– each document is fed to the classifier and the deci-
sion is compared to the manual category
12
set
• the documents in the test set are not used in the con-
struction of the classifier
• alternative: k-fold cross-validation
– k different classifiers are built by partitioning the initial
corpus into k disjoint sets and then iteratively applying
the train-and-test approach on pairs, where k-1 sets
construct a training set and 1 set is used as a test set
– individual results are then averaged
13
set
• training set can be split to two parts
• one part is used for optimising parameters
– test which values of parameters yield the best effec-
tiveness
• test set and validation set must be kept separate
14
Strengths of machine learning
approach
• the learner is domain independent
– usually available ’off-the-shelf’
• the inductive process is easily repeated, if the set of cat-
egories changes
– only the training set has to be replaced
• manually classified documents often already available
– manual process may exist
– if not, it is still easier to manually classify a set of doc-
uments than to build and tune a set of rules
15

Chapter-12 - Document Classification

Uploaded by

Document Information

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Chapter-12 - Document Classification

Uploaded by

Copyright:

Available Formats

Document Classification

Doc 1; Label: yes

Test set Classifier

You might also like