You are on page 1of 15

Document Classification

Document Classification
• Document classification is a technique
used to predict group membership for
document collections.
• Goal: previously unseen documents
should be assigned a class as accurately
as possible.

05/24/2023 2
Document Classification
Examples
– To classify news articles into “business” and “sports”
– To classify Web pages into personal home pages and
others
– To classify product reviews into positive reviews and
negative reviews
• Approach: supervised machine learning
– For each pre-defined category, we need a set of train-
ing documents known to belong to the category.
– From the training documents, we train a classifier.
Overview
• Step 1—text pre-processing
– to pre-process text and represent each doc-
ument as a feature vector
• Step 2—training
– to train a classifier using a classification tool
• Step 3—classification
– to apply the classifier to new documents
Methods (1)
• Manual classification
– Used by Yahoo!, Looksmart, about.com, ODP, Medline
– very accurate when job is done by experts
– consistent when the problem size and team is small
– difficult and expensive to scale
• Automatic document classification
– Hand-coded rule-based systems
• Reuters, CIA, Verity, …
• Commercial systems have complex query lan-
guages (everything in IR query languages + accu-
mulators)
Methods (2)
• Supervised learning of document-label as-
signment function: Autonomy, Kana, MSN,
Verity, …
• Naive Bayes (simple, common method)
• k-Nearest Neighbors (simple, powerful)
• Support-vector machines (new, more powerful)
• … plus many other methods
• No free lunch: requires hand-classified training
data
• But can be built (and refined) by amateurs
Machine learning approach
• a general inductive process (learner) automatically
builds a classifier for a category ci by observing the
characteristics of a set of documents manually classi-
fied under ci or ci by a domain expert
• from these characteristics the learner extracts the
characteristics that a new unseen document should
have in order to be classified under ci
• use of classifier: the classifier observes the character-
istics of a new document and decides whether it
should be classified under ci or ci

7
Classification process: classifier
construction

Learner

Training
set

Doc 1; Label: yes


Doc2; Label: no Classifier
...
Docn; Label: yes

8
Classification process: testing
the classifier

Test set Classifier

9
Classification process: use of
the classifier

New, unseen
document Classifier

TRUE / FALSE

10
Training set, test set, validation
set
• initial corpus of manually classified documents
– let dj belong to the initial corpus
– for each pair <dj, ci> it is known if dj should be filed
under ci
• positive examples, negative examples of a category

11
Training set, test set, validation
set
• the initial corpus is divided into two sets
– a training set
– a test set
• the training set is used to build the classifier
• the test set is used for testing the effectiveness of the
classifier
– each document is fed to the classifier and the deci-
sion is compared to the manual category

12
Training set, test set, validation
set
• the documents in the test set are not used in the con-
struction of the classifier
• alternative: k-fold cross-validation
– k different classifiers are built by partitioning the initial
corpus into k disjoint sets and then iteratively applying
the train-and-test approach on pairs, where k-1 sets
construct a training set and 1 set is used as a test set
– individual results are then averaged

13
Training set, test set, validation
set
• training set can be split to two parts
• one part is used for optimising parameters
– test which values of parameters yield the best effec-
tiveness
• test set and validation set must be kept separate

14
Strengths of machine learning
approach
• the learner is domain independent
– usually available ’off-the-shelf’
• the inductive process is easily repeated, if the set of cat-
egories changes
– only the training set has to be replaced
• manually classified documents often already available
– manual process may exist
– if not, it is still easier to manually classify a set of doc-
uments than to build and tune a set of rules

15

You might also like