Welcome to Scribd. Sign in or start your free trial to enjoy unlimited e-books, audiobooks & documents.Find out more
Download
Standard view
Full view
of .
Look up keyword
Like this
2Activity
0 of .
Results for:
No results containing your search query
P. 1
Genre Document Classification Using Flexible Length Phrases

Genre Document Classification Using Flexible Length Phrases

Ratings: (0)|Views: 208|Likes:
Radošević, D., J. Dobša, D. Mladenić, Z. Stapić, M. Novak: Genre Document Classification Using Flexible Length Phrases, Proceedings of the 17th IIS International Conference on Data and Knowledge Bases, pp. 23.-28., Varaždin, 2006.
Radošević, D., J. Dobša, D. Mladenić, Z. Stapić, M. Novak: Genre Document Classification Using Flexible Length Phrases, Proceedings of the 17th IIS International Conference on Data and Knowledge Bases, pp. 23.-28., Varaždin, 2006.

More info:

Categories:Types, Research, Science
Published by: Zlatko Stapić, M.A. on Jun 26, 2008
Copyright:Attribution Non-commercial

Availability:

Read on Scribd mobile: iPhone, iPad and Android.
download as PDF or read online from Scribd
See more
See less

09/24/2009

pdf

 
Genre Document Classification Using Flexible Length Phrases
Danijel Radošević, Jasminka Dobša
University of Zagreb, Faculty of Organization and Informatics Pavlinska 2, 42 000 Varaždin, Croatia{danijel.radosevic, jasminka.dobsa}@foi.hr  
Dunja Mladenić
 Jožef Stefan Institute Jamova 39, 1000 Ljubljana, Slovenia Dunja.Mladenic@ijs.si 
Zlatko Stapić, Miroslav Novak 
University of Zagreb, Faculty of Organization and Informatics Pavlinska 2, 42 000 Varaždin, Croatia{zlatko.stapic,miroslav.novak}@foi.hr  
Abstract.
 
 In this paper we investigate possibility of using phrases of flexible length in genre classificationof textual documents as an extension to classic bag of words document representation where documents arerepresented using single words as features. Theinvestigation is conducted on collection of articles from document database collected from threedifferent sources representing different genres:newspaper reports, abstracts of scientific articles and legal documents. The investigation includescomparison between classification results obtained byusing classic bag of words representation and resultsobtained by using bag of words extended by flexiblelength phrases.
Keywords
.
 
Flexible length phrases, bag of wordsrepresentation, genre classification
1. Introduction
The goal of text categorization is classification of textdocuments into a fixed number of predefinedcategories.
 
Document classification is used in manydifferent problem areas involving text documentssuch as classifying news articles based on their content, or suggesting interesting documents to theweb user.The common way of representing textualdocuments is by
vector space model 
or, so called
bag-of-words representation
[13] . Generally, index termcan be any word present in the text of document, butnot all the words in the documents have equalimportance in representation of document semantic.That is why various schemes in bag-of-wordsrepresentation give greater weight to words whichappear in smaller number of documents, and havegreater discrimination power in documentclassification, and smaller weight to words which are present in lots of documents. Common preprocessingstep in document indexing is elimination of, so called,
 stop words,
or words such as conjunctions, prepositions and similar, because these words havelow discrimination power.Some approaches extend bag-of-wordsrepresentation by some additional index terms such as
n-
grams proposed in [11] . Those index terms consistof 
n
words forming sequences. In this work, wesuggest using
 statistical phrases of flexible length.
Inthe further text we will refer to statistical phrases of flexible length as phrases. The main difference between
n
-grams and phrases, beside the fact that
n
-grams consist of exactly
n
words, opposite to phrasesof the flexible length, is that the phrases can contain punctuations. It is important to stress that phrases canalso include stop words. The reason for inclusion of stop words is the intention to form phrasescharacteristic to writing style. So, beside the indexterms consisting of just one word that are key wordscharacteristic for some topic, for representation of document we use phrases which could reveal the styleof writing. Nevertheless, some phrases are verycommon, and their classification power is low. That iswhy we introduce
 stop phrases
as phrases that do notappear dominantly in one single category (we usedthreshold of 70% of all occurrences).Classification of documents according to style of writing or genre has already been recognized as useful procedure for 
 
heterogeneous collections of documents, especially for Web ([4], [7] , [9] ,[10],[14] ).Many algorithms are already developed for automatic categorization [6] . For the purpose of our experiments we used the algorithm of support vector machines (SVMs), which was introduced in 1992 byVapnik and coworkers [2]. Since then, it was shownthat algorithm of SVMs is very effective in large scale
 
of applications
,
especially for the classification of textdocuments [8] .The paper is organized as follows:- section 1 is introduction to using phrases of flexiblelength in genre classification of textual documents asan extension to classic bag of words documentrepresentation,- section 2 describes the offered algorithm for generating statistical phrases of flexible length,- section 3 gives experimental design: the algorithmof the support vector machines (SVM; for classification) and data description- section 4 gives the results of performed experiments- section 5 discussion of the experimental results and plans for further work.
2. Generating statistical phrases of flexiblelength
 
The algorithm of generating statistical phrases is based on the frequency of the phrases over all thedocuments. By application of algorithm the list of  phrases occurring at least two times in all categories isobtained.
2.1. Algorithm Description
The algorithm [12] starts by dividing all sentencesfrom all the documents into sets of subsentences.Subsentence is a sequence of words that starts fromsubsequent word position and consist of the rest of thesentence. The minimum length of subsentence is two.Starting from the sentence containing n words, in thisway, we get set of (n-1) subsentences. All punctuations are treated in the same way as words.After that, subsentences are sorted by category. Insideeach category they are sorted alphabetically inascending order. Extracting phrases starts bycomparing beginnings of subsequent subsentences.Algorithm for each subsentence finds its overlap withthe next subsentence starting form the first word of each subsentence. If overlap consists of minimum of two words it is considered as a phrase.The algorithm for extracting phrases is summarized asa following pseudocode.
Given
: Set of documents (each document is asequence of sentences consisting of words).foreach Document foreach Sentenceform a set of subsentences,starting from position of each word in theSentence and taking the rest of thesentenceend // Sentenceend // Document collect all subsentences from all documentssort subsentences alphabetically foreach Subsentencecompare Subsentence to the next subsentence by taking the maximumnumber of overlapping words from the first word (forming a phrase)extract phrases consisting of minimum twowordsend // Subsentenceforeach Phrasecount number of occurrences in documentsend // Phrase
Some of the generated phrases occur frequently inmost of the categories and not dominantly in onesingle category. We consider such phrases as “stop phrases”, i.e. phrases that are useless for the givenclassification task. Consequently, these phrases arediscarded from phrases list.
2.2. Illustration of the algorithm
In the first step all sentences in all documents aredivided into sets of all subsentences, starting fromeach position in the sentence and taking the rest of thesentence (punctuation signs are considered as words;last word is not particularly considered because it cannot form a phrase). Illustration of the first step of algorithm is given by the Figure 1.In the second step all subsentences belonging to samecategory are put together and sorted.In the third step each beginning of subsentence iscompared to the beginning of next subsentence takingmaximum number of overlapping words. Illustrationof the second and third step of the algorithm is given by the Figure 2. Extracted phrases are shadowed.
 
 
 
 Erozija tla vodom je prirodni proces kojega čovjek može ubrzati.tla vodom je prirodni proces kojega čovjek možeubrzati.vodom je prirodni proces kojega čovjek možeubrzati. je prirodni proces kojega čovjek može ubrzati. prirodni proces kojega čovjek može ubrzati. proces kojega čovjek može ubrzati.kojega čovjek može ubrzati.čovjek može ubrzat.čovjek može.može.
Figure 1: The first step (makingsubsentences)
 . . .erozija se može svesti na minimum.erozija tla vodom.erozija tla vodom je prirodni proces.erozija zuba kod osoba koje su preboljele karijes. . . .nižih količina dušika u proizvodnji.nižih nadmorskih visina.nižih ocjena svojim nastavnicima negonižih ocjena iz matematike.nižih prinosa . . .
Figure 3: The second step (sorting of subsentences) and the third step(extraction of phrases by comparison of beginnings of subsequent sentences).Extracted phrases are shadowed.
3. Experimental design
For classification we used the Support vector machines ([3],[8] ), implemented by SvmLight v.5.0software by Joachims [8] with default parameters.The evaluation is done by 3-fold cross validation.
3.1. The algorithm of support vector machines
Support vector machine ([3],[8] ) is an algorithmthat finds a hyperplane which separates positive andnegative training examples with maximum possiblemargin. This means that the distance between thehyperplane and the corresponding closest positive andnegative examples is maximized. A classifier of theform
)(
b xw sign
+
is learned, where
w
is theweight vector or normal vector to the hyperplane and
b
is the bias. The goal of margin maximization isequivalent to the goal of minimization of the norm of the weight vector when the margin is fixed to be of the unit value. Let the training set be the set of pairs
ni y x
ii
,,2,1),,(
K
=
where
 x
i
are vectors of attributes and
 y
i
 
are labels which take values of 1 and –1. The problem of finding the separating hyperplaneis reduced to the optimisation problem of the type
( )
.,,2,1,1,subject to ,min
,
nib xw yww
iibw
K
=+
 
This problem can be transformed into itscorresponding dual problem for which efficientalgorithms are developed. The dual problem is the problem of the type
.,,1,0 ,0 subject to ,21max
11,1
ni y x x y y
iinii ji ji  jn jiinii
K
==
===
α α α α α 
 If 
),,,(
**2*1*
α α α α 
n
K
=
is solution of the dual problem then weight vector 
 x yw
iinii
α 
*1*
=
=
 and the bias
( ) ( )
2,min,max
*1*1*
 xw y xw yb
ii
ii
==
+=
 
realize
 
themaximal margin hyperplane
.
The training pair 

Activity (2)

You've already reviewed this. Edit your review.
1 thousand reads
1 hundred reads

You're Reading a Free Preview

Download
scribd
/*********** DO NOT ALTER ANYTHING BELOW THIS LINE ! ************/ var s_code=s.t();if(s_code)document.write(s_code)//-->