You are on page 1of 38


Peter Náther

N-gram based Text Categorization
Diploma thesis Thesis advisor: Mgr. Ján Habdák



By this, I declare that I wrote this diploma thesis by oneself, only with the help of the referenced literature, under the careful supervision of my thesis advisor.

Bratislava, 2005

Peter Náther

I would like to express sincere appreciation to Mr. Jan Habdák for his assistance and insight throughout the development of this project. Thanks also to my family and friends their support and patience.

. . . 4 Clusterization and Categorization of Long Texts 4. . . .4 3.1 4. . . . . . . . . . . . . . . . . . .3 5. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4. . . . . . . . . . . . . . . . . . Observations . . . . . . . . . . . . . . . .3 3. . . . . . . . . . . . . . . . . . . . . . . . . . .1 5. . . . . . . . . . . . . . . . . . . . . . Clusterization . . . . . . . . . . . . .1 2. . . . . . . . . . . . . . . . . . 6 Conclusions and Further Work 1 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .4 Corpus . . . . . . . . . . . . . . . . . Comparing . . . . . . . . . . . . . .2 4. . . . . . . . . . . .4 4. . . . . . . . . . . . . . . . Clusterization . . . . . . . . .5 N-gram Selection . . . . . . Performance Measures . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .2 5. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . N-gram frequency statistics . . .2 4. . . . . . . . . . . . . . . . . . . . . . . . . . . . .2 2. . . . . . . . . Learning Phase . . . . . . . . . . . . . .6 N-grams . . . . . . Measure of Similarity . . . . . . . . . . . . . . . . . .1 3. . . . . . . . . . . . . . . . .3 3 Feature Construction and Feature Selection . . . .1 Feature Selection . . . . . . . . Possible improvements . . . 5 Experiments 5. 3 5 5 6 8 10 10 11 11 12 12 13 14 15 15 15 16 17 17 18 20 20 20 21 23 25 N-gram based Text Categorization 3. . . . . Document and Category Representation . . . . . . . . .1. . . .3 4. . . . . . . . . . . . . .1. . Categorization . . . . . . . . . . IDF’s Threshold . . . . . . . . . . .Contents 1 2 Introduction Text Categorization 2. . . . . . . . . . . . . . . . . . . . . . . . . . . . .2 3. . . . . . . . . . . . . . . . . IDF table . Categorization . . . . . . . . . Document representation . . . . . . . . . . . . . . . . . .5 3. . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

CONTENTS Bibliography A Tables B Implementation 2 28 28 30 .

That is why there is a lot of research in the area of automatic text categorization. but with lower achievement and some other restrictions.). Such a learning algorithm is trained on the set of texts with assigned categories. helping the user to navigate to the information he would like to obtain. and solved this problem with the accuracy of 99. Trained system we can use to assign category(ies) to the previously unseen documents. These tasks are usually solved within the machine learning framework. Usually text categorization tries to categorize documents according to two characteristics . Usually these catalogs are made by people. Trenkle and W. with the usage of sequences of characters. Used methods vary from simple methods to very complex ones.Chapter 1 Introduction We live in the world where information have a great value and the amount of available information (mostly on internet) has been expansively growing during last years. The main problem of the designed system was 3 . this method was designed for categorization by a language. relations between them. which consumes a lot of resources and computational time and the results are often not so good to be used. the automatic text categorization is a hard problem.the language and the subject (or topic) of the text. Most of these information are texts and here the text categorization comes to the scene.8%. They introduced a very simple method based on statistical principles. This is an excellent number and I think that it is easy to increase it even more. etc. The inspiration for this work comes from the article N-gram Based Text Categorization by J. because you have to read each text (or its part) to assign a correct category to it. They tested it also on the problem of categorization by a subject. creating such a database consumes a lot of time. First. Because of this. There are so much information around us. there are many databases and catalogues of information divided into categories. However. Usually the text is represented as the sequence of words and to find a correct category for it you have to use a lot of information about the text (number of words. Because of this. Cavnar [1]. Usually whole words or sequences of words are used in text categorization. Text categorization is the problem of choosing a category from the catalogue or database that has common characteristics with the selected text. that it becomes a problem to find those that are relevant for us.

where the numbers of Dewey Decimal Classification for some publications were. As I have written.loc. Therefore. I have tried many other resources and finally I tried to build up my own database from the books (E-texts) available on the Project Gutenberg’s site 1 with the help of the database of publication on the database located on the web site of the Library of Congress 2 . we need the mentioned training set. This Classification was developed early. and has more levels of categories. What I needed was the database of documents (articles. the one introduced in [1] is not an exception. Graph Theory. books . This means articles or books. but it is a nice experiment. text categorization used to be solved with learning methods. Nevertheless. It means that before the text categorization I have to perform a text clusterization. In my work. Especially to change this system in the way that it will be working well for long texts. these categories may not correspond to categories that people use. In 3 In this thesis I use some standard terminology from the graph theory. To provide sufficient experiments we needed a large set of long documents. You can find all definitions in the book Diestel. This means to divide the training set (in this case E-texts) into the groups according to some common characteristics of the documents in one of such http://www.) with the categories assigned to each of these documents. Of course. which can tell us whether the documents in the categories we used had something common. Springer-Verlag New York 1997. I extracted names of authors and titles from the E-texts. . R. but this also does not work well enough to build and categorized database. in 1873.gutebberg. I asked forthe Reuter’s corpus. even if the “learning phase” is so simple that we can say there is almost no learning... I suggested a method for the text clusterization and I have made some implementation to be able to run some tests of clusterization and categorization. I wanted to find a way to improve this method to make it more general and obtain better results. Than I tried to match them with the entries in this database. unfortunately I got no answer. After all these unsuccessful attempts. which is often used for training and testing automatic text categorization. INTRODUCTION 4 that it was working only for short texts.CHAPTER 1. I gave it up and decided to build my database automatically.3 1 2 http://www.

Other possibilities are the context of word w . not necessarily 5 . for more complete overview in section 2.1 Feature Construction and Feature Selection As I have mentioned above. Some of already used features are simple words (or strings separated by blanks) with the number of occurrences in documents as the value of the feature. Nowadays statistical pattern recognition and neural networks are used to construct text classifiers. Therefore. which is a sequence of nearby. we have to extract some features from the document. I described mainly used performance measures. which will represent it. The problem is that many authors used their own measures.2. learning and classification techniques work with vectors or sets of features. However. 2. The reason is that classifiers use some computations algorithms and these work with some measurable characteristics and not the plain text. that must co occur in the document with w(used in RIPPER [2]) or spare phrases. In the 80’s mainly knowledge-based systems were implemented and used for the text categorization.1. Learning Phase 2. Often very complicated and sophisticated methods are used to construct these inputs.Chapter 2 Text Categorization Text categorization or classification can be defined as a content-based assignment of one or more predefined categories to free texts. in the second task we train the classification machine on features (feature vectors) obtained from documents from the training data set.3. Usually each text categorization system is then tested according to texts from outside of the training set. The first one serves to prepare data for learning machines. The goal of a classifier is to assign a category(ies) to given documents. Of course. and almost each system was tested with different data sets. Text categorization or the process of learning to classify texts can be divided into two main tasks: Feature Construction and Feature Selection 2. there were made only few such comparisons. The best would be to make various measurements of performance and compare them to the results of other systems. each classifier needs some set of data as an input for each single computation.set of words. Unfortunately. also called feature vectors. The most frequently used input is the vector of weighted characteristics.

I have not seen any solution that would use anything else as parts of the text. Generally. For this reason. Last but not least I have to mention character N-grams that are very efficient and easy to be used in various text categorization tasks [1] [4] [5]. the feature of the document can be any characteristics you can observe and describe (e. FuCe and Autofeatures [4]. examined in more details in [6]. information gain . Usually. m-ary classifiers use the same classifier for all categories and produce a ranked list of candidate categories for each document. There are many methods for dimensionality reduction in statistical literature. to use independent binary classifier to produce category ranking is also possible.g. Per-category binary decisions can be obtained by thresholding on ranks or scores of candidate categories. In principle. Let me note. There are several algorithms for converting m-ary classification output into per-category binary decisions. there are some possibilities of term-goodness criteria. The dimension of the features is very high. the feature selection plays an important reason in the text categorization. Feature selection methods are designed to reduce the dimension of the feature. used for their training. Binary classifiers assign YES/NO for a given category and document.2 Learning Phase We can divide classifiers into two main types. the length of the text). These are: document frequency number of documents in which a feature occurs. what can lead to overfitting of classification machine. χ2 . independently from its decisions on other categories. All these features have a common disadvantage. 2. E. Some experiments and more details about most of the following can be found in [8]. Usually the simple threshold cut off method is used. with a confidence score for each candidate. Distances of vectors are the simplest classifiers. with the smallest possible influence on the information represented by feature vector.CHAPTER 2. that all these features where parts of the given text. all these selections have a positive impact on the performance of text categorization.measures the number of bits of information obtained from category prediction by knowing the presence or absence of a feature in a document. Now I will shortly describe some classifiers and machine learning frameworks. which are commonly used in the text categorization.g. algorithms of classifiers are closely related to machine learning algorithms. These work with previous results of classification machines. Nevertheless. Also word N-grams (ordered sequences of words)[3] and morphemes [4] were used. In addition. TEXT CATEGORIZATION 6 consecutive words (used in Sleeping Experts [2]). These are binary classifiers and m-ary (m > 2) classifiers. However. term strength . but it is still not well understood and can be an object of future research. In this case.measures how commonly a term is likely to appear in "closely related" documents. See [7] for details. some other feature selection methods were designed.statistic measure of the lack of independence between the term and the category. For each category and document represen- . The difference between these two types is following.

We count all these distances between vectors of categories and a document’s vector and the category closest to the document is chosen. Naive Bayes approach is far more efficient than many other approaches with the exponential complexity. Often the dot product cosin value of vectors is used instead of the distance. It uses a statistical data to create simple rules for each category and then uses conjunctions of these rules do determine whether the given document belongs to the given category [2]. kNN belongs to m-ary classifiers Rocchio algorithm uses vector-space model for document classification. The weakness of this method is the assumption of one centroid (a group of vectors) per category. TEXT CATEGORIZATION 7 tative vectors are produced. This brings problems. Support Vector Machines(SVM) is a learning method introduced by Vapnik [9]. Decision Trees are algorithms that are used to select informative words based on an information gain criterion and predict categories of document according to occurrences of word combinations Naive Bayes are probabilistic classifiers using joint probabilities of words and categories to calculate the category of a given document. Simple more-layered networks were designed and tested. when documents with very different vectors belong to the same category. which combines results of these classifiers. This system ranks k-nearest documents from the training set and use categories of these documents to predict a category of a given document. The basic idea uses summing vectors of document and categories with positive or negative weights. It is important to obtain a good classifier. Sleeping experts are based on the idea of combining predictions of several classifiers. This method is based on the Structural Risk Minimization principle and mapping of input vectors in highdimensional feature space. Naive Bayes based systems are probably the most frequently used systems in text categorization. Empirical evidence shows. which depends on belonging of a document to a given category. Some measure of the distance must be defined. kNN is the k-nearest neighbors classification. the master algorithm. Experimental results show. that SVM is good method for text catego- . RIPPER is a nonlinear rule learning algorithm. Neural networks were also used to solve this task. that multiplicative updates of weights of classifiers are the most efficient [2].CHAPTER 2.

in ranking based systems.Average eleven precision obtained from previous step to obtain a single number .10%-20%.3 Performance Measures Now I would like to show some measures that are used to evaluate classifiers.11-point average precision. Two basic measures for a given document. use the highest precision value as the representative precision. we can divide measures for category ranking and binary classification. It can be computed using following algorithm: .For each interval between recall values compute average precision over all test documents. . are recall and precision. Again. It reaches very good results in the high-dimensional feature space. The goal of the machine learning is then to train all parameters and thresholds of the algorithm to obtain best similarities for documents that belong to the same category. Often the probability measure is used.For each document.CHAPTER 2. Similarity Measure All these techniques are well-known machine learning algorithms that can be used to solve many other problems. Some systems work also with the dot product of some feature vectors. compute the recall and precision at each position for the ranked list. However. . SVM method has also one of best results in efficiency of text categorization [10]..number of documents correctly assigned to the category. TEXT CATEGORIZATION 8 rization. the essential part of the text categorization is the “similarity measure” of two documents or a document and a category. For evaluation of binary classifiers four are values important (for a given category): a . avoids overfitting and does not need a feature selection. 2.. There are many possibilities how to define such a measure.90%-100% .For each interval between recall values (0%-10%. These are computed as follows: recall = categoriesf oundandcorrect totalcategoriescorrect categoriesf oundandcorrect totalcategoriesf ound precision = Very popular method for global evaluation of a classifier is an 11-point average precision measure. . .. To count this measure the feature sets or vectors are used.therefore 11point)..

the second is F-measure. TEXT CATEGORIZATION b . p) = (β 2 +1)pr β 2 p+r Last two measures are most often used to compare the performance of classifiers. c. Therefore.CHAPTER 2. c . Some of performance measures may be misleading when examined alone. two new measures are of importance. Macro-average gives equal weights to categories. there are two methods. d . b. macro-averaging and micro-averaging. The first one are computed as average from per-category values. The first one is break-even point (BEP) of the system.number of documents correctly rejected from the category. F-measure was also designed to balance weights of recall and precision. 9 Following performance measures are defined and computed from these values: recall r = a/(a + c) precision p = a/(a + b) fallout f = b/(b + d) accuracy Acc = (a + d)/n where n = a + b + c + d error Err = (b + c)/n where n = a + b + c + d For evaluation of average performance over categories. The F-measure is defined as: Fβ (r. when the recall and precision are equal.number of documents incorrectly assigned to the category. micro-average gives it to the document. BEP is a value. . d are computed first over all categories and then the performance measures are computed.number of documents incorrectly rejected from the category. in microaveraging values a.

Nevertheless.1 N-grams N-gram is a sequence of terms. In literature. counted as the number of N-grams. EX. We can build such N-gram representation for the whole document. some improvements were introduced. I chose the solution based on Ngrams. but used methods are also more complicated. Here also leading and trailing spaces were considered as the part of the word. Trenkle and Cavnar chose N-grams based on characters for their work. with the length of N. you can find the definition of N-gram as any co-occurring set of terms. The measure. Mostly words are taken as terms. This representation is then used for comparing documents. EXT_. TE. For example. TEX. XT_. but for this work. In the second one only N-grams of the length of four were used. so any errors that are presented. T_ tri-grams _TE. I would like to explore possible improvements of methods described in [1] and [5]. The benefit of the N-gram based matching is derived form its nature. TEXT.Chapter 3 N-gram based Text Categorization From all possible features presented in the previous chapter. leaving the remainder intact. only contiguous sequences were used. One word from the document was represented as the set of overlapping Ngrams. EXT. that are common for two strings. N-grams with various lengths were used (from 1 to 5-grams). XT__. affects only a limited number of N-grams. I would like to more deeply introduce these works. 10 . In this chapter. the word "TEXT" would be composed of following N-grams: bi-grams _T. T___ In the first of the mentioned texts. XT. Every string is decomposed into small parts. is resistant to various textual errors. In the second article. T__ quad-grams _TEX. 3.

if it will occur only in part of the documents. N-GRAM BASED TEXT CATEGORIZATION 11 3. Zipf’s Law states. The main idea of the categorization used by Trenkle and Cavnar is that if we are comparing documents from the same category. were suggested.3 Document representation In the first article. It simply describes the Zipfian distribution of N-grams in the document. they used representation that was more sophisticated. As the vectors for each document would have a very high dimension and would be very sparse. In the second article.2) where tfj is the term frequency and the IDFj is the inverse document frequency of the jth N-gram in the document. but they are only small deviations from original Zipf’s observations. that there is always a set of words. these have something incommon (at least this N-gram). N-grams starting at rank 300 were used. in terms of frequency of use. that are specific for the particular subject. . The most common observation of some regularity is known as Zipf’s Law. by their N-gram frequency profiles. the same representation as for documents was used also for categories. As N-grams are components. This is true for languages. If N-gram can be founded in all documents. which dominates most of the other words. which carry meanings. IDF was used to reduce the dimension of the vector space. that: f= k r (3. The IDF for a specific N-gram is counted as the ratio of documents in the collection to the number of documents in which the given N-gram occurs. they should have similar N-gram frequency distribution [1]. IDF strongly corresponds to the importance of the N-gram for the text categorization. In both cases. Each document was represented as a vector. but also for words. For the categorization by subject.CHAPTER 3. 3. It is the list of N-grams ordered by the number of occurrences in the given document. IDF was used to help to choose most significant N-grams for the subject identification. with values of term weights for each N-gram. the idea of Zipf’s law can be applicable also on frequencies of N-grams. In [5] Cavnar uses another helpful characterization and that is the inverse document frequency (IDF). This number was obtained by several experiments. documents were represented. The weight for the terms was defined like: wj = (log2 (tfj ) + 1) · IDFj wi i (3.2 N-gram frequency statistics Each word occurs in human languages with a different frequency. If f is the frequency of the word and r is the rank of the word in the list ordered by the frequency. From this we get. it gives us no information about difference between contexts. on the other side.1) Later some modifications.

we can choose most relevant categories for the given document. The problem of longer texts is that there is a higher amount of language specific N-grams in the text and these have usually high frequencies. The problem is.CHAPTER 3. only with a low loss of performance. Unfortunately. they are more subject specific. against different data sets. When we want to choose a category for the document. to eliminate this problem. They used 778 articles from the Usenet and tested them again the 7 selected FAQs. For counting profiles of categories. If an N-gram was not present in one of the profiles. Usenet newsgroups messages articles are quite short and the system was not giving such a good results for longer texts. from the computer science. a very simple method was used. In the second article. that this classifier would obtain a pretty good score in the BEP measure. so they assigned exactly one category to each document.4 Comparing For the comparison of the two profiles. Precision @ 100 . Cosine of this angle can be computed as a dot product of these two vectors. in the first work. Their system achieved 80% correct classification rate. In their work. so we can use them to determine the subject of the text. a maximum value was used. This helped to reduce the dimension of N-gram vectors and thus increase the speed of computation. we can order them and with the help of some threshold. that in short texts. This means that the suggested classifier belongs to m-ary classifiers. the similarity of two documents or categories was defined as the angle between the representation vectors.5 Observations In the first case. The system was tested on the TREC competition. but I think. I tried to make same computation of recall and precision (Cavnar and Trenkle have mentioned only the numbers of documents correctly/incorrectly assigned) and from these computation I assumed that recall and precision of this system are both about 60%. The similarity measure of two profiles was defined as a sum of differences between the rank of the N-grams in one profile and the rank in the other profile. N-grams in first 300 ranks are mostly language specific. In the newer article. They have chosen five closely related categories. categorization was tested on the Usenet newsgroups. and from the rank of 300. As we have the list of distances from all categories. This leads to the fact that subject specific N-grams start on higher ranks than 300. Than we choose the category with the smallest distance (angle) from the document profile. N-GRAM BASED TEXT CATEGORIZATION 12 3. It is not a very high number. they used only the first category as the winner. 3. we have to count distances (angles) from all the categories profiles. they used the FAQs of the given newsgroups. Very popular performance measures like Average Precision. inverse document frequency was used.

than median performance of all system in the TREC93 competition. As it was already mentioned. . N-GRAM BASED TEXT CATEGORIZATION 13 docs and R-precision were used. to see how good the results are. Other advantage. The first system also consumes very little resources. because the frequency of such an “invalid” N-gram will be so low that we will never calculate with it. So such an error does not affect final results. because we do not do any linguistic computation on the data. we suppose that the same errors occur with a low probability. Such a system has some very nice advantages. find a clever function to appoint the correct number to replace the magical ’300’. it is resistant to various textual errors. but I would like to keep it as simple as possible. with a most precise value for a various documents. 3. without any additional effort. that it is possible to implement an N-gram based retrieval system. The other one has a little problem with the table of IDF’s. We can say that the “learning phase” of these systems does not exist (all you need to do is to count the profiles of categories and they can be easily updated).CHAPTER 3. etc. how to improve the systems suggested in these works. to find some way how to apply these methods to large texts. it is a very important benefit of these systems. We can use more sophisticated term weighting. use N-grams with larger N. In my work. Language independence was also verified on a set of Spanish documents. is the simplicity of suggested methods. Of course. The suggested systems are language independent.6 Possible improvements There are some possibilities. Both articles show. larger amount of N-grams to represent the text. The values of all measures were somewhat lower. but I think a very important one. It is easy to implement them. because I think. I would like to explore some of these possibilities.

org 14 . to build up clusters of documents. as a man would set up.Chapter 4 Clusterization and Categorization of Long Texts The first inspiration to this project was the article [1]. from the unknown set of documents (we can look at this as on the reinforcement learning). I did find any resource of this kind. In their experiment they say. that will for each text return the most precise starting rank of the subject specific Ngrams. to subsets according some common signs. For this purposes I used the collection of books form Project Gutenberg 1 . I made following tests: I chouse the train and the test set from the whole collection. I hoped that created subsets would have common subjects and would correspond to categories. This means I will try automatically to divide the set of documents. in which I selected 1 http://www. in this case N-grams. to make it more general. Next step was the clusterization phase. that in the way it was described in [1] it can be used only for short texts (Usenet newsgroups articles). the subject specific N-grams are starting at the rank of 300. Before the text categorization. The first idea was to try to find such a function.gutebberg. But I still wanted to discover the power of N-grams. The suggested system is very easy and has good results. I needed a train set of long texts with assigned categories. Then I counted profiles of all documents. It is a large collection of free books (almost 10 000) and it was easy to get it. Nevertheless. I wanted to find some relationship between the length of the text and this special rank. I had to deal with the text clusterization. Unfortunately. As the main problem of this system is the fact. I decided to use the information given by comparing two texts .“similarity”. that in the sequence of N-grams ranked by the number of occurrences. I wanted to find some way how to improve this system to obtain better results also for longer texts. To do this. so I decided to build categories by my own.

give us some information about relations between these words. I did not have it.1. CLUSTERIZATION AND CATEGORIZATION OF LONG TEXTS 15 documents to categories and counted profiles for them. so I also used IDF to select representative N-grams. or it were very specific N-grams. N-grams “ew Y” and “w Yo” fror words “New York”. I used the N-gram with the length of 4. These tests are described in the chapter 5. The second disadvantage against the system. in which it occurs. Now I will describe the suggested methods.2 IDF table The first task was to build up the table of inverse document frequencies. I was working with the text as one single word and include spaces between words into N-grams.1.1) where nt is the number of documents in which the N-gram t occurs and N is the number of all documents in the collection. The last phase was categorization phase. Let a be N-gram with the lowest rank (ra ) . f is its frequency and k is constant. the table was too huge to store in memory so I threw out N-grams with lowest frequencies. In the first article the N-grams selected as features. so they did not take a care about N-grams. were determined by its rank. Because of this. The IDF of an N-gram t is defined as IDF (t) = nt N (4.1) helps us see that we can cut off quite a large part of k N-grams with a low influence on the final distribution. I compared documents from the test set to profiles of categories and chose one category for each document. In [5] they worked with each word separately. From (3. will be often together. overlapping two words. IDF (see 3.1) we have r = f . I think this has two disadvantages.1 N-gram Selection As I had already written. Even after this. so I decided to use N-grams with the length of 4.1 Feature Selection In the first article [1]. I took a plain text with a single space between words. The table was very huge. I counted the number of documents. in the second article.CHAPTER 4. As I have already written. 4. I suppose these were mistakes. Zipf’s law (see 3. In this part. In this. N-grams that interfere with two words. quad-grams were used. 4. 4. For example. to split the text to words. that completely treats input as a single word is. For each N-gram. in tests.2) was used to determine appropriate N-grams. without tabulators or new line characters. that would give us no help by categorization. N-grams with the length from 1 to 5 were used. in the second article [5]. that we have to search for word boundaries. to determine the correct rank of important N-grams we need a categorized database of documents. I decided to use also N-grams that interfere with more than one word. As an input. where r is the rank of N-gram.

Now if we want to determine the fraction of N-grams with the frequency of β we have From this. This list I used to represent the category.4) where ft is the frequency of the N-gram t and wi is the weight of N-gram i. which occurs only once. but we can highly reduce the number of N-grams we have to hold in the table (If an N-gram is not in the table. This representation has a great advantage. Here the N-grams with the highest frequencies were at first ranks. Because of that. to represent the document. It is very easy to update the category profile. ordered by frequencies. I used these sums to order N-grams into the final list of N-grams. Than the number of N-grams with the frequency β is: k k k − = β (β + 1) β(β + 1) The highest rank of an N-gram in the collection is approximately: ra − rb ≈ rmax ≈ k =k 1 (4. In which the weight wt assigned to N-gram t was defined as wt = ft wi i (4. I used the vector representation. CLUSTERIZATION AND CATEGORIZATION OF LONG TEXTS 16 from N-grams with the frequency β (number of documents in witch it is present) and b the N-gram with the lowest rank (rb ) from N-grams with the frequency β + 1. Because an intuitive way how to create an ordered list of N-grams for a category would be to take all the documents from the category as one text and to count its profile.CHAPTER 4.2 Document and Category Representation From the list created like in previous section I chose 1000 N-grams with the highest frequency (number of occurrences). . This would assign the highest ranks to the N-grams with the highest frequencies. I used two ways of comparing documents (see next section) and I used also two kinds of document profiles. I decided to use only the first style of representation. Then I counted a sum of ranks from all profiles for each N-gram. while it was more successful than the vector representation. the portion of N-grams. I took profiles of all documents in the category and counted the reverse order of N-grams in the profile. The first method is the representation of the document as a list of selected N-grams. After comparing these representations in the clusterization problem. I will usually write only a document or a category in spite of the document or category profile (whether it will be a list or a vector). From now.3) 1 β(β+1) .2) (4. so do not use him for representation). Clearly. is 1 . it means it was only in few documents. 4. if you want to add a new document to the category. such N-grams have no 2 influence on the categorization. For the second method. but I used a slightly different strategy. I made this to prevent N-grams that have very high frequencies in few documents to be on high ranks in the final list.

Surprisingly. This gives us the number from the interval −1. The “out-of-place” measure s(v. Let v and w be lists of N-grams. defined in the previous section between two documents from the same group will be higher. that for any a. After this. results of the first method were better.5) where a (b) goes through all N-grams from v (w) and d(i. The measures I used are normalized to the interval 0. j) is defined as d(x) = |rvx − rwx | x ∈ v ∧ x ∈ w l x∈v∨x∈w / / (4. 1 . This is why I did not use the vector processing method for the text categorization. / So I had the set of N documents and the table of N × N with values of similarity measures between all these documents (we do not need the whole table. than similarities between these documents and any other document outside this group. 4. It seems like the “out-ofplace” measure is creating categories that are more similar to categories created by people). we can say that the measure. b) > s(a. When we have this measure. if and only if holds. b ∈ C and c ∈ T ∧ c ∈ C s(a. I used a transformation to get values into the interval 0. I compared both methods for comparing documents (“out-of-place” measure [1] and the vector processing [5]). . More formally: Let T be the set of all documents. 1 . that the lower values mean higher similarity. 1 . C ⊂ S is the cluster of similar documents.CHAPTER 4. and are build in such a way. As we know. the upper triangle is sufficient. so that similar documents will be in the same group (cluster). I used this measure for the second type of the document profile. Now I will define this measure. The second measure uses the dot product of two vectors. To count this measure we need the ordered list of N-grams. CLUSTERIZATION AND CATEGORIZATION OF LONG TEXTS 17 4. we say that the similarity of two documents or categories v and w increases with increasing value of s(v.3 Measure of Similarity To make a clusterization or categorization we have to define some measure of similarity between the two documents or the category and the document. w) can be defined as following: d(a) + s(v. w) = 1 − a d(b)+ b l2 2· (4. w). the dot product of two vectors corresponds to the cosine value of the angle between these two vectors. This measure gains values from the interval 0. 1 . First is the mentioned “out-of-place” measure. If we want to define this problem.4 Clusterization The task of clusterization is to divide the set of documents to groups (clusters). c).6) where rvx is the rank of the N-gram x in the list v and rwx is the rank in list w. with the same length l.

mark them as clusters and remove them from the graph. where documents can be mapped to nodes and the weight of an edge between nodes v and w is s(v. then two documents can be a base for a cluster). To do this. but some of created categories were “man-like” (see 5.the clusters of a very small size or single documents. Now we can transform our problem to split components of this graph. The problem was that the similarity of documents in one category (or cluster) was different and could be much higher or lower than the similarity of documents in other category. CLUSTERIZATION AND CATEGORIZATION OF LONG TEXTS 18 because s(v. Creating profiles of documents and categories was discussed in 4. We can look at this table as at the complete graph. we need to set up a threshold. The task of categorization assumes that we have the profile of the document we would like to categorize and the profiles of categories from which we chose the one that is most suitable to be the category of the document. In the meaning of similarity. The split graph looked like one big component of many independent documents and few small components that could be considered as categories. until there is a component with more than t nodes. With this system. The algorithm of the categorization is very simple and has only two steps. but the previous small components felt apart to single documents (or very small components from two or three documents). This does not mean they will really be in different clusters. count similarities (4. but now it seems. because they both can be similar to the third document. which says that edges with the lower weight than this threshold will be outgoing edges of components. With increasing of the threshold. this threshold is saying that documents with the lower similarity should not be in the same cluster. 4.. First. I was really disappointed. Increase t and repeat from the step 1. 4. 1.5) between the selected document and all categories. w).3 for more details). There was quite a big junk . I was able to create clusters from the training set. that I have found the solution. v)). so that weights of edges inside of these components will be bigger than outgoing edges of these components. Split the graph using this threshold 3. . The “good” size depends on the size of the graph and on your belief in your similarity measurement (if it is very good. new components were cut form the big one. The idea of the improving the clusterization was as followed: 1.CHAPTER 4. which will bind them together.2. w) = s(w. I made some experiments with clusterization and I came to the problem. This will split up the graph to “unconnected” components (it is something like deleting these edges from the graph).5 Categorization The main goal was to find a good method for text categorization. Set up an low threshold t 2. If the clusters with the “good” size are present.

one document can be a member of more categories and this method of categorization (but not the clusterization from previous section) can handle this. 19 In reality.). etc. They are described in chapter 5. For this work. I used only the first one as the category of the document. sizes.. All we need to do is to order the categories according to similarities with the document and then the categories with the highest similarity are the most probable categories of this document. Unfortunately.. Now if our test set contained documents with assigned categories. I checked results manually and made some statistical observations.CHAPTER 4. . Than choose the category with the highest similarity as the category of the document. . I do not have it. we can compare it with the chosen category and use it to tune up the system (all thresholds. CLUSTERIZATION AND CATEGORIZATION OF LONG TEXTS 2.

I knew the subjects of these documents. which I used in experiments with suggested methods of text clusterization and categorization. For this. Their threshold corresponds to the threshold of 6. 1 http://www. so I was able to say whether the results are clever or not. I did not use the whole collection.75.see section 4. 5. Of course I cut off headers of E-texts. I counted the table of IDF’s. I used the set of 8189 documents and the final table (without the N-grams with the lowest frequencies .75 as a threshold for IDF. In [5] they also used N the value 2.gutebberg.1.72 in my system. So this threshold corresponds to the threshold of 1. In spite of this.46 in their system.1 Corpus For the training and testing text categorization. There is something strange about this value.1).1. I was not able to get a corpus sorted into categories. The set of documents. I think this can be due to different creation of N-grams in their and my system (see 4. As I have already written. we need a corpus. were E-texts of books freely available on the website of the Project Gutenberg1 .75 gives much better 20 .2) contained 346 305 unique N-grams. Later I made some tests with 500 documents. I started measuring the similarity on a small set of 30 documents. but obtained results were giving best results. I used only plain text files smaller than 2MB’s (because of memory consumption). I came to the threshold of 2.2 IDF’s Threshold First.Chapter 5 Experiments 5. it looks like that the threshold of 2. I do not say that it is the best value. but they defined IDF as log2 ( nt but I defined it without a logarithm. The whole collection contains about 11 000 books. because my implementation was not able to work with such a great amount of data.

75 is too high.1 are some examples of categories and documents from which they were created. This set of 197 books was divided into 9 categories. Each category has the size from 10 to 35 documents. In figure 5. The effect of this is that similar measures between documents are very low (it is under 0.2) and so only very similar documents create categories.000 test” . 150 100 50 20 26 19 14 10 16 30 29 33 0 cat0 cat1 cat2 cat3 cat4 cat5 cat6 cat7 cat8 Figure 5. “500 test”. First tests with clusterization were made on a set of 500 books randomly chosen from the set of 11 0000 books I had. 3 I was expecting that if in the previous test categories had the size between 10 and 30.1: Distribution of documents among categories. The results were following: 739 books were divided into 7 categories.CHAPTER 5.1 you can se how 197 books were divided into categories.3%). the size of cluster we want and of course the clusterization algorithm. I think that one the reasons for this may be that the threshold for IDF with the value 2. I do not know whether documents in other categories would be in similar categories as in the “man-like” categorization.000 documents.2 you can see the distribution of books among categories. As clusters are created in iterations. in category2 poems and theatre. In the figure 5. here I chose the size from 50 to 200. Here I must say that there is nothing wrong with the fact that 2 3 I will refer to all tests connected with this set of documents as the “500 test” I will refer to all tests connected with this set of documents as the “3. in category5 political documents. Only 246 books where used to create categories. The best thing about these tests was that some of the created categories where “man-like”. so there was a big junk (60%). EXPERIMENTS 21 5. only 197 were divided into categories I ran next test with the set of 3. In category1 were memories of people.3 Clusterization Dividing documents into categories depends on their similarity. But the junk was very big (75. that the last created cluster will be only the rest of documents that had not fit anywhere but are slightly similar. it is possible. because they were created from more “common books” and I had not read all these books to determine their subjects. In table A.2 I was satisfied with these results.

000 test”. in spite of the fact. in the “500 test”: category1 . There were used different train sets and numbers of categories does not say that these must have something common. “3. But in reality.g.2: Distribution of documents among categories. 600 500 400 300 200 182 196 100 93 78 73 54 64 0 cat0 cat1 cat2 cat3 cat4 cat5 cat6 Figure 5. in the “3. that was creating categories from standalone components.000 test”.000 test” they where in category4. only 739 were divided into categories The problem in clusterization is that I assigned one or none category to each document. the results are quite interesting. their knowledge and senses category4 . French books created catgory1 in the “3.1 are completely different.memories. For example in the “500 test” German books where in category0. I was able to describe some other categories too. also when there is a small set off documents creating a cut between two sets of . that I threw out all frequent N-grams (this is because of usage IDF) documents from other languages than English created their own categories.contained books somehow connected with Netherlands category2 . e. an so there are connections between categories which abuses the clusterization. together with novels from A. Furthermore.CHAPTER 5. diaries and life stories of famous people mainly from England and the USA category5 .memories of famous French people from the times of Napoleon. In spite of all these problems.plays . First. but for the tests I used only a simple version. Balzac (both were writing about these times) category8 .some philosophical books about people. Dumas and H. that had no connections to the rest of the graph.books about theology and psychology category3 .2 and 5. each document can belongs to many categories. This could be solved with more complicated algorithms that will split the graph. EXPERIMENTS 22 distributions in figures 5.

000 test”: category0 . that this document should not belong to any of created categories. In the second test I randomly chose 1.4. which is the rest of the graph. 5. I took the “junk files” . You can see the distribution of documents among categories in figures 5. EXPERIMENTS About categories from the “3. At the end there is a component.CHAPTER 5. in the way I implemented cutting components from one large graph. Slightly disappointing is that these distributions do not correspond to the distribution of categories (fig 5. the reason can be the .poetry As you can see. Except the failure. what would be a great success of the suggested method. but it is enough small to fulfill the conditions to become a category. 5. The clusterization. Many of these files had something common with the categories in which they were assigned and I did not know why they do not become part of this category during clusterization.German literature.mainly some scientific books category2 . This can mean that there is a characteristics of this collection of books.1. that almost all documents that should belong to some category. I made two such tests.except two. which divides the whole collection to created categories. This can lead to categories that are giving no sense. category5 .memories and books about famous French people from the times of Napoleon category3 . together with books about Greek mythology and antic times maybe the reason is that some of German’s books were also from Greek mythology. But this means that these categories are feasible.000 documents from the Project Gutenberg’s books (different from the train set). all by Shakespeare. One of the reasons can be following. However. there where assigned also documents that do not belong to it. but it is easy to see that distributions in figure 5.3 and 5. I checked these results also manually. In the first test. This means that they really describe the whole collection of books. 23 category4 .4 are similar. . Results are lists of documents assigned to each category and the count of documents assigned to each category. but the chosen category had the most similar characteristics. It seems. With both sets of categories. where assigned to this category. I was not able to say anything about categories 6 and 7 from the first test and category6 from the second test.4 Categorization The categorization tests consist of comparing set of documents with created categories.plays . to each category.2).the files from train set that were not chosen to any category and ran the categorization test on them. Tests with “junk files” are giving interesting results.

000 tests” .CHAPTER 5.000 documents Figure 5.000 test” consists of the plays written by Shakespeare (only two texts where by other authors). EXPERIMENTS 24 There is also an example. I was not able to make any classical measurements. 300 1000 250 800 200 600 150 400 100 65 61 46 34 3 23 32 17 22 50 200 99 53 121 96 166 151 92 124 98 0 0 cat0 cat1 cat2 cat3 cat4 cat5 cat6 cat7 cat8 cat0 cat1 cat2 cat3 cat4 cat5 cat6 cat7 cat8 “junk” 1. but also when these documents are very similar. that it is possible that this method will reach good results with only few documents in a category (some categories had only ten documents).000 randomly chosen files. category3 from the “3. because I did not have the set of correctly and incorrectly categorized documents as it used to be common. Then in the tests with “junk files” and the 1. Unfortunately. To be more concrete.000 documents Figure 5.3: Distributions of documents in “500 tests” 1000 2100 1800 1500 600 1200 900 600 300 0 cat0 27 104 26 606 685 581 232 229 140 232 41 51 52 255 800 400 200 0 cat5 cat6 cat0 cat1 cat2 cat3 cat4 cat5 cat6 cat1 cat2 cat3 cat4 “junk” 1.4: Distributions of documents in “3. much other theatre plays by various authors were assigned to this category.

I was able to find some “real” similarities among documents in created categories and I do not say that there is no similarity among books from other categories. This would especially help to find better threshold for IDF’s values. but I think that now we can see. but they are not so bad and we should not throw them into the bin. Obtained results are not so good that I could use this system to solve some real problems. We can look at text clusterization as at the reinforcement learning method of text categorization. The first thing what should be done is to find some database of categorized texts. which helps us to build categories from the set of documents. so I did not find it. I have implemented the system for text categorization. when it correctly categorized previously unseen documents. The implemented system showed also some ability of generalization. I had found a method for the text clusterization. mentioned already by Trenkle and Cavnar [1] is the language independency. That was the reason I did not fulfill the goal stated at the beginning of the work completely. With such a database. Of course. that the N-gram based system has its meaning and it can be valuable for future research. which will for sure bring much better results. because we do not know whether created categories are feasible or not. we can tune up the system for much better results. the text clusterization can be very useful and there are some possibillities how to 25 . I believe that it is possible that with such a database we will be able to find “ranks of important N-grams” in the complete list of N-grams from a document without the help of IDF values. Nevertheless. Other very useful advantage of N-gram based systems. It is a method.Chapter 6 Conclusions and Further Work In this work. Here I would like to sketch some possible suggestions for improvements. without previous knowledge of given texts. Unfortunately. As I said. Used ideas are mainly the mixture of ideas from articles [1] and [5]. it is possible to make many improvements. but I do not know every book. Furthermore. I think that the system with a good ability of text clusterization in unknown environment will be very valuable for problems of information retrieval. This would help us to reduce computational resources we need. I did not have the necessary database to train this method with some classical machine-learning algorithm.

One problem of the implemented method was that each document was assigned only to one category. If you think how the people do the text categorization. But we have the text categorization of short texts that reaches quite good results [1]. with the cut containing only few documents. the k-NN algorithm can be useful. categorize them and from the results choose the category of the whole document? For the last part. This means that these parts must contain the sufficient information for the text categorization. They usually read only short parts. we have to include these nodes in both sets. The suggested improvement is the question of algorithms in the graph of documents and their similarities. This will ensures that one document can be in more categories. CONCLUSIONS AND FURTHER WORK 26 make it better.CHAPTER 6. We can split the graph also when there are two sets of nodes. So why not split the long text into small parts. In this case. The last improvement is a slightly different view of the problem. The idea is not to split the graph only when there are unconnected components. . they usually do not read the whole book to determine its subject.

Carnegie Mellon University. Using an n-gram-based document representation with a vector processing retrieval model. 24th ACM International Conference on Research and Development in Information Retrieval. [9] V. Technical Report CMU-CS-97-127. Proceedings of the European Conference on Machine Learning (ECML). Context-sensitive learning methods for text categorization. In Proceedings of the Third Text REtrieval Conference (TREC-3). Yang and J. 3rd Annual Symposium on Document Analysis and Information Retrieval. A study on thresholding strategies for text categorization. [5] W. 14th International Conference on Machine Learning.Bibliography [1] W. 0. Peng. 2001. 1998. Trenkle. Vapnik and C. [10] T. 2000. Proceedings of the 2003 Human Language Technology Conference of the North American Chapter of the Association for Computational Linguistics. [3] D. N-gram-based text categorization. Computer Science Department. Wolff. Wang. Pedersen. [7] Y. Proceedings of SDAIR94. 1997. 1996. A comparative study on feature selection in text categorization. Joachims. Text categorization with support vector machines: Learning with many relevant features. Singer. B. Cavnar and J.. 1997. 0. 1994. W. C. Proceedings of ICML-97. Cohen and Y. Cavnar. 19th ACM International Conference on Research and Development in Information Retrieval. An evaluation of statistical approaches to text categorization.. 2003. Yang. [2] W. M. 1994. Proceedings of SIGIR-01. F. USA. AT&T Labs-Research. Will T. [8] Y. 27 . Support-vector networks. Loning and W. 1995. Goller. B. Language and task independent text categorization with simple language models. In Internationales Symposium f ur Informationswissenschaft. [4] J. [6] Y. Yang and J. Cortes. Proceedings of SIGIR-96. Automatic document classification: A thorough evaluation of various methods.. Schuurmans and S. Pedersen.

The Confessions of J. Kate Douglas Wiggin. The Book Of The Thousand Nights And One Night. category4 General Philip Henry Sheridan. the Private Life of Napoleon La Marquise De Montespan. Constant. Charlotte M. The Psychology of Beauty. The Getting of Wisdom. o El Optimismo. Indian Boyhood. v1. T. A Christmas Carol. Les grandes dames. Arsene Houssaye. Memoirs of General W. J. Augustus Does His Bit. The Alchemist. The Merchant of Venice. Ethel D. 1564-65. Quotations from Diary of Samuel Pepys. Rise of the Dutch Republic. Davis. George Meredith. A. Dante Alighieri. Marius the Epicurean. Mary Schweidler. William T. Wilhelm Meisters Lehrjahre–Buch 2. There are lists of books from each category. Walter Pater. Jerrold. category6 Anthony Trollope. category7 Trevelyan. The Essays of Montaigne. The Way of the World. category0 Wilhelm Busch . Joseph Butler. Richard Hakluyt. Sir Charles W. Thomas de Quincey. J.Ben Jonson. This is no a complete list. Ferdinand Raimund. Römische Geschichte Book 8. I-III. Johanna Spyri category1 G. Caudle’s Curtain Lectures. The Second-Story Man. Our American Cousin Voltaire. Hon. Barrie. Heinrich Heine. Eben Holden.. Rousseau. L’Avare. Table A. The Prime Minister. Charles Eastman.Appendix A Tables Following tables show results of my experiments. The Pigeon. The City of the Sun.. Georg Ebers. The Memoirs of Madame de Montespan. L’homme qui rit. Duchesse d’Orleans. Introduction to the Old Testament. The Amber Witch. A Treatise Concerning the Principles of Human Knowledge. Old Age and Death. Emile Zola. Voyages. Memorials and Other Papers. The Bride of the Nile. Dante’s Inferno.. Human Nature & Other Sermons. Love’s Labour’s Lost. Irving Bacheller. Pepys The Diary of Samuel Pepys. Beauchamps Career. Abraham Myerson. Candido. Bernard Shaw.. Friedrich Schiller. Jacob Michael Reinhold Lenz. Nana. What Every Woman Knows Dante Alighieri. Stephen Gwynn. Jo’s Boys. I. The Life of the Rt. Charles Dickens.. Elizabeth-Charlotte. Honore de Balzac. All titles and authors were retrieved from the books by a small utility I made and some of them where not parsed correctly. Theodor Mommsen. James M. The Heir of Redclyffe. Meyer. v1. Voltaire. David Widger. Memoirs of Aaron Burr. The Earl of Chesterfield. V4. James Gillman. and the Regency Jean Jacques Rousseau. George Berkeley. Jacques Casanova. Sheridan. Vol. Life and Letters of Lord Macaulay. The Life of Stephen A. By England’s Aid. Der Verschwender. v6. The Meaning of Truth Thomas Paine. Tullia d’Aragona. Jean-Baptiste Poquelin. "Divina Commedia di Dante: Inferno". Yonge. The Life of John of Barneveld. Johann Wolfgang von Goethe. Vol. Charlotte Mary Yonge. Hans Huckebein. John Lothrop Motley. Sherman. Gardner. Victor Hugo. Revolt of Netherlands. v4. Anonymous.1: Categories and assigned documents. the Memoirs of Napoleon. Frederick Chopin as a Man and Musician. The Deputy of Arcis. Thomas. Henty. Puffer. Louisa May Alcott .9. Rime. Matthew L.6. Kabale und Liebe. The Chosen People. Henry Handel Richardson. Mrs. Sherman. Guy de Maupassant Original Maupassant Short Stories. Barbara Blomberg. William James. A Summer in a Canyon. Common Sense. Reineke Fuchs. Johann Wolfgang Goethe. John Edgar McFadyen. Letters to His Son. Frederick Niecks. J. The Principal Navigations. Upton Sinclair. H. Froudacity. Leonardo Da Vinci. George Bernard Shaw. “500 test” 28 . IV. Tom Taylor. The Life of Samuel Taylor Coleridge. 1. Heartbreak House. Book I. Buch der Lieder. Der Hofmeister. History of the United Netherlands. category5 Tommaso Campanells. William James. Memoirs of Louis XIV. Memoirs of General P. Douglas. The Foundations of Personality category3 Speeches and Writings of Edmund Burke. category8 John Galsworthy . Shakespeare. The Varieties of Religious Experience. Shakespeare. Spinoza A Theologico-Political Treatise. Die Versuchung des Pescara.. Louis Antoine Fauvelet de Bourrienne. D. La Bete Humaine. John Lothrop Motley. category2 Michel de Montaigne. 1573-1574. Le Blanc et le Noir. Schiller. The Notebooks of Leonardo Da Vinci ongreve. Georg Ebers. Emile Zola.

Baring-Gould. Joseph Conrad . Simon Newcomb. Rudyard Kipling. Robert Chambers. Shakespeare. Darwin. Muhlbach. Alexandre Dumas. Eight Cousins. Friedrich Hebbel. Iphigenie auf Tauris. Georg Büchner. Reineke Fuchs. Les Visages du temps. Sir Gibbie. Time and Change. The Border Legion. The Spanish Tragedie. Goethe. Private Life of Napoleon. Literary Archive Foundation. Hermann Hesse. On the Origin of Species. Charles Dickens . Chamber Music. Elizabeth Cleghorn Gaskell. Dumas. Romantic Ballads et al. William Makepeace Thackeray. Pierre Loti. Taine. The Angel in the House Esaias Tegne’r. A L’Ombre Des Jeunes Filles en Fleur. Briefe aus der Schweiz. Friedrich Hebbel. Emily Dickinson. Shakespeare. The Mysteries of Paris. The Voyage of the Beagle. Jules Renard. Shakespeare. Anthony Trollope. Madame Hausset. Franz Grillparzer. category4 Goethe. Memoirs Louis XIV. Memoirs of Marie Antoinette. Herodotus. TABLES 29 category0 Trollope. Thais. Vestiges of the Natural History of Creation. The Half-Brothers. Stallman. Moments of Vision. Victor Hugo. Le Dernier Jour d’un Condamne. “3000 test” . Römische Geschichte. poesie. Two Nations. Gentleman at Paris. Shakespeare. Napoleon And Blucher. Vittoria. Robert de Boron. The House of Atreus. Mr.H. Franz Grillparzer. Bangs. Andrew Lang.. George Borrow. Espace perdu. George Walker At Suez. Zane Grey. poesie. Thomas Henry Huxley. Trent’s Last Case F. Schnock Andrew Lang.W. Memoirs of Louis XV. King John . Huxley. Walter Leaf. Thomas Hardy. A History of Science. Shakespeare. Novelle. Moorman. The House of Heine Brothers. Agnes Bernauer. L’Etourdi. William Wordsworth. Siddhartha. Much Ado About Nothing. Elizabeth Gaskell. The Trespasser. Part 1. Twelfth Night. Les Quarante-Cinq category2 The Duc de Saint-Simon. The Man in the Iron Mask. Henri III et sa Cour.APPENDIX A. Dantons Tod. Shakespeare. Eugene Sue. Anatole France. Louise de la Valliere. Frederic Mistral. The Wandering Jew. Lyrical Ballads. King Lear. Shakespeare. Stewarton. Louisa M. Wessex Poems and Other Verses C. The Maid of Orleans (play). Shakespeare. Alfred de Vigny. Essay on Man Charles Swinburne. Robert Alexander Wason. Kim. Romeo and Juliet. The History of Herodotus. Characters of Shakespeare’s Plays. Cinq Mars. Fridthjof’s Saga Swinburne. Edgar Allan Poe. Kipling . The Two Gentlemen of Verona. Schiller. Poems by the Way Ella Wheeler Wilcox. Catherine de Medici category3 William Shakespeare. S. Eugene Sue. The Right to Read. Jules Verne. par Moliere. Actes et Paroles. Memoires et Recits. Trollope. Shakespeare. Alexander Pope. John Burroughs. Titus Andronicus. Thomas Kyd. Goethe. Poems of Progress Hardy. Dumas. Emile Zola. Frederick The Great And His Family. Shakespeare. Alcott . Happy Hawkins. Verses and Rhymes by the way. Boileau. Huguette Bertrand. The Origins of Contemporary France. William Hazlitt. Songs before Sunrise Jean de La Fontaine. to Nobleman in London.„ The Iliad of Homer. Table A. Marcel Proust. Doom of the Griffiths. Cousin Phillis. Side-Lights On Astronomy. Patmore. Robert Falconer Hardy./XVI. Victor Hugo. Huguette Bertrand. Elizabeth Gaskell. Candide. Moliere . The Merry Devil. Captains Courageous. Trollope. Yorkshire Dialect Poems George Meredith. D. George MacDonald . A Set of Six. Poil De Carotte. Dumas. Romantic Adventures of a Milkmaid Bentley. Die Piccolomini. Story of Aeneas category5 Nora Pembroke. Li Romanz de l’estoire dou Graal. George MacDonald. La Bete Humaine. Litt. Waverley. Mes Origines. Hard Times Sir Walter Scott.2: Categories and assigned documents. Measure for Measure. Johann Wolfgang Goethe. Lawrence. How to Fail in Literature Elizabeth Gaskell. Madam Campan. Darwin. Voltaire. The Fables of La Fontaine category6 Kate Langley Bosher. William Morris. Charles Darwin. Constant. Friedrich Schiller. With Other Poems. The Lives Of The Caesars Michael Clarke. Louise Muhlbach. James Joyce . William Shakespeare. Der Traum ein Leben. AEschylus. Was ihr wollt Suetonius. Dumas. Alexandre Dumas. category1 Hippolyte A. Barchester Towers . Twenty Years After. The Pilgrims of Hope William Morris. O’Conors of Castle Conor. Wie es euch gefällt William Shakespeare. Le Lutrin. Sylvia’s Lovers. In Troubador-Land. Trollope. Pecheur d’Islande. Late Lyrics and Earlier. The Tempest. Jules Verne. Bonaparte of Corsica. Poems. The Home Book of Verse. The Courtship of Susan Bell. John K. Tour Du Mond 80 Jours . M.D. Honore de Balzac. The Works of Edgar Allan Poe. Theodor Mommsen. Ballads. The Rise and Progress of Palaeontology. Voyage au Centre de la Terre. Burton Stevenson. The Three Musketeers. Die Argonauten. Pericles. Miss Gibbie Gault. Shakespeare. King Henry VI. Thomas Hardy. Lectures on Evolution. Williams.A.

k e y S e t ( ) ) . which is a part of the implementation. Used algorithms are not optimized. p a c k a g e o r g . Map . ArrayList .Appendix B Implementation I have made an implementation of the described methods of clusterization and categorization. Set . . Manual. ∗ I t implements adding N −grams t o t h e s e q u e n c e . . . p r o t e c t e d Map map = new HashMap ( ) . /∗∗ ∗ Returns the l i s t of N −grams ∗ ∗/ private List getList (){ i f ( f L i s t = = n u l l ) f L i s t = new A r r a y L i s t ( map . To try the categorization again the categories form the “3000 test”. import import import import import import import import import java java java java java java java java java . HashSet . . return f L i s t . t e x t c a t . Here are the most important parts from the source code. . util util util util util util util util util . Arrays . List . . HashMap . . complete source cod and compiled classes are on the CD. ∗ @author pna ∗ ∗/ p u b l i c c l a s s NgramSequence { p r o t e c t e d d o u b l e l o w e s t V a l u e = Double . private List fList . . . Iterator . . return f S e t . . I implemented the solution in Java. } /∗∗ ∗ Returns the se t of a l l N −grams i n t h e s e q u e n c e ∗/ public Set getSet ( ) { i f ( f S e t = = n u l l ) f S e t = new H a s h S e t ( map .MAX_VALUE. . It implements the addition of N-gram and the counting of ranks. . I made the implementation only for the tests. . . ∗ o r d e r i n g and r a n k i n g o f N −grams by t h e a s s i g n e d v a l u e s . /∗∗ ∗ This c l a s s r e p r e s e n t s the se t of p a i r s N −gram and v a l u e . you can use the TextCat demo. } 30 . . ∗ Ranks from o d e r e d l i s t o r f r e q u e n c i e s t h a t where a s s i g n e d t o N −grams ∗ are here for values . p r i v a t e Set fSet . First is the source code of the NgramSequence class. ngram . Comparator . It was used for the representation of documents and categories. k e y S e t ( ) ) .

d . Object [ ] array = l i s t . i f ( d1 <d2 ) r e t u r n 1 . i ++) { Object o = l i s t . put (o . } /∗∗ ∗ ∗ Add an N −gram t o t h e s e q u e n c e . f L i s t = new A r r a y L i s t ( A r r a y s . d o u b l e v a l u e ) { Double d = ( Double ) map . ∗/ p u b l i c void put ( O b j e c t id . ∗ ∗/ p u b l i c v o i d add ( O b j e c t i d ) { add ( i d . i < l i s t . i n t end ) { List l i s t = getList (). getValue ( o ) ) . for ( int i = 0 . s o r t ( a r r a y . double v a l u e ) { i f ( value <lowestValue ) lowestValue = value . ∗ Returns the f i n a l l i s t ∗/ public List orderList (){ List l i s t = getList (). return ( d = = null ) ? − 1 : d . } return r e s . a s L i s t ( a r r a y ) ) . I t means i n c r e a s e t h e v a l u e a s s i g n e d t o N −gram by < code > v a l u e < / code > ∗ I f t h e r e i s no s u c h N −gram i n s e r t i t t o s e q u e n c e w i t h t h e v a l u e o f < code > v a l u e < / code > . } /∗∗ ∗ R e t u r n s new < code >NgramSequence < / code > . new C o m p a r a t o r ( ) { p u b l i c i n t compare ( O b j e c t a . i f ( d = = n u l l ) put ( id . A r r a y s . map . g e t ( i d ) . i n t e = end . d o u b l e d2 = g e t V a l u e ( b ) . List l i s t = orderList (). } /∗∗ ∗ Add an N −gram t o t h e s e q u e n c e . O b j e c t b ) { d o u b l e d1 = g e t V a l u e ( a ) . NgramSequence r e s = new NgramSequence ( ) . v a l u e ) . i f ( end > s i z e ( ) ) e = s i z e ( ) . size ( ) . new Double ( v a l u e ) ) . Ngrams i n t h i s s e q u e n c e w i l l be s u b s e t o f t h e N −grams ∗ from c u r r e n t s e q u e n c e . g e t ( i d ) . doubleValue ( ) . i ++) { . ∗ ∗/ p u b l i c void countRanks ( ) { l o w e s t V a l u e = Double . p u t ( i d . IMPLEMENTATION 31 /∗∗ ∗ Order N −grams by v a l u e s a s s i g n e d t o them . } /∗∗ ∗ Puts N −gram w i t h t h e g i v e n v a l u e t o t h e s e q u e n c e . } /∗∗ ∗ Returns the value assigned to N −gram . return 0 .MAX_VALUE. e l s e put ( id . toArray ( ) . } /∗∗ ∗ Replace the values assigned to N −grams w i t h t h e i r rank i n an o r d e r e d l i s t . for ( int i = s t a r t . I f t h e r e i s a l r e a d y s u c h N −gram . doubleValue ( ) + v a l u e ) . get ( i ) . O d e r s o f N −grams i s g i v e n by t h e < code > l i s t < code > ∗/ p u b l i c NgramSequence g e t S u b S e q u e n c e ( i n t s t a r t .APPENDIX B. 1 ) . res . ∗ U s u a l l y i t i s a f r e q u e n c y o r a rank ∗ @param i d ∗ @return ∗/ p u b l i c double g e t V a l u e ( O b j e c t i d ) { Double d = ( Double ) map . i < e . ∗ ∗/ p u b l i c v o i d add ( O b j e c t i d . ∗ i t w i l l be r e p l a c e d . return f L i s t . i f ( d1 >d2 ) r e t u r n −1. I t means i n c r e a s e t h e v a l u e a s s i g n e d t o ∗ N −gram by one ∗ I f t h e r e i s no s u c h N −gram i n s e r t i t t o s e q u e n c e w i t h t h e v a l u e o f 1 . } }). ∗ The l i s t i s o r d e r e d form h i g h e s t t o l o w e s t v a l u e s .

O b j e c t ) ∗/ p u b l i c boolean a c c e p t ( Object o ) { i f ( IDFTable . get ( i ) . ngram . g e t V a l u e ( o ) > 2 . r e p l a c e A l l ( " [ \ t \ n \ f \ r ] " . t e x t c a t .APPENDIX B. i n s t a n c e ( ) . the filter of N-grams is used. } f i l t e r ){ /∗ ∗ C r e a t e d on 2 0 . t e x t c a t . F i l t e r s o u r c e = s o u r c e . ∗ So z e r o u w i l l be t h e l o w e s t v a l u e h e r e ∗ ∗/ p u b l i c v o i d normRanks ( ) { d o u b l e lw = l o w e s t V a l u e . next ( ) . common . ns . During the creation. i +length ) . i m p o r t o r g . " " ) . g e t V a l u e ( e l e m e n t )−lw ) .75 in this case). which a c c e p t s N −grams w i t h t h e IDF v a l u e ∗ higher than 2. i t . i n t i =0. l a n g . 7 5 ) return t r u e . p u b l i c NgramSequence b u i l d R a n k e d F i l t e r e d S e q u e n c e ( S t r i n g s o u r c e . } } /∗∗ ∗ Decrease values of a l l N −grams . F i l t e r # a c c e p t ( j a v a . " " ) . r e p l a c e A l l ( " {2 . i n s t a n c e ( ) . length ( ) ) { String s = source . by t h e l o w e s t v a l u e . accept ( s )) r e s . f o r ( I t e r a t o r i t = map . If it is greater than the given threshold (2. /∗∗ ∗ T h i s i s a f i l t e r . i n t l e n g t h . s u b s t r i n g ( i . 4 . i t e r a t o r ( ) . NgramSequence r e s = new NgramSequence ( ) . common . return r e s . t e x t c a t . i f ( f i l t e r . k e y S e t ( ) . It checks the value of the IDF. put ( element . i ) .75 ∗ ∗ @author pna ∗/ publi c c l a s s I D F F i l t e r implements F i l t e r { / ∗ ( non−J a v a d o c ) ∗ @see o r g . i ++. } } } Here is the source code of method that implements the creation of the sequence of N-grams from the plain text. it will not use this N-gram. 2 0 0 5 ∗ ∗/ p a c k a g e o r g .}+ " . countRanks ( ) . IMPLEMENTATION 32 String element = ( String ) l i s t . p u t ( e l e m e n t . return f a l s e . ) { Object element = i t . h a s N e x t ( ) . } } . documents / IDFTable . } r e s . while ( i +length <= source . add ( s ) . F i l t e r .

getSet ( ) ) . } } This is the implementation of the clusterization. abs ( av−bv ) . which updates the representation of a category with the new NgramSequence. i f ( b . i t . s i z e ()−1− s e q . g e t V a l u e ( e l e m e n t ) ) . ) { r e s += d i s t a n c e ( a . see comments in the code. g e t V a l u e ( i d ) . n s . For more details. ∗/ p u b l i c d o u b l e compare ( NgramSequence a1 . r e t u r n Math . s i z e ( ) ) b = b . add ( e l e m e n t . ) { Object element = i t . s i z e ( ) ) . i f ( av = = − 1 | | bv = = − 1 ) r e t u r n a . b . s i z e ( ) . The measure is described in 4. S e t c = new H a s h S e t ( a . g e t V a l u e ( i d ) . getSubSequence (0 . addAll ( b . i t . g e t S e t ( ) . /∗∗ ∗ U p d a t e s t h e v a l u e s i n < code > c a t e g o r i e . i f ( a . n e x t ( ) ) . s i z e ()∗ a . i t e r a t o r ( ) . s i z e () > b . s e q . IMPLEMENTATION 33 Here is the method. getSubSequence (0 . d o u b l e bv = b .3. c a t e g o r y .APPENDIX B. b=b1 . hasNext ( ) . s i z e ( ) ) a = a . for ( I t e r a t o r i t = c . O b j e c t i d ) { d o u b l e av = a . } /∗∗ ∗ R e t u r n s d i s t a n c e b e t w e e n an N −gram ( < code > i d < / code > i n two s e q u e n c e s . i t . ∗ So t h e N −grams on t h e f i r s t r a n k s w i l l h a v e h i g h e s t v a l u e s ∗/ p u b l i c v o i d j o i n ( NgramSequence s e q ) { for ( I t e r a t o r i t = seq . s i z e ( ) ) . a . NgramSequence b1 ) { NgramSequence a = a1 . /∗∗ ∗ R e t u r n s t h e " o u t−of−p l a c e−m e a s u r e " ( s i m i l a r i t y ) o f ∗ < code >a1 < / code > and < code >b1 < / code > . } } This is the implementation of the similarity measure. g e t S e t ( ) ) . NgramSequence b . Concretely it counts the components of given graph. ∗/ p r i v a t e d o u b l e d i s t a n c e ( NgramSequence a . c . double r e s = 0 . i t e r a t o r ( ) . } return 1 − r e s / ( c . s i z e ( ) ) . hasNext ( ) . next ( ) . ns < / code > w i t h t h e v a l u e s from < code > seq < / code > ∗ V a l u e s a r e u p d a t e d w i t h t h e v a l u e o f t h e rank from t h e end o f t h e l i s t . I chose only the most important parts of the . b . s i z e () > a .

99. p r i v a t e i n t maxsize . i t . s i z e ( ) + " " +comp . m a x s i z e ) ) { comp = g e t C o m p o n e n t s ( f i l e N a m e s . L i s t comp = new A r r a y L i s t ( ) . T h i s t a b l e p r o v i d e s t h e a c c e s t o t h e s i m i l a r i t y b e t w e e n two d o c u m e n t s . id2 ) . i f ( step <= 0) step = 0. ∗ c a t i s t h e o u t p u t u s e d f o r s t o r i n g l i s t o f c a t e g o r i e s and c r e a t e t h e names o f c a t e g o r i e s . " + fileNames . g e t D i f f e r e n c e ( id1 . I n c r e a s e s t h e t r e s h o l d u n t i l t h e r e a r e only s m a l l e r than r e l e v a n t ∗ components ∗/ public List s p l i t () { c o m p o n e n t s = new A r r a y L i s t ( ) . double v a l u e = 0 . I t s t a r t s a t t r e s h o l d o f 0 . S t r i n g id2 ) { return t a b l e . h a s N e x t ( ) . } log . p r i v a t e L i s t components . return components . i t . ∗ Only t h e c o m p o n e n t s o f r e l e v a n t f i l e s a r e c h o o s e n . IMPLEMENTATION source code. i f ( element . next ( ) . C o m p a r e T a b l e P a r t C o m p a r e T a b l e P a r t } ∗ i s g i v e n .01. 34 /∗∗ ∗ R e t u r n s t h e s i m i l a r i t y o f two d o c u m e n t s ∗/ p u b l i c double g e t D i f f e r e n c e ( S t r i n g id1 . h a s N e x t ( ) . s i z e ( ) < 1 ) return t r u e . } /∗∗ ∗ Removes t h e node from t h e s m a l l c o m p o p n e n t s from t h e g r a p h and ∗ adds r e l e v a n t components t o t h e r r e s u l t ∗/ p r i v a t e v o i d h a n d l e F i n i s h e d ( L i s t nodes .001 ∗ ∗ where t a b l e i s i n p u t f i l e c o n t a i n i n g t h e { @link o r g . 1 and ∗ i t e r a t e s u n t i l l a r g e r a s < code > m a x s i z e < / code > c o m p o n e n t s a r e p r e s e n t . p r i n t l n ( value +" log . flush ( ) . f o r ( I t e r a t o r i t = comp . Here i s an e x a m p l e : ∗ ∗ table = joinedtable ∗ cat = cat ∗ max = 3 5 ∗ min = 1 0 ∗ step = 0. t e x t c a t . removeAll ( element ) . comp ) . ∗ Arguments i s t h e i n i f i l e . p r i v a t e double s t e p . L i s t fileNames . h a n d l e F i n i s h e d ( f i l e N a m e s . t e x t c a t . } /∗∗ ∗ Counts t h e components . I f t h e v a l u e o f s i m i l a r i t y i s l o w e r ∗ t h a n t h e t r e s h o l d we c o n s i d e r i t a s non e x i s t i n g e d g e . w h i l e ( v a l u e < 1 & & h a s L a r g e r C o m p o n e n t ( comp . ) { L i s t element = ( L i s t ) i t . 0 1 . ) { L i s t element = ( L i s t ) i t . s i z e () >= minsize ) { c o m p o n e n t s . i f ( step >= 1) step = 0. next ( ) . . i t e r a t o r ( ) . private int [ ] h . close ( ) . C o m p a r e T a b l e P a r t C o m p a r e T a b l e P a r t } . } } } } /∗∗ ∗ R e t u r n s t r u e i f t h e g r a p h c o n t a i n s component l a r g e r t h a n t h e < code > m a x s i z e 2 < code > ∗/ p r i v a t e b o o l e a n h a s L a r g e r C o m p o n e n t ( L i s t comp . private int [ ] p . add ( e l e m e n t ) . s i z e ( ) < = maxsize ) { nodes . L i s t comp ) { f o r ( I t e r a t o r i t = comp . i n t m a x s i z e 2 ) { i f ( comp . ∗ max and min a r e t h e maximal and m i n i m a l s i z e o f c r e a t e d c l u s t e r s and t h e s t e p i s a i t e r a t i o n v a l u e f o r t h e ∗ similarity treshold . ngram . ∗ ∗ @author pna ∗/ public class SplitTable { CompareTablePart t a b l e . i t e r a t o r ( ) . v a l u e ) . s i z e ( ) ) . ∗ T h i s c l a s s c o m p u t e s t h e c o m p o n e n t s when t h e { @link o r g . s i z e ( ) + " log . i f ( element . ngram .APPENDIX B. p r i v a t e int minsize . value = value + step . "+components .

i ++) { int k= findSet ( i ). i n t j ) { i f ( h [ i ]<h [ j ] ) { p [ i ]= p [ j ] . l e n g t h . d o u b l e v a l u e ) { p = new i n t [ n o d e s 2 . s i z e ( ) ] . } L i s t r e s u l t = new A r r a y L i s t ( ) . return p [ i ] . i f ( comps [ k ] = = n u l l ) comps [ k ] = new A r r a y L i s t ( ) . i < comps . f o r ( i n t i = 0 . add ( new I n t e g e r ( i ) ) . remove ( ( new I n t e g e r ( k1 ) ) ) . s i z e ( ) < 2 ) break . j ++){ i f ( g e t D i f f e r e n c e ( ( S t r i n g ) nodes2 . c . IMPLEMENTATION 35 i f ( element . f o r ( i n t i = 0 . } return r e s u l t . length . i < p . g e t ( i ) ) .APPENDIX B. i ++){ f o r ( i n t j = i + 1 . h [ j ]= h [ i ]+ h [ j ] + 1 . c . Edges w i t h l o w e r v a l u e s t h a n < code > v a l u e < / code > ∗ a r e c o n s i d e r e d a s i f t h e y do n o t e x i s t . ( S t r i n g ) nodes2 . c . s i z e ( ) . remove ( ( new I n t e g e r ( k2 ) ) ) . } return f a l s e . flush ( ) . } f o r ( i n t i = 0 . add ( new I n t e g e r ( p1 ) ) . ( n − 1 ) / 2 i s t h e number o f e d g e s and n number o f n o d e s ∗/ p r o t e c t e d L i s t g e t C o m p o n e n t s ( L i s t nodes2 . s i z e ( ) ] . } } } . s i z e () −1. i <p . get ( j ) ) > value ) { i n t k1= f i n d S e t ( i ) . } /∗∗ ∗ D e t e r m i n e t h e r e p e r e s e n t a t i o n ( p a r e n t ) o f t h e component i n t o ∗ which < code > i < / code > b e l o n g s t o . } /∗∗ ∗ R e t u r n s c o m p o n e n t s o f t h e g r a p h r e p r e s e n t e d by < code > nodes2 < / code > a s t h e v e r t i c e s and ∗ t h e s i m i l a r i t i e s a s e d g e v a l u e s . s i z e ( ) > = maxsize2 ) return t r u e . get ( i ) . return p [ i ] . h [ i ]= h [ i ]+ h [ j ] + 1 . return p [ j ] . i + + ) { i f ( comps [ i ] ! = n u l l ) r e s u l t . } /∗∗ ∗ J o i n s two c o m p o n e n t s t o g e h t e r ∗ @param i ∗ @param j ∗ @return ∗/ p r i v a t e i n t union ( i n t i . comps [ k ] . i f ( i%20 ==0){ l o g . S e t c = new H a s h S e t ( ) . i n t k2= f i n d S e t ( j ) . ∗/ p r i v a t e int findSet ( int i ){ i f ( p [ i ] ! = i ) p [ i ]= f i n d S e t ( p [ i ] ) . } } } i f ( c . s i z e ( ) ) . k2 ) . p r i n t l n ( i + " " +c . h = new i n t [ n o d e s 2 . add ( n o d e s 2 . } } L i s t [ ] comps = new L i s t [ n o d e s 2 . j < n o d e s 2 . } else { p [ j ]= p [ i ] . i < n o d e s 2 . c . log . s i z e ( ) ] . l e n g t h . add ( comps [ i ] ) . for ( int i = 0 . i f ( k1 ! = k2 ) { i n t p1 = union ( k1 . ∗ ∗ F i n d i n g c o m p o n e n t s i s i m p e l m e n t e d w i t h t h e h e l p o f < code >union < / code >/ < code > f i n d S e t < / code > m e t h o d s ∗ which g i v e s t h e c o m p l e x i t y o f O(m + n . i + + ) { p [ i ] = i . l o g (m ) ) where m = n .