Chapter 1 - Text Clustering

1-1 Text clustering
The idea of text clustering long preceded the computer age: “Clustering is one of the most
primitive mental activities of humans, used to handle the huge amount of information they receive
every day” (Theodoridis and Koutroubas, 2003: 398). The act of indexing long used in libraries
is an obvious example. Manual clustering was the only type of document clustering possible prior
to the computer age. This circumstance may have influenced much clustering work that relied
only on immediate intuitive knowledge of the world without making use of quantitative numerical
methods. In other words, text clustering was usually performed in subjective ways that relied
heavily on the perception, knowledge, and judgment of the researcher. With more and easier
accessibility to electronic digital data in different disciplines and the power of computing data
processing on one hand and the need for maintaining objectivity standards on the other, it has
become ever more likely that such procedures must involve computational automated methods
(Arabie et al., 1996) where human intuition and traditional organization methods are replaced by
mathematical and computational techniques (Golub, 2006; Golub, 2005). In this, recent years have
witnessed a flourishing of the development of automated statistical clustering and classification
systems for systematizing the inherent subjectivity in traditional text classification applications.
 Clustering vs. classification
The two terms clustering and classification are extensively used throughout this chapter. The
question that rises at this point is: are they synonymous or is there a distinction?
In order to answer this question, some overlapping concepts should be considered. Firstly, there
is an overlap between the two terms text classification and text categorization. In information
retrieval (IR) and text classification of literature (Sebastiani, 2006; Svetlana, 2006; Taeho, 2006;
Mirkin, 2005; Sebastiani, 2005a; Sebastiani, 2005b), the two terms are often used interchangeably.
1
This book too uses them interchangeably. Secondly, there is a frequent confusion between the
terms text clustering and text classification. While many studies (Janos and Balazs, 2007; Wang,
2007; Ozgur, 2006; Jain et al., 1999) use the two terms interchangeably, this book does not. The
idea they share is that they are both concerned with grouping documents into clusters or groups.
However, mechanisms for doing so are different.
Text clustering is the process of automatically grouping natural language texts according to an
analysis of their information/ semantic content, by means of clustering algorithms (2004; Debole
and Sebastiani, 2003). It is simply a process of placing similar documents together into distinct
sets without labelling them (Maranis and Babenko, 2009). Text classification, on the other hand,
is the task of automatically sorting a set of documents into a number of classes or categories where
each is given a label (Maranis and Babenko, 2009; Taeho, 2006).
The main difference thus between clustering and classification is that, in the former, there are no
prior assumptions about the data structure. Unlike clustering, “classification relies on a priori
reference structures that divide the space of all possible data points into a set of classes that are
usually, but not necessarily, nonoverlapping” (Maranis and Babenko, 2009: 164-5). In other
words, clustering is a task of dividing given data into defined set of clusters and it is the task of
classification to structure these clusters and sort them into categories according to a group
structure known in advance (Sebastiani, 2006). In this, a text classification task starts by
discovering and finding groups that have similar content then organizing our perceptions of these
groups into categories. In other words, clustering places documents into natural classes while
classification places them into predefined known ones. There is a link thus between clustering and
classification since clustering is a way of generating taxonomies for classification purposes. The
one word that distinguishes between clustering and classification is that clustering is an
“unsupervised” activity while classification is a supervised one. In clustering, there is no one who
2
assigns documents to classes but it is only the distribution and makeup of the data that will
determine cluster membership (Manning et al., 2008).
To illustrate the argument, let us consider the following example. Having a set of 1000 documents
on the history of English literature, these can be both clustered and classified. In performing a
clustering task, documents are just clustered into distinct groups where similar or related
documents are grouped together. In classification, on the other hand, predefined sets are given
first. These can be Old English literature, Shakespearean literature, Augustan Literature, Romantic
Literature, and Victorian Literature. Then documents are placed or classified under these
predefined categories.
1-1-1 Applications
Text/ document clustering is applied in different disciplines including IR and data mining. In IR,
document clustering is used to automatically group together documents that belong to the same
topic in order to facilitate user’s browsing of retrieval results. This is usually labelled as cluster-
based retrieval. The underlying principle cluster-based retrieval is the organization of web pages
or any documents into a hierarchical structure for subject browsing with the purpose of speeding
up and improving IR operations since search in the vector space model amounts to finding the
nearest neighbours to the query (Manning et al., 2008; Golub, 2006; Golub, 2005; Rijsbergen,
2004). The main assumption of cluster-based retrieval applications is that closely associated
documents tend to be relevant to the same requests (Rijsbergen, 1979). Kobayashi and Aono
(2004) argue that clustering is an effective approach for overcoming many IR problems. They
explain that to identify clusters or sets of documents that cover similar topics helps in ranking and
presenting only documents in one or a few selected clusters to users. In turn, the lack of a coherent
clustering structure makes searching much more difficult, especially when specific resource types
are being sought.
3
In data mining, document clustering is one of some computational systems for carrying out data
mining tasks (Mirkin, 2005). Document clustering is proved successful in carrying out many
important operations for data mining including natural language processing, feature extraction,
annotation and summarization.
1-1-2 Approaches to text classification & clustering
Currently, there are many classification systems. Broadly speaking, these systems fall into two
main categories. These are binary and multiclass systems. Binary classification systems are only
concerned with classifying documents into two main categories or groups. Classification systems
of this kind are used to distinguish between just two classes of objects. As Maranis and Bebenko
(2009) explain, these systems provide Yes/No answer to the question: Does this document belong
to class X? In this, such systems can be useful in classifying emails where they are classified as
to whether they are spam or not, or commercial transactions where they are determined to be
fraudulent or not. In such applications, it is more likely and easier to use binary classification
systems as we have only two classes or groups. Multiclass systems, in turn, divide documents into
two classes or more. As the name indicates, these classifiers assign each document or data point
to one of many classes where each has a distinct subject area. Newspaper accounts, for instance,
can be classified under different categories such as news, sport, culture, business & money,
politics, science, etc.
Computational methods of text clustering fall into two main categories. These are linguistic and
statistical mathematical methods (Srivastava and Sahami, 2009; Justo and Torres, 2005).
Linguistic methods are based on natural language processing techniques. Methods of this kind
usually involve morphological and syntactic processes for extracting meaning and identifying
relationships within documents. Mathematical and statistical classification methods are essentially
based on probabilistic frameworks. The main difference between the two approaches lies in the
4
idea that statistical and mathematical methods are not concerned with linguistic properties. Word
order or compositional semantics, for instance, do not have any significance in classification or
clustering performance. The focus of this chapter is on the use of statistical methods.
 Statistical text clustering
Statistical text clustering developed greatly in the 1990s, when it emerged as a subtask of
Information Retrieval (IR) applications (Joachims and Sebastiani, 2002). The hallmark of that a
development has been the dramatic improvement of the effectiveness of text clustering systems.
The last two decades have witnessed an unprecedented revolution in developing mechanized
solutions for organizing the vast quantity of unstructured digital documents and providing
powerful tools for turning this unstructured repository into a structured one (Sebastiani, 2006).
The world of knowledge has witnessed over recent years a rapid increase in the amount of sorted
data in all fields of knowledge due to the continuous improvement of methods for digitally storing
data. As a response to the growing overflow of information which has made it difficult for many
search engines to fill people’s needs, various computer-based clustering and classification
methods have been developed. Concerns have been raised by IR researchers and internet users
about the poor matching of queries and the results generated by search engines. IR researchers
have worked in consequence to find ways capable of automatically analyzing, classifying and
summarizing data in order to make it easy for internet users to access data effectively. In this they
made extraordinary progress over the last few years as they have managed to devise mechanisms
for assessing the relevance of information for user interests. Nonetheless, statistical text clustering
has not escaped criticism. Perhaps the most serious disadvantage with ATC applications is that in
almost all text classification schemes, semantic relatedness is merely judged at the level of lexical
semantics without taking compositional semantics into account (Gabrilovich and Markovitch,
2007). Another major criticism of ATC applications is that many of the algorithms used for
5
computing semantic relatedness represent documents as just bags of words where context is not
considered at all.
 Content vs. context
To date, the standard approach and the most widely used in statistical text clustering applications
is clustering by content. This is the clustering of documents by the words they contain. Content
clustering is carried out by means of computing semantic similarity/ distance or what can be called
measuring proximity within documents. It is thus a lexical semantic function. It has always been
argued that semantic information within documents is key to understanding and determining the
content of such documents. In some recent studies (Attardi et al., 1999; Attardi et al., 1998),
however, clustering by context has been introduced as a working approach to evade the problems
caused in clustering by content. Clustering by context is a new method for grouping web pages
whereby the context surrounding a link is used for categorizing the document referred by the link.
The conception is based on the assumption that a web page which refers to a document must
necessarily involve enough hints about its content which themselves are sufficient to classify the
document (Attardi et al., 1998).
In relation to clustering by context, many software programs have been devised to execute such
tasks including SenseClusters (Purandare and Pedersen, 2004). This and others are programs that
allow users to cluster similar contexts such as emails and web pages (Pedersen, 2008). The
working principle of such programs is that data documents can be grouped on the basis of their
mutual contextual similarities (Purandare and Pedersen, 2004). Programs of this kind have indeed
proven a successful clustering method when applied to web pages and its merits are more tangible
with multimedia material. Nevertheless, an approach of this kind carries with it some limitations.
The most serious shortcoming is that it is not concerned with the analysis of the content of
documents. One more drawback is that in almost all context classification applications “identical
6
replications of controlled experiments result in different conclusions” (Martin et al., 2005: 470).
1-2 Document clustering models
Several document clustering models exist. A clustering model is a formalization of the way of
thinking about document clustering. Such formalism can be established and defined in the form
of algorithms capable of computing semantic similarity among documents. The goal of all
clustering models or systems is to create clusters that are coherent internally, but clearly different
from each other (Manning et al., 2008). The selection of a clustering system however is a
complicated and controversial matter. The idea that clustering results can be used in important
applications makes it crucial to consider the selection of a clustering system or model well.
The main bulk of clustering systems or approaches can be best described under the heading vector
semantics (VS) or vector space clustering (VSC). The underlying principle of VSC is measuring
or computing similarity between the documents to be clustered. VSC is an umbrella approach that
encompasses a number of methods and techniques including vector space model (VSM), latent
semantic indexing (LSI), explicit semantic analysis (ESA), and concept mining.
Prior to discussing the vector space clustering theory and its main methods, however, there are
some concepts that need to be addressed. They are fundamental to understanding VSC theory and
applications. These are data, geometry, and vector space. These are discussed as follows.
 The nature of data
Data is the plural form of ‘datum’, the past participle of Latin to ‘dare’, ‘to give’, and means
‘things that are given’. A datum is therefore is something to be accepted as granted or assumed as
fact, a true statement about the world and used as a basis for reasoning, discussion, or calculation
(Simpson and Weiner, 1989). The question ‘what is a true statement about the world?’ has been
intensively studied in cognitive science. For convenience reasons, this discussion adopts the
7
attitude prevalent in most areas of science: data are abstractions of what we observe using our
senses, often with the aid of instruments (Chalmers, 1999).
There are numerous sources of data. Any object in the world can be a source of data: people,
animal, plants, institutions, texts, sounds, etc but data itself is ontologically different from the
world: “The world is as it is; the data is an interpretation of it for the purpose of scientific study”
(Moisl, 2008 a: 876). To explain the difference between the two concepts, let us consider this
example.
Table 1-1: Age distribution of the Egyptian people
Age Population %
group Male Female
0–18 5,560,489 5,293,871 18.0
19–59 20,193,876 19,736,516 66.3
60+ 4,027,721 5,458,235 15.7
The table above describes the age distributions of the Egyptian people. In this example,
populations are the world, the reality while age and sex are just features or observations about this
world. The populations themselves represent the real world while the features of age and sex are
the variables that describe population. Data, therefore, is just these variables or values that
describe populations: “Data never provides more than a pale and hazy shadow, a murky outline,
of the true workings of the world. And yet this gossamer wisp is just enough for us to grasp at the
edges of understanding” (Pyle, 1999: 46).
In linguistics, a text corpus is not the linguist’s data. A text corpus is as it is; measurements of
such things as sentence length, the use of function words, or the destiny of loan words can be the
linguist’s data. Texts are facts; they are unchangeable entities. Data, on the other hand, can be
different from one analysis to another depending on the nature and purpose of analysis. Having
8
Shakespeare’s corpus as an example, data can be different from one application to another. If the
analysis is concerned, for instance, with the relationship between the use of static/dynamic
adjectives and characterization, it is likely then that data can be composed only of the adjectives
and proper names within the documents. If the analysis, however, is concerned with the
investigation of the Latin element in Shakespeare’s texts, data can be abstracted from the lexical
types within the texts to determine the ratio of Latin words to the overall texts.
 Data abstraction
In general, any aspect of the world can be described in a number of ways and to degrees of
precision. However, there is no theory-free observation of the world (Popper, 1963; Popper, 1959).
In other words, entities in a domain of inquiry only become relevant to observation in terms of a
research question framed using the ontology and axioms of a theory about the domain. For
example, in marketing analysis variables are selected in terms of the discipline of marketing
broadly defined, which includes the division into sub-disciplines such as direct marketing, brand
marketing, online marketing, promotions, and public relations.
Data can, therefore, only be created in relation to a research question that is defined on the domain
of interest, and that thereby provides an interpretative orientation --without such an orientation,
how does one know what to observe, what is important, and what is not? The domain of interest
in the present case is the collection of Hardy’s prose works, and the research question defined on
it is:
Can an experimentally replicable, objective and conceptually useful classification based
on empirical evidence abstracted from Thomas Hardy’s prose fiction texts be defined?
9
 Data representation
Given that data is an interpretation of some aspect of the world in terms of variables, it is crucial
to select the appropriate variables for a successful data analysis. Unfortunately, there are no pre-
defined rules for variable selection. The process depends in the first place on the nature and
purpose of the analysis. The fundamental principle of such a process is that the variables must
describe all and only those aspects of the domain that are relevant to the purpose of analysis
(Moisl, 2008 a; Pyle, 1999).
In text clustering, the semantics of each selected variable determines a particular interpretation of
the domain of interest, and the domain is measured in terms of the semantics. Measurement is
fundamental in data preparation because it makes the link between data and the world, and thus
allows the results of data analysis to be applied to the understanding of the world (Moisl, 2008 a;
Pyle, 1999). Measurement is only possible in terms of some scale and there are various types of
measurement scale, but for present purposes the main dichotomy is between numeric and non-
numeric. The cluster analysis methods discussed in due course assume numeric measurement as
the default case, and for that reason the same is done here. If they are to be analyzed using
mathematically-based computational methods, selected variables must be mathematically
represented. A widely used way for doing this in text clustering is vector space representation
(Salton et al., 1975).
 The nature of geometry
Geometry is based on human intuitions about the world around us. This is a discipline in
mathematics that is concerned with the properties and relations of magnitudes in space, as lines,
surfaces, and solids. The earliest systematic discussion of geometry was developed by Euclid
around 300 B.C (Lang, 1958). Euclid based his theory on the earliest attempts of many
10
mathematicians before him like Apollonius, Hippocrates, and Eudoxus who were concerned with
defining the intuitive notions of space, direction, distance, size and shape. He placed the
propositions of the previous mathematicians into a comprehensive deductive and logical system
entitled Elements (Ball, 1935; Taylor, 1893; Euclid, 1826). The Euclidean geometry was used
virtually unchanged for 2, 000 years for understanding the physical reality (Lang and Murrow,
1988; Prenowitz and Jordan, 1965). In the meantime, Euclidean geometry used to be seen as a
perfect model for logical reasoning. At the end of the 19th century, however, mathematicians found
logical deficiencies within Euclidean framework. Prenowitz and Jordan (1965) argue that there
are certain logical gaps in the reasoning in the Euclidean frame. Nevertheless, Euclidean geometry
is still used in a range of disciplines including analysis of textual data, and as such the present
discussion is based on Euclidean geometry. Some fundamental ideas of that geometry are
presented below.
 Euclidean geometry
Euclidean geometry is concerned with modelling the world as it is experienced. It describes the
physical world in any finite number of dimensions using a distance formula. In mathematical
terms, Euclidean geometry is concerned with studying the relationships among distances and
angles in a space. According to Euclid, A 1-dimensional, 2-dimnsional, or 3-dimensional can be
described and defined by axes. For a 1-dimensional space, only a single numerical measure is
required. The distance between two objects can be defined by length and graphically represented
as in the figure below.
Figure 1-1 Axis for a 1-dimensional space
11
Likewise, a 2-dimensional space can be defined using two numerical measures. A school’s
playground, for instance, can be defined in terms of length and width. The two measurements can
be represented in Euclidean geometry as a 2-dimensional space as in Figure 1-2.
Figure 1-2 : Axes for a 2-dimensional space
Euclid observed that there are still other kinds of physical property which cannot be described in
one or two dimension but require three, such as Big Ben Tower. In such a case, three
measurements are required: length, width, and height, and these can be represented in Euclidean
geometry as a 3-dimensional space as in Figure 1-3.
12
Figure 1-3 : Axes for a 3-dimensional space
Because it was impossible to define more than three dimensions, modern mathematics generalized
Euclid’s concepts of distance, length, and angle so that any number of dimensions can be defined.
The economic growth of developing countries can be represented by an arbitrary large number of
dimensions such as the role of physical and human capital, technological progress, scale of
investments, trade, capital mobility, fixed assets, net capital stock, and employment. These can be
represented using N-dimensional space. Some fundamental ideas about N-dimensional space are
presented below.
 Vector space
A vector space is the basic object of study in linear algebra, one of the most basic of all branches
of mathematics. Vector space theory is based on the development of a mathematical structure that
consists of a set of vectors associated with a field of scalars (which are the object elements of
these vectors). The theory is used in different applications “and the vectors and scalars for one
application will generally be different from the vectors and scalars for another application”
13
(Howlett, 2010:10-11). A vector space is defined as a geometrical interpretation of a vector in
which the dimensionality 𝓃 of the vector defines an 𝓃-dimensional space, the sequence of
numerical values comprising the vector specifies coordinates in the space, and the vector itself is
a point at the specified coordinates (Howlett, 2010; Fraleigh et al., 1995). So let X and Y are 2
axes with the values 36 and 160 respectively. These two axes can be represented in a vector space
as follows V [36, 160]. This is a vector representation in 2 dimensional spaces, and this can be
represented in a 2-dimensional space as in Figure 1-4. The components of the 2-dimensional
vector correspond to the coordinates in the 2-dimensional vector space with axes 0…100 and
0…200, counting 36 along the horizontal axis and 160 along the vertical.
Figure 1-4: A vector graphic representation of a 2-dimensional space
This can be extended to include a third dimension, and can be graphically represented as a 3-
dimensional vector space as in Figure 1-5.
14
Figure 1-5: A vector graphic representation of a 3-dimensional space
Indeed, we are not limited to only using 2 or even 3 dimensions in mathematics. We can have any
number of dimensions and we can use each axis to represent a different value. An axis can be
anything. It can be an image, numbers, or documents. This discussion is only concerned with
documents. The following is an example.
The following are 3 different titles that represent different categories
Title No. 1- United States and Russia agree historic nuclear deal
Title No. 2- Korean sailors killed in naval incident
Title No. 3- Couple convicted of toddler murder
The titles are given the title names A, B, and C respectively and are represented as follow
A = {United, States, and, Russia, agree, historic, nuclear, deal},

B = {Korean, sailors, killed in naval incident},
C = {Couple, convicted, of, toddler, murder},
These vectors can be included together in just one space technically called a matrix. For now, let’s
give it the name X.
15
X = { United, States, and, Russia, agree, historic, nuclear, deal; B Korean, sailors, killed in naval
incident; C Couple, convicted, of, toddler, murder}.
Where there is more than one vector in a space as in the above example, they are collected so as
to constitute a matrix in which each row is a vector, as in the figure below.
Figure 1-6: A matrix X of the 3 vectors A, B, and C
Uni st a R a his nu d Ko Sa ki i N inc Co con o to m

ted at n us gr tor cle e re ilo lle n av ide up vict f dd ur
es d sia ee ic ar a an rs d al nt le ed ler de
l r
A 0 01 0 01 0 01 01 0 00 00 00 0 00 00 00 00 0 00 00
: 1 1 1 1 0 0
B 0 00 0 00 0 00 00 0 01 01 01 0 01 01 00 00 0 00 00
: 0 0 0 0 1 0
C 0 00 0 00 0 00 00 0 00 00 00 0 00 00 01 01 0 01 01
: 0 0 0 0 0 1
The matrix X in the above example is composed of two main constituents: rows and columns. The
matrix with the m rows and n columns defines a manifold in 𝓃-dimensional space. This is the
shape of data in 𝓃-dimensional space. In a manifold, vectors are plotted in a 3-dimensional space
forming a cloud of points, as in Figure 1-7.
Figure 1-7: A manifold in 3-dimensional space
16
1-3 Vector space clustering (VSC)
Vector space clustering (VSC) is based on measuring the relative distances between the row
vectors. The distance between any two vectors in a space is jointly determined by the size of the
angle between the lines joining them to the origin of the space’s coordinate system, and by the
lengths of those lines. These are shown in Figure 1-8 and Figure 1-9.
Figure 1-9: Vector length

Figure 1-8: The angle between vectors
In VSC, the interplay between length and angle is what determines the distance relations between
and among vectors in a space, and thereby their clustering structure. This can be explained as
follows.
 If the angle is kept constant and the lengths of the vectors are made unequal by
lengthening or shortening one of them, then the distance increases, as shown in
Figure 1-10.
17
Figure 1-10
 If the lengths are kept equal but the angle is increased the distance between them
increases, as shown in Figure 1-11.
Figure 1-11
 If the lengths are kept equal and angle is decreased so does the distance, as shown in
Figure 1-12.
18
Figure 1-12
Based on these observations, vectors in the first two cases are set apart and not clustered together
shown in Figure 1-10 and Figure 1-11 while they are more likely to be clustered together in the
last case shown in Figure 1-12.
In what follows the main applications of VSC are discussed.
1-3-1 Vector space model (VSM)
VSM is simply a technique where documents are compared with each other then indexed or
classified in terms of their similarity or distance based on the words they contain. It can be defined
as the organization of a collection of documents usually represented by a vector space model into
distinct clusters based on similarity. The theory was first developed by Salton (1971) essentially
for IR purposes four decades ago and since then it has become a standard tool in IR systems. The
underlying formula of VSM is initially to extract all useful information within a document
collection and record it in an index known as a vector space. Then a proximity measurement is
used to compute the semantic similarity among the documents with the purpose of grouping
similar documents together. The way data is mathematically represented using VSM is further
19
discussed in Error! Reference source not found.
In spite of being widely used, many studies have doubted the effectiveness of VSM as it is wholly
based on lexical semantics with no regard to the importance of context in identifying intended
meanings (Gabrilovich and Markovitch, 2007; Gabrilovich and Markovitch, 2006; Landauer et
al., 1998; Deerwester et al., 1990). Likewise, some studies have argued that VSM is less effective
in clustering and ranking web pages (Markov et al., 2008; Maguitman et al., 2005) since these
have some special features such as hyperlinks and structural information, which inevitably have
additional information and these are ignored in VSM applications.
1-3-2 Latent Semantic Indexing
Latent Semantic Indexing (LSI) is a statistical/ mathematical technique for extracting and
representing the underlying semantic connections between both the documents and the words in
a large corpus of texts for the purpose of automatic indexing or grouping of documents. (Adrian
et al., 2007; Foltz et al., 1998; Landauer et al., 1998; Deerwester et al., 1990; Dumais et al., 1988).
The literature suggests that LSI was originally developed to tackle some problems such as
polysemy and synonymy that used to affect the validity of VSC performance. Today, it has
numerous applications and techniques. Almost all LSI models assume in principle that a document
arises from one single source even if that source is not determined or defined.
The main assumption behind LSI is that linguistic information that is typically ignored in VSM
applications is fundamental; it is not supplemental information that can be ignored (Kuhn et al.,
2007; Berlin, 2006; Kuhn, 2006). This linguistic information, Deerwester et al. (1990) postulate,
has some underlying semantic structure that is essential for IR and clustering applications.
Nevertheless, LSI, Kuhn (2006) explains, is based essentially on the VSM approach. Just like
VSM, the first step in LSI is to represent a document in the form of a matrix of rows and vectors
20
where rows stand for unique words and columns for text passages.
The underlying principle of LSI is that it uses statistical correlation between word and passage
meaning to create a similarity score between any two documents based entirely on the words that
they contain. Landauer et al. (1998) assert that the relations LSI generates are well correlated with
several human cognitive phenomena involving association or semantic similarity. Unlike VSM,
LSI “uses as its initial data not just the summed contiguous pairwise (or tuple-wise) co-
occurrences of words but the detailed patterns of occurrences of very many words over very large
numbers of local meaning-bearing contexts, such as sentences or paragraphs, treated as unitary
wholes” (Landauer et al., 1998: 5).
Although LSI is reported to achieve good performance in grouping documents of similar topical
meaning together, it has some problems when it comes to practice. One major drawback of this
approach however is that the resulting dimensions might be difficult to interpret. That is, results,
which can be justified on a mathematical level, can have no interpretable meaning in natural
language. Equally important, it is theoretically unsuitable for many applications as a document
does not arise from a single theme. Rather a document often contains multiple themes.
1-3-3 Explicit Semantic Analysis
ESA is a novel scheme for computing semantic similarity developed by Gabrilovich and
Markovitch (2007). It computes semantic relatedness between two given texts. This newly
developed technique is based on Wikipedia, the largest online encyclopaedia now. It represents
the meaning of texts in a high dimensional space of concepts derived from Wikipedia. The main
assumption behind ESA is that computing the degree of semantic relatedness between fragments
of natural language text can be improved by explicitly representing the meaning of any text in
terms of Wikipedia-based concepts (Gabrilovich and Markovitch, 2007; Gabrilovich and
21
Markovitch, 2006).
Weiping et al. (2008) assert that the newly developed approach has achieved good performance
in computing semantic relatedness. The experimental results, however, indicate that ESA has
some problems. One main problem about this technique is that it can’t exactly determine the
intended sense of an ambiguous word (Weiping et al., 2008). In this, it loses a lot of its
significance. ESA is not an integrated classification system. It is merely a technique for computing
semantic relatedness and it is not successful in relating similar long documents together.
1-3-4 Concept mining
This is a process that has long been used to provide an automated categorization of documents
based on their content. It is a workflow that is used to discover implicit and explicit relationships,
useful associations and groupings in a set of documents or data collection with the purpose of
detecting similar documents in a large corpora and classifying them by topic. It can provide thus
powerful insights into the meaning, provenance and similarity of documents (Looks et al., 2007;
Fang et al., 2006; Han and Kamber, 2001). The assumption is that each word in a given document
relates to several possible concepts which makes it possible to cluster documents based on their
content. The underlying principle of concept mining is the conversion of words into concepts.
This is done in two subsequent steps. First, documents are reduced into a sequence of words that
describes the content. Second, these words are mapped into concepts.
In this way, given that we have a number of documents on generative grammar; concept mining
is possible by identifying relationships and generating facts based on the data within collection
and the dimensions of the subject. These can be something like Chomsky and generative grammar;
theoretical linguistics and generative grammar; Phrase Structure Rules (PSR) and Generative
grammar; deep and surface structures in generative grammar; etc. Documents can also be
22
classified by topic as WH-movement; linguistic competence; etc.
1-4 Summary
This chapter has given an account of the different ways documents can be clustered thematically
yet in objective ways. The one approach in the literature that seems theoretically most consistent
with our goal is VSM. In spite of its limitations, VSM remains the most widespread method for
data representation in document clustering and classification applications. So far, most document
clustering approaches work with VSM methods. This can be justified in that it is still suitable for
the majority of clustering and classification purposes. The way data is mathematically represented
using VSM is the subject of the next chapter.
23
Bibliography
Abdur, C., McCabe, M. C., David, G. and Ophir, F. (2002) 'Document normalization revisited',
Proceedings of the 25th annual international ACM SIGIR conference on Research and
development in information retrieval, pp. 381-382.
Abercrombie, L. (1912) Thomas Hardy. A critical study. Martin Secker: London.
Abu-Salem, H., Mahmoud, A.-O. and Martha, W. E. (1999) 'Stemming methodologies over
individual query words for an Arabic information retrieval system', Journal of the
American Society for Information Science, 50, (6), pp. 524-529.
Adams, R. (2003) Perceptions of innovations: exploring and developing innovation classification.

PhD thesis. Cranfield University.
Adrian, K., Stphane, D. and Tudor, G. (2007) 'Semantic clustering: Identifying topics in source
code', Information and Software Technology, 49, (3), pp. 230-243.
Afifi, A. A., Clark, V. and May, S. (2004) Computer-aided multivariate analysis. Boca Raton,
Fla. ; London: Chapman & Hall/CRC.
Altintas, K., Can, F. and Patton, J. M. (2007) 'Language Change Quantification Using Time-
separated Parallel Translations', Lit Linguist Computing, 22, (4), pp. 375-393.
Amati, G. and Rijsbergen, C. J. V. (2002) 'Probabilistic models of information retrieval based on

measuring the divergence from randomness', ACM Transactions on Information Systems,
20, (4), pp. 357-389.
Anderberg, M. R. (1973) Cluster analysis for applications. New York ; London: Academic Press.
Anderson, B. R. O. G. (1991) Imagined communities : reflections on the origin and spread of

nationalism. London: Verso.
Arabie, P., Hubert, L. J. and Soete, G. d. (1996) Clustering and classification. Singapore ; London:
World Scientific.
Argamon, S. and Olsen, M. (2006) 'Toward Meaningful Computing', Communications of ACM,

49, (4), pp. 33-35.
Attardi, G., Di Marco, S. and Salvi, D. (1998) 'Categorization by Context', Journal of Universal
24
Computer Science, 4 (9), pp. 719–736.
Attardi, G., Gulli, A. and Sebastiani, F. (1999) THAI-99.
Atwood, M. E. (1972) Survival : a thematic guide to Canadian literature. Toronto: Anansi.
Austen, J. (1818) Northanger Abbey: and Persuasion. London: John Murray.
Ball, W. W. R. (1935) A Short Account of the History of Mathematics. London: Macmillan.
Baugh, A. C. (1967) A literary history of England. Routledge & K.Paul.
Beer, G. (2004) Darwin's Plots: Evolutionary Narrative in Darwin, George Eliot and Nineteenth-
Century Fiction. Cambridge: Cambridge University Press.
Bellamy, L. (1998) 'Regionalism and Nationalism: Maria Edgeworth, Walter Scott and the
Definition of Britishness', in Snell, K. D. M.(ed), The Regional Novel in Britain and
Ireland 1800-1990.Cambridge University Press.
Benazon, M. (1978) 'Dark and Fair: Character Contrast in Hardy’s "The Fiddler of the Reels"’',
Ariel: A Review of International English Literature, 9, (2), pp. 75-82.
Berg, B. L. (1998) Qualitative research methods for the social sciences. Boston: Allyn and Bacon.
Berlin, C. (2006) 'Exploring the use of latent topical information for statistical Chinese spoken
document retrieval', Pattern Recogn. Lett., 27, (1), pp. 9-18.
Berry, M. W. (ed.) (2004) Survey of Text Mining: Clustering, Classification, and Retrieval. New
York: Springer.
Biber, D. (1986) 'Spoken and Written Textual Dimensions in English: Resolving the
Contradictory Findings', Language, 62, (2), pp. 384-413.
Biber, D. (1992) 'The Multidimensional Approach to Linguistic Analyses of Genre Variation: An

Overview of Methodology and Finding', Computers and the Humanities, 26, (5-6), pp.
331-347.
Binongo, J. N. G. and Smith, M. W. A. (1999) 'The application of principal component analysis

to stylometry', Lit Linguist Computing, 14, (4), pp. 445-466.
Bookstein, A. and Swanson, D. R. (1974) 'Probabilistic models for automatic indexing', Journal
of the American Society for Information Science, 25, pp. 312-318.
Boot, P. (2006) 'Decoding Emblem Semantics', Lit Linguist Computing, 21, (suppl_1), pp. 15-27.
Boumelha, P. (1982) Thomas Hardy and Women: Sexual Ideology and Narrative Form. Brighton:
25
Harvester Wheatsheaf.
Brady, K. (1982) The short stories of Thomas Hardy. New York: St. Martin's Press.
Brady, K. (ed.) (1999) The Withered Arm and other Stories. London: Penguin.
Breckenridge, J. N. (2000) 'Validating Cluster Analysis: Consistent Replication and Symmetry',

Multivariate Behavioral Research, 35, (2), pp. 261 - 285.
Breton, A. (1997) Anthology of Black Humor. San Francisco: Translated into English by Mark
Polizzotti. City Lights Books ; Subterranean Co.
Brooks, J. R. (1971) Thomas Hardy: The Poetic Structure. Ithaca, N.Y.: Cornell University Press.
Brown, D. (1961) Thomas Hardy. Longmans, Green.
Burrows, J. (2004) 'Textual Analysis', in Schreibman, S., Siemens, R. and Unsworth, J.(eds) A
Companion to Digital Humanities.Oxford: Blackwell, pp. 88-97.
Burrows, J. F. (1986) 'Modal Verbs and Moral Principles: An Aspect of Jane Austen's Style', Lit
Linguist Computing, 1, (1), pp. 9-23.
Burrows, J. F. (1987) Computation into criticism : a study of Jane Austen's novels and an
experiment in method. Oxford: Clarendon.
Burrows, J. F. (2003) 'Questions of Authorship: Attribution and Beyond A Lecture Delivered on

the Occasion of the Roberto Busa Award ACH-ALLC 2001, New York', Computers and
the Humanities, 37, (1), pp. 5-32.
Burrows, J. F. (2005) 'Who wrote Shamela? Verifying the Authorship of a Parodic Text', Lit
Burrows, J. F. (2007) 'All the Way Through: Testing for Authorship in Different Frequency
Strata', Lit Linguist Computing, 22, (1), pp. 27-47.
Carmichael, J. W. and Sneath, P. H. A. (1969) 'Taxometric Maps', Syst Biol, 18, (4), pp. 402-415.
Cecil, D. (1943) Hardy, the novelist; an essay in criticism. London: Constable.
Chalmers, A. F. (1999) What is This Thing Called Science. Indianapolis: Hackett Pub.
Cheong, M.-Y. and Lee, H. (2008) 'Determining the number of clusters in cluster analysis',
26
Journal of the Korean Statistical Society, 37, (2), pp. 135-143.
Cohen, R. (1989) The Future of literary theory. New York: Routledge.
Coleman, T. (ed.) (1976) An Indiscretion in the Life of an Heiress. London: Hutchinson
Cooley, W. W. and Lohnes, P. R. (1985) Multivariate data analysis. Malabar, Fla.: R.E. Krieger
Pub. Co.
Corns, T. N. (1987) 'Computers in the Humanities: Methods and Applications in the Study of
English Literature', Lit Linguist Computing, 2, (2), pp. 127-130.
Cox, R. G. (1970) Thomas Hardy; the critical heritage. New York: Barnes & Noble.
Craig, H. (1999) 'Contrast and Change in the Idiolects of Ben Jonson Characters', Computers and
the Humanities, 33, (3), pp. 221-240.
Craig, H. (2004) 'Stylistic Analysis and Authorship Studies', in Schreibman, S., Siemens, R. and
Unsworth, J.(eds) A Companion to Digital Humanities .Oxford: Blackwell, pp. 273-288.
Dalziel, P. (1992a) 'Hapless Destiny: an uncollected story of marginalized lives', Thomas Hardy
Journal, (8), pp. 41-2.
Dalziel, P. (1992b) 'Hardy's Unforgotten 'Indiscretion': The Centrality of an Uncontrolled Work',

Review of English Studies, XLIII, (171), pp. 347-366.
Dalziel, P. (ed.) (1992c) Thomas Hardy: The Excluded and Collaborative Stories. Oxford:
Clarendon Press.
De Grazia, M. and Wells, S. W. (2001) The Cambridge Companion to Shakespeare. Cambridge,

U.K.: Cambridge University Press.
Debole, F. and Sebastiani, F. (2003) 'Supervised term weighting for automated text
categorization', Proceedings of the 2003 ACM symposium on Applied computing, pp. 784-
788.
Debole, F. and Sebastiani, F. (2004) 'Supervised Term Weighting for Automated Text
Categorization', ERCIM News 56, pp. 55-56.
Deerwester, S., Susan, T. D., George, W. F., Thomas, K. L. and Richard, H. (1990) 'Indexing by
latent semantic analysis', Journal of the American Society for Information Science, 41, (6),
pp. 391-407.
DeForest, M. and Johnson, E. (2001) 'The Density of Latinate Words in the Speeches of Jane
27
Austen's Characters', Lit Linguist Computing, 16, (4), pp. 389-401.
Delcourt, C. (1992) 'About the statistical analysis of co-occurrence', Computers and the
Humanities, 26, (1), pp. 21-29.
DeVine, C. (2005) Class in turn-of-the-century Novels of Gissing, James, Hardy, and Wells.
Aldershot, Hants, England: Ashgate.
Dey, I. (1993) Qualitative data analysis : a user-friendly guide for social scientists. London: New
York Routledge.
Dhillon, I., Kogan, J. and Nicholas, C. (2004) 'Feature Selection and Document Clustering', in
Berry, M. W.(ed), Survey of Text Mining: Clustering, Classification, and Retrieval.New
York: Springer.
Dickens, C. (1854) Hard Times. London: Bradury & Evans.
Dik, L. L., Huei, C. and Kent, S. (1997) 'Document Ranking and the Vector-Space Model', IEEE
Softw., 14, (2), pp. 67-75.
Dimitriadou, E., Dolničar, S. and Weingessel, A. (2002) 'An examination of Indexes for
Determining the Number of Clusters in Binary Data Sets', Psychometrika, 67, (1), pp. 137-
159.
Dingle, H. and Bush, D. (1952) 'Science and Literary Criticism', The British Journal for the
Philosophy of Science, 3, (10), pp. 194-196.
Draper, R. P. and Ray, M. (1989) An annotated critical bibliography of Thomas Hardy. Hemel
Hempstead: Harvester Wheatsheaf.
DuBien, J. L. and Warde, W. D. (1979) 'A Mathematical Comparison of the Members of an

Infinite Family of Agglomerative Clustering Algorithms', The Canadian Journal of
Statistics / La Revue Canadienne de Statistique, 7, (1), pp. 29-38.
Dumais, S. T., Furnas, G. W., Landauer, T. K., Deerwester, S. and Harshman, R. (1988) 'Using
latent semantic analysis to improve access to textual information', Proceedings of the
SIGCHI conference on Human factors in computing systems. Washington, D.C., United
States, ACM, pp.
Duncan, I. (2002) 'The Provincial or Regional Novel', in Brantlinger, P. and Thesing, W.(eds) A
28
Companion to the Victorian Novel.Oxford: Blackwell.
Dunteman, G. H. (1989) Principal Components Analysis Sage Publications.
Eastman, D. R. (1971) The Concept of Character in the Major Novels of D.H. Lawrence. thesis.
University of Florida.
Eaton, M. L. (2007) Multivariate Statistics: A Vector Space Approach. Beachwood, Ohio:

Institute of Mathematical Statistics.
Ebbatson, R. (1993) ''The Withered Arm’ and History', Critical Survey, 5, (2), pp. 131-35.
Ebbatson, R. (2009) '“A Thickness of Wall”: Hardy and Class ', in Wilson, K.(ed), A Companion
to Thomas Hardy.Malden, MA: Wiley-Blackwell Pub., pp. xiii, 488 p.
El-Hamdouchi, A. and Willett, P. (1986) 'Hierarchic document classification using Ward's

clustering method', Proceedings of the 9th annual international ACM SIGIR conference
on Research and development in information retrieval. Palazzo dei Congressi, Pisa, Italy,
ACM, pp.
Euclid. (1826) Euclid's Elements of Geometry, With notes, critical and exploratory by Phillips,
George. London: Printed for Baldwin, Cradock, and Joy.
Everitt, B. (1993) Cluster analysis. London: E. Arnold.
Everitt, B., Landau, S. and Leese, M. (2001) Cluster analysis. London: Arnold ; New York :
Oxford University Press.
Fang, L., Mehlitz, M., Li, F. and Sheng, H. (2006) 'Web Pages Clustering and Concepts Mining:
An approach towards Intelligent Information Retrieval', Cybernetics and Intelligent
Systems, 2006 IEEE Conference, pp. 1-6.
Fielding, A. (2007) Cluster and Classification Techniques for the Biosciences. Cambridge, UK ;
New York: Cambridge University Press.
Florek, K., Lukaszewicz, P. J., Steinhaus, H. and Zubrzycki, S. (1951) 'Sur la liaison et la division
des points d’un ensemble fini', Colloq. Math, 2, pp. 282-285.
Foltz, P., Kintsch, W. and Landauer, T. K. (1998) 'Measurement of Text Coherence with Latent
Semantic Analysis', Discourse Processes, 25, pp. 285-307.
Fong, J. H. (2008) 'Determining The Genre of Antony and Cleopatra: Categorising Shakespeare's
Roman Play as Tragedy or History', Shakespearean Theatre, [Online]. Available at:
http://shakespeare-
29
tragedies.suite101.com/article.cfm/determining_genre_for_antony_and_cleopatra
(Accessed: 16 March 2010).
Foucault, M. (1985) The history of sexuality. [Harmondsworth]: Viking, 1986.
Fraleigh, J. B., Beauregard, R. A. and Katz, V. J. (1995) Linear Algebra. Reading, Mass. ;
Wokingham: Addison-Wesley.
Frigui, H. and Nasraoui, O. (2004) 'Simultaneous Clustering and Dynamic Keyword Weighting
for Text Documents', in Berry, M. W.(ed), Survey of Text Mining: Clustering,
Classification, and Retrieval.New York: Springer.
Fromkin, V., Rodman, R. and Hyams, N. M. (2007) An Introduction to Language. Boston, Mass.:
Thomson Wadsworth ; [London : Thomson Learning, distributor].
Gabrilovich, E. and Markovitch, S. (2006) 'Overcoming the Brittleness Bottleneck using

Wikipedia: Enhancing Text Categorization with Encyclopedic Knowledge', Proceedings
of the Twenty-First National Conference on Artificial Intelligence, pp. 1301--1306.
Gabrilovich, E. and Markovitch, S. (2007) 'Computing Semantic Relatedness using Wikipedia-

based Explicit Semantic Analysis', Proceedings of the 20th International Joint Conference
on Artificial Intelligence, pp. 1606--1611.
Gan, G., Ma, C. and Wu, J. (2007) Data Clustering: Theory, Algorithms, and Applications.
Philadelphia, Pa.: SIAM, Society for Industrial and Applied Mathematics ; American
Statistical Association.
Garcia, A. M. and Martin, J. C. (2007) 'Function Words in Authorship Attribution Studies', Lit
Gatrell, S. (2003) Thomas Hardy's vision of Wessex. Houndmills, Basingstoke, Hampshire:

Palgrave Macmillan.
Gatrell, S. (2006) 'The Erotics of Dress in A Pair of Blue Eyes', in Thomas Hardy Reappraised:
Essays in Honour of Michael Millgate.Toronto: University of Toronto Press, pp. 118-135.
Gauch, H. G. (1982) Multivariate analysis in community ecology. Cambridge Cambridgeshire ;

New York: Cambridge University Press.
Geer, J. P. v. d. (1971) Introduction to multivariate analysis for the social sciences. San Francisco,:
30
W. H. Freeman.
Gibson, J. (1996) Thomas Hardy. Basingstoke: Macmillan.
Gilmartin, S. and Mengham, R. (2007) Thomas Hardy's shorter fiction : a critical study.
Edinburgh: Edinburgh University Press.
Gittings, R. (ed.) (1978) An Introduction to The Hand of Ethelberta. New York: St. Martin's Press.
Golub, K. (2005) Automated Subject Classification of Textual Web Pages, for Browsing. thesis.
Lund University.
Golub, K. (2006) 'Automated subject classification of textual Web documents', Journal of

Documentation, 62, (3), pp. 350-371.
Gomm, R. (2009) Key concepts in social research methods. Basingstoke: Palgrave Macmillan.
Goode, J. (1988) Thomas Hardy: The Offensive Truth. Oxford [Oxfordshire]: B. Blackwell.
Goodheart, E. (1957) 'Thomas Hardy and The Lyrical Novel', Nineteenth-Century Fiction, 12, (3),
pp. 215-225.
Gordon, A. D. (1996) 'Hierarchical Classification', in Arabie, P., Hubert, L. J. and Soete, G.

d.(eds) Clustering and classification.Singapore ; London: World Scientific, pp. ix,490p.
Gosse, E. (1928) 'Thomas Hardy's Lost Novel', London Times, January 22,
Gosse, E. (1970) 'Thomas Hardy', in Cox, R. G.(ed), Thomas Hardy; The Critical Heritage.New
York: Barnes & Noble. First published in The Speaker (13 September 1890). , pp. 167-
172.
Gottschall, J. (2008) 'Measure for Measure', The Boston Globe, 11/05/2008.
Grabmeier, J. and Rudolph, A. (2002) 'Techniques of Cluster Algorithms in Data Mining', Data
Min. Knowl. Discov., 6, (4), pp. 303-360.
Gray, A. (1997) Modern Differential Geometry of Curves and Surfaces with Mathematica. Boca
Raton, FL: CRC Press.
Hair, J. F. (2006) Multivariate data analysis. Upper Saddle River, N.J. ; London: Prentice Hall
PTR.
Halkidi, M., Batistakis, Y. and Vazirgiannis, M. (2001) 'On Clustering Validation Techniques',
31
Journal of Intelligent Information Systems, 17, (2-3), pp. 107–145.
Hammill, F. (2007) Canadian Literature. Edinburgh: Edinburgh University Press.
Han, J. and Kamber, M. (2001) Data mining : concepts and techniques. San Francisco, Calif. ;
London: Morgan Kaufmann.
Handl, J., Knowles, J. and Kell, D. B. (2005) 'Computational cluster validation in post-genomic
data analysis', Bioinformatics, 21, (15), pp. 3201-3212.
Härdle, W. and Simar, L. (2003) Applied multivariate statistical analysis. Berlin ; New York:
Springer.
Hardy, F. E. (1933) The Life of Thomas Hardy. London: Macmillan.
Hardy, T. (1879) 'The Distracted Young Preacher', New Quarterly Magazine, i, pp. 324-376.
Hardy, T. (1891) A group of Noble Dames. New York: Harper and brothers.
Hardy, T. (1894) Life's Little Ironies. Leipzig: B. Tauchnitz.
Hardy, T. (1896) Wessex Tales. New York: Harper & brothers.
Hardy, T. (1897) The Well-Beloved : A Sketch of a Temperament. London: J. R. Osgood McIlvaine

& co.
Hardy, T. (1912a) A Group of Noble Dames. London: Macmillan and Co.
Hardy, T. (1912b) Jude the Obscure. London: Macmillan and Co.
Hardy, T. (1912c) Life's Little Ironies. London: Macmillan and Co.
Hardy, T. (1912d) Wessex Tales. London: Macmillan and Co.
Hardy, T. (1912e) The Woodlanders. London: Macmillan and Co.
Hardy, T. (1912f) The works of Thomas Hardy in prose and verse. With prefaces and notes.
(Wessex edition.). London: Macmillan & Co.
Hardy, T. (1913) A Changed Man, The Waiting Supper, and Other tales, Concluding with The
Romantic Adventures of a Milkmaid. London: Macmillan & co.
Hardy, T. and Dugdale-Hardy, F. (1992) 'The Unconquerable', in Dalziel, P.(ed), Thomas Hardy:
The Excluded and Collaborative Stories.Oxford: Clarendon Press.
Harman, D. (1991) 'How effective is suffixing?', Journal of the American Society for Information
32
Science, 42, (1), pp. 7-15.
Harrington, A. (2004) Art and social theory: sociological arguments in aesthetics. Cambridge,
UK: Polity Press.
Harter, S. P. (1975) 'A probabilistic approach to automatic keyword indexing, Part 1: On the
distribution of speciality words in a technical literature', Journal of the American Society
for Information Science, 26, pp. 280-289
Harvey, G. (2003) The complete critical guide to Thomas Hardy. London: Routledge.
Hernadi, P. (1972) Beyond Genre: New Directions in Literary Classification. Ithaca New York:
Cornell University Press.
Hersh, W. R. (1996) Information Retrieval: A Health Care Perspective. New York: Springer.
Higonnet, M. R. (1993) The Sense of sex: feminist perspectives on Hardy. Urbana: University of
Illinois Press.
Hockey, S. M. (2000) Electronic Texts in the Humanities: Principles and Practice. Oxford: Oxford
University Press.
Holloway, I. (1997) Basic concepts for qualitative research. Oxford: Blackwell Science.
Holmes, D. I. (1994) 'Authorship Attribution', Computers and the Humanities, 28, pp. 87-106.
Holmes, D. I. (1998) 'The Evolution of Stylometry in Humanities Scholarship', Lit Linguist

Computing, 13, (3), pp. 111-117.
Holmes, D. I. and Forsyth, R. S. (1995) 'The Federalist Revisited: New Directions in Authorship
Attribution', Lit Linguist Computing, 10, (2), pp. 111-127.
Hoover, D. L. (2001) 'Statistical Stylistics and Authorship Attribution: an Empirical

Investigation', Lit Linguist Computing, 16, (4), pp. 421-444.
Hoover, D. L. (2002) 'Frequent Word Sequences and Statistical Stylistics', Lit Linguist
Computing, 17 (2), (2), pp. 157-180.
Hope, K. (1968) Methods of multivariate analysis, with handbook of multivariate methods

programmed in Atlas Autocode. London,: University of London P.
Horton, R., Olsen, M., Roe, G. and Voyer, R. (2007) 'Mining Eighteenth Century Ontologies:
Machine Learning and Knowledge Classification in the Encyclope´die', Digital
33
Humanities,. Urbana-Champaign, Illinois, 2-8 June 2007. pp.
Horton, T., Taylor, C., Yu, B. and Xiang, X. (2006) '‘Quite Right, Dear and Interesting’: Seeking
the Sentimental in Nineteenth Century American Fiction', Digital Humanities. Paris-
Sorbonne, France, 5-9 July 2006. pp. 81-82.
Houen, A. (2000) 'Hardy, Thomas, 1840-1928', Literature Online biography, [Online]. Available
at: http://gateway.proquest.com/openurl?ctx_ver=Z39.88-
2003&xri:pqil:res_ver=0.2&res_id=xri:lion&rft_id=xri:lion:ft:ref:BIO002715:0
(Accessed: 25/08/2008).
Howlett, R. (2010) Vector Space Theory. Sidney: The University of Sydney.
Hull, D. A. (1996) 'Stemming algorithms: A case study for detailed evaluation', Journal of the
American Society for Information Science, 47, (1), pp. 70-84.
Hunter, A. (2007) The Cambridge Introduction to the Short Story in English. Leiden: Cambridge
University Press.
Ide, N. (1989) 'A statistical measure of theme and structure', Computers and the Humanities, 23,
(4), pp. 277-283.
Ingham, P. (1989) Thomas Hardy: A Feminist Reading Hemel Hempstead: Harvester Wheatsheaf.
Irwin, M. (1980) 'Readings of Melodrama', in Gragor, I.(ed), Reading the Victorian Novel: Detail
into Form.New York: Harper & Row, Publishers, Inc.
Jackson, J. E. (1991) A user's guide to principal components. New York: Wiley.
Jacobs, H. A. (1861) Incidents in the Life of a Slave Girl. Boston: Jacobs.
Jacobus, M. (1976) 'Tess’s Purity', Essays in Criticism, 26, pp. 318-38.
Jacobus, M. (ed.) (1979) Women Writing and Writing about Women. London: Croom Helm.
Jain, A. K., Murty, M. N. and Flynn, P. J. (1999) 'Data Clustering: A Review', ACM Comput.
Surv., 31, (3), pp. 264-323.
Jain, R. and Koronios, A. (2008) 'Innovation in the cluster validating techniques', Fuzzy
Optimization and Decision Making, 7, (3), pp. 233-241.
James, L. (2006) The Victorian novel. Malden, MA: Blackwell Pub.
Janos, A. and Balazs, F. (2007) Cluster Analysis for Data Mining and System Identification.
34
Basel: Birkhauser
Jekel, P. L. (1986) Thomas Hardy's Heroines: A Chorus of Priorities. New York: Whitston.
Jenkins, M.-C. and Smith, D. (2005) 'Conservative stemming for search and indexing', SIGIR’05.
August 15–19, 2005. ACM, pp.
Jinxi, X. and Croft, W. B. (1998) 'Corpus-based stemming using cooccurrence of word variants',
ACM Trans. Inf. Syst., 16, (1), pp. 61-81.
Joachims, T. (2002) Learning to Classify Text Using Support Vector Machines: Methods, Theory
and Algorithms. Kluwer Academic Publishers.
Joachims, T. and Sebastiani, F. (2002) 'Guest Editors' Introduction to the Special Issue on
Automated Text Categorization', Journal of Intelligent Information Systems, 18, (2), pp.
103-105.
Jockers, M. L. (2009) Machine-Classifying Novels and Plays by Genre. Available at:

https://www.stanford.edu/~mjockers/cgi-bin/drupal/node/27 (Accessed: 16 March 2010).
Jockers, M. L., Witten, D. M. and Criddle, C. S. (2008) 'Reassessing authorship of the Book of
Mormon using delta and nearest shrunken centroid classification', Lit Linguist Computing,
23, (4), pp. 465-491.
Johnson, L. P. (1923) The Art of Thomas Hardy. London: J. Lane.
Johnson, S. L. (2005) Historical Fiction: A Guide to the Genre. Westport, Conn.: Libraries
Unlimited.
Jolliffe, I. T. (2002) Principal component analysis. Berlin ; London: Springer.
Juola, P. (2008) 'Authorship Attribution', Foundations and Trends R in Information Retrieval, 1,

(3), pp. 233-334.
Juola, P., Sofko, J. and Brennan, P. (2006) 'A Prototype for Authorship Attribution Studies', Lit
Justo, R. and Torres, I. (2005) Progress in Pattern Recognition, Image Analysis and Applications.
10th Iberoamerican Congress on Pattern Recognition, CIARP 2005,Havana, Cuba,
November 15-18, 2005. Proceedings
Kaplan, R. M. (2005) 'A Method for Tokenizing Text', in Arppe, A., Carlson, L., Lindén, K.,
Piitulainen, J., Suominen, M., Vainio, M., Westerlund, H. and Yli-Jyrä, A.(eds) Inquiries
into Words, Constraints and Contexts.CSLI Studies in Computational Linguistics
35
Publications, pp. 55-64.
Kaufman, L. and Rousseeuw, P. J. (1990) Finding Groups in Data: An Introduction to Cluster

Analysis. Hoboken, New Jersey: John Wiley& Sons, INC.
Kaur, M. ( 2005) The Feminist Sensibility in the Novels of Thomas Hardy. New Delhi: Sarup &
Sons.
Kearney, P. J. (1982) A History of Erotic Literature. London: Macmillan.
Keith, W. J. (1969) 'Thomas Hardy and the Literary Pilgrims', Nineteenth-Century Fiction, 24,
(1), pp. 80-92.
Keith, W. J. (1979) 'A Regional Approach to Hardy's Fiction', in Kramer, D.(ed), Critical
Approaches to the Fiction of Thomas Hardy.London: Macmillan.
Kessler, B., Numberg, G. and Schtze, H. (1997) 'Automatic Detection of Text Genre', Proceedings
of the eighth conference on European chapter of the Association for Computational
Linguistics. Madrid, Spain, Association for Computational Linguistics, pp. 32-38.
Kettle, A. (1953) 'Tess of the d’Urbervilles', in An Introduction to the English Novel. Vol. 2
London: Hutchinson University Library, pp. 45-56.
Kettle, A. (1967) An introduction to the English novel. London,: Hutchinson.
Kettle, A. (1973) The Novel in the Mid-nineteenth Century. Milton Keynes: Open University
Press.
King, J. (1978) 'Thomas Hardy: Tragedy Ancient and Modern', in Tragedy in the Victorian
Novel.Cambridge: Cambridge University Press, pp. 97–126.
Kobayashi, M. and Aono, M. (2004) 'Vector Space Models for Search and Cluster Mining', in
Berry, M. W.(ed), Survey of Text Mining: Clustering, Classification, and Retrieval.New
York: Springer-Verlag
Koppel, M., Argamon, S. and Shimoni, A. R. (2002) 'Automatically Categorizing Written Texts
by Author Gender', Lit Linguist Computing, 17, (4), pp. 401-412.
Kramer, D. (1975) Thomas Hardy: the Forms of Tragedy. Detroit: Wayne State University Press.
Kramer, D. (1999) The Cambridge companion to Thomas Hardy. Cambridge, UK: Cambridge
University Press.
Kramer, D. and Dalziel, P. (eds.) (2004) The Mayor of Casterbridge. New York: Oxford
36
University Press.
Krishnaiah, P. R. and Kanal, L. N. (1982) Classification, pattern recognition and reduction of

dimensionality. Amsterdam ; Oxford: North-Holland.
Krovetz, R. (1993) 'Viewing morphology as an inference process', Proceedings of the 16th annual
international ACM SIGIR conference on Research and development in information
retrieval, pp. 191-202.
Kuhn, A. (2006) Semantic Clustering: Making Use of Linguistic Information to Reveal Concepts
in Source Code. thesis. University of Bern.
Kuhn, A., Ducasse, S. and Gîrba, T. (2007) 'Semantic clustering: Identifying topics in source
code', Information and Software Technology, 49, (3), pp. 230-243.
Laan, N. M. (1995) 'Stylometry and Method. The Case of Euripides', Lit Linguist Computing, 10,
(4), pp. 271-278.
Labbe, C. and Labbe, D. (2006) 'A Tool for Literary Studies: Intertextual Distance and Tree
Classification', Lit Linguist Computing, 21, (3), pp. 311-326.
Laffal, J. (1995) 'A concept analysis of Jonathan Swift's A tale of a Tub and Gulliver's Travels',
Computers and the Humanities, 29, (5), pp. 339-361.
Landauer, T. K., Foltz, P. and Laham, D. (1998) 'An Introduction to Latent Semantic Analysis',
Discourse Processes, 25, (2-3), pp. 259-84.
Lang, S. (1958) Introduction to Algebraic Geometry. New York: Interscience Publishers.
Lang, S. and Murrow, G. (1988) Geometry. New York: Springer.
Lawrence, D. H. (1936) 'Study of Thomas Hardy', Phoenix: The Posthumous Papers of D.H.
Lawrence, pp. 398-516.
Lea, H. (1969) Thomas Hardy's Wessex. St. Peter Port: Toucan P.
Leah, S. L., Lisa, B. and Margaret, E. C. (2002) 'Improving stemming for Arabic information
retrieval: light stemming and co-occurrence analysis', Proceedings of the 25th annual
international ACM SIGIR conference on Research and development in information
retrieval, pp. 275-282.
Liebler, N. C. (1995) Shakespeare's Festive Tragedy: The Ritual Foundations of Genre. London:
37
Routledge.
Lodge, D. (1974) 'Thomas Hardy and Cinematographic Form', NOVEL: A Forum on Fiction, 7,
(3), pp. 246-254.
Looks, M., Levine, A., Covington, G. A., Loui, R. P. A. L. R. P., Lockwood, J. W. A. L. J. W.

and Cho, Y. H. A. (2007) Aerospace Conference, 2007 IEEE.
Losada, D. and Azzopardi, L. (2008) 'An analysis on document length retrieval trends in language
modeling smoothing', Information Retrieval, 11, (2), pp. 109-138.
Love, H. (2002) Attributing authorship : an introduction. Cambridge: Cambridge University

Press.
Lovins, J. B. (1968) 'Development of a Stemming Algorithm', Mechanical Translation and

Computational Linguistics, 11, (1), pp. 23-31.
Luhn, H. (1957) ' A statistical approach to mechanised encoding and searching of library
information', IBM Journal of Research and Development, 1, pp. 309-317.
Macdonell, A. (1894) Thomas Hardy. London: Hodder & Stoughton.
Machan, T. W. (1991) 'Late Middle English Texts and the Higher and Lower Criticisms', in
Machan, T. W.(ed), Medieval Literature: Texts and Interpretation. Medieval and
Renaissance Texts and Studies.New York: Binghamton, pp. 3–16.
Maguitman, A. G., Menczer, F., Roinestad, H. and Vespignani, A. (2005) International World
Wide Web Conference Committee (IW3C2). Chiba, Japan., May 10-14, 2005.ACM.
Mallett, P. (2003) Thomas Hardy Texts and Contexts. New York, N.Y.: Palgrave Macmillan.
Manning, C. D., Raghavan, P. and Schütze, H. (2008) An Introduction to Information Retrieval.

Cambridge: Cambridge University Press.
Maranis, H. and Babenko, D. (2009) Algorithms of the Intelligent Web. Greenwich: Manning
Publications Co.
Markov, A., Last, M. and Kandel, A. (2008) 'The hybrid representation model for web document
classification', International Journal of Intelligent Systems, 23, (6), pp. 654-679.
Martin, H., st, Claes, W. and Thomas, T. (2005) 'Experimental context classification: incentives
and experience of subjects', Proceedings of the 27th international conference on Software
38
engineering. St. Louis, MO, USA, ACM, pp.
Mather, P. M. (1976) Computational methods of multivariate analysis in physical geography.

London ; New York: Wiley.
Matthews, R. A. J. and Merriam, T. V. N. (1993) 'Neural Computation in Stylometry I: An

Application to the Works of Shakespeare and Fletcher', Lit Linguist Computing, 8, (4), pp.
203-209.
Meisel, P. (1972) Thomas Hardy: The Return of the Repressed; A Study of the Major Fiction.
New Haven: Yale University Press.
Miles, M. and Hurberman, M. (1994) Qualitative Data Analysis: an expanded sourcebook.

London: Beverley Hills.
Miller, J. H. (1970) Thomas Hardy, distance and desire. Cambridge, Mass.,: Belknap Press of
Harvard University Press.
Millgate, M. (1985) The Life and Work of Thomas Hardy. Athens: University of Georgia Press.
Millgate, M. (1994) Thomas Hardy : His Career As A Novelist. London: Macmillan.
Milligan, G. and Cooper, M. (1985) 'An examination of procedures for determining the number
of clusters in a data set', Psychometrika, 50, (2), pp. 159-179.
Milligan, G. W. (1996) 'Clustering Validation: Results and Implications for Applied Analyses', in
Arabie, P., Hubert, L.J. and De Soete, G(ed), Classification and Clustering.River Edge,
NJ: World Scientific Publishing Co Pte Ltd.
Milton, J. S. and Arnold, J. C. (2002) Introduction to probability and statistics : principles and
applications for engineering and the computing sciences. New York ; London: McGraw-
Hill.
Mingjin Yan, K. Y. (2007) 'Determining the Number of Clusters Using the Weighted Gap
Statistic', in:
Mirkin, B. (2005) Clustering for Data Mining: A Data Recovery Approach. Taylor & Francis
Group, LLC.
Moisl, H. (2008 a) 'Exploratory Multivariate Analysis', in Lüdeling, A. and Kytö, M.(eds) Corpus
Linguistics. An International Handbook. Vol. II Berlin: Mouton de Gruyter, pp. 874-899.
Moisl, H. (2008 b) 'Using electronic corpora to study language variation: the problem of data
39
sparsity', Studies in Language Variation.
Moisl, H. (2009) 'Using Electronic Corpora in Historical Dialectology Research: The Problem of
Document Length Variation', in Dossena, M. and Lass, R.(eds) Studies in English and
European Historical Dialectology. Vol. 98, pp. 67-90.
Moisl, H. and Jones, V. (2005) 'Cluster Analysis of the Newcastle Electronic Corpus of Tyneside
English: A Comparison of Methods', Lit Linguist Computing, 20, pp. 125-146.
Moisl, H. and Maguire, W. (2008) 'Identifying the Main Determinants of Phonetic Variation in
the Newcastle Electronic Corpus of Tyneside English', Journal of Quantitative Linguistics
15, pp. 46-69.
Morgan, R. (1988) Women and Sexuality in the Novels of Thomas Hardy. Routledge.
Morgan, R. (2002) Cancelled Words: Rediscovering Thomas Hardy. London: Routledge.
Morgan, R. (2006) Student Companion to Thomas Hardy. Greenwood Press.
Mort, J. (2002) Christian Fiction: A Guide to the Genre. Greenwood Village, Colo.: Libraries
Unlimited.
Morton, A. Q. (1965) 'The Authorship of Greek Prose', Journal of the Royal Statistical Society,
Series A (128), pp. 169-233.
Morton, A. Q. (1986) 'Once. A Test of Authorship Based on Words which are not Repeated in the
Sample', Lit Linguist Computing, 1, (1), pp. 1-8.
Mosteller, F. and Wallace, D. L. (1964) Inference and Disputed Authorship: The Federalist.
Reading, Mass.: Addison-Wesley Pub. Co.
Mosteller, F. and Wallace, D. L. (1984) Applied Bayesian and classical inference : the case of the
Federalist papers. New York: Springer.
Nakamura, J. and Sinclair, J. (1995) 'The World of Woman in the Bank of English: Internal
Criteria for the Classification of Corpora', Lit Linguist Computing, 10, (2), pp. 99-110.
Nemesvari, R. (2009) 'Genres are not to be mixed. . . . I will not mix them”: Discourse, Ideology,
and Generic Hybridity in Hardy’s Fiction ', in Wilson, K.(ed), A Companion to Thomas
Hardy.Malden, MA: Wiley-Blackwell Pub., pp. xiii, 488 p.
Nixon, J. V. (2004) Victorian Religious Discourse: New Directions in Criticism. New York:
40
Palgrave Macmillan.
Novovičová, J., Malík, A. and Pudil, P. (2004) 'Feature Selection Using Improved Mutual
Information for Text Classification', in Structural, Syntactic, and Statistical Pattern
Recognition. pp. 1010-1017.
Omar, A. A. (2010) 'Addressing Subjectivity in Thematic Classification of Literary Texts: Using

Cluster Analysis to Derive Taxonomies of Thematic Concepts in the Thomas Hardy’s
Prose Fiction', Proceedings of the Chicago Colloquium on Digital Humanities and
Computer Science, 1, (2).
Oulton, C. (2002) Literature and religion in mid-Victorian England : from Dickens to Eliot.
Houndmills, Hampshire: Palgrave Macmillan.
Ozgur, Y. (2006) Empirical selection of nlp-driven document representations for text

categorization. thesis. Syracuse University.
Page, N. (2000) Oxford Reader's Companion to Hardy. Oxford: Oxford University Press.
Paice, C. D. (1990) 'Another stemmer', SIGIR Forum, 24 (3), (3), pp. 56-61.
Paice, C. D. (1996) 'Method for evaluation of stemming algorithms based on error counting', J.
Am. Soc. Inf. Sci., 47, (8), pp. 632-649.
Paterson, J. (1991) 'An Attempt at Grand Tragedy', in Draper, R. P.(ed), Hardy: The Tragic
Novels.London: Macmillan Press LTD.
Paton, J. M. and Can, F. (2004) 'A Stylometric Analysis of Yas¸ ar Kemal’s I_nce Memed
Tetralogy', Computers and the Humanities 38, pp. 457–467.
Patton, M. Q. (2002) Qualitative Research & Evaluation Methods. London: Sage.
Payne, G. and Payne, J. (2004) Key concepts in social research. London: SAGE.
Pedersen, T. (2008) 'Computational Approaches to Measuring the Similarity of Short Contexts :

A Review of Applications and Methods', in.
Plain, G., Sellers, S. and Ebooks Corporation. (eds.) (2007) A History of Feminist Literary
Criticism. Leiden: Cambridge University Press.
Plaisant, C., Rose, J. and Yu, B. (2006) 'Exploring Erotics in Emily Dickinson’s Correspondence
with Text Mining and Visual Interfaces', Proceedings of the 6th ACM/IEEE-CS Joint
Conference on Digital Libraries (JCDL ’06), Chapel Hill, North Carolina. 11-15 June
41
2006. pp. 141-50.
Plietzsch, B. (2003) Hardy's Classification of his Works Available at: http://www.st-

and.ac.uk/~bp10/wessex/fictional_concept/classification.shtml (Accessed: 20/10/2008).
Plietzsch, B. (2004) The Novels of Thomas Hardy as a Product of Nineteenth-century Social,

Economic, and Cultural Change. Thesis (doctoral) thesis. Tenea
Martin-Luther-Universität, Halle-Wittenberg.
Popper, K. R. (1959) The Logic of Scientific Discovery. New York: Basic Books.
Popper, K. R. (1963) Conjectures and Refutations; The Growth of Scientific Knowledge. London:
Routledge and K. Paul.
Porter, M. F. (1980) 'An Algorithm for Suffix Stripping', Program, 14, (130-137).
Porter, M. F. (1997) 'An algorithm for suffix stripping', in Readings in information

retrieval.Morgan Kaufmann Publishers Inc., pp. 313-316.
Potter, R. (1988) 'Literary criticism and literary computing: The difficulties of a synthesis',
Praz, M. (1933) The Romantic Agony. London: Oxford University Press.
Prenowitz, W. and Jordan, M. (1965) Basic Concepts of Geometry. New York: Blaisdell Pub. Co.
Punj, G. and Stewart, D. W. (1983) 'Cluster Analysis in Marketing Research: Review and
Suggestions for Application', Journal of Marketing Research, 20, (2), pp. 134-148.
Purandare, A. and Pedersen, T. (2004) In Proceedings of the Nineteenth National Conference on

Artificial Intelligence (AAAI-04) (San Jose, USA, July 25-29, 2004). . July 25-29, 2004.
Purdy, R. L. (1979) Thomas Hardy : A Bibliographical Study. [S.l.]: Oxford University Press.
Purdy, R. L. and Millgate, M. (eds.) (1978) The Collected Letters of Thomas Hardy (hereafter
Collected Letters). Oxford: Clarendon Press.
Pyle, D. (1999) Data Preparation for Data Mining. San Francisco, California.: Morgan Kaufmann
; London : Taylor & Francis
Quiller-Couch, A. T. (1923) Studies in Literature. Cambridge Eng.: University Press.
Radford, A. D. (2009) Victorian Sensation Fiction. Basingstoke [England]: Palgrave Macmillan.
Ramos, J. (2003) Using TF-IDF to Determine Word Relevance in Document Queries. Available
42
at: http://www.cs.rutgers.edu/~mlittman/courses/ml03/iCML03/papers/ramos.pdf
(Accessed:
Ramsay, S. (2003) 'Special Section: Reconceiving Text Analysis: Toward an Algorithmic

Criticism', Lit Linguist Computing, 18, (2), pp. 167-174.
Ramsay, S. (2005) 'In Praise of Pattern', TEXT Technology: the Journal of Computer Text
Processing, 14, (2), pp. 177-190.
Ramsay, S. (2007) 'Algorithmic Criticism', in Siemens, R. G. and Schreibman, S.(eds) A

companion to digital literary studies. Vol. A companion to digital literary studies Malden,
MA: Blackwell Publishers, pp. xx, 620 p.
Ramsay, S. and Steger, S. (2006) 'Distinguished Speakers: Keyword Extraction and Critical
Analysis with Virginia Woolf’s The Waves', Digital Humanities. Sorbonne, Paris, 5-9 July
2006. pp. 255-257.
Ramsdell, K. (1999) Romance Fiction: A Guide to the Genre. Englewood, Colo.: Libraries
Unlimited.
Ray, M. (1996) 'THOMAS HARDY'S 'THE SON'S VETO' : A TEXTUAL HISTORY', Review
of English Studies, XLVII, (188), pp. 542-547.
Ray, M. (1997) Thomas Hardy : a textual study of the short stories. Aldershot ; Brookfield, Vt.,
USA: Ashgate.
Reger, R. K. and Huff, A. S. (1993) 'Strategic groups: a cognitive perspective', Strategic

Management Journal 14, (2), pp. 103-124.
Rencher, A. C. (2002) Methods of Multivariate Analysis. John Wiley & Sons, INC.
Rijsbergen, C. J. V. (1979) Information Retrieval. London: Butterworth.
Rijsbergen, C. J. v. (2004) The Geometry of Information Retrieval. Cambridge University Press.
Roberts, J. L. (1962) 'Legend and Symbol in Hardy's "The Three Strangers"', Nineteenth-Century
Fiction, 17, (2), pp. 191-194.
Robertson, S. (2004) 'Understanding inverse document frequency: on theoretical arguments for

IDF', Journal of Documentation, 60, (5), pp. 503-520.
Robertson, S. E. and Walker, S. (1994) 'Some simple effective approximations to the 2-Poisson
model for probabilistic weighted retrieval', Proceedings of the 17th annual international
ACM SIGIR conference on Research and development in information retrieval. Dublin,
43
Ireland, Springer-Verlag New York, Inc., pp. 232-241.
Rockwell, G. (2003) 'What is Text Analysis, Really?', Lit Linguist Computing, 18, (2), pp. 209-
219.
Rogers, M. F. (1991) Novels, novelists, and readers: toward a phenomenological sociology of

literature. Albany: State University of New York Press.
Rommel, T. (2004) 'Literary Studies', in Schreibman, S., Siemens, R. and Unsworth, J.(eds)
ACompanion to Digital Humanities .Oxford: Blackwell, pp. 88-97.
Rousseeuw, P. (1987) 'Silhouettes: a graphical aid to the interpretation and validation of cluster
analysis', J. Comput. Appl. Math., 20, (1), pp. 53-65.
Rowson, S. (1794) Charlotte. A Tale of Truth. Philadelphia: Printed by D. Humphreys, for M.

Carey.
Rowson, S. (1828) Charlotte's Daughter; or, The Three Orphans. A Sequel to Charlotte Temple.
Boston: Richardson & Lord.
Rudman, J. (1997) 'The State of Authorship Attribution Studies: Some Problems and Solutions',
Salton, G. (1971) The Smart retrieval system : experiments in Automatic document processing.
Englewood Cliffs: Prentice Hall Inc.
Salton, G. (1982) Introduction to Modern Information Retrieval. New York: McGraw-Hill.
Salton, G. and Buckley, C. (1987) Term Weighting Approaches in Automatic Text Retrieval.
Cornell University
Salton, G. and Buckley, C. (1988) 'Term-weighing approaches in automatic text retrieval',

Information Processing & Management, 24, (5), pp. 513-523.
Salton, G., Wong, A. and Yang, C. S. (1975) 'A Vector Space Model for Automatic Indexing',
Communications of the ACM, 18, (11), pp. 613–620.
Salvador, S. and Chan, P. (2004) Tools with Artificial Intelligence, 2004. ICTAI 2004. 16th IEEE
International Conference on.
Saricks, J. G. (2009) The Readers' Advisory Guide to Genre Fiction. Chicago: American Library
Association.
Savoy, J. (1999) 'A stemming procedure and stopword list for general French corpora', J. Am. Soc.
44
Inf. Sci., 50 (10), pp. 944-952.
Saxelby, F. O. (1911) A Thomas Hardy Dictionary. The characters and scenes of the novels and
poems alphabetically arranged and described. George Routledge & Sons: London.
Schreibman, S., Siemens, R. G. and Unsworth, J. (eds.) (2004) A Companion to Digital

Humanities. Oxford: Blackwell.
Scoltock, J. (1984) The Application of Cluster Analysis to Sales, Production, Costing and Design.
thesis. University of Newcastle upon Tyne.
Sebastiani, F. (2005a) 'Text Categorization', in Zanasi, I. A.(ed), Text Mining and its
Applications.Southampton, UK: WIT Press, pp. 109-129.
Sebastiani, F. (2005b) 'Text categorization', in Rivero, L. C., Doorn, J. H. and Ferraggine, V.

E.(eds) The Encyclopedia of Database Technologies and Applications .Hershey: Idea
Group Publishing, pp. 683-687.
Sebastiani, F. (2006) 'Classification of Text, Automatic. ', in Brown, K.(ed), Encyclopaedia of

Language & Linguistics. Vol. volume 2 Oxford: Elsevier, pp. 457-462.
Seifoddini, H. K. (1989) 'Single linkage versus average linkage clustering in machine cells
formation applications', Computers & Industrial Engineering, 16, (3), pp. 419-426.
Seymour-Smith, M. (1994) Hardy. Bloomsbury Pub.
Sherman, G. W. (1976) The Pessimism of Thomas Hardy. Rutherford [N.J.]: Fairleigh Dickinson
University Press.
Sherren, W. (1902) The Wessex of Romance ... With illustrations. (Bibliography of Thomas
Hardy.). pp. xi. 312. Chapman & Hall: London.
Sherren, W. (1908) The Wessex of Romance ... New and revised edition. [With a bibliography of
Thomas Hardy.]. pp. 13. 295. Francis Griffiths: London.
Shumaker, J. R. (1999) 'Abjection and Degeneration in Thomas Hardy's "Barbara of the House of
Grebe" ', College Literature, 26, (2), pp. 1-17.
Shuttleworth, S. (ed.) (1999) Two on a Tower : A Romance. London: Penguin.
Sichel, H. S. (1974) 'On a Distribution Representing Sentence-Length in Written Prose', Journal

of the Royal Statistical Society, 137, (1), pp. 25-34.
Siemens, R. (2002) 'A New Computer-assisted Literary Criticism?', Computers and the
45
Humanities, 36, (3), pp. 259-267.
Siemens, R. G. and Schreibman, S. (eds.) (2007) A Companion to Digital Literary Studies.

Malden, MA: Blackwell Publishers.
Simpson, J. A. and Weiner, E. S. C. (1989) The Oxford English Dictionary. Oxford: Clarendon
Press ; Oxford University Press.
Singhal, A., Chris, B. and Mandar, M. (1996a) 'Pivoted document length normalization',
Proceedings of the 19th annual international ACM SIGIR conference on Research and
development in information retrieval, pp. 21-29.
Singhal, A., Salton, G., Mitra, M. and Buckley, C. (1996b) 'Document length normalization',
Information Processing & Management, 32, (5), pp. 619-633.
Smith, M. W. A. (1985) 'An investigation of Morton's method to distinguish Elizabethan

playwrights', Comput. Hum., 19, (1), pp. 3-21.
Smith, M. W. A. (1987) 'Hapax Legomena in Prescribed Positions: An Investigation of Recent

Proposals to Resolve Problems of Authorship', Lit Linguist Computing, 2, (3), pp. 145-
152.
Smith, M. W. A. (1992) 'Shakespeare, Stylometry and "Sir Thomas More"', Studies in Philology,
89, (4), pp. 434-444.
Smollett, T. G. T. E. o. H. C. A. (1818) Brambleton Hall, a novel, being a sequel to the celebrated

Expedition of Humphrey Clinker, by Tobias Smollet. London: T. H. Green.
Sneath, P. H. A. (1957) 'The application of computers to taxonomy,' Journal of General

Microbiology, 17, pp. 201-226.
Sneath, P. H. A. and Sokal, R. R. (1973) Numerical Taxonomy: The Principles and Practice of
Numerical Classification. San Francisco,: W. H. Freeman.
Snell, K. D. M. (1998) ' The Regional Novel: Themes for the Interdisciplinary Research', in Snell,
K. D. M.(ed), The Regional Novel in Britain and Ireland 1800-1990.Cambridge University
Press pp. 1-53.
Snyder, S. (2001) 'The Genres of Shakespeare’s Plays', in De Grazia, M. and Wells, S. W.(eds)
The Cambridge Companion to Shakespeare.Cambridge, U.K.: Cambridge University
46
Press, pp. xx, 328 p.
Sollors, W. (1993) The return of thematic criticism. Cambridge, Mass.: Harvard University Press.
Spärck Jones, K. (1972) 'A statistical interpretation of term specificity and its application in
retrieval ', Journal of Documentation, 28, pp. 11-21.
Spiegel, A. (1973) 'Flaubert to Joyce: Evolution of a Cinematographic Form', NOVEL: A Forum

on Fiction, VI, pp. 229-43.
Spiridon, M. (1987) 'Literary criticism and the magnifying glass of sociology', Neohelicon, 14,
(2), pp. 53-60.
Spivey, T. R. (1954) 'Thomas Hardy's Tragic Hero', Nineteenth-Century Fiction, 9, (3), pp. 179-
191.
Srivastava, A. N. and Sahami, M. (eds.) (2009) Text Mining Classification, Clustering, and
Applications. Chapman and Hall.
Stageberg, N. C. (1981) An Introductory English Grammar. New York ; London: Holt, Rinehart
and Winston.
Stevenson, R. L. (1886) Kidnapped. [London]: Cassell & Company, Limited.
Stowe, H. B. (1859) The Minister's Wooing. New York: Derby and Jackson.
Stowe, H. B. (1897) Uncle Tom's Cabin. New York: T. Y. Crowell & company.
Sumner, R. (1981) Thomas Hardy, psychological novelist. London: Macmillan.
Svetlana, K. (2006) Hierarchical text categorization and its application to bioinformatics. thesis.
University of Ottawa.
Taeho, J. (2006) The implementation of dynamic document organization using the integration of
text clustering and text categorization. thesis. University of Ottawa.
Tambouratzis, G. and Vassiliou, M. (2007) 'Employing Thematic Variables for Enhancing

Classification Accuracy Within Author Discrimination Experiments', Lit Linguist
Computing, 22, (2), pp. 207-224.
Taylor, D. (1998) 'The Need for a Religious Literary Criticism', in Mahoney, J. L.(ed), Seeing
Into the Life of Things: Essays on Literature and Religious Experience.New York:
Fordham University Press.
Taylor, H. M. (1893) Euclid's Elements of Geometry, Books I-VI. Cambridge: The University
47
Press.
Thabet, N. (2005) 'Understanding the thematic structure of the Qur'an: an exploratory multivariate
approach', Proceedings of the ACL Student Research Workshop. Ann Arbor, Michigan,
Association for Computational Linguistics, pp.
Theodoridis, S. and Koutroubas, K. (2003) Pattern Recognition. Academic Press.
Thomas, J. (1999) Thomas Hardy, Femininity and Dissent: Reassessing the Minor Novels. New
York: Macmillan.
Todorov, T. (1975) The fantastic : a structural approach to a literary genre. Ithaca, N.Y.: Cornell
University Press.
Tukey, J. W. (1977) Exploratory Data Analysis. [S.l.]: Addison Wesley.
Tuttle, L. (1986) Encyclopedia of feminism. Harlow: Longman.
Unsworth, J. (2000) 'Scholarly Primitives: What Methods do Humanities Researchers Have in

Common, and How Might Our Tools Reflect This? ', Symposium on Humanities
Computing: Formal Methods, Experimental Practice, King’s College, London.
Venables, W. N. and Ripley, B. D. (2002) Modern applied statistics with S. New York: Springer.
Wake, W. C. (1957) 'Sentence-length Distributions of Greek Authors', Journal of the Royal

Statistical Society, Series A, (120), pp. 331-346.
Waldoff, L. (1979) 'Psychological Determinism in Tess of the d’Urbervilles', in Kramer, D.(ed),

Critical Approaches to the Fiction of Thomas Hardy.London: Macmillan, pp. 135-54.
Wang, W. (2007) An Empirical Study on Hierarchical Text Categorization. Master Dissertation

thesis. The University of Guelph.
Ward, J. H., Jr. (1963) 'Hierarchical Grouping to Optimize an Objective Function', Journal of the
American Statistical Association, 58, (301), pp. 236-244.
Watt, G. (1984) The fallen woman in the nineteenth-century English novel. London; Totowa, N.J.:
Croom Helm ;Barnes & Noble Books.
Weber, C. J. (ed.) (1935) An Indiscretion in the Life of an Heiress. Hardy's "lost novel" now first
printed in America and edited with introduction and notes by Carl J. Weber. Baltimore,
MD.: The Johns Hopkins Press. .
Weiping, W., Peng, C. and Bowen, L. (2008) 'A Self-Adaptive Explicit Semantic Analysis
48
Method for Computing Semantic Relatedness Using Wikipedia', Proceedings of the 2008
International Seminar on Future Information Technology and Management Engineering.
IEEE Computer Society, pp.
Widdowson, P. (1989) Hardy in history : a study in literary sociology. London ; New York:
Routledge.
Widdowson, P. (1998) On Thomas Hardy : late essays and earlier. Basingstoke: Macmillan.
Widdowson, P. (2009) '"........Into the Hands of Pure-minded English Girls": Hardy's Short Stories
and the Late Victorian Literary Marketplace', in Wilson, K.(ed), A companion to Thomas
Hardy.Malden, MA: Wiley-Blackwell Pub., pp. 364-378.
William, N. (2006) social research methods SAGE Publications Ltd.
Williams, M. (1974) Thomas Hardy and Rural England. London: Macmillan.
Williams, R. (1970) 'Thomas Hardy', in The English Novel from Dickens to Lawrence.London:
Chatto & Windus.
Wilson, H. G., Boots, B. and Millward, A. A. (2002) Geoscience and Remote Sensing
Symposium, 2002. IGARSS '02. 2002 IEEE International.
Wilson, K. (2009) A companion to Thomas Hardy. Malden, MA: Wiley-Blackwell Pub.
Windle, B. C. A. (1902) The Wessex of Thomas Hardy. London: J. Lane.
Wishart, D. (1998) ClustanGraphics [Computer Software] (3)
Wolters, M. and Kirsten, M. (1999) 'Exploring the Use of Linguistic Features in Domain and
Genre Classification', Proceedings of the ninth conference on European chapter of the
Association for Computational Linguistics. Bergen, Norway, Association for
Computational Linguistics, pp. 142-9.
Wotton, G. (1985) Thomas Hardy: towards a Materialist Criticism. Goldenbridge, Ireland: Gill
and Macmillan ; Barnes & Noble.
Wright, T. R. (1984) 'Middlemarch as a Religious Novel, or Life Without God', in Jasper, D.(ed),
Images of Belief in Literature.Macmillan, pp. 138-52.
Wright, T. R. (1989) Hardy and the Erotic Palgrave Macmillan
Wright, T. R. (2003) Hardy and his readers. New York: Palgrave Macmillan.
Xiao, Z. and McEnery, A. (2005) 'Two Approaches to Genre Analysis: Three Genres in Modern
49
American English', Journal of English Linguistics, 33, (1), pp. 62-82.
Yu, B. (2008) 'An Evaluation of Text Classification Methods for Literary Study', Lit Linguist
Computing, 23, (3), pp. 327-343.
Yu, B. and Unsworth, J. (2006) 'Toward Discovering Potential Data Mining Applications in
Literary Criticism', Digital Humanities, 5-9 July 2006. Paris-Sorbo, pp.
Yule, G. U. (1939) 'On Sentence-length as a Statistical Characteristic of Style in Prose: With

Applications to Two Cases of Disputed Authorship', Biometrika, 30, pp. 363-390.
Zeitler, M. A. (2007) Representations of Culture: Thomas Hardy's Wessex & Victorian

Anthropology. New York: Peter Lang.
50

Chapter 1 - Text Clustering

Uploaded by

Document Information

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Chapter 1 - Text Clustering

Uploaded by

Copyright:

Available Formats

1-1 Text clustering

witnessed a flourishing of the development of automated statistical clustering and classification

 Clustering vs. classification

However, mechanisms for doing so are different.

each is given a label (Maranis and Babenko, 2009; Taeho, 2006).

determine cluster membership (Manning et al., 2008).

are being sought.

annotation and summarization.

1-1-2 Approaches to text classification & clustering

politics, science, etc.

 Statistical text clustering

 Content vs. context

document (Attardi et al., 1998).

1-2 Document clustering models

 The nature of data

senses, often with the aid of instruments (Chalmers, 1999).

Table 1-1: Age distribution of the Egyptian people

edges of understanding” (Pyle, 1999: 46).

marketing, online marketing, promotions, and public relations.

Can an experimentally replicable, objective and conceptually useful classification based

(Moisl, 2008 a; Pyle, 1999).

mathematically-based computational methods, selected variables must be mathematically

(Salton et al., 1975).

 The nature of geometry

angles in a space. According to Euclid, A 1-dimensional, 2-dimnsional, or 3-dimensional can be

as in the figure below.

Figure 1-1 Axis for a 1-dimensional space

be represented in Euclidean geometry as a 2-dimensional space as in Figure 1-2.

Figure 1-2 : Axes for a 2-dimensional space

geometry as a 3-dimensional space as in Figure 1-3.

represented in a 2-dimensional space as in Figure 1-4. The components of the 2-dimensional

Figure 1-4: A vector graphic representation of a 2-dimensional space

dimensional vector space as in Figure 1-5.

documents. The following is an example.

The following are 3 different titles that represent different categories

Title No. 2- Korean sailors killed in naval incident

Title No. 3- Couple convicted of toddler murder

A = {United, States, and, Russia, agree, historic, nuclear, deal},

give it the name X.

to constitute a matrix in which each row is a vector, as in the figure below.

Figure 1-6: A matrix X of the 3 vectors A, B, and C

Uni st a R a his nu d Ko Sa ki i N inc Co con o to m

forming a cloud of points, as in Figure 1-7.

Figure 1-7: A manifold in 3-dimensional space

Figure 1-9: Vector length

lengthening or shortening one of them, then the distance increases, as shown in

increases, as shown in Figure 1-11.

last case shown in Figure 1-12.

In what follows the main applications of VSC are discussed.

1-3-1 Vector space model (VSM)

additional information and these are ignored in VSM applications.

1-3-2 Latent Semantic Indexing

numbers of local meaning-bearing contexts, such as sentences or paragraphs, treated as unitary

wholes” (Landauer et al., 1998: 5).

language. Equally important, it is theoretically unsuitable for many applications as a document

1-3-3 Explicit Semantic Analysis

terms of Wikipedia-based concepts (Gabrilovich and Markovitch, 2007; Gabrilovich and

1-3-4 Concept mining

using VSM is the subject of the next chapter.

Abercrombie, L. (1912) Thomas Hardy. A critical study. Martin Secker: London.

Adams, R. (2003) Perceptions of innovations: exploring and developing innovation classification.

Amati, G. and Rijsbergen, C. J. V. (2002) 'Probabilistic models of information retrieval based on

Anderson, B. R. O. G. (1991) Imagined communities : reflections on the origin and spread of