You are on page 1of 50

1-1 Text clustering

The idea of text clustering long preceded the computer age: “Clustering is one of the most

primitive mental activities of humans, used to handle the huge amount of information they receive

every day” (Theodoridis and Koutroubas, 2003: 398). The act of indexing long used in libraries

is an obvious example. Manual clustering was the only type of document clustering possible prior

to the computer age. This circumstance may have influenced much clustering work that relied

only on immediate intuitive knowledge of the world without making use of quantitative numerical

methods. In other words, text clustering was usually performed in subjective ways that relied

heavily on the perception, knowledge, and judgment of the researcher. With more and easier

accessibility to electronic digital data in different disciplines and the power of computing data

processing on one hand and the need for maintaining objectivity standards on the other, it has

become ever more likely that such procedures must involve computational automated methods

(Arabie et al., 1996) where human intuition and traditional organization methods are replaced by

mathematical and computational techniques (Golub, 2006; Golub, 2005). In this, recent years have

witnessed a flourishing of the development of automated statistical clustering and classification

systems for systematizing the inherent subjectivity in traditional text classification applications.

 Clustering vs. classification

The two terms clustering and classification are extensively used throughout this chapter. The

question that rises at this point is: are they synonymous or is there a distinction?

In order to answer this question, some overlapping concepts should be considered. Firstly, there

is an overlap between the two terms text classification and text categorization. In information

retrieval (IR) and text classification of literature (Sebastiani, 2006; Svetlana, 2006; Taeho, 2006;

Mirkin, 2005; Sebastiani, 2005a; Sebastiani, 2005b), the two terms are often used interchangeably.

1
This book too uses them interchangeably. Secondly, there is a frequent confusion between the

terms text clustering and text classification. While many studies (Janos and Balazs, 2007; Wang,

2007; Ozgur, 2006; Jain et al., 1999) use the two terms interchangeably, this book does not. The

idea they share is that they are both concerned with grouping documents into clusters or groups.

However, mechanisms for doing so are different.

Text clustering is the process of automatically grouping natural language texts according to an

analysis of their information/ semantic content, by means of clustering algorithms (2004; Debole

and Sebastiani, 2003). It is simply a process of placing similar documents together into distinct

sets without labelling them (Maranis and Babenko, 2009). Text classification, on the other hand,

is the task of automatically sorting a set of documents into a number of classes or categories where

each is given a label (Maranis and Babenko, 2009; Taeho, 2006).

The main difference thus between clustering and classification is that, in the former, there are no

prior assumptions about the data structure. Unlike clustering, “classification relies on a priori

reference structures that divide the space of all possible data points into a set of classes that are

usually, but not necessarily, nonoverlapping” (Maranis and Babenko, 2009: 164-5). In other

words, clustering is a task of dividing given data into defined set of clusters and it is the task of

classification to structure these clusters and sort them into categories according to a group

structure known in advance (Sebastiani, 2006). In this, a text classification task starts by

discovering and finding groups that have similar content then organizing our perceptions of these

groups into categories. In other words, clustering places documents into natural classes while

classification places them into predefined known ones. There is a link thus between clustering and

classification since clustering is a way of generating taxonomies for classification purposes. The

one word that distinguishes between clustering and classification is that clustering is an

“unsupervised” activity while classification is a supervised one. In clustering, there is no one who

2
assigns documents to classes but it is only the distribution and makeup of the data that will

determine cluster membership (Manning et al., 2008).

To illustrate the argument, let us consider the following example. Having a set of 1000 documents

on the history of English literature, these can be both clustered and classified. In performing a

clustering task, documents are just clustered into distinct groups where similar or related

documents are grouped together. In classification, on the other hand, predefined sets are given

first. These can be Old English literature, Shakespearean literature, Augustan Literature, Romantic

Literature, and Victorian Literature. Then documents are placed or classified under these

predefined categories.

1-1-1 Applications

Text/ document clustering is applied in different disciplines including IR and data mining. In IR,

document clustering is used to automatically group together documents that belong to the same

topic in order to facilitate user’s browsing of retrieval results. This is usually labelled as cluster-

based retrieval. The underlying principle cluster-based retrieval is the organization of web pages

or any documents into a hierarchical structure for subject browsing with the purpose of speeding

up and improving IR operations since search in the vector space model amounts to finding the

nearest neighbours to the query (Manning et al., 2008; Golub, 2006; Golub, 2005; Rijsbergen,

2004). The main assumption of cluster-based retrieval applications is that closely associated

documents tend to be relevant to the same requests (Rijsbergen, 1979). Kobayashi and Aono

(2004) argue that clustering is an effective approach for overcoming many IR problems. They

explain that to identify clusters or sets of documents that cover similar topics helps in ranking and

presenting only documents in one or a few selected clusters to users. In turn, the lack of a coherent

clustering structure makes searching much more difficult, especially when specific resource types

are being sought.

3
In data mining, document clustering is one of some computational systems for carrying out data

mining tasks (Mirkin, 2005). Document clustering is proved successful in carrying out many

important operations for data mining including natural language processing, feature extraction,

annotation and summarization.

1-1-2 Approaches to text classification & clustering

Currently, there are many classification systems. Broadly speaking, these systems fall into two

main categories. These are binary and multiclass systems. Binary classification systems are only

concerned with classifying documents into two main categories or groups. Classification systems

of this kind are used to distinguish between just two classes of objects. As Maranis and Bebenko

(2009) explain, these systems provide Yes/No answer to the question: Does this document belong

to class X? In this, such systems can be useful in classifying emails where they are classified as

to whether they are spam or not, or commercial transactions where they are determined to be

fraudulent or not. In such applications, it is more likely and easier to use binary classification

systems as we have only two classes or groups. Multiclass systems, in turn, divide documents into

two classes or more. As the name indicates, these classifiers assign each document or data point

to one of many classes where each has a distinct subject area. Newspaper accounts, for instance,

can be classified under different categories such as news, sport, culture, business & money,

politics, science, etc.

Computational methods of text clustering fall into two main categories. These are linguistic and

statistical mathematical methods (Srivastava and Sahami, 2009; Justo and Torres, 2005).

Linguistic methods are based on natural language processing techniques. Methods of this kind

usually involve morphological and syntactic processes for extracting meaning and identifying

relationships within documents. Mathematical and statistical classification methods are essentially

based on probabilistic frameworks. The main difference between the two approaches lies in the

4
idea that statistical and mathematical methods are not concerned with linguistic properties. Word

order or compositional semantics, for instance, do not have any significance in classification or

clustering performance. The focus of this chapter is on the use of statistical methods.

 Statistical text clustering

Statistical text clustering developed greatly in the 1990s, when it emerged as a subtask of

Information Retrieval (IR) applications (Joachims and Sebastiani, 2002). The hallmark of that a

development has been the dramatic improvement of the effectiveness of text clustering systems.

The last two decades have witnessed an unprecedented revolution in developing mechanized

solutions for organizing the vast quantity of unstructured digital documents and providing

powerful tools for turning this unstructured repository into a structured one (Sebastiani, 2006).

The world of knowledge has witnessed over recent years a rapid increase in the amount of sorted

data in all fields of knowledge due to the continuous improvement of methods for digitally storing

data. As a response to the growing overflow of information which has made it difficult for many

search engines to fill people’s needs, various computer-based clustering and classification

methods have been developed. Concerns have been raised by IR researchers and internet users

about the poor matching of queries and the results generated by search engines. IR researchers

have worked in consequence to find ways capable of automatically analyzing, classifying and

summarizing data in order to make it easy for internet users to access data effectively. In this they

made extraordinary progress over the last few years as they have managed to devise mechanisms

for assessing the relevance of information for user interests. Nonetheless, statistical text clustering

has not escaped criticism. Perhaps the most serious disadvantage with ATC applications is that in

almost all text classification schemes, semantic relatedness is merely judged at the level of lexical

semantics without taking compositional semantics into account (Gabrilovich and Markovitch,

2007). Another major criticism of ATC applications is that many of the algorithms used for

5
computing semantic relatedness represent documents as just bags of words where context is not

considered at all.

 Content vs. context

To date, the standard approach and the most widely used in statistical text clustering applications

is clustering by content. This is the clustering of documents by the words they contain. Content

clustering is carried out by means of computing semantic similarity/ distance or what can be called

measuring proximity within documents. It is thus a lexical semantic function. It has always been

argued that semantic information within documents is key to understanding and determining the

content of such documents. In some recent studies (Attardi et al., 1999; Attardi et al., 1998),

however, clustering by context has been introduced as a working approach to evade the problems

caused in clustering by content. Clustering by context is a new method for grouping web pages

whereby the context surrounding a link is used for categorizing the document referred by the link.

The conception is based on the assumption that a web page which refers to a document must

necessarily involve enough hints about its content which themselves are sufficient to classify the

document (Attardi et al., 1998).

In relation to clustering by context, many software programs have been devised to execute such

tasks including SenseClusters (Purandare and Pedersen, 2004). This and others are programs that

allow users to cluster similar contexts such as emails and web pages (Pedersen, 2008). The

working principle of such programs is that data documents can be grouped on the basis of their

mutual contextual similarities (Purandare and Pedersen, 2004). Programs of this kind have indeed

proven a successful clustering method when applied to web pages and its merits are more tangible

with multimedia material. Nevertheless, an approach of this kind carries with it some limitations.

The most serious shortcoming is that it is not concerned with the analysis of the content of

documents. One more drawback is that in almost all context classification applications “identical

6
replications of controlled experiments result in different conclusions” (Martin et al., 2005: 470).

1-2 Document clustering models

Several document clustering models exist. A clustering model is a formalization of the way of

thinking about document clustering. Such formalism can be established and defined in the form

of algorithms capable of computing semantic similarity among documents. The goal of all

clustering models or systems is to create clusters that are coherent internally, but clearly different

from each other (Manning et al., 2008). The selection of a clustering system however is a

complicated and controversial matter. The idea that clustering results can be used in important

applications makes it crucial to consider the selection of a clustering system or model well.

The main bulk of clustering systems or approaches can be best described under the heading vector

semantics (VS) or vector space clustering (VSC). The underlying principle of VSC is measuring

or computing similarity between the documents to be clustered. VSC is an umbrella approach that

encompasses a number of methods and techniques including vector space model (VSM), latent

semantic indexing (LSI), explicit semantic analysis (ESA), and concept mining.

Prior to discussing the vector space clustering theory and its main methods, however, there are

some concepts that need to be addressed. They are fundamental to understanding VSC theory and

applications. These are data, geometry, and vector space. These are discussed as follows.

 The nature of data

Data is the plural form of ‘datum’, the past participle of Latin to ‘dare’, ‘to give’, and means

‘things that are given’. A datum is therefore is something to be accepted as granted or assumed as

fact, a true statement about the world and used as a basis for reasoning, discussion, or calculation

(Simpson and Weiner, 1989). The question ‘what is a true statement about the world?’ has been

intensively studied in cognitive science. For convenience reasons, this discussion adopts the

7
attitude prevalent in most areas of science: data are abstractions of what we observe using our

senses, often with the aid of instruments (Chalmers, 1999).

There are numerous sources of data. Any object in the world can be a source of data: people,

animal, plants, institutions, texts, sounds, etc but data itself is ontologically different from the

world: “The world is as it is; the data is an interpretation of it for the purpose of scientific study”

(Moisl, 2008 a: 876). To explain the difference between the two concepts, let us consider this

example.

Table 1-1: Age distribution of the Egyptian people

Age Population %
group Male Female
0–18 5,560,489 5,293,871 18.0
19–59 20,193,876 19,736,516 66.3
60+ 4,027,721 5,458,235 15.7

The table above describes the age distributions of the Egyptian people. In this example,

populations are the world, the reality while age and sex are just features or observations about this

world. The populations themselves represent the real world while the features of age and sex are

the variables that describe population. Data, therefore, is just these variables or values that

describe populations: “Data never provides more than a pale and hazy shadow, a murky outline,

of the true workings of the world. And yet this gossamer wisp is just enough for us to grasp at the

edges of understanding” (Pyle, 1999: 46).

In linguistics, a text corpus is not the linguist’s data. A text corpus is as it is; measurements of

such things as sentence length, the use of function words, or the destiny of loan words can be the

linguist’s data. Texts are facts; they are unchangeable entities. Data, on the other hand, can be

different from one analysis to another depending on the nature and purpose of analysis. Having

8
Shakespeare’s corpus as an example, data can be different from one application to another. If the

analysis is concerned, for instance, with the relationship between the use of static/dynamic

adjectives and characterization, it is likely then that data can be composed only of the adjectives

and proper names within the documents. If the analysis, however, is concerned with the

investigation of the Latin element in Shakespeare’s texts, data can be abstracted from the lexical

types within the texts to determine the ratio of Latin words to the overall texts.

 Data abstraction

In general, any aspect of the world can be described in a number of ways and to degrees of

precision. However, there is no theory-free observation of the world (Popper, 1963; Popper, 1959).

In other words, entities in a domain of inquiry only become relevant to observation in terms of a

research question framed using the ontology and axioms of a theory about the domain. For

example, in marketing analysis variables are selected in terms of the discipline of marketing

broadly defined, which includes the division into sub-disciplines such as direct marketing, brand

marketing, online marketing, promotions, and public relations.

Data can, therefore, only be created in relation to a research question that is defined on the domain

of interest, and that thereby provides an interpretative orientation --without such an orientation,

how does one know what to observe, what is important, and what is not? The domain of interest

in the present case is the collection of Hardy’s prose works, and the research question defined on

it is:

Can an experimentally replicable, objective and conceptually useful classification based

on empirical evidence abstracted from Thomas Hardy’s prose fiction texts be defined?

9
 Data representation

Given that data is an interpretation of some aspect of the world in terms of variables, it is crucial

to select the appropriate variables for a successful data analysis. Unfortunately, there are no pre-

defined rules for variable selection. The process depends in the first place on the nature and

purpose of the analysis. The fundamental principle of such a process is that the variables must

describe all and only those aspects of the domain that are relevant to the purpose of analysis

(Moisl, 2008 a; Pyle, 1999).

In text clustering, the semantics of each selected variable determines a particular interpretation of

the domain of interest, and the domain is measured in terms of the semantics. Measurement is

fundamental in data preparation because it makes the link between data and the world, and thus

allows the results of data analysis to be applied to the understanding of the world (Moisl, 2008 a;

Pyle, 1999). Measurement is only possible in terms of some scale and there are various types of

measurement scale, but for present purposes the main dichotomy is between numeric and non-

numeric. The cluster analysis methods discussed in due course assume numeric measurement as

the default case, and for that reason the same is done here. If they are to be analyzed using

mathematically-based computational methods, selected variables must be mathematically

represented. A widely used way for doing this in text clustering is vector space representation

(Salton et al., 1975).

 The nature of geometry

Geometry is based on human intuitions about the world around us. This is a discipline in

mathematics that is concerned with the properties and relations of magnitudes in space, as lines,

surfaces, and solids. The earliest systematic discussion of geometry was developed by Euclid

around 300 B.C (Lang, 1958). Euclid based his theory on the earliest attempts of many

10
mathematicians before him like Apollonius, Hippocrates, and Eudoxus who were concerned with

defining the intuitive notions of space, direction, distance, size and shape. He placed the

propositions of the previous mathematicians into a comprehensive deductive and logical system

entitled Elements (Ball, 1935; Taylor, 1893; Euclid, 1826). The Euclidean geometry was used

virtually unchanged for 2, 000 years for understanding the physical reality (Lang and Murrow,

1988; Prenowitz and Jordan, 1965). In the meantime, Euclidean geometry used to be seen as a

perfect model for logical reasoning. At the end of the 19th century, however, mathematicians found

logical deficiencies within Euclidean framework. Prenowitz and Jordan (1965) argue that there

are certain logical gaps in the reasoning in the Euclidean frame. Nevertheless, Euclidean geometry

is still used in a range of disciplines including analysis of textual data, and as such the present

discussion is based on Euclidean geometry. Some fundamental ideas of that geometry are

presented below.

 Euclidean geometry

Euclidean geometry is concerned with modelling the world as it is experienced. It describes the

physical world in any finite number of dimensions using a distance formula. In mathematical

terms, Euclidean geometry is concerned with studying the relationships among distances and

angles in a space. According to Euclid, A 1-dimensional, 2-dimnsional, or 3-dimensional can be

described and defined by axes. For a 1-dimensional space, only a single numerical measure is

required. The distance between two objects can be defined by length and graphically represented

as in the figure below.

Figure 1-1 Axis for a 1-dimensional space

11
Likewise, a 2-dimensional space can be defined using two numerical measures. A school’s

playground, for instance, can be defined in terms of length and width. The two measurements can

be represented in Euclidean geometry as a 2-dimensional space as in Figure 1-2.

Figure 1-2 : Axes for a 2-dimensional space

Euclid observed that there are still other kinds of physical property which cannot be described in

one or two dimension but require three, such as Big Ben Tower. In such a case, three

measurements are required: length, width, and height, and these can be represented in Euclidean

geometry as a 3-dimensional space as in Figure 1-3.

12
Figure 1-3 : Axes for a 3-dimensional space

Because it was impossible to define more than three dimensions, modern mathematics generalized

Euclid’s concepts of distance, length, and angle so that any number of dimensions can be defined.

The economic growth of developing countries can be represented by an arbitrary large number of

dimensions such as the role of physical and human capital, technological progress, scale of

investments, trade, capital mobility, fixed assets, net capital stock, and employment. These can be

represented using N-dimensional space. Some fundamental ideas about N-dimensional space are

presented below.

 Vector space

A vector space is the basic object of study in linear algebra, one of the most basic of all branches

of mathematics. Vector space theory is based on the development of a mathematical structure that

consists of a set of vectors associated with a field of scalars (which are the object elements of

these vectors). The theory is used in different applications “and the vectors and scalars for one

application will generally be different from the vectors and scalars for another application”

13
(Howlett, 2010:10-11). A vector space is defined as a geometrical interpretation of a vector in

which the dimensionality 𝓃 of the vector defines an 𝓃-dimensional space, the sequence of

numerical values comprising the vector specifies coordinates in the space, and the vector itself is

a point at the specified coordinates (Howlett, 2010; Fraleigh et al., 1995). So let X and Y are 2

axes with the values 36 and 160 respectively. These two axes can be represented in a vector space

as follows V [36, 160]. This is a vector representation in 2 dimensional spaces, and this can be

represented in a 2-dimensional space as in Figure 1-4. The components of the 2-dimensional

vector correspond to the coordinates in the 2-dimensional vector space with axes 0…100 and

0…200, counting 36 along the horizontal axis and 160 along the vertical.

Figure 1-4: A vector graphic representation of a 2-dimensional space

This can be extended to include a third dimension, and can be graphically represented as a 3-

dimensional vector space as in Figure 1-5.

14
Figure 1-5: A vector graphic representation of a 3-dimensional space

Indeed, we are not limited to only using 2 or even 3 dimensions in mathematics. We can have any

number of dimensions and we can use each axis to represent a different value. An axis can be

anything. It can be an image, numbers, or documents. This discussion is only concerned with

documents. The following is an example.

The following are 3 different titles that represent different categories

Title No. 1- United States and Russia agree historic nuclear deal

Title No. 2- Korean sailors killed in naval incident

Title No. 3- Couple convicted of toddler murder

The titles are given the title names A, B, and C respectively and are represented as follow

A = {United, States, and, Russia, agree, historic, nuclear, deal},


B = {Korean, sailors, killed in naval incident},
C = {Couple, convicted, of, toddler, murder},

These vectors can be included together in just one space technically called a matrix. For now, let’s

give it the name X.

15
X = { United, States, and, Russia, agree, historic, nuclear, deal; B Korean, sailors, killed in naval
incident; C Couple, convicted, of, toddler, murder}.

Where there is more than one vector in a space as in the above example, they are collected so as

to constitute a matrix in which each row is a vector, as in the figure below.

Figure 1-6: A matrix X of the 3 vectors A, B, and C

Uni st a R a his nu d Ko Sa ki i N inc Co con o to m


ted at n us gr tor cle e re ilo lle n av ide up vict f dd ur
es d sia ee ic ar a an rs d al nt le ed ler de
l r
A 0 01 0 01 0 01 01 0 00 00 00 0 00 00 00 00 0 00 00
: 1 1 1 1 0 0

B 0 00 0 00 0 00 00 0 01 01 01 0 01 01 00 00 0 00 00
: 0 0 0 0 1 0

C 0 00 0 00 0 00 00 0 00 00 00 0 00 00 01 01 0 01 01
: 0 0 0 0 0 1

The matrix X in the above example is composed of two main constituents: rows and columns. The

matrix with the m rows and n columns defines a manifold in 𝓃-dimensional space. This is the

shape of data in 𝓃-dimensional space. In a manifold, vectors are plotted in a 3-dimensional space

forming a cloud of points, as in Figure 1-7.

Figure 1-7: A manifold in 3-dimensional space

16
1-3 Vector space clustering (VSC)

Vector space clustering (VSC) is based on measuring the relative distances between the row

vectors. The distance between any two vectors in a space is jointly determined by the size of the

angle between the lines joining them to the origin of the space’s coordinate system, and by the

lengths of those lines. These are shown in Figure 1-8 and Figure 1-9.

Figure 1-9: Vector length


Figure 1-8: The angle between vectors

In VSC, the interplay between length and angle is what determines the distance relations between

and among vectors in a space, and thereby their clustering structure. This can be explained as

follows.

 If the angle is kept constant and the lengths of the vectors are made unequal by

lengthening or shortening one of them, then the distance increases, as shown in

Figure 1-10.

17
Figure 1-10

 If the lengths are kept equal but the angle is increased the distance between them

increases, as shown in Figure 1-11.

Figure 1-11

 If the lengths are kept equal and angle is decreased so does the distance, as shown in

Figure 1-12.

18
Figure 1-12

Based on these observations, vectors in the first two cases are set apart and not clustered together

shown in Figure 1-10 and Figure 1-11 while they are more likely to be clustered together in the

last case shown in Figure 1-12.

In what follows the main applications of VSC are discussed.

1-3-1 Vector space model (VSM)

VSM is simply a technique where documents are compared with each other then indexed or

classified in terms of their similarity or distance based on the words they contain. It can be defined

as the organization of a collection of documents usually represented by a vector space model into

distinct clusters based on similarity. The theory was first developed by Salton (1971) essentially

for IR purposes four decades ago and since then it has become a standard tool in IR systems. The

underlying formula of VSM is initially to extract all useful information within a document

collection and record it in an index known as a vector space. Then a proximity measurement is

used to compute the semantic similarity among the documents with the purpose of grouping

similar documents together. The way data is mathematically represented using VSM is further

19
discussed in Error! Reference source not found.

In spite of being widely used, many studies have doubted the effectiveness of VSM as it is wholly

based on lexical semantics with no regard to the importance of context in identifying intended

meanings (Gabrilovich and Markovitch, 2007; Gabrilovich and Markovitch, 2006; Landauer et

al., 1998; Deerwester et al., 1990). Likewise, some studies have argued that VSM is less effective

in clustering and ranking web pages (Markov et al., 2008; Maguitman et al., 2005) since these

have some special features such as hyperlinks and structural information, which inevitably have

additional information and these are ignored in VSM applications.

1-3-2 Latent Semantic Indexing

Latent Semantic Indexing (LSI) is a statistical/ mathematical technique for extracting and

representing the underlying semantic connections between both the documents and the words in

a large corpus of texts for the purpose of automatic indexing or grouping of documents. (Adrian

et al., 2007; Foltz et al., 1998; Landauer et al., 1998; Deerwester et al., 1990; Dumais et al., 1988).

The literature suggests that LSI was originally developed to tackle some problems such as

polysemy and synonymy that used to affect the validity of VSC performance. Today, it has

numerous applications and techniques. Almost all LSI models assume in principle that a document

arises from one single source even if that source is not determined or defined.

The main assumption behind LSI is that linguistic information that is typically ignored in VSM

applications is fundamental; it is not supplemental information that can be ignored (Kuhn et al.,

2007; Berlin, 2006; Kuhn, 2006). This linguistic information, Deerwester et al. (1990) postulate,

has some underlying semantic structure that is essential for IR and clustering applications.

Nevertheless, LSI, Kuhn (2006) explains, is based essentially on the VSM approach. Just like

VSM, the first step in LSI is to represent a document in the form of a matrix of rows and vectors

20
where rows stand for unique words and columns for text passages.

The underlying principle of LSI is that it uses statistical correlation between word and passage

meaning to create a similarity score between any two documents based entirely on the words that

they contain. Landauer et al. (1998) assert that the relations LSI generates are well correlated with

several human cognitive phenomena involving association or semantic similarity. Unlike VSM,

LSI “uses as its initial data not just the summed contiguous pairwise (or tuple-wise) co-

occurrences of words but the detailed patterns of occurrences of very many words over very large

numbers of local meaning-bearing contexts, such as sentences or paragraphs, treated as unitary

wholes” (Landauer et al., 1998: 5).

Although LSI is reported to achieve good performance in grouping documents of similar topical

meaning together, it has some problems when it comes to practice. One major drawback of this

approach however is that the resulting dimensions might be difficult to interpret. That is, results,

which can be justified on a mathematical level, can have no interpretable meaning in natural

language. Equally important, it is theoretically unsuitable for many applications as a document

does not arise from a single theme. Rather a document often contains multiple themes.

1-3-3 Explicit Semantic Analysis

ESA is a novel scheme for computing semantic similarity developed by Gabrilovich and

Markovitch (2007). It computes semantic relatedness between two given texts. This newly

developed technique is based on Wikipedia, the largest online encyclopaedia now. It represents

the meaning of texts in a high dimensional space of concepts derived from Wikipedia. The main

assumption behind ESA is that computing the degree of semantic relatedness between fragments

of natural language text can be improved by explicitly representing the meaning of any text in

terms of Wikipedia-based concepts (Gabrilovich and Markovitch, 2007; Gabrilovich and

21
Markovitch, 2006).

Weiping et al. (2008) assert that the newly developed approach has achieved good performance

in computing semantic relatedness. The experimental results, however, indicate that ESA has

some problems. One main problem about this technique is that it can’t exactly determine the

intended sense of an ambiguous word (Weiping et al., 2008). In this, it loses a lot of its

significance. ESA is not an integrated classification system. It is merely a technique for computing

semantic relatedness and it is not successful in relating similar long documents together.

1-3-4 Concept mining

This is a process that has long been used to provide an automated categorization of documents

based on their content. It is a workflow that is used to discover implicit and explicit relationships,

useful associations and groupings in a set of documents or data collection with the purpose of

detecting similar documents in a large corpora and classifying them by topic. It can provide thus

powerful insights into the meaning, provenance and similarity of documents (Looks et al., 2007;

Fang et al., 2006; Han and Kamber, 2001). The assumption is that each word in a given document

relates to several possible concepts which makes it possible to cluster documents based on their

content. The underlying principle of concept mining is the conversion of words into concepts.

This is done in two subsequent steps. First, documents are reduced into a sequence of words that

describes the content. Second, these words are mapped into concepts.

In this way, given that we have a number of documents on generative grammar; concept mining

is possible by identifying relationships and generating facts based on the data within collection

and the dimensions of the subject. These can be something like Chomsky and generative grammar;

theoretical linguistics and generative grammar; Phrase Structure Rules (PSR) and Generative

grammar; deep and surface structures in generative grammar; etc. Documents can also be

22
classified by topic as WH-movement; linguistic competence; etc.

1-4 Summary

This chapter has given an account of the different ways documents can be clustered thematically

yet in objective ways. The one approach in the literature that seems theoretically most consistent

with our goal is VSM. In spite of its limitations, VSM remains the most widespread method for

data representation in document clustering and classification applications. So far, most document

clustering approaches work with VSM methods. This can be justified in that it is still suitable for

the majority of clustering and classification purposes. The way data is mathematically represented

using VSM is the subject of the next chapter.

23
Bibliography

Abdur, C., McCabe, M. C., David, G. and Ophir, F. (2002) 'Document normalization revisited',
Proceedings of the 25th annual international ACM SIGIR conference on Research and
development in information retrieval, pp. 381-382.

Abercrombie, L. (1912) Thomas Hardy. A critical study. Martin Secker: London.

Abu-Salem, H., Mahmoud, A.-O. and Martha, W. E. (1999) 'Stemming methodologies over
individual query words for an Arabic information retrieval system', Journal of the
American Society for Information Science, 50, (6), pp. 524-529.

Adams, R. (2003) Perceptions of innovations: exploring and developing innovation classification.


PhD thesis. Cranfield University.

Adrian, K., Stphane, D. and Tudor, G. (2007) 'Semantic clustering: Identifying topics in source
code', Information and Software Technology, 49, (3), pp. 230-243.

Afifi, A. A., Clark, V. and May, S. (2004) Computer-aided multivariate analysis. Boca Raton,
Fla. ; London: Chapman & Hall/CRC.

Altintas, K., Can, F. and Patton, J. M. (2007) 'Language Change Quantification Using Time-
separated Parallel Translations', Lit Linguist Computing, 22, (4), pp. 375-393.

Amati, G. and Rijsbergen, C. J. V. (2002) 'Probabilistic models of information retrieval based on


measuring the divergence from randomness', ACM Transactions on Information Systems,
20, (4), pp. 357-389.

Anderberg, M. R. (1973) Cluster analysis for applications. New York ; London: Academic Press.

Anderson, B. R. O. G. (1991) Imagined communities : reflections on the origin and spread of


nationalism. London: Verso.

Arabie, P., Hubert, L. J. and Soete, G. d. (1996) Clustering and classification. Singapore ; London:
World Scientific.

Argamon, S. and Olsen, M. (2006) 'Toward Meaningful Computing', Communications of ACM,


49, (4), pp. 33-35.

Attardi, G., Di Marco, S. and Salvi, D. (1998) 'Categorization by Context', Journal of Universal

24
Computer Science, 4 (9), pp. 719–736.

Attardi, G., Gulli, A. and Sebastiani, F. (1999) THAI-99.

Atwood, M. E. (1972) Survival : a thematic guide to Canadian literature. Toronto: Anansi.

Austen, J. (1818) Northanger Abbey: and Persuasion. London: John Murray.

Ball, W. W. R. (1935) A Short Account of the History of Mathematics. London: Macmillan.

Baugh, A. C. (1967) A literary history of England. Routledge & K.Paul.

Beer, G. (2004) Darwin's Plots: Evolutionary Narrative in Darwin, George Eliot and Nineteenth-
Century Fiction. Cambridge: Cambridge University Press.

Bellamy, L. (1998) 'Regionalism and Nationalism: Maria Edgeworth, Walter Scott and the
Definition of Britishness', in Snell, K. D. M.(ed), The Regional Novel in Britain and
Ireland 1800-1990.Cambridge University Press.

Benazon, M. (1978) 'Dark and Fair: Character Contrast in Hardy’s "The Fiddler of the Reels"’',
Ariel: A Review of International English Literature, 9, (2), pp. 75-82.

Berg, B. L. (1998) Qualitative research methods for the social sciences. Boston: Allyn and Bacon.

Berlin, C. (2006) 'Exploring the use of latent topical information for statistical Chinese spoken
document retrieval', Pattern Recogn. Lett., 27, (1), pp. 9-18.

Berry, M. W. (ed.) (2004) Survey of Text Mining: Clustering, Classification, and Retrieval. New
York: Springer.

Biber, D. (1986) 'Spoken and Written Textual Dimensions in English: Resolving the
Contradictory Findings', Language, 62, (2), pp. 384-413.

Biber, D. (1992) 'The Multidimensional Approach to Linguistic Analyses of Genre Variation: An


Overview of Methodology and Finding', Computers and the Humanities, 26, (5-6), pp.
331-347.

Binongo, J. N. G. and Smith, M. W. A. (1999) 'The application of principal component analysis


to stylometry', Lit Linguist Computing, 14, (4), pp. 445-466.

Bookstein, A. and Swanson, D. R. (1974) 'Probabilistic models for automatic indexing', Journal
of the American Society for Information Science, 25, pp. 312-318.

Boot, P. (2006) 'Decoding Emblem Semantics', Lit Linguist Computing, 21, (suppl_1), pp. 15-27.

Boumelha, P. (1982) Thomas Hardy and Women: Sexual Ideology and Narrative Form. Brighton:

25
Harvester Wheatsheaf.

Brady, K. (1982) The short stories of Thomas Hardy. New York: St. Martin's Press.

Brady, K. (ed.) (1999) The Withered Arm and other Stories. London: Penguin.

Breckenridge, J. N. (2000) 'Validating Cluster Analysis: Consistent Replication and Symmetry',


Multivariate Behavioral Research, 35, (2), pp. 261 - 285.

Breton, A. (1997) Anthology of Black Humor. San Francisco: Translated into English by Mark
Polizzotti. City Lights Books ; Subterranean Co.

Brooks, J. R. (1971) Thomas Hardy: The Poetic Structure. Ithaca, N.Y.: Cornell University Press.

Brown, D. (1961) Thomas Hardy. Longmans, Green.

Burrows, J. (2004) 'Textual Analysis', in Schreibman, S., Siemens, R. and Unsworth, J.(eds) A
Companion to Digital Humanities.Oxford: Blackwell, pp. 88-97.

Burrows, J. F. (1986) 'Modal Verbs and Moral Principles: An Aspect of Jane Austen's Style', Lit
Linguist Computing, 1, (1), pp. 9-23.

Burrows, J. F. (1987) Computation into criticism : a study of Jane Austen's novels and an
experiment in method. Oxford: Clarendon.

Burrows, J. F. (2003) 'Questions of Authorship: Attribution and Beyond A Lecture Delivered on


the Occasion of the Roberto Busa Award ACH-ALLC 2001, New York', Computers and
the Humanities, 37, (1), pp. 5-32.

Burrows, J. F. (2005) 'Who wrote Shamela? Verifying the Authorship of a Parodic Text', Lit
Linguist Computing, 20, (4), pp. 437-450.

Burrows, J. F. (2007) 'All the Way Through: Testing for Authorship in Different Frequency
Strata', Lit Linguist Computing, 22, (1), pp. 27-47.

Carmichael, J. W. and Sneath, P. H. A. (1969) 'Taxometric Maps', Syst Biol, 18, (4), pp. 402-415.

Cecil, D. (1943) Hardy, the novelist; an essay in criticism. London: Constable.

Chalmers, A. F. (1999) What is This Thing Called Science. Indianapolis: Hackett Pub.

Cheong, M.-Y. and Lee, H. (2008) 'Determining the number of clusters in cluster analysis',

26
Journal of the Korean Statistical Society, 37, (2), pp. 135-143.

Cohen, R. (1989) The Future of literary theory. New York: Routledge.

Coleman, T. (ed.) (1976) An Indiscretion in the Life of an Heiress. London: Hutchinson

Cooley, W. W. and Lohnes, P. R. (1985) Multivariate data analysis. Malabar, Fla.: R.E. Krieger
Pub. Co.

Corns, T. N. (1987) 'Computers in the Humanities: Methods and Applications in the Study of
English Literature', Lit Linguist Computing, 2, (2), pp. 127-130.

Cox, R. G. (1970) Thomas Hardy; the critical heritage. New York: Barnes & Noble.

Craig, H. (1999) 'Contrast and Change in the Idiolects of Ben Jonson Characters', Computers and
the Humanities, 33, (3), pp. 221-240.

Craig, H. (2004) 'Stylistic Analysis and Authorship Studies', in Schreibman, S., Siemens, R. and
Unsworth, J.(eds) A Companion to Digital Humanities .Oxford: Blackwell, pp. 273-288.

Dalziel, P. (1992a) 'Hapless Destiny: an uncollected story of marginalized lives', Thomas Hardy
Journal, (8), pp. 41-2.

Dalziel, P. (1992b) 'Hardy's Unforgotten 'Indiscretion': The Centrality of an Uncontrolled Work',


Review of English Studies, XLIII, (171), pp. 347-366.

Dalziel, P. (ed.) (1992c) Thomas Hardy: The Excluded and Collaborative Stories. Oxford:
Clarendon Press.

De Grazia, M. and Wells, S. W. (2001) The Cambridge Companion to Shakespeare. Cambridge,


U.K.: Cambridge University Press.

Debole, F. and Sebastiani, F. (2003) 'Supervised term weighting for automated text
categorization', Proceedings of the 2003 ACM symposium on Applied computing, pp. 784-
788.

Debole, F. and Sebastiani, F. (2004) 'Supervised Term Weighting for Automated Text
Categorization', ERCIM News 56, pp. 55-56.

Deerwester, S., Susan, T. D., George, W. F., Thomas, K. L. and Richard, H. (1990) 'Indexing by
latent semantic analysis', Journal of the American Society for Information Science, 41, (6),
pp. 391-407.

DeForest, M. and Johnson, E. (2001) 'The Density of Latinate Words in the Speeches of Jane

27
Austen's Characters', Lit Linguist Computing, 16, (4), pp. 389-401.

Delcourt, C. (1992) 'About the statistical analysis of co-occurrence', Computers and the
Humanities, 26, (1), pp. 21-29.

DeVine, C. (2005) Class in turn-of-the-century Novels of Gissing, James, Hardy, and Wells.
Aldershot, Hants, England: Ashgate.

Dey, I. (1993) Qualitative data analysis : a user-friendly guide for social scientists. London: New
York Routledge.

Dhillon, I., Kogan, J. and Nicholas, C. (2004) 'Feature Selection and Document Clustering', in
Berry, M. W.(ed), Survey of Text Mining: Clustering, Classification, and Retrieval.New
York: Springer.

Dickens, C. (1854) Hard Times. London: Bradury & Evans.

Dik, L. L., Huei, C. and Kent, S. (1997) 'Document Ranking and the Vector-Space Model', IEEE
Softw., 14, (2), pp. 67-75.

Dimitriadou, E., Dolničar, S. and Weingessel, A. (2002) 'An examination of Indexes for
Determining the Number of Clusters in Binary Data Sets', Psychometrika, 67, (1), pp. 137-
159.

Dingle, H. and Bush, D. (1952) 'Science and Literary Criticism', The British Journal for the
Philosophy of Science, 3, (10), pp. 194-196.

Draper, R. P. and Ray, M. (1989) An annotated critical bibliography of Thomas Hardy. Hemel
Hempstead: Harvester Wheatsheaf.

DuBien, J. L. and Warde, W. D. (1979) 'A Mathematical Comparison of the Members of an


Infinite Family of Agglomerative Clustering Algorithms', The Canadian Journal of
Statistics / La Revue Canadienne de Statistique, 7, (1), pp. 29-38.

Dumais, S. T., Furnas, G. W., Landauer, T. K., Deerwester, S. and Harshman, R. (1988) 'Using
latent semantic analysis to improve access to textual information', Proceedings of the
SIGCHI conference on Human factors in computing systems. Washington, D.C., United
States, ACM, pp.

Duncan, I. (2002) 'The Provincial or Regional Novel', in Brantlinger, P. and Thesing, W.(eds) A

28
Companion to the Victorian Novel.Oxford: Blackwell.

Dunteman, G. H. (1989) Principal Components Analysis Sage Publications.

Eastman, D. R. (1971) The Concept of Character in the Major Novels of D.H. Lawrence. thesis.
University of Florida.

Eaton, M. L. (2007) Multivariate Statistics: A Vector Space Approach. Beachwood, Ohio:


Institute of Mathematical Statistics.

Ebbatson, R. (1993) ''The Withered Arm’ and History', Critical Survey, 5, (2), pp. 131-35.

Ebbatson, R. (2009) '“A Thickness of Wall”: Hardy and Class ', in Wilson, K.(ed), A Companion
to Thomas Hardy.Malden, MA: Wiley-Blackwell Pub., pp. xiii, 488 p.

El-Hamdouchi, A. and Willett, P. (1986) 'Hierarchic document classification using Ward's


clustering method', Proceedings of the 9th annual international ACM SIGIR conference
on Research and development in information retrieval. Palazzo dei Congressi, Pisa, Italy,
ACM, pp.

Euclid. (1826) Euclid's Elements of Geometry, With notes, critical and exploratory by Phillips,
George. London: Printed for Baldwin, Cradock, and Joy.

Everitt, B. (1993) Cluster analysis. London: E. Arnold.

Everitt, B., Landau, S. and Leese, M. (2001) Cluster analysis. London: Arnold ; New York :
Oxford University Press.

Fang, L., Mehlitz, M., Li, F. and Sheng, H. (2006) 'Web Pages Clustering and Concepts Mining:
An approach towards Intelligent Information Retrieval', Cybernetics and Intelligent
Systems, 2006 IEEE Conference, pp. 1-6.

Fielding, A. (2007) Cluster and Classification Techniques for the Biosciences. Cambridge, UK ;
New York: Cambridge University Press.

Florek, K., Lukaszewicz, P. J., Steinhaus, H. and Zubrzycki, S. (1951) 'Sur la liaison et la division
des points d’un ensemble fini', Colloq. Math, 2, pp. 282-285.

Foltz, P., Kintsch, W. and Landauer, T. K. (1998) 'Measurement of Text Coherence with Latent
Semantic Analysis', Discourse Processes, 25, pp. 285-307.

Fong, J. H. (2008) 'Determining The Genre of Antony and Cleopatra: Categorising Shakespeare's
Roman Play as Tragedy or History', Shakespearean Theatre, [Online]. Available at:
http://shakespeare-

29
tragedies.suite101.com/article.cfm/determining_genre_for_antony_and_cleopatra
(Accessed: 16 March 2010).

Foucault, M. (1985) The history of sexuality. [Harmondsworth]: Viking, 1986.

Fraleigh, J. B., Beauregard, R. A. and Katz, V. J. (1995) Linear Algebra. Reading, Mass. ;
Wokingham: Addison-Wesley.

Frigui, H. and Nasraoui, O. (2004) 'Simultaneous Clustering and Dynamic Keyword Weighting
for Text Documents', in Berry, M. W.(ed), Survey of Text Mining: Clustering,
Classification, and Retrieval.New York: Springer.

Fromkin, V., Rodman, R. and Hyams, N. M. (2007) An Introduction to Language. Boston, Mass.:
Thomson Wadsworth ; [London : Thomson Learning, distributor].

Gabrilovich, E. and Markovitch, S. (2006) 'Overcoming the Brittleness Bottleneck using


Wikipedia: Enhancing Text Categorization with Encyclopedic Knowledge', Proceedings
of the Twenty-First National Conference on Artificial Intelligence, pp. 1301--1306.

Gabrilovich, E. and Markovitch, S. (2007) 'Computing Semantic Relatedness using Wikipedia-


based Explicit Semantic Analysis', Proceedings of the 20th International Joint Conference
on Artificial Intelligence, pp. 1606--1611.

Gan, G., Ma, C. and Wu, J. (2007) Data Clustering: Theory, Algorithms, and Applications.
Philadelphia, Pa.: SIAM, Society for Industrial and Applied Mathematics ; American
Statistical Association.

Garcia, A. M. and Martin, J. C. (2007) 'Function Words in Authorship Attribution Studies', Lit
Linguist Computing, 22, (1), pp. 49-66.

Gatrell, S. (2003) Thomas Hardy's vision of Wessex. Houndmills, Basingstoke, Hampshire:


Palgrave Macmillan.

Gatrell, S. (2006) 'The Erotics of Dress in A Pair of Blue Eyes', in Thomas Hardy Reappraised:
Essays in Honour of Michael Millgate.Toronto: University of Toronto Press, pp. 118-135.

Gauch, H. G. (1982) Multivariate analysis in community ecology. Cambridge Cambridgeshire ;


New York: Cambridge University Press.

Geer, J. P. v. d. (1971) Introduction to multivariate analysis for the social sciences. San Francisco,:

30
W. H. Freeman.

Gibson, J. (1996) Thomas Hardy. Basingstoke: Macmillan.

Gilmartin, S. and Mengham, R. (2007) Thomas Hardy's shorter fiction : a critical study.
Edinburgh: Edinburgh University Press.

Gittings, R. (ed.) (1978) An Introduction to The Hand of Ethelberta. New York: St. Martin's Press.

Golub, K. (2005) Automated Subject Classification of Textual Web Pages, for Browsing. thesis.
Lund University.

Golub, K. (2006) 'Automated subject classification of textual Web documents', Journal of


Documentation, 62, (3), pp. 350-371.

Gomm, R. (2009) Key concepts in social research methods. Basingstoke: Palgrave Macmillan.

Goode, J. (1988) Thomas Hardy: The Offensive Truth. Oxford [Oxfordshire]: B. Blackwell.

Goodheart, E. (1957) 'Thomas Hardy and The Lyrical Novel', Nineteenth-Century Fiction, 12, (3),
pp. 215-225.

Gordon, A. D. (1996) 'Hierarchical Classification', in Arabie, P., Hubert, L. J. and Soete, G.


d.(eds) Clustering and classification.Singapore ; London: World Scientific, pp. ix,490p.

Gosse, E. (1928) 'Thomas Hardy's Lost Novel', London Times, January 22,

Gosse, E. (1970) 'Thomas Hardy', in Cox, R. G.(ed), Thomas Hardy; The Critical Heritage.New
York: Barnes & Noble. First published in The Speaker (13 September 1890). , pp. 167-
172.

Gottschall, J. (2008) 'Measure for Measure', The Boston Globe, 11/05/2008.

Grabmeier, J. and Rudolph, A. (2002) 'Techniques of Cluster Algorithms in Data Mining', Data
Min. Knowl. Discov., 6, (4), pp. 303-360.

Gray, A. (1997) Modern Differential Geometry of Curves and Surfaces with Mathematica. Boca
Raton, FL: CRC Press.

Hair, J. F. (2006) Multivariate data analysis. Upper Saddle River, N.J. ; London: Prentice Hall
PTR.

Halkidi, M., Batistakis, Y. and Vazirgiannis, M. (2001) 'On Clustering Validation Techniques',

31
Journal of Intelligent Information Systems, 17, (2-3), pp. 107–145.

Hammill, F. (2007) Canadian Literature. Edinburgh: Edinburgh University Press.

Han, J. and Kamber, M. (2001) Data mining : concepts and techniques. San Francisco, Calif. ;
London: Morgan Kaufmann.

Handl, J., Knowles, J. and Kell, D. B. (2005) 'Computational cluster validation in post-genomic
data analysis', Bioinformatics, 21, (15), pp. 3201-3212.

Härdle, W. and Simar, L. (2003) Applied multivariate statistical analysis. Berlin ; New York:
Springer.

Hardy, F. E. (1933) The Life of Thomas Hardy. London: Macmillan.

Hardy, T. (1879) 'The Distracted Young Preacher', New Quarterly Magazine, i, pp. 324-376.

Hardy, T. (1891) A group of Noble Dames. New York: Harper and brothers.

Hardy, T. (1894) Life's Little Ironies. Leipzig: B. Tauchnitz.

Hardy, T. (1896) Wessex Tales. New York: Harper & brothers.

Hardy, T. (1897) The Well-Beloved : A Sketch of a Temperament. London: J. R. Osgood McIlvaine


& co.

Hardy, T. (1912a) A Group of Noble Dames. London: Macmillan and Co.

Hardy, T. (1912b) Jude the Obscure. London: Macmillan and Co.

Hardy, T. (1912c) Life's Little Ironies. London: Macmillan and Co.

Hardy, T. (1912d) Wessex Tales. London: Macmillan and Co.

Hardy, T. (1912e) The Woodlanders. London: Macmillan and Co.

Hardy, T. (1912f) The works of Thomas Hardy in prose and verse. With prefaces and notes.
(Wessex edition.). London: Macmillan & Co.

Hardy, T. (1913) A Changed Man, The Waiting Supper, and Other tales, Concluding with The
Romantic Adventures of a Milkmaid. London: Macmillan & co.

Hardy, T. and Dugdale-Hardy, F. (1992) 'The Unconquerable', in Dalziel, P.(ed), Thomas Hardy:
The Excluded and Collaborative Stories.Oxford: Clarendon Press.

Harman, D. (1991) 'How effective is suffixing?', Journal of the American Society for Information

32
Science, 42, (1), pp. 7-15.

Harrington, A. (2004) Art and social theory: sociological arguments in aesthetics. Cambridge,
UK: Polity Press.

Harter, S. P. (1975) 'A probabilistic approach to automatic keyword indexing, Part 1: On the
distribution of speciality words in a technical literature', Journal of the American Society
for Information Science, 26, pp. 280-289

Harvey, G. (2003) The complete critical guide to Thomas Hardy. London: Routledge.

Hernadi, P. (1972) Beyond Genre: New Directions in Literary Classification. Ithaca New York:
Cornell University Press.

Hersh, W. R. (1996) Information Retrieval: A Health Care Perspective. New York: Springer.

Higonnet, M. R. (1993) The Sense of sex: feminist perspectives on Hardy. Urbana: University of
Illinois Press.

Hockey, S. M. (2000) Electronic Texts in the Humanities: Principles and Practice. Oxford: Oxford
University Press.

Holloway, I. (1997) Basic concepts for qualitative research. Oxford: Blackwell Science.

Holmes, D. I. (1994) 'Authorship Attribution', Computers and the Humanities, 28, pp. 87-106.

Holmes, D. I. (1998) 'The Evolution of Stylometry in Humanities Scholarship', Lit Linguist


Computing, 13, (3), pp. 111-117.

Holmes, D. I. and Forsyth, R. S. (1995) 'The Federalist Revisited: New Directions in Authorship
Attribution', Lit Linguist Computing, 10, (2), pp. 111-127.

Hoover, D. L. (2001) 'Statistical Stylistics and Authorship Attribution: an Empirical


Investigation', Lit Linguist Computing, 16, (4), pp. 421-444.

Hoover, D. L. (2002) 'Frequent Word Sequences and Statistical Stylistics', Lit Linguist
Computing, 17 (2), (2), pp. 157-180.

Hope, K. (1968) Methods of multivariate analysis, with handbook of multivariate methods


programmed in Atlas Autocode. London,: University of London P.

Horton, R., Olsen, M., Roe, G. and Voyer, R. (2007) 'Mining Eighteenth Century Ontologies:
Machine Learning and Knowledge Classification in the Encyclope´die', Digital

33
Humanities,. Urbana-Champaign, Illinois, 2-8 June 2007. pp.

Horton, T., Taylor, C., Yu, B. and Xiang, X. (2006) '‘Quite Right, Dear and Interesting’: Seeking
the Sentimental in Nineteenth Century American Fiction', Digital Humanities. Paris-
Sorbonne, France, 5-9 July 2006. pp. 81-82.

Houen, A. (2000) 'Hardy, Thomas, 1840-1928', Literature Online biography, [Online]. Available
at: http://gateway.proquest.com/openurl?ctx_ver=Z39.88-
2003&xri:pqil:res_ver=0.2&res_id=xri:lion&rft_id=xri:lion:ft:ref:BIO002715:0
(Accessed: 25/08/2008).

Howlett, R. (2010) Vector Space Theory. Sidney: The University of Sydney.

Hull, D. A. (1996) 'Stemming algorithms: A case study for detailed evaluation', Journal of the
American Society for Information Science, 47, (1), pp. 70-84.

Hunter, A. (2007) The Cambridge Introduction to the Short Story in English. Leiden: Cambridge
University Press.

Ide, N. (1989) 'A statistical measure of theme and structure', Computers and the Humanities, 23,
(4), pp. 277-283.

Ingham, P. (1989) Thomas Hardy: A Feminist Reading Hemel Hempstead: Harvester Wheatsheaf.

Irwin, M. (1980) 'Readings of Melodrama', in Gragor, I.(ed), Reading the Victorian Novel: Detail
into Form.New York: Harper & Row, Publishers, Inc.

Jackson, J. E. (1991) A user's guide to principal components. New York: Wiley.

Jacobs, H. A. (1861) Incidents in the Life of a Slave Girl. Boston: Jacobs.

Jacobus, M. (1976) 'Tess’s Purity', Essays in Criticism, 26, pp. 318-38.

Jacobus, M. (ed.) (1979) Women Writing and Writing about Women. London: Croom Helm.

Jain, A. K., Murty, M. N. and Flynn, P. J. (1999) 'Data Clustering: A Review', ACM Comput.
Surv., 31, (3), pp. 264-323.

Jain, R. and Koronios, A. (2008) 'Innovation in the cluster validating techniques', Fuzzy
Optimization and Decision Making, 7, (3), pp. 233-241.

James, L. (2006) The Victorian novel. Malden, MA: Blackwell Pub.

Janos, A. and Balazs, F. (2007) Cluster Analysis for Data Mining and System Identification.

34
Basel: Birkhauser

Jekel, P. L. (1986) Thomas Hardy's Heroines: A Chorus of Priorities. New York: Whitston.

Jenkins, M.-C. and Smith, D. (2005) 'Conservative stemming for search and indexing', SIGIR’05.
August 15–19, 2005. ACM, pp.

Jinxi, X. and Croft, W. B. (1998) 'Corpus-based stemming using cooccurrence of word variants',
ACM Trans. Inf. Syst., 16, (1), pp. 61-81.

Joachims, T. (2002) Learning to Classify Text Using Support Vector Machines: Methods, Theory
and Algorithms. Kluwer Academic Publishers.

Joachims, T. and Sebastiani, F. (2002) 'Guest Editors' Introduction to the Special Issue on
Automated Text Categorization', Journal of Intelligent Information Systems, 18, (2), pp.
103-105.

Jockers, M. L. (2009) Machine-Classifying Novels and Plays by Genre. Available at:


https://www.stanford.edu/~mjockers/cgi-bin/drupal/node/27 (Accessed: 16 March 2010).

Jockers, M. L., Witten, D. M. and Criddle, C. S. (2008) 'Reassessing authorship of the Book of
Mormon using delta and nearest shrunken centroid classification', Lit Linguist Computing,
23, (4), pp. 465-491.

Johnson, L. P. (1923) The Art of Thomas Hardy. London: J. Lane.

Johnson, S. L. (2005) Historical Fiction: A Guide to the Genre. Westport, Conn.: Libraries
Unlimited.

Jolliffe, I. T. (2002) Principal component analysis. Berlin ; London: Springer.

Juola, P. (2008) 'Authorship Attribution', Foundations and Trends R in Information Retrieval, 1,


(3), pp. 233-334.

Juola, P., Sofko, J. and Brennan, P. (2006) 'A Prototype for Authorship Attribution Studies', Lit
Linguist Computing, 21, (2), pp. 169-178.

Justo, R. and Torres, I. (2005) Progress in Pattern Recognition, Image Analysis and Applications.
10th Iberoamerican Congress on Pattern Recognition, CIARP 2005,Havana, Cuba,
November 15-18, 2005. Proceedings

Kaplan, R. M. (2005) 'A Method for Tokenizing Text', in Arppe, A., Carlson, L., Lindén, K.,
Piitulainen, J., Suominen, M., Vainio, M., Westerlund, H. and Yli-Jyrä, A.(eds) Inquiries
into Words, Constraints and Contexts.CSLI Studies in Computational Linguistics

35
Publications, pp. 55-64.

Kaufman, L. and Rousseeuw, P. J. (1990) Finding Groups in Data: An Introduction to Cluster


Analysis. Hoboken, New Jersey: John Wiley& Sons, INC.

Kaur, M. ( 2005) The Feminist Sensibility in the Novels of Thomas Hardy. New Delhi: Sarup &
Sons.

Kearney, P. J. (1982) A History of Erotic Literature. London: Macmillan.

Keith, W. J. (1969) 'Thomas Hardy and the Literary Pilgrims', Nineteenth-Century Fiction, 24,
(1), pp. 80-92.

Keith, W. J. (1979) 'A Regional Approach to Hardy's Fiction', in Kramer, D.(ed), Critical
Approaches to the Fiction of Thomas Hardy.London: Macmillan.

Kessler, B., Numberg, G. and Schtze, H. (1997) 'Automatic Detection of Text Genre', Proceedings
of the eighth conference on European chapter of the Association for Computational
Linguistics. Madrid, Spain, Association for Computational Linguistics, pp. 32-38.

Kettle, A. (1953) 'Tess of the d’Urbervilles', in An Introduction to the English Novel. Vol. 2
London: Hutchinson University Library, pp. 45-56.

Kettle, A. (1967) An introduction to the English novel. London,: Hutchinson.

Kettle, A. (1973) The Novel in the Mid-nineteenth Century. Milton Keynes: Open University
Press.

King, J. (1978) 'Thomas Hardy: Tragedy Ancient and Modern', in Tragedy in the Victorian
Novel.Cambridge: Cambridge University Press, pp. 97–126.

Kobayashi, M. and Aono, M. (2004) 'Vector Space Models for Search and Cluster Mining', in
Berry, M. W.(ed), Survey of Text Mining: Clustering, Classification, and Retrieval.New
York: Springer-Verlag

Koppel, M., Argamon, S. and Shimoni, A. R. (2002) 'Automatically Categorizing Written Texts
by Author Gender', Lit Linguist Computing, 17, (4), pp. 401-412.

Kramer, D. (1975) Thomas Hardy: the Forms of Tragedy. Detroit: Wayne State University Press.

Kramer, D. (1999) The Cambridge companion to Thomas Hardy. Cambridge, UK: Cambridge
University Press.

Kramer, D. and Dalziel, P. (eds.) (2004) The Mayor of Casterbridge. New York: Oxford

36
University Press.

Krishnaiah, P. R. and Kanal, L. N. (1982) Classification, pattern recognition and reduction of


dimensionality. Amsterdam ; Oxford: North-Holland.

Krovetz, R. (1993) 'Viewing morphology as an inference process', Proceedings of the 16th annual
international ACM SIGIR conference on Research and development in information
retrieval, pp. 191-202.

Kuhn, A. (2006) Semantic Clustering: Making Use of Linguistic Information to Reveal Concepts
in Source Code. thesis. University of Bern.

Kuhn, A., Ducasse, S. and Gîrba, T. (2007) 'Semantic clustering: Identifying topics in source
code', Information and Software Technology, 49, (3), pp. 230-243.

Laan, N. M. (1995) 'Stylometry and Method. The Case of Euripides', Lit Linguist Computing, 10,
(4), pp. 271-278.

Labbe, C. and Labbe, D. (2006) 'A Tool for Literary Studies: Intertextual Distance and Tree
Classification', Lit Linguist Computing, 21, (3), pp. 311-326.

Laffal, J. (1995) 'A concept analysis of Jonathan Swift's A tale of a Tub and Gulliver's Travels',
Computers and the Humanities, 29, (5), pp. 339-361.

Landauer, T. K., Foltz, P. and Laham, D. (1998) 'An Introduction to Latent Semantic Analysis',
Discourse Processes, 25, (2-3), pp. 259-84.

Lang, S. (1958) Introduction to Algebraic Geometry. New York: Interscience Publishers.

Lang, S. and Murrow, G. (1988) Geometry. New York: Springer.

Lawrence, D. H. (1936) 'Study of Thomas Hardy', Phoenix: The Posthumous Papers of D.H.
Lawrence, pp. 398-516.

Lea, H. (1969) Thomas Hardy's Wessex. St. Peter Port: Toucan P.

Leah, S. L., Lisa, B. and Margaret, E. C. (2002) 'Improving stemming for Arabic information
retrieval: light stemming and co-occurrence analysis', Proceedings of the 25th annual
international ACM SIGIR conference on Research and development in information
retrieval, pp. 275-282.

Liebler, N. C. (1995) Shakespeare's Festive Tragedy: The Ritual Foundations of Genre. London:

37
Routledge.

Lodge, D. (1974) 'Thomas Hardy and Cinematographic Form', NOVEL: A Forum on Fiction, 7,
(3), pp. 246-254.

Looks, M., Levine, A., Covington, G. A., Loui, R. P. A. L. R. P., Lockwood, J. W. A. L. J. W.


and Cho, Y. H. A. (2007) Aerospace Conference, 2007 IEEE.

Losada, D. and Azzopardi, L. (2008) 'An analysis on document length retrieval trends in language
modeling smoothing', Information Retrieval, 11, (2), pp. 109-138.

Love, H. (2002) Attributing authorship : an introduction. Cambridge: Cambridge University


Press.

Lovins, J. B. (1968) 'Development of a Stemming Algorithm', Mechanical Translation and


Computational Linguistics, 11, (1), pp. 23-31.

Luhn, H. (1957) ' A statistical approach to mechanised encoding and searching of library
information', IBM Journal of Research and Development, 1, pp. 309-317.

Macdonell, A. (1894) Thomas Hardy. London: Hodder & Stoughton.

Machan, T. W. (1991) 'Late Middle English Texts and the Higher and Lower Criticisms', in
Machan, T. W.(ed), Medieval Literature: Texts and Interpretation. Medieval and
Renaissance Texts and Studies.New York: Binghamton, pp. 3–16.

Maguitman, A. G., Menczer, F., Roinestad, H. and Vespignani, A. (2005) International World
Wide Web Conference Committee (IW3C2). Chiba, Japan., May 10-14, 2005.ACM.

Mallett, P. (2003) Thomas Hardy Texts and Contexts. New York, N.Y.: Palgrave Macmillan.

Manning, C. D., Raghavan, P. and Schütze, H. (2008) An Introduction to Information Retrieval.


Cambridge: Cambridge University Press.

Maranis, H. and Babenko, D. (2009) Algorithms of the Intelligent Web. Greenwich: Manning
Publications Co.

Markov, A., Last, M. and Kandel, A. (2008) 'The hybrid representation model for web document
classification', International Journal of Intelligent Systems, 23, (6), pp. 654-679.

Martin, H., st, Claes, W. and Thomas, T. (2005) 'Experimental context classification: incentives
and experience of subjects', Proceedings of the 27th international conference on Software

38
engineering. St. Louis, MO, USA, ACM, pp.

Mather, P. M. (1976) Computational methods of multivariate analysis in physical geography.


London ; New York: Wiley.

Matthews, R. A. J. and Merriam, T. V. N. (1993) 'Neural Computation in Stylometry I: An


Application to the Works of Shakespeare and Fletcher', Lit Linguist Computing, 8, (4), pp.
203-209.

Meisel, P. (1972) Thomas Hardy: The Return of the Repressed; A Study of the Major Fiction.
New Haven: Yale University Press.

Miles, M. and Hurberman, M. (1994) Qualitative Data Analysis: an expanded sourcebook.


London: Beverley Hills.

Miller, J. H. (1970) Thomas Hardy, distance and desire. Cambridge, Mass.,: Belknap Press of
Harvard University Press.

Millgate, M. (1985) The Life and Work of Thomas Hardy. Athens: University of Georgia Press.

Millgate, M. (1994) Thomas Hardy : His Career As A Novelist. London: Macmillan.

Milligan, G. and Cooper, M. (1985) 'An examination of procedures for determining the number
of clusters in a data set', Psychometrika, 50, (2), pp. 159-179.

Milligan, G. W. (1996) 'Clustering Validation: Results and Implications for Applied Analyses', in
Arabie, P., Hubert, L.J. and De Soete, G(ed), Classification and Clustering.River Edge,
NJ: World Scientific Publishing Co Pte Ltd.

Milton, J. S. and Arnold, J. C. (2002) Introduction to probability and statistics : principles and
applications for engineering and the computing sciences. New York ; London: McGraw-
Hill.

Mingjin Yan, K. Y. (2007) 'Determining the Number of Clusters Using the Weighted Gap
Statistic', in:

Mirkin, B. (2005) Clustering for Data Mining: A Data Recovery Approach. Taylor & Francis
Group, LLC.

Moisl, H. (2008 a) 'Exploratory Multivariate Analysis', in Lüdeling, A. and Kytö, M.(eds) Corpus
Linguistics. An International Handbook. Vol. II Berlin: Mouton de Gruyter, pp. 874-899.

Moisl, H. (2008 b) 'Using electronic corpora to study language variation: the problem of data

39
sparsity', Studies in Language Variation.

Moisl, H. (2009) 'Using Electronic Corpora in Historical Dialectology Research: The Problem of
Document Length Variation', in Dossena, M. and Lass, R.(eds) Studies in English and
European Historical Dialectology. Vol. 98, pp. 67-90.

Moisl, H. and Jones, V. (2005) 'Cluster Analysis of the Newcastle Electronic Corpus of Tyneside
English: A Comparison of Methods', Lit Linguist Computing, 20, pp. 125-146.

Moisl, H. and Maguire, W. (2008) 'Identifying the Main Determinants of Phonetic Variation in
the Newcastle Electronic Corpus of Tyneside English', Journal of Quantitative Linguistics
15, pp. 46-69.

Morgan, R. (1988) Women and Sexuality in the Novels of Thomas Hardy. Routledge.

Morgan, R. (2002) Cancelled Words: Rediscovering Thomas Hardy. London: Routledge.

Morgan, R. (2006) Student Companion to Thomas Hardy. Greenwood Press.

Mort, J. (2002) Christian Fiction: A Guide to the Genre. Greenwood Village, Colo.: Libraries
Unlimited.

Morton, A. Q. (1965) 'The Authorship of Greek Prose', Journal of the Royal Statistical Society,
Series A (128), pp. 169-233.

Morton, A. Q. (1986) 'Once. A Test of Authorship Based on Words which are not Repeated in the
Sample', Lit Linguist Computing, 1, (1), pp. 1-8.

Mosteller, F. and Wallace, D. L. (1964) Inference and Disputed Authorship: The Federalist.
Reading, Mass.: Addison-Wesley Pub. Co.

Mosteller, F. and Wallace, D. L. (1984) Applied Bayesian and classical inference : the case of the
Federalist papers. New York: Springer.

Nakamura, J. and Sinclair, J. (1995) 'The World of Woman in the Bank of English: Internal
Criteria for the Classification of Corpora', Lit Linguist Computing, 10, (2), pp. 99-110.

Nemesvari, R. (2009) 'Genres are not to be mixed. . . . I will not mix them”: Discourse, Ideology,
and Generic Hybridity in Hardy’s Fiction ', in Wilson, K.(ed), A Companion to Thomas
Hardy.Malden, MA: Wiley-Blackwell Pub., pp. xiii, 488 p.

Nixon, J. V. (2004) Victorian Religious Discourse: New Directions in Criticism. New York:

40
Palgrave Macmillan.

Novovičová, J., Malík, A. and Pudil, P. (2004) 'Feature Selection Using Improved Mutual
Information for Text Classification', in Structural, Syntactic, and Statistical Pattern
Recognition. pp. 1010-1017.

Omar, A. A. (2010) 'Addressing Subjectivity in Thematic Classification of Literary Texts: Using


Cluster Analysis to Derive Taxonomies of Thematic Concepts in the Thomas Hardy’s
Prose Fiction', Proceedings of the Chicago Colloquium on Digital Humanities and
Computer Science, 1, (2).

Oulton, C. (2002) Literature and religion in mid-Victorian England : from Dickens to Eliot.
Houndmills, Hampshire: Palgrave Macmillan.

Ozgur, Y. (2006) Empirical selection of nlp-driven document representations for text


categorization. thesis. Syracuse University.

Page, N. (2000) Oxford Reader's Companion to Hardy. Oxford: Oxford University Press.

Paice, C. D. (1990) 'Another stemmer', SIGIR Forum, 24 (3), (3), pp. 56-61.

Paice, C. D. (1996) 'Method for evaluation of stemming algorithms based on error counting', J.
Am. Soc. Inf. Sci., 47, (8), pp. 632-649.

Paterson, J. (1991) 'An Attempt at Grand Tragedy', in Draper, R. P.(ed), Hardy: The Tragic
Novels.London: Macmillan Press LTD.

Paton, J. M. and Can, F. (2004) 'A Stylometric Analysis of Yas¸ ar Kemal’s I_nce Memed
Tetralogy', Computers and the Humanities 38, pp. 457–467.

Patton, M. Q. (2002) Qualitative Research & Evaluation Methods. London: Sage.

Payne, G. and Payne, J. (2004) Key concepts in social research. London: SAGE.

Pedersen, T. (2008) 'Computational Approaches to Measuring the Similarity of Short Contexts :


A Review of Applications and Methods', in.

Plain, G., Sellers, S. and Ebooks Corporation. (eds.) (2007) A History of Feminist Literary
Criticism. Leiden: Cambridge University Press.

Plaisant, C., Rose, J. and Yu, B. (2006) 'Exploring Erotics in Emily Dickinson’s Correspondence
with Text Mining and Visual Interfaces', Proceedings of the 6th ACM/IEEE-CS Joint
Conference on Digital Libraries (JCDL ’06), Chapel Hill, North Carolina. 11-15 June

41
2006. pp. 141-50.

Plietzsch, B. (2003) Hardy's Classification of his Works Available at: http://www.st-


and.ac.uk/~bp10/wessex/fictional_concept/classification.shtml (Accessed: 20/10/2008).

Plietzsch, B. (2004) The Novels of Thomas Hardy as a Product of Nineteenth-century Social,


Economic, and Cultural Change. Thesis (doctoral) thesis. Tenea

Martin-Luther-Universität, Halle-Wittenberg.

Popper, K. R. (1959) The Logic of Scientific Discovery. New York: Basic Books.

Popper, K. R. (1963) Conjectures and Refutations; The Growth of Scientific Knowledge. London:
Routledge and K. Paul.

Porter, M. F. (1980) 'An Algorithm for Suffix Stripping', Program, 14, (130-137).

Porter, M. F. (1997) 'An algorithm for suffix stripping', in Readings in information


retrieval.Morgan Kaufmann Publishers Inc., pp. 313-316.

Potter, R. (1988) 'Literary criticism and literary computing: The difficulties of a synthesis',
Computers and the Humanities, 22, (2), pp. 91-97.

Praz, M. (1933) The Romantic Agony. London: Oxford University Press.

Prenowitz, W. and Jordan, M. (1965) Basic Concepts of Geometry. New York: Blaisdell Pub. Co.

Punj, G. and Stewart, D. W. (1983) 'Cluster Analysis in Marketing Research: Review and
Suggestions for Application', Journal of Marketing Research, 20, (2), pp. 134-148.

Purandare, A. and Pedersen, T. (2004) In Proceedings of the Nineteenth National Conference on


Artificial Intelligence (AAAI-04) (San Jose, USA, July 25-29, 2004). . July 25-29, 2004.

Purdy, R. L. (1979) Thomas Hardy : A Bibliographical Study. [S.l.]: Oxford University Press.

Purdy, R. L. and Millgate, M. (eds.) (1978) The Collected Letters of Thomas Hardy (hereafter
Collected Letters). Oxford: Clarendon Press.

Pyle, D. (1999) Data Preparation for Data Mining. San Francisco, California.: Morgan Kaufmann
; London : Taylor & Francis

Quiller-Couch, A. T. (1923) Studies in Literature. Cambridge Eng.: University Press.

Radford, A. D. (2009) Victorian Sensation Fiction. Basingstoke [England]: Palgrave Macmillan.

Ramos, J. (2003) Using TF-IDF to Determine Word Relevance in Document Queries. Available

42
at: http://www.cs.rutgers.edu/~mlittman/courses/ml03/iCML03/papers/ramos.pdf
(Accessed:

Ramsay, S. (2003) 'Special Section: Reconceiving Text Analysis: Toward an Algorithmic


Criticism', Lit Linguist Computing, 18, (2), pp. 167-174.

Ramsay, S. (2005) 'In Praise of Pattern', TEXT Technology: the Journal of Computer Text
Processing, 14, (2), pp. 177-190.

Ramsay, S. (2007) 'Algorithmic Criticism', in Siemens, R. G. and Schreibman, S.(eds) A


companion to digital literary studies. Vol. A companion to digital literary studies Malden,
MA: Blackwell Publishers, pp. xx, 620 p.

Ramsay, S. and Steger, S. (2006) 'Distinguished Speakers: Keyword Extraction and Critical
Analysis with Virginia Woolf’s The Waves', Digital Humanities. Sorbonne, Paris, 5-9 July
2006. pp. 255-257.

Ramsdell, K. (1999) Romance Fiction: A Guide to the Genre. Englewood, Colo.: Libraries
Unlimited.

Ray, M. (1996) 'THOMAS HARDY'S 'THE SON'S VETO' : A TEXTUAL HISTORY', Review
of English Studies, XLVII, (188), pp. 542-547.

Ray, M. (1997) Thomas Hardy : a textual study of the short stories. Aldershot ; Brookfield, Vt.,
USA: Ashgate.

Reger, R. K. and Huff, A. S. (1993) 'Strategic groups: a cognitive perspective', Strategic


Management Journal 14, (2), pp. 103-124.

Rencher, A. C. (2002) Methods of Multivariate Analysis. John Wiley & Sons, INC.

Rijsbergen, C. J. V. (1979) Information Retrieval. London: Butterworth.

Rijsbergen, C. J. v. (2004) The Geometry of Information Retrieval. Cambridge University Press.

Roberts, J. L. (1962) 'Legend and Symbol in Hardy's "The Three Strangers"', Nineteenth-Century
Fiction, 17, (2), pp. 191-194.

Robertson, S. (2004) 'Understanding inverse document frequency: on theoretical arguments for


IDF', Journal of Documentation, 60, (5), pp. 503-520.

Robertson, S. E. and Walker, S. (1994) 'Some simple effective approximations to the 2-Poisson
model for probabilistic weighted retrieval', Proceedings of the 17th annual international
ACM SIGIR conference on Research and development in information retrieval. Dublin,

43
Ireland, Springer-Verlag New York, Inc., pp. 232-241.

Rockwell, G. (2003) 'What is Text Analysis, Really?', Lit Linguist Computing, 18, (2), pp. 209-
219.

Rogers, M. F. (1991) Novels, novelists, and readers: toward a phenomenological sociology of


literature. Albany: State University of New York Press.

Rommel, T. (2004) 'Literary Studies', in Schreibman, S., Siemens, R. and Unsworth, J.(eds)
ACompanion to Digital Humanities .Oxford: Blackwell, pp. 88-97.

Rousseeuw, P. (1987) 'Silhouettes: a graphical aid to the interpretation and validation of cluster
analysis', J. Comput. Appl. Math., 20, (1), pp. 53-65.

Rowson, S. (1794) Charlotte. A Tale of Truth. Philadelphia: Printed by D. Humphreys, for M.


Carey.

Rowson, S. (1828) Charlotte's Daughter; or, The Three Orphans. A Sequel to Charlotte Temple.
Boston: Richardson & Lord.

Rudman, J. (1997) 'The State of Authorship Attribution Studies: Some Problems and Solutions',
Computers and the Humanities, 31, (4), pp. 351-365.

Salton, G. (1971) The Smart retrieval system : experiments in Automatic document processing.
Englewood Cliffs: Prentice Hall Inc.

Salton, G. (1982) Introduction to Modern Information Retrieval. New York: McGraw-Hill.

Salton, G. and Buckley, C. (1987) Term Weighting Approaches in Automatic Text Retrieval.
Cornell University

Salton, G. and Buckley, C. (1988) 'Term-weighing approaches in automatic text retrieval',


Information Processing & Management, 24, (5), pp. 513-523.

Salton, G., Wong, A. and Yang, C. S. (1975) 'A Vector Space Model for Automatic Indexing',
Communications of the ACM, 18, (11), pp. 613–620.

Salvador, S. and Chan, P. (2004) Tools with Artificial Intelligence, 2004. ICTAI 2004. 16th IEEE
International Conference on.

Saricks, J. G. (2009) The Readers' Advisory Guide to Genre Fiction. Chicago: American Library
Association.

Savoy, J. (1999) 'A stemming procedure and stopword list for general French corpora', J. Am. Soc.

44
Inf. Sci., 50 (10), pp. 944-952.

Saxelby, F. O. (1911) A Thomas Hardy Dictionary. The characters and scenes of the novels and
poems alphabetically arranged and described. George Routledge & Sons: London.

Schreibman, S., Siemens, R. G. and Unsworth, J. (eds.) (2004) A Companion to Digital


Humanities. Oxford: Blackwell.

Scoltock, J. (1984) The Application of Cluster Analysis to Sales, Production, Costing and Design.
thesis. University of Newcastle upon Tyne.

Sebastiani, F. (2005a) 'Text Categorization', in Zanasi, I. A.(ed), Text Mining and its
Applications.Southampton, UK: WIT Press, pp. 109-129.

Sebastiani, F. (2005b) 'Text categorization', in Rivero, L. C., Doorn, J. H. and Ferraggine, V.


E.(eds) The Encyclopedia of Database Technologies and Applications .Hershey: Idea
Group Publishing, pp. 683-687.

Sebastiani, F. (2006) 'Classification of Text, Automatic. ', in Brown, K.(ed), Encyclopaedia of


Language & Linguistics. Vol. volume 2 Oxford: Elsevier, pp. 457-462.

Seifoddini, H. K. (1989) 'Single linkage versus average linkage clustering in machine cells
formation applications', Computers & Industrial Engineering, 16, (3), pp. 419-426.

Seymour-Smith, M. (1994) Hardy. Bloomsbury Pub.

Sherman, G. W. (1976) The Pessimism of Thomas Hardy. Rutherford [N.J.]: Fairleigh Dickinson
University Press.

Sherren, W. (1902) The Wessex of Romance ... With illustrations. (Bibliography of Thomas
Hardy.). pp. xi. 312. Chapman & Hall: London.

Sherren, W. (1908) The Wessex of Romance ... New and revised edition. [With a bibliography of
Thomas Hardy.]. pp. 13. 295. Francis Griffiths: London.

Shumaker, J. R. (1999) 'Abjection and Degeneration in Thomas Hardy's "Barbara of the House of
Grebe" ', College Literature, 26, (2), pp. 1-17.

Shuttleworth, S. (ed.) (1999) Two on a Tower : A Romance. London: Penguin.

Sichel, H. S. (1974) 'On a Distribution Representing Sentence-Length in Written Prose', Journal


of the Royal Statistical Society, 137, (1), pp. 25-34.

Siemens, R. (2002) 'A New Computer-assisted Literary Criticism?', Computers and the

45
Humanities, 36, (3), pp. 259-267.

Siemens, R. G. and Schreibman, S. (eds.) (2007) A Companion to Digital Literary Studies.


Malden, MA: Blackwell Publishers.

Simpson, J. A. and Weiner, E. S. C. (1989) The Oxford English Dictionary. Oxford: Clarendon
Press ; Oxford University Press.

Singhal, A., Chris, B. and Mandar, M. (1996a) 'Pivoted document length normalization',
Proceedings of the 19th annual international ACM SIGIR conference on Research and
development in information retrieval, pp. 21-29.

Singhal, A., Salton, G., Mitra, M. and Buckley, C. (1996b) 'Document length normalization',
Information Processing & Management, 32, (5), pp. 619-633.

Smith, M. W. A. (1985) 'An investigation of Morton's method to distinguish Elizabethan


playwrights', Comput. Hum., 19, (1), pp. 3-21.

Smith, M. W. A. (1987) 'Hapax Legomena in Prescribed Positions: An Investigation of Recent


Proposals to Resolve Problems of Authorship', Lit Linguist Computing, 2, (3), pp. 145-
152.

Smith, M. W. A. (1992) 'Shakespeare, Stylometry and "Sir Thomas More"', Studies in Philology,
89, (4), pp. 434-444.

Smollett, T. G. T. E. o. H. C. A. (1818) Brambleton Hall, a novel, being a sequel to the celebrated


Expedition of Humphrey Clinker, by Tobias Smollet. London: T. H. Green.

Sneath, P. H. A. (1957) 'The application of computers to taxonomy,' Journal of General


Microbiology, 17, pp. 201-226.

Sneath, P. H. A. and Sokal, R. R. (1973) Numerical Taxonomy: The Principles and Practice of
Numerical Classification. San Francisco,: W. H. Freeman.

Snell, K. D. M. (1998) ' The Regional Novel: Themes for the Interdisciplinary Research', in Snell,
K. D. M.(ed), The Regional Novel in Britain and Ireland 1800-1990.Cambridge University
Press pp. 1-53.

Snyder, S. (2001) 'The Genres of Shakespeare’s Plays', in De Grazia, M. and Wells, S. W.(eds)
The Cambridge Companion to Shakespeare.Cambridge, U.K.: Cambridge University

46
Press, pp. xx, 328 p.

Sollors, W. (1993) The return of thematic criticism. Cambridge, Mass.: Harvard University Press.

Spärck Jones, K. (1972) 'A statistical interpretation of term specificity and its application in
retrieval ', Journal of Documentation, 28, pp. 11-21.

Spiegel, A. (1973) 'Flaubert to Joyce: Evolution of a Cinematographic Form', NOVEL: A Forum


on Fiction, VI, pp. 229-43.

Spiridon, M. (1987) 'Literary criticism and the magnifying glass of sociology', Neohelicon, 14,
(2), pp. 53-60.

Spivey, T. R. (1954) 'Thomas Hardy's Tragic Hero', Nineteenth-Century Fiction, 9, (3), pp. 179-
191.

Srivastava, A. N. and Sahami, M. (eds.) (2009) Text Mining Classification, Clustering, and
Applications. Chapman and Hall.

Stageberg, N. C. (1981) An Introductory English Grammar. New York ; London: Holt, Rinehart
and Winston.

Stevenson, R. L. (1886) Kidnapped. [London]: Cassell & Company, Limited.

Stowe, H. B. (1859) The Minister's Wooing. New York: Derby and Jackson.

Stowe, H. B. (1897) Uncle Tom's Cabin. New York: T. Y. Crowell & company.

Sumner, R. (1981) Thomas Hardy, psychological novelist. London: Macmillan.

Svetlana, K. (2006) Hierarchical text categorization and its application to bioinformatics. thesis.
University of Ottawa.

Taeho, J. (2006) The implementation of dynamic document organization using the integration of
text clustering and text categorization. thesis. University of Ottawa.

Tambouratzis, G. and Vassiliou, M. (2007) 'Employing Thematic Variables for Enhancing


Classification Accuracy Within Author Discrimination Experiments', Lit Linguist
Computing, 22, (2), pp. 207-224.

Taylor, D. (1998) 'The Need for a Religious Literary Criticism', in Mahoney, J. L.(ed), Seeing
Into the Life of Things: Essays on Literature and Religious Experience.New York:
Fordham University Press.

Taylor, H. M. (1893) Euclid's Elements of Geometry, Books I-VI. Cambridge: The University

47
Press.

Thabet, N. (2005) 'Understanding the thematic structure of the Qur'an: an exploratory multivariate
approach', Proceedings of the ACL Student Research Workshop. Ann Arbor, Michigan,
Association for Computational Linguistics, pp.

Theodoridis, S. and Koutroubas, K. (2003) Pattern Recognition. Academic Press.

Thomas, J. (1999) Thomas Hardy, Femininity and Dissent: Reassessing the Minor Novels. New
York: Macmillan.

Todorov, T. (1975) The fantastic : a structural approach to a literary genre. Ithaca, N.Y.: Cornell
University Press.

Tukey, J. W. (1977) Exploratory Data Analysis. [S.l.]: Addison Wesley.

Tuttle, L. (1986) Encyclopedia of feminism. Harlow: Longman.

Unsworth, J. (2000) 'Scholarly Primitives: What Methods do Humanities Researchers Have in


Common, and How Might Our Tools Reflect This? ', Symposium on Humanities
Computing: Formal Methods, Experimental Practice, King’s College, London.

Venables, W. N. and Ripley, B. D. (2002) Modern applied statistics with S. New York: Springer.

Wake, W. C. (1957) 'Sentence-length Distributions of Greek Authors', Journal of the Royal


Statistical Society, Series A, (120), pp. 331-346.

Waldoff, L. (1979) 'Psychological Determinism in Tess of the d’Urbervilles', in Kramer, D.(ed),


Critical Approaches to the Fiction of Thomas Hardy.London: Macmillan, pp. 135-54.

Wang, W. (2007) An Empirical Study on Hierarchical Text Categorization. Master Dissertation


thesis. The University of Guelph.

Ward, J. H., Jr. (1963) 'Hierarchical Grouping to Optimize an Objective Function', Journal of the
American Statistical Association, 58, (301), pp. 236-244.

Watt, G. (1984) The fallen woman in the nineteenth-century English novel. London; Totowa, N.J.:
Croom Helm ;Barnes & Noble Books.

Weber, C. J. (ed.) (1935) An Indiscretion in the Life of an Heiress. Hardy's "lost novel" now first
printed in America and edited with introduction and notes by Carl J. Weber. Baltimore,
MD.: The Johns Hopkins Press. .

Weiping, W., Peng, C. and Bowen, L. (2008) 'A Self-Adaptive Explicit Semantic Analysis

48
Method for Computing Semantic Relatedness Using Wikipedia', Proceedings of the 2008
International Seminar on Future Information Technology and Management Engineering.
IEEE Computer Society, pp.

Widdowson, P. (1989) Hardy in history : a study in literary sociology. London ; New York:
Routledge.

Widdowson, P. (1998) On Thomas Hardy : late essays and earlier. Basingstoke: Macmillan.

Widdowson, P. (2009) '"........Into the Hands of Pure-minded English Girls": Hardy's Short Stories
and the Late Victorian Literary Marketplace', in Wilson, K.(ed), A companion to Thomas
Hardy.Malden, MA: Wiley-Blackwell Pub., pp. 364-378.

William, N. (2006) social research methods SAGE Publications Ltd.

Williams, M. (1974) Thomas Hardy and Rural England. London: Macmillan.

Williams, R. (1970) 'Thomas Hardy', in The English Novel from Dickens to Lawrence.London:
Chatto & Windus.

Wilson, H. G., Boots, B. and Millward, A. A. (2002) Geoscience and Remote Sensing
Symposium, 2002. IGARSS '02. 2002 IEEE International.

Wilson, K. (2009) A companion to Thomas Hardy. Malden, MA: Wiley-Blackwell Pub.

Windle, B. C. A. (1902) The Wessex of Thomas Hardy. London: J. Lane.

Wishart, D. (1998) ClustanGraphics [Computer Software] (3)

Wolters, M. and Kirsten, M. (1999) 'Exploring the Use of Linguistic Features in Domain and
Genre Classification', Proceedings of the ninth conference on European chapter of the
Association for Computational Linguistics. Bergen, Norway, Association for
Computational Linguistics, pp. 142-9.

Wotton, G. (1985) Thomas Hardy: towards a Materialist Criticism. Goldenbridge, Ireland: Gill
and Macmillan ; Barnes & Noble.

Wright, T. R. (1984) 'Middlemarch as a Religious Novel, or Life Without God', in Jasper, D.(ed),
Images of Belief in Literature.Macmillan, pp. 138-52.

Wright, T. R. (1989) Hardy and the Erotic Palgrave Macmillan

Wright, T. R. (2003) Hardy and his readers. New York: Palgrave Macmillan.

Xiao, Z. and McEnery, A. (2005) 'Two Approaches to Genre Analysis: Three Genres in Modern

49
American English', Journal of English Linguistics, 33, (1), pp. 62-82.

Yu, B. (2008) 'An Evaluation of Text Classification Methods for Literary Study', Lit Linguist
Computing, 23, (3), pp. 327-343.

Yu, B. and Unsworth, J. (2006) 'Toward Discovering Potential Data Mining Applications in
Literary Criticism', Digital Humanities, 5-9 July 2006. Paris-Sorbo, pp.

Yule, G. U. (1939) 'On Sentence-length as a Statistical Characteristic of Style in Prose: With


Applications to Two Cases of Disputed Authorship', Biometrika, 30, pp. 363-390.

Zeitler, M. A. (2007) Representations of Culture: Thomas Hardy's Wessex & Victorian


Anthropology. New York: Peter Lang.

50

You might also like