You are on page 1of 5

Authorship attribution

Authorship attribution is the process of looking for salient features in a literary work that relates the

work to its author. Craig (2004) points out that “authorship studies aim at ‘yes or no’ resolutions to

existing problems, and avoid perceptible features if possible, working at the base strata of language

where imitation or deliberate variation can be ruled out” (2004: 273).

The idea of authorship attribution is very old. Love (2002) says that it “reaches back as far as the

great library of Alexandria and embraces the formation of the Jewish and Christian biblical canons”

(2002:1). The motive behind authorship attribution studies is that many works were written

anonymously and many others raise suspicion about their real author, and historical evidence is

sparse or lacking. Many of Shakespeare’s plays, for instance, are attributed to other writers and at

the same time other plays of some writers are attributed to Shakespeare. Authorship attribution

studies have been developed to assign such works correctly.

Traditionally, work on authorship attribution was conceived as an organized scholarly enterprise

where it was not “the work of a specialist in authorship but of a scholar for whom the determination

of authorship has repeatedly been a crucial element in other kinds of investigation” (Love, 2002:1).

There are many examples where the task of identifying the author of a particular document was the

job of politicians, journalists, and lawyers (Juola, 2008; Juola et al., 2006). Studies in this tradition

often used criteria for relating works to authors on chronological and epistemological bases. One

problem with such methods is that it is difficult always to find reliable historical facts or knowledge-

based evidence that will help in the identification of authors.

In the face of this limitation, stylistic analysis was proposed: “In the past few decades, stylometric

studies have been mainly based on statistical methods in order to reach solid conclusions regarding

the authorship of a given text” (Tambouratzis and Vassiliou, 2007). In principle, stylistic testing is

based on the assumption that everyone has a unique style by which he can be identified. In this, they
employ different strategies that are used to argue that such and such linguistic and stylistic

properties are peculiar to a particular author, to a particular literary work, or to a particular stage in

an author’s literary career. In spite of its richness, studies of this kind are always blamed of being

subjective. Morton (1965) explains that one main problem with traditional stylistic tests for

authorship attribution is subjectivity: “A critic draws up a list of genuine works, using literary style

as an important criterion in his judgment. Asked how he knows the works to be genuine, he can only

reply, “I see in them the mind and style of the author and the external evidence agrees with this

judgment”. If you then ask him how he knows the mind and style of the author he can only say, “I see

them in the genuine works”. So a large part, and it may be the decisive part, of his analysis is founded

upon a circular argument” (1965: 169).

One way to overcome the problem then was to look for some internal evidence within the texts

through quantitative methods. The chief merit of this quantitative analysis is that it is objective.

Additionally, it makes use of the advantages of traditional stylistic analysis. The rapid advancements

of quantitative investigations for resolving controversial authorship problems helped in the

development of quantitative authorship analysis as a distinct discipline of knowledge. This has been

known as non-traditional authorship attribution, stylometric authorship attribution, or simply

authorship attribution (Juola et al., 2006). The question asked thus concerning the reliability of

computational methods in authorship attribution studies is: Whether speculations concerning

anonymous and controversial texts can be backed up with objective evidence derived from

computational statistical analyses?

There is no definite answer to this question. The authorship literature indicates that there are two

contradictory views concerning the reliability of computational techniques in such studies. Juola et

al (2006) support the idea that the application of computational approaches to authorship

attribution is neither reliable nor well understood. Rudman (1997) points out that stylometric

authorship attribution is often blamed for the fact that a successful application cannot be
appropriately applied to other genres or languages. Smith (1992) adds that the question of who

wrote a given text remains exactly as it was before any of stylometric approaches is undertaken. On

the other hand, many scholars including (Burrows, 2003; Love, 2002; Hoover, 2001; Holmes, 1998;

Smith, 1987) provide much evidence for the validity, reliability, and efficiency of computational

techniques in authorship attribution studies. They generally agree that with many traditional

fundamental questions concerning authorship attribution remaining unresolved and the feasibility

of computer-based technology being available, it is becoming imperative to go beyond traditional

techniques for dealing with such issues. Wells (1996) even argues that the investigation of

authorship of disputed or dubious texts relies more on statistical analysis rather than on literary

investigation. He believes also that statisticians are more likely to investigate authorship since they

are more able to deal with the realms of mathematics and statistics.

Apart from the argument concerning the utility of computational tools in authorship attribution

studies, the literature indicates that an impressive computer-based work has been done on

authorship studies over the last five decades by means of stylometric approaches.

The basic assumption behind stylometric testing of authorship attribution is that, Holmes (1998)

contends, “authors have an unconscious aspect of their style, a style which cannot consciously be

manipulated but which possesses features which are quantifiable and which may be distinctive”

(1998: 111) and the identification of such personal distinctive stylistic features makes it possible to

detect an author’s signature and distinguish the writing of one author from another or others. To

perform this task, stylometrists often employ multivariate techniques that range from frequency

distribution of frequently-used words to discriminant analysis for the investigation of some linguistic

and stylistic features within texts that can be detectors for the identification of authors. The use of

sentence length to determine who wrote a text when it is unclear who wrote it is one of the oldest

quantitative methods in the field. The test is thought to be first proposed by (Yule, 1939). The main

assumption of this approach is that sentence lengths can be good indicators for authorship
determinism (Sichel, 1974; Morton, 1965; Wake, 1957; Yule, 1939). However, recent stylometric

studies are more concerned with both frequent and rare expressions. The section below gives an

overall summary of the two methods.

 The use of frequency counts of common words

The search for the most frequent words has been one of the most widely used methods for

determining the author of a given work (Burrows, 2007; Burrows, 2003). Garcia and Martin (2007)

explain that statisticians attempted over the last decade to solve some controversial authorship

problems by finding a formula grounded on the computation of tokens, word-types, and most

frequently-used words. They contend that computational statisticians have tended to investigate

what they call the ‘Lexical Richness’ of authors in order to propose a reliable approach to authorship

attribution. The authors indicate that although there have been many studies that investigate the

morphological and syntactic properties of authors as well, the study of the lexical properties remains

the most widely used in spite of the fact that there is no consensus on results, nor is there consensus

as to the accepted or correct methodology or technique. Although the idea that texts of a doubtful

origin can be attributed to their true authors by counting the occurrences of peculiar properties had

been developed long before the advent of the computer, the computer enables us to perform

authorship/attribution tasks more swiftly and accurately. In almost all such applications,

multivariate analysis techniques including PCA, factor analysis, discriminant analysis, and cluster

analysis have been successfully used (Burrows, 2007; Burrows, 2003; Holmes, 1998)

 The use of rare words

The main assumption behind this testing is that the use of rare words is a good indication for

determining the author of a given text. The basic argument is that the use of rare words enables one

writer to be distinguished from another. Morton (1986) explains “The once occurring words convey

many of the elements thought to show excellence in writing, the range of a writer's interests, the
precision of his observation, the imaginative power of his comparisons, they demonstrate his

command of rhythm and of alternations” (1986: 1). To put it simply, rare words are quite noticeable

which makes it easier and accurate to use them as an indicator for determining authors.

You might also like