Professional Documents
Culture Documents
Authorship attribution is the process of looking for salient features in a literary work that relates the
work to its author. Craig (2004) points out that “authorship studies aim at ‘yes or no’ resolutions to
existing problems, and avoid perceptible features if possible, working at the base strata of language
The idea of authorship attribution is very old. Love (2002) says that it “reaches back as far as the
great library of Alexandria and embraces the formation of the Jewish and Christian biblical canons”
(2002:1). The motive behind authorship attribution studies is that many works were written
anonymously and many others raise suspicion about their real author, and historical evidence is
sparse or lacking. Many of Shakespeare’s plays, for instance, are attributed to other writers and at
the same time other plays of some writers are attributed to Shakespeare. Authorship attribution
where it was not “the work of a specialist in authorship but of a scholar for whom the determination
of authorship has repeatedly been a crucial element in other kinds of investigation” (Love, 2002:1).
There are many examples where the task of identifying the author of a particular document was the
job of politicians, journalists, and lawyers (Juola, 2008; Juola et al., 2006). Studies in this tradition
often used criteria for relating works to authors on chronological and epistemological bases. One
problem with such methods is that it is difficult always to find reliable historical facts or knowledge-
In the face of this limitation, stylistic analysis was proposed: “In the past few decades, stylometric
studies have been mainly based on statistical methods in order to reach solid conclusions regarding
the authorship of a given text” (Tambouratzis and Vassiliou, 2007). In principle, stylistic testing is
based on the assumption that everyone has a unique style by which he can be identified. In this, they
employ different strategies that are used to argue that such and such linguistic and stylistic
properties are peculiar to a particular author, to a particular literary work, or to a particular stage in
an author’s literary career. In spite of its richness, studies of this kind are always blamed of being
subjective. Morton (1965) explains that one main problem with traditional stylistic tests for
authorship attribution is subjectivity: “A critic draws up a list of genuine works, using literary style
as an important criterion in his judgment. Asked how he knows the works to be genuine, he can only
reply, “I see in them the mind and style of the author and the external evidence agrees with this
judgment”. If you then ask him how he knows the mind and style of the author he can only say, “I see
them in the genuine works”. So a large part, and it may be the decisive part, of his analysis is founded
One way to overcome the problem then was to look for some internal evidence within the texts
through quantitative methods. The chief merit of this quantitative analysis is that it is objective.
Additionally, it makes use of the advantages of traditional stylistic analysis. The rapid advancements
development of quantitative authorship analysis as a distinct discipline of knowledge. This has been
authorship attribution (Juola et al., 2006). The question asked thus concerning the reliability of
anonymous and controversial texts can be backed up with objective evidence derived from
There is no definite answer to this question. The authorship literature indicates that there are two
contradictory views concerning the reliability of computational techniques in such studies. Juola et
al (2006) support the idea that the application of computational approaches to authorship
attribution is neither reliable nor well understood. Rudman (1997) points out that stylometric
authorship attribution is often blamed for the fact that a successful application cannot be
appropriately applied to other genres or languages. Smith (1992) adds that the question of who
wrote a given text remains exactly as it was before any of stylometric approaches is undertaken. On
the other hand, many scholars including (Burrows, 2003; Love, 2002; Hoover, 2001; Holmes, 1998;
Smith, 1987) provide much evidence for the validity, reliability, and efficiency of computational
techniques in authorship attribution studies. They generally agree that with many traditional
fundamental questions concerning authorship attribution remaining unresolved and the feasibility
techniques for dealing with such issues. Wells (1996) even argues that the investigation of
authorship of disputed or dubious texts relies more on statistical analysis rather than on literary
investigation. He believes also that statisticians are more likely to investigate authorship since they
are more able to deal with the realms of mathematics and statistics.
Apart from the argument concerning the utility of computational tools in authorship attribution
studies, the literature indicates that an impressive computer-based work has been done on
authorship studies over the last five decades by means of stylometric approaches.
The basic assumption behind stylometric testing of authorship attribution is that, Holmes (1998)
contends, “authors have an unconscious aspect of their style, a style which cannot consciously be
manipulated but which possesses features which are quantifiable and which may be distinctive”
(1998: 111) and the identification of such personal distinctive stylistic features makes it possible to
detect an author’s signature and distinguish the writing of one author from another or others. To
perform this task, stylometrists often employ multivariate techniques that range from frequency
distribution of frequently-used words to discriminant analysis for the investigation of some linguistic
and stylistic features within texts that can be detectors for the identification of authors. The use of
sentence length to determine who wrote a text when it is unclear who wrote it is one of the oldest
quantitative methods in the field. The test is thought to be first proposed by (Yule, 1939). The main
assumption of this approach is that sentence lengths can be good indicators for authorship
determinism (Sichel, 1974; Morton, 1965; Wake, 1957; Yule, 1939). However, recent stylometric
studies are more concerned with both frequent and rare expressions. The section below gives an
The search for the most frequent words has been one of the most widely used methods for
determining the author of a given work (Burrows, 2007; Burrows, 2003). Garcia and Martin (2007)
explain that statisticians attempted over the last decade to solve some controversial authorship
problems by finding a formula grounded on the computation of tokens, word-types, and most
frequently-used words. They contend that computational statisticians have tended to investigate
what they call the ‘Lexical Richness’ of authors in order to propose a reliable approach to authorship
attribution. The authors indicate that although there have been many studies that investigate the
morphological and syntactic properties of authors as well, the study of the lexical properties remains
the most widely used in spite of the fact that there is no consensus on results, nor is there consensus
as to the accepted or correct methodology or technique. Although the idea that texts of a doubtful
origin can be attributed to their true authors by counting the occurrences of peculiar properties had
been developed long before the advent of the computer, the computer enables us to perform
authorship/attribution tasks more swiftly and accurately. In almost all such applications,
multivariate analysis techniques including PCA, factor analysis, discriminant analysis, and cluster
analysis have been successfully used (Burrows, 2007; Burrows, 2003; Holmes, 1998)
The main assumption behind this testing is that the use of rare words is a good indication for
determining the author of a given text. The basic argument is that the use of rare words enables one
writer to be distinguished from another. Morton (1986) explains “The once occurring words convey
many of the elements thought to show excellence in writing, the range of a writer's interests, the
precision of his observation, the imaginative power of his comparisons, they demonstrate his
command of rhythm and of alternations” (1986: 1). To put it simply, rare words are quite noticeable
which makes it easier and accurate to use them as an indicator for determining authors.