Oxford Handbooks Online: Author Identification in The Forensic Setting

Author Identification In The Forensic Setting
Oxford Handbooks Online

Carole E. Chaski
The Oxford Handbook of Language and Law
Edited by Lawrence M. Solan and Peter M. Tiersma
Print Publication Date: Mar 2012 Subject: Linguistics, Forensic Linguistics

Online Publication Date: Nov 2012 DOI: 10.1093/oxfordhb/9780199572120.013.0036
Abstract and Keywords
Author identification can play a role in the investigation of many different types of crimes,
civil transactions, and security issues. Over the last fifteen years, author identification
has become increasingly recognized by research-funding agencies, investigative
agencies, the forensic science community, and the courts, as a forensic science that deals
with pattern evidence. Forensic authorship identification methods, despite their
differences, require that similar steps be taken: choosing an appropriate linguistic level,
coding, engaging in statistical analysis and decision making, and conducting validation
testing. This article presents a review of currently available and emerging methods for
identifying authorship in the forensic setting, presents some current methods based on
the categorization scheme, and provides some ideas about why some methods fail to
obtain high accuracy. Finally, it considers two methods that are currently used in criminal
investigations and trials: forensic stylistics and software-generated analysis.
Keywords: author identification, forensic science, pattern evidence, coding, statistical analysis, decision making,
validation testing, categorization scheme, forensic stylistics, software-generated analysis
A “No” uttered from the deepest conviction is better than a “Yes” merely uttered
to please, or worse, to avoid trouble.
Mahatma Gandhi
Author identification can play a role in the investigation of many different types of crimes,
civil transactions, and security issues. Some documents have an obvious legal
significance: anonymous threatening letters, ransom notes, anonymous missives to the
Securities and Exchange Commission. But in a more subtle way, other documents—
diaries, business and personal emails, business memoranda, personal letters, employee
Page 1 of 17
PRINTED FROM OXFORD HANDBOOKS ONLINE (www.oxfordhandbooks.com). (c) Oxford University Press, 2015. All Rights
Reserved. Under the terms of the licence agreement, an individual user may print out a PDF of a single chapter of a title in
Oxford Handbooks Online for personal use.
Subscriber: Gothenburg University Library; date: 02 May 2018

manuals, blog posts, website copy—can also directly relate to determining a person's
involvement in a situation with criminal, civil, or security implications.
Authorship identification methods have developed out of many different disciplines, with
each one bringing its own set of literary, historical, linguistic, or computational tools to
the workbench (Chaski 1998; 2008). A multitude of potentially useful methods are
available. Within the computational-linguistic paradigm, research in author identification
has boomed since 2000. Stamatatos (2009), Koppel et al. (2009), and Juola (2008, 2009)
provide reviews of this activity.
Forensic authorship identification poses specific challenges to any method. First, there is
usually far less data than non-forensic contexts require. The huge novels-and-newspapers
datasets used for comparison in solving non-forensic problems allow non-forensic
methods to test ideas that could never be implemented with forensic datasets. Second,
the data are far less clean or grammatical than non-forensic methods require. Again,
novels and newspapers provide clean, grammatical data. Most standard tools for natural
language processing or content analysis expect data to be spelled correctly and
grammatically composed. Forensic methods should expect just the opposite. Third,
(p. 490) the method must meet legal requirements that most non-forensic methods need
never consider. Error rates are generally foreign to literary discussions but essential to
forensic methods in the post-Daubert judicial system. Thus forensic authorship
identification methods can and should draw upon the research and insights of non-
forensic studies, but in the end methods employed in forensic authorship identification
must be shaped, evaluated, and vetted within the forensic science community and
approved by the courts.
Over the last 15 years, author identification has become increasingly recognized by
research funding agencies, investigative agencies, the forensic science community, and
the courts, as a forensic science that deals with pattern evidence. Pattern evidence relies
on the identification of patterns on suspect or questioned material that can be matched
with patterns on comparable known material. It includes both biometric and non-
biometric data, such as fingerprints, palmprints, handwriting, voice, gait, shoe prints, tire
treads, toolmarks, and ballistics.
Any analysis of pattern evidence requires methods that can answer the following
questions:
(1) What kinds of patterns are significant and reliable?

(2) Are these patterns detectable or countable in specific ways that can be taught,
learned, and processed by humans and/or machines?
(3) What amounts of data are required to get significant and reliable patterns?
(4) Are there other conditions, besides quantity, which affect the recovery of
significant and reliable patterns from data?
Page 2 of 17

This chapter presents a review of currently available and emerging methods for
identifying authorship in the forensic setting. In Section 35.1, features which help to
compare and contrast current methods are presented. Section 35.1 also covers validation
testing methodology. Section 35.2 presents some current methods based on the
categorization scheme in Section 35.1. Section 35.2 also provides some ideas about why
some methods fail to obtain high accuracy. Section 35.3 provides some future directions
for forensic author identification. It is an exciting time for forensic author identification
because the field poses so many research questions, and validation testing provides a
clear paradigm in which to answer them.
35.1 Comparing and Contrasting Current

Methods
Forensic authorship identification methods, despite their differences, require that similar
steps be taken: choosing an appropriate linguistic level, coding, engaging in statistical
analysis and decision making, and conducting validation testing.
Page 3 of 17

(p. 491) 35.1.1 Linguistic level: which linguistic units are used?
Linguists often focus their analysis on specific linguistic levels, such as the phonemic,
morphemic, lexical, syntactic, semantic, discursive, and pragmatic. Forensic author
identification methods, which deal with written data, have focused on analytical units at
the character, word, sentence, and text levels. At the character level, analytical units
include single characters, punctuation marks, or character-level n-grams (units of
adjacent characters). At the word level, analytical units can be function words, content
words, word-level n-grams (units of adjacent words), lexical semantics, lexical overlap,
vocabulary richness, average word length, and variants thereof. Sentence-level analytical
units include part of speech (POS) tags, tag-level n-grams, constituent structures,
anaphoric dependencies, marked and unmarked constituent structures, sentence type,
average sentence length, and variants thereof. At the text level, analytical units can be
text length, paragraph length, and discourse strategies.
For some linguistic levels, it is extremely easy to detect patterns by machine. For
instance, character and word level features can almost always be extracted and analyzed
automatically, even for messy forensic data. For these reasons, methods based on
character and word features will be highly valued if and when they are shown to work
reliably.
Other linguistic levels are fairly difficult for machine pattern detection with forensic data.
Sentence level features can be extracted automatically, but most parsers cannot deal with
messy forensic data very well. An intern at the Institute for Linguistic Evidence was
testing the Stanford parser, one of the state-of-the-art parsers available, and discovered,
much to his chagrin, that the parser could not properly parse the sentence: “I’ll meet him
at the libary tonight.” The parser tagged “libary” as an adjective, “tonight” as a noun, and
the prepositional phrase as “at the libary tonight.” Most parsers are built to generate
grammatical tags and constituent structures based on morphological and positional
information. The morpheme “-ary” is very common for adjectives, and clearly the parser
was tripped by the misspelling of “libary.”
Some discourse features can be even more difficult to extract automatically, while other
discourse features are so clearly lexicalized that they can be extracted accurately.
Consider, for instance, how lexicalized politeness is in contrast to irony or sarcasm.
Since features on all linguistic levels might contribute to a very robust method of author
identification, both machine and human coding may be necessary for high accuracy, and
more importantly, for the minimization of mechanical errors. On the other hand, totally
automated systems might attain slightly less accuracy than mixed machine–human
methods, but fulfill a screening or ranking operation, as will be suggested below.
35.1.2 Coding: how is the linguistic analysis recorded?
Page 4 of 17

Coding refers to the spectrum of methods for keeping track of linguistic features in a
text. Coding includes: a list of examples, binary codes for the presence or absence of
features, frequency counts (or even better, frequency counts normalized to text length so
(p. 492) that texts of different lengths can be directly compared), and a mixture of the
frequency counts and binary codes.
Example listing and presence/absence codes are purely qualitative (and also known as
binary variables). Frequency counts are purely quantitative (also known as interval or
count variables, depending on whether they have been normalized or not). A method
might exploit both quantitative and qualitative coding, and some statistical procedures,
for example logistic regression, allow for both interval and binary types of variables.
Sometimes analysts conflate example listing and frequency counting in their analysis.
Recently I was asked to review an authorship identification report. There were four
potential authors whose known documents were compared to the questioned document,
an anonymous whistleblower letter. The client for whom the report was issued claimed
that of the four potential authors, A, B, C, and D, the questioned document was authored
by A or B. The first selected stylemarker was the presence of an apostrophe in plurals, for
example “vendetta's, fee's, doc's, FAQ's.” The report listed an example of this in the
questioned document, and then listed examples of this in the known documents from
authors A, B, C, and D. For author D, the report states that the pluralizing apostrophe is
attested but not frequently. The examples were not counted in any of the known authors.
The report states that authors A, B, and C “frequently use apostrophe and s for plurals on
nouns.” With this feature as one of the supporting arguments, the report concludes that
authors A and/or B authored the questioned document. How did authors C and D—who
each also exhibit this feature—not get caught in the same net that caught A and B? If
frequency really mattered, then counts should have been recorded and compared to some
baseline of pluralizing apostrophes in current American English. Since the mere presence
of a pluralizing apostrophe cannot differentiate the four authors, but frequency might
really serve as the differentiator among the authors, then the counts should have been
taken and subjected to some kind of statistical analysis. If binary coding (presence/
absence) is applied, the four authors cannot be differentiated and two cannot be singled
out as authors. This vignette illustrates how important coding schemes can be, and how
they are connected to decision making and statistical analysis.
35.1.3 Statistical analysis and decision making: how is the

identification/elimination made?
Whether a method does or does not use any statistical analysis, all methods require some
kind of decision making. The decision making might be as simple as majority rules (there
are more matches to the questioned document in X than in Y, so X is counted as the
author of the questioned document). For example, the examiner could make the decision
about authorship based on his/her listing of matching points (i.e. similar examples)
Page 5 of 17

between a set of known documents, or segments of one long known document, and a
questioned document.
Alternatively, the decision making could be the output of a statistical procedure,

(p. 493)
or the statistical output along with other qualitative observations which align with or
contradict the statistical procedure. Statistical procedures are applied to the coding of
linguistic analytical units in a set of texts and determine the decision making for
identification or elimination (inclusion or exclusion).
Using statistical procedures for decision making is one way to minimize the possibility of
confirmation bias—the propensity to reach conclusions that confirm the researcher's
initial hypothesis. Although confirmation bias can, first, influence the selection of
features, and, second, influence the coding of features, it can finally influence decision
making. The example of the pluralizing apostrophe, described earlier, shows confirmation
bias at each of these analytical stages.
Statistical procedures can be fairly simple or complex. For example, the decision making
could be based on a threshold: X exceeds the threshold, Y does not, so X wins, or based
on a rank: X has a lower/higher score than Y, so X wins.
More complicated statistical classifiers actually make the decision regarding authorship
by assigning the questioned document to one of the known author classes. Advantages to
using statistical classifier output as the basis for decision making are objectivity;
simultaneous use of multiple variables; and simplicity. The statistical output is based on
the coding from the data at hand; an analyst cannot easily predict what the statistical
output will be just by eyeballing the coding, especially if the features being coded are
abstract. Since almost all methods use multiple variables (on one or more levels of
linguistic analysis), the classification is multivariate: this kind of hyperdimensional
feature space is usually very difficult for us to visualize but easy for multivariate statistics
to handle. Finally, reliance on the statistical output for decision making is simple, as long
as some standard protocols have been developed. For instance, if the accuracy of a
statistical classifier is only 70 percent, then a standard protocol could preclude going to
the next step of classifying any questioned documents, based on a model which has such
low accuracy on known documents.
One disadvantage of statistical classifiers is that they do not always perform well on some
datasets. Some statistical classifiers, logistic regression in particular, perform very poorly
or halt if there is a clear 100 percent separation between classes (documents or authors).
On the other hand, some statistical classifiers such as discriminant function analysis work
perfectly when there is complete separation between the classes (documents or authors)
but cannot work with binary data. Calculating frequency base rates is also tricky because
the multiplication rule (which is what allows DNA analysis to get such astronomically
small probabilities) can only be applied if the items are truly independent of each other.
For many linguistic features, this kind of strict independence cannot be justified. For
Page 6 of 17

instance, multiplying the frequency of verbs and the frequency of particles does not meet
the independence criterion because some of the verbs require the particles, or the
particles are dependent on the verbs. So it is important to select the right kind of
statistical procedure for the kind of available data and coding.
Whether a statistical procedure is included in a method or not, any method

(p. 494)
should use cross-validation to check its conclusions, as discussed later. Cross-validation is

one component of validation testing.
35.1.4 Results of validation testing: how well does the method work
on ground truth data?
Forensic author identification methods can also be compared with respect to the results
of validation testing. Validation testing may be for the courts one of the most important
aspects of the method, since it speaks to the legal requirements for admissible scientific
evidence.
Validation testing is an empirical technique that determines how well a procedure works,
under specific and manipulable conditions, on a dataset containing samples with known
provenance and characteristics. Chaski (2009) identifies four steps for validation testing:
(35.1.4.1) Get a database of known samples.

(35.1.4.2) Apply a repeatable analytical method to all texts in the database.
(35.1.4.3) Apply a cross-validation scheme.
(35.1.4.4) Compute the error rate based on hits and misses, analyzing the errors, in
relation to the amount of data and other characteristics of the data.
35.1.4.1 Datasets: get a database of known samples

A dataset of known author documents—ground truth data for the authorship
identification question—is essential. The point of using ground truth data is that we are
trying to find out how often a method is correct at assigning authorship; if we do not
know the correct authorship of known documents, we will never be able to tell if the
method is accurate or not. Any mixture of known and unknown documents will make the
calculation of error rates meaningless.
Ground truth data is all too often overlooked or undervalued. One intriguing study of the
“writeprint” claimed a high degree of accuracy at identifying the authorship of emails,
with over 97 percent accuracy for English, and over 92 percent accuracy for Chinese (Li
et al. 2006). This impressive result, however, is undermined by the fact that the dataset
was not ground truth data, as revealed by the researchers’ comment about a substudy of
three authors in their English dataset: “Clearly, Mike's distinct writeprint from the other
two indicates his unique identity. The high degree of similarity between the writeprints of
Joe and Roy suggests these two IDs might be the same person” (Li et al. 2006:82). Joe and
Roy's ‘writeprints’ are almost identical. Yet it is also possible that Joe and Roy are distinct
Page 7 of 17

people, and the method cannot clearly recognize the difference between Joe's and Roy's
documents. We will never know which explanation is correct because a dataset of ground
truth data was not used. If a ground truth dataset had been used, if known authors were
attached to one or more screennames before validation testing was begun, the accuracy
of the method could have been legitimately tested.
The dataset of known author documents also needs to be tuned to the particular
(p. 495)
task at hand. For instance, Brennan and Greenstadt (2009) wanted to test whether non-
professional writers could disguise their writing, either through obfuscation or imitation,
to such a degree that stylistic analysis would fail to assign the disguised documents to the
correct author. They designed a ground truth database specifically for this purpose, and
found that non-professional writers could indeed fool a few methods they tested, which
focused on punctuation, letter usage, vocabulary richness, readability, sentence length,
and synonyms. Most of these features have already been shown to be very poor
performers (Chaski 2001) or argued to be potentially poor discriminators because they
rely on features with high linguistic salience and imitability (Chaski 1997). Nonetheless,
the dataset is an important contribution because it can be studied to determine and to
document the most frequent ways in which non-professional writers disguise their styles
or imitate others’ styles.
Chaski (1997, 2001) designed a ground truth dataset to be used specifically for author
identification in forensic setting. The writing tasks spanned a wide range of text types,
including love letters, apologies, business letters, narratives, fact-based essays, opinion-
based essays, angry letters, complaints, and threat letters, for each author. The range of
text types reflects the fact that in many authorship cases, the type of document which is
questioned and the types of comparable known documents are not the same. A suspect
suicide note might necessarily be compared to love letters and school essays. An
anonymous threat letter might necessarily be compared to business letters and personal
emails. This dataset has been released for use by researchers in Switzerland, France,
Canada, Spain, and the United States. Through case work, Chaski has also has collected a
dataset of ~1,500 spontaneously produced, topically unconstrained, known-author
documents, which has been used for validation testing, as reported below.
Koppel and his colleagues harvested a dataset of blog posts from ~19,000 bloggers,
which is available for research (Schler et al. (2006)). The bloggers are identified by a
numerical identifier, gender, age, industry, and zodiacal sign. As with any data collected
from the web, there is an assumption that the screenname belongs to one person at the
keyboard, but the sheer size of this dataset makes it a valuable contribution to authorship
research in the forensic setting.
Funded by the National Science Foundation, Juola and Argamon (2008–11) extracted and
cleaned a set of emails from the famous Enron email database. The Enron email database
is approximately 2 gigabyte of emails authored by known adults in a business
environment, the authorship of which has, to the best of my knowledge, never been
disputed; it is available for download from several sites including www.cs.cmu.edu/
Page 8 of 17

~enron/ and http://enrondata.org/. Email datasets are initially difficult to work with
because emails often include emails from the author to whom the enveloping email is
addressed; hence, cleaning the emails makes the data easier to work with when testing
authorship methods.
35.1.4.2 Analytical method: apply a repeatable method to all texts in

database
Once a dataset has been
designed, collected, and
organized for analysis, an
analytical method can be
applied to each document
in the dataset. No matter
what method is being
(p. 496) tested, in
validation testing the

method is applied to each
document in the dataset.
There are basically two
ways to apply a method to
a document: computer
software or human
Click to view larger
analysis.
figure 35.1 A protocol for author identification
methods
Forensic authorship
identification requires a
repeatable method that can be consistently applied by different analysts. The units
measured in both qualitative and (p. 497) quantitative analysis must be explicit enough to
be operationally defined. Software-implemented (quantitative) methods can produce
rapid validation testing results once the methods have been programmed. Qualitative
methods can also produce validation testing results, but they require a team of trained
linguists, inter-rater reliability, and a longer time commitment than software-implemented
methods.
If the analytical units are explicit, then it may be possible to implement the method in
computer software. Software implementation of a method provides instant repeatability
because the software implements the same method every time it runs. Software
implementation of a method also provides speed and efficiency, since most linguistic
analysis performed by humans is time consuming and can be exhausting mentally. Once
an authorship identification method has been implemented in software, each document in
any dataset can be processed automatically and subsequently subjected to statistical
analysis.
Page 9 of 17

But sometimes qualitative methods are necessary, and software implementation may not
be available, for example identifying a particular discourse strategy such as sarcasm or
irony. When the analytical method cannot be implemented using software, then it should
be subjected to inter-rater reliability testing. Several analysts are given the same task
and the same dataset. Their analyses are compared to determine how often the analysts
agree with each other. This agreement level indicates the inter-rater reliability. There are
several statistical procedures for measuring the agreement level (Cronbach's alpha,
Cohen's kappa), but the important point is that qualitative methods can and should be
subjected to validation testing, with the first step being inter-rater reliability. If the
analysts are in high agreement with each other, then they can proceed to apply the
qualitative method to each document in the dataset. If the analysts have very low
agreement, then they need to hammer out their differences, re-test themselves with new
data, and determine their inter-rater reliability, iteratively, until they have reached a high
enough level of agreement to apply the method to the entire dataset. If they do not test
for inter-rater reliability first, then the entire dataset is being used just for inter-rater
reliability, rather than validation testing.
35.1.4.3 Apply a cross-validation scheme

Once the linguistic units have been accounted for in a text, whether through computer
software or human coding, the next step in validation testing is discovering how well the
method works by checking the method's answers against the correct answer known from
the ground truth data.
The general idea behind cross-validation is simple. First, run a statistical procedure over
a dataset, building a statistical model which differentiates two (or more) authors. The
accuracy of this model may not generalize to any other dataset, so cross-validation is a
way to find out how robust the model really is. If one removes some data from the
dataset, and builds a statistical model with the remaining data, will the removed data be
classified accurately when reinserted into the dataset?
Even if no statistical modeling is used, cross-validation can and should be

(p. 498)
applied. For instance, one could apply discourse analysis, qualitatively, to two sets of
known documents to determine which set a questioned document belongs to. The two
authors of the known documents first need to be differentiated from each other, at the
discourse level. If author A and author B show two different clusters of discourse
strategies, how stable are their different clusters across each author's own documents? If
I remove author A's document 1, do I find the same cluster of discourse strategies across
the remaining known documents? This kind of procedure would, first, show the analyst
how consistent the known documents are, and second, go a long way toward minimizing
confirmation bias in feature selection. If only one of the known documents contains the
selected discourse features, then the discourse features are not consistent and stable
among the author's known documents, and this kind of instability weakens any conclusion
about authorship. If none of the known documents contains the same cluster of selected
discourse features, then the discourse features are clearly not consistent and stable
Page 10 of 17

among the author's known documents, and this kind of instability weakens any conclusion
about authorship, or makes the selected features unemployable as identifying features.
Cross-validation applied in the traditional statistical way allows us to minimize over-

fitting for statistical procedures but, applied in this novel way I am suggesting for
forensic linguistics, also allows us to identify and minimize over-generalization for non-
statistical procedures.
35.1.4.4 Error rate calculation and error analysis

Error rates are calculated from the mis-classifications. When Authors A and B are being
compared, any A documents which jump over into the B bin, and any B documents which
jump over into the A bin are misses: the method missed the true author classification.
At some point, any method will fail to distinguish different authors or the method will
assign documents to the wrong author. Any method which never fails—no matter what the
data quantity, no matter what other characteristics of the data are varied—is most likely
over-fitted to the dataset and needs to be tested on different datasets.
In the forensic setting, failure can easily be caused by not having enough data to run the
method robustly. The worst part of this is that there is usually no way to get any more
data. In this scenario, the method cannot be applied with confidence. The forensic data
problem certainly motivates researchers to develop methods which can work on very
small amounts of data. The forensic minimal-quantity requirement also means that most
methods for authorship identification based on literary data or newswriting data often
cannot be applied to forensic data because they require far more data than most forensic
cases supply.
At other times in the forensic setting, failure can occur because some quality shared by
the known documents from different authors makes them difficult to differentiate. All of
the different authors may each use a specific, highly-constrained jargon (drug dealing,
numbers running, prostitution, corporate culture) which results in very (p. 499) formulaic
language. When everyone is writing three sentence emails that begin “Dear ____,” “Your
order has been scheduled….see below,” and “please feel free to contact me directly,” the
difficulty of detecting differences between authors obviously increases. The different
authors might be professional writers with high metalinguistic awareness or one author
might demonstrate a distinctive substyle.
Much more work is needed in error analysis to determine how any particular method is
failing. Brennan and Greenstadt's (2009) work offers a good place to begin error analysis
for the methods they tested, and shows, for example, that lexical choice is a highly salient
linguistic feature which most people will manipulate when they are disguising or
imitating authorship.
Page 11 of 17

35.2 Current Forensic Author Identification

Methods
There are two methods which are currently used in investigations and trials: forensic
stylistics and software-generated analysis. There are also other methods in development,
which have real potential for being tested on forensically-feasible datasets and in the near
future employed forensically. These will be discussed in Section 35.3.
35.2.1 Forensic stylistics
The forensic stylistics method is best represented by McMenamin (1993, 2002), although
it has many other practitioners who borrow extensively from McMenamin's work.
McMenamin (2002) offers the best statement of the method and the fullest list of
potential stylemarkers.
Using forensic stylistics, the examiner selects linguistic features from multiple linguistic
levels as well as formatting and spelling. Examples of these features are listed from the
texts, and the lists from the known and questioned documents are compared. The
authorship decision is based on the matching of examples from the known and questioned
documents.
There are no published validation testing results from any practitioner of forensic
stylistics. Chaski (2001) tested some linguistic features which are common in forensic
stylistic analyses but did not combine them with other linguistic features; when the
singular results are combined, the forensic stylistics features would have achieved only
54 percent accuracy. St Vincent and Hamilton (nd post-2001) did combine forensic
stylistics features from Chaski (2001) and found that the method performed very poorly,
the highest accuracy being 51.46 percent. Koppel and Schler (2003) also tested forensic
stylistics features and achieved poor identification results, with the highest accuracy of
72 percent using part of speech tags in addition to the usual forensic stylistics features.
(p. 500) Chaski (2007b) used 42 punctuation and abbreviation features on 9,000 bloggers
from the Koppel blog dataset and achieved only 60 percent differentiation among the
bloggers, showing that merely recognizing punctuation and abbreviation features is not
enough to differentiate authors from each other at a high enough rate to serve the justice
system.
35.2.2 Software-implemented methods
35.2.2.1 ALIAS SynAID
Page 12 of 17

The ALIAS SynAID method (Chaski 2001, 2005, 2007c) is a software-implemented method
which extracts and normalizes frequency measurements at the character, word, and
sentence levels from known and questioned documents. At the character level,
punctuation is classified for its syntactic edge (clause, phrase, or morphemic). At the
word level, average word length is calculated. At the sentence level, within each
sentence, marked and unmarked syntactic structures for each syntactic head (noun, verb,
adjective, etc.) are categorized based on POS tag n-gram sequences and regular
expression programming which allow for non-adjacent tags to be considered in patterns.
The marked and unmarked category examples are then counted within each sentence,
and summed for document level quantification. Once each document is quantified, the
numerical output is processed through linear discriminant function analysis using leave
one out cross-validation. The authorship decision is based on the classification results, as
long as the statistical model attains an accuracy which is in the very high 80s or better.
Independent of any litigation, ALIAS SynAID has achieved 95 percent accuracy in pair-
wise tests among ten authors from the Chaski dataset, and 94 percent accuracy in pair-
wise tests among ten authors from the Chaski dataset and known case dataset. In 23
litigation-related cases, ALIAS SynAID has created statistical models ranging from 93 to
100 percent cross-validated accuracy at differentiating the known authors’ documents, so
that classification of the questioned document(s) was possible.
ALIAS SynAID has also been tested, independent of litigation, with ten authors chosen at
random from samples produced by 175 known authors) each represented by 20 five-
sentence chunks. The accuracy rate is much lower with these very short texts (from 30 to
75 words), as shown in Table 35.1, from Chaski (2007c). Author 16, tested against the
nine other authors, has documents correctly assigned to her ~95 percent when linear
discriminant function analysis (LDA) and support vector machines with different kernels
(SVM 1 and SVM 2) are used, but only ~92 percent with logistic regression (LR) and
decision tree forest (DTF). Author 91's five-sentence chunks are much more difficult to
assign to him correctly, as shown by the fact that the best classifier overall, LDA, achieves
only 74 percent cross-validated accuracy.
The computational method, as implemented in ALIAS SynAID, has been scrutinized by

Daubert and Frye hearings (Washington DC, 2001 (Federal); Atlanta, GA, 2008 (p. 501)
(State); Annapolis, MD (1998)). Presented by one expert, computational linguistics
evidence has attained full admissibility, meaning that the expert was allowed to state a
conclusion about the authorship of the document in question.1
Page 13 of 17

Table 35.1 Results of SynAID with very short samples
Author LR LDA SVM 1 SVM 2 DTF
16 91.67 95.28 95.28 95.83 92.50
23 85.28 88.33 85.28 80.56 82.78
80 80.00 85.56 80.56 77.78 81.39
90 72.22 76.67 81.94 80.28 76.11
91 81.39 74.17 81.94 78.89 75.28
96 79.17 83.61 78.61 76.11 76.11
97 85.56 85.83 83.61 79.44 82.22
98 79.72 84.17 79.44 76.11 82.50
99 70.00 82.50 78.06 74.17 80.83
168 83.33 88.89 76.39 79.72 86.39
Overall 80.83 84.50 82.11 79.89 81.61
Page 14 of 17
PRINTED FROM OXFORD HANDBOOKS ONLINE (www.oxfordhandbooks.com). (c) Oxford University Press, 2015. All Rights Reserved. Under the terms of the licence agreement, an
individual user may print out a PDF of a single chapter of a title in Oxford Handbooks Online for personal use.

Given the current interest and research activities, other computational linguists, or law
enforcement officers trained in the use of specific software, will be presenting evidence
based on validated, litigation-independent techniques in the near future.
35.2.2.2 POS (part of speech) trigrams

Hirst and Feiguina (2007) tagged ten authors from the Chaski dataset (the same authors
used in Chaski (2005)) and used a Support Vector Machine with 10-fold cross-validation
to classify documents based on POS trigrams. This method achieved 88 percent accuracy.
Hirst and Feiguina (2007) do not manually check the assigned POS tags and they use a
parser designed for standard English. It is highly likely that at least some of the POS tags
have been incorrectly assigned. It is thus possible that they might achieve higher
accuracy with the trigrams if trigrams were checked for containing the correct tags.
Baayen (2008:154–9) reports the results of his statistical analysis of POS trigrams from
Spassova's (2006) dataset of three Spanish writers. There are 15 texts, each of
approximately 3,000 words, so this dataset is on the high side for forensic casework.
Applying (p. 502) discriminant function analysis, Baayen shows how the initial result is
perfect, with each text assigned to its true author with high probability. But then Baayen
applies leave-one-out cross-validation, and shows that the cross-validated accuracy is
9/15 correct assignments, or ~70 percent. This rather low result is most likely due to the
fact that the data were not run in pair-wise tests, but run as a multi-class problem. The
best results are always obtained through pair-wise comparisons (Chaski 2005; Grieve
2007).
35.3 Future Directions For Forensic Author

Identification
New methods are continually being developed, especially in the computational paradigm.
In this section, I will mention a few and point out some areas of consensus that have
developed in the past decade with respect to authorship identification research.
The ALIAS UniAIDE method (Chaski 2007a, 2010) is a software-implemented method

which extracts the frequencies of single characters (including punctuation and
whitespace) in a questioned document and each known document. The questioned–known
frequencies are subjected to a variation of the chi-square statistic that enables the
method to overcome length bias. Independently, Chaski (2007a) and Stamatatos (2008)
realized that the unigram methods as described in Peng et al. (2003) and Keselj et al.
(2003) would rank the longest documents as most similar to the questioned document. In
the forensic setting, the chief suspect would typically have more documents collected,
and thus end up with the longest merged document, and then end up as most similar to
the questioned document based on sheer size, unless the length bias is corrected.
Page 15 of 17

Stamatatos's (2008) fix involves text sampling; Chaski's fix is a simple variation on how
the chi-square is calculated, allowing raw counts which are normally not allowed in the
use of chi-square for calculating a probability.
Character-based methods are especially intriguing because, in a Unicode-compliant

system, they can be applied to any language which has a Unicode-compliant script. This
means that a reliable character-based method could be run on languages for which no
linguistic expertise was available to sort and rank documents before being sent to
linguistic experts.
Koppel et al. (2011) have recently published a character-based method based on trigrams
that include no spaces. The choice of three non-space characters is prudent and clever,
linguistically, since this window will capture almost all function words and roots, in both
Indo-European and Semitic languages. The method currently can achieve high accuracy if
a threshold for false positives is implemented; false positives can run as high as 30
percent, but once these are filtered out, the method achieves a high accuracy. Whether
this can be implemented in a forensic setting remains to be seen, but the work is
important because it opens the door for ways of discovering when a document cannot be
matched to any in the known set.
Argamon et al (2007) are developing a method based on function words and

(p. 503)
lexical categories within Halliday's (1994) Systemic Functional Grammar. On a dataset of

nineteenth-century novels, the method extracts and counts the lexical categories and
applies Weka's SMO learning algorithm with 10-fold cross-validation. On the literary
dataset, the method is able to distinguish different authors at near 90 percent accuracy.
Some points of consensus and experimental replication have already been attained. For
instance, we know that author pair-wise comparison achieves more accurate results than
multi-author comparisons. We know that methods which combine multiple linguistic
levels are more accurate than methods which focus only on one level. We know that
character-based and word-based methods can show topic bias that must be controlled for
in the known document data collection. We know that syntax-based methods, whether
POS tags or more sophisticated structures, work across topics and registers better than
word-based methods do. We know that there are data requirements for different methods
in terms of known characters, words, or sentences.
The immediate future for forensic author identification is intensely exciting as research
teams collaborate by data sharing, validation testing, and learning what we can and
cannot do with the data at hand. As linguists in service to the legal communities, it is
equally as important for us to say “no” as it is for us to say “yes.”
Notes:
Page 16 of 17

(1) Green v Dalton/US Navy, 164 F. 3rd 671, District of Columbia District Court,
Washington DC; (for related material see 〈http://caselaw.findlaw.com/us-dc-circuit/
1436391.html〉 accessed September 23, 2011); Arsenault v Dixon, Fulton County Superior
Court, Atlanta, GA (certified transcript of ruling available from author); Zarolia v
Osborne/Buffalo Environmental, Anne Arundel County Superior Court, Annapolis, MD.
Since most opinions in forensic author identification are unpublished, please contact the
author for further information about these cases and others.
Carole E. Chaski
Carole E. Chaski, PhD is the Executive Director of the Institute for Linguistic
Evidence, the first non-profit research organization devoted to linguistic evidence
and the CEO of ALIAS Technology LLC. Dr Chaski held a Visiting Research
Fellowship (1995–98) at the US Department of Justice's National Institute of Justice,
where she began the validation testing which has become an increasingly important
aspect of forensic sciences since the Daubert ruling, and introduced the
computational, pattern recognition paradigm to questioned document examination
and forensic linguistics. Dr Chaski has served as an expert witness in Federal and
State Courts in the USA, in Canada and in The Hague. Primarily a researcher and
software developer, Dr Chaski consults with corporations, defense, and security on
computational linguistic applications. Dr Chaski earned her doctorate and master's in
linguistics at Brown University, her master's in psychology of reading at the
University of Delaware and her bachelor's in English and Ancient Greek from Bryn
Mawr College.
Access is brought to you by
Page 17 of 17

Oxford Handbooks Online: Author Identification in The Forensic Setting

Uploaded by

Document Information

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Oxford Handbooks Online: Author Identification in The Forensic Setting

Uploaded by

Copyright:

Available Formats

Author Identification In The Forensic Setting

Oxford Handbooks Online

Print Publication Date: Mar 2012 Subject: Linguistics, Forensic Linguistics

Abstract and Keywords

Subscriber: Gothenburg University Library; date: 02 May 2018

(1) What kinds of patterns are significant and reliable?

Subscriber: Gothenburg University Library; date: 02 May 2018

35.1 Comparing and Contrasting Current

Subscriber: Gothenburg University Library; date: 02 May 2018

35.1.2 Coding: how is the linguistic analysis recorded?

Subscriber: Gothenburg University Library; date: 02 May 2018

frequency counts and binary codes.

35.1.3 Statistical analysis and decision making: how is the

Subscriber: Gothenburg University Library; date: 02 May 2018

Alternatively, the decision making could be the output of a statistical procedure,

Subscriber: Gothenburg University Library; date: 02 May 2018

Whether a statistical procedure is included in a method or not, any method

should use cross-validation to check its conclusions, as discussed later. Cross-validation is

(35.1.4.1) Get a database of known samples.

35.1.4.1 Datasets: get a database of known samples

Subscriber: Gothenburg University Library; date: 02 May 2018

Subscriber: Gothenburg University Library; date: 02 May 2018

35.1.4.2 Analytical method: apply a repeatable method to all texts in

validation testing the

Subscriber: Gothenburg University Library; date: 02 May 2018

35.1.4.3 Apply a cross-validation scheme

Even if no statistical modeling is used, cross-validation can and should be

Subscriber: Gothenburg University Library; date: 02 May 2018

Cross-validation applied in the traditional statistical way allows us to minimize over-

35.1.4.4 Error rate calculation and error analysis

Subscriber: Gothenburg University Library; date: 02 May 2018

35.2 Current Forensic Author Identification

35.2.1 Forensic stylistics

35.2.2 Software-implemented methods

35.2.2.1 ALIAS SynAID

Subscriber: Gothenburg University Library; date: 02 May 2018

The computational method, as implemented in ALIAS SynAID, has been scrutinized by

Subscriber: Gothenburg University Library; date: 02 May 2018

Table 35.1 Results of SynAID with very short samples

Author LR LDA SVM 1 SVM 2 DTF

16 91.67 95.28 95.28 95.83 92.50

23 85.28 88.33 85.28 80.56 82.78

80 80.00 85.56 80.56 77.78 81.39

90 72.22 76.67 81.94 80.28 76.11

91 81.39 74.17 81.94 78.89 75.28

96 79.17 83.61 78.61 76.11 76.11

97 85.56 85.83 83.61 79.44 82.22

98 79.72 84.17 79.44 76.11 82.50

99 70.00 82.50 78.06 74.17 80.83

168 83.33 88.89 76.39 79.72 86.39

Overall 80.83 84.50 82.11 79.89 81.61

Subscriber: Gothenburg University Library; date: 02 May 2018

35.2.2.2 POS (part of speech) trigrams

35.3 Future Directions For Forensic Author

The ALIAS UniAIDE method (Chaski 2007a, 2010) is a software-implemented method

Subscriber: Gothenburg University Library; date: 02 May 2018

Character-based methods are especially intriguing because, in a Unicode-compliant

Argamon et al (2007) are developing a method based on function words and

lexical categories within Halliday's (1994) Systemic Functional Grammar. On a dataset of

Subscriber: Gothenburg University Library; date: 02 May 2018

Access is brought to you by

Subscriber: Gothenburg University Library; date: 02 May 2018

You might also like