You are on page 1of 12

SCIENTIFIC & TECHNICAL

Forensic linguistics: an assessment of the


CUSUM method for the determination
of authorship
RA HARDCASTLE*
The Forensic Science Service, Priory House, Gooch Street North, Birmingham, United Kingdom B5 6QQ

Journal of the Forensic Science Society 1993; 33: 95-106


Received 1 October 1992; accepted 20 January 1993

The application of a proposed technique, based upon Die Anwendung einer vorgeschlagenen Technik,
the cumulative sum ("CUSUM"), to forensic basierend auf der kumulativen Summe ("CUSUM"),
problems such as the authentication of disputed police wurde beschrieben fur forensische Probleme wie der
interview records, is described. The method is found Nachweis der Echtheit von in Frage gestellten
to be ill-defined and its application amateurish. In its Vernehmungsprotokollen der Polizei. Es hat sich
present form it cannot be accepted as providing herausgestellt, dass die Methode schlecht definiert ist
reliable evidence of authorship when applied to either und die Anwendung dilettantisch ist. In dieser
the written or spoken word. aktuellen Form kann sie nicht akzeptiert werden, um
zuverlassige Beweise zu liefern uber die Urhebersch-
aft von geschriebenem oder gesprochenem Text.

L'application d'une technique proposCe, basCe sur la Se describe la aplicaci6n de una tCcnica, basada en las
somme cumulative "CUSUM" appliquCe B des sumas acumulativas, (CUSUM) , a problemas forenses
problbmes forensiques tels que l'authentification de tales como la autentificacidn de entrevistas grabadas
procbs-verbaux de police contestCs est dCcrite. La de la policia. El mCtodo no parece ser muy definido y
mCthode s'est avCrCe ma1 dCfinie et son application su aplicaci6n reuslta amateur. En su forma actual no
ressort plus de l'amateurisme qu'autre chose. Dans sa puede ser ace~tablecomo prueba segura de evidencia
forme actuelle, elle ne peut pas Ctre acceptCe pour de autoria ni para la palabra hablada ni para la
procurer des indices valables de I'identitC de l'auteur escrita.
lorsqu'elle est utilisCe pour l'analyse du texte Ccrit ou
parlC.

Key Words: Document examination; Text Analysis; Stylistics.

* Present address: Document Evidence Ltd, Gatsby Court, 172 Holliday Street, Birmingham, United Kingdom B1 1TJ.

JFSS 1993; 33(2): 95-106


Introduction by combining two or more simple classes; Morton
In an earlier paper [I], the results of an investigation often used the class of Short words + Vowel words.
into the application of Stylometry, a method of
literary stylistics devised by Andrew Morton, were The application of the method depends upon two
presented. Briefly, it was found that the method had hypotheses: that in all the utterances (written or
not been adequately tested on texts of forensic spoken) of some people, words of a particular class
interest of known authorship for it to be accepted as will occur as a consistent proportion of the total
wholly valid. Furthermore, the method could only be number of words in each sentence; and that different
applied to relatively lengthy utterances of over 1000, people may use words of a particular class at different
and preferably over 3000, words. Smith [2] has since and distinguishable rates.
summarised the findings of other investigations into Thus a disputed text, containing words of a particular
the method and severely criticised Morton's practical class occurring at a rate which is consistent but which
procedures. differs significantly from specimen texts of the person
concerned, would be judged to be of different
Morton is now advocating the use of a technique authorship. A disputed text, with such words
based upon cumulative sum analysis (abbreviated to occurring at a rate which varies from one part to
CUSUM) [3-51. He has been consulted and prepared another when the person involved is known to be
evidence in a number of cases where it has been consistent in specimen material, would be judged to
alleged that confession statements and police records be of multiple authorship. A similarity in word rates
of interviews have been fabricated in part or in whole. between two texts would not, however, prove
The CUSUM method has been applied to texts as common authorship.
short as 20 sentences and is claimed to be capable of
locating insertions of just a few sentences of "foreign" Morton has advanced no theoretical basis as to why
material of different authorship. the features he examined should be consistent in a
person's utterances, or why they should be charac-
In his publications [3,4], Morton used many examples teristic of a person. This is a serious limitation, since
from literary texts to support his theory, and claimed there is no way of predicting circumstances in which
that the method was also valid when applied to texts the method might not be valid. Morton claimed that
of forensic interest such as statements or interview people can be consistent in their rates of use of
records. He averred that the method was insensitive particular classes of words, irrespective of the type of
to differences between written and spoken utterances utterance, written or spoken, personal letter or
even when the speech comprised replies to questions formal, conversation or interview, etc. Certain
rather than normal conversation. In this paper, tests instances have been identified as the cause of
of these claims are presented in the form of anomalies: sentences containing long lists, sentences
examinations of utterances of known authorship. In containing reported speech, very short replies to
addition, difficulties in the practical application of the questions and formal modes of address, but there is
method are identified. no comprehensive catalogue of such instances and no
rigorous procedure to identify them.

The basis of CUSUM analysis Morton often employed the class of Short words and
The essence of the method is the comparison of the the combined class of Short + Vowel words because
rates of use of a particular class of word. Morton these yielded higher rates of occurrence and
referred to such rates of use, rather inappropriately, reportedly greater consistency than other classes. He
as "habits". This nomenclature has been followed referred to Short words as being "filler" words such as
here, but it should be noted that these properties of "and", "of)', "the7', "to", etc, which are not related
language have not been established to be habits and to the subject matter of the text. However, a
Morton identified no psychological or linguistic proportion of words of 2 or 3 letters in any text are
processes which might generate such habits. Examples subject-related but these are not specifically excluded.
of such classes are nouns, verbs, short words (i.e., Furthermore, words beginning with a vowel (other
words of 2 or 3 letters), and vowel words (i.e., words than those of 2 or 3 letters) are predominantly
beginning with a vowel). The first two identify words subject-related, and the notion that occurrences of
by grammatical use, whilst the latter two are more such words can distinguish authors is preposterous.
arbitrary definitions. Further classes may be formed When these vowel words are added to the Short

JFSS 1993; 33(2): 95-106


words, roughly half of all the words within a text fall that changes in sentence length from one sentence to
into the combined class. the next are small. Note that both plots always end on
the horizontal axis because of the way in which the
The claim that the CUSUM method can be used to
values are calculated.
compare texts of written utterances with spoken ones
is crucial where statements and interview records are
concerned. There are many linguistic differences
between writing and speech, and the claim that the
method is insensitive to these is a remarkable one.
There are obvious examples of such differences.
People commonly use contracted forms such as
"I'm" and "didn't" in speech but the full forms "I
am" and "did not" in writing.
Replies to questions are often incomplete in the
sense that they lack grammatical components
normally present in written sentences. For example,
Q: "Where did the car go then?"
A: "Down the street and round the corner."
Here the reply contains no verb and no subject for a
verb.
In speech, replies to questions are often prefaced by
"Yes" or "No".
In an interview, the interviewee sometimes echoes
part of the question in his reply. For example:
Q: "Did you already know the answer?"
A: "How could I know the answer?" 1 v Sentence number

Application of the CUSUM method


Morton used a graphical device called a CUSUM FIGURE 1 CUSUM analysis of the first specimen letter written
by subject A showing (a) the sentence length CUSUM line and (b)
chart to analyse a text. The chart is, in effect, one the habit CUSUM l i e , plotted separately to the same scale.
graph superimposed upon another. The first graph
plots a cumulative sum (hence the name CUSUM)
derived from the lengths of the sentences in the text, The correspondence between the two graphs was
while the second plots a cumulative sum derived from demonstrated by Morton by applying a scaling-factor
the numbers of words of the chosen class in the to the values in the habit CUSUM plot and then
sentences. Figure 1 shows the sentence-length superimposing it upon the sentence-length CUSUM
CUSUM line and the "habit" CUSUM line, plotted plot. He calculated the scaling-factor as the ratio of
separately, for a specimen text written by subject A. the maximum to minimum CUSUM value ranges for
The worked example for this text is given in Appendix the individual plots (see Appendix 1). The result is the
1, to show the construction of a CUSUM chart. CUSUM chart as shown in Figure 2a.
It is obvious that there is a relationship between the If a person produced text with a very consistent
shapes of the two CUSUM lines but with a difference proportion of the words in each sentence being of the
of scale. This means, usually, that the longer the class under scrutiny, then the habit CUSUM line
sentence the more words there are of the chosen class, would closely shadow the sentence-length CUSUM
as would be expected. The overall shape is not line throughout the CUSUM chart, no matter what its
directly relevant to the attribution of authorship but it shape. The correspondence will rarely if ever be
does provide other information. For example, an exact; in real texts, some natural variation in the rate
upward slope in the sentence-length CUSUM plot of use of words is to be expected and the lengths of
indicates that the corresponding sentence is longer sentences themselves have an influence-if the
than average, and the steeper the slope the greater the average number of habit words per sentence was, for
departure from the average. A smooth graph shows example, 0.5, then all the sentences with odd numbers

JFSS 1993; 33(2): 95-1 06


of words would produce slight divergences between since it represents the greatest departure from the
the two CUSUM lines, since it is not possible to have average value.
half of the words in these sentences being of the
In Figure 2, the maximum separation between the two
defined class.
CUSUM lines occurs at sentence 15, whilst the
maximum separation between points within both lines
occurs between sentences 13 and 14, because sentence
14 is the longest sentence at 58 words. Morton
concluded that this particular text showed no
inconsistency in the habit under scrutiny.
Even if Morton's criterion were accepted a priori,
there remain other serious objections to its practical
application:
The CUSUM chart does not represent directly the
deviations from consistency, sentence by sentence,
throughout the text but portrays instead cumulative
deviations. As a result, the inter-line separations
are more likely to be greatest near the middle of the
chart than near to either end. Indeed, the lines are
constrained to meet at the right-hand end. It could
also be argued that the longer the text the greater
the possibility of a chance separation of the lines.
The scaling-factor used by Morton to allow the
superimposition of the two graphs is imprecise (see
Appendix 2). The application of the correct scaling
factor can make the fit between the two CUSUM
-sJ Sentence number lines better in some areas and worse in others. The
(b) way in which Morton's factor is calculated means
that it is a function of the average sentence length
FIGURE 2 CUSUM analysis of the first specimen letter written
by subject A showing (a) the CUSUM chart constructed according and habit words rate within only that portion of the
to Morton's method using the scaling factor (see text); (b) the text represented by the CUSUM chart data points
corresponding histogram of the differences between observed lying between the highest peak and the lowest
and expected numbers of "habit" words (separation change)
for each sentence of the text. - sentence length CUSUM trough. If the peak and trough lie close together,
....."habit" CUSUM. this will be a minor, perhaps unrepresentative,
portion of the text. If the peak precedes the trough,
The critical part of the analysis is the identification of this portion of text will contain a high proportion of
a significant divergence between the two CUSUM short sentences; if the peak follows the trough, it
lines. Morton claimed that such a divergence will contain a high proportion of long sentences. In
demonstrated different authorship for part of the text some charts, the portion of text delimited in the
since it indicated a difference in the rate of use of the sentence-length CUSUM line is somewhat different
words under scrutiny. He made no use of standard from that delimited in the habit CUSUM line.
statistical measures to compare the two sets of data
values, but advanced the criterion that when the Morton reported that in general the more frequent
maximum vertical separation between the CUSUM the use of the class of words under scrutiny (i.e.,
lines is less than the separation between any two the greater the proportion of these words in each
successive points within either of the lines, the use of sentence), then the better the correspondence
words is considered to be consistent. This criterion between the two CUSUM lines, yet this was not
makes no sense at all. Whilst the separation between taken into account in setting a level of significance
the CUSUM lines does have something to do with the for the inter-line separation.
consistency of the use of the words under scrutiny, the If the punctuation of the text is by someone other
maximum separation between points within a line is than the source of the utterance (e.g., by a police
determined usually by the longest sentence in the text, notetaker in an interview), a change in punctuation

JFSS 1993; 33(2): 95-106


could directly affect the outcome of the test for course, this would preclude the use of transparencies
consistency, if it altered the value of the maximum to manipulate the charts.
separation of successive points within each CUSUM
If the proportion of habit words in a sentence were
line.
the same as the proportion of habit words in the text
When a deviation from absolute consistency occurs as a whole, then in the CUSUM chart there would be
within a text, thereby causing the CUSUM lines to neither a widening nor a narrowing of the separation
diverge, the cumulative nature of the chart means between the two CUSUM lines. (If this property held
that the separation is "carried forward" to the for all the text, the CUSUM lines would be identical.)
following part of the chart. Morton advocated the The change in separation between the two CUSUM
use of a transparent overlay of the habit CUSUM lines is directly related to how close the observed
graph so that it could be moved around with respect proportion of habit words in the sentence is to the
to the sentence-length CUSUM graph. This is a average proportion for the whole text. The change in
very subjective procedure and appears to result in separation can therefore be expressed as:
many ad hoc interpretations. Separation change
The criterion does not take adequate account of the = Observed habit words - Expected habit words,
different ranges of natural variation to be expected
where Expected habit words = Sentence length
in the utterances of different people, nor of the
possibility that different situations may lead to Average habit words Der sentence
,,
different ranges of variation in the utterances of the Average sentence length
same person. A histogram can be constructed of the differences
In reports issued by Morton in particular cases, it is between observed and expected values throughout the
apparent that he did not rely solely upon the stated text. Figure 2b shows the histogram for the text
criterion for consistency, but the procedure by which analysed in Figure 2a. Whilst a positive'value indicates
he arrived at his conclusions is not clear. He has a sentence with an "excess" of habit words, it does
stated that the method simply involves the presenta- not necessarily imply an increase in the CUSUM lines
tion of the CUSUM chart and the posing of the separation at that point in the text, only that the habit
question "Are they [the two CUSUM graphs] similar line is rising faster (or falling less quickly) than the
or are they different?", and that in this judgement the sentence-length line. What is important is the
layman stands on almost the same footing as the magnitude of the values. A large value (positive or
expert! If this is the case, then that judgement is negative) indicates that the sentence departs markedly
seriously hampered by the use of many different y-axis from the average rate of use of the habit words.
scales in the CUSUM charts that Morton presented. Obviously the more variation there is in this chart, the
Where Morton produced charts with compressed more variable the person is in their use of the habit
scales, a consistency was often identified. On the words. Inspection of the histogram in Figure 2b shows
other hand, where he produced charts with expanded that there is considerable variation from one sentence
scales, differences tended to be reported. The to another, with several sentences containing 2-3
deplorable lack of scientific rigour in the CUSUM habit words more or fewer than predicted by the
method means that in its present form it cannot be overall average rate.
accepted as a reliable technique. Interestingly, the separation changes are counts of
words which are not normalised according to the
Alternative methods of presenting the data lengths of the sentences themselves. Thus a difference
One problem with the CUSUM chart is that judging between observed and expected words of, for
the vertical separation between the two CUSUM lines
example, two words in a sentence 10 words long has
by eye is not always easy. Where the lines are rising or
the same effect upon the separation between the
falling steeply together, the eye is drawn to the
CUSUM lines as does a difference of two words in a
separation perpendicular to the general line direction
sentence 50 words long.
rather than to the separation of interest. Since it is the
vertical separations, and not the shape of the chart, The complications in the CUSUM chart associated
that is supposed to be of prime importance, then with deviations from consistency, in one part of the
perhaps these vertical separation values themselves text, affecting parts of the chart associated with other
should be plotted instead of the CUSUM values. Of parts of the text, could be avoided by plotting a

JFSS 1993; 33(2): 95-1 06


moving average of the separation changes described and 14 are almost cancelled out by the low values for
above. sentences 12 and 15.) It is possible that if a cluster of
high or low values occurs within a questioned text, it
Examples and discussion could be erroneously interpreted as a section of text of
Figures 3 to 8 show various examples of CUSUM different authorship to the rest.
charts. Figure 6 relates to the rate of use of the class
of Short words while the other figures relate to the In Figure 4a there is a large divergence between the
rate of use of the combined class of Short and Vowel CUSUM lines. This chart illustrates one of the
words. Each chart is displayed together with the weaknesses of the CUSUM method already referred
corresponding histogram, showing the differences to. The vertical separation between the CUSUM lines
between the observed and predicted habit words for between sentences 10 and 26 is roughly constant; it
each sentence. All the CUSUM charts have been arises from deviations in the preceding and following
generated using the scaling factor calculated according parts of the text. In Morton's procedure, this would
to the method advocated bv Morton. be demonstrated by moving a transparency of one
CUSUM line with respect to the other. In this
Figures 3 and 4 show CUSUM charts and
particular case, Morton identified only sentence 8 as
corresponding histograms for further specimen texts
being abnormal, and inspection of the text showed
written by Subject A. The fit between the two
that this sentence was indeed rather confused. Morton
CUSUM lines in Figure 3a is somewhat poorer than
claimed that omitting this sentence from the text was
that in Figure 2, but in Figure 3b, the lines are well
all that was necessary to remove the inconsistency
separated between sentences 4 and 22. Inspection of
from the CUSUM chart.
the histogram below reveals that this is the result of a
cluster of slightly lower than average habit rates for Figure 4b shows the result of omitting sentence 8. The
sentences 1 to 6 and a cluster of slightly higher than separation between the two CUSUM lines is only
average habit rates for sentences 21 to 26. The first reduced a little. Inspection of the histograms in
cluster causes the CUSUM lines to diverge, while the Figures 4a and 4b reveals that whilst sentence 8 did
second causes them to converge again. (The effects of have the greatest deviation from the expected habit
the high habit rate values observed for sentences 13 rate, sentences 3 and 33 were also atypical. Sentence 3

J
(b) Sentence number

FIGURE 3 CUSUM analysis of (a) the second and (b) the fourth specimen letter written by subject A. -entente length CUSUM
.--.-
"habit" CUSUM.

100 JFSS 1993; 33(2): 95-106


I
Sentence number
(b)

FIGURE 4 CUSUM analysis of (a) the thud specimen letter and (b) the third specimen letter, omitting sentence 8. -sentence length
CUSUM -....
"habit" CUSUM.

FIGURE 5 CUSUM analysis of the first and second specimen letters written by subject A, when joined together as one piece of text.
-sentence length CUSUM .....
"habit" CUSUM.

JFSS 1993; 33(2): 95- 106 101


included a list, a feature identified by Morton as a an average rate of 63%. Figure 5 shows the results
possible cause of problems, but the list was short, obtained when the two texts are combined. The large
containing only four items. Sentence 33 contained no divergence between the two CUSUM lines is a direct
identifiable abnormality. Although there may be some result of the difference between the average habit
justification for the removal of sentence 3, there is rates for the two texts.
none for sentence 33. In view of the range of variation
Figures 2 to 5 relate to specimens of written texts, all
found in most texts, it is not surprising that such
atypical sentences do occur. letters written by subject A to a relative or friend
within a two-year time period. When specimen
transcripts of spoken words are analysed, the range of
variation encountered appears to be greater than that
seen within written texts. An apposite illustration of
this is provided by an analysis of Morton's own words.
In his publications [3,4], Morton presented a
CUSUM analysis of a combination of three of his own
written texts, spanning 20 years, to demonstrate a
consistency in his own use of short words. However,
Figure 6 shows that this apparent consistency does not
extend to his spoken words. The text used for Figure 6
consists of the first 40 sentences spoken by Morton
when giving evidence during a trial (taken from the
official court transcript).
Figure 7 shows CUSUM analyses of the replies of
subjects B and C taken from undisputed police
interview records. The former interview was recorded
contemporaneously as a handwritten transcript by a
police notetaker; the latter is a transcription made
from a tape-recording of an interview. In both cases
the CUSUM chart shows inconsistencies between the
two CUSUM lines.
The next two figures are examples of CUSUM
-5 1 Sentence number analyses of questioned police interview records
FIGURE 6 CUSUM analysis of words spoken by Morton when
(contemporaneously written) where Morton con-
giving evidence in court. - sentence length CUSUM cluded that the interviewee's replies could not be
-....
"habit" CUSUM. accepted as the utterance of one person. Figure 8a
shows a CUSUM analysis of the replies attributed to
subject A in an interview. In his report for this
Morton often produced a CUSUM chart of a
particular case, Morton stated "that [subject A] is not
questioned text combined with a specimen text (by
the source of the utterance attributed to him. . . . .
simply appending one text to the other), showing a
could hardly be clearer." It cannot be discerned how
large divergence between the two CUSUM lines as
this conclusion was reached. The CUSUM chart
evidence of different authorship for the two texts.
satisfies Morton's own criterion regarding the
However, this procedure merely demonstrates that
maximum separation between the CUSUM lines, and
the two texts have different average rates of use of the
the histogram in Figure 8a shows that there are no
habit under scrutiny. That this is not a reliable test of
substantial deviations from the average habit rate.
authorship is easily demonstrated; analysis of different
Furthermore, the specimen texts of subject A
specimen texts from the same subject shows that
analysed in Figures 3a, 3b and 4b show deviations
different texts (which individually have closely
between the CUSUM lines of greater magnitude than
corresponding CUSUM lines) can have significantly
those occurring in Figure 8a.
different average habit rates. For example, the text
used for Figure 2 has an average habit rate (for the Figure 8b shows a CUSUM analysis of the replies
combination of short words and words beginning with attrib-uted to subject D in a questioned interview
a vowel) of 51%, whereas that used for Figure 3a has record. Here Morton concluded that sentences 17-22

JFSS 1993; 33(2): 95-106


%,-
5
5
-
.
111 I I1 1 I1 . , 1,1111 11
E
v,
.
1

I I1 ''I'I'
Sentence number

FIGURE 7 CUSUM analysis of the replies spoken by (a) subject B, taken from an undisputed police record of interview recorded in writing
by a police notetaker, and (b) subject C, taken from a transcript of a tape-recorded police interview. - sentence length CUSUM
...--
"habit" CUSUM.

54

o
Q n -
m .
C

-c
.3 0.
I1 1 1'
I
"
I II I
II

rA

-5A
(a) Sentence number (b)

FIGURE 8 CUSUM analysis of the disputed replies atlributed to (a) subject A, in a handwritten police record of interview, and (b) subject
D, in a handwritten police record of interview. -
sentence length CUSUM .---.
"habit" CUSUM.

JFSS 1993; 33(2): 95-106 103


were not the utterances of the same person as 1-16, providing trustworthy evidence unless, and until, it is
without examining any specimen material from refined and properly validated.
subject D. The separation of the CUSUM lines for
this region of the chart is no greater than References
discrepancies seen in specimen texts of other authors 1. Totty RN, Hardcastle RA and Pearson J. Forensic
such as those of subject A. Furthermore, inspection of linguistics: the determination of authorship from
habits of style. Journal of the Forensic Science
the histogram below the chart reveals that it is only Society 1987; 27: 13-28.
sentences 18 and 20 which could reasonably be 2. Smith MWA. Forensic stylometry: A theoretical basis
considered to be atypical, but differences in habit for further developments of practical methods.
rates of a comparable magnitude are found within Journal of the Forensic Science Society 1989; 29:
specimen material. If these two sentences were not 15-33.
3. Morton AQ and Michaelson S. The Qsum Plot.
the utterances of subject D, it would seem that they
Internal Report CSR-3-90. University of Edinburgh:
must be the utterances of two other people, one with Department of Computer Science 1990.
a lower habit rate and one with a higher habit rate. 4. Morton AQ. Proper Words in Proper
Places. Departmental Research Report 1991/R18.
Conclusions University of Glasgow: Department of Computing
It is clear that, for texts of forensic interest at least, Science 1991.
5. Morton AQ. The scientific testing of utterances.
the CUSUM method in its present form cannot be Cumulative sum analysis. Journal of the Law
regarded as objective or reliable. The necessity for the Society of Scotland 1991; 357-359.
similarity between the two CUSUM lines to be judged
by eye is a major failing. Some differences between APPENDIX I
the lines is always to be expected; what constitutes a
significant difference needs to be rigorously defined The construction of a CUSUM chart-a worked
for the method to be at all credible. O n the one hand example
Morton indicated that it was not the detailed profiles The construction of a CUSUM chart begins with counting
of the CUSUM lines that was important but the the numbers of words of the particular class under scrutiny
separations between them, whilst on the other hand in each sentence of the text. The identification of words of 2
he insisted that transparencies be used to allow one or 3 letters or words beginning with a vowel is easily
accomplished by computer. The identification of other types
CUSUM line to be translated or even rotated with such as nouns or verbs is more difficult unless a sophisticated
respect to the other, a procedure which compares computer parsing program is available.
details of the line shapes and ignores the separations.
For this example a CUSUM chart will be constructed for the
combined class or "habit" of short words (i.e., 2 or 3 letters)
The histograms of (observed-expected) values shown and words beginning with a vowel. Morton's nomenclature
in Figures 2-7 show that a considerable variation in for this class was "231w + ivw". Note that a word such as
habit rate occurs from one sentence to the next within "and" which belongs to both categories does not count
specimen texts. The CUSUM chart does not appear to twice. In Table 1 the counts for each sentence of a specimen
be the best way of representing the data; a simple text are given. The first column gives the sentence number,
the second column gives the total number of words in the
moving average, for example, might be more sentence and the fifth column gives the number of "habit"
appropriate. words in the sentence.
The next step is to calculate for the whole text the average
Without a rigorous procedure to identify all "naturally number of words per sentence and the average number of
anomalous" sentences, or to estimate their rate of "habit" words per sentence.
occurrence, the possibility that spurious discrepancies
568
will arise between the two CUSUM lines ought to be Average number of words per sentence = -= 21.04
27
allowed for. A systematic survey of a large number of
texts of written and spoken utterances of known Average number of "habit" words per sentence
authorship is required to set confidence limits on the
conclusions drawn from analyses in real cases. Many
of the criticisms levelled at Morton's Stylometry
Now the cumulative sums or CUSUMs themselves can be
method [I] are equally applicable to his CUSUM
calculated. For the sentence-length CUSUM the differences
method. In particular the CUSUM method has not yet between the average sentence length (21.04 words) and the
been advanced beyond the stage of hypothesis, and individual sentence lengths are obtained. These are listed in
should not be accepted in court proceedings as the third column of Table 1. Then the CUSUM values are

JFSS 1993; 33(2): 95- 106


calculated by adding the difference values in succession. The CUSUM chart produced from this particular example is
shown in Figures 1 and 2 of the main paper.
For sentence 1, CUSUM = -4.04 = -4.04
For sentence 2, CUSUM = -4.04 - 11.04 = -15.08 APPENDIX 2
For sentence 3, CUSUM = -4.04 - 11.04 - 10-04= -25.12 Derivation of the correct CUSUM chart scaling
and so o n . . .. . . factor
The CUSUM values are rounded to the nearest whole
number as listed in the fourth column of Table 1. For the
"habit" CUSUM the calculations are performed in a similar
manner with the resulting values being listed in columns six
and seven of Table 1.
Lastly, the two sets of CUSUM values are plotted as two --- ---- --- -- -
-s .. - -S"+,
lines on the same graph but before this can be done the
"habit" CUSUM values must be multiplied by a scaling
factor. Morton derived his scaling factor from the ranges of
values observed in each CUSUM line. In this example
Sentence length:
, -?I---- /
.. - - - - - - - - - - - - -.a.

maximum CUSUM value = 2


]
0
range is 83
minimum CUSUM value = -81
Habit:
+1
maximum CUSUM value = 0
minimum CUSUM value = -44
} range is M
n Sentence number
FIGURE 9 Schematic remesentation of the two l i e s in a
83 CUSUM chart showing values for two successive sentences only.
Scaling factor = - = 1.89
44 -sentence length CUSUM --.-.
"habit" CUSUM.

TABLE 1 Example of the calculation of a CUSUM chart

Sentence-length CUSUM 231w + sw CUSUM


Sentence No of Difference No of Difference
no words from average CUSUM words from average CUSUM

27
Total

JFSS 1993; 33(2): 95-106


For simplicity assume that at some sentence n the two therefore
CUSUM lines meet at a CUSUM value C, as shown
schematically in Figure 9. For the next sentence the s,,, = H,+l or (L,,, - L) = F . (w,,, - W )
CUSUM value for the sentence length line is given by Substituting for W,,,from above gives

+
where L,,,= length of sentence n 1, and L = average
sentence length. The CUSUM value for the habit line is
given by

where W,,,= habit words in sentence n + 1, w = average


habit words per sentence and F = scaling factor.
If the proportion of habit words in sentence n + 1 is exactly i.e., Scaling Factor
the same as the average proportion within the text as a average number of words per sentence
whole (i.e., if the observed number of habit words equals
the expected number), then average number of habit words per sentence
For the worked example in Appendix 1 the scaling factor
calculated by this formula would be 1-93, which is not far
from Morton's estimate of 1.89 but for the CUSUM chart in
In these circumstances we require that the two CUSUM Figure 46, for instance, the values would be rather different
lines do not diverge between sentences n and n + 1 and at 1.71 and 1-48 respectively.

JFSS 1993; 33(2): 95-106

You might also like