Morrison - Distinguishing Between Forensic Science and Forensic Pseudoscience

Science and Justice 54 (2014) 245–256
Contents lists available at ScienceDirect
Science and Justice

journal homepage: www.elsevier.com/locate/scijus
Review
Distinguishing between forensic science and forensic pseudoscience:

Testing of validity and reliability, and approaches to forensic
voice comparison☆
Geoffrey Stewart Morrison ⁎
Forensic Voice Comparison Laboratory, School of Electrical Engineering & Telecommunications, University of New South Wales, UNSW Sydney, NSW 2052, Australia
a r t i c l e i n f o a b s t r a c t
Article history: In this paper it is argued that one should not attempt to directly assess whether a forensic analysis technique
Received 12 January 2013 is scientifically acceptable. Rather one should first specify what one considers to be appropriate principles governing
Received in revised form 28 May 2013 acceptable practice, then consider any particular approach in light of those principles. This paper focuses on one
Accepted 17 July 2013
principle: the validity and reliability of an approach should be empirically tested under conditions reflecting those
of the case under investigation using test data drawn from the relevant population. Versions of this principle have
Keywords:
Validity
been key elements in several reports on forensic science, including forensic voice comparison, published over the
Reliability last four-and-a-half decades. The aural–spectrographic approach to forensic voice comparison (also known as
Forensic voice comparison “voiceprint” or “voicegram” examination) and the currently widely practiced auditory–acoustic–phonetic approach
Aural are considered in light of this principle (these two approaches do not appear to be mutually exclusive). Approaches
Spectrographic based on data, quantitative measurements, and statistical models are also considered in light of this principle.
Acoustic–phonetic © 2013 Forensic Science Society. Published by Elsevier Ireland Ltd. All rights reserved.
Contents
1. Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 246
1.1. The 2009 National Research Council report's versus Cole's concept of forensic “science” . . . . . . . . . . . . . . . . . . . . . . . . 246
1.2. Paradigm . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 246
1.3. The likelihood-ratio framework . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 246
1.4. Approaches based on quantitative measurements, databases representative of the relevant population, and statistical models . . . . . . . 246
2. Testing of validity and reliability under conditions reflecting those of the case under investigation using data drawn from the relevant population . 247
2.1. Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 247
2.2. Procedures for measuring validity and reliability . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 247
2.3. Lack of testing of experience-based systems . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 248
2.4. Lack of testing/lack of appropriate testing of systems based on data, quantitative measurements, and statistical models . . . . . . . . . . 248
2.5. Pre-testing or case-by-case testing? . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 248
3. The spectrographic/aural–spectrographic approach . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 249
3.1. Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 249
3.2. Legal admissibility . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 249
3.3. Reports including consideration of principles for determining acceptable practice . . . . . . . . . . . . . . . . . . . . . . . . . . . 250
3.4. Tests of validity . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 251
3.5. Conversions and reasons for conversion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 252
3.6. Is the fact that a spectrogram is used a key aspect of the criticism of the approach? . . . . . . . . . . . . . . . . . . . . . . . . . . 252
3.7. The IAFPA resolution . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 253
4. The auditory–acoustic–phonetic approach . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 254
5. Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 255
Acknowledgments . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 255
References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 256
☆ This is a version of the opening presentation of the Special Session on Distinguishing Between Science and Pseudoscience in Forensic Acoustics bhttp://montreal2013.forensic-acoustics.
net/N at ICA 2013: 21st International Congress on Acoustics/165th Meeting of the Acoustical Society of America/52nd Meeting of the Canadian Acoustical Association, Montréal, 2–7 June
2013 bhttp://www.ica2013montreal.org/N. An abridged written version appears in the conference proceedings under the title “Distinguishing between science and pseudoscience in
forensic acoustics”. The present written version maintains some of the oral character of the original presentation.
⁎ Now Forensic Consultant, Vancouver, British Columbia, Canada. Tel.: +1 604 637 0896, +44 191 645 0896, +61 2 800 74930.
E-mail address: geoff-morrison@forensic-evaluation.net.
1355-0306/$ – see front matter © 2013 Forensic Science Society. Published by Elsevier Ireland Ltd. All rights reserved.
http://dx.doi.org/10.1016/j.scijus.2013.07.004
246 G.S. Morrison / Science and Justice 54 (2014) 245–256
1. Introduction methodology for testing validity and reliability for forensic comparison
in general appear in Morrison [8].
The title of this paper was deliberately chosen to be provocative, but is Below I briefly discuss the first two elements, then discuss the third
probably somewhat (if not highly) inaccurate: I don't plan to actually pro- element in greater detail.
vide a definition which could be used to include everything one wants to
count as science and to exclude everything one does not want to count as 1.3. The likelihood-ratio framework
science, a problem known in philosophy of science as the demarcation
problem. See Edmond & Mercer [1] on the problems with the “junk sci- I (and many others) consider the likelihood-ratio framework to be
ence” versus “good science” debate. I will, however, provide a discussion the logically correct framework for the evaluation and interpretation
of what I consider to be relevant principles governing acceptable practice of forensic evidence irrespective of the approach adopted (several ap-
in forensic science in general and forensic voice comparison in particular. I proaches to forensic voice comparison are discussed below). There is in-
believe that it is more productive to focus on and potentially debate prin- creasing support for this position: In 2011, 31 experts in the field signed
ciples and then consider different approaches in light of these principles, a position statement that included an affirmation that they consider the
rather than immediately attempt to critique the approaches. I believe that likelihood-ratio framework to be the most appropriate framework for
a focus on principles will help us to understand what really matters. the evaluation of evidence (Evett et al. [9]), and this position statement
There are serious problems with current practice in forensic science, was endorsed by the Board of the European Network of Forensic Science
as documented in the 2009 National Research Council (NRC) report on Institutes (ENFSI), representing 58 laboratories in 33 countries.
Strengthening forensic science in the United States: A path forward [2], in In the context of forensic voice comparison, the forensic scientist
the 2012 Frontline documentary The real CSI: How reliable is the science must assess the likelihood of getting the acoustic properties of the re-
behind forensic science? [3], and elsewhere. Although both the afore- cording of a speaker of questioned identity had it been produced by a
mentioned report and documentary are from the United States, I speaker of known identity (similarity) versus had it been produced by
would be very surprised if similar problems did not exist in Canada some other speaker from the relevant population (typicality).1 The
and in other parts of the world. likelihood-ratio framework requires the forensic scientist to consider
both similarity and typicality, and to consider what constitutes the rel-
1.1. The 2009 National Research Council report's versus Cole's concept of evant population. Much has been written and said about the likelihood-
forensic “science” ratio framework, and I will not focus on this element of the paradigm in
the current paper.2
The message of the 2009 NRC report [2] could be summarized as “fo-
rensic science should be more scientific”, and it explicitly calls for the
1.4. Approaches based on quantitative measurements, databases
adoption of a “scientific culture” [2, p. 125]. From a philosophy and soci-
representative of the relevant population, and statistical models
ology of science perspective, Cole [4] is critical of the NRC report's por-
trayal of science and scientific culture, arguing among other things
Approaches based on quantitative measurements, databases repre-
that it focused on “discovery science” whereas the majority of forensic
sentative of the relevant population, and statistical models are highly
science practice is what he calls “mundane science”. Discovery science
preferred over more human–expert–experience-based approaches be-
can be exemplified by the recently completed process of hypothesizing
cause they are more transparent, more easily replicated, and as a practi-
the existence of the Higgs boson then designing and running an experi-
cal matter more easily subjected to validity and reliability testing.3 They
ment to test this hypothesis, whereas mundane science can be exempli-
are more transparent and more easily replicated because it is possible
fied by “laboratory technicians performing routine assays, industrial
to describe the data used, measurements made, and statistical models ap-
scientists seeking to refine a product or process, and even physicians try-
plied in sufficient detail that another suitably qualified and equipped fo-
ing to diagnose patients or engineers trying to design a safer bridge” [4,
rensic scientist can copy what was done — the first forensic scientist can
p. 447]. Cole points out, however, that the NRC report never claimed that
even provide the second with the data and software which they used. If
forensic science was “not science”, “unscientific”, or “pseudoscience”,
there are major discrepancies in results, these can potentially be traced
and that it instead made a number of specific claims and recommenda-
back to mistakes in the application of the procedures (e.g., measuring
tions. One of these recommendations, Recommendation 3 [2, pp.
the wrong sample or misrecording a measurement) or genuine disagree-
22–23], will be the focus of my presentation, and can be summarized as:
ments with respect to issues such as what constituted the relevant popu-
lation. Complete objectivity is unachievable, and it may be reasonable to
The validity and reliability of forensic analysis approaches and
expect that differences in subjective decisions will typically be the cause
procedures should be tested.
1
The speaker of questioned identity is usually the offender and the speaker of known
1.2. Paradigm
identity is usually a suspect. This is not always the case, for example the speaker of
questioned identity could be a victim, and the recording of the speaker of known identity
For several years I have been advocating a paradigm for the evaluation a recording of a missing person who it is believed could be that victim. For simplicity, I will
of forensic evidence consisting of the following three elements: use the terms “offender” and “suspect” hereafter, rather than the more widely applicable
but periphrastic “speaker of questioned identity” and “speaker of known identity”.
2
Introductions to the likelihood-ratio framework include Robertson & Vignaux [10],
1. obligatory use of the likelihood-ratio framework Balding [11], and Morrison [12].
2. highly preferred use of approaches based on quantitative measure- 3
Systems with the output based directly on human expert judgments can be fused with
ments, databases representative of the relevant population, and systems based on data, quantitative measurements, and statistical models (see Morrison
statistical models [13]); however, because such a fused system would include a system with the output
based directly on human expert judgments it would still be less transparent, harder to rep-
3. obligatory testing of validity and reliability under conditions reflecting
licate, and harder to test than a system based on data, quantitative measurements, and sta-
those of the case under investigation using data drawn from the rele- tistical models. Note that the use of systems based on data, quantitative measurements,
vant population. and statistical models is preferred rather than obligatory within the paradigm, the para-
digm does not absolutely preclude the use of systems whose output is based directly on
Recent summaries of the paradigm appear in Morrison, Evett, et al. [5] human expert judgments. As discussed in Section 2.1 below, whatever the approach used,
the validity and reliability of the system should be tested under conditions reflecting the
and Morrison [6]. Details of my thoughts on selecting an appropriate da- condition of the case under investigation and the best performing system should be used
tabase for forensic-voice-comparison cases appear in Morrison, Ochoa, & irrespective of whether it is based directly on human expert judgments, based on data,
Thiruvaran [7], and details of my thoughts on appropriate metrics and quantitative measurements, and statistical models, or a fusion of the two.
G.S. Morrison / Science and Justice 54 (2014) 245–256 247
of the latter type of disagreement. Such disagreements could be discussed of a set of estimates to the true value.6 Reliability, synonymous with pre-
by the forensic scientists and potentially resolved before trial,4 or could cision, refers to the spread of a set of measurements or estimates around
become matters to be debated before the trier of fact. I would consider the average value of those measurements or estimates, e.g., what is the
these legitimate topics for debate. For example, last year (2012) I was variance of a set of estimates of a value.
asked to critique a forensic-voice-comparison report written by an expert The validity of a forensic–comparison system can be empirically
who used (at least in part of their analysis) databases, quantitative mea- assessed using a database of pairs of test samples. Some pairs must be
surements, and statistical models to calculate likelihood ratios; and one of same-origin pairs and other pairs must be different-origin pairs. The
my primary negative criticisms was that in my opinion the data used did tester must known which pairs are same origin and which are different
not reflect the relevant population or the recording conditions of the case origin, but the system being tested must not have access to this infor-
under investigation.5 In contrast, if the final conclusion presented by the mation. The system is presented with the test pairs and it provides an
forensic expert is directly dependent on an experience-based subjective answer for each pair. The tester compares the system's answers with
decision, there is really no way to interrogate the process by which that the truth as to whether each pair was a same-origin or a different-
decision was made, and no way to sensibly debate or resolve a major dis- origin pair. The tester assigns a penalty value to each answer according
crepancy between the conclusions of two forensic experts. to the correctness of the answer and takes an average of these penalty
values as an indicator of the validity of the performance of the system.
2. Testing of validity and reliability under conditions reflecting those The function which assigns the penalty value and the averaging func-
of the case under investigation using data drawn from the tion constitute a metric of system validity.
relevant population Correct-classification rate (or its inverse, classification-error rate) has
been proposed as a metric of system validity; however, it is not consis-
2.1. Introduction tent with the role of the forensic scientist within the likelihood-ratio
framework. It is based on hard-thresholded decisions made on the
I described the second element of the paradigm (use of quantitative basis of posterior probabilities. Within the likelihood-ratio framework
measurements, databases representative of the relevant population, such decisions are the responsibility of the trier of fact and the forensic
and statistical models) as highly preferred rather than obligatory, be- scientist should not usurp the trier-of-fact's role. An appropriate metric
cause it should be subservient to the third element: testing of validity for assessing the validity of a forensic–comparison system within the
and reliability under conditions reflecting those of the case under inves- likelihood-ratio framework must be based on likelihood ratios, not pos-
tigation using data drawn from the relevant population. If, under the terior probabilities, and must assign continuous penalty values, not dis-
conditions of a particular case, a more subjective experience-based sys- crete hard-thresholded values. For example, a likelihood-ratio value of
tem is found to have better validity and reliability than a system based 1000 from a test pair known to the be a different-origin pair should at-
on data, quantitative measurements, and statistical models, then the tract a greater penalty value than a likelihood-ratio value of 1.1 for the
former should be employed rather than the latter. same pair since the former provides greater support for the contrary-
Why is it essential to measure the validity and reliability of a forensic to-fact same-origin hypothesis over the consistent-with-fact different-
analysis under conditions reflecting those of the case under investiga- origin hypothesis than does the former; likewise, a likelihood-ratio
tion using samples drawn from the relevant population? Quite simply value of 0.001 for that same pair should attract a smaller penalty value
such testing is the only way to demonstrate the degree to which a foren- than a likelihood-ratio value of 0.9 since the former provides more sup-
sic system does what it is claimed to do, and to demonstrate the degree port for the consistent-with-fact different-origin hypothesis over the
of consistency with which it does that. contrary-to-fact same-origin hypothesis than does the latter. A metric
which has these properties is the log-likelihood-ratio cost (Cllr).
2.2. Procedures for measuring validity and reliability Imprecision can come from various sources, for example: if the same
sample is remeasured multiple times imprecision in the measurement
The following summarizes parts of Morrison [8], see the latter for system may result in different values for the measurements and ultimate-
details. ly different values for the calculated likelihood ratio. If multiple samples
Validity, synonymous with accuracy, refers to the extent to which on are taken from the same object (e.g., multiple recordings of the same
average a set of measurements or estimates approximate the true value speaker) this may also result in different likelihood-ratio values. Different
of the property being measured or estimated, e.g., how close is the mean samples of the same population may also result in different likelihood-
ratio values. A test set including multiple measurements and/or multiple
4
“Concurrent evidence” (aka “hot tubbing”) is practiced in some Australian jurisdic- samples can be used to obtain multiple likelihood-ratio estimates for
tions, and has recently been introduced in Canadian Federal Courts [Federal Court Rules each pair of test objects, e.g., multiple recordings of each speaker resulting
SOR/2010-176, s. 9, 282.1–282.2] and in Ontario [Rules of Civil Procedure 20.05(2)(k)]. in multiple likelihood-ratio estimates for each same-speaker and each
5
The suspect and offender recordings were spontaneous speech (not necessarily the
different-speaker comparison (the speaker rather than the recording
same speaking style on each recording), neither were of studio quality, the suspect record-
ing was made in a police interview room and the offender recording was made using the
audio recording facility of a mobile telephone, and both were recorded in 2009, whereas
the recordings constituting the sample of the population were studio-quality recordings
6
of read speech recorded in the 1960s. The questions of whether the data adequately reflect One of the reviewers pointed out that ISO/IEC 17025:2005 tests of validity of human-
the relevant population and the casework conditions cannot be addressed via empirical based approaches focus on assessing competence in terms of whether the individual pos-
testing because they are questions related to the selection of data which would subse- sesses certain qualifications, experience, and knowledge, and whether they follow certain
quently be used for empirical testing. A forensic scientist can use their expertise to select procedures and practices. This would, however, give no indication of how well the system
what they consider to be appropriate data, but they must also make it clear to the trier of of which this individual is a part would perform under conditions reflecting those of a par-
fact that this decision was a subjective decision and the trier of fact should have the oppor- ticular case. That ISO/IEC 17025:2005 sense of “validity” is very different from the sense of
tunity to decide whether it was an appropriate decision, possibly after also taking into ac- “validity” employed in the current paper, and compliance with the former is insufficient in
count the subjective opinion of another forensic scientist. In casework reports issued by and of itself to produce a system acceptable for casework. Advocates of the use of the au-
my lab we make it clear that if the trier of fact does not believe that we have obtained data ral–spectrographic approach wrote protocols for the application of this approach (see
that adequately reflect the relevant population and the conditions of the case under inves- Section 3.7) and then argued that acceptability was based on whether one followed these
tigation then all subsequent testing is meaningless. Our data-selection procedure involves protocols (see footnote 10). Establishing poor standards can be a means of giving an
the use of a panel of listeners, see Morrison, Ochoa, & Thiruvaran [7]. Although subjective undeserved imprimatur to poor practice (see Morrison, Evett, et al. [5]). An augment that
decisions are needed to select relevant databases, these decisions are remote from the ul- a particular approach or procedure is valid because it conforms with an existing standard
timate output of the system compared to a system whose output is directly the subjective or protocol distracts from defining appropriate principles for acceptability and then deter-
judgment of an expert. The former type of system is therefore much more robust than the mining whether an approach and the procedures which implement that approach con-
latter with respect to resistance to human bias influencing the outcome. form with those principles.
being the object of interest). Procedures have been proposed for calculat- this population with which to model the denominator of the likelihood
ing estimates of credible intervals (CI) as metrics of system reliability. ratio, then the question for which they provide an answer will not be
The results of empirical assessments of system validity and reliabili- the question for which the trier of fact requires an answer. Whatever
ty depend on the test data as well as the system. In order to be informa- the value of the likelihood ratio calculated under such circumstances,
tive with respect to the case under investigation the data used to test a it is meaningless because it answers the wrong question. Note that a
forensic–voice–comparison system should therefore reflect the relevant very large likelihood ratio can be obtained if the properties of the sus-
population, the speaking styles, and the recording conditions of the pect and offender samples are far out on a tail of the distribution of
suspect and offender recordings in the case under investigation. This is those properties in a model of the population, but this is very mislead-
discussed in Section 2.4 below. ing if the model is of the wrong population.
The procedures for measuring validity and reliability proposed can Likewise, if the actually relevant population has not been sampled to
be applied to any forensic–comparison system irrespective of whether build the test database, then testing will be on a sample of the wrong
the output is based directly on a human expert's judgment or whether population and the results will not be relevant to the case under
it is based on relevant data, quantitative measurements, and statistical investigation.7
models. The only requirements are that the system accepts pairs of sam- The other error in testing procedures is not to test under conditions
ples as inputs and that it provides likelihood ratios as outputs. reflecting the conditions of the particular case under investigation. In
forensic voice comparison such conditions can include recording dura-
2.3. Lack of testing of experience-based systems tion, speaking style, and recording conditions, the latter including noise,
reverberation, transmission of the speech signal over different communi-
The idea that experience-based systems should be tested is not new: cation systems, and lossy compression in the storage format. Mismatches
in conditions are typical, e.g., a half-hour police interview with back-
For an expert to say “I think this is true because I have been doing ground noise and reverberation recorded directly from a microphone ver-
this job for x years” is, in my view, unscientific. On the other hand, sus a two-minute exchange of information about bank account details
for an expert to say “I think this is true and my judgement has been with the speaker of interest using a mobile telephone and the recording
tested in controlled experiments” is fundamentally scientific. (Evett made by a device attached to a landline telephone and saved in a com-
[14, p. 21]) pressed format. The results of testing under conditions which do not
reflect those of the case under investigation may be of little or no rele-
It is my impression, however, that practitioners of experience-based vance with respect to the performance of the system when applied to
approaches are often unable or unwilling to undergo validity and reli- the actual suspect and offender recordings. I am aware of court cases
ability testing. I have even heard one practitioner of such an approach involving forensic voice comparison using quantitative measurements,
claim in court that the validity and reliability of forensic voice compari- databases, and statistical models where the practitioner has either not
son cannot be tested, and another claim that their approach to forensic tested the performance of the system using samples from the relevant
voice comparison was scientifically valid because it was reproducible population and under conditions reflecting those of the case under in-
and testable but without presenting any evidence that their system vestigation, or has not performed any tests at all and has at best relied
had in fact been reproduced or that their ability to do what they claimed on tests conducted by commercial manufacturers or tests reported in
to be able to do had in fact been tested (either under conditions academic research publications; the latter tests having been conducted
reflecting those of the case at trial or under any other conditions). using samples of populations and under recording conditions which
Some of this is likely due to a practical issue: systems based on data, were very different from those of the cases under investigation.8
quantitative measurements, and statistical models are often wholly or
substantially automated and once the system has been built, tailored,
2.5. Pre-testing or case-by-case testing?
and optimized to the relevant population and the conditions of the case
under investigation it may be relatively easy to run a large number of
Cole [4] proposes that forensic science culture/society be reformed
test trials; in contrast, an experience-based system may have to start
on the model of medical culture/society. This would include researchers
from scratch on each trial. There may be a large investment in setting
who develop and validate new techniques; practitioners who are skilled
up the former, but then little additional cost for each test trial, whereas
for the latter there may be moderate investment on the first test trial 7
A sample of the wrong population in forensic voice comparison may include pairs of
and the same moderate investment on every other test trial resulting in voice recordings which subjectively sound very different from each other, potentially so
a rapidly increasing total investment as the number of test trials increases. different that no one would think that they could be produced by the same speaker and
hence so different that they would not be submitted for forensic comparison. Such pairs
2.4. Lack of testing/lack of appropriate testing of systems based on data, may also present very easy trials for a forensic–voice–comparison system, leading to valid-
ity and reliability test results which are highly optimistic compared to how the system
quantitative measurements, and statistical models
would perform if a truly relevant population were sampled.
8
In my lab we use databases containing multiple non-contemporaneous recordings of
The use of approaches based on data, quantitative measurements, and a large number of speakers, with each speaker recorded using multiple speaking styles on
statistical models is not itself a panacea. Leaving aside the issue of wheth- each occasion (Morrison, Rose, & Zhang [15]). We select recordings in the speaking styles
er I am critical of the design of any particular system based on this general which we judge to be closest to the speaking styles on the suspect and offender record-
ings, e.g., interview and telephone conversation. Unless we were to collect additional data,
approach, I have seen such systems used inappropriately in both research we could not perform analyses involving speaking styles which we judge to be dissimilar
and casework. The principal problems are inappropriate selection of the from any of the speaking styles in our existing databases (we do not have recordings of
relevant population and a sample thereof, and no testing or inappropriate whispered speech for example). Our databases are of particular languages and dialects
testing of validity and reliability. These are issues which we discussed at and, pending the collection of databases of other languages and dialects, we cannot per-
form analyses involving other languages and dialects. For a particular case a subset of
length in Morrison, Ochoa, & Thiruvaran [7], and so I discuss them only
speakers from a database is selected for inclusion in the sample of the population via a
briefly here. procedure which makes use of a panel of listeners (see Morrison, Ochoa, & Thiruvaran
A likelihood ratio as a forensic strength-of-evidence statement is the [7]). Our databases consist of high-quality audio recordings which we process to reflect
answer to a particular question, a question defined by the prosecution the conditions of the suspect and offender recordings, for example by adding reverbera-
and defense hypothesis. If the forensic scientist does not properly con- tion and noise, and by passing the recordings through different telephone channels (land-
line, mobile telephone, etc.) or simulations thereof. A paper describing the details of both
sider what the appropriate defense hypothesis would be for the case the data selection procedure and recording condition simulation is in preparation, this pa-
under investigation, and what would be the relevant population speci- per is based on the speaking styles and recording conditions which we encountered in an
fied by that hypothesis, and does not obtain a sample representative of actual case.
users of those techniques, who understand the theory behind them and recordings, then making a decision on the basis of this aural and visual
the results of the validation studies, and who make informed practical examination. A spectrogram in this context is a graphical representation
decisions accordingly; and technicians who follow prescribed protocols of the time, frequency, and amplitude properties of an acoustic signal.
without necessarily having substantial knowledge about theory or the Traditionally time is represented on the abscissa of a two-dimensional
results of validation studies. Under this system the practitioners would graph, frequency on the ordinate, and amplitude as the intensity of
be highly trained individuals and would be required to keep up with a monochrome image within those axes (colors can also be used to rep-
research developments, but would not themselves typically conduct resent amplitude and on-screen computer graphics can be used to view
validation studies. virtual three-dimensional objects). An example of a spectrogram is
Such a model I think assumes that there are a relatively small num- shown in Fig. 1.
ber of conditions under which forensic systems need to be validated, Using both auditory and visual examination, the aural–spectro-
that the results of validation studies can then be published, and when graphic practitioner forms an experience-based subjective opinion as
a practitioner works on a case they have to determine what the condi- to whether the suspect and offender recordings were produced by the
tions are and look up the results of existing validation studies under same speaker. The spectrographic approach is visual-mode only, but
those conditions for the system or systems they are considering using. was supplanted by the aural–spectrographic approach at the beginning
Perhaps the number of conditions which need to be pre-tested runs to of the 1970s (the term “spectrographic approach” can also be used as a
the tens or hundreds, but testing them all is conceivably achievable in cover for both the visual-only and the aural-plus-visual approaches). In
the short term, and then the practitioners are covered for the majority both approaches the practitioner's opinion is based directly and wholly
of cases on which they are likely to work. There would be a small pro- on their subjective experience-based judgment.
portion of cases when the practitioner recognizes that the conditions There has been much debate as to whether these approaches are ap-
are unusual and not covered in the existing validation literature, and propriate. Unsupported claims of near infallibility have been made, and
in these instances they would call in the researchers to address the at times the debate has been acrimonious see, for example, Hollien [16,
problems or even take over the case. I could conceive that this may be pp. 24–25 and ch. 6], and Koenig [17]. The most comprehensive bal-
the way that some forensic DNA analysis laboratories already work, anced review of the use of the approaches and the debate about their
and that there may be other branches of forensic science where this use appears in Gruber & Poza [18]. Other more recent, though less thor-
model could be applied. ough, reviews appear in Meuwly [19, ch. 5]/[20,21], Rose [22, pp.
If Cole's [4] proposal could be widely implemented, I think this 107–123], and Morrison [12, §99.680–99.690]. Almost all the literature
would lead to improvement in forensic-science practice. When it on the topic focuses on the situation in the United States.
comes to forensic voice comparison, however, I do not think that the
pre-testing aspect of the model can be applied in the foreseeable future,
3.2. Legal admissibility
if ever. Given the very large (perhaps infinite) number of possible com-
binations of conditions and the variability in what constitutes the rele-
In 2003 in United States v Robert N Angleton [2003, 269 F Supp 2nd
vant population from case to case, and therefore the low probability of
892 S D TX] the court conducted a relatively thorough review of the ad-
frequent repetitions of the same conditions and relevant population,
missibility of the aural–spectrographic approach under Federal Rule of
I think that validity and reliability have to be tested on a scenario-by-
Evidence 702 and case law following Daubert v Merrell Dow Pharmaceu-
scenario basis, which is effectively a case-by-case basis.9 This means
ticals [1993, 509 US 579], and ruled that the approach was not admissi-
that every forensic–voice–comparison laboratory will have to have
ble.10 The Innocence Project documents a case in which use of the
staff capable of running validity and reliability tests.
spectrographic approach may have contributed to a wrongful convic-
tion [State of Texas v David Shawn Pope, 1986, 204th District Court,
3. The spectrographic/aural–spectrographic approach Case No F85-98755-MQ].11
I am not aware of any recent use of the aural–spectrographic ap-
3.1. Introduction proach in Canadian courts. It was admitted in Ontario and Manitoba in
the 1970s see Tosi [23] Appendix A, and in February 2013 the opinion
The aural–spectrographic approach (aka “voiceprinting” and of a practitioner of the aural–spectrographic approach was used by a
“voicegram identification”) consists of listening to suspect and offender journalist as part of an investigation into the “robocall” political scandal
recordings (and depending on the protocol also recordings of foil [24].
speakers), and looking at spectrograms made from each of those same I appeared as an expert witness in two cases in Australian courts last
year (2012) where I was asked to critique reports submitted by experts
9
Testing on a case-by-case basis (scenario-by-scenario basis) should not be con- who used the aural–spectrographic approach. One of these cases was
fused with optimizing the system for the conditions of each case (scenario). The sys- before a jury, and the lawyer on one side attempted at voir dire to
tem must be optimized in the sense that it must be trained using data from a sample have the testimony based on the aural–spectrographic approach ruled
of the relevant population and that the data must reflect the conditions of the actual
inadmissible (the other case was heard by judge alone). The admissibil-
suspect and offender recordings (speaking styles and recording conditions). The sys-
tem may also be optimized to, for example, attempt to compensated for differences
ity rules in Australian jurisdictions are very different from the US federal
between the suspect and offender recordings due to mismatches in speaking styles rules, and the attempt was unsuccessful. The judge decided that they
and recording conditions. All system optimization should be completed and the sys- were bound by Regina v Gilmore [1977, 2 NSWLR 935], a case from
tem should be frozen (no further changes allowed) before the system is tested, and 1977 in which the approach had been ruled admissible, in part because
testing should be performed using test data which are separate from any data used
it had been ruled admissible by a number of US courts in the early to mid
to train and optimize the system. The system must be tested on previously unseen da-
ta since training and testing on the same data would give an overly optimistic assess-
10
ment as to how the system would be expected to perform on previously unseen data I am also aware of one US-state-level appeal ruling in 2003 [State of Louisiana v Gary
such as the actual suspect and offender samples. In our laboratory we maintain a Morrison 2003 KW 1554 (no relation)] and another in 2009 [State of Vermont v Gregory S
strict chronology in casework whereby the system is first trained and optimized, sec- Forty 2009 VT 118], both Daubert-based, both spending less time on this issue than
ond the performance of the system is tested, and third the likelihood ratio for the ac- Angleton, and both coming to the same conclusion of inadmissibility. The appeal court in
tual suspect and offender recordings is calculated. Each step is completed before Forty ruled that the lower court should not have made its decision on the basis of earlier
starting the subsequent step and no changes to previous steps are allowed once sub- case law such as Angleton which was not actually brought forward by either the prosecu-
sequent steps have begun. Preliminary tests may be performed in order to select pa- tion or defense. It did, however, uphold the exclusion of the aural–spectrographic ap-
rameters which optimize the system for the conditions of the case under proach in this case on the grounds that the ABRE protocol (see Section 3.7) required
investigation, but no overlap is allowed between the data used for these preliminary that at least ten words be examined and the expert had only been able to examine eight.
11
tests (development data) and the data used for the final test of system performance. http://www.innocenceproject.org/Content/David_Shawn_Pope.php
The principles expressed in Bolt et al. [27] parallel the second and
Frequency (kHz)
4 third elements of the paradigm I promote: highly preferred use of ap-

3 proaches based on quantitative measurements, databases representa-
tive of the relevant population, and statistical models; and obligatory
2 testing of validity and reliability under conditions reflecting those of
1 the case under investigation using test data drawn from the relevant
population. The first element, obligatory use of the likelihood-ratio
0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 framework, was not introduced to forensic voice comparison until the
Time (s) late 1990s (see Morrison [28] for a history).
The principles and conclusions expressed in Bolt et al. [27] were also
Fig. 1. Example spectrogram of the word “spectrogram”. similar to principles and conclusions expressed in a 1979 NRC report on
the aural–spectrographic approach [29] prepared at the request of the
Federal Bureau of Investigations (FBI, the committee included both pro-
ponents and opponents of the approach), in the 2009 NRC report on fo-
1970s (which was in turn in part on the basis of the results of a study by
rensic science in general [2], and in the US National Institute of
Tosi et al. [25], which will be discussed in Section 3.4).
Standards and Technology (NIST) and National Institute of Justice
(NIJ) 2012 report on forensic fingerprint analysis [30].
3.3. Reports including consideration of principles for determining
acceptable practice
1979 NRC Report:
In 1968 Peter B. Denes, at that time the Chair of the Speech Commu- The degree of accuracy, and the corresponding error rates, of aural-
nication Technical Committee (SCTC) of the Acoustical Society of visual voice identification vary widely from case to case, depending
America (ASA), appointed six SCTC members (including himself) to in- upon several conditions including the properties of the voices in-
vestigate the use of the spectrographic approach [26]. In a sense this volved, the conditions under which the voice samples were made,
study group was the forerunner of the current ASA Forensic Acoustics the characteristics of the equipment used, the skill of the examiner
Subcommittee (FAS), although there was approximately a 40 year gap making the judgments, and the examiner's knowledge about the
between the study group being active and the formation of the FAS. At case. Estimates of error rates now available pertain to only a few of
the time, a visual-mode only approach was prevalent and this was the the many combinations of conditions encountered in real-life situa-
focus of the study group's investigation. The study group presented a tions. These estimates do not constitute a generally adequate basis
draft report at the SCTC meeting on 9 April 1969, at which the SCTC en- for a judicial or legislative body to use in making judgments
dorsed the report. The final version (Bolt et al. [27]) was published in concerning the reliability and acceptability of aural-visual voice
the Journal of the Acoustical Society of America (JASA) in 1970 with a identification in forensic applications. [29, p. 60]
footnote that the views expressed were those of the authors as individ- The Committee concludes that the full development of voice identi-
uals. The following quotes from the report are of particular interest in fication by both aural-visual and automated methods can be
relation to my topic of principles governing acceptable practice in foren- attained only through a longer-term program of research and devel-
sic science. opment leading to a science-based technology of voice identifica-
tion. [29, p. 60]
1970 ASA SCTC study group report: An important initial step in developing research plans will be the
development of a standard data base of voice samples that are
What kinds of evidence would convince scientists of the reliability of
representative of the relevant populations and of the characteristics
speaker identification based on voice patterns?
encountered in voice identification. [29, p. 61]
The usual basis for the scientific acceptance of any new procedure is
determining the acceptability of a particular error rate for a particu-
an explicit description of experimental methods and of results of rel-
lar forensic application is a value question and not a question of sci-
evant tests. The description must be sufficient to allow the replica-
entific or technical fact. It can be answered properly not by this
tion of experiments and results by other scientists....
Committee and not by the technical examiner, but only by the judi-
Lacking explicit knowledge and procedures, can individuals never-
cial or legislative body charged with regulating the proceeding in
theless acquire such expertise in identification from voice patterns
question. [29, p. 62]
that their opinions could be accepted as reliable?… Validation of this
approach to voice identification becomes a matter of replicable ex-
2009 NRC Report:
periments on the expert himself, considered as a voice identifying
machine. some forensic disciplines are supported by little rigorous systematic
Thus, voice identification might be accomplished either on the basis research to validate the discipline's basic premises and techniques.
of explicit knowledge and procedure available to anyone, or on the There is no evident reason why such research cannot be conducted.
basis of the unexplained expertise of individuals. In either case, [2, p. 22].
validation requires experimental assessment of performance on rel- The judicial system is encumbered by, among other things, judges and
evant tasks.… lawyers who generally lack the scientific expertise necessary to com-
It may be objected that this minimal set of tests is unreasonably ar- prehend and evaluate forensic evidence in an informed manner, trial
duous. We do not believe that it is. As scientists we could accept no judges (sitting alone) who must decide evidentiary issues without
less in checking the reliability of a “black box” supposed to perform the benefit of judicial colleagues and often with little time for exten-
speaker identification. [27, pp. 601–602] sive research and reflection, and the highly deferential nature of the
Court determinations may also depend on the apparent validity of appellate review afforded trial courts' Daubert rulings. Given these re-
exhibits brought in evidence. [27, p. 602] alities, there is a tremendous need for the forensic science community
We conclude that the available results are inadequate to establish to improve. Judicial review, by itself, will not cure the infirmities of the
the reliability of voice identification by spectrograms… Procedures forensic science community. [footnote omitted] The development of
exist, as we have suggested, by which the reliability of voice identi- scientific research, training, technology, and databases associated with
fication methods can be evaluated. We believe that such validation DNA analysis have resulted from substantial and steady federal sup-
is urgently required. [27, p. 603] port for both academic research and programs employing techniques
for DNA analysis. Similar support must be given to all credible forensic same speaker as on the first recording or say that none of the other
science disciplines if they are to achieve the degrees of reliability need- recordings were produced by the same speaker. The examiners also had
ed to serve the goals of justice. [2, pp. 12–13] to indicate how confident they were in their decision.
The interpretation of the results reported in Tosi et al. [25] is hin-
2012 NIST/NIJ report: dered by a conflation of different types of error. In open-set trials there
are four types of error (assuming that the suspect is in the lineup)13:
A basic tenet of experimental science is that “errors and uncertainties
exist that must be reduced by improved experimental techniques and A The suspect is the offender but the examiner picks a foil speaker in-
repeated measurements, and those errors remaining must always be stead.
estimated to establish the validity of our results.” [31, p. 1] What B The suspect is the offender but the examiner says that the offender is
applies to physics and chemistry applies to all of forensic science: “A not in the lineup.
key task … for the analyst applying a scientific method is to conduct C The suspect is not the offender and the examiner picks the suspect.
a particular analysis to identify as many sources of error as possible, D The suspect is not the offender and the examiner picks a foil speaker.
to control or eliminate as many as possible, and to estimate the mag-
In a real case, errors A and D would be immediately detected (as-
nitude of remaining errors so that the conclusions drawn from the
suming that none of the foil speakers could be the offender). In signal-
study are valid.” [2, p. 111] In other words, errors should, to the extent
detection theory error B is known as a miss, in this context it would
possible, be identified and quantified. [30, p. 21]
not be immediately detectable and could contribute to a guilty person
quantified “error rates”… not only can lead to improvements in the
being declared not guilty. In signal-detection theory error C is known
reliability and validity of current practices, but it also could assist in
as a false alarm, in this context it would not be immediately detectable
more appropriate use of the evidence by fact-finders… Many court
and could contribute to an innocent person being declared guilty. Guilty
opinions have discussed error rates of scientific tests such as
versus not guilty decisions would be made by the trier of fact who
polygraphy, speaker identification, and latent print identification as
would weigh the voice-comparison evidence along with all the other
a consideration affecting the admissibility of these tests. [30, p. 22]
evidence presented in the legal trial; other evidence may outweigh
Recommendation 6.3: A testifying expert should be familiar with the
the voice-comparison evidence. In Tosi et al. [25] no speakers were
literature related to error rates. A testifying expert should be prepared
specified as suspects and the situation was simplified to presence of tar-
to describe the steps taken in the examination process to reduce the
get speaker (A and B) and absence of target speaker (C and D). Given
risk of observational and judgmental error. The expert should not
this, Tosi et al. [25] did not have a distinction between error types C
state that errors are inherently impossible or that a method inherently
and D, so conflated C and D as a single error type, and, for no apparent
has a zero error rate. [30, p. 209]
reason, they also conflated this with error type A. Results were therefore
reported as error type B “false elimination” and conflated error type
I think that a clear pattern emerges across all these sober reports is-
A + C + D “wrong matches”.
sued over the last four-and-a-half decades: The key scientific concern
According to Tosi et al. [25], in the most forensically realistic condi-
regarding any approach to forensic analysis (including the spectro-
tion tested (open set and non-contemporaneous recordings) the “false
graphic or aural–spectrographic approach for forensic voice compari-
elimination” rate was 13% and the “wrong matches” rate was 6%. If
son) is whether it has been demonstrated to be sufficiently valid and
only the 74% of decisions when the examiners were “fairly certain” or
reliable under casework conditions. It is up to forensic scientists to dem-
“almost certain” were included, these rates dropped to 5% and 2% re-
onstrate the degree of validity and reliability of the approach under
spectively. These values were averaged across fixed and random word
casework conditions, and it is up to the legal authorities, who may not
context, but results for the more forensically realistic random context
have a good understanding of the approach itself, to decide whether
were said to be poorer than those for the fixed context.
the demonstrated degree of validity and reliability is sufficient.12
Tosi et al. [25] ended their paper by speculating as to how their results
might be relevant for casework conditions. Tosi et al. [25] was immedi-
ately criticized by the ASA SCTC study group (Bolt et al. [32], see also
3.4. Tests of validity
Gruber & Poza [18]).14 The primary criticisms were that the laboratory
tests were methodologically flawed and did not reflect casework condi-
Over the years there have been a number of tests of the validity of the
tions, and that Tosi et al.'s attempt to extrapolate to casework conditions
spectrographic and aural–spectrographic approach. Some of these are
was based on dubious assumptions.15 One criticism of note is that the
summarized in the 1979 NRC report [29], Gruber & Poza [18], and else-
250 speakers who Tosi et al. [25] claimed to be from a homogeneous
where, but one research report merits attention here because it described
population probably included many pairs of individuals who did not
the largest study conducted and it had the greatest impact on practice and
sound very much like each other, and did not therefore constitute mem-
(for a number of years) admissibility. In 1972 Tosi et al. [25] reported
bers of the same relevant population. If they were not sufficiently similar
on a study conducted between 1968 and 1970 using recordings of 250
US-English speakers, and just under 35 thousand experiment trials 13
This discussion is framed in terms of categorical decisions, which are not consistent
performed by 29 examiners. These experiments were conducted in
with the role of the forensic scientist in the likelihood-ratio framework. Error types A
visual-only mode. Prior to performing the test, as part of the research pro- and B here are the same as indicated by those letters in Tosi et al. [25], but error types C
tocol the examiners received one month of training in the spectrographic and D differ.
14
approach (the participants were not previously-trained professional prac- In contrast to the opinion expressed by the ASA SCTC study group, Greene [33, pp.
189–190] and Tosi [23, p. 144] quote extracts from a letter written on 23 or 28 March
titioners of the approach). In general, the examiners were presented with
1973 by then President of the ASA, Karl D. Kryter, which appears to give conditional sup-
spectrograms from a recording of one speaker, and a set of spectrograms port to the use of the aural–spectrographic approach: “contrary to the resolution [Bolt
from recordings of multiple other speakers one of whom might be the et al. [32]?], it can be stated, in my opinion, that by scientific tests it has been proven within
same speaker. Some trials were closed set where the examiner knew normal standards of scientific reliability and validity, that voiceprints for some speakers,
that the target speaker was included, but others were open set where under certain conditions and with certain analysis procedures, can provide positive iden-
tification…” (Karl D. Kryter quoted in Tosi [23, p. 144]) There is no record of this letter in
the examiner had to either choose a recording as being produced by the
the minutes of ASA Executive Council meetings (p.c. Elaine Moran, ASA Office Manager, 13
November 2012) so it appears that Kryter wrote this letter in his personal capacity rather
12
The brief pre-Daubert discussion of legal acceptability in Bolt et al. [27] focused on the than in his capacity as President of the ASA.
15
court's determination of whether the approach had gained general acceptability within Rather than discuss the “Tosi extrapolation” here, I recommend the review of this is-
the scientific community, which in turn they argued should be based on demonstrated de- sue in Gruber & Poza [18, part B]. Extrapolation from laboratory studies to casework was
gree of validity. also discussed at length in the 1979 NRC report [29].
sounding that an investigator would submit them for forensic compari- error rate (Ladefoged [34], Gruber & Poza [18, §7], Solan & Tiersma [35,
son with each other, then they would not constitute forensically realistic pp. 418–420]).
different-speaker trials. Those trials would be too easy and the correct- The FBI began using the aural–spectrographic approach in the 1950s
decision rate would be inflated. or early 1960s, it commissioned the 1979 NRC report, and continued
Tosi et al. [25] also made reference to a “field study” conducted at the using the approach until 2011 (Koening [36], Archer [37]). Throughout
Michigan Department of State Police Crime Laboratory. This was a re- that period the approach was used for investigative purposes but not for
view of 673 aural–spectrographic cases conducted between 1967 and presentation of evidence in court (Koening [36], Archer [37]). As of 2012
1970. The results, as reported in Tosi et al. [25], where that no decision the FBI no longer uses the aural–spectrographic approach (p.c. Hirotaka
was made in 59% of cases because of poor audio quality or quantity Nakasone, Senior Scientist, Digital Evidence Section, FBI, 30 November
(as compared with 26% “almost uncertain” and “fairly uncertain” 2011). In the late 1990s the FBI got involved in automatic-speaker-
responses in the laboratory study), and of the remainder, 38% were de- recognition (ASR) research, and now uses an ASR-based approach to fo-
clared to be the same speaker and 62% different speakers. It was further rensic voice comparison, again for investigative purposes only (Archer
reported that of the same-speaker conclusions 29% were confirmed be- [37]). Why did the FBI move away from the aural–spectrographic ap-
cause the suspects “admitted culpability or were convicted by evidence proach and towards an ASR approach? According to Hirotaka Nakasone,
other than that produced by their voice” (p. 2042). The latter clearly has the consensus in the laboratory was that16:
a danger of circularity: a suspect may plead guilty or be found guilty
even if they are innocent, and if the voice evidence is presented one
• an ASR approach uses quantitative measurements, data, and statistical
cannot determine the extent to which this contributed to a jury's decision
models, rather than the aural–spectrographic approach's subjective
(it may be that the voice evidence was not presented in these legal trials).
decisions, and is therefore a priori considered more reliable
The argument in Tosi et al. [25] appears to have been that the use of
• an ASR approach allows for easier testing of within- and between-
the aural–spectrographic approach (visual and auditory) by profes-
analyst reliability
sionals taking as much time as they need, as opposed to the use of the
• ASR approaches in theory satisfy three out of five Daubert criteria
spectrographic approach (visual only) by amateurs with only one
(they have not yet been tested at a Daubert hearing), whereas the
month of training performing the task in a limited amount of time, com-
aural–spectrographic approach satisfies none
bined with the use of a “no decision” option will lead to higher correct-
• ASR approaches have ample support from the scientific community,
decision rates. As an argument in favor of the use of the aural–spectro-
which has abandoned the aural–spectrographic approach
graphic approach (not necessarily in contrast with the spectrographic
• scientists and engineers around the world are conducting serious
approach), I find this unconvincing. I need to see the results of tests
research and development on ASR algorithms, whereas little or no re-
under conditions reflecting those of casework, and for which there can
search is being done on the aural–spectrographic approach
be no dispute as to the same-speaker or different-speaker status of
• ASR approaches use recordings of spontaneous speech, whereas the
every test pair. I do, however, believe it is self evident that if one avoids
aural–spectrographic approach requires verbatim voice samples
making a decision in cases one judges to be difficult and removes these
which are difficult if not impossible to obtain
from the statistics, then one will be left with cases which are generally
• an ASR approach can perform a much larger number of comparisons
easier and one will therefore achieve a better correct-decision rate. I
in a given time
also believe that in fact all practitioners, irrespective of their approach,
• ASR approaches can be applied regardless of the language spoken
decline to perform analyses when a priori they believe that the quantity
• training an ASR analyst is easier and less time consuming, requiring
or quality of the recorded material is such that their system is unlikely to
training in fewer disciplines.
produce a high strength of evidence in either direction. I believe that it
would be inappropriate to proceed with a full analysis without at least
making the client aware of the likely limitations. An attack on this prac- 3.6. Is the fact that a spectrogram is used a key aspect of the criticism of the
tice per se I would not consider appropriate, but neither would I consid- approach?
er it appropriate to make unsubstantiated claims that this practice will
eliminate or virtually eliminate errors. If this practice is part of casework With respect to the spectrographic and aural–spectrographic ap-
conditions, then it should be included when assessing the degree of va- proaches, is the fact that a spectrogram is used a key aspect of the crit-
lidity and reliability of a forensic system under casework conditions. icism of the approach? I would argue that it is not. As I understand it,
sober criticisms have always centered on the issue of whether the de-
gree of validity and reliability of the approach has been demonstrated
3.5. Conversions and reasons for conversion under conditions reflecting those of casework. If a forensic scientist
did not use spectrograms, but, for example, instead measured formant
It is interesting to note that although some proponents of the spectro- values from tokens of a number of vowel phonemes, plotted these on
graphic or aural–spectrographic approach seem to have believed in its a first-formant by second-formant (F1 by F2) plot, and then on the
efficacy more as a matter of faith than on the basis of evidence, there basis of looking at these plots made an experience-based subjective
were a number of researchers and practitioners who converted to or opinion, this approach would be subject to the same criticisms. Whether
away from supporting the use of the approach on the basis of their as- such an approach were deemed acceptable should depend on whether
sessment of the results of empirical tests of its validity. Even if I were it had been tested under casework conditions and found to be sufficient-
to disagree with someone else's assessment as to the extent to which ly valid and reliable. The same criterion should apply even if a graphic
the results of such tests constituted convincing evidence, I would still representation is not used at all. The same criterion should apply to a
consider this a rational basis on which to make such a decision. purely auditory approach, to looking at a table of numbers, and, as ar-
Both Oscar Tosi and Peter Ladefoged were initially of the opinion gued earlier, to an approach based on data representative of the relevant
that the validity of the spectrographic approach had not been proven, population, quantitative measurements, and statistical models.
but later became supporters of the aural–spectrographic approach, Under cross-examination I was asked by a lawyer whether there
apparently on the basis of the studies reported in Tosi et al. [25] were studies which had shown that the aural–spectrographic approach
and their own personal experience see Ladefoged [34], and Tosi [23,
pp. 137, 138, 140]. Ladefoged, however, seemed to be particularly 16
The following is a paraphrase of a written personal communication from Nakasone re-
concerned that the validity of the aural–spectrographic approach not ceived 26 November 2012. Drafts of the paraphrase was sent to Nakasone and revised on
be overstated, and regarded the 6% “wrong matches” rate as a minimum the basis of feedback received from him on 9 December 2012.
was more valid than the spectrographic approach (it seemed to be a “Voiceprint” was a trademark owned in the 1960s and early 70s
question from 1972 rather than 2012). The expert that this lawyer had by Voiceprint Laboratories, Inc., a company established by Lawrence
called had used the aural–spectrographic approach, and what the law- Kersta. If not the originator, Kersta was definitely the popularizer of
yer seemed to be implying was that criticisms of the spectrographic ap- the spectrographic approach, and was an advocate its use in visual-
proach did not apply because that expert had listened as well as looked. only mode. Tosi cited criticisms of Kersta for making unsubstantiated
There was also an argument made that the expert had actually formed claims of infallibility (Tosi [23, pp. 68–69], see also Gruber & Poza
their opinion on the basis of listening, and had only used the spectro- [18, §10]). The term “voicegram” does not appear in Tosi [23]. The
grams to confirm that opinion. What the lawyer seemed to have failed 1979 NRC report used the term “voicegram” rather than “voiceprint”
to understand (or willfully ignored) was that the key point in my testi- to avoid the implicit suggestion of association with “fingerprint” ex-
mony with respect to the expert's approach had been that they had amination which was perceived to have high validity [29, pp. 6–7].
failed to present any evidence of the validity and reliability of their ap- Second, what does “holistic, i.e., non-analytic” mean? My best guess
proach under the conditions of the particular case at trial and with re- is that it refers to a gestalt approach to auditory and visual comparison.
spect to the relevant population for this case (or under any other This is an approach recommended by Poza & Begault [39], but it is not,
conditions or with respect to any other population for that matter). at least on the face of it, the approach recommended by Tosi [23]. Tosi
The issue of the use of spectrograms per se, was actually a red herring. [23] discussed a number of acoustic–phonetic features considered in
The opposing lawyer, who tried to have the aural–spectrographic ap- earlier aural and spectrographic studies (p. 43), and it is also clear that
proach ruled inadmissible, may also have fallen into this trap. What he expected examiners to follow the protocol promulgated by the Inter-
one needs to focus on is principles, one should not get fixated on national Association of Voice Identification (IAVI). This required exam-
approaches. iners to auditorily compare features such as “melody pattern, pitch,
quality, respiratory grouping of words, and any peculiar common
3.7. The IAFPA resolution features” and to spectrographically compare “mean frequencies and
apparent bandwidths (clarity) of formants, rates of change of formant
I was quite surprised when I critiqued the two 2012 reports in which frequencies, levels of components between formants, type of vertical
the aural–spectrographic approach had been used, because I knew that striations and distances between them, spectral distributions of fricatives
the authors of both reports were members of the International Associa- and plosives, gaps of plosives, and voice onset times of vowels following
tion for Forensic Phonetics and Acoustics (IAFPA),17 and I thought that plosives” (quoted from the 1979 NRC report [29, p. 77]).18 Tosi [23]
in 2007 IAFPA had issued a statement to the effect that its members provided examples of the application of the spectrographic part of his
should not use this approach. The resolution is as follows: approach (pp. 118–127) which made use of multiple acoustic–phonetic
features on tokens of different words appearing in the recordings. He
IAFPA dissociates itself from the approach to forensic speech gave a numeric rating (from −10 to +10) to his subjective evaluation
comparison known as the “voiceprint” or “voicegram” method of the similarity of each word and a description of the observed similari-
in the sense described in Tosi (1979). ties/differences which led him to assign each rating, then averaged the
This approach to forensic speaker identification involves the ho- ratings to provide a final score on the basis of which he made his decision.
listic, i.e., non-analytic, comparison of speech spectrograms in One could query the reliability of his assignment of ratings or the appro-
the absence of interpretation based on understanding of how priateness of the function he used to combine these into a single score,
spectrographic patterns relate to acoustic reflexes of articulatory but this approach is clearly not gestalt, holistic, or non-analytic. Even if
events and vocal tract configurations. The Association considers the approach were gestalt, as in Poza & Begault [39], would that in and
this approach to be without scientific foundation, and it should of itself be unacceptable? As I argued earlier, what matters is the degree
not be used in forensic casework. [38] of validity and reliability of a system under conditions reflecting those
of the case under investigation. If a gestalt approach were found to have
After a brief conversation with the president of IAFPA (p.c. J. Peter better validity and reliability than any other approach, then that is the
French 24 October 2012, who was in office in 2007 and has been contin- approach which should be preferred.
ually ever since), I came to realize that my interpretation of the resolu- Third, does the aural–spectrographic approach lack “interpretation
tion as an outright ban on the use of the aural–spectrographic approach based on understanding of how spectrographic patterns relate to acous-
was not in fact what the drafters and endorsers had intended. Rather, tic reflexes of articulatory events and vocal tract configurations”? Let us
the resolution was specifically restricted to: “the method in the sense say that this may have been true and may still be true of some practi-
described in Tosi (1979)… involv[ing] the holistic, i.e., non-analytic, tioners of the approach (I would even accept something more definitive
comparison of speech spectrograms in the absence of interpretation than “may”),19 but the IAFPA resolution claims, or at least implies, that
based on understanding of how spectrographic patterns relate to acous- this was the case for Tosi's [23] approach. Is that claim true? I think
tic reflexes of articulatory events and vocal tract configurations.” But
what does this mean? Let us unpack the IAFPA resolution:
First, Tosi disliked the term “voiceprint” but recognized that it re-
18
ferred to the aural–spectrographic approach, the same general ap- Similar lists of features appeared in a protocol promulgated by the American Board of
proach which he used: Recorded Evidence (ABRE) [40], and in a description of the FBI's protocol in Koenig [36].
19
Let us consider, for the sake of argument, that the intention of the IAFPA resolution
was to ban the use of the aural–spectrographic approach by individuals who lacked suffi-
[In this book] special emphasis was placed on the popularly, and cient training and qualifications in phonetics but allow it by individuals with sufficient
wrongly, named “voiceprinting” method because it is the only training and qualifications in phonetics. Who would decide what qualifications were nec-
method presently used for legal evidence. This author, an expert essary and who was sufficiently qualified? Would being qualified in and of itself be suffi-
cient to make the approach and procedures used by the qualified individual acceptable for
witness on voice identification, refers to this method as “aural
any particular case? The central thesis of the present paper is that the acceptability of a
and spectrographic examination of speech samples,” that is, the particular approach should not be based on these sorts of considerations. It should instead
aural examination of tape recordings and the visual examination be based on first establishing the principles required for acceptability independent of any
of their spectrograms. [23, p. ix] approach, then considering whether the particular approach conforms to those principles.
In particular the principle proposed is that the validity and reliability of the approach
should be tested under conditions reflecting those of the case under investigation and that
17
For the record, I am also a member of this association, but joined after the resolution the results of those tests should be presented to the judge at an admissibility hearing and/
was passed. The minutes of the IAFPA Annual General Meeting on 24 July 2007 indicate or the trier of fact at trial who can decide if the performance of the system is sufficient to
that the resolution was passed by 22 in favor, 3 abstentions, and 0 against. provide them with useful information.
not. Oscar Tosi, among other qualifications and appointments, had a In the survey on International practices in forensic speaker comparison
PhD in audiology, speech sciences, and electronics from Ohio State Uni- reported in Gold & French [44], 71% of respondents (25 of 35) reported
versity, and was Director of the Speech and Hearing Research Laborato- using an auditory–acoustic–phonetic approach (it is not clear whether
ry at Michigan State University (1979 NRC report [29, p. 161]). Chapter 2 this category subsumed the aural–spectrographic approach, the latter
of Tosi [23] is on Acoustics, phonetics, and theory of voice production, and was not explicitly mentioned), and another 6% (2 of 35) reported
clearly demonstrates an understanding of “how spectrographic patterns using an auditory-only approach. 70% of respondents reported using
relate to acoustic reflexes of articulatory events and vocal tract configu- “some form of population statistics in arriving at their conclusions”
rations”. Education and demonstrated knowledge may not be definitive (Gold & French [44, p. 299]). The survey included practitioners from
indicators that a practitioner has integrated these into their practice, but outside the UK and likely included practitioners who were not members
it is probably reasonable to assume that it is correlated. of IAFPA, but I am still unable to reconcile the latter figure with the
Finally, the IAFPA resolution states that “The Association considers statement in the UK Position Statement about the lack of data and
this approach to be without scientific foundation”, but fails to specify problem of defining the relevant population. Also, I do not know how
what it considers to constitute a “scientific foundation”. respondents use “population statistics” and combine them with the
Perhaps I am being too literal, perhaps we all know what the drafters experience-based auditory element of their approach.
and endorsers of the resolution really meant, but I'm not convinced that My key concern, irrespective of the approach, is whether the validity
this is the case. There is clearly a demarcation problem here, and I don't and reliability of the forensic–voice–comparison system has been
know what the conditions are under which an IAFPA member is or is not assessed under conditions reflecting those of the case under investigation
permitted to use the spectrographic or aural–spectrographic approach. using test samples drawn from the relevant population. Gold & French
My criticisms of the IAFPA resolution are not meant as an endorsement [44] did not report on this issue. They did, however, note that approaches
of Tosi's [23] aural–spectrographic approach. There are appropriate based on automatic-speaker-recognition techniques (used by only 20% of
criticisms which can be made of the approach. Tosi [23] is far from the respondents, 7 of 35) lend themselves more readily to being tested.
unbiased, and at a minimum makes convenient omissions. Rather, Testing of validity and reliability was not even mentioned in the descrip-
my point is that the IAFPA resolution failed to address the issue on tion of the UK Position Statement set out in French & Harrison [41]. The
the level of principles, and came unstuck on a direct attack on the word “reliable” occurred once in French et al. [43], “we are of the view
approach. that it is unrealistic to see it as merely a matter of time and research before
a rigorously and exclusively quantitative LR approach can be regarded as
feasible, let alone reliable,…” (pp. 149–150), but there was no discussion
4. The auditory–acoustic–phonetic approach of the validity or reliability of their own approach. Ultimately I do not
know what proportion of practitioners of auditory–acoustic–phonetic ap-
Given IAFPA's attack on “the approach to forensic speech comparison proaches test the validity and reliability of their systems under conditions
known as the ‘voiceprint’ or ‘voicegram’ method”, I think it is fair to ex- reflecting those of the case at trial using samples drawn from the relevant
amine what approach might be recommended by the majority or a population, but the lack of indication that this is a widespread practice is
large proportion of IAFPA members. There is no single approach officially disturbing if one considers it essential practice.
endorsed by IAFPA, but I believe that the Position Statement concerning I would further contend that the approach advocated in the UK
use of impressionistic likelihood terms in forensic speaker comparison Position Statement is open to many of the same criticisms as have
cases (French & Harrison [41]) is representative of the practice of a sub- previously been levied against the aural–spectrographic approach.
stantial proportion of its members. The Position Statement was endorsed As a demonstration, I provide the following quotes from French &
by 25 IAFPA members, claimed to be all except one of the “practising fo- Harrison [41] and French et al. [43] juxtaposed with quotes criticiz-
rensic speech scientists and interested academics within the UK” (French ing the aural-spectrographic approach.
& Harrison [41, p. 138]), hence it has come to be known as the UK
Position Statement. According to the minutes of the 2007 IAFPA Annual UK Position Statement:
General Meeting, the association had 82 members including 6 students,
so the 25 signatories of the Position Statement represented approximate- In considering consistency one would assess the degree to which
ly a third of the membership (excluding students, who presumably were observable features were similar or different. (French & Harrison
not counted as practicing speech scientists or interested academics), and [41, p. 141])
it appears that IAFPA members living outside the United Kingdom (the
other two thirds of the membership) were not invited to sign. Criticism of the aural–spectrographic approach:
We have previously been critical of the UK Position Statement
The problem here is that… it is never stated what the criteria for
(Rose & Morrison [42], Morrison [28, §2.5], Morrison [12, §99.400];
similarity are. (Rose [22, p. 114])
and see response in French et al. [43]), and I will not repeat our whole
set of criticisms in the present paper. Rather I want to focus on the ap-
UK Position Statement:
proach espoused in the UK Position Statement from the perspective of
what I consider to be appropriate principles as to what constitutes This assessment… involves ‘separating out’ the samples into their
acceptable forensic-science practice. The focus of the UK Position State- constituent phonetic and acoustic ‘strands’ (e.g., voice quality, in-
ment is a framework for the evaluation of evidence, but it also assumes tonation, rhythm, tempo, articulation rate, consonant and vowel
an experience-based “auditory–acoustic–phonetic” approach (French realisations) and analysing each one separately. (French & Harrison
et al. [43, p. 150]). “It may also involve statistical analysis of the features [41, p. 138])
found” (French & Harrison [41, p. 138]), but there is no elaboration as to
the nature of this possible statistical analysis, and databases representa- Amongst the features commonly considered in speaker comparison
tive of the relevant population are not used: “we consider the lack of cases are the following:
demographic data along with the problems of defining relevant ref-
erence populations as grounds for precluding the quantitative appli- 1. Vocal setting and voice quality. Full analysis… distinguishes phona-
cation of this type of approach [the likelihood-ratio framework] in the tion features, overall muscular tension features and vocal tract
present context” (French & Harrison [41, p. 142]). Note that there are no features, with up to 38 individual elements to be considered.
explicit references to the spectrographic/aural–spectrographic ap- 2. Intonation, potentially including analysis of tone unit nuclei, heads
proach in either French & Harrison [41] or French et al. [43]. and tails.
3. Pitch, measured as average and variation in fundamental frequency. questions, without the accompaniment of compelling empirical sci-
4. Articulation rate. entific data, has done little to quell the controversy surrounding this
5. Rhythmical features. technique. (Gruber & Poza [18, §7])
6. Connected speech processes such as patterns of assimilation and “‘voiceprint’ enthusiasts” should follow the example of those work-
elision. ing on automatic and semi-automatic speaker identification systems
7. A large set of consonantal features, including energy loci of fric- and resist premature application of their technique in court and in
atives and plosive bursts, durations of nasals, liquids, and frica- forensic investigation; specifically… they should apply a “moratori-
tives in specific phonological environments, voice onset time of um” to their activities until they can unequivocally demonstrate that
plosives, presence/absence of (pre-)voicing in lenis plosives, their system provides acceptable identification levels… (Nolan [45,
and discrete sociolinguistic variables. p. 25])
8. A large set of vowel features, including acoustic patterns such as for-
mant configurations, centre frequencies, densities, and bandwidths, Although too small to be contemplated as a representative sample,
and auditory qualities of sociolinguistic variables. I critiqued reports written by three IAFPA members last year (2012).
9. Higher-level linguistic information including use and patterning of Two used aural–spectrographic approaches20 and one use an approach
discourse markers, lexical choices, morphological and syntactic based on data, quantitative measurements, and statistical models. None
variants, pragmatic behaviour such as turn-taking and telephone of the three presented tests of the validity and reliability of their
call opening habits, aspects of multilingual behaviour such as code- approach, and one even claimed that such testing was impossible.
switching.
10. Evidence of speech impediment, voice and language pathology. 5. Conclusion
11. Non-linguistic features characteristic of the speaker, for example
patterns of audible breathing, throat-clearing, tongue clicking, I have argued that if one wants to determine whether a particular
and both filled and silent hesitation phenomena. (French et al. approach to forensic analysis is acceptable, one should first specify
[43, pp. 146–147]) what one considers to be the principles governing what would be ac-
ceptable. Once this has been done, the same principles can be applied
Criticism of the aural–spectrographic approach: to all approaches which one may want to consider.
In my opinion, one of the key principles is that the validity and reliabil-
As Gruber and Poza (1995: section 59 fn11) point out, these features, ity of the approach be empirically tested under conditions reflecting those
although sounding impressively scientific, are not selective but ex- of the case under investigation using test samples drawn from the rele-
haustive, and the protocol amounts to instructions to ‘examine all vant population. This (or a very similar principle) was also proposed in
the characteristics that appear on a spectrogram’. Likewise, they each of the Acoustical Society of America Speech Communication Techni-
point out that the instructions as to aural cues also amount to telling cal Committee study group's 1970 report on Speaker identification by
the examiner to take everything he hears into account. Finally, they speech spectrograms: a scientists' view of its reliability for legal purposes
point out that it is no use just to name the parameters of compari- (Bolt et al. [27]), the 1979 National Research Council report On the theory
son: one has to be specific about what particular aspect of the fea- and practice of voice identification [29], the 2009 National Research Council
ture it is that is to be compared. (Rose [22, p. 113–114]) report on Strengthening forensic science in the United States [2], and the
2012 National Institute of Standards and Technology/National Institute
The lists of features to analyze provided by the proponents of the UK of Justice report on Latent print examination and human factors [30].
Position Statement look remarkably similar, though not identical, to lists I have considered the aural–spectrographic approach to forensic
of features which appeared in protocols for the aural–spectrographic voice comparison from the perspective of this principle, and the
approach. Also, unlike Tosi [23] which the IAFPA resolution criticized auditory–acoustic–phonetic approach (it does not appear that these
for being “holistic, i.e., non-analytic”, the UK Position Statement fails two approaches are mutually exclusive), and also approaches based
to explain how the separately analyzed “strands” are ultimately to be on data, quantitative measurements, and statistical models. In the end
combined to provide a single assessment of the strength of evidence. I will refrain from making an explicit statement as to whether I think
any of these approaches are acceptable. What I want to emphasize in-
UK Position Statement: stead is that, for whatever approach in whatever branch of forensic sci-
ence, this decision should be based on principles. I believe that a key
the rigour and detail of the analysis, together with the education, principle is that the forensic scientist should test the validity and reli-
training and experience pre-requisite to carrying it out, put it well ability of the approach under conditions reflecting those of the case
beyond the resources of a layman simply listening to the material. under investigation using test samples drawn from the relevant popula-
Additionally, by drawing upon research literature and general expe- tion, that the forensic scientist should present the results of such testing
rience, the analyst may provide an assessment of the degree to which to the judge at an admissibility hearing and/or the trier of fact during a
the features common to the questioned voice and that of the suspect legal trial, and that it is the judge/trier of fact who should ultimately de-
are unusual or distinctive. (French & Harrison [41, p. 138]) termine acceptability on the basis of the test results.
Criticism of the aural–spectrographic approach: Acknowledgments
the competency of forensic examiners, both in absolute terms and This research was supported by the Australian Research Council,
relative to laypersons who just listen to voices, is largely un- Australian Federal Police, New South Wales Police, Queensland Police,
known;… to assert that the individual examiner's experience, com- National Institute of Forensic Science, Australasian Speech Science and
bined with his competence and talent, should, in the end, override Technology Association, and the Guardia Civil through Linkage Project
any concerns about the problems associated with subjective deci-
20
sion making is to make a very questionable assumption. (Gruber & In addition to listening and looking at spectrograms, they also reported some quanti-
Poza [18, §6]) tative fundamental–frequency measurements. Thus perhaps their approach should be
called aural–spectrographic–acoustic–phonetic, but it doesn't matter what it is called,
Proponents often claim that an examiner's “experience” will enable
what matters is whether it has been tested under conditions reflecting those of the case
him or her to distinguish inter-from intra-talker characteristics,… at trial using test samples drawn from the relevant population, and its validity and reliabil-
Putting forth “experience” as the answer to difficult scientific ity found to be sufficient.
LP100200142. Unless otherwise explicitly attributed, the opinions [22] P. Rose, Forensic Speaker Identification, Taylor and Francis, London, UK, 2002.
[23] O. Tosi, Voice Identification: Theory and Legal Applications, University Park Press,
expressed are those of the author and do not necessarily represent Baltimore, MD, 1979.
the policies or opinions of any of the above mentioned organizations. [24] G. McGregor, S. Maher, Tories now admit they sent Saskatchewan robocall: Forensic
expert links company behind latest push poll to firm behind Pierre Poutine calls,
Ottawa Citizen, February 5 2013.(http://www.ottawacitizen.com/news/Tories+
References admit+they+sent+Saskatchewan+robocall/7922470/story.html).
[25] O. Tosi, H. Oyer, L. Lashbrook, C. Pedrey, J. Nicol, E. Nash, Experiment on voice
[1] G. Edmond, D. Mercer, Trashing “junk science”, Stanford Technology Law Review identification, Journal of the Acoustical Society of America 51 (1972)
(3) (1998) 1–86, (http://stlr.stanford.edu/STLR/Articles/98_STLR_3). 2030–2043, http://dx.doi.org/10.1121/1.1913064.
[2] National Research Council, Strengthening Forensic Science in the United States: [26] J.M. Pickett, Annual report of the Speech Communication Technical Committee,
A Path Forward, National Academies Press, Washington, DC, 2009. http://www. Journal of the Acoustical Society of America 46 (1969) 687–688.
nap.edu/catalog.php?record_id=12589. [27] R.A. Bolt, F.S. Cooper, E.E. David Jr., P.B. Denes, J.M. Pickett, K.N. Stevens, Speaker
[3] A. Cediel, L. Bergman, The Real CSI: How Reliable is the Science Behind Forensics? identification by speech spectrograms: a scientists' view of its reliability for legal
PBS Frontline, WGBH Educational Foundation, Boston, MA, April 17 2012. (http:// purposes, Journal of the Acoustical Society of America 47 (1970) 597–612,
www.pbs.org/wgbh/pages/frontline/real-csi/). http://dx.doi.org/10.1121/1.1911935.
[4] S.A. Cole, Acculturating forensic science: what is ‘scientific culture’, and how can [28] G.S. Morrison, Forensic voice comparison and the paradigm shift, Science & Justice
forensic scientists adopt it? Fordham Urban Law Journal 38 (2010) 435–472, 49 (2009) 298–308, http://dx.doi.org/10.1016/j.scijus.2009.09.002.
(http://ssrn.com/abstract=1788414). [29] National Research Council, On the Theory and Practice of Voice Identification,
[5] G.S. Morrison, I.W. Evett, S.M. Willis, C. Champod, C. Grigoras, J. Lindh, N. Fenton, National National Academies Press, Washington, DC, 1979. http://books.google.ca/
A. Hepler, C.E.H. Berger, J.S. Buckleton, W.C. Thompson, J. González-Rodríguez, C. books?id=FjMrAAAAYAAJ.
Neumann, J.M. Curran, C. Zhang, C.G.G. Aitken, D. Ramos, J.J. Lucena-Molina, G. [30] Expert Working Group on Human Factors in Latent Print Analysis, Latent Print
Jackson, D. Meuwly, B. Robertson, G.A. Vignaux, Response to Draft Australian Examination and Human Factors: Improving the Practice through a Systems Ap-
Standard: DR AS 5388.3 Forensic Analysis — Part 3 — Interpretation, 2012. proach, US Department of Commerce, National Institute of Standards and Technology,
(http://forensic-evaluation.net/australian-standards/#Morrison_et_al_2012). Gaithersburg, MD, 2012. (http://www.nist.gov/manuscript-publication-search.cfm?
[6] G.S. Morrison, The likelihood-ratio framework and forensic evidence in court: a pub_id=910745).
response to R v T, International Journal of Evidence and Proof 16 (2012) 1–29, [31] P.R. Bevington, D.K. Robinson, Data Reduction and Error Analysis for the Physical
http://dx.doi.org/10.1350/ijep.2012.16.1.390. Sciences, 3rd edition McGraw Hill, Boston, MA, 2003.
[7] G.S. Morrison, F. Ochoa, T. Thiruvaran, Database selection for forensic voice compari- [32] R.A. Bolt, F.S. Cooper, E.E. David Jr., P.B. Denes, J.M. Pickett, K.N. Stevens, Speaker
son, Proceedings of Odyssey 2012: The Language and Speaker Recognition Workshop, identification by speech spectrograms: some further observations, Journal of the
International Speech Communication Association, Singapore, 2012, pp. 62–77. http:// Acoustical Society of America 54 (1973) 531–534, http://dx.doi.org/10.1121/1.191
geoff-morrison.net/documents/Morrison,%20Ochoa,%20Thiruvaran%20(2012)%20 3613.
Database%20selection%20for%20forensic%20voice%20comparison.pdf. [33] H.F. Greene, Voiceprint identification: the case in favor of admissibility, American
[8] G.S. Morrison, Measuring the validity and reliability of forensic likelihood-ratio systems, Criminal Law Review 13 (1975) 171–200.
Science & Justice 51 (2012) 91–98, http://dx.doi.org/10.1016/j.scijus.2011.03.002. [34] P. Ladefoged, An opinion on voiceprints, UCLA Working Papers in Phonetics, 19,
[9] I.W. Evett, C.G.G. Aitken, C.E.H. Berger, J.S. Buckleton, C. Champod, J.M. Curran, 1971, pp. 84–87. http://escholarship.org/uc/item/4k81b31v.
A.P. Dawid, P. Gill, J. González-Rodríguez, G. Jackson, A. Kloosterman, T. [35] L.M. Solan, P.M. Tiersma, Hearing voices: speaker identification in court, Hastings
Lovelock, D. Lucy, P. Margot, L. McKenna, D. Meuwly, C. Neumann, N. Nic Law Journal 54 (2003) 373–435.
Daeid, A. Nordgaard, R. Puch-Solis, B. Rasmusson, M. Radmayne, P. Roberts, B. [36] B.E. Koenig, Spectrographic voice identification: a forensic survey, Journal of the
Robertson, C. Roux, M.J. Sjerps, F. Taroni, T. Tjin-A-Tsoi, G.A. Vignaux, S.M. Acoustical Society of America 79 (1986) 2088–2090, http://dx.doi.org/10.1121/1.
Willis, G. Zadora, Expressing evaluative opinions: a position statement, Science 393170.
& Justice 51 (2011) 1–2, http://dx.doi.org/10.1016/j.scijus.2011.01.002. [37] C. Archer, HSNW conversation with Hirotaka Nakasone of the FBI: voice recognition
[10] B. Robertson, G.A. Vignaux, Interpreting Evidence, Wiley, Chichester UK, 1995. capabilities at the FBI — from the 1960s to the present, Homeland Security News
[11] D.J. Balding, Weight-of-Evidence for Forensic DNA Profiles, Wiley, Chichester, UK, 2005. Wire, July 11 2012. (http://www.homelandsecuritynewswire.com/bull20120711-
[12] G.S. Morrison, Forensic voice comparison, in: I. Freckelton, H. Selby (Eds.), Expert voice-recognition-capabilities-at-the-fbi-from-the-1960s-to-the-present).
Evidence, Thomson Reuters, Sydney, Australia, 2010, (ch. 99). http://www. [38] International Association for Forensic Phonetics and Acoustics, Resolution on voice-
thomsonreuters.com.au/forensic-voice-comparison-expert-evidence/product prints, http://www.iafpa.net/voiceprintsres.htm July 24 2007.
detail/91156. [39] F.T. Poza, D.R. Begault, Voice identification and elimination using aural–
[13] G.S. Morrison, Tutorial on logistic-regression calibration and fusion: converting a spectrographic protocols, Proceedings of the Audio Engineering Society 26th
score to a likelihood ratio, Australian Journal of Forensic Sciences 45 (2013) International Conference: Audio Forensics in the Digital Age, 2005(paper
173–197, http://dx.doi.org/10.1080/00450618.2012.733025. 1-1).
[14] I.W. Evett, Interpretation: a personal odyssey, in: C.G.G. Aitken, D.A. Stoney (Eds.), The [40] American Board of Recorded Evidence, Voice comparison standards, http://
Use of Statistics in Forensic Science, Ellis Horwood, Chichester UK, 1991, pp. 9–22. www.tapeexpert.com/pdf/abrevoiceid.pdf 1999.
[15] G.S. Morrison, P. Rose, C. Zhang, Protocol for the collection of databases of recordings [41] J.P. French, P. Harrison, Position Statement concerning use of impressionistic likeli-
for forensic–voice–comparison research and practice, Australian Journal of Forensic hood terms in forensic speaker comparison cases, International Journal of Speech,
Sciences 44 (2012) 155–167, http://dx.doi.org/10.1080/00450618.2011.630412. Language and the Law 14 (2007) 137–144, http://dx.doi.org/10.1558/ijsll.v14i1.137.
[16] H. Hollien, Forensic Voice Identification, Academic Press, San Diego CA, 2002. [42] P. Rose, G.S. Morrison, A response to the UK position statement on forensic speaker
[17] B.E. Koenig, Review of Hollien (2002) Forensic voice identification, Journal of comparison, International Journal of Speech, Language and the Law 16 (2009)
Forensic Identification 52 (2002) 762–766. 139–163, http://dx.doi.org/10.1558/ijsll.v16i1.139.
[18] J.S. Gruber, F. Poza, Voicegram identification evidence, American Jurisprudence [43] J.P. French, F. Nolan, P. Foulkes, P. Harriaon, L. McDougall, The UK position
Trials 54 (1) (1995) §1–§133. statement on forensic speaker comparison: a rejoinder to Rose and Morrison,
[19] D. Meuwly, Reconnaissance de locuteurs en sciences forensiques: L'apport d'une International Journal of Speech, Language and the Law 17 (2010) 143–152,
approche automatique. Doctoral dissertation University of Lausanne, 2001. (www. http://dx.doi.org/10.1558/ijsll.v17i1.143.
unil.ch/webdav/site/esc/shared/These.Meuwly.pdf). [44] E. Gold, J.P. French, International practices in forensic speaker comparison,
[20] D. Meuwly, Le mythe de l'empreinte vocale I, Revue Internationale de Criminologie International Journal of Speech, Language and the Law 18 (2011) 143–152,
et Police Technique 56 (2003) 219–236. http://dx.doi.org/10.1558/ijsll.v18i2.293.
[21] D. Meuwly, Le mythe de l'empreinte vocale II, Revue Internationale de Criminologie [45] F. Nolan, The Phonetic Bases of Speaker Recognition, Cambridge University Press,
et Police Technique 56 (2003) 361–374. Cambridge, UK, 1983.

Morrison - Distinguishing Between Forensic Science and Forensic Pseudoscience

Uploaded by

Document Information

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Morrison - Distinguishing Between Forensic Science and Forensic Pseudoscience

Uploaded by

Copyright:

Available Formats

Science and Justice 54 (2014) 245–256

Contents lists available at ScienceDirect

Science and Justice

Distinguishing between forensic science and forensic pseudoscience:

4 third elements of the paradigm I promote: highly preferred use of ap-

Criticism of the aural–spectrographic approach: Acknowledgments

You might also like