You are on page 1of 20

See discussions, stats, and author profiles for this publication at: https://www.researchgate.

net/publication/3176892

Cepstral analysis technique for automatic speaker verification

Article  in  IEEE Transactions on Acoustics Speech and Signal Processing · May 1981


DOI: 10.1109/TASSP.1981.1163530 · Source: IEEE Xplore

CITATIONS READS

1,265 1,890

1 author:

Sadaoki Furui
Tokyo Institute of Technology
405 PUBLICATIONS   9,956 CITATIONS   

SEE PROFILE

All content following this page was uploaded by Sadaoki Furui on 05 August 2014.

The user has requested enhancement of the downloaded file.


254 IEEE
TRANSACTIONS
ON
ACOUSTICS,
SPEECH, AND SIGNAL PROCESSING,
VOL. ASSP-29, NO. 2, APRIL 1981

synthesisof recursive digital fiiters,” IEEE Trans. Audio Elec-


troacoust., vol. AU-21, pp. 61-65, Feb. 1973.
[5] K. S. Chao and K. S. Lu, “On sequential refinement schemes for
recursive digital filter design,” IEEE Trans. Circuit Theory, vol.
CT-20, pp. 396-401, July 1973.
[6] F. Brophy and A. C. Salazar, “Recursive digital fdter synthesis in
the time domain,” IEEE Trans. Acoust., Speech, Signal Process-
ing, vol. ASSP-22, pp. 45-55, Feb. 1974.
[7] M. S. Bertrin, “Approximation of digital fiiters in one and two
dimensions,” IEEE Trans. Acoust.,Speech, Signal Processing,
VOI.ASSP-23, pp. 438-443, Oct. 1975.
[8] J. A. Cadzow, “Recursive digitalfiltersynthesis via gradient
based algorithms,” IEEE Trans. Acoust., Speech, Signal Process-
ing, vol. ASSP-24, pp. 349-355, Oct. 1976.
[9] R. Hastings-James and S. K. Mehra, “Extensions of the Pad&
approximant technique for the design of recursive digital fiiters,”
IEEE Trans. Acoust., Speech, Signal Processing, vol. ASSP-25,
pp. 501-509, Dec. 1977.
[lo] R. Fletcher and M. J. D. Powell, “A rapidly convergent descent
method for minimization,” Cornput. J., vol. 6,pp.163-168,
July 1963.

Cepstral Analysis Technique for Automatic


Speaker Verification
SADAOKI F U R U I , MEMBER, IEEE

Abstract-This paper describes new techniques for automaticspeaker I. INTRODUCTION


verification using telephone speech. Theoperation of the system is
PEAKER verification is a process to accept or reject the
based on a set of functions of time obtained fromacoustic analysis of a
fixed, sentence-long utterance. Cepstrum coefficients are extracted by
means of LPC analysis successively throughout an utterance to form
S identity claim of a speaker by comparing a set of measure-
ments of the speaker’s utterances with a reference set of mea-
time functions, and frequency response distortions introduced by trans- surementsof the utterance of theperson whose identity is
mission systems are removed. Thetimefunctions are expanded by claimed.
orthogonal polynomialrepresentations and,after a feature selection
procedure, brought into time registration with stored reference func-
Research on an automatic system for speaker verification at
tions to calculate the overall distance. This is accomplished by a new Bell Laboratories has been reported in previous papers [ l ] -
time warping method using a dynamic programming technique. A de- [4] . The system is based on an acoustic analysis of a fixed,
cision is made to accept or reject an identity claim, based on the overall sentence-long utterance resulting in a function of time or con-
distance. Reference functions and decision thresholds are updated for tour for each feature analyzed. Features selected for analysis
each customer. in previous evaluations have included pitch, intensity, the first
Several sets of experimental utterances were used for the evaluation
of the system, which include male and female utterances recorded over three formants, and selected prediction coefficients. The sys-
a conventionaltelephoneconnection. Male utterances processed by tem which uses pitch and intensity contours hasbeen evaluated
ADPCM and LPC coding systems were used together with unprocessed using telephone speech over a period of five months with a test
utterances.Results of the experiment indicate that verificationerror population of over 100 male and female speakers. The evalu-
rate of one percent orless can be obtainedeven if the reference and test ation indicated an error rate of approximately ten percent for
utterances are subjected to different transmission conditions.
new customers and approximately five percent for adapted
customers [4] . It has also been shown that the performance
of this system is relatively insensitive to transmission systems
Manuscript received May 5, 1980;revised September 25,1980. in which the speech is encoded using adaptive differential
The author was with Bell Laboratories, Murray Hill, NJ 07974. He
is now with the Electrical Communication Laboratories, Nippon Tele- pulse code modulation (ADPCM) coding or linear predictive
graph and Telephone Public Corporation, Tokyo 180,Japan. coding (LPC) vocoding [5] .

0096-3518/81/0400-0254$00.75 0 1981 IEEE


FURUI: AUTOMATIC SPEAKER VERIFICATION 255

Thispaper describes newtechniques for an automatic


speaker verification system for telephone-quality speech. The
differences between the present implementation and previous
implementations of the system lie in the features selected for
w
UTTERANCE

analysis and the method of overall distance computation. In


REFERENCE
DATA CEPSTRUM COEFFICIENTS
addition, new and enlarged samples of speech, includingseveral
kinds of transmission systemshave been used for evaluation.
RETRIEVAL
I
THROUGH LPG ANALYSIS
I I
I
& &
NCRMALIZATION BY LONG-TIME
AVERAOE CEPSTRUM AMmE
11. SYSTEMOPERATION I

A block diagram indicating the principal operations of the


system is shown in Fig. 1. There are two inputs to the system,
theidentity claim andthe sampleutterance.Theidentity
claim whichmaybeprovidedbyakeyed-in identification
l c e I = l FEATURE SELECTION

number causes reference data corresponding to the claim to


be retrieved. Thesecond input is activatedbyarequest to
speak the sample utterance. The recording interval is scanned
to find the endpoints of the utterance. The utterance is then
analyzed. Linear predictor coefficients are extracted suc-
cessively and these coefficients are transformed into cepstrum
coefficients. The cepstrum coefficients are averagedover the
duration of theentireutteranceandthe averagevaluesare
subtracted from the cepstrum coefficients of every frame to Fig. 1. Block diagram indicating the principal operations of the system.
compensate for frequency-response distortions introduced by
the transmission system.
SPEECH WAVE
Thetimefunctions of thecepstrum coefficients are ex-
panded by an orthogonal polynomial representationover short
time segments. Then the utterance is represented by the time (FC = 3.0 kHz OR 2.6 kHz)
functions of coefficients of the orthogonal polynomial repre-
sentation. A part of the set of these coefficients is selected for (FS= 6.67 kHz OR 6 kHz)
speaker verification, based on the statistical analysis ofthe

w
effectiveness of each coefficient.
A crucial property of the system is automatic time registra- EMPHASIS
tion of the time functions of the sample utterance to the time
functions retrieved as the reference template of the claimed
identity. An overall distancebetween the sampleutterance
and the reference template is obtained as the result of time
registration using a dynamic programming technique. The dis- I CORRELATION
COEFFICIENTS 1 (ORDER lo)
I I
tance of each element is weighted by intraspeaker variability I
andsummed t o produce the overall distance. Finally, the LINEARPREDICTIVE
COEFFICIENTS
overall distance is compared withJathreshold distance value to 1
determine whether the identity claim should be accepted or
CEPSTRUM
rejected. COEFFICIENTS
Details concerningthe analysis procedures,referencecon-
Fig. 2. Block diagram for cepstmm extraction.
struction,time registration, anddistancecalculation willbe
presented in the following sections.
dictor coefficients are extracted from each frame by the auto-
A. Normalized Cepstrum Extraction correlation method. The linear predictor coefficients are
The speech wave is bandlimited from 100 Hz to 3.0 kHz and transformed into cepstrum coefficients, using thefollowing
sampled at a 6.67 kHz rate, or bandlimited from 100 Hz to recursive relationshps [9] :
2.6kHz and sampled at 6 kHz. The digitized speech is then
scanned forward from the beginning of the recording interval
and backward from the end to determine the beginning and
end of the actual sample utterance. The endpoint detection is
accomplished by means of anenergy calculation. Ahigh
emphasis filter (1 - 0.95Z-l) is applied to thedelimited where ci and ai are the ith-order cepstrum coefficient and
speech, and a 30 ms Hamming window is applied to the em- linear predictor coefficient, respectively. Fig. 2showsthe
phasized speech every 10 ms. First to tenth-order linear pre- block diagram of these processes.
256 IEEE
TRANSACTIONS
ON
ACOUSTICS,
SPEECH, AND SIGNAL
PROCESSING, VOL. ASSP-29, NO. 2, APRIL 1981

Atal [ 9 ] examined several different parametricrepresenta-


tions of speech derived from the linear prediction model for
their effectiveness forautomatic recognition of speakers.
Among all the parameters investigated, the cepstrum was
found to be the most effective. It was also pointed out that
REFERENCE DATA CEPSTRUM
COEFFICIENTS
cepstrum coefficients have the additional advantage that one RETRIEVAL THROUGHLPCANALYSIS
I

1 -1
I I
can derive from them a set of parameters which are invariant
to any fixed frequency-response distortion introduced by the POLYNOMIAL FUNCTION l*l
recording apparatus or the transmission system. The new pa-
rameters are obtained simply by subtracting from the cepstrum
coefficients a set ofvalues representing their time averages
over the duration of the entireutterance. This process can FEATURE
SELECTION
normalize the gross spectral distribution of the utterance, and
it is similar to the inverse filtering process which has been used . REFERENCE DYNAMIC
TEMPLATE
in a spoken word recognition system at Bell Laboratories [ 6 ] .
The normalization technique introduced by Atal is used in the 1
OVERALL DISTANCE
COMPUTATION
speaker verification system studied in this paper. I
In previous studies by the author [7], [SI it was shown that
this normalization process is also effective in reducing long-
termintraspeaker spectral variability for maintaining high
speaker verification andidentification accuracy over a long
period. Fig. 3. Block diagram indicating the principal operations of the system
(modification of Fig. 1).
B. Polynomial Coefficients
Time functions of the normalized cepstrum coefficients are of feature extraction is modified as shown inFig. 3, since
expanded byan orthogonal polynomial representation over cepstrum normalization does not affect the first- and second-
90 ms intervals every 10 ms. The 90 ms interval length seemed order polynomial coefficients.
adequate for preserving transitional information between pho- Accordingly, the utterance is represented by time functions
nemes. The first three orthogonal polynomials are used. They of the cepstrum coefficients x,(i), and the first- and second-
are [IO] order polynomial coefficients, b,(i) and c,(i), where t is the
Poi = 1 frame number and i is the index of the cepstrum coefficient
(1 < i < p). Since p is set to ten in this system, the result is a
P .=j- 5 representation by a time function of a 30-dimensional vector.
1J

55 From these 30 elements, a set of elements, which are most ef-


l O j + -. fective in separating the overall distance distribution of cus-
3
tomer and impostor sample utterances are selected for speaker
Thus, if the control function samples for an utterance within verification. The selection is made based on the inter-to-
-
the segment being measured are x i 0 = 1 , 2 , * ,9), then the intraspeaker variability ratio for each element:
first three coefficients of the orthogonal polynomial repre-
sentation are
-
dWj = E (Zjjj)
1

dijklm
( Z f m , if j = k )

where E means averaging over the index j , and dijklm is the


J
distance between the time function of ith element derived
These coefficients represent mean value, slope, and curvature from Zth utterance by speaker j and mth utterance by speaker
of the time function of each cepstrum coefficient in each seg- k after time registration.
ment, respectively.
As the original time functions of cepstrum coefficients are C Time Registration
considered to be more efficient than the 0th-order polynomial A sample utterance is brought into time registration with the
coefficients for speaker verification, the original time func- reference template to calculate the distance between them.
tions of cepstrum coefficients are used to replace the 0th-order n i s is accomplished by a new time warping method using
polynomial coefficients in this implementation. When the Oth- dynamic programming technique. As there is often some un-
order polynomial coefficients are not used, the block diagram certainty in the location of both the initial and final frames
FURUI: AUTOMATIC SPEAKER VERIFICATION 25 1

due to breath noise, etc., the unconstrainedendpointtech- N-6

nique [l 11 .is applied.


We denote two contours as R(n), 1 < n < N , and T(m),
1 < m < M. We denote R(n) and T(m) as guide contour and
slave contour, respectively. The purpose of the time warping
algorithm is to provide a mapping between the time indexesn
and m such that atime registration between the two utterances
is obtained. We denote the mappingw , between n and m as
m = w(n). (5)
n N
The function w must satisfy a set of boundary conditions at GUIDE CONTOUR
the endpoints of the utterance and some restrictions on the Fig. 4. Allowable region of the dynamic programming warpingfunction
formit assumes. In our case, thefollowingconditions are path.
applied:
w(n t 1) - w(n) = 0 , 1 , 2 (w(n)#:-w(n- 1 ) ) (64 tour. If we allow the warping function to also start from any
= 1,2 (w(n) = w(n - 1)) (6b) frame of the guide contour, the computationtime becomes ex-
cessive.If the intraspeaker variation of utterance lengths is
l<w(l)<6+1 (6c) not large, we have found that the likelihood is great that the
M -6 <w(N)<M ( 6 4 optimum warping function starts from the first frame of the
shorter of either the reference template or test utterance. The
maxw(n)=M, N- 6<n<N (be) basis for this result is described in Section VI-F. Based on this
M M assumption we adopted the procedure of usingas the guide
-n-mo<w(ri)<- ntm,,. contour the shorter of either the referencetemplateortest
N N
utterance. This means that the longer one is mapped to the
Equations (6a) and (6b) require that w(n) be monotonically slave contour axis which is the ordinate of the warping plane
increasing, with a maximumslopeoftwo,andaminimum and the shorter oneis mapped to the guide contour axis which
slope of 1 / 2 . The minimum slope constraint is a consequence is the abscissa of the warping planein Fig. 4.
of the prohibition against two consecutive steps with slope 0. A complete specification ofthe warping function results
In (6c), (6d), and (6e), 6 represents the maximum anticipated from a point-by-point measure ofsimilarity between the guide
range of mismatch (in frames) between boundary pointsof the contour R (n) and the slave contour T(m).
two utterances. In our case, a value of 6 of 15 (frames) was
used, representing a150 ms region in which the initial and final D. Distance Measure
frames could be mapped. A similarity measure or distance functionD must be defined
The warping function can reach the final boundary of the for everypairof points (n, m) within the shaded regionof
slave contour prior to thelast frame, i.e., it is possible that Fig. 4. Given the distance function D,the optimum dynamic
w ( n ) = M for . n < N (7) path w is chosen to minimize the accumulated distance D T
along the path, i.e.,
in which case it is not physically meaningful to continue the
N
path. DT = min D (R(n),T(w(n))).
Equation(6f) restricts the warpingfunctionwithinsome (w<n)) n=1
fixed region along the diagonal line which connects (1, 1) and
( N , M ) points on the (n - m ) plane. In our case mo was set to When the warping function reaches the final boundary of the
20 (frames). Fromtheseconditionsthe warping function is slave contour prior to thelast frame, the accumulated distance
constrained to follow a path inside the shaded region of Fig. 4. DT isscaled by the factor (N/Ns) where.Ns is the frame at
The vertices of the labeled points A and B are obtained as the which (7) is satisfied, so as to equalize the number ofdistances
intersections of the lines. which enter into the total distance DT. The optimum path w
canbe determined by the method of dynamic programming
easily.
Let us denote the feature vector of the nth frame of the
-
guide contour as R (n) = (rl (n), r2(n), . . ,ri(n), . . ,rK(n))
andthe mth frame of the slave contour as T(m) =(tl(m),
m= I n
2 tz(m),* * * ,ti(m), . ,tK(m)),where K is the number of the
point B
m-Mt6=2(n-N) elements of thefeature vector. In this paper,twokinds of
distance measures are used and evaluated.
As can be seen in Fig. 4, the warping function can start from K
any frame of the slave contour between the first and 6 t l t h
frame, but it must start from the first frame of the guide con-
258 IEEE
TRANSACTIONS
ON
ACOUSTICS,
SPEECH, AND SIGNAL
PROCESSING, VOL. ASSP-29, NO. 2, APRIL 1981

D2(R(n), r(m))= Ob) (1


i=1
2
I -
INTRA-SPEAKER
DISTANCE

where gi is the weighting function, which is the reciprocal of


the mean value of intraspeaker variability for the ith element,
defined as follows:
-
gi = l&Ji
- N
dm = (&jj) = E (tijk (n) - tiil (w(n))IZ (1 1) ~ D
DISTANCE
B
J J n=l
k+ I Fig. 5. Example of typical intraspeakerand interspeaker distance
distributions.
where tijk(n) is thenth frame of thekthutteranceby
speaker j . to compare the results withthoseobtained using a priori
thresholds.
E. Decision Threshold
F. Reference Construction
The overall distance accumulated over the optimum warping
function is compared with a threshold to determine whether The establishment and updating of reference information is
to accept or reject an identity claim. In many kinds of speaker another important element of the system. For each kind of
verification experiments, the threshold is set a posteriori so data set, three or five utterances were used to constructa
that the twokinds of error rate (the rate of rejecting utterances reference template for each customer. Two methods of refer-
which should be accepted and the rate of accepting utterances ence updating were observed. In the first method, the refer-
which should be rejected) are equal. But these experiments ence template was updated every seventh access by the cus-
are unrealistic, and procedures for settingthresholdsin ad- tomer using his latest utterances (method 1). In the second
vance in practical situations are not well established. method, it was updated each time the system was accessed by
In this paper, two methods for setting an a priori threshold the customer (method 2). The procedure for establishing the
are evaluated. In the first method, the threshold is set to an initial reference template is the following. The first training
experimentally decided fixed value, and the same threshold is utterance is used as a basic utterance, to which the second is
used for all customers. In the second method, the optimum brought into time registration. After registration the time
threshold is estimated based on the distribution of overall dis- functions of the feature parameters of the first two utterances
tances between each customer's reference template and a set are averaged and the third is brought into time registration
of utterances of other speakers. In the latter case, the thresh- with the averaged function and then averaged into it. When
old is updated at the same time as the reference template up- five utterances are used to construct the reference template,
dating, based on the distribution of interspeaker distances. the fourthandfifth utterances are also brought into time
The following equation, based on empirical results, is used to registration and included in the averaging.
set the threshold foreach customer: The training utterances are also used for the calculation of
the weighting function which is used in the distance measure
6( k ) = a (CDB ( k ) - GDB (k)) b (1 2) of (loa) and (lob), theinterspeaker to intraspeaker variability
ratio of (4) which is used in feature selection, and the inter-
where B(k) is the threshold for the customer k, c ~ B ( k and ) speaker distance distribution which is used to set the decision
??DB(k)are mean value and standard deviation for the distribu- threshold using (12).
tion of interspeaker distance, respectively. a and b are con-
stant parameters which are set experimentally, the same values 111. SAMPLE UTTERANCES
being used for all customers and for all data sets. Several kmds of utterance sets were used to evaluate this
Fig. 5 shows an example of typical intraspeaker and inter- system. Fig. 6 is a block diagram which shows the procedures
speaker distance distributions.Equation (1 2) indicates that used to create the utterance sets. The speech was uttered in a
as the mean value of the interspeaker distance becomes larger sound booth and recorded over conventional dialed up tele-
andthe standard deviation becomes smaller, the decision phone lines or a high-quality microphone. The signal was
threshold becomes larger. The intraspeaker distance distribu- bandlimited from 100 to 3200 Hz, which is the nominal tele-
tion is not taken into account in the calculation of the decision phone bandwidth. The telephone speech was processed by
threshold. There are two reasons for this. First,the intra- the following three transmission systems:
speaker distance distributions are fairly uniform from speaker 1) clear channel-i.e., no additional processing,
to speaker. Second, it is difficult to obtain stable estimates of 2 ) adaptive differential pulse code modulation (ADPCM)
the distribution of intraspeaker distance for small numbers of coding,
training utterances, whereas it is easy to estimate interspeaker 3) linear predictive vocoding (LPC).
distance distributions by cross comparison of the training The ADPCM coder used in this experiment was a simulation
utterances between different customers. of the coder built by Bates [12], based on the work of Cum-
' A posteriori equal error decision thresholds were also used miskey e t al. [13] . Fig. 7 shows a block diagram of the
FURUI: AUTOMATIC SPEAKER VERIFICATION 259

AID X, kHz 6.67kHz AID

LFC ANAWZER I LPG SYNTHESIZER

Fig. 8. Block diagram of the LPC system.

CONVERSION
d-
TABLE I
SETSUSEDIN EXPERIMENTS
UTTERANCE
Number of

Male or Customers & Sampling

No. Impostors
Female
Recording

FEATURE
PARAMETER (1) Male 10 + 40 Telephone
ANALYSIS

t
I
I
SPEAKER
VERIFICATION I
Fig. 6. Block diagram indicating the procedure to make the utterance ““
sets. (4)

-
CODER ADAPTIVE
OUANTIZER
Microphone
I

(6) I Female I 10 + 40 1 ~~~

Telephone

No quantization of the LPC parameters was used in this


experiment.
Thesamplingrateof the signalpassed throughthe LPC
vocoder or the clear channel was converted from 10 kHz to
6.67 kHz, orif bandlimited to 2.6 kHz, converted t o 6 kHz.
SPEECH
INPUT
I
I
The telephone speech utteranceset includes the following.
A
1) 50 recordings made by each of 10 male and 10 female
speakers over a period of two months. The first 10 recordings
L& v’(nl
A’m)
were made once a day; the remaining 40 were made twice a
day (morning and afternoon). These speakers were designated
“customers.”
Fig. 7. Block diagram of the ADPCM system.
2) One recording made by each of 40 male and 40 female
naive speakers.Thesespeakers were designated“impostors.”
ADPCM system. Since the requiredsamplingrate forthe There was no attempt tomimic the “customers.”
ADPCM coder was 6 kHz, a sampling rate conversion system Thespeechrecorded over ahigh-qualitymicrophone was
was used to convert it from 10 kHz to 6 kHz at the input to bandlimited from 100 to 3200 Hz and sampled at 6.67 kHz.
the coder [ 141 . The signal bandwidth was reduced to 2.6 kHz The high-quality speech utteranceset includes the following.
for the ADPCM coder in the sampling rate conversion system. 1) 26 recordings made by each of 21 male customers over
In the coder, a 4-bit adaptive quantizer was used to code the a period of two months. Each was recorded on a differentday.
differential signal giving an overall bit rate of 24 kbits/s for the 2) One recording made by each of 55 male impostors with
coder [ 131 . no attempt tomimic the customers.
A block diagram of the LPC vocoder is given in Fig. 8. The Two all-voiced sentences wereused in the recordings. The
implementation wasbased on the autocorrelation method of males used the sentence, “We were away a year ago” and the
linear prediction [15],[16] . Pitchdetectionandvoiced- females used the sentence, “I know when my lawyer is due.”
unvoiced decisionwere performed using the modified auto- Table I summarizes the six kinds ofutterance sets used in this
correlation pitch detector of Dubnowskiet al. [ 171 . A 12 pole experinient. All the low-pass filters of 3.2kHz and 2.6 kHz
LPC analysis ( p = 12) was performed using a pitch adaptive are digital filters, except that the 2.6 kHz low-pass filter ap-
variable frame size, at a rate of 100 frames per second [18] . plied to ADPCM speech is an analog hardware filter.
260 IEEE TRANSACTIONS ON ACOUSTICS, SPEECH, AND SIGNAL PROCESSING, VOL. ASSP-29, NO. 2 , APRIL 1981

2nd POLYNOMIAL COEFF

I ’b A‘
I

0 1 1 1 1 1 I l l I 1J
( 2 3 4 5 6 7 8 9 1 0 1 2 3 4 5 6 7 8 9 1 0
CEPSTRUM COEFF INMX CEPSTRUM COEFF INDEX

Fig. 9. Interspeaker to intraspeaker distance ratio for the first ten Fig. 10. Interspeaker to intraspeaker distance ratio for the middle ten
utterances by f i i e speakers each extracted from utterance set (1). utterances by five speakers each extracted from utterance set (1).

IV. EXPERIMENTAL RESULTS TABLE I1


AVERAGE
ERROR RATES.
UTTERANCESET:No. (1). F R FALSE REJECTION
A. Results for Utterance Set (1) F A FALSEACCEPTANCE
(FALSEALARM). (MISS RATE).
The first experiment was performed using utterance set (l),
which is a set of utterancesby ten male customers and 40
male impostors recorded over a conventional telephone con-
Threshold I 1 1 ! FR FA

nection, transferred through a clear channel and sampled at


6.67 kHz. The distance measure Dl is used in the experiments
in Sections IVrA-D. Thecepstrumnormalizationtechnique
using the averaged value of the cepstrum is not applied to the A Posteriori 0%
experiments in these sections.
1) Distance ratio: In order to evaluate the feature parame-
ters from the viewpoint of their effectiveness for speaker veri- Based on these results, 18 parameters which have a relatively
fication, the ratio of the average value of interspeaker distance large distance ratio were selected. These are all ten cepstrum
to the averagevalue of intraspeaker distance defined by (4), coefficients and all the first-order polynomial coefficients ex-
was calculated for each parameter. cept coefficient index numbers 5 and 8. None of the second-
Fig. 9 shows the results for an utterance set which uses the orderpolynomial coefficients were included in this selected
first 10 utterances by five customers each. Fig. 10 shows the parameter set. The choice of 18 for the number of selected
results foranutteranceset which comprises themiddle 10 parameters was decided arbitrarily.
utterances by five customers each. It can be seen from these 2 ) Speaker verification: Table I1 shows the mean-error rate
figures that all of these parameters have distance ratios greater of speaker verification when five utterances were used to estab-
than one, which means that all of them are useful to distin- lish an initial reference template whichwas updated every
guish speakers. The original cepstrum coefficients are gen- seventh utterance by eachcustomer.Themean interval be-
erally most effective and the higher theorderofthe poly- tween training and test utterances is nearly six days. Three
nomial coefficients becomes, the less effective they become, types of decision thresholds were applied. The error rates were
irrespective of the order of the cepstrum. averaged over ten customers and presented in this table. When
A preliminary experimentindicated that using utterances the threshold is set aposteriori the error rate is completely
from ‘ten customers in the feature selection process produced zero. When the threshold is set a priori the mean-error rate of
no improvement in speaker verification accuracy over the false acceptance and false rejection can be madeassmallas
procedure described here using five customers. 0.19 percent using theoptimum thresholdestimationtech-
FURUI: AUTOMATIC SPEAKER VERIFICATION 261

TABLE 111
AVERAGE SET:No. (6). FR FALSE
ERRORRATES. UTTERANCE REJECTION
(FALSEALARM).FA: FALSE
ACCEPTANCE (MISSRATE).
I I I I I

A Posteriori 0.06%

nique presented in Section 11-E.When the threshold is fixed


to a value which is common to all customers, the mean-error
rate increases to 0.30 percent. These results show thatthe
speaker verification techniques proposed in this paper are very
powerful for telephone speech.
3) Effects of time interval between trainingand test
utterances: Intersession variability for a given speaker is one (b) METHOD 2
TRAINING

of the most important problems in speaker verification [7], Fig. 11. Two reference updating procedures used for utterance set (5).
[8]. In order to check the effect of the time interval between In Method 1, a reference template is updated every seventh access by
the customer. In Method 2, a reference templateis updated each time
training and test utterances on speaker verification accuracy the system is accessed by the customer. In both methods, the latest
this interval was varied from six days to six weeks comparing five utterances are used to update thereference template.
test utterances with reference templates constructed at times
corresponding to the specified intervals. In this experiment 14
utterances were used as test inputs by ten customers each, and C Results for Utterance Set (5)
five utterances were used toconstruct each reference tem- In the third experiment, utterance set ( 5 ) which comprises
plate. The experimental results indicated that verification ac- 26 utterances by 21 male customers each and a single utter-
curacy is not affected by time intervals between training and ance by 5 5 impostors recorded over a high-quality microphone
input utterances of at least six weeks. was used to test the speaker verification system. In this case
the 18 selected parameters include the first nine cepstrum co-
B. Results for Utterance Set ( 6 ) efficients and the first nine first-order polynomial coefficients.
The second experiment was performed using utterance set Fig. 11(a) shows the time relation between training and test
(6) which comprises the utterances by ten female customers utterances in speaker verification experiments for the condi-
and 40 female impostors recorded over a conventional tele- tion that five utterances were used to construct a reference
phone connection. templateforeachcustomer andthatit was updated every
The ratio of interspeaker distance to intraspeaker distance seventh access by the customer. Table IV shows error rates
for each parameterwas calculated using the first ten utterances averaged over 21 customers. Results of the first, middle, and
by five customers each. The result was similar to the result for last seven input utterances are averaged separately. Falsere-
male speakers shown in Fig. 9. The original cepstrum coeffi- jection error is very large for the first seven input utterances.
cients are most effective and the second-order polynomial co- Initial variability like this was also shown in the previous ex-
efficients are less efficient than the first-order ones. This re- periment by Rosenberg [4].
sult was used for the selection of 18 parameters, which include In order to improve the results forthe first seven input
all ten cepstrum coefficients and all the first-order polynomial utterances, the second method for reference updating was in-
coefficients except coefficients index numbers 4and 9. troduced. The reference template was updated at each time of
Table I11 presents the speaker verification results under the the customer's access using the latest five utterances as'shown
same conditions observed for the male speaker set described in in Fig. 1 l(b). Table V shows the results of the verification ex-
the previous section;a reference file was constructed using periment using this method. In this case, the error rate for the
five utterances and updated every seventh access by each cus- first seven input utterances is not significantly larger than
tomer. Although the error rates for the female utterance set those for the middle and last seven utterances. Compared with
are slightly larger than those for the male utterance set, they Table IV, it can be also concluded that frequent updating of
are still very small. the reference template is quite efficient for several initial input
Results for a speaker verification experimentin which a utterances but it is not necessary to do it for the remaining
reference template was constructed using five' utterances and utterances. If we apply the second reference updating method
the time interval between training andtest utterances was to the first seven input utterances and the first reference up-
varied up to six weeks were quite similar to that of the male dating method to the remaining utterances, we canachieve
speaker set. There was no significant increase inerrorrate verification errorrate ofless than onepercent using the
when the interval is extended to six weeks. a priori estimated threshold.
2 62 IEEE
TRANSACTIONS ON ACOUSTICS,
SPEECH, AND SIGNAL PROCESSING,
VOL. ASSP-29, NO. 2 , APRIL 1981

TABLE IV TABLE VI
ERRORRATES.UTTERANCE
AVERAGE SET:No. (5). REFERENCE
UPDATING: ERRORRATES.UTTERANCE
AVERAGE SET:No. (3). FR: FALSEREJECTION
METHOD1 . FR: FALSE (FALSEALARM).
REJECTION FA: FALSE (FALSE
ALARM). FA: FALSE
ACCEPTANCE (MISSRATE).
(MISSRATE).
ACCEPTANCE II I I I

Threshold

Estimated
A
Threshold

1 1 I 1FR

FR

0.29%
FA

FA

0.83%
FR+FA
2

0.56%

Priori

Fixed 1.43% 1.46% 1.45%

Estimated
A Posteriori 0.04%

TABLE VI1
ERRORRATES BY ESTIMATED
AVERAGE a priori THRESHOLD.
Priori
I Utterance
set No. ( I ) No. (6) No.(5) No. (3)
Fixed
I Customers IO Male 10 Female 21 Male IO Male

1 Impostors 40 Male 40 Female 55 Male 40 Male

Transmission rclcphone Microphone Telephone


Telephone
A Posteriori
24 kb/s ADPCM

TABLE V
(False Alarm) 0.29% 0.29% 0.68% 0.29%
AVERAGE
ERROR SET:No. (5). REFERENCE
RATES.UTTERANCE UPDATING:

1 1 La;1
METHOD2. FR: FALSE
REJECTION
(FALSE A L A R M ) . FA: FALSE
False Acceptance
(MISS RATE).
ACCEPTANCE
(Miss Rate) 0.08% 0.43% 0.86% 0.83%

-.--I Utterances
I Average 0.19% 0.36% 0.77% 0.56%

First Middle
I Number of Trials 5,500 1 5,500 I 12,726 1 5,500
I
I I 068%
0.68%

1 Estimated
FR
0.86%
0.60% 0.60%
seventh access. Although the errorrates are slightlylarger
than thoseobtained for clean speechpresentedin Table 11,
both false rejection and false acceptance are still less than one
percent even whenthe decision threshold is set a priori by

ir
(12). Speaker verification results showing the effectofex-
A ! I tending the interval between training and test utterances up to
prior^ FR 1.36% 1 0.68% I 0.68% I six weeks indicated thatthe errorrate was slightly greater
Fixed FA 0.56% 1
0.74% 0.74% I 0 57% 1 than that obtainedfor clean speech.
Table VI1 shows the summary of the results of speaker veri-
0.96% 1 0.71% 1 0.63% 1 fication experiments for utterance sets (l), ( 3 ) , ( 9 , and (6),
0.94% 1 0.17% 1 0.348 i using the a priori threshold specified by (12). For utterance
set ( 9 , reference templates were updated following each access
for the first seven test utterances using the latest five utter-
ances. After the seventh utterance, updating was carried out
D. Results for Utterance Set (3) every seventh access. For all other utterance sets updating was
In the fourth experiment, utterance set (3) which comprises carried out only after each seventh access. For all utterance
50 utterances by ten male customers each and a single utter- sets except (5) there were 35 customer test utterances and 515
ance by 40 impostors, recorded over a conventional telephone impostor test utterances per customer for a total of 350 cus-
connection and transformed by a 24 kbit/s ADPCM system, tomerand 5150 impostortrials, respectively. Forutterance
was used to evaluate the speaker verification system. In this set (5) there were 21 customer test utterances and 585 im-
case, the 18 selected parameters include all ten cepstrum co- postor test utterancesper customer for a total of 441 customer
efficients and all the first-orderpolynomial coefficients ex- and 12 285imposter trials, respectively.
cept coefficients withindex numbers 1 and 2. Although this table indicates a higher error rate for micro-
Table VIshows the results of speaker verification experi- phone speech than for telephone speech, the difference is sta-
mentswhen an initial reference template for each customer tistically insignificant since the number of utterances which
was constructed usingfive utterances andupdated every caused the verification error is very small.
FURUI: AUTOMATIC SPEAKER VERIFICATION 263

V. EXPERIMENTS USING MIXED TRANSMISSION SYSTEMS TABLE VIII


AVERAGE ERROR RATES. TRAINING UTTERANCES: ADPCM. TEST
A . Experimental Design UTTERANCES:LPC. F R FALSE REJECTION (FALSE ALARM).FA: FALSE
ACCEPTANCE (MISS RATE).
In order to investigate the effects of several transmission sys-
tems on the speaker verification system more thoroughly, the Distance
reference and test utterances were subjected to different trans-
Measure
mission systems and evaluated using the same techniques used
in previous experiments with homogeneous transmission con-
ditions. One difference betweenthe
techniques
used in
previousexperiments and this experiment is that the trans- DI

mission characteristics normalization method, subtracting the (square)


time averages from cepstrum coefficients, described in Section A Posteriori 0.16%
11-A, wasapplied to all utterances in this experiment.
Utterance sets (2), (3), and (4), each of which comprises 50 Estimated 1.14% 1.13% 1.14%

utterances by ten male customers each and a single utterance Dz A Priori


by 40 male impostors were used in the experiment. They were :absolute) Fixed 1.71% 2.08% 1.90%
recorded over aconventionaltelephoneconnection trans-
A Posteriori 0.12%
mitted over clear, ADPCM and LPC vocoder channel, respec-
tively. All of these utterances were sampled at 6 kHz.
I

B. Result of Preliminary Experiments TABLE IX


AVERAGE ERROR RATES. F R FALSE REJECTION (FALSE ALARM). FA
1) Distance ratio: Tenutterancesby five customerseach FALSE ACCEPTANCE (MISS RATE).
were used to calculate the ratio of averaged interspeaker dis- -
Test
tance to averaged intraspeakerdistanceforeachfeaturepa-
‘raining Threshold Error
rameter.Inthe tenutterances, five utterances were trans-
mitted over clear channel andtheremaining five utterances Clear ADPCM LPC

were transmitted over the ADPCM system for each speaker. -


__
FR 0.29% 1.43% 0.86%
The results which were similar to that obtained for the utter-
ance set which comprises only ADPCM speech indicated that Clear A Priori FA 0.62 0.64 0.74

thefeatureparameters, especially normalizedcepstrum co-


efficients and the first-order polynomial coefficients, have a (Estimated) FR+FA 0.46 1.04 0.80
2
great amount of individual information which is not affected
by the difference betweenclear and ADPCM channels. -
2) comparison betweentwo distance measures: Before
starting the speaker verification experiment which uses differ- rDPCM
ent combinations of the utterance sets, a preliminary experi-
ment was carried out to compare speaker verification perfor- (Estimated) I /I I
1.05
mance using the two distance measures D l and D2 described
in Section 11-D.
-
In this experiment the reference template for each customer
was constructed by the training utterances transmitted over
LPC
the ADPCM system, and test utterances were transmittedover
the LPC vocoder system. The results are given in Table VI11
showing error rates for the two distance measures. The error
rates are quite similar although the error rates obtained using
D2 are generally somewhat smaller than those obtained using __
D l . The correlation coefficient between the two sets of dis-
tances is 0.992. The calculation time for D2 is much smaller ances are subject t o the same transmission system. But even in
than D l . Based on these results, D2 isused hereafter in the the worst case, which is the combination of ADPCM and LPC
transmission systems experiments. vocodedspeech,the average error rate bytheestimated
a priori threshold is onlyonepercent. It means thatthe
C. Result of Speaker Verification Experiments speaker verification method investigated in this paper has little
Table IX shows the summary of the results of speaker veri- degradation evenwhen the reference and test utterances are
fication experimentsforninecombinationsoftransmission subjected to different transmission systems.
systems. When the reference and test utterances are subjected Fig. 12 shows plots of false rejection and flase acceptance
to different transmissionsystems,the error rate is slightly for each transmission system combination as a function of in-
larger than the error rate which is obtained when all the utter- dividual customers.Part (a) shows false rejection rates and
264 IEEE
TRANSACTIONS
ON
ACOUSTICS,
SPEECH, AND SIGNAL PROCESSING, VOL. ASSP-29, NO. 2 , APRIL 1981

TABLE X
ERRORRATES WITHNo CEPSTRUM
AVERAGE NORMALIZATION.
TRAINING ADPCM. TESTUTTERANCES:
UTTERANCES: LPC.
c-v
T 1'" Threshold Error Rate Error
1

False Rejection

c-A 1: A Priori
(False Alarm)

False Acceptance
1.14%

v-c 1': (Estimated) (Miss Rate) I .30%

1.22%
v-v Average
1lo
A Posteriori !?qual 0.52%

channel transmissions for test and training utterances, respec-


b
:"
A-c,!
tively. The distances between reference andtest utterances
are generally much greater than those obtained when cepstrum
normalization is applied. Accordingly, the parameter b in the
A-V
T 1'" threshold estimation equation is changed to a value which is
A-A appropriate to make the two kinds of error rates almost same.
L 1'O The larger error rates obtained when cepstrum normalizationis
RC FP LN PB DM AR JW LR JH JJ omitted confirms the effectiveness of the cepstrum normaliza-
CUSTWER
tion technique.
(4 The error rates for the homogeneous conditions in Table IX

c-c u:% are slightly differentfrom the previous results described in


Sections IV-A and IV-D, since the sampling frequency of
"clear" speech is different between the previous and present
c-v T. experiments, and the cepstrum normalization technique was
not applied in the previous experiments. From these com-
c-A A: parisons, it is apparent that when the difference of the trans-

v-c u: mission characteristics between reference and test utterances


is small, cepstrum normalization slightly increases the verifica-

v-v u: tionerrorrate
information.
In thenext
by removing thelong-term speaker-related

section,the effectiveness of the cepstrum


v-A T: normalization will be investigated using an utterance set which
has very large differences between the transmission character-
A-C u : istics of reference and test utterances.

A-v : D. Experiments with Artificial Transmission Variation


The utterance set by ten male customers and 40 male im-
A-A :-
RC FP LN F 9 DM PR JW LR JH JJ
postorsrecorded over a conventionaltelephoneconnection
was used in a speaker verification experiment. All utterances
CUSTOMER werepassed through a 3 kHz low-pass filter and sampled at
(b) 6.67 kHz. Training utterances were processed withpre-
Fig. 12. False rejection (a) and false acceptance (b) as a function of the emphasis, whereas preemphasis was omittedfortestutter-
training system testing system pairand customer. C-clear channel,
V-LPC vocoder system, A-ADPCM system. ances. This results in a simple but large difference in frequency
characteristics between training and test utterances. Two ex-
periments were performed to studytheeffect of cepstrum
part (b) shows false acceptance rates. The reader should note normalization; verification using normalized cepstrum and
that the scales of the two figures are different. A high degree verification using unnormalized cepstrum. The results are
of variability in scores exists among customers for each pair of shown in Table XI. There are very large differences between
transmission systems. The variability between scores for pairs the results for normalized cepstrumandunnormalizedcep-
of transmission system is almost negligible compared with the strum. It is evident that cepstrum normalization is very power-
variability of scores within a pair of transmission systems. ful, and small error rates can be obtained even when there are
Table X shows error rates when cepstrum normalization is large frequency characteristic differences between the training
omittedforthecombinationof LPC vocoder and ADPCM and testutterances.
FURUI:AUTOMATICSPEAKERVERIFICATION 265

TABLE XI
AVERAGE ERROR RATES. TRAINING UTTERANCES:
PROCESSED
BY
PREEMPHASIS.
TESTUTTERANCES:
UNPROCESSEDBY PREEMPHASIS.
F R FALSEREJECTION
(FALSE ALARM).
F A FALSEACCEPTANCE
(MIS RATE).
Cepstrum

Normalization Threshold FR FA
2

Estimated 0.86% 0.64% 0.75%

YES A Priori

Fixed 0.86% 0.78% 0.82%

A Posierion 0.14%

Estimated 10.57% 13.15% 11.86%

A Priori

I Fixed

A Posterior
17.14% 16.82%
1644%

9.66%

I 2 3 4 5 6 7
Loo AREA RATIO INDEX
8 9 1 0

VI. DISCUSSION Fig. 13. Interspeaker to intraspeaker distance ratio for each time func-
tion of log wea ratios and polynomialcoefficients derived from
A. Comparison Between Cepstrum and Log AreaRatio them. Ten utterances by five male speakers each were used for the
analysis.
In order to check the advantage of the cepstrumCoefficients
derived through LPC analysis &PC-cepstrum),which have
been used in this paper, log area ratio parameters, which are TABLE XI1
arctanhtransformatioriof PARCOR coefficients, were ex- AVERAGEERROR RATES. UTTERANCE SET: SUBSETOF THE UTTERANCE
SET(1). THRESHOLD:A posteriori EQUALERRORTHRESHOLD.
tracted from the utterance set (1) and studied. Log area ratios TIMEREGISTRATION: CONSTRAINED ENDPOINT DYNAMIC TIME
were found to be very good parameters for speaker verification WARPING. NUMBER OF TRAINING UTTERANCES: 3.
in previous experiments by the author [ 191 . Fig. 13 shows
distance ratios for each time function of log area ratios and
polynomial coefficients derived fromthem. The results for Cepstrum 0.80%

cepstrum coefficients which were extractedfromthe same


utterance set was shown in Fig. 10. Comparing Figs. 10 and
13, it can be seen that cepstrum coefficients are more efficient
than log area ratios for speaker verification. spectral envelope sequence by cepstrum Coefficientsis more
Table XI1 shows the results of a speaker verification experi- stable than that obtained using log area ratios.
ment using log area ratios and polynomial coefficients derived Fig. 15 shows comparisons of four kinds of spectra; short
fromthemcompared tothe results using cepstrum coeffi- time speech spectrum, spectral envelope derived from log area
cients. In this experiment, 10 utterancesby 10 customers ratio, spectral envelope derived from LPC-cepstrum, and spec-
each and a single utterance by 40 impostors were used. The tral envelope derived from conventional cepstrum coefficients
first three utterances were used to construct a reference tem- which are extracted through Fourier transformation of thelog
plate for eachcustomer,and the remaining seven customer powerspectrum(FFT-cepstrum).Results for twoframes in
utterances and impostor utterances were used as test utter- the sentence are presented in the figure. It canbeseen that
ances. Aconstrainedendpointdynamictime warping tech- spectral envelopes derived fromLPC-cepstrumand FFT-
nique was used in this experiment. The error rate results show cepstrum are quite similar and are much smoother than those
thatcepstrum coefficients have anadvantage overlogarea derived directly from LPC parameters, which is very sensitive
ratios. to spectral peaks,
Fig. 14 shows examples of spectral envelopes derived from
10 cepstrum coefficients or 10 logarea ratios for a spoken B. Comparison Between LPC-Cepstrum and FFT-Cepsmm
sentence “We were away a year ago.” Log area ratios are trans- As indicated in Fig. 15, a spectral envelope derived from the
formed into linear predictor coefficients and the spectral en- LPC-cepstrum is slightly different from a spectral envelope de-
velope is computed using the correlation function of the co- rived from the FET-cepstrum. In order t o study the effect of
efficients. Time sequencesoftheenvelopeforthe first 100 this differenceonspeaker verification, several experiments
frames are shown in these figures. The frame interval is 10 ms. were performed using utterance set (6) which consists of fe-
Spectralenvelopes derived fromcepstrum coefficients are male utterances. The size of the time window was set to 256
much smoother than those derived from log area ratios along samples (38.4 ms) to extract the FFT-cepstrum,while the win-
both the frequency axis andtime axis. In other words, the dow size used to extract LPC-cepstrum was 30 ms. The FFT-
266 IEEE TRANSACTIONS ON ACOUSTICS, SPEECH, AND SIGNAL PROCESSING, VOL. ASSP-29, NO. 2 , APRIL 1981

IO
,ENVELOPE BY LOG
RATIO
AREA 1
ENVELOPE BY LPC-CEPSTRUM

2-
L SHORT
TIME SPECTRUM
ENVELOPE BY FFT-CEPSTRUM

- FREPUENCY

(a)
0
0
I I I
I
I I

2
FREOUENCY ( k H z )
I 1 I I
3

(4
10
ENVELOPE BY LOG AREARATIO
I
ENVELOPE BY LPC -CEPSTRUM

ENVELOPE dY FFT-CEPSTRUM I

0
0 1 2 3
p FREOUENCY FREOUENCY (kHZ)

(b) (b)
Fig. 14. Spectralenvelopesderived fromten cepstrumcoefficients Fig. 15. Comparison of four kinds of spectra; short time speech spec-
(a) or ten log area ratios (b) for a spoken sentence, “We were away a trum, spectral envelope derived from log area ratio, spectral envelope
year ago.” Time sequences of the envelope for the first 100 frames derived from LPC-cepstmm, and spectral envelope derived from FFT-
(1 s long) are shown. .
cepstrum. (a) For the sound /i/ in “We . . .” (b) For the sound lo/
in “. . . ago.”

cepstrum computation time which includes two Fourier trans- TABLE XI11
formations is almost twice that of the LPC-cepstrum. AVERAGEERRORRATES.UTTERANCESET:No. (6) (FFT-CEPSTRUM).
The distance ratio for each parameter derived from the FFT- FR. FALSE (FALSE
REJECTION ALARM).FA FALSE ACCEPTANCE
(MISS RATE).
cepstrum was calculated. The result was similar to that for the
LPC-cepstrum except that the speaker dependent information
Threshold FR FA
in the FFT-cepstrumtends to concentrate in the first-order 2

cepstrum. Overall average distance ratios for cepstrum coeffi-


cients, the first-order polynomial coefficients and the second- Estimated 0.29% 0.33% 0.31%

order polynomial coefficients are 2.15, 1.89, and 1.45, re- A

spectively, for LPC-cepstrum, and 2.07, 1.83, and 1.37, Priori Fixed 0.86% 0.74% 0.80%
respectively, for FFT-cepstrum. LPC-cepstrum has slightly
A Posteriori 0.02%
largervalues than FFT-cepstrum, but the difference is very L
small.
Table XI11 shows the results of a speaker verification experi- same results in speaker verification as the conventional FFT-
ment using FFT-cepstrumunder the same condition as that cepstrum, while it takes only half the time to calculate the
using LPG-cepstrum whose results were presented in Table 111. LPG-cepstrum compared with the FFT-cepstrum.
The difference in error rates between these two experiments is
very small. Speaker verification results using FFT-cepstrum C. Effectiveness of Polynomial Coefficients
when the interval between training andtest utterances was In order to study the effectiveness of the use of polynomial
long was similar to the results obtained using LPC-cepstrum. coefficients on speaker verification, an additional experiment
It is apparent that the LPC-cepstrum produces almost the was performed in which the polynomial coefficients were
FURUI: AUTOMATICSPEAKERVERIFICATION 267

TABLE XIV
ERROR
AVERAGE RATES.TRAININGUTTERANCES:PROCESSED BY
PREEMPHASIS. UNPROCESSED
TESTUTTERANCES: BY PREEMPHASIS.
FR FALSE (FALSEALARM).
REJECTION F A FALSEACCEPTANCE A PRIORI THRESHOLD
(MISSRATE). 1.5 (FR + FA112

Feature
\
Parameters Threshold

Estimated
FR

2.29%
I 1
FA

1.69%
FR + FA

1.99%
!t
W

0
I \

\
Cepstrum A

Coefficients Priori Fixed 3.14% 2.31% 1.73%

A Posteriori 0.64%

Estimated 0
0 50 100 (50 200
Cepstrum A SEGMENT
LENGTH (mS)

and Priori Fixed Fig. 16. Error rate versus the length of the speech segment for orthogo-
'
nal polynomial expansion. Training utterances were processed with
Polynomial preemphasis, whereas test utterances were not.
Coefficients

A Posteriori 0.14% TABLE XV


ERRORRATES.UTTERANCE
AVERAGE SET:No. (1). TIMEREGISTRATION:
CONSTRAINED DYNAMIC
END-POINT TIMEWARPING.FR FALSE
(FALSE
REJECTION ALARM).FA FALSEACCEPTANCE (MISSRATE).
omitted using only the timefunctionsofthe first to tenth
FR + FA
-
cepstrum coefficients. The training and test utterance record- Threshold FR FA
2
ing conditions are the same as theexperimentdescribed in
Section V-D, where preemphasis is applied to thetraining utter- Estimated 0.40% 0.43% 0.42%
ances but omitted for the test utterances. Cepstrum normal- A
ization was applied to all utterances. TableXIVshows the
Priori Fixed 0.57% 0.58% 0.58%
verification error rates including previous results whichwere
obtained when polynomial coefficients are included. It can be
seen that error rates are increased by a factorof three or more
when polynomial coefficients are omitted.
it can be concluded that 90 ms is a reasonable value for poly-
D. Optimum Length of Speech Segment for nomial expansionin this speaker verification system.
Polynomial Expansion
In the experiments described so far, the length of the speech E. Comparison Between Unconstrained Endpoint and
segment for which time functions of cepstrum coefficients are Constrained Endpoint Dynamic Time Warping Methods
expandedby an orthogonalpolynomialrepresentation has Table XV shows speaker verification results when con-
been set to 90 ms. This value was determined to be adequate strained endpoint dynamic time warping is used. Other condi-
for preserving transitional information between phonemes. In tions are the same as the experiment whose results were shown
order to check the appropriateness of this value oflength, in Table 11. The error rate usingtheconstrainedendpoint
additionalspeaker verification experiments were performed method is almost twice as that using the unconstrained end-
varying the length between 50 ms and 210 ms. Training utter- point method. This result shows the advantage of the uncon-
ances were processedwithpreemphasisbut test utterances strained endpoint dynamic time warping method, which can
were not. Cepstrumnormalization was applied to all utter- cope with the uncertainty in the location of both the initial
ances. The condition for the experiment in the previous sec- and final frames due to breath noise, etc., over the constrained
,tion corresponds to the condition of zero length segment.The endpoint method.
speaker verification error rates with a priori or a posteriori
threshold are plotted in Fig. 16 as afunction of segment F. Effectivenessof Dynamic Time Warping Guided by
length, including the results of the previous section, For the Shorter One of Either Input or Reference
a priori threshold condition, the averaged values of false ac- In Section 11-C it was stated that optimum matches are ob-
ceptance and false rejection rates are plotted. The verification tainedby usingasguide theshorterof either theinputor
error rate is a minimum for 170 ms and the error rate increases referencecontours. Thiswarping procedure was used inall
both for shorterand longerlengths. the speaker verification experiments described in this paper.
Although 90 ms is not the optimumvalue, the difference be- To show the effect of not observing this procedure an addi-
tweentheerror rates for 90 ms and the optimum value of tional experiment was performed. Utterance set (5) was used
170 ms is small. The number of computation increasesin pro- in a speaker verification experiment in which the input utter-
portion to the segment length. Based on these considerations, ance exclusively was used as the guide. This is referred to as
268 IEEE TRANSACTIONS ON
ACOUSTICS, SPEECH, AND SIGNAL PROCESSING, VOL. ASSP-29, NO. 2 , APRIL 1981

the UEGI (unconstrained endpoint guided by input) method,


incontrast tothe UEGS method[unconstrainedendpoint
guided by (the) shorter (of reference and input)] adopted in
all other experiments. \ - UEGS
‘, 0 FALSE REJECTION
A reference templateofeachcustomer was constructed ‘ FALSE ACCEPTANCE
using five utterances and updated at every access by the cus-
tomer for the first seven test utterances (method 2) and up-

7
dated every seventh access by the customer for the remaining \
\
testutterances(method 1).Decision thresholds were set \

a priori based on (1 2). ; 1.5


\
\

Mean error rates for this utterance set using the UEGI pro- \

cedure are plotted in Fig. 17 along with the results obtained Es \


\
\
earlier using the UEGS procedure. It canbe seen that, al-
though false acceptance error rates are comparable for the two
techniques, false rejection rates for the first and middle seven
input utterances are much greater forthe UEGI procedure.
This outcome may be attributed to the fact that until stable
references are established by updating, the lengths of reference
and input utterances are quite variable. Therefore, largedis-
crepancies can be expected between the UEGI and UEGS pro-
FIRST MlwLE LAST
cedures. However, with stable references associated with the INPUT UTTERANCES
last seven input utterances the lengths of input and reference
Fig. 17. Comparisonoferror ratesfortwo dynamic time warping
utterances are more consistent and little or no discrepancy is techniques; UEGI and UEGS. Results for utterance set (5).
expected between the two techniques.
When warping isguided bytheinpututterance,the first
frame of theinput may bewarped to thefirstthrough
(6 t 1)th frame of the reference where 6 specifies the width of -7 30- 142) 9.5% (25) 5.7%

the allowable range. Similarly, when warping is guided by the


reference, the first frame of the reference may be warped to
the first through 6 t l t h frame of the input.
An experiment was carried out using 26 utterancesfrom
each of the 21 male customers in utterance set (5). Each cus-
(8) 2.6%
tomer’s utterances were paired with the customer’s reference
and matched two ways, using the input utterance as the guide
and the reference utterance as the guide. For each such pair,
the slave frame matched to the first guide frame for the better
of the two matches (the match resulting in the lower overall
distance) was tabulated. This tabulation is presentedin the
histograms of Fig. 18.
Along the abscissais plottedthe slave frame minusone REFERENCE
(matched to guide frame numberINPUT one) with the input as slave REFERENCE
plottedalongthe positive axis andthe reference asslave Fig. 18. Histograms forthe startingframeof theoptimum warping
plotted along the negative axis. Equivalently, the positive axis function, where a positive number means that warping function starts
represents matches in which the reference is guide while the from the input axis and a negative number means that it starts from
the reference axis. N : number of frames of reference function, M:
negative axis represents matches in which the input is guide. number of frames of input function.
Each histogram point represents thenumber of optimum
matchescorresponding to the indicated slave frame.The re-
gion enclosed by the shaded vertical bars represents optimum into two categories. In Fig. 18(b) all theoptimummatches
matches which can be obtained by using either the reference are shown for which the input length i s less than or equal to
asguideor input as guide. Forexample,optimummatches the reference length, while in Fig. 18(c) are shown all the
within the shaded region to the left of the y-axis, whch are optimum matches for which the reference length is less than
actually obtained using the input as guide, are substantially the the input length. It can be seen immediately that the greatest
same when guided by the reference, matching the first refer- number of optimum matches is associated with using as guide
ence frame to the first input frame. Thus, from Fig. 18(a) all the shorter of the input and reference. That is, in Fig. 18(b)
but 9.5 percent of the optimum matches are obtained by using all but 2.6 percent of the optimum matches are obtained by
the reference as guide, while all but 5.7 percent are obtained using the input as guide, while in Fig. 18(c) all but 3.7 percent
by using the inputas guide. of the optimum matches are obtained by using the reference
Fig. 18(b) and (c)decompose thematchesof Fig. 18(a) as guide.
FURUI:AUTOMATICSPEAKERVERIFICATION 269

TABLE XVI
AVERAGE
ERRORRATES.UTTERANCE SET:No. (1). TIMEREGISTRATION
CONSTRAINED
END-POINT DYNAMIC TIMEWARPING. NUMBER OF
TRAINING 3 . F R FALSE
UTTERANCES: REJECTION (FALSE ALARM).
F A FALSEACCEPTANCE (MISSRATE).

-
.
40

.
0

&
,
,,
30-

,
,, e
.
A Posterior 0.14%

G. Effect of the Number of Training Utterances


Table XVI shows the results of a speakerverification experi-
Fig. 19. Relation between j&(k) - ?DB ( k ) and equal error threshold
ment in which three utterances were used to construct a refer- Oeq. Results of the speaker verification experiment using utterance
ence template. Other conditions are the sameas the experi- set (3).
ment whose result was shown in Table XV (Section XI-E),
which means that constrained endpoint dynamic time warping TRAINING TEST
method was used. Comparing Table XV and XVI, it canbe 0 CLEAR
seen that using three utterances to make a reference template LPC x LPC
ADFCM ADPCM
is not adequate. The error rate becomes almost twice that ob-
tained when five training utterances are used. 2.5
Next,thenumberof training utterances was increased to
ten,andaspeaker verification experiment was performed.
However, it produced no improvement compared with the re-
sults using five training samples to construct a reference tern- 2 .o
plate. It can be concluded that five utterances are necessary
and sufficient to make a reference template.

H. Threshold Estimation 8 1.5


W
In this paper, (12) is used to set an a priori decision thresh- I-
Q
old for each customer. This equation and two parametersin it K

were determinedexperimentally. Fig. 19 showsthe relation


1.0
between $ D B ( ~ )- G D (~k ) and equal error threshold e,,@) W
which produces the equal error of false acceptance and false
rejection. This is the result of the speaker verification experi-
ment using utterance set (3) which produced the error rates 0.5
shown in Table VI. The correlation coefficient between
gDB(k)- GnB(k) and e&) calculated from these valuesis
0.753. This result indicates the appropriateness of using (12)
I I I I I
to estimate the optimum decision threshold. 0
5 6 7 8 9
To determine the effect of varying the parameter b in (12)
on the error rate, all thecustomerutterancesandimpostor PARAMETER b

utterances which were tested in the speaker verification ex- Fig. 20. Error rate versus b parameter in the threshold estimation equa-
tion. The effects of parameter variation on the average of false rejec-
periment using themixedtransmissionsystem(Section V) tion andfalse acceptance errorratesare shown using the estimated
were scannedby varying theparameter b. Theparameter a decision threshold. Results for nine training systemtestingsystem
was set to 0.6, which was determinedexperimentally. The pairs are plotted.
number of errors was tabulated at each step by comparing the
actual overall distances with the estimatedthreshold using of b. The optimum value of the parameter bywhich produces
the varied parameter value b. False acceptance rate and false the minimum average error rate, is seven almost irrespective of
rejection rate wereaveraged and plotted in Fig. 20. Results the experimental condition. This is the value consistently used
for nineconditions,which are ninecombinations of three in this paper to estimate an optimum threshold for each cus-
training systems and three testing systems, are shown. As the tomer. As there is sometradeoffbetweenthe two kinds of
result of increase in false acceptance rate for large threshold error rates, if it is desirable to keep the false acceptance rate at
values and increase in false rejection rate for small threshold a much lower value, the parameter b should be set at a value
values, the averaged error rate has a concave slopeas a function smaller than seven, even though it increases the false rejection
270 IEEE TRANSACTIONS ON ACOUSTICS, SPEECH, AND SIGNAL PROCESSING,VOL. ASSP-29, NO. 2, APRIL 1981

7r TEST TRAINING
-FALSE
REJECTION --- FALSE
ACCEPTANCE
UTTERANCE SET (1) x UTTERANCESET (61
CLEAR o CLEAR
0 UTTERANCE SET(31 A UTTERANCE
SET (51
~

5-
-3 1.0-

-E W

w 4- 5
a
a
a
B 3-
a
P
05-
L.-.
6

2-

0-
I-
WITHHOLDING
THRESMLD

(a)
---
24 25 26 27 28
-FALSE ACCEPTANCE

''jl
FALSE REJECTION
THRESHOLD
UTTERANCESET (11
Fig. 21. Error rate versus decision threshold. The effects of threshold 0 UTTERANCESET (31
variation on the average of false rejection and false acceptance error X UTTERANCE SET(6)
rates are shown using the threshold. Results for nine txaining system-
testing system pairs are plotted.
0

rate. Conversely,values of b larger than seven producea


smaller false rejection rate and a larger false acceptance rate.
The dashed line in Fig. 19 indicates the relation '\
eeq = 0.6 (pDB- SDB) + 7. 3) (1
Fig. 21 shows verification error rate, which is the mean value
of false acceptance rate and false rejection rate, as a function
of decisionthreshold for the same experimentalconditions. A t t l ' l
10 15
Results for the nine conditions are plotted. The reader should
WITHHOLDINGTHRESHOLD
note that the scale of this figure is different from the previous
(b)
one. This result shows that the optimum value of the thresh-
Fig. 22. Errorrate versus withholdingthreshold.(a) False rejection
old varies considerably depending on the utterance set. Thus, and false acceptance rates on the condition that a reference template
it is very difficult to set the threshold in advance independently is updated every seventh access by each customer. Mean time in-
of the utterance set. terval between training and test utterances is six days. (b) False re-
jectionand false acceptancerates averaged over several conditions
These results indicate the effectiveness of the threshold when the timeintervalbetweenreference and test utterances is
estimating method using (12). varied from six days to six weeks.

I. Withholding Decision
Another tabulation was carried out to assess the effect on -SHORTTIMEINTERVALBETWEENTRAINING
error rate ofwithholdingdecision(sequentialdecision)on ---LONG ) AND TEST
UTTERANCESET ( 1 I
UTTERANCES

trials for which I D T - 0 I < A, where DT and 0 are the overall 0 UTTERANCESET (31
a UTTERANCE
SET (61
distanceandthreshold, respectively. When the decisionis UTTERANCESET (51

withheldona given utterance,a new distance is calculated


which is the mean of the distances of the withheld utterances
and the succeeding utterance. Utterance sets (l), (3), ( 9 , and
(6) were used in this experiment. As these utterance sets in-
clude only one utterance for each impostor, impostor utter-
ances were not used in this experiment.
Fig. 22 shows error rates as a function of the withholding
threshold A. Part (a) shows the results when reference tem-
plates were updated every seventh trial for each customer, and
part (b) showsaveraged error rates overseveral conditions
when the interval between training and test utterances were
varied from six days to six weeks. Fig. 23 shows the per- WITHHOLDING
THRESHOLD

centageofwithheld trials, which is thepercentage of addi- Fig. 23. Percentage of withheld trials versus withholding threshold.
R FURUI: AUTOMATIC 271

-- -FALSEACCEPTANCE In order t o test the independence of these two kinds of in-


-FALSEREJECTION
UTTERANCE SET ( 1 ) formation, the distribution of speaker verification error rates
0 UTTERANCE S E T ( 3 )
x UTTERANCE SET(6) by cepstrum and that by pitch and intensity contours were
A UTTERANCE SET(5) comparedwitheachotherand a correlation coefficient be-
tween them was calculated. Whenall the error rates plotted
in Fig. 12 and the error rates obtained with the same condi-
tions using pitch and intensity contours are used, the correla-
tion coefficient is 0.22 and -0.36 for false rejection and false
acceptance, respectively. It canbe concluded thatthetwo
kinds of information are fairly independent.' Based on these
results, an improvement in performance canbe expected by
combining thesetwo kinds of information.

VII. SUMMARY
WITHHOLDING
RATE I%)
A new system for automatic speaker verification has been
(a) implemented on a 16-bit .laboratory computer and evaluated.
--- FALSE ACCEPTANCE A fixed, sentence-long utterance is analyzed by cepstrum co-
-FALSE REJECTION efficients by means of LPC analysis. Frequency-response dis-
UTTERANCE SET ( 1 tortionsintroducedbytransmissionsystemsare removed
o UTTERANCESET (3)
1.0r R automatically. Time functionsofcepstrum coefficients are
expanded by orthogonal polynomial representations and com-
- 0.8 - paredwithstoredreferencefunctions.Afterdynamictime
W
% warping, a decision is made to acceptor reject anidentity
01
01
claim. Referencefunctionsanddecisionthresholds are up-
0.6-
L
dated for each customer. The total processing time is approxi-
0
W
mately 40 times real time in this computer simulation.
i 0.4 - In the first part of the experiment, three sets of utterances
s were used fortheevaluationof the system.The first and
0
z second sets eachcomprises 50 utterancesbytencustomers
o.2 t I 0
each and a single utterance by 40 impostors recorded over a
conventionaltelephoneconnection. The third set comprises
O LL- O 20
WITHHOLDINGRATE
30
I

(%)
40
I KJ50
26 utterances by 21 customers each and a single utterance by
55 impostors recorded over ahigh-qualitymicrophone.The
(b) first and third sets were uttered by male speakers, whereas the
Fig. 24. Normalized error rates versus percentage of withheld trials. second set was uttered by female speakers. The evaluation in-
(a) and (b) correspond to (a) and (b) in Fig. 22, respectively. dicated mean error rates of 0.19 percent, 0.36 percent, and
0.77 percent for each utterance set,respectively.
Second, the first utterance set was processed by an ADPCM
tional trials, as afunctionofthewithholdingthreshold A. codingsystem and an LPC codingsystem.Theseutterance
Fig. 24 shows the relation between the percentage of withheld sets were used for a speaker verification experiment together
trials and the error rate normalized by the error rate obtained with an unprocessed utterance set. Experimental results indi-
without withholding. Part (a) and part (b) correspond to part cate that the transmission system affects the verification ac-
(a) and part (b) in Fig. 22, respectively. Fig. 24 indicates that curacy only slightly even if the reference and test utterances
at least 30 percent improvement in error rate can be obtained are subjected to different transmission conditions.
with decisions withheld on five percent of the trials and an Third,thetime interval betweenreference andtestutter-
average 73 percentimprovement can beobtainedwith de- ances was changed from six days to six weeks. Results of the
cisions withheld on ten percent of the trials. experiment indicate no significant increase of verification error
with the increase of time interval.
J. Combination with Pitch and Intensity Contours These results verify the robustness of the new speaker verifi-
Speaker verification systemsbasedonpitchandintensity cationsystempresented inthis paper.Some discussions on
contours have beenevaluated in Bell Laboratories using the new techniques used in this system are also included in this
same utterance sets used in this paper [l] -[5]. paper.
As theinformationconveyedbypitchandintensitycon- Further investigations, current or projected, include a large-
tours is considered to be almost independent of that conveyed scale and long-term evaluation over telephone lines permitting
by the cepstrum, the combination of these two kinds of in- direct customer access and on-line response,and specialized
formation will improve the performance. hardware processing to improve response time.
212 IEEE TRANSACTIONS ON ACOUSTICS, SPEECH, AND SIGNAL PROCESSING, VOL. ASSP-29, NO. 2 , APRIL 1981

ACKNOWLEDGMENT [13] P. Cummiskey, N. S. Jayant,and J. L. Flanagan,“Adaptive


quantization in differential PCM coding of speech,” Bell Syst.
The author wishes to thank J. L. Flanagan and A. E. Rosen- Tech. J.,vol. 52, pp. 1105-1118, Sept. 1973.
berg for their guidance and stimulating discussions, and to ac- [ 141 R. E. Crochiere and L. R. Rabiner, “Optimum FIR digital filter
knowledge the assistance of C. A. McGonegal in providing the implementations for decimation, interpolation, and narrow-band
filtering,” IEEE Trans. Acousr.,Speech, Signal Processing, vol.
data base. ASSP-23, pp. 444-456, Oct. 1975.
[ 151 F.Itakuraand S. Saito, “Analysis synthesis telephony based
REFERENCES upon the maximum likelihood method,” Reports 6th Znr. Cong.
Acoust.,Y. Kohashi,Ed., vol. C-5-5, C17-20, Tokyo, Japan,
[ 11 G. R. Doddington, “A computer method of speaker verification,” 1968.
Ph.D. dissertation, Dep. Elec. Eng., University of Wisconsin, [ 161 J. D. Markel and A. H. Gray, Jr., Linear Prediction of Speech.
1970. New York: Springer, 1976.
[2] R. C. Lummis,“Speakerverification by computer using speech [ 171 J. J. Dubnowski, R. W. Schafer, and L. R. Rabiner,“Real-time
intensity for temporal registration,” IEEE Trans. Audio Electro- digital hardware pitch detector,” IEEE Trans. Acousr., Speech,
acousr., vol. AU-21, pp. 80-89, Apr. 1973. Signal Processing, vol. ASSP-24, pp. 2-8, Feb. 1976.
[3] A. E. Rosenberg and M.R. Sambur, “New techniques for auto- [ 181 L. R. Rabiner, “On the use of autocorrelation analysis for pitch
matic speaker verification,” IEEE Trans. Acoust., Speech, Signal detection,” IEEE Trans. Acousr., Speech, Signal Processing, vol.
Processing, vol. ASSP-23, pp. 169-176, Apr. 1975. ASSP-25, pp. 24-33, Feb. 1977.
[4] A. E. Rosenberg, “Evaluation of an automatic speaker-verification [ 191 S . Furui and F. Itakura, “Talkerrecognition by statistical fea-
system over telephone lines,” Bell Syst. Tech. J., vol. 55, pp. tures of speech sounds,” Electron. Commun., vol. 56-A, pp. 62-
723-744, July-Aug. 1976. 71, 1973.
[5] C. A. McGonegal, L. R. Rabiner, and A. E. Rosenberg, “The ef-
fectsof ADPCM coding and LPC vocoding on an automatic
speakerverification system,” J. Acoust.SOC.Amer., vol. 65,
supplement 1, YY7, Spring 1979.
[6] F. Itakura, “Minimum predictionresidualprincipleapplied to
speech recognition,” IEEE Trans. Acoust., Speech, Signal Process-
ing, vol. ASSP-23, pp. 67-72, Feb. 1975.
[7] S. Furui, “An analysis of long-term variation of feature parame- Sadaoki Furui (”79) was born in Tokyo, Ja-
ters of speech and its application to talker recognition,” Electron. pan, onSeptember 9, 1945. He received the
Commun., vol. 57-A, pp. 34-42,1974. B.S., MS., and Ph.D. degrees in mathematical
[8] -, “Effects of long-term spectral variability on speaker recog- engineering andinstrumentation physics from
nition,” J. Acousr. SOC. Amer., vol. 64, supplement 1, NNN 28, Tokyo University, Tokyo, Japan, in 1968,
Fall 1978. 1970, and 1978, respectively.
[9] B. S. Atal, “Effectiveness of linear prediction characteristics of the Afterjoining the Electrical Communication
speech wave forautomatic speakeridentification and verifica- Laboratories, Nippon Telegraph and Telephone
tion,”J. Acousr. SOC.Amer., vol. 55, pp. 1304-1312, June, 1974. Public Corporation, Tokyo, in 1970, he studied
[ 101 N. L. Johnson and F. C. Leone, Statistics and Experimental De- the analysis of speaker characterizing informa-
sign in Engineering and theAppliedSciences, 2nded., vol. l. tion in the speech wave, and its application to
New York: Wiley, 1977, pp. 471-481. speakerrecognition and interspeaker normalizationin speech recog-
[ l l ] L. R. Rabiner, A.E. Rosenberg, and S. E. Levinson, “Con- nition. He is currently a Staff Engineer at Musashino Electrical Com-
siderations in dynamic time warping algorithms for discrete word munication Laboratory. From December 1978 to December 1979 he
recognition,” IEEE Trans. Acousr.,Speech, Signal Processing, joined the staff of the Acoustics Research Department at Bell Labo-
vol. ASSP-26, pp. 575-582, Dec. 1978. ratories, Murray Hill, NJ, as an exchange visitor working on speaker
[ 121 S. L. Bates, “A hardware realization of a PCM-ADPCM code con- verification.
verter,” MS. thesis, Dep. Elec. Eng. and Comp. Sci., Massachu- Dr. Furui is a member of the Acoustical Society of Japan and the In-
setts Institute of Technology, Cambridge, MA, Jan. 1976. stitute of Electronics and CommunicationEngineers of Japan.

View publication stats

You might also like