You are on page 1of 5

A Fast Shrinking Suspicious Criminals System From the Voice

Chun-Hsi Shih' Chi-Shi Liu' Chun-Hsiao Lee So-Lin Yen2


Jung-Hsiang Liao2 Bor-Shenn Jeng'

1. Telecommunication Labs, Chung-Hwa Telecom Co., Ltd., Taiwan, ROC.


2. Ministry of Justice Investigation Bureau, Taiwan, ROC.

ABSTRACT studying result of a fast shrinking suspicious


As science technique becomes newer, versatile criminals system. In this system, we provide an
committiiig crime methods are coming. In Taiwan, environment to let operators easily to create the
we usually find criminals use the telephone to criminal database. We use spectrum as system
blackmail people. In this situation, only voice is the feature set, which can be computed automatically
cue to know who are doing that. However, at this and left in the computer. To alleviate the channel and
time if we have a large of pre-recorded voice files, noise influence, these derived feature set are
we need to search these large voice files to find normalized and equalized. In the second step, a
them, or predict them by experience if we do not statistical model is built up for each speaker by the
have any voice database. It should spend a lot of above feature sets. For a new coming speaker, the
time to do that, even for an experienced worker. same procedure is done.
Computer can save that time if we can develop a As a criminal's voice is coming in face or from
good method to slirink them to a smaller range of remote transmission medium, our system can quickly
possible criminals or directly find those men's find some possible criminals to let experienced
identity, and then let experienced workers more workers do the final judgement. The number of
quickly do the final judgement. This paper will criminals to be finally judged can be defined by
discuss this shrinking system and give a preliminary operators. Operators can ask our system to pump out
result. a fixed number of criminals, whose voice
characteristics are closest to the true crime voice, or
set a confidence score to let our system
automatically pick up the possible criminals. It can
1. INTRODUCTION playback these selected speakers' voice and plot
As science technique and communication tools their voice characteristics to let operators do the final
become newer, versatile committing crime methods check.
are coming. To prevent more crime occurring and The other sections of this paper are organized as
quickly detect the place, the time, the man of follows. Section 2 introduces the system architecture
committing the crime, more efficient methods should for this system. Section 3 talks about how to use this
be developed. In Taiwan, voice is an auxiliary system and some demonstration is also given. In the
witness in some criminal cases and some crime cases final section, A summary is given.
are occurred by using the telephone to blackmail
people. In this situation, only voice is the cue to 2. SYSTEM ARCHITECTURE
know who are doing that. To attain the above
purpose, two organizations, Chung-Hwa To develop a shrinking suspicious criminals system,
Telecommunication Labs, and Ministry of Justice we plan it as the following architecture shown in
Investigation Bureau, are jointed together to Fig. 1. This architecture contains the following
develop the speaker authentication research work. In components: recording environment, high pass filter,
this paper, we will talk about the preliminary end point detection, bias removal, feature extraction,

0-7803-4535-5/98/$10.000 1998 IEEE 158

Authorized licensed use limited to: Somaiya University. Downloaded on September 29,2023 at 15:28:08 UTC from IEEE Xplore. Restrictions apply.
from tape
from t e l e p h c

Fig-1 system architecture

distance computation, candidates display and final computation loading and the side effect, which is
judgement. Below we will explain each step’s made due to non-speech part’s effect over the true
function and the method to implement it. speech part’s. To achieve this purpose, we use the
features, zero crossing rate and normalized energy,
to detect the true speech. To avoid discarding tlie
2.1 Recording environment
true speech too much, we did not give a rigid bound.
The speaker’s voic;e could be obtained from tlie
internet, the telephone, the tape, the microphone and
2.4 Bias removal
other storage deviccs. According to the voice type,
we can divide it into two types, analog and digital. The incoming speech may be from different
As analog type, it sliould digitize it before storing in recording environments. To alleviate the recording
the computer. If this analog voice comes from the environment’s effect and reduce its effect on the final
telephone, a telephone interface should be open to performance, a simple but efficient way is used. We
receive this kind of voice. In our system, we use a take the average of the whole incoming speech’s log-
dialogic company’s telephone interface, which has a spectrum, called cepstrum, and then subtract it from
telephone interface with an 8KHz sampling rate aiid the whole speech’s cepstrum. This simple metliod
8 bit PCM encoder, to receive the analog voice. has proved to be efficient [l].
Other analog voices, such as cassette tape, are also
recorded aiid digitized by a built-in sound blaster.
2.5 Feature extraction
After digitizing processing, all data should be
transferred into a uniform data type, a 16-bit linear The key point to develop an efficient speaker
PCM “.wav” format with sampling rate equal to authentication system is feature extraction. It is not
8KHz. easy to directly find an efficient feature from the
time-domain waveform. The super-segmental
features, such as pitch, energy, zero crossing rate
2.2 High pass filter
can be used to improve the performance, but it is
Some of the incoin ing digitized voice may contain limited and usually used an auxiliary feature for the
DC component. In order to remove this component, a most speaker authentication systems. For this reason,
60Hz high pass filter is used. the segmental information, that is, spectrum, is taken
as the features for this shrinking system. LPC(Linear
2.3 End point detection
Predictive Coding), LAR(Log Area Ratio), LSP(Line
Since the incoming voice has soine silence Spectrum Pair), Ceptrurn are all good spectrum
components before a true speech conies, it is features, In this paper, we choose the LPC-cepstrum
necessary to remove this part to reduce the feature as our system feature.

159

Authorized licensed use limited to: Somaiya University. Downloaded on September 29,2023 at 15:28:08 UTC from IEEE Xplore. Restrictions apply.
from tape
€1-0

F i g 2 training algorithm

LVQ training algorithm]


2.6 Distance computation
Vector Quantization is a way to partition a vector
Before discussing the function of distance space into some of lion-overlap cells based on
computation, it is necessary to introduce the method predefined distortion. This vector space is then
of creating speaker models. The steps of creating represented by centroids of these non-overlap cells.
speaker models are shown in Fig. 2. The steps, The centroid is called a codeword, and the set of
recording environments, High pass filtering, end centroids is called a codebook. Design of such
point detection, bias removal and feature extraction, codebook is by finding a set of codewords such that
are the same as the steps we just mentioned in the distortion DJQ is miniinurn. D-Q is defined
sections 2.1, 2.2, 2.3, 2.4, and 2.5. After these steps, as
we use the cepstrum feature and VQ training
algorithm to create speaker models. Although there D,0 = cd ,
l 7
I
21))
O(x(L)>

are several methods wliicli can be used to create where


- -
speaker models, such as GMM(Guassian Mixture x ( i ) = a r g m i n d , (x(z),c ( j ) ) and d v Q ( x ( i ) , ~ ( iis) )
Model), HMM(Hidden MarKov Model), and C(/)

template model. However, the main point of this a distortion measurement between x(i) and x(i).
system is to quickly shrink the suspicious criminals LBG algorithin[2] can be used for the derivation of
and let the experts do tlie final judgement. Thus an codebook C. Encoding a test vector y(i) is to find a
efficient method but less computation time is used
for this purpose. Besides, we should also consider codeword j ( i ) from codebook C, such that
tlie incoming speech materials would be any kinds of
text combination. It would not be text-dependent like
d(y(i),y(i)) is minimum. Based on the above
algorithm, each speaker will obtained his own
system or a limited domain material system, so that
codebook.
as we choose the method to create speaker models,
we should consider this effect. Based on the above
(VQ recognition algorithnij
consideration, we discard the probabilistic model and
choose VQ(Vector Quantization) model as our
system model. In the following, we will introduce Given a sequence of feature
how to train the speaker models based on VQ vectors,X = {x(l),x(2);.+,x(T)}. Each element of
training algorithin and what VQ training algorithm X i s a feature vector obtained by signal analysis. The
is. object of source coder is to find a reproduction

160

Authorized licensed use limited to: Somaiya University. Downloaded on September 29,2023 at 15:28:08 UTC from IEEE Xplore. Restrictions apply.
codebook such that the distortion between training then use these scores as a index to extract the related
vectors X and reproduction codebook is minimum. information for those speakers. A typical information
The total distortion for X is obtained by tlie is shown in Fig. 4. Here we only list some of
following equation. information for a speaker, such as his age, education,
address, self-picture, etc. Operators can define other
information that is good for them. They can also pop
out the speaker's voice characteristics, as shown in
Fig. 5. His waveform, sonogram or FFT-spectrum
[Beam search UlgorMzm] can be displayed on tlie screen to let operators
To fast search the right person, we set up a compare his voice and spectrum characteristics with
computation beam in which only the part of speakers the input speaker's voice. By tlie above procedure,
with smaller distanc;e are jointed. The input speech operators can reduced tlie human search time.
distances in the computation beam speakers are Especially when the suspicious criminals are large,
simultaneously computed frame by frame. After they do not need to do the full search speaker by
computing several frames, we only leave the M speaker. It is left to this system.
speakers in our computation beam. This step will
continuously proceed until tlie end of input speech.
At each step, some speakers with larger distance will 4. CONCLUSION
go out of this computation beam so that tlie A first step to build up a shrinking system is
computation beam width will be reduced with time completed. By this system, operators can easily and
going. This procedure will save the search time, but quickly find tlie set of suspicious criminals and
tlie width should not reduced too much. Otherwise, further analyze tlie voice characteristics of those
the target speaker will be lost in tlie search step. criminals. This will give them more evidence to do a
correct judgement.
final judgement In the next stage, we will collect and set up a
complete criminal database by tlie voice. This would
After tlie distance is computed, this system will list be a huge work, but it is important to find criminals.
the top N speakers whose distances are the smallest. Since the voice characteristic will change with time,
This value "N" is chosen by operators. Their distance tlie new system should have tlie ability of updating
will also be shown to operators to let them show and easily re-building up the existing database. That
their personal information further. Operators can is, an adapting system should be developed. Finally,
also use the distance score to decide which speaker We hope a user friendly and reliable system can be
will show their personal database from the speaker coming soon.
corpus pool.

3. OPERATION FLOW 5. REFERENCES


As operators get the signals from tlie telephone, the [ 11 S. Furui,"Cepstrum analysis technique for
tape or other storage devices, they can recall the automatic speaker verification", IEEE Tran.
recording program, like the menu shown in Fig. 3 , Acoust., Speech, Signal Processing, vol.
and select tlie record bar to record the signals. After ASSP-29, pp. 254-272, Apr., 1981.pp. 254-
recording, they will call the recognition program to 272, Apr., 198 1.
recognize who speaks this voice from a speaker pool, Y. Linde and A. BUZOand R. M. Gray,"An
in which speaker templates are trained and stored by algorithm for vector quantization design"
tlie training procedure done by section 2. After IEEE Trans. Commun. ,vol. COM-28,pp. 84-
recognition, the shrinking system will show a list of
95, Jan., 1980.
recognition scores i o tell operators which speakers
have the closest scores for the input voice. Operators

161

Authorized licensed use limited to: Somaiya University. Downloaded on September 29,2023 at 15:28:08 UTC from IEEE Xplore. Restrictions apply.
Fig. 3 Recording Menu Fig.5 analysis of voice characteristics: waveform and spectrum

., pi---
- K)

162

Authorized licensed use limited to: Somaiya University. Downloaded on September 29,2023 at 15:28:08 UTC from IEEE Xplore. Restrictions apply.

You might also like