Professional Documents
Culture Documents
Authorized licensed use limited to: Somaiya University. Downloaded on September 29,2023 at 15:28:08 UTC from IEEE Xplore. Restrictions apply.
from tape
from t e l e p h c
distance computation, candidates display and final computation loading and the side effect, which is
judgement. Below we will explain each step’s made due to non-speech part’s effect over the true
function and the method to implement it. speech part’s. To achieve this purpose, we use the
features, zero crossing rate and normalized energy,
to detect the true speech. To avoid discarding tlie
2.1 Recording environment
true speech too much, we did not give a rigid bound.
The speaker’s voic;e could be obtained from tlie
internet, the telephone, the tape, the microphone and
2.4 Bias removal
other storage deviccs. According to the voice type,
we can divide it into two types, analog and digital. The incoming speech may be from different
As analog type, it sliould digitize it before storing in recording environments. To alleviate the recording
the computer. If this analog voice comes from the environment’s effect and reduce its effect on the final
telephone, a telephone interface should be open to performance, a simple but efficient way is used. We
receive this kind of voice. In our system, we use a take the average of the whole incoming speech’s log-
dialogic company’s telephone interface, which has a spectrum, called cepstrum, and then subtract it from
telephone interface with an 8KHz sampling rate aiid the whole speech’s cepstrum. This simple metliod
8 bit PCM encoder, to receive the analog voice. has proved to be efficient [l].
Other analog voices, such as cassette tape, are also
recorded aiid digitized by a built-in sound blaster.
2.5 Feature extraction
After digitizing processing, all data should be
transferred into a uniform data type, a 16-bit linear The key point to develop an efficient speaker
PCM “.wav” format with sampling rate equal to authentication system is feature extraction. It is not
8KHz. easy to directly find an efficient feature from the
time-domain waveform. The super-segmental
features, such as pitch, energy, zero crossing rate
2.2 High pass filter
can be used to improve the performance, but it is
Some of the incoin ing digitized voice may contain limited and usually used an auxiliary feature for the
DC component. In order to remove this component, a most speaker authentication systems. For this reason,
60Hz high pass filter is used. the segmental information, that is, spectrum, is taken
as the features for this shrinking system. LPC(Linear
2.3 End point detection
Predictive Coding), LAR(Log Area Ratio), LSP(Line
Since the incoming voice has soine silence Spectrum Pair), Ceptrurn are all good spectrum
components before a true speech conies, it is features, In this paper, we choose the LPC-cepstrum
necessary to remove this part to reduce the feature as our system feature.
159
Authorized licensed use limited to: Somaiya University. Downloaded on September 29,2023 at 15:28:08 UTC from IEEE Xplore. Restrictions apply.
from tape
€1-0
F i g 2 training algorithm
template model. However, the main point of this a distortion measurement between x(i) and x(i).
system is to quickly shrink the suspicious criminals LBG algorithin[2] can be used for the derivation of
and let the experts do tlie final judgement. Thus an codebook C. Encoding a test vector y(i) is to find a
efficient method but less computation time is used
for this purpose. Besides, we should also consider codeword j ( i ) from codebook C, such that
tlie incoming speech materials would be any kinds of
text combination. It would not be text-dependent like
d(y(i),y(i)) is minimum. Based on the above
algorithm, each speaker will obtained his own
system or a limited domain material system, so that
codebook.
as we choose the method to create speaker models,
we should consider this effect. Based on the above
(VQ recognition algorithnij
consideration, we discard the probabilistic model and
choose VQ(Vector Quantization) model as our
system model. In the following, we will introduce Given a sequence of feature
how to train the speaker models based on VQ vectors,X = {x(l),x(2);.+,x(T)}. Each element of
training algorithin and what VQ training algorithm X i s a feature vector obtained by signal analysis. The
is. object of source coder is to find a reproduction
160
Authorized licensed use limited to: Somaiya University. Downloaded on September 29,2023 at 15:28:08 UTC from IEEE Xplore. Restrictions apply.
codebook such that the distortion between training then use these scores as a index to extract the related
vectors X and reproduction codebook is minimum. information for those speakers. A typical information
The total distortion for X is obtained by tlie is shown in Fig. 4. Here we only list some of
following equation. information for a speaker, such as his age, education,
address, self-picture, etc. Operators can define other
information that is good for them. They can also pop
out the speaker's voice characteristics, as shown in
Fig. 5. His waveform, sonogram or FFT-spectrum
[Beam search UlgorMzm] can be displayed on tlie screen to let operators
To fast search the right person, we set up a compare his voice and spectrum characteristics with
computation beam in which only the part of speakers the input speaker's voice. By tlie above procedure,
with smaller distanc;e are jointed. The input speech operators can reduced tlie human search time.
distances in the computation beam speakers are Especially when the suspicious criminals are large,
simultaneously computed frame by frame. After they do not need to do the full search speaker by
computing several frames, we only leave the M speaker. It is left to this system.
speakers in our computation beam. This step will
continuously proceed until tlie end of input speech.
At each step, some speakers with larger distance will 4. CONCLUSION
go out of this computation beam so that tlie A first step to build up a shrinking system is
computation beam width will be reduced with time completed. By this system, operators can easily and
going. This procedure will save the search time, but quickly find tlie set of suspicious criminals and
tlie width should not reduced too much. Otherwise, further analyze tlie voice characteristics of those
the target speaker will be lost in tlie search step. criminals. This will give them more evidence to do a
correct judgement.
final judgement In the next stage, we will collect and set up a
complete criminal database by tlie voice. This would
After tlie distance is computed, this system will list be a huge work, but it is important to find criminals.
the top N speakers whose distances are the smallest. Since the voice characteristic will change with time,
This value "N" is chosen by operators. Their distance tlie new system should have tlie ability of updating
will also be shown to operators to let them show and easily re-building up the existing database. That
their personal information further. Operators can is, an adapting system should be developed. Finally,
also use the distance score to decide which speaker We hope a user friendly and reliable system can be
will show their personal database from the speaker coming soon.
corpus pool.
161
Authorized licensed use limited to: Somaiya University. Downloaded on September 29,2023 at 15:28:08 UTC from IEEE Xplore. Restrictions apply.
Fig. 3 Recording Menu Fig.5 analysis of voice characteristics: waveform and spectrum
., pi---
- K)
162
Authorized licensed use limited to: Somaiya University. Downloaded on September 29,2023 at 15:28:08 UTC from IEEE Xplore. Restrictions apply.