Speaker Recognition: SRT Project of Signal Processing

Speaker Recognition
SRT project of Signal Processing
Xinyu Zhou, Yuxin Wu, Tiezheng Li
Department of Computer Science and Technology

Tsinghua University
January 6, 2016
. . . .... .... .... . . . . .

Overview
Speaker Recognition
Identification of the person who is speaking by characteristics of their
voice biometrics.
Primary Goal:
Accurate and Efficient Short utterance speaker recognition.
Additional Goal:
Scalability over large number of speakers.
Low Latency Real-Time recognition
A working Real-Time recognition system.
. . . . . . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . . . . . .
Xinyu Zhou, Yuxin Wu, Tiezheng Li Speaker Recognition January 6, 2016 2/1
Content
. . . . . . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . . . . . .
VAD
VAD
Voice Activity Detection shall be applied for all signals as a

pre-filter. We’ve tried 2 different approaches:
Energy-Based:
• Filter out the intervals with relatively low energy.
• Work perfectly for high-quality recordings.
• Sensitive to noise.
Long-Term Spectral Divergence
• Compare long-term spectral envelope with noise spectrum.
• More robust to noise, used in our GUI.
• Efficient voice activity detection algorithms using long-term speech
information, Ramırez, Javier, 2004
. . . . . . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . . . . . .
VAD
VAD

Energy-Based:
. . . . . . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . . . . . .
VAD
VAD

Energy-Based:
. . . . . . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . . . . . .
VAD
LTSD
. . . . . . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . . . . . .
Features MFCC
MFCC
Mel-Frequency Cepstral Coefficients

Cepstral feature which closely approximates human auditory system’s
response. Commonly used feature for Speech/Speaker Recognition.
. . . . . . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . . . . . .
Features MFCC
Windowing
. . . . . . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . . . . . .
Features MFCC
Mel-Scale
f
Mel(f) = 2595 log10 (1 + )
700
. . . . . . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . . . . . .
Features MFCC
MFCC
Mel-Frequency Cepstral Coefficients

Cepstral feature which closely approximates human auditory system’s
response. Commonly used feature for Speech/Speaker Recognition.
. . . . . . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . . . . . .
Features LPC
LPC
Linear Predictive Coding/Coefficients

Assumption
In a short period, the nth signal is a linear combination of previous p
∑
p
signals: x̂(n) = ai x(n − i)
i=1
Minimize squared error E [x̂(n) − x(n)] using Levinson-Durbin

algorithm.
Use a1 , · · · , ap as features.
. . . . . . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . . . . . .
Xinyu Zhou, Yuxin Wu, Tiezheng Li Speaker Recognition January 6, 2016 10 / 1
Features LPC
LPC
Linear Predictive Coding/Coefficients

Assumption
In a short period, the nth signal is a linear combination of previous p
∑
p
signals: x̂(n) = ai x(n − i)
i=1
Minimize squared error E [x̂(n) − x(n)] using Levinson-Durbin

algorithm.
Use a1 , · · · , ap as features.
. . . . . . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . . . . . .
Features Experiments on Features
MFCC Params
1.00 1.00
0.98 0.98
0.96 0.96
Accuracy
Accuracy
0.94
0.94
0.92
0.92
0.90
12 14 16 18 20 22 24 26 0.9020 25 30 35 40 45 50 55
Number of Cepstrals Number of Filters
1.00 Best parameters in our cases:

0.98 Number of cepstrals: 15
0.96 Number of filters: 55
Accuracy
0.94
Frame length: 32ms
0.92
0.90
20 25 30 35 40
Frame Length
. . . . . . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . . . . . .
Features Experiments on Features
LPC Params
1.00 1.00
0.98 0.98
0.96 0.96
Accuracy
Accuracy
0.94 0.94
0.92 0.92
0.9020 25 30 35 40 0.9010 12 14 16 18 20 22 24 26
Frame Length Number of Coefficients
Best parameter in our cases:

Number of coefficients: 23
Frame length: 32ms
. . . . . . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . . . . . .
Model GMM-UBM
GMM
Gaussian Mixture Model is commonly used to model human’s
acoustic feature.
∑
K
p(θ) = wi N (µi , Σi )
i=1
. . . . . . . . . . . . . . . . . . . .
We use K = 32
. . . . . . . . . . . . . . . . . . . .
Model GMM-UBM
GMM
Gaussian Mixture Model is commonly used to model human’s
acoustic feature.
∑
K
p(θ) = wi N (µi , Σi )
i=1
1.00
0.98
0.96
Accuracy
0.94
We use K = 32
0.92
0.900 20 40 60 80 100 120 140 160

Number of Mixtures
. . . . . . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . . . . . .
Model GMM-UBM
Optimized GMM
Basic GMM training: random initialize, estimate parameters with EM.
Improvment: initialize with a parallel KMeansII.
Improvment: parallel training implementation in C++.
Compared to GMM from scikit-learn:
Arthur, David, Sergei, 2007, k-means++: The advantages of careful seeding.

Bahmani, et. al, 2012, Scalable K-means++ .
. .
. . . . . . . . . . . . . .
. . . . . . . . . . . . . .
.
.
.
.
.
.
.
.
.

Model GMM-UBM
Optimized GMM
Basic GMM training: random initialize, estimate parameters with EM.
Improvment: initialize with a parallel KMeansII.
Improvment: parallel training implementation in C++.
Compared to GMM from scikit-learn:
1.00
45 Our GMM with concurrency of 1
Our GMM with concurrency of 2
0.98 40 Our GMM with concurrency of 4
Our GMM with concurrency of 8
Time used for ten iterations in seconds

35
0.96 Our GMM with concurrency of 16
30 scikit-learn GMM
0.94 25
0.92 20
15
0.90
10
0.88 5
GMM from scikit-learn 0
0.86 0 2000 4000 6000 8000 10000 12000 14000 16000
Our GMM Number of MFCC Features
0 10 20 30 40 50 60 70 80
Arthur, David, Sergei, 2007, k-means++: The advantages of careful seeding.

Bahmani, et. al, 2012, Scalable K-means++ .
. .
. . . . . . . . . . . . . .
. . . . . . . . . . . . . .
.
.
.
.
.
.
.
.
.

Model GMM-UBM
UBM
Universal Background Model is a GMM trained on giant datasets.

UBM can be used to:
Describe general acoustic feature of human.
Reject the decision of GMM.
Train adaptive GMM.
Reynolds, Douglas, et al, 2000,

Speaker verification using adapted Gaussian mixture. models
.
.
. . . .
. . . .
. .
.
. . . .
. . . .
. . . .
. . . . .
.
.
.
.
.
.
.
.
.

Model GMM-UBM
GMM Results
1.00 1.00
0.98 0.98
0.96 0.96
Accuracy
Accuracy
0.94 0.94
0.92 2s 2s
3s 0.92 3s
4s 4s
0.90 5s 5s
0.90
0 10 20 30 40 50 60 70 80 0 10 20 30 40 50 60 70 80
Number of Speakers (Reading) Number of Speakers (Whisper)
1.00 Train duration: 20s

0.98 Random selected test utterance: 50
0.96 Each value in the graph is an average
Accuracy
0.94 of 20 independent experiments.

0.92 2s
3s
4s
0.90 5s
0 10 20 30 40 50 60 70 80
Number of Speakers (Spontaneous)
. . . . . . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . . . . . .
Model CRBM
CRBM
Restricted Boltzmann Machine is a generative stochastic two-layer
neural network.
Continuous RBM extends RBM to real-valued inputs.
RBM has the ability to reconstruct a layer similar to input layer. The
difference between the two layers can be a used to measure the fitness
of an input to the model.
Therefore, RBM can be a substituion to GMM.
Chen, Hsin and Murray, Alan F, 2003, Continuous restricted Boltzmann machine
with an implementable training algorithm .
. .
. . . . . . . . . . . . . .
. . . . . . . . . . . . . .
.
.
. .
. .
.
.
.

Model CRBM
CRBM
Restricted Boltzmann Machine is a generative stochastic two-layer
neural network.
Continuous RBM extends RBM to real-valued inputs.
RBM has the ability to reconstruct a layer similar to input layer. The
difference between the two layers can be a used to measure the fitness
of an input to the model.
Therefore, RBM can be a substituion to GMM.
Chen, Hsin and Murray, Alan F, 2003, Continuous restricted Boltzmann machine
with an implementable training algorithm .
. .
. . . . . . . . . . . . . .
. . . . . . . . . . . . . .
.
.
. .
. .
.
.
.

Model CRBM
RBM Results
Results of CRBM, tested with 5 secs of utterance.
1.00 Effect on number of speakers, using CRBM
0.95
Accuracy
0.90
0.85
30 sec training
60 sec training
0.80 120 sec training
0 10 20 30 40 50
Number of Speakers
. . . . . . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . . . . . .
Model CRBM
GUI Demo
. . . . . . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . . . . . .
Model CRBM
Conclusion
We implemented a faster GMM, also with better performance.

Accuracy is kept even under short training and testing utterance.
Our system is highly accurate, can almost response in real-time.
97% accuracy for 20∼30 speakers, 95% for 70∼80 speakers.
. . . . . . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . . . . . .
Model CRBM
Thanks!
. . . . . . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . . . . . .

Speaker Recognition: SRT Project of Signal Processing

Uploaded by

Document Information

Original Description:

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Speaker Recognition: SRT Project of Signal Processing

Uploaded by

Copyright:

Available Formats

Speaker Recognition

SRT project of Signal Processing

Xinyu Zhou, Yuxin Wu, Tiezheng Li

Department of Computer Science and Technology

. . . .... .... .... . . . . .

Voice Activity Detection shall be applied for all signals as a

Voice Activity Detection shall be applied for all signals as a

Voice Activity Detection shall be applied for all signals as a

Mel-Frequency Cepstral Coefficients

Mel-Frequency Cepstral Coefficients

Linear Predictive Coding/Coefficients

Minimize squared error E [x̂(n) − x(n)] using Levinson-Durbin

Linear Predictive Coding/Coefficients

Minimize squared error E [x̂(n) − x(n)] using Levinson-Durbin

1.00 Best parameters in our cases:

Best parameter in our cases:

0.900 20 40 60 80 100 120 140 160

Arthur, David, Sergei, 2007, k-means++: The advantages of careful seeding.

Xinyu Zhou, Yuxin Wu, Tiezheng Li Speaker Recognition January 6, 2016 14 / 1

Time used for ten iterations in seconds

Arthur, David, Sergei, 2007, k-means++: The advantages of careful seeding.

Xinyu Zhou, Yuxin Wu, Tiezheng Li Speaker Recognition January 6, 2016 14 / 1

Universal Background Model is a GMM trained on giant datasets.

Reynolds, Douglas, et al, 2000,

Xinyu Zhou, Yuxin Wu, Tiezheng Li Speaker Recognition January 6, 2016 15 / 1

1.00 Train duration: 20s

0.94 of 20 independent experiments.

Xinyu Zhou, Yuxin Wu, Tiezheng Li Speaker Recognition January 6, 2016 17 / 1

Xinyu Zhou, Yuxin Wu, Tiezheng Li Speaker Recognition January 6, 2016 17 / 1

1.00 Effect on number of speakers, using CRBM

We implemented a faster GMM, also with better performance.

You might also like