You are on page 1of 27

Speaker Recognition

SRT project of Signal Processing

Xinyu Zhou, Yuxin Wu, Tiezheng Li

Department of Computer Science and Technology


Tsinghua University

January 6, 2016

. . . .... .... .... . . . . .


Overview

Speaker Recognition
Identification of the person who is speaking by characteristics of their
voice biometrics.
Primary Goal:
Accurate and Efficient Short utterance speaker recognition.
Additional Goal:
Scalability over large number of speakers.
Low Latency Real-Time recognition
A working Real-Time recognition system.

. . . . . . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . . . . . .
Xinyu Zhou, Yuxin Wu, Tiezheng Li Speaker Recognition January 6, 2016 2/1
Content

. . . . . . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . . . . . .
Xinyu Zhou, Yuxin Wu, Tiezheng Li Speaker Recognition January 6, 2016 3/1
VAD

VAD

Voice Activity Detection shall be applied for all signals as a


pre-filter. We’ve tried 2 different approaches:
Energy-Based:
• Filter out the intervals with relatively low energy.
• Work perfectly for high-quality recordings.
• Sensitive to noise.
Long-Term Spectral Divergence
• Compare long-term spectral envelope with noise spectrum.
• More robust to noise, used in our GUI.
• Efficient voice activity detection algorithms using long-term speech
information, Ramırez, Javier, 2004

. . . . . . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . . . . . .
Xinyu Zhou, Yuxin Wu, Tiezheng Li Speaker Recognition January 6, 2016 4/1
VAD

VAD

Voice Activity Detection shall be applied for all signals as a


pre-filter. We’ve tried 2 different approaches:
Energy-Based:
• Filter out the intervals with relatively low energy.
• Work perfectly for high-quality recordings.
• Sensitive to noise.
Long-Term Spectral Divergence
• Compare long-term spectral envelope with noise spectrum.
• More robust to noise, used in our GUI.
• Efficient voice activity detection algorithms using long-term speech
information, Ramırez, Javier, 2004

. . . . . . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . . . . . .
Xinyu Zhou, Yuxin Wu, Tiezheng Li Speaker Recognition January 6, 2016 4/1
VAD

VAD

Voice Activity Detection shall be applied for all signals as a


pre-filter. We’ve tried 2 different approaches:
Energy-Based:
• Filter out the intervals with relatively low energy.
• Work perfectly for high-quality recordings.
• Sensitive to noise.
Long-Term Spectral Divergence
• Compare long-term spectral envelope with noise spectrum.
• More robust to noise, used in our GUI.
• Efficient voice activity detection algorithms using long-term speech
information, Ramırez, Javier, 2004

. . . . . . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . . . . . .
Xinyu Zhou, Yuxin Wu, Tiezheng Li Speaker Recognition January 6, 2016 4/1
VAD

LTSD

. . . . . . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . . . . . .
Xinyu Zhou, Yuxin Wu, Tiezheng Li Speaker Recognition January 6, 2016 5/1
Features MFCC

MFCC

Mel-Frequency Cepstral Coefficients


Cepstral feature which closely approximates human auditory system’s
response. Commonly used feature for Speech/Speaker Recognition.

. . . . . . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . . . . . .
Xinyu Zhou, Yuxin Wu, Tiezheng Li Speaker Recognition January 6, 2016 6/1
Features MFCC

Windowing

. . . . . . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . . . . . .
Xinyu Zhou, Yuxin Wu, Tiezheng Li Speaker Recognition January 6, 2016 7/1
Features MFCC

Mel-Scale

f
Mel(f) = 2595 log10 (1 + )
700

. . . . . . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . . . . . .
Xinyu Zhou, Yuxin Wu, Tiezheng Li Speaker Recognition January 6, 2016 8/1
Features MFCC

MFCC

Mel-Frequency Cepstral Coefficients


Cepstral feature which closely approximates human auditory system’s
response. Commonly used feature for Speech/Speaker Recognition.

. . . . . . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . . . . . .
Xinyu Zhou, Yuxin Wu, Tiezheng Li Speaker Recognition January 6, 2016 9/1
Features LPC

LPC

Linear Predictive Coding/Coefficients


Assumption
In a short period, the nth signal is a linear combination of previous p

p
signals: x̂(n) = ai x(n − i)
i=1

Minimize squared error E [x̂(n) − x(n)] using Levinson-Durbin


algorithm.
Use a1 , · · · , ap as features.

. . . . . . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . . . . . .
Xinyu Zhou, Yuxin Wu, Tiezheng Li Speaker Recognition January 6, 2016 10 / 1
Features LPC

LPC

Linear Predictive Coding/Coefficients


Assumption
In a short period, the nth signal is a linear combination of previous p

p
signals: x̂(n) = ai x(n − i)
i=1

Minimize squared error E [x̂(n) − x(n)] using Levinson-Durbin


algorithm.
Use a1 , · · · , ap as features.

. . . . . . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . . . . . .
Xinyu Zhou, Yuxin Wu, Tiezheng Li Speaker Recognition January 6, 2016 10 / 1
Features Experiments on Features

MFCC Params

1.00 1.00

0.98 0.98

0.96 0.96
Accuracy

Accuracy
0.94
0.94
0.92
0.92
0.90
12 14 16 18 20 22 24 26 0.9020 25 30 35 40 45 50 55
Number of Cepstrals Number of Filters

1.00 Best parameters in our cases:


0.98 Number of cepstrals: 15
0.96 Number of filters: 55
Accuracy

0.94
Frame length: 32ms
0.92

0.90
20 25 30 35 40
Frame Length
. . . . . . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . . . . . .
Xinyu Zhou, Yuxin Wu, Tiezheng Li Speaker Recognition January 6, 2016 11 / 1
Features Experiments on Features

LPC Params

1.00 1.00

0.98 0.98

0.96 0.96
Accuracy

Accuracy
0.94 0.94

0.92 0.92

0.9020 25 30 35 40 0.9010 12 14 16 18 20 22 24 26
Frame Length Number of Coefficients

Best parameter in our cases:


Number of coefficients: 23
Frame length: 32ms

. . . . . . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . . . . . .
Xinyu Zhou, Yuxin Wu, Tiezheng Li Speaker Recognition January 6, 2016 12 / 1
Model GMM-UBM

GMM
Gaussian Mixture Model is commonly used to model human’s
acoustic feature.


K
p(θ) = wi N (µi , Σi )
i=1

. . . . . . . . . . . . . . . . . . . .
We use K = 32
. . . . . . . . . . . . . . . . . . . .
Xinyu Zhou, Yuxin Wu, Tiezheng Li Speaker Recognition January 6, 2016 13 / 1
Model GMM-UBM

GMM
Gaussian Mixture Model is commonly used to model human’s
acoustic feature.


K
p(θ) = wi N (µi , Σi )
i=1

1.00

0.98

0.96
Accuracy

0.94
We use K = 32

0.92

0.900 20 40 60 80 100 120 140 160


Number of Mixtures
. . . . . . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . . . . . .
Xinyu Zhou, Yuxin Wu, Tiezheng Li Speaker Recognition January 6, 2016 13 / 1
Model GMM-UBM

Optimized GMM
Basic GMM training: random initialize, estimate parameters with EM.
Improvment: initialize with a parallel KMeansII.
Improvment: parallel training implementation in C++.
Compared to GMM from scikit-learn:

Arthur, David, Sergei, 2007, k-means++: The advantages of careful seeding.


Bahmani, et. al, 2012, Scalable K-means++ .
. .
. . . . . . . . . . . . . .
. . . . . . . . . . . . . .
.
.
.
.
.
.
.
.
.

Xinyu Zhou, Yuxin Wu, Tiezheng Li Speaker Recognition January 6, 2016 14 / 1


Model GMM-UBM

Optimized GMM
Basic GMM training: random initialize, estimate parameters with EM.
Improvment: initialize with a parallel KMeansII.
Improvment: parallel training implementation in C++.
Compared to GMM from scikit-learn:

1.00
45 Our GMM with concurrency of 1
Our GMM with concurrency of 2
0.98 40 Our GMM with concurrency of 4
Our GMM with concurrency of 8

Time used for ten iterations in seconds


35
0.96 Our GMM with concurrency of 16
30 scikit-learn GMM
0.94 25

0.92 20
15
0.90
10
0.88 5
GMM from scikit-learn 0
0.86 0 2000 4000 6000 8000 10000 12000 14000 16000
Our GMM Number of MFCC Features
0 10 20 30 40 50 60 70 80

Arthur, David, Sergei, 2007, k-means++: The advantages of careful seeding.


Bahmani, et. al, 2012, Scalable K-means++ .
. .
. . . . . . . . . . . . . .
. . . . . . . . . . . . . .
.
.
.
.
.
.
.
.
.

Xinyu Zhou, Yuxin Wu, Tiezheng Li Speaker Recognition January 6, 2016 14 / 1


Model GMM-UBM

UBM

Universal Background Model is a GMM trained on giant datasets.


UBM can be used to:
Describe general acoustic feature of human.
Reject the decision of GMM.
Train adaptive GMM.

Reynolds, Douglas, et al, 2000,


Speaker verification using adapted Gaussian mixture. models
.
.
. . . .
. . . .
. .
.
. . . .
. . . .
. . . .
. . . . .
.
.
.
.
.
.
.
.
.

Xinyu Zhou, Yuxin Wu, Tiezheng Li Speaker Recognition January 6, 2016 15 / 1


Model GMM-UBM

GMM Results

1.00 1.00

0.98 0.98

0.96 0.96
Accuracy

Accuracy
0.94 0.94
0.92 2s 2s
3s 0.92 3s
4s 4s
0.90 5s 5s
0.90
0 10 20 30 40 50 60 70 80 0 10 20 30 40 50 60 70 80
Number of Speakers (Reading) Number of Speakers (Whisper)

1.00 Train duration: 20s


0.98 Random selected test utterance: 50
0.96 Each value in the graph is an average
Accuracy

0.94 of 20 independent experiments.


0.92 2s
3s
4s
0.90 5s
0 10 20 30 40 50 60 70 80
Number of Speakers (Spontaneous)
. . . . . . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . . . . . .
Xinyu Zhou, Yuxin Wu, Tiezheng Li Speaker Recognition January 6, 2016 16 / 1
Model CRBM

CRBM
Restricted Boltzmann Machine is a generative stochastic two-layer
neural network.
Continuous RBM extends RBM to real-valued inputs.
RBM has the ability to reconstruct a layer similar to input layer. The
difference between the two layers can be a used to measure the fitness
of an input to the model.
Therefore, RBM can be a substituion to GMM.

Chen, Hsin and Murray, Alan F, 2003, Continuous restricted Boltzmann machine
with an implementable training algorithm .
. .
. . . . . . . . . . . . . .
. . . . . . . . . . . . . .
.
.
. .
. .
.
.
.

Xinyu Zhou, Yuxin Wu, Tiezheng Li Speaker Recognition January 6, 2016 17 / 1


Model CRBM

CRBM
Restricted Boltzmann Machine is a generative stochastic two-layer
neural network.
Continuous RBM extends RBM to real-valued inputs.
RBM has the ability to reconstruct a layer similar to input layer. The
difference between the two layers can be a used to measure the fitness
of an input to the model.
Therefore, RBM can be a substituion to GMM.

Chen, Hsin and Murray, Alan F, 2003, Continuous restricted Boltzmann machine
with an implementable training algorithm .
. .
. . . . . . . . . . . . . .
. . . . . . . . . . . . . .
.
.
. .
. .
.
.
.

Xinyu Zhou, Yuxin Wu, Tiezheng Li Speaker Recognition January 6, 2016 17 / 1


Model CRBM

RBM Results
Results of CRBM, tested with 5 secs of utterance.

1.00 Effect on number of speakers, using CRBM

0.95
Accuracy

0.90

0.85
30 sec training
60 sec training
0.80 120 sec training
0 10 20 30 40 50
Number of Speakers

. . . . . . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . . . . . .
Xinyu Zhou, Yuxin Wu, Tiezheng Li Speaker Recognition January 6, 2016 18 / 1
Model CRBM

GUI Demo

. . . . . . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . . . . . .
Xinyu Zhou, Yuxin Wu, Tiezheng Li Speaker Recognition January 6, 2016 19 / 1
Model CRBM

Conclusion

We implemented a faster GMM, also with better performance.


Accuracy is kept even under short training and testing utterance.
Our system is highly accurate, can almost response in real-time.
97% accuracy for 20∼30 speakers, 95% for 70∼80 speakers.

. . . . . . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . . . . . .
Xinyu Zhou, Yuxin Wu, Tiezheng Li Speaker Recognition January 6, 2016 20 / 1
Model CRBM

Thanks!

. . . . . . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . . . . . .
Xinyu Zhou, Yuxin Wu, Tiezheng Li Speaker Recognition January 6, 2016 21 / 1

You might also like