You are on page 1of 23

asadthomas@gmail.

com
asadthomas@gmail.com

SPEAKER RECOGNATION
SYSTEM (SRS)
MD. ASAD
asadthomas@gmail.com
RIYA BHADRA
riyabhadra123456@gmail.com
IASNLP-2015, IIIT Hyderabad
Introduction

• Speaker Recognition: It is the process of automatically recognizing


(identify & verify) who is speaking on the basis of individual information
that exist in speech waves.
Objectives and aims
• To extract, characterize, and recognize the information about a
speaker identity.

• To building a robust system to identify and verify a speaker


accurately.
Automatically extract information transmitted
in speech signal
Application of speaker recognition

• SR uses are voice dialling, banking by telephone, telephone


shopping, database access services, information services,
voice mail, security control for confidential information
areas, and remote access to computers.

• Some systems use "anti-speaker" techniques such as 


cohort models.
Development of Speaker Recognition Systems

• The first type of speaker recognition machine using


spectrograms of voices was invented in the 1960’s. It was
called voiceprint analysis or visible speech.

• Since the mid-1980s, this field has been steadily getting


matured that commercial applications of SR have been
increasing, and many companies currently offer this
technology.
Speech processing taxonomy
Principles of Speaker Recognition

Two applications:

• Speaker Identification and


• Speaker Verification
There exist two types of speaker recognition:

• Text dependent (restrained)
• Text independent (unrestrained)

Text dependent recognition has better performance for


subjects that cooperate. But text independent voice
recognition is more flexible that it can be used for non-
cooperating individuals.
• Close Set
• Open Set
Speaker Recognition

• Basically identification or authentication using speaker


recognition consists of four steps:

1. Voice Recording
2. Feature Extraction
3. Pattern Matching
4. Decision (accept / reject)
Feature Extraction

• Feature extraction is to convert speech waveform to


some type of parametric representation. This sub-
process is the key part in front-end processing, and
always be viewed as a ‘replacer’ of front-end
processing
• Models used for feature extraction are LPCCs, MFCCs
etc…
Pattern Matching

• Pattern matching is the actual comparisson of the extracted


frames with known speaker models (or templates), this results
in a matching score which quantifies the similarity in between
the voice recording and a known speaker model. Pattern
matching is often based on Hidden Markov Models (HMMs),
a statistical model which takes into account the underlying
variations and temporal changes of the accoustic pattern.
• Models used for Pattern Matching are VQ, NN,
HMM,GMM etc…
Speaker Recognition

• Data Base using = TIMIT


• Feature extraction = MFCCs
• Pattern matching = GMM
• Tool used = Mat-Lab
WHY MFCCs?

Mel-frequency Cepstrum Coefficients:


• Until now, Mel-frequency cepstral coefficients (MFCC) are the best
known and most commonly used features for not only speech
recognition, but speaker recognition as well. The computation of
MFCC is based on the short-term analysis and it is similar to the
computation of Cepstral Coefficients. The significant difference lays
on the usage of critical bank filters to realize mel-frequency
warping. The critical bandwidths with frequency are based on the
human ears perception.
• A mel is a unit of measure based on the human ear’s perceived
frequency.
Intoduction to GMM

• Gaussian • Mixture Model


“Gaussian is a characteristic symmetric “mixture model is a probabilistic model
“bell carve” shape that quickly falls off which assumes the underlying data to
towards 0 (practically)” belong to a mixture distribution”
Why GMM?

• Classification paradigms used in SRS during the past 20


years VQ, NN, HMM and GMM represent Vector
Quantization, Neutral Network, Hidden Markov Model and
Gaussian Mixture Model respectively. A continuous ergodic
HMM method is superior to a discrete ergodic HMM
method and that a continuous ergodic HMM method is as
robust as a VQ-based method when enough training data is
available. However, when little data is available, the VQ-
based method is more robust than a continuous HMM
method.
EXPERIMENTAL METHODOLOGY

Dataset Description
• TIMIT Database.
• Total Number of speakers= 98
• Female speakers= 48
• Male Speakers= 50
• Total sentences= 10
• Trained Data= 8 sentences for each speaker
• Testing Data= 2 sentences for each speaker
Analysis Tool
• Matlab
Result
References:

1. Reynolds, D. A and Rose, R. C. 1995. “Robust Text- Independent


Speaker Identification Using Gaussian Mixture Speaker Models”,
IEEE Trans. on Speech and Audio Processing, vol.3, No.1, pp.72-
83,
2. Panda, A. K & Sahoo, A. K. 2011. Study of Speaker Recognition
System. Thesis NIT, Rourkela.
3. Ling Feng, “Speaker Recognition”, Kgs. Lyngby 2004
Question?

You might also like