You are on page 1of 78

Feature Classifier \\ Testing \

Extraction

Delta
-
(U

~
l,f I
MFCC
ftS
C
Q)

1u
(,)
C
0
u
Delta L,_____j
Input
Delta
~
Speak-er Recogrnt,on

Figure 2.2 Stages of ASR using Voice Modality

!.1 Pre-processing voice signal and feature extraction


2.1.1 Pre-p roce ssing v oice signa l an d
fc ,.tur e extra ction
Input speec h sign al need s to be pre-p roces
sed to enab le extrac tjon of usefu l s peak
er relat ed
infom w tion . Prep roces sing the spee ch signa
l is carri ed o ut as fo Jlows [ 15 J.
1. A nalo g to digit al conv ersio n
11 . Pre-e mpha sis
111 . '" ~ind owing
iv. Nois e remo val
Voice or speech signal is an Anal og signal. The
first step in pre-p roce ssing an audio signal is
to digitize it. For extra cting spea ker relev ant
features from these sign als, it is prer equi site
to
conv ert them to a digit al format. The Ana log
to digital conv ersio n of the audi o sign al allow
s
samp ling and conv ertin g the sign al into a discr
eet space. Feat ures from the spee ch sign al is
extracted from an audi o sign al repr esen ted in
term s of spee ch para mete rs. The sign al need
s to
be pre-p roce ssed and then analysed further.

An audio signal has voiced and unvoiced segm


ents. Vowels which are the voiced component
have more energy at lowe r frequencies than the
energy available at higher frequencies. This
spectral tilt needs to be compensated to have the
information at higher frequencies available
for preparing the acoustic model. Pre-emphasis
boosts ·the energy in the high er frequency
component which is achieved by using filters.

17
1 ,-. ,...,, <lf"I'
...tri<fl "'''°'''t'.. ..hc•-n-•
.. ~ chA
__
,,ftl\lt
,ntt
1h~ltiff-
'Ii~ . 11.
, 1 k\11,.h, ,
1 l \\l I q1
"' •'

-.I ,
.,, c ronn ,
1\1\lf\ LI I'}

ru , c 1t·,nV1 ,1
ro ~wJ )'
'Iii!,
1~vu p l \)' 1nlO
nu I 15

,..9; ~ ,,.
"-"""'~-wr.;, i ... th.? " I ,., h Cf
11 ~,. , . d'l'<irr " " . II Y 1n 1h.: '11 ""
h1 r,,,, ~... c«r ~._,u
t
~ fr#tn
1h ... ,ini'fl«fflP ~ ~ If' l\ft ,~~ I ~
I ",. t - (1/11 ~ C'#l1 I i'I ~e~!U•fY
, "' m, tht- '"
........ ,,.,. ;n, ' f II""'
... •1 •, . I I
~ t , / t - «'ff>l'" rrorn
•rir"' ' NI" tififi •r forrnonce
l t-,ctte r P'-'
<il!'f¾l'' C i n rvornr d oc; lO e'ltract
l1t .. fn,rn ntd10 d ' ve r'!C rne tho
ti•"< nm'" "''" 'l<!lil "' • ()vet the , c ,H " ' d' rfe renti ate d
< , .., t.;.::t t <."0''£mt1<lll hniqucs arc I. .
,,::,ii • ,, fil l' ,nt hm< i('ft l ,cd f'cature extracu o n tee
. . formati o n . the ir
~ iha• ic J, ecn pr<'J'Xl d e rception in
,fuph '.i;,,,.,,, I ..!l!Ttlt.., d "t n. processi ng an p . (16J The unique
I c human au 1 o . ., • ,;indoW ·
!1\ o,cir l!hilJ t , r('I ut, ,z 1 of the obscrvat10n " . [ 17) Pitch.
d h, the Jc ngt 1 duct1on ·
.. tJ1~ ~ , ,.., if,c..r, ,n,on,. an • . I human s peech pro
.,u dctcm11nes t ic
jl1, o1nJ!' .,r Ull' human ,ocal tract . aker-depend ent features .
. and accent arc some of the several spe
~ ng raTL' h
. . d s cho linguistic features . Eac
. ed
-. n1'£'.C-b feature~ can be charactcns as
acoustic. lingu1st1c. an p y ~ eaker
t d to be processed ior sp
' ,_ . d can be extrac e 11
cam e'- speaker personality traits an m ent m ethod u sua y
feawre . h to feature m easure
8] Tbe acoustic-pho netic approac . ·
,-ecngn1t1on 11 · . . f the tune varymg speech signal.
. f th characten st1cs o
follo v. !'> the spectral representatio n o e
Common)) used spect.al analysis techniques are

i) class of filter banks methods

ii) class oflinear predictive coding methods.

. df
The perceptual and significant areas of the spectrum denve dio signal are the basis
roman au
of MFCCs [J 9] -[22]. Just as the human ear h fi
responds tot e requency v ariation MFCC features
mimic the same. The substantial phonetic characteristics of the speech signal are captured with,
arrangement of filters. Linearly arranged filters at lower frequencies while for higher
frequencies the tilters are to be spaced logarithmically for significant feature extraction
[23],[24]. The lower-order speech coefficients carry the relevant details of the shape of the
spectrum representing a source-filter transfer function. Speaker specific characteri stics can also
be derived from the higher-order frequency components as they too carry sufficient spectrum
information [25]. Preprocessing of raw audio signal is at times imperativ e to in1prove the
quality of the signal by boosting the signal to noise ratio. Ambient noise adversely affects the
_speech _signal and hence makes MFCC features unreliable making speech signal processin g
imperative. [26], [27]. Other popular speech features can be derived fr01n Linear Predictio n

18
l

is pi c toria I l y re p rese nted in


the audi o wav e fo rm .
in audi o and I rnag e cl assi fica tlon
fr{"(JuctJC'fe!'. (."(Yfn.pri'-1111!
n,, " nr111<- • . C
nd are ,ncrc.a<ingl} u <=cd as feat ures
~ nwm ~ a ,. . ·ona l " ·{FC C o r LPC
h t d1t1 IY
le rn c ntcd wit
~ 1 p ;; 1 These fo.awrcs "' hen s upp . ra e (3 3) .

feaiu ·c~
h:,
.,.....,"m"
t. •
~o" n .,,gn1iica nt imp ,o"e me nt Ill
sp
eech classificaiti,o n pe rform ::mc
h aud io fram es are also ofte n
i ( ,1 < .:,p.;traJ Coc f!i c1en ts
trac ted from eac
( CQ CCs ) ex . . the
Cnn i,taJl -
.
. ure a con stan t Q whi ch is
d / NJ The geo
. f freq uenc y bins ens
,rwc ,11ra1e mern cal spac mg O . .
- .
filte r mod el kno wn as h o mom orp hic filte nng
,cm:-c frequenc:- of a ban d H l"dtJ1. AnotJ1e r sour ce- 1
C stra l C oeff icie nts (CC Cs) . Fro m
,,, cepo.;tral anal:- sis are often emp 1O) .ed to extr act Com plex ep
' . d Tak ing the logarith mlc. f th
an audio -;1gn al lhe . ti (FF T) is com pute . o ese
Fast Fourier Trans arm s
. ·t . vers e Fou rier tran sfo rm
den, ed sign ah. {he coef-iicients are
fina lly ava1·1a bl e by takm g I s m
on,.

The FF1 of the speech signal cov erts


the time domain signa1 to its frequency dom ain by
.
performing source-filter convolution .
The log computes the pro duct into a sum operation of the
source and filter components. The IFT aga .
in brings the summed components into th
e tim e
domain [3 5]. The CCCs characterize the
slow components of speech such as pitc
h wh ich are
present in higher cepstral region and charact
eriz e the fast-changing components ext rac ted
fro m
zhe vocal tract filter information. The cep
stral domain ' s lower region finds the con
cen tration
of the fast components . These features are
highly used in speaker rec ogn itio n pro ble
m as they
carry more speech related information
than quite a few other features . [36]. Eve
r sin ce dee p
learning has emerged as an area for analysing
speech and using it for rec ogn izin g a
spe ake r as
both the domains have significa ntly huge amo
unt of data to be ana lys ed, a lot of res
ear ch foc us
has shifted to application of Deep Ne ura l Net
works (DNN) [37],[38]. Stu die s sho
w tha t wit h
channel variation and environmental noise in han
dheld devices or microphones, the MFCC
and
LPCC features degrade the performance of
classification. [39]. Fusion of MFCC wit
h time-

19
I r s a \ so
, -CC r fe at ure s th e a ut :lO
.fflf f1 A«"'I ""
<:<' ,,cv, r-,11 eak er
it io n -
r e c og1 '
,,.,,,f••·.h , 1'"''1"' ....., In f 'fi •I ,..:u n•I t h C
,.,,.~ \ N N) f 4 1 l f o r sp
ti
NN ·, ~ co n s ide r ab l y a pP ed to
i
,,,.... I .-r1, l ,. . " '" '""o rk ( 1
1 I" •7 ,,, <""°"'""' D "
.,..rlf f 'l"-' - '" ' ·r1.. 1nil tc,tfrn .
,,,, " \(.. (t"" 1l ,. '<(. ,f.t lH ..,.. ifl '"" (.'1: c ...tt 144f .
7,.. • nf• !JJ'' ' I xi
,.,, Ju, Ji1iJJ t<l tl'f1 ..,.,o \.c \,\/U VC • ()" "'
" ""'' " ,, ,1,, '"" "' m></l, .. a l in te rm s
. n t t h e s 1gJ1
, fro 111 a spe ech s,g n a, I is to rcp rcs c .
wt • ru rp• 'q ,, f f.,• Jttu l'C c:x tT'.OC I IOI data c a n b e ign o re d for
111• '" ,, ·rrc Jc va nt an d r edu nda n t
' -rc-d coc ffic1c nt.<. as '
,.:-nr ,'Jnd num l '-
t,f "' " (1 1'-

. "ve can sa e fi ly con e I u d e tha t var iou s


· . feature ex trac t10 n,
J ,..,n 1 u,'-c di scu ssio n on , ·01ce I I
. 1 f atu res
h ese c las s1c a e
. f I w MF
eve CC fea tur es. T
r.:-, c3rchers have p p ro ose d extrac tion o o . -..-n en t
res . n
ult s in no isy e v lfO •=·
als fi s give uns ati sfa cto ry
o ca/ led the handcrafted eat ure
i ~ h1c h are arc her s in
mo difi cat ion to h e bee n sug ges te d and exe cut ed by res e
and h ence these fea tur es av
. th sam e wi th fea tur fr other
_....,,c of add ing high lev lem ent ing e es om
fc, ,,,.., el fea tures to the m or sup p . the
dali ty. Ho we ver. the fac t rem . MF CC fea tur es st1·11 c om e clo sest to rep res en tin g
mo ain s tha t
rncal tract en vel ope and are d voi ce fea tur es ex tra cte d. CN N wh ich has be en
the mo st pre fer re
.
"11 date pri ma rily used w . . e and vid eo ana lys is are I o inc rea sin gly ex plo red b y
ith un ag a s
r:se archer s for spe ech fea .
ture ext rac tio n and cla ssi
fic at1 0n.

2. 1.2 Voice fea tur e sel ect


ion
The vital step in speaker rec be addressed is the ex tra cti
ognition pro ble m that need on of
s to
speaker specific features wh . . .
ich need to be d1s tmct an d una ffiec te d by am bie nt no ise . Th es e
extracted features from the · me ·
voice brn tnc, arm e d wi·t h spe aker r ela ted inf or ma tio n ne ed
s to
be further classified.

However, to avoid the burde


n of memory required to store
large feature data set an d to
processing time a fea red uc e
ture ranking framework become
s imperative to make the sy
learn from fewer features [45]. ste m op tim all y
Feature selection techniques
involve se lec tin g a sm all er
dimension from the larger fea fe atu re
ture domain which can ad
umqueness. equately de no te th e sp ea
ke r' s

;upervised feature selection, sem


i-supervised method and unsu
pervised technique are th re
stmct categories of feature selec e
tion methods. These are furth
er classified into th e fil ter
-

20
( Wrt "-

4 ,mt ,, ali! •·11h m l•HlJ ,focl-J . ran k fea tures i-: the Re lief
u<:.t.~ <'U pcn i,ed tec hni ~ f . Th ie:
que to
,n
no:1ihm i :. wJ,11:.t ti~ hn
g dcf rad cd and mu luc
.
• J " - fea tur e c:et "' h ich
as~ ,
are no t co m
p let e and
a· t'1 11,"
\,·,d ch u.rwd for f~pra) pre . . . 1.. , 1 on thm
-rr ooe s.,m g. Sta ust ,ca I m eth od s are np ph ed in wlS a, g
. .
,a,;1d Jcat11n :..el ect ion
J~ .ea rne d ou t bas . . hts Int ra- cla ss nea res d . gn ate d
ed on the ir ,,e ,g . t da ta es1
a.. inri arc cak uJa 1ed and
.so are 1he o tJ1er nea res "gh b fro m dif fer ent classe s un d
t ne, ou r5 nre
fah dle d a~ nca rc.; t ~

Pearson Co rrelation roe ffi .


cient (PC C) {49] 1s yet another [,eature r an kin g alg ori thm wh ich fm ds
., relation amongst data · .
points by com puting the
coeffic ient of co rre lat ion
equation (2.1 J v.here it he as giv en by r m
s in the range of I an d - 1.
The po siti ve s sig nif y a co
negatiH!s s1g nif ) im ers e rrelat ion an d the
relationshjp and no relati on
ship ex ist s if the value is zero.
co v(X ;,Y )
r(iJ
J var ( x, ) * va r(y ) (2 .1)
~here Xi is the i1h va ria bk
and y is the class label.
Another fearure sel rot ian tec
hnique which is an un sup erv
ise d filt er-
based metho d is La pla cia
Sco re. In this method the fea n
tures are ranked on the bas
is of the ir loc ality pre se rv
are based on Lap1acian Eig ing po we r an d
en maps [50 ]. A sta tis ti ca
l fea ture rankin g an d fea
technique is the Chi-s quare tur e se lec tio n
(X2J method where the x2
tes t evalu ates the im po rta
with respect to its class fabels nc e of fea tur e
.

On ce ad equate and dis cri


minative features are ex
tracte d an d se lec ted , th
uniquely classi fy these features e ne xt task is to
to repres ent a spea ker's mo
be matched . del ag ain st w hic h the tes t da
ta ca n

21
I ~ ' "'" fun11 < .. ia ..... r..,'8tu ,f'l
2
• ,111Jl'1.-.1 ar un , ,..,.,,,m,N"CI m f<'M1 ff'€ ~ttto on nlgor ,thm " arc f1rth
\
cr c1.v~<1 ific J to c reate

q,r;n l r.:r ~f'N , I< m,-..1.. 1 ""'ch ,m?ql ,ch J<,«tt l'III:'• them
Jl,C"\ ,JHl t,c ' ,,, '-''"'n ,.,') fl~
f'll! ,n,,<tri, m, , Jd ,._, fl \ Q)

II
, , ,, J,!l"''' ,,,. -,.1d rn'-..f:,.,t < \f\il
mt 1u.,.11 1, t,:1..c;,_1 '"' ,11 til!lcn 1thm (~V \.f)
and _
1\ 11, 11 1, t,s~d ,-.n !"t.a1c ~l-th. :•at1 n<.'ura k nd deep Jeam 1ng.
''
l ttel\.v or s a

\ N'JfH , )'u 11nflr 1tf1nn

<t1,c- of rhl oJdes l .


and qmte succe ssful featu re class1fica . chni ue for ASR is Vect or
tion te q
. . tl
class ify 1e extra c ted featu re vecto rs base d on
( )uan11w, Hm cVQ) . The basic idea of
VQ 1s to
._ a..c panw on mth a yaria ble dista nce thres ho
11 ld [5 1] . In t11J.S a pp roac h the larae
0 num ber of
._h,m-tenn spec tral featu re \·ecto rs are comp • 11
resse d mto sma er sets of code vect or s. The
c us1enng algor i thm used here are tJ1e L md .
e-Bu zo -G ray (LB G) an d K - Mea ris · An itera tive
approach 1s empl oyed by K-Means algo rithm
in w hich , with each su cces siv e iter ation
,
redistribution of vectors to form new clusters take · · · d. t m· This
place to mm1m1ze 1s o on. algo rithm
esumates the means of all the created clust ers.
Simi lar to the clust ering K-M eans algo rithm
,
the LBG al so clusters a group of inpu tted featu
re vect ors S = {Xi E Rd I i = 1, 2 , • · · , n }
to
ere.ale another feature vecto r set C = {Cj E Rd I j =
1, 2, ... , K} depending on the
simi larity
index where K is decided by the user to be less than
n according to the similarity mea sure . The
initial codebook size, the distortion and the thres
hold E , decides the conv ergence of the
algorithm . K-Means and LBG algo rithm have
often been applied to opti mize mea ns and
covariances for the GMM class ifier. VQ has been for
decades successfully emp loye d in spea ker
independent isolated word recognition [52] as a
front end pre- proc esso r on ceps tral dyn ami c
featu re set. High speaker recognition accu racy has
been achi eved with VQ even for sh ort t est
utterance s and with limited training data as com
pare d to GM M appr oach [53] .

Gau ssian Mixtur.e Mod els

However, the limitation of a clusterin g model 1s


that each cluster has the sam e diagonal
covariance matrix which · Jt · · fl 'bl
resu s m rn ex1 e sph • en cal clusters in term s of the distribution they
can model. Gaussian mixture models whi ch are also
a type of clustering algorithm , overcomes

22
h

c.-velUHbmt ( 'RI l 2'006 d.a t~·


ML 1.1m1 t 1 1n
-
,M11\ q f\t a r,d G MM Jf1\ (j(,j nt J-c, d l\ [FC C ~oc ffu~um ls
ot a,wl)<: t' ) "a..." cmp loy c 1
ot _
nm.;,t11i d 'rom ,11c,u .1I l.lflc rod ~ 1nd
o,, rng tech niq ue for src akc r .en f ton exp erim ent s (S :> \.
,en 1 •
Stam,1H'.a an;.i h · 1· o f mul nap cr
\ 4fC C ga~ e bett er rcsu I ts fio r speak er ,.er ific at1o n uS com pu.red
t ,rn \Cn t ,n m,l 't\fFCC' fca.t • • .
w<c, Fw.1on of MF CC and stat1•s uc · a.I H fen lute s was inv esti gat
P ed
t 11 m:1FJH t he pcr lon nan ce _ . . . .
I
of 5pe ake r, ent 1ca uon in nots } co n d"ti 1 ons 1·n [56 ] · The cr.-integrnted
GJ-.,fM "a~ emp lo~ ed .u. the cJ~
51fier on the enh anc ed TIM IT cor
pus spe ech sig nal W1"th
1gmfo.:an1 1m rro, emc nt in ,pe ake r '1- erif icat ion · epe d
. For a tex t md n ent v 1·dT IM1 T dat a bas e
of; :" ~p~ kcr::,. auth ors inf 57]
e\ afoated the per fon n anc
e of spe ake r rec ogn itio n by app
\ O and C,MM model s. 11,e GMM mo lyin g
del sho wed 15% imp rove me nt in acc
uracy ovei: the VQ
moad <-:,pr:crraJ ~ubtractior:i ll.':>ed
as pre -processin g the spe ech signal
l o enh ance spe ech qua lity
h., ncu ~l' rem oval v.as empIO)OO in
[581 fo r tex t ind epe nde nt ASR on
the Vid TlM IT dat abase.
0 '~ 1-.1 mo deh ga\ c a better acc u rac
y of spe ake r rec ogn itio n ove r the
VQ method. To ma ke a
ru OJ SI system to cout1te
r the risks of spo ofi ng attacks, author
s in [59] proposed novel short-
term spectral features to capture
efficie ntl y the di scriminative info
rm atio n bet we en syn the tic
spc;c;c1J and ruu u.raJ spe ech . GM M
was em plo yed as a class ifie r fo r
the syn the tic spe ech and
was combined witJ1 GM M- UB M mo
del for spe ake r verific atio n sol
uti ons . GM M has
con ventioruill y proven w be a reli
able cla ssifier for speaker recognitio
n and ver ific ati on
problems. To improve the sec urity .and
robustness of speake r verification system against replay
attacks, a study of a throat micropJ1one
i.e. specia l body-c onducted sensor wa
s carried out in
f60J. Spectral and i-vector foatures were mo de1
led using GMM/UBM tec hniqu e on
38 a dataset of
speakers. A con siderable reduction in
the fal se error rate wa s observed frm
18 n 69 .69 % to
·?S¾. Authors in [61] howeve r, em
ph asize that GMM does not give des
results for short utterances. irable accuracy
-,✓r-t1l ,,rrwor-k< .
ram s of s pee c h s 1gn I 1.d
a s ru . the
' " "' ~-./,I ll ,, m
J X"<l dcx rcA !-C fhC ~, Ml err or rat e fro m 4
'lfl<-'-C::
is cm rlo )c< l tur 1hc r ''"
to 2 5% Fo r an uns u
per v ise d
n< " ,rn.,.ld he J . I· II n ge co rpu s a nd
• .tn icd out on don 1,11n a dap tat wn c 13 c
• ,rx· r rnc n rn11on (t"lr ~.-.J<<"f F\." ("() glll lton .._ b
,
. 1 J<,21 a<fAptcd '"" o dornatn:<1 . a p rob a 11·ty
1
est im ati o n
'{)I (, (.--.,rpu<:. sm a'!
thn ~ " .
, , ..., 1 '- RI -
i- Vc cto f'! w hil e ex pec ta ti o n
J\-fM m,"l4.k1 wa:e: u~ <"-f for out of do ma 1r1
pri, r-,J on l• m . . -·~ v ect ors .
n,11 ,,m ,?',1 11,"ln afn ori tl1m "A:< : .:-m plo J ·M fl tf lllC !Hl :'I of t h e C'M r
M to in -do ma in I
"' f() ,1 1c . [63 1
r ati on par am , nd rep eat ed i ter at1
f I x:a e-J
t,o ns " 't.• rc npl nll l7 11c,mg trn n <: rot tn e te r'! a ons
( ••n frl'' ' ,
..
Jcj Ii' <.Jff 11 t-, '-·ant irnr1r<",,·c-mc11ts 111 acc ura cy f -eak c r rec og n1t 1on
o sp

' p o rl \ ('(' tor Ms ch inc


up s
. . bl
ture , ectors i.n speak cr recog111t1on pro em, an oth er very. promi sing
f ,,, ~ las '- rfi cau on o f .
fea
.. up cn ise d ma chine lea . . Su ort Vecto r Ma h. Thi s machine learrn.ng
mmg algon.tlun is pp c mes .
. f, t'
;11-
-,je / uses classificati
on algori.thm. to fi t1<l I1YPer Plan es to d1 f eren ia te between classe s by
r1o:Ung. each fea tur e vector as a point in the n-d1m . the size of tot al
ens1• onal plane wh ere n is
.

Fearure ,·ectors.
For speak er verification, spe .
ech signals repres ented as robabilities of discrete events as
features v. ·as examined on NI p hi ve d wa s
ST database m . [64] The performance
· of AS R ac e
manrinaIJy better tha n that ob . th .
tamed wi.th GMM. An HM M based text-to-speech
u-as -evaluated for speaker
sy n es1 zer
verificatio.n m. [65] .h
on Wa11 Stree t Journal speech co rp us wit
match claim of 81%. An eff . a
icient d1. scn.mm. at1.ve sp
eaker ve n·fi1cafIOn h a s be en de ve lop ed m th
i-rncwr space with SVM e
as a linear discn.mm . at1.ve cIass1"fi · [66] on
1er m the NI ST 20 10 data
bas e. SVM was trame · d to d1s
· cn·mm · ate between the h
ypo thes1·s d eter mi nin g the pr ob ab
a pair of feature vecto ili ty of
rs belonging to a speake
r or to different sp ea ke rs
symmetric quadratic functi . Pa ra me ter s of a
on which approximates a log
lik eli ho od ratio wa s es tim
the need to explicitly mo de ate d wi th ou t
l the i-vector distributions
as in Pr ob ab ili sti c Li ne ar
Analysis (PLDA) mo dels. Th Di sc rim in an t
is ge nd er ind ep en de nt mo
de l ga ve sta te- of -th e- ar
Speaker adaptation tec hniqu t ac cu ra cy .
es to all ow sp ea ke r mo de
ls to be tra in ed us in g les
than that required for the ma s tra in in g da ta
ximum-1ike1yhood tra ini
ng ha ve be en ex ten siv ely
speaker recognition . M axim em pl oy ed fo r
um a posteriori (MAP) an
d ma xi mu m- lik ely ho od
(MLLR) techniqu e was incor lin ea r re gr es sio n
po rat ed in [67] as fea tur
e ve cto rs to be fu rth er
SVM. The evaluation on NIST cl as sif ie d us in g
2005 an d 2006 SRE da ta ba
se ga ve go od re su lts du
,nd speaker adaptat e to fro nt
ion . Intermediate matching ke
rnels (!MK) were explored
emels (DK) for classification as dynamic
using SVM of the feature vecto
rs extracted from varying le
· speech. Class independent GM ngth
M, SVM based classifiers
with IMK and SVM base
d

24
d~"t\CN\t\ C.l-.., ,. rfh~ I" Co, m..: d1slunce
I -..~ ,.'('ft' ~
~ idfflt•'"
f.
<Vf
, ~,
~, ... e,if" , 1100 sn·nhlun {II'\
NI'- 1 20ll8
~
,, M d,11c;c,fu'f .. :,,, "'p1nn1t~ f.v, ~ - ,, tth ,.,.~cv,r• w\1h ~"' r-..L <¼!-> c.lill' " lh~r

8h11 fft'IJ f u(tffip ,,u. ("("f, df~.w.c<t f~~


•fntl d,•tm tm•1._ du.e \o nu111mo.h. h m
<1'1
ti ti llntlflW"'~
""..r • I . ""' fl I< "''''""'""
~-...Jl..lfM~ ~""'1'hll •11,"\ O
f< m ••« f«quoncy ,
,.
~INfi'I:' (;jA(.ihlthCt ~-fl I#~ M JTW1"ln • I ('Ml\,f~<..
-rf<1rm "'' e in I ,n r V M.
rt ~·l"nfl'fln ,~
'"' . .,,,mmc liniJ ~
t .
f~''C ~ \
fr•
1,,_.,.i """''" • ,~ ~n••
• ' l't: 11-¼li V~ i 111rrpVP1Ti l! l1 tn
s
N!l' 1Wt1t'i11 t;~,mli ~ 'C\'<tt C'Mf, I ~ ·h 1Vi{lll t' in bum.in" w.i
~ • MWI a~nnrm 1
~ T h1:. I\ n•l C feature
I.;
,c ~m ~ /I ,1."t , ' f f l ~ .
-erifrca•1nn <) qi 111
I J '" C"\' ht •c "~ ttn-,nrt
'·••1~ un 'l'- 1..cr " . , [ L. ~e of
n lhe ~rim d ataun.
M.:·-,ftal u--••h.. ~\i \ 0
'.('J("I lr.n.c'" th lfne '1w,11m1c , ~ were c " - I beha"v 1our impacted th e

~
' n, . rfrnenwuon sh,cn\-ed that afmom1a [ 72 ]
m 1Hma i>J~n C!"-"'f'C • J 7o/c SV M ,.vas employed m
. h th I rror rate increasmg by 0· •
ficat •o r. aoc:urac~ V•'fl e cqua e 1 :1ino algorithm. to
'" . F FNN) another machme ean o
om! ,u 1/; J~-forwanf neural nctwol\k ( ; • . LE stic feature vecto rs
. l 1MFCC and open SM1 acou
.. a· r.1[\ ,:; l,-,;1al pRrarnewrs combmcd '\vll I . • • (SLI) in children .
cHn.1- ic-4 tr. •m voice ~ign.al to de,ect spec1. fi,c Iangu age impairment
, A,-,c.,mcm, were pc;fonned on SLI speech corpus.

~~ 11h ad, ,mc:e\ m ~ · orks in "-'M',,


n"tw the domain of image processlil
IJ ""'·• ---,:al · g , deep learning
al ,,or;1thm." have been extensively expenmented . h speec h signals
wit · · In recent years,
~
.
rc,carchen have begun to appJ y lhese algorithms to speaker recogmt10n.
·· [73] , [74] · DNNs have
.
t>ei:n w,ed for front end factor analysis model and are increasingly applied for fe ature extraction

w be: u~ed ~ itb MFCC features. Like DNNs other powerful deep learning algorithms are CNN
ana Recurrent Neural Networks (RNN) which have fo und to give better accuracy with larger
da.1.a set [75]. [ 76].

Much of the work has traditionaJiy been done with application of CNN on the speech
spectrogram wbich is in the frequency domain. For solutions to speech analysis, DNN is outperformed
by CNN (77). Aulh.ors have frequently applied DNN based systems successfully to resolve
speaker recognition issues, however much research conducted is in in the field of language
recognition [78]. Researchers have proved that DNN have optimized performance in the
chalJenging speaker recognition task in the domain of speech-based human-computer
interaction. The large feature data set required for tuning and enhancing performance of ASR
n different supervised learning task is the limitation of this classifier. Authors in [79] have

25
,fM

L mm en, e res ear ch h..is


ri nnr np rnhu:-.,. ;,pr.aJ
.
~ n:-n t,cs uon n• ,;c,nc:: again...:;, t t~• f alta c"S .
- · · • ct>" "' 1

,po o f~d d
m d out ,r; w front c11d to find c feature-: to d1c::llngwsh bet we an
app rop nal en -
. xc~ h (,M J\,f and D".'1-~ arc ,he pop - pl oye d as the b uc 1:-.. e nd . Au tho ts
ula r cla.ss11tcrs em
.
u cd d, nam,c fih.cr han l,-h ase d cep d CQ CC a loncr Wlt h DN N cla. ss ifte rs
i,;
s,tra l fea ture s an e
mm nc ",th hum an Jog -hk ehb ood s sco nng d f d the m to be mo re e ffect1v.

method an oun e
. 1· t ·
,,oc
81 ~ 111~1 ·
• ff .. ,.1
f a ........ ,,_.., comp411
__ .., to G\.1 \ 1 ba.sed s~ stem s. Co nun. . ese arc h tn app tea ion of
·a.1 • • um g r
, ·•ranu' nwc hm c J~i ;:ig "lg orit hm lo ,ed and exp eri me nte d wt• th
.
. [84 )
s for AS R. aut hor s in
"' em p )
tour clJ ~ fo..'f m<Jddi. namel) S\'~ . R .
1. K-~ earest ne igh bou rs. a n do Fo res t Lo gis tic reo0 res s10 n
m
.,ua Ar:u ficzaJ Ne:.mJ ~cr wo rks (A;,,.'!}:
) fo r s pea ker ide ntif ica ti o n fro
m voi ce . AN N bas ed
s, ::.1i:m:- !'ave the best accuracy of 96.
. - 03% fo r AS R [8 5].

Rcco~111Ling the spe ake r from the fea


tures extrac ted fro m the speech sig
nal s dep end on ho ,v
d1s~runiruu.ing the features are . GM M
a statistical model is built on maxim
izi ng the lik ely ho od
of fc:::s mre vectors computed by the
EM algorithm . In compari son to GM
M modelling the DN N
compul.cd models perform better as this technique maximizes feature
discrimination at each stage.
As the DNN mode11ing gets deeper the
feat ure vec tors bec ome profoundl y dis
cri n1i native . Th ese
systems respond better in adverse noi sy
conditions too . The state- of-the-art CN
N based n1odels
offers the least application com plexit y as
they provid e end -to -end solution for
ASR.

26
-
• Bilabial: both lips come alveolar
together (p, b, m, wJ r ido e
. Labiodental : lowwer lip an d
upper teeth make cont act (f, v)
~ P4/au
• Dental: the tongue makes
contact with the upper teeth (-
th}
• Alveolar: the t ip of the tongue
makes contact wit h the alveolar
ridge (t, d, s, z, n, I)
• Pala tal: the tongue approaches
the palate (j, r, -sh)
• Velar: back of the tongue
contacts th e velum (le, gi -ng)
• Glotta l: t his is really an
unvoiced vowel (h)
Image from: https://notendur.hi.is

Figure 3.2 Specific sound production due to position of articulators

As the anatomy of every person is different, the voice of every individual will also be
unique as the pitch, amplitude and the frequency of sound produced depends on the
physiological mechani sm of speech production and the articulators as discussed before .
With this information, the rest of the chapter discusses how voice can be a si gnific ant
L'
• G lotta l: {Ill:> , , • ~h·· , 1,na,e from : 11\.'-V-' ""
um,oiced vowel ( )
·t ·o n o f articul ators
d t' 1 due to p os 1 ' b
n,ru rf' _,.2 src-cific ~ound rro uc ,o, . . "dual will also e
. f eve r y ,nd1 v 1
wm , 0f e , -cry perso n is differe nt. th e vo ice o d d depends on the
,\ s the ana . f so und pro uce
. . ' the p itch . a mp litude a n d th e fre qu e n cy o . . discussed before.
unique a. . d th e a rti c u la to rs a s
• f . ccc h 11roduc t1 o n a n · .c: t
1ig1cal mcclla111sm n s p . be a si gn1J.1can
rh~ s10 ( , " f di sc u sses how vo ice c an
\\ Jlh this ,n(("lrma11011 . the res t of th e clrn pt e . . le m e nted i n f our key
te rn h as b een imp
tl°1 rccoo1117e a ·" p eo k er . ;\ •S R sys
t, ,omcrnc trait ~

fea tures from the selected biome tric parameter.


F "\ tractio n of
(ll . t ge one This phase 1s also
(JI)
Ml"ldel creatio n with the featu res extracte d 1n s a ·
k.no v,-n as the training stage/enro lm ent stage .
k dels in this t est
( j i j) Sampl e speaker data is matched w ith the created spea er mo
stage to recognize the speaker.
(i , ·) Decisi on stage to determine ASR.

Figure 3.3 gives the pictorial description of the steps undertaken in the experimentat ion carried
our for speaker recognition fr01n text independent speech information taken from the standard
idTIMIT database and was further validated on the database locally created in the college
1

premises hence forth known as the MPSTME Database.

42
j f r m 11ni.2
\\ ,n, h' " 111 12
• Pre e mph ·~-. , •

• M FCC
Teatu re
\\,! FCC f- O EL T "\
• MF CC + D EL TA·
E:xtTact ion
• CNN
J
J
Feat ure Sele ctio n Fish er Sco re
Alg orit hm

Train ing
• VQ
Classificatio n / • GM M
Mod ellin g • SVM
• CN N

Testing

Decision of Spe ake r


Recognition

Figure 3.3 Steps for spea ker recognition from


text independ ent speech information

leading to speaker recognition.


The next section gives the explanation of each step

43
1]. Denoi.~inl! S peech s ignal . . h signals from noisy
-• • I • l roducc noise - free spcec .
. 0f dcnoising sp eech s1gna s ,s I< p . d increasing its
11,c 31111 • • f the speech compo nent a n
rdings. "hilc imprm·ing 1hc perceived quaht y o . . ti ' t es a re limited by
rcc0 r f eake t spec1 ic ,ea ur
1Fni bilitY I JJ~1 - l11c cffbcti,-cncss and qua ,ty o sp . d t the time of
,nrc "" • . . 1 oise introd uce a
,nc degraded Aud i,1 sig11a l Mic..-ctcd h y the e nv1ro nm enta n :io s1gna . I d m ake it legible,
. · · f th e aud an
• d ' , To impro"c the pe rceptual ch atacten s ll c o . d The
rc-L<1r ,ng . . . . . the a lteration occurre .
I anccmcnt of a ud io signa l need s to be undertaken to m1nim1 ze . I ethods -
cn i . . . . d si n one of the c lass1ca m
dd ti n ? nonstai-ionary a mbi e nt no ise can be e liminate u . gh . h
a , . k done for th1s res earc .
· tral subtraction . TI1is a lgoritlun has been impl emented mt e war
spc-. -

~ ..2. I Spectral Subtraction for speech signal enhancement . 0


f
- . . · · duction mainly because
Spectral subtraction is a widely used algorithm m acoustic noise re '
. . . . h l 970' s by Boll (140] , then
its simplicity of 1mplementat10n [139]. It was introduced mt e
generalized and improved by Berouti [141].

d to have a clean speech


·
An audio signal affected by noise and represented as y[n] 1s assume
dd.. d
component s[n], which has been degraded by the addition of statistically a itive an
independent noise d[n] as shown in equation (3.1) [138].
(3.1)
y[n] =s[n]+d[n]
where y[n], s[nj and d[nj are the sampled noisy speech, clean speech, and additive noise,
respeotively. The additive noise is presumed to be of zero mean and is uncorrelated with the
clean speech. The processing of the noisy signal is to be done for each frame. Short-time
Fourier transform (STFT) can represent the signal as given in equation (3 .2),
Y(w,k) =S(w,k)+D (w,k) (3.2)

In equation. (2), the frame number i_s represented by k. Ass~ing the speech signal t? be
segmented into frames, we drop k to simplify the equation. As speech and background noise are
uncorrelated, the short-term power spectrum of y[ n] has no cross-terms. The short-term power
spectrum of the signal Y[ OJ] can be represented in the Fourier transform domain and can be
derived from equation (3 .3 ).
2 2 2
/Y(co)/ =/S(cu)/ +/D(w)/ (3.3)

44
I \ Il

' , '

/Jj J
1... l\. { \f the nPt, e in

f•n~ ~l•tn e fr~IT"c num her ('f '-P ,, rerre , ente d 1')
fr lt hi: -•aua t,nr •' " \ tht: .
C'
f ,n .:; co nver gi ng to
lt equa
h{: fru·~ - a Jong cr a,cr afe taJ..en '"'" re,u
in it -
,n ""7£.na ,. a.a..:.11mC"d 1n
. d r met e'-ti mate s the
al :spec trum sun e ~
nm~ p nv.c r ,pec trum The sign
tn ~d, a_ 1.-"""11ma1 1nn nf the d s ·t gnul 1'-~ finally
h b
a rrhcat1on of m"e rse OFT . t e ooste
:-:1.u:nrtudc ·. . pe.ci rum v.h1Jc v.11h the

ci~mpwcd

as a fdteT ,
the spee ch s ignal can also be repr esen ted
c.;;pcetrn1 ,uht racu on of the n oise signaJ from
giYen in equa tion (3 .6) . spectraJ subt ract
ion is rt:prt:~t:ntt;d by
h.\ m;P 11 ,i1ilc1t1r;. equa 1ion 3 .4. ~

1hc. distorted audi o spec trum ·:, prod uct


\.-.·ith filte r .
(3 .6)
.s:s
It is a zero -phase filter and the mag nitu de
resp ons e
\/\he re /l((J}) i!> SSF or rhe gain func tion .

h~ m the rang e ofO ~ H(w ) ::S l and is com


puted as given in equation (3.7 ).

(3.T)

al.
d after estimating the phase of the spe ech sign
Reconstruction of the resultant signal is achieve
e
com mon ly con sidered to be hav ing the sam
The noisy signal and the estimated clean signal are
e with the assumption that for the human ear, the short-term phase is rather not important.
phas
arrived at as shown in equation (3 .8) .
The resultant audio waveform per frame is thus
(3.8)
S((j)) = js(m )jsi <Y( m) = H(m )Y( m)

45
3 --~ rr·c rrn("{'U.IOJ! .'pee -ch sig nal
.
,prep ,..,l c,,1n c 1, R fu ndam ental signa l proc es,;m 1· d befo re re lev a n t fe ature
g task. to be app ,e
IW ,.,n , iff' he d nnc 1n enha nce
~;,. '1 the pcrfo mrnn ce of featu re e-xtract 10 n alizo
~
ri thm . Elim matm g
1thc f h 1. om r oncn L pre-e mph asis fihcn ng
the signa l and norm a 1,z
· · the wav e t·arm am p l1tud e
in g
a ·~ ._,1mc n e CC'-SaT_\ and com mon b used prep
roce ssing tec hniq ues.

J.J .1 Rem odn g the DC Co mpo nent


n e '-peech signal ini6a lly has a non-zero mean which is the fixed com pone nt. The sign al
acqw s1tion instr ume nt intro duce s the DC
bias. On subt racti ng the mea n valu e from all the
samples. the DC comp onen t can be easi ly remo
ved.

3.J.2 Pre-emphasis
\\'ith energy concentrated in th e low frequenc
y region, a negative spectral slop e is obse rved
in
the estim ated spectrum of speech. Vowels whic
h are the voiced info rma tion of spee ch hav e
more energy in the lower frequency than at the
higher frequencies. This is also kno wn as the
spec rraJ ti!L

For preprocessing the speech signal, a pre-


emphasis filter is introduced for rais ing the
comparative energy in the high frequency region
of the spectrum before applying any feat ure
extraction algorithms. Information in the higher
formants is made available to the acou stic
model by boosting the high frequency energy. Figu
re 3 .4 shows the spee ch signal befo re it was
pre-emphasi sed and the speech signa] after the high
frequency com pon ents wer e boo sted is as
given in Figure 3.5.

46
tf;
1
l
I
0
Frequen cy ( Hz)
. t,eforc Pre-emph as is
Figure 3 4 Speech signal

0
L - - - - - - - - - - - - ~22050~
Frequency (Hz)

Figure 3.5 Speech signal after Pre-emphasis

The pre-emphasis filter compresses the dynamic range of the speech signal' s power spectrum
by flattening the spectral tilt. The equation (3.9) denotes the fi lter for the same,
P(z) = l-az-1
(3.9)
In equation (9), the typical range of a lies between 0.9 to 1.0. For processing the speech signal,
the glottal signal is usually modelled using a two-pole filter for which both the poles are close
to the unit circle.

47
fro rn th e sp eak e r
, orrn111irio fr." " ·cr op ho r,c
ll0 ccr nc11l o f the rn • . \ev e l s
n s iste nt e ne rgY
\ ~\ t ' ,,, ,:<

11•• '' I• t
,h'-f,111t.~::',f,:::~/
'.''.,t:~ ~c ,rc ~J..c r- ,m, f
, , , rccc,Nlltl/l
•;1H10tion c; 1c0din g1:v:
o
: is .tak en car e b ~ au:r
e-
JI, ' 1,1H1f< ' , 1,,1 , , <' ,n . the c nc rgY is
cd ,,,c c,h ~,11na" I 11,., nK<'nc,1,; c .
i1'- iD I 1 .,,c v '" ces
. n g techn lq
, . r hi s pre -pr o s 1 b
, Nn .
,11
rhc r·co ird
II l \ 111ph1Utk , rnwlin1twr1.
1c , ,1 L'<
Is
co rde d s 1gn a the re y
r,..,c~ , ,111f l<'d 1n1q1 '" 1<' ca nce Is
y !ev e • hc tw ec n re b s e qu e nt\ Y.
l the ,,ir yi n g e11 c rg
arrl1<'Lf ,., I
._ ,en h '- 1f 1111 es w h ic h are ex tra cte d s u
. I tcd fca tur · i nt wi th its
, , c,t, ,nn ,ll1 L-c of 1hc en erg y re a . . f the s ign al a t e ve ry p o
d'J ,an .in£ ! t 1,.._ l
. ,h·cs im pJc mc 11ti 11 °
d iv1s 1on . \ 0 an d
~
f J1<' rn, cc'-. ,,. prc alh. ,n, t
g the . n ge o f th e s ign a I w ith in - ·
d n am t c ra I po int
t,~ lu tc ya Jue the re t, y res tric tin g the y d "vi din g ev ery s arn P e
m:r, n,uJJJ a . . . . ac h ieved by I
Jiz at ion tec hn iqu e is
I (1 ' aJ ucs A , ·ari a tion
111 no nn a . te of the utt e ra n ces .
. -, fo m1 " i th the va ria
in Ul <" \\ 3 \ IC n ce est1111a

.
_~,_.1 4 , , ·;o do ln og . . na l Ap pli.ca tio
. f the wm . d w fu nc tion
. a co nti nu ously va rym · audio s1g · n o o
Speech is g . tu . g f en ou gh
m
. . nal to a stat10naIY on e to en ab le ca p r o
xiJnates a continuously .
va rym g s1g
appro 1 t d infc
. f 25 ms wi dth as sh ow n . fi ur e 3 .6 1s
on natio n. A wi nd ow siz in g
speec b or spe ak er re a e eo
. . f na d ye t re tai n en ou gh
usually cons1"dere d large en ou gh for the sig na I to be qu asi sta 10 ry an
inforrnati on .

25m s Frame 1

C
~1 ·===~. . ""--PP-re_-e_m=
(z) _p_h
l-a_s_si_,- 1
az ~ -- -- -- , 25m s I Ha mm ing
Window
I Orn Fra me 2
Speaker X(t) Y(t)
Frames Wi nd ow ed
t Fr am es

Figure 3.6 Steps in pre-pr


oce ssing the spe ech sig
nal
Windowing process slices
the audio waveform into
multiple sliding frames
rectangular window functio . If a si m pl e
n is applied , the samples
which are no t co nt ai ne d
frame are made zero while in th e w in do w ed
maintaining the amplitude
of all the sa m pl es w hi ch
the windowed fram e. To the ar e pr es en t in
rectangular windowed sig
nal' s frame, ap pl ic at io n
Transform (FFT) for spectra of Fa st Fo ur ie r
l analysis distorts the sig
nal du e to ab ru pt ch an
and end points leading to ge s at th e st ar tin g
creation of noise w hi ch
is m or e pr om in en t in
th e hi gh fr eq ue nc y

48
(3 . \ 0)

11 (I , , I'
J):,0 < n \
'}
· 3 10 )
~ ,tic kn ct I1 r, fth c" 11 1d o,, in eq ua t10 n ( · ·
l\ ~ -• ' ' I
~
. 6 it is a Ha mm in • wi nd ow .
d,, ,, ,c;, ap pl ied ,, hen .
1-fann m: ",n a is 0.5 . an d "vhen (1 15 0.4 .
g
Ha rri s . Ka ise d T ria ng ul ar
B r an
..., nn,< ,,1 t,cr co mn1o nh us ed "in dO "' s are ar et Bl ac km a n.
. r .
Ta hle 3 1 )Jsts so me eq mo nl y us ed w .·in do w fu nc tio ns .
\\1110 0 " ._ ua tio ns fo r co m
Ta ble 3. I \\'ind O\\ typ the ir equ ati o ns
es and

\\ in do ~ Eq ua tio ns

Hann mg u·( n) = 0.5 - 0.5 co s 2rrn)


(M
Bla ckm an u·( n) = 0. 42 - 0. 5 co s ( 2_1r
U )
n + 0 08 co s
· (41rn )
--
_i\ f

Hamm ing w (n ) = 0.54 - 0. 46 c os


(M21rn)
Ba rtlett
w(n) - l - 2 In - AJ \
- 1\1

Rectangular
w(n) 1

For a samplin g frequency of


441 00Hz and a cut off at
5000Hz, the frequency
order lo w pass filter of vario response of a 28 th
us windows is as in figur
e 3. 7.

49
1

I I ••• • • • f '

P!'' t .inoule r ..-


9 :3rt l!'tt
Hgnn1no
He mm ing
eh1 c►man

ii
r.
~-.. . S
',If
\, 15
P1ght num bor n
' 11
"

I nt mm £ au. ,;n
. I
,, ee c h ~rg na
. 1 ,
·- - .
.
rs ,;,..h1c
n m1JC" \~ fle•n£ fh an d the <.tc p s1 7c a rc t 11c 1,,.0 11a ram ete • h ar e co nv en uo na11Y u se d to fra m e
h
., I ,ng " as the ,, ,n
;-, ~ , ·,g naI if ''-" • d y f as the ste p s ize fra m in g o t· the s pe ec
do ,, siz e an . ·
- can he ,mplemente
511! ,a d as follo ws .
Be gin the fir:r.,l fra me . al N /2 th sa m p le
fro m the sta rti.ng of the s pe ec h s ig n · wi ll be th e
centre of the fra me
.
, · t es
ti Keep mm mg the fra me ah ea d by M po mt •
s ti·11 the sig na l te m un a ·
T;hc tth fra me v. iJJ be H J , th
ce nt red at th e sa m p le (i - 1) * (Ni + m 211 ·
,, The pomt5 remain ing at . . 1
th e end if ar e no t Iar g e en o ug h to bu ild a co m p e t e fr am e ca n
be dumped else co ns tru . . h am e
ct th e las t fra m e by ad . . 1
d1t10na ze ro pad din a m am ta m t e s
o
step si,,;e for getti ng an
oth er full fra me .

3.3.6 M FC C feat ure ex


tractio n
For implementing conventio na
l A SR syste ms, th e Mel
Frequenc y Ce pstral Coe
(MFCC) are one of the
mos t popular fe atu res ex tra ffic ients
cted from audio signals l
features remain rele vant ev 142 1-(1441 •MFCC
en to da y as they hold su
bstantial speaker specifi
critical band width in the c details . T he
human ear varies with
fre qu ency and these va
represented by the MFC C riations are be st
fe atures. Phonetically sig
nificant characteristics
captured with filte rs spaced of speech are
linearly at low frequenci
es and logarithm ically
[145] . The lower order co at high fr equenc ie s
efficient s carry maxim um
th deta ils of th e co m plet e
e source-fi lter transfer spectral sh ap e of
function , however spectra
l tra nsiti ons also pl ay an
important role in the

50
be ex arn1· ne d fo r
ffi e ne e. sh o U ld al 50
~ h Th e h ig h er rd cr co - c ic nt s h e . of
. r1 of hlllll•u1 ~p o ·n e re as in g \e ve ls
perc'"~n• l10 cc c .
"1 kc
. . c rfo rn1 an c e as 1
. • • n)) " )1C <
on tri b1 1 111111 to 1 he ov er .,
r rc co gn 1t1 0n
p . tl1 e l11 fl 46 )
· .
111c::1r c I tcd . . · pr cs e•1l 111
. ) fill?- ~pca l,..c r i11 fo n 11 a t 10 n ,s
, pc cir ,J
1 i,.::tJ1'~ 1.::.in re n
L
rnn1on wn
e fe at ur es w h ,. c h ar e co
lY kn O
. . ·cs cn ts th es e d •

,c d I ll
i::n:11 11 .11 log ~p
cctn1111 icp• ·. yn arn l r be fu rth e r co
rn pu te d an d
\ un 1·n t s11e et ra l d ist
I . Delta fr: iw rc! - 17 K --.cconui t rn cr cn l o • a nc es ca n
'
,1:- I lL . . ·rc a nt l y to th e m I ~ , tu re s.
f,t111d fl1 co nt nl rn c ca k
1
,.,. 3 1 1111ic:-
t
tc s1 gn1 i
. st s ig ni fi fe at ur es , t h .l s w o r
• •- . i·c m ain th e 111 0 ca nt s pe ec h
- - - fr3 {ll rC!
\s " fF ( ( - C0 1 1t111u c 10
I • fro m th e sp
- -•d -~ tra ct I 1c m ee ch s ig na l s .
rfl 'LL L CLi Ill L

. .
. de ic te d in fi gu
. M FCC fe at ur es d h de riv fe
ati ve at ur es is p re
fh e r r ('I L. L-du.re fo r ex tra etm g an t e
• .c. • ..; 11 er ex p lai
_;S an d is iw u
ne d in de ta il.
7

Speec hsignal x1(n) - X1(k)


I'\"
-
Me\
x(n)
Pre-emphasis 7- OF T fi\ter-bank
x'(n)
l1 \l

Window
Yt(m)
~ ~
r-. /I

energy Log(\ \2)


ii
.e, " I
. et
\I

~L1 {e,}
Y,'(m)
.

derivatives Yt V)
.Li2{e,} ,.
"
IDFT ~
I,
' MFCC

Figure 3.8 Steps in


computing Mel fre
quency cepstral co
ef fic ie nt s
Steps for extracting MFC
C features from the spee
ch signal
1. Den oi se and pre-proces
s the speech signal.
11. Obtain short frames o
f the speech utteranc
es by applying suitab
m. Calculate the periodog le window function
ram estimate o f the s.
power spectrum for
1v. The triangular mel filte each frame.
r bank to be applied
to the power spectrum
up for individual filte . Energy to b e su m
r. m ed

51
I ( II fihct hM1k crrcr-gic:<.
Com,1mtc the t)fJ. , , II
and
.
( Jhiain 1hc D1.«:-rc:-ct Co!:inc
·
Tnm:<.fon 1l .
. t •ct··, re presents N{FCC features
' I
, _1 , C<>cfficic-nlc<: computt.--xf Ill
.
:<;
tcp VI \,V II
'11 Rcllllf1 ~
2
d,-c;card ihc rc.<:t . . to e~ti m ate th e f) e \ta and Delta
. 1 , fir~! <>rdct And second otdet d e t1 vA11 ve:<.
\ ,,, Comp ulc IR • .

feature!-
. ed furt her in the
"'CC' fi ture ex trac ti. o n p roce <lure i s d i scu ss
T11c d <, ta1 1e d dt?!'Cription of t h e Mf·· ea '
{<,ll("' 1 ng !"ection.

. . . . i e a stati stically
. uasi-sta tionary s1gna1 . .
Th , ome ,·ru:'·ing speec h signal ,s convert ed mto a q . I tion is signific ant.
l:
t tionary signal by app lying 20 - 40ms w mdow frames. The frame size se ec

. t and short enough


nt an re rtable spectra l est1ma e
!- a . d
JI should be Jong enough to contain sufficie
for the signal to be conside red stationa ry.
ff arnes in which

The time domain signal s(n)1s • "' , i h · denotes the numbe r O r
framed to gives;, n/ w ere z . t Fourie r
the simal is divided . The Discree t cosine transfo rm ( DCT) or a comple x D1scre e
Tran.s~rrn (DFT) for each frame is determ ined as given in equatio n (3.11),

N
s.(k) =L S,,-(n)h( n)e -j21da z/N ,l <k<K
_ - (3 .11)
1
n=l
. g, 1. denote s the fi
Where S;(n) is the speech signal after framm rame numb e r , h(n) is an N sampl e
Jong analysis window and K is the length of the DFT.

Once the speech signal is divided into quasi-stationary frames, the next
step is to compu te the
power spectrum. The power spectrum mimics the function of the cochle a and
identi fies the
frequencies present in the frame. Its calculation for frame i is given as in equati
on (3 .12).
1 2
P.(k) =-jS.(k)j
l N l (3.12)
Where Pi(k) is the power spectrum of frame i which is computed by taking the absolu
te value
of the complex Fourier transform. Squaring the computed result, we can determine
the
periodogram spectral estimate. Conventionally a 512 point FFT is performed and
257
coefficients are retained. The periodogram estimate of the power spectrum, however,
cannot
discern between two closely related frequencies present in the frame and this limitation
becomes
more pronounced with increase in frequencies.

52
~\ y 2 0-40
l \o .. cc 0 1 0rprn ,irr\atC
tK' , ~ ~<"< I fil te r t-.:in " d · the
J'N'"
' rr, m•~"h'< 41 <~Tnfllllm , ' ~ tn im cqll mate in
r,("\ fl'(' ..-wf.(,.rr~m p,n\ Cr ,q r -.:'- • 6
1 ' ' fif l("f ' , , <ht~ "ntl Af!Ph('d '" ' re' l, ·11 t, . in lh C form o l 2
, 11 n, mt11m11lh tf1vt ,,w, <: I
,
J""' 1111 f J J · th~ filter han .... "'' c
rri-'.,,fYI~ 'i:1" ~ , · th e ., pcc trum
, 1 tnr cc rt ,11 n <,<!cu o n , ' 1
f 1.m , th ., <? I #C"h ''Ell: ,,.,,. " 111 '"" mn<1ly ;rc\"I ' < n<
,« •l"f"" J "th the r ••"-1.' r .,pee
· trum a nd the
r7I he "'" " ,•rf• l -l'!Ch filt c•t 1,1111 '-: , , 1lx•1 •rm lt1 phc, ""' 6
il~' h e r<, w h ic h is a n
c; Th1 ,;izi,eq u., 2 num
ffi .,,,~"'' .,,,,,h~ trr ,,, '.,mputc filter »~n, .. ncrin<' . . th e fo rmul a
~ " ,., 1< f c uation(l I J)~u "es
rntfrt ' '"" pfthl s,m,-.,1111 c,f ._ nclf.!' prcc;cnl m each filter t,.an I
fc, .nJ, ulf! llltl.! t h l 1, J1d han 1< cPc1 1pc<:.

() k < f (m - 1)

f (m - l ) s k s /( 111 )
/ (m l - f (m - 1) (3 . 13)
H. , .... o-
f ( m -'- 1) - /.
f ( m ) s k s f(m + I)
(( m + 1) - .f(m)
0 k > f (m + l )
Jn 1he equatJon (13 ), m denotes the total count of filters w hile f (m) enumera th
tes e m el
;requenc1 es spaced by factor (m +2) . To make the speec h 1eatures
.c 1 h closely to what w e
mac
humans can hear. a compres sion operatio n needs to be subsequ ently perform ed .
The comput ed
log of each 26 energies gives us 26 log filter bank energies .

Once the filter bank energies are availabl e, we need to comput e the logarith m of them. This
is
essentially motivated by how we human' s hear loudnes s. Loudne ss by human is
not perceiv ed
on a linear scale. If the sound is large to begin with, variation in energy might not
sound differen t .
Me/ scale power spectrum which mimics the human 's sense of loudnes s percept ion is
obtaine d
as given in equation (3.14).

(3 .14)

The ? CT transform s the outputted logarithmic mel spectrum of the signal S[ m] into the
tin1e
do~ These are the decorrelated MFCC coefficients and conventionally only 13 coefficients are
retamed. Equatio n (3 .15) computes the MFCC coeffic ients from S[m].

MFCC[i]=L !1 log(S[m])cos({m-½)),
i == 1, 2, ......L (3 .15)

53
• -e
fthc 1ra1, , is
I \.\lh~.,-.:1.\~ the \c n ~ lh o
,~ t('pl"<''""'~ t h y
llP< a-of, ,,("'"" '' l rtnt?e
, h~ ,,fl't ( ' ' " ,
' / • "'4 P tl 1H'r1 ( ~ I l
"1""~1.-.1 J,- "' m' x
,r CC coeffi cients .
. I tn th.: qW ll C I\, .
I !Tl'' 1,,,,, •'' mJ,, ~d1lill 1T ,mt·<. .,tc <fc«n1hc< I 1111 c:<1 for w hi c h th e traj ectone s
~"_,, <:r« ,r~ ' t.: . I lied m11,n11 ,1t1<Hl m the< y n ,1
'- ,.,., '1 " '"" r,,-..., ,pell , r ,._ l'vl rC C a nd D e lta-
, ·imine d . De lta
Utti"- H'f ~ ,.,,,-Jli, " nt<. i,,o <l pcrrod of tune
need'> to he ex, efficie nt'> rc'> pcc ti vcly. w h e n
<>frJic J\f/ ( "Im h <1rc the dr1Tcrcnt1al ,md the accele ration co
t [ I 4 7 1. The contri bution
p /Ill \ fl < < . I
' •cf ., rh, c,.,m cn11onal feature s can cnnc l t I,cs pcal<cr .
feature sc 3 I 6)
rr-1••'" r. . h .s work. Equat ion ( .
I nam 1, lcarurc s m e nhanc
be earned o ut in t i
,-, f}l i."'l l \
ing ;\ SR has en

!!'' c, t I,~• Delia c0cffic 1cnL<;.

(J. 16)
tf.• == ,, 2
., ' 11
~ ~
n=J
. . d by dt t 1s a fame
.
r selected, an d I·t
n (3 16 ). the differe ntial coefficient is represente '
The cepstrum num er' also
ln equauo . b
de nds on the static coefficients evaluated from c,+n throug
h
c,- n . . lta
-pe .
MFCC colum ns is represented by n m . The d amic or the shifted de
the equation. yn
kno'"n as
HFCC features are furthe r evaluated for computing the accele . Delta-Delta featur es.
ration

features are then given to classi fiers


The decorrelated MFCC features, also called the handcrafted
. . . · 1
such as HMM and GMM to create speaker models. D1scn · g howe ver does not
mmat1ve earmn , '
happen here.

If Deep Neural Networks is deplo yed as the classifier for learn


ing relev ant patte rn in the
utterances, fewer handcrafted features are necessary for this mode
lling. It has been observ ed that
when DNN enhances the cepstral domain of the speec h signa
l, it leads to subst antia l noise
reduction thus significantly enhancing the signal [148]. Exten sive resea
rch in the field of speec h
recognition deploys deep learning where the spect rogra ms of frame s
comp uted from the speec h
signa.ls are traditionally inputted to CNN for classi ficati on [149]. As
speak er recog nitio n is the
focus of the problem at hand, concatenation of derivative dyna mic
coeff icien ts to the static
handcrafted MFCC coefficients form the input s for the
one-d imen siona l CNN class ifier.
Contribution of this state- of-the -art techn ique to ASR along with
VQ, GMM and SVM
classifiers has been evaluated in this resea rch.

54
tu re ~C'l'<'t:' fion dio feature se t.
l 4 F eB ( c h th e a u
.. - . c<)'Cfficietits to en , . I vant
1 J,-ri11mn' ('(~cflkief1t.s to the stAt,c . . edund a nt and ,rre e
r{"t)dJ/lP I ,c . . J' · t1. te~ w hi c h a re r .
\(': 1 :1 ri'><' (,fhigh J,num_<;i<,nahty . ea , . . ·stan ces. selectin g
ht"',c,i'T- Jn, ·1tc-1- r ,c L i . , , ,. sten\~. t Tt1<.lel' these c 1rc um - . . f
dc-crcAC:C- the cfficicnc~ "f tht: i\ RR sy. - duc in g the issues o
r-r rC'nd'- f,:, . ,. • . d m emiing fu l In re
F ·ri - rdcvtm r fohh1rcs is ,e ssenti al an d es w i ti, lesser bu:t
<rc.akel :.ri'CI L. • f utation thus re uc
a11d n-,dund1mc~ T'h e crnnplex1ty o comp
" ' cr fi1tmJJ. ·
cant fr.afvn~s. Jc.adj~ to ,e nhanced AS R syste m .
~1f!ll'fi . . t di . categori zed as
,tJ c;d,;; for feature sel,0 c11on can be )roa Y
'f)ll: -i..c., me 1 •

FiJ1er-hru::e<l te.ature s election.


\X •rapper fea ture selection m etho d and
11
. Em t"tedded fea ture seJeotfon tedmique. . and the
JJJ
· the filter- b ase d t e chntque '
k d 10
• fl the preprocessing phase. the features are ran e g m
·

Dunn~ ed for further process


r..n ... ,res wi th higher ranking are subsequently selected and employ .
,=.... . of feature selection.
oess Features are scored for further learning in the Wrapper technique d
pro . . n and wrapper-base
The embedded method is a combination of filter-based feature selectio
. . h fi d
]earning method leaving it a closely coupled technique whic 1Il s 1ml
r
·ted application (150].

. . . b d t hni e _ Fisher score has been


For the expenments conducted m this research, the filter- ase ec qu
deployed for feature ,o ptimization. This algorithm scores over the Laplacian method, and
the

Relieff method [15 I] for ,t he selection of features.

3.4.1 Filter-based Fisher Score technique


For supervised learning problems, the filter-based Fisher score method is one of the most favored
method for selecting the feature subset. The algorithm maximizes a criteria F by selecting 'm',
a tisfying the condition m ~
a smaller set of features from the extracted bouquet of 'd' features, s_
d as given in equation (3.17) .

r=argmaxF(r),s1./r/ =m

(3.17)
Where/r/represents the set's cardinality and the size of the 'd' feature set is denoted by ' s' . A

co~b~ative optimization is accomplished using this equation. To arrive at a solution for global
opt1m1zation how · · •
' ever, an mvest1gat1ve method should be the approach to initially compute and

55
h hi g h est
d F S ubse q ue ntl y t e
nr ndcnl scor-c to all the fca tll rcs a-'l per th e sta nd a r .
-
3-<:.<l.!'"
an ,nd c, -
• . (can,r-cs arc sclcc tc<l
r,111J.cd m
. ss features
. featurell a nd ,ntra c 1a
I
- . .,~--.re algori thm di~ri m ina tes hc t'ween the mte t c ass d b large but
T11c I ,t,.hLI . vari a nce sho u l e
. ,., " t11c sclccl ion of t he fea ture set. B e tween cl a1,s fea ture . . . . st th e
in dcc1uine- - d di scrimm at10n among
. ( " i th in-.:lass fea ture shou ld be sm a ll to ha ve goo . .
,he 'anancc o . . d as the shown in equatton
"' ,, ure'- Fo r the fca t1.1rc lab ell ed as fi ,ts score ,s ca lcu late
::;clcc1c d ,c., 1 "·
<3 1/1 1

n ,, 1 - c , cr _ k )2
.L
k=I
,L,
1· .=k
·J i
.
µr:· i
·J
. tO lass k is represented
" ·here the mean of f i is given asµ fl while mean for feature fl belonging c
. . • 1 k d the value of
i.. k The samples m the k class 1s numbered as nk of samples m c ass , an
<': µ fi "
fea~eJi in the sample xj is given as fj ,i. By this method Fisher scores are computed for all the

extracted features from which the highest ranked m features are retained for further processing .
In this experiment ation, the role of Fisher score in improving the ASR has been examined.

Once the features are extracted and the most relevant ones are selected by optimizing the static
and dynamic feature set, the next step in ASR computation is to further train them for creating
speaker model. Models for all the speakers are similarly created. Using a test sample of data
which has preferably not been enrolled i.e. it is not a trained data, is then deployed for testing

for a match with the speaker models formed from the trained data.

Conventional GMM, SVM the algorithm developed from statistical learning theory by V apnik

and I-Dimensional CNN are the classifiers employed for this research. The next section
discusses them in detail.

3.5 Classifiers

The classifiers play an · rt .


. . impo ant ro1e m training the extracted features which are duly
opt1m1zed into models of individual speakers. Th
e models learn to categorize the different

56
.
· vita l in g c ner atin g a
cl1-1-,sificr is
n1,l<,vc d 1 he role of the
_.,
' f,.l'-<" ' ,,n ihc a lg,, nt I1m "' . the tes tin g p h a se of
• ,,. .... , ' I ,, h1'-h ,, ,11 hel p
u,rr
. th prc d1d the s,"'"',
.-
:ikcr 1n
f tf\lf' Hl c~
rt" 18 \~R ._,, tcm
,mrl< ,,,."'Tl'"'"'" ,,t nn •
. , and '-; V\\t f rno del \in g h ave
.. n ;il \ \.'<-1<'' Qn;U1tv,1t1on . ('\., fvf rno dcl llng
,tlll fl h <'<'11 '
cnt H111 , sed for
usi ng I • fJ ( ' "rN has. a lso bee n u
Jn th' .
.....
, , i I t111hc1 dee p lc,u
J,.."i'!Tl m1rk•111, nt u rnn g tcc hm quc -
l he nex t sec tion dis cus iou s c \as si fi e rs.
,,lg ,q(\ lflf! I 11l, '-,,tr act cd fea tur es . ses the var

t; I ' Q-th <- clu ste rin g alg .


on thm . h ·n the i;eg ion
.•- t1 e
. to sm all er gro ups wit
illff- .nth m ma ps 1 pool of fea tur e vec tor s in a reg ion I
' (.i a II d d its cen ter or als o
, cd on cen . . . TI1e gro ups of vec tor s c lus ter s an
am cnt ena. s are lab e e a .
t,:c tlie cen tro id. rep res ent s it. . . ted as a cod ew or d . Th e col lec t1o n
IJJo nn as Co llectiv ely thi s is den o .
. al d "agr am t o exp lai n thi
of all cod ew ord s ma ke a h ws a con cep tu s
cod ebo ok. Fig ure 3. 9 s 1
o
c1ass1fica tio n pro ces s.

Sa mp les
Samples

Ce ntr oid Centro id

Figure 3.9 Code book crea


tion using vector quantization
In the figure 3.9, 2 speakers
in the two-dimensional acou
stic space are represented.
represent spe ech fea tures of speaker! and Ci rcl es
triangles represent the acou
stic features of sp ea ke r
2. For each speak er in the datab
ase , a vector quantized code
book is created us ing th e clu
of the ex tracted aco ustic ve cto ste rs
rs. Figure 3.9 also shows
the computed ce ntr oid s an
codeword for the two speak ers d th e
. VQ -di stortion gives the dis
tance ca lcu lat ion of a ve cto
nearest codeword for a given r to th e
codebook. Least cumulative
dis tor tio n fo r a sp ea ke r
corresponding codebook results to its
in a match else a decision for
inc or rec t sp ea ke r re co gn iti
on is

57
\e d a ta to the
I Ass
. I . s selectin g K clusters initia ll y a n c < ., ig nin g a sarnp
- 1 he p rx, cc.,s 111, o , c. .
ni1111• .
-l ~,er /'if> in cq11:it1on .(~. -I q) _ (3 \ 9)
lo Ji,,<:L,,$f '- IJ. .
,1r ,. u )<...d( ., .111 )\1',)-- i
1 t ',· '
( T B G) a lgorith m -
.
' , .....,nfion us es the most popu h1t a nd w ,de y
. I u sed Lind e Buz o O ray ,
, Jc b<,11 " ~ ' ~ .
< ' ~ ' ()) ii' a pplied fi.,r c reatio n of a cod e book .
I qua11on (. - (J .20)
• == 1· - 1" ( J- E)
( I+ E ) . " - . "
i· ·is . n
db n and the splittin g pa rameter g1ve
1, n . .
. •
I• ( ~.20). til e cod e book size is re present e y
Jn ihc c-qua 1011 -

t,_, E.

. . . als are taken as mpu . t Acoustic


the LBG algorit hm. the trammg sequence of speech sign s.
Jn d are the
.
fi tures vectors are extracted from mput speech s1gna . l f different speakers an
ea
so h . = lto
.
· · g vectors . Centroids are then found m the
. . ce S= {xi € Rd\ w ere 1
tramm_ tramm g sequen ,
, Centroids which are the initial codebook values denve . h t of training patterns,
n' .
d from t e se
are then split to make appropriate set of codewords. D 1s . .ffi between training vectors
the d1 erence
and code vectors during the iteration process and codeword 1s . b ently generated for the
su sequ
. h
Jtb iteration. D ' denotes the (1-l) th iteration. New Centroids are t en obtame . d and they replac e
, _
. . . .
the old ones. The above procedure IS Iterative m nature an IS d . eated until the desrre d
rep
numbers of codeword(s) are generated to arrive at a final convergence.

Thus, from the derived feature vector, the LBG algorithm performs
the comp lex task of
building the VQ codebook by clustering the set of training vectors.
It works to desig n the
centroid of the complete set of training vectors i.e., it initially design s
the 1-vec tor codeb ook.
The codebook size is doubled by dividing each codebook. For indivi
dual traini ng vecto r, a
search for the nearest neigh bour codeword in the curren t codeb ook is
done. A simila rity to the
closest codebook is denoted as a match and the vecto rs are subse quent
ly assign ed to that neare st
code word. Iterations then updat es the codew ord by using the newly
comp uted centr oids of the
training vectors. The proce ss of codeb ook creati on is an iterat ive activi ty and
is conti nued till
the required codebook level is accom plishe d. For this work, codeb ook of
level 8 has been
created. Figure 3.10 repres ents the proce ss of the imple menta tion
of the LBG algor ithm.
,\ t a n

0,<1
1 111d C e 1111

c~ - 2 *c ~

Clu ster th e
rs
fea ture V ec to
Q (X l) =G

F in d C en tr o id

rt ion D
Com p u te disto

No (D - D ')
- -D- < E

YES

No

0 he LBG A lgorithm
Figure 3. 10 T

59
J tg,,, t- l ' ' '~-- ... - , .. ( , . ,, r' 1, .; t r, l"' ,H
1
" ',..,. ,.,,,. ,,., ., •· ' , _,, r , , •" '
, , · •· , • ,-.- ... ,,,k l, ,·, p l·1n 11·
,\ r-o ~,;,w f~ ·1°< I

(. ,
1.. ..1 . \ 1\,f ( ·1,- u;f irr
, '- lc m'- 111 ,11 rcP 1<lt'd kRn c.• ' " l'l r I "th m 1'- w ide ly use d
·, ,,1 r r' ,. 11n,:z 01 p Rttc r rc-c n~n i 7i n ~.
th e , " ,.,.. , a ~o n
. ,.~l 11 , .. ::i ,i,,l n m1 nM 1, c cla ,,11- · · · · · c t' n a-v a1' a · h
,cr a nd ,1 cl Rss 1fic s th e trn mm g m1 h le in t e
,.,., ., teal ,.,..- <cl ""'' d1t o rm a 1°
frrrn1 cla sse s h, con stru ctin th
g h} pe r p Iane s whi ch m ax
~., n,if1P hcr ,n' cn the 1ra1 rnn g • d. imi z es e
11 fe atu res f 0r a fea tu re spa · hy pe r pla ne 1s
c e tn n 1m e n s 10 n . a
rcr r (''-C Tl ll',i .1 '- Jn c.-quat il~ll ( J .2J ) .

r 1 rt := T 11' h = ~ :~I xiw i + h (3 .23 )


\' ..L
=0
r urtheJ d1n s1on of equ ati on
(3 .23 ) with 11 ,-14! .gi ves equation
(3 .24).
'; H
h
-
l~ ·I
= /'l i'(X ) = -1-1 II (3 .24)
vi '

Jr. equatio n r3.24 ). data x is projec b


ted on a plane and is rep res ent ed
by \\w\\ . Th e pla ne ' s norma1
directio n is represented by w . Thus,
the space which was defined in
n dimension is no w div ide d
m rwo classes.
The mapp ing function y = sign (f (x)
) E { 1, -1 } for the data points is
giv en by eq ua tio n (3 .25 ).

61
(3 .25)
- , T,1 ~ /, {' (). 1 HX,1 ( ( ( \ )) J. \ ( f'
f (' ) -
{, 0. 1 "" ·''Rn( f ( ¥)) I., f V
,,..., " c pl:mc map'- all dat ,1 pnmt<s I~ inµ 111 ,t to I <md ,,11 "" ill he m a rke d as - I if
111<r 1 d ,,w potnl'!
JTC lf1 1hc ncgatn c p lane , ( /' rep, cc;cntc; H daW po11lt . . side w hil e x E N
,he., · o n th e p o'! itl ve ·
,r t.,
rcpfC'-l 1 •3 J aw 1101111 nn the ncnat1'- c ~, de of the p hin e
t:-

~5.4 The d<'<'P J('srnin g dassifi ('r-CN N


· . _
\ , cT"' r(1pula1 , ananl of the deep neural
networ ks is the CNN an d tt
· h ·
as JS po
pularly appli ed m
- _, ·· · d hitectu re and optimi zing
.irr1i~-ar1 l1m t'l spei'.11\.er recog111t1011 proble m s [ 153]. Its cuttmg -e ge arc
carati1h{) g,,·e a very impres si ve near hwnan perfon nance
[1 54]. CNN follow s the feed forwar ~
:i;-ch1tecture. These netv,o rks which are typica lly deep learnin g model s are al so
known as multi-
Ja , er percep tron. TI1ese are essentiall y "feed forwar d" models
as the flow of inputt ed data is
ak~ng the model. The netwo rk has no feedba ck connec tions. CNN
' s have advan cemen ts such as
,~·eighr sharing . convo lution filters and poolin g which h elps proces
sing of inform ation using
functio ns that are non-li near. An end to end phone me detect
ion proble m was imple mente d
successfuIJy using CNN by proces sing sp eech signal s directl y with
minim al prior know ledge or
assump tions [155].

CJ<N netwo rks comprise of manifold convoluting layers. The three


elementary layers are
convolution followed by pooling and finally it has the nonlinearity state.
Features which are
compl ex and having high level of abstraction are learnt easily by these
classifiers. The heart of
a CNN network is the pooling layer which helps reduce the
dimension of the extrac ted featur e
set. The effectiveness of the ASR task greatly depends on the Max.out
which is the widely used
nonlinearity [156]. The fundamental CNN network is given in figure
3.12.

62
OUtpUI

1 '11 ~ l r.::--7 : l 00

~- -- L:: _Jt_-=:j c: ::J ~


Figure 3. 12 C'""'- arcflite-crure

spec, 1e d ord er per~o rms convo lution.


.
P,i fter-; nr kernel & mm'inE across the incomi ng data ma .fi

Fqua1Jon (3.26 ) .computes the convolution fo r the Ith layer.

A111 = ( CiW}l) ,c * X(l-1) ,c


,, C
+Bi (3 .26)

J_n eq uation (3 .26). numbe r of the kernel is denote d by k. numbe r


of an inputte d channe l is
denoted b) c. For this channe l. the convol ution kernel k is represe nted by Wk(/) C (/ - 1)
' *X '

sk1 is the kth kernel's bias which is learnt while the activation function is given by f and *
ensures multiplication of the elemen ts [ 157].

The pooling block added in between several convolution layers, allows the passag
e of
extracted features along the several layers of pooling. This aids in reducing the dimen
sion of
the feature map. For obtaining smaller and a deeper feature set, the network
subsequently
.attaches fully connec:ted layers once the multiple convolution layers along with
the
subsequent pooling layer is computed. The CNN network finally uses the Softm ax
layer to
flatten it resulting in a multi class, completely connected layer.

Having discussed the steps involved in implementing the text indepe ndent voice based
ASR
system, the next section discus ses the execution of the system and analys is the result s
obtained.
. Jcmc n tafio n 11nd R l'\'IIO l f:tl! for AS R
_1.n 1mr . . . . .
d1l ' '1 T n.11ac:.c-1 / l '-Sq 1(1 an amho , ,deo . l 1 q cake rs of w hi c h 19
fn•· , , dataf
laqe of of video ~ P . Th
~ and ~4 arC' male( : J ·ach ~peak C't has . O f qma ll dura tions . e
- 1cm1l c recit ed 'lcnte nces
11n .
'
. "hrd d ~ va ri o us resea rc h top ic s
1 ,._ trnm thC' corp us of the N -TTM Tr datah aqe ,._ u qe
.
Jpl/lt, ,1 <;.( or ·
·" tti ·<-rC'ec h spca kc1 recog ni t
rcl~I< "-' ion .

Th detail '- 0( the datas e t need s to be discu ssed as they influ k r char acteri stics .
1.
ence the spea e
• •
T]lc ,-e,cDrding of the spea ker rec1tallon
s was cond ucted m J sepa rate seSS I·ans · keep ing a gap

.
llf - 03 ,"' for sessi.on 1 and 2 wh1.le a delay of
6 days was kept b -~
e o re reco
rdin g sess ion 3 . This
.
d ,a, m recor ding was intro duced fo r the purp . . . the spea ker· s voic e
ose of ensu ring van at1o n tn
e - . .
appe aran ce hke c Iot h es, m ake -up and hair style .
1uch 1s infl uenc ed by the chan ge 111 phys ical .
"
The diffe rent moo ds of the sp eake r have
affec ted the pron unc1· afion, thus allow mg .

incorporation of different attri bute s that can influ


ence the spea k.er reco gnit ion com puta ti on.
The rand om setti ng of the zoom facto r of the ·
g a 1tion a11Y adds
cam era used for reco rdm dd. ·
rnriarion to the reco rded images.

The database consists of 10 sentences spoken per


person where 6 sentences wer e reco rded in
me J5' session, 2 sentences in the 2 nd session and the last 2 sentences
in the 3 rd sess ion. It is a
text independent database however the 1st
two sentences for all speakers is the sam e whi
le the
rest are different. The average length of the reco
rded sentence is 4.25 seco nds. Wit h 25 fps,
106 video frames have been approximately
extracted. The prof ile of the spea ker and
JdimensionaJ information can also be deri ved from
the head rota tion perf orm ed by spea kers .
Ambient noise is incl uded in the reco rdin g as the
office envi ronm ent was nois y. JPEG ima ges
sequentially num bere d and havi ng a reso lutio n
of 384 x 512 pixe ls is avai labl e and the aud
io
recording is stored as a 16-bit, 32 kHz wav file.

The details of the Vid TIM IT database in a glance can


be obta ined from tabl e 3.2 .
n e.."cription

'\Jl' ~[
,n 24 Male~f19 Female<\
~rcJJI..Cr'-,
·
D elay of 7 days hetw'een <1ess1()11 I & 2.
RC(Xlrd I ttF JO
"~1ofl " :;,c1~tcnoes/Person D elay of 6 days betv.reen sess ·wn 2 & 3
Se:-sion/ 1/6
Scn1ences
Session/ 212
Sentences
Session/ 3/2
Sentences

Mean duration 4.25 Secs. 106 vi deo frames @25fps


of sentences

Sentences First twu sentences Delay between recording sessions


are the same and allowed for changes in the hairstyle, the
remaining eight are clothing and make-up which affected
different for all the mood hence influence d the voice.
speakers

Noisy Office The speech is stored After each recording the Zoom factor of ;J
environment as a mono, 16-bit, camera was randomly perturbed .
32 kHz, .wav file,
and images are
stored as JPEG.

3.6.1 Discussion and Results for VQ


The audio signal of 43 speakers, speaking 10 sentences each, was examined to compute ASR
using VQ method. The sampling of the speech signal is done at 32 KHz. To obtain a discreet
speech signal spectrum employment of triangular filter banks is necessary. Audio signal which
is constantly varying was windowed for framing to make the signal quasi-stationary. Hamming
Window was then applied to graduaily move from frame to frame as it will decrease the
consequences of abrupt variation. Additional noise gets introduced specially in the high
frequency components due to abrupt changes when windowing. An overlap of 25 msec. was
applied for two primary reasons: to ensure between frames continuity and it helps in retaining

65
:x:w
1-
. d from eve,.-,., fram e
I ..,,g n 11 foaturc!-. l he num hcr of MT . . cflic icnt' l rct0n1c
J

/L,,1 1 '- J'Cl"'C l ~ ·


( ( co
[h.:O LTl 11 thcc;c fcn1t11·c" crc:l lcd a co<lc hook of<,j ,re8 .
' (.1 11:Jllf l,',lfH lll <
'-'~ .., I
43 s pe aker s of the
- J'l'' 11nc nh "He; c,11 ncd out ,.v,t. I' 2( ·ak e r'l out of the
/ ,er ,,1 l' l <, pc
the ti rs t set o f
rrtJr,a i n/1:1'-<' \ 1,,1-1I ol ?c,O c;mn plcs id ercd for
' ,,11 , 111 'a, , have been co n '>
ker
. d ent senten ces p er spea
-nn1cn1,111 ,,n ,1'- there 'ire reco rdin gs of 10 text mdc pcn ·
.
L,f'l

n" on to the class ifica . VQ c lassi fier. the


tion of featu re vect o rs by e mpl oyin
pcfN \'_ mo' , ,,. g . serr men t
and cxtTactio n of MFC C featu res need . c ussed . The follo win g o•·-
._.,,,11rura11,,n . . s to be dt s
, ,trac tion of the class 1cal, hand craf ted As hon etic ally .imp orta nt
.JL",31I:- t IJ<= c
3
MFC C featu res. P .
._t_jcs of spee ch are capt ured with Mel . 1 t low freq uenc 1e 5 and
c1,arac1en . filte rs spac ed line ar Y a
. · -e roly
.i..n cal1 \' at hiizh freq
. e extr acti on was um1-or
,ocarww11 ue ncie s, the fi lter bank appl ied for feat
- ~ ur
s;ce d before J kHz and loga rithm ic scal ed
afte r I kHz.

rakiIJgan example ofa sample audio sentence . d . /Sa l the p aram eter s
of spea ker I , 1.e. fadg olau 10 '
chos en for MFCC feature extr action are liste
d as follows.
Speech signal duration
4.76 for this sent ence
Windo w - 10 msec. 320 Samples obtained
Windo w overlap 25%
Total no of frames 633
-.I
Num ber of tilters 24
Order of MFCC 13
Total Coeffici ents 633 *13

The original waveform for frame 10 of a sam


ple spe ech signalfadgo/audio /Sa l from the dat ase t
considered is as shown in figure 3 .13 . The
figu re 3 .14 is the ima ge of the ham min
g w ind ow
applied and figure 3. 15 gives the natu re of
the wav efo rm obs erv ed afte r app lica
tion of the
hamming window.
Test
02 - - - - T " " - - - , - - - - - r - - r -

~
;:,
rS
p..

I
-005

-0 1

-0 15

-0.20
2 6 8 10 12 14 16
Samples 44
X 10
Figure 3.13 Origin al Signal fadgo/ audio/ Sa I

0.8

II)
"t,
~ 0.6
~
Q.
E
ti, 0.4

0.2

00 50 100 150 250 300 350

samples
Figure 3.14 Hamming Window

67
50 HD 150 200 250 ll) 350

F,gure J . 15 Waveform offi-ame 10 of the sampl e signal after app ly ing Hamming w indow

H 1h an o,-erlap of 25% and Hamming window of 1Omsec, the MFCC features of the order 13
Here ex-zracted i.e. a total of 13 MFCC features have been retained per frame. The preprocessed
,\a,eform is given in figure 3. 16. Figure 3. 17 gives the nature of waveform observed after
,\[FCC features were extracted from it. The same sp eech signal fadgo/audio/Sali s consid ered .
The figures are of frame 10.

Discrete Time

Figure 3.16 Pre Processed Waveform

68
._J;.n,e,eedt ◄ ttr ..,,:-- %0:1
j zf I

Disaete Time

Figur e 3.17 Speec h Wave form after proce ssing

[be Eucli dean distance as in equation 3.27 was calcu


lated and the resu lts of the mat ch obta ined
for all the 10 sentences of each spea ker with
the code book was tabu lated in the matr ix as
show n
in table 3.3 .

d(p, q) =d(q,p) =JCP1 -q1) 2 +(p 2 -q2 )2 + ...... +(P


n -qn )2

=Jt;,cP;-q;)" (3.2 7)
In equation (3 .27), d(p, q) is the Euclidean dista
nce between points p and q.

r
l ab lr J J l rue ma1ch ohtai ncd
- -- -- -- -- -- -- -- ---~ -- -- for 26 Speak ers

~d~ Speaker~ Rec~i zed


-- -- ---,-
~, h c d ·e f I! h i i
1 0 0 0 0 0 0 0 0 0 0 l<J r 1 t u V w X y z
k l m no l1.
0 O O O O O I O O I
I ~ ~ 00000000J I O O O
l020Q0000 0
h 00 ~ 1J 1 0 0 0 0 0 0 0 000
'--i 000000000~
0 00 · . 1 0 0 0 0 0 0 0 0 0 0 0 0
0 0 ~00
00
0 ol o o ~ 1 o
1 o o o o o o o
0 0 0 oo
r ol o o 2
o o o
~ o o o o o o o o o o o o o o o o o
O O 0
O
- : 0 0r0
o o o o o o O
o
0 l l
~ ~ ~
S 0 0 0 0 0 0 0 0 0 0
,.- b O O 0 I 0 0 0 0 0 0 O O O
R I 0 0 0 0 0 0 0 0 0 0 0 0 0 O
- 1· O O O O O 4 0 0 6 0 0 0 0 0 0 0 0
0 0 0 0 0 0 0 0 O
o 0/0 0 0 2 0 I
!- Li
1ooo ooooo o s 0 0 0 0 0 o 0 0 0I
I 5 0 0 0 0 0 0 0
oooooooo 0000000
- I 00 0 0 0 0 0 0 0 0 1
9 0000000000
o o o o o o o o o o oo 0000
s 2000000000
m
D O O O O O O O O O O O ° 00
O O 10 0 0 0 0 0 0 0 0 0 0 0 0
0 00 0 0 0 0 0 00 0 1
1oo s ooo3o oo
JJ 00 0 0 0 0 0 0 0 0 0 o o 00
1020 1 000000
-
11 O O O O O O O O O O O 00° 0
IO O O O 9 0 0 0 0 0
r 00 0 00 0 0 0 0 0 0 0 0 0 O
002000 6 2000
s 00 0 0 0 0 0 0 0 0 0 0 0 00
0 0 0 0 0 0 0 10 00 0
t 00 00 0 0 0 0 0 0 0 000°
00000000 3 50
0 0 0 0 0 0 0 0 0 0 0 0 O 2 00 0
I u O O O O O 4 0 6 0 0 0
,, o o/o o o o o o o o o 0 0
o o 2 o o o o I o o 6
w 0000 0000 000 o 1 o o
001oooooooo
I 00 0 0 0 0 0 0 0 0 0 s 10 0
00000000000
r 0000 0000 000 0 \0 00
00000000000
z O O O O O O O 02 s o
O O O O 2 0 IO O O O O O O O O l O 61
From the matrix created in table 3.3,
for the 26 speakers, with each speake
r spe aki ng 10 tex t
independent sentences, we observe tha
t the Sp(?akers are labelled from 'a~ to
'z'.
Total Samples obtained: 26* 10 =26 0
True positives calculated = 186
ASR computed = 71.5 3 %

Having obtained a speaker recognition


accuracy of 71.53% for 26 speakers,
the influence of
varying window sizes along with varyin
g percentage overlap of the chosen ham
min g win dow
on the performance of ASR was tested.
The num ber of speakers considered
here we re the sam e

70

C
. \ «-Wt~' , \-

'- 1 c c,f \\ , ndfn• 111 <:'{'(


<h <.·fl 'I' Pc r ln rm.:1nc e \ <-; K ~ ,, ,,
\ C1.. l lf;\I.. \ ◄
i /)
:> "
..I ,.,
I i)
"0 f.R
.."::P 25 70
20 50 69

J"J,r rc,ul L'- indica te that fo r a \,indo ,, size of 1O msec. and


a percentage overlap of 25% gave
an 0 pnm um accura c) of ASR as 72% whi le by increasing
the overl ap to SO% with maint aining
u,~ "wdo ,\ size as 10 m secs., in fact reduces the accur · · the
acy to 68%. Furth er mcrea smg
,nndo,\ size to 20 m secs. with an overlap of 25% did impro
ve the ASR to 70% from that of
680 0 but it was once again observed that increasing the perce
ntage overla p beyon d 25% does
not make an) signif icant impro veme nt to the Accu racy of Speak
er Recog nition.

Hence. the VQ metho d of deriv ing ASR, which is an iterative


proce ss of findin g centr oids and
arriving at the correct codebook size was a comp utatio nally simpl
e proce ss. How ever, as the
result of ASR obtained was 72% and not very prom ising,
furthe r exper imen tation was carrie d
out witb the param etric mode l GMM as the classi fier.

3.6.2 Discussion and Results for GMM


A Gaussian Mixture Model (GMM) is a parametric probability
density function represented as
a weighted sum of Gaussian component densities. They comm
only model the prob ability
distribution of vocal-tract related spectral features in a voice based biom
etric ASR system. G:N1M
parameters are estimated from the features which represent the
training data using the iterat ive
EM algorithm or Maximum A Posteriori (MAP) estimation from
a well-trained prior mode l.
Anned with the discu ssion of GMM as a classifier, we now
discuss how this class ifier was
applied on the MFCC features vectors extracted in thi s work .

71
i
~ '-.-ctra
~,.,- -
f 8 " d Tc m por-a l s ~ c h fea tu re.., cla~,ified w ith GMNf. SVM 9nd
.. r-0 CNN cf~41;;.-.ifte,-s f'or ASR
.., ,mr,J('Tni:ntcxi the , ·oice hiomctric hMed i\ SR n~ino the clu~te r in g V Q and the
ilr '"n~ e
,..me (,au,sian model. tht"h> Wa<l
r1tr11flll
~tm
~ope fot 1mntm,ement where 13ccuracy o f AS R was
'.

....,f\l C'r1'ed

\\ ,1 h .,
"th ancc, m machine Jeam1ng and neural netwot~s it hecame imperative to explo re and

arirl' thC'C 1echnique.(: fotaddte..,si n g text independent. speaker recogn ition problem . Rec:ent\ y.

f.,r pam~rn rec,lgnition p roble ms. SVM - a superv ised machine \earnin g discrim inative
.-iassdie1 ,, mcreru-.ingly being ap,p1ied . C o mplex deep learning algo rithms like Lo ng S ho rt-
rerm \ femo~ C LS TM) networks which are a type of recurrent neural netw0rk are a pplied in
fearn.lilE for ~equence prediction proble m s. H owev er. in th is research CNN has been ex plo red
w unpro,·e the tex1 independe nt speaker re-cogn iti o n problem w he re w ho is s peaki n g is
unpD.rflUJi regardless of what is being spoken .

The performance of ASR with classifiers SVM and GMM is compared and subsequently the
performance of tbe system with application of 1-D CNN is evaluated.
f aking an example o f a speech signal with a duration of 4.76 seconds, windowed with
hamming funcrjon of IOmsec and 25% overlap and 32 order filter banks, 320 samples were
obtained. 13 MFCC coefficients were retained as conventionally they are deemed to contain
enough speaker related information. Figure 3.28 shows the retained MFCC coefficients.

MFCC Coefficients

~
100

l
~
V
50

f --50

tu -100
~ -150

I -200

'-250

0 5 10 15 20 25 3S
n order coefficients

Figure 3.28 MFCC coeffici ents

83

. of the dataset avail able fo 1


171c s,ze - . - r t 1e I\ S R_
d:it:abase cons ists of 1o sente , evaluation _
c 1J1C • - n ccs u tter wal'. 430 t .
:i- • - crirne ntat:ton . For 13 MFCC " cd by 43 S"- - ex t independent - .
;:: c:--P _. . _ . 1catutes _ ,,eak.e rs. Ra" . speech signals
1 ,,1c11ker recogntt-1o n dei-1ved Was 4S 3 40 / Pct ftan, e and cla~ .fi- "' signal s were employed for
,1_
ot · • -~ - . · r o a -d _ c ,,s, 1ed ·
o ,vas acJu eVt,d . Besides the - . " w hen ' 1 . w ith OMM. the a
51 . 11, o - - - - ·-- . - ·- teta 1ned 13 - • '• c ass,fi ed w ith S - ccuracy
. ,,,·. re lev:mt speak et specthc in for - . Ml ( (_ coeffi . VM an accuracy of
re,,c -. ·- .· --· . 1ll att o n is a ic1ents as di s . .
. 1ude tlilel f contnbuhon Ill impr - . - lso avai labl . . cussed\ m hteralillte
f.,1 ,11c - -- ovmg ASR - e ,n the hi her - .
_ wres and 40 features per fr ' furth er evalu . g order coefficients.
;zo te1J . - ame. TI1e a t1 on s Wer
MFCC coefficients. 52 430/
, c:e in cornputed accur . e conducted consideting
_.,,c.ri:JJ!i
,,. _ . / O accu racy w acy in AS R 1mprov
. d -- · -h h
rs and an accuracy of 63.95'¾ as obtained w·1th G . e w1t , t e
,-eet() 0 was achieved _ MM classifyi ng 20 feature
fU-rtJJef iniprovem ent w as observed w It
· 11 40 coefficient
when SVM was apphed . for classification.

ifl 8Il accuracy of 55.81 % and SVM mode11ing gav s as the GMM modelled features resulted
J)."perirnental performance of the speak e an accuracy of 79 .0So/c
er recognition £ o.
ra u]a.ted in table 3 .8. or a dataset of 430 t est sentences 1s
.
b
Table 3.8 Accuracy ofspeak er reco · · with 13 20
gn1tton ' and 40 MFCC .
No. Extracted N coefficients
o. of Cl
Feature Coeffi1c1ents
. assifier Accuracy of
ASR(%)
per frame

1. MFCC --
13 GMM 45.34
SVM 51.16 - ., i't
2. MFCC 20 GMM 52.43 bl
SVM 63.95
3. MFCC 40 GMM 55.81
SVM 79.05

Results of ASR displayed in table 9 indicates that the GMM classification of feature vectors
were outdone with SVM classification. The system performance with SVM showed an

improvement of almost 24%.

Going ahead, SVM classifiers were selected over GMM. As research points out that for speech
recognition problems, MFCC features carry sufficient information. However, speak.er related
infonnation is also present in the dynamics of speech. Hence further experimentation was
conducted to establish the role of the dynamic features in improving speak.er recognition.

84
~•R1 t'Jt 111• fosttur e<- lffl(f ,h<- AC<··demt<"<t docl u d<elu fenh,re
"i .1nhc1.1 ,,11h the "tal\c
• .,,,11m: " re , 11J1C'<f m a fwt<tCT fOllU't c cfml't'.n<tMn for
<'.om p 11hn~ \C..R \ h e n ,1111.e ,~1 the
rm",,,, turc " , , cccn rn fiaurc ~ lQ alone " nh fi~iTe 1 10
\""
6 J' fC<: COflffl'!"..-ots
()Oi

I .,
~
0 ()4

002
t
'-'
IJ 000
t
~
'o -0 02
1 -0 04
f
l -0 06
0 5 10 15 20 25 Y,
n order coeffic_ients

Figure 3.29 Delta MFCC coefficients

~ -~ MFCC Coefficients
"'
~ 0015
~
I.I
E
e
'-
0010

l
0 0005
<1f
<1

~
8 0000
....%0
-0 005
J
·e
l -0.010
'-------...----r---r---,---.,-
----,----,-
0 5 10 15 20 25 35 40
n ordtr coeffi cients

Figure 3.30 Dclta2 MFCC coeffi cient

ASR was now computed with a larger feature set. Three set of
data size was chosen. pt set
had 13 Delta features appended with 13 MFCC, the 2nd one had
20 Delt a appended with 20
MFCC and for the third featu re set comp risin g of 80 features simil
arly had the addition of 40
MFCC coefficients with 40 dynamic coefficients. Best accuracy
of 86.04% was attained with

85
. , features w hich is 26% higher th
/sfl"e 80 an th e ac
c1.1111t1 fe,1tures. Table 3.9 lists th e res ults o f I\S curacy achieved with 13
, f)c/1·1 • R computed .h MFcc and
1-' W1t different ·.
.•cl-
sizes of feature
h blc 3.9 AS R wi th SVM cla .Ii
s~, •er for ~lati c a I •
~ 11( dyriarn,c cocffi ciem~

l"eatures Coef't:: .
11c1e nts
Selected AS R
Accuracy

Static
13+ 13=26
and 60.64

Dynamic 20+20=40
76.74

40+40= 80 86.04

from the results listed in table 1o th . .


. ' e contnbution of temporal features in significant
iJilprovement ill ASR cannot be ruled out H •
· ence, It can be clearly stated that these coefficients
also carry speaker specific characteristics An ·
· Improvement of around 10% in accuracy has been
atta.ined when additional 40 differ t · I ffi ·
en Ia coe 1c1ents have been considered. No significant
difference in the accuracy of speak:e ·· .
r recogrution was however achieved with the accelerated
Delta-Delta coefficients and hence were not processed further.

A larger feature set often results in an increase in computational complexity and redundancy
which are also known as the curse of high dimensionality. A model tries to overfit these
redundant features creating further problems. It was decided to apply feature selection algorithm
to overcome these limitations.

In this research the most relevant speaker specific information was selected by.the Fisher score
algorithm. Three experiments were carried out with a feature vector size of 430*26, 430*40 and
430*80 dimensions. Best selected features were subsequently classified using the SVM
classifier. The graph showing the Fisher scores of 40 coefficients is available in figure 3.31.
Lower the score the more relevant is the feature. The fisher scores indicate that coefficients
numbered 3, 4, 6, 10, 16, 28, 32, 34 and 38 carry significant speaker characteristics.
40
• •
J5 • • •
,0
• • •
lS

I JI)
••
••
I


~

0
u, J •• . ••
I

..... 1i

-I 1
-- ,IO

s l

rr r

0
0 s
T ' T
- .
10 15

-..., ,.,_, ofMR:Cc .. . la

.
Figure 3.31 Fisher scores for MFCC coe ffi1c1ents

for the cumulative features the ASR accur


acy was computed. A comparison of the performance
.
of the system with best features selected wit· h those achieved before the feature selection
algorithm was applied is as given in table 3. Io.
,,
Table 3.10 Syste m perionnance after feature optimization

Combination of extracted Number of ASR accuracy before ASR accuracy after


coefficients useful p·1sh er Score (%) feature optimization
features (%)
13 MFCC + 13 Delta 12 60.64 56.97

20 MFCC + 20 Delta 15 76.74 77.90

40 MFCC + 40 Delta 27 86.04 94.51

An ASR accuracy of 56.97% was achieved for the best 12 scores from a total of 26 scores
nd
obtained for the combination of MFCC with Delta coefficients. From the 2 combination of
40 features, 77.90% speaker recognition accuracy was achieved with 15 features selected by
the fisher score algorithm and from the last combination of 80 features, 94.51 % accuracy of
ASR was attained from Fisher score selected 15 optimum features.
1. nr ance o f A S R fo r G MM and SV M moo II
e perio 1 . c ed three set of st . .
f11 . ,ori:il lY represe nted in fi g ure 3 .32, fi gure '\
- t ;s pre 3 .•>3 and fi gure J 34ati c and dynami
.
c feature
:,e. perfon11011cc of speak er recognition · respecti vel y .
i) . . system modelled with OM
three differe nt combin ation of fieat· u M/ SVM classifyi ng
re set.
·,·) Perfonn ance of spcaket recogn ·t·
, ton syste m ·th .
' M
concatenated FCC and Delta fi"at w, improved feature set o f
·
'-,. ures .
iii ) ASR witl1 best featu res selected from the
. . . concatenated feature set after l" .
o1 F1sher scor1e a:lgon thtn . · app icatton

AS R l) l'rfor m a ("'
n~e , oAcc urac)) for 40. 20 and 13
co n \'C ntr o n a l M F(T coe ffi cie nts

MFCC/ 40/ SVM

MFCC/20/GMM 22224WWZ7>LZZ2l 52.43

MFCC/ 13/ SVM &•-mawm............... Is1.16 I


M FCC/ 13/ GMM (>'.'Y/YJi'YY-/'d7
3 Zd 45.35

0 20 40 60 80 100

Figure 3.32 ASR with GMM/SVM as classifiers for different dimensions of MFCC coefficients

AS R performance( % acc uracy) w ith combin ati on of MFCC features


w ith d y namic featu res modell ed w ith SVM cl as sifier

(80)MFCC&De lta/ SVM ~ 86.04

(40)M FCC&Delta/ SVM ~~~ 76.74

(26)MFCC&De lta/ SVM 60.65

0 20 40 60 80 100

Figure 3.33 ASR performance with static and dynamic features modelled with SVM
1so1 MfCC & De lta/S VM

oJf'/l fCC& De lta/S VM


14

(26) MFC C&D elta/S VM

0 10 2 60.Ss
e:J Accu racy of A O 30
SR aft er best f 40 so
eatu res ser 60 70 90
ected ('¾o) l,!'!l oYo
80 1oo
figUre 3.34 Comparison of system performan Accu
racy of ASR
ce With o .

d"ffi Phmizect fi
tfu;
from e ' ee I . ents
eren t exp enm . algorithm eatures With that before a .
th """"''" offISher Score
ofr,JFCC features did impro ve the earned out It. can b
accur e obse
•J: h acy of Peaker reco . . rvect that .
s lllereasing th
. er, . owever, gave better accur
c assw The machi ne learnemnumber
I
cJas si.i.er. Add itio n of d yna nuc a .on.
. fea acy as compared to thatgruti
SVM
ect.with the Parametrtc. gGMM
of the system. Finally, th e best perf lures to the static £eatures chiev
furth
i
ormance er of the ASR improved the perform
computed wit h sele ctin g th e b est features system with v . ance
. s was
the FISher score algorithm . 8.47 o½, Im
. from the set of the large feature setmce'thbiometric
.
. . Wt application of
achieved with feature . . provement in the ac on reco gnition was
curacy of pers
opt nru zati on as the accuracy went 86.04%. up to 94.51 % from

. of computing the s ste


. ate cnt. ena
The con. ventional and mn
co . Y m perfonnance is achieved from
mputmg the accuracy of person reco gmt. 1on. However robu t f
d1 , s ness o the biomeliic system
mo e can also be jud ged using oth er measures [160] These per£onnance parameters are defined
. . .
as follows:

positives.
ii) T rue pre 1cted positive observations per overall predicted
d.
i) Precision:
all: It is the ratio defining the rightly predicted positives obtained, to the entire class
Rec

of observations. and precision. The incorrect positive


FlScore: It is the measured mean of recall
iii) ervations are include d for the score computation. Fl
obs erv atio ns and the incorrect negative obs
imilar
uracy for an uneven class distribution. For diss
sco re usu ally is mo re suitable than acc
ance
es and false positives, the parameters of perfonn
dis trib uti on of the com put ed false negativ
89
. red s hould be Reca ll or Precis io n . Tahlc 3 . 11 g ives the m odel pe rform ance in terms of
consrde
netcrs discussed .
1
11
c para.r .
T11ble 3 . I I Performance of1hc ASR modei with different pe rformance parnmcters

Nu mber of
- -¾-
Acc ura cy Precis ion Recall F l Score
Optimized
Fea ture Vectors
13 0.6976 0.5975 0.6788 0.5989
15 0.7790 0.7410 0.7552 0.7073
20 0.8255 0.7943 0.7987 0.761 2
24 0.9069 0.8891 0.9041 0.8810
27 0.9451 0.9437 0.9604 0.9461
29 0.9534 0.9065 0.9065 0.9008
34 0.9186 0.8577 0.8719 0.8562
40 0.8023 0.7991 0.7999 0.7636
44 0.8023 0.7991 0.7999 0.7636
65 0.8023 0.7991 0.7999 0.7636
80 0.8023 0.7991 0.7999 0.7636

For the designed text independent ASR system performance, the range of accuracy achieved
considering various optimized features was from 69.76 % being the least and 95.34% being the
maximum accuracy. The maximum accuracy achieved is 95.34% resulting from 29 optimum
selected features. However, considering the precision, recall and Fl score performance
parameters the model with 27 optimized features and ASR accuracy of 94.51 % proves better
than the other models. For this ASR model the value of the Precision performance parameter is
0.9437 indicating 94.37% accurate prediction of the positives. Recall parameter obtained is
0.9604 the model is supposed to perform well when the value is greater than 0.5. 0.9461 is the
computed F 1 score.

Figure 3.35 is a plot of accuracy of the ASR model versus the computed Fisher scores for the
extracted features. 27 feature coefficients resulted in an optimized model performance.
0 10 20 30 «> 50 60 70
No. of feMURS selected

Figure 3.35 ASR system accurac y for various feature set size

In the domain of computer vision, CNN finds applications


primarily in analyzing imag es. In this
v..·ork Conv ID CNN, a specific form of DNN worked on the
430 data signals belon ging to 43
speakers of comp uter vision. of the VidTIMIT database which
was consi dered for all the
previous experimentation. Figur e 3.36 show s the param eters-chose
n for imple ment ing ASR with
CNN classifier.

Convolution Layer

r
1 Convolution Layer 2
-- - - -
I Fully Connected Layer
Dense Layer: Flatten
high level features
rr
ReLU I learnt
ReLU Activation
.,. Activation layer: layer: To separate
i,,,..
To separate non non linear 40 rSoft.max Activation: ~
linear 40 classes
~- classes multiclass problem
'""' o determine

Dropout: 0. I: To ~probability
..., Dropout: 0.1: To
reduce ~ reduce
overfitting overfitting rmsprop optimizer:
J

'""' to accelerate the


rr 1 learning direction.
L
Maxpooling
I!!!

Layer : Size 8 ... Maxpooling


Layer : Size 8 Loss Type: \
'\1

= Entropy(Log Loss) \1
..
.1Perfect Model-Log I
1

~ ~oss O )
Figure 3.36 Implementation of CNN classifier
tJ'liJn icst rJltic <lf 8-0 2-0 -wa~ ~lected di~
fl'" . . ~ .... d v ,c, g tl\te ~ ~i>"e .
~ .a5! tt'AJ"1I1,g -ua1..a M -oo sentenc ot J 'O <1en1,
r,fi'flu<-' es M te~ing da \!"~" 1t\l() ,44
ci• the con"°lutio n layer to sen !>-.. ,11.._
, A,. .. ta Rft \ · \\r,t_q the 'l\!\~t~ u ,,· .
,eJ BJ~"' ..,._.. OlC U~ 40 f\!On-1' l,!\\'/"J.\l"n
Ill' -1· U\•ear\) "l!PMlbl
Ulllds the non mear margins in an irh . ~ e ~l3 "!i~ the netw(1t\.._
,,,.i!er5 age \Vllh this rectifier fu, , • .
ut lJf {). J helps d ecrease over-fitting iss.ues S m:til'n \n mtrt,uutli\m l'r u
.1~~ . k. · ubseq\lef\U; a Ma~-pQt11ing \u~r \'illS .itlu\!\l
,n the n~or

econd convolutio n layer with the same s:eciuence as d. ~..1. •


,\ s ---,

•scuss~ wns added ugum m the network.
f 111all\, • the fully connected 1Jayer flattened the f,...,.h,-.~ ,.... - t "" lh
· -.u, .... ~ ,_,_...
•. L c . 1..
uy e ne,wor11... i:Q t u 1t
classification of 43 dasses, a final 10unit activation function - Softmux wus ud<.ied us the den.st
layer. To accelerate the 'l earning direction, rmsprop optimizet compiled' the crtmtt<l modd. rnt
Joss type in this speaker classification problem selected was Cross entropy.
for the training data and testing data, the model losses have been plotted in f\sure 3.37 whHc
fiQ:Ure 3.38 displays the -compated trainj ng accuracy and the testing accurac.')' .
t:

model loss

--
nin
l!st
1D

6
I
4

0 ~ 1000
~
fiOO
0 200 epoch
f the model
. · g loss ,and testing loss o 92
. 3 37 Tramm
Figure ·
10 - _..,n
08
- ~
06
~
04

02

0.0
0 200 ~
fiOO (K)()
'POch 1000
Figure 3.38 T rainin .
g and testing validatio
n accuracy
rtJ fj{[llfe 3.37 it can be observed that the tr ..
fro ~ ammg and testi l
. are synchronized. Both the losses decre . ng oss curves are non-linear. but
dJe) ase with the increa · h
e constant at approximately 600 e h . se m t e epoch values and have
l)eCOID poc s which indicates that h
c. +-< ,res The training and the val ·d t· t e model does not overfit
we 1eal,JU.l ·
.
1 a ion accuracy cu •
rves m figure 3 .3 8 are also increasing
however the testmg accuracy observed is quite less th h . .
. . an t e trammg accuracy .
i further reduce the d1fference m the train/te t . . . .
o . s accuracies which md1cates slight overfitting
issue. a modified value of 0.15 for dropout was expen· t . . . .
, men a110n with. This resulted m an
improved model.

With the delta coefficients affixed with the handcrafted features, a one-dimensional feature set
of80 was available for t~e CNN classifier. Model ~ccuracy for training was computed at 94.77
%while accuracy of testing also showed an improvement of 3% as the accuracy computed was
73.25%. Figure 3.39 indicates that the training data is not memorized as the gap between the two
curves is less here and a well-trained model has been created as the training losses and the testing
losses are decreasing linearly. The generalization capability of the model became much better
with increase in epochs however, it stagnates after a while. The training and testing model
accuracy is shown in figure 3.40.

93
Jl
-
-
e,..in
-.st
-
model k>s.s

JO

II

j fi

0 ... ... ... ,.. .-- .-- ~- --- -.-500-~ -~fiOO--- r-1
-.-«>O 700
0 100 200 :mo
epoch

icien ts .
train ed w ith static plus dyna mi c coeff
Figure 3.39 Mod el losse s fo r syste m

acc::..
del ...:
mo:..: cy__ __ ____,
ura:.!.
LO -r -- ---------=--=..
- train ! :

- test
0.8

06
V

" 0.4

0.2

0.0
J>O 400 500 600 700
0 100 200
epoch

accuracy
Figure 3.40 Training and testing validation

lied for image analysis and when used wit h spe ech
Even though CNN is being traditionally app
ctro gra m of the aud io signal, this wo rk has proved that it wo rks we ll
signals it works on the spe
we hav e obs erv ed tha t the SV M cla ssified model did not per for m as
with speech vectors and
well as the CNN based AS R system.

94
·oo of MPSTME database
crea tl
3.1. 1rJPSTME database creatwn
· · research. A s a 11
was part of the work carried out for this
f he . tation was carried out on a standa rd database there was a need to validate the reSUltS
~ m ~1 ,
exP curacy . dm
obtame . a rea I tune
. .
s1tuat1.on. A dditi o nally there was no database ava1·1 a bl e
C
of,4.SR a . . , .
ian scenano to vahdate the results. It was he nce decided to create a database m the
• t11e I11 d
111 I. 1 . . •
· vironment w 11 c 1 JS no isy enough for a real time situati on
J1
ea.e e . ·
co /J -

. a the standard database as a reference an audio video database was created with SO
J{eepJJl_
-~1,-ers comprisiug of students, faculty, and staff of MPSTME . Like the VidTIMIT database
spe,;u,.
nber of sentences per speaker was chosen as 10 5 sentences from the VidTIMlT
tile W
Il .
se ,.vere retained, 4 sentences were framed with a word in each sentence containing all
dat2 ba
tlJe vowels and I sentence was of the speaker's choice. The recording was done in a noisy but
,~·ell-lit office environment over a period of 6 months. The database of 72 speakers was finally
selected for validating the work done on the standard database as the recordings of 8 speakers
was not quite audible. Of the chosen 72 speakers 39 are males and 33 are females in the age
group of 18-52 years. Each speaker recorded 10 sentences of an average duration of 4 secs
each. The speech is available as a .wav signal which is 16bit and sampled with a 48 KHz.
frequency, while the images extracted from the videos have been stored as JPEG. Table 3.12
gives an overview of the MPSTME database.
Table 3.12 Specifications of the MPSTME Database

List Details Description


No. of speakers 72 39 males & 33 females
(Age group 18-54yrs)
Recording sessions 10sentences/speaker Recorded over a period of 6
months
Mean duration of sentences 4 secs. 110/120 frames/video
@25fps
Sentence details a) 5 from VidTIMIT E.g.
database
h) 4 c reat e d - eac h with a) rhc c lumsy c u st o m e r

fl '-V0 rd c ontAini n g all s pi ll e d som e


the v o'-vel s e x p e n s i ve p e rfume .
c) 1 sentence of h) Yo u are a uthorize d
s p e ake r· s c hoi ce to use th e
auto m o bile .
c ) M y na m e is .....
()f1i ~•c CJl ' 1n~llll1Cll t S p eech s tored a s m o no ( 5 12 Videos recorde d w ith lntex

kbps. 16bit, 48 kh z .wav fil e web cam at two diffe re nt

& images stored as JP EG location s at MPSTM E in a


well -lit but noi sy
environ ment

f or 7-., s-~
0
"'lrers, the databas e size was 720 sentenc es. Static and dynami c features were
ing
ex-u-acted from these sentenc es. A train I test ratio of 80:20 % was conside red for process
me data. The SVM algorith m and one-dim ensiona l Convol ution neural network s, w ere used to
s
classify the extracted feature vectors . Fisher Score algorith m further optimiz ed the feature
s is
thus removing redunda nt informa tion. The plot of the fisher scores obtaine d for 80 feature
as given in figure 3 .4 I.
lK)
◄ t1
70 it ·• ,. '. ~ ' • I.

t

f,()
'

t I


I
I •

,t 't t
..."'
Q) 50
C)
(/)
...
~
0

40
I ••
,. • ◄-
I
I
~
I
I

en JI t
iZ ~
,, JI
~
• ◄t

,
I
20

I

J
10
1 .,
J ◄

' ~1
0
0 10
' •
20 30 40
-
so 60 70

MFCC + Delta features

Figure 3.41 Fisher scores for a total of 80 MFCC and Delta features

fhe optimized 44 features were computed by the algorithm as shown in figure 3 .4 2.

96
08

06

02

00
0 10 20 30 4'() SO 60 7'0 80
No. of f~•tu~s selected

Fig:ure 3.42 Feature selection from the optimi zed features set

f}ie resul ts of ASR computed are as listed in table 3.13.


Table 3.13 Accuracy of ASR for voi ce features ofMPSTME with SVM

Optimize d Classifie r No. Of % Accurac y of ASR


features
features Speakers

44 SVM 72 88.5
MfCC (40)
+
Delta (40)

As accuracy of 88.5% was achieved with 44 optimized features considered from the
concatenated static and dynamic features and SVM classifier .
Application of I-dimension CNN on 576 training samples and tested with 144 samples for the
same set of 44 optimized features. The accuracy of ASR computed is mention ed in the table
3. J4. CNN model training paramet ers preferred were the same as the ones conside red when
experimented with the VidTIMIT database.
Table 3.14 Accuracy of ASR for voice features ofMPSTM E database with 1-D CNN

Features Optimized Classifier Training Testing


features Accuracy Accuracy
(%) (%)

MFCC 44 1-D CNN 98.57 75


(40)
+
Delta (40)

97
. .. h 0 w n in fi g ure
. . .
for 5 00 e po c h :- co n s id e re d 1s a s s
u racy
h fo r rrai11111g a nd tes t m g ncc
f11C gr9P
.3 43 . mo del a ccu rac y
LO - tra in

0 tJ

6J

"0 4
0.2

300 400
0 100 200
epo ch

MPS TM E data base


Figu re 3.43 Trai ning ffesting
Acc uracy of ASR usin g CNN for

su mm ary of AS R fro m vo ice


bio me tri cs

mm on
a key req uir em ent in tod ay' s increasing digital world where co
An ASR system forms
uri ty ser vic es, onl ine sho ppi ng and banking hav e ado pte d an
utilities such as teller machines , sec
cru cia l
tem . Pro tec tin g ide nti ties and preventing spoofing is another
automatic working sys
the use r po ses a
AS R. Ho we ver , dev elo pin g an effective system to authenticate
application of vo ice
obs tac le, the fun dam ent al and non-invasive fea tur es of
challenge. Addressing this
anc e ASR.
s wa s identi fied and im ple me nte d as an appropriate choice to enh
biometric

lor ed tw o key asp ect s - spe ech features to be co nsi de red an d


This investigation specificaliy exp
lev el
pro ve the AS R sys tem . MF CC though are the mo st pre fer red low
their classification to im
ion , con trib uti on of De lta and hig h level features wa s ex plo red
coefficients for speaker recog nit
R sy ste m is
the ir inc lus ion in im ple me nta tion proves that the res ult an t AS
in this research and
more robust

98
,t ... j -.,1t, +:;;

•Ct, nd ,111por tant \\ Otis done: ""a" to appl y fi , h c r c;corc


al g onthm to optimi .r,e the la r ge fea ture
1nc "'-
)1I" c11111111.11111g r1.'<1 u nd.mt data clnd
rcducm g c omput at1om1I c omple xity . The acc uracy
·I I
-c ' - i In .1 g rc,ll C'\ tcnt from 86 04 0 o to 94 'i I o o whe
r t'"'•l 'l. l n 2 7 ontim al fe atu re<, were select ed fo r
,n•, tr,lPK fn,m ,1 collL'c twn of 80 fr atun: " and
t- ·
c la.,..,,fi c d \i <; ing s V\'v{
(",l ' I

, c'\ tcnsn 1.' litciat ure review w hi ch rcvcnlc<l that CNN


, /lc t ,1 11
I 111 ,1/1 1
. is co nvent io nall y a pplied
, 111 c 1 , ,, n,11 ,1ppl1c at10 ns. the succe ssfu
h,r ,,,1111 l a ppli catio n of CNN direct ly on the vecto rs of
, ... tcad of the ortho dox techn ique of u s ing CNN
,-rec~11 11 on the s pectro gram of frame .
- -3-n tb c(llllll butcs towar ds the resear ch area of AS
, rt'nr rr~ • R.

• pured
results indica te that high convo lution featur es in additi on
f/1 C ~ t101 to differential featur es
enc~impass signif icant data of the speak er, result ing in a
high accur acy of 94 .77%. Table 3.15
-ummarizes tbe result s of chapt er 3.
:,

Table 3. 15 Summa ry of ASR with vo ice modali ty

Data base % ASR w ith % ASR with Featu res % ASR with
MFC C + GNIM on MFC C+DE LTA+ Optim ized with CNN
clean signial SVM Fishe r Score
+SVM
Close d Set Close d Set Train Test

\ TidTIMIT 82.95 86.04 94.51 94.77 73 .2 5


\tfPSTME 88.5 98.57 75

As this research explores the contribution of lip movement


in aiding the voice biom etric in
improving the ASR system, the next chapter discusses the lip
biometric moda lity and its
wntributio n in speaker recognitio n.
il cf cr cn cc.s
irl'><"II N 11! .~ 1·o
ren sic "~ ak er r e<'o!lll,ttOT) .•

I
r ( ,.n
• TT f f, <;
'<!"""' ,,,.,.,, .,., l fo~ V o\
, ,; _ PP CJ'\ . \",, 1
,,1( )0 " I NT I ·' RP O l
I <;urv
, f• J i~r, 1<c (>ll er A
- ey of the ll~e , Ii
of ' P<eakc-r ,-,~
, , "' 1e1111011 h I
'' ' l 26 ~ " ,..-.2 . ,00 J
•• .. 1 , ,rr 11<1c S ,·1 Inf . Vo • • , .p • lln 2() 16 )' IIW e11forc emen1
ff'~ - • .. , n r u
,11:til<
,.:-rf'C7 A fl U1'Ar-·
. ;\ DA P TF
s,oN sc nr MF ~ f-OR. ,
,,,1,1111 , P M .. .. . 1v l\ J1 fll\ ,\( )I)
h D lhe sis . ll
• iv1adt1d Sp a
· \l Rl( ll\, \F fRI C
N1 IC A 1 IO N. r . , '" · 20{)6
I l HI I . P a nk a nt i " B ·
;, A R MS an d S . iom etr ics• • 1\ too l fi . .
"
• • JIii . or inf or111atio 11 ~ec un ty: · IF.F.F.
" ,...

S,x11r1f1 •. Vo l. I . no
. 12 5- 14 3 J
. 2. pp • un . 20 06 ·, · Inf r,•,m,
rJ ..
,. ·
,,...,1fl• d J . Zh
r f
' . o. S. V1' o ng . S. Pu rne ll an ou _ Su rve y ing th
ed
ev e lopme nt o f bio me
t.
\\ r,kfl!!.·
. rw V 17, no . 3 nc U<ier aut henticatton
Co mm un. Su rveys
..t,iJe ph on es ... IE ££ ,, · • ol,
· Pp . I 268- 1293 3 rd Q uar t. 2 0 I 5
.
ne ke, U n im od a l an d Mu ltirn od al Bi o
111 .
,,11 '-
oJ ov erl e and G . P . Ha rne
.
tnc Sens ing Syst em s . AR e v1ew : · ,n
~• o. . . 75 32 -7 55 5, 20 16 do ·· 10
. 11 09 /A CC ES S2
•' .~rr:-1ccess. Vo l.4 . pp . " ' I.
I .b. · Ol 6 .26 14 72 0 .
1~· , A Re vie w of M ste m wi t .
afilr and D. A. Ra mh . u t, tom etr ic Sy h Fu sio n Str ate gie s
and W eight"
mg
•, H- Ja CS £), Vo l. 2 n 4 5 Jul
mp ut. SC l. En g. (/J , o. , pp . 15 8- 16
c c1or:· Int. J Co . ' Y- 20 13 .
r ll D. He rb ert , "C on tm . da l B" .
H. Kn n an d us Mu lti mo tio n Sc he me s·.
R 11 s. Yeom . S.
.
uo
1ometr1c Au the nti ca
-s1 R- ) . in IE EE A Vo l 9 do i :
Re view , cc ess , 34 54 1-345 57 . 202\ .
S)rstematic . ' pp .
A 061589.
J/ 09/ACCESS.2 021.3
JO. ion d techni u ,, Int . J. Adv. Res.
pa an d L. La tha , "A survey of biometric fus an tem pla te security 9 es ,
C Prathi _ , _
pp .3
f'] . l. , Vol. 3, no. IO , 511 35 16 20 14
CompuL En g. Te ch no
Feb. 20 20 _
etrics -A bo on fo r fa rmers Rutva Safi 10th
Biom fu . 7 lss . 5,
(JO]
et al. , " Sy m m etr ic sum-based biometric score s1on," /E T Bi om., 20 \8 , Vol.
Mohamed Cbeniti
(111
"' . f
pp . 391 -395. ari son for evalua11·ng th e euect1veness o
sta tis tic al co mp
& Vi ne et Kansai, "A .
21· So u.m i Ghosh, Aj ay Ra na so
.
ftware detecti· on prect·1ct1on. Int ernatzonal Journa
l
{1 tio n tec hn iq ue s fo r
an ifo ld de tec
linear and nonlinear m , 2019.
en ce Pa ra di gm s (IJ AIP), Vol. 12, No. 3/4
ofAdvanced Intellig tio nal
ic Re co gn iti on M ethods," 46th Interna
rvey of Biometr
mi r De lac & Mislav Gr gi c, "A Su , Zadar, Croatia
[13] Kr esi
AR -2 00 4, pp 18 2- 197, 16-18 June 2004
Marine, ELM ation."
Symposium Electronics in Sp oo fin g De tec tio n Using Phase Inform
l Sy nt he tic Sp ee ch
] J. Sa nc he z et al. , "T ow ar d a Un iv er sa pp . 810-82 0, April 20 15
. doi:
[14 curity, Voi. 10, No. 4,
s an d Se
In fo rm at io n Forensic
in /£ ££ Transactions on
12. al
10.1109/f/FS.2015.23988 r vo ice re co gn ition systems ," 18th
Int ernacion
ic e sig na ls fo
"P re -p ro ce ss in g vo DM) , Er \ag,o\, pp .
{15] G. K. Be rd ib ae va et .al , and El ec tron Devices (E
techNologies
rence of Young Specialists on Micro/ NaNo
Confe
10 9/ ED M , 2017.7 981748.
242-245,2017. do i: 10 .1 ent lnte\ligence
on Th ro ug h Sp ee ch Si gn als For Ambi
Em ot io n Re co gn iti ? pp • 244-257, De
c.
{16] I. Bisio et.al, "G en de r- D riv en . · C i·n g Vo l l No • -,
. g ,,,
, op1cs m ompu
1 , • ,

. . . Transactions on Em ergm
APP11cat10ns," m IEEE
2274797. d Cliffs, NJ, USA:
2013. doi: 10.1109/TE
TC.2013. . . y 0 l· 14 · Englewoo
. Juang, Fundamentals ofSpeech Recogmtwn,
{17] L· R· Rab m er and B.-H.
Prentice-Hall, 1993.
\S\
,\I ~ d (j . R ic ca rd
i.
. . . " Fu s o n of
1
< f ,. ., .,.311
• r3 its re co gn
,.. -r""
.- p r1c
111 1
1 11o n ." 20 1
A ,< Sr /. F lo renc e. 4 I F.
8C OU <:. \lC
- ££· Im -,· lm g u1<-\ tc
r"•", , id·'· o .. & Ro 20 ,.4 pp . -, ,.oso •n
•w ho .n al c oo d !>"; I
se, " RO bu s< te, '"''""'"''<
,(m n ac
,,
,., -" '' ·
s•••
.. 10 lf;£ £ 1 '"
. •
"·' "c ,ln n< on
ct - in d ep <n .
s, w I
de n, s' dn , 11 t lO<l
" " '' " \d en t's
~"°'"" "
IC " '"
fm u m rn , S
" s ,. .. ,, no dS
,, ., ,. ,
. ,
•• •• · f . K • O>
OV & P . fN
e n a nd A. ,d io •~
. .. 10 14 """ "
'' "°'
,n" «•"' •" , ,,dm . s,,,,c1,.a.nd • · l-<lme ,PW " ""• · V " " · No• t''"• ""
m • i "R ca ' hon'"'" ,

-:,>•• ,._ I ..~,'" • ,~ "S ()l'1s uk• " S


1'
r• , . I •• .. ,a ,, f', n, ".
'" '" \d en t''h> <Mlon •' •PP l
''° m ·'" ", . '""''°'
..
fl#'
I f.,r1l1a uo n :·
111 : / ££ E Tr
· p e M
~e ss rn•a · V
m \ d, nf fi ol . 14 N ... nu
H 1 · "" 1
211v · r, 1, "' 19 '5
•' onso ct io . · o . t . PP en 1c8 lio n ··
~• I
., , . 10 •> - M•>
20 12 . ~• o n ' ,c at io n an d
V e . ,, . · n . l ef
_

A ud · •1 1\\
'" · >' · Bel!" "· and 1. '°· S p m h. an d I.anu ca tmn b
n
',. la nu ary 10 06
y
I , ,~ E\amvaru thi .. '" " '' P, o,Co mbini ng M F CC
anO
~
. rn l o
"" "' "" '' (> ,lf C C ) an d dyna
•, ,., I•bl< h , , . V o«
mic tim e wa . e «c o . . •· V ol . 10 . No , .
arxiv.or abs/ 1003.40 &33 7 ,i, m g (D gm
TW ) t,on algo,it s ui
, techniques ,"hm20 10m g M CL
""g. S. "' aI.(N ly
20 0 ' " x ·w·. \0ff01
eq40
ue nc
'1y \ce ?s•tra \
.c ,, I) . "T oe H T K Book " V
" .
:'
, 5 . r ifUI" al . • emon 3.0
s,•sl . .4ppf.. yo aet al.. Speaker ,d . . Onlme \
l. 90 , PP · 25 0- 27 entilication f eatum ex
5 1 , D ec . 2017 . traction
methods, A. systemat
, ;J . f red". .
,, Re et al.. "L ic m\ew :· E xp "'
cDgri oon, Jn . 0C
,· .
IEal1
E EYs,
N.gn
orm al ized Fi lte
al Processingr B
1 Lean
tteks
rsA.PP ,e
' V o\'1. 2d4to Dee eural•N
:;~ V. pana1yoto V. G. , N o. 4p N
pp 377 3etwo,k-B
Chen, D . Povey an . ased Robust spee
' udiD 1,00\<S," pr oc d s. K hu da npur, "L.tbns. pe ' ch
. IE E E Int. C on
j A co us t. Spee c ec h, ·
An A.SR - co
&\ , A pn\ 20 \7
r,1 X- z;tiang et al.,."Noise h s·,gna! Process. . .
" .Robust Speaker Rec (/CASSPrp ) us ba•se2d on• pu
bh c doma\n
ognition Based
on A,daptw . e Fram' ePPW· ' 06-,g21\nO. Ap
vecror 1;,ctract100. · dm . \O .\ \0 9/ eightin GM, . 2O
In. IE E E A cc es s,
V ol . 7 , pp. 27 74 27 88 2 ACCESS.20 \9 M fol r5 '
8 _ .290\& \'2 ,
I"' ..
. . .
· ·.iI
nS?j 20
A-19en. owdhur .
)' & A. Ross, "f
R,:eog,unon in Sev u si n g M fC C and LPC Fe es St 0g lD Tnp\et C W
lot Sl aket
erely Degraded A
udio Signals," \nc
atur u · ions on 1nJorm
. atwn Forens)C
ics and
IE EE Transact·
p.9] se
1'\.cusarilY , Vlla
bidu ol.h.15T,•PP· 16
,+- • .
Kin nu16-1629,
nen, and 20C. 20
Han. il,;i,
"A comparison of
in Proc. /6th Ann features for synthe
u. Conf Int. Speech tic sl)Cech detectio
[30] Unnila Sbfaw C om m n,"
ankar, V M Thaka un. Assoc., Dresden
re, "Techniques , GermanY, Sep. 2015
for Speech Recog .
International Journa nition System·. A co
l O f Computer A mparative StudY,"
pplications In Eng
ISSN 12 classifica. ineering , Technolo
tion using spectr gy and Sciences (iJC
og ra m s, " Appl. Soft Comput.,
. AETS) .
[32] K, W. E. Lin et V o\. 52 , PP· 2S-3S ,
al., "Singing voic Mar. 2011 .
e separation usin
g a deep convo\ut
binarY mask and cr ional neutal networ
oss entropy," Neu k trained b~ ideal
1 1 Q.T. Nguyen an ral Comput. Appl.,
d T. D. Bui, "Spe PP· 1-14, Dec. 20 \
ech classification S•
using sift features
33 on spectrogram in
»ges ," Vieinam
J.
[34] M.CTom put. Sc
odisco, Hi.., D
Vel
olga
. 3,doN o . 4, pp. 24
and N . Evans 7-,25"C7,on20st16
[35] L an.t Q cepstral co
efficients: f,,. spoo 2011
automatic speaker
verification," C 516 35 fing countermeasure for
omput. Speech Lang
· R. Rabiner and R. ., Vo\. 45, PP. · . 1 S ch processing
W. Schafer, Theor •5 ' S•\l· · D\l\ler Saddle
y and A pplications of v,g,t
a pee
River, NJ, USA:Pea '
rson , V o l. 64, 201 I.

\:,1
J (
,r F A S. Al7 qh ou l an d A •t11 1l<'Yn, l'l •
_ . · < ~"1"' I' l ~ t-c,I\~
f , a · _ for en sic , < He< - ""
fic ien ts for nf) arj "'<'\'n \.\ 'ft 1\-y
..ef
~ "'-•l'I~ a I '"C
f F COT ,.
I •"" , • PP 1--f,_ 2 0 J d ' hh ""d ," rl' 111,et\llt•n,;." "'' ' cn,11r1,..,.
.~~ ! c r>m ru r Sc t 11
''"•'It r~ ~ n -~,. ,
FnC l'N ror & M M el
, cti cff cr. I ~ rcn . (20 14 ) t
• ...<' ,\ '-'ovc-
I.-• ' . , , s~ tk-cp n<'-nral net wo l'l( 1 ~c h"' "" l'o
, 11
-4 ,._ JI .
, 41
4S Sf ' ). pp 1(-; Q',
n )() 74 ll-1 ,_
I . ,.,..,,Nr,m
',W,.,W (
r ~Pl'.' 1k. ,t !'el ' "Ill \\\ ln 11 llql l l
,.,... ,.,;, ii. 1
r"'""' Fri >•· r<< lfl,(! rlC . • ~<>() . f'f ~ ..,.,,, ,. ,., •. 1,,,. ,.,
.,, ...,. ,.
_ ~< '.2 01 4 ·•
,';(" <J
, I ,c1 & I- M c
Dc rrno11 . 1 1 \,f e No al)d
- <Yr-
,,, ' . J <m n:>a\o
, ,,r,J J
li'll l('-x-t-d ep en de ni !:: ) r >nn ,1n 11:1h:7 I JO 1 , l I)
fj,r ~m s ll f 0,'> lpl" I ea ke r Ve rifr ce r ne11 l""I\
I -<
' cat ion rn )() f 4 ff. f 1-· f;
~n••"" ., ,,.,_.,•,·h mul -"•IR,ml l'r nc en .
• • ,,,R 0C AS Sf>
J Fl 11,t>,.►· 11;, ,.i;,.,1 rn,,,.,,, .
"' . Ol"ence. PP 4()5 2 '\6 "" '"
S . C he ok an d M S I\ . lil an ...
,, ... ;11t lf• •
.\ Ja_ t:!'im. N . · -40
tam - \\ . ' ·'· .
.,.
. Y. I\ rob us t ~p k er tde nti tic atio
, 1 4 I<
jj-o m a mo de l of the au dit ory pe np he ry: · PL o') ea n qY,tern u, 1n11.
11~
r I . , ON E. Vol. I I No 7. Jul 2016
"- ,..--<.('(' - ep en dent Sp ea ke .
nJ! ir I:'! al. "Te,·1-l nd de nt,ficatio .
-32 2 n Th rou gh Feat ure Fu sio n and Dee
t ,,1i a Vo l. 8. pp . 32 I 87 d ·. P Neurnl
_..1 ~ ,11 /£££
lfc ce ss. 02 , 20 20 SS 2
. . B . . o1. I 0. 1 I 09 ACCE . 01 0 .297JSi.t I
d y
io. A . Co ur vil le an
J "'•
,.-..tf•(J
. eng10 , De ep
Le .
- ;ll'(lfellll'1 . Y. Be ng . arning , Cambrid ge
MA
I l" . Al -g ara di an d U R . . . USA .M n . 2016
A s ~ h
M. . . Al o, "D ee le ..
, ,,,·eke. Y. W . Te h. alg orithm
JI
f . se ns or ne tw k . P arn mg or um an activtty
.• • '/1 an d we ara ble or s . State of th ..
- ~j oo n using mo bil e all
e art and research ch enges. Expert
..,,_--r1:: 3 -2 6 I, Se p. 20 18
Jnn l . Vol. I 05. pp . 23 .
5'r~I ,.-,-- ur al ne tw or k
en tat ion of ne an d fea tur e ex tracti on to classify ECG sio na ls"
~arthik et al.. "Im pl em . / . e
111
able: htt p·/ 62 88
R
r. ;802.06288, . 20 18, [o nl me j Avail . arx1v .org/abs/ l 802.0
- • "
tral E . .
].
.JT- " .
e1 dt, DN N- Ba se d Ce ps
tion Ma nip ula tio n ~ ent:·
gs
~ £l5 bam )' & T. Fm . ch xc1ta or Speech Enhanc em
~ Sp ee ch an d L in y I
-"' ons on Au di o, anguage Process g , o · 27 , No . l 1, pp . 1803-18 14 ·
. [£ ££ AC M Transactr '
/Jl.
. 9
:,-or. 201 • re Ra .
siv e Sa mp lin g and Featu nkmg Framework for
Bearing Fa ult
K. Na nd i, "C om pr es • I

.,.::- H. Ahmed and A • •


V I doi·
m IEEE Ac e , 0 · 6 , PP- 44 73 1-4 47 46 20 18 ,
• • 11

Si gn als , ·
n With V1brat1on
.
- -1 ess '
C/assiticatw
JO./ J09/ACCES S. 20
18 .2 86 51 16 ta
on : A review " in Da
rw al , "F ea tur e se lection for classificati
ga
H. Li u an d C. Ag
:t6] Tang. S. Alelyani, , Bo ca Raton, FL, USA:
CRC Press , 20 14 _
s an d Ap pl ica tio ns
Classification : Al go rit hm fea tur e selection," Pr oc. 27 th
Conj. Annul.
Fi sh er sc or e fo r
J. Ha n, "G en er al iz ed
_'t j Q. Gu, z. Li an d pp. 266-273, 2011.
on Un ce rta in ty Ar tif ic ia l Intelligence., A: Chapman &
Conference lec tion, New York , NY , US
otoda; Computational Methods of Feature Se
[48) H. Liu and H. M
ance and
Ha ll, 2007.
pe rvised Featu re Se lection Based on Relev
emisu
[49) J. Xu, B. Tang , H
. He and H. Man, "S 28, No . 9,
ur al Netw orks an d Learnin g Systems, Vol.
Ne
IE EE Transactions on
Redundancy Criteria," in 670.
i: I0.1109/TNNLS.2016.2562 5 7 514
pp.1974-1984 , Se pt. 20 17 , do
tio n, " Pr oc . NIPS , Vo l. 186, PP · o - ,
r feature selec
, D. Cai and P. N iyog i, "LapJacian score fo
[SO] X. He
• •
11
2010 3rd
2005, •
ctor quan t1zat1 on,
imag e ve
"A n im pr oved LBG algorithm for 1, Chengdu,
[
51] Bang Huang and Li
nb o Xi e,
. . 7, ·hN OI ogy, pp . 467-47
Science and Informat
wn ec
J, . ence on Co m pu te r
nternatwna/ Confer
20JO, doi: 10.1109/ICCSIT.2010.5564073.

153
"A VQ~r<>N'f r,,,e~<> ot """'-
. .. ff! fFF J. ((•"""""' ~"-"~,
4 1·1tt"" . ~,df'CC<'J gllil1<ttl. r,. i•
f<J/111" . n~.,11,,,f "' 1\- 1
,..,,,. i,- 11'¥~ n,, ,
,,r<11 ..,, l11W•q1J ?. luly f~S~. do; IO ·•~-,~ 'WI
..,,..,..,, t, .. , ..
7 r•
,~ vr J~li 11n'1 f ifctii"'1.!Jh#rJt.. -- ~ltktt ......__.. _ '"''~ • ~ .. '" ,..~ ~ ... '""""""""" t '"
"'"""'' ~('~,- •1t,• ,, r. ,, .... •1t,-
«J;,;11 "it v., 1
,., vit"~~N J 'SJ;eJtket thde~ing .. .._ I\'\\\ fv 1, ~
ljt'IJ J/1 , '" l'f.' 1- 'F r,. "\\ ti;.,
,rr . k :,.s~ -~q2._ Ju ly 200.'-. doi lo 'I O'<> l'fS;\ tWuf1'<'•p~ ~ ~...._i:\'l._ ... ,hl11 1nfh..,11,111,,
"'" a. f'1 · ~**% ,, •lt,.,1,,,,
,1 f -V-ttiJ•
1( ;t,. I..<'e a'tld H . U . ''GMM · SV"- •
• •vi
. 2'()0~
"Kef'nle,
.,-i. -,,.,1 lu,11 r
'' • '· .,,.,,.
'-'v1tl, a Rha11• L !l /,,,
I I •ri,n." Jh JE&E rnm.th<'Tian.t On A ud · ac"nf)>), 1 f\
_.,..,pni 1 . 10. S:r,eec'h.,

~ -- uf!. 2iJJ,{J.doi: 10. l i 091TA S L .20()() _2Q3 ~la,i~ ~f'F',. ,. •1qetlf)\q11111r"fnr


. . 2950 ~t>, 11;,~ V <.p,.11\11•,
~,:z_ ~ · 01 I~ 'In,;
1 1Jell ,el a l.. " Low-Vari ance Mutt · · PP l ;no.
,.-,nlW llaper MFcc
r tioh. h in /£EE T'rans actions on Audi Features: ~, C
\ "'efl ficB . . . . o. Speech, and ase <:;tutl-. .
·t. 2{J'I 2. do 1: IO. 1I09/TASL. 2012 .219 1960. Language P,- . - in Rnh1"1 "ipeuk
.c p nce,,,ng_ Vo\ 20 er
,(ltll- .:,e
· Ni, 7. pp 100 0 _
~ wrini. L. Zao and R . Coelho, "On s .
4 ven .. . peech features fu .
. s1on. o.-integrati
•· . ""le training fot Noise robust speaker cl . ass 1fi cat1 o n " · on Gaussian .
m11ltJ--S•.1 _ modeling md
[;anguage Processing , Vol. · in ! ££& ACM Tran
.,,,ti 22 ' No. 12 sact1ons nn lu /'
· PP - 195 \ -1 964 D c,n.Sp!!rcf,.
/TASLP.201 4.23 55821 . dor
• ec . 20 \4
JO.Jl 09
·
. Naman and Vaishaii Kulkarni ' " Perfo nnance Evaluatio
511m1ta
1.'11 o~,.,.,enirion using VQ and GMM" ICTCS '16 : Proce ed tngs ' n of Text Independe nt Automatic S k
of the S pea eT
,-.,,.,.,~ nd Internat ional Co nr. .
.r, --ation and Commun ication TechNolo cn., fi eco
o, or Competit ' !.lerence on
/f1;0"" eve Strate ·
/Id · g1es . March 2016, Article
}IJo.1 33, pp.1 -6 https: 01.org/lO.] 145/2905 055 .2905349 .
S Nainall and V. Kulkarni, "A Comparison of Perform ance Evaluatio _
[58] . . n of ASR ~ .
GMM " Internat ional Conferen ce on Co . or Noisy and Enhanced
·cmal using mputmg, analytics and Sec .
Sit,-·
pp.489-494, CoHege of Enginee ring Pune, India. 2016 _ urz ty trends , CAST, 20\6

,, 0 Paul, M. Pal and G. Saha, "Spectral Features for Synth et'IC Speech Detection," in /£££ Journal of
[591 . . . ,. .
Selected Topics m ~,1gnal Processing, Vol. 11 ' No · 4' pp. 60S -6 \7 , June ?.O I7, doi·
·
10.1 109/JSTSP.20 I 7.2684 705.
Using Throat
M. Sahidullah et al., "Robust Voice Liveness Detection and Speaker Verification
Microphones," in IEEE/ACM Transactions on Audio, Speech, and Language Processing , Yo\. ?.6, No. \ ,
pp.44-56, Jan. 2018, doi: 10.I 109/TASLP.2017.2760243.
Z. Uu, Z. Wu, T. Li, J. Li and C. Shen, "GMM and CNN Hybrid Method for Short Utterance S-peaker
R.ewgnition," in IEEE Transactions on Industrial Inform atics, Vol. 14, No. 7, pp. 3244-3252, Ju\y 20\8,
doi: 10.1109/TJI.2018.2799928.
A, Misra and J. H. L. Hansen, "Maximum-Likelihood Linear Transformation for Unsupervised Domain
nd e
Adaptation in Speaker Verification," in IEEE/ACM Transactions on Audio, Speech, a Langt1ag
2 2831460 ·
Processing, Vol. 26, No. 9., pp. 1549-1558, Sept. 2018, doi: 10.l 109/TASLP. 0IS.
[63] 11· •fi · b l tegrated C\usterin g and
iev, A. GianelJi and A. R. Trivedi, "Low Power Speaker ldentt icatton Y n
G . y 0 1 l2 No l pp. 9-12, March
aussian Mixture Model Scoring," in IEEE Embedded Systems Letters, · ' · '

2020, doi: 10.l 109/LES.2019.2915953.


doi:.

p4J D.Yu, I(. Yao ,


Ii. Su, G. Li, an
d F. Seide, "KL-d
unproved Jarge vo ivergence regu\ari
cabu lary speech reco zec\ c\ee p neural networ\c ac
gnition," in Pro \aptat\on Im
c. IEEE Int. C
onf Aco usL , Spe
ech Signal
\151 PMro g,dppz..7Z8aj9ic
. Hcenisszinan
3-7
, "C
89o7n,
V
Jaonlu2ti0o1n3.al Neu
ral Network for
speaker change de
diarii.ation system,"
in Proc. IEEE In 4te49ction U\ tele\lnoue spea\<.et
[16] C. Li, X. Ma, B t. Conf Acottst ., M
. Jiang, X. Li, x
to-end neural sp
. Z h a ng, X. Liu,'{. C ar. 20 \1 , PP· 4945 _ 9 ·
ao, A,. \(annan, an
eaker embedd c\ z. 7.,h\\ . "De<\l
ing system," spea\<.<C ""enc\·
20 \1, arXiv: 17
05.02304 . [Onl
inel . /\va\\alil<°•
https://arxiv.org/ab
s/1705 .02304.
, _,.. jd . /\
..,JCI · "'" l'v 'o ha m ed-
,' •·,<'-' l I<'.. H 1

~ ...... .,
" Sr<"'" R ,a ng.
, ,<" C vo l ,i . ,<
,. ,. ,g n; h= I n I l¼
11. '1 n. _ <; >
I
,, ,, "
,,Jl · , . ;,
.
<" ,, ..<•
•• · 11 R •>
.

1,1c rC<-....:ig

.
. >'
n

11' cm In
I 0 . r,> I 5 '
. nl d•
& N
I F,1:;,
. ' • I<.
r,f .., .,
: S i""
.
" I
(2 0
• « 'f r
1' ·-~, _, _'"' "'
"" '&

) l~ -p t\oe\
h V" !
. ' ,. .. ,
I ,,
1

v.,.......
..
.., ,r j, N11oY . •
' i<. •J •k ov lk "' , .. n ''"
11
~,.,,.,. ,_ ,._ ,.,,


- •1 ' , " '" ""'" " nl "" ''" '
• Ju " " ' & . 1 .,
,,., ,, , ,. 1o•••70
, , ,, <
,..,,o ,..
;o • nf • sp < •k
.(1
7
« re co gn ;, ;n
..
<M "
, V ol'" " ""
""'"""",.'',.'"..A•..1 .,,)) '< • • n m
m "" ••
• ••• '
•••~ .. ,.
'' 1 «' · ' 'I''•••· • • n.
. '' " '" , , ,,.,,ff
I ·h rec,1g:nili
1<' 1<um•< '•
d V.G
•=•· " ln ve sf
,., ,,. " " • .. . ' 1 '" '
"" " . . . ..
-·~ "•=
of1 :· /11 2n .
-r"'' If , l £E E •• • _ , <
••~
,. ,. ,, w"' ,,~,..,
(,'1
.... ~<1'1,c (I' - • -4,',-.r<p )
- . '"
. ,gat,<m-. <m '"''""
f''.,.
1 1' '
,, ;, . ,n d T
-" · PP · •S0 2
0 -5 02 4
~ ,wahru-a. " • Sh ang.ha irn a1
" . •=• ... . lt ·q '.\ )Ir ,
., < ·on f"
" " '" '" ' i,1uJnple A SR
. . .,.. Se .
m•-Su pe.n,Jsl6
2010 » ° I. Con<' er at \
,eten,ce aµ\'i\\in n
e . d oic 10. I 10 m, -t '" '" "', \ c:.n-.\
'~ • S
••
S ys te m s ' H 0 \CAS
, ,,,F',a', , p,oeess;ng . V ol yp ol hc se s d" A .
oo us tk M ,,.c1, '"''
"R~ mndsl<
. . 24 . N o . 9. l'P od •I S "1 01 61 1n ,
J ,c · \5 2 4 ,, ., . ,
_>r"'1"'· N . i, 1o nt · \5 34. . Se
"' 1£ E E IA C M l, s,,,,,,1
z. l . .A ne. m il
,:
• ,c<"" .,,.d ev en JS : Jm li
ph ca u o n s fo r
e< , s G oetz e an d
·
B K
T•• ~ o m , by
mclio~ D\,cnm· .
pt. 10 16 . do ''-. \0 . 11 '" ""'" " "" '
( A D N N s
00 " '.,""
· 6

1&v •4 (# r,ansac ,or<S , T D N N s · old lmek<. "C 1" Jl o c ,~ ,, "" ''
on ud;o, Sp · an \> l ass1.f1er l\r S L
ee ch '- ' • 10 16 15 ,,1)
J""' 017. do i: 10 .l 109/ • an Lang
T A S L P .2 01 7.
, ua er ce-pt ua
\ Fe
d r C.11 \tectu -- .,
26 atmes frnm D es fo r i\c
'/11- z. TaII· Z .
M a, R . M ar ti 90 56 9 . ge P ro ce ss in g.
\/ \
ou st \
c
¢ ~. 2 n o . 15 CA .SE
'· . and J. G u "S . No. 6 2\ )\ 6 ." ,n
..,,stefll' ll Sln .
,. g oNN C\a. ss o , poofmg D · OP · \1 04 -111 4
1fiers an d D
.e"'orks (PIO y n am ic A co us ll c Feteaectut\ro ". . .
Lear,ung Sy , .
I0.l I09,rJ<l'ILS stems V ol . , N o . 10 , n m ,u to m at \c sP "' "" Verif<Ca\\oo
pn .
.20 l 7
.2 7 7 1 9 47 . 29 nsac l<oos '"
,\ es , m IE EE N ,,. ,al
Tr a .
'.!41 • 4633-4644.
i l s . ], !okgOD Ya O ct. 10 I% .
. ,. ,, , ba5 ed on ne
., Opt, .T. .J. S e fa
un1Sed "A
do.:
,. ,ara T . I. M o d ip a
ch. 'me Learning Aan ·
d M J · Manamela " " +
do i: Jo. J\ 09iA
f Rlc o N
lgor ithm •
s,., IEEEAF , ~u ,o mat 1
' 5 4 6 7 5 5 .2 0 1 9 . RJ CON " Spe ak
913 3 8 2 3 . « Recogn\fon
·s~ s. N•;nan and V. J(u " ' pp .
!kanU, "Enhance gnr1ion or optum ·1 , />,ma, Gllana. 2019
Gr&>!, sVM ment in Speaker zed Speec\\ \eatUTC
and JD CNN Rec'Jo . ec f ec no S us\no 0

, In ternational Jour
. logy 0 . , 19th .Oct . 1Cl'
nal , Spe h T h 2Cl DO\
IO. I 007 /s I 0772-0
r,t,15ut f urui T an2d0..()9771-2
S, "Concatenat re cog,n\t,on, Pr
/J1Jlf nf1/Wnal ed phoneme mod t'
oceedings
Conference on els for text-variable snealm
Acaustics, Spee .. .. , .
ch and Signal P
ro cessing,I99 l. \C
S U -3 9 \-3 94 .
li7l Girija Chett)' , M Mic1,
i ch
(Cross R.ef Link)- aCI Wagner, "A
utomated lip fe
authentication," H ature extraction
CC laboratoyY for liveness verif1
U n cation in audio-vide
[88] Vibbanshu G iversity o f Canberra , A o
upta, and sh a n ustralia, 2OM . ,'J
n il a Sengupta, ticle (CrossRel I.in\
authentication sy "Automatic spee <.).
stem ," Interna ch reading bJ or
tional Journal al motion nac~ing
o f Software En g for "' "
inee rin g Research
& p,ortices, \/o
[89] M ' in g- H su an
No.l , April , 20Y1an g ' David · " s l. J, "
3. J v.-" iegman an
d ]'lare.ndra Ah
d Mujah,'"DetInec
tetillinge
g nc
face
,ournal of IEEE e, \/s oin\. \m
24ag
, ~eso·.. \/1., PPu<
Transactions o · J4
"•1•-Si '
n patt4rn Analy
sis an ac me
January 2002.
/
ivl · Jo!les , " Ro bu st R ea l-T im e Fa
•ol~ -
, ."'
iJ" , . p P· 13 7- t5 4. 20 04 . cc Dc t
.,,. " '"•,o y, ,g . oa v<. d J . K, ; cg ec" " " ..
, ~i1 i'
n, .n . '"" "
'JI''' I " oz'
,, , .,£ ££ r,- an d N ' "" •• ••
• 1 r)'
11•
.
,o
1 '
a11sa cti o11s o n Pa t ar,cn
' "' " A no /y
. · dr a /\ " • ••.,
IS a n d Mh"
.,o,,,
»a l nf ('
'"< ' ace ••p
y · Ctc cf "'" v .. ,,..
• "• • , r:11
.,,.,,, an d Oc h ·
rn e I
, V
t-~ •• J01 1r11o l of >. ••hw a,; ance Pr o\)
,ky,-k
IE EE "" · . , · • O\
.
c·l r,,,,,i;, ·o,> S·
• "'" "• • ttce , V " l moo,,,e, ;\ "-
.111 o•'
· "a n. ,acr crti cs f
(\I l <\ ,..,
) •o n on Im o ge" P,U h «-~, urv e-y "
;oO Cl· a" ""'. P i\1 . "" I
-j111•''
, LO •
ce 2002 - " Au
di o- vi su al S p ee
c\
"
''< I F . "
.
)S.s , .
Q ' Pm
,:] '·,•<"'' e,>.<'""'~•'ll·

ueensland Univ . ce ss in "


•- m g , Vo l . I S, ..,
., ,, eat5
ure'\-O ve n,·
''-
,.,,~. ••"
. seid and B . G am em ty of Te chg, Ph .D . ,pp I()
an d

•I. I bO ck , . Th es; , S
. 20 05 .. A RR1-1 0,s . Ma
, ! ,,,.,.,,.
·,
. •• M""'" Thes», Computec S . ' nolo
Speake, gy' B'". ba ' choo \ of Pl
ne . • <c<,;c , &
,

l)l>"P' · c,ence & 1nflnd o,mepaf


enden
\On Tt Contmo oos S "" '" "' '
"" '"' "' pave and Narendrn M. Patel•• Ph echNo\ \><<eh R
. • oneme and Vis eco,,,
ogy, A,ba Mi
'i i 1,,, ,., a1;o•ol ]o ur
. 385·394 , ZO I 4. .a l of S, gn al Pr oc es sin I nch Un'"is« " fo,
si~
g, mage P, oc eeme bM
pP· . ed A??<each fo
ssmg and pauern ,L
d J
1""'' N'ic an ean·Ph1. bp .
pe ThITan "M R 'P Sy nchrnni~hoc/'
ecognitio
·

Jt1] R"°"([(lirion,''
in pr oc ee di ng s of
• utual in> . " • Vo\. 1 , No . ,
,, 5e p« t0ber 4- 8, 20 06 14 th Eu ro p ea n S, gn
/lffef' . alorpmat1on Ei.ge '.
n\ips ,o, Audio y
rncessing Confe,e "
A· B. 11,ssanat.. 20 09 ne e (EUSIP . CO"u)a\FlSp ee d,
. , "Visual W . ords for Auto .
.p-.
!llputillg v01vers1ty matteg Liom Reading,, p
of B uc km gh am, U
ni te d Kin d en' t of"' ""-p\·\ed
A:p
j9!l CO ' h. D . Th esi.s, De
p. J(al<Ul"anu, S. M partm .
J!\etbods,'' Journal o fakfogiannis, and N . Bourbakis •
pattern Recognition , V "A survey of skin-c
. ol. 40 • No. 3• PP. ll
06 - 1122 olor modelin
Al'" wee-Chung 1ew d sh·1· g and detection
an . 1 ID Wang,
/,{e,licol 1nJorrnatio L "Visual Spe ech R ecogmt1on: U pMSe ar 2007
.
n science reference, H .. ' · ·
ershey, New y ork, 2009
[l&JI . !llflentanon and Mai,pin
syed Al.i J{.,,{ hayam h, 10 2003
"T he Discrete Cos ,,,;
ine Transform (DCT . eory and Ap
} Th p\1cat1on," M ichigan State
. .
JOl] un iversity, 1vJ.arc
c. vu nala. V. Rad ha., "A Rev
.
ie w of Speech l of World p-proacues, Journa
of Co Recognition Challe
[ mputer Science and Informat ng es an dA \.. 20,,12
[JOl] M. Gordan, C. Kot ion TechN ology (WC
ro po u\ os , and l. Pi
S/T), Vol. 2, No. \, pp. \-1 ,
tas. "A support ve .
speech recognition ap ctor machine-based
dimamic net,votl<. for
pl ic at io ns ," EURA
.SIP Journal on Applied Si v\sua\
gnal Proc essing, \Jo\., \\ , pp ., \2
4i -
[JOJ] J2
S. 59
L.,W
20an
02g,. W. H. La u, S. H. L eu ng an
dA . W. C. Liew, "Lip segm
1004 IEEE Internationa n wtth the presence ol entatio
l Conference on Acous beards:'
tics, Speech, and Sign
10 ] al Processing, Montre
al , Qt"-- lOOJ ,
PP· iii-529, doi: IO.I \0 . . • \( I mutual subs
M. lchiNo, H. Sa ka N o9/ICASSP.2004.1326598. pace method ,"
and N. K om at su , "S .. .r 20 04
[ 4 ICAR pe ak er recogn1uon uSm ., PP· 391-402 <;o\. \ ,
CV 2004 8th Control g erne
, Automation, Rob
otics and Vision Con
468858 ,erence,
Kunming, China, 2004
, doi: l 0.1109/ICAR
CV .2004 .1 ·

\51
J1111l ft••nd J fl •J;""· ft P ct'5lC>O
•, I • .,-; < ff
,, V<-TI""
, /,,..,, ' r•"' '""'°" .,W.•"'•h•p •~Ca lt<'lll h

·~· --
1
(< -,-P R w ·a
" I 0 ,, ,,,
, d O W l. >. . •• ' ~ '"""' ....,
" .. .. " " ' .. _ ,

... ·"• • = •- ·~" ...


. ,> w. 'K S
• o

,>•"
,,• • •'" " " " ' JF

,, ,,. •,oo '- rr ,. ,.


.FF C m ,(e ~
•1 '•

m·e
" " I " 'F ,-
> " ' '"- , ,.••
"" , •., - ', .,
,,.
I" '1 - do
, •• '' ' " ' "

' ,, ,,
.,, .,. , '
,o d R lk
-~ ~ , p;.,.-
7 ; IO 11 OS
''P « s» n ;d 'R . 1Rno / •~ ••
IT
" '~ " =
= r " _· l Olf;n • T" '""'
'1
""' ' " ' ",~ ,.
, \
" "
1
'. ••'•
. ,, 1
zitt1 I Sif!.1 101 P
rn ce s.•;i no

(D n,.. P )
om
. " " · 47 2- 47 '
li n t CX b> tt
"> lff < .. ., r, ,_
"',_
"" "" "" " m••
.
"'
.
,0 ~ • "l ip•> '" 6. do
. \ I( ) ,, ,. ,, 11
,, , ·.. 2n1 "1 R "" I '< '" ""
,,
,,, , ' .
' O •, ._ " 20 l
"e m en > B as
6
ed S p = k """ . ,,, ,..,
..

, ,, ,, ,, 5 ih //A I /n tt R =o . . U > S , 201


l u n a o.
• ,, ,, ,. ,,,n . 20 gmtmo F= u
s d

• •'6 M"1 ' "'" "" " " '
I 6. PP · I 13 2 ,o
•I1, ,, ,, 1,,-en · 11 3 5 . doi , . na l llC,\on 0
e =
l O l1 091 l -11.Al 1' e\ \ln •.
e> al -- •O n d " ' " • A d, ne ed>heApp ti
~,
. ,e ,o b u st n es ,a he d
s o f au . "" '" "' "" s, ru ct ure nt l'
-p l' d,ovisual.2\'01 6.20 0
1£££- srh ,, ., ., ,. . ln fa ,,,m,k
,r,·
r,11'- . 20 a o on o l .C o n fe • OI II I - I -•Il·l
dm • IO . II,e0n ce on Bw
16. PP · 1- 8.
. rn e t, i~ Th "
" Ali- T- R - S 9 /B T A S .2 0
16 .7 79 11eo ~ " ' dc >e c> m n >o •i su I
61ry, A pp lic a .
~
• • P · aeed and \VI . fw m on dSy,
H . Al-\Vluifr • oe' ", " ' ' , nim,,-
10 5,s<"" b, se . .. ,o n. ,
d on ,N GRAlVI-WN ·
,v.,'',,;,,¢ N ," 2 0 2 0 In•J e, fP G ,\ Im " rBTAS). Nim
";og '.,e5A SEl l) u h o k t e m a lw. •a l Conpl·"em mrn
' " Ir
, . aq , 2. 0 ,
2 0 en ta ti on of v ·,sua\ S~eecn •
111 Abd' .,-:,hiJJl P P . 13 2- 137 R
"' Mest,ab. et al
., L. tp R ea d
' d m., 10
,m n c e
.1109/on C om
..
p d ,-;s;on Com _ m g w it h H ah pu te, S2".'" " on"°'"""'
_ " n Convolution dS..of rw •" n
P,a ,ng, E ls ev ie al cN s~e sl 'A ,920 . ,, ,
r, In p re ss . h al . 0 .9142095 .
-0 2 1 0 9 3 9 7
pi J.I,Af'll' ROSS and ur omenls . Imag
1204
10 s·,gnilnaJ.l{.pJarom; M. ULT IM
1n -farofl'lffl ,', a\ "N etworKs m
..
e
ce ss zn g Co n., O D A L BIO
,e re n ce (EUSIMETR IC
PCO) , S :1e,\nn Na0 \/ cv·
, AEmR,tr\l,alE
) W ,, A ,
111 , ,, ., ,,
, pp . 12 2pp1-ea
12re24
d ,in Prteocm. b«
Sep of
'""'°'
'l .. _....,endr3, R
s, 05 lo g faC, oorizzi, B-, . "
" and paJntpRnao nt,, Ap.,aet al
[l h • tt er n., "D
Reces ig g eftic"te ntf st o n sc h em es fo tm
ognnitin u. u lt \
. ' 4 4' '5
' ), PP· \0 .
A] eontL v., _ _ _ 1 6-1 0 8 '
]'Jilitello, C., S _
orbello, F ., et al _ 8 ,
.
20 11. m od al 'o,omet
nc
and iris J!JUltiJnodal .' "A frequency approac or foatur
b -based es fus\on fmi«pt\nl
1ometrtc 1dentd
icat1on systems, hIl· an Cybern . C, in
· ys
Appl. Rev ._4()
" JEEE Trans s ,
[\\ SJ (4
!IU),3IPP'g·, z M
3g.,4L-3iu9,5Y ., L1i,OX
, 20 . -, et al., "An ad
aptive bimodal
and ea recognition fram
[J\ 61 LuJ!iinr,i,"AP-,aN
ttern Recognitio ework using s~arse
n Letters, 53, pp cOOIDglot lace
ann~ L.,
"Overview o f .69-76, 20 \ 5.

~~-
the combinatio
n of biometric m
atchers," Inf Fus
[II~ AJsaade, F., ion , 33, 1 1-
Ariyaeeinia, A SS ,
20 17 . ., Ma\egaonka
t, A., et al. "Q
mu\timodal biom ualitative fusio
etrics," Pattern R n o\ ~ormalise<
[118] fierrez. Aguil e I ,cotes \n
ar, J., Ortega-G c ognit. Lett., 30,
arcia, J., Gonz (5 ), P 1
ll alez-Rodrigue P· 564-569, 2009.
authentication ba z , J., et al.,"Discrim
sed on quality 77in
7 at7iv19e m20ul05
timoll• biomelii<
measures," Pa
1Srinivas, N., Vee t tern Recognit., dd
ramachaneni, 3S,(5), PP· t fr0 m multiple c\asstf,ers \ot
K., osadciW, L ,r - • YI In·form .
[ 9 unproved bio .A. "fusing corr ation fusion ,
metric verific elate • • 0
ation," 2 009 IEEE 12th
International Con .
J erence.
Seattle, WA, USA
, pp. 1504-\511, 200
9-
~,.,,,~-",.,,'.,..
-~~-w
,._ ,...,t-'" .
~ r.,, ,., · ~.,-- J,Cf'1J (lr 'le '!J t--C ,& n '" ·
,c.atti"fl
·• J 7

~.~
.
t,s<;('(I

""..-•,., ., .,
<'ffi qo .c .'i. h (,( 're
1 11

~
' ·,
,••
,, • ""
,, ,,• ... ,.. .•n<' ,
• ·I fl•¥"II "'//
,n ",1• ...,~,=o"•fl"', , .""
.,m 1
, o, , . , no ' ,_"
,. • ., _ ,, . . ... .,..
' ' "· ..,~..,. ....' . ,,., .. • ...

· ""'""' ~. '"
• , ' '

..."",.""• '""'''
>·"' • "• ••
"
, r . ...' ,,,•"
,. " " ' ,n
,;~ .' 1n dlf,_J I" "T"'•m· P tt- rt ".."."(" "
,

~
.,. .
_, .,,.,..... •,n 1n ae•
~ nc On • '" ' ,.,, " "
' " " ' '- .. . , . , • ~ ., ,_ •• "' ' •• •

.1 . ,,, rJ
"" .. .
. ..-.ar - , •, ,e 1 0
11 "" ''f'C F 20
.
~ Kanhatii?. . 1" 56l\11 <s
>A '' p 'l fl 1 ''" " ' 'f &d , .
_ ,,
,....
,, ,_
'• .,,
•~ .. . ,. "" .
1 " " ••
, • .,

~.. .._ ,
., ,. . -····
1 '" • '" '
' \1 1 'R ,C S M A N ' •• • "" '" '" ''
_,.,.n. - ,,,, ,. vr A G 7h ,m g.

f l"'" ·
Q,- 102. """'
1 f\ ,s ,r ri cl >i
an d R . A dl 2 0 ,0
,F " fl '' f •
A .. ..
' ••n •• •• ••
• '
1f "'1'" ~ "" ' l' ,n ,c e).'' 201 2
"
ou d· ..
J. S «w el
· ' " ' r." • • =~•~'"" " '• R
• .. , , .. !·•·.. fn
.-

61 h / ,e ... • " ' ' " ''


,~ ,,., .;-" m a, ev e! fu ,,, ... ' · · "'
" an d 0 ;a na l C on ft siao ,,_,
1• T ei ec o m t 1

/ 1"""
I (l 'l'

l.
S fff f .20 I 2.
>' """''.
64 8 19 0 3 .
m u n ;co,;ons
l,S ET ITen ) ce
·
.Sc,e •= ~
·« i m n\ ,\m od
. s n' a '
1
' ,.., ·• '" l
°'"'""'"
'd•" ""• '" ""
and S a m "" "e PO - ,I( ,.\ SO
1 e t S in gh " F « '" "" '" r, .,' - ""'"" "•I
, C""
JI S,etiJOC I,t,.6-Jgof l·thJll ... In.t er
.
us,o ·
n S= h
.
na t w na l Jo ur . Sonssc .
., · na l JA dv,. an ce o,
c d an c w-
E ,d f. ac llm
,g m we\n Egiom etr'" ' Usm •
""
,'O ) td ' et al.. )'1u\tu '° 'c l'""d..'"
. E
S • nlmnccd
n o d al fe ''"" V ol, Vm \o n o\
" P'
,o;£ Acc.ess, Vo\. at u re -L ev el
F us,.on
,
. fod
6, PP · 2 1. 4 I S -2 lio m et
14 26 , 2 0 I g ' d o, , \ 0 .\\ 09 tics \ .
~. . , ,..S• ,nilll ]'!., .a n an. d y a, shah K . 1A C C E
dcnufioat\on -- ••
m on lo .r t P II ..
s,,,.
• ... -·
. ~ .e .,,,ition.•· Jf./£ u lkar n i, "S yner .
ln tf o= :·
Tra n sa ct io ns 'CJ m V m.ce 1
, scf/doi .or /1 o n Smart p rocess mg and LS S .201>.1%\ SS40
0 .5 573 . and C
ii E!ESPC .20 • M.o« m
.~
0
1 9 .S .4 .279. .n t fo, "" '" m ,u c ?cr<oh0
.. .
,oeJ tJ/ am p" ' m g·
B-1£ jJ .£GO
£ TOra {.<s¢.io
, r,n
V ol . &. No. <. Ang 10
Nnsixoon
n an
s;odmJ.etNri.cs usion for Subject \ •
ca, rt
B eehr,aV
"Sioorft' on
B R
ooi• o.1109 rr BI io m etri
d Ide nIcityf S .. ec ..
oM.2.019 .2943 c1 ence Vo\ \ 0 4
N ogmt\o n at a
,. . 934. D\sta,,ce,"
r•'l9I S. &1,o g, !, I. ' · · · • ••
Yu, Y. GuO · l' l2 •3° 1.
0 c,. 1 0 " .
w,,,tification." in and Y. Yu, r us io n
JEE£ Access , V "A W eigh ted Center , Gra•nh •
,
,. ,e th od (o r ?et, on Re·
::iO] M. EskaDdari ol. 7 , PP · 2332 f V.7 n. " c.SS .20
and O. Sbafifi, 9-23342 2019 \() .2?1%1'2.\) .
"E ffec doi· 10 \ \'"' /"C
c\assifjcalion." in t o f face and ocular cc
IET Biometrics multimodal bio
metric ,~stem• on
, Vol. 8, No . 4,
11; I] B. Gundogdu PP · 243-248, 7 ~enrle<
and M. J. BianCO 20 I9, doi'. IO.I049 1ie
, "Collaborativ t- bm t20 1~ .513,1 .
inIET J,nage P e simi\arit)' met
rocess ing , Vol. ric \e arni ng fo r facerecogn
!il2] Gurjit Singh
I 4 , No . 9, pp. 17 ition ID \\oe wi\ol:'
Walia, ShivaJll 59. \ 7 68 , 20 7 , 2
020 , doi: IO.I04
Rishi, Rajesh 9 1iet•i~r.20 \9.05
biometric system Asthana, Aaroh Io.
based on diffu i Kumar , ,\.njan
a G ta, "Secure mu\1i 4

~~-
sed graphs an u~ 1
d optimal score 111oils\
fu sion ," /ET Bio
m ., 1 0 19 , 'lo\. ~
[Ill] L. Wu, J. yan " · •
231-242. g , M. Zhou , y .
Ch en and Q . W
ang,. "LVI D: A
~~ -
• FMultim .·• o•adanldtiSioecmuretilriVcs i>-utn15 1511 15
on Smartphones," , 'l o\. en•tic•'ion S~•" ~
in JEEE Tran " '
sac tion s on Jnform a
[134] X tt0 n orens ics
2020, doi: JO. II 0
9trlfS.2019 .29 ·'f ~\\5\0n," in 1ET
Jrna ge
1
· Wang and S.
Feng, "M
4405 8. ··
b d on c\ass
ulti-perspecti
ve gait recog . . 0 se
\ 049 /iet-ipr .'10,er '
Processing, Vol. 1 n,uon a Is.6566 .
3, }lo. 11 , PP· 1
885-1891, 19 9
2019, d o t. \ .
id M. Wagner, "Multi- Leve) L·
c 11e11)' ar . iveness Ve .
(j. tries Sy mpos,um: Special S: . rifi cation f.
:'I 6 9;0111e ' ess ,o,, on or Face- .
;oO e MD- 2006. pp. 1-6, doi : I 0. 1 J 09/Bc Research a, ,1 Voi ce Flio1net··
11 ir11or . c .2006 1e f1 ;o . , ic i\ uthe,1 .
13~ d M 1 ipton. "Multirn oda l fr _ .43416 1 S rne,r;c c t1cation ..
hct1)' an . ~ eatute f1.1s ion ·
r, • . on.~o,,;,un c .
-
t,} (,.
C'' /i ·
0 11 In orm ar1011
p us ion. pp. 1- 7 ., . Ot Yid eo fotg n nfere ~ice
• c.d111butgh "
'
to' .
,fere11r:e
= ..
.
t
Sa fehg 111 a11.
"s· k
pea er y ·n .
, , o1o, doi : ery detect'ion ·., 20 /() 11,1-i l
10 1
n~se rfl en it ati on . . 109/1c rp nrernr11inno /
: / H ·SO~ 05421 [ee.u .AS].. August201 8 using C .2010 57 11339
\ I ~ . onvolut1on
" Neural N
~r . u ,,adhyaya and Abhijit Kannak etwork~:·
)'JB' ,neel . . ar, Speech Enh
:81 . 1 15 . A Compariso n and Simulat' an cernent .
,41gonf 111 • • . ion Study," El using Spectral
,;on Processmg -20 I 5 (IMCIP-20 1S) p evenrh Intern . Subtraction-type
111f{)rma ' rocedia Co,n at1ona[ J\,fufti-Cn
.06.066, 2015. Pllter Science, V 11/erence nn
10.10 .
16/j .procs.2015 0 1. 54, pp , 574
dran and T. K. Kumar, "Oblique Pro· . - 584. doi·
- ~~ ~~~Qd C epstral s b .
;,1J ~- . , • u traction in s·
ement for Colored Noise Reduction ,, i IEE
£nhanc , n EIACM Transact' ignal Subspace Speech
cessing, Vol. 26, No. 12, pp. 2328-2340 , Dec. 2018 . tons on Audio, Speech d
pro ' do1 : I 0. I I 09{fA , an language
. Xiao. S. Wang, M. Wan and L. Wu, "Radiated N . SLP.2018.28 64535
JO} J... . . • o1se Suppression for Elec .
,, u]o'band Tune-Dom am Amplitude Modulatio n ' " zn . IEE trolarynx Speech Ba d
I"'
EIACM T, se on
. , vo l . 26 ' no. 9, pp. 1585-1593 S t 2 ransactions A
. on udio, Spee ch. and
J,a11guage Processing
' ep. 018, doi: 10 1 i09{f
,., Berouri, R. Swhwartz and J. Makhoul, "Enhancem e t f · ASLP.2018.2834729.
JJJ J' • n o speech corrupted b . .
PP Y acou st ic noise," In Proc.
Conj on Acoustics Speech and Signal Processing
1
1n1· , · 208 •2 11, 1979.
, T Lin and Y. Zhang, "Speaker Recognitio n Based on Long-T A .
· s
[1~-1 · V l 7
enn coustic Features With Anal . ys1s parse
. " . IEEEA 10.1109/AC CESS.2019. 2925839
Representation, m ccess, o. 'pp. 87439-874 47, 2019, doi:
·
z.Ma, H. Yu, Z. Tan and J. Guo, "Text-Inde pendent Speaker Identification Usm· g th e H'1stogram Transform
{1431
Model " in IEEE Access, vol. 4, pp. 9733-9739 , 2016, doi: 10.l 109/ACCESS.20l 6.2646458 _
of
Jl44] D. Paul, M. Pal and G. Saha, "Spectral Features for Synthetic Speech Detection," in IEEE Journal
Selected Topics in Signal Processin g, vol. 11, no. 4, pp. 605-617, June 20 17, doi:
JO.l 109/JSTSP .2017.2684 705.
(145] A. K. H. Al-Ali, D. Dean, B. Senadji, V. Chandran and G. R. Naik, "Enhanced Forensic Speaker
Verification Using a Combination of DWT and MFCC Feature Warping in the Presence of Noise and
Reverberation Conditions," in IEEE Access, Vol. 5, pp. 15400-15413, 20 17, doi:
10.1 109/ACCE SS.2017.2 72880 I.
{146] C. Quan, K. Ren and z. Luo, "A Deep Leaming B~sed Method for Parkinson's Disease Detection Using
. I 9 10239-10252, 2021. doi:
Dynamic Features of Speech," in IEEE Access, vo · , PP·
,, ·
l0.1109/AC CESS.202 1.3051432 .
fJ47J . Score and Genetic Algorithm, Jo11mal oj
MI Zhou, "A Hybrid Feature Selection Method based on Fisher
M.th · · · Vl37pp5 1-78, 20l 6.
a emahcal Sciences: Advances and Appltcattons,
0
· ' · W'th Analysis Sparse
U48] T L' .. -Term Acoustic Features ,
· mand Y. Zhang, "Speaker Recogmt10n Based on Long CCESS 20 t9.2925839.
R 2019 doi: 10.1109/A ·
epresentation," in IEEE Access vol. 7, pp. 87439-87447, ' b dded devices," Future
[149] Z' . ' k verification on em e
JtJan Zhao, et.al., "A Lighten CNN-LSTM model for spea er
generation Computer Systems, Vol.100, pp. 751-758, Nov. 2019.

160
~, AS ai ' fvl. aen gh era bi, A Am
,o] ,, . bi:;nd piv ers ity . rou che
1' fr<'" ' an d Sc ore L and F. l-la . .
1 /o<!Y & /ntern et- 80 ,ed Sy
1 r,d•"o , ,r eve\ Fu S.o ·
n," In·"" • " I
. J( We i, K. Ki
rch ho ff y em s' PP I 3 "'P,ov;,
y v'· · S · 6-1 . 20 13 1 11<S0pe ,;
;tl · , score spaces . ' · ong & l 42 K nte r
,,,as In, 20 /£ £
,
£ lnt· en
' Yot0 2
Bil1ne, s ' o11.
.
°"
0o1 •
er V erif i
" ' """ •
13 . ' ub"'Odul ' ""/"" '
1 I"" 11 ng, yancou
•"; ' no o b,., t , b>
-'•e n,..,.,
1 .
v<', BC, pp .
-
co•e5-C- & yapmk, V. Machine71Le84 71. 88' 20'"' I '"" "' Co nfe , " '"' "" " .l«
r -t•"'•
1il ' 1,0kiC- C- yogt, 0 am mg 20 3. '"" "" A
-J ' . . Di lrr an d T. Sta d ' ""·",aa
' fo, bl'-" -d,me, .
ti' l ,,, ,. I nerworkS," 20 16 IE
elm ' '273. 1995
ann
EE 26 th Inie ,na tio n'
"S https·ltd .
1
,.,. s,,.,, on "" "'
d Steno!
11,SP), v;, r,i sul Ma ,e, Poake, id <nt>fic. . ati °'·'<l',/10
(I 20 I 6, pp. l -6 . al Wo ,k, ho . 023 1 A· IO
. ' do-, IO .1\ 0 p o on and <lustc, ; . 21621 ' 114
1"1 l(r; ,J,e vskY A, Sutskev er 9 n Moch/n, \\ 1
. I, Hmton GE "l ag "'' '•" '
' ma gen etc 1MLsp _20 l6 Le
I· /n--AtfwJ"'es in Ne ur all J . 6 o., ;,g Jo . a,0
.773881 ' S,goo lP, 1.,;,,.
~~]
nfo rm atf on p
JaZ roc ess ing ass .ficatio .
a D' coJJobert R, Doss MM · "End S n w,th dee · " "·"' '
r-· " .v·
' el"'orks, arft/V -
P .
, -to-end hysierns, 2012 P con,o\utional,
eura\ netw
pr p oneme ,.
ep rm t orXiv : 13 I 2213 '2013.
zt>,n' Y, pezeshkI M, 7 sequence recogm.tion usin " ' '·
ri • • •
Brake! p Zh an
g 5 g "'"'o\o tio
1

6l ,ecognitton with deep convolutional ura netwng


0 • ' ' Be io CL nal
ne 1 orks,, X "'" "
11'• Xi Ch en, fot is Kopsaftopoulos,QiWu y • Courville A ., " Towards d
. H R
, e en & Fu-ku ' or "'p ,ep ,in , arXbr 17
o Cha en -to-end speech
71 Neural Network for Fhght-State Jdentificafion. Sensors MD . 0/02720, 2011
Png. (2019)
. A self-Ad .ve lD ·Con \ .
apt1
) c. sanderson and B.C. Lo ve ll, "M
J'S]
ult i-R eg · p '
ion robabilistic Ris 19,275.
I, to
[ Inference," Lecture Notes in Computer Scien ce 1 '" otmo
(LN CS), Vol. 555gra 8 ca
\Jl9] M- I. .Obayya. M- El ms 19 Robust an ds ab\e ldentlt. y
for
. -G.handour and F. Alrow
Learning W ith Ba ye sia n . , "Con
ais
Op tim iza tio n" in IEE tactless Pal '\Il
ml- Ve9-2
in 0&, 2009..
' E Access,
I0.1109/ACCESS.2020 -3 vol 9 Authe94 nt1cati.on Usino Deep
04 5424. °
. . ' pp. \ 0-l957 , 202 t
, doi,
L. G. Zhao and C. K. W ased on Feature D1stnbution
u, "A Pla ne s De tec tio n ," 2011 Thi,d
[]60] Algorithm B
Jnternatronal Conference o a . .
on Intelligent Networking oratzve Systems, Fukuoka, la an 20 ll
172-177, doi: JO.I !09/INCo and c ll b ·
S.2011.33
[161] !. Zeng et al., "Finger P '
Ve in Verification Algo ' PP
rithm Based on Fully Convo
Conditional Random \utiona\ Neural Network ,nd
Field," in IEEE
Access, Vol.S, pp. 65402-
65419, 2020, doi,
[162] K.10.l 109/ACCESS.2020.298
Sujatha et al., "]m ag e 471 l.
an aly sis for diagnosis
and early detection of hep
Advances in ubiquitom sens atop1otective activi\'/," In
ing ap
plications for healthcare, Cla
ssification Techniques f o, Medic
Analysis and Computer Aide 2019 al Ima ge
d Diagnosis, pp.67-87, Acade
[163] K. S. Beevi, M. S. Nair mic Press, Science Direct,
an d G - R. Bindu, "Dete ·
ction of mitotic h Aast hisItop
nuclei in bre Intath
erno\o
atiol',Jnalin"Co
&"n_"s
Jerin&
ence
localized ACM an d Ra nd 2016 3 81
om Ki tch en Si nk ba.sed . MBC' Orlando , FL, US/\, 20 16,
classifier," nnu a PP· 24,5-
of lhe IEEE Engineering 1 _
in Medicine and B10/ogy
Society (E
[164] N , doi: 10.l !0 9/E M BC .20 16 .75 91 22 2-
'
2439 2
. Dalal and B. Triggs , "Hist . " ool IEEE Computer
ograms of oriented gra d.ient for hurnan ection
. . ,cVPRdet ,
Soci ty c s '05 ), San o·,ego ,CA'
USA,1005,
e onference on Compute d p tt rn Recognit10n r
r Vision an a e
pp. 88 6-893 vol. I , do i: \O J 77
!0 9/ Cv PR -2 00 5- 1 ·
\6 \
easa, "lmpleme ntation of ho
lPrkat . g edge dete .
1
r:l se/ (11
.;oral Scrences, Vol.174 , pp_1567 ct1on al .
• " • . - 1575 S . . gor1thrn
1 and z. Li. ObJect detection u . . , c1ence Di on FPG 1\ "
,I. ~efl sm g ed e . rect, 201 ~- '" P,
~I I. 11ere11ce 0 11 Image Processin g (lg h1 stogran, of o•· 5 . ncedio . Snrio/
l oU'
o.l I (1C1P.201 4.702582 4. Cl P)
' PP. 40
'"""d gradient " '""
. 51-406\ · l014 IU
09
,,on
1 and Chazal. " A review of irna . Pa,1,. F ., '""''"''""' '
Johfl ge -based ranee 20
r' . o4C:.!P Journal 0 11 Im age and Vide· p automatic f . 1 .
1 ,·L,{\ , L
o roe essing (20 1 ac1a\ andmark . I I rloi
0Jl4
f; .4. . 8) 201 8:86 httpsc11,:·d<nt\C.001100 t«holq.,,. ..

<I p. ,•. AicS-stc and A. K. Katsaggelos , "Aud'to-Visual B' '·°'e/10.\\R6J, 11,,10-0"· ·


'' 1 • pp, )025-2044, Nov. 2006, doi: lO.
110911PROC.20tometrics,
06 " io Proceedings of 1h
1
1 i,zin, Y. Yemez and A. M. Tekalp
l)l~n "M . .886017. ' /EE£. ,, '<
, 1 • . . ' ulttmodal s e . ·"'
0 ••~cade based on modahty reliabilit
- y," in IEE E Transact'
P aker identification using
. an ad .
Ct. 2005, doi: 10.1109/ TMM.20 05.85446 4 • ions on Mu ltimedia ' VO I. 7 no aptive
J F c . · 5· pp.classifier
~
840-852
0
•O] · A- fox, R. Gross, . . ohn and R. B. ReillY, .,
Robust Biom tr· ·
cJassifier Fusion of Speech, Mouth, and Face Experts," in. IEEE
• "Person
T; Identification Usmg
. Automatic
0 0
. PP· 701 •714, June 2007, doi: IO .1109ffMM.200 _ 9 00
7 893339 . '""' " ' "' an Multimedia. "'· .
4
!!7ll . , .A. Polette, I. Brunette and J. M eumer,
N. I(ihal, S. Chitroub .
11
Efficient
• multimodal .
for person authentication based on iris texture and corneal h ., .
s ape, m JET Biometricocular b,ometric system
0\ 6
386, 11 2017, doi: IO.1049/iet-bmt.2016.0067. , , ' • • oo. 6, PP· 379-
. , B-H Juang
{138] L.Rabiner . and . B.Yegnanarayana, "Fundamentals of Speeech Recogmno
. . n," \od\an
Subcontment Adaptation, published by Dorling Kindersley (India) Pvt · Ltd., pearson Education
. .
mSouth

Asia, 2015
{172] C.H. Chan, B. Goswami, J. Kittler and W. Christmas, "Local Ordinal Contrast Pattern Histogram
s rm
Spatiotemporal, Lip-Based Speaker Authentication," in IEEE Transactions on Information Forensics
and
Security, vol. 7, no. 2, pp. 602-6 I 2, April 20 I 2, do i: 10.1109(f]FS.2011.217 5920
{173] M. 0 . Oloyede and G. p. Haneke, "Unimodal and Multimodal Biometric Sensing Systems: A
Re,iew ," in
4
IEEE Access, vol. 4, pp. 7 5 32•7 5 55, 201 6, doi: 10.I I 09/ACCESS .2016.26 I 12o.
. .. d p entation Attack Detection: A.
1 4] H. Mandalapu et al., "Audio- Visual Biometnc Recognit1 l on 9an PPres 3743 \-37455, ?.0?. l, doi:
17
Comprehensive Survey," in IEEE Access, vo · ' ·
I0.1109/ACCESS.2 021.306 3031. . ., ?O?O IEEE 1n1ernational
117 . • s ritY' An overiteW , - -
SJ S.Alwahaishi and J. Zdralek , "Biometric Authent1cauon ecu .
· Markets (CCEM), 1010 - - ' PP· s1m, doi:
Conference on Cloud Comput ing in Emerging
IO. I I09/CCEM5 067 4.2020.00027. . f . Robusl Hand-Based t,1ultin10dal
[\76] S . . .
Sparse Coding o1 . I86-3 I98, 20?. l.
· L, and B. Zhang, "Joint Discnm mative
. Forensics and securi·iy , vol. \6, PP· 1
Recognition," in IEEE Transac tions on Inforrnat1on

doi: lO.l 109/TIFS.2021.3074315.


I 1' 11
bJi c•tion (S co pu s)
l'u bl ic ar
•on s
w
j
• i 11 il ~, 1'
.1 -J:1i t1an
an d V .K ulk arn i , "E nh an
111 1
us1• ng <.,. . M M , SV M .
ccr nc pt •
.
I'" I
.,
11 5 ,m d \
, 1n '~Pea
n 'NN ke, I\ ecogn,r
1 ,.., 9th O
rcd7170 o.{D'- ;- . Ct . 20 20 , DO (_
r I 10 . 100 " l () n In
7is \ 077' lnt ern ot ""'"' Ir nr11 m11e, \ s
• •
:, .J-.J nin •" an d V .K ulk arn i, "S yn e. pccc\,
' . I gy of voie 2-020-()<)7 71-2 · riun1 11 I ,,1 s
p,·. I,
F.,..-,,1,!111· (101
· " IE lE T
1, • '
1
ransactions· on Smar/---epand r P mo em
,, 71h Jun e 2019 « m fo, ,\ 1
ro ces sin g and C "''" '·"
omput ing 'I 0"I 1Pe, ""
, s~. l'Jainan an d V . Ku lka rn i, "M ul f
ip ove111, ent bi ometr ics with im o dal S k · · N{)
, de cis·ion and £ p ea
l 111 eater
ureRelev
cog fusion
el nit ionsyste
" --- m usm g >oice uod
ce pubb ca tio n (S co pu s)
con erell -- puper in rev
f
s .Nainan an d V .K ulk arn ""
h " p i, "L ip T racking Us ing D f
1. ,;.pproac es , m e ormable d
ro e. 3rd International
Conference on I an ,-r. Geometnc
Co nu.nunic ation Te ch no log y
fo r Int ell
. . . igent System s 201 8 , Chapt nJormatio9n and
er 65 , pp . \. lO\%
_ s.Naman and V .Kul.karm an
d A. Snvastava' "Computer v·1s1. on based Real 1·\ 1..·
2 lr; .c p
Trac,-.ng ,or erson Authenttc. . ' ·
atton" International con Jer ence on Informatime
on andrp
,r.
Communication Technology fo
r Intelligent Systems 20 J 7,,, smart 1nnovat10ns, S)'stems
.
and Technologies Springer,vol.
83, pp. l 08-115,.2017
3. S.Nainall and V .Kulkarni,
"A Comparison of Performance
Evaluation o[ ASR [o,
Noisy and Enhanced signal us
ing GM M " International Confe
rence on Compucing.
analytics an d Se curity trends
, CAST , 2016 pp.48 9-494 , Colle
ge ofEngineering l'ufi<.

4. Ind
S.Nia.
ain20
an16an
. d V .Kulkarn
i, "Performance Evaluation of
" Text inde\lendent Automatic
d' s of the Second ]Hknwcionnl
Speaker Reco gniti on using VQ
and GMM Pr.ocee. ningTechn
. olo"gy for c0111pe11.t1\'e
.
Conference on Informat ion an
d Communicatio ·
Strate gies, ACM , ICTI S, 2016
.
Ap11cn..:a ·
'-'I X I\
J! ,nd f RR va lu es fo r O Sl T
f~ f RR va \uc s fo r th e Op en echn iq
. Uc-V n, cc
~ ct 1dcntt1i•cat,nn 1<'<:.hniq"<
5
I "" 311d - - -- - -- - -
OSI - Ex pt-
•I
1 -I
~ -----
,r11rcs1t old l<' RR
- ----~ l' An n s.1 nr• \. "' • h' ••I"'"
"'- Th 'll t. l
300 0 10 0 " ti'\
O T\'~ h o\d H~~

98 .095 24 3
0 000
92.3 80 95 86 fl ~6 ,n
0 4000
sooo (i

87.6 19 05 60 ➔6'> \ \ 6
3900 0 ()
6000
77 .1 42 86 3 \ .00775 ~
4200 8.333333 7000
()
9 J023256
75 .2 38 1 8.33 333 3 7 407 +\)71
4500 8000 l. 5503876 33 3;1 3;
66 .6 66 67 3
4800 16 .66667 9000 0.775 \938 ➔4 ➔-+4:.+·H
5100 53.33 333 25 1000 0 0
5400 32 .85714 31.66667 11000 0 5 \.85 \852
57 00 32 .19048 50 12000 0 55 .555556

6000 31.42857 62.5 13000 0 66 .666667

6300 20.95238 70.83333 14000 0 74 .07 407 ➔

16.19048 87.5 15000 0 77 .7777T~


6600
91.66667 16000 0 77 .777778
6900 11.42857
17000 0 88 .888889
7200 7.619048 91.66667
0 9'2.59?.593
95.83333 18000
7500 4.761905 96 .2962%
19000 0
2.857143 100 ,oo
7800 0
100
20000
8100 1.904762

84 00 0.952381
8700 0.952381
9000 0.952381
9300 0.952381
9600 0
9900 0
A.pp e11
• rd Vi d'f JM IT Da ta ba se
· di x 11
~r1 fld ,rf oa tas et Do c u111cn
t nti o n
,• .,l1j \ I~~11 dcr son
• , ,r•1J . . Jcr son @1:"\ n1.cta .co
n1 .au
i' 1 ... :1t1
.,, ,r,1'1·· ;() j'Jo "
200 9
' 1
,,,,,. .
\ ,r.
1i<'cn
• , :L
td
. 1 tJ\111° dat aset is Co p y rig
, •_ , ht c 20 0 l -20

c
~
~',, ,,f th iS dat asc! 09
· ·ts pc• nm tte d un de r tl
,3fll"
. ,e fol low ·1Ont ad
. a n g conct·and
"', cc 1s \cfl- mt ad an d no t mo d' fi crS<,
_J, 11,et is pro n .ded as 1s.
.
There l ied m .
llto n D1, \ h
is n ny way. ns·
·· n """ "
•• t11
nil ,, 13 r o f tl1e d ata set 1s . no t res po o wa rra nt m,1< ,. " h
' - .c of 10 dataSe t.
011rl1e
n 'b
Sl \e fo Y as to the fi1tness f
(1' tJ. . · ( fi r any direct or ind ·or any partH:u\.:ir
, ;n) ubh cation e.g . co n ere nce pa pe lrec t \
r , Jou
. rnal art \ ?Ul 1)ns; ,
,.,, U. P fro nl the usa ge
osses res ultmg fmm '
= of the Vi dT IM IT d
{,- )'Bl1 der
I 00
son and
. nfi B.C. Lo ata set
" vell. "Multi -Reg1. on Pro
m ic e , .techni cal
ustbcite report b
the. fol\o ..g papookcha
• 1 ble Jdentlty I
•,.:ai
b" . wm . pter. \x, ok )
:• a,00 9. ere nc e , Le ctu re No tes . C
m om pu tea s 1hs
/8. - r ciet1c R istograms
nce (LN - for R
CS) er. obust and
, Vo\ . 555 0o. pp. 100 .

arerl'iev.'
e VidTJMlT data.S et is compris
ed of video and correspo ndi ng aud10 · .
·oJunreers (1 9 females an.d 24. males
1fil
).' reciting short sen1ences . It can
be usefrec for mg
u\ ord ressearotch43
on
ropics .suc . h as autom . .
atJ.c hp rea . dmg, .mu
lti-view face re cogmt1 . .on, mult1-. rooda\ S\l<ech
r,co gtlll!Oll and person 1de
nt1ficat1on/venfication.
The dala5et was recorded in 3 ses
sions, with a mean delay of 7 day e een ess1on \ and 1.
and 6 days between Session 2 and s b tw s ·
3. The delay between sessions allo
ws for changes in the
voice, hairstyle, make-up, clothi
ng and mood (which can affect
the prommciahon), thus
incorporating attributes wh ich wo
uld be present during the deploy
ment of averification system .
Additionally , the zoom factor of
the camera was ra~domly pertu
rbed after each recording.
The sentences were chosen fro
m the test section of the NT
lMlT corpus\. t here are ten
Th
sentences per person. ) igned to Sesston \ •
Thefirst six sentences (so rte d alp ha-num
erica
. lly by .filh enhamrem
e are ass two to Sessio
aining . n') .
Theenex fi t two se nte nc es ar e assig ned to Session 2
wit .th t the
e remaining et.
ght genei.any d.t[[erent
trS! two sentences for all pe rso
ns are the sam e, wi
b ~ person.
The 1 6 video frames lus1. ni
d or approximatelY O
mean duration of ea ch se nt en ce is 4.2
5 secon s, \65
25 fps).
ric,v e,,.amP le 2 of thve 5,e n t en ce
_

\•r<ili,j eCI
1
os ·
t,e tw ee n 1d Tl M lT s us ed is in T 1
1 . abl e
i'II""
·,,;,,o
. o ,i, c sente nc es . ea an d NT 1M n ( . in ete i<\
1· ch pe tson pe
JO ,,,,. ,, ,;c 10 ,tOf rne d•n d h•n ce t' '"'"• '
h all ow s ot ex tTac "e fe . .,, , (\f ~,
tio n n f pr of ile
~ r'"°"
•"' 1.,,o,,; ng th ei
r h ead to th e I n
an ex t•n de d h "" ",
•n 1e n,, .
1
"" '"""°'"
,,. c<',1cr. e n . right.an d ) in t
ba 11k "' "' "' "" to1 ••d
,nn
fh n <eq,.-,.,
J'l'.'11
"', ,, c tn the
d · '<< qu '" ,,. .,
o~•ordini.2- ,;\·3 s on e l1l a noi sy «n te , "" '"
f11C1 • d. . ic e env· offi . "• I\ " "" "
.d · 1en dt'I
I quaht.-Y 1g lfo nm "'" <nd Ii 'n f
Jc
i•"' " 1ta1 v1 eo cam er a F 111n\\v
,110 •'' · or alm os t en t e(m
recos t\
ord all th Y· co mpute, an nl"li'\
I• r
mgs the Ii h . ~ ,ete)u?1""-lny_ ,
se et ioP JD g \\nu •
S e nt en ce JD
t-__ _ ~3__.;
sa 21 -_ _ 1 s
· h t' h ad Yo u rl s
e nte n c e ~
D . . r ark R\\
wa, '"
si l398 it i E>Xt
on t ask m l' ·to c-anrr .
si202 8 !!.Tl"I\.", w I\."h wn t"r I\
He l ooDk°hith ev. ma k·e c\a.5> '" 0 il,
S ~i oD 1 ~ ---:-::;-;;-;:;:_ _ JL _ .-,.hi11.o;;('<l rt!!:. hkl
. -' tl\1\' 1 "" '
I
si7 68 s m ask from h' f
l\ r l"rt "ln n.- .'
- acr Of{ 'hl"arI ,,nrl th
IS
U akc lidne f xp cc tcd \~-
or Sl\ "0. f h \ oss th I Tl"w it
':' l l"rk
OW th{' R!\fI"fi{'
Tl \ om1tt in" dl'S. . .
1e c umsy cu st 1· fl.'- Jf\r \ii\,-
sx
sx 221388 O
ig;n c 1sk
omcr spilled soml:'
5e.55ion 2 r --; ;;s~x
_•~-31!18~ -+ -- -- -=
pT h~e~ vivi1ewp oint overl
i-- -~ sicon
sessi i:0 :3 ooked t'~l 'tIB iV{ ' P{'f tl\ff i{'
3 1 -~ ~sxiR 40 8"- -+- I" 0_d ;~
rid l~
eea~ s~e~di~-
t he su g~m ~
y ~p t
bwav bo atlo c:; up ~d~ t eo=roc-
sx 48 G ra nd mo th er ou · {'an
{'"rr _os_t_ _ .
. t gre' , ut
\ havm. t rnou<:i:h cha
• "'u pb nngins in p,. m!.l'
• Ui m "'"
Ty pi cal ex am ple of
Ta ble 1: se nt en ces used in the
VidT l '·'·llT datahase
,
\. standard overhead fluor '
e env1ronments . 1be \ig
escent tubes like in m hts were
.,vered with A4 size white ost offic ·
office paper in order to \ is re .uce t e glare on
diffuse the light f th· d
d h
1be face and top of the he
1 An incaJldescent lamp ad.}
in front of the person Gu
st below the camera). Toe
wiJh a sheet of A4 size w \amp wascovmd
hite office paper. The
video of each person is sto
sequence of JPEG images red as a oumbered
with a resolution of 384
_ 512 pixels (rows _ colum
idling of 90% was used ns). A ~ua\it)
during the creation of
the JPEG images. The co
rrespondingaudio.\ is
stored as a mono, 16 bit, 32 • · d
Th e Vi kHz WAV file. tatton) takln'E,i U? about
3G dTIM.IT data.Set is compr
ised of 44 files (includm
gdthha
iss th
oceurfone\\onwing,n. temal su.uc,ture..
b. Each zip archive is for
one person (e.g. felcO .ta
r) an
sub·~ectlD I audio / sentence
ID. wav
sub JectIDI vi. de o/ sentence
· th
wher ID / ### 39
e senten ce ID is t e iden
·.
tifie r ( ·
e.
, sx 6) 3J\d ### is a ree
di . s th e he ad rotation or se
n enc
g. . ·
JPEG illlage (note that there ,s no .JPg
git frame number (e.g. 03
7). E ac h frame is stored
extension).
as a
re 1S. audio for the he ad r . n
f1" ., ,ag
00
e,.ents ota t1o
sequence,
cJ(11! rrJJ
1 .111' dataset wa s
4 r ,a'U•;
'ff" . vers1t. )', Q ue cre at·ed by C ·
ori ,,s
· 11
f I
ensland A .
r/1 • or !(uldip K. Pa liw al. ' US on,
.,,,
lra lia (~d San
w WWde '-•o, Wh ·1
.
·R• ' eh
f odio was recorded us ing th
rif11'e •~i 0
•11' U t"' digit fratn• nu mb er Ii e_ tat
cam
·•d•.u•J '"' "·' P
,0 m\ erad's. tti· .
\on . •nd ., ,;, hb ·""d
. lcr op h .
Oes one ~uper . em ot
,,~ ~,_.,;.A - l(.olyanswam '"'" "
i -'"'.,,,.;dlh gpe<cl• oambas y. s. Basso
no
I'"'•',9"- VOL t , pp. t 09- \t 2.e",Nn
<""""Ceh' ,estrionons on 1he
Pm c n and I. Sp;,, . t a(lpl · _
M,IEEE Int. c.;;,i"' nM n, A y to the h
nf
f"''':,,,non '° NT IMIT
1h< ligh ting set up T ean be oblo' . Acoosues :"" <adl • nd
""'
J'°~' CSC"'' 1,.n•P "" ' ,w;tched o- .occ u,s'l:u s P'<Vent ,;;; ," from th~ ( "'" "'"
00
,rl"
°' t
he,d1 , '"ls "'''"""
. "l•enc, s ol
mbdgO ••d•~''"'"""'"
Se, s\o n 3 of "' of •II " "'" seo
" i,
' '"" 2 nr'''"' c,,
"'° Vi,I ,.,~
' ,,,; """""11:s' )\""' r••
, '"'''

-~
T\M

~
~m I"" • .Id P). >lh,,.·
,,,.,.,p1e: V1dTIMff Da tab as e
"""" ''"""' '"'"""'"'
Signal Label: r 1.wav
5
riJtle t ::::: 3.1 * 10-
prequencY === fs== 32 KHz
plot of the tun e domain wave
form an d.its frequency domam
. Spectrogram

-10
.o.15

.0.1
-1~ 0. . . - ~
2_ _._ __~6- ......g- -
,o - ,-l _ \4_____,16
.(J 15a~-:.0.5
- ;--' .:t- --::
1..5::---'2'-- _2.__
..5 __,_ _3_3_._5_..J-
5eCOfldl
.. - ....J-.5_ J5 - - - - - -- f~~~~ --
Figure B.I Plot ·or a time dom
ain and frequency domain waveform
for sii••' ''-" "

MPSTME Database Creatio


Creation of n
the MPSTME database. The
MPSTME database creation was ~,r
t of me woi\<
car• ried out. for this research keep
ing the standard datab. ase
. as a refere
d nce
ts .facu\W, and staft. of
Au st
video database of 80 speake 1
d10 . rs cornpnS ng of u en '
~•er was chosen as \0.
• M L"PS
k TME nd
• e the VidTIMIT database the f t es per spe~
number sen enc
O
1 ined to bencbJ!1"'k with the sta ard
5 sentences from the VidTIM
IT database were reta
database.
\61
4 serl teoc es we re fram ed w ith a
, wo rd in
sell te11c e wa s of the spe ake r' h .
. d s c Oice . each sen tence c
• Ifh< ,eeo rd1 nS wa s one on, .
in a no·isy but Wei "'"' •• • Ihh, '"'
Is \ .
1,,0 111. l . ·ht 11ffi ""'
.
ICe Ctw ·
_ ,; ,s of the dat aba se are as, . . 0
~!,.,.,;i, c• 01 nent, oned . .
T Rbl e B. \ Sp ecifi
""" "'•• ,
. . 1n fab l e "B I "'' .
I Ca\1011~ f \)et,n (.\ nf f\
• 0 the M .
~
Lis-
t ------- - P~f MF· r )maha~c
Details
80 l)~ t.
~ fll)\,1\1\
ma\es & 17
Recording ses sio ns (Age gro fema\e<1
1Osentences/sp eaker up I & 54
Recorded -
over a yr~).
months penod of '1
duration of sen ten ces 4 secs.
~4e,lll1 l 10/120 ·fnun.es/ ·d
@2Sfps v, eo
sen ten ce det ails 5 from VidTlMlT
database Eg.

1~e c\umsy customei:


4 created - each with spilled some
ex1>ens1ve -pei:fume .
a wo rd containing all
the vowels '{ ou are authorized
to use the
l sentence of automobile.
spe ake r's choice My name is .....
Sp eec h sto red as mono (512 Videos recorded wtth 1ntex
office environment
kbp s, 16b it, 48 kHz .wav file web cam at two different
& im age s stored as JPEG locations at MPSTME in a
well-ht but noisy
environment

List of Sentences in the MPSTME dat


. d abase: pensive perfume.
1. The clumsy customer sp111e some ex
2. I'd ride the subway but I haven't enough
change.
. Don't ask me to carr)' an oily rug like
3 that.
4- Do they make class biased decisions?
thentication.
S. Multimodal biometrics for person au
. t be tolerated,
6· Misbehaviour in education cann°
7· You are authorized to use the automobi
le.
8· Mv name
1 c ca r bro ke do wn th ey had t
\\ 11(!11 t 1 . o /J er a 111hutC11e i
•I ,nee of y o11r o wn. l to lh c , ..
1\:dr \:S \ .
, 1t' f1/ l .
- i!,r1 s Slc1t 1in
/II /Ill ' - - -
- - Spea k e r 1

I~ Sis 1-Speaker I
The clunis
sentence 1
y custom er . ·11 Cd
Sp1
i)11r1t11,n

4
(<.:t.c;'I.)

some expens ive Perfume

S l 5S2-Sp eaker 15 sentence 2


3
l ' d ride the subway but I haven't
enough change

S25S3 -Speaker 25 sentence 3 4


Don't ask me to carry an oily rug
like that.

S42S4-Speaker 42 sentenc e 4 3

Do they make class biased


decisions?

,
c=====-- ---,- - - --
-Speak
I -S6-7-S l~O~ er 6-:-:7~ sentenc
e 10 J

My nam e is Tanv1

database
, rs fro 01 tI1e
f the speake
. ure B· l Sample Images o
Fig
tabase of 72 sp ea kers w as fin al /
fl'' d• . Y se lec ted f or var
rd m gs of 8
tbe re. co speakers
~
'
J"'b•' 85 e wa ve fo rm i tzm
. aidatin o
. e do Was no t qu tte t:, the ,,.

/eP /ots of th n ·
ma,n andr, uct · vva
,w11P 16 Ible. r don • on th
,,, ,pie: spe er ak an d sp ea k . re qu
er 37 fr
!'$'" b) om th e MP s en cy do,,, .
• Standard
. ;11 t,a bel: a) s I 6S I. wa v S3 7S4 'fM t D n1n .
5,g11 atabas
- a) 2. 08 * I o-s b e
)2 .083 3• 10-s
fi1lle1 -
frequencY = fs= 48 KHz

-40

-60

I
~ -80
>,
u
C
!!\ -100
CT
~
~ -120
:t
~ -140
-04

-0.6 -160

3.5 4 4 .5 5
2 2.5 3
-0.8 1.5 -180
0.5 15 20
0 Sec ond s 0 10
Frequency (kHz)

tral Density . Estimate


Perlodogram Power Spec
-20
0.5

-40
0.4

(I) 01 ~
-0 C
~ -1 00
J 0 er
a. ~
E
<( -0.1 i -120
~
0
-0.2 Cl -140

-0.3
-160
-0.4 15
10
4 -180 Frequency (kHz )
3 3.5 0
2 2.5
0.5 1.5
Second s

main waveform and its frequency domain Spectrogram


Plot of th . e do
e tim

You might also like