Professional Documents
Culture Documents
Extraction
Delta
-
(U
~
l,f I
MFCC
ftS
C
Q)
1u
(,)
C
0
u
Delta L,_____j
Input
Delta
~
Speak-er Recogrnt,on
17
1 ,-. ,...,, <lf"I'
...tri<fl "'''°'''t'.. ..hc•-n-•
.. ~ chA
__
,,ftl\lt
,ntt
1h~ltiff-
'Ii~ . 11.
, 1 k\11,.h, ,
1 l \\l I q1
"' •'
-.I ,
.,, c ronn ,
1\1\lf\ LI I'}
ru , c 1t·,nV1 ,1
ro ~wJ )'
'Iii!,
1~vu p l \)' 1nlO
nu I 15
,..9; ~ ,,.
"-"""'~-wr.;, i ... th.? " I ,., h Cf
11 ~,. , . d'l'<irr " " . II Y 1n 1h.: '11 ""
h1 r,,,, ~... c«r ~._,u
t
~ fr#tn
1h ... ,ini'fl«fflP ~ ~ If' l\ft ,~~ I ~
I ",. t - (1/11 ~ C'#l1 I i'I ~e~!U•fY
, "' m, tht- '"
........ ,,.,. ;n, ' f II""'
... •1 •, . I I
~ t , / t - «'ff>l'" rrorn
•rir"' ' NI" tififi •r forrnonce
l t-,ctte r P'-'
<il!'f¾l'' C i n rvornr d oc; lO e'ltract
l1t .. fn,rn ntd10 d ' ve r'!C rne tho
ti•"< nm'" "''" 'l<!lil "' • ()vet the , c ,H " ' d' rfe renti ate d
< , .., t.;.::t t <."0''£mt1<lll hniqucs arc I. .
,,::,ii • ,, fil l' ,nt hm< i('ft l ,cd f'cature extracu o n tee
. . formati o n . the ir
~ iha• ic J, ecn pr<'J'Xl d e rception in
,fuph '.i;,,,.,,, I ..!l!Ttlt.., d "t n. processi ng an p . (16J The unique
I c human au 1 o . ., • ,;indoW ·
!1\ o,cir l!hilJ t , r('I ut, ,z 1 of the obscrvat10n " . [ 17) Pitch.
d h, the Jc ngt 1 duct1on ·
.. tJ1~ ~ , ,.., if,c..r, ,n,on,. an • . I human s peech pro
.,u dctcm11nes t ic
jl1, o1nJ!' .,r Ull' human ,ocal tract . aker-depend ent features .
. and accent arc some of the several spe
~ ng raTL' h
. . d s cho linguistic features . Eac
. ed
-. n1'£'.C-b feature~ can be charactcns as
acoustic. lingu1st1c. an p y ~ eaker
t d to be processed ior sp
' ,_ . d can be extrac e 11
cam e'- speaker personality traits an m ent m ethod u sua y
feawre . h to feature m easure
8] Tbe acoustic-pho netic approac . ·
,-ecngn1t1on 11 · . . f the tune varymg speech signal.
. f th characten st1cs o
follo v. !'> the spectral representatio n o e
Common)) used spect.al analysis techniques are
. df
The perceptual and significant areas of the spectrum denve dio signal are the basis
roman au
of MFCCs [J 9] -[22]. Just as the human ear h fi
responds tot e requency v ariation MFCC features
mimic the same. The substantial phonetic characteristics of the speech signal are captured with,
arrangement of filters. Linearly arranged filters at lower frequencies while for higher
frequencies the tilters are to be spaced logarithmically for significant feature extraction
[23],[24]. The lower-order speech coefficients carry the relevant details of the shape of the
spectrum representing a source-filter transfer function. Speaker specific characteri stics can also
be derived from the higher-order frequency components as they too carry sufficient spectrum
information [25]. Preprocessing of raw audio signal is at times imperativ e to in1prove the
quality of the signal by boosting the signal to noise ratio. Ambient noise adversely affects the
_speech _signal and hence makes MFCC features unreliable making speech signal processin g
imperative. [26], [27]. Other popular speech features can be derived fr01n Linear Predictio n
18
l
feaiu ·c~
h:,
.,.....,"m"
t. •
~o" n .,,gn1iica nt imp ,o"e me nt Ill
sp
eech classificaiti,o n pe rform ::mc
h aud io fram es are also ofte n
i ( ,1 < .:,p.;traJ Coc f!i c1en ts
trac ted from eac
( CQ CCs ) ex . . the
Cnn i,taJl -
.
. ure a con stan t Q whi ch is
d / NJ The geo
. f freq uenc y bins ens
,rwc ,11ra1e mern cal spac mg O . .
- .
filte r mod el kno wn as h o mom orp hic filte nng
,cm:-c frequenc:- of a ban d H l"dtJ1. AnotJ1e r sour ce- 1
C stra l C oeff icie nts (CC Cs) . Fro m
,,, cepo.;tral anal:- sis are often emp 1O) .ed to extr act Com plex ep
' . d Tak ing the logarith mlc. f th
an audio -;1gn al lhe . ti (FF T) is com pute . o ese
Fast Fourier Trans arm s
. ·t . vers e Fou rier tran sfo rm
den, ed sign ah. {he coef-iicients are
fina lly ava1·1a bl e by takm g I s m
on,.
19
I r s a \ so
, -CC r fe at ure s th e a ut :lO
.fflf f1 A«"'I ""
<:<' ,,cv, r-,11 eak er
it io n -
r e c og1 '
,,.,,,f••·.h , 1'"''1"' ....., In f 'fi •I ,..:u n•I t h C
,.,,.~ \ N N) f 4 1 l f o r sp
ti
NN ·, ~ co n s ide r ab l y a pP ed to
i
,,,.... I .-r1, l ,. . " '" '""o rk ( 1
1 I" •7 ,,, <""°"'""' D "
.,..rlf f 'l"-' - '" ' ·r1.. 1nil tc,tfrn .
,,,, " \(.. (t"" 1l ,. '<(. ,f.t lH ..,.. ifl '"" (.'1: c ...tt 144f .
7,.. • nf• !JJ'' ' I xi
,.,, Ju, Ji1iJJ t<l tl'f1 ..,.,o \.c \,\/U VC • ()" "'
" ""'' " ,, ,1,, '"" "' m></l, .. a l in te rm s
. n t t h e s 1gJ1
, fro 111 a spe ech s,g n a, I is to rcp rcs c .
wt • ru rp• 'q ,, f f.,• Jttu l'C c:x tT'.OC I IOI data c a n b e ign o re d for
111• '" ,, ·rrc Jc va nt an d r edu nda n t
' -rc-d coc ffic1c nt.<. as '
,.:-nr ,'Jnd num l '-
t,f "' " (1 1'-
20
( Wrt "-
4 ,mt ,, ali! •·11h m l•HlJ ,focl-J . ran k fea tures i-: the Re lief
u<:.t.~ <'U pcn i,ed tec hni ~ f . Th ie:
que to
,n
no:1ihm i :. wJ,11:.t ti~ hn
g dcf rad cd and mu luc
.
• J " - fea tur e c:et "' h ich
as~ ,
are no t co m
p let e and
a· t'1 11,"
\,·,d ch u.rwd for f~pra) pre . . . 1.. , 1 on thm
-rr ooe s.,m g. Sta ust ,ca I m eth od s are np ph ed in wlS a, g
. .
,a,;1d Jcat11n :..el ect ion
J~ .ea rne d ou t bas . . hts Int ra- cla ss nea res d . gn ate d
ed on the ir ,,e ,g . t da ta es1
a.. inri arc cak uJa 1ed and
.so are 1he o tJ1er nea res "gh b fro m dif fer ent classe s un d
t ne, ou r5 nre
fah dle d a~ nca rc.; t ~
21
I ~ ' "'" fun11 < .. ia ..... r..,'8tu ,f'l
2
• ,111Jl'1.-.1 ar un , ,..,.,,,m,N"CI m f<'M1 ff'€ ~ttto on nlgor ,thm " arc f1rth
\
cr c1.v~<1 ific J to c reate
q,r;n l r.:r ~f'N , I< m,-..1.. 1 ""'ch ,m?ql ,ch J<,«tt l'III:'• them
Jl,C"\ ,JHl t,c ' ,,, '-''"'n ,.,') fl~
f'll! ,n,,<tri, m, , Jd ,._, fl \ Q)
II
, , ,, J,!l"''' ,,,. -,.1d rn'-..f:,.,t < \f\il
mt 1u.,.11 1, t,:1..c;,_1 '"' ,11 til!lcn 1thm (~V \.f)
and _
1\ 11, 11 1, t,s~d ,-.n !"t.a1c ~l-th. :•at1 n<.'ura k nd deep Jeam 1ng.
''
l ttel\.v or s a
22
h
Fearure ,·ectors.
For speak er verification, spe .
ech signals repres ented as robabilities of discrete events as
features v. ·as examined on NI p hi ve d wa s
ST database m . [64] The performance
· of AS R ac e
manrinaIJy better tha n that ob . th .
tamed wi.th GMM. An HM M based text-to-speech
u-as -evaluated for speaker
sy n es1 zer
verificatio.n m. [65] .h
on Wa11 Stree t Journal speech co rp us wit
match claim of 81%. An eff . a
icient d1. scn.mm. at1.ve sp
eaker ve n·fi1cafIOn h a s be en de ve lop ed m th
i-rncwr space with SVM e
as a linear discn.mm . at1.ve cIass1"fi · [66] on
1er m the NI ST 20 10 data
bas e. SVM was trame · d to d1s
· cn·mm · ate between the h
ypo thes1·s d eter mi nin g the pr ob ab
a pair of feature vecto ili ty of
rs belonging to a speake
r or to different sp ea ke rs
symmetric quadratic functi . Pa ra me ter s of a
on which approximates a log
lik eli ho od ratio wa s es tim
the need to explicitly mo de ate d wi th ou t
l the i-vector distributions
as in Pr ob ab ili sti c Li ne ar
Analysis (PLDA) mo dels. Th Di sc rim in an t
is ge nd er ind ep en de nt mo
de l ga ve sta te- of -th e- ar
Speaker adaptation tec hniqu t ac cu ra cy .
es to all ow sp ea ke r mo de
ls to be tra in ed us in g les
than that required for the ma s tra in in g da ta
ximum-1ike1yhood tra ini
ng ha ve be en ex ten siv ely
speaker recognition . M axim em pl oy ed fo r
um a posteriori (MAP) an
d ma xi mu m- lik ely ho od
(MLLR) techniqu e was incor lin ea r re gr es sio n
po rat ed in [67] as fea tur
e ve cto rs to be fu rth er
SVM. The evaluation on NIST cl as sif ie d us in g
2005 an d 2006 SRE da ta ba
se ga ve go od re su lts du
,nd speaker adaptat e to fro nt
ion . Intermediate matching ke
rnels (!MK) were explored
emels (DK) for classification as dynamic
using SVM of the feature vecto
rs extracted from varying le
· speech. Class independent GM ngth
M, SVM based classifiers
with IMK and SVM base
d
24
d~"t\CN\t\ C.l-.., ,. rfh~ I" Co, m..: d1slunce
I -..~ ,.'('ft' ~
~ idfflt•'"
f.
<Vf
, ~,
~, ... e,if" , 1100 sn·nhlun {II'\
NI'- 1 20ll8
~
,, M d,11c;c,fu'f .. :,,, "'p1nn1t~ f.v, ~ - ,, tth ,.,.~cv,r• w\1h ~"' r-..L <¼!-> c.lill' " lh~r
~
' n, . rfrnenwuon sh,cn\-ed that afmom1a [ 72 ]
m 1Hma i>J~n C!"-"'f'C • J 7o/c SV M ,.vas employed m
. h th I rror rate increasmg by 0· •
ficat •o r. aoc:urac~ V•'fl e cqua e 1 :1ino algorithm. to
'" . F FNN) another machme ean o
om! ,u 1/; J~-forwanf neural nctwol\k ( ; • . LE stic feature vecto rs
. l 1MFCC and open SM1 acou
.. a· r.1[\ ,:; l,-,;1al pRrarnewrs combmcd '\vll I . • • (SLI) in children .
cHn.1- ic-4 tr. •m voice ~ign.al to de,ect spec1. fi,c Iangu age impairment
, A,-,c.,mcm, were pc;fonned on SLI speech corpus.
w be: u~ed ~ itb MFCC features. Like DNNs other powerful deep learning algorithms are CNN
ana Recurrent Neural Networks (RNN) which have fo und to give better accuracy with larger
da.1.a set [75]. [ 76].
Much of the work has traditionaJiy been done with application of CNN on the speech
spectrogram wbich is in the frequency domain. For solutions to speech analysis, DNN is outperformed
by CNN (77). Aulh.ors have frequently applied DNN based systems successfully to resolve
speaker recognition issues, however much research conducted is in in the field of language
recognition [78]. Researchers have proved that DNN have optimized performance in the
chalJenging speaker recognition task in the domain of speech-based human-computer
interaction. The large feature data set required for tuning and enhancing performance of ASR
n different supervised learning task is the limitation of this classifier. Authors in [79] have
25
,fM
method an oun e
. 1· t ·
,,oc
81 ~ 111~1 ·
• ff .. ,.1
f a ........ ,,_.., comp411
__ .., to G\.1 \ 1 ba.sed s~ stem s. Co nun. . ese arc h tn app tea ion of
·a.1 • • um g r
, ·•ranu' nwc hm c J~i ;:ig "lg orit hm lo ,ed and exp eri me nte d wt• th
.
. [84 )
s for AS R. aut hor s in
"' em p )
tour clJ ~ fo..'f m<Jddi. namel) S\'~ . R .
1. K-~ earest ne igh bou rs. a n do Fo res t Lo gis tic reo0 res s10 n
m
.,ua Ar:u ficzaJ Ne:.mJ ~cr wo rks (A;,,.'!}:
) fo r s pea ker ide ntif ica ti o n fro
m voi ce . AN N bas ed
s, ::.1i:m:- !'ave the best accuracy of 96.
. - 03% fo r AS R [8 5].
26
-
• Bilabial: both lips come alveolar
together (p, b, m, wJ r ido e
. Labiodental : lowwer lip an d
upper teeth make cont act (f, v)
~ P4/au
• Dental: the tongue makes
contact with the upper teeth (-
th}
• Alveolar: the t ip of the tongue
makes contact wit h the alveolar
ridge (t, d, s, z, n, I)
• Pala tal: the tongue approaches
the palate (j, r, -sh)
• Velar: back of the tongue
contacts th e velum (le, gi -ng)
• Glotta l: t his is really an
unvoiced vowel (h)
Image from: https://notendur.hi.is
As the anatomy of every person is different, the voice of every individual will also be
unique as the pitch, amplitude and the frequency of sound produced depends on the
physiological mechani sm of speech production and the articulators as discussed before .
With this information, the rest of the chapter discusses how voice can be a si gnific ant
L'
• G lotta l: {Ill:> , , • ~h·· , 1,na,e from : 11\.'-V-' ""
um,oiced vowel ( )
·t ·o n o f articul ators
d t' 1 due to p os 1 ' b
n,ru rf' _,.2 src-cific ~ound rro uc ,o, . . "dual will also e
. f eve r y ,nd1 v 1
wm , 0f e , -cry perso n is differe nt. th e vo ice o d d depends on the
,\ s the ana . f so und pro uce
. . ' the p itch . a mp litude a n d th e fre qu e n cy o . . discussed before.
unique a. . d th e a rti c u la to rs a s
• f . ccc h 11roduc t1 o n a n · .c: t
1ig1cal mcclla111sm n s p . be a si gn1J.1can
rh~ s10 ( , " f di sc u sses how vo ice c an
\\ Jlh this ,n(("lrma11011 . the res t of th e clrn pt e . . le m e nted i n f our key
te rn h as b een imp
tl°1 rccoo1117e a ·" p eo k er . ;\ •S R sys
t, ,omcrnc trait ~
Figure 3.3 gives the pictorial description of the steps undertaken in the experimentat ion carried
our for speaker recognition fr01n text independent speech information taken from the standard
idTIMIT database and was further validated on the database locally created in the college
1
42
j f r m 11ni.2
\\ ,n, h' " 111 12
• Pre e mph ·~-. , •
• M FCC
Teatu re
\\,! FCC f- O EL T "\
• MF CC + D EL TA·
E:xtTact ion
• CNN
J
J
Feat ure Sele ctio n Fish er Sco re
Alg orit hm
Train ing
• VQ
Classificatio n / • GM M
Mod ellin g • SVM
• CN N
Testing
43
1]. Denoi.~inl! S peech s ignal . . h signals from noisy
-• • I • l roducc noise - free spcec .
. 0f dcnoising sp eech s1gna s ,s I< p . d increasing its
11,c 31111 • • f the speech compo nent a n
rdings. "hilc imprm·ing 1hc perceived quaht y o . . ti ' t es a re limited by
rcc0 r f eake t spec1 ic ,ea ur
1Fni bilitY I JJ~1 - l11c cffbcti,-cncss and qua ,ty o sp . d t the time of
,nrc "" • . . 1 oise introd uce a
,nc degraded Aud i,1 sig11a l Mic..-ctcd h y the e nv1ro nm enta n :io s1gna . I d m ake it legible,
. · · f th e aud an
• d ' , To impro"c the pe rceptual ch atacten s ll c o . d The
rc-L<1r ,ng . . . . . the a lteration occurre .
I anccmcnt of a ud io signa l need s to be undertaken to m1nim1 ze . I ethods -
cn i . . . . d si n one of the c lass1ca m
dd ti n ? nonstai-ionary a mbi e nt no ise can be e liminate u . gh . h
a , . k done for th1s res earc .
· tral subtraction . TI1is a lgoritlun has been impl emented mt e war
spc-. -
In equation. (2), the frame number i_s represented by k. Ass~ing the speech signal t? be
segmented into frames, we drop k to simplify the equation. As speech and background noise are
uncorrelated, the short-term power spectrum of y[ n] has no cross-terms. The short-term power
spectrum of the signal Y[ OJ] can be represented in the Fourier transform domain and can be
derived from equation (3 .3 ).
2 2 2
/Y(co)/ =/S(cu)/ +/D(w)/ (3.3)
44
I \ Il
' , '
/Jj J
1... l\. { \f the nPt, e in
f•n~ ~l•tn e fr~IT"c num her ('f '-P ,, rerre , ente d 1')
fr lt hi: -•aua t,nr •' " \ tht: .
C'
f ,n .:; co nver gi ng to
lt equa
h{: fru·~ - a Jong cr a,cr afe taJ..en '"'" re,u
in it -
,n ""7£.na ,. a.a..:.11mC"d 1n
. d r met e'-ti mate s the
al :spec trum sun e ~
nm~ p nv.c r ,pec trum The sign
tn ~d, a_ 1.-"""11ma1 1nn nf the d s ·t gnul 1'-~ finally
h b
a rrhcat1on of m"e rse OFT . t e ooste
:-:1.u:nrtudc ·. . pe.ci rum v.h1Jc v.11h the
ci~mpwcd
as a fdteT ,
the spee ch s ignal can also be repr esen ted
c.;;pcetrn1 ,uht racu on of the n oise signaJ from
giYen in equa tion (3 .6) . spectraJ subt ract
ion is rt:prt:~t:ntt;d by
h.\ m;P 11 ,i1ilc1t1r;. equa 1ion 3 .4. ~
(3.T)
al.
d after estimating the phase of the spe ech sign
Reconstruction of the resultant signal is achieve
e
com mon ly con sidered to be hav ing the sam
The noisy signal and the estimated clean signal are
e with the assumption that for the human ear, the short-term phase is rather not important.
phas
arrived at as shown in equation (3 .8) .
The resultant audio waveform per frame is thus
(3.8)
S((j)) = js(m )jsi <Y( m) = H(m )Y( m)
45
3 --~ rr·c rrn("{'U.IOJ! .'pee -ch sig nal
.
,prep ,..,l c,,1n c 1, R fu ndam ental signa l proc es,;m 1· d befo re re lev a n t fe ature
g task. to be app ,e
IW ,.,n , iff' he d nnc 1n enha nce
~;,. '1 the pcrfo mrnn ce of featu re e-xtract 10 n alizo
~
ri thm . Elim matm g
1thc f h 1. om r oncn L pre-e mph asis fihcn ng
the signa l and norm a 1,z
· · the wav e t·arm am p l1tud e
in g
a ·~ ._,1mc n e CC'-SaT_\ and com mon b used prep
roce ssing tec hniq ues.
3.J.2 Pre-emphasis
\\'ith energy concentrated in th e low frequenc
y region, a negative spectral slop e is obse rved
in
the estim ated spectrum of speech. Vowels whic
h are the voiced info rma tion of spee ch hav e
more energy in the lower frequency than at the
higher frequencies. This is also kno wn as the
spec rraJ ti!L
46
tf;
1
l
I
0
Frequen cy ( Hz)
. t,eforc Pre-emph as is
Figure 3 4 Speech signal
0
L - - - - - - - - - - - - ~22050~
Frequency (Hz)
The pre-emphasis filter compresses the dynamic range of the speech signal' s power spectrum
by flattening the spectral tilt. The equation (3.9) denotes the fi lter for the same,
P(z) = l-az-1
(3.9)
In equation (9), the typical range of a lies between 0.9 to 1.0. For processing the speech signal,
the glottal signal is usually modelled using a two-pole filter for which both the poles are close
to the unit circle.
47
fro rn th e sp eak e r
, orrn111irio fr." " ·cr op ho r,c
ll0 ccr nc11l o f the rn • . \ev e l s
n s iste nt e ne rgY
\ ~\ t ' ,,, ,:<
11•• '' I• t
,h'-f,111t.~::',f,:::~/
'.''.,t:~ ~c ,rc ~J..c r- ,m, f
, , , rccc,Nlltl/l
•;1H10tion c; 1c0din g1:v:
o
: is .tak en car e b ~ au:r
e-
JI, ' 1,1H1f< ' , 1,,1 , , <' ,n . the c nc rgY is
cd ,,,c c,h ~,11na" I 11,., nK<'nc,1,; c .
i1'- iD I 1 .,,c v '" ces
. n g techn lq
, . r hi s pre -pr o s 1 b
, Nn .
,11
rhc r·co ird
II l \ 111ph1Utk , rnwlin1twr1.
1c , ,1 L'<
Is
co rde d s 1gn a the re y
r,..,c~ , ,111f l<'d 1n1q1 '" 1<' ca nce Is
y !ev e • hc tw ec n re b s e qu e nt\ Y.
l the ,,ir yi n g e11 c rg
arrl1<'Lf ,., I
._ ,en h '- 1f 1111 es w h ic h are ex tra cte d s u
. I tcd fca tur · i nt wi th its
, , c,t, ,nn ,ll1 L-c of 1hc en erg y re a . . f the s ign al a t e ve ry p o
d'J ,an .in£ ! t 1,.._ l
. ,h·cs im pJc mc 11ti 11 °
d iv1s 1on . \ 0 an d
~
f J1<' rn, cc'-. ,,. prc alh. ,n, t
g the . n ge o f th e s ign a I w ith in - ·
d n am t c ra I po int
t,~ lu tc ya Jue the re t, y res tric tin g the y d "vi din g ev ery s arn P e
m:r, n,uJJJ a . . . . ac h ieved by I
Jiz at ion tec hn iqu e is
I (1 ' aJ ucs A , ·ari a tion
111 no nn a . te of the utt e ra n ces .
. -, fo m1 " i th the va ria
in Ul <" \\ 3 \ IC n ce est1111a
.
_~,_.1 4 , , ·;o do ln og . . na l Ap pli.ca tio
. f the wm . d w fu nc tion
. a co nti nu ously va rym · audio s1g · n o o
Speech is g . tu . g f en ou gh
m
. . nal to a stat10naIY on e to en ab le ca p r o
xiJnates a continuously .
va rym g s1g
appro 1 t d infc
. f 25 ms wi dth as sh ow n . fi ur e 3 .6 1s
on natio n. A wi nd ow siz in g
speec b or spe ak er re a e eo
. . f na d ye t re tai n en ou gh
usually cons1"dere d large en ou gh for the sig na I to be qu asi sta 10 ry an
inforrnati on .
25m s Frame 1
C
~1 ·===~. . ""--PP-re_-e_m=
(z) _p_h
l-a_s_si_,- 1
az ~ -- -- -- , 25m s I Ha mm ing
Window
I Orn Fra me 2
Speaker X(t) Y(t)
Frames Wi nd ow ed
t Fr am es
48
(3 . \ 0)
11 (I , , I'
J):,0 < n \
'}
· 3 10 )
~ ,tic kn ct I1 r, fth c" 11 1d o,, in eq ua t10 n ( · ·
l\ ~ -• ' ' I
~
. 6 it is a Ha mm in • wi nd ow .
d,, ,, ,c;, ap pl ied ,, hen .
1-fann m: ",n a is 0.5 . an d "vhen (1 15 0.4 .
g
Ha rri s . Ka ise d T ria ng ul ar
B r an
..., nn,< ,,1 t,cr co mn1o nh us ed "in dO "' s are ar et Bl ac km a n.
. r .
Ta hle 3 1 )Jsts so me eq mo nl y us ed w .·in do w fu nc tio ns .
\\1110 0 " ._ ua tio ns fo r co m
Ta ble 3. I \\'ind O\\ typ the ir equ ati o ns
es and
\\ in do ~ Eq ua tio ns
Rectangular
w(n) 1
49
1
I I ••• • • • f '
ii
r.
~-.. . S
',If
\, 15
P1ght num bor n
' 11
"
I nt mm £ au. ,;n
. I
,, ee c h ~rg na
. 1 ,
·- - .
.
rs ,;,..h1c
n m1JC" \~ fle•n£ fh an d the <.tc p s1 7c a rc t 11c 1,,.0 11a ram ete • h ar e co nv en uo na11Y u se d to fra m e
h
., I ,ng " as the ,, ,n
;-, ~ , ·,g naI if ''-" • d y f as the ste p s ize fra m in g o t· the s pe ec
do ,, siz e an . ·
- can he ,mplemente
511! ,a d as follo ws .
Be gin the fir:r.,l fra me . al N /2 th sa m p le
fro m the sta rti.ng of the s pe ec h s ig n · wi ll be th e
centre of the fra me
.
, · t es
ti Keep mm mg the fra me ah ea d by M po mt •
s ti·11 the sig na l te m un a ·
T;hc tth fra me v. iJJ be H J , th
ce nt red at th e sa m p le (i - 1) * (Ni + m 211 ·
,, The pomt5 remain ing at . . 1
th e end if ar e no t Iar g e en o ug h to bu ild a co m p e t e fr am e ca n
be dumped else co ns tru . . h am e
ct th e las t fra m e by ad . . 1
d1t10na ze ro pad din a m am ta m t e s
o
step si,,;e for getti ng an
oth er full fra me .
50
be ex arn1· ne d fo r
ffi e ne e. sh o U ld al 50
~ h Th e h ig h er rd cr co - c ic nt s h e . of
. r1 of hlllll•u1 ~p o ·n e re as in g \e ve ls
perc'"~n• l10 cc c .
"1 kc
. . c rfo rn1 an c e as 1
. • • n)) " )1C <
on tri b1 1 111111 to 1 he ov er .,
r rc co gn 1t1 0n
p . tl1 e l11 fl 46 )
· .
111c::1r c I tcd . . · pr cs e•1l 111
. ) fill?- ~pca l,..c r i11 fo n 11 a t 10 n ,s
, pc cir ,J
1 i,.::tJ1'~ 1.::.in re n
L
rnn1on wn
e fe at ur es w h ,. c h ar e co
lY kn O
. . ·cs cn ts th es e d •
,c d I ll
i::n:11 11 .11 log ~p
cctn1111 icp• ·. yn arn l r be fu rth e r co
rn pu te d an d
\ un 1·n t s11e et ra l d ist
I . Delta fr: iw rc! - 17 K --.cconui t rn cr cn l o • a nc es ca n
'
,1:- I lL . . ·rc a nt l y to th e m I ~ , tu re s.
f,t111d fl1 co nt nl rn c ca k
1
,.,. 3 1 1111ic:-
t
tc s1 gn1 i
. st s ig ni fi fe at ur es , t h .l s w o r
• •- . i·c m ain th e 111 0 ca nt s pe ec h
- - - fr3 {ll rC!
\s " fF ( ( - C0 1 1t111u c 10
I • fro m th e sp
- -•d -~ tra ct I 1c m ee ch s ig na l s .
rfl 'LL L CLi Ill L
. .
. de ic te d in fi gu
. M FCC fe at ur es d h de riv fe
ati ve at ur es is p re
fh e r r ('I L. L-du.re fo r ex tra etm g an t e
• .c. • ..; 11 er ex p lai
_;S an d is iw u
ne d in de ta il.
7
Window
Yt(m)
~ ~
r-. /I
~L1 {e,}
Y,'(m)
.
derivatives Yt V)
.Li2{e,} ,.
"
IDFT ~
I,
' MFCC
51
I ( II fihct hM1k crrcr-gic:<.
Com,1mtc the t)fJ. , , II
and
.
( Jhiain 1hc D1.«:-rc:-ct Co!:inc
·
Tnm:<.fon 1l .
. t •ct··, re presents N{FCC features
' I
, _1 , C<>cfficic-nlc<: computt.--xf Ill
.
:<;
tcp VI \,V II
'11 Rcllllf1 ~
2
d,-c;card ihc rc.<:t . . to e~ti m ate th e f) e \ta and Delta
. 1 , fir~! <>rdct And second otdet d e t1 vA11 ve:<.
\ ,,, Comp ulc IR • .
feature!-
. ed furt her in the
"'CC' fi ture ex trac ti. o n p roce <lure i s d i scu ss
T11c d <, ta1 1e d dt?!'Cription of t h e Mf·· ea '
{<,ll("' 1 ng !"ection.
. . . . i e a stati stically
. uasi-sta tionary s1gna1 . .
Th , ome ,·ru:'·ing speec h signal ,s convert ed mto a q . I tion is signific ant.
l:
t tionary signal by app lying 20 - 40ms w mdow frames. The frame size se ec
•
N
s.(k) =L S,,-(n)h( n)e -j21da z/N ,l <k<K
_ - (3 .11)
1
n=l
. g, 1. denote s the fi
Where S;(n) is the speech signal after framm rame numb e r , h(n) is an N sampl e
Jong analysis window and K is the length of the DFT.
Once the speech signal is divided into quasi-stationary frames, the next
step is to compu te the
power spectrum. The power spectrum mimics the function of the cochle a and
identi fies the
frequencies present in the frame. Its calculation for frame i is given as in equati
on (3 .12).
1 2
P.(k) =-jS.(k)j
l N l (3.12)
Where Pi(k) is the power spectrum of frame i which is computed by taking the absolu
te value
of the complex Fourier transform. Squaring the computed result, we can determine
the
periodogram spectral estimate. Conventionally a 512 point FFT is performed and
257
coefficients are retained. The periodogram estimate of the power spectrum, however,
cannot
discern between two closely related frequencies present in the frame and this limitation
becomes
more pronounced with increase in frequencies.
52
~\ y 2 0-40
l \o .. cc 0 1 0rprn ,irr\atC
tK' , ~ ~<"< I fil te r t-.:in " d · the
J'N'"
' rr, m•~"h'< 41 <~Tnfllllm , ' ~ tn im cqll mate in
r,("\ fl'(' ..-wf.(,.rr~m p,n\ Cr ,q r -.:'- • 6
1 ' ' fif l("f ' , , <ht~ "ntl Af!Ph('d '" ' re' l, ·11 t, . in lh C form o l 2
, 11 n, mt11m11lh tf1vt ,,w, <: I
,
J""' 1111 f J J · th~ filter han .... "'' c
rri-'.,,fYI~ 'i:1" ~ , · th e ., pcc trum
, 1 tnr cc rt ,11 n <,<!cu o n , ' 1
f 1.m , th ., <? I #C"h ''Ell: ,,.,,. " 111 '"" mn<1ly ;rc\"I ' < n<
,« •l"f"" J "th the r ••"-1.' r .,pee
· trum a nd the
r7I he "'" " ,•rf• l -l'!Ch filt c•t 1,1111 '-: , , 1lx•1 •rm lt1 phc, ""' 6
il~' h e r<, w h ic h is a n
c; Th1 ,;izi,eq u., 2 num
ffi .,,,~"'' .,,,,,h~ trr ,,, '.,mputc filter »~n, .. ncrin<' . . th e fo rmul a
~ " ,., 1< f c uation(l I J)~u "es
rntfrt ' '"" pfthl s,m,-.,1111 c,f ._ nclf.!' prcc;cnl m each filter t,.an I
fc, .nJ, ulf! llltl.! t h l 1, J1d han 1< cPc1 1pc<:.
() k < f (m - 1)
f (m - l ) s k s /( 111 )
/ (m l - f (m - 1) (3 . 13)
H. , .... o-
f ( m -'- 1) - /.
f ( m ) s k s f(m + I)
(( m + 1) - .f(m)
0 k > f (m + l )
Jn 1he equatJon (13 ), m denotes the total count of filters w hile f (m) enumera th
tes e m el
;requenc1 es spaced by factor (m +2) . To make the speec h 1eatures
.c 1 h closely to what w e
mac
humans can hear. a compres sion operatio n needs to be subsequ ently perform ed .
The comput ed
log of each 26 energies gives us 26 log filter bank energies .
Once the filter bank energies are availabl e, we need to comput e the logarith m of them. This
is
essentially motivated by how we human' s hear loudnes s. Loudne ss by human is
not perceiv ed
on a linear scale. If the sound is large to begin with, variation in energy might not
sound differen t .
Me/ scale power spectrum which mimics the human 's sense of loudnes s percept ion is
obtaine d
as given in equation (3.14).
(3 .14)
The ? CT transform s the outputted logarithmic mel spectrum of the signal S[ m] into the
tin1e
do~ These are the decorrelated MFCC coefficients and conventionally only 13 coefficients are
retamed. Equatio n (3 .15) computes the MFCC coeffic ients from S[m].
MFCC[i]=L !1 log(S[m])cos({m-½)),
i == 1, 2, ......L (3 .15)
53
• -e
fthc 1ra1, , is
I \.\lh~.,-.:1.\~ the \c n ~ lh o
,~ t('pl"<''""'~ t h y
llP< a-of, ,,("'"" '' l rtnt?e
, h~ ,,fl't ( ' ' " ,
' / • "'4 P tl 1H'r1 ( ~ I l
"1""~1.-.1 J,- "' m' x
,r CC coeffi cients .
. I tn th.: qW ll C I\, .
I !Tl'' 1,,,,, •'' mJ,, ~d1lill 1T ,mt·<. .,tc <fc«n1hc< I 1111 c:<1 for w hi c h th e traj ectone s
~"_,, <:r« ,r~ ' t.: . I lied m11,n11 ,1t1<Hl m the< y n ,1
'- ,.,., '1 " '"" r,,-..., ,pell , r ,._ l'vl rC C a nd D e lta-
, ·imine d . De lta
Utti"- H'f ~ ,.,,,-Jli, " nt<. i,,o <l pcrrod of tune
need'> to he ex, efficie nt'> rc'> pcc ti vcly. w h e n
<>frJic J\f/ ( "Im h <1rc the dr1Tcrcnt1al ,md the accele ration co
t [ I 4 7 1. The contri bution
p /Ill \ fl < < . I
' •cf ., rh, c,.,m cn11onal feature s can cnnc l t I,cs pcal<cr .
feature sc 3 I 6)
rr-1••'" r. . h .s work. Equat ion ( .
I nam 1, lcarurc s m e nhanc
be earned o ut in t i
,-, f}l i."'l l \
ing ;\ SR has en
(J. 16)
tf.• == ,, 2
., ' 11
~ ~
n=J
. . d by dt t 1s a fame
.
r selected, an d I·t
n (3 16 ). the differe ntial coefficient is represente '
The cepstrum num er' also
ln equauo . b
de nds on the static coefficients evaluated from c,+n throug
h
c,- n . . lta
-pe .
MFCC colum ns is represented by n m . The d amic or the shifted de
the equation. yn
kno'"n as
HFCC features are furthe r evaluated for computing the accele . Delta-Delta featur es.
ration
54
tu re ~C'l'<'t:' fion dio feature se t.
l 4 F eB ( c h th e a u
.. - . c<)'Cfficietits to en , . I vant
1 J,-ri11mn' ('(~cflkief1t.s to the stAt,c . . edund a nt and ,rre e
r{"t)dJ/lP I ,c . . J' · t1. te~ w hi c h a re r .
\(': 1 :1 ri'><' (,fhigh J,num_<;i<,nahty . ea , . . ·stan ces. selectin g
ht"',c,i'T- Jn, ·1tc-1- r ,c L i . , , ,. sten\~. t Tt1<.lel' these c 1rc um - . . f
dc-crcAC:C- the cfficicnc~ "f tht: i\ RR sy. - duc in g the issues o
r-r rC'nd'- f,:, . ,. • . d m emiing fu l In re
F ·ri - rdcvtm r fohh1rcs is ,e ssenti al an d es w i ti, lesser bu:t
<rc.akel :.ri'CI L. • f utation thus re uc
a11d n-,dund1mc~ T'h e crnnplex1ty o comp
" ' cr fi1tmJJ. ·
cant fr.afvn~s. Jc.adj~ to ,e nhanced AS R syste m .
~1f!ll'fi . . t di . categori zed as
,tJ c;d,;; for feature sel,0 c11on can be )roa Y
'f)ll: -i..c., me 1 •
r=argmaxF(r),s1./r/ =m
(3.17)
Where/r/represents the set's cardinality and the size of the 'd' feature set is denoted by ' s' . A
co~b~ative optimization is accomplished using this equation. To arrive at a solution for global
opt1m1zation how · · •
' ever, an mvest1gat1ve method should be the approach to initially compute and
55
h hi g h est
d F S ubse q ue ntl y t e
nr ndcnl scor-c to all the fca tll rcs a-'l per th e sta nd a r .
-
3-<:.<l.!'"
an ,nd c, -
• . (can,r-cs arc sclcc tc<l
r,111J.cd m
. ss features
. featurell a nd ,ntra c 1a
I
- . .,~--.re algori thm di~ri m ina tes hc t'ween the mte t c ass d b large but
T11c I ,t,.hLI . vari a nce sho u l e
. ,., " t11c sclccl ion of t he fea ture set. B e tween cl a1,s fea ture . . . . st th e
in dcc1uine- - d di scrimm at10n among
. ( " i th in-.:lass fea ture shou ld be sm a ll to ha ve goo . .
,he 'anancc o . . d as the shown in equatton
"' ,, ure'- Fo r the fca t1.1rc lab ell ed as fi ,ts score ,s ca lcu late
::;clcc1c d ,c., 1 "·
<3 1/1 1
n ,, 1 - c , cr _ k )2
.L
k=I
,L,
1· .=k
·J i
.
µr:· i
·J
. tO lass k is represented
" ·here the mean of f i is given asµ fl while mean for feature fl belonging c
. . • 1 k d the value of
i.. k The samples m the k class 1s numbered as nk of samples m c ass , an
<': µ fi "
fea~eJi in the sample xj is given as fj ,i. By this method Fisher scores are computed for all the
extracted features from which the highest ranked m features are retained for further processing .
In this experiment ation, the role of Fisher score in improving the ASR has been examined.
Once the features are extracted and the most relevant ones are selected by optimizing the static
and dynamic feature set, the next step in ASR computation is to further train them for creating
speaker model. Models for all the speakers are similarly created. Using a test sample of data
which has preferably not been enrolled i.e. it is not a trained data, is then deployed for testing
for a match with the speaker models formed from the trained data.
Conventional GMM, SVM the algorithm developed from statistical learning theory by V apnik
and I-Dimensional CNN are the classifiers employed for this research. The next section
discusses them in detail.
3.5 Classifiers
56
.
· vita l in g c ner atin g a
cl1-1-,sificr is
n1,l<,vc d 1 he role of the
_.,
' f,.l'-<" ' ,,n ihc a lg,, nt I1m "' . the tes tin g p h a se of
• ,,. .... , ' I ,, h1'-h ,, ,11 hel p
u,rr
. th prc d1d the s,"'"',
.-
:ikcr 1n
f tf\lf' Hl c~
rt" 18 \~R ._,, tcm
,mrl< ,,,."'Tl'"'"'" ,,t nn •
. , and '-; V\\t f rno del \in g h ave
.. n ;il \ \.'<-1<'' Qn;U1tv,1t1on . ('\., fvf rno dcl llng
,tlll fl h <'<'11 '
cnt H111 , sed for
usi ng I • fJ ( ' "rN has. a lso bee n u
Jn th' .
.....
, , i I t111hc1 dee p lc,u
J,.."i'!Tl m1rk•111, nt u rnn g tcc hm quc -
l he nex t sec tion dis cus iou s c \as si fi e rs.
,,lg ,q(\ lflf! I 11l, '-,,tr act cd fea tur es . ses the var
Sa mp les
Samples
57
\e d a ta to the
I Ass
. I . s selectin g K clusters initia ll y a n c < ., ig nin g a sarnp
- 1 he p rx, cc.,s 111, o , c. .
ni1111• .
-l ~,er /'if> in cq11:it1on .(~. -I q) _ (3 \ 9)
lo Ji,,<:L,,$f '- IJ. .
,1r ,. u )<...d( ., .111 )\1',)-- i
1 t ',· '
( T B G) a lgorith m -
.
' , .....,nfion us es the most popu h1t a nd w ,de y
. I u sed Lind e Buz o O ray ,
, Jc b<,11 " ~ ' ~ .
< ' ~ ' ()) ii' a pplied fi.,r c reatio n of a cod e book .
I qua11on (. - (J .20)
• == 1· - 1" ( J- E)
( I+ E ) . " - . "
i· ·is . n
db n and the splittin g pa rameter g1ve
1, n . .
. •
I• ( ~.20). til e cod e book size is re present e y
Jn ihc c-qua 1011 -
t,_, E.
Thus, from the derived feature vector, the LBG algorithm performs
the comp lex task of
building the VQ codebook by clustering the set of training vectors.
It works to desig n the
centroid of the complete set of training vectors i.e., it initially design s
the 1-vec tor codeb ook.
The codebook size is doubled by dividing each codebook. For indivi
dual traini ng vecto r, a
search for the nearest neigh bour codeword in the curren t codeb ook is
done. A simila rity to the
closest codebook is denoted as a match and the vecto rs are subse quent
ly assign ed to that neare st
code word. Iterations then updat es the codew ord by using the newly
comp uted centr oids of the
training vectors. The proce ss of codeb ook creati on is an iterat ive activi ty and
is conti nued till
the required codebook level is accom plishe d. For this work, codeb ook of
level 8 has been
created. Figure 3.10 repres ents the proce ss of the imple menta tion
of the LBG algor ithm.
,\ t a n
0,<1
1 111d C e 1111
c~ - 2 *c ~
Clu ster th e
rs
fea ture V ec to
Q (X l) =G
F in d C en tr o id
rt ion D
Com p u te disto
No (D - D ')
- -D- < E
YES
No
0 he LBG A lgorithm
Figure 3. 10 T
59
J tg,,, t- l ' ' '~-- ... - , .. ( , . ,, r' 1, .; t r, l"' ,H
1
" ',..,. ,.,,,. ,,., ., •· ' , _,, r , , •" '
, , · •· , • ,-.- ... ,,,k l, ,·, p l·1n 11·
,\ r-o ~,;,w f~ ·1°< I
(. ,
1.. ..1 . \ 1\,f ( ·1,- u;f irr
, '- lc m'- 111 ,11 rcP 1<lt'd kRn c.• ' " l'l r I "th m 1'- w ide ly use d
·, ,,1 r r' ,. 11n,:z 01 p Rttc r rc-c n~n i 7i n ~.
th e , " ,.,.. , a ~o n
. ,.~l 11 , .. ::i ,i,,l n m1 nM 1, c cla ,,11- · · · · · c t' n a-v a1' a · h
,cr a nd ,1 cl Rss 1fic s th e trn mm g m1 h le in t e
,.,., ., teal ,.,..- <cl ""'' d1t o rm a 1°
frrrn1 cla sse s h, con stru ctin th
g h} pe r p Iane s whi ch m ax
~., n,if1P hcr ,n' cn the 1ra1 rnn g • d. imi z es e
11 fe atu res f 0r a fea tu re spa · hy pe r pla ne 1s
c e tn n 1m e n s 10 n . a
rcr r (''-C Tl ll',i .1 '- Jn c.-quat il~ll ( J .2J ) .
61
(3 .25)
- , T,1 ~ /, {' (). 1 HX,1 ( ( ( \ )) J. \ ( f'
f (' ) -
{, 0. 1 "" ·''Rn( f ( ¥)) I., f V
,,..., " c pl:mc map'- all dat ,1 pnmt<s I~ inµ 111 ,t to I <md ,,11 "" ill he m a rke d as - I if
111<r 1 d ,,w potnl'!
JTC lf1 1hc ncgatn c p lane , ( /' rep, cc;cntc; H daW po11lt . . side w hil e x E N
,he., · o n th e p o'! itl ve ·
,r t.,
rcpfC'-l 1 •3 J aw 1101111 nn the ncnat1'- c ~, de of the p hin e
t:-
62
OUtpUI
1 '11 ~ l r.::--7 : l 00
sk1 is the kth kernel's bias which is learnt while the activation function is given by f and *
ensures multiplication of the elemen ts [ 157].
The pooling block added in between several convolution layers, allows the passag
e of
extracted features along the several layers of pooling. This aids in reducing the dimen
sion of
the feature map. For obtaining smaller and a deeper feature set, the network
subsequently
.attaches fully connec:ted layers once the multiple convolution layers along with
the
subsequent pooling layer is computed. The CNN network finally uses the Softm ax
layer to
flatten it resulting in a multi class, completely connected layer.
Having discussed the steps involved in implementing the text indepe ndent voice based
ASR
system, the next section discus ses the execution of the system and analys is the result s
obtained.
. Jcmc n tafio n 11nd R l'\'IIO l f:tl! for AS R
_1.n 1mr . . . . .
d1l ' '1 T n.11ac:.c-1 / l '-Sq 1(1 an amho , ,deo . l 1 q cake rs of w hi c h 19
fn•· , , dataf
laqe of of video ~ P . Th
~ and ~4 arC' male( : J ·ach ~peak C't has . O f qma ll dura tions . e
- 1cm1l c recit ed 'lcnte nces
11n .
'
. "hrd d ~ va ri o us resea rc h top ic s
1 ,._ trnm thC' corp us of the N -TTM Tr datah aqe ,._ u qe
.
Jpl/lt, ,1 <;.( or ·
·" tti ·<-rC'ec h spca kc1 recog ni t
rcl~I< "-' ion .
Th detail '- 0( the datas e t need s to be discu ssed as they influ k r char acteri stics .
1.
ence the spea e
• •
T]lc ,-e,cDrding of the spea ker rec1tallon
s was cond ucted m J sepa rate seSS I·ans · keep ing a gap
•
.
llf - 03 ,"' for sessi.on 1 and 2 wh1.le a delay of
6 days was kept b -~
e o re reco
rdin g sess ion 3 . This
.
d ,a, m recor ding was intro duced fo r the purp . . . the spea ker· s voic e
ose of ensu ring van at1o n tn
e - . .
appe aran ce hke c Iot h es, m ake -up and hair style .
1uch 1s infl uenc ed by the chan ge 111 phys ical .
"
The diffe rent moo ds of the sp eake r have
affec ted the pron unc1· afion, thus allow mg .
'\Jl' ~[
,n 24 Male~f19 Female<\
~rcJJI..Cr'-,
·
D elay of 7 days hetw'een <1ess1()11 I & 2.
RC(Xlrd I ttF JO
"~1ofl " :;,c1~tcnoes/Person D elay of 6 days betv.reen sess ·wn 2 & 3
Se:-sion/ 1/6
Scn1ences
Session/ 212
Sentences
Session/ 3/2
Sentences
Noisy Office The speech is stored After each recording the Zoom factor of ;J
environment as a mono, 16-bit, camera was randomly perturbed .
32 kHz, .wav file,
and images are
stored as JPEG.
65
:x:w
1-
. d from eve,.-,., fram e
I ..,,g n 11 foaturc!-. l he num hcr of MT . . cflic icnt' l rct0n1c
J
rakiIJgan example ofa sample audio sentence . d . /Sa l the p aram eter s
of spea ker I , 1.e. fadg olau 10 '
chos en for MFCC feature extr action are liste
d as follows.
Speech signal duration
4.76 for this sent ence
Windo w - 10 msec. 320 Samples obtained
Windo w overlap 25%
Total no of frames 633
-.I
Num ber of tilters 24
Order of MFCC 13
Total Coeffici ents 633 *13
~
;:,
rS
p..
I
-005
-0 1
-0 15
-0.20
2 6 8 10 12 14 16
Samples 44
X 10
Figure 3.13 Origin al Signal fadgo/ audio/ Sa I
0.8
II)
"t,
~ 0.6
~
Q.
E
ti, 0.4
0.2
samples
Figure 3.14 Hamming Window
67
50 HD 150 200 250 ll) 350
F,gure J . 15 Waveform offi-ame 10 of the sampl e signal after app ly ing Hamming w indow
H 1h an o,-erlap of 25% and Hamming window of 1Omsec, the MFCC features of the order 13
Here ex-zracted i.e. a total of 13 MFCC features have been retained per frame. The preprocessed
,\a,eform is given in figure 3. 16. Figure 3. 17 gives the nature of waveform observed after
,\[FCC features were extracted from it. The same sp eech signal fadgo/audio/Sali s consid ered .
The figures are of frame 10.
Discrete Time
68
._J;.n,e,eedt ◄ ttr ..,,:-- %0:1
j zf I
Disaete Time
=Jt;,cP;-q;)" (3.2 7)
In equation (3 .27), d(p, q) is the Euclidean dista
nce between points p and q.
r
l ab lr J J l rue ma1ch ohtai ncd
- -- -- -- -- -- -- -- ---~ -- -- for 26 Speak ers
70
C
. \ «-Wt~' , \-
71
i
~ '-.-ctra
~,.,- -
f 8 " d Tc m por-a l s ~ c h fea tu re.., cla~,ified w ith GMNf. SVM 9nd
.. r-0 CNN cf~41;;.-.ifte,-s f'or ASR
.., ,mr,J('Tni:ntcxi the , ·oice hiomctric hMed i\ SR n~ino the clu~te r in g V Q and the
ilr '"n~ e
,..me (,au,sian model. tht"h> Wa<l
r1tr11flll
~tm
~ope fot 1mntm,ement where 13ccuracy o f AS R was
'.
....,f\l C'r1'ed
\\ ,1 h .,
"th ancc, m machine Jeam1ng and neural netwot~s it hecame imperative to explo re and
•
arirl' thC'C 1echnique.(: fotaddte..,si n g text independent. speaker recogn ition problem . Rec:ent\ y.
f.,r pam~rn rec,lgnition p roble ms. SVM - a superv ised machine \earnin g discrim inative
.-iassdie1 ,, mcreru-.ingly being ap,p1ied . C o mplex deep learning algo rithms like Lo ng S ho rt-
rerm \ femo~ C LS TM) networks which are a type of recurrent neural netw0rk are a pplied in
fearn.lilE for ~equence prediction proble m s. H owev er. in th is research CNN has been ex plo red
w unpro,·e the tex1 independe nt speaker re-cogn iti o n problem w he re w ho is s peaki n g is
unpD.rflUJi regardless of what is being spoken .
The performance of ASR with classifiers SVM and GMM is compared and subsequently the
performance of tbe system with application of 1-D CNN is evaluated.
f aking an example o f a speech signal with a duration of 4.76 seconds, windowed with
hamming funcrjon of IOmsec and 25% overlap and 32 order filter banks, 320 samples were
obtained. 13 MFCC coefficients were retained as conventionally they are deemed to contain
enough speaker related information. Figure 3.28 shows the retained MFCC coefficients.
MFCC Coefficients
~
100
l
~
V
50
f --50
tu -100
~ -150
I -200
'-250
0 5 10 15 20 25 3S
n order coefficients
83
►
ifl 8Il accuracy of 55.81 % and SVM mode11ing gav s as the GMM modelled features resulted
J)."perirnental performance of the speak e an accuracy of 79 .0So/c
er recognition £ o.
ra u]a.ted in table 3 .8. or a dataset of 430 t est sentences 1s
.
b
Table 3.8 Accuracy ofspeak er reco · · with 13 20
gn1tton ' and 40 MFCC .
No. Extracted N coefficients
o. of Cl
Feature Coeffi1c1ents
. assifier Accuracy of
ASR(%)
per frame
1. MFCC --
13 GMM 45.34
SVM 51.16 - ., i't
2. MFCC 20 GMM 52.43 bl
SVM 63.95
3. MFCC 40 GMM 55.81
SVM 79.05
Results of ASR displayed in table 9 indicates that the GMM classification of feature vectors
were outdone with SVM classification. The system performance with SVM showed an
Going ahead, SVM classifiers were selected over GMM. As research points out that for speech
recognition problems, MFCC features carry sufficient information. However, speak.er related
infonnation is also present in the dynamics of speech. Hence further experimentation was
conducted to establish the role of the dynamic features in improving speak.er recognition.
84
~•R1 t'Jt 111• fosttur e<- lffl(f ,h<- AC<··demt<"<t docl u d<elu fenh,re
"i .1nhc1.1 ,,11h the "tal\c
• .,,,11m: " re , 11J1C'<f m a fwt<tCT fOllU't c cfml't'.n<tMn for
<'.om p 11hn~ \C..R \ h e n ,1111.e ,~1 the
rm",,,, turc " , , cccn rn fiaurc ~ lQ alone " nh fi~iTe 1 10
\""
6 J' fC<: COflffl'!"..-ots
()Oi
I .,
~
0 ()4
002
t
'-'
IJ 000
t
~
'o -0 02
1 -0 04
f
l -0 06
0 5 10 15 20 25 Y,
n order coeffic_ients
~ -~ MFCC Coefficients
"'
~ 0015
~
I.I
E
e
'-
0010
l
0 0005
<1f
<1
~
8 0000
....%0
-0 005
J
·e
l -0.010
'-------...----r---r---,---.,-
----,----,-
0 5 10 15 20 25 35 40
n ordtr coeffi cients
ASR was now computed with a larger feature set. Three set of
data size was chosen. pt set
had 13 Delta features appended with 13 MFCC, the 2nd one had
20 Delt a appended with 20
MFCC and for the third featu re set comp risin g of 80 features simil
arly had the addition of 40
MFCC coefficients with 40 dynamic coefficients. Best accuracy
of 86.04% was attained with
85
. , features w hich is 26% higher th
/sfl"e 80 an th e ac
c1.1111t1 fe,1tures. Table 3.9 lists th e res ults o f I\S curacy achieved with 13
, f)c/1·1 • R computed .h MFcc and
1-' W1t different ·.
.•cl-
sizes of feature
h blc 3.9 AS R wi th SVM cla .Ii
s~, •er for ~lati c a I •
~ 11( dyriarn,c cocffi ciem~
l"eatures Coef't:: .
11c1e nts
Selected AS R
Accuracy
Static
13+ 13=26
and 60.64
Dynamic 20+20=40
76.74
40+40= 80 86.04
A larger feature set often results in an increase in computational complexity and redundancy
which are also known as the curse of high dimensionality. A model tries to overfit these
redundant features creating further problems. It was decided to apply feature selection algorithm
to overcome these limitations.
In this research the most relevant speaker specific information was selected by.the Fisher score
algorithm. Three experiments were carried out with a feature vector size of 430*26, 430*40 and
430*80 dimensions. Best selected features were subsequently classified using the SVM
classifier. The graph showing the Fisher scores of 40 coefficients is available in figure 3.31.
Lower the score the more relevant is the feature. The fisher scores indicate that coefficients
numbered 3, 4, 6, 10, 16, 28, 32, 34 and 38 carry significant speaker characteristics.
40
• •
J5 • • •
,0
• • •
lS
I JI)
••
••
I
•
~
0
u, J •• . ••
I
..... 1i
-I 1
-- ,IO
s l
•
rr r
•
0
0 s
T ' T
- .
10 15
.
Figure 3.31 Fisher scores for MFCC coe ffi1c1ents
An ASR accuracy of 56.97% was achieved for the best 12 scores from a total of 26 scores
nd
obtained for the combination of MFCC with Delta coefficients. From the 2 combination of
40 features, 77.90% speaker recognition accuracy was achieved with 15 features selected by
the fisher score algorithm and from the last combination of 80 features, 94.51 % accuracy of
ASR was attained from Fisher score selected 15 optimum features.
1. nr ance o f A S R fo r G MM and SV M moo II
e perio 1 . c ed three set of st . .
f11 . ,ori:il lY represe nted in fi g ure 3 .32, fi gure '\
- t ;s pre 3 .•>3 and fi gure J 34ati c and dynami
.
c feature
:,e. perfon11011cc of speak er recognition · respecti vel y .
i) . . system modelled with OM
three differe nt combin ation of fieat· u M/ SVM classifyi ng
re set.
·,·) Perfonn ance of spcaket recogn ·t·
, ton syste m ·th .
' M
concatenated FCC and Delta fi"at w, improved feature set o f
·
'-,. ures .
iii ) ASR witl1 best featu res selected from the
. . . concatenated feature set after l" .
o1 F1sher scor1e a:lgon thtn . · app icatton
AS R l) l'rfor m a ("'
n~e , oAcc urac)) for 40. 20 and 13
co n \'C ntr o n a l M F(T coe ffi cie nts
0 20 40 60 80 100
Figure 3.32 ASR with GMM/SVM as classifiers for different dimensions of MFCC coefficients
0 20 40 60 80 100
Figure 3.33 ASR performance with static and dynamic features modelled with SVM
1so1 MfCC & De lta/S VM
0 10 2 60.Ss
e:J Accu racy of A O 30
SR aft er best f 40 so
eatu res ser 60 70 90
ected ('¾o) l,!'!l oYo
80 1oo
figUre 3.34 Comparison of system performan Accu
racy of ASR
ce With o .
d"ffi Phmizect fi
tfu;
from e ' ee I . ents
eren t exp enm . algorithm eatures With that before a .
th """"''" offISher Score
ofr,JFCC features did impro ve the earned out It. can b
accur e obse
•J: h acy of Peaker reco . . rvect that .
s lllereasing th
. er, . owever, gave better accur
c assw The machi ne learnemnumber
I
cJas si.i.er. Add itio n of d yna nuc a .on.
. fea acy as compared to thatgruti
SVM
ect.with the Parametrtc. gGMM
of the system. Finally, th e best perf lures to the static £eatures chiev
furth
i
ormance er of the ASR improved the perform
computed wit h sele ctin g th e b est features system with v . ance
. s was
the FISher score algorithm . 8.47 o½, Im
. from the set of the large feature setmce'thbiometric
.
. . Wt application of
achieved with feature . . provement in the ac on reco gnition was
curacy of pers
opt nru zati on as the accuracy went 86.04%. up to 94.51 % from
positives.
ii) T rue pre 1cted positive observations per overall predicted
d.
i) Precision:
all: It is the ratio defining the rightly predicted positives obtained, to the entire class
Rec
Nu mber of
- -¾-
Acc ura cy Precis ion Recall F l Score
Optimized
Fea ture Vectors
13 0.6976 0.5975 0.6788 0.5989
15 0.7790 0.7410 0.7552 0.7073
20 0.8255 0.7943 0.7987 0.761 2
24 0.9069 0.8891 0.9041 0.8810
27 0.9451 0.9437 0.9604 0.9461
29 0.9534 0.9065 0.9065 0.9008
34 0.9186 0.8577 0.8719 0.8562
40 0.8023 0.7991 0.7999 0.7636
44 0.8023 0.7991 0.7999 0.7636
65 0.8023 0.7991 0.7999 0.7636
80 0.8023 0.7991 0.7999 0.7636
For the designed text independent ASR system performance, the range of accuracy achieved
considering various optimized features was from 69.76 % being the least and 95.34% being the
maximum accuracy. The maximum accuracy achieved is 95.34% resulting from 29 optimum
selected features. However, considering the precision, recall and Fl score performance
parameters the model with 27 optimized features and ASR accuracy of 94.51 % proves better
than the other models. For this ASR model the value of the Precision performance parameter is
0.9437 indicating 94.37% accurate prediction of the positives. Recall parameter obtained is
0.9604 the model is supposed to perform well when the value is greater than 0.5. 0.9461 is the
computed F 1 score.
Figure 3.35 is a plot of accuracy of the ASR model versus the computed Fisher scores for the
extracted features. 27 feature coefficients resulted in an optimized model performance.
0 10 20 30 «> 50 60 70
No. of feMURS selected
Figure 3.35 ASR system accurac y for various feature set size
Convolution Layer
r
1 Convolution Layer 2
-- - - -
I Fully Connected Layer
Dense Layer: Flatten
high level features
rr
ReLU I learnt
ReLU Activation
.,. Activation layer: layer: To separate
i,,,..
To separate non non linear 40 rSoft.max Activation: ~
linear 40 classes
~- classes multiclass problem
'""' o determine
Dropout: 0. I: To ~probability
..., Dropout: 0.1: To
reduce ~ reduce
overfitting overfitting rmsprop optimizer:
J
= Entropy(Log Loss) \1
..
.1Perfect Model-Log I
1
~ ~oss O )
Figure 3.36 Implementation of CNN classifier
tJ'liJn icst rJltic <lf 8-0 2-0 -wa~ ~lected di~
fl'" . . ~ .... d v ,c, g tl\te ~ ~i>"e .
~ .a5! tt'AJ"1I1,g -ua1..a M -oo sentenc ot J 'O <1en1,
r,fi'flu<-' es M te~ing da \!"~" 1t\l() ,44
ci• the con"°lutio n layer to sen !>-.. ,11.._
, A,. .. ta Rft \ · \\r,t_q the 'l\!\~t~ u ,,· .
,eJ BJ~"' ..,._.. OlC U~ 40 f\!On-1' l,!\\'/"J.\l"n
Ill' -1· U\•ear\) "l!PMlbl
Ulllds the non mear margins in an irh . ~ e ~l3 "!i~ the netw(1t\.._
,,,.i!er5 age \Vllh this rectifier fu, , • .
ut lJf {). J helps d ecrease over-fitting iss.ues S m:til'n \n mtrt,uutli\m l'r u
.1~~ . k. · ubseq\lef\U; a Ma~-pQt11ing \u~r \'illS .itlu\!\l
,n the n~or
model loss
--
nin
l!st
1D
6
I
4
0 ~ 1000
~
fiOO
0 200 epoch
f the model
. · g loss ,and testing loss o 92
. 3 37 Tramm
Figure ·
10 - _..,n
08
- ~
06
~
04
02
0.0
0 200 ~
fiOO (K)()
'POch 1000
Figure 3.38 T rainin .
g and testing validatio
n accuracy
rtJ fj{[llfe 3.37 it can be observed that the tr ..
fro ~ ammg and testi l
. are synchronized. Both the losses decre . ng oss curves are non-linear. but
dJe) ase with the increa · h
e constant at approximately 600 e h . se m t e epoch values and have
l)eCOID poc s which indicates that h
c. +-< ,res The training and the val ·d t· t e model does not overfit
we 1eal,JU.l ·
.
1 a ion accuracy cu •
rves m figure 3 .3 8 are also increasing
however the testmg accuracy observed is quite less th h . .
. . an t e trammg accuracy .
i further reduce the d1fference m the train/te t . . . .
o . s accuracies which md1cates slight overfitting
issue. a modified value of 0.15 for dropout was expen· t . . . .
, men a110n with. This resulted m an
improved model.
With the delta coefficients affixed with the handcrafted features, a one-dimensional feature set
of80 was available for t~e CNN classifier. Model ~ccuracy for training was computed at 94.77
%while accuracy of testing also showed an improvement of 3% as the accuracy computed was
73.25%. Figure 3.39 indicates that the training data is not memorized as the gap between the two
curves is less here and a well-trained model has been created as the training losses and the testing
losses are decreasing linearly. The generalization capability of the model became much better
with increase in epochs however, it stagnates after a while. The training and testing model
accuracy is shown in figure 3.40.
93
Jl
-
-
e,..in
-.st
-
model k>s.s
JO
II
j fi
0 ... ... ... ,.. .-- .-- ~- --- -.-500-~ -~fiOO--- r-1
-.-«>O 700
0 100 200 :mo
epoch
icien ts .
train ed w ith static plus dyna mi c coeff
Figure 3.39 Mod el losse s fo r syste m
acc::..
del ...:
mo:..: cy__ __ ____,
ura:.!.
LO -r -- ---------=--=..
- train ! :
- test
0.8
06
V
" 0.4
0.2
0.0
J>O 400 500 600 700
0 100 200
epoch
accuracy
Figure 3.40 Training and testing validation
lied for image analysis and when used wit h spe ech
Even though CNN is being traditionally app
ctro gra m of the aud io signal, this wo rk has proved that it wo rks we ll
signals it works on the spe
we hav e obs erv ed tha t the SV M cla ssified model did not per for m as
with speech vectors and
well as the CNN based AS R system.
94
·oo of MPSTME database
crea tl
3.1. 1rJPSTME database creatwn
· · research. A s a 11
was part of the work carried out for this
f he . tation was carried out on a standa rd database there was a need to validate the reSUltS
~ m ~1 ,
exP curacy . dm
obtame . a rea I tune
. .
s1tuat1.on. A dditi o nally there was no database ava1·1 a bl e
C
of,4.SR a . . , .
ian scenano to vahdate the results. It was he nce decided to create a database m the
• t11e I11 d
111 I. 1 . . •
· vironment w 11 c 1 JS no isy enough for a real time situati on
J1
ea.e e . ·
co /J -
. a the standard database as a reference an audio video database was created with SO
J{eepJJl_
-~1,-ers comprisiug of students, faculty, and staff of MPSTME . Like the VidTIMIT database
spe,;u,.
nber of sentences per speaker was chosen as 10 5 sentences from the VidTIMlT
tile W
Il .
se ,.vere retained, 4 sentences were framed with a word in each sentence containing all
dat2 ba
tlJe vowels and I sentence was of the speaker's choice. The recording was done in a noisy but
,~·ell-lit office environment over a period of 6 months. The database of 72 speakers was finally
selected for validating the work done on the standard database as the recordings of 8 speakers
was not quite audible. Of the chosen 72 speakers 39 are males and 33 are females in the age
group of 18-52 years. Each speaker recorded 10 sentences of an average duration of 4 secs
each. The speech is available as a .wav signal which is 16bit and sampled with a 48 KHz.
frequency, while the images extracted from the videos have been stored as JPEG. Table 3.12
gives an overview of the MPSTME database.
Table 3.12 Specifications of the MPSTME Database
f or 7-., s-~
0
"'lrers, the databas e size was 720 sentenc es. Static and dynami c features were
ing
ex-u-acted from these sentenc es. A train I test ratio of 80:20 % was conside red for process
me data. The SVM algorith m and one-dim ensiona l Convol ution neural network s, w ere used to
s
classify the extracted feature vectors . Fisher Score algorith m further optimiz ed the feature
s is
thus removing redunda nt informa tion. The plot of the fisher scores obtaine d for 80 feature
as given in figure 3 .4 I.
lK)
◄ t1
70 it ·• ,. '. ~ ' • I.
t
•
f,()
'
t I
•
I
I •
,t 't t
..."'
Q) 50
C)
(/)
...
~
0
40
I ••
,. • ◄-
I
I
~
I
I
en JI t
iZ ~
,, JI
~
• ◄t
,
I
20
•
I
J
10
1 .,
J ◄
' ~1
0
0 10
' •
20 30 40
-
so 60 70
Figure 3.41 Fisher scores for a total of 80 MFCC and Delta features
96
08
06
02
00
0 10 20 30 4'() SO 60 7'0 80
No. of f~•tu~s selected
Fig:ure 3.42 Feature selection from the optimi zed features set
44 SVM 72 88.5
MfCC (40)
+
Delta (40)
As accuracy of 88.5% was achieved with 44 optimized features considered from the
concatenated static and dynamic features and SVM classifier .
Application of I-dimension CNN on 576 training samples and tested with 144 samples for the
same set of 44 optimized features. The accuracy of ASR computed is mention ed in the table
3. J4. CNN model training paramet ers preferred were the same as the ones conside red when
experimented with the VidTIMIT database.
Table 3.14 Accuracy of ASR for voice features ofMPSTM E database with 1-D CNN
97
. .. h 0 w n in fi g ure
. . .
for 5 00 e po c h :- co n s id e re d 1s a s s
u racy
h fo r rrai11111g a nd tes t m g ncc
f11C gr9P
.3 43 . mo del a ccu rac y
LO - tra in
0 tJ
6J
"0 4
0.2
300 400
0 100 200
epo ch
mm on
a key req uir em ent in tod ay' s increasing digital world where co
An ASR system forms
uri ty ser vic es, onl ine sho ppi ng and banking hav e ado pte d an
utilities such as teller machines , sec
cru cia l
tem . Pro tec tin g ide nti ties and preventing spoofing is another
automatic working sys
the use r po ses a
AS R. Ho we ver , dev elo pin g an effective system to authenticate
application of vo ice
obs tac le, the fun dam ent al and non-invasive fea tur es of
challenge. Addressing this
anc e ASR.
s wa s identi fied and im ple me nte d as an appropriate choice to enh
biometric
98
,t ... j -.,1t, +:;;
• pured
results indica te that high convo lution featur es in additi on
f/1 C ~ t101 to differential featur es
enc~impass signif icant data of the speak er, result ing in a
high accur acy of 94 .77%. Table 3.15
-ummarizes tbe result s of chapt er 3.
:,
Data base % ASR w ith % ASR with Featu res % ASR with
MFC C + GNIM on MFC C+DE LTA+ Optim ized with CNN
clean signial SVM Fishe r Score
+SVM
Close d Set Close d Set Train Test
I
r ( ,.n
• TT f f, <;
'<!"""' ,,,.,.,, .,., l fo~ V o\
, ,; _ PP CJ'\ . \",, 1
,,1( )0 " I NT I ·' RP O l
I <;urv
, f• J i~r, 1<c (>ll er A
- ey of the ll~e , Ii
of ' P<eakc-r ,-,~
, , "' 1e1111011 h I
'' ' l 26 ~ " ,..-.2 . ,00 J
•• .. 1 , ,rr 11<1c S ,·1 Inf . Vo • • , .p • lln 2() 16 )' IIW e11forc emen1
ff'~ - • .. , n r u
,11:til<
,.:-rf'C7 A fl U1'Ar-·
. ;\ DA P TF
s,oN sc nr MF ~ f-OR. ,
,,,1,1111 , P M .. .. . 1v l\ J1 fll\ ,\( )I)
h D lhe sis . ll
• iv1adt1d Sp a
· \l Rl( ll\, \F fRI C
N1 IC A 1 IO N. r . , '" · 20{)6
I l HI I . P a nk a nt i " B ·
;, A R MS an d S . iom etr ics• • 1\ too l fi . .
"
• • JIii . or inf or111atio 11 ~ec un ty: · IF.F.F.
" ,...
•
S,x11r1f1 •. Vo l. I . no
. 12 5- 14 3 J
. 2. pp • un . 20 06 ·, · Inf r,•,m,
rJ ..
,. ·
,,...,1fl• d J . Zh
r f
' . o. S. V1' o ng . S. Pu rne ll an ou _ Su rve y ing th
ed
ev e lopme nt o f bio me
t.
\\ r,kfl!!.·
. rw V 17, no . 3 nc U<ier aut henticatton
Co mm un. Su rveys
..t,iJe ph on es ... IE ££ ,, · • ol,
· Pp . I 268- 1293 3 rd Q uar t. 2 0 I 5
.
ne ke, U n im od a l an d Mu ltirn od al Bi o
111 .
,,11 '-
oJ ov erl e and G . P . Ha rne
.
tnc Sens ing Syst em s . AR e v1ew : · ,n
~• o. . . 75 32 -7 55 5, 20 16 do ·· 10
. 11 09 /A CC ES S2
•' .~rr:-1ccess. Vo l.4 . pp . " ' I.
I .b. · Ol 6 .26 14 72 0 .
1~· , A Re vie w of M ste m wi t .
afilr and D. A. Ra mh . u t, tom etr ic Sy h Fu sio n Str ate gie s
and W eight"
mg
•, H- Ja CS £), Vo l. 2 n 4 5 Jul
mp ut. SC l. En g. (/J , o. , pp . 15 8- 16
c c1or:· Int. J Co . ' Y- 20 13 .
r ll D. He rb ert , "C on tm . da l B" .
H. Kn n an d us Mu lti mo tio n Sc he me s·.
R 11 s. Yeom . S.
.
uo
1ometr1c Au the nti ca
-s1 R- ) . in IE EE A Vo l 9 do i :
Re view , cc ess , 34 54 1-345 57 . 202\ .
S)rstematic . ' pp .
A 061589.
J/ 09/ACCESS.2 021.3
JO. ion d techni u ,, Int . J. Adv. Res.
pa an d L. La tha , "A survey of biometric fus an tem pla te security 9 es ,
C Prathi _ , _
pp .3
f'] . l. , Vol. 3, no. IO , 511 35 16 20 14
CompuL En g. Te ch no
Feb. 20 20 _
etrics -A bo on fo r fa rmers Rutva Safi 10th
Biom fu . 7 lss . 5,
(JO]
et al. , " Sy m m etr ic sum-based biometric score s1on," /E T Bi om., 20 \8 , Vol.
Mohamed Cbeniti
(111
"' . f
pp . 391 -395. ari son for evalua11·ng th e euect1veness o
sta tis tic al co mp
& Vi ne et Kansai, "A .
21· So u.m i Ghosh, Aj ay Ra na so
.
ftware detecti· on prect·1ct1on. Int ernatzonal Journa
l
{1 tio n tec hn iq ue s fo r
an ifo ld de tec
linear and nonlinear m , 2019.
en ce Pa ra di gm s (IJ AIP), Vol. 12, No. 3/4
ofAdvanced Intellig tio nal
ic Re co gn iti on M ethods," 46th Interna
rvey of Biometr
mi r De lac & Mislav Gr gi c, "A Su , Zadar, Croatia
[13] Kr esi
AR -2 00 4, pp 18 2- 197, 16-18 June 2004
Marine, ELM ation."
Symposium Electronics in Sp oo fin g De tec tio n Using Phase Inform
l Sy nt he tic Sp ee ch
] J. Sa nc he z et al. , "T ow ar d a Un iv er sa pp . 810-82 0, April 20 15
. doi:
[14 curity, Voi. 10, No. 4,
s an d Se
In fo rm at io n Forensic
in /£ ££ Transactions on
12. al
10.1109/f/FS.2015.23988 r vo ice re co gn ition systems ," 18th
Int ernacion
ic e sig na ls fo
"P re -p ro ce ss in g vo DM) , Er \ag,o\, pp .
{15] G. K. Be rd ib ae va et .al , and El ec tron Devices (E
techNologies
rence of Young Specialists on Micro/ NaNo
Confe
10 9/ ED M , 2017.7 981748.
242-245,2017. do i: 10 .1 ent lnte\ligence
on Th ro ug h Sp ee ch Si gn als For Ambi
Em ot io n Re co gn iti ? pp • 244-257, De
c.
{16] I. Bisio et.al, "G en de r- D riv en . · C i·n g Vo l l No • -,
. g ,,,
, op1cs m ompu
1 , • ,
. . . Transactions on Em ergm
APP11cat10ns," m IEEE
2274797. d Cliffs, NJ, USA:
2013. doi: 10.1109/TE
TC.2013. . . y 0 l· 14 · Englewoo
. Juang, Fundamentals ofSpeech Recogmtwn,
{17] L· R· Rab m er and B.-H.
Prentice-Hall, 1993.
\S\
,\I ~ d (j . R ic ca rd
i.
. . . " Fu s o n of
1
< f ,. ., .,.311
• r3 its re co gn
,.. -r""
.- p r1c
111 1
1 11o n ." 20 1
A ,< Sr /. F lo renc e. 4 I F.
8C OU <:. \lC
- ££· Im -,· lm g u1<-\ tc
r"•", , id·'· o .. & Ro 20 ,.4 pp . -, ,.oso •n
•w ho .n al c oo d !>"; I
se, " RO bu s< te, '"''""'"''<
,(m n ac
,,
,., -" '' ·
s•••
.. 10 lf;£ £ 1 '"
. •
"·' "c ,ln n< on
ct - in d ep <n .
s, w I
de n, s' dn , 11 t lO<l
" " '' " \d en t's
~"°'"" "
IC " '"
fm u m rn , S
" s ,. .. ,, no dS
,, ., ,. ,
. ,
•• •• · f . K • O>
OV & P . fN
e n a nd A. ,d io •~
. .. 10 14 """ "
'' "°'
,n" «•"' •" , ,,dm . s,,,,c1,.a.nd • · l-<lme ,PW " ""• · V " " · No• t''"• ""
m • i "R ca ' hon'"'" ,
\:,1
J (
,r F A S. Al7 qh ou l an d A •t11 1l<'Yn, l'l •
_ . · < ~"1"' I' l ~ t-c,I\~
f , a · _ for en sic , < He< - ""
fic ien ts for nf) arj "'<'\'n \.\ 'ft 1\-y
..ef
~ "'-•l'I~ a I '"C
f F COT ,.
I •"" , • PP 1--f,_ 2 0 J d ' hh ""d ," rl' 111,et\llt•n,;." "'' ' cn,11r1,..,.
.~~ ! c r>m ru r Sc t 11
''"•'It r~ ~ n -~,. ,
FnC l'N ror & M M el
, cti cff cr. I ~ rcn . (20 14 ) t
• ...<' ,\ '-'ovc-
I.-• ' . , , s~ tk-cp n<'-nral net wo l'l( 1 ~c h"' "" l'o
, 11
-4 ,._ JI .
, 41
4S Sf ' ). pp 1(-; Q',
n )() 74 ll-1 ,_
I . ,.,..,,Nr,m
',W,.,W (
r ~Pl'.' 1k. ,t !'el ' "Ill \\\ ln 11 llql l l
,.,... ,.,;, ii. 1
r"'""' Fri >•· r<< lfl,(! rlC . • ~<>() . f'f ~ ..,.,,, ,. ,., •. 1,,,. ,.,
.,, ...,. ,.
_ ~< '.2 01 4 ·•
,';(" <J
, I ,c1 & I- M c
Dc rrno11 . 1 1 \,f e No al)d
- <Yr-
,,, ' . J <m n:>a\o
, ,,r,J J
li'll l('-x-t-d ep en de ni !:: ) r >nn ,1n 11:1h:7 I JO 1 , l I)
fj,r ~m s ll f 0,'> lpl" I ea ke r Ve rifr ce r ne11 l""I\
I -<
' cat ion rn )() f 4 ff. f 1-· f;
~n••"" ., ,,.,_.,•,·h mul -"•IR,ml l'r nc en .
• • ,,,R 0C AS Sf>
J Fl 11,t>,.►· 11;, ,.i;,.,1 rn,,,.,,, .
"' . Ol"ence. PP 4()5 2 '\6 "" '"
S . C he ok an d M S I\ . lil an ...
,, ... ;11t lf• •
.\ Ja_ t:!'im. N . · -40
tam - \\ . ' ·'· .
.,.
. Y. I\ rob us t ~p k er tde nti tic atio
, 1 4 I<
jj-o m a mo de l of the au dit ory pe np he ry: · PL o') ea n qY,tern u, 1n11.
11~
r I . , ON E. Vol. I I No 7. Jul 2016
"- ,..--<.('(' - ep en dent Sp ea ke .
nJ! ir I:'! al. "Te,·1-l nd de nt,ficatio .
-32 2 n Th rou gh Feat ure Fu sio n and Dee
t ,,1i a Vo l. 8. pp . 32 I 87 d ·. P Neurnl
_..1 ~ ,11 /£££
lfc ce ss. 02 , 20 20 SS 2
. . B . . o1. I 0. 1 I 09 ACCE . 01 0 .297JSi.t I
d y
io. A . Co ur vil le an
J "'•
,.-..tf•(J
. eng10 , De ep
Le .
- ;ll'(lfellll'1 . Y. Be ng . arning , Cambrid ge
MA
I l" . Al -g ara di an d U R . . . USA .M n . 2016
A s ~ h
M. . . Al o, "D ee le ..
, ,,,·eke. Y. W . Te h. alg orithm
JI
f . se ns or ne tw k . P arn mg or um an activtty
.• • '/1 an d we ara ble or s . State of th ..
- ~j oo n using mo bil e all
e art and research ch enges. Expert
..,,_--r1:: 3 -2 6 I, Se p. 20 18
Jnn l . Vol. I 05. pp . 23 .
5'r~I ,.-,-- ur al ne tw or k
en tat ion of ne an d fea tur e ex tracti on to classify ECG sio na ls"
~arthik et al.. "Im pl em . / . e
111
able: htt p·/ 62 88
R
r. ;802.06288, . 20 18, [o nl me j Avail . arx1v .org/abs/ l 802.0
- • "
tral E . .
].
.JT- " .
e1 dt, DN N- Ba se d Ce ps
tion Ma nip ula tio n ~ ent:·
gs
~ £l5 bam )' & T. Fm . ch xc1ta or Speech Enhanc em
~ Sp ee ch an d L in y I
-"' ons on Au di o, anguage Process g , o · 27 , No . l 1, pp . 1803-18 14 ·
. [£ ££ AC M Transactr '
/Jl.
. 9
:,-or. 201 • re Ra .
siv e Sa mp lin g and Featu nkmg Framework for
Bearing Fa ult
K. Na nd i, "C om pr es • I
153
"A VQ~r<>N'f r,,,e~<> ot """'-
. .. ff! fFF J. ((•"""""' ~"-"~,
4 1·1tt"" . ~,df'CC<'J gllil1<ttl. r,. i•
f<J/111" . n~.,11,,,f "' 1\- 1
,..,,,. i,- 11'¥~ n,, ,
,,r<11 ..,, l11W•q1J ?. luly f~S~. do; IO ·•~-,~ 'WI
..,,..,..,, t, .. , ..
7 r•
,~ vr J~li 11n'1 f ifctii"'1.!Jh#rJt.. -- ~ltktt ......__.. _ '"''~ • ~ .. '" ,..~ ~ ... '""""""""" t '"
"'"""'' ~('~,- •1t,• ,, r. ,, .... •1t,-
«J;,;11 "it v., 1
,., vit"~~N J 'SJ;eJtket thde~ing .. .._ I\'\\\ fv 1, ~
ljt'IJ J/1 , '" l'f.' 1- 'F r,. "\\ ti;.,
,rr . k :,.s~ -~q2._ Ju ly 200.'-. doi lo 'I O'<> l'fS;\ tWuf1'<'•p~ ~ ~...._i:\'l._ ... ,hl11 1nfh..,11,111,,
"'" a. f'1 · ~**% ,, •lt,.,1,,,,
,1 f -V-ttiJ•
1( ;t,. I..<'e a'tld H . U . ''GMM · SV"- •
• •vi
. 2'()0~
"Kef'nle,
.,-i. -,,.,1 lu,11 r
'' • '· .,,.,,.
'-'v1tl, a Rha11• L !l /,,,
I I •ri,n." Jh JE&E rnm.th<'Tian.t On A ud · ac"nf)>), 1 f\
_.,..,pni 1 . 10. S:r,eec'h.,
,, 0 Paul, M. Pal and G. Saha, "Spectral Features for Synth et'IC Speech Detection," in /£££ Journal of
[591 . . . ,. .
Selected Topics m ~,1gnal Processing, Vol. 11 ' No · 4' pp. 60S -6 \7 , June ?.O I7, doi·
·
10.1 109/JSTSP.20 I 7.2684 705.
Using Throat
M. Sahidullah et al., "Robust Voice Liveness Detection and Speaker Verification
Microphones," in IEEE/ACM Transactions on Audio, Speech, and Language Processing , Yo\. ?.6, No. \ ,
pp.44-56, Jan. 2018, doi: 10.I 109/TASLP.2017.2760243.
Z. Uu, Z. Wu, T. Li, J. Li and C. Shen, "GMM and CNN Hybrid Method for Short Utterance S-peaker
R.ewgnition," in IEEE Transactions on Industrial Inform atics, Vol. 14, No. 7, pp. 3244-3252, Ju\y 20\8,
doi: 10.1109/TJI.2018.2799928.
A, Misra and J. H. L. Hansen, "Maximum-Likelihood Linear Transformation for Unsupervised Domain
nd e
Adaptation in Speaker Verification," in IEEE/ACM Transactions on Audio, Speech, a Langt1ag
2 2831460 ·
Processing, Vol. 26, No. 9., pp. 1549-1558, Sept. 2018, doi: 10.l 109/TASLP. 0IS.
[63] 11· •fi · b l tegrated C\usterin g and
iev, A. GianelJi and A. R. Trivedi, "Low Power Speaker ldentt icatton Y n
G . y 0 1 l2 No l pp. 9-12, March
aussian Mixture Model Scoring," in IEEE Embedded Systems Letters, · ' · '
~ ...... .,
" Sr<"'" R ,a ng.
, ,<" C vo l ,i . ,<
,. ,. ,g n; h= I n I l¼
11. '1 n. _ <; >
I
,, ,, "
,,Jl · , . ;,
.
<" ,, ..<•
•• · 11 R •>
.
1,1c rC<-....:ig
.
. >'
n
11' cm In
I 0 . r,> I 5 '
. nl d•
& N
I F,1:;,
. ' • I<.
r,f .., .,
: S i""
.
" I
(2 0
• « 'f r
1' ·-~, _, _'"' "'
"" '&
) l~ -p t\oe\
h V" !
. ' ,. .. ,
I ,,
1
v.,.......
..
.., ,r j, N11oY . •
' i<. •J •k ov lk "' , .. n ''"
11
~,.,,.,. ,_ ,._ ,.,,
◄
- •1 ' , " '" ""'" " nl "" ''" '
• Ju " " ' & . 1 .,
,,., ,, , ,. 1o•••70
, , ,, <
,..,,o ,..
;o • nf • sp < •k
.(1
7
« re co gn ;, ;n
..
<M "
, V ol'" " ""
""'"""",.'',.'"..A•..1 .,,)) '< • • n m
m "" ••
• ••• '
•••~ .. ,.
'' 1 «' · ' 'I''•••· • • n.
. '' " '" , , ,,.,,ff
I ·h rec,1g:nili
1<' 1<um•< '•
d V.G
•=•· " ln ve sf
,., ,,. " " • .. . ' 1 '" '
"" " . . . ..
-·~ "•=
of1 :· /11 2n .
-r"'' If , l £E E •• • _ , <
••~
,. ,. ,, w"' ,,~,..,
(,'1
.... ~<1'1,c (I' - • -4,',-.r<p )
- . '"
. ,gat,<m-. <m '"''""
f''.,.
1 1' '
,, ;, . ,n d T
-" · PP · •S0 2
0 -5 02 4
~ ,wahru-a. " • Sh ang.ha irn a1
" . •=• ... . lt ·q '.\ )Ir ,
., < ·on f"
" " '" '" ' i,1uJnple A SR
. . .,.. Se .
m•-Su pe.n,Jsl6
2010 » ° I. Con<' er at \
,eten,ce aµ\'i\\in n
e . d oic 10. I 10 m, -t '" '" "', \ c:.n-.\
'~ • S
••
S ys te m s ' H 0 \CAS
, ,,,F',a', , p,oeess;ng . V ol yp ol hc se s d" A .
oo us tk M ,,.c1, '"''
"R~ mndsl<
. . 24 . N o . 9. l'P od •I S "1 01 61 1n ,
J ,c · \5 2 4 ,, ., . ,
_>r"'1"'· N . i, 1o nt · \5 34. . Se
"' 1£ E E IA C M l, s,,,,,,1
z. l . .A ne. m il
,:
• ,c<"" .,,.d ev en JS : Jm li
ph ca u o n s fo r
e< , s G oetz e an d
·
B K
T•• ~ o m , by
mclio~ D\,cnm· .
pt. 10 16 . do ''-. \0 . 11 '" ""'" " "" '
( A D N N s
00 " '.,""
· 6
c£
1&v •4 (# r,ansac ,or<S , T D N N s · old lmek<. "C 1" Jl o c ,~ ,, "" ''
on ud;o, Sp · an \> l ass1.f1er l\r S L
ee ch '- ' • 10 16 15 ,,1)
J""' 017. do i: 10 .l 109/ • an Lang
T A S L P .2 01 7.
, ua er ce-pt ua
\ Fe
d r C.11 \tectu -- .,
26 atmes frnm D es fo r i\c
'/11- z. TaII· Z .
M a, R . M ar ti 90 56 9 . ge P ro ce ss in g.
\/ \
ou st \
c
¢ ~. 2 n o . 15 CA .SE
'· . and J. G u "S . No. 6 2\ )\ 6 ." ,n
..,,stefll' ll Sln .
,. g oNN C\a. ss o , poofmg D · OP · \1 04 -111 4
1fiers an d D
.e"'orks (PIO y n am ic A co us ll c Feteaectut\ro ". . .
Lear,ung Sy , .
I0.l I09,rJ<l'ILS stems V ol . , N o . 10 , n m ,u to m at \c sP "' "" Verif<Ca\\oo
pn .
.20 l 7
.2 7 7 1 9 47 . 29 nsac l<oos '"
,\ es , m IE EE N ,,. ,al
Tr a .
'.!41 • 4633-4644.
i l s . ], !okgOD Ya O ct. 10 I% .
. ,. ,, , ba5 ed on ne
., Opt, .T. .J. S e fa
un1Sed "A
do.:
,. ,ara T . I. M o d ip a
ch. 'me Learning Aan ·
d M J · Manamela " " +
do i: Jo. J\ 09iA
f Rlc o N
lgor ithm •
s,., IEEEAF , ~u ,o mat 1
' 5 4 6 7 5 5 .2 0 1 9 . RJ CON " Spe ak
913 3 8 2 3 . « Recogn\fon
·s~ s. N•;nan and V. J(u " ' pp .
!kanU, "Enhance gnr1ion or optum ·1 , />,ma, Gllana. 2019
Gr&>!, sVM ment in Speaker zed Speec\\ \eatUTC
and JD CNN Rec'Jo . ec f ec no S us\no 0
, In ternational Jour
. logy 0 . , 19th .Oct . 1Cl'
nal , Spe h T h 2Cl DO\
IO. I 007 /s I 0772-0
r,t,15ut f urui T an2d0..()9771-2
S, "Concatenat re cog,n\t,on, Pr
/J1Jlf nf1/Wnal ed phoneme mod t'
oceedings
Conference on els for text-variable snealm
Acaustics, Spee .. .. , .
ch and Signal P
ro cessing,I99 l. \C
S U -3 9 \-3 94 .
li7l Girija Chett)' , M Mic1,
i ch
(Cross R.ef Link)- aCI Wagner, "A
utomated lip fe
authentication," H ature extraction
CC laboratoyY for liveness verif1
U n cation in audio-vide
[88] Vibbanshu G iversity o f Canberra , A o
upta, and sh a n ustralia, 2OM . ,'J
n il a Sengupta, ticle (CrossRel I.in\
authentication sy "Automatic spee <.).
stem ," Interna ch reading bJ or
tional Journal al motion nac~ing
o f Software En g for "' "
inee rin g Research
& p,ortices, \/o
[89] M ' in g- H su an
No.l , April , 20Y1an g ' David · " s l. J, "
3. J v.-" iegman an
d ]'lare.ndra Ah
d Mujah,'"DetInec
tetillinge
g nc
face
,ournal of IEEE e, \/s oin\. \m
24ag
, ~eso·.. \/1., PPu<
Transactions o · J4
"•1•-Si '
n patt4rn Analy
sis an ac me
January 2002.
/
ivl · Jo!les , " Ro bu st R ea l-T im e Fa
•ol~ -
, ."'
iJ" , . p P· 13 7- t5 4. 20 04 . cc Dc t
.,,. " '"•,o y, ,g . oa v<. d J . K, ; cg ec" " " ..
, ~i1 i'
n, .n . '"" "
'JI''' I " oz'
,, , .,£ ££ r,- an d N ' "" •• ••
• 1 r)'
11•
.
,o
1 '
a11sa cti o11s o n Pa t ar,cn
' "' " A no /y
. · dr a /\ " • ••.,
IS a n d Mh"
.,o,,,
»a l nf ('
'"< ' ace ••p
y · Ctc cf "'" v .. ,,..
• "• • , r:11
.,,.,,, an d Oc h ·
rn e I
, V
t-~ •• J01 1r11o l of >. ••hw a,; ance Pr o\)
,ky,-k
IE EE "" · . , · • O\
.
c·l r,,,,,i;, ·o,> S·
• "'" "• • ttce , V " l moo,,,e, ;\ "-
.111 o•'
· "a n. ,acr crti cs f
(\I l <\ ,..,
) •o n on Im o ge" P,U h «-~, urv e-y "
;oO Cl· a" ""'. P i\1 . "" I
-j111•''
, LO •
ce 2002 - " Au
di o- vi su al S p ee
c\
"
''< I F . "
.
)S.s , .
Q ' Pm
,:] '·,•<"'' e,>.<'""'~•'ll·
•
•I. I bO ck , . Th es; , S
. 20 05 .. A RR1-1 0,s . Ma
, ! ,,,.,.,,.
·,
. •• M""'" Thes», Computec S . ' nolo
Speake, gy' B'". ba ' choo \ of Pl
ne . • <c<,;c , &
,
Jt1] R"°"([(lirion,''
in pr oc ee di ng s of
• utual in> . " • Vo\. 1 , No . ,
,, 5e p« t0ber 4- 8, 20 06 14 th Eu ro p ea n S, gn
/lffef' . alorpmat1on Ei.ge '.
n\ips ,o, Audio y
rncessing Confe,e "
A· B. 11,ssanat.. 20 09 ne e (EUSIP . CO"u)a\FlSp ee d,
. , "Visual W . ords for Auto .
.p-.
!llputillg v01vers1ty matteg Liom Reading,, p
of B uc km gh am, U
ni te d Kin d en' t of"' ""-p\·\ed
A:p
j9!l CO ' h. D . Th esi.s, De
p. J(al<Ul"anu, S. M partm .
J!\etbods,'' Journal o fakfogiannis, and N . Bourbakis •
pattern Recognition , V "A survey of skin-c
. ol. 40 • No. 3• PP. ll
06 - 1122 olor modelin
Al'" wee-Chung 1ew d sh·1· g and detection
an . 1 ID Wang,
/,{e,licol 1nJorrnatio L "Visual Spe ech R ecogmt1on: U pMSe ar 2007
.
n science reference, H .. ' · ·
ershey, New y ork, 2009
[l&JI . !llflentanon and Mai,pin
syed Al.i J{.,,{ hayam h, 10 2003
"T he Discrete Cos ,,,;
ine Transform (DCT . eory and Ap
} Th p\1cat1on," M ichigan State
. .
JOl] un iversity, 1vJ.arc
c. vu nala. V. Rad ha., "A Rev
.
ie w of Speech l of World p-proacues, Journa
of Co Recognition Challe
[ mputer Science and Informat ng es an dA \.. 20,,12
[JOl] M. Gordan, C. Kot ion TechN ology (WC
ro po u\ os , and l. Pi
S/T), Vol. 2, No. \, pp. \-1 ,
tas. "A support ve .
speech recognition ap ctor machine-based
dimamic net,votl<. for
pl ic at io ns ," EURA
.SIP Journal on Applied Si v\sua\
gnal Proc essing, \Jo\., \\ , pp ., \2
4i -
[JOJ] J2
S. 59
L.,W
20an
02g,. W. H. La u, S. H. L eu ng an
dA . W. C. Liew, "Lip segm
1004 IEEE Internationa n wtth the presence ol entatio
l Conference on Acous beards:'
tics, Speech, and Sign
10 ] al Processing, Montre
al , Qt"-- lOOJ ,
PP· iii-529, doi: IO.I \0 . . • \( I mutual subs
M. lchiNo, H. Sa ka N o9/ICASSP.2004.1326598. pace method ,"
and N. K om at su , "S .. .r 20 04
[ 4 ICAR pe ak er recogn1uon uSm ., PP· 391-402 <;o\. \ ,
CV 2004 8th Control g erne
, Automation, Rob
otics and Vision Con
468858 ,erence,
Kunming, China, 2004
, doi: l 0.1109/ICAR
CV .2004 .1 ·
\51
J1111l ft••nd J fl •J;""· ft P ct'5lC>O
•, I • .,-; < ff
,, V<-TI""
, /,,..,, ' r•"' '""'°" .,W.•"'•h•p •~Ca lt<'lll h
·~· --
1
(< -,-P R w ·a
" I 0 ,, ,,,
, d O W l. >. . •• ' ~ '"""' ....,
" .. .. " " ' .. _ ,
,>•"
,,• • •'" " " " ' JF
m·e
" " I " 'F ,-
> " ' '"- , ,.••
"" , •., - ', .,
,,.
I" '1 - do
, •• '' ' " ' "
' ,, ,,
.,, .,. , '
,o d R lk
-~ ~ , p;.,.-
7 ; IO 11 OS
''P « s» n ;d 'R . 1Rno / •~ ••
IT
" '~ " =
= r " _· l Olf;n • T" '""'
'1
""' ' " ' ",~ ,.
, \
" "
1
'. ••'•
. ,, 1
zitt1 I Sif!.1 101 P
rn ce s.•;i no
•
(D n,.. P )
om
. " " · 47 2- 47 '
li n t CX b> tt
"> lff < .. ., r, ,_
"',_
"" "" "" " m••
.
"'
.
,0 ~ • "l ip•> '" 6. do
. \ I( ) ,, ,. ,, 11
,, , ·.. 2n1 "1 R "" I '< '" ""
,,
,,, , ' .
' O •, ._ " 20 l
"e m en > B as
6
ed S p = k """ . ,,, ,..,
..
~~-
the combinatio
n of biometric m
atchers," Inf Fus
[II~ AJsaade, F., ion , 33, 1 1-
Ariyaeeinia, A SS ,
20 17 . ., Ma\egaonka
t, A., et al. "Q
mu\timodal biom ualitative fusio
etrics," Pattern R n o\ ~ormalise<
[118] fierrez. Aguil e I ,cotes \n
ar, J., Ortega-G c ognit. Lett., 30,
arcia, J., Gonz (5 ), P 1
ll alez-Rodrigue P· 564-569, 2009.
authentication ba z , J., et al.,"Discrim
sed on quality 77in
7 at7iv19e m20ul05
timoll• biomelii<
measures," Pa
1Srinivas, N., Vee t tern Recognit., dd
ramachaneni, 3S,(5), PP· t fr0 m multiple c\asstf,ers \ot
K., osadciW, L ,r - • YI In·form .
[ 9 unproved bio .A. "fusing corr ation fusion ,
metric verific elate • • 0
ation," 2 009 IEEE 12th
International Con .
J erence.
Seattle, WA, USA
, pp. 1504-\511, 200
9-
~,.,,,~-",.,,'.,..
-~~-w
,._ ,...,t-'" .
~ r.,, ,., · ~.,-- J,Cf'1J (lr 'le '!J t--C ,& n '" ·
,c.atti"fl
·• J 7
~.~
.
t,s<;('(I
""..-•,., ., .,
<'ffi qo .c .'i. h (,( 're
1 11
~
' ·,
,••
,, • ""
,, ,,• ... ,.. .•n<' ,
• ·I fl•¥"II "'//
,n ",1• ...,~,=o"•fl"', , .""
.,m 1
, o, , . , no ' ,_"
,. • ., _ ,, . . ... .,..
' ' "· ..,~..,. ....' . ,,., .. • ...
•
· ""'""' ~. '"
• , ' '
..."",.""• '""'''
>·"' • "• ••
"
, r . ...' ,,,•"
,. " " ' ,n
,;~ .' 1n dlf,_J I" "T"'•m· P tt- rt ".."."(" "
,
~
.,. .
_, .,,.,..... •,n 1n ae•
~ nc On • '" ' ,.,, " "
' " " ' '- .. . , . , • ~ ., ,_ •• "' ' •• •
.1 . ,,, rJ
"" .. .
. ..-.ar - , •, ,e 1 0
11 "" ''f'C F 20
.
~ Kanhatii?. . 1" 56l\11 <s
>A '' p 'l fl 1 ''" " ' 'f &d , .
_ ,,
,....
,, ,_
'• .,,
•~ .. . ,. "" .
1 " " ••
, • .,
~.. .._ ,
., ,. . -····
1 '" • '" '
' \1 1 'R ,C S M A N ' •• • "" '" '" ''
_,.,.n. - ,,,, ,. vr A G 7h ,m g.
f l"'" ·
Q,- 102. """'
1 f\ ,s ,r ri cl >i
an d R . A dl 2 0 ,0
,F " fl '' f •
A .. ..
' ••n •• •• ••
• '
1f "'1'" ~ "" ' l' ,n ,c e).'' 201 2
"
ou d· ..
J. S «w el
· ' " ' r." • • =~•~'"" " '• R
• .. , , .. !·•·.. fn
.-
/ 1"""
I (l 'l'
l.
S fff f .20 I 2.
>' """''.
64 8 19 0 3 .
m u n ;co,;ons
l,S ET ITen ) ce
·
.Sc,e •= ~
·« i m n\ ,\m od
. s n' a '
1
' ,.., ·• '" l
°'"'""'"
'd•" ""• '" ""
and S a m "" "e PO - ,I( ,.\ SO
1 e t S in gh " F « '" "" '" r, .,' - ""'"" "•I
, C""
JI S,etiJOC I,t,.6-Jgof l·thJll ... In.t er
.
us,o ·
n S= h
.
na t w na l Jo ur . Sonssc .
., · na l JA dv,. an ce o,
c d an c w-
E ,d f. ac llm
,g m we\n Egiom etr'" ' Usm •
""
,'O ) td ' et al.. )'1u\tu '° 'c l'""d..'"
. E
S • nlmnccd
n o d al fe ''"" V ol, Vm \o n o\
" P'
,o;£ Acc.ess, Vo\. at u re -L ev el
F us,.on
,
. fod
6, PP · 2 1. 4 I S -2 lio m et
14 26 , 2 0 I g ' d o, , \ 0 .\\ 09 tics \ .
~. . , ,..S• ,nilll ]'!., .a n an. d y a, shah K . 1A C C E
dcnufioat\on -- ••
m on lo .r t P II ..
s,,,.
• ... -·
. ~ .e .,,,ition.•· Jf./£ u lkar n i, "S yner .
ln tf o= :·
Tra n sa ct io ns 'CJ m V m.ce 1
, scf/doi .or /1 o n Smart p rocess mg and LS S .201>.1%\ SS40
0 .5 573 . and C
ii E!ESPC .20 • M.o« m
.~
0
1 9 .S .4 .279. .n t fo, "" '" m ,u c ?cr<oh0
.. .
,oeJ tJ/ am p" ' m g·
B-1£ jJ .£GO
£ TOra {.<s¢.io
, r,n
V ol . &. No. <. Ang 10
Nnsixoon
n an
s;odmJ.etNri.cs usion for Subject \ •
ca, rt
B eehr,aV
"Sioorft' on
B R
ooi• o.1109 rr BI io m etri
d Ide nIcityf S .. ec ..
oM.2.019 .2943 c1 ence Vo\ \ 0 4
N ogmt\o n at a
,. . 934. D\sta,,ce,"
r•'l9I S. &1,o g, !, I. ' · · · • ••
Yu, Y. GuO · l' l2 •3° 1.
0 c,. 1 0 " .
w,,,tification." in and Y. Yu, r us io n
JEE£ Access , V "A W eigh ted Center , Gra•nh •
,
,. ,e th od (o r ?et, on Re·
::iO] M. EskaDdari ol. 7 , PP · 2332 f V.7 n. " c.SS .20
and O. Sbafifi, 9-23342 2019 \() .2?1%1'2.\) .
"E ffec doi· 10 \ \'"' /"C
c\assifjcalion." in t o f face and ocular cc
IET Biometrics multimodal bio
metric ,~stem• on
, Vol. 8, No . 4,
11; I] B. Gundogdu PP · 243-248, 7 ~enrle<
and M. J. BianCO 20 I9, doi'. IO.I049 1ie
, "Collaborativ t- bm t20 1~ .513,1 .
inIET J,nage P e simi\arit)' met
rocess ing , Vol. ric \e arni ng fo r facerecogn
!il2] Gurjit Singh
I 4 , No . 9, pp. 17 ition ID \\oe wi\ol:'
Walia, ShivaJll 59. \ 7 68 , 20 7 , 2
020 , doi: IO.I04
Rishi, Rajesh 9 1iet•i~r.20 \9.05
biometric system Asthana, Aaroh Io.
based on diffu i Kumar , ,\.njan
a G ta, "Secure mu\1i 4
~~-
sed graphs an u~ 1
d optimal score 111oils\
fu sion ," /ET Bio
m ., 1 0 19 , 'lo\. ~
[Ill] L. Wu, J. yan " · •
231-242. g , M. Zhou , y .
Ch en and Q . W
ang,. "LVI D: A
~~ -
• FMultim .·• o•adanldtiSioecmuretilriVcs i>-utn15 1511 15
on Smartphones," , 'l o\. en•tic•'ion S~•" ~
in JEEE Tran " '
sac tion s on Jnform a
[134] X tt0 n orens ics
2020, doi: JO. II 0
9trlfS.2019 .29 ·'f ~\\5\0n," in 1ET
Jrna ge
1
· Wang and S.
Feng, "M
4405 8. ··
b d on c\ass
ulti-perspecti
ve gait recog . . 0 se
\ 049 /iet-ipr .'10,er '
Processing, Vol. 1 n,uon a Is.6566 .
3, }lo. 11 , PP· 1
885-1891, 19 9
2019, d o t. \ .
id M. Wagner, "Multi- Leve) L·
c 11e11)' ar . iveness Ve .
(j. tries Sy mpos,um: Special S: . rifi cation f.
:'I 6 9;0111e ' ess ,o,, on or Face- .
;oO e MD- 2006. pp. 1-6, doi : I 0. 1 J 09/Bc Research a, ,1 Voi ce Flio1net··
11 ir11or . c .2006 1e f1 ;o . , ic i\ uthe,1 .
13~ d M 1 ipton. "Multirn oda l fr _ .43416 1 S rne,r;c c t1cation ..
hct1)' an . ~ eatute f1.1s ion ·
r, • . on.~o,,;,un c .
-
t,} (,.
C'' /i ·
0 11 In orm ar1011
p us ion. pp. 1- 7 ., . Ot Yid eo fotg n nfere ~ice
• c.d111butgh "
'
to' .
,fere11r:e
= ..
.
t
Sa fehg 111 a11.
"s· k
pea er y ·n .
, , o1o, doi : ery detect'ion ·., 20 /() 11,1-i l
10 1
n~se rfl en it ati on . . 109/1c rp nrernr11inno /
: / H ·SO~ 05421 [ee.u .AS].. August201 8 using C .2010 57 11339
\ I ~ . onvolut1on
" Neural N
~r . u ,,adhyaya and Abhijit Kannak etwork~:·
)'JB' ,neel . . ar, Speech Enh
:81 . 1 15 . A Compariso n and Simulat' an cernent .
,41gonf 111 • • . ion Study," El using Spectral
,;on Processmg -20 I 5 (IMCIP-20 1S) p evenrh Intern . Subtraction-type
111f{)rma ' rocedia Co,n at1ona[ J\,fufti-Cn
.06.066, 2015. Pllter Science, V 11/erence nn
10.10 .
16/j .procs.2015 0 1. 54, pp , 574
dran and T. K. Kumar, "Oblique Pro· . - 584. doi·
- ~~ ~~~Qd C epstral s b .
;,1J ~- . , • u traction in s·
ement for Colored Noise Reduction ,, i IEE
£nhanc , n EIACM Transact' ignal Subspace Speech
cessing, Vol. 26, No. 12, pp. 2328-2340 , Dec. 2018 . tons on Audio, Speech d
pro ' do1 : I 0. I I 09{fA , an language
. Xiao. S. Wang, M. Wan and L. Wu, "Radiated N . SLP.2018.28 64535
JO} J... . . • o1se Suppression for Elec .
,, u]o'band Tune-Dom am Amplitude Modulatio n ' " zn . IEE trolarynx Speech Ba d
I"'
EIACM T, se on
. , vo l . 26 ' no. 9, pp. 1585-1593 S t 2 ransactions A
. on udio, Spee ch. and
J,a11guage Processing
' ep. 018, doi: 10 1 i09{f
,., Berouri, R. Swhwartz and J. Makhoul, "Enhancem e t f · ASLP.2018.2834729.
JJJ J' • n o speech corrupted b . .
PP Y acou st ic noise," In Proc.
Conj on Acoustics Speech and Signal Processing
1
1n1· , · 208 •2 11, 1979.
, T Lin and Y. Zhang, "Speaker Recognitio n Based on Long-T A .
· s
[1~-1 · V l 7
enn coustic Features With Anal . ys1s parse
. " . IEEEA 10.1109/AC CESS.2019. 2925839
Representation, m ccess, o. 'pp. 87439-874 47, 2019, doi:
·
z.Ma, H. Yu, Z. Tan and J. Guo, "Text-Inde pendent Speaker Identification Usm· g th e H'1stogram Transform
{1431
Model " in IEEE Access, vol. 4, pp. 9733-9739 , 2016, doi: 10.l 109/ACCESS.20l 6.2646458 _
of
Jl44] D. Paul, M. Pal and G. Saha, "Spectral Features for Synthetic Speech Detection," in IEEE Journal
Selected Topics in Signal Processin g, vol. 11, no. 4, pp. 605-617, June 20 17, doi:
JO.l 109/JSTSP .2017.2684 705.
(145] A. K. H. Al-Ali, D. Dean, B. Senadji, V. Chandran and G. R. Naik, "Enhanced Forensic Speaker
Verification Using a Combination of DWT and MFCC Feature Warping in the Presence of Noise and
Reverberation Conditions," in IEEE Access, Vol. 5, pp. 15400-15413, 20 17, doi:
10.1 109/ACCE SS.2017.2 72880 I.
{146] C. Quan, K. Ren and z. Luo, "A Deep Leaming B~sed Method for Parkinson's Disease Detection Using
. I 9 10239-10252, 2021. doi:
Dynamic Features of Speech," in IEEE Access, vo · , PP·
,, ·
l0.1109/AC CESS.202 1.3051432 .
fJ47J . Score and Genetic Algorithm, Jo11mal oj
MI Zhou, "A Hybrid Feature Selection Method based on Fisher
M.th · · · Vl37pp5 1-78, 20l 6.
a emahcal Sciences: Advances and Appltcattons,
0
· ' · W'th Analysis Sparse
U48] T L' .. -Term Acoustic Features ,
· mand Y. Zhang, "Speaker Recogmt10n Based on Long CCESS 20 t9.2925839.
R 2019 doi: 10.1109/A ·
epresentation," in IEEE Access vol. 7, pp. 87439-87447, ' b dded devices," Future
[149] Z' . ' k verification on em e
JtJan Zhao, et.al., "A Lighten CNN-LSTM model for spea er
generation Computer Systems, Vol.100, pp. 751-758, Nov. 2019.
160
~, AS ai ' fvl. aen gh era bi, A Am
,o] ,, . bi:;nd piv ers ity . rou che
1' fr<'" ' an d Sc ore L and F. l-la . .
1 /o<!Y & /ntern et- 80 ,ed Sy
1 r,d•"o , ,r eve\ Fu S.o ·
n," In·"" • " I
. J( We i, K. Ki
rch ho ff y em s' PP I 3 "'P,ov;,
y v'· · S · 6-1 . 20 13 1 11<S0pe ,;
;tl · , score spaces . ' · ong & l 42 K nte r
,,,as In, 20 /£ £
,
£ lnt· en
' Yot0 2
Bil1ne, s ' o11.
.
°"
0o1 •
er V erif i
" ' """ •
13 . ' ub"'Odul ' ""/"" '
1 I"" 11 ng, yancou
•"; ' no o b,., t , b>
-'•e n,..,.,
1 .
v<', BC, pp .
-
co•e5-C- & yapmk, V. Machine71Le84 71. 88' 20'"' I '"" "' Co nfe , " '"' "" " .l«
r -t•"'•
1il ' 1,0kiC- C- yogt, 0 am mg 20 3. '"" "" A
-J ' . . Di lrr an d T. Sta d ' ""·",aa
' fo, bl'-" -d,me, .
ti' l ,,, ,. I nerworkS," 20 16 IE
elm ' '273. 1995
ann
EE 26 th Inie ,na tio n'
"S https·ltd .
1
,.,. s,,.,, on "" "'
d Steno!
11,SP), v;, r,i sul Ma ,e, Poake, id <nt>fic. . ati °'·'<l',/10
(I 20 I 6, pp. l -6 . al Wo ,k, ho . 023 1 A· IO
. ' do-, IO .1\ 0 p o on and <lustc, ; . 21621 ' 114
1"1 l(r; ,J,e vskY A, Sutskev er 9 n Moch/n, \\ 1
. I, Hmton GE "l ag "'' '•" '
' ma gen etc 1MLsp _20 l6 Le
I· /n--AtfwJ"'es in Ne ur all J . 6 o., ;,g Jo . a,0
.773881 ' S,goo lP, 1.,;,,.
~~]
nfo rm atf on p
JaZ roc ess ing ass .ficatio .
a D' coJJobert R, Doss MM · "End S n w,th dee · " "·"' '
r-· " .v·
' el"'orks, arft/V -
P .
, -to-end hysierns, 2012 P con,o\utional,
eura\ netw
pr p oneme ,.
ep rm t orXiv : 13 I 2213 '2013.
zt>,n' Y, pezeshkI M, 7 sequence recogm.tion usin " ' '·
ri • • •
Brake! p Zh an
g 5 g "'"'o\o tio
1
Asia, 2015
{172] C.H. Chan, B. Goswami, J. Kittler and W. Christmas, "Local Ordinal Contrast Pattern Histogram
s rm
Spatiotemporal, Lip-Based Speaker Authentication," in IEEE Transactions on Information Forensics
and
Security, vol. 7, no. 2, pp. 602-6 I 2, April 20 I 2, do i: 10.1109(f]FS.2011.217 5920
{173] M. 0 . Oloyede and G. p. Haneke, "Unimodal and Multimodal Biometric Sensing Systems: A
Re,iew ," in
4
IEEE Access, vol. 4, pp. 7 5 32•7 5 55, 201 6, doi: 10.I I 09/ACCESS .2016.26 I 12o.
. .. d p entation Attack Detection: A.
1 4] H. Mandalapu et al., "Audio- Visual Biometnc Recognit1 l on 9an PPres 3743 \-37455, ?.0?. l, doi:
17
Comprehensive Survey," in IEEE Access, vo · ' ·
I0.1109/ACCESS.2 021.306 3031. . ., ?O?O IEEE 1n1ernational
117 . • s ritY' An overiteW , - -
SJ S.Alwahaishi and J. Zdralek , "Biometric Authent1cauon ecu .
· Markets (CCEM), 1010 - - ' PP· s1m, doi:
Conference on Cloud Comput ing in Emerging
IO. I I09/CCEM5 067 4.2020.00027. . f . Robusl Hand-Based t,1ultin10dal
[\76] S . . .
Sparse Coding o1 . I86-3 I98, 20?. l.
· L, and B. Zhang, "Joint Discnm mative
. Forensics and securi·iy , vol. \6, PP· 1
Recognition," in IEEE Transac tions on Inforrnat1on
4. Ind
S.Nia.
ain20
an16an
. d V .Kulkarn
i, "Performance Evaluation of
" Text inde\lendent Automatic
d' s of the Second ]Hknwcionnl
Speaker Reco gniti on using VQ
and GMM Pr.ocee. ningTechn
. olo"gy for c0111pe11.t1\'e
.
Conference on Informat ion an
d Communicatio ·
Strate gies, ACM , ICTI S, 2016
.
Ap11cn..:a ·
'-'I X I\
J! ,nd f RR va lu es fo r O Sl T
f~ f RR va \uc s fo r th e Op en echn iq
. Uc-V n, cc
~ ct 1dcntt1i•cat,nn 1<'<:.hniq"<
5
I "" 311d - - -- - -- - -
OSI - Ex pt-
•I
1 -I
~ -----
,r11rcs1t old l<' RR
- ----~ l' An n s.1 nr• \. "' • h' ••I"'"
"'- Th 'll t. l
300 0 10 0 " ti'\
O T\'~ h o\d H~~
98 .095 24 3
0 000
92.3 80 95 86 fl ~6 ,n
0 4000
sooo (i
87.6 19 05 60 ➔6'> \ \ 6
3900 0 ()
6000
77 .1 42 86 3 \ .00775 ~
4200 8.333333 7000
()
9 J023256
75 .2 38 1 8.33 333 3 7 407 +\)71
4500 8000 l. 5503876 33 3;1 3;
66 .6 66 67 3
4800 16 .66667 9000 0.775 \938 ➔4 ➔-+4:.+·H
5100 53.33 333 25 1000 0 0
5400 32 .85714 31.66667 11000 0 5 \.85 \852
57 00 32 .19048 50 12000 0 55 .555556
84 00 0.952381
8700 0.952381
9000 0.952381
9300 0.952381
9600 0
9900 0
A.pp e11
• rd Vi d'f JM IT Da ta ba se
· di x 11
~r1 fld ,rf oa tas et Do c u111cn
t nti o n
,• .,l1j \ I~~11 dcr son
• , ,r•1J . . Jcr son @1:"\ n1.cta .co
n1 .au
i' 1 ... :1t1
.,, ,r,1'1·· ;() j'Jo "
200 9
' 1
,,,,,. .
\ ,r.
1i<'cn
• , :L
td
. 1 tJ\111° dat aset is Co p y rig
, •_ , ht c 20 0 l -20
•
c
~
~',, ,,f th iS dat asc! 09
· ·ts pc• nm tte d un de r tl
,3fll"
. ,e fol low ·1Ont ad
. a n g conct·and
"', cc 1s \cfl- mt ad an d no t mo d' fi crS<,
_J, 11,et is pro n .ded as 1s.
.
There l ied m .
llto n D1, \ h
is n ny way. ns·
·· n """ "
•• t11
nil ,, 13 r o f tl1e d ata set 1s . no t res po o wa rra nt m,1< ,. " h
' - .c of 10 dataSe t.
011rl1e
n 'b
Sl \e fo Y as to the fi1tness f
(1' tJ. . · ( fi r any direct or ind ·or any partH:u\.:ir
, ;n) ubh cation e.g . co n ere nce pa pe lrec t \
r , Jou
. rnal art \ ?Ul 1)ns; ,
,.,, U. P fro nl the usa ge
osses res ultmg fmm '
= of the Vi dT IM IT d
{,- )'Bl1 der
I 00
son and
. nfi B.C. Lo ata set
" vell. "Multi -Reg1. on Pro
m ic e , .techni cal
ustbcite report b
the. fol\o ..g papookcha
• 1 ble Jdentlty I
•,.:ai
b" . wm . pter. \x, ok )
:• a,00 9. ere nc e , Le ctu re No tes . C
m om pu tea s 1hs
/8. - r ciet1c R istograms
nce (LN - for R
CS) er. obust and
, Vo\ . 555 0o. pp. 100 .
arerl'iev.'
e VidTJMlT data.S et is compris
ed of video and correspo ndi ng aud10 · .
·oJunreers (1 9 females an.d 24. males
1fil
).' reciting short sen1ences . It can
be usefrec for mg
u\ ord ressearotch43
on
ropics .suc . h as autom . .
atJ.c hp rea . dmg, .mu
lti-view face re cogmt1 . .on, mult1-. rooda\ S\l<ech
r,co gtlll!Oll and person 1de
nt1ficat1on/venfication.
The dala5et was recorded in 3 ses
sions, with a mean delay of 7 day e een ess1on \ and 1.
and 6 days between Session 2 and s b tw s ·
3. The delay between sessions allo
ws for changes in the
voice, hairstyle, make-up, clothi
ng and mood (which can affect
the prommciahon), thus
incorporating attributes wh ich wo
uld be present during the deploy
ment of averification system .
Additionally , the zoom factor of
the camera was ra~domly pertu
rbed after each recording.
The sentences were chosen fro
m the test section of the NT
lMlT corpus\. t here are ten
Th
sentences per person. ) igned to Sesston \ •
Thefirst six sentences (so rte d alp ha-num
erica
. lly by .filh enhamrem
e are ass two to Sessio
aining . n') .
Theenex fi t two se nte nc es ar e assig ned to Session 2
wit .th t the
e remaining et.
ght genei.any d.t[[erent
trS! two sentences for all pe rso
ns are the sam e, wi
b ~ person.
The 1 6 video frames lus1. ni
d or approximatelY O
mean duration of ea ch se nt en ce is 4.2
5 secon s, \65
25 fps).
ric,v e,,.amP le 2 of thve 5,e n t en ce
_
\•r<ili,j eCI
1
os ·
t,e tw ee n 1d Tl M lT s us ed is in T 1
1 . abl e
i'II""
·,,;,,o
. o ,i, c sente nc es . ea an d NT 1M n ( . in ete i<\
1· ch pe tson pe
JO ,,,,. ,, ,;c 10 ,tOf rne d•n d h•n ce t' '"'"• '
h all ow s ot ex tTac "e fe . .,, , (\f ~,
tio n n f pr of ile
~ r'"°"
•"' 1.,,o,,; ng th ei
r h ead to th e I n
an ex t•n de d h "" ",
•n 1e n,, .
1
"" '"""°'"
,,. c<',1cr. e n . right.an d ) in t
ba 11k "' "' "' "" to1 ••d
,nn
fh n <eq,.-,.,
J'l'.'11
"', ,, c tn the
d · '<< qu '" ,,. .,
o~•ordini.2- ,;\·3 s on e l1l a noi sy «n te , "" '"
f11C1 • d. . ic e env· offi . "• I\ " "" "
.d · 1en dt'I
I quaht.-Y 1g lfo nm "'" <nd Ii 'n f
Jc
i•"' " 1ta1 v1 eo cam er a F 111n\\v
,110 •'' · or alm os t en t e(m
recos t\
ord all th Y· co mpute, an nl"li'\
I• r
mgs the Ii h . ~ ,ete)u?1""-lny_ ,
se et ioP JD g \\nu •
S e nt en ce JD
t-__ _ ~3__.;
sa 21 -_ _ 1 s
· h t' h ad Yo u rl s
e nte n c e ~
D . . r ark R\\
wa, '"
si l398 it i E>Xt
on t ask m l' ·to c-anrr .
si202 8 !!.Tl"I\.", w I\."h wn t"r I\
He l ooDk°hith ev. ma k·e c\a.5> '" 0 il,
S ~i oD 1 ~ ---:-::;-;;-;:;:_ _ JL _ .-,.hi11.o;;('<l rt!!:. hkl
. -' tl\1\' 1 "" '
I
si7 68 s m ask from h' f
l\ r l"rt "ln n.- .'
- acr Of{ 'hl"arI ,,nrl th
IS
U akc lidne f xp cc tcd \~-
or Sl\ "0. f h \ oss th I Tl"w it
':' l l"rk
OW th{' R!\fI"fi{'
Tl \ om1tt in" dl'S. . .
1e c umsy cu st 1· fl.'- Jf\r \ii\,-
sx
sx 221388 O
ig;n c 1sk
omcr spilled soml:'
5e.55ion 2 r --; ;;s~x
_•~-31!18~ -+ -- -- -=
pT h~e~ vivi1ewp oint overl
i-- -~ sicon
sessi i:0 :3 ooked t'~l 'tIB iV{ ' P{'f tl\ff i{'
3 1 -~ ~sxiR 40 8"- -+- I" 0_d ;~
rid l~
eea~ s~e~di~-
t he su g~m ~
y ~p t
bwav bo atlo c:; up ~d~ t eo=roc-
sx 48 G ra nd mo th er ou · {'an
{'"rr _os_t_ _ .
. t gre' , ut
\ havm. t rnou<:i:h cha
• "'u pb nngins in p,. m!.l'
• Ui m "'"
Ty pi cal ex am ple of
Ta ble 1: se nt en ces used in the
VidT l '·'·llT datahase
,
\. standard overhead fluor '
e env1ronments . 1be \ig
escent tubes like in m hts were
.,vered with A4 size white ost offic ·
office paper in order to \ is re .uce t e glare on
diffuse the light f th· d
d h
1be face and top of the he
1 An incaJldescent lamp ad.}
in front of the person Gu
st below the camera). Toe
wiJh a sheet of A4 size w \amp wascovmd
hite office paper. The
video of each person is sto
sequence of JPEG images red as a oumbered
with a resolution of 384
_ 512 pixels (rows _ colum
idling of 90% was used ns). A ~ua\it)
during the creation of
the JPEG images. The co
rrespondingaudio.\ is
stored as a mono, 16 bit, 32 • · d
Th e Vi kHz WAV file. tatton) takln'E,i U? about
3G dTIM.IT data.Set is compr
ised of 44 files (includm
gdthha
iss th
oceurfone\\onwing,n. temal su.uc,ture..
b. Each zip archive is for
one person (e.g. felcO .ta
r) an
sub·~ectlD I audio / sentence
ID. wav
sub JectIDI vi. de o/ sentence
· th
wher ID / ### 39
e senten ce ID is t e iden
·.
tifie r ( ·
e.
, sx 6) 3J\d ### is a ree
di . s th e he ad rotation or se
n enc
g. . ·
JPEG illlage (note that there ,s no .JPg
git frame number (e.g. 03
7). E ac h frame is stored
extension).
as a
re 1S. audio for the he ad r . n
f1" ., ,ag
00
e,.ents ota t1o
sequence,
cJ(11! rrJJ
1 .111' dataset wa s
4 r ,a'U•;
'ff" . vers1t. )', Q ue cre at·ed by C ·
ori ,,s
· 11
f I
ensland A .
r/1 • or !(uldip K. Pa liw al. ' US on,
.,,,
lra lia (~d San
w WWde '-•o, Wh ·1
.
·R• ' eh
f odio was recorded us ing th
rif11'e •~i 0
•11' U t"' digit fratn• nu mb er Ii e_ tat
cam
·•d•.u•J '"' "·' P
,0 m\ erad's. tti· .
\on . •nd ., ,;, hb ·""d
. lcr op h .
Oes one ~uper . em ot
,,~ ~,_.,;.A - l(.olyanswam '"'" "
i -'"'.,,,.;dlh gpe<cl• oambas y. s. Basso
no
I'"'•',9"- VOL t , pp. t 09- \t 2.e",Nn
<""""Ceh' ,estrionons on 1he
Pm c n and I. Sp;,, . t a(lpl · _
M,IEEE Int. c.;;,i"' nM n, A y to the h
nf
f"''':,,,non '° NT IMIT
1h< ligh ting set up T ean be oblo' . Acoosues :"" <adl • nd
""'
J'°~' CSC"'' 1,.n•P "" ' ,w;tched o- .occ u,s'l:u s P'<Vent ,;;; ," from th~ ( "'" "'"
00
,rl"
°' t
he,d1 , '"ls "'''"""
. "l•enc, s ol
mbdgO ••d•~''"'"""'"
Se, s\o n 3 of "' of •II " "'" seo
" i,
' '"" 2 nr'''"' c,,
"'° Vi,I ,.,~
' ,,,; """""11:s' )\""' r••
, '"'''
-~
T\M
~
~m I"" • .Id P). >lh,,.·
,,,.,.,p1e: V1dTIMff Da tab as e
"""" ''"""' '"'"""'"'
Signal Label: r 1.wav
5
riJtle t ::::: 3.1 * 10-
prequencY === fs== 32 KHz
plot of the tun e domain wave
form an d.its frequency domam
. Spectrogram
-10
.o.15
.0.1
-1~ 0. . . - ~
2_ _._ __~6- ......g- -
,o - ,-l _ \4_____,16
.(J 15a~-:.0.5
- ;--' .:t- --::
1..5::---'2'-- _2.__
..5 __,_ _3_3_._5_..J-
5eCOfldl
.. - ....J-.5_ J5 - - - - - -- f~~~~ --
Figure B.I Plot ·or a time dom
ain and frequency domain waveform
for sii••' ''-" "
I~ Sis 1-Speaker I
The clunis
sentence 1
y custom er . ·11 Cd
Sp1
i)11r1t11,n
4
(<.:t.c;'I.)
S42S4-Speaker 42 sentenc e 4 3
,
c=====-- ---,- - - --
-Speak
I -S6-7-S l~O~ er 6-:-:7~ sentenc
e 10 J
My nam e is Tanv1
database
, rs fro 01 tI1e
f the speake
. ure B· l Sample Images o
Fig
tabase of 72 sp ea kers w as fin al /
fl'' d• . Y se lec ted f or var
rd m gs of 8
tbe re. co speakers
~
'
J"'b•' 85 e wa ve fo rm i tzm
. aidatin o
. e do Was no t qu tte t:, the ,,.
/eP /ots of th n ·
ma,n andr, uct · vva
,w11P 16 Ible. r don • on th
,,, ,pie: spe er ak an d sp ea k . re qu
er 37 fr
!'$'" b) om th e MP s en cy do,,, .
• Standard
. ;11 t,a bel: a) s I 6S I. wa v S3 7S4 'fM t D n1n .
5,g11 atabas
- a) 2. 08 * I o-s b e
)2 .083 3• 10-s
fi1lle1 -
frequencY = fs= 48 KHz
-40
-60
I
~ -80
>,
u
C
!!\ -100
CT
~
~ -120
:t
~ -140
-04
-0.6 -160
3.5 4 4 .5 5
2 2.5 3
-0.8 1.5 -180
0.5 15 20
0 Sec ond s 0 10
Frequency (kHz)
-40
0.4
(I) 01 ~
-0 C
~ -1 00
J 0 er
a. ~
E
<( -0.1 i -120
~
0
-0.2 Cl -140
-0.3
-160
-0.4 15
10
4 -180 Frequency (kHz )
3 3.5 0
2 2.5
0.5 1.5
Second s