You are on page 1of 19

11

Chng 2: H

THNG NH DANH NGI NI

2.1. M hnh tng qut Ty theo phng php tip cn gii quyt bi ton, h thng nh danh ngi ni c th gm cc thnh phn v c ch hot ng khc nhau, tuy nhin vn c mt s im chung nht nh. Hnh 2.1 minh ha c ch hot ng tng qut ca mt h thng nh danh ngi ni trn tp m. Trong phm vi ti ny, h thng nh danh ngi ni s c trnh by theo hng tip cn ca m hnh Markov n hp Gauss (MGHMM).
Input Transformation MFCC Audio Sampling Digital Speech Feature Extraction Feature Vectors Energy Detecting non-speech segments are ignored

Feature Normalization Normalized Features

Init Models Initialized Models Train Models

K-means

Speaker Model 1

Speaker Model 2

Speaker Model n

Trained Models

EM

Score

Score

Score

Score Modelization

Classification Score Score Normalization UCN Decision Threshold Reject Confirmed ID

Hnh 2.1: C ch hot ng ca mt h thng nh danh ngi ni trn tp m.

12

2.2. Ly mu ting ni (audio sampling) Ting ni trong th gii thc thu c t cc thit b thu m nh microphone, mobile device s c s ha thnh cc tn hiu ri rc. Tn hiu thu c sau l d liu ting ni mc th (raw). D liu biu din giai on ny cha th hin c cc thng tin ng ngha/c trng v thng cha nhiu t mi trng. Do vy, trc khi c th c a vo hun luyn m hnh hay nhn dng, d liu ting ni th cn phi tri qua cc bc tin x l nhm loi b nhiu cng nh rt trch ra cc c trng cn thit cho qu trnh hun luyn v nhn dng. 2.3. Rt trch c trng (feature extraction) Rt trch c trng c hiu nh l mt qu trnh bin i t vector c kch thc ln sang vector c kch thc nh hn. Nh vy, v mt hnh thc, rt trch c trng c th c nh ngha nh mt nh x f: f : RN Rd, trong d << N. Thng thng, cc m hnh ngi ni c kh nng m hnh ha tt, s lng cc vector hun luyn phi ln. Nh vy, vic gim kch thc ca tng vector hun luyn thng qua bc rt trch c trng s gip lm gim phc tp tnh ton ca bc hun luyn v nhn dng. i vi bi ton nhn dng ngi ni, mt c trng c cho l tt cn phi c cc tnh cht sau: Sai bit gia cc vectors c trng ca nhng ngi ni khc nhau phi ln. Sai bit gia cc vectors c trng ca cng mt ngi ni phi nh. Khng nhiu tt. Phn bit c gi mo tt. c lp vi cc c trng khc.

13

Hai tnh cht u i hi c trng phi mang tnh tch bit cng nhiu cng tt. Mt v d minh ha th hin trong hnh 2.2 cho thy tnh tch bit ca 2 c trng khc nhau. D dng thy c c trng 2 tt hn hn c trng 1 trong vic phn bit gia nhng ngi ni.

Hinh 2.2: V d v tnh tch bit ca 2 c trng khc nhau. Mt c trng c gi l tt cng cn phi c tnh khng nhiu v phn bit gi mo tt (c tnh th 3 v th 4). Cui cng, nu mt h thng s dng nhiu hn 1 c trng, th cc c trng ny phi c lp vi nhau (tnh cht 5); vic s dng cc c trng ph thuc ln nhau thng khng em li kt qu tt. Mt c trng l tng (c tt c 5 tnh cht tt nu trn) thng khng tn ti trong thc t. Trong lnh vc nhn dng ngi ni, cc c trng thng c s dng l MFCC (Mel-Frequency Cepstral Coefficients), LSP (Line Spectral Pairs) ti ny ch tp trung vo c trng MFCC cho bi ton nh danh ngi ni. Hnh 2.3 th hin cc bc rt trch c trng MFCC. Tn hiu th s tri qua cc bc x l chnh: chia frame, bin i Fourier, p dng cc Mel filter-banks, ly log v bin i cosin ri rc.

14

Voice Signal

Chia Frames Hamming Window

Voice Frames

FFT

Power Spectrum

Apply Mel Filter Banks MFCC Vectors DCT Ly log

Hnh 2.3: Cc bc rt trch c trng MFCC. 2.3.1. Chia frame (enframing)

Hnh 2.4: Tn hiu wave trc v sau khi lc thng cao. Trc khi tin hnh qu trnh rt trch c trng, d liu liu ting ni c a qua bc x l pre-emphasis bng b lc thng cao (high-pass filter):

15

s2(n) = s(n) a * s(n - 1) Trong s(n) l tn hiu input, s2(n) tn hiu kt qu, hng s a [0.9, 1]. Mc tiu ca bc pre-emphasis cng c cc tn s cao b mt trong qu trnh thu nhn tn hiu. Hnh 2.4 minh ha kt qu lc pre-emphasis. D liu ting ni thng khng n nh, nn thng thng php bin i Fourier c thc hin trn tng on tn hiu ngn. Mc tiu ca bc chia frame l chia d liu ting ni thnh tng frame nh c kch thc khong t 20ms n 30ms. Cc frame lin k c xp chng ln nhau khong t 10ms n 15ms trnh mt mt thng tin. C ch chia frame ny c minh ha trong hnh 2.5.

Hnh 2.5: C ch chia frame. Sau , mi frame s c nhn vi mt hm ca s (window function): s(n) = s(n) * w(n) , n [0, N-1] Trong , s(n) l tn hiu trong frame, N l kch thc ca frame, w(n) l hm ca s. Mt s window functions thng c dng l:

16

Hamming: w(n) = 0.53836 0.46164 cos Hann: w(n) = 0.5 1 cos


2n N 1

2n N 1

Cosine: w(n) = cos

n n = sin N 1 2 N 1

Vic nhn mi frame vi hm ca s s gip cng c tnh lin tc 2 bin ca frame v to tnh chu k cho ton b tn hiu trong frame. Hnh 2.6 minh ha kt qu nhn mt frame vi Hamming window.

Hnh 2.6: Tn hiu trc v sau khi nhn vi ca s Hamming. 2.3.2. Bin i Fourier ri rc (Discrete Fourier Transform - DFT) Php bin i Fourier ri rc (DFT) chuyn tn hiu m thanh t min thi gian sang min tn s. Mt tn hiu X c chiu di N khi qua bin i DFT s thu c tn hiu phc c chiu di N/2+1 min tn s gm 2 phn: ReX (kt qu phn thc) v ImX (kt qu phn o). Phng trnh ca php bin i DFT:

Re X [k ] = x[i ] cos(
i =0

N 1

2ki ) N

Im X [k ] = x[i ] sin(
i =0

N 1

2ki ) N

17

trong , i [0, N-1], k [0, N/2]. Trong khng gian s phc (ta Descartes), ReX v ImX cn c th c biu din di dng ln r ca vector phc v gc quay (ta cc) nh trong hnh 2.7.

Hnh 2.7: Tng quan gia ta Descartes v ta cc. Nh vy, vi phn thc ReX v phn o ImX, ta c th tnh ln MagX (magnitude spectrum) v pha PhaseX theo cng thc:
MagX = r = (Re X ) 2 + (Im X ) 2 Im X PhaseX = = arctan Re X

Cng thc bin i nghch:


Re X = MagX cos(PhaseX )

Im X = MagX sin (PhaseX )

Trong php rt trch c trng MFCC, kt qu ca bc bin i Fourier l PowX Power Spectrum (bnh phng ln vector phc):
PowX = (MagX )
2

PowX th hin mc tp trung nng lng ca tn hiu m thanh vo cc vng tn s.

18

2.3.3. Mel filter bank Mel l vit tt ca t melody. Tn s mel (mel-frequency) tng ng vi logaric ca tn s thng thng/tuyn tnh (linear-frequency). Tn s mel phn nh cch thc tip nhn m thanh ca tai ngi. Phng trnh tng quan gia mel-frequency v linear-frequency:

mel = 1127.01048 * ln(1 + f/700) f = 700(em/1127.01048 - 1)

(2.1) (2.2)

Hnh 2.8: Tng quan gia tn s mel v tn s tuyn tnh. Mel filter banks l cc b lc band-pass hnh tam gic. Lc band-pass l lc thng cc tn s trong khong mong mun. Mc tiu ca bc p dng cc b lc Mel filter bank l lc ly cc tn s m tai ngi c th nghe c, ng thi rt ngn kch thc ca vector c trng. Cc b lc ny c t sao cho cc tn s trung tm tng u trn min mel, v logaric trn min tn s (linear frequency), ng thi hai cnh ca mt b lc phi c t trng vo tn s trung tm ca hai b lc ln cn. Hnh 2.9 minh ha cc b lc trn min mel v min tn s.

19

Hnh 2.9: Mel filter banks trn min mel v min tn s. Xt cc mel filter banks trn min tn s trong hnh 2.10. Trong , fc(m) l tn s trung tm ca b lc th m, Fs l sampling rate ca tn hiu m thanh.

Hnh 2.10: Mel filter banks trn min tn s tuyn tnh. Cc b lc c cho bi cng thc:
0 f (k ) f c (m 1) f (m) f c (m 1) H m (k ) = c f (k ) f c (m + 1) f c (m) f c (m + 1) 0

for for for for

f (k ) < f c (m 1) f c (m 1) f (k ) < f c (m) f c (m) f (k ) < f c (m + 1) f (k ) f c (m + 1)

vi Hm l b lc th m v f(k) = kFs/N. Tn s trung tm fc(m) c tnh theo tn s trung tm tng ng trn min mel c(m) theo cng thc (2.2):

fc(m) = 700(ec(m)/1127.01048 - 1)
Do cc tn s trung tm c(m) tng u trn min mel nn ta lun c: c(m) = m v = (max min)/(M + 1)

20

trong , M l s b lc, m [1, M], max v min c th c tnh t fmax v fmin theo cng thc (2.1) vi fmin = 0 v fmax = Fs/2. Power spectrum t bc bin i Fourier khi a qua cc Mel filter banks s c kt qu:

N 1 e(m) = ln PowX (k ) H m (k ) k =0
Bc cui cng l p dng php bin i cosine ri rc ln e. 2.3.4. Bin i Cosine ri rc (Discrete Cosine Transform - DCT) Bin i cosine ri rc c cho bi cng thc:

1 c(l ) = e(m) cos l m M 2 m =1


M

trong , c(l) l h s MFCC th l, l [1, L], L l s h s MFCC mong mun, M l s b lc. Thng thng M c chn gi tr 24 v L l 12. Nh vy, vector c l kt qu ca ton b qu trnh rt trch c trng MFCC cho mt frame. T cc vector c, cc o hm bc mt (delta-cepstrum) v bc hai (deltadelta cepstrum) c th c tnh nh sau:

c n = cn =

1 (cn+1 cn1 ) 2

1 (cn+1 cn1 ) 2

Cc o hm ny c xem l c trng ng (dynamic features) th hin thay i gia cc frame. Ngoi ra, mt c trng khc cng thng c quan tm l mc nng lng ca frame:
t2

Energy = x 2 (t )
t =t1

21

Trong cc h thng nhn dng li thoi, thng thng vector c trng c chn gm 39 thnh phn: 12 MFCC, 1 MFCC-energy, 12 delta-cepstrum, 1 delta-energy, 12 delta-delta cepstrum, 1 delta-delta-energy. Tuy nhin, h thng nh danh ngi ni trong ti ny ch s dng vector c trng gm 12 h s MFCC. 2.4. D tm nng lng (energy detection) Mc tiu ca bc d tm nng lng l nhm loi b nhng on m thanh khng c li thoi. Cng vic ny c thc hin thng qua vic so snh nng lng ca mi vector c trng vi mt ngng T, nu mc nng lng nh hn T, vector c trng s b loi ra. Sau bc d tm nng lng, cc vector c trng c li thoi s c chun ha bc tip theo. 2.5. Chun ha c trng (feature normalization) Thng thng, gim cc nh hng ph t mi trng thu m, cc vector c trng s c chun ha thng qua 2 tham s l vector trung bnh (mean) v phng sai (covariance) tnh t ton b cc vector c trng. Khi , mi vector c trng s c chun ha nh sau:

x' =

y cng l bc cui cng ca qu trnh tin x l d liu. Cc vector c trng sau khi chun ha s c a vo hun luyn m hnh ( giai on enrollment) hoc nhn dng ( giai on test). 2.6. Xy dng m hnh ngi ni Ht nhn ca mt h thng nhn dng/nh danh ngi ni chnh l cc speaker models (m hnh ngi ni) i din cho tng ngi ni ring bit. Cc speaker

22

models ny c xy dng bng cch tng qut ha d liu mu ca speaker tng ng. iu ny c ngha l ta phi hun luyn sao cho mi speaker model thch nghi nht vi d liu mu ca n. Mc tiu ca vic xy dng cc speaker models l to nn tng cho bc nhn dng cc mu ting ni v sau.
Training Data of Speaker 1 Training Data of Speaker 2 . . . Training Data of Speaker n Modelization Method Modelization Method . . . Modelization Method Speaker Model 1 Speaker Model 2 . . . Speaker Model n

Hnh 2.11: Cc m hnh ngi ni. Ty theo tng phng php tip cn m speaker models s c i din bng cc m hnh c th. Chng hn nh trong phng php Vector Quantization, mi speaker model s c i din bi mt codebook; i vi phng php GMM, mi speaker model tng ng vi mt m hnh GMM; cn trong phng php Dynamic Time Warping, speaker model ch n thun l tp cc vector c trng ca ngi ni tng ng m khng cn n mt c ch m hnh ha no c. Nh cp trong chng 1, h thng nh danh ngi ni trong ti ny c xy dng theo hng tip cn MGHMM. Mi speaker s c m hnh ha bng mt speaker model i din bi mt MGHMM ring bit. nh ngha v chi tit cc bc xy dng MGHMM s c trnh by c th trong chng 3. 2.7. Nhn dng Trong h thng nh danh ngi ni trn tp d liu m (open-set test data), giai on nhn dng gm 2 bc chnh:

23

Bc 1: ch ra identity (nh danh ngi ni) ca mu test. Ngha l cho bit ai trong s cc thnh vin h thng pht m mu test . Bc ny c gi l identification.

Bc 2: xc minh li xem mu test c tht s thuc v ngi ni xc nh bc 1 hay thuc v mt ngi ni no cha bit. Bc ny c gi l verification.

2.7.1. Identification Sau khi cc speaker model (MGHMM) c hun luyn, mi speaker model i s thch nghi nht vi d liu hun luyn Xi ca n (i c hun luyn theo hng cc i ha likelihood p(Xi | i)). Khi cho mt vector c trng X vo speaker model bt k , kt xut nhn c s l tng t (likelihood) ca X i vi : p(X | ). Tuy nhin, theo lut quyt nh Bayes phn lp theo t l li nh nht, mu X s c phn vo lp i c p(i | X) ln nht.

Identity = arg max p(i | X )


i

(2.3)

Theo cng thc Bayes, ta c: p(i | X ) =

p ( X | i ) p (i ) p( X )

trong , p(X) l xc sut xut hin vector X trong ton khng gian d liu. Tuy nhin p(X) c lp v ging nhau vi mi i nn s khng c xt. Xc sut p(i) chnh l tn xut xut hin ca ngi ni th i; thng thng nhng ngi ni ny c xem l c tn sut xut hin nh nhau nn p(i ) =
1 vi n l s lng ngi n

ni trong h thng. Nh vy, cng thc (2.3) c th c quy v:


Identity = arg max p ( X | i ) vi i [1, n]
i

Likelihood p(X | i) ng vai tr nh im (score) ca m hnh i cho vector c trng X, v X s c phn vo lp ca ngi ni c m hnh cho im cao nht.

24

2.7.2. Verification Kt thc bc identification, nh danh (identity) ca mu test c xc nh l ngi ni c score cao nht. Mc tiu ca bc verification l xc minh tr li xem mu test c ng tht l ca ngi ni hay thuc v mt ngi ni cha bit (unknown speaker/impostor).
Speaker Model 1 P1 ID = k

Speaker Model 2

P2

Test Data

Max Pk > ?

Impostor Speaker Model n Pn

Hnh 2.12: Cc bc nhn dng. C ch xc minh thng qua vic so snh score (c c t bc identification) vi mt ngng cho trc. Nu score vt ngng , kt qu nh danh bc trc s c chp nhn (confirmed); ngc li, mu test s c xem nh thuc v mt ngi ni cha bit. Hnh 2.12 minh ha cc bc ca giai on nhn dng. i vi bc verification, vic chn ngng thch hp l rt quan trng. C 2 loi ngng: Ngng ton cc: mt ngng duy nht c p dng cho tt c cc speaker. Ngng cc b: mi speaker s c mt ngng ring.

Thng thng, ngng tt nht c chn ti im cn bng li trn ng DET (Detection Error Trade-off). Ngoi ra, chun ha score (score normalization) s gip cho vic chn ngng c hiu qu hn v gim ng k t l li.

25

2.8. Score normalization Mc ch chnh ca vic chun ha score l to s tch bit gia phn b score ca nhng speaker trong h thng v nhng speaker cha bit, ngha l tng cng score cho speaker trong h thng v gim thiu score ca speaker cha bit. C 2 nhm gii php chnh cho vn ny: Chun ha phn phi score (standardization of score distributions): bao gm cc k thut c s dng ph bin ca lnh vc xc minh ngi ni (speaker verification) nh T-norm (test normalization), Z-norm (zero normalization) v H-norm (handset normalization). Gii php Bayes: l mt k thut trong Bayesian framework, p dng chun ha score cho h thng nh danh ngi ni. ti ny s dng gii php Bayes cho bc chun ha score theo biu thc: L(O) = log p(O | ML) log p(O | U) trong , ML = i, i = arg max p (O | i ) (ML l m hnh c im cao nht bc
i

identification), U l m hnh i din cho mt s unknown speaker c kh nng b nhn nhm vo speaker ca m hnh ML. Trong thc t, ta khng th xc nh c U, mt la chn tt l tm mt i lng xp x gn ng cho p(O | U). C 3 phng php gii quyt cho vn ny: World Model Normalisation (WMN), Cohort Normalisation (CN) v Unconstraint Cohort Normalisation (UCN). 2.8.1. World Model Normalization (WMN) Phng php ny xp x p(O | U) bi p(O | WM) vi WM l m hnh c tng qut ha t mt s lng ln cc speaker. WM thng c gi l world model hay universal background model.

26

hun luyn m hnh WM, cn phi c mt lng ln d liu ca rt nhiu speaker. 2.8.2. Cohort Normalization (CN) Trong phng php ny, mi speaker model s c lin h vi mt nhm cc speaker model gn n nht trong khng gian speaker (th hin bi b tham s ). Khi , p(O | U) s c xp x bi pCN(O, ML, K):

1 pCN (O, , K ) = K
ML

log p(O |
k =1

f ( ML ,k )

trong , f (ML, i) f (ML, j) vi i j.

f (

ML

,1)

, f ( ML , 2) ,..., f ( ML , K ) l cc speaker model gn nht vi ML trong khng

gian speaker. Cc speaker model ny c chn ra t tp speaker model trong h thng trc giai on test, v c gi l nhm competitive speaker models 2.8.3. Unconstraint Cohort Normalization (UCN) Phng php ny ging vi phng php Cohort Normalization, tuy nhin nhm competitive speaker models s c chn ra ngay trong giai on test xp x p(O | U) bi pUCN(O, ML, K) theo biu thc:

1 pUCN (O, , K ) = K
ML

log p(O |
k =1

(k )

trong , (i) (j) vi i j, v (1), (2), , (K) l cc speaker models c score cao nht sau ML. Cc speaker models ny c th c chn trc tip ngay sau bc identification m khng cn phi tn chi ph pht sinh thm m hnh (phng php World Model Normalization) hay tn chi ph la chn m hnh trong khng gian speaker (phng php Cohort Normalization).

27

Trong [1], cc phng php chun ha score c trnh by, phn tch v so snh thc nghim. Kt qu cho thy phng php Unconstraint Cohort Normalization em li hiu qu tt nht cho vic chun ha score trong h thng nh danh ngi ni trn tp m. ti ny s dng gii php Bayes chun ha score theo phng php Unconstraint Cohort Normalization. 2.9. Mt s h thng nh danh ngi ni Mt s h thng nh danh ngi ni c lp vn bn thng gp c xy dng theo hng tip cn ca Vector Quantization hoc GMM (Gaussian Mixture Model) 2.9.1. H thng Vector Quantization Trong h thng nh danh ngi ni xy dng theo phng php Vector Quantization, mi ngi ni s c m hnh ha bng mt codebook. Mt codebook c kch thc M s gm M codevectors cn gi l cc vectors mu (prototype vectors).

Hnh 2.13: Vector Quantization vi codebook c M = 3. Trn khng gian vector ca d liu mu, M codevectors s chia tp d liu ra thnh M cm con m mi codevector s l trng tm (vector trung bnh) ca cm con tng ng. Hnh 2.13 minh ha mt khng gian d liu c phn cm bi 3 codevectors (prototype vectors).

28

T tp d liu ban u, c mt s thut ton khc nhau cho vic xc nh cc codevectors ny. Hai thut ton ph bin nht thng dng l thut ton Lloyd cn gi l thut ton k-means c trnh by mc 3.3.3.1, v thut ton hc cnh tranh khng gim st. Nh vy, bc xy dng m hnh, h thng Vector Quantization s c lng codebook cho tng ngi ni t tp d liu hun luyn ca ngi ni . Trong bc nhn dng, sai s quantization error (khong cch euclid) gia mu test vi codevector gn n nht trong codebook ca tng ngi ni s c tnh; v mu test s c phn vo lp ca ngi ni c sai s quantization error thp nht. Trong [19], h thng VQ-100 (Vector Quantization vi codebook c M = 100) v VQ-50 t ln lt 92.9% v 90.7% hiu sut phn lp trn tp d liu ting ni telephone gm 16 ngi ni, mi on test c di 5 giy. 2.9.2. H thng GMM i vi h thng nh danh ngi ni xy dng theo phng php GMM, mi ngi ni s c m hnh ha bng mt GMM. Mt m hnh GMM c kch thc M s gm M mt Gauss vi cc tham s l vector trung bnh v ma trn hip phng sai . Chi tit v m hnh GMM v thut ton hun luyn GMM t d liu hc s c trnh by c th hn trong mc 3.1. H thng GMM c xy dng trong [19] t c 96.8% hiu sut phn lp trn tp d liu ting ni r (clean speech m thanh thu c t micro cht lng tt) gm 49 ngi ni, mi on test c di 5 giy; v t 94.5% hiu sut phn lp trn tp d liu ting ni telephone gm 16 ngi ni, mi on test di 5 giy. Mt dng c bit ca GMM khi M = 1 c gi l m hnh Gauss (Gaussian Model - GM). M hnh ny t 67.1% hiu sut phn lp trn tp d liu ting ni telephone gm 16 ngi ni, mi on test c di 5 giy [19].

29

Bng 2.1 tm tt hiu sut nhn dng ca cc h thng nh danh ngi ni trnh by trong mc 2.9.1 v 2.9.2 trn cng tp d liu ting ni telephone gm 16 ngi ni, mi on test c di 5 giy [19]. Bng 2.1: So snh cc h thng nh danh ngi ni trn cng tp d liu. H thng Gaussian Mixture Model Vector Quantization 100 Vector Quantization 50 Gaussian Model 2.9.3. Cc h thng khc Ngoi hai phng php truyn thng l GMM v Vector Quantization, cc cng trnh nghin cu gn y tip cn bi ton theo mt s hng khc nh Support Vector Machine (SVM) [24], mng Neuron (Neural Network) [21], Bng 2.2: Hiu sut ca mt s h thng trn cc tp d liu khc nhau. H thng Gaussian Mixture Model Support Vector Machine Neural Network Vector Quantization ci tin Hiu sut nh danh (identification accuracy) 96.8 % 95.6 % 94.7 % 94.4 % Hiu sut xc minh (verification accuracy) 92.6 % 95.9 % 95.1 % 89.2 % Hiu sut nh danh (Identification accuracy) 94.5 % 92.9 % 90.7 % 67.1 %

Bng 2.2 so snh hiu sut nhn dng ca mt s h thng c xy dng theo phng php: GMM, Support Vector Machine, mng Neuron, Vector Quantization ci tin. Cc kt qu ny c tm tt t [9, 15, 19, 21, 22, 23, 24, 25]. Tuy nhin, vic so snh y ch mang tnh tng i v nhng h thng ny c thc nghim trn cc tp d liu ting ni khc nhau.

You might also like