You are on page 1of 8

See discussions, stats, and author profiles for this publication at: https://www.researchgate.

net/publication/289235769

Using PCA algorithm in voice recognition

Article · December 2012

CITATIONS READS

0 700

3 authors, including:

Emrah Ozkaynak Ilhami muharrem Orak


Karabuk University Karabuk University
15 PUBLICATIONS   10 CITATIONS    30 PUBLICATIONS   73 CITATIONS   

SEE PROFILE SEE PROFILE

Some of the authors of this publication are also working on these related projects:

Electronic Election System Based on Blockchain View project

All content following this page was uploaded by Emrah Ozkaynak on 25 June 2021.

The user has requested enhancement of the downloaded file.


Using PCA Algorithm
in Voice Recognition
Salih Gorgunoglu1,*, Emrah Ozkaynak1 , lhami Muharrem Orak1
1 Department of Computer Engineering, University of Karabük, Karabük, Turkey

Received: ; accepted:
Abstract

Voice is a biometric feature that is used to distinguish or to identify human beings or


species. Origin of the existing voice can be determined by analyzing the sound signal. Since
voice of the human beings has disparities, it can be used for recognizing people. Various
devices or machines are also controlled by voice commands obtained from the words out of
the sound signal from one’s speeches. In this study, Principal Component Analysis (PCA)
algorithm which is mostly used in face recognition system is investigated for voice
recognition purposes. PCA algorithm along with Support Vector Machines (SVMs) and K-
means algorithms are realized for voice recognition and also their performances are
compared. It is shown that PCA algorithm can be used as alternative to the well known voice
recognition algorithms
Keywords: Principal component analysis (PCA), Voice processing, Voice recognition, Support
vector machine (SVM), K-means.
©Sila Science. All rights reserved.

1. Introduction

Voice is used in many applications for various purposes. In biometric systems where the
security is important, voice is used as a biometric identifier or signature to recognize a person
mostly communicating via phone call. Internet browsers and devices that are controlled by
voice especially for visually and physically disabled people are also the area where voice
recognition is used. Voice signature technology also began to be used in banking and booking
that are carried out through phone calls. In these systems, when the user connected to the
system, input voice signal is compared with prerecorded voice signal for recognition. In this
way, the user can carry out the operations in less time [1]. In general, it is seen that the studies
on voice recognition are focused on improving performance and security of these systems and
developing speech recognition systems more accurately [2-7].

In the literature several studies on the voice and speech recognition may be found. Wooil
Kim and Richard M. Stern [8] in their study used the Multi-band method in order to reduce
noise signal in voice. Martin Cooke and at al. [9] in 2009, perform a study to remove
background audio signal. In another study, Kingsbury at al. [10], improve the performance of
speech recognition over a range of noisy and reverberant conditions using the modulation
spectrogram.
___________
*
Corresponding author. Tel +90-370-433 20 21 Fax : +90-370-433 32 90.
E-mail address: sgorgunoglu@karabuk.edu.tr;
Artificial neural network and support vector machine algorithms are used to solve various
classification problems [11-14].
On the other hand, it is shown that Support Vector Machines(SVMs) also give successful
results in voice recognition. Tsang-Long Pao and at al.[15] estimate emotional state of a
person by using voice signal. In this study, they used Neural Network (NN) and Support
Vector Machines (SVM) classifiers and feature selection algorithm to classify five emotions
such as angry, happy, sad, bored and neutral. Jing Bai and at al.[16] compared the results
obtained from speech recognition using SVM and RBF network and found that the SVM has
higher recognition rates than RBF network. Mohsen Bardidehl at al.[17] proposed an SVM
based continuous speech recognition system and a confidence measure has been evaluated for
the speech features vectors.
Voice recognition can be performed easily with ordinary computer hardware and
appropriate software without the need of extra hardware. In this study, some of the well-
known methods such as Principal Component Analysis (PCA), Support Vector Machines
(SVMs) and K-Means algorithms have been used to perform voice recognition and their
performances have been compared.
2.Algorithms

In this study, three different algorithms have been used for voice recognition. These
algorithms are PCA, SVM and K-Means respectively.
2.1. PCA Algorithm
Principal Component Analysis (PCA) is used in the reduction of high dimensional data
into smaller dimensions by taking distinctive characteristics into consideration [18, 19].
This algorithm consists of two stages: training stage and recognition stage. In training
stage, eigenmatrix which is required for recognition, is formed from sample data.
The first step is to transform voice data obtained from each person into a column vector
of size Nx1. Each column vector is represented by Γ1 , Γ2 , … ΓM . where M is number of
distinct voice and N is the size of voice data. In this study N is 25.000 sampled voice data.
The average voice is computed as follows:
1
Ѱ = M ∑M
n=1 Гn (1)

The difference of each voice from average is obtained as below:


Фi = Гi − Ѱ (2)
The dimension of this high dimensional matrix requires to be reduced by finding a set of
M orthonormal vector uk which best describes the distribution of data. The k th vector, uk , is
selected to maximize λk .
1 T 2
λ k = M ∑M
n=1(uk Фn ) (3)

These λk values and uk vectors are the eigenvalues and eigenvectors of covariance matrix.
1
C = M ∑M T
n=1(Фn Фn ) = AA
T
(4)
where the matrix A = [Ф1 , Ф2 , Ф3 , … , ФM ]. Covariance matrix C is in N × Ndimension.
Since it will be difficult to calculate the eigenvalues and eigenvectors of this matrix which is
in these dimensions, the eigenvalues and eigenvectors of the matrix given below are
calculated:
L = AT . A (5)
Eigenvalues of matrix L which is in M × M dimension (M<<N) are adequate for the
problem of voice recognition. M denotes the number of distinct voice. For the matrix L, the
number of eigenvalues and eigenvector is only M. A new matrix consists of V =
[v1 , v2 , v3 , … vM ] eigenvectors of L.
The eigenvectors ul of C is obtained from eigenvectors vl of L as follows.
ul = ∑M
k=1 vlk Фk l = 1, … , M (6)

In the recognition stage, feature or weight vector Ω is formed for all voices in the database.
wk = uk (Г − Ѱ), ΩT = [w1 , w2 , w3 , … , wM ] (7)
This calculation is also carried out for unknown (new) voice introduced to the system for
recognition.
Finally Euclidian Distance is calculated for similarity measure.
dk = |Ω − Ωk | (8)
If the distance is above a certain threshold for the k th voice of person, then the input voice
is considered as the same for k th voice of person.
2.2.Support Vector Machines (SVMs)
SVMs are one of the most widely used algorithms to classify linear and non-linerar data
utilizing statistical learning. It is used in many areas such as image processing, voice
recognition, signature recognition and pattern recognition. SVMs try to find the most
appropriate (optimal) hyperplane which separates data into two classes. Separations of
different classes with various hyperplanes and the optimal separating hyperplane have been
shown in Fig.1 [20].
support vectors

margin

(a) (b)

Fig.1. (a) Separating hyperplanes (b) Optimal separating hyperplane and its margin

The aim of the SVM is to find two furthest hyperplane which separates two classes. If the
margin between these two hyperplanes is in maximum, then the optimal hyperplane between
the hyperplanes can be identified [20]. Each 𝑥𝑖 data which can be linearly separated are
expressed in two classes as 𝑦𝑖 = {−1, +1}. These two hyperplane is given in equation 9 and
the optimal hyper plane is also given in equation 10.
𝑊 𝑇 𝑋 + 𝑏 = +1 (9)
𝑊 𝑇 𝑋 + 𝑏 = −1
𝑊𝑇𝑋 + 𝑏 = 0 (10)
Here, X denotes for data, w denotes for weight and b denotes for constant value. Once the
w,b values which achieve maximum margin given in equation 11 is calculated, the aimed
classifier is also obtained.
2
𝑚𝑎𝑥𝑖𝑚𝑢𝑚 𝑚𝑎𝑟𝑔𝑖𝑛 = |𝑊| (11)

In this study the voice data to be recognized is chosen as one class and the voice data in the
database is chosen as another class. Hence, the given voice is matched with a voice data in
database.

2.3 K-Means algorithm


K-means algorithm is one of the simple algorithms that solves classification problem via
clustering. It is used in the study of pattern recognition commonly. The higher recognition
rate and simple working principle lead to its common usage [21,22].
K-means algorithm initially states eigenvectors with k centers. If it does not work, the
group number of k becomes the number of separate voice in database. Then, the center of
each group is calculated. For each entry values, their distances to center are calculated and the
center points are updated using the average of the data which are closer to the same center.
After that for each entry data, their distance to the center points are calculated and assigned to
the nearest center. These steps continue until the neither center points get change and
algorithm then terminate [21].

3. Voice Recognition

Voice recognition consist of identification of voice and speech recognition by obtaining


words from the voice signals. For this reason the voice signal go through several pre-
processing. First of all, analog signal is digitized by taken appropriate samples. Secondly, the
digitized voice is filtered for noise elimination, normalized for voice amplitude. Also, starting
and end point of digitized voice are determined. All these pre-processing increase the
performance of voice recognition. In this study, voice signals are sampled as 16bit and
normalized by using equation 12.
𝑥−𝑚𝑖𝑛
𝑥 ′ = 𝑚𝑎𝑥−𝑚𝑖𝑛 (12)

where, 𝑥 is voice signal, 𝑚𝑖𝑛 min and 𝑚𝑎𝑥 are the lowest and highest value in voice signal,
respectively.
For the voice signal preprocessed and normalized, PCA algorithm is applied and its
features are identified. The following step is to recognize a new voice and to assign it to a
known class. The Euclidean distance given in equation 8 is calculated by using the weights
from PCA algorithm and the weight for the voices in database given as Ω𝑇 =
[𝑤1 , 𝑤2 , 𝑤3 , … , 𝑤𝑀 ] to find the distance between voices. The stages of voice recognition
system are shown in Fig 2.

Classifier
Sampling Preprocessing Feature
(PCA, SVM,
Extracttion Database
K-Means)

Classification
Result
Voice
Fig. 2. The stages of voice recognition system

200 voice data are gathered from 25 different people taking 8 samples from each. For each
data, eigenvalues and eigenvectors are calculated by PCA algorithm. Voice signals of
different people for same sample word are shown in Table 1. For the voice data, Turkish
words and city names are chosen.

4. Experimental results

In this study as it is explained before a database consisting of 200 different voice data is
used for voice recognition. When constituting word table, various grammatical features such
as vowel harmonies, vowels and consonants, emphasis, stress, intonations, etc. were taken
into consideration. Based on the experiments done for voice recognition, threshold value is
taken as the distance for
Table 1. Voice Graphics of same word for different people
Voice Signal of the Voice Signal of the Voice Signal of the word
word for Person 1 word for Person 2 for Person 3

Time
domain

Frequency
domain

two voices calculated from equation 8. The threshold value is specified as 0.67 × 10−2 after
experimental studies. If distance of a voice is below this value, then it is considered as
recognized. PCA algorithms shows 92% successful rate for considering the all voice data
stored in the database. SVM and K-Means algorithms also show similar success rate as 96%
and 84% accordingly. In Tablo 2 distance values and recognition status of voice data taken
from 5 out of 25 people.
Table2. Results of the PCA, SVM and K-Means Algorithm for voice recognition
Voice to be PCA SVM K-Means
recognized Calculated Calculated Calculated
Distance Result Distance Result Distance Result

Person 1 0.1268 Fail 0.1263 Fail 0.2538 Fail


Person 2 0.0066 Success 0.0002 Success 0.2369 Fail
Person 3 0.0053 Success 0.0015 Success 0.0725 Success
Person 4 0.0951 Success 0.0254 Success 0.0142 Success
Person 5 0.3253 Fail 0.0352 Success 0.3896 Fail
By comparing the results from all these three algorithms, it can be seen that SVM algorithm
provides better performance. As it is clear from Fig 3 that PCA algorithm is also achieve
similar performance with SVM algorithm. It can be deduced that PCA algorithm can also be
used in voice recognition with a good performance outcome.

Calculated Distance 1,2


1
0,8
0,6 PCA
0,4 SVM
0,2 K-Means
0
Person1
Person2
Person3
Person4
Person5
Person6
Person7
Person8
Person9
Person10
Fig. 3. Distance values for same people with different algorithms

5. Conclusions

Due to its easiness to gather and process, voice signal is widely used. In this study, PCA
algorithm which is mostly used in different areas is investigated for voice recognition
purposes. A voice database is built based on the voice sample taken from different people. By
using this database, PCA, SVM and K-Means algorithms are used for voice recognition and
their performances are compared. The results show that PCA algorithm has high performance
also in voice recognition. It can be used for that purposes as alternative to the other well
known algorithms. In future work, further investigation of using PCA algorithm in voice
recognition will be conducted. Improving the algorithms performances used in voice
recognition by considering parallel programming techniques will also be investigated.

References

[1] Ocal K. Application Of Automatic Speech Recognition Algorithms. M.Sc. Thesis, Univ. of
Ankara, Ankara, 2005.[in Turkish]
[2] Axelrod S, Goel V, Gopinath R, Olsen P, Visweswariah K. Discriminative estimation of subspace
constrained Gaussian mixture models for speech recognition. IEEE Transactions on Speech and
Audio Processing, 2007; 15(1):172-189.
[3] He X, Deng L, Chou W. A novel learning method for hidden Markov models in speech and audio
processing. IEEE Workshop on Multimedia Signal Processing (MMSP), Victoria, BC, USA, pp.
pp. 80–85, 2006.
[4] Lee SM, Fang SH, Hung JW, Lee LS. Improved MFCC Feature Extraction by PCA-Optimized
Filter Bank for Speech Recognition. Automatic Speech Recognition and Understanding, ASRU, pp.
49-52, 2001.
[5] Lima A, Zen H, Nankaku Y, Miyajima C, Tokuda K, Kitamura T, On the Use of Kernel PCA for
Feature Extraction in Speech Recognition. IEICE Trans. Inf. & Syst, 2004; Vol.E87-D(issue 12):
2802-2811.
[6] Benzeghiba M, Mori RD, Deroo O, Dupont S, Erbes T, Jouvet D, Fissore L, Laface P, Mertins A,
Ris C, Rose R, Tyagi V, Wellekens C. Automatic speech recognition and speech variability: A
review. Speech Communication, 2007; 49(10-11):763-786.
[7] Valente F, Hierarchical and parallel processing of auditory and modulation frequencies for
automatic speech recognition. Speech Communication, 2010(10); 52:790-800.
[8] Kim W, Stern RM. Mask classification for missing-feature reconstruction for robust speech
recognition in unknown background noise. Speech Communication, 2011; 53(1):1-11.
[9] Cooke M, Hershey JR, Rennie SJ. Monaural speech separation and recognition challenge.,
Computer Speech & Language, 2010; 24(1):1-15.
[10]Kingsbury B, Morgan N, Greenberg S. Robust speech recognition using the modulation
spectrogram. Proceedings of Speech Communication., 1998; 25(1-3): 117-132.
[11] Olgun MO, Ozdemir G, Aydemir E. Forecasting of Turkey’s natural gas demand using artifical
neural networks and support vector machines. Energy Education Science and Technology Part A:
Energy Science and Research 2012; 30(1):15-20.
[12] Ekici S., Multi-class support vector machines for classification of transmission line faults
Energy Education Science and Technology Part A: Energy Science and Research 2012; 28(2):
1015-1026.
[13] Gorgünoğlu S, Altay Ş. Motion clustering on video sequences using competitive learning
network. Turkish Journal of Electrical Engineering & Computer Sciences, DOI: 10.3906/elk-
1203-37, 2012.
[14] Uçar E., Şen B., Bayir Ş., Placement score estimation of secondary education transition system
(SETS) using artificial neural networks. Energy Education Science and Technology Part A:
Energy Science and Research 2013; 30(2): 749-758.
[15] Pao TL, Chen YT, Yeh JH, Li PJ. Mandarin emotional speech recognition based on SVM and
NN. Proceedings of the 18th International Conference on Pattern Recognition (ICPR’06), 2006;
1:1096-1100.
[16] Bai J, Zhang XY, Guo YL. Speech Recognition Based on A Compound Kernel Support Vector
Machine, 11th IEEE International Conference on Communication Technology (ICCT 2008); pp.
696 – 699.
[17] Bardideh M, Razzazi F, Ghassemian H, An SVM Based Confidence Measure for Continuous
Speech Recognition. IEEE International Conference on Signal Processing and Communications
(ICSPC 2007), pp. 1015 – 1018.
[18] Turk M, Pentland A. Eigenfaces for Recognition”, Journal of Cognitive Neuroscience. 1991;
3(1):71-86.
[19] Gorgünoğlu S, Oz K, Bayır Ş. Performance Analysis of Eigenfaces Method in Face Recognition
System. 2nd international Symposium on Computing in Science & Engineering, 01-04 June
2011, Kusadası, Aydın, Turkey.
[20] Wu JD, Liu CT. Finger-vein pattern identification using SVM and neural network technique,
Expert Systems with Applications 2011; 38(11):14284–14289.
[21] Qiu D, A comparative study of the K-means algorithm and the normal mixture model for
clustering: Bivariate homoscedastic case. Journal of Statistical Planning and Inference 2010;
140: 1701–1711.
[22] Sarma TH, Viswanath P, Reddy BE. Speeding-up the kernel k-means clustering method: A
prototype based, hybrid approach, Pattern Recognition Letters 2013; 34: 564-573.

View publication stats

You might also like