You are on page 1of 10

Evaluation of Singer’s Voice Quality by Means of

Visual Pattern Recognition

 ski, Szczecin, Poland
Paweł Forczman

Summary: The article presents a description of the algorithm of singing voice quality assessment that uses selected
methods from the field of digital image processing and recognition. It adopts the assumption that an audio signal with
recorded vocal exercise can be converted into a visual representation, and processed further, as an image. Presented
approach is based on generating a sound spectrogram of a sample in the form of a rectangular matrix, objective improve-
ment of its visual quality based on local changes in brightness and contrast, and scaling to a fixed size. Then, it uses a
two-step approach: the construction of a representative database of reference samples and the identification of test sam-
ples. The process of building the database uses two-dimensional linear discriminant analysis. Then, the recognition
operation is carried out in a reduced feature space that has been obtained by two-dimensional Karhunen-Loeve projec-
tion. Classification is done by a variant of Support Vector Machines approach. As it is shown, the results are very encour-
aging and are competitive to the most powerful state-of-the-art methods.
Key Words: Singing quality–Image recognition–Image processing–Spectrogram–Short-time Fourier transform–
Linear discriminant analysis–Support vector machine.

INTRODUCTION ble to the analysis performed by human). The speech signal

Problem definition contains a complex information that helps to receive the basic
Analysis of human voice is one of the most interesting tasks of meaning of speech and gives an opportunity to detect additional
multimedia systems. Within this problem, we can distinguish features, such as interlocutor’s sex, age, health status, mood, ed-
three main research directions: speech recognition (in terms ucation, and others. Analysis of the literature shows that speech
of content), speaker recognition (identification), and evaluation signal can be described by a number of numerical parameters. It
of speech quality. Although the identification of persons on the should be noted that there are various known techniques aimed,
basis of the registered voice and speech recognition in terms of for example, at measurements of voice acoustic phonetics,
content are tasks fairly well described in the literature and im- medicine, and automatic speech recognition.1 Thus, it seems
plemented in practice, the evaluation of voice quality is a prob- that there is no need to seek for new methods of representation
lem that is still not fully solved. It may involve the automatic of speech signal, but the focus should be on the selection and
evaluation of voice quality, for example, in the process of lan- use of existing ones.
guage learning and assessment of the degree of training of lec- Singing and speech signal are extremely complex, when it
tors and singers. It is worth noting that the sensitivity, which comes to formal description. The concept of singing is strongly
characterizes human sense of hearing, has long been achievable related to the concept of quality. In the traditional approach,
for technical equipment—each of the physical quantities char- singing is usually assessed by an expert or group of experts
acterizing the speech signal can now be specified much more related to the issue of voice, and this analysis is perceptual.2
precisely using computerized analyzers than using the sense Because this process is human-centric and highly subjective,
of hearing. At the same time, however, man is still able to use it may be interesting to provide some measures and techniques
the acquired information from the voice signal in a more effec- to make it more objective.
tive manner. This is due to the fact that the sense of hearing and It should be noted that automation of the singing quality eval-
the human nervous system are highly specialized and adapted uation process can have multiple purposes, including support-
through evolution to collect and analyze speech signal; howev- ing the learning process of singing and vocal skills with
er, occurring processes are not fully understood. This makes singers, supporting the classification of singers in terms of vocal
serious difficulties in implementation of computer algorithms advancement and suitability for specific vocal assignments,
aimed at such problems. Voice analysis is the subject of together with the identification of disorders in emission and po-
research of specialists in many fields: phoneticians, phonia- tential health problems. It is easy to imagine that a computer
trists, speech therapists, and specialists in telecommunications, system performs an automatic evaluation of singer’s advance-
but in spite of many studies, speech signal must be considered ment by means of analyzing a small vocal exercise recorded us-
complex and difficult to complete interpretation (ie, compara- ing a simple microphone. This examination can be used in choir
rehearsal or as a routine check of singer’s physical shape.
Accepted for publication March 2, 2015.
It is therefore important to find a subset of parameters
From the Faculty of Computer Science and Information Technology, West Pomeranian measured for sound for which the interindividual variability
University of Technology, Szczecin, 52 Zolnierska St., 71-210 Szczecin, Poland.
Address correspondence and reprint requests to Pawe1 Forczmanski, Faculty of Com-
is significantly smaller than the variation resulting from the
puter Science and Information Technology, West Pomeranian University of Technology, level of vocal advancement. In this work, it is also assumed
Szczecin, 52 Zolnierska St., 71-210 Szczecin, Poland. E-mail: pforczmanski@wi.zut.
that previously described process can be performed by means
Journal of Voice, Vol. 30, No. 1, pp. 127.e21-127.e30 of timing and frequency-subcontours for sound.
 2016 The Voice Foundation
Therefore, this work focuses on the task of singing quality evaluation in the context of automated classification of the level
127.e22 Journal of Voice, Vol. 30, No. 1, 2016

of singer’s training. The proposed process uses objective fea- and pathologic conditions. The classification of speech signals
tures taken from voice signal and can be applied in many prac- in reduced feature space obtained by principal component anal-
tical situations, for example, when evaluating singer’s abilities ysis was presented.11–13 Another approach to classification
and personal vocal skills or to detect potential anomalies in using linear discriminant analysis (LDA) was presented in the
voice production. study by Lee et al.14 The tests were performed on samples of
The article is organized as follows. The rest of introductory normal and pathologic voice and shown 83% accuracy of the
part presents some related works. The second section provides proposed method.
a description of the developed algorithm as well, as the bench- As it has been shown in the scientific literature from the field
mark data set consisting of real vocal samples collected from of signal processing, most of the methods use physical charac-
choir singers. The third section presents numerical experiments teristics of sound through low-level feature vectors consisting
and discusses their results. The last section concludes the of a set of coefficients calculated in the time domain, frequency,
article. or cepstrum. Their recognition and classification are based pri-
marily on a one-dimensional approach to data. Resulting effec-
Related works tiveness varies and depends on the initial conditions and
There is a large number of approaches related to the automatic problem nature. In case of classification of singing quality, pre-
evaluation of singing quality, which can be found in the litera- sented methods achieve the accuracy of about 90%.
ture. One of the proposed measures of singing quality is singing In contrast, this article focuses on a two-dimensional
power ratio (SPR).3 It is calculated in the spectral domain and is approach to singing signal presented in a form of a matrix.
defined as the ratio between the highest peak amplitude for fre- Because sound signal is a one-dimensional function (eg, ampli-
quencies from range 2–4 kHz and the highest peak in the range tude), it is not easy to capture all its variability over time. When
of 0–2 kHz. According to Omori et al,3 SPR can be used to we add another dimension to this representation, we can depict
distinguish between voice samples with extremely varying sound as matrix and store much more sound features in one,
advancement levels. The same feature has been used together compact structure. The typical representation of any two-
with a set of other factors in another study by Nakano et al,4 dimensional matrix, in computer science, is an image; there-
to determine the differences between training of singing stu- fore, it seems to be natural to use algorithms from the field of
dents. Experiments conducted on 55 people show that it is not digital image processing and recognition for signal processing.
possible to clearly distinguish subjects. Taking into account Summarizing, the proposed approach is based on the observa-
the results obtained, it can be concluded that SPR can be used tion that the visual, two-dimensional representation of the
to distinguish singers only if their skills vary in a significant signal of singing voice carries much more information that in
way. case of standard and limited vector representation. By using
Another popular criterion for assessing the singing quality is selected set of methods aimed at image processing, it is
intonation accuracy used by Murry5 in terms of an ability to assumed to obtain higher accuracy of singer’s voice quality esti-
perform a pitch-matching task. It is considered as one of the mation in comparison with established methods.
measures, which is independent of individual characteristics
and melodic features.6 Intonation accuracy is often used in
combination with other features for a more reliable assessment
of singing quality. In the studies by Kostek and Zwan, 7,8
they Initial assumptions
showed how to evaluate singing quality on the basis of only Numerical analysis of human voice must take into account its
one sound (a vowel). Described classification of singing voice time-frequency structure because it depends on the phenomena
has been carried out on two levels: on the basis of assessment encountered with the production of acoustic signal. Details of
of voice itself (amateur and semiprofessional) and its type the anatomy of vocal tract (geometric dimensions, acoustic
(bass, baritone, tenor, alto, mezzo-soprano, and soprano). A impedance of tissues) are different for each person, and each
set of parameters describing a song was used in this solution difference is reflected in the parameters of produced acoustic
for the construction of a feature vector, in which a single vowel voice. On the other hand, it can be assumed that certain unique
for each singer was evaluated by six experts. The resulting qual- features are responsible for the quality of the produced voice
ity index calculated for 2690 recordings was used for training and indirectly for the level of singer’s training. These features
an artificial neural network. Obtained classification accuracy may be also common for larger groups of singers. Such formu-
reached 84–90%, depending on the features used. In the study lated assumptions allow to build a hierarchical database in
by Jha and Rao,9 they presented a similar approach, which which a single class will consist grouped voice samples in terms
uses an analysis of vowels to assess the quality of singing voice. of similar degree of training. Previously described features may
In this solution, two signal characteristics were used, namely an be represented in a graphical form which can capture their vari-
envelope of the spectrum and a pitch. They have been subjected ability in a more adequate way, as opposed to strictly numerical,
to classification by means of Gaussian mixture models and one-dimensional parameters.
linear regression. The accuracy achieved 76–89%. Neural net-
works used by Hariharan et al for operations on speech signal Processing outline
in time domain were shown in the study by Hariharan et al,10 An image of voice spectrum (so-called spectrogram) is a well-
with over 98% accuracy of classification in case of normal known method of sound representation. When used as a result
Pawe1 Forczman Evaluation of Singer’s Voice Quality 127.e23

FIGURE 1. General scheme of singing voice processing and classification.

of short-time Fourier transform (STFT),15 it captures many much more memory space to store covariance matrices and
important low-level characteristics of audio signal. However, also does not cope with small-sample-size problem,18 we pro-
to be useful for in-depth analysis, it should be represented using pose to use a 2DLDA. It was successfully applied in our pre-
rather large data matrix. Digital processing of such large vious works, especially for face, stamp, and texture
matrices, in terms of classification or recognition, causes recognition.16,19,20
several practical problems, namely large processing power At the stage of classification, we propose to use Support Vec-
overhead and storage space requirements; hence, it is necessary tor Machine (SVM), as it is very popular because of its excellent
to reduce their dimensionality. generalization performance (especially in medical analysis21).
Developed algorithm for the classification of samples con- SVM, introduced by Vapnik,22 uses structural risk minimiza-
taining singing voice is shown in Figure 1. It consists of three tion whereby a bound on the risk is minimized by maximizing
main components: data preparation (preprocessing), reduction the margin between the separating hyperplane and the closest
of dimensionality (projection of high-dimensional input spec- data point to the hyperplane. In general, SVM uses a hyperplane
trogram into a lower-dimensional subspace), and classification. or set of hyperplanes in a high-dimensional feature space to
At the stage of developing a reference database, containing im- classify lower-dimensional data. Although many hyperplanes
age patterns (spectrograms) in a reduced form, we use a method may classify input data, the hyperplane we are looking for is
used, among others, in recognizing facial images, namely two- the one that represents the largest separation (margin), between
dimensional LDA (2DLDA),16 whereas at the recognition stage two classes (associated with the lowest possible generalization
of test samples—two-dimensional Karhunen-Loeve transform error). To make the separation easier, the original finite-
(2DKLT).16 dimensional space is mapped into a much higher dimensional
To obtain the previously mentioned reduction effect, space. We choose a maximum-margin hyperplane in which
together with clustering improvement (in terms of increasing the distance to the nearest data point on each side is maximized.
interclass scatter while decreasing intraclass scatter), we pro- After Boser et al,23 we used nonlinear classifiers by applying
pose to use a dimensionality reduction stage. The analysis of the kernel trick to maximum-margin hyperplanes. In such
the research showed that LDA may be successfully applied in case, every dot product is replaced by a nonlinear kernel func-
this case.17 Because one-dimensional variant of LDA requires tion, which allows the algorithm to fit the maximum-margin

FIGURE 2. Exemplary result of audio sample conversion (left) into spectrogram form (middle) and logarithmic spectrogram (right).
127.e24 Journal of Voice, Vol. 30, No. 1, 2016

FIGURE 3. Exemplary results of spectrogram processing (original, left): spectrogram after intensity equalization in columns (middle) and after
resampling and absolute value calculation (right).

hyperplane in a transformed feature space. Thus, in this modified Support Vector Machine approach, namely n-SVC
research, four different kernels have been investigated (linear, (nu-Support Vector Classification). The parameter n makes it
polynomial, Gaussian-radial basis function [RBF], and sig- easier to tune the classifier, as it controls the number of margin
moid), to find the most adequate one. The classification uses errors and the number of supporting vectors.24

FIGURE 4. Scheme of spectrograms analysis: dimensionality reduction and classification.

Pawe1 Forczman Evaluation of Singer’s Voice Quality 127.e25

calculate power spectrum. Instead, it was decided to use the log-

Singers Characteristics Versus Their Skills
arithm operation on the resulting spectrogram, which is pre-
sented in Figure 2. Initial size of the spectrogram is equal to
Person Skills Voice Experience 512 lines (spectral coefficients in a single window) by 2048 col-
Code Sex Level Type in Years umns. Later, in the preprocessing (effects shown in Figure 3),
s01f Female Beginner Soprano 5 we perform a leveling of the brightness in each column of spec-
s02m Male Intermediate Bass 16 trogram. Finally, we scale the spectrogram matrix to a specific
s03f Female Intermediate Soprano 15 size (using a bilinear interpolation).
s04m Male Intermediate Bass 3
s05m Male Beginner Bass 1
s06f Female Intermediate Soprano 5
Dimensionality reduction
s07m Male Advanced Bass 18 As it was previously mentioned, the dimensions of input spec-
s08f Female Intermediate Alto 7 trogram are high which causes obvious problems with efficient
s09f Female Intermediate Alto 3 processing. Therefore, we perform a dimensionality reduction
s10m Male Intermediate Bass 6 to obtain matrices with significantly lower number of values.
s11f Female Advanced Soprano 3 A schematic diagram of such reduction is shown in Figure 4.
s12m Male Advanced Bass 3 According to the assumed methodology, all the audio samples
s13f Female Advanced Soprano 15 (spectrograms) from the learning part of the database are sub-
s14m Male Beginner Bass 2 ject to analysis (2D analysis), which leads to forming the trans-
s15f Female Beginner Alto 2 formation matrices. They are later used during the construction
s16m Male Beginner Bass 2
phase of the reference database to project input spectrograms
s17f Female Advanced Soprano 4
s18m Male Beginner Tenor 6
(2D projection). Spectra calculated for audio samples are
s19f Female Intermediate Alto 2 grouped according to G classes corresponding to the level of
s20f Female Beginner Soprano 1 singer’s advancement. It is assumed that each input image spec-
trogram X(g,l) is represented in shades of gray as a matrix of di-
mensions M 3 N elements, where g is the number of class
Data preparation corresponding to the established training level and l is a spectro-
The study assumes that the singer’s training level can be as- gram number in each class (l ¼ 1,.L).
sessed by an analysis of a spectrogram, which is a visual repre- In the first step, an average matrix of all the matrices X is
sentation of the spectrum of frequencies in a sound as they vary calculated:
with time. Spectrograms are sometimes called spectral water-
falls, voiceprints, or voicegrams. Spectrograms are often used 1 XG X L
XM3N ¼ X (2)
to identify spoken words phonetically. They are used exten- GL g¼1 l¼1 M3N
sively in the development of the fields of music, sonar, radar,
and speech processing, and so forth. Spectrogram can be and a mean matrix in each class g ¼ 1,.,G:
created using STFT,15 according to the general formula15:
ðgÞ 1X L
N XM3N ¼ X : (3)
Xðm; uÞ ¼ x½nw½n  mejun ; (1) L l¼1 M3N
Because of the large size of input images (in experimental
where x—consecutive samples of pulse code modulation input studies—even 2048 3 2048 pixels), it is not possible to apply
signal, w—window function, X—spectrogram matrix, N— directly the one-dimensional LDA method; therefore, it was
number of samples, and m 3 u—the size of spectrogram. decided to use a 2DLDA. In this method, the following covari-
In the algorithm described in this work, we do not use win- ance matrices are calculated, each one for row and column rep-
dowing (so the function w is a constant function) and do not resentation of spectrogram, respectively16:

Benchmark Database Characteristics
Set 1 (G ¼ 6) Set 2 (G ¼ 3)

Male Female Total

Advancement No. Pers. No. Smpl. No. Pers. No. Smpl. No. Pers. No. Smpl.
Beginner 4 81 3 76 7 157
Intermediate 3 62 5 119 8 181
Advanced 2 46 3 75 5 121
Total 9 289 11 170 20 459
127.e26 Journal of Voice, Vol. 30, No. 1, 2016

FIGURE 5. Averaged spectrograms of vocal exercise; male singers and three advancement levels (beginner, intermediate, and advanced).

L h
X ih iT For the matrices H(Row) and H(Col), we solve the task of
ðRowÞ ðg;lÞ ðgÞ ðg;lÞ ðgÞ
WM3M ¼ XM3N  XM3N XM3N  XM3N : (4) searching for the eigenvalues fLðRowÞ ; LðColÞ g (diagonal
g¼1 l¼1
matrices of size M 3 M and N 3 N, respectively) and matrix ei-
genvectors fV ðRowÞ ; V ðColÞ g (orthogonal matrices of dimensions
G h
X ih iT M 3 M and N 3 N, respectively) satisfying the condition20:
ðRowÞ ðgÞ ðgÞ
BM3M ¼ XM3N  XM3N XM3N  XM3N : (5)
ðRowÞ ðRowÞ ðRowÞ ðRowÞ
g¼1 HM3M VM3M ¼ VM3M LM3M ; (10)

X L h iT h i ðColÞ ðColÞ ðColÞ ðColÞ
ðColÞ ðg;lÞ ðgÞ ðg;lÞ ðgÞ HN3N VN3N ¼ VN3N LN3N : (11)
WN3N ¼ XM3N  XM3N XM3N  XM3N : (6)
g¼1 l¼1
In the next step, from the diagonals of L(Row) and L(Col), s
and p maximal values are selected, respectively, and their posi-
G h
X iT h i tions are recorded. From ðV ðRowÞ ÞT , s rows corresponding to
ðColÞ ðgÞ ðgÞ
BN3N ¼ XM3N  XM3N XM3N  X : (7) selected elements are extracted, and from V (Col), p columns in
the same way.16 Then, the two transformation matrices are con-
structed: F (Row) containing s 3 M elements and F (Col) having
Then, we calculate the corresponding matrices H, deter-
N 3 p elements. Projection of l-th spectrogram from the g-th
mining the total distribution of the classes in the feature
class X (g,l) into a subspace is performed by matrix
ðRowÞ ðRowÞ ðRowÞ
HM3M ¼ WM3M BM3M ; (8) ðg;lÞ ðRowÞ ðg;lÞ
Ys3p ¼ Fs3M XM3N  XM3N FN3p :

 1 For optimal degree of dimensionality reduction, it was

ðColÞ ðColÞ ðColÞ
HN3N ¼ WN3N BN3N : (9) assumed that the information contained in spectrograms, ar-
ranged vertically (in columns of each image), is more important
They are used to maximize the so-called Fisher criterion, the than the horizontal information (stored in rows of image); there-
aim of which was to increase the between-class scatter in rela- fore, we used such relation that s  p.
tion to the intraclass scatter.17,25 This gives an improvement of For the purpose of research, we used our own implementa-
clustering and significantly increases the effectiveness of later tions of STFT and 2DLDA in MATLAB (The MathWorks,
classification.20 Inc.) environment.

FIGURE 6. Averaged spectrograms of vocal exercise; female singers and three advancement levels (beginner, intermediate, and advanced).
Pawe1 Forczman Evaluation of Singer’s Voice Quality 127.e27

Function f transforms training vectors Y (r) into a higher

Classification Performance for Different Classifiers and
dimensional space. We denote KðYðrÞ ; YðqÞ ÞhfðYðrÞ ÞT fðYðqÞ Þ
2DLDA-Reduced Spectrograms as a kernel.24,28,29 The choice of a particular kernel type is
motivated by a specific application. It is often shown in the
G¼3 G¼6 literature, that nonlinear kernels perform better in case of
Classifier TP FP TP FP feature spaces that can not be linearly separated.30 Hence, we
investigated linear, polynomial, RBF, and sigmoid kernels.
3NN 0.58 0.218 0.725 0.064
The simplest kernel is a linear one. It works well mostly in
SimpleCart 0.619 0.204 0.632 0.097
RandomForest 0.715 0.197 0.728 0.072 linear classification problems. It is defined by a
MLP 0.915 0.044 0.902 0.027 multiplication24:
Naive Bayes 0.928 0.04 0.932 0.02   T
n-SVC 0.935 0.038 0.935 0.021 K YðrÞ ; YðqÞ ¼ YðrÞ YðqÞ : (15)
Abbreviations: TP, true positive; FP, false positive; MLP, multilayer
perceptron. The polynomial kernel of d-th degree and vectors that are
linearly dependent on d dimensions are given as24:
Classification K YðrÞ ; YðqÞ ¼ gYðrÞ YðqÞ þ t ; g > 0: (16)
As it was mentioned previously, vocal pattern classification (in
a form of a reduced spectrogram) is based on operations in the In the research presented here, d ¼ {2, 3} was investigated as
reduced feature space. The whole process of feature space cre- a compromise between complexity and expected classification
ation and further classification is presented in Figure 4. Refer- accuracy.
ence database contains reduced spectrograms calculated for The RBF kernel, given as24:
all learning voice samples. The class assignment of input     2 
feature matrix (reduced spectrogram) Y is done using Support K YðrÞ ; YðqÞ ¼ exp  gYðrÞ  YðqÞ  ; g > 0; (17)
Vector Machine-based approach, namely n-SVC. It can incor-
porate different basic kernels26 and makes it possible to intui- is suited best to deal with data that have a class-conditional
tively tune the number of supporting vectors.27 Most probability distribution function approaching the Gaussian
approaches for multiclass SVM decompose the data set to distribution.
several binary problems.24 For example, the one-against-one The last investigated kernel is a sigmoid kernel24:
approach trains a binary SVM for any two classes of data and    T

obtains a decision function. Because of that, to solve the prob- K YðrÞ ; YðqÞ ¼ tanh gYðrÞ YðqÞ þ t : (18)
lem of multiclass classification, we perform several two-class
classifications. In each single classification case, we divide The whole process of spectrograms reduction and classifica-
samples set into two groups: a group of analyzed class and a tion is presented in Figure 4. For the purpose of this research, a
group containing all samples that belong to remaining classes. library for support vector machines (LIBSVM)31 is used, as an
Given training samples YðrÞ ˛Rsp ; r ¼ 1; .; GL in the two-class implementation of n-SVC with previously described kernels.
case and the corresponding class labels decision zr ˛f1; 1g,
the statement of classifier optimization for classification prob- Participants
lems may be the following24,28,29: In the study, a database of singing samples recorded by 20 male
and female singers from The Jan Szyrocki Academic Choir of
1 1 XGL
min wT w  nr þ x; (13) West Pomeranian University of Technology in Szczecin
u;b;x 2 N r¼1 r (Poland) has been used.32 Recorded singers represent different
levels of advancement. It should be noted that all of them have
with constrains24,28,29: passed an initial period of singing training. Most of them repre-
    sent the intermediate group with several years of singing expe-
zr wT f YðrÞ þ b  r  xr ; xr  0; r ¼ 1; .; GL and r  0:
rience. Several people have musical education, yet not in the
(14) field of singing. Only one person has an education in vocal.

Confusion Matrix for n-SVC and G ¼ 3

Actual Beginner Intermediate Advanced Accuracy

Beginner 145 11 1 0.924
Intermediate 6 174 1 0.961
Advanced 5 6 110 0.909
127.e28 Journal of Voice, Vol. 30, No. 1, 2016

Confusion Matrix for n-SVC and G ¼ 6

Beginner, Beginner, Intermediate, Intermediate, Advanced, Advanced,

Actual Male Female Male Female Male Female Accuracy
Beginner, male 70 0 0 10 0 1 0.864
Beginner, female 0 68 0 8 0 0 0.895
Intermediate, male 0 0 61 1 0 0 0.984
Intermediate, female 3 1 0 114 0 1 0.966
Advanced, male 0 0 0 1 45 0 0.978
Advanced, female 1 0 0 3 0 71 0.947

The recording of singing samples was performed in a studio, among males and females). It should be noted that their
that was acoustically modified to eliminate voice reflection distinguishability in original attributes space (spectrogram
and reverberation. The audio hardware included a capacitor images) is very limited.
microphone AT4050 (Audio-Technica U.S., Inc) connected to The reduction parameters at the stage of 2DLDA/2DKLT is s
audio interface MOTU 896mk3 (Mark of the Unicorn, Inc., ¼ P ¼ 10; hence, we classified objects using 100 features,
Cambridge, U.S.A.) and a personal computer equipped with which is lower than in case of other known methods.7
Sony Sound Forge 10 (Sony Creative Software, Inc.) and Sam- In the experiments, the influence of the kernel function and
plitude Pro (MAGIX Software GmbH). All the sound material the parameters of n-SVC on the effectiveness of the classifica-
was recorded in 24-bit resolution, with the frequency of sam- tion were investigated.
pling at 96 kHz. As a voice production session was recorded Additionally, to evaluate the performance of the proposed al-
as a whole for a particular singer, the division into elementary gorithm of classification, we investigated certain other classi-
parts, called sequences or phrases, was done with use of a fiers on the same data set and extracted features.
tool for automatic segmentation, under supervision, which
was created in MATLAB environment.32 RESULTS AND DISCUSSION
The whole database contains five vocal exercises (E01–E05) The results of experiments are presented in Table 3. Provided true
selected from a set usually used during vocal trainings. For the positive and false positive rates show that selected classifier (n-
research presented here, an exercise E01 was used. It contains SVC) gives the highest possible accuracy in case of our bench-
singing vowels ‘‘a-e-i-o-u’’ at one pitch. The sequence lasts 3– mark database. The only closest results could be achieved using
4 seconds. Exercise E01 was chosen because it, on the one multilayer perceptron and naive Bayes classifiers; however, the
hand, is sung at one pitch and, on the other hand, introduces an first one is much more complex in terms of learning, whereas
obstacle for the singer, in the form of changing vowels. This sit- the second one depends on the variables independence (which
uation is similar to the real conditions of singing. It is assumed cannot be guaranteed for such data as spectrograms). Hence,
that a person with higher vocal skills will keep the singing param- the most accurate results were obtained for RBF kernel of 3rd de-
eters more stable. The exercises were sung on successive pitches gree and values n ¼ 0.7 for G ¼ 3 and n ¼ 0.505 for G ¼ 6.
to the extent consistent with the persons capabilities.
Table 1 presents the characteristics of the analyzed group of
singers including sex, level of advancement, voice type, and the TABLE 6.
number of years in the choir. The advancement was assessed Comparison With Other Methods
subjectively by an expert (instructor of voice production).32
Algorithm Rate Rate Remark
Experimental setup
Experiments were performed using a ground truth proposed by STFT + 2DLDA 0.935 0.021
an expert, who divided voice samples into specific groups of + n-SVC [Ours]
MPEG7 + ANN8 0.84 0.16 531 Samples
advancement. Table 2 presents the detailed information about
MPEG7 0.87 0.19 As previously
these groups (including total number of 459 sound samples). + Rough sets8 mentioned
Because of the characteristics of data, we adopted a 10-fold MFCC + GMM9 0.845 n/a Single vowels
cross-validation approach to validate the performance of the MFCC + LR9 0.818 n/a As previously
recognition algorithm. The 10-fold cross-validation method mentioned
uses 9/10 of the samples as the training set and 1/10 as the STFT + SVM4 0.833 n/a Two classes
test set in each trial. The results are the averaged values for Abbreviations: TP, true positive; FP, false positive; GMM, Gaussian
10 single experiments (trials). To demonstrate the complexity mixture model; n/a, not applicable; LR, linear regression; MPEG, Mov-
of the problem, Figures 5 and 6 show the averaged spectra for ing Picture Experts Group; ANN, Artificial Neural Network; MFCC,
Mel-frequency cepstrum coefficients; SVM, Support vector machine.
each of the classes (beginner, intermediate, and advanced,
Pawe1 Forczman Evaluation of Singer’s Voice Quality 127.e29

Another observation is general higher accuracy for G ¼ 6 3. Omori K, Kacker A, Carroll LM, Riley WD, Blaugrund SM. Singing power
classes, which means that vocal features represented by STFT ratio: quantitative evaluation of singing voice quality. J Voice. 1996;10:
spectrogram depend to some extent on the sex of vocalist.
4. Nakano T, Goto M, Hiraga Y. An automatic singing skill evaluation method
Confusion matrices for both cases of data set labeling are for unknown melodies using pitch interval accuracy and vibrato features.
provided below in Tables 4 and 5. Besides typical number of Interspeech-ICSLP. 2006;1707–1709.
actual versus predicted samples, the accuracy for each class is 5. Murry T. Pitch-matching accuracy in singers and nonsingers. J Voice. 1990;
given. A closer look at these tables unveils, that the 4:317–321.
6. Brown W, Rothman H, Sapienza C. Perceptual and acoustic study of pro-
intermediate advancement level is a class with the highest
fessional trained versus untrained voices. J Voice. 2000;3:301–309.
classification rate. This class is probably well clustered by _
7. Zwan P, Kostek B. System for automatic singing voice recognition. JAES.
LDA, whereas other two classes have more outliers. The 2008;56:710–772.
other explanation could be a nonideal assignment by a vocal _
8. Kostek B, Zwan P. Automatic classification of singing voice quality, 5th In-
expert hired by authors.32 ternational Conference on Intelligent Systems Design and Applications.
ISDA ’05. 2005: 444–449.
Finally, we performed a comparison of our method with
9. Jha MV, Rao P. Assessing vowel quality for singing evaluation, proc. of the
other state-of-the-art methods aimed at singer advancement National Conference on Communications (NCC). Kharagpur, India. 2012:
evaluation. The results are provided in Table 6. As it can be 1–5.
seen, presented selection of features, dimensionality reduction 10. Hariharan M, Paulraj MP, Sazali Y. Time-domain features and probabilistic
method, and classifier results in the highest accuracy. More- neural network for the detection of vocal fold pathology. Malaysian J Com-
puter Sci. 2010;23:6–67.
over, the superiority of the method presented in this article
11. Alvarez M, Henao R, Castellanos G, Godino JI, Orozco A. Kernel Principal
comes from the facts, that it with three classes of voice Component analysis through time for voice disorder classification. Conf
advancement (compared with two classes in another method Proc IEEE Eng Med Biol Soc. 2006;1:5511–5514.
also using SVM4) and with more complex (hence more valu- 12. Lee C-C, Katsamanis A, Black M, Georgiou P, Narayanan S. Affective state
able) vocal samples (in comparison with single vowels as in recognition in married couples’ interactions using PCA-based vocal
entrainment measures with multiple instance learning. In proc. of Interna-
the study by Hariharan et al10).
tional Conference on Affective Computing and Intelligent Interactions
(ACII). 2011: 31–41.
13. Mu~oz-Mulas C, Martı́nez-Olalla R, Gomez-Vilda P, Lang EW, Alvarez- 
Marquina A, Mazaira-Fernandez LM, Nieto-Lluis V. KPCA vs. PCA study
In the article, a novel algorithm for automatic evaluation of for an age classification of speakers. In: Travieso-Gonzlez CM, Alonso-
the quality of singing voice was presented. The original idea Hernndez JB, eds. Proceedings of the 5th International Conference on Ad-
was to apply image-based techniques to process audio sam- vances in Nonlinear Speech Processing (NOLISP’11). Berlin-Heidelberg:
ples. The main concepts of the developed approach involve Springer-Verlag; 2011:190–198.
14. Lee J-Y, Jeong S, Hahn M. Classification of pathological and normal voice
selected methods from the field of digital image processing
based on linear discriminant analysis. Lecture Notes in Computer Science.
and recognition. In this algorithm, a visual representation of Adaptive Nat Comput Algorithms. 2007;4432:382–390.
audio sample is created using STFT and processed to 15. Allen JB. Short time spectral analysis, synthesis, and modification by
normalize it using certain image processing techniques. discrete Fourier transform. IEEE Trans Acoust Speech, Signal Process
Then, the results are projected to the reduced feature space ASSP-25. 1977;3:235–238.
16. Kukharev G, Forczmanski P. Face Recognition by Means of
calculated by means of 2DLDA. Classification in resulting
Two-Dimensional Direct Linear Discriminant Analysis. Proceedings of
subspace is done using n-SVC method. The conducted exper- the 8th International Conference PRIP 2005 Pattern Recognition and Infor-
iments on the benchmark database showed the high accuracy mation Processing. Minsk: Republic of Belarus; 2005:280–283.
of developed method. Hence, it is possible to predict with high 17. Fukunaga K. Introduction to Statistical Pattern Recognition. 2nd ed. New
confidence singer’s level of advancement by only recording York: Academic Press; 1990.
18. Raudys SJ, Jain AK. Small sample size effects in statistical pattern recog-
and processing a short vocal exercise.
nition: recommendations for practitioners. Pattern Anal Machine Intelli-
In comparison with other, similar methods, presented gence, IEEE Trans. 1991;13:252–264.
approach features higher recognition accuracy and is indepen- 19. Forczmanski P, Frejlichowski D. Classification of elementary stamp shapes
dent on singer’s sex, making this algorithm highly universal. by means of reduced point distance histogram representation. Machine
It can be applied in various practical problems, for example, Learn Data Mining Pattern Recognition, Lecture Notes Computer Sci.
singers education and medical use, where an automatic and
20. Okarma K, Forczmanski P. 2DLDA-based texture recognition in the aspect
objective evaluation of vocal advancement is required. It is of objective image quality assessment. Ann Universitatis Mariae Curie-
also possible to apply developed method to retrieve vocal sam- Skodowska. Sectio AI Informatica. 2008;8:99–110.
ples of expected advancement level from a large multimedia da- 21. Cwiklinska-Jurkowska M. Performance of the support vector machines for
tabases without actual acoustic verification of each instance. medical classification problem. Biocybernetics Biomed Eng. 2009;29:
22. Vapnik VN. The Nature of Statistical Learning Theory. Springer-Verlag:
REFERENCES New York, Inc; 1995.
1. Larrouy-Maestri P, Lev^eque Y, Sch€on D, Giovanni A, Morsomme D. The 23. Boser BE, Guyon IM, Vapnik VN. A training algorithm for optimal margin
evaluation of singing voice accuracy: a comparison between subjective classifiers, Proceedings of the Fifth Annual Workshop on Computational
and objective methods. J Voice. 2013;27:259.e1–259.e5. Learning Theory COLT92. 1992: 144–152.
2. Mendes AP, Rothman HB, Sapienza C, Brown WS Jr. Effects of vocal 24. Sch€olkopf B, Smola AJ. Learning with Kernels Support Vector Machines,
training on the acoustic parameters of the singing voice. J Voice. 2003; Regularization, Optimization, and Beyond. MIT Press: Cambridge, Massa-
17:529–543. chusetts; 2002.
127.e30 Journal of Voice, Vol. 30, No. 1, 2016

25. Swets DL, Weng J. Using discriminant eigenfeatures for image 29. Vapnik V. Statistical Learning Theory. New York: Wiley; 1998.
retrieval. IEEE Trans Pattern Anal Machine Intelligence. 1996;18: 30. Gao Y, Sun S. An empirical evaluation of linear and nonlinear kernels
831–836. for text classification using Support Vector Machines. Fuzzy Syst
26. Novakovic J, Veljovic A. C-support vector classification: selection of Knowledge Discov (fskd) 2010 Seventh Int Conf. 2010;4:144–152. 10-
kernel and parameters in medical diagnosis, SISY 2011, IEEE 9th Interna- 12.
tional Symposium on Intelligent Systems and Informatics, 2011: 465–470. 31. Chang C-C, Lin C-J. LIBSVM: a library for support vector machines. ACM
27. Sch€olkopf B, Smola AJ, Williamson RC, Bartlett PL. New support vector Trans Intell Syst Technology. 2011;2:1–27. Available at: http://www.csie.
algorithms. Neural Comput. 2000;12:1207–1245. Accessed December 20, 2014.
28. Vapnik V, Cortes C. Support-vector networks. Machine Learn. 1995;20: 32. qazoryszczak M, Po1rolniczak E. Audio database for the assessment of
273–297. singing voice quality of choir members. Elektronika. 2013;3:92–96.