A High-Precision Feature Extraction Network of Fatigue Speech From Air Traffic Controller Radiotelephony Based On Improved Deep Learning

Available online at www.sciencedirect.
com
ScienceDirect
ICT Express 7 (2021) 403–413
www.elsevier.com/locate/icte
A high-precision feature extraction network of fatigue speech from air traffic

controller radiotelephony based on improved deep learning
Zhiyuan Shen ∗, Yitao Wei
College of Civil Aviation, Nanjing University of Aeronautics and Astronautics, Nanjing 211100, China
Received 29 July 2020; received in revised form 6 January 2021; accepted 10 January 2021
Available online 22 January 2021
Abstract
Air traffic controller (ATC) fatigue is receiving considerable attention in recent studies because it represents a major cause of air traffic
incidences. Research has revealed that the presence of fatigue can be detected by analysing speech utterances. However, constructing a
complete labelled fatigue data set is very time-consuming. Moreover, a manually constructed speech collection will often contain only little
key information to be used effectively in fatigue recognition, while multilevel deep models based on such speech materials often have
overfitting problems due to an explosive increase of model parameters. To address these problems, a novel deep learning framework is
proposed in this study to integrate active learning (AL) into complex speech features selected from a large set of unlabelled speech data
in order to overcome the loss of information. A shallow feature set is first extracted using stacked sparse autoencoder networks, in which
fatigue state challenge features from a manually selected speaker set of are exploited as the input vector. A densely connected convolutional
autoencoder (DCAE) is then proposed to learn advanced features automatically from spectrograms of the selected data to supplement the
fatigue features. The network can be effectively trained using a relatively small number of labelled samples with the help of AL sampling
strategies, and the addition of a dense block to the convolutional automatic encoder can decrease the number of parameters and make the
model easier to fit. Finally, the two above-mentioned features are combined using multiple kernel learning with a support-vector-machine
classifier. A series of comparative experiments using the Civil Aviation Administration of China radiotelephony corpus demonstrates that the
proposed method provides a significant improvement in the detection precision compared to current state-of-the-art approaches.
⃝c 2021 The Korean Institute of Communications and Information Sciences (KICS). Publishing services by Elsevier B.V. This is an open access
article under the CC BY-NC-ND license (http://creativecommons.org/licenses/by-nc-nd/4.0/).
Keywords: Air traffic control; Fatigue; SSAE; Active learning; Dense block; Spectrogram
1. Introduction in considerable attention being paid to the accurate detection

of fatigue in air traffic controllers (ATCs) among researchers
The rapid development of civil aviation and continuing
in the field of civil aviation.
increases in air traffic have made a shortage of labour in the
civil aviation industry a serious problem. The resulting high Fatigue can be measured using a multitude of methods and
workloads can induce fatigue, thus increasing the probability tools, which can be grouped into two categories: objective
of human error and the associated dangerous consequences to and subjective methods [3]. Subjective self-rating scales and
aviation safety [1]. Researches have demonstrated that there questionnaires have been the most-important sources of data
is a close association between greater fatigue and higher risk. for assessing ATC fatigue [4,5]. Two renowned and vali-
It is also commonly acknowledged that fatigue is a vaguely dated subjective fatigue/sleepiness scales are the Karolinska
defined concept, since this is a complex multidimensional Sleepiness Scale (KSS) [6] and NASA’s Task Load Index [7].
phenomenon with numerous contributing factors, including Although subjective methods are easy to implement, they
biological and psychology ones [2]. This situation has resulted are barely able to detect a fatigued state either rapidly or
validly. Therefore, objective methods have received a consider-
∗ Correspondence to: Room 604, Civil Aviation Building, Nanjing
able amount of research interest. There are two categories of
University of Aeronautics and Astronautics, Nanjing, 210016, China.
E-mail address: shenzy@nuaa.edu.cn (Z. Shen).
popular objective methods for detecting fatigue based on its
Peer review under responsibility of The Korean Institute of Communica- different manifestations. The methods in one category detect
tions and Information Sciences (KICS). physiological parameters, including heart rate, blood pressure,
https://doi.org/10.1016/j.icte.2021.01.002
2405-9595/⃝ c 2021 The Korean Institute of Communications and Information Sciences (KICS). Publishing services by Elsevier B.V. This is an open access
article under the CC BY-NC-ND license (http://creativecommons.org/licenses/by-nc-nd/4.0/).
Z. Shen and Y. Wei ICT Express 7 (2021) 403–413
breathing rate, electroencephalogram, and skin electricity Table 1

[8–10]. The other methods directly record observable acts, List of terminology used in this paper.
including eye movement, blink times, yawning, and frequently Terminology Referred to
nodding [11]. These objective methods are relatively accurate ATC Air traffic controller
and can be used to determine a reliable physiological fatigue ATCs Air traffic controllers
AE Autoencoder
index. The main disadvantage of these techniques is their intru-
AL Active learning
siveness, which usually results in aversion and disturbance to CNN Convolutional neural network
the controller, resulting in poor availability of these monitoring CAE Convolutional autoencoder
methods. DNN Deep neural network
The rapid developments in speech recognition have resulted DCAE Densely connected convolutional automatic encoder
FFT Fast Fourier transform
in vocal-feature-based methods recently emerging as the pre-
KSS Karolinska Sleepiness Scale
ferred avenue for research [12]. Vocal features are convenient KL Kullback–Leibler
to collect and analyse, because the main job of an ATC is LLDs Low-level descriptors
communicating with pilots via radiotelephony, and regulations MFCC Mel-frequency cepstrum coefficient
dictate that all of the call records must be preserved for some MKL Multiple kernel learning
MCLU Multiclass-Level Uncertainty
time. There are many reports in the literature on analyses of
PCA Principal Component Analysis
the connection between vocal features and fatigue [13,14]. In SVM Support Vector Machine
2006, Greeley et al. demonstrated that voice features show SSAE Stacked sparse autoencoder
strong correlations with fatigue in certain standardized tests,
such as the Sleep Onset Latency test [15]. Krajewski intro-
duced a fatigue eigenvector composed of linear speech features database is not available for detecting the fatigue state based
such as the fundamental frequency, resonance peak and Mel- on speech analysis, it is of great significance to use AL when
frequency cepstrum coefficient (MFCC) [16]. However, the selecting data. These problems have motivated us to develop
reported average accuracy of the results was only 76.5%. With a novel hierarchical stacked sparse autoencoder (SSAE) and
the development of deep-learning techniques, researchers can densely connected convolutional automatic encoder (DCAE)
now use convolutional neural network (CNN), recurrent neural network incorporated with AL, which enables the learning of
networks, autoencoders (AEs) and other methods for speech a generic and robust fatigue-feature representation.
feature extraction and emotion classification [17–19]. For ex- Inspired by the observations described above, in this pa-
ample, Prasomphan [20] used a spectrogram and CNN to per we propose a novel high-precision, speech-based fatigue-
classify the five emotions in the Berlin corpus with an accuracy detection method. We first use AL to select valuable labelled
of 83.28%. The speech spectrum has also been used as the samples from a large unlabelled data set. The shallow features
input to a CNN, with the Alexnet network used to train and test were extracted from the manually selected feature set by an
the speech spectrum [21]. That approach achieved an accuracy AE. A DCAE is then utilized to extract the advanced fea-
of 84.3% for all of the speakers using a speech–emotion data tures from the spectrogram. These two categories of features
set. are then combined with MKL to improve the accuracy of
However, the results obtained in the above-mentioned stud- speech–emotion recognition.
ies indicate that using a single speech feature often ignores The remainder of this paper is organized as follows:
some useful feature information, which will impair the recog- Section 2 briefly introduces the basic speech feature set with
nition rate. Therefore, methods that combine multiple features its spectrogram, Section 3 describes an AL architecture with
in the field of speech recognition are currently popular [22,23]. an AE for extracting fatigue features, Section 4 presents the
In particular, a spectrogram displays the relationships between proposed unified deep-learning network, Section 5 reports on
adjacent frequency points in a speech signal, which not only the series of experiments performed to test our new method,
shows the time–frequency characteristics of speech, but also and conclusions are drawn in Section 6.
reflects the language characteristics of typical ATC communi-
cations and has become a very useful tool to compensate the 2. The basic speech feature set and spectrogram analysis
feature. The increasing research on speech recognition has led to
Traditional methods for speech fatigue classification have the development of some convenient open-source frameworks.
tended to extract single manually selected features such as the The most-representative ones are openEAR [26] and OpenS-
MFCC using methods such as principal-components analysis mile [27] developed by the team of Eyben and Schuller. These
and dictionary learning. These models are thus unable to tools are very convenient to use and greatly improve the
take full advantage of the information contained in speech. research efficiency, and they focus on the automatic extraction
Moreover, it is very difficult to obtain a large amount of of distinguishing features in speech recognition.
labelled data to construct a well-trained deep neural network
(DNN) [24]. Active learning (AL) has been widely used in 2.1. Speaker state feature set
recognition applications involving small samples, and it can
not only reduce the cost of labelling, but also filter valuable An acoustic feature set is a collection of audio descrip-
data to improve the recognition rate [25]. Since a sufficient tors that potentially carry information about affective cues
404
Fig. 1. Spectrograms of fatigued speech (a) and normal speech (b). (For
interpretation of the references to colour in this figure legend, the reader is
referred to the web version of this article.)
Fig. 2. The basic architecture of AE.
in the voice. The speech representation described below is

a new paradigm for speech analysis. It contrasts the stan- samples with different emotions will appear distinguishable in
dard paradigm for speech analysis (involving a sequence of spectrograms, which allows different emotions to be classified.
observation vectors) in terms of duration, with a speech utter- The aim of this study was to use spectrograms to extract
ance represented by a large set of features, termed an audio more-complete speech features to compensate for the lack of
feature set. The feature set is based on several energy-based manually identified features.
low-level descriptors (LLDs) [28], which are well known in
3. AE for extracting fatigue features
the speech-recognition field. Various statistical functions are
computed for these LLDs: some aim at estimating the spa- We hoped to be able to extract useful speech features from
tial variability (e.g. means, standard deviations and quartiles), spectrograms based on the rich feature information that they
while others estimate the temporal variability (e.g. peaks and contain. In the field of deep learning, a CNN is effective at ex-
linear-regression slopes). There are 59 types of LLDs features, tracting multidimensional features, although this is dependent
with 59 statistical functionals applied to produce 4368 feature on the availability of a sufficient amount of labelled data [30].
vectors (Table 2). Additionally, for balancing the magnitudes Therefore, a convolutional autoencoder (CAE) was considered
effect of different features, the min–max normalization is used for extracting the spectrogram features, which employs a type
for data preprocessing, which can be described as follows: of unsupervised learning and has the same deep-feature ex-
traction ability as a CNN. Meanwhile, inspired by a previous
x − xmin
xnor m = (1) study [31], we added a dense block to the CAE for releasing
xmax − xmin the gradient explosion and improving the accuracy. Moreover,
where, x represent one type of feature set, xmax and xmin we also utilized an SSAE to reduce the dimensions of the
represent the maximum and minimum values of set separately. manual features. The principles and structure of the relevant
automatic encoder are described below.
2.2. Spectrogram
3.1. Autoencoder
A spectrogram is a visual representation of the amplitudes
of different frequency components over time that are present An autoencoder (AE) is a type of artificial neural net-
in a certain signal. It is presented as a two-dimensional graph work used to learn efficient data codings in an unsuper-
in which time appears along the horizontal axis, frequency vised manner [32]. AE can find a new feature representation of
appears along the vertical axis, and the amplitudes of the the data by making the input and output as equal as possible to
frequency components at a particular time are indicated by the minimize the reconstruction error. The purpose of dimension-
intensity or colour of that point on the graph. Low amplitudes ality reduction can be achieved by making the hidden layer
are generally represented by dark-blue colours, and higher am- nodes smaller than the number of input and output nodes [33].
plitudes (or louder sounds) are represented by brighter colours The important role of the automatic encoder is to reduce the
feature dimension to achieve the purpose of compressing data
up through to red. The values are computed by applying the
and to eliminate redundant information in the feature. The
fast Fourier transform (FFT) to the speech signal to yield the
basic AE architecture could be shown in Fig. 2.
time–frequency representation [21]. In order to discover the
Among Fig. 2, xm represents the training samples, rm is
frequencies present at particular moments in a speech signal,
the mth reconstructed output by autoencoder network, and
it is divided into small samples, to which the FFT is applied.
the training method is generally to use the gradient descent
MFCC features can be extracted from the speech spectrum. algorithm to minimize L AE (θ ) [34]. Like the multi-layer per-
The spectrograms of a fatigued-speech sample and a normal ceptron, the autoencoder and each node represent a nonlinear
sample are shown in Fig. 1. activation function, which can be obtained in the following
Several items of useful information can be obtained from a form:
spectrogram, such as the change in the sound intensity over M
a certain period of time, the frequency distribution of the
∑
L AE (θ) = ∥xm − rm ∥2 (2)
entire speech sample and the sound intensity [29]. Speech m=1
405
Table 2
Constituents of a speaker state feature set.
INTERSPEECH 2011 feature set
LLDs 59
Functions 39
Features 4368
LLDs
RMS energy Spectral kurtosis
Sum of the auditory spectrum (loudness) Spectral slope
Sum of the RASTA-style filtered auditory spectrum MFCC 1–10
Zero-crossing rate MFCC 11 or 12
Energy in bands from 250 to 650 Hz and from 1 to 4 kHz RASTA-style auditory spectrum bands 1–26
Spectral roll-off points of 25%, 50%, 75% and 90% F0 (SHS based followed by Viterbi smoothing)
Spectral flux Probability of voicing
Spectral entropy Jitter
Spectral variance Variation in jitter
Spectral skewness Shimmer
where
rm = gθ ( f θ (xm )) (3)
f θ (x) = s f (W x + b1 ) (4)
gθ (h) = sg (W T h + b2 ) (5)
where, s f and sg are both nonlinear activation functions. The
parameter set θ = {W, b1 , W T , b2 }, W and W T represent the
weight matrix of the encoder and decoder respectively, and b1
and b2 represent the offset vector of the encoder and decoder Fig. 3. Construction of an SSAE with two hidden layers.
respectively.
The nonlinear activation function is generally set to be a
is a regularized item to avoid over-fitting. Parameter β is the
sigmoid function, which can be formulated as follows:
weight of the sparse penalty, where K L(ρ ∥ ρ̂ j ) is the KL
1 divergence or relative entropy, can be written as follows:
s f , sg = (6)
1 + e−x ρ 1−ρ
K L ρ ∥ ρ̂ j = ρlog + (1 − ρ)log
( )
(9)
3.2. Stacked sparse AE ρ̂ j 1 − ρ̂ j
N
1 ∑ (n)
Sparse autoencoder (SAE) is a special self-encoding net- ρ̂ j = z (10)
N n=1 j
work that makes hidden units activated randomly through
adding some sparse penalty terms [35]. It not only can set more Here, z (n)
j denotes the activation of the jth hidden neuron,
larger hidden units, but also has good robustness to signal to so ρ̂ j represents the average activation of the jth neuron in the
noise ratio and other effects. hidden layer. And ρ is a network variable, which is usually set
In the design of SAE structure, the number of hidden layer as the small value. The KL value decreases when ρ̂ j becomes
units is usually less than the number of input layer or previous close to the ρ, and this value reaches the maximum when the
hidden layer units. It can make it possible that compress deviation between ρ̂ j and ρ increases.
the data dimension so that the output is low-dimensional Fig. 3 shows the construction of an SSAE with two hidden
representation. The effect is similar to the PCA (Principal layers. ‘Input’ is the original input layer, ‘Feature1’ is the
Component Analysis) [36], which is of great benefit to the first hidden layer of SAE, and ‘Feature2’ is the second hidden
design of subsequent classifiers. layer of SAE where the feature1 is regarded as the input to
The loss function of the SAE can be defined as [24]: the second SAE. In short, SSAE is a deep architecture of
s2
∑ SAEs that stacks several hidden layers of basic SAEs together,
Jspar se (W, b) = J (W, b) + β K L(ρ ∥ ρ̂ j ) (7) meaning that the output of each layer is regarded as the input
j=1 to the subsequent layer in the SSAE [24].
m
∑
J (W, b) = ∥xi − ri ∥2 + λ(∥W ∥2 + ∥W T ∥2 ) (8) 3.3. Architecture of the proposed DCAE
i=1
where, J (W, b) denotes the loss function to measure the The convolutional autoencoder (CAE) [37] simply changes
difference between xm and rm , and the first term of J (W, b) the full connection into a convolution operation between the
is the reconstruction loss using l2 norm, while the second term encoding layer and the decoding layer, which is better for
406
input [38]:
xn = Hn ([x0 , x1 , . . . , xn−1 ]) (11)
where [x0 , x1 , . . . , xn−1 ] indicates the concatenation of the fea-
ture maps produced in layers 0, . . . , n − 1, and Hn (·) can be a
composite function of operations such as batch normalization,
pooling or convolution.
The dense block with a growing rate of k = 32 was
designed in the proposed AE. The designed dense block con-
tained three consecutive operations: batch normalization, a
3×3 convolution and a rectified linear unit activation function.
In the encoder, the first dense block consisted of four layers,
Fig. 4. Architecture of the proposed DCAE.
while the second block had eight layers. Since the input images
of the second dense block have been twofold downsampled
(compared with the previous block), we take the number of
layers of the second block to balance the complexity of each
dense block. The decoder has the same architecture.
The remaining parts of the proposed AE are now described.
For the normalization, consider an input panorama image
consisting of RGB values in the range of [0, 255]. Since CNNs
perform better for data ranging from 0 to 1, the proposed
encoder normalizes each channel of input images as
I (x, y)
Iˆ (x, y) = (12)
255
Fig. 5. Construction of a DenseNet.
where I (x, y) is the original pixel value at position (x, y),
and Iˆ (x, y) is the normalized value. Correspondingly, the
encoder and decoder have normalization and denormalization
layers, respectively. The transition layer is a convolutional
layer with 3×3 convolutional kernels with 32 channels in
the encoder, while it has 128 channels in the decoder. The
convolutional layer uses 3×3 convolutional kernels of stride
1 in the convolutional layers to analyse images. Meanwhile,
Fig. 6. A stereotypical dense block. 1-padding is adopted to ensure that the scale of images does
not change after convolution. Maxpooling layers are utilized
in the encoder to twofold downsample the images, with 2×2
extracting the hierarchical features. The architecture of our filters of stride 2 and 32 channels. Finally, the upsample layer
proposed DCAE is shown in Fig. 4. Considering that spectro- is the opposite of the maxpooling layer, upsampling images to
grams usually have abundant features, we specifically added those needed, without changing the number of channels.
the dense blocks to analyse and extract spectrogram fea-
tures. Experiments performed on densely connected convolu- 4. Unified AE fatigue-feature-extraction model
tional layers with dense blocks have shown that the proposed
DCAE exhibits better convergence and a higher detection 4.1. Architecture of the AE fatigue-feature-extraction method
rate. Meanwhile, the redundancy of feature maps can also be
greatly reduced. Moreover, the structure of the dense connec- Fig. 7 shows the architecture of the proposed method. Some
tion requires fewer parameters and eases the gradient vanishing effective labelled data are first screened using AL to reduce
problem. the cost of manual marking. The SSAE is utilized to extract
The construction of a dense convolutional network shallow features Fshallow and advanced features Fadvance then
(DenseNet) consisting of a dense block and a transition layer extracted by using proposed DCAE. The two obtained features
is shown in Fig. 5, while a stereotypical dense block is shown are combined and then input to the SVM (Support Vector
in Fig. 6. In Fig. 6, the notation 3×3@4 refers to a 3×3 Machine) classification as follows. This method achieves a
convolutional kernel with 4 channels. In a dense block, the joint deep–shallow–advanced feature that is more suitable for
current convolutional layer connects the next layer, while the our speech fatigue detection task.
current layer also connects all of the remaining layers in the From a manual marking perspective, an SSAE network
block. The transition layer is cascaded after the dense block, is designed to extract shallow feature Fshallow for the given
with the aim of reducing the number of channels generated manually selected speaker state challenge feature set. Unlike
by the dense block. Consequently, the nth layer receives the artificial feature, spectrogram information is represented as
the feature maps of all preceding layers x0 , x1 , . . . , xn−1 as a three-dimensional structure, which contains a considerable
407
to a multiscale kernel, which is obtained by weighting several

different scales of Gaussian kernel functions:
M
∑
K (x, z) = βi ki (x, z)
i=1
M
∑
Fig. 7. Architecture of the proposed unified deep-learning model. such that βi ≥ 0, βi = 1 (14)
i=1
where βi is the weight, ki is the basic kernel, and M is

the total number of basic kernels. The large-scale kernel is
better for fitting areas with gentle changes, and there will
be errors in other areas with sharp changes, but these can
be addressed using the small-scale kernel and thereby ensure
that the function exhibits a better fitting performance overall.
Finally, Fhand and Fadv get two synthetic kernels K 1 and K 2 .
For the test data {xi , yi }i=1
N
, xi are the sample data and yi
is the corresponding label. The decision function of the final
data is as follows:
Fig. 8. The structure of MKL.
f (x) = a1 (x) K 1 (x1 , x) + a2 (x) K 2 (x2 , x) + b (15)
amount of useful extra information. If we directly exploit a 4.3. Active sampling strategy
CAE to extract the advanced features, the deep convolution
network will usually lead to the loss of the former information. Training the DNN to achieve impressive performance re-
Moreover, such a deep-learning model generally depends on a quires the use of a large amount of labelled training samples
large quantity of training parameters, which makes optimiza- in supervised learning. But it will consume long time to label
tion more difficult. Inspired by previous work [39], the dense enough speech samples in practice, in which small training
block is utilized in our process of extracting advanced feature samples are prone to overfitting. With the help of AL strategy,
the proposed model can be trained more effectively by using
Fadvanced to obtain rich information, as detailed in Section 3.3.
a relatively small number of labelled samples.
In this paper, we choose the MCLU (Multiclass Level Un-
4.2. Multiple kernel learning strategy certainty) [41] technique as the query criterion. This method
applies a difference function cdi f f (x) to record the uncertainty
We have now obtained two deep features, Fshallow and of unlabelled samples. cdi f f (x) on logistic regression con-
Fadvance , in our feature-extraction subnetworks. Since simply siders the difference between the largest and second-largest
concatenating these two fatigue features would not make full class-conditional probability density using the following object
use of them and research shows that multiple kernel learn- function [24]:
ing (MKL) [40] can combine the features in a more-elastic cdi f f (x) = p (i) (x|ωmax1 ) − p (i) (x|ωmax2 ) (16)
manner.
ωmax1 = arg max p (i) (x|ωn )
{ }
The general SVM usually uses single kernel function to (17)
map the sample features to the Hilbert space, which turns the ωn ∈ Ω
ωmax2 = arg max p (i) (x|ωm )
{ }
linear inseparable problem of the original feature space into (18)
linear separable by maximizing the interval between positive ωn ∈ Ω /{ωmax1 }
and negative samples.
MKL is an optimization strategy of SVM actually, which When cdi f f (x) is large, x will be considered as the pre-
selects different kernels for different features and then trains dicted class ωmax1 . Otherwise, it will be assigned to uncertain
the weight of each kernel to synthesize a multi-kernel matrix sample that should be classified manually. In other words, the
to fusing the types of features better. The complete structure MCLU strategy selects the data corresponding to the minimum
of MKL is shown in Fig. 8. value of cdi f f (x) from the candidate unlabelled samples. The
In this paper, we utilize the multiscale Gaussian kernel detailed description is shown in Algorithm 1.
function as basic kernel: Fig. 9. shows the detailed construction of AL sampling
strategy for SSAE. we first train the SSAE with a few labelled
∥x − z∥2 training samples and the extracted features are then used to
k (x, z) = exp(− ) (13)
2σ 2 train a softmax classifier with supervised fine-tuning. The
When σ is small, the Gaussian kernel function can fit initial labelled samples are enough to train the SSAE to extract
sharply changing samples well. Otherwise, it can fit gently a robust feature representation. Subsequently, a subset of unla-
changing samples. Then both extracted feature corresponds belled data regarded as the candidate set is then classified using
408
Table 3
The fatigue data set utilized in this study.
Data set Unlabelled speech data Labelled speech data Total
(N = 3000) (N = 1606) (N = 4606)
Number Expression Explanation
1 Control category R, area control; A, approach control; T, tower control
Labelled 2 ATC rank 5, level 5; 4, level 4; 3, level 3; 2, level 2; 1, level 1;
speech data numbering 0, trainee
3–10 Time (UTC) 3–6, time of starting work; 7–10, time of ending work
11 Sex F, female; M, male
12 and 13 Age Arabic numeral (age in years)
14 and 15 Order Nn, N is a digital indicator and n is an Arabic numeral
indicating the nth instruction issued by the ATC while
working
16 and 17 Status 14th, ‘-’; 15th, voice command; 1, error; 2, ambiguity;
3, hesitation or pause; 4, fatigue
5. Experiments
Three experiments were used to verify the performance of
the proposed method. The first experiment filtered out some
unlabelled data from the candidate set using the MCLU query
method, and used the selected data to pretrain the SSAE. The
second experiment verified the performance of the proposed
DCAE. The third experiment compared the combined feature
classification method with current state-of-the-art methods.
The results are presented in detail in the following subsections.
Fig. 9. AL sampling strategy for SSAE. All of the experimental results were obtained on a Windows
10 personal computer equipped with a 64-bit Intel Core i5-
9300H CPU running at 2.4 GHz and with 8 GB of RAM.
softmax regression [42]. Finally, AL iteratively selects the All of the proposed methods were implemented using Python
most-uncertain unlabelled samples, adds them to the training (version 3.7) and TensorFlow (version 1.14.0) software.
set with true labels and simultaneously removes them from the
candidate set [43]. 5.1. AL and SSAE pretraining
Algorithm 1: Active Learning With MCLU 5.1.1. Data sets and setting
Required: In the experiments, the fatigue data set consisted of two
L 0 : initial labelled dataset parts: labelled and unlabelled. The labelled data set [44] is
U0 : candidate unlabelled samples reported in Table 3. We also collected 3000 unlabelled data
Us : selected samples to be labelled by the ATC instructor samples. The radiotelephony communications were obtained
Ix : the index of candidate unlabelled samples from the Air Traffic Management Shandong Bureau of China.
1. Pretrain the SSAE with initial labelled dataset L 0 ;
5.1.2. SSAE depth effect
2. The shallow features are extracted to train a softmax
Due to the automatic recognition of features, the number
classifier with supervised fine-tuning;
of hidden layers in the SSAE significantly affects the classi-
3. Learn the shallow features of candidate unlabelled
fication performance. In the experiments we fixed the other
samples in U0 ;
parameters and only changed the number of hidden layers in
4. Iteration
order to assess the effect of this change on the performance.
5. Train candidate data in softmax and Calculate the value
We then selected the complete labelled samples to train the
of cdi f f (x) of each candidate data with Eq. (13) in Cd(i) ;
network. We tested several SSAEs with depths varying from
6. sort Cd(i) by ascending;
one to three layers.
7. Get the samples index Ix with the minimum value of
After several experiments, we find that it will achieve
cdi f f (x) and remove the sample in U0 ;
optimal classification performance when the number of the
8. Update the label of selected data and add the labelled
units was set to 512,256 and 128 of each hidden layer in
data to Us ;
SSAE model. During training, the Adam algorithm was used
9. Until iteration over.
for backpropagation. The number of epochs was set as 30, the
409
Table 4
Classification results for different depths.
Number of hidden layers Classification result
1 0.8243
2 0.8525
3 0.8473
Table 5
Spectrogram set parameters in the transformation.
Parameter Value
Number of FFT samples 512
Sampling frequency 16 kHz
Window length 512
Frame overlap 256
Fig. 12. Information lost for the proposed DCAE and CAE.
Fig. 10. Classification results for different depths.
Fig. 13. Accuracy for proposed DCAE method and competing methods.
test set. For the proposed AL-based SSAE we divided the

data set into three parts: 100 labelled samples of each class
were randomly selected for training the SSAE, and 3000 unla-
belled data samples were used as the candidate set. The same
25% of the labelled data were used to test the classification
performance.
According to Fig. 10, the training set was used to pretrain
the parameters of SSAE with two hidden layers and the iter-
ation is set as 300. Therefore the 300 most-uncertain samples
are selected by the MCLU query method combined with the
softmax layer. These samples were added to the training set
Fig. 11. Accuracy trends between the SSAE and AL-based SSAE. with true labels as determined by the ATC instructor to fine-
tune the network. We then compared the classification results
of the AL-based SSAE with those of the traditional SSAE, as
learning rate was 0.001 and the sparsity penalty weight was shown in Fig. 11. The findings indicate that adding the extra
set to 0.05. The effect of the depth on the classification results labelled data using AL alleviated the labelling cost and clearly
is presented in Table 4 and Fig. 10. From the classification improved the classification accuracy.
results, we finally decided an SSAE network with two hidden
layers. 5.2. Proposed DCAE training method
5.1.3. Active learning We transferred each speech sample to the spectrogram to

For the traditional SSAE we extracted 75% of the labelled perform convolution using the parameters listed in Table 5.
data as the training set and used the remaining 25% is the For convenience, all of the spectrograms were transformed to
410
Fig. 14. Results for different fatigue-detection methods based on the SVM.
a size of 256×256. The data-set allocation was the same as in Table 6

the preceding experiment. Hyperparameter settings of the training network.
In the proposed DCAE network, the DenseNet increase Hyperparameter Value
rate was first set to 36, followed by 48. During training, Batch size 20
the Adam optimizer and a binary cross-entropy loss function Epochs 30
Learning rate 0.001
were used for parameter updating, which exhibited excellent Overall training time (h) 3.2
performance in practice. Other hyperparameter settings of the Feature-map size 16 × 16
training network are listed in Table 6. The layout of the DCAE
network has been reported in Section 3.3. Table 7
As shown in Fig. 12, after several analyses it was obvious Classification accuracy of the different models.
that the loss function reduction rate was faster for the DCAE Model Classification result
methods. The loss function did not decrease any further after CAE 84.89%
around the time of epoch 16. The DCAE converged to 0.61 AL-based CAE 88.56%
and the CAE converged to 0.38, with the improved method DCAE 91.01%
also halving the required training time. We utilized a softmax AL-based DCAE 92.68%
classifier to test the accuracy, to verify the superiority of the
proposed model classification and to further illustrate that the
data obtained from AL had better discrimination. All of the 5.3. Results and analysis
results are presented in Fig. 13 and Table 7, which indicate
that the proposed method performs better at extracting deep We finally extracted 128-dimensional feature vectors from
features from a spectrogram. the SSAE and DCAE networks, and then used the multiple
411
Table 8
Results for different fatigue-detection methods (824 test samples).
pH SWFF Shallow features Advanced features Combined features Proposed
from SSAE from DCAE without MKL combined
features
Accuracy rate 85.63% 92.82% 87.86% 93.47% 95.58 98.35%
Improvement Baseline 8.39% 2.60% 9.15% 11.6% 12.72%
kernel SVM (support vector machine) as a classifier, which The experimental results demonstrate that the proposed
selected the Gaussian kernel function. In the experiment we method exhibits promising performance compared with many
selected 300 data sets obtained by AL as training sets and the current state-of-the-art approaches. The present research find-
fatigue data sets as test sets. Because the number of fatigued ings can provide theoretical guidance for air traffic manage-
speech samples in the fatigue data set was far smaller than ment authorities attempting to detect ATC fatigue, and they
the number of normal speech samples, in order to ensure the might also be useful as a reference in fatigue assessments
accuracy of the experimental results, we finally selected all performed in other professional fields of civil aviation.
412 fatigued speech samples, corresponding to the random
selection of 412 normal speech samples. Then, based on the CRediT authorship contribution statement
proposed model, we compared the recognition performance for
Zhiyuan Shen: Conceptualization, Methodology, Software,
shallow, advanced and combined features, and also with the
Validation, Supervision, Project administration, Writing - re-
state-of-the-art pH [45] and SWFF [44] fatigue features. Both
view & editing. Yitao Wei: Conceptualization, Methodol-
pH and SWFF features belong to the speech nonlinear features,
ogy, Software, Visualization, Formal analysis, Data curation,
which are demonstrated to achieve an effective recognition in
Writing - original draft.
the field of fatigue detection.
The experimental results are presented in Fig. 14 and
Table 8. The detection accuracy was calculated as Declaration of competing interest
Aacc = Ncor /Ntest (19) The authors declare that they have no known competing
financial interests or personal relationships that could have
where Ncor is the number of correct detections and Ntest is the appeared to influence the work reported in this paper.
number of samples in the test set.
The red points in the Fig. 14(a)–(d) indicated the true Acknowledgements
category results, while the green crosses indicated the pre-
dicted ones. The more overlaps between two markers implied The author thanks the Shandong Air Traffic Management
a higher accuracy. It was shown that, the least mismatch Sub-Bureau of Civil Aviation Administration of China for
was shown based on our proposed method. The experimental supplying speech raw data of air traffic controllers and helpful
results show that the data set obtained using AL was more- guidance and suggestions from Professor Chin-hui Lee at
clearly differentiated. The number of fatigue features extracted Georgia Institute of Technology.
by adding a dense block to the CAE increased by 5.61% com-
pared with using the ordinary SSAE, and the final accuracy References
for the combined fatigue features was 98.35%. These find- [1] Yu-Hern Chang, Hui-Hua Yang, Wan-Jou Hsu, Yu-hern chang hui-hua
ings demonstrate that the multilevel combined fatigue features yang wan-jou hsu effects of work shifts on fatigue levels of air traffic
proposed in this study provide performance that is superior to controllers, J. Air Transp. Manag. (ISSN: 0969-6997) 76 (2019) 1–9.
[2] J. Shen, J. Barbera, C.M. Shapiro, Distinguishing sleepiness and
those of other advanced fatigue-detection technologies. fatigue: focus on definition and measurement, Sleep. Med. Rev. 10
(1) (2006) 63e76.
6. Conclusion [3] S. Lee, J.K. Kim, Factors contributing to the risk of airline pilot
fatigue, J. Air Transp. Manag. 67 (2018) 197–207.
This paper has presented a novel unified deep-learning [4] X. Wang, C. Xu, Driver drowsiness detection based on non-intrusive
network in which two subnetworks are applied to extract the metrics considering individual specifics, Accid. Anal. Prev. 95 (2016)
shallow and advanced features, and MKL is used to combine 350–357.
[5] L.L. Di Stasi, R. Renner, A. Catena, J.J. Cañas, B.M. Velichkovsky,
the features in a more-generic and robust manner. The AL S. Pannasch, Towards a driver fatigue test based on the saccadic main
sampling strategy is exploited to select a subset of the most- sequence: Apartial validation by subjective report data, Transp. Res.
informative unlabelled samples for labelling and use them C 21 (1) (2012) 122–133.
to train the SSAE, which can improve the performance of [6] T. Chalder, G. Berelowitz, T. Pawlikowska, L. Watts, E.P. Wallace,
proposed network when relatively few labelled samples are Development of a fatigue scale, J. Psychosom. Res. 37 (2) (1993)
147–153.
available. Meanwhile, adding the dense block to the CAE [7] V. Riethmeister, Ute Bültmann, M. Gordijn, S. Brouwer, M.D. Boer,
network improves the ability to extract deep features from Investigating daily fatigue scores during two-week offshore day shifts,
spectrograms. Applied Ergon. 71 (2018).
412
[8] S. Arnau, T. M?Ckel, G. Rinkenauer, E. Wascher, The interconnection [26] F. Eyben, M. Wollmer, B. Schuller, Openear - introducing the munich
of mental fatigue and aging: an eeg study, Int. J. Psychophysiol. 117 open-source emotion and affect recognition toolkit, in: Affective
(2017) 17–25. Computing and Intelligent Interaction and Workshops, 2009. ACII
[9] Shitong, Huang, Jia, Li, Pengzhu, Zhang, et al., Detection of mental 2009. 3rd International Conference on, IEEE, 2009.
fatigue state with wearable ecg devices, Int. J. Med. Inform. (2018). [27] F. Eyben, M. Wöllmer, B. Schuller, Opensmile - the munich ver-
[10] H. Mansikka, P. Simola, K. Virtanen, D. Harris, L. Oksama, Fighter pi- satile and fast open-source audio feature extractor, in: Proc. ACM
lots’ heart rate, heart rate variation and performance during instrument Multimedia (MM), ACM, ACM, Florence, Italy, 2010, pp. 1459–1462.
approaches, Ergonomics (2016) 1–9. [28] Marie-José Caraty, Claude Montacié, Vocal fatigue induced by pro-
[11] M.L. Chen, S.Y. Lu, I.F. Mao, Subjective symptoms and physiological longed oral reading: Analysis and detection, Comput. Speech Lang.
measures of fatigue in air traffic controllers, Int. J. Ind. Ergon. 70 (2014).
(2019) 1–8. [29] L. Deng, M.L. Seltzer, D. Yu, et al., Binary coding of speech
[12] Baisheng Nie, Xin Huang, Yang Chen, Anjin Li, Ruming Zhang, spectrograms using a deep auto-encoder, in: Interspeech, Conference
Jinxin Huang, Baisheng nie xin huang yang chen anjin li ruming of the International Speech Communication Association, Makuhari,
zhang jinxin huang experimental study on visual detection for fatigue Chiba, Japan, September, DBLP, 2011.
of fixed-position staff, Applied Ergon. (ISSN: 0003-6870) 65 (2017) [30] L.O. Chua, Roska, et al., The CNN paradigm, IEEE Trans. Circuits
1–11. Syst. I Fundam. Theory Appl. (1993).
[13] J. Whitmore, S. Fisher, Speech during sustained operations, Speech [31] G. Huang, Z. Liu, K.Q. Weinberger, L. van der Maaten, Densely
Commun. 20 (1996) 55–70. connected convolutional networks, in: Proc. IEEE Conf. Comput.
[14] M. Vollrath, Automatic measurement of aspects of speech reflecting Vision & Pattern Recognition, vol. 1, 2017, p. 3.
motor coordination, Behav. Res. Methods Instrum Comput. 26 (1) [32] A. Kramer Mark, Nonlinear principal component analysis using
(1989) 35–40. autoassociative neural networks, AIChE J. 37 (2) (1991) 233–243,
[15] H.P. Greeley, E. Friets, J.P. Wilson, S. Raghavan, J. Picone, J. Berg,
http://dx.doi.org/10.1002/aic.690370209.
Detecting fatigue from voice using speech recognition, in: 2006
[33] J. Deng, Z. Zhang, E. Marchi, et al., Sparse autoencoder-based
IEEE International Symposium on Signal Processing and Information
feature transfer learning for speech emotion recognition, in: Affective
Technology, Vancouver, BC, 2006, pp. 567–571, http://dx.doi.org/10.
Computing & Intelligent Interaction, IEEE, 2013.
1109/ISSPIT.2006.270865.
[34] J. Li, Active learning for hyperspectral image classification with a
[16] E.M. Albornoz, M. Sánchez-Gutiérrez, F. Martinez-Licona, H.L.
stacked autoencoders based neural network, in: Proc. IEEE Int. Conf.
Rufiner, J. Goddard, Spoken emotion recognition using deep learn-
Image Process. (ICIP), Phoenix, AZ, USA, Sep. 2016, pp. 1062–1065.
ing, in: Proc. Iberoamer. Congr. Pattern Recognit, Springer, Cham,
[35] Vincent Pascal, Hugo Larochelle, Stacked denoising autoencoders:
Switzerland, 2014, pp. 104–111.
Learning useful representations in a deep network with a local
[17] Salaheddine Bendak, Hamad S.J. Rashid, Fatigue in aviation: A
systematic review of the literature, Int. J. Ind. Ergon. (2020). denoising criterion, J. Mach. Learn. Res. 11 (2010) 3371–3408.
[18] D. Yu, M.L. Seltzer, J. Li, J.-T. Huang, F. Seide, Feature learning [36] I.M. Mohammed, M.Z.N. Al-Dabagh, M.I. Ahmad, et al., Face
in deep neural networks—Studies on speech recognition tasks, 2013, Recognition Using PCA Implemented on Raspberry Pi, in: Proceed-
arXiv:1301.3605. ings of the 11th National Technical Seminar on Unmanned System
[19] N. Kalchbrenner, E. Grefenstette, P. Blunsom, A convolutional neural Technology, 2019, p. 2021.
network for modelling sentences, 2014, arXiv:1404.2188 [Online]. [37] J. Masci, U. Meier, D. Ciresan, et al., Stacked convolutional
Available: https://arxiv.org/abs/1404.2188. auto-encoders for hierarchical feature extraction, in: International
[20] S. Prasomphan, Improvement of speech emotion recognition with Conference on Artificial Neural Networks, Springer-Verlag, 2011.
neural network classifier by using speech spectrogram, in: International [38] Schuller Appendix, Computational paralinguistics emotion affect and
Conferenceon Systems, Signals and Image Processing, IEEE, 2015, pp. personality in speech and language processing, 2013.
73–76. [39] Shengwei Wang, Hongkui Wang, Sen Xiang, Li Yu, Densely connected
[21] Abdul Malik Badshah, Jamil Ahmad, Nasir Rahim, Sung Wook Baik, convolutional network block based autoencoder for panorama map
Speech emotion recognition from spectrograms with deep convolu- compression, Signal Process., Image Commun. (ISSN: 0923-5965) 80
tional neural network, in: 2017 International Conference on Platform (2020) 115678.
Technology and Service (PlatCon), 2017. [40] M. Nen, Alpayd, N. Ethem, Multiple kernel learning algorithms, J.
[22] J. Zhu, Z. Liu, Analysis of hybrid feature research based on ex- Mach. Learn. Res. 12 (2011) 2211–2268.
traction LPCC and MFCC, in: Tenth International Conference on [41] C. Persello, L. Bruzzone, Active learning for domain adaptation in
Computational Intelligence and Security, IEEE, 2014, pp. 732–735. the supervised classification of remote sensing images, IEEE Trans.
[23] B. Schuller, F. Burkhardt, Learning with synthesized speech for Geosci. Remote Sens. 50 (11) (2012) 4468–4483.
automatic emotion recognition, in: Proc. of the 2010 IEEE Int’L Conf. [42] A.I. Schein, L.H. Ungar, Active learning for logistic regression: An
on Acoustics Speech and Signal Processing (ICASSP), IEEE, 2010, evaluation, Mach. Learn. 68 (3) (2007) 235–265.
pp. 5150–5153. [43] C. Deng, X. Liu, C. Li, D. Tao, Active multi-kernel domain adaptation
[24] C. Deng, Y. Xue, X. Liu, C. Li, D. Tao, Active transfer learning for hyperspectral image classification, Pattern Recognit. 77 (2018)
network: A unified deep joint spectral–spatial feature learning model 306–315.
for hyperspectral image classification, IEEE Trans. Geosci. Remote [44] Shen Zhiyuan, Pan Guozhuang, A High-Precision Fatigue Detect-
Sens. 57 (3) (2019) 1741–1754, http://dx.doi.org/10.1109/TGRS.2018. ing Method for Air Traffic Controllers Based on Revised Fractal
2868851. Dimension Feature, Hindawi, 2020, 1024-123X.
[25] A.I. Schein, L.H. Ungar, Active learning for logistic regression: An [45] V.V. Nishawala, M. Ostoja-Starzewski, Acceleration waves on random
evaluation, Mach. Learn. 68 (3) (2007) 235–265. fields with fractal and hurst effects, Wave Motion 74 (2017).
413

A High-Precision Feature Extraction Network of Fatigue Speech From Air Traffic Controller Radiotelephony Based On Improved Deep Learning

Uploaded by

Document Information

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

A High-Precision Feature Extraction Network of Fatigue Speech From Air Traffic Controller Radiotelephony Based On Improved Deep Learning

Uploaded by

Copyright:

Available Formats

Available online at www.sciencedirect.

A high-precision feature extraction network of fatigue speech from air traffic

1. Introduction in considerable attention being paid to the accurate detection

breathing rate, electroencephalogram, and skin electricity Table 1

in the voice. The speech representation described below is

to a multiscale kernel, which is obtained by weighting several

where βi is the weight, ki is the basic kernel, and M is

Fig. 10. Classification results for different depths.

test set. For the proposed AL-based SSAE we divided the

5.1.3. Active learning We transferred each speech sample to the spectrogram to

a size of 256×256. The data-set allocation was the same as in Table 6

You might also like