You are on page 1of 6

2019 IEEE Fifth International Conference on Big Data Computing Service and Applications (BigDataService)

Artificial Intelligent (AI) Clinical Edge


for Voice disorder Detection
Jaya Shankar Vuppalapati, Santosh Kedari Sharat Kedari Anitha Ilapakurti, Chandrasekar Vuppalapati
Sanjeevani Electronic Health Records IoT And Data Analytics Products and Programs
Hanumayamma Innovations and Hanumayamma Innovations and Hanumayamma Innovations and Technologies,
Technologies, Pvt Limited Technologies, Inc. Inc.
HIG II, Flat 7, Baghlingumpally, 628 Crescent Terrace, Fremont 628 Crescent Terrace, Fremont
Hyderabed CA 94536 CA 94536
India, 500004 sharath@hanuinnotech.com {ailapakurti,cvuppalapati}@hanuinnotech.com
{jaya.vuppalapati, skedari}@sanjeevani-
ehr.com

Abstract - Computerized detection of voice disorders have patient if located in a remote setting where the availability and
attracted considerable academic and clinical interest in the hope spread of doctors is very limited. Having a simple Mobile
of providing an effective screening method for voice diseases diagnostic tool to detect the voice condition before endoscopic
before endoscopic confirmation. The goal of this paper is to apply confirmation makes the treatment very patient friendly. The goal
neural networks and Machin learning techniques to detect of the paper is to develop mobile diagnostic voice disorder App.
pathological voice and classify three disordered categories from
acoustic waveforms collected on the mobile phone. From a health In order to perform audio classification, we need to extract
science perspective, a pathological status of the human voice can the most appropriate and informative acoustic parameters.
substantially reduce quality of life and occupational performance, Traditionally pitch, jitter and shimmer have been used for this
which results in considerable costs for both the patient and society. purpose. In recent years, Mel Frequency Cepstral Coefficient
(MFCC) has gained popularity as a successful parameter for
The paper summarizes the various techniques and feature audio classification. We are using MFCC [1] to classify between
engineering processes that we have applied for the Voice Data healthy and pathological patient audio.
collected for classification of voice disorders. We have used Mel
scaled spectrograms and MFCC components as audio features to
train various Neural Network Architectures. We have trained a 5-
layer plain network, 5-layer CNN and RNN. We discuss the
challenges faced and solutions to improve model performance,
model parameter tuning and model evaluation.

Keywords—FEMH, Neural Networks, Tensor Flow, Voice Data,


Mel Frequency Spectrum, MFCC, CNN, RNN

I. INTRODUCTION Fig. 1. Mel Spectrogram of an audio file


Voice disorders are a widespread and significant health
problem. Estimates of prevalence range from 3 percent to 7 We initially started with Machine Learning models like
percent of the general U.S. population. Untreated, voice Random Forest classifier, XGBoost classifier. The model
disorders cost billions of dollars in lost productivity. For an performance was moderate and the accuracy was around 50% to
occupation like teaching where voice use is heavy, the cost is 60%.
almost $3 billion annually1.
The first one is a plain fully connected neural network. This
Voice disorders can happen any point. A study done by the architecture is equivalent to stacking multiple perceptron
National Institute of Deafness and Other Communication (neurons) and averaging their predictions.
Disorders, US, says that 5% of the population in India has voice
disorder at some point in their life time2. The second architecture is a Convolution Neural Network
(CNN). Initially CNNs gained popularity for their state of the art
Lack of sufficient skilled doctors or clinicians to address the performance on image dataset. In the recent years, research has
issue is a huge problem. The problem gets even harder if the

1 2
THE PROBLEM OF LIMITED HEALTH CARE Chennai doctors help patients find their voice -
COVERAGE VOICE DISORDER TREATMENT https://timesofindia.indiatimes.com/city/chennai/Chennai-
[TRANSCRIPT] - https://blog.asha.org/the-problem-of- doctors-help-patients-find-their-
limited-health-care-coverage-voice-disorder-treatment- voice/articleshow/16886456.cms
transcript/

978-1-7281-0059-3/19/$31.00 ©2019 IEEE 340


DOI 10.1109/BigDataService.2019.00060
shown that CNN performs well on Text and audio classification Mel Spectrogram
as well.
The third architecture is that of an Recurrent Neural Network sig, sample_rate = librosa.load(‘002.wav')
(RNN). RNN is a neural network with memory that leverages mel = librosa.feature.melspectrogram(y=sig, sr=sample_rate)
sequential information and has been used in Natural Language
Processing (NLP). It assumes the inputs are dependent on each The above code (Table I) loads “002.wav” file and generates
other. Hence RNN has produces state of art results for NLP tasks Mel spectrogram, fig.1, by calling Librosa function.
like predicting the next word in a sentence. Audio data, unlike
image data, is time dependent and hence research shows RNN 1)MFCC
is more relevant for audio classification too. The mel-frequency cepstrum (MFC) [3] is the representation of
short-term power spectrum of a sound that is derived from a
linear cosine transform of a log power spectrum on a nonlinear
of a mel scale of frequency. Please note mel scale is a scale of
pitches with reference point set to 1000 Hz tone, 40 dB above
listener’s threshold, fig.2, with a pitch of 1000 mels [4].
The MFCC is derived in the following order [5]:
Signal Æ Pre-emphasis Æ Hamming Window Æ Fast
Fourier Transform Æ Log Æ cosine Æ Mel-frequency
Cepstral coefficients Æ MFCC
Fig.2. Mel Scale
We have considered Mel-Scale Spectrogram and MFCC
as our audio features for our Neural network. Mel-Scale’s goal
II. DATA & FEATURE EXTRACTION is to represent the non-linear perception of sound by human. It
is more discriminative in lower frequencies and less
A. Data Samples
discriminative in higher frequencies. The Spectrogram as a
The Voice samples were obtained from a voice clinic in a result is highly correlated. When Discrete Cosine Transform
tertiary teaching hospital (Far Eastern Memorial Hospital, (DCT) is applied on the Mel scaled features, we get the
FEMH), which included 50 normal voice samples and 150 compressed representation of the Spectrogram by retaining the
samples of common voice disorders, including vocal nodules,
most prominent features (top 40 in our case) which are
polyps, and cysts (collectively referred to Phono trauma); glottis
decorrelated [6]. Mel-Scale Spectrogram gives the entire
neoplasm; unilateral vocal paralysis. Voice samples of a 3-
second sustained vowel sound /a:/ were recorded at a information and bigger picture of the audio composition and
comfortable level of loudness, with a microphone-to-mouth MFCC highlights the top features to focus on.
def extract_features(parent_dir,sub_dirs,file_ext="*.wav",bands = 60,
distance of approximately 15–20 cm, using a high-quality frames = 41):
microphone (Model: SM58, SHURE, IL), with a digital
amplifier (Model: X2u, SHURE) under a background noise '''Wrangle audio data into 60x41x2 frames.
level between 40 and 45 dBA. The sampling rate was 44,100 Hz Bands = 60 = number of MFCC
with a 16-bit resolution, and data were saved in an frames = 41 = number of windows from the audio signal
uncompressed .wav format. Number of channels (like R,G,B channels in images) = 2 = mel-
spectrograms and their corresponding delta'''
B. Voice Feature Extraction Techniques window_size = 512 * (frames - 1)
Our aim is to identify the best features applied to perform log_specgrams = []
sound recognition. As indicated in [2], there are several frames labels = []
to sounds features: Time domain features include: Zero Crossing for l, sub_dir in enumerate(sub_dirs):
Rate (ZCR) and Short Time Energy (STE). Spectral Features for fn in glob.glob(os.path.join(parent_dir, sub_dir, file_ext)):
Include: Linear Predictive Coding (LPC) coefficients, Relative sound_clip,s = librosa.load(fn)
label = l
Spectral Predictive Linear Coding (RASTA PLP), Pitch, Sone, for (start,end) in windows(sound_clip,window_size):
Spectral Flux (SF) and coefficients from basic time to frequency feature = []
transforms (FFT, DFT, DWT, CWT and Constant Q- # print(type(start), start, type(end), end, window_size)
Transform). Cepstral domain features are Mel Frequency if(len(sound_clip[start:end]) == window_size):
Cepstral Coefficient (MFCC) and Bark Frequency Cepstral signal = sound_clip[start:end] # window the original audio
Coefficient (BFCC). [2]
D = np.abs(librosa.stft(signal))**2
melspec = librosa.feature.melspectrogram(S=D, n_mels =
bands) # length 0f 60
# print('melspec:', len(melspec))
TABLE I. CODE logspec = librosa.amplitude_to_db(melspec) # length 0f 60 &
normalized
# print('logspec:', len(logspec))

341
logspec = logspec.T.flatten()[:, np.newaxis].T # Transposed list A. Neural Network with a 5-layer architecture
of MFCCs We started with a plain Neural Network with a 5-layer
# print('logspec final:', len(logspec))
architecture [9]. The tuning of the hyper parameters of the model
feature.extend(logspec)
improved the accuracy by 10%. Each parameter value was tuned
melspec = librosa.feature.melspectrogram(signal, n_mels = with the following values.
bands)
• regularization_rate = 0.1, 0.01, 0.001
logspec = librosa.amplitude_to_db(melspec)
logspec = logspec.T.flatten()[:, np.newaxis].T • activation = 'tanh', 'relu'
feature.extend(logspec) • Number of hidden nodes = 64, 40, 100

melspec = librosa.feature.mfcc(y=signal, sr=s, n_mfcc=bands)


The values that gave stable and good performance where
logspec = librosa.amplitude_to_db(melspec) • regularization_rate = 0.001
logspec = logspec.T.flatten()[:, np.newaxis].T
feature.extend(logspec)
• activation = 'tanh'
• Number of hidden nodes = 64
# use perceptual linear prediction cepstral features (PLPs)
The following table II & Accuracy figure 3. displays the
log_specgrams.append(feature)
labels.append(label) accuracies for various epoch values.

TABLE II. TEST ACCURACY DATA


The above code extracts MFCC and delta changes of
MFCC. Epochs Accuracy
10000 47%
C. Data Wrangling
50000 45%
We are wrangling our audio data into a 60x41x2 array.
Here 60 is the number of MFCC, 41 is the number of windows 100000 50%
over the audio signal and 2 is the number of channels. In order 200000
to have fixed size input for our varying size audio files, we 54%
perform windowing over the audio.
In our case, we can feed in a different feature for each
channel. In our first channel we feed in the MFCCs and in the
second channel, we feed in the delta features (local estimate of
the derivative of MFCC) [1]. Now the CNN can learn from not
only MFCC but also their deltas. Librosa has been used to
extract the MFCC and delta features. Keras has been used to
implement the 5-layer CNN [7] [8]. Apart from feeding only
MFCC, we have also trained the neural network with both Mel-
Scale Spectrogram and MFCC features. Another method
employed to improve the accuracy is using the Mel-Scale
Spectrogram and MFCC features as input to separate Neural
networks and bagging of the multiple Neural networks, by Fig. 3. Accuracy vs Epochs
taking a majority voting based on the class probability.
B. 5-layer Convolution Neural Network (CNN)
D. Metrics
We move to CNN for better accuracy. We have implemented
We have used Accuracy score and ROC (Receiver a 5-layer CNN [1,5,10] to classify our audio dataset. This
Operating Characteristic) metric to evaluate our classifier. In architecture has 3 convolutions + Max Pooling layers and one
healthcare applications, sensitivity is given higher importance. fully connected dense layer. We have used 2D convolution and
It defines the percentage of people the model can correctly 2D Max Pooling. The CNN was trained for 1000 epochs and has
predict as unhealthy. Hence, we have also used sensitivity and achieved an accuracy on test data of 93% and ROC of 0.99.
specificity as our metric.
1)Code
E. Mobile Neural Networks – Tensor Flow
# Layer 1 - Convolution with 24 filters + Maxpooling
We have used Accuracy score and ROC (Receiver
Operating Characteristic model.add(Convolution2D(64, (filter_size, filter_size),
border_mode='valid', input_shape=(bands, frames,
III. ARCHITECTURE num_channels), strides=(1, 1)))
We studied and applied various Deep learning architectures: model.add(MaxPooling2D(pool_size=(2, 2), strides=(2, 2)))
5-layer network, 5-layer CNN and RNN. #model.add(BatchNormalization(axis=1))
model.add(Activation('tanh'))

342
# Layer 2 - Convolution with 48 filters + Maxpooling Parameters Accuracy ROC
Convolution_filter_size 2x2 76% 0.939
model.add(Convolution2D(64, (filter_size, filter_size)))
Convolution_filter_size 3x3 82% 0.959
model.add(MaxPooling2D(pool_size=(2, 2)))
Max_Pooling_filter_size 2x2 84% 0.976
model.add(LeakyReLU(alpha=0.01))
Learning rate = 0.01 69% 0.918
# Layer 3 - Convolution with 24 filters + Maxpooling SGD_momentum = 0.9 90% 0.986
model.add(Convolution2D(64, (filter_size, filter_size), Adam optimizer 93% 0.994
border_mode='valid'))
Tanh for all layers 94% 0.994
model.add(LeakyReLU(alpha=0.01))
Tanh for1st layer & relu for rest of 94% 0.996
layers
2)Parameters Tanh for1st layer & LeakyReLU for 93% 0.99
rest of layers
Below are the parameter values experimented:

• regularization_rate = 0.1, 0.01, 0.001


• learning rate = 0.1, 0.01, 0.001 3)Parameter intuition
• activation = 'tanh', 'relu', LeakyReLU, ELU, PReLU
• Number_nodes = 64, 40, 100 Larger Convolution filter size gives a larger receptive
• Convolution filter size = 1,2,3 window over the audio pattern. Smaller pooling filter size helps
• Max Pooling filter size = 2x2, 4x2 us to extract the right local features. Larger pooling filter size
• Optimizers: Adam, SGD might miss the local features retrieved in the convolution step.
• SGD_momentum: between 0.5 to 0.9 Learning rate reduction from 0.001 affects the model’s learning
We tuned one parameter with the rest of them fixed. The negatively.
optimized parameter values achieved are:
Optimizers that aid in CNN learning are: Adam and SGD.
• Convolution_filter_size = 3x3
However, for our data, Adam seems to push the accuracy higher.
• pooling_filter_size = 2x2
It is a general practice to use Relu as the activation function in
• Convolution stride = 1x1
• pooling stride = 2x2 the hidden layers and Tanh for the input layer. Leaky Relu solves
• learning rate = 0.001 the problem of Relu, where if Relu’s output is consistently zero,
• regularization rate = 0.001 then the gradient is zero and the error back propagated is
• optimizer = Adam multiplied by zero, meaning the Relu has died. As we can see
• activation = LeakyReLU this again impacts the ROC positively. Our final test accuracy
achieved is 93%.
CNN (please see fig.4) has achieved 94% accuracy, when
trained on only one feature – MFCC and achieved 93%
accuracy, when trained on 2 features. It achieved a sensitivity
97% and specificity 94% on the test partition of training dataset
and a Sensitivity: 96% & Specificity: 18% on the testing
dataset. Below are the parameters tuned to achieve this
accuracy (please see Table III and fig.5).

Fig.5. Parameters vs Accuracy & ROC

C. Recurrent Neural Network (RNN)


RNNs are used in applications where the data has
sequential information with respect to time. RNN’s nodes not
Fig.4. 5-layer CNN Architecture only process the current inputs but also have a function to store
the weights of the last inputs. LSTM – long short term memory,
TABLE III. PARAMETERS – ACCURACY - ROC is more popular for their performance in Natural language
ROC
processing tasks. LSTM is similar to RNN in architecture, nut
Parameters Accuracy
Convolution_filter_size 48% 0.741
uses a different function to record the previous input states. As
1x1

343
a result, LSTM can capture long term dependencies in a 1)Leaky ReLU
sequence than RNN.
Leaky ReLU is used to solve dying ReLU problem. Leaky
The architecture used for LSTM, is a 4-layer network. We
ReLU function is nothing but an improved version of the ReLU
used the below parameters to tune our RNN to give an accuracy
function. The leak helps to increase the range of the ReLU
function. Usually, the value is 0.01 or so. When it is not 0.01
then it is called Randomized ReLU. Therefore, the range of the
Leaky ReLU is (-infinity to infinity). Leaky ReLU functions are
monotonic in nature. Also, their derivatives also monotonic in
nature.
As we know that for the ReLU function, the gradient is 0
for x<0, which made the neurons die for activations in that
region. Leaky ReLU is defined to address this problem. Instead
of defining the ReLU function as 0 for x less than 0, we define
it as a small linear component of x. The main advantage of this
is to remove the zero gradient. So, in this case the gradient of
the negative values is non-zero and so we would no longer
encounter dead neurons in that region. Also this speeds up the
of 90%: training process of the neural network.
• recurrent_dropout = 0.35
• dropout = 0.5 However, in case of a parameterized ReLU function, y = ax,
• optimizer = Adam when x<0. The network learns the parameter value of ‘a’ for
• learning rate = 0.001 faster and more optimum convergence. The parametrized ReLU
function is used when the leaky ReLU function still fails to
We achieved Sensitivity: 8% & Specificity: 97.7% on the RNN solve the problem of dead neurons and the relevant information
prediction evaluation. CNN performed better than RNN with is not successfully passed to the next layer.
respect to these metrics.
TABLE IV. MODEL EVALUATION RESULTS
Code:
NETWORK ARCHITECTURES SENSITIVITY SPECIFICITY
model = Sequential()
model.add(LSTM(units = 128, return_sequences=True,
recurrent_dropout=0.35, input_shape=(bands, frames))) 5-LAYER PLAIN NEURAL 12% 96%
model.add(Dropout(0.5)) NETWORK
model.add(LSTM(units = 128, return_sequences=True,
recurrent_dropout=0.35))
model.add(Dropout(0.5)) 5-LAYER CNN 96% 18%
model.add(LSTM(units = 128, return_sequences=True,
recurrent_dropout=0.35))
model.add(Dropout(0.5)) 4-LAYER RNN 8% 97.7%
model.add(Flatten())
model.add(Dense(num_labels, activation='softmax'))
# sgd = SGD(lr=0.001, momentum=0.9, decay=0.0,
nesterov=False) D. Architecture Deployment
adam = Adam(lr=0.001, beta_1=0.9, beta_2=0.999, epsilon=1e-
08, decay=0.0) The 5-Layer CNN is deployed into Mobile Application by
model.compile(loss='categorical_crossentropy', Freezing the model, Convert into TFLite and deployed.
metrics=['accuracy'], optimizer=adam)

Tensor Flow Implementation file:


D. Loss function
<?xml version="1.0" encoding="UTF-8"?>
Loss function is an important part in artificial neural <module external.linked.project.id="AppAssist_TensorFlow_Android"
networks, which is used to measure the inconsistency between external.linked.project.path="$MODULE_DIR$"
predicted value and actual label. It is a non-negative value, external.root.project.path="$MODULE_DIR$"
external.system.id="GRADLE" type="JAVA_MODULE" version="4">
where the robustness of model increases along with the
<component name="FacetManager">
decrease of the value of loss function. Loss function is the hard <facet type="java-gradle" name="Java-Gradle">
core of empirical risk function as well as a significant <configuration>
component of structural risk function. <option name="BUILD_FOLDER_PATH"
value="$MODULE_DIR$/build" />
<option name="BUILDABLE" value="false" />

344
</configuration> ACKNOWLEDGMENT
</facet> We sincerely thank you to the team in Far Eastern Memorial
</component>
<component name="NewModuleRootManager"
Hospital (FEMH) for providing valuable voice data without the
LANGUAGE_LEVEL="JDK_1_8" inherit-compiler-output="true"> development of Neural Network is impossible. We acknowledge
<exclude-output /> and sincerely credit the support of FEMH. Additionally, we
<content url="file://$MODULE_DIR$"> would thank the management of Hanumayamma Innovations
<excludeFolder url="file://$MODULE_DIR$/.gradle" /> and Technologies, Inc., for active support they provided in
</content> helping and providing resources needed to work on the
<orderEntry type="inheritedJdk" /> challenge.
<orderEntry type="sourceFolder" forTests="false" />
</component> REFERENCES
</module>

[1] Environmental sound classification with convolutional neural networks


Please note: the deployment requires JDK 1.8 by Karol J. Piczak, 2015 IEEE INTERNATIONAL WORKSHOP ON
MACHINE LEARNING FOR SIGNAL PROCESSING, SEPT. 17–20,
We have loaded the model by coping the CNN output model to 2015, BOSTON, USA
assets folder and loading it. [2] D. Mitrovic, M. Zeppelzauer and C. Breiteneder, "Discrimination and
retrieval of animal sounds," 2006 12th International Multi-Media
IV. CONCLUSION Modelling Conference, Beijing, 2006, pp. 5,
doi: 10.1109 / MMMC.2006.1651344 ,
We have observed from our work that Neural Networks URL: http://ieeexplore.ieee.org/stamp/stamp.jsp?tp=&arnumber=1651344&isnumber=34625
perform better than simpler Machine Learning Algorithms for [3] Z. Le-Qing, "Insect Sound Recognition Based on MFCC and PNN," 2011
audio classification (please table IV). CNN consistently International Conference on Multimedia and Signal Processing, Guilin,
China, 2011 , pp.42-46, doi: 10.1109/CMSP.2011.100,
performs better than RNN, if provided the necessary sequential URL: http://ieeexplore.ieee.org/stamp/stamp.jsp?tp=&arnumber=5957464&isnumber=595743
information. Loss function like LeakyRelu and PRelu have 9
resulted in noticeable improvement in sensitivity and [4] The mel scale as a function of frequency (from Appleton and Perera,
specificity of the model, which are more important than the eds., The Development and Practice of Electronic Music, Prentice-Hall,
1975, p. 56; after Stevens and Davis, Hearing; used by permission
accuracy of the model. The best model was a 5-layer CNN
[5] Deep Convolutional Neural Networks and Data Augmentation for
trained with MFCC and Mel-Spectrogram. It had a Sensitivity: Environmental Sound Classification Justin Salamon and Juan Pablo
96% & Specificity: 18% on the test data. Deployment of our Bello, IEEE SIGNAL PROCESSING LETTERS, ACCEPTED
model into mobile application enables clinicians to early detect NOVEMBER 2016.
any voice disorders, especially in rural settings and proactively [6] Haytham Fayek, “Speech Processing for Machine Learning: Filter banks,
address healthcare issues. Mel-Frequency Cepstral Coefficients (MFCCs) and What's In-Between”,
Apr 21, 2016, https://haythamfayek.com/2016/04/21/speech-processing-
for-machine-learning.html, accessed on: 11/11/2018
V. FUTURE WORK
[7] Mel Frequency Cepstral Coefficient (MFCC) tutorial,
Our model has achieved good accuracy on out of sample http://practicalcryptography.com/miscellaneous/machine-learning/guide-
dataset. However, the balance between sensitivity & specificity mel-frequency-cepstral-coefficients-mfccs/, accessed on: 11/11/2018
and UAR (Unweighted Average Recall) has been average. In [8] Jordi Pons, Xavier Serra, Randomly weighted CNNs for (music) audio
classification, arXiv:1805.00237 [cs.SD]
order to improve these metrics, we could increase the number
[9] Annamaria Mesaros, Toni Heittola, Tuomas Virtanen, A multi-device
of filters (128, 256) used in CNN to perform better feature dataset for urban acoustic scene classification, arXiv:1807.09840
extraction and number of epochs to improve the train accuracy. [eess.AS]
Also, we could experiment with alternate features or different [10] Jongpil Lee, Taejun Kim, Jiyoung Park, Juhan Nam, Sound Raw
shapes for Mel-spectrogram and MFCC component count. Waveform-based Audio Classification Using Sample-level CNN
Architectures, arXiv:1712.00866 [cs.SD]
On mobile applications side, we have condensed our model; in
the next versions, we would like to expand the performance
characteristics and model expansion.

345

You might also like