You are on page 1of 8

2020 IEEE International Conference on Bioinformatics and Biomedicine (BIBM)

Multimodal Lung Disease Classification using Deep


Convolutional Neural Network
Zeenat Tariq Sayed Khushal Shah Yugyung Lee
2020 IEEE International Conference on Bioinformatics and Biomedicine (BIBM) 978-1-7281-6215-7/20/$31.00 ©2020 IEEE DOI: 10.1109/BIBM49941.2020.9313208

School of Computing and Engineering School of Computing and Engineering School of Computing and Engineering
University of Missouri-Kansas City University of Missouri-Kansas City University of Missouri-Kansas City
USA USA USA
zt2gc@mail.umkc.edu ssqn7@mail.umkc.edu leeyu@umkc.edu

Abstract—Lung disease is the most common cause of severe response in the acquisition of sound. The stethoscope instru-
illness and death in the World. The early diagnosis and treatment ment used for auscultation is a sound channel between the
of the disease are of great importance in the medical field. The human body surface and ear [5]. The identification of sound
computer-assisted systems for lung disease recognition are effec-
tive methods to help physicians diagnose the diseases effectively. sometimes becomes difficult due to the presence of internal or
Therefore, this paper studies the multimodal recognition of lung external noise and sensitivity of ears in identifying the sound
sound using spectrograms. Based on the classification of lung as normal or abnormal, which may affect the diagnosis results.
diseases by deep convolutional neural networks, an integrated Thus, to reduce the subjectivity on lung sound, a developed
network Multimodal Lung Disease Classification (MLDC) model automated algorithm for diagnosing the diseases through lung
was used with advanced pre-processing techniques to assess
the classification accuracy acceptable in the medical field. The sound provides an efficient method for clinical diagnosis.
research has three main contributions. First, we have performed Machine learning plays an essential role in classifying
data pre-processing using two techniques Data Normalization and different types of sounds through multiple algorithms [6].
Data Augmentation. The data were normalized by removing the In particular, deep learning (DL) is a branch of machine
unwanted noise and adjusting the peak values in a sound signal. learning which supports the enhanced detection of respiratory
For training purposes, the publicly available data was
insufficient. Hence we applied advanced data augmentation diseases from the sound through the auscultation method
techniques to generate some additional data without affecting [7]. These learning techniques are among the fastest-growing
the categories. Secondly, we have extracted the spectrograms fields nowadays in machine translation, image and signal
from lung sound and used them as features and images for signal recognition, and so on. With the continuous development
and image processing. Finally, we created an integrated model and innovation of algorithms for image processing in the
for the high-performance classification of lung diseases. We have
compared the audio and spectrogram image-based results where medical field, deep learning has become an important research
we found the image-based approach is cost-effective, efficient, direction. Among various techniques used for lung sound
and reliable. analysis, one is the spectrogram applied to study lung disease.
The growing expenses of health care increasing with time in
Index Terms—Spectrogram, Deep Learning, Image Classifica- the United States are almost double the time as it is happening
tion, Audio Classification, Data Normalization, Data Augmenta-
tion, Convolutional Neural Network
in most developing countries around the globe [8]. The lung
disease classification using a spectrogram would help to reduce
the cost of treatment done through CT-scans.
I. I NTRODUCTION
In this paper, we aim to extend our previous work [9]
The early diagnosis and timely treatment of lung disease is performed for the lung disease classification through lung
the key to controlling the rising illness and death rate [1], [2]. sounds. Previously, We proposed a uniquely designed model
Lung sound conveys the relevant information to determine the with a popular deep learning network, Convolutional Neural
pulmonary conditions of a patient. An auscultation technique Network (CNN). More specifically, we introduced advanced
is used by the physicians to acquire lung sounds and extract pre-processing techniques, i.e., normalization and augmenta-
more information to select the type of disease that affects the tion, for adequate classification. Since deep learning relies on
organ [3]. The disorders may be classified into pneumonia, a large amount of data, augmentation is an efficient method
asthma, and bronchiectasis. The auscultation is a non-invasive to overcome limited available data publicly.
procedure that contributes to reduce the time for diagnosis Our main contribution to this paper is to propose a multi-
of the disease, which increases the treatment efficiency [4]. modal lung disease classification method through a spectro-
However, due to the complexity of sound characteristics, there gram for audio and image classification. The spectrogram is
may involve a high risk of missed data, leading to the wrong regarded as an image that contains the texture representation
diagnosis. of time, phase, and frequency of lung sounds [10]. The
Lung sounds have naturally non-stationary signals, and the techniques of image processing are applied to normalized
auscultation method sometimes does not show an efficient spectrogram extracted from lung sound. We have evaluated

978-1-7281-6215-7/20/$31.00 ©2020 IEEE 2530

Authorized licensed use limited to: London School of Economics & Political Science. Downloaded on May 17,2021 at 11:35:14 UTC from IEEE Xplore. Restrictions apply.
the spectrograms with various approaches to obtain the best data augmentation. However, we used 2D convolutional neural
performance acceptable in the medical field. The proposed network for data augmentation and data normalization on the
method increases the visualization of different types of lung spectrogram images, which outperforms the accuracy results
diseases and passes the image to our designed network model. claimed in their paper. Another augmentation technique in [17]
The process is performed in three folds. First, we have for environmental sound has used a deep convolutional neural
completed the data pre-processing on lung sound data. The network. The deformation of audio was performed through
pre-processing step was conducted using advanced data nor- stretching, pitch shifting, dynamic range compression, and
malization and data augmentation techniques. The lung sound background noise. Our proposed research has an emphasis on
data was normalized by removing the unwanted noise and data normalization and augmentation applied on spectrogram
adjusting the signal’s peak values. Deep learning relies on images, and We outperformed the claimed results for lung
a large amount of data for efficient training purposes. The disease classification.
publicly available data was not sufficient for training. Hence An automatic solution for lung and heart disease classifi-
we have used advanced augmentation techniques to generate cation in [18] has used audio samples that can be considered
the data without affecting the category of diseases. Secondly, the baseline for research. The dataset was too limited to have
we developed the spectrograms from lung sound and used any consequences for results. The authors have experimented
these spectrograms for audio and image classification. Finally, with the dataset on the convolutional neural network to detect
we integrated the Multimodal Lung Disease Classification heart and lung disease automatically. For lung sounds clas-
(MLDC) model with advanced data pre-processing techniques sification, the authors showed promising results. Our paper
for high performance. made the dataset larger than the original one and achieved
The rest of the paper is organized as follows. Section better results for audio and image spectrogram classification
II mentions the related work for data normalization, data for lung diseases.
augmentation, and deep learning techniques for sound and The main limitation of the research described in [19] is
image classification. Section III mentions the details about insufficient training of data due to unavailability. We have
the methodology, data normalization, data augmentation, the addressed the solution for low and uncleaned data for effective
generation of a spectrogram, and the Convolutional Neural treatment. Another limitation of the work in [20] shows an
Network model. Section IV examines the experimental setup, imbalance dataset used for heart signals classification, and for
results, and evaluation. Conclusion and future work are dis- each segment, the authors assumed only one type of disease.
cussed in Section V. Our approach focuses on deep learning classification based on
the audio-visual approach. The method also applies advanced
II. R ELATED W ORK
normalization and augmentation techniques for efficient train-
Lung sounds and the signal produced by these sounds play ing because deep learning needs a large amount of data for
an important role in identifying different diseases. The work appropriate training to get high accuracy acceptable in the
in [11] is based on lung sounds classification by using a medical field.
publicly available dataset. The dataset has three categories,
such as wheezes, crackles, and normal sounds. They proposed III. M ETHOD
a detection method using optimized S-transformed (OST) and
deep residual networks, also known as ResNets. They used a The overall view of our proposed method is shown in
visual-based approach while we recommend an audio-based Figure 1. The proposed method consists of a dynamic structure
approach to reduce the cost of heavy machines such as CT that extracts the features and parameters, eliminates the redun-
scans. dant and inexpressive data, overcome the issue of limited data
required for deep learning, and applies the signal and image
The review in [12] mentioned several feature extraction and
processing to identify the diseases. Our model is divided into
classification techniques for obstructive pulmonary diseases
three stages: 1) Pre-processing of the lung sounds, 2) The
such as COPD and asthma. The process involves several
generation of spectrogram for multi-modality 3) Classification
traditional and deep learning classification techniques such
Model. The details about each stage are given below.
as K-Nearest Neighbor (KNN), Artificial Neural Network
(ANN), Deep Neural Network (DNN), and Convolutional
A. Pre-processing
Neural Network (CNN) and feature extraction through signals
such as Fast Fourier Transform (FFT), Short Time Fourier For advanced classification, preprocessing is performed in
Transform (STFT), spectrograms, and wavelet transform. two steps, i.e., Data Normalization [21] and Data Augmenta-
Several data augmentation techniques are applied previously tion [22].
to classify the different types of sounds [13]–[15]. The re- 1) Data Normalization: Previously, after several evalua-
search in [16] proposed the data augmentation for environ- tions, we have chosen the best three normalization techniques
mental sounds through random time delays and pitch shifting. for lung sounds [9]. The techniques were applied to the
Mel spectrograms were extracted from all audio samples, training dataset, which outperformed the other normalization.
resampled, and normalized with different window sizes. The They are classified as 1) Root Mean Square 2) Peak 3)
authors applied a 1D convolutional neural network model with European broadcast Union Standard R128 (EBU).

2531

Authorized licensed use limited to: London School of Economics & Political Science. Downloaded on May 17,2021 at 11:35:14 UTC from IEEE Xplore. Restrictions apply.
Fig. 1. An overview of proposed Multi-modal Lung Disease Classification

• Root Mean Square (RMS) The RMS level is useful • European Broadcasting Union R128 (EBU) European
for finding the signal strength based on the amplitude Broadcasting Union Standard R128 Normalization
regardless of the signal’s positive or negative values. focused on measuring the average loudness of a program
In RMS, the amplitude level takes the average of a using audio signals’ normalization. A model commonly
signal amplitude where it does not work as the arith- approaches the unit scale’s loudness and is considered
metic mean of a signal received. For a given signal, better than RMS and peak in theory for the audio domain
x = x1 , x2 , . . . , xn , the RMS value, xrms is: [23].
 
x2 1 2 2) Data Augmentation: We have applied the data aug-
xrms = = (x + x22 + . . . + x2n ) (1)
n n 1 mentation to generate the data for lung sound and experi-
mented with the three best technique for the lungs sound
The signal amplitude normalization can only be possible data such as time-stretching, pitch shifting, and Dynamic
if we can figure out the scaling factor to perform the linear Range Compression.
gain change. There is a possibility to scale a signal with • Pitch Shifting In this data augmentation technique, the
an amplitude higher than 1 or less than zero 0 decibels audio samples’ pitch is either decreased or increased by
(dB). We can rearrange the above RMS level formula for four values (semitones) [24]. We assume that with the
applying the linear gain change, as shown in Equation 2, pitch shifting factor ashif t , the artificial training data
where R has a linear scale. generated is N aug times larger than the original lung

1 sound data. The audio samples’ duration is kept constant
R= [(ax1 )2 + (ax2 )2 + . . . + (axn )2 ] similar to the original audio samples, i.e., 10 - 90 seconds.
n
1 For our experimentation, the value changed in semitones
R2 = [(ax1 )2 + (ax2 )2 + . . . + (axn )2 ] were in the interval [−aShif , ashif ] for each signal. The
n (2) value has changed in semitones ranged of (-2, -1, 1, 2).
nR2 = [(ax1 )2 + (ax2 )2 + . . . + (axn )2 ]
 • Time Stretching Like the pitch shifting, the lungs’ data
nR2 signals are stretched horizontally along the time axis by
a=
(x1 ) + (x2 )2 + . . . + (xn )2
2 a scaling factor astre > 0. We sampled the values in
the interval [1, astre ] if astre ≥ 1 or [astre , 1] along with
• Peak In peak normalization, the peak signal level is original files while keeping the pitch and other factors
analyzed in decibels relative to full scale (dBFS). For same. The four audio speed is (0.5, 0.7, 1.2, 1.5) along
normalization, it amplifies the volume of the signal in with original files.
such a manner that the output gets 0 dB maximum. Due to • Dynamic Range Compression This method compresses
this feature, the signal has high volume peaks. Even after the audio sample according to its dynamic range. This
peak normalization, some of the signals remain quiet, and process can be done either by amplifying the sample or
the quality of those signals cannot be improved further. reduce the volume of the loud sounds. Among the four
The above 0 processes can scale the amplitude of all input parameters, three are taken from Dolby E Standard and
audio signals so that the highest amplitude of the signal 1 is taken from the ice cast radio live streaming server.
has a value of 1. The output signal based on above scaling
can be mathematically calculated as B. Spectrogram Generation
1 The spectrogram is a visual representation of a signal
out = .in (3) in the time-frequency domain. These are generated by
max(abs(in))

2532

Authorized licensed use limited to: London School of Economics & Political Science. Downloaded on May 17,2021 at 11:35:14 UTC from IEEE Xplore. Restrictions apply.
the application of the short-time Fourier transform(STFT) sigmoid function for linear activation. The fully connected is
[25]. According to the theorem, the spectrum variation layer that is represented by equation 8 and 9.
of a nonstationary signal may not be seen by a single (l)

Fourier analysis. To overcome this issue, the spectrogram [xi wi,j yj + biasl−1
l−1 l−1
j ] (8)
j
considers the stationary signal by computing the Fourier
l
transform of the segmented signal into slices. Hence the [yi,j,k = σ(xli,j,k )] (9)
spectrogram is also called STFT, which can be calculated
as: The 2D CNN architecture is shown in Figure 2 is composed
 ∞ of 5 layers. The first three are the convolutional layers enclosed
ST F Txf (t, f ) = [x(t)w(t − τ )e−j2πf t dt (4) by the max pool layer, followed by two fully connected layers.
∞ During the extraction of spectrogram features, we have used
where x(t) is time-domain signal, τ is the time window size and hop size of 23 ms as the sound clips vary
localization of STFT and w(t − τ ) is a window function between 10 to 90 seconds so that we kept the extraction to 3
to cut and filter the signal. The length of the window seconds to make every bit of the sound clip usable. The input
function must be selected and adjusted according to from the sound clips is reshaped, and X ∈ R128x128 shape is
the signal’s length because it affects the time and provided to the classifier.
frequency resolution [26]. We have transformed the The first layer takes the reshaped features as an input in
spectrogram into a grayscale image where we used the the form of spectrograms with 24 filters. It takes the shape
image processing methods to extract the information. of [24x1x5x5]. The stride in this layer is [4x2] with ReLU
as the activation function. The second layer has 48 filters of
• Scaling Process The scaling process is applied to the the shape [48x24x5x5] with [4x2] stride max-pooling layer
spectrogram to expand the values range between 0-255 and using ReLU as the activation layer. The third layer also
because the range of the spectrogram is usually wide. The takes 48 filters with a receptive field [5x5], resulting in shape
process of scaling is done in a linear manner, which can [48x48x5x5], and the activation is ReLU without pooling.
be expressed as follow: Finally, the fourth layer has 64 hidden units resulting in shape
[2000x64] with ReLU activation and [64x10] with softmax
|Spec(m, n)|
S(m, n) = × 255 (5) activation. We considered a [5x5] small receptive layer in the
max|Spec| top layer due to the localized patterns.
where Spec(m, n) is the value of the spectrogram and
S(m, n) is the expanded value from a spectrogram. IV. E XPERIMENTAL W ORKS
A. Dataset
C. Classification Model
The major drawback of lung disease classification is the
The convolutional neural network became a significant trend limited amount of publicly available data for training. For clas-
in deep learning and growing very fast in audio, image sification, we have used the dataset, which is the only publicly
recognition, computer vision, and many other fields. CNN available dataset known as the Respiratory dataset [27]. The
is composed of convolution, pooling, and fully connected sound recordings employed in this study are from 126 patients
layers. In this paper, we have explored the power of CNN in Portugal and Greece. The recordings consist of audio sounds
for the lung disease classification. We have implemented a for lungs for different diseases, i.e., Healthy, Asthma, chronic
2D convolutional neural network with Keras’s implementation. obstructive pulmonary disease (COPD), Bronchiectasis, URTI,
There are two primary components of a convolutional neural LRTI, and Pneumonia. The dataset consists of recording index,
network, i.e., feature extractor and classifier. The feature patient number, location on the chest, and device used to detect
extractor extracts the spectrogram features from the audio the diseases in audio format. Further, to generate an image
signal and passes them to a classifier to classify the signals into dataset, spectrograms were generated and arranged in the same
their appropriate categories. The classifier consists of different order as original categories. Images were generated to qualify
convolutional and pooling layers, followed by activation. It the differences between an audio and image dataset for the
also holds fully connected layers with some hidden units. same task and sounds. The audio files consisted of clean and
The mathematical form of the convolutional layers is given noisy audios for the lungs.
in Equation 6 and 7
   (l−1,f ) (l−1) B. Experimental Setup
[xli,j,k = wi,j,k yi+a,j+b,k+c + biasf ] (6)
To perform data classification, we designed our model
a b c
with advanced techniques. We have used the NVIDIA GPU
l (l) graphic card, which consists of 4 11GB GPU slots 1080i.
[yi,j,k = σ(xi,j,k )] (7)
The RAM power is 16GB. We have used a publicly available
l
The output layer is represented by yi,j,k where as the 3- respiratory sound dataset to evaluate our model efficiency. The
dimensional input tensor is denoted by i, j, k. The weights experimentation was performed in two different strategies, i.e.,
(l) (l)
for filters are denoted by wi,j,k and σ(xi,j,k ) describes the Audio feature-based approach and audio to image spectrogram

2533

Authorized licensed use limited to: London School of Economics & Political Science. Downloaded on May 17,2021 at 11:35:14 UTC from IEEE Xplore. Restrictions apply.
Fig. 2. Convolutional Neural Network Model

TABLE I D. Results
O RIGINAL AND AUGMENTED DATA S IZE
In this section, we discuss the details about our experimen-
ID Name of Disease Data Size Augmented Data Size
1 Asthma 1 13
tation and the results we achieved. The comparison is based
2 Bronchiectasis 29 377 on audio and image-based techniques using our convolutional
3 COPD 785 10205 neural network model for multi-modality in the health domain.
4 Health 35 455 The audio and image-based approach can be an efficient way
5 LRTI 2 26
6 Pneumonia 37 481 to detect disease at a low cost where MRI and CT scans
7 URTI 31 403 are very costly. Based on our model, both the techniques
Total 920 11,960 performed pretty well and can be used in real-time.
1) Audio Classification: Based on the previous results,
we extracted audio features from the labeled audio for lung
based approach. The overall system working is discussed in diseases. The dataset consists of patient records, recording
detail in Section Methodology. index, and equipment used to record the audio. We considered
For the classification of the results, we considered two dif- wav files and extracted the spectrogram features into a Numpy
ferent scenarios discussed above using the same model. We array using the librosa spectrogram library. We further passed
compared the results on both the cases and acquired the the extracted features to our network in a resized format of the
classification accuracy. We have also compared our results vectors is the size (128,128,1). Our network model performed
for training and testing accuracy. To avoid unfitting and exceptionally well and reported an accuracy of 83% using the
overfitting, we used Early-Stopping in our code to get accurate data in its original form. We further normalized and augmented
results for the network designed. our data to remove any type of outliers that existed in the
audio files based on the initial results. We have used different
C. Feature Extraction types of augmentation techniques to increase the data because
We have used the librosa library [28] to extract the spec- a deep learning model can perform better if the network is
trograms from the audio files arranged in different folders, trained on a large dataset. Real patient data is not available
as shown in Fig 3. For spectrograms, STFT is used to cut publicly, so we considered synthesizing the dataset to increase
down the continuous signal into parts. Each audio file was in the number of files. After we applied normalization, it was
a wav format where the length of each audio file was 10-90 noticed that the maximum accuracy that our model reported
seconds. We have cut down the audio files to 3 seconds for was 88%. The increase in accuracy encourage us more to apply
our model using the librosa functionality of input duration. augmentation techniques, and finally, using augmentation, we
We extracted the features from the spectrogram and stored got a maximum testing accuracy of 97%. The accuracy is quite
them in a NumPy array for audio classification. In contrast, reliable and is considered to be used in the health domain.
for spectrogram image classification, we converted the images 2) Image Classification: For this paper, we further pro-
into a grayscale image using the CV2 library [29] for better gressed our research to check the multi-modality of our
prediction of results. After that, we directly input the images existing network model by applying in a different domain,
into a CNN model in shape (128,128,1) after resizing it to the i.e., image-based classification. During the audio extraction
desired shape required for our CNN model. duration, it is noticed that when spectrogram is generated,

2534

Authorized licensed use limited to: London School of Economics & Political Science. Downloaded on May 17,2021 at 11:35:14 UTC from IEEE Xplore. Restrictions apply.
Fig. 3. STFT obtained from Original Wav files a) Spectrogram obtained from original lungs data b) Spectrogram obtained from normalized sound data c)
Spectrogram generated from Augmented Lung sound data

they could be saved to a NumPy array or directly as spec- right margin which was dropped little using normalization as
trogram images. This time we selected images to be used for the quality of the image was reduced. In contrast, augmenta-
lung sound classification. The overall workflow of our image tion gave the network model the ability to train the images
classification system is shown in Fig 1. Images are the more in different aspects. The final and best accuracy obtained by
visual form of a signal, which can be interpreted easily by image classification is 95%. It can be seen from Fig 4 that the
physicians which are not the case for audio classification. To training of the model was performed exceptionally well, but
do this, we have extracted image features from the audio files due to the loss in quality of the images, the accuracy dropped
in the dimension of 72x72. We kept this ratio for getting by a minor difference. The accuracy was also dropped in the
the best quality of the images and the images to be used initial stages due to the model overfitting issue.
for classification. The images were resized to 128x128x1
3) Over-fitting for image classification: During the experi-
to fit in our network model to assess the accuracy using
mentation phase of our model training for multi-modality, we
multi-modality. The results we obtained for testing accuracy
came across the situation where accuracy did not improve or
were quite reliable. The maximum testing accuracy reported
even drop by a significant amount. By further investigating, we
for the original form of the dataset in images format was
concluded that the model is overfitting. We kept the training
approximately equal to 84%. We wanted to work further
epochs of 100 for the audio model while our model performed
with our network model performance tuning. We applied data
better for image-based classification because the model was
normalization, and it can be seen from the results that the
overfitting before reaching 100 epochs. To overcome this
accuracy dropped by a little amount due to the nature of the
issue and test the model for efficiency, we adopted the Early-
images. During the spectrogram generation stage, we already
Stopping of the network model. From the training model
converted the images to grayscale images, and by applying
figure, it can be seen that the model for the image would stop
normalization, it was observed that the quality of images
between 50- 60 epochs and still having good accuracy. This
dropped further and reported a slightly lower accuracy. Finally,
can be noted that due to the size and heavy audio features,
following the process of audio classification, we performed
it takes more time and iterations to train a network while the
augmentation. The augmentation increases the accuracy by a
image can do the same task in a significantly less amount of

2535

Authorized licensed use limited to: London School of Economics & Political Science. Downloaded on May 17,2021 at 11:35:14 UTC from IEEE Xplore. Restrictions apply.
Fig. 4. a) Model training accuracy for 100 epochs b) Loss for model Training with 100 Epochs c) Model training accuracy while using EarlyStopping d)
Model Loss using EarlyStopping

time with a low number of iterations. Even though the data


was not enough and had a lot of variation and environmental TABLE II
interference in recording (i.e., heartbeat, running fan, etc.), 2D CNN A RCHITECTURE D ETAILS
it was observed that our technique has achieved very good
accuracy in comparison with the state of the art research Layers Dimensions
considering feature-based approach. 1. First Convolutional layer 5 x 5 (24 filters)
4) Comparison of audio and image techniques: Overall, 2. Maxpooling layer 4x2
3. Second convolutional layer 5 x 5 (48 filters)
comparing both the audio and image, we can say that audio 4. Maxpooling layer 4x2
can report good accuracy while taking more memory and 5. Third convolutional layer 5 x 5 (48 filters)
reporting time. The image can do the same task significantly 6. Fourth layer 2400 x 64
less time and consuming 80% lower memory then audio and 7. Fifth layer 64 x 10
can access the accuracy promptly. The image can have a visual
representation and is considered easy to be used by health
professionals. The overall comparison of the result between
audio and images are shown in Table III. TABLE III
E VALUATION R ESULTS OF M ULTI - MODAL L UNG D ISEASE
C LASSIFICATION
V. C ONCLUSION
In this paper, we developed the Multimodal Lung Disease Model Technique Audio Image
Model 1 Original Data 83% 84%
Classification (MLDC) system combined with advanced data Model 2 EBU Normalized Data 88% 79%
normalization and data augmentation techniques for high- Model 3 RMS Normalized Data 87% 81%
performance classification in lung disease diagnosis. The Model 4 Peak Normalized Data 86% 85%
Model 5 Original Augmented Data 93% 88%
model is tested for audio and image classification. The pa- Model 6 EBU Augmeneted Data 97% 89%
per is an extension of our previous work, where the audio Model 7 RMS Augmented Data 94% 91%
outperformed all state-of-the-art research. We experimented Model 8 Peak Augmeneted Data 92% 95%
with our created model for multi-modality, which performed

2536

Authorized licensed use limited to: London School of Economics & Political Science. Downloaded on May 17,2021 at 11:35:14 UTC from IEEE Xplore. Restrictions apply.
very efficiently. The audio-based classification gave us the [10] Gábor Manhertz, Dániel Modok, and Ákos Bereczky, “Evaluation of
highest accuracy of 97% after applying data pre-processing short-time fourier-transformation spectrograms derived from the vibra-
tion measurement of internal-combustion engines,” in 2016 IEEE In-
techniques. However, the image classification’s highest accu- ternational Power Electronics and Motion Control Conference (PEMC).
racy reported was 95% after using the pre-processing methods. IEEE, 2016, pp. 812–817.
The audio classification was observed to be an expensive [11] Hai Chen, Xiaochen Yuan, Zhiyuan Pei, Mianjie Li, and Jianqing Li,
“Triple-classification of respiratory sounds using optimized s-transform
way of classification, i.e., it achieved the best accuracy for and deep residual networks,” IEEE Access, vol. 7, pp. 32845–32852,
100 epochs. In contrast, the image classification achieved 2019.
the results in half of the time consumed by image. The [12] Rupesh Dubey and Rajesh M Bodade, “A review of classification
techniques based on neural networks for pulmonary obstructive dis-
image classification secured the highest accuracy in only 50-60 eases,” Proceedings of Recent Advances in Interdisciplinary Trends in
epochs. We used Early-stopping to avoid model over-fitting. Engineering & Applications (RAITEA), 2019.
We have obtained better accuracy than the state-of-the-art [13] Zeenat Tariq, Sayed Khushal Shah, and Yugyung Lee, “Speech emotion
detection using iot based deep learning for health care,” in 2019 IEEE
accuracy. This confirms that the proposed model could be used International Conference on Big Data (Big Data). IEEE, 2019, pp.
for the diagnosis of lung diseases with lung sounds in health 4191–4196.
care. [14] Agnieszka Mikołajczyk and Michał Grochowski, “Data augmentation
for improving deep learning in image classification problem,” in 2018
The model can further be used in such a way that during international interdisciplinary PhD workshop (IIPhDW). IEEE, 2018,
training, there might be some biased cases that were examined pp. 117–122.
[15] Sayed Khushal Shah, Zeenat Tariq, and Yugyung Lee, “Iot based urban
even after K-fold validation. The usage of the ensemble noise monitoring in deep learning using historical reports,” in 2019
model can deal with such issues. Spectrogram images in the IEEE International Conference on Big Data (Big Data). IEEE, 2019,
healthcare domain are considered less informative as compared pp. 4179–4184.
[16] Karol J Piczak, “Environmental sound classification with convolutional
to CT-scans and MRI. In our case, we convert the images to neural networks,” in 2015 IEEE 25th International Workshop on
grayscale to work with the skew images. However, dealing Machine Learning for Signal Processing (MLSP). IEEE, 2015, pp. 1–6.
with colored images will help improve the efficiency and [17] Justin Salamon, Christopher Jacoby, and Juan Pablo Bello, “A dataset
and taxonomy for urban sound research,” in Proceedings of the 22nd
reliability of our model. We expect that future researchers ACM international conference on Multimedia, 2014, pp. 1041–1044.
consider our model and techniques discussed in the paper and [18] Qiyu Chen, Weibin Zhang, Xiang Tian, Xiaoxue Zhang, Shaoqiong
develop software and application used by end-users, i.e., home Chen, and Wenkang Lei, “Automatic heart and lung sounds classification
using convolutional neural networks,” in 2016 Asia-Pacific Signal and
users and health specialists. Information Processing Association Annual Summit and Conference
(APSIPA). IEEE, 2016, pp. 1–4.
[19] U Rajendra Acharya, Hamido Fujita, Shu Lih Oh, Yuki Hagiwara,
R EFERENCES Jen Hong Tan, and Muhammad Adam, “Application of deep convolu-
tional neural network for automated detection of myocardial infarction
using ecg signals,” Information Sciences, vol. 415, pp. 190–198, 2017.
[1] Frank van Haren, Tài Pham, Laurent Brochard, Giacomo Bellani, John [20] Shu Lih Oh, Eddie YK Ng, Ru San Tan, and U Rajendra Acharya,
Laffey, Martin Dres, Eddy Fan, Ewan C Goligher, Leo Heunks, Joan “Automated diagnosis of arrhythmia using combination of cnn and lstm
Lynch, et al., “Spontaneous breathing in early acute respiratory distress techniques with variable length heart beats,” Computers in biology and
syndrome: insights from the large observational study to understand the medicine, vol. 102, pp. 278–287, 2018.
global impact of severe acute respiratory failure study,” Critical care [21] Sergey Ioffe and Christian Szegedy, “Batch normalization: Accelerating
medicine, vol. 47, no. 2, pp. 229, 2019. deep network training by reducing internal covariate shift,” arXiv
[2] Innes Asher, Karen Bissell, Chen-Yuan Chiang, Asma El Sony, Philippa preprint arXiv:1502.03167, 2015.
Ellwood, Luis Garcı́a-Marcos, Guy B Marks, Kevin Mortimer, Neil [22] Ilyes Rebai, Yessine BenAyed, Walid Mahdi, and Jean-Pierre Lorré,
Pearce, and David Strachan, “Calling time on asthma deaths in tropical “Improving speech recognition using data augmentation and acoustic
regions—how much longer must people wait for essential medicines?,” model fusion,” Procedia Computer Science, vol. 112, pp. 316–322,
The Lancet Respiratory Medicine, vol. 7, no. 1, pp. 13–15, 2019. 2017.
[3] Abraham Bohadana, Gabriel Izbicki, and Steve S Kraman, “Fundamen- [23] R EBU-Recommendation, “Loudness normalisation and permitted
tals of lung auscultation,” New England Journal of Medicine, vol. 370, maximum level of audio signals,” 2011.
no. 8, pp. 744–751, 2014. [24] Shengyun Wei, Kele Xu, Dezhi Wang, Feifan Liao, Huaimin Wang, and
[4] Zahra Moussavi, “Fundamentals of respiratory sounds and analysis,” Qiuqiang Kong, “Sample mixed-based data augmentation for domestic
Synthesis lectures on biomedical engineering, vol. 1, no. 1, pp. 1–68, audio tagging,” arXiv preprint arXiv:1808.03883, 2018.
2006. [25] Leon Cohen, Time-frequency analysis, vol. 778, Prentice hall, 1995.
[5] Hans Pasterkamp, Steve S Kraman, and George R Wodicka, “Respi- [26] John L Semmlow and Benjamin Griffel, Biosignal and medical image
ratory sounds: advances beyond the stethoscope,” American journal of processing, CRC press, 2014.
respiratory and critical care medicine, vol. 156, no. 3, pp. 974–987, [27] BM Rocha, D Filos, L Mendes, I Vogiatzis, E Perantoni, E Kaimakamis,
1997. P Natsiavas, A Oliveira, C Jácome, A Marques, et al., “A respiratory
[6] Rajkumar Palaniappan, Kenneth Sundaraj, and Nizam Uddin Ahamed, sound database for the development of automated classification,” in In-
“Machine learning in lung sound analysis: a systematic review,” Bio- ternational Conference on Biomedical and Health Informatics. Springer,
cybernetics and Biomedical Engineering, vol. 33, no. 3, pp. 129–135, 2017, pp. 33–37.
2013. [28] Brian McFee, Colin Raffel, Dawen Liang, Daniel PW Ellis, Matt
[7] Basil M Harris, Constantine F Harris, George C Harris, and Edward L McVicar, Eric Battenberg, and Oriol Nieto, “librosa: Audio and music
Hepler, “Non-invasive system and method for breath sound analysis,” signal analysis in python,” in Proceedings of the 14th python in science
Dec. 26 2019, US Patent App. 16/465,353. conference, 2015, vol. 8, pp. 18–25.
[8] Steven B Cohen, “The concentration of health care expenditures and [29] K Yamini, K Sai Swetha, P Lakshmi Prasanna, M Rupa Venkata
related expenses for costly medical conditions, 2012,” 2017. Swathi, and Venkata Rao Maddumala, “Image colorization with deep
[9] Zeenat Tariq, Sayed Khushal Shah, and Yugyung Lee, “Lung disease convolutional open cv,” Journal of Engineering Science, vol. 11, no. 4,
classification using deep convolutional neural network,” in 2019 IEEE pp. 533–543, 2020.
International Conference on Bioinformatics and Biomedicine (BIBM).
IEEE, 2019, pp. 732–735.

2537

Authorized licensed use limited to: London School of Economics & Political Science. Downloaded on May 17,2021 at 11:35:14 UTC from IEEE Xplore. Restrictions apply.

You might also like