You are on page 1of 11

Clinical Text Classification of Medical

Transcriptions Based on Different


Diseases

Yadukrishna Sreekumar and P. K. Nizar Banu

Abstract Clinical text classification is the process of extracting the information from
clinical narratives. Clinical narratives are the voice files, notes taken during a lecture,
or other spoken material given by physicians. Because of the rapid rise in data in the
healthcare sector, text mining and information extraction (IE) have acquired a few
applications in the previous few years. This research attempts to use machine learning
algorithms to diagnose diseases from the given medical transcriptions. Proposed
clinical text classification models could decrease human efforts of labeled training
data creation and feature engineering and for designing for applying machine learning
models to clinical text classification by leveraging weak supervision. The main aim
of this paper is to compare the multiclass logistic regression model and support vector
classifier model which is implemented for performing clinical text classification on
medical transcriptions.

Keywords Clinical text mining · Transcriptions · Natural language processing ·


TF-IDF vectorization · scispaCy · Multiclass logistic regression · Support vector
classifier

1 Introduction

In the medical world, a lot of digital text documents from several specialties are
generated like patient health records or documentation of clinical studies. Clinical
text contains valuable information about symptoms, diagnoses, treatments, drug use,
and adverse (drug) events for the patient that can be utilized to improve healthcare
for other patients. The physician also writes her or his reasoning for the conclusion
of the diagnosis of the patient in the patient record [1].
Since there is enormous data being generated in the healthcare sector, it becomes
a great deal for people to get the required information. There comes the relevance of
text mining and classification in biomedical and clinical data. We have to make use of

Y. Sreekumar · P. K. Nizar Banu (B)


Department of Computer Science, CHRIST (Deemed to be University), Bangalore, India
e-mail: nizar.banu@christuniversity.in

© The Author(s), under exclusive license to Springer Nature Singapore Pte Ltd. 2022 613
C. Satyanarayana et al. (eds.), High Performance Computing and Networking,
Lecture Notes in Electrical Engineering 853,
https://doi.org/10.1007/978-981-16-9885-9_50
614 Y. Sreekumar and P. K. Nizar Banu

several text mining algorithms for fetching the information from heap data [2]. The
main aim of this paper is to compare two major algorithms that classify and summa-
rize the potential factors, signs, or symptoms from unstructured textual descriptions
of patients. For extracting the potential information from the transcriptions, we will
be using various natural language processing techniques. Once after extracting the
potential information, we will be trying to diagnose the diseases based on the symp-
toms extracted using various machine learning models. Here, we will be having a
comparison of the multiclass logistic regression model and support vector classifier
model which are used to classify the transcriptions. There is an enormous volume of
biomedical data as well as clinical data generated, so that there is increasingly more
demand for accurate biomedical text mining tools for extracting information from
the literature [3].
Healthcare systems and specifically health record systems contain both structured
and unstructured information as text [1]. More specifically, it is estimated that over
40% of the data in healthcare record systems contains text, so-called clinical text,
sometimes also called electronic patient record text. Clinical text or biomedical text
literature can be seen as a large unstructured data repository, which makes text mining
come into play. In the next session, we will be having a background study to know
what exactly happening in medical text classification before we move on to proposed
methodology [4]. Sect. 3 presents the methodology and details about the dataset.
Section 4 discusses on the results obtained. Section 5 concludes the paper.

2 Background Study

In dictionary-based approach, they have taken the data from Pub Med and Medline,
using part-of-speech (POS) tagging, phrase block’s formulation, and designed VWIA
algorithm to identify entities for matching biomedical concepts. With the data
collected from PubMed and Medline, they have created a literature database from that
literature they used to take one of each literature [5]. Here, they have used a model
called conditional random fields (CRF) model. This combines the best of both HMM
and MEMM [6]. Dynamic biomedical information is extracted, namely association
between biomedical entities which is often extracted based on entity co-occurrence
analysis with statistics theory [7]. For that purpose, they were using an algorithm
called mining multiclass entity association (MMEA) [8].
Another set of researchers have collected information Medline and ScienceDi-
rect [9], and used NLP methods are based on prior knowledge on how language is
structured and on specific knowledge on how biological information is mentioned in
the literature [10]. The analysis results show that pre-training BERT on biomedical
corpora helps it to understand complex biomedical texts [11] (Table 1).
Clinical Text Classification of Medical Transcriptions Based … 615

Table 1 Literature review table


Authors Name of the paper Technology/algorithm Accuracy (%)
Lee BioBERT: a pre-trained BERT 95
biomedical language
representation model for
biomedical text mining
Pereira Text mining applied to electronic Variable-step window 75
medical records identification algorithm
(VWIA)
Gong Application of biomedical text Conditional random 74.50
mining fields ( CRF) model
T. Fadi MCAR: multiclass classification Mining multiclass entity
based on association rule association
M. Simmons Text mining for precision General idea about
medicine: bringing structure to medical documents
EHRs and biomedical literature
to understand genes and health
G. Tancev Mining and classifying medical General idea about
documents biomedical text mining

3 Methodology

In this paper, we have implemented a model to correctly classify the medical diagnosis
based on the given medical transcriptions. Our basic aim is to correctly classify the
medical specialties based on the transcription text. Flowchart of the process what we
have followed in our methodology is given in Fig. 1.

3.1 Dataset

This dataset contains sample medical transcriptions for various medical specialties.
Medical data is extremely hard to find due to HIPAA privacy regulations [12]. This
dataset offers a solution by providing medical transcription samples. This dataset
contains sample medical transcriptions for various medical specialties. This data
was scraped from mtsamples.com. MTSamples.com is designed to give us access

Fig. 1 Block diagram of the proposed model framework


616 Y. Sreekumar and P. K. Nizar Banu

Fig. 2 Sample data description of medical transcription dataset

to a big collection of transcribed medical reports. These samples can be used by


learning, as well as working medical transcriptionists for their daily transcription
needs [13].
This dataset contains six columns—‘description,’ ‘medical_specialty,’
‘sample_name,’ ‘transcription,’ and ‘keywords’ as shown in Fig. 2.
In total, there are 140,214 sentences in the transcription’s column and around
35,822 unique words in the transcriptions column which is the vocabulary. And also,
there are around 40 categories of medical specialties in the dataset. The categories
are like allergy/immunology, autopsy, cardiovascular/pulmonary, gastroenterology,
endocrinology, etc. As part of pre-processing, we have filtered out the categories
which have more than 50 samples, so the number of categories got reduced from 40
to 21. If we see the number of transcriptions based on the medical specialties also, we
can see that surgery has 1088 transcription records, and there are also few categories
which has a smaller number of records, for example, emergency room reports, pain
management, psychiatry/psychology, etc. We have also plotted the classes of filtered
data categories, so we have around 21 categories of medical specialties. Figure 3
portrays the various medical specialty categories and number of records in every
category.
If we look at the plot, we can clearly state that it is a data imbalance problem.
There are a huge number of records belonging to the class surgery, which is almost
thrice when compared with some of the other classes in the dataset. Since we are
trying to classify the medical specialities based on medical transcriptions, we need
only the ‘transcription’ and ‘medical_specialty’ columns in the dataset.

3.2 Data Pre-processing

As part of data pre-processing, the transcription columns which are empty or null are
removed. After removing the empty cells in the transcription columns, there are a
total of 4597 transcriptions in the whole dataset. In order to make the data perfect, we
had removed the punctuations, digits, and white spaces in the transcriptions. Also,
we had converted the data into lowercase for more convenience. Then, we have
performed lemmatization on the text. Lemmatization [14] is a text normalization
technique, which reduces the inflected word properly assuring that the root word
Clinical Text Classification of Medical Transcriptions Based … 617

Fig. 3 Medical speciality category details in dataset

belongs to the language. It will be replacing the words with similar meaning, so that
it will reduce the number of words present in the transcription column. It reduces
the variability in the words, so that the words with similar meaning will be mapped
together.

3.3 Feature Extraction

In order to extract the features from the dataset, we have used TF-IDF vectorizer. In
information retrieval, TF-IDF or TFIDF, short for term frequency–inverse document
frequency, is a numerical statistic that is intended to reflect how important a word
is to a document in a collection or corpus. It is often used as a weighting factor
in general [15]. We have used term frequency—inverse document frequency as a
feature extraction method over here. Also, we set that word or bigram appears in
more than 75% of the document, then we do not consider it as a feature, and also,
we were looking for maximum features, so that we have set the maximum feature
count as 1000. Then, we fit the TF-IDF vectorizer on our transcription column. We
have to visualize the TF-IDF features using t-sne plot. So, if you look at the feature
extraction process, we have extracted close to 1000 features, so from 1000 features,
618 Y. Sreekumar and P. K. Nizar Banu

we are trying to visualize that in a two-dimensional space. If we look at the t-sne


plot, we can see that most of the classes are quite close to each other and the majority
class surgery overlaps the other classes here. So, the data points are quite close to
each other.

3.4 Dimensionality Reduction

We have performed PCA for reducing the dimensionality in the features for further
processing. PCA [16] is commonly used for dimensionality reduction by projecting
each data point onto only the first few principal components to obtain lower-
dimensional data while preserving as much of the data’s variation as possible. We
have performed PCA in TF-IDF matrix, so after doing PCA the number of features
reduced from 1000 to 614. While doing PCA, we retained the components which
has variance more than 0.95, which captures more than 95% of the dataset.

3.5 Classification Using Support Vector Classifier


and Multiclass Logistic Regression

Then, we have used the train–test split in scikit learn to split the data into test (25%)
and train (75%) data. Also, we have used stratified methods here because some of
the classes are minority classes. We have applied support vector classifier to learn
on training data, to learn a classifier, and to predict on the test data. Once after
completing the training, predict the results on the dataset. SVC is a nonparametric
clustering algorithm that does not make any assumption on the number or shape of
the clusters in the data. In our experience, it works best for low-dimensional data, so if
the data is high-dimensional, a pre-processing step, e.g., using principal component
analysis, is usually required. Several enhancements of the original algorithm were
proposed that provide specialized algorithms for computing the clusters by only
computing a subset of the edges in the adjacency matrix [17]. We got an accuracy of
39% which is comparatively less. So, we need to add a few more methods to increase
the accuracy. Transcriptions for surgery could belong to any of the categories like,
for example, heart surgery, so if it is heart surgery, it could belong to cardiology,
but it is still present in surgery. Similarly, if it is something related to bone, then
it could belong to orthopedic. So, we can say that surgery class is a superset of all
other classes. Also, we have classes like discharge summary, office notes, emergency
room report, etc., which will be a super set of all other classes. So, we have removed
these classes from the dataset. Also, we have done some mapping on neurosurgery
and nephrology since both of those come under neurology and urology, respectively.
After performing all the mapping and removal, now we will be having around 12
Clinical Text Classification of Medical Transcriptions Based … 619

Fig. 4 T-sne plot for data points after applying scispaCy package

categories of medical specialities. Now all the medical specialities are separate and
unique. Now the total transcriptions in the dataset are 2324.
The next step is to apply scispaCy models to detect medical entities in our text.
ScispaCy is a Python package containing spaCy models for processing biomedical,
scientific, or clinical text. So, we will be using spaCy to implement scispaCy model
which will be detecting the medical terms. So again, we will be processing all the 12
categories of data with this scispaCy package to detect the medical entities. Once the
medical entities are detected, we will be again doing the lemmatization and cleaning
the text which we had done earlier. Again, we will be applying TF-IDF vectorizer
and extracting the maximum features from the data. Figure 4 shows the T-sne plot
for various categories after applying scispaCy package.
Now if we see the t-sne plot of updated dataset, there are few more clusters created,
but still we can see a lot of overlapping in the dataset. One way to deal with addressing
imbalanced datasets is to oversample the minority class. The least complex method-
ology includes copying models in the minority class, albeit these models do not add
any new data to the model. By considering all the things, new models can be inte-
grated from the current models. This is a sort of information increase for the minority
class and is referred to as the synthetic minority oversampling technique, or SMOTE
for short [18].
620 Y. Sreekumar and P. K. Nizar Banu

Table 2 Confusion matrix details for different algorithms


Models Precision Recall f 1-score Accuracy (%)
Support vector classifier 0.35 0.39 0.32 38.80
Multiclass logistic regression 0.63 0.64 0.63 64
MLR with SMOTE 0.65 0.67 0.65 67

4 Results and Discussion

4.1 Support Vector Classifier

The goal of the SVM algorithm is to create the best line or decision boundary that
can segregate n-dimensional space into classes, so that we can easily put the new
data point in the correct category in the future. This best decision boundary is called
a hyperplane. SVM chooses the extreme points/vectors that help in creating the
hyperplane. These extreme cases are called as support vectors, and hence, algorithm
is termed as support vector machine [19].
We have applied a support vector classifier to learn on training data, to learn a
classifier, and to predict the test data. The results are analyzed using the confusion
matrix, and classification results are depicted in Table 2. If we analyze the confusion
matrix, the classification is done properly, but here most of the dataset is classified
in the surgery class. Then, if we observe the classification reports here, the overall
accuracy is 39% which is very low. And if we see the classification results in the
bigger classes like surgery, the results are quite better, but if we see the minority
classes like neurosurgery and hematology—oncology, etc., the results are very poor.

4.2 Multiclass Logistic Regression

Logistic regression, by default, is limited to two-class classification problems. Some


extensions like one versus rest can allow logistic regression to be used for multiclass
classification problems, although they require that the classification problem first be
transformed into multiple binary classification problems [20]. Since the accuracy is
very less, we have to apply some domain knowledge to improve the results. After
adding the data to the scispaCy model, we have applied another machine learning
algorithm called logistic regression to classify the medical transcriptions. Since there
are multiple output classes, we should use multiclass logistic regression for the
classification purpose. After doing all the pre-processing task, the multiclass logistic
regression is applied on the training data, and then, it is classified and the results are
presented in Table 2.
Clinical Text Classification of Medical Transcriptions Based … 621

But now if we see the confusion matrix after the prediction, we can see that certain
classes are getting classified very well over here. But again, certain classes are not
getting classified properly. But the overall accuracy improved from 39 to 64%.

4.3 Multiclass Logistic Regression with SMOTE

Since some classes are in minority, we can use synthetic minority oversampling
technique (SMOTE) to generate more sample form minority class to solve the data
imbalance problem. In machine learning and data science, we often come across a
term called imbalanced data distribution, which generally happens when observations
in one of the class are much higher or lower than the other classes. As machine
learning algorithms tend to increase accuracy by reducing the error, they do not
consider the class distribution. Synthetic minority oversampling technique (SMOTE)
is one of the most commonly used oversampling methods to solve the imbalance
problem. It aims to balance class distribution by randomly increasing minority class
examples by replicating them.
SMOTE helps to generate new samples from the existing minority classes of
data. It generates the virtual training records by linear interpolation for the minority
class. These synthetic training records are generated by randomly selecting one or
more of the k-nearest neighbors for each example in the minority class. After the
oversampling process, the data is reconstructed, and several classification models
can be applied for the processed data [21].
Initially, we have used support vector classifier for classifying the transcriptions
based on the diseases. After using SVC, we got an accuracy of 39% which was
slightly on the lower side of accuracy. So, we have used multiclass logistic regres-
sion along with scispaCy package for attaining for accuracy. This time we got an
accuracy of 64%. Basically, we will not be able to attain more accuracy as there
is a problem of data imbalance with the dataset. Synthetic minority oversampling
technique in Python is one of the solutions for data imbalancing problem. Even after
using SMOTE, we got an accuracy of 67% which was on the higher side compared
to another two algorithms.

5 Conclusion

We have used support vector classifier on the medical transcription dataset, and
we have tried to classify the medical specialties (diagnosis) based on the available
medical transcriptions. As presented in the results section, though we use any of
the advanced techniques, expecting increased accuracy is a very challenging task
as it is a class imbalance dataset. On the other hand, we understood that the data
itself is noisy, and if we could use some customized feature extraction technique,
we can expect better results. Future work will be focusing on using more suitable
622 Y. Sreekumar and P. K. Nizar Banu

machine learning techniques to generate more samples from minority classes to


address the class imbalance problem and also using an ensemble approach for better
classification.

References

1. Dalianis H (2018) Clinical text mining: secondary use of electronic patient records. Springer,
Stockholm
2. Tancev G (2019) Mining and classifying medical documents, 25 Oct 2019. [Online].
Available: https://towardsdatascience.com/mining-and-classifying-medical-text-documents-
1876462f73bc. Accessed on 02 Apr 2021
3. Lee J1 LW (2020) BioBERT: a pre-trained biomedical language representation model for
biomedical text mining. Bioinformatics 36(4):1234–1240
4. Singhal A, Simmons M, Lu Z (2016) Text mining for precision medicine: bringing structure to
EHRs and biomedical literature to understand genes and health. Adv Exp Med Biol 939:139–
166
5. Rijo R, Martinho R, Pereira L, Silva C (2015) Text mining applied to electronic medical records.
Int J E-Health Med Commun 6(3):1–18
6. Blog G (2018) Complete tutorial on text classification using conditional random fields model
(in Python). Analytics Vidhya, 13 Aug 2018. [Online]. Available: https://www.analyticsvid
hya.com/blog/2018/08/nlp-guide-conditional-random-fields-text-classification/. Accessed on
05 Apr 2021
7. Gong L (2018) Application of biomedical text mining. IntechOpen I:427–428
8. Thabtah F, Cowling P, Peng Y (2005) MCAR: multi-class classification based on association
rule. ResearchGate
9. ScienceDirect, 05 Apr 2005. [Online]. Available: https://www.sciencedirect.com/. Accessed
on 05 Apr 2021
10. Fleuren WW, Alkema W (2015) Application of text mining in the biomedical domain. Sci
Direct 74(1):97–106
11. Lee J, Yoon W, Kim S, Kim D, Kim S, So CH, Kang J (2019) BioBERT: a pre-trained biomedical
language representation model for biomedical text mining,.Cornell University, pp 10–11
12. Boyle T (2020) Medical transcriptions—Kaggle,” Kaggle, Apr 2018. [Online]. Available:
https://www.kaggle.com/tboyle10/medicaltranscriptions. Accessed on 25 Oct 2020
13. MTSAMPLES.COM (2021) “mtsamples, 1 Apr 2021. [Online]. Available: https://www.mts
amples.com/index.asp. Accessed on 05 Apr 2021
14. Srinidhi S (2020) Lemmatization in natural language processing (NLP) and machine learning,
towards data science, 26 Feb 2020. [Online]. Available: https://towardsdatascience.com/
lemmatization-in-natural-language-processing-nlp-and-machine-learning-a4416f69a7b6.
Accessed on 05 Apr 2021
15. Wikipedia (2021) Wikipedia—the free encyclopedia, 08 Mar 2021. [Online]. Available: https://
en.wikipedia.org/wiki/Tf%E2%80%93idf. Accessed on 05 Apr 2021
16. Jolliffe IT, Cadima J (2016) Principal component analysis: a review and recent developments.
Royal Soc Publishing 374(2065)
17. Ben-Hur A (2008) Scholarpedia. 2008. [Online]. Available: http://www.scholarpedia.org/art
icle/Support_vector_clustering. Accessed on 05 Apr 2021
18. Brownlee J (2020) Machinelearningmastery—SMOTE for imbalanced classification with
Python, 17 Jan 2020. [Online]. Available: https://machinelearningmastery.com/smote-oversa
mpling-for-imbalanced-classification/. Accessed on 03 Apr 2021
19. Javatpoint, [Online]. Available: https://www.javatpoint.com/machine-learning-support-vector-
machine-algorithm
Clinical Text Classification of Medical Transcriptions Based … 623

20. Brownlee J (2021) Machinelearningmastery—multinomial logistic regression with Python, 1


Jan 2021. [Online]. Available: https://machinelearningmastery.com/multinomial-logistic-reg
ression-with-python/. Accessed on 02 Apr 2021
21. “GeeksforGeeks—ML|handling imbalanced data with SMOTE and near miss algorithm in
Python, 30 June 2019. [Online]. Available: https://www.geeksforgeeks.org/ml-handling-imb
alanced-data-with-smote-and-near-miss-algorithm-in-python/. Accessed on 20 June 2021

You might also like