Professional Documents
Culture Documents
Abstract Clinical text classification is the process of extracting the information from
clinical narratives. Clinical narratives are the voice files, notes taken during a lecture,
or other spoken material given by physicians. Because of the rapid rise in data in the
healthcare sector, text mining and information extraction (IE) have acquired a few
applications in the previous few years. This research attempts to use machine learning
algorithms to diagnose diseases from the given medical transcriptions. Proposed
clinical text classification models could decrease human efforts of labeled training
data creation and feature engineering and for designing for applying machine learning
models to clinical text classification by leveraging weak supervision. The main aim
of this paper is to compare the multiclass logistic regression model and support vector
classifier model which is implemented for performing clinical text classification on
medical transcriptions.
1 Introduction
In the medical world, a lot of digital text documents from several specialties are
generated like patient health records or documentation of clinical studies. Clinical
text contains valuable information about symptoms, diagnoses, treatments, drug use,
and adverse (drug) events for the patient that can be utilized to improve healthcare
for other patients. The physician also writes her or his reasoning for the conclusion
of the diagnosis of the patient in the patient record [1].
Since there is enormous data being generated in the healthcare sector, it becomes
a great deal for people to get the required information. There comes the relevance of
text mining and classification in biomedical and clinical data. We have to make use of
© The Author(s), under exclusive license to Springer Nature Singapore Pte Ltd. 2022 613
C. Satyanarayana et al. (eds.), High Performance Computing and Networking,
Lecture Notes in Electrical Engineering 853,
https://doi.org/10.1007/978-981-16-9885-9_50
614 Y. Sreekumar and P. K. Nizar Banu
several text mining algorithms for fetching the information from heap data [2]. The
main aim of this paper is to compare two major algorithms that classify and summa-
rize the potential factors, signs, or symptoms from unstructured textual descriptions
of patients. For extracting the potential information from the transcriptions, we will
be using various natural language processing techniques. Once after extracting the
potential information, we will be trying to diagnose the diseases based on the symp-
toms extracted using various machine learning models. Here, we will be having a
comparison of the multiclass logistic regression model and support vector classifier
model which are used to classify the transcriptions. There is an enormous volume of
biomedical data as well as clinical data generated, so that there is increasingly more
demand for accurate biomedical text mining tools for extracting information from
the literature [3].
Healthcare systems and specifically health record systems contain both structured
and unstructured information as text [1]. More specifically, it is estimated that over
40% of the data in healthcare record systems contains text, so-called clinical text,
sometimes also called electronic patient record text. Clinical text or biomedical text
literature can be seen as a large unstructured data repository, which makes text mining
come into play. In the next session, we will be having a background study to know
what exactly happening in medical text classification before we move on to proposed
methodology [4]. Sect. 3 presents the methodology and details about the dataset.
Section 4 discusses on the results obtained. Section 5 concludes the paper.
2 Background Study
In dictionary-based approach, they have taken the data from Pub Med and Medline,
using part-of-speech (POS) tagging, phrase block’s formulation, and designed VWIA
algorithm to identify entities for matching biomedical concepts. With the data
collected from PubMed and Medline, they have created a literature database from that
literature they used to take one of each literature [5]. Here, they have used a model
called conditional random fields (CRF) model. This combines the best of both HMM
and MEMM [6]. Dynamic biomedical information is extracted, namely association
between biomedical entities which is often extracted based on entity co-occurrence
analysis with statistics theory [7]. For that purpose, they were using an algorithm
called mining multiclass entity association (MMEA) [8].
Another set of researchers have collected information Medline and ScienceDi-
rect [9], and used NLP methods are based on prior knowledge on how language is
structured and on specific knowledge on how biological information is mentioned in
the literature [10]. The analysis results show that pre-training BERT on biomedical
corpora helps it to understand complex biomedical texts [11] (Table 1).
Clinical Text Classification of Medical Transcriptions Based … 615
3 Methodology
In this paper, we have implemented a model to correctly classify the medical diagnosis
based on the given medical transcriptions. Our basic aim is to correctly classify the
medical specialties based on the transcription text. Flowchart of the process what we
have followed in our methodology is given in Fig. 1.
3.1 Dataset
This dataset contains sample medical transcriptions for various medical specialties.
Medical data is extremely hard to find due to HIPAA privacy regulations [12]. This
dataset offers a solution by providing medical transcription samples. This dataset
contains sample medical transcriptions for various medical specialties. This data
was scraped from mtsamples.com. MTSamples.com is designed to give us access
As part of data pre-processing, the transcription columns which are empty or null are
removed. After removing the empty cells in the transcription columns, there are a
total of 4597 transcriptions in the whole dataset. In order to make the data perfect, we
had removed the punctuations, digits, and white spaces in the transcriptions. Also,
we had converted the data into lowercase for more convenience. Then, we have
performed lemmatization on the text. Lemmatization [14] is a text normalization
technique, which reduces the inflected word properly assuring that the root word
Clinical Text Classification of Medical Transcriptions Based … 617
belongs to the language. It will be replacing the words with similar meaning, so that
it will reduce the number of words present in the transcription column. It reduces
the variability in the words, so that the words with similar meaning will be mapped
together.
In order to extract the features from the dataset, we have used TF-IDF vectorizer. In
information retrieval, TF-IDF or TFIDF, short for term frequency–inverse document
frequency, is a numerical statistic that is intended to reflect how important a word
is to a document in a collection or corpus. It is often used as a weighting factor
in general [15]. We have used term frequency—inverse document frequency as a
feature extraction method over here. Also, we set that word or bigram appears in
more than 75% of the document, then we do not consider it as a feature, and also,
we were looking for maximum features, so that we have set the maximum feature
count as 1000. Then, we fit the TF-IDF vectorizer on our transcription column. We
have to visualize the TF-IDF features using t-sne plot. So, if you look at the feature
extraction process, we have extracted close to 1000 features, so from 1000 features,
618 Y. Sreekumar and P. K. Nizar Banu
We have performed PCA for reducing the dimensionality in the features for further
processing. PCA [16] is commonly used for dimensionality reduction by projecting
each data point onto only the first few principal components to obtain lower-
dimensional data while preserving as much of the data’s variation as possible. We
have performed PCA in TF-IDF matrix, so after doing PCA the number of features
reduced from 1000 to 614. While doing PCA, we retained the components which
has variance more than 0.95, which captures more than 95% of the dataset.
Then, we have used the train–test split in scikit learn to split the data into test (25%)
and train (75%) data. Also, we have used stratified methods here because some of
the classes are minority classes. We have applied support vector classifier to learn
on training data, to learn a classifier, and to predict on the test data. Once after
completing the training, predict the results on the dataset. SVC is a nonparametric
clustering algorithm that does not make any assumption on the number or shape of
the clusters in the data. In our experience, it works best for low-dimensional data, so if
the data is high-dimensional, a pre-processing step, e.g., using principal component
analysis, is usually required. Several enhancements of the original algorithm were
proposed that provide specialized algorithms for computing the clusters by only
computing a subset of the edges in the adjacency matrix [17]. We got an accuracy of
39% which is comparatively less. So, we need to add a few more methods to increase
the accuracy. Transcriptions for surgery could belong to any of the categories like,
for example, heart surgery, so if it is heart surgery, it could belong to cardiology,
but it is still present in surgery. Similarly, if it is something related to bone, then
it could belong to orthopedic. So, we can say that surgery class is a superset of all
other classes. Also, we have classes like discharge summary, office notes, emergency
room report, etc., which will be a super set of all other classes. So, we have removed
these classes from the dataset. Also, we have done some mapping on neurosurgery
and nephrology since both of those come under neurology and urology, respectively.
After performing all the mapping and removal, now we will be having around 12
Clinical Text Classification of Medical Transcriptions Based … 619
Fig. 4 T-sne plot for data points after applying scispaCy package
categories of medical specialities. Now all the medical specialities are separate and
unique. Now the total transcriptions in the dataset are 2324.
The next step is to apply scispaCy models to detect medical entities in our text.
ScispaCy is a Python package containing spaCy models for processing biomedical,
scientific, or clinical text. So, we will be using spaCy to implement scispaCy model
which will be detecting the medical terms. So again, we will be processing all the 12
categories of data with this scispaCy package to detect the medical entities. Once the
medical entities are detected, we will be again doing the lemmatization and cleaning
the text which we had done earlier. Again, we will be applying TF-IDF vectorizer
and extracting the maximum features from the data. Figure 4 shows the T-sne plot
for various categories after applying scispaCy package.
Now if we see the t-sne plot of updated dataset, there are few more clusters created,
but still we can see a lot of overlapping in the dataset. One way to deal with addressing
imbalanced datasets is to oversample the minority class. The least complex method-
ology includes copying models in the minority class, albeit these models do not add
any new data to the model. By considering all the things, new models can be inte-
grated from the current models. This is a sort of information increase for the minority
class and is referred to as the synthetic minority oversampling technique, or SMOTE
for short [18].
620 Y. Sreekumar and P. K. Nizar Banu
The goal of the SVM algorithm is to create the best line or decision boundary that
can segregate n-dimensional space into classes, so that we can easily put the new
data point in the correct category in the future. This best decision boundary is called
a hyperplane. SVM chooses the extreme points/vectors that help in creating the
hyperplane. These extreme cases are called as support vectors, and hence, algorithm
is termed as support vector machine [19].
We have applied a support vector classifier to learn on training data, to learn a
classifier, and to predict the test data. The results are analyzed using the confusion
matrix, and classification results are depicted in Table 2. If we analyze the confusion
matrix, the classification is done properly, but here most of the dataset is classified
in the surgery class. Then, if we observe the classification reports here, the overall
accuracy is 39% which is very low. And if we see the classification results in the
bigger classes like surgery, the results are quite better, but if we see the minority
classes like neurosurgery and hematology—oncology, etc., the results are very poor.
But now if we see the confusion matrix after the prediction, we can see that certain
classes are getting classified very well over here. But again, certain classes are not
getting classified properly. But the overall accuracy improved from 39 to 64%.
Since some classes are in minority, we can use synthetic minority oversampling
technique (SMOTE) to generate more sample form minority class to solve the data
imbalance problem. In machine learning and data science, we often come across a
term called imbalanced data distribution, which generally happens when observations
in one of the class are much higher or lower than the other classes. As machine
learning algorithms tend to increase accuracy by reducing the error, they do not
consider the class distribution. Synthetic minority oversampling technique (SMOTE)
is one of the most commonly used oversampling methods to solve the imbalance
problem. It aims to balance class distribution by randomly increasing minority class
examples by replicating them.
SMOTE helps to generate new samples from the existing minority classes of
data. It generates the virtual training records by linear interpolation for the minority
class. These synthetic training records are generated by randomly selecting one or
more of the k-nearest neighbors for each example in the minority class. After the
oversampling process, the data is reconstructed, and several classification models
can be applied for the processed data [21].
Initially, we have used support vector classifier for classifying the transcriptions
based on the diseases. After using SVC, we got an accuracy of 39% which was
slightly on the lower side of accuracy. So, we have used multiclass logistic regres-
sion along with scispaCy package for attaining for accuracy. This time we got an
accuracy of 64%. Basically, we will not be able to attain more accuracy as there
is a problem of data imbalance with the dataset. Synthetic minority oversampling
technique in Python is one of the solutions for data imbalancing problem. Even after
using SMOTE, we got an accuracy of 67% which was on the higher side compared
to another two algorithms.
5 Conclusion
We have used support vector classifier on the medical transcription dataset, and
we have tried to classify the medical specialties (diagnosis) based on the available
medical transcriptions. As presented in the results section, though we use any of
the advanced techniques, expecting increased accuracy is a very challenging task
as it is a class imbalance dataset. On the other hand, we understood that the data
itself is noisy, and if we could use some customized feature extraction technique,
we can expect better results. Future work will be focusing on using more suitable
622 Y. Sreekumar and P. K. Nizar Banu
References
1. Dalianis H (2018) Clinical text mining: secondary use of electronic patient records. Springer,
Stockholm
2. Tancev G (2019) Mining and classifying medical documents, 25 Oct 2019. [Online].
Available: https://towardsdatascience.com/mining-and-classifying-medical-text-documents-
1876462f73bc. Accessed on 02 Apr 2021
3. Lee J1 LW (2020) BioBERT: a pre-trained biomedical language representation model for
biomedical text mining. Bioinformatics 36(4):1234–1240
4. Singhal A, Simmons M, Lu Z (2016) Text mining for precision medicine: bringing structure to
EHRs and biomedical literature to understand genes and health. Adv Exp Med Biol 939:139–
166
5. Rijo R, Martinho R, Pereira L, Silva C (2015) Text mining applied to electronic medical records.
Int J E-Health Med Commun 6(3):1–18
6. Blog G (2018) Complete tutorial on text classification using conditional random fields model
(in Python). Analytics Vidhya, 13 Aug 2018. [Online]. Available: https://www.analyticsvid
hya.com/blog/2018/08/nlp-guide-conditional-random-fields-text-classification/. Accessed on
05 Apr 2021
7. Gong L (2018) Application of biomedical text mining. IntechOpen I:427–428
8. Thabtah F, Cowling P, Peng Y (2005) MCAR: multi-class classification based on association
rule. ResearchGate
9. ScienceDirect, 05 Apr 2005. [Online]. Available: https://www.sciencedirect.com/. Accessed
on 05 Apr 2021
10. Fleuren WW, Alkema W (2015) Application of text mining in the biomedical domain. Sci
Direct 74(1):97–106
11. Lee J, Yoon W, Kim S, Kim D, Kim S, So CH, Kang J (2019) BioBERT: a pre-trained biomedical
language representation model for biomedical text mining,.Cornell University, pp 10–11
12. Boyle T (2020) Medical transcriptions—Kaggle,” Kaggle, Apr 2018. [Online]. Available:
https://www.kaggle.com/tboyle10/medicaltranscriptions. Accessed on 25 Oct 2020
13. MTSAMPLES.COM (2021) “mtsamples, 1 Apr 2021. [Online]. Available: https://www.mts
amples.com/index.asp. Accessed on 05 Apr 2021
14. Srinidhi S (2020) Lemmatization in natural language processing (NLP) and machine learning,
towards data science, 26 Feb 2020. [Online]. Available: https://towardsdatascience.com/
lemmatization-in-natural-language-processing-nlp-and-machine-learning-a4416f69a7b6.
Accessed on 05 Apr 2021
15. Wikipedia (2021) Wikipedia—the free encyclopedia, 08 Mar 2021. [Online]. Available: https://
en.wikipedia.org/wiki/Tf%E2%80%93idf. Accessed on 05 Apr 2021
16. Jolliffe IT, Cadima J (2016) Principal component analysis: a review and recent developments.
Royal Soc Publishing 374(2065)
17. Ben-Hur A (2008) Scholarpedia. 2008. [Online]. Available: http://www.scholarpedia.org/art
icle/Support_vector_clustering. Accessed on 05 Apr 2021
18. Brownlee J (2020) Machinelearningmastery—SMOTE for imbalanced classification with
Python, 17 Jan 2020. [Online]. Available: https://machinelearningmastery.com/smote-oversa
mpling-for-imbalanced-classification/. Accessed on 03 Apr 2021
19. Javatpoint, [Online]. Available: https://www.javatpoint.com/machine-learning-support-vector-
machine-algorithm
Clinical Text Classification of Medical Transcriptions Based … 623