Cascade Net

See discussions, stats, and author profiles for this publication at: https://www.researchgate.
net/publication/330831968
CASCADENET: An LSTM based deep learning model for automated ICD-10

coding
Chapter · January 2020

DOI: 10.1007/978-3-030-12385-7_6
CITATIONS READS
2 1,858
4 authors:
Sheikh Shams Azam Manoj Raju

Purdue University Practo Technologies
9 PUBLICATIONS 40 CITATIONS 5 PUBLICATIONS 66 CITATIONS
SEE PROFILE SEE PROFILE
Venkatesh Pagidimarri Vamsi Chandra Kasivajjala
4 PUBLICATIONS 61 CITATIONS
Practo
3 PUBLICATIONS 55 CITATIONS
SEE PROFILE
SEE PROFILE
Some of the authors of this publication are also working on these related projects:
Q-Map: clinical concept mining with phrase sense disambiguation View project
All content following this page was uploaded by Sheikh Shams Azam on 15 April 2019.
The user has requested enhancement of the downloaded file.

CASCADENET: An LSTM based Deep
Learning Model for Automated ICD-10 Coding
Sheikh Shams Azam1 , Manoj Raju1 Venkatesh Pagidimarri1 , and Vamsi

Chandra Kasivajjala2
1
Foundation Inc., Marina Del Rey CA 90292, USA,
s.shams.official@gmail.com,
rmanuuu@gmail.com,
venki.460@gmail.com,
2
Healthcare Information and Management Systems Society, India
vamsichandra@gmail.com
Abstract. In this paper, a cascading hierarchical architecture using

LSTM is proposed for automatic mapping of ICD-10 codes from clin-
ical documents. The fact that it becomes increasingly difficult to train a
robust classifier as the number of classes (over 93k ICD-10 codes) grows,
coupled with other challenges such as the variance in length, structure
and context of the text data, and the lack of training data, puts this task
among some of the hardest tasks of Machine Learning (ML) and Natu-
ral Language Processing (NLP). This work evaluates the performance of
various methods on this task, which include basic techniques such as TF-
IDF, inverted indexing using concept aggregation based on exhaustive
Unified Medical Language System (UMLS) knowledge sources, as well
as advanced methods such as SVM trained on a bag-of-words model,
CNN and LSTM trained on distributed word embeddings. The effect of
breaking down the problem into a hierarchy is also explored. Data used
is an aggregate of ICD-10 long descriptions along with anonymised anno-
tated training data provided by few of the private hospitals from India.
A study of the above-mentioned techniques leads to the observation that
hierarchical LSTM network outperforms other methods in terms of ac-
curacy as well as micro and macro-averaged precision and recall scores
on the held out data (or test data).
Keywords: deep learning, natural language processing, medical coding
1 Introduction
Medical Coding is the process of classifying the clinical documents and health
records according to standard coding conventions. Some of the popular cod-
ing conventions widely used are International Classification of Diseases (ICD)
[1], Current Procedural Terminology (CPT) [2], Logical Observation Identifiers
Names and Codes (LOINC) [3] etc. ICD-10, the 10th revision of ICD codes, is
a coding standard released by World Health Organization (WHO) for diseases,
signs and symptoms, abnormal findings, complaints, social circumstances, and
2 Sheikh Shams Azam et al.
external causes of injury or disease. There are a total of over 93k codes in this
system as of the time of this work.
Annotation of unstructured clinical documents using ICD codes also marks
the foundation for various subsequent analysis such as identification of statistical
health trends globally, patient similarity analysis that can help in the prediction
of disease onset, development of clinical decision support system (CDSS) etc. It
is also traditionally used as the first step in insurance claim filing process.
Since the annotation of ICD codes from clinical notes requires significant
amount of time and expert knowledge in the field of medicine, the task is prone
to errors both in terms of incorrectly attributed codes as well as missed diagnoses
due to the high number of documents that are to be processed manually by
human coders.
In order to tackle the posed issues of accuracy and time consumption this
study aims at evaluating various techniques for automating the process of ICD
coding. The work leads to the conclusion that a cascaded hierarchical architec-
ture of Long Short-Term Memory (LSTM) [4] where inference from each level
cascades to the subsequent levels perform particularly well on this task. The
state-of-the-art performance in this classification task is achieved by this pro-
posed architecture.
The rest of the paper is organized as follows. Section 2 discusses the related
work in the field. Section 3 explains the background concepts such as organiza-
tion of ICD-10 data and UMLS etc. Section 4 reviews the optimization objectives
for the classifiers. Section 5 explains the architecture of the system, preprocessing
and training strategies to achieve the best results. Section 6 reports the exper-
imental results and discussions based on evaluation metrics. Finally, section 7
concludes the work and discusses the limitations and future scope of this work.
2 Related Work
The number of research works in the field of automatic classification of clinical

text is quite large. But a major portion of these publications is devoted to either
classification of a carefully selected subset of codes or classification as per ICD-9
conventions.
Farkas et al. [5] presented automated construction of ICD-9-CM codes us-
ing hand-crafted rule based system. They use preprocessing techniques such as
lemmatization, punctuation-removal, removal of negated diagnoses (using manu-
ally collected indicator words such as can, may etc.) to normalize the data. They
experimented with multi-label classification using binary relevance [6] as well as
label powerset [7] methods. The best model trained got a 88.93% F measure on
the test set. The work also points out to the issue of data sparseness during for-
mulation of problem as multi-label classification. This issue will only compound
as the number of classes increase, which is the case with jump to ICD-10 from
ICD-9. Hence, direct application of multi-label classification was ruled out.
Pariera et al. [8] propose a semi-automated ICD-10 coding system using
mapping between MeSH [9] and ICD-10 extracted from UMLS metathesaurus
CASCADENET 3
[10], and usage of drug prescriptions by exploiting mapping between prescription

drugs and relevant ICD-10 codes. The aim of the system is to assist the human
coders by narrowing down on code sets to choose from using the contextual
information. While this system achieves a recall as high as 68% under some
settings, it is dependent on the quality of contextual information recorded.
Lita et al. [11] presented data mining based approaches using Support Vector
Machines (SVM [12]) and Bayesian Ridge Regression [13] for ICD-9 classification,
but these cannot be extended to ICD-10 because of the high number of classes.
These methods do not take into consideration the semantic aspect of the words
during text classification. There are various efforts which achieve a varying degree
of success based on the specific use case. But these works were not a general effort
to be applied to the entire set of ICD codes.
One of the notable works presented by Subotin et al. [14] considers clinical
text and builds a 2-level hierarchical classifier to predict ICD-10 PCS code using
regularized logistic regression. The work uses a bag of tokens available in their
training data as the feature set for ICD-10 PCS code prediction. They build
a code concept mapping model and concept code mapping model to rank the
codes.
Since the organization of the codes in ICD-10 differs substantially (in terms
of hierarchical structure, number of codes and granularity of the descriptions)
when compared to ICD-9, it is only natural that the work done on later cannot
be effectively extended to former.
Our work differs from the ones presented because it not only takes into ac-
count the semantic aspect of words in a text but also breaks down the classifica-
tion to basic levels to achieve maximum accuracy and good decision boundaries
among classes. Also, ICD codes are annotated using the description of diagnoses
only and does not depend on supplementary information such as prescriptions
etc. Our original contribution includes the formulation of task as a hierarchy,
limiting the number of classes at each level by transforming the classification
task to character level, the architecture design of the system where inference
from each hierarchy cascades as a feature to the succeeding stages, use of the
distributed word embeddings (Word2Vec [15]) to take into account the seman-
tics of the word, and considering the words as a sequence so that loss of context
due to change in word order is kept at a minimum. The final architecture can
be loosely seen as a combination of multi-class classification and multi-label
classifier chains [16].
3 Background
3.1 ICD-10 Coding System
ICD-10 is the latest in-use version of ICD. ICD-10 is split into two systems
namely, ICD-10 Clinical Modification (ICD-10-CM) and ICD-10 Procedure Cod-
ing System (ICD-10-PCS). ICD-10-CM is the diagnostic coding used by health-
care providers while ICD-10-PCS is used for inpatient procedure reporting by
Category Specificity
T 33 . 4 2 X S
Subcategory Extender
Fig. 1. ICD-10 Code Structure
hospitals. The ICD-10 coding standard is a seven character coding convention

and follows a curated hierarchical structure unlike the previous version of ICD,
i.e. ICD-9.
Figure 1 gives a summary of the function of each of the seven characters in
the ICD-10 code. A minimum of first 3 characters, together called category must
be present to form a valid ICD-10 code. Among the three, the first character is
an alphabet (except U) that points to the type of diagnosis eg. injury, poisoning,
infection etc. The second two characters are numeric and together they point out
the specific ailment or diagnosis of the given type. Occasionally, a dot separator
is used after the 3rd character but it is not mandatory. The character at position
four through seven may or may not follow the category. These are filled on the
basis of the precision of information present in the description of diagnosis such
as severity, etiology (the cause, the set of causes or the manner of causation of
disease or condition), laterality etc. The last character is an extender or exten-
sion, which is used to specify the type of encounter, namely, initial encounter
(patient is receiving active treatment), subsequent encounter (encounter after
an active phase of treatment), or sequela (complication or condition as a direct
result of an injury). “X” is used as a placeholder if a position is to be skipped
for specifying the next characters.
Various revisions of the ICD-10 coding system are released periodically which
appends, modifies and removes the previous errors and discrepancies in the codes,
spellings etc. According to the statistics presented by [17], ICD-10 is cited in more
than 20,000 scientific articles and is adopted across more than 100 countries.
With the growing norm of adoption of ICD coding, there is very high demand
for automatic coding systems, but there are not many such solutions available
for use by hospitals and other consumers of ICD information.
3.2 Clinical Documents vs Diagnosis Phrases
Clinical Documents can be defined as a digital or analog record detailing medical

treatment, medical trial or clinical tests. These are generally a narrative of a
physicians evaluation and follow up notes of a given patient that can be presented
in either tabular or free text formats. The clinical documents may be in form of
discharge summaries, surgical reports, progress notes etc.
It is important to note the difference between the clinical documents and the
diagnosis phrases. A clinical document is a detailed report of the patient history
CASCADENET 5
Table 1. Q-Map Output
Document Output
Ligament tear observed in Radiograph. Diagnosis: Ligament Tear
Patient is a 53 year old female and Procedures:
has been recommended for Radiograph
unilateral Total Knee Replacement. Total Knee Replacement
Patient was given Heparin Prophylactic Treatment
as a prophylactic treatment to prevent DVT. Medicines: Heparin
of findings, medications and procedures. But diagnosis phrases are chunks or

phrases extracted from these clinical documents that contain only the relevant
information about the patient diagnosis. For example, a clinical document may
have the following line in the patient diagnosis, “Patient is a 48-year old male
and presented himself to the Emergency Department with chest pain. AFib
and massive cardiac arrest was observed.”, but only the phrases, “chest pain”,
“AFib” (short form of Atrial fibrillation) and “cardiac arrest” are the relevant
diagnosis phrases.
The predictive models such as LSTM, Convolutional Neural Networks (CNN)
[18], term-frequency inverse document frequency (TF-IDF) etc. mentioned in
this work are trained on the diagnosis phrases and not on the entire unprocessed
clinical document for the purpose of predicting the ICD codes, i.e. training here
only refers to formulation of task that leads to successful ICD coding of diagnosis
phrases. Apart from this we also explore and present the preprocessing pipeline
to retrieve these phrases of interest from real-world discharge summaries. These
supporting systems are briefly discussed in section 3.3 and 3.4.
3.3 Q-Map
Q-Map [19] is a simple yet powerful system that can sift through large datasets
to retrieve structured information aggressively and efficiently. It is backed by an
effective mining algorithm based on curated knowledge sources, that is both fast
and configurable.
It works in two phases, namely training (indexing) and testing. The pre-
processing options are available through configuration that can switch on the
modules as and when required. The heart of the system lies in the Aho-Corasick
[20] algorithm that builds a finite-state machine (very similar to tree structure)
with indexed failure state for each node during the training phase.
By using the configuration options in semantic types, one can filter out only
the diagnosis terms. Because it is using UMLS as the knowledge source, filtered
out terms and phrases are elementary as that is the intrinsic property of UMLS
concepts. The elementary nature of concepts retrieved ensures that each output
concept has one and only one ICD-10 code associated with it. Table 1 gives a
sample output from Q-Map system.
3.4 Negation Detection

Farkas et al. [5] lay strong emphasis on detection of negation among diagnoses as
it plays an important role in determining the final performance of such coding
systems. For this reason, all the diagnosis phrases retrieved from free clinical
narrative texts via Q-Map during the real world application, are also passed
to an algorithm called NegEx [21] (which is a rule based negation detection
algorithm that can resolve complex sentence structures such as negations of
negations etc.) to weed out the negated phrases.
4 Optimization Objectives
The study begins by defining the ICD classification task and the corresponding
optimization objectives. Given an input sequence of arbitrary length, M , the goal
is to produce an ICD code of appropriate length, N , which can vary between
3 to 7 based on the precision of the description present in the word sequence.
Let the input sequence consist of M words x1 , x2 , · · · , xM coming from a fixed
vocabulary ν of size |ν| = V . Each word is represented using an indicator vector
xi for i ∈ {1, 2, · · · , M }, where the indicator vector can be either one-hot encoded
over a bag of words model, or a distributed representation of the word based on
the Word2Vec [22] training. Furthermore, the notation x[i···j] is used to indicate
the sub-sequence from index i to index j.
4.1 One-Step Classification

In a one-step classification method, there is a fixed set of possible output classes,
Y = {y1 , y2 , · · · , yl }, where l = 93830 (number of classes). A one-step classifier
takes x as input and outputs y ∈ Y to maximize the probability, p(y|x; θ) where
θ is the parameter of the classifier learnt through training.
The classifier tries to find the optimal θ such that,
arg maxθ s(x, y) (1)
under a scoring function s : X × Y → R. If a window size W is used and

conditional log probability is chosen as the scoring function, then s(x, y) from
(1) can be approximated using the W th order markov [23] assumption as follows,
s(x, y) = p(y|x; θ) ≈ p(y|x[1···W ] ; θ) (2)
4.2 Hierarchical Classification

In hierarchical classification, the single-step classification is split into several
levels. The breakdown of classes into levels is explained in table 2 and figure 2.
Looking at the hierarchies, a natural question arises as to why is the second and
third digit not broken down into individual chunks, and the reason lies in the
fact that under the implementation of ICD-10 coding they are treated as a single
CASCADENET 7
Table 2. Classification Hierarchies
Hierarchy Classes Characters

Level 1 (L1) 25 First
Level 2 (L2) 106 Second and Third
Level 3 (L3) 23 Fourth
Level 4 (L4) 14 Fifth
Level 5 (L5) 19 Sixth
Level 6 (L6) 24 Seventh
Level 1 Level 3 Level 5
T 33 . 4 2 X S
Level 2 Level 4 Level 6
Fig. 2. Hierarchical breakdown of ICD-10 Code
unit instead of giving individual functions to each. Following this very intuition,
levels have been broken down based on how the ICD-10 codes are designed,
which is explained below in section 5.1.
Following the new structure, the classifier at hierarchy i, takes as input x
along with the outputs from preceding classifiers (if any) y (i−1) , · · · , y (1) and
(i) (i) (i)
outputs a y (i) ∈ Y (i) , where Y (i) = {y1 , y2 , · · · , yli } and li is the number
of classes at hierarchy i. The notation y (i) is used to denote variable at level
i. Similarly, notation y (i···j) is used to denote corresponding variables through
levels i to j.
In hierarchical classification, a classifier at level i tries to find the optimal θ
such that,
arg maxθ s(x, y (1···i) ) (3)
For a window size, W and conditional log probability as the scoring function,
s(x, y (1···i) ) is defined as,
s(x, y (1···i) ) = p(y (i) |y (1···(i−1)) , x; θ)

(4)
≈ p(y (i) |y (1···(i−1)) , x[1···W ] ; θ)
5 Architecture and Procedure
5.1 Training Data
The training data is obtained from the official release of ICD-10 Data, 2017
version released by Centers for Disease Control and Prevention, [24] and the an-
notated data (containing diagnosis phrase from clinical documents and manually
annotated ICD codes) provided by some of the private hospitals from India.
The data received from the hospital is in compliance with the Information
Technology Act, 2000 and Information Technology (Amendment) Act, 2008 un-
der the Indian Penal Code. The data is received by the hospital with the signed
patient consent and is anonymized to mask all the sensitive personal data and
information (SPDI) at the source itself.
The official data released by CDC consists of ICD-10 codes along with their
prescribed long and short descriptions. The data has 93830 unique codes, where
each code is accompanied by corresponding long and short descriptions. The
long descriptions are the actual descriptions of the ICD codes, while the short
description use the abbreviations which might not be universally accepted. For
example, consider the short description “Intraop and postproc comp and disord
of eye and adnexa, NEC” for the corresponding long description “Intraoper-
ative and postprocedural complications and disorders of eye and adnexa, not
elsewhere classified”. The short description presents the abbreviation “NEC”
for “Not Elsewhere Classified”, but there are several other full forms in use
such as “Necrotizing Enterocolitis”. Similarly, the short descriptions have a lot
of acronyms and shortened word forms (such as enctr. for encounter, sql. for
sequela) which increase the number of Out of Vocabulary (OOV) words when
using the pre-trained word embeddings (explained in section 5.2).
The data contributed by the private hospitals come in form of phrases ex-
tracted from discharge summaries and clinical notes that consist of only the
relevant information about a diagnosis along with ICD-10 codes manually an-
notated by doctors from various specialities. The codes are re-annotated by the
consulting clinician and inter-rater reliability (IRR) is calculated using kappa
statistic. The scores are as high as 0.94 for the first character which gradually
falls to 0.72 as we reach the seventh character.
Since the institutions contributing data are multi-speciality hospitals with
around 600 beds, the data has codes from all major classes of ICD-10. A to-
tal of 134733 data rows are contributed which have a total of 69777 distinct
descriptions that point to 14692 distinct ICD-10 codes.
The metrics regarding the distribution of codes and lengths of descriptions
in the data given by the hospitals with respect to the ones present in ICD-10
data released by CDC is presented in table 3 and table 4 respectively.
Together, the number of data records available is around 163k with a total
vocabulary size of 13634, i.e. |ν| = 13634. A standard stratification train-test
split of 70 : 30 is performed on the data to ensure that none of the codes is
missed out during the training phase. Due to the lack of data, stratification does
CASCADENET 9
Table 3. Distribution of Distinct ICD-10 Codes
Granularity CDC Data Hospital Data

7-Character Codes 93830 14692
3-Character Codes 1910 1620
Level 1 (L1) 25 25
Level 2 (L2) 106 105
Level 3 (L3) 23 23
Level 4 (L4) 14 14
Level 5 (L5) 19 16
Level 6 (L6) 23 16
Table 4. Code Description Length Metrics
Metrics CDC Data Hospital Data

Mean 10.05 5.06
Median 9 4
Mode 7 4
Max 32 30
Min 1 1
not ensure the same set of classes in the train and test splits. The metrics about
the class coverage are presented later in the paper while discussing the results.
5.2 Word Embeddings

For the purpose of training, input sequences are fed as distributed vector rep-
resentations learned from Word2Vec instead of one-hot encoded vectors over a
bag-of-words model.
Some of the advantages of this technique are the removal of data sparsity
as Word2Vec is a dense model, dimensionality reduction as vector dimension in
Word2Vec is several orders of dimension smaller than the bag-of-words model,
and the inclusion of a sense of semantic distance which is an intrinsic property
of Word2Vec representations.
The pre-trained open-source Word2Vec model released by Biomedical nat-
ural language processing lab (BioNLP) [25] was used for training. The vector
representations were trained on text from PubMed [26], PMC [27] and English
Wikipedia [28]. The pre-trained model from BioNLP has a vocabulary size of
5,443,656 with each word having 200-dimensional vector representation. This
model is apt for our use case as the documents used for training the model
include medical documents as well as general English documents, which makes
the vector space representations inclusive of the medical context of words like
seizure, collapse, procedure etc.
Table 5. One-Step Classification Performance
Model Used Accuracy

TF-IDF 0.2932
Inverted Index 0.3727
SVM 0.1407
CNN 0.5454
LSTM 0.5798
During our training, we encountered a total of 760 OOV cases. Most of these
were irregular words such as lemli, xyy etc. which might be the result of spelling
mistakes or usage of acronyms in the data.
Removal of data possessing OOV cases from training data did not have a
positive effect on the evaluation metrics, pointing to the fact that the advantages
of contextual information attributed by the word embeddings far outweighed the
disadvantages of OOV words.
5.3 Preprocessing
Basic text preprocessing [29] on the training data is performed in the form of
handling punctuations, extra spaces in the text. No other preprocessing such as
stop-word removal, stemming or lemmatization is performed. The LSTM algo-
rithm is inherently capable of understanding the context and importance of a
word in a sentence and the amount of significance it plays in the classification
process.
5.4 Experiments
The work begins by employing various different techniques such as TF-IDF, In-
verted Indexing, SVM, CNN and LSTM for the single-step classification task and
observing their performance which is summarized in table 5. The top performers
in this task are re-applied on the hierarchical classification to achieve a per-
formance boost. The accuracy on the held-out data (test dataset mentioned in
section 5.1) is taken as the final metric of performance for evaluation of models.
For the TF-IDF method, standard smoothed IDF function was used. The
accuracies observed was as low as 29.3%, a strong reflection of the fact that
word importance and document lengths used for representing the concepts is
very random in the medical domain, e.g. “pyrexia” and “fever” are semantically
very similar but such resemblance is not captured in TF-IDF method. Similarly,
“diabetic foot” and “diabetes mellitus with foot ulcer” represent the same diag-
nosis, but the difference in sentence length attributes low TF-IDF scores to such
examples.
To overcome the drawback of semantic loss and sequence lengths explained
above in TF-IDF method, an attempt was made to bag synonyms of a medical
CASCADENET 11
Table 6. Evaluation Metrics Results of Individual Classifiers in the Hierarchical Setup
Micro Macro
Hierarchy Accuracy Classes
Precision Recall F1 Precision Recall F1
L1 Test 0.9586 0.9586 0.9586 0.9586 0.9334 0.9268 0.9301 25
L1 Train 0.9823 0.9823 0.9823 0.9823 0.9724 0.9691 0.9707 25
L2 Test 0.8656 0.8656 0.8656 0.8656 0.8701 0.8645 0.8673 106
L2 Train 0.9205 0.9205 0.9205 0.9205 0.9300 0.9255 0.9277 106
L3 Test 0.8067 0.8067 0.8067 0.8067 0.8199 0.7624 0.7901 22
L3 Train 0.8652 0.8652 0.8652 0.8652 0.9301 0.8418 0.8838 23
L4 Test 0.8935 0.8935 0.8935 0.8935 0.9067 0.8754 0.8908 13
L4 Train 0.9410 0.9410 0.9410 0.9410 0.9498 0.9488 0.9493 14
L5 Test 0.9627 0.9627 0.9627 0.9627 0.9123 0.9018 0.9070 19
L5 Train 0.9851 0.9851 0.9851 0.9851 0.9529 0.9299 0.9413 19
L6 Test 0.9928 0.9928 0.9928 0.9928 0.9923 0.9825 0.9874 24
L6 Train 0.9976 0.9976 0.9976 0.9976 0.9984 0.9943 0.9963 24
Table 7. Cumulative Evaluation Metrics Results of the Hierarchical Setup
Micro Macro
Hierarchy Accuracy Classes
Precision Recall F1 Precision Recall F1
L2 Test 0.8995 0.8995 0.8995 0.8995 0.8610 0.8145 0.8371 1634
L2 Train 0.8532 0.8532 0.8532 0.8532 0.8636 0.8421 0.8527 1879
L3 Test 0.7833 0.7833 0.7833 0.7833 0.7319 0.7366 0.7342 5888
L3 Train 0.7885 0.7885 0.7885 0.7885 0.7290 0.7369 0.7329 8935
L4 Test 0.7741 0.7741 0.7741 0.7741 0.7391 0.7502 0.7446 11560
L4 Train 0.7919 0.7919 0.7919 0.7919 0.7890 0.7919 0.7905 21912
L5 Test 0.7743 0.7743 0.7743 0.7743 0.8014 0.8133 0.8073 17683
L5 Train 0.7959 0.7959 0.7959 0.7959 0.8796 0.8813 0.8805 36418
L6 Test 0.7205 0.7205 0.7205 0.7205 0.8776 0.9939 0.8857 23429
L6 Train 0.7149 0.7149 0.7149 0.7149 0.8079 0.8092 0.8086 71361
concept under a single bucket using Q-Map (an optimized version of Metamap
[30]) which is explained in section 3.3. After concept retrieval, an inverted index
is built, wherein each bucket points to the set of ICD-10 codes it occurs in.
The accuracy achieved using this method was 37.27%, which is a considerable
jump from the previous method. A drawback observed with this method was
the excessive dependency on the word order and the organisation of sentences in
the concept retrieval step. For example, while “cancer of brain”, “brain cancer”
etc. is a part of UMLS database and is readily identified in text, a string like
“cancerous tissue mass in temporal lobe” is not identified as a single concept
because even though UMLS is a very comprehensive aggregation of medical
concepts it is not exhaustive.
Following the above two approaches the popular machine learning technique,
SVM model was also evaluated for the single step classification process using
bag-of-words model. This model gave an accuracy of 14.07%. It is posited that
the major reason for such low accuracy can be attributed to the high cardinality
of the feature set i.e. the vocabulary size of 13634. The similar low performance
was observed by applying the SVM on features extracted using Q-Map.
Observing the drawbacks of the previous methods, the work pivoted to deep
learning techniques, namely, CNN and LSTM because of the advantages they of-
fer in overcoming the issues of word order, sentence structure, word importance,
semantics of words etc. For setting up these models for single-step classification,
the input layer is weighed using the distributed word embedding from Word2Vec
(explained in section 5.2) and the model parameters are learnt, providing a soft-
max score which indicates the likelihood of belonging to a class. Both CNN and
LSTM outperformed the previous techniques. CNN converged to an accuracy of
54.54%, while the accuracy of LSTM was observed to be 57.98%. The authors
posit that the LSTM performs better than CNN because it takes into consid-
eration the word order in a document which plays a significant role e.g. word
order and semantics are crucial in differentiating descriptions like “Hypertensive
chronic kidney disease” and “Chronic kidney disease”, where the former is coded
as “I12” while later as “N18”.
Among the methods evaluated on single-step classification, LSTM and CNN
give the most promising results. Hence, the hierarchical classification is imple-
mented using this method and the experimental results maintained the order
or performance that was observed in one-step classification, i.e. LSTM outper-
formed CNN. Following this, the hierarchical LSTM was further optimized by
hyper-parameter tuning and changing network architecture as explained below
in section 5.5. The results are explained in detail in the sections that follow.
There are few adjustments made in the coding scheme for implementing the
hierarchical classification. One major hindrance in the process is the uncertainty
in the length of expected ICD code (ranges between 3 and 7). This is tackled
by padding the codes with X’s and getting a uniform length of codes i.e. 7.
This padding is harmless because X’s are originally utilized as a placeholder as
explained in section 5.1.
CASCADENET 13
Input Sequence
1st Character
Level 1
2nd & 3rd Character

Level 2
4th Character
Level 3
5th Character
Level 4
6th Character
Level 5
7th Character
Level 6
Fig. 3. Cascaded hierarchically stacked architecture
In the hierarchical classification, there are 6 classifiers set up, one for each
layer listed in table 2. The inference from each classifier is fed to the next level
as represented mathematically in (4). So, the system consists of stacked LSTM
classifiers. The architecture can be better understood from figure 3.
The hierarchical LSTM model outperformed the corresponding model under
single-step classification. Evaluation metrics of the model are listed in table 6.
It includes the accuracy, macro/micro precision, recall and F1-scores along with
class coverage (number of labels) for both training and test datasets for individ-
ual classifiers.
Table 7 lists the similar evaluation metrics for the cumulative coding at a
given level. Cumulative coding means the aggregation of predicted codes at any
level given the codes predicted at previous levels. Since the minimum number
of characters in a valid ICD-10 codes is three, these metrics are presented from
level 2 onwards only.
5.5 LSTM Architecture and Training
The experiments started out by using vanilla LSTM but later few design modifi-
cations were introduced which further improved the performance of the network.
As a first step, the entire word sequence along with the inputs from previous
layers was handled as one long sequence fed to the LSTM layer.
As a next step, a different architecture wherein word sequence and the infer-
ences from previous layers are handled as separate entities, and only the word
sequence is fed into the LSTM layers while the other inputs are fed into a dense
layer and then concatenated before normalizing and adding dropout [31]. It
is observed that dropout and batch-normalization [32] help in generalizing the
classifier by a considerable margin as the ratio of training accuracy to the test
accuracy is much closer to unity in models after normalization and dropout.
Table 8. Window Size for Classifier Levels
Hierarchy Window Size

Level 1 (L1) 10
Level 2 (L2) 10
Level 3 (L3) 15
Level 4 (L4) 20
Level 5 (L5) 20
Level 6 (L6) 25
Input Text Sequence Input From Previous Levels
... ...
LSTM Layer Dense Layer
Merge Layer
Dense Layer
Batch Normalization
Dropout Layer
Softmax Layer
Output of the Level
Fig. 4. Hybrid network structure at each level using LSTM
Figure 4 gives a brief generalized summary of the individual level LSTM clas-
sifier architecture. Exact architectural details along with model parameters for
individual level classifiers can be seen in figure 5, 6, 7, 8, 9 and 10 .
Also, the input word sequence lengths were varied keeping the rest of the
parameters static in order to study the effect of window size, W on the accuracy.
Table 8 lists the final window size parameters that were found most optimal by
observing the increase in accuracy for an increase in window size. The results
are in line with the fact that the later characters in ICD-10 coding are to do
with the precision of the diagnosis which is often captured in longer sentences.
The text preprocessing (mentioned in section 5.3) for LSTM is kept at a very
minimal intrusion because it is capable of drawing more informed context from
a well-formed sentence.
Once the model is defined, one can estimate the model parameters to mini-
mize multi-log loss, the cost function which is given by,
CASCADENET 15
Fig. 5. Level 1 Classifier

Table 9. Model Output
Input Code Predicted ICD-10 Description

Patient Fever,
R509XXX
has fever unspecified
Malignant
neoplasm
Brain Cancer C719XXX
of brain,
unspecified
Right eye Central retinal
has retinal vein occlusion,
H348112
vein occlusion right eye,
and is stable stable
Type 2
Diabetic foot E11621X diabetes mellitus
with foot ulcer
Llog (Y, P ) = −log P r(Y |P )

N −1 K−1
1 X X (5)
=− yi,k log pi,k
N i=0
k=0
where the true labels are a set of samples encoded as a 1-of-K binary indicator
matrix, Y, i.e. yi,k = 1 if sample i has label k taken from a set of K labels, and
P is a matrix of probability estimates, with pi,k = P r(ti,k = 1). The networks
are trained as long as there is an improvement in test accuracy, using a learning
rate that decreases by order of 10 as the training epochs proceed.
6 Results and Discussion

Our results from LSTM are presented in table 6 and table 7. Table 9 lists out
a few examples of the automatic coding by the system. It was first noted that
the conventional NLP methods such as TF-IDF and Inverted Indexing show
relatively poor performance, indicating that neither term frequency and its im-
portance in corpus or bagging techniques alone are sufficiently discriminative for
decision making.
Both deep learning techniques, i.e. LSTM and CNN are better suited to draw
the inferences from the training data available, but the advantages of LSTM
over CNN in the natural language domain are evident from performance metrics
on single step classification. It is also observed that breaking down the task
into hierarchical classification is advantageous and better convergence for the
CASCADENET 17
Table 10. Invalid Codes Predicted
Hierarchy Data Instances Invalid Codes Ratio

L2 Test 32722 0.0019
L2 Train 130885 0.0012
L3 Test 32722 0.0172
L3 Train 130885 0.0156
L4 Test 32722 0.0202
L4 Train 130885 0.0299
L5 Test 32722 0.0280
L5 Train 130885 0.0322
L6 Test 32722 0.0287
L6 Train 130885 0.0379
networks are achieved when compared to single step classification among the
massive number of classes that the data presents.
In particular, LSTM model that cascades the inference from one level to an-
other helps achieve much better classification metrics. While it gives an accuracy
of 72.05% on the seven-digit coding scheme, it is worth noting that the accuracy
for predicting the equally important three-digit code (cumulative accuracy at
level 2) is considerably high at 89.95%. The lower values in the macro metrics
in table 7 can be attributed to the high-class imbalance which can be observed
in the class coverage of the same table. The high accuracy of 99.28% for level
6 in table 6 might be due to the fact that the extenders of ICD code which
are represented by the 7th character in the code are generally associated with
a combination of words that do not present a lot of variations over the texts
irrespective of the source as explained in section 5.1.
It can be noted that while the actual number of classes in ICD-10 is limited
to 93830, the sample space for classes in the hierarchical model is very high
(25 ∗ 106 ∗ 23 ∗ 14 ∗ 19 ∗ 24 = 389, 104, 800), if all the possible combinations are
considered for classifiers at each level. The number of invalid codes predicted is
studied at each level presented in table 7 and the findings are summarized in
table 10.
It can be seen that the total number of invalid codes predicted are very low
with the maximum of 3.79% at seven-digit code prediction level. The same is
as low as 0.19% for the three-digit coding. The low percentage of invalid code
predictions suggests that the levels in LSTM network in the hierarchical set up
are statistically strengthened to prevent prediction of invalid subclasses for a
given superclass by means of passing information from one classifier level to the
following levels. Invalid codes are truncated at the preceding level based on a
lookup table to avoid such predictions in a real-world scenario and production
deployments.
It must be noted that the system is only trained for predicting correct code
given that the input data is a valid diagnosis string. The instance of an in-
put string from any other domain, invalid diagnosis strings or concatenation
of strings belonging to varied classes of ICD-10 code would lead to haphazard
results. This is because inherently, the neural network is trained to output a
single class for a given input sequence. But when sequences containing concepts
belonging to more than one class are passed in a concatenated form, the network
will not work with maximum efficiency.
We prevent the occurrence of such cases by devising a set of steps which
include the best practices to achieve the most efficient coding when applied to
clinical free text. A system for splitting diagnosis documents into suitable valid
diagnosis phrases that can be annotated by individual codes plays an important
role in maintaining the performance of the system in predicting correct codes
when applied to a real-world clinical or discharge note. These obstacles are tack-
led using the proprietary clinical concept extraction system, Q-Map (explained
in section 3.3), and detection of negation of retrieved concepts using the NegEx
algorithm (explained in section 3.4).
7 Conclusion
The authors designed an efficient LSTM based hierarchical network for auto-
matic classification of ICD-10 coding using a limited amount of dataset by ex-
ploiting cascaded hierarchical classification. As a next step, this architecture
will be extended to various other hierarchical coding schemes, such as CPT and
LOINC and other architecture options such as LSTM with attention mechanism,
LSTM-CNN parallel model will be explored. Also, there is a scope to improve the
accuracy by deriving attention based summaries from the discharge notes. Both
the future works pose additional challenges in terms of availability of training
data.
The utility of this model is already proving to be very useful. Following the
development of this model, systems have been deployed on-premise for various
hospitals that helps in suggestive ICD annotation of clinical documents. It is also
helpful in creating registries of patients for doctor references. Similarly, using the
data from MIMIC III [33], systems are developed for next disease prediction and
the onset of disease prediction using methods such as collaborative filtering [34].
Apart from this, it is helping in other scenarios such as insurance claims review
and document search based on diseases and diagnostic related groups (DRGs).
References
1. World Health Organization. International statistical classification of diseases and

related health problems. Vol. 1. World Health Organization, 2004.
2. Beebe, Michael, et al. Current Procedural Terminology: CPT. American Medical
Association, 2007.
CASCADENET 19
3. McDonald, Clem, et al. Logical observation identifiers names and codes (LOINC)
users’ guide. Indianapolis: Regenstrief Institute (2004).
4. Hochreiter, Sepp, and Jrgen Schmidhuber. Long short-term memory. Neural com-
putation 9.8 (1997): 1735-1780.
5. Farkas, Richrd, and Gyrgy Szarvas. Automatic construction of rule-based ICD-9-CM
coding systems. BMC bioinformatics. Vol. 9. No. 3. BioMed Central, 2008.
6. Boutell, Matthew R., et al. Learning multi-label scene classification. Pattern recog-
nition 37.9 (2004): 1757-1771.
7. Tsoumakas, Grigorios, Ioannis Katakis, and Ioannis Vlahavas. Random k-labelsets
for multilabel classification. IEEE Transactions on Knowledge and Data Engineering
23.7 (2011): 1079-1089.
8. Pereira, Suzanne, et al. Construction of a semi-automated ICD-10 coding help sys-
tem to optimize medical and economic coding. MIE. 2006.
9. Lipscomb, Carolyn E. Medical subject headings (MeSH). Bulletin of the Medical
Library Association 88.3 (2000): 265.
10. Schuyler, Peri L., et al. The UMLS Metathesaurus: representing different views of
biomedical concepts. Bulletin of the Medical Library Association 81.2 (1993): 217.
11. Lita, Lucian Vlad, et al. Large scale diagnostic code classification for medical pa-
tient records. Proceedings of the Third International Joint Conference on Natural
Language Processing: Volume-II. 2008.
12. Hearst, Marti A., et al. Support vector machines. IEEE Intelligent Systems and
their applications 13.4 (1998): 18-28.
13. Hoerl, Arthur E., and Robert W. Kennard. Ridge regression: Biased estimation for
nonorthogonal problems. Technometrics 12.1 (1970): 55-67.
14. Lita, Lucian Vlad, et al. Large scale diagnostic code classification for medical pa-
tient records. Proceedings of the Third International Joint Conference on Natural
Language Processing: Volume-II. 2008.
15. Mikolov, Tomas, et al. Efficient estimation of word representations in vector space.
arXiv preprint arXiv:1301.3781 (2013).
16. Read, Jesse, et al. Classifier chains for multi-label classification. Machine learning
85.3 (2011): 333.
17. International Classification Of Diseases, 10th Revision (ICD-10). World Health
Organization. N.p., 2018. Web. 9 July 2018.
18. Kim, Yoon. ”Convolutional neural networks for sentence classification.” arXiv
preprint arXiv:1408.5882 (2014).
19. Azam, Sheikh Shams, et al. Q-Map: clinical concept mining with phrase sense
disambiguation. arXiv preprint arXiv:1804.11149 (2018).
20. Aho, Alfred V., and Margaret J. Corasick. Efficient string matching: an aid to
bibliographic search. Communications of the ACM 18.6 (1975): 333-340.
21. Chapman, Wendy W., et al. A simple algorithm for identifying negated findings
and diseases in discharge summaries. Journal of biomedical informatics 34.5 (2001):
301-310.
22. Mikolov, Tomas, et al. Distributed representations of words and phrases and their
compositionality. Advances in neural information processing systems. 2013.
23. Gagniuc, Paul A. Markov Chains: From Theory to Implementation and Experi-
mentation. John Wiley & Sons, 2017.
24. National Center for Health Statistics. Centers for Disease Control and
Prevention, Centers for Disease Control and Prevention, 11 June 2018,
www.cdc.gov/nchs/icd/icd10cm.htm.
25. Biomedical Natural Language Processing. Bio.nlplab.org, bio.nlplab.org/.
26. Home - PubMed - NCBI. , U.S. National Library of Medicine,

www.ncbi.nlm.nih.gov/pubmed.
27. Home - PMC - NCBI. , U.S. National Library of Medicine,
www.ncbi.nlm.nih.gov/pmc/.
28. Main Page. Wikipedia, Wikimedia Foundation, 8 July 2018,
en.wikipedia.org/wiki/Main Page.
29. shams-sam. Shams-Sam/Logic-Lab. GitHub, github.com/shams-sam/logic-
lab/blob/master/TextPreprocessing/text preprocessing.py. Accessed 9 July 2018.
30. Aronson, Alan R. Effective mapping of biomedical text to the UMLS Metathesaurus:
the MetaMap program. Proceedings of the AMIA Symposium. American Medical
Informatics Association, 2001.
31. Srivastava, Nitish, et al. Dropout: a simple way to prevent neural networks from
overfitting. The Journal of Machine Learning Research 15.1 (2014): 1929-1958.
32. Ioffe, Sergey, and Christian Szegedy. Batch normalization: Accelerating deep net-
work training by reducing internal covariate shift. arXiv preprint arXiv:1502.03167
(2015).
33. Johnson, Alistair EW, et al. MIMIC-III, a freely accessible critical care database.
Scientific data 3 (2016): 160035.
34. Schafer, J. Ben, et al. Collaborative filtering recommender systems. The adaptive
web. Springer, Berlin, Heidelberg, 2007. 291-324.
View publication stats

Cascade Net

Uploaded by

Document Information

Original Description:

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Cascade Net

Uploaded by

Copyright:

Available Formats

See discussions, stats, and author profiles for this publication at: https://www.researchgate.

CASCADENET: An LSTM based deep learning model for automated ICD-10

Chapter · January 2020

Sheikh Shams Azam Manoj Raju

SEE PROFILE SEE PROFILE

Venkatesh Pagidimarri Vamsi Chandra Kasivajjala

The user has requested enhancement of the downloaded file.

Sheikh Shams Azam1 , Manoj Raju1 Venkatesh Pagidimarri1 , and Vamsi

Abstract. In this paper, a cascading hierarchical architecture using

Keywords: deep learning, natural language processing, medical coding

The number of research works in the field of automatic classification of clinical

[10], and usage of drug prescriptions by exploiting mapping between prescription

3.1 ICD-10 Coding System

Fig. 1. ICD-10 Code Structure

hospitals. The ICD-10 coding standard is a seven character coding convention

3.2 Clinical Documents vs Diagnosis Phrases

Clinical Documents can be defined as a digital or analog record detailing medical

Table 1. Q-Map Output

of findings, medications and procedures. But diagnosis phrases are chunks or

3.4 Negation Detection

4.1 One-Step Classification

arg maxθ s(x, y) (1)

under a scoring function s : X × Y → R. If a window size W is used and

s(x, y) = p(y|x; θ) ≈ p(y|x[1···W ] ; θ) (2)

4.2 Hierarchical Classification

Table 2. Classification Hierarchies

Hierarchy Classes Characters

Level 1 Level 3 Level 5

Fig. 2. Hierarchical breakdown of ICD-10 Code

arg maxθ s(x, y (1···i) ) (3)

s(x, y (1···i) ) = p(y (i) |y (1···(i−1)) , x; θ)

5 Architecture and Procedure

5.1 Training Data

Table 3. Distribution of Distinct ICD-10 Codes

Granularity CDC Data Hospital Data

Table 4. Code Description Length Metrics

Metrics CDC Data Hospital Data

5.2 Word Embeddings

Table 5. One-Step Classification Performance

Model Used Accuracy

Table 6. Evaluation Metrics Results of Individual Classifiers in the Hierarchical Setup

Table 7. Cumulative Evaluation Metrics Results of the Hierarchical Setup

2nd & 3rd Character

Fig. 3. Cascaded hierarchically stacked architecture

5.5 LSTM Architecture and Training

Table 8. Window Size for Classifier Levels

Hierarchy Window Size

Input Text Sequence Input From Previous Levels

LSTM Layer Dense Layer

Output of the Level

Fig. 4. Hybrid network structure at each level using LSTM

Fig. 5. Level 1 Classifier

Fig. 6. Level 2 Classifier

Fig. 7. Level 3 Classifier

Fig. 8. Level 4 Classifier

Fig. 9. Level 5 Classifier

Fig. 10. Level 6 Classifier

Table 9. Model Output

Input Code Predicted ICD-10 Description

Llog (Y, P ) = −log P r(Y |P )

6 Results and Discussion

Table 10. Invalid Codes Predicted

Hierarchy Data Instances Invalid Codes Ratio