Professional Documents
Culture Documents
Diseases Diagnosis
Xu Ye1,2,§ Meng Xiao1,2,§ Zhiyuan Ning1,2 Weiwei Dai3 Wenjuan Cui1
1,∗
Yi Du Yuanchun Zhou1
With the development of natural language processing tech- (OEMR) and the development of NLP techniques make
niques(NLP), automatic diagnosis of eye diseases using oph- it possible to diagnose eye diseases based on OEMR au-
thalmology electronic medical records (OEMR) has become
possible. It aims to evaluate the condition of both eyes of tomatically.
a patient respectively, and we formulate it as a particular As shown in the left part of Figure 1, the OEMR
multi-label classification task in this paper. Although there document consists of several parts, including Chief
are a few related studies in other diseases, automatic diag-
nosis of eye diseases exhibits unique characteristics. First, Complaint (CC), History of Present Illness (HPI), Ex-
descriptions of both eyes are mixed up in OEMR documents, amination Results (ER), etc. Each part can be consid-
with both free text and templated asymptomatic descrip- ered a particular type of text. The goal of automatic
tions, resulting in sparsity and clutter of information. Sec-
ond, OEMR documents contain multiple parts of descrip- diagnosis of eye diseases is to diagnose diseases suffered
tions and have long document lengths. Third, it is critical by each of the patient’s eyes respectively based on these
to provide explainability to the disease diagnosis model. To texts. For this task, a typical solution that has been
overcome those challenges, we present an effective automatic
eye disease diagnosis framework, NEEDED. In this frame- tried on similar tasks is to concatenate these texts and
work, a preprocessing module is integrated to improve the use neural networks like RNN [5, 6, 7] and CNN [8, 9]
density and quality of information. Then, we design a hierar- to obtain the embedding representation of the OEMR
chical transformer structure for learning the contextualized
representations of each sentence in the OEMR document. document. Then the document-level representation is
For the diagnosis part, we propose an attention-based pre- used as input to classifiers to obtain results.
dictor that enables traceable diagnosis by obtaining disease- However, after analyzing the characteristics of
specific information. Experiments on the real dataset and
comparison with several baseline models show the advan- OEMR and our task, we found three issues that make
tage and explainability of our framework. this solution unsuitable: Issues 1: sparsity and
clutter of information in the OEMR. Descriptions
1 Introduction of different eyes are mixed up in OEMR documents,
With the rising incidence of eye diseases and the short- causing clutter in the information relevant to different
age of ophthalmic medical resources, eye health issues eyes. Besides, many of these descriptions are templated
are becoming increasingly prominent [1]. To improve asymptomatic descriptions such as No congestion or
the efficiency of ophthalmologists and lower the rate of edema in the conjunctiva of the left eye, resulting in
misdiagnosis, an automated eye disease diagnostic sys- a sparsity of information in the document. This spar-
tem that provides advice to ophthalmologists has be- sity and clutter of information may hinder the diagnosis
come an urgent demand. There have been many at- model from giving the accurate diagnosis results of two
tempts to automatically diagnose eye diseases based on eyes respectively. Issues 2: long document length
image data (e.g., ocular fundus photographs) [2, 3, 4]. and multiple text types. OEMR documents are long
However, a kind of image usually contains information and have multiple text types. This solution is inappro-
about only one part of the eye, limiting the scope of au- priate when the input is a lengthy document because
tomatic disease diagnosis. Meanwhile, the widespread it is weak at learning long-range dependencies, making
it challenging to extract information from long docu-
1
Computer Network Information Center, CAS, Emails: ments comprehensively [10, 11]. Besides, it ignores the
yexu@cnic.cn, shaow@cnic.cn, ningzhiyuan@cnic.cn, wenjuan- differences between text types and does not preserve the
cui@cnic.cn, duyi@cnic.cn, zyc@cnic.cn type information of the text. Issues 3: explainabil-
2
University of Chinese Academy of Sciences
3
Changsha Aier Eye Hospital, Emails: daiweiwei@aierchina.com
ity of automatic disease diagnosis. It is critical to
§ These authors have contributed equally to this work.
∗
provide explainability to the proposed model for med-
Corresponding author
the code can be found in https://github.com/coco11563/ ical scenarios, especially for automatic disease diagno-
NEEDED sis [12]. However, This typical approach obtains only
···
Mild clouding of the vitreous in the left eye.
The vitreous humor of the right eye was clear.
…… Pooling Pooling Pooling Pooling
There was no redness or swelling of the left eyelid, no masses,
and the eyelashes were neatly aligned.
K V Q DES
Word-level Word-level Word-level Word-level
The skin of the upper and lower lids of the right eye was red
and swollen, and the eyelashes were neatly aligned. Transformer Transformer ··· Transformer Transformer
Dot-Product Attention MGD
Preprocess 𝑆! 𝑆" 𝑆#$! 𝑆#
𝑣# 𝑣" 𝑣!$# 𝑣!
(Left)OEMR*
OEMR* ···
CC: (Right)OEMR* (%)
… 𝑡! 𝑆!
CC: (S1) One year of tearing and occasional discharge from the
CC: One year of tearing and occasional discharge from the
right eye with redness and pain for 3 days. (&)
right eye with redness and pain for 3 days. 𝑡! 𝑆"
Tokenized
HPI: FNN1 FNN2 FNNL-1 FNNL
…
HPI: A year ago, the patient developed tearing and occasional
discharge…
ER: Mild clouding of the vitreous in the left eye.
H PI:(S2) A year ago, the patient developed tearing and
occasional discharge… ······
(')
Sigmoid Sigmoid Sigmoid Sigmoid
……
𝑡! 𝑆#
ER:
There wasER: The vitreous
no redness humorof
or swelling ofthe
theleft
right eye was
eyelid, clear.
no masses, ……
……
and the eyelashes were neatly aligned.
(Left)MOEMR
The skin of the upper and lower lids of the right eye was red (SN) The skin of the upper and lower lids of the right eye was
(#)
and swollen, and the eyelashes were neatly aligned. red and swollen, and the eyelashes were neatly aligned. 𝑡! : type-token
(Left)MOEMR Probability for each disease
(1) Preprocessing Component (2) Hierarchical Transformer Encoder (3) Attention-based Predictor
Figure 1: The NEEDED takes a given OEMR O as input and outputs two disease label sets corresponding to
diffrent eyes of the patient. (1) Preprocessing Component. The useless parts of the descriptions marked by
underline are first filtered out, and then the descriptions relevant to different eyes are distinguished to form two
documents, Ol and Or . They will be used separately to get the diagnostic results for the two eyes. (2)
Hierarchical Transformer Encoder. Taking Or for example, each sentence set {S1 , S2 , . . . , SN } of Or are encoded
into the document representation Cr . (3) Attention-based predictor. The predictor captures disease-specific
information representation from Cr and outputs the disease’s presence probability.
filtering useless descriptions and distinguishing content former structure as an encoder. As shown in Figure
relevant to different eyes. Step-1: Inspired by TF-IDF, 1(2), the encoder consists of a word-level Transformer
we consider that for those templated asymptomatic de- and a sentence-level Transformer. The word-level trans-
scriptions, the more they appear in the document set, former takes the OEMR document as input and en-
the more they are common condition descriptions and codes each sentence in the document to obtain the type-
the less useful they are for disease diagnosis. Thus, we specific sentence embedding representations. Then, the
calculate the frequency of occurrence of each description sentence-level transformer takes the set of sentence em-
in all OEMR documents and filter out asymptomatic de- bedding representations as input, learns the dependen-
scriptions that occur more frequently than the threshold cies between sentences and generate the representation
B. Step-2: Given an OEMR document O, the descrip- matrix of the input OEMR document.
tions relevant to different eyes are distinguished from
O to form two documents, Ol and Or . They have the 3.2.1 Word-level Transformer Given a OEMR
same structure as the OEMR, except that the content document O∗ , in this stage we aims to use a Transformer
is focused on a particular eye. Specifically, for the ex- to embed each sentence Si in O∗ into the sentence repre-
amination results section, each sentence describes an sentation ai ∈ Rd . To retain information about the text
examination result for a specific eye, so we can easily type of each sentence and thus help the Transformer
distinguish them. For other parts, we developed some model its semantics, we first add a type-token t0 at the
rules based on the characteristics of the descriptions to beginning of each sentence. Formally, given an sentence
(i) (i) (i) (i)
distinguish, and for those sentences that are difficult to Si with token sequence [t0 , t1 , t2 , . . . , t|Si | ], the sen-
distinguish or relevant to both eyes, both documents tence representation ai can be obtained by:
retain them.
(3.2)
3.2 Hierarchical Transformer Encoder After the a = P ooling([h(i) , h(i) , h(i) , . . . , h(i) ]]),
i 0 1 2 |Si |
above stage, we obtain two documents, Ol and Or . They
(i) (i) (i) (i)
will be used as input of the downstream model sepa- = P ooling((T ransf ormerw ([t0 , t1 , t2 , . . . , t|Si | ])),
rately to get the diagnostic results for the two eyes. For
simplicity of expression, they will be denoted as O∗ in where T ransf ormerw is a Nw -layers Transformer. For
(i)
the following. To extract the information in a given each token tj in the sentence Si , its input embedding
OEMR document O∗ , we design a hierarchical Trans- is obtained by summing its token embedding and posi-
tional encoding, and the token embedding is randomly
4.2 Baseline Methods To comprehensively evalu- NEEDED 90.25±0.48 92.61±0.27 97.92±0.15 98.59±0.10
ate the performances of our proposed framework, we
selected three baselines and five ablation variants of
NEEDED. The baselines are listed as follows: 1) 4.4 Experiment Results These experiment aims to
CAML [20]: The CAML is a CNN-based model which answer the following questions: Q1:How is the perfor-
integrate label-wise attention to obtain the label-specific mance of NEEDED in compared with other baseline
representations. 2) LSTM-Att [7]: LSTM-Att com- methods? Q2: How is each component of NEEDED
bines LSTM and attention mechanism to capture im- impact the model performance? Q3: What is the im-
portant semantic information in the input for predic- pact of distinguishing the descriptions of different eyes
tion. 3) BERT [21]:BERT is the representative of the in preprocessing? Q4: What is the impact of filtering
pre-trained models, we use BERT and an pooling layer the asymptomatic templated descriptions? Q5: How is
to obtain the representation of input. We also proposed the explainablity of our proposed model? Beside this,
five variants of NEEDED which are listed as follows: 1) we also conduct the hyperparameter senetivity study.
w/o p: ablated the pre-processing steps. 2) w/o c: ab-
lated the hierarchical transformers and replaced it with 4.4.1 RQ1: Overall Comparison The goal of the
vanilla transformer. 3) w/o s: ablated the sentence- first experiment is to compare the performance of the
level transformers. 4) w/o l: ablated the attention NEEDED and baseline models on the ADED task. In
mechanism and label embedding in predictor. 5) w/o this comparison, we used Macro-F1, Micro-F1, Macro-
w: both ablated the hierarchical transformers and the AUC, and Micro-AUC as evaluation metrics, which are
attention based predictor. widely used in multi-label classification problems [11,
20, 23, 24]. The results are shown in Table 1. From the
4.3 Experiment Settings In experiments, we set overall results, we observed that: First, our framework
the transformer layer number Nw to 5, the multi-head achieves the best performance in overall evaluation
number to 8, and the dimension size dmodel to 256. metrics, showing it is better at extracting disease-
For the detail of NEEDED training, we use AdamW related information from the OEMR document. Second,
[22] optimizer with learning rate of 1 × 10−4 , and our model performs better than w/o c by learning
set the batch size as 32, AdamW weight decay as both word-level and sentence-level dependencies and
1 × 10−2 . The dropout rate is set to 0.1 to prevent preserving text type information. Third, BERT as pre-
overfitting. For all models, we train and test them trained model does not perform as well as other baseline
multiple times with different random seeds under their models on most metrics, probably due to the large gap
optimal hyperparameters and report their performance between the text in its pre-training corpus and the
and standard deviation. In the following experiments, OEMR documents.
0.90 0.98
0.88
0.96
0.86 0.97
0.85
0.84 0.94
0.96
LSTM-Att BERT CAML NEEDED LSTM-Att BERT CAML NEEDED LSTM-Att BERT CAML NEEDED LSTM-Att BERT CAML NEEDED
0.91
0.91
0.5
0.90
0.90
0.89
0.89 0
0 2 4 6 8 1 2 4 8 16 0 10 20 30 40 50 60
Transformer Layer Number Attention Head Number Epoch
(a) Hyperparameter study on transformer (b) Hyperparameter study on attention (c) Model convegence study
layer number. head number.