NEEDED: Introducing Hierarchical Transformer To Eye Diseases Diagnosis

NEEDED: Introducing Hierarchical Transformer to Eye
Diseases Diagnosis
Xu Ye1,2,§ Meng Xiao1,2,§ Zhiyuan Ning1,2 Weiwei Dai3 Wenjuan Cui1
1,∗
Yi Du Yuanchun Zhou1
Abstract adoption of ophthalmology electronic medical record

arXiv:2212.13408v3 [cs.CL] 11 Jan 2023
With the development of natural language processing tech- (OEMR) and the development of NLP techniques make
niques(NLP), automatic diagnosis of eye diseases using oph- it possible to diagnose eye diseases based on OEMR au-
thalmology electronic medical records (OEMR) has become
possible. It aims to evaluate the condition of both eyes of tomatically.
a patient respectively, and we formulate it as a particular As shown in the left part of Figure 1, the OEMR
multi-label classification task in this paper. Although there document consists of several parts, including Chief
are a few related studies in other diseases, automatic diag-
nosis of eye diseases exhibits unique characteristics. First, Complaint (CC), History of Present Illness (HPI), Ex-
descriptions of both eyes are mixed up in OEMR documents, amination Results (ER), etc. Each part can be consid-
with both free text and templated asymptomatic descrip- ered a particular type of text. The goal of automatic
tions, resulting in sparsity and clutter of information. Sec-
ond, OEMR documents contain multiple parts of descrip- diagnosis of eye diseases is to diagnose diseases suffered
tions and have long document lengths. Third, it is critical by each of the patient’s eyes respectively based on these
to provide explainability to the disease diagnosis model. To texts. For this task, a typical solution that has been
overcome those challenges, we present an effective automatic
eye disease diagnosis framework, NEEDED. In this frame- tried on similar tasks is to concatenate these texts and
work, a preprocessing module is integrated to improve the use neural networks like RNN [5, 6, 7] and CNN [8, 9]
density and quality of information. Then, we design a hierar- to obtain the embedding representation of the OEMR
chical transformer structure for learning the contextualized
representations of each sentence in the OEMR document. document. Then the document-level representation is
For the diagnosis part, we propose an attention-based pre- used as input to classifiers to obtain results.
dictor that enables traceable diagnosis by obtaining disease- However, after analyzing the characteristics of
specific information. Experiments on the real dataset and
comparison with several baseline models show the advan- OEMR and our task, we found three issues that make
tage and explainability of our framework. this solution unsuitable: Issues 1: sparsity and
clutter of information in the OEMR. Descriptions
1 Introduction of different eyes are mixed up in OEMR documents,
With the rising incidence of eye diseases and the short- causing clutter in the information relevant to different
age of ophthalmic medical resources, eye health issues eyes. Besides, many of these descriptions are templated
are becoming increasingly prominent [1]. To improve asymptomatic descriptions such as No congestion or
the efficiency of ophthalmologists and lower the rate of edema in the conjunctiva of the left eye, resulting in
misdiagnosis, an automated eye disease diagnostic sys- a sparsity of information in the document. This spar-
tem that provides advice to ophthalmologists has be- sity and clutter of information may hinder the diagnosis
come an urgent demand. There have been many at- model from giving the accurate diagnosis results of two
tempts to automatically diagnose eye diseases based on eyes respectively. Issues 2: long document length
image data (e.g., ocular fundus photographs) [2, 3, 4]. and multiple text types. OEMR documents are long
However, a kind of image usually contains information and have multiple text types. This solution is inappro-
about only one part of the eye, limiting the scope of au- priate when the input is a lengthy document because
tomatic disease diagnosis. Meanwhile, the widespread it is weak at learning long-range dependencies, making
it challenging to extract information from long docu-
1
Computer Network Information Center, CAS, Emails: ments comprehensively [10, 11]. Besides, it ignores the
yexu@cnic.cn, shaow@cnic.cn, ningzhiyuan@cnic.cn, wenjuan- differences between text types and does not preserve the
cui@cnic.cn, duyi@cnic.cn, zyc@cnic.cn type information of the text. Issues 3: explainabil-
2
University of Chinese Academy of Sciences
3
Changsha Aier Eye Hospital, Emails: daiweiwei@aierchina.com
ity of automatic disease diagnosis. It is critical to
§ These authors have contributed equally to this work.
∗
provide explainability to the proposed model for med-
Corresponding author
the code can be found in https://github.com/coco11563/ ical scenarios, especially for automatic disease diagno-
NEEDED sis [12]. However, This typical approach obtains only
Copyright © 2023 by SIAM

Unauthorized reproduction of this article is prohibited
a coarse-grained document-level embedding representa- the explainability.
tion to diagnose all diseases, making it difficult to trace
the diagnosis and explain the results. 2 Preliminaries
Another method used in related tasks is extract- In this section, we first give the definition of the
ing structured information from clinical texts by en- ophthalmology electronic medical record(OEMR), then
tity recognition and other methods, then using them provide a formal definition of the task of Automatic
to enhance the downstream model or get the diagno- Diagnose of Eye Disease(ADED).
sis results directly [13, 14]. This method requires anno- Definiton 1 (Ophthalmology Electronic Medical
tated dataset to train the information extraction model. Record): As shown in the left part of Figure 1, an
However, due to the complexity and length of OEMR OEMR has several parts. Each part can be considered
documents, doctors need to spend too long time an- as a set of sentences. Therefore, the OEMR can be
notating comprehensively and accurately, making this viewed as a document consisting of multiple sentences.
method difficult to implement. Besides, it suffers from Formally, we define O = {S1 , S2 , ..., SN } as an OEMR
the problem of error propagation. document with N sentences. The i-th sentence in the
To overcome these issues, we formulate the auto- (i) (i) (i) (i)
document is defined as Si = [t1 , t2 , ..., t|Si | ], where tk
matic eye disease diagnosis as a particular multi-label denotes the k-th token in i-th sentence.
classification task. Specifically, the input to this task Definiton 2 (Automatic Diagnosis of Eye Dis-
is an OEMR document, and the output is two sets of ease): Given an OEMR O, the task of automatic diag-
labels corresponding to the diagnosis results for both nosis of eye diseases aims to diagnose the diseases suf-
eyes, respectively. Then, we iNtroduce hierarchical fered by each of the patient’s eyes respectively based on
transformer to EyE DisEases Diagnosis and propose an it. By treating each disease as a label, we can view this
automatic diagnosis framework named NEEDED. The problem as deriving two sets of labels for each OEMR
main contributions of this paper are summarized as: corresponding to different eyes, which can be seen as a
• An efficient preprocessing method for improv- particular multi-label classification task. Formally, Let
ing information density and quality. We first filter L = {l1 , l2 , ..., lD } denotes the label set for all eye dis-
out useless asymptomatic descriptions from the OEMR ease, where li ∈ {0, 1} represent the presence or absence
document and then distinguish the contents relevant to of the ith disease. As mentioned before, we consider that
different eyes to form two documents, which will be used it is essential to preprocess the OEMR document, let z
separately to get the diagnostic result. While retaining denote the preprocessing method. Finally, the ADED
vital information, we improve the information density task can be formulated as:
and quality of the document.
(2.1) Ω(z(O), Θ) → La , Lb
• A hierarchical encoder to extract abundant se-
where Θ is the parameters of the disease diagnosis model
mantic information from OEMR. Inspired by the
Ω, La and Lb denote the set of diseases labels for two
long-text modeling methods [15, 16, 17], we design a hi-
eyes respectively.
erarchical encoder architecture to embed both texts and
their particular type within an OEMR document into
3 Proposed Framework
a matrix. Specifically, we utilize a hierarchical trans-
former [18, 19] to obtain the contextualized represen- 3.1 Preprocessing Component As shown in the
tation of each sentence in the OEMR while preserving left part of Figure 1, the OEMR document is composed
the type information of the sentence by adding type- of several sections, including Chief Complaint (CC),
token [11]. Then, we use the matrix formed by these History of Present Illness (HPI), Examination Results
representations as the document representation. (ER), etc. Each part can be considered a particular
type of text that consists of some sentences. The task
• An attention-based predictor that enables of automatic eye disease diagnosis aims to evaluate the
traceable diagnosis. We utilize a dot-product atten- status of both eyes respectively based on an OEMR
tion layer as an extractor to capture disease-specific in- document. However, descriptions of different eyes
formation from these sentence representations and per- are mixed up in OEMR documents, resulting in a
form the diagnosis. This method enables traceable diag- clutter of information. Besides, many descriptions
nosis by observing the distribution of attention weight are templated asymptomatic descriptions that contain
and providing explanations for diagnostic results. little information for disease diagnosis, their abundance
We conduct experiments on the real dataset and results in a sparsity of information in the document.
comparison with several baseline models to validate the To improve information density and quality, we pro-
advantage of NEEDED and provide a case study to show pose a preprocessing method consisting of two steps:

OEMR
CC: One year of tearing and occasional discharge from the
𝑐! 𝑐" 𝑐#$! 𝑐# Disease Label
right eye with redness and pain for 3 days.
··· Embedding
CAT
HPI: A year ago, the patient developed tearing and occasional
discharge from the right eye… Sentence-level Transformer 𝑒# 𝑒" 𝑒!$# 𝑒!
DR
𝑎! 𝑎" 𝑎#$! 𝑎# ···
ER:
···
Mild clouding of the vitreous in the left eye.
The vitreous humor of the right eye was clear.
…… Pooling Pooling Pooling Pooling
There was no redness or swelling of the left eyelid, no masses,
and the eyelashes were neatly aligned.
K V Q DES
Word-level Word-level Word-level Word-level
The skin of the upper and lower lids of the right eye was red
and swollen, and the eyelashes were neatly aligned. Transformer Transformer ··· Transformer Transformer
Dot-Product Attention MGD
Preprocess 𝑆! 𝑆" 𝑆#$! 𝑆#
𝑣# 𝑣" 𝑣!$# 𝑣!
(Left)OEMR*
OEMR* ···
CC: (Right)OEMR* (%)
… 𝑡! 𝑆!
CC: (S1) One year of tearing and occasional discharge from the
CC: One year of tearing and occasional discharge from the
right eye with redness and pain for 3 days. (&)
right eye with redness and pain for 3 days. 𝑡! 𝑆"
Tokenized
HPI: FNN1 FNN2 FNNL-1 FNNL
…
HPI: A year ago, the patient developed tearing and occasional
discharge…
ER: Mild clouding of the vitreous in the left eye.
H PI:(S2) A year ago, the patient developed tearing and
occasional discharge… ······
(')
Sigmoid Sigmoid Sigmoid Sigmoid
……
𝑡! 𝑆#
ER:
There wasER: The vitreous
no redness humorof
or swelling ofthe
theleft
right eye was
eyelid, clear.
no masses, ……
……
and the eyelashes were neatly aligned.
(Left)MOEMR
The skin of the upper and lower lids of the right eye was red (SN) The skin of the upper and lower lids of the right eye was
(#)
and swollen, and the eyelashes were neatly aligned. red and swollen, and the eyelashes were neatly aligned. 𝑡! : type-token
(Left)MOEMR Probability for each disease
(1) Preprocessing Component (2) Hierarchical Transformer Encoder (3) Attention-based Predictor
Figure 1: The NEEDED takes a given OEMR O as input and outputs two disease label sets corresponding to
diffrent eyes of the patient. (1) Preprocessing Component. The useless parts of the descriptions marked by
underline are first filtered out, and then the descriptions relevant to different eyes are distinguished to form two
documents, Ol and Or . They will be used separately to get the diagnostic results for the two eyes. (2)
Hierarchical Transformer Encoder. Taking Or for example, each sentence set {S1 , S2 , . . . , SN } of Or are encoded
into the document representation Cr . (3) Attention-based predictor. The predictor captures disease-specific
information representation from Cr and outputs the disease’s presence probability.
filtering useless descriptions and distinguishing content former structure as an encoder. As shown in Figure
relevant to different eyes. Step-1: Inspired by TF-IDF, 1(2), the encoder consists of a word-level Transformer
we consider that for those templated asymptomatic de- and a sentence-level Transformer. The word-level trans-
scriptions, the more they appear in the document set, former takes the OEMR document as input and en-
the more they are common condition descriptions and codes each sentence in the document to obtain the type-
the less useful they are for disease diagnosis. Thus, we specific sentence embedding representations. Then, the
calculate the frequency of occurrence of each description sentence-level transformer takes the set of sentence em-
in all OEMR documents and filter out asymptomatic de- bedding representations as input, learns the dependen-
scriptions that occur more frequently than the threshold cies between sentences and generate the representation
B. Step-2: Given an OEMR document O, the descrip- matrix of the input OEMR document.
tions relevant to different eyes are distinguished from
O to form two documents, Ol and Or . They have the 3.2.1 Word-level Transformer Given a OEMR
same structure as the OEMR, except that the content document O∗ , in this stage we aims to use a Transformer
is focused on a particular eye. Specifically, for the ex- to embed each sentence Si in O∗ into the sentence repre-
amination results section, each sentence describes an sentation ai ∈ Rd . To retain information about the text
examination result for a specific eye, so we can easily type of each sentence and thus help the Transformer
distinguish them. For other parts, we developed some model its semantics, we first add a type-token t0 at the
rules based on the characteristics of the descriptions to beginning of each sentence. Formally, given an sentence
(i) (i) (i) (i)
distinguish, and for those sentences that are difficult to Si with token sequence [t0 , t1 , t2 , . . . , t|Si | ], the sen-
distinguish or relevant to both eyes, both documents tence representation ai can be obtained by:
retain them.
(3.2)
3.2 Hierarchical Transformer Encoder After the a = P ooling([h(i) , h(i) , h(i) , . . . , h(i) ]]),
i 0 1 2 |Si |
above stage, we obtain two documents, Ol and Or . They
(i) (i) (i) (i)
will be used as input of the downstream model sepa- = P ooling((T ransf ormerw ([t0 , t1 , t2 , . . . , t|Si | ])),
rately to get the diagnostic results for the two eyes. For
simplicity of expression, they will be denoted as O∗ in where T ransf ormerw is a Nw -layers Transformer. For
(i)
the following. To extract the information in a given each token tj in the sentence Si , its input embedding
OEMR document O∗ , we design a hierarchical Trans- is obtained by summing its token embedding and posi-
tional encoding, and the token embedding is randomly

(i) (i) (i) (i)
initialized. [h0 , h1 , h2 , . . . , h|Si | ] is a list of output 3.3 Attention-based Predictor After the previous
(i) step, the OEMR document is embedded in a matrix
embedding of tokens in the sentence Si , where hj is a d
(i) C ∈ Rd×N , where N denotes the number of sentences in
dimensional vector corresponding to tj . The pooling(·) the OEMR document. A typical subsequent operation
denotes the average-pooling on the output embedding is to perform a pooling operation on matrix C to obtain
representations to obtain the sentence embedding rep- a d dimensional vector and use this vector as the input
resentation ai . to classifiers to obtain the diagnostic result. However,
this approach obtains only a coarse-grained document-
3.2.2 Setence-level Transformer In the above level representation to diagnose all diseases, making it
stage, we obtain the representation of each sentence in difficult to trace the diagnosis and provide explanations
the OEMR document. Most of the sentences are de- for the results.
scriptions of the patient’s symptoms or medical history. To provide explainability for the automatic diagno-
It is likely that there are many associations between sis, inspired by the label-wise attention[20], we adopt
them, which means that the meaning represented by the dot-product attention to capture disease-specific in-
one sentence may vary depending on its context. Thus, formation from C for traceable diagnosing. Specially,
the embedding representation of each sentence should given the disease l to be predicted, we first obtain the
not only depend on its content but also consider its con- attention weight of each component in C for disease l.
text. this process can be formulated as:
At this stage, we aims to use the Transformer >
C el
to obtain contextualized sentence representations by (3.4) αl = softmax √
learning the dependencies between sentences. Given d
a OEMR document O∗ , we can obtain the sentences where the el ∈ Rd is the label embedding of disease
representations set {a1 , a2 , ..., aN } from the above word- l, which is randomly initialized. The αl ∈ Rd is the
level Transformer. The setence-level transformer takes attention weight vector for disease l. Then, we use the
sentences representations set as the input and outputs attention weight vector αl and matrix C to compute the
a matrix C, which can be formulated as: specific representation vl for disease l.
N
X
C = [c1 , c2 , . . . , cN ] (3.5) vl = Cαl = αln cn
(3.3) n=1
= T ransf ormers ([a1 , a2 , . . . , aN ])
Be noted that each component in C is an embedding
where T ransf ormers is a one layer Transformer. C = representation of a sentence in the input. Thus, by
[c1 , c2 , . . . , cN ] is a set of contextualized sentence repre- observing the attention weight vector αl , we can know
sentations. With the multi-head self-attention mecha- the importance of sentences in the input for diagnosing
nism, Transformer can model the dependencies between disease l, thus enabling a traceable diagnosis.
sentence representations in the input from multiple per- Suppose we have a total of L diseases to diag-
spectives, enabling each sentence representation can col- nose, after the above stage, we get a matrix V =
lect global information of context. In a sense, the matrix [v1 , v2 , . . . , vL ] where vl is the disease-specific informa-
C can be seen as a multi-perspective representation of tion representation for disease l. Then, we use a set of
the input OEMR document. Feedforward Neural Networks(FNNs) and the sigmoid
The hierarchical transformer encoder has the follow- function to calculate the probability of the presence of
ing main advantages: First, Transformer has strengths each disease.
in learning long-range dependencies [10, 11]. The hi- (3.6) ŷl = sigmoid(F N Nl (vl ))
erarchical Transformer encoder can effectively and ef-
ficiently model the semantic information of OEMR 3.4 Training The training procedure minimizes the
by learning both word-level and sentence-level depen- binary cross-entropy loss, which can be formulated as:
dencies. Second, embedding documents into multiple
L
sentence-level representations allows us to implement X
traceable disease prediction based on these representa- (3.7) L=− yl log(ŷl ) + (1 − yl ) log(1 − ŷl )
l=1
tions in combination with our attention-based predictor
proposed below. Third, by adding the type-token at where yl ∈ {0, 1} indicates whether disease l exists
the beginning of sentences, the differences between text in the diagnosis record of the input document. ŷl is the
types are taken into account, helping the encoder to probability of the existence of disease l output by the
model sentence semantics more accurately. prediction model.

4 Experiments all methods are implemented by PyTorch, and all
This section presents our experiments’ dataset and the experiments are conducted on a Linux server with an
evaluation of experimental results with some analysis. AMD EPYC 7742 CPU and one NVIDIA A100 GPU.
We demonstrate the advantage of our proposed frame-
work by comparing it to the baseline models. We ex- Table 1: The overall comparison experiments. For a
plore the effectiveness of each component of our frame- fair comparison, we equipped three baselines with the
ρ
work and provide a case study to illustrate the explain- preprocessing and marked them with (·) . The best
ability. performance is marked by bold, and the runner up is
marked by underlines.
4.1 Dataset Description We conducted the exper-
Models Macro-F1 Micro-F1 Macro-AUC Micro-AUC
iments on the real-world OEMR dataset with 5134
records collected by Aier Eye Hospital. After discus- BERT 81.21±0.40 84.28±0.78 92.55±0.43 95.61±0.22
CAML 83.71±0.57 81.22±0.80 94.32±0.61 96.09±0.20
sion with ophthalmologists, we select six common eye LSTM-Att 81.55±0.74 80.29±0.70 93.50±0.39 95.68±0.23
diseases to test the diagnostic ability of our model, BERTρ 87.39±0.72 89.73±0.34 95.73±0.33 97.33±0.19
including cataract(CAT), glaucoma, diabetic retinopa- CAMLρ 88.49±0.42 90.53±0.33 96.10±0.18 97.59±0.14
thy(DR), dry eye syndrome(DES), pterygium, and mei- LSTM-Attρ 88.19±0.61 91.11±0.44 97.16±0.23 98.14±0.12
bomian gland dysfunction(MGD). The data set is di- w/o p 82.59±0.76 82.37±0.78 94.85±0.10 96.46±0.09
vided into a training set, validation set, and test set w/o c 87.96±0.73 89.80±0.44 97.14±0.26 98.00±0.19
according to the proportion of 70%, 15%, 15%, respec- w/o s 88.19±0.50 91.31±0.34 96.85±0.15 97.89±0.08
tively. w/o l 89.29±0.65 91.88±0.49 96.40±0.19 97.67±0.20
w/o w 87.29±0.59 90.58±0.36 96.13±0.22 97.51±0.17
4.2 Baseline Methods To comprehensively evalu- NEEDED 90.25±0.48 92.61±0.27 97.92±0.15 98.59±0.10
ate the performances of our proposed framework, we
selected three baselines and five ablation variants of
NEEDED. The baselines are listed as follows: 1) 4.4 Experiment Results These experiment aims to
CAML [20]: The CAML is a CNN-based model which answer the following questions: Q1:How is the perfor-
integrate label-wise attention to obtain the label-specific mance of NEEDED in compared with other baseline
representations. 2) LSTM-Att [7]: LSTM-Att com- methods? Q2: How is each component of NEEDED
bines LSTM and attention mechanism to capture im- impact the model performance? Q3: What is the im-
portant semantic information in the input for predic- pact of distinguishing the descriptions of different eyes
tion. 3) BERT [21]:BERT is the representative of the in preprocessing? Q4: What is the impact of filtering
pre-trained models, we use BERT and an pooling layer the asymptomatic templated descriptions? Q5: How is
to obtain the representation of input. We also proposed the explainablity of our proposed model? Beside this,
five variants of NEEDED which are listed as follows: 1) we also conduct the hyperparameter senetivity study.
w/o p: ablated the pre-processing steps. 2) w/o c: ab-
lated the hierarchical transformers and replaced it with 4.4.1 RQ1: Overall Comparison The goal of the
vanilla transformer. 3) w/o s: ablated the sentence- first experiment is to compare the performance of the
level transformers. 4) w/o l: ablated the attention NEEDED and baseline models on the ADED task. In
mechanism and label embedding in predictor. 5) w/o this comparison, we used Macro-F1, Micro-F1, Macro-
w: both ablated the hierarchical transformers and the AUC, and Micro-AUC as evaluation metrics, which are
attention based predictor. widely used in multi-label classification problems [11,
20, 23, 24]. The results are shown in Table 1. From the
4.3 Experiment Settings In experiments, we set overall results, we observed that: First, our framework
the transformer layer number Nw to 5, the multi-head achieves the best performance in overall evaluation
number to 8, and the dimension size dmodel to 256. metrics, showing it is better at extracting disease-
For the detail of NEEDED training, we use AdamW related information from the OEMR document. Second,
[22] optimizer with learning rate of 1 × 10−4 , and our model performs better than w/o c by learning
set the batch size as 32, AdamW weight decay as both word-level and sentence-level dependencies and
1 × 10−2 . The dropout rate is set to 0.1 to prevent preserving text type information. Third, BERT as pre-
overfitting. For all models, we train and test them trained model does not perform as well as other baseline
multiple times with different random seeds under their models on most metrics, probably due to the large gap
optimal hyperparameters and report their performance between the text in its pre-training corpus and the
and standard deviation. In the following experiments, OEMR documents.

w Distinction w Distinction w Distinction 0.98 w Distinction
0.90
w/o Distinction w/o Distinction w/o Distinction w/o Distinction
0.90 0.98
0.88
0.96
0.86 0.97
0.85
0.84 0.94
0.96
LSTM-Att BERT CAML NEEDED LSTM-Att BERT CAML NEEDED LSTM-Att BERT CAML NEEDED LSTM-Att BERT CAML NEEDED
(a) Micro-F1 (b) Macro-F1 (c) Micro-AUC (d) Macro-AUC

Figure 2: The effect of distinguishing descriptions relevant to different eyes in the preprocessing step.
of the asymptomatic descriptions in the OEMR docu-
Macro-F1
Micro-F1 0.985
Micro-AUC
Macro-AUC ment according to the previous section. Then, instead
0.92
of distinguishing the remaining descriptions to form two
0.980 documents, we use all filtered texts as model inputs to
0.90 obtain the diagnosis results for both eyes. We perform
0.975 this experiment on our framework and all baseline meth-
ods, then compare the results with the case where dis-
0.88
0 0.05 0.1 0.2 0.3 0 0.05 0.1 0.2 0.3 tinction were made. The results are shown in figure 2.
(a) MiF1 & MaF1 (b) MiAUC & MaAUC We observe that the performance of all models degrades
significantly if the descriptions are not distinguished.
Figure 3: NEEDED performance in different templated
The reason is the information clutter caused by descrip-
asymptomatic description filtering thresholds, where
tions of different eyes mixed up. Besides, we observed
‘0’ represent the ablation of the filter.
that our model is the most affected, probably because it
4.4.2 RQ2: Ablation Study of NEEDED To ex- utilize the hierarchical Transformer to model the docu-
plore the effectiveness of each part of our framework, ment’s information at sentence level, which exacerbate
we conduct ablation experiments. We observed that the information clutter. In addition to the above, we
our framework outperforms all its variants, illustrating observed that BERT is the least affected, probably be-
that each component in our framework is indispensable. cause it has a huge number of model parameters and is
In addition to this, there are the following points: First, pre-trained on a large corpus, making its fitting ability
compared to our model, both w/o c and w/o w show a more stable despite the information clutter.
significant performance drop due to the fact that they
both remove the hierarchical Transformer. Meanwhile, 4.4.4 RQ4: The Impact of Filtering Templated
w/o c is better than w/o w in most metrics because Asymptomatic Descriptions In our framework, we
it retains the attention-based predictor. Second, w/o filter out templated asymptomatic descriptions that
s outperforms w/o w in all metrics and w/o c is better occur more frequently than a certain threshold before
than w/o w in most metrics, indicating the effectiveness the OEMR document is fed into the downstream model.
of attention-based predictor which can capture disease- In this experiment, we explore the impact of this way
specific information. Third, although the performance by setting different filtering thresholds and not filtering
of w/o l is degraded compared to our framework, it per- at all. Specifically, we set the filtering thresholds B to
forms better than w/o w in all metrics. This is because 0.05, 0.1, 0.2, and 0.3, as well as no filtering, to observe
it removes the attention-based predictor while retaining the changes in the performance of our framework. The
the hierarchical Transformer, indicating the importance results are shown in figure 3, where 0 represent the
of learning the dependency between sentences. Fourth, ablation of filter. We can find that after the threshold
w/o p is the worst and has a very significant degrada- B is bigger than 0.1, the model’s performance tends to
tion compared to the other models because it removes decrease as B increases. The performance is worst in
the preprocessing component, resulting in clutter and the absence of filtering, which indicates the validity of
sparsity of information in the model input. filtering the templated asymptomatic description.
We also observe that the performance is worse when
4.4.3 RQ3: The Impact of Distinguishing De- B is 0.05 than when B is 0.1, indicating that the too-
scriptions Relevant to Different Eyes To explore low filtering threshold is harmful. This may be due to
the effect of distinguishing descriptions of different eyes, some valuable asymptomatic descriptions being filtered
we conduct the following experiment. We first filter part out.

0.93 0.93
Micro-F1 Micro-F1 1.0 Accuracy
Macro-F1 Macro-F1 Loss
0.92 0.92
0.91
0.91
0.5
0.90
0.90
0.89
0.89 0
0 2 4 6 8 1 2 4 8 16 0 10 20 30 40 50 60
Transformer Layer Number Attention Head Number Epoch
(a) Hyperparameter study on transformer (b) Hyperparameter study on attention (c) Model convegence study
layer number. head number.
Figure 4: Hyperparameter study and model convergence study.

4.4.5 RQ5: Exploring the Explainability of OEMR
NEEDED Our predictor utilizes dot-product atten- (S1) Thickening of the right eyelid margin with
tion to extract disease-specific information, which im- dilated capillaries …
(S2) Abnormal conjunctival hyperplasia on the
proves the model’s effectiveness and can provide some nasal side of the right eye…
explainability to the diagnostic results. Specifically, as- (S3) Pterygium growing horizontally into the
cornea of the right eye up to 3mm…
sume that our model diagnoses the disease l based on (S4) Mild clouding of the lens in the right eye,
the given input, we can obtain the attention weight vec- without tremor…
(S5) Moderate vitreous turbidity in the right
tor αl from the predictor. The αli denotes the attention eye…
weight of the i-th sentence’s representation for the dis-
ease l, which can be viewed as the importance of the i-th
Figure 5: Explainability Case Study. The left part of
sentence for diagnosis of disease l. Thus, for each dis-
this figure shows a portion of an OEMR document
ease, we can obtain the important sentences associated
describing the patient’s right eye, and the right part
with it in the input based on the attention weights. We
shows a heatmap in which each column is the
show a case to demonstrate the effect of providing ex-
distribution of attention weights for one disease.
plainability in this way. The left part of Figure 5 shows a
portion of an OEMR document describing the patient’s
right eye. Specifically, the patient’s right eye was diag- We observe that the performance of our model shows a
nosed with cataract(CAT), meibomian gland dysfunc- trend of increasing and then decreasing as the number of
tion(MGD), dry eye syndrome(DES), and pterygium Transformer layers varies from one to eight. In addition,
by doctor. Our model yields the same diagnosis. The the performance of our model shows the same trend as
right part of Figure 5 shows a heatmap in which each the number of self-focused heads increases. This may be
column is the distribution of attention weights for one because the fitting ability of the model improves with
disease. We can observe that: First, our model can fo- more parameters introduced, but after introducing too
cus on different information and make the correct diag- many parameters, the model appears to be over-fitted.
nosis when diagnosing various diseases, demonstrating Besides, we collected the loss values and accuracy of
that our model can capture disease-specific information each epoch during the training process and visualized
from the document by the attention mechanism. Sec- them to study the convergence of our model. We observe
ond, based on the distribution of attention weights, we that the model’s accuracy increases rapidly in the first
can track the sentences essential for diagnosing the dis- ten training epochs, then tends to converge. The loss
ease, which provides explainability to our model. Third, curve shows that the model’s training process is stable.
we can learn each sentence’s role in diagnosing different
diseases based on the distribution of attention weights, 5 RELATED WORK
which can informing clinical research on the condition. 5.1 Clinical Text Classification With the devel-
4.4.6 Hyperparameter Sensitivity and Model opment of NLP techniques, various neural models have
Converge Study The number of Transformer layers been proposed to address the problem of text-based au-
in the word-level Transformer and the number of self- tomatic disease diagnosis and other clinical text classi-
attention heads in the hierarchical Transformer are fication tasks.
two critical parameters in our model. To assess the Many studies treat medical text as free text and
impact of these two hyperparameters, we investigate the propose methods based on neural networks[6, 7, 8, 9,
sensitivity of them and show the results in Figure 4. 5, 25] to solve the medical text classification problem.

For instance, Yang et al.[8] train a multi-layer CNN to transformer encoder. Finally, we utilize dot-product at-
predict several basic diseases based on electronic medi- tention to capture the disease-specific information and
cal records. Girardi et al.[9] propose an attention-based make the diagnosis. Experiments on the real dataset
CNN model to assess patient risk and detect warning show the superiority and explainability of our frame-
symptoms. Yuwono et al.[25] combine CNN and RNN work, and each component in our framework is effec-
to diagnose appendicitis based on clinical notes. Hashir tive.
and Sawhney [5] established a hierarchical neural net-
work to predict the mortality of inpatients using clinical 7 Ackownledgement
records as input. Some work extracts structured knowl- This work was supported by the National Key Re-
edge from clinical text and then uses it to improve model search and Development Plan of China under Grant
performance or get results directly [13, 14]. For exam- No. 2022YFF0712200 ,the Natural Science Foundation
ple, Chen et al.[13] propose a disease diagnosis frame- of China under Grant No. 61836013 ,Science and Tech-
work based on CNN and Bayesian networks, which ex- nology Service Network Initiative, Chinese Academy of
tracts symptom entities from medical records and then Sciences under Grant No. KFJ-STS-QYZD-2021-11-
uses them to enhance model performance. Yuan et 001 ,Beijing Natural Science Foundation under Grant
al.[14] first extract medical entities from medical records No. 4212030 ,Youth Innovation Promotion Association
and construct a graph based on the relationships be- CAS
tween them, and later extract structured information
from the graph for disease diagnosis.
Nowadays, transformers [10] are widely adopted in References
various scenarios [11, 26, 27, 28]. For clinical text classi-
fication, Feng and Shaib [27] build a hierarchical CNN- [1] World Health Organization et al. World report on
transformer model for sepsis and mortality prediction. vision. 2019.
Si and Roberts [28] propose a clinical document classi- [2] Mark Christopher, Akram Belghith, Christopher
fication model based only on transformers. Bowd, James A Proudfoot, Michael H Goldbaum,
Robert N Weinreb, Christopher A Girkin, Jeffrey M
5.2 Automatic Diagnosis of Eye Diseases Many Liebmann, and Linda M Zangwill. Performance of
scholars have explored how to automatically diagnose deep learning architectures and transfer learning for de-
eye diseases based on artificial intelligence. Most of tecting glaucomatous optic neuropathy in fundus pho-
these studies are based on image data [2, 3, 4, 29]. For tographs. Scientific reports, 8(1):1–13, 2018.
[3] Stuart Keel, Jinrong Wu, Pei Ying Lee, Jane Scheetz,
example, Christopher et al. [2]train several neural net-
and Mingguang He. Visualizing deep learning mod-
works with ocular fundus photographs as input to diag-
els for the detection of referable diabetic retinopathy
nose glaucomatous optic nerve damage. Stuart Keel et and glaucoma. JAMA ophthalmology, 137(3):288–292,
al. [3] build two CNN neural networks based on fundus 2019.
images to determine the presence of diabetic retinopa- [4] Dan Milea, Raymond P Najjar, Zhubo Jiang, Daniel
thy and glaucoma, respectively. Besides, there are a few Ting, Caroline Vasseneix, Xinxing Xu, Masoud Agh-
studies based on text data [30, 31]. Apostolova et al. [30] saei Fard, Pedro Fonseca, Kavin Vanikieti, Wolf A
built an open eye injury identification model based on Lagrèze, et al. Artificial intelligence to detect pa-
TF-IDF and SVM. Saleh et al. [31] extracted numeri- pilledema from ocular fundus photographs. New Eng-
cal type features from the electronic health records of land Journal of Medicine, 382(18):1687–1695, 2020.
diabetic patients and used machine learning models to [5] Mohammad Hashir and Rapinder Sawhney. To-
wards unstructured mortality prediction with free-text
predict the risk of diabetic retinopathy.
clinical notes. Journal of Biomedical Informatics,
108:103489, 2020.
6 Conclusion [6] Ying Sha and May D Wang. Interpretable predictions
This paper presents a framework for the automatic di- of clinical outcomes with an attention-based recurrent
agnosis of eye diseases based on OEMR, which consists neural network. In Proceedings of the 8th ACM Inter-
of three modules: a preprocessing component, a hierar- national Conference on Bioinformatics, Computational
chical transformer encoder, and an attention-based pre- Biology, and Health Informatics, pages 233–240, 2017.
dictor. In the processing step, we filter out the useless [7] Annika M Schoene, George Lacey, Alexander P Turner,
and Nina Dethlefs. Dilated lstm with attention for
descriptions and then distinguish the content relevant
classification of suicide notes. In Proceedings of the
to different eyes in the OEMR document to form two
tenth international workshop on health text mining and
documents. Then, we obtain the representation ma- information analysis (LOUHI 2019), pages 136–145,
trix for the input document based on the hierarchical 2019.

[8] Zhongliang Yang, Yongfeng Huang, Yiran Jiang, Yuxi [20] James Mullenbach, Sarah Wiegreffe, Jon Duke, Jimeng
Sun, Yu-Jin Zhang, and Pengcheng Luo. Clinical Sun, and Jacob Eisenstein. Explainable prediction
assistant diagnosis for electronic medical record based of medical codes from clinical text. arXiv preprint
on convolutional neural network. Scientific reports, arXiv:1802.05695, 2018.
8(1):1–9, 2018. [21] Jacob Devlin, Ming-Wei Chang, Kenton Lee, and
[9] Ivan Girardi, Pengfei Ji, An-phi Nguyen, Nora Hollen- Kristina Toutanova. Bert: Pre-training of deep bidirec-
stein, Adam Ivankay, Lorenz Kuhn, Chiara Marchiori, tional transformers for language understanding. arXiv
and Ce Zhang. Patient risk assessment and warning preprint arXiv:1810.04805, 2018.
symptom detection using deep attention-based neural [22] Ilya Loshchilov and Frank Hutter. Decoupled weight
networks. arXiv preprint arXiv:1809.10804, 2018. decay regularization. arXiv preprint arXiv:1711.05101,
[10] Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob 2017.
Uszkoreit, Llion Jones, Aidan N Gomez, Lukasz Kaiser, [23] Tong Zhou, Pengfei Cao, Yubo Chen, Kang Liu, Jun
and Illia Polosukhin. Attention is all you need. Ad- Zhao, Kun Niu, Weifeng Chong, and Shengping Liu.
vances in neural information processing systems, 30, Automatic icd coding via interactive shared represen-
2017. tation networks with self-distillation mechanism. In
[11] Meng Xiao, Ziyue Qiao, Yanjie Fu, Yi Du, Pengyang Proceedings of the 59th Annual Meeting of the Associa-
Wang, and Yuanchun Zhou. Expert knowledge-guided tion for Computational Linguistics and the 11th Inter-
length-variant hierarchical label generation for pro- national Joint Conference on Natural Language Pro-
posal classification. In 2021 IEEE International Con- cessing (Volume 1: Long Papers), pages 5948–5957,
ference on Data Mining (ICDM), pages 757–766. IEEE, 2021.
2021. [24] Jun Chen, Quan Yuan, Chao Lu, and Haifeng Huang.
[12] Yiming Zhang, Ying Weng, and Jonathan Lund. Ap- A novel sequence-to-subgraph framework for diagnosis
plications of explainable artificial intelligence in diag- classification. In IJCAI, pages 3606–3612, 2021.
nosis and surgery. Diagnostics, 12(2):237, 2022. [25] Steven Kester Yuwono, Hwee Tou Ng, and Kee Yuan
[13] Jun Chen, Xiaoya Dai, Quan Yuan, Chao Lu, and Ngiam. Learning from the experience of doctors:
Haifeng Huang. Towards interpretable clinical diagno- Automated diagnosis of appendicitis based on clinical
sis with bayesian network ensembles stacked on entity- notes. In Proceedings of the 18th BioNLP Workshop
aware cnns. In Proceedings of the 58th Annual Meeting and Shared Task, pages 11–19, Florence, Italy, August
of the Association for Computational Linguistics, pages 2019. Association for Computational Linguistics.
3143–3153, 2020. [26] Meng Xiao, Min Wu, Ziyue Qiao, Zhiyuan Ning,
[14] Quan Yuan, Jun Chen, Chao Lu, and Haifeng Huang. Yi Du, Yanjie Fu, and Yuanchun Zhou. Hierarchi-
The graph-based mutual attentive network for auto- cal mixup multi-label classification with imbalanced
matic diagnosis. In Proceedings of the Twenty-Ninth interdisciplinary research proposals. arXiv preprint
International Conference on International Joint Con- arXiv:2209.13912, 2022.
ferences on Artificial Intelligence, pages 3393–3399, [27] Jinyue Feng, Chantal Shaib, and Frank Rudzicz. Ex-
2021. plainable clinical decision support from text. In Pro-
[15] Raghavendra Pappagari, Piotr Zelasko, Jesús Villalba, ceedings of the 2020 conference on empirical methods
Yishay Carmiel, and Najim Dehak. Hierarchical trans- in natural language processing (EMNLP), pages 1478–
formers for long document classification. In 2019 1489, 2020.
IEEE Automatic Speech Recognition and Understand- [28] Yuqi Si and Kirk Roberts. Hierarchical transformer
ing Workshop (ASRU), pages 838–844. IEEE, 2019. networks for longitudinal clinical document classifica-
[16] Xingxing Zhang, Furu Wei, and Ming Zhou. Hib- tion. arXiv preprint arXiv:2104.08444, 2021.
ert: Document level pre-training of hierarchical bidi- [29] Cecilia S Lee, Doug M Baughman, and Aaron Y
rectional transformers for document summarization. Lee. Deep learning is effective for classifying normal
arXiv preprint arXiv:1905.06566, 2019. versus age-related macular degeneration oct images.
[17] Yang Liu and Mirella Lapata. Hierarchical transform- Ophthalmology Retina, 1(4):322–327, 2017.
ers for multi-document summarization. arXiv preprint [30] Emilia Apostolova, Helen A White, Patty A Morris,
arXiv:1905.13164, 2019. David A Eliason, and Tom Velez. Open globe injury
[18] Meng Xiao, Ziyue Qiao, Yanjie Fu, Hao Dong, Yi Du, patient identification in warfare clinical notes. In
Pengyang Wang, Dong Li, and Yuanchun Zhou. Who AMIA annual symposium proceedings, volume 2017,
should review your proposal? interdisciplinary topic page 403. American Medical Informatics Association,
path detection for research proposals. arXiv preprint 2017.
arXiv:2203.10922, 2022. [31] Emran Saleh, Jerzy Blaszczyński, Antonio Moreno,
[19] Meng Xiao, Ziyue Qiao, Yanjie Fu, Hao Dong, Yi Du, Aida Valls, Pedro Romero-Aroca, Sofia de la Riva-
Pengyang Wang, Hui Xiong, and Yuanchun Zhou. Fernandez, and Roman Slowiński. Learning ensemble
Hierarchical interdisciplinary topic detection model classifiers for diabetic retinopathy assessment. Artifi-
for research proposal classification. arXiv preprint cial intelligence in medicine, 85:50–63, 2018.
arXiv:2209.13519, 2022.


NEEDED: Introducing Hierarchical Transformer To Eye Diseases Diagnosis

Uploaded by

Document Information

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

NEEDED: Introducing Hierarchical Transformer To Eye Diseases Diagnosis

Uploaded by

Copyright:

Available Formats

NEEDED: Introducing Hierarchical Transformer to Eye

Abstract adoption of ophthalmology electronic medical record

Copyright © 2023 by SIAM

Copyright © 2023 by SIAM

Copyright © 2023 by SIAM

Copyright © 2023 by SIAM

Copyright © 2023 by SIAM

(a) Micro-F1 (b) Macro-F1 (c) Micro-AUC (d) Macro-AUC

Copyright © 2023 by SIAM

Figure 4: Hyperparameter study and model convergence study.

Copyright © 2023 by SIAM

Copyright © 2023 by SIAM

Copyright © 2023 by SIAM

You might also like