1706 File Paper

A Biomedical Entity Extraction Pipeline for
Oncology Health Records in Portuguese

Anonymous Author(s)
ABSTRACT as electronic health records (EHRs) – compile all available patient- 9
Textual health records of cancer patients are usually protracted and specific information, such as previous conditions, diagnoses, treat- 10
highly unstructured, making it very time-consuming for health pro- ments, prescriptions, laboratory test results, radiological images, 11
fessionals to get a complete overview of the patient’s therapeutic and phenotypic and genotypic features [2, 15]. As a result, much 12
course. As such limitations can lead to suboptimal and/or inefficient potential lurks within EHRs [39]. First, from an integrated health- 13
treatment procedures, healthcare providers would greatly benefit care standpoint, by providing clinicians with instant and portable 14
from a system that effectively summarizes the information of those access to all up-to-date patient information, EHRs streamline local 15
records. With the advent of deep neural models, this objective has and remote care delivery, potentially leading to better-informed 16
been partially attained for English clinical texts, however, the re- decision-making [2, 11, 15]. Second, the possible benefits of mining 17
search community still lacks an effective solution for languages these records for research purposes are endless. Current research 18
with limited resources. In this paper, we present the approach we of AI methods in the clinical context includes: variant reporting 19
developed to extract procedures, drugs, and diseases from oncol- [2]; public health and environmental impact surveillance [4, 21]; 20
ogy health records written in European Portuguese. This project and precision (or personalized) Medicine [9]. 21
was conducted in collaboration with the Portuguese Institute for Although most successful applications of AI in healthcare have 22
Oncology which, besides holding over 10 years of duly protected been for image-based diagnosis and detection [25], most of EHRs 23
medical records, also provided oncologist expertise throughout the are primarily composed of free-text clinical narratives describing 24
development of the project. Since there is no annotated corpus for the patient’s history, which opens up a wide range of possibilities 25
biomedical entity extraction in Portuguese, we also present the for natural language processing (NLP) applications. Nevertheless, 26
strategy we followed in annotating the corpus for the development due to their unstructured nature, harnessing the information within 27
of the models. The final models, which combined a neural archi- these records is not straightforward. When compared with a more 28
tecture with entity linking, achieved 𝐹 1 scores of 88.6, 95.0, and structured representation, the free-text format has the advantage of 29
55.8 per cent in the mention extraction of procedures, drugs, and allowing the medical staff to communicate more naturally, provid- 30
diseases, respectively. ing adequate freedom for clear human-to-human communication. 31
On the other hand, since the amount of text grows with the patient’s 32
CCS CONCEPTS interactions with healthcare professionals, the documents quickly 33
reach dozens and hundreds of pages. This problem is aggravated in 34

• Computing methodologies → Information extraction; • In-
patients with persistent diseases, such as cancer, where the number 35
formation systems → Data mining; Decision support systems;
of pages can run into the thousands. Such lengthy documents are in- 36
Document structure;
tractable for the clinical staff to handle, especially in an emergency 37
setting where time is scarce to take action. In such a scenario, a 38

KEYWORDS tool that would automatically and reliably summarize the patient’s 39
Biomedical Entity Recognition, Data Mining, Oncology Electronic therapeutic path would be a great asset for improving healthcare 40
Health Records services. 41
ACM Reference Format: An important step to achieve such summarization is the identifi- 42
Anonymous Author(s). 2023. A Biomedical Entity Extraction Pipeline for, cation/extraction of the biomedical entities in the text – such as 43
Oncology Health Records in Portuguese. In Proceedings of ACM SAC Confer- diseases, medical procedures, or drugs. Although some systems 44
ence (SAC’23). ACM, New York, NY, USA, 7 pages. https://doi.org/10.1145/ were developed for general-purpose clinical texts written in Eng- 45
nnnnnnn.nnnnnnn lish – including Apache cTAKES1 , CLAMP2 , and more notably IBM 46
Watson Health3 – the same can not be said for languages with 47
1 1 INTRODUCTION relatively low-resources, where efficient extraction of medical en- 48
2 The effectiveness achieved by artificial intelligence (AI) systems in tities remains a challenge [1]. The lack of effectiveness is mainly 49
3 recent years has led to an increase in practical applications in health- attributed to the scarcity or nonexistence of publicly available data, 50
4 care support. To give an example, the number of FDA-approved which limits the development of tools for clinical texts (such as the 51
5 AI medical algorithms has grown rapidly in recent years [8]. Such pre-training of large language models for the clinical setting). Apart 52
6 growth would not be possible without the digitization of healthcare, from that, the heterogeneity of clinical narratives is also a major 53
7 which has significantly increased the number of clinical records problem. For instance, while the clinical note that results from a 54
8 stored in digital collections. These longitudinal records – known scheduled appointment is descriptive and syntactically correct, the 55
1 https://ctakes.apache.org/
SAC’23, March 27 –April 2, 2023, Tallinn, Estonia
2 https://clamp.uth.edu/
2023. ACM ISBN 978-x-xxxx-xxxx-x/YY/MM. . . $15.00
3 https://www.ibm.com/watson-health
https://doi.org/10.1145/nnnnnnn.nnnnnnn
SAC’23, March 27 –April 2, 2023, Tallinn, Estonia Anon.
56 same is not true for notes from the emergency service where item- NLP, in particular with large language models such as BERT [37], 111
57 ized text is prevalent and the use of abbreviations and acronyms BioBERT [20], or PubMedBERT [12], have mitigated this problem. 112
58 more frequent. Therefore, a model that is trained in scheduled ap- Moreover, training deep learning architectures such as Long Short- 113
59 pointment documents would (most likely) present a much lower Term Memory (LSTM) and Conditional Random Field (CRF) on top 114
60 effectiveness in emergency service documents. In practice, this of the large language models word representations, had greatly 115
61 means that to achieve an effective system, one must annotate a improved the effectiveness of BioNER for English texts over the 116
62 corpus within the domain/institution where the model will be de- last few years [10, 13, 38, 40]. 117
63 ployed. Since the current state-of-the-art systems (which are based In that vein, a few deep-learning-based NLP approaches have been 118
64 on deep neural models) require large amounts of annotated data developed to support clinical decision-making in cancer-related 119
65 to attain high effectiveness, this is a major bottleneck in system contexts using unstructured clinical texts. These have been tested 120
66 development. in several medical fields and applications, such as but not limited 121
67 Faced with the aforementioned issues while developing a system to: senology, to identify and pinpoint local and distant recurrence 122
68 for the extraction of biomedical entities for the Portuguese Institute for breast cancer [16]; endocrinology, to improve the diagnosis of 123
69 for Oncology (IPO) in Porto, Portugal’s largest oncology center, we thyroid cancer [41]; urology, to stratify the risk of developing uri- 124
70 created an efficient approach for the annotation of a medical corpus nary incontinence following partial or full prostate removal [5]; and 125
71 that can be extrapolated for any medical domain and institution clinical trial screenings, to select patients based on prognosis and 126
72 to quickly achieve a corpus that can be used to train a deep neural propensity to changes in therapeutic courses [17]. Additionally, a 127
73 model. In this paper, we describe how the Unified Medical Language BERT-based model – MetBERT – was also created for the prediction 128
74 System (UMLS) metathesaurus was employed to accelerate and of metastatic cancer from clinical notes [24]. 129
75 minimize the required resources for the annotation process of the Nevertheless, developing effective BioNER models for low-resource 130
76 corpus. Furthermore, in this paper, we also show that by combining languages remains a challenge [29]. This is primarily because har- 131
77 three named entity recognition models – for the identification of nessing the power of deep learning models requires a substantial 132
78 procedures, drugs, and diseases – with entity linking on UMLS we amount of data. However, because clinical texts contain highly 133
79 were able to achieve relevant effectiveness for European Portuguese. sensitive information, ethical and legal constraints make it hard 134
80 As this approach is only limited by the availability of a pre-trained to make corpora available. Yet another issue is the specificity of 135
81 large language model and the availability of the language on the the data, which limits its applicability. For example, the only clin- 136
82 UMLS knowledge base, this approach can also be extrapolated to ical text corpus released for European Portuguese [27] does not 137
83 other languages and domains. resemble the clinical notes we found at IPO, as the latter were more 138
84 In summary, the main contributions of this work are the following: disaggregated/itemized and did not follow the traditional seman- 139
85 (i) we present an efficient strategy for the annotation of clinical tic structure of a narrative. Consequently, application projects are 140
86 narratives with biomedical entities that accelerates the annota- forced to develop a corpus for the data with which the application 141
87 tion process; will go into production. In this regard, some efforts have been made 142
88 (ii) we present a methodology to achieve efficient biomedical en- to develop methodologies to simplify the annotation of the clinical 143
89 tity extraction for Portuguese that can be adapted to other corpus. For instance, in a similar challenge to ours, Silvestri et al. 144
90 languages with equivalent resources; [33] used active learning and distant supervision to annotate clini- 145
91 (iii) finally, we present a real-life case study with results obtained cal texts written in Italian. In this work, we adopted a more general 146
92 by applying such a strategy for the extraction of biomedical strategy that requires less computational power. 147
93 entities on free-text oncology narratives written in European

94 Portuguese. 3 CORPUS DEVELOPMENT 148
95 The remainder of the paper is structured as follows: in Section 2 IPO Porto has been keeping electronic health records of its patients 149
96 we frame the presented work in the context of the current research. for more than 10 years. We were able to work with a sufficient 150
97 Section 3 describes the developed methodology for the annotation sample, under confidentiality, and on-site. Among the many files 151
98 of clinical texts along with the corpus that resulted from applying that describe each patient’s journey through the hospital, there 152
99 it in IPO Porto documents. In section 4 we present the prediction is one of particular interest for this project, the clinical diary (see 153
100 pipeline that was implemented on IPO Porto and present the ex- Figure 1). 154
101 perimental procedure that led us to that architecture in Section 5. The clinical diary contains a textual description of every action 155
102 Finally, Section 6 and 7 concludes the paper by discussing possibili- or interaction the patient had in the hospital. This includes any 156
103 ties for future research, respectively. treatment, procedure, appointment, plan of action, and other in- 157
formation deemed relevant by the health professionals. It is a very 158
104 2 RELATED WORK descriptive file and, thereby, a rich source of information. Adversely, 159
it can become very long, repetitive, and hard to be efficiently read 160
105 The aim of biomedical NER (BioNER) is to identify the spans of
by humans in the loop. 161
106 biomedical entities in a clinical text. Not long ago, the best-performing
107 systems relied on hand-crafted rules to capture different character-
108 istics of entity classes [6, 22]. However, these features frequently 3.1 Sampling Process 162
109 result in highly specialized tools that can only be used in the do- For the development of the corpus, we selected 80 fairly short clinic 163
110 mains for which they were designed. Recently, the developments in diaries. In practical terms, this meant sampling documents that 164
Biomedical Entity Extraction Pipeline for Oncology SAC’23, March 27 –April 2, 2023, Tallinn, Estonia
165 were between the first and second quartile in terms of the number Table 1: How the UMLS semantic types were grouped for our
166 of pages, which materialized in documents ranging from 52 to 97 three entity types. TUI stands for “Type Unique Identifier”.
167 A4 equivalent pages long. This is significant because we wanted to
168 reflect a rich variety of clinical note-writing styles as possible in the TUI Name
169 corpus (which vary with the person that writes them). Yet another
170 benefit of sampling from a variety of relatively small documents is T058 Health Care Activity
that it increases the exposure to different patients. Consequently, T059 Laboratory Procedure
171 Procedures
172 this increases the number of biomedical entities that will be present T060 Diagnostic Procedure
173 in the final corpus. T061 Therapeutic or Preventive Procedure
T121 Pharmacologic Substance
174 3.2 Preprocessing T122 Biomedical or Dental Material
Drugs
175 The clinical diary is composed of clinical notes that are delimited T195 Antibiotic
176 by headers, as illustrated by the blue and yellow regions in the T200 Clinical Drug
177 sample page in Figure 1. Important metadata, including the date T019 Congenital Abnormality
178 and the context in which the note was written, is contained in the T020 Acquired Abnormality
179 headers. Since the status of the patient does not change in every T033 Finding
180 interaction that he/she has with the medical staff, it is common T037 Injury or Poisoning
181 that the clinical notes overlap in content. Furthermore, there is T046 Pathologic Function
182 information (such as comorbidities and pathological antecedents) T047 Disease or Syndrome
that is frequently repeated throughout the clinical diary. To avoid Diseases
183
T048 Mental or Behavioral Dysfunction
184 multiple annotations of the same content, we segmented the clinical T049 Cell or Molecular Dysfunction
185 notes into sentences with the segtok sentence segmenter4 . From T050 Experimental Model of Disease
186 the 80 clinic diaries, we collected a total of 36, 512 unique sentences. T184 Sign or Symptom
T190 Anatomical Abnormality
T191 Neoplastic Process
However, it can be manually extended to include new acronyms 194
and abbreviations that may appear in future documents. 195
3.3 Annotation with UMLS 196
To greatly reduce manual annotation effort we start by automati- 197
cally annotating the sentences. For this purpose, the Unified Medical 198
Language System (UMLS) metathesaurus [3] was used as a knowl- 199
edge base for biomedical entity identification. UMLS contains an 200
ontology that maps each biomedical entity (referred to as “concept” 201
in UMLS jargon) to a semantic group, which is a high-level catego- 202
rization of the entities. Examples of semantic groups are “Clinical 203
Drug”, “Pathologic Function”, and “Diagnostic Procedure”. To de- 204
termine the semantic groups of interest in our project, we followed 205
a two-step process. First, we asked the oncology team to pre-select 206
the semantic groups to be extracted based on the description of the 207
semantic group5 . We then annotated 300 sentences with the entities 208
within the initial set of semantic groups and presented the result 209
of the annotation to the oncology team. In conjunction, we parsed 210

Figure 1: Redacted page of the clinic diary. all sentences and iteratively updated the set of semantic groups 211
to add concepts that were not annotated (that is, false negatives). 212
187 Yet another important specificity of clinical notes is the frequent The final set of semantic groups was then grouped into three types: 213
188 use of acronyms and abbreviations which are commonly employed procedures, drugs, and diseases. Table 1 presents the mapping be- 214
189 by clinical staff. Although there has been some work towards using tween semantic groups and the defined entity types6 . Our initial 215
190 word embeddings/deep neural models to handle the acronyms and automatic labeling step aims to catch as many true entities as possi- 216
191 abbreviations [23, 26] we found it sufficiently simple and robust to ble even if at the cost of including many false entities. By following 217
192 build a glossary of the acronyms and abbreviations found in our 5 The semantic groups and their respective definition can be found in https://lhncbc.
193 corpus. In the current version, the glossary contains 183 entries. nlm.nih.gov/ii/tools/MetaMap/Docs/SemGroups_2018.txt
6 For the complete semantic type mappings see https://lhncbc.nlm.nih.gov/ii/tools/
4 https://github.com/fnl/segtok MetaMap/documentation/SemanticTypesAndGroups.html
Table 2: Distribution of the corpus statistics over train, validation, and test sets within the three entity types (procedures, drugs,
and diseases). Sent, B, I, and O stand for the number of sentences, begin tokens, inside tokens, and outside tokens, respectively.
Procedures Drugs Diseases Aggregated

Sent B I O Sent B I O Sent B I O Sent B I O
Train 2,398 2,091 1,225 39,333 1,236 1,128 6 16,529 1,870 1,651 512 26,616 4,219 4,870 1,743 62,328
Validation 599 610 311 8,717 309 285 2 4,163 467 395 141 5,738 1,055 1,290 454 13,503
Test 701 821 452 13,992 377 445 1 6,737 608 734 210 11,743 1,318 2,000 663 25,755
Total 3,698 3,522 1,988 62,042 1,922 1,858 9 27,429 2,945 2,780 863 44,097 6,592 8,160 2,860 101,586
218 this procedure, we ultimately produced a high-recall classifier, and 4 PREDICTION PIPELINE 258
219 therefore, reduce the annotation effort of the specialists to eliminate To model the biomedical entity identification task we designed a 259
220 the false entities. pipeline that conceptually can be segmented into three components: 260
221 To extract the concepts that were within the final set of semantic a pre-trained language model to provide embeddings for the input 261
222 groups, we used the QuickUMLS toolkit7 [34], which implements text; the NER model(s) to identify the entities of interest; and en- 262
223 fast string matching with the UMLS concepts. From the set of 36, 512 tity linking with UMLS to ensure that entities that were found by 263
224 sentences, QuickUMLS identified concepts in 9, 912 of them. the NER models refer to biomedical entities. Figure 2 presents a 264
graphical illustration of the architecture we used on our specific 265
225 3.4 Specialized Curation application for IPO Porto. 266
226 Since the set of semantic groups to be extracted by UMLS was For the pre-trained language model, we employed the BERTimbau8 267
227 selected to maximize recall, the annotation produced by UMLS [35] language model, which was built by training the BERT [7] 268
228 included a lot of vague terms (such as "blood" or "pain") that are not architecture in a mixture of Brazilian and European Portuguese 269
229 useful for oncologists. To curate the annotations, we presented the texts9 . For the identification of the entities, we trained three mod- 270
230 annotations to the oncology team and instructed them to validate els – one for each entity type – using the architecture proposed 271
231 the annotations, i.e., remove false positives. This is a significant by Ma and Hovy [28]. This model architecture was tailored for 272
232 boost in the manual annotation effort as it does not require the sequence labeling, relying on a convolutional neural network [19] 273
233 specialist to read the entire sentence, but only to look at the entities for character encoding, a bi-directional long-short term memory 274
234 that were annotated by UMLS. However, the annotators were also [14] to aggregate character and word representations, and a sequen- 275
235 allowed to add missing entities, in case they drew their attention tial conditional random field [18] to jointly classify all words of a 276
236 while parsing the UMLS annotations. sentence. Lastly, all the entities identified by the NER models are 277
237 Note that, since the annotation produced by UMLS covers a wide looked for in the UMLS (a process commonly referred to as entity 278
238 range of concepts, the revision by the specialist can be interpreted linking) with the list of semantic types we deemed relevant in the 279
239 as the selection of the entities he/she wants the pipeline to identify. automatic annotation phase (see Table 1). If the entity is found in 280
240 Therefore, the resulting dataset will be tailored to the necessities the UMLS and its semantic group is part of the set that is relevant, 281
241 and requirements of the institution. the entity is kept, otherwise, it is discarded. 282
242 After the curation of the specialists and the removal of all the An important remark is that although we use a NER model for pro- 283
243 sentences that did not have any annotation, we were left with 6, 592 cedures, drugs, and diseases, the final class of the entity is defined 284
244 sentences, in which, 3, 689 contained procedures mentions, 1, 922 by the metadata in the UMLS. More precisely, by its semantic group, 285
245 contained drugs mentions, and 2, 945 contained diseases mentions. which is then mapped to the class by the map in Table 1. Conceptu- 286
246 As the goal was to train a NER model, the annotation was stored in ally, the three NER models can be interpreted as just one NER model 287
247 the Inside-Outside-Beginning format [31]. The last row of Table 2 that classifies the tokens as BEGIN, INSIDE, and OUTSIDE rather than 288
248 shows how the sentences and the begin/inside/outside tokens were classifying the tokens as B-PROCEDURE, B-DRUG, et cetera. 289
249 distributed over the datasets that only contained one entity type
250 (“Procedures”, “Drugs”, and “Diseases” columns), and the dataset 5 EXPERIMENTS 290
251 with all the entity types (“Aggregated” column). The remaining rows To evaluate the importance of the components in our prediction 291
252 of the table present the distribution over the train, validation, and pipeline we run experiments with three modeling settings: just 292
253 test set. This split was performed randomly at the sentence level of using the UMLS predictions; just using the NER predictions; and 293
254 the “Aggregated” dataset. First, 20% of the sentences were set aside using the full pipeline prediction. In this section, we describe how 294
255 for test. Then, the remaining 80% of sentences were partitioned so the experiments were executed. 295
256 that 80% of the documents were used to train the models and 20%
257 to validate them.
8 https://huggingface.co/neuralmind/bert-base-portuguese-cased
9 Thereis currently no large-language model trained exclusively on European Por-
7 https://github.com/Georgetown-IR-Lab/QuickUMLS tuguese texts.
Figure 2: Graphical representation of the proposed pipeline.
296 5.1 Evaluation Spark NLP to train and implement the NER models. All the experi- 333
297 For the evaluation and comparison of the models, we employed ments were run on a server with an AMD EPYC 7351P processor 334
298 the strict and relaxed Precision, Recall, and 𝐹 1 metrics. Whereas and NVIDIA GeForce GTX 1080 Ti GPU. 335
299 the strict metrics only count exact matches, the relaxed metrics
300 consider the overlaps between the predictions and the annotations 5.4 Training 336
301 as being correct. Therefore, by presenting both we are able to attain As stated in Section 4, three NER models were trained for the 337
302 a more comprehensive understanding of the model’s effectiveness. identification of procedures, drugs, and diseases. Before training, 338
303 Naturally, for the dataset, we used the corpus that is described in we randomly sampled a set of 20 hyperparameter configurations to 339
304 Section 3. To provide a detailed overview of the model’s effective- train each of the models. The hyperparameters that were tuned are: 340
305 ness, besides presenting the result on the Aggregated dataset (the the dropout rate, which was sampled from {0.0, 0.1, 0.2, 0.3, 0.4, 341
306 dataset that contains the annotation of all the entity types) we also 0.5}; the batch size, which was sampled from {4, 8, 16, 32, 64, 128}; 342
307 analyze the results of the model on the Procedures, Drugs, and Dis- and the learning rate, which was sampled from {0.001, 0.003, 0.005, 343
308 eases datasets. The train, validation, and test partitions are present 0.01, 0.03, 0.05, 0.1}. For each hyperparameter configuration, we 344
309 in Table 2. trained the model for 30 epochs and stored the weights of the epoch 345
in which the model had the best micro 𝐹 1 score in the validation 346
310 5.2 Models set. After training the 20 models for each entity type, we kept the 347
311 The three models we use in our experiments are the following: model with the best validation 𝐹 1 as our final model. 348
312 UMLS. For the baseline model we used UMLS as introduced in Sec-
313 tion 3. That is, we used the QuickUMLS tool to search for concepts
5.5 Results 349
314 belonging to the semantic groups that were deemed relevant for The results of the final models on the test set of the datasets are 350
315 each entity type (see Table 1). This is a simple baseline that can be presented in Table 3. In a general overview of the effectiveness of 351
316 easily reproduced and therefore, provide a reference metrics that the three models, one can observe that the BERT & UMLS model 352
317 can serve as a base for comparison in future research. presents the best strict and relaxed 𝐹 1 in all the datasets except 353
for the Diseases, where the UMLS model presents the best results. 354
318 NER. The joint NER task is approached by first aggregating the However note that while the strict 𝐹 1 of the NER & UMLS model 355
319 predictions of three separate, but structurally identical, NER models. for the identification of Diseases is 23.9 percentage points worst 356
320 One for identifying mentions to procedures, another for diseases than the UMLS model, when we compare the relaxed 𝐹 1 score the 357
321 and a third for drugs. This serves to evaluate the effectiveness of the difference is only 3.7 percentage points. Upon close inspection, we 358
322 NER models without doing entity linking with UMLS. As we use found that UMLS has a tendency to annotate the entities along with 359
323 three different models, one token may be classified with two entity its attributes, for example “benign prostatic hyperplasia”, while 360
324 types. That is, a token can be classified as B-DRUG and B-DISEASE. the NER & UMLS models focus more on the key words, in the 361
325 In that case, we check the raw values outputted by the NER models previous example, it only identified “hyperplasia” as a disease10 . 362
326 and keep the entity with the highest value. Another aspect that might be hurting the NER performance is 363
327 NER & UMLS. Lastly, the NER & UMLS is the model described in the wide variety of concepts fitted under the “Diseases” category 364
328 Section 4 that does entity linking of the entities predicted by the (see Table 1). Further segmenting this category might yield better 365
329 NER model. results, however further studies would be required to validate this 366
hypothesis. Apart from that, we can see that the NER & UMLS 367
330 5.3 Implementation Details

331 The pipeline was implemented in Python. In particular, we used 10 Please
note that we present the examples in English for the reader’s convenience.
332 the QuickUMLS toolkit to implement the UMLS entity linking, and However, these excerpts have been translated from European Portuguese.
Table 3: Results of evaluating the three models on test partition of the datasets. In bold and underlined are highlighted the
best strict and relaxed 𝐹 1 , respectively.
UMLS NER NER & UMLS

Strict Relaxed Strict Relaxed Strict Relaxed
P R F1 P R F1 P R F1 P R F1 P R F1 P R F1
Procedures 65.4 96.7 78.0 68.5 99.9 81.3 75.2 80.8 77.9 81.0 87.0 83.9 95.0 83.1 88.6 98.2 85.6 91.4
Drugs 75.8 92.5 83.3 76.2 93.0 83.7 74.5 95.2 83.6 74.5 95.2 83.6 98.6 91.6 95.0 98.6 91.6 95.0
Diseases 67.6 97.2 79.7 70.1 99.8 82.4 57.4 46.0 51.1 87.1 69.7 77.4 67.5 47.5 55.8 95.7 66.8 78.7
Aggregated 44.5 97.0 61.0 47.3 99.7 64.2 62.3 68.0 65.0 75.7 81.8 78.6 79.8 67.3 73.0 92.3 77.4 84.2
368 model presents an 𝐹 1 score of 88.6% and 95.5% on the identification (much broader than those that developed the approach) which will 408
369 of Procedures and Drugs, respectively. not only provide a qualitative evaluation of the pipeline but also 409
370 Another interesting remark that can be made upon inspection of more fine-grained feedback that will allow a quantitative study. 410
371 Table 3 is the fact that the NER models greatly benefit from the A path we are keen to explore is the effectiveness impact of swap- 411
372 entity linking with UMLS. This is mostly due to the removal of false ping the general purpose BERTimbau [35] model by the BioBERTpt 412
373 entities that were identified by the NER models that do not find [32] model, or even train/fine-tune BERT on European Portuguese 413
374 any match on UMLS, which results in a major boost in precision. texts has it has been shown to yield better results [30, 36]. This 414
375 However, by comparing the NER and the NER & UMLS models on seems a natural route to further improve the performance of the 415
376 the Procedures and Diseases datasets, one can see that the linking system. 416
377 with UMLS also increased the recall of the NER models. This effect
378 is due to the fact that in the NER & UMLS model the entity type is
379 defined by the semantic group of the entity and not by the model REFERENCES 417
380 that identified the entity. Therefore, linking with the UMLS is also [1] Mohamed AlShuweihi, Said A. Salloum, and Khaled F. Shaalan. 2021. Biomedical 418
Corpora and Natural Language Processing on Clinical Text in Languages Other 419
381 useful to classify the entity type. The same effect was not observed Than English: A Systematic Review. In Recent Advances in Intelligent Systems and 420
382 in the "Drugs" dataset, where the recall got worst after linking Smart Applications. 421
383 with UMLS. However, this dataset is also the smallest so further [2] Jean Emmanuel Bibault, Philippe Giraud, and Anita Burgun. 2016. Big Data and 422
machine learning in radiation oncology: State of the art and future prospects. 423
384 studies would be necessary to have a better understanding of this Cancer Letters 382, 1 (2016), 110–117. https://doi.org/10.1016/j.canlet.2016.05.033 424
385 phenomenon. [3] Olivier Bodenreider. 2004. The unified medical language system (UMLS): in- 425
tegrating biomedical terminology. Nucleic acids research 32, suppl_1 (2004), 426
D267–D270. 427
386 6 CONCLUSION [4] Mary Regina Boland, Lena M. Davidson, Silvia P. Canelón, Jessica Meeker, Trevor 428
Penning, John H. Holmes, and Jason H. Moore. 2021. Harnessing electronic
387 This work describes the methodology employed to achieve effective health records to study emerging environmental disasters: a proof of concept
429
430
388 results for biomedical entity identification in oncology texts writ- with perfluoroalkyl substances (PFAS). npj Digital Medicine 4, 1 (aug 2021), 1–10. 431
389 ten in European Portuguese when no initial annotated corpus was https://doi.org/10.1038/s41746-021-00494-5 432
[5] Selen Bozkurt, Rohan Paul, Jean Coquet, Ran Sun, Imon Banerjee, James D. 433
390 available. That includes the annotation of a clinical corpus with the Brooks, and Tina Hernandez-Boussard. 2020. Phenotyping severity of patient- 434
391 UMLS metathesaurus, the data pipeline created, and the training of centered outcomes using clinical notes: A prostate cancer use case. Learning 435
392 three NER models for the identification of procedures, drugs, and Health Systems 4, 4 (oct 2020). https://doi.org/10.1002/lrh2.10237 436
[6] David Campos, Sérgio Matos, and José Luís Oliveira. 2012. Biomedical Named 437
393 diseases. Even though the methodology was employed for Euro- Entity Recognition: A Survey of Machine-Learning Tools. 438
394 pean Portuguese texts, it can be adapted to other languages with [7] Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova. 2019. BERT: 439
Pre-training of Deep Bidirectional Transformers for Language Understanding.
395 comparable resources. More precisely, the span and applicability ArXiv abs/1810.04805 (2019).
440
441
396 of this methodology are dependent on three main aspects, namely: [8] Shadi Ebrahimian, Mannudeep K. Kalra, Sheela Agarwal, Bernardo Canedo Bizzo, 442
397 the availability of the UMLS metathesaurus for the language;11 Mona Elkholy, Christoph Wald, Bibb Allen, and Keith J. Dreyer. 2021. FDA- 443
regulated AI Algorithms: Trends, Strengths, and Gaps of Validation Studies. 444
398 the availability and effectiveness of a large language model for Academic radiology (2021). 445
399 the language; and the accessibility to clinical texts. Given these [9] Guy Fagherazzi. 2020. Deep Digital Phenotyping and Digital Twins for Precision 446
400 three components, the reader should be able to extrapolate this Health: Time to Dig Deeper. Journal of medical Internet research 22, 3 (2020), 447
e16770. https://doi.org/10.2196/16770 448
401 methodology to his/her project. [10] John Giorgi and Gary D Bader. 2018. Transfer learning for biomedical named 449
entity recognition with neural networks. Bioinformatics 34 (2018), 4087 – 4094. 450
402 7 FUTURE RESEARCH [11] Mark L. Graber, Colene Byrne, and Doug Johnston. 2017. The impact of elec- 451
tronic health records on diagnosis. , 211–223 pages. https://doi.org/10.1515/ 452
403 The presented models are part of a bigger pipeline that ends up dx-2017-0012 453
[12] Yu Gu, Robert Tinn, Hao Cheng, Michael Lucas, Naoto Usuyama, Xiaodong
404 feeding a dashboard that summarizes the patient therapeutic path. Liu, Tristan Naumann, Jianfeng Gao, and Hoifung Poon. 2021. Domain-Specific
454
455
405 Arrangements are being made so that a production test is conducted Language Model Pretraining for Biomedical Natural Language Processing. ACM 456
406 for this dashboard on the IPO emergency service. This test will Transactions on Computing for Healthcare (HEALTH) 3, 1 (oct 2021). https: 457
//doi.org/10.1145/3458754 arXiv:2007.15779 458
407 expose the pipeline predictions to a panel of oncology specialists [13] Maryam Habibi, Leon Weber, Mariana L. Neves, D. Wiegandt, and Ulf Leser. 459
11 See 2017. Deep learning with word embeddings improves biomedical named entity 460
https://www.nlm.nih.gov/research/umls/knowledge_sources/metathesaurus/ recognition. Bioinformatics 33 (2017), i37 – i48. 461
release/abbreviations.html#LAT for the complete list of languages supported by UMLS.
462 [14] Sepp Hochreiter and Jürgen Schmidhuber. 1997. Long Short-Term Memory. [36] Inigo Jauregi Unanue, Ehsan Zare Borzeshi, and Massimo Piccardi. 2017. Re- 539
463 Neural Computation 9 (1997), 1735–1780. current neural networks with specialized word embeddings for health-domain 540
464 [15] Peter B. Jensen, Lars J. Jensen, and Soøren Brunak. 2012. Mining electronic health named-entity recognition. Journal of biomedical informatics 76 (2017), 102–109. 541
465 records: Towards better research applications and clinical care. , 395–405 pages. [37] Ashish Vaswani, Noam M. Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, 542
466 https://doi.org/10.1038/nrg3208 Aidan N. Gomez, Lukasz Kaiser, and Illia Polosukhin. 2017. Attention is All you 543
467 [16] Yasmin H. Karimi, Douglas W. Blayney, Allison W. Kurian, Jeanne Shen, Rikiya Need. ArXiv abs/1706.03762 (2017). 544
468 Yamashita, Daniel Rubin, and Imon Banerjee. 2021. Development and Use of [38] Xuan Wang, Yu Zhang, Xiang Ren, Yuhao Zhang, Marinka Zitnik, Jingbo Shang, C. 545
469 Natural Language Processing for Identification of Distant Cancer Recurrence Langlotz, and Jiawei Han. 2019. Cross-type Biomedical Named Entity Recognition 546
470 and Sites of Distant Recurrence Using Unstructured Electronic Health Record with Deep Multi-Task Learning. Bioinformatics 35 10 (2019), 1745–1752. 547
471 Data. JCO Clinical Cancer Informatics 5, 5 (dec 2021), 469–478. https://doi.org/ [39] Yanshan Wang, Liwei Wang, Majid Rastegar-Mojarad, Sungrim Moon, Feichen 548
472 10.1200/cci.20.00165 Shen, Naveed Afzal, Sijia Liu, Yuqun Zeng, Saeed Mehrabi, Sunghwan Sohn, and 549
473 [17] Kenneth L. Kehl, Stefan Groha, Eva M. Lepisto, Haitham Elmarakeby, James Hongfang Liu. 2018. Clinical information extraction applications: A literature 550
474 Lindsay, Alexander Gusev, Eliezer M. Van Allen, Michael J. Hassett, and Deborah review. Journal of biomedical informatics 77 (2018), 34–49. 551
475 Schrag. 2021. Clinical Inflection Point Detection on the Basis of EHR Data [40] Wonjin Yoon, Chan Ho So, Jinhyuk Lee, and Jaewoo Kang. 2019. CollaboNet: 552
476 to Identify Clinical Trial–Ready Patients With Cancer. JCO Clinical Cancer collaboration of deep neural networks for biomedical named entity recognition. 553
477 Informatics 5, 5 (dec 2021), 622–630. https://doi.org/10.1200/cci.20.00184 BMC Bioinformatics 20 (2019). 554
478 [18] John D. Lafferty, Andrew McCallum, and Fernando Pereira. 2001. Conditional [41] Qiang Zhang, Sheng Zhang, Jianxin Li, Yi Pan, Jing Zhao, Yixing Feng, Yanhui 555
479 Random Fields: Probabilistic Models for Segmenting and Labeling Sequence Data. Zhao, Xiaoqing Wang, Zhiming Zheng, Xiangming Yang, Lixia Liu, Chunxin 556
480 In ICML. Qin, Ke Zhao, Xiaonan Liu, Caixia Li, Liuyang Zhang, Chunrui Yang, Na Zhuo, 557
481 [19] Yann LeCun, Bernhard E. Boser, John S. Denker, Donnie Henderson, Richard E. Hong Zhang, Jie Liu, Jinglei Gao, Xiaoling Di, Fanbo Meng, Wei Ji, Meng Yang, 558
482 Howard, Wayne E. Hubbard, and Lawrence D. Jackel. 1989. Backpropagation Xiaojie Xin, Xi Wei, Rui Jin, Lun Zhang, Xudong Wang, Fengju Song, Xiangqian 559
483 Applied to Handwritten Zip Code Recognition. Neural Computation 1 (1989), Zheng, Ming Gao, Kexin Chen, and Xiangchun Li. 2022. Improved diagnosis of 560
484 541–551. thyroid cancer aided with deep learning applied to sonographic text reports: a 561
485 [20] Jinhyuk Lee, Wonjin Yoon, Sungdong Kim, Donghyeon Kim, Sunkyu Kim, retrospective, multi-cohort, diagnostic study. Cancer Biology and Medicine 19, 5 562
486 Chan Ho So, and Jaewoo Kang. 2020. BioBERT: a pre-trained biomedical language (may 2022), 733–741. https://doi.org/10.20892/j.issn.2095-3941.2020.0509 563
487 representation model for biomedical text mining. Bioinformatics 36 (2020), 1234 –
488 1240.
489 [21] Scott H. Lee. 2018. Natural language generation for electronic health records. npj
490 Digital Medicine 1, 1 (nov 2018), 1–7. https://doi.org/10.1038/s41746-018-0070-0
491 arXiv:1806.01353
492 [22] Ulf Leser and Jörg Hakenberg. 2005. What makes a gene name? Named entity
493 recognition in the biomedical literature. Briefings in bioinformatics 6 4 (2005),
494 357–69.
495 [23] Irene Z Li, Michihiro Yasunaga, Muhammed Yavuz Nuzumlali, César Caraballo,
496 Shiwani Mahajan, Harlan M. Krumholz, and Dragomir R. Radev. 2019. A Neural
497 Topic-Attention Model for Medical Term Abbreviation Disambiguation. ArXiv
498 abs/1910.14076 (2019).
499 [24] Ke Liu, Omkar Kulkarni, Martin Witteveen-Lane, Bin Chen, and Dave
500 Chesla. 2022. MetBERT: a generalizable and pre-trained deep learn-
501 ing model for the prediction of metastatic cancer from clinical notes.
502 AMIA ... Annual Symposium proceedings. AMIA Symposium 2022 (2022),
503 331–338. /pmc/articles/PMC9285138//pmc/articles/PMC9285138/?report=
504 abstracthttps://www.ncbi.nlm.nih.gov/pmc/articles/PMC9285138/
505 [25] Xiaoxuan Liu, Livia Faes, Aditya Uday Kale, Siegfried Karl Wagner, Dun Jack
506 Fu, Alice Bruynseels, Thushika Mahendiran, Gabriella Moraes, Mohith Shamdas,
507 Christoph Kern, Joseph R. Ledsam, Martin K. Schmid, Konstantinos Balaskas,
508 Eric J. Topol, Lucas M. Bachmann, Pearse A. Keane, and Alastair K. O. Denniston.
509 2019. A comparison of deep learning performance against health-care profes-
510 sionals in detecting diseases from medical imaging: a systematic review and
511 meta-analysis. The Lancet. Digital health 1 6 (2019), e271–e297.
512 [26] Yue Liu, Tao Ge, Kusum S. Mathews, Heng Ji, and Deborah L. McGuinness. 2015.
513 Exploiting Task-Oriented Resources to Learn Word Embeddings for Clinical
514 Abbreviation Expansion. ArXiv abs/1804.04225 (2015).
515 [27] Fábio Lopes, César Alexandre Teixeira, and Hugo Gonçalo Oliveira. 2019. Contri-
516 butions to Clinical Named Entity Recognition in Portuguese. In BioNLP@ACL.
517 [28] Xuezhe Ma and Eduard H. Hovy. 2016. End-to-end Sequence Labeling via Bi-
518 directional LSTM-CNNs-CRF. ArXiv abs/1603.01354 (2016).
519 [29] Aurélie Névéol, Hercules Dalianis, Guergana K. Savova, and Pierre Zweigenbaum.
520 2018. Clinical Natural Language Processing in languages other than English:
521 opportunities and challenges. Journal of Biomedical Semantics 9 (2018).
522 [30] Denis Newman-Griffis and Ayah Zirikly. 2018. Embedding Transfer for Low-
523 Resource Medical Named Entity Recognition: A Case Study on Patient Mobility.
524 In BioNLP.
525 [31] Lance A. Ramshaw and Mitchell P. Marcus. 1995. Text Chunking using
526 Transformation-Based Learning. ArXiv cmp-lg/9505040 (1995).
527 [32] Elisa Terumi Rubel Schneider, João Vitor Andrioli de Souza, Julien Knafou, Lu-
528 cas E. S. Oliveira, Jenny Copara, Yohan Bonescki Gumiel, Lucas Ferro Antunes
529 de Oliveira, Emerson Cabrera Paraiso, Douglas Teodoro, and Claudia Maria
530 Cabral Moro Barra. 2020. BioBERTpt - A Portuguese Neural Language Model for
531 Clinical Named Entity Recognition. In CLINICALNLP.
532 [33] Stefano Silvestri, Francesco Gargiulo, and Mario Ciampi. 2022. Iterative Annota-
533 tion of Biomedical NER Corpora with Deep Neural Networks and Knowledge
534 Bases. Applied Sciences (2022).
535 [34] Luca Soldaini. 2016. QuickUMLS: a Fast, Unsupervised Approach for Medical
536 Concept Extraction.
537 [35] Fábio Souza, Rodrigo Nogueira, and Roberto de Alencar Lotufo. 2020. BERTimbau:
538 Pretrained BERT Models for Brazilian Portuguese. In BRACIS.

1706 File Paper

Uploaded by

Document Information

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

1706 File Paper

Uploaded by

Copyright:

Available Formats

A Biomedical Entity Extraction Pipeline for

Oncology Health Records in Portuguese

diseases, respectively. ing adequate freedom for clear human-to-human communication. 31

CCS CONCEPTS interactions with healthcare professionals, the documents quickly 33

reach dozens and hundreds of pages. This problem is aggravated in 34

setting where time is scarce to take action. In such a scenario, a 38

1 1 INTRODUCTION relatively low-resources, where efficient extraction of medical en- 48

93 entities on free-text oncology narratives written in European

formation deemed relevant by the health professionals. It is a very 158

However, it can be manually extended to include new acronyms 194

and abbreviations that may appear in future documents. 195

3.3 Annotation with UMLS 196

To greatly reduce manual annotation effort we start by automati- 197

Language System (UMLS) metathesaurus [3] was used as a knowl- 199

edge base for biomedical entity identification. UMLS contains an 200

ontology that maps each biomedical entity (referred to as “concept” 201

in UMLS jargon) to a semantic group, which is a high-level catego- 202

rization of the entities. Examples of semantic groups are “Clinical 203

Drug”, “Pathologic Function”, and “Diagnostic Procedure”. To de- 204

termine the semantic groups of interest in our project, we followed 205

a two-step process. First, we asked the oncology team to pre-select 206

the semantic groups to be extracted based on the description of the 207

of the annotation to the oncology team. In conjunction, we parsed 210

Procedures Drugs Diseases Aggregated

graphical illustration of the architecture we used on our specific 265

225 3.4 Specialized Curation application for IPO Porto. 266

Figure 2: Graphical representation of the proposed pipeline.

330 5.3 Implementation Details

UMLS NER NER & UMLS

You might also like