You are on page 1of 7

A Biomedical Entity Extraction Pipeline for

Oncology Health Records in Portuguese


Anonymous Author(s)
ABSTRACT as electronic health records (EHRs) – compile all available patient- 9

Textual health records of cancer patients are usually protracted and specific information, such as previous conditions, diagnoses, treat- 10

highly unstructured, making it very time-consuming for health pro- ments, prescriptions, laboratory test results, radiological images, 11

fessionals to get a complete overview of the patient’s therapeutic and phenotypic and genotypic features [2, 15]. As a result, much 12

course. As such limitations can lead to suboptimal and/or inefficient potential lurks within EHRs [39]. First, from an integrated health- 13

treatment procedures, healthcare providers would greatly benefit care standpoint, by providing clinicians with instant and portable 14

from a system that effectively summarizes the information of those access to all up-to-date patient information, EHRs streamline local 15

records. With the advent of deep neural models, this objective has and remote care delivery, potentially leading to better-informed 16

been partially attained for English clinical texts, however, the re- decision-making [2, 11, 15]. Second, the possible benefits of mining 17

search community still lacks an effective solution for languages these records for research purposes are endless. Current research 18

with limited resources. In this paper, we present the approach we of AI methods in the clinical context includes: variant reporting 19

developed to extract procedures, drugs, and diseases from oncol- [2]; public health and environmental impact surveillance [4, 21]; 20

ogy health records written in European Portuguese. This project and precision (or personalized) Medicine [9]. 21

was conducted in collaboration with the Portuguese Institute for Although most successful applications of AI in healthcare have 22

Oncology which, besides holding over 10 years of duly protected been for image-based diagnosis and detection [25], most of EHRs 23

medical records, also provided oncologist expertise throughout the are primarily composed of free-text clinical narratives describing 24

development of the project. Since there is no annotated corpus for the patient’s history, which opens up a wide range of possibilities 25

biomedical entity extraction in Portuguese, we also present the for natural language processing (NLP) applications. Nevertheless, 26

strategy we followed in annotating the corpus for the development due to their unstructured nature, harnessing the information within 27

of the models. The final models, which combined a neural archi- these records is not straightforward. When compared with a more 28

tecture with entity linking, achieved 𝐹 1 scores of 88.6, 95.0, and structured representation, the free-text format has the advantage of 29

55.8 per cent in the mention extraction of procedures, drugs, and allowing the medical staff to communicate more naturally, provid- 30

diseases, respectively. ing adequate freedom for clear human-to-human communication. 31

On the other hand, since the amount of text grows with the patient’s 32

CCS CONCEPTS interactions with healthcare professionals, the documents quickly 33

reach dozens and hundreds of pages. This problem is aggravated in 34


• Computing methodologies → Information extraction; • In-
patients with persistent diseases, such as cancer, where the number 35
formation systems → Data mining; Decision support systems;
of pages can run into the thousands. Such lengthy documents are in- 36
Document structure;
tractable for the clinical staff to handle, especially in an emergency 37

setting where time is scarce to take action. In such a scenario, a 38


KEYWORDS tool that would automatically and reliably summarize the patient’s 39
Biomedical Entity Recognition, Data Mining, Oncology Electronic therapeutic path would be a great asset for improving healthcare 40
Health Records services. 41

ACM Reference Format: An important step to achieve such summarization is the identifi- 42

Anonymous Author(s). 2023. A Biomedical Entity Extraction Pipeline for, cation/extraction of the biomedical entities in the text – such as 43

Oncology Health Records in Portuguese. In Proceedings of ACM SAC Confer- diseases, medical procedures, or drugs. Although some systems 44

ence (SAC’23). ACM, New York, NY, USA, 7 pages. https://doi.org/10.1145/ were developed for general-purpose clinical texts written in Eng- 45
nnnnnnn.nnnnnnn lish – including Apache cTAKES1 , CLAMP2 , and more notably IBM 46

Watson Health3 – the same can not be said for languages with 47

1 1 INTRODUCTION relatively low-resources, where efficient extraction of medical en- 48

2 The effectiveness achieved by artificial intelligence (AI) systems in tities remains a challenge [1]. The lack of effectiveness is mainly 49

3 recent years has led to an increase in practical applications in health- attributed to the scarcity or nonexistence of publicly available data, 50

4 care support. To give an example, the number of FDA-approved which limits the development of tools for clinical texts (such as the 51

5 AI medical algorithms has grown rapidly in recent years [8]. Such pre-training of large language models for the clinical setting). Apart 52

6 growth would not be possible without the digitization of healthcare, from that, the heterogeneity of clinical narratives is also a major 53

7 which has significantly increased the number of clinical records problem. For instance, while the clinical note that results from a 54

8 stored in digital collections. These longitudinal records – known scheduled appointment is descriptive and syntactically correct, the 55

1 https://ctakes.apache.org/
SAC’23, March 27 –April 2, 2023, Tallinn, Estonia
2 https://clamp.uth.edu/
2023. ACM ISBN 978-x-xxxx-xxxx-x/YY/MM. . . $15.00
3 https://www.ibm.com/watson-health
https://doi.org/10.1145/nnnnnnn.nnnnnnn
SAC’23, March 27 –April 2, 2023, Tallinn, Estonia Anon.

56 same is not true for notes from the emergency service where item- NLP, in particular with large language models such as BERT [37], 111

57 ized text is prevalent and the use of abbreviations and acronyms BioBERT [20], or PubMedBERT [12], have mitigated this problem. 112

58 more frequent. Therefore, a model that is trained in scheduled ap- Moreover, training deep learning architectures such as Long Short- 113

59 pointment documents would (most likely) present a much lower Term Memory (LSTM) and Conditional Random Field (CRF) on top 114

60 effectiveness in emergency service documents. In practice, this of the large language models word representations, had greatly 115

61 means that to achieve an effective system, one must annotate a improved the effectiveness of BioNER for English texts over the 116

62 corpus within the domain/institution where the model will be de- last few years [10, 13, 38, 40]. 117

63 ployed. Since the current state-of-the-art systems (which are based In that vein, a few deep-learning-based NLP approaches have been 118

64 on deep neural models) require large amounts of annotated data developed to support clinical decision-making in cancer-related 119

65 to attain high effectiveness, this is a major bottleneck in system contexts using unstructured clinical texts. These have been tested 120

66 development. in several medical fields and applications, such as but not limited 121

67 Faced with the aforementioned issues while developing a system to: senology, to identify and pinpoint local and distant recurrence 122

68 for the extraction of biomedical entities for the Portuguese Institute for breast cancer [16]; endocrinology, to improve the diagnosis of 123

69 for Oncology (IPO) in Porto, Portugal’s largest oncology center, we thyroid cancer [41]; urology, to stratify the risk of developing uri- 124

70 created an efficient approach for the annotation of a medical corpus nary incontinence following partial or full prostate removal [5]; and 125

71 that can be extrapolated for any medical domain and institution clinical trial screenings, to select patients based on prognosis and 126

72 to quickly achieve a corpus that can be used to train a deep neural propensity to changes in therapeutic courses [17]. Additionally, a 127

73 model. In this paper, we describe how the Unified Medical Language BERT-based model – MetBERT – was also created for the prediction 128

74 System (UMLS) metathesaurus was employed to accelerate and of metastatic cancer from clinical notes [24]. 129

75 minimize the required resources for the annotation process of the Nevertheless, developing effective BioNER models for low-resource 130

76 corpus. Furthermore, in this paper, we also show that by combining languages remains a challenge [29]. This is primarily because har- 131

77 three named entity recognition models – for the identification of nessing the power of deep learning models requires a substantial 132

78 procedures, drugs, and diseases – with entity linking on UMLS we amount of data. However, because clinical texts contain highly 133

79 were able to achieve relevant effectiveness for European Portuguese. sensitive information, ethical and legal constraints make it hard 134

80 As this approach is only limited by the availability of a pre-trained to make corpora available. Yet another issue is the specificity of 135

81 large language model and the availability of the language on the the data, which limits its applicability. For example, the only clin- 136

82 UMLS knowledge base, this approach can also be extrapolated to ical text corpus released for European Portuguese [27] does not 137

83 other languages and domains. resemble the clinical notes we found at IPO, as the latter were more 138

84 In summary, the main contributions of this work are the following: disaggregated/itemized and did not follow the traditional seman- 139

85 (i) we present an efficient strategy for the annotation of clinical tic structure of a narrative. Consequently, application projects are 140

86 narratives with biomedical entities that accelerates the annota- forced to develop a corpus for the data with which the application 141

87 tion process; will go into production. In this regard, some efforts have been made 142

88 (ii) we present a methodology to achieve efficient biomedical en- to develop methodologies to simplify the annotation of the clinical 143

89 tity extraction for Portuguese that can be adapted to other corpus. For instance, in a similar challenge to ours, Silvestri et al. 144

90 languages with equivalent resources; [33] used active learning and distant supervision to annotate clini- 145

91 (iii) finally, we present a real-life case study with results obtained cal texts written in Italian. In this work, we adopted a more general 146

92 by applying such a strategy for the extraction of biomedical strategy that requires less computational power. 147

93 entities on free-text oncology narratives written in European


94 Portuguese. 3 CORPUS DEVELOPMENT 148

95 The remainder of the paper is structured as follows: in Section 2 IPO Porto has been keeping electronic health records of its patients 149

96 we frame the presented work in the context of the current research. for more than 10 years. We were able to work with a sufficient 150

97 Section 3 describes the developed methodology for the annotation sample, under confidentiality, and on-site. Among the many files 151

98 of clinical texts along with the corpus that resulted from applying that describe each patient’s journey through the hospital, there 152

99 it in IPO Porto documents. In section 4 we present the prediction is one of particular interest for this project, the clinical diary (see 153

100 pipeline that was implemented on IPO Porto and present the ex- Figure 1). 154

101 perimental procedure that led us to that architecture in Section 5. The clinical diary contains a textual description of every action 155

102 Finally, Section 6 and 7 concludes the paper by discussing possibili- or interaction the patient had in the hospital. This includes any 156

103 ties for future research, respectively. treatment, procedure, appointment, plan of action, and other in- 157

formation deemed relevant by the health professionals. It is a very 158

104 2 RELATED WORK descriptive file and, thereby, a rich source of information. Adversely, 159

it can become very long, repetitive, and hard to be efficiently read 160
105 The aim of biomedical NER (BioNER) is to identify the spans of
by humans in the loop. 161
106 biomedical entities in a clinical text. Not long ago, the best-performing
107 systems relied on hand-crafted rules to capture different character-
108 istics of entity classes [6, 22]. However, these features frequently 3.1 Sampling Process 162

109 result in highly specialized tools that can only be used in the do- For the development of the corpus, we selected 80 fairly short clinic 163

110 mains for which they were designed. Recently, the developments in diaries. In practical terms, this meant sampling documents that 164
Biomedical Entity Extraction Pipeline for Oncology SAC’23, March 27 –April 2, 2023, Tallinn, Estonia

165 were between the first and second quartile in terms of the number Table 1: How the UMLS semantic types were grouped for our
166 of pages, which materialized in documents ranging from 52 to 97 three entity types. TUI stands for “Type Unique Identifier”.
167 A4 equivalent pages long. This is significant because we wanted to
168 reflect a rich variety of clinical note-writing styles as possible in the TUI Name
169 corpus (which vary with the person that writes them). Yet another
170 benefit of sampling from a variety of relatively small documents is T058 Health Care Activity
that it increases the exposure to different patients. Consequently, T059 Laboratory Procedure
171 Procedures
172 this increases the number of biomedical entities that will be present T060 Diagnostic Procedure
173 in the final corpus. T061 Therapeutic or Preventive Procedure
T121 Pharmacologic Substance
174 3.2 Preprocessing T122 Biomedical or Dental Material
Drugs
175 The clinical diary is composed of clinical notes that are delimited T195 Antibiotic
176 by headers, as illustrated by the blue and yellow regions in the T200 Clinical Drug
177 sample page in Figure 1. Important metadata, including the date T019 Congenital Abnormality
178 and the context in which the note was written, is contained in the T020 Acquired Abnormality
179 headers. Since the status of the patient does not change in every T033 Finding
180 interaction that he/she has with the medical staff, it is common T037 Injury or Poisoning
181 that the clinical notes overlap in content. Furthermore, there is T046 Pathologic Function
182 information (such as comorbidities and pathological antecedents) T047 Disease or Syndrome
that is frequently repeated throughout the clinical diary. To avoid Diseases
183
T048 Mental or Behavioral Dysfunction
184 multiple annotations of the same content, we segmented the clinical T049 Cell or Molecular Dysfunction
185 notes into sentences with the segtok sentence segmenter4 . From T050 Experimental Model of Disease
186 the 80 clinic diaries, we collected a total of 36, 512 unique sentences. T184 Sign or Symptom
T190 Anatomical Abnormality
T191 Neoplastic Process

However, it can be manually extended to include new acronyms 194

and abbreviations that may appear in future documents. 195

3.3 Annotation with UMLS 196

To greatly reduce manual annotation effort we start by automati- 197

cally annotating the sentences. For this purpose, the Unified Medical 198

Language System (UMLS) metathesaurus [3] was used as a knowl- 199

edge base for biomedical entity identification. UMLS contains an 200

ontology that maps each biomedical entity (referred to as “concept” 201

in UMLS jargon) to a semantic group, which is a high-level catego- 202

rization of the entities. Examples of semantic groups are “Clinical 203

Drug”, “Pathologic Function”, and “Diagnostic Procedure”. To de- 204

termine the semantic groups of interest in our project, we followed 205

a two-step process. First, we asked the oncology team to pre-select 206

the semantic groups to be extracted based on the description of the 207

semantic group5 . We then annotated 300 sentences with the entities 208

within the initial set of semantic groups and presented the result 209

of the annotation to the oncology team. In conjunction, we parsed 210


Figure 1: Redacted page of the clinic diary. all sentences and iteratively updated the set of semantic groups 211

to add concepts that were not annotated (that is, false negatives). 212

187 Yet another important specificity of clinical notes is the frequent The final set of semantic groups was then grouped into three types: 213

188 use of acronyms and abbreviations which are commonly employed procedures, drugs, and diseases. Table 1 presents the mapping be- 214

189 by clinical staff. Although there has been some work towards using tween semantic groups and the defined entity types6 . Our initial 215

190 word embeddings/deep neural models to handle the acronyms and automatic labeling step aims to catch as many true entities as possi- 216

191 abbreviations [23, 26] we found it sufficiently simple and robust to ble even if at the cost of including many false entities. By following 217

192 build a glossary of the acronyms and abbreviations found in our 5 The semantic groups and their respective definition can be found in https://lhncbc.
193 corpus. In the current version, the glossary contains 183 entries. nlm.nih.gov/ii/tools/MetaMap/Docs/SemGroups_2018.txt
6 For the complete semantic type mappings see https://lhncbc.nlm.nih.gov/ii/tools/
4 https://github.com/fnl/segtok MetaMap/documentation/SemanticTypesAndGroups.html
SAC’23, March 27 –April 2, 2023, Tallinn, Estonia Anon.

Table 2: Distribution of the corpus statistics over train, validation, and test sets within the three entity types (procedures, drugs,
and diseases). Sent, B, I, and O stand for the number of sentences, begin tokens, inside tokens, and outside tokens, respectively.

Procedures Drugs Diseases Aggregated


Sent B I O Sent B I O Sent B I O Sent B I O
Train 2,398 2,091 1,225 39,333 1,236 1,128 6 16,529 1,870 1,651 512 26,616 4,219 4,870 1,743 62,328
Validation 599 610 311 8,717 309 285 2 4,163 467 395 141 5,738 1,055 1,290 454 13,503
Test 701 821 452 13,992 377 445 1 6,737 608 734 210 11,743 1,318 2,000 663 25,755
Total 3,698 3,522 1,988 62,042 1,922 1,858 9 27,429 2,945 2,780 863 44,097 6,592 8,160 2,860 101,586

218 this procedure, we ultimately produced a high-recall classifier, and 4 PREDICTION PIPELINE 258

219 therefore, reduce the annotation effort of the specialists to eliminate To model the biomedical entity identification task we designed a 259
220 the false entities. pipeline that conceptually can be segmented into three components: 260
221 To extract the concepts that were within the final set of semantic a pre-trained language model to provide embeddings for the input 261
222 groups, we used the QuickUMLS toolkit7 [34], which implements text; the NER model(s) to identify the entities of interest; and en- 262
223 fast string matching with the UMLS concepts. From the set of 36, 512 tity linking with UMLS to ensure that entities that were found by 263
224 sentences, QuickUMLS identified concepts in 9, 912 of them. the NER models refer to biomedical entities. Figure 2 presents a 264

graphical illustration of the architecture we used on our specific 265

225 3.4 Specialized Curation application for IPO Porto. 266

226 Since the set of semantic groups to be extracted by UMLS was For the pre-trained language model, we employed the BERTimbau8 267

227 selected to maximize recall, the annotation produced by UMLS [35] language model, which was built by training the BERT [7] 268

228 included a lot of vague terms (such as "blood" or "pain") that are not architecture in a mixture of Brazilian and European Portuguese 269

229 useful for oncologists. To curate the annotations, we presented the texts9 . For the identification of the entities, we trained three mod- 270

230 annotations to the oncology team and instructed them to validate els – one for each entity type – using the architecture proposed 271

231 the annotations, i.e., remove false positives. This is a significant by Ma and Hovy [28]. This model architecture was tailored for 272

232 boost in the manual annotation effort as it does not require the sequence labeling, relying on a convolutional neural network [19] 273

233 specialist to read the entire sentence, but only to look at the entities for character encoding, a bi-directional long-short term memory 274

234 that were annotated by UMLS. However, the annotators were also [14] to aggregate character and word representations, and a sequen- 275

235 allowed to add missing entities, in case they drew their attention tial conditional random field [18] to jointly classify all words of a 276

236 while parsing the UMLS annotations. sentence. Lastly, all the entities identified by the NER models are 277

237 Note that, since the annotation produced by UMLS covers a wide looked for in the UMLS (a process commonly referred to as entity 278

238 range of concepts, the revision by the specialist can be interpreted linking) with the list of semantic types we deemed relevant in the 279

239 as the selection of the entities he/she wants the pipeline to identify. automatic annotation phase (see Table 1). If the entity is found in 280

240 Therefore, the resulting dataset will be tailored to the necessities the UMLS and its semantic group is part of the set that is relevant, 281

241 and requirements of the institution. the entity is kept, otherwise, it is discarded. 282

242 After the curation of the specialists and the removal of all the An important remark is that although we use a NER model for pro- 283

243 sentences that did not have any annotation, we were left with 6, 592 cedures, drugs, and diseases, the final class of the entity is defined 284

244 sentences, in which, 3, 689 contained procedures mentions, 1, 922 by the metadata in the UMLS. More precisely, by its semantic group, 285

245 contained drugs mentions, and 2, 945 contained diseases mentions. which is then mapped to the class by the map in Table 1. Conceptu- 286

246 As the goal was to train a NER model, the annotation was stored in ally, the three NER models can be interpreted as just one NER model 287

247 the Inside-Outside-Beginning format [31]. The last row of Table 2 that classifies the tokens as BEGIN, INSIDE, and OUTSIDE rather than 288

248 shows how the sentences and the begin/inside/outside tokens were classifying the tokens as B-PROCEDURE, B-DRUG, et cetera. 289

249 distributed over the datasets that only contained one entity type
250 (“Procedures”, “Drugs”, and “Diseases” columns), and the dataset 5 EXPERIMENTS 290

251 with all the entity types (“Aggregated” column). The remaining rows To evaluate the importance of the components in our prediction 291
252 of the table present the distribution over the train, validation, and pipeline we run experiments with three modeling settings: just 292
253 test set. This split was performed randomly at the sentence level of using the UMLS predictions; just using the NER predictions; and 293
254 the “Aggregated” dataset. First, 20% of the sentences were set aside using the full pipeline prediction. In this section, we describe how 294
255 for test. Then, the remaining 80% of sentences were partitioned so the experiments were executed. 295
256 that 80% of the documents were used to train the models and 20%
257 to validate them.

8 https://huggingface.co/neuralmind/bert-base-portuguese-cased
9 Thereis currently no large-language model trained exclusively on European Por-
7 https://github.com/Georgetown-IR-Lab/QuickUMLS tuguese texts.
Biomedical Entity Extraction Pipeline for Oncology SAC’23, March 27 –April 2, 2023, Tallinn, Estonia

Figure 2: Graphical representation of the proposed pipeline.

296 5.1 Evaluation Spark NLP to train and implement the NER models. All the experi- 333

297 For the evaluation and comparison of the models, we employed ments were run on a server with an AMD EPYC 7351P processor 334

298 the strict and relaxed Precision, Recall, and 𝐹 1 metrics. Whereas and NVIDIA GeForce GTX 1080 Ti GPU. 335

299 the strict metrics only count exact matches, the relaxed metrics
300 consider the overlaps between the predictions and the annotations 5.4 Training 336

301 as being correct. Therefore, by presenting both we are able to attain As stated in Section 4, three NER models were trained for the 337
302 a more comprehensive understanding of the model’s effectiveness. identification of procedures, drugs, and diseases. Before training, 338
303 Naturally, for the dataset, we used the corpus that is described in we randomly sampled a set of 20 hyperparameter configurations to 339
304 Section 3. To provide a detailed overview of the model’s effective- train each of the models. The hyperparameters that were tuned are: 340
305 ness, besides presenting the result on the Aggregated dataset (the the dropout rate, which was sampled from {0.0, 0.1, 0.2, 0.3, 0.4, 341
306 dataset that contains the annotation of all the entity types) we also 0.5}; the batch size, which was sampled from {4, 8, 16, 32, 64, 128}; 342
307 analyze the results of the model on the Procedures, Drugs, and Dis- and the learning rate, which was sampled from {0.001, 0.003, 0.005, 343
308 eases datasets. The train, validation, and test partitions are present 0.01, 0.03, 0.05, 0.1}. For each hyperparameter configuration, we 344
309 in Table 2. trained the model for 30 epochs and stored the weights of the epoch 345

in which the model had the best micro 𝐹 1 score in the validation 346
310 5.2 Models set. After training the 20 models for each entity type, we kept the 347

311 The three models we use in our experiments are the following: model with the best validation 𝐹 1 as our final model. 348

312 UMLS. For the baseline model we used UMLS as introduced in Sec-
313 tion 3. That is, we used the QuickUMLS tool to search for concepts
5.5 Results 349

314 belonging to the semantic groups that were deemed relevant for The results of the final models on the test set of the datasets are 350

315 each entity type (see Table 1). This is a simple baseline that can be presented in Table 3. In a general overview of the effectiveness of 351

316 easily reproduced and therefore, provide a reference metrics that the three models, one can observe that the BERT & UMLS model 352

317 can serve as a base for comparison in future research. presents the best strict and relaxed 𝐹 1 in all the datasets except 353

for the Diseases, where the UMLS model presents the best results. 354
318 NER. The joint NER task is approached by first aggregating the However note that while the strict 𝐹 1 of the NER & UMLS model 355
319 predictions of three separate, but structurally identical, NER models. for the identification of Diseases is 23.9 percentage points worst 356
320 One for identifying mentions to procedures, another for diseases than the UMLS model, when we compare the relaxed 𝐹 1 score the 357
321 and a third for drugs. This serves to evaluate the effectiveness of the difference is only 3.7 percentage points. Upon close inspection, we 358
322 NER models without doing entity linking with UMLS. As we use found that UMLS has a tendency to annotate the entities along with 359
323 three different models, one token may be classified with two entity its attributes, for example “benign prostatic hyperplasia”, while 360
324 types. That is, a token can be classified as B-DRUG and B-DISEASE. the NER & UMLS models focus more on the key words, in the 361
325 In that case, we check the raw values outputted by the NER models previous example, it only identified “hyperplasia” as a disease10 . 362
326 and keep the entity with the highest value. Another aspect that might be hurting the NER performance is 363

327 NER & UMLS. Lastly, the NER & UMLS is the model described in the wide variety of concepts fitted under the “Diseases” category 364

328 Section 4 that does entity linking of the entities predicted by the (see Table 1). Further segmenting this category might yield better 365

329 NER model. results, however further studies would be required to validate this 366

hypothesis. Apart from that, we can see that the NER & UMLS 367

330 5.3 Implementation Details


331 The pipeline was implemented in Python. In particular, we used 10 Please
note that we present the examples in English for the reader’s convenience.
332 the QuickUMLS toolkit to implement the UMLS entity linking, and However, these excerpts have been translated from European Portuguese.
SAC’23, March 27 –April 2, 2023, Tallinn, Estonia Anon.

Table 3: Results of evaluating the three models on test partition of the datasets. In bold and underlined are highlighted the
best strict and relaxed 𝐹 1 , respectively.

UMLS NER NER & UMLS


Strict Relaxed Strict Relaxed Strict Relaxed
P R F1 P R F1 P R F1 P R F1 P R F1 P R F1
Procedures 65.4 96.7 78.0 68.5 99.9 81.3 75.2 80.8 77.9 81.0 87.0 83.9 95.0 83.1 88.6 98.2 85.6 91.4
Drugs 75.8 92.5 83.3 76.2 93.0 83.7 74.5 95.2 83.6 74.5 95.2 83.6 98.6 91.6 95.0 98.6 91.6 95.0
Diseases 67.6 97.2 79.7 70.1 99.8 82.4 57.4 46.0 51.1 87.1 69.7 77.4 67.5 47.5 55.8 95.7 66.8 78.7
Aggregated 44.5 97.0 61.0 47.3 99.7 64.2 62.3 68.0 65.0 75.7 81.8 78.6 79.8 67.3 73.0 92.3 77.4 84.2

368 model presents an 𝐹 1 score of 88.6% and 95.5% on the identification (much broader than those that developed the approach) which will 408

369 of Procedures and Drugs, respectively. not only provide a qualitative evaluation of the pipeline but also 409

370 Another interesting remark that can be made upon inspection of more fine-grained feedback that will allow a quantitative study. 410

371 Table 3 is the fact that the NER models greatly benefit from the A path we are keen to explore is the effectiveness impact of swap- 411

372 entity linking with UMLS. This is mostly due to the removal of false ping the general purpose BERTimbau [35] model by the BioBERTpt 412

373 entities that were identified by the NER models that do not find [32] model, or even train/fine-tune BERT on European Portuguese 413

374 any match on UMLS, which results in a major boost in precision. texts has it has been shown to yield better results [30, 36]. This 414

375 However, by comparing the NER and the NER & UMLS models on seems a natural route to further improve the performance of the 415

376 the Procedures and Diseases datasets, one can see that the linking system. 416

377 with UMLS also increased the recall of the NER models. This effect
378 is due to the fact that in the NER & UMLS model the entity type is
379 defined by the semantic group of the entity and not by the model REFERENCES 417

380 that identified the entity. Therefore, linking with the UMLS is also [1] Mohamed AlShuweihi, Said A. Salloum, and Khaled F. Shaalan. 2021. Biomedical 418
Corpora and Natural Language Processing on Clinical Text in Languages Other 419
381 useful to classify the entity type. The same effect was not observed Than English: A Systematic Review. In Recent Advances in Intelligent Systems and 420
382 in the "Drugs" dataset, where the recall got worst after linking Smart Applications. 421

383 with UMLS. However, this dataset is also the smallest so further [2] Jean Emmanuel Bibault, Philippe Giraud, and Anita Burgun. 2016. Big Data and 422
machine learning in radiation oncology: State of the art and future prospects. 423
384 studies would be necessary to have a better understanding of this Cancer Letters 382, 1 (2016), 110–117. https://doi.org/10.1016/j.canlet.2016.05.033 424
385 phenomenon. [3] Olivier Bodenreider. 2004. The unified medical language system (UMLS): in- 425
tegrating biomedical terminology. Nucleic acids research 32, suppl_1 (2004), 426
D267–D270. 427
386 6 CONCLUSION [4] Mary Regina Boland, Lena M. Davidson, Silvia P. Canelón, Jessica Meeker, Trevor 428
Penning, John H. Holmes, and Jason H. Moore. 2021. Harnessing electronic
387 This work describes the methodology employed to achieve effective health records to study emerging environmental disasters: a proof of concept
429
430
388 results for biomedical entity identification in oncology texts writ- with perfluoroalkyl substances (PFAS). npj Digital Medicine 4, 1 (aug 2021), 1–10. 431

389 ten in European Portuguese when no initial annotated corpus was https://doi.org/10.1038/s41746-021-00494-5 432
[5] Selen Bozkurt, Rohan Paul, Jean Coquet, Ran Sun, Imon Banerjee, James D. 433
390 available. That includes the annotation of a clinical corpus with the Brooks, and Tina Hernandez-Boussard. 2020. Phenotyping severity of patient- 434
391 UMLS metathesaurus, the data pipeline created, and the training of centered outcomes using clinical notes: A prostate cancer use case. Learning 435

392 three NER models for the identification of procedures, drugs, and Health Systems 4, 4 (oct 2020). https://doi.org/10.1002/lrh2.10237 436
[6] David Campos, Sérgio Matos, and José Luís Oliveira. 2012. Biomedical Named 437
393 diseases. Even though the methodology was employed for Euro- Entity Recognition: A Survey of Machine-Learning Tools. 438
394 pean Portuguese texts, it can be adapted to other languages with [7] Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova. 2019. BERT: 439
Pre-training of Deep Bidirectional Transformers for Language Understanding.
395 comparable resources. More precisely, the span and applicability ArXiv abs/1810.04805 (2019).
440
441
396 of this methodology are dependent on three main aspects, namely: [8] Shadi Ebrahimian, Mannudeep K. Kalra, Sheela Agarwal, Bernardo Canedo Bizzo, 442

397 the availability of the UMLS metathesaurus for the language;11 Mona Elkholy, Christoph Wald, Bibb Allen, and Keith J. Dreyer. 2021. FDA- 443
regulated AI Algorithms: Trends, Strengths, and Gaps of Validation Studies. 444
398 the availability and effectiveness of a large language model for Academic radiology (2021). 445
399 the language; and the accessibility to clinical texts. Given these [9] Guy Fagherazzi. 2020. Deep Digital Phenotyping and Digital Twins for Precision 446

400 three components, the reader should be able to extrapolate this Health: Time to Dig Deeper. Journal of medical Internet research 22, 3 (2020), 447
e16770. https://doi.org/10.2196/16770 448
401 methodology to his/her project. [10] John Giorgi and Gary D Bader. 2018. Transfer learning for biomedical named 449
entity recognition with neural networks. Bioinformatics 34 (2018), 4087 – 4094. 450

402 7 FUTURE RESEARCH [11] Mark L. Graber, Colene Byrne, and Doug Johnston. 2017. The impact of elec- 451
tronic health records on diagnosis. , 211–223 pages. https://doi.org/10.1515/ 452
403 The presented models are part of a bigger pipeline that ends up dx-2017-0012 453
[12] Yu Gu, Robert Tinn, Hao Cheng, Michael Lucas, Naoto Usuyama, Xiaodong
404 feeding a dashboard that summarizes the patient therapeutic path. Liu, Tristan Naumann, Jianfeng Gao, and Hoifung Poon. 2021. Domain-Specific
454
455
405 Arrangements are being made so that a production test is conducted Language Model Pretraining for Biomedical Natural Language Processing. ACM 456

406 for this dashboard on the IPO emergency service. This test will Transactions on Computing for Healthcare (HEALTH) 3, 1 (oct 2021). https: 457
//doi.org/10.1145/3458754 arXiv:2007.15779 458
407 expose the pipeline predictions to a panel of oncology specialists [13] Maryam Habibi, Leon Weber, Mariana L. Neves, D. Wiegandt, and Ulf Leser. 459

11 See 2017. Deep learning with word embeddings improves biomedical named entity 460
https://www.nlm.nih.gov/research/umls/knowledge_sources/metathesaurus/ recognition. Bioinformatics 33 (2017), i37 – i48. 461
release/abbreviations.html#LAT for the complete list of languages supported by UMLS.
Biomedical Entity Extraction Pipeline for Oncology SAC’23, March 27 –April 2, 2023, Tallinn, Estonia

462 [14] Sepp Hochreiter and Jürgen Schmidhuber. 1997. Long Short-Term Memory. [36] Inigo Jauregi Unanue, Ehsan Zare Borzeshi, and Massimo Piccardi. 2017. Re- 539
463 Neural Computation 9 (1997), 1735–1780. current neural networks with specialized word embeddings for health-domain 540
464 [15] Peter B. Jensen, Lars J. Jensen, and Soøren Brunak. 2012. Mining electronic health named-entity recognition. Journal of biomedical informatics 76 (2017), 102–109. 541
465 records: Towards better research applications and clinical care. , 395–405 pages. [37] Ashish Vaswani, Noam M. Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, 542
466 https://doi.org/10.1038/nrg3208 Aidan N. Gomez, Lukasz Kaiser, and Illia Polosukhin. 2017. Attention is All you 543
467 [16] Yasmin H. Karimi, Douglas W. Blayney, Allison W. Kurian, Jeanne Shen, Rikiya Need. ArXiv abs/1706.03762 (2017). 544
468 Yamashita, Daniel Rubin, and Imon Banerjee. 2021. Development and Use of [38] Xuan Wang, Yu Zhang, Xiang Ren, Yuhao Zhang, Marinka Zitnik, Jingbo Shang, C. 545
469 Natural Language Processing for Identification of Distant Cancer Recurrence Langlotz, and Jiawei Han. 2019. Cross-type Biomedical Named Entity Recognition 546
470 and Sites of Distant Recurrence Using Unstructured Electronic Health Record with Deep Multi-Task Learning. Bioinformatics 35 10 (2019), 1745–1752. 547
471 Data. JCO Clinical Cancer Informatics 5, 5 (dec 2021), 469–478. https://doi.org/ [39] Yanshan Wang, Liwei Wang, Majid Rastegar-Mojarad, Sungrim Moon, Feichen 548
472 10.1200/cci.20.00165 Shen, Naveed Afzal, Sijia Liu, Yuqun Zeng, Saeed Mehrabi, Sunghwan Sohn, and 549
473 [17] Kenneth L. Kehl, Stefan Groha, Eva M. Lepisto, Haitham Elmarakeby, James Hongfang Liu. 2018. Clinical information extraction applications: A literature 550
474 Lindsay, Alexander Gusev, Eliezer M. Van Allen, Michael J. Hassett, and Deborah review. Journal of biomedical informatics 77 (2018), 34–49. 551
475 Schrag. 2021. Clinical Inflection Point Detection on the Basis of EHR Data [40] Wonjin Yoon, Chan Ho So, Jinhyuk Lee, and Jaewoo Kang. 2019. CollaboNet: 552
476 to Identify Clinical Trial–Ready Patients With Cancer. JCO Clinical Cancer collaboration of deep neural networks for biomedical named entity recognition. 553
477 Informatics 5, 5 (dec 2021), 622–630. https://doi.org/10.1200/cci.20.00184 BMC Bioinformatics 20 (2019). 554
478 [18] John D. Lafferty, Andrew McCallum, and Fernando Pereira. 2001. Conditional [41] Qiang Zhang, Sheng Zhang, Jianxin Li, Yi Pan, Jing Zhao, Yixing Feng, Yanhui 555
479 Random Fields: Probabilistic Models for Segmenting and Labeling Sequence Data. Zhao, Xiaoqing Wang, Zhiming Zheng, Xiangming Yang, Lixia Liu, Chunxin 556
480 In ICML. Qin, Ke Zhao, Xiaonan Liu, Caixia Li, Liuyang Zhang, Chunrui Yang, Na Zhuo, 557
481 [19] Yann LeCun, Bernhard E. Boser, John S. Denker, Donnie Henderson, Richard E. Hong Zhang, Jie Liu, Jinglei Gao, Xiaoling Di, Fanbo Meng, Wei Ji, Meng Yang, 558
482 Howard, Wayne E. Hubbard, and Lawrence D. Jackel. 1989. Backpropagation Xiaojie Xin, Xi Wei, Rui Jin, Lun Zhang, Xudong Wang, Fengju Song, Xiangqian 559
483 Applied to Handwritten Zip Code Recognition. Neural Computation 1 (1989), Zheng, Ming Gao, Kexin Chen, and Xiangchun Li. 2022. Improved diagnosis of 560
484 541–551. thyroid cancer aided with deep learning applied to sonographic text reports: a 561
485 [20] Jinhyuk Lee, Wonjin Yoon, Sungdong Kim, Donghyeon Kim, Sunkyu Kim, retrospective, multi-cohort, diagnostic study. Cancer Biology and Medicine 19, 5 562
486 Chan Ho So, and Jaewoo Kang. 2020. BioBERT: a pre-trained biomedical language (may 2022), 733–741. https://doi.org/10.20892/j.issn.2095-3941.2020.0509 563
487 representation model for biomedical text mining. Bioinformatics 36 (2020), 1234 –
488 1240.
489 [21] Scott H. Lee. 2018. Natural language generation for electronic health records. npj
490 Digital Medicine 1, 1 (nov 2018), 1–7. https://doi.org/10.1038/s41746-018-0070-0
491 arXiv:1806.01353
492 [22] Ulf Leser and Jörg Hakenberg. 2005. What makes a gene name? Named entity
493 recognition in the biomedical literature. Briefings in bioinformatics 6 4 (2005),
494 357–69.
495 [23] Irene Z Li, Michihiro Yasunaga, Muhammed Yavuz Nuzumlali, César Caraballo,
496 Shiwani Mahajan, Harlan M. Krumholz, and Dragomir R. Radev. 2019. A Neural
497 Topic-Attention Model for Medical Term Abbreviation Disambiguation. ArXiv
498 abs/1910.14076 (2019).
499 [24] Ke Liu, Omkar Kulkarni, Martin Witteveen-Lane, Bin Chen, and Dave
500 Chesla. 2022. MetBERT: a generalizable and pre-trained deep learn-
501 ing model for the prediction of metastatic cancer from clinical notes.
502 AMIA ... Annual Symposium proceedings. AMIA Symposium 2022 (2022),
503 331–338. /pmc/articles/PMC9285138//pmc/articles/PMC9285138/?report=
504 abstracthttps://www.ncbi.nlm.nih.gov/pmc/articles/PMC9285138/
505 [25] Xiaoxuan Liu, Livia Faes, Aditya Uday Kale, Siegfried Karl Wagner, Dun Jack
506 Fu, Alice Bruynseels, Thushika Mahendiran, Gabriella Moraes, Mohith Shamdas,
507 Christoph Kern, Joseph R. Ledsam, Martin K. Schmid, Konstantinos Balaskas,
508 Eric J. Topol, Lucas M. Bachmann, Pearse A. Keane, and Alastair K. O. Denniston.
509 2019. A comparison of deep learning performance against health-care profes-
510 sionals in detecting diseases from medical imaging: a systematic review and
511 meta-analysis. The Lancet. Digital health 1 6 (2019), e271–e297.
512 [26] Yue Liu, Tao Ge, Kusum S. Mathews, Heng Ji, and Deborah L. McGuinness. 2015.
513 Exploiting Task-Oriented Resources to Learn Word Embeddings for Clinical
514 Abbreviation Expansion. ArXiv abs/1804.04225 (2015).
515 [27] Fábio Lopes, César Alexandre Teixeira, and Hugo Gonçalo Oliveira. 2019. Contri-
516 butions to Clinical Named Entity Recognition in Portuguese. In BioNLP@ACL.
517 [28] Xuezhe Ma and Eduard H. Hovy. 2016. End-to-end Sequence Labeling via Bi-
518 directional LSTM-CNNs-CRF. ArXiv abs/1603.01354 (2016).
519 [29] Aurélie Névéol, Hercules Dalianis, Guergana K. Savova, and Pierre Zweigenbaum.
520 2018. Clinical Natural Language Processing in languages other than English:
521 opportunities and challenges. Journal of Biomedical Semantics 9 (2018).
522 [30] Denis Newman-Griffis and Ayah Zirikly. 2018. Embedding Transfer for Low-
523 Resource Medical Named Entity Recognition: A Case Study on Patient Mobility.
524 In BioNLP.
525 [31] Lance A. Ramshaw and Mitchell P. Marcus. 1995. Text Chunking using
526 Transformation-Based Learning. ArXiv cmp-lg/9505040 (1995).
527 [32] Elisa Terumi Rubel Schneider, João Vitor Andrioli de Souza, Julien Knafou, Lu-
528 cas E. S. Oliveira, Jenny Copara, Yohan Bonescki Gumiel, Lucas Ferro Antunes
529 de Oliveira, Emerson Cabrera Paraiso, Douglas Teodoro, and Claudia Maria
530 Cabral Moro Barra. 2020. BioBERTpt - A Portuguese Neural Language Model for
531 Clinical Named Entity Recognition. In CLINICALNLP.
532 [33] Stefano Silvestri, Francesco Gargiulo, and Mario Ciampi. 2022. Iterative Annota-
533 tion of Biomedical NER Corpora with Deep Neural Networks and Knowledge
534 Bases. Applied Sciences (2022).
535 [34] Luca Soldaini. 2016. QuickUMLS: a Fast, Unsupervised Approach for Medical
536 Concept Extraction.
537 [35] Fábio Souza, Rodrigo Nogueira, and Roberto de Alencar Lotufo. 2020. BERTimbau:
538 Pretrained BERT Models for Brazilian Portuguese. In BRACIS.

You might also like