mTechPesWeJune21Grp6 Final+Submission

Redefining the World of Medical Image
Processing with AI – Automatic clinical report

generation to support doctors
1
Narayana Darapaneni Abhishek S R
Albert Princy V Nivetha S Suman Sourabh
Sunil Kumar B S Varadharajan Damotharan
05th February 2023
1 Corresponding author
Contents
1 Introduction 7
1.1 Background . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7
1.2 Introduction and Literature Survey . . . . . . . . . . . . . . . 8
1.3 Related Work . . . . . . . . . . . . . . . . . . . . . . . . . . . 16
1.3.1 Natural Image Captioning . . . . . . . . . . . . . . . . 20
2 Foundations 21
2.1 Foundations . . . . . . . . . . . . . . . . . . . . . . . . . . . . 21
2.1.1 Convolutional Neural Network (CNN) . . . . . . . . . 23
2.1.2 Exploratory Data Analysis . . . . . . . . . . . . . . . . 24
2.1.3 Reports . . . . . . . . . . . . . . . . . . . . . . . . . . 24
2.1.4 Sample Images along with Captions . . . . . . . . . . . 28
2.1.5 Word-cloud of all the impression values . . . . . . . . . 29
2.1.6 Top 20 most frequently occurring value for impression
column . . . . . . . . . . . . . . . . . . . . . . . . . . . 30
2.1.7 Performance metric and Business constraints . . . . . . 30
3 Model Architecture and Training 31

3.1 Model Architecture and Training . . . . . . . . . . . . . . . . 31
3.1.1 My Experiments and Implementation details . . . . . . 31
3.1.2 Encoder . . . . . . . . . . . . . . . . . . . . . . . . . . 31
3.1.3 Simple Encoder-Decoder Model . . . . . . . . . . . . . 32
3.1.4 Final Custom Model . . . . . . . . . . . . . . . . . . . 33
3.1.5 Global Flow and Context Flow . . . . . . . . . . . . . 33
4 Modelling and Evaluation 35

4.1 Encoder-Decoder Architecture . . . . . . . . . . . . . . . . . . 35
5 Summary and results 38

5.1 Summary and results . . . . . . . . . . . . . . . . . . . . . . . 38
5.1.1 Simple Encoder-Decoder model . . . . . . . . . . . . . 38
2
CONTENTS 3
6 Final Custom Model 41

6.0.1 Some random predictions on test data are shown below 42
7 Other medical applications 43
8 Research challenges and future directions 44
9 Conclusion 45
10 Future Scope 46
List of Figures
1.1 Node graphs of 1D representations of architectures commonly

used in medical imaging . . . . . . . . . . . . . . . . . . . . . 9
1.2 General scheme of components of the architectures, inputs on
the left and outputs on the right . . . . . . . . . . . . . . . . 11
1.3 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 18
2.1 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 22
2.2 Neural network architecture with highlighted sizes and each
layer units. . . . . . . . . . . . . . . . . . . . . . . . . . . . . 23
2.3 We can see 3 sample images (a), (b), (c) of the data set. These
are chest X-rays which are taken in front and side view. . . . . 24
2.4 This is a sample report. The report is stored in XML format.
We will extract the comparison, indication, findings, and im-
pression part of the report. This was done using regex . . . . 24
2.5 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 25
2.6 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 26
2.7 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 27
2.8 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 27
2.9 We can see that findings are the observations from X-ray while
the impression is inference obtained. In this case study, I will
try to predict the impression part of the medical report given
the two images . . . . . . . . . . . . . . . . . . . . . . . . . . 28
2.10 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 29
2.11 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 30
3.1 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 32
3.2 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 32
3.3 Attention guided chained context aggregation for image seg-
mentation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 34
3.4 Global Flow . . . . . . . . . . . . . . . . . . . . . . . . . . . 34
4
4.1 Global Flow . . . . . . . . . . . . . . . . . . . . . . . . . . . 35
4.2 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 36
4.3 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 37
5.1 BLEU score for simple encoder-decoder model . . . . . . . . . 38

5.2 3 Results on sample test data . . . . . . . . . . . . . . . . . . 39
5.3 Value counts of predicted captions . . . . . . . . . . . . . . . 40
6.1 Bleu Scores for Final Custom Model . . . . . . . . . . . . . . 41

6.2 Value counts % of predictions for final custom model . . . . . 41
6.3 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 42
9.1 Best Models of each approach arranged in descending order of

cumulative BLEU-4 score . . . . . . . . . . . . . . . . . . . . 45
Abstract
In this study, we focus on generating readable automatic assisted clinical

reports using medical scans/ images. A typical report written by radiolo-
gists consists of patient history, examination reason, and summary of the
findings. This project is beneficial when it comes to quick analysis of the
medical condition and providing on-time treatment. The challenge in this
project is the achievement of clinical accuracy and reduce data bias. On the
other hand, to address the insufficient number of radiologists and radiology
training programs in resource-limited countries, this project plays a vital role
in addressing medical needs (Rosman et al.[1]).
The problem we are going to solve in this case study is Medical image
captioning or report generation (Rajpurkar et al.)[2]. Basically we have to
extract features (bottleneck features) from the images using a Convolutional
Neural Network (CNN) or using a transfer learning (preferable as we have less
amount of data). Then use these extracted features to predict the captions.
The output would be a sequence of words.
With this objective we were able to develop the python code. Beam
search and greedy search methods were used to analyze the predictions. It
was found that greedy search is much faster compared to beam search while
beam search is found to produce correct sentences. In conclusion, based on
Bleu score, custom final model using greedy search is found to be the suitable
model for this project.
6
Chapter 1
Introduction
1.1 Background
A typical report written by radiologists consists of patient history, ex-
amination reason, and summary of the findings. This project is beneficial
when it comes to quick analysis of the medical condition and providing on-
time treatment. The challenge in this project is the achievement of clin-
ical accuracy. This project plays a vital role in addressing medical needs
in resource-limited countries having insufficient number of radiologists and
radiology training programs .
Automated generation of radiological reports for different imaging modal-
ities is required to smoothen the clinical workflow and alleviate radiologist’s
workload. It involves the amalgamation of image processing techniques for
medical image interpretation and language generation techniques for report
generation. Kaur and Mittal[3] present a report which generates a clinically
accurate report from chest x-ray images.
An increasing number of chest X-ray (CXR) examinations in radio diag-
nosis departments burdens radiologists and makes the timely generation of
accurate radiological reports highly challenging. An automatic radiological
report generation (ARRG) system is envisaged to generate reports with mini-
mal human intervention, which ease radiologists workload, and smoothen the
clinical workflow. The success of an ARRG system depends on two critical
factors: i) the quality of the features extracted by the ARRG system from
the CXR images, and ii) the quality of the linguistic expression generated by
the ARRG system describing the normalities and abnormalities as indicated
by the extracted features. Most of the existing ARRG systems miserably
fail due to the latter factor and do not generate clinically acceptable reports
because they ignore the contextual importance of the medical terms[4].
7
8 CHAPTER 1. INTRODUCTION
1.2 Introduction and Literature Survey

Litjens et al.[5] explain that deep learning algorithms, in particular convo-
lutional networks, have rapidly become a methodology of choice for analyzing
medical images. Started from using rule-based image processing systems to
supervised techniques, where training data is used to develop a system, are
becoming increasingly popular in medical image analysis. Thus, there is a
shift from systems that are completely designed by humans to systems that
are trained by computers using example data from which feature vectors are
extracted. The concept of computers learning the features of organs lies at
the basis of many deep learning algorithms: models (networks) composed of
many layers that transform input data (e.g., images) to outputs (e.g., dis-
ease present/absent) while learning increasingly higher-level features. The
most successful type of models for image analysis to date are convolutional
neural networks (CNNs). In CNNs, weights in the network are shared such
that the network performs convolution operations on images, and the model
need not learn separate detectors for the same object occurring at different
positions in an image. Thus the network becomes equivariant with respect to
translations of the input thereby reducing the number of parameters where
a number of weights are independent of the size of the input image (ref fig
1.1).
At each layer, the input image is convolved with a set of K kernels
W = W1 , W2 , . . . .Wk and added biases B = b1 , . . . .bk , each generating a
new feature map Xk . These features are subjected to an element wise non-
linear transform σ(·) and the same process is repeated for every convolutional
layer l:
Xkl = σ(Wkl−1 ∗ X l−1 + bl−1

k ) (1.1)
Also, in CNNs pooling layers are incorporated and pixel values are in-
corporated are aggregated using a permutation invariant function (max or
mean operation). This reduces the number of parameters in the network by
inducing translation invariance. At the end of the convolutional stream of
the network, fully-connected layers (i.e. regular neural network layers) are
usually added, where weights are no longer shared.
Varoquaux and Cheplygina[6] explore in their article the avenues to im-
prove the clinical impact of machine learning in medical imaging. The author
explains that medical datasets are typically smaller, on the order of hundreds
or thousands. Although, the datasets are small, in medical imaging, the num-
ber of subjects is referred to, but a subject may have multiple images, for
example, taken at different points in time. Also, the datasets only partially
1.2. INTRODUCTION AND LITERATURE SURVEY 9
Figure 1.1: Node graphs of 1D representations of architectures commonly

used in medical imaging
reflect the clinical situation for a particular medical condition, leading to

dataset bias and shortcomings of the results/ diagnosis and a non-accurate
model. This model may have a different distribution when compared to the
test data.
Liu et al.[7] have conducted a study that generates chest x-ray reports us-
ing reinforcement learning (to aid better readability). Two datasets namely
Open-I and MIMIC-CXR are used in their study. Open-I is a public radio-
graphy dataset collected by Indiana University, and MIMIC-CXR is largest
paired image-report dataset presently available.
Messina Pablo et al.[8] conducted a survey on the methods used in the
area of automatic report generation with an emphasis on deep neural net-
works. The study is grouped into the following categories: datasets, archi-
tecture, explain ability, and evaluation metrics. The article also throws light
on explainable AI (XAI) where the physicians will be able to understand the
rationale behind automatic reports from black-box algorithms. This allows
the model to achieve clinical accuracy of the generated reports irrespective
of dataset bias (Guidotti et al.)[9]. They also state that the classification
datasets do not provide a report, but a set of clinical conditions or abnor-
malities present or absent. In conclusion, this kind of information is used
as pre-training, an intermediate, or an auxiliary task to generate the actual
report.
Datasets for chest X-rays: Report Dataset (only English reports are con-
sidered): IU X-ray (7470 images); MIMIC-CXR (377110 images); Classifi-
cation Dataset: CheXpert (224316 images); ChestX-ray14 (112120 images).
Guidotti et al. present a summary of model design according to dimensions
such as Input and output; visual and language components. Ref fig 1.2.
Figure 1.2: General scheme of components of the architectures, inputs on

the left and outputs on the right
Input and output:

The features or anomalies from the image taken at different time steps
or by using a sequence data is encoded using bidirectional long-short term
memory (BiLSTM). Multi-task multi-attribute learning (MTMA) encodes
the attributes from the image as indications/ findings. A software package
such as Lucene is also used to generate the report in the defined template
where paraphrased by an encoder-decoder network to generate the final re-
port.
The output can be:
(1) generative multi-sentence (unstructured) where the model has the
freedom to decide the number of sentences and the words in each sentence;
(2) generative multi-sentence structured where the output has a fixed
number of sentences, and each sentence always has a pre-defined topic;
(3) In generative single-sentence the output is a single sentence;
(4) The template based report uses a human-designed template to pro-
duce the report, for example, performing a classification task followed by
if-then rules;
(5) hybrid template will use templates and also have the freedom to gen-
erate sentences word by word.
Visual and language component:
In general, a typical model architecture to analyze the image consists of
a visual component and a language component.
CNN acts as a visual component and natural language processing (NLP)
neural architecture acts as a language component. CNN processes and ex-
tracts the features from the image. NLP processes text and supports report
generation. The outputs consist of a volume of feature maps of dimensions
W×H×C, where W and H denote spatial dimensions (width and height) and
C denotes the channel dimensions (depth or the number of feature maps).
The architecture used for visual and language components:
CNN Architectures for chest X-ray: Dense Net; ResNet; VGG; Incep-
tionV3; GoogLeNet;
Language component Architectures:
GRU; LSTM; LSTM with attention; Hierarchical LSTM with attention;
Hierarchical: Sentence LSTM + Dual Word LSTM (normal/abnormal); Re-
current BiLSTM-attention-LSTM; Partial report encoding + FC layer (next
word); Transformer; Hybrid template retrieval + generation/edition
According to the literature survey, Dense Net, ResNet, and VGG are the
most used. LSTM is the most used. LSTM receives an encoding vector from
the visual component at the beginning and the full report is decoded from
it. This encoding vector is typically a vector of global features output by the
CNN. However, LSTM can be used only for short reports. For unstructured
multi-sentence reports, Hierarchical LSTM with attention can be used.
Evaluation metrics of the reports are assessed into 3 categories:
Text quality, medical correctness/ accuracy, and explainability.
Text quality: BLUE; ROUGE L; METEOR; CIDEr;
Medical Correctness/ Accuracy: MIRQI; MeSH; Keyword accuracy; ROC-
AUC;
The authors conclude that MIRQI seems like the most promising ap-
proach to fulfill medical accuracy, as it captures more data from the reports,
and text quality can be used for fluency, grammar, etc.
The authors also describe challenges that need attention:
(1) Expert evaluation: The model and the report generation system be
tested by medical experts who are board-certified experts. Their feedback
carries immense value to better the model.
(2) Explain ability
Yu et al.[10] quantitatively examined the correlation between automated
metrics and the scoring of reports by radiologists. The failure modes of
the metrics, namely the types of information the metrics do not capture, to
understand when to choose particular metrics and how to interpret metric
scores were analyzed. Their new automatic metric RadGraph F1 computes
the overlap in clinical entities and relations between a machine-generated re-
port and a radiologist-generated report. A composite metric is also proposed
in their work called RadCliQ, which is able to rank the quality of reports
similarly to radiologist’s reports.
A quantitative investigation of alignment between automated metrics and
radiologists is studied. Scores are set for automated metrics and radiology
reports. THE MIMIC CXR dataset is used for automated metric. Candi-
dates report is selected based on the test reports metric oracle. Metric oracle
reports were constructed as BLEU, BERTScore, CheXbert vector similarity
(semb ), and a novel metric RadGraph F1. BLEU computes n-gram overlap
and is representative of the family of text overlap-based natural language
generation metrics such as CIDEr, METEOR, and ROUGE. BERTScore
has been proposed for capturing contextual similarity beyond exact textual
matches. CheXbert vector similarity and RadGraph F1 are metrics designed
to measure the correctness of clinical information.
The radiologist evaluation study design was conducted with the support
of six board-certified radiologists who scored the number of errors that var-
ious metric-oracle reports make compared to the test report. Radiologists
categorized errors as significant or insignificant. Radiologists subtyped every
error into the following six categories:
(1) false prediction of finding (i.e., false positive).
(2) omission of finding (i.e., false negative).
(3) incorrect location/position of finding.
(4) incorrect severity of finding.
(5) mention of comparison that is not present in the reference impression.
and (6) omission of comparison describing a change from a previous study.
50 random studies were sampled from the MIMIC-CXR test set.
Kendall rank correlation coefficient (tau-b) is used for metric scores and
radiologists-reported errors. It was found that BERTScore and RadGraph F1
were the metrics with the two highest alignments with radiologists. Specif-
ically, BERTScore has a tau value of 0.500 [95% CI 0.497 0.503] for a total
number of errors and 0.496 [95% CI 0.493 0.498] for significant errors. Rad-
Graph has a tau value of 0.463 [95% CI 0.460 0.465] for a total number of
errors and 0.459 [95% CI 0.456 0.461] for significant errors. BLEU is the third
best metric under this evaluation with a 0.459 [95% CI 0.456 0.462] tau value
for a total number of errors and 0.445 [95% CI 0.442 0.448] for significant er-
rors. Lastly, CheXbert vector similarity had the worst alignment with a tau
value of 0.457 [95% CI 0.454 0.459] for total errors and 0.418 [95% CI 0.416
0.421] for significant errors. From these results, it is seen that BERTScore,
RadGraph, and BLEU are the metrics with the closest alignment to radiolo-
gists. CheXbert has alignment with radiologists but is less concordant than
the previously mentioned metrics.
Also, BLEU exhibited a prominent failure mode in identifying false pre-
dictions of finding in reports. Metric-oracle reports with respect to BLEU
produced more false predictions of finding than BERTScore and RadGraph
in terms of both the total number of errors (0.807 average number of errors
per report verses 0.477 and 0.427 for BERTScore and RadGraph) and the
number of clinically significant errors (0.607 average number of errors per

report verses 0.363 and 0.300 for BERTScore and RadGraph). BLEU exhib-
ited a less prominent failure mode in identifying incorrect locations/positions
of finding compared with CheXbert vector similarity. Metric-oracle reports
with respect to BLEU have fewer incorrect locations/positions of finding than
CheXbert in terms of both the total number of errors (0.113 average number
of errors per report versus 0.227 for CheXbert) and the number of clinically
significant errors (0.087 average number of errors per report versus 0.193 for
CheXbert).
It was observed that BLEU and RadGraph correctly ranked prior mod-
els above random retrieval while BERTScore and CheXbert do not, even
though BERTScore has the strongest alignment with radiologists in evalu-
ating metric-oracle reports. Other text overlap-based metrics are commonly
used in natural language generation beyond BLEU, such as CIDEr, ME-
TEOR and ROUGE, which may have better or worse radiologist alignment
and reliability than BLEU in report generation.
Hence, the authors conclude that novel metrics RadGraph F1 and Rad-
CliQ meaningfully measure progress in radiology report generation.
The methodology used by Xie et al.[11] in their study is leveraging exter-
nal information from natural images via transfer learning, as well as utilizing
the domain knowledge from medical doctors to create networks that resemble
how medical doctors are trained, mimic their diagnostic patterns, or focus
on the features or areas they pay particular attention to. Additionally, the
accuracy of these models is evaluated using traditional evaluation metrics
such as accuracy, precision, and recall.
Ma et al.[12] have used a Competence-based Multimodal Curriculum
Learning framework (CMCL) which simulates the learning process of radiol-
ogists and optimizes the model in a step-by-step manner. Specifically, CMCL
estimates the difficulty of each training instance, evaluates the competence of
the current model, and selects the most suitable batch of training instances
given the current model competence. Additionally, the accuracy of these
models is evaluated using traditional evaluation metrics such as accuracy,
precision, and recall.
The article by Sirshar et al.[13] includes using a hybrid approach based
on natural language processing and computer vision techniques to create an
auto medical report generator. Additionally, the automated report generator
is based on convolutional neural networks and long short-term memory for
detecting diseases, followed by the attention mechanism for sequence gen-
eration based on these diseases. The accuracy of these models is evaluated
using traditional Natural Language Processing (NLP) metrics like BLEU-1,
BLEU-2, BLEU-3, and BLEU-4.
Amjoud et al.[14] have combined the transfer learning technique and the
transformer approach to generate reports. Specifically, a pre-trained convo-
lutional neural network model from the ImageNet database is used for feature
extraction and a modified transformer encoder-decoder is used for sequence
generation. Additionally, the accuracy of these models is evaluated using
traditional evaluation metrics such as accuracy, precision, and recall.
We know that Medical imaging (e.g. chest x-ray) is widely used in
medicine for diagnostic purposes. However, the current clinical practice re-
quires a radiologist with specialized training to manually evaluate x-rays and
note their findings in a radiology report. This manual evaluation is time-
consuming and providing an automated solution for this task would help
streamline the clinical workflow and improve the quality of care. (Lovelace
et al. [15])
Rajpurkar et al.[2] explains that Image Captioning is the process of gen-
erating text (Captions) based on the image. Image captioning uses both
Natural Language Processing(NLP) and Computer Vision (CV) to generate
the text output. X-Rays are a form of Electro Magnetic Radiation that is
used for medical imaging. X-rays can be used to spot fractures, bone injuries,
or tumors. Analysis of X-ray reports is a very important task of radiologists
and pathologists to recommend the correct diagnosis to the patients.
Medical reports (Nguyen et al.[16]) are the primary medium, through
which physicians communicate findings and diagnoses from the medical scans
of patients. The process is usually laborious, and typing out a medical report
takes on average five to ten minutes it could also be error-prone. This has
led to a surging need for automated generation of medical reports, to assist
radiologists and physicians in making rapid and meaningful diagnoses. Its
potential efficiency and benefits could be enormous, especially during criti-
cal situations such as COVID or a similar pandemic. Clearly, a successful
medical report generation process is expected to possess two key properties:
1) clinical accuracy, to properly and correctly describe the disease and
related symptoms;
2) language fluency, to produce realistic and human-readable text.
Writing medical reports manually from medical images is a time-consuming
task for radiologists (Nishino et al.[17]) . To write reports, radiologists first
recognize what findings are included in medical images, such as computed
tomography (CT) and X-ray images. Then radiologists compose reports that
describe the recognized findings correctly without omission. Doctors prefer
radiology reports written in natural language.
Medical images, e.g., radiology and pathology images, and their corre-
sponding reports, which describe the observations in detail of both normal
and abnormal regions, are widely used for diagnosis and treatment (Ma et
al.[12]). In clinical practice, writing a medical report can be time-consuming

and tedious for experienced radiologists, and error-prone for inexperienced
radiologists. Therefore, automatically generating medical reports can assist
radiologists in clinical decision-making and emerge as a prominent attractive
research direction in both artificial intelligence and clinical medicine
A book by Suetens[18] describe that Image or exam classification was
one of the first areas in which deep learning made a major contribution
to medical image analysis. In exam classification, one typically has one or
multiple images as input with a single diagnostic variable as output (e.g.,
disease present or not). In such a setting, every diagnostic exam is a sample
and dataset sizes are typically small compared to those in computer vision
1.3 Related Work

Image Captioning Image captioning aims to understand the given im-
ages and generate corresponding descriptive sentences (Liu et al.[19] (Jing et
al.[20]). The task combines image understanding and language generation.
In recent years, a large number of encoder-decoder-based neural systems
have been proposed for image captioning. However, the sentence generated
by image captioning is usually short and describes the most prominent visual
contents, which cannot fully represent the rich feature information of the im-
age. Recently, visual paragraph generation, which aims to generate long and
coherent reports or stories to describe visual content, has recently attracted
increasing research interest. However, due to the data bias in the medical
domain, the widely-used hierarchical LSTM in the visual paragraph genera-
tion does not perform very well in automatic chest X-ray report generation
and is tend to produce normal reports.
Chest X-ray Report generation inspired by the success of deep learning
models on image captioning, a lot of encoder-decoder-based frameworks have
been proposed. Specifically, a hierarchical LSTM is proposed with the at-
tention mechanism further incorporated the medical concept to enrich the
decoder with descriptive semantics.
Since the first application of AI techniques in medicine in the 1980s, the
use of these algorithms has grown exponentially, especially in recent years
(Liz et al.[21]). Deep learning algorithms are applied to all kinds of clinical
data bio signals, which include electrical, mechanical, and thermal signals bio
medicine, which studies molecules of biological processes electronic health
records (EHR), focused on optimizing diagnosis and clinical imaging, widely
used in the diagnosis of many diseases, as is the case with our problem.
The practice of healthcare has evolved from observation-based medicine to
1.3. RELATED WORK 17
evidence-based medicine. This makes deep learning and big data algorithms
especially useful in this field as they can identify some radiological signs
that medical staff cannot detect. Although in this manuscript we focus on
classification problems, there are papers where these algorithms are used in
regression problems, such as estimating the dose of a drug generating medical
reports from clinical tests or image processing, such as image segmentation
and image reconstruction.
The COVID-19 pandemic has had a strong impact on research into the
application of machine learning and deep learning in medical image analysis.
As expected, many of the classification systems investigated have focused
on detecting signs of bilateral COVID-19-associated pneumonia. The ex-
plainability of deep learning models is a fundamental factor to be taken into
account in their application. These models are black-box algorithms and
need explainable AI techniques to make them more trustworthy.
Although most medical datasets have two classes (samples of a particular
pathology and healthy samples), in chest X-rays it is common to find signs
of more than one pathology. For this reason, in the last five years, different
authors have published multilabel radiological datasets. These datasets are
closer to real situations than binary ones, with the additional challenge of
the imbalance of different classes. The size of each class in a realistic dataset
should depend on the incidence of pathology in society, i.e., some classes
are more represented than others. The characteristics of these datasets are
interesting and need to be analyzed in detail in order to address the problem
adequately.
Conventional evaluation metrics for text generation from images include
BLEU, ROUGE, and METEOR (Babar et al.[22]). To date, these are the
most widely used tools in machine translation and summarization for evalu-
ating how good a generated text is as compared to the ground truth. Such
metrics measure goodness in a generic context, independent of the application
domain. They score a generated report only in terms of how well it matches
the corresponding ground truth report. As such, they can miss important
clinical information contained in the generated report. For example, “Heart
is normal” and “Heart is not normal” are syntactically close but semantically
very different sentences. Therefore, we propose to assess the clinical quality
of radiology reports using an external source of information (Shown in the
below figure). We assume an external source of ground truth knowledge in
the form of a set of diagnostic tags for each report. These diagnostic tags are
external knowledge because they are not used for training the report gen-
erator model. By considering tags as class labels associated with a report,
we build a probabilistic model (on the training data) that predicts the class
labels of a report. The probabilistic model is applied to the generated reports
of the test set, and the model performance is used as a quantitative estimate
of the diagnostic quality of the generated reports. Ref fig 1.3
Figure 1.3
Medical report generation, which aims to automatically generate descrip-

tions for clinical radiographs (e.g. chest Xrays), has drawn attention in
both the machine learning and medical communities (Yan et al.[23]). Many
methods have been proposed to solve this task. For example, Jing et al.[20]
proposed a co-attention hierarchical CNN-RNN model. Other authors used
hybrid retrieval-generation models to complement generation, and proposed
learning visual semantic embeddings for cross-modal retrieval in a contrastive
setting and showed improvements in identifying abnormal findings. However,
this is hard to scale or generalize since we need to build template abnormal
sentences for new datasets. Chen et al. leveraged a memory-augmented
transformer model to improve the ability to generate long and coherent text
but didn’t address the issue of generating dominant normal findings specifi-
cally. The work by Yan et al.[23] proposed to incorporate contrastive learning
into training a generation-based model, which benefits from the contrastive
loss to encourage diversity and is also easy to scale compared to retrieval-
based methods. Contrastive Learning has been widely used in many fields
of machine learning. The goal is to learn a representation by contrasting
positive and negative pairs. Recent work showed that contrastive learning
can boost the performance of self-supervised and semi-supervised learning in
computer vision tasks. In natural language processing, contrastive learning
has been investigated for several tasks, including language modeling, unsu-
pervised word alignment and machine translation. In this work, focus is made
on applying contrastive learning to chest X-ray report generation in a multi-
modality setting. Different from previous work in applying contrastive learn-
ing to image captioning, a recent contrastive formulation is leveraged which
transforms representations into latent spaces and propose a novel learning
objective for the medical report generation task specifically. Yang et al.[23]
1.3. RELATED WORK 19
aimed to develop a method that takes a radiology report as input and gen-
erates a set of X-ray images that manifest the clinical findings narrated in
the report. For simplicity, we assume there are two views of images: frontal
view and lateral view, each with one image. It is straightforward to extend
our method to more views (each with more than one image).
Content-based image retrieval (CBIR) is a technique for knowledge dis-
covery in massive databases and offers the possibility to identify similar case
histories, understand rare disorders, and, ultimately, improve patient care
(Litjens et al.[5]). The major challenge in the development of CBIR meth-
ods is extracting effective feature representations from pixel-level information
and associating them with meaningful concepts. The ability of deep CNN
models to learn rich features at multiple levels of abstraction has elicited
interest from the CBIR community.
Image-to-Text Radiology Report Generation (Miura et al.[24]), (Wang et
al.[25]) and (Jing et al.[20]) first proposed multi-task learning models that
jointly generate a report and classify disease labels from a chest X-ray im-
age. Their models were extended to use multiple images, to adopt a hybrid
retrieval-generation model, or to consider structure information. They con-
clude that more recent work has focused on generating reports that are clini-
cally consistent and accurate. Liu et al.[19] presented a system that generates
accurate reports by fine-tuning it clinically.
The most popular related task (Chen et al.[26]) to ours is image caption-
ing, a cross-modal task involving natural language processing and computer
vision, which aims to describe images in sentences. Among these studies,
the most related study from Cornia et al. also proposed to leverage memory
matrices to learn a prior knowledge for visual features using memory net-
works but the such operation is only performed during the encoding process.
Different from this work, the memory in our model is designed to align the
visual and textual features, and the memory operations (i.e., querying and
responding) are performed in both the encoding and decoding processes.
(Zhang et al [27]) Most of the work on image-based captioning is based
on the classical structure of CNN + LSTM, which aims to generate real sen-
tences or relevant paragraphs of a topic to summarize the visual content in
an image. With the development of computer vision and natural language
processing technology, many works combine radiology images and free text to
automatically generate radiology reports to help clinical radiologists make a
quick and meaningful diagnosis. Radiology report generation takes X-ray im-
ages as input and generates descriptive reports to support inference of better
diagnostic conclusions beyond disease labels. Many radiology report gener-
ation methods follow the practice of image captioning models. For example,
adopting an encoder–decoder architecture and proposed a hierarchical gen-
erator and attention mechanism to generate long reports. Fused the visual
features and semantic features of the previous sentence through the attention
mechanism, used the fused features to generate the next sentence, and then
generated the whole report in a loop.
(Srinivasan et al [28]) The success in medical image captioning has been
possible due to the latest advances in deep learning. DenseNet, being a
densely connected convolutional network, enabled us to learn high order de-
pendencies by using a large number of layers with a minimal number of
parameters, enabling the architectures to understand complex images like
X-ray images without overfitting. Xception proposed depth-wise separable
convolutional operation, which inturn extracts efficient image features with
a decreased number of parameters in the model.
1.3.1 Natural Image Captioning

Hou et al.[29] proposes the basic encoder-decoder architecture (also called
CNN-RNN) for natural image captioning. In this approach, the CNN-composed
encoder maps an image into a context vector representation, and then the
LSTM-composed decoder unrolls and outputs the word distribution at ev-
ery time step conditioned on the context vector. The attention mechanism
from machine translation to the decoder and achieved better performance
was also introduced. All the above approaches are top-down, which start
from an image and convert it into words.
Chapter 2
Foundations
2.1 Foundations
Chempolil[30] in his article stated the problem of image captioning task.
For this dataset, we are given a couple of images of different patients and a
report which is in XML format. The information on the image names and also
the patients are contained within the xml file. We extracted this data from
xml file using regex and transformed it into a data frame. Then, we can use
pre-trained models to get information from images and using this info we can
generate captions using LSTMs or GRUs. BLEU score stands for Bilingual
Evaluation Understudy. Here we will be using the BLEU score as the metric.
BLEU score compares each word in the predicted sentence and compare it
to the reference sentence (It is also done in n-grams) and returns a score
based on how many words were predicted that were in the original sentence.
BLEU score is not a good metric to compare the performance of translation
since similar words that have the same meaning will be penalized. So we will
not only use the n-gram BLEU score but also take a sample of the predicted
captions and compare it to the original reference caption manually. BLEU
scores range from 0–1. Predicted text: “The dog is jumping” Reference text:
“The dog is jumping”
The data sets were extracted from Indiana University – Chest X-Rays
(Rajpurkar et al.[2]). In this particular case study, we are given a set of im-
ages(Chest X-Rays) and impressions radiologists inferred from those images.
We have to develop and train a model which generates impressions based on
the chest X-rays provided in image format and findings and some other data
provided in XML format to it. This type of model can save a lot of time
for the radiologists from analyzing the reports and giving impressions. Also,
this model can be used as a base model to validate some of the decisions
21
22 CHAPTER 2. FOUNDATIONS
since the cost of mistakes is very high in the medical domain, unlike other
domains.
Here we have used the same publicly available dataset from Indiana Uni-
versity which consists of chest X-ray images and reports (in XML format)
which contain information regarding the comparison, indication, findings,
and impression of the X-ray. The goal of this case study is to predict the
impressions of the medical report attached to the images.
Figure 2.1
The output from the above model (ref fig 2.1) gives a probability value
for which word has the highest probability to come next in the sentence.
Here the greedy search method is used (gets the word that has the highest
probability to come next in the sentence). The model was then trained for 30
epochs with an initial learning rate of 0.001 and 3 pictures per batch (batch
size). However, after 20 epochs, the learning rate was reduced to 0.0001. The
loss used is categorical cross entropy while the optimizer used was Adam.
2.1. FOUNDATIONS 23
2.1.1 Convolutional Neural Network (CNN)

CNN is a special type of neural network that provides sophisticated per-
formance in image processing and visual representation tasks (Sirshar et
al.[13]). Some of the best applications of CNN are feature extraction and
classification based on those features, such as image segmentation, object
detection, etc. The CNN is composed of different types of convolutional lay-
ers. Similar to the multilayer neural network, there are fully connected (FC)
layers after these convolutional layers, ref fig 2.2. A CNN is built in such a
way as to take advantage of the 2D input image structure. With the support
of multiple local ties and linked weights, this task is accomplished along with
many pooling methods that translate the input data into invariant features.
The key benefit of CNN includes the freedom to prepare and offer fewer
parameters than other networks with the same number of hidden states.
Figure 2.2: Neural network architecture with highlighted sizes and each layer
units.
2.1.2 Exploratory Data Analysis
(a) (b)
(c)
Figure 2.3: We can see 3 sample images (a), (b), (c) of the data set. These
are chest X-rays which are taken in front and side view.
2.1.3 Reports
Figure 2.4: This is a sample report. The report is stored in XML format.
We will extract the comparison, indication, findings, and impression part of
the report. This was done using regex
2.1. FOUNDATIONS 25
In the XML file, report information is stored like this
Figure 2.5
These are the two image files that are associated with this report.
Now we will see what is the maximum and minimum possible value for
the number of images that are associated with a report.
Figure 2.6
We can see that the maximum number of images associated with a report
can be 5 while the minimum is 0.
We will extract all the information part i.e, comparison, indication, find-
ings, and impression part of the report and the corresponding 2 images (we
will take 2 images as input as 2 images are the highest frequency of being
associated with a report) of the concerned report to a data frame with xml
report file name using regex and string manipulations.
For images more than 2, we will create new data points with new images
and the same info.
2.1. FOUNDATIONS 27
Figure 2.7
It was found that there are missing values in the data frame. All the data
points which had their image 1 and impression value null were removed from
the data frame. Then all the missing values found in image 2 were filled with
the same data path as that of image 1.
Figure 2.8
We can see from the above diagram that 420 is the most common image
height for image1 while 624 is the most common image height. The width
for both of the images has only 1 unique value for all data points and that
is 512. Since pre-trained models are modeled for square-sized images we can
choose 224*224 as the specified size of the image. Hence we will resize all
images into 224*224 shapes.
2.1.4 Sample Images along with Captions
Figure 2.9: We can see that findings are the observations from X-ray while
the impression is inference obtained. In this case study, I will try to predict
the impression part of the medical report given the two images
2.1. FOUNDATIONS 29
2.1.5 Word-cloud of all the impression values
Figure 2.10
We can see from the word cloud above that several words are higher in
frequency i.e “acute cardiopulmonary” and “cardiopulmonary abnormality”.
We will investigate this further by observing the value counts of the impres-
sion feature.
2.1.6 Top 20 most frequently occurring value for im-

pression column
Figure 2.11
From the above value counts we can see that the top 20 most frequently
occurring words had the same meaning thereby the same information sug-
gesting one type of data is dominating for this data. We will apply a set of
upsampling and downsampling to the data so that model doesn’t output the
same value for the entire dataset i.e to reduce over-fitting.
2.1.7 Performance metric and Business constraints

BLEU score for different n-grams(1,2,3,4. . . ) is a good metric in the case
of sequential output models.
• Interpretability is moderately important.
• There are no latency constraints.
• As the cost of mistakes in the Medical domain is very high the model
should be very good in its predictions. Even if different architectures are
tried ultimately more the data better the model would perform.
Chapter 3
Model Architecture and

Training
3.1 Model Architecture and Training

CheXNet is a 121-layer Dense Convolutional Network (DenseNet) trained
on the ChestX-ray 14 dataset (Rajpurkar et al.[2]). DenseNets improve row of
information and gradients through the network, making the optimization of
very deep networks tractable. We replace the final fully connected layer with
one that has a single output, after which we apply a sigmoid non-linearity.
The weights of the network are initialized with weights from a model pre-
trained on ImageNet. The network is trained end-to-end using Adam with
standard parameters. We train the model using mini-batches of size 16. We
use an initial learning rate of 0:001 that is decayed by a factor of 10 each
time the validation loss plateaus after an epoch, and pick the model with the
lowest validation loss.
3.1.1 My Experiments and Implementation details

For the modeling part, I have created 3 models. Each one will have a
similar encoder but entirely different decoder architectures
3.1.2 Encoder
The encoder part will take the two images and convert them into backbone
features to provide to the decoder. Here for the encoder part, I will be using
the CheXNET model. CheXNET Model is a Denset121 layered model which
is trained on millions of chest x-ray images for the classification of 14 diseases.
31
32 CHAPTER 3. MODEL ARCHITECTURE AND TRAINING
We can load the weights of that model and pass the images through that
model. The top layer will be ignored. (Ref fig 3.1)
Figure 3.1
3.1.3 Simple Encoder-Decoder Model

This model will be our baseline. Here I will build a simple implementation
of an image captioning model. The architecture will be as shown below (Ref
fig 3.2):
Figure 3.2
3.1. MODEL ARCHITECTURE AND TRAINING 33
Here the two images are passed through the Image encoder layer and
concatenate the two outputs and then pass it through the Dense layer. The
padded tokenized captions will be passed through an embedding layer where
we will be using pre-trained Glove vectors (300 dimensions) as the initial
weights for the layer. This will be set as trainable and then it is passed
through LSTM where the initial state of the LSTM is taken from the output
of the Image dense layer. These are then added and then passed through the
output dense layer where the number of output will be the vocabulary size
with softmax activation applied on top.
3.1.4 Final Custom Model

Here a custom encoder is used along with a decoder which is the same as
that of the Attention Model.
For the encoder part of this model, the backbone features from the
chexnet model specifically the 3rd last layer’s output. This will be the output
from the Image encoder layer is extracted. Then it is passed through global
flow and context flow which is actually inspired by another model which was
used for image segmentation purposes. This will be explained below.
3.1.5 Global Flow and Context Flow

This architecture implementation is taken from Attention guided chained
context aggregation for image segmentation (Specifically Chained Context
Aggregation Module (CAM)) which was used for image segmentation but
I will use it to extract image information. Here what I will do is that the
outputs from the Image encoder (ie chexnet) will be sent to global flow. Then
outputs from both the chexnet and global flow will be concatenated and sent
to the context flow. Here only one context flow will be used since our dataset
is small and can lead to underfitting if a more complicated architecture is
used. Here global flow extracts the global information of the image while
context flow will get the local features of the images. Then the output from
the global flow and context flow will be summed and then sent to the decoder
after reshaping, and applying batch norm and dropout. (Ref fig 3.3 and 3.4):
34 CHAPTER 3. MODEL ARCHITECTURE AND TRAINING
Figure 3.3: Attention guided chained context aggregation for image segmen-
tation
Figure 3.4: Global Flow

Chapter 4
Modelling and Evaluation
(Srinivasa[31])The model architectures used are Encoder-Decoder models.
4.1 Encoder-Decoder Architecture
Figure 4.1: Global Flow
35
36 CHAPTER 4. MODELLING AND EVALUATION
The time distributed desne layer is applied at the end because the output
is a sequential output and it should be applied to every temporal slice of
output.
Used Beam search to predict the sentence .
A few predictions on test data made with Attention model+Beam Search
Prediction 1: Original Sentence is : Heart size normal. stable cardiomediasti-
nal silhouette. no pneumothora, pleural effusion, or focal airspace disease.
bony structures are in alignment without fracture. Predicted Sentence is :
The heart is normal in size and mediastinal contours are within normal lim-
its. there is no pleural effusion or pneumothora. there is no focal air space
opacity to suggest a pneumonia. there is no pleural effusion or pneumothora.
Figure 4.2
4.1. ENCODER-DECODER ARCHITECTURE 37
Prediction 2 : Original Sentence is : Mediastinal contours are normal.

lungs are clear. there is no pneumothora or large pleural effusion. Predicted
Sentence is : The heart size is normal. the lungs are clear without infiltrate.
there is no effusion or pneumothora.
Figure 4.3
Chapter 5
Summary and results
5.1 Summary and results

Beam search and greedy search methods were used to analyze the pre-
dictions. Greedy search just outputs the most probable word in each time
step while beam search shows the most probable sentence by multiplying
the probability of each word at each time step and getting the sentence that
has the highest probability. Greedy search is much faster compared to beam
search while beam search is found to produce correct sentences.
5.1.1 Simple Encoder-Decoder model
Figure 5.1: BLEU score for simple encoder-decoder model
The bleu scores of both greedy search and beam search (top k = 3) are
found to be similar so we can say that the best model will be based on a
greedy search for the simple encoder-decoder model since it is the fastest.
Finding the bleu score for top k = 5 was found to be very slow so it was
discarded
38
5.1. SUMMARY AND RESULTS 39
Figure 5.2: 3 Results on sample test data
We can see from above that model has predicted the same caption for
all images. It was found that the model predicted the same caption for all
images when taking the value counts as seen below
40 CHAPTER 5. SUMMARY AND RESULTS
Figure 5.3: Value counts of predicted captions
From the above results, we can see that for every data point the model
is predicting “no acute cardiopulmonary abnormality” suggesting the model
has overfitted. Since this is a baseline model we can check on the other
model’s predictions and compare their performance to this.
Chapter 6
Final Custom Model
Figure 6.1: Bleu Scores for Final Custom Model
This model also produced similar bleu compared to the other two. Look-
ing at the value counts of the predictions we can see that there is a little
variability better than that of the baseline model but not as good as that of
the 2nd model.
Figure 6.2: Value counts % of predictions for final custom model
Here also same as the case of the Attention model, beam search (top k
= 3) was found to be slow in predicting one caption so the approach was
discarded.
41
42 CHAPTER 6. FINAL CUSTOM MODEL
6.0.1 Some random predictions on test data are shown

below
Figure 6.3
We can observe from the above that the model is predicting only those
images which have no disease correctly but for others, it is not. We can
conclude that the Attention model is the best-performing model out of the
three.
Chapter 7
Other medical applications
In this section, we briefly introduce the works on incorporating domain

knowledge in other medical image analysis applications, like medical im-
age reconstruction, medical image retrieval, and medical report generation
(Suetens[18]).
43
Chapter 8
Research challenges and future

directions
The aforementioned sections reviewed research studies on deep learning

models that incorporate medical domain knowledge for various tasks. Al-
though using medical domain knowledge in deep learning models is quite
popular, there are still many difficulties about the selection, representa-
tion and incorporating method of medical domain knowledge. In the fol-
lowing sections, we summarize challenges and future directions in this area
(Suetens[18]).
44
Chapter 9
Conclusion
Figure 9.1: Best Models of each approach arranged in descending order of

cumulative BLEU-4 score
Based on bleu score, the custom final model (greedy search) is found to
be the best model.
This model performed better than other model (it was able to output
names of diseases and some observation more correctly).
The simple encoder decoder model was only able to output one caption
for the entire dataset suggesting overfitting.
More data with much more variability specifically X-rays with diseases
can be much more helpful for the models to understand better and can help
reduce bias towards “no disease category”.
45
Chapter 10
Future Scope
Based on this project we can see a possibility to extend methodologies

for pathology images, 3D scans, etc.
Improve clinical accuracy and less data bias.
Get more X-ray images with diseases since most of the data that is avail-
able on this dataset were of “no disease” category.
46
Bibliography
[1] David Rosman, Judith Bamporiki, Rebecca Stein-Wexler, and Robert

Harris. Developing diagnostic radiology training in low resource coun-
tries. Current Radiology Reports, 7, 08 2019.
[2] Pranav Rajpurkar, Jeremy Irvin, Kaylie Zhu, Brandon Yang, Hershel
Mehta, Tony Duan, Daisy Ding, Aarti Bagul, Curtis Langlotz, Katie
Shpanskaya, et al. Chexnet: Radiologist-level pneumonia detection on
chest x-rays with deep learning. arXiv preprint arXiv:1711.05225, 2017.
[3] Navdeep Kaur and Ajay Mittal. Cadxreport: Chest x-ray report gener-
ation using co-attention mechanism and reinforcement learning. Com-
puters in Biology and Medicine, 145:105498, 2022.
[4] Navdeep Kaur and Ajay Mittal. Radiobert: A deep learning-based sys-
tem for medical report generation from chest x-ray images using contex-
tual embeddings. Journal of Biomedical Informatics, 135:104220, 2022.
[5] Geert Litjens, Thijs Kooi, Babak Ehteshami Bejnordi, Arnaud

Arindra Adiyoso Setio, Francesco Ciompi, Mohsen Ghafoorian,
Jeroen A.W.M. van der Laak, Bram van Ginneken, and Clara I. Sánchez.
A survey on deep learning in medical image analysis. Medical Image
Analysis, 42:60–88, 2017.
[6] Gael Varoquaux and Veronika Cheplygina. Machine learning for medical
imaging: methodological failures and recommendations for the future.
npj Digital Medicine, 5:48, 04 2022.
[7] Xuewei Ma, Fenglin Liu, Changchang Yin, Xian Wu, Shen Ge, Yuexian
Zou, Ping Zhang, and Xu Sun. Contrastive attention for automatic chest
x-ray report generation, 2021.
[8] Pablo Messina, Pablo Pino, Denis Parra, Alvaro Soto, Cecilia Besa,
Sergio Uribe, Marcelo Andı́a, Cristian Tejos, Claudia Prieto, and Daniel
Capurro. A survey on deep learning and explainability for automatic
47
48 BIBLIOGRAPHY
report generation from medical images. ACM Comput. Surv., 54(10s),

sep 2022.
[9] Riccardo Guidotti, Anna Monreale, Salvatore Ruggieri, Franco Turini,

Dino Pedreschi, and Fosca Giannotti. A survey of methods for explaining
black box models, 2018.
[10] Feiyang Yu, Mark Endo, Rayan Krishnan, Ian Pan, Andy Tsai, Ed-
uardo Pontes Reis, Eduardo Kaiser Ururahy Nunes Fonseca, Hen-
rique Min Ho Lee, Zahra Shakeri Hossein Abad, Andrew Y. Ng, Curtis P.
Langlotz, Vasantha Kumar Venugopal, and Pranav Rajpurkar. Eval-
uating progress in automatic chest x-ray radiology report generation.
medRxiv, 2022.
[11] Xiaozheng Xie, Jianwei Niu, Xuefeng Liu, Zhengsu Chen, Shaojie Tang,
and Shui Yu. A survey on incorporating domain knowledge into deep
learning for medical image analysis. Medical Image Analysis, 69:101985,
2021.
[12] Xuewei Ma, Fenglin Liu, Shen Ge, and Xian Wu. Competence-based
multimodal curriculum learning for medical report generation, 2022.
[13] Mehreen Sirshar, Muhammad Faheem Khalil Paracha, Muhammad Us-

man Akram, Norah Saleh Alghamdi, Syeda Zainab Yousuf Zaidi, and
Tatheer Fatima. Attention based automated radiology report generation
using cnn and lstm. PloS one, 17(1):e0262209, 2022.
[14] Ayoub Benali Amjoud and Mustapha Amrouch. Automatic generation

of chest x-ray reports using a transformer-based deep learning model. In
2021 Fifth International Conference On Intelligent Computing in Data
Sciences (ICDS), pages 1–5, 2021.
[15] Justin Lovelace and Bobak Mortazavi. Learning to generate clinically

coherent chest X-ray reports. In Findings of the Association for Compu-
tational Linguistics: EMNLP 2020, pages 1235–1243, Online, November
2020. Association for Computational Linguistics.
[16] Hoang T. N. Nguyen, Dong Nie, Taivanbat Badamdorj, Yujie Liu,

Yingying Zhu, Jason Truong, and Li Cheng. Automated generation
of accurate & fluent medical x-ray reports, 2021.
[17] Toru Nishino, Ryota Ozaki, Yohei Momoki, Tomoki Taniguchi, Ryuji
Kano, Norihisa Nakano, Yuki Tagawa, Motoki Taniguchi, Tomoko
BIBLIOGRAPHY 49
Ohkuma, and Keigo Nakamura. Reinforcement learning with imbal-

anced dataset for data-to-text medical report generation. In Findings
of the Association for Computational Linguistics: EMNLP 2020, pages
2223–2236, Online, November 2020. Association for Computational Lin-
guistics.
[18] Paul Suetens. Fundamentals of Medical Imaging. Cambridge University

Press, 2 edition, 2009.
[19] Guanxiong Liu, Tzu-Ming Harry Hsu, Matthew McDermott, Willie

Boag, Wei-Hung Weng, Peter Szolovits, and Marzyeh Ghassemi. Clini-
cally accurate chest x-ray report generation, 2019.
[20] Baoyu Jing, Pengtao Xie, and Eric Xing. On the automatic generation
of medical imaging reports. In Proceedings of the 56th Annual Meeting of
the Association for Computational Linguistics (Volume 1: Long Papers).
Association for Computational Linguistics, 2018.
[21] Helena Liz, Javier Huertas-Tato, Manuel Sánchez-Montañés, Javier

Del Ser, and David Camacho. Deep learning for understanding mul-
tilabel imbalanced chest x-ray datasets, 2022.
[22] Zaheer Babar, Twan van Laarhoven, Fabio Massimo Zanzotto, and
Elena Marchiori. Evaluating diagnostic content of ai-generated radiology
reports of chest x-rays. Artificial Intelligence in Medicine, 116:102075,
2021.
[23] An Yan, Zexue He, Xing Lu, Jiang Du, Eric Chang, Amilcare Gentili,
Julian McAuley, and Chun-Nan Hsu. Weakly supervised contrastive
learning for chest x-ray report generation, 2021.
[24] Yasuhide Miura, Yuhao Zhang, Emily Bao Tsai, Curtis P. Langlotz,
and Dan Jurafsky. Improving factual completeness and consistency of
image-to-text radiology report generation, 2020.
[25] Xiaosong Wang, Yifan Peng, Le Lu, Zhiyong Lu, and Ronald M. Sum-
mers. Tienet: Text-image embedding network for common thorax dis-
ease classification and reporting in chest x-rays, 2018.
[26] Zhihong Chen, Yaling Shen, Yan Song, and Xiang Wan. Cross-modal
memory networks for radiology report generation, 2022.
[27] Improving Medical X-ray Report Generation by Using Knowledge

Graph. https://www.mdpi.com/2076-3417/12/21/11111.
50 BIBLIOGRAPHY
[28] Hierarchical X-Ray Report Generation via Pathology

tags and Multi Head Attention. chrome-extension://
efaidnbmnnnibpcajpcglclefindmkaj/https://openaccess.thecvf.com/
content/ACCV2020/papers/Srinivasan Hierarchical X-Ray Report
Generation via Pathology tags and Multi Head ACCV 2020 paper.
pdf.
[29] Daibing Hou, Zijian Zhao, Yuying Liu, Faliang Chang, and Sanyuan
Hu. Automatic report generation for chest x-ray images via adversarial
reinforcement learning. IEEE Access, 9:21236–21250, 2021.
[30] Medical Image Captioning on Chest X-

Rays. https://towardsdatascience.com/
medical-image-captioning-on-chest-x-rays-a43561a6871d.
[31] Chest X-Rays Image Captioning. https://medium.com/

analytics-vidhya/chest-x-rays-image-captioning-70abb55d41c4.

mTechPesWeJune21Grp6 Final+Submission

Uploaded by

Document Information

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

mTechPesWeJune21Grp6 Final+Submission

Uploaded by

Copyright:

Available Formats

Redefining the World of Medical Image

Processing with AI – Automatic clinical report

Albert Princy V Nivetha S Suman Sourabh

Sunil Kumar B S Varadharajan Damotharan

05th February 2023

3 Model Architecture and Training 31

4 Modelling and Evaluation 35

5 Summary and results 38

6 Final Custom Model 41

7 Other medical applications 43

8 Research challenges and future directions 44

1.1 Node graphs of 1D representations of architectures commonly

5.1 BLEU score for simple encoder-decoder model . . . . . . . . . 38

6.1 Bleu Scores for Final Custom Model . . . . . . . . . . . . . . 41

9.1 Best Models of each approach arranged in descending order of

In this study, we focus on generating readable automatic assisted clinical

1.2 Introduction and Literature Survey

Xkl = σ(Wkl−1 ∗ X l−1 + bl−1

Figure 1.1: Node graphs of 1D representations of architectures commonly

reflect the clinical situation for a particular medical condition, leading to

Figure 1.2: General scheme of components of the architectures, inputs on

Input and output:

number of clinically significant errors (0.607 average number of errors per

al.[12]). In clinical practice, writing a medical report can be time-consuming

1.3 Related Work

Medical report generation, which aims to automatically generate descrip-

1.3.1 Natural Image Captioning

2.1.1 Convolutional Neural Network (CNN)

2.1.2 Exploratory Data Analysis

In the XML file, report information is stored like this

2.1.4 Sample Images along with Captions

2.1.5 Word-cloud of all the impression values

2.1.6 Top 20 most frequently occurring value for im-

2.1.7 Performance metric and Business constraints

Model Architecture and

3.1 Model Architecture and Training

3.1.1 My Experiments and Implementation details

3.1.3 Simple Encoder-Decoder Model

3.1.4 Final Custom Model

3.1.5 Global Flow and Context Flow

Figure 3.4: Global Flow

Modelling and Evaluation

(Srinivasa[31])The model architectures used are Encoder-Decoder models.

4.1 Encoder-Decoder Architecture

Figure 4.1: Global Flow

Prediction 2 : Original Sentence is : Mediastinal contours are normal.

Summary and results

5.1 Summary and results

5.1.1 Simple Encoder-Decoder model

Figure 5.1: BLEU score for simple encoder-decoder model

Figure 5.2: 3 Results on sample test data

Figure 5.3: Value counts of predicted captions

Final Custom Model

Figure 6.1: Bleu Scores for Final Custom Model

Figure 6.2: Value counts % of predictions for final custom model

6.0.1 Some random predictions on test data are shown

Other medical applications

In this section, we briefly introduce the works on incorporating domain

Research challenges and future

The aforementioned sections reviewed research studies on deep learning

Figure 9.1: Best Models of each approach arranged in descending order of

Based on this project we can see a possibility to extend methodologies

[1] David Rosman, Judith Bamporiki, Rebecca Stein-Wexler, and Robert