Professional Documents
Culture Documents
1
Narayana Darapaneni Abhishek S R
1 Corresponding author
Contents
1 Introduction 7
1.1 Background . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7
1.2 Introduction and Literature Survey . . . . . . . . . . . . . . . 8
1.3 Related Work . . . . . . . . . . . . . . . . . . . . . . . . . . . 16
1.3.1 Natural Image Captioning . . . . . . . . . . . . . . . . 20
2 Foundations 21
2.1 Foundations . . . . . . . . . . . . . . . . . . . . . . . . . . . . 21
2.1.1 Convolutional Neural Network (CNN) . . . . . . . . . 23
2.1.2 Exploratory Data Analysis . . . . . . . . . . . . . . . . 24
2.1.3 Reports . . . . . . . . . . . . . . . . . . . . . . . . . . 24
2.1.4 Sample Images along with Captions . . . . . . . . . . . 28
2.1.5 Word-cloud of all the impression values . . . . . . . . . 29
2.1.6 Top 20 most frequently occurring value for impression
column . . . . . . . . . . . . . . . . . . . . . . . . . . . 30
2.1.7 Performance metric and Business constraints . . . . . . 30
2
CONTENTS 3
9 Conclusion 45
10 Future Scope 46
List of Figures
2.1 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 22
2.2 Neural network architecture with highlighted sizes and each
layer units. . . . . . . . . . . . . . . . . . . . . . . . . . . . . 23
2.3 We can see 3 sample images (a), (b), (c) of the data set. These
are chest X-rays which are taken in front and side view. . . . . 24
2.4 This is a sample report. The report is stored in XML format.
We will extract the comparison, indication, findings, and im-
pression part of the report. This was done using regex . . . . 24
2.5 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 25
2.6 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 26
2.7 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 27
2.8 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 27
2.9 We can see that findings are the observations from X-ray while
the impression is inference obtained. In this case study, I will
try to predict the impression part of the medical report given
the two images . . . . . . . . . . . . . . . . . . . . . . . . . . 28
2.10 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 29
2.11 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 30
3.1 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 32
3.2 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 32
3.3 Attention guided chained context aggregation for image seg-
mentation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 34
3.4 Global Flow . . . . . . . . . . . . . . . . . . . . . . . . . . . 34
4
4.1 Global Flow . . . . . . . . . . . . . . . . . . . . . . . . . . . 35
4.2 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 36
4.3 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 37
6
Chapter 1
Introduction
1.1 Background
A typical report written by radiologists consists of patient history, ex-
amination reason, and summary of the findings. This project is beneficial
when it comes to quick analysis of the medical condition and providing on-
time treatment. The challenge in this project is the achievement of clin-
ical accuracy. This project plays a vital role in addressing medical needs
in resource-limited countries having insufficient number of radiologists and
radiology training programs .
Automated generation of radiological reports for different imaging modal-
ities is required to smoothen the clinical workflow and alleviate radiologist’s
workload. It involves the amalgamation of image processing techniques for
medical image interpretation and language generation techniques for report
generation. Kaur and Mittal[3] present a report which generates a clinically
accurate report from chest x-ray images.
An increasing number of chest X-ray (CXR) examinations in radio diag-
nosis departments burdens radiologists and makes the timely generation of
accurate radiological reports highly challenging. An automatic radiological
report generation (ARRG) system is envisaged to generate reports with mini-
mal human intervention, which ease radiologists workload, and smoothen the
clinical workflow. The success of an ARRG system depends on two critical
factors: i) the quality of the features extracted by the ARRG system from
the CXR images, and ii) the quality of the linguistic expression generated by
the ARRG system describing the normalities and abnormalities as indicated
by the extracted features. Most of the existing ARRG systems miserably
fail due to the latter factor and do not generate clinically acceptable reports
because they ignore the contextual importance of the medical terms[4].
7
8 CHAPTER 1. INTRODUCTION
Also, in CNNs pooling layers are incorporated and pixel values are in-
corporated are aggregated using a permutation invariant function (max or
mean operation). This reduces the number of parameters in the network by
inducing translation invariance. At the end of the convolutional stream of
the network, fully-connected layers (i.e. regular neural network layers) are
usually added, where weights are no longer shared.
Varoquaux and Cheplygina[6] explore in their article the avenues to im-
prove the clinical impact of machine learning in medical imaging. The author
explains that medical datasets are typically smaller, on the order of hundreds
or thousands. Although, the datasets are small, in medical imaging, the num-
ber of subjects is referred to, but a subject may have multiple images, for
example, taken at different points in time. Also, the datasets only partially
1.2. INTRODUCTION AND LITERATURE SURVEY 9
W×H×C, where W and H denote spatial dimensions (width and height) and
C denotes the channel dimensions (depth or the number of feature maps).
The architecture used for visual and language components:
CNN Architectures for chest X-ray: Dense Net; ResNet; VGG; Incep-
tionV3; GoogLeNet;
Language component Architectures:
GRU; LSTM; LSTM with attention; Hierarchical LSTM with attention;
Hierarchical: Sentence LSTM + Dual Word LSTM (normal/abnormal); Re-
current BiLSTM-attention-LSTM; Partial report encoding + FC layer (next
word); Transformer; Hybrid template retrieval + generation/edition
According to the literature survey, Dense Net, ResNet, and VGG are the
most used. LSTM is the most used. LSTM receives an encoding vector from
the visual component at the beginning and the full report is decoded from
it. This encoding vector is typically a vector of global features output by the
CNN. However, LSTM can be used only for short reports. For unstructured
multi-sentence reports, Hierarchical LSTM with attention can be used.
Evaluation metrics of the reports are assessed into 3 categories:
Text quality, medical correctness/ accuracy, and explainability.
Text quality: BLUE; ROUGE L; METEOR; CIDEr;
Medical Correctness/ Accuracy: MIRQI; MeSH; Keyword accuracy; ROC-
AUC;
The authors conclude that MIRQI seems like the most promising ap-
proach to fulfill medical accuracy, as it captures more data from the reports,
and text quality can be used for fluency, grammar, etc.
The authors also describe challenges that need attention:
(1) Expert evaluation: The model and the report generation system be
tested by medical experts who are board-certified experts. Their feedback
carries immense value to better the model.
(2) Explain ability
Yu et al.[10] quantitatively examined the correlation between automated
metrics and the scoring of reports by radiologists. The failure modes of
the metrics, namely the types of information the metrics do not capture, to
understand when to choose particular metrics and how to interpret metric
scores were analyzed. Their new automatic metric RadGraph F1 computes
the overlap in clinical entities and relations between a machine-generated re-
port and a radiologist-generated report. A composite metric is also proposed
in their work called RadCliQ, which is able to rank the quality of reports
similarly to radiologist’s reports.
A quantitative investigation of alignment between automated metrics and
radiologists is studied. Scores are set for automated metrics and radiology
1.2. INTRODUCTION AND LITERATURE SURVEY 13
reports. THE MIMIC CXR dataset is used for automated metric. Candi-
dates report is selected based on the test reports metric oracle. Metric oracle
reports were constructed as BLEU, BERTScore, CheXbert vector similarity
(semb ), and a novel metric RadGraph F1. BLEU computes n-gram overlap
and is representative of the family of text overlap-based natural language
generation metrics such as CIDEr, METEOR, and ROUGE. BERTScore
has been proposed for capturing contextual similarity beyond exact textual
matches. CheXbert vector similarity and RadGraph F1 are metrics designed
to measure the correctness of clinical information.
The radiologist evaluation study design was conducted with the support
of six board-certified radiologists who scored the number of errors that var-
ious metric-oracle reports make compared to the test report. Radiologists
categorized errors as significant or insignificant. Radiologists subtyped every
error into the following six categories:
(1) false prediction of finding (i.e., false positive).
(2) omission of finding (i.e., false negative).
(3) incorrect location/position of finding.
(4) incorrect severity of finding.
(5) mention of comparison that is not present in the reference impression.
and (6) omission of comparison describing a change from a previous study.
50 random studies were sampled from the MIMIC-CXR test set.
Kendall rank correlation coefficient (tau-b) is used for metric scores and
radiologists-reported errors. It was found that BERTScore and RadGraph F1
were the metrics with the two highest alignments with radiologists. Specif-
ically, BERTScore has a tau value of 0.500 [95% CI 0.497 0.503] for a total
number of errors and 0.496 [95% CI 0.493 0.498] for significant errors. Rad-
Graph has a tau value of 0.463 [95% CI 0.460 0.465] for a total number of
errors and 0.459 [95% CI 0.456 0.461] for significant errors. BLEU is the third
best metric under this evaluation with a 0.459 [95% CI 0.456 0.462] tau value
for a total number of errors and 0.445 [95% CI 0.442 0.448] for significant er-
rors. Lastly, CheXbert vector similarity had the worst alignment with a tau
value of 0.457 [95% CI 0.454 0.459] for total errors and 0.418 [95% CI 0.416
0.421] for significant errors. From these results, it is seen that BERTScore,
RadGraph, and BLEU are the metrics with the closest alignment to radiolo-
gists. CheXbert has alignment with radiologists but is less concordant than
the previously mentioned metrics.
Also, BLEU exhibited a prominent failure mode in identifying false pre-
dictions of finding in reports. Metric-oracle reports with respect to BLEU
produced more false predictions of finding than BERTScore and RadGraph
in terms of both the total number of errors (0.807 average number of errors
per report verses 0.477 and 0.427 for BERTScore and RadGraph) and the
14 CHAPTER 1. INTRODUCTION
Amjoud et al.[14] have combined the transfer learning technique and the
transformer approach to generate reports. Specifically, a pre-trained convo-
lutional neural network model from the ImageNet database is used for feature
extraction and a modified transformer encoder-decoder is used for sequence
generation. Additionally, the accuracy of these models is evaluated using
traditional evaluation metrics such as accuracy, precision, and recall.
We know that Medical imaging (e.g. chest x-ray) is widely used in
medicine for diagnostic purposes. However, the current clinical practice re-
quires a radiologist with specialized training to manually evaluate x-rays and
note their findings in a radiology report. This manual evaluation is time-
consuming and providing an automated solution for this task would help
streamline the clinical workflow and improve the quality of care. (Lovelace
et al. [15])
Rajpurkar et al.[2] explains that Image Captioning is the process of gen-
erating text (Captions) based on the image. Image captioning uses both
Natural Language Processing(NLP) and Computer Vision (CV) to generate
the text output. X-Rays are a form of Electro Magnetic Radiation that is
used for medical imaging. X-rays can be used to spot fractures, bone injuries,
or tumors. Analysis of X-ray reports is a very important task of radiologists
and pathologists to recommend the correct diagnosis to the patients.
Medical reports (Nguyen et al.[16]) are the primary medium, through
which physicians communicate findings and diagnoses from the medical scans
of patients. The process is usually laborious, and typing out a medical report
takes on average five to ten minutes it could also be error-prone. This has
led to a surging need for automated generation of medical reports, to assist
radiologists and physicians in making rapid and meaningful diagnoses. Its
potential efficiency and benefits could be enormous, especially during criti-
cal situations such as COVID or a similar pandemic. Clearly, a successful
medical report generation process is expected to possess two key properties:
1) clinical accuracy, to properly and correctly describe the disease and
related symptoms;
2) language fluency, to produce realistic and human-readable text.
Writing medical reports manually from medical images is a time-consuming
task for radiologists (Nishino et al.[17]) . To write reports, radiologists first
recognize what findings are included in medical images, such as computed
tomography (CT) and X-ray images. Then radiologists compose reports that
describe the recognized findings correctly without omission. Doctors prefer
radiology reports written in natural language.
Medical images, e.g., radiology and pathology images, and their corre-
sponding reports, which describe the observations in detail of both normal
and abnormal regions, are widely used for diagnosis and treatment (Ma et
16 CHAPTER 1. INTRODUCTION
evidence-based medicine. This makes deep learning and big data algorithms
especially useful in this field as they can identify some radiological signs
that medical staff cannot detect. Although in this manuscript we focus on
classification problems, there are papers where these algorithms are used in
regression problems, such as estimating the dose of a drug generating medical
reports from clinical tests or image processing, such as image segmentation
and image reconstruction.
The COVID-19 pandemic has had a strong impact on research into the
application of machine learning and deep learning in medical image analysis.
As expected, many of the classification systems investigated have focused
on detecting signs of bilateral COVID-19-associated pneumonia. The ex-
plainability of deep learning models is a fundamental factor to be taken into
account in their application. These models are black-box algorithms and
need explainable AI techniques to make them more trustworthy.
Although most medical datasets have two classes (samples of a particular
pathology and healthy samples), in chest X-rays it is common to find signs
of more than one pathology. For this reason, in the last five years, different
authors have published multilabel radiological datasets. These datasets are
closer to real situations than binary ones, with the additional challenge of
the imbalance of different classes. The size of each class in a realistic dataset
should depend on the incidence of pathology in society, i.e., some classes
are more represented than others. The characteristics of these datasets are
interesting and need to be analyzed in detail in order to address the problem
adequately.
Conventional evaluation metrics for text generation from images include
BLEU, ROUGE, and METEOR (Babar et al.[22]). To date, these are the
most widely used tools in machine translation and summarization for evalu-
ating how good a generated text is as compared to the ground truth. Such
metrics measure goodness in a generic context, independent of the application
domain. They score a generated report only in terms of how well it matches
the corresponding ground truth report. As such, they can miss important
clinical information contained in the generated report. For example, “Heart
is normal” and “Heart is not normal” are syntactically close but semantically
very different sentences. Therefore, we propose to assess the clinical quality
of radiology reports using an external source of information (Shown in the
below figure). We assume an external source of ground truth knowledge in
the form of a set of diagnostic tags for each report. These diagnostic tags are
external knowledge because they are not used for training the report gen-
erator model. By considering tags as class labels associated with a report,
we build a probabilistic model (on the training data) that predicts the class
labels of a report. The probabilistic model is applied to the generated reports
18 CHAPTER 1. INTRODUCTION
of the test set, and the model performance is used as a quantitative estimate
of the diagnostic quality of the generated reports. Ref fig 1.3
Figure 1.3
aimed to develop a method that takes a radiology report as input and gen-
erates a set of X-ray images that manifest the clinical findings narrated in
the report. For simplicity, we assume there are two views of images: frontal
view and lateral view, each with one image. It is straightforward to extend
our method to more views (each with more than one image).
Content-based image retrieval (CBIR) is a technique for knowledge dis-
covery in massive databases and offers the possibility to identify similar case
histories, understand rare disorders, and, ultimately, improve patient care
(Litjens et al.[5]). The major challenge in the development of CBIR meth-
ods is extracting effective feature representations from pixel-level information
and associating them with meaningful concepts. The ability of deep CNN
models to learn rich features at multiple levels of abstraction has elicited
interest from the CBIR community.
Image-to-Text Radiology Report Generation (Miura et al.[24]), (Wang et
al.[25]) and (Jing et al.[20]) first proposed multi-task learning models that
jointly generate a report and classify disease labels from a chest X-ray im-
age. Their models were extended to use multiple images, to adopt a hybrid
retrieval-generation model, or to consider structure information. They con-
clude that more recent work has focused on generating reports that are clini-
cally consistent and accurate. Liu et al.[19] presented a system that generates
accurate reports by fine-tuning it clinically.
The most popular related task (Chen et al.[26]) to ours is image caption-
ing, a cross-modal task involving natural language processing and computer
vision, which aims to describe images in sentences. Among these studies,
the most related study from Cornia et al. also proposed to leverage memory
matrices to learn a prior knowledge for visual features using memory net-
works but the such operation is only performed during the encoding process.
Different from this work, the memory in our model is designed to align the
visual and textual features, and the memory operations (i.e., querying and
responding) are performed in both the encoding and decoding processes.
(Zhang et al [27]) Most of the work on image-based captioning is based
on the classical structure of CNN + LSTM, which aims to generate real sen-
tences or relevant paragraphs of a topic to summarize the visual content in
an image. With the development of computer vision and natural language
processing technology, many works combine radiology images and free text to
automatically generate radiology reports to help clinical radiologists make a
quick and meaningful diagnosis. Radiology report generation takes X-ray im-
ages as input and generates descriptive reports to support inference of better
diagnostic conclusions beyond disease labels. Many radiology report gener-
ation methods follow the practice of image captioning models. For example,
adopting an encoder–decoder architecture and proposed a hierarchical gen-
20 CHAPTER 1. INTRODUCTION
erator and attention mechanism to generate long reports. Fused the visual
features and semantic features of the previous sentence through the attention
mechanism, used the fused features to generate the next sentence, and then
generated the whole report in a loop.
(Srinivasan et al [28]) The success in medical image captioning has been
possible due to the latest advances in deep learning. DenseNet, being a
densely connected convolutional network, enabled us to learn high order de-
pendencies by using a large number of layers with a minimal number of
parameters, enabling the architectures to understand complex images like
X-ray images without overfitting. Xception proposed depth-wise separable
convolutional operation, which inturn extracts efficient image features with
a decreased number of parameters in the model.
Foundations
2.1 Foundations
Chempolil[30] in his article stated the problem of image captioning task.
For this dataset, we are given a couple of images of different patients and a
report which is in XML format. The information on the image names and also
the patients are contained within the xml file. We extracted this data from
xml file using regex and transformed it into a data frame. Then, we can use
pre-trained models to get information from images and using this info we can
generate captions using LSTMs or GRUs. BLEU score stands for Bilingual
Evaluation Understudy. Here we will be using the BLEU score as the metric.
BLEU score compares each word in the predicted sentence and compare it
to the reference sentence (It is also done in n-grams) and returns a score
based on how many words were predicted that were in the original sentence.
BLEU score is not a good metric to compare the performance of translation
since similar words that have the same meaning will be penalized. So we will
not only use the n-gram BLEU score but also take a sample of the predicted
captions and compare it to the original reference caption manually. BLEU
scores range from 0–1. Predicted text: “The dog is jumping” Reference text:
“The dog is jumping”
The data sets were extracted from Indiana University – Chest X-Rays
(Rajpurkar et al.[2]). In this particular case study, we are given a set of im-
ages(Chest X-Rays) and impressions radiologists inferred from those images.
We have to develop and train a model which generates impressions based on
the chest X-rays provided in image format and findings and some other data
provided in XML format to it. This type of model can save a lot of time
for the radiologists from analyzing the reports and giving impressions. Also,
this model can be used as a base model to validate some of the decisions
21
22 CHAPTER 2. FOUNDATIONS
since the cost of mistakes is very high in the medical domain, unlike other
domains.
Here we have used the same publicly available dataset from Indiana Uni-
versity which consists of chest X-ray images and reports (in XML format)
which contain information regarding the comparison, indication, findings,
and impression of the X-ray. The goal of this case study is to predict the
impressions of the medical report attached to the images.
Figure 2.1
The output from the above model (ref fig 2.1) gives a probability value
for which word has the highest probability to come next in the sentence.
Here the greedy search method is used (gets the word that has the highest
probability to come next in the sentence). The model was then trained for 30
epochs with an initial learning rate of 0.001 and 3 pictures per batch (batch
size). However, after 20 epochs, the learning rate was reduced to 0.0001. The
loss used is categorical cross entropy while the optimizer used was Adam.
2.1. FOUNDATIONS 23
Figure 2.2: Neural network architecture with highlighted sizes and each layer
units.
24 CHAPTER 2. FOUNDATIONS
(a) (b)
(c)
Figure 2.3: We can see 3 sample images (a), (b), (c) of the data set. These
are chest X-rays which are taken in front and side view.
2.1.3 Reports
Figure 2.4: This is a sample report. The report is stored in XML format.
We will extract the comparison, indication, findings, and impression part of
the report. This was done using regex
2.1. FOUNDATIONS 25
Figure 2.5
These are the two image files that are associated with this report.
Now we will see what is the maximum and minimum possible value for
the number of images that are associated with a report.
26 CHAPTER 2. FOUNDATIONS
Figure 2.6
We can see that the maximum number of images associated with a report
can be 5 while the minimum is 0.
We will extract all the information part i.e, comparison, indication, find-
ings, and impression part of the report and the corresponding 2 images (we
will take 2 images as input as 2 images are the highest frequency of being
associated with a report) of the concerned report to a data frame with xml
report file name using regex and string manipulations.
For images more than 2, we will create new data points with new images
and the same info.
2.1. FOUNDATIONS 27
Figure 2.7
It was found that there are missing values in the data frame. All the data
points which had their image 1 and impression value null were removed from
the data frame. Then all the missing values found in image 2 were filled with
the same data path as that of image 1.
Figure 2.8
We can see from the above diagram that 420 is the most common image
height for image1 while 624 is the most common image height. The width
for both of the images has only 1 unique value for all data points and that
is 512. Since pre-trained models are modeled for square-sized images we can
choose 224*224 as the specified size of the image. Hence we will resize all
images into 224*224 shapes.
28 CHAPTER 2. FOUNDATIONS
Figure 2.9: We can see that findings are the observations from X-ray while
the impression is inference obtained. In this case study, I will try to predict
the impression part of the medical report given the two images
2.1. FOUNDATIONS 29
Figure 2.10
We can see from the word cloud above that several words are higher in
frequency i.e “acute cardiopulmonary” and “cardiopulmonary abnormality”.
We will investigate this further by observing the value counts of the impres-
sion feature.
30 CHAPTER 2. FOUNDATIONS
Figure 2.11
From the above value counts we can see that the top 20 most frequently
occurring words had the same meaning thereby the same information sug-
gesting one type of data is dominating for this data. We will apply a set of
upsampling and downsampling to the data so that model doesn’t output the
same value for the entire dataset i.e to reduce over-fitting.
3.1.2 Encoder
The encoder part will take the two images and convert them into backbone
features to provide to the decoder. Here for the encoder part, I will be using
the CheXNET model. CheXNET Model is a Denset121 layered model which
is trained on millions of chest x-ray images for the classification of 14 diseases.
31
32 CHAPTER 3. MODEL ARCHITECTURE AND TRAINING
We can load the weights of that model and pass the images through that
model. The top layer will be ignored. (Ref fig 3.1)
Figure 3.1
Figure 3.2
3.1. MODEL ARCHITECTURE AND TRAINING 33
Here the two images are passed through the Image encoder layer and
concatenate the two outputs and then pass it through the Dense layer. The
padded tokenized captions will be passed through an embedding layer where
we will be using pre-trained Glove vectors (300 dimensions) as the initial
weights for the layer. This will be set as trainable and then it is passed
through LSTM where the initial state of the LSTM is taken from the output
of the Image dense layer. These are then added and then passed through the
output dense layer where the number of output will be the vocabulary size
with softmax activation applied on top.
Figure 3.3: Attention guided chained context aggregation for image segmen-
tation
35
36 CHAPTER 4. MODELLING AND EVALUATION
The time distributed desne layer is applied at the end because the output
is a sequential output and it should be applied to every temporal slice of
output.
Used Beam search to predict the sentence .
A few predictions on test data made with Attention model+Beam Search
Prediction 1: Original Sentence is : Heart size normal. stable cardiomediasti-
nal silhouette. no pneumothora, pleural effusion, or focal airspace disease.
bony structures are in alignment without fracture. Predicted Sentence is :
The heart is normal in size and mediastinal contours are within normal lim-
its. there is no pleural effusion or pneumothora. there is no focal air space
opacity to suggest a pneumonia. there is no pleural effusion or pneumothora.
Figure 4.2
4.1. ENCODER-DECODER ARCHITECTURE 37
Figure 4.3
Chapter 5
The bleu scores of both greedy search and beam search (top k = 3) are
found to be similar so we can say that the best model will be based on a
greedy search for the simple encoder-decoder model since it is the fastest.
Finding the bleu score for top k = 5 was found to be very slow so it was
discarded
38
5.1. SUMMARY AND RESULTS 39
We can see from above that model has predicted the same caption for
all images. It was found that the model predicted the same caption for all
images when taking the value counts as seen below
40 CHAPTER 5. SUMMARY AND RESULTS
From the above results, we can see that for every data point the model
is predicting “no acute cardiopulmonary abnormality” suggesting the model
has overfitted. Since this is a baseline model we can check on the other
model’s predictions and compare their performance to this.
Chapter 6
This model also produced similar bleu compared to the other two. Look-
ing at the value counts of the predictions we can see that there is a little
variability better than that of the baseline model but not as good as that of
the 2nd model.
Here also same as the case of the Attention model, beam search (top k
= 3) was found to be slow in predicting one caption so the approach was
discarded.
41
42 CHAPTER 6. FINAL CUSTOM MODEL
Figure 6.3
We can observe from the above that the model is predicting only those
images which have no disease correctly but for others, it is not. We can
conclude that the Attention model is the best-performing model out of the
three.
Chapter 7
43
Chapter 8
44
Chapter 9
Conclusion
Based on bleu score, the custom final model (greedy search) is found to
be the best model.
This model performed better than other model (it was able to output
names of diseases and some observation more correctly).
The simple encoder decoder model was only able to output one caption
for the entire dataset suggesting overfitting.
More data with much more variability specifically X-rays with diseases
can be much more helpful for the models to understand better and can help
reduce bias towards “no disease category”.
45
Chapter 10
Future Scope
46
Bibliography
[2] Pranav Rajpurkar, Jeremy Irvin, Kaylie Zhu, Brandon Yang, Hershel
Mehta, Tony Duan, Daisy Ding, Aarti Bagul, Curtis Langlotz, Katie
Shpanskaya, et al. Chexnet: Radiologist-level pneumonia detection on
chest x-rays with deep learning. arXiv preprint arXiv:1711.05225, 2017.
[3] Navdeep Kaur and Ajay Mittal. Cadxreport: Chest x-ray report gener-
ation using co-attention mechanism and reinforcement learning. Com-
puters in Biology and Medicine, 145:105498, 2022.
[4] Navdeep Kaur and Ajay Mittal. Radiobert: A deep learning-based sys-
tem for medical report generation from chest x-ray images using contex-
tual embeddings. Journal of Biomedical Informatics, 135:104220, 2022.
[6] Gael Varoquaux and Veronika Cheplygina. Machine learning for medical
imaging: methodological failures and recommendations for the future.
npj Digital Medicine, 5:48, 04 2022.
[7] Xuewei Ma, Fenglin Liu, Changchang Yin, Xian Wu, Shen Ge, Yuexian
Zou, Ping Zhang, and Xu Sun. Contrastive attention for automatic chest
x-ray report generation, 2021.
[8] Pablo Messina, Pablo Pino, Denis Parra, Alvaro Soto, Cecilia Besa,
Sergio Uribe, Marcelo Andı́a, Cristian Tejos, Claudia Prieto, and Daniel
Capurro. A survey on deep learning and explainability for automatic
47
48 BIBLIOGRAPHY
[10] Feiyang Yu, Mark Endo, Rayan Krishnan, Ian Pan, Andy Tsai, Ed-
uardo Pontes Reis, Eduardo Kaiser Ururahy Nunes Fonseca, Hen-
rique Min Ho Lee, Zahra Shakeri Hossein Abad, Andrew Y. Ng, Curtis P.
Langlotz, Vasantha Kumar Venugopal, and Pranav Rajpurkar. Eval-
uating progress in automatic chest x-ray radiology report generation.
medRxiv, 2022.
[11] Xiaozheng Xie, Jianwei Niu, Xuefeng Liu, Zhengsu Chen, Shaojie Tang,
and Shui Yu. A survey on incorporating domain knowledge into deep
learning for medical image analysis. Medical Image Analysis, 69:101985,
2021.
[12] Xuewei Ma, Fenglin Liu, Shen Ge, and Xian Wu. Competence-based
multimodal curriculum learning for medical report generation, 2022.
[17] Toru Nishino, Ryota Ozaki, Yohei Momoki, Tomoki Taniguchi, Ryuji
Kano, Norihisa Nakano, Yuki Tagawa, Motoki Taniguchi, Tomoko
BIBLIOGRAPHY 49
[20] Baoyu Jing, Pengtao Xie, and Eric Xing. On the automatic generation
of medical imaging reports. In Proceedings of the 56th Annual Meeting of
the Association for Computational Linguistics (Volume 1: Long Papers).
Association for Computational Linguistics, 2018.
[22] Zaheer Babar, Twan van Laarhoven, Fabio Massimo Zanzotto, and
Elena Marchiori. Evaluating diagnostic content of ai-generated radiology
reports of chest x-rays. Artificial Intelligence in Medicine, 116:102075,
2021.
[23] An Yan, Zexue He, Xing Lu, Jiang Du, Eric Chang, Amilcare Gentili,
Julian McAuley, and Chun-Nan Hsu. Weakly supervised contrastive
learning for chest x-ray report generation, 2021.
[24] Yasuhide Miura, Yuhao Zhang, Emily Bao Tsai, Curtis P. Langlotz,
and Dan Jurafsky. Improving factual completeness and consistency of
image-to-text radiology report generation, 2020.
[25] Xiaosong Wang, Yifan Peng, Le Lu, Zhiyong Lu, and Ronald M. Sum-
mers. Tienet: Text-image embedding network for common thorax dis-
ease classification and reporting in chest x-rays, 2018.
[26] Zhihong Chen, Yaling Shen, Yan Song, and Xiang Wan. Cross-modal
memory networks for radiology report generation, 2022.
[29] Daibing Hou, Zijian Zhao, Yuying Liu, Faliang Chang, and Sanyuan
Hu. Automatic report generation for chest x-ray images via adversarial
reinforcement learning. IEEE Access, 9:21236–21250, 2021.