You are on page 1of 4

The medical image caption generation model takes a medical image as input and generates a

textual description as the output based on the visual content. Image captioning is a challenging
task since it requires the understanding of the main objects, their attributes, and their
relationships in an image and also includes the generation of syntactically and semantically
meaningful descriptions of the images in natural language. However, medical reports must be not
only readable and grammatically correct but also clinically accurate []. Nowadays, deep
convolutional neural networks have demonstrated state-of-the-art performance for medical image
captioning on various imaging modalities and tasks [7] and [2]. Despite early success, captioning
networks may still generate anatomically aberrant captions. They frequently require large
amounts of labeled training data, which is not easily accessible in the medical field, and CNN
requires a large amount of training data to avoid overfitting problems and improve the
generalizability of the model. However, data across languages, domains, or tasks will not always
be available, especially in the medical domain, where patient information is kept confidential and
will not be shared with anyone for privacy reasons. Due to the scarcity of such large, publicly
available datasets, medical image captioning is challenging. Besides that, with the main objects
being far apart. To mitigate these limitations, recent research studies have focused on
incorporating prior knowledge. This helps with the curse of dimensionality, which can enable
models to learn faster, use less training data, and still be robust and accurate [7] [3].

Literature Review
In related work we enhance relevant information on prior study on image caption generation and
attention. Several approaches for generating image descriptions have recently been presented.
Automatic image captioning generation has emerged as a promising research area in recent
years, because to advances in deep neural network models for Computer Vision (CV) and
Natural Language Processing (NLP). In general, there are three types of image captioning
modeling techniques: neural-based approaches [10] [11] [12], attention-based strategies [13] [14]
[15] [16], and RL-based methods framework [17] [18]. Attention based approaches have recently
gained popularity and have been shown to be more successful than neural based methods. When
guessing each word in the caption, attention-based techniques tend to focus on certain locations
in the image.
Recently, many researchers have used deep learning and deep transfer learning models for the
prediction and diagnosis of various kinds of patients [28–31]. +erefore, many medical image
captioning models by considering deep learning and deep transfer learning have been proposed
in the literature.

Deep neural networks


Deep neural networks (DNNs) were initially proposed for caption generation [19]. They
suggested extracting characteristics from images using convolutional neural networks (CNNs) to
produce captions. A popular captioning technique involves integrating CNN and RNN, with
CNN extracting image features and RNN framework generating the language model [20]. for
example, presented an end-to-end network made up of a CNN and an RNN. Given the CNN
feature of the training image at the starting time step, the model is trained to maximize the
likelihood of the target sentence. The image's CNN feature is provided into the multimodal layer
after the recurrent layer rather than at the beginning in the proposed m-RNN model [21]. Some
are comparable instances of similar work [22] [23] that utilizes CNN and RNN to generate
descriptions

Xiao et al. implemented a deep hierarchical encoder decoder network (DHN) for medical image
captioning. DHN divides the functionalities of the encoder and decoder. It can evaluate the
potential information by integrating the high-level semantics of language and vision to obtain
medical captions [27].

Xiao et al. studied that encoder-decoder approaches are extensively utilized in medical image
captioning, and the majority of them are implemented using single long short-term memory
(LSTM) [27].

From the related review, we can say that the development of an efficient image captioning model
is still a challenging issue. Additionally, not much work is done to tune the initial parameters of
medical image captioning models [37–41]. Therefore, using meta-heuristic techniques

Medical Imaging Modality


Introduction to Medical Imaging
Medical Imaging Techniques (MITs) are non-invasive methods for looking inside the body
without opening up the body surgically. There are many medical imaging techniques; X-ray
radiography, X-ray, Computed Tomography (CT), Magnetic Resonance Imaging (MRI),
ultrasonography, Elastography, optical imaging, Radionuclide imaging includes (Scintigraphy,
Positron Emission Tomography (PET) and Single Photon Emission Computed Tomography
(SPECT)), thermography, and Terahertz imaging.

X-Ray Radiography
Radiography is a diagnostic technique that used the ionizing electromagnet radiation, such as X-
ray to view objects. X-ray is a high energy electromagnetic radiation that can penetrate solids
and ionize gas; it has a wavelength between 0.01 and 10 manometers.
It is based on the use of wavelength and frequency of electromagnetic radiation which penetrates
the skin and is absorbed by the internal tissues at different rates. A 2D representation of the
internal structure is provided by monitoring the variance in absorption [@#]

Magnetic Resonance Imaging (MRI)


MRI is a diagnostic technology that uses magnetic and radio frequency fields to image the body
tissues and monitor body chemistry.
MRI provides a powerful technique that enables multi-planar three-dimensional views of body
organs. As we know, the human body is composed of water molecules. When applying a
magnetic feld, the relaxation of the hydrogen nucleus of the water molecules is exploited and
excited

Ultrasonography
Ultrasonography is a diagnostic technology that uses high frequency broadband sound waves in
the megahertz range that are reflected by tissue to varying degrees to produce medical images.
The ultrasound transducer is placed against the skin of the patient near the region of interest. The
transducer produces a stream of high frequency sound waves that penetrate into the body and
reflect from the organs inside. The transducer detects sound waves as they echo back from the
internal structures of the organs. Different tissues reflect these sound waves differently resulting
a signature that can be measured and transformed into an image. These waves are received by the
ultrasound machine and turned into live pictures. The real time moving image obtained can be
used to guide drainage and biopsy procedures. Doppler capabilities of the recent scanners allow
the blood flow in arteries and veins to be assessed.

Mammography

Fluoroscopy

(1) (PDF) A Comparative Study of Medical Imaging Techniques. Available from:


https://www.researchgate.net/publication/274634575_A_Comparative_Study_of_Medical_Imagi
ng_Techniques [accessed Dec 03 2022].

You might also like