You are on page 1of 33

Medical Image Synthesis Using Generative

Models
Submitted by
Anubhav Misra
Enrolment No. – 2022CSM015
M.Tech 2nd Semester

Under the guidance of


Prof. Jaya Sil

A Term Paper
Submitted in partial fulfillment of the requirements for the degree of
Master of Technology
(Computer Science and Engineering)

Department of Computer Science and Engineering


Indian Institute of Engineering Science and Technology, Shibpur
Howrah-711103
May,2023

1
Indian Institute of Engineering Science and Technology, Shibpur
Howrah-711103

CERTIFICATE

I hereby forward the term paper entitled “Medical Image Synthesis using
Generative Models” submitted by Anubhav Misra (Enrolment No. –
2022CSM015) as a bona-fide record of the project work carried out by her under
my guidance and supervision, in partial fulfillment of the requirements for the award
of the degree of Master of Technology in Computer Science and Engineering from
this Institute.

___________________________________

Dr. Jaya Sil Professor Department


of CST IIEST, Shibpur

Counter-signed by:

___________________________________

Dr. Asit Kumar Das Professor


and HOD Department of CST
IIEST, Shibpur

2
CONTENTS

1. INTRODUCTION.......................................................................................................... 4

2. LITERATURE SURVEY.............................................................................................. 5

2.1 Deep Unsupervised Learning using Nonequilibrium Thermodynamics (2015) ....... 5

2.2 Denoising Diffusion Probabilistic Models (2020) ........................................................ 7

2.3 High-Resolution Image Synthesis with Latent Diffusion Models (2021) .................. 9

2.4 Attention Is All You Need (2017) ................................................................................ 11

2.5 Efficient Attention: Attention with Linear Complexities (2020) ............................. 14

2.6 U-Net: Convolutional Networks for Biomedical Image Segmentation (2015) ....... 16

2.7 Micro-Batch Training with Batch-Channel Normalization and Weight


Standardization (2015) .......................................................................................................... 18

2.8 Group Normalization (2018) ....................................................................................... 20

2.9 Deep Residual Learning for Image Recognition (2015) ........................................... 22

2.10 DIFFUSION MODELS FOR MEDICAL IMAGE ANALYSIS: A


COMPREHENSIVE SURVEY (2022) ................................................................................. 24

3. SCOPE OF WORK...................................................................................................... 27

4. WORK DONE .............................................................................................................. 29

5. REFERENCES............................................................................................................. 32

3
1. Introduction

Artificial intelligence (AI) has the potential to transform healthcare, ‘by


helping clinicians to make more accurate diagnoses, providing decision
support, and automating tasks. AI has already been used to develop
predictive models for conditions such as heart disease, cancer, and
diabetes. These models can help clinicians to identify patients who are at
risk of developing a particular condition and to recommend treatments.’

This introductory quote was written entirely by a generative AI system


from the prompt ‘Artificial intelligence (AI) has the potential to transform
healthcare’([21]). The system in question is the GPT-3 model, developed
by OpenAI (San Francisco, USA).

Similar generative AI systems have been able to produce extremely


realistic pictures of human faces and, in a medical context, synthetic chest
X-rays that are indistinguishable from real ones. Broadly, there are two
main types of AI: deductive and generative. Deductive algorithms are
increasingly capable of analysing data to find patterns that would be
unfeasible for humans to program; they may be used in data analysis and
even diagnosis. Generative AI, also known as generative artificial
intelligence, is a type of AI that is focused on generating new content or to
create synthetic data in the form of text, images, or other forms of media.

In the context of healthcare, generative AI can be used to

• Create new medical images, such as X-rays or MRIs


• Generate personalized treatment plans based on a patient’s medical
history and other factors.

These 2 use cases can generate substantial benefits in healthcare ([22]).

4
2. Literature Survey

2.1 Deep Unsupervised Learning using Nonequilibrium


Thermodynamics (2015)
Jascha Sohl-Dickstein, Eric A. Weiss, Niru Maheswaranathan,
Surya Ganguli

Historically, probabilistic models suffer from a tradeoff between two conflicting


objectives: tractability and flexibility. Models that are tractable can be
analytically evaluated and easily fit to data (e.g. a Gaussian or Laplace).
However, these models are unable to aptly describe structure in rich datasets. On
the other hand, models that are flexible can be modelled to fit structure in arbitrary
data. For example, we can define models in terms of any (non-negative) function
- ɸ (x) yielding the flexible distribution p (x) = ɸ(x)/Z , where Z is a normalization
constant. However, computing this normalization constant is generally
intractable. Evaluating, training, or drawing samples from such flexible models
typically requires a very expensive Monte Carlo process. A variety of analytic
approximations exist which ameliorate, but do not remove, this tradeoff.

This paper provides a novel approach that simultaneously achieves both


flexibility and tractability. The essential idea, inspired by non-equilibrium
statistical physics, is to systematically and slowly destroy structure in a data
distribution through an iterative forward diffusion process. We then
learn a reverse diffusion process that restores structure in data, yielding a highly
flexible and tractable generative model of the data. This approach
allows us to rapidly learn, sample from, and evaluate probabilities in deep
generative models with thousands of layers or time steps, as well as to compute
conditional and posterior probabilities under the learned model.

The method uses a Markov chain to gradually convert one distribution into
another, an idea used in non-equilibrium statistical physics and sequential Monte
Carlo. A generative Markov diffusion chain is built which converts a simple
known distribution (e.g. a Gaussian) into a target (data) distribution using a
diffusion process. Learning in this framework involves estimating small
perturbations to a diffusion process. Estimating small perturbations is more
tractable than explicitly describing the full
distribution with a single, non-analytically-normalizable, potential function.
Furthermore, since a diffusion process exists for any smooth target distribution,
this method can capture data distributions of arbitrary form.

5
Fig 1:- (a) A bark image (b) The same image with the central 100100 pixel region
replaced with isotropic Gaussian noise. (c) The central 100100 region has been
inpainted using a diffusion probabilistic model trained on images of bark, by
sampling from the posterior distribution over the missing
region conditioned on the rest of the image.

6
2.2 Denoising Diffusion Probabilistic Models (2020)
Jonathan Ho, Ajay Jain, Pieter Abbeel

This paper presents progress in diffusion probabilistic models. A diffusion


probabilistic model is a parameterized Markov chain trained using variational
inference to produce samples matching the data after finite time. Transitions of
this chain are learned to reverse a diffusion process, which is a Markov chain that
gradually adds noise to the data in the opposite direction of sampling until signal
is destroyed. When the diffusion consists of small amounts of Gaussian noise, it
is sufficient to set the sampling chain transitions to conditional Gaussians too,
allowing for a particularly simple neural network parameterization.

Fig 2:- The directed graphical model of diffusion model

Given a data point sampled from a real data distribution x0∼q(x), a forward
diffusion process is defined in which we add small amount of Gaussian noise to
the sample in T steps, producing a sequence of noisy samplesx1,…,xT. The step
sizes are controlled by a variance schedule {βt∈(0,1)}Tt=1 . And then we
approximate q(xt−1∣xt) with a parameterized model pθ (a neural network) to get
back the original image.

Diffusion models are straightforward to define and efficient to train but this paper
demonstrates for the first time that they are capable of generating high quality
samples better than many other types of generative models. It further shows
certain parameterization of diffusion models reveals an equivalence with
denoising score matching over multiple noise levels during training and with
annealed Langevin dynamics during sampling. It found majority of the models’
lossless codelengths are consumed to describe imperceptible image details and
presented a more refined analysis in the language of lossy compression, and
showed that the sampling procedure of diffusion models is a type of progressive
decoding that resembles autoregressive decoding along a bit ordering that vastly
generalizes what is normally possible with autoregressive models.

7
Fig 3:- Generated samples on CelebA-HQ 256 x 256

8
2.3 High-Resolution Image Synthesis with Latent Diffusion
Models (2021)
Robin Rombach, Andreas Blattmann, Dominik Lorenz,
Patrick Esser, Bjorn Ommer

By decomposing the image formation process into a sequential application of


denoising autoencoders, diffusion models (DMs) achieve state-of-the-art
synthesis results on image data and beyond. However, since these models
typically operate directly in pixel space, optimization of powerful DMs often
consumes hundreds of GPU days and inference is expensive due to sequential
evaluations.

To enable DM training on limited computational resources while retaining their


quality and flexibility this paper presented Latent Diffusion Models (LDMs) in
which the model is applied in the latent space of powerful pretrained
autoencoders. Training diffusion models on such a representation allowed for the
first time to reach a near-optimal point between complexity reduction and detail
preservation, greatly boosting visual fidelity. LDMs also introduced cross-
attention layers into the model architecture which turn diffusion models into
powerful and flexible generators for general conditioning inputs such as text or
bounding boxes and high-resolution synthesis becomes possible in a
convolutional manner.

Learning can be roughly divided into two stages:


• First is a perceptual compression stage which removes high-frequency
details but still learns little semantic variation.
• In the second stage, the actual generative model learns the semantic and
conceptual composition of the data (semantic compression).

Therefore the aim to first find a perceptually equivalent, but computationally


more suitable space, in which the diffusion models will be trained for high-
resolution image synthesis. The training is separated into two distinct phases:

• First, an autoencoder is trained which provides a lower-dimensional (and


thereby efficient) representational space which is perceptually equivalent
to the data space.
• DMs are trained in latent space, which exhibits better scaling properties
with respect to the spatial dimensionality. The reduced complexity also
provides efficient image generation from the latent space with a single
network pass.

9
Fig 4

With the trained perceptual compression models consisting of E and D, we now


have access to an efficient, low-dimensional latent space in which high-
frequency, imperceptible details are abstracted away. Compared to the high-
dimensional pixel space, this space is more suitable for likelihood-based
generative models, as they can now (i) focus on the important, semantic bits of
the data and (ii) train in a lower dimensional, computationally much more
efficient space.
The neural backbone of our model is realized as a time-conditional UNet . Since
the forward process is fixed zt can be efficiently obtained from E during training,
and samples from p(z) can be decoded to image space with
a single pass through D.

10
2.4 Attention Is All You Need (2017)
Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit,
Llion Jones, Aidan N. Gomez, Lukasz Kaiser, Illia Polosukhin

Recurrent neural networks, long short-term memory and gated recurrent neural
networks in particular, have been firmly established as state of the art approaches
in sequence modeling and transduction problems such as language modeling and
machine translation. Recurrent models typically factor computation along the
symbol positions of the input and output
sequences. Aligning the positions to steps in computation time, they generate a
sequence of hidden states ht, as a function of the previous hidden state ht-1 and the
input for position t. This inherently
sequential nature precludes parallelization within training examples, which
becomes critical at longer sequence lengths, as memory constraints limit batching
across examples.

This paper proposed the idea of Transformer, a model architecture eschewing


recurrence and instead relying entirely on an attention mechanism to draw global
dependencies between input and output.
The Transformer allows for significantly more parallelization and can reach a
new state of the art in translation quality after being trained within limited
resources and time.

Self-attention, sometimes called intra-attention is an attention mechanism


relating different positions of a single sequence in order to compute a
representation of the sequence. Self-attention has been used successfully in a
variety of tasks including reading comprehension, abstractive summarization,
textual entailment and learning task-independent sentence representations.

11
Fig 5:- The Transformer – model architecture

Encoder: The encoder is composed of a stack of N = 6 identical layers. Each


layer has two sub-layers. The first is a multi-head self-attention mechanism, and
the second is a simple, position-wise fully connected feed-forward network. We
employ a residual connection around each of the two sub-layers, followed by
layer normalization . That is, the output of each sub-layer is LayerNorm(x +
Sublayer(x)), where Sublayer(x) is the function implemented by the sub-layer
itself. To facilitate these residual connections, all sub-layers in the model, as well
as the embedding layers, produce outputs of dimension dmodel = 512.
Decoder: The decoder is also composed of a stack of N = 6 identical layers. In
addition to the two sub-layers in each encoder layer, the decoder inserts a third

12
sub-layer, which performs multi-head attention over the output of the encoder
stack. Similar to the encoder, we employ residual connections
around each of the sub-layers, followed by layer normalization. We also modify
the self-attention sub-layer in the decoder stack to prevent positions from
attending to subsequent positions. This masking, combined with fact that the
output embeddings are offset by one position, ensures that the predictions for
position i can depend only on the known outputs at positions less than i.

Attention:

Fig 6
An attention function can be described as mapping a query and a set of key-value
pairs to an output, where the query, keys, values, and output are all vectors. The
output is computed as a weighted sum of the values, where the weight assigned
to each value is computed by a compatibility function of the query with the
corresponding key.

Where Q,K,V are matrices packing the Queries, Keys and Values respectively.

13
2.5 Efficient Attention: Attention with Linear Complexities
(2020)
Zhuoran Shen, Mingyuan Zhang, Haiyu Zhao, Shuai Yi,
Hongsheng Li

Dot-product attention has wide applications in computer vision and natural


language processing. However, its memory and computational costs grow
quadratically with the input size. Such growth prohibits its application on high
resolution inputs. This paper presented a novel efficient attention mechanism
equivalent to dot-product attention but with substantially less memory and
computational costs. Its resource efficiency allows more widespread and flexible
integration of attention modules into a network, which leads to better accuracies.

Fig 7:- Illustration of the architecture of dot-product and efficient attention:


Each box represents an input, output, or intermediate
matrix. Above it is the name of the corresponding matrix, and inside are the
variable name and the size of the matrix. ρ, ρq, ρk are
the normalizers on S,Q,K, respectively. n; d,dk,dv are the input size and the
dimensionalities of the input, the keys, and the values,
respectively. denotes matrix multiplication. When ρ, ρq, ρk implement scaling
normalization, the efficient attention mechanism is mathematically equivalent to
dot-product attention. When they implement softmax normalization, the two
mechanisms are approximately equivalent.

14
The need for global dependency modeling on large inputs motivates the
exploration for a resource-efficient attention mechanism. Putting aside the
normalization, dot-product attention involves two consecutive matrix
multiplications. The first one (S = QKT) computes pairwise similarities between
pixels and forms per-pixel attention maps. The second (D = SV ) aggregates the
values V by the per-pixel attention maps to produce the output. Since matrix
multiplication is associative, switching the order from (QKT)V to Q(KTV ) has
no impact on the effect but changes the complexities from O(n2) to O(dkdv), for
n the input size and dk; dv the dimensionalities of the keys and the values,
respectively. This change removes the O(n2) terms in the complexities of the
module, making it linear in complexities. Further, dkdv is significantly less than
n2 in practical cases, hence this new term will not become a new bottleneck.
Therefore, switching the order of multiplication to Q(KTV ) results in a
substantially more efficient mechanism, which this paper names efficient
attention.

The efficient attention brings a new interpretation to the attention mechanism.


Assuming the keys are of dimensionality dk and the input size is n, one can
interpret the dk x n key matrix as dk template attention maps, each corresponding
to a semantic aspect of the input. Then, the query at each pixel is dk coefficients
for each of the dk template attention maps, respectively. Under this interpretation,
efficient and dot-product attention differs in that dot-product attention first
synthesizes the pixel-wise attention maps from the coefficients and lets each pixel
aggregate the values with its own attention map, while efficient attention first
aggregates the values by the template attention maps to form template outputs
(i.e. global context vectors) and lets each pixel aggregate the template outputs.

It brought substantial performance boosts to tasks such as object detection and


instance segmentation (on MS-COCO 2017).

15
2.6 U-Net: Convolutional Networks for Biomedical Image
Segmentation (2015)
Olaf Ronneberger, Philipp Fischer, and Thomas Brox

There is large consent that successful training of deep networks requires many
thousand annotated training samples. The typical use of convolutional networks
is on classification tasks, where the output to an image is a single class label.
However, in many visual tasks, especially in biomedical image processing, the
desired output should include localization, i.e., a class label is supposed to be
assigned to each pixel. Moreover, thousands of training images are usually
beyond reach in biomedical tasks.

Hence, Ciresan et al. trained a network in a sliding-window setup to predict the


class label of each pixel by providing a local region (patch) around that pixel. as
input. First, this network can localize. Secondly, the training data in terms of
patches is much larger than the number of training images. However this strategy
has two drawbacks:
• First, it is quite slow because the network must be run separately for each
patch, and there is a lot of redundancy due to overlapping patches.
• Secondly, there is a trade-off between localization accuracy and the use of
context. Larger patches require more max-pooling layers that reduce the
localization accuracy, while small patches allow the network to see only
little context.

This paper presented a network and training strategy that relies on the strong use
of data augmentation to use the available annotated samples more efficiently.It is
shown that the proposed network can be trained end-to-end from very few
images and outperforms the prior best method (a sliding-window convolutional
network) on the ISBI challenge for segmentation of neuronal structures in
electron microscopic stacks.

The main idea in is to supplement a usual contracting network by successive


layers, where pooling operators are replaced by upsampling operators. Hence,
these layers increase the resolution of the output. In order to localize, high
resolution features from the contracting path are combined with the upsampled
output. A successive convolution layer can then learn to assemble a more precise
output based on this information. In the upsampling part we have also a large
number of feature channels, which allow the network to propagate context
information to higher resolution layers. As a consequence, the expansive path is
more or less symmetric to the contracting path, and yields a u-shaped architecture.

16
The resulting network is applicable to various biomedical segmentation
problems.

Fig 8:- U-net architecture (example for 32x32 pixels in the lowest
resolution). Each blue box corresponds to a multi-channel feature map. The
number of channels is denoted on top of the box. The x-y-size is provided at
the lower left edge of the box. White boxes represent copied feature maps.
The arrows denote the different operations.

17
2.7 Micro-Batch Training with Batch-Channel Normalization
and Weight Standardization (2015)
Siyuan Qiao, Huiyu Wang, Chenxi Liu, Wei Shen, and Alan Yuille

Batch Normalization (BN) has become an out-of-box technique to improve deep


network training. However, its effectiveness is limited for micro-batch training,
i.e., each GPU typically has only 1-2 images for training, which is inevitable for
many computer vision tasks, e.g., object detection and semantic segmentation,
constrained by memory consumption. It is found that channel-based
normalization methods, such as Layer Normalization (LN) and Group
Normalization (GN) are unable to keep far distances from elimination
singularities, caused by lack of batch knowledge.

Fig 9:- A 3D Visualization of various in-layer normalization techniques

To address this issue, Weight Standardization (WS) and Batch-Channel


Normalization (BCN) is proposed to bring two success factors of BN into micro-
batch training:

18
• the smoothing effects on the loss landscape:- makes the landscape of the
corresponding optimization problem significantly smoother, thus is able to
stabilize the training process and accelerate the convergence speed of training
• the ability to avoid harmful elimination singularities along the training
trajectory:- Elimination of singularities refer to the points along the training
trajectory where neurons in the networks get eliminated(The original definition
of elimination singularities is based on weights : if we use wc to denote the
weights that take the channel c as input, then an elimination singularity is
encountered when wc = 0.). Eliminable neurons waste computations and decrease
the effective model complexity. Getting closer to them will harm the training
speed and the final performances. By forcing each neuron to have zero mean and
unit variance, BN keeps the networks at far distances from elimination
singularities caused by non-linear activation functions.

WS standardizes the weights in convolutional layers and BCN leverages


estimated batch statistics of the activations in convolutional layers.

19
2.8 Group Normalization (2018)
Yuxin Wu , Kaiming He

Batch Normalization (BN) is a milestone technique in the development of deep


learning, enabling various networks to train. However, normalizing along the
batch dimension introduces problems — BN’s error increases rapidly when the
batch size becomes smaller, caused by inaccurate batch statistics estimation. This
limits BN’s usage for training larger models and transferring features to computer
vision tasks including detection, segmentation, and video, which require small
batches constrained by memory consumption.

In this paper a new normalisation method is presented called Group


Normalisation (GN) as a simple alternative to batch normalisation. GN divides
the channels into groups and computes within each group the mean and variance
for normalization. GN’s computation is independent of batch sizes, and its
accuracy is stable in a wide range of batch sizes.

Fig 10:- ImageNet classification error vs. batch sizes. This is


a ResNet-50 model trained in the ImageNet training set using 8
workers (GPUs), evaluated in the validation set.

20
Fig 11:- Normalization methods:- Each subplot shows a feature map tensor,
with N as the batch axis, C as the channel axis, and (H, W)
as the spatial axes. The pixels in blue are normalized by the same mean and
variance, computed by aggregating the values of these pixels.

Fig 12:- Comparison using a batch size of 32 images per GPU in ImageNet.
Validation error VS the numbers of training epochs is shown. The model is
ResNet-50.

21
2.9 Deep Residual Learning for Image Recognition (2015)
Kaiming He, Xiangyu Zhang, Shaoqing Ren, Jian Sun

Deep convolutional neural networks have led to a series of breakthroughs for


image classification. Network depth is of crucial importance, and the leading
results on the challenging ImageNet dataset all exploit “very deep” models, with
a depth of sixteen to thirty.

Fig 13:- Training error (left) and test error (right) on CIFAR-10
with 20-layer and 56-layer “plain” networks. The deeper network
has higher training error, and thus test error.

When deeper networks are able to start converging, a degradation problem has
been exposed: with the network depth increasing, accuracy gets saturated (which
might be unsurprising) and then degrades rapidly. The above paper addressed the
degradation problem by introducing a deep residual learning framework.

Fig 14:-Residual learning: a building block.

22
Formally, denoting the desired underlying mapping as H(x),the stacked nonlinear
layers fit another mapping of F(x) := H(x)−x. The original mapping is recast into
F(x)+x. It is hypothesized that it is easier to optimize the residual mapping than
to optimize the original, unreferenced mapping. To the extreme, if an identity
mapping were optimal, it would be easier to push the residual to zero than to fit
an identity mapping by a stack
of nonlinear layers.

The formulation of F(x) +x can be realized by feedforward neural networks with


“shortcut connections” . Shortcut connections are those skipping one or more
layers. In this case, the shortcut connections simply perform identity mapping,
and their outputs are added to the outputs of the stacked layers. Identity shortcut
connections add neither extra parameter nor computational complexity. The
entire network can still be trained end-to-end by SGD with backpropagation.

Fig 15:- Training on ImageNet. Thin curves denote training error, and bold
curves denote validation error of the center crops. Left: plain
networks of 18 and 34 layers. Right: ResNets of 18 and 34 layers. In this plot,
the residual networks have no extra parameter compared to
their plain counterparts.

23
2.10 DIFFUSION MODELS FOR MEDICAL IMAGE
ANALYSIS: A COMPREHENSIVE SURVEY (2022)
Amirhossein Kazerouni, Ehsan Khodapanah Aghdam, Moein
Heidari, Reza Azad, Mohsen Fayyaz, Ilker Hacihaliloglu, Dorit
Merhof

This paper provides a comprehensive survey on diffusion models with focus on


applications in medical image analysis. A systematic taxonomy of diffusion
models in the medical domain is provided and a multi-perspective categorization
based on their application, imaging modality, organ of interest, and algorithms is
proposed.

Generative models typically entail key requirements to be adopted in real-world


problems. These requirements include (i) high-quality sampling, (ii) mode
coverage and sample diversity, and (iii) fast execution time and computationally
inexpensive sampling. GANs are capable of generating high-quality samples
rapidly, but they have poor mode coverage and are prone to lack sampling
diversity. A common concern with GANs is their training dynamics which have
been recognized as being unstable, resulting in deficiencies such as mode
collapse, vanishing gradients, and convergence.

Recently, diffusion models have emerged as powerful generative models.


Fundamentally, diffusion models work by destroying training data through the
successive addition of Gaussian noise and then learning to recover the data by
reversing this noising process. A diffusion probabilistic model defines a forward
diffusion stage where the input data is gradually perturbed over several steps by
adding Gaussian noise and then learns to reverse the diffusion process to retrieve
the desired noise-free data from noisy data samples.

Generative models have significantly impacted the field of medical imaging,


where there is a strong need for tools to improve the routines of clinicians and
patients. Concretely, the complexity of data collection procedures, the lack of
experts, privacy concerns, and the compulsory requirement of authorization from
patients create a major bottleneck in the annotation
process in medical imaging. This is where generative models become
advantageous. With its ability to produce a limitless source of unique instances
of different medical imaging modalities, diffusion models can satisfy educational
demands by constructing distinct synthetic samples for teaching and practice.
Additionally, these artificial images can mitigate data

24
security concerns associated with using patient data in public settings. Hence,
using diffusion models to generate synthetic samples can alleviate the problem of
medical data scarcity to a great extent.

Fig 16:- Iteratively applying diffusion models using an unconditional model


encodes the input image into a latent space. Then, reversing the diffusion
process from the latent space decodes a healthy state image. The decoding
process is guided by conditioning it on the healthy state. The anomaly heatmap
is generated by subtracting the input image from the generated counterfactual.

The paper divided their applications in 7 categories:-

• Anomaly Detection:-
• Denoising
• Reconstruction
• Segmentation
• Image-to-Image Translation
• Image Generation
• Other Applications and Multi-tasks

Some of the primary concerns and limitations of diffusion models are their slow
speed and required computational cost. Several methods have been developed to
address these drawbacks. With their remarkable results, diffusion models have
proven that they can be a powerful competitor against other generative models.

25
Fig 17:- Simple diffusion-based ecosystem in the medical domain.

26
3. Scope of Work

Image synthesis tasks are performed generally by deep generative models like
GANs, VAEs, and autoregressive models. Generative adversarial networks
(GANs) have been a research area of much focus in the last few years due to the
quality of output they produce. Another interesting area of research that has found
a place are diffusion models. Both of them have found wide usage in the field of
image, video and voice generation. GAN is an algorithmic architecture that uses
two neural networks that are set one against the other to generate newly
synthesized instances of data that can pass for real data. Diffusion models have
become increasingly popular as they provide training stability as well as quality
results on image and audio generation.

Though GANs form the framework for image synthesis in a vast section of
models, they do come with some disadvantages that researchers are actively
working on.

• Vanishing gradients: If the discriminator is too good, the generator


training can fail due to the issue of vanishing gradients.
• Mode collapse: If a generator produces an especially plausible output, it
can learn to produce only that output. If this happens, the discriminator’s
best strategy is to learn to always reject that output. Google adds, “But if
the next generation of discriminator gets stuck in a local minimum and
doesn’t find the best strategy, then it’s too easy for the next generator
iteration to find the most plausible output for the current discriminator.”
• Failure to converge: GANs also have this frequent issue to converge.\

Physics-inspired Diffusion Models have ascended to state-of-the-art performance


in several domains, powering models like Stable Diffusion, DALL-E 2, and
Imagen. A paper titled ‘Diffusion Models Beat GANs on Image Synthesis’
([11]) by OpenAI researchers has shown that diffusion models can achieve image
sample quality superior to the generative models.

Researchers from MIT have recently unveiled a new physics-inspired generative


model, this time drawing inspiration from the field of electrostatics. This new
type of model - the Poisson Flow Generative Model (PFGM) ([12],[20])- treats
the data points as charged particles. By following the electric field generated by
the data points, PFGMs can create entirely novel data. Below we see images of
faces generated with a PFGM:

27
Fig 18:- CelebA images generated with a PFGM

PFGMs constitute an exciting foundation for new avenues of research, especially


given that they are 10-20 times faster than Diffusion Models on image
generation tasks, with comparable performance.

28
4. Work Done

Implementation of Diffusion Model (DDPM) from scratch trained on


Fashion MNIST dataset : DDPM Implementation

A Gentle description of the Implementation ([14]):-

First a number of building blocks are defined :- position embeddings ([6]) ,


ResNet blocks([10]), attention ([6],[7],[18]) and group normalization ([9]).

Then the neural network is defined. the job of the network ϵθ(xt,,t) is to take
in a batch of noisy images and their respective noise levels, and output the
noise added to the input. More formally:

The network takes a batch of noisy images of shape (batch_size,


num_channels, height, width) and a batch of noise levels of shape
(batch_size, 1) as input, and returns a tensor of shape (batch_size,
num_channels, height, width).

The DDPM authors ([2]) used U-Net Model ([5]) for this purpose.

The network is built up as follows:

• first, a convolutional layer is applied on the batch of noisy images,


and position embeddings are computed for the noise levels
• next, a sequence of downsampling stages are applied. Each
downsampling stage consists of 2 ResNet blocks + groupnorm +
attention + residual connection + a downsample operation
• at the middle of the network, again ResNet blocks are applied,
interleaved with attention
• next, a sequence of upsampling stages are applied. Each upsampling
stage consists of 2 ResNet blocks + groupnorm + attention + residual
connection + an upsample operation
• finally, a ResNet block followed by a convolutional layer is applied.

29
Defining the Forward Diffusion process:- The forward diffusion process
gradually adds noise to an image from the real distribution, in a number of
time steps T. This happens according to a variance schedule. The original
DDPM authors employed a linear schedule. However, it was shown in
(Nichol et al., 2021 ([19])) that better results can be achieved when
employing a cosine schedule.

Defining the Backward Diffusion process:- The backward diffusion process


then learn to get the original image from the noise. The backward diffusion
process employs the U-Net Model defined above.

Sampling:- Generating new images from a diffusion model happens by


reversing the diffusion process: we start from T, where we sample pure
noise from a Gaussian distribution, and then use our neural network to
gradually denoise it (using the conditional probability it has learned), until
we end up at time step t=0.

30
Some Output Samples from my implementation:-

31
References

[1] Jascha Sohl-Dickstein, Eric A. Weiss, Niru Maheswaranathan, Surya


Ganguli, “Deep Unsupervised Learning using Nonequilibrium
Thermodynamics”, Proceedings of Machine Learning Research (PMLR).

[2] Jonathan Ho, Ajay Jain, Pieter Abbeel, “Denoising Diffusion Probabilistic
Models”, Neural Information Processing Systems (NeurIPS).

[3] Robin Rombach, Andreas Blattmann, Dominik Lorenz, Patrick Esser,


Björn Ommer, “High-Resolution Image Synthesis with Latent Diffusion
Models”, arXiv:2112.10752.

[4] Amirhossein Kazerouni, Ehsan Khodapanah Aghdam, Moein Heidari,


Reza Azad, Mohsen Fayyaz, Ilker Hacihaliloglu, Dorit Merhof, “Diffusion
Models for Medical Image Analysis: A Comprehensive Survey”,
arXiv:2211.07804.

[5] Olaf Ronneberger, Philipp Fischer, Thomas Brox, “U-Net: Convolutional


Networks for Biomedical Image Segmentation”, International Conference
on Medical Image Computing and Computer-Assisted Intervention.

[6] Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion
Jones, Aidan N. Gomez, Lukasz Kaiser, Illia Polosukhin. “Attention Is All
You Need”, Neural Information Processing Systems (NeurIPS).

[7] Zhuoran Shen, Mingyuan Zhang, Haiyu Zhao, Shuai Yi, Hongsheng Li,
“Efficient Attention: Attention with Linear Complexities”,
arXiv:1812.01243.

[8] Siyuan Qiao, Huiyu Wang, Chenxi Liu, Wei Shen, Alan Yuille, “
Micro-Batch Training with Batch-Channel Normalization and Weight
Standardization”, arXiv:1903.10520.

[9] Yuxin Wu, Kaiming He, “Group Normalization”, arXiv:1803.08494.

[10] Kaiming He, Xiangyu Zhang, Shaoqing Ren, Jian Sun, “Deep Residual
Learning for Image Recognition”, CVPR.

[11] Prafulla Dhariwal, Alex Nichol, “Diffusion Models Beat GANs on Image
Synthesis”, Neural Information Processing Systems (NeurIPS).

32
[12] Yilun Xu, Ziming Liu, Max Tegmark, Tommi Jaakkola, “Poisson Flow
Generative Models”, Neural Information Processing Systems (NeurIPS).

[13] Dan Ciresan, Alessandro Giusti, Luca Gambardella, Jürgen Schmidhuber,


“Deep Neural Networks Segment Neuronal Membranes in Electron
Microscopy Images”, Advances in Neural Information Processing Systems
25 (NIPS 2012).

[14] Niels Rogge, Kashif Rasul, “The Annotated Diffusion Model”.

[15] Lilian Weng, “What are Diffusion Models?”.

[16] Jay Alammar, “The Illustrated Stable Diffusion”.

[17] Sergios Karagiannakos,Nikolas Adaloglou, “How diffusion models work:


the math from scratch”.

[18] Jay Alammar, “The Illustrated Transformer” .

[19] Alexander Quinn Nichol, Prafulla Dhariwal, “Improved Denoising


Diffusion Probabilistic Models”, ICLR 2021 Conference.

[20] Ryan O'Connor, “An Introduction to Poisson Flow Generative Models”,


AssemblyAI.

[21] Ananya Arora, Anmol Arora, “Generative adversarial networks and


synthetic patient data: current challenges and future perspectives”, Future
Healthcare Journal.

[22] Cem Dilmegani, “Generative AI Healthcare Industry: Benefits,


Challenges, Potentials”, AIMultiple.

33

You might also like