Professional Documents
Culture Documents
Models
Submitted by
Anubhav Misra
Enrolment No. – 2022CSM015
M.Tech 2nd Semester
A Term Paper
Submitted in partial fulfillment of the requirements for the degree of
Master of Technology
(Computer Science and Engineering)
1
Indian Institute of Engineering Science and Technology, Shibpur
Howrah-711103
CERTIFICATE
I hereby forward the term paper entitled “Medical Image Synthesis using
Generative Models” submitted by Anubhav Misra (Enrolment No. –
2022CSM015) as a bona-fide record of the project work carried out by her under
my guidance and supervision, in partial fulfillment of the requirements for the award
of the degree of Master of Technology in Computer Science and Engineering from
this Institute.
___________________________________
Counter-signed by:
___________________________________
2
CONTENTS
1. INTRODUCTION.......................................................................................................... 4
2. LITERATURE SURVEY.............................................................................................. 5
2.3 High-Resolution Image Synthesis with Latent Diffusion Models (2021) .................. 9
2.6 U-Net: Convolutional Networks for Biomedical Image Segmentation (2015) ....... 16
3. SCOPE OF WORK...................................................................................................... 27
5. REFERENCES............................................................................................................. 32
3
1. Introduction
4
2. Literature Survey
The method uses a Markov chain to gradually convert one distribution into
another, an idea used in non-equilibrium statistical physics and sequential Monte
Carlo. A generative Markov diffusion chain is built which converts a simple
known distribution (e.g. a Gaussian) into a target (data) distribution using a
diffusion process. Learning in this framework involves estimating small
perturbations to a diffusion process. Estimating small perturbations is more
tractable than explicitly describing the full
distribution with a single, non-analytically-normalizable, potential function.
Furthermore, since a diffusion process exists for any smooth target distribution,
this method can capture data distributions of arbitrary form.
5
Fig 1:- (a) A bark image (b) The same image with the central 100100 pixel region
replaced with isotropic Gaussian noise. (c) The central 100100 region has been
inpainted using a diffusion probabilistic model trained on images of bark, by
sampling from the posterior distribution over the missing
region conditioned on the rest of the image.
6
2.2 Denoising Diffusion Probabilistic Models (2020)
Jonathan Ho, Ajay Jain, Pieter Abbeel
Given a data point sampled from a real data distribution x0∼q(x), a forward
diffusion process is defined in which we add small amount of Gaussian noise to
the sample in T steps, producing a sequence of noisy samplesx1,…,xT. The step
sizes are controlled by a variance schedule {βt∈(0,1)}Tt=1 . And then we
approximate q(xt−1∣xt) with a parameterized model pθ (a neural network) to get
back the original image.
Diffusion models are straightforward to define and efficient to train but this paper
demonstrates for the first time that they are capable of generating high quality
samples better than many other types of generative models. It further shows
certain parameterization of diffusion models reveals an equivalence with
denoising score matching over multiple noise levels during training and with
annealed Langevin dynamics during sampling. It found majority of the models’
lossless codelengths are consumed to describe imperceptible image details and
presented a more refined analysis in the language of lossy compression, and
showed that the sampling procedure of diffusion models is a type of progressive
decoding that resembles autoregressive decoding along a bit ordering that vastly
generalizes what is normally possible with autoregressive models.
7
Fig 3:- Generated samples on CelebA-HQ 256 x 256
8
2.3 High-Resolution Image Synthesis with Latent Diffusion
Models (2021)
Robin Rombach, Andreas Blattmann, Dominik Lorenz,
Patrick Esser, Bjorn Ommer
9
Fig 4
10
2.4 Attention Is All You Need (2017)
Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit,
Llion Jones, Aidan N. Gomez, Lukasz Kaiser, Illia Polosukhin
Recurrent neural networks, long short-term memory and gated recurrent neural
networks in particular, have been firmly established as state of the art approaches
in sequence modeling and transduction problems such as language modeling and
machine translation. Recurrent models typically factor computation along the
symbol positions of the input and output
sequences. Aligning the positions to steps in computation time, they generate a
sequence of hidden states ht, as a function of the previous hidden state ht-1 and the
input for position t. This inherently
sequential nature precludes parallelization within training examples, which
becomes critical at longer sequence lengths, as memory constraints limit batching
across examples.
11
Fig 5:- The Transformer – model architecture
12
sub-layer, which performs multi-head attention over the output of the encoder
stack. Similar to the encoder, we employ residual connections
around each of the sub-layers, followed by layer normalization. We also modify
the self-attention sub-layer in the decoder stack to prevent positions from
attending to subsequent positions. This masking, combined with fact that the
output embeddings are offset by one position, ensures that the predictions for
position i can depend only on the known outputs at positions less than i.
Attention:
Fig 6
An attention function can be described as mapping a query and a set of key-value
pairs to an output, where the query, keys, values, and output are all vectors. The
output is computed as a weighted sum of the values, where the weight assigned
to each value is computed by a compatibility function of the query with the
corresponding key.
Where Q,K,V are matrices packing the Queries, Keys and Values respectively.
13
2.5 Efficient Attention: Attention with Linear Complexities
(2020)
Zhuoran Shen, Mingyuan Zhang, Haiyu Zhao, Shuai Yi,
Hongsheng Li
14
The need for global dependency modeling on large inputs motivates the
exploration for a resource-efficient attention mechanism. Putting aside the
normalization, dot-product attention involves two consecutive matrix
multiplications. The first one (S = QKT) computes pairwise similarities between
pixels and forms per-pixel attention maps. The second (D = SV ) aggregates the
values V by the per-pixel attention maps to produce the output. Since matrix
multiplication is associative, switching the order from (QKT)V to Q(KTV ) has
no impact on the effect but changes the complexities from O(n2) to O(dkdv), for
n the input size and dk; dv the dimensionalities of the keys and the values,
respectively. This change removes the O(n2) terms in the complexities of the
module, making it linear in complexities. Further, dkdv is significantly less than
n2 in practical cases, hence this new term will not become a new bottleneck.
Therefore, switching the order of multiplication to Q(KTV ) results in a
substantially more efficient mechanism, which this paper names efficient
attention.
15
2.6 U-Net: Convolutional Networks for Biomedical Image
Segmentation (2015)
Olaf Ronneberger, Philipp Fischer, and Thomas Brox
There is large consent that successful training of deep networks requires many
thousand annotated training samples. The typical use of convolutional networks
is on classification tasks, where the output to an image is a single class label.
However, in many visual tasks, especially in biomedical image processing, the
desired output should include localization, i.e., a class label is supposed to be
assigned to each pixel. Moreover, thousands of training images are usually
beyond reach in biomedical tasks.
This paper presented a network and training strategy that relies on the strong use
of data augmentation to use the available annotated samples more efficiently.It is
shown that the proposed network can be trained end-to-end from very few
images and outperforms the prior best method (a sliding-window convolutional
network) on the ISBI challenge for segmentation of neuronal structures in
electron microscopic stacks.
16
The resulting network is applicable to various biomedical segmentation
problems.
Fig 8:- U-net architecture (example for 32x32 pixels in the lowest
resolution). Each blue box corresponds to a multi-channel feature map. The
number of channels is denoted on top of the box. The x-y-size is provided at
the lower left edge of the box. White boxes represent copied feature maps.
The arrows denote the different operations.
17
2.7 Micro-Batch Training with Batch-Channel Normalization
and Weight Standardization (2015)
Siyuan Qiao, Huiyu Wang, Chenxi Liu, Wei Shen, and Alan Yuille
18
• the smoothing effects on the loss landscape:- makes the landscape of the
corresponding optimization problem significantly smoother, thus is able to
stabilize the training process and accelerate the convergence speed of training
• the ability to avoid harmful elimination singularities along the training
trajectory:- Elimination of singularities refer to the points along the training
trajectory where neurons in the networks get eliminated(The original definition
of elimination singularities is based on weights : if we use wc to denote the
weights that take the channel c as input, then an elimination singularity is
encountered when wc = 0.). Eliminable neurons waste computations and decrease
the effective model complexity. Getting closer to them will harm the training
speed and the final performances. By forcing each neuron to have zero mean and
unit variance, BN keeps the networks at far distances from elimination
singularities caused by non-linear activation functions.
19
2.8 Group Normalization (2018)
Yuxin Wu , Kaiming He
20
Fig 11:- Normalization methods:- Each subplot shows a feature map tensor,
with N as the batch axis, C as the channel axis, and (H, W)
as the spatial axes. The pixels in blue are normalized by the same mean and
variance, computed by aggregating the values of these pixels.
Fig 12:- Comparison using a batch size of 32 images per GPU in ImageNet.
Validation error VS the numbers of training epochs is shown. The model is
ResNet-50.
21
2.9 Deep Residual Learning for Image Recognition (2015)
Kaiming He, Xiangyu Zhang, Shaoqing Ren, Jian Sun
Fig 13:- Training error (left) and test error (right) on CIFAR-10
with 20-layer and 56-layer “plain” networks. The deeper network
has higher training error, and thus test error.
When deeper networks are able to start converging, a degradation problem has
been exposed: with the network depth increasing, accuracy gets saturated (which
might be unsurprising) and then degrades rapidly. The above paper addressed the
degradation problem by introducing a deep residual learning framework.
22
Formally, denoting the desired underlying mapping as H(x),the stacked nonlinear
layers fit another mapping of F(x) := H(x)−x. The original mapping is recast into
F(x)+x. It is hypothesized that it is easier to optimize the residual mapping than
to optimize the original, unreferenced mapping. To the extreme, if an identity
mapping were optimal, it would be easier to push the residual to zero than to fit
an identity mapping by a stack
of nonlinear layers.
Fig 15:- Training on ImageNet. Thin curves denote training error, and bold
curves denote validation error of the center crops. Left: plain
networks of 18 and 34 layers. Right: ResNets of 18 and 34 layers. In this plot,
the residual networks have no extra parameter compared to
their plain counterparts.
23
2.10 DIFFUSION MODELS FOR MEDICAL IMAGE
ANALYSIS: A COMPREHENSIVE SURVEY (2022)
Amirhossein Kazerouni, Ehsan Khodapanah Aghdam, Moein
Heidari, Reza Azad, Mohsen Fayyaz, Ilker Hacihaliloglu, Dorit
Merhof
24
security concerns associated with using patient data in public settings. Hence,
using diffusion models to generate synthetic samples can alleviate the problem of
medical data scarcity to a great extent.
• Anomaly Detection:-
• Denoising
• Reconstruction
• Segmentation
• Image-to-Image Translation
• Image Generation
• Other Applications and Multi-tasks
Some of the primary concerns and limitations of diffusion models are their slow
speed and required computational cost. Several methods have been developed to
address these drawbacks. With their remarkable results, diffusion models have
proven that they can be a powerful competitor against other generative models.
25
Fig 17:- Simple diffusion-based ecosystem in the medical domain.
26
3. Scope of Work
Image synthesis tasks are performed generally by deep generative models like
GANs, VAEs, and autoregressive models. Generative adversarial networks
(GANs) have been a research area of much focus in the last few years due to the
quality of output they produce. Another interesting area of research that has found
a place are diffusion models. Both of them have found wide usage in the field of
image, video and voice generation. GAN is an algorithmic architecture that uses
two neural networks that are set one against the other to generate newly
synthesized instances of data that can pass for real data. Diffusion models have
become increasingly popular as they provide training stability as well as quality
results on image and audio generation.
Though GANs form the framework for image synthesis in a vast section of
models, they do come with some disadvantages that researchers are actively
working on.
27
Fig 18:- CelebA images generated with a PFGM
28
4. Work Done
Then the neural network is defined. the job of the network ϵθ(xt,,t) is to take
in a batch of noisy images and their respective noise levels, and output the
noise added to the input. More formally:
The DDPM authors ([2]) used U-Net Model ([5]) for this purpose.
29
Defining the Forward Diffusion process:- The forward diffusion process
gradually adds noise to an image from the real distribution, in a number of
time steps T. This happens according to a variance schedule. The original
DDPM authors employed a linear schedule. However, it was shown in
(Nichol et al., 2021 ([19])) that better results can be achieved when
employing a cosine schedule.
30
Some Output Samples from my implementation:-
31
References
[2] Jonathan Ho, Ajay Jain, Pieter Abbeel, “Denoising Diffusion Probabilistic
Models”, Neural Information Processing Systems (NeurIPS).
[6] Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion
Jones, Aidan N. Gomez, Lukasz Kaiser, Illia Polosukhin. “Attention Is All
You Need”, Neural Information Processing Systems (NeurIPS).
[7] Zhuoran Shen, Mingyuan Zhang, Haiyu Zhao, Shuai Yi, Hongsheng Li,
“Efficient Attention: Attention with Linear Complexities”,
arXiv:1812.01243.
[8] Siyuan Qiao, Huiyu Wang, Chenxi Liu, Wei Shen, Alan Yuille, “
Micro-Batch Training with Batch-Channel Normalization and Weight
Standardization”, arXiv:1903.10520.
[10] Kaiming He, Xiangyu Zhang, Shaoqing Ren, Jian Sun, “Deep Residual
Learning for Image Recognition”, CVPR.
[11] Prafulla Dhariwal, Alex Nichol, “Diffusion Models Beat GANs on Image
Synthesis”, Neural Information Processing Systems (NeurIPS).
32
[12] Yilun Xu, Ziming Liu, Max Tegmark, Tommi Jaakkola, “Poisson Flow
Generative Models”, Neural Information Processing Systems (NeurIPS).
33