You are on page 1of 11

Welcome To

An Illustrative Guide to Masked Image Modelling


Today, we can see in machine learning that methods and models from one field
may carry out operations from another. For instance, some computer vision -
related tasks can be completed by models that are primarily geared toward
natural language processing. We will talk about a technique that can be used
from NLP to computer vision in this essay. Masked Image Modelling is the name
we can give it when we apply it to Image Masking computer vision applications.
We'll make an effort to comprehend how this technology functions an d some of
its key applications. The following list includes the main points that will be
covered in this article.

Table of Contents

What is Masked Image Modelling?

The framework of Masked Image Modelling


Works Related to Masked Image Modelling

Applications of Masked Image Modelling

Let’s begin the discussion by understanding what mask image modelling is.

What is Masked Image modelling?

Masked signal learning, a sort of machine learning, uses the input's masked component to learn
about and forecast the masked signal. NLP for self-supervised learning contains use cases for
this kind of learning. We can observe the application of masked signal modeling for learning
from vast amounts of unannotated data in various studies. For the computer vision challenge, this
method can also produce results that are comparable to those of methods like contrastive
learning. Masked image modelling is the process of employing masked images to perform
computer vision tasks.

Applying masked image modelling can have the following difficulties:

Pixels next to one another have a strong correlation. In contrast to the signal (tokens) under the
NLP data, the signals under the photos are unprocessed and low level. Whereas text signals are
discrete, visual signals are continuous.
So, in order to avoid correlation when using this method with image or computer vision-related
data, the operation must be very properly executed. High-level visual tasks can benefit from
prediction from low-level data, and the method can modify the behavior of continuous signals.

We may see numerous examples of works that generalized similar issues using picture data
modeling, including:

Pre-Trained Image Processing Transformer: This work demonstrates the use of continuous signal
from images for classification problems while also using color clustering methods.

Swin Transformer V2: This work demonstrates a method for scaling a Swin transformer up to 3
billion parameters, enabling it to learn from images with a resolution of up to 1536 × 1536 and
execute computer vision tasks. They have used models to apply adaption approaches for
continuous signals derived from images.

AiT: AiT Pre-Training of Image Transformers: This work can be compared to the BERT model
in computer vision, which employs a similar tokenization strategy employing an extra network
for image data and block-wise picture masking to break the short-range connection between the
pixels.

After reading the aforementioned books, we can observe a few illustrations of methods that can
be applied to address the problems. We can comprehend the degree of complexity needed to
create a model or framework that can manage these challenges and carry out the necessary
activity. The following pictures can be used to illustrate the core concept of the model or
framework for masked image modeling:

Image source

We can see the input picture patches in the above image, along with a linear layer that performs
regression on the pixel values of the masked area experiencing loss. A simple model's design and
conclusions may include the following:
applied masking to the photos

Regression model for raw pixels

portable prediction head

We may simplify the process for a transformer by performing basic masking on the photos. The
continuous nature of visual information and the regression task are well matched, and a light
prediction head should have the ability to significantly speed up pre-training. Although heavier
heads can produce a stronger generation, they may suffer in the downstream fine-tuning
operations.

The Framework of Masked Image Modelling

Masked Image Modelling (MIM) is a framework for generating high-quality images from
incomplete or corrupted inputs. The framework works by iteratively refining an initial estimate
of the complete image using a series of conditional models.

The MIM framework can be broken down into the following steps: Input Masking: The input
image is masked, meaning that a portion of it is intentionally removed or obscured. The goal is to
generate a complete image that matches the original input as closely as possible, even in areas
where information was removed.

Initialization: An initial estimate of the complete image is generated, based on the available
information in the unmasked portion of the input image.

Iterative Refinement: The initial estimate is iteratively refined using a series of conditional
models. Each model is trained to predict a specific portion of the complete image, given the
available information at that point in the refinement process.
Output Generation: Once the refinement process is complete, the final estimate of the complete
image is generated by combining the results of each of the conditional models.

There are several variations of the MIM framework, each with its own set of conditional models
and training strategies. Some common types of conditional models used in MIM include:

The MIM framework can be broken down into the following steps: Input Masking: The input
image is masked, meaning that a portion of it is intentionally removed or obscured. The goal is to
generate a complete image that matches the original input as closely as possible, even in areas
where information was removed.

Autoregressive models, which generate each pixel in the image sequentially, based on the values
of previously generated pixels.

Variational Autoencoder (VAE) models, which generate the complete image by sampling from a
learned probability distribution. Generative Adversarial Network (GAN) models, which generate
the complete image by training a generator network to produce realistic images that can fool a
discriminator network.

MIM has a wide range of applications, including image inpainting, image super-resolution, and
image denoising. It has proven to be an effective method for generating high-quality images
from incomplete or corrupted inputs.
Generative Pretraining from Pixels (GPT) is a family of language models developed by OpenAI
that are trained using unsupervised learning on large amounts of text data. The models are based
on deep neural networks and can generate text that is coherent and often indistinguishable from
text written by humans.

In the above section, we have seen a basic architecture of the framework for masked image
modelling and the components using which we can make the framework perform computer
vision tasks using masked image modelling. Let’s see some of the examples where we can
witness the masked image modelling.

Generative Pretraining from Pixels (GPT) is a family of language models developed by OpenAI
that are trained using unsupervised learning on large amounts of text data. The models are based
on deep neural networks and can generate text that is coherent and often indistinguishable from
text written by humans.

Generative Pretraining from Pixels (GPT) is a family of language models developed by OpenAI
that are trained using unsupervised learning on large amounts of text data. The models are based
on deep neural networks and can generate text that is coherent and often indistinguishable from
text written by humans.

Works Related to Masked Image modelling

There have been many works related to Masked Image Modelling (MIM) in recent years. Here
are a few notable examples:
"Generative Image Inpainting with Contextual Attention" by Yu et al. (2018) - This paper
introduced a MIM approach for image inpainting that uses contextual attention to guide the
refinement process. The method achieves state-of-the-art results on several benchmarks.

"Deep Image Prior" by Ulyanov et al. (2018) - This paper proposes a MIM approach based on
the assumption that convolutional neural networks can learn the structure of natural images
without any training data. The method achieves impressive results on a range of image
restoration tasks.

Generative Pretraining from Pixels (GPT) is a family of language models developed by OpenAI
that are trained using unsupervised learning on large amounts of text data. The models are based
on deep neural networks and can generate text that is coherent and often indistinguishable from
text written by humans.

"Plug-and-Play Generative Networks: Conditional Iterative Generation of Images in Latent


Space" by Nguyen et al. (2019) - This paper introduces a MIM approach that generates images in
a learned latent space. The method is capable of generating high-quality images and can be
applied to a wide range of image generation tasks.

"MaskGAN: Better Text Generation via Filling in the ______" by Fedus et al. (2018) - This
paper applies MIM to the task of text generation. The method uses a GAN-based architecture to
generate text that fills in missing words in a given context.
"Conditional Variational Autoencoder with Soft-Attention for Multi-Modal Image Inpainting" by
Li et al. (2021) - This paper introduces a MIM approach that uses a conditional variational
autoencoder with soft attention for multi-modal image inpainting. The method achieves state-of-
the-art results on several benchmarks. These are just a few examples of the many works related
to MIM. The field is rapidly evolving, and new approaches are being developed and refined all
the time.

Generative Pretraining from Pixels (GPT) is a family of language models developed by OpenAI
that are trained using unsupervised learning on large amounts of text data. The models are based
on deep neural networks and can generate text that is coherent and often indistinguishable from
text written by humans.

The first model in the GPT family, GPT-1, was introduced in 2018 and had 117 million
parameters. It was trained on a diverse corpus of web text, including books, articles, and
websites, and was capable of generating coherent text in a variety of styles and genres.

Generative Pretraining from Pixels (GPT) is a family of language models developed by OpenAI
that are trained using unsupervised learning on large amounts of text data. The models are based
on deep neural networks and can generate text that is coherent and often indistinguishable from
text written by humans.

Subsequent versions of the model, including GPT-2 and GPT-3, have greatly increased the
number of parameters, with GPT-3 having 175 billion parameters, making it one of the largest
language models ever developed. The larger models have been shown to be capable of
generating even more impressive text, with the ability to mimic different writing styles,
summarize text, and even translate between languages.
The key innovation of GPT is the use of unsupervised learning to train the model. Unlike
traditional supervised learning, which requires labeled data to train the model, GPT is trained on
raw text data using a self-supervised learning approach. The model is trained to predict the next
word in a sequence of text, given the previous words in the sequence. By doing this, the model
learns to generate text that is coherent and follows the rules of language.

GPT has many practical applications, including in chatbots, automated content generation, and
text completion. However, the technology also raises ethical concerns, including the potential for
misuse in generating fake news, propaganda, and disinformation. As a result, the development of
large-scale language models like GPT has prompted discussions about responsible AI and the
need for transparency and accountability in AI research.

Variational Autoencoder with Soft-Attention for Multi-Modal Image Inpainting" by Li et al.


(2021) - This paper introduces a MIM approach that uses a conditional variational autoencoder
with soft attention for multi-modal image inpainting. The method achieves state-of-the-art results
on several benchmarks

Applications of Masked Image Modelling

Masked image modeling is a technique used in computer vision and machine learning to train
neural networks to fill in missing parts of images. Here are some applications of masked image
modeling:
Image inpainting: Masked image modeling can be used to fill in missing parts of images caused
by scratches, stains, or other types of damage. For example, in the field of art restoration, it can
be used to repair old or damaged paintings.

Object removal: Masked image modeling can be used to remove objects from images. This is
useful in applications such as photo editing or video processing, where it is necessary to remove
unwanted objects from an image or video.

Image generation: Masked image modeling can be used to generate new images by filling in
missing parts of existing images. This can be useful in applications such as creating realistic
images of people or objects that do not exist in real life.

Image completion: Masked image modeling can be used to complete partially visible images. For
example, in medical imaging, it can be used to complete scans where only part of the body is
visible.

Image segmentation: Masked image modeling can be used to segment images into different
regions based on their characteristics. For example, it can be used to identify and separate
different types of tissue in medical images.

Overall, masked image modeling is a versatile technique that has many applications in various
fields, including art restoration, photo editing, video processing, medical imaging, and more.

Final Words
In this article, we have discussed masked image modelling which is developed by taking
inspiration from masked language modelling. Here, the using the is donemasked information
and we can be more capable in learning from visual signals of the images. Along with this, we
have seen how we can perform this technique and what are the works related to this technique. In
the end, we also discussed some key applications of mask image modelling.

CONTACT US
Website: https://www.photoeditorph.com/

Phone: +8801723283638

Whatsapp: +8801723283638

Email: info@photoeditorph.com

Address: Blk 34 Lot 5 Easthomes 3 Subd., Estefania, Bacolod City,

Philippines,6100

You might also like