You are on page 1of 38

NEURAL

STYLE TRANSFER
ARIHAR ASUDHAN
NEURAL STYLE TRANSFER
Neural Style Transfer (NST) is a
deep learning technique in the
field of computer vision and
artificial intelligence that allows
us to combine the content of one
image with the artistic style of
another image. Look at the
following Neural Style
Transferred Image.
NST allows artists and designers
to create unique and visually
appealing artwork by combining
the content of one image with
the artistic style of another. This
is particularly popular in creating
novel and expressive art forms.
NST can be used to enhance the
aesthetics of images and videos.
For instance, it can be applied to
enhance the visual appeal of
photos, videos, or even websites
by adding artistic styles. Graphic
designers can use NST to quickly
apply artistic styles to graphics
and illustrations.
THE INTUITION
The intuition behind Neural Style
Transfer (NST) is to combine the
content of one image with the
artistic style of another image to
create a visually appealing and
unique image.

CONTENT IMAGE STYLE IMAGE

We need to somehow combine


the style and the content to
make a style transferred image.
NST begins with two input
images: a "content image" and a
"style image." The content image
contains objects, scenes, and
elements that we want to
preserve in the final image. The
style image represents the
artistic characteristics, such as
textures, colors, and patterns,
that we want to apply to the
content.
@ INPUTS : Content & Style
Both the content image and the
style image are passed through a
pre-trained convolutional neural
network (CNN), typically a VGG
network.
The VGG (Visual Geometry
Group) network is a class of
convolutional neural networks
(CNNs) commonly used for
various computer vision tasks,
including image classification
and feature extraction. The VGG
architecture is known for its
simplicity and effectiveness.
The network extracts features
from different layers, capturing
information about the content
and style of the images. The
deeper layers tend to capture
high-level features and style
information, while the shallower
layers capture low-level details
and content information.
@ EXTRACTING FEATURES of
Content & Style using VGG Net
NST defines two types of loss
functions: content loss and style
loss. Content loss measures the
difference between the feature
representations of the content in
the content image and the
generated image. It encourages
the generated image to have
similar content to the content
image. Style loss measures the
difference between the feature
statistics (e.g., mean and
variance) of the style image and
the generated image in various
layers. It encourages the
generated image to adopt the
artistic style of the style image.
We’ll have a deeper exploration
of how style is extracted after
few more pages.
@ LOSSES for Content & Style
NST uses an optimization
algorithm (typically gradient
descent) to minimize the
combined content and style loss.
The algorithm adjusts the pixel
values of a third image, known as
the "generated image," in each
iteration to make it both
resemble the content of the
content image and the style of
the style image. The process
continues iteratively until the
loss functions are minimized,
resulting in a final generated
image that combines the content
of the content image with the
artistic style of the style image.
@ OPTIMIZER Matters!
The intuition is that NST
leverages the power of deep
learning and feature
representations to separate
content and style aspects from
images and then recombine
them to produce a new image
that possesses the content and
artistic style as desired. This
technique allows for creative and
artistic transformations of
images, enabling the generation
of visually striking and
personalized artwork.
GRAM MATRIX
The Gram matrix, also known as
the Gramian matrix or the
autocorrelation matrix, is a
mathematical concept commonly
used in linear algebra and
machine learning. It is often used
in the context of understanding
patterns and relationships within
data, especially in the field of
deep learning and neural
networks. To understand the
Gram matrix, we first need to
know about vectors and the dot
product. In a vector space, we
have vectors, which are
essentially lists of numbers.
The dot product of two vectors is
a way to measure the similarity
or alignment between them. The
Gram matrix is constructed from
a set of vectors.
Let's say we have a set of
vectors, usually represented as
columns in a matrix. Each column
is a vector. To create the Gram
matrix, we take all possible pairs
of vectors and compute the dot
product between them. The
result of these dot products is
stored in the Gram matrix (If you
are familar with Attention
Concept in Transformers, you
may have heared of the
attention map which is
calculated in a slightly similar
manner). Let’s discuss how we
can canstruct gram matrix for
the given matrix of vectors.
Let's say we have three vectors:
v1 = [1, 2, 3]
v2 = [4, 5, 6]
v3 = [7, 8, 9]
First, we need to arrange these
vectors as columns in a matrix.
Let's call this matrix A:
|1 4 7|
A= |2 5 8|
|3 6 9|

Next, we'll compute the Gram


matrix G. To do this, we take the
dot product of each pair of
vectors and fill in the elements
of the Gram matrix accordingly.
G[1,1] is the dot product of v1
with itself: v1 · v1 = 1*1 + 2*2 +
3*3 = 14
Similarly G[2,2] = 77, G[3,3] = 194
Now, we compute the other
elements of the Gram matrix:
G[1,2] is the dot product of v1
and v2: G[1,2] = 32
Similarly,
G[1,3] = 50;
G[2,3] = 122
Since the Gram matrix is
symmetric, we can fill in the
elements below the diagonal
with the same values:
G[2,1] = G[1,2] = 32
G[3,1] = G[1,3] = 50
G[3,2] = G[2,3] = 122
So, the Gram matrix G looks like
this:
| 14 32 50 |
G= | 32 77 122 |
| 50 122 194 |
This is the Gram matrix for the
given set of vectors. It encodes
information about the
relationships between these
vectors based on their dot
products. The Gram matrix helps
us understand the relationships
between vectors.
It encodes information about
how much the vectors are similar
or different from each other.
The diagonal elements of the
Gram matrix (the values on the
main diagonal from the top left
to the bottom right) represent
the dot products of each vector
with itself. These values provide
information about the
magnitude or length of each
vector. In this case, the diagonal
elements are 14, 77, and 195,
which represent the squares of
the magnitudes of v1, v2, and v3,
respectively. Vectors that are
orthogonal (perpendicular) to
each other will have a dot
product of 0. The Gram matrix
allows you to check if any of your
vectors are orthogonal. In your
matrix, there are no 0 values off
the main diagonal, indicating
that none of the vectors are
perfectly orthogonal. The off-
diagonal elements represent the
relationships between vectors.
For example, the value 32 in the
Gram matrix indicates that v1
and v2 are related, while the
value 122 suggests that v2 and
v3 are even more related, and 50
shows a moderate relationship
between v1 and v3.
The Gram matrix is used in neural
style transfer to capture the
style of an image by quantifying
the correlations between
features in the neural network's
layers. This style information is
then used to guide the process
of generating an image that
combines the content of one
image with the style of another.
@ Style Information?
Style Information refers to the
visual characteristics that define
the artistic or aesthetic style of
an image. These characteristics
include patterns, textures, color
palettes.
LOSS FUNCTION
Let’s say we have a content
image and noise as displayed
below:

Now, we are about to use an MSE


loss to bring the noise image
near to the content image. What
would happen?
output = model(noisy_input)
loss = criterion(output, noisy_image)
The content will start to appear
in the noise image gradually.

With this, let’s dive deeper into


the two losses used in Neural
Style Transfer.
CONTENT LOSS
In the content loss, the features
of the generated image are
compared to the content image.
It is to make sure that the
generated image has the same
content while the algorithm
changes the style. This way, the
authenticity of the content
image isn’t lost and from the
style image, the style elements
get added. First, two copies of
the same pre-trained image
classification CNNs are used as
loss networks. These networks
will be fed a reference image
(content) and a test image
(generated) respectively. The
outputs from these two
classifiers are used as inputs to
the loss function. The calculation
is a distance (Euclidean) between
two contents output from the
loss CNNs; one content from the
generated image and the other
from the base image. To
calculate the content loss
function, we have to calculate
two things first. The content
features of both the generated
image and the content image
using the pre-trained loss
networks.
Next, the Mean Squared Error
(a.k.a. the L2 Norm) is calculated.
To perform this calculation, we
follow the name. First, calculate
the error using element-wise
subtraction. Subtract the
generated image features from
the content image features.
Next, square these errors
element-wise to get the squared
errors. Add all the values and
divide by the number of features
to calculate the average, and you
will have the Mean Squared
Error.
STYLE LOSS
Another important loss is the
style loss which is represented
by the gram matrix. It is
important to learn the difference
between the two losses to
understand them better. So,
what is style loss? In simple
words, style loss is the measure
of how different the lower-level
features of the generated image
are from the base image. For
example, features like color and
texture. Style loss is obtained
from all the layers whereas
content loss is obtained from
higher layers.
It goes into the deepest of layers
to make sure that there is a
visible difference between the
style image and the generated
image. After all, we don’t want
the original image to lose its
value and real meaning. The
concept of content loss is far
easier than style loss. For
content loss, we can use a simple
mathematical formula to find
out its value. The style loss is
meant to penalize the output
image when the style is deviating
from the supplied style image.
Now, for content loss, we can
simply add up and divide for the
Mean Squared Error value. For
style loss, there is another step.
First, a loss network is used like
with the content loss. Both the
test (generated) image features
and style image features are fed
to the loss networks. This
produces their activations. Next,
these outputs are averaged over
every value in the feature map to
calculate the Gram matrix. This
matrix is a measurement of the
style at each layer. The matrix
measures covariance and
therefore it captures
information about regions of the
image which tend to activate
together. The benefit of using a
gram matrix is that it enables
different features to co-exist in
different parts of the images.
TOTAL LOSS
The total loss is the sum of the
content loss and the style loss,
often weighted by
hyperparameters that control
the relative importance of
content and style in the final
generated image. Additionally, a
third term, the "total variation
loss," is sometimes added to
encourage spatial smoothness in
the generated image. The
objective in neural style transfer
is to minimize the total loss by
making adjustments to the
generated image.
This is typically done using
optimization techniques like
gradient descent. By minimizing
the total loss, we can create an
image that combines the content
of the content image with the
artistic style of the style image.
The optimal values of the
generated image that minimize
the total loss will result in an
image that is a stylized version of
the content image.
IMPLEMENTATION
1. Initialization and Importing

2. Preprocessing and Loading


3. Visualize
4. Losses
5. VGG Model & Normalization

6. Input Image Initialization and


Optimization
7. Style Transfer & Optimization
8. The Execution
NANDRI

You might also like