You are on page 1of 22

CARTOONIZE AN IMAGE USING NEURAL STYLE TRANSFER ALGORITHM

ABSTRACT
Our aim is to creating animated images (Cartoon images) with the help of real images in high
resolution that considered very challenging and valuable in today’s computer vision and
computer graphics area. The Traditional technique was quite time taking and it require an
artistic mind-set, but sometime it do not produce satisfactory results. Then we analysis the
behaviour of cartoon paintings after then, we propose three separately identify sections.
(1) The surface representation (2) The structure representation (3) The texture representation.
Where our model will first decompose the real image and then start learning from that
decompose image. For these we are using Generative Adversarial Network (GAN)
framework. Cartoon GAN will be used for the learning real–world image and extract them
into Carbonised image
INTRODUCTION

Cartoons one of the most popular and entertaining art in the world. Cartoon is a form of 2D
or 3D illustrated visual art. While the specific definition has changed over time, modern
usage refers to a typically non-realistic or semi-realistic drawing or painting intended for
satire, caricature or humour or to the artistic style of such works. In the 20th century and
onwards it referred to comic strips and animated films.

Cartoons represent humorous or satirical forms of painting that have a great influence on our
lives. Classic cartoon images, such as Mickey Mouse and Donald Duck, have become an
irreplaceable part of the post-1990s generation. Disney and Marvel comics are still loved by
millions of fans around the world. Animation creates another world that a lot of people yearn
for. Reading comics can relax personal mind, improve pupils' imagination, find their own
ideas and spiritual sustenance, and see the future belonging to students.

Since computer technology has changed people's daily life, it has significantly impacted on
comic creation as well. The research on computer graphics has always been one of the hot
fields in the domain of computer science, and its major content includes the representation
method of graphics in the computer, the calculation of graphics, and the principle and
algorithm of display. In 1962, E. Sutherland from MIT Lincoln Laboratory used the term
“Computer Graphics” for the first time in his paper, proving that interactive Computer
Graphics is a feasible and useful research field. Thus, Computer Graphics as a new branch of
independent science was confirmed. At the beginning of the birth of computer graphics,
pioneers were mainly committed to using computer technology to better express the natural
world and to truly present the picture seen by bare eyes in the computer system.

At first, scholars used computers to imitate the output of early cameras and gradually began
to simulate the reality of images. Computer graphics began to pursue real pictures. With the
advancement of technology, the term “Photorealistic Rendering” has been used to describe a
range of techniques that mimic the output of cameras [3]. In pursuit of realism, early work
studied the process and characteristics of light reflected off surfaces and then explored the
equilibrium solution of light reaching various surfaces when reflected in the environment.
Overall, realistic rendering is usually based on the scene and the physical model of objects in
the scene. It stimulates colors, textures, and other attributes of objects in the scene in order to
keep behaviors and characteristics consistent with colors, light, and shadow of nature as far as
possible [4, 5]. Therefore, the computer system presents scenes complying with “real”
pictures of human cognition. This pursuit of realistic feelings has greatly promoted the
development of computer graphics. Realistic rendering technology has been widely used in
industrial fields such as CAD and postprocessing of film and television. Meanwhile, it has
directly drawn forth the industrial field of “CG rendering” [6].

Rendering technology is widely considered to be inseparable from computer models [7, 8],
but in fact, it also includes processing two-dimensional images. If the object is a data type of
2D images, then the corresponding technique is called image space nonphotorealistic
rendering, whereas if the object is a data type of 3D model, then the technique is called 3D
space nonphotorealistic rendering [9]. According to Gooch's classification theory of
nonphotorealistic rendering technology, it can be divided into three categories: simulation of
artistic media, interactive generation of users' images, and automatic generation of system
images.

First and foremost, with the development of social media networks and the increasing
popularity of the gaming market, there is a growing demand for personalized, nonrealistic
images. For example, more youngsters tend to post photos or art works with personal features
in short video platforms. In online games with social functions, many people also desire to
have figures with their own characteristics, which require nonphotorealistic rendering.
Furthermore, image processing based on deep learning has been investigated by more
scholars, such as super-resolution reconstruction, missing image repair, colorization of black-
and-white images, and the extremely popular AI face changing recently. Therefore, this kind
of deep learning will definitely affect the blossom of comics.

In the early stage, nonphotorealistic rendering was primarily used as a supplement to


photorealistic rendering [10]. Therefore, nonphotorealistic rendering technology for 3D
models has many similarities with photorealistic rendering technology for 3D models in
terms of implementation methods [11]. Nonphotorealistic rendering techniques for cartoon-
style 3D object space are also rare. Nonphotorealistic rendering is generally defined as a
technique that uses computers to simulate the rendering styles of various visual arts [12].
There are great differences between nonphotorealistic rendering and photorealistic rendering
in the definition, purposes, and common methods. From the angle of specific rendering
methods, photorealistic rendering mainly adopts the way of simulation, while
nonphotorealistic rendering usually carries out the stylized process of the material. From the
perspective of characteristics, realistic rendering results in the representation of the objective
world, while nonrealistic rendering contains subjective expressions. From the point of the
effect of the rendering works, realistic rendering will effectively reflect the real physical
process, while works of nonrealistic rendering emphasize audiences' emotional cognition
rather than visual correctness and usually attach some artistic processing closer to the art
works in vision.

With the rapid development of computing technology, more and more researchers have
devoted themselves to the study of nonphotorealistic cartoon-rendering algorithms. Lander et
al. [13] focused on the relationship between textures and cartoon styles in images and
proposed a method to realize cartoon coloring through texture mapping. Claes et al. [14]
introduced a fast method to achieve cartoon-style rendering in their paper and smoothed the
boundaries in the image during rendering. Cartoon rendering methods based on traditional
image processing technology are generally nonparametric methods. Rosa [15] proposed an
edge fusion method, which first converts images from RGB color space to Lab color space
and performs gradient filtering on the L channel alone to obtain maps based on the edge
gradient of images. At the same time, the L channel is smoothed by grayscale quantization in
Lab space. Finally, the edge gradient image is fused with a smoothed image in RGB space,
and the resulting image has smooth color blocks and prominent edges, with some
characteristics of cartoon style.

Raskar et al. [16] offered a method using image depth and edge information in 2004, which
takes advantage of multiple flashlights to shoot images to capture geometric information in
the images. This approach separates textures from the image to help nonphotorealistic
rendering. Winnemoller et al. [17] in 2006 tried to extract the contour of the image by using
the bilateral filter combined with Gaussian difference filter, so as to render the image contour.
Yan [18] designed a 2D cartoon-style-rendering algorithm based on scan lines in 2009, which
can automatically complete cartoon-style rendering in real time. These methods dealing with
traditional image processing technology can only process images with simple texture content
and uncomplicated background, because they rely on image filtering and edge detection.
Their effect is greatly affected by image content while the generalization ability is poor.
Gatys et al. embraced a texture model based on feature space of convolutional neural
networks [19]. In this model, textures would be re-presented according to the relationship
between feature maps in each layer of the network. When the object information is made
clearer, texture extraction will also capture more content features of natural images [20].
Images are input to neural networks, and then, texture analysis is obtained through feature
extraction of different convolutional layers and Gram matrix calculation. Next, the image
white noise is input to calculate the loss function of textures in different layers to carry out
texture synthesis. The document uses the first method of texture synthesis in the paper to
transfer image style of oil painting style. The author uses the middle layer of neural networks
for the reconstruction of the content, leaving the maximum pixel values of texture image
fusion. Finally, it can get the final image containing different artistic styles of oil painting
[21].

From the point of endless work in image processing in recent years, people have more
research works and ideas in image processing. New breakthroughs have been made in image
stylization, image recognition, video stylization, and other aspects [22, 23]. Researchers will
introduce image generation and rendering image editing based on the deep learning into the
cartoon style and make use of learning ability of deep neural networks to study the mapping
relationship between natural images and cartoon-style images. Image transfer networks based
on the deep learning can be used for 2D image style of cartoon rendering tasks, and other
methods used for image conversion can also be used for cartoon-style rendering. The former
regards cartoon style as a kind of abstract artistic style, while the latter models cartoon-style
rendering task as a domain-to-domain conversion task between domain of natural images and
domain of cartoon-style images [24].

The existing nonrealistic rendering methods of cartoon style can draw cartoon-style pictures
in specific scenes, but they cannot meet the changeable practical application scenes. An
excellent cartoon-style nonrealistic rendering algorithm should be able to render natural
images with any resolution or any object into a cartoon-style image while keeping the
semantic content of the image as far as possible. Nowadays, people's living standard is
getting better, and their pursuit of art is also becoming higher. Image stylization algorithm
based on deep learning has become more worthy of exploration. In the future, there is a high
possibility for deep learning and other related algorithms to achieve results of excellent image
stylization. We combine convolutional neural network and attention mechanism to obtain a
preliminary improved network and establish the matching and migration algorithm to extract
the rendered image features in many aspects.

Representation of Surface, Structure and Texture Extraction Process


2. LITERATURE REVIEW

2.1 RESEARCH PAPER-1


Author: Leon A. Gatys, Alexander S. Ecker, Matthias Bethge
Title: A Neural Algorithm of Artistic Style
Year: 2015
Overview:

An artificial system based on a Deep Neural Network that creates artistic images of
high perceptual quality. we take pre-trained neural network and define a 3 Component loss
function that will enable us to achieve our end goal of style transfer and then optimize over
that loss function.

Proposed Model:

The class of Deep Neural Networks that are most powerful in image processing tasks
are called Convolutional Neural Networks. Convolutional Neural Networks consist of layers
of small computational units that process visual information hierarchically in a feed-forward
man- ner (Fig 1). Each layer of units can be understood as a collection of image filters, each
of which extracts a certain feature from the input image. Thus, the output of a given layer
consists of so-called feature maps: differently filtered versions of the input image.

Figure 1: Convolutional Neural Network (CNN). A given input image is represented as a set
of filtered images at each processing stage in the CNN. While the number of different filters
increases along the processing hierarchy, the size of the filtered images is reduced by some
down sampling mechanism (e.g. max-pooling) leading to a decrease in the total number of
units per layer of the network. Content Reconstructions. We can visualise the information at
different processing stages in the CNN by reconstructing the input image from only know-
ing the network’s responses in a particular layer. We reconstruct the input image from from
layers ‘conv1 1’ (a), ‘conv2 1’ (b), ‘conv3 1’ (c), ‘conv4 1’ (d) and ‘conv5 1’ (e) of the orig-
inal VGG-Network. We find that reconstruction from lower layers is almost perfect (a,b,c). In
higher layers of the network, detailed pixel information is lost while the high-level content of
the image is preserved (d,e). Style Reconstructions. On top of the original CNN
representations we built a new feature space that captures the style of an input image. The
style representation computes correlations between the different features in different layers of
the CNN. We recon- struct the style of the input image from style representations built on
different subsets of CNN layers ( ‘conv1 1’ (a), ‘conv1 1’ and ‘conv2 1’ (b), ‘conv1 1’,
‘conv2 1’ and ‘conv3 1’ (c), ‘conv1 1’, ‘conv2 1’, ‘conv3 1’ and ‘conv4 1’ (d), ‘conv1 1’,
‘conv2 1’, ‘conv3 1’, ‘conv4 1’ and ‘conv5 1’ (e)). This creates images that match the style of
a given image on an increasing scale while discarding information of the global arrangement
of the scene.

LIMITATION: While the Gatys et al. method produced beautiful neural style transfer
results, the problem was that it was quite slow.
2.2 RESEARCH PAPER-2

Author: Justin Johnson, Alexandre Alahi, Fei-Fei Li)


Title: Perceptual Losses for Real-Time Style Transfer and Super- Resolution
Year: 2016
Overview:
Proposes the use of perceptual loss functions for training feed-forward networks for
image transformation tasks. a feed-forward network is trained to solve the optimization
problem proposed by Gatys et al. in real-time.

One approach for solving image transformation tasks is to train a feed- forward
convolutional neural network in a supervised manner, using a per-pixel loss function to
measure the difference between output and ground-truth images. This approach has been used
for example by Dong et al. for super-resolution , by Cheng et al. for colorization by Long et
al. for segmentation [4], and by Eigen et al. for depth and surface normal prediction [5, 6].
Such approaches are efficient at test-time, requiring only a forward pass through the trained
network.

Proposed Model:

Feed-forward image transformation. In recent years, a wide variety of image


transformation tasks have been trained with per-pixel loss functions. Semantic segmentation
methods [4, 6, 14–17] produce dense scene labels by running networks in a fully-
convolutional manner over input images, training with a per-pixel classification loss. Recent
methods for depth [6, 5, 18] and sur- face normal estimation [6, 19] are similar, transforming
color input images into geometrically meaningful output images using a feed-forward
convolutional net- work trained with per-pixel regression [5, 6] or classification [19] losses.
Some methods move beyond per-pixel losses by penalizing image gradients [6], fram- ing
CRF inference as a recurrent layer trained jointly with the rest of the network [17], or using a
CRF loss layer [18] to enforce local consistency in the output. The architecture of our
transformation networks are inspired by [4] and [16], which use in-network downsampling to
reduce the spatial extent of feature maps followed by in-network upsampling to produce the
final output image.
LIMITATION: While the Johnson et al. method is certainly fast, the biggest downside is
that you cannot arbitrarily select your style images like you could in the Gatys et al. method.
Author: Dmitry Ulanova, Andrea Vedaldi, Victor Lempitsky.
Topic: Instance Normalization: The Missing Ingredient for Fast Stylization
Year: 2017

Overview:

The change is limited to swapping batch normalization with instance normalization,


and to apply the latter both at training and testing times. The resulting method can be used to
train high-performance architectures for real-time image generation.

The recent work of Gatys et al. (2016) introduced a method for transferring a style
from an image onto another one, as demonstrated in fig. 1. The stylized image matches
simultaneously selected statistics of the style image and of the content image. Both style and
content statistics are obtained from a deep convolutional network pre-trained for image
classification. The style statistics are ex- tracted from shallower layers and averaged across
spatial locations whereas the content statistics are extracted form deeper layers and preserve
spatial information. In this manner, the style statistics capture the “texture” of the style image
whereas the content statistics capture the “structure” of the content image. Although the
method of Gatys et. al produces remarkably good results, it is computationally ineffi- cient.
The stylized image is, in fact, obtained by iterative optimization until it matches the desired
statistics. In practice, it takes several minutes to stylize an image of size 512 × 512. Two
recent works, Ulyanov et al. (2016) Johnson et al. (2016), sought to address this problem by
learning equivalent feed-forward generator networks that can generate the stylized image in a
single pass. These two methods differ mainly by the details of the generator architecture and
produce results of a comparable quality; however, neither achieved as good results as the
slower optimization-based method of Gatys et. al.

The work of Ulanova et al. (2016) showed that it is possible to learn a generator
network g(x, z) that can apply to a given input image x the style of another x0, reproducing to
some extent the results of the optimization method of Gatys et al. Here, the style image x0 is
fixed and the generator g is learned to apply the style to any input image x. The variable z is a
random seed that can be used to obtain sample stylization results.
LIMITAION: Fast, the biggest downside is that you cannot arbitrarily select your style
images.
EXISTING SYSTEM
Perceptual Losses for Real-Time Style Transfer and Super- Resolution
Year: 2016
Overview:
Proposes the use of perceptual loss functions for training feed-forward networks for
image transformation tasks. a feed-forward network is trained to solve the optimization
problem proposed by Gatys et al. in real-time.

One approach for solving image transformation tasks is to train a feed- forward
convolutional neural network in a supervised manner, using a per-pixel loss function to
measure the difference between output and ground-truth images. This approach has been used
for example by Dong et al. for super-resolution , by Cheng et al. for colorization by Long et
al. for segmentation [4], and by Eigen et al. for depth and surface normal prediction [5, 6].
Such approaches are efficient at test-time, requiring only a forward pass through the trained
network.

Proposed Model:

Feed-forward image transformation. In recent years, a wide variety of image


transformation tasks have been trained with per-pixel loss functions. Semantic segmentation
methods [4, 6, 14–17] produce dense scene labels by running networks in a fully-
convolutional manner over input images, training with a per-pixel classification loss. Recent
methods for depth [6, 5, 18] and sur- face normal estimation [6, 19] are similar, transforming
color input images into geometrically meaningful output images using a feed-forward
convolutional net- work trained with per-pixel regression [5, 6] or classification [19] losses.
Some methods move beyond per-pixel losses by penalizing image gradients [6], fram- ing
CRF inference as a recurrent layer trained jointly with the rest of the network [17], or using a
CRF loss layer [18] to enforce local consistency in the output. The architecture of our
transformation networks are inspired by [4] and [16], which use in-network downsampling to
reduce the spatial extent of feature maps followed by in-network upsampling to produce the
final output image.
LIMITATION: While the Johnson et al. method is certainly fast, the biggest downside is
that you cannot arbitrarily select your style images like you could in the Gatys et al. method.
PROPOSED SYSTEM

We propose to use neural style transfer which is a machine learning algorithm, which
involves two images, first is the input image from the user and second is the style image
which is used to apply the style on the input image. Following are the examples of images
generated using neural style transfer.

We propose to use neural style transfer which is a machine learning algorithm, which
involves two images, first is the input image from the user and second is the style image
which is used to apply the style on the input image. Following are the examples of images
generated using neural style transfer We propose to create a website, which consists of image
upload functionality using which the user can upload his image, the uploaded image is then
processed by server using Neural style transfer algorithm and the resulting image is presented
to the user on the website. Which then user can download & share. Neural fast style transfer
is used by Apps such as https://deepart.io, Prisma, Artisto etc. We decided to choose this
approach over traditional image filters (e.g. using image filters such as median & bilateral
filters to posterize an Image) as Neural fast style transfer is quite new and challenging
technique which uses machine learning & image processing to produce various styled images
based on variety of input & style images. The algorithm can be implemented in
Python/JavaScript/Lua to perform neural style transfer. We will use Python to implement the
backend and the front end of the website will be in HTML, CSS & JS. Basically, in Neural
Style Transfer we have two images- style and content. We need to copy the style from the
style image and apply it to the content image. By, style we basically mean, the patterns, the
brushstrokes, etc. we will provide a set of style images which a user can use to apply different
kinds of Cartoon like effects to his image.
PROBLEM DEFINATION

Training of networks for different style images is time consuming and requires lots of
computation hardware (GPUs). Different content images may produce slightly different
styled images. Precision of cartoon like effect entirely depends on type of content image
provided.
OBJECTIVE

1 - To detect blur and bold edges of the actual RGB image.

2 - To smooth, quantize and conversion of RGB image to Grayscale image

3 – To Identify Surface Representation


METHOLOGY

Our implementation uses TensorFlow to train a fast style transfer network. We use
roughly the same transformation network as described in Justin Johnson et. al, except that
batch normalization is replaced with Ulyanov's instance normalization. We use a loss
function close to the one described in Gatys, using VGG19 instead of VGG16 and typically
using "shallower" layers than in Johnson's implementation (e.g. we use relu1_1 rather than
relu1_2). Empirically, this results in larger scale style features in transformations.

Step1. Train a feed forward neural network for each style


Step2. Fetch the uploaded Content image(i.e. uploaded user image) & trained feed forward
network for the selected style Image (i.e. style selected by user)
Step3. Calculate the Cost/Loss function of Generated Image G using
f(G) = fContent (C,G) + fStyle(S,G)
where C : Content Image & S: Style Image fContent (C,G) – Content Loss Function
fStyle(S,G) – Style Loss Function
Step4. After minimizing the loss function f(G), Stylize the Image by single forward pass
through image transformation network
Step5. Display Styled Image to the user

Our approach is to decompose the image into three individual modules i.e. surface
representation, structure representation and texture representation and they are used to
extract corresponding information. Surface discrimination to extract surface details from the
image and is for texture discrimination to extract texture details from the image. For
extracting the high resolution features and extracts global content using structure
representation. Adjustment of weight is done by loss function; using these we can control the
output image style.
Block Diagram.
REFERENCES

1. A Neural Algorithm of Artistic Style (Leon A. Gatys, Alexander S. Ecker, Matthias


Bethge) 2015
2. Perceptual Losses for Real-Time Style Transfer and Super- Resolution (Justin
Johnson, Alexandre Alahi, Fei-Fei Li) 2016
3. Instance Normalization: The Missing Ingredient for Fast Stylization (Dmitry Ulyanov,
Andrea Vedaldi, Victor Lempitsky) 2017
4. A Neural Algorithm of Artistic Style, 2016 - Leon A. Gatys, Alexander S. Ecker,
Matthias Bethge
5. Image Style Transfer Using Convolutional Neural Networks, 2016 - Leon A. Gatys,
Alexander S. Ecker, Matthias Bethge
6. Perceptual Losses for Real-Time Style Transfer and Super-Resolution, 2016 - Justin
Johnson, Alexandre Alahi, Li Fei-Fei
7. Precomputed Real-Time Texture Synthesis with Markovian Generative Adversarial
Networks, 2016 -Chuan Li, Michael Wand
8. Texture networks: Feed-forward synthesis of textures and stylized images, 2016 - D.
Ulyanov, V. Lebedev, A. Vedaldi, V. Lempitsky
9. Demystifying Neural Style Transfer, 2017 - Yanghao Li, Naiyan Wang, Jiaying Liu,
Xiaodi Hou
10. A Learned Representation For Artistic Style, 2017 - Vincent Dumoulin, Jonathon
Shlens, Manjunath Kudlur
11. Deep Photo Style Transfer, 2017 - Fujun Luan, Sylvain
Paris, Eli Shechtman, Kavita Bala
[9] Neural Style Transfer: A Review, 2018 - Yongcheng
Jing, Yezhou Yang, Zunlei Feng, Jingwen Ye, Yizhou Yu,
Mingli Song
[10] StyleBank: An Explicit Representation for Neural Image
Style Transfer, 2017 - Dongdong Chen, Lu Yuan, Jing
Liao, Nenghai Yu, Gang Hua
[11] Conditional Fast Style Transfer Network, 2017 - Keiji
Yanai, Ryosuke Tanno
[12] Characterizing and Improving Stability in Neural Style
Transfer, 2017 - Agrim Gupta, Justin Johnson,
1. Alexandre Alahi, Li Fei-Fei

You might also like