You are on page 1of 4

Video to video translation models using

Generative Adversarial Networks


Yonas Brhanu SGSR/0134/12
Department of Computer Engineering, Wollo University

KIoT, Kombolcha
yonas.berhanu@wu.edu.et

specifically designed for particular body parts such as the human


Abstract: Generative Modelling has been a very extensive face [1] they lack generalization and does not work well if there is
area of research since it finds immense use cases across occlusion. The second approach use paired image-to-image
multiple domains. Various models have been proposed in the translation such as Pix2Pix -in an image it takes a pixel then
recent past including Fully Visible Belief Nets, NADE, MADE, converts by another pixel. Isola et al. [2] use conditional GAN [3],
Pixel RNN Variational Auto Encoders, Markov Chain, and learn a mapping between paired input to the output image. The
Generative Adversarial Networks. Amongst all the models, third category is unsupervised and unpaired data domain
Generative Adversarial Networks have been consistently
The image-to-image translation aims to learn a mapping function
showing huge potential and developments in the area of Art,
between the input image and out image in different domains.
Music, SemiSupervised learning, Handling Missing data, Drug
Image-to-Image basically learning involves the precise
Discovery, and unsupervised learning. Generative Adversarial
modification of an image while preserving contain information and
Networks (GANs) in one of the promising models that
it requires large datasets of paired images that are complex to
synthesizes data samples that are similar to real data samples.
prepare, meaning the dataset should contain images that are one to
This emerging technology has reshaped the research landscape
one correspondence. The Zhu et al. claims the difficulty in the
in the field of generative modeling. The research in the area of
image-to-image translation is they need paired data set for training
Generative Adversarial Networks (GANs) was introduced by
but in reality, doing so is very expensive and not scalable, but
Ian J. Goodfellow et al in 2014 [1]. However, since its
some work achieves good results. The author claim Pix2Pix by
inception, various models have been proposed over the years
Isola et al. is one of the papers archives the good result with
and are considered state-of-the-art models in generative
conditional generative adversarial model.
modeling. In this paper, I try to review of the original GAN
model and its modified versions for video-to-video translation. transfer like Cycle-GAN [4], which enforces cycle consistency for
First, I have summarized different architectures proposed the unpaired image. The recent state of artwork work ReCycle-
along with Objective Functions and loss functions used. GAN by Bansal et al. [5] motivated by [4] proposes video
Second, we cover the evolution of GAN followed by retargeting via spatiotemporal constraint though directly
summarizing comparative analysis for various GANs for synthesizing future frames via temporal predictor to preserve
video-to-video translations. Then, I review various ways of temporal continuity. Bansal et al., claims video-to-video translation
video-to-video translation presented by other authors. is still under constraint since their work result shows of video-to-
video transfer has very flickering output. This work proposed to
Keywords
extend Bansal et al., work to improve temporal continuity between
GAN, Cycle-GAN, Adversarial Networks, Pix2Pix, Mocycle,
unsupervised video adjacent consecutive frames by introducing additional temporal
cycle consistency constraints also proposes to introduce
I. Introduction
Spatiotemporal video-to-video translation for better realistic
Video-to-video transfer is a domain transfer problem that aims to results.
transfer sequential content information form one domain to another
while preserving the style of the target domain. Current approaches II. Detail review
for domain transfer categories broadly into three classes. Early
Pix2Pix [1] is Generative model by Isola et al. train in a supervised
techniques use classical computer vision mechanism work
manner using a paired dataset. that fits into a supervised image-to-
image translation. Pix2Pix as the name indicates it learn to map The state of artwork [6] ReCycle-GAN further extends cycle
pixel from the first image to the second one. Because in reality pair consistency constraint by intercorporate it with a temporal
datasets are very rare and expensive Zhu et al. [2] came up with predictor network to predict over spatiotemporal predictor. though
Cycle Consistency constraint which was invented to learn directly synthesize future frame via temporal predictor to preserve
bidirectional mapping in the absence of paired training data via temporal continuity is still under constraint. The under-constraint
Cycle consistency loss. Cycle Consistency loss utilizes to learn problem in ReCycle GAN work result in undesirable motion on
transformation between two domains in a frontward and backward generated video.
fashion.
Another recent quality work by Park et al. [7] proposes an optical
Since Video to video translation is a natural extension of image to flow warping ground truth and content loss on frame mechanism to
image translation the author extend zhu et al.[2] work by guarantee the consistency to overcome the temporal flickering and
modifying generator network of Cycle-GAN model. The author motion inconsistency between frames. Temporal flow consistency
use the same network setup with [2]. The paper result generates a is another count of this work, which basically excellent if the two
flickering result. domains are similar in nature, but has no much impact on slightly
different motion videos.
To overcome the flickering effect, Chen et al. [3] consider
temporal information along with spatial information. Specifically,
III. RELATED WORK
they exploit previous frame optical flow to warp the current frame
As the study of GANs is accelerating rapidly, there are constantly
to impose temporal constraints, but this paradigm prone to
new GANs frameworks that were not covered in the existing
occlusion and fast illumination change (since optical flow does not
review papers. The timeline of survey papers found by searching
consider newly introduced pixels in the given frame scene).
Google Scholar for the keywords “overview of generative
Unsupervised Video-to-Video Translation paper by Bashkirova.
adversarial networks”, “survey of generative adversarial
[4] models temporal information using a 3D convolutional layer networks”, and “review of generative adversarial networks”, as
embedded on Cycle-GAN, their result shows better color
well as the papers cited in the retrieved papers, is illustrated in
preservation and less artifact compared to Cycle-GAN but, the
Figure 1. The timeline also includes the publication dates of the
model black-box nature makes it hard to train. Nevertheless, the
review papers, which will be discussed in this section, besides the
result shows a lack of robustness to different video lengths.
publication dates of notable GANs frameworks discussed. While

Video-to-Video Translation with Global Temporal Consistency [5] some of the review papers provide overviews of the state-of-the-art

by Wei et al. further extend the optical flow frame warping GANs, others focused on GANs for a specific domain (e.g. image

network, the authors present a mechanism focusing on the video- generation). Reviews of GANs for general visual image datasets

level consistency by residual error based on two-channel exceed other specialized GANs reviews, including those in

discriminator to minimize the total Mean absolute (L1) distance cybersecurity, anomaly detection and medical imaging (see

between the optical flow map of consecutive frames eventually this below). As shown in Figure 1, the red diamonds that represent

approach failed on longer video and result failed in fast motion general reviews of GANs dominate other categories.

videos since the network constrained to minimize temporal


difference along with the whole video.

Another fine work by Chen et al. [4] MoCycle-GAN introduces


motion guided Cycle GAN to transfer estimated motion between
domains. This work is explicitly designed in two-way network by
splitting spatial, and temporal information separately. The spatial
network is Cycle GAN and the temporal network is a motion
translation network. This work also relives the temporal cycle
constraint for motion reconstruction. Even if an explicit motion
translation network is a blessing, the model parameter increased
enormously. Another pitfall of this work is the network relay on Fig 1: Timeline of the review papers, along with recent GANs
cycle constraint to content translation. advances (light blue)
This table shows detailed comparison of papers different parameters

Temporal Temporal
Evaluation
Architecture information constraint Limitation
Related papers Dataset metrics used
modeling. applied on

Unpaired Image- Cityscapes, Horse to Cycle No temporal - FCN, IS Framewise


to-Image Zebra, Apple to Consistency Information is image-to-image
Orange, Summer to Constraint. considered. translation.
Translation Using Cycle-
Winter
Consistent Adversarial
Yosemite.
Networks [2]

Unsupervised Video-to- Volumetric 3D Cycle-GAN The network - Human evaluation, 3D tensor fails for
Video Translation [8] MNIST, GTA implicitly learns pixel accuracy, and temporal learning
from input video L2 error between consistency
segment to video and
(3D-Conv- net) original and between frames,
MRI-to-CT
retranslated image fixed-length video.

Video-to-Video Translation DAVIS 2017 RNN based Optical flow, Generator and Peak Signal to Complex
with Global Temporal Cycle-GAN, and RNN temporal residual Discriminator Noise Ratio, architecture hard to train.
Consistency[5] based error minimizer Network Inappropriate for in videos
Region Similarity,
Discriminator for global contain fast object
and Contour
temporal consistency
Accuracy motion. Not work for long
videos.

MoCycle-GAN: Unpaired Flower video and Cycle-GAN with Optical flow with Generator Human Explicit motion
Video- to-Video viper dataset motion translator- motion translator Network evaluation, IoU, translator, and no
Translation based motion network pixel accuracy, content translation
MoCycle-GAN [4] cycle consistency Average class
accuracy

Recycle-GAN: Viper, face, and Cycle-GAN with Recurrent Generator Human Temporal
Unsupervised flower datasets recurrent temporal Network evaluation, IoU, predictor fails to
Video (more than temporal predictor pixel accuracy, correctly predict, and
10,000 images) predictor (Pix2Pix) Average class no content translation
Retargeting:
accuracy, IS
ReCycle-GAN [6]

Viper dataset Cycle-GAN with flow optical flow base Generator mIoU, fwIoU, Input domain
Preserving estimator temporal fuse Network, Use and videos shall have very
Semantic and Temporal network and [43] to further pixel accuracy similar
Consistency for Unpaired with spatial for
consistency reduce the content.
Video- to-Video improving
warping network Temporal
Translation [7] occlusions
warping error.
problem
10.1145/3343031.3350937.
IV. Conclusion
[5] X. Wei, S. Feng, J. Zhu, and H. Su, “Video-to-video translation
This paper aims to provide a comparative analysis along with the
with global temporal consistency,” MM 2018 - Proc. 2018 ACM
video translation domains, how to improve GAN network stability
Multimed. Conf., pp. 18–25, 2018, doi:
and discuss how to improve network cost function to enhance
10.1145/3240508.3240708.
generated video. So Generative models such as GANs provide
promising results in multiple domains including images, videos, [6] A. Bansal, S. Ma, D. Ramanan, and Y. Sheikh, “Recycle-GAN:
audios and texts. Video synthesis is still in the early stages Unsupervised Video Retargeting,” CoRR, vol. abs/1808.0, 2018,
compared to other domains such as images. [Online]. Available: http://arxiv.org/abs/1808.05174.

[7] K. Park, S. Woo, D. Kim, D. Cho, and I. S. Kweon,


V. Riferences
“Preserving semantic and temporal consistency for unpaired video-
[1] P. Isola, J.-Y. Zhu, T. Zhou, and A. A. Efros, “Image-to-Image to-video translation,” MM 2019 - Proc. 27th ACM Int. Conf.
Translation with Conditional Adversarial Networks,” CoRR, vol. Multimed., pp. 1248–1257, Aug. 2019, doi:
abs/1611.0, 2016, [Online]. Available: 10.1145/3343031.3350864.
http://arxiv.org/abs/1611.07004.
[8] D. Bashkirova, B. Usman, and K. Saenko, “Unsupervised
[2] J.-Y. Zhu, T. Park, P. Isola, and A. A. Efros, “Unpaired Image- Video-to-Video Translation,” no. Nips, 2018, [Online]. Available:
to-Image Translation using Cycle-Consistent Adversarial http://arxiv.org/abs/1806.03698.
Networks,” CoRR, vol. abs/1703.1, 2017, [Online]. Available:
[9] M. El-Kaddoury, A. Mahmoudi, and M. M. Himmi, "Deep
http://arxiv.org/abs/1703.10593.
Generative Models for Image Generation: A Practical Comparison
[3] D. Chen, J. Liao, L. Yuan, N. Yu, and G. Hua, “Coherent Between Variational Autoencoders and Generative Adversarial
Online Video Style Transfer,” in CoRR, Dec. 2017, vol. 2017- Networks," in International Conference on Mobile, Secure, and
Octob, pp. 1114–1123, doi: 10.1109/ICCV.2017.126. Programmable Networking, 2019: Springer, pp. 1-8.

[4] Y. Chen, Y. Pan, T. Yao, X. Tian, and T. Mei, “Mocycle-GAN: [10] Y. Hong, U. Hwang, J. Yoo, and S. Yoon, "How Generative
Unpaired video-to-video translation,” MM 2019 - Proc. 27th ACM Adversarial Networks and Their Variants Work: An Overview,"
Int. Conf. Multimed., pp. 647–655, Aug. 2019, doi: ACM Computing Surveys (CSUR), vol. 52, no. 1, p. 10, 2019.

You might also like