You are on page 1of 57

Universidade Federal do Rio Grande do Norte

Centro de Ciências Exatas e da Terra


Departamento de Informática e Matemática Aplicada
Bachelor in Computer Science

Procedural Terrain Generation through Image


Completion using GANs

Lucas Torres de Souza

Natal-RN
November 2019
Lucas Torres de Souza

Procedural Terrain Generation through Image


Completion using GANs

Bachelor’s dissertation presented to the De-


partamento de Informática e Matemática
Aplicada at the Centro de Ciências Exatas
e da Terra at the Universidade Federal do
Rio Grande do Norte in partial fulfillment of
the requirements for the degree of bachelor
in Computer Science.

Advisor
Bruno Motta de Carvalho, PhD

Universidade Federal do Rio Grande do Norte – UFRN


Departamento de Informática e Matemática Aplicada – DIMAp

Natal-RN
November 2019
Bachelor dissertation under the title Procedural Terrain Generation through Image Com-
pletion using GANs presented by Lucas Torres de Souza and accepted by the Departa-
mento de Informática e Matemática Aplicada of the Centro de Ciências Exatas e da Terra
of the Universidade Federal do Rio Grande do Norte, being approved by all the members
of the thesis committee specified below:

Bruno Motta de Carvalho, PhD


Advisor
Departamento de Informática e Matemática Aplicada
Universidade Federal do Rio Grande do Norte

André Maurício Cunha Campos, PhD


Departamento de Informática e Matemática Aplicada
Universidade Federal do Rio Grande do Norte

Selan Rodrigues dos Santos, PhD


Departamento de Informática e Matemática Aplicada
Universidade Federal do Rio Grande do Norte

Natal-RN, November 2019.


One reality won’t be enough for her now.

Christopher Nolan, Inception


Procedural Terrain Generation through Image
Completion using GANs

Author: Lucas Torres de Souza


Advisor: Bruno Motta de Carvalho, PhD

Abstract

Procedural terrain generation is the creation of virtual landscapes through algorithmic


means. There are various well tested methods for terrain generation, but most require
manual parameter tuning to obtain the expected results. In this work, we propose an
system that generates terrain height maps and color textures based on real world exam-
ples. This generator system is constructed using Generative Adversarial Networks, a deep
learning architecture that, over the last years, has shown great results in image synthesis
tasks. We model the terrain generation problem as a texture completion task. That re-
sults in a system that can not only generate new terrain, but expand and connect existing
ones. While the described system has limitations, it provides an useful framework for more
complete systems as geospatial data becomes more readily available.

Keywords: Procedural Terrain Generation, Generative Adversarial Networks, Image Com-


pletion.
Geração Procedural de Terrenos por Compleção de
Imagem utilizando Redes Adversárias Generativas

Autor: Lucas Torres de Souza


Orientador: Dr. Bruno Motta de Carvalho

Resumo

Geração procedural de terrenos é a criação de paisagens virtuais através de métodos al-


gorítmicos. Existem vários métodos bem testados para a geração de terrenos, mas a sua
maioria exige a configuração manual de parâmetros. Neste trabalho, nós propomos um
sistema que gera mapas de altura e texturas de cor para terrenos, baseado em exemplos
do mundo real. Este sistema gerador é construído utilizando Redes Adversárias Genera-
tivas, uma arquitetura de aprendizado profundo que, nos últimos anos, mostrou ótimos
resultados em tarefas de geração de imagens. Nós modelamos o problema de geração de
terreno como uma tarefa de compleção de textura. Isso resulta num sistema que não só
é capaz de gerar novos terrenos, mas também expandir e conectar terrenos já existentes.
Enquanto o sistema descrito possui limitações, ele provê um framework útil para sistemas
mais completos, conforme dados geoespaciais se tornam mais disponíveis.

Palavras-chave: Geração Procedural de Terrenos, Redes Adversárias Generativas, Com-


pleção de Imagens.
Lista de figuras

2.1 The image completion process. The input 2.1a is an incomplete image,
whose unknown region is marked here in black. The process creates a
filled output image 2.1b. . . . . . . . . . . . . . . . . . . . . . . . . . . p. 14

2.2 Good and bad similarity after completion . . . . . . . . . . . . . . . . . p. 15

2.3 Good and bad continuity after completion . . . . . . . . . . . . . . . . p. 16

2.4 Good and bad feature continuation after completion . . . . . . . . . . . p. 16

2.5 Good and bad isotropy after completion . . . . . . . . . . . . . . . . . p. 17

3.6 GAN architecture . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . p. 22

3.7 Terrain generated with the midpoint method, after 8 subdivisions . . . p. 24

3.8 Terrain generated with a Perlin noise height map . . . . . . . . . . . . p. 24

3.9 Steep cliffs with talus slopes at their feet . . . . . . . . . . . . . . . . . p. 25

3.10 Terrain generated by (BECKHAM; PAL, 2017) . . . . . . . . . . . . . . . p. 26

3.11 Terrain generated by (SPICK; COWLING; WALKER, 2019) . . . . . . . . p. 27

4.12 Our architecture . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . p. 28

4.13 Generator architecture . . . . . . . . . . . . . . . . . . . . . . . . . . . p. 31

4.14 Discriminator architecture . . . . . . . . . . . . . . . . . . . . . . . . . p. 32

4.15 Terrain mesh topology . . . . . . . . . . . . . . . . . . . . . . . . . . . p. 34

4.16 Visualization of real terrain sample from the Alps dataset . . . . . . . . p. 35

5.17 Crops of size 256 × 256 of each test region (color map on the left, height
map on the right) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . p. 37

5.18 Completion of 64 × 64 square inside crops of size 128 × 128, in the Alps
dataset . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . p. 38

5.19 Visualization of inpainting result depicted in Figure 5.18a . . . . . . . . p. 38


5.20 Completion of 64 × 64 square inside crops of size 128 × 128, in various
datasets . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . p. 39

5.21 Expansion by 32 pixels of crops of size 128 × 128, in the Alps dataset . p. 41

5.22 Visualization of expansion result depicted in Figure 5.21a . . . . . . . . p. 42

5.23 Expansion by 32 pixels of crops of size 128 × 128, in various datasets . p. 43

5.24 Crops of size 256 × 256 of each test region, with color map on the left
and height map on the right . . . . . . . . . . . . . . . . . . . . . . . . p. 44

5.25 Visualization of generation result depicted in Figure 5.24a . . . . . . . p. 44

5.26 Original crops of size 128 × 128 of each test region . . . . . . . . . . . . p. 45

5.27 Result of network trained on the Alps dataset . . . . . . . . . . . . . . p. 46

5.28 Result of network trained on the Canyon dataset . . . . . . . . . . . . p. 46

5.29 Result of network trained on the Ethiopia dataset . . . . . . . . . . . . p. 47

5.30 Result of network trained on the Maze dataset . . . . . . . . . . . . . p. 47

5.31 Visualization of style transfer . . . . . . . . . . . . . . . . . . . . . . . p. 48


List of Abbreviations

PCG – Procedural Content Generation

GAN – Generative Adversarial Network

RGB – Red Green Blue

ANN – Artificial Neural Network

DCGAN – Deep Convolutional Generative Adversarial Network

SGAN – Spatial GAN

GPU – Graphics Processing Unit

DEM – Digital Elevation Model

HSV – Hue Saturation Value


Contents

1 Introduction p. 11

1.1 Objectives . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . p. 12

1.2 Outline . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . p. 12

2 Problem Definition p. 13

2.1 Terrain representation . . . . . . . . . . . . . . . . . . . . . . . . . . . p. 13

2.2 Image completion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . p. 14

2.3 Generated terrain quality metrics . . . . . . . . . . . . . . . . . . . . . p. 15

2.3.1 Similarity . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . p. 15

2.3.2 Continuity . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . p. 15

2.3.3 Feature continuation . . . . . . . . . . . . . . . . . . . . . . . . p. 16

2.3.4 Isotropy . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . p. 16

3 Techniques and Related Works p. 18

3.1 Generative Adversarial Networks . . . . . . . . . . . . . . . . . . . . . p. 18

3.1.1 Artificial neural networks . . . . . . . . . . . . . . . . . . . . . . p. 18

3.1.1.1 Training . . . . . . . . . . . . . . . . . . . . . . . . . . p. 19

3.1.2 Convolutional neural networks . . . . . . . . . . . . . . . . . . . p. 20

3.2 Generative adversarial networks . . . . . . . . . . . . . . . . . . . . . . p. 21

3.3 Procedural terrain generation . . . . . . . . . . . . . . . . . . . . . . . p. 23

3.3.1 Traditional methods . . . . . . . . . . . . . . . . . . . . . . . . p. 23

3.3.2 Methods based on data . . . . . . . . . . . . . . . . . . . . . . . p. 25


3.3.3 Methods based on GANs . . . . . . . . . . . . . . . . . . . . . . p. 25

4 Methodology p. 28

4.1 Data Representation . . . . . . . . . . . . . . . . . . . . . . . . . . . . p. 29

4.2 Generator network architecture . . . . . . . . . . . . . . . . . . . . . . p. 30

4.3 Discriminant network architecture . . . . . . . . . . . . . . . . . . . . . p. 31

4.4 Loss function . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . p. 32

4.5 Training procedure . . . . . . . . . . . . . . . . . . . . . . . . . . . . . p. 33

4.6 Reconstruction and visualization . . . . . . . . . . . . . . . . . . . . . . p. 34

5 Results p. 36

5.1 Inpainting . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . p. 37

5.2 Expansion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . p. 40

5.3 Full Generation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . p. 43

5.4 Style Transfer . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . p. 44

5.5 Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . p. 48

6 Final Remarks p. 50

6.1 Future Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . p. 50

6.1.1 Expanded architectures . . . . . . . . . . . . . . . . . . . . . . . p. 51

References p. 53
11

1 Introduction

Procedural content generation (PCG) is the creation of digital content through algo-
rithmic means. It is a toolset frequently used by software developers for dynamic creation
of various virtual items. By using PCG, it is possible to create a many variants of a kind
of object without the need to design each form by hand, decreasing production cost. It
also allows the creation of content after deployment, dynamically, as needed.

Most developments in the PCG area were motivated by game development. In games,
procedural content generation allows increases replayability, as new content can be created
for each session, and adaptability, as content can be adapted to different play styles (SMITH
et al., 2011).

A major topic within procedural modeling is the automatic generation of terrains;


studies on terrain elevation and vegetation have been developed since the 1980s. The
studied methods involve both purely procedural and data-based methods (SMELIK et al.,
2009).

Some more recent works attempt to use deep neural networks for generation (BECK-
HAM; PAL, 2017). However, many aspects of terrain generation using neural networks
are still under study. There is a great range of generative tasks that modern neural net-
works are capable of executing; however, current works limit themselves only to full image
generation and colorization. Other functionalities, like harmonization, completion, super-
resolution, style transfer and texture expansion have not been explored.

In this work, we propose a procedural terrain generation system based on completion:


the task of filling in incomplete images. The advantage of the completion method is its
flexibility; it allows both the harmonization with fixed, manually prepared terrain regions
and infinite world generation through expansion.

Specifically, we train an Generative Adversarial Network (GAN) with real terrain


samples and use it for terrain synthesis. GANs have been used with good results for a
great variety of image synthesis tasks, such as super-resolution, translation, face synthesis
12

and texture synthesis. Our work is a specific example of the latter.

The wanted characteristics of the generated terrain are application dependent. Our
system synthesizes terrains for 3D visualization purposes. Hence, we focus on two terrain
aspects: height and color. Those features are encoded in a rectangular texture image, that
is afterwards converted into a 3D mesh.

1.1 Objectives

Our objective in this work is to explore the capabilities of a completion-oriented terrain


generation system. We expect that such system would be able to

• generate new terrains (height and color information) that are similar to a reference
terrain;

• expand and combine patches of given terrains, filling missing regions with generated
terrain;

• harmonize terrains so that they are similar to the reference terrain.

Within the scope of this work, similarity is a visual metric. Therefore, we would
achieve our goals as long as the results in the described tasks look like the reference
terrain; statistical equivalence and physical coherence are not required.

1.2 Outline

The terrain generation problem is detailed and formalized in Chapter 2. In Chapter


3 we give a brief introduction to Generative Adversarial Networks and comment on the
related works on procedural terrain generation and texture synthesis. The full description
of our system, including architecture and training method, is given on Chapter 4. The
results are presented and analyzed on Chapter 5. Finally, we reflect on achieved and
missing features of our generator on Chapter 6.
13

2 Problem Definition

The objective of this work is to create a procedural terrain generation system that,
given an incomplete description of a terrain region, fills the missing parts with terrain of
a similar type, in a way such that the end result appears to be a realistic terrain.

2.1 Terrain representation

Not every terrain feature is relevant for visualization purposes. There are two impor-
tant terrain features, its topography and its surface color.

We assume that the topography can be modeled as the graph of a height function
h : A → R, that maps each point from an interest area A ⊂ R2 into its height above
a reference level. That representation is unable to express every kind of terrain feature;
for instance, caves can not be represented. Topographic maps like these have existed for
centuries, and became a natural representation method for virtual terrains.

The perceived color of a surface depends on many factors, like the position of the
observer, illumination and material properties. In our simplified model we will factor out
illumination and assume relative observer position is irrelevant; the visual color will be
represented by a function c : A → R3 , that maps each point from an interest area A ⊂ R2
into its material color, represented in RGB. The use of two dimensional information to
describe surface characteristics of three dimensional objects is one of the oldest ideas in
computer graphics (CATMULL, 1974).

We can combine both height and color information in a single function f : A → R4 ,


f (x) 7→ (h(x), c(x)).

Computationally, we discretize the domain of f to allow representation. We call the


discrete representation an texture. For visualization purposes, we separate the terrain
texture into a height map and a color map, encoding topography and perceived color
respectively.
14

2.2 Image completion

The image completion problem is, as the name suggests, the task of filling an incom-
plete image. It has also been called image inpainting, region filling, void filling, texture
synthesis and interpolation of missing data. This task was originally motivated by the
removal of artifacts and defects from photography (KOKARAM et al., 1995; BERTALMIO et
al., 2000). Those first methods used spatial and temporal interpolation, which limited the
size of the filled regions. Larger gaps motivated texture synthesis algorithms (ASHIKHMIN,
2001). Later, more flexible methods combining both approaches were developed (CRIMIN-
ISI; PÉREZ; TOYAMA, 2004). The advent of deep learning allowed image completion based
on the semantics of the surrounding context (YEH et al., 2017).

Formally, let I be the of all images of domain A, a probability distribution p over I,


K ⊂ A and j : K → R. We call K the set of known pixels, and j the known part of the
image. Let Ij ⊂ I be the set of images that coincide with i in the known pixels, that is,

Ij = {i ∈ I | ∀k ∈ K, i(k) = j(k)} .

The image completion problem is to find the image i ∈ Ii that maximizes p(i). In other
words, we want to find the most likely image that agrees with what we already know
about it. The probability distribution p defines what class of images we are looking for.
For example, p might model the likelihood that an image is the texture of a valley.

An example of image completion can be seen in Figure 2.1.

(a) (b)

Figure 2.1: The image completion process. The input 2.1a is an incomplete image, whose
unknown region is marked here in black. The process creates a filled output image 2.1b.
15

Many terrain generation tasks can be posed as the image completion problem. To
generate a terrain texture from nothing, it is enough to set K = ∅. To extend an existing
texture, just set K to the known image and expand the domain. We can also link two
given textures, defining the gap between them.

2.3 Generated terrain quality metrics

The quality of the generated terrain is subjective. An extensive study of what features
of real terrain are relevant to the perceptual realism was conducted by (RAJASEKARAN et
al., 2019). In this work, we skip such detailed analysis; instead, we propose four general,
visually distinctive metrics to evaluate our results. These metrics were chosen after obser-
vation of our various results; therefore, they are closely associated with our methodology
and not all are adequate for analysis of other works. The chosen metrics are similarity,
continuity, feature continuation and isotropy.

2.3.1 Similarity

The completed output texture must match the terrain features in the incomplete
input; deserts should not be completed with forests, plains should not be completed with
mountains. The detail level must also not change; rough terrains should not be smoothed
by the completion process, and small features in the incomplete image should also appear
in the filled areas. We call this characteristic similarity (Figure 2.2).

(a) Incomplete image (b) Good similarity (c) Bad similarity

Figure 2.2: Good and bad similarity after completion

2.3.2 Continuity

In the output, the border between the known and filled regions should not be de-
tectable; in this work, we call this property continuity (Figure 2.3). Any border artifacts
16

would induce unwanted, visible geometry, especially if the texture completion method
is used repeatedly. For example, creating a larger texture by creating new tiles through
continuation would generate a grid pattern on the terrain.

(a) Incomplete image (b) Good continuity (c) Bad continuity

Figure 2.3: Good and bad continuity after completion

2.3.3 Feature continuation

There are many large terrain features that are expected not to end abruptly. Rivers,
roads and mountain ranges are examples of such features. Ideally, features in the known
region are continued in the generated one, and no feature that is not continued is created.
We call this property feature continuation (Figure 2.4).

(b) Good feature continua-


(a) Incomplete image (c) Bad feature continuation
tion

Figure 2.4: Good and bad feature continuation after completion

2.3.4 Isotropy

Even if the border of the filled region itself is not detectable, the system might still
use the border information for feature construction. That induces the creation of features
that follow the border contours. That effect is not so critical if the completion process
is used only once, but repeated use will again induce unwanted geometry. We call the
independence between border direction and feature direction isotropy (Figure 2.5).
17

(a) Incomplete image (b) Good isotropy (c) Bad isotropy

Figure 2.5: Good and bad isotropy after completion


18

3 Techniques and Related Works

3.1 Generative Adversarial Networks

In this chapter we introduce the fundamental concepts of Generative Adversarial


Networks, a machine learning technique that has a central function in our system.

3.1.1 Artificial neural networks

Artificial neural networks (ANNs) are computing models based on studies of biologi-
cal neural networks. The basic unit of an ANN is the neuron, a component that receives
multiple inputs and generates an output. The fundamental computational model of the
biological neuron was the MP model, proposed by (MCCULLOCH; PITTS, 1943). The percep-
tron is an extension of the MP model, created specifically for an artificial neural network
(ROSENBLATT, 1958). The function of an perceptron neuron is

1, if w · x > φ
f (x) =
0, otherwise

where x = (x1 , . . . , xn )T is the input vector, w = (w1 , . . . , wn )T the weight vector and φ
the threshold. The modern neuron model is an generalization of the perceptron, capable
of generating non-binary outputs. Its function form is

f (x) = σ(w · x)

where σ is called the activation function. Typical activation functions include the Heavi-
side step function 
0, if z < 0,
H(z) =
1, if z ≥ 0,
19

that generates an perceptron, the inverse tangent function, the logistic function
1
L(z) = ,
1 + e−z
the rectifier linear unit (ReLU) function (HAHNLOSER et al., 2000)

ReLU (z) = max(z, 0),

and the leaky ReLU function (MAAS; HANNUN; NG, 2013)



z, z > 0,
LeakyReLU (z) =
z, otherwise,

where  is a small value.

To create an multidimensional output, we can combine the output of multiple neurons;


for m inputs and n outputs, the equivalent function f : Rm → Rn is

f (x) = σ (W x)

where W ∈ Rn×m is the weight matrix. This creates a single-layer network. We can
sequentially compose many such layers in a feed-forward, k-layer, fully-connected network

F (x) = (f1 ◦ · · · ◦ fk−1 )(x),

where fi (x) = σ (Wi x).

There are other network models may not connect every pair of neurons in subsequent
layers, have cycles, or even use different types of neurons. The selection and arrangement
of the neurons in an artificial neural network is called its architecture.

3.1.1.1 Training

The universal approximation theorem (CSÁJI, 2001) guarantees that any continuous
function that satisfy a few reasonable conditions can be approximated by a feed-forward
network with 3 finite layers. However, finding the specific weights that define the wanted
function is hard, especially since generally we have no explicit description of the wanted
function beforehand. Therefore, we need two things to find a neural network for a certain
task, a quantitative way to evaluate the fitness of a function to that task and a algorithm
to find a function with maximal fitness.

The fitness of a network to a task can be determined using a loss function. A loss
20

function indicates how bad the network performs a certain task. For instance, let us
consider a binary classification problem. Let A be a set and B ⊂ A. Define b : A → {0, 1},

0, x 6∈ B
b(x) =
1, x ∈ B.

Suppose we want a network that calculates b. We might use the loss function
X
`(F ) = |F (x) − b(x)| .
x∈A

Minimizing ` corresponds to to finding F that better emulates b; more specifically, it


corresponds to finding the weight parameters whose F better emulates b.

In the machine learning approach, the evaluation of the wanted function is known
only over a subset of the domain. We call that subset the training set, and use it to define
the loss function.

Except for some very simple architectures and loss functions, there is no known algo-
rithm that is guaranteed to find the global minimum of the loss function.

3.1.2 Convolutional neural networks

Convolutional neural networks are the fundamental tool of the modern deep learning
systems. Their success on image classification tasks (KRIZHEVSKY; SUTSKEVER; HINTON,
2012) put in motion a new era of machine learning research.

The concept of a convolutional layer was introduced by (FUKUSHIMA, 1980). In kind


of layer, each neuron acts as a convolutional filter over a signal; we will focus on 2-
dimensional signals, i.e images. However, convolutional layers are also commonly used
with one and three dimensional sources.

A convolution filter convolves a kernel over a signal and generates a new signal. A
kernel K of size (2k + 1) is a (2k + 1) × (2k + 1) matrix of values. In a neural convolutional
layer, the kernel is the parameter to be learned. The output J of the convolution filter
over an image I is
k
X k
X
J(i, j) = K`m I(i − `, j − m)
`=−k m=−k
.

By the given definition, some values of I outside the input image are needed. The way
21

used to extend the input image is called the padding. Typical padding methods include
the use of a constant value (usually 0), repeating the image and reflecting the image.
Alternatively we may forgo the use of padding, but that results in an output image
smaller than the input one.

The kernel does not need to trace the input image pixel by pixel to generate the
output. The step in the input image between each kernel application in called the stride.
A stride of s results in an image with 1/s the dimensions of the input.

Another kind of layer used to reduce the image size is the pooling layer, also introduced
by (FUKUSHIMA, 1980). The pooling layer uses a statistic, like mean or maximum, to
summarize the values in a fixed-sized window. For generative tasks, pooling layers have
fallen out of favor; convolutional layers with non unitary strides have similar capabilities,
while also being able to learn the summary function.

Neither convolutional nor pooling layers are capable of increasing the dimensions of
the image. This upsampling task is performed by the transposed convolution operation
(LONG; SHELHAMER; DARRELL, 2015), also know as deconvolution, backwards convolution
or fractionally-strided convolution. As suggested by the alternative nomenclature, it has a
role contrary to that of the convolution, using a similar kernel but applying it in reverse.
Mathematically, it is equivalent to a convolution with fractional (less than one) stride.
Transposed convolution layers are successfully used by networks for generation, auto-
encoding and segmentation tasks.

A useful characteristic of convolution, pooling and transposed convolution layers is


that they can be applied to images of arbitrary size, and are translation-invariant. We
call networks composed only of those kinds of layers fully-convolutional networks. That
kind of network is ideal for texture related tasks, as it limits its effect to a fixed distance
neighbor region and cannot consider global aspects of the image.

3.2 Generative adversarial networks

As noted on Section 3.1.1.1, training a neural network requires the definition of a loss
function. For more straightforward tasks, like classification, regression and segmentation,
the distance between expected output and obtained output is easy to measure. On the
other hand, image generation tasks provide no such simplicity. There is no mathematically
sensible way to indicate how well an image represents, for example, a dog.
22

Figure 3.6: GAN architecture

The solution found was to simultaneously train two networks, a generator and a dis-
criminator. The generator network is the one that performs the wanted image generations
task. During training, the objective of the discriminator network is to distinguish between
real images and generated images. The generator network fights against the discriminator,
attempting to pass the generated images as real. These two networks are trained together,
adapting to each other. This framework is called Generative Adversarial Network (GANs),
and was first proposed by (GOODFELLOW et al., 2014). A diagram of this architecture can
be found at Figure 3.6.

A Deep Convolutional Generative Adversarial Networks (DCGAN) is the combina-


tion of the GAN architecture with deep convolutional generator and discriminator neural
networks (RADFORD; METZ; CHINTALA, 2015). Fully-convolutional DCGANs have great
results on style transfer and super-resolution tasks (JOHNSON; ALAHI; FEI-FEI, 2016).
Fully-convolutional DCGANs that use a noise image instead of a noise vector are called
spatial GANs (SGANs) (JETCHEV; BERGMANN; VOLLGRAF, 2016); they were developed
specially for texture synthesis.

GANs have been successfully used for a wide variety of tasks, like

• texture synthesis, the generation of a new texture with second order characteristics
similar to a reference one (GATYS; ECKER; BETHGE, 2015a);

• texture expansion (ZHOU et al., 2018);

• super-resolution, upsampling an image into a higher resolution (LEDIG et al., 2017);

• image completion, filling holes in an image (IIZUKA; SIMO-SERRA; ISHIKAWA, 2017);

• face image synthesis (ZHANG; SONG; QI, 2017);

• human image synthesis (ZHAO et al., 2018);

• image to image translation, transforming images from one given class to another
(ZHU et al., 2017);
23

• image harmonization, modification of an image pasted over another to harmonize it


with its surroundings (XIAODONG; CHI-MAN, 2019);

• style transfer, the transfer from one artwork style into another(GATYS; ECKER;
BETHGE, 2015b);

• colorization, the addition of color to black and white images (CAO et al., 2017).

3.3 Procedural terrain generation

The generation of virtual landscapes is one of the oldest interest areas in procedural
content generation research. Most of the work has focused on the generation of height
maps, but hydrography (KELLEY; MALIN; NIELSON, 1988), vegetation (DEUSSEN et al.,
1998), roads (SUN et al., 2002) and urban environments(WATSON et al., 2008) have also
been explored.

On this review, we considered only works that focus on the generation of either the
height map or the color map of the terrain.

3.3.1 Traditional methods

The oldest height map generation methods are based on subdivision. The objective
of those initial papers was not the generation of a terrain from scratch, but the addi-
tion of details to already existing, low resolution terrain meshes (FOURNIER; FUSSELL;
CARPENTER, 1982). The point-displacement method, described by (MILLER, 1986), is a
recursive subdivision method that subdivides existing edges in their midpoint and dis-
place that new point by a random height; the range of the random height decreases with
the recursion. While the midpoint method can work on arbitrary mesh topologies, it was
popularized in conjunction with the diamond-square subdivision. That terrains outputted
by the midpoint method generally look like hilly areas or mountain sides; furthermore,
they are fractal-like and do not have different features based on scale. Figure 3.7 shows a
terrain output of the midpoint method.

Most popular methods are based on fractal noise generators (VOSS, 1985). The Perlin
noise (PERLIN, 1985) is a smooth multidimensional function obtained by interpolation of
a grid of vectors. The sum of those noise functions in various scales, related by powers
of two, creates a fractal function; each of those functions is called an octave. Perlin noise
has an advantage over subdivision methods: it can be sampled at any point in constant
24

Figure 3.7: Terrain generated with the midpoint method, after 8 subdivisions

Figure 3.8: Terrain generated with a Perlin noise height map

time, with only local information, as long as the vectors on the grid can also be calculated
in constant time. This allows the creation of infinite terrains. A terrain with Perlin noise
height map is displayed in Figure 3.8.

Height maps generated with a pure Perlin noise function are often too smooth, while
simultaneously too steep in many regions. Many refinement methods have been proposed.
Erosion, both thermal and fluvial, is treated in (MUSGRAVE, 1993). Fluvial erosion moves
matter around according to local topography, creating valleys and drainage networks.
Thermal erosion affects steep slopes, softening them and creating talus slopes at their
feet. An example of talus slope can be seen on Figure 3.9.

Other refinement methods simulate whole columns of material. Contrary to previ-


25

Figure 3.9: Steep cliffs with talus slopes at their feet


source: commons.wikimedia.org/wiki/File:TalusConesIsfjorden.jpg

ously cited approaches, these methods are capable of generating caves and overhangs.
In (BENES; FORSBACH, 2001), many layers are used to represent the terrain, each with
distinct material properties. This allows more realistic results, at a higher computational
cost. Other methods, like (SANTAMARÍA-IBIRIKA et al., 2014), manipulate voxels.

In general, erosion computation is resource intensive, especially when working with


voxels. Hence, more recent works implement erosion on the GPU. In (WEIß, 2016), a
particle based hydraulic erosive system is simulated on the GPU.

3.3.2 Methods based on data

Among the procedural terrain generation methods that do not rely on machine learn-
ing, few use real data. Real world topography data is represented in Digital Elevation
Models (DEMs). The Terrainosaurus system (SAUNDERS, 2007) synthesizes terrain by
genetically selecting DEMs that fit the given elevation profile. In (PARBERRY, 2014), ele-
vation statistics are collected from DEMs and used to configure a variant of Perlin noise,
called value noise; that allows the creation of terrains with elevation profile similar to a
reference one.

3.3.3 Methods based on GANs

Since generative adversarial networks are still a relatively young method, only a few
works have considered their application for procedural terrain generation.

The system proposed by (BECKHAM; PAL, 2017) uses two separate generator networks,
26

Figure 3.10: Terrain generated by (BECKHAM; PAL, 2017)


source: (BECKHAM; PAL, 2017)

one for height map generation and one for color. The height map is generated first with a
DCGAN. The color map is generated from the height map, using a conditional image-to-
image translation GAN known as pix2pix (ISOLA et al., 2017). The networks were trained
with a cross-entropy loss function, using 512×512 sized images of the Earth, at a resolution
of about 0.58 kilometers per pixel. The results of this work suggest that the height map
is insufficient for generating color information, as snow was added to a desert terrain.
Its results also display the glitchy artifacts commonly found in other GAN based texture
generation systems. A sample of the genrated terrain can be seen in Figure 3.10.

Height maps with visually better quality was obtained by (SPICK; COWLING; WALKER,
2019) (Figure 3.11). Its methodology is somewhat similar to ours, as it trains a SGAN
with small crops from a large region, to generate similar patches. It does not, however,
generate color maps.
27

Figure 3.11: Terrain generated by (SPICK; COWLING; WALKER, 2019)


source: (SPICK; COWLING; WALKER, 2019)
28

4 Methodology

The main component of the terrain generation system is a generative neural network
G. It receives an incomplete terrain texture, with partial color and height map, and
outputs a complete texture.

To train G, we utilize a Generative Adversarial Network architecture, as defined in


Section 3.1. As such, we have an associated discriminative network D, that classifies
textures as either real or fabricated. The G network is trained with complete/incomplete
image pairs, with two objectives: to fill the incomplete image so that it resembles the
completed one, and to make D classify its output as real. The network D, on the other
hand, has an competing target: to classify G’s output as fabricated and real images as
real. Both G and D are trained concurrently.

An overview of our GAN architecture can be found in Figure 4.12. Our neural networks
are modified SGANs. Therefore, they are fully-convolutional and can work with arbitrarily
sized data.

In this chapter we describe the details of our proposed architecture; Section 4.1 ex-
plains how texture data is represented and prepared for training; the architectures of G

Figure 4.12: Our architecture


29

and D are described in Section 4.2 and Section 4.3, respectively; training targets are listed
in Section 4.4; the training procedure itself is detailed in Section 4.5 finally, we briefly
discuss our approach to terrain mesh reconstruction and visualization on Section 4.6.

4.1 Data Representation

Our texture image has four channels, where three represent color and one represents
the height. The color is encoded in RGB; other works with GANs often use an HSV color
space encoding (SMITH, 1978) , but that option would be incompatible with the style
loss function. For similar reasons, no gamma correction is performed on the color texture.
Height is represented linearly by
h − hmin
ĥ =
hmax − hmin

where ĥ is the encoded height, h is the real height, hmin the minimal height and hmax the
maximal height. The values of hmin and hmax must be chosen manually for each terrain
class so that 0 ≤ ĥ ≤ 1.

In our training data and result visualization software, each channel can assume at
most 256 distinct values. That limitation is not the result of any network property, but
a constraint imposed by the available training data. The system can adapted to input or
output higher color and height resolutions at no additional cost.

The input to an image completion problem is an image segmented in two parts, a


known region and a unknown region. Such segmentation can be encoded into the input
using various approaches:

1. Set the unknown pixels to a constant color/height;

2. Set each unknown pixel to a random color/height;

3. Set the unknown pixels to a constant color/height, and mark them in a new channel;

4. Set each unknown pixel to a random color/height, and mark it in a new channel;

The use of constant color/height information has a unavoidable side-effect in our archi-
tecture: it creates a small repeating pattern far from the known region. Not using a mask
makes the problem ambiguous between style transfer and completion. Therefore, we chose
to use both noise and a unknown region mask layer.
30

In training, each training texture is obtained through random 128 × 128 cropping
of a single large reference image. To diversify the inputs, those crops may be flipped
horizontally or vertically, with independent probability 0.5 for each. To feed the network
G, we take those textures and remove a region, encoding it using one of the methods
above described. The removal region is chosen at random from a fixed selection of masks.
As a final step, pixel values are remapped to a [−1, 1] floating point interval.

4.2 Generator network architecture

The layers of G are organized in three sequential zones:

• The reduction zone, where the spatial resolution of the image decreases and the
number of channels increases;

• The residual zone, where both spatial resolution and number of channels is constant;

• The expansion zone, where the spatial resolution increases and the number of chan-
nels decreases.

The reduction zone has three convolution layers. The first layer does not reduce size,
but increases the number of channels to 64. Each one of the following two layers decreases
the spatial resolution by half and doubles the number of channels. Each convolution layer
is followed by a 2D batch normalization and a ReLU activation function. If the input of
the reduction zone is a k × k × d image, the output is an k/4 × k/4 × 256 one.

The residual zone is formed by six sequential residual blocks. A residual block is so
called because its result is added to the input (and therefore what is calculated is the
residue). Each residue block has two convolution layers, followed by batch normalization
and separated by a ReLU activation and, in training, a dropout layer.

The expansion zone is a reflection of the reduction zone, with the dimension reducing
convolution layers replaced by dimension increasing transposed convolutions. If the input
image of this zone is of size k × k × 256, the output is 4k × 4k × d.

The full generator architecture can be observed in Figure 4.13.

Due to the complex nature of neural networks, it is difficult to discern exactly what
function each of these zones has in the solution of the problem. However, it is possible
compare this architecture with those of similar networks. Without the residual zone, we
31

Figure 4.13: Generator architecture

obtain a traditional auto-encoder setup. Auto-encoders reconstruct images after reducing


their spatial dimension; for that purpose, they need to learn the local patterns of the
training images. Therefore, we might assume that the function of the reduction and ex-
pansion layers is similar. Hence, the purpose of the residual layer is the actual completion
of the input texture.

The generator is a fully-convolutional network, i.e. it is formed only by convolutional


and deconvolutional layers, with no fully connected layers. That property implies that the
same G can process images with arbitrary sizes, with a fixed size ratio between input and
output. For our network, the input and output size is always the same.

Another effect of having a fully-convolutional network is that, for a large enough input
image, no pixel value change will affect the output image everywhere. There is a maximum
pixel influence radius determined by the network architecture. The influence radius for
our network is 91 pixels. Therefore, while there is no limit to the input image size, the
dimensions of the unknown region are limited; any unknown area too far away from the
known region will be filled with content unrelated to its known neighborhood. If necessary
for an application, we can increase the radius by adding either size reduction layers or
residual blocks.

4.3 Discriminant network architecture

The discriminant network D is formed by a sequence of six size reducing convolution


layers, the first five followed by a LeakyReLU activation function and the last followed by
a Sigmoid one. Its full architecture can be observed on (Figure 4.14).
32

Figure 4.14: Discriminator architecture

4.4 Loss function

The total generative loss LG function is composed by three terms

LG = Ladv + λ1 LL1 + λ2 Lstyle

where Ladv is the adversarial loss, LL1 is the L1 loss and Lstyle is the style loss. The
generator network is trained to minimize the loss function.

The adversarial loss Ladv is the traditional metric described by (GOODFELLOW et al.,
2014), in the seminal paper on GANs. It is given by

Ez [log(1 − D(G(z))]

where

• G(z) is the generator output given incomplete texture z;

• D(G(z)) is the discriminator estimate of the probability the texture generated from
z is real;

• Ez is the expected value over all incomplete textures.

By minimizing Ladv , the generator becomes better at fooling the the discriminator.

The L1 loss is the mean absolute error between each pixel of the real input and the
generated one. Minimizing LL1 corresponds to outputting an image similar to the input,
in a pixel by pixel sense. This loss has a different meaning for known and unknown pixels;
33

for known pixels, it expresses our intent of not modifying their value; for unknown pixels,
it stabilizes their generated values to something not too different from the original image.

The style loss Lstyle represents a more subtle concept. It is low when the source and
generated images have a similar visual texture. The style loss was originally described by
(GATYS; ECKER; BETHGE, 2015a) in a paper about texture generation. First, we extract
features of different sizes from the source image. To do so, we run it through the initial
layers of a already trained high-performing deep neural network. Each activated layer `
in a trained network corresponds to a set of N` non-linear feature filters, each with M`
outputs; we can represent the features of a layer with a feature matrix F ` ∈ RN` ×M` . The
Gram matrix G` depicts the correlation between the features of the layer. It is given by
X
G`ij = Fik` · Fjk
`

The Gram matrices related to a fixed layer set L, {G` }`∈L , encode the texture of the
corresponding image. Representing by G, Ĝ the Gram matrices of the source and generated
images, respectively, the style loss is
XX w`  ` `
2
Lstyle = Gij − Ĝij
`∈L i,j
4N`2 M`2

for fixed weights {w` }`∈L .

Our choice for feature network was the VGG-19 model trained for ImageNet, as used
by (GATYS; ECKER; BETHGE, 2015a) and (ZHOU et al., 2018). However, unlike those works,
our output image has four channels. As VGG-19 receives images with three channels, we
create two sets Gram matrices; the first with only the RGB channels, and another with
the triplicated height channel. These two sets are combined for loss evaluation purposes.

4.5 Training procedure

Following the approach for texture generation in works like (ZHOU et al., 2018), we do
not attempt to create a network that can generate every kind of terrain. Instead, we train
a new network with a single, continuous piece of terrain that contain a repeated type of
characterizing feature, like a mountain range, an archipelago or a river delta.

For each training class, the network is trained for 105 iterations. Optimization is done
with the Adam (KINGMA; BA, 2014) stochastic method; the momentum is set to 0.5. The
learning rate is set to 0.0002 from the start to iteration 5 · 104 , and linearly decays to zero
34

until the final iteration.

The software used for training and testing was adapted from the one used by (ZHOU et
al., 2018). It is written in the Python programming language using the PyTorch (PASZKE
et al., 2017) deep learning library.

4.6 Reconstruction and visualization

For real-time visualization of a three dimensional surface on current hardware, we


need to represent it as a triangular mesh. Our generated mesh has the simple repeated
structure shown in Figure 4.15.

Figure 4.15: Terrain mesh topology

Each vertex of our mesh corresponds to a pixel in the generated texture; the position
of the vertex is influenced only by the height map value at that pixel. Generally, the
topology of the terrain is smoothed out before display; we do not smooth our results, as
not to hide any artifacts generated by the previous steps.

The rendered terrains in this work were created with the Blender 3D software. A
sample of rendered result can be seen in Figure 4.16.
35

Figure 4.16: Visualization of real terrain sample from the Alps dataset
36

5 Results

We trained the network with four distinct terrain datasets, and ran them through
four experiments with distinct amounts of unknown region. That allows us to infer the
efficacy of the system in various terrain generation schemes.

The Alps dataset (Figure 5.17a) contains a region of the Swiss Alps. It is characterized
by a combination of grassy, fractal valleys and tall, snow-covered mountains. We consider
this dataset contains all features necessary for proper training: a single, repetitive type of
feature, with a high correlation between height and color information. The Canyon dataset
(Figure 5.17b) consists mostly of rocky desert, planar terrain, with a singular canyon
structure crossing the desert. It is based on the neighborhood of the Grand Canyon, US.
The Ethiopia dataset (Figure 5.17c) is formed by irregular, hilly terrain covered by a
semi-arid biome, taken from southeast Ethiopia. About half of it is covered with rough
terrain sculpted by rivers, while the other half is mostly plain. The Maze dataset (Figure
5.17d) is an artificial terrain with very simple, straight features, and is used to allow us
to evaluate our method in a more controlled way.

The training was executed in a GeForce 940MX GPU with 4 GB of RAM. The
network was trained in batches of 8 samples at a time. A batch size of 8 is low relative to
the related works (BECKHAM; PAL, 2017; SPICK; COWLING; WALKER, 2019), but no higher
power of two could be used due to limited memory. Training each network took about 30
hours, a rate of about 1.08 s per iteration. The generation of a 128 × 128 crop takes about
4.3 ms, assuming that both the network and the input have already been loaded into the
GPU.
37

(a) Alps (b) Canyon

(c) Ethiopia (d) Maze

Figure 5.17: Crops of size 256 × 256 of each test region (color map on the left, height map
on the right)

5.1 Inpainting

The inpainting test analyzes the generation of content when surrounded by known
terrain. We expect this to be the easiest of the four tasks for the generator system, as
most of the training data contains inpainting unknown regions, and a lot of information
about the surrounding terrain is available.

For the Alps dataset (Figure 5.18), while the color image quality is degraded, all
our other objective metrics, continuity, isotropy and feature continuation, are successfully
obtained. For instance, consider the inpainted terrain in Figure 5.18a; there is no visible
border between the known and filled region, so continuity is achieved; feature continuation
is clear, as valleys in the northwest, west, south and east of the unknown region are
connected; finally, no anisotropic features can be observed. However, there are visible
flaws in the results, such as the rough elevation artifact in the west of the result, more
clearly visible in Figure 5.19.
38

(a) (b)

(c) (d)

Figure 5.18: Completion of 64 × 64 square inside crops of size 128 × 128, in the Alps
dataset

Figure 5.19: Visualization of inpainting result depicted in Figure 5.18a


39

The results for the Canyon dataset (Figure 5.20a) are of low quality. The generator
fails to continue the canyon feature, and there is a visible border around the generated
region. We believe this low performance is a result of the fact that most of the Canyon
dataset is filled with plains.

The generator for the Ethiopia dataset (Figure 5.20b) contains the same faults at a
lesser degree. In the valley diagonally crossing the image, the generated terrain has softer
features than the surrounding terrain.

The Maze dataset (Figure 5.20c) has acceptable results, with proper similarity, con-
tinuity and feature continuation. However, an artifact can be observed on the northwest
corner.

(a) Canyon (b) Ethiopia

(c) Maze

Figure 5.20: Completion of 64×64 square inside crops of size 128×128, in various datasets
40

5.2 Expansion

The expansion test tries to generate terrain around a known area. More specifically,
we generated a border of 32 pixels around a 128×128 terrain patch. The terrain expansion
task has medium difficulty; there is some context neighborhood, but not as much as in
the inpainting experiment.

For the Alps network (Figure 5.21), the generated terrain is properly similar, continu-
ous and feature-continuous. However, the results are not isotropic. Observe the generated
valleys at the south edge in Figure 5.21a, east and southwest in Figure 5.21b, north and
east in Figure 5.21c and south in Figure 5.21d; they follow the direction of the edge of
expansion and create unnatural features. The same elevation artifacts previously observed
appear again in this sample, altough in different locations (Figure 5.22).
41

(a) (b)

(c) (d)

Figure 5.21: Expansion by 32 pixels of crops of size 128 × 128, in the Alps dataset
42

Figure 5.22: Visualization of expansion result depicted in Figure 5.21a

The other generator networks (Figure 5.23) also produce non-ideal results. Feature
continuation is mostly absent with Canyon and Ethiopia networks, a more serious problem
in the former. Maze does have good feature continuation, but isotropy is poor on the height
map, with unjustified high elevation at the four expansion edges.
43

(a) Canyon (b) Ethiopia

(c) Maze

Figure 5.23: Expansion by 32 pixels of crops of size 128 × 128, in various datasets

5.3 Full Generation

If fed with only noise, the generator network will still attempt to generate terrain.
The results of this task are the kind of terrain expected when generating terrain beyond
the influence radius of any known region. They also indicate the variability expected
when distinct noise is used with the same known region. As there is no present terrain
information, and very few of the training samples in each network request full generation,
this is a specially hard task for the networks.
44

The results for all test datasets (Figure 5.24) show a lack of large scale features,
and contain only a amorphous incoherent texture that somewhat resembles the reference
terrain. A rendered version can be seen in Figure 5.25.

(a) Alps (b) Canyon

(c) Ethiopia (d) Maze

Figure 5.24: Crops of size 256 × 256 of each test region, with color map on the left and
height map on the right

Figure 5.25: Visualization of generation result depicted in Figure 5.24a

5.4 Style Transfer

In the style transfer task, the input contains no unknown regions. This case is not a
directly trained, but a side-effect of the style networks. However, that does not mean that
45

the task itself is useless. When it receives a terrain as input unlike the one it was trained
with, the generator networks translates it into the features of the trained terrain.

It is interesting to note that the transfer occurs not only within the color map, but
also in the height map. When attempting to translate height maps with the Canyon-based
network, that is mostly plain, only blurred features are passed (Figure 5.28d); desertic
hills are converted in green mountains by the Alps network (Figure 5.31); the swirling
valleys of an Ethiopia sample is built from the vertical and horizontal features of the Maze
(Figure 5.30c). This characteristic allows another method for new terrain generation: to
create a base terrain with a traditional method and pass it trough the generator network.

(a) Alps (b) Canyon

(c) Ethiopia (d) Maze

Figure 5.26: Original crops of size 128 × 128 of each test region
46

(a) (b)

(c) (d)

Figure 5.27: Result of network trained on the Alps dataset

(a) (b)

(c) (d)

Figure 5.28: Result of network trained on the Canyon dataset


47

(a) (b)

(c) (d)

Figure 5.29: Result of network trained on the Ethiopia dataset

(a) (b)

(c) (d)

Figure 5.30: Result of network trained on the Maze dataset


48

(a) Original terrain depicted in Figure 5.26c

(b) Style transfer result depicted in Figure 5.27c

Figure 5.31: Visualization of style transfer

5.5 Discussion

Through all four experiments, it is clear that two of the networks had considerably
better performance, the ones based on Alps and Maze datasets. They are different from
the Canyon and Ethiopia datasets in a crucial aspect; while Canyon and Ethiopia have
a diverse set of terrain features, Alps and Maze are homogeneous and contain a singular
kind of feature. This suggests that a carefull selection of the source dataset is necessary,
and that using the network to generate multiple kinds of terrain would require a more
powerful network architecture.
49

When compared to inpainting, expansion and style transfer, the full generation task
obtained very low quality results. Hence, it is easy to generate new terrain when some
neighbor is already available, but hard to create brand new terrain. We suggest another
method to generate new terrain, using our network:

1. generate a height and color map using a traditional method (Section 3.3.1);

2. apply style transfer;

3. expand style transfer result;

4. crop the region generated by the expansion.

Our system could be used for both real-time and offline terrain generation. The limit-
ing factor is not execution time, but available GPU memory. Our training hardware could
handle the generation of textures 512 × 512 in 21 ms; inputs sized 1024 × 1024 lead to “out
of memory” errors. As the influence radius is small and the network output is translation
invariant, size limitations can be overcome by splitting apart the input; in this case, the
performance bottleneck is the transference of textures between the main memory and the
GPU.

We obtained results of higher quality than the ones presented by (BECKHAM; PAL,
2017). However, since we were unable to generate terrains from a desert biome, a more de-
tailed comparison is impossible. The paper also does not provide information on hardware,
training hyper-parameters or performance.

The height maps generated by our method are not clearly better or worse than the re-
sults by (SPICK; COWLING; WALKER, 2019), and we cannot compare perceptual realism on
humans because we executed no such study. When considering only the outputs presented
by the authors in the paper, we believe that our work achieved, with the Alps network,
better feature continuity but worse similarity. Our network is leaner; ours has depth 2
while theirs has depths between 4 and 6. They also do not generate color information.
50

6 Final Remarks

This work provided a series of new contributions to the procedural terrain generation
problem, we

• showed that texture completion is a valid way of structuring the procedural terrain
generation task;

• created a system that can generate color and height information with a single net-
work;

• described a working architecture that, unlike similar works in terrain generation


with GANs, explicitly considers texture properties of the data;

• demonstrated the necessity of a homogeneous training dataset;

• highlighted the benefits of adding noise to the unknown region of the input of image
completion networks;

• provided an initial analysis the effects of style transfer in height maps, as it has been
studied mainly in the context of color images.

While our results have clear flaws in some aspects, those kinds of artifacts have been
found in other GAN based systems and corrected by following works. Therefore, we believe
that with further research a ready-for-use version of the system could be developed.

6.1 Future Work

There are some possible improvements to our system that, due to hardware limita-
tions, were not explored.

The analysis in related works show that the ideal batch size for training is around
64. Our batch size was 8. It has been shown that the mini-batch size influences training
speed and stability.
51

Another parameter to try to increase is the generator network depth. Similar networks
obtained better results with depth between 5 and 6. A increased network depth also
increases the influence radius, which may in turn induce better feature continuation.

Since our system generates unwanted artifacts near the image borders, alternative
padding settings should also be considered. As the effects of padding should increase with
the influence radius, it might also be interesting to adapt the system to use no padding at
all, or at least no padding outside the residual blocks. A network with no padding would
inevitably output an image smaller than the input; proper preprocessing should be able
to cancel out the effect.

An alternative architecture for the generator is the U-Net (RONNEBERGER; FISCHER;


BROX, 2015). As our architecture, U-Net is a fully-convolutional network, capable of pro-
cessing images of arbitrary size. Unlike our architecture, U-Net connects convolutional
and deconvolutional layers of same size; we believe such characteristic would allow recon-
struction of the known region with higher fidelity.

Finally, it is worth observing that the terrain description our system produces (height
and color) is not enough for rendering a fully realistic environment in modern graphical
pipelines. Necessary additional information includes ground reflectivity properties, water
location and depth, and vegetation density. Since all of those can be encoded as textures,
we believe our architecture should be able to handle those additional features with little
modification.

6.1.1 Expanded architectures

The need to train a distinct generator network for each kind of terrain is a big limi-
tation of our system. Ideally, the end user should be able to pick from a pallete of terrain
types and combine them as needed, in a interactive system like (ZHU et al., 2016). We
believe that a network architecture similar to ours would work for such task; however,
training such network would require a extensive, carefully labeled dataset, whose con-
struction falls beyond the scope of this work.

Ideally, a procedural terrain generation network should simultaneously execute two


tasks: texture completion and multi-resolution. The purpose of texture completion has
already been explored in this work; the main advantage of generating terrains through
completion is the ability to expand already existing terrains. Multi-resolution allows us
to generate multiple levels of detail as needed; this is important because storing the full
52

terrain texture with high detail level is extremely memory intensive. StyleGANs are able
to tackle each of those tasks individually, but no studies were found that consider them
combined.

Our described approach is incapable of generating caves and overhangs, and poor at
generating cliffs and other high slope structures. In traditional methods, those structures
have been created over a voxel representation. Convolutional Neural Networks are capable
of working with voxels, but it is not clear how our architecture could be adapted to three
dimensional information.
53

References

ASHIKHMIN, M. Synthesizing natural textures. SI3D, Citeseer, v. 1, p. 217–226, 2001.

BECKHAM, C.; PAL, C. A step towards procedural terrain generation with gans. arXiv
preprint arXiv:1707.03383, 2017.

BENES, B.; FORSBACH, R. Layered data representation for visual simulation of terrain
erosion. In: IEEE. Proceedings Spring Conference on Computer Graphics. [S.l.], 2001. p.
80–86.

BERTALMIO, M. et al. Image inpainting. In: ACM PRESS/ADDISON-WESLEY


PUBLISHING CO. Proceedings of the 27th annual conference on Computer graphics and
interactive techniques. [S.l.], 2000. p. 417–424.

CAO, Y. et al. Unsupervised diverse colorization via generative adversarial networks. In:
SPRINGER. Joint European Conference on Machine Learning and Knowledge Discovery
in Databases. [S.l.], 2017. p. 151–166.

CATMULL, E. A subdivision algorithm for computer display of curved surfaces. [S.l.],


1974.

CRIMINISI, A.; PÉREZ, P.; TOYAMA, K. Region filling and object removal by
exemplar-based image inpainting. IEEE Transactions on image processing, v. 13, n. 9, p.
1200–1212, 2004.

CSÁJI, B. C. Approximation with artificial neural networks. Faculty of Sciences, Etvs


Lornd University, Hungary, Citeseer, v. 24, p. 48, 2001.

DEUSSEN, O. et al. Realistic modeling and rendering of plant ecosystems. In: ACM.
Proceedings of the 25th annual conference on Computer graphics and interactive
techniques. [S.l.], 1998. p. 275–286.

FOURNIER, A.; FUSSELL, D.; CARPENTER, L. Computer rendering of stochastic


models. Communications of the ACM, ACM, v. 25, n. 6, p. 371–384, 1982.

FUKUSHIMA, K. Neocognitron: A self-organizing neural network model for a mechanism


of pattern recognition unaffected by shift in position. Biological cybernetics, Springer,
v. 36, n. 4, p. 193–202, 1980.

GATYS, L.; ECKER, A. S.; BETHGE, M. Texture synthesis using convolutional neural
networks. In: Advances in neural information processing systems. [S.l.: s.n.], 2015. p.
262–270.

GATYS, L. A.; ECKER, A. S.; BETHGE, M. A neural algorithm of artistic style. arXiv
preprint arXiv:1508.06576, 2015.
54

GOODFELLOW, I. et al. Generative adversarial nets. In: Advances in neural information


processing systems. [S.l.: s.n.], 2014. p. 2672–2680.

HAHNLOSER, R. H. et al. Digital selection and analogue amplification coexist in a


cortex-inspired silicon circuit. Nature, Nature Publishing Group, v. 405, n. 6789, p. 947,
2000.

IIZUKA, S.; SIMO-SERRA, E.; ISHIKAWA, H. Globally and locally consistent image
completion. ACM Transactions on Graphics (ToG), ACM, v. 36, n. 4, p. 107, 2017.

ISOLA, P. et al. Image-to-image translation with conditional adversarial networks. In:


Proceedings of the IEEE conference on computer vision and pattern recognition. [S.l.:
s.n.], 2017. p. 1125–1134.

JETCHEV, N.; BERGMANN, U.; VOLLGRAF, R. Texture synthesis with spatial


generative adversarial networks. arXiv preprint arXiv:1611.08207, 2016.

JOHNSON, J.; ALAHI, A.; FEI-FEI, L. Perceptual losses for real-time style transfer and
super-resolution. In: SPRINGER. European conference on computer vision. [S.l.], 2016.
p. 694–711.

KELLEY, A. D.; MALIN, M. C.; NIELSON, G. M. Terrain simulation using a model of


stream erosion. [S.l.]: ACM, 1988.

KINGMA, D. P.; BA, J. Adam: A method for stochastic optimization. arXiv preprint
arXiv:1412.6980, 2014.

KOKARAM, A. C. et al. Interpolation of missing data in image sequences. IEEE


Transactions on Image Processing, IEEE, v. 4, n. 11, p. 1509–1519, 1995.

KRIZHEVSKY, A.; SUTSKEVER, I.; HINTON, G. E. Imagenet classification with deep


convolutional neural networks. In: Advances in neural information processing systems.
[S.l.: s.n.], 2012. p. 1097–1105.

LEDIG, C. et al. Photo-realistic single image super-resolution using a generative


adversarial network. In: Proceedings of the IEEE conference on computer vision and
pattern recognition. [S.l.: s.n.], 2017. p. 4681–4690.

LONG, J.; SHELHAMER, E.; DARRELL, T. Fully convolutional networks for semantic
segmentation. In: Proceedings of the IEEE conference on computer vision and pattern
recognition. [S.l.: s.n.], 2015. p. 3431–3440.

MAAS, A. L.; HANNUN, A. Y.; NG, A. Y. Rectifier nonlinearities improve neural


network acoustic models. In: Proc. icml. [S.l.: s.n.], 2013. v. 30, n. 1, p. 3.

MCCULLOCH, W. S.; PITTS, W. A logical calculus of the ideas immanent in nervous


activity. The bulletin of mathematical biophysics, Springer, v. 5, n. 4, p. 115–133, 1943.

MILLER, G. S. The definition and rendering of terrain maps. In: ACM. ACM
SIGGRAPH Computer Graphics. [S.l.], 1986. v. 20, n. 4, p. 39–48.

MUSGRAVE, F. K. Methods for realistic landscape imaging. Yale University, New


Haven, CT, p. 21, 1993.
55

PARBERRY, I. Designer worlds: Procedural generation of infinite terrain from real-world


elevation data. Journal of Computer Graphics Techniques, v. 3, n. 1, 2014.

PASZKE, A. et al. Automatic differentiation in pytorch. 2017.

PERLIN, K. An image synthesizer. ACM Siggraph Computer Graphics, v. 19, n. 3, p.


287–296, 1985.

RADFORD, A.; METZ, L.; CHINTALA, S. Unsupervised representation learning with


deep convolutional generative adversarial networks. arXiv preprint arXiv:1511.06434,
2015.

RAJASEKARAN, S. D. et al. Ptrm: Perceived terrain realism metrics. arXiv preprint


arXiv:1909.04610, 2019.

RONNEBERGER, O.; FISCHER, P.; BROX, T. U-net: Convolutional networks for


biomedical image segmentation. In: SPRINGER. International Conference on Medical
image computing and computer-assisted intervention. [S.l.], 2015. p. 234–241.

ROSENBLATT, F. The perceptron: a probabilistic model for information storage and


organization in the brain. Psychological review, American Psychological Association,
v. 65, n. 6, p. 386, 1958.

SANTAMARÍA-IBIRIKA, A. et al. Procedural approach to volumetric terrain


generation. The Visual Computer, Springer, v. 30, n. 9, p. 997–1007, 2014.

SAUNDERS, R. L. Terrainosaurus: realistic terrain synthesis using genetic algorithms.


Tese (Doutorado) — Texas A&M University, 2007.

SMELIK, R. M. et al. A survey of procedural methods for terrain modelling. In:


Proceedings of the CASA Workshop on 3D Advanced Media In Gaming And Simulation
(3AMIGAS). [S.l.: s.n.], 2009. p. 25–34.

SMITH, A. R. Color gamut transform pairs. ACM Siggraph Computer Graphics, ACM,
v. 12, n. 3, p. 12–19, 1978.

SMITH, G. et al. Pcg-based game design: enabling new play experiences through
procedural content generation. In: ACM. Proceedings of the 2nd international workshop
on procedural content generation in games. [S.l.], 2011. p. 7.

SPICK, R. J.; COWLING, P.; WALKER, J. A. Procedural generation using spatial gans
for region-specific learning of elevation data. 2019 IEEE Conference on Games (CoG),
p. 1–8, 2019.

SUN, J. et al. Template-based generation of road networks for virtual city modeling. In:
ACM. Proceedings of the ACM symposium on Virtual reality software and technology.
[S.l.], 2002. p. 33–40.

VOSS, R. F. Random fractal forgeries. In: Fundamental algorithms for computer


graphics. [S.l.]: Springer, 1985. p. 805–835.

WATSON, B. et al. Procedural urban modeling in practice. IEEE Computer Graphics


and Applications, IEEE, v. 28, n. 3, p. 18–26, 2008.
56

WEIß, S. Fast Voxel-Based Hydraulic Erosion. Dissertação (Bachelorarbeit) —


Technische Universität München, 2016.

XIAODONG, C.; CHI-MAN, P. Improving the harmony of the composite image by


spatial-separated attention module. arXiv preprint arXiv:1907.06406, 2019.

YEH, R. A. et al. Semantic image inpainting with deep generative models. In: Proceedings
of the IEEE Conference on Computer Vision and Pattern Recognition. [S.l.: s.n.], 2017.
p. 5485–5493.

ZHANG, Z.; SONG, Y.; QI, H. Age progression/regression by conditional adversarial


autoencoder. In: Proceedings of the IEEE Conference on Computer Vision and Pattern
Recognition. [S.l.: s.n.], 2017. p. 5810–5818.

ZHAO, B. et al. Multi-view image generation from a single-view. In: ACM. 2018 ACM
Multimedia Conference on Multimedia Conference. [S.l.], 2018. p. 383–391.

ZHOU, Y. et al. Non-stationary texture synthesis by adversarial expansion. ACM


Transactions on Graphics (TOG), ACM, v. 37, n. 4, p. 49, 2018.

ZHU, J.-Y. et al. Generative visual manipulation on the natural image manifold. In:
SPRINGER. European Conference on Computer Vision. [S.l.], 2016. p. 597–613.

ZHU, J.-Y. et al. Unpaired image-to-image translation using cycle-consistent adversarial


networks. In: Proceedings of the IEEE international conference on computer vision. [S.l.:
s.n.], 2017. p. 2223–2232.

You might also like