You are on page 1of 7

Font2Fonts: A modified Image-to-Image translation framework

for font generation

Debbie Honghee Ko Ammar Ul Hassan Saima Majeed


Soongsil University Soongsil University Soongsil University
South Korea South Korea South Korea
Debbie.pust@gmail.com ammar.instantsoft@gmail.com saimamajeed089@gmail.com

Jaeyoung Choi
Soongsil University
South Korea
choi@ssu.ac.kr

ABSTRACT adversarial network, style transfer

Generating a font from scratch requires font domain knowledge


additionally, it’s a labor intensive and time-consuming task. With 1 INTRODUCTION
the remarkable success of deep learning methods for image
Font designing is a time consuming and labor-intensive task
synthesis, many researchers are focusing on generating fonts by
moreover, it requires domain expertise. Font designer takes
utilizing these methods. In order to utilize these deep learning
several weeks for designing a new font style for English alphabets
methods for font generation, language specific font image datasets
this time can be increased up to months or years for languages
are manually prepared which is a cumbersome and time-
which consists of large number of characters like Chinese (more
consuming task. Additionally, existing supervised image-to-image
than 50,000 glyphs) or Korean (11,172 glyphs). Moreover, for
translation methods like pix2pix are able to do only one-to-one
each font style like normal, bold, italic, bold-italic, etc. font
domain translation therefore they cannot be applied to font
designer designs each character and symbols of various styles and
generation task which is multi-domain. In this paper, we propose a
sizes individually.
model, Font2Fonts, a conditional generative adversarial network
With the recent advances in deep learning many computer
(GAN) for font synthesis in a supervised setting. Unlike pix2pix
vision tasks like image classification, object detection, image
which can only translate from one font domain to the other,
synthesis, etc. can be achieved in high accuracy and high quality.
Font2Fonts is a multi-domain translation model. The proposed
Many researchers are adopting to deep learning methods for
method can synthesize high quality diverse font images using a
generating font glyphs. Recently, Isola proposed an image-to-
single end-to-end network. By our qualitative and quantitative
image (i2i) translation framework (pix2pix) [1] based on
experiments, we verify the effectiveness of our proposed model.
generative adversarial network GAN [2] for generating high
Moreover, we also propose a Unicode-based module for
quality images from one domain to the other. This method can be
automatically generating font image dataset. Our proposed
applied on various domain translation tasks like season transfer,
Unicode-based method can be easily applied for preparing font
style transfer, labels to street scenes etc. Pix2pix works in a
dataset of various language characters.
supervised setting i.e. it requires paired training dataset.
Font generation can be regarded as an i2i task where a new
CCS CONCEPTS font style can be generated from the other font style. Therefore,
• Computer vision → Appearance and Texture representation; pix2pix can be directly used for font generation task. However,
Machine Learning pixp2pix is a one-to-one domain translation framework which
means that only one font style can be learnt using a single pix2pix
KEYWORDS model. If we would like to learn n different font styles we need n
Image-to-Image translation, font generation, generative pix2pix models which is unfeasible considering the memory and
training time constraints.
Permission to make digital or hard copies of part or all of this work for personal or
classroom use is granted without fee provided that copies are not made or This paper focuses on generating high quality fonts using a
distributed for profit or commercial advantage and that copies bear this notice and single end-to-end network. We propose a multi-style font
the full citation on the first page. Copyrights for third-party components of this
work must be honored. For all other uses, contact the owner/author(s).
generation model, Font2Fonts, that utilizes single generator for
SMA 2020, September 17-19, 2020, Jeju, Republic of Korea generating diverse fonts in a supervised setting. We build our
© 2020 Copyright held by the owner/author(s). network on the top of pix2pix model and apply some architectural
https://doi.org/10.1145/1234567890
SMA 2020, September 17-19, 2020, Jeju, Republic of Korea D. Ko et al.

modifications for achieving high-quality image synthesis in multi- The module generates Korean Hangul image datasets as
domain (multi-styles) fashion (in the paper, we alternatively use required however, there are some inconveniences in preparing the
multi-domain and multi-style terms). data set. This module only supports limited Hangul characters
Additionally, we also propose a Unicode-based font dataset which are manually stored in the pre-prepared label files. More
generator module that can easily generate cross-language font precisely the module supports 2,350 Hangul characters, however
datasets. The existing methods for generating font image data the Korean language has around 11,172 Hangul characters. If the
requires character labels for the languages like English, Chinese, user wants to create other Hangul characters not listed in the label
Japanese, Korean, etc. to generate their font datasets for training file, he needs to prepare a separate label file which is a hard task.
deep learning models. With the proposed Unicode module, the Another weak point of this module is that this module only
user can generate font image datasets for any language like supports Hangul characters not any other language characters in
Chinese, Korean, and Japanese, etc. without gathering character the world.
labels of that language which can be hard task for languages with
large character sets (Chinese, Korean, etc.).
Our paper follows the following sequence. In Section 2, we
discuss existing font dataset preparation technique. Additionally,
we discuss related deep learning based font generation models. In
Section 3, we demonstrate the proposed Unicode-based font
dataset generator module and Font2Fonts model for generating
high quality font synthesis respectively. In Section 4, we validate
our proposed Font2Fonts model with various qualitative and
quantitative experiments. In Section 5 we write the conclusion
remarks.

2 RELARED WORKS
In this section, we first discuss the existing technique of preparing
font character dataset for deep learning models. Then we will
describe some existing font generation models that utilize deep
learning. Figure 1: Structure of module Hangul Image Generator in
IBM's Tensorflow Hangul Recognition.
2.1 Font dataset preparation
Deep learning models require decent amount of training data in Therefore, we propose a Unicode-based font image dataset
order to perform some task like image recognition or image generation module that overcomes the issues of the current
synthesis. However, getting a large enough dataset of font module. With the proposed module, the user is not dependent on
characters is challenging to find and cumbersome to create. generating the character images which the module provides by
Recently, IBM developed an open source project that recognizes default (like IBM module only supports specific Korean
Korean handwriting named Tensorflow Hangul Recognition [3]. characters), instead the user can generate any number of
The project is an Android application that utilizes a TensorFlow characters the particular font file supports. Additionally, the
model trained to recognize Korean syllables. The model used in proposed module is not specific to any language rather it can be
this project is a Convolutional Neural Network (CNN) [4] that is utilized for generating font characters dataset for any language.
trained on a large Hangul image dataset. The project has an image The details of the proposed module are described in Section 3.
dataset preparation module named the Hangul Image Generator
that takes TTF font files and a label file consisting of Hangul 2.2 Deep Learning based font generation
characters to be generated as images are specified. Image-to-Image translation aims at learning an image generation
The module receives the font file and the label file as an input function that can map a conditional input image of one domain
when it is executed. The Font Loader receives the font file and (source class) to an image in the other domain (target class). This
creates a font file list. On the other hand, the Character Label problem can be either considered as supervised i2i or
Loader reads the label file and creates a list of character labels. unsupervised i2i. In the supervised i2i setting the conditional input
Finally, the Character Image Dataset Generator creates an image image has a corresponding target domain image with structural
by applying the corresponding font file style to each letter of the similarity also referred to as pair dataset. On the other hand, in
text label. After that, each image is saved as a JPEG file. Also, the unsupervised i2i setting the input and target domain images don’t
path and label information of each stored JPEG file are separately have a pair supervision.
specified in the label map document, so that it can be used as a Pix2pix [1] was the first method for supervised i2i. It produced
data set when the deep learning model training is performed later. high quality results in many i2i tasks. As most of the worlds
The structure and process of this Hangul Image Generator, datasets cannot be paired, therefore cycleGAN [5] was the first
module is shown in Fig. 1. method that produced a network architecture and cycle
2
Font2Fonts: A modified Image-to-Image translation framework for
WOODSTOCK’97, July 2016, El Paso, Texas USA
font generation

consistency loss function for unpaired i2i translation. Based on and font files created by the two loaders. By comparing the
these two state-of-the-art (SOTA) methods many researchers have Unicode supported by each font file and the Unicode of the
produced various networks for multi-modal i2i in supervised and language specified by the user, the character to be generated as an
unsupervised settings respectively. BicycleGAN [6] performs image is extracted as a code point and made into a character label
multimodal i2i in a supervised setting. On the other hand, MUNIT list. Finally, the Character Image Dataset Generator generates a
[7], STARGAN-v2 [8] can perform multimodality in an character image through a list of character labels and a font file in
unsupervised fashion. FUNIT [9] also performs i2i translation in a the same way as an existing module, and stores each character
few-shot setting where the target domain images are very few. image file path and label information in a label map document.
Many researchers have utilized these i2i methods for
generating fonts. DCFont method [10] uses a font feature
reconstruction method for better quality synthesis of Chinese
handwriting. They used a pre-trained VGG network for extracting
the font style features and combined it with character embedding
from the encoder. Similarly, Bo Chang [11] proposed a method to
generate Chinese characters in a personalized handwritten style
using DenseNet CycleGAN. The only difference between their
network and CycleGAN is that, CycleGAN by default uses
ResNets [12] blocks after the encoder, whereas they used
DenseNet [13] as an alternate to ResNets. Hanfei Sun [14]
focused on Chinese typography synthesis. They added a
pretrained CNN model that extracts the content features of the
input character image.
All these related works mainly utilize pix2pix or cycleGAN
for generating fonts. As we can learn the font domain translation
mapping from one style to the other we can treat this problem as a
supervised setting and use pix2pix model. However, pix2pix
model is not a multi-domain model, i.e. with pix2pix model we Figure 2: Proposed Unicode-based Font Image Generator.
can only learn one style at a time. In order to learn n styles in one
training process we need n pix2pix generators because of its one- 3.1.1 Language Unicode Range Document Structure. As shown
to-one mapping nature. In the next section, we describe our in Fig. 3. the Language Unicode Range document is predefined in
proposed Font2Fonts method in detail and demonstrate how we JSON format for the Font Image Generator to work smoothly. The
solve this single domain i2i translation problem of pix2pix to Unicode range information specified in the document conforms to
generate diverse font styles. the Unicode standard [15]. The language name in the document
and the Unicode range for each language are stored as Key and
3 PROPOSED NETWORK Value, respectively. One range information has one list form with
a pair of starting and ending code points. However, there are cases
In this Section, we first describe our proposed font dataset module where characters are distributed in various areas according to
then we present our Font2Fonts i2i model for font generation. languages, and each language has a two-dimensional Unicode
range list.
3.1 Proposed Unicode-based font image
generator
The architecture and process of Font Image Generator is shown in
the Fig. 2. Instead of taking the font file and a specified character
file to generate limited Hangul characters only, the user now only
specifies the name of the language and the corresponding font file.
First, the Language Range Loader creates a list of Unicode
range information where the characters of the language are
located. The Unicode range information is defined in advance in
the Language Unicode Range document, and the Language Range
Loader reads this information. Secondly, the Font Loader creates
a list of font files through the TTF font file in the same way as the
existing module. Thirdly, the Character Selector creates a list of
character labels to be generated with the list of Unicode ranges Figure 3: Structure of Language Unicode Range Document.

3
SMA 2020, September 17-19, 2020, Jeju, Republic of Korea D. Ko et al.

Figure 4: Network architecture of Font2Fonts.

In addition, all code points in the document are stored in string and for controlling the generated font style we can change the
form. JSON documents can store 4-digit Unicode code points style vector in the latent.
directly. However, since supporting characters are gradually Our model consists of a conditional generator G and
added to Unicode and there are characters with code points of 5 or patchGAN discriminator D similar to the original pix2pix
more digits, they are defined as strings for storing code points of 5 network. As shown in the Fig. 4. G maps an input reference font
or more digits. When the user specifies a language, the Language style image x to a target font style output image x*, such that the
Range Loader in the module reads the list of ranges from the x* has the style of the target font class cy, and x* and x share same
corresponding document with the language name as the Key. It character content. In the next section, we elaborate the network
then converts the Unicode range defined by the string into a architecture and loss functions.
hexadecimal integer. 3.2.1 Font2Fonts Architecture. Our proposed Font2Fonts G
consists of an encoder Ex and a decoder Fx. The encoder Ex is
3.2 Proposed Font2Fonts for Font Generation made up of several 2D convolutional layers. Instance
The pix2pix model consists of an encoder-decoder architecture normalization [16] follows each convolution layer unlike pix2pix
with a patchGAN discriminator. The encoder takes a which uses batchnormalization [17]. We use ReLU activation
reference/source domain image and the decoder generates the function to add non-linearity. It maps the reference font style
corresponding target domain image. This kind of architecture can image x to character content code zc. Here we combine the
only learn one font style in a single training. The reason is that the guiding style vector as a one-hot vector for guiding the decoder to
decoder doesn’t know what specific font style to generate. generate the output in a particular font style. The decoder F x takes
In order to fix this problem of pix2pix we add a guiding style the z vector as a combination of zc and zs as an input and generates
vector in the latent space of the encoder. Same as the pix2pix the target font style image. The decoder consists of several
encoder our encoder takes a reference font image as an input, upscale convolutional layers. Each convolution layer is followed
downsamples it to get the character latent information and then we by Instance normalization layer and ReLU non-linearity as in the
concatenate a style vector as a one-hot representation of the target encoder. More details are described in Section 4.2.
domain font style. This guides our decoder to generate a specific Our discriminator D is a patchGAN borrowed from the
target font style. Instead of generating a single font style from a original patchGAN discriminator with some modifications.
reference font style our generator is able to produce variety of font Instead of just telling whether the input is real or fake our
styles (multi-domain) thus named Font2Fonts. discriminator also tells which style the input image is in. For
To train Font2Fonts we use a reference font style for achieving this goal, we add a fully connected layer at the end of
generating a specific content. We use a paired dataset as an input the discriminator whose output length depends on the total
where one reference font character has a corresponding target number of font styles the network is training on.
style character. During testing we can control the character to be 3.2.2 Loss Functions.
generated by the reference font image as an input to the encoder The proposed Font2Fonts is trained to solve the minimax
optimization problem by the following objective:

4
Font2Fonts: A modified Image-to-Image translation framework for
WOODSTOCK’97, July 2016, El Paso, Texas USA
font generation

G* = minG maxD LGAN(G,D) + LSCL(D) + LL1(G) (1) We used 2*2 stride in all layers except the last layer where we
have stride of 1. The batch-size was set to 16 and learning rate
where LGAN, LSCL, and LL1 are the GAN loss, style classification was 0.0002. We trained our network for 40 epochs.
loss, and the L1 loss respectively. The GAN loss is a conditional
GAN loss given by, 4.3 Qualitative and Quantitative results
We compare our model with pix2pix model for the qualitative
LGAN(D,G) = Ey [log D(y)] + Ex [log(1-D(G(x)))]. (2) evaluation by visualizing the generated font images using the
proposed Font2Fonts and pix2pix model. For fair comparison
The style classification loss, LSCL, guides the generator to both models are trained with same training data and evaluated
synthesize the output image in a specific font style. The fully with same font styles.
connected layer added at the end of D produces the output for this
loss function. The discriminator D tells which font style the
generated output image is in. This loss is given by

LSCL(D) = Ey [log D(y)] + Ex [log (1-D(G(x)))]. (3)

The LL1 guides the generator to generate clean unnoisy images. It


tries to minimize the pixel by pixel difference between the
generated image and the target font style image. The loss is
formulated is:

LL1(G) = Exy [|| (y- G(x) ||1] (4)

where y is the target font image and G(x) is the generated image
from the generator given the reference font image x. Figure 5: Font2Fonts comparison vs pix2pix.

4 EXPERIMENTS As shown in the Fig. 5, the proposed method shows dominance


in the synthesized font images compared with the pix2pix
4.1 Training Dataset methods in visual appearance. When the synthesized images are
We trained our network on Korean hangul characters. For the zoomed in, poor smoothness, breaking of strokes, and blurriness
Korean hangul characters we utilized 2,350 most commonly used are often found in the other methods. We also demonstrate extra
Korean Hangul characters from 20 font files. We found through results of Font2Fonts method on various target font styles as
our experiments that with these number of font files for training shown in Fig. 6.
the model learns diverse styles. We used 75% (15 fonts) of these
randomly chosen fonts for training and the rest 15% unseen font
styles for testing. We used our Unicode-based font image
generator to generate our font image paired dataset. Both the input
reference font image and the target font image in our system is
RGB with a size of 256*256*3 (3-channels for RGB). In order to
make the model robust during the pre-training process, we learn a
one-to-one mapping function i.e. 2,350 skeletons for each font
style to the corresponding target domain font images.

4.2 Network Details and Parameter Settings


The encoder-decoder used in our network consists of seven down-
sampling and seven up-sampling layers, respectively. Each layer
in the encoder comprises of convolution operation followed by
Instance Normalization [16] and ReLU activation function [18].
Every layer in the decoder consists of a deconvolution operation
followed by Instance Normalization and ReLU activation
function. Tanh activation function is used in the last layer. For
optimizing our network, we trained our model with Adam [19].
Figure 6: Font2Fonts results on various style.

5
SMA 2020, September 17-19, 2020, Jeju, Republic of Korea D. Ko et al.

Figure 8: Cross language results.

To evaluate the synthesized characters content accuracy, we character image generated by the Hangul Image Generator
performed a Hangul character recognition experiment using a pre- through a 2,350-character Hangul character label file, and (b) is
trained CNN classifier. This classifier was trained on Hangul Korean Hangul character images generated by the proposed
character images. This CNN model achieves a 99% testing module using only the language information 'kr' (Korean
accuracy on real Hangul characters. At inference time, we characters short). For easy comparison, the output range is limited
calculated the pre-trained classifier accuracy by predicting the to the Hangul syllable area (U + AC00-U + D7AF). Fig. 7. shows
character labels from our model and pix2pix. Quantitative that the Hangul images generated by the reported Unicode-based
comparison is demonstrated in Table 1. As shown in Table 1, we font image generator module are absolutely the same in quality
can clearly see that our model outperforms pix2pix method. compared with the original module but with more ease.

Table 1: Qualitative comparison.

Figure 7: Unicode-based image generator vs existing method.

Fig. 8. is the result generated when the proposed Unicode font


Image Generator module is given different languages other than
Korean hangul. When you run the module, you can see that the
characters in the language are generated as images according to
4.4 Experimental results of Unicode-based font the language you specify.
dataset generator
The outputs of the Hangul Image Generator module and the 5 CONCLUSIONS
outputs of the reported Unicode-based font image generator In this paper, we introduce a multi-style i2i translation model for
module for Korean Hangul characters are shown in Fig. 7. (a) is a font synthesis. We showed that how our model can handle font
6
Font2Fonts: A modified Image-to-Image translation framework for
WOODSTOCK’97, July 2016, El Paso, Texas USA
font generation

generation task using conditional GAN in a multi-domain setting Recognition (CVPR), 2017.
[17] S. Ioffe, and C. Szegedy, "Batch normalization: Accel-erating deep network
unlike the pix2pix model which is single-domain model. training by reducing internal co-variate shift", arXiv preprint
Qualitative and quantitative results demonstrate our methods arXiv:1502.03167, 2015
[18] Xu, B., Wang, N., Chen, T., and Li, M. Empirical evalua-tion of rectified
effectiveness on font generation challenge using deep learning. activations in convolutional network. In ICML Workshop, 2015.
We also proposed a Unicode-based font dataset generator, a [19] D. P. Kingma, and J. Ba, Adam: A method for stochastic optimization, arXiv
module that generates character image data sets with only font preprint arXiv:1412.6980.
files and language information. The proposed module is easy to
use because it automatically grasps the characters to be created
when the user designates the language they need. In addition, the
font image dataset created through the module is expected to be
used in various fields such as character recognition other than font
production.
One limitation of our font2fonts model is that it learns a multi-
domain translation between any fixed font style to the other in a
supervised setting however, if un-supervised setting is considered
where we don’t have the target domain font style to learn our
model cannot be adopted. We will consider this unsupervised
multi-domain setting in our future work.

ACKNOWLEDGMENTS
This work was supported by Institute for Information &
communications Technology Promotion (IITP) grant funded by
the Korean government(MSIP) (No. R0117-17-0001, Technology
Development Project for Information, Communication, and
Broadcast).

REFERENCES
[1] P. Isola, J. Zhu, T. Zhou, A. Efros, “Image-to-image translation with
conditional adversarial networks”, CVPR, 2017.
[2] I. Goodfellow, J. Pouget-Abadie, M. Mirza, Bi. Xu, D. Warde-Farley, S Ozair,
A. Courville, Y. Bengio “Genera-tive Adversarial Nets”, NIPS, 2014.
[3] P. Van Eck, "Handwritten Korean Character Recognition with TensorFlow and
Android", Retrieved Apr 30, 2019 from https://github.com/IBM/tensorflow-
hangul-recognition, 2017.
[4] Y. LeCun, L. Bottou, Y. Bengio, and P. Haffner. Gradient-based learning
applied to document recognition. Proceedings of the IEEE, november 1998.
[5] J.-Y. Zhu, T. Park, P. Isola, and A. A. Efros. Unpaired image-to-image
translation using cycle-consistent adver-sarial networks. In Proceedings of the
IEEE Internation-al Conference on Computer Vision (ICCV), 2017.
[6] Zhu, Jun-Yan, et al. “Toward multimodal image-to-image translation.”
Advances in Neural Information Processing Systems. 2017.
[7] Xun Huang, Ming-Yu Liu, Serge Belongie, and Jan Kautz. Multimodal
unsupervised image-to-image translation. European Conference on Computer
Vision (ECCV), 2018.
[8] Choi, Y., Uh, Y., Yoo, J., Ha, J.W.: StarGAN v2: Diverse image synthesis for
multiple domains. In: IEEE Conference on Computer Vision and Pattern
Recognition (2020).
[9] Liu, M.Y., Huang, X., Mallya, A., Karras, T., Aila, T., Lehtinen, J., Kautz, J.:
Fewshot unsupervised image-to-image translation. In: IEEE International
Conference on Computer Vision (2019)
[10] Y. Jiang, Z. Lian, Y. Jianguo, and J. Xiao, “DCFont: an end-to-end deep
Chinese font generation system”, SIG-GRAPH Asia, p. 22, 2017 TB.
[11] B. Chang, Q. Zhang, S. Pan, and L. Meng, (2018). Gen-erating handwritten
Chinese characters using cy-clegan.CoRR, abs/1801.08624.
[12] X. Zhang, S. Ren, and J. Sun. Deep residual learning for image recognition. In
CVPR, 2016.
[13] G. Huang, Z. Liu, K. Q. Weinberger, and L. van der Maaten. Densely
connected convolutional networks. CVPR, 2017.
[14] Hanfei Sun, Yiming Luo, Ziang Lu. Unsupervised Ty-pography Transfer. In
CVPR, 2018.
[15] The Unicode Consortium, The Unicode Standard,
https://www.unicode.org/standard/standard.html
[16] Dmitry Ulyanov, Andrea Vedaldi, and Victor Lempitsky. Improved texture
networks: Maximizing quality and diversity in feed-forward stylization and
texture synthesis. In IEEE Conference on Computer Vision and Pattern

You might also like