You are on page 1of 35

EigenGAN: Layer-Wise Eigen-Learning for GANs

Zhenliang He1,2 , Meina Kan1,2 , Shiguang Shan1,2,3


1
Key Laboratory of Intelligent Information Processing, ICT, CAS
2
University of Chinese Academy of Sciences, Beijing, China
3
Peng Cheng Laboratory, Shenzhen, China
zhenliang.he@vipl.ict.ac.cn, {kanmeina,sgshan}@ict.ac.cn

Abstract
arXiv:2104.12476v1 [cs.CV] 26 Apr 2021

Gender
Layer: 3
Recent studies on Generative Adversarial Network
(GAN) reveal that different layers of a generative CNN hold
different semantics of the synthesized images. However, few Hair
Color
GAN models have explicit dimensions to control the seman-
Layer: 5
tic attributes represented in a specific layer. This paper pro-
poses EigenGAN which is able to unsupervisedly mine in-
Hue
terpretable and controllable dimensions from different gen-
Layer: 6
erator layers. Specifically, EigenGAN embeds one linear
subspace with orthogonal basis into each generator layer.
Via the adversarial training to learn a target distribution, Painting
these layer-wise subspaces automatically discover a set of Style
“eigen-dimensions” at each layer corresponding to a set of Layer: 2
semantic attributes or interpretable variations. By travers-
ing the coefficient of a specific eigen-dimension, the gener- Pose
ator can produce samples with continuous changes corre- Layer: 3
sponding to a specific semantic attribute. Taking the human
face for example, EigenGAN can discover controllable di- Hue
mensions for high-level concepts such as pose and gender in Layer: 6
the subspace of deep layers, as well as low-level concepts
such as hue and color in the subspace of shallow layers.
Moreover, under the linear circumstance, we theoretically Figure 1. Example of interpretable dimensions learned by Eigen-
prove that our algorithm derives the principal components GAN. The smaller the index, the deeper the layer.
as PCA does. Codes can be found in https://github.
com/LynnHo/EigenGAN-Tensorflow.
showing that deep layers tend to determine the spatial lay-
out while shallow layers determine the color scheme. Sim-
1. Introduction ilar conclusion is also found by Bau et al. [3] in the dissec-
Generative adversarial network (GAN) [10] and its vari- tion analysis of GAN features at different layers. All these
ants [25, 11, 5, 18] achieve great success in high fidelity evidences reveal a property that different generator layers
image synthesis. Strong evidences [39, 41, 2] show that hold different semantics of the synthesized images in terms
different layers of a discriminative CNN capture different of abstraction level.
semantic concepts in terms of abstraction level, e.g., shal- According to this property, one can identify semantic
lower layers detect color and texture while deeper layers fo- attributes from different layers of a well-trained generator
cus more on objects and parts. Accordingly, we can expect by performing special algorithms [3, 12, 36, 38], and then
that a generative CNN also has similar property, and recent can manipulate these attributes on the synthesized images.
GAN studies confirm this fact [18, 38, 3]. StyleGAN [18] For example, Bau et al. [3] identify the causal units for
show that deeper generator layers control higher-level at- a specific concept (such as “tree”) by dissection and in-
tributes such as pose and glasses while shallower layers tervention on each generator layer. Turning on or off the
control lower-level features such as color and edge. Yang causal units causes the concept to appear or disappear on
et al. [38] found similar phenomenon in scene synthesis, the synthesized image. However, these methods are all

1
post-processing algorithms for well-trained GAN genera- tive, and there is no direct link from these dimensions to
tors. As for the generator itself, it operates as a black box the semantics of any specific generator layer. Ramesh et
and lacks explicit dimensions to directly control the seman- al. [33] found that the principal right-singular subspace of
tic attributes represented in different layers. In other words, the generator Jacobian show local disentanglement prop-
we do not know what attributes are represented in differ- erty, then they apply a spectral regularization to align the
ent generator layers or how to manipulate these attributes, singular vectors with straight coordinates, and finally obtain
unless we deeply inspect each layer by post-processing al- globally interpretable representations. However, this work
gorithms [3, 12, 36, 38]. also does not investigate the correspondence between these
Under above discussion, this paper starts with a question: interpretable representations and the semantics of different
can a generator itself automatically/unsupervisedly learn generator layers. Different from these methods, the inter-
explicit control of the semantic attributes represented in dif- pretability of our EigenGAN comes from the special design
ferent layers? To this end, we propose EigenGAN which of layer-wise subspace embedding, rather than imposing
equips a generator with interpretable dimensions for differ- any objective or regularization. Moreover, our EigenGAN
ent layers in a completely unsupervised manner. Specifi- establishes an explicit connection between the interpretable
cally, EigenGAN embeds a linear subspace model with or- dimensions and the semantics of a specific layer by directly
thogonal basis into each generator layer. On the one hand, embedding a subspace model into that layer.
since each subspace model is directly embedded into a spe- The above methods try to learn a GAN generator with
cific layer, a direct link is established between the subspace explicit interpretable representations, in contrast, another
and the semantic variations of the corresponding layer. On class of methods tries to reveal the interpretable factors from
the other hand, driven by the adversarial learning, the gener- a well-trained GAN generator [9, 3, 35, 38, 32, 12, 36].
ator tries to capture the principal variations of the data dis- [9, 3, 35, 38] adopt pre-trained semantic predictors to iden-
tribution, and these principal variations are separately repre- tify the corresponding semantic factors in the GAN latent
sented in different layers in terms of their abstraction level. space, e.g., Yang et al. [38] use layout estimator, scene
Then, with the help of the subspace model, the principal category recognizer, and attribute classifier to find out the
variations of a specific layer are further orthogonally sepa- decision boundaries for these concepts in the latent space.
rated into different basis vectors. Finally, each basis vector Without introducing external supervision, several methods
discovers an “eigen-dimension” that controls an attribute or search interpretable factors in self-supervised [32] or un-
interpretable variation corresponding to the semantics of its supervised [12, 36] manners. Plumerault et al. [32] uti-
layer. For example, as shown at the top of Fig. 1, an eigen- lize simple image transforms (e.g., translation and zoom)
dimension of the subspace embedded in a deep layer con- to search the axes for these transforms in the latent space.
trols gender, while another of the subspace embedded in the Harkonen et al. [12] apply PCA on the feature space of the
shallowest layer controls the hue of the image. Furthermore, early layers, and the resulting principal components repre-
under the linear circumstance, i.e., one layer model, we the- sent interpretable variations. Shen and Zhou [36] show that
oretically prove that our EigenGAN is able to discover the the weight matrix of the very first fully-connected layer of a
principal components as PCA [15] does, which gives us a generator determines a set of critical latent directions which
strong insight and reason to embed the subspace models into dominate the image synthesis, and the moving along these
different generator layers. Besides, we also provide a man- directions controls a set of semantic attributes. Among
ifold perspective showing that our EigenGAN decomposes these methods, [3, 35, 38, 12, 36] carefully investigate the
the data generation modeling into layer-wise dimension ex- semantics represented in different generator layers. How-
panding steps. ever, this class of methods can only be operated on well-
trained GANs, on the contrary, our EigenGAN aims to dis-
2. Related Works cover the interpretable dimensions for each generator layer
along with the GAN training in an end-to-end manner.
2.1. Interpretablity Learning for GANs
2.2. Generative Adversarial Networks
The first attempt to learn interpretable representations for
GAN generators is InfoGAN [6] which employs mutual in- Generative adversarial network (GAN) [10] is a sort of
formation maximization (MIM) between the latent variable generative model which can synthesize data samples from
and synthesized samples. Including InfoGAN, MIM based noises. The learning process of GAN is the competition be-
methods [6, 16, 17, 14, 20, 21, 22] can automatically dis- tween a generator and a discriminator. Specifically, the dis-
cover interpretable dimensions which respectively control criminator tries to distinguish the synthesized samples from
different semantic attributes such as pose, glasses and emo- the real ones, while the generator tries to make the synthe-
tion of human face. However, the learning of these inter- sized samples as realistic as possible in order to fool the
pretable dimensions is mainly driven by the MIM objec- discriminator. When the competition reaches Nash equilib-

2
real
FC Convs 2↑ Convs 2↑ Convs 2↑ or
fake

Figure 2. Overview of the proposed EigenGAN. The main stream of the model is a chain of 2-stride transposed convolutional blocks which
gradually enlarges the resolution of the feature maps and finally outputs a synthesized sample. In the ith layer, we embed a linear subspace
with orthonormal basis Ui = [ui1 , . . . , uiq ], and each basis vector uij is intended to unsupervisedly discover an “eigen-dimension” which
holds an interpretable variation of the synthesized samples.

rium, the synthesized data distribution is identical to the real controls major variation of the the ith layer while low
data distribution. absolute value denotes minor variation, which can be
GANs show promising performance and properties on also viewed as a kind of dimension selection.
data synthesis. Therefore, plenty of researches on GANs
appear, including loss functions [30, 25, 1], regulariza- • µi denotes the origin of the subspace.
tions [34, 26, 28], conditional generation [27, 31, 29], repre- Then, we use the ith latent variable zi = [zi1 , . . . , ziq ] as
T

sentation learning [24, 6, 8], architecture design [7, 5, 18], the coordinates (linear combination coefficients) to sample
applications [13, 42, 40], and etc. Our EigenGAN can be a point from the subspace Si :
categorized into representation learning as well as architec-
ture design for GANs. φi = Ui Li zi + µi (1)
q
X
3. EigenGAN = zij lij uij + µi . (2)
j=1
In this section, we first introduce the EigenGAN gen-
erator design with layer-wise subspace models in Sec. 3.1. This sample point φi will be added to the network feature
Then in Sec. 3.2, we make a discussion from the linear case of the ith layer as stated next.
to the general case of EigenGAN and finally provide a man- Let hi ∈ RHi ×Wi ×Ci denote the feature maps of the ith
ifold perspective. layer and x = ht+1 denote the final synthesized image, the
forward relation between the adjacent layers is
3.1. Generator with Layer-Wise Subspaces
hi+1 = Conv2x (hi + f (φi )) , i = 1, . . . , t, (3)
Fig. 2 shows our generator architecture. Our target is to
learn a t-layer generator mapping from a set of latent vari- where “Conv2x” denotes transposed convolutions that dou-
ables {zi ∈ Rq | zi ∼ Nq (0, I) , i = 1, . . . , t} to the syn- bles the resolution of the feature maps, and f can be iden-
thesized image x = G (z1 , . . . , zt ), where zi is directly tity function or a simple transform (1x1 convolution in prac-
injected into the ith generator layer. tice). As can be seen from Eq. (3), the sample point φi from
In the ith layer, we embed a linear subspace model Si = the subspace Si directly interacts with the network feature
(Ui , Li , µi ) where of the ith layer. Therefore, the subspace Si directly deter-
mines the variations of the ith layer, more concretely, q coor-
• Ui = [ui1 , . . . , uiq ] is the orthonormal basis of the T
dinates zi = [zi1 , . . . , ziq ] respectively control q different
subspace, and each basis vector uij ∈ RHi ×Wi ×Ci variations.
is intended to unsupervisedly discover an “eigen- Besides, we also inject a noise input  ∼ N (0, I) to the
dimension” which holds an interpretable variation of bottom of the generator intended to capture the rest varia-
the synthesized samples. tions missed by the subspaces, as follows,
• Li = diag (li1 , . . . , liq ) is a diagonal matrix with lij h1 = FC () , (4)
deciding the “importance” of the basis vector uij . To
be specific, high absolute value of lij means that uij where “FC” denotes the fully-connected layer.

3
nonlinear gaze nonlinear
gaze
pose
pose
pose pose hue gaze

Layer 1 Layer 2 Layer 3 pose

Figure 3. Manifold perspective of EigenGAN. At each layer, a linear subspace is added to the feature manifold, expanding the manifold
with “straight” directions along which the variation of some semantic attributes are linear. At the end of each layer, nonlinear mappings
“bend” these straight directions, yet another subspace at the next layer will continue to add new straight directions. Here, we only show
one semantic direction of each subspace just for simplicity, generally, each subspace contains multiple orthogonal directions.

t
The bases {Ui }ti=1 , the importance matrices {Li }i=1 , in a specific layer can capture the principal semantic vari-
t
the origins {µi }i=1 , and the convolution kernels are all ations of that layer, and these principal variations are or-
learnable parameters and the learning can be driven by var- thogonally separated into the basis vectors. In consequence,
ious adversarial losses [10, 25, 1, 28]. In this paper, hinge each basis vector discovers an “eigen-dimension” that con-
loss [28] is used for the adversarial training. Besides, the trols an attribute or interpretable variation corresponding to
orthogonality of Ui is achieved by the regularization of the semantics of its layer.
kUT i Ui − IkF . After training, each latent dimension zij
2

can explicitly control an interpretable variation correspond- Manifold Perspective Fig. 3 shows a manifold perspec-
ing to the semantic of its layer. tive of EigenGAN. From this aspect, the subspace of each
layer expands the feature manifold with “straight” direc-
3.2. Discussion tions along which the variations of some semantic attributes
are linear. At the end of each layer, nonlinear mappings
Linear Case To better understand how our model works,
“bend” these straight directions, yet another subspace at the
we first discuss the linear case of our EigenGAN. Adapting
next layer will continue to add new straight directions. In
from Eq. (1), the linear model is formulated as below,
a word, EigenGAN decomposes the data generation mod-
x = ULz + µ + σ. (5) eling into hierarchical dimension expanding steps, i.e., ex-
panding the feature manifold with linear semantic dimen-
This equation relates a d-dimension observation vector x sions layer-by-layer.
to a corresponding q-dimension (q < d) latent variables
z ∼ Nq (0, I) by an affine transform UL and a transla- 4. Experiments
tion µ. Besides, a noise vector  ∼ Nd (0, I) is introduce to
compensate the missing energy. We also constrain U with Dataset We test our method on CelebA [23], FFHQ [18],
orthonormal columns and L as a diagonal matrix like the and Danbooru2019 Portraits [4]. CelebA contains 202,599
general case in Sec. 3.1. This formulation can also be re- celebrity face images with annotations of 40 binary at-
garded as a constrained case of Probabilistic PCA [37]. tributes. FFHQ contains 70,000 high-quality face images
To estimate U, L, µ and σ in Eq. (5) with n observations and Danbooru2019 Portraits contains 302,652 anime face
n
{xi }i=1 , an analytical solution is maximum likelihood es- images. We use CelebA attributes for the quantitative evalu-
timation (MLE). Please refer to the appendix for detailed ations and use FFHQ and Danbooru2019 Portraits for more
derivation of the MLE results. One important result is that visual results.
 
the columns of UML = uML ML
1 , . . . , uq are the eigenvec-
Implementation Details We use hinge loss [28] and R1
tors of data covariance corresponding to the q largest eigen-
penalty [26] for the adversarial training. We adopt Adam
values, which is exactly the same as the result of PCA [15].
solver [19] for all networks and parameter moving average
That is to say, the linear EigenGAN is able to discover the
for the generator. The generator is designed for 256 × 256
principal dimensions, which gives us a strong insight and
images and contains 6 upsampling convolutional blocks. A
motivation to embed such a linear model (Eq. (5)) hierar-
whole block with one upsampling is defined as a “layer”,
chically into different generator layers as stated in Sec. 3.1.
and one linear subspace with 6 basis vectors is embedded
EigenGAN (General Case) With the insight of the linear into the generator at each layer. Please refer to the appendix
case, we suppose that the linear subspace model embedded for detailed network architectures.

4
L1 D5 (Layer: 1 Dimension: 5) Facial Hair → Hat Hat (45%) Sideburns (33%)

L2 D2 Hair Side & Background Texture Orientation

L3 D1 Age → Gender Gender (89%) Lipstick (87%) Makeup (80%) Attractive (60%) Age (57%)

Hair Color
L3 D4 Bangs Bangs (68%) L5 D2 Black Hair (59%) Blond Hair (44%) Gray Hair (33%)

L3 D6 Body Side L5 D4 Lighting

L4 D1 Pose L5 D6 Lipstick Color

L4 D5 Smile
Smile (81%) High Cheekbones (67%) Mouth Open (55%) Narrow Eyes (43%) L6 D1 Background Hue

L4 D6 Face Shape L6 D4 Foreground Hue Pale Skin (39%)

Figure 4. Discovered semantic attributes at different layers for CelebA dataset [23]. Traversing the coordinate value in [−4.5σ, 4.5σ], each
dimension controls an attribute, colored in blue. The attributes colored in green are the most correlated CelebA attributes, and the bracket
value is the entropy coefficient: what fraction of the information of the CelebA attribute is contained in the corresponding dimension. “Li
Dj” means the jth dimension of the ith layer. We only show the most meaningful dimensions, please refer to the appendix for all dimensions.

5
L2 D5 Painting Style

L4 D2 Mouth Shape

L6 D1 Hue

Figure 5. Interpretable dimensions of FFHQ dataset [18] and anime dataset [4].

4.1. Discovered Semantic Attributes Identifying Well-Defined Attributes In the previous part,
we visually identify semantic attributes for each dimension.
Visual Analysis Fig. 4 shows the semantic attributes In this part, we identify the attributes in a statistical man-
learned by the subspace of different layers, where “Li Dj” ner, utilizing 40 well-defined binary attributes in CelebA
means the jth dimension of the ith layer and smaller index dataset [23]. Specifically, we investigate the correlation be-
of layer means deeper. As shown, moving along an eigen- tween a dimension Z and a CelebA attribute Y in terms of
dimension (i.e., a basis vector of a subspace), the synthe- entropy coefficient (normalized mutual information), which
sized images consistently change by an interpretable mean- represents what fraction of the information of Y is con-
ing. Shallower layers tend to learn lower-level attributes, tained in Z:
e.g., L6 and L5 learn color-related attributes such as “Hue” I(Y ; Z) H(Y ) − H(Y |Z)
in L6 and “Hair Color” in L5. As the layer goes deeper, U (Y |Z) = = ∈ [0, 1] (6)
H(Y ) H(Y )
the generator discovers attributes with higher-level or more
complicated concepts. For example, L4 and L3 learn geo- where
metric or structural attributes such as “Face Shape” in L4 Z

and “Body Side” in L3. Deep layers tend to learn multi- H(Y |Z) = pZ (z) − pY|Z (y = 1|z) ln (pY|Z (y = 1|z))
ple attributes in one dimension, e.g., L1 D5 learns “Facial Z

Hair” on the left axis but “Hat” on the right axis. Besides, − (1 − pY|Z (y = 1|z)) ln (1 − pY|Z (y = 1|z)) dz, (7)
entanglement of attributes is likely to happen in deep layer
dimensions, e.g., L2 D2 learns to simultaneously change H(Y ) = − pY (y = 1) ln (pY (y = 1))
“Hair Side” and “Background Texture Orientation”, be- − (1 − pY (y = 1)) ln (1 − pY (y = 1)) . (8)
cause complex attribute composition might mislead the net-
work into believing their whole as one high-level attribute. pY|Z (y = 1|z) and pY (y = 1) can be calculated by1
In summary, shallow layers learn low-level or simple at- Z
tributes while deep layers learn high-level or complicated pY|Z (y = 1|z) = pY|X (y = 1|x)pG (x|z)dx, (9)
attributes. Entanglement might happen in some deep layer ZX
dimensions, and this is one of our limitations. Nonethe-
pY (y = 1) = pY|Z (y = 1|z)pZ (z)dz, (10)
less, the entanglement is interpretable, i.e., we can identify Z
what attributes are entangled in a dimension. Moreover,
our method can still discover well disentangled dimensions where pG (x|z) is the generator distribution, and pY|X (y =
that are highly consistent with visual concepts of humans. 1|x) is the posterior distribution which is approximated by
Fig. 5 show additional results of FFHQ dataset [18] and a pre-trained attribute classifier on CelebA dataset. We set
Danbooru2019 Portraits dataset [4]. Please refer to the ap- 1 y and z are conditionally independent given x, i.e., p
Y|X,Z (y =

pendix for more results and more interpretable dimensions. 1|x, z) = pY|X (y = 1|x).

6
(a)

Pose Smiling Hair Color Hue

(b)

Figure 6. Qualitative comparison between (a) SeFa [36] from StyleGAN and (b) EigenGAN. Both are trained on FFHQ-256 [18].

Table 1. Correlation between the discovered attributes and the CelebA attributes in terms of entropy coefficient. Each row denotes a
discovered attributes by (a) SeFa [36] and (b) EigenGAN, and each column denotes a CelebA attribute.
(a) SeFa from StyleGAN trained on FFHQ-256 dataset [18]. (b) EigenGAN trained on FFHQ-256 dataset [18].

Gender Eyeglasses Smiling Black Hair Gender Eyeglasses Smiling Black Hair
Gender 49% 14% 2% 4% Gender 57% 14% 12% 2%
Eyeglasses 5% 49% 2% 0% Eyeglasses 2% 33% 0% 1%
Smiling 1% 1% 52% 8% Smiling 1% 0% 55% 2%
Black Hair 1% 0% 1% 18% Black Hair 0% 0% 0% 38%

pZ (z) as U[−4.5, 4.5] and discretize


R it into 100 equal bins skin color in EigenGAN. This is because both of them are
for approximation of the integral Z · pZ (z)dz in Eq. (7) and unsupervised methods, and it is difficult to precisely de-
(10); and we sample 1000 x from the generator R pG (x|z) in couple all the attributes without any supervision. Table 1
each bin of z, then approximate the integral X · pG (x|z)dx shows the quantitative comparison of the correlation be-
in Eq. (9) by averaging over the samples. tween the discovered attributes and the CelebA attributes, in
For each dimension in Fig. 4, the five most correlated terms of entropy coefficient introduced in the previous part.
CelebA attributes with entropy coefficient larger than 30% As can be seen, the discovered attributes by both SeFa and
are shown (green text). As shown, the identified CelebA our EigenGAN have high correlation to the corresponding
attributes according to entropy coefficient are highly con- CelebA attributes, demonstrating that both methods can in-
sistent with our visual perception. Several dimensions have deed discover meaningful semantic attributes. Besides, our
no correlated CelebA attributes just because the attributes EigenGAN achieves comparable performance to the state-
represented by these dimensions are not included in the of-the-art SeFa.
CelebA, but these dimensions are still interpretable, e.g., L4
D1 learns “Pose” which is not a CelebA attribute. Several 4.2. Model Analysis
dimensions correlate to multiple CelebA attributes mainly Effect of the Latent Variables EigenGAN contains two
because these CelebA attributes are themselves highly cor- kinds of latent variables: 1) layer-wise latent variables
related, e.g., L4 D5 learns “Smile” therefore it has high en- t
{zi }i=1 , which are used as the subspace coordinates; 2)
tropy coefficient for “Smile” correlated attributes: “High bottom noise  to compensate the missing variations. In
Cheekbones”, “Mouth Open”, and “Narrow Eyes”. In con- Fig. 7a, we respectively fix one of them and randomly sam-
clusion, this experiment statistically verifies that, Eigen- ple another to generate images. As can be seen, the layer-
GAN can indeed discover interpretable dimensions con- t
wise latent variables {zi }i=1 dominate the major variations
trolling attributes which are highly consistent with human- while the bottom noise  captures subtle changes. That is to
defined ones (e.g., CelebA attributes). say, EigenGAN tends to put major variations into the layer-
wise latent variables rather than the bottom noise used in
Comparison In this part, we compare our method to the typical GANs, but the bottom noise can still capture some
state-of-the-art method SeFa [36], which identifies inter- subtle variations missed by the subspace models.
pretable dimensions for well-trained GANs. Fig. 6 shows
the qualitative comparison. As can be seen, both methods Effect of the Subspace Model We remove all the layer-
can achieve smooth and consistent change of the identified wise subspace models to investigate their effect, instead, we
attributes. However, entanglement to some extent happens directly add the layer-wise latent variables to the network
in both methods, e.g., “Smiling” dimension also changes features. As shown in Fig. 7b, without the subspace mod-
bangs in SeFa and “Hair Color” dimension also changes els, the layer-wise latent variables can only capture minor

7
effect of layer-wise latent (fix bottom noise) effect of layer-wise latent (fix bottom noise)

effect of bottom noise (fix layer-wise latent) effect of bottom noise (fix layer-wise latent)

(a) With the subspace models (EigenGAN), major variations are captured by (b) Without the subspace models (typical GANs), major variations are cap-
the layer-wise latent variables. tured by the bottom noise.

Figure 7. Effect of the layer-wise latent variables (top) and the bottom noise (down).

Table 2. Basis similarity with PCA. P = Nd (0, I). Table 3. Basis similarity with PCA. P = Ud (0, 1).
Data Rank → Subspace Rank Data Rank → Subspace Rank
GAN Loss GAN Loss
5→1 5→3 10→1 10→3 10→5 20→1 20→5 20→10 5→1 5→3 10→1 10→3 10→5 20→1 20→5 20→10
KL-f-GAN [30] 1.00 0.98 0.99 0.90 0.93 0.97 0.78 0.79 KL-f-GAN [30] 0.96 0.98 0.97 0.89 0.93 0.89 0.72 0.82
Vanilla GAN [10] 1.00 0.99 1.00 0.90 0.94 0.98 0.77 0.81 Vanilla GAN [10] 0.97 0.97 0.97 0.92 0.92 0.92 0.76 0.84
WGAN [11] 0.99 0.98 1.00 0.89 0.92 0.99 0.76 0.83 WGAN [11] 0.98 0.97 0.98 0.93 0.94 0.98 0.77 0.84
LSGAN [25] 0.99 0.99 1.00 0.89 0.92 0.99 0.76 0.80 LSGAN [25] 0.97 0.97 0.96 0.89 0.95 0.91 0.74 0.82
HingeGAN [28] 0.99 0.99 1.00 0.92 0.93 0.96 0.77 0.81 HingeGAN [28] 0.97 0.98 0.97 0.87 0.94 0.92 0.75 0.82

variations, which is completely opposite to the original set- be seen, when the data rank is no more than 10, EigenGAN
ting in Fig. 7a. In conclusion, the subspace model is the key basis is highly similar to PCA basis with cosine similarity
point to enable the generator to put major variations into the about 0.9-1.0. When the data rank increases to 20, there are
layer-wise variables, therefore can further let the layer-wise two situations: 1) if we only search the most principal one
variables capture different semantics of different layers. basis vector (20→1), the vectors found by linear EigenGAN
and by PCA are still very close; 2) but if we want to find 5
Linear Case Study Sec. 3.2 theoretically proves that the
or more basis vectors, the average similarity decreases to
linear case of EigenGAN can discover the principal com-
0.7-0.8. We suppose the reason is that higher dimension
ponents under maximum likelihood estimation (MLE). In
data leads to the curse of dimensionality and further results
this part, we validate this statement by applying adversar-
in learning instability. Besides, various GAN losses have
ial training on the linear EigenGAN (we do not directly use
very consistent results, which shows the potential of gen-
MLE since we train the general EigenGAN with adversar-
eralizability of our theoretical results in Sec. 3.2 from KL
ial loss rather than MLE objective, and we keep this con-
divergence (MLE) to more general statistical distance such
sistency between the linear and the general case). Specifi-
JS divergence and Wassertein distance. In conclusion, we
cally, we use the linear EigenGAN to learn a low-rank sub-
experimentally verify the theoretical statement that the lin-
space model for toy datasets, then compare the basis vectors
ear EigenGAN can indeed discover principal components.
learned by our model and learned by PCA in terms of cosine
similarity. The toy datasets are generated as follows,
5. Limitations and Future Works
DA,b,P = {yi = Axi + b | xi ∼ P } (11)
Discovered semantic attributes are not always the same
where A is a random transform matrix, b is a random at different training times in two cases: 1) E.g., sometimes
translation vector, and P is a distribution selected from the gender and pose are learned as separated dimensions
Nd (0, I) or Ud (0, 1). We test typical adversarial losses but sometimes are entangled in one dimension at a deeper
including Vanilla GAN [10], LSGAN [25], WGAN [11], layer. This is because, without supervision, some complex
HingeGAN [28], and f-GAN [30] with KL divergence (KL- attribute compositions might mislead the model into believ-
f-GAN). Note that the objective of KL-f-GAN is theoreti- ing their whole as one higher-level attribute. 2) Sometimes
cally equivalent to MLE, thus we are actually also testing the model can discover a specific attribute but sometimes
MLE in the adversarial training manner. cannot, such as eyeglasses, mainly because these attributes
Table 2 and Table 3 report the average similarity between appear less frequently in the dataset. Future works will
EigenGAN basis vectors and PCA basis vectors, where each study the layer-wise eigen-learning with better disentangle-
result is the average over 100 random toy datasets. As can ment techniques and more powerful GAN architectures.

8
References [17] Takuhiro Kaneko, Kaoru Hiramatsu, and Kunio Kashino.
Generative adversarial image synthesis with decision tree la-
[1] Martin Arjovsky, Soumith Chintala, and Léon Bottou. tent controller. In IEEE Conf. Comput. Vis. Pattern Recog.,
Wasserstein gan. In Int. Conf. Mach. Learn., 2017. 3, 4 2018. 2
[2] David Bau, Bolei Zhou, Aditya Khosla, Aude Oliva, and [18] Tero Karras, Samuli Laine, and Timo Aila. A style-based
Antonio Torralba. Network dissection: Quantifying inter- generator architecture for generative adversarial networks. In
pretability of deep visual representations. In IEEE Conf. IEEE Conf. Comput. Vis. Pattern Recog., 2019. 1, 3, 4, 6, 7
Comput. Vis. Pattern Recog., 2017. 1 [19] Diederik P Kingma and Jimmy Ba. Adam: A method for
[3] David Bau, Jun-Yan Zhu, Hendrik Strobelt, Bolei Zhou, stochastic optimization. In Int. Conf. Learn. Represent.,
Joshua B. Tenenbaum, William T. Freeman, and Antonio 2015. 4
Torralba. Gan dissection: Visualizing and understanding [20] Wonkwang Lee, Donggyun Kim, Seunghoon Hong, and
generative adversarial networks. In Int. Conf. Learn. Rep- Honglak Lee. High-fidelity synthesis with disentangled rep-
resent., 2019. 1, 2 resentation. In Eur. Conf. Comput. Vis., 2020. 2
[4] Gwern Branwen, Anonymous, and Danbooru Community. [21] Zinan Lin, Kiran Thekumparampil, Giulia Fanti, and Se-
Danbooru2019 portraits: A large-scale anime head illustra- woong Oh. Infogan-cr and modelcentrality: Self-supervised
tion dataset, 2019. 4, 6 model training and selection for disentangling gans. In Int.
[5] Andrew Brock, Jeff Donahue, and Karen Simonyan. Large Conf. Mach. Learn., 2020. 2
scale gan training for high fidelity natural image synthesis. [22] Bingchen Liu, Yizhe Zhu, Zuohui Fu, Gerard de Melo, and
In Int. Conf. Learn. Represent., 2018. 1, 3 Ahmed Elgammal. Oogan: Disentangling gan with one-hot
[6] Xi Chen, Yan Duan, Rein Houthooft, John Schulman, Ilya sampling and orthogonal regularization. In AAAI, 2020. 2
Sutskever, and Pieter Abbeel. Infogan: Interpretable rep- [23] Ziwei Liu, Ping Luo, Xiaogang Wang, and Xiaoou Tang.
resentation learning by information maximizing generative Deep learning face attributes in the wild. In Proceedings of
adversarial nets. In Adv. Neural Inform. Process. Syst., 2016. the IEEE international conference on computer vision, 2015.
2, 3 4, 5, 6
[7] Emily Denton, Soumith Chintala, Arthur Szlam, and Rob [24] Alireza Makhzani, Jonathon Shlens, Navdeep Jaitly, Ian
Fergus. Deep generative image models using a laplacian Goodfellow, and Brendan Frey. Adversarial autoencoders.
pyramid of adversarial networks. In Adv. Neural Inform. Pro- In Int. Conf. Learn. Represent., 2016. 3
cess. Syst., 2015. 3 [25] Xudong Mao, Qing Li, Haoran Xie, Raymond YK Lau, Zhen
[8] Jeff Donahue, Philipp Krähenbühl, and Trevor Darrell. Ad- Wang, and Stephen Paul Smolley. Least squares generative
versarial feature learning. In Int. Conf. Learn. Represent., adversarial networks. In Int. Conf. Comput. Vis., 2017. 1, 3,
2017. 3 4, 8
[9] Lore Goetschalckx, Alex Andonian, Aude Oliva, and Phillip [26] Lars Mescheder, Andreas Geiger, and Sebastian Nowozin.
Isola. Ganalyze: Toward visual definitions of cognitive im- Which training methods for gans do actually converge? In
age properties. In Int. Conf. Comput. Vis., 2019. 2 Int. Conf. Mach. Learn., 2018. 3, 4
[10] Ian Goodfellow, Jean Pouget-Abadie, Mehdi Mirza, Bing [27] Mehdi Mirza and Simon Osindero. Conditional generative
Xu, David Warde-Farley, Sherjil Ozair, Aaron Courville, and adversarial nets. arXiv:1411.1784, 2014. 3
Yoshua Bengio. Generative adversarial networks. In Adv. [28] Takeru Miyato, Toshiki Kataoka, Masanori Koyama, and
Neural Inform. Process. Syst., 2014. 1, 2, 4, 8 Yuichi Yoshida. Spectral normalization for generative ad-
[11] Ishaan Gulrajani, Faruk Ahmed, Martin Arjovsky, Vincent versarial networks. In Int. Conf. Learn. Represent., 2018. 3,
Dumoulin, and Aaron Courville. Improved training of 4, 8
wasserstein gans. In Adv. Neural Inform. Process. Syst., [29] Takeru Miyato and Masanori Koyama. cgans with projection
2017. 1, 8 discriminator. In Int. Conf. Learn. Represent., 2018. 3
[12] Erik Härkönen, Aaron Hertzmann, Jaakko Lehtinen, and [30] Sebastian Nowozin, Botond Cseke, and Ryota Tomioka. f-
Sylvain Paris. Ganspace: Discovering interpretable gan con- gan: Training generative neural samplers using variational
trols. Adv. Neural Inform. Process. Syst., 2020. 1, 2 divergence minimization. In Adv. Neural Inform. Process.
[13] Phillip Isola, Jun-Yan Zhu, Tinghui Zhou, and Alexei A Syst., 2016. 3, 8
Efros. Image-to-image translation with conditional adver- [31] Augustus Odena, Christopher Olah, and Jonathon Shlens.
sarial networks. In IEEE Conf. Comput. Vis. Pattern Recog., Conditional image synthesis with auxiliary classifier gans.
2017. 3 In Int. Conf. Mach. Learn., 2017. 3
[14] Insu Jeon, Wonkwang Lee, Myeongjang Pyeon, and Gunhee [32] Antoine Plumerault, Hervé Le Borgne, and Céline Hude-
Kim. Ib-gan: Disentangled representation learning with in- lot. Controlling generative models with continuous factors
formation bottleneck gan. In AAAI, 2021. 2 of variations. In Int. Conf. Learn. Represent., 2019. 2
[15] Ian T Jolliffe. Principal component analysis. 1986. 2, 4 [33] Aditya Ramesh, Youngduck Choi, and Yann LeCun.
[16] Takuhiro Kaneko, Kaoru Hiramatsu, and Kunio Kashino. A spectral regularizer for unsupervised disentanglement.
Generative attribute controller with conditional filtered gen- arXiv:1812.01161, 2018. 2
erative adversarial networks. In IEEE Conf. Comput. Vis. [34] Kevin Roth, Aurelien Lucchi, Sebastian Nowozin, and
Pattern Recog., 2017. 2 Thomas Hofmann. Stabilizing training of generative adver-

9
sarial networks through regularization. In Adv. Neural In-
form. Process. Syst., 2017. 3
[35] Yujun Shen, Ceyuan Yang, Xiaoou Tang, and Bolei Zhou. In-
terfacegan: Interpreting the disentangled face representation
learned by gans. IEEE Trans. Pattern Anal. Mach. Intell.,
2020. 2
[36] Yujun Shen and Bolei Zhou. Closed-form factorization of
latent semantics in gans. In IEEE Conf. Comput. Vis. Pattern
Recog., 2021. 1, 2, 7
[37] Michael E Tipping and Christopher M Bishop. Probabilistic
principal component analysis. Journal of the Royal Statisti-
cal Society: Series B (Statistical Methodology), 61(3):611–
622, 1999. 4
[38] Ceyuan Yang, Yujun Shen, and Bolei Zhou. Semantic hier-
archy emerges in deep generative representations for scene
synthesis. Int. J. Comput. Vis., 2021. 1, 2
[39] Matthew D Zeiler and Rob Fergus. Visualizing and under-
standing convolutional networks. In Eur. Conf. Comput. Vis.,
2014. 1
[40] Han Zhang, Tao Xu, Hongsheng Li, Shaoting Zhang, Xiao-
gang Wang, Xiaolei Huang, and Dimitris N Metaxas. Stack-
gan: Text to photo-realistic image synthesis with stacked
generative adversarial networks. In Int. Conf. Comput. Vis.,
2017. 3
[41] Bolei Zhou, Aditya Khosla, Àgata Lapedriza, Aude Oliva,
and Antonio Torralba. Object detectors emerge in deep scene
cnns. In Int. Conf. Learn. Represent., 2015. 1
[42] Jun-Yan Zhu, Taesung Park, Phillip Isola, and Alexei A
Efros. Unpaired image-to-image translation using cycle-
consistent adversarial networks. In Int. Conf. Comput. Vis.,
2017. 3

10
Appendix A for EigenGAN

In Sec. 1-8, we derive the analytical maximum likelihood estimation (MLE) result
of the linear case of the proposed EigenGAN. Sec. 8 discusses the MLE result and the
relation among the linear EigenGAN, the Principal Component Analysis (PCA) [1],
and Probabilistic PCA [2].

1. The Likelihood

The linear EigenGAN relates a d-dimension observation vector x to a corresponding


q-dimension (q ≤ d) latent variables z by an affine transform UL and a translation µ,
which is formulated as
x = ULz + µ + σ, (1)
with constraints:
z ∼ Nq (0, I) , (2)
 ∼ Nd (0, I) , independent of z, (3)
UT U = I, U is of size d × q, (4)
L is a q × q diagonal matrix. (5)
The noise vector  in Eq. (1) is introduced to compensate the missing energy (missing
rank) since the rank of the latent variables is no more than the rank of the observation
(q ≤ d). According to Eq. (1)-(3), the probability density function of x is
 
1 1
px (x) = q exp − (x − µ)T C−1 (x − µ) , (6)
d 2
(2π) |C|
where
C = UL2 UT + σ 2 I, (7)
C−1 = UMUT + σ −2 I, (8)
−1
M = L2 + σ 2 I − σ −2 I. (9)

1
Then for n observations {x}ni=1 , the log-likelihood is
( n
)
n 1 X
L1 = − d log 2π + log |C| + (xi − µ)T C−1 (xi − µ) . (10)
2 n i=1

According to Eq. (7), only the square value of L affects the probability density func-
tion, therefore we can assume the elements of L to be non-negative. Further, for conve-
nience of the following analysis, without loss of generality, we organize L by grouping
and sorting it by the value of the diagonal elements:

L = diag l1 Id1 , l2 Id2 , · · · , lp Idp , (11)
where l1 > l2 > · · · > lp ≥ 0; Idj denotes a dj × dj identity matrix, dj 6= 0, and
d1 + d2 + · · · + dp = q. According to Eq. (9) and (11), M also has a grouped form:
     
2 2 −1 −2 2 2 −1 −2
M = diag l1 + σ − σ Id1 , · · · , lp + σ − σ Idp . (12)

And we can also define a block form of U accordingly:


U = [U1 , U2 , · · · , Up ] , (13)
where Ui is of size d × di .

2. Determination of µ

The partial derivative of the log-likelihood L1 (10) with respect to µ is


n
∂L1 X −1
= C (xi − µ) . (14)
∂µ i=1

Then the stationary point is


n
1X
µ= xi = x̄. (15)
n i=1

Since L1 is a concave function of µ, the above stationary point is also the global max-
imum point.

2
3. Determination of U: Part (1)

Substituting Eq. (15) into the log-likelihood L1 (10), we obtain a new objective:
n 
L2 = − d log 2π + log |C| + tr SC−1 , (16)
2
where
n
1X
S= (xi − x̄) (xi − x̄)T , (17)
n i=1
i.e., the covariance matrix of the data. According to Eq. (4), the maximization of
L2 (16) with respect to U is a constrained optimization as below:
max L2
U
subject to UT U = I
Introducing the Lagrange multiplier H, the Lagrangian function is

LU = L2 + tr HT UT U − I
n  
=− d log 2π + log |C| + tr SC−1 + tr HT UT U − I . (18)
2
Then the partial derivative of LU with respect to U is
   
∂LU  −1 H + H T
= −n U L2 + σ 2 I L2 + + SUM . (19)
∂U 2
At the stationary point,
 
 −1 H + H T
SUM = −U L2 + σ 2 I L2 + . (20)
2
Left multiplying the above equation by UT and using UT U = I, we obtain
T 2

2 −1 2 H + HT
U SUM = − L + σ I L − . (21)
2
The right-hand side of above equation is a symmetric matrix, therefore the left-hand
side UT SUM is also symmetric. Furthermore, since both UT SU and M are also sym-
metric, to satisfy the symmetry of UT SUM, according to the form of M in Eq. (12),
UT SU must have a similar block diagonal form:
UT SU = diag (A1 , A2 , · · · , Ap ) (22)

= diag QT T T
1 Λ1 Q1 , Q2 Λ2 Q2 , · · · , Qp Λp Qp , (23)

3
where Aj is a dj × dj symmetric matrix, and QT j Λj Qj is the eigendecomposition of
Aj . Using Eq. (20), (21), and (23), we can derive

SUM = U · diag QT Λ Q
1 1 1 , QT
Λ
2 2 2Q , · · · , QT
p p p · M.
Λ Q (24)
Substituting Eq. (12) and (13) into Eq. (24), we obtain
−1  
2 −1

lj2 + σ 2 − σ −2 SUj QT
j = lj
2
+ σ − σ −2
Uj QT
j Λj (25)
=⇒ SUj QT T
j = Uj Qj Λj , j = 1, 2, · · · , p0 , (26)
where
(
p, lp > 0,
p0 = (27)
p − 1, lp = 0.

j are eigenvectors of S, and the diagonal


Eq. (26) tells us that, the columns of Uj QT
T
elements of Λj are the corresponding eigenvalues. Further, since Uj QT j Uj QTj = I,
these eigenvectors are orthonormal. Let Vj = Uj QT j , we obtain the stationary point:

Uj = Vj Qj , j = 1, 2, · · · , p0 , (28)
where the columns of V
 j are orthonormal eigenvectors of S with corresponding eigen-
values as Λj = diag λj1 , λj2 , · · · , λjdj , and Qj is an arbitrary orthogonal matrix.
Note that if p0 = p − 1, i.e., lp = 0, Up is an arbitrary matrix.

4. Determination of L = diag l1Id1 , l2Id2 , · · · , lpIdp

Substituting Eq. (26), Eq. (7)-(8), and Eq. (11)-(12) into L2 (16) and after some
manipulation, a new objective is derived:
( p0 h
n X   i
2 2 2 2 −1
L3 = − d log 2π + dj log lj + σ + lj + σ tr (Λj )
2 j=1
p0
)
 X 
+(d − q 0 ) log σ 2 + σ −2 tr (S) − tr (Λj ) , (29)
j=1

where
0
p
X
q0 = dj . (30)
j=1

4
Then the partial derivative of L3 with respect to lj is
( )
∂L3 dj lj lj tr (Λj )
= −n 2 − , j = 1, 2, · · · , p0 , (31)
∂lj lj + σ 2 (lj2 + σ 2 )2
and the stationary point is
tr (Λj )
lj2 = − σ2, j = 1, 2, · · · , p0 . (32)
dj

5. Determination of σ

Substituting Eq. (32) into L3 (29), we obtain a new objective:


( p0  
n X tr (Λj )
L4 = − d log 2π + dj log + dj
2 j=1
d j
 p0
X )
+(d − q 0 ) log σ 2 + σ −2 tr (S) − tr (Λj ) . (33)
j=1

Then the partial derivative of L4 with respect to σ is


  
0
∂L4  d − q0 1  Xp 
= −n − 3 tr (S) − tr (Λj ) , (34)
∂σ  σ σ 
j=1

and the stationary point is


p0
P
tr (S) − tr (Λj )
2 j=1
σ = . (35)
d − q0

6. Determination of Λj

Substituting Eq. (35) into L4 (33), we obtain a new objective:


 p0


 P 


 p 0 tr (S) − tr (Λj ) 

n X tr (Λj ) j=1
0
L5 = − d log 2π + dj log + (d − q ) log 0
+ d .(36)
2 dj d − q 


 j=1 

5
 
According to Sec. 3, the diagonal elements of Λj = diag λj1 , λj2 , · · · , λjdj , j =
1, · · · , p0 are the eigenvalues of S, therefore the problem here is to select the suitable
eigenvalues from S and separate them into different Λj s to maximize L5 (36). Using
Jensen’s inequality:
tr (Λj ) λj1 + · · · + λjdj log λj1 + · · · + log λjdj
log = log ≥ , (37)
dj dj dj
and the equality holds if and only if λj1 = · · · = λjdj . That means, no matter how we
select the eigenvalues, only grouping them by the same values can maximize L5 (36).
Therefore the optimal grouping is
 
Λj = diag λj1 , λj2 , · · · , λjdj
= diag (λj , λj , · · · , λj )
= λj Idj , (38)
where λj is an eigenvalue of S whose algebraic multiplicity ≥ dj , and without loss of
generality, we can assume λ1 > λ2 > · · · > λp0 .

Now, the left problem is to select the eigenvalues λj , j = 1, · · · , p0 . Substituting


Eq. (38) into L5 (36), we obtain
 p0


 P 


 p 0 tr (S) − tr (Λj ) 

n X j=1
0
L6 = − d log 2π + dj log λj + (d − q ) log 0
+d
2  d − q 


 j=1 

 
 Pd


 γi 

n  Xd
0

0 i=q +1
=− d log 2π + tr (log S) − log γi + (d − q ) log +d
2 d − q0 



0
i=q +1 

  d  
 P Pd


 log γ γ 

n   i i  
0  
i=q 0 +1 i=q 0 +1
=− d log 2π + tr (log S) − (d − q )  − log  + d ,(39)
2  d − q0 d − q0  


 

where q 0 = d1 + d2 + · · · + dp0 and γi , i = q 0 + 1, · · · , d are the rest eigenvalues not

6
been selected. Maximizing L6 (39) requires maximizing
P
d P
d
log γi γi
i=q 0 +1 i=q 0 +1
F= − log , (40)
d − q0 d − q0
which only requires γi , i = q 0 + 1, · · · , d to be adjacent in ordered eigenvalues. How-
ever according to Eq. (32), we need λj > σ 2 , j = 1, · · · , p0 , and then from Eq. (35),
the only choice to maximize F is to let γi , i = q 0 + 1, · · · , d be the d − q 0 smallest
eigenvalues. Meanwhile, lager q 0 leads to larger F, therefore,
p0 = p (41)
p
X
0
q =q= dj (42)
j=1

7. Determination of U: Part (2)

According to Eq. (28) and Eq. (38), the columns of Vj are orthonormal eigenvectors
of S corresponding to a same eigenvalue λj . Since Qj is an arbitrary orthogonal matrix,
the column of Uj = Vj Qj are still orthonormal eigenvectors corresponding to the
eigenvalue λj .

8. Summary and Discussion

Summarizing the above analysis (Eq. (15), (32), (35), and Sec. 7), the global maxi-
mum of the likelihood with respect to the model parameters is
n
1X
µ= xi , (43)
n i=1
tr (S) − tr (Λ)
σ2 = , (44)
d−q
L2 = Λ − σ 2 I, (45)
U = [u1 , · · · , uq ] , (46)
where the elements of the diagonal matrix Λ is the q largest eigenvalues of the data
covariance S, and u1 , · · · , uq are the principal q eigenvectors corresponding to Λ. As

7
can be seen, under maximum likelihood estimation, the basis vectors U of our linear
model are exactly the same as that learned by PCA [1]. Moreover, diagonal elements of
L represent the “importance” or “energy” of the corresponding basis vectors, and from
Eq. (45), when σ → 0, the elements of L approach the q largest eigenvalues. Besides,
as shown in Eq. (44), energy (σ 2 ) of the noise is the average of the discard eigenvalues,
which exactly compensates the energy missed by the subspace model.

Our model can be viewed as constrained case of Probabilistic PCA (PPCA) [2]:
x = Wz + µ + σ, (47)
whose maximum likelihood estimation is
1
W = V Λ − σ 2 I 2 Q, (48)
where the columns of V are the principal eigenvectors of the data covariance, Λ is a
diagonal matrix whose elements are the corresponding eigenvalues, Q is an arbitrary
orthogonal matrix. Therefore, MLE result of PPCA is nondeterministic due to the arbi-
trary Q. Although W contains information of the principal eigenvectors, the columns
of W itself do not show explicit property of the orthogonality. Our model (1) restricts
W of PPCA (47) to the special form of UL where U has orthonormal columns and L
is diagonal matrix. In consequence, MLE result of our model is deterministic (Eq. (43)-
(46)). Moreover, our model can build a linear subspace with the principal eigenvectors
explicitly as the basis vectors, which is very suitable for extension to the nonlinear case
to learn layer-wise interpretable dimensions, as introduced in the main text.

References
[1] Ian T Jolliffe. Principal component analysis. 1986. 1, 8
[2] Michael E Tipping and Christopher M Bishop. Probabilistic principal component analysis. Journal
of the Royal Statistical Society: Series B (Statistical Methodology), 61(3):611–622, 1999. 1, 8

8
Appendix B for EigenGAN

1. All Interpretable Dimensions of Each Layer IEEE Conf. Comput. Vis. Pattern Recog., 2019. 1, 11, 12, 13,
14, 15
Fig. 1 to Fig. 5 show all interpretable dimensions of [3] Ziwei Liu, Ping Luo, Xiaogang Wang, and Xiaoou Tang. Deep
each generator layer learnt by the proposed EigenGAN on learning face attributes in the wild. In Int. Conf. Comput. Vis.,
CelebA dataset [3], Fig. 6 to Fig. 9 show all dimensions 2015. 1, 2, 3, 4, 5, 6
learnt on Anime dataset [1], and Fig. 10 to Fig. 14 show all [4] Andrew L Maas, Awni Y Hannun, and Andrew Y Ng. Rec-
dimensions learnt on FFHQ dataset [2]. We traverse each tifier nonlinearities improve neural network acoustic models.
dimension from −4.5σ to 4.5σ and omit the dimensions In Int. Conf. Mach. Learn., 2013. 17
with almost no change. “None” in these figures means the
corresponding part of the dimension is uninterpretable or
difficult to be assigned an attribute name. The smaller the
index, the deeper the layer.

2. Effect of the Importance Matrix Li


In Sec. 3.1 of the main text, we introduce the impor-
tance matrix Li = diag (li1 , . . . , liq ) with lij deciding the
importance or energy of the basis vector uij . In Fig. 15,
we compare dimensions with different importance values.
As can be seen, dimensions with large importance value
control large variations while dimensions with small impor-
tance value control slight variations. It cannot be judged
whether the larger one of two large enough importance val-
ues controls larger variation, because it is hard to quantify
the semantic variations. However, it can be sure that dimen-
sions with small importance value can only control slight
changes or even no change. Therefore, to some extent, the
importance matrix can be used to select the learned seman-
tic dimensions by discarding dimensions with small impor-
tance value. Note that the values of Li across different lay-
ers (i = 1, 2, · · · , 6) cannot be directly compared since they
belong to different subspaces.

3. Network Architectures
The architectures of the generator and the discriminator
of EigenGAN are shown in Fig. 16.

References
[1] Gwern Branwen, Anonymous, and Danbooru Community.
Danbooru2019 portraits: A large-scale anime head illustration
dataset, 2019. 1, 7, 8, 9, 10
[2] Tero Karras, Samuli Laine, and Timo Aila. A style-based
generator architecture for generative adversarial networks. In

1
L6 (Layer 6, shallowest)
-4.5σ -3σ -2σ -1σ 1σ 2σ 3σ 4.5σ

L6 D1 Hue (Blue-Pink)

L6 D2 Hue (Blue-Orange)

L6 D3 Hue (Yellow-Purple)

L6 D4 Hue (Orange-White)

L6 D5 Hue (Green-Red)

Figure 1. Interpretable dimensions of layer 6 (the shallowest) for CelebA dataset [3].

2
L5 (Layer 5) L4 (Layer 4)
-4.5σ -3σ -2σ -1σ 1σ 2σ 3σ 4.5σ -4.5σ -3σ -2σ -1σ 1σ 2σ 3σ 4.5σ

L5 D1 Background (Dark-Bright) L4 D1 Pose (Yaw)

L5 D2 Hair Color L4 D2 Background (Dark-Bright)

L5 D3 Hair Color L4 D3 Pose (Pitch)

L5 D4 Lighting L4 D4 Lighting

L5 D5 Gaze L4 D5 Smiling

L5 D6 Lipstick Color L4 D6 Face Shape

Figure 2. Interpretable dimensions of layer 5 (left) and layer 4 (right) for CelebA dataset [3].

3
L3 (Layer 3)

-4.5σ -3σ -2σ -1σ 1σ 2σ 3σ 4.5σ

L3 D1 Age Gender & Hair Color

L3 D2 Artifact on Neck Age

L3 D3 Race Bangs

L3 D4 Bangs

L3 D5 Hair Style

L3 D6 Body Side

Figure 3. Interpretable dimensions of layer 3 for CelebA dataset [3].

4
L2 (Layer 2)
-4.5σ -3σ -2σ -1σ 1σ 2σ 3σ 4.5σ

L2 D1 Hair Length Distortion

L2 D2 Hair Side & Background Texture Orientation

L2 D3 Top Bar Artifact Top Edge-padding Artifact

L2 D4 Top Edge-padding Artifact None

L2 D5 Body Pose

Figure 4. Interpretable dimensions of layer 2 for CelebA dataset [3]. “None” means uninterpretable.

5
L1 (Layer 1, deepest)
-4.5σ -3σ -2σ -1σ 1σ 2σ 3σ 4.5σ

L1 D1 Hair Style (Wavy-Straight) & Female (←) Blurry

L1 D2 Distortion None Male (→) & Facial Hair & Hair or Hat

L1 D3 Distortion None Blurry

L1 D4 Hair Style (Wavy-Straight) Distortion

L1 D5 Facial Hair Hat

Figure 5. Interpretable dimensions of layer 1 (the deepest) for CelebA dataset [3]. “None” means uninterpretable.

6
L6 (Layer 6, shallowest) L5 (Layer 5)
-4.5σ -3σ -2σ -1σ 1σ 2σ 3σ 4.5σ -4.5σ -3σ -2σ -1σ 1σ 2σ 3σ 4.5σ

L6 D1 Hue (Orange-Blue) L5 D1 Hair Color (Light-Dark)

L6 D2 Hue (Green-Purple) L5 D2 Hair Color (Dark-Light)

L6 D3 Hue (Green-Purple) L5 D3 Hair Color (Dark-Light)

Figure 6. Interpretable dimensions of layer 6 (left, the shallowest) and layer 5 (right) for Anime dataset [1].

7
L4 (Layer 4) L3 (Layer 3)
-4.5σ -3σ -2σ -1σ 1σ 2σ 3σ 4.5σ -4.5σ -3σ -2σ -1σ 1σ 2σ 3σ 4.5σ

L4 D1 Flush & Eye Color (Red-Blue) L3 D1 Pose (Roll)

L4 D2 Mouth Shape L3 D2 Pose (Yaw)

L4 D3 Eye Color (Light-Dark) L3 D3 Zoom & Rotate

L4 D4 Skin Color L3 D4 Pose (Roll)

L4 D5 Eye Color (Blue-Red) L3 D5 Pose (Pitch)

Figure 7. Interpretable dimensions of layer 4 (left) and layer 3 (right) for Anime dataset [1].

8
L2 (Layer 2)
-4.5σ -3σ -2σ -1σ 1σ 2σ 3σ 4.5σ

L2 D1 None Headband or Hat

L2 D2 Painting Style & Background (Blue, →)

L2 D3 Painting Style & Background (Bright-Dark)

L2 D4 Matureness

L2 D5 Painting Style

L2 D6 Hair Length

Figure 8. Interpretable dimensions of layer 2 for Anime dataset [1]. “None” means uninterpretable.

9
L1 (Layer 1, deepest)
-4.5σ -3σ -2σ -1σ 1σ 2σ 3σ 4.5σ

L1 D1 Head Size (Main) & Others

L1 D2 Head Size & Painting Style

L1 D3 Hair Style & Hair Color (Light-Dark)

L1 D4 Hair Style (Wavy-Straight) Distortion

L1 D5 Head Accessories None Distortion

Figure 9. Interpretable dimensions of layer 1 (the deepest) for Anime dataset [1]. “None” means uninterpretable.

10
L6 (Layer 6, shallowest)
-4.5σ -3σ -2σ -1σ 1σ 2σ 3σ 4.5σ

L6 D1 Hue (Blue-Yellow)

L6 D2 Hue (Red-White)

L6 D3 Hue (Red-Green)

Figure 10. Interpretable dimensions of layer 6 (the shallowest) for FFHQ dataset [2].

11
L5 (Layer 5) L4 (Layer 4)

-4.5σ -3σ -2σ -1σ 1σ 2σ 3σ 4.5σ -4.5σ -3σ -2σ -1σ 1σ 2σ 3σ 4.5σ

L5 D1 Lighting (Brightness) L4 D1 Lighting (Upper Side)

L5 D2 Gaze L4 D2 Background (Dark-Bright)

L5 D3 Gaze & Eyebrows L4 D3 Lighting (Left-Right)

L5 D4 Squinting L4 D4 Race

L4 D5 Race

L4 D6 Background (Symmetric)

Figure 11. Interpretable dimensions of layer 5 (left) and layer 4 (right) for FFHQ dataset [2].

12
L3 (Layer 3)
-4.5σ -3σ -2σ -1σ 1σ 2σ 3σ 4.5σ

L3 D1 Pose (Yaw)

L3 D2 Gender & Hair Color

L3 D3 Background

L3 D4 Smiling & Face Size

L3 D5 Smiling

L3 D6 Smiling & Face Shape

Figure 12. Interpretable dimensions of layer 3 for FFHQ dataset [2].

13
L2 (Layer 2)
-4.5σ -3σ -2σ -1σ 1σ 2σ 3σ 4.5σ

L2 D1 Age & Male (→)

L2 D2 Pose (Pitch) & Female (←)

L2 D3 Age

L2 D4 Hair Style (Wavy-Straight) & Face Shape

L2 D5 Hair Volume

L2 D6 Body Side

Figure 13. Interpretable dimensions of layer 2 for FFHQ dataset [2].

14
L1 (Layer 1, deepest)
-4.5σ -3σ -2σ -1σ 1σ 2σ 3σ 4.5σ

L1 D1 Gender

L1 D2 Race Bangs & Female (→)

L1 D3 Female (←) Eyeglasses & Age

L1 D4 Artifact on Head None

L1 D5 Eyeglasses & Female (←) None

L1 D6 Eyeglasses

Figure 14. Interpretable dimensions of layer 1 (the deepest) for FFHQ dataset [2]. “None” means uninterpretable.

15
Figure 15. Effect of the importance values Li = diag (li1 , . . . , liq ). Dimensions with large importance value control large variations while
dimensions with small importance value control small variations.

16
Logit

Tanh FC(1)
LReLU → Conv(3, 7, 1) FC(512) → LReLU

Conv(512, 3, 1) → LReLU
LReLU → DeConv(210-i, 3, 1)

DeConv(210-i, 3, 2) Conv(16, 7,3,1)


Conv(512, 2)→
→LReLU
LReLU
Conv(512, 3, 1) → LReLU
LReLU → DeConv(210-i, 3, 2)

Conv(16, 7,3,1)
Conv(512, 2)→
→LReLU
LReLU
DeConv(211-i, 1, 1)
Conv(256, 3, 1) → LReLU

Conv(16, 7,3,1)
Conv(256, 2)→
→LReLU
LReLU
Conv(128, 3, 1) → LReLU

LReLU → DeConv(210-i, 3, 1) Conv(16, 7,3,1)


Conv(128, 2)→
→LReLU
LReLU
Conv(64, 3, 1) → LReLU
DeConv(210-i, 3, 2)

Conv(16,
Conv(64, 7,
3, 1)
2) → LReLU
LReLU → DeConv(210-i, 3, 2)
Conv(32, 3, 1) → LReLU
DeConv(211-i, 1, 1)
Conv(16, 3,
Conv(32, 7, 2)
1) → LReLU
Conv(16, 3, 1) → LReLU

Reshape(512, 4, 4)
FC(512 * 4 * 4) Conv(16, 7, 1) → LReLU

(a) Generator (b) Discriminator

Figure 16. Network architectures of EigenGAN. Conv(d, k, s) and DeConv(d, k, s) denote convolutional layer and transposed convolutional
layer with d as output dimensions, k as kernel size and s as stride. FC(d) denotes fully connected layer with d as output dimensions. LReLU
denotes Leaky ReLU [4].

17

You might also like