You are on page 1of 12

A Meta-Transfer Objective for Learning to Disentangle

Causal Mechanisms

Krishna Prasad Neupane

kpn3569@rit.edu

July 11, 2020

Krishna Prasad Neupane (RIT) Causal Mechanisms July 11, 2020 1 / 12


Observational and Causal concept

Key point:
An observational or associational concept is any relationship that can be
defined in terms of a joint distribution of observed variables, and a causal
concept is any relationship that cannot be defined from the distribution
alone.a
a
https://ftp.cs.ucla.edu/pub/stats er /r 350.pdf

Observational p(y |x): What is the distribution of Y given that


observe variable X takes value x. It is a conditional distribution which
can be calculated as: p(y |x) = p(x,y
p(x)
)

Interventional p(y |do(x)): What is the distribution of Y if we set the


value of X to x. This describes the distribution of Y we would observe
if we intervened in the data generating process by artificially forcing
the variable X to take value x, but otherwise simulating the rest of the
variables according to the original process that generated the data.1
1
https://www.inference.vc
Krishna Prasad Neupane (RIT) Causal Mechanisms July 11, 2020 2 / 12
Example

2
Figure: Three different scripts and their corresponding joint distributions.

2
https://www.inference.vc
Krishna Prasad Neupane (RIT) Causal Mechanisms July 11, 2020 3 / 12
Example

3
Figure: Joint distribution plot after intervention and conditional at x=3.

Note: Conditional probability and causal probability are distinct.

3
https://www.inference.vc
Krishna Prasad Neupane (RIT) Causal Mechanisms July 11, 2020 4 / 12
Summary of the Paper

To meta-learn causal structures based on how fast a learner adapts to


new distributions arising from sparse distributions. (Meta- structure)
Based on the assumption of small change in the right knowledge
representation space, the paper defines a meta-learning objective that
measures the speed of adaptation. (Out of distribution)
To obtain fast transfer or adaptation, the paper is able to recover a
good approximation of the true causal decomposition into
independent mechanisms. (Disentangle)

Main Idea
If we have the right knowledge representation, then we should get fast
adaptation to the transfer distribution when starting from a model that is
well trained on the training distribution.

Krishna Prasad Neupane (RIT) Causal Mechanisms July 11, 2020 5 / 12


Which is Cause and Which is Effect?
The problem of determining if variable A causes variable B or
vice-versa.
The comparative performance of two hypotheses (A → B vs B → A)
in terms of how fast the two models adapt on a transfer distribution.

Figure: We see that the correct causal model adapts faster (smaller regret),
and that the most informative part of the trajectory (where the two models
generalize the most differently) is in the first 10-20 examples.
Krishna Prasad Neupane (RIT) Causal Mechanisms July 11, 2020 6 / 12
Experiments on Adaptation to the transfer distribution
In this section, the paper has used only few gradient updates with a small
set of data coming from the different but related distributions.
Experimental comparison of learning curve of correct vs incorrect
causal models.
The adaptation with only a few gradient steps on data coming from a
different, but related, transfer distribution is critical in getting a signal
that can be leveraged by their meta-learning algorithm.

Figure: Train (red) and transfer (green and blue) samples from an SCM for joint
distribution of A and B.

Krishna Prasad Neupane (RIT) Causal Mechanisms July 11, 2020 7 / 12


Parameter Counting Argument

It helps to understand what we are observing in Figure 1.


Proposition 1: The modules that were correctly learned in the
training distribution and whose ground truth conditional distribution
did not change with the transfer distribution, the parameters already
are at a maximum of the log-likelihood over the transfer distribution.
Proposition 2:The gradient of the negative log-likelihood of the
transfer data is the difference between the log-likelihoods of the two
hypotheses on the transfer data.
Proposition 3: Stochastic gradient descent (with appropriately
decreasing learning rate) on expected transfer data is converges
towards sigmoid with values 1 and 0.

Krishna Prasad Neupane (RIT) Causal Mechanisms July 11, 2020 8 / 12


Experimental Results

Convergence result from Proposition 3:learning the structural


parameter in a bivariate model.
MLPs to parametrize the conditional distributions to decide whether
one variable is a direct causal parent or not.

Figure: Learning structure parameter and cross-entropy between the


ground-truth SCM structure and the learned SCM structure.

Krishna Prasad Neupane (RIT) Causal Mechanisms July 11, 2020 9 / 12


Representation Learning

Many realistic scenarios for learning agents might not use true causal
variables but sensory-level data instead, like pixels and sounds. So,
the correct causal graph will be sparsely connected.
To tackle this, the paper follows the deep learning objective of
disentangling the underlying causal variables to learn a representation
in which these properties hold.
The learner must map its raw observations to a hidden representation
space H via an encoder E. The encoder is trained such that the
hidden space H helps to optimize the meta-transfer objective.

Krishna Prasad Neupane (RIT) Causal Mechanisms July 11, 2020 10 / 12


Representation Learning

Krishna Prasad Neupane (RIT) Causal Mechanisms July 11, 2020 11 / 12


The End

Krishna Prasad Neupane (RIT) Causal Mechanisms July 11, 2020 12 / 12

You might also like