Aglietti2020 Part2

Causal decision-making meets
Gaussian Processes
Virginia Aglietti
University of Warwick, The Alan Turing Institute

Causal decision-making
Integrate causal considerations into a choice process and take decisions

based on causal knowledge [Hagmayer and Fernbach (2017)].
Why is this important?
1
Causal decision-making
Integrate causal considerations into a choice process and take decisions

based on causal knowledge [Hagmayer and Fernbach (2017)].
Why is this important?
1
Systems/processes decompose in sets of interconnected nodes
Eelworm
t +1
Eelworm Eelworm Crop

t −1 t yield
Soil
fumigants
Figure 1: Causal Graph for Crop Yield. Nodes denote variables, arrows
represent causal effects and dashed edges indicate unobserved confounders.
2
Chlα Salin.
ΩA TA
pHsw DIC
Nutr. PCO2
Light Tem.
NEC
Figure 2: Causal Graph for Net Ecosystem Calcification (NEC). Dotted nodes
represent non-manipulative variables.
3
BMI
Age Aspirin
Cancer
Statin PSA
Figure 3: Causal Graph for Prostate Specific Antigen (PSA) level. Dotted
nodes represent non-manipulative variables.
4
Common elements in these examples
• A causal graph (Directed Acyclic Graph - DAG).

• Observational data from all (non hidden) nodes.
• Ability of running experiments (in reality or in simulation).
• Cost of experiments depends on the number and type of nodes in
which we intervene.
Research goal
Efficiently find the system configuration that optimises the target node.
5
Research Goal
Research goal
• System configuration → manipulative variables to be intervened on

and their intervention levels e.g. Soil fumigants, CO2 , Statin.
• Target node → variable in the causal graph that we wish to optimize
considering its causal relationships e.g. Crop yield, NEC, PSA.
• Efficiently → exploit all information, observational and interventional,
that we collect when exploring the system.
6
Questions to answer
• How can I learn the expected crop

yield given different interventions?
Eelworm
t +1

t −1 t yield
Soil
fumigants
7
Questions to answer

Eelworm
t +1 • How can I learn the optimal
intervention that is the intervention
maximizing the the crop yield?
t −1 t yield
Soil
fumigants
7
Questions to answer

Eelworm
t +1 • How can I learn the optimal
intervention that is the intervention
maximizing the the crop yield?
Eelworm Eelworm Crop • I have observed the crop yield over
t −1 t yield seasons and have previously
intervened on soil fumigants. How
can I integrate this information to
Soil infer the crop yield I would get by
fumigants intervening on soil fumigants and
eelworm t?
7
Research Goal
Research goal
1. Perform experiments
2. Integrate interventional and observational data
3. Transfer interventional information.
8
Causal models and do-calculus
Causal models and do-calculus (1 of 3)
Causal model: DAG G + four-tuple hU, V, F , P(U)i
• U: independent exogenous background variables.

• P(U) distribution of U.
• V: endogenous variables (non-manipulative, manipulative, target).
• F = {f1 , ..., f|V| }: functions vi = fi (pai , ui ), pai are the parents of Vi .
C = fc (Uc ), Uc ∼ N (0, σc2 )

C
X = fx (C , Ux ), Ux ∼ N (0, σx2 )
Y = fy (X , C , Uy ), Yc ∼ N (0, σy2 )
X Y
9
Intervention: Setting a manipulative variable X to a value x, do(X = x).
Observed universe Post-intervention universe
C C
X Y X =x Y
C = fc (Uc ) C = fc (Uc )
X = fx (C , Ux ) X = x
Y = fy (X , C , Uy ) Y = fy (x, C , Uy )
P(X , C , Y ) P do(X =x) (C , Y )
P(Y |do(X = x)) := P do(X =x) (Y |X = x)

10
Key question: How to do inference in the post-intervention universe?
• Intervene → Interventional data → P(Y |do (X = x))

• Observe → Observational data → do-calculus → P̂(Y |do (X = x))
do-calculus: algebra to emulate the post-intervention universe in terms of

conditionals P(Y |X = x) in the observed universe.
X Y
R
Back-door adjustment: p(Y |do(X = x)) = P(Y |c, X = x)P(c)dc.
11
Causal Optimization
X?s , x?s = arg min E[Y |do (Xs = xs )] (1)

Xs ∈P(X)
xs ∈D(Xs )
• P(X) gives all possible interventions.

• D(Xs ) fixed interventional domain for each Xs .
• Xs , xs one possible intervention set and value.
• X?s , x?s , optimal intervention set and value.
12
Gobal optimization vs. Causal optimization
Causal optimization Global optimization
X?s , x?s = arg min E[Y |do (Xs = xs )] x? = arg min E[Y |do (X = x)]
Xs ∈P(X) x∈D(X)
xs ∈D(Xs )
• Explore P(X) • Set the intervention set to X

• Find the intervention set and • Find the intervention level
the intervention level
A E B E D A C
Y
Y
B C D
13
Solving Global Optimization
• Target function f is explicitly unknown and multimodal.

• Evaluations of f are perturbed by noise.
• Evaluations of f are expensive.
Bayesian Optimization
14
Bayesian optimization
• Goal: Collect data x1 , . . . , xn to find the optimum as fast as possible.

• Model: Gaussian process f (x) ∼ GP(µ(x), kθ (x, x 0 )).
R
• Acquisition: αEI (x; θ, D) = y max(0, ybest − y )p(y |x; θ, D)dy
Each point xn+1 is collected as xn+1 = arg max αEI (x; θ, Dn )
15
Solving Causal Optimization
• Target function f is explicitly unknown and multimodal.

• Evaluations of f are perturbed by noise.
• Evaluations of f are expensive.
+
• Causal graph
Causal Bayesian Optimization

(CBO)
Idea: Run interventions (Xs1 , xs1 ), . . . , (Xsn , xsn ) to find the optimum as
fast as possible.
16
Do we need to explore all 2|X| sets in P(X)? NO!
X?s , x?s = arg min E[Y |do (Xs = xs )]

Xs ∈P(X)
xs ∈D(Xs )
17
Do we need to explore all 2|X| sets in P(X)? NO!
X?s , x?s = arg min E[Y |do (Xs = xs )]

Xs ∈P(X)
xs ∈D(Xs )
Minimal Intervention Set (MIS, MC

G,Y )
Given hG, Y, X, Ci, a set Xs ∈ P(X) is said to be a MIS if there is no

X0s ⊂ Xs such that E[Y |do (Xs = xs ) , C] = E[Y |do (X0s = x0s ) , C].
Possibly-Optimal Minimal Intervention set (POMIS, PC

G,Y )
Let Xs ∈ MCG,Y . Xs is a POMIS if there exists a sem conforming to G

such that E[Y |do (Xs = x∗s ) , C] > ∀W∈MCG,Y \Xs E[Y |do (W = w∗ ) , C]
where x∗ and w∗ denote the optimal intervention values.
MIS and POMIS are sets of variables ‘worth’ intervening on.

17
Causal Bayesian Optimization
Toy example
X Z Y
X = X
Z = exp(−X ) + Z
Z
Y = cos(Z ) − exp(− ) + Y
20
MG,Y = {∅, {X }, {Z }}
PG,Y = {{Z }}
BG,Y = {{X , Z }}
18
Modelling E[Y|do (Xs = xs )] for each Xs with Causal GP prior
f (xs ) ∼ GP(m(xs ), k(xs , x0s ))

m(xs ) = Ê[Y |do (Xs = xs )]
k(xs , x0s ) = kRBF (xs , x0s ) + σ(xs )σ(x0s )
where
||x −x0 ||2
• kRBF (xs , x0s ) := exp(− s 2l 2 s )
q
• σ(xs ) = V̂(Y |do (Xs = xs )) with V̂ is the variance of the causal
effects estimated from observational data.
19
Modelling E[Y|do (Xs = xs )] for each Xs with Causal GP prior
Standard GP prior: f (xs ) ∼ GP(m(xs ), k(xs , x0s )), m(xs ) = 0 and

||x −x0 ||2
k(xs , x0s ) = kRBF (xs , x0s ) with kRBF (xs , x0s ) := exp(− s 2l 2 s ).
20
Solving the exploration-exploitation trade-off
Causal Expected Improvement
• ys = E[Y |do (Xs = xs )]

• y ? = maxXs ∈es,x∈D(Xs ) E[Y |do (Xs = xs )]
EI s (x) = Ep(ys ) [max(ys − y ? , 0)]/Co(Xs , xs )
• α1 , . . . , α|es| : solutions of optimizing EI s (x) for each set in es and
New intervention set and value
α? := max{α1 , . . . , α|es| }
s? = argmax αs
s∈{1,··· ,|es|}
21
Solving the intervention-observation trade-off
-greedy criteria
Vol(C(DO )) N
= × ,
Vol(×X ∈X (D(X ))) Nmax
• Vol(C(DO )): volume of the convex hull for observational data.

• Vol(×X ∈X (D(X ))): volume of the interventional domain.
22
Causal Bayesian Optimization - CBO
Algorithm: Causal Bayesian Optimization

Data: DO , DI , G, es, number of steps T
Result: X?s , x?s , Ê[Y? |do (X∗s = x?s )]
Initialise: Set D0I = DI and D0O = DO
for t=1, ..., T do
Compute and sample u ∼ U(0, 1)
if > u then
(Observe)
1. Observe new observations (xt , ct , yt ).
2. Augment DO = DO ∪ {(xt , ct , yt , )}.
3. Update prior of the causal GP.
end
else
(Intervene)
1. Compute EI s (x) for each element
s ∈ es.
2. Obtain the optimal interventional
set-value pair (s ? , α? ).
3. Intervene on the system.
4. Update posterior of the interventional
GP.
end
end
23
Simulation analysis
• Results are consistent with what is expected.

• Better results that BO: propagation of effect beyond default domain.
24
Take home messages:
1. Many real systems decompose in interconnected nodes.

2. Optimization requires ‘intervening’ in the manipulative nodes
and solving a Causal Optimization problem.
3. Standard Bayesian Optimization ignores causal assumptions.
4. CBO solves Causal Optimization problems and improves BO
when causal information is available.
5. CBO efficiently explores ‘worthy’ interventions.
6. Causal GPs prior merges observational and interventional data.
24
Limitations of CBO
• The number of GPs we require is determined by |P(X)| which is

potentially huge.
• We don’t transfer interventional information across GPs e.g. we don’t
account for the fact that intervening on X might give us some
information about an intervention on X and Z.
25
Limitation of CBO
X Z Y
MG,Y = {∅, {X }, {Z }}
• fX (x)
• fZ (z)
• fX ,Z (x, z)
26
Limitation of CBO
• The number of GPs we require is determined by |P(X)| which is

potentially huge.
• We don’t transfer interventional information across GPs e.g. we don’t
account for the fact that intervening on X might give us some
information about an intervention on X and Z.
Research goal
Methodology to transfer interventional information.
27
Research goal
We aim at learning the set of intervention functions for Y in G:
|P(X)|
T = {ts (x)}s=1 ts (x) = fs (x) = Ep(Y |do(Xs =x)) [Y ] = E[Y |do (Xs = x)].
I
I Ns
given DO = {xn , yn }N I I I I
S
n=1 and D = (X , Y ) with X = s {xsi }i=1 and
I
N
YI = s {ysiI }i=1
S s
.
Goal
Define p(T) and compute p(T|DI ) so as to make probabilistic
predictions for T at some unobserved intervention sets and levels.
28
Steps in the methodology
|P(X)|
1. Study the correlation among functions in T = {fs (x)}s=1 which
varies with the topology of G.
2. Define a joint prior distribution p(T).
3. Develop a multi-task model based on p(T) so as to compute the
posterior p(T|DI ).
29
1. Study the correlation among intervention functions
Any function in T can be written as an integral transformation of some

base function f , via some causal operator Ls such that
ts (x) = Ls (f )(x), ∀Xs ∈ P(X) (Theorem 3.1):
Z Z
Ls (f )(x) = ··· πs (x, (vsN , c))f (v, c)dvsN dc,
30
Any function in T can be written as an integral transformation of some

base function f , via some causal operator Ls such that
ts (x) = Ls (f )(x), ∀Xs ∈ P(X) (Theorem 3.1):
Z Z
Ls (f )(x) = ··· πs (x, (vsN , c))f (v, c)dvsN dc,

• f (v, c) = E Y |do (I = v) , CN = c
• CN is the set of variables in G that are directly confounded with Y
and are not colliders.
• I is the set Pa(Y ).
• πs (x, (vsN , c)) = p(cIs |cN N N
s )p(vs , cs |do (Xs = x)) is the integrating
measure for the set Xs .
30
31
• (a) I = {Z }, CN = ∅
• (b) I = {E , D}, CN = {A, B}
32
2. Define a joint prior p(T)
ts (x) = Ls (f )(x), ∀Xs ∈ P(X)
We can define a prior for T by:
• Step (1): placing a causal prior on f

• Step (2): propagating this prior through Ls (·) for all ts (x)
33
• Step (1): Placing a causal prior on f (v, c) = f (b)
f (b) ∼ GP(m(b), K (b, b0 ))
with m(b) = fˆ(b) and K (b, b0 ) = krbf (b, b0 ) + σ̂(b)σ̂(b0 ) where
fˆ(b) = fˆ(v, c) = Ê[Y |do (I = v) , c]

σ̂(b) = σ̂(v, c) = V̂[Y |do (I = v) , c]1/2
where V̂ and Ê represent the variance and expectation of the causal effects
estimated from DO using do-calculus.
34
• Step (2): propagating this prior through Ls (·) for all ts (x)
For each Xs ∈ P(X), we have ts (x) ∼ GP(ms (x), ks (x, x0 )) with:

Z Z
ms (x) = · · · m(b)πs (x, bs ) dbs
Z Z
ks (x, x ) = · · · K (b, b0 )πs (x, bs ) πs (x0 , b0s ) dbs db0s .
0
where bs = (vsN , c) is the subset of b including only the v values

corresponding to the set IN s .
35
The DAG-GP model
Joint prior distribution: T D ∼ N (mT (D), KT (D, D))
Likelihood function: p(YI |TI , σ 2 ) = N (TI , σ 2 I).
Joint posterior distribution: TD |DI ∼ N (mT|DI (D), KT|DI (D, D))

with mT|DI (D) = mT (D) + KT (D, XI )[KT (XI , XI ) + σ 2 I](TI − mT (XI ))
and KT|DI (D, D) = KT (D, D) − KT (D, XI )[KT (XI , XI ) + σ 2 I]KT (XI , D).
36
A helicopter view
Interventional data
No Yes
Single-task Multi-task
Observational data
gp

dag-gp

Mechanistic p(T ) = s p(ts (x)) p(T) = s p(ts (x)|f )
No model
ts (x) ∼ GP(0, KRBF (x, x′ )) ts (x) = f (b)πs (x, bs )dbs
f (b) ∼ GP(0, KRBF (b, b′ ))
gp+
dag-gp+
Yes do-calculus p(T ) = s p(ts (x)) p(T) = s p(ts (x)|f )

ts (x) ∼ GP(m+ (x), K + (x, x′ )) ts (x) = f (b)πs (x, bs )dbs
f (b) ∼ GP(m+ (b), K + (b, b′ ))
Figure 6: Models for learning the intervention functions T defined on a dag.
37
Learning intervention function from data
Figure 7: Posterior mean and variance for tX (x) = fX (x).
38
DAG-GP as surrogate model in CBO
Figure 8: Convergence of the CBO algorithm to the global optimum

(E[Y ? |do (Xs = x)]) when a single-task or a multi-task GP model are used as
surrogate models.
39
DAG-GP as surrogate model in CBO
Figure 9: Convergence of the CBO algorithm to the global optimum for PSA
(E[PSA? |do (Xs = x)]) when a single-task or a multi-task GP model are used as
surrogate models.
40
Take home messages:
1. The DAG-GP allows to efficiently learn the causal effects in a

graph and identify the optimal intervention to perform
2. It captures the non-trivial correlation structure across different
experimental outputs.
3. It enables proper uncertainty quantification and can be used
within decision-making algorithm to choose experiments to
perform.
40
Thanks for your attention!
40

Aglietti2020 Part2

Uploaded by

Document Information

Original Description:

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Aglietti2020 Part2

Uploaded by

Copyright:

Available Formats

Causal decision-making meets

University of Warwick, The Alan Turing Institute

Integrate causal considerations into a choice process and take decisions

Why is this important?

Integrate causal considerations into a choice process and take decisions

Why is this important?

Eelworm Eelworm Crop

• A causal graph (Directed Acyclic Graph - DAG).

• System configuration → manipulative variables to be intervened on

• How can I learn the expected crop

Eelworm Eelworm Crop

• How can I learn the expected crop

• How can I learn the expected crop

Causal model: DAG G + four-tuple hU, V, F , P(U)i

• U: independent exogenous background variables.

C = fc (Uc ), Uc ∼ N (0, σc2 )

Intervention: Setting a manipulative variable X to a value x, do(X = x).

Observed universe Post-intervention universe

P(X , C , Y ) P do(X =x) (C , Y )

P(Y |do(X = x)) := P do(X =x) (Y |X = x)

Key question: How to do inference in the post-intervention universe?

• Intervene → Interventional data → P(Y |do (X = x))

do-calculus: algebra to emulate the post-intervention universe in terms of

X?s , x?s = arg min E[Y |do (Xs = xs )] (1)

• P(X) gives all possible interventions.

Causal optimization Global optimization

• Explore P(X) • Set the intervention set to X

• Target function f is explicitly unknown and multimodal.

• Goal: Collect data x1 , . . . , xn to find the optimum as fast as possible.

Each point xn+1 is collected as xn+1 = arg max αEI (x; θ, Dn )

• Target function f is explicitly unknown and multimodal.

Causal Bayesian Optimization

X?s , x?s = arg min E[Y |do (Xs = xs )]

X?s , x?s = arg min E[Y |do (Xs = xs )]

Minimal Intervention Set (MIS, MC

Given hG, Y, X, Ci, a set Xs ∈ P(X) is said to be a MIS if there is no

Possibly-Optimal Minimal Intervention set (POMIS, PC

Let Xs ∈ MCG,Y . Xs is a POMIS if there exists a sem conforming to G

MIS and POMIS are sets of variables ‘worth’ intervening on.

f (xs ) ∼ GP(m(xs ), k(xs , x0s ))

Standard GP prior: f (xs ) ∼ GP(m(xs ), k(xs , x0s )), m(xs ) = 0 and

Causal Expected Improvement

• ys = E[Y |do (Xs = xs )]

EI s (x) = Ep(ys ) [max(ys − y ? , 0)]/Co(Xs , xs )

• α1 , . . . , α|es| : solutions of optimizing EI s (x) for each set in es and

New intervention set and value

• Vol(C(DO )): volume of the convex hull for observational data.

Algorithm: Causal Bayesian Optimization

• Results are consistent with what is expected.

1. Many real systems decompose in interconnected nodes.

• The number of GPs we require is determined by |P(X)| which is

• The number of GPs we require is determined by |P(X)| which is

Methodology to transfer interventional information.

We aim at learning the set of intervention functions for Y in G:

Any function in T can be written as an integral transformation of some

Any function in T can be written as an integral transformation of some

ts (x) = Ls (f )(x), ∀Xs ∈ P(X)

We can define a prior for T by:

• Step (1): placing a causal prior on f

• Step (1): Placing a causal prior on f (v, c) = f (b)

f (b) ∼ GP(m(b), K (b, b0 ))

with m(b) = fˆ(b) and K (b, b0 ) = krbf (b, b0 ) + σ̂(b)σ̂(b0 ) where