Finding Important Variables and Interactions in Black Boxes Theme

'
Quasi Regression and black boxes
$'
1 Quasi Regression and black boxes
$
2
Finding important
variables and interactions Theme
As dimension increases many numerical problems become
in black boxes more statistical.
Art B. Owen Because:

Stanford University
1. the sample is inevitably sparse,
owen@stat.stanford.edu
2. error depends on unsampled part of space,
Tao Jiang
3. worst case error bounds are inapplicable
Stanford University
jiang@stat.stanford.edu
& %& %
'
$'
$
4
Mortgage backed
Example: integration securities integrand
Z
I = f (x)dx Paskov & Traub, Caflisch, Morokoff, & Owen
(0;1)d
Y = Present value of 30 years of monthly cash flows.
Sampling Methods
Prepayment:
1=2
1. Monte Carlo: n 1. puts lumps into payment stream
2. Quasi-Monte Carlo: n 1
(log n)d 1
, but no practical 2. more common when interest rates are low
error estimate
MBS Model (from Goldman-Sachs)
3. Randomized Quasi-Monte Carlo: replication based error
estimates, and n 3=2
(log n)(d 1)=2 Y = f (X )
Rates are asymptotic under mild conditions on f
X U [0; 1℄360 ! Z = 1 (X )
Interest rates: r1 : : : r360 Geometric Brownian motion
& %& %
Also statistical: approximation
driven by Z
Prepayment fraction: A + B ar tan(C + D rt )

'
$'
ANOVA of L2 [0; 1℄d

$
6
Hoeffding, Efron & Stein, Sobol’
Main effects and k –factor interactions generalizing familiar

QMC super on MBS discrete ANOVA
But Y = f (X ) is 99:96% additive X

Latin hypercube sampling variance about 0:04% of MC f (x) = fu (x)
Also Y = f (X ) is 99:98% odd (antisymmetric) uf1;2;:::;dg
antithetic sampling variance is about 0:02% of MC fu depends only on x-components in set u
R
Additive and odd: was virtually linear in Z f; = f (x)dx “grand mean”
R
upon further investigation
2 (f ) = Pu6=; fu (x)2dx
R
Curse of dimensionality not broken by QMC
fu (x)fv (x)dx = 0, u = 6 v
we just had an easy integrand
n n
QMC requires low “effective dimension” to trounce MC 1X f (xi ) =
X1X
fu (xi )
& %& %
n i=1
n u i=1
QMC xi very uniform in low dimensional projections
Great for functions dominated by fu with small u
'
$'
$
8
Isotropic integrand The borehole function

Morris, Mitchell, Ylvisaker
Capstick & Keister, Pagageorgiou & Traub, Owen
Flow from upper to lower aquifer:
s 2 ! 2Tu [Hu Hl ℄
f (x) = os
1
1 (x) ; x U (0; 1)d

2 log r
rw 1 + log( 2rLT)ruw2 Kw + TTul
rw
q
R
f (x)dx = E os =2
2
(d) r, rw Radii borehole, basin
closed form (Mathematica) aids comparison of methods Tl , Tu Transmissivities upper and lower
Varies equally in all directions Hl , Hu Potentiometric heads upper and lower
QMC does well

L, Kw Length and conductivity
For d = 25 over 99% of variance from

Diaconis: closed form 6= understanding
& %& %
1,2,3 dimensional ANOVA effects
::: after numerical investigation Which variables are important?
exploiting symmetry and Gaussianity
Which interact?
'
$'
$
10
Black box functions

A small neural net
Y = f (X ) Without “ + e” Venables, Ripley
Examples X Y
Predict log10 (perf ) from the others
Semiconductors Device design Speed, heat perf published performance of computer
Aerospace Wing shape Lift, drag syct cycle time in nanoseconds
Automotive Auto Frame Strength, weight mmin minimum main memory in kilobytes
Statistics Predictors Responses mmax maximum main memory in kilobytes
cach cache size in kilobytes

Used to design products. Cheaper than physical chmin minimum number of channels
experiments. Costs from milliseconds to hours. Dimension
chmax maximum number of channels
from 3 to 300. Accuracy varies too.
209 examples.
& %& %
Function found by training on
Kriging widely used Journel, Huijbreghts, Sacks, Ylvisaker,
Welch, Wynn, Mitchell
'
$'
$
12

2:82S 1:12 + 0:45x1 + 2:24x2 + 2:51x3 1:63x4 0:56x5 + 0:43x6
+ 3:17S 1:09 + 2:28x1 0:10x2 + 1:44x3 + 2:70x4 + 1:24x5 + 0:25x6

+ 0:39S 0:04 0:11x1 + 0:11x2 + 0:12x3 0:10x4 0:04x5 + 0:02x6
Given f (x) on [0; 1℄d

0:46 1:21x1 + 1:36x2 + 1:42x3 1:01x4 0:33x5 + 0:30x6
How can we tell if f is:

The n-net function
is a sigmoidal function
1. Nearly linear?
2. Nearly additive?
3. Nearly quadratic?
4. Has mostly 3 factor interactions or less?

5. Which variables matter most?
1
(z ) = [1 + exp( z )℄
6. Which interactions matter most?
We would like:
1. a systematic approach
& %& %
2. that also predicts f

where S
'
$'
$
14
MC approximation Start with univariate basis

X functions
Let: y = f (x) = r r (x)
r2U
X 0 ; 1 ; 2 ; : : :
= r r (x) + (x)
r2R First is constant, all are orthonormal
Orthonormal basis r 0 (x) = 1; 0 x 1

(x) a deterministic truncation error Z 1
j (x)dx = 0; j1
Estimate r from f (xi ) values, where xi U (0; 1)d 0
Z 1
j (x)k (x)dx = 1j=k
getting
X
fe(x) = er r (x) 0
r2R
& %& %
EG: orthogonal polynomials, sinusoids, wavelets,
Apply graphical and numerical interpretation to fe Hermite( 1()), Chebychev(qbeta())
'
$'
$
16
Polynomials r , with :::

Tensor product basis Rank kr k0 3, Degree kr k1 4, Order kr k1 3.
Rank Deg Order r (x) #

x = (x1 ; x2 ; : : : ; xd ) 2 [0; 1℄d 0 0 0 Const 1
r = (r(1); r(2); : : : ; r(d)) 2 f0; 1; 2; : : : gd 1 1 1 Linear d
d
Y
r (x) = r(j ) xj 2 2 Quad d
j =1 3 3 Cubic d

2 2 1 Lin Lin d
2
Finite subset of basis:
3 2 Lin Quad d(d 1)
4 3 Lin Cubic d(d 1)

P
Rank(r) krk0 = dj=1 1r(j)>0 B0 4 2 Quad Quad d
P 2
Degree(r ) kr k1 = dj=1 r(j ) B1 3 3 1 Lin Lin Lin d
& %& %
3

Order(r ) krk1 = max1jd r(j ) B1 4 2 Lin Lin Quad 3 d3
p = 1 + 3d + 3d(d 1) + (2=3)d(d 1)(d 2)
'
$'
$
18
Approximation through
Interpretation
P R
Variance of f is r6=0 r + (x) dx integration
2 2
P
Importance of S is r2S r
2
Define: Z (x) = ( 0 (x); : : : ; p 1 (x))T
P
r2S r Var(r )
Estimate by e2 e
Subsets of interest include:

Optimal is
fr j r(1) > 0g involves x1
Z
2
= arg min f (x) Z (x)T dx
fr j r(1) = 0g does not involve x1
Z Z
fr j krk0 = 1g additive part
= Z (x)Z (x)T dx
1
Z (x)f (x)dx
fr j 0 < krk0 kg interactions up to order k
fr j 0 < krk1 kg of degree at most k

also,
Z
fr j r(j ) = 0; j > 3g 3 inputs, ISE = (f (x) Z (x)T )2 dx
& %& %
uses only first
'
Regression and
$'
$
20
Precursors of
quasi-regression
quasi-regression
Z 1Z
= Z (x)Z (x)T dx Z (x)f (x)dx Quasi-interpolation
Z Chui & Diamond, Wang
= Z (x)f (x)dx
Z Z ) to get fast approximate
by orthogonality
“Ignore the denominator” ( T
Observations interpolation.
xi U [0; 1℄d ; 1 i n; IID Computer experiments
Regression Koehler and Owen 1996 advocate quasi-regression for

computer experiments
1 T
^ = Z T Z Z Y Znp Yn1 Efromovich 1992 applies qr to sinusoids on [0; 1℄.
& %& %
Quasi-Regression Owen 1992 describes quasi-regression for Latin hypercube
e =
1 ZT Y sampling
n
'
Accuracy in Monte Carlo

$'
Fast stable updates

$
22
sampling Define:
n
n1
Define: X
er(n) r (xi )f (xi )
i = Yi Zi i=1
P
= n i=1 Zi (Yi
n
Zi ) n
n1
Æp1 1 T X 2
P Sr(n) r (xi )f (xi ) er(n)
A p p = n T
n i=1 Zi Zi
1
I i=1
Then:
Now
" #
e = n1 Z T Y er(n) = er(n 1)
+1 n r (xi )f (xi ) er(n 1)
= 1 Z T (Z + )
n
Sr(n) = n 1 S (n 1)
+
n r
= Æ +A " #2
^ = (Z T Z ) Z T (Z + ) n 1
r (xi )f (xi ) e(n
1
r 1)
n2
= (Z T Z ) 1
ZT
& %& %
(n)
= (I + A) 1 Æ Chan, Golub, Leveque who use nSr

= (I A + A2 A3 )Æ E nn 1 Sr(n) = Var(er(n) )
=: Æ AÆ
'
$'
$
24
Presented as lack-of-fit:
Updatable accuracy
1 R2
estimates
Predict f (xn ) by fen 1 (xn ) xn indep of fen 1 LOF = ISE Ld
OF = AVG(f fe)2
V ar AVG(f e0 )2
Average recent squared errors
nm
[ (nm ) =
ISE
1 X
f (xi ) fei
2
(xi ) log10 (LOF ) R2
1
nm nm 1
i=nm 1 4 99:99%
on subsequence nm = m(m + 1)=2 3 99:9%
p 2 99%
estimates avg ISE over recent 2n values
Diagnostic:
1 90%
P
=) 0 0%
& %& %
r Var(r )
Large LOF and small e need bigger basis
1 900%
'
$'
$
26
p = 1; 000; 000 doable by quasi-reg., not by reg.Owen, Ann Stat 2000

O(n5 + p5 )
Incorporating shrinkage
Footprint
O(np4 )
O(np2 )
Hoerl, Kennard, Efromovich, Donoho, Johnstone, Beran : : :
(Quasi-)regression
Quasi-reg allows larger n or much larger p
Costs of algebra
O(n2 + p2 )
[good luck]
fe ;n (x) = r;n er;n r (x); r;n 2 [0; 1℄
Kriging
O(p2 )
Space
r
O(p)
Easy
Optimally
r2
Dimension
O(n3 + p3 )
r;n =
High
High
r2 + Var(er;n )
Low
Low
O(np2 )
O(np)
Time
Shrinkage can reduce prediction variance.

Cost of f
We use data to estimate r;n

High
High
Low
Low
Quasi-regression
1)2
er(n
^r;n =
Regression
& %& %
e.g.
er(n 1)2
+ Sr(n 1)
Kriging
'
Exploiting residuals
$'
$
28
For r 6= 0: r (f ) = r (f ), for 2R
n
1X
Var r (xi )(f (xi ) ) depends on c
n i=1 N-net example
Try 0 f (x) is prediction of log10 (perf )
More generally d = 6 r are Legendre polynomials
r are tensor products
n
1
n
X X
er(n) r (xi ) f (xi ) s;i 1 es(i s (xi )
1)
i=1 s6=r krk0 3 krk1 8 krk1 4 =) p = 1145

Original quasi-reg: r;i = 0 or 1r==0 r;i = 1r2R Net is fast, so n = 500; 000
Self-consistent quasi-reg: r;i = r;i 2 [0; 1℄ (about 3min on 800Mhz PC in java)
R
fe2 by sample variance : : : eliminates explosive
& %& %
Bounding
feedback
er and Sr
Still updatable
er(n)
NB: n ( r ) is a martingale in n

&
'
&
'
Neural net results LOF
10^-3 10^-1 10^1 10^3

Number of bases is 1145
Neural net accuracy

Shrinkage applied after n=600 (lower curve)
100
###### Anova at Iteration 500000 ######
1-RSquare (LOF) is 0.0011707 at iteration 499500
1000
Beta[0] (constant factor) is 2.0717
Sample size
Sample mean is 2.0719, sample variance is 0.14359
10000
Unbiased estimates of dimension variances
0.11441 0.026592 0.0027723 0.0 0.0 0.0
100000
Dimension Probabilities
(Ratios of dimension variances to sample variance)
0.79676 0.18518 0.019307 0.0 0.0 0.0

%&
$'
%&
$'
31
29

Results
Neural net results, ctd
f (x) =
0:46 1:21x1 + 1:36x2 + 1:42x3 1:01x4 0:33x5 + 0:30x6
Variances on one and two variables / sample variance

syct mmin mmax cach chmin 2:82S 1:12 + 0:45x1 + 2:24x2 + 2:51x3 1:63x4 0:56x5 + 0:43x6
0.5177106
9.292114E-4 0.01069175 + 3:17S 1:09 + 2:28x1 0:10x2 + 1:44x3 + 2:70x4 + 1:24x5 + 0:25x6
0.008898125 0.02590950 0.08782891
0.05507833 0.006469443 0.05429608 0.1301971 + 0:39S 0:04 0:11x1 + 0:11x2 + 0:12x3 0:10x4 0:04x5 + 0:02x6
0.01091619 6.212815E-4 0.008541468 0.01008703 0.03679156
2.480628E-4 4.889575E-4 2.725553E-4 0.001473632 2.348261E-4 Additive component of fe
52%
Biggest main effect: syct is
Var syct mmin mmax cach chmin delch total
Biggest interaction syctcach is 5:5%
% 0.520 0.011 0.088 0.131 0.037 0.009 0.797
%
$
%
$
32
30
'
$'
Caveats
$
34
Effect of cycle time Important variables in E (Y j X = x) are not

0.4
necessarily causal
Same for f (x) = E^ (Y j X = x) and fe

0.2
Training x not from a product measure (nor are test x)

0.0
Non-product measure issues
False positives: f; fe might have large structure in region

-0.4
with no data
R
False negatives: (fe f )2 dx might be dominated by x
0.0 0.2 0.4 0.6 0.8 1.0 away from data. Small error and simple model might mask
poor fit in training region. (Easy to compare f and fe on
training data.)
Degree 1 2 3 4
Functions r and estimated anova components correlated
& %& %
Coef -0.272 -0.030 0.00242 .0000777
on empirical distribution
% of fe 51.38 0.630 0.00041 0.000004
Using product of empirical margins mitigates problem (only
slightly)
'
CPU inputs
$'
Biggest interaction
$
36

Cycle time Cache Size 5.5% of fe
0.0 0.6 0.0 0.6 0.0 0.6

••• ••
• •••• •• •
•
••••
•• ••
••• ••• • ••• • ••
• •• •••••• ••••• • • • • ••••• •• ••• • •••• •••••
•• •••••••••• •••••• •••••• •
0.6
•• • • • ••• •••••• • •••••••••• • •••

•••••••••••• •••••••••••• ••
•
syct •••• ••••••• •••••••• • •• •• ••••
•••• ••••
••• •• • • •••
••••••• ••• • ••••• • ••
• ••• • •• ••••• •••••••••••• • •• •••••••• •
•••
••••••••• ••
• • ••• ••••• ••• • • ••••• •••••••••
• ••••••• ••••••••••• •
••• •• •••••••••••• • • • • ••••••
••••• •
• • •• •• • • •••••• •• ••• • •• • •• • •• •• •• •• • • •••••••••• •
0.0
• • •• • •
• • • • •
•••• •• • • • • • ••••
•••••••• • •••
••
• •• • ••• • • • •• •• •• • ••••••• •
• ••••• •• • ••••••••••••• ••••••••• • • •• •••••• •• • • • •••••••• ••••
0.6
•• •••••••••••••••• • • ••••••• •
•••
••• •••
•••
•••
•••• ••••• • ••
••
••• ••••••• •• •
•••••• • •••••••••••••••• ••
•• •••••••••••••••• •••• •
• ••••••••••• ••••
mmin •
• •••••• •••
• ••••••••••••••••
•••• ••••
••••••••• • ••••••
•••
••••••••
••
•••••• •
••••••••••••
•
•
••• • •• •• •• ••••••• •• • •••• • ••• •• •
•• • ••• ••• •••
0.0
• •
-0.1 00.10.2
•• ••• • • • •• • •• •
•••••••••••• • • ••• •• • • •••••• • • • • ••••• •• • •• • ••••••••• •
• •••••••••• ••••••••• •• ••• ••• ••• • •••• ••••
•••• • •••• •••• •• • • ••••••••• •• •
••••••••••• •••• •••
•••••••• • •••••• ••• ••••••
••• ••••••••••
•••• • • •••••••••••••••• • •• •••••••••••••••••••••• ••
0.6
• ••• ••••••••••••••••••••••••••••• • •• • ••••• • ••••••

••• •
• ••••••
••••••••••• • •••••••••• •
••• •• • • • •• •• mmax •••• ••••••
••••••
••••
•• •• • •• • ••
0.0
4 -30.2
• • • • •
• • • • • • • • •
-0.-0.
0.6
• • • • •
••••• • • • •• ••• • • • • • • •• • cach • ••• • • • ••• •• •• •
• • • • •
•• ••••• • •••• • •••••• • ••••• • • ••••••• •
• •••••••••
•• •••••• • • • • ••• ••• •• ••••••••••••
• •• •••••••••••• • • • ••••••••• •
• •• ••••••• ••
0.0
•••••••••••••••••••••••••••• • • •• •••••••••• •••••

•••• • • ••••••••••••••••••
••••••• •••••••
••
••••
•• •••••• ••••••••••• • ••
•••••••• ••
• •• • • • 0.8
0.6
0.8
0.6
• • • • •
• • • • •• • • chmin •• 0.4
•••••••••• • • • • ••• • • • •••••••••• ••• •• •• •• • ••• • •••••••••• • • • 0.6
•••• •••••• •
• •••••••••••••••••
• •• •••••••••••••••••
•••••••• •• •
•• •• ••• •••• ••• •••••••••••••••
•• •• •••••
•••
••••• ••••• ••
• ••• • •
•
•••• ••••••••••••••••••••••••••••
• 0.2 0.4
0.0
••••••••••••• • • •• ••••••• • • • • • •••••••••••••

••• •
•• ••
••••
•• •
& %& %
• • • • • • 0.2
•• • ••• •• • •• • •• •
0.6
chdel
• •• • • ••••• •••• • ••• • •• •••• •
••• •• ••• ••••••• •• • •• •
•••••••••••••••••• ••
• • ••••••
•••••• • • •
• •••••••••••••• ••••••••••••••••••••••••••• • ••••• •••••••••••••• ••• •• • •••••••••• •••••• •••• •
••••
••••••• • •• ••••• • •••••••••
•••••• •••• ••••••••• • • ••••••••••••••• •• • •
0.0
•• • • • • ••••••• •••• •••••

••• •
0.0 0.6 0.0 0.6 0.0 0.6

'
Biggest interaction
$'
2nd biggest interaction

$
38

Cycle time Main Memory Max 5.4% of fe
Cycle Time x Cache Size Interaction
1.0
0.8
0.6
0.04
cach
-0.0200.02
0.4
-0.04
-0.06
0.2
0.8
0.6
0.0
0.4 0.8
0.6
& %& %
0.2 0.4
0.0 0.2 0.4 0.6 0.8 1.0 0.2
syct
krk0 3 krk1 8 krk1 4
'
2nd biggest interaction

$'
$
40
Cycle Time x Max Main Memory Interaction

1.0
N-net conclusions
0.8
1. fe a fairly simple function wrt U [0; 1℄6

x1 most important, and nearly linear
0.6
2.
mmax
3. At least one interaction not supported by data

0.4
4. Non-random cross-validation (leave out clusters) might

0.2
help
0.0
& %& %
0.0 0.2 0.4 0.6 0.8 1.0
syct
krk0 3 krk1 8 krk1 4

'
$'
$
42
Next directions
1. Mars-like dynamic choice of basis Robot arm function
2. Comparisons of f and fe on training data Robot arm has 4 joints: Lengths Lj , angles j
3. Decompositions of fe under empirical measures Shoulder at (0; 0), hand at (u; v ):
4 j
X 4 j
X
4. Distinguishing f structure from fe artifacts X X
u= Lj os k v= Lj sin k
5. More types of statistical/ML black boxes j =1 k=1 j =1 k=1
6. Missing data (arise in function mining too) f is shoulder to hand distance

7. Stopping rules p
8. More basis function choices
f L1 ; L2 ; L3 ; L4 ; 1 ; 2 ; 3 ; 4 = u2 + v 2
9. Block diagonal or banded E (Z T Z )

(EG B -splines) 0 Lj 1 0 j 2
& %& %
10. Examples with noise ( unusable basis fns)

Finding Important Variables and Interactions in Black Boxes Theme

Uploaded by

Document Information

Original Description:

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Finding Important Variables and Interactions in Black Boxes Theme

Uploaded by

Copyright:

Available Formats

'

Quasi Regression and black boxes

Art B. Owen Because:

Prepayment fraction: A + B ar tan(C + D rt )

ANOVA of L2 [0; 1℄d

Hoeffding, Efron & Stein, Sobol’

Main effects and k –factor interactions generalizing familiar

But Y = f (X ) is 99:96% additive X

Great for functions dominated by fu with small u

Isotropic integrand The borehole function

Varies equally in all directions Hl , Hu Potentiometric heads upper and lower

QMC does well

For d = 25 over 99% of variance from

Black box functions

Aerospace Wing shape Lift, drag syct cycle time in nanoseconds

Statistics Predictors Responses mmax maximum main memory in kilobytes

cach cache size in kilobytes

Given f (x) on [0; 1℄d

How can we tell if f is:

4. Has mostly 3 factor interactions or less?

6. Which interactions matter most?

MC approximation Start with univariate basis

Orthonormal basis r 0 (x) = 1; 0  x  1

Polynomials r , with :::

Rank Deg Order r (x) #

Subsets of interest include:

fr j 0 < krk1  kg of degree at most k

xi  U [0; 1℄d ; 1  i  n; IID Computer experiments

Regression Koehler and Owen 1996 advocate quasi-regression for

Accuracy in Monte Carlo

Fast stable updates

p = 1; 000; 000 doable by quasi-reg., not by reg.Owen, Ann Stat 2000

Hoerl, Kennard, Efromovich, Donoho, Johnstone, Beran : : :

Shrinkage can reduce prediction variance.

We use data to estimate r;n

i=1 s6=r krk0  3 krk1  8 krk1  4 =) p = 1145

Quasi Regression and black boxes

10^-3 10^-1 10^1 10^3

Neural net accuracy

1-RSquare (LOF) is 0.0011707 at iteration 499500

0.79676 0.18518 0.019307 0.0 0.0 0.0

Quasi Regression and black boxes

Effect of cycle time  Important variables in E (Y j X = x) are not

 Same for f (x) = E^ (Y j X = x) and fe

 Training x not from a product measure (nor are test x)

Non-product measure issues

False positives: f; fe might have large structure in region

0.0 0.6 0.0 0.6 0.0 0.6

•• • • • ••• •••••• • •••••••••• • •••

• ••• ••••••••••••••••••••••••••••• • •• • ••••• • ••••••

•••••••••••••••••••••••••••• • • •• •••••••••• •••••

••••••••••••• • • •• ••••••• • • • • • •••••••••••••

•• • • • • ••••••• •••• •••••

0.0 0.6 0.0 0.6 0.0 0.6

2nd biggest interaction

krk0  3 krk1  8 krk1  4

2nd biggest interaction

Cycle Time x Max Main Memory Interaction

1. fe a fairly simple function wrt U [0; 1℄6

3. At least one interaction not supported by data

4. Non-random cross-validation (leave out clusters) might

krk0  3 krk1  8 krk1  4

Orthonormal basis r 0 (x) = 1; 0 x 1

fr j 0 < krk1 kg of degree at most k

xi U [0; 1℄d ; 1 i n; IID Computer experiments

i=1 s6=r krk0 3 krk1 8 krk1 4 =) p = 1145

Effect of cycle time Important variables in E (Y j X = x) are not

Same for f (x) = E^ (Y j X = x) and fe

Training x not from a product measure (nor are test x)

krk0 3 krk1 8 krk1 4

krk0 3 krk1 8 krk1 4