You are on page 1of 30

# learning bounds for importance weighting

April 15, 2015
Introduction

## Often, training distribution does not match testing distribution

Want to utilize information about test distribution
Correct bias or discrepancy between training and testing
distributions

1
Importance Weighting

## Labeled training data from source distribution Q

Unlabeled test data from target distribution P
Weight the cost of errors on training instances.
Common denition of weight for point x: w(x) = P(x)/Q(x)

2
Importance Weighting

## Reasonable method, but sometimes doesnt work

Can we give generalization bounds for this method?
When does DA work? When does it not work?
How should we weight the costs?

3
Overview

Preliminaries
Learning guarantee when loss is bounded
Learning guarantee when loss is unbounded, but second moment
is bounded
Algorithm

4
Preliminaries: Rnyi divergence

## For 0, D (P||Q) between distributions P and Q

( )1
1 P(x)
D (P||Q) = log2 P(x)
1 x
Q(x)

[ ] 1
1
P (x)
D (P||Q)
d (P||Q) = 2 =
x
Q1 (x)

## Metric of info lost when Q is used to approximate P

D (P||Q) = 0 iff P = Q

5
Preliminaries: Importance weights

Lemma 1:

## E[w] = 1 E[w2 ] = d2 (P||Q) 2 = d2 (P||Q) 1

Proof:
( P(x) )2
2 2
EQ [w ] = w (x)Q(x) = Q(x) = d2 (P||Q)
Q(x)
xX xX

## Lemma 2: For all > 0 and x X,

1
EQ [w2 (x)L2h (x)] d+1 (P||Q)R(h)1

6
Preliminaries: Importance weights

1 1
Hlders Inequality (Jin, Wilson, and Nobel, 2014): Let p + q = 1, then

( ) p1 ( ) q1
p q
|ax bx | |ax | |bx |
x x x

7
Preliminaries: Importance weights

## Proof for Lemma 2: Let the loss be bounded by B = 1, then

[ ]2 [ ]
2 P(x) 1 P(x) 1
ExQ [w (x)L2h (x)] = Q(x) L2h (x) = P(x) P(x) L2h (x)
x
Q(x) x
Q(x)

[ [ ] ] 1 [ ] 1
P(x) 2
P(x) P(x)Lh (x)
1

x
Q(x) x

[ +1
] 1

= d+1 (P||Q) P(x)Lh (x)Lh 1
(x)
x

1 1 1
d+1 (P||Q)R(h)1 B1+ = d+1 (P||Q)R(h)1

8
Learning Guarantees: Bounded case

P(x)
supx w(x) = supx Q(x) = d (P||Q) = M. Let d (P||Q) < +. Fix h H.
Then, for any > 0, with probability at least 1 ,

2
w (h)| M log
|R(h) R
2m

## M can be very large, so we naturally want a more favorable bound...

9
Overview

Preliminaries
Learning guarantee when loss is bounded
Learning guarantee when loss is unbounded, but second moment
is bounded
Algorithm

10
Learning Guarantees: Bounded case

## Theorem 1: Fix h H. For any 1, for any > 0, with probability at

least 1 , the following bound holds:

1 1 1 R(h)2 ] log 1
R(h) R w (h) + 2M log + 2[d+1 (P||Q)R(h)
3m m

11
Learning Guarantees: Bounded case

## Bernsteins inequality (Bernstein 1946):

( n ) ( )
1 n2
Pr xi exp
n 2 2 + 2M/3
i=1

when |xi | M.

12
Learning Guarantees: Bounded case

## Proof of Theorem 1: Let Z be the random variable w(x)Lh (x) R(x).

Then |Z| M. Thus, by lemma 2, the variance of Z can be bounded in
terms of d+1 (P||Q):
1
2 (Z) = EQ [w2 (x)Lh (x)2 )] R(h)2 d+1 (P||Q)R(h)1 R(h)2

( )
m2 /2
Pr[R(h) R
w (h) > ] exp .
2 (Z) + M/3

13
Learning Guarantees: Bounded case

## Thus, setting to match upper bound, then with probability at least

1

1
2M log M2 log2 1 2 2 (Z) log 1
R(h) Rw (h) +
+ +
3m 9m2 m

1
w (h) + 2M log + 2 2 (Z) log 1
=R
3m m

14
Learning Guarantees: Bounded case

Theorem 2: Let H be a nite hypothesis set. Then for any > 0, with
probability at least 1 , the following holds for the importance
weighting method:

w (h) + 2M(log|H| + log ) + 2d2 (P||Q)(log |H| = log
1 1
R(h) R
3m m

15
Learning Guarantees: Bounded case

## Theorem 2 holds when = 1. Note that theorem 1 can be simplied

in the case of = 1:

1 1
R(h) R w (h) + 2M log + 2d2 (P||Q) log
3m m

## Thus, theorem 2 follows by including the cardinality of H

16
Learning Guarantees: Bounded case

## Proposition 2: Lower bound. Assume M < and 2 (w)/M2 1/m.

Assume there exists h0 H such that Lh0 (x) = 1 for all x. There exists
an absolute constant c, c = 2/412 , such that
[ ]
d2 (P||Q) 1
Pr sup |R(h) Rw (h)|
c>0
hH 4m

## Proof from theorem 9 of Cortes, Mansour, and Mohri, 2010.

17
Overview

Preliminaries
Learning guarantee when loss is bounded
Learning guarantee when loss is unbounded, but second moment
is bounded
Algorithm

18
Learning Guarantees: Unbounded case

## d (P||Q) < does not always hold... Assume P and Q follow a

Guassian distribution with P and Q with means and
[ ]
P(x) P Q2 (x )2 P2 (x )2
= exp
Q(x) Q 2P2 Q2
P(x)
Thus, even if P = Q and = , d (P||Q) = supx Q(x) = , thus
Theorem 1 is not informative.

19
Learning Guarantees: Unbounded case

## However, the variance of the importance weights is bounded.

+ [ ]
Q 2 2 (x )2 P2 (x )2 2
dw (P||Q) = 2 exp Q Q dx
P 2 2P2

20
Learning Guarantees: Unbounded case

## Q provides some useful information about P

But sample from Q only has few points far from
A few extreme sample points would have large weights

## Likewise, if P = Q but >> , weights would be negligible.

21
Learning Guarantees: Unbounded case

## Theorem 3: Let H be a hypothesis set such that

Pdim({Lh (x) : H H}) = p < . Assume that d2 (P||Q) < + and
w(x) = 0 for all x. Then for > 0, with probability at least 1 , the
following holds:

5/4
3 p log
2me
p + log
4
R(h) Rw (h) + 2
8
d2 (P||Q)
m

22
Learning Guarantees: Unbounded case

## Proof outline (full proof in of Cortes, Mansour, Mohri, 2010):

[ ]
h]
Pr suphH E[Lh ] E[L
2]
> 2 + log
1
E[L
[ h ]

Pr suphH,t Pr[Lh>t]Pr[L h >t]
>
h >t]
Pr[L
[ ] ( )
2
Pr suphH
R(h)R(h)
> 2 + log 4H (2m) exp m4
1
R(h)
[ ]
h (x)]
Pr sup h H 2
E[Lh (x)]E[L
> 2 + log 1
E[Lh (x)]
( )
m2
p 4
4 exp p log 2em
[ ] ( )
h (x)] m8/3
Pr suphH E[Lh (x)]E[L
2
> 4 exp p log 2em
p 45/3
E[Lh (x)]

## |E[Lh (x)] E[L

(x)]|

h 3 p log 2me +log 8
5/4 2 2
2 max{ E[Lh (x)], E[Lh (x)]} 8 p
m 23
Learning Guarantees: Unbounded case

## Thus, we can show the following:

] ( )
[ R(h) R w (h) 2em m8/3
Pr sup > 4 exp p log 5/3 .
hH d2 (P||Q) p 4

## Where p = Pdim({Lh (x) : h H} is the pseudo-dimension of

H = {w(x)Lh (x) : h H}. Note, any set shattered by H is shattered
by H , since there exists a subset B of a set A that is shattered by H ,
such that H shatters A with witnesses si = ri /w(xi ).

24
Overview

Preliminaries
Learning guarantee when loss is bounded
Learning guarantee when loss is unbounded, but second moment
is bounded
Algorithm

25
Alternative algorithms

## We can generalize this analysis to an arbitrary function

u (h) = 1 m u(xi )Lh (xi ) and let Q
u : X 7 R, u > 0. Let R be the
m i=1
empirical distribution: Theorem 4: Let H be a hypothesis set such
that Pdim({Lh (x) : h H}) = p < . Assume that
0 < EQ [u2 (x)] < + and u(x) = 0 for all x. Then for any > 0 with
probability at least 1 ,
[ ]
|R(h) R
u (h) |EQ [w(x) u(x)]Lh (x) |

( ) p log 2me 4
p + log
3
+25/4 max
8
EQ [u2 (x)L2h (x)], EQ [u2 (x)L2h (x)]
m

26
Alternative algorithms

## Other functions u than w can be used to reweight cost of error

Minimize upper bound
( )
( )
max EQ [u ], EQ [u ] EQ [u2 ] 1 + O(1/ m) ,
2 2

[ ]
minuU E |w(x u(x)| + EQ [u2 ]
Trade-off between bias and variance minimization.

27
Alternative algorithms

28
Alternative algorithms

29