You are on page 1of 46

Tutorial Part I:

Information theory meets machine learning

Emmanuel Abbe
UC Berkeley

(UC Berkeley and Princeton)

Martin Wainwright
Princeton University

Information theory and machine learning

June 2015

1 / 46

Introduction
Era of massive data sets
Fascinating problems at the interfaces between information theory and
statistical machine learning.

Fundamental issues

Concentration of measure: high-dimensional problems are remarkably


predictable
Curse of dimensionality: without structure, many problems are hopeless
Low-dimensional structure is essential

Machine learning brings in algorithmic components

Computational constraints are central


Memory constraints: need for distributed and decentralized procedures
Increasing importance of privacy

Historical perspective: info. theory and statistics

Claude Shannon

Andrey Kolmogorov

A rich intersection between information theory and statistics


1
2
3

Hypothesis testing, large deviations


Fisher information, Kullback-Leibler divergence
Metric entropy and Fanos inequality

Statistical estimation as channel coding


Codebook: indexed family of probability distributions {Q | }
Codeword: nature chooses some

Channel: user observes n i.i.d. draws Xi Q


Decoding: estimator X1n 7 b such that b

X1n

Statistical estimation as channel coding


Codebook: indexed family of probability distributions {Q | }
Codeword: nature chooses some

X1n

Channel: user observes n i.i.d. draws Xi Q


Decoding: estimator X1n 7 b such that b

Perspective dating back to Kolmogorov (1950s) with many variations:


codebooks/codewords: graphs, vectors, matrices, functions, densities....
channels: random graphs, regression models, elementwise probes of
vectors/machines, random projections
closeness b : exact/partial graph recovery in Hamming, p -distances,
L( Q)-distances, sup-norm etc.

Machine learning: algorithmic issues to forefront!


1

Efficient algorithms are essential

Distributed procedures are often needed

only (low-order) polynomial-time methods can ever be implemented


trade-offs between computational complexity and performance?
fundamental barriers due to computational complexity?

many modern data sets: too large to stored on a single machine


need methods that operate separately on pieces of data
trade-offs between decentralization and performance?
fundamental barriers due to communication complexity?

Privacy and access issues

conflicts between individual privacy and benefits of aggregation?


principled information-theoretic formulations of such trade-offs?

(UC Berkeley and Princeton)

Information theory and machine learning

June 2015

6 / 46

Part I of tutorial: Three vignettes


1

Graphical model selection

Sparse principal component analysis

Structured non-parametric regression and minimax theory

(UC Berkeley and Princeton)

Information theory and machine learning

June 2015

7 / 46

Vignette A: Graphical model selection


Simple motivating example: Epidemic modeling
(
+1 if individual s is infected
Disease status of person s:
xs =
1 if individual s is healthy
(1) Independent infection
Q(x1 , . . . , x5 )

5
Y

exp(s xs )

s=1

(2) Cycle-based infection


Q(x1 , . . . , x5 )

5
Y

exp(s xs )

s=1

exp(st xs xt )

(s,t)C

(3) Full clique infection


Q(x1 , . . . , x5 )

5
Y

s=1

exp(s xs )

s6=t

exp(st xs xt )

Possible epidemic patterns (on a square)

From epidemic patterns to graphs

Underlying graphs

Model selection for graphs


drawn n i.i.d. samples from
Q(x1 , . . . , xp ; ) exp

X
sV

s xs +

st xs xt

(s,t)E

graph G and matrix []st = st of edge weights are unknown


data matrix Xn1 {1, +1}np
b to minimize error probability
want estimator Xn1 7 G



b n1 ) 6= G
Qn G(X
{z
}
|
Prob. that estimated graph differs from truth

Channel decoding:
Think of graphs as codewords, and the graph family as a codebook.

Past/on-going work on graph selection


exact polynomial-time solution on trees

(Chow & Liu, 1967)

testing for local conditional independence relationships

(e.g., Spirtes et al,

2000; Kalisch & Buhlmann, 2008)

pseudolikelihood and BIC criterion

(Csiszar & Talata, 2006)

pseudolikelihood and 1 -regularized neighborhood regression

Gaussian case
Binary case

(Meinshausen & Buhlmann, 2006)


(Ravikumar, W. & Lafferty et al., 2006)

various greedy and related methods:

(Bresler et al., 2008; Bresler, 2015;

Netrapalli et al., 2010; Anandkumar et al., 2013)

lower bounds and inachievability results

information-theoretic bounds
(Santhanam & W., 2012)
computational lower bounds
(Dagum & Luby, 1993; Bresler et al., 2014)
phase transitions and performance of neighborhood regression
(Bento &
Montanari, 2009)

US Senate network (20042006 voting)

Experiments for sequences of star graphs


p=9

p = 18

d=3

d=6

Empirical behavior: Unrescaled plots


Star graph; Linear fraction neighbors
1

Prob. success

0.8

0.6

0.4

0.2

0
0

p = 64
p = 100
p = 225
100

200
300
400
Number of samples

500

600

Plots of success probability versus raw sample size n.

Empirical behavior: Appropriately rescaled


Star graph; Linear fraction neighbors
1

Prob. success

0.8

0.6

0.4

0.2

0
0

p = 64
p = 100
p = 225
0.5

1
1.5
Control parameter

Plots of success probability versus rescaled sample size

2
n
d2 log p

Some theory: Scaling law for graph selection


graphs Gp,d with p nodes and maximum degree d
minimum absolute weight min on edges
how many samples n needed to recover the unknown graph?

Theorem (Ravikumar, W. & Lafferty, 2010; Santhanam & W., 2012)


b
Achievable result: Under regularity conditions, for a graph estimate G
produced by 1 -regularized logistic regression:

2
b 6= G] 0
log p
=
Q[G
n > cu d2 + 1/min
|
{z
}
|
{z
}
Vanishing probability of error
Lower bound on sample size

e produced by any algorithm.


Necessary condition: For graph estimate G

2
e 6= G] 1/2
=
Q[G
log p
n < c d2 + 1/min
|
{z
}
{z
}
|
Constant probability of error
Upper bound on sample size

Information-theoretic analysis of graph selection


Question:
How to prove lower bounds on graph selection methods?
Answer:
Graph selection is an unorthodox channel coding problem.
codewords/codebook: graph G in some graph class G
channel use: draw sample Xi = (Xi1 , . . . , Xip ) {1, +1}p from the
graph-structured distribution QG
decoding problem: use n samples {X1 , . . . , Xn } to correctly distinguish
the codeword

Q(X | G)

X1 , X2 , . . . , Xn

Proof sketch: Main ideas for necessary conditions


based on assessing difficulty of graph selection over various sub-ensembles
G Gp,d
choose G G u.a.r., and consider multi-way hypothesis testing problem
based on the data Xn1 = {X1 , . . . , Xn }
for any graph estimator : X n G, Fanos inequality implies that
Q[(Xn1 ) 6= G] 1

I(Xn1 ; G)
o(1)
log |G|

where I(Xn1 ; G) is mutual information between observations Xn1 and


randomly chosen graph G
remaining steps:
1

Construct difficult sub-ensembles G Gp,d

Compute or lower bound the log cardinality log |G|.

Upper bound the mutual information I(Xn


1 ; G).

A hard d-clique ensemble

Base graph G
1

2
3

Graph Guv

Graph Gst

p
Divide the vertex set V into d+1
groups of size d + 1, and form the base
graph G by making a (d + 1)-clique C within each group.
Form graph Guv by deleting edge (u, v) from G.

Consider testing problem over family of graph-structured distributions


{Q( ; Gst ), (s, t) C}.

Why is this ensemble hard?


Kullback-Leibler divergence between pairs decays exponentially in d, unless
minimum edge weight decays as 1/d.

Vignette B: Sparse principal components analysis


Principal component analysis:
widely-used method for {dimensionality reduction, data compression etc.}
b
extracts top eigenvectors of sample covariance matrix
classical PCA in p dimensions: inconsistent unless p/n 0
low-dimensional structure: many applications lead to sparse eigenvectors
Population model: Rank-one spiked covariance matrix
=

( )T
| {z }

|{z}

+Ip

SNR parameter rank-one spike

Sampling model: Draw n i.i.d. zero-mean vectors with cov(Xi ) = , and


form sample covariance matrix
n

b :=

1X
Xi XiT
n i=1
|
{z
}

p-dim. matrix, rank min{n, p}

Example: PCA for face compression/recognition

First 25 standard principal components (estimated from data)

Example: PCA for face compression/recognition

First 25 sparse principal components (estimated from data)

Perhaps the simplest estimator....


Diagonal thresholding:

(Johnstone & Lu, 2008)


Diagonal thresholding

14
12

Diagonal value

10
8
6
4
2
0
0

10

20

30

40

50

Index

Given n i.i.d. samples Xi with zero mean, and with spiked covariance
= ( )T + I:
b jj = 1 Pn X 2 .
1 Compute diagonal entries of sample covariance:
i=1 ij
n
b jj , j = 1, . . . , p}
2 Apply threshold to vector {

Diagonal thresholding: unrescaled plots


Unrescaled plots of diagonal thresholding; k = O(log p)
1

Prob. success

0.8

0.6

0.4
p = 100
p = 200
p = 300
p = 600
p = 1200

0.2

0
0

200
400
600
Number of observations

800

Diagonal thresholding: rescaled plots


Diagonal thresholding: k = O(log(p))
1

Prob. success

0.8

0.6

0.4
p = 100
p = 200
p = 300
p = 600
p = 1200

0.2

0
0

5
10
Control parameter

Scaling is quadratic in sparsity:

15
n
k2 log p

Diagonal thresholding and fundamental limit


Consider spiked covariance matrix
=

|{z}
Signal-to-noise

( )T
+Ipp
| {z }
rank one spike

where

B0 (k) B2 (1)
|
{z
}
k-sparse and unit norm

Theorem (Amini & W., 2009)


(a) There are thresholds DT uDT such that
n
DT
k 2 log p
{z
}
|
DT fails w.h.p.

n
uDT
k 2 log p
{z
}
|
DT succeeds w.h.p.

(b) For optimal method (exhaustive search):


n
< ES
k log p
{z
}
|
Fail w.h.p.

n
> ES .
k log p
{z
}
|
Succeed w.h.p.

One polynomial-time SDP relaxation


Recall Courant-Fisher variational formulation:

o

n
z .
( )T + Ipp
= arg max z T
kzk2 =1
|
{z
}
Population covariance
Equivalently, lifting to matrix variables Z = zz T :
n
o
T
trace(Z
)
s.t. trace(Z) = 1, and rank(Z) = 1
T = arg
max
pp
ZR
Z=Z T ,

Z0

Dropping rank constraint yields a standard SDP relaxation:


ZbT = arg

max
pp

ZR
Z=Z T ,

Z0

trace(Z T )

s.t. trace(Z) = 1.

In practice:
(dAspremont et al., 2008)
b
apply this relaxation using the sample covariance matrix
Pp
2
add the 1 -constraint i,j=1 |Zij | k .

Phase transition for SDP: logarithmic sparsity


SDP relaxation (k = O(log p))
1

Prob. success

0.8

0.6

0.4
p = 100
p = 200
p = 300

0.2

0
0

5
10
Control parameter

Scaling is linear in sparsity:

15
n
k log p

A natural question
Questions:
Can logarithmic sparsity or rank one condition be removed?

Computational lower bound for sparse PCA


Consider spiked covariance matrix
=

|{z}
signal-to-noise

( )( )T +Ipp
| {z }
rank one spike

Sparse PCA detection problem:

H0
H1

where

B0 (k) B2 (1)
|
{z
}
k-sparse and unit norm

(no signal):
(spiked signal):

Xi D(0, Ipp )
Xi D(0, ).

Distribution D with sub-exponential tail behavior.


Theorem (Berthet & Rigollet, 2013)
Under average-case hardness of planted clique, polynomial-minimax level of
detection POLY is given by
POLY

k 2 log p
n

Classical minimax level OPT

for all sparsity log p k


k log p
n .

Planted k-clique problem

Erdos-Renyi

Planted k-clique

Binary hypothesis test based on observing random graph G on p-vertices:


H0
H1

Erdos-Renyi, each edge randomly with prob. 1/2


Planted k-clique, remaining edges random

Planted k-clique problem

Random entries

Planted k k sub-matrix

Binary hypothesis test based on observing random binary matrix:


H0
H1

Random {0, 1} matrix with Ber(1/2) on off-diagonal


Planted k k sub-matrix

Vignette C: (Structured) non-parametric regression


Goal: How to predict output from covariates?
given covariates (x1 , x2 , x3 , . . . , xp )
output variable y
want to predict y based on (x1 , . . . , xp )
Examples: Medical Imaging; Geostatistics; Astronomy; Computational
Biology .....
1st order Sobolev
0.5

0
Function value

Function value

2nd order polynomial kernel


0.5

0.5

1.5
0.5

0.5

0.25

0
Design value x

0.25

0.5

(a) Second-order polynomial fit

1.5
0.5

0.25

0
Design value x

0.25

(b) Lipschitz function fit

0.5

High dimensions and sample complexity


Possible models:
ordinary linear regression: y =

p
X

j xj +w

j=1

| {z }
h , xi

general non-parametric model: y = f (x1 , x2 , . . . , xp ) + w.


Estimation accuracy: How well can f be estimated using n samples?
linear models

without any structure: error 2

p/n
|{z}
linear in p
ep 
2
with sparsity k p: error
k log
/n
{zk
}
|
logarithmic in p

non-parametric models: p-dimensional, smoothness


Curse of dimensionality:

Error

(1/n) 2+p
| {z }
Exponential slow-down

Minimax risk and sample size


Consider a function class F, and n i.i.d. samples from the model
yi = f (xi ) + wi ,

where f is some member of F.

For a given estimator {(xi , yi )}ni=1 7 fb F, worst-case risk in a metric :


n
Rworst
(fb; F) = sup En [2 (fb, f )].
f F

Minimax risk
For a given sample size n, the minimax risk


n
inf Rworst
(fb; F) = inf sup En 2 (fb, f )
fb

fb f F

where the infimum is taken over all estimators.

(UC Berkeley and Princeton)

Information theory and machine learning

June 2015

37 / 46

How to measure size of function classes?


A 2-packing of F in metric is a
collection {f 1 , . . . , f M } F such that

f j , f k > 2
for all j 6= k.

The packing number M (2) is the


cardinality of the largest such set.

Packing/covering entropy: emerged from Russian school in 1940s/1950s


(Kolmogorov and collaborators)
Central object in proving minimax lower bounds for nonparametric
problems
(e.g., Hasminskii & Ibragimov, 1978; Birge, 1983; Yu, 1997; Yang &
Barron, 1999)

Packing and covering numbers


f1
f2

f3

f4

2
fM

A 2-packing of F in metric is a collection {f 1 , . . . , f M } F such that



f j , f k > 2
for all j 6= k.
The packing number M (2) is the cardinality of the largest such set.

Example: Sup-norm packing for Lipschitz functions


L

L
L

-packing set: functions {f 1 , f 2 , . . . , f M } such that kf j f k k2 > for all


j 6= k
for L-Lipschitz functions in 1-dimension:
M () 2(L/) .

Standard reduction: from estimation to testing


goal: to characterize the minimax risk for
-estimation over F
construct a 2-packing of F:

f1
f3

f2

f4

collection {f 1 , . . . , f M } such that (f j , f k ) > 2


now form a M -component mixture
distribution as follows:

2
f

draw packing index V {1, . . . M }


uniformly at random
conditioned on V = j, draw n i.i.d.
samples (Xi , Yi ) Qf j

Claim: Any estimator fb such that (fb, f J ) w.h.p. can be used to


solve the M -ary testing problem.
Use standard techniques ({Assouad, Le Cam, Fano }) to lower bound the
probability of error in the testing problem.

Minimax rate via metric entropy matching


observe (xi , yi ) pairs from model yi = f (xi ) + wi
two possible norms
n

kfb f k2n :=



2
1X b
e f (X))
e 2 .
f (xi ) f (xi ) , or kfb f k22 = E (fb(X)
n i=1

Metric entropy master equation


For many regression problems, minimax rate n > 0 determined by solving the
master equation

log M 2; F, k k n 2 .
basic idea (with Hellinger metric) dates back to Le Cam (1973)
elegant general version due to Yang and Barron (1999)

(UC Berkeley and Princeton)

Information theory and machine learning

June 2015

42 / 46

Example 1: Sparse linear regression


Observations yi = hxi , i + wi , where
n
(k, p) := Rp | kk0 k,

and

o
kk2 1 .

Gilbert-Varshamov: can construct a 2-separated set with


log M (2) k log

ep
)
k

elements

Master equation and minimax rate


log M (2) n

k log
n

ep
k

Polynomial-time achievability:
by 1 -relaxations under restricted eigenvalue (RE) conditions
(Candes & Tao, 2007; Bickel et al., 2009; Buhlmann & van de Geer, 2011)

achieve minimax-optimal rates for 2 -error


(Raskutti, W., & Yu, 2011)

1 -methods can be sub-optimal for prediction error kX(b )k2 / n

(Zhang, W. & Jordan, 2014)

Example 2: -smooth non-parametric regression


Observations yi = f (xi ) + wi , where
n
o
P
f F() = f : [0, 1] R | f is -times diffble, with j=0 kf (j) k C .
Classical results in approximation theory

(e.g., Kolmogorov & Tikhomorov, 1959)

log M (2; F) 1/

1/

Master equation and minimax rate


log M (2) n 2

(UC Berkeley and Princeton)

 2
2 1/n 2+1 .

Information theory and machine learning

June 2015

44 / 46

Example 3: Sparse additive and -smooth


Observations yi = f (xi ) + wi , where f satisfies constraints
f (x1 , . . . xp ) =

Additively decomposable:

p
X

gj (xj )

j=1

Sparse:

At most k of

Smooth:

gj

Each

(g1 , . . . , g p)

are non-zero

belongs to -smooth family F()

Combining previous results yields 2-packing with


1/
ep 
log M (2; F) k 1/
+ k log
k
Master equation and minimax rate
Solving log M (2) n 2 yields
2

k 1/n)
|
{z

2
2+1

k-component estimation


k log ep
k
n }
| {z

search complexity

Summary
To be provided during tutorial....

You might also like