Tutorial Part I Information Theory Meets Machine Learning Tuto - Slides - Part1

Tutorial Part I:
Information theory meets machine learning
Emmanuel Abbe
UC Berkeley
(UC Berkeley and Princeton)
Martin Wainwright
Princeton University
Information theory and machine learning
June 2015
1 / 46
Introduction
Era of massive data sets
Fascinating problems at the interfaces between information theory and
statistical machine learning.
Fundamental issues
Concentration of measure: high-dimensional problems are remarkably

predictable
Curse of dimensionality: without structure, many problems are hopeless
Low-dimensional structure is essential
Machine learning brings in algorithmic components
Computational constraints are central

Memory constraints: need for distributed and decentralized procedures
Increasing importance of privacy
Historical perspective: info. theory and statistics
Claude Shannon
Andrey Kolmogorov
A rich intersection between information theory and statistics

1
2
3
Hypothesis testing, large deviations

Fisher information, Kullback-Leibler divergence
Metric entropy and Fanos inequality
Statistical estimation as channel coding

Codebook: indexed family of probability distributions {Q | }
Codeword: nature chooses some
Channel: user observes n i.i.d. draws Xi Q

Decoding: estimator X1n 7 b such that b
X1n
Statistical estimation as channel coding

Codebook: indexed family of probability distributions {Q | }
Codeword: nature chooses some
X1n
Channel: user observes n i.i.d. draws Xi Q

Decoding: estimator X1n 7 b such that b
Perspective dating back to Kolmogorov (1950s) with many variations:

codebooks/codewords: graphs, vectors, matrices, functions, densities....
channels: random graphs, regression models, elementwise probes of
vectors/machines, random projections
closeness b : exact/partial graph recovery in Hamming, p -distances,
L( Q)-distances, sup-norm etc.
Machine learning: algorithmic issues to forefront!

1
Efficient algorithms are essential
Distributed procedures are often needed
only (low-order) polynomial-time methods can ever be implemented

trade-offs between computational complexity and performance?
fundamental barriers due to computational complexity?
many modern data sets: too large to stored on a single machine

need methods that operate separately on pieces of data
trade-offs between decentralization and performance?
fundamental barriers due to communication complexity?
Privacy and access issues
conflicts between individual privacy and benefits of aggregation?

principled information-theoretic formulations of such trade-offs?
June 2015
6 / 46
Part I of tutorial: Three vignettes

1
Graphical model selection
Sparse principal component analysis
Structured non-parametric regression and minimax theory
June 2015
7 / 46
Vignette A: Graphical model selection

Simple motivating example: Epidemic modeling
(
+1 if individual s is infected
Disease status of person s:
xs =
1 if individual s is healthy
(1) Independent infection
Q(x1 , . . . , x5 )
5
Y
exp(s xs )
s=1
(2) Cycle-based infection

Q(x1 , . . . , x5 )
5
Y
exp(s xs )
s=1
exp(st xs xt )
(s,t)C
(3) Full clique infection

Q(x1 , . . . , x5 )
5
Y
s=1
exp(s xs )
s6=t
exp(st xs xt )
Possible epidemic patterns (on a square)
From epidemic patterns to graphs
Underlying graphs
Model selection for graphs

drawn n i.i.d. samples from
Q(x1 , . . . , xp ; ) exp
X
sV
s xs +
st xs xt
(s,t)E
graph G and matrix []st = st of edge weights are unknown

data matrix Xn1 {1, +1}np
b to minimize error probability
want estimator Xn1 7 G

b n1 ) 6= G
Qn G(X
{z
}
|
Prob. that estimated graph differs from truth
Channel decoding:
Think of graphs as codewords, and the graph family as a codebook.
Past/on-going work on graph selection

exact polynomial-time solution on trees
(Chow & Liu, 1967)
testing for local conditional independence relationships
(e.g., Spirtes et al,
2000; Kalisch & Buhlmann, 2008)
pseudolikelihood and BIC criterion
(Csiszar & Talata, 2006)
pseudolikelihood and 1 -regularized neighborhood regression
Gaussian case
Binary case
(Meinshausen & Buhlmann, 2006)

(Ravikumar, W. & Lafferty et al., 2006)
various greedy and related methods:
(Bresler et al., 2008; Bresler, 2015;
Netrapalli et al., 2010; Anandkumar et al., 2013)
lower bounds and inachievability results
information-theoretic bounds
(Santhanam & W., 2012)
computational lower bounds
(Dagum & Luby, 1993; Bresler et al., 2014)
phase transitions and performance of neighborhood regression
(Bento &
Montanari, 2009)
US Senate network (20042006 voting)
Experiments for sequences of star graphs

p=9
p = 18
d=3
d=6
Empirical behavior: Unrescaled plots

Star graph; Linear fraction neighbors
1
Prob. success
0.8
0.6
0.4
0.2
0
0
p = 64
p = 100
p = 225
100
200
300
400
Number of samples
500
600
Plots of success probability versus raw sample size n.
Empirical behavior: Appropriately rescaled

Star graph; Linear fraction neighbors
1
Prob. success
0.8
0.6
0.4
0.2
0
0
p = 64
p = 100
p = 225
0.5
1
1.5
Control parameter
Plots of success probability versus rescaled sample size
2
n
d2 log p
Some theory: Scaling law for graph selection

graphs Gp,d with p nodes and maximum degree d
minimum absolute weight min on edges
how many samples n needed to recover the unknown graph?
Theorem (Ravikumar, W. & Lafferty, 2010; Santhanam & W., 2012)

b
Achievable result: Under regularity conditions, for a graph estimate G
produced by 1 -regularized logistic regression:

2
b 6= G] 0
log p
=
Q[G
n > cu d2 + 1/min
|
{z
}
|
{z
}
Vanishing probability of error
Lower bound on sample size
e produced by any algorithm.

Necessary condition: For graph estimate G

2
e 6= G] 1/2
=
Q[G
log p
n < c d2 + 1/min
|
{z
}
{z
}
|
Constant probability of error
Upper bound on sample size
Information-theoretic analysis of graph selection

Question:
How to prove lower bounds on graph selection methods?
Answer:
Graph selection is an unorthodox channel coding problem.
codewords/codebook: graph G in some graph class G
channel use: draw sample Xi = (Xi1 , . . . , Xip ) {1, +1}p from the
graph-structured distribution QG
decoding problem: use n samples {X1 , . . . , Xn } to correctly distinguish
the codeword
Q(X | G)
X1 , X2 , . . . , Xn
Proof sketch: Main ideas for necessary conditions

based on assessing difficulty of graph selection over various sub-ensembles
G Gp,d
choose G G u.a.r., and consider multi-way hypothesis testing problem
based on the data Xn1 = {X1 , . . . , Xn }
for any graph estimator : X n G, Fanos inequality implies that
Q[(Xn1 ) 6= G] 1
I(Xn1 ; G)
o(1)
log |G|
where I(Xn1 ; G) is mutual information between observations Xn1 and

randomly chosen graph G
remaining steps:
1
Construct difficult sub-ensembles G Gp,d
Compute or lower bound the log cardinality log |G|.
Upper bound the mutual information I(Xn

1 ; G).
A hard d-clique ensemble
Base graph G
1
2
3
Graph Guv
Graph Gst
p
Divide the vertex set V into d+1
groups of size d + 1, and form the base
graph G by making a (d + 1)-clique C within each group.
Form graph Guv by deleting edge (u, v) from G.
Consider testing problem over family of graph-structured distributions

{Q( ; Gst ), (s, t) C}.
Why is this ensemble hard?

Kullback-Leibler divergence between pairs decays exponentially in d, unless
minimum edge weight decays as 1/d.
Vignette B: Sparse principal components analysis

Principal component analysis:
widely-used method for {dimensionality reduction, data compression etc.}
b
extracts top eigenvectors of sample covariance matrix
classical PCA in p dimensions: inconsistent unless p/n 0
low-dimensional structure: many applications lead to sparse eigenvectors
Population model: Rank-one spiked covariance matrix
=
( )T
| {z }
|{z}
+Ip
SNR parameter rank-one spike
Sampling model: Draw n i.i.d. zero-mean vectors with cov(Xi ) = , and

form sample covariance matrix
n
b :=
1X
Xi XiT
n i=1
|
{z
}
p-dim. matrix, rank min{n, p}
Example: PCA for face compression/recognition
First 25 standard principal components (estimated from data)
Example: PCA for face compression/recognition
First 25 sparse principal components (estimated from data)
Perhaps the simplest estimator....

Diagonal thresholding:
(Johnstone & Lu, 2008)

Diagonal thresholding
14
12
Diagonal value
10
8
6
4
2
0
0
10
20
30
40
50
Index
Given n i.i.d. samples Xi with zero mean, and with spiked covariance
= ( )T + I:
b jj = 1 Pn X 2 .
1 Compute diagonal entries of sample covariance:
i=1 ij
n
b jj , j = 1, . . . , p}
2 Apply threshold to vector {
Diagonal thresholding: unrescaled plots

Unrescaled plots of diagonal thresholding; k = O(log p)
1
Prob. success
0.8
0.6
0.4
p = 100
p = 200
p = 300
p = 600
p = 1200
0.2
0
0
200
400
600
Number of observations
800
Diagonal thresholding: rescaled plots

Diagonal thresholding: k = O(log(p))
1
Prob. success
0.8
0.6
0.4
p = 100
p = 200
p = 300
p = 600
p = 1200
0.2
0
0
5
10
Control parameter
Scaling is quadratic in sparsity:
15
n
k2 log p
Diagonal thresholding and fundamental limit

Consider spiked covariance matrix
=
|{z}
Signal-to-noise
( )T
+Ipp
| {z }
rank one spike
where
B0 (k) B2 (1)
|
{z
}
k-sparse and unit norm
Theorem (Amini & W., 2009)

(a) There are thresholds DT uDT such that
n
DT
k 2 log p
{z
}
|
DT fails w.h.p.
n
uDT
k 2 log p
{z
}
|
DT succeeds w.h.p.
(b) For optimal method (exhaustive search):

n
< ES
k log p
{z
}
|
Fail w.h.p.
n
> ES .
k log p
{z
}
|
Succeed w.h.p.
One polynomial-time SDP relaxation

Recall Courant-Fisher variational formulation:

o

n
z .
( )T + Ipp
= arg max z T
kzk2 =1
|
{z
}
Population covariance
Equivalently, lifting to matrix variables Z = zz T :
n
o
T
trace(Z
)
s.t. trace(Z) = 1, and rank(Z) = 1
T = arg
max
pp
ZR
Z=Z T ,
Z0
Dropping rank constraint yields a standard SDP relaxation:

ZbT = arg
max
pp
ZR
Z=Z T ,
Z0
trace(Z T )
s.t. trace(Z) = 1.
In practice:
(dAspremont et al., 2008)
b
apply this relaxation using the sample covariance matrix
Pp
2
add the 1 -constraint i,j=1 |Zij | k .
Phase transition for SDP: logarithmic sparsity

SDP relaxation (k = O(log p))
1
Prob. success
0.8
0.6
0.4
p = 100
p = 200
p = 300
0.2
0
0
5
10
Control parameter
Scaling is linear in sparsity:
15
n
k log p
A natural question
Questions:
Can logarithmic sparsity or rank one condition be removed?
Computational lower bound for sparse PCA

Consider spiked covariance matrix
=
|{z}
signal-to-noise
( )( )T +Ipp
| {z }
rank one spike
Sparse PCA detection problem:
H0
H1
where
B0 (k) B2 (1)
|
{z
}
k-sparse and unit norm
(no signal):
(spiked signal):
Xi D(0, Ipp )
Xi D(0, ).
Distribution D with sub-exponential tail behavior.

Theorem (Berthet & Rigollet, 2013)
Under average-case hardness of planted clique, polynomial-minimax level of
detection POLY is given by
POLY
k 2 log p
n
Classical minimax level OPT
for all sparsity log p k

k log p
n .
Planted k-clique problem
Erdos-Renyi
Planted k-clique
Binary hypothesis test based on observing random graph G on p-vertices:

H0
H1
Erdos-Renyi, each edge randomly with prob. 1/2

Planted k-clique, remaining edges random
Planted k-clique problem
Random entries
Planted k k sub-matrix
Binary hypothesis test based on observing random binary matrix:

H0
H1
Random {0, 1} matrix with Ber(1/2) on off-diagonal

Planted k k sub-matrix
Vignette C: (Structured) non-parametric regression

Goal: How to predict output from covariates?
given covariates (x1 , x2 , x3 , . . . , xp )
output variable y
want to predict y based on (x1 , . . . , xp )
Examples: Medical Imaging; Geostatistics; Astronomy; Computational
Biology .....
1st order Sobolev
0.5
0
Function value
Function value
2nd order polynomial kernel

0.5
0.5
1.5
0.5
0.5
0.25
0
Design value x
0.25
0.5
(a) Second-order polynomial fit
1.5
0.5
0.25
0
Design value x
0.25
(b) Lipschitz function fit
0.5
High dimensions and sample complexity

Possible models:
ordinary linear regression: y =
p
X
j xj +w
j=1
| {z }
h , xi
general non-parametric model: y = f (x1 , x2 , . . . , xp ) + w.

Estimation accuracy: How well can f be estimated using n samples?
linear models
without any structure: error 2
p/n
|{z}
linear in p
ep
2
with sparsity k p: error
k log
/n
{zk
}
|
logarithmic in p
non-parametric models: p-dimensional, smoothness

Curse of dimensionality:
Error
(1/n) 2+p
| {z }
Exponential slow-down
Minimax risk and sample size

Consider a function class F, and n i.i.d. samples from the model
yi = f (xi ) + wi ,
where f is some member of F.
For a given estimator {(xi , yi )}ni=1 7 fb F, worst-case risk in a metric :

n
Rworst
(fb; F) = sup En [2 (fb, f )].
f F
Minimax risk
For a given sample size n, the minimax risk

n
inf Rworst
(fb; F) = inf sup En 2 (fb, f )
fb
fb f F
where the infimum is taken over all estimators.
June 2015
37 / 46
How to measure size of function classes?

A 2-packing of F in metric is a
collection {f 1 , . . . , f M } F such that

f j , f k > 2
for all j 6= k.
The packing number M (2) is the

cardinality of the largest such set.
Packing/covering entropy: emerged from Russian school in 1940s/1950s

(Kolmogorov and collaborators)
Central object in proving minimax lower bounds for nonparametric
problems
(e.g., Hasminskii & Ibragimov, 1978; Birge, 1983; Yu, 1997; Yang &
Barron, 1999)
Packing and covering numbers

f1
f2
f3
f4
2
fM
A 2-packing of F in metric is a collection {f 1 , . . . , f M } F such that

f j , f k > 2
for all j 6= k.
The packing number M (2) is the cardinality of the largest such set.
Example: Sup-norm packing for Lipschitz functions

L
L
L
-packing set: functions {f 1 , f 2 , . . . , f M } such that kf j f k k2 > for all

j 6= k
for L-Lipschitz functions in 1-dimension:
M () 2(L/) .
Standard reduction: from estimation to testing

goal: to characterize the minimax risk for
-estimation over F
construct a 2-packing of F:
f1
f3
f2
f4
collection {f 1 , . . . , f M } such that (f j , f k ) > 2

now form a M -component mixture
distribution as follows:
2
f
draw packing index V {1, . . . M }

uniformly at random
conditioned on V = j, draw n i.i.d.
samples (Xi , Yi ) Qf j
Claim: Any estimator fb such that (fb, f J ) w.h.p. can be used to

solve the M -ary testing problem.
Use standard techniques ({Assouad, Le Cam, Fano }) to lower bound the
probability of error in the testing problem.
Minimax rate via metric entropy matching

observe (xi , yi ) pairs from model yi = f (xi ) + wi
two possible norms
n
kfb f k2n :=

2
1X b
e f (X))
e 2 .
f (xi ) f (xi ) , or kfb f k22 = E (fb(X)
n i=1
Metric entropy master equation

For many regression problems, minimax rate n > 0 determined by solving the
master equation

log M 2; F, k k n 2 .
basic idea (with Hellinger metric) dates back to Le Cam (1973)
elegant general version due to Yang and Barron (1999)
June 2015
42 / 46
Example 1: Sparse linear regression

Observations yi = hxi , i + wi , where
n
(k, p) := Rp | kk0 k,
and
o
kk2 1 .
Gilbert-Varshamov: can construct a 2-separated set with

log M (2) k log
ep
)
k
elements
Master equation and minimax rate

log M (2) n
k log
n
ep
k
Polynomial-time achievability:
by 1 -relaxations under restricted eigenvalue (RE) conditions
(Candes & Tao, 2007; Bickel et al., 2009; Buhlmann & van de Geer, 2011)
achieve minimax-optimal rates for 2 -error

(Raskutti, W., & Yu, 2011)
1 -methods can be sub-optimal for prediction error kX(b )k2 / n
(Zhang, W. & Jordan, 2014)
Example 2: -smooth non-parametric regression

Observations yi = f (xi ) + wi , where
n
o
P
f F() = f : [0, 1] R | f is -times diffble, with j=0 kf (j) k C .
Classical results in approximation theory
(e.g., Kolmogorov & Tikhomorov, 1959)
log M (2; F) 1/
1/

log M (2) n 2
2
2 1/n 2+1 .
June 2015
44 / 46
Example 3: Sparse additive and -smooth

Observations yi = f (xi ) + wi , where f satisfies constraints
f (x1 , . . . xp ) =
Additively decomposable:
p
X
gj (xj )
j=1
Sparse:
At most k of
Smooth:
gj
Each
(g1 , . . . , g p)
are non-zero
belongs to -smooth family F()
Combining previous results yields 2-packing with

1/
ep
log M (2; F) k 1/
+ k log
k
Solving log M (2) n 2 yields
2
k 1/n)
|
{z
2
2+1
k-component estimation

k log ep
k
n }
| {z
search complexity
Summary
To be provided during tutorial....

Tutorial Part I Information Theory Meets Machine Learning Tuto - Slides - Part1

Uploaded by

Document Information

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Tutorial Part I Information Theory Meets Machine Learning Tuto - Slides - Part1

Uploaded by

Copyright:

Available Formats

Tutorial Part I:

Information theory meets machine learning

(UC Berkeley and Princeton)

Information theory and machine learning

Concentration of measure: high-dimensional problems are remarkably

Machine learning brings in algorithmic components

Computational constraints are central

Historical perspective: info. theory and statistics

A rich intersection between information theory and statistics

Hypothesis testing, large deviations

Statistical estimation as channel coding

Channel: user observes n i.i.d. draws Xi Q

Statistical estimation as channel coding

Channel: user observes n i.i.d. draws Xi Q

Perspective dating back to Kolmogorov (1950s) with many variations:

Machine learning: algorithmic issues to forefront!

Efficient algorithms are essential

Distributed procedures are often needed

only (low-order) polynomial-time methods can ever be implemented

many modern data sets: too large to stored on a single machine

Privacy and access issues

conflicts between individual privacy and benefits of aggregation?

(UC Berkeley and Princeton)

Information theory and machine learning

Part I of tutorial: Three vignettes

Graphical model selection

Sparse principal component analysis

Structured non-parametric regression and minimax theory

(UC Berkeley and Princeton)

Information theory and machine learning

Vignette A: Graphical model selection

(2) Cycle-based infection

(3) Full clique infection

Possible epidemic patterns (on a square)

From epidemic patterns to graphs

Model selection for graphs

graph G and matrix []st = st of edge weights are unknown

Past/on-going work on graph selection

(Chow & Liu, 1967)

testing for local conditional independence relationships

(e.g., Spirtes et al,

2000; Kalisch & Buhlmann, 2008)

pseudolikelihood and BIC criterion

(Csiszar & Talata, 2006)

pseudolikelihood and 1 -regularized neighborhood regression

(Meinshausen & Buhlmann, 2006)

various greedy and related methods:

(Bresler et al., 2008; Bresler, 2015;

Netrapalli et al., 2010; Anandkumar et al., 2013)

lower bounds and inachievability results

US Senate network (20042006 voting)

Experiments for sequences of star graphs

Empirical behavior: Unrescaled plots

Plots of success probability versus raw sample size n.

Empirical behavior: Appropriately rescaled

Plots of success probability versus rescaled sample size

Some theory: Scaling law for graph selection

Theorem (Ravikumar, W. & Lafferty, 2010; Santhanam & W., 2012)

e produced by any algorithm.

Information-theoretic analysis of graph selection

Proof sketch: Main ideas for necessary conditions

where I(Xn1 ; G) is mutual information between observations Xn1 and

Construct difficult sub-ensembles G Gp,d