You are on page 1of 46

Diffusion Wavelets on Graphs and Manifolds

R.R. Coifman, MM, J.C. Bremer Jr., A.D. Szlam

Other collaborators on related projects: P.W. Jones, S. Lafon, R. Schul

Outline of the talk

1) Relate learning and approximation under smoothness constraints, motivate need for good 'building blocks for approximation 2) Some very classical building blocks in Euclidean spaces: Fourier and wavelets (a comparison) 3) New multiscale building blocks for manifolds: for this we need to talk about about diffusion geometries: 3a) Global diffusion geometries, eigenfunctions of the Laplacian 3b) Multiscale diffusion wavelets 4) Stylized applications 5) Future work

Learning & Function Approximation on graphs and manifolds

In many cases learning algorithms (or learning itself?) can be viewed as an approximation problem, under smoothness constraints, on a set of points/state space represented as a manifold or graph. [Think of SMVs as an example] Approximation Smoothness constraint Manifold/graph

Fitting the function of interest

Want generalization power, complexity control, avoid overfitting

Belief: the structure of the state space has to do with the learning task.

Develop tools for multiscale analysis of functions on manifolds, varifolds, graphs, datasets.

Analyze large amount of data, and functions on this data, intrinsically rather low-dimensional, embedded in high dimensions. Paradigm: we have a large number of documents (e.g.: web pages, gene array data, (hyper)spectral data, molecular dynamics data etc...) and a way of measuring similarity between pairs. Model: a graph (G,E,W) In important practical cases: vertices are points in high-dimensional Euclidean space, weights may be a function of Euclidean distance.

Data sets in high-dimensions are complicated, as are classes of interesting functions on them. We want to do approximation (learning) of such functions. Parametrize low dimensional subspaces of smooth functions on sets embedded in highdimension. Fast algorithms

High dimensional data: examples

Documents, web searching Customer databases Satellite imagery Transaction logs Social networks Gene arrays, proteomics data Art transactions data Traffic (automobilistic, network) statistics




















-20 50 100 150 200 250

Function with a discontinuity

6 4














Rate of decay of coefficients onto Fourier (red) and wavelets (blue)

Laplacian, diffusion geometries

[RR Coifman, S. Lafon] Related previous and current work by M. Belkin, P. Niyogi on eigenmaps and applications to semi-supervised learning on manifolds and graphs. Part of the material in the next few slides is courtesy of Stephane Lafon. From local to global: diffusion distances Motto: Diffusion distance measures and averages connections of all lengths. It is more stable and finer than geodesic distance, uses a preponderance of evidence

A local similarity operator on the set

X = { x1 , x2 ,..., xN } data set.

k ( x , y ) kernel defined on the data and being symmetric k ( x, y) = k ( x, y) positivity-preserving: k ( x , y ) 0 positive semi-definite: ( x) ( y)k ( x, y) 0
x X y X

The kernel k describes the geometry of X by defining the relationship between the data points.

Examples of similarity kernels

Examples: If X lies in n-dimensional Euclidean space, k(x,y) could be: - exponentially weighted distance: exp(-(||x-y||/a)^2 - angle: <x,y>/(||x|| ||y||) - harmonic potential: 1/(a+||x-y||), or powers of it - feature distances: any of the above applied in the range f(X), f nonlinear, possibly mapping X to higher dimension - if a model for the data is available (probabilistic, or from a (stochastic) dynamical system), k may be chosen to be consistent with that model If X is an abstract graph: we need to be given some weights on edges, measuring similarity. This could wildly vary: from the graph obtained by discretizing a PDE, to ways of measuring similarity between two web pages, or two protein chains...

Diffusion Distances


Diffusion embedding mapping X with diffusion distance into Euclidean space with Euclidean distance:




Diffusion embedding mapping X with diffusion distance into Euclidean space with Euclidean distance:

Link with other kernel methods

Recent kernel methods: LLE (Roweis, Saul 2000), Laplacian Eigenmaps (Belkin, Niyogi 2002) , Hessian Eigenmaps (Donoho, Grimes 2003), LTSA (Zhang, Zha 2002) ... all based on the following paradigm: minimize Q( f ) where Q( f ) = Qx ( f )
x X

Qx ( f ) : quadratic form measuring local variation of f i a neighborhood of x n Solution: compute eigenfunctions {l }of Q and map data points via x (0 ( x), 1 ( x),..., p ( x)) T
x X

In our case we minimize k ( x, y)( f ( x) f ( y)) 2


- Classifiers in the semi-supervised learning context (M. Belkin, P. Nyogi) - fMRI data (F. Meyer, X. Shen) - Art data (W Goetzmann, PW Jones, MM, J Walden) - Hyperspectral Imaging in Pathology (MM, GL Davis, F Warner, F. Geshwind, A Coppi, R. DeVerse, RR Coifman) - Molecular dynamics simulations (RR. Coifman, G.Hummer, I. Kevrekidis, MM) - Text documents classification (RR. Coifman, MM)

Art data
[W Goetzmann, PW Jones, MM, J Walden] Given the sales of art works (paintings and drawings) of 58 artists in for 9 years. View them as 58 points in 9 dimensions, try to find clusters that indicate how the works of groups of artists are 'traded' together. Does this correlate to style?
3 2 1 0 -1 -2 -3 -4 5 0 -5 -10 -4 -2 0 2 4 6 8

Projection of the data on the top 3 Principal Components

Image under the top 3 eigenfunction of Laplacian + K-means

Hyper-spectral Pathology Data

[MM, GL Davis, F Warner, F. Geshwind, A Coppi, R. DeVerse, RR Coifman]

So far...
..we have seen that: - it seems useful to consider a framework in which, for a given data set, local similarities are given only between very similar points - it is possible to organize these local information by diffusion into global parametrizations, - these parametrizations can be found by looking at the eigenvectors of a diffusion operator, - these eigenvectors in turn yield a nonlinear embedding into low-dimensional Euclidean space, - the eigenvectors can be used for global Fourier analysis on the set/manifold PROBLEM: Either very local information or very global information: in many problems the intermediate scales are very interesting (think of web pages)! Would like MULTISCALE information! Solution 1: proceed bottom up: repeatedly cluster together in a multi-scale fashion, in a way that is faithful to the operator: diffusion wavelets. Solution 2: proceed top bottom: cut greedily according to global information, and repeat procedure on the pieces: recursive partitioning, local cosines... Solution 3: do both!

From global Diffusion Geometries...

We are given a graph X with weights W. There is a natural random walk P on X induced by these weights. P maps probability distributions on X to probability distributions on X. The reversibility condition on P implies that P is conjugate to a selfadjoint operator T. We assume T is renormalized to have 2-norm 1. The spectra of T and its powers look like:
1 0.9 0.8 0.7 0.6 0.5 0.4 0.3 0.2 0.1
8 (T ) 4 (T ) 2 (T )



0 5 10 V 15 20




... to Multiresolution Diffusion

[Coifman,MM] Multiscale compression scheme: - Random walk for 1=2^0 step - Collect together random walkers into representatives - Write a random walk on the representatives - Let the representatives random walk 2=2^1 steps...
The decay in the spectrum of T says powers of T are low-rank, hence compressible.
1 0.9 0.8 0.7 0.6 0.5 0.4 0.3 0.2 0.1
16 (T ) 8 (T ) 4 (T ) 2 (T )

0 5 10 V 15 20




Multiscale Random Walkers


Dilations, translations, downsampling

We have frequencies: the eigenvalues of the diffusion T. What about dilations , translations , downsampling? We may have minimal information about the geometry, and only locally. Let's think in terms of functions on the set X. Dilations: Use the diffusion operator T and its dyadic powers as dilations. Translations and downsampling: Idea: diffusing a basis of scaling functions at a certain scale by a power of T should yield a redundant set of coarser scaling functions at the next coarser scale: reduce this set to a Riesz (i.e. well-conditioned) or even orthonormal basis. This is downsampling in the function space, and corresponds to finding a well-conditioned subset of translates.

Potential Theory, Green's function

Diffusion Wavelets on the sphere

Diffusion Wavelets on a dumbbell

Wavelets: - Lifting (W. Sweldens, I. Daubechies,...) - Continuous wavelets from group actions (Grossman, Morlet, Ali, Antoine, Gazeau, many physicists...) Classical Harmonic Analysis: - Littlewood-Paley on semigroups (Stein), Markov diffusion semigroups (large literature) - Martingales associated to the above, Brownian motion - Heat kernel estimates, generalized Heisenberg principles (A. Nahmod) - Harmonic Analysis of eigenfunctions of the Laplacian on domains, manifolds and graphs - Atomic decompositions (Coifman, Weiss, Rochberg...) Numerics: - Algebraic Multigrid (Brandt, large literature since the 80's...) - Kind of inverse FMM (Rohklin, Beylkin-Coifman-Rohklin, ...) - Multiscale matrix compression techniques (Gu-Eisenstat, Chaen-GimbutasMartinsson-Rohklin,...) - Randomizable - FFTs?

Diffusion Wavelet Packets

[JC Bremer, RR Coifman, MM, AD Szlam]

We can split the wavelet subspaces further, in a hierarchical dyadic fashion, very much like in the classical case. The splittings are generated by numerical kernel and numerical range operations.

Compression example I

[a la Coifman-Wickerhauser & Donoho-Johnstone]

Analysis of a document corpora

Given 1,000 documents, each of which is associated with 1,000 words, with a value indicating the relevance of each word in that document. View this as a set of 1,000 in 1,000 dimensions, construct a graph with 1,000 vertices, each of which with few selected edges to very close-by documents.

0.15 0.2 0.1 0 0 -0.1 0 -0.2 -0.05 -0.05 0 0.05 0.1 -0.1 -0.2 0.1 0.05 0.05 0 0 -0.05 -0.1 -0.05 0.1 -0.05 0.05 -0.1 -0.15 0.1 0.15 0.1 0.05





-0.2 0.1 0.05 0 -0.05 -0.1 -0.05 0

0.1 0.05


-0.2 0.1 0.05 0 0.1 -0.05 -0.1 -0.05 0.05 0


Paleontology Climate


Astronomy, planets

-0.2 0.1 0.05 0 -0.05 0.05 0 -0.1 -0.05 0 0.1 0.2 0.15 0.1




-0.2 0.1 0.05 0 -0.05 0.05 0 0.1 0.15

-0.2 0.1

Comets, asteroids

0.05 0 -0.05 -0.1 -0.05 0 0.1 0.05

Examples of multiscale scaling functions on documents

- By Jessica Gorman--"Pan scrapings!" announced G. Kenneth Sams. "When you come down to it, that is -what has brought us all here tonight-pan scrapings."-- It was probably not the most appetizing introduction for a $150per-plate -museum fundraiser. Some 150 guests, many in black tie, had just sat down to -enjoy their first course at lavish banquet tables in the Upper Egyptian Gallery -of the University of Pennsylvania Museum of Archaeology and Anthropology in -Philadelphia. Did they r Can chemical analysis confirm a wine's authenticity?-Damaris Christensen-- There's a romance to wine unmatched by most agricultural products. Making -wine, with all its complex flavors, remains as much an art as a science.-- Hints of citrus, peach, raspberry, pear, oak, grass, or flowers may show up in -the taste or smell of wines. The catalog of factors that determine the flavor -and bouquet is almost as long as the list of adjectives that connoisseurs use -to describe them.--

Words 'wine' 'region'




Examples of multiscale scaling functions on documents

Documents Words 'nitrogen' 'plant' 'ecologist' 'carbon' 'global' by J. Raloff-- Acid rain and agricultural pollution 'specy' 'native' 'change' 'pollution' both spew nitrogen into the air. Though -plants need 'university' 'growth' 'soil' 'human' 'activity' nitrogen to grow, a new study finds that even small 'nutrient' 'dioxide' 'fuel' 'acid' 'cycle' additions of -this fertilizing pollutant can perturb the 'natural' landscape.-- In plots of Minnesota prairie to which ecologists applied nitrogen for 12 -years, native grasses showed a dramatically impaired ability to compete against -weeds that had immigrated from Europe centuries ago.-- The nitrogen treatment triggered the terrestrial equ ans = by C. Mlot-- For most of agricultural history, nitrogen has been a precious commodity. Only -specialized bacteria and lightning could convert atmospheric nitrogen into -biologically usable forms. Today, however, fertilizers and fossil fuels have -made nitrogen so freely available that it has become too much of a good thing.-- In a review of nitrogen's effects across the environmental spectrum, a team of -ecologists headed by Peter M. Vitousek of Stanford University has concluded in -no

Examples of multiscale scaling functions on documents

Documents Words 'earthquake' 'wave' 'fault' 'quake' 'tsunami' New technologies convert the motion of waves into 'california' 'southern' 'rock' 'seismologist' watts-- Peter Weiss-'ocean' 'lake' 'coast' 'tsunamis' 'north' R. Monastersky-- The underwater volcano growing 'region' 'island' 'seismic' 'plate' 'crust' off Hawaii's south coastline collapsed 'power' - The Mush Zone A slurpy layer lurks deep inside the planet-by I. Peterson-- In late 1942, carrying 15,000 U.S. soldiers bound for England, the Queen Mary -hit a storm about 700 miles off the coast of Scotland. Without warning amid the -tumult, a single, mountainous wave struck the ocean liner, rolling it over and -washing water across its upper decks. Luckily, the ship managed to right itself -and continue on its voyage.-- Gargantuan waves, which appear unexpectedly even under calm conditions in the -open ocean, have damaged and sunk numerous s - Racing the Waves Seismologists try to catch quake tremors quickly enough to -save lives-- geologists have been snapping a very different picture of -the lake lately. Far beneath Lake Tahoe's gentle surface, they say, several -hidden earthquake faults snake across the lake's flat bottom. How a middling quake made a giant tsunami Waves of Death-- Why the New Guinea tsunami

Comments, Applications, etc...

This is a wavelet analysis on manifolds, graphs, markov chains, certain fractals, while Laplacian eigenfunctions do the corresponding Fourier Analysis, adapted to a given diffusion operator. We are compressing: powers of the operator, functions of the operators, subspaces of the function subspaces on which its powers act. This yields sampling formulas, quadrature formulas. Does not require the diffusion to be self-adjoint, nor needs the eigenvectors. We are constructing a biorthogonal version of the transform (better adapted to studying Markov chains): this will allow efficient denoising, compression, discrimination on all the spaces mentioned above. Preliminary: n log(n) with good constants for symmetric problems. The multiscale spaces are a natural scale of complexity spaces for learning empirical functions on the data set. Diffusion scaling functions extend outside the set, in a natural multiscale fashion. Exploring ties with measure-geometric considerations used for embedding metric spaces in Euclidean spaces with small distortion. Study and compression of dynamical systems.

Current & Future Work

- Use for regularization and learning on graphs and manifolds: semisupervised and plain supervised. - Robustness to noise, perturbations, extension of the functions to test points. - Biorthogonal construction (better localization, better constants) - Multiscale embeddings for graphs, measure-geometric implications - Martingale aspects and fast Brownian motion simulation in nonhomogeneous media - Compression of data sets - Going nonlinear This talk, papers, Matlab code available at: (or Google for Mauro Maggioni) Thank you!