You are on page 1of 12

Random Variables and Densities

 Random variables X represents out omes or states of world.

Probability and Statisti s

Sam Roweis

Instantiations of variables usually in lower ase: x
We will write p(x) to mean probability(X = x). 
Sample Spa e: the spa e of all possible out omes/states.
(May be dis rete or ontinuous or mixed.) 
Probability mass (density) fun tion p(x)  0
Assigns a non-negative number
P to ea h pointRin sample spa e.
Sums (integrates) to unity: x p(x) = 1 or x p(x)dx = 1.
Intuitively: how often does x o ur, how mu h do we believe in x. 
Ensemble: random variable + sample spa e+ probability fun tion

O tober 2, 2002


Expe tations, Moments 

We use probabilities p(x) to represent our beliefs B (x) about the

states x of the world. 
There is a formal al ulus for manipulating un ertainties
represented by probabilities. 
Any onsistent set of beliefs obeying the Cox Axioms an be
mapped into probabilities.
1. Rationally ordered degrees of belief:
if B (x) > B (y ) and B (y ) > B (z ) then B (x) > B (z )
2. Belief in x and its negation x are related: B (x) = f [B (x)℄
3. Belief in onjun tion depends only on onditionals:



B (x and y ) = g [B (x); B (y x)℄ = g [B (y ); B (x y )℄ 

Expe tation of a fun tion a(x) is written E [a℄ or hai
h i=

E [a℄ = a





e.g. mean = x xp(x), varian e = x(x E [x℄)2p(x) 
Moments are expe tations of higher order powers.
(Mean is

g. Auto orrelation is se ond moment.rst moment. varian e. urtosis).)  Centralized moments have lower moments subtra ted away (e.  Deep fa t: Knowledge of all orders of moments ompletely de. skew.

.nes the entire distribution.

 This is like taking a "sli e" through the joint table. y ) j z y y j p(y x)p(x) p(y ) =P j p(y x)p(x) 0 0 x p(y x )p(x ) 0 j  This gives us a way of "reversing" onditional probabilities. the probability of one taking on a ertain value depends on whi h value(s) the others are taking.  We all this a joint ensemble and write p(x. it hanges our belief about the probability of other events.y|z) p(x.Joint Probability Conditional Probability  Key on ept: two or more random variables may intera t. y ) = prob(X = x and Y = y )  If we know that some event has o urred.  Thus.y) x  Another equivalent de. all joint probabilities an be fa tored by sele ting an ordering p(x. y )=p(y ) z z p(x. j p(x y ) = p(x.z) y x y x Marginal Probabilities Bayes' Rule  We an "sum out" part of a joint distribution to get the marginal distribution of a subset of variables: p(x) = X y p(x.y. Thus.

Σz  Manipulating the basi de. x one of the most important formulas in probability theory: p(x y ) =  This is like adding sli es of the table together.nition: p(x) = Py p(xjy)p(y).

z ) . y. : : :) = p(x)p(y x)p(z x. y )p(: : : x. z.nition of onditional probability gives for the random variables and using the " hain rule": j j j p(x. y.

the resulting sli e fa tors: j j j p(x.g.  Maximal value when p is uniform.y) p(x) x = p(y)  Two variables are onditionally independent given a third one if for X x p(x) log p(x)  Expe ted value of log p(x) (a fun tion whi h depends on p(x)!). Simpson's paradox  De. y z ) = p(x z )p(y z ) 8z Be Careful!  Wat h the ontext: e.  H (p) > 0 unless only one possible out omein whi h ase H (p) = 0. y ) = p(x)p(y ) H (p) = p(x.  Tells you the expe ted " ost" if ea h event osts log p(event) all values of the onditioning variable.Independen e & Conditional Independen e Entropy  Measures the amount of ambiguity or un ertainty in a distribution:  Two variables are independent i their joint fa tors: p(x.

ne random variables and sample spa es arefully: e.g. Prisoner's paradox Cross Entropy (KL Divergen e)  An assymetri measure ofX the distan ebetween two distributions: KL[pkq ℄ = p(x)[log p(x) log q (x)℄ x  KL > 0 unless p = q then KL = 0  Tells you the extra ost if events were generated by p(x) but instead of harging under p(x) you harged under q (x). .

 For ea h type we'll see some basi probability models whi h are parametrized families of distributions.Jensen's Inequality  For any onvex fun tion f () and any distribution on x. interval. log() and p Some (Conditional) Probability Fun tions  Probability density fun tions p(x) (for ontinuous variables) or probability mass fun tions p(x = k ) (for dis rete variables) tell us how likely it is to get a parti ular value for a random variable (possibly onditioned on the values of some other variables. E [f (x)℄  f (E [x℄) f(E[x]) E[f(x)]  e.)  We an onsider various types of variables: binary/dis rete ( ategori al). and integer ounts.g. are onvex Statisti s  Probability: inferring probabilisti quantities for data given . ontinuous.

onditionals. prob.g.  Statisti s: inferring a model given .xed models (e. marginals. of events. et ).

lassi.xed data observations (e. lustering.g.

the most basi parametrization is the probability table whi h lists p(xi = k th value). ation.. de ision theory. for k-ary variables there are k 1 free parameters.y|z) y x y x . .  Many approa hes to statisti s: frequentist.z) p(x. regression).y.  Sin e PTs must be nonnegative and sum to 1. (Conditional) Probability Tables  For dis rete ( ategori al) quantities. z z p(x. Bayesian..  If a dis rete variable is onditioned on the values of some other dis rete variables we make one table for ea h possible setting of the parents: these are alled onditional probability tables or CPTs.

 Key idea: all you need to know about the data is aptured in the summarizing fun tion T (x).g. number of photons x that arrive at a pixel during a .Exponential Family Poisson  For ( ontinuous or dis rete) random variable x p(xj ) = h(x) expf T (x) A( )g 1 = h(x) expf T (x)g Z ( ) > > is an exponential family distribution with natural parameter  .  Fun tion A() = log Z () is the log normalizer.  For an integer ount variable with rate : x e  p(xj) = x! 1 = expfx log  x!  Exponential family with:  g  = log  T (x) = x A( ) =  = e h(x) = 1 x!  e.  Fun tion T (x) is a suÆ ient statisti .

So we de. exponential.xed interval given mean intensity   Other ount densities: binomial. Bernoulli Multinomial  For a binary random variable with p(heads)=: p(xj ) =  x(1  ) x     For a set of integer ounts on k trials 1 = exp log  Exponential family with: 1   x + log(1 )   = log 1  T (x) = x A( ) = log(1  ) = log(1 + e ) h(x) = 1   The logisti fun tion relates the natural parameter and the han e of heads  = 1 1+e  xj p(  ) = k! x1!x2!    xn! x x 1 1 2 2 8 <X    nx = h(x) exp : n  But the parameters are onstrained: PPi i = 1.

.   i n xi + k log n i = log i log n T (xi) = xi P A( ) = k log n = k log i ei h( ) = k !=x1!x2! xn! x  xi log i i o 9 = .ne the last one n = 1 xj x) exp p(  ) = h(  Exponential family with: nP n 1 i=1 log n 1 i=1 i.

T (x) = [ .  ) = p (x ) 2 2 2 ( 2 x2 1 x = p exp 2  22 2 Any onditional of a Gaussian is again Gaussian.  For a ontinuous univariate random  variable:  1 1 exp p(xj. Important Gaussian Fa ts Gaussian (normal)  All marginals of a Gaussian are again Gaussian. x xx 1=2 1℄ ℄ A( ) = log jj=2 +   h(x) = (2 ) n=2 > > 1 =2  SuÆ ient statisti s: mean ve tor and orrelation matrix.  For non-negative values use exponential. The softmax fun tion relates the basi and natural parameters: i ei =P j je Multivariate Gaussian Distribution  For a ontinuous ve tor random variable:  1 = p(xj. 1=22℄ T (x) = [x . Lapla ian.  Other densities: Student-t. x2℄ A( ) = log  + =2 2 p h(x) = 1= 2   Note: a univariate Gaussian is a two-parameter distribution with a two- omponent ve tor of suÆ ient statistis. log-normal. Gamma. p(y|x=x0) p(x. ) = j2 j exp (x 2 1 2 ) >  (x 1 )   Exponential family with:  = [ 1 . 2 2 22 ) log   Exponential family with: = [=2 .y) x0 Σ p(x) .

Gaussian Marginals/Conditionals  To .

 We an easily ompute moments of any exponential family distribution by taking the derivatives of the log normalizer A( ). we just have one density for ea h possible setting of the parents. B C A C) 1 1 1 > > > 1 Moments  For ontinuous variables. > > and the onditional distributions are: xjy  N (a + CB (y b). a table of natural parameters in exponential models or a table of tables for dis rete models. its value sets some of the parameters for the other variables.g.  The fun tion f is alled the response fun tion. e.  A very ommon instan e of this for regression is the \linear-Gaussian": p(yjx) = gauss( x. Generalized Linear Models (GLMs)  Generalized Linear Models: p(yjx) is exponential family with onditional mean  = f ( x). > = () 1 f () = ()  . dA( ) d d2A( ) d 2  = mean = varian e  When the suÆ ient statisti is a ve tor. A) y  N (b. moment al ulations are important.  For dis rete hildren and ontinuous parents. B) Parameterizing Conditionals  When the variable(s) being onditioned on (parents) are dis rete. we often use a Bernoulli/multinomial whose paramters are some fun tion f ( x).nd these parameters is mostly linear algebra: Let z = [x > y ℄ be normally distributed a ording to: > >       z = yx  N ba . ).  If we hose f to be the inverse of the mapping b/w onditional mean and natural parameters then it is alled the anoni al response fun tion. partial derivatives need to be onsidered.  The qth derivative gives the qth entred moment. A CB C ) yjx  N (b + C A (x a).  When the onditioned variable is ontinuous. CA CB > where C is the (non-symmetri ) ross- ovarian e matrix between x and y whi h has as many rows as the size of x and as many olumns as the size of y. The marginal distributions are: x  N (a.

Potential Fun tions  We an be even more general and de.

ne distributions by arbitrary energy fun tions proportional to the log probability. x) / expf p( X x)g Hk ( k  A ommon hoi e is to useXpairwise terms in the energy: X x) = H( ai x i + i wij xixj pairs ij Basi Statisti al Problems  Let's remind ourselves of the basi problems we dis ussed on the .

rst day: density estimation. lustering lassi.

If we an do joint density estimation then we an always ondition to get what we want: Regression: p(yjx) = p(y. x)=p(x) Classi. ation and regression.  Density estimation is hardest.

ation: p( jx) = p( . x)=p(x) Clustering: p( jx) = p( . For example inputs in regression or lassi. x)=p(x) unobserved Spe ial variables  If ertain variables are always observed we may not want to model their density.

) Fundamental Operations with Distributions  Generate data: draw samples from the distribution. This leads to onditional density estimation.  If ertain variables are always unobserved. For more omplex distributions it may involve an iterative pro edure that takes a long time to produ e a single sample (e. ation.g.  Compute log probabilities. but an make the density modeling of the observed variables easier.1℄ and transforming it. Gibbs sampling. (We'll see more on this later. When all variables are either observed or marginalized the result is a single number whi h is the log prob of the on. This often involves generating a uniformly distributed variable in the range [0. They an always be marginalized out. they are alled hidden or latent variables. MCMC).

Set the parameters of the density fun tions given some (partially) observed data to maximize likelihood or penalized likelihood.  Inferen e: Compute expe tations of some variables given others whi h are observed or marginalized.guration.  Learning. .

Known Models  Many systems we build will be essentially probability models.  Want to build systems automati ally based on data and a small amount of prior information (from experts). Complete Data. unreliable.  Assume the prior information we have spe i. whi h reates a set of random variables: D = fx . We observe all random variables in the domain on ea h observation: omplete data. slow. expensive.Learning  In AI the bottlene k is often knowledge a quisition.  Human experts are rare. Observations are independently and identi ally distributed a ording to joint distribution of graphi al model: IID samples.  Generally we have data in luding many observations. Multiple Observations. : : : . 2. xM g  Two very ommon assumptions: 1 1. x .  But we have lots of data. IID Sampling  A single observation of the data X is rarely useful on its own.

 Also possible to do \stru ture learning" to learn model. as well as the form of the ( onditional) distributions or type & stru ture of the model. 2 Likelihood Fun tion  So far we have fo used on the (log) probability fun tion p(xj) whi h assigns a probability (density) to any joint on.  In this ase learning  setting parameters.

guration of variables x given .

xed parameters .  But in learning we turn this on its head: we have some .

xed data and we want to .

nd parameters.  Think of p(xj) as a fun tion of  for .

 Chose  to maximize some ost fun tion () whi h in ludes `(): () = `(. AIC.xed x: L(.. .. `(. Bayesian estimators.) . D) D) + r() maximum likelihood (ML) maximum a posteriori (MAP)=penalizedML (also ross-validation. BIC. () = `(. x) = p(xj) x) = log p(xj) This fun tion is alled the (log) \likelihood".

H. = log xm X m [x = k℄ = m  Take derivatives and set to zero (enfor ing ` k = ) k = Example: Bernoulli Trials x m mX = log  m (1 m  )1 x xm + log(1 = log NH + log(1  Take derivatives and set to zero: `  = )  = ML ) )NT NH NT  1  NH NH + NT Nk k Nk M X P Nk log k k k k = 1): M Example: Univariate Normal  We observe M iid oin ips: D=H. D)  We observe M iid dieProlls (K-sided): D=3.: : :  Model: p(x) = (2 ) = expf (x ) =2 g  Likelihood (using probability density): `(.25.1.: : :  Model: p(H ) =  p(T ) = (1 )  Likelihood: `(.18.. D) = Y m X m Example: Multinomial x j) log p(xmj) p( m  Idea of maximum likelihod estimation (MLE): pi k the setting of parameters most likely to have generated the data we saw:  ML = argmax `(. D) = log p(Dj) Y Y = log = Xm log k k  Very ommonly used in statisti s.Maximum Likelihood  For IID data: Dj) = p( `(.H. Often leads to \intuitive".2. or \natural" estimators.K. D) = log p(Dj) 2 X m (1 xm) = 1 2 M 2 2 log(22)  Take derivatives and set to zero: P 2 1 X (xm )2 2 m 2 = (1=2) m(xm ) P = 2M2 + 21 4 m(xm )2 P ) ML = (1=M ) P m xm 2 ML = (1=M ) m x2m 2ML `  `  2 .: : :  Model: p(k) = k k k = 1  Likelihood (for binary indi ators [xm = k℄): `(.-.T.78. \appealing". D) = log p(Dj) Y = log m m m [x =1℄ [x =k ℄ 1 : : : k  We observe M iid real samples: D=1.

x ) = L(. hildren) are ontinuous valued variables. )  Example: exponential family models: p(xj) = h(x) expf T (x) A( )g 1 2 1 > 2 . D) = 2 > 2 2 m  The ML parameters an be solved for using linear least-squares: `  )  ML X xm)xm m = (X X) X Y = > (ym 1  > > Suffi ient Statisti s  A statisti is a fun tion of a random variable. T (x)) g (T (x). some inputs ( ovariates. ) = gauss(y  2 >  The likelihood is the familiar \squared error" ost: 1 X m (y  xm) `(.  ) p(y .Example: Univariate Normal Example: Linear Regression y X B x A x x x x x x x x x x x x x x x x x x x x x x x x Y Example: Linear Regression  In linear regression.  For ea h hild and setting of dis rete parents we use the model: jx j x.parents) and all outputs (responses.  T (X) is a \suÆ ient statisti " for X if T (x ) = T (x ) ) L(. x ) 8  Equivalently (by the Neyman fa torization theorem) we an write: p(xj) = h (x.

suÆ ient statisti is Pm T (!xm): > `( . m ! ) .  Only exponential family models have simple suÆ ient statisti s. mean-square Regression: orrelations  As we will see. this is true for all exponential family models: suÆ ient statisti s are average natural parameters. D) = log p(Dj) = X m log h(x )  Take derivatives and set to zero: m M A( )+  X m x T( P = m T (xm) M A() P ) = M1 m T (xm) P 1 m ML = M m T (x ) `  A( )  > re alling that the natural moments of an exponential distribution are the derivatives of the log normalizer. tails Multinomial: # of ea h type Gaussian: mean.Suffi ient Statisti s are Sums  In the examples above. MLE for Exponential Family Models  Re all the probability fun tion for exponential models: p(xj) = h(x) expf T (x) A( )g  For iid data. the suÆ ient statisti s were merely sums ( ounts) of the data: Bernoulli: # of heads.