2 views

Uploaded by ineedausername101

prob

- Determinants of Economic Growth a Bayesian Panel Data Approach
- Introduction to Survey Weights Pri Version
- @Illusions in Regression Analysdfbis1_1
- EDA_Lecture11.pdf
- Coub Douglas function of Pakistani economics
- MISHRA DISTRIBUTION
- bayesian learning
- bb
- Week 1 Discussion
- 15
- Things
- Est Ad is Tic a 1
- 2012 Reuters Ipsos Daily Election Tracking 09.27.12
- Midterm 2
- Assessing Degeneracy in Statistical Models of Social
- Paper 8
- Measuring and Predicting Software Productivity- A Systematic Map and Review
- 175.full
- Metrics Text Chapter 7
- Li (2008)

You are on page 1of 12

Random variables X represents out omes or states of world.

Review:

Probability and Statisti
s

Sam Roweis

**Instantiations of variables usually in lower
ase: x
**

We will write p(x) to mean probability(X = x).

Sample Spa
e: the spa
e of all possible out
omes/states.

(May be dis
rete or
ontinuous or mixed.)

Probability mass (density) fun
tion p(x) 0

Assigns a non-negative number

P to ea
h pointRin sample spa
e.

Sums (integrates) to unity: x p(x) = 1 or x p(x)dx = 1.

Intuitively: how often does x o
ur, how mu
h do we believe in x.

Ensemble: random variable + sample spa
e+ probability fun
tion

O tober 2, 2002

Probability

Expe tations, Moments

We use probabilities p(x) to represent our beliefs B (x) about the

**states x of the world.
**

There is a formal
al
ulus for manipulating un
ertainties

represented by probabilities.

Any
onsistent set of beliefs obeying the Cox Axioms
an be

mapped into probabilities.

1. Rationally ordered degrees of belief:

if B (x) > B (y ) and B (y ) > B (z ) then B (x) > B (z )

2. Belief in x and its negation x are related: B (x) = f [B (x)℄

3. Belief in
onjun
tion depends only on
onditionals:

j

j

B (x and y ) = g [B (x); B (y x)℄ = g [B (y ); B (x y )℄

**Expe
tation of a fun
tion a(x) is written E [a℄ or hai
**

h i=

E [a℄ = a

P

X

x

p(x)a(x)

P

e.g. mean = x xp(x), varian
e = x(x E [x℄)2p(x)

Moments are expe
tations of higher order powers.

(Mean is

g. Auto orrelation is se ond moment.rst moment. varian e. urtosis).) Centralized moments have lower moments subtra ted away (e. Deep fa t: Knowledge of all orders of moments ompletely de. skew.

.nes the entire distribution.

This is like taking a "sli e" through the joint table. y ) j z y y j p(y x)p(x) p(y ) =P j p(y x)p(x) 0 0 x p(y x )p(x ) 0 j This gives us a way of "reversing" onditional probabilities. the probability of one taking on a ertain value depends on whi h value(s) the others are taking. We all this a joint ensemble and write p(x. it hanges our belief about the probability of other events.y|z) p(x.Joint Probability Conditional Probability Key on ept: two or more random variables may intera t. y ) = prob(X = x and Y = y ) If we know that some event has o urred. Thus.y) x Another equivalent de. all joint probabilities an be fa tored by sele ting an ordering p(x. y )=p(y ) z z p(x. j p(x y ) = p(x.z) y x y x Marginal Probabilities Bayes' Rule We an "sum out" part of a joint distribution to get the marginal distribution of a subset of variables: p(x) = X y p(x.y. Thus.

Σz Manipulating the basi de. x one of the most important formulas in probability theory: p(x y ) = This is like adding sli es of the table together.nition: p(x) = Py p(xjy)p(y).

z ) . y. : : :) = p(x)p(y x)p(z x. y )p(: : : x. z.nition of onditional probability gives for the random variables and using the " hain rule": j j j p(x. y.

the resulting sli e fa tors: j j j p(x.g. Maximal value when p is uniform.y) p(x) x = p(y) Two variables are onditionally independent given a third one if for X x p(x) log p(x) Expe ted value of log p(x) (a fun tion whi h depends on p(x)!). Simpson's paradox De. y z ) = p(x z )p(y z ) 8z Be Careful! Wat h the ontext: e. H (p) > 0 unless only one possible out omein whi h ase H (p) = 0. y ) = p(x)p(y ) H (p) = p(x. Tells you the expe ted " ost" if ea h event osts log p(event) all values of the onditioning variable.Independen e & Conditional Independen e Entropy Measures the amount of ambiguity or un ertainty in a distribution: Two variables are independent i their joint fa tors: p(x.

ne random variables and sample spa es arefully: e.g. Prisoner's paradox Cross Entropy (KL Divergen e) An assymetri measure ofX the distan ebetween two distributions: KL[pkq ℄ = p(x)[log p(x) log q (x)℄ x KL > 0 unless p = q then KL = 0 Tells you the extra ost if events were generated by p(x) but instead of harging under p(x) you harged under q (x). .

For ea h type we'll see some basi probability models whi h are parametrized families of distributions.Jensen's Inequality For any onvex fun tion f () and any distribution on x. interval. log() and p Some (Conditional) Probability Fun tions Probability density fun tions p(x) (for ontinuous variables) or probability mass fun tions p(x = k ) (for dis rete variables) tell us how likely it is to get a parti ular value for a random variable (possibly onditioned on the values of some other variables. E [f (x)℄ f (E [x℄) f(E[x]) E[f(x)] e.) We an onsider various types of variables: binary/dis rete ( ategori al). and integer ounts.g. are onvex Statisti s Probability: inferring probabilisti quantities for data given . ontinuous.

onditionals. prob.g. Statisti s: inferring a model given .xed models (e. marginals. of events. et ).

lassi.xed data observations (e. lustering.g.

the most basi parametrization is the probability table whi h lists p(xi = k th value). ation.. de ision theory. for k-ary variables there are k 1 free parameters.y|z) y x y x . . Many approa hes to statisti s: frequentist.z) p(x. regression).y. Sin e PTs must be nonnegative and sum to 1. (Conditional) Probability Tables For dis rete ( ategori al) quantities. z z p(x. Bayesian.. If a dis rete variable is onditioned on the values of some other dis rete variables we make one table for ea h possible setting of the parents: these are alled onditional probability tables or CPTs.

Key idea: all you need to know about the data is aptured in the summarizing fun tion T (x).g. number of photons x that arrive at a pixel during a .Exponential Family Poisson For ( ontinuous or dis rete) random variable x p(xj ) = h(x) expf T (x) A( )g 1 = h(x) expf T (x)g Z ( ) > > is an exponential family distribution with natural parameter . Fun tion A() = log Z () is the log normalizer. For an integer ount variable with rate : x e p(xj) = x! 1 = expfx log x! Exponential family with: g = log T (x) = x A( ) = = e h(x) = 1 x! e. Fun tion T (x) is a suÆ ient statisti .

So we de. exponential.xed interval given mean intensity Other ount densities: binomial. Bernoulli Multinomial For a binary random variable with p(heads)=: p(xj ) = x(1 ) x For a set of integer ounts on k trials 1 = exp log Exponential family with: 1 x + log(1 ) = log 1 T (x) = x A( ) = log(1 ) = log(1 + e ) h(x) = 1 The logisti fun tion relates the natural parameter and the han e of heads = 1 1+e xj p( ) = k! x1!x2! xn! x x 1 1 2 2 8 <X nx = h(x) exp : n But the parameters are onstrained: PPi i = 1.

. i n xi + k log n i = log i log n T (xi) = xi P A( ) = k log n = k log i ei h( ) = k !=x1!x2! xn! x xi log i i o 9 = .ne the last one n = 1 xj x) exp p( ) = h( Exponential family with: nP n 1 i=1 log n 1 i=1 i.

T (x) = [ . ) = p (x ) 2 2 2 ( 2 x2 1 x = p exp 2 22 2 Any onditional of a Gaussian is again Gaussian. For a ontinuous univariate random variable: 1 1 exp p(xj. Important Gaussian Fa ts Gaussian (normal) All marginals of a Gaussian are again Gaussian. x xx 1=2 1℄ ℄ A( ) = log jj=2 + h(x) = (2 ) n=2 > > 1 =2 SuÆ ient statisti s: mean ve tor and orrelation matrix. For non-negative values use exponential. The softmax fun tion relates the basi and natural parameters: i ei =P j je Multivariate Gaussian Distribution For a ontinuous ve tor random variable: 1 = p(xj. 1=22℄ T (x) = [x . Lapla ian. Other densities: Student-t. x2℄ A( ) = log + =2 2 p h(x) = 1= 2 Note: a univariate Gaussian is a two-parameter distribution with a two- omponent ve tor of suÆ ient statistis. log-normal. Gamma. p(y|x=x0) p(x. ) = j2 j exp (x 2 1 2 ) > (x 1 ) Exponential family with: = [ 1 . 2 2 22 ) log Exponential family with: = [=2 .y) x0 Σ p(x) .

Gaussian Marginals/Conditionals To .

We an easily ompute moments of any exponential family distribution by taking the derivatives of the log normalizer A( ). we just have one density for ea h possible setting of the parents. B C A C) 1 1 1 > > > 1 Moments For ontinuous variables. > > and the onditional distributions are: xjy N (a + CB (y b). a table of natural parameters in exponential models or a table of tables for dis rete models. its value sets some of the parameters for the other variables.g. The fun tion f is alled the response fun tion. e. A very ommon instan e of this for regression is the \linear-Gaussian": p(yjx) = gauss( x. Generalized Linear Models (GLMs) Generalized Linear Models: p(yjx) is exponential family with onditional mean = f ( x). > = () 1 f () = () . dA( ) d d2A( ) d 2 = mean = varian e When the suÆ ient statisti is a ve tor. A) y N (b. moment al ulations are important. For dis rete hildren and ontinuous parents. B) Parameterizing Conditionals When the variable(s) being onditioned on (parents) are dis rete. we often use a Bernoulli/multinomial whose paramters are some fun tion f ( x).nd these parameters is mostly linear algebra: Let z = [x > y ℄ be normally distributed a ording to: > > z = yx N ba . ). If we hose f to be the inverse of the mapping b/w onditional mean and natural parameters then it is alled the anoni al response fun tion. partial derivatives need to be onsidered. The qth derivative gives the qth entred moment. A CB C ) yjx N (b + C A (x a). When the onditioned variable is ontinuous. CA CB > where C is the (non-symmetri ) ross- ovarian e matrix between x and y whi h has as many rows as the size of x and as many olumns as the size of y. The marginal distributions are: x N (a.

Potential Fun tions We an be even more general and de.

ne distributions by arbitrary energy fun tions proportional to the log probability. x) / expf p( X x)g Hk ( k A ommon hoi e is to useXpairwise terms in the energy: X x) = H( ai x i + i wij xixj pairs ij Basi Statisti al Problems Let's remind ourselves of the basi problems we dis ussed on the .

rst day: density estimation. lustering lassi.

If we an do joint density estimation then we an always ondition to get what we want: Regression: p(yjx) = p(y. x)=p(x) Classi. ation and regression. Density estimation is hardest.

ation: p( jx) = p( . x)=p(x) Clustering: p( jx) = p( . For example inputs in regression or lassi. x)=p(x) unobserved Spe ial variables If ertain variables are always observed we may not want to model their density.

) Fundamental Operations with Distributions Generate data: draw samples from the distribution. This leads to onditional density estimation. If ertain variables are always unobserved. For more omplex distributions it may involve an iterative pro edure that takes a long time to produ e a single sample (e. ation.g. Compute log probabilities. but an make the density modeling of the observed variables easier.1℄ and transforming it. Gibbs sampling. (We'll see more on this later. When all variables are either observed or marginalized the result is a single number whi h is the log prob of the on. This often involves generating a uniformly distributed variable in the range [0. They an always be marginalized out. they are alled hidden or latent variables. MCMC).

Set the parameters of the density fun tions given some (partially) observed data to maximize likelihood or penalized likelihood. Inferen e: Compute expe tations of some variables given others whi h are observed or marginalized.guration. Learning. .

Known Models Many systems we build will be essentially probability models. Want to build systems automati ally based on data and a small amount of prior information (from experts). Complete Data. unreliable. Assume the prior information we have spe i. whi h reates a set of random variables: D = fx . We observe all random variables in the domain on ea h observation: omplete data. slow. expensive.Learning In AI the bottlene k is often knowledge a quisition. Human experts are rare. Observations are independently and identi ally distributed a ording to joint distribution of graphi al model: IID samples. Generally we have data in luding many observations. Multiple Observations. : : : . 2. xM g Two very ommon assumptions: 1 1. x . But we have lots of data. IID Sampling A single observation of the data X is rarely useful on its own.

Also possible to do \stru ture learning" to learn model. as well as the form of the ( onditional) distributions or potentials.es type & stru ture of the model. 2 Likelihood Fun tion So far we have fo used on the (log) probability fun tion p(xj) whi h assigns a probability (density) to any joint on. In this ase learning setting parameters.

guration of variables x given .

xed parameters . But in learning we turn this on its head: we have some .

xed data and we want to .

nd parameters. Think of p(xj) as a fun tion of for .

Chose to maximize some ost fun tion () whi h in ludes `(): () = `(. AIC.xed x: L(.. .. `(. Bayesian estimators.) . D) D) + r() maximum likelihood (ML) maximum a posteriori (MAP)=penalizedML (also ross-validation. BIC. () = `(. x) = p(xj) x) = log p(xj) This fun tion is alled the (log) \likelihood".

H. = log xm X m [x = k℄ = m Take derivatives and set to zero (enfor ing ` k = ) k = Example: Bernoulli Trials x m mX = log m (1 m )1 x xm + log(1 = log NH + log(1 Take derivatives and set to zero: ` = ) = ML ) )NT NH NT 1 NH NH + NT Nk k Nk M X P Nk log k k k k = 1): M Example: Univariate Normal We observe M iid oin ips: D=H. D) We observe M iid dieProlls (K-sided): D=3.: : : Model: p(x) = (2 ) = expf (x ) =2 g Likelihood (using probability density): `(.25.1.: : : Model: p(H ) = p(T ) = (1 ) Likelihood: `(.18.. D) = Y m X m Example: Multinomial x j) log p(xmj) p( m Idea of maximum likelihod estimation (MLE): pi k the setting of parameters most likely to have generated the data we saw: ML = argmax `(. D) = log p(Dj) Y Y = log = Xm log k k Very ommonly used in statisti s.Maximum Likelihood For IID data: Dj) = p( `(.H. Often leads to \intuitive".2. or \natural" estimators.K. D) = log p(Dj) 2 X m (1 xm) = 1 2 M 2 2 log(22) Take derivatives and set to zero: P 2 1 X (xm )2 2 m 2 = (1=2) m(xm ) P = 2M2 + 21 4 m(xm )2 P ) ML = (1=M ) P m xm 2 ML = (1=M ) m x2m 2ML ` ` 2 .: : : Model: p(k) = k k k = 1 Likelihood (for binary indi ators [xm = k℄): `(.-.T.78. \appealing". D) = log p(Dj) Y = log m m m [x =1℄ [x =k ℄ 1 : : : k We observe M iid real samples: D=1.

x ) = L(. hildren) are ontinuous valued variables. ) Example: exponential family models: p(xj) = h(x) expf T (x) A( )g 1 2 1 > 2 . D) = 2 > 2 2 m The ML parameters an be solved for using linear least-squares: ` ) ML X xm)xm m = (X X) X Y = > (ym 1 > > Suffi ient Statisti s A statisti is a fun tion of a random variable. T (x)) g (T (x). some inputs ( ovariates. ) = gauss(y 2 > The likelihood is the familiar \squared error" ost: 1 X m (y xm) `(. ) p(y .Example: Univariate Normal Example: Linear Regression y X B x A x x x x x x x x x x x x x x x x x x x x x x x x Y Example: Linear Regression In linear regression. For ea h hild and setting of dis rete parents we use the model: jx j x.parents) and all outputs (responses. T (X) is a \suÆ ient statisti " for X if T (x ) = T (x ) ) L(. x ) 8 Equivalently (by the Neyman fa torization theorem) we an write: p(xj) = h (x.

suÆ ient statisti is Pm T (!xm): > `( . m ! ) . Only exponential family models have simple suÆ ient statisti s. mean-square Regression: orrelations As we will see. this is true for all exponential family models: suÆ ient statisti s are average natural parameters. D) = log p(Dj) = X m log h(x ) Take derivatives and set to zero: m M A( )+ X m x T( P = m T (xm) M A() P ) = M1 m T (xm) P 1 m ML = M m T (x ) ` A( ) > re alling that the natural moments of an exponential distribution are the derivatives of the log normalizer. tails Multinomial: # of ea h type Gaussian: mean.Suffi ient Statisti s are Sums In the examples above. MLE for Exponential Family Models Re all the probability fun tion for exponential models: p(xj) = h(x) expf T (x) A( )g For iid data. the suÆ ient statisti s were merely sums ( ounts) of the data: Bernoulli: # of heads.

- Determinants of Economic Growth a Bayesian Panel Data ApproachUploaded byallokot
- Introduction to Survey Weights Pri VersionUploaded byAnonymous PKVCsG
- @Illusions in Regression Analysdfbis1_1Uploaded byDeta Detade
- EDA_Lecture11.pdfUploaded byFerda Özdemir
- Coub Douglas function of Pakistani economicsUploaded byAhmed Faizan
- MISHRA DISTRIBUTIONUploaded byinventionjournals
- bayesian learningUploaded byDeepthi Rajashekar
- bbUploaded byFumaça Persa
- Week 1 DiscussionUploaded byTatag Adi Sasono
- 15Uploaded byShubham Tewari
- ThingsUploaded bypathanfor786
- Est Ad is Tic a 1Uploaded byNat Zim
- 2012 Reuters Ipsos Daily Election Tracking 09.27.12Uploaded bykyle_r_leighton
- Midterm 2Uploaded byWeiWei
- Assessing Degeneracy in Statistical Models of SocialUploaded bysannuk72581
- Paper 8Uploaded byAnggie
- Measuring and Predicting Software Productivity- A Systematic Map and ReviewUploaded byGeorge Nunes
- 175.fullUploaded byShofiyyatunnisa' Ws
- Metrics Text Chapter 7Uploaded byAcumenMaaz
- Li (2008)Uploaded bySiegrid
- 1039_002Uploaded byErmawati Rohana
- chap10.pdfUploaded byMetha Muthia
- apparelUploaded byTodor Stojanov
- Yarn AssignmentUploaded byRashique Ul Latif
- Asher Novosad PoliticiansUploaded byRodrigoOdónSalcedoCisneros
- Polycentric Urban Structure - The Case of Milwaukee (McMillen 2001)Uploaded byStephie Lee
- SEEG5013Top3Uploaded byRabel Khokhar
- On Regression Analysis of the Relationship Between Age and Blood Cholesterol on Blood PressureUploaded byIJSTR Research Publication
- Notes 19Uploaded bySandeep Singh
- Lecture 1.0Uploaded byKaustubhi Harit

- fend hbookUploaded byineedausername101
- lstmUploaded byineedausername101
- ViterbiUploaded byineedausername101
- hikesUploaded byineedausername101
- Ways of Seeing - John BergerUploaded byAdrianP
- Microsoft Machine Learning Algorithm Cheat Sheet v2Uploaded byChaitanya Potti
- Lourie 5 PreludesUploaded byineedausername101
- Lourie 5 PreludesUploaded byineedausername101
- Hypothesis TestsUploaded byMicheal Huluf
- Competitive ProgrammingUploaded byJuly Rodriguez
- Top 10 Algorithms in Data MiningUploaded byguru9anand
- Etude for the Left-Hand, Op.36Uploaded byapachefog
- Software Defined Networks for DummiesUploaded bykeepalive2010
- Bash Command Line History Cheat SheetUploaded byPeteris Krumins
- Beautiful CodeUploaded byineedausername101
- AI Growing UpUploaded byMohammad Hawatmeh
- Lyapunov etudeUploaded byineedausername101
- Rnn TranslationUploaded byineedausername101
- Basic SortsUploaded byineedausername101
- Idiomatic PythonUploaded byineedausername101
- J S Bach - The Goldberg VariationsUploaded bySimone Giacomini
- Emacs Tuareg ModeUploaded byineedausername101
- Basic SortsUploaded byineedausername101
- Gayle McDowell CareerCup Sample ResumeUploaded byMrunal Ruikar
- Graph Algorithm NotesUploaded byineedausername101
- Integral PaperUploaded byineedausername101
- Photo CachingUploaded byineedausername101
- Amazing Gen Chem SheetUploaded byJules Bruno

- Lecture6 Discrete ProbabilityUploaded bysourav kumar ray
- Probability & Statistics BITS WILPUploaded bypdparthasarathy03
- Hasil Uji Validitas SitiUploaded byOnix Radempthus ObinaYonk
- Topics in Random Matrix TheoryUploaded byVincent Lin
- MeasureUploaded bymathematician7
- TilPer9a.pdfUploaded byNazar Aziz
- notes.420.02[1]Uploaded byDino V Escalona
- Discrete and Continuous Probability DistributionsUploaded byjawad
- (Www.entrance-exam.net)-JNTU B.tech in ECE-2nd Year Probability Theory and Stochastic Process Sample Paper 1Uploaded byMaroju Santosh Kiran
- DR. Parul AgarwalUploaded byAshutoshSharma
- CIS2460 Statistics Tutorial Part 4Uploaded byAxaaaa
- Arena 8 RandomVariateGenerationUploaded byVictoria Moore
- Chapter5.pdfUploaded byCindy
- Mcq Binomial and Hypergeometric Probability Distribution With Correct AnswersUploaded byranit98
- H Koenig - Measure and IntegrationUploaded byGianluca Napoletano
- Chapter 3 Discrete Probability Distributions_2.docxUploaded byVictor Chan
- act. statUploaded byGA RICAFRENTE
- Normal curve tableUploaded bytendo
- lectnotemat5Uploaded byHassan
- Chap 3Uploaded byFrisky
- ST202 Lecture NotesUploaded bylowchangsong
- Tutsheet5Uploaded byvishnu
- Mca4020 Slm Unit 02Uploaded byAppTest PI
- Naïve-Bayes Crashing CourseUploaded bykhundalini
- 4 ECE Radom Process Unit Test IUploaded byBIBIN CHIDAMBARANATHAN
- cheatsheet.pdfUploaded byJDTerex
- _expectation_and_Fubini_s_theorem.pdfUploaded byTing-Hui Kuo
- CouplingUploaded bySimone
- Baye's TheoremUploaded byKasturi Sahay Singh
- f5089Binomial Probability DistributionUploaded byTannawy Sinha