You are on page 1of 50

Useful Statistical concepts for Engineers

Deepak Agarwal
ML Class
12/10/2009
Yahoo!

-1-
Scope of the lecture
• Basic probability distributions to model randomness in data

• Fitting distributions to data

• Common parametric distributions


– Discrete distributions, continuous distributions

• Generalized Linear models

• Multi-level hierarchical models


– Generalized Linear mixed effects models

-2-
Role of Probability distributions
• Probability distributions
– Mathematical models to describe intrinsic variation in data
– Helps in quantifying uncertainty and eventually decision making

• How do we construct such distributions to compute


probabilities for any subset in data domain ?
– Domain
• Finite set of points (total clicks in 100 displays of ad X on Pub Y)
• Countable but infinite set of points (total visits to webpage Y)
• Real numbers (time spent on webpage Y)
– Is it necessary to specify probabilities for all subsets?
• NO, what to specify ?

-3-
Cumulative distribution function (CDF)
• X : random variable
• CDF F : Ω → [0,1] such that
– F(x) = Pr( X ≤ x)
• F is non-decreasing and right continuous
• CDF uniquely characterize a probability distribution
• Given CDF, we can compute probability of any subset

• E.g P( a < X ≤ b) = F(b) – F(a) ; P( X > b) = 1 – F(b)


– What about more complicated sets?
– In high dimension?

-4-
Probability density function (PDF)
• A unique function p : Ω → [0,∞) such that
– P( A ) = ∫A p(x) dF(x) [aggregate density with weights from F]

• Meaning of notation ∫A p(x) dF(x)


– real numbers P( A ) = ∫A p(x) dx (Continuous distributions)
– Discrete numbers P( A ) = ∑A p(x) (Discrete distributions)

• PDF often easier to work with when (modeling) fitting


distributions to data

-5-
Empirical CDF
• Empirical CDF Fm for data X = (x1,x2,…,xm) (I I D)
• Probability distribution with mass 1/m on each xi

• 1. m=10; -1.21 0.28 1.08 -2.35 0.43 0.51 -0.57 -0.55 -0.56 -0.89
• 2. m=10; 0 0 0 0 0 1 0 0 1 1

-6-
“Plug-in” principle
• θ(F) : Some characteristic of the theoretical distribution
– E.g. mean

• θ(Fm) : Corresponding quantity for empirical CDF

– Exercise: Convince yourself this is true

• Plug-in principle: θ(Fm) good estimator of θ(F) for all


characteristics θ(F)

-7-
Why plug-in works? Glivenko-Cantelli Lemma
• Intuitively, Fn good estimator of F
– Estimator should get better with increasing sample size m

• Glivenko-Cantelli : For the iid scenario


– P(A) → Pm(A) uniformly for all subsets A, as m → ∞
– We can infer about distribution with a large sample
– Error does not grow as we increase the number of quantities
estimated from the same sample
– Justifies why “plug-in” principle works

-8-
Quantifying uncertainty in estimates
• We won’t have infinite m in practice (costly)
– Quantify uncertainty in estimates for a given m

• Sample mean

• Estimate of uncertainty

-9-
Standard error calculations continued
• Consider median

• Difficult to compute

• Asymptotic approx:

• Is there a better way?

- 10 -
Re-sampling from empirical CDF: Bootstrap

• Bootstrap: Random sample (with replacement) from Fm


• The samples help compute s.e.

• For the median example,


– Take a random sample of size m with replacement from
empirical CDF Fm and compute the median
– For B such samples, compute the standard deviation (se) of
median estimates, this quantifies uncertainty
• This works well if Fm is a good approximation to F
• Bootstrap is only finding an approx of s.eF(θ(Fm)) under Fm

- 11 -
Why bootstrap works?
• Except for mean, difficult to compute standard deviation of
other sample statistics

• Bootstrap sampling provides an approximation to Fm, easy


black box to compute variance estimates

• How many bootstrap samples B ?


– For estimating s.e., 20-100 are good enough
• Depends on m, tails of the underlying distribution

• Exercise: How many distinct bootstrap samples are there for a given m ?

- 12 -
Bootstrap: Variations
• Does it always work?
– No, especially in cases where Fm is not a good approx of F
– E.g. sample m data from Uniform(0, θ)
• max xi ML estimator of θ
• Bootstrap as defined so far won’t work well here

• Parametric bootstrap
– What if we know about the parametric form of F (e.g. guassian)
– Sample from Fm,par instead of Fm

- 13 -
Example
• m=100; data drawn iid from N(0,1); distribution of median

- 14 -
How can we use it at Y! ?
• Variance estimates can help with online learning
(explore/exploit): John’s lecture

• Bootstrapping can help better understand variance


properties of models
– Running too many experiments on test data not a good idea
• (Kilian’s lecture)

• Easy to Map-reduce

- 15 -
Before we move on, another look at bias-
variance tradeoff

- 16 -
Bias-Variance Tradeoff
• Important in all scenarios (regression, density estimation,….)
• Recall Rob’s example from lecture 1

High Bias Optimal trade-off High variance

- 17 -
Bias-Variance continued
• F : True distribution generating the data (not known)

• Δ ={ Fδ } : Model class chosen by analyst to approximate F


– Influenced by things like domain knowledge, previous studies,
software availability, my favorite algo, ad-server latency, ….
– E.g. Linear models, Neural networks, logistic regression,…

• X : Available data

• Loss L(F, Fδ[X]) : Metric that measures model performance


– E.g. MSE, Misclassification error, total click lift, total revenue,..

- 18 -
Bias-Variance continued
• Loss influenced by two aspects
– How flexible is Δ to approximate reality ? (Bias)
• More flexible it is, more complex it gets (reduces bias)
– How stable is the best fit from Δ to data ? (Variance)
• Does the fit change a lot with perturbations to data ?

• More flexible the class to choose from, more data we need


to control the variance
– With too much flexibility and little data, we tend to learn
patterns that are not real
• (“chasing the data”, “too many parameters”, “generalization error”,
“fitting noise”, “too many degrees-of-freedom”)

- 19 -
Example: Recall Regression from Rob’s lecture

Exercise:

1. Identify Δ

2. If dim(x) = 20, m
= 10M; what is a
more serious
problem here (bias
or variance) ?

3. Based on 2., what


other tools would
you try on this
problem?

- 20 -
A useful exercise
• Might have heard things like
– “All models are wrong, some are more useful than others”
– “Google uses simple models but trains them on lots of data”
– “SVM works well on my data”
– “Naïve Bayes is hard to beat on text data”
– “Boosting is the best off-the-shelf classifier”
– “It’s all about feature engineering; Maxent, GBDT doesn’t matter”
– “ Y! data is too noisy, better to work with simple models”
– “ We have terabytes of data but it is too little”

• Exercise: Interpret these in terms of bias-variance tradeoff

- 21 -
Other remarks on bias-variance

• There is no universal solution to the bias-


variance tradeoff
– Several classes Δ available, each with pros and cons

– Understanding the properties of Δs and experimenting with


data important

– Inventing new Δs motivated by failures of existing ones on real


applications important for advancement of the field

- 22 -
How to measure performance ?
• Depends on the loss function

• For classification and regression, test errors used in ML

• Several other measures in Statistics (does not use test data)


MODEL FIT - MODEL COMPLEXITY
• AIC, BIC, DIC, Mallows Cp, Bayes factors, …
• E.g. AIC = - log-loss(training) + # parameters
– Based on assumptions that may not hold in all scenarios

- 23 -
Parametric Distributions: A useful class Δ to
work with data

- 24 -
Parametric models
• Non-parametric approach attractive, no assumptions needed
• Bootstrap and asymptotics often provides answers, BUT
– Hard to incorporate additional knowledge about the system
– Computationally intensive
– Higher uncertainty in estimates price we pay for generality
– Theory gets harder for dependent random variables
• Social network data, Spatial data, time series
• Parametric models that assume functional form is an
alternate way to model the world
– Faster computation, better estimates if model good
approximation to reality
– Easier to model dependent random variables

- 25 -
Common discrete parametric distributions
• Bernoulli

• Poisson

• Geometric

• Negative Binomial
• Multinomial

- 26 -
Common continuous parametric distributions
• Normal (Gaussian)

• Log-normal: Normal on log scale

• Gamma : Tails thinner than log-normal

• Beta: flexible class on [0,1]

• Multivariate Normal : Multivariate Gaussian data

- 27 -
Exponential family: A general class of parametric
distributions

• Distribution with PDF given by

• g(θ): convex (log-partition function)

• Example: Bernoulli distribution

- 28 -
Estimation
• Maximum likelihood estimation (MLE)

• For i i d case,

• MLE is asymptotically unbiased, consistent and achieves


lowest variance asymptotically under mild conditions

- 29 -
Desirable properties of estimators
• Unbiased

• Consistent

• Low variance, efficient [Attains Cramer-Rao lower bound]

- 30 -
MLE efficient
• Under mild regularity conditions, MLE is asymptotically
unbiased, consistent and efficient
• Other estimators
– MVUE: Minimum variance unbiased estimators
• Search for lowest variance estimator among unbiased ones
• Requires only moment assumptions on the distributions

– Method-of-moments (MOM) estimators


• Equate empirical moments with theoretical ones
• May lose efficiency but easier to estimate in some cases

- 31 -
Non i i d data [adding more flexibility]
• Statistically independent but different means

• Too flexible, sharing parameters a good compromise

• E.g. 100 displays of an ad on a website (Bernoulli)


– Click probabilities not same, how do we model it ?
– θi function of features ? Males, females have different probs
• Regression problem (Logistic regression, …)

θi =: θ (zi, β) = ziT β ; dim(β) = n << m

- 32 -
Generalized Linear Models: Flexible class for regressions

• Data: (x1,z1), (x2,z2),…,(xm,zm)

• Assumption: zis measured without error (important)


• 1-parameter exponential family

<φ(xi), θi > = φ(xi). θi

• Do linear regression on transformed scale


θi := θ (zi, β) = ziT β

- 33 -
GLM continued
• Example: logistic regression (covered in Rob’s lecture)

• Gaussian regression, Poisson regression are special cases

• Referred to as Generalized linear model (GLM)


– MacCullagh and Nelder (book)

- 34 -
Other option: Shrinkage estimators (Stein)
• Stein

• Result
– Stein estimator has smaller MSE than MLE
– Remarkable : Incurring some bias by pooling data reduces
variance significantly
– Shrinkage: Estimates pulled towards the mean

- 35 -
Bayesian statistics
• Data, parameters are all random variables that we model
• All inferences about parameters are conditional on data
• Bayes Theorem

[θ|X] = [X| θ] [θ] / [X]


Posterior Lik x Prior

Likelihood

Prior

- 36 -
Bayesian continued
• Does not depend on asymptotics, works for finite m

• Rich class of models (generally over-parametrized) but


avoids over-fitting through constraints on parameters

• Model specification often requires care

• Computationally intensive
– (but approximations work well for large data)

- 37 -
Bayesian interpretation of Stein

• Exercise

- 38 -
Analysis of variance (ANOVA)
• Replications within each group
– E.g. log NGD prices in different dma’s

• How to estimate unknown hyper-parameters , ?

- 39 -
Estimating hyper-parameters: Empirical Bayes
• Empirical Bayes (EB): Maximize marginal likelihood

• ANOVA example (integral available in closed form)

• EB works well for large data, in small samples in may overfit


– “Double dipping”

- 40 -
Example
• Time spent on landing page after a story click on Today
Module on Y! Front Page

- 41 -
Distribution across different properties

- 42 -
ANOVA
• Observations for a property replications: log(time spent) data

• 0.04651195, 0.11435909 , 2.52275583

- 43 -
Shrinkage

- 44 -
Estimating hyper-parameters: Full Bayes
• Assume a mild prior on hyper-parameters

• In ANOVA example

• Computation gets difficult, often require simulation


• Main idea
– Simulate samples from posterior distribution and make all
conclusions from these (recall parametric bootstrap)
– Several techniques : Markov Chain Monte Carlo (MCMC)

- 45 -
Modeling correlations through priors
• Time series: Autoregressive prior

• Conditional independence, marginal dependence


– Attractive way to model correlations
• Spatial correlation

- 46 -
Generalized linear mixed model (GLMM)
• Fit different regressions to different groups but share
parameters
• Example: Random intercept models
– Parallel regressions lines to groups

• Front Page example: log(ts) = a + b*Gender + prop_id


(Intercept) gender)0 gender)f gender)m sigma^2 tau^2
0.025 0.114 0.051 0.049 1.32 .121

- 47 -
GLMM continued
• Crossed-random effects
– Group specific slopes and intercepts

• FP example
• log(ts) = a + b*gend + Propid*Gender

– Exercise: fit this model using lme4 in R


– Hint: formula (log(ts) ~ gender + (Propid|gender) )

- 48 -
GLMMs
• From an ML perspective
– Linear models with different cross-product features
– Fancy regularization (different priors for different features)
– No cross-validation, all parameters estimated automatically
– Priors motivated by problem, highly flexible class
– Model specification has to be done carefully by analyst

• Extends to exponential family


– Conceptually easy, more computation required
– Software (lme4 in R; PROC GLMMIX in SAS)

- 49 -
Summary
• We covered
– Bootstrap for I I D case
– Parametric distributions
– Shrinkage Estimators
– Generalized linear models
– Grouped regressions (mixed effects models)

• For non i i d data, working with flexible parametric models


provide powerful expressive language to model data
– Needs some practice to master these models
• Next lecture: Olivier Chapelle (Optimization techniques)

- 51 -

You might also like