Useful Statistical Concepts For Engineers

Useful Statistical concepts for Engineers
Deepak Agarwal
ML Class
12/10/2009
Yahoo!
-1-
Scope of the lecture
• Basic probability distributions to model randomness in data
• Fitting distributions to data
• Common parametric distributions

– Discrete distributions, continuous distributions
• Generalized Linear models
• Multi-level hierarchical models

– Generalized Linear mixed effects models
-2-
Role of Probability distributions
• Probability distributions
– Mathematical models to describe intrinsic variation in data
– Helps in quantifying uncertainty and eventually decision making
• How do we construct such distributions to compute

probabilities for any subset in data domain ?
– Domain
• Finite set of points (total clicks in 100 displays of ad X on Pub Y)
• Countable but infinite set of points (total visits to webpage Y)
• Real numbers (time spent on webpage Y)
– Is it necessary to specify probabilities for all subsets?
• NO, what to specify ?
-3-
Cumulative distribution function (CDF)
• X : random variable
• CDF F : Ω → [0,1] such that
– F(x) = Pr( X ≤ x)
• F is non-decreasing and right continuous
• CDF uniquely characterize a probability distribution
• Given CDF, we can compute probability of any subset
• E.g P( a < X ≤ b) = F(b) – F(a) ; P( X > b) = 1 – F(b)

– What about more complicated sets?
– In high dimension?
-4-
Probability density function (PDF)
• A unique function p : Ω → [0,∞) such that
– P( A ) = ∫A p(x) dF(x) [aggregate density with weights from F]
• Meaning of notation ∫A p(x) dF(x)

– real numbers P( A ) = ∫A p(x) dx (Continuous distributions)
– Discrete numbers P( A ) = ∑A p(x) (Discrete distributions)
• PDF often easier to work with when (modeling) fitting

distributions to data
-5-
Empirical CDF
• Empirical CDF Fm for data X = (x1,x2,…,xm) (I I D)
• Probability distribution with mass 1/m on each xi
• 1. m=10; -1.21 0.28 1.08 -2.35 0.43 0.51 -0.57 -0.55 -0.56 -0.89
• 2. m=10; 0 0 0 0 0 1 0 0 1 1
-6-
“Plug-in” principle
• θ(F) : Some characteristic of the theoretical distribution
– E.g. mean
• θ(Fm) : Corresponding quantity for empirical CDF
– Exercise: Convince yourself this is true
• Plug-in principle: θ(Fm) good estimator of θ(F) for all

characteristics θ(F)
-7-
Why plug-in works? Glivenko-Cantelli Lemma
• Intuitively, Fn good estimator of F
– Estimator should get better with increasing sample size m
• Glivenko-Cantelli : For the iid scenario

– P(A) → Pm(A) uniformly for all subsets A, as m → ∞
– We can infer about distribution with a large sample
– Error does not grow as we increase the number of quantities
estimated from the same sample
– Justifies why “plug-in” principle works
-8-
Quantifying uncertainty in estimates
• We won’t have infinite m in practice (costly)
– Quantify uncertainty in estimates for a given m
• Sample mean
• Estimate of uncertainty
-9-
Standard error calculations continued
• Consider median
• Difficult to compute
• Asymptotic approx:
• Is there a better way?
- 10 -
Re-sampling from empirical CDF: Bootstrap
• Bootstrap: Random sample (with replacement) from Fm

• The samples help compute s.e.
• For the median example,

– Take a random sample of size m with replacement from
empirical CDF Fm and compute the median
– For B such samples, compute the standard deviation (se) of
median estimates, this quantifies uncertainty
• This works well if Fm is a good approximation to F
• Bootstrap is only finding an approx of s.eF(θ(Fm)) under Fm
- 11 -
Why bootstrap works?
• Except for mean, difficult to compute standard deviation of
other sample statistics
• Bootstrap sampling provides an approximation to Fm, easy

black box to compute variance estimates
• How many bootstrap samples B ?

– For estimating s.e., 20-100 are good enough
• Depends on m, tails of the underlying distribution
• Exercise: How many distinct bootstrap samples are there for a given m ?
- 12 -
Bootstrap: Variations
• Does it always work?
– No, especially in cases where Fm is not a good approx of F
– E.g. sample m data from Uniform(0, θ)
• max xi ML estimator of θ
• Bootstrap as defined so far won’t work well here
• Parametric bootstrap
– What if we know about the parametric form of F (e.g. guassian)
– Sample from Fm,par instead of Fm
- 13 -
Example
• m=100; data drawn iid from N(0,1); distribution of median
- 14 -
How can we use it at Y! ?
• Variance estimates can help with online learning
(explore/exploit): John’s lecture
• Bootstrapping can help better understand variance

properties of models
– Running too many experiments on test data not a good idea
• (Kilian’s lecture)
• Easy to Map-reduce
- 15 -
Before we move on, another look at bias-
variance tradeoff
- 16 -
Bias-Variance Tradeoff
• Important in all scenarios (regression, density estimation,….)
• Recall Rob’s example from lecture 1
High Bias Optimal trade-off High variance
- 17 -
Bias-Variance continued
• F : True distribution generating the data (not known)
• Δ ={ Fδ } : Model class chosen by analyst to approximate F

– Influenced by things like domain knowledge, previous studies,
software availability, my favorite algo, ad-server latency, ….
– E.g. Linear models, Neural networks, logistic regression,…
• X : Available data
• Loss L(F, Fδ[X]) : Metric that measures model performance

– E.g. MSE, Misclassification error, total click lift, total revenue,..
- 18 -
Bias-Variance continued
• Loss influenced by two aspects
– How flexible is Δ to approximate reality ? (Bias)
• More flexible it is, more complex it gets (reduces bias)
– How stable is the best fit from Δ to data ? (Variance)
• Does the fit change a lot with perturbations to data ?
• More flexible the class to choose from, more data we need

to control the variance
– With too much flexibility and little data, we tend to learn
patterns that are not real
• (“chasing the data”, “too many parameters”, “generalization error”,
“fitting noise”, “too many degrees-of-freedom”)
- 19 -
Example: Recall Regression from Rob’s lecture
Exercise:
1. Identify Δ
2. If dim(x) = 20, m
= 10M; what is a
more serious
problem here (bias
or variance) ?
3. Based on 2., what

other tools would
you try on this
problem?
- 20 -
A useful exercise
• Might have heard things like
– “All models are wrong, some are more useful than others”
– “Google uses simple models but trains them on lots of data”
– “SVM works well on my data”
– “Naïve Bayes is hard to beat on text data”
– “Boosting is the best off-the-shelf classifier”
– “It’s all about feature engineering; Maxent, GBDT doesn’t matter”
– “ Y! data is too noisy, better to work with simple models”
– “ We have terabytes of data but it is too little”
• Exercise: Interpret these in terms of bias-variance tradeoff
- 21 -
Other remarks on bias-variance
• There is no universal solution to the bias-

variance tradeoff
– Several classes Δ available, each with pros and cons
– Understanding the properties of Δs and experimenting with

data important
– Inventing new Δs motivated by failures of existing ones on real

applications important for advancement of the field
- 22 -
How to measure performance ?
• Depends on the loss function
• For classification and regression, test errors used in ML
• Several other measures in Statistics (does not use test data)

MODEL FIT - MODEL COMPLEXITY
• AIC, BIC, DIC, Mallows Cp, Bayes factors, …
• E.g. AIC = - log-loss(training) + # parameters
– Based on assumptions that may not hold in all scenarios
- 23 -
Parametric Distributions: A useful class Δ to
work with data
- 24 -
Parametric models
• Non-parametric approach attractive, no assumptions needed
• Bootstrap and asymptotics often provides answers, BUT
– Hard to incorporate additional knowledge about the system
– Computationally intensive
– Higher uncertainty in estimates price we pay for generality
– Theory gets harder for dependent random variables
• Social network data, Spatial data, time series
• Parametric models that assume functional form is an
alternate way to model the world
– Faster computation, better estimates if model good
approximation to reality
– Easier to model dependent random variables
- 25 -
Common discrete parametric distributions
• Bernoulli
• Poisson
• Geometric
• Negative Binomial
• Multinomial
- 26 -
Common continuous parametric distributions
• Normal (Gaussian)
• Log-normal: Normal on log scale
• Gamma : Tails thinner than log-normal
• Beta: flexible class on [0,1]
• Multivariate Normal : Multivariate Gaussian data
- 27 -
Exponential family: A general class of parametric
distributions
• Distribution with PDF given by
• g(θ): convex (log-partition function)
• Example: Bernoulli distribution
- 28 -
Estimation
• Maximum likelihood estimation (MLE)
• For i i d case,
• MLE is asymptotically unbiased, consistent and achieves

lowest variance asymptotically under mild conditions
- 29 -
Desirable properties of estimators
• Unbiased
• Consistent
• Low variance, efficient [Attains Cramer-Rao lower bound]
- 30 -
MLE efficient
• Under mild regularity conditions, MLE is asymptotically
unbiased, consistent and efficient
• Other estimators
– MVUE: Minimum variance unbiased estimators
• Search for lowest variance estimator among unbiased ones
• Requires only moment assumptions on the distributions
– Method-of-moments (MOM) estimators

• Equate empirical moments with theoretical ones
• May lose efficiency but easier to estimate in some cases
- 31 -
Non i i d data [adding more flexibility]
• Statistically independent but different means
• Too flexible, sharing parameters a good compromise
• E.g. 100 displays of an ad on a website (Bernoulli)

– Click probabilities not same, how do we model it ?
– θi function of features ? Males, females have different probs
• Regression problem (Logistic regression, …)
θi =: θ (zi, β) = ziT β ; dim(β) = n << m
- 32 -
Generalized Linear Models: Flexible class for regressions
• Data: (x1,z1), (x2,z2),…,(xm,zm)
• Assumption: zis measured without error (important)

• 1-parameter exponential family
<φ(xi), θi > = φ(xi). θi
• Do linear regression on transformed scale

θi := θ (zi, β) = ziT β
- 33 -
GLM continued
• Example: logistic regression (covered in Rob’s lecture)
• Gaussian regression, Poisson regression are special cases
• Referred to as Generalized linear model (GLM)

– MacCullagh and Nelder (book)
- 34 -
Other option: Shrinkage estimators (Stein)
• Stein
• Result
– Stein estimator has smaller MSE than MLE
– Remarkable : Incurring some bias by pooling data reduces
variance significantly
– Shrinkage: Estimates pulled towards the mean
- 35 -
Bayesian statistics
• Data, parameters are all random variables that we model
• All inferences about parameters are conditional on data
• Bayes Theorem
[θ|X] = [X| θ] [θ] / [X]

Posterior Lik x Prior
Likelihood
Prior
- 36 -
Bayesian continued
• Does not depend on asymptotics, works for finite m
• Rich class of models (generally over-parametrized) but

avoids over-fitting through constraints on parameters
• Model specification often requires care
• Computationally intensive
– (but approximations work well for large data)
- 37 -
Bayesian interpretation of Stein
• Exercise
- 38 -
Analysis of variance (ANOVA)
• Replications within each group
– E.g. log NGD prices in different dma’s
• How to estimate unknown hyper-parameters , ?
- 39 -
Estimating hyper-parameters: Empirical Bayes
• Empirical Bayes (EB): Maximize marginal likelihood
• ANOVA example (integral available in closed form)
• EB works well for large data, in small samples in may overfit

– “Double dipping”
- 40 -
Example
• Time spent on landing page after a story click on Today
Module on Y! Front Page
- 41 -
Distribution across different properties
- 42 -
ANOVA
• Observations for a property replications: log(time spent) data
• 0.04651195, 0.11435909 , 2.52275583
- 43 -
Shrinkage
- 44 -
Estimating hyper-parameters: Full Bayes
• Assume a mild prior on hyper-parameters
• In ANOVA example
• Computation gets difficult, often require simulation

• Main idea
– Simulate samples from posterior distribution and make all
conclusions from these (recall parametric bootstrap)
– Several techniques : Markov Chain Monte Carlo (MCMC)
- 45 -
Modeling correlations through priors
• Time series: Autoregressive prior
• Conditional independence, marginal dependence

– Attractive way to model correlations
• Spatial correlation
- 46 -
Generalized linear mixed model (GLMM)
• Fit different regressions to different groups but share
parameters
• Example: Random intercept models
– Parallel regressions lines to groups
• Front Page example: log(ts) = a + b*Gender + prop_id

(Intercept) gender)0 gender)f gender)m sigma^2 tau^2
0.025 0.114 0.051 0.049 1.32 .121
•
- 47 -
GLMM continued
• Crossed-random effects
– Group specific slopes and intercepts
• FP example
• log(ts) = a + b*gend + Propid*Gender
– Exercise: fit this model using lme4 in R

– Hint: formula (log(ts) ~ gender + (Propid|gender) )
- 48 -
GLMMs
• From an ML perspective
– Linear models with different cross-product features
– Fancy regularization (different priors for different features)
– No cross-validation, all parameters estimated automatically
– Priors motivated by problem, highly flexible class
– Model specification has to be done carefully by analyst
• Extends to exponential family

– Conceptually easy, more computation required
– Software (lme4 in R; PROC GLMMIX in SAS)
- 49 -
Summary
• We covered
– Bootstrap for I I D case
– Parametric distributions
– Shrinkage Estimators
– Generalized linear models
– Grouped regressions (mixed effects models)
• For non i i d data, working with flexible parametric models

provide powerful expressive language to model data
– Needs some practice to master these models
• Next lecture: Olivier Chapelle (Optimization techniques)
- 51 -

Useful Statistical Concepts For Engineers

Uploaded by

Document Information

Original Description:

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Useful Statistical Concepts For Engineers

Uploaded by

Copyright:

Available Formats

Useful Statistical concepts for Engineers

• Fitting distributions to data

• Common parametric distributions

• Generalized Linear models

• Multi-level hierarchical models

• How do we construct such distributions to compute

• E.g P( a < X ≤ b) = F(b) – F(a) ; P( X > b) = 1 – F(b)

• Meaning of notation ∫A p(x) dF(x)

• PDF often easier to work with when (modeling) fitting

• θ(Fm) : Corresponding quantity for empirical CDF

– Exercise: Convince yourself this is true

• Plug-in principle: θ(Fm) good estimator of θ(F) for all

• Glivenko-Cantelli : For the iid scenario

• Is there a better way?

• Bootstrap: Random sample (with replacement) from Fm

• For the median example,

• Bootstrap sampling provides an approximation to Fm, easy

• How many bootstrap samples B ?

• Bootstrapping can help better understand variance

High Bias Optimal trade-off High variance

• Δ ={ Fδ } : Model class chosen by analyst to approximate F

• Loss L(F, Fδ[X]) : Metric that measures model performance

• More flexible the class to choose from, more data we need

3. Based on 2., what

• Exercise: Interpret these in terms of bias-variance tradeoff

• There is no universal solution to the bias-

– Understanding the properties of Δs and experimenting with

– Inventing new Δs motivated by failures of existing ones on real

• For classification and regression, test errors used in ML

• Several other measures in Statistics (does not use test data)

• Log-normal: Normal on log scale

• Gamma : Tails thinner than log-normal

• Beta: flexible class on [0,1]

• Multivariate Normal : Multivariate Gaussian data

• Distribution with PDF given by

• g(θ): convex (log-partition function)

• Example: Bernoulli distribution

• MLE is asymptotically unbiased, consistent and achieves

• Low variance, efficient [Attains Cramer-Rao lower bound]

– Method-of-moments (MOM) estimators

• Too flexible, sharing parameters a good compromise

• E.g. 100 displays of an ad on a website (Bernoulli)

θi =: θ (zi, β) = ziT β ; dim(β) = n << m

• Data: (x1,z1), (x2,z2),…,(xm,zm)

• Assumption: zis measured without error (important)

<φ(xi), θi > = φ(xi). θi

• Do linear regression on transformed scale

• Gaussian regression, Poisson regression are special cases

• Referred to as Generalized linear model (GLM)

[θ|X] = [X| θ] [θ] / [X]

• Rich class of models (generally over-parametrized) but

• Model specification often requires care

• How to estimate unknown hyper-parameters , ?

• ANOVA example (integral available in closed form)

• EB works well for large data, in small samples in may overfit

• 0.04651195, 0.11435909 , 2.52275583

• Computation gets difficult, often require simulation

• Conditional independence, marginal dependence

• Front Page example: log(ts) = a + b*Gender + prop_id

– Exercise: fit this model using lme4 in R

• Extends to exponential family

• For non i i d data, working with flexible parametric models