Professional Documents
Culture Documents
Deepak Agarwal
ML Class
12/10/2009
Yahoo!
-1-
Scope of the lecture
• Basic probability distributions to model randomness in data
-2-
Role of Probability distributions
• Probability distributions
– Mathematical models to describe intrinsic variation in data
– Helps in quantifying uncertainty and eventually decision making
-3-
Cumulative distribution function (CDF)
• X : random variable
• CDF F : Ω → [0,1] such that
– F(x) = Pr( X ≤ x)
• F is non-decreasing and right continuous
• CDF uniquely characterize a probability distribution
• Given CDF, we can compute probability of any subset
-4-
Probability density function (PDF)
• A unique function p : Ω → [0,∞) such that
– P( A ) = ∫A p(x) dF(x) [aggregate density with weights from F]
-5-
Empirical CDF
• Empirical CDF Fm for data X = (x1,x2,…,xm) (I I D)
• Probability distribution with mass 1/m on each xi
• 1. m=10; -1.21 0.28 1.08 -2.35 0.43 0.51 -0.57 -0.55 -0.56 -0.89
• 2. m=10; 0 0 0 0 0 1 0 0 1 1
-6-
“Plug-in” principle
• θ(F) : Some characteristic of the theoretical distribution
– E.g. mean
-7-
Why plug-in works? Glivenko-Cantelli Lemma
• Intuitively, Fn good estimator of F
– Estimator should get better with increasing sample size m
-8-
Quantifying uncertainty in estimates
• We won’t have infinite m in practice (costly)
– Quantify uncertainty in estimates for a given m
• Sample mean
• Estimate of uncertainty
-9-
Standard error calculations continued
• Consider median
• Difficult to compute
• Asymptotic approx:
- 10 -
Re-sampling from empirical CDF: Bootstrap
- 11 -
Why bootstrap works?
• Except for mean, difficult to compute standard deviation of
other sample statistics
• Exercise: How many distinct bootstrap samples are there for a given m ?
- 12 -
Bootstrap: Variations
• Does it always work?
– No, especially in cases where Fm is not a good approx of F
– E.g. sample m data from Uniform(0, θ)
• max xi ML estimator of θ
• Bootstrap as defined so far won’t work well here
• Parametric bootstrap
– What if we know about the parametric form of F (e.g. guassian)
– Sample from Fm,par instead of Fm
- 13 -
Example
• m=100; data drawn iid from N(0,1); distribution of median
- 14 -
How can we use it at Y! ?
• Variance estimates can help with online learning
(explore/exploit): John’s lecture
• Easy to Map-reduce
- 15 -
Before we move on, another look at bias-
variance tradeoff
- 16 -
Bias-Variance Tradeoff
• Important in all scenarios (regression, density estimation,….)
• Recall Rob’s example from lecture 1
- 17 -
Bias-Variance continued
• F : True distribution generating the data (not known)
• X : Available data
- 18 -
Bias-Variance continued
• Loss influenced by two aspects
– How flexible is Δ to approximate reality ? (Bias)
• More flexible it is, more complex it gets (reduces bias)
– How stable is the best fit from Δ to data ? (Variance)
• Does the fit change a lot with perturbations to data ?
- 19 -
Example: Recall Regression from Rob’s lecture
Exercise:
1. Identify Δ
2. If dim(x) = 20, m
= 10M; what is a
more serious
problem here (bias
or variance) ?
- 20 -
A useful exercise
• Might have heard things like
– “All models are wrong, some are more useful than others”
– “Google uses simple models but trains them on lots of data”
– “SVM works well on my data”
– “Naïve Bayes is hard to beat on text data”
– “Boosting is the best off-the-shelf classifier”
– “It’s all about feature engineering; Maxent, GBDT doesn’t matter”
– “ Y! data is too noisy, better to work with simple models”
– “ We have terabytes of data but it is too little”
- 21 -
Other remarks on bias-variance
- 22 -
How to measure performance ?
• Depends on the loss function
- 23 -
Parametric Distributions: A useful class Δ to
work with data
- 24 -
Parametric models
• Non-parametric approach attractive, no assumptions needed
• Bootstrap and asymptotics often provides answers, BUT
– Hard to incorporate additional knowledge about the system
– Computationally intensive
– Higher uncertainty in estimates price we pay for generality
– Theory gets harder for dependent random variables
• Social network data, Spatial data, time series
• Parametric models that assume functional form is an
alternate way to model the world
– Faster computation, better estimates if model good
approximation to reality
– Easier to model dependent random variables
- 25 -
Common discrete parametric distributions
• Bernoulli
• Poisson
• Geometric
• Negative Binomial
• Multinomial
- 26 -
Common continuous parametric distributions
• Normal (Gaussian)
- 27 -
Exponential family: A general class of parametric
distributions
- 28 -
Estimation
• Maximum likelihood estimation (MLE)
• For i i d case,
- 29 -
Desirable properties of estimators
• Unbiased
• Consistent
- 30 -
MLE efficient
• Under mild regularity conditions, MLE is asymptotically
unbiased, consistent and efficient
• Other estimators
– MVUE: Minimum variance unbiased estimators
• Search for lowest variance estimator among unbiased ones
• Requires only moment assumptions on the distributions
- 31 -
Non i i d data [adding more flexibility]
• Statistically independent but different means
- 32 -
Generalized Linear Models: Flexible class for regressions
- 33 -
GLM continued
• Example: logistic regression (covered in Rob’s lecture)
- 34 -
Other option: Shrinkage estimators (Stein)
• Stein
• Result
– Stein estimator has smaller MSE than MLE
– Remarkable : Incurring some bias by pooling data reduces
variance significantly
– Shrinkage: Estimates pulled towards the mean
- 35 -
Bayesian statistics
• Data, parameters are all random variables that we model
• All inferences about parameters are conditional on data
• Bayes Theorem
Likelihood
Prior
- 36 -
Bayesian continued
• Does not depend on asymptotics, works for finite m
• Computationally intensive
– (but approximations work well for large data)
- 37 -
Bayesian interpretation of Stein
• Exercise
- 38 -
Analysis of variance (ANOVA)
• Replications within each group
– E.g. log NGD prices in different dma’s
- 39 -
Estimating hyper-parameters: Empirical Bayes
• Empirical Bayes (EB): Maximize marginal likelihood
- 40 -
Example
• Time spent on landing page after a story click on Today
Module on Y! Front Page
- 41 -
Distribution across different properties
- 42 -
ANOVA
• Observations for a property replications: log(time spent) data
- 43 -
Shrinkage
- 44 -
Estimating hyper-parameters: Full Bayes
• Assume a mild prior on hyper-parameters
• In ANOVA example
- 45 -
Modeling correlations through priors
• Time series: Autoregressive prior
- 46 -
Generalized linear mixed model (GLMM)
• Fit different regressions to different groups but share
parameters
• Example: Random intercept models
– Parallel regressions lines to groups
- 47 -
GLMM continued
• Crossed-random effects
– Group specific slopes and intercepts
• FP example
• log(ts) = a + b*gend + Propid*Gender
- 48 -
GLMMs
• From an ML perspective
– Linear models with different cross-product features
– Fancy regularization (different priors for different features)
– No cross-validation, all parameters estimated automatically
– Priors motivated by problem, highly flexible class
– Model specification has to be done carefully by analyst
- 49 -
Summary
• We covered
– Bootstrap for I I D case
– Parametric distributions
– Shrinkage Estimators
– Generalized linear models
– Grouped regressions (mixed effects models)
- 51 -