0% found this document useful (0 votes)
11 views46 pages

CH 8

Chapter 8 discusses the integration of mathematical foundations with practical machine learning, introducing the four pillars of machine learning: regression, dimensionality reduction, density estimation, and classification. It covers essential concepts such as data representation, models as functions and probability distributions, learning methods, and model selection and evaluation. The chapter emphasizes the importance of good models, parameter estimation, regularization, and cross-validation for effective machine learning.

Uploaded by

piat2136
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
11 views46 pages

CH 8

Chapter 8 discusses the integration of mathematical foundations with practical machine learning, introducing the four pillars of machine learning: regression, dimensionality reduction, density estimation, and classification. It covers essential concepts such as data representation, models as functions and probability distributions, learning methods, and model selection and evaluation. The chapter emphasizes the importance of good models, parameter estimation, regularization, and cross-validation for effective machine learning.

Uploaded by

piat2136
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd

Chapter 8: When Models Meet Data

Mathematics for Machine Learning

SNU 2025 Spring Introduction to Data Science


Chapter Overview

▶ This chapter bridges mathematical foundations with practical


machine learning
▶ Introduces the four pillars of machine learning:
▶ Regression (Chapter 9)
▶ Dimensionality reduction (Chapter 10)
▶ Density estimation (Chapter 11)
▶ Classification (Chapter 12)
▶ Covers essential concepts:
▶ Data representation
▶ Models (as functions and probability distributions)
▶ Learning methods (parameter estimation)
▶ Model selection and evaluation
8.1 Data, Models, and Learning

▶ Core Question: ”What do we mean by good models?”


▶ Good models must perform well on unseen data
▶ Requires:
▶ Defining performance metrics
▶ Developing methods to optimize these metrics
▶ Three major components of a machine learning system:
▶ Data: Represented numerically (examples, features)
▶ Models: Functions or probability distributions
▶ Learning: Process of parameter estimation
8.1.1 Data as Vectors

▶ Data represented in tabular form:


▶ Rows: Examples/instances (n = 1, . . . , N)
▶ Columns: Features/attributes/covariates (d = 1, . . . , D)
▶ Each example xn is a D-dimensional vector of real numbers
▶ For supervised learning: Example-label pairs
{(x1 , y1 ), . . . , (xN , yN )}
▶ Matrix representation: X ∈ RN×D
▶ Proper data preparation is crucial:
▶ Converting categorical variables
▶ Normalizing numerical features (scaling, shifting)
8.1.1 Data as Vectors (cont.)

▶ Linear algebra allows comparison of examples via:


▶ Similarity measures (similar features  similar labels)
▶ Distance metrics (geometric interpretation)
▶ Feature representations:
▶ Low-dimensional approximations: PCA (Chapter 10)
▶ High-dimensional representations: Feature maps ϕ(x)
▶ Feature maps create non-linear combinations of original
features
▶ Feature maps can lead to kernels (Chapter 12)
8.1.2 Models as Functions

▶ A predictor maps inputs to outputs (prediction function)


▶ For scalar outputs:
f : RD → R (1)
▶ Input vector x is D-dimensional, output f (x) is a real number
▶ We focus on linear functions:

f (x) = θ⊤ x + θ0 (2)

▶ Linear functions balance expressivity with mathematical


simplicity
8.1.3 Models as Probability Distributions

▶ Probabilistic perspective accounts for uncertainty


▶ Sources of uncertainty:
▶ Noisy observations (data uncertainty)
▶ Model uncertainty (confidence in predictions)
▶ Instead of a single function, use distributions over possible
functions
▶ Focus on distributions with finite-dimensional parameters
▶ Uncertainty in predictions visualized as distributions (e.g.,
Gaussian)
8.1.4 Learning is Finding Parameters

▶ Learning goal: Find model parameters that perform well on


unseen data
▶ Three algorithmic phases:
1. Prediction/Inference: Using trained model on new data
2. Training/Parameter Estimation: Adjusting parameters using
training data
3. Hyperparameter Tuning/Model Selection: Choosing model
structure
▶ Two main parameter estimation strategies:
▶ Finding point estimates (empirical risk minimization)
▶ Bayesian inference (distribution over parameters)
8.1.4 Learning is Finding Parameters (cont.)

▶ Parameter estimation principles:


▶ Empirical risk minimization: Direct optimization problem
(Section 8.2)
▶ Maximum likelihood: Statistical perspective (Section 8.3)
▶ Generalization challenge:
▶ Need to balance fitting training data vs. finding ”simple”
explanations
▶ Tools: Regularization, priors, cross-validation
▶ Remark: Parameters vs. hyperparameters
▶ Parameters: Directly optimized during training
▶ Hyperparameters: Control model structure or training process
8.2 Empirical Risk Minimization

▶ Core idea: Learn by minimizing average loss on training data


▶ Framework components:
1. Hypothesis class: Set of possible prediction functions
2. Loss function: Measure of prediction error
3. Regularization: Controls model complexity
4. Evaluation procedure: Assessing generalization
▶ Originally popularized through Support Vector Machines
▶ Provides ”probability-free” approach to learning
8.2.1 Hypothesis Class of Functions

▶ Given N examples xn ∈ RD with labels yn ∈ R


▶ Goal: Estimate predictor f (·, θ) : RD → R
▶ Seek parameters θ∗ such that f (xn , θ∗ ) ≈ yn for all n
▶ Common hypothesis class: Affine functions

D
(d)
X
f (xn , θ) = θ0 + θd x n = θ⊤ xn (3)
d=1

▶ More complex hypothesis classes include non-linear functions,


neural networks
8.2.2 Loss Function for Training

▶ Loss function ℓ(yn , ŷn ) measures prediction error


▶ Input: True label yn and prediction ŷn = f (xn , θ)
▶ Output: Non-negative real number (higher = worse)
▶ Assume training examples are independent and identically
distributed (i.i.d.)
▶ Empirical risk: Average loss over training data

N
1 X
Remp (f , X, y) = ℓ(yn , ŷn ) (4)
N
n=1
8.2.2 Loss Function for Training (Example: Least-Squares)

▶ Squared loss: ℓ(yn , ŷn ) = (yn − ŷn )2


▶ Empirical risk with squared loss:

N
1 X
min (yn − f (xn , θ))2 (5)
θ∈RD N
n=1

▶ With linear predictor f (xn , θ) = θ⊤ xn :

N
1 X 1
min (yn − θ⊤ xn )2 = min ||y − Xθ||2 (6)
θ∈RD N θ∈RD N
n=1

▶ This is the well-known least-squares problem


8.2.2 Loss Function for Training (cont.)

▶ We ultimately care about expected risk on unseen data:

Rtrue (f ) = Ex,y [ℓ(y , f (x))] (7)


▶ Two practical challenges:
1. How to modify training procedure to generalize well?
2. How to estimate expected risk from finite data?
▶ Remark: In practice, loss functions might not directly match
the performance metric of interest
8.2.3 Regularization to Reduce Overfitting

▶ Overfitting: Model fits training data too closely, fails to


generalize
▶ Symptoms: Very small training error but large test error
▶ Causes:
▶ Complex hypothesis class with many parameters
▶ Limited training data
▶ Solution: Regularization - adds penalty term to discourage
complex solutions
8.2.3 Regularization to Reduce Overfitting (Example)

▶ Standard least-squares:

1
min ||y − Xθ||2 (8)
θ N
▶ Regularized (Tikhonov) version:

1
min ||y − Xθ||2 + λ||θ||2 (9)
θ N
▶ Regularization parameter λ:
▶ Controls trade-off between data fit and parameter magnitude
▶ Higher λ = stronger preference for smaller parameters
▶ Equivalent to assuming prior distribution on parameters in
probabilistic view
8.2.4 Cross-Validation to Assess Generalization

▶ Need to estimate generalization performance from finite data


▶ Challenge: Want both large training set and reliable
performance estimate
▶ K -fold cross-validation:
▶ Divide data into K equal parts
▶ Train on K − 1 parts, validate on remaining part
▶ Repeat for all K possible validation sets
▶ Average performance across all trials
8.2.4 Cross-Validation to Assess Generalization (cont.)

▶ For each partition k:


▶ Training data R (k) produces predictor f (k)
▶ Validation set V (k) used to compute empirical risk
R(f (k) , V (k) )
▶ Expected generalization error approximated as:

K
1 X
EV [R(f , V )] ≈ R(f (k) , V (k) ) (10)
K
k=1

▶ Cross-validation is ”embarrassingly parallel” (can be


computed in parallel)
8.2.5 Further Reading

▶ Statistical learning theory (Vapnik)


▶ Regularization:
▶ Tikhonov vs. Ivanov regularization
▶ Connection to bias-variance trade-off
▶ Relationship to feature selection
▶ Alternative validation approaches:
▶ Bootstrap
▶ Jackknife
▶ Connection to probability: Agnostic to choice of underlying
distribution
8.3 Parameter Estimation

▶ Statistical approach using probability distributions


▶ Explicitly models:
▶ Uncertainty in data (observation process)
▶ Uncertainty in model parameters
▶ Key concepts:
▶ Likelihood: Analogous to loss function (Section 8.2.2)
▶ Prior: Analogous to regularization (Section 8.2.3)
8.3.1 Maximum Likelihood Estimation

▶ Find parameters that maximize the probability of observing


the data
▶ For data x and probability density p(x|θ) parametrized by θ
▶ Negative log-likelihood:

Lx (θ) = − log p(x|θ) (11)

▶ Minimizing negative log-likelihood = maximizing likelihood


▶ Likelihood models uncertainty in the data given fixed
parameters
8.3.1 Maximum Likelihood Estimation (cont.)

▶ For supervised learning with example-label pairs {(xn , yn )}


▶ Specify conditional probability p(yn |xn , θ)
▶ Assuming i.i.d. data, likelihood factorizes:

N
Y
p(Y|X, θ) = p(yn |xn , θ) (12)
n=1

▶ Negative log-likelihood:

N
X
L(θ) = − log p(Y|X, θ) = − log p(yn |xn , θ) (13)
n=1
8.3.1 Maximum Likelihood Estimation (Example: Gaussian
Likelihood)
▶ Assume Gaussian likelihood with linear model:

p(yn |xn , θ) = N (yn |x⊤ 2


n θ, σ ) (14)

▶ Negative log-likelihood:

N
X
L(θ) = − log N (yn |x⊤ 2
n θ, σ ) (15)
n=1
N
1 X
= (yn − x⊤ 2
n θ) + constant (16)
2σ 2
n=1

▶ Minimizing negative log-likelihood = minimizing sum of


squared errors
▶ Shows equivalence to least-squares regression
8.3.2 Maximum A Posteriori Estimation

▶ Incorporates prior knowledge about parameters


▶ Prior probability distribution on parameters p(θ)
▶ Use Bayes’ theorem to compute posterior distribution:

p(x|θ)p(θ)
p(θ|x) = (17)
p(x)
▶ For optimization, can ignore denominator:

p(θ|x) ∝ p(x|θ)p(θ) (18)


8.3.2 Maximum A Posteriori Estimation (cont.)

▶ Maximum a posteriori (MAP) estimation:

θMAP = arg max p(θ|x) = arg max p(x|θ)p(θ) (19)


θ θ

▶ Negative log-posterior:

− log p(θ|x) = − log p(x|θ) − log p(θ) + constant (20)


▶ Example: Gaussian prior with zero mean p(θ) = N (0, Σ)
▶ Corresponds to ℓ2 regularization in empirical risk minimization
8.3.3 Model Fitting

▶ Model fitting = optimizing parameters to minimize loss


function
▶ For parametrized model class Mθ , we seek parameters that
approximate true (unknown) model M∗
▶ Three scenarios:
1. Overfitting: Model class too rich for dataset
▶ Too many parameters
▶ Fits noise in training data
▶ Poor generalization
2. Underfitting: Model class not rich enough
▶ Too few parameters
▶ Cannot capture structure in data
3. Fitting well: Model class appropriately expressive
8.3.4 Further Reading

▶ Generalized linear models


▶ Linear dependence between parameters and data
▶ Potentially non-linear transformation (link function)
▶ Examples: logistic regression, Poisson regression
▶ Statistical frameworks:
▶ Bayesian vs. frequentist perspectives
▶ Alternative estimation methods (method of moments,
M-estimation)
8.4 Probabilistic Modeling and Inference

▶ Builds models that describe the generative process of observed


data
▶ Full probabilistic treatment of both:
▶ Observed variables (data)
▶ Hidden variables (parameters)
▶ Example (coin flip):
1. Define parameter µ (probability of ”heads”)
2. Sample outcome x ∈ {head, tail} from p(x|µ) = Ber(µ)
▶ Challenge: Learn about unobservable parameter µ from
observed outcomes
8.4.1 Probabilistic Models

▶ A probabilistic model is specified by the joint distribution of


all random variables
▶ Central importance of joint distribution p(x, θ):
▶ Encapsulates prior and likelihood (product rule)
▶ Allows computation of marginal likelihood p(x) for model
selection
▶ Enables derivation of posterior via Bayes’ theorem
▶ Benefits: Unified framework for modeling, inference,
prediction, and model selection
8.4.2 Bayesian Inference

▶ Goal: Find posterior distribution over hidden variables


▶ Definition (Bayesian Inference): Process of finding
posterior distribution over parameters

p(X|θ)p(θ) p(X|θ)p(θ)
p(θ|X) = =R (21)
p(X) p(X|θ)p(θ)dθ
▶ Enables uncertainty propagation to predictions:
Z
p(x) = p(x|θ)p(θ)dθ = Eθ [p(x|θ)] (22)
8.4.2 Bayesian Inference (cont.)

▶ Comparison with parameter estimation:


▶ Parameter estimation: Point estimates via optimization
▶ Bayesian inference: Full distributions via integration
▶ Advantages of Bayesian approach:
▶ Principled incorporation of prior knowledge
▶ Uncertainty quantification
▶ Useful for decision-making under uncertainty
▶ Practical challenges: Integration is often intractable
▶ Requires approximation methods (MCMC, variational
inference, etc.)
8.4.3 Latent-Variable Models

▶ Include additional latent variables z besides model parameters


θ
▶ Latent variables:
▶ Describe data generation process
▶ Simplify model structure
▶ Enhance interpretability
▶ Often reduce number of parameters
▶ Generative process: p(x|z, θ) allows data generation given
latents and parameters
8.4.3 Latent-Variable Models (cont.)

▶ Likelihood requires marginalizing out latent variables:


Z
p(x|θ) = p(x|z, θ)p(z)dz (23)

▶ Parameter posterior (integrating out latents):

p(X|θ)p(θ)
p(θ|X) = (24)
p(X)
▶ Latent variable posterior:

p(X|z, θ)p(z)
p(z|X, θ) = (25)
p(X|θ)
▶ Examples: PCA, Gaussian mixture models, hidden Markov
models
8.4.4 Further Reading

▶ Computational methods for Bayesian inference:


▶ Markov Chain Monte Carlo (MCMC)
▶ Variational inference
▶ Expectation propagation
▶ Applications of Bayesian inference:
▶ Topic modeling
▶ Click-through rate prediction
▶ Reinforcement learning
▶ Recommendation systems
▶ Probabilistic programming languages
8.5 Directed Graphical Models

▶ Visual language for specifying probabilistic models


▶ Definition (Directed Graphical Model/Bayesian
Network): Graph where:
▶ Nodes represent random variables
▶ Directed edges represent conditional dependencies
▶ Benefits:
▶ Visualizes structure of probabilistic model
▶ Provides insights into conditional independence properties
▶ Simplifies complex computations for inference and learning
8.5.1 Graph Semantics

▶ Directed links indicate conditional probabilities


▶ From factorized joint distribution to graphical model:
1. Create node for each random variable
2. Add directed link from conditioning variables to conditioned
variable
▶ From graphical model to joint distribution:

K
Y
p(x) = p(x1 , . . . , xK ) = p(xk |Pak ) (26)
k=1

where Pak denotes parent nodes of xk


8.5.1 Graph Semantics (cont.)

▶ Notation conventions:
▶ Observed variables: Shaded nodes
▶ Latent variables: Unshaded nodes
▶ Deterministic parameters: No circle
▶ Plate notation: Box indicating repeated structure
▶ Used for i.i.d. observations
▶ Makes diagrams more compact
8.5.2 Conditional Independence and d-Separation

▶ Graphical models reveal conditional independence properties


▶ Definition (d-separation): For sets of nodes A, B, C, a path
from A to B is blocked if:
▶ Arrows meet head-to-tail or tail-to-tail at node in C, OR
▶ Arrows meet head-to-head at node where neither the node nor
its descendants are in C
▶ If all paths from A to B are blocked by C, then A is
d-separated from B given C
▶ D-separation implies conditional independence: A ⊥⊥ B — C
8.5.3 Further Reading

▶ Types of probabilistic graphical models:


▶ Directed graphical models (Bayesian networks)
▶ Undirected graphical models (Markov random fields)
▶ Factor graphs
▶ Applications:
▶ Computer vision (image segmentation, denoising)
▶ Ranking in online games
▶ Coding theory
▶ Signal processing
▶ Related areas:
▶ Structured prediction
▶ Causal inference
8.6 Model Selection

▶ Challenge: High-level modeling decisions significantly affect


performance
▶ Examples of choices:
▶ Polynomial degree in regression
▶ Number of components in mixture model
▶ Network architecture in neural networks
▶ Kernel type in SVM
▶ Trade-off: Model complexity vs. generalization
▶ More complex models:
▶ More expressive/flexible
▶ Higher risk of overfitting
8.6.1 Nested Cross-Validation

▶ Extension of standard cross-validation (Section 8.2.4)


▶ Two levels of K-fold cross-validation:
▶ Inner loop: Evaluates different model choices on validation set
▶ Outer loop: Estimates generalization performance of best
model from inner loop
▶ Inner loop approximates:

K
1 X
EV [R(V |M)] ≈ R(V (k) |M) (27)
K
k=1

▶ Provides both expected error and error uncertainty (standard


error)
8.6.2 Bayesian Model Selection

▶ Based on Occam’s razor: Prefer simplest model that explains


data well
▶ Bayesian probability automatically embodies this principle
▶ Treat model selection as hierarchical inference problem:
1. Place prior p(M) over model set M = {M1 , . . . , MK }
2. For each model, define parameter distribution p(θ|M)
3. Generate data according to p(D|θ)
▶ Posterior distribution over models:

p(Mk |D) ∝ p(Mk )p(D|Mk ) (28)


8.6.2 Bayesian Model Selection (cont.)

▶ Model evidence/marginal likelihood:


Z
p(D|Mk ) = p(D|θk )p(θk |Mk )dθk (29)

▶ MAP estimate of model (with uniform prior):

M ∗ = arg max p(Mk |D) = arg max p(D|Mk ) (30)


Mk Mk

▶ Key difference from maximum likelihood:


▶ Parameters integrated out (no overfitting)
▶ Automatically balances complexity vs. data fit
8.6.3 Bayes Factors for Model Comparison

▶ Compare two models M1 and M2 given data D


▶ Posterior odds:
p(M1 |D) p(M1 ) p(D|M1 )
= (31)
p(M2 |D) p(M2 ) p(D|M2 )
p(M1 )
▶ Prior odds: p(M2 ) (prior beliefs favoring M1 over M2 )

▶ Bayes factor: p(D|M 1)


p(D|M2 ) (how well data predicted by M1 vs.
M2 )
▶ With uniform prior: Posterior odds = Bayes factor
8.6.4 Further Reading

▶ Alternative model selection criteria:


▶ Akaike Information Criterion (AIC):

log p(x|θ) − M (32)

▶ Bayesian Information Criterion (BIC):

1
log p(x|θ) − M log N (33)
2
▶ Bayesian non-parametric models
▶ Automatic Occam’s razor in Bayesian inference
Key Questions and Answers
▶ Q1: What are the three main components of a machine learning system?
A1: Data (represented as vectors), Models (as functions or probability
distributions), and Learning (process of parameter estimation).
▶ Q2: How does empirical risk minimization differ from maximum
likelihood estimation?
A2: Empirical risk minimization minimizes an arbitrary loss function on
training data, while maximum likelihood estimation maximizes the
probability of observing the data under a probabilistic model.
▶ Q3: What is the difference between MAP estimation and Bayesian
inference?
A3: MAP estimation finds a single best parameter value that maximizes
the posterior probability, while Bayesian inference computes the full
posterior distribution over parameters.
▶ Q4: What is the purpose of regularization in machine learning?
A4: Regularization discourages complex solutions by adding a penalty
term to the objective function, helping prevent overfitting and improve
generalization to unseen data.
▶ Q5: How does Bayesian model selection embody Occam’s razor?
A5: The marginal likelihood automatically balances model complexity
and data fit, favoring simpler models that explain the data well over
unnecessarily complex ones.

You might also like