Chapter 8: When Models Meet Data
Mathematics for Machine Learning
SNU 2025 Spring Introduction to Data Science
Chapter Overview
▶ This chapter bridges mathematical foundations with practical
machine learning
▶ Introduces the four pillars of machine learning:
▶ Regression (Chapter 9)
▶ Dimensionality reduction (Chapter 10)
▶ Density estimation (Chapter 11)
▶ Classification (Chapter 12)
▶ Covers essential concepts:
▶ Data representation
▶ Models (as functions and probability distributions)
▶ Learning methods (parameter estimation)
▶ Model selection and evaluation
8.1 Data, Models, and Learning
▶ Core Question: ”What do we mean by good models?”
▶ Good models must perform well on unseen data
▶ Requires:
▶ Defining performance metrics
▶ Developing methods to optimize these metrics
▶ Three major components of a machine learning system:
▶ Data: Represented numerically (examples, features)
▶ Models: Functions or probability distributions
▶ Learning: Process of parameter estimation
8.1.1 Data as Vectors
▶ Data represented in tabular form:
▶ Rows: Examples/instances (n = 1, . . . , N)
▶ Columns: Features/attributes/covariates (d = 1, . . . , D)
▶ Each example xn is a D-dimensional vector of real numbers
▶ For supervised learning: Example-label pairs
{(x1 , y1 ), . . . , (xN , yN )}
▶ Matrix representation: X ∈ RN×D
▶ Proper data preparation is crucial:
▶ Converting categorical variables
▶ Normalizing numerical features (scaling, shifting)
8.1.1 Data as Vectors (cont.)
▶ Linear algebra allows comparison of examples via:
▶ Similarity measures (similar features similar labels)
▶ Distance metrics (geometric interpretation)
▶ Feature representations:
▶ Low-dimensional approximations: PCA (Chapter 10)
▶ High-dimensional representations: Feature maps ϕ(x)
▶ Feature maps create non-linear combinations of original
features
▶ Feature maps can lead to kernels (Chapter 12)
8.1.2 Models as Functions
▶ A predictor maps inputs to outputs (prediction function)
▶ For scalar outputs:
f : RD → R (1)
▶ Input vector x is D-dimensional, output f (x) is a real number
▶ We focus on linear functions:
f (x) = θ⊤ x + θ0 (2)
▶ Linear functions balance expressivity with mathematical
simplicity
8.1.3 Models as Probability Distributions
▶ Probabilistic perspective accounts for uncertainty
▶ Sources of uncertainty:
▶ Noisy observations (data uncertainty)
▶ Model uncertainty (confidence in predictions)
▶ Instead of a single function, use distributions over possible
functions
▶ Focus on distributions with finite-dimensional parameters
▶ Uncertainty in predictions visualized as distributions (e.g.,
Gaussian)
8.1.4 Learning is Finding Parameters
▶ Learning goal: Find model parameters that perform well on
unseen data
▶ Three algorithmic phases:
1. Prediction/Inference: Using trained model on new data
2. Training/Parameter Estimation: Adjusting parameters using
training data
3. Hyperparameter Tuning/Model Selection: Choosing model
structure
▶ Two main parameter estimation strategies:
▶ Finding point estimates (empirical risk minimization)
▶ Bayesian inference (distribution over parameters)
8.1.4 Learning is Finding Parameters (cont.)
▶ Parameter estimation principles:
▶ Empirical risk minimization: Direct optimization problem
(Section 8.2)
▶ Maximum likelihood: Statistical perspective (Section 8.3)
▶ Generalization challenge:
▶ Need to balance fitting training data vs. finding ”simple”
explanations
▶ Tools: Regularization, priors, cross-validation
▶ Remark: Parameters vs. hyperparameters
▶ Parameters: Directly optimized during training
▶ Hyperparameters: Control model structure or training process
8.2 Empirical Risk Minimization
▶ Core idea: Learn by minimizing average loss on training data
▶ Framework components:
1. Hypothesis class: Set of possible prediction functions
2. Loss function: Measure of prediction error
3. Regularization: Controls model complexity
4. Evaluation procedure: Assessing generalization
▶ Originally popularized through Support Vector Machines
▶ Provides ”probability-free” approach to learning
8.2.1 Hypothesis Class of Functions
▶ Given N examples xn ∈ RD with labels yn ∈ R
▶ Goal: Estimate predictor f (·, θ) : RD → R
▶ Seek parameters θ∗ such that f (xn , θ∗ ) ≈ yn for all n
▶ Common hypothesis class: Affine functions
D
(d)
X
f (xn , θ) = θ0 + θd x n = θ⊤ xn (3)
d=1
▶ More complex hypothesis classes include non-linear functions,
neural networks
8.2.2 Loss Function for Training
▶ Loss function ℓ(yn , ŷn ) measures prediction error
▶ Input: True label yn and prediction ŷn = f (xn , θ)
▶ Output: Non-negative real number (higher = worse)
▶ Assume training examples are independent and identically
distributed (i.i.d.)
▶ Empirical risk: Average loss over training data
N
1 X
Remp (f , X, y) = ℓ(yn , ŷn ) (4)
N
n=1
8.2.2 Loss Function for Training (Example: Least-Squares)
▶ Squared loss: ℓ(yn , ŷn ) = (yn − ŷn )2
▶ Empirical risk with squared loss:
N
1 X
min (yn − f (xn , θ))2 (5)
θ∈RD N
n=1
▶ With linear predictor f (xn , θ) = θ⊤ xn :
N
1 X 1
min (yn − θ⊤ xn )2 = min ||y − Xθ||2 (6)
θ∈RD N θ∈RD N
n=1
▶ This is the well-known least-squares problem
8.2.2 Loss Function for Training (cont.)
▶ We ultimately care about expected risk on unseen data:
Rtrue (f ) = Ex,y [ℓ(y , f (x))] (7)
▶ Two practical challenges:
1. How to modify training procedure to generalize well?
2. How to estimate expected risk from finite data?
▶ Remark: In practice, loss functions might not directly match
the performance metric of interest
8.2.3 Regularization to Reduce Overfitting
▶ Overfitting: Model fits training data too closely, fails to
generalize
▶ Symptoms: Very small training error but large test error
▶ Causes:
▶ Complex hypothesis class with many parameters
▶ Limited training data
▶ Solution: Regularization - adds penalty term to discourage
complex solutions
8.2.3 Regularization to Reduce Overfitting (Example)
▶ Standard least-squares:
1
min ||y − Xθ||2 (8)
θ N
▶ Regularized (Tikhonov) version:
1
min ||y − Xθ||2 + λ||θ||2 (9)
θ N
▶ Regularization parameter λ:
▶ Controls trade-off between data fit and parameter magnitude
▶ Higher λ = stronger preference for smaller parameters
▶ Equivalent to assuming prior distribution on parameters in
probabilistic view
8.2.4 Cross-Validation to Assess Generalization
▶ Need to estimate generalization performance from finite data
▶ Challenge: Want both large training set and reliable
performance estimate
▶ K -fold cross-validation:
▶ Divide data into K equal parts
▶ Train on K − 1 parts, validate on remaining part
▶ Repeat for all K possible validation sets
▶ Average performance across all trials
8.2.4 Cross-Validation to Assess Generalization (cont.)
▶ For each partition k:
▶ Training data R (k) produces predictor f (k)
▶ Validation set V (k) used to compute empirical risk
R(f (k) , V (k) )
▶ Expected generalization error approximated as:
K
1 X
EV [R(f , V )] ≈ R(f (k) , V (k) ) (10)
K
k=1
▶ Cross-validation is ”embarrassingly parallel” (can be
computed in parallel)
8.2.5 Further Reading
▶ Statistical learning theory (Vapnik)
▶ Regularization:
▶ Tikhonov vs. Ivanov regularization
▶ Connection to bias-variance trade-off
▶ Relationship to feature selection
▶ Alternative validation approaches:
▶ Bootstrap
▶ Jackknife
▶ Connection to probability: Agnostic to choice of underlying
distribution
8.3 Parameter Estimation
▶ Statistical approach using probability distributions
▶ Explicitly models:
▶ Uncertainty in data (observation process)
▶ Uncertainty in model parameters
▶ Key concepts:
▶ Likelihood: Analogous to loss function (Section 8.2.2)
▶ Prior: Analogous to regularization (Section 8.2.3)
8.3.1 Maximum Likelihood Estimation
▶ Find parameters that maximize the probability of observing
the data
▶ For data x and probability density p(x|θ) parametrized by θ
▶ Negative log-likelihood:
Lx (θ) = − log p(x|θ) (11)
▶ Minimizing negative log-likelihood = maximizing likelihood
▶ Likelihood models uncertainty in the data given fixed
parameters
8.3.1 Maximum Likelihood Estimation (cont.)
▶ For supervised learning with example-label pairs {(xn , yn )}
▶ Specify conditional probability p(yn |xn , θ)
▶ Assuming i.i.d. data, likelihood factorizes:
N
Y
p(Y|X, θ) = p(yn |xn , θ) (12)
n=1
▶ Negative log-likelihood:
N
X
L(θ) = − log p(Y|X, θ) = − log p(yn |xn , θ) (13)
n=1
8.3.1 Maximum Likelihood Estimation (Example: Gaussian
Likelihood)
▶ Assume Gaussian likelihood with linear model:
p(yn |xn , θ) = N (yn |x⊤ 2
n θ, σ ) (14)
▶ Negative log-likelihood:
N
X
L(θ) = − log N (yn |x⊤ 2
n θ, σ ) (15)
n=1
N
1 X
= (yn − x⊤ 2
n θ) + constant (16)
2σ 2
n=1
▶ Minimizing negative log-likelihood = minimizing sum of
squared errors
▶ Shows equivalence to least-squares regression
8.3.2 Maximum A Posteriori Estimation
▶ Incorporates prior knowledge about parameters
▶ Prior probability distribution on parameters p(θ)
▶ Use Bayes’ theorem to compute posterior distribution:
p(x|θ)p(θ)
p(θ|x) = (17)
p(x)
▶ For optimization, can ignore denominator:
p(θ|x) ∝ p(x|θ)p(θ) (18)
8.3.2 Maximum A Posteriori Estimation (cont.)
▶ Maximum a posteriori (MAP) estimation:
θMAP = arg max p(θ|x) = arg max p(x|θ)p(θ) (19)
θ θ
▶ Negative log-posterior:
− log p(θ|x) = − log p(x|θ) − log p(θ) + constant (20)
▶ Example: Gaussian prior with zero mean p(θ) = N (0, Σ)
▶ Corresponds to ℓ2 regularization in empirical risk minimization
8.3.3 Model Fitting
▶ Model fitting = optimizing parameters to minimize loss
function
▶ For parametrized model class Mθ , we seek parameters that
approximate true (unknown) model M∗
▶ Three scenarios:
1. Overfitting: Model class too rich for dataset
▶ Too many parameters
▶ Fits noise in training data
▶ Poor generalization
2. Underfitting: Model class not rich enough
▶ Too few parameters
▶ Cannot capture structure in data
3. Fitting well: Model class appropriately expressive
8.3.4 Further Reading
▶ Generalized linear models
▶ Linear dependence between parameters and data
▶ Potentially non-linear transformation (link function)
▶ Examples: logistic regression, Poisson regression
▶ Statistical frameworks:
▶ Bayesian vs. frequentist perspectives
▶ Alternative estimation methods (method of moments,
M-estimation)
8.4 Probabilistic Modeling and Inference
▶ Builds models that describe the generative process of observed
data
▶ Full probabilistic treatment of both:
▶ Observed variables (data)
▶ Hidden variables (parameters)
▶ Example (coin flip):
1. Define parameter µ (probability of ”heads”)
2. Sample outcome x ∈ {head, tail} from p(x|µ) = Ber(µ)
▶ Challenge: Learn about unobservable parameter µ from
observed outcomes
8.4.1 Probabilistic Models
▶ A probabilistic model is specified by the joint distribution of
all random variables
▶ Central importance of joint distribution p(x, θ):
▶ Encapsulates prior and likelihood (product rule)
▶ Allows computation of marginal likelihood p(x) for model
selection
▶ Enables derivation of posterior via Bayes’ theorem
▶ Benefits: Unified framework for modeling, inference,
prediction, and model selection
8.4.2 Bayesian Inference
▶ Goal: Find posterior distribution over hidden variables
▶ Definition (Bayesian Inference): Process of finding
posterior distribution over parameters
p(X|θ)p(θ) p(X|θ)p(θ)
p(θ|X) = =R (21)
p(X) p(X|θ)p(θ)dθ
▶ Enables uncertainty propagation to predictions:
Z
p(x) = p(x|θ)p(θ)dθ = Eθ [p(x|θ)] (22)
8.4.2 Bayesian Inference (cont.)
▶ Comparison with parameter estimation:
▶ Parameter estimation: Point estimates via optimization
▶ Bayesian inference: Full distributions via integration
▶ Advantages of Bayesian approach:
▶ Principled incorporation of prior knowledge
▶ Uncertainty quantification
▶ Useful for decision-making under uncertainty
▶ Practical challenges: Integration is often intractable
▶ Requires approximation methods (MCMC, variational
inference, etc.)
8.4.3 Latent-Variable Models
▶ Include additional latent variables z besides model parameters
θ
▶ Latent variables:
▶ Describe data generation process
▶ Simplify model structure
▶ Enhance interpretability
▶ Often reduce number of parameters
▶ Generative process: p(x|z, θ) allows data generation given
latents and parameters
8.4.3 Latent-Variable Models (cont.)
▶ Likelihood requires marginalizing out latent variables:
Z
p(x|θ) = p(x|z, θ)p(z)dz (23)
▶ Parameter posterior (integrating out latents):
p(X|θ)p(θ)
p(θ|X) = (24)
p(X)
▶ Latent variable posterior:
p(X|z, θ)p(z)
p(z|X, θ) = (25)
p(X|θ)
▶ Examples: PCA, Gaussian mixture models, hidden Markov
models
8.4.4 Further Reading
▶ Computational methods for Bayesian inference:
▶ Markov Chain Monte Carlo (MCMC)
▶ Variational inference
▶ Expectation propagation
▶ Applications of Bayesian inference:
▶ Topic modeling
▶ Click-through rate prediction
▶ Reinforcement learning
▶ Recommendation systems
▶ Probabilistic programming languages
8.5 Directed Graphical Models
▶ Visual language for specifying probabilistic models
▶ Definition (Directed Graphical Model/Bayesian
Network): Graph where:
▶ Nodes represent random variables
▶ Directed edges represent conditional dependencies
▶ Benefits:
▶ Visualizes structure of probabilistic model
▶ Provides insights into conditional independence properties
▶ Simplifies complex computations for inference and learning
8.5.1 Graph Semantics
▶ Directed links indicate conditional probabilities
▶ From factorized joint distribution to graphical model:
1. Create node for each random variable
2. Add directed link from conditioning variables to conditioned
variable
▶ From graphical model to joint distribution:
K
Y
p(x) = p(x1 , . . . , xK ) = p(xk |Pak ) (26)
k=1
where Pak denotes parent nodes of xk
8.5.1 Graph Semantics (cont.)
▶ Notation conventions:
▶ Observed variables: Shaded nodes
▶ Latent variables: Unshaded nodes
▶ Deterministic parameters: No circle
▶ Plate notation: Box indicating repeated structure
▶ Used for i.i.d. observations
▶ Makes diagrams more compact
8.5.2 Conditional Independence and d-Separation
▶ Graphical models reveal conditional independence properties
▶ Definition (d-separation): For sets of nodes A, B, C, a path
from A to B is blocked if:
▶ Arrows meet head-to-tail or tail-to-tail at node in C, OR
▶ Arrows meet head-to-head at node where neither the node nor
its descendants are in C
▶ If all paths from A to B are blocked by C, then A is
d-separated from B given C
▶ D-separation implies conditional independence: A ⊥⊥ B — C
8.5.3 Further Reading
▶ Types of probabilistic graphical models:
▶ Directed graphical models (Bayesian networks)
▶ Undirected graphical models (Markov random fields)
▶ Factor graphs
▶ Applications:
▶ Computer vision (image segmentation, denoising)
▶ Ranking in online games
▶ Coding theory
▶ Signal processing
▶ Related areas:
▶ Structured prediction
▶ Causal inference
8.6 Model Selection
▶ Challenge: High-level modeling decisions significantly affect
performance
▶ Examples of choices:
▶ Polynomial degree in regression
▶ Number of components in mixture model
▶ Network architecture in neural networks
▶ Kernel type in SVM
▶ Trade-off: Model complexity vs. generalization
▶ More complex models:
▶ More expressive/flexible
▶ Higher risk of overfitting
8.6.1 Nested Cross-Validation
▶ Extension of standard cross-validation (Section 8.2.4)
▶ Two levels of K-fold cross-validation:
▶ Inner loop: Evaluates different model choices on validation set
▶ Outer loop: Estimates generalization performance of best
model from inner loop
▶ Inner loop approximates:
K
1 X
EV [R(V |M)] ≈ R(V (k) |M) (27)
K
k=1
▶ Provides both expected error and error uncertainty (standard
error)
8.6.2 Bayesian Model Selection
▶ Based on Occam’s razor: Prefer simplest model that explains
data well
▶ Bayesian probability automatically embodies this principle
▶ Treat model selection as hierarchical inference problem:
1. Place prior p(M) over model set M = {M1 , . . . , MK }
2. For each model, define parameter distribution p(θ|M)
3. Generate data according to p(D|θ)
▶ Posterior distribution over models:
p(Mk |D) ∝ p(Mk )p(D|Mk ) (28)
8.6.2 Bayesian Model Selection (cont.)
▶ Model evidence/marginal likelihood:
Z
p(D|Mk ) = p(D|θk )p(θk |Mk )dθk (29)
▶ MAP estimate of model (with uniform prior):
M ∗ = arg max p(Mk |D) = arg max p(D|Mk ) (30)
Mk Mk
▶ Key difference from maximum likelihood:
▶ Parameters integrated out (no overfitting)
▶ Automatically balances complexity vs. data fit
8.6.3 Bayes Factors for Model Comparison
▶ Compare two models M1 and M2 given data D
▶ Posterior odds:
p(M1 |D) p(M1 ) p(D|M1 )
= (31)
p(M2 |D) p(M2 ) p(D|M2 )
p(M1 )
▶ Prior odds: p(M2 ) (prior beliefs favoring M1 over M2 )
▶ Bayes factor: p(D|M 1)
p(D|M2 ) (how well data predicted by M1 vs.
M2 )
▶ With uniform prior: Posterior odds = Bayes factor
8.6.4 Further Reading
▶ Alternative model selection criteria:
▶ Akaike Information Criterion (AIC):
log p(x|θ) − M (32)
▶ Bayesian Information Criterion (BIC):
1
log p(x|θ) − M log N (33)
2
▶ Bayesian non-parametric models
▶ Automatic Occam’s razor in Bayesian inference
Key Questions and Answers
▶ Q1: What are the three main components of a machine learning system?
A1: Data (represented as vectors), Models (as functions or probability
distributions), and Learning (process of parameter estimation).
▶ Q2: How does empirical risk minimization differ from maximum
likelihood estimation?
A2: Empirical risk minimization minimizes an arbitrary loss function on
training data, while maximum likelihood estimation maximizes the
probability of observing the data under a probabilistic model.
▶ Q3: What is the difference between MAP estimation and Bayesian
inference?
A3: MAP estimation finds a single best parameter value that maximizes
the posterior probability, while Bayesian inference computes the full
posterior distribution over parameters.
▶ Q4: What is the purpose of regularization in machine learning?
A4: Regularization discourages complex solutions by adding a penalty
term to the objective function, helping prevent overfitting and improve
generalization to unseen data.
▶ Q5: How does Bayesian model selection embody Occam’s razor?
A5: The marginal likelihood automatically balances model complexity
and data fit, favoring simpler models that explain the data well over
unnecessarily complex ones.