0% found this document useful (0 votes)

11 views46 pages

CH 8

Chapter 8 discusses the integration of mathematical foundations with practical machine learning, introducing the four pillars of machine learning: regression, dimensionality reduction, density estimation, and classification. It covers essential concepts such as data representation, models as functions and probability distributions, learning methods, and model selection and evaluation. The chapter emphasizes the importance of good models, parameter estimation, regularization, and cross-validation for effective machine learning.

Uploaded by

piat2136

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

11 views46 pages

CH 8

Uploaded by

piat2136

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

Chapter 8: When Models Meet Data

Mathematics for Machine Learning

SNU 2025 Spring Introduction to Data Science

Chapter Overview

▶ This chapter bridges mathematical foundations with practical

machine learning
▶ Introduces the four pillars of machine learning:
▶ Regression (Chapter 9)
▶ Dimensionality reduction (Chapter 10)
▶ Density estimation (Chapter 11)
▶ Classification (Chapter 12)
▶ Covers essential concepts:
▶ Data representation
▶ Models (as functions and probability distributions)
▶ Learning methods (parameter estimation)
▶ Model selection and evaluation
8.1 Data, Models, and Learning

▶ Core Question: ”What do we mean by good models?”

▶ Good models must perform well on unseen data
▶ Requires:
▶ Defining performance metrics
▶ Developing methods to optimize these metrics
▶ Three major components of a machine learning system:
▶ Data: Represented numerically (examples, features)
▶ Models: Functions or probability distributions
▶ Learning: Process of parameter estimation
8.1.1 Data as Vectors

▶ Data represented in tabular form:

▶ Rows: Examples/instances (n = 1, . . . , N)
▶ Columns: Features/attributes/covariates (d = 1, . . . , D)
▶ Each example xn is a D-dimensional vector of real numbers
▶ For supervised learning: Example-label pairs
{(x1 , y1 ), . . . , (xN , yN )}
▶ Matrix representation: X ∈ RN×D
▶ Proper data preparation is crucial:
▶ Converting categorical variables
▶ Normalizing numerical features (scaling, shifting)
8.1.1 Data as Vectors (cont.)

▶ Linear algebra allows comparison of examples via:

▶ Similarity measures (similar features similar labels)
▶ Distance metrics (geometric interpretation)
▶ Feature representations:
▶ Low-dimensional approximations: PCA (Chapter 10)
▶ High-dimensional representations: Feature maps ϕ(x)
▶ Feature maps create non-linear combinations of original
features
▶ Feature maps can lead to kernels (Chapter 12)
8.1.2 Models as Functions

▶ A predictor maps inputs to outputs (prediction function)

▶ For scalar outputs:
f : RD → R (1)
▶ Input vector x is D-dimensional, output f (x) is a real number
▶ We focus on linear functions:

f (x) = θ⊤ x + θ0 (2)

▶ Linear functions balance expressivity with mathematical

simplicity
8.1.3 Models as Probability Distributions

▶ Probabilistic perspective accounts for uncertainty

▶ Sources of uncertainty:
▶ Noisy observations (data uncertainty)
▶ Model uncertainty (confidence in predictions)
▶ Instead of a single function, use distributions over possible
functions
▶ Focus on distributions with finite-dimensional parameters
▶ Uncertainty in predictions visualized as distributions (e.g.,
Gaussian)
8.1.4 Learning is Finding Parameters

▶ Learning goal: Find model parameters that perform well on

unseen data
▶ Three algorithmic phases:
1. Prediction/Inference: Using trained model on new data
2. Training/Parameter Estimation: Adjusting parameters using
training data
3. Hyperparameter Tuning/Model Selection: Choosing model
structure
▶ Two main parameter estimation strategies:
▶ Finding point estimates (empirical risk minimization)
▶ Bayesian inference (distribution over parameters)
8.1.4 Learning is Finding Parameters (cont.)

▶ Parameter estimation principles:

▶ Empirical risk minimization: Direct optimization problem
(Section 8.2)
▶ Maximum likelihood: Statistical perspective (Section 8.3)
▶ Generalization challenge:
▶ Need to balance fitting training data vs. finding ”simple”
explanations
▶ Tools: Regularization, priors, cross-validation
▶ Remark: Parameters vs. hyperparameters
▶ Parameters: Directly optimized during training
▶ Hyperparameters: Control model structure or training process
8.2 Empirical Risk Minimization

▶ Core idea: Learn by minimizing average loss on training data

▶ Framework components:
1. Hypothesis class: Set of possible prediction functions
2. Loss function: Measure of prediction error
3. Regularization: Controls model complexity
4. Evaluation procedure: Assessing generalization
▶ Originally popularized through Support Vector Machines
▶ Provides ”probability-free” approach to learning
8.2.1 Hypothesis Class of Functions

▶ Given N examples xn ∈ RD with labels yn ∈ R

▶ Goal: Estimate predictor f (·, θ) : RD → R
▶ Seek parameters θ∗ such that f (xn , θ∗ ) ≈ yn for all n
▶ Common hypothesis class: Affine functions

D
(d)
X
f (xn , θ) = θ0 + θd x n = θ⊤ xn (3)
d=1

▶ More complex hypothesis classes include non-linear functions,

neural networks
8.2.2 Loss Function for Training

▶ Loss function ℓ(yn , ŷn ) measures prediction error

▶ Input: True label yn and prediction ŷn = f (xn , θ)
▶ Output: Non-negative real number (higher = worse)
▶ Assume training examples are independent and identically
distributed (i.i.d.)
▶ Empirical risk: Average loss over training data

N
1 X
Remp (f , X, y) = ℓ(yn , ŷn ) (4)
N
n=1
8.2.2 Loss Function for Training (Example: Least-Squares)

▶ Squared loss: ℓ(yn , ŷn ) = (yn − ŷn )2

▶ Empirical risk with squared loss:

N
1 X
min (yn − f (xn , θ))2 (5)
θ∈RD N
n=1

▶ With linear predictor f (xn , θ) = θ⊤ xn :

N
1 X 1
min (yn − θ⊤ xn )2 = min ||y − Xθ||2 (6)
θ∈RD N θ∈RD N
n=1

▶ This is the well-known least-squares problem

8.2.2 Loss Function for Training (cont.)

▶ We ultimately care about expected risk on unseen data:

Rtrue (f ) = Ex,y [ℓ(y , f (x))] (7)

▶ Two practical challenges:
1. How to modify training procedure to generalize well?
2. How to estimate expected risk from finite data?
▶ Remark: In practice, loss functions might not directly match
the performance metric of interest
8.2.3 Regularization to Reduce Overfitting

▶ Overfitting: Model fits training data too closely, fails to

generalize
▶ Symptoms: Very small training error but large test error
▶ Causes:
▶ Complex hypothesis class with many parameters
▶ Limited training data
▶ Solution: Regularization - adds penalty term to discourage
complex solutions
8.2.3 Regularization to Reduce Overfitting (Example)

▶ Standard least-squares:

1
min ||y − Xθ||2 (8)
θ N
▶ Regularized (Tikhonov) version:

1
min ||y − Xθ||2 + λ||θ||2 (9)
θ N
▶ Regularization parameter λ:
▶ Controls trade-off between data fit and parameter magnitude
▶ Higher λ = stronger preference for smaller parameters
▶ Equivalent to assuming prior distribution on parameters in
probabilistic view
8.2.4 Cross-Validation to Assess Generalization

▶ Need to estimate generalization performance from finite data

▶ Challenge: Want both large training set and reliable
performance estimate
▶ K -fold cross-validation:
▶ Divide data into K equal parts
▶ Train on K − 1 parts, validate on remaining part
▶ Repeat for all K possible validation sets
▶ Average performance across all trials
8.2.4 Cross-Validation to Assess Generalization (cont.)

▶ For each partition k:

▶ Training data R (k) produces predictor f (k)
▶ Validation set V (k) used to compute empirical risk
R(f (k) , V (k) )
▶ Expected generalization error approximated as:

K
1 X
EV [R(f , V )] ≈ R(f (k) , V (k) ) (10)
K
k=1

▶ Cross-validation is ”embarrassingly parallel” (can be

computed in parallel)
8.2.5 Further Reading

▶ Statistical learning theory (Vapnik)

▶ Regularization:
▶ Tikhonov vs. Ivanov regularization
▶ Connection to bias-variance trade-off
▶ Relationship to feature selection
▶ Alternative validation approaches:
▶ Bootstrap
▶ Jackknife
▶ Connection to probability: Agnostic to choice of underlying
distribution
8.3 Parameter Estimation

▶ Statistical approach using probability distributions

▶ Explicitly models:
▶ Uncertainty in data (observation process)
▶ Uncertainty in model parameters
▶ Key concepts:
▶ Likelihood: Analogous to loss function (Section 8.2.2)
▶ Prior: Analogous to regularization (Section 8.2.3)
8.3.1 Maximum Likelihood Estimation

▶ Find parameters that maximize the probability of observing

the data
▶ For data x and probability density p(x|θ) parametrized by θ
▶ Negative log-likelihood:

Lx (θ) = − log p(x|θ) (11)

▶ Minimizing negative log-likelihood = maximizing likelihood

▶ Likelihood models uncertainty in the data given fixed
parameters
8.3.1 Maximum Likelihood Estimation (cont.)

▶ For supervised learning with example-label pairs {(xn , yn )}

▶ Specify conditional probability p(yn |xn , θ)
▶ Assuming i.i.d. data, likelihood factorizes:

N
Y
p(Y|X, θ) = p(yn |xn , θ) (12)
n=1

▶ Negative log-likelihood:

N
X
L(θ) = − log p(Y|X, θ) = − log p(yn |xn , θ) (13)
n=1
8.3.1 Maximum Likelihood Estimation (Example: Gaussian
Likelihood)
▶ Assume Gaussian likelihood with linear model:

p(yn |xn , θ) = N (yn |x⊤ 2

n θ, σ ) (14)

▶ Negative log-likelihood:

N
X
L(θ) = − log N (yn |x⊤ 2
n θ, σ ) (15)
n=1
N
1 X
= (yn − x⊤ 2
n θ) + constant (16)
2σ 2
n=1

▶ Minimizing negative log-likelihood = minimizing sum of

squared errors
▶ Shows equivalence to least-squares regression
8.3.2 Maximum A Posteriori Estimation

▶ Incorporates prior knowledge about parameters

▶ Prior probability distribution on parameters p(θ)
▶ Use Bayes’ theorem to compute posterior distribution:

p(x|θ)p(θ)
p(θ|x) = (17)
p(x)
▶ For optimization, can ignore denominator:

p(θ|x) ∝ p(x|θ)p(θ) (18)

8.3.2 Maximum A Posteriori Estimation (cont.)

▶ Maximum a posteriori (MAP) estimation:

θMAP = arg max p(θ|x) = arg max p(x|θ)p(θ) (19)

θ θ

▶ Negative log-posterior:

− log p(θ|x) = − log p(x|θ) − log p(θ) + constant (20)

▶ Example: Gaussian prior with zero mean p(θ) = N (0, Σ)
▶ Corresponds to ℓ2 regularization in empirical risk minimization
8.3.3 Model Fitting

▶ Model fitting = optimizing parameters to minimize loss

function
▶ For parametrized model class Mθ , we seek parameters that
approximate true (unknown) model M∗
▶ Three scenarios:
1. Overfitting: Model class too rich for dataset
▶ Too many parameters
▶ Fits noise in training data
▶ Poor generalization
2. Underfitting: Model class not rich enough
▶ Too few parameters
▶ Cannot capture structure in data
3. Fitting well: Model class appropriately expressive
8.3.4 Further Reading

▶ Generalized linear models

▶ Linear dependence between parameters and data
▶ Potentially non-linear transformation (link function)
▶ Examples: logistic regression, Poisson regression
▶ Statistical frameworks:
▶ Bayesian vs. frequentist perspectives
▶ Alternative estimation methods (method of moments,
M-estimation)
8.4 Probabilistic Modeling and Inference

▶ Builds models that describe the generative process of observed

data
▶ Full probabilistic treatment of both:
▶ Observed variables (data)
▶ Hidden variables (parameters)
▶ Example (coin flip):
1. Define parameter µ (probability of ”heads”)
2. Sample outcome x ∈ {head, tail} from p(x|µ) = Ber(µ)
▶ Challenge: Learn about unobservable parameter µ from
observed outcomes
8.4.1 Probabilistic Models

▶ A probabilistic model is specified by the joint distribution of

all random variables
▶ Central importance of joint distribution p(x, θ):
▶ Encapsulates prior and likelihood (product rule)
▶ Allows computation of marginal likelihood p(x) for model
selection
▶ Enables derivation of posterior via Bayes’ theorem
▶ Benefits: Unified framework for modeling, inference,
prediction, and model selection
8.4.2 Bayesian Inference

▶ Goal: Find posterior distribution over hidden variables

▶ Definition (Bayesian Inference): Process of finding
posterior distribution over parameters

▶ Comparison with parameter estimation:

▶ Parameter estimation: Point estimates via optimization
▶ Bayesian inference: Full distributions via integration
▶ Advantages of Bayesian approach:
▶ Principled incorporation of prior knowledge
▶ Uncertainty quantification
▶ Useful for decision-making under uncertainty
▶ Practical challenges: Integration is often intractable
▶ Requires approximation methods (MCMC, variational
inference, etc.)
8.4.3 Latent-Variable Models

▶ Include additional latent variables z besides model parameters

θ
▶ Latent variables:
▶ Describe data generation process
▶ Simplify model structure
▶ Enhance interpretability
▶ Often reduce number of parameters
▶ Generative process: p(x|z, θ) allows data generation given
latents and parameters
8.4.3 Latent-Variable Models (cont.)

▶ Likelihood requires marginalizing out latent variables:

Z
p(x|θ) = p(x|z, θ)p(z)dz (23)

▶ Parameter posterior (integrating out latents):

p(X|θ)p(θ)
p(θ|X) = (24)
p(X)
▶ Latent variable posterior:

p(X|z, θ)p(z)
p(z|X, θ) = (25)
p(X|θ)
▶ Examples: PCA, Gaussian mixture models, hidden Markov
models
8.4.4 Further Reading

▶ Computational methods for Bayesian inference:

▶ Markov Chain Monte Carlo (MCMC)
▶ Variational inference
▶ Expectation propagation
▶ Applications of Bayesian inference:
▶ Topic modeling
▶ Click-through rate prediction
▶ Reinforcement learning
▶ Recommendation systems
▶ Probabilistic programming languages
8.5 Directed Graphical Models

▶ Visual language for specifying probabilistic models

▶ Definition (Directed Graphical Model/Bayesian
Network): Graph where:
▶ Nodes represent random variables
▶ Directed edges represent conditional dependencies
▶ Benefits:
▶ Visualizes structure of probabilistic model
▶ Provides insights into conditional independence properties
▶ Simplifies complex computations for inference and learning
8.5.1 Graph Semantics

▶ Directed links indicate conditional probabilities

▶ From factorized joint distribution to graphical model:
1. Create node for each random variable
2. Add directed link from conditioning variables to conditioned
variable
▶ From graphical model to joint distribution:

K
Y
p(x) = p(x1 , . . . , xK ) = p(xk |Pak ) (26)
k=1

where Pak denotes parent nodes of xk

8.5.1 Graph Semantics (cont.)

▶ Notation conventions:
▶ Observed variables: Shaded nodes
▶ Latent variables: Unshaded nodes
▶ Deterministic parameters: No circle
▶ Plate notation: Box indicating repeated structure
▶ Used for i.i.d. observations
▶ Makes diagrams more compact
8.5.2 Conditional Independence and d-Separation

▶ Graphical models reveal conditional independence properties

▶ Definition (d-separation): For sets of nodes A, B, C, a path
from A to B is blocked if:
▶ Arrows meet head-to-tail or tail-to-tail at node in C, OR
▶ Arrows meet head-to-head at node where neither the node nor
its descendants are in C
▶ If all paths from A to B are blocked by C, then A is
d-separated from B given C
▶ D-separation implies conditional independence: A ⊥⊥ B — C
8.5.3 Further Reading

▶ Types of probabilistic graphical models:

▶ Directed graphical models (Bayesian networks)
▶ Undirected graphical models (Markov random fields)
▶ Factor graphs
▶ Applications:
▶ Computer vision (image segmentation, denoising)
▶ Ranking in online games
▶ Coding theory
▶ Signal processing
▶ Related areas:
▶ Structured prediction
▶ Causal inference
8.6 Model Selection

▶ Challenge: High-level modeling decisions significantly affect

performance
▶ Examples of choices:
▶ Polynomial degree in regression
▶ Number of components in mixture model
▶ Network architecture in neural networks
▶ Kernel type in SVM
▶ Trade-off: Model complexity vs. generalization
▶ More complex models:
▶ More expressive/flexible
▶ Higher risk of overfitting
8.6.1 Nested Cross-Validation

▶ Extension of standard cross-validation (Section 8.2.4)

▶ Two levels of K-fold cross-validation:
▶ Inner loop: Evaluates different model choices on validation set
▶ Outer loop: Estimates generalization performance of best
model from inner loop
▶ Inner loop approximates:

K
1 X
EV [R(V |M)] ≈ R(V (k) |M) (27)
K
k=1

▶ Provides both expected error and error uncertainty (standard

error)
8.6.2 Bayesian Model Selection

▶ Based on Occam’s razor: Prefer simplest model that explains

data well
▶ Bayesian probability automatically embodies this principle
▶ Treat model selection as hierarchical inference problem:
1. Place prior p(M) over model set M = {M1 , . . . , MK }
2. For each model, define parameter distribution p(θ|M)
3. Generate data according to p(D|θ)
▶ Posterior distribution over models:

p(Mk |D) ∝ p(Mk )p(D|Mk ) (28)

8.6.2 Bayesian Model Selection (cont.)

▶ Model evidence/marginal likelihood:

Z
p(D|Mk ) = p(D|θk )p(θk |Mk )dθk (29)

▶ MAP estimate of model (with uniform prior):

M ∗ = arg max p(Mk |D) = arg max p(D|Mk ) (30)

Mk Mk

▶ Key difference from maximum likelihood:

▶ Parameters integrated out (no overfitting)
▶ Automatically balances complexity vs. data fit
8.6.3 Bayes Factors for Model Comparison

▶ Compare two models M1 and M2 given data D

▶ Posterior odds:
p(M1 |D) p(M1 ) p(D|M1 )
= (31)
p(M2 |D) p(M2 ) p(D|M2 )
p(M1 )
▶ Prior odds: p(M2 ) (prior beliefs favoring M1 over M2 )

▶ Bayes factor: p(D|M 1)

p(D|M2 ) (how well data predicted by M1 vs.
M2 )
▶ With uniform prior: Posterior odds = Bayes factor
8.6.4 Further Reading

▶ Alternative model selection criteria:

▶ Akaike Information Criterion (AIC):

log p(x|θ) − M (32)

▶ Bayesian Information Criterion (BIC):

1
log p(x|θ) − M log N (33)
2
▶ Bayesian non-parametric models
▶ Automatic Occam’s razor in Bayesian inference
Key Questions and Answers
▶ Q1: What are the three main components of a machine learning system?
A1: Data (represented as vectors), Models (as functions or probability
distributions), and Learning (process of parameter estimation).
▶ Q2: How does empirical risk minimization differ from maximum
likelihood estimation?
A2: Empirical risk minimization minimizes an arbitrary loss function on
training data, while maximum likelihood estimation maximizes the
probability of observing the data under a probabilistic model.
▶ Q3: What is the difference between MAP estimation and Bayesian
inference?
A3: MAP estimation finds a single best parameter value that maximizes
the posterior probability, while Bayesian inference computes the full
posterior distribution over parameters.
▶ Q4: What is the purpose of regularization in machine learning?
A4: Regularization discourages complex solutions by adding a penalty
term to the objective function, helping prevent overfitting and improve
generalization to unseen data.
▶ Q5: How does Bayesian model selection embody Occam’s razor?
A5: The marginal likelihood automatically balances model complexity
and data fit, favoring simpler models that explain the data well over
unnecessarily complex ones.

When Models Meet Data
No ratings yet
When Models Meet Data
25 pages
Main 2
No ratings yet
Main 2
37 pages
Model Data
No ratings yet
Model Data
43 pages
ML 01
No ratings yet
ML 01
24 pages
Supervised vs Unsupervised Learning Explained
No ratings yet
Supervised vs Unsupervised Learning Explained
11 pages
07 Intro To ML
No ratings yet
07 Intro To ML
38 pages
AI Data Mining: K-Means & HAC Overview
No ratings yet
AI Data Mining: K-Means & HAC Overview
75 pages
Machine Learning Basics Overview
No ratings yet
Machine Learning Basics Overview
24 pages
Statistical Learning Theory Guide
No ratings yet
Statistical Learning Theory Guide
4 pages
Lecture 01
No ratings yet
Lecture 01
4 pages
Uncertainty Estimation Foundations
No ratings yet
Uncertainty Estimation Foundations
166 pages
Regression Techniques in Predictive Modeling
No ratings yet
Regression Techniques in Predictive Modeling
14 pages
CSE 440 AI Volume1 (p1)
No ratings yet
CSE 440 AI Volume1 (p1)
4 pages
Unit-2 IML
No ratings yet
Unit-2 IML
54 pages
ML 3
No ratings yet
ML 3
66 pages
TFM Lichtner Bajjaoui Aisha
No ratings yet
TFM Lichtner Bajjaoui Aisha
18 pages
DL Unit1
100% (1)
DL Unit1
61 pages
Statistical Learning Theory Notes
No ratings yet
Statistical Learning Theory Notes
119 pages
Chapter 08
100% (2)
Chapter 08
202 pages
Statistical Learning by Sasha Rakhlin
No ratings yet
Statistical Learning by Sasha Rakhlin
26 pages
Lagrange Multipliers & Linear Regression Techniques
No ratings yet
Lagrange Multipliers & Linear Regression Techniques
61 pages
Supervised Learning Insights
No ratings yet
Supervised Learning Insights
22 pages
Bias-Variance Tradeoff in Supervised Learning
No ratings yet
Bias-Variance Tradeoff in Supervised Learning
3 pages
Lecture 1 2022
No ratings yet
Lecture 1 2022
55 pages
Interpolation in Deep Learning Theory
No ratings yet
Interpolation in Deep Learning Theory
51 pages
Six Lectures On NN - Montanari
No ratings yet
Six Lectures On NN - Montanari
77 pages
Ai512 Book
No ratings yet
Ai512 Book
127 pages
IML Unit 2 Updated
No ratings yet
IML Unit 2 Updated
20 pages
Statistical Decision Theory in Regression
No ratings yet
Statistical Decision Theory in Regression
14 pages
Advanced Machine Learning Lecture Notes
No ratings yet
Advanced Machine Learning Lecture Notes
123 pages
APS1070 Lecture (3) Slides
No ratings yet
APS1070 Lecture (3) Slides
70 pages
Understanding Bias-Variance Tradeoff
No ratings yet
Understanding Bias-Variance Tradeoff
71 pages
ML Exam Prep: Key Concepts & Techniques
No ratings yet
ML Exam Prep: Key Concepts & Techniques
78 pages
L3 Generalization and Regularization
No ratings yet
L3 Generalization and Regularization
79 pages
Learning Theory
No ratings yet
Learning Theory
28 pages
PRML Exercise Solutions Guide
No ratings yet
PRML Exercise Solutions Guide
87 pages
MLSM Lecture1 050923
No ratings yet
MLSM Lecture1 050923
37 pages
Over Fitting
No ratings yet
Over Fitting
19 pages
DL Unit-2
100% (1)
DL Unit-2
24 pages
Supervised Learning and Matrix Operations
No ratings yet
Supervised Learning and Matrix Operations
65 pages
CS 182 Berkeley 2021 Discussion 1
No ratings yet
CS 182 Berkeley 2021 Discussion 1
7 pages
AWS Machine Learning Specialty Master Cheat Sheet
No ratings yet
AWS Machine Learning Specialty Master Cheat Sheet
24 pages
Statlearn PDF
No ratings yet
Statlearn PDF
123 pages
GML Slides 2024 04 29
No ratings yet
GML Slides 2024 04 29
206 pages
Introduction to Pattern Recognition and Machine Learning
No ratings yet
Introduction to Pattern Recognition and Machine Learning
35 pages
MLP RL1
No ratings yet
MLP RL1
6 pages
Machine Learning Concepts Explained
No ratings yet
Machine Learning Concepts Explained
34 pages
6.867 Lecture Notes: Section 1: Introduction: 1 Intro 2 2 Problem Class 3
No ratings yet
6.867 Lecture Notes: Section 1: Introduction: 1 Intro 2 2 Problem Class 3
10 pages
Linear Classifiers: Training Methods Explained
No ratings yet
Linear Classifiers: Training Methods Explained
16 pages
Lecture16 Crossvalidation
No ratings yet
Lecture16 Crossvalidation
32 pages
Notes Class1 Copy 2
No ratings yet
Notes Class1 Copy 2
225 pages
Machine Learning and Data Mining
No ratings yet
Machine Learning and Data Mining
88 pages
Math2931 Notes
No ratings yet
Math2931 Notes
124 pages
Introduction to Cognitive Science and Machine Learning
No ratings yet
Introduction to Cognitive Science and Machine Learning
14 pages
Machine Learning Handbook - Radivojac and White
No ratings yet
Machine Learning Handbook - Radivojac and White
108 pages
Least Squares Prediction Techniques
No ratings yet
Least Squares Prediction Techniques
3 pages
PRML RefSheet
No ratings yet
PRML RefSheet
6 pages
Machine Learning and Deep Learning Theory
No ratings yet
Machine Learning and Deep Learning Theory
46 pages
Week11 - Regularization and Optimization
No ratings yet
Week11 - Regularization and Optimization
61 pages
Evaluating ML Systems & Linear Regression
No ratings yet
Evaluating ML Systems & Linear Regression
34 pages
Final Report Fraud Detection
No ratings yet
Final Report Fraud Detection
46 pages
Rule-Based Classification in Data Mining
No ratings yet
Rule-Based Classification in Data Mining
72 pages
Diabetes Risk Prediction with AI Model
No ratings yet
Diabetes Risk Prediction with AI Model
21 pages
IandF CS2 Paper A 202204 Examiner Report
No ratings yet
IandF CS2 Paper A 202204 Examiner Report
13 pages
Jntuk r20 Unit-I Deep Learning Techniques (WWW - Jntumaterials.co - In)
No ratings yet
Jntuk r20 Unit-I Deep Learning Techniques (WWW - Jntumaterials.co - In)
23 pages
Machine Learning Lab Experiments Guide
No ratings yet
Machine Learning Lab Experiments Guide
51 pages
Analysis of Emotion Detection From Code Mixed or Codeswitched
No ratings yet
Analysis of Emotion Detection From Code Mixed or Codeswitched
14 pages
Finacial News Summary and Sentiment Report
No ratings yet
Finacial News Summary and Sentiment Report
3 pages
Deep Learning Is Not So Mysterious or Different: Andrew Gordon Wilson
No ratings yet
Deep Learning Is Not So Mysterious or Different: Andrew Gordon Wilson
20 pages
Ieee
No ratings yet
Ieee
13 pages
Faghani Et Al 2022 Mitigating Bias in Radiology Machine Learning 3 Performance Metrics
No ratings yet
Faghani Et Al 2022 Mitigating Bias in Radiology Machine Learning 3 Performance Metrics
12 pages
Cad and Dog 2
No ratings yet
Cad and Dog 2
5 pages
A Robust Machine Learning Framework For Stock Market Classification
No ratings yet
A Robust Machine Learning Framework For Stock Market Classification
28 pages
AIF-C01 Updated Questions - AWS Certified AI Practitioner
No ratings yet
AIF-C01 Updated Questions - AWS Certified AI Practitioner
43 pages
Assignment
No ratings yet
Assignment
56 pages
Part B - Chapter 8 - Evaluation
No ratings yet
Part B - Chapter 8 - Evaluation
35 pages
Price Elasticity in Insurance Pricing
No ratings yet
Price Elasticity in Insurance Pricing
29 pages
AI Glossary HB Basic
No ratings yet
AI Glossary HB Basic
21 pages
ML Practice - Set
No ratings yet
ML Practice - Set
2 pages
RapidMiner Lab: Self-Organizing Map Analysis
No ratings yet
RapidMiner Lab: Self-Organizing Map Analysis
22 pages
Building A Smarter Trading ML System
No ratings yet
Building A Smarter Trading ML System
4 pages
SLCA: Continual Learning with Pre-trained Models
No ratings yet
SLCA: Continual Learning with Pre-trained Models
12 pages
AI in 100 Images
No ratings yet
AI in 100 Images
104 pages
ML m1-m5 NOTES
No ratings yet
ML m1-m5 NOTES
160 pages
Logistic Regression A Brief Primer
No ratings yet
Logistic Regression A Brief Primer
6 pages
FEMODULE - 1 Daa DBMS Python Algorithms
No ratings yet
FEMODULE - 1 Daa DBMS Python Algorithms
68 pages
Foundations Data Science 2 Marks Complete
No ratings yet
Foundations Data Science 2 Marks Complete
4 pages
Classification Models in Machine Learning
No ratings yet
Classification Models in Machine Learning
47 pages
DMML
No ratings yet
DMML
65 pages
Ds Unit 1
No ratings yet
Ds Unit 1
77 pages

CH 8

Uploaded by

CH 8

Uploaded by

Chapter 8: When Models Meet Data

Mathematics for Machine Learning

SNU 2025 Spring Introduction to Data Science

▶ This chapter bridges mathematical foundations with practical

▶ Core Question: ”What do we mean by good models?”

▶ Data represented in tabular form:

▶ Linear algebra allows comparison of examples via:

▶ A predictor maps inputs to outputs (prediction function)

▶ Linear functions balance expressivity with mathematical

▶ Probabilistic perspective accounts for uncertainty

▶ Learning goal: Find model parameters that perform well on

▶ Parameter estimation principles:

▶ Core idea: Learn by minimizing average loss on training data

▶ Given N examples xn ∈ RD with labels yn ∈ R

▶ More complex hypothesis classes include non-linear functions,

▶ Loss function ℓ(yn , ŷn ) measures prediction error

▶ Squared loss: ℓ(yn , ŷn ) = (yn − ŷn )2

▶ With linear predictor f (xn , θ) = θ⊤ xn :

▶ This is the well-known least-squares problem

▶ We ultimately care about expected risk on unseen data:

Rtrue (f ) = Ex,y [ℓ(y , f (x))] (7)

▶ Overfitting: Model fits training data too closely, fails to

▶ Need to estimate generalization performance from finite data

▶ For each partition k:

▶ Cross-validation is ”embarrassingly parallel” (can be

▶ Statistical learning theory (Vapnik)

▶ Statistical approach using probability distributions

▶ Find parameters that maximize the probability of observing

Lx (θ) = − log p(x|θ) (11)

▶ Minimizing negative log-likelihood = maximizing likelihood

▶ For supervised learning with example-label pairs {(xn , yn )}

p(yn |xn , θ) = N (yn |x⊤ 2

▶ Minimizing negative log-likelihood = minimizing sum of

▶ Incorporates prior knowledge about parameters

p(θ|x) ∝ p(x|θ)p(θ) (18)

▶ Maximum a posteriori (MAP) estimation:

θMAP = arg max p(θ|x) = arg max p(x|θ)p(θ) (19)

− log p(θ|x) = − log p(x|θ) − log p(θ) + constant (20)

▶ Model fitting = optimizing parameters to minimize loss

▶ Generalized linear models

▶ Builds models that describe the generative process of observed

▶ A probabilistic model is specified by the joint distribution of

▶ Goal: Find posterior distribution over hidden variables

▶ Comparison with parameter estimation:

▶ Include additional latent variables z besides model parameters

▶ Likelihood requires marginalizing out latent variables:

▶ Parameter posterior (integrating out latents):

▶ Computational methods for Bayesian inference:

▶ Visual language for specifying probabilistic models

▶ Directed links indicate conditional probabilities

where Pak denotes parent nodes of xk

▶ Graphical models reveal conditional independence properties

▶ Types of probabilistic graphical models:

▶ Challenge: High-level modeling decisions significantly affect

▶ Extension of standard cross-validation (Section 8.2.4)

▶ Provides both expected error and error uncertainty (standard

▶ Based on Occam’s razor: Prefer simplest model that explains

p(Mk |D) ∝ p(Mk )p(D|Mk ) (28)

▶ Model evidence/marginal likelihood:

▶ MAP estimate of model (with uniform prior):

M ∗ = arg max p(Mk |D) = arg max p(D|Mk ) (30)

▶ Key difference from maximum likelihood:

▶ Compare two models M1 and M2 given data D

▶ Bayes factor: p(D|M 1)

▶ Alternative model selection criteria:

log p(x|θ) − M (32)

▶ Bayesian Information Criterion (BIC):

You might also like