CS229 Machine Learning: Supervised Learning
Lecture Notes
Andrew Ng (CS229, Autumn 2018)
May 30, 2025
Contents
1 Recap of Linear Regression 3
1.1 Overview . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3
1.2 Key Components . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3
1.3 Limitations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3
1.4 Example . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3
1.5 Insight . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4
2 Locally Weighted Regression (LWR) 4
2.1 Purpose . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4
2.2 Mechanism . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4
2.3 Bandwidth Parameter (𝜏) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4
2.4 Parametric vs. Non-Parametric . . . . . . . . . . . . . . . . . . . . . . . . . . 4
2.5 Challenges . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5
2.6 Applications . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5
2.7 Extensions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5
2.8 Practice . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5
2.9 Example . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5
2.10 Insight . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5
3 Probabilistic Interpretation of Linear Regression 5
3.1 Objective . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5
3.2 Model Assumptions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5
3.3 Likelihood . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6
3.4 Key Insights . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6
3.5 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6
3.6 Insight . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7
4 Logistic Regression 7
4.1 Overview . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7
4.2 Why Not Linear Regression? . . . . . . . . . . . . . . . . . . . . . . . . . . . 7
4.3 Hypothesis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7
4.4 Probabilistic Model . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7
4.5 Likelihood . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7
1
4.6 Optimization . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8
4.7 Why Sigmoid? . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8
4.8 Practical Notes . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8
4.9 Insight . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8
5 Newton’s Method 8
5.1 Purpose . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8
5.2 Mechanism . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9
5.3 Quadratic Convergence . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9
5.4 Trade-offs . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9
5.5 Guidelines . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9
5.6 Context . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9
5.7 Insight . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9
6 Future Topics 9
7 Practical Insights 10
2
1 Recap of Linear Regression
1.1 Overview
Linear regression predicts continuous outputs using a linear combination of features, founda-
tional to supervised learning. Previously covered: Problem setup, gradient descent, normal
equations.
1.2 Key Components
Notation:
• 𝜉: Feature vector for the 𝑖-th example (𝑛 + 1-dimensional, with 𝑥 0(𝑖) = 1).
• 𝑦 (𝑖) : Continuous output (real number).
• 𝑚: Number of training examples.
• 𝑛: Number of features.
Hypothesis:
ℎ𝜃 (𝑥) = 𝜃 𝑇 𝑥 = 𝜃 0 + 𝜃 1 𝑥1 + · · · + 𝜃 𝑛 𝑥 𝑛
Cost Function:
1 ∑
𝑚
𝐽 (𝜃) = (ℎ𝜃 (𝜉) − 𝑦 (𝑖) ) 2
2𝑚 𝑖=1
1
Measures average squared error, with 2 for derivative simplicity.
Optimization:
Gradient Descent:
1 ∑
𝑚
𝜃 𝑗 := 𝜃 𝑗 − 𝛼 (ℎ𝜃 (𝜉) − 𝑦 (𝑖) )𝑥 (𝑖)
𝑗
𝑚 𝑖=1
Iterative, suitable for large datasets.
Normal Equations:
𝜃 = (𝑋 𝑇 𝑋) −1 𝑋 𝑇 𝑦
Closed-form, but computationally expensive (𝑂 (𝑛3 )) for large 𝑛.
1.3 Limitations
• Assumes linear relationships, struggles with non-linear data.
• Requires feature engineering (e.g., adding 𝑥 2 ) for non-linearity.
1.4 Example
Housing price prediction: Features (size, bedrooms) fit a line, but non-linear trends (e.g., di-
minishing returns for large houses) need advanced methods.
3
1.5 Insight
Linear regression is a baseline for regression tasks, widely used in economics and forecasting.
Visualization Placeholder: A scatter plot of house sizes vs. prices with a fitted line would
illustrate the linear fit and highlight non-linear deviations.
2 Locally Weighted Regression (LWR)
2.1 Purpose
Addresses non-linearity by fitting linear models
√ locally around each prediction point, avoiding
manual feature engineering (e.g., adding 𝑥 2 , 𝑥).
2.2 Mechanism
For a prediction at point 𝑥:
• Assign weights to training examples:
( )
(𝑥𝑖 − 𝑥) 2
𝑤 𝑖 = exp −
2𝜏 2
• 𝑤 𝑖 ≈ 1 if 𝑥𝑖 is close to 𝑥.
• 𝑤 𝑖 ≈ 0 if 𝑥𝑖 is far from 𝑥.
• Minimize weighted cost:
∑
𝑚
𝐽 (𝜃) = 𝑤 𝑖 (ℎ𝜃 (𝑥𝑖 ) − 𝑦𝑖 ) 2
𝑖=1
• Solve using weighted least squares:
𝜃 = (𝑋 𝑇 𝑊 𝑋) −1 𝑋 𝑇 𝑊 𝑦
𝑊: Diagonal matrix with 𝑊𝑖𝑖 = 𝑤 𝑖 .
2.3 Bandwidth Parameter (𝜏)
Controls the “neighborhood” size:
• Small 𝜏: Narrow focus, risks overfitting (jagged fit).
• Large 𝜏: Broad focus, risks underfitting (smooth, linear-like fit).
Tuning: Use cross-validation to select optimal 𝜏.
2.4 Parametric vs. Non-Parametric
• Parametric: Fixed parameters (e.g., linear regression’s 𝜃).
• Non-Parametric: Parameters scale with data (LWR stores all data, memory grows with
𝑚).
4
2.5 Challenges
• Computationally intensive: Requires solving a new linear system per prediction.
• Poor extrapolation outside the training data range.
2.6 Applications
• Low-dimensional data (𝑛 ≤ 3) with sufficient examples.
• Time-series forecasting, robotics (e.g., path planning).
2.7 Extensions
• Alternative kernels: Triangular, Epanechnikov for different weighting schemes.
• Scalability: KD-trees to efficiently find nearby points.
2.8 Practice
CS229 Problem Set 1: Implement LWR, experiment with 𝜏.
2.9 Example
Non-linear housing prices: LWR captures curves (e.g., price plateaus for large houses) without
explicit non-linear features.
2.10 Insight
LWR is intuitive but less common in high-dimensional settings due to computational cost.
Visualization Placeholder: A plot showing a non-linear dataset with LWR’s local linear fits at
different points would clarify the method.
3 Probabilistic Interpretation of Linear Regression
3.1 Objective
Justify the squared error cost using a probabilistic framework, connecting to maximum likeli-
hood estimation (MLE).
3.2 Model Assumptions
True output:
𝑦 𝑖 = 𝜃 𝑇 𝑥𝑖 + 𝜖𝑖
𝜖𝑖 : Error (unmodeled effects + noise).
Error distribution:
𝜖𝑖 ∼ N (0, 𝜎 2 )
Gaussian, mean 0, variance 𝜎 2 , independent and identically distributed (IID).
5
Density: ( )
1 𝜖𝑖2
𝑝(𝜖𝑖 ) = √ exp −
2𝜋𝜎 2𝜎 2
Implies:
𝑦𝑖 |𝑥𝑖 ; 𝜃 ∼ N (𝜃 𝑇 𝑥𝑖 , 𝜎 2 )
Density: ( )
1 (𝑦𝑖 − 𝜃 𝑇 𝑥𝑖 ) 2
𝑝(𝑦𝑖 |𝑥𝑖 ; 𝜃) = √ exp −
2𝜋𝜎 2𝜎 2
3.3 Likelihood
Likelihood:
∏
𝑚
𝐿 (𝜃) = 𝑝(𝑦𝑖 |𝑥𝑖 ; 𝜃)
𝑖=1
Log-likelihood:
1 ∑
𝑚
𝑚
𝑙 (𝜃) = − log(2𝜋𝜎 2 ) − 2
(𝑦𝑖 − 𝜃 𝑇 𝑥𝑖 ) 2
2 2𝜎 𝑖=1
Maximizing 𝑙 (𝜃):
Equivalent to minimizing:
∑
𝑚
(𝑦𝑖 − 𝜃 𝑇 𝑥𝑖 ) 2
𝑖=1
Matches linear regression’s least squares objective.
3.4 Key Insights
• Gaussian Justification: Central Limit Theorem suggests errors from many small sources
are approximately Gaussian.
• IID Assumption: Simplifies math but may not hold (e.g., correlated housing prices).
• Likelihood vs. Probability: Likelihood treats data as fixed, 𝜃 as variable; probability
fixes 𝜃.
• Notation: Semicolon (;) denotes 𝜃 as a parameter (frequentist convention).
3.5 Conclusion
• Least squares is the MLE under Gaussian, IID errors.
• Non-Gaussian errors (e.g., Poisson) require generalized linear models (GLMs).
6
3.6 Insight
This probabilistic view unifies regression with other models, setting the stage for logistic re-
gression and GLMs.
Visualization Placeholder: A plot of a Gaussian error distribution around a linear fit would
illustrate the error model.
4 Logistic Regression
4.1 Overview
Used for binary classification (𝑦 ∈ {0, 1}), e.g., tumor malignancy (1 = malignant, 0 = benign).
4.2 Why Not Linear Regression?
• Outputs unbounded values, not probabilities in [0, 1].
• Sensitive to outliers, distorting decision boundaries.
• Non-binary outputs are unnatural for classification.
4.3 Hypothesis
Sigmoid function:
1
𝑔(𝑧) =
1 + 𝑒 −𝑧
Maps 𝑧 ∈ R to [0, 1].
Hypothesis:
1
ℎ𝜃 (𝑥) = 𝑔(𝜃 𝑇 𝑥) = 𝑇
1 + 𝑒 −𝜃 𝑥
Represents 𝑝(𝑦 = 1|𝑥; 𝜃).
4.4 Probabilistic Model
Assumptions:
𝑝(𝑦 = 1|𝑥; 𝜃) = ℎ𝜃 (𝑥)
𝑝(𝑦 = 0|𝑥; 𝜃) = 1 − ℎ𝜃 (𝑥)
Combined:
𝑝(𝑦|𝑥; 𝜃) = (ℎ𝜃 (𝑥)) 𝑦 (1 − ℎ𝜃 (𝑥)) 1−𝑦
4.5 Likelihood
Likelihood:
∏
𝑚
(𝑖) (𝑖)
𝐿 (𝜃) = (ℎ𝜃 (𝜉)) 𝑦 (1 − ℎ𝜃 (𝜉)) 1−𝑦
𝑖=1
7
Log-likelihood:
∑
𝑚
[ ]
𝑙 (𝜃) = 𝑦 (𝑖) log ℎ𝜃 (𝜉) + (1 − 𝑦 (𝑖) ) log(1 − ℎ𝜃 (𝜉))
𝑖=1
4.6 Optimization
Batch Gradient Ascent:
Maximize 𝑙 (𝜃):
∑
𝑚
𝜃 𝑗 := 𝜃 𝑗 + 𝛼 (𝑦 (𝑖) − ℎ𝜃 (𝜉))𝑥 (𝑖)
𝑗
𝑖=1
Similar to linear regression’s gradient descent, but uses sigmoid-based ℎ𝜃 .
Properties:
• Concave log-likelihood ensures a global maximum.
• No closed-form solution, requiring iterative methods.
4.7 Why Sigmoid?
• Ensures probabilistic outputs in [0, 1].
• Derived from GLMs, guaranteeing concavity.
4.8 Practical Notes
• Decision Boundary: 𝜃 𝑇 𝑥 = 0, where ℎ𝜃 (𝑥) = 0.5.
• Applications: Medical diagnosis, spam detection, credit scoring.
• Extensions: L1/L2 regularization for overfitting, softmax for multiclass classification.
4.9 Insight
Logistic regression is robust and interpretable, serving as a baseline for classification tasks.
Visualization Placeholder: A plot of the sigmoid function and a 2D decision boundary sepa-
rating two classes would clarify the model.
5 Newton’s Method
5.1 Purpose
Optimizes 𝜃 faster than gradient ascent using second-order information (Hessian).
8
5.2 Mechanism
Goal: Find 𝜃 where ∇𝑙 (𝜃) = 0.
Scalar Case:
𝑙 ′ (𝜃 𝑡 )
𝜃 𝑡+1 = 𝜃 𝑡 −
𝑙 ′′ (𝜃 𝑡 )
Vector Case:
𝜃 𝑡+1 = 𝜃 𝑡 − 𝐻 −1 ∇𝑙 (𝜃 𝑡 )
𝐻: Hessian matrix ((𝑛 + 1) × (𝑛 + 1)), second derivatives of 𝑙 (𝜃).
Process: Uses tangent approximation to find the zero of the derivative.
5.3 Quadratic Convergence
• Error reduces quadratically (e.g., from 0.01 to 0.0001).
• Requires fewer iterations (e.g., 10 vs. 100–1000 for gradient ascent).
5.4 Trade-offs
• Advantages: Rapid convergence for low-dimensional 𝜃 (𝑛 ≤ 50).
• Disadvantages: Hessian inversion is costly (𝑂 (𝑛3 )) for high-dimensional 𝜃 (e.g., 𝑛 =
10, 000).
5.5 Guidelines
• Use Newton’s method for 𝑛 ≤ 50.
• Use gradient ascent or L-BFGS for large 𝑛.
5.6 Context
• Also known as Newton-Raphson.
• Applications: Small-scale optimization in finance, control systems.
5.7 Insight
Newton’s method is powerful for small problems but impractical for modern ML’s high-dimensional
data.
Visualization Placeholder: A plot comparing convergence paths of gradient ascent vs. New-
ton’s method would highlight quadratic convergence.
6 Future Topics
• Problem Set 1: Implement LWR, experiment with 𝜏.
• Generalized Linear Models (GLMs): Unify linear and logistic regression under a com-
mon framework.
9
• Feature Selection: Automate feature choice to improve model performance.
• Overfitting/Underfitting: Address via 𝜏 tuning, regularization techniques.
Insight: These topics build toward robust, scalable ML systems.
7 Practical Insights
LWR:
• Ideal for non-linear, low-dimensional data.
• Used in time-series, robotics (e.g., trajectory prediction).
• Requires careful 𝜏 tuning to balance fit.
Logistic Regression:
• Robust for binary classification.
• Applications: Medical diagnostics, spam filters, credit risk.
• Use as a baseline before complex models like neural networks.
Newton’s Method:
• Efficient for small 𝑛, but modern ML prefers stochastic gradient descent for scalability.
Learning Tips:
• Visualize LWR fits and logistic decision boundaries to build intuition.
• Study GLMs for a unified perspective on regression and classification.
• Use Python libraries (e.g., scikit-learn) for implementation.
Insight: These methods are foundational, bridging theory and practice in ML engineering.
Visualization Placeholder: A flowchart of supervised learning algorithms (linear regression
→ LWR → logistic regression → GLMs) would clarify their relationships.
10