0% found this document useful (0 votes)
38 views10 pages

ML Day3

The document contains lecture notes for CS229 Machine Learning, focusing on supervised learning topics such as linear regression, locally weighted regression, and logistic regression. It covers key components, limitations, and examples of these methods, along with insights into their applications and challenges. The notes also discuss optimization techniques and future topics in machine learning.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
38 views10 pages

ML Day3

The document contains lecture notes for CS229 Machine Learning, focusing on supervised learning topics such as linear regression, locally weighted regression, and logistic regression. It covers key components, limitations, and examples of these methods, along with insights into their applications and challenges. The notes also discuss optimization techniques and future topics in machine learning.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd

CS229 Machine Learning: Supervised Learning

Lecture Notes

Andrew Ng (CS229, Autumn 2018)

May 30, 2025

Contents
1 Recap of Linear Regression 3
1.1 Overview . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3
1.2 Key Components . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3
1.3 Limitations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3
1.4 Example . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3
1.5 Insight . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4

2 Locally Weighted Regression (LWR) 4


2.1 Purpose . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4
2.2 Mechanism . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4
2.3 Bandwidth Parameter (𝜏) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4
2.4 Parametric vs. Non-Parametric . . . . . . . . . . . . . . . . . . . . . . . . . . 4
2.5 Challenges . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5
2.6 Applications . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5
2.7 Extensions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5
2.8 Practice . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5
2.9 Example . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5
2.10 Insight . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5

3 Probabilistic Interpretation of Linear Regression 5


3.1 Objective . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5
3.2 Model Assumptions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5
3.3 Likelihood . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6
3.4 Key Insights . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6
3.5 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6
3.6 Insight . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7

4 Logistic Regression 7
4.1 Overview . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7
4.2 Why Not Linear Regression? . . . . . . . . . . . . . . . . . . . . . . . . . . . 7
4.3 Hypothesis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7
4.4 Probabilistic Model . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7
4.5 Likelihood . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7

1
4.6 Optimization . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8
4.7 Why Sigmoid? . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8
4.8 Practical Notes . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8
4.9 Insight . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8

5 Newton’s Method 8
5.1 Purpose . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8
5.2 Mechanism . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9
5.3 Quadratic Convergence . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9
5.4 Trade-offs . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9
5.5 Guidelines . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9
5.6 Context . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9
5.7 Insight . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9

6 Future Topics 9

7 Practical Insights 10

2
1 Recap of Linear Regression
1.1 Overview
Linear regression predicts continuous outputs using a linear combination of features, founda-
tional to supervised learning. Previously covered: Problem setup, gradient descent, normal
equations.

1.2 Key Components


Notation:
• 𝜉: Feature vector for the 𝑖-th example (𝑛 + 1-dimensional, with 𝑥 0(𝑖) = 1).
• 𝑦 (𝑖) : Continuous output (real number).
• 𝑚: Number of training examples.
• 𝑛: Number of features.
Hypothesis:
ℎ𝜃 (𝑥) = 𝜃 𝑇 𝑥 = 𝜃 0 + 𝜃 1 𝑥1 + · · · + 𝜃 𝑛 𝑥 𝑛

Cost Function:
1 ∑
𝑚
𝐽 (𝜃) = (ℎ𝜃 (𝜉) − 𝑦 (𝑖) ) 2
2𝑚 𝑖=1
1
Measures average squared error, with 2 for derivative simplicity.
Optimization:
Gradient Descent:
1 ∑
𝑚
𝜃 𝑗 := 𝜃 𝑗 − 𝛼 (ℎ𝜃 (𝜉) − 𝑦 (𝑖) )𝑥 (𝑖)
𝑗
𝑚 𝑖=1
Iterative, suitable for large datasets.
Normal Equations:
𝜃 = (𝑋 𝑇 𝑋) −1 𝑋 𝑇 𝑦
Closed-form, but computationally expensive (𝑂 (𝑛3 )) for large 𝑛.

1.3 Limitations
• Assumes linear relationships, struggles with non-linear data.
• Requires feature engineering (e.g., adding 𝑥 2 ) for non-linearity.

1.4 Example
Housing price prediction: Features (size, bedrooms) fit a line, but non-linear trends (e.g., di-
minishing returns for large houses) need advanced methods.

3
1.5 Insight
Linear regression is a baseline for regression tasks, widely used in economics and forecasting.
Visualization Placeholder: A scatter plot of house sizes vs. prices with a fitted line would
illustrate the linear fit and highlight non-linear deviations.

2 Locally Weighted Regression (LWR)


2.1 Purpose
Addresses non-linearity by fitting linear models
√ locally around each prediction point, avoiding
manual feature engineering (e.g., adding 𝑥 2 , 𝑥).

2.2 Mechanism
For a prediction at point 𝑥:
• Assign weights to training examples:
( )
(𝑥𝑖 − 𝑥) 2
𝑤 𝑖 = exp −
2𝜏 2

• 𝑤 𝑖 ≈ 1 if 𝑥𝑖 is close to 𝑥.
• 𝑤 𝑖 ≈ 0 if 𝑥𝑖 is far from 𝑥.
• Minimize weighted cost:

𝑚
𝐽 (𝜃) = 𝑤 𝑖 (ℎ𝜃 (𝑥𝑖 ) − 𝑦𝑖 ) 2
𝑖=1

• Solve using weighted least squares:

𝜃 = (𝑋 𝑇 𝑊 𝑋) −1 𝑋 𝑇 𝑊 𝑦

𝑊: Diagonal matrix with 𝑊𝑖𝑖 = 𝑤 𝑖 .

2.3 Bandwidth Parameter (𝜏)


Controls the “neighborhood” size:
• Small 𝜏: Narrow focus, risks overfitting (jagged fit).
• Large 𝜏: Broad focus, risks underfitting (smooth, linear-like fit).
Tuning: Use cross-validation to select optimal 𝜏.

2.4 Parametric vs. Non-Parametric


• Parametric: Fixed parameters (e.g., linear regression’s 𝜃).
• Non-Parametric: Parameters scale with data (LWR stores all data, memory grows with
𝑚).

4
2.5 Challenges
• Computationally intensive: Requires solving a new linear system per prediction.
• Poor extrapolation outside the training data range.

2.6 Applications
• Low-dimensional data (𝑛 ≤ 3) with sufficient examples.
• Time-series forecasting, robotics (e.g., path planning).

2.7 Extensions
• Alternative kernels: Triangular, Epanechnikov for different weighting schemes.
• Scalability: KD-trees to efficiently find nearby points.

2.8 Practice
CS229 Problem Set 1: Implement LWR, experiment with 𝜏.

2.9 Example
Non-linear housing prices: LWR captures curves (e.g., price plateaus for large houses) without
explicit non-linear features.

2.10 Insight
LWR is intuitive but less common in high-dimensional settings due to computational cost.
Visualization Placeholder: A plot showing a non-linear dataset with LWR’s local linear fits at
different points would clarify the method.

3 Probabilistic Interpretation of Linear Regression


3.1 Objective
Justify the squared error cost using a probabilistic framework, connecting to maximum likeli-
hood estimation (MLE).

3.2 Model Assumptions


True output:
𝑦 𝑖 = 𝜃 𝑇 𝑥𝑖 + 𝜖𝑖
𝜖𝑖 : Error (unmodeled effects + noise).
Error distribution:
𝜖𝑖 ∼ N (0, 𝜎 2 )
Gaussian, mean 0, variance 𝜎 2 , independent and identically distributed (IID).

5
Density: ( )
1 𝜖𝑖2
𝑝(𝜖𝑖 ) = √ exp −
2𝜋𝜎 2𝜎 2

Implies:
𝑦𝑖 |𝑥𝑖 ; 𝜃 ∼ N (𝜃 𝑇 𝑥𝑖 , 𝜎 2 )

Density: ( )
1 (𝑦𝑖 − 𝜃 𝑇 𝑥𝑖 ) 2
𝑝(𝑦𝑖 |𝑥𝑖 ; 𝜃) = √ exp −
2𝜋𝜎 2𝜎 2

3.3 Likelihood
Likelihood:

𝑚
𝐿 (𝜃) = 𝑝(𝑦𝑖 |𝑥𝑖 ; 𝜃)
𝑖=1

Log-likelihood:
1 ∑
𝑚
𝑚
𝑙 (𝜃) = − log(2𝜋𝜎 2 ) − 2
(𝑦𝑖 − 𝜃 𝑇 𝑥𝑖 ) 2
2 2𝜎 𝑖=1

Maximizing 𝑙 (𝜃):
Equivalent to minimizing:

𝑚
(𝑦𝑖 − 𝜃 𝑇 𝑥𝑖 ) 2
𝑖=1
Matches linear regression’s least squares objective.

3.4 Key Insights


• Gaussian Justification: Central Limit Theorem suggests errors from many small sources
are approximately Gaussian.
• IID Assumption: Simplifies math but may not hold (e.g., correlated housing prices).
• Likelihood vs. Probability: Likelihood treats data as fixed, 𝜃 as variable; probability
fixes 𝜃.
• Notation: Semicolon (;) denotes 𝜃 as a parameter (frequentist convention).

3.5 Conclusion
• Least squares is the MLE under Gaussian, IID errors.
• Non-Gaussian errors (e.g., Poisson) require generalized linear models (GLMs).

6
3.6 Insight
This probabilistic view unifies regression with other models, setting the stage for logistic re-
gression and GLMs.
Visualization Placeholder: A plot of a Gaussian error distribution around a linear fit would
illustrate the error model.

4 Logistic Regression
4.1 Overview
Used for binary classification (𝑦 ∈ {0, 1}), e.g., tumor malignancy (1 = malignant, 0 = benign).

4.2 Why Not Linear Regression?


• Outputs unbounded values, not probabilities in [0, 1].
• Sensitive to outliers, distorting decision boundaries.
• Non-binary outputs are unnatural for classification.

4.3 Hypothesis
Sigmoid function:
1
𝑔(𝑧) =
1 + 𝑒 −𝑧
Maps 𝑧 ∈ R to [0, 1].
Hypothesis:
1
ℎ𝜃 (𝑥) = 𝑔(𝜃 𝑇 𝑥) = 𝑇
1 + 𝑒 −𝜃 𝑥
Represents 𝑝(𝑦 = 1|𝑥; 𝜃).

4.4 Probabilistic Model


Assumptions:
𝑝(𝑦 = 1|𝑥; 𝜃) = ℎ𝜃 (𝑥)
𝑝(𝑦 = 0|𝑥; 𝜃) = 1 − ℎ𝜃 (𝑥)

Combined:
𝑝(𝑦|𝑥; 𝜃) = (ℎ𝜃 (𝑥)) 𝑦 (1 − ℎ𝜃 (𝑥)) 1−𝑦

4.5 Likelihood
Likelihood:

𝑚
(𝑖) (𝑖)
𝐿 (𝜃) = (ℎ𝜃 (𝜉)) 𝑦 (1 − ℎ𝜃 (𝜉)) 1−𝑦
𝑖=1

7
Log-likelihood:

𝑚
[ ]
𝑙 (𝜃) = 𝑦 (𝑖) log ℎ𝜃 (𝜉) + (1 − 𝑦 (𝑖) ) log(1 − ℎ𝜃 (𝜉))
𝑖=1

4.6 Optimization
Batch Gradient Ascent:
Maximize 𝑙 (𝜃):

𝑚
𝜃 𝑗 := 𝜃 𝑗 + 𝛼 (𝑦 (𝑖) − ℎ𝜃 (𝜉))𝑥 (𝑖)
𝑗
𝑖=1
Similar to linear regression’s gradient descent, but uses sigmoid-based ℎ𝜃 .
Properties:
• Concave log-likelihood ensures a global maximum.
• No closed-form solution, requiring iterative methods.

4.7 Why Sigmoid?


• Ensures probabilistic outputs in [0, 1].
• Derived from GLMs, guaranteeing concavity.

4.8 Practical Notes


• Decision Boundary: 𝜃 𝑇 𝑥 = 0, where ℎ𝜃 (𝑥) = 0.5.
• Applications: Medical diagnosis, spam detection, credit scoring.
• Extensions: L1/L2 regularization for overfitting, softmax for multiclass classification.

4.9 Insight
Logistic regression is robust and interpretable, serving as a baseline for classification tasks.
Visualization Placeholder: A plot of the sigmoid function and a 2D decision boundary sepa-
rating two classes would clarify the model.

5 Newton’s Method
5.1 Purpose
Optimizes 𝜃 faster than gradient ascent using second-order information (Hessian).

8
5.2 Mechanism
Goal: Find 𝜃 where ∇𝑙 (𝜃) = 0.
Scalar Case:
𝑙 ′ (𝜃 𝑡 )
𝜃 𝑡+1 = 𝜃 𝑡 −
𝑙 ′′ (𝜃 𝑡 )

Vector Case:
𝜃 𝑡+1 = 𝜃 𝑡 − 𝐻 −1 ∇𝑙 (𝜃 𝑡 )
𝐻: Hessian matrix ((𝑛 + 1) × (𝑛 + 1)), second derivatives of 𝑙 (𝜃).
Process: Uses tangent approximation to find the zero of the derivative.

5.3 Quadratic Convergence


• Error reduces quadratically (e.g., from 0.01 to 0.0001).
• Requires fewer iterations (e.g., 10 vs. 100–1000 for gradient ascent).

5.4 Trade-offs
• Advantages: Rapid convergence for low-dimensional 𝜃 (𝑛 ≤ 50).
• Disadvantages: Hessian inversion is costly (𝑂 (𝑛3 )) for high-dimensional 𝜃 (e.g., 𝑛 =
10, 000).

5.5 Guidelines
• Use Newton’s method for 𝑛 ≤ 50.
• Use gradient ascent or L-BFGS for large 𝑛.

5.6 Context
• Also known as Newton-Raphson.
• Applications: Small-scale optimization in finance, control systems.

5.7 Insight
Newton’s method is powerful for small problems but impractical for modern ML’s high-dimensional
data.
Visualization Placeholder: A plot comparing convergence paths of gradient ascent vs. New-
ton’s method would highlight quadratic convergence.

6 Future Topics
• Problem Set 1: Implement LWR, experiment with 𝜏.
• Generalized Linear Models (GLMs): Unify linear and logistic regression under a com-
mon framework.

9
• Feature Selection: Automate feature choice to improve model performance.
• Overfitting/Underfitting: Address via 𝜏 tuning, regularization techniques.
Insight: These topics build toward robust, scalable ML systems.

7 Practical Insights
LWR:
• Ideal for non-linear, low-dimensional data.
• Used in time-series, robotics (e.g., trajectory prediction).
• Requires careful 𝜏 tuning to balance fit.
Logistic Regression:
• Robust for binary classification.
• Applications: Medical diagnostics, spam filters, credit risk.
• Use as a baseline before complex models like neural networks.
Newton’s Method:
• Efficient for small 𝑛, but modern ML prefers stochastic gradient descent for scalability.
Learning Tips:
• Visualize LWR fits and logistic decision boundaries to build intuition.
• Study GLMs for a unified perspective on regression and classification.
• Use Python libraries (e.g., scikit-learn) for implementation.
Insight: These methods are foundational, bridging theory and practice in ML engineering.
Visualization Placeholder: A flowchart of supervised learning algorithms (linear regression
→ LWR → logistic regression → GLMs) would clarify their relationships.

10

You might also like