Professional Documents
Culture Documents
● Machine learning algorithms are procedures that are implemented in code and are run on data.
● Machine learning models are output by algorithms and are comprised of model data and a
prediction algorithm.
● Machine learning algorithms provide a type of automatic programming where machine learning
models represent the program.
Overview
This tutorial is divided into four parts; they are:
● Linear Regression
● Logistic Regression
● Decision Tree
● Artificial Neural Network
● k-Nearest Neighbors
● k-Means
6-2013
A Risk Comparison of Ordinary Least Squares vs
Ridge Regression
Paramveer S. Dhillon
University of Pennsylvania
Dean P. Foster
University of Pennsylvania
Sham M. Kakade
University of Pennsylvania
Lyle Ungar
A Risk Comparison of Ordinary Least Squares vs
Ridge Regression
Abstract
We compare the risk of ridge regression to a simple variant of ordinary least squares, in which
one simply projects the data onto a finite dimensional subspace (as specified by a principal
component analysis) and then performs an ordinary (un- regularized) least squares regression
in this subspace. This note shows that the risk of this ordinary least squares method (PCA-OLS)
is within a constant factor (namely 4) of the risk of ridge regression (RR).
Keywords
risk inflation, ridge regression, pca
Disciplines
Computer Sciences
As with any intellectual project, the content and views expressed in this thesis may be
considered objectionable by some readers. However, this student-scholar’s work has been
judged to have academic value by the student’s thesis committee members trained in the
discipline. The content and views expressed in this thesis are those of the student-scholar and
are not endorsed by Missouri State University, its Graduate College, or its employees.
Recommended Citation
Kumar, Dalip, "Ridge Regression and Lasso Estimators for Data Analysis" (2019). MSU Graduate
Theses. 3380.
https://bearworks.missouristate.edu/theses/3380
This article or document was made available through BearWorks, the institutional repository of
Missouri State University. The work contained in it may be protected by copyright and require
permission of the copyright holder for reuse or redistribution. For more information, please
contact BearWorks@library.missouristate.edu.
ABSTRACT
An important problem in data science and statistical learning is to predict an outcome based on
data collected on several predictor variables. This is generally known as a regression problem.
In the field of big data studies, the regression model often depends on a large number of
predictor variables. The data scientist is often dealing with the difficult task of determining the
most appropriate set of predictor variables to be employed in the regression model. In this
thesis we adopt a technique that constraints the coefficient estimates which in effect shrinks
the coefficient estimates towards zero. Ridge regression and lasso are two well-known methods
for shrinking the coefficients towards zero. These two methods are investigated in this thesis.
Ridge regression and lasso techniques are compared by analyzing a real data set for a
regression model with a large collection of predictor variables.
K nearest neighbours, or KNN, is a simple algorithm that can be used to classify objects, such as
people, or to predict outcomes. To illustrate, the following table outlines some research
questions that KNN can address.
Machine learning
KNN is a very simple example of machine learning—in particular a kind called supervisory
learning. Consequently, this document not only outlines KNN but also introduces the
fundamentals of machine learning. If you have already developed basic knowledge about
machine learning, you can disregard the sections about machine learning.
● neural networks
● decision trees and random forests
● support vector machines
● Bayesian networks
● deep graphs, and
● genetic algorithms.
But, how can researchers decide which models to utilise? How can researchers decide, for
example, whether to use KNN, decision trees, or other approaches? One answer to this
question revolves around cross-validation. Cross validation is a technique that can be used to
decide which model classifies or predicts outcomes most effectively.
Example
Imagine a researcher who wants to develop an algorithm or app that can predict which
research candidates are likely to complete their thesis on time. Specifically, the researcher
collates information on 1000 candidates who had enrolled at least 8 years ago and thus should
have completed their thesis. An extract of data appears in the following table. Each row
corresponds to one individual. The columns represent