You are on page 1of 18

Final Project Report

On

Least Square Regression

BY

HEMANT GANDHI

2014B4A4PS763H

Under the supervision of

PROF. ADDEPALLI RAMU

Submitted in partial fulfillment of the requirements of

MATH F266: Study Oriented Project

BITS PILANI (RAJASTHAN), HYDERABAD CAMPUS


ACKNOWLEDGEMENT

Any work irrespective of its magnitude or complexity is always a group effort and is never fully
complete unless due gratitude is bestowed upon all who contributed to its success. I would like to
take the opportunity to thank Prof., ADDEPALLI RAMU Associate Professor, Birla Institute of
Technology and Science Pilani, Hyderabad Campus, for having given me this wonderful chance to
work under his guidance.

We are grateful to the administration of BITS Pilani, Hyderabad Campus for providing
opportunities to the students for development of their academic skills and logical thinking through
open ended study oriented activities.
CERTIFICATE

This is to certify that the project report entitled Least Square Regression submitted by Mr.
Hemant Gandhi (2014B4A4763H), in partial fulfilment of the requirements of the course MATH
F266 (Study Oriented Project), embodies the work done by him under my supervision and guidance.

Date: (Prof. ADDEPALLI RAMU)

BITS- Pilani, Hyderabad Campus


Curve Fitting

Curve fitting is the process of constructing a curve, or mathematical function that has the best fit to
a series of data points, possibly subject to constraints.

It is frequently used in engineering. For example the empirical relations that we use in heat
transfer and fluid mechanics are functions fitted to experimental data.

Regression: Mainly used with experimental data, which might have significant amount of error
(Noise). No need to find a function that passes through all discrete points.

Linear Regression Polynomial Regression

Interpolation: Used if the data is known to be very precise. Find a function (or a
series of functions) that passes through all discrete points.
Polynomial Interpolation Spline Interpolation
Least Square Regression
The method of least squares is a standard approach in regression analysis to the approximate
solution of over determined systems, i.e., sets of equations in which there are more equations than
unknowns. "Least squares" means that the overall solution minimizes the sum of the squares of the
residuals made in the results of every single equation.

The most important application is in data fitting. The best fit in the least-squares sense
minimizes the sum of squared residuals (a residual being: the difference between an observed value,
and the fitted value provided by a model).

Least squares problems fall into two categories: linear or ordinary least squares and non-linear least
squares, depending on whether or not the residuals are linear in all unknowns. The linear least-
squares problem occurs in statistical regression analysis; it has a closed-form solution. The non-
linear problem is usually solved by iterative refinement; at each iteration the system is approximated
by a linear one, and thus the core calculation is similar in both cases. Polynomial least
squares describe the variance in a prediction of the dependent variable as a function of the
independent variable and the deviations from the fitted curve.

Linear Regression
Linear least squares regression is by far the most widely used modeling method. It is what most
people mean when they say they have used "regression", "linear regression" or "least squares" to fit
a model to their data. Not only is linear least squares regression the most widely used modeling
method, but it has been adapted to a broad range of situations that are outside its direct scope.
Mathematically, linear least squares is the problem of approximately solving an over determined
system of linear equations, where the best approximation is defined as that which minimizes the
sum of squared differences between the data values and their corresponding modeled values. The
approach is called linear least squares since the assumed function is linear in the parameters to be
estimated.

Several possibilities to minimize the error (deviation) to get a best-fit line (to find a0 and a1) are:

Minimize the sum of individual errors.

Minimize the sum of absolute values of individual errors.

Minimize the maximum error.

Minimize the sum of squares of individual errors. This is the preferred strategy.
Minimizing the square of individual errors
Sum of squares of the residuals:

Determine the unknowns a0 and a1 by minimizing Sr


To do this set the derivatives of Sr wrt a0 and a1 to zero.

Or,

These are called normal equations.


Solve these for a0 and a1. The results are:

Error of Linear Regression


The improvement obtained by using a regression line instead of the mean gives a measure of how
good the regression fit is.

How to interpret the correlation coefficient?


Two extreme cases are
S = 0 -> r=1 describes a perfect fit (straight line passing through all points).
Sr = St -> r=0 describes a case with no improvement.
Usually an r value close to 1 represents a good fit. But be careful and always plot the data points
and the regression line together to see what is going on.

Linearization of Nonlinear Behavior


Linear regression is useful to represent a linear relationship.
If the relation is nonlinear either another technique can be used or the data can be transformed so
that linear regression can still be used. The latter technique is frequently used to fit the following
nonlinear equations to a set of data.

1. Exponential Equation
2. Power Equation

3. Saturation-growth rate equation

Application of Linear Regression with 3 independent variable


Polynomial regression
Polynomial regression is a form of linear regression in which the relationship between the
independent variable x and the dependent variable y is modeled as an nth degree polynomial in x.
Polynomial regression fits a nonlinear relationship between the value of x and the
corresponding conditional mean of y, denoted E(y |x), and has been used to describe nonlinear
phenomena such as the growth rate of tissues, the distribution of carbon isotopes in lake
sediments, and the progression of disease epidemics. Although polynomial regression fits a
nonlinear model to the data, as a statistical estimation problem it is linear, in the sense that the
regression function E(y | x) is linear in the unknown parameters that are estimated from the data. For
this reason, polynomial regression is considered to be a special case of multiple linear regressions.

We can model the expected value of y as an nth degree polynomial, yielding the general polynomial
regression model

Application of Polynomial Regression


Table 2. Polynomial regression results for direction a
polynomial model
linear quadratic cubic
RMSE 7.876 3.011 1.295
MAPE 14.9473 4.8526 1.5763
R2 0.9233 0.9902 0.9984
R 2 0.9137 0.9874 0.9977

Table 3. Polynomial regression results for direction b


polynomial model
linear quadratic cubic
RMSE 5.357 2.542 2.732
MAPE 13.5912 3.0394 2.6997
R2 0.9638 0.9929 0.9929
R 2 0.9593 0.9908 0.9894

Table 4. Polynomial regression results for direction c


linear quadratic cubic quartic
RMSE 1.501 1.516 1.319 0.656
MAPE 26.2045 24.0227 19.7495 8.1552
R2 0.9467 0.9524 0.9691 0.9936
R 2 0.94 0.9388 0.9537 0.9885
Direction a:
The cubic polynomial regression model outperforms the other two models with lowest error statistics and
highest deterministic coefficient.
= (9.20 = (9.2000, 56.9503, 12.3007, 1.0521)T .
Least squares parameter estimates for this model

Direction b: We find that the quadratic polynomial regression model appears to fit the data best.
Least squares parameter estimates for this model are = (5.8667, 30.2242, 2.3636)T .

Direction c: The quartic polynomial regression model is here the best.


Least squares parameter estimates for this model are = (0.5000, 20.9751, 17.0268, 4.2906, 0.3590)T .

There are several possible uses of a regression model. One is understand the relationship between the
two or more variables. A more common use of a regression analysis is prediction, providing
estimates of values of the dependent variable (variables) by using the prediction equation. Point
predictions are not perfect and are subject to error. The error is due to the uncertainty in estimation as
well as the natural variation of points about the regression line.

We can compute e.g. 95 % prediction interval for strains a, b, c in particular directions marked
as a, b, c Figures 1(b), 2(b), 3(b) show the 95 % prediction interval for strains in particular directions
by using the best polynomial regression model.
R-squared
R-squared is a statistical measure of how close the data are to the fitted regression line. It is also
known as the coefficient of determination, or the coefficient of multiple determination for multiple
regression.

The definition of R-squared is fairly straight-forward; it is the percentage of the response variable
variation that is explained by a linear model. Or:

R-squared = Explained variation / Total variation

R-squared is always between 0 and 100%:

0% indicates that the model explains none of the variability of the response data around its mean.

100% indicates that the model explains all the variability of the response data around its mean.

Graphical Representation of R-squared

Plotting fitted values by observed values graphically illustrates different R-squared values for
regression models.

The regression model on the left accounts for 38.0% of the variance while the one on the right
accounts for 87.4%. The more variance that is accounted for by the regression model the closer the
data points will fall to the fitted regression line. Theoretically, if a model could explain 100% of the
variance, the fitted values would always equal the observed values and, therefore, all the data points
would fall on the fitted regression line.

Key Limitations of R-squared

R-squared cannot determine whether the coefficient estimates and predictions are biased, which is
why you must assess the residual plots.

R-squared does not indicate whether a regression model is adequate. You can have a low R-squared
value for a good model, or a high R-squared value for a model that does not fit the data!
Are Low R-squared Values Inherently Bad?

No! There are two major reasons why it can be just fine to have low R-squared values.

In some fields, it is entirely expected that your R-squared values will be low. For example, any field
that attempts to predict human behavior, such as psychology, typically has R-squared values lower
than 50%. Humans are simply harder to predict than, say, physical processes.

Furthermore, if your R-squared value is low but you have statistically significant predictors, you can
still draw important conclusions about how changes in the predictor values are associated with
changes in the response value. Regardless of the R-squared, the significant coefficients still represent
the mean change in the response for one unit of change in the predictor while holding other predictors
in the model constant. Obviously, this type of information can be extremely valuable.

A low R-squared is most problematic when you want to produce predictions that are reasonably
precise (have a small enough prediction interval). How high should the R-squared be for prediction?
Well, that depends on your requirements for the width of a prediction interval and how much
variability is present in your data. While a high R-squared is required for precise predictions, its not
sufficient by itself, as we shall see.

Are High R-squared Values Inherently Good?

No! A high R-squared does not necessarily indicate that the model has a good fit. That might be a
surprise, but look at the fitted line plot and residual plot below. The fitted line plot displays the
relationship between semiconductor electron mobility and the natural log of the density for real
experimental data.
The fitted line plot shows that these data follow a nice tight function and the R-squared is 98.5%,
which sounds great. However, look closer to see how the regression line systematically over and
under-predicts the data (bias) at different points along the curve. You can also see patterns in the
Residuals versus Fits plot, rather than the randomness that you want to see. This indicates a bad fit,
and serves as a reminder as to why you should always check the residual plots.

Residuals

The difference between the observed value of the dependent variable (y) and the predicted value () is
called the residual (e). Each data point has one residual.

Residual = Observed value - Predicted value


e=y-

Both the sum and the mean of the residuals are equal to zero. That is, e = 0 and e = 0.

Residual Plots

A residual plot is a graph that shows the residuals on the vertical axis and the independent variable on
the horizontal axis. If the points in a residual plot are randomly dispersed around the horizontal axis, a
linear regression model is appropriate for the data; otherwise, a non-linear model is more appropriate.
Non-Linear Regression
While simple and multiple linear regression functions are adequate for modeling a wide variety of
relationships between response variables and predictor variables, many situations require nonlinear
functions. Nonlinear regression is a form of regression analysis in which observational data are
modeled by a function which is a nonlinear combination of the model parameters and depends on one
or more independent variables. The data are fitted by a method of successive approximations.
Conclusion
Regression analysis is a statistical tool for the investigation of relationships between variables. The
multiple regression analysis is a useful method for generating mathematical models where there are
several (more than two) variables involved. Polynomial regression model is consisting of successive
power terms. Each model will include the highest order term plus all lower order terms (significant or
not). We can view polynomial regression as a particular case of multiple linear regression. Polynomial
models are an effective and flexible curve fitting technique. The most widely used method of
regression analysis is ordinary least squares analysis. This method works by creating a best fit line
through all of the available data points and parameter estimates are chosen to minimize error sum of
squares. Fitting a regression model requires several assumptions. Estimation of the model parameters
requires the assumption that the errors are uncorrelated random variables with mean zero and constant
variance. Tests of hypotheses and interval estimation require that the errors are normally distributed.
There are a number of advanced statistical tests that can be used to examine whether or not these
assumptions are true for any given regression equation.
Bibliography
http://users.metu.edu.tr/csert/me310/me310_5_regression.pdf

https://en.wikipedia.org/wiki/Least_squares

http://www.sciencedirect.com/science/article/pii/S1877705812046085

https://en.wikipedia.org/wiki/Linear_regression

https://en.wikipedia.org/wiki/Polynomial_regression

http://www.itl.nist.gov/div898/handbook/pmd/section1/pmd141.htm

http://blog.minitab.com/blog/adventures-in-statistics-2/regression-analysis-how-do-i-interpret-r-
squared-and-assess-the-goodness-of-fit