You are on page 1of 24

https://people.revoledu.com/kardi/tutorial/Regression/OLS.

html
Ordinary least squares (OLS) regression is a statistical method of analysis that
estimates the relationship between one or more independent variables and a
dependent variable; the method estimates the relationship by minimizing the sum
of the squares in the difference between the observed and predicted values of the
dependent variable configured as a straight line. In this entry, OLS regression will
be discussed in the context of a bivariate model, that is, a model in which there is
only one independent variable ( X ) predicting a dependent variable ( Y ). However,
the logic of OLS regression is easily extended to the multivariate model in which
there are two or more independent variables.

Social scientists are often concerned with questions about the relationship between
two variables. These include the following: Among women, is there a relationship
between education and fertility? Do more-educated women have fewer children,
and less-educated women have more children? Among countries, is there a
relationship between gross national product (GNP) and life expectancy? Do
countries with higher levels of GNP have higher levels of life expectancy, and
countries with lower levels of GNP, lower levels of life expectancy? Among
countries, is there a positive relationship between employment opportunities and
net migration? Among people, is there a relationship between age and values of
baseline systolic blood pressure? (Lewis-Beck 1980; Vittinghoff et al. 2005).

As Michael Lewis-Beck notes, these examples are specific instances of the


common query, “What is the relationship between variable X and
variable Y ?” (1980, p. 9). If the relationship is assumed to be linear, bivariate
regression may be used to address this issue by fitting a straight line to a scatterplot
of observations on variable X and variable Y. The simplest statement of such a
relationship between an independent variable, labeled X, and a dependent variable,
labeled Y, may be expressed as a straight line in this formula:

where a is the intercept and indicates where the straight line intersects the Y -axis
(the vertical axis); b is the slope and indicates the degree of steepness of the
straight line; and e represents the error.

The error term indicates that the relationship predicted in the equation is not
perfect. That is, the straight line does not perfectly predict Y. This lack of a perfect
prediction is common in the social sciences. For instance, in terms of the education
and fertility relationship mentioned above, we would not expect all women with
exactly sixteen years of education to have exactly one child, and women with
exactly four years of education to have exactly eight children. But we would
expect that a woman with a lot of education would have fewer children than a
woman with a little education. Stated in another way, the number of children born
to a woman is likely to be a linear function of her education, plus some error.
Actually, in low-fertility societies, Poisson and negative binomial regression
methods are preferred over ordinary least squares regression methods for the
prediction of fertility (Poston 2002; Poston and McKibben 2003).
We first introduce a note about the notation used in this entry. In the social
sciences we almost always undertake research with samples drawn from larger
populations, say, a 1 percent random sample of the U.S. population. Greek letters
like α and β are used to denote the parameters (i.e., the intercept and slope values)
representing the relationship between X and Y in the larger population, whereas
lowercase Roman letters like a and b will be used to denote the parameters in the
sample.

When postulating relationships in the social sciences, linearity is often assumed,


but this may not be always the case. Indeed, a lot of relationships are not linear.
When one hypothesizes the form of a relationship between two variables, one
needs to be guided both by the theory being used, as well as by an inspection of the
data.

But given that we wish to use a straight line for relating variable Y, the dependent
variable, with variable X, the independent variable, there is a question about which
line to use. In any scatterplot of observations of X and Y values (see Figure 1),
there would be an infinite number of straight lines that might be used to represent
the relationship. Which line is the best line?

The chosen straight line needs to be the one that minimizes the amount of error
between the predicted values of Y and the actual values of Y. Specifically, for each
of the i th observations in the sample, if one were to square the difference between
the observed and predicted values of Y, and then sum these squared differences, the
best line would have the lowest sum of squared errors (SSE), represented as
follows:

Ordinary least squares regression is a statistical method that produces the one
straight line that minimizes the total squared error.

Using the calculus, it may be shown that SSE is the lowest or the “least” amount
when the coefficients a and b are calculated with these formulas (Hamilton 1992,
p. 33):

These values of a and b are known as least squares coefficients, or sometimes


as ordinary least squares coefficients or OLS coefficients.

We now will apply the least squares principles. We are interested in the extent to
which there is a relationship among the counties of China between the fertility rate
(the dependent variable) and the level of illiteracy (the independent variable).
China had 2,372 counties in 1982. We hypothesize that counties with populations
that are heavily illiterate will have higher fertility rates than those with populations
with low levels of illiteracy.

The dependent variable, Y, is the general fertility rate, GFR, that is, the number of
children born in 1982 per 1,000 women in the age group fifteen to forty-nine. The
independent variable, X, is the percentage of the population in the county in 1981
aged twelve or more who are illiterate.

The relationship may be graphed in the scatterplot in Figure 1. The association


between the GFR and the illiteracy rate appears to be linear and positive. Each dot
refers to a county of China; there are 2,372 dots on the scatterplot.

Equation (1) may be estimated using the least squares formulas for a and b in
equations (3) and (4). This produces the following:

The OLS results in equation (5) indicate that the intercept value is 57.56, and the
slope value is 1.19. The intercept, or a, indicates the point where the regression
line “intercepts” the Y -axis. It tells the average value of Y when X = 0. Thus, in this
China dataset, the value of a indicates that a county with no illiterate person in the
population would have an expected fertility rate of 57.6 children per 1,000 women
aged fifteen to forty-nine.

The slope coefficient, or b, indicates the average change in Y associated with a


one-unit change in X. In the China example, b = 1.19, meaning that a 1 percent
increase in a county’s illiteracy rate is associated with an average GFR increase, or
gain, of 1.19 children per 1,000 women aged fifteen to forty-nine.

We would probably want to interpret this b coefficient in the other direction; that
is, it makes more sense to say that if we reduce the county’s illiteracy rate by 1
percent, this would result in an average reduction of 1.2 children per 1,000 women
aged fifteen to forty-nine. This kind of interpretation would be consistent with a
policy intervention that a government might wish to use; that is, a lower illiteracy
rate would tend to result in a lower fertility rate.

The regression line may be plotted in the above scatterplot, as shown in Figure 2.

It is noted that while in general the relationship between illiteracy and fertility is
linear, there is a lot of error in the prediction of county fertility with a knowledge
of county illiteracy. Whereas some counties lie right on or close to the regression
line, and therefore, their illiteracy rates perfectly or near perfectly predict their
fertility rates, the predictions for other counties are not as good.

One way to appraise the overall predictive efficiency of the OLS model is
to “eyeball” the relationship as we have done above. How well does the above OLS
equation correspond with variation in the fertility rates of the counties? As we
noted above, the relationship appears to be positive and linear. A more accurate
statistical approach to address the question of how well the data points fit the
regression line is with the coefficient of determination ( R 2).
We start by considering the problem of predicting Y, the fertility rate, when we
have no other knowledge about the observations (the counties). That is, if we only
know the values of Y for the observations, then the best prediction of Y, the fertility
rate, is the mean of Y. It is believed that Carl Friedrich Gauss (1777–1855) was the
first to demonstrate that lacking any other information about a variable’s value for
any one subject, the arithmetic mean is the most probable value (Gauss [1809]
2004, p. 244).

But if we guess the mean of Y for every case, we will have lots of poor predictions
and lots of error. When we have information about the values of X, predictive
efficiency may be improved, as long as X has a relationship with Y. “The question
then is, how much does this knowledge of X improve our prediction
of Y ?” (Lewis-Beck 1980, p. 20).

First, consider the sum of the squared differences of each observation’s value
on Y from the mean of Y. This is the total sum of squares (TSS) and represents the
total amount of statistical variation in Y, the dependent variable.

Values on X are then introduced for all the observations (the Chinese counties),
and the OLS regression equation is estimated. The regression line is plotted (as in
the scatterplot in Figure 2), and the actual values of Y for all the observations are
compared to their predicted values of Y. The sum of the squared differences
between the predicted values of Y and the mean of Y is the explained sum of
squares (ESS), sometimes referred to as the model sum of squares. This represents
the amount of the total variation in Y that is accounted for by X. The difference
between TSS and ESS is the amount of the variation in Y that is not explained
by X, known as the residual sum of squares (RSS).

The coefficient of determination (R2) is:

The coefficient of determination, when multiplied by 100, represents the


percentage amount of variation in Y (the fertility rates of the Chinese counties) that
is accounted for by X (the illiteracy rates of the counties). The R2 values range from
+1 to 0. If R2 = 1.0, the X variable perfectly accounts for variation in Y. Alternately,
when R2 = 0 (in this case the slope of the line, b, would also equal 0), the X variable
does not account for any of the variation in Y (Vittinghoff et al. 2005, p. 44; Lewis-
Beck 1980, pp. 21–22).

SEE ALSO Cliometrics; Least Squares, Three-Stage; Least Squares, Two-Stage;


Linear Regression; Logistic Regression; Methods, Quantitative; Probabilistic
Regression; Regression; Regression Analysis; Social Science; Statistics in the
Social Sciences; Tobit
BIBLIOGRAPHY
Gauss, Carl Friedrich. [1809] 2004. Theory of Motion of the Heavenly Bodies
Moving About the Sun in Conic Sections: A Translation of Theoria Motus.
Mineola, NY: Dover.

Hamilton, Lawrence C. 1992. Regression with Graphics: A Second Course in


Applied Statistics. Pacific Grove, CA: Brooks/Cole.

Lewis-Beck, Michael S. 1980. Applied Regression: An Introduction. Beverly Hills,


CA: Sage.

Poston, Dudley L., Jr. 2002. The Statistical Modeling of the Fertility of Chinese
Women. Journal of Modern Applied Statistical Methods 1 (2): 387–396.

Poston, Dudley L., Jr., and Sherry L. McKibben. 2003. Zero-inflated Count
Regression Models to Estimate the Fertility of U.S. Women. Journal of Modern
Applied Statistical Methods 2 (2): 371–379.

Vittinghoff, Eric, David V. Glidden, Stephen C. Shiboski, and Charles E.


McCulloch. 2005. Regression Methods in Biostatistics: Linear, Logistic, Survival,
and Repeated Measures Models. New York: Springer.
Least Squares Regression

Line of Best Fit

Imagine you have some points, and want to have a line that best fits
them like this:

We can place the line "by eye": try to have the line as close as possible to
all points, and a similar number of points above and below the line.

But for better accuracy let's see how to calculate the line using Least
Squares Regression.

The Line

Our aim is to calculate the values m (slope) and b (y-intercept) in


the equation of a line :

y = mx + b

Where:

 y = how far up
 x = how far along
 m = Slope or Gradient (how steep the line is)
 b = the Y Intercept (where the line crosses the Y axis)

Steps

To find the line of best fit for N points:

Step 1: For each (x,y) point calculate x2 and xy

Step 2: Sum all x, y, x2 and xy, which gives us Σx, Σy, Σx2 and Σxy ( Σ
means "sum up" )

Step 3: Calculate Slope m:


m = N Σ(xy) − Σx ΣyN Σ(x2) − (Σx)2

(N is the number of points.)

Step 4: Calculate Intercept b:

b = Σy − m ΣxN

Step 5: Assemble the equation of a line

y = mx + b

Done!

Example

Let's have an example to see how to do it!

Example: Sam found how many hours of sunshine vs how many ice
creams were sold at the shop from Monday to Friday:

"x" "y"
Hours of Ice Creams
Sunshine Sold

2 4

3 5

5 7

7 10

9 15

Let us find the best m (slope) and b (y-intercept) that suits that data
y = mx + b

Step 1: For each (x,y) calculate x2 and xy:

x y x2 xy

2 4 4 8

3 5 9 15

5 7 25 35

7 10 49 70

9 15 81 135

Step 2: Sum x, y, x2 and xy (gives us Σx, Σy, Σx2 and Σxy):

x y x2 xy

2 4 4 8

3 5 9 15

5 7 25 35

7 10 49 70

9 15 81 135

Σx: 26 Σy: 41 Σx2: 168 Σxy: 263

Also N (number of data values) = 5

Step 3: Calculate Slope m:

m = N Σ(xy) − Σx ΣyN Σ(x2) − (Σx)2

= 5 x 263 − 26 x 415 x 168 − 262


= 1315 − 1066840 − 676

= 249164 = 1.5183...

Step 4: Calculate Intercept b:

b = Σy − m ΣxN

= 41 − 1.5183 x 265

= 0.3049...

Step 5: Assemble the equation of a line:

y = mx + b

y = 1.518x + 0.305

Let's see how it works out:

x y y = 1.518x + 0.305 error

2 4 3.34 −0.66

3 5 4.86 −0.14

5 7 7.89 0.89

7 10 10.93 0.93

9 15 13.97 −1.03

Here are the (x,y) points and the line y = 1.518x + 0.305 on a graph:
Nice fit!

Sam hears the weather forecast which says "we expect 8 hours of sun
tomorrow", so he uses the above equation to estimate that he will sell

y = 1.518 x 8 + 0.305 = 12.45 Ice Creams

Sam makes fresh waffle cone mixture for 14 ice creams just in case. Yum.

How does it work?

It works by making the total of the square of the errors as small as


possible (that is why it is called "least squares"):

The straight line minimizes the sum of squared errors

So, when we square each of those errors and add them all up, the total is
as small as possible.

You can imagine (but not accurately) each data point connected to a
straight bar by springs:

Boing!
Ordinary least squares regression (OLSR) is a generalized linear modeling
technique. It is used for estimating all unknown parameters involved in a linear
regression model, the goal of which is to minimize the sum of the squares of the
difference of the observed variables and the explanatory variables.
Ordinary least squares regression is also known as ordinary least squares or
least squared errors regression.

Techopedia explains Ordinary Least Squares Regression (OLSR)


Invented in 1795 by Carl Friedrich Gauss, it is considered one of the earliest
known general prediction methods. OLSR describes the relationship between a
dependent variable (what is aimed to be explained or predicted) and its one or
more independent variables (explanatory variable). OLSR application can be
found in myriad fields such as psychology, social sciences, medicine, economics
and finance.
There are two relationships that may occur: linear and curvilinear. A linear
relationship is a straight line that is drawn through the central tendency of the
points; whereas a curvilinear relationship is a curved line. Associations between
said variables are depicted by using a scatterplot. The relationship could be
positive or negative, and result variation also differs in strength.
At a basic level, OLSR it can be easily understood even by non-mathematicians,
and its solutions could be easily interpreted. Its added regard is due to its
affordance with recent computers’ built-in algorithms from linear algebra. Thus
it can quickly be applied to problems with hundreds of independent variables
efficiently delivering results to tens of thousands of data points.
OLSR is often used in econometrics, as it provides the best linear unbiased
estimator (BLUE) given the Gauss-Markov assumptions. Econometrics is branch
of economics where statistical methods are applied to economic data. It aims to
extract simple relationships by dissecting existing huge amounts of data. This
statistical algorithm is also being used in machine learning and predictive
analytics to dynamically predict outcomes based on dynamically changing
variables.
EXCEL 2007: Multiple Regression

A. Colin Cameron, Dept. of Economics, Univ. of Calif. - Davis

This January 2009 help sheet gives information on

 Multiple regression using the Data Analysis Add-in.


 Interpreting the regression statistic.
 Interpreting the ANOVA table (often this is skipped).
 Interpreting the regression coefficients table.
 Confidence intervals for the slope parameters.
 Testing for statistical significance of coefficients
 Testing hypothesis on a slope parameter.
 Testing overall significance of the regressors.
 Predicting y given values of regressors.
 Excel limitations.

There is little extra to know beyond regression with one explanatory variable.
The main addition is the F-test for overall fit.

MULTIPLE REGRESSION USING THE DATA ANALYSIS ADD-IN

This requires the Data Analysis Add-in: see Excel 2007: Access and Activating
the Data Analysis Add-in
The data used are in carsdata.xls

We then create a new variable in cells C2:C6, cubed household size as a


regressor.
Then in cell C1 give the the heading CUBED HH SIZE.
(It turns out that for the se data squared HH SIZE has a coefficient of exactly 0.0
the cube is used).

The spreadsheet cells A1:C6 should look like:

We have regression with an intercept and the regressors HH SIZE and CUBED
HH SIZE
The population regression model is: y = β1 + β2 x2 + β3 x3 + u
It is assumed that the error u is independent with constant variance
(homoskedastic) - see EXCEL LIMITATIONS at the bottom.

We wish to estimate the regression line: y = b 1 + b 2 x2 + b 3 x3

We do this using the Data analysis Add-in and Regression.

The only change over one-variable regression is to include more than one column
in the Input X Range.
Note, however, that the regressors need to be in contiguous columns (here
columns B and C).
If this is not the case in the original data, then columns need to be copied to get
the regressors in contiguous columns.

Hitting OK we obtain

The regression output has three components:

 Regression statistics table


 ANOVA table
 Regression coefficients table.

INTERPRET REGRESSION STATISTICS TABLE

This is the following output. Of greatest interest is R Square.

Explanation

Multiple R 0.895828 R = square root of R2

R Square 0.802508 R2

Adjusted R
0.605016 Adjusted R2 used if more than one x variable
Square

This is the sample estimate of the standard


Standard Error 0.444401
deviation of the error u

Observations 5 Number of observations used in the regression (n)

The above gives the overall goodness-of-fit measures:


R2 = 0.8025
Correlation between y and y-hat is 0.8958 (when squared gives 0.8025).
Adjusted R2 = R2 - (1-R2 )*(k-1)/(n-k) = .8025 - .1975*2/2 = 0.6050.

The standard error here refers to the estimated standard deviation of the error
term u.
It is sometimes called the standard error of the regression. It equals sqrt(SSE/(n-
k)).
It is not to be confused with the standard error of y itself (from descriptive
statistics) or with the standard errors of the regression coefficients given below.

R2 = 0.8025 means that 80.25% of the variation of yi around ybar (its mean) is
explained by the regressors x2i and x3i.

INTERPRET ANOVA TABLE

An ANOVA table is given. This is often skipped.


df SS MS F Significance F

Regression 2 1.6050 0.8025 4.0635 0.1975

Residual 2 0.3950 0.1975

Total 4 2.0

The ANOVA (analysis of variance) table splits the sum of squares into its
components.

Total sums of squares


= Residual (or error) sum of squares + Regression (or explained) sum of squares.

Thus Σ i (yi - ybar)2 = Σ i (yi - yhati)2 + Σ i (yhati - ybar)2


where yhati is the value of yi predicted from the regression line
and ybar is the sample mean of y.

For example:
R2 = 1 - Residual SS / Total SS (general formula for R2)
= 1 - 0.3950 / 1.6050 (from data in the ANOVA table)
= 0.8025 (which equals R2 given in the regression Statistics
table).

The column labeled F gives the overall F-test of H0: β2 = 0 and β3 = 0 versus Ha:
at least one of β2 and β3 does not equal zero.
Aside: Excel computes F this as:
F = [Regression SS/(k-1)] / [Residual SS/(n-k)] = [1.6050/2] / [.39498/2] = 4.0635.

The column labeled significance F has the associated P-value.


Since 0.1975 > 0.05, we do not reject H0 at signficance level 0.05.

Note: Significance F in general = FINV(F, k-1, n-k) where k is the number of


regressors including hte intercept.
Here FINV(4.0635,2,2) = 0.1975.

INTERPRET REGRESSION COEFFICIENTS TABLE

The regression output of most interest is the following table of coefficients and
associated output:

Coefficient St. error t Stat P-value Lower 95% Upper 95%

Intercept 0.89655 0.76440 1.1729 0.3616 -2.3924 4.1855


HH SIZE 0.33647 0.42270 0.7960 0.5095 -1.4823 2.1552

CUBED HH SIZE 0.00209 0.01311 0.1594 0.8880 -0.0543 0.0585

Let βj denote the population coefficient of the jth regressor (intercept, HH SIZE
and CUBED HH SIZE).

Then

 Column "Coefficient" gives the least squares estimates of βj.


 Column "Standard error" gives the standard errors (i.e.the estimated
standard deviation) of the least squares estimates bj of βj.
 Column "t Stat" gives the computed t-statistic for H0: βj = 0 against Ha:
βj ≠ 0.

This is the coefficient divided by the standard error. It is compared to a t


with (n-k) degrees of freedom where here n = 5 and k = 3.

 Column "P-value" gives the p-value for test of H0: βj = 0 against Ha: βj ≠
0..

This equals the Pr{|t| > t-Stat}where t is a t-distributed random variable


with n-k degrees of freedom and t-Stat is the computed value of the t-
statistic given in the previous column.
Note that this p-value is for a two-sided test. For a one-sided test divide
this p-value by 2 (also checking the sign of the t-Stat).

 Columns "Lower 95%" and "Upper 95%" values define a 95% confidence
interval for βj.

A simple summary of the above output is that the fitted line is

y = 0.8966 + 0.3365*x + 0.0021*z

CONFIDENCE INTERVALS FOR SLOPE COEFFICIENTS

95% confidence interval for slope coefficient β2 is from Excel output (-1.4823,
2.1552).

Excel computes this as


b2 ± t_.025(3) × se(b2)
= 0.33647 ± TINV(0.05, 2) × 0.42270
= 0.33647 ± 4.303 × 0.42270
= 0.33647 ± 1.8189
= (-1.4823, 2.1552).
Other confidence intervals can be obtained.
For example, to find 99% confidence intervals: in the Regression dialog box (in
the Data Analysis Add-in),
check the Confidence Level box and set the level to 99%.

TEST HYPOTHESIS OF ZERO SLOPE COEFFICIENT ("TEST OF


STATISTICAL SIGNIFICANCE")

The coefficient of HH SIZE has estimated standard error of 0.4227, t-statistic of


0.7960 and p-value of 0.5095.
It is therefore statistically insignificant at significance level α = .05 as p > 0.05.

The coefficient of CUBED HH SIZE has estimated standard error of 0.0131, t-


statistic of 0.1594 and p-value of 0.8880.
It is therefore statistically insignificant at significance level α = .05 as p > 0.05.

There are 5 observations and 3 regressors (intercept and x) so we use t(5-3)=t(2).


For example, for HH SIZE p = =TDIST(0.796,2,2) = 0.5095.

TEST HYPOTHESIS ON A REGRESSION PARAMETER

Here we test whether HH SIZE has coefficient β2 = 1.0.

Example: H0: β2 = 1.0 against Ha: β2 ≠ 1.0 at significance level α = .05.

Then
t = (b2 - H0 value of β2) / (standard error of b2 )
= (0.33647 - 1.0) / 0.42270
= -1.569.

Using the p-value approach

 p-value = TDIST(1.569, 2, 2) = 0.257. [Here n=5 and k=3 so n-k=2].


 Do not reject the null hypothesis at level .05 since the p-value is > 0.05.

Using the critical value approach

 We computed t = -1.569
 The critical value is t_.025(2) = TINV(0.05,2) = 4.303. [Here n=5 and k=3
so n-k=2].
 So do not reject null hypothesis at level .05 since t = |-1.569| < 4.303.
OVERALL TEST OF SIGNIFICANCE OF THE REGRESSION
PARAMETERS

We test H0: β2 = 0 and β3 = 0 versus Ha: at least one of β2 and β3 does not equal
zero.

From the ANOVA table the F-test statistic is 4.0635 with p-value of 0.1975.
Since the p-value is not less than 0.05 we do not reject the null hypothesis that
the regression parameters are zero at significance level 0.05.
Conclude that the parameters are jointly statistically insignificant at
significance level 0.05.

Note: Significance F in general = FINV(F, k-1, n-k) where k is the number of


regressors including hte intercept.
Here FINV(4.0635,2,2) = 0.1975.

PREDICTED VALUE OF Y GIVEN REGRESSORS

Consider case where x = 4 in which case CUBED HH SIZE = x^3 = 4^3 = 64.

yhat = b1 + b2 x2 + b3 x3 = 0.88966 + 0.3365×4 + 0.0021×64 = 2.37006

EXCEL LIMITATIONS

Excel restricts the number of regressors (only up to 16 regressors ??).

Excel requires that all the regressor variables be in adjoining columns.


You may need to move columns to ensure this.
e.g. If the regressors are in columns B and D you need to copy at least one of
columns B and D so that they are adjacent to each other.

Excel standard errors and t-statistics and p-values are based on the assumption
that the error is independent with constant variance (homoskedastic).
Excel does not provide alternaties, such asheteroskedastic-robust or
autocorrelation-robust standard errors and t-statistics and p-values.
More specialized software such as STATA, EVIEWS, SAS, LIMDEP, PC-TSP, ...
is needed.

For further information on how to use Excel go to


http://cameron.econ.ucdavis.edu/excel/excel.html
Linear Regression is a statistical analysis for predicting the value of
a quantitative variable. Based on a set of independent variables, we try to
estimate the magnitude of a dependent variable which is the outcome variable.

Suppose you want to predict the amount of ice cream sales you would make
based on the temperature of the day, then you can plot a regression line that
passes through all the data points.

This regression line in red is the best fit line that predict the sale of ice creams to
best possible accuracy. One of the methods to draw this line is using the least
squares method.

Least squares is one of the methods to find the best fit line for a dataset using
linear regression. The most common application is to create a straight line
that minimizes the sum of squares of the errors generated from the
differences in the observed value and the value anticipated from the model.
Least-squares problems fall into two categories: linear and non linear
squares, depending on whether or not the residuals are linear in all unknowns.

Follow the below steps to find the best fit line for a set of ordered pairs (x1,y1),
(x2,y2)….

Step 1: Calculate the mean of x values and the mean of y values

Step 2: Calculate the slope of the best fit line using the following formula

Step 3: Calculate the y-intercept of the line by the below formula:


The best fit line is called the regression line and it has the least square of
distance from each data point to the line

You might also like