regression

best equation that reproduces a data set

• Linear Regression: Regression method applied with

a linear model (straight line)

• Uses

– Prediction of new X,Y values

– Understanding data behavior

• Verification of hypotheses/physical laws

Regression

• The Linear Model X1=1, Y1=2.4

X2=20, Y2=10

Y = mX + b 12

10

8

Y = Dependent variable

Y

6

DY

X = Independent variable 4

DX

m = slope = DY/DX 2

0

b = y-intercept (point where 0 5 10 15 20 25

Regression

• Fitting the data: finding the equation for the straight

line that does the best job of reproducing the data.

Average Income versus % with a College Degree (by State)

40,000

Average Income Level ($ per year)

35,000

30,000

25,000

20,000

15,000

10 15 20 25 30 35

College Degree or Higher

Regression

• Residual: Difference between measured and

calculated Y-values

26,000

Average Income Level ($ per year)

25,500

25,000

24,500

24,000

23,500

23,000

22,500

22,000

15 15.5 16 16.5 17 17.5 18 18.5 19 19.5 20

College Degree or Higher

Regression Analysis

• Use the least square method to “best fit” a

straight line through the data points.

• A straight line is described by its slope and “y”-

intercept in a x-y plot.

• Need to determine the numerical values of the

slope and the “y”-intercept from the data.

• This is equivalent to adding a trendline to your

scatter plot in EXCEL.

Regression Analysis

• The least square method consists of defining a

difference, called the residual, between the

regression line and a data point along a

measured “x” value.

• Then add up the squared residuals for all data

points.

• Adjusting the slope and the “y”-intercept of the

regression line so that the sum of squared

residuals, called regression error, has the

smallest value.

Regression Analysis

• The covariance appears in the calculation

of the correlation coefficient between the

measurements of two variables.

• Let us denote the two variables as “x” and

“y”.

• Their measurements are the “x” data set

and the “y” data set.

Regression Analysis

• The slope of the regression line is given by the

ratio of the covariance between the “x” and “y”

data sets and of the variance of the “x” data set.

• You then use the equation of the line to

determine the y-intercept. You MUST use the

mean of x and the mean of y for this equation

since your data points are likely not on the

regression line.

Regression Analysis

• Once we determined the slope and the “y”

intercept of the regression line, we have a

mathematical relation that ties the “x”

variable to the “y” variable.

• We can use this relation to predict

values of “y” given a “x” value that are

not on the data sets.

Regression Analysis

• Interpolation – the process by which we

use the regression line to predict a value

of the “y” variable for a value of the “x”

variable that is not one of the data points

but is within the range of the data set.

• The “x” and “y” points will lie on the

regression line.

Regression Analysis

• Extrapolation – the process by which we

use the regression line to predict a value

of the “y” variable for a value of the “x”

variable that is outside of the range of the

data set.

• The “x” and “y” points also lie on the

regression line but outside of the range of

the data set.

Tricks of the Trade

• A curve can be partitioned into sections

and “best” fitted a different curve in each

section.

• Use scaling as a mean to increase the

accuracy of the “fitted” curve.

Multivariate Analysis

Regression

• Prediction: Once the best fit line has been determined,

the equation can be used to predict new values of Y for any

given X and vice versa. (Interpolation/Extrapolation)

y = 772.03x + 10810

then they can expect an average income level of

y = 772.03(20) + 10810 = $26,250

its population has a college degree?

x = (30,000 – 10810)/772.03 = 24.9%

Multivariate Analysis

– SLOPE() - Returns the slope when passed X, Y data..

– INTERCEPT() - Returns the intercept when passed X, Y data..

– LINEST() - Returns the slope and intercepts when passed X, Y

data..

– TREND() - Returns predicted values in a linear trend when

passed X, Y data..

– Trendline (from the Chart menu) Returns the trendline,

equation, and correlation coefficient for a set of X,Y data.

