You are on page 1of 14

# Regression

## • Regression: Mathematical method for determining the

best equation that reproduces a data set
• Linear Regression: Regression method applied with
a linear model (straight line)
• Uses
– Prediction of new X,Y values
– Understanding data behavior
• Verification of hypotheses/physical laws
Regression
• The Linear Model X1=1, Y1=2.4
X2=20, Y2=10

Y = mX + b 12

10

8
Y = Dependent variable

Y
6
DY
X = Independent variable 4

DX
m = slope = DY/DX 2

0
b = y-intercept (point where 0 5 10 15 20 25

## line crosses y-axis at x=0) X

Regression
• Fitting the data: finding the equation for the straight
line that does the best job of reproducing the data.
Average Income versus % with a College Degree (by State)

40,000
Average Income Level (\$ per year)

35,000

30,000

25,000

20,000

15,000
10 15 20 25 30 35

## Percentage of Population with

College Degree or Higher
Regression
• Residual: Difference between measured and
calculated Y-values

## Average Income versus % with a College Degree (by State)

26,000
Average Income Level (\$ per year)

25,500
25,000
24,500
24,000

23,500
23,000
22,500
22,000
15 15.5 16 16.5 17 17.5 18 18.5 19 19.5 20

## Percentage of Population with

College Degree or Higher
Regression Analysis
• Use the least square method to “best fit” a
straight line through the data points.
• A straight line is described by its slope and “y”-
intercept in a x-y plot.
• Need to determine the numerical values of the
slope and the “y”-intercept from the data.
• This is equivalent to adding a trendline to your
scatter plot in EXCEL.
Regression Analysis
• The least square method consists of defining a
difference, called the residual, between the
regression line and a data point along a
measured “x” value.
• Then add up the squared residuals for all data
points.
• Adjusting the slope and the “y”-intercept of the
regression line so that the sum of squared
residuals, called regression error, has the
smallest value.
Regression Analysis
• The covariance appears in the calculation
of the correlation coefficient between the
measurements of two variables.
• Let us denote the two variables as “x” and
“y”.
• Their measurements are the “x” data set
and the “y” data set.
Regression Analysis
• The slope of the regression line is given by the
ratio of the covariance between the “x” and “y”
data sets and of the variance of the “x” data set.
• You then use the equation of the line to
determine the y-intercept. You MUST use the
mean of x and the mean of y for this equation
since your data points are likely not on the
regression line.
Regression Analysis
• Once we determined the slope and the “y”
intercept of the regression line, we have a
mathematical relation that ties the “x”
variable to the “y” variable.
• We can use this relation to predict
values of “y” given a “x” value that are
not on the data sets.
Regression Analysis
• Interpolation – the process by which we
use the regression line to predict a value
of the “y” variable for a value of the “x”
variable that is not one of the data points
but is within the range of the data set.
• The “x” and “y” points will lie on the
regression line.
Regression Analysis
• Extrapolation – the process by which we
use the regression line to predict a value
of the “y” variable for a value of the “x”
variable that is outside of the range of the
data set.
• The “x” and “y” points also lie on the
regression line but outside of the range of
the data set.
Tricks of the Trade
• A curve can be partitioned into sections
and “best” fitted a different curve in each
section.
• Use scaling as a mean to increase the
accuracy of the “fitted” curve.
Multivariate Analysis
Regression
• Prediction: Once the best fit line has been determined,
the equation can be used to predict new values of Y for any
given X and vice versa. (Interpolation/Extrapolation)
y = 772.03x + 10810

## If a states % of the population with a college degree is 20%,

then they can expect an average income level of
y = 772.03(20) + 10810 = \$26,250

## If a states average income level is \$30,000, then what % of

its population has a college degree?
x = (30,000 – 10810)/772.03 = 24.9%
Multivariate Analysis

## • Excel Functions and Tools

– SLOPE() - Returns the slope when passed X, Y data..
– INTERCEPT() - Returns the intercept when passed X, Y data..
– LINEST() - Returns the slope and intercepts when passed X, Y
data..
– TREND() - Returns predicted values in a linear trend when
passed X, Y data..
– Trendline (from the Chart menu) Returns the trendline,
equation, and correlation coefficient for a set of X,Y data.