You are on page 1of 14

Regression

• Regression: Mathematical method for determining the


best equation that reproduces a data set
• Linear Regression: Regression method applied with
a linear model (straight line)
• Uses
– Prediction of new X,Y values
– Understanding data behavior
• Verification of hypotheses/physical laws
Regression
• The Linear Model X1=1, Y1=2.4
X2=20, Y2=10

Y = mX + b 12

10

8
Y = Dependent variable

Y
6
DY
X = Independent variable 4

DX
m = slope = DY/DX 2

0
b = y-intercept (point where 0 5 10 15 20 25

line crosses y-axis at x=0) X


Regression
• Fitting the data: finding the equation for the straight
line that does the best job of reproducing the data.
Average Income versus % with a College Degree (by State)

40,000
Average Income Level ($ per year)

35,000

30,000

25,000

20,000

15,000
10 15 20 25 30 35

Percentage of Population with


College Degree or Higher
Regression
• Residual: Difference between measured and
calculated Y-values

Average Income versus % with a College Degree (by State)

26,000
Average Income Level ($ per year)

25,500
25,000
24,500
24,000

23,500
23,000
22,500
22,000
15 15.5 16 16.5 17 17.5 18 18.5 19 19.5 20

Percentage of Population with


College Degree or Higher
Regression Analysis
• Use the least square method to “best fit” a
straight line through the data points.
• A straight line is described by its slope and “y”-
intercept in a x-y plot.
• Need to determine the numerical values of the
slope and the “y”-intercept from the data.
• This is equivalent to adding a trendline to your
scatter plot in EXCEL.
Regression Analysis
• The least square method consists of defining a
difference, called the residual, between the
regression line and a data point along a
measured “x” value.
• Then add up the squared residuals for all data
points.
• Adjusting the slope and the “y”-intercept of the
regression line so that the sum of squared
residuals, called regression error, has the
smallest value.
Regression Analysis
• The covariance appears in the calculation
of the correlation coefficient between the
measurements of two variables.
• Let us denote the two variables as “x” and
“y”.
• Their measurements are the “x” data set
and the “y” data set.
Regression Analysis
• The slope of the regression line is given by the
ratio of the covariance between the “x” and “y”
data sets and of the variance of the “x” data set.
• You then use the equation of the line to
determine the y-intercept. You MUST use the
mean of x and the mean of y for this equation
since your data points are likely not on the
regression line.
Regression Analysis
• Once we determined the slope and the “y”
intercept of the regression line, we have a
mathematical relation that ties the “x”
variable to the “y” variable.
• We can use this relation to predict
values of “y” given a “x” value that are
not on the data sets.
Regression Analysis
• Interpolation – the process by which we
use the regression line to predict a value
of the “y” variable for a value of the “x”
variable that is not one of the data points
but is within the range of the data set.
• The “x” and “y” points will lie on the
regression line.
Regression Analysis
• Extrapolation – the process by which we
use the regression line to predict a value
of the “y” variable for a value of the “x”
variable that is outside of the range of the
data set.
• The “x” and “y” points also lie on the
regression line but outside of the range of
the data set.
Tricks of the Trade
• A curve can be partitioned into sections
and “best” fitted a different curve in each
section.
• Use scaling as a mean to increase the
accuracy of the “fitted” curve.
Multivariate Analysis
Regression
• Prediction: Once the best fit line has been determined,
the equation can be used to predict new values of Y for any
given X and vice versa. (Interpolation/Extrapolation)
y = 772.03x + 10810

If a states % of the population with a college degree is 20%,


then they can expect an average income level of
y = 772.03(20) + 10810 = $26,250

If a states average income level is $30,000, then what % of


its population has a college degree?
x = (30,000 – 10810)/772.03 = 24.9%
Multivariate Analysis

• Excel Functions and Tools


– SLOPE() - Returns the slope when passed X, Y data..
– INTERCEPT() - Returns the intercept when passed X, Y data..
– LINEST() - Returns the slope and intercepts when passed X, Y
data..
– TREND() - Returns predicted values in a linear trend when
passed X, Y data..
– Trendline (from the Chart menu) Returns the trendline,
equation, and correlation coefficient for a set of X,Y data.