You are on page 1of 27

Regression Analysis

Least-Squares Linear Regression

Enables fit of linear or exponential function to data. The goal in regression analysis is the development of a statistical model that can be used to predict the values of a dependent or response variable from the values of the independent variable(s).

Linear Fits Most Common For exponential functions, data must be transformed.

Method of Least Squares

If we have N pairs of data (x i , y i ) we seek to fit a straight line through the data of the form:

Determine constants, a 0 and a 1 , such that the distance between the actual y data and the fitted/ predicted line is minimized.

Each x i is assumed to be error free. All the error is assumed to be in the y values.

y =

a

0

+

a x

1

" "
x y !
" "
2
x
x
y i
i
i
i
i
a 0 =
("
) 2 ! N
"
2
x i
x i
" "
x
y
i ! N
"
x i y i
i
a 1 =
("
2
) 2 ! N
"
x i
x i

Manual Calculation Method

 Raw Data y i x i x i yi x 2 i 1.2 1 1.2 1 2 1.6 3.2 2.56 2.4 3.4 8.16 11.56 3.5 4 14 16 3.5 5.2 18.2 27.04 S u m 1 2 . 6 1 5 . 2 4 4 . 7 6 5 8 . 1 6

Seeking an equation with the form: y=a 0 +a 1 x

y=0.879+0.540x

(
15.2 44.76
)(
)
! (
58.16 12.6
)(
)
a =
= 0.879
0
(
2
15.2
)
!
( )(
5 58.16
)
(
15.2 12.6
)(
) ( )(
!
5 44.76
)
a =
= 0.540
1
(
15.2
)
2
!
( )(
5 58.16
)

How good is the fit?

Coefficient of Determination (R 2 ) measures the goodness of fit and the proportion of the variation of the y values associated with the variation in the x variable in the regression. The ratio of the explained variation to the total variation.

R 2 =1 Perfect Fit (good prediction) R 2 =0 No correlation between x and y For engineering data, R 2 , will normally be quite high (0.8-0.90 or higher) A low value might indicate that some important variable was not considered, but is affecting the results.

R 2

" (ax i + b ! y i ) 2

 = 1 ! = Excel Function RSQ (y i 's, x i 's)

" (y i ! y) 2

where

y = average of the y i 's

Standard Error of Estimate SEE

The standard error of estimate (SEE or S

) is a

yx

statistical measure of how well the best-fit line represents the data. This is, effectively, the standard deviation of the differences between the data points and the best-fit line.

It provides an estimation of the scatter/random error in the data about the fitted line. This is analogous to standard deviation for sample data. It has the same units as y. 2 degrees of freedom are lost to calculate coefficients a 0 and a 1 .

sey = SEE = S yx =

"
(
y ! yˆ
) 2
i
i
N ! 2

= Excel Function STEYX ( y i ' s , x i ' s )

where y i = actual value of y for a given x i yˆ i = predicted value of y for a given x i

Linear Regression Assumptions

Variation in the data is assumed to be normally distributed and due to random causes. Assuming random variation exists in y values, while x values are error free. Since error has been minimized in the y direction, an erroneous conclusion may be made if x is estimated based on a value for y. For power law or exponential relationships, data needs to be transformed before carrying out linear regression analysis. (As we will discuss later, the method of least squares can also be applied to nonlinear functional relationships.)

Linear Regression Example

Use Excel Chart>>Add Trendline to obtain coefficients Functions RSQ() and STEYX() to determine R 2 and SEE

3.00
y = 0.9977x + 0.0295
2.50
R 2 = 0.9993
2.00
1.50
1.00
0.50
0.00
0.00
0.50
1.00
1.50
2.00
2.50
3.00

Length, cm

Output, Volts

Regression Analysis using Excel Analysis Tools

Linear regression is a standard feature of statistical programs and most spreadsheet programs. It is only necessary to input the x and y data. The remaining calculations are performed immediately.

Excel “Regression Analysis” macro

Performs linear regression only Non-linear relationships must be transformed Calculates the slope, intercept, SEE, and the upper and lower confidence intervals for the slope and intercept Does not produce any graphical output on the user’s plot. Does not update automatically. The user must interpret the results.

Linear Regression in Excel 2008

Torque, N-m (Y)

Y = m1i X + b

RPM (X)

Y Predicted

Residual

Residual/SEE=Residual/sey

 4.89 100 4.998433207 0.108433207 0.175585 4.77 201 4.559896053 -0.210103947 -0.340219 3.79 298 4.138726707 0.348726707 0.564689 3.76 402 3.687163697 -0.072836303 -0.117943 2.84 500 3.261652399 0.421652399 0.682777 4.12 601 2.823115245 -1.296884755 -2.10003 2.05 699 2.397603947 0.347603947 0.562871 1.61 799 1.963408745 0.353408745 0.572271

-0.004341952 5.432628409

0.481645161

0.617554846

6

0.000954031

0.775391233

20.71311576

 m1 b se1 seb r^2 sey F df

=LINEST(A2:A9,B2:B9,TRUE,TRUE)

Outlier

Linear Regression Example: Omit Outlier

Torque, N-m (Y)

RPM (X)

Y Predicted

Residual

Residual/SEE=Residual/sey

 4.89 100 5.000219168 0.110219168 0.504559919 4.77 201 4.504157858 -0.265842142 -1.21696881 3.79 298 4.02774254 0.23774254 1.088334807 3.76 402 3.516946736 -0.243053264 -1.112646171 2.84 500 3.03561992 0.19561992 0.895506407 2.05 699 2.058231795 0.008231795 0.037683406 1.61 799 1.567081983 -0.042918017 -0.196469559 -0.0049115 5.49137 m1 b 0.000348477 0.170607 se1 seb 0.975448 0.218446 r^2 sey 198.646 5 F df 9.47915 0.238594 m1 b

Uncertainties on Regression

Confidence Interval for Regression Line SEE=sey

TINV(a=0.05,n=5) 2.570581835 95% C.I.=TINV( α=0.05, ν =5)*SEE/SQRT(7)# 0.212239784

0.218446143

Prediction Band for Regression Line 95% P.I.=TINV( α=0.05, ν =5)*SEE # 0.561533687

Uncertainty in Slope Δb=TiINV(0.05,5)*se1# 0.000895789

Uncertainty in Intercept Δb=TiINV(0.05,5)*seb # 0.438558582

Regression Line Confidence Intervals & Prediction Band

Not only do you want to obtain a curve fit relationship but you also want to establish a confidence interval in the equation or measure of random uncertainty in a curve fit. ν =N-2 in determination of t-value. Two degrees of freedom are lost because m1 and b are determined.

6
Prediction Band -95%
CI
- 95%
Torque, Lease Squares Fit
CI
+95%
5
Prediction Band +95%
Data
4
3
2
1
0
0
200
400
600
800
1000
Torque, N-m
SEE
S
yx
S ey
CI = !y " ± t # , \$
= ± t # , \$
N = ± t # , \$
N
N
where
t # , \$ = TINV (# , \$ )
(two-sided t-table)
# = 1 % P
PB " ± t # , \$ SEE = ± t # , \$ S yx = ± t # , \$ S ey

RPM

Regression Line Confidence Interval & Prediction Band

+ ( x * " x ) 2
n
S xx

CI! in ! Curve! Fit! = ± t ! 2 , n " 2 # sey 1

(
*
x
# x
) 2
+
!y Prediction!Band = ± t " 2 , n # 2 sey n + 1
n
S
xx

More accurate

\$ ± t ! 2 , n " 2 # sey

n

\$ ± t " 2 , n # 2 sey

Approximate

-minimum at mean -flares out at low & high extremes

Summations Used in Statistics & Regression

 Variable Expression Sample Standard Deviation \$ 1 # ( i ! x ) 2 ' ( 1 / 2 S x = & % N ! 1 " x ) Expressions used in regression analysis Sum of squares for evaluating CI & PI S xx = " ( x i ! x ) 2 1 / 2 Standard error of estimate sey = SEE = S yx = # % " ( y ! y i predicted ! at ! x = x i ) 2 & ( \$ N ! 2 '

CI in slope and intercept

Slope, m

CI ! in!slope = ± t ! 2 , v " se1

Intercept, b

CI in Intercept = ± t ! 2, v " seb

Note 1: ν =n-2. Note 2: m & b are not independent variables. Therefore, do not apply RSS to y=mx+b to determine Δ y. Instead, use CI for curve fit.

Outliers in x-y Data Sets

Method involves computing the ratio of the residuals (predicted-actual) to the standard error of estimate (sey=SEE)

1.

2.

3.

Residuals=y predicted -y actual at each x i Plot the ratio of residuals/SEE for each x i . These are the “standardized residuals”. Standardized residuals exceeding ±2 may be considered outliers. Assuming the residuals are normally distributed, you can expect that 95% of residuals are in the range ±2 (that is, within 2 standard deviations from best fit line)

Linear Regression with Data Transformation

Data Transformation

Commonly, test data do not show an approximate linear relationship between the dependent (Y) and independent (X) variables and a direct linear regression is not useful.

The form of the relationship expected between the dependent and independent variables is often known. The data needs to be transformed prior to performing a linear regression. Transformations often can be accomplished by taking the logarithms of or natural logarithms of one or both sides of the equation.

Common Transformations

 Relationship Plot Method Transformed Intercept, b Transformed Slope, m1 y=α x γ Log y vs. Log x (log plot) Log( α ) γ Log(y)=Log( α )+γLog(x) Ln y vs. x (log-log paper) γ Ln(y)=Ln( α )+γLn(x) Ln( α ) y=α e γx Log y vs. x (semi-log plot) Log(y)=Log( α )+γLog(e)x Log( α ) γ Log(e) Ln y vs. x (semi-log plot) Ln(y)=Ln( α )+γx Ln( α ) γ

Regression with Transformation

Example

A velocity probe provides a voltage output that is related to velocity, U, by the form E= δ+ εU ρ δ, ε, and ρ are constants

Output Voltage, VDC

5

4.5

4

3.5

3

U (ft/s)

0

Ei (V)

3.19

10 3.99

20

30 4.48

40 4.65

4.3

0

10

20

30

40

Velocity, ft/s

50

10
1
1
10
100
Output Voltage, VDC

Velocity, ft/s

Data Relationship Transformation

E= δ+εU ρ

Log(E-3.19)=Log( εU ρ ) Log(E-3.19)=Log( ε)+Log(U ρ )= Log( ε)+ ρLog(U)

(E=δ=3.19 at U=0)

Y
b
m1
X
 U (ft/s) Ei (V) Lets Tranform this X Y 0 3.19 10 3.99 1.00 -0.097 20 4.3 1.30 0.045 30 4.48 1.48 0.111 40 4.65 1.60 0.164 Perform Regression on the transformed Data

Solution (Excel 2004 Output)

 SUMMARY OUTPUT Regression Statistics Multiple R 0.998723855 R Square Adjusted R Square Standard Error Observations 0.997449339 0.996174009 t value t ! ," TINV ( 0.05 , 2) = 4.3026 t*SEE 0.02 = 3.18 0.01 4 SEE=0.0070 ANOVA df SS MS F Significance F Regression 1 0.038118269 0.038118 782.1106 0.00127614 Residual 2 9.74754E-05 4.87E-05 Total 3 0.038215745 Coefficients Standard Error t Stat P-value Lower 95% Upper 95% Intercept -0.525 0.021056315 -24.9274 0.001605 -0.61547736 -0.4342812 X Variable 1 0.432 0.015438034 27.96624 0.001276 0.36531922 0.49816831

Y=-0.525+0.432X

Regression with Transformation & Uncertainty

 Y predicted Y+ Y- Transform it Back Again E E+ E- 3.19 3.19 3.19 -0.0931 -0.0781 -0.1082 4.00 4.03 3.97 0.0368 0.0519 0.0218 4.28 4.32 4.24 0.1129 0.1279 0.0978 4.49 4.53 4.44 0.1668 0.1818 0.1518 4.66 4.71 4.61

Example 4.10

5
4.5
4
3.5
3
0
10
20
30
40
50
E, V

U, ft/s

Regression Analysis

B=Logb

-0.525=Logb

b=0.298

E=3.19+0.298U 0.432

24

Multiple and Polynomial Regression

Regression analysis can also be performed in situations where there is more than one independent variable (multiple regression) or for polynomials of an independent variable (polynomial regression)

Polynomial Expression Seeks the form

Y=b+m 1 *x+m 2 *x 2 + …… +m k x k

Multiple Regression seeks a function of the form

Y = b + m 1 xˆ 1 + m 2 xˆ 2 + m 3 xˆ 3 +

where

xˆ may represent several independent variables For example:

xˆ 1 = x 1 xˆ 2 = x 2 xˆ 3 = x 1 ! x 2

+ m k xˆ k

Linear Regression in Excel 2004

Input the result values

Input the independent variable

Input desired confidence level

Excel 2004 Linear Regression Output

SUMMARY OUTPUT

R 2

Regression Statistics

 Multiple R 0.99964308 R Square 0.99928628 SEE=sey Adjusted R Square 0.99910785 Standard Error 0.02788582 Observations 6 N ANOVA df SS MS F Significance F Regression 1 4.35502286 4.35502286 5600.45805 1.9107E-07 Residual 4 0.00311048 0.00077762 Total 5 4.35813333 Coefficients Standard Erro t Stat P-value Lower 95% Upper 95%
Intercept
0.02952381
0.02018228
1.46285828
0.21733392
-0.02651117
0.08555879
X Variable 1
0.99771429
0.01333197
74.8362082
1.9107E-07
0.9606988
1.03472978

intercept b"

slope ”m1"

The lower and upper bounds for the coefficients. To obtain the +- bound, simply subtract the lower from the upper and divide by two.

27

Regression Analysis