You are on page 1of 27

Regression Analysis

Least-Squares Linear Regression

Enables fit of linear or exponential function to data. The goal in regression analysis is the development of a statistical model that can be used to predict the values of a dependent or response variable from the values of the independent variable(s).

Linear Fits Most Common For exponential functions, data must be transformed.

Method of Least Squares

If we have N pairs of data (x i , y i ) we seek to fit a straight line through the data of the form:

Determine constants, a 0 and a 1 , such that the distance between the actual y data and the fitted/ predicted line is minimized.

Each x i is assumed to be error free. All the error is assumed to be in the y values.

y =

a

0

+

a x

1

" " x y ! " " 2 x x y i i i i
" "
x y !
" "
2
x
x
y i
i
i
i
i
a 0 =
("
) 2 ! N
"
2
x i
x i
" "
x
y
i ! N
"
x i y i
i
a 1 =
("
2
) 2 ! N
"
x i
x i

Manual Calculation Method

   

Raw Data

     
 

y

i

x

i

x

i yi

x

2

i

 

1.2

1

1.2

1

   

2

 

1.6

 

3.2

 

2.56

   

2.4

 

3.4

 

8.16

 

11.56

   

3.5

 

4

 

14

 

16

   

3.5

 

5.2

 

18.2

 

27.04

S

u m

 

1

2 . 6

 

1

5 . 2

4

4 . 7 6

 

5

8 . 1 6

Seeking an equation with the form: y=a 0 +a 1 x

y=0.879+0.540x

( 15.2 44.76 )( ) ! ( 58.16 12.6 )( ) a = = 0.879
(
15.2 44.76
)(
)
! (
58.16 12.6
)(
)
a =
= 0.879
0
(
2
15.2
)
!
( )(
5 58.16
)
(
15.2 12.6
)(
) ( )(
!
5 44.76
)
a =
= 0.540
1
(
15.2
)
2
!
( )(
5 58.16
)

How good is the fit?

Coefficient of Determination (R 2 ) measures the goodness of fit and the proportion of the variation of the y values associated with the variation in the x variable in the regression. The ratio of the explained variation to the total variation.

R 2 =1 Perfect Fit (good prediction) R 2 =0 No correlation between x and y For engineering data, R 2 , will normally be quite high (0.8-0.90 or higher) A low value might indicate that some important variable was not considered, but is affecting the results.

R 2

" (ax i + b ! y i ) 2

= 1 !

= Excel Function RSQ (y i 's,

x i 's)

" (y i ! y) 2

where

y = average of the y i 's

Standard Error of Estimate SEE

The standard error of estimate (SEE or S

) is a

yx

statistical measure of how well the best-fit line represents the data. This is, effectively, the standard deviation of the differences between the data points and the best-fit line.

It provides an estimation of the scatter/random error in the data about the fitted line. This is analogous to standard deviation for sample data. It has the same units as y. 2 degrees of freedom are lost to calculate coefficients a 0 and a 1 .

sey = SEE = S yx =

" ( y ! yˆ ) 2 i i N ! 2
"
(
y ! yˆ
) 2
i
i
N ! 2

= Excel Function STEYX ( y i ' s , x i ' s )

where y i = actual value of y for a given x i yˆ i = predicted value of y for a given x i

Linear Regression Assumptions

Variation in the data is assumed to be normally distributed and due to random causes. Assuming random variation exists in y values, while x values are error free. Since error has been minimized in the y direction, an erroneous conclusion may be made if x is estimated based on a value for y. For power law or exponential relationships, data needs to be transformed before carrying out linear regression analysis. (As we will discuss later, the method of least squares can also be applied to nonlinear functional relationships.)

Linear Regression Example

Use Excel Chart>>Add Trendline to obtain coefficients Functions RSQ() and STEYX() to determine R 2 and SEE

3.00 y = 0.9977x + 0.0295 2.50 R 2 = 0.9993 2.00 1.50 1.00 0.50
3.00
y = 0.9977x + 0.0295
2.50
R 2 = 0.9993
2.00
1.50
1.00
0.50
0.00
0.00
0.50
1.00
1.50
2.00
2.50
3.00

Length, cm

Output, Volts
Output, Volts

Regression Analysis using Excel Analysis Tools

Linear regression is a standard feature of statistical programs and most spreadsheet programs. It is only necessary to input the x and y data. The remaining calculations are performed immediately.

Excel “Regression Analysis” macro

Performs linear regression only Non-linear relationships must be transformed Calculates the slope, intercept, SEE, and the upper and lower confidence intervals for the slope and intercept Does not produce any graphical output on the user’s plot. Does not update automatically. The user must interpret the results.

Linear Regression in Excel 2008

Torque, N-m (Y)

Y = m1i X + b

RPM (X)

Y Predicted

Residual

Residual/SEE=Residual/sey

4.89

100 4.998433207 0.108433207

0.17558474

4.77

201

4.559896053 -0.210103947

-0.340219088

3.79

298 4.138726707 0.348726707

0.564689451

3.76

402

3.687163697 -0.072836303

-0.117943051

2.84

500 3.261652399 0.421652399

0.682777249

4.12

601

2.823115245

-1.296884755

-2.100031702

2.05

699 2.397603947 0.347603947

0.562871377

1.61

799 1.963408745 0.353408745

0.572271025

1.61 799 1.963408745 0.353408745 0.572271025 -0.004341952 5.432628409 0.481645161 0.617554846 6

-0.004341952 5.432628409

0.481645161

0.617554846

6

0.000954031

0.775391233

20.71311576

m1

b

se1

seb

r^2

sey

F

df

=LINEST(A2:A9,B2:B9,TRUE,TRUE)

m1 b se1 seb r^2 sey F df =LINEST(A2:A9,B2:B9,TRUE,TRUE) Outlier Regression Analysis 10

Outlier

Linear Regression Example: Omit Outlier

Torque, N-m (Y)

RPM (X)

Y Predicted

Residual

Residual/SEE=Residual/sey

4.89

100

5.000219168

0.110219168

0.504559919

4.77

201

4.504157858

-0.265842142

-1.21696881

3.79

298

4.02774254

0.23774254

1.088334807

3.76

402

3.516946736

-0.243053264

-1.112646171

2.84

500

3.03561992

0.19561992

0.895506407

2.05

699

2.058231795

0.008231795

0.037683406

1.61

799

1.567081983

-0.042918017

-0.196469559

-0.004911498

5.49136898

m1

b

0.000348477

0.170606738

se1

seb

0.975447633

0.218446143

r^2

sey

198.6463557

5

F

df

9.479149271

0.238593586

m1

b

Uncertainties on Regression

Confidence Interval for Regression Line SEE=sey

TINV(a=0.05,n=5) 2.570581835 95% C.I.=TINV( α=0.05, ν =5)*SEE/SQRT(7)# 0.212239784

0.218446143

Prediction Band for Regression Line 95% P.I.=TINV( α=0.05, ν =5)*SEE # 0.561533687

Uncertainty in Slope Δb=TiINV(0.05,5)*se1# 0.000895789

Uncertainty in Intercept Δb=TiINV(0.05,5)*seb # 0.438558582

Regression Line Confidence Intervals & Prediction Band

Not only do you want to obtain a curve fit relationship but you also want to establish a confidence interval in the equation or measure of random uncertainty in a curve fit. ν =N-2 in determination of t-value. Two degrees of freedom are lost because m1 and b are determined.

6 Prediction Band -95% CI - 95% Torque, Lease Squares Fit CI +95% 5 Prediction
6
Prediction Band -95%
CI
- 95%
Torque, Lease Squares Fit
CI
+95%
5
Prediction Band +95%
Data
4
3
2
1
0
0
200
400
600
800
1000
Torque, N-m
SEE S yx S ey CI = !y " ± t # , $ =
SEE
S
yx
S ey
CI = !y " ± t # , $
= ± t # , $
N = ± t # , $
N
N
where
t # , $ = TINV (# , $ )
(two-sided t-table)
# = 1 % P
PB " ± t # , $ SEE = ± t # , $ S yx = ± t # , $ S ey

RPM

Regression Line Confidence Interval & Prediction Band

+ ( x * " x ) 2 n S xx
+ ( x * " x ) 2
n
S xx

CI! in ! Curve! Fit! = ± t ! 2 , n " 2 # sey 1

( * x # x ) 2 + !y Prediction!Band = ± t " 2
(
*
x
# x
) 2
+
!y Prediction!Band = ± t " 2 , n # 2 sey n + 1
n
S
xx
Prediction!Band = ± t " 2 , n # 2 sey n + 1 n S

More accurate

$ ± t ! 2 , n " 2 # sey

n
n

$ ± t " 2 , n # 2 sey

" 2 # s e y n $ ± t " 2 , n # 2

Approximate

-minimum at mean -flares out at low & high extremes

Summations Used in Statistics & Regression

Variable

Expression

 

Sample Standard Deviation

 

$

1

#

(

 

i

! x ) 2

'

(

1 / 2

S x =

&

%

N

! 1

"

x

)

Expressions used in regression analysis

 

Sum of squares for evaluating CI & PI

S xx =

"

(

x

i

! x ) 2

 

1 / 2

Standard error of estimate

sey = SEE = S yx =

#

%

"

(

y ! y

i

predicted ! at ! x = x

i

) 2

&

(

$

N ! 2

 

'

CI in slope and intercept

Slope, m

CI ! in!slope = ± t ! 2 , v " se1

CI ! in ! slope = ± t ! 2 , v " se 1 Intercept,

Intercept, b

CI in Intercept = ± t ! 2, v " seb

b CI in Intercept = ± t ! 2 , v " seb Note 1: ν

Note 1: ν =n-2. Note 2: m & b are not independent variables. Therefore, do not apply RSS to y=mx+b to determine Δ y. Instead, use CI for curve fit.

Outliers in x-y Data Sets

Method involves computing the ratio of the residuals (predicted-actual) to the standard error of estimate (sey=SEE)

1.

2.

3.

Residuals=y predicted -y actual at each x i Plot the ratio of residuals/SEE for each x i . These are the “standardized residuals”. Standardized residuals exceeding ±2 may be considered outliers. Assuming the residuals are normally distributed, you can expect that 95% of residuals are in the range ±2 (that is, within 2 standard deviations from best fit line)

you can expect that 95% of residuals are in the range ±2 (that is, within 2
you can expect that 95% of residuals are in the range ±2 (that is, within 2

Linear Regression with Data Transformation

Data Transformation

Commonly, test data do not show an approximate linear relationship between the dependent (Y) and independent (X) variables and a direct linear regression is not useful.

The form of the relationship expected between the dependent and independent variables is often known. The data needs to be transformed prior to performing a linear regression. Transformations often can be accomplished by taking the logarithms of or natural logarithms of one or both sides of the equation.

Common Transformations

Relationship

Plot Method

Transformed Intercept, b

Transformed Slope, m1

y=α x γ

Log y vs. Log x

(log plot)

Log( α )

γ

Log(y)=Log( α )+γLog(x)

Ln y vs. x

(log-log paper)

γ

Ln(y)=Ln( α )+γLn(x)

Ln( α )

y=α e γx

Log y vs. x

(semi-log plot)

   

Log(y)=Log( α )+γLog(e)x

Log( α )

γ Log(e)

Ln y vs. x

(semi-log plot)

Ln(y)=Ln( α )+γx

Ln( α )

γ

Regression with Transformation

Example

A velocity probe provides a voltage output that is related to velocity, U, by the form E= δ+ εU ρ δ, ε, and ρ are constants

Output Voltage, VDC

5

4.5

4

3.5

3

Output Voltage, VDC 5 4.5 4 3.5 3 U (ft/s) 0 Ei (V) 3.19 10 3.99
Output Voltage, VDC 5 4.5 4 3.5 3 U (ft/s) 0 Ei (V) 3.19 10 3.99
Output Voltage, VDC 5 4.5 4 3.5 3 U (ft/s) 0 Ei (V) 3.19 10 3.99
Output Voltage, VDC 5 4.5 4 3.5 3 U (ft/s) 0 Ei (V) 3.19 10 3.99
Output Voltage, VDC 5 4.5 4 3.5 3 U (ft/s) 0 Ei (V) 3.19 10 3.99

U (ft/s)

0

Ei (V)

3.19

10 3.99

20

30 4.48

40 4.65

4.3

0

10

20

30

40

Velocity, ft/s

50

10 1 1 10 100 Output Voltage, VDC
10
1
1
10
100
Output Voltage, VDC

Velocity, ft/s

Data Relationship Transformation

E= δ+εU ρ

Log(E-3.19)=Log( εU ρ ) Log(E-3.19)=Log( ε)+Log(U ρ )= Log( ε)+ ρLog(U)

(E=δ=3.19 at U=0)

Y b m1 X
Y
b
m1
X

U (ft/s)

Ei (V)

Lets Tranform this

X

Y

 

0

3.19

 

10

3.99

1.00

-0.097

20

4.3

1.30

0.045

30

4.48

1.48

0.111

40

4.65

1.60

0.164

 

Perform Regression on the transformed Data

Solution (Excel 2004 Output)

SUMMARY OUTPUT

 
 

Regression Statistics

 

Multiple R

0.998723855

R

Square

Adjusted R Square Standard Error Observations

0.997449339

0.996174009

t value

t ! ,"

TINV ( 0.05 , 2) = 4.3026

t*SEE

0.02

=

3.18

0.01

0.01

4

SEE=0.0070

 

ANOVA

 
 

df

SS

MS

F

Significance F

Regression

1

0.038118269

0.038118

782.1106

0.00127614

Residual

2

9.74754E-05

4.87E-05

 

Total

3

0.038215745

 

Coefficients

Standard Error

t Stat

P-value

Lower 95%

Upper 95%

Intercept

-0.525

 

0.021056315 -24.9274 0.001605 -0.61547736 -0.4342812

X

Variable 1

0.432

0.015438034 27.96624 0.001276 0.36531922 0.49816831

Y=-0.525+0.432X
Y=-0.525+0.432X
Y=-0.525+0.432X

Y=-0.525+0.432X

Regression with Transformation & Uncertainty

Y predicted

Y+

Y-

Transform it Back Again

E

E+

E-

 

3.19

3.19

3.19

-0.0931

-0.0781

-0.1082

4.00

4.03

3.97

0.0368

0.0519

0.0218

4.28

4.32

4.24

0.1129

0.1279

0.0978

4.49

4.53

4.44

0.1668

0.1818

0.1518

4.66

4.71

4.61

Example 4.10

5 4.5 4 3.5 3 0 10 20 30 40 50 E, V
5
4.5
4
3.5
3
0
10
20
30
40
50
E, V

U, ft/s

Regression Analysis

B=Logb

-0.525=Logb

b=0.298

E=3.19+0.298U 0.432

24

Multiple and Polynomial Regression

Regression analysis can also be performed in situations where there is more than one independent variable (multiple regression) or for polynomials of an independent variable (polynomial regression)

Polynomial Expression Seeks the form

Y=b+m 1 *x+m 2 *x 2 + …… +m k x k

Multiple Regression seeks a function of the form

Y = b + m 1 xˆ 1 + m 2 xˆ 2 + m 3 xˆ 3 +

where

xˆ may represent several independent variables For example:

xˆ 1 = x 1 xˆ 2 = x 2 xˆ 3 = x 1 ! x 2

+ m k xˆ k

Linear Regression in Excel 2004

Linear Regression in Excel 2004 Input the result values Input the independent variable Input desired confidence

Input the result values

Linear Regression in Excel 2004 Input the result values Input the independent variable Input desired confidence
Linear Regression in Excel 2004 Input the result values Input the independent variable Input desired confidence

Input the independent variable

Input desired confidence level

Excel 2004 Linear Regression Output

SUMMARY OUTPUT

R 2

Regression Statistics

Regression Output SUMMARY OUTPUT R 2 Regression Statistics Multiple R 0.99964308 R Square 0.99928628

Multiple R

0.99964308

R

Square

0.99928628

 

SEE=sey

 

Adjusted R Square

0.99910785

Adjusted R Square 0.99910785

Standard Error

0.02788582

 

Observations

6

N
N
 

ANOVA

 
 

df

SS

MS

F

Significance F

Regression

1

4.35502286

4.35502286

5600.45805

1.9107E-07

Residual

4

0.00311048

0.00077762

Total

5

4.35813333

 

Coefficients Standard Erro

t Stat

P-value

Lower 95%

Upper 95%

Intercept 0.02952381 0.02018228 1.46285828 0.21733392 -0.02651117 0.08555879 X Variable 1 0.99771429 0.01333197
Intercept
0.02952381
0.02018228
1.46285828
0.21733392
-0.02651117
0.08555879
X Variable 1
0.99771429
0.01333197
74.8362082
1.9107E-07
0.9606988
1.03472978

intercept b"

1.9107E-07 0.9606988 1.03472978 intercept ” b" slope ”m1" The lower and upper bounds for the

slope ”m1"

The lower and upper bounds for the coefficients. To obtain the +- bound, simply subtract the lower from the upper and divide by two.

27

Regression Analysis