7 views

Uploaded by fa2heem

its about Regression

- Pizza Corner
- 78-532-1-PB
- Chap 005
- Breed and Environmental Effects on Post Weaning Growth of Rabbits
- Regression_and_Autocorrelation SCCS Writing, Monetization, Urbanization - Multilinear Regression
- Sample Midterm
- MODULE-4
- tutorial1_stata
- Output
- Asim
- Excel Word Assignment Practice Exercise
- Regression
- Bel et al_2018
- Quick GRETL Guide
- Regression and Correlation Analysis
- Spss Hr Analytics Assignment Employee Satisfaction 1 20162164_part 1
- final-stat 03 (2)
- Regression Techniques
- Regression 14Ch
- jafari2012.pdf

You are on page 1of 51

Regression

Correlation and Regression

The test you choose depends on level of measurement:

Independent Dependent Test

Dichotomous Interval-Ratio Independent Samples t-test

Dichotomous

Nominal Interval-Ratio ANOVA

Dichotomous Dichotomous

Nominal Nominal Cross Tabs

Dichotomous Dichotomous

Interval-Ratio Interval-Ratio Bivariate Regression/Correlation

Dichotomous

Correlation and Regression

Bivariate regression is a technique that fits a

straight line as close as possible between all the

coordinates of two continuous variables plotted

on a two-dimensional graph--to summarize the

relationship between the variables

Correlation is a statistic that assesses the

strength and direction of association of two

continuous variables . . . It is created through a

technique called regression

Bivariate Regression

For example:

A criminologist may be interested in the

relationship between Income and Number of

Children in a family or self-esteem and

criminal behavior.

Independent Variables

Family Income

Self-esteem

Dependent Variables

Number of Children

Criminal Behavior

Bivariate Regression

For example:

Research Hypotheses:

As family income increases, the number of children in

families declines (negative relationship).

As self-esteem increases, reports of criminal behavior

increase (positive relationship).

Independent Variables

Family Income

Self-esteem

Dependent Variables

Number of Children

Criminal Behavior

Bivariate Regression

For example:

Null Hypotheses:

There is no relationship between family income and the

number of children in families. The relationship statistic b = 0.

There is no relationship between self-esteem and criminal

behavior. The relationship statistic b = 0.

Independent Variables

Family Income

Self-esteem

Dependent Variables

Number of Children

Criminal Behavior

Bivariate Regression

Lets look at the relationship between self-

esteem and criminal behavior.

Regression starts with plots of coordinates of

variables in a hypothesis (although you will

hardly ever plot your data in reality).

The data:

Each respondent has filled out a self-esteem

assessment and reported number of crimes

committed.

Bivariate Regression

Y,

crime

s

X,

self-esteem

10 15 20 25 30 35 40

What do you think

the relationship is?

0

1

2

3

4

5

6

7

8

9

1

0

Bivariate Regression

Y,

crime

s

X,

self-esteem

10 15 20 25 30 35 40

Is it positive?

Negative?

No change?

0

1

2

3

4

5

6

7

8

9

1

0

Bivariate Regression

Y,

crime

s

X,

self-esteem

10 15 20 25 30 35 40

Regression is a procedure that

fits a line to the data. The slope

of that line acts as a model for

the relationship between the

plotted variables.

0

1

2

3

4

5

6

7

8

9

1

0

Bivariate Regression

Y,

crime

s

X,

self-esteem

10 15 20 25 30 35 40

The slope of a line is the change in the corresponding

Y value for each unit increase in X (rise over run).

Slope = 0, No relationship!

Slope = 0.2, Positive Relationship!

1

Slope = -0.2, Negative Relationship!

1

0.5

0

1

2

3

4

5

6

7

8

9

1

0

Bivariate Regression

The mathematical equation for a line:

Y = mx + b

Where: Y = the lines position on the

vertical axis at any point

X = the lines position on the

horizontal axis at any point

m = the slope of the line

b = the intercept with the Y axis,

where X equals zero

Bivariate Regression

The statistics equation for a line:

Y = a + bx

Where: Y = the lines position on the

vertical axis at any point (value of

dependent variable)

X = the lines position on the

horizontal axis at any point (value of

the independent variable)

b = the slope of the line (called the coefficient)

a = the intercept with the Y axis,

where X equals zero

^

^

Bivariate Regression

The next question:

How do we draw the line???

Our goal for the line:

Fit the line as close as possible to all the

data points for all values of X.

Bivariate Regression

Y,

crime

s

X,

self-esteem

10 15 20 25 30 35 40

How do we minimize the

distance between a line and all

the data points?

0

1

2

3

4

5

6

7

8

9

1

0

Bivariate Regression

How do we minimize the distance between a line and

all the data points?

You already know of a statistic that minimizes the

distance between itself and all data values for a

variable--the mean!

The mean minimizes the sum of squared

deviations--it is where deviations sum to zero and

where the squared deviations are at their lowest

value. (Y - Y-bar)

2

Bivariate Regression

The mean minimizes the sum of squared

deviations--it is where deviations sum to zero and

where the squared deviations are at their lowest

value.

Take this principle and fit the line to the place

where squared deviations (on Y) from the line are

at their lowest value (across all Xs).

(Y - Y)

2

Y = line

^ ^

Bivariate Regression

There are several lines that you could draw where the

deviations would sum to zero...

Minimizing the sum of squared errors gives you the

unique, best fitting line for all the data points. It is the

line that is closest to all points.

Y or Y-hat = Y value for line at any X

Y = case value on variable Y

Y - Y = residual

(Y Y) = 0; therefore, we use (Y - Y)

2

and minimize that!

^

^

^ ^

Bivariate Regression

Y,

crime

s

X,

self-esteem

10 15 20 25 30 35 40

0

1

2

3

4

5

6

7

8

9

1

0

Illustration of Y Y

= Yi, actual Y value corresponding w/ actual X

= Yi, line level on Y corresponding w/ actual X

5

-4

Y = 10, Y = 5

Y = 0, Y = 4

Bivariate Regression

Y,

crime

s

X,

self-esteem

10 15 20 25 30 35 40

0

1

2

3

4

5

6

7

8

9

1

0

Illustration of (Y Y)

2

= Yi, actual Y value corresponding w/ actual X

= Yi, line level on Y corresponding w/ actual X

5

-4

(Yi Y)

2

= deviation

2

Y = 10, Y = 5 . . . 25

Y = 0, Y = 4 . . . 16

Bivariate Regression

Y,

crime

s

X,

self-esteem

10 15 20 25 30 35 40

0

1

2

3

4

5

6

7

8

9

1

0

Illustration of (Y Y)

2

= Yi, actual Y value corresponding w/ actual X

= Yi, line level on Y corresponding w/ actual X

The goal: Find the line that minimizes

sum of deviations squared.

?

The best line will have the lowest value of sum of deviations squared

(adding squared deviations for each case in the sample.

Bivariate Regression

Y,

crime

s

X,

self-esteem

10 15 20 25 30 35 40

The fitted line for our

example has the equation:

Y = - X

, where e = distance from

line to data points or error

If you were to draw any other line, it

would not minimize . . .

(Y - Y)

2

0

1

2

3

4

5

6

7

8

9

1

0

Y = a + bX

e

Bivariate Regression

We use (Y - Y)

2

and minimize that!

There is a simple, elegant formula for

discovering the line that minimizes the sum of

squared errors

((X - X)(Y - Y))

b = (X - X)

2

a = Y - bX Y = a + bX

This is the method of least squares, it gives our

least squares estimate and indicates why we call

this technique ordinary least squares or OLS

regression

^

^

Bivariate Regression

Y

X

0 1

1

2

3

4

5

6

7

8

9

1

0

Considering that a regression line minimizes (Y - Y)

2

,

where would the regression line cross for an interval-ratio

variable regressed on a dichotomous independent variable?

^

For example:

0=Men: Mean = 6

1=Women: Mean = 4

Bivariate Regression

Y

X

0 1

1

2

3

4

5

6

7

8

9

1

0

The difference of means will be the slope.

This is the same number that is tested for

significance in an independent samples t-test.

^

0=Men: Mean = 6

1=Women: Mean = 4

Slope = -2 ; Y = 6 2X

Correlation

This lecture has

covered how to model

the relationship

between two

variables with

regression.

Another concept is

strength of

association.

Correlation provides

that.

Correlation

Y,

crime

s

X,

self-esteem

10 15 20 25 30 35 40

So our equation is:

Y = 6 - .2X

The slope tells us direction of

association How strong is

that?

0

1

2

3

4

5

6

7

8

9

1

0

^

Correlation

1

2

3

4

5

6

7

8

9

1

0

Example of Low Negative Correlation

When there is a lot of difference on the dependent variable across

subjects at particular values of X, there is NOT as much association

(weaker).

Y

X

Correlation

1

2

3

4

5

6

7

8

9

1

0

Example of High Negative Correlation

When there is little difference on the dependent variable across

subjects at particular values of X, there is MORE association

(Stronger).

Y

X

Correlation

To find the strength of the relationship

between two variables, we need

correlation.

The correlation is the standardized

slope it refers to the standard deviation

change in Y when you go up a standard

deviation in X.

Correlation

The correlation is the standardized slope it refers to the standard

deviation change in Y when you go up a standard deviation in X.

(X - X)

2

Recall that s.d. of x, Sx = n - 1

(Y - Y)

2

and the s.d. of y, Sy = n - 1

Sx

Pearson correlation, r = Sy b

Correlation

The Pearson Correlation, r:

tells the direction and strength of the

relationship between continuous variables

ranges from -1 to +1

is + when the relationship is positive and -

when the relationship is negative

the higher the absolute value of r, the stronger

the association

a standard deviation change in x corresponds

with r standard deviation change in Y

Correlation

The Pearson Correlation, r:

The pearson correlation is a statistic that is an

inferential statistic too.

r - (null = 0)

t

n-2

= (1-r

2

) (n-2)

When it is significant, there is a relationship in

the population that is not equal to zero!

Error Analysis

Y = a + bX This equation gives the conditional

mean of Y at any given value of X.

So In reality, our line gives us the expected mean of Y given each

value of X

The lines equation tells you how the mean on your dependent

variable changes as your independent variable goes up.

^

Y

^

X

Y

Error Analysis

As you know, every mean has a distribution around it--so

there is a standard deviation. This is true for conditional

means as well. So, you also have a conditional standard

deviation.

Conditional Standard Deviation or Root Mean Square Error

equals approximate average deviation from the line.

SSE ( Y - Y)

2

= n - 2 = n - 2

Y

^

X

Y

^

^

Error Analysis

The Assumption of Homoskedasticity:

The variation around the line is the same no matter the X.

The conditional standard deviation is for any given value of X.

If there is a relationship between X and Y, the conditional standard

deviation is going to be less than the standard deviation of Y--if this

is so, you have improved prediction of the mean value of Y by

looking at each level of X.

If there were no relationship, the conditional standard deviation

would be the same as the original, and the regression line would be

flat at the mean of Y.

Y

X

Y Conditional

standard

deviation

Original

standard

deviation

Error Analysis

So guess what?

We have a way to determine how much our

understanding of Y is improved when taking X

into accountit is based on the fact that

conditional standard deviations should be

smaller than Ys original standard deviation.

Error Analysis

Proportional Reduction in Error

Lets call the variation around the mean in Y Error 1.

Lets call the variation around the line when X is considered

Error 2.

But rather than going all the way to standard deviation to

determine error, lets just stop at the basic measure, Sum of

Squared Deviations.

Error 1 (E1) = (Y Y)

2 also called Sum of Squares

Error 2 (E2) = (Y Y)

2 also called Sum of Squared Errors

Y

X

Y Error 2 Error 1

R-Squared

Proportional Reduction in Error

To determine how much taking X into consideration reduces the

variation in Y (at each level of X) we can use a simple formula:

E1 E2 Which tells us the proportion or

E1 percentage of original error that

is Explained by X.

Error 1 (E1) = (Y Y)

2

Error 2 (E2) = (Y Y)

2

Y

X

Y

Error 2

Error 1

R-squared

r

2

= E1 - E2

E1

= TSS - SSE

TSS

= (Y Y)

2 -

(Y Y)

2

(Y Y)

2

r

2

is called the coefficient of

determination

It is also the square of the

Pearson correlation

Y

X

Y

Error 2

Error 1

R-Squared

R

2

:

Is the improvement obtained by using X (and drawing a line

through the conditional means) in getting as near as possible to

everybodys value for Y over just using the mean for Y alone.

Falls between 0 and 1

Of 1 means an exact fit (and there is no variation of scores

around the regression line)

Of 0 means no relationship (and as much scatter as in the

original Y variable and a flat regression line through the mean of

Y)

Would be the same for X regressed on Y as for Y regressed on

X

Can be interpreted as the percentage of variability in Y that is

explained by X.

Some people get hung up on maximizing R

2

, but this is too bad

because any effect is still a findinga small R

2

only indicates that

you havent told the whole (or much of the) story with your variable.

Error Analysis, SPSS

Some SPSS output (Anti- Gay Marriage regressed on Age):

r

2

(Y Y)2 - (Y Y)2

(Y Y)2

196.886 2853.286 = .069

Line to the Mean

Data points to the line

Data points to

the mean

Original SS

for Anti- Gay

Marriage

Error Analysis

Some SPSS output (Anti- Gay Marriage regressed on Age):

r

2

(Y Y)2 - (Y Y)2

(Y Y)2

196.886 2853.286 = .069

Line to the Mean

Data points to the line

Data points to

the mean

0 18 45 89

Age

Strong Oppose 5

Oppose 4

Neutral 3

Support 2

Strong Support 1

Anti- Gay

Marriage

M = 2.98

Colored lines are examples of:

Distance from each persons data point

to the line or modelnew, still

unexplained error.

Distance from line or model to Mean

for each personreduction in error.

Distance from each persons data point

to the Meanoriginal variables error.

ANOVA Table

X

Y

Mean

Q: Why do I see an ANOVA Table?

A: We bust up variance to get R

2

.

Each case has a value for distance

from the line (Y-bar

cond. Mean

) to Y-bar

big

,

and a value for distance from its Y

value and the line (Y-bar

cond. Mean

).

Squared distance from the line to the mean

(Regression SS) is equivalent to BSS, df =

1. In ANOVA, all in a group share Y-bar

group

The squared distance from the line to the

data values on Y (Residual SS) is

equivalent to WSS, df = n-2. In ANOVA, all in

a group share Y-bar

group

The ratio, Regression to Residual SS,

forms an F distribution in repeated

sampling. If F is significant, X explains

some variation in Y.

BSS

WSS

TSS

Line Intersects

Group Means

Dichotomous Variables

Y

X

0 1

1

2

3

4

5

6

7

8

9

1

0

Using a dichotomous independent variable,

the ANOVA table in bivariate regression will

have the same numbers and ANOVA results

as a one-way ANOVA table would (and

compare this with an independent samples t-

test).

^

0=Men: Mean = 6

1=Women: Mean = 4

Slope = -2 ; Y = 6 2X

Mean = 5

BSS

WSS

TSS

Regression, Inferential Statistics

Descriptive:

The equation for your line

is a descriptive statistic.

It tells you the real, best-

fitted line that minimizes

squared errors.

Inferential:

But what about the

population? What can we

say about the relationship

between your variables in

the population???

The inferential statistics

are estimates based on

the best-fitted line.

Recall that statistics are divided between descriptive

and inferential statistics.

Regression, Inferential Statistics

The significance of F, you already understand.

The ratio of Regression (line to the mean of Y) to Residual (line to

data point) Sums of Squares forms an F ratio in repeated sampling.

Null: r

2

= 0 in the population. If F exceeds critical F, then your

variables have a relationship in the population (X explains some of

the variation in Y).

Most extreme

5% of Fs

F = Regression SS / Residual SS

Regression, Inferential Statistics

What about the Slope or

Coefficient?

From sample to sample, different

slopes would be obtained.

The slope has a sampling

distribution that is normally

distributed.

So we can do a significance test.

-3 -2 -1 0 1 2 3 z

Regression, Inferential Statistics

Conducting a Test of Significance for the slope of the Regression Line

By slapping the sampling distribution for the slope over a guess of the

populations slope, H

o

, one determines whether a sample could

have been drawn from a population where the slope is equal H

o

.

1. Two-tailed significance test for -level = .05

2. Critical t = +/- 1.96

3. To find if there is a significant slope in the population,

H

o

:

= 0

H

a

:

0 ( Y Y )

2

4. Collect Data n - 2

5. Calculate t (z): t = b

o

s.e. =

s.e. ( X X )

2

6. Make decision about the null hypothesis

7. Find P-value

Correlation and Regression

Back to the SPSS output:

The standard error and t

appears on SPSS output

and the p-value too!

Correlation and Regression

Back to the SPSS output:

Y = 1.88 + .023X

So the GSS example, the

slope is significant. There is

evidence of a positive

relationship in the

population between Age and

Anti- Gay Marriage

sentiment. 6.9% of the

variation in Marriage

attitude is explained by age.

The older Americans get, the

more likely they are to

oppose gay marriage.

A one year increase in age elevates anti attitudes by .023 scale units. There is a weak

positive correlation. A s.d, increase in age produces a .023 s.d. increase in anti scale units.

- Pizza CornerUploaded byNaresh Kumar
- 78-532-1-PBUploaded byCristianLopez
- Chap 005Uploaded byErica Jurkowski
- Breed and Environmental Effects on Post Weaning Growth of RabbitsUploaded byJoseph R. Joseph
- Regression_and_Autocorrelation SCCS Writing, Monetization, Urbanization - Multilinear RegressionUploaded byMihai Rusu
- Sample MidtermUploaded byArafath Cherukuri
- MODULE-4Uploaded byVijetha K Murthy
- tutorial1_stataUploaded byHans Würstel
- OutputUploaded byRival Scopi
- AsimUploaded byMultazamMansyurAddury
- Excel Word Assignment Practice ExerciseUploaded byShanmuka Sreenivas
- RegressionUploaded byoove
- Bel et al_2018Uploaded byAlfredo Steven
- Quick GRETL GuideUploaded byFernanda Carolina Ferreira
- Regression and Correlation AnalysisUploaded byolpot
- Spss Hr Analytics Assignment Employee Satisfaction 1 20162164_part 1Uploaded bykkiiidd
- final-stat 03 (2)Uploaded byAlMumit
- Regression TechniquesUploaded bySiddharth Verma
- Regression 14ChUploaded byFahad Mushtaq
- jafari2012.pdfUploaded byWulansari Putri Utami
- ECO204Y_TT2_2013F.pdfUploaded byexamkiller
- Multiple Linear Regression Ver1.1Uploaded byjstadlas
- MAF356 T3 2011 AssignmentUploaded byTanvir Ahmed
- Analysis of Factors Affecting the Success of Safety Management Programs of Food Manufacturing Companies in Nairobi County, KenyaUploaded byAJHSSR Journal
- OutputUploaded byHechy Hoop
- Pre Purchase Behaviour of MaazaUploaded byMalhar Lakdawala
- team project minitab newUploaded byapi-233175651
- biostatistik 1Uploaded bySalma
- OUTPUT1Uploaded byAdlina Safitri Helmi
- Solved Papers-Oldpaper 1Uploaded byvignesh__m

- Pride and Self RighteousnessUploaded byaudioactivated.org
- Photomaker MedUploaded byfa2heem
- One Solitary LifeUploaded byfa2heem
- no_profit_hi.ppsUploaded byfa2heem
- Pharisee and PublicanUploaded byaudioactivated.org
- Funny Things Kids SayUploaded byfa2heem
- ATMcard Under Your Skin LoUploaded byfa2heem
- Tumbling BarrelUploaded byfa2heem
- Trust God AnyhowUploaded byfa2heem
- The ChoiceUploaded byfa2heem
- Storm Clouds LoUploaded byfa2heem
- Power to ForgiveUploaded byfa2heem
- I_will_not_fear_loUploaded byfa2heem
- Gods_waysUploaded byfa2heem
- Gwens CanyonUploaded byfa2heem
- Gods FormulaUploaded byfa2heem
- Feeling Negative LoUploaded byfa2heem
- Drawing PowerUploaded byfa2heem
- Defy the Impossible LoUploaded byfa2heem
- Chase the LionUploaded byfa2heem
- Christians.ppsUploaded byfa2heem
- 7_wonders_of_the_world.ppsUploaded byfa2heem
- Worldchanger LoUploaded byfa2heem
- Words of Wisdom 4Uploaded byMarkKevinAngtud
- Words of Wisdom 3Uploaded byfa2heem
- Words of Wisdom 2Uploaded byfa2heem
- Words of Wisdom 1 LoUploaded byfa2heem
- Who Rules LoUploaded byfa2heem

- Trends in Microwave-related Drying of Fruits and VegetablesUploaded byİsmail Eren
- CV-X200_X100_DIGEST_C_611843_GB_1034-1Uploaded bySanti Diaz Montero
- Lecture 2.pdfUploaded byAfrah Mir
- ASD vs LRFD ComparisonUploaded bypandiangv
- Job Duties and Tasks for Electrical and Electronics EngineerUploaded bysathyavlr
- Psa RecordUploaded byMurugesan Arumugam
- cocomomodel-101118115319-phpapp01Uploaded byvracle
- 357063767-1-1-population-dynamics.pdfUploaded byluckyguy1
- Chapter 15 - Communication SystemsUploaded byGaurav Chopra
- The Genetic Theory of Natural SelectionUploaded byTapuwa Chizunza
- Weeks3-4 JythonUploaded byseggy7
- vehicle vibrationUploaded bykarthiksugam
- RenameUploaded byShantanu Yadav
- Real Qualities, Graham HarmanUploaded byCameron Graham
- ENGR 3300 Syllabus Fall 2013 August 22 2013 (2)Uploaded bySpoodie
- HyTran Training Hill ChartUploaded byAyanil
- Modeling the Unsaturated Soil Zone in Slope Stability AnalysisUploaded byLuisRodrigoPerezPinto
- 1.4810920.pdfUploaded bylahoual70
- ch05Uploaded byRaf.Z
- VHDL CodesUploaded bypooja_2408
- Nelson 2Uploaded byPoom Chiarawongse
- JavaTipsEveryDay Course Content PreviewUploaded bynabilovic01
- 311 Session 5 2 ProbabilityUploaded bykrakendegen
- [Subashri_Vasudevan_et_al.]_Cracking_the_C++_Progr(b-ok.xyz).epubUploaded bylndyaBhai
- NCERT Book Micro Economics XIIUploaded bynikhilam.com
- Characteristics of Solar Cell Model Using SA.pdfUploaded byAmrYassin
- Summative Test Measures of Position Ungrouped DataUploaded byRomeo Jr Pacheco Opena
- ETABS Presentation With New Graphics Sept 2002Uploaded bysatoni12
- Item File 20485 Eurocode 1-4-A-windUploaded bygemo_n_fabrice69
- Progressions Paper 1@Set 6(Revised)Uploaded byHayati Aini Ahmad