98 views

Uploaded by turnernp

regression

save

- Regression
- Practice Exam CRE Sample
- Factor Analysis - Wikipedia, The Free Encyclopedia
- Regression Project
- Factor Analysis
- Exercises for Stochastic Calculus
- Factor Analysis Problem
- Regression
- R&R Case study
- MQM100_MultipleChoice_Chapter5
- Factor Analysis
- Gage R&R
- Regression 101
- Dell Hymes Speaking
- Statistics Regression Project REV a1
- Reg Lineal
- 7 Linear Regression
- Data Analysis
- Data Assignm
- LinearRegression-LectureNotesPublic.pdf
- Factors Influencing
- Simple Regression
- Regression
- Statistik-06
- Lec03Pt5B
- CMRP Exam Question Sources Listed 14 Aug 06 (Rev 27 Sep 12)
- Andrew F. Hayes-Statistical Methods for Communication Science-Routledge (2005)
- Simple Linear Regression[1]
- popy
- forecasting
- The Unwinding: An Inner History of the New America
- Yes Please
- Sapiens: A Brief History of Humankind
- The Innovators: How a Group of Hackers, Geniuses, and Geeks Created the Digital Revolution
- Dispatches from Pluto: Lost and Found in the Mississippi Delta
- Elon Musk: Tesla, SpaceX, and the Quest for a Fantastic Future
- Devil in the Grove: Thurgood Marshall, the Groveland Boys, and the Dawn of a New America
- John Adams
- The Prize: The Epic Quest for Oil, Money & Power
- The Emperor of All Maladies: A Biography of Cancer
- A Heartbreaking Work Of Staggering Genius: A Memoir Based on a True Story
- Grand Pursuit: The Story of Economic Genius
- This Changes Everything: Capitalism vs. The Climate
- The New Confessions of an Economic Hit Man
- Team of Rivals: The Political Genius of Abraham Lincoln
- Smart People Should Build Things: How to Restore Our Culture of Achievement, Build a Path for Entrepreneurs, and Create New Jobs in America
- The Hard Thing About Hard Things: Building a Business When There Are No Easy Answers
- Rise of ISIS: A Threat We Can't Ignore
- The World Is Flat 3.0: A Brief History of the Twenty-first Century
- Bad Feminist: Essays
- Angela's Ashes: A Memoir
- Steve Jobs
- How To Win Friends and Influence People
- Extremely Loud and Incredibly Close: A Novel
- The Sympathizer: A Novel (Pulitzer Prize for Fiction)
- The Silver Linings Playbook: A Novel
- Leaving Berlin: A Novel
- The Light Between Oceans: A Novel
- The Incarnations: A Novel
- You Too Can Have a Body Like Mine: A Novel
- The Love Affairs of Nathaniel P.: A Novel
- Life of Pi
- Brooklyn: A Novel
- The Flamethrowers: A Novel
- The Rosie Project: A Novel
- The Blazing World: A Novel
- We Are Not Ourselves: A Novel
- The First Bad Man: A Novel
- Bel Canto
- The Master
- A Man Called Ove: A Novel
- Beautiful Ruins: A Novel
- The Kitchen House: A Novel
- Interpreter of Maladies
- The Wallcreeper
- The Art of Racing in the Rain: A Novel
- Wolf Hall: A Novel
- The Cider House Rules
- A Prayer for Owen Meany: A Novel
- The Perks of Being a Wallflower
- The Bonfire of the Vanities: A Novel
- Lovers at the Chameleon Club, Paris 1932: A Novel
- Little Bee: A Novel

You are on page 1of 31

**Simple Linear Regression
**

1. The following data were collected by a bank wishing to examine the relationship (if

any) between individual income and savings per year (in units of $1, 000).

Income 60 40 50 30 70 80 74 54

Savings 6 3 3 2 8 12 11 7

(a) Which of the two variables would you choose to be the response variable in a

simple linear regression analysis?

Solution:

The bank would be most interested in predicting an individual’s Savings, given

their individual Income, and in assessing the eﬀect of change in Income on

Savings. Since Savings is the variable that we are most interested in predicting,

Savings is taken as the response variable. Income is the explanatory variable.

(b) Without using Excel, sketch an approximate scatterplot of the data.

Solution:

Your scatterplot should look similar to this:

A simple linear regression analysis was performed using Excel, yielding the following

output:

141

142 CHAPTER 10. SIMPLE LINEAR REGRESSION

(c) Use the Excel output to write down an estimate b

1

for the regression slope

parameter β

1

. Interpret the meaning of b

1

in terms of family income and

savings.

Solution:

From the output, the estimate of the slope parameter β

1

is b

1

= 0.205 (3 d.p.).

The interpretation is that if an individual’s Income increases by $1, 000, their

expected Savings increases by $205.

(d) Test the hypothesis H

0

: β

1

= 0 against H

1

: β

1

= 0 at the 1% signiﬁcance

level.

Solution:

In simple linear regression, the test of H

0

: β

1

= 0 (there is no relationship

between Income and Savings) against H

1

: β

1

= 0 (there is a signiﬁcant linear

relationship between Income and Savings) can be carried out in either of

two (equivalent) ways. The ﬁrst approach is based on the test statistic

F = MS

Regression

/MS

Residual

(∼ F

1,n−2

under H

0

). From the Excel Output, the

observed value is F

obs

= 48.42, and p–value associated with this observed value

is 0.000437 < α = 0.01.

The second approach is based on the test statistic

T =

B

1

−β

1

√

MS

Residual

/SS

x

(∼ t

n−2

= t

6

under H

0

),

143

where

B

1

=

n

i=1

(x

i

− ¯ x)(Y

i

−

¯

Y )

n

i=1

(x

i

− ¯ x)

2

is the least squares estimator of the slope β

1

. From the Excel output, the

observed value of the test statistic is

t

obs

= 6.9585,

and the critical region is (two-tail test, α = 0.01)

CR = {|T| > t

crit

= t

α/2

n−2

= t

0.005

6

= 3.7074}.

Thus, t

obs

∈ CR. (Alternatively, it can be seen from the Excel output that

the p–value associated with t

obs

= 6.9585 is 0.000437 < α = 0.01.)

Thus, either approach results in rejection of H

0

: β

1

= 0 in favour of H

1

:

β

1

= 0 at the 1% signiﬁcance level. Hence there is suﬃcient evidence at the

1% signiﬁcance level to conclude that there is a signiﬁcant linear relationship

between Income and Savings.

The advantage of the second approach is that it can be used to test H

0

: β

1

= 0

against either of the one-tail alternatives H

1

: β

1

> 0 (signiﬁcant positive

relationship between Income and Savings) or H

1

: β

1

< 0 (signiﬁcant negative

relationship between Income and Savings), by choosing the appropriate form of

the critical region.

(e) Brieﬂy explain what, in practice, is the purpose of examining a plot of Residuals

against the explanatory variable (Income).

Solution:

From diagnostic plots, one can check whether any of the assumptions of simple

linear regression appear to be violated. From residual plots, one can assess

the appropriateness of the linear model, and can recognise if the errors are not

independent or do not have constant variance. The remaining assumption is

that of Normality of the errors, which can be checked by examining a Normality

plot of the residuals.

(f) From this regression, what is the predicted Savings for an individual with an

Income of $20, 000 per annum? Comment on the usefulness of this prediction.

Solution:

Predicted Savings = b

0

+ b

1

×Income = −5.246 + 0.205 × 20 = −1.142. One

might question how a negative value of Savings should be interpreted. Note

that an Income of $20, 000 is outside the range of the data upon which the

model was constructed, hence this prediction is not reliable and should be

taken with a grain of salt. Predictions are only reliable if the values of any

explanatory variables are within the range of the data.

144 CHAPTER 10. SIMPLE LINEAR REGRESSION

2. The following output comes from a linear regression, modelling the number of elec-

tronic components assembled (within a certain time) by employees of an electronics

company with diﬀering amounts of experience (in years).

(a) Specify the regression model and explain each term in the model.

(b) State the estimated regression equation between Production and Experience.

(c) Is there a signiﬁcant linear relationship between Production and Experience?

Justify your answer.

(d) Do the residual plots suggest any problems with model assumptions?

(e) Estimate the eﬀect, on average, of

i. a one year increase in experience,

ii. a two year increase in experience.

(f) State a 95% conﬁdence interval for the slope parameter β

1

.

(g) What is the co-eﬃcient of determination for this model? What is its meaning?

Solution:

(a) Production = β

0

+ β

1

× Experience + , where Production is the number

of components produced, Experience is the number of years of experience the

employee has, β

0

is the intercept, β

1

is the slope, and is the random variation

term or error.

(b) The estimated regression line is

Production = 2.914 + 1.967 ×Experience.

(c) We are testing the hypotheses H

0

: β

1

= 0 against H

1

: β

1

= 0. The p–

value for this test is 7.33 × 10

−31

< 0.05, so the data provides overwhelming

evidence against the null hypothesis. We conclude that there is a signiﬁcant

linear relationship between Production and Experience.

145

(d) 1. A linear model is appropriate. There is no evidence of a trend in the

residual plot.

2. The errors are normally distributed. The points in the normal prob-

ability plot lie approximately on a straight line, indicating the assumption

of normality is okay.

3. The errors have constant variance. The spread of residuals about the

horizontal axis does not vary as Experience increases, so this looks okay.

4. The errors are independent (or uncorrelated). The residual plot

doesn’t show any clear violation of independence.

No evidence of outliers or points of high leverage. Thus there is no reason to

doubt the adequacy of our linear regression model.

(e) i. an extra one year of experience will increase Production on average by

1.967 components.

ii. an extra two years of experience will increase Production on average by

1.967 ×2 = 3.934 components.

(f) We can read the 95% conﬁdence interval for β

1

from the Excel output as

(1.912, 2.021).

(g) r

2

= 0.9955, this means that the variation in Experience explains 99.55% of

the variation in Production.

3. House Data: Regression of Price against Age

Open the House.xlsx ﬁle. We will perform a regression analysis of Price against the

Age of the houses sold. The data was collected in 2010.

(a) Create a column called AgeHouse (which is simply 2010 - YrBuilt). To do this,

type AgeHouse in Cell J1, type = 2010 - E2 in Cell J2, and ﬁll down.

(b) Produce a scatterplot of Price against AgeHouse, and describe any general

trend. Aside from this, is there anything else of note?

Solution:

A scatterplot of Price against AgeHouse is shown below:

146 CHAPTER 10. SIMPLE LINEAR REGRESSION

There does seem to a trend for Price to decrease with increasing age, but this

is due almost entirely to 5 points (possible outliers?) which correspond to very

new houses.

(c) Go to Data ⇒ Data Analysis ⇒ Regression. The Input Y Range is Price,

the Input X Range is AgeHouse. You should include Labels in these ranges;

check the corresponding box. Select an Output Range and click OK.

Solution:

These steps yield the following output:

(d) Write down the equation of regression.

Solution:

From the Table of Coeﬃcients, the regression equation is

ˆ

Price = 478.77 −3.4485 ×AgeHouse.

147

(e) Test appropriate hypotheses to determine if there is a signiﬁcant linear rela-

tionship between Price and AgeHouse. State your conclusion.

Solution:

The hypotheses of interest are

H

0

: β

1

= 0 H

1

: β

1

= 0

where β

1

is the true regression gradient linking House Prices and the age of

the house.

The p-value for the test is 3.394 × 10

−9

<< 0.05, so there is overwhelming

evidence against the null hypothesis. We conclude that there is a signiﬁcant

linear relationship between Price and the age of the house sold.

(f) Examine the residual plots and Normality plot associated with the regression.

Do the residuals appear Normally distributed?

Solution:

No, there does seem to be some non-Normality in the residuals – the Normality

plot is not entirely linear, due to perhaps 3-5 extreme points. This throws some

doubt on our conclusion above. One option would be to remove one or two of

them and see whether the Normality of the residuals improves, and whether

our conclusions stay the same.

4. Calculate the estimated coeﬃcients b

0

and b

1

in the estimated least squares regres-

sion equation ˆ y = b

0

+b

1

x in each of the cases (a) and (b), using the formulae given

in lecture slides, for a set of data (x

i

, y

i

), i = 1, 2, . . . , 10, such that

(a)

10

i=1

x

i

= 15,

10

i=1

y

i

= 714,

10

i=1

x

i

y

i

= 1278,

10

i=1

x

2

i

= 25.8,

(b) ¯ x = 0, ¯ y = 12.7, SS

xy

= 246.56, s

2

x

= 36.67.

Solution:

(a) From formulae given in Lecture slides,

b

1

=

n

i=1

x

i

y

i

−n¯ x¯ y

n

i=1

x

2

i

−n(¯ x)

2

=

1278 −10(

15

10

)(

714

10

)

25.8 −10(

15

10

)

2

= 62.727,

b

0

= ¯ y −b

1

¯ x =

714

10

−62.727(

15

10

) = −22.691.

so the equation of the regression line is ˆ y = −22.691 + 62.727x.

148 CHAPTER 10. SIMPLE LINEAR REGRESSION

(b) First we need to compute SS

x

. Recognising that SS

x

=

n

i=1

(x

i

− ¯ x)

2

=

(n −1)s

2

x

, we ﬁnd that SS

x

= (n −1)s

2

x

= 9 ×36.67. Thus

b

1

=

SS

xy

SS

x

=

246.56

36.67 ×9

= 0.7471,

b

0

= ¯ y −b

1

¯ x = 12.7 −0.7471(0) = 12.7.

so the equation of the regression line is ˆ y = 12.7 −0.7471x.

5. Open the Excel ﬁle House.xlsx. National Realty wants you to investigate the rela-

tionship between the selling price of a house (in $1,000) and the area of the block of

land on which it is situated (in m

2

). You decide to perform a simple linear regression

between Price and Area.

(a) First, decide which of the two variables should be chosen as the response vari-

able. Then specify the regression model, and explain each term in the model.

(b) What are the assumptions that must be satisﬁed to ensure that a simple linear

regression is appropriate?

(c) Using Excel, produce an appropriate Summary Output for the simple linear

regression described by (a). This should include an appropriate set of diag-

nostic plots that can be used to assess whether or not the assumptions of the

regression model in (b) are justiﬁed.

(d) From your output in (c), write down the estimated regression equation between

Price and Area.

(e) Give an interpretation for the estimate of the slope parameter in the estimated

regression equation in (d).

(f) Do the diagnostic plots suggest any violation of the assumptions in (b)?

Solution:

(a) Price is the appropriate choice for the response variable. The regression model

is Price = β

0

+ β

1

Area +, where Price is the selling price of the house, Area

is the area of the block of the house, β

0

is the intercept, β

1

is the slope, and

is the random variation term or residual.

(b) The assumptions of the simple linear regression are

i. A linear model is appropriate: Price = β

0

+ β

1

Area + , where E[] = 0;

ii. The error variables are Normally distributed;

iii. The error variables have constant variance;

iv. The error variables are independent (or at least uncorrelated).

(c) An Excel output is shown:

149

(d)

**Price = 219.23 + 0.2901 Area.
**

(e) If the area increases by 1m

2

then the selling price will, on average, increase by

$290.

(f) 1. A linear model is appropriate. The scatter plot is slightly suggestive

of a curved relationship, particularly if the one extremely negative point

is seen as an outlier.

2. The Error variables are normally distributed. The normality plot

is approximately linear, except at the tails. The Normality assumption is

called into question by 4-6 extreme points. See below.

3. The Error variables have constant variance. This really depends

on how we see the one negative outlier. Without this point, the constant

variance assumption looks okay. Leaving this point in, constant variance

is more open to question.

4. The Error variables are independent (or uncorrelated). The resid-

ual plot doesn’t show any clear violation of independence.

The data contains some very extreme points in the Area variable and all of

these would have high leverage. One of these points is an extreme negative

outlier, all of which cast some doubt on the results above. We would be well

advised to see what eﬀect these points have on our regression model, by reﬁtting

the model with one or more of these points removed.

150 CHAPTER 10. SIMPLE LINEAR REGRESSION

6. (a) Based on your answers to Question 7e, predict the selling price for a house

with area equal to (i) 900m

2

; (ii) 1900m

2

. Comment on the reliability of

these predictions.

(b) Is there a signiﬁcant (linear) relationship between Price and Area? State the

hypotheses to be tested, and read oﬀ the appropriate p–value for this test from

your output in Question 7(e)iii.

(c) Without any calculation, state a 95% conﬁdence interval for the slope param-

eter β

1

.

(d) Calculate a 98% conﬁdence interval for β

1

.

Solution:

(a) (i)

ˆ

Price = 219.23 + 0.2901(900) = 480.354, that is, $480,354. (ii)

ˆ

Price =

219.23 + 0.2901(1900) = 770.493, that is, $770,493. The ﬁrst prediction is

reliable (subject to the comments above about residuals), since 900 is in the

range of Area on which we built the model. The second prediction is unreliable,

as 1900 is well outside the data range of Area upon which we built the model.

(b) We are testing the hypotheses H

0

: β

1

= 0 versus H

1

: β

1

= 0. The p-

value for this test is 4.1 × 10

−29

< 0.05, so the data provides overwhelming

evidence against the null hypothesis. We conclude that there is signiﬁcant

linear relationship between Price and Area.

(c) We can read the 95% conﬁdence interval from the Excel output as (0.2467, 0.3336).

(d) 98% CL for β

1

= b

1

±t

0.01,206

SE(β

1

). So 98% CL for β

1

= 0.2901 ±2.3451 ×

0.022044 = 0.2901 ±0.0517, so a 98% CI for β

1

= (0.2384, 0.3418)

Chapter 11

Multiple Linear Regression

1. Absenteeism is a major problem for employers in most countries, reducing potential

output by an estimated 10%. Economists M. Chaudhary and I. Ng (Canadian Jour-

nal of Economics,, August 1992) conducted a research project to better understand

the causes of this problem. They randomly selected 100 organisations to participate

in a year long study. For each organisation, the average number of days absent per

employee was recorded, along with several other variables described below:

• Wage : the average employee wage

• Pct PT: percentage of part time employees

• Pct U: the percentage of unionised employees

• Av Shift: availability of shift work (1 = yes, 0 = no)

• U/M Rel: union-management relationship (1 = good, 0 = not good)

A linear regression analysis was conducted with Absent (average number of days

absent per employee) as response, and some of the output is given on the following

page.

(a) Specify the multiple linear regression model between Absent and the explana-

tory variables, and explain each term in the model.

(b) Is there suﬃcient evidence to conclude that the availability of shift work is

related to absenteeism? Justify your answer.

(c) Can we infer that in organisations where union and management relations are

poor, absenteeism is high? Justify your answer.

(d) Write down the ﬁtted regression model between Absent and the explanatory

variables, using only the signiﬁcant terms.

(e) State and verify the assumptions of the linear regression model using the out-

put.

(f) Which variable, Av Shift or U/M Rel, has the greatest aﬀect on absenteeism

in the workplace according to this data?

(g) Compute a 95% conﬁdence interval for the coeﬃcient of the percentage of

unionised employees.

(h) How can this model be improved? Justify your answer.

151

152 CHAPTER 11. MULTIPLE LINEAR REGRESSION

Solution:

(a) The regression model is

Absent = β

0

+ β

1

Wage + β

2

Pct PT + β

3

Pct U + β

4

Av Shift +β

5

U/M Rel +

where β

0

is the intercept, β

1

is the coeﬃcient of Wage, β

2

is the coeﬃcient of

Pct PT, β

3

is the coeﬃcient of Pct U, β

4

is the coeﬃcient of Av Shift, β

5

is the

coeﬃcient of U/M Rel, and is the random variation term.

(b) The p-value of the coeﬃcient of Av Shift is 0.0025 < 0.05, so there is suﬃcient

evidence to conclude that the availability of shift work is related to absenteeism.

153

In fact, since the coeﬃcient of Av Shift is positive, the availability of shift work

increases the mean number of days absent per employee (by 1.56 days per year).

(c) The p-value of the coeﬃcient of U/M Rel is 5.99 × 10

−7

<< 0.05, so there is

suﬃcient evidence to conclude that the status of union-management relations

is related to absenteeism. Since the coeﬃcient of U/M Rel is negative, it

indicates that if union-management relations are good, then the mean number

of days absent per employee decreases (by 2.64 days per year). Equivalently,

bad union-management relations imply that the mean number of days per year

absent per employee will increase by 2.64.

(d) To write down the ﬁtted regression model, we just need to read oﬀ the estimated

coeﬃcients from the output:

**Absent = 10.2648 −0.0002 ×Wage −0.1069 ×Pct PT
**

+ 0.0599 ×Pct U + 1.5619 ×Av Shift −2.6366 ×U/M Rel.

Note that all the estimated coeﬃcients have p–value < 0.05, thus, every one

of the ﬁve explanatory variables contributes signiﬁcantly to Absenteeism (and

should be included in the model).

(e) i. The linear regression model is appropriate. The scatter plot of

Residuals against Fitted Values is slightly suggestive of some degree of

non-linearity (i.e. a curved relationship). Assume for now that the linear

model is appropriate.

ii. The errors are normal. It is not very evident, from the histogram

given, that the residuals are probably not normally distributed. However,

the Normal Probability Plot shows a distinct curvature. Thus the residuals

are probably not normal – this assumption is not justiﬁed.

iii. The errors have constant variance. The scatter plot of Residuals

against Fitted Values shows no clear pattern, so there is no reason to

doubt the equal variance assumption.

iv. The errors are uncorrelated. As observed in i., the scatter plot of

residuals against ﬁtted values shows a slight pattern, but there is possibly

not enough reason to doubt the claim that the residuals are uncorrelated.

(f) In the case of the two factor variables, union/management relations and

availability of shift work, it is clear that union/management relations

have a greater eﬀect than availability of shift work , because the absolute

value of the estimated coeﬃcient is larger.

(g) A 95% CI for the coeﬃcient of Pct U is

b

3

±t

0.025,94

SE(b

3

) = 0.0599 ±t

0.025,94

×0.0124

= 0.0599 ±1.9855 ×0.0124

= 0.0599 ±0.0246 = (0.0353, 0.0845).

One can also read this straight from the Excel output.

Informally, this tells us that if percentage union membership increases by 1%,

then we would expect that mean absenteeism will increase by between 0.0353

and 0.0845 days per year.

154 CHAPTER 11. MULTIPLE LINEAR REGRESSION

(h) The data (Absent) should be transformed and the model re-ﬁtted to see if

there is any improvement in the behaviour of the residuals with respect to the

normality assumption.

155

2. As a further analysis, the following loglinear model was ﬁtted to the data:

ln Absent = β

0

+ β

1

Wage + β

2

PctPT + β

3

PctU + β

4

AvShift + β

5

U/MRel +

Some of the output from the analysis is given on the following page.

(a) Using the analysis of the previous question, justify ﬁtting the above model to

the data.

(b) Write down the ﬁtted regression model between ln(Absent) and the explanatory

variables.

(c) Is there suﬃcient evidence to conclude that the availability of shift work is

related to absenteeism? Justify your answer.

(d) Can we infer that in organisations where union and management relations are

poor, absenteeism is high? Justify your answer.

(e) State and verify the assumptions of the regression model using the output.

(f) Compare the log model to the linear model ﬁtted in the previous question.

Which is better? Justify your answer.

(g) Between U/M Rel and Av Shift, which variable has the greatest aﬀect on

absenteeism in this model? How does this compare with the model in Question

1?

(h) Compute a 95% conﬁdence interval for the coeﬃcient of the percent of unionised

employees, and compare your answer to that in Question 7(a)vii.

(i) Write a statement reporting the results of the analysis, referring to the factors

that aﬀect worker absenteeism.

156 CHAPTER 11. MULTIPLE LINEAR REGRESSION

Solution:

(a) The Normality assumption employed in the previous analysis was perhaps not

justiﬁed. Now the response variable is being transformed, in an attempt to

ﬁnd a more appropriate model. After ﬁtting the new model to the data, we

can see if there is any change in the behaviour of the residuals. Since the

157

histogram of residuals was right-skewed, an appropriate transformation might

be to take the square root or natural logarithm of the response variable. The

log transformation has been used here.

(b) Reading the estimated coeﬃcients from the output, the ﬁtted model is

**ln Absent = 2.25 −3.38Wage −0.019PCt PT + 0.011Pct U +
**

+ 0.283Av Shift −0.371U/M Rel.

(c) The p-value of the coeﬃcient of Av Shift is 0.0012 < 0.05, so there is suﬃcient

evidence to conclude that the availability of shift work is related to absenteeism.

In fact, since the coeﬃcient of Av Shift is positive, the availability of shift work

increases the mean number of days absent per employee. From the ﬁtted model

derived in (b), we obtain

**Absent = exp(2.25 −3.38Wage −0.019PCt PT + 0.011Pct U +
**

+ 0.283Av Shift −0.371U/M Rel)

so absenteeism increases by a factor of e

0.283

= 1.327 for companies that have

shift work available.

(d) The p-value of the coeﬃcient of U/M Rel is 2.11 × 10

−5

<< 0.05, so there is

suﬃcient evidence to conclude that the status of union-management relations

is related to absenteeism. Since the coeﬃcient of U/M Rel is negative, it

indicates that if union-management relations as good then the mean number

of days absent per employee decreases. In fact, absenteeism decreases by a

factor of e

−0.371

= 0.690 if management-union relations are good.

(e) i. The ﬁtted regression model is appropriate. The scatter plot of Resid-

uals against Fitted Values shows no clear pattern, so we conclude that the

model is appropriate.

ii. The errors are normal. The normal probability plot is fairly straight,

and the histograms is similar to that expected from a Normal distribu-

tion. We conclude that there is no reason to question the assumption of

Normality of the residuals.

iii. The errors have constant variance. The scatter plot of residuals

against ﬁtted values shows possibly that the variances decrease slightly

as the ﬁtted values increase, but perhaps not enough to doubt the equal

variance assumption.

iv. The errors are uncorrelated. The scatter plot of residuals against ﬁtted

values shows no clear pattern, so there we conclude that the residuals are

uncorrelated.

(f) The log model is a great improvement on the linear model. The correlation

coeﬃcient has not changed much (0.7252 compared to 0.7296). The standard

error for the log model is much smaller than for the linear model — this is

partly due to the fact that the data values have decreased due to the log

transformation, but even taking this into account, the reduction is large (in

fact, as a rough guide, the log of the standard error for the linear model is

ln 2.3559 = 0.8569, and this is more than twice the standard error for the log

model). Furthermore, the diagnostic plots suggest that we may safely assert

158 CHAPTER 11. MULTIPLE LINEAR REGRESSION

that the assumptions of the regression are satisﬁed by the log model, whereas

some doubt must be cast on the validity of the assumptions of the linear model,

in particular Normality of the residuals.

(g) Again the coeﬃcient of Union/Management Relations has larger absolute value,

so again of the two variables, this variable has the greatest eﬀect.

(h) 95% CL for coeﬃcient of Pct U

= 0.0111 ±t

94

×0.0021

= 0.0111 ±1.9855 ×0.0021

= 0.0111 ±0.0042

so a 95% CI is (0.0069, 0.0153). This is quite diﬀerent to the 95% CI for the

same coeﬃcient under the previous model. The diﬀerence is due to the choice

of model. Note that this is a CI for the increase in ln(Absent) corresponding

to a 1% increase in union membership.

(i) Worker absenteeism decreases as the average employee wages and percent of

part time employees increase. Further, absenteeism is lower for those companies

for which management has a good relationship with the union and higher for

those companies that have shift work available. Finally, absenteeism increases

as the percentage of unionised employees increases.

3. In this exercise, we will examine how to use Excel to generate output that can

be used to conduct a multiple linear regression analysis on a given data set. The

standard Excel regression output does not include all of the diagnostic plots that

one would usually be interested in – separately, we can obtain plots of residuals

against ﬁtted values, and histograms of the residuals. The Absenteeism data of the

previous two questions is contained in the Excel ﬁle Absent.xlsx.

Follow the steps outlined below:

(a) Generate output relevant to a multiple linear regression of Absent on the

ﬁve explanatory variables. To do this, select Data -> Data Analysis ->

Regression. Since Absent is the response variable, set Y Range to be all

the data in the column Absent. X Range should be set to be all the data in

remaining columns. Labels should be included. Select also Residuals and

Normal Probability Plot.

(b) Under Residual Output, you will see two columns headed Predicted Absent

and Residuals. Copy the two columns to a separate worksheet, and use the

Scatterplot command to generate a scatterplot of Residuals against Fitted

Values (Predicted Absent). Check that the plot is the same as the one given

in the output in Question 7a.

(c) Generate a histogram of the Residuals.

Solution:

See Question 7a for an example output.

159

4. In Question 7b, we considered a multiple linear regression of the natural logarithm

of Absent on the ﬁve explanatory variables. Here, we will reproduce the relevant

output in Excel. First, we must transform the Absent data.

(a) Create a new column, to the right of the Absent column, headed ln(Absent),

calculate the natural logarithm of the ﬁrst data point as shown below, and

then ﬁll down the column.

(b) Now generate the standard Excel output for a multiple linear regression of

ln(Absent) on the ﬁve explanatory variables (ignoring the original Absent

data!).

(c) Once again, create a plot of Residuals against Fitted Values. Check that it is

the same as the plot given in the Excel output in Question 7b. Comment on

the diﬀerences in the diagnostic plots for the two diﬀerent models, and what

these plots tell us.

Solution:

See Question 7b for an example output. The diagnostics for the two models were

examined respectively in Questions 7(a)v and 7(b)v. We can assert that the as-

sumptions of the regression are satisﬁed by the log model. There is no evident

pattern in the residual plot to suggest that this model is not appropriate or that

the errors are not independent. Furthermore, the Normal Probability Plot resem-

bles a line, indicating that the Normality assumption is OK. However, for the linear

model, the Normal Probability Plot has a deﬁnite curve. For this model, some doubt

must be cast on the validity of the assumption of Normality of the errors. So, the

model is inappropriate. Inference based on bad models will usually result in wrong

conclusions.

160 CHAPTER 11. MULTIPLE LINEAR REGRESSION

Chapter 12

Chi-Squared Tests for Categorical

Data

12.1–12.2: The Chi-Squared Test for Goodness of Fit

1. A company which manufactures tractors takes daily samples of 4 tractors for careful

inspection as a check on the quality of their product. Over 200 days, the numbers

of tractors needing adjustment on each day were recorded, resulting in the following

frequency table. Test whether a Binomial model with p = 0.1 is appropriate for the

number of tractors needing adjustment on a given day.

[Fill in the rest of the table before your lab, remembering that you might need to

group the categories.]

Number needing adj. per day (x

i

) 0 1 2 3 4 total

Number of days (o

i

) 102 78 19 1 0 200

P(X = x

i

) if X ∼ Bin(4, 0.1) 1

Expected frequency (e

i

) 200

(o

i

−e

i

)

2

/e

i

Solution:

The hypotheses to be tested are

H

0

: data consistent with a Bin(4, 0.1) distribution

H

1

: data not consistent with a Bin(4, 0.1) distribution

Let X denote the number of tractors needing adjustment on a given day. As-

sume that the numbers of tractors requiring adjustment on each day are iid. Then

the number of days, out of 200, on which x

i

tractors need adjustment (for x

i

=

0, 1, . . . , 4) is a Bin(200, p

i

) random variable, where p

i

= P(X = x

i

). So the ex-

pected number of days on which x

i

tractors need adjustment can be written as

200p

i

. Under the Bin(4, 0.1) model, p

i

=

4

x

i

0.1

x

i

(1 − 0.1)

4−x

i

. The expected fre-

quencies under this model can now be calculated, and the results are given in the

following table:

161

162 CHAPTER 12. CHI-SQUARED TESTS

Number needing adj. (x

i

) 0 1 2 3 4 Total

Number of days (o

i

) 102 78 19 1 0 200

P(X = x

i

) if X ∼ Bin(4, 0.1) 0.6561 0.2916 0.0486 0.0036 0.0001 1

Expected freq. (e

i

= 200p

i

) 131.22 58.32 9.72 0.72 0.02 200

The chi-square tests require that all expected frequencies be greater than 5. To

achieve this, we group the last three categories. The revised table is shown below:

Number needing adj. (x

i

) 0 1 2, 3 or 4 total

Number of days (o

i

) 102 78 20 200

P(X = x

i

) if X ∼ Bin(4, 0.1) 0.6561 0.2916 0.0523 1

Expected freq. (e

i

= 200p

i

) 131.22 58.32 10.46 200

(o

i

−e

i

)

2

/e

i

6.51 6.64 8.70 21.85

The test statistic is

X

2

=

i

(o

i

−e

i

)

2

e

i

,

where the sum is over all (remaining) categories of the variable. Under H

0

, X

2

observes a χ

2

distribution, with

df = Number of categories −1 −Number of parameters estimated

= 3 −1 −0

= 2,

i.e. X

2

∼ χ

2

2

under H

0

. So the α = 0.05 critical value is χ

2

crit

= χ

2

2,0.05

= 5.99.

The observed value of test statistic is χ

2

obs

= 21.85.

Since χ

2

obs

= 21.85 ∈ CR, the data provides suﬃcient evidence to reject H

0

at the 5%

level of signiﬁcance. We conclude that the number of tractors needing adjustment

is not distributed as Bin(4, 0.1).

2. Political ideology of government has a great impact on business perception and

planning. A market researcher is investigating the support for the various political

parties in Australia at the Federal level. The support at the 2001 Federal election

was Liberal 37%, Labor 38%, National 6%, Democrats 5%, Others 14%

(source: http://www.aec.gov.au/ content/when/past/2001/results/index.html).

Six months after the 2001 election, a survey of 1050 voters was conducted, to de-

termine whether the level of support for each party had changed. The results are

summarised in the table below.

Party (i) Lib Lab Nat Dem Oth Total

No. of voters (o

i

) 350 456 50 44 150 1050

Probability (p

i

) 0.37 0.38 0.06 0.05 1

163

Determine at signiﬁcance level 0.05 whether the level of support for the parties

changed in the six months following the 2001 election. Comment on where the

major discrepancy appears to lie.

Solution:

Party (i) Lib Lab Nat Dem Oth Total

No. of voters (o

i

) 350 456 50 44 150 1050

Probability (p

i

) 0.37 0.38 0.06 0.05 0.14 1

Expected frequency (e

i

) 388.5 399 63 52.5 147 1050

o

i

−e

i

-38.5 57 -13 -8.5 3 0

(o

i

−e

i

)

2

1482.25 3249 169 72.25 9 (not reqd)

(o

i

−e

i

)

2

/e

i

3.8153 8.1429 2.6825 1.3762 0.0612 16.0781

(a) df = number of categories (after grouping to eliminate any expected frequencies

less than 5) − 1 − number of parameters estimated from the data.

(b) [See table above].

df= 5 −1 −0 = 4; χ

2

4,0.05

= 9.49.

χ

2

o

= 16.0781 > 9.49, therefore the data provides suﬃcient evidence to reject

H

0

at the 5% level of signiﬁcance. We conclude that the level of support for

the political parties has changed since the last election.

The largest contribution to the Chi-squared statistic is from the Labor column.

Thus the major discrepancy is that the support for Labor increased in the six

months following the 2001 election.

3. Black et al. 12.5.

Solution:

H

0

: The way that men deﬁne their personal success does not diﬀer from how women

deﬁne theirs

H

1

: H

0

is false.

The test statistic is given by

χ

2

=

obs freq(f

o

) −exp freq(f

e

)

2

exp freq(f

e

)

The signiﬁcance level is given as α = 0.05

There are four categories in this question (happiness, sales, helping others, achieve-

ments), k = 4. The degrees of freedom are k − 1. For α = 0.05 and df = 3, the

critical chi-square value is

χ

2

0.05,3

= 7.8147

The observed values are computed by multiplying the expected proportions (from

woman’s data) to the total sample size of the men’s data. For example, the total

sample size for the men’s data is 227 (add up all the observed frequencies). The ex-

pected frequency for the happiness category is then 227(0.39) = 88.53 and similarly

for the sales category, the expected frequency is 227(0.12) = 27.24 and so on.

164 CHAPTER 12. CHI-SQUARED TESTS

Deﬁnition f

o

f

e

(fo−fe)

2

fe

Happiness 42 88.53 24.46

Sales 95 27.24 168.55

Helping 27 40.86 4.70

Achievements 63 70.34 0.77

Total 198.98

Since the chi-squared observed value (198.98) is greater than the critical value, we

reject the null hypothesis.

Thus, the data gathered in the sample suggests that the way men deﬁne their

personal success diﬀers signiﬁcantly from how women deﬁne theirs.

12.3: Contingency Analysis: The Chi-Squared Test

for Independence

4. In a random sample of 100 people, each person was classiﬁed by buying response to

a particular product and also by degree of exposure to marketing pressure (recorded

in four categories I, II, III, IV), with the following results:

[Fill in (say) three of the expected frequencies (in parentheses) before your lab.]

Marketing Pressure

I II III IV Totals

Deﬁnitely buy 12 ( ) 12 ( ) 6 ( ) 17 ( ) 47

Undecided 5 ( ) 8 ( ) 10 ( ) 5 ( ) 28

Will not buy 3 ( ) 10 ( ) 7 ( ) 5 ( ) 25

Total 20 30 23 27 100

(a) State the hypotheses you would use in testing the advertising agency’s claim

that buying response is inﬂuenced by the degree of marketing pressure.

(b) Explain why you would calculate the expected frequencies using the rule

expected frequency for a cell =

row total ×column total

grand total

.

(c) Test the advertising agency’s claim at the 5% signiﬁcance level.

Solution:

(a) The hypotheses to be tested are

H

0

: Marketing pressure and buying response are independent

H

1

: Marketing pressure and buying response are not independent

(i.e. buying response is inﬂuenced by marketing pressure)

165

(b) Consider the upper-left cell. The probability that a particular customer will

be classiﬁed in this cell is P(“Marketing Pressure I” and “Deﬁnitely buy”). If

H

0

is true, then

P(“Marketing Pressure I” and “Deﬁnitely buy”)

= P(“Marketing Pressure I”) ×P(“Deﬁnitely buy”).

The two probabilities on the right-hand side can be estimated naturally in

terms of the respective row and column totals:

P(“Marketing Pressure I”) ≈

Column 1 Total

Grand Total

P(“Deﬁnitely buy”) ≈

Row 1 Total

Grand Total

.

So,

P(“Marketing Pressure I” and “Deﬁnitely buy”) =

Row 1 Total ×Column 1 Total

(Grand Total)

2

.

The expected frequency for the upper-left cell is

Grand Total ×P(“Marketing Pressure I” and “Deﬁnitely buy”)

≈ Grand Total ×

Row 1 Total ×Column 1 Total

(Grand Total)

2

=

Row 1 Total ×Column 1 Total

Grand Total

.

This argument applies in the same way for all other cells in the table.

(c) Expected frequencies (under the model of independence) are given in brackets:

Marketing Pressure

I II III IV Totals

Deﬁnitely buy 12 (9.4) 12 (14.1) 6 (10.81) 17 (12.69) 47

Undecided 5 (5.6) 8 (8.4) 10 (6.44) 5 (7.56) 28

Will not buy 3 (5) 10 (7.5) 7 (5.75) 5 (6.75) 25

Total 20 30 23 27 100

The test statistic is

X

2

=

r

i=1

c

j=1

(O

ij

−e

ij

)

2

e

ij

where r and c are the numbers of rows and columns (not including totals), O

ij

is the observed count in the cell in row i and column j, and e

ij

is the expected

count in the same cell (assuming H

0

, that is, no relationship between the two

variables). This “double sum” can be thought of simply as a single sum over

all cells in the table.

Under H

0

, the test statistic observes a χ

2

distribution, with degree of freedom

(r − 1) × (c − 1) = (3 − 1) × (4 − 1) = 6, i.e. X

2

∼ χ

2

6

under H

0

. Thus, the

α = 0.05 critical value of the test is χ

2

crit

= χ

2

6,0.05

= 12.59.

166 CHAPTER 12. CHI-SQUARED TESTS

The observed value of the test statistic is

χ

2

obs

=

r

i=1

c

j=1

(o

ij

−e

ij

)

2

e

ij

= 0.719 + 0.312 + 2.140 + 1.464 + 0.064 + 0.019 + 1.968 +

+ 0.867 + 0.800 + 0.833 + 0.271 + 0.454

= 9.91.

Since χ

2

obs

= 9.91 < 12.59, the data does not provide suﬃcient evidence to

reject H

0

in favour of H

1

at the 5% level of signiﬁcance. We conclude that

buying response is not inﬂuenced by marketing pressure.

5. Four hotels took part in a survey on hotel guest satisfaction. A follow up question

was asked of all respondents who were dissatisﬁed with the service. These guests

were asked to indicate the main reason for their dissatisfaction. You are asked

to investigate whether the choice of hotel has any bearing on the main reason for

dissatisfaction.

Do not use Excel in this question! Write your answers on paper, showing

full working.

(a) State appropriate hypotheses that could be tested to answer the question: “Do

the results of the survey provide evidence that the nature of dissatisfaction and

the choice of hotel are related?”

Solution:

H

0

: Choice of hotel and reason for dissatisfaction are independent.

H

1

: H

0

is false (i.e. choice of hotel and reason for dissatisfaction are related).

(b) A contingency table, summarising the results of the survey, is given below.

The table shows the observed frequencies for each cell, as well as some of the

expected frequencies under H

0

(in parentheses). Copy down this table, and

without using Excel, calculate the remaining expected frequencies under H

0

.

Show working!

Hotel

Fijian Tradeswest Sheraton Coral Reef Totals

Politeness 23 ( ) 7 ( ) 37 (33.7410) 67 (62.0192) 134

Knowledge 25 ( ) 13 ( ) 25 (30.9712) 60 (56.9281) 123

Responsiveness 13 (11.0024) 5 (6.6906) 13 (15.6115) 31 (28.6954) 62

Other 13 (17.3909) 20 (10.5755) 30 (24.6763) 35 (45.3573) 98

Totals 74 45 105 193 417

Solution:

167

The expected frequency for Fijian and Politeness is

Row Total ×Column Total

Grand Total

= (134 ×74)/417 = 23.7794.

One can either work out the remaining three expected frequencies as above, or

by using the fact that the expected frequencies in each row/column are required

to sum to the (observed) row/column total. The complete table is below:

Hotel

Fijian Tradeswest Sheraton Coral Reef Totals

Politeness 23 (23.7794) 7 (14.4604) 37 (33.7410) 67 (62.0192) 134

Knowledge 25 (21.8273) 13 (13.2734) 25 (30.9712) 60 (56.9281) 123

Responsiveness 13 (11.0024) 5 (6.6906) 13 (15.6115) 31 (28.6954) 62

Other 13 (17.3909) 20 (10.5755) 30 (24.6763) 35 (45.3573) 98

Totals 74 45 105 193 417

(c) Write down an expression for the relevant test statistic, and state its distribu-

tion under H

0

(together with any associated parameters!).

Solution:

The test statistic is

X

2

=

r

i=1

c

j=1

(O

ij

−e

ij

)

2

e

ij

where r and c are the numbers of rows and columns (not including totals), O

ij

is the observed count in the cell in row i and column j, and e

ij

is the expected

count in the same cell (assuming H

0

, that is, no relationship between the two

variables).

Under H

0

, the test statistic observes a χ

2

distribution, with degree of freedom

(r −1) ×(c −1) = (4 −1) ×(4 −1) = 9, i.e. X

2

∼ χ

2

9

under H

0

.

(d) Without using Excel, calculate the contribution from the upper-left cell to the

observed value of the test statistic.

Solution:

The contribution to the observed value from the upper-left cell is

(o

ij

−e

ij

)

2

e

ij

=

(23 −23.7794)

2

23.7794

= 0.0256.

(e) Given that the observed value of the test statistic is χ

2

obs

= 20.8059, carry

out the test (without using Excel) at the 5% signiﬁcance level, and state your

168 CHAPTER 12. CHI-SQUARED TESTS

conclusion. Is there suﬃcient evidence to conclude that there is a relationship

between the choice of hotel and the nature of dissatisfaction?

Solution:

The critical value for this test is

χ

2

crit

= χ

2

9,α

= χ

2

9,0.05

= 16.92,

so the critical region is {X

2

> χ

2

crit

= 16.92}. Since χ

2

obs

= 20.8059 is within

the critical region, we reject H

0

in favour of H

1

. There is suﬃcient evidence,

at the 5% signiﬁcance level, to conclude that the nature of dissatisfaction is

related to the choice of hotel.

6. To undertake contingency analysis in Excel, ﬁrst enter the data, then go KaddSTAT

-> Hypothesis Testing -> Chi-Square Test. Select the data as Input Range,

tick the Header Row and Column Included box, and choose where you want Excel

to print the output.

Enter the data from Question 7e as shown below:

(a) Use Excel to generate an appropriate output for a test for independence of the

two variables of interest, carry out the test, and check that your conclusions

are the same as in Question 7e.

Solution:

Excel returns the following output:

169

(b) If there is evidence that the nature of dissatisfaction is related to the choice

of hotel, where do the discrepancies lie? Which hotel(s) could be advised to

improve their service, and in which area(s)? Do any of the hotels appear to

provide signiﬁcantly better service than the others in a particular area?

Solution:

Having established that there is indeed a relationship between the choice of

hotel and the nature of dissatisfaction, one can examine the output to deter-

mine which hotels have a greater (or lesser) proportion of complaints of each

type.

It can be seen from the output of chi-square calculations that there are three

cells that have much larger contributions to the observed value of the test

statistic than the others. These cells are Tradewest and Politeness (3.8490),

Tradewest and Other (8.3987) and Coral Reef and Other (2.3651). Comparing

the observed frequencies with the expected frequencies for these cells, we see

that of the dissatisﬁed hotel guests, those who stayed at Tradeswest are less

often dissatisﬁed with Politeness, and more often their dissatisfaction is classi-

ﬁed as Other. It might also be that those dissatisﬁed guests who stay at Coral

Reef less often state that their dissatisfaction is due to Other, although there

is probably not enough evidence to conﬁrm this (the chi-square contribution is

not that large).

We conclude that Tradeswest should take steps to improve their service in the

area of Other. Some further analysis might be required to provide more useful

advice. To ﬁnd out which particular aspects of Tradeswest’s service guests are

dissatisﬁed with, one might choose to replace Other by a collection of more

meaningful categories (e.g. Cleanliness, Food, etc.). One can ensure that the

expected frequencies are all greater than 5 by combining any categories that

have small expected frequencies, or by simply gathering enough data.

7. A market-researcher wished to investigate whether a buyer’s age had any bearing

on choice of car colour. A random sample of 200 car buyers resulted in the following

table which shows the observed frequencies and some of the expected frequencies

(in parentheses).

Chose Red Chose White Chose Grey

Age 17 – 24 20 ( 16 ) 15 ( 16 ) 5 ( )

Age 25 – 40 30 ( 24 ) 20 ( 24 ) 10 ( )

Age over 40 30 ( ) 45 ( ) 25 ( )

(a) State the hypotheses that the researcher is comparing in this investigation.

(b) Copy the body of the table and complete the entries for expected frequencies.

(c) Give the number of degrees of freedom for a χ

2

-test of the hypotheses in part

(a).

170 CHAPTER 12. CHI-SQUARED TESTS

(d) Explain fully how the expected frequency of 16 is obtained for the 17 – 24 age

group with a preference for Red. (Do not merely quote a formula or show one

line of arithmetic.)

(e) Using a 5% level of signiﬁcance, determine if the buyer’s age has any bearing

on the choice of car colour.

Solution:

(a) H

0

: choice of colour independent of age; H

1

: choice of colour dependent on

age.

(b)

Chose Red Chose White Chose Grey Total

Age 17 – 24 20 ( 16 ) 15 ( 16 ) 5 ( 8 ) 40

Age 25 – 40 30 ( 24 ) 20 ( 24 ) 10 ( 12 ) 60

Age over 40 30 ( 40 ) 45 ( 40 ) 25 ( 20 ) 100

Total 80 80 40 200

(c) No grouping required, so the number of degrees of freedom for the χ

2

-distribution

is (3 −1)(3 −1) = 4.

(d) Under H

0

, P(Age 17 −24 and Red) = P(Age 17 −24).P(Red).

Estimating P(Age 17−24) by

40

200

and P(Red) by

80

200

, gives expected frequency

for (1, 1)-cell =

40

200

×

80

200

×200 = 16.

(e) From tables, χ

2

4, 0.05

= 9.49 and observed value of test statistic = 9.06 / ∈ CR

so the data does not provide suﬃcient evidence to reject H

0

at the 5% level of

signiﬁcance. We conclude that choice of colour is not dependent on age.

8. Black et al. Exercises 12.27 and 12.29.

Solution:

(a) Black et al. 12.27

171

The hypotheses of interest here are

H

0

: Proportion of households with internet access is not dependent on whether

they have children under the age of 15 for the period 1989 to 2003.

H

1

: Proportion of households with internet access is dependent on whether

they have children under the age of 15 for the period 1989 to 2003. Note that

all the expected frequencies are greater than 5. df = (2 − 1)(6 − 1) = 5. The

p-value of the test = P(χ

2

5

> 0.13) = 0.9997 (from Excel), so there is insuﬃ-

cient evidence against the null hypothesis. We conclude that the proportion of

households with internet access is the same for those with children under 15

and those without children under 15 in the period 1989 to 2003.

(b) Black et al. 12.29

H

0

: Gender and colour preference for cars is independent

H

1

: Gender and colour preference for cars is not independent

To test this hypothesis, we use the chi-squared test of independence. The

observed chi-squared value (from the test-statistic) is 5.366. The p-value

0.252 > 0.05, so there is insuﬃcient evidence to reject the null hypothesis.

(The critical value at 5% level of signiﬁcance with (5 −1)(2 −1) = 4 degrees of

freedom is 9.4877. Since the observed value does not lie in the critical region,

we do not reject the null hypothesis.) Therefore, there is not enough evidence

provided by the data to suggest that colour preference is dependent on gender.

Marketing agencies don’t have to model colour as a factor when trying to sell

cars to either gender. Also, manufacturers can determine car colour quotes on

another basis, instead of gender preference.

- RegressionUploaded byparkchick
- Practice Exam CRE SampleUploaded byKasinathan Muniandi
- Factor Analysis - Wikipedia, The Free EncyclopediaUploaded byvikramkrishnan
- Regression ProjectUploaded byMango7187
- Factor AnalysisUploaded bykboad
- Exercises for Stochastic CalculusUploaded byA.Benhari
- Factor Analysis ProblemUploaded byraghav4231
- RegressionUploaded byprabodh20
- R&R Case studyUploaded byRohith Girish
- MQM100_MultipleChoice_Chapter5Uploaded byNakin K
- Factor AnalysisUploaded byReynol Junco
- Gage R&RUploaded byParimal
- Regression 101Uploaded byKaixin Go
- Dell Hymes SpeakingUploaded byLuckyHasannudin
- Statistics Regression Project REV a1Uploaded byJoseph Higgins
- Reg LinealUploaded bynacho2012
- 7 Linear RegressionUploaded byAbhishekKumar
- Data AnalysisUploaded bySeli adalah Saya
- Data AssignmUploaded byAtul Bhaisare
- LinearRegression-LectureNotesPublic.pdfUploaded byEdmond Z
- Factors InfluencingUploaded byJojo Lam
- Simple RegressionUploaded byDlo Perera
- RegressionUploaded byaanismu1893
- Statistik-06Uploaded byNurh4y
- Lec03Pt5BUploaded byngyncloud
- CMRP Exam Question Sources Listed 14 Aug 06 (Rev 27 Sep 12)Uploaded byjoe_pulaski
- Andrew F. Hayes-Statistical Methods for Communication Science-Routledge (2005)Uploaded byportafia
- Simple Linear Regression[1]Uploaded byRangothri Sreenivasa Subramanyam
- popyUploaded bytitas31
- forecastingUploaded byims@1988