This action might not be possible to undo. Are you sure you want to continue?

BooksAudiobooksComicsSheet Music### Categories

### Categories

### Categories

Editors' Picks Books

Hand-picked favorites from

our editors

our editors

Editors' Picks Audiobooks

Hand-picked favorites from

our editors

our editors

Editors' Picks Comics

Hand-picked favorites from

our editors

our editors

Editors' Picks Sheet Music

Hand-picked favorites from

our editors

our editors

Top Books

What's trending, bestsellers,

award-winners & more

award-winners & more

Top Audiobooks

What's trending, bestsellers,

award-winners & more

award-winners & more

Top Comics

What's trending, bestsellers,

award-winners & more

award-winners & more

Top Sheet Music

What's trending, bestsellers,

award-winners & more

award-winners & more

Welcome to Scribd! Start your free trial and access books, documents and more.Find out more

STAT 371 Course Notes

SPRING 2011

JOCK MACKAY

rjmackay@uwaterloo.ca

Statistics 371 © R.J. MacKay, University of Waterloo 2009

Index

Chapter 1 Chapter 2 Chapter 3 Chapter 4 Chapter 5 Chapter 6 Chapter 7 Chapter 8 Chapter 9 Chapter 10 Appendix 1 Appendix 2 Appendix 3 Statistical Tables Solutions to Exercises Old Midterms and Exams

The Need for Statistics in Business Models linking explanatory and response variates Making Inferences from Regression Models The Analysis of Variance Assessing Model Fit Model Building Sample Survey Issues Probability Sampling Ratio and Regression Estimation with SRS Stratified Random Sampling R Properties of vectors and matrices of random variables Gaussian Quantile-Quantile Plots

Please email me with any errors or points of clarification. These notes are a work in progress.

Data Sets You can download all data sets in the notes and exercises from the file stat371.zip on the Angel course web page. You can access the individual files at the same site.

Statistics 371 © R.J. MacKay, University of Waterloo 2009

Chapter 1 The Need for Statistics in Business “There is no substitute for knowledge” W. Edwards Deming “The greatest obstacle to discovery is not ignorance – it is the illusion of knowledge” Daniel Boorstin The purpose of Stat 371 and 372 is to provide a unified set of strategies and tools to apply Statistical Method in business and industry. In particular, the goal is to learn how to: • • • • pose clear questions collect the right data efficiently and effectively– a good plan provide useful conclusions communicate the conclusions and the method by which they are reached to a nontechnical audience

Statistics, or better Statistical Method, is a powerful, widely applicable process that we can use to learn about business processes and markets (populations). Statistical Method is empirical, that is, based on observational and experimental investigations. By collecting and analyzing the right data, we can increase our knowledge of the market, the products and services we produce (and plan to produce) and the processes we use in this production. We may then use this knowledge to make better decisions to improve the business. Example 1 The maker of “frost-free” refrigerators in temperate New Zealand decided to expand their market to tropical south-east Asia. There were immediately numerous complaints about frost build-up in the fridges from the new market. The company interviewed 25 recent purchasers in each of the two markets and found that there were large differences in ambient environmental conditions (temperature and humidity) and usage (frequency of door openings, amount of food introduced at one time) in the two markets (investigation 1). They were convinced that these factors were the cause of the frost build-up in the tropical market. To solve the problem, they decided to try to redesign the fridge to make it more robust to ambient environmental conditions and usage factors. In a experimental investigation, they built 8 prototype fridges in which four design inputs were changed simultaneously. They then tested each prototype under two conditions defined by the extremes of the environmental and usage factors. The response variate was the temperature of the cooling plate in the fridge after 30 minutes operation – low constant values mean that there will be no frost build-up. The experimental plan and data are

Statistics 371 ©R.J.MacKay, University of Waterloo, 2009

I-1

Treatment 1 2 3 4 5 6 7 8

D1 new new new new original original original original

D2 new new original original new new original original

D3 new original new original new original new original

D4 new original original new original new new original

Environmental Conditions Normal Extreme 0.7 2.1 2.9 4.8 2.4 9.6 3.8 5.9 1.9 4.0 -0.2 0.1 -0.1 3.5 0.2 7.2

Looking at the data, we can see that there are several promising designs (e.g. treatment 6). After further analysis and a review of the costs, the company adopted the combination in treatment 6 as the new design. The complaints about frost buildup disappeared. Example 2 Municipal taxes in Ontario are based the market value of the property. Where possible, the market or assessed value is determined by predicting the market value of the property using the prices from recent sales of comparable properties. A property owner may choose to appeal the assessed value. A large company felt that the assessed value of its very large property (an automobile assembly plant) was too high. To argue their case, they collected data on 38 large plants that had been sold in the last 10 years throughout Canada and the USA. The first few records are

size (sq ft /10^6) 0.848 1.813 1.297 1.747 age(years) 35 37 50 23 percent office 5.8 3.2 19.0 10.2 build/land ratio 26.6 17.3 45.1 13.3 location usa usa usa usa value $/sq ft 4.32 6.74 6.36 5.95

The idea was to predict the value of the building in question using a model constructed from the data and the known values of the explanatory variates size, age etc. Here the prediction was a failure as there were many problems with the data and how it was collected. We use PPDAC (Problem, Plan, Data, Analysis, Conclusion) to describe Statistical Method, the process we use to learn empirically. The purpose of each stage is: Problem: Plan: Data: Analysis: Develop clear questions about attributes of the population/process of interest Develop a plan to answer the questions posed Execute the Plan to collect the required data Analyze the data based on the Plan and a model to address the question

Statistics 371 ©R.J.MacKay, University of Waterloo, 2009

I-2

Conclusion:

Answer the questions and report uncertainties and limitations

The following should remind you of the language of PPDAC and how we apply the process.

Target Population

Study Population

Sample

Measured variate values

Conclusions

(Model-based) analysis

PPDAC is a process that we use to plan and execute empirical investigations so that we get reliable conclusions at a reasonable cost. There must be a good reason to undertake the investigation in the first place and resolve to take action and make decisions based on the Conclusions. Governments are famous for avoiding decisions by saying that another study is required. The two course are organized by the nature of the Plan and the models used in the analysis. In Stat 371 we concentrate on applications of regression models and sample surveys. In Stat 372, we look at issues of data collected over time (time series, control charting) and the use of experimental plans. Exercises 1. (A true story, believe it or not) To improve the shifting of the transmission, an automobile manufacturer organizes a clinic in which about 100 people evaluate the “feel” of 6 transmissions on different models from low to (very) high cost. Each person is asked to rate each transmission on several dimensions. The idea is to use the data to help design a new transmission that will have good “feel” to improve the perceived quality of the vehicle and hence improve sales/market share. To save money in organizing the clinic, the company uses the engineers at its development

Statistics 371 ©R.J.MacKay, University of Waterloo, 2009

I-3

center of which 90% are males under the age of 35. What changes to this plan would you recommend? Why? 2. Write a brief description of the 6 Sigma program. Where does Statistical Method fit in 6 Sigma? What advantages and disadvantages can you see in an organization adopting such a program? 3. What is a software usability trial? What are two key issues in the design of such a trial? How does Statistical Method fit into a usability trial? 4. Give two examples of the how you might use of Statistical Method in market research.

Statistics 371 ©R.J.MacKay, University of Waterloo, 2009

I-4

. MacKay. we look at regression models and how to fit them to a set of data. then we have Y = βX and stdev(Y ) β= . percentage returns are collected for the asset yi and the portfolio xi over a number of periods (e. That is.ret. months) and a linear model of the form yi = β 0 + β 1 xi + ri is fit to the data.. β 1 . for each unit i. Example 2 In Chapter 1. the parameter β measures the relative volatility of the IBM excess stdev( X ) return. The month over month returns from Jan 2001 to March 2003 for IBM and the S&P 500 are given in the file IBM. xi 2 . The variate names are sp.. The common interpretation is that β > 1 corresponds to an asset riskier than the portfolio... University of Waterloo 2009 II-1 .ret.ylab='IBM return'.. Suppose we have a set of n units selected from a population and..J. The purpose of this modeling is to estimate an attribute of the population of monthly returns. we introduced the problem of determining the market value of a property that is not sold. Note we can fit a model that includes the risk-free return if the data were available.+ β p xip + ri where the parameters β 0 .. The statistical problem is to fit a regression model of the form yi = β 0 + β 1 xi1 +.ibm. days. the fit will not change markedly. Example 1 The CAPM model is used to measure the risk of a single asset relative to that of a portfolio. S&P 500 Monthly Returns') From the plot. using known explanatory variates and the market values of similar Stat 371 © R. xip . the model should provide a reasonable fit to the data. The risk-free return is not included in the model. The theoretical CAPM model describes the excess return (actual return – risk free return) for an IBM share over a period of time as a constant β times the excess return of the portfolio. There are many issues about the time period (months) and the sampling period. In many empirical applications. measured on a monthly basis. There are many applications of such models. suppose we want to assess the relative risk of an IBM share relative to the S&P 500 index. n are unknown.. created with the R code plot(sp. We give three here. We see a scatterplot of the data on the next page.main='IBM vs. If we model the excess IBM return as a random variable Y and the portfolio excess return as a random variable X .g.. For example.ret and ibm.txt.. i = 1.Chapter 2 Models linking explanatory and response variates In this chapter...xlab='SP500 return'.. β p and the residuals ri . Since this return is small.ret. we have the values of the response variate yi and p explanatory variates xi1 .

J. To do so. The data are in the file assessment. x1 . This is an example of an analytic method in auditing. there is a private organization that makes extensive use of regression modeling to provide market values to municipalities for all properties that provide the basis for property taxes. we first fit a regression model using the data from the actual sales... 1)t be a column vector of n 1’s and X = (1.. Then we can write the model in terms of these vectors as Stat 371 © R. x j a column vector containing the values of the jth ( j = 1. let β = ( β 0 . There are many applications of regression where the object is to predict the unknown response variate for a given set of values of the explanatory variates. age=21... Finally.. Example 3 A service organization has 24 offices.. In Ontario. we mean that we estimate the unknown model parameters using the data. p ) explanatory variate and r a column vector of the unknown residuals. The goal is to look for exceptional cases for which the relationship between the explanatory variate and the response is very different. building/land ratio=53. x p ) an n × (1 + p) matrix with columns corresponding to the explanatory variates. The values for the explanatory variates for the unsold building are size=13.txt.. and the relative cost of living in the city where the office is located. The data are in the file analytic. β 1 . By "fitting the model". ratio and location and the response variate value ($ per square ft). He also has access to the office size and age. the number of employees and clients. He will devote more audit resources to any such office. Another similar application is to look at salaries of employees relative to the work they do. location= 0 (Canada). Let y be a n ×1 column vector containing the response variate values.. β p )t be a (1 + p) × 1 column vector of the unknown coefficients.properties that were actual sales.txt. We then use the model to predict the market value of the property that was not sold.825. percent office =3..8. we represent the data model in terms of vectors and matrices.. The auditor plans to fit a model relating overhead to the explanatory variates in order to look for outliers. offices for which the relationship is different. There are issues about which properties to include in the data set and which explanatory variates to include in the model . MacKay... Also let 1 = (1.. There are 38 units (large sales)with 5 explanatory variates size.. University of Waterloo 2009 II-2 . In the planning of an audit.. age. In this example. the accountant looks at the stated overhead from the current and past year for each office. Fitting the Model – Least Squares. office.

.. x p )β + r or more compactly as y = Xβ + r We have written y as the sum of two vectors...... x1 ... x1 . x1 . University of Waterloo 2009 II-3 .....+ β p x p span(1. x p ) The span(1..y = β 0 1 + β 1 x1 +.J... That is.... we find the value for β that minimizes the function W ( β ) = ∑ ( yi − β 0 − β 1 xi1 −...−β p xi p ) 2 i =|| y − Xβ ||2 =|| r||2 To minimize the squared length of r . We can picture the model in R n as shown below.. x p ) as shown below. y r β 0 1 + β 1 x1 +. x p ) is the subspace of R n spanned by the columns of X . x1 . We assume that this subspace has dimension p+1 or equivalently that the columns of X are linearly independent. we project y orthogonally onto span(1.+ β p x p + r = (1... MacKay.. x p ) Stat 371 © R. To fit the model..+ β p x p span(1.. we use least squares. y r β 0 1+ β 1 x1 +. x1 ..

x p ) .. p+1 linearly independent columns).txt”. Note that we have decomposed the vector y = Hy + ( I − H ) y into two orthogonal components.. x1 .. Note that X t X has an inverse because we assume that X has full rank (i. University of Waterloo 2009 II-4 . x tp r = 0 We can write these equations more compactly as X t r = 0 . We label the projection (called the vector of fitted values) as μ ..ret.. x1 . we get X t ( y − Xβ ) = 0 .. to every column of X. The following code produces the given output.see the exercises.ret.e. main= “IBM monthly return vs S&P 500 monthly return”) abline(b) Stat 371 © R.. a<-read.. That is we have 1t r = 0. and after rearrangement.table(“IBM. MacKay..J..ibm.+ β p x p = Xβ = X ( X t X ) −1 X t y = Hy and the estimated residual by r = y − Xβ = (I − H)y where the matrix H = X ( X t X ) −1 X t is called the hat-matrix and is the projection onto the subspace span(1. x p ) or equivalently.. x1t r = 0. β = ( X t X ) −1 X t y . Substituting for r .header=TRUE) attach(a) b<-lm(ibm.The estimated residual vector r = y − Xβ is orthogonal to the span(1. H has several interesting properties .ret) summary(b) fitted(b) plot(sp... so μ = β 0 1 + β 1 x1 +.ret~sp. Example We use R to fit the empirical CAPM model to the IBM returns vs S&P 500 returns..

001 -5.334 0.7090947 -9.876 -1.9334725 15.0083625 16.' 0.6464207 2.6636134 3. Error t value Pr(>|t|) (Intercept) 1.ret) Residuals: Min 1Q Median 3Q Max -10.05 `.6117 F-statistic: 46.261 on 28 degrees of freedom Multiple R-Squared: 0.2193082 -13.001 `**' 0.834 19.ret 1.The output is: Call: lm(formula = ibm.742 0.1267206 -18.6340558 7.625. p-value: 2.1 ` ' 1 Residual standard error: 7.023 4.013e-07 1 2 3 4 5 6 1.5224802 7 8 9 10 11 12 -1.9528059 14.4473138 -10.01e-07 *** --Signif. codes: 0 `***' 0.1633477 4.68 on 1 and 28 DF.8917891 3.8102016 -3.1026870 -0. Adjusted R-squared: 0.799 0.5573599 -0.1842320 2.1183661 Stat 371 © R.8955444 -3.J.4666260 19 20 21 22 23 24 -2.832 2.1007748 13 14 15 16 17 18 1.066 1.ret ~ sp.1705660 25 26 27 28 29 30 -10.431 sp.5157827 -9.5523248 -1.01 `*' 0.633 Coefficients: Estimate Std. University of Waterloo 2009 II-5 .9162213 -12.3848525 14.255 6.2892430 1.0381487 9.6967085 -11. MacKay.4441644 11.

To interpret the output. abline(b) adds the fitted line to the scatter plot when there is a single explanatory variate. R-squared (usually written R 2 ) is defined as Stat 371 © R. University of Waterloo 2009 II-6 .lm(y~x1+x2+…+xp). We will interpret most of the other statistics when we look at formal inference procedures for the corresponding response (probability) model. We can look at the contents of the model object b with the commands: • summary(b) : the table of estimated coefficients and statistics. This term can be omitted using the code lm(y~-1+x1+x2+…+xp) 2. Since the estimated slope is greater than 1.J. • fitted(b) : the vector of the fitted values in the same order as the original data. we know that the IBM share is more volatile than the market as defined by the S&P 500. The output of the function is assigned to the object b <. β 1 = 1. • resid(b) : the estimated residuals.066. we note that β 0 = 1. MacKay. • anova(b) : an Analysis of Variance table • coefficients(b) : the estimated coefficients.Notes on the R code: 1.742 . The function lm(y~x1+x2+…+xp) fits the regression model that includes a constant term. 3.

J. x1 . R 2 is always between 0 and 1 and is often quoted as a percentage. x1 2... we get the summary in the following text box. Exercises 1. Again.8 4.2 y 3. we can say that the S&P 500 returns “explain 62.. We use this artificial example to help you review the basic concepts of fitting a model using least squares.6 16. In some sense.8 6..2 2... x p .e.5 -6 1.8 3.9 x2 3.4 2. then R 2 is 1 and if the fitted model does not involve x1 . x1 . University of Waterloo 2009 II-7 . In other words if we can write y as a linear combination of 1.5% of the variation in the IBM returns”.. β 0 = y .8 5.7 7.R2 = 1 − = 1− residual sum of squares from the fitted model residual sum of squares from the model with only a constant term || r ||2 || y − y1||2 where y is the sample average of the response variate..8 11 4.. i = 1. MacKay.8 5.10. x p ) .3 1.. Another interpretation is that 100 R2 is the percent of the variation in the response model explained by the explanatory variates.9 1... β p = 0 .1 3.. R 2 measures how well the model fits the data but you need to be very careful in this interpretation. See Exercise 7. y lies in span(1. then R 2 is 0.e.1 5.5 3.8 3.6 10.7 3 2. we need to be sure to explain what this means..1 1.. The data are shown below. β 1 = 0.. But note that neither the numerator nor the denominator is the usual measure of variation in the response variate.8 Using R to fit the model yi = β 0 + β 1 xi1 + β 2 xi 2 + ri . x p ....5 1.9 2. In the example.. R 2 is 1 when the length of the residual vector r is 0 i. a) b) What are the estimates β ? Calculate μ 1 and r1 Stat 371 © R. R 2 is 0 when r = y − y1 i.7 11.6 4.

Suppose we have the returns on an asset yi . change? Explain.7581 on 7 degrees of freedom Multiple R-Squared: 0.01086 0.95e-06 *** --Signif.9879.26481 0.001 `**' 0. Use R to fit the model yi = β 0 + β 1 xi1 +. Error t value Pr(>|t|) (Intercept) 3.28650 0. Consider three regression models: Model 1: ( yi − xi 2 ) = β ( xi1 − xi 2 ) + ri Model 2: yi = β 0 + β 1 xi1 + ri Model 3: yi = γ 0 + γ 1 xi1 + γ 2 xi 2 + ri When we fit each model.09959 20..9844 F-statistic: 285.J.912 0. University of Waterloo 2009 II-8 .958e-07 2.56017 6. the return on the market xi1 and the risk free return xi2 for n periods.' 0. Adjusted R-squared: 0. Stat 371 © R. MacKay.+ β p xip + ri to the assessment data (assessment. codes: 0 `***' 0.Call: lm(formula = y ~ x1 + x2) Residuals: Min 1Q Median 3Q Max -1.txt) with a) all 5 explanatory variates b) only age and size c) Do the estimated coefficients change? Why? 3. will the coefficient of x1 .01 `*' 0.87190 0.300 1.83e-07 *** x2 -1.08845 -14.000229 *** x1 2.09907 -0. Suppose we have a response variate yi and a single explanatory variate xi1 for each of n units sampled from a population. Consider the two models Model 1: yi = β 0 + β 1 xi1 + ri Model 2: yi = γ 0 + γ 1 ( xi1 − x1 ) + ri where x1 is the sample average of the explanatory variate. 4.87919 Coefficients: Estimate Std.1 ` ' 1 Residual standard error: 0.. a) Show that the vectors x1 − x11 and 1 are orthogonal.05 `.03148 0.3 on 2 and 7 DF. the measure of volatility.192 1. p-value: 1.52948 0.

construct a scatterplot of y versus x and add the fitted line. a) For each pair. x1 ) = span(1. We defined the hat matrix H = X ( X t X ) −1 X t . H ( I − H ) = 0 d) 0 ≤ hii ≤ 1 where hii is the diagonal element of H. How are those projections different? d) What is the relationship between the estimated coefficients in fitting the two models? e) How does the result in a) simplify the calculation of γ when fitting model 2? 5. Show that a) H t = H b) H 2 = H . Anscombe. x1 − x11) ? c) In fitting models a) and b). we project onto a subspace. x p ) .J. American Statistician 27.. labeled x1-x4. c) Comment. 7.txt were produced by F. Interpret this result geometrically. the projection onto span(1. 17-21 to demonstrate the difficulty of using R 2 as a measure of fit and the importance of plotting the data.. The file contains 4 sets of ( x. y) vectors. fit a straight line model and report the estimated parameters and the coefficient of determination R 2 b) For each pair. 6.. x1 . University of Waterloo 2009 II-9 . Some questions about R 2 a) In question 1. y1-y4.b) Why is span(1. Stat 371 © R. The data in the file anscombe. MacKay. which model gave a larger value for R 2 ? b) Show that R 2 cannot decrease if we add extra terms to a model..J. c) ( I − H ) 2 = ( I − H ).

Stat 371 © R. MacKay. University of Waterloo 2009 II-10 .J.

+ β p xip + Ri . xip . The column vector 0 gives the component means.Chapter 3 Making Inferences from Regression Models In this chapter. = β j .. xi 2 . R ~ N (0. We use the model to describe how the estimates would behave if we were to repeat the ~ Plan over and over. Suppose we have a set of n units selected from a population and..J.. The stdev(Yi ) = stdev( Ri ) = σ is constant. We treat the explanatory variates as constants (not random variables) in the model. again with all other ∂x j explanatory variates held fixed.+ β p xip so we can interpret β j as the change in E(Yi ) when the jth explanatory variate changes by 1 unit with all other explanatory variates held fixed. σ 2 I ) . n independent Note that. for each unit i. we have the values of the response variate yi and p explanatory variates xi1 . A statistical regression model is Yi = β 0 + β 1 xi1 +... confidence intervals and prediction intervals for regression models. Using the properties of expectation and variance of linear combinations of random variables (Appendix 1) we have Stat 371 © R. R j ) = 0 in ijth position See Appendix 2. Rn into a vector R ~ N (0. MacKay University of Waterloo 2009 III-1 . This model uses random variables to replace the response variate values and residuals in the data model. we consider a statistical model to describe the repeated application of the Plan. Ri ~ G(0.. If x j is continuous. The estimator β = ( X t X ) −1 XY (a p ×1 vector of random variables) describes the behaviour of β ... we look at formal inference procedures such as hypothesis tests. i = 1. in the model. • • We can combine the n independent gaussian random variables R1 . σ 2 I ) so Y ~ N ( Xβ . We write the model more compactly as Y = Xβ + R. σ )... σ 2 I ) . We use these procedures to help answer questions of interest such as • • Is there evidence that IBM returns are more volatile than the S&P 500 index? What is a range of plausible values for an unsold property based on the values of its explanatory variates? To start. the rate of change of E(Yi ) as x j changes. The variance – covariance matrix σ 2 I gives the variance σ 2 of Ri in the ith diagonal position and the covariance Cov( Ri .. we can interpret β j in terms of the partial ∂E(Yi ) derivative.... • E (Yi ) = β 0 + β 1 xi1 +...

σd j ) where d j is the square root ~ of the jth diagonal element of ( X t X ) −1 . MacKay University of Waterloo 2009 III-2 .J. or.see the exercises – which partially justifies the denominator.~ E( β ) = E(( X t X ) −1 X t Y ) = ( X t X ) −1 X t E(Y ) = ( X t X ) −1 X t Xβ =β ~ Var ( β ) = Var (( X t X ) −1 X t Y ) = [( X t X ) −1 X t ]σ 2 I[( X t X ) −1 X t ]t = σ 2 ( X t X ) −1 ~ and using the properties of the multivariate normal distribution β ~ N ( β .e. Note that E (||( I − H ) R||2 ) = [n − ( p + 1)]σ 2 . we use the sum of squares of the estimated residuals divided by the degrees of freedom σ= = = ∑r i 2 i n − ( p + 1) || r ||2 n − ( p + 1) ||( I − H ) y||2 n − ( p + 1) The corresponding estimator is ~ σ= || ~||2 r n − ( p + 1) ||( I − H )Y ||2 = n − ( p + 1) = ||( I − H ) R||2 n − ( p + 1) since ( I − H ) Xβ = 0 . in other words. the columns of X are orthogonal. σ 2 ( X t X ) −1 ). ~ ~ That is. Note that the components of β are not independent unless ( X t X ) −1 is diagonal. each component of β is gaussian i. To estimate σ . β j ~ G( β j . We also have the unproven result that Stat 371 © R.

Stat 371 © R. Step 3: (Calculate the discrepancy measure) The “distance” from the estimated parameter to the hypothesized value is d= | β − 1| |1. The standard error is the ~ estimated standard deviation of the corresponding estimator β 1 and is given in the R output for each estimated coefficient. σ = 0. Step1: (Formulate) Suppose β 1 = 1. Step 2: (Estimate) We have β 1 = 1.742 − 1| = = 2. Since σ r βj −βj ~ tn −( p +1) ~ σd j We use this t-distribution to test hypotheses and find confidence intervals for the individual parameters β j .91 σd1 0. We use the same 5-step procedure as in the beloved Stat 231.255.~ σ / σ ~ Kn −( p +1) ~ ~ We also can easily show that Cov( β . Tables for the t-distribution are given in Appendix 3. We can calculate this probability in R with the command pt(2.255 Note that the denominator is called the standard error of β 1 .txt. Note that the estimate of σ is called the residual standard error in the summary. In this case.742 from the R output. Example 1 We fit an empirical CAPM model to a series of monthly returns from an IBM share versus the corresponding returns of the S&P 500 index in the file IBM. it then follows that – see the exercises. Step 4: (Calculate the p-value) To asses if this distance is large or small.007 The degrees of freedom correspond to the denominator in the calculation of σ . ~ One question of interest is to see if β 1 is different from 1.28).91. is the volatility of an IBM share different than that of the index? We consider a test of the hypothesis β 1 = 1. The summary R output is given in Chapter 2. we calculate the p-value Pr(| t28 | ≥ 2.J. ~) = 0 so that β and ~ are statistically independent r r ~ is a function of ~ . In words.91) = 0. The pvalue is the chance that we get such a large discrepancy between the estimated and hypothesized value if the hypothesis is true. MacKay University of Waterloo 2009 III-3 .

less than 0. In a pilot project.742 ± 2.05 less than 0.change to indicate the measured variates. The promotion for each store is specified by two indictor variates x1 and x2.523 . one group is given promotion 2 and the third group acts as a control. The conclusion in the example is that there is strong evidence that the volatility of the IBM share is different from that of the index. 30 stores are selected and divided at random into three groups of 10.001..2. we say that there is strong evidence that β 1 is different from 1. for a 95% confidence interval.05 × 0.05 and the interval is 1. we have. The confidence interval shows us how precisely we have estimated the parameter.txt with columns comp. from the tables. promotion control promotion 1 promotion 2 x1 0 1 0 x2 0 0 1 The statistical model is Yi = β 0 + β 1 xi1 + β 2 xi 2 + β 3comp.255 or 1.sales and percent.J. Ri ~ G(0. The data are stored in the file trial. One group is given promotion 1. In the example. 1 is outside of the confidence interval and is not a plausible value for β 1 based on the data. salesi + Ri . They also measure the sales of competing products during the promotion period. σ ). we interpret a p-value according to the table Range of p-value greater than 0. Example 2 A marketing firm wants to test two sales promotions. For each store. i = 1.05 and 0. More generally.. MacKay University of Waterloo 2009 III-4 .10 between 0. Recall that the general form of a confidence interval (based on a t-distribution) is estimate ± c × standard error(estimate) where the constant c is chosen from the t-tables so that Pr( − c ≤ t df ≤ c ) is the confidence level.742 ± 0..Step 5: (Interpret) Since the p-value is so small. c = 2.01 and 0.10 between 0.. as expected.01 Interpretation no evidence against the hypothesis weak evidence against the hypothesis some evidence against the hypothesis strong evidence against the hypothesis We can also summarize our knowledge of β 1 using a confidence interval. Note. the firm measures the total sales over a two week period before and after the promotion is in place and calculates the percent change.30 independent Stat 371 © R.

β 1 represents the increase in the mean response (percent change in sales) if we change from the control to promotion 1. The type='n' in the plot command suppresses the plotting of any points but sets up the axes and labels. ylab='percent change'. The text command adds the points using the text characters in the vector p as the plotting symbols.c(rep("c".header=T) attach(a) p <. We plot the data by promotion using the R code a <.sales fixed.J. percent.p) The 30 × 1 vector p is a string of characters corresponding to the promotion. We have a similar interpretation for β 2 . Note that the plotting symbol corresponds to the promotion.Note the interpretation of the parameters β 1 and β 2 .10). percent.10)) plot(comp.change.sales.10). Fitting the model.table('trial. β 1 measures the effect of promotion 1.read.change. type='n') text(comp.txt'.rep("2". Holding x2=0 and comp. we get the summary output: Stat 371 © R. MacKay University of Waterloo 2009 III-5 .sales. main='Percent Change in Sales vs Competing Sales by Promotion'. That is.rep("1". xlab='competing sales'.

3 on 3 and 26 DF.lm(percent.01367 To examine the effects of promotion 1.82 (standard error 2.001 `**' 0.3316. matrix multiplication %*% and inverse function solve() # define the vector a a <.0917034 2.01 `*' 0.00159. How can we get the standard error of this estimate? We use vectors to represent θ = β 1 − β 2 . σ 2 a t ( X t X )a) and hence the standard error of θ is σ a t ( X t X ) −1 a .matrix(b) W <.model. Although it is clear here that the effects of the two promotions differ.40).42 (standard error 2. Since β ~ N ( β .3996781 1.47980 Coefficients: Estimate Std.58509 --Signif.553 0. we consider the hypothesis β 1 = 0 corresponding to no effect.0 ) t .change ~ x1 + x2 + comp.03009 3.J.8245616 2. There is strong evidence that promotion 1 has a positive effect with β 1 = 8. MacKay University of Waterloo 2009 III-6 .525 0.177 0.47944 9.Call: lm(formula = percent.0003564 0. Error t value Pr(>|t|) (Intercept) -0.c(0. We can calculate the standard error in R using the following statements.-1.60 .−1.24984 comp.3884539 3.solve(t(X)%*%X) # fit the model # extract the X matrix # find ( X t X ) −1 . p-value: 0.sales) X <. We have θ = β 1 − β 2 = 5.1969230 -0.sales -0.334 on 26 degrees of freedom Multiple R-Squared: 0. in this case with p-value 0.05 `. then θ = a t β .1.00159 ** x2 2.' 0.96702 x1 8.042 0. suppose we were interested in the parameter θ = β 1 − β 2 that measures the difference in effects of the two promotions.sales) Residuals: Min 1Q Median 3Q Max -12. If we let a = ( 0.0006446 -0.39) if all other explanatory variates in the model are fixed.1.change~x1+x2+ comp.66432 -3. codes: 0 `***' 0. b <.2545 F-statistic: 4. note the transpose function t().06860 -0.0) Stat 371 © R. we have θ ~ N (a t β . σ 2 ( X t X ) −1 ) .1 ` ' 1 Residual standard error: 5. There is no evidence that promotion 2 has an effect with β 2 = 2. Adjusted R-squared: 0.4187456 2. the R output gives the results for the corresponding t test. By default. Note the comments after the #. θ = a t β ~ ~ ~ ~ and θ = a t β .

334*sqrt(t(a)%*%W%*%a) st. we can fit the model percent. you might wonder why we did not create an explanatory variate x3 for the control promotion. That is. That is.err <.33% and 7. the estimates of the parameters corresponding to comparisons e. There is no reason to suspect that changing from the control to promotion 1 has the same effect as changing from promotion 1 to promotion 2. In this case you need to set up indicator variates to represent the categories as described in Note 1.27 .87% per week in average sales.theta.06 ≤ t26 ≤ 2. the 95% confidence interval for the difference in the effects of the two promotions is 5. you can see the problem.06) = 0.60 ± 2.1.t(a)%*%coef(b) st.2 corresponding to the promotion control. Since we are projecting onto the space in each model.sales + r . When you try to interpret β1 . Using the fact that Pr( −2. one or two.J. as implied by this model.5. In this case. In example 2. Stat 371 © R.change~-1+x1+x2+x3+comp. Promotion 1 looks promising in terms of its effect on sales.sales) The –1 in the model specification suppresses the intercept. we have promotion control promotion 1 promotion 2 x1 0 1 0 x2 0 0 1 x3 1 0 0 This will create problems in the fitting since x1 + x 2 + x3 = 1 and so the columns of the matrix X are linearly independent. it is tempting to simplify the modeling by using a single vector x with elements 0.err # calculate θ # calculate the standard error SE(θ ) we get the estimate of σ = 5. we get the standard error SE(θ ) = 110 . Also in example 2.lm(percent. You need to be careful to recognize categorical explanatory variates that are coded as integers for convenience.hat <.g. promotion 2 vs promotion 1 are identical with the same standard error. compared to promotion 2.change = β 01 + β1 x + β 2 comp. 2.95 . We can draw conclusions about any linear combination of the coefficients using the same methodology.334 from summary(b) # display SE(θ ) . We proceed by deleting x3 from the model as in the Example or by deleting the intercept with the R command b <. Notes 1. MacKay University of Waterloo 2009 III-7 . We can be confident that promotion 1 produces a percent change in sales between 3.

In general. This is the problem we need to solve in marketvalue assessment example discussed in Chapter 1.95) p The second line creates a new data set (a data.. level=0.x2=0.+ β p x p + R = u t β + R. σ ) ~ so that Y ~ G(u t β . In the example. u t β − c σ 1 + u t ( X t X ) − 1 u ≤ y ≤ u t β + cσ 1 + u t ( X t X ) − 1 u ) To illustrate how we can get this interval using R.data. σ ) . the option=”p” produces a prediction interval at the given values of the explanatory variates. suppose in Example 2 we want to predict the percent change in sales for a store that uses promotion 1 with competitor sales $3000. We also know that u t β ~ G(u t β . σ u t ( X t X ) −1 u ) so ~ Y − u t β ~ G( 0..95 We get the prediction interval by replacing the estimators with the corresponding estimates. The last line prints the fitted value u t β and the prediction interval.sales=3000) p<-predict(b.lm(response~x1+x2+ comp... then we have.interval=”p”. x p ) be the values of the explanatory variates for the unit with response variate that has not been measured.newdata=new. MacKay University of Waterloo 2009 III-8 . x1 .frame in R-speak) with the values of the explanatory variates for which we want to make the prediction.sales) new <. σ 1 + u t ( X t X ) −1 u ) and ~ Y − ut β ~ σ 1 + u t ( X t X ) −1 u ~ tn −( p +1) We use this t distribution to produce prediction intervals for Y. rearranging the inequality ~ ~ ~ ~ Pr(u t β − cσ 1 + u t ( X t X ) −1 u ≤ Y ≤ u t β + cσ 1 + u t ( X t X ) −1 u ) = 0..frame(x1=1.Prediction Intervals Suppose we want an interval of plausible values of the response variate for a unit with known values of the explanatory variates. b <. interval. we get Stat 371 © R. The third line calculates the interval. From the model. If Pr( − c ≤ tn − ( p +1) ≤ c ) = 0. we can describe the behaviour of the response variate by the random variable Y = β 0 + β 1 x1 +. let u t = (1.. R ~ G(0.95 .J. comp.

Is there any evidence of curvature in the relationship? a) b) c) d) e) 3. Using the data in the promotion trial described in this chapter. Construct a scatterplot of hardness vs the amount of fragrance oil. Derive the confidence interval forθ d) Show that θ 0 is in the 95% confidence interval for θ if and only if the p-value for the test of the hypothesis θ = θ 0 exceeds 5%. 2.334 ) using the same promotion in the investigation. a) Find a 95% prediction interval using promotion 2 for a large store where the competing sales are $30000. Stat 371 © R.3% and 18.oil).78579 That is.oil + R. This interval is wide because of the high variation within stores (σ = 5. a company that manufactures candle wax examined 20 candles made from batches of wax that have different amounts of fragrance oil added. Consider the simple model where we assume separate batches are independent. Find a 95% confidence interval for the average hardness of candles made with 2% fragrance oil. we predict the percent change in sales to be between –4. the estimator for a parameter θ and the statistically ~ independent σ with n − ( p + 1) degrees of freedom. R ~ G(0. σ ). Add a quadratic term to the model (in R f2 <.oil creates a vector with components the square of those in frag. The data are stored in the file hardness.oil. Some ideas about confidence intervals: a) Using the R-output given for the sales promotion example. Can you see any difficulty with this prediction? b) Construct a prediction interval for the change in sales if promotion 1 is used rather than promotion 2 for the same store (i.txt. find a 99% confidence interval for the effect of competing sales on the percent change in sales.fit lwr [1. competing sales are fixed).] 7. Exercises 1. MacKay University of Waterloo 2009 III-9 .26998 upr 18.8%. In a small study.257904 -4. dσ ) .] ~ 4.J. Interpret the parameter β 0 .frag. Find a 95% confidence interval for β 0 . [You will need to go back to first principles. What can you conclude? b) How does the confidence interval change as we increase the confidence level? ~ c) Suppose we have θ ~ G(θ . Prove that the components of β are independent if and only if the columns of X are orthogonal. The company was interested in understanding the relationship between the hardness of the candles (a technical measurement) and the amount of fragrance oil added.oil*frag.e. The variates are named hardness and frag. hardness = β 0 + β 1frag.

Show that E ( ~ t ~ ) = tr ( E ( RR t )( I − H )) = σ 2 tr ( I − H ) r r Use the result from a) and the fact that H = X ( X t X ) −1 X t to evaluate E ( ~ t ~ ) . r r Recall that for any square matrix A .J. the trace of the matrix is tr( A) = ∑ aii . ~) = 0. We want to show that E ( ~ t ~ ) = ( n − p − 1)σ 2 . r r a) b) c) Stat 371 © R. The estimate of σ is σ = r tr . MacKay University of Waterloo 2009 III-10 . Here. r 6.~ 5. Show that if C is n × k and D is k × n . then tr(CD) = tr( DC) . Let ~ = ( I − H ) R . Prove Cov( β . One (poor) justification for the denominator n − ( p + 1) ~ is based on the fact that with this choice we have E (σ 2 ) = σ 2 . we verify this r result.

ratio of building size to the land area.256 0.8670 -0. size.5435 Coefficients: Estimate Std.069 0. That is.799388 ratio 0. In fitting the full model. p-value: 0. codes: 0 `***' 0.1 ` ' 1 Residual standard error: 5.6911 -3.432 0. Error t value Pr(>|t|) (Intercept) 19. we look at hypothesis tests that involve several of the parameters simultaneously.41526 1.22e-05 *** office 0. we saw how to make formal inference statements about any component β j or any linear combination θ = a t β of the coefficient vector β . In this chapter.01 `*' 0.993 on 32 degrees of freedom Multiple R-Squared: 0.7848 14.72137 0. Example 1 In the problem of predicting the market value of a large plant (see Chapter 2).833 3.txt. The confidence intervals and hypothesis tests for these one-dimensional parameters are based on a t distribution of the corresponding estimator.08139 1.315 on 5 and 32 DF.49361 3.000288 *** size -2.354867 --Signif.52300 0.939 0. The data are in the file assessment. Adjusted R-squared: 0.081181 . It cannot be framed in terms of a single parameter. age -0. percentage of office space and the country of sale (1=USA.801 0. the summary output from R is given below.05 `.77137 4. Call: lm(formula = value ~ size + age + office + ratio + location) Residuals: Min 1Q Median 3Q Max -10.41588 4.161926 location 3.001 `**' 0.4537.001150 Stat 371 © R.8164 2.Chapter 4 The Analysis of Variance In Chapter 3.14776 0. 0=Canada) to the adjusted value (current $ per square foot) for 38 sales.03786 0.11653 0.3683 F-statistic: 5. That is ~ θ −θ ~ ~ tn −( p +1) σd ~ where the constant d is determined by finding stdev(θ ) = σd = σ a t ( X t X ) −1 a . MacKay University of Waterloo 2009 IV-1 .10820 -4.' 0.34128 -1. the hypothesis is not one-dimensional. we fit a model with 5 explanatory variates age.J.

txt... The ratio will tend to be Stat 371 © R.. given that all the other explanatory variates are in the model. we write the data model as y = β 1 x1 +.. in terms of the parameters. We want to examine this multi-dimensional hypothesis. xij = 0 otherwise where j = 1. The second is a valid estimate of σ 2 only if the hypothesis is true.. There is no difference in the versions if β 1 = β 2 =. or.. The first estimate is valid whether or not the hypothesis is true. β 5 differ from 0? The corresponding hypothesis is β 1 = 0.+ β 6 x6 + β 7 pst. let yi : satisfaction score for subject i.J.. is there any evidence that any one of β 1 . pst.. MacKay University of Waterloo 2009 IV-2 . Is there any evidence that only age is an important explanatory variate. in terms of the corresponding vectors. the subjects completed a questionnaire to determine • a score to measure past experience with similar products • a score to measure satisfaction with the proposed version The first question of interest was to see if any of the new versions were different from the original. adjusting for constant background experience. an R&D department conducted a clinic in which each of the 6 versions was assessed by 8 different subjects.6 and j = 1 corresponds to the current version. β 4 .. is there any evidence that any one of β 1 . The data are stored in the file product.. β 4 = 0. scorei : past experience score for subject i Then. We can also ask questions such as: Is there any evidence that all of the explanatory variates explain a significant portion of the variation in the response variate... score + r Note that there is no intercept term in the model. the basic idea is to construct a discrepancy measure that is the ratio of two estimates of the residual variance σ 2 . To model the data.. = β 6 .48 xij = 1 if subject i used version j.. or.. β 2 = 0. β 5 differ from 0? The corresponding hypothesis is β 1 = 0. in terms of the parameters. β 5 = 0 The defining relationships in the hypothesis holds simultaneously for all the parameters listed. β 3 = 0.The output includes the results of t test of the hypothesis that each coefficient is 0.. β 3 . To test multi-dimensional hypotheses such as those described in Examples 1 and 2. After trying the product. Example 2 To compare 5 different versions of a product to the current version. β 5 = 0 ..... i = 1.

Note that the sum of squares of the estimated residuals r t r = ∑ ri 2 is an estimate of ( n − p − 1)σ 2 . β 5 = 0 .75e-08 *** age -0. we have σ 2 = 5. Step 4: The discrepancy measure is f = t (rH rH − r t r ) / ( p − q ) σ2 Step 5: To calculate the p-value.J.01 `*' 0. β 3 = 0. Example 1 We can illustrate the steps with the assessment data where we have.084) = 1332 542 with Multiple R-Squared: 0.1 ` ' 1 2 The residual sum of error: 6.665e-05 Stat 371 © R.66e-05 *** --Signif.9932 = 35.n − p −1 has an F distribution with p − q [numerator] and n − p −1 [denominator] degrees of freedom.3488 F-statistic: 20. From the R output.3664. n = 38. The basic steps are: Step 1: Fit the full model to get the usual estimate of the variance σ 2 = r t r / ( n − p − 1) with n − p −1 degrees of freedom. in the full model.432 Coefficients: Estimate Std. Step 2: Fit the reduced model value = β 0 1 + β 2 age + r with β 1 = 0. Error t value Pr(>|t|) (Intercept) 20.09971 -4. The hypothesis is β 1 = 0.762 6. codes: 0 `***' 0.00833 6. then rH rH is an estimate of ( n − q − 1)σ 2 with n − q −1 degrees of freedom.916 and the residual sum of squares 32(5. p = 5 . Call: lm(formula = value ~ age) Residuals: Min 1Q Median 3Q Max -7.823 -3.34243 3.993) 2 = 1149. β 4 = 0.05 `. to get the residual sum of t t squares rH rH . Adjusted R-squared: 0.45498 0. t ( rH rH − r t r ) / ( p − q ) estimates σ 2 with ( p − q) degrees of freedom. β 4 = 0. β 5 = 0 .314 with 32 degrees of freedom. assuming the hypothesis is true.823 -0. Step 3: The second estimate of σ 2 is based on the so-called additional sum of squares t rH rH − r t r that estimates ( p − q )σ 2 if the hypothesis is true.different from 1 if the hypothesis is not true. β 3 = 0. Step 1: Fit the full model.001 `**' 0. Step 2: Fit the reduced model.' 0.freedom 36 degrees of freedom. p-value: 5. If the reduced model has q +1 parameters. Residual standard squares is 36(6.968 2. MacKay University of Waterloo 2009 IV-3 . we find Pr( F ≥ f ) where F ~ Fp − q .097 19. if this is the case.084 on 36 degrees of . Hence. We can this procedure the Analysis of Variance (ANOVA).82 on 1 and 36 DF.563 5.

92 Step 5: Using the R function 1-pf(1.542 − 1149.81 5 −1 45.30 .2778 0. For Example 1.314 = 45.275.Step 3: The second estimate of σ 2 .56 1. We fit both the full model and the reduced model and then apply the anova() function. there is no evidence against the hypothesis.275 35. MacKay University of Waterloo 2009 IV-4 .2992 Stat 371 © R.Df RSS Df Sum of Sq F Pr(>F) 1 36 1332.32 ≥ 1. rather than the estimate in Step 3 based on the change in the residual sum of squares. 2. For each tail probability. An F random variable is always positive and has mean close to 1. the code is b<-lm(value~size+office+age+ratio+location) c<-lm(value~age) anova(c. an Fnum. we see that Pr( F4. there is no evidence that the model is improved by adding all of the other explanatory variates. den random variable with num and den degrees of freedom is defined as K2 χ 2 / num F = num ( = num ) 2 K den χ 2 / den den where the numerator and denominator are independent.76 2 32 1149. You may have wondered at Step 2 why we could not have used the estimate of σ 2 produced from fitting the reduced model directly. 4. Notes 1.81 Step 4: The discrepancy measure is f = = 1. Since the p-value is so large. The reason is that by subtraction. there is one page of tables with a column for the numerator degrees of freedom and a row for the denominator degrees of freedom. assuming the hypothesis is true.J.275) = 0.32) or the Tables in the Appendix.b) and the corresponding output is Model 1: value ~ age Model 2: value ~ size + office + age + ratio + location Res. is 1332. In other words. we get independent estimators which are required for the F distribution. There are tables (in the same format as the t tables) in the Appendix. Calculations with R We can use R to perform all of the calculations.20 4 183. once age is included in the model. F distribution and F tables Mathematically.

+ x6 = 1 . We are interested in the hypothesis β 1 = β 2 =. the output gives the F ratio F = 20. we calculate the additional sum of squares (residual) anova(c. 3.60391 5 0. In words.001146 ** Since the F ratio is so large (p-value 0. = β 6 which corresponds to all versions being the same. The sum of the vectors corresponding to the six indicators is x1 +. If we include an intercept term.e.82 with a pvalue that is very small so there is very strong evidence that one or more of the coefficients differ from 0.+ x6 = 1 .score) Note that the –1 tells R to not include a constant term (i. = β 6 = β .score Res. Since x1 +...J. In the summary output from fitting with lm(). none of the explanatory variates is important in explaining variation in the response variate.58217 2 41 1. if this hypothesis is true. We fit the full model with the code b<-lm(sat. to fit a model without β 0 ) Next we fit the reduced model with β 1 = β 2 =.score is held fixed. Stat 371 © R.score~-1+x1+x2+x3+x4+x5+x6+pst. c<-lm(sat.Note in the function anova( .0014 0..97826 5.score~pst. with the full model.. MacKay University of Waterloo 2009 IV-5 .. In Example 1.001). ) we put the reduced model first.score.Df RSS Df Sum of Sq F Pr(>F) 1 46 2. this corresponds to the model with a constant term and the single explanatory variate pst.score) To test the hypothesis. the last line is an F test for the hypothesis that β 1 = β 2 =.. there is strong evidence of differences among the 6 versions if the pst.score ~ pst.. Example 2 There are six indicator variables to index the product version and one other explanatory variate..score ~ -1 + x1 + x2 + x3 + x4 + x5 + x6 + pst.score Model 2: sat. Note that the full model does not have an intercept term. then the columns of X are linearly dependent. = β p = 0 .b) with output Model 1: sat...

Is there any evidence that these versions have significantly different average satisfaction scores? 4. Explain how we can test the hypothesis using an F test c. The data are stored in the file ch4exercise2.. That is.. In the revised model show that x * ⊥1 for all j j Stat 371 © R.5 and 6 share a common feature..05 c. Explain how we can test the hypothesis using a t-test b. find a constant c so that Pr( F ≥ c) = 0. What is the distribution of 1 / F 2. Test the hypothesis in the two ways and show that the p-value is identical. However. d. Suppose we have a discrepancy measure with an F distribution with 3 and 30 degrees of freedom. Find Pr( F ≥ 3) b. explain why the additional sum of squares is always non-negative. [This is always true although a nuisance to prove] d. x1 x 2 should be included in the model.. x 2 . Some theory a. Consider again the product testing example described in Exercise 3. Explain why testing the hypothesis β 1 = β 2 =. the coefficients of the explanatory variates do not change. c. the manufacturer collects 60 observations to build a model to relate a product property y to two quantitative explanatory variates x1 and x2 . If t ~ tk . In the product testing example (Example 2 in Chapter 4).+ β p x p + r . we can test a hypothesis θ = 0 in two ways...Exercises 1. a. Show that if we replace the vector x j by the vector x * = x j − x j 1. In an industrial example.+ β p x p + r . Theory suggests that a linear model of the form y = β 0 + β 1 x1 + β 2 x2 + r should describe the data. In the construction of the F test. the model becomes j * * y = α 0 1 + β 1 x1 +. = β p = 0 will yield identical results for either formulation of the model. Versions 4. Consider the hypothesis that the coefficient β 7 of the explanatory variate pst. Consider the model y = β 0 1 + β 1 x1 +.score is 0. show that t 2 has an F distribution.] 3. the analyst worries 2 that additional second order terms of the form x12 . MacKay University of Waterloo 2009 IV-6 . use an F test to address the following questions? a. b.J.txt. If we have a single parameterθ . Does the addition of the extra terms contribute significantly to the fit of the model? [Note: In R you can create new variables such as x 22 < − x 2 * x 2 to represent the quadratic terms. What are the degrees of freedom? 5. Is there any evidence of differences among the new versions 2 to 6? b. a.

... Stat 371 © R. In testing the hypothesis.. MacKay University of Waterloo 2009 IV-7 .e...J.. β p )t and X* = ( x1* . This quantity is p often called the regression sum of squares. x * ) . show that the additional sum of squares is t β * ( X*t X* )β * where β * = (β 1 ..

Chapter 5 Assessing Model Fit To this point. In this chapter. Also we can add extra terms (squares. If not. If we have units in the sample in which the explanatory variates are identical.. we have built. σ 2 I ) For example.. we examine the problem of model fit.J. MacKay University of Waterloo.. we are making a number of assumptions about the underlying probability model Y = β 0 1 + β 1 x1 +.. cross products etc. Recall that r t −1 t H = X ( X X ) X depends only on X. Are the assumptions reasonably well met and. 2009 V-1 . we are assuming that: • • the mean vector E(Y ) is the specified linear function of the explanatory variates the residuals are gaussian. we should see a plot with no obvious patterns. If we find such patterns. are derived from the given model.. We also know that r and μ are orthogonal and..+ β p x p + R.. σ 2 ( I − H )) . Looking at the Estimated Residuals We also assess fit by looking for patterns that would be unusual if the model is “true”. ~ and μ are independent – see the exercises. independent with constant standard deviation for each unit in the sample We can assess these assumptions in several ways. ~ according to the model. fit and used a model for a given set of data without questioning any of the underlying assumptions. the components of the vector r . we consider two situations to look at the first bullet. This approach to assessing fit is informal and subjective – we need to be careful not to over-interpret the plots looking for patterns. then we have greater confidence in the form of the mean function in the original model.) to the proposed model and test if the additional terms have significant effects... Stat 371 © R. R ~ N (0. if not.. according to the model. we can use ANOVA to assess the fit.+ β p x p + r and using the corresponding estimators to construct formal statistical procedures. r If we plot the individual components. n . the estimated residual ri versus the fitted value μ i for i = 1. what do we do about it? In fitting the model y = β 0 1 + β 1 x1 +. The estimated residuals. In the exercises. r = y − μ = y − Xβ ~ r The corresponding estimator ~ = Y − Xβ = ( I − H ) R is a linear combination of the components of R and hence. we are suspicious about the assumptions underlying the model. ~ ~ N (0.

office.resid(b)) Does this plot raise any suspicions about the proposed model? The answer is yes since it would surprising (assuming that the model is correct) if the two largest estimated residuals correspond to the two largest fitted values as seen in the plot. Change all the variate values for cases 18 and 27 to NA. we can create a plot of the estimated residuals versus the fitted values with the R code b<-lm(value~size+age+ratio+office+location) plot(fitted(b). The plot of the estimated residuals versus the fitted values looks much better. aa<-edit(a) detach(a) attach(aa) Then refit the same model using the 36 units in aa. ratio and location to the measured value. MacKay University of Waterloo. If they are influential.Example 1 Consider again the assessment data discussed in previous chapters and found in the file assessment. In the example. we can ignore the poor fit. 2009 V-2 . Note that the age is still the only significant Stat 371 © R. The remedy here is to repeat the fitting and analysis with these cases removed to see if the conclusions are substantially effected. age. If we fit a model with 5 explanatory variates size. then we need to decide (not on statistical basis) how to proceed. we can delete cases 18 and 27 by editing the data frame a.txt. Otherwise.J.

stdev(Y ) =|2 + 3x1 − 2 x 2| . we fit a model of the form Stat 371 © R. We make the decision to omit or include the two cases on non-statistical grounds.explanatory variate but that the estimated coefficient changes from –0. MacKay University of Waterloo. Here the decision was to proceed without these two sales since they corresponded to buildings that were very different from the building in question.52 to –0. We may see a funnel shape on the plot the estimated residuals versus the fitted values. 2009 V-3 .1) and Ri ~ G(0.txt. That is. When we fit the linear model lm(y~x1+x2). The 50 observations were created from the model Y = (2 + 3x1 − 2 x 2)*(3 + R) where x1 and x2 are uniform on the interval (0. This indicates that the standard deviation is not constant but is a function of the mean μ( x ) . the plot of the estimated residuals versus the fitted values (see the left panel on the next page) shows the classic funnel shape indicating that the standard deviation is not constant. the standard deviation of Yi depends on the mean E(Y ) = 3 *(2 + 3x1 − 2 x 2). Example 2 Here is an artificial example to demonstrate the problem and the remedy.1) independently.17 so the two cases are very influential if we want to predict the value of an unsold building. The data are stored in the file ch5example2. The remedy is to transform the response variate.J. In this model.

MacKay University of Waterloo. in spite of the fact that the we know that the original model is not linear in x1 and x2 on this scale. y and 1 / y . We can examine the gaussian assumption with a quantile-quantile (usually abbreviated qq) plot. The plot of the estimated residuals versus the fitted values is the single most useful diagnostic tool. σ 2 ( I − H )) or r ~ ~ G(0. if the form of the model is adequate. σ 1 − h ) .J. After taking logarithms. we standardize and define ri ii the standardized residuals as ri zi = 1 − hii If the gaussian assumption in the model is more or less correct.f ( y) = β 0 + β 1 x1 + β 2 x 2 + r Standard choices for the transformation include the functions log( y). we expect to see no patterns on these plots. Under the model assumption we have ~ ~ N (0. then we will see a straight line on the qq plot of the standardized residuals. Again the usual remedy is to transform the Stat 371 © R. 2009 V-4 . Large deviations from the line indicate that the gaussian assumption is likely false. We can also plot the estimated residuals versus the explanatory variates and. the funnel effect is removed.The right hand panel shows the plot of the fitted versus residual values after applying the log transformation. To make the standard deviation constant. There are many other such diagnostic plots.

We call a case an outlier if the conclusion of the analysis is materially changed by omitting the unit from the data.J. σ 1 − hii ) where hii is the ith diagonal element of the hat t −1 matrix H = X ( X X ) X t . The plot on the left picks up to some degree the two exceptional cases with large estimated residuals. the response variate or both. These residuals are larger than can be expected if the estimated residuals are a sample from a gaussian distribution. if hii is close to 1. Stat 371 © R. For the assessment data with all 38 cases (and the edited file with cases 18 and 27 removed). We show an extreme example of each type of outlier for a single explanatory variate on the next page. See the Appendix 2 for the concepts underlying the qq plot. Note that a case can have unusual values for the explanatory variates. Recall that ~ ~ G(0.resid(b)/sqrt(1-hatvalues(b)) qqnorm(s) Sensitivity (Case) Analysis We use case analysis to determine in any of the units correspond to an “outlier”. MacKay University of Waterloo.response variate or to delete cases with large (positive or negative) estimated residuals. After removing these two cases. with the results of lm() stored in b. We call hii the leverage of the ith case since. the plot on the right provides no such evidence against the gaussian assumption. we can calculate the studentized residuals and create the qq plot with the code s <. 2009 V-5 . We look for outliers in the space of the explanatory variates using the following ri argument. Using R. the qq plots of the standardized residuals are shown in the left and right panels respectively.

the plot of the leverage versus the index (case number) is shown below. Note that hii depends only on the explanatory variates and so a case has high leverage (hii close to 1) regardless of the observed response variate.J. the fitted value μ i will be close to yi .the standard deviation of ~ is close to 0 and hence we know that ri will be close to 0. 2009 V-6 . MacKay University of Waterloo. We can extract the hii from any model fit b<-lm(y~…) using the R function hatvalues(b). Deletion of a case with high leverage relative to the others may change the fit of the model substantially. There are no exceptional large values. For the assessment data (with the two cases deleted). In ri other words. Stat 371 © R.

σ 1 − hii A large studentized residual (say greater than 2.5) then we know the response variate of the corresponding case is an outlier. Using this formula. For the above assessment data.e.J. the plots of the studentized residuals [plot(rstudent(b)] against the index number shows two exceptional values for the unedited data (left) and none for the edited data (right). ti increases) of the studentized residual ri si = .which gives an explicit t formula for ( X− i X− i ) −1 in terms of ( X t X ) −1 . for each case i. σ − i ˆ ˆ calculate the predicted value y−i = uit β −i ˆ yi − y − i calculate a t-statistic ti = t t ˆ σ − i 1 + ui ( X − i X − i ) −1 ui If any ti is large (i. bigger than 2.5) corresponds to a large value of ti which in turn corresponds to an outlier in the response variate. the predicted value of y if we delete the ith case. We are saved by a remarkable formula – see the Exercises on rank 1 update . we can rewrite ti as n− p−2 ti = si n − p − si2 a monotone function (as si increases. Stat 371 © R. we compare yi to y − i .ˆ To look for outliers in the response variate. 2009 V-7 . The calculation of ti looks formidable. MacKay University of Waterloo. We calculate the studentized residuals for the model b<-lm(…) using the R function rstudent(b). Note the difference in vertical and horizontal scales. We • • • delete the ith row uit from the matrix X to get X − i and refit the model to get parameter estimates β − i .

We conclude that there are no other cases that (singly) are highly influential. Do not forget that the Conclusion is driven by the original Problem. 2009 V-8 . if we find influential cases. Note that • there are many other ways to measure the influence of single cases • we have looked at cases one-at-a-time. not in groups so there may still be highly influential small groups of cases. MacKay University of Waterloo. Stat 371 © R. Changes in the fitted model that do not effect the Conclusion are not important. we should repeat the analysis with theses cases deleted and see if the Conclusion is materially changed. This issue is beyond the scope of the course. In summary.J.

3000 -10. Use the methods in this chapter to assess the fit of the model and to suggest remedies.Exercises 1.' 0.15 Suppose we fit a model y = β 0 + β 1 x1 + β 2 x2 + β 3 x3 + r . 2009 V-9 .J.3000 1.9 and age 30 sensitive to any particular cases? 2.7649 on 7 degrees of freedom Multiple R-Squared: 0.89 12. In an experimental Plan.21 17. The summary output from R is Call: lm(formula = y ~ x1 + x2 + x3) Residuals: Min 1Q Median 3Q Max -1.776 0. there were three explanatory variates x1 .7615 Coefficients: Estimate Std.4419 0.97 11. Error t value Pr(>|t|) (Intercept) 11.57 11. here coded as -1 and +1. x1 -1 -1 -1 1 1 1 1 0 0 0 0 x2 -1 1 1 -1 -1 1 1 0 0 0 0 x3 1 -1 1 -1 1 -1 1 0 0 0 0 y 11. As well.87e-10 *** x1 2.30 11.6181 0. There are 8 combinations. Adjusted R-squared: 0. x2 .357 1.40 11. x2 = 0. the investigators looked at the response variate for the so-called center point x1 = 0.985 3.2372 48.34 17.9558 F-statistic: 73 on 3 and 7 DF.203e-05 Stat 371 © R.001 `**' 0. codes: 0 `***' 0.969. The data are shown below and can be found in the file ch5Exercise2.54 5.txt. Is the prediction of value for a building with size 13. Consider the assessment data with simple model value = β 0 + β 1age + β 2 size + residual .68e-05 *** x2 -3. MacKay University of Waterloo.1 ` ' 1 Residual standard error: 0.87 9.45 7.1071 0.05 `.3000 8.01 `*' 0.1360 0. p-value: 1. x3 that each were assigned two values. x3 = 0.5329 0.1465 0.4654 0.70e-05 *** x3 0.218 7.119 --Signif.2527 -0.

We give the basic mathematics behind the arithmetic that we use for the calculations when deleting a single case. x1 x 2 . Show t t that X t X = X −1 X −1 + u1u1t and hence find an expression for ( X −1 X −1 ) −1 .We drop x 3 from the model. Show that the residual sum of squares from fitting this model is ∑ ∑ (yij − yi ) 2 where i indexes the unique sets of explanatory variate values and j i j indexes the replicated observations within these sets. use the additional residual sum of squares to test the hypothesis that the extended model is necessary. If the model is correct. a) After fitting the full model. The response variate is the weekly sales and there are four explanatory variates. Are there any cases that have a large influence on the conclusion about this comparison? ~ 4. show that Cov( ~.txt) in which a marketing firm wanted to compare two sales promotions against a control. 3. Find the constant a so that ( I + vu t ) −1 = I + avu t [This is known as a rank one update] b) If C = B + uu t where B is invertible. For the given data. find an expression for C −1 . consider two formal approaches. r ~ r What does this suggest about the plot of the estimates ri versus μ 5. Consider the data described in Chapter 3 (the file is trial. x2 ) with no further specification. is there any evidence of lack of fit? b) Suppose the primary question is to compare the two promotions adjusting for past and competitor’s sales. μ ) = 0 and hence μ and ~ are independent. MacKay University of Waterloo. Stat 371 © R. two of which index the promotion used. The key step is to find an expression for the inverse of t X −1 X −1 where X−1 is the matrix X with the first row u1t omitted. a) Suppose u and v are two n ×1 column vectors and A = I + vu t . b) Consider an extended model in which the mean of Y is a function μ( x1 . 2009 V-10 . c) Suppose we consider dropping the first case when fitting the model y = Xβ + r . This is called a “pure residual” test of fit. To assess the fit of the model. 2 a) Add quadratic terms x12 .J. x 2 to the model and then test the hypothesis that the additional terms are unnecessary.

If we include unnecessary terms. based on the p-value for the test of the hypothesis that the corresponding coefficient is 0. Keep deleting until all of the included variates have coefficients significantly different from 0. the model with the largest F-ratio. We consider three strategies: 1. Select two or three of each size and then use a criterion that balances the value of R 2 against the addition of extra variates to pick the “best” model. leave out the least important variate. if the problem is to • • predict the value of the response variate for a given set of values for the explanatory variates assess the effect of a particular explanatory variate (or variates) on the response variate when controlling for a number of other explanatory variates we may or may not include some of the explanatory variates in the model. We create an artificial example to demonstrate this point. we add to the model complexity and we can also distort the conclusion.2*x2-0. This decision is important if we want to get a final model that is as simple and useful as possible and fits the data well.x2<-u1+u2.x3<-u1+2*u2-u3.Chapter 6 Model Building In many applications. Continue until we can find no more important variates to add.u2<-rnorm(100). We want to use the data to decide which variates to include or delete. 3. Then add a second variate to the model that maximizes the increase in R 2 and has a coefficient significantly different from 0. (All regressions) Fit all possible models – there are 2 p − 1 if we have p explanatory variates.u3<-rnorm(100). 2. Example 1 We create the data with the code: u1<-rnorm(100). The data are stored in ch6example1. we have a number of explanatory variates that we can chose to include or delete from the model. University of Waterloo.txt.u4<-rnorm(100).r<-rnorm(100) x1<-u1. For example. 2009 VI-1 . If any coefficient is judged not significantly different from 0. You might expect that strategies 1 and 2 would get the same answer but this is not always the case.x4<-u1+u4 y<-x1+1. MacKay. select the one with the highest value of R 2 or equivalently.5*x3+2*r Stat 371 © R. (Backwards elimination) Start by fitting the full model.J.e. (Forward selection) Start with a simple one-variate models and select the one that best explains the variation in the response variate i.

that involve adding or eliminating variates based on testing the significance of their coefficients in a given fit. Now using strategy 2..4560 Using strategy 1. x2.. x1 . We drop x4 and fit the three-variate model in which we find all coefficients significant and R2 = 0. There are many other similar strategies. According to strategy 2..2439 We select x1 and proceed by fitting the 3 two-variate models that include x1. In testing the significance of the coefficients. In this case we have trouble with least squares since X t X is singular and there are many models that give the same minimum value for the sum of Stat 371 © R.. We then try strategy 1 where we add significant variates. leave out some of the explanatory variates).e. we stop and the selected model includes only x1. MacKay. Model lm(y~x1+x2) lm(y~x1+x3) lm(y~x1+x4) Significant variates x1 x1 x1 R2 0.4550 0.4567 0. we stop and select the three-variate model that includes x1. x p are linearly dependent or collinear. and ( X t X ) −1 is also diagonal with entries 1 / x tj x j . x p are mutually orthogonal.The corresponding vectors are not close to orthogonal.Note that the second line introduces relationships among x1. The estimates of the coefficients β j = x tj y / x tj x j and the ~ corresponding estimators β j ~ G( β j . often called stepwise procedures. only the estimate of σ and the degrees of freedom change from the full model.x2. The extreme opposite of orthogonality occurs if the vectors 1.x3 with coefficients judged significantly different from 0 (p-value<10%) and R2 = 0.4744 . The results are Model lm(y~x1) lm(y~x2) lm(y~x3) lm(y~x4) Significant variates x1 x2 x3 x4 R2 0. In the full model we have x1. 2009 VI-2 .x3. The reason that we get different answers from such procedures is largely due to the correlation among the explanatory variates.J.. σ / x tj x j ) depend only on x j . If we fit any sub~ model (i.x2. x1 .2491 0. β j and β j do not change.. then X t X is a diagonal matrix with entries x tj x j .4547 0. If the vectors 1...0408 0. we start with the four-variate model and work to eliminate variates. We start by fitting all of the one-variate models and picking the one with a significant coefficient (p-value less than 10%) and the highest value of R 2 . x3 and x4.4743 . University of Waterloo.

The constants are chosen so that c p ≤ k + 1 for good sub-models with k explanatory variates and a constant term. With the invention of clever algorithms that can fit all possible models without reinversion of large matrices. Suppose that we have p +1 terms in the full model (p explanatory variates plus a constant) and k + 1 terms in a sub-model. We want this quantity to be small. We want Radj to be large (i.model) R = 1− 2 σ ( null .model) 2 adj where σ 2 ( null . the more difficult is our problem to select the “best” model. σ 2 (sub . Again we want this quantity to be small.e. University of Waterloo. Then two criteria that balance the requirements are based on estimated residual sum of squares n − ( k + 1) Note that both the numerator and denominator decrease as we add terms to the model. • σ 2 (sub − model) = • estimated residual sum of squares + c k where c > 0 is chosen for calibration. σ 2 (sub − model ) to be small). Note that R 2 has the same form with sum 2 of squares of estimated residuals in the numerator and denominator. Recall that R 2 must increase when we add additional variates to the model because the sum of squares of the estimated residuals must get smaller. We specify the second criterion using Mallow’s “ c p ” statistic for a given sub-model where estimated residual sum of squares cp = + 2( k + 1) − n σ2 The denominator is the estimate of σ 2 from the fit of the full model. we want a simple model and adding explanatory variates increases the complexity.squares of the estimated residuals. We want to select a model that explains a large portion of the variation in the response variate.model) = ∑ (y i − y ) 2 / (n − 1) is the estimate of σ 2 if none of the i explanatory variates are included in the model.J. We pick a criterion that balances these two requirements. The only difficulty is to specify a criterion for choosing among all of the models. We specify the first criterion by the so-called adjusted R 2 for a given sub-model. Stat 371 © R. On the other hand. The closer the explanatory variates are to collinear. MacKay. 2009 VI-3 . strategy 3 is the preferred approach.

2827 Coefficients: Estimate Std. We start with the command library(leaps) This loads a set of functions that we need to fit a large number of models simultaneously. If you have not downloaded the leaps.4578213 0.023301 5.adjr2) # gets model.458 2 adj If we fit the model with explanatory variates x1.2520 6. MacKay.03 ≈ 3 and R = 0. The R code is library(leaps) e<-regsubsets(y~x1+x2+x3+x4.566221 39.484 0. x2 and x3 with the C p = 3. the best model includes x1. c p = p + 1 by definition in the example. use the Package menu to get it.737843 4. x2 and x3. University of Waterloo.339054 3.cp.218 0.828073 x1 1.33241 0.4289 -0.211471 4.1651 1.000746 *** Stat 371 © R.032519 6.04579 0. nbest=2) # finds the best submodels of size 2 f<-summary(e) # extracts useful information from e detach(a) attach(f) cbind(which. 2009 VI-4 .4523016 1 1 2 2 3 3 4 Note that • • • the two best models of each size are chosen based on cp for the full model. We use the artificial example to demonstrate the calculations. data=a.J. Cp and adjusted R-squared The output is (Intercept) x1 x2 x3 x4 1 1 0 0 0 1 0 1 0 0 1 1 1 0 0 1 1 0 0 1 1 1 1 1 0 1 1 1 0 1 1 1 1 1 1 cp 2.21029 -0. we get the summary output Residuals: Min 1Q Median 3Q Max -4.4407583 0.000000 adjr2 0.38243 3.2413939 0. Error t value Pr(>|t|) (Intercept) -0.We can use the package leaps in R to evaluate these criteria for a large number of submodels.4454612 0.4447408 0.2513 -1.4491371 0.

260 0.670 0.758 16.87 on 3 and 96 DF.78717 -4. 2009 VI-5 .20450 -1. the diagonal elements of H • a plot of the studentized residuals Estimate -65. We end with another market value assessment example.4743.65047 -1.1 ` ' 1 Residual standard error: 2.874 0.650142 0. the data are: size: the size in m2 baths: the number of bathrooms – bathrooms with only a basin and toilet count ½ rooms: the number of rooms at or above ground level age: age in years lotsize: size of the lot in m2 basement: whether or not the basement is substantially finished garage: whether or not the house has a garage value: the selling price in $000 The data are stored in the file mkvalue.J.16106 Std.57860 17.47105 -1. note that the selected model and parameter estimates match the “true” model and parameter values used to generate the data.91862 0.171e-13 Just for fun. --Signif.81162 0.05 `.x2 0.178 9. x3 -0. MacKay.' 0.001 `**' 0.57e-06 *** 0.076281 . Adjusted R-squared: 0. University of Waterloo.36646 0.082105 . This was likely fortunate since there are other “good” models that we might have selected. codes: 0 `***' 0.063970 .01 `*' 0. For each home. p-value: 2. 1.21994 7. We start the analysis by fitting a full model and examining how well this model fits.719 Pr(>|t|) 0.78106 1.792 0. Example 2 The sample is 100 homes that have been sold in a given region.49019 1.03983 -0.txt.04331 3.750 0.92366 1.455 5.000345 *** (Intercept) size stories baths rooms age lotsize Stat 371 © R.86e-10 *** 0.056 on 96 degrees of freedom Multiple R-Squared: 0.19540 7. Error t value 39. 0.4578 F-statistic: 28.795258 7. this time using data from house sales with the object of predicting the market value for other homes in the same region.31162 -3.098321 . Below we give a summary of the fit and • a plot of the estimated residuals versus the fitted values • a qq plot of the standardized residuals • a plot of the leverages.73867 0. hii .

We continue in any case.1 ` ' 1 Residual standard error: 41.' 0. 2009 VI-6 .13 < 5 . age. we have Radj 2 = 0.2e-16 The relatively low value of R2 = 0. There is no apparent lack of fit or influential cases as demonstrated by the four plots.14 on 8 and 91 DF.05 `. lotsize and garage. We can refit this simpler model and assess the fit as above.799 0. University of Waterloo. We now search for a simpler model by looking at the best 2 of all possible models.165 0. MacKay.001 `**' 0.50462 9. Adjusted R-squared: 0.6704. For this model. Cp = 4.006253 ** --Signif. p-value: < 2.01 `*' 0.11014 0. codes: 0 `***' 0. The output of regsubsets( ) is presented on the next page There are several attractive models but the best (simplest) includes size.19 on 91 degrees of freedom Multiple R-Squared: 0.67 indicates that we are likely to have large prediction error.6414 F-statistic: 23. Stat 371 © R.basement 1.04650 -2.869185 garage -42.11920 15.J.644.

00 10. R ~ G(0.317 0.549 0.644 0.645 0.3 x2 − 2 x 4 + x 7 − x9 + R.. MacKay.55 25. The model used to generate the data was Y = 3 x1 + 0.560 0. Show that c p = p + 1 for the full model that includes all p explanatory variates.641 0.15 90.641 cp 27.14 16.. is not dependent on which are columns of X are included in the model.618 0.32 6.2).09 5. 2.63 21.J.625 0. University of Waterloo. 2009 VI-7 .05 to decide to proceed at each step.05 to decide to proceed. At each step. The file ch6exercise3.(Intercept) size stories baths rooms age lotsize basement garage 1 1 1 0 0 0 0 0 0 0 1 1 0 0 1 0 0 0 0 0 2 1 1 0 0 0 1 0 0 0 2 1 1 1 0 0 0 0 0 0 3 1 1 0 0 0 1 1 0 0 3 1 1 0 0 0 1 0 0 1 4 1 1 0 0 0 1 1 0 1 4 1 1 1 0 0 1 1 0 0 5 1 1 1 0 0 1 1 0 1 5 1 1 0 0 1 1 1 0 1 6 1 1 1 1 0 1 1 0 1 6 1 1 1 0 1 1 1 0 1 7 1 1 1 1 1 1 1 0 1 7 1 1 1 1 0 1 1 1 1 8 1 1 1 1 1 1 1 1 1 Exercises adjr2 0.593 0.. 3. c) Use leaps to investigate all possible models. d) How do the results of the three strategies compare in this case.21 7. use a p-value of 0. Show that the estimate of β j . a) Fit a model using forward selection.32 3.648 0. Pick a reasonable model.86 4.645 0. Stat 371 © R. Note that the columns of X are not orthogonal.txt contains a response variate y and 10 explanatory variates x1 .648 0. the coefficient of x j .572 0.09 5.13 9.00 1.02 7. b) Fit a model using backwards selection using a p-value of 0..651 0. x10 for 100 cases. These data were created artificially for practice. Suppose the columns of X are orthogonal.06 9.

Chapter 7 Sample Survey Issues In the second half of the course.L. For details. we follow the book “Sampling: Design and Analysis” by S. the daily poll on the Netscape home page http://www.statcan. we concentrate on the sampling protocol. See the reference list attached to the course outline. A census is an investigation of a population where we try to examine every unit. For formal surveys. non-institutionalised population 15 years of age and over.com/ ) or highly complex and regular (e. For the most part.htm).J. We use some specialized language to describe survey methodology within the PPDAC framework. Example In the Labour Force Survey (the quoted material is from the above web site). the units are defined as: “LFS covers the civilian. we deal with • • • • the language of sample surveys examples of sampling protocols classification of error (the difference between the estimate and the attribute) assessment of error Sample surveys are widely used to estimate attributes of interest in a specified target population. In this chapter. persons living on Indian Reserves.” Stat 371 © R.netscape.ca/english/sdds/3701. full-time members of the Canadian Armed Forces and inmates of institutions. we consider the planning and analysis of simple sample surveys. Note that we often select units in clusters to implement a sampling protocol. Northwest Territories and Nunavut. Excluded from the survey's coverage are residents of the Yukon.g. University of Waterloo. These groups together represent an exclusion of less than 2% of the population aged 15 and over. MacKay. Surveys are used to estimate attributes of human populations as in the above examples and also any other collection of objects such as financial records. The reasons for using a sample survey rather than a census of the target population to learn about attributes are • cost • timeliness • ethical issues relating to efficient use of resources • the improved quality of the estimates available from a carefully conducted survey rather than a sloppy census. 2009 VII-1 .g. We will cover most of the material in Chapters 1-4. There are multiple copies of the book on reserve (UWD1510) in the Davis Centre Library. see http://www. the Canadian Labour Force survey that estimates unemployment rates across Canada on a month to month basis. The survey can be one-time only and informal (e. Lohr.

A probability sampling protocol uses a probability distribution to select the sample from the frame. The first stage of sampling consists of Stat 371 © R. Instead. Each province is divided into large geographic stratum.. called clusters. we have ⎧ 1 ⎪ ⎛1220 ⎞ ⎪ Pr( S ) = ⎨ ⎜ ⎟ ⎪ ⎝ 20 ⎠ ⎪ 0 ⎩ if S has size 20 otherwise Example: For the labour force survey. The auditor decides to use simple random sampling (SRS). the study population is only vaguely specified. The frame defines the study population.. Note that informal surveys such as the Netscape poll do not use a frame since the units are selfselecting. Developing a good frame (one which covers the target population) is often one of the most expensive components of conducting the survey. Each province is divided into large geographic strata. if the frame is denoted by U = {1.The sampling protocol does not choose units (people who meet the inclusion criteria) directly...1220} and. In this instance.. The first stage of sampling consists of selecting smaller geographic areas. MacKay. N } .J. More formally. then a probability sampling protocol assigns a probability to every subset of U and the sample is selected according to this distribution. In the Labour force Survey. for any subset S of U . We will look at ways to implement such a protocol later.. Example: Suppose an auditor has a file of 1220 records and plans to select a sample of 20 records to examine the quality of the file. Sampling Protocols There are many sampling protocols that can be used to select the sample from the study population. Here we write the frame as U = {1. 2. The second frame is a list of households within each selected cluster. 2009 VII-2 .” We call the households the sampling units.. a sample of households is selected and then variates are measured on every appropriate unit in the selected households. “The LFS uses a probability sample that is based on a stratified multi-stage design. One frame is a list of clusters within each geographic stratum. The second stage of sampling consists of selecting households from within each selected cluster. a protocol in which all samples of size 20 have the same probability of selection. from within each stratum. 2. the sampling protocol is described as: “The LFS uses a probability sample that is based on a stratified multi-stage design. there are separate frames for each stage of the sampling. University of Waterloo. Formal surveys have a frame. The frame is the list of sampling units on which the sampling protocol operates..

an important component of sample error is nonresponse error.” “Since July 1995. at least partially. usually with little control e. we can classify errors as: Study error: the difference in attributes of interest between the target and study population. Errors In applying PPDAC to estimate population attributes. Some examples are: • • • convenience sampling – take what you can get e. The second stage of sampling consists of selecting dwellings from within each selected cluster. In the context of sample surveys. a survey of people in a mall by a marketing firm self-selection sampling – units choose themselves.g. University of Waterloo.” There are many non-probability sampling protocols. the error that occurs in drawing conclusions about the target population from the sample. 2009 VII-3 . each interviewer is directed to find a sample whose attributes match the local population in terms of age. The attributes of the units listed in the frame may not match those of the target population.J. • We concentrate on formal surveys that use probability sampling protocols since we can use mathematical tools to assess.g. match the target population with respect to the attributes of interest). judgment sampling – units are selected so that the samplers think that the sample will be representative of the target population (i. income profile and gender.000 households. Measurement error: the difference in the attributes of interest due to the difference between the true and measured values of the variates on the units in the sample. from within each stratum.000 individuals. study error is called frame error. Sample error: the difference in attributes of interest between the study population and the sample. called clusters. many internet polls quota sampling – units are selected so that some attributes of the sample match known attributes in the target population e.e. resulting in the collection of labour market information for approximately 100. the monthly LFS sample size has been approximately 54. Suppose the attributes of interest in the respondent and non-respondent populations are different Then the sample attributes may not match those in the frame because one or more units in the sample may have refused to provide data. in a marketing survey. MacKay. For surveys of human populations.g.selecting smaller geographic areas. Stat 371 © R. This is a major advantage of these sampling protocols.

Measurement error may occur because of systematic differences in interviewers. The confidence interval does not capture uncertainty due to frame error. a sample of this size is accurate to within 3 percentage points” at the bottom of the conclusions from a survey. hire an expert.If we divide the frame into those units that would respond and those that would not. We have all seen statements such as “19 times out of 20. There are many books and papers written on questionnaire design. This confidence interval captures the uncertainty due to a component of the sample and measurement error. University of Waterloo. 2009 VII-4 . people in the sample may lie. Non-respondents Respondents Intended sample Actual sample The actual sample is different from the intended sample and the attributes in the actual sample may not match those in the frame. Stat 371 © R. We can control these latter sources of error only through good planning and execution of the survey. Questionnaire Design Here is a brief set of considerations in designing the instrument (the questionnaire) for the survey of a human population.J. recall error. error due misunderstanding the question and so on. The proper design of the questionnaire and a good plan for its administration can substantially reduce non-response error. If you are involved in an important survey. This is a very complex subject. We use the probability model that generated the sample to describe how the sample attributes would behave if we were to repeat the same sampling protocol over and over. we can see the effect of non-response error. non-response. forget or modify their answers to please the interviewer. MacKay. Interviewers may influence the responses by using a different protocol for asking the questions. Measurement error may also occur if the question we pose does not match the question used to define the response variate in the target population. The following list is adapted from Lohr pages 10-15. systematic errors in the sampling protocol and measurement system etc.

• • • • • • • • • • • • • • Decide what you want to find out (understand the Problem) Keep the questions clear and simple Use specific instead of general questions Decide whether to use open-ended or closed questions Ask only one concept in each question Use forced choice rather than agree/disagree questions Avoid leading questions and contexts Relate each question to your objective – what will you do with the data? Keep the questionnaire short. University of Waterloo. 2009 VII-5 . MacKay.J. Plan to report the actual questions used Stat 371 © R. Explain the purpose of the survey Ensure confidentiality Pay attention to question-order effects Test your questions before the survey.

. There are 100 possible samples..000 and n = 100 . If we execute the protocol as planned...2.100} at random and then taking every subsequent 100th unit.. Use simple random sampling to select one cluster as the sample.9902}... We consider protocols where the sample size n is fixed so that the only subsets with positive probability have n units....1000}...10000}. each with the same probability....10}. The sample is the 100 units in the 10 1000 selected clusters.. There are G H 10 K samples.. 100 FG H IJ K Stratified random sampling: divide the frame into sub-frames called strata...20}.. each with the same probability.. 10 possible Cluster sampling: Divide the frame into clusters. ... We call this protocol systematic because we can select the sample by choosing the first unit from {1.Chapter 8 Probability Sampling Formal surveys use probability sampling.. we first examine several probability sampling protocols and then look at simple random sampling (SRS) in detail.102.. C1000 = {9991.. Then a probability sampling protocol specifies the probability that the sample is s for any subset of s ⊂ U . For example..... C2 = {2. For example... In this chapter.. Stat 371 © R.. Denote the frame by the set U = {1. Select 10 clusters using simple random sampling.. University of Waterloo... C100 = {100. 10 FG H IJ K Systematic sampling: Define clusters with n = 100 units per cluster C1 = {1101. for example C1 = {1. N } so that there are N units in the frame..e...... a protocol that selects units for the sample based on a probability model on subsets of the frame. C2 = {11... to generate confidence intervals and hypothesis tests for model parameters that represent attributes of interest in the study population.J.. There are possible samples.200. The major advantage of probability sampling is that the sampling protocol produces a statistical model that we can use to assess sample error i..9901}. MacKay. Here are some common sampling protocols explained in terms of an example... Example: Suppose N = 10. each with the same probability.. 2009 VIII-1 ... U 1 = {1.. U 10 = {9 001. 10000 Simple random sampling: all samples of size 100 have the same probability.10 000} . F1000IJ For each stratum. Two-stage sampling: Select the sample in two stages.. then we know that the model is appropriate.10000}.. select a simple random sample of size 10..

Also note that we use SRS within each of the above protocols so we need to understand the properties of this most important protocol. Technically. There are 2 Stage 2: Select 50 units from each of the two selected primary units using simple random sampling. Note that complex surveys such as the labour force survey use multi-stage sampling with stratification in the primary stage and cluster sampling (the clusters are households) in the ultimate stage. using simple random sampling. you can show that pi = 100 for each of the described sampling protocols – see the exercises. University of Waterloo. each has probability 1 / . MacKay. You should be able to provide definitions of these sampling protocols in general. there are many sampling protocols with the same inclusion probabilities as SRS. One advantage of two stage sampling is that we only need to build a frame at the second stage for those primary units selected in the first stage. In the above example. Simple Random Sampling For simple random sampling. FG IJ H K F10I F1000IJ There are G J G H 2 K H 50 K 2 possible samples. The inclusion probability for unit i in the frame is n N −1 n −1 n pi = = N N n Note that the numerator is the number of samples that contain the particular unit. each with the same probability. The inclusion probability pi for any unit i in the frame is the probability that the unit is 1 for each unit i included in the sample. Since there are possible n N samples. this protocol is often called simple random sampling without replacement since we do not allow the same unit to be included in the sample more than once. We use the shortened form of the name. We cannot define simple random sampling by saying that every unit has the same chance of being included in the sample.Stage 1: Select two strata (here called primary units) from the 10 described in 2.J. Stat 371 © R. we select n units from a frame of N units so that each N sample of size n has the same probability of selection. FG IJ H K FG IJ H K FG H FG IJ H K IJ K As we saw in the above example. 2009 VIII-2 . above 10 possible samples of primary units.

. σ = iεS n n −1 1 where S is a random subset with Pr( S = s) = for every subset s ⊂ U of size n.. ~ We cannot calculate the exact distribution of the estimator μ . ˆ σ= ∑( y ε i s i ˆ − μ )2 = ∑ rˆ ε i i s 2 n −1 We do not build a response model here as we did in the earlier part of the course.Let yi be the value of the response variate for unit i .. N 0 otherwise FG IJ H K R S T Then we can write n n −1 in terms of the indicator random variables I1 .. σ= ∑ I ( y − μ) ε i i i U 2 For simple random sampling of n units from a frame of N units. Suppose that we are interested in estimating the average response in the target population. I N . MacKay. we have E ( μ ) = μ . Note that the sums are over the entire frame U. we use the probability mechanism that generated the sample to look at the properties of the estimates if we were to repeat the sampling over and over.. University of Waterloo. It is N n convenient to re-express the estimators in terms of random variables rather than a random subset. we denote the standard deviation in the frame by σ and the sample standard deviation by n −1 ˆ ˆ where ri = yi − μ is the estimated residual. we define the estimators ~ ∑ yi ~ ∑ ( yi − μ ) 2 ~ μ = iεS . That is.. 2009 VIII-3 . Instead..J. However we can find many of its properties. Similarly. stdev( μ ) = (1 − n σ ) N n Stat 371 © R.. We have the following important results: μ= ∑I y ε i i U i . We denote this average in the ∑ yi iε s frame (study population) by μ and the sample average by μ = where s is the n selected sample. Let 1 if unit i is in the sample Ii = i = 1.

and hence we have that Pr( I i = 1) = N N N N iε U ~ E (μ) = E ( ∑I y i i = = ) n ∑ E ( Ii ) yi iε U n ∑ (n / N ) yi iε U =μ ~ To prove the formula for stdev( μ ) . i . N ( N − 1) N N N N −1 1 n 1 n n n ~ Var( μ ) = 2 { (1 − )∑ yi2 − (1 − ) ∑ yi y j} n N N iε U N N N − 1 i ≠ j .~ ~ We call μ an unbiased estimator of μ since E ( μ ) = μ . we have n n Var ( I i ) = E ( I i2 ) − E ( I i ) 2 = Pr( I i = 1) − Pr( I i = 1) 2 = (1 − ) N N To find the covariance of I i and I j . jε U 2 } = (1 − ) {∑ yi − n N N iε U N −1 Stat 371 © R. jε U ∑ yi y j 1 n 1 i ≠ j . for any linear combination of dependent random variables V1 .2K = n(n − 1) E( I I ) = Pr( units i and j are both in the sample) = FG NIJ N ( N − 1) H nK i j so the covariance is Cov( I i . jε U Since I i is an indicator random variable.V j ) = E (VV j ) − E (Vi ) E (V j ) .2IJ H n . i 1 ~ Applying this result.Vn . I j )}.. University of Waterloo.. MacKay.V j ) i i i≠ j where Cov (Vi . I j ) = E ( I i I j ) − E ( I i ) E ( I j ) = Combining the above results we get n(n − 1) n 2 n n 1 − 2 = − (1 − ) .J. we have FG N . i .. we need the result from Stat 230 that. Since the product is 0 unless I i = 1 and I j = 1 . we need to find E[ I i I j ] .. note n n n n so E ( I i ) = 0(1 − ) + 1 = . To prove this statement. 2009 VIII-4 . n Var ( ∑ aiVi ) = ∑ ai2Var (Vi ) + ∑ ai a j Cov (Vi . i . we have Var( μ ) = 2 {∑ yi2 Var ( Ii ) + ∑ yi y j Cov( Ii . n iε U i ≠ j .

∑ yi y j 1 i ≠ j . the sampling fraction is negligible and we ignore the fpc. Stat 371 © R. Here we state one of the consequences of the theorem avoiding all technicalities. The standard deviation of the the estimator corresponding to the sample average.We can simplify the expression inside the braces with a bit of algebra. If the sampling fraction is appreciable. using the model generated by simple random sampling. jε U 1 {N ∑ yi2 − (∑ yi ) 2} N − 1 iε U iε U 1 {N ∑ yi2 − N 2 μ 2} N − 1 iε U N {∑ yi2 − Nμ 2} N − 1 iε U n σ n σ2 ~ ~ and stdev( μ ) = (1 − )1/ 2 as required. then We use this result to build confidence intervals for various attributes of interest. 1− f σ average σ If N. ~ We can also show that E(σ 2 ) = σ 2 . 1) approximately. For many applications. The factor arises n because the terms in the sum that defines the estimator are dependent. U iε U iε U = = = = 1 {N ∑ yi2 − (∑ yi2 + ∑ yi y j )} N − 1 iε U iε U i ≠ j .i. so. is the usual standard deviation for an = Nσ 2 multiplied by the square root of the correction factor. 2009 VIII-5 . ijε U 2 ∑ yi − N − 1 = N − 1 {( N − 1)∑ yi2 − i≠ j∑jεyi y j )} . The factor 1 − n = 1 − f is called the finite population correction factor (fpc) where f is N the sampling fraction.see the exercises. MacKay. we need one more fact. at last. then there can be a significant reduction in the standard deviation. We use the result repeatedly so it is worthwhile learning it. To use the above results. i . ~ n (μ − μ) ~ ~ G(0.J. we have Var ( μ ) = (1 − ) N n N n The formula for the standard deviation of an average is the basic result in the sampling part of the course. a version of the Central Limit Theorem for the average of a sequence of dependent random variables. n and N-n are suitably large. the proportion of the population units included in the sample. University of Waterloo.

77 (1 − )1/ 2 ) N 1256 n 50 Stat 371 © R.value 1335 0. The sample average dollar error (actual value – stated value) was $2.Example 1 An auditor has a file of stated counts and prices for N = 1256 56 items stored in a warehouse of a producer of small automotive parts.J.08 1106 1. MacKay.78 847 4.81 4296. The form of the interval is the same as usual. 2009 VIII-6 .29 . The purposes of the sampling were to assess: • the true total value of the inventory • the average dollar error per item • the proportion of items with counts in error The auditor selected a simple random sample of n = 50 items and re-counted (actually a pair of co-op students did the counts).22 = (1 − = 276.50 1941.27 10434.35 1335 814. Here the standard error is ˆ σ n 50 1/ 2 1997. item 1 25 39 53 56 121 207 212 223 225 stated. There were 11 items with count errors in the sample.32 Table 8.88 1294 1. then we ˆ estimate μ using the sample average μ = 2895.35 1192 1.49 1529 4296.93.00 1294 1941.08 1446 3586.76 1269 2588. The first 10 lines of the data are shown below and the complete data set is available in the file inventory.txt The average actual value of the sampled items was $2895.64 1954.04 2588.29 with corresponding sample standard deviation $1997. we start by estimating the population average μ and then use the fact that τ = N μ .price stated. we find an approximate 95% confidence interval based on the approximation to ~ the distribution of μ described above. The auditor planned to use SRS to select a sample of item numbers and then physically count the actual number of those selected items.33 and the sample standard deviation of the dollar errors was $41.49 1446 2.61 814.value actual.712.95 4192.56 1529 2.78 1106 1249.65 1016 10. ˆ μ ± c × standard error(estimate) where the standard error is the estimate of the standard deviation of the estimator μ .88 1192 1954.76 1427 3.00 1269 2.1 Part of the Sampled Data To estimate the total actual value τ . If yi is the actual value of item number i.number actual. To assess the precision of this estimate.65 847 4192.24 4623.number item.32 1016 10434. University of Waterloo.22.48 3586. The total stated value was $4.311.48 1419 4597.13 1249.

33 ± 11.48 so the 95% confidence interval for τ error is 1256 × (2. Note by exploiting the known information (the total stated value) and the fact that we can estimate the average error with much smaller standard error than the average actual value.712) plus the total dollar error.33 ± $11. The interval is 3.3% of the estimated total so the difference between the stated and actual value of the inventory is likely immaterial.39 . Compared to the stated value of $4.33 = $2926. A 95% confidence interval for the total true value of the inventory is then 4311712 + 2926 ± 14306 or $4. 636. we use c = 1.29 ± 542. A 95% confidence interval for the average error is ˆ μerror ± c × standard error 50 41.81. 484 ± 681.33 with corresponding sample standard deviation 41. 2009 VIII-7 .J. 342 This interval is very wide meaning we have estimated the total value of the inventory very imprecisely. where the standard error is 1− The average error more precisely estimated than is the average actual value. it is difficult to assess whether or not there are material errors in the inventory. The estimate of the total error is ˆ τˆerror = N μerror = 1256 × 2. Stat 371 © R. MacKay.311.93. Hence the 95% confidence interval 1256 50 for the average error is $2. This procedure is called difference estimation. The average actual value is poorly estimated. Substituting.39) or 2926 ± 14306 . University of Waterloo. we estimate the true total value as the total stated value ($4.For a 95% confidence interval.93 = 5. We can exploit this result and the fact that we know the total stated value to get a better estimate of the total actual value. we can get a much more precise estimate of the total actual value.638 ± $14. Since the error was defined as the actual value minus the stated value.712.314.311.96 from the standard gaussian (last row of the t tables).47 . the confidence interval for μ is 2895. We use the same methodology to estimate the average error. One possibility is to increase the sample size but see below. The sample average error is ˆ μerror = 2.306 The confidence limits are about 0. We are interested in the total actual value τ = N μ and we can get a 95% confidence interval for τ by multiplying the above interval by N = 1256 .

Stat 371 © R. the sample standard deviation is a function of the sample average proportion. the population average N of the binary variate. We can use the same theory as above.418 Note for a binary response variate.96 1 − 50 0.96 1256 played a very small role (especially as it enters the calculations as 1 − f = 0.11 1256 50 There is considerable evidence that more than 10% of the item counts are in error. the sample standard deviation is nπ (1 − π ) ≈ π (1 − π ) n −1 We usually ignore the factor n / n − 1 when we apply this formula. Notes 1. We use the same theory to get estimates and approximate confidence intervals for averages. let yi = 1 if the ith item is in ∑ yi error and yi = 0 otherwise. though the average error is likely to be small and the total error immaterial. MacKay. University of Waterloo. The form of the interval is always estimate ± c × standard error(estimate) 2.To estimate the proportion of items with counts in error. In the above example.98 ) and we could have safely ignored it. 2009 VIII-8 . 3.22 ± 0.J. totals and proportions. In the case of a binary response. The attribute of interest is π = iεU . the finite population correction factor 1 − f = 1 − 50 = 0. An approximate 95% confidence interval for π is π ± 1. The sample average is 11 and the sample standard deviation is π= 50 ∑ (y − y) ε i i s 2 49 = ∑y ε i s i s 2 i − 50 y 2 49 = = ∑ y − 50π ε i 2 49 50π (1 − π ) 49 = 0.418 or 0.

the attributes of interest. A headlight is declared defective if it fails to pass any one of a large number of tests. The data (number of defective items per sampled carton) are shown below. you can determine the target population. Since we will select only one sample. Hence a ) 2000 30 95% confidence interval for π is 0. we have n σ l = c 1− N n or. 0 0 0 1 0 0 1 3 2 4 2 2 0 0 0 0 0 1 0 0 0 0 1 1 2 1 0 3 0 0 The sample average and standard deviation are 0. Suppose that we are interested in estimating a population average μ and we want the confidence interval to be of length 2l (i. University of Waterloo.J.80 and 1. a manufacturing organization decides to select a sample of 30 cartons using SRS for inspection.96(1 − or 0.7% ± 3. 2009 VIII-9 . then the proportion of defective items in the population is 2000μ μ = π= 12(2000) 12 ˆ 30 1/ 2 σ ˆ A 95% confidence interval for μ is μ ± 1. From the above results. we base the sample size determination on the attribute of primary interest.80 ± 0. a possible frame and the required precision for the estimates. Note how we adapt the results from SRS to apply to cluster sampling. The proportion of defective headlights is poorly estimated but is significantly larger than 0.e. Sample Size Determination We can use the same theory to answer the most common question in Statistics. See the Exercises for another formulation of the precision in terms of relative error.13 respectively.33 or 6.Example 2 To assess the quality of a shipment of 2000 cartons of headlights (packed 12 to a carton). the confidence interval should be μ ± l ). We suppose that we can state the precision in terms of the length of a confidence interval for this attribute. The attribute of interest is the proportion of headlights that are defective. Here we are using cluster sampling with the clusters defined as the cartons. assuming we use SRS or a close facsimile. “How large a sample do I need?” The obvious answer is “What is your objective?” It may take some effort to elicit a specific response but with some guidance.067 ± 0. If μ is the average number of defectives per carton. MacKay.3% .40 . solving for the sample size Stat 371 © R.

93 from the initial survey. guess the value of σ . we can ignore the term 1 / N and then the required sample size is approximately c 2σ 2 / l 2 . suppose that the above description was a pilot survey and the overall goal was to estimate the average error with 95% confidence within plus or minus one dollar.96.96 2 × 41.n= 1 l2 1 + 2 2 N cσ To determine the required sample size. Example In the audit example. How many more items do we need to include in the sample? Here we have l = 1. We get an estimate of σ to help determine the sample size in the main survey and we also can use the pilot study to test the questionnaire and the rest of the proposed methodology. we need to specify the confidence level to find c and. we are forced to examine an extra 1009 items to achieve the desired precision. we would likely recommend a complete census. Example A polling firm has been hired to conduct a cross-Canada survey to solicit opinions from adults on a number of issues.J. c = 1. we have 1 n= = 1059 1 1 + 1256 1. In other words. University of Waterloo.57 .05) with 99% confidence so we have l = 0. 2009 VIII-10 . MacKay.932 Here. N = 1256 and σ = 41. Since this is most of the frame. One way to get an idea of σ is to carry out a small pilot survey. To achieve the required precision. The primary question has a Yes/No answer and the sample size is selected based on estimating the proportion of adult Canadians π who would answer Yes to the question. If the second term in the denominator is much larger than the first. we often do not know enough to give a good answer to the question. with more difficulty.025. Sometimes we can use the results of previous surveys with similar response variates on the same population to get an idea of the value of σ . n −1 The required sample size is Stat 371 © R. because σ = 41. c = 2.93 is so large. The nπ (1 − π ) estimated standard deviation will be ≈ π (1 − π ) . The answer is very sensitive to the value of σ . The client asks for a confidence interval of length 5 percentage points (0.

we will meet the requirements. University of Waterloo. Consider the sampling protocols defined in Example 1. • • • Exercises 1. If we have a better idea of π from a pilot survey or elsewhere. Note that these sample size determinations do not take frame error. • To implement SRS. SRS is the simplest probability sampling protocol. a) Show that the inclusion probability for each unit in the frame is 1/100 for every protocol.sample(x. shorter confidence interval with a smaller sample size) to stratify the population and use stratified random sampling. we need a frame for the target population of interest. a student once defined simple random sampling as follows: “simple random sampling is a method of selecting units from a Stat 371 © R. Because of the difficulty of completing a frame. With cluster sampling the frame can be the list of clusters.response error and other such errors into account. In R. if the units are water heaters packed in cartons stored in large stacks. We know that if we choose n = 10568 × 0. we may use cluster or multi-stage sampling instead.25 = 2642 .572 × π × (1 − π ) 2. For many populations.5 .n= 1 1 0. When and How to Implement SRS Here we briefly look at when SRS should be used and how to implement the sampling protocol. For example.n) selects a random sample of size n from the vector x and stores the result in s. If the frame consists of a list of items or people. the command s <. we can assign each unit a unique number from 1 to N and then used available software to select a sample of n units using simple random sampling. Here the required sample size is bounded because the function π (1− π ) has maximum value of 0.572 π × (1 − π ) 0. we may be able to reduce the sample size from this upper bound.g.0252 + N 2. 2009 VIII-11 .0252 = 10568π × (1 − π ) ≈ where we ignore the term 1 / N since it is so small.J. With multi-stage sampling we can build the frame as we go.25 when π = 0. we can select the sample of identifiers using SRS but we are unlikely to find someone willing to sort through the cartons to find the selected units. b) On a final examination. MacKay. non. We must be able to examine the selected units. it is more efficient (e. See Chapter 10.

show that σ that ∑ ( yi − y ) 2 = ∑ yi2 − ny 2 ]. your intrepid instructor visits each of the selected plots (after dawn but before 9:00 am between May 24 and July 6) and counts the number of singing male song sparrows detected in a 10 minute period. If you find 1 or more defective items. Is this a correct answer? ∑ yi iε s c) Show that the estimator corresponding to the sample average μ = is n unbiased for μ for each of the protocols.population so that every unit has the same chance of selection”. Consider the estimate σ = ∑(y ε i s i − y)2 n −1 ~ 2 is an unbiased estimator for σ 2 . find a formula for the required sample size. we want to find the sample size required (SRS) so that the length of the confidence interval 2l divided by the sample average is pre-determined. b) Suppose that I wanted to estimate the total number of male song sparrows to within ±1000 with 95% confidence. a) How would you select the sample? b) Calculate the probability p(π ) that you accept the shipment as a function of π . ~ b) Is σ unbiased for σ ? iε s iε s ~ and the corresponding estimator σ . Suppose that there are N = 1000 items in a shipment and you cannot tolerate more than 1% defective (your first mistake – why should you tolerate any defective items from your supplier). 3.J. a) For a given confidence level and required precision p%. a simple random sample of 50 one hectare plots (a hectare is 100m by 100m) is selected. To estimate the total number of male song sparrows in a 10 km by 10 km square (http://www. One cheap but (poor) way to check the quality of a batch of items is called acceptance sampling. You decide to select and inspect a sample of 20 items and accept the shipment if you find 0 defectives. That is. # of sparrows # of plots 0 28 1 13 2 5 3 3 4 1 a) Find a 95% confidence interval for the total number of male song sparrows in the square.birdsontario. 2009 VIII-12 . Using a GPS system. 2. Stat 371 © R. How many additional plots are needed? 4. b) What knowledge of the population attributes do we need to make this formula usable? 5. [Hint: Use the fact a) For SRS. you inspect the complete shipment. the percentage of defective items in the shipment. University of Waterloo. Suppose we want to estimate a population average so that the relative precision is specified.org/atlas/atlasmain.html ) for a breeding bird atlas. The data are summarized below. MacKay.

MacKay. University of Waterloo.J. you decide to increase the sample size so that there is only a 5% chance of accepting a shipment with 1% defective. What sample size do you recommend? Stat 371 © R. 2009 VIII-13 .c) Graph p(π ) for 0 ≤ π ≤ 10% d) Given the results in c).

Chapter 9 Ratio and Regression Estimation with SRS In this chapter. y ) at the point ( x0 . Here we consider estimating a ratio. Recall that we can expand f ( x. Note that iε U ∑ y z = ∑ y here ε i i i i U because yi is the error in the ith account. We can write this attribute as ∑ yi zi ∑ yi zi / N μ θ = iεU = iεU = zi ∑ ∑ zi / N π iεU iεU where zi = 1 if the ith file is in error and 0 otherwise. y0 ) . y0 ) ∂f ( x0 . we looked at assessing the estimates of the frame average (or total) when the sampling protocol is SRS and the estimate is the sample average. y ) about the point ( x0 . y0 ) to get a linear approximation f ( x.J. 2009 IX-1 . we use Taylor’s theorem for a function of two variables. Suppose we want to estimate the average size of the error in those files that are in error. y ) = x / y . we consider two related problems: • estimating a ratio such as the proportion or average response of a subpopulation (domain) with unknown size • improving the sample average as an estimate of the frame average by using explanatory variates Estimating a Ratio In Chapter 8. y0 ) x If f ( x. we find the (approximate) distribution of θ by finding its mean and variance and then using a gaussian approximation. ∂f ( x0 . MacKay. then we have = . To derive the approximation. University of Waterloo. y0 ) ( x − x0 ) + ( y − y0 ) ∂x ∂y The linear function on the right has the same value and first partial derivatives as f ( x. ˆ μ μ We use the estimate θˆ = with corresponding estimator θ = . y0 ) + ∂f ( x0 . = − 0 and 2 ∂x ∂y y0 y0 x x0 1 x ≈ + ( x − x0 ) − 0 ( y − y0 ) 2 y y0 y0 y0 Stat 371 © R. y ) ≈ f ( x0 . To assess the estimate ˆ π π and produce confidence intervals for θ . y0 ) 1 ∂f ( x0 . The parameter μ is the average error per file and π is the proportion of files in error. The distinguishing feature is that both the numerator and denominator will change if we were to repeat the sampling protocol over and over. Consider again the inventory example from the previous chapter.

. Var ( μ − θπ ) = (1 − f ) σ r2 n We can estimate this variance by the corresponding sample variance (1 − f ) n ∑ (r − r ) ε i i s 2 n −1 (1 − f ) = n ≈ (1 − f ) n (1 − f ) n ∑ [ y − y − θ ( z − z )] ε i i i s iε s 2 n −1 ∑ [ yi − y − θˆ( zi − z )]2 n −1 ∑ ( yi − θˆ zi )2 iε s = n −1 where we replace θ by its estimate θ = y / z in the second line and f = n / N is the sampling fraction as usual. MacKay..J. rn where μ − θπ is μ − θπ = i ε s n ri = yi − θzi . we have μ μ 1 μ ≈ + ( μ − μ ) − 2 (π − π ) π π π π For large sample sizes. y0 ) by ( μ . The estimate of the variance of the estimator θ is then Var (θ ) = ^ 1 ˆ π2 ∑ ( y − θˆz ) (1 − f ) ε i i i s 2 n n −1 Stat 371 © R.Replacing ( x . University of Waterloo. Using the basic formula for the variance of an average with SRS.. π ) . Notice that the estimate corresponding to ∑ ( yi − θ zi ) ˆ ˆ which is the sample average of r1 . Hence we have E (θ ) ≈ μ 1 μ μ + E ( μ − μ ) − 2 E (π − π ) = π π π π 1 Var (θ ) ≈ π2 Var ( μ − θπ ) The estimate is approximately unbiased (but see Exercise 1). π ) .. 2009 IX-2 . y ) by the random variables ( μ . the approximation is reasonable since we expect ( μ . We can write the variance in several forms. π ) and ( x0 . π ) to be close to ( μ .

.. the average error. a company selects a sample of 40 parts to check the average length of a critical dimension.57 ± 51.. rn = yn − θˆzn in R by creating the vector r < − y − theta _ hat * z .48 so we can estimate the average error in accounts with errors very imprecisely.. ˆ π n A 95% confidence interval for θ is 10.txt. In many surveys.J. the demographics (gender ratio. The idea of the methods discussed here is to adjust the sample average μ ( y) based on differences between the sample and (known) population attributes of the explanatory variates. From previous experience. Example In the assessment of a lot of 10000 incoming molded parts.. in the inventory survey. we have θˆ = = = 10. we can determine the values of the response variate y and the explanatory variates. Note the change in notation to explicitly include y in the definition of the attribute. We can use the same approach via Taylor’s theorem to estimate any other function of variate averages in which we have interest. MacKay. first calculate the standard deviation of ˆ ˆ r1 = y1 − θˆz1 . say gender and age for each person in the sample. for large values of n and N. To construct a confidence interval for θ . When we get the sample.26. To find ˆ π 0. we have the standard error is 26. For example.Note that the last factor is the sample variance of the estimated residuals ˆ y1 − θˆz1 . Then 1 multiply by the factor 1 − f . because we also have uncertainty about the proportion of files in error. we know the stated value and the stated number of items for each file in the population and hence we can calculate population attributes for these variates. yn − θ zn . age distribution etc. the estimator is approximately gaussian so the confidence interval has the standard form estimate ± c stdev(θ ) ˆ μ 2.) of the population are known.57 ..33 In the example – see the data file inventory.. University of Waterloo. In many surveys of human populations. we consider only one explanatory variate.22 the estimate of the standard deviation. perhaps from a census.. Ratio Estimation of the Average Suppose the purpose of the survey is to estimate the study population average μ ( y ) for some variate y. 2009 IX-3 . there are other (explanatory) variates that can be measured on each unit in the sample and for which we have complete knowledge of their attributes in the population. Stat 371 © R. ^ Also note that this confidence interval is wider than that for μ . In the example. For simplicity.

5 32 32.000 . University of Waterloo.txt and are plotted below.691 y 45. The sample data are included in the file molded.24 0.56 1.5 44 43. The population average weight is μ ( x ) = 33.they know that the dimension is related to part weight so they measure the weight of each part in the sample and also the weight of the entire shipment. MacKay.10 grams determined as the total weight (measured all at once) divided by the number of pieces N = 10. dev. The plot shows a strong correlation between the length and weight.J.5 33 33.5 45 44.005 sample average sample st. 2009 length (micron) IX-4 .5 46 45. We develop the estimators and their properties assuming SRS – this corresponds to assuming that the haphazard sampling protocol mirrors SRS if the protocol is repeated over and over. ˆ The ratio estimate of μ ( y ) is μ ( y ) ratio = The sample is collected haphazardly since it is too expensive to create a frame.5 weight (g) 34 34.5 The average and standard deviations for the two variates are: x 33. Length versus weight 48 47. Stat 371 © R. We use the results on the estimation of a ratio θ to derive an approximation for the mean and standard deviation of μ ( y ) ratio .5 35 35. μ ( y ) are the ˆ μ ( x) sample averages for x and y and θ = μ ( y ) / μ ( x ) .5 31. ˆ μ( y) ˆ ˆ ˆ μ ( x ) = θμ ( x ) where μ ( x ).5 47 46.

371 and ˆ population average is μ ( y ) ratio = ∑( y ε i s i ˆ − θ xi )2 = 0. This gain in precision is the major advantage of the ratio estimate. Qualitatively. we estimate the variance of μ ( y ) ratio by ∑ ( yi − θˆxi )2 1 ∑ rˆi2 1 Var[ μ ( y ) ratio ] = (1 − f ) i ε s = (1 − f ) i ε s n −1 n n −1 n = μ ( x )2 ^ where ri = yi − θ xi as before. Consider Var[ μ ( y ) ratio ] = ^ 1 (1 − f ) i n ∑( y ε s i ˆ − θ xi )2 n −1 versus Var[ μ ( y )] = ^ 1 (1 − f ) i n ∑( y ε s i − y )2 n −1 The ratio estimate is more precise (i.060. University of Waterloo.37 with corresponding standard error 33.e.31 . the ratio estimate is more precise if a line through the origin explains some of the variation in the response variate.24 0.56 since this ˆ σ ( y) ˆ or estimate gives a confidence interval (ignoring the fpc) μ ( y ) ± 1.E[ μ ( y ) ratio ] = E[θ ]μ ( x ) ≈ θμ ( x ) = μ ( y ) Var[ μ ( y ) ratio ] = μ ( x ) 2Var[θ ] 1 Var[ μ ( y ) − θμ ( x )] μ ( x )2 = Var[ μ ( y ) − θμ ( x )] Using the results on the estimation of a ratio. An approximate 95% confidence interval for the population average length based on the ratio estimate is 45.12 microns. ˆ Here the ratio estimate is more precise than the sample average μ ( y ) = 45.96 40 45.37 ± 0. The expression on the right is the total sum of squares. the estimator based on the sample average.J. in general. Stat 371 © R.56 × 33.10 = 45. gives a shorter confidence interval) than the sample average if ∑( y ε i s i − θˆ xi ) 2 < ∑ ( yi − y ) 2 iε s The expression on the left is the residual sum of squares if we “fit” a line through the origin to the sample scatterplot.56 ± 0. We can compare the estimated variance of μ ( y ) ratio versus that of μ ( y ) .147 so the ratio estimate of the n −1 45. ˆ In the example we have θ = 1. MacKay. 2009 IX-5 .

Yi = β xi + Ri . i s can be written as the ratio of two averages. between x and y in the study population. 2009 IX-6 . If we fit a response model to the above data (e. x fitted line y = θˆ x x ˆ μ ( y ) ratio x x x ˆ( μ y) x x x x x x x μ ( x) In this case. University of Waterloo. then the estimate is an adjustment based on the fact that the sample average x is different than the population average μ ( x ) . the population average of the explanatory variate a relationship of the form y = β x + noise . We can also see the adjustment by rewriting the ratio estimate as μ( x ) μ ( y) μ ( y)ratio = μ( x ) 3. the smaller is the adjustment. The closer x is to μ ( x ) . Ri ~ G (0. The smaller the noise. 2. we need • to measure the explanatory variate xi for each unit i in the sample • • to know μ ( x ) . This suggests another estimate βμ ( x ) for μ ( y ) . a straight line through the origin.Notes 1. MacKay. σ ) ) then we ∑x y /n Since β = ε ∑x /n ε i i 2 i ∑x y ˆ estimate the slope using β = ε ∑x ε i i s 2 i i s i ˆ . the sample average x is smaller than the population average μ ( x ) so we adjust the estimate of μ ( y ) upward using the relationship between y and x. the greater the benefit in using the ratio estimate. To apply ratio estimate effectively. we can derive its i s Stat 371 © R.g. If we think of ratio estimation in terms of fitting a line to the scatterplot.J.

If the variation increases as x increases. Regression Estimation of the Average Once we discussed ratio estimation in terms of fitting a line through the origin to the data in the sample. MacKay. ∑(x − x) y ˆ β= ε ∑(x − x) ε i i s 2 i i s i • substitute the known mean μ ( x ) into the fitted line Stat 371 © R. y With ratio estimation as described above. σ ) ˆ to the data in the sample. In either case. Ri ~ G (0. If the explanatory variate is binary or categorical we can use post-stratification (see the Chapter 9) to improve the precision of the estimation of π . we have the conditions necessary for fitting the response model Yi = α + β ( xi − x ) + Ri .J. Here we look at using information on the explanatory variate if the relationship between y and x is linear with constant variation about the line. ˆ To produce the regression estimate μ ( y ) reg . we • fit a the model using least squares to estimate α and β to get ˆ ˆ α = y = μ ( y ). we expect the estimator based on β to be superior. you will have considered what happens if the line does not go through the origin. σ xi ) where the variation around the line increases as x increases. x ˆ If there is constant variation about the line. Ri* ~ G(0. 4. 2009 IX-7 . because we are exploiting structure in the study population. the estimates will be superior to the sample average. If we have a continuous explanatory variate x. You can easily y verify that the least squares estimate of β in this model is θˆ = . You may wonder how the precision of this estimate compares to that of the ratio estimate. then we can divide by xi to get the model Yi / xi = β xi + Ri* . University of Waterloo. then we expect the ratio estimate to be better. Suppose the response variate y is binary and the goal is to estimate the population proportion π . we need more complex models (and subsequent analysis) to exploit the relationship between the variates in the study population. In other words. we estimate the slope using θˆ = . Ri ~ G (0.ˆ variance as we did for θˆ and hence find the variance of βμ ( x ) .σ ) with constant standard deviation. If we x start with a response model Yi = β xi + Ri . Note that x = μ ( x ) is the sample average for the explanatory variate.

If the sample average μ ( x ) of the explanatory variate is less than the known population average μ ( x ) . We can simplify the argument with the following handwave. 2009 IX-8 . The adjustment is shown on the following plot.ˆ ˆ ˆ ˆ μ ( y ) reg = μ ( y ) + β [ μ ( x ) − μ ( x )] ˆ ˆ We can view μ ( y ) reg as an adjustment to the sample average μ ( y ) as we did with the ratio estimate. Hence we can say that μ ( y ) reg − μ ( y ) ≈ [ μ ( y ) − μ ( y )] + β [ μ ( x ) − μ ( x )] and we have E [ μ ( y ) reg − μ ( y )] ≈ E [ μ ( y ) − μ ( y )] + β E [ μ ( x ) − μ ( x )] = 0 . MacKay.J. The right-most term. a product of two small quantities. Suppose that β is positive so that in the study population larger values of ˆ x correspond to larger values of y. University of Waterloo. That is. Rewrite the estimator as μ ( y ) reg − μ ( y ) = [ μ ( y ) − μ ( y )] + β [ μ ( x ) − μ ( x )] + [ β − β ][ μ ( x ) − μ ( x )] In large samples. is an order smaller than the other two terms. fitted line y = μ ( y) + β ( x − μ ( x )) x ˆ( μ y ) reg x x x x ˆ( μ y) x x x x x x x μ ( x) The properties of the estimator μ ( y ) reg = μ ( y ) + β [ μ ( x ) − μ ( x )] are complicated because of the three random components. the regression estimate is approximately unbiased. Stat 371 © R. we adjust the estimate of μ ( y ) upward. we expect each of the terms within the brackets [ ] to be small.

00 4.31) and the residual sum of squares is 2. The data and fitted line are plotted below. x ...50 5.30 1.50 7.. To estimate the total volume in a section of forest that was to be sold.90 1. The last factor is the sample variance of the estimated residuals from the least squares fit of the line to the sample data. the crosssectional area of the tree measured at breast height. We will treat this protocol as if it were SRS. Volume versus Basal Area 7.50 4.268 Stat 371 © R.70 1. 2009 IX-9 .50 volume 6. Example The volume of useable wood y in a Douglas fir is related to the basal area.00 0.is the n sample average of r1 .51( x − 1.70 0..50 3.00 6.17 + 1.50 1.90 basal area The equation of the fitted line is y = 6.J. University of Waterloo.00 5.50 0.00 3. Using the basic result for the variance of an average with SRS. MacKay. rn where ri = yi − β ( xi − μ ( x )) and r = y − β ( x − μ ( x )) .10 1. A SRS of 25 sub-sections was selected and then a tree was selected at random within the sub-section. a sample of 25 trees was selected by dividing the section into small subsections. we have [ r − μ ( r )]2 1∑ i Var ( μ ( y ) reg ) = (1 − f ) iεU which can be estimated by n N −1 1 ^ Var ( μ ( y ) reg ) = (1 − f ) n ˆ( We can estimate Var ( μ ( y ) reg ) by noting that μ y ) reg = ∑[ y ε i s i − β ( xi − μ ( x ))] ∑ [r − r ] ε i i s 2 n −1 1∑ = (1 − f ) iε s n ˆ [ yi − y − β ( xi − x )]2 n −1 ˆ where we have replaced β by the estimate β . Volume is expensive to measure because it requires that the tree be destroyed. The selected trees were sacrificed and the basal area and volume were measured.

40 − 1. We use least squares to estimate the relationship between the Stat 371 © R.. University of Waterloo.111.12 . ˆ The regression estimate is μ ( y ) reg = 6. 2009 IX-10 . We used a difference estimate in the inventory example to estimate the total true value of the files.17 ± 0. Regression estimation can be extended to multiple explanatory variates and nonlinear relationships. The 95% confidence interval for the total volume. the ratio estimate and the regression estimate by looking at the sum of squares of the estimated residuals under the three least squares fits • Sample average: ∑ ( yi − y )2 iε s • • Ratio estimate: Regression estimate: ∑( y ε i s i − θˆ xi ) 2 where θˆ = y / x ∑( y ε i s i ˆ ˆ − y − β ( xi − x )) 2 where β is the estimated slope The major reason for using ratio and regression estimates is the gain in precision. is then 358 408 ± 6816 ..31 ± 0.40 . 3.. We assume that the errors in estimating μ ( x ) and N are negligible...31) = 6. ˆ The estimate for μ ( y ) based on the sample average μ ( y ) gives a 95% confidence interval 6.. A special simple case of regression estimation is to use the difference d i = yi − xi as the response variate and then estimate the population average by ˆ ˆ μ ( y ) diff = μ ( d ) + μ ( x ) This estimate is more precise than the sample average if the variation in the differences d1 .17 + 1.31 ± 1.96 × 0.51(1. ^ stdev ( μ ( y ) reg ) ≈ In general. The regression estimate is more precise in this case...268 = 0.J.22 since the sample standard deviation of the 25 measured volumes is 0. yn . MacKay.061 = 6.A second much larger (and cheaper) survey was carried out to estimate the total number of trees N = 56800 and the average basal area μ ( x ) = 1. we need a continuous response variate y and a continuous explanatory variate x knowledge of the study population average of the explanatory variate a linear relation between y and x with smaller residual variation leading to a more precise estimate.061 25 24 The approximate 95% confidence interval for μ ( y ) based on the regression estimator is 6. τ ( y ) = 56800 μ ( y ) . d n is less than the variation in y1 . we can compare the precision of the sample average. Notes 1. • • • To use the regression estimate effectively. 2.31 and the standard error of the estimate is 1 2.

Note this adjustment accounts for differences in the sample attributes of the explanatory variates and the known population attributes. MacKay. Stat 371 © R. 2009 IX-11 . University of Waterloo.J.response and explanatory variates in the sample and then adjust the sample average using the fitted model and the sample averages for the explanatory variates.

y0 ) ( x − x 0 ) 2 ∂ 2 f ( x 0 .163 g and the total weight is 154. first and second derivatives at the point ( x0 . Find a 95% confidence intervals for the total number of thrushes based on the a) sample average y b) ratio estimate c) regression estimate 4.txt. y0 ) ∂ 2 f ( x 0 . c) In the example. an area of highly fragmented forest patches.it is not. μ ( y)) to ~ ~ ~ estimate the bias in the estimator θ = μ ( y) / μ ( x ) . y0 ) ( y − y0 ) 2 ( x − x0 ) + ( y − y0 ) + + ( x − x 0 )( y − y0 ) + 2 ∂x ∂y ∂x 2 ∂x∂y ∂y 2 2 f ( x . 2009 IX-12 . Find a 95% confidence interval for the total number of items in the container. y) = y / x about the point ( μ ( x ). the sample standard deviation is 0. Assume that there is small error in weighing and act as if SRS is used . Suppose we wanted to estimate the number of wood thrush pairs nesting within the region of Waterloo. wood thrush are a forest dwelling bird that live in the hardwood forests of eastern North America. They then weigh the whole shipment (excluding the container). MacKay. A simple random sample of 50 woodlots is selected and the number of nesting pairs yi is counted in each woodlot by counting the number of singing males. The area xi of each sampled woodlot is also recorded. y ) ≈ f ( x 0 . y0 ) ∂ 2 f ( x 0 . The data are available in the file thrush. y0 ) ∂f ( x 0 . We can exploit this knowledge when we are trying to estimate population totals or density. Let the weight of the ith item in the population be yi and the total known weight be τ a) Show that an estimate of the population size is N = ∑y ε i s τ i / 25 b) Find the (approximate) mean and standard deviation of the corresponding estimator ~ N. Stat 371 © R. y0 ) + This quadratic function has the same value. You can easily check this statement by differentiating the right side of the expression.45 g. University of Waterloo.2 kg. y0 ) as does f ( x. the sample average weight is 75. we know that there are 1783 such patches (minimum size 3 ha) with an average size 13.J.Exercises 1. a shipping company selects a sample of 25 items and weighs them. y) . 3. In order to count the number of small items in a large container. 2. Briefly describe when you would use the ratio or regression estimate instead of the sample average to estimate the population average. Many bird species have specialized habitat. Find the quadratic expansion of f ( x. For example. Note that the general form of the expansion is ∂ f ( x 0 . Using aerial photography.4 ha. the sampling is haphazard.

d) Which estimation procedure is preferable here? Why? Stat 371 © R. it is known that the average total water consumption per house is μ ( x ) = 15. The City of Waterloo wants to estimate the average amount of water per house μ ( y) that is used to water lawns and gardens in the month of July. The total volume of water x is measured by the regular meter.5. c) Find 95% confidence intervals based on each estimate. A SRS of 50 houses is selected and special metering units are installed to measure the volume of water y from external taps. From water records.J.6 cubic metres.txt. MacKay. the ratio estimate and the regression estimate. The data are stored in the file water. a) Prepare a scatterplot of y versus x b) Estimate μ ( y) using the sample average. 2009 IX-13 . University of Waterloo.

We call Wh = Nh / N the stratum weight. the proportion of the total units found in that stratum. h = 1. University of Waterloo. For the variate of interest.+ WH μ H . we change both the sampling protocol and the estimate to get a procedure that usually produces a better estimate of the study population average. The idea is to divide the study population in sub-populations. MacKay. We can combine these estimates to get the stratified estimate of the population average μ strat = W1 μ 1 +.+ N H μ H = W1 μ 1 +........ The basic idea was to exploit a structural relationship between the response and explanatory variates with known attributes in the population. we write N1 μ 1 +... Some examples of possible strata are: • Provinces and large urban centers in national opinion surveys • Small and large accounts in auditing a population of accounts • Home faculties in a survey of UW students • Sites in a survey of employees in a multi-site company In many examples.. we independently select a sample of size nh from stratum h using SRS and calculate the sample average μ h . H .. U H with sizes N1 . N H so that N = N1 +.Chapter 10 Stratified Random Sampling In the previous chapter.. Suppose that we divide the population U into H mutually exclusive strata U1 .. 2007 X-1 .. we denote the stratum averages and standard deviations by μ h . we have Stat 371 © R. Then combine the estimates of each stratum average to get an estimate of the population average.+ WH μ H ~ ~ ~ The corresponding estimator is μ strat = W1 μ 1 +.. Since we used SRS within each stratum. we looked at ways to use an explanatory variate with known attributes to improve on the sample average as an estimate of the study population average with SRS..+ N H .J. In this chapter. With this notation.. called strata.+ WH μ H N the weighted average of the stratum averages... and sample independently using SRS from each stratum. Stratified sampling gives information about these averages and often an improved estimate of the overall average.. σ h ... we have questions about the strata averages as well as the overall population average. μ= Now suppose for each stratum.

~ ~ ~ E( μ strat ) = W1 E( μ 1 )+.45 37.62 51. We can estimate the variance by σ σ 2 ^ ~ Var( μ strat ) = W12 (1 − f1 ) 1 +..2 11.. The population of 13 345 wells was identified from assessment records.J. Three strata were created Stratum Size Weight Sample Size farms with animals 2365 0. 2007 X-2 .23 % contaminated 17.+WH E( μ H ) = W1 μ 1 +. University of Waterloo.177 150 farms without animals 1297 0. Example To estimate average water quality and the proportion of wells with contamination.6 220.+ WH (1 − f H ) H n1 nH 2 2 where fh = nh / Nh is the sampling fraction and σ h is the standard deviation of the response variate for stratum h. a survey of residential wells was carried out in the rural part of the region of Waterloo.2 The estimate of the population average Na concentration μ is Stat 371 © R..1 St Dev Na 41.097 100 houses 9683 0.3 245.726 250 A random sample of wells was selected from each stratum and the water was tested for a large number of characteristics. Here we look at only two: ˆ2 where σ h = ∑ (y ε j sh hj ˆ − μh )2 y: u: sodium (Na) concentration (mg/L) the water was contaminated by coliform bacteria (u = 1) or not (u = 0 ) The data are summarized below....+ WH μ H =μ and σ σ ~ 2 Var ( μ strat ) = W12 (1 − f1 ) 1 +.4 13. MacKay... Stratum farms with animals farms without animals houses Average Na 237.+ WH (1 − fH ) H n1 nH 2 2 is the sample standard deviation within stratum h and y jh is nh − 1 the response variate for the jth unit in the sample from stratum h .

University of Waterloo.418 mg/L. For the binary response variate u.177)2 (1 − ) + (.7% ± 3. An approximate 95% confidence interval for μ is 225.μ strat = 0.097)2 (1 − ) + (.172(1−. 2007 X-3 . ratio or regression estimation if appropriate.177(.114(1−.J.137 or 13.0165.232 ^ ~ Var( μ strat ) = (.6 ± 4. The formulae are expressed in terms of proportions.62 2 250 51. • When is stratified sampling more efficient than SRS? • How should we allocate the total sample among the strata? • Can we combine ratio/regression estimation with stratified sampling? The answer to the last question is the easiest.137 ± 0.g.2% Note that you need to be careful moving from percentages to proportions.097)2 (1 − ) + (.097(245.177(237. We can compare the strata means and proportions – see Exercise 1. the estimated variance is the sum of the squared strata weights times the variance of the stratum estimates.172)] 100 [.172 ) + 0.7%. Stat 371 © R. the estimate of the proportion of contaminated wells π = W1π 1 + W2π 2 + W3π 3 is π strat = 0.726)2 (1 − ) = 0.6 mg/L ~ The estimated variance of the estimator μ strat is 150 41. There are several questions of interest. Since each stratum is sampled independently.097(.032 or 13.114) + 0.452 100 37.726(220. MacKay.1) = 225.845 2365 150 1297 100 9683 250 and so the standard error is 2. The estimated variance of the associated estimator is 150 [. Note that we used the approximation 1 − fh ~ Var (π h ) ≈ π h (1 − π h ) nh in each stratum. and then combine the estimates using the strata weights.132)] ^ ~ Var (π strat ) = (.726) 2 (1 − ) = 5.177)2 (1 − ) + (.3) + 0.114)] 250 [. Note that these variances are calculated using the formulae for ratio or regression estimates. We can estimate the strata averages in the best way possible e.132(1−.726(. The 95% confidence interval for π is 0.6) + 0.7 mg/L.132) =.000272 2365 150 1297 100 9683 250 and so the standard error is 0.

+ σ H] n w1 wH we see that if we were to give a very small sample weight to a stratum with high stratum ~ ~ weight and σ 2 > σ 2 .. we need to consider the sampling weights wi = ni / n .. Suppose that y jh is the response variate for the jth unit in stratum h. To confirm this point.. we have wh = Wh or. However. the strata weights Wi = Ni / N and the relative sizes of σ 1 . there is no uniform result.. Between strata (treatment): Within strata (treatment): Total: ∑ N (μ ∑ ∑ (y h h h − μ )2 hj − μ h ) 2 = ∑ ( Nh − 1)σ 2 h h ∑ ∑ ( yhj − μ )2 = ( N − 1)σ 2 h j h j Stat 371 © R. then Var ( μ strat ) > Var ( μ ) . we have σ σ ~ Var ( μ strat ) = W1 (n1 / n) 1 +. if we construct the strata with care.J... we can partition the total sum of squares as... Consider each stratum as a treatment and recall for an unbalanced design.+ WHσ 2 ) H n and the variance of the stratified estimator will be less than the variance of the sample average from SRS if = 2 W1σ 1 +. within and between strata. Another way to make the same point is to use the ANOVA partition of the total sum of squares into two components. form the strata so that there is greater consistency within strata compared to the whole population.+ WH σ 2 < σ 2 H The left side is the weighted average of the within strata variances.. then the weighted average will be less than the overall variance.. 2007 X-4 . σ H ~ versus σ . consider proportional allocation where. this is a contrived situation and in most cases. If we form the strata so that these variances are small i. stratified sampling will be much better than SRS. except for rounding.+ WH (1 − f H ) H n1 nH 2 2 2 WH 2 1 W12 2 ≈ [ σ 1 +. Looking at the variance of μ strat σ σ ~ 2 Var ( μ strat ) = W12 (1 − f1 ) 1 +. We also ignore the finite population corrections.Comparison to SRS To examine the efficiency of stratified sampling.e. Substituting Wh = wh = nh / n ... h it is possible to have larger variance with stratified sampling. nh α N h . University of Waterloo. MacKay.+ WH (nH / n) H n1 nH 2 2 1 2 (W1σ 1 +. in other words.. In other words..

+ WHσ H W1σ 1 λ +..+ nH − n) 1 1 2 1 2 1 − )σ 1 +.. University of Waterloo.+ WH σ H ) 2 n Note that proportional allocation is optimal if σ h is the same for each stratum... we need to know (unlikely) or have an estimate of the within-stratum standard deviations.. That is. nH as continuous variables and use a Lagrange multiplier. If we ignore the fpc. We want to ~ determine n1 . MacKay. Optimal allocation Suppose that at the Plan stage. we decide that we can afford to select a sample of size n. h = 1. We allocate more sampling effort to those strata that have higher weight or larger within stratum standard deviation. we get 1 ~ Var ( μ strat ) = (W1σ 1 +. H ...+ WHσ H λ λ = n or λ = or more simply nh ∝ Whσ h . = n1 +....... In order to use the optimal allocation..J.. W1σ 1 +.where the sums are over the whole population. Setting ∂nh nh ∂λ and solving for λ . for optimal allocation n nh = Whσ h n W1σ 1 +. If we do so and decide to use optimal allocation. How should we divide the sampling effort among the strata if the objective is to minimize the variance of the resulting estimator? This is the allocation problem. we get nh = Whσ h ∂f ∂f W 2σ 2 = − h 2 h + λ . We have σ2 = ∑ N (μ h h h − μ )2 + ∑ (N h h − 1)σ 2 h N −1 N −1 ≈ ∑ Wh ( μ h − μ ) 2 + ∑ Whσ 2 h h h the difference in variance for stratified versus sample average estimator is proportional to ∑ Wh ( μ h − μ )2 .+ WH ( − )σ 2 + λ (n1 +.+ nH = n and Var ( μ strat ) is minimized...+ WHσ H ... before selecting the sample..... λ ) = W12 (1 − f1 ) = W12 ( 2 σ1 n1 2 +..+ nH − n ..+ nH − n) H n1 N1 nH N H The partial derivatives are these to 0. We treat n1 . For proportional allocation (and for any other allocation).... we find the critical point of the function f (n1 ... perhaps from a pilot survey. 2007 X-5 .. nH so that n1 +. then we can use the preceding Stat 371 © R. nH ...+ WH (1 − fH ) σ2 H nH + λ (n1 +. Hence we have.. we should h make the strata means as different as possible in order to achieve the greatest gain over SRS... then for the optimal allocation.

1) tables determined by the level of confidence. we need to ensure that each province gets a large enough sample to estimate the within-stratum rate so here the allocation problem is very different. for a given level of precision. The likelihood of large errors is greater in larger accounts so every account in the stratum of the largest accounts is included in the sample.. Stat 371 © R. in many applications in auditing.. we want to estimate the rate of unemployment in each province so it is natural to stratify by province. University of Waterloo.. we form the strata so that the averages are likely to be very different. shorter confidence intervals) compared to SRS for the same total sample size. for a given level of confidence. the approximate confidence interval has length 2c 2l = (W1σ 1 +. Since we are interested in the provincial rates..formula to select the total sample size n to achieve a confidence limit with a predetermined length. Forming the Strata Stratified sampling can produce large increases in precision (i.+WHσ H ) n c2 so we select a sample with total size n = 2 (W1σ 1 +. We cannot use the discrete variate to form strata since we do not know the value of the variate for every unit in the frame.. we need to identify the stratum for each unit in the frame before we begin.e.. However.+ WH σ H )2 where c is a value from l the G(0. we can use the values of the explanatory variate to form the strata. In some cases.. We can use estimates from a pilot study or an earlier version of the survey to optimally allocate the sample across the strata. In many cases such as the labour force survey. stratification adds complexity. Suppose there is a discrete explanatory variate such as gender or age class and we know the proportion of the population that falls in each class. we know the population proportions or weights W1 . we can use a smaller sample size with stratified sampling. MacKay. If we are interested only in the overall frame average or total. a particular stratum may be so important that we do a complete census. WH for the H classes. Post Stratification We now return to an issue discussed in Chapter 9. If we have complete knowledge of some explanatory variate that we believe to be related to the response variate. Proportional allocation is popular because we do not need to know the within strata standard deviations and we can be almost sure to do better than with SRS. The first consideration is the purpose of the survey..J. That is. For example. This corresponds to knowing the mean for a continuous explanatory variate that we might use in a ratio or regression estimate. Put in another way. for example. accounts are stratified on the basis of stated value. 2007 X-6 . Instead.

nh . 2007 X-7 .. Y = y) .. First.. Then we can write E( X ) = ∑ ∑ x Pr( X = x.. ~ ~ ~ We use this result to find E( μ post ) .... find the conditional expectation for each value of y and second. If we ignore the event nh = 0 (which happens with small probability in large samples) we have to a good approximation ~ ~ ~ E ( μ h ) = E[ E ( μ h | nh = nh )] ≈ E ( μ h ) = μ h Hence we have ~ ~ ~ E( μ post ) = E(W1μ 1 +. A natural estimate of the population average is μ post = W1μ 1 +.J. we have a small aside. Note that E( X| Y = y) is a function of y only because we have added over all values of x.We select a sample using SRS from the frame and observe n1 . nH units in each class. MacKay..stratification estimate because we do not establish the stratum for each unit in the sample until after it is selected. The sample sizes are not controlled and if we were to repeat the sampling they would change.+WH μ H = μ The post stratified estimate is unbiased (almost).+WH μ H ) = W1μ 1 +. To determine the properties of this procedure... Y =y) = ∑ [∑ x Pr( X = x| Y =y)]Pr(Y = y) y x y x The expression in [ ] is the conditional expected value of X for a given value Y = y and is written E( X| Y = y) . Stat 371 © R... find the expected value of the conditional expectation over the distribution of Y. The estimate looks like the stratified ~ estimate – the estimators are different because the denominator of μ h is random for the post-stratification estimator. we know using the results from SRS that E ( μ h | nh = nh ) = μ h . Then we have ~ ~ ~ E ( μ h ) = E[ E ( μ h | nh = nh )] ~ ~ As long as nh ≠ 0 . we can calculate the expected value of X in two steps.+WH μ H We call this the post. With this notation we have E( X ) = ∑ E( X | Y = y) Pr(Y = y) y The right side is the expected value of the function E( X| Y = y) so we write E( X ) = E[ E( X| Y = y)] In words. Consider the two random variables μ h . University of Waterloo. Suppose X and Y are two discrete random variables with probability function Pr( X = x.

they know the distribution of household Stat 371 © R. University of Waterloo.+ WH ( − )σ 2 H n1 N1 nH N H and so 1 1 2 1 1 2 ~ ~ ~ E[Var ( μ post | n1 = n1 ...... nH = nH ) = W12 ( − )σ 1 +.... Var ( X ) = E[ X 2 ] − μ 2 = E[ E( X 2 | Y = y)] − μ 2 = E[ E( X 2 | Y = y) − E( X| Y = y)2 ] + E[ E( X | Y = y) 2 ] − E[ E( X | Y = y)]2 since μ = E[ E( X| Y = y)] ... The expression inside the first [ ] is Var( X| Y = y) so the first term is E[Var( X| Y = y)]. we need a second result. we condition on n1 = n1 . nH . Example A market research organization interviews a randomly selected sample of 300 households in a community to estimate the average amount of money spent on DVD/video rental and movies in the previous week... From census data. nH = nH )] = Var[ μ ] = 0 ...+ WH ( E[ ~ ] − )σ 2 H n1 N1 nH NH We approximate this variance by 1 1 2 1 2 1 ^ ~ Var ( μ post ) = W12 ( − )σ 1 +. nH .... nH = nH )] = W12 ( E[ ~ ] − )σ 1 +. nH = nH ... 2007 X-8 . The second and third terms are Var[ E( X| Y = y)] so we have the result Var( X ) = E[Var( X| Y = y)] + Var[ E( X| Y = y)] ~ ~ ~ To find Var( μ post ) .... MacKay.....J.+ WH ( E[ ~ ] − )σ 2 H n1 N1 nH NH Combining the two pieces we get 1 1 2 1 1 ~ 2 Var ( μ post ) = W12 ( E[ ~ ] − )σ 1 +. we have 1 1 2 1 2 1 ~ ~ ~ Var ( μ post | n1 = n1 .. nH = nH ) = μ for all values of n1 . from SRS...... We now interpret the two pieces. Also. we have ~ ~ ~ Var[ E( μ post | n1 = n1 .. ~~ ~ Since E( μ| n1 = n1 ..+ WH ( − )σ 2 H n1 N1 nH N H that is identical to the variance of the stratified estimator for the observed allocation n1 ..To find the variance..

we ignored non-response.91 ± 0..10 .45+. In many surveys.232 0.090 0. Suppose that the purpose of the survey is to estimate a population proportion π . 2007 X-9 .77 1 2 3 4 >4 87 109 54 27 23 0.292.size in the community but this information is not available in the frame for each unit. Household size Population weight (census) 0. In the well survey. there is interest in estimating strata averages or differences in strata averages. MacKay. 2.180 0. University of Waterloo. The confidence intervals that we have constructed do not take this error into account. In the above example.J. ~ b) What is the variance of π strat for proportional allocation? c) How should the strata be formed so that the stratified sampling protocol is superior to SRS? 3.381 0.071 × 2810 = $19. How would you recommend allocating the sample to the strata if a) Estimating the average Na level was the primary goal.67 28.91 and the estimated standard deviation of the corresponding estimator is (ignoring fpc’s) 0.123 0.290 0. Suppose the well survey was to be re-done with the same overall sample size 500. find a 95% confidence interval for the proportion of wells in farms with animals that are contaminated c. a) Write down the stratified estimate of π and the variance of the corresponding estimator.67 4.21 28. Note that the sample average is $19.+0. In the well survey. There are many analytic methods and sampling strategies to deal with this important issue. Exercises 1.363 0. Non-response is a major source of error when sampling human populations.22 25. An approximate 95% confidence interval for the population average amount spent is $19.45 20. If there are H strata.56 5.47 has been adjusted upward because of the over-representation of households of size 1 in the sample. They post stratify the data as follows.077 13. for SRS.193 0. Stat 371 © R. b. find a 95% confidence interval for the average Na difference between the two types of farm wells.89 5. write down the distribution for the estimators μ h and ~ ~ μh − μk .071 Sample size Sample weight Sample average Sample standard deviation 3.57 . In general. The estimate is μ post = 0.232 × 13..23 6. ~ a. The company telephoned many more than 300 households to get the required number of completions.

31 2. MacKay. compare the predicted standard deviations of μ strat and π strat to what occurred in the current survey. Year Sample size Population Average score Standard weight deviation 1 39 0. ~ ~ c) For each case.4 strongly disagree .09 3 26 0.1 0. One particular statement was (with scores): “All mathematics students are required to take Stat 231?” strongly agree – 1 agree – 2 neutral – 3 disagree. h b) When will the gain be large with optimal allocation relative to proportional allocation? 5. Ignore the fpc.b) Estimating the proportion of contaminated wells was the primary goal.87 Stat 371 © R.22 2 23 0.5 1.23 3.5 The sample results. 100 people were asked their opinion (on a 5 point scale) about the core courses and their value.24 3. Consider the difference of the variances of μ strat under proportional and optimal allocation for a sample of size n.03 4 12 0. In an informal sample of math students at UW. Estimate the average score for all math students and find an approximate 95% confidence interval for the population average – note that SRS was not used here so were are making assumptions about the estimators that may be unwarranted.J. 2007 X-10 . 1 a) Show that this difference can be written as ∑ (σ h − σ )2 Wh where n h σ = ∑ σ h Wh is the weighted average standard deviation over the H strata.22 3.2 1.8 1. University of Waterloo. broken down by year are shown below. ~ 4. There are about 3300 students in the faculty.

type q(). try help.Appendix 1 An Introduction to R R is a high level language with many useful statistical functions. • R is available on the math faculty unix machines – type R to start the program 2.ca/Stats_Dept/StatSoftware/R/ Try the R tutorial. • To get a data set. Reading Data into R • All data sets used in the course notes and lectures will be posted on the course web page in a . See 4 below.zip file that you can download to your own machine. Note that R will let you save the workspace in the current working directory.table(‘file path and namel’. • To quit R. The files have variate names in the first row and the variate values in rows. one row per unit in the sample. • Try the sample session in the “An Introduction to R” manual 4. Where to find R • A Windows version of R is available free at http://www. • It is a good idea to clear the workspace if you plan to start a new project – see the Misc menu.html . Appendix I -1 .stats.read.uwaterloo. • Within R.Introduction • R is available on the faculty PCs. if you know the function name. You can look at or download this document from the course web page so the links will be active. In this document • all R commands and objects are given in italics. specifically in rooms MC 3006 and 3009. Getting Started 1. • I assume that you are using the Windows version of R. • Look at the web page http://www.r-project. The manual “An Introduction to R” can be downloaded in pdf format. use the command a <. use the command help(function) for assistance. R will open and restore the previously saved workspace.org/ • For help with installation and implementation see C:\Program Files\R\rw1062\doc\html\rw-FAQ. create a shortcut on your desktop from the program. If you do not know the function. The online FAQ for Windows can be searched for help with almost anything. Starting and Quitting R • To start R.search(‘ what you are looking for’) 3. header=T) The data are stored in the data frame a for further use. Where to find help • Use the Help menu for on-line assistance.

txt'. If you want to use R on other data.b) b <..+ r and stores the results in the lm object b fits the linear model y = β 1 x1 +.lm(y~x1+x2+…) b <.ca/~rjmackay/stat371/file name.txt file. 2.g w <. These can be stored e.u*v) sqrt(y) A%*%B t(A) solve(A) b <. the R variate name is a$sales. Using the up and down arrow keys in R displays past and subsequent command lines which can be edited and re-executed 3. 4. p variate models Appendix I -2 .u+v (x <. Here are some R functions and objects used repeatedly in STAT 371. To restore the full name.g.y. use detach(a)..lm(y~-1+x1+x2+…) summary(b) resid(b) fitted(b) anova(b) hatvalues(b) rstudent(b) anova(c. applied to y for each value of x creates the sum (element-wise product) of two vectors calculates the element-wise square root of y matrix product gives the transpose of A gives the inverse of A fits the linear model y = β 0 + β 1 x1 +.. R output and plots can be copied and pasted into Word to create reports. function) x <. mean.. nbest =k)) purpose calculates the average of the variate y calculates the st dev of the variate y calculates a 5 number summary of the variate y calculates the function e.. If there is a single data frame a. You can also read the data files from my web page with the command a < -read.mean(y) or immediately displayed e. 2.table('http://www. you can avoid long path names by setting the working directory to the directory containing the file. I prefer to type them in Word or Notepad and then paste into the R gui.header=T) For a variate named sales in the .uwaterloo.txt format in your working directory.g. Look under the file menu on the R gui to set the working directory. create the data set in EXCEL and then save it in tab delimited .regsubsets(y~x1+x2+…. Working with R 1.math.+ r without intercept details of the fit the estimated residuals the fitted values the analysis of variance from the fit the diagonal elements of the hat matrix the studentized residuals analysis of variance to compare the sub-model c to b fits the k best subsets for 1.• • • • If the file is stored on your own machine. Commands can be typed directly into R. This makes editing easy and preserves a record for reuse.…. you can simplify the name with the command attach(a) so the awkward a$ notation is avoided. mean(y) function mean(y) sd(y) summary(y) tapply(x.

plot(x. x-axis label etc.xlab='xx'.lm(y~x) to the scatterplot histogram of the values in y a gaussian qq plot of the values in y creates a graphic window with j rows and k columns for the next jk plots Appendix I -3 . x with title .y.k)) scatterplot of y vs. main='title'. etc) abline(b) hist(y) qqnorm(y) par(mfrow=c(j. adds the fitted line from b <.

. Var(a + U ) = Var(U ) 5.. Y ) = E{( X − E( X ))(Y − E(Y ))} Now suppose we have two vectors of k random variables U t = (U1 . W ) = E{(U − E (U ))(W − E (W ))t } . Properties: These follow from the properties of expectation. Y ) where Cov( X. b are constants then we have E( aX + b) = aE ( X ) + b E ( X + Y ) = E ( X ) + E (Y ) Var ( aX + b) = a 2 Var ( X ) Var ( X + Y ) = Var ( X ) + Var (Y ) + Cov( X . 2... This follows using the same argument as in 3.. then E( a + U ) = a + E(U ) and E( AU ) = AE(U ) . This useful result is easy to show using properties 1 and 3.. 1. 6. Wk )t written as the transpose of row vectors to save space. Uk )t W t = (W1 .J. If a is a vector and A is a matrix of constants.MacKay University of Waterloo 2005 Appendix 2 1 .. The expected value of the sum of two vectors is the sum of the expected values That is E(U + W ) = E(U ) + E(W ) . 3. Cov(U . We can see that this result is true by noting that the ijth element of xx t is xi x j for any vector x. Note that the diagonal elements are the variances and the off-diagonal elements are the covariances of the components of U. Var (U ) = E{(U − E (U ))(U − E (U )) t }. The covariance matrix of U and W is the matrix Cov(U. the covariance of Ui and Wj . Var ( AU ) = AVar (U ) A t . Statistics 371 © R. 4..Appendix 2 Properties of vectors and matrices of random variables Recall that if X and Y are two random variables and a. Definitions: The expected value of U is the vector E(U ) with ith element E(Ui ) The variance-covariance matrix of U is the matrix Var(U ) with ijth element E{(Ui − E (Ui )(U j − E (U j )} . W ) with ijth element E{(Ui − E (Ui )(Wj − E (Wj )} .

linear combinations of the components of a multivariate normal random vector are multivariate normal with the appropriate mean and variance-covariance matrix. if U ~ N ( μ. b + W ) = Cov(U. ak ) is a vector of constants. We write Z ~ N (0. 5.. Z2 . W ) = ACov(U . In words. and 1.. More generally.. we say that Z has a multivariate normal distribution with mean vector 0 and variancecovariance matrix I . The components Ui and U j are independent random variables if and only if Cov(Ui . I ) Properties 1. Cov( AU . BW ) = Cov(U . CU ) = BVar (U )C t = 0 . Multivariate Normal Distribution If Z t = ( Z1 . W ) B t from 6. W ) and Cov(U .. then BU ~ N ( Bμ . U j ) = 0 .. We use the notation U ~ N ( μ. More generally. Cov(a + U.MacKay University of Waterloo 2005 Appendix 2 2 .. then a tU is gaussian with mean a tU and standard deviation a t Σa . Zk and hence is gaussian with mean μ i and standard deviation the square root of the ith diagonal element of Var(U ) .J. 6. Suppose μ is a vector and A is a matrix of constants. Σ ) .7.. The component Ui of U is a constant μ i plus a linear combination of Z1 . 3.. Σ ) . If U = μ + AZ ... Zk )t is a vector of k independent gaussian G(0.. These results follow Statistics 371 © R. the vectors BU and CU are independent iff Cov( BU .. W ) 8. (An important special case of 3) If a t = (a1 . 2.1) random variables. then the mean is E(U ) = μ and the variance covariance matrix is Var (U ) = AA t = Σ and U has a multivariate normal distribution. 4. BΣB t ) .

Suppose we have a random sample of 5 values z1 . we have Pr( Z ≤ q( i ) ) = i −1 1 2i − 1 .... u( 2 ) ). Let the “probabilistic” centers of the bins be q(1) . Now suppose that u1 . Pr( Z ≤ q( 2 ) ) = 3 / 10. we should see a straight line through the origin with slope 1..... then we decide that the gaussian assumption is not tenable. If the points deviate from this line substantially.. Note in general for a sample of size n.. a plot of the points ( q(1) .. if we plot the points ( q(1) .. z5 from this distribution. + (1 / 2) = n n 2n If the sample z1 . q( 5) . With such a sample.( q( 2 ) . σ ) distribution.1) density function and 5 “bins” each with probability 0.Appendix 3: Gaussian Quantile Quantile Plots We use a gaussian quantile-quantile (qq) plot to assess if a set of n values looks like a random sample of size n from a gaussian distribution. un is a sample form a G( μ. u2 .. Denote the sample values in increasing order by z(1) . z( 2 ) ≈ q( 2 ) ..( q( 2 ) .20.. z( 5) ) . z2 . That is Pr( Z ≤ q(1) ) = 1 / 10. These centers are shown with arrows on the plot... the expected number of values in each bin is 1.... z( 2 ) .. consider the figure below which shows a G(0. zn is a sample from a G(0.. z( 5 ) ≈ q( 5) or equivalently.J. u( n ) ) will be approximately Stat 371 © R. z( 2 ) to fall in the second and so on... MacKay University of Waterloo 2009 Appendix 3 1 ...1) distribution. Since u( i ) = μ + σz( i ) . q( 2 ) . u(1) ).. z( 5) ... then we expect that z(1) ≈ q(1) . To explain the plot. z2 .. Then we would expect z(1) to fall in the first bin.. z(1) )..1) distribution..( q( 5) .. z5 is from a G(0. z( 2 ) ). Then we have ui = μ + σ zi where z1 .( q( n ) ... z2 . Pr( Z ≤ q( 5 ) ) = 9 / 10 ......

Note how several of the plots appear non-linear or have apparent outliers. the final three plots correspond from left to right to a sample of 50 values from a G(0. The plots on the next page are based on 9 random samples of size 50 from a G(0. There is no evidence against this assumption once the two cases are deleted. If the qq plot of the estimated residuals is systematically non-linear. The plot on the left shows that it is not reasonable to suppose that the residuals from fitting the model to the full data set are gaussian. we use the code qqnorm(residual(b)). Stat 371 © R. To see the behaviour of the plots for a non-gaussian distribution. If the points deviate from a line substantially. MacKay University of Waterloo 2009 Appendix 3 2 . then we can try transforming the values of the response variate before fitting the linear model. You need to be careful not to over-interpret these plots. the square of the values and the reciprocal of the values. two cases omitted on the right) are shown below. We call this a qq plot and use R to construct the plot with the function qqnorm().a straight line with slope σ and y-intercept μ . For example. then we decide that the gaussian assumption is not tenable. to construct the qq plot of the estimated residuals from the fit b.1) distribution. The plots for the residuals in the assessment data (full data on the left.1) distribution.J.

Stat 371 © R.J. MacKay University of Waterloo 2009

Appendix 3

3

t-table (right tail) For each row (degrees of freedom k ) and column (right tail probability α ), the table entry e satisfies Pr( t k ≥ e) = α . Note that the t-distribution is symmetric about 0.

degrees of freedom 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 35 40 45 50 gaussian right tail probability 0.10 0.05 0.025 3.078 6.314 12.706 1.886 2.920 4.303 1.638 2.353 3.182 1.533 2.132 2.776 1.476 2.015 2.571 1.440 1.943 2.447 1.415 1.895 2.365 1.397 1.860 2.306 1.383 1.833 2.262 1.372 1.812 2.228 1.363 1.796 2.201 1.356 1.782 2.179 1.350 1.771 2.160 1.345 1.761 2.145 1.341 1.753 2.131 1.337 1.746 2.120 1.333 1.740 2.110 1.330 1.734 2.101 1.328 1.729 2.093 1.325 1.725 2.086 1.323 1.721 2.080 1.321 1.717 2.074 1.319 1.714 2.069 1.318 1.711 2.064 1.316 1.708 2.060 1.315 1.706 2.056 1.314 1.703 2.052 1.313 1.701 2.048 1.311 1.699 2.045 1.310 1.697 2.042 1.306 1.690 2.030 1.303 1.684 2.021 1.301 1.679 2.014 1.299 1.676 2.009 1.282 1.646 1.962

0.25 1.000 0.816 0.765 0.741 0.727 0.718 0.711 0.706 0.703 0.700 0.697 0.695 0.694 0.692 0.691 0.690 0.689 0.688 0.688 0.687 0.686 0.686 0.685 0.685 0.684 0.684 0.684 0.683 0.683 0.683 0.682 0.681 0.680 0.679 0.675

0.01 31.821 6.965 4.541 3.747 3.365 3.143 2.998 2.896 2.821 2.764 2.718 2.681 2.650 2.624 2.602 2.583 2.567 2.552 2.539 2.528 2.518 2.508 2.500 2.492 2.485 2.479 2.473 2.467 2.462 2.457 2.438 2.423 2.412 2.403 2.330

Stat 371 © R.J. MacKay, University of Waterloo, 2008

table -1

F-table (right tail) α = 0.10 For each row (denominator degrees of freedom k) and column (numerator degrees of freedom j), the table entry e satisfies P ( F ( j , k ) ≥ e) = α .

numerator degrees of freedom 1 2 3 4 5 6 7 8 9 10 20 30 1 39.86 49.50 53.59 55.83 57.24 58.20 58.91 59.44 59.86 60.19 61.74 62.26 2 8.53 9.00 9.16 9.24 9.29 9.33 9.35 9.37 9.38 9.39 9.44 9.46 3 5.54 5.46 5.39 5.34 5.31 5.28 5.27 5.25 5.24 5.23 5.18 5.17 4 4.54 4.32 4.19 4.11 4.05 4.01 3.98 3.95 3.94 3.92 3.84 3.82 5 4.06 3.78 3.62 3.52 3.45 3.40 3.37 3.34 3.32 3.30 3.21 3.17 6 3.78 3.46 3.29 3.18 3.11 3.05 3.01 2.98 2.96 2.94 2.84 2.80 7 3.59 3.26 3.07 2.96 2.88 2.83 2.78 2.75 2.72 2.70 2.59 2.56 8 3.46 3.11 2.92 2.81 2.73 2.67 2.62 2.59 2.56 2.54 2.42 2.38 9 3.36 3.01 2.81 2.69 2.61 2.55 2.51 2.47 2.44 2.42 2.30 2.25 10 3.29 2.92 2.73 2.61 2.52 2.46 2.41 2.38 2.35 2.32 2.20 2.16 11 3.23 2.86 2.66 2.54 2.45 2.39 2.34 2.30 2.27 2.25 2.12 2.08 12 3.18 2.81 2.61 2.48 2.39 2.33 2.28 2.24 2.21 2.19 2.06 2.01 13 3.14 2.76 2.56 2.43 2.35 2.28 2.23 2.20 2.16 2.14 2.01 1.96 14 3.10 2.73 2.52 2.39 2.31 2.24 2.19 2.15 2.12 2.10 1.96 1.91 15 3.07 2.70 2.49 2.36 2.27 2.21 2.16 2.12 2.09 2.06 1.92 1.87 16 3.05 2.67 2.46 2.33 2.24 2.18 2.13 2.09 2.06 2.03 1.89 1.84 17 3.03 2.64 2.44 2.31 2.22 2.15 2.10 2.06 2.03 2.00 1.86 1.81 18 3.01 2.62 2.42 2.29 2.20 2.13 2.08 2.04 2.00 1.98 1.84 1.78 19 2.99 2.61 2.40 2.27 2.18 2.11 2.06 2.02 1.98 1.96 1.81 1.76 20 2.97 2.59 2.38 2.25 2.16 2.09 2.04 2.00 1.96 1.94 1.79 1.74 21 2.96 2.57 2.36 2.23 2.14 2.08 2.02 1.98 1.95 1.92 1.78 1.72 22 2.95 2.56 2.35 2.22 2.13 2.06 2.01 1.97 1.93 1.90 1.76 1.70 23 2.94 2.55 2.34 2.21 2.11 2.05 1.99 1.95 1.92 1.89 1.74 1.69 24 2.93 2.54 2.33 2.19 2.10 2.04 1.98 1.94 1.91 1.88 1.73 1.67 25 2.92 2.53 2.32 2.18 2.09 2.02 1.97 1.93 1.89 1.87 1.72 1.66 30 2.88 2.49 2.28 2.14 2.05 1.98 1.93 1.88 1.85 1.82 1.67 1.61 40 2.84 2.44 2.23 2.09 2.00 1.93 1.87 1.83 1.79 1.76 1.61 1.54 50 2.81 2.41 2.20 2.06 1.97 1.90 1.84 1.80 1.76 1.73 1.57 1.50 100 2.76 2.36 2.14 2.00 1.91 1.83 1.78 1.73 1.69 1.66 1.49 1.42

Stat 371 © R.J. MacKay, University of Waterloo, 2008

denominator degrees of freedom

table -2

36 2.38 9.74 2.12 2.09 3 215.87 3.29 3.59 2.94 2 199.16 2.48 3.81 3.89 8.98 2.85 2.37 3.05 3.11 2.20 2.49 2.04 6.68 3.54 19.54 16 4.94 5.71 5 6.21 2.60 15 4.75 13 4.96 2.42 2.24 2.46 4. the table entry e satisfies P ( F ( j . MacKay.60 2.70 2.59 5.51 2.58 2.74 2.77 2.02 19.05 2.27 2.32 22 4.39 3.96 2.35 19.35 21 4.01 8.28 6.69 2.49 2.16 3.14 2.94 2.01 1.46 2.12 10 4.86 2.80 3.81 6.46 8.25 2.05 For each row (denominator degrees of freedom k) and column (numerator degrees of freedom j).39 2.81 2.50 3.45 2 18.84 3.25 19.38 3.39 6.46 2.96 1.76 4.77 2.95 4.31 2.15 2.05 4.16 2.37 19.84 1.23 3.10 4.85 2.98 1.26 25 4.93 1.13 3.74 3.15 2.73 3.76 2. 1 1 161.65 3.18 3.45 18 4.63 2.76 2.38 2.55 6.92 2.52 3.07 2.28 2.37 2.16 6.18 3.99 7 5.07 3.62 5.77 2.49 2.42 3.88 240.90 2.60 2.29 2.46 2.60 2. k ) ≥ e) = α .03 1.79 3.03 3.56 2.01 2.68 3.12 2.49 3.42 2.96 11 4.40 3.29 3.71 3.01 1.55 2.61 6 5.34 2.49 2.32 9 5.92 2.16 9.87 2.42 2.00 2.93 2.22 3.98 3.40 2.63 3.77 4.74 4.17 40 4.47 3.69 1.75 4.45 2.00 9.21 4.50 3.30 2.54 2.58 3.57 Stat 371 © R.18 3.28 24 4.59 8 5.45 2.07 3.34 2.88 4.44 3.58 230.06 3.51 3 10.71 2.06 2.14 4.15 4.26 3.84 2.10 2.79 2. University of Waterloo.54 2.13 2.20 3.97 10 241.89 3.39 3.41 19 4.41 4.41 2.12 9.26 6.18 2.70 numerator degrees of freedom 4 5 6 7 8 9 224.00 5.48 3.01 2.65 2.56 3.81 3.84 1.88 19.25 2.32 3.94 8.82 2.66 2.19 2.F-table (right tail) α = 0.51 2.54 2.85 8.03 2.91 2.59 3.09 3.79 2.67 2.90 3.74 1.08 50 4.10 3.57 2.87 3.32 2.24 3.35 2.07 3.40 8. 2008 denominator degrees of freedom table -3 .11 3.10 19.19 5.66 2.44 3.39 4.62 2.08 2.80 2.54 2.49 3.33 2.49 17 4.70 2.03 2.50 19.96 4.28 4.36 3.71 3.02 3.64 2.86 3.64 2.20 3.53 4.04 2.83 2.25 2.23 3.37 2.99 236.45 8.90 2.33 2.66 2.40 2.34 2.46 2.71 2.01 2.40 2.51 2.30 2.84 12 4.27 2.30 23 4.41 3.19 2.55 3.63 3.61 2.07 2.61 2.68 2.11 2.99 2.47 2.59 3.12 3.85 2.19 2.23 2.85 2.32 2.42 2.59 3.03 100 3.16 233.39 2.92 1.74 4.94 1.48 2.03 1.93 20 248.34 3.35 3.82 4.84 2.26 4.69 3.35 4.10 2.78 2.38 2.28 2.55 2.80 4.J.33 3.64 3.09 6.78 1.44 2.24 30 4.57 2.30 19.33 19.07 2.71 19.95 2.67 14 4.53 2.68 30 250.53 2.79 5.44 3.97 3.70 2.10 3.66 5.75 2.13 4 7.31 2.45 2.79 5.08 2.38 20 4.77 238.14 3.

19 5.55 2.07 4.28 4.85 5.45 7.76 3.16 5.27 10.01 For each row (denominator degrees of freedom k) and column (numerator degrees of freedom j).36 5.83 2.99 3.06 4.43 3. the table entry e satisfies P ( F ( j .46 16.82 4.54 3.98 15.71 3.88 7.10 3.92 9.00 3.80 3.70 3.69 12.51 3. numerator degrees of freedom 4 5 6 7 8 9 5624 5764 5859 5928 5981 6022 99.31 3.18 6.65 4.25 99.59 3.52 15.02 3.20 16.97 10.67 27.82 2.95 5.06 9.50 34.56 7.18 3.18 8.47 5.04 9.22 3.21 3.23 26.70 2.82 4.69 2.74 2.65 8.29 8.99 6.34 4.67 3.70 4.59 3.12 21.67 4.00 2.00 30. MacKay.89 3.94 2.31 7.17 6.82 3 5404 99.84 3.72 3.90 2 4999 99.45 3.14 4.40 4.33 9.80 14.42 5.64 4.84 6.56 7.56 5.86 3.26 13.01 3.10 8.93 6.89 4.80 2.78 3.67 5.46 3.30 3.05 9.70 3.01 6.20 5.85 7.30 4.51 3.44 4.68 8.61 5.21 4.66 11.26 2.90 3.50 3.79 3.03 3.82 7.69 4.33 99.56 4.16 3.37 2.89 3.F-table (right tail) α = 0.94 4.71 3.91 6.14 4.51 3.51 3.53 8.09 5.12 2.62 4.25 11.95 7.21 14.47 8.45 99.39 28.15 8.55 9.10 3.26 4.88 2.24 27.99 6.08 2.10 2.62 6.21 2.32 5. University of Waterloo.06 5.98 2.89 Stat 371 © R.04 3.47 27.38 7.30 3.17 3.07 1.50 4.37 6.71 28.39 10.26 10.36 99.69 26.91 27.39 5.35 5.06 4.94 4.54 4.87 3.65 9.87 4.26 4.67 10.98 14.29 10.60 4.40 8.20 4.94 3.22 4.89 4.02 13. 2008 denominator degrees of freedom table -4 .30 4.68 4.26 3.31 4.07 8.02 7.04 4.31 2.03 4.62 3.66 3.39 5.40 7.59 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 30 40 50 100 1 4052 98.74 5.77 4.55 8.58 4.75 8.98 10 20 30 6056 6209 6260 99.87 7.39 5.16 9.46 4.80 5.01 5.10 3.36 6.81 4.13 2.76 4.26 3.94 3.27 2.19 3.19 5.86 4.37 4.43 4.81 5.61 5.25 4.46 4.63 3.07 3.69 3.02 7.06 4. k ) ≥ e) = α .94 3.21 2.63 6.41 3.J.38 99.41 3.16 29.01 4.25 4.50 4.51 4.17 2.72 5.37 2.89 4.10 3.99 2.78 8.26 8.10 7.46 7.56 3.20 5.19 6.51 3.70 2.23 6.93 3.72 7.30 99.78 3.72 4.77 3.29 5.66 5.92 3.77 7.55 14.68 4.72 3.10 3.46 10.98 7.47 3.56 10.99 5.00 3.37 3.51 6.34 15.84 3.78 5.99 5.41 5.63 5.86 8.03 5.94 5.44 4.20 3.39 2.70 6.64 3.81 3.78 2.41 4.63 3.17 3.59 6.75 12.21 6.49 27.42 6.50 14.84 10.20 2.40 99.78 4.89 2.02 2.00 13.36 3.11 6.74 4.29 3.18 5.64 5.51 3.85 4.22 5.54 2.99 2.55 6.32 4.50 2.23 6.67 3.85 3.52 4.35 3.82 18.58 3.57 5.83 3.35 4.93 5.32 3.

01 `*' 0.72137 0. p-value: 0.8670 -0.87.05 `..1 ` ' 1 Residual standard error: 5.49361 3.7848 14.939 0.41588 4. r1 = y1 − μ1 = −0.J.10820 -4.97 2.11653 0.069 0.354867 --Signif. Adjusted R-squared: 0.3683 F-statistic: 5. Error t value Pr(>|t|) (Intercept) 19.081181 .256 0.161926 location 3. University of Waterloo 2009 Exercise Solution -1 .26) ˆ ˆ ˆ ˆ ˆ ˆ b) μ1 = β 0 + β1 x11 + β 2 x21 = 4.34128 -1.08139 1.Exercise Solutions Chapter 2 1.799388 ratio 0.001 `**' 0.801 0. From the R output we have the following ˆ a) β t = (3.77137 4. 2. use the R code Stat 371 © R.table("assessment. −1.000288 *** size -2.833 3.52300 0.4537.txt) with a) all 5 explanatory variates I used the R code a<-read.header=T) attach(a) b<-lm(value~size+age+office+ratio+location) summary(b) to produce the output Call: lm(formula = value ~ size + age + office + ratio + location) Residuals: Min 1Q Median 3Q Max -10.03786 0.41526 1.67. age -0.' 0.315 on 5 and 32 DF.432 0.001150 b) only age and size To fit the model with only age and size.993 on 32 degrees of freedom Multiple R-Squared: 0.+ β p xip + ri to the assessment data (assessment.01.txt".8164 2.22e-05 *** office 0. codes: 0 `***' 0.14776 0.6911 -3. Use R to fit the model yi = β 0 + β 1 xi1 +.5435 Coefficients: Estimate Std. MacKay..

If we write X = ( X1 X2 ) where X corresponds to the full model and X1 corresponds only to the intercept.J. Hence the estimates found by F I GH JK Fβ I F X X X X I calculating β = G J = G H X X X X JK c X X hy will differ unless X X Hβ K 1 t 1 1 t 1 −1 2 2 t 2 1 t 2 t 1 t 2 t 1 2 = 0 .32 on 2 and 35 DF.171 -3.1 ` ' 1 Residual standard error: 6.456 Coefficients: Estimate Std. codes: 0 `***' 0.01 `*' 0.629 4.629 -1.273 3. 3.001 `**' 0. MacKay.45850 0. size and age. the return on the market xi1 and the risk free return xi2 for n periods. In this example.65633 6.535 2. Suppose we have the returns on an asset yi . Adjusted R-squared: 0. University of Waterloo 2009 Exercise Solution -2 .' 0.56577 1. p-value: 0. then X1t X1 X1t X2 t we have X X = but note that . Consider three regression models: Model 1: ( yi − xi 2 ) = β ( xi1 − xi 2 ) + ri Model 2: yi = β 0 + β 1 xi1 + ri Model 3: yi = γ 0 + γ 1 xi1 + γ 2 xi 2 + ri Stat 371 © R. the top left corner of ( X t X ) −1 is t t X2 X1 X2 X2 t not equal to ( X1 X1 ) −1 unless X1t X2 = 0 .27249 -1.09905 -4.227 age -0.89e-05 *** --Signif.358 F-statistic: 11. in general.93698 3.05 `.682 18. we do not have this orthogonality. Error t value Pr(>|t|) (Intercept) 22.3927.c<-lm(value ~ size + age) summary(c) with output Call: lm(formula = value ~ size + age) Residuals: Min 1Q Median 3Q Max -9. especially the estimated coefficient for size.230 0.042 on 35 degrees of freedom Multiple R-Squared: 0. Note you 2 can interpret this last condition geometrically The product is 0 if the columns of X1 are orthogonal to the columns of X2 .38e-07 *** size -1.0001620 c) Do the estimated coefficients change? Why? Yes.

How are those projections different? Since the two subspaces are the same. x1 − x11 are orthogonal (hence linearly independent. c) In fitting models a) and b). b) Why is span(1. the measure of volatility. d) What is the relationship between the estimated coefficients in fitting the two models? Since the projections are the same. we are projecting onto different subspaces in each case the coefficient will change. x1 and 1. 4. Consider the two models Model 1: yi = β 0 + β 1 xi1 + ri Model 2: yi = γ 0 + γ 1 ( xi1 − x1 ) + ri where x1 is the sample average of the explanatory variate. we must have β 0 1 + β 1 x1 = γ 0 1 + γ 1 ( x1 − x11) = (γ 0 − γ 1 x1 )1 + γ 1 x1 and since 1. β 1 = γ 1 . x1 − x11) ? Since x1 − x11 is a linear combination of 1. We can write the first two models as special cases of the third. e) How does the result in a) simplify the calculation of γ when fitting model 2? Stat 371 © R. we project onto a subspace. a) Show that the vectors x1 − x11 and 1 are orthogonal. University of Waterloo 2009 Exercise Solution -3 . Model 1: yi = (0)1 + ( β ) xi1 + (1 − β ) xi 2 + ri Model 2: yi = β 0 + β 1 xi1 + (0) x2 i + ri Model 3: yi = γ 0 + γ 1 xi1 + γ 2 xi 2 + ri In fitting the models.If we fit each model will the coefficient of x1 . we have β 0 = γ 0 − γ 1 x1 . MacKay. Yes the coefficient of x1 is likely to change for each model. Suppose we have a response variate yi and a single explanatory variate xi1 for each of n units sampled from a population.J. Again the result depends on the orthogonality of the vectors 1 x1 x2 as in Question 1. the two spans are the same subspace. x1 ) = span(1. change? Explain. We have 1t ( x1 − x11) = ∑ ( xi1 −x1 ) = 0 since x1 is the sample average of the explanatory i variate. x1 are linearly independent. the projections are the same vector.

which model gave a larger value for R 2 ? The model with more explanatory variates gave a larger value of R 2 (0. Hence for any vector y . the projection onto span(1. the matrix X t X is diagonal and hence the inverse is found by inverting the diagonal elements. Interpret this result geometrically. x1 − x11 are orthogonal. Show that a) H t = H H t = ( X ( X t X ) −1 X t )t = X[( X t X ) −1 ]t X using the result that ( AB)t = B t A t .. University of Waterloo 2009 Exercise Solution -4 .Since 1. Now consider the inverse S −1 of any symmetric matrix S . Hence we have hii (1 − hii ) ≥ 0 or equivalently 0 ≤ hii ≤ 1 . applying H to a vector in this space has no effect. c) ( I − H ) 2 = ( I − H ).. 6.+ hi2( p +1) or hii − hii = hi2 + hi22 +.J. d) 0 ≤ hii ≤ 1 where hii is the diagonal element of H. b) H 2 = H . Combining the two equations. We have ( S −1S ) t = I t = I and also ( S −1S )t = S t ( S −1 ) t = S( S −1 )t since S is symmetric. Some questions about R 2 a) In question 1. 2 Since H = H 2 = H t H . x1 .3927) b) Show that R 2 cannot decrease if we add extra terms to a model? Stat 371 © R. x p ) . We conclude that the inverse of a symmetric matrix is symmetric and hence [( X t X ) −1 ]t = ( X t X ) −1 and Ht = H . we have S( S −1 ) t = I and hence ( S −1 )t = S −1 since the matrix inverse is unique. MacKay. we know that H ( Hy) = Hy since Hy is in the column space of X . we have hii = hi2 + hi22 +.. H 2 = ( X ( X t X ) −1 X t )( X ( X t X ) −1 X t ) = X ( X t X ) −1 ( X t X )( X t X ) −1 X t = X ( X t X ) −1 X t = H Since H is the projection onto the column space of X . We have ( X t X )t = X t X so this matrix is symmetric. We defined the hat matrix H = X ( X t X ) −1 X t ..... 5..+ hi2( p +1) 1 1 2 where the hii term is removed from the right hand side. H ( I − H ) = 0 We have ( I − H )2 = ( I − H )( I − H ) = I − H − H + H2 =I−H and H ( I − H ) = H − H 2 = 0 as required. Now consider the transpose of the matrix X t X .4537 versus 0.

7.6665 0. Anscombe.0017 β1 0. y1-y4 a) For each pair.0001 3.J.6663 0. y) vectors. The file contains 4 sets of ( x.6667 The fitted models and residual sum of squares are virtually identical.237 1.summary(B1) B2<-lm(y2~x2).txt”. MacKay. University of Waterloo 2009 Exercise Solution -5 .summary(B4) The estimated parameters and R 2 for each case are Case 1 2 3 4 β0 3. labeled x1-x4.table(“anscombe. fit a straight line model and report the estimated parameters and the coefficient of determination R 2 b) For each pair.236 R2 0.J.237 1.R2 = 1 − residual sum of squares from fitting the model ∑ (y i − y )2 As we add terms to a model. Stat 371 © R. 17-21 to demonstrate the importance of plotting the data. c) Comment. The data in the file anscombe.6662 0.txt were produced by F.4999 σ 1. The plots on the next page show that the data relationships between y and x are very different so it is a mistake to interpret R 2 as a measure of fit of the model to the data. header=T) attach(A) B1<-lm(y1~x1).summary(B2) B3<-lm(y3~x3).summary(B3) B4<-lm(y4~x4). American Statistician 27.4997 0. the residual sum of squares must go down (at least it cannot go up) since it is the minimum value of the function || y − Xβ ||2 .001 3.5001 0.500 0.236 1. construct a scatterplot of y versus x and add the fitted line.0025 3. Hence the need to plot the data (or the estimated residuals) to understand the fit. I used the following R code to fit the four lines A<-read.

The underlying variability is estimated with 25 degrees of freedom and for 99% confidence we have Pr(| t25 | ≤ 2.942 ± 0.026 or 0. University of Waterloo 2009 Exercise Solution -6 . What can you conclude? From the output. Some ideas about confidence intervals a) Using the R-output given for the sales promotion example. b) How does the confidence interval change as we increase the confidence level? Stat 371 © R. the effect of past sales is close to 1 i. current sales are close to a constant plus a term proportional to past sales. we have the estimate 0.942 ± 2.Chapter 3 Solutions 1. The confidence interval is 0.942 and Std.79 × 0.e.073 .99 .J.79) = 0.026 for the coefficient of past sales. MacKay. With all the other explanatory variates in the model. find a 99% confidence interval for the effect of past sales on the response. Error 0.

we get ~dσ = ~ ~ dσ Kn −( p +1) σ /σ we have Pr(| tn − ( p +1) | ≤ c ) = CL . err. ratio. d) θ 0 is in the interval (θ − cdσ . dσ Finally.J. we replace the estimators by the estimates to get (θ − cdσ . than 0. In a small study.oil. University of Waterloo 2009 Exercise Solution -7 . The dσ θ −θ 0 |) .txt. Show that θ 0 is in the 95% confidence interval for θ if and only if the pvalue for the test of the hypothesis θ = θ 0 exceeds 5%. equivalently. To find the confidence interval. Consider the simple model hardness = β 0 + β 1frag. θ + cdσ ) if and only ifθ − cdσ ≤ θ 0 ≤ θ + cdσ θ −θ 0 | | ≤ c where Pr(| tn − ( p +1) | ≤ c ) = 0.05 if and only if | dσ 2. As the confidence level increases.95 or. the constant c increases and so the interval gets wider (the center stays the same). This probability is greater p-value for the hypothesis θ = θ 0 is Pr(| tn −( p +1) | ≥| dσ θ −θ 0 | ≤ c as required.oil + residual . For a particular level of confidence CL.1 ) . θ + cdσ ) . dσ ) . b) Find a 95% confidence interval for this parameter. a) Interpret the parameter β 0 . Also we have σ / σ ~ Kn −( p +1) . Pr(| tn − ( p +1) | ≥ c ) = 0. Derive the confidence interval forθ ~ ~ θ −θ ~ ~ G(0. The variates are named hardness and frag.The form of the confidence interval is estimate ± c × st. c) ~ Suppose we have θ ~ G(θ . We can fit the simple model using the R statements Stat 371 © R. we have ~ θ −θ ~ ~ ~ ~ Pr(| ~ | ≤ c) = CL or re-arranging the inequality Pr(θ − cdσ ≤ θ ≤ θ + cdσ ) = CL . The company was interested in understanding the relationship between the hardness of the candles (a technical measurement) and the amount of fragrance oil added. the estimator for a parameter θ and the ~ statistically independent σ with n − ( p + 1) degrees of freedom. The data are stored in the file hardness. MacKay.1 ) = tn −( p +1) .05 . dσ ) . β 0 represents the average hardness in the study population of candles if the level of fragrance oil is 0. we have dσ ~ θ −θ ~ θ − θ G(0. a company that manufactures candle wax examined 20 candles made from batches of wax that have different amounts of fragrance oil added. Taking the Since θ ~ G(θ .where Pr(| t25 | ≤ c) is the confidence level.

9724 0.oil) summary(b) with output Call: lm(formula = hardness ~ frag.oil) Residuals: Min 1Q Median 3Q Max -0. we have Pr(| t18 | ≤ 2.3203 or 1.05 `.95868 -0.672 Find a 95% confidence interval for the average hardness of candles made with 2% fragrance oil.4683 on 18 degrees of freedom Multiple R-Squared: 0. The parameter of interest is θ = β 0 + β 1 (0. 07e-15 *** --Signif.02) = 1.1725 0.matrix(b) sterr<-0.661 . Since the underlying variability is estimated with 18 degrees of freedom.3119 ± 2.1725 ± 2.001 `**' 0.3146 or 1. University of Waterloo 2009 Exercise Solution -8 .9669.3119 but need to use R to find the corresponding standard error.01758 0.0.1725 with associated standard error 0.4683*sqrt(t(u)%*%solve(t(X)%*%X)%*%u) sterr to get 0.3119 ± 0.92 9. p-value: 9.02 ) .37530 0.10 × 0.0.01 `*' 0.txt".067e-15 From the output.header=T) attach(a) b<-lm(hardness~frag.' 0.3042 22. Alternately. the estimate of β 0 is 1. codes: 0 `***' 0. MacKay.table("hardness.oil 6.3203.965 F-statistic: 525. Error t value Pr(>|t|) (Intercept) 1.1 ` ' 1 Residual standard error: 0.34417 0.66 0.2 on 1 and 18 DF. There are two approaches.02) X<-model.95 and hence the 95% confidence interval is 1.1725 ± 0.3146 and hence the 95% confidence interval is 1.J. We can estimate θ by c) θ = β 0 + β 1 (0. Adjusted R-squared: 0.00179 ** frag.02).94132 Coefficients: Estimate Std.a<-read.3203 3. then we can calculate σ u t ( X t X ) −1 u with the statements u<-c(1.10) = 0. we can use the statements Stat 371 © R.0.10 × 0. If we let u t = (1.

interval=”c”. To test the hypothesis that the coefficient of the square term is 0. Is there any evidence of curvature in the relationship? The output is shown on the next page.oil 7.1 ` ' 1 Residual standard error: 0.new<-data.089 0.9669. codes: 0 `***' 0. Adjusted R-squared: 0.95496 Coefficients: Estimate Std.046 0.02) p<-predict(b.oil).635e-13 Stat 371 © R.36635 0. there is no evidence of curvature in the relationship between hardness and the amount of fragrance oil. Call: lm(formula = hardness ~ frag.972837 Note the argument interval=“c” produces a confidence interval for the mean when frag.311952 0.9301 --Signif. f2<-frag.081 1.oil*frag.newdata=new.081 1.level=0.95) p to get the output fit lwr [1.000 0.6510668 upr 1. University of Waterloo 2009 Exercise Solution -9 .9301 so there is no evidence that the coefficient is different from 0.05 `. MacKay.oil + f2) Residuals: Min 1Q Median 3Q Max -0.01 `*' 0.35467 0.1 on 2 and 17 DF.0073 ** f2 -0.180 2. d) Add a quadratic term to the model (in R.J.4818 on 17 degrees of freedom Multiple R-Squared: 0.94504 -0.oil=0.oil creates a vector with components the square of those in frag. the p-value is 0. In other words.357 3.] 1.02.3313 frag.02659 0.963 F-statistic: 248. Error t value Pr(>|t|) (Intercept) 1.' 0. p-value: 2.oil=0.168 -0.frame(frag.001 `**' 0. not a prediction interval.104 1.

txt”. b) Construct a prediction interval for the change in sales if promotion 2 is used rather than promotion 1 for the same store (i.277 so the 95% prediction interval is (8996.comp. Prove that the components of β are independent if and only if the columns of X are orthogonal. σ 2 ) so (Y (2) − Y (1)) − ( β 2 − β 1 ) ~ ~ G(0. sales + R(2) and the corresponding response for promotion 1 isY (1) = β 0 + β 1 + β 3 pst. [You will need to go back to first principles. We have no idea if the model fits in this unexplored region.e. Stat 371 © R. This matrix is diagonal if and only if X t X is diagonal and hence when the columns of X are orthogonal.027 9992.frame(x1=0. Using the data in the promotion trial described in this chapter.09 ~ 4. If we replace σ byσ we get a t distribution with 25 σ 2 degrees of freedom.table(“trial.] 9494.interval=”p”. Can you see any difficulty with this prediction? We use the R statements to fit the model and produce the prediction interval a<-read.sales) new<-data. we see that the largest pst. MacKay.31 ± 60.pst. σ 2 ( X t X ) −1 ) and hence the components of β are independent if ( X t X ) −1 is diagonal. level=0.x2=1.sales+comp.1) . sales + β 4 comp. If we look at the original data set [ the R statement summary(a) is helpful].95) p with output fit lwr upr [1.sales value is $1918 so the value of $10000 is an extreme extrapolation. Prove Cov( β . Find a 95% prediction interval using promotion 2 if the past sales are a) $10000 and the competitor sales are $3000.152 8996. ~ r 5.9992). sales + β 4 comp.sales=10000.header=T) attach(a) b<-lm(response~x1+x2+pst. University of Waterloo 2009 Exercise Solution -10 . ~ ~ We know that β ~ N (β . ~) = 0.3.sales=3000) p<-predict(b. sales + R(1) and hence the difference is Y (2 ) − Y (1) = β 2 − β 1 + R(2) − R(1) ~ G( β 2 − β 1 . newdata=new. past and competitor sales are fixed). The 95% prediction interval for the change of response is β 2 − β 1 ± cσ 2 or −25.J.] If promotion 2 is used let the response be Y (2) = β 0 + β 2 + β 3 pst.

In an industrial example.05 From the tables we find c = 2.30) to find Pr( F ≥ 2) = 0.32 ≥ 2. Substituting.1352 b. find a constant c so that Pr( F ≥ c) = 0. we get ~ Cov( β . MacKay. From R we can use the statement 1-pf(2. the analyst worries 2 that additional second order terms of the form x12 .92 c. University of Waterloo 2009 Exercise Solution -11 . the manufacturer collects 60 observations to build a model to relate a product property y to two quantitative explanatory variates x1 and x2 . What is the distribution of 1 / F 1 K2 K32 so = 30 ~ F30. Theory suggests that a linear model of the form y = β 0 + β 1 x1 + β 2 x2 + r should describe the data.92) = 0. F = Stat 371 © R. a.05 so all we know is that Pr( F ≥ 2) > 0. ~) = σ 2 ( X t X ) −1 X t ( I − X ( X t X ) −1 X t ) r = σ 2 [( X t X ) −1 X t − ( X t X ) −1 X t X ( X t X ) −1 X t ] =0 as required. we have Pr( F3. Suppose we have a discrepancy measure with an F distribution with 3 and 30 degrees of freedom.( I − H ) R) r = ( X t X ) −1 X t Cov( R.~ We have β = ( X t X ) −1 X t Y = β + ( X t X ) −1 X t R and ~ ~ = Y − Xβ r = Y − X ( X t X ) −1 X t Y = ( I − H )Y = ( I − H )( Xβ + R) = (I − H)R Hence ~ Cov(β .J.txt.05. Chapter 4 Solutions 1. The data are stored in the file exercise2. x1 x 2 should be included in the Note that F is the ratio of two K distributions. R)( I − H )t Now Cov( R. x 2 .3 2 F K32 K30 2. However. R) = Var ( R) = σ 2 I and ( I − H )t = I − H = I − X ( X t X ) −1 X t . Find Pr( F ≥ 2) Using the tables. ~) = Cov(( X t X ) −1 X t R.3.

5765 2 54 12.05 `.score) x<-x2+x3+x4+x5+x6 c<-lm(sat.1 ` ' 1 3.score~-1+x1+x2+x3+x4+x5+x6+pst..] We use the following R statements to fit the full and reduced model and then carry out the ANOVA a<-read. use an F test to address the following questions? a.table(“exercise2. Note that the data set contains one other explanatory variates pst. Note that in the reduced model the explanatory variate corresponding to β is x = x2 + x3 +.lm(y~x1+x2) anova(c.. Is there any evidence of differences among the new versions 2 to 6? We fit the full model and then the model under the hypothesis β 2 = β 3 =.5098 0.header=T) attach(a) b<-lm(sat.header=T) attach(a) x11<-x1*x1. model.txt”.table(“product. Does the addition of the extra terms contribute significantly to the fit of the model? [Note: In R you can create new variables such as x 22 < − x 2 * x 2 to represent the quadratic terms.01 `*' 0.x22<-x2*x2 b<-lm(y~x1+x2+x11+x12+x22) c<.b) with output Analysis of Variance Table Model 1: y ~ x1 + x2 Model 2: y ~ x1 + x2 + x11 + x12 + x22 Res.02123 * --so there is some evidence that one or more of the second order terms is necessary in the Signif.+ x6 .001 `**' 0. MacKay..J.score The R code is a<-read. codes: 0 `***' 0. = β 6 = β .Df RSS Df Sum of Sq F Pr(>F) 1 57 14.b) Stat 371 © R.x12<-x1*x2.1980 3 2. In the product testing example (Example 2 in Chapter 4).score~-1+x1+x+pst.3785 3 .score) anova(c.' 0.txt”.model. University of Waterloo 2009 Exercise Solution -12 ..

Df RSS Df Sum of Sq F Pr(>F) 1 45 2.00467 ** --Signif.51717 6.score ~ -1 + x1 + x2 + x3 + x4 + x5 + x6 + pst. 4.score ~ -1 + x1 + x + pst. If we have a single parameterθ .001 `**' 0.6101 0.score Res.003249 ** --Signif.' 0.05 `.69009 4. score + R To test the hypothesis of no difference among versions 4.b) The output is Analysis of Variance Table Model 1: sat.60391 4 0.Df RSS Df Sum of Sq F Pr(>F) 1 43 2.5 and 6. a. Is there any evidence that these versions have significantly different average satisfaction scores? If β 4 = β 5 = β 6 = β . codes: 0 `***' 0.05 `.score ~ -1 + x1 + x2 + x3 + x4 + x5 + x6 + pst.5 and 6.The output is Analysis of Variance Table Model 1: sat.001 `**' 0.score Model 2: sat. MacKay. b.' 0.01 `*' 0. Versions 4.score) anova(c.score.5 and 6 share a common feature.score~-1+x1+x2+x3+x+pst.1 ` ' 1 There is strong evidence of differences among versions 4.score Model 2: sat.29400 2 41 1.score Res.01 `*' 0.60391 2 0. Explain how we can test the hypothesis using a t-test Stat 371 © R. after controlling for pst.score ~ -1 + x1 + x2 + x3 + x + pst.1 ` ' 1 There is strong evidence of differences among the 5 new versions. codes: 0 `***' 0. University of Waterloo 2009 Exercise Solution -13 . we can test a hypothesis θ = 0 in two ways. The following R statements produce the F test x<-x4+x5+x6 c<-lm(sat.J.12109 2 41 1. the model becomes Y = β 1 x1 + β 2 x2 + β 3 x3 + β ( x 4 + x5 + x6 ) + β 7 pst. we fit the reduced model and use the change in the residual sum of squares as the basis for the discrepancy measure.4101 0.

|θ || ~ where stdev(θ ) = σd and calculate the p-value σd as Pr(| tdf | ≥ d ) where df are the degrees of freedom associated with the residual sum of squares.

We can use the discrepancy measure b. Explain how we can test the hypothesis using an F test

We can fit the full model to find the residual sum of squares. This sum of squares divided by the associated degrees of freedom df (same as in a.) is the denominator of the discrepancy measure. Then we can fit a reduced model in which excludes the explanatory variate associated with θ (since under the hypothesis θ = 0 ) and again calculate the residual sum of squares. The difference in the residual sum of squares (here with 1 degree of freedom) is the numerator of the discrepancy measure. We calculate the p-value by finding Pr( F1,df ≥ discrepancy measure) . c. Consider again the product testing example described in Exercise 3. Consider the hypothesis that the coefficient β 7 of the explanatory variate pst.score is 0. Test the hypothesis in the two ways and show that the p-value is identical. [This is always true although a nuisance to prove]

From the fit of the full model b and summary(b), we get the discrepancy measure 16.410 and p-value < 2e-16 for the t test. We can fit the reduced model with β 7 = 0 and get the F test from the ANOVA with the R statements c<-lm(sat.score~1+x1+x2+x3+x4+x5+x6) anova(c,b) The F-test in the output has discrepancy measure 269.30 with p-value < 2.2e-16 ***. Note that 269.30 = (16.410) 2 . d. If t ~ tk , show that t 2 has an F distribution. What are the degrees of freedom? G(0,1)2 K12 G(0,1) = 2 = F1,k and also that G(0,1) 2 ~ K1 . Hence tk2 = K k2 Kk Kk

We know that tk =

5. Some theory a. In the construction of the F test, explain why the additional sum of squares is always non-negative. The first step is to fit the full model y = β 0 1 + β 1 x1 +...+ β p x p + r by minimizing || r||2 =|| y − β 0 1 + β 1 x1 +...+ β p x p ||2 with respect to β 0 , β 1 ,..., β p . The hypothesis puts some restriction on β 0 , β 1 ,..., β p so when we minimize|| r||2 under this restriction we cannot get

Stat 371 © R.J. MacKay, University of Waterloo 2009

Exercise Solution -14

a smaller value than when there was no constraint. Hence the difference in the two minima must be non-negative. b. Consider the model y = β 0 1 + β 1 x1 +...+ β p x p + r . Show that if we replace the * vector x j by the vector x * = x j − x j 1, the model becomes y = α 0 1 + β 1 x1 +...+ β p x * + r j p . That is, the coefficients of the explanatory variates do not change.

**Letting x * = x j − x j 1 and substituting x j = x * + x j 1 the model becomes j j * * y = β 0 1 + β 1 ( x1 + x11)+...+ β p ( x p + x p 1) + r
**

* = ( β 0 + β 1 x1 +...+ β p x p )1 + β 1 x1 +...+ β p x * + r p and setting α 0 = β 0 + β 1 x1 +...+ β p x p gives the required result.

c.

Explain why testing the hypothesis β 1 = β 2 =... = β p = 0 will yield identical results for either formulation of the model.

* Since span(1, x1 ,..., x p ) = span(1, x1 ,..., x * ) , when fitting the full model, the residual sum p of squares is the same since the vector of estimated residuals is the same. Under the hypothesis, the reduced models are identical so the residual sum of squares will be the same. Hence the two tests are identical.

d.

In the revised model show that x * ⊥1 for all j j

1t x j = ∑ ( xij − x j ) = 0 so x * ⊥1. j

i

e.

In testing the hypothesis, show that the additional sum of squares is β ( X X* )β * where β * = (β 1 ,..., β p )t and X* = ( x1* ,..., x * ) . This quantity is often called p the regression sum of squares.

t * t *

**Let X = (1 X* ) so that in the second representation of the model we have
**

y = α 1 + X* β * + r = 1 X*

b gFGHα IJK + r β

*

= Xβ + r

Stat 371 © R.J. MacKay, University of Waterloo 2009

Exercise Solution -15

F1 1 Since X = b1 X g we have X X = G HX 1

t t

1t X*

t *

*

t *

orthogonal to 1. Hence we have ( X t X ) −1

**I = F1 1 0 I since the columns of X are J J G X X K H0 X X K F1 / n 0 IJ . Also X y = 1 y + X y so =G H0 (X X ) K
**

t * t *

*

*

t

t

t *

−1

t *

*

β = (X X) X y

t −1 t

=

FG1 / n 0 IJ (1 y + X y) H0 (X X ) K F y IJ =G H ( X X ) X yK

t t * −1 t * * t * −1 * t *

so α = y , β * = ( X*t X* ) −1 X*t y . Now can write y = α1 + X* β * + r or equivalently y − y1 = X* β * + r where X* β * ⊥ r and so || y − y1||2 =|| X* β * ||2 +|| r ||2 . The left side is the minimum of the residual sum of squares under the hypothesis β * = 0 . Hence the t additional sum of squares is | | X* β * ||2 = β * X*t X* β * .

Stat 371 © R.J. MacKay, University of Waterloo 2009

Exercise Solution -16

size=13. much narrower but still not useful. for this fit. the 95% prediction interval for age=30. the prediction interval is -17.9 is -46.6.J. The qq plot of the standardized residuals is not linear but is highly distorted by the two large standardized residuals. Is the prediction of value for a building with size 13. an interval so wide that it is useless. MacKay.6.7 to 21. If we delete these two points and repeat the fit we get the following plots.9 and age 30 sensitive to any particular cases? We start by fitting the simple model and looking at various plots of the residuals and the qq plot of the standardized residuals to examine the fit. University of Waterloo 2009 Exercise Solution -17 . Stat 371 © R. There is no evidence against the fit in any of these plots. Use the methods in this chapter to assess the fit of the model and to suggest remedies.Chapter 5 Solutions 1.0 to 25. With the two points deleted. The one common feature of all of the plots is the two large residuals both of which correspond to a large fitted value and relatively small age and size. The bottom line here is that it is not feasible to use these data to assess a building that is so much larger than any other in the sample. Note. Consider the assessment data with simple model value = β 0 + β 1age + β 2 size + residual .

Adjusted R-squared: 0.203e-05 We drop x3 from the model.6181 0.985 3.87e-10 *** x1 2. Error t value Pr(>|t|) (Intercept) 11. As well. codes: 0 `***' 0.7615 Coefficients: Estimate Std.05 `.4654 0.4419 0.68e-05 *** x2 -3.21 17.34 17.J.70e-05 *** x3 0. The summary output from R is Call: lm(formula = y ~ x1 + x2 + x3) Residuals: Min 1Q Median 3Q Max -1.40 11.1071 0. In an experimental Plan.3000 8.001 `**' 0.3000 1.87 9.97 11.1360 0.2372 48. the investigators looked at the response variate for the so-called center point x1 = 0. x2 .9558 F-statistic: 73 on 3 and 7 DF. consider two formal approaches.1 ` ' 1 Residual standard error: 0.2527 -0.3000 -10. x3 = 0. MacKay.776 0.15 Suppose we fit a model y = β 0 + β 1 x1 + β 2 x2 + β 3 x3 + r . There are 8 combinations.1465 0. x2 = 0. To assess the fit of the model.54 5.30 11.119 --Signif.89 12.01 `*' 0. The data are shown below and can be found in the file ch5Exercise2.57 11. Stat 371 © R. University of Waterloo 2009 Exercise Solution -18 . x1 -1 -1 -1 1 1 1 1 0 0 0 0 x2 -1 1 1 -1 -1 1 1 0 0 0 0 x3 1 -1 1 -1 1 -1 1 0 0 0 0 y 11.7649 on 7 degrees of freedom Multiple R-Squared: 0.357 1.45 7. p-value: 1.969.' 0. here coded as -1 and +1.5329 0.218 7. x3 that each were assigned two values.txt.2. there were three explanatory variates x1 .

b<-lm(y~x1+x2) c<-lm(y~x1+x2+x11+x12) anova(b. for the model y = β 0 + β 1 x1 + β 2 x2 + r . There are repeated measurements only at the center point (0. x2 ) = y = 11.3018 so there is no evidence that the second order terms are necessary which provides support for the linear model. x2 ) with no further specification. then the least squares estimate of μ( x1 .Df RSS Df Sum of Sq F Pr(>F) 1 8 5. b) Consider an extended model in which the mean of Y is a function μ( x1 .9559 1. This is called a “pure residual” test of fit.9410 with 8 degrees of freedom. You will discover that x12 = x2 so we can only add two terms to the model. Use the R statements to create the new variates. From the output of the above ANOVA. Show that the residual sum of squares from fitting this model is ∑ ∑ (yij − yi ) 2 where i indexes the unique sets of explanatory variate values and j i j indexes the replicated observations within these sets.2 a) Add quadratic terms x12 . MacKay. x2 . x2 .c) The output is Analysis of Variance Table Model 1: y ~ x1 + x2 Model 2: y ~ x1 + x2 + x11 + x12 Res.0) where μ ( x1 . use the additional residual sum of squares to test the hypothesis that the extended model is necessary. University of Waterloo 2009 Exercise Solution -19 . x 2 to the model and then test the hypothesis that the 2 additional terms are unnecessary.9410 2 6 3. If we model the mean to be different for every set of values of x1 . The additional residual sum of squares is 5. x2 ) is the average of the response variate values at x1 . x1 x 2 .9852 2 1.x12<-x1*x2. Hence the residual sum of squares is ∑ ∑ (yij − yi ) 2 . x11<-x1*x1.4724 0.J. then μ ( x1 . fit the extended model and then test the hypothesis that the coefficients of the second order terms are 0. For the given data. the residual sum of squares is 5. if there is a single value of i j y at x1 .7647 with 5 degrees of freedom and the F statistic (5 and 3 degrees of freedom) is Stat 371 © R.895 and the residual sum of squares is 0.1763 with 3 degrees of freedom. x2 ) = y and the estimated residual is 0. x2 . In our example.

017 [R code pf(19. sales + r and look at the following 6 plots. a) After fitting the full model. Stat 371 © R. 3. sales + β 4 comp. Consider the data described in Chapter 3 in which a marketing firm wanted to compare two sales promotions against a control. is there any evidence of lack of fit? We fit the model response = β 0 1 + β 1 x1 + β 2 x 2 + β 3 pst.1763 / 3 some evidence against the fit of the linear model. MacKay. The response variate is the weekly sales and there are four explanatory variates.5.6 .The corresponding p-value is 0.7647 / 5 = 19. two of which index the promotion used.3)] so there is 0. University of Waterloo 2009 Exercise Solution -20 .5.J.6.

Stat 371 © R. MacKay.• • • • • the plot of the fitted values versus the estimated residuals shows no unusual patterns. The plot of the studentized residuals shows that cases 5.sales and comp. There are no obvious transformations or additions to the model.J.31.43 and –2.15 and 21 have large studentized residuals 2.sales each show a point far to the right again corresponding to case 11. -2. There is one very large fitted value corresponding to case 11 The qq plot is quite straight providing no evidence against the normality assumption The plots of the estimated residuals vs pst. Otherwise there are no apparent patterns. The plot of the leverages hii shows one point with extreme leverage again case #11. University of Waterloo 2009 Exercise Solution -21 .70 respectively.

11. Find the constant a so that ( I + vu t ) −1 = I + avu t [This is known as a rank one update] We have I = A −1 A = ( I + avu t )( I + vu t ) = I + avu t + vu t + avu t vu t = I + ( a + 1 + au t v)vu t and so a is the solution to a + 1 + aut v = 0 or a = −1 1 + ut v b) If C = B + uu t where B is invertible. we can test the hypothesis that β 1 = β 2 using ANOVA by refitting a model with x = x1 + x2 .026 Deleting the cases one-at-a-time has little effect on the p-value and hence on the conclusion that there is a difference in the two promotions. Analysis of Variance Table Model 1: response ~ x + pst.003 0.01707 * --Signif.1 6.001 `**' 0. We give the basic mathematics behind the arithmetic that we use for the calculations when deleting a single case.b) Suppose the primary question is to compare the two promotions adjusting for past and competitor’s sales.Df RSS Df Sum of Sq F Pr(>F) 1 26 13419.01 `*' 0.sales Res.1 1 2779.sales + comp. Case Deleted 5 11 15 21 p-value 0.025 0. a) Suppose u and v are two n ×1 column vectors and A = I + vu t .1 2 25 10640.sales Model 2: response ~ x1 + x2 + pst. The output of the anova function is given below.1 ` ' 1 Deleting the cases 5. MacKay.J. 4.and 21 in turn gives the p-vale for the test of the hypothesis β 1 = β 2 as shown in the following table.15.014 0. Stat 371 © R.05 `. The key step is to find an expression for the inverse of t X −1 X −1 where X−1 is the matrix X with the first row u1t omitted. codes: 0 `***' 0. Are there any cases that have a large influence on the conclusion about this comparison? With the original fit.' 0.5297 0.sales + comp. find an expression for C −1 . University of Waterloo 2009 Exercise Solution -22 .

Show t t that X t X = X −1 X −1 + u1u1t and hence find an expression for ( X −1 X −1 ) −1 . Hence we have X t X = (u1 X−1 ) Fu I = u u + X GH X JK t 1 1 t 1 t −1 X−1 . MacKay. Hence we have C −1 = ( I + vu t ) −1 B −1 vu t ) B −1 t 1+ u v B −1uu t = (I − ) B −1 1 + u t B −1u = (I − c) Suppose we consider dropping the first case when fitting the model y = Xβ + r . Hence we have −1 t t X −1 X −1 = X t X − u1 u1t and ( X−1 X−1 ) −1 = ( I + ( X t X ) −1 u1u1t )( X t X ) −1 1 − u1t ( X t X ) −1 u1 Stat 371 © R. University of Waterloo 2009 Exercise Solution -23 .We write C = B( I + B −1uu t ) = B( I + vu t ) where v = B−1u .J. We can write X = F u I where u GH X JK t 1 t 1 gives the values of the explanatory variates for the first −1 t case.

We start with all one variate models and pick the one with the highest R 2 value if any are significant at the 0. x9 have significant coefficients and x1 has the highest R2 = 0.J..1)σ 2 cp = + 2( p + 1) − n = p + 1 as required. is not dependent on which are columns of X are included in the model. Only models with x1 .. Show that c p = p + 1 for the full model that includes all p explanatory variates.p .3 x2 − 2 x4 + x7 − x9 + R. If we fit the full model with p 3. Hence the diagonal element of (U tU ) −1 corresponding to β j is 1 / x tj x j and since (U tU ) −1 is also diagonal. x8 .Chapter 6 Solutions 1. then σ2 explanatory variates. The highest value corresponds to x7 with R2 = 0. Now we build all three variate models that include x7 . x6 . x10 for 100 cases. x1 .5084 Now we build all two variate models that include x7 and select the next variate to have the highest R 2 value if significant. Suppose we have a model that includes x j and any other columns of X . x2 . MacKay. x8 and pick the one with the highest R 2 as long as the coefficient is significantly different from 0.. the coefficient of x j . These data were created artificially for practice. Note that the columns of U are also orthogonal so that U tU is diagonal and the diagonal element corresponding to β j is x tj x j . The file ch6Exercise3. a) Fit a model using forward selection. use a p-value of 0. x3 ..5947 corresponds to x8 . Stat 371 © R. By definition if there are k explanatory variates in the model (plus a constant term). R ~ G(0.txt contains a response variate y and 10 explanatory variates x1 . we get (n . Show that the estimate of β j . Note that the columns of X are not orthogonal. The model used to generate the data was Y = 3 x1 + 0. We can write the model as y = Uα + r . σ2 cp = estimated residual sum of squares + 2( k + 1) − n .05 to decide to proceed.2). x10 have coefficients significantly different from 0 and the highest R2 = 0. At each step. we x tj y have β j = t independent of all the other explanatory variates xjxj 2. University of Waterloo 2009 Exercise Solution -24 .6446 .05 level. Suppose the columns of X are orthogonal.

For backwards selection we start with a ten variate model and delete the one that is least significant if we can find one.582227 0.7681730 5.515179 0.781 For six variate models including x1 .474527 0.247352 0. x 4 . MacKay. all but two are not significant.7444682 21.168717 0. x8 .633685 0.781. x8 .J. x9 and have significant coefficients and x4 has the highest R2 = 0. x 7 .585931 0. only x10 has a significant coefficient with R2 = 0.05 and we stop. all variates have coefficients with pvalue less than 0.105201 0.5863704 11. x5 .7691758 5. x10 . we delete x1 With the remaining six variates x3 .05.05 to decide to proceed at each step.7675915 Stat 371 © R. x 4 .990974 0. b) Fit a model using backwards selection using a p-value of 0. x 4 .036584 0. The output from the R code is (Intercept) x1 x2 x3 x4 x5 x6 x7 x8 x9 x10 1 1 0 0 0 0 0 0 1 0 0 0 1 1 1 0 0 0 0 0 0 0 0 0 2 1 1 0 0 1 0 0 0 0 0 0 2 1 0 0 0 0 0 0 1 1 0 0 3 1 1 0 0 1 0 0 1 0 0 0 3 1 1 0 1 1 0 0 0 0 0 0 4 1 1 0 0 1 0 0 1 0 1 0 4 1 1 0 0 1 0 1 1 0 0 0 5 1 1 0 0 1 0 0 1 1 0 1 5 1 1 0 0 1 0 0 1 0 1 1 6 1 1 0 0 1 0 1 1 1 0 1 6 1 0 0 1 1 0 0 1 1 1 1 7 1 1 0 1 1 0 0 1 1 1 1 7 1 1 1 0 1 0 1 1 1 0 1 8 1 1 0 1 1 0 1 1 1 1 1 8 1 1 1 0 1 0 1 1 1 1 1 cp adjr2 109. We delete x 5 . x7 .7652356 4.7603550 6.For four variate models including x1 . x8 .7683448 7.642803 0.768 .415232 0. x7 . we delete x6 With the remaining seven variates.7197896 6.6971923 75.3931790 30. x 4 . x7 . x 4 .181391 0. no other variate has coefficient that is significantly different from 0 with p-value less than 0. With the remaining nine variates. With ten variates. x10 with R2 = 0. x8 . So we end with the model that includes x1 . x9 . Pick a reasonable model. University of Waterloo 2009 Exercise Solution -25 .7679391 7. x10 . x7 . x8 .7696742 5.5033519 155.130472 0.782 c) Use leaps to investigate all possible models.024557 0.7691349 5. x6 .7600636 3. We have R2 = 0. we delete x2 With the remaining eight variates. For five variate models including x1 .302265 0.

Consider the sampling protocols defined in Example 1. we count the ways of selecting the remaining 99 units. x 4 . 9999 9999 99 1 = SRS: there are ways to select the other units so pi = 10000 99 100 FG H IJ K Systematic sampling: there is only one way to select the sample so pi = 1 / 100 F 999IJ FG1000IJ ways to select the other units so Stratified sampling: there are G H 9 K H 10 K FG 999IJ FG1000IJ H 9 K H 10 K = 1 p = FG1000IJ 100 H 10 K F 999IJ ways to choose the remaining clusters so Cluster sampling: there are G H9K FG 999IJ H 9K = 1 p = FG1000IJ 100 H 10 K 9 FG IJ H K FG IJ H 100 K 9 i 10 i Stat 371 © R. The backwards selection got us to a six variate model which has good cp and adjusted R 2 values. we need only count the number of samples that contain a particular unit.The best choice is the five variate model with x1 . That is.. University of Waterloo 2009 Exercise Solution -26 . Once the unit is in the sample. a) Show that the inclusion probability for each unit in the frame is 1/100 for every protocol. d) How do the results of the three strategies compare in this case. To find the inclusion probability. MacKay. For each protocol. Chapter 8 1. the chance of any possible sample is equal. None of the methods reproduced the model used to generate the data – this is not surprising because of the correlations among the columns of the X matrix. the model is uniform. x10 and with many other good candidates. Here the forward selection and best subsets methods got us to the same model. x7 . x8 .J.

we can write (n − 1)σ 2 = ∑ yi2 Ii − nμ 2 where E( Ii ) = n / N and ~ ~ ~ E( μ 2 ) = Var( μ ) + E( μ )2 = (1 − f ) σ2 n 2 ~ 2 ) = 1 ( y 2 n − n[(1 − n ) σ − μ 2 ]) E(σ ∑ i n − 1 iε U N N n = = 1 n n ( [∑ yi2 − Nμ 2 ] − (1 − )σ 2 ) n − 1 N iε U N 1 n n ( ( N − 1)σ 2 − (1 − )σ 2 ) n −1 N N ~ Is σ unbiased for σ ? + μ 2 . Then we can write S0 otherwise T i i i i n ~ and E( μ ) = ∑ y E( I ) ∑ y p ε ε i i i i U i n = i U n i = μ since pi = n / N for all five protocols.. show that σ 2 is an unbiased estimator for σ 2 .. Let Ii = ~ μ= ∑y I ε i U R1 if unit i is in the sample i = 1. Then the other H1 K FG 9IJ FG 999IJ FG1000IJ F 999IJ FG1000IJ ways so p = H1 K H 49 K H 50 K = 1 99 secondary units can be selected in G H 49 K H 50 K FG10IJ FG1000IJ 100 H 2 K H 50 K i 2 b) On a final examination.. MacKay.J. a student once defined simple random sampling as follows: “simple random sampling is a method of selecting units from a population so that every unit has the same chance of selection”.. 2. Consider the estimate σ = a) ~ and the corresponding estimator σ . Is this a correct answer? No because there are many sampling protocols that satisfy this definition as shown in a) ∑ yi iε s c) Show that the estimator corresponding to the sample average μ = is n unbiased for μ for each of the protocols. Combining the results we have =σ2 b) Stat 371 © R. N so that E( I ) = p . n −1 ~ For SRS. [Hint: Use the fact that ∑ ( yi − y ) 2 = ∑ yi2 − ny 2 ]. University of Waterloo 2009 Exercise Solution -27 . i s iε s iε s iε U ∑(y ε − y)2 ~ ~ Using the hint.Two stage sampling: there are FG 9IJ ways to select the second primary unit.

Hence we need about 328 more plots to achieve the desired precision.011 or 0. your intrepid instructor visits each of the selected plots (after dawn but before 9:00 am between May 24 and July 6) and counts the number of singing male song sparrows detected in a 10 minute period.. y50 where 28 of the yi are 0 and so on. # of sparrows # of plots a) 0 28 1 13 2 5 3 3 4 1 Find a 95% confidence interval for the total number of male song sparrows in the square..birdsontario. MacKay.96(1 − ) n 10000 σ = 1. a) For a given confidence level and required precision p%. we want to find the sample size required (SRS) so that the length of the confidence interval 2l divided by the sample average is pre-determined. 4.72 ± 1. a simple random sample of 50 one hectare plots (a hectare is 100m by 100m) is selected. A the average number of sparrows per plot is 0. 1 1 − σ .72 ± 0. That is. find a formula for the required sample size. The data are summarized below. How many additional plots are needed? n 1/ 2 σ . We want to find the In general.011 and interval length 0. the length of a confidence interval for μ is 2c n N 1 1 − σ / μ = p / 100 . we can solve for n to find n = 378.96 1 − 10000 50 95% confidence interval for the total number of male sparrows in the square (τ = 10000 μ ) is 7200 ± 2800 b) Suppose that I wanted to estimate the total number of male song sparrows to within ±1000 with 95% confidence.28 . 3. University of Waterloo 2009 . Suppose we want to estimate a population average so that the relative precision is specified.72.. Solving for n we have the ugly formula sample size n so that 2c n N 1 n= 1 pμ 2 +( ) N 200cσ b) What knowledge of the population attributes do we need to make this formula usable? Exercise Solution -28 Stat 371 © R.J. Using a GPS system.. σ = 1.No using the result assignment 2.html ) for a breeding bird atlas. Hence the sample mean and standard deviation are μ = 0. the length of the confidence interval for μ is 2 × 1.2. To estimate the total number of male song sparrows in a 10 km by 10 km square (http://www. Using In general.org/atlas/atlasmain. The data can be written as y1 .011 and a 95% confidence interval for 50 1.

Since we are sampling such a small fraction of the shipment. Then we have P(accept shipment) = (1. You decide to select and inspect a sample of 20 items and accept the shipment if you find 0 defectives. we have the following graph. b) Calculate the probability p(π ) that you accept the shipment as a function of π . One cheap but (poor) way to check the quality of a batch of items is called acceptance sampling. Usually haphazard (for small items) or systematic sampling is used in this context.J. 5. you inspect the complete shipment. Suppose that there are N = 1000 items in a shipment and you cannot tolerate more than 1% defective (your first mistake – why should you tolerate any defective items from your supplier). a) How would you select the sample? It would be nice to use SRS but it is likely too expensive unless the items are already numbered and it is easy to locate an idea with a specified label. If you find 1 or more defective items. the percentage of defective items in the shipment. d) Given the results in c). University of Waterloo 2009 Exercise Solution -29 . MacKay.We need an estimate of the so-called coefficient of variation σ / μ .π ) 20 c) Graph p(π ) for 0 ≤ π ≤ 10% Using R. you decide to increase the sample size so that there is only a 5% chance of accepting a shipment with 1% defective. we can approximate the number of defective items in the sample by a binomial random variable with n = 20 and the probability of a defective item π . What sample size do you recommend? Stat 371 © R.

2 =0 μ ( x )2 ∂y μ ( x ) ∂x μ ( x )3 ∂x∂y μ ( x )2 ∂y ∂x so we can write ~ ~ μ ( y) ~ 1 ~ 2 μ ( y) [ μ ( x ) − μ ( x )]2 1 ~ ~ θ ≈θ − − [ μ ( x ) − μ ( x )] + [ μ ( y) − μ ( y)] + [ μ ( x ) − μ ( x )][ μ ( y) − μ ( y)] 2 3 2 μ( x ) μ( x) μ( x) μ( x) 2 and μ ( y) 1 ~ ~ ~ ~ E (θ ) ≈ θ + Var ( μ ( x )) − Cov( μ ( x ). In order to count the number of small items in a large container. You can easily check this statement by differentiating the right side of the expression. Chapter 9 Find the quadratic expansion of f ( x. 2. μ ( y )) 3 μ( x) μ ( x )2 =θ + 1 ~ ~ ~ [θ Var ( μ ( x )) − Cov( μ ( x ). a shipping company selects a sample of 25 items and weighs them. They then weigh the whole shipment (excluding the container). y0 ) ∂f ( x 0 . . on a practical basis. We want to find n so that (1 − 0. y0 ) ( x − x 0 ) 2 ∂ 2 f ( x 0 . we can show Cov( μ n 1 is the population covariance. we have ∂f μ ( y) ∂f 1 ∂ 2 f 2 μ ( y) ∂ 2 f 1 ∂2 f =− =− . y0 ) as does f ( x. Note that the general form of the expansion is f ( x .it is not. y) . y) where Cov( x. Let the weight of the ith item in the population be yi and the total known weight be τ Stat 371 © R.π ) n .05 so n = 298 . y) = y / x about the point ( μ ( x ). the sampling is haphazard. MacKay. We know Var( μ n ~ ( x ).01) n = 0. y ) ≈ f ( x 0 . first and second derivatives at the point ( x0 . ~ ~ ~ estimate the bias in the estimator θ = μ ( y) / μ ( x ) .Suppose the sample size is n . μ ( y)) to 1. Assuming that we can use the binomial approximation. Assume that there is small error in weighing and act as if SRS is used . y) ~ and with a bit of effort. y0 ) ∂ 2 f ( x 0 . . University of Waterloo 2009 Exercise Solution -30 . y0 ) ∂ 2 f ( x 0 .J. μ ( y)) = (1 − f ) Cov( x. I recommend you tell your supplier to ensure that there are no defective items in the shipments. This is so large that the binomial approximation may breakdown and. μ ( y)] 2 μ( x) 2 ~ ( x )) = (1 − f ) σ ( x ) The approximate bias is given by the second term. To use the expansion. Sampling inspection is not useful here. 2 = = . is completely unreasonable. y0 ) ( y − y0 ) 2 ( x − x0 ) + ( y − y0 ) + + ( x − x 0 )( y − y0 ) + 2 ∂x ∂y ∂x 2 ∂x∂y ∂y 2 2 This quadratic function has the same value. y0 ) + ∂ f ( x 0 . The key point is to notice that the bias has a factor and n will be small if the sample size is large. we have P(accept shipment) = (1.

96(1 − f ) σ ( y) 1. N ) . MacKay. Find a 95% confidence interval for the total number of items in the container. the sample average weight is 75. so we have ∑ yi τ iε s τ ≈ and hence the estimate N = . The linear approximation about μ ( y) is 1 1 1 1 −1 ~ −1 ( μ ( y) − μ ( y)) and ≈ + ( y − μ ( y)) and hence we have ~ ≈ + 2 y μ ( y) μ ( y) μ ( y) μ ( y) μ ( y ) 2 1 1 1 1 ~ ~ ~ E( ~ ) ≈ . Var ( ~ ) ≈ Var ( μ ( y)) . 2053). Instead we work with ~ G(1. μ( y). Since N = τ / μ ( y) . Consider expanding the function f ( y) = 1 / y . ) and hence we N n μ ( y) have ~ N / N −1 | ≤ 1. (1− f ) σ ( y) ~ To find the confidence interval we have N ~ G( N . σ ( y) .2 kg. ) 1/ 2 1. Var ( N ) ≈ . N 25 ∑ yi / 25 iε s ∑y ε i s τ i / 25 .96(1 − f )1/ 2 σ ( y) 1+ 1− n n μ ( y) μ ( y) In the example.J.96) = 0. Note n μ ( y) that the mean and standard deviation both depend on the unknown N which is different ~ N 1 − f σ ( y) from the usual situation. we have 4 μ ( y) μ ( y) μ ( y) μ ( y) (1 − f ) N 2σ ( y)2 N2 ~ ~ ~ Var ( μ ( y)) = E( N ) ≈ N . b) Find the (approximate) mean and standard deviation of the corresponding ~ estimator N . c) Stat 371 © R. The sample average and population average should be close. σ ( y) for N . approximately. University of Waterloo 2009 Exercise Solution -31 . we get the confidence interval N N ( .a) Show that an estimate of the population size is N = Note that τ = Nμ ( y) so we can construct an estimate of N using our knowledge of estimating μ ( y) . n μ ( y) 2 μ ( y)2 In the example. μ ( y). we have N = 2044 and the 95% confidence interval is (2035.45 g with sample standard deviation 0.95 Pr(| (1 − f )1/ 2 σ ( y) n μ ( y) ~ Re-arranging the inequality and substituting N .163 g and the total weight is 154.

228 μ( x) ∑ (y − θ x ) ε i i i s 2 n −1 = 0. University of Waterloo 2009 Exercise Solution -32 .62.62. We can exploit this knowledge when we are trying to estimate population totals or density.187 so an approximate 95% confidence interval for the average n number of thrushes per woodlot is 1.37 and the interval for the total number of thrushes is 1783(1. α = μ ( y) = 1. Suppose we wanted to estimate the number of wood thrush pairs nesting within the region of Waterloo.138. The data are available in the file exercise8. ∑[y − α − β( x ε i i s i − μ ( x ))]2 = 0.85 ± 0. we know that there are 1783 such patches (minimum size 3 ha) with an average size 13.72 σ ( y) = 1.3. If the response variate is approximately proportional to the explanatory variate.142 n −1 a) The sample average is 1.J.37) = 2888 ± 653 b) The ratio estimate is θμ ( x ) = 1. μ ( y) = 0. We need the following summaries of the data: 4.17) = 3299 ± 308 . μ ( x ) = 11. i i ∑ (y − θ x ) ε 2 Stat 371 © R. μ ( y) = 1. For example.62 with associated estimated standard deviation 1− f σ ( y)2 = 0. Using aerial photography. Find 95% confidence intervals for the total number of thrushes based on the a) sample average y ratio estimate b) c) regression estimate If you do not want to do the calculations. We need an explanatory variate with known population average that can be measured on each unit in the sample. Many bird species have specialized habitat. then the regression estimate is more precise. If the response variate is approximately linear in the explanatory variate. Briefly describe when you would use the ratio or regression estimate instead of the sample average to estimate the population average.34. β = 0. The area xi of each sampled woodlot is also recorded. A simple random sample of 50 woodlots is selected and the number of nesting pairs yi is counted in each woodlot by counting the number of singing males.399. wood thrush are a forest dwelling bird that live in the hardwood forests of eastern North America.xls. an area of highly fragmented forest patches.4 ha.62 ± 0.85 with associated estimated standard deviation 1− f i s = 0. θ = μ ( x ) = 11.62 ± 0. then the ratio estimate is more precise than the sample average. MacKay.17 and for the population total is 1783(185 ± 0.088 so an approximate 95% confidence interval for the n n −1 population average is 1. write out what summaries you need to get the three confidence intervals.72.

00 with associated estimated 1− f i s = 0. In many surveys. Stat 371 © R.00 ± 0. MacKay. there is interest in estimating strata averages or differences in strata averages.4 .10) = 3566 ± 184 i i standard deviation ∑[y − α − β( x ε − μ ( x ))]2 The scatterplot on the next page shows the fitted regression line (solid).00 ± 0. University of Waterloo 2009 Exercise Solution -33 .c) The regression estimate is μ ( y) + β ( μ ( x ) − μ ( x )) = 2.J. Number of thrushes versus woodlot area 6 5 4 number 3 2 1 0 0 -1 5 10 15 area 20 25 30 35 μ ( x ) = 13.10 and for the population total is 1783(2.4 Chapter 10 1. the “fitted” line through the origin (dotted) and μ ( x ) = 13.053 so an approximate 95% n n −1 confidence interval for the population average is 2.

614 and hence we have μ 2 − μ 1 = 8.~ a) In general. for SRS.6 with associated estimated standard deviation 2 (1 − f2 )σ 2 = 3..172 with associated estimated 1− f π 1 (1 − π 1 ) = 0.058 c) In the well survey. MacKay. write down the distribution for the estimators μ h and ~ ~ μh − μk .275 . Suppose that the purpose of the survey is to estimate a population proportion π .J. University of Waterloo 2009 Exercise Solution -34 .+WH π H we have π strat = W1π 1 +. we have μ 1 = 237. (1 − fh )σ 2 (1 − fk )σ 2 h k ) + nh nk b) In the well survey. For farms with animals.+ WH π H (1 − π H ) n1 nH (ignoring the factors nh / (nh −1) ) ~ b) What is the variance of π strat for proportional allocation? = W12 If nh = Wh n .3 with associated estimated standard deviation 2 (1 − f1 )σ 1 = 3. There is no evidence of a difference in average Na levels between the two groups of farms..614 2 = 4.172 ± 0. n1 For farms without animals..877 . find a 95% confidence interval for the proportion of wells in farms with animals that are contaminated The estimate of the proportion contaminated is π 1 =... 2.30 ± 9.030 so the 95% confidence interval is standard deviation n 0.. we have Stat 371 © R. Assuming relatively large sample sizes within the strata we have approximately ~ μ h ~ G( μ h . Since π = W1π 1 +.60 .30 with associated n2 estimated standard deviation 3.2752 + 3. Hence a 95% confidence interval for μ 2 − μ1 is 8.. we have μ 2 = 245. If there are H strata.+ WH Var (π H ) (1 − f1 ) 2 (1 − f H ) π 1 (1 − π 1 )+.(1 − fh )1/ 2 σh nh ) ~ ~ μ h − μ k ~ G( μ h − μ k ..+ WH π H and 2 ~ ~ ~ Var (π strat ) = W12 Var (π 1 )+. find a 95% confidence interval for the average Na difference between the two types of farm wells. a) Write down the stratified estimate of π and the variance of the corresponding estimator.

We have stratum 1 2 3 Weight St Dev 0. How would you recommend allocating the sample to the strata if a) Estimating the average Na level was the primary goal. we have nh ∝ Wh π h (1 − π h ) and we use the current estimates to get startum Weight sd 1 0. For optimal allocation. more weight is given to stratum three because it is larger and has higher estimated standard deviation. we have nh ∝ Whσ h .. 3..097 37.726 51.. ~ c) For each case. = c) How should the strata be formed so that the stratified sampling protocol is superior to SRS? We want to form the strata so that [W1π 1 (1 − π 1 )+.(1 − n1 / N1 ) 2 (1 − nH / N H ) ~ Var (π strat ) = W12 π 1 (1 − π 1 )+... we use the estimates from the current survey to allocate the sample. 45 and 358. we decrease the variation in the strata by making the response more consistent.726 0.338491 So the optimal sample sizes are B:97.45 0.37738 2 0.317811 3 0. University of Waterloo 2009 Exercise Solution -35 .177 41. MacKay.23 so the optimal sample sizes are A: 76. b) Estimating the proportion of contaminated wells was the primary goal.J. We have Stat 371 © R.+ WH π H (1 − π H )] n where f = 1 − n / N . compare the predicted standard deviations of μ strat and ~ π strat to what occurred in the current survey. If we assume that the standard deviations do not change markedly.. In words.+WH π H (1 − π H )] < π (1 − π ) .097 0.62 0. We do this by making π h close to 0 or 1 for each stratum. Suppose the well survey was to be re-done with the same overall sample size 500.177 0. 38 and 386.+ WH π H (1 − π H ) W1n WH n 1− f [W1π 1 (1 − π 1 )+. For optimal allocation.

(1 − f1 ) 2 ~ 2 (1 − f H ) Var ( μ strat ) = W12 σ 1 +...+ WH σ2 H n1 nH (1 − f1 ) 2 (1 − f H ) ~ Var (π strat ) = W12 π 1 (1 − π 1 )+...+ WH π H (1 − π H ) n1 nH Using the current estimates and the two new allocations, we get the estimated standard deviations

allocation current

~ μ strat π strat

A 2.11 0.015

B 2.13 0.015

2.42 0.016

The estimator of the proportion is much less sensitive to changes in the allocation. . ~ 4. Consider the difference of the variances of μ strat under proportional and optimal allocation for a sample of size n. Ignore the fpc. 1 a) Show that this difference can be written as ∑ (σ h − σ )2 Wh where n h σ = ∑ σ h Wh is the weighted average standard deviation over the H

h

strata. For optimal allocation, ignoring the fpc, we have 1 Vopt = (W1σ 1 +...+ WHσ H ) 2 and, for proportional allocation, we have n 1 2 Vprop = (W1σ 1 +...+ WH σ 2 ) . The weights can considered a probability distribution on the H n 1 1 integers 1,…,H so we have ∑ (σ h − σ )2 Wh = [∑ σ 2 Wh − (∑ σ h Wh ) 2 ] as required. h n h n h h b) When will the gain be large with optimal allocation relative to proportional allocation? The gain will be largest when the standard deviations vary widely. 5. In an informal sample of math students at UW, 100 people were asked their opinion (on a 5 point scale) about the core courses and their value. One particular statement was (with scores): “All mathematics students are required to take Stat 231?” strongly agree – 1 agree – 2 neutral – 3 disagree- 4 strongly disagree - 5 The sample results, broken down by year are shown below. Estimate the average score for all math students and find an approximate 95% confidence interval for the population average – note that SRS was not used here so were are making assumptions about the estimators that may be unwarranted. There are about 3300 students in the faculty.

Stat 371 © R.J. MacKay, University of Waterloo 2009

Exercise Solution -36

Year

Sample size

1 2 3 4

39 23 26 12

Population weight 0.31 0.24 0.23 0.22

Average score

2.8 3.5 3.2 3.1

Standard deviation 1.22 1.09 1.03 0.87

We can estimate the average score as if we had stratified the sampling beforehand.

**. . μ post = 0.31(2.8) + 0.24(3.5) + 0.23(3.2) + 0.22(31) = 3126 . The approximate estimated ~ variance of μ post is
**

1 − fh 2 2 Wh σ h = 0.107 nh h The approximate 95% confidence interval is 313 ± 0.21. .

∑

Stat 371 © R.J. MacKay, University of Waterloo 2009

Exercise Solution -37

In Stat 371, we deal with applications and theory of the linear model Y = Xβ + R where X = (1 x1 ... x p ) is a n × ( p + 1) matrix with columns giving the values of the explanatory variates and R is a vector of random variables with independent components Ri ~ N (0, σ 2 ) . We represent the corresponding data model by y = Xβ + r where y is the vector of observed values of the response variate. 1. For the model described above: $ a) (4 marks) Derive the least squares estimate of β , i.e. show that β = ( X t X ) −1 X t y . Be sure to explain the principles underlying your derivation.

n

Statistics 371 Midterm Solution

The least squares criterion is to minimize

∑ ri

i =1

2

=r t r = ( y − Xβ )t ( y − Xβ ) with respect to

β.

y r

Xβ

From the picture, the minimum value corresponds to the orthogonal projection of y onto $ the column space of X so that y − Xβ is perpendicular to 1, x1 ..., x p , the columns of X or $ $ $ equivalently X t ( y − Xβ ) = 0 . Solving we have X t y = X t Xβ so β = ( X t X ) −1 X t y as required. b)

% (3 marks) Show that the estimator r corresponding to the estimated residuals is 2 MVN (0, σ ( I − H )) where H = X ( X t X ) −1 X t .

$ $ Since r = y − Xβ = y − X ( X t X ) −1 Xy = ( I − H ) y , we have ~ = ( I − H )Y = ( I − H )( Xβ + R) = ( I − H ) R and hence ~ is multivariate normal with mean r r vector and variance covariance matrix E(~) = E(( I − H ) R) = ( I − H )0 = 0 r Var ( ~) = Var (( I − H ) R) = ( I − H )σ 2 I (( I − H )t = σ 2 ( I − H ) r

since ( I − H )( I − H )t = ( I − H ) and I − H is a symmetric projection matrix.

c)

(3 marks) Using the result in b), explain the notion of a unit in the sample that has high leverage. We say that unit i has high leverage if the ith diagonal element of H, hii , is close to 1. ~ $ Using the result in b), we have ri ~ N (0,(1 − hii )σ 2 ) and ifhii ≈ 1, we know that ri is close to 0 . Since hii depends only on X, the fitted plane passes close to yi regardless of its value so deleting case i may have a large effect on the fitted plane.

1

59e-07 *** --Signif.' 0.01 `*' 0.7907 3.1256 6.05 `.2. Error t value Pr(>|t|) (Intercept) 30.846 0.01 < 2e-16 *** x1 0. a) (2 marks) Consider the two models Model 1: Y = β 0 1 + β 1 x1 + R and Model 2: Y = γ 0 1 + γ 1 x1 + γ 2 x2 + R .04125 0. p-value: 1.1072 0.04878 -0.1 ` ' 1 Estimate Std. Error t value Pr(>|t|) 30.58 on 1 and 48 DF. codes: 0 `***' 0.3255 0.4391. Suppose we are interested in understanding the relationship between a response variate y and a specified explanatory variate x1 In an investigation.587e-07 Call: lm(formula = y ~ x1 + x2) Residuals: Min 1Q Median 3Q Max -2.7724 -0.91e-07 *** -0.81995 0.416 on 48 degrees of freedom Multiple R-Squared: 0.53223 0. y.7700 0.02882 -0. Adjusted R-squared: 0.892 3. From R. What is the difference in interpretation between β 1 and γ 1 ? β 1 represents the change in the expected response for a unit change in x1 .26556 Coefficients: Estimate Std.9820 -0.85583 3.02064 0.402 2 . codes: 0 `***' 0.01 `*' 0.81201 -1.001 `**' 0. the summary output from fitting the two models is: Call: lm(formula = y ~ x1) Residuals: Min 1Q Median 3Q Max -2.' 0.1 ` ' 1 Residual standard error: 1.311 < 2e-16 *** 0.44051 69. x1 and a second explanatory variate x2 are measured on a sample of 50 units from the study population.2069 Coefficients: (Intercept) x1 x2 --Signif.05 `.13 1.3653 83. γ 1 represents the change in the expected response for a unit change in x1 with x2 held fixed.4274 F-statistic: 37.13915 5.001 `**' 0.

820 ± 0.01 × 0.424 F-statistic: 19.005 = 1.421 on 47 degrees of freedom Multiple R-Squared: 0. From fitting the full model the estimate of σ 2 that does not depend on any hypothesis about γ 2 is 1. Hence the change in the residual sum of squares is 47 × 2. there is strong evidence that one or more of the explanatory variates explains variation in the response variate. Show how we ca use the Analysis of variance to test the same hypothesis. p-value: 8.4212 = 2.416 2 = 2.135 (estimate ±c standard error) or 0.019 − 48 × 2. Some suggested that there was a “funnel” effect that suggests a non-constant standard deviation and transforming the response variate using the logarithm etc. e) (3 marks) The output for model 2 shows that there is no evidence that γ 2 differs from 0 using a t-test.95 so the confidence interval is 0. Does this mean that this model better fits the data? No since we know that R 2 must increase as we add terms to a model whether or not the corresponding explanatory variate has an effect on the response. we need to find two estimates of σ 2 .Residual standard error: 1.662 so there no evidence that γ 2 differs from 0.01) ≈ 0. R 2 measures the proportion of variation in the response variate explained by the explanatory variates.273 d) (1 mark) The F statistic in the summary output for model 2 is F=19.388 / 1 = 0. To use ANOVA. No action is required because there are no apparent patterns or outliers on the plot.807e-07) there is strong evidence against the hypothesis that all of the γ ’s are 0.338 with 1 degree of freedom so the F-ratio is 1.019 with 47 degrees of freedom.019 f) (2 marks) A plot of the estimated residuals versus the fitted values from Model 2 is shown below.005 with 48 degrees of freedom.03 on 2 and 47 DF.820 ± 2. the estimate of σ 2 is 1. find a 95% confidence interval for γ 1 $ From the R output we have γ 1 = 0. c) (3 marks) Using the model 2. what action would you recommend? Why? Note the plot has been corrected in the solution. From the t-table we have P(| t47 | ≤ 2.820 with standard error of 0. That is. Based on the plot.4475. Adjusted R-squared: 0.807e-07 b) (2 marks) The value of R 2 is a bit larger for model 2. 2.. When fitting the model assuming γ 2 = 0 .135 and 47 degrees of freedom.03. What does this signify? Since the F ratio is so large (p-value 8. 3 .

Derive a 95% prediction interval.. σ 2 (1 + u t ( X t X ) −1 u)). We are ~ predicting Y where Y ~ N (u t β . σ 2 u t ( X t X ) −1 u) . we deal with applications and theory of the linear model Y = Xβ + R where X = (1 x1 . age.95 σ (1 + u t ( X t X ) −1 u) Cross-multiplying and re-arranging we get the probability statement 1 . …) from sales of similar buildings • estimate parameters: estimate the volatility of a share price relative to an index using past closing prices • look for outliers: identify extreme salaries (response variate) after adjusting for explanatory variates such as experience . x p ) b) (5 marks) We know that the least squares estimate of β is β = ( X t X ) −1 X t y and the ~ corresponding estimator is β ~ N ( β .. ~ Standardizing and replacing σ by σ . we have ~ Y − ut β ~ tn −( p +1) ~ σ (1 + u t ( X t X ) −1 u) We use this random variable as the basis for our interval.. x1 .. For the model described above: a) (1 mark) What is the criterion used to produce the least squares estimates of the parameters β ? We minimize || r||2 =|| y − Xβ ||2 or we chose β so that r = y − Xβ is perpendicular to span(1. (4 marks) Give two distinct. 1. u1 .. σ ) . Choosing c so that Pr(| tn −( p +1) | ≤ c) = 0. Suppose we want to predict the response variate for a unit with values of the explanatory variate u t = (1. u p ) .. different uses of this model in business contexts. σ 2 ( X t X ) −1 ) . ~ ~ We know that β ~ N ( β . 2. Be sure to explain the derivation. σ 2 ) .. educational qualifications.Statistics 371 Sample Midterm Solution In Stat 371.. We represent the corresponding data model by y = Xβ + r where y is the vector of observed values of the response variate. σ 2 ( X t X ) −1 ) and hence u t β ~ N (u t β . • prediction: predict the market value of a building using the selling price (response variate) and various explanatory variates (size. x p ) is a n × ( p + 1) matrix with columns giving the values of the explanatory variates and R is a vector of random variables with independent components Ri ~ G(0.. . we have ~ Y − ut β Pr( − c ≤ ~ ≤ c) = 0... Hence we have Y − u t β ~ N (0.95. age.

4598 6.703 0.701 9.1024 0. after adjusting for qualifications. data were collected from 91 rural school districts in a given year.7 -944.7820 334.484 --Signif.4134 4.6 3299.' 0.2e-16 2 . The variates measured were: experience: number of years in the current or similar job size: number of students in the district education level: BA only. 1 and phd=0.0956 4152. (u t β − cσ (1 + u t ( X t X ) −1 u). Call: lm(formula = salary ~ experience + size + ma + phd + col) Residuals: Min 1Q Median 3Q Max -2969.1441 -0. codes: 0 `***' 0. In a compensation study of the chief executive officer salaries in one state.747 1.or over-paid. u t β + cσ (1 + u t ( X t X ) −1 u)) Note: many marks were lost because of confusion among parameters.74e-09 *** ma 5228.7543 474.001 `**' 0. then the CEO has the equivalent to both degrees so ma is set to 1. If phd=1.6 Coefficients: Estimate Std.1 ` ' 1 a) (1 mark) Carefully interpret the on 85 degrees4 of freedom Residual standard error: 1402 coefficient β corresponding to the explanatory variate phd.889.2 -135. MA or PhD cost of living(col): relative cost of living in the district salary: annual salary of CEO Note that education level is captured by two explanatory variates ma=0.6700 4093. relative to the others. Multiple R-Squared: 0.4 1059.352 < 2e-16 *** col -2917. estimate and estimators.164 < 2e-16 *** experience 171.1682 36.8825 F-statistic: 136.0329 22.95 We get the 95% confidence interval by replacing the estimators by the corresponding estimates. 1 where 1 indicates the presence of the degree.3909 10.05 `.634 < 2e-16 *** phd 4910. Error t value Pr(>|t|) (Intercept) 90717. p-value: < 2.4497 15. or if some CEOs were highly under. Adjusted R-squared: 0.01 `*' 0.~ ~ ~ ~ Pr(u t β − cσ (1 + u t ( X t X ) −1 u) ≤ Y ≤ u t β + cσ (1 + u t ( X t X ) −1 u)) = 0. 3. The R output from fitting a linear model to the data is given in the box below.91e-06 *** size 3.2 on 5 and 85 DF. The purpose of the investigation was to determine if the salaries were relatively “equitable”.

076.557 and the p-value is (1377)2 3 .8955.6 on 7 and 83 DF.707 degrees of freedom so the mean square is = 4.378. the 4. (4 marks) To check the contribution of size and experience to the model. there is no evidence that col effects salary.8867 F-statistic: 101. the squares of size and experience was fit.848. Is there any evidence that these quadratic terms are necessary? d) Residual standard error: 1377 on 83 degrees of freedom Multiple R-Squared: 0. What does this tell us? The p-value for the hypothesis β 5 = 0 is large.848. E(Y ) = β 0 + β 1experience + β 2size + β 3 ma + β 4 + β 5 col + β 14 experience = β 0 + ( β 1 + β 14 )experience + β 2size + β 3 ma + β 4 + β 5 col Hence β 14 represents the change in the rate that changing experience effects average salary if a CEO has a PhD versus not having a PhD .340 − 157. That is. Adjusted R-squared: 0.816 discrepancy measure is = 2. 2 To test the hypothesis that the coefficients of the squared terms are simultaneously 0.076.707 The estimate of σ under the restricted model (without the squared terms) is 1402 so the residual sum off squares is 85 * (1402) 2 = 167. all other explanatory variates being held ficxed.2e-16 The estimate of σ under the full model (including the squared terms) is 1377 so the residual sum off squares is 83 * (1377) 2 = 157. E(Y ) = β 0 + β 1experience + β 2 size + β 3 ma + +β 5 col and if phd=1. b) (1 mark) Suppose we add a product term phd*experience with coefficient β 14 to the model.076. With the new model we have E(Y ) = β 0 + β 1experience + β 2size + β 3 ma + β 4 phd + β 5 col + β 14 experience * phd If phd=0.707 with 8-6= 2 167. Part of the R summary output is shown below. Good thing it was only one mark! c) (2 marks) The Pr(>|t|) for the variable col is 0.340 − 157. Carefully interpret this parameter. p-value: < 2. Note : This was meant to be difficult and it proved to be so.378. all other explanatory variates held fixed. a new model with terms e2 and s2.484. β 4 represents the average change in salary if a CEO gets a PhD.Since E(Y ) = β 0 + β 1experience + β 2 size + β 3 ma + β 4 phd + β 5 col . so there is no evidence against this hypothesis.816 .340 The change in the residual sum of squares is 167.378.

05 < Pr( F2. we can be confident that the assumption of gaussian residuals is reasonable. f) (1 mark) What does the qq plot tell us in this case? Since the points fall close to a straight line.1) into 91 bins each with probability 1/91. the diagonal elements of the hat matrix H = X ( X t X ) −1 X t .83 ≥ 2.10 . There is weak evidence against the hypothesis that the coefficients of the squared terms are 0 and hence weak evidence that they need to be included in the model.0. The x-coordinate is the 1 . e) (2 marks) A quantile-quantile (qq) plot of the standardized residuals is shown below. If the leverages are close to 1 or relatively large then the corresponding values of the explanatory variates are an outlier and are possibly influential in the fit of the model. We divide the G(0. we look at the leverages hii .557) < 0. g) (2 marks) How can we detect cases with an outlier in the explanatory variates? For each case. Explain how to calculate the coordinates of the point in the lower left corner of the plot. 4 . Note that the full model has 83 degrees of freedom for estimating σ here. The y-coordinate is the smallest “center” of the first bin q1where Pr( Z ≤ q1 ) = 182 standardized residual in the set of 91.

Looking at the plot of the studentized residuals we see no very large values (i.e. 5 . The purpose of the investigation was to identify outliers in the response variate. use the plot to provide a conclusion to the investigation. Assuming that the fit of the model is adequate.>2.5) so it appears that the salaries are equitable. the CEO salary after accounting for the explanatory variates.h) (2 marks) A plot of the studentized residuals versus the case number is shown below.

x1 = 0 otherwise x2 = 1 if promotion 2 is used. Find the distribution of the ith component ri You work in the marketing division of a large corporation that owns pizza franchises. an engineer gone wrong. (2 marks) To maintain symmetry. Your company is investigating a special promotion with the goal of increasing sales. 2.. your boss. (3 marks) The estimator corresponding to the vector of estimated residuals is r = Y − X β . We represent the corresponding data model by y = Xβ + r where y is the vector of observed values of the response variate. the data are coded as follows: Name average weekly sales (in $1000) during the promotion period promotion 1 promotion 2 promotion 3 promotion 4 (control) past average sales (in $1000) Symbol y x1 = 1 if promotion 1 is used. 1 .. x2 = 0 otherwise x3 = 1 if promotion 3 is used. show that the least squares estimate of β is β = ( X t X ) −1 X t y . x p ) is a n × ( p + 1) matrix with columns giving the values of the explanatory variates for the n units in the sample and R is a vector of random variables with independent components Ri ~ G(0. (2 marks) Carefully interpret the coefficient β1 5. σ ) . suggests adding an extra term β c xc to the model (1) where xc = 1 for the franchises with the control and xc = 0 otherwise. The company assigns each of the versions at random to 20 franchises and measures the average weekly sales (over a four week period). 1. x3 = 0 otherwise none x4 Consider the model (in vector notation) Y = β 01 + β1 x1 + β 2 x2 + β 3 x3 + β 4 x4 + R. we looked at applications and theory of the linear model Y = Xβ + R where X = (1 x1 . Is this a good idea? Explain. R ~ N (0.Final Examination Spring 2004 Part I (35 marks) In the first part of the course. σ 2 I ) (1) 4. (4 marks) Show that the mean and variance-covariance matrix of the corresponding estimator β are β and σ 2 ( X t X ) −1 respectively. For each franchise. 3. (5 marks) From first principles. There are three versions of the promotion plus a control in which there is no change to current practice. The average weekly sales before the promotion is also recorded for each of the 80 franchises in the sample.

44554 0.2e-16 Is there any evidence of a difference among the three versions? 11. you fit the model Y = β 01 + β ( x1 + x2 + x3 ) + β 4 x4 + R. 2 . Error 0.407 on 75 degrees of freedom Multiple R-Squared: 0. p-value: < 2. Briefly describe what each plot tells you about the fit.44505 0. 8.18265 2.9828” tell you? (2 marks) What does “F-statistic: 1072 on 4 and 75 DF.9781 F-statistic: 1763 on 2 and 77 DF. 9.18868 0.2e-16 6. (4 marks) To assess the fit of the original model (1). R ~ N (0. p-value: < 2.00347 0.2e-16” tell you? (3 marks) Find a 95% confidence interval for β 2 .31e-05 0.019 0.410 4.925 64. Adjusted R-squared: 0.666 2.90509 Std.30198 0. p-value: < 2.44514 0.9819 F-statistic: 1072 on 4 and 75 DF. the following plots were prepared.9786.68274 1.07901 1. (1 mark) What does “Multiple R-Squared: 0.39377 0.00455 < 2e-16 Residual standard error: 1. lm(formula = y ~ x1 + x2 + x3 + x4) Coefficients: (Intercept) x1 x2 x3 x4 Estimate -1.01399 t value -3.549 on 77 degrees of freedom Multiple R-Squared: 0.9828. σ 2 I ) with summary output (in part): Residual standard error: 1. Adjusted R-squared: 0.709 Pr(>|t|) 0. 7. (4 marks) To test that the hypothesis that there is no difference among the three promotions. What does this interval tell you? (3 marks) Explain in symbols and words (no numerical calculations needed ) how you could formally assess if there was a difference between promotion 2 and promotion 1? 10.You fit the model (1) using R with the following summary output.

Plot 1(Estimated Residuals vs Fitted Values): Plot 2 (Normal Q_Q Plot of standardized residuals): Plot 3 (Leverage vs Case Number): Plot 4 (Studentized Residual vs Case Number): 12. how would you proceed with the information from the above plots? 3 . (2 marks) If the primary purpose of the study was to look for differences in the versions (as in Question 10).

we deal with the theory. How should we best allocate the sample to the various strata if the goal is to estimate the population average? In the early 1980’s. (3 marks) A key step in proving that E ( μ ) = μ for SRS was to determine Pr( I i = 1) . some homeowners developed allergy symptoms that were attributed to formaldehyde (CH2O).. Show that σ ( y ) is essentially determined by π . 3. a foam insulation that could be pumped into cavities as a liquid. To assess the magnitude of the problem. a gas that could have been given off by UFFI. the population proportion of units with y = 1 . 4. Find this probability and explain your reasoning. The cost per unit of sampling from stratum h is ch and the total sampling cost must be limited to C.. Var ( μ ) = (1 − ) N n ∑( y i∈U i − μ )2 N −1 . a very expensive proposition. Unfortunately. Many homes used UFFI. 2. The ∑ yi ˆ = i∈s basic estimate of a population average μ is the sample average μ where yi is n the value of the response variate for the ith unit in the sample s and n is the sample size. H are mutually exclusive strata of size N h .... applications and some extensions of simple random sampling (SRS) to learn about population averages. (4 marks) A key step in the derivation of Var ( μ ) for SRS was to determine the covariance of I i and I j . the Federal Government of Canada established a grant program to help homeowners re-insulate their homes to reduce energy consumption. Find this covariance. h = 1.. There was then pressure on the Government to help homeowners remove the UFFI. ∑ yi I i The corresponding estimator can be written μ = i∈U where U is the population N (frame) with size N and I i = 1 if unit i is in the sample and I i = 0 otherwise. The foam then solidified without reducing its volume. (5 marks) Suppose we have a population with frame U = U1 ∪ . ∪ U H of size N where the U h . We can n σ 2 ( y) where σ 2 ( y ) = show that E ( μ ) = μ . 1. We plan to sample nh units from stratum h using SRS independently for each stratum. (3 marks) Suppose that yi is a binary variate with values 0 and 1. a survey was commissioned with the basic purposes to assess • • the average level of CH2O in homes with UFFI the proportion of individuals in these homes with allergy symptoms 4 .Part 2 (35 marks) In the second part of the course.

981 homes that had been re-insulated without UFFI.4 non-UFFI homes 47. A summary of part of the data collected is Attribute CH2O average concentration (ppb) CH2O standard deviation (ppb) UFFI homes 57. (2 marks) Explain how you could implement simple random sampling in this case. the 95% confidence interval for π . (3 marks) An early press release about the allergy problems noted that “ the survey will look at a random sample of people living in homes with UFFI …” Is this statement technically correct? Explain.023 . 9. How large a sample would have been required to reduce the length of this interval by half? 11. the proportion of homes in which one or more persons experienced allergy symptoms was 0. Strategy 1: Strategy 2: 5 .and to compare these attributes to those in homes without UFFI. (3 marks) Find a 95% confidence interval for the population average CH2O concentration in the UFFI homes. (3 marks) A chemist involved in the planning of the survey noted that the frame of homes with UFFI was about half the size as that for homes without UFFI.8 12. Is this correct? Explain. 5. He suggested that the survey would be improved if the sample sizes was proportional to the frame sizes. (2 marks) For the UFFI home sample. 7.7 8. (4 marks) Briefly discuss two different strategies that could have been used to increase the efficiency of the survey. (3 marks) Find a 95% confidence interval for the difference in population average CH2O concentration in the UFFI and non-UFFI homes. They decided to select a simple random sample of 500 homes from each frame and then • measure the concentration of CH2O in the air in each home in the sample.345 homes in which UFFI had been installed and another frame of 230.6 9. 6. Since the Government had awarded grants. 10. they had a frame of 124. • administer a questionnaire to the homeowner to collect information about allergy symptoms and other demographics.08 ± 0.

Are you sure?

This action might not be possible to undo. Are you sure you want to continue?

We've moved you to where you read on your other device.

Get the full title to continue

Get the full title to continue listening from where you left off, or restart the preview.

scribd