This action might not be possible to undo. Are you sure you want to continue?

STAT 371 Course Notes

SPRING 2011

JOCK MACKAY

rjmackay@uwaterloo.ca

Statistics 371 © R.J. MacKay, University of Waterloo 2009

Index

Chapter 1 Chapter 2 Chapter 3 Chapter 4 Chapter 5 Chapter 6 Chapter 7 Chapter 8 Chapter 9 Chapter 10 Appendix 1 Appendix 2 Appendix 3 Statistical Tables Solutions to Exercises Old Midterms and Exams

The Need for Statistics in Business Models linking explanatory and response variates Making Inferences from Regression Models The Analysis of Variance Assessing Model Fit Model Building Sample Survey Issues Probability Sampling Ratio and Regression Estimation with SRS Stratified Random Sampling R Properties of vectors and matrices of random variables Gaussian Quantile-Quantile Plots

Please email me with any errors or points of clarification. These notes are a work in progress.

Data Sets You can download all data sets in the notes and exercises from the file stat371.zip on the Angel course web page. You can access the individual files at the same site.

Statistics 371 © R.J. MacKay, University of Waterloo 2009

Chapter 1 The Need for Statistics in Business “There is no substitute for knowledge” W. Edwards Deming “The greatest obstacle to discovery is not ignorance – it is the illusion of knowledge” Daniel Boorstin The purpose of Stat 371 and 372 is to provide a unified set of strategies and tools to apply Statistical Method in business and industry. In particular, the goal is to learn how to: • • • • pose clear questions collect the right data efficiently and effectively– a good plan provide useful conclusions communicate the conclusions and the method by which they are reached to a nontechnical audience

Statistics, or better Statistical Method, is a powerful, widely applicable process that we can use to learn about business processes and markets (populations). Statistical Method is empirical, that is, based on observational and experimental investigations. By collecting and analyzing the right data, we can increase our knowledge of the market, the products and services we produce (and plan to produce) and the processes we use in this production. We may then use this knowledge to make better decisions to improve the business. Example 1 The maker of “frost-free” refrigerators in temperate New Zealand decided to expand their market to tropical south-east Asia. There were immediately numerous complaints about frost build-up in the fridges from the new market. The company interviewed 25 recent purchasers in each of the two markets and found that there were large differences in ambient environmental conditions (temperature and humidity) and usage (frequency of door openings, amount of food introduced at one time) in the two markets (investigation 1). They were convinced that these factors were the cause of the frost build-up in the tropical market. To solve the problem, they decided to try to redesign the fridge to make it more robust to ambient environmental conditions and usage factors. In a experimental investigation, they built 8 prototype fridges in which four design inputs were changed simultaneously. They then tested each prototype under two conditions defined by the extremes of the environmental and usage factors. The response variate was the temperature of the cooling plate in the fridge after 30 minutes operation – low constant values mean that there will be no frost build-up. The experimental plan and data are

Statistics 371 ©R.J.MacKay, University of Waterloo, 2009

I-1

Treatment 1 2 3 4 5 6 7 8

D1 new new new new original original original original

D2 new new original original new new original original

D3 new original new original new original new original

D4 new original original new original new new original

Environmental Conditions Normal Extreme 0.7 2.1 2.9 4.8 2.4 9.6 3.8 5.9 1.9 4.0 -0.2 0.1 -0.1 3.5 0.2 7.2

Looking at the data, we can see that there are several promising designs (e.g. treatment 6). After further analysis and a review of the costs, the company adopted the combination in treatment 6 as the new design. The complaints about frost buildup disappeared. Example 2 Municipal taxes in Ontario are based the market value of the property. Where possible, the market or assessed value is determined by predicting the market value of the property using the prices from recent sales of comparable properties. A property owner may choose to appeal the assessed value. A large company felt that the assessed value of its very large property (an automobile assembly plant) was too high. To argue their case, they collected data on 38 large plants that had been sold in the last 10 years throughout Canada and the USA. The first few records are

size (sq ft /10^6) 0.848 1.813 1.297 1.747 age(years) 35 37 50 23 percent office 5.8 3.2 19.0 10.2 build/land ratio 26.6 17.3 45.1 13.3 location usa usa usa usa value $/sq ft 4.32 6.74 6.36 5.95

The idea was to predict the value of the building in question using a model constructed from the data and the known values of the explanatory variates size, age etc. Here the prediction was a failure as there were many problems with the data and how it was collected. We use PPDAC (Problem, Plan, Data, Analysis, Conclusion) to describe Statistical Method, the process we use to learn empirically. The purpose of each stage is: Problem: Plan: Data: Analysis: Develop clear questions about attributes of the population/process of interest Develop a plan to answer the questions posed Execute the Plan to collect the required data Analyze the data based on the Plan and a model to address the question

Statistics 371 ©R.J.MacKay, University of Waterloo, 2009

I-2

Conclusion:

Answer the questions and report uncertainties and limitations

The following should remind you of the language of PPDAC and how we apply the process.

Target Population

Study Population

Sample

Measured variate values

Conclusions

(Model-based) analysis

PPDAC is a process that we use to plan and execute empirical investigations so that we get reliable conclusions at a reasonable cost. There must be a good reason to undertake the investigation in the first place and resolve to take action and make decisions based on the Conclusions. Governments are famous for avoiding decisions by saying that another study is required. The two course are organized by the nature of the Plan and the models used in the analysis. In Stat 371 we concentrate on applications of regression models and sample surveys. In Stat 372, we look at issues of data collected over time (time series, control charting) and the use of experimental plans. Exercises 1. (A true story, believe it or not) To improve the shifting of the transmission, an automobile manufacturer organizes a clinic in which about 100 people evaluate the “feel” of 6 transmissions on different models from low to (very) high cost. Each person is asked to rate each transmission on several dimensions. The idea is to use the data to help design a new transmission that will have good “feel” to improve the perceived quality of the vehicle and hence improve sales/market share. To save money in organizing the clinic, the company uses the engineers at its development

Statistics 371 ©R.J.MacKay, University of Waterloo, 2009

I-3

center of which 90% are males under the age of 35. What changes to this plan would you recommend? Why? 2. Write a brief description of the 6 Sigma program. Where does Statistical Method fit in 6 Sigma? What advantages and disadvantages can you see in an organization adopting such a program? 3. What is a software usability trial? What are two key issues in the design of such a trial? How does Statistical Method fit into a usability trial? 4. Give two examples of the how you might use of Statistical Method in market research.

Statistics 371 ©R.J.MacKay, University of Waterloo, 2009

I-4

i = 1.txt.. suppose we want to assess the relative risk of an IBM share relative to the S&P 500 index. we have the values of the response variate yi and p explanatory variates xi1 .ret.ibm..Chapter 2 Models linking explanatory and response variates In this chapter.ylab='IBM return'. we introduced the problem of determining the market value of a property that is not sold.main='IBM vs. There are many applications of such models.. days. The purpose of this modeling is to estimate an attribute of the population of monthly returns.g. we look at regression models and how to fit them to a set of data... We give three here. Example 1 The CAPM model is used to measure the risk of a single asset relative to that of a portfolio. The variate names are sp. If we model the excess IBM return as a random variable Y and the portfolio excess return as a random variable X .. The month over month returns from Jan 2001 to March 2003 for IBM and the S&P 500 are given in the file IBM..ret and ibm. the fit will not change markedly.ret. n are unknown. using known explanatory variates and the market values of similar Stat 371 © R.. The risk-free return is not included in the model.. MacKay. The theoretical CAPM model describes the excess return (actual return – risk free return) for an IBM share over a period of time as a constant β times the excess return of the portfolio. Since this return is small. The statistical problem is to fit a regression model of the form yi = β 0 + β 1 xi1 +.. In many empirical applications. S&P 500 Monthly Returns') From the plot.ret. β 1 . For example. for each unit i.. xi 2 . That is. Example 2 In Chapter 1. the model should provide a reasonable fit to the data. then we have Y = βX and stdev(Y ) β= .. percentage returns are collected for the asset yi and the portfolio xi over a number of periods (e..J. created with the R code plot(sp. β p and the residuals ri . We see a scatterplot of the data on the next page. University of Waterloo 2009 II-1 . measured on a monthly basis. Note we can fit a model that includes the risk-free return if the data were available. There are many issues about the time period (months) and the sampling period.xlab='SP500 return'. Suppose we have a set of n units selected from a population and. xip ..+ β p xip + ri where the parameters β 0 . The common interpretation is that β > 1 corresponds to an asset riskier than the portfolio. months) and a linear model of the form yi = β 0 + β 1 xi + ri is fit to the data. the parameter β measures the relative volatility of the IBM excess stdev( X ) return.

To do so. There are issues about which properties to include in the data set and which explanatory variates to include in the model . MacKay. x j a column vector containing the values of the jth ( j = 1. We then use the model to predict the market value of the property that was not sold. age=21. By "fitting the model". He also has access to the office size and age. In the planning of an audit. The data are in the file assessment. Finally. ratio and location and the response variate value ($ per square ft). He will devote more audit resources to any such office.properties that were actual sales.8. 1)t be a column vector of n 1’s and X = (1. we first fit a regression model using the data from the actual sales. Fitting the Model – Least Squares. age. percent office =3. Another similar application is to look at salaries of employees relative to the work they do.825. The values for the explanatory variates for the unsold building are size=13. There are many applications of regression where the object is to predict the unknown response variate for a given set of values of the explanatory variates. In Ontario. This is an example of an analytic method in auditing. x1 . There are 38 units (large sales)with 5 explanatory variates size. The goal is to look for exceptional cases for which the relationship between the explanatory variate and the response is very different..J.txt. Also let 1 = (1. β 1 . The data are in the file analytic.. Let y be a n ×1 column vector containing the response variate values.txt. location= 0 (Canada). the number of employees and clients. we mean that we estimate the unknown model parameters using the data. x p ) an n × (1 + p) matrix with columns corresponding to the explanatory variates. β p )t be a (1 + p) × 1 column vector of the unknown coefficients. office.. Example 3 A service organization has 24 offices. there is a private organization that makes extensive use of regression modeling to provide market values to municipalities for all properties that provide the basis for property taxes. p ) explanatory variate and r a column vector of the unknown residuals.. The auditor plans to fit a model relating overhead to the explanatory variates in order to look for outliers. Then we can write the model in terms of these vectors as Stat 371 © R..... and the relative cost of living in the city where the office is located.... offices for which the relationship is different. building/land ratio=53. let β = ( β 0 .... In this example. the accountant looks at the stated overhead from the current and past year for each office.. University of Waterloo 2009 II-2 . we represent the data model in terms of vectors and matrices..

x p ) as shown below... University of Waterloo 2009 II-3 .. x p ) is the subspace of R n spanned by the columns of X ...+ β p x p span(1....... x1 ...+ β p x p + r = (1. x1 ..... x1 .−β p xi p ) 2 i =|| y − Xβ ||2 =|| r||2 To minimize the squared length of r . we project y orthogonally onto span(1. MacKay...+ β p x p span(1. x1 .. we use least squares.J. y r β 0 1+ β 1 x1 +. x p ) Stat 371 © R. y r β 0 1 + β 1 x1 +.. x1 . x p ) The span(1.. We can picture the model in R n as shown below.. We assume that this subspace has dimension p+1 or equivalently that the columns of X are linearly independent. That is.. we find the value for β that minimizes the function W ( β ) = ∑ ( yi − β 0 − β 1 xi1 −. x p )β + r or more compactly as y = Xβ + r We have written y as the sum of two vectors...y = β 0 1 + β 1 x1 +.. To fit the model..

txt”.J.. The following code produces the given output.. Substituting for r . Example We use R to fit the empirical CAPM model to the IBM returns vs S&P 500 returns. x1 . x1t r = 0.. p+1 linearly independent columns).. x p ) or equivalently. That is we have 1t r = 0.ibm.see the exercises. Note that X t X has an inverse because we assume that X has full rank (i. a<-read.ret~sp.The estimated residual vector r = y − Xβ is orthogonal to the span(1.. to every column of X..header=TRUE) attach(a) b<-lm(ibm. x tp r = 0 We can write these equations more compactly as X t r = 0 . Note that we have decomposed the vector y = Hy + ( I − H ) y into two orthogonal components... x p ) .ret) summary(b) fitted(b) plot(sp.. and after rearrangement.+ β p x p = Xβ = X ( X t X ) −1 X t y = Hy and the estimated residual by r = y − Xβ = (I − H)y where the matrix H = X ( X t X ) −1 X t is called the hat-matrix and is the projection onto the subspace span(1. We label the projection (called the vector of fitted values) as μ . MacKay.. University of Waterloo 2009 II-4 . main= “IBM monthly return vs S&P 500 monthly return”) abline(b) Stat 371 © R. so μ = β 0 1 + β 1 x1 +..e.ret.. we get X t ( y − Xβ ) = 0 .ret..table(“IBM. β = ( X t X ) −1 X t y .. H has several interesting properties . x1 .

1007748 13 14 15 16 17 18 1. University of Waterloo 2009 II-5 . p-value: 2.8917891 3.255 6.6117 F-statistic: 46.742 0. Adjusted R-squared: 0.1267206 -18.832 2.001 -5.5573599 -0.633 Coefficients: Estimate Std.023 4.ret 1.4441644 11.1705660 25 26 27 28 29 30 -10.0381487 9.9162213 -12.876 -1.1633477 4.6340558 7.68 on 1 and 28 DF.1842320 2.013e-07 1 2 3 4 5 6 1.' 0.5523248 -1.9528059 14.2193082 -13.2892430 1.261 on 28 degrees of freedom Multiple R-Squared: 0.8955444 -3.5157827 -9.5224802 7 8 9 10 11 12 -1.799 0.The output is: Call: lm(formula = ibm. MacKay.6967085 -11.J.4666260 19 20 21 22 23 24 -2.066 1.ret) Residuals: Min 1Q Median 3Q Max -10.05 `.834 19.625.01 `*' 0.0083625 16.334 0.1 ` ' 1 Residual standard error: 7.3848525 14.ret ~ sp. codes: 0 `***' 0.4473138 -10.8102016 -3.6636134 3.1183661 Stat 371 © R.7090947 -9.6464207 2.01e-07 *** --Signif. Error t value Pr(>|t|) (Intercept) 1.431 sp.001 `**' 0.9334725 15.1026870 -0.

• resid(b) : the estimated residuals. MacKay. R-squared (usually written R 2 ) is defined as Stat 371 © R. We can look at the contents of the model object b with the commands: • summary(b) : the table of estimated coefficients and statistics.J. • anova(b) : an Analysis of Variance table • coefficients(b) : the estimated coefficients. 3. • fitted(b) : the vector of the fitted values in the same order as the original data. we know that the IBM share is more volatile than the market as defined by the S&P 500. We will interpret most of the other statistics when we look at formal inference procedures for the corresponding response (probability) model.742 . The output of the function is assigned to the object b <. This term can be omitted using the code lm(y~-1+x1+x2+…+xp) 2.lm(y~x1+x2+…+xp).066. The function lm(y~x1+x2+…+xp) fits the regression model that includes a constant term. Since the estimated slope is greater than 1. University of Waterloo 2009 II-6 . abline(b) adds the fitted line to the scatter plot when there is a single explanatory variate.Notes on the R code: 1. we note that β 0 = 1. To interpret the output. β 1 = 1.

Again.8 3.1 3.1 1. x p . R 2 is always between 0 and 1 and is often quoted as a percentage.5 -6 1..10..5 3.e.5% of the variation in the IBM returns”.6 10. y lies in span(1.J...2 2.. then R 2 is 0.7 11.9 x2 3.e..9 2.. x p ) .3 1.. Another interpretation is that 100 R2 is the percent of the variation in the response model explained by the explanatory variates..8 4.4 2. then R 2 is 1 and if the fitted model does not involve x1 .8 3.2 y 3. University of Waterloo 2009 II-7 ..6 4.. We use this artificial example to help you review the basic concepts of fitting a model using least squares. i = 1.8 5. The data are shown below. x1 . we can say that the S&P 500 returns “explain 62. β 0 = y .5 1. In some sense.8 5. x1 ... Exercises 1.7 7.R2 = 1 − = 1− residual sum of squares from the fitted model residual sum of squares from the model with only a constant term || r ||2 || y − y1||2 where y is the sample average of the response variate. MacKay..8 Using R to fit the model yi = β 0 + β 1 xi1 + β 2 xi 2 + ri .8 6. x p . x1 2. R 2 is 0 when r = y − y1 i. we need to be sure to explain what this means. See Exercise 7. β p = 0 . But note that neither the numerator nor the denominator is the usual measure of variation in the response variate.6 16.7 3 2.. R 2 measures how well the model fits the data but you need to be very careful in this interpretation. a) b) What are the estimates β ? Calculate μ 1 and r1 Stat 371 © R..1 5..9 1. we get the summary in the following text box. R 2 is 1 when the length of the residual vector r is 0 i.8 11 4.... β 1 = 0. In the example. In other words if we can write y as a linear combination of 1.

Error t value Pr(>|t|) (Intercept) 3.J..Call: lm(formula = y ~ x1 + x2) Residuals: Min 1Q Median 3Q Max -1.87919 Coefficients: Estimate Std.05 `.9879. Suppose we have the returns on an asset yi .' 0. 4.09907 -0.300 1. University of Waterloo 2009 II-8 . Suppose we have a response variate yi and a single explanatory variate xi1 for each of n units sampled from a population.95e-06 *** --Signif. the return on the market xi1 and the risk free return xi2 for n periods. the measure of volatility.01086 0.txt) with a) all 5 explanatory variates b) only age and size c) Do the estimated coefficients change? Why? 3. codes: 0 `***' 0. change? Explain.1 ` ' 1 Residual standard error: 0.26481 0.9844 F-statistic: 285. Use R to fit the model yi = β 0 + β 1 xi1 +. Consider three regression models: Model 1: ( yi − xi 2 ) = β ( xi1 − xi 2 ) + ri Model 2: yi = β 0 + β 1 xi1 + ri Model 3: yi = γ 0 + γ 1 xi1 + γ 2 xi 2 + ri When we fit each model.09959 20.001 `**' 0.7581 on 7 degrees of freedom Multiple R-Squared: 0.912 0.08845 -14. a) Show that the vectors x1 − x11 and 1 are orthogonal.56017 6.03148 0.. p-value: 1.+ β p xip + ri to the assessment data (assessment. will the coefficient of x1 .52948 0. Stat 371 © R.958e-07 2.192 1. MacKay. Adjusted R-squared: 0.87190 0.83e-07 *** x2 -1. Consider the two models Model 1: yi = β 0 + β 1 xi1 + ri Model 2: yi = γ 0 + γ 1 ( xi1 − x1 ) + ri where x1 is the sample average of the explanatory variate.3 on 2 and 7 DF.01 `*' 0.28650 0.000229 *** x1 2.

y1-y4. a) For each pair. c) ( I − H ) 2 = ( I − H )..J. 17-21 to demonstrate the difficulty of using R 2 as a measure of fit and the importance of plotting the data. Stat 371 © R. labeled x1-x4. The data in the file anscombe. x1 . Some questions about R 2 a) In question 1. 7. American Statistician 27.. We defined the hat matrix H = X ( X t X ) −1 X t . which model gave a larger value for R 2 ? b) Show that R 2 cannot decrease if we add extra terms to a model. The file contains 4 sets of ( x. fit a straight line model and report the estimated parameters and the coefficient of determination R 2 b) For each pair.J. Anscombe. x1 − x11) ? c) In fitting models a) and b). Show that a) H t = H b) H 2 = H ... 6. H ( I − H ) = 0 d) 0 ≤ hii ≤ 1 where hii is the diagonal element of H. x1 ) = span(1. University of Waterloo 2009 II-9 . we project onto a subspace.txt were produced by F. c) Comment. MacKay. Interpret this result geometrically. the projection onto span(1. y) vectors. construct a scatterplot of y versus x and add the fitted line.b) Why is span(1. How are those projections different? d) What is the relationship between the estimated coefficients in fitting the two models? e) How does the result in a) simplify the calculation of γ when fitting model 2? 5. x p ) .

Stat 371 © R. MacKay.J. University of Waterloo 2009 II-10 .

Ri ~ G(0. n independent Note that.+ β p xip + Ri . σ 2 I ) .. again with all other ∂x j explanatory variates held fixed.J. We use these procedures to help answer questions of interest such as • • Is there evidence that IBM returns are more volatile than the S&P 500 index? What is a range of plausible values for an unsold property based on the values of its explanatory variates? To start. The estimator β = ( X t X ) −1 XY (a p ×1 vector of random variables) describes the behaviour of β . The column vector 0 gives the component means. we consider a statistical model to describe the repeated application of the Plan.Chapter 3 Making Inferences from Regression Models In this chapter. We use the model to describe how the estimates would behave if we were to repeat the ~ Plan over and over. This model uses random variables to replace the response variate values and residuals in the data model. We write the model more compactly as Y = Xβ + R.. A statistical regression model is Yi = β 0 + β 1 xi1 +. σ 2 I ) .... We treat the explanatory variates as constants (not random variables) in the model.. confidence intervals and prediction intervals for regression models. for each unit i. The variance – covariance matrix σ 2 I gives the variance σ 2 of Ri in the ith diagonal position and the covariance Cov( Ri .+ β p xip so we can interpret β j as the change in E(Yi ) when the jth explanatory variate changes by 1 unit with all other explanatory variates held fixed. • • We can combine the n independent gaussian random variables R1 ... Using the properties of expectation and variance of linear combinations of random variables (Appendix 1) we have Stat 371 © R. Suppose we have a set of n units selected from a population and.. xi 2 ... xip . σ ). i = 1. If x j is continuous... we look at formal inference procedures such as hypothesis tests. we have the values of the response variate yi and p explanatory variates xi1 . R ~ N (0. in the model. we can interpret β j in terms of the partial ∂E(Yi ) derivative.. R j ) = 0 in ijth position See Appendix 2. the rate of change of E(Yi ) as x j changes. MacKay University of Waterloo 2009 III-1 . = β j .. σ 2 I ) so Y ~ N ( Xβ .. Rn into a vector R ~ N (0. • E (Yi ) = β 0 + β 1 xi1 +. The stdev(Yi ) = stdev( Ri ) = σ is constant.

MacKay University of Waterloo 2009 III-2 . Note that the components of β are not independent unless ( X t X ) −1 is diagonal. Note that E (||( I − H ) R||2 ) = [n − ( p + 1)]σ 2 . we use the sum of squares of the estimated residuals divided by the degrees of freedom σ= = = ∑r i 2 i n − ( p + 1) || r ||2 n − ( p + 1) ||( I − H ) y||2 n − ( p + 1) The corresponding estimator is ~ σ= || ~||2 r n − ( p + 1) ||( I − H )Y ||2 = n − ( p + 1) = ||( I − H ) R||2 n − ( p + 1) since ( I − H ) Xβ = 0 . the columns of X are orthogonal. To estimate σ . or. ~ ~ That is. each component of β is gaussian i. in other words.e.see the exercises – which partially justifies the denominator. We also have the unproven result that Stat 371 © R. σd j ) where d j is the square root ~ of the jth diagonal element of ( X t X ) −1 . σ 2 ( X t X ) −1 ).J. β j ~ G( β j .~ E( β ) = E(( X t X ) −1 X t Y ) = ( X t X ) −1 X t E(Y ) = ( X t X ) −1 X t Xβ =β ~ Var ( β ) = Var (( X t X ) −1 X t Y ) = [( X t X ) −1 X t ]σ 2 I[( X t X ) −1 X t ]t = σ 2 ( X t X ) −1 ~ and using the properties of the multivariate normal distribution β ~ N ( β .

Since σ r βj −βj ~ tn −( p +1) ~ σd j We use this t-distribution to test hypotheses and find confidence intervals for the individual parameters β j . In words. ~) = 0 so that β and ~ are statistically independent r r ~ is a function of ~ .255. MacKay University of Waterloo 2009 III-3 .91) = 0.742 from the R output. Stat 371 © R.255 Note that the denominator is called the standard error of β 1 . Step 3: (Calculate the discrepancy measure) The “distance” from the estimated parameter to the hypothesized value is d= | β − 1| |1.28). In this case.91 σd1 0.742 − 1| = = 2.J.91. Step 4: (Calculate the p-value) To asses if this distance is large or small.~ σ / σ ~ Kn −( p +1) ~ ~ We also can easily show that Cov( β . The standard error is the ~ estimated standard deviation of the corresponding estimator β 1 and is given in the R output for each estimated coefficient. Note that the estimate of σ is called the residual standard error in the summary. Step1: (Formulate) Suppose β 1 = 1. is the volatility of an IBM share different than that of the index? We consider a test of the hypothesis β 1 = 1.007 The degrees of freedom correspond to the denominator in the calculation of σ . Example 1 We fit an empirical CAPM model to a series of monthly returns from an IBM share versus the corresponding returns of the S&P 500 index in the file IBM. The summary R output is given in Chapter 2. it then follows that – see the exercises. Step 2: (Estimate) We have β 1 = 1.txt. We use the same 5-step procedure as in the beloved Stat 231. we calculate the p-value Pr(| t28 | ≥ 2. σ = 0. ~ One question of interest is to see if β 1 is different from 1. We can calculate this probability in R with the command pt(2. The pvalue is the chance that we get such a large discrepancy between the estimated and hypothesized value if the hypothesis is true. Tables for the t-distribution are given in Appendix 3.

promotion control promotion 1 promotion 2 x1 0 1 0 x2 0 0 1 The statistical model is Yi = β 0 + β 1 xi1 + β 2 xi 2 + β 3comp.742 ± 2. More generally. 30 stores are selected and divided at random into three groups of 10. They also measure the sales of competing products during the promotion period.255 or 1. One group is given promotion 1..sales and percent.J.01 Interpretation no evidence against the hypothesis weak evidence against the hypothesis some evidence against the hypothesis strong evidence against the hypothesis We can also summarize our knowledge of β 1 using a confidence interval. we say that there is strong evidence that β 1 is different from 1. Note. salesi + Ri .. one group is given promotion 2 and the third group acts as a control. For each store. for a 95% confidence interval.. The promotion for each store is specified by two indictor variates x1 and x2. Recall that the general form of a confidence interval (based on a t-distribution) is estimate ± c × standard error(estimate) where the constant c is chosen from the t-tables so that Pr( − c ≤ t df ≤ c ) is the confidence level.10 between 0. i = 1.change to indicate the measured variates.001. The confidence interval shows us how precisely we have estimated the parameter. Example 2 A marketing firm wants to test two sales promotions.2.742 ± 0.txt with columns comp. from the tables.523 . we interpret a p-value according to the table Range of p-value greater than 0.30 independent Stat 371 © R. c = 2. Ri ~ G(0. The conclusion in the example is that there is strong evidence that the volatility of the IBM share is different from that of the index. The data are stored in the file trial.. we have.10 between 0.05 and 0.05 × 0. In a pilot project. MacKay University of Waterloo 2009 III-4 . In the example. as expected.Step 5: (Interpret) Since the p-value is so small.05 less than 0.01 and 0.05 and the interval is 1. σ ). 1 is outside of the confidence interval and is not a plausible value for β 1 based on the data. less than 0. the firm measures the total sales over a two week period before and after the promotion is in place and calculates the percent change.

10).sales.txt'. Fitting the model. main='Percent Change in Sales vs Competing Sales by Promotion'.sales. xlab='competing sales'. Note that the plotting symbol corresponds to the promotion.header=T) attach(a) p <.c(rep("c".rep("1".Note the interpretation of the parameters β 1 and β 2 . β 1 measures the effect of promotion 1. We plot the data by promotion using the R code a <.change. type='n') text(comp. The text command adds the points using the text characters in the vector p as the plotting symbols.J.read. β 1 represents the increase in the mean response (percent change in sales) if we change from the control to promotion 1.p) The 30 × 1 vector p is a string of characters corresponding to the promotion.rep("2". we get the summary output: Stat 371 © R. We have a similar interpretation for β 2 . percent.change.10). That is.sales fixed. Holding x2=0 and comp. MacKay University of Waterloo 2009 III-5 .10)) plot(comp. percent. The type='n' in the plot command suppresses the plotting of any points but sets up the axes and labels. ylab='percent change'.table('trial.

47944 9.40).Call: lm(formula = percent. codes: 0 `***' 0. then θ = a t β .3884539 3.c(0. note the transpose function t(). Adjusted R-squared: 0.96702 x1 8.00159. We have θ = β 1 − β 2 = 5.' 0.solve(t(X)%*%X) # fit the model # extract the X matrix # find ( X t X ) −1 .3996781 1. If we let a = ( 0.2545 F-statistic: 4. Note the comments after the #. There is strong evidence that promotion 1 has a positive effect with β 1 = 8.60 .1969230 -0.525 0.1.03009 3.8245616 2.-1.01 `*' 0.47980 Coefficients: Estimate Std. By default. θ = a t β ~ ~ ~ ~ and θ = a t β .4187456 2.matrix(b) W <.sales -0.−1. We can calculate the standard error in R using the following statements. suppose we were interested in the parameter θ = β 1 − β 2 that measures the difference in effects of the two promotions.model.042 0.553 0.3316.0003564 0.lm(percent. Although it is clear here that the effects of the two promotions differ.42 (standard error 2. σ 2 a t ( X t X )a) and hence the standard error of θ is σ a t ( X t X ) −1 a . the R output gives the results for the corresponding t test. p-value: 0. we have θ ~ N (a t β .39) if all other explanatory variates in the model are fixed.change~x1+x2+ comp.sales) X <.177 0.24984 comp. Since β ~ N ( β .0) Stat 371 © R.05 `.82 (standard error 2.00159 ** x2 2.58509 --Signif. How can we get the standard error of this estimate? We use vectors to represent θ = β 1 − β 2 . There is no evidence that promotion 2 has an effect with β 2 = 2.3 on 3 and 26 DF.66432 -3. in this case with p-value 0.334 on 26 degrees of freedom Multiple R-Squared: 0.06860 -0.0 ) t .0917034 2. MacKay University of Waterloo 2009 III-6 .change ~ x1 + x2 + comp.01367 To examine the effects of promotion 1.001 `**' 0.0006446 -0.1. b <. σ 2 ( X t X ) −1 ) . matrix multiplication %*% and inverse function solve() # define the vector a a <.1 ` ' 1 Residual standard error: 5. we consider the hypothesis β 1 = 0 corresponding to no effect.J.sales) Residuals: Min 1Q Median 3Q Max -12. Error t value Pr(>|t|) (Intercept) -0.

the 95% confidence interval for the difference in the effects of the two promotions is 5. We can draw conclusions about any linear combination of the coefficients using the same methodology. We can be confident that promotion 1 produces a percent change in sales between 3. Since we are projecting onto the space in each model.27 . Promotion 1 looks promising in terms of its effect on sales.1.334*sqrt(t(a)%*%W%*%a) st.sales) The –1 in the model specification suppresses the intercept. That is. Using the fact that Pr( −2.hat <.err <. promotion 2 vs promotion 1 are identical with the same standard error.lm(percent.33% and 7. the estimates of the parameters corresponding to comparisons e.sales + r . That is. one or two. You need to be careful to recognize categorical explanatory variates that are coded as integers for convenience. Stat 371 © R. compared to promotion 2.theta.t(a)%*%coef(b) st. you might wonder why we did not create an explanatory variate x3 for the control promotion. 2.334 from summary(b) # display SE(θ ) .5.g. we can fit the model percent.change~-1+x1+x2+x3+comp.06) = 0.2 corresponding to the promotion control.87% per week in average sales. There is no reason to suspect that changing from the control to promotion 1 has the same effect as changing from promotion 1 to promotion 2. MacKay University of Waterloo 2009 III-7 . you can see the problem. When you try to interpret β1 . Also in example 2. as implied by this model. we have promotion control promotion 1 promotion 2 x1 0 1 0 x2 0 0 1 x3 1 0 0 This will create problems in the fitting since x1 + x 2 + x3 = 1 and so the columns of the matrix X are linearly independent.06 ≤ t26 ≤ 2.60 ± 2. it is tempting to simplify the modeling by using a single vector x with elements 0.err # calculate θ # calculate the standard error SE(θ ) we get the estimate of σ = 5. we get the standard error SE(θ ) = 110 .95 .J. We proceed by deleting x3 from the model as in the Example or by deleting the intercept with the R command b <. In this case.change = β 01 + β1 x + β 2 comp. In this case you need to set up indicator variates to represent the categories as described in Note 1. Notes 1. In example 2.

95 We get the prediction interval by replacing the estimators with the corresponding estimates.95 ..x2=0. We also know that u t β ~ G(u t β . σ u t ( X t X ) −1 u ) so ~ Y − u t β ~ G( 0. In the example.Prediction Intervals Suppose we want an interval of plausible values of the response variate for a unit with known values of the explanatory variates. b <. rearranging the inequality ~ ~ ~ ~ Pr(u t β − cσ 1 + u t ( X t X ) −1 u ≤ Y ≤ u t β + cσ 1 + u t ( X t X ) −1 u ) = 0. R ~ G(0.+ β p x p + R = u t β + R.J. suppose in Example 2 we want to predict the percent change in sales for a store that uses promotion 1 with competitor sales $3000.newdata=new..lm(response~x1+x2+ comp. x p ) be the values of the explanatory variates for the unit with response variate that has not been measured.sales) new <. In general. we can describe the behaviour of the response variate by the random variable Y = β 0 + β 1 x1 +.frame in R-speak) with the values of the explanatory variates for which we want to make the prediction. u t β − c σ 1 + u t ( X t X ) − 1 u ≤ y ≤ u t β + cσ 1 + u t ( X t X ) − 1 u ) To illustrate how we can get this interval using R.data. σ ) . level=0. let u t = (1. From the model. then we have. σ 1 + u t ( X t X ) −1 u ) and ~ Y − ut β ~ σ 1 + u t ( X t X ) −1 u ~ tn −( p +1) We use this t distribution to produce prediction intervals for Y.. This is the problem we need to solve in marketvalue assessment example discussed in Chapter 1. σ ) ~ so that Y ~ G(u t β .interval=”p”. x1 .frame(x1=1. If Pr( − c ≤ tn − ( p +1) ≤ c ) = 0... interval.sales=3000) p<-predict(b. The last line prints the fitted value u t β and the prediction interval. MacKay University of Waterloo 2009 III-8 . the option=”p” produces a prediction interval at the given values of the explanatory variates. The third line calculates the interval.95) p The second line creates a new data set (a data.. comp. we get Stat 371 © R.

Interpret the parameter β 0 . [You will need to go back to first principles.oil creates a vector with components the square of those in frag. What can you conclude? b) How does the confidence interval change as we increase the confidence level? ~ c) Suppose we have θ ~ G(θ . Construct a scatterplot of hardness vs the amount of fragrance oil. Using the data in the promotion trial described in this chapter. This interval is wide because of the high variation within stores (σ = 5.txt. Is there any evidence of curvature in the relationship? a) b) c) d) e) 3. Exercises 1. Add a quadratic term to the model (in R f2 <.J. MacKay University of Waterloo 2009 III-9 . Consider the simple model where we assume separate batches are independent. The variates are named hardness and frag. find a 99% confidence interval for the effect of competing sales on the percent change in sales. Find a 95% confidence interval for the average hardness of candles made with 2% fragrance oil.8%.334 ) using the same promotion in the investigation.26998 upr 18. Prove that the components of β are independent if and only if the columns of X are orthogonal. The data are stored in the file hardness.78579 That is.oil + R.] 7.oil. Stat 371 © R. a company that manufactures candle wax examined 20 candles made from batches of wax that have different amounts of fragrance oil added.3% and 18. hardness = β 0 + β 1frag.oil*frag. Derive the confidence interval forθ d) Show that θ 0 is in the 95% confidence interval for θ if and only if the p-value for the test of the hypothesis θ = θ 0 exceeds 5%. In a small study. 2. Some ideas about confidence intervals: a) Using the R-output given for the sales promotion example.frag. we predict the percent change in sales to be between –4.fit lwr [1. a) Find a 95% prediction interval using promotion 2 for a large store where the competing sales are $30000. R ~ G(0.e. Find a 95% confidence interval for β 0 . σ ). The company was interested in understanding the relationship between the hardness of the candles (a technical measurement) and the amount of fragrance oil added. the estimator for a parameter θ and the statistically ~ independent σ with n − ( p + 1) degrees of freedom. Can you see any difficulty with this prediction? b) Construct a prediction interval for the change in sales if promotion 1 is used rather than promotion 2 for the same store (i.oil).] ~ 4. dσ ) .257904 -4. competing sales are fixed).

MacKay University of Waterloo 2009 III-10 . Here. Prove Cov( β . Show that if C is n × k and D is k × n . The estimate of σ is σ = r tr . then tr(CD) = tr( DC) . Let ~ = ( I − H ) R . the trace of the matrix is tr( A) = ∑ aii .J. r 6. One (poor) justification for the denominator n − ( p + 1) ~ is based on the fact that with this choice we have E (σ 2 ) = σ 2 . r r Recall that for any square matrix A . we verify this r result. r r a) b) c) Stat 371 © R. ~) = 0.~ 5. We want to show that E ( ~ t ~ ) = ( n − p − 1)σ 2 . Show that E ( ~ t ~ ) = tr ( E ( RR t )( I − H )) = σ 2 tr ( I − H ) r r Use the result from a) and the fact that H = X ( X t X ) −1 X t to evaluate E ( ~ t ~ ) .

codes: 0 `***' 0.069 0. ratio of building size to the land area. size.432 0.41526 1.01 `*' 0.256 0.72137 0. In fitting the full model. 0=Canada) to the adjusted value (current $ per square foot) for 38 sales. we fit a model with 5 explanatory variates age. age -0.52300 0.3683 F-statistic: 5.49361 3.Chapter 4 The Analysis of Variance In Chapter 3. That is.03786 0.001150 Stat 371 © R.J. we saw how to make formal inference statements about any component β j or any linear combination θ = a t β of the coefficient vector β .6911 -3.001 `**' 0.11653 0.801 0.txt.41588 4. the summary output from R is given below.354867 --Signif. The data are in the file assessment. we look at hypothesis tests that involve several of the parameters simultaneously. Error t value Pr(>|t|) (Intercept) 19.1 ` ' 1 Residual standard error: 5.05 `.77137 4. That is ~ θ −θ ~ ~ tn −( p +1) σd ~ where the constant d is determined by finding stdev(θ ) = σd = σ a t ( X t X ) −1 a .7848 14.08139 1.22e-05 *** office 0.10820 -4.993 on 32 degrees of freedom Multiple R-Squared: 0.8670 -0. p-value: 0. Example 1 In the problem of predicting the market value of a large plant (see Chapter 2).315 on 5 and 32 DF.34128 -1.161926 location 3. MacKay University of Waterloo 2009 IV-1 .' 0. The confidence intervals and hypothesis tests for these one-dimensional parameters are based on a t distribution of the corresponding estimator.939 0.833 3. Call: lm(formula = value ~ size + age + office + ratio + location) Residuals: Min 1Q Median 3Q Max -10.000288 *** size -2.14776 0.8164 2. Adjusted R-squared: 0.5435 Coefficients: Estimate Std. the hypothesis is not one-dimensional. In this chapter.081181 .799388 ratio 0. percentage of office space and the country of sale (1=USA. It cannot be framed in terms of a single parameter.4537.

adjusting for constant background experience.. β 5 = 0 The defining relationships in the hypothesis holds simultaneously for all the parameters listed. The first estimate is valid whether or not the hypothesis is true. Example 2 To compare 5 different versions of a product to the current version. β 3 .The output includes the results of t test of the hypothesis that each coefficient is 0. There is no difference in the versions if β 1 = β 2 =. We can also ask questions such as: Is there any evidence that all of the explanatory variates explain a significant portion of the variation in the response variate. Is there any evidence that only age is an important explanatory variate. the basic idea is to construct a discrepancy measure that is the ratio of two estimates of the residual variance σ 2 . an R&D department conducted a clinic in which each of the 6 versions was assessed by 8 different subjects..6 and j = 1 corresponds to the current version. MacKay University of Waterloo 2009 IV-2 . The second is a valid estimate of σ 2 only if the hypothesis is true. we write the data model as y = β 1 x1 +. scorei : past experience score for subject i Then. given that all the other explanatory variates are in the model. β 4 = 0. or.. β 3 = 0. in terms of the parameters. The data are stored in the file product. The ratio will tend to be Stat 371 © R. score + r Note that there is no intercept term in the model. β 2 = 0... is there any evidence that any one of β 1 . β 5 = 0 .J..... β 5 differ from 0? The corresponding hypothesis is β 1 = 0. = β 6 . To model the data.txt.. β 4 ...+ β 6 x6 + β 7 pst. in terms of the corresponding vectors. is there any evidence that any one of β 1 . After trying the product..48 xij = 1 if subject i used version j. the subjects completed a questionnaire to determine • a score to measure past experience with similar products • a score to measure satisfaction with the proposed version The first question of interest was to see if any of the new versions were different from the original... To test multi-dimensional hypotheses such as those described in Examples 1 and 2. i = 1. We want to examine this multi-dimensional hypothesis.... or.. in terms of the parameters. let yi : satisfaction score for subject i. β 5 differ from 0? The corresponding hypothesis is β 1 = 0.. xij = 0 otherwise where j = 1. pst.

Step 1: Fit the full model.00833 6. Step 4: The discrepancy measure is f = t (rH rH − r t r ) / ( p − q ) σ2 Step 5: To calculate the p-value.34243 3.09971 -4. Example 1 We can illustrate the steps with the assessment data where we have. Note that the sum of squares of the estimated residuals r t r = ∑ ri 2 is an estimate of ( n − p − 1)σ 2 .3488 F-statistic: 20. assuming the hypothesis is true. β 5 = 0 .084) = 1332 542 with Multiple R-Squared: 0.n − p −1 has an F distribution with p − q [numerator] and n − p −1 [denominator] degrees of freedom. codes: 0 `***' 0. we find Pr( F ≥ f ) where F ~ Fp − q .3664. Residual standard squares is 36(6.66e-05 *** --Signif.75e-08 *** age -0.823 -3.05 `.' 0.freedom 36 degrees of freedom. Hence. t ( rH rH − r t r ) / ( p − q ) estimates σ 2 with ( p − q) degrees of freedom.968 2.916 and the residual sum of squares 32(5. β 4 = 0.01 `*' 0.665e-05 Stat 371 © R. to get the residual sum of t t squares rH rH . β 4 = 0.001 `**' 0.097 19. The basic steps are: Step 1: Fit the full model to get the usual estimate of the variance σ 2 = r t r / ( n − p − 1) with n − p −1 degrees of freedom. Call: lm(formula = value ~ age) Residuals: Min 1Q Median 3Q Max -7. n = 38.different from 1 if the hypothesis is not true.432 Coefficients: Estimate Std.314 with 32 degrees of freedom. then rH rH is an estimate of ( n − q − 1)σ 2 with n − q −1 degrees of freedom.084 on 36 degrees of .45498 0. MacKay University of Waterloo 2009 IV-3 . Error t value Pr(>|t|) (Intercept) 20. We can this procedure the Analysis of Variance (ANOVA).1 ` ' 1 2 The residual sum of error: 6. if this is the case.762 6. β 5 = 0 .823 -0. From the R output. we have σ 2 = 5. Adjusted R-squared: 0. Step 3: The second estimate of σ 2 is based on the so-called additional sum of squares t rH rH − r t r that estimates ( p − q )σ 2 if the hypothesis is true. β 3 = 0. Step 2: Fit the reduced model value = β 0 1 + β 2 age + r with β 1 = 0.563 5. If the reduced model has q +1 parameters. in the full model. p-value: 5. β 3 = 0.J. Step 2: Fit the reduced model.993) 2 = 1149. The hypothesis is β 1 = 0.9932 = 35.82 on 1 and 36 DF. p = 5 .

F distribution and F tables Mathematically.2778 0.275 35.Step 3: The second estimate of σ 2 . we get independent estimators which are required for the F distribution. The reason is that by subtraction. MacKay University of Waterloo 2009 IV-4 .30 . There are tables (in the same format as the t tables) in the Appendix.81 5 −1 45. For Example 1. Notes 1. rather than the estimate in Step 3 based on the change in the residual sum of squares. an Fnum. You may have wondered at Step 2 why we could not have used the estimate of σ 2 produced from fitting the reduced model directly. there is no evidence against the hypothesis. We fit both the full model and the reduced model and then apply the anova() function. 2.b) and the corresponding output is Model 1: value ~ age Model 2: value ~ size + office + age + ratio + location Res. assuming the hypothesis is true.32 ≥ 1.92 Step 5: Using the R function 1-pf(1. Calculations with R We can use R to perform all of the calculations. Since the p-value is so large.Df RSS Df Sum of Sq F Pr(>F) 1 36 1332.314 = 45. we see that Pr( F4.275) = 0.542 − 1149. For each tail probability.20 4 183.2992 Stat 371 © R.81 Step 4: The discrepancy measure is f = = 1. is 1332. there is no evidence that the model is improved by adding all of the other explanatory variates.J. there is one page of tables with a column for the numerator degrees of freedom and a row for the denominator degrees of freedom. the code is b<-lm(value~size+office+age+ratio+location) c<-lm(value~age) anova(c.56 1.76 2 32 1149. 4. den random variable with num and den degrees of freedom is defined as K2 χ 2 / num F = num ( = num ) 2 K den χ 2 / den den where the numerator and denominator are independent. An F random variable is always positive and has mean close to 1. In other words.275.32) or the Tables in the Appendix. once age is included in the model.

score ~ pst.. = β 6 = β .60391 5 0. In the summary output from fitting with lm().score Res.e..... = β p = 0 .. with the full model. to fit a model without β 0 ) Next we fit the reduced model with β 1 = β 2 =.001146 ** Since the F ratio is so large (p-value 0.+ x6 = 1 .82 with a pvalue that is very small so there is very strong evidence that one or more of the coefficients differ from 0.. this corresponds to the model with a constant term and the single explanatory variate pst. Stat 371 © R. ) we put the reduced model first. c<-lm(sat.score) Note that the –1 tells R to not include a constant term (i. the last line is an F test for the hypothesis that β 1 = β 2 =. there is strong evidence of differences among the 6 versions if the pst. Note that the full model does not have an intercept term. we calculate the additional sum of squares (residual) anova(c.001). In words. In Example 1.score is held fixed. The sum of the vectors corresponding to the six indicators is x1 +.score ~ -1 + x1 + x2 + x3 + x4 + x5 + x6 + pst.score Model 2: sat. if this hypothesis is true. = β 6 which corresponds to all versions being the same.Df RSS Df Sum of Sq F Pr(>F) 1 46 2.58217 2 41 1. If we include an intercept term.J. Since x1 +. Example 2 There are six indicator variables to index the product version and one other explanatory variate..score. then the columns of X are linearly dependent.. MacKay University of Waterloo 2009 IV-5 ..score~-1+x1+x2+x3+x4+x5+x6+pst. none of the explanatory variates is important in explaining variation in the response variate.b) with output Model 1: sat. the output gives the F ratio F = 20. 3. We fit the full model with the code b<-lm(sat.Note in the function anova( . We are interested in the hypothesis β 1 = β 2 =.+ x6 = 1 .97826 5.score~pst.score) To test the hypothesis.0014 0.

If t ~ tk . What are the degrees of freedom? 5.+ β p x p + r .J. explain why the additional sum of squares is always non-negative. Test the hypothesis in the two ways and show that the p-value is identical. d. In the revised model show that x * ⊥1 for all j j Stat 371 © R. Is there any evidence that these versions have significantly different average satisfaction scores? 4. In an industrial example.. show that t 2 has an F distribution. find a constant c so that Pr( F ≥ c) = 0. That is. What is the distribution of 1 / F 2. [This is always true although a nuisance to prove] d.score is 0. the coefficients of the explanatory variates do not change. The data are stored in the file ch4exercise2. Versions 4.] 3. use an F test to address the following questions? a. Explain how we can test the hypothesis using an F test c.05 c. the manufacturer collects 60 observations to build a model to relate a product property y to two quantitative explanatory variates x1 and x2 . c. a. Show that if we replace the vector x j by the vector x * = x j − x j 1.5 and 6 share a common feature. x1 x 2 should be included in the model. Explain why testing the hypothesis β 1 = β 2 =. Find Pr( F ≥ 3) b. Consider again the product testing example described in Exercise 3. b. Consider the hypothesis that the coefficient β 7 of the explanatory variate pst. Theory suggests that a linear model of the form y = β 0 + β 1 x1 + β 2 x2 + r should describe the data. the analyst worries 2 that additional second order terms of the form x12 . the model becomes j * * y = α 0 1 + β 1 x1 +. Some theory a. we can test a hypothesis θ = 0 in two ways... MacKay University of Waterloo 2009 IV-6 .+ β p x p + r . Explain how we can test the hypothesis using a t-test b. If we have a single parameterθ .txt. x 2 . Is there any evidence of differences among the new versions 2 to 6? b. Does the addition of the extra terms contribute significantly to the fit of the model? [Note: In R you can create new variables such as x 22 < − x 2 * x 2 to represent the quadratic terms. However. In the product testing example (Example 2 in Chapter 4). Suppose we have a discrepancy measure with an F distribution with 3 and 30 degrees of freedom. In the construction of the F test. = β p = 0 will yield identical results for either formulation of the model. a. Consider the model y = β 0 1 + β 1 x1 +.Exercises 1....

. x * ) .e...J. β p )t and X* = ( x1* . This quantity is p often called the regression sum of squares.... MacKay University of Waterloo 2009 IV-7 ... show that the additional sum of squares is t β * ( X*t X* )β * where β * = (β 1 . Stat 371 © R. In testing the hypothesis.

.. what do we do about it? In fitting the model y = β 0 1 + β 1 x1 +. Looking at the Estimated Residuals We also assess fit by looking for patterns that would be unusual if the model is “true”. n . In this chapter. ~ according to the model. we consider two situations to look at the first bullet.J.. we are assuming that: • • the mean vector E(Y ) is the specified linear function of the explanatory variates the residuals are gaussian. If not. we have built. Recall that r t −1 t H = X ( X X ) X depends only on X. fit and used a model for a given set of data without questioning any of the underlying assumptions.. independent with constant standard deviation for each unit in the sample We can assess these assumptions in several ways. In the exercises.. according to the model. the estimated residual ri versus the fitted value μ i for i = 1. σ 2 I ) For example. If we find such patterns. We also know that r and μ are orthogonal and. R ~ N (0. The estimated residuals. 2009 V-1 . If we have units in the sample in which the explanatory variates are identical. we are making a number of assumptions about the underlying probability model Y = β 0 1 + β 1 x1 +. we can use ANOVA to assess the fit. we examine the problem of model fit. MacKay University of Waterloo. Stat 371 © R..) to the proposed model and test if the additional terms have significant effects. are derived from the given model.. the components of the vector r .. we are suspicious about the assumptions underlying the model.+ β p x p + r and using the corresponding estimators to construct formal statistical procedures.Chapter 5 Assessing Model Fit To this point. This approach to assessing fit is informal and subjective – we need to be careful not to over-interpret the plots looking for patterns.. r If we plot the individual components. if not. we should see a plot with no obvious patterns. ~ and μ are independent – see the exercises. then we have greater confidence in the form of the mean function in the original model. Also we can add extra terms (squares. Are the assumptions reasonably well met and. σ 2 ( I − H )) . cross products etc. ~ ~ N (0. r = y − μ = y − Xβ ~ r The corresponding estimator ~ = Y − Xβ = ( I − H ) R is a linear combination of the components of R and hence.+ β p x p + R..

txt. Change all the variate values for cases 18 and 27 to NA. The remedy here is to repeat the fitting and analysis with these cases removed to see if the conclusions are substantially effected. office. we can create a plot of the estimated residuals versus the fitted values with the R code b<-lm(value~size+age+ratio+office+location) plot(fitted(b). then we need to decide (not on statistical basis) how to proceed. 2009 V-2 . If we fit a model with 5 explanatory variates size. If they are influential. ratio and location to the measured value. we can delete cases 18 and 27 by editing the data frame a. age.Example 1 Consider again the assessment data discussed in previous chapters and found in the file assessment. we can ignore the poor fit. The plot of the estimated residuals versus the fitted values looks much better. In the example. MacKay University of Waterloo.resid(b)) Does this plot raise any suspicions about the proposed model? The answer is yes since it would surprising (assuming that the model is correct) if the two largest estimated residuals correspond to the two largest fitted values as seen in the plot. Note that the age is still the only significant Stat 371 © R.J. aa<-edit(a) detach(a) attach(aa) Then refit the same model using the 36 units in aa. Otherwise.

The data are stored in the file ch5example2. We make the decision to omit or include the two cases on non-statistical grounds. Example 2 Here is an artificial example to demonstrate the problem and the remedy. we fit a model of the form Stat 371 © R.explanatory variate but that the estimated coefficient changes from –0. MacKay University of Waterloo.txt. the plot of the estimated residuals versus the fitted values (see the left panel on the next page) shows the classic funnel shape indicating that the standard deviation is not constant. When we fit the linear model lm(y~x1+x2). 2009 V-3 . Here the decision was to proceed without these two sales since they corresponded to buildings that were very different from the building in question.1) and Ri ~ G(0. stdev(Y ) =|2 + 3x1 − 2 x 2| .J.17 so the two cases are very influential if we want to predict the value of an unsold building. We may see a funnel shape on the plot the estimated residuals versus the fitted values. In this model.52 to –0. The 50 observations were created from the model Y = (2 + 3x1 − 2 x 2)*(3 + R) where x1 and x2 are uniform on the interval (0.1) independently. That is. This indicates that the standard deviation is not constant but is a function of the mean μ( x ) . the standard deviation of Yi depends on the mean E(Y ) = 3 *(2 + 3x1 − 2 x 2). The remedy is to transform the response variate.

if the form of the model is adequate. we expect to see no patterns on these plots. we standardize and define ri ii the standardized residuals as ri zi = 1 − hii If the gaussian assumption in the model is more or less correct. Large deviations from the line indicate that the gaussian assumption is likely false. We can also plot the estimated residuals versus the explanatory variates and.f ( y) = β 0 + β 1 x1 + β 2 x 2 + r Standard choices for the transformation include the functions log( y).J. y and 1 / y . After taking logarithms. There are many other such diagnostic plots. We can examine the gaussian assumption with a quantile-quantile (usually abbreviated qq) plot. the funnel effect is removed. Again the usual remedy is to transform the Stat 371 © R. 2009 V-4 . To make the standard deviation constant. σ 2 ( I − H )) or r ~ ~ G(0. in spite of the fact that the we know that the original model is not linear in x1 and x2 on this scale. Under the model assumption we have ~ ~ N (0. MacKay University of Waterloo. then we will see a straight line on the qq plot of the standardized residuals. σ 1 − h ) . The plot of the estimated residuals versus the fitted values is the single most useful diagnostic tool.The right hand panel shows the plot of the fitted versus residual values after applying the log transformation.

See the Appendix 2 for the concepts underlying the qq plot. Using R. the plot on the right provides no such evidence against the gaussian assumption. Recall that ~ ~ G(0. We look for outliers in the space of the explanatory variates using the following ri argument. σ 1 − hii ) where hii is the ith diagonal element of the hat t −1 matrix H = X ( X X ) X t .resid(b)/sqrt(1-hatvalues(b)) qqnorm(s) Sensitivity (Case) Analysis We use case analysis to determine in any of the units correspond to an “outlier”. MacKay University of Waterloo. the qq plots of the standardized residuals are shown in the left and right panels respectively. We call a case an outlier if the conclusion of the analysis is materially changed by omitting the unit from the data. These residuals are larger than can be expected if the estimated residuals are a sample from a gaussian distribution. The plot on the left picks up to some degree the two exceptional cases with large estimated residuals. if hii is close to 1.J. We call hii the leverage of the ith case since. After removing these two cases. We show an extreme example of each type of outlier for a single explanatory variate on the next page. For the assessment data with all 38 cases (and the edited file with cases 18 and 27 removed). Note that a case can have unusual values for the explanatory variates.response variate or to delete cases with large (positive or negative) estimated residuals. 2009 V-5 . we can calculate the studentized residuals and create the qq plot with the code s <. Stat 371 © R. with the results of lm() stored in b. the response variate or both.

the fitted value μ i will be close to yi . MacKay University of Waterloo. Stat 371 © R.J. In ri other words. We can extract the hii from any model fit b<-lm(y~…) using the R function hatvalues(b). Note that hii depends only on the explanatory variates and so a case has high leverage (hii close to 1) regardless of the observed response variate. For the assessment data (with the two cases deleted).the standard deviation of ~ is close to 0 and hence we know that ri will be close to 0. There are no exceptional large values. the plot of the leverage versus the index (case number) is shown below. 2009 V-6 . Deletion of a case with high leverage relative to the others may change the fit of the model substantially.

e. ti increases) of the studentized residual ri si = . Stat 371 © R. The calculation of ti looks formidable. MacKay University of Waterloo. the plots of the studentized residuals [plot(rstudent(b)] against the index number shows two exceptional values for the unedited data (left) and none for the edited data (right). for each case i.which gives an explicit t formula for ( X− i X− i ) −1 in terms of ( X t X ) −1 . we compare yi to y − i .5) corresponds to a large value of ti which in turn corresponds to an outlier in the response variate. bigger than 2. For the above assessment data.ˆ To look for outliers in the response variate.J. Using this formula. 2009 V-7 . We calculate the studentized residuals for the model b<-lm(…) using the R function rstudent(b). We are saved by a remarkable formula – see the Exercises on rank 1 update . the predicted value of y if we delete the ith case. σ 1 − hii A large studentized residual (say greater than 2. We • • • delete the ith row uit from the matrix X to get X − i and refit the model to get parameter estimates β − i . σ − i ˆ ˆ calculate the predicted value y−i = uit β −i ˆ yi − y − i calculate a t-statistic ti = t t ˆ σ − i 1 + ui ( X − i X − i ) −1 ui If any ti is large (i. Note the difference in vertical and horizontal scales.5) then we know the response variate of the corresponding case is an outlier. we can rewrite ti as n− p−2 ti = si n − p − si2 a monotone function (as si increases.

Stat 371 © R. Do not forget that the Conclusion is driven by the original Problem. if we find influential cases. In summary.We conclude that there are no other cases that (singly) are highly influential. This issue is beyond the scope of the course. Note that • there are many other ways to measure the influence of single cases • we have looked at cases one-at-a-time. MacKay University of Waterloo. we should repeat the analysis with theses cases deleted and see if the Conclusion is materially changed.J. 2009 V-8 . Changes in the fitted model that do not effect the Conclusion are not important. not in groups so there may still be highly influential small groups of cases.

As well.70e-05 *** x3 0. Is the prediction of value for a building with size 13.4654 0. x2 = 0.1 ` ' 1 Residual standard error: 0.15 Suppose we fit a model y = β 0 + β 1 x1 + β 2 x2 + β 3 x3 + r . Consider the assessment data with simple model value = β 0 + β 1age + β 2 size + residual .119 --Signif.2372 48. There are 8 combinations.969.9558 F-statistic: 73 on 3 and 7 DF.7615 Coefficients: Estimate Std. x2 .1465 0.3000 8. The data are shown below and can be found in the file ch5Exercise2.68e-05 *** x2 -3.Exercises 1.776 0.001 `**' 0.9 and age 30 sensitive to any particular cases? 2.218 7.89 12.357 1. Error t value Pr(>|t|) (Intercept) 11. the investigators looked at the response variate for the so-called center point x1 = 0.34 17.87 9.2527 -0.54 5.21 17.' 0. The summary output from R is Call: lm(formula = y ~ x1 + x2 + x3) Residuals: Min 1Q Median 3Q Max -1.1360 0.01 `*' 0.txt.5329 0. p-value: 1.40 11.05 `. 2009 V-9 . x3 that each were assigned two values.97 11.87e-10 *** x1 2.203e-05 Stat 371 © R. Adjusted R-squared: 0.3000 1.3000 -10.45 7. x3 = 0.7649 on 7 degrees of freedom Multiple R-Squared: 0. here coded as -1 and +1.57 11.J. In an experimental Plan. codes: 0 `***' 0.6181 0. MacKay University of Waterloo. Use the methods in this chapter to assess the fit of the model and to suggest remedies.30 11. there were three explanatory variates x1 . x1 -1 -1 -1 1 1 1 1 0 0 0 0 x2 -1 1 1 -1 -1 1 1 0 0 0 0 x3 1 -1 1 -1 1 -1 1 0 0 0 0 y 11.985 3.1071 0.4419 0.

3. MacKay University of Waterloo. Find the constant a so that ( I + vu t ) −1 = I + avu t [This is known as a rank one update] b) If C = B + uu t where B is invertible. two of which index the promotion used. use the additional residual sum of squares to test the hypothesis that the extended model is necessary. consider two formal approaches. Stat 371 © R. Show that the residual sum of squares from fitting this model is ∑ ∑ (yij − yi ) 2 where i indexes the unique sets of explanatory variate values and j i j indexes the replicated observations within these sets. Show t t that X t X = X −1 X −1 + u1u1t and hence find an expression for ( X −1 X −1 ) −1 . For the given data. This is called a “pure residual” test of fit. x1 x 2 . 2 a) Add quadratic terms x12 .txt) in which a marketing firm wanted to compare two sales promotions against a control. a) Suppose u and v are two n ×1 column vectors and A = I + vu t .J. x 2 to the model and then test the hypothesis that the additional terms are unnecessary. b) Consider an extended model in which the mean of Y is a function μ( x1 . a) After fitting the full model. find an expression for C −1 . The key step is to find an expression for the inverse of t X −1 X −1 where X−1 is the matrix X with the first row u1t omitted. x2 ) with no further specification.We drop x 3 from the model. is there any evidence of lack of fit? b) Suppose the primary question is to compare the two promotions adjusting for past and competitor’s sales. If the model is correct. Are there any cases that have a large influence on the conclusion about this comparison? ~ 4. We give the basic mathematics behind the arithmetic that we use for the calculations when deleting a single case. r ~ r What does this suggest about the plot of the estimates ri versus μ 5. 2009 V-10 . c) Suppose we consider dropping the first case when fitting the model y = Xβ + r . To assess the fit of the model. The response variate is the weekly sales and there are four explanatory variates. μ ) = 0 and hence μ and ~ are independent. Consider the data described in Chapter 3 (the file is trial. show that Cov( ~.

leave out the least important variate. Keep deleting until all of the included variates have coefficients significantly different from 0. we add to the model complexity and we can also distort the conclusion. The data are stored in ch6example1. the model with the largest F-ratio.u3<-rnorm(100). Then add a second variate to the model that maximizes the increase in R 2 and has a coefficient significantly different from 0. We create an artificial example to demonstrate this point. This decision is important if we want to get a final model that is as simple and useful as possible and fits the data well. Continue until we can find no more important variates to add. If we include unnecessary terms.x3<-u1+2*u2-u3.Chapter 6 Model Building In many applications.x2<-u1+u2.u4<-rnorm(100). We consider three strategies: 1. if the problem is to • • predict the value of the response variate for a given set of values for the explanatory variates assess the effect of a particular explanatory variate (or variates) on the response variate when controlling for a number of other explanatory variates we may or may not include some of the explanatory variates in the model. You might expect that strategies 1 and 2 would get the same answer but this is not always the case. Select two or three of each size and then use a criterion that balances the value of R 2 against the addition of extra variates to pick the “best” model. For example. (Backwards elimination) Start by fitting the full model. based on the p-value for the test of the hypothesis that the corresponding coefficient is 0. (All regressions) Fit all possible models – there are 2 p − 1 if we have p explanatory variates. 2.e. we have a number of explanatory variates that we can chose to include or delete from the model. select the one with the highest value of R 2 or equivalently. We want to use the data to decide which variates to include or delete. 2009 VI-1 . If any coefficient is judged not significantly different from 0.x4<-u1+u4 y<-x1+1.J. Example 1 We create the data with the code: u1<-rnorm(100). University of Waterloo.u2<-rnorm(100).2*x2-0. MacKay. (Forward selection) Start with a simple one-variate models and select the one that best explains the variation in the response variate i.r<-rnorm(100) x1<-u1.txt. 3.5*x3+2*r Stat 371 © R.

. leave out some of the explanatory variates). x1 . x3 and x4. we stop and select the three-variate model that includes x1. We start by fitting all of the one-variate models and picking the one with a significant coefficient (p-value less than 10%) and the highest value of R 2 . only the estimate of σ and the degrees of freedom change from the full model.2491 0. The reason that we get different answers from such procedures is largely due to the correlation among the explanatory variates.4567 0. we stop and the selected model includes only x1. We drop x4 and fit the three-variate model in which we find all coefficients significant and R2 = 0.4744 .. University of Waterloo. In the full model we have x1.4550 0.2439 We select x1 and proceed by fitting the 3 two-variate models that include x1. The extreme opposite of orthogonality occurs if the vectors 1. Model lm(y~x1+x2) lm(y~x1+x3) lm(y~x1+x4) Significant variates x1 x1 x1 R2 0. If the vectors 1. σ / x tj x j ) depend only on x j . MacKay.x2. In this case we have trouble with least squares since X t X is singular and there are many models that give the same minimum value for the sum of Stat 371 © R. β j and β j do not change. x p are linearly dependent or collinear..0408 0. then X t X is a diagonal matrix with entries x tj x j . We then try strategy 1 where we add significant variates. 2009 VI-2 . that involve adding or eliminating variates based on testing the significance of their coefficients in a given fit.J.Note that the second line introduces relationships among x1. If we fit any sub~ model (i. and ( X t X ) −1 is also diagonal with entries 1 / x tj x j . The results are Model lm(y~x1) lm(y~x2) lm(y~x3) lm(y~x4) Significant variates x1 x2 x3 x4 R2 0. x2.x3 with coefficients judged significantly different from 0 (p-value<10%) and R2 = 0.4560 Using strategy 1. Now using strategy 2.e. According to strategy 2.. There are many other similar strategies..x3.. x1 .. In testing the significance of the coefficients.x2. often called stepwise procedures. we start with the four-variate model and work to eliminate variates.4743 . The estimates of the coefficients β j = x tj y / x tj x j and the ~ corresponding estimators β j ~ G( β j ..4547 0.The corresponding vectors are not close to orthogonal. x p are mutually orthogonal.

We want Radj to be large (i.e. The constants are chosen so that c p ≤ k + 1 for good sub-models with k explanatory variates and a constant term.model) 2 adj where σ 2 ( null . The only difficulty is to specify a criterion for choosing among all of the models. • σ 2 (sub − model) = • estimated residual sum of squares + c k where c > 0 is chosen for calibration. Note that R 2 has the same form with sum 2 of squares of estimated residuals in the numerator and denominator. Suppose that we have p +1 terms in the full model (p explanatory variates plus a constant) and k + 1 terms in a sub-model. On the other hand.J. MacKay. σ 2 (sub . We pick a criterion that balances these two requirements. University of Waterloo.model) = ∑ (y i − y ) 2 / (n − 1) is the estimate of σ 2 if none of the i explanatory variates are included in the model. We specify the first criterion by the so-called adjusted R 2 for a given sub-model. We want to select a model that explains a large portion of the variation in the response variate.squares of the estimated residuals. Recall that R 2 must increase when we add additional variates to the model because the sum of squares of the estimated residuals must get smaller. we want a simple model and adding explanatory variates increases the complexity.model) R = 1− 2 σ ( null . Then two criteria that balance the requirements are based on estimated residual sum of squares n − ( k + 1) Note that both the numerator and denominator decrease as we add terms to the model. strategy 3 is the preferred approach. the more difficult is our problem to select the “best” model. σ 2 (sub − model ) to be small). Again we want this quantity to be small. The closer the explanatory variates are to collinear. Stat 371 © R. We specify the second criterion using Mallow’s “ c p ” statistic for a given sub-model where estimated residual sum of squares cp = + 2( k + 1) − n σ2 The denominator is the estimate of σ 2 from the fit of the full model. With the invention of clever algorithms that can fit all possible models without reinversion of large matrices. 2009 VI-3 . We want this quantity to be small.

4289 -0.2413939 0. MacKay. x2 and x3 with the C p = 3.4578213 0.4523016 1 1 2 2 3 3 4 Note that • • • the two best models of each size are chosen based on cp for the full model.cp.1651 1. We use the artificial example to demonstrate the calculations.032519 6.adjr2) # gets model.339054 3. The R code is library(leaps) e<-regsubsets(y~x1+x2+x3+x4.2827 Coefficients: Estimate Std.737843 4. c p = p + 1 by definition in the example.484 0.000746 *** Stat 371 © R.04579 0.566221 39. 2009 VI-4 .2520 6.4454612 0. data=a.211471 4. We start with the command library(leaps) This loads a set of functions that we need to fit a large number of models simultaneously.000000 adjr2 0. Cp and adjusted R-squared The output is (Intercept) x1 x2 x3 x4 1 1 0 0 0 1 0 1 0 0 1 1 1 0 0 1 1 0 0 1 1 1 1 1 0 1 1 1 0 1 1 1 1 1 1 cp 2.4407583 0.21029 -0.218 0.828073 x1 1.4447408 0.J.023301 5. University of Waterloo. x2 and x3.2513 -1. If you have not downloaded the leaps. nbest=2) # finds the best submodels of size 2 f<-summary(e) # extracts useful information from e detach(a) attach(f) cbind(which. we get the summary output Residuals: Min 1Q Median 3Q Max -4.03 ≈ 3 and R = 0. the best model includes x1.38243 3. use the Package menu to get it.4491371 0. Error t value Pr(>|t|) (Intercept) -0.We can use the package leaps in R to evaluate these criteria for a large number of submodels.33241 0.458 2 adj If we fit the model with explanatory variates x1.

J.001 `**' 0. Below we give a summary of the fit and • a plot of the estimated residuals versus the fitted values • a qq plot of the standardized residuals • a plot of the leverages.31162 -3.260 0.86e-10 *** 0.78717 -4.056 on 96 degrees of freedom Multiple R-Squared: 0.57860 17.65047 -1.47105 -1. --Signif.73867 0.098321 .063970 . hii .082105 . University of Waterloo. 0.171e-13 Just for fun.792 0. Adjusted R-squared: 0.57e-06 *** 0.81162 0.670 0.000345 *** (Intercept) size stories baths rooms age lotsize Stat 371 © R.92366 1. this time using data from house sales with the object of predicting the market value for other homes in the same region. We start the analysis by fitting a full model and examining how well this model fits. 1.03983 -0.178 9.758 16. p-value: 2.750 0.04331 3.49019 1. 2009 VI-5 .076281 . x3 -0.01 `*' 0. note that the selected model and parameter estimates match the “true” model and parameter values used to generate the data. Example 2 The sample is 100 homes that have been sold in a given region.txt. the data are: size: the size in m2 baths: the number of bathrooms – bathrooms with only a basin and toilet count ½ rooms: the number of rooms at or above ground level age: age in years lotsize: size of the lot in m2 basement: whether or not the basement is substantially finished garage: whether or not the house has a garage value: the selling price in $000 The data are stored in the file mkvalue.874 0. We end with another market value assessment example. For each home. MacKay.87 on 3 and 96 DF.795258 7.19540 7.05 `.78106 1.x2 0. the diagonal elements of H • a plot of the studentized residuals Estimate -65.650142 0.21994 7.1 ` ' 1 Residual standard error: 2. codes: 0 `***' 0.91862 0.719 Pr(>|t|) 0.4578 F-statistic: 28.455 5.16106 Std.20450 -1.4743. Error t value 39.' 0.36646 0. This was likely fortunate since there are other “good” models that we might have selected.

50462 9.01 `*' 0.11014 0. codes: 0 `***' 0. lotsize and garage. The output of regsubsets( ) is presented on the next page There are several attractive models but the best (simplest) includes size.6414 F-statistic: 23.644.05 `.' 0.basement 1. 2009 VI-6 .006253 ** --Signif.869185 garage -42.165 0. Adjusted R-squared: 0. We can refit this simpler model and assess the fit as above. University of Waterloo.13 < 5 .6704. we have Radj 2 = 0. Cp = 4.2e-16 The relatively low value of R2 = 0.1 ` ' 1 Residual standard error: 41. age. We continue in any case. We now search for a simpler model by looking at the best 2 of all possible models. There is no apparent lack of fit or influential cases as demonstrated by the four plots. p-value: < 2.11920 15. Stat 371 © R.001 `**' 0.799 0. For this model.19 on 91 degrees of freedom Multiple R-Squared: 0.J. MacKay.67 indicates that we are likely to have large prediction error.04650 -2.14 on 8 and 91 DF.

Show that the estimate of β j . d) How do the results of the three strategies compare in this case. Note that the columns of X are not orthogonal.05 to decide to proceed.549 0. b) Fit a model using backwards selection using a p-value of 0. MacKay. Pick a reasonable model..00 10.317 0.21 7. c) Use leaps to investigate all possible models.625 0.32 6.09 5.00 1. Stat 371 © R. a) Fit a model using forward selection. The model used to generate the data was Y = 3 x1 + 0.09 5. is not dependent on which are columns of X are included in the model.15 90.63 21.13 9. At each step.txt contains a response variate y and 10 explanatory variates x1 .645 0.572 0. the coefficient of x j . R ~ G(0. use a p-value of 0. Suppose the columns of X are orthogonal.06 9.560 0.645 0. Show that c p = p + 1 for the full model that includes all p explanatory variates.14 16.651 0. The file ch6exercise3.02 7. 2009 VI-7 .2).3 x2 − 2 x 4 + x 7 − x9 + R. University of Waterloo.32 3..641 0. 3.648 0.55 25.593 0. 2.05 to decide to proceed at each step..86 4.(Intercept) size stories baths rooms age lotsize basement garage 1 1 1 0 0 0 0 0 0 0 1 1 0 0 1 0 0 0 0 0 2 1 1 0 0 0 1 0 0 0 2 1 1 1 0 0 0 0 0 0 3 1 1 0 0 0 1 1 0 0 3 1 1 0 0 0 1 0 0 1 4 1 1 0 0 0 1 1 0 1 4 1 1 1 0 0 1 1 0 0 5 1 1 1 0 0 1 1 0 1 5 1 1 0 0 1 1 1 0 1 6 1 1 1 1 0 1 1 0 1 6 1 1 1 0 1 1 1 0 1 7 1 1 1 1 1 1 1 0 1 7 1 1 1 1 0 1 1 1 1 8 1 1 1 1 1 1 1 1 1 Exercises adjr2 0. These data were created artificially for practice..618 0.J.648 0.641 cp 27. x10 for 100 cases.644 0.

persons living on Indian Reserves. the daily poll on the Netscape home page http://www.ca/english/sdds/3701.htm).com/ ) or highly complex and regular (e. In this chapter. The reasons for using a sample survey rather than a census of the target population to learn about attributes are • cost • timeliness • ethical issues relating to efficient use of resources • the improved quality of the estimates available from a carefully conducted survey rather than a sloppy census.statcan.J. For formal surveys.Chapter 7 Sample Survey Issues In the second half of the course. The survey can be one-time only and informal (e. A census is an investigation of a population where we try to examine every unit. Excluded from the survey's coverage are residents of the Yukon. There are multiple copies of the book on reserve (UWD1510) in the Davis Centre Library. For the most part. we follow the book “Sampling: Design and Analysis” by S. We use some specialized language to describe survey methodology within the PPDAC framework. University of Waterloo. the units are defined as: “LFS covers the civilian.g.” Stat 371 © R. non-institutionalised population 15 years of age and over. see http://www. we concentrate on the sampling protocol. For details.netscape. we deal with • • • • the language of sample surveys examples of sampling protocols classification of error (the difference between the estimate and the attribute) assessment of error Sample surveys are widely used to estimate attributes of interest in a specified target population.L.g. See the reference list attached to the course outline. MacKay. We will cover most of the material in Chapters 1-4. Example In the Labour Force Survey (the quoted material is from the above web site). Lohr. full-time members of the Canadian Armed Forces and inmates of institutions. These groups together represent an exclusion of less than 2% of the population aged 15 and over. 2009 VII-1 . the Canadian Labour Force survey that estimates unemployment rates across Canada on a month to month basis. Surveys are used to estimate attributes of human populations as in the above examples and also any other collection of objects such as financial records. Note that we often select units in clusters to implement a sampling protocol. we consider the planning and analysis of simple sample surveys. Northwest Territories and Nunavut.

A probability sampling protocol uses a probability distribution to select the sample from the frame. University of Waterloo. Formal surveys have a frame. In the Labour force Survey. The first stage of sampling consists of Stat 371 © R. The second frame is a list of households within each selected cluster. N } . from within each stratum.. MacKay. the study population is only vaguely specified. the sampling protocol is described as: “The LFS uses a probability sample that is based on a stratified multi-stage design.. The frame defines the study population.The sampling protocol does not choose units (people who meet the inclusion criteria) directly. Instead. We will look at ways to implement such a protocol later.1220} and. One frame is a list of clusters within each geographic stratum. Here we write the frame as U = {1. Sampling Protocols There are many sampling protocols that can be used to select the sample from the study population.. Developing a good frame (one which covers the target population) is often one of the most expensive components of conducting the survey. Example: Suppose an auditor has a file of 1220 records and plans to select a sample of 20 records to examine the quality of the file. 2. for any subset S of U . The second stage of sampling consists of selecting households from within each selected cluster.” We call the households the sampling units. a sample of households is selected and then variates are measured on every appropriate unit in the selected households. In this instance. 2.. Each province is divided into large geographic strata. there are separate frames for each stage of the sampling.J... 2009 VII-2 . a protocol in which all samples of size 20 have the same probability of selection. “The LFS uses a probability sample that is based on a stratified multi-stage design. if the frame is denoted by U = {1. More formally. Note that informal surveys such as the Netscape poll do not use a frame since the units are selfselecting. Each province is divided into large geographic stratum. The first stage of sampling consists of selecting smaller geographic areas. we have ⎧ 1 ⎪ ⎛1220 ⎞ ⎪ Pr( S ) = ⎨ ⎜ ⎟ ⎪ ⎝ 20 ⎠ ⎪ 0 ⎩ if S has size 20 otherwise Example: For the labour force survey.. called clusters. The auditor decides to use simple random sampling (SRS). The frame is the list of sampling units on which the sampling protocol operates.. then a probability sampling protocol assigns a probability to every subset of U and the sample is selected according to this distribution.

Suppose the attributes of interest in the respondent and non-respondent populations are different Then the sample attributes may not match those in the frame because one or more units in the sample may have refused to provide data. income profile and gender. The second stage of sampling consists of selecting dwellings from within each selected cluster. This is a major advantage of these sampling protocols. the error that occurs in drawing conclusions about the target population from the sample. For surveys of human populations.e. match the target population with respect to the attributes of interest). many internet polls quota sampling – units are selected so that some attributes of the sample match known attributes in the target population e.g. study error is called frame error. Measurement error: the difference in the attributes of interest due to the difference between the true and measured values of the variates on the units in the sample.g.000 individuals. from within each stratum. Errors In applying PPDAC to estimate population attributes.” There are many non-probability sampling protocols. • We concentrate on formal surveys that use probability sampling protocols since we can use mathematical tools to assess. Stat 371 © R. a survey of people in a mall by a marketing firm self-selection sampling – units choose themselves.g. usually with little control e.J. in a marketing survey. the monthly LFS sample size has been approximately 54.” “Since July 1995. The attributes of the units listed in the frame may not match those of the target population. judgment sampling – units are selected so that the samplers think that the sample will be representative of the target population (i. Some examples are: • • • convenience sampling – take what you can get e. In the context of sample surveys. University of Waterloo. we can classify errors as: Study error: the difference in attributes of interest between the target and study population. called clusters. Sample error: the difference in attributes of interest between the study population and the sample. 2009 VII-3 . each interviewer is directed to find a sample whose attributes match the local population in terms of age.selecting smaller geographic areas.000 households. resulting in the collection of labour market information for approximately 100. at least partially. MacKay. an important component of sample error is nonresponse error.

The following list is adapted from Lohr pages 10-15. recall error. The proper design of the questionnaire and a good plan for its administration can substantially reduce non-response error. 2009 VII-4 . There are many books and papers written on questionnaire design. Measurement error may occur because of systematic differences in interviewers. We have all seen statements such as “19 times out of 20. Non-respondents Respondents Intended sample Actual sample The actual sample is different from the intended sample and the attributes in the actual sample may not match those in the frame. MacKay.J. Interviewers may influence the responses by using a different protocol for asking the questions.If we divide the frame into those units that would respond and those that would not. This is a very complex subject. Questionnaire Design Here is a brief set of considerations in designing the instrument (the questionnaire) for the survey of a human population. Stat 371 © R. people in the sample may lie. we can see the effect of non-response error. The confidence interval does not capture uncertainty due to frame error. hire an expert. We can control these latter sources of error only through good planning and execution of the survey. forget or modify their answers to please the interviewer. If you are involved in an important survey. a sample of this size is accurate to within 3 percentage points” at the bottom of the conclusions from a survey. error due misunderstanding the question and so on. This confidence interval captures the uncertainty due to a component of the sample and measurement error. University of Waterloo. systematic errors in the sampling protocol and measurement system etc. Measurement error may also occur if the question we pose does not match the question used to define the response variate in the target population. non-response. We use the probability model that generated the sample to describe how the sample attributes would behave if we were to repeat the same sampling protocol over and over.

2009 VII-5 . MacKay.• • • • • • • • • • • • • • Decide what you want to find out (understand the Problem) Keep the questions clear and simple Use specific instead of general questions Decide whether to use open-ended or closed questions Ask only one concept in each question Use forced choice rather than agree/disagree questions Avoid leading questions and contexts Relate each question to your objective – what will you do with the data? Keep the questionnaire short. Plan to report the actual questions used Stat 371 © R. Explain the purpose of the survey Ensure confidentiality Pay attention to question-order effects Test your questions before the survey.J. University of Waterloo.

.102. to generate confidence intervals and hypothesis tests for model parameters that represent attributes of interest in the study population.10000}.J.. Stat 371 © R. then we know that the model is appropriate...1000}. If we execute the protocol as planned.. ... we first examine several probability sampling protocols and then look at simple random sampling (SRS) in detail. Two-stage sampling: Select the sample in two stages.. The major advantage of probability sampling is that the sampling protocol produces a statistical model that we can use to assess sample error i.9901}...000 and n = 100 . Select 10 clusters using simple random sampling.. each with the same probability. University of Waterloo. 10 FG H IJ K Systematic sampling: Define clusters with n = 100 units per cluster C1 = {1101.. For example. a protocol that selects units for the sample based on a probability model on subsets of the frame... 100 FG H IJ K Stratified random sampling: divide the frame into sub-frames called strata.. select a simple random sample of size 10...... C100 = {100.20}.. for example C1 = {1.10}..... C1000 = {9991.... There are 100 possible samples. We call this protocol systematic because we can select the sample by choosing the first unit from {1.e.100} at random and then taking every subsequent 100th unit. each with the same probability. U 10 = {9 001. We consider protocols where the sample size n is fixed so that the only subsets with positive probability have n units..10 000} .. Denote the frame by the set U = {1... In this chapter..2... N } so that there are N units in the frame. For example.Chapter 8 Probability Sampling Formal surveys use probability sampling.. There are G H 10 K samples. There are possible samples... 10000 Simple random sampling: all samples of size 100 have the same probability.... C2 = {11..200..... C2 = {2. The sample is the 100 units in the 10 1000 selected clusters. Here are some common sampling protocols explained in terms of an example.10000}. each with the same probability. 10 possible Cluster sampling: Divide the frame into clusters.. Use simple random sampling to select one cluster as the sample.9902}... MacKay. F1000IJ For each stratum. U 1 = {1....... Example: Suppose N = 10. 2009 VIII-1 . Then a probability sampling protocol specifies the probability that the sample is s for any subset of s ⊂ U ..

Stat 371 © R. In the above example. Also note that we use SRS within each of the above protocols so we need to understand the properties of this most important protocol.J. Simple Random Sampling For simple random sampling. above 10 possible samples of primary units. FG IJ H K FG IJ H K FG H FG IJ H K IJ K As we saw in the above example. We cannot define simple random sampling by saying that every unit has the same chance of being included in the sample.Stage 1: Select two strata (here called primary units) from the 10 described in 2. One advantage of two stage sampling is that we only need to build a frame at the second stage for those primary units selected in the first stage. The inclusion probability for unit i in the frame is n N −1 n −1 n pi = = N N n Note that the numerator is the number of samples that contain the particular unit. FG IJ H K F10I F1000IJ There are G J G H 2 K H 50 K 2 possible samples. There are 2 Stage 2: Select 50 units from each of the two selected primary units using simple random sampling. University of Waterloo. 2009 VIII-2 . We use the shortened form of the name. there are many sampling protocols with the same inclusion probabilities as SRS. you can show that pi = 100 for each of the described sampling protocols – see the exercises. we select n units from a frame of N units so that each N sample of size n has the same probability of selection. using simple random sampling. Since there are possible n N samples. each has probability 1 / . each with the same probability. MacKay. Technically. You should be able to provide definitions of these sampling protocols in general. this protocol is often called simple random sampling without replacement since we do not allow the same unit to be included in the sample more than once. Note that complex surveys such as the labour force survey use multi-stage sampling with stratification in the primary stage and cluster sampling (the clusters are households) in the ultimate stage. The inclusion probability pi for any unit i in the frame is the probability that the unit is 1 for each unit i included in the sample.

J. 2009 VIII-3 .... we use the probability mechanism that generated the sample to look at the properties of the estimates if we were to repeat the sampling over and over.... Instead. N 0 otherwise FG IJ H K R S T Then we can write n n −1 in terms of the indicator random variables I1 . we denote the standard deviation in the frame by σ and the sample standard deviation by n −1 ˆ ˆ where ri = yi − μ is the estimated residual. stdev( μ ) = (1 − n σ ) N n Stat 371 © R. We denote this average in the ∑ yi iε s frame (study population) by μ and the sample average by μ = where s is the n selected sample. However we can find many of its properties. we define the estimators ~ ∑ yi ~ ∑ ( yi − μ ) 2 ~ μ = iεS . We have the following important results: μ= ∑I y ε i i U i . Let 1 if unit i is in the sample Ii = i = 1. Note that the sums are over the entire frame U.. σ= ∑ I ( y − μ) ε i i i U 2 For simple random sampling of n units from a frame of N units. σ = iεS n n −1 1 where S is a random subset with Pr( S = s) = for every subset s ⊂ U of size n. It is N n convenient to re-express the estimators in terms of random variables rather than a random subset. MacKay.Let yi be the value of the response variate for unit i . we have E ( μ ) = μ . That is. ~ We cannot calculate the exact distribution of the estimator μ . University of Waterloo. I N . Similarly.. ˆ σ= ∑( y ε i s i ˆ − μ )2 = ∑ rˆ ε i i s 2 n −1 We do not build a response model here as we did in the earlier part of the course. Suppose that we are interested in estimating the average response in the target population.

i .V j ) = E (VV j ) − E (Vi ) E (V j ) .. 2009 VIII-4 .J. note n n n n so E ( I i ) = 0(1 − ) + 1 = .Vn . we have FG N . for any linear combination of dependent random variables V1 .~ ~ We call μ an unbiased estimator of μ since E ( μ ) = μ .. n iε U i ≠ j . we have Var( μ ) = 2 {∑ yi2 Var ( Ii ) + ∑ yi y j Cov( Ii . I j )}. we need to find E[ I i I j ] . jε U ∑ yi y j 1 n 1 i ≠ j . we have n n Var ( I i ) = E ( I i2 ) − E ( I i ) 2 = Pr( I i = 1) − Pr( I i = 1) 2 = (1 − ) N N To find the covariance of I i and I j ..2K = n(n − 1) E( I I ) = Pr( units i and j are both in the sample) = FG NIJ N ( N − 1) H nK i j so the covariance is Cov( I i . and hence we have that Pr( I i = 1) = N N N N iε U ~ E (μ) = E ( ∑I y i i = = ) n ∑ E ( Ii ) yi iε U n ∑ (n / N ) yi iε U =μ ~ To prove the formula for stdev( μ ) . i 1 ~ Applying this result. jε U 2 } = (1 − ) {∑ yi − n N N iε U N −1 Stat 371 © R. we need the result from Stat 230 that.V j ) i i i≠ j where Cov (Vi .. jε U Since I i is an indicator random variable. n Var ( ∑ aiVi ) = ∑ ai2Var (Vi ) + ∑ ai a j Cov (Vi . i . N ( N − 1) N N N N −1 1 n 1 n n n ~ Var( μ ) = 2 { (1 − )∑ yi2 − (1 − ) ∑ yi y j} n N N iε U N N N − 1 i ≠ j . I j ) = E ( I i I j ) − E ( I i ) E ( I j ) = Combining the above results we get n(n − 1) n 2 n n 1 − 2 = − (1 − ) . To prove this statement. MacKay. i .2IJ H n . Since the product is 0 unless I i = 1 and I j = 1 . University of Waterloo.

then We use this result to build confidence intervals for various attributes of interest. ∑ yi y j 1 i ≠ j . at last. University of Waterloo. the proportion of the population units included in the sample. U iε U iε U = = = = 1 {N ∑ yi2 − (∑ yi2 + ∑ yi y j )} N − 1 iε U iε U i ≠ j .We can simplify the expression inside the braces with a bit of algebra. The factor 1 − n = 1 − f is called the finite population correction factor (fpc) where f is N the sampling fraction. i . then there can be a significant reduction in the standard deviation. ijε U 2 ∑ yi − N − 1 = N − 1 {( N − 1)∑ yi2 − i≠ j∑jεyi y j )} . Here we state one of the consequences of the theorem avoiding all technicalities. 1) approximately. 1− f σ average σ If N. we need one more fact. the sampling fraction is negligible and we ignore the fpc. we have Var ( μ ) = (1 − ) N n N n The formula for the standard deviation of an average is the basic result in the sampling part of the course. MacKay. We use the result repeatedly so it is worthwhile learning it. jε U 1 {N ∑ yi2 − (∑ yi ) 2} N − 1 iε U iε U 1 {N ∑ yi2 − N 2 μ 2} N − 1 iε U N {∑ yi2 − Nμ 2} N − 1 iε U n σ n σ2 ~ ~ and stdev( μ ) = (1 − )1/ 2 as required. If the sampling fraction is appreciable. ~ We can also show that E(σ 2 ) = σ 2 . To use the above results. The standard deviation of the the estimator corresponding to the sample average.i.J. For many applications. Stat 371 © R. n and N-n are suitably large. The factor arises n because the terms in the sum that defines the estimator are dependent.see the exercises. is the usual standard deviation for an = Nσ 2 multiplied by the square root of the correction factor. using the model generated by simple random sampling. 2009 VIII-5 . a version of the Central Limit Theorem for the average of a sequence of dependent random variables. ~ n (μ − μ) ~ ~ G(0. so.

48 3586.value 1335 0. The total stated value was $4.08 1446 3586.22 = (1 − = 276.78 847 4. To assess the precision of this estimate.27 10434. MacKay.00 1294 1941.08 1106 1.29 with corresponding sample standard deviation $1997.price stated.65 847 4192. There were 11 items with count errors in the sample.78 1106 1249.22. Here the standard error is ˆ σ n 50 1/ 2 1997.00 1269 2.712.65 1016 10.value actual.29 .48 1419 4597.76 1427 3.61 814.32 1016 10434. The purposes of the sampling were to assess: • the true total value of the inventory • the average dollar error per item • the proportion of items with counts in error The auditor selected a simple random sample of n = 50 items and re-counted (actually a pair of co-op students did the counts).Example 1 An auditor has a file of stated counts and prices for N = 1256 56 items stored in a warehouse of a producer of small automotive parts.35 1335 814. ˆ μ ± c × standard error(estimate) where the standard error is the estimate of the standard deviation of the estimator μ .95 4192.24 4623.49 1446 2.number item.81 4296.50 1941.88 1192 1954.04 2588. If yi is the actual value of item number i.J.13 1249. 2009 VIII-6 . we start by estimating the population average μ and then use the fact that τ = N μ . item 1 25 39 53 56 121 207 212 223 225 stated. The sample average dollar error (actual value – stated value) was $2. then we ˆ estimate μ using the sample average μ = 2895. The first 10 lines of the data are shown below and the complete data set is available in the file inventory.number actual.txt The average actual value of the sampled items was $2895. we find an approximate 95% confidence interval based on the approximation to ~ the distribution of μ described above.311.32 Table 8.35 1192 1.33 and the sample standard deviation of the dollar errors was $41.49 1529 4296.77 (1 − )1/ 2 ) N 1256 n 50 Stat 371 © R.1 Part of the Sampled Data To estimate the total actual value τ .93.88 1294 1. The auditor planned to use SRS to select a sample of item numbers and then physically count the actual number of those selected items. The form of the interval is the same as usual.64 1954.76 1269 2588. University of Waterloo.56 1529 2.

47 . 342 This interval is very wide meaning we have estimated the total value of the inventory very imprecisely. University of Waterloo.712. One possibility is to increase the sample size but see below. we use c = 1. the confidence interval for μ is 2895.81.93 = 5. Hence the 95% confidence interval 1256 50 for the average error is $2. This procedure is called difference estimation.J.33 ± $11. 484 ± 681. it is difficult to assess whether or not there are material errors in the inventory. The sample average error is ˆ μerror = 2. We are interested in the total actual value τ = N μ and we can get a 95% confidence interval for τ by multiplying the above interval by N = 1256 . The interval is 3. 636. we estimate the true total value as the total stated value ($4. A 95% confidence interval for the total true value of the inventory is then 4311712 + 2926 ± 14306 or $4. Substituting.39 . where the standard error is 1− The average error more precisely estimated than is the average actual value. Stat 371 © R.3% of the estimated total so the difference between the stated and actual value of the inventory is likely immaterial. The estimate of the total error is ˆ τˆerror = N μerror = 1256 × 2. Compared to the stated value of $4.For a 95% confidence interval. Note by exploiting the known information (the total stated value) and the fact that we can estimate the average error with much smaller standard error than the average actual value.306 The confidence limits are about 0.33 ± 11.311.638 ± $14. The average actual value is poorly estimated.39) or 2926 ± 14306 . A 95% confidence interval for the average error is ˆ μerror ± c × standard error 50 41.93.712) plus the total dollar error.33 with corresponding sample standard deviation 41. 2009 VIII-7 . MacKay.314. We can exploit this result and the fact that we know the total stated value to get a better estimate of the total actual value. We use the same methodology to estimate the average error.48 so the 95% confidence interval for τ error is 1256 × (2.96 from the standard gaussian (last row of the t tables).29 ± 542.33 = $2926. we can get a much more precise estimate of the total actual value.311. Since the error was defined as the actual value minus the stated value.

totals and proportions.To estimate the proportion of items with counts in error. let yi = 1 if the ith item is in ∑ yi error and yi = 0 otherwise.418 Note for a binary response variate.96 1 − 50 0. We use the same theory to get estimates and approximate confidence intervals for averages. the finite population correction factor 1 − f = 1 − 50 = 0. 3. MacKay.98 ) and we could have safely ignored it.J. The attribute of interest is π = iεU .418 or 0.22 ± 0. The form of the interval is always estimate ± c × standard error(estimate) 2. University of Waterloo.11 1256 50 There is considerable evidence that more than 10% of the item counts are in error. An approximate 95% confidence interval for π is π ± 1. 2009 VIII-8 . the population average N of the binary variate. In the above example. Notes 1. In the case of a binary response. Stat 371 © R. the sample standard deviation is nπ (1 − π ) ≈ π (1 − π ) n −1 We usually ignore the factor n / n − 1 when we apply this formula. though the average error is likely to be small and the total error immaterial. the sample standard deviation is a function of the sample average proportion. The sample average is 11 and the sample standard deviation is π= 50 ∑ (y − y) ε i i s 2 49 = ∑y ε i s i s 2 i − 50 y 2 49 = = ∑ y − 50π ε i 2 49 50π (1 − π ) 49 = 0.96 1256 played a very small role (especially as it enters the calculations as 1 − f = 0. We can use the same theory as above.

Here we are using cluster sampling with the clusters defined as the cartons.96(1 − or 0. then the proportion of defective items in the population is 2000μ μ = π= 12(2000) 12 ˆ 30 1/ 2 σ ˆ A 95% confidence interval for μ is μ ± 1. “How large a sample do I need?” The obvious answer is “What is your objective?” It may take some effort to elicit a specific response but with some guidance. University of Waterloo. Since we will select only one sample. we base the sample size determination on the attribute of primary interest.Example 2 To assess the quality of a shipment of 2000 cartons of headlights (packed 12 to a carton). The proportion of defective headlights is poorly estimated but is significantly larger than 0. A headlight is declared defective if it fails to pass any one of a large number of tests.40 .7% ± 3.80 and 1.80 ± 0. the confidence interval should be μ ± l ). See the Exercises for another formulation of the precision in terms of relative error. 0 0 0 1 0 0 1 3 2 4 2 2 0 0 0 0 0 1 0 0 0 0 1 1 2 1 0 3 0 0 The sample average and standard deviation are 0. Sample Size Determination We can use the same theory to answer the most common question in Statistics.J.067 ± 0. MacKay.3% . Suppose that we are interested in estimating a population average μ and we want the confidence interval to be of length 2l (i. Note how we adapt the results from SRS to apply to cluster sampling. 2009 VIII-9 .13 respectively. The data (number of defective items per sampled carton) are shown below. solving for the sample size Stat 371 © R. Hence a ) 2000 30 95% confidence interval for π is 0.33 or 6. we have n σ l = c 1− N n or.e. The attribute of interest is the proportion of headlights that are defective. a manufacturing organization decides to select a sample of 30 cartons using SRS for inspection. assuming we use SRS or a close facsimile. We suppose that we can state the precision in terms of the length of a confidence interval for this attribute. From the above results. a possible frame and the required precision for the estimates. you can determine the target population. the attributes of interest. If μ is the average number of defectives per carton.

suppose that the above description was a pilot survey and the overall goal was to estimate the average error with 95% confidence within plus or minus one dollar. 2009 VIII-10 .96 2 × 41.57 . The primary question has a Yes/No answer and the sample size is selected based on estimating the proportion of adult Canadians π who would answer Yes to the question. we have 1 n= = 1059 1 1 + 1256 1. The client asks for a confidence interval of length 5 percentage points (0. we are forced to examine an extra 1009 items to achieve the desired precision. How many more items do we need to include in the sample? Here we have l = 1. One way to get an idea of σ is to carry out a small pilot survey. we would likely recommend a complete census. We get an estimate of σ to help determine the sample size in the main survey and we also can use the pilot study to test the questionnaire and the rest of the proposed methodology.025.96. c = 1.93 from the initial survey. with more difficulty. In other words.n= 1 l2 1 + 2 2 N cσ To determine the required sample size. University of Waterloo. The nπ (1 − π ) estimated standard deviation will be ≈ π (1 − π ) . To achieve the required precision. we can ignore the term 1 / N and then the required sample size is approximately c 2σ 2 / l 2 .93 is so large.05) with 99% confidence so we have l = 0.J. n −1 The required sample size is Stat 371 © R. we need to specify the confidence level to find c and. N = 1256 and σ = 41. MacKay. we often do not know enough to give a good answer to the question. because σ = 41. Since this is most of the frame.932 Here. Example A polling firm has been hired to conduct a cross-Canada survey to solicit opinions from adults on a number of issues. The answer is very sensitive to the value of σ . Sometimes we can use the results of previous surveys with similar response variates on the same population to get an idea of the value of σ . c = 2. If the second term in the denominator is much larger than the first. guess the value of σ . Example In the audit example.

Because of the difficulty of completing a frame. MacKay. we will meet the requirements. With cluster sampling the frame can be the list of clusters. For many populations.25 = 2642 . For example. we may use cluster or multi-stage sampling instead. We know that if we choose n = 10568 × 0. b) On a final examination.sample(x. Note that these sample size determinations do not take frame error. the command s <.25 when π = 0.572 π × (1 − π ) 0.g. if the units are water heaters packed in cartons stored in large stacks. Here the required sample size is bounded because the function π (1− π ) has maximum value of 0.n) selects a random sample of size n from the vector x and stores the result in s. 2009 VIII-11 . If we have a better idea of π from a pilot survey or elsewhere. we may be able to reduce the sample size from this upper bound. shorter confidence interval with a smaller sample size) to stratify the population and use stratified random sampling. a) Show that the inclusion probability for each unit in the frame is 1/100 for every protocol. • To implement SRS.response error and other such errors into account. We must be able to examine the selected units. When and How to Implement SRS Here we briefly look at when SRS should be used and how to implement the sampling protocol. it is more efficient (e. See Chapter 10. non.J. SRS is the simplest probability sampling protocol. If the frame consists of a list of items or people.0252 + N 2.572 × π × (1 − π ) 2. University of Waterloo. we can assign each unit a unique number from 1 to N and then used available software to select a sample of n units using simple random sampling. we need a frame for the target population of interest. In R. Consider the sampling protocols defined in Example 1. With multi-stage sampling we can build the frame as we go. a student once defined simple random sampling as follows: “simple random sampling is a method of selecting units from a Stat 371 © R. we can select the sample of identifiers using SRS but we are unlikely to find someone willing to sort through the cartons to find the selected units. • • • Exercises 1.n= 1 1 0.5 .0252 = 10568π × (1 − π ) ≈ where we ignore the term 1 / N since it is so small.

To estimate the total number of male song sparrows in a 10 km by 10 km square (http://www. 2009 VIII-12 . The data are summarized below.org/atlas/atlasmain.html ) for a breeding bird atlas. find a formula for the required sample size. If you find 1 or more defective items. b) What knowledge of the population attributes do we need to make this formula usable? 5. b) Suppose that I wanted to estimate the total number of male song sparrows to within ±1000 with 95% confidence. Suppose that there are N = 1000 items in a shipment and you cannot tolerate more than 1% defective (your first mistake – why should you tolerate any defective items from your supplier). Suppose we want to estimate a population average so that the relative precision is specified. MacKay. That is. you inspect the complete shipment. [Hint: Use the fact a) For SRS.population so that every unit has the same chance of selection”. a) For a given confidence level and required precision p%. show that σ that ∑ ( yi − y ) 2 = ∑ yi2 − ny 2 ]. 2. 3.J. # of sparrows # of plots 0 28 1 13 2 5 3 3 4 1 a) Find a 95% confidence interval for the total number of male song sparrows in the square. we want to find the sample size required (SRS) so that the length of the confidence interval 2l divided by the sample average is pre-determined. a simple random sample of 50 one hectare plots (a hectare is 100m by 100m) is selected. Is this a correct answer? ∑ yi iε s c) Show that the estimator corresponding to the sample average μ = is n unbiased for μ for each of the protocols. Consider the estimate σ = ∑(y ε i s i − y)2 n −1 ~ 2 is an unbiased estimator for σ 2 . the percentage of defective items in the shipment.birdsontario. ~ b) Is σ unbiased for σ ? iε s iε s ~ and the corresponding estimator σ . your intrepid instructor visits each of the selected plots (after dawn but before 9:00 am between May 24 and July 6) and counts the number of singing male song sparrows detected in a 10 minute period. One cheap but (poor) way to check the quality of a batch of items is called acceptance sampling. How many additional plots are needed? 4. You decide to select and inspect a sample of 20 items and accept the shipment if you find 0 defectives. Using a GPS system. a) How would you select the sample? b) Calculate the probability p(π ) that you accept the shipment as a function of π . Stat 371 © R. University of Waterloo.

What sample size do you recommend? Stat 371 © R. MacKay. you decide to increase the sample size so that there is only a 5% chance of accepting a shipment with 1% defective.J.c) Graph p(π ) for 0 ≤ π ≤ 10% d) Given the results in c). University of Waterloo. 2009 VIII-13 .

y0 ) . y0 ) ∂f ( x0 . The distinguishing feature is that both the numerator and denominator will change if we were to repeat the sampling protocol over and over.J.Chapter 9 Ratio and Regression Estimation with SRS In this chapter. Here we consider estimating a ratio. y ) = x / y . y0 ) x If f ( x. MacKay. y0 ) + ∂f ( x0 . The parameter μ is the average error per file and π is the proportion of files in error. = − 0 and 2 ∂x ∂y y0 y0 x x0 1 x ≈ + ( x − x0 ) − 0 ( y − y0 ) 2 y y0 y0 y0 Stat 371 © R. we consider two related problems: • estimating a ratio such as the proportion or average response of a subpopulation (domain) with unknown size • improving the sample average as an estimate of the frame average by using explanatory variates Estimating a Ratio In Chapter 8. We can write this attribute as ∑ yi zi ∑ yi zi / N μ θ = iεU = iεU = zi ∑ ∑ zi / N π iεU iεU where zi = 1 if the ith file is in error and 0 otherwise. y ) ≈ f ( x0 . ˆ μ μ We use the estimate θˆ = with corresponding estimator θ = . Recall that we can expand f ( x. then we have = . Consider again the inventory example from the previous chapter. Suppose we want to estimate the average size of the error in those files that are in error. To derive the approximation. To assess the estimate ˆ π π and produce confidence intervals for θ . y0 ) ( x − x0 ) + ( y − y0 ) ∂x ∂y The linear function on the right has the same value and first partial derivatives as f ( x. y0 ) 1 ∂f ( x0 . Note that iε U ∑ y z = ∑ y here ε i i i i U because yi is the error in the ith account. y0 ) to get a linear approximation f ( x. we use Taylor’s theorem for a function of two variables. y ) about the point ( x0 . we find the (approximate) distribution of θ by finding its mean and variance and then using a gaussian approximation. y ) at the point ( x0 . University of Waterloo. ∂f ( x0 . 2009 IX-1 . we looked at assessing the estimates of the frame average (or total) when the sampling protocol is SRS and the estimate is the sample average.

π ) and ( x0 . π ) . the approximation is reasonable since we expect ( μ . Using the basic formula for the variance of an average with SRS. we have μ μ 1 μ ≈ + ( μ − μ ) − 2 (π − π ) π π π π For large sample sizes.. Notice that the estimate corresponding to ∑ ( yi − θ zi ) ˆ ˆ which is the sample average of r1 .. Var ( μ − θπ ) = (1 − f ) σ r2 n We can estimate this variance by the corresponding sample variance (1 − f ) n ∑ (r − r ) ε i i s 2 n −1 (1 − f ) = n ≈ (1 − f ) n (1 − f ) n ∑ [ y − y − θ ( z − z )] ε i i i s iε s 2 n −1 ∑ [ yi − y − θˆ( zi − z )]2 n −1 ∑ ( yi − θˆ zi )2 iε s = n −1 where we replace θ by its estimate θ = y / z in the second line and f = n / N is the sampling fraction as usual. MacKay.. π ) to be close to ( μ . The estimate of the variance of the estimator θ is then Var (θ ) = ^ 1 ˆ π2 ∑ ( y − θˆz ) (1 − f ) ε i i i s 2 n n −1 Stat 371 © R. π ) . University of Waterloo. y ) by the random variables ( μ .J. 2009 IX-2 . Hence we have E (θ ) ≈ μ 1 μ μ + E ( μ − μ ) − 2 E (π − π ) = π π π π 1 Var (θ ) ≈ π2 Var ( μ − θπ ) The estimate is approximately unbiased (but see Exercise 1). y0 ) by ( μ .. rn where μ − θπ is μ − θπ = i ε s n ri = yi − θzi . We can write the variance in several forms.Replacing ( x .

33 In the example – see the data file inventory. 2009 IX-3 . To construct a confidence interval for θ . we can determine the values of the response variate y and the explanatory variates. Ratio Estimation of the Average Suppose the purpose of the survey is to estimate the study population average μ ( y ) for some variate y. Note the change in notation to explicitly include y in the definition of the attribute... In the example. Then 1 multiply by the factor 1 − f . rn = yn − θˆzn in R by creating the vector r < − y − theta _ hat * z . age distribution etc.. To find ˆ π 0. we have θˆ = = = 10. University of Waterloo.. For simplicity.57 . ^ Also note that this confidence interval is wider than that for μ . MacKay. For example. From previous experience.57 ± 51. for large values of n and N.. yn − θ zn .Note that the last factor is the sample variance of the estimated residuals ˆ y1 − θˆz1 ...J. When we get the sample. perhaps from a census. a company selects a sample of 40 parts to check the average length of a critical dimension.26. first calculate the standard deviation of ˆ ˆ r1 = y1 − θˆz1 . because we also have uncertainty about the proportion of files in error. say gender and age for each person in the sample. we know the stated value and the stated number of items for each file in the population and hence we can calculate population attributes for these variates. ˆ π n A 95% confidence interval for θ is 10. we have the standard error is 26. in the inventory survey.. In many surveys of human populations. Example In the assessment of a lot of 10000 incoming molded parts.txt.22 the estimate of the standard deviation. The idea of the methods discussed here is to adjust the sample average μ ( y) based on differences between the sample and (known) population attributes of the explanatory variates. there are other (explanatory) variates that can be measured on each unit in the sample and for which we have complete knowledge of their attributes in the population. In many surveys. the demographics (gender ratio. the estimator is approximately gaussian so the confidence interval has the standard form estimate ± c stdev(θ ) ˆ μ 2. we consider only one explanatory variate.) of the population are known. We can use the same approach via Taylor’s theorem to estimate any other function of variate averages in which we have interest. Stat 371 © R. the average error.48 so we can estimate the average error in accounts with errors very imprecisely.

5 46 45.J.56 1.691 y 45.they know that the dimension is related to part weight so they measure the weight of each part in the sample and also the weight of the entire shipment. ˆ μ( y) ˆ ˆ ˆ μ ( x ) = θμ ( x ) where μ ( x ). The population average weight is μ ( x ) = 33. Length versus weight 48 47. MacKay.5 44 43.5 weight (g) 34 34.5 45 44.5 31. We develop the estimators and their properties assuming SRS – this corresponds to assuming that the haphazard sampling protocol mirrors SRS if the protocol is repeated over and over.5 32 32. The plot shows a strong correlation between the length and weight.5 35 35.5 33 33. University of Waterloo.24 0.10 grams determined as the total weight (measured all at once) divided by the number of pieces N = 10. Stat 371 © R. The sample data are included in the file molded.005 sample average sample st.5 47 46.txt and are plotted below. 2009 length (micron) IX-4 .5 The average and standard deviations for the two variates are: x 33. ˆ The ratio estimate of μ ( y ) is μ ( y ) ratio = The sample is collected haphazardly since it is too expensive to create a frame. We use the results on the estimation of a ratio θ to derive an approximation for the mean and standard deviation of μ ( y ) ratio . dev. μ ( y ) are the ˆ μ ( x) sample averages for x and y and θ = μ ( y ) / μ ( x ) .000 .

37 ± 0. the ratio estimate is more precise if a line through the origin explains some of the variation in the response variate. Qualitatively. ˆ Here the ratio estimate is more precise than the sample average μ ( y ) = 45.56 since this ˆ σ ( y) ˆ or estimate gives a confidence interval (ignoring the fpc) μ ( y ) ± 1.147 so the ratio estimate of the n −1 45. University of Waterloo.10 = 45. gives a shorter confidence interval) than the sample average if ∑( y ε i s i − θˆ xi ) 2 < ∑ ( yi − y ) 2 iε s The expression on the left is the residual sum of squares if we “fit” a line through the origin to the sample scatterplot.J. An approximate 95% confidence interval for the population average length based on the ratio estimate is 45. The expression on the right is the total sum of squares. We can compare the estimated variance of μ ( y ) ratio versus that of μ ( y ) . Stat 371 © R. in general. 2009 IX-5 .e. the estimator based on the sample average.371 and ˆ population average is μ ( y ) ratio = ∑( y ε i s i ˆ − θ xi )2 = 0.37 with corresponding standard error 33.060.96 40 45.E[ μ ( y ) ratio ] = E[θ ]μ ( x ) ≈ θμ ( x ) = μ ( y ) Var[ μ ( y ) ratio ] = μ ( x ) 2Var[θ ] 1 Var[ μ ( y ) − θμ ( x )] μ ( x )2 = Var[ μ ( y ) − θμ ( x )] Using the results on the estimation of a ratio.56 × 33. ˆ In the example we have θ = 1.24 0. Consider Var[ μ ( y ) ratio ] = ^ 1 (1 − f ) i n ∑( y ε s i ˆ − θ xi )2 n −1 versus Var[ μ ( y )] = ^ 1 (1 − f ) i n ∑( y ε s i − y )2 n −1 The ratio estimate is more precise (i.56 ± 0.12 microns. This gain in precision is the major advantage of the ratio estimate. MacKay. we estimate the variance of μ ( y ) ratio by ∑ ( yi − θˆxi )2 1 ∑ rˆi2 1 Var[ μ ( y ) ratio ] = (1 − f ) i ε s = (1 − f ) i ε s n −1 n n −1 n = μ ( x )2 ^ where ri = yi − θ xi as before.31 .

We can also see the adjustment by rewriting the ratio estimate as μ( x ) μ ( y) μ ( y)ratio = μ( x ) 3. the greater the benefit in using the ratio estimate. The closer x is to μ ( x ) . To apply ratio estimate effectively. i s can be written as the ratio of two averages. Ri ~ G (0. a straight line through the origin. If we fit a response model to the above data (e. If we think of ratio estimation in terms of fitting a line to the scatterplot. σ ) ) then we ∑x y /n Since β = ε ∑x /n ε i i 2 i ∑x y ˆ estimate the slope using β = ε ∑x ε i i s 2 i i s i ˆ . University of Waterloo. between x and y in the study population. the sample average x is smaller than the population average μ ( x ) so we adjust the estimate of μ ( y ) upward using the relationship between y and x.g. we need • to measure the explanatory variate xi for each unit i in the sample • • to know μ ( x ) .J. the smaller is the adjustment. Yi = β xi + Ri . The smaller the noise. This suggests another estimate βμ ( x ) for μ ( y ) . 2009 IX-6 . MacKay. we can derive its i s Stat 371 © R. 2. the population average of the explanatory variate a relationship of the form y = β x + noise .Notes 1. x fitted line y = θˆ x x ˆ μ ( y ) ratio x x x ˆ( μ y) x x x x x x x μ ( x) In this case. then the estimate is an adjustment based on the fact that the sample average x is different than the population average μ ( x ) .

σ xi ) where the variation around the line increases as x increases. 4. MacKay. ˆ To produce the regression estimate μ ( y ) reg . Ri* ~ G(0. x ˆ If there is constant variation about the line. 2009 IX-7 . In either case. σ ) ˆ to the data in the sample. because we are exploiting structure in the study population. You can easily y verify that the least squares estimate of β in this model is θˆ = . Note that x = μ ( x ) is the sample average for the explanatory variate.ˆ variance as we did for θˆ and hence find the variance of βμ ( x ) . Ri ~ G (0. If we have a continuous explanatory variate x.J. In other words. Ri ~ G (0. we have the conditions necessary for fitting the response model Yi = α + β ( xi − x ) + Ri . Suppose the response variate y is binary and the goal is to estimate the population proportion π .σ ) with constant standard deviation. y With ratio estimation as described above. Here we look at using information on the explanatory variate if the relationship between y and x is linear with constant variation about the line. you will have considered what happens if the line does not go through the origin. we • fit a the model using least squares to estimate α and β to get ˆ ˆ α = y = μ ( y ). If the variation increases as x increases. Regression Estimation of the Average Once we discussed ratio estimation in terms of fitting a line through the origin to the data in the sample. If we x start with a response model Yi = β xi + Ri . the estimates will be superior to the sample average. we need more complex models (and subsequent analysis) to exploit the relationship between the variates in the study population. You may wonder how the precision of this estimate compares to that of the ratio estimate. If the explanatory variate is binary or categorical we can use post-stratification (see the Chapter 9) to improve the precision of the estimation of π . then we expect the ratio estimate to be better. ∑(x − x) y ˆ β= ε ∑(x − x) ε i i s 2 i i s i • substitute the known mean μ ( x ) into the fitted line Stat 371 © R. University of Waterloo. we expect the estimator based on β to be superior. then we can divide by xi to get the model Yi / xi = β xi + Ri* . we estimate the slope using θˆ = .

Hence we can say that μ ( y ) reg − μ ( y ) ≈ [ μ ( y ) − μ ( y )] + β [ μ ( x ) − μ ( x )] and we have E [ μ ( y ) reg − μ ( y )] ≈ E [ μ ( y ) − μ ( y )] + β E [ μ ( x ) − μ ( x )] = 0 . The right-most term. we expect each of the terms within the brackets [ ] to be small.ˆ ˆ ˆ ˆ μ ( y ) reg = μ ( y ) + β [ μ ( x ) − μ ( x )] ˆ ˆ We can view μ ( y ) reg as an adjustment to the sample average μ ( y ) as we did with the ratio estimate. Rewrite the estimator as μ ( y ) reg − μ ( y ) = [ μ ( y ) − μ ( y )] + β [ μ ( x ) − μ ( x )] + [ β − β ][ μ ( x ) − μ ( x )] In large samples. a product of two small quantities. Suppose that β is positive so that in the study population larger values of ˆ x correspond to larger values of y. The adjustment is shown on the following plot. we adjust the estimate of μ ( y ) upward. University of Waterloo. 2009 IX-8 . is an order smaller than the other two terms. MacKay. If the sample average μ ( x ) of the explanatory variate is less than the known population average μ ( x ) . Stat 371 © R. We can simplify the argument with the following handwave. That is. fitted line y = μ ( y) + β ( x − μ ( x )) x ˆ( μ y ) reg x x x x ˆ( μ y) x x x x x x x μ ( x) The properties of the estimator μ ( y ) reg = μ ( y ) + β [ μ ( x ) − μ ( x )] are complicated because of the three random components. the regression estimate is approximately unbiased.J.

90 basal area The equation of the fitted line is y = 6. The data and fitted line are plotted below. The last factor is the sample variance of the estimated residuals from the least squares fit of the line to the sample data. A SRS of 25 sub-sections was selected and then a tree was selected at random within the sub-section.70 0.J. we have [ r − μ ( r )]2 1∑ i Var ( μ ( y ) reg ) = (1 − f ) iεU which can be estimated by n N −1 1 ^ Var ( μ ( y ) reg ) = (1 − f ) n ˆ( We can estimate Var ( μ ( y ) reg ) by noting that μ y ) reg = ∑[ y ε i s i − β ( xi − μ ( x ))] ∑ [r − r ] ε i i s 2 n −1 1∑ = (1 − f ) iε s n ˆ [ yi − y − β ( xi − x )]2 n −1 ˆ where we have replaced β by the estimate β . 2009 IX-9 .50 3. x . Using the basic result for the variance of an average with SRS. Volume is expensive to measure because it requires that the tree be destroyed.00 6. Volume versus Basal Area 7.51( x − 1.50 4.50 7.50 volume 6. Example The volume of useable wood y in a Douglas fir is related to the basal area.10 1. University of Waterloo. the crosssectional area of the tree measured at breast height.50 0.is the n sample average of r1 . To estimate the total volume in a section of forest that was to be sold.90 1..17 + 1.31) and the residual sum of squares is 2..268 Stat 371 © R..50 1.30 1.00 5. MacKay. The selected trees were sacrificed and the basal area and volume were measured.00 0..70 1.00 4.50 5.00 3. a sample of 25 trees was selected by dividing the section into small subsections. We will treat this protocol as if it were SRS. rn where ri = yi − β ( xi − μ ( x )) and r = y − β ( x − μ ( x )) .

We used a difference estimate in the inventory example to estimate the total true value of the files.268 = 0.40 − 1. 3.. we need a continuous response variate y and a continuous explanatory variate x knowledge of the study population average of the explanatory variate a linear relation between y and x with smaller residual variation leading to a more precise estimate. We assume that the errors in estimating μ ( x ) and N are negligible.40 . is then 358 408 ± 6816 . the ratio estimate and the regression estimate by looking at the sum of squares of the estimated residuals under the three least squares fits • Sample average: ∑ ( yi − y )2 iε s • • Ratio estimate: Regression estimate: ∑( y ε i s i − θˆ xi ) 2 where θˆ = y / x ∑( y ε i s i ˆ ˆ − y − β ( xi − x )) 2 where β is the estimated slope The major reason for using ratio and regression estimates is the gain in precision.A second much larger (and cheaper) survey was carried out to estimate the total number of trees N = 56800 and the average basal area μ ( x ) = 1.12 .31 ± 1. yn . A special simple case of regression estimation is to use the difference d i = yi − xi as the response variate and then estimate the population average by ˆ ˆ μ ( y ) diff = μ ( d ) + μ ( x ) This estimate is more precise than the sample average if the variation in the differences d1 .96 × 0.111.31 ± 0. ˆ The regression estimate is μ ( y ) reg = 6.J.51(1.. The 95% confidence interval for the total volume.22 since the sample standard deviation of the 25 measured volumes is 0.17 + 1. We use least squares to estimate the relationship between the Stat 371 © R. University of Waterloo. Regression estimation can be extended to multiple explanatory variates and nonlinear relationships... ˆ The estimate for μ ( y ) based on the sample average μ ( y ) gives a 95% confidence interval 6. 2009 IX-10 . • • • To use the regression estimate effectively.17 ± 0.061 25 24 The approximate 95% confidence interval for μ ( y ) based on the regression estimator is 6. we can compare the precision of the sample average. τ ( y ) = 56800 μ ( y ) . d n is less than the variation in y1 .. 2.. MacKay.31) = 6. Notes 1..061 = 6. ^ stdev ( μ ( y ) reg ) ≈ In general.. The regression estimate is more precise in this case.31 and the standard error of the estimate is 1 2.

Note this adjustment accounts for differences in the sample attributes of the explanatory variates and the known population attributes.response and explanatory variates in the sample and then adjust the sample average using the fitted model and the sample averages for the explanatory variates.J. MacKay. University of Waterloo. 2009 IX-11 . Stat 371 © R.

Let the weight of the ith item in the population be yi and the total known weight be τ a) Show that an estimate of the population size is N = ∑y ε i s τ i / 25 b) Find the (approximate) mean and standard deviation of the corresponding estimator ~ N. 3.it is not. μ ( y)) to ~ ~ ~ estimate the bias in the estimator θ = μ ( y) / μ ( x ) . c) In the example. Find a 95% confidence interval for the total number of items in the container.163 g and the total weight is 154. Suppose we wanted to estimate the number of wood thrush pairs nesting within the region of Waterloo. In order to count the number of small items in a large container. y ) ≈ f ( x 0 . A simple random sample of 50 woodlots is selected and the number of nesting pairs yi is counted in each woodlot by counting the number of singing males. Find a 95% confidence intervals for the total number of thrushes based on the a) sample average y b) ratio estimate c) regression estimate 4. y0 ) + This quadratic function has the same value. For example. y0 ) as does f ( x.4 ha. Many bird species have specialized habitat. Note that the general form of the expansion is ∂ f ( x 0 . an area of highly fragmented forest patches.45 g. Using aerial photography. Stat 371 © R.J. we know that there are 1783 such patches (minimum size 3 ha) with an average size 13. You can easily check this statement by differentiating the right side of the expression. We can exploit this knowledge when we are trying to estimate population totals or density. They then weigh the whole shipment (excluding the container). the sample standard deviation is 0. The area xi of each sampled woodlot is also recorded. y0 ) ∂ 2 f ( x 0 . University of Waterloo. 2009 IX-12 . the sample average weight is 75. a shipping company selects a sample of 25 items and weighs them. y0 ) ( y − y0 ) 2 ( x − x0 ) + ( y − y0 ) + + ( x − x 0 )( y − y0 ) + 2 ∂x ∂y ∂x 2 ∂x∂y ∂y 2 2 f ( x . The data are available in the file thrush.2 kg. first and second derivatives at the point ( x0 . 2. Find the quadratic expansion of f ( x. wood thrush are a forest dwelling bird that live in the hardwood forests of eastern North America. the sampling is haphazard. y0 ) ( x − x 0 ) 2 ∂ 2 f ( x 0 .txt. y) . Assume that there is small error in weighing and act as if SRS is used . Briefly describe when you would use the ratio or regression estimate instead of the sample average to estimate the population average. y0 ) ∂f ( x 0 .Exercises 1. y0 ) ∂ 2 f ( x 0 . MacKay. y) = y / x about the point ( μ ( x ).

The data are stored in the file water. University of Waterloo. MacKay. the ratio estimate and the regression estimate. The City of Waterloo wants to estimate the average amount of water per house μ ( y) that is used to water lawns and gardens in the month of July. it is known that the average total water consumption per house is μ ( x ) = 15. The total volume of water x is measured by the regular meter. 2009 IX-13 . c) Find 95% confidence intervals based on each estimate.J.5. From water records.txt.6 cubic metres. A SRS of 50 houses is selected and special metering units are installed to measure the volume of water y from external taps. d) Which estimation procedure is preferable here? Why? Stat 371 © R. a) Prepare a scatterplot of y versus x b) Estimate μ ( y) using the sample average.

Since we used SRS within each stratum..+ N H μ H = W1 μ 1 +.. With this notation. The idea is to divide the study population in sub-populations. 2007 X-1 .. MacKay.J..+ WH μ H . Stratified sampling gives information about these averages and often an improved estimate of the overall average. we independently select a sample of size nh from stratum h using SRS and calculate the sample average μ h . the proportion of the total units found in that stratum. we denote the stratum averages and standard deviations by μ h . we write N1 μ 1 +. we change both the sampling protocol and the estimate to get a procedure that usually produces a better estimate of the study population average. The basic idea was to exploit a structural relationship between the response and explanatory variates with known attributes in the population. we have Stat 371 © R. H ..+ N H . Suppose that we divide the population U into H mutually exclusive strata U1 . U H with sizes N1 ... and sample independently using SRS from each stratum. σ h . University of Waterloo. we have questions about the strata averages as well as the overall population average..Chapter 10 Stratified Random Sampling In the previous chapter.. Then combine the estimates of each stratum average to get an estimate of the population average.. In this chapter.. μ= Now suppose for each stratum.. For the variate of interest. Some examples of possible strata are: • Provinces and large urban centers in national opinion surveys • Small and large accounts in auditing a population of accounts • Home faculties in a survey of UW students • Sites in a survey of employees in a multi-site company In many examples... We call Wh = Nh / N the stratum weight.. h = 1... N H so that N = N1 +.+ WH μ H N the weighted average of the stratum averages.. We can combine these estimates to get the stratified estimate of the population average μ strat = W1 μ 1 +..+ WH μ H ~ ~ ~ The corresponding estimator is μ strat = W1 μ 1 +.. called strata.. we looked at ways to use an explanatory variate with known attributes to improve on the sample average as an estimate of the study population average with SRS..

Here we look at only two: ˆ2 where σ h = ∑ (y ε j sh hj ˆ − μh )2 y: u: sodium (Na) concentration (mg/L) the water was contaminated by coliform bacteria (u = 1) or not (u = 0 ) The data are summarized below.. Three strata were created Stratum Size Weight Sample Size farms with animals 2365 0..+ WH (1 − fH ) H n1 nH 2 2 is the sample standard deviation within stratum h and y jh is nh − 1 the response variate for the jth unit in the sample from stratum h .~ ~ ~ E( μ strat ) = W1 E( μ 1 )+. a survey of residential wells was carried out in the rural part of the region of Waterloo.+ WH (1 − f H ) H n1 nH 2 2 where fh = nh / Nh is the sampling fraction and σ h is the standard deviation of the response variate for stratum h.2 11. MacKay.23 % contaminated 17..097 100 houses 9683 0.. We can estimate the variance by σ σ 2 ^ ~ Var( μ strat ) = W12 (1 − f1 ) 1 +. University of Waterloo.45 37..4 13.J. Stratum farms with animals farms without animals houses Average Na 237.+ WH μ H =μ and σ σ ~ 2 Var ( μ strat ) = W12 (1 − f1 ) 1 +. 2007 X-2 .62 51.+WH E( μ H ) = W1 μ 1 +..2 The estimate of the population average Na concentration μ is Stat 371 © R.726 250 A random sample of wells was selected from each stratum and the water was tested for a large number of characteristics.177 150 farms without animals 1297 0. Example To estimate average water quality and the proportion of wells with contamination.1 St Dev Na 41.. The population of 13 345 wells was identified from assessment records.3 245.6 220..

the estimated variance is the sum of the squared strata weights times the variance of the stratum estimates. Note that we used the approximation 1 − fh ~ Var (π h ) ≈ π h (1 − π h ) nh in each stratum.132(1−.132) =.62 2 250 51.6 mg/L ~ The estimated variance of the estimator μ strat is 150 41.137 ± 0.0165. and then combine the estimates using the strata weights.μ strat = 0.172)] 100 [.114) + 0. Stat 371 © R. 2007 X-3 .177)2 (1 − ) + (.726(220.177(237. The estimated variance of the associated estimator is 150 [. We can estimate the strata averages in the best way possible e.114)] 250 [. Since each stratum is sampled independently. Note that these variances are calculated using the formulae for ratio or regression estimates.726) 2 (1 − ) = 5. For the binary response variate u. • When is stratified sampling more efficient than SRS? • How should we allocate the total sample among the strata? • Can we combine ratio/regression estimation with stratified sampling? The answer to the last question is the easiest. The formulae are expressed in terms of proportions.2% Note that you need to be careful moving from percentages to proportions.097(245.g.726)2 (1 − ) = 0.3) + 0.418 mg/L.032 or 13.7 mg/L. MacKay.452 100 37.232 ^ ~ Var( μ strat ) = (.097(.097)2 (1 − ) + (.177)2 (1 − ) + (.097)2 (1 − ) + (. We can compare the strata means and proportions – see Exercise 1.1) = 225.177(.172 ) + 0. the estimate of the proportion of contaminated wells π = W1π 1 + W2π 2 + W3π 3 is π strat = 0. An approximate 95% confidence interval for μ is 225.6 ± 4.137 or 13.J.000272 2365 150 1297 100 9683 250 and so the standard error is 0.7% ± 3.7%. There are several questions of interest.172(1−.6) + 0. ratio or regression estimation if appropriate.132)] ^ ~ Var (π strat ) = (.114(1−. University of Waterloo. The 95% confidence interval for π is 0.845 2365 150 1297 100 9683 250 and so the standard error is 2.726(.

Substituting Wh = wh = nh / n . then the weighted average will be less than the overall variance. except for rounding.. However. we have wh = Wh or... this is a contrived situation and in most cases. Another way to make the same point is to use the ANOVA partition of the total sum of squares into two components. Suppose that y jh is the response variate for the jth unit in stratum h. we can partition the total sum of squares as. the strata weights Wi = Ni / N and the relative sizes of σ 1 . nh α N h . If we form the strata so that these variances are small i. stratified sampling will be much better than SRS. σ H ~ versus σ . Looking at the variance of μ strat σ σ ~ 2 Var ( μ strat ) = W12 (1 − f1 ) 1 +.. h it is possible to have larger variance with stratified sampling. We also ignore the finite population corrections... In other words.+ σ H] n w1 wH we see that if we were to give a very small sample weight to a stratum with high stratum ~ ~ weight and σ 2 > σ 2 .Comparison to SRS To examine the efficiency of stratified sampling. To confirm this point. 2007 X-4 ..+ WH (nH / n) H n1 nH 2 2 1 2 (W1σ 1 +.e. then Var ( μ strat ) > Var ( μ ) . we have σ σ ~ Var ( μ strat ) = W1 (n1 / n) 1 +.. Consider each stratum as a treatment and recall for an unbalanced design.+ WH (1 − f H ) H n1 nH 2 2 2 WH 2 1 W12 2 ≈ [ σ 1 +.. in other words.... consider proportional allocation where.J.. within and between strata. MacKay.. Between strata (treatment): Within strata (treatment): Total: ∑ N (μ ∑ ∑ (y h h h − μ )2 hj − μ h ) 2 = ∑ ( Nh − 1)σ 2 h h ∑ ∑ ( yhj − μ )2 = ( N − 1)σ 2 h j h j Stat 371 © R. form the strata so that there is greater consistency within strata compared to the whole population. University of Waterloo. there is no uniform result. if we construct the strata with care.+ WH σ 2 < σ 2 H The left side is the weighted average of the within strata variances.+ WHσ 2 ) H n and the variance of the stratified estimator will be less than the variance of the sample average from SRS if = 2 W1σ 1 +. we need to consider the sampling weights wi = ni / n .

. nH .. We want to ~ determine n1 .+ WH σ H ) 2 n Note that proportional allocation is optimal if σ h is the same for each stratum.+ nH − n) 1 1 2 1 2 1 − )σ 1 +. H ... nH as continuous variables and use a Lagrange multiplier. If we do so and decide to use optimal allocation. then we can use the preceding Stat 371 © R.. nH so that n1 +. How should we divide the sampling effort among the strata if the objective is to minimize the variance of the resulting estimator? This is the allocation problem. we find the critical point of the function f (n1 ... Hence we have..... Setting ∂nh nh ∂λ and solving for λ ....where the sums are over the whole population....+ nH − n) H n1 N1 nH N H The partial derivatives are these to 0. we decide that we can afford to select a sample of size n.. we should h make the strata means as different as possible in order to achieve the greatest gain over SRS..+ WHσ H λ λ = n or λ = or more simply nh ∝ Whσ h .. λ ) = W12 (1 − f1 ) = W12 ( 2 σ1 n1 2 +.. For proportional allocation (and for any other allocation).J.. then for the optimal allocation..... Optimal allocation Suppose that at the Plan stage.+ WH ( − )σ 2 + λ (n1 +. for optimal allocation n nh = Whσ h n W1σ 1 +.+ nH − n ..+ nH = n and Var ( μ strat ) is minimized...+ WH (1 − fH ) σ2 H nH + λ (n1 +.. h = 1. perhaps from a pilot survey.. we get nh = Whσ h ∂f ∂f W 2σ 2 = − h 2 h + λ . In order to use the optimal allocation.. before selecting the sample.. = n1 +... We allocate more sampling effort to those strata that have higher weight or larger within stratum standard deviation. We have σ2 = ∑ N (μ h h h − μ )2 + ∑ (N h h − 1)σ 2 h N −1 N −1 ≈ ∑ Wh ( μ h − μ ) 2 + ∑ Whσ 2 h h h the difference in variance for stratified versus sample average estimator is proportional to ∑ Wh ( μ h − μ )2 . University of Waterloo. we need to know (unlikely) or have an estimate of the within-stratum standard deviations. W1σ 1 +. That is.+ WHσ H W1σ 1 λ +. we get 1 ~ Var ( μ strat ) = (W1σ 1 +. MacKay..+ WHσ H . If we ignore the fpc. We treat n1 . 2007 X-5 .

In many cases such as the labour force survey.. for example. we want to estimate the rate of unemployment in each province so it is natural to stratify by province.+ WH σ H )2 where c is a value from l the G(0. for a given level of confidence... a particular stratum may be so important that we do a complete census. stratification adds complexity. for a given level of precision. If we are interested only in the overall frame average or total.1) tables determined by the level of confidence. We can use estimates from a pilot study or an earlier version of the survey to optimally allocate the sample across the strata. accounts are stratified on the basis of stated value. The first consideration is the purpose of the survey. in many applications in auditing. The likelihood of large errors is greater in larger accounts so every account in the stratum of the largest accounts is included in the sample. we can use the values of the explanatory variate to form the strata. we need to identify the stratum for each unit in the frame before we begin.e. 2007 X-6 .. Proportional allocation is popular because we do not need to know the within strata standard deviations and we can be almost sure to do better than with SRS. University of Waterloo. shorter confidence intervals) compared to SRS for the same total sample size. We cannot use the discrete variate to form strata since we do not know the value of the variate for every unit in the frame. In some cases. the approximate confidence interval has length 2c 2l = (W1σ 1 +.. WH for the H classes. we need to ensure that each province gets a large enough sample to estimate the within-stratum rate so here the allocation problem is very different. we can use a smaller sample size with stratified sampling.J. Stat 371 © R. However.+WHσ H ) n c2 so we select a sample with total size n = 2 (W1σ 1 +. If we have complete knowledge of some explanatory variate that we believe to be related to the response variate. This corresponds to knowing the mean for a continuous explanatory variate that we might use in a ratio or regression estimate.formula to select the total sample size n to achieve a confidence limit with a predetermined length. For example.. Since we are interested in the provincial rates. Post Stratification We now return to an issue discussed in Chapter 9.. Instead. That is. we form the strata so that the averages are likely to be very different. MacKay.. we know the population proportions or weights W1 . Put in another way. Suppose there is a discrete explanatory variate such as gender or age class and we know the proportion of the population that falls in each class. Forming the Strata Stratified sampling can produce large increases in precision (i.

We select a sample using SRS from the frame and observe n1 .+WH μ H = μ The post stratified estimate is unbiased (almost). First. To determine the properties of this procedure. The estimate looks like the stratified ~ estimate – the estimators are different because the denominator of μ h is random for the post-stratification estimator.+WH μ H ) = W1μ 1 +. The sample sizes are not controlled and if we were to repeat the sampling they would change. A natural estimate of the population average is μ post = W1μ 1 +. ~ ~ ~ We use this result to find E( μ post ) .. we can calculate the expected value of X in two steps. With this notation we have E( X ) = ∑ E( X | Y = y) Pr(Y = y) y The right side is the expected value of the function E( X| Y = y) so we write E( X ) = E[ E( X| Y = y)] In words. If we ignore the event nh = 0 (which happens with small probability in large samples) we have to a good approximation ~ ~ ~ E ( μ h ) = E[ E ( μ h | nh = nh )] ≈ E ( μ h ) = μ h Hence we have ~ ~ ~ E( μ post ) = E(W1μ 1 +. Stat 371 © R. MacKay. 2007 X-7 . find the conditional expectation for each value of y and second.. Then we can write E( X ) = ∑ ∑ x Pr( X = x. Consider the two random variables μ h . Then we have ~ ~ ~ E ( μ h ) = E[ E ( μ h | nh = nh )] ~ ~ As long as nh ≠ 0 .. nH units in each class. University of Waterloo.+WH μ H We call this the post..J. we know using the results from SRS that E ( μ h | nh = nh ) = μ h . nh ...stratification estimate because we do not establish the stratum for each unit in the sample until after it is selected. Note that E( X| Y = y) is a function of y only because we have added over all values of x. Suppose X and Y are two discrete random variables with probability function Pr( X = x.... we have a small aside. Y =y) = ∑ [∑ x Pr( X = x| Y =y)]Pr(Y = y) y x y x The expression in [ ] is the conditional expected value of X for a given value Y = y and is written E( X| Y = y) ... find the expected value of the conditional expectation over the distribution of Y.. Y = y) .

.. nH ..+ WH ( E[ ~ ] − )σ 2 H n1 N1 nH NH We approximate this variance by 1 1 2 1 2 1 ^ ~ Var ( μ post ) = W12 ( − )σ 1 +. Also.. nH = nH )] = Var[ μ ] = 0 ...+ WH ( − )σ 2 H n1 N1 nH N H and so 1 1 2 1 1 2 ~ ~ ~ E[Var ( μ post | n1 = n1 .. ~~ ~ Since E( μ| n1 = n1 . 2007 X-8 .... University of Waterloo...+ WH ( E[ ~ ] − )σ 2 H n1 N1 nH NH Combining the two pieces we get 1 1 2 1 1 ~ 2 Var ( μ post ) = W12 ( E[ ~ ] − )σ 1 +..... nH = nH ) = μ for all values of n1 ... from SRS.. Var ( X ) = E[ X 2 ] − μ 2 = E[ E( X 2 | Y = y)] − μ 2 = E[ E( X 2 | Y = y) − E( X| Y = y)2 ] + E[ E( X | Y = y) 2 ] − E[ E( X | Y = y)]2 since μ = E[ E( X| Y = y)] . we condition on n1 = n1 .. MacKay. they know the distribution of household Stat 371 © R.. nH = nH )] = W12 ( E[ ~ ] − )σ 1 +.J.. nH = nH .... The second and third terms are Var[ E( X| Y = y)] so we have the result Var( X ) = E[Var( X| Y = y)] + Var[ E( X| Y = y)] ~ ~ ~ To find Var( μ post ) . nH = nH ) = W12 ( − )σ 1 +..+ WH ( − )σ 2 H n1 N1 nH N H that is identical to the variance of the stratified estimator for the observed allocation n1 . We now interpret the two pieces... nH . Example A market research organization interviews a randomly selected sample of 300 households in a community to estimate the average amount of money spent on DVD/video rental and movies in the previous week.. From census data... we have ~ ~ ~ Var[ E( μ post | n1 = n1 . The expression inside the first [ ] is Var( X| Y = y) so the first term is E[Var( X| Y = y)]...To find the variance. we need a second result. we have 1 1 2 1 2 1 ~ ~ ~ Var ( μ post | n1 = n1 ....

91 ± 0.57 .292.10 .67 28.363 0. find a 95% confidence interval for the proportion of wells in farms with animals that are contaminated c. ~ b) What is the variance of π strat for proportional allocation? c) How should the strata be formed so that the stratified sampling protocol is superior to SRS? 3.232 0. we ignored non-response.+0.232 × 13. 2007 X-9 .123 0.193 0.071 × 2810 = $19.381 0. In the well survey. In the above example.23 6.071 Sample size Sample weight Sample average Sample standard deviation 3. In general. MacKay.290 0. University of Waterloo.22 25.45 20. If there are H strata. Exercises 1. b. In many surveys..21 28. write down the distribution for the estimators μ h and ~ ~ μh − μk . In the well survey.J. Suppose the well survey was to be re-done with the same overall sample size 500. They post stratify the data as follows.090 0. 2.47 has been adjusted upward because of the over-representation of households of size 1 in the sample. Stat 371 © R.77 1 2 3 4 >4 87 109 54 27 23 0.67 4. Household size Population weight (census) 0. for SRS.180 0. How would you recommend allocating the sample to the strata if a) Estimating the average Na level was the primary goal. there is interest in estimating strata averages or differences in strata averages.. Non-response is a major source of error when sampling human populations. The estimate is μ post = 0.89 5. There are many analytic methods and sampling strategies to deal with this important issue. Note that the sample average is $19.077 13. Suppose that the purpose of the survey is to estimate a population proportion π . find a 95% confidence interval for the average Na difference between the two types of farm wells.56 5.91 and the estimated standard deviation of the corresponding estimator is (ignoring fpc’s) 0.45+. The company telephoned many more than 300 households to get the required number of completions. An approximate 95% confidence interval for the population average amount spent is $19. ~ a. a) Write down the stratified estimate of π and the variance of the corresponding estimator. The confidence intervals that we have constructed do not take this error into account.size in the community but this information is not available in the frame for each unit.

compare the predicted standard deviations of μ strat and π strat to what occurred in the current survey.22 3. 1 a) Show that this difference can be written as ∑ (σ h − σ )2 Wh where n h σ = ∑ σ h Wh is the weighted average standard deviation over the H strata.09 3 26 0. h b) When will the gain be large with optimal allocation relative to proportional allocation? 5. There are about 3300 students in the faculty.2 1.24 3.5 1. broken down by year are shown below. Consider the difference of the variances of μ strat under proportional and optimal allocation for a sample of size n. MacKay.03 4 12 0. 100 people were asked their opinion (on a 5 point scale) about the core courses and their value. One particular statement was (with scores): “All mathematics students are required to take Stat 231?” strongly agree – 1 agree – 2 neutral – 3 disagree. University of Waterloo.87 Stat 371 © R.22 2 23 0. ~ ~ c) For each case. ~ 4. In an informal sample of math students at UW.31 2.4 strongly disagree . 2007 X-10 . Estimate the average score for all math students and find an approximate 95% confidence interval for the population average – note that SRS was not used here so were are making assumptions about the estimators that may be unwarranted.8 1. Ignore the fpc.b) Estimating the proportion of contaminated wells was the primary goal.5 The sample results.1 0.J.23 3. Year Sample size Population Average score Standard weight deviation 1 39 0.

The online FAQ for Windows can be searched for help with almost anything. • Look at the web page http://www. try help. • To get a data set. The manual “An Introduction to R” can be downloaded in pdf format.uwaterloo. one row per unit in the sample. You can look at or download this document from the course web page so the links will be active. specifically in rooms MC 3006 and 3009.Appendix 1 An Introduction to R R is a high level language with many useful statistical functions. use the command a <. Reading Data into R • All data sets used in the course notes and lectures will be posted on the course web page in a . • R is available on the math faculty unix machines – type R to start the program 2.html . type q().search(‘ what you are looking for’) 3. R will open and restore the previously saved workspace. Getting Started 1.stats. Starting and Quitting R • To start R. • I assume that you are using the Windows version of R.read. If you do not know the function.r-project. • It is a good idea to clear the workspace if you plan to start a new project – see the Misc menu. Appendix I -1 . Where to find R • A Windows version of R is available free at http://www. create a shortcut on your desktop from the program. Note that R will let you save the workspace in the current working directory.zip file that you can download to your own machine.table(‘file path and namel’. The files have variate names in the first row and the variate values in rows. Where to find help • Use the Help menu for on-line assistance.org/ • For help with installation and implementation see C:\Program Files\R\rw1062\doc\html\rw-FAQ. • Try the sample session in the “An Introduction to R” manual 4. • To quit R. In this document • all R commands and objects are given in italics. header=T) The data are stored in the data frame a for further use. use the command help(function) for assistance. • Within R. if you know the function name. See 4 below.Introduction • R is available on the faculty PCs.ca/Stats_Dept/StatSoftware/R/ Try the R tutorial.

function) x <. 4.+ r without intercept details of the fit the estimated residuals the fitted values the analysis of variance from the fit the diagonal elements of the hat matrix the studentized residuals analysis of variance to compare the sub-model c to b fits the k best subsets for 1..txt'.txt file.y.uwaterloo. use detach(a). you can avoid long path names by setting the working directory to the directory containing the file.g.header=T) For a variate named sales in the .mean(y) or immediately displayed e. This makes editing easy and preserves a record for reuse.regsubsets(y~x1+x2+…. You can also read the data files from my web page with the command a < -read. Here are some R functions and objects used repeatedly in STAT 371.….math. Using the up and down arrow keys in R displays past and subsequent command lines which can be edited and re-executed 3..• • • • If the file is stored on your own machine. 2.lm(y~x1+x2+…) b <.u*v) sqrt(y) A%*%B t(A) solve(A) b <.. Commands can be typed directly into R. the R variate name is a$sales.table('http://www. mean.+ r and stores the results in the lm object b fits the linear model y = β 1 x1 +. create the data set in EXCEL and then save it in tab delimited . Working with R 1.g. R output and plots can be copied and pasted into Word to create reports. 2..g w <.ca/~rjmackay/stat371/file name.txt format in your working directory. mean(y) function mean(y) sd(y) summary(y) tapply(x. nbest =k)) purpose calculates the average of the variate y calculates the st dev of the variate y calculates a 5 number summary of the variate y calculates the function e.u+v (x <.b) b <. applied to y for each value of x creates the sum (element-wise product) of two vectors calculates the element-wise square root of y matrix product gives the transpose of A gives the inverse of A fits the linear model y = β 0 + β 1 x1 +.lm(y~-1+x1+x2+…) summary(b) resid(b) fitted(b) anova(b) hatvalues(b) rstudent(b) anova(c.. These can be stored e. Look under the file menu on the R gui to set the working directory. If you want to use R on other data. you can simplify the name with the command attach(a) so the awkward a$ notation is avoided. To restore the full name. I prefer to type them in Word or Notepad and then paste into the R gui. p variate models Appendix I -2 . If there is a single data frame a.

main='title'. adds the fitted line from b <.k)) scatterplot of y vs. x with title .plot(x. x-axis label etc. etc) abline(b) hist(y) qqnorm(y) par(mfrow=c(j.y.lm(y~x) to the scatterplot histogram of the values in y a gaussian qq plot of the values in y creates a graphic window with j rows and k columns for the next jk plots Appendix I -3 .xlab='xx'.

Wk )t written as the transpose of row vectors to save space. We can see that this result is true by noting that the ijth element of xx t is xi x j for any vector x.. Note that the diagonal elements are the variances and the off-diagonal elements are the covariances of the components of U. Y ) = E{( X − E( X ))(Y − E(Y ))} Now suppose we have two vectors of k random variables U t = (U1 . then E( a + U ) = a + E(U ) and E( AU ) = AE(U ) . 2. Cov(U ..MacKay University of Waterloo 2005 Appendix 2 1 . Var(a + U ) = Var(U ) 5. Uk )t W t = (W1 . 1. The expected value of the sum of two vectors is the sum of the expected values That is E(U + W ) = E(U ) + E(W ) . Statistics 371 © R. W ) with ijth element E{(Ui − E (Ui )(Wj − E (Wj )} ... If a is a vector and A is a matrix of constants. Definitions: The expected value of U is the vector E(U ) with ith element E(Ui ) The variance-covariance matrix of U is the matrix Var(U ) with ijth element E{(Ui − E (Ui )(U j − E (U j )} .Appendix 2 Properties of vectors and matrices of random variables Recall that if X and Y are two random variables and a.. b are constants then we have E( aX + b) = aE ( X ) + b E ( X + Y ) = E ( X ) + E (Y ) Var ( aX + b) = a 2 Var ( X ) Var ( X + Y ) = Var ( X ) + Var (Y ) + Cov( X . 6. the covariance of Ui and Wj . Properties: These follow from the properties of expectation. Var (U ) = E{(U − E (U ))(U − E (U )) t }. This follows using the same argument as in 3.J. Y ) where Cov( X. The covariance matrix of U and W is the matrix Cov(U.. W ) = E{(U − E (U ))(W − E (W ))t } . This useful result is easy to show using properties 1 and 3.. Var ( AU ) = AVar (U ) A t . 4.. 3.

. BW ) = Cov(U . In words. if U ~ N ( μ. linear combinations of the components of a multivariate normal random vector are multivariate normal with the appropriate mean and variance-covariance matrix.... Zk )t is a vector of k independent gaussian G(0.. the vectors BU and CU are independent iff Cov( BU . We write Z ~ N (0. If U = μ + AZ . and 1. Zk and hence is gaussian with mean μ i and standard deviation the square root of the ith diagonal element of Var(U ) .. W ) = ACov(U . 2. Σ ) . U j ) = 0 . ak ) is a vector of constants.. then BU ~ N ( Bμ .1) random variables. More generally. Cov( AU . Cov(a + U. 4... CU ) = BVar (U )C t = 0 . The component Ui of U is a constant μ i plus a linear combination of Z1 . we say that Z has a multivariate normal distribution with mean vector 0 and variancecovariance matrix I . Suppose μ is a vector and A is a matrix of constants. W ) 8. W ) B t from 6. BΣB t ) .. then a tU is gaussian with mean a tU and standard deviation a t Σa . 6. The components Ui and U j are independent random variables if and only if Cov(Ui . W ) and Cov(U . 3. Multivariate Normal Distribution If Z t = ( Z1 .. 5. Σ ) . We use the notation U ~ N ( μ. I ) Properties 1.J. More generally.MacKay University of Waterloo 2005 Appendix 2 2 .7. b + W ) = Cov(U. These results follow Statistics 371 © R.. Z2 . then the mean is E(U ) = μ and the variance covariance matrix is Var (U ) = AA t = Σ and U has a multivariate normal distribution. (An important special case of 3) If a t = (a1 .

the expected number of values in each bin is 1. z( 2 ) ). z5 is from a G(0.. These centers are shown with arrows on the plot. If the points deviate from this line substantially..... Then we would expect z(1) to fall in the first bin.. zn is a sample from a G(0.20. z( 2 ) to fall in the second and so on.. u(1) ).. we should see a straight line through the origin with slope 1.( q( 2 ) .. With such a sample. u( 2 ) ). q( 2 ) ..( q( 2 ) .. if we plot the points ( q(1) .( q( 5) ..1) density function and 5 “bins” each with probability 0. a plot of the points ( q(1) ..... z2 . + (1 / 2) = n n 2n If the sample z1 ... Pr( Z ≤ q( 2 ) ) = 3 / 10. Suppose we have a random sample of 5 values z1 . we have Pr( Z ≤ q( i ) ) = i −1 1 2i − 1 ...... consider the figure below which shows a G(0. q( 5) . To explain the plot. z(1) )..( q( n ) . Pr( Z ≤ q( 5 ) ) = 9 / 10 .J.. z( 5) . z( 5 ) ≈ q( 5) or equivalently. then we expect that z(1) ≈ q(1) . z2 .. z( 2 ) ≈ q( 2 ) ... Since u( i ) = μ + σz( i ) .1) distribution.. That is Pr( Z ≤ q(1) ) = 1 / 10.. Denote the sample values in increasing order by z(1) . MacKay University of Waterloo 2009 Appendix 3 1 . z( 5) ) .. z2 .... Then we have ui = μ + σ zi where z1 . Note in general for a sample of size n.Appendix 3: Gaussian Quantile Quantile Plots We use a gaussian quantile-quantile (qq) plot to assess if a set of n values looks like a random sample of size n from a gaussian distribution. un is a sample form a G( μ. Now suppose that u1 . u( n ) ) will be approximately Stat 371 © R..... z5 from this distribution. then we decide that the gaussian assumption is not tenable.1) distribution. u2 ... Let the “probabilistic” centers of the bins be q(1) . σ ) distribution. z( 2 ) .

J. We call this a qq plot and use R to construct the plot with the function qqnorm().a straight line with slope σ and y-intercept μ . If the points deviate from a line substantially. For example. then we can try transforming the values of the response variate before fitting the linear model. to construct the qq plot of the estimated residuals from the fit b. we use the code qqnorm(residual(b)). the final three plots correspond from left to right to a sample of 50 values from a G(0. two cases omitted on the right) are shown below.1) distribution. MacKay University of Waterloo 2009 Appendix 3 2 . You need to be careful not to over-interpret these plots. The plots for the residuals in the assessment data (full data on the left. Note how several of the plots appear non-linear or have apparent outliers.1) distribution. then we decide that the gaussian assumption is not tenable. If the qq plot of the estimated residuals is systematically non-linear. There is no evidence against this assumption once the two cases are deleted. Stat 371 © R. The plots on the next page are based on 9 random samples of size 50 from a G(0. The plot on the left shows that it is not reasonable to suppose that the residuals from fitting the model to the full data set are gaussian. To see the behaviour of the plots for a non-gaussian distribution. the square of the values and the reciprocal of the values.

Stat 371 © R.J. MacKay University of Waterloo 2009

Appendix 3

3

t-table (right tail) For each row (degrees of freedom k ) and column (right tail probability α ), the table entry e satisfies Pr( t k ≥ e) = α . Note that the t-distribution is symmetric about 0.

degrees of freedom 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 35 40 45 50 gaussian right tail probability 0.10 0.05 0.025 3.078 6.314 12.706 1.886 2.920 4.303 1.638 2.353 3.182 1.533 2.132 2.776 1.476 2.015 2.571 1.440 1.943 2.447 1.415 1.895 2.365 1.397 1.860 2.306 1.383 1.833 2.262 1.372 1.812 2.228 1.363 1.796 2.201 1.356 1.782 2.179 1.350 1.771 2.160 1.345 1.761 2.145 1.341 1.753 2.131 1.337 1.746 2.120 1.333 1.740 2.110 1.330 1.734 2.101 1.328 1.729 2.093 1.325 1.725 2.086 1.323 1.721 2.080 1.321 1.717 2.074 1.319 1.714 2.069 1.318 1.711 2.064 1.316 1.708 2.060 1.315 1.706 2.056 1.314 1.703 2.052 1.313 1.701 2.048 1.311 1.699 2.045 1.310 1.697 2.042 1.306 1.690 2.030 1.303 1.684 2.021 1.301 1.679 2.014 1.299 1.676 2.009 1.282 1.646 1.962

0.25 1.000 0.816 0.765 0.741 0.727 0.718 0.711 0.706 0.703 0.700 0.697 0.695 0.694 0.692 0.691 0.690 0.689 0.688 0.688 0.687 0.686 0.686 0.685 0.685 0.684 0.684 0.684 0.683 0.683 0.683 0.682 0.681 0.680 0.679 0.675

0.01 31.821 6.965 4.541 3.747 3.365 3.143 2.998 2.896 2.821 2.764 2.718 2.681 2.650 2.624 2.602 2.583 2.567 2.552 2.539 2.528 2.518 2.508 2.500 2.492 2.485 2.479 2.473 2.467 2.462 2.457 2.438 2.423 2.412 2.403 2.330

Stat 371 © R.J. MacKay, University of Waterloo, 2008

table -1

F-table (right tail) α = 0.10 For each row (denominator degrees of freedom k) and column (numerator degrees of freedom j), the table entry e satisfies P ( F ( j , k ) ≥ e) = α .

numerator degrees of freedom 1 2 3 4 5 6 7 8 9 10 20 30 1 39.86 49.50 53.59 55.83 57.24 58.20 58.91 59.44 59.86 60.19 61.74 62.26 2 8.53 9.00 9.16 9.24 9.29 9.33 9.35 9.37 9.38 9.39 9.44 9.46 3 5.54 5.46 5.39 5.34 5.31 5.28 5.27 5.25 5.24 5.23 5.18 5.17 4 4.54 4.32 4.19 4.11 4.05 4.01 3.98 3.95 3.94 3.92 3.84 3.82 5 4.06 3.78 3.62 3.52 3.45 3.40 3.37 3.34 3.32 3.30 3.21 3.17 6 3.78 3.46 3.29 3.18 3.11 3.05 3.01 2.98 2.96 2.94 2.84 2.80 7 3.59 3.26 3.07 2.96 2.88 2.83 2.78 2.75 2.72 2.70 2.59 2.56 8 3.46 3.11 2.92 2.81 2.73 2.67 2.62 2.59 2.56 2.54 2.42 2.38 9 3.36 3.01 2.81 2.69 2.61 2.55 2.51 2.47 2.44 2.42 2.30 2.25 10 3.29 2.92 2.73 2.61 2.52 2.46 2.41 2.38 2.35 2.32 2.20 2.16 11 3.23 2.86 2.66 2.54 2.45 2.39 2.34 2.30 2.27 2.25 2.12 2.08 12 3.18 2.81 2.61 2.48 2.39 2.33 2.28 2.24 2.21 2.19 2.06 2.01 13 3.14 2.76 2.56 2.43 2.35 2.28 2.23 2.20 2.16 2.14 2.01 1.96 14 3.10 2.73 2.52 2.39 2.31 2.24 2.19 2.15 2.12 2.10 1.96 1.91 15 3.07 2.70 2.49 2.36 2.27 2.21 2.16 2.12 2.09 2.06 1.92 1.87 16 3.05 2.67 2.46 2.33 2.24 2.18 2.13 2.09 2.06 2.03 1.89 1.84 17 3.03 2.64 2.44 2.31 2.22 2.15 2.10 2.06 2.03 2.00 1.86 1.81 18 3.01 2.62 2.42 2.29 2.20 2.13 2.08 2.04 2.00 1.98 1.84 1.78 19 2.99 2.61 2.40 2.27 2.18 2.11 2.06 2.02 1.98 1.96 1.81 1.76 20 2.97 2.59 2.38 2.25 2.16 2.09 2.04 2.00 1.96 1.94 1.79 1.74 21 2.96 2.57 2.36 2.23 2.14 2.08 2.02 1.98 1.95 1.92 1.78 1.72 22 2.95 2.56 2.35 2.22 2.13 2.06 2.01 1.97 1.93 1.90 1.76 1.70 23 2.94 2.55 2.34 2.21 2.11 2.05 1.99 1.95 1.92 1.89 1.74 1.69 24 2.93 2.54 2.33 2.19 2.10 2.04 1.98 1.94 1.91 1.88 1.73 1.67 25 2.92 2.53 2.32 2.18 2.09 2.02 1.97 1.93 1.89 1.87 1.72 1.66 30 2.88 2.49 2.28 2.14 2.05 1.98 1.93 1.88 1.85 1.82 1.67 1.61 40 2.84 2.44 2.23 2.09 2.00 1.93 1.87 1.83 1.79 1.76 1.61 1.54 50 2.81 2.41 2.20 2.06 1.97 1.90 1.84 1.80 1.76 1.73 1.57 1.50 100 2.76 2.36 2.14 2.00 1.91 1.83 1.78 1.73 1.69 1.66 1.49 1.42

Stat 371 © R.J. MacKay, University of Waterloo, 2008

denominator degrees of freedom

table -2

05 3.41 2.96 4.59 3.38 3.77 2.96 2.18 3.99 2.57 2.21 4.90 2.77 238.16 9.98 2.54 2.46 2.04 6.94 8.51 3 10.01 8.54 19.39 3.98 1.80 3.63 3.26 4.12 2.49 2.95 4.60 2.66 2.33 3.25 2.24 2.01 2.82 4.16 2.30 2.11 2.30 19.91 2.59 8 5.22 3.16 3.96 2.68 2.15 2.67 2.45 2.51 2.87 3.32 3.26 25 4.66 5.28 6.58 2.44 3.12 2.05 4.82 2.07 3.08 2.88 19.93 20 248.96 11 4.37 2.65 3.53 2.16 6.38 2.39 6.28 2.56 2.85 2.60 15 4.13 3.54 16 4.78 2.55 3.35 19.70 2.25 2.38 2.45 2.71 2.84 3.32 2.76 2.10 2.23 3.38 9.47 3.75 2.16 2.42 2.02 19.71 3.09 3.81 2.10 19.37 3.02 3.01 2.69 2.85 2.00 9.19 5.48 2.74 1.41 19 4.12 10 4.66 2.86 2.68 3.36 3.88 4.79 5.29 3.07 3.50 19.51 2.06 3.20 2.37 19. 1 1 161.32 22 4.80 4.00 5.71 2.31 2.13 2.25 19.68 30 250.89 8.39 2.01 2.68 3.53 2.81 6.35 2.61 2.86 3.62 5.32 2.06 2.18 2.54 2.70 2.57 Stat 371 © R.38 20 4.09 3 215.40 3.67 14 4.12 3.34 3.J.53 4.07 2.33 19.49 3.F-table (right tail) α = 0.10 3.88 240.45 2 18.99 7 5.39 3.34 2.73 3.42 2.40 2.74 4.11 2.03 3.96 1.25 2.45 18 4.35 3.14 2.35 21 4.63 3.49 2.59 3.48 3.49 17 4. 2008 denominator degrees of freedom table -3 .85 2.89 3.55 2.09 6.80 2.49 3.94 5.84 12 4.92 1.74 4.58 3.14 3.54 2.19 2.42 2.48 3.19 2.20 3.11 3.81 3.01 1.03 100 3.54 2.44 2.79 2.07 2.59 3.29 2.58 230.26 6.99 236.46 2.93 2.15 2.42 3.94 2.64 2.45 2.01 1.08 2.03 2.85 2.23 2.50 3.27 2.84 1.59 5.40 8.34 2.71 5 6.63 2.76 4.15 4.29 3.44 3.16 233.55 2.12 9.13 4 7.18 3.27 2.03 1.79 5. the table entry e satisfies P ( F ( j .03 1.97 3.71 19.97 10 241. MacKay.04 2.34 2.05 2.81 3.92 2.79 2.87 2.64 2.20 3.47 2.98 3.92 2.94 2 199.10 3.08 50 4.66 2.32 9 5.79 3.70 2.75 13 4.83 2.70 numerator degrees of freedom 4 5 6 7 8 9 224.95 2.41 3.56 3.90 2.19 2.31 2.69 1.28 2.69 3.00 2.93 1.49 2.44 3.49 2.51 2.84 2.60 2.52 3.84 1.50 3.78 1.61 2.21 2.18 3.77 2.57 2.45 8.64 3.03 2.39 2.74 2.46 8.17 40 4. k ) ≥ e) = α .40 2.75 4.46 2.74 3.30 23 4.24 3.07 2.23 3.28 4.77 2.90 3.85 8.74 2.37 2.30 2.46 2.76 2.36 2.62 2.87 3.10 4.60 2.59 2.35 4.39 4.10 2.41 4.46 4.61 6 5.42 2.33 2.84 2.71 3.14 4. University of Waterloo.33 2.40 2.94 1.05 For each row (denominator degrees of freedom k) and column (numerator degrees of freedom j).28 24 4.77 4.55 6.24 30 4.26 3.65 2.07 3.

10 3.18 5.55 8.37 3.28 4. the table entry e satisfies P ( F ( j .50 14.39 5.46 7.06 4.02 3.39 5.30 3. numerator degrees of freedom 4 5 6 7 8 9 5624 5764 5859 5928 5981 6022 99.20 4.95 5.98 2.94 3.J.70 2.72 3.01 4.78 3.70 2.22 5.26 4.99 2.89 2.08 2.89 3.72 5.71 28.69 4.94 3.51 6.10 8.01 5.02 7.66 5.68 4.19 5.64 5.10 3.36 3.26 10.84 6.70 3.84 10.41 3.65 9.91 6.01 For each row (denominator degrees of freedom k) and column (numerator degrees of freedom j).31 3.56 10.25 4.97 10.64 3.69 12.50 4.70 4.51 3.59 3.01 3.02 7. University of Waterloo.45 7.26 3.56 7.44 4.24 27.44 4.75 12.99 6.10 7.21 6.93 6.06 4.88 7.12 2.19 5.16 29.42 5.80 2.39 2.47 27.51 3.10 3.70 3.32 4.25 11.60 4.17 2.87 7.86 3.31 4.31 2.46 3.47 5.35 3.51 3.00 3.06 4.78 3.94 3.38 7.77 3.71 3.84 3.32 3.53 8.82 4.41 5.55 14.63 3.40 8.26 3.29 3.26 2.35 5.85 3.39 5.46 10.63 6.39 28.58 3.00 3.94 5.89 4.98 15.83 3.50 3.81 4.22 3.12 21.54 4.37 6.74 2.10 2.55 6.00 2.30 4.51 3.98 10 20 30 6056 6209 6260 99.80 14.88 2.55 2.68 4.72 3.20 16.02 2.81 3.90 3.36 99.67 4.39 10. 2008 denominator degrees of freedom table -4 .49 27.10 3.67 3.20 5.43 4.23 6.63 5.72 4.89 Stat 371 © R.89 4.50 2.26 13.86 4.13 2.40 4.98 14.29 8.14 4.56 4.75 8.85 4.51 4.40 7.32 5.82 2.26 4.82 18.59 3.07 8.84 3.62 3.93 5.78 8.52 15.20 2.29 10.20 3.78 5.99 3.76 3.06 5.21 4.78 4.16 9.37 2.33 9.67 27.17 3.31 7.04 9.65 4.30 3.10 3.33 99.26 8.95 7.66 3.03 4.92 3.69 26.35 4.59 6.16 5.50 4.47 3.40 99.56 5.07 3.56 3.30 99.36 5.36 6.38 99.45 99.04 3.30 4.80 3.78 2.74 5.67 3.05 9.55 9.25 99.79 3. MacKay.43 3.04 4.58 4.94 4.21 2.09 5.64 4.21 14.99 2.87 3.89 3.76 4.90 2 4999 99.20 5.59 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 30 40 50 100 1 4052 98.07 4.62 6.54 3.50 34.57 5.21 2.42 6.77 4.86 8.63 3.69 2.22 4.56 7.00 13. k ) ≥ e) = α .41 4.34 4.37 4.17 3.F-table (right tail) α = 0.47 8.06 9.99 5.69 3.82 3 5404 99.61 5.82 7.11 6.87 4.25 4.01 6.03 3.72 7.94 4.74 4.18 6.71 3.92 9.93 3.62 4.23 6.98 7.61 5.70 6.54 2.18 3.34 15.85 7.17 6.00 30.68 8.77 7.37 2.23 26.66 11.67 5.82 4.15 8.89 4.18 8.65 8.27 10.46 4.80 5.46 4.46 16.03 5.83 2.91 27.41 3.21 3.51 3.67 10.81 5.19 3.14 4.99 6.16 3.85 5.07 1.29 5.94 2.45 3.52 4.02 13.19 6.27 2.99 5.51 3.

Use R to fit the model yi = β 0 + β 1 xi1 +.939 0.4537. Error t value Pr(>|t|) (Intercept) 19..26) ˆ ˆ ˆ ˆ ˆ ˆ b) μ1 = β 0 + β1 x11 + β 2 x21 = 4.001150 b) only age and size To fit the model with only age and size.22e-05 *** office 0. Adjusted R-squared: 0.08139 1.801 0.6911 -3. −1. 2.52300 0.01 `*' 0.354867 --Signif.799388 ratio 0.72137 0.table("assessment.069 0.header=T) attach(a) b<-lm(value~size+age+office+ratio+location) summary(b) to produce the output Call: lm(formula = value ~ size + age + office + ratio + location) Residuals: Min 1Q Median 3Q Max -10. From the R output we have the following ˆ a) β t = (3.87.315 on 5 and 32 DF.10820 -4.1 ` ' 1 Residual standard error: 5.5435 Coefficients: Estimate Std.34128 -1.8164 2.txt) with a) all 5 explanatory variates I used the R code a<-read.+ β p xip + ri to the assessment data (assessment.7848 14..432 0.txt".05 `.11653 0.01.001 `**' 0.256 0.J.41588 4. MacKay.49361 3.14776 0. p-value: 0. r1 = y1 − μ1 = −0.993 on 32 degrees of freedom Multiple R-Squared: 0.03786 0.081181 . University of Waterloo 2009 Exercise Solution -1 .3683 F-statistic: 5.' 0. codes: 0 `***' 0. age -0.833 3.77137 4.8670 -0.Exercise Solutions Chapter 2 1.000288 *** size -2.67.97 2.161926 location 3. use the R code Stat 371 © R.41526 1.

3927.05 `. University of Waterloo 2009 Exercise Solution -2 .629 4. p-value: 0. in general. Consider three regression models: Model 1: ( yi − xi 2 ) = β ( xi1 − xi 2 ) + ri Model 2: yi = β 0 + β 1 xi1 + ri Model 3: yi = γ 0 + γ 1 xi1 + γ 2 xi 2 + ri Stat 371 © R.01 `*' 0. the return on the market xi1 and the risk free return xi2 for n periods.38e-07 *** size -1. codes: 0 `***' 0.629 -1.09905 -4. 3.' 0. size and age.27249 -1.c<-lm(value ~ size + age) summary(c) with output Call: lm(formula = value ~ size + age) Residuals: Min 1Q Median 3Q Max -9.56577 1. Hence the estimates found by F I GH JK Fβ I F X X X X I calculating β = G J = G H X X X X JK c X X hy will differ unless X X Hβ K 1 t 1 1 t 1 −1 2 2 t 2 1 t 2 t 1 t 2 t 1 2 = 0 .227 age -0. then X1t X1 X1t X2 t we have X X = but note that .65633 6.1 ` ' 1 Residual standard error: 6. we do not have this orthogonality.171 -3.89e-05 *** --Signif.001 `**' 0. especially the estimated coefficient for size. Adjusted R-squared: 0. If we write X = ( X1 X2 ) where X corresponds to the full model and X1 corresponds only to the intercept.93698 3.535 2.042 on 35 degrees of freedom Multiple R-Squared: 0.273 3. Note you 2 can interpret this last condition geometrically The product is 0 if the columns of X1 are orthogonal to the columns of X2 .J.230 0. MacKay.358 F-statistic: 11.456 Coefficients: Estimate Std. In this example. Suppose we have the returns on an asset yi .682 18. Error t value Pr(>|t|) (Intercept) 22.32 on 2 and 35 DF.45850 0.0001620 c) Do the estimated coefficients change? Why? Yes. the top left corner of ( X t X ) −1 is t t X2 X1 X2 X2 t not equal to ( X1 X1 ) −1 unless X1t X2 = 0 .

we are projecting onto different subspaces in each case the coefficient will change. c) In fitting models a) and b). We have 1t ( x1 − x11) = ∑ ( xi1 −x1 ) = 0 since x1 is the sample average of the explanatory i variate. β 1 = γ 1 . MacKay. we have β 0 = γ 0 − γ 1 x1 . x1 ) = span(1. we must have β 0 1 + β 1 x1 = γ 0 1 + γ 1 ( x1 − x11) = (γ 0 − γ 1 x1 )1 + γ 1 x1 and since 1. Again the result depends on the orthogonality of the vectors 1 x1 x2 as in Question 1. x1 − x11) ? Since x1 − x11 is a linear combination of 1. e) How does the result in a) simplify the calculation of γ when fitting model 2? Stat 371 © R. we project onto a subspace. We can write the first two models as special cases of the third. Suppose we have a response variate yi and a single explanatory variate xi1 for each of n units sampled from a population. the measure of volatility. 4. How are those projections different? Since the two subspaces are the same. University of Waterloo 2009 Exercise Solution -3 . a) Show that the vectors x1 − x11 and 1 are orthogonal. change? Explain. x1 and 1. b) Why is span(1. d) What is the relationship between the estimated coefficients in fitting the two models? Since the projections are the same.If we fit each model will the coefficient of x1 . the projections are the same vector. Yes the coefficient of x1 is likely to change for each model. x1 − x11 are orthogonal (hence linearly independent. Consider the two models Model 1: yi = β 0 + β 1 xi1 + ri Model 2: yi = γ 0 + γ 1 ( xi1 − x1 ) + ri where x1 is the sample average of the explanatory variate.J. the two spans are the same subspace. Model 1: yi = (0)1 + ( β ) xi1 + (1 − β ) xi 2 + ri Model 2: yi = β 0 + β 1 xi1 + (0) x2 i + ri Model 3: yi = γ 0 + γ 1 xi1 + γ 2 xi 2 + ri In fitting the models. x1 are linearly independent.

Interpret this result geometrically. We defined the hat matrix H = X ( X t X ) −1 X t . Show that a) H t = H H t = ( X ( X t X ) −1 X t )t = X[( X t X ) −1 ]t X using the result that ( AB)t = B t A t . c) ( I − H ) 2 = ( I − H ). We have ( X t X )t = X t X so this matrix is symmetric.. d) 0 ≤ hii ≤ 1 where hii is the diagonal element of H. Combining the two equations. We conclude that the inverse of a symmetric matrix is symmetric and hence [( X t X ) −1 ]t = ( X t X ) −1 and Ht = H . which model gave a larger value for R 2 ? The model with more explanatory variates gave a larger value of R 2 (0.3927) b) Show that R 2 cannot decrease if we add extra terms to a model? Stat 371 © R.. we have hii = hi2 + hi22 +. Some questions about R 2 a) In question 1. we know that H ( Hy) = Hy since Hy is in the column space of X . the matrix X t X is diagonal and hence the inverse is found by inverting the diagonal elements.. Now consider the inverse S −1 of any symmetric matrix S ... the projection onto span(1. x p ) .4537 versus 0. 5. H ( I − H ) = 0 We have ( I − H )2 = ( I − H )( I − H ) = I − H − H + H2 =I−H and H ( I − H ) = H − H 2 = 0 as required.+ hi2( p +1) 1 1 2 where the hii term is removed from the right hand side. x1 .+ hi2( p +1) or hii − hii = hi2 + hi22 +.. we have S( S −1 ) t = I and hence ( S −1 )t = S −1 since the matrix inverse is unique. 6. University of Waterloo 2009 Exercise Solution -4 .. We have ( S −1S ) t = I t = I and also ( S −1S )t = S t ( S −1 ) t = S( S −1 )t since S is symmetric. Hence we have hii (1 − hii ) ≥ 0 or equivalently 0 ≤ hii ≤ 1 . 2 Since H = H 2 = H t H .Since 1. b) H 2 = H . x1 − x11 are orthogonal. MacKay. applying H to a vector in this space has no effect. Hence for any vector y ..J. H 2 = ( X ( X t X ) −1 X t )( X ( X t X ) −1 X t ) = X ( X t X ) −1 ( X t X )( X t X ) −1 X t = X ( X t X ) −1 X t = H Since H is the projection onto the column space of X . Now consider the transpose of the matrix X t X .

237 1. 7.summary(B2) B3<-lm(y3~x3).500 0.236 R2 0.summary(B3) B4<-lm(y4~x4). American Statistician 27. c) Comment.6665 0.6663 0.4997 0. The data in the file anscombe. 17-21 to demonstrate the importance of plotting the data.J.J. construct a scatterplot of y versus x and add the fitted line.6662 0. Anscombe.txt”.summary(B4) The estimated parameters and R 2 for each case are Case 1 2 3 4 β0 3. Stat 371 © R. y1-y4 a) For each pair.237 1.5001 0.summary(B1) B2<-lm(y2~x2).0001 3. y) vectors. I used the following R code to fit the four lines A<-read.4999 σ 1. The file contains 4 sets of ( x. Hence the need to plot the data (or the estimated residuals) to understand the fit. fit a straight line model and report the estimated parameters and the coefficient of determination R 2 b) For each pair. header=T) attach(A) B1<-lm(y1~x1).txt were produced by F.6667 The fitted models and residual sum of squares are virtually identical. MacKay.0025 3.0017 β1 0.R2 = 1 − residual sum of squares from fitting the model ∑ (y i − y )2 As we add terms to a model. the residual sum of squares must go down (at least it cannot go up) since it is the minimum value of the function || y − Xβ ||2 . labeled x1-x4. University of Waterloo 2009 Exercise Solution -5 .001 3.236 1.table(“anscombe. The plots on the next page show that the data relationships between y and x are very different so it is a mistake to interpret R 2 as a measure of fit of the model to the data.

J.e.942 ± 2. we have the estimate 0. University of Waterloo 2009 Exercise Solution -6 . The confidence interval is 0.79 × 0. current sales are close to a constant plus a term proportional to past sales. Some ideas about confidence intervals a) Using the R-output given for the sales promotion example.026 or 0. find a 99% confidence interval for the effect of past sales on the response. the effect of past sales is close to 1 i.79) = 0. Error 0.942 ± 0. With all the other explanatory variates in the model.026 for the coefficient of past sales.942 and Std.Chapter 3 Solutions 1. What can you conclude? From the output.99 .073 . b) How does the confidence interval change as we increase the confidence level? Stat 371 © R. MacKay. The underlying variability is estimated with 25 degrees of freedom and for 99% confidence we have Pr(| t25 | ≤ 2.

we get ~dσ = ~ ~ dσ Kn −( p +1) σ /σ we have Pr(| tn − ( p +1) | ≤ c ) = CL . The data are stored in the file hardness. θ + cdσ ) if and only ifθ − cdσ ≤ θ 0 ≤ θ + cdσ θ −θ 0 | | ≤ c where Pr(| tn − ( p +1) | ≤ c ) = 0. dσ Finally. we replace the estimators by the estimates to get (θ − cdσ . θ + cdσ ) . a company that manufactures candle wax examined 20 candles made from batches of wax that have different amounts of fragrance oil added.05 . a) Interpret the parameter β 0 . ratio. The dσ θ −θ 0 |) . For a particular level of confidence CL. equivalently. b) Find a 95% confidence interval for this parameter. To find the confidence interval. Derive the confidence interval forθ ~ ~ θ −θ ~ ~ G(0.txt.oil + residual . c) ~ Suppose we have θ ~ G(θ .where Pr(| t25 | ≤ c) is the confidence level. err. MacKay.1 ) = tn −( p +1) . β 0 represents the average hardness in the study population of candles if the level of fragrance oil is 0.1 ) . dσ ) . In a small study. Show that θ 0 is in the 95% confidence interval for θ if and only if the pvalue for the test of the hypothesis θ = θ 0 exceeds 5%.The form of the confidence interval is estimate ± c × st.95 or. We can fit the simple model using the R statements Stat 371 © R. University of Waterloo 2009 Exercise Solution -7 . As the confidence level increases. The company was interested in understanding the relationship between the hardness of the candles (a technical measurement) and the amount of fragrance oil added. Taking the Since θ ~ G(θ . we have dσ ~ θ −θ ~ θ − θ G(0. This probability is greater p-value for the hypothesis θ = θ 0 is Pr(| tn −( p +1) | ≥| dσ θ −θ 0 | ≤ c as required. Also we have σ / σ ~ Kn −( p +1) . than 0. dσ ) . the constant c increases and so the interval gets wider (the center stays the same).J.oil.05 if and only if | dσ 2. Pr(| tn − ( p +1) | ≥ c ) = 0. we have ~ θ −θ ~ ~ ~ ~ Pr(| ~ | ≤ c) = CL or re-arranging the inequality Pr(θ − cdσ ≤ θ ≤ θ + cdσ ) = CL . d) θ 0 is in the interval (θ − cdσ . The variates are named hardness and frag. Consider the simple model hardness = β 0 + β 1frag. the estimator for a parameter θ and the ~ statistically independent σ with n − ( p + 1) degrees of freedom.

matrix(b) sterr<-0.00179 ** frag. 07e-15 *** --Signif.05 `. p-value: 9. Adjusted R-squared: 0.3203 or 1.3203. codes: 0 `***' 0.4683 on 18 degrees of freedom Multiple R-Squared: 0.1725 ± 0.001 `**' 0.1725 ± 2.02) X<-model.1725 with associated standard error 0.34417 0. MacKay.oil 6. Error t value Pr(>|t|) (Intercept) 1.a<-read.0. we can use the statements Stat 371 © R.3146 or 1.02 ) .1 ` ' 1 Residual standard error: 0. then we can calculate σ u t ( X t X ) −1 u with the statements u<-c(1.header=T) attach(a) b<-lm(hardness~frag. we have Pr(| t18 | ≤ 2. Alternately.3119 but need to use R to find the corresponding standard error.3042 22.95 and hence the 95% confidence interval is 1.672 Find a 95% confidence interval for the average hardness of candles made with 2% fragrance oil.3203 3.01758 0.661 .0.10 × 0.0. University of Waterloo 2009 Exercise Solution -8 .95868 -0.9724 0.3119 ± 0.965 F-statistic: 525.01 `*' 0.oil) Residuals: Min 1Q Median 3Q Max -0.10) = 0.02).oil) summary(b) with output Call: lm(formula = hardness ~ frag.067e-15 From the output.66 0.2 on 1 and 18 DF.92 9. There are two approaches.4683*sqrt(t(u)%*%solve(t(X)%*%X)%*%u) sterr to get 0. We can estimate θ by c) θ = β 0 + β 1 (0.1725 0. Since the underlying variability is estimated with 18 degrees of freedom.' 0.table("hardness.3119 ± 2.txt". The parameter of interest is θ = β 0 + β 1 (0.J.37530 0.10 × 0.02) = 1.94132 Coefficients: Estimate Std.9669.3146 and hence the 95% confidence interval is 1. the estimate of β 0 is 1. If we let u t = (1.

357 3.95) p to get the output fit lwr [1.0073 ** f2 -0.3313 frag.02.oil creates a vector with components the square of those in frag. Call: lm(formula = hardness ~ frag. interval=”c”.6510668 upr 1.oil 7.05 `.36635 0.oil=0.oil*frag. University of Waterloo 2009 Exercise Solution -9 .level=0.oil + f2) Residuals: Min 1Q Median 3Q Max -0.new<-data.35467 0.9669. To test the hypothesis that the coefficient of the square term is 0.311952 0.9301 --Signif. Adjusted R-squared: 0.001 `**' 0. codes: 0 `***' 0.] 1.180 2.1 on 2 and 17 DF.081 1.089 0.972837 Note the argument interval=“c” produces a confidence interval for the mean when frag.1 ` ' 1 Residual standard error: 0. d) Add a quadratic term to the model (in R.4818 on 17 degrees of freedom Multiple R-Squared: 0. MacKay.9301 so there is no evidence that the coefficient is different from 0. p-value: 2.046 0.oil=0. the p-value is 0. not a prediction interval.963 F-statistic: 248.' 0.02) p<-predict(b.newdata=new.104 1.frame(frag.J. Error t value Pr(>|t|) (Intercept) 1.168 -0.oil). f2<-frag. Is there any evidence of curvature in the relationship? The output is shown on the next page.081 1. In other words.01 `*' 0. there is no evidence of curvature in the relationship between hardness and the amount of fragrance oil.95496 Coefficients: Estimate Std.02659 0.94504 -0.000 0.635e-13 Stat 371 © R.

Prove that the components of β are independent if and only if the columns of X are orthogonal. ~ r 5. we see that the largest pst. ~ ~ We know that β ~ N (β . If we look at the original data set [ the R statement summary(a) is helpful]. Prove Cov( β . newdata=new.x2=1. ~) = 0.header=T) attach(a) b<-lm(response~x1+x2+pst.sales+comp. level=0.152 8996.J. σ 2 ( X t X ) −1 ) and hence the components of β are independent if ( X t X ) −1 is diagonal. Can you see any difficulty with this prediction? We use the R statements to fit the model and produce the prediction interval a<-read.sales=10000. The 95% prediction interval for the change of response is β 2 − β 1 ± cσ 2 or −25.] 9494.table(“trial. b) Construct a prediction interval for the change in sales if promotion 2 is used rather than promotion 1 for the same store (i.027 9992.e. Find a 95% prediction interval using promotion 2 if the past sales are a) $10000 and the competitor sales are $3000. sales + R(1) and hence the difference is Y (2 ) − Y (1) = β 2 − β 1 + R(2) − R(1) ~ G( β 2 − β 1 .pst.interval=”p”. σ 2 ) so (Y (2) − Y (1)) − ( β 2 − β 1 ) ~ ~ G(0.sales value is $1918 so the value of $10000 is an extreme extrapolation.sales=3000) p<-predict(b. This matrix is diagonal if and only if X t X is diagonal and hence when the columns of X are orthogonal.] If promotion 2 is used let the response be Y (2) = β 0 + β 2 + β 3 pst.txt”.09 ~ 4. sales + β 4 comp. past and competitor sales are fixed). sales + R(2) and the corresponding response for promotion 1 isY (1) = β 0 + β 1 + β 3 pst.sales) new<-data. sales + β 4 comp. [You will need to go back to first principles.frame(x1=0.31 ± 60. Using the data in the promotion trial described in this chapter. Stat 371 © R. If we replace σ byσ we get a t distribution with 25 σ 2 degrees of freedom. University of Waterloo 2009 Exercise Solution -10 .3. We have no idea if the model fits in this unexplored region.comp.95) p with output fit lwr upr [1. MacKay.1) .9992).277 so the 95% prediction interval is (8996.

txt. we get ~ Cov( β .05.3 2 F K32 K30 2. ~) = σ 2 ( X t X ) −1 X t ( I − X ( X t X ) −1 X t ) r = σ 2 [( X t X ) −1 X t − ( X t X ) −1 X t X ( X t X ) −1 X t ] =0 as required. find a constant c so that Pr( F ≥ c) = 0. Suppose we have a discrepancy measure with an F distribution with 3 and 30 degrees of freedom.J. Find Pr( F ≥ 2) Using the tables. F = Stat 371 © R. MacKay. the analyst worries 2 that additional second order terms of the form x12 . the manufacturer collects 60 observations to build a model to relate a product property y to two quantitative explanatory variates x1 and x2 . we have Pr( F3. x1 x 2 should be included in the Note that F is the ratio of two K distributions. Theory suggests that a linear model of the form y = β 0 + β 1 x1 + β 2 x2 + r should describe the data. R)( I − H )t Now Cov( R. R) = Var ( R) = σ 2 I and ( I − H )t = I − H = I − X ( X t X ) −1 X t .1352 b.05 From the tables we find c = 2.3. ~) = Cov(( X t X ) −1 X t R. University of Waterloo 2009 Exercise Solution -11 . a. What is the distribution of 1 / F 1 K2 K32 so = 30 ~ F30.92 c. Substituting. In an industrial example.30) to find Pr( F ≥ 2) = 0.05 so all we know is that Pr( F ≥ 2) > 0. x 2 .( I − H ) R) r = ( X t X ) −1 X t Cov( R.92) = 0. From R we can use the statement 1-pf(2. Chapter 4 Solutions 1.32 ≥ 2. The data are stored in the file exercise2. However.~ We have β = ( X t X ) −1 X t Y = β + ( X t X ) −1 X t R and ~ ~ = Y − Xβ r = Y − X ( X t X ) −1 X t Y = ( I − H )Y = ( I − H )( Xβ + R) = (I − H)R Hence ~ Cov(β .

Note that the data set contains one other explanatory variates pst..b) with output Analysis of Variance Table Model 1: y ~ x1 + x2 Model 2: y ~ x1 + x2 + x11 + x12 + x22 Res. Note that in the reduced model the explanatory variate corresponding to β is x = x2 + x3 +.02123 * --so there is some evidence that one or more of the second order terms is necessary in the Signif. model.1980 3 2..x12<-x1*x2.x22<-x2*x2 b<-lm(y~x1+x2+x11+x12+x22) c<.score) x<-x2+x3+x4+x5+x6 c<-lm(sat. MacKay.3785 3 .table(“exercise2. use an F test to address the following questions? a. Does the addition of the extra terms contribute significantly to the fit of the model? [Note: In R you can create new variables such as x 22 < − x 2 * x 2 to represent the quadratic terms.header=T) attach(a) x11<-x1*x1.lm(y~x1+x2) anova(c.txt”.' 0.] We use the following R statements to fit the full and reduced model and then carry out the ANOVA a<-read..score) anova(c. = β 6 = β .05 `.b) Stat 371 © R.score~-1+x1+x2+x3+x4+x5+x6+pst. Is there any evidence of differences among the new versions 2 to 6? We fit the full model and then the model under the hypothesis β 2 = β 3 =.J.5098 0.score The R code is a<-read..+ x6 .5765 2 54 12. In the product testing example (Example 2 in Chapter 4). codes: 0 `***' 0.model.header=T) attach(a) b<-lm(sat.txt”.001 `**' 0.table(“product. University of Waterloo 2009 Exercise Solution -12 .score~-1+x1+x+pst.Df RSS Df Sum of Sq F Pr(>F) 1 57 14.1 ` ' 1 3.01 `*' 0.

5 and 6. The following R statements produce the F test x<-x4+x5+x6 c<-lm(sat. 4.score ~ -1 + x1 + x2 + x3 + x4 + x5 + x6 + pst. b. after controlling for pst.05 `.60391 4 0.score) anova(c.5 and 6.05 `.score Res. a.The output is Analysis of Variance Table Model 1: sat. the model becomes Y = β 1 x1 + β 2 x2 + β 3 x3 + β ( x 4 + x5 + x6 ) + β 7 pst.score Model 2: sat.29400 2 41 1.score ~ -1 + x1 + x2 + x3 + x + pst.score ~ -1 + x1 + x + pst.' 0.score ~ -1 + x1 + x2 + x3 + x4 + x5 + x6 + pst.4101 0.6101 0.score Model 2: sat.01 `*' 0. score + R To test the hypothesis of no difference among versions 4. If we have a single parameterθ .1 ` ' 1 There is strong evidence of differences among the 5 new versions. we fit the reduced model and use the change in the residual sum of squares as the basis for the discrepancy measure.00467 ** --Signif. we can test a hypothesis θ = 0 in two ways. University of Waterloo 2009 Exercise Solution -13 .69009 4.score.score Res.5 and 6 share a common feature.Df RSS Df Sum of Sq F Pr(>F) 1 43 2. codes: 0 `***' 0.score~-1+x1+x2+x3+x+pst.1 ` ' 1 There is strong evidence of differences among versions 4.Df RSS Df Sum of Sq F Pr(>F) 1 45 2.b) The output is Analysis of Variance Table Model 1: sat.' 0.01 `*' 0.51717 6.001 `**' 0.12109 2 41 1. MacKay.60391 2 0. codes: 0 `***' 0. Versions 4.001 `**' 0.003249 ** --Signif. Is there any evidence that these versions have significantly different average satisfaction scores? If β 4 = β 5 = β 6 = β . Explain how we can test the hypothesis using a t-test Stat 371 © R.J.

|θ || ~ where stdev(θ ) = σd and calculate the p-value σd as Pr(| tdf | ≥ d ) where df are the degrees of freedom associated with the residual sum of squares.

We can use the discrepancy measure b. Explain how we can test the hypothesis using an F test

We can fit the full model to find the residual sum of squares. This sum of squares divided by the associated degrees of freedom df (same as in a.) is the denominator of the discrepancy measure. Then we can fit a reduced model in which excludes the explanatory variate associated with θ (since under the hypothesis θ = 0 ) and again calculate the residual sum of squares. The difference in the residual sum of squares (here with 1 degree of freedom) is the numerator of the discrepancy measure. We calculate the p-value by finding Pr( F1,df ≥ discrepancy measure) . c. Consider again the product testing example described in Exercise 3. Consider the hypothesis that the coefficient β 7 of the explanatory variate pst.score is 0. Test the hypothesis in the two ways and show that the p-value is identical. [This is always true although a nuisance to prove]

From the fit of the full model b and summary(b), we get the discrepancy measure 16.410 and p-value < 2e-16 for the t test. We can fit the reduced model with β 7 = 0 and get the F test from the ANOVA with the R statements c<-lm(sat.score~1+x1+x2+x3+x4+x5+x6) anova(c,b) The F-test in the output has discrepancy measure 269.30 with p-value < 2.2e-16 ***. Note that 269.30 = (16.410) 2 . d. If t ~ tk , show that t 2 has an F distribution. What are the degrees of freedom? G(0,1)2 K12 G(0,1) = 2 = F1,k and also that G(0,1) 2 ~ K1 . Hence tk2 = K k2 Kk Kk

We know that tk =

5. Some theory a. In the construction of the F test, explain why the additional sum of squares is always non-negative. The first step is to fit the full model y = β 0 1 + β 1 x1 +...+ β p x p + r by minimizing || r||2 =|| y − β 0 1 + β 1 x1 +...+ β p x p ||2 with respect to β 0 , β 1 ,..., β p . The hypothesis puts some restriction on β 0 , β 1 ,..., β p so when we minimize|| r||2 under this restriction we cannot get

Stat 371 © R.J. MacKay, University of Waterloo 2009

Exercise Solution -14

a smaller value than when there was no constraint. Hence the difference in the two minima must be non-negative. b. Consider the model y = β 0 1 + β 1 x1 +...+ β p x p + r . Show that if we replace the * vector x j by the vector x * = x j − x j 1, the model becomes y = α 0 1 + β 1 x1 +...+ β p x * + r j p . That is, the coefficients of the explanatory variates do not change.

**Letting x * = x j − x j 1 and substituting x j = x * + x j 1 the model becomes j j * * y = β 0 1 + β 1 ( x1 + x11)+...+ β p ( x p + x p 1) + r
**

* = ( β 0 + β 1 x1 +...+ β p x p )1 + β 1 x1 +...+ β p x * + r p and setting α 0 = β 0 + β 1 x1 +...+ β p x p gives the required result.

c.

Explain why testing the hypothesis β 1 = β 2 =... = β p = 0 will yield identical results for either formulation of the model.

* Since span(1, x1 ,..., x p ) = span(1, x1 ,..., x * ) , when fitting the full model, the residual sum p of squares is the same since the vector of estimated residuals is the same. Under the hypothesis, the reduced models are identical so the residual sum of squares will be the same. Hence the two tests are identical.

d.

In the revised model show that x * ⊥1 for all j j

1t x j = ∑ ( xij − x j ) = 0 so x * ⊥1. j

i

e.

In testing the hypothesis, show that the additional sum of squares is β ( X X* )β * where β * = (β 1 ,..., β p )t and X* = ( x1* ,..., x * ) . This quantity is often called p the regression sum of squares.

t * t *

**Let X = (1 X* ) so that in the second representation of the model we have
**

y = α 1 + X* β * + r = 1 X*

b gFGHα IJK + r β

*

= Xβ + r

Stat 371 © R.J. MacKay, University of Waterloo 2009

Exercise Solution -15

F1 1 Since X = b1 X g we have X X = G HX 1

t t

1t X*

t *

*

t *

orthogonal to 1. Hence we have ( X t X ) −1

**I = F1 1 0 I since the columns of X are J J G X X K H0 X X K F1 / n 0 IJ . Also X y = 1 y + X y so =G H0 (X X ) K
**

t * t *

*

*

t

t

t *

−1

t *

*

β = (X X) X y

t −1 t

=

FG1 / n 0 IJ (1 y + X y) H0 (X X ) K F y IJ =G H ( X X ) X yK

t t * −1 t * * t * −1 * t *

so α = y , β * = ( X*t X* ) −1 X*t y . Now can write y = α1 + X* β * + r or equivalently y − y1 = X* β * + r where X* β * ⊥ r and so || y − y1||2 =|| X* β * ||2 +|| r ||2 . The left side is the minimum of the residual sum of squares under the hypothesis β * = 0 . Hence the t additional sum of squares is | | X* β * ||2 = β * X*t X* β * .

Stat 371 © R.J. MacKay, University of Waterloo 2009

Exercise Solution -16

There is no evidence against the fit in any of these plots. the 95% prediction interval for age=30.7 to 21. for this fit. an interval so wide that it is useless.6. Stat 371 © R. With the two points deleted.9 is -46.Chapter 5 Solutions 1. The one common feature of all of the plots is the two large residuals both of which correspond to a large fitted value and relatively small age and size. Note. Is the prediction of value for a building with size 13. much narrower but still not useful. University of Waterloo 2009 Exercise Solution -17 .J. MacKay.9 and age 30 sensitive to any particular cases? We start by fitting the simple model and looking at various plots of the residuals and the qq plot of the standardized residuals to examine the fit.0 to 25. the prediction interval is -17. If we delete these two points and repeat the fit we get the following plots. size=13. Consider the assessment data with simple model value = β 0 + β 1age + β 2 size + residual .6. The bottom line here is that it is not feasible to use these data to assess a building that is so much larger than any other in the sample. The qq plot of the standardized residuals is not linear but is highly distorted by the two large standardized residuals. Use the methods in this chapter to assess the fit of the model and to suggest remedies.

2372 48.1 ` ' 1 Residual standard error: 0.985 3.5329 0.2. Error t value Pr(>|t|) (Intercept) 11. There are 8 combinations. x2 .45 7.6181 0. codes: 0 `***' 0.3000 8.68e-05 *** x2 -3. The data are shown below and can be found in the file ch5Exercise2.776 0.119 --Signif.30 11.34 17. x3 that each were assigned two values.203e-05 We drop x3 from the model.4419 0.969.3000 1.357 1. the investigators looked at the response variate for the so-called center point x1 = 0.' 0.7649 on 7 degrees of freedom Multiple R-Squared: 0.87e-10 *** x1 2. consider two formal approaches.01 `*' 0. In an experimental Plan.2527 -0.1465 0. there were three explanatory variates x1 . Stat 371 © R.05 `.87 9.001 `**' 0. Adjusted R-squared: 0.J.21 17. The summary output from R is Call: lm(formula = y ~ x1 + x2 + x3) Residuals: Min 1Q Median 3Q Max -1. MacKay. x1 -1 -1 -1 1 1 1 1 0 0 0 0 x2 -1 1 1 -1 -1 1 1 0 0 0 0 x3 1 -1 1 -1 1 -1 1 0 0 0 0 y 11. University of Waterloo 2009 Exercise Solution -18 .54 5.15 Suppose we fit a model y = β 0 + β 1 x1 + β 2 x2 + β 3 x3 + r .40 11.57 11.3000 -10. x2 = 0.97 11.70e-05 *** x3 0.1071 0.9558 F-statistic: 73 on 3 and 7 DF.4654 0. As well.1360 0.txt. To assess the fit of the model. p-value: 1.218 7. x3 = 0.89 12. here coded as -1 and +1.7615 Coefficients: Estimate Std.

x2 .9559 1. x 2 to the model and then test the hypothesis that the 2 additional terms are unnecessary. then the least squares estimate of μ( x1 . Hence the residual sum of squares is ∑ ∑ (yij − yi ) 2 .4724 0.x12<-x1*x2. This is called a “pure residual” test of fit.1763 with 3 degrees of freedom. The additional residual sum of squares is 5. x11<-x1*x1. b) Consider an extended model in which the mean of Y is a function μ( x1 . Use the R statements to create the new variates. University of Waterloo 2009 Exercise Solution -19 . for the model y = β 0 + β 1 x1 + β 2 x2 + r .9410 with 8 degrees of freedom.2 a) Add quadratic terms x12 .9410 2 6 3. From the output of the above ANOVA. then μ ( x1 . MacKay. x2 ) is the average of the response variate values at x1 . the residual sum of squares is 5. x2 ) = y = 11. If we model the mean to be different for every set of values of x1 . if there is a single value of i j y at x1 . x2 .3018 so there is no evidence that the second order terms are necessary which provides support for the linear model. b<-lm(y~x1+x2) c<-lm(y~x1+x2+x11+x12) anova(b. Show that the residual sum of squares from fitting this model is ∑ ∑ (yij − yi ) 2 where i indexes the unique sets of explanatory variate values and j i j indexes the replicated observations within these sets. For the given data.c) The output is Analysis of Variance Table Model 1: y ~ x1 + x2 Model 2: y ~ x1 + x2 + x11 + x12 Res. x2 . fit the extended model and then test the hypothesis that the coefficients of the second order terms are 0.Df RSS Df Sum of Sq F Pr(>F) 1 8 5. x1 x 2 .895 and the residual sum of squares is 0.9852 2 1. There are repeated measurements only at the center point (0. use the additional residual sum of squares to test the hypothesis that the extended model is necessary. In our example.0) where μ ( x1 . You will discover that x12 = x2 so we can only add two terms to the model.J. x2 ) with no further specification. x2 ) = y and the estimated residual is 0.7647 with 5 degrees of freedom and the F statistic (5 and 3 degrees of freedom) is Stat 371 © R.

Stat 371 © R. a) After fitting the full model. 3. is there any evidence of lack of fit? We fit the model response = β 0 1 + β 1 x1 + β 2 x 2 + β 3 pst.6 .3)] so there is 0.J.6. MacKay. two of which index the promotion used.The corresponding p-value is 0. Consider the data described in Chapter 3 in which a marketing firm wanted to compare two sales promotions against a control. University of Waterloo 2009 Exercise Solution -20 .5. The response variate is the weekly sales and there are four explanatory variates. sales + r and look at the following 6 plots. sales + β 4 comp.017 [R code pf(19.1763 / 3 some evidence against the fit of the linear model.5.7647 / 5 = 19.

There is one very large fitted value corresponding to case 11 The qq plot is quite straight providing no evidence against the normality assumption The plots of the estimated residuals vs pst.• • • • • the plot of the fitted values versus the estimated residuals shows no unusual patterns.31.70 respectively. The plot of the studentized residuals shows that cases 5.15 and 21 have large studentized residuals 2.43 and –2. The plot of the leverages hii shows one point with extreme leverage again case #11. MacKay. -2.sales each show a point far to the right again corresponding to case 11.sales and comp. There are no obvious transformations or additions to the model.J. University of Waterloo 2009 Exercise Solution -21 . Stat 371 © R. Otherwise there are no apparent patterns.

05 `. Case Deleted 5 11 15 21 p-value 0.1 ` ' 1 Deleting the cases 5.J. University of Waterloo 2009 Exercise Solution -22 . find an expression for C −1 . The key step is to find an expression for the inverse of t X −1 X −1 where X−1 is the matrix X with the first row u1t omitted.5297 0.' 0. The output of the anova function is given below.Df RSS Df Sum of Sq F Pr(>F) 1 26 13419. a) Suppose u and v are two n ×1 column vectors and A = I + vu t . we can test the hypothesis that β 1 = β 2 using ANOVA by refitting a model with x = x1 + x2 .11.and 21 in turn gives the p-vale for the test of the hypothesis β 1 = β 2 as shown in the following table. Stat 371 © R.025 0.1 6.014 0.026 Deleting the cases one-at-a-time has little effect on the p-value and hence on the conclusion that there is a difference in the two promotions. MacKay. Find the constant a so that ( I + vu t ) −1 = I + avu t [This is known as a rank one update] We have I = A −1 A = ( I + avu t )( I + vu t ) = I + avu t + vu t + avu t vu t = I + ( a + 1 + au t v)vu t and so a is the solution to a + 1 + aut v = 0 or a = −1 1 + ut v b) If C = B + uu t where B is invertible.sales + comp.001 `**' 0.b) Suppose the primary question is to compare the two promotions adjusting for past and competitor’s sales.01707 * --Signif.01 `*' 0. Are there any cases that have a large influence on the conclusion about this comparison? With the original fit.1 2 25 10640.15. codes: 0 `***' 0.sales Model 2: response ~ x1 + x2 + pst. 4.sales Res.sales + comp.1 1 2779.003 0. We give the basic mathematics behind the arithmetic that we use for the calculations when deleting a single case. Analysis of Variance Table Model 1: response ~ x + pst.

University of Waterloo 2009 Exercise Solution -23 . Hence we have X t X = (u1 X−1 ) Fu I = u u + X GH X JK t 1 1 t 1 t −1 X−1 . MacKay. Show t t that X t X = X −1 X −1 + u1u1t and hence find an expression for ( X −1 X −1 ) −1 .We write C = B( I + B −1uu t ) = B( I + vu t ) where v = B−1u . We can write X = F u I where u GH X JK t 1 t 1 gives the values of the explanatory variates for the first −1 t case. Hence we have C −1 = ( I + vu t ) −1 B −1 vu t ) B −1 t 1+ u v B −1uu t = (I − ) B −1 1 + u t B −1u = (I − c) Suppose we consider dropping the first case when fitting the model y = Xβ + r . Hence we have −1 t t X −1 X −1 = X t X − u1 u1t and ( X−1 X−1 ) −1 = ( I + ( X t X ) −1 u1u1t )( X t X ) −1 1 − u1t ( X t X ) −1 u1 Stat 371 © R.J.

2). Now we build all three variate models that include x7 . we get (n . x8 . The model used to generate the data was Y = 3 x1 + 0. x1 . Note that the columns of U are also orthogonal so that U tU is diagonal and the diagonal element corresponding to β j is x tj x j . we x tj y have β j = t independent of all the other explanatory variates xjxj 2. Suppose the columns of X are orthogonal. Show that c p = p + 1 for the full model that includes all p explanatory variates. the coefficient of x j . Only models with x1 . σ2 cp = estimated residual sum of squares + 2( k + 1) − n .. If we fit the full model with p 3. Note that the columns of X are not orthogonal.J.5947 corresponds to x8 . a) Fit a model using forward selection.1)σ 2 cp = + 2( p + 1) − n = p + 1 as required. The highest value corresponds to x7 with R2 = 0. At each step. These data were created artificially for practice. x2 . x10 for 100 cases.6446 . then σ2 explanatory variates. use a p-value of 0. By definition if there are k explanatory variates in the model (plus a constant term). is not dependent on which are columns of X are included in the model.p . x3 . x9 have significant coefficients and x1 has the highest R2 = 0.5084 Now we build all two variate models that include x7 and select the next variate to have the highest R 2 value if significant. The file ch6Exercise3. Hence the diagonal element of (U tU ) −1 corresponding to β j is 1 / x tj x j and since (U tU ) −1 is also diagonal. Suppose we have a model that includes x j and any other columns of X . University of Waterloo 2009 Exercise Solution -24 . Show that the estimate of β j ..05 to decide to proceed. x10 have coefficients significantly different from 0 and the highest R2 = 0. R ~ G(0.05 level.Chapter 6 Solutions 1.. Stat 371 © R. x8 and pick the one with the highest R 2 as long as the coefficient is significantly different from 0. MacKay.txt contains a response variate y and 10 explanatory variates x1 . We start with all one variate models and pick the one with the highest R 2 value if any are significant at the 0. We can write the model as y = Uα + r .3 x2 − 2 x4 + x7 − x9 + R. x6 ..

x 4 .768 . x9 and have significant coefficients and x4 has the highest R2 = 0. b) Fit a model using backwards selection using a p-value of 0. x7 .3931790 30. x5 .05 to decide to proceed at each step.7652356 4.7691349 5.7600636 3.105201 0. x10 .247352 0.7691758 5. So we end with the model that includes x1 .474527 0.7696742 5. x8 . We delete x 5 . x 4 . x 4 .05 and we stop. no other variate has coefficient that is significantly different from 0 with p-value less than 0.585931 0.7603550 6.515179 0. We have R2 = 0. x 4 . x6 . x10 with R2 = 0. University of Waterloo 2009 Exercise Solution -25 . all but two are not significant.415232 0. only x10 has a significant coefficient with R2 = 0.For four variate models including x1 .781. The output from the R code is (Intercept) x1 x2 x3 x4 x5 x6 x7 x8 x9 x10 1 1 0 0 0 0 0 0 1 0 0 0 1 1 1 0 0 0 0 0 0 0 0 0 2 1 1 0 0 1 0 0 0 0 0 0 2 1 0 0 0 0 0 0 1 1 0 0 3 1 1 0 0 1 0 0 1 0 0 0 3 1 1 0 1 1 0 0 0 0 0 0 4 1 1 0 0 1 0 0 1 0 1 0 4 1 1 0 0 1 0 1 1 0 0 0 5 1 1 0 0 1 0 0 1 1 0 1 5 1 1 0 0 1 0 0 1 0 1 1 6 1 1 0 0 1 0 1 1 1 0 1 6 1 0 0 1 1 0 0 1 1 1 1 7 1 1 0 1 1 0 0 1 1 1 1 7 1 1 1 0 1 0 1 1 1 0 1 8 1 1 0 1 1 0 1 1 1 1 1 8 1 1 1 0 1 0 1 1 1 1 1 cp adjr2 109. we delete x6 With the remaining seven variates. x8 . x8 .782 c) Use leaps to investigate all possible models. all variates have coefficients with pvalue less than 0.5863704 11. x9 . x7 .781 For six variate models including x1 .642803 0.7444682 21.7683448 7.582227 0.990974 0. we delete x1 With the remaining six variates x3 .036584 0. x 7 . MacKay. For five variate models including x1 .633685 0. x7 . Pick a reasonable model.7679391 7. With the remaining nine variates.7681730 5.024557 0.6971923 75. x8 .168717 0. we delete x2 With the remaining eight variates.181391 0.302265 0. x7 .7675915 Stat 371 © R.7197896 6. x10 . With ten variates. For backwards selection we start with a ten variate model and delete the one that is least significant if we can find one.130472 0. x 4 .05. x8 .J.5033519 155.

we need only count the number of samples that contain a particular unit. Chapter 8 1. d) How do the results of the three strategies compare in this case. University of Waterloo 2009 Exercise Solution -26 . x8 . a) Show that the inclusion probability for each unit in the frame is 1/100 for every protocol. we count the ways of selecting the remaining 99 units. For each protocol. MacKay. That is. the chance of any possible sample is equal. the model is uniform. 9999 9999 99 1 = SRS: there are ways to select the other units so pi = 10000 99 100 FG H IJ K Systematic sampling: there is only one way to select the sample so pi = 1 / 100 F 999IJ FG1000IJ ways to select the other units so Stratified sampling: there are G H 9 K H 10 K FG 999IJ FG1000IJ H 9 K H 10 K = 1 p = FG1000IJ 100 H 10 K F 999IJ ways to choose the remaining clusters so Cluster sampling: there are G H9K FG 999IJ H 9K = 1 p = FG1000IJ 100 H 10 K 9 FG IJ H K FG IJ H 100 K 9 i 10 i Stat 371 © R. x 4 . Once the unit is in the sample. x10 and with many other good candidates. The backwards selection got us to a six variate model which has good cp and adjusted R 2 values. To find the inclusion probability. Here the forward selection and best subsets methods got us to the same model. Consider the sampling protocols defined in Example 1..J.The best choice is the five variate model with x1 . None of the methods reproduced the model used to generate the data – this is not surprising because of the correlations among the columns of the X matrix. x7 .

. MacKay. Is this a correct answer? No because there are many sampling protocols that satisfy this definition as shown in a) ∑ yi iε s c) Show that the estimator corresponding to the sample average μ = is n unbiased for μ for each of the protocols. [Hint: Use the fact that ∑ ( yi − y ) 2 = ∑ yi2 − ny 2 ]... n −1 ~ For SRS. Then we can write S0 otherwise T i i i i n ~ and E( μ ) = ∑ y E( I ) ∑ y p ε ε i i i i U i n = i U n i = μ since pi = n / N for all five protocols. i s iε s iε s iε U ∑(y ε − y)2 ~ ~ Using the hint.J. show that σ 2 is an unbiased estimator for σ 2 . we can write (n − 1)σ 2 = ∑ yi2 Ii − nμ 2 where E( Ii ) = n / N and ~ ~ ~ E( μ 2 ) = Var( μ ) + E( μ )2 = (1 − f ) σ2 n 2 ~ 2 ) = 1 ( y 2 n − n[(1 − n ) σ − μ 2 ]) E(σ ∑ i n − 1 iε U N N n = = 1 n n ( [∑ yi2 − Nμ 2 ] − (1 − )σ 2 ) n − 1 N iε U N 1 n n ( ( N − 1)σ 2 − (1 − )σ 2 ) n −1 N N ~ Is σ unbiased for σ ? + μ 2 . Let Ii = ~ μ= ∑y I ε i U R1 if unit i is in the sample i = 1. Combining the results we have =σ2 b) Stat 371 © R. Consider the estimate σ = a) ~ and the corresponding estimator σ . N so that E( I ) = p . University of Waterloo 2009 Exercise Solution -27 . Then the other H1 K FG 9IJ FG 999IJ FG1000IJ F 999IJ FG1000IJ ways so p = H1 K H 49 K H 50 K = 1 99 secondary units can be selected in G H 49 K H 50 K FG10IJ FG1000IJ 100 H 2 K H 50 K i 2 b) On a final examination.. a student once defined simple random sampling as follows: “simple random sampling is a method of selecting units from a population so that every unit has the same chance of selection”.Two stage sampling: there are FG 9IJ ways to select the second primary unit. 2.

Using a GPS system. # of sparrows # of plots a) 0 28 1 13 2 5 3 3 4 1 Find a 95% confidence interval for the total number of male song sparrows in the square. find a formula for the required sample size.No using the result assignment 2.011 and interval length 0. To estimate the total number of male song sparrows in a 10 km by 10 km square (http://www. a) For a given confidence level and required precision p%.. 4.72. your intrepid instructor visits each of the selected plots (after dawn but before 9:00 am between May 24 and July 6) and counts the number of singing male song sparrows detected in a 10 minute period.. we can solve for n to find n = 378.011 or 0. 3.org/atlas/atlasmain. σ = 1.birdsontario.28 . MacKay.J. 1 1 − σ . the length of the confidence interval for μ is 2 × 1. The data can be written as y1 .html ) for a breeding bird atlas. y50 where 28 of the yi are 0 and so on. a simple random sample of 50 one hectare plots (a hectare is 100m by 100m) is selected.2. How many additional plots are needed? n 1/ 2 σ . That is. We want to find the In general. the length of a confidence interval for μ is 2c n N 1 1 − σ / μ = p / 100 . The data are summarized below. Suppose we want to estimate a population average so that the relative precision is specified..011 and a 95% confidence interval for 50 1.96 1 − 10000 50 95% confidence interval for the total number of male sparrows in the square (τ = 10000 μ ) is 7200 ± 2800 b) Suppose that I wanted to estimate the total number of male song sparrows to within ±1000 with 95% confidence. Hence we need about 328 more plots to achieve the desired precision. we want to find the sample size required (SRS) so that the length of the confidence interval 2l divided by the sample average is pre-determined. University of Waterloo 2009 ..72 ± 1. Using In general. Solving for n we have the ugly formula sample size n so that 2c n N 1 n= 1 pμ 2 +( ) N 200cσ b) What knowledge of the population attributes do we need to make this formula usable? Exercise Solution -28 Stat 371 © R. Hence the sample mean and standard deviation are μ = 0.72 ± 0.96(1 − ) n 10000 σ = 1. A the average number of sparrows per plot is 0.

we have the following graph. d) Given the results in c). b) Calculate the probability p(π ) that you accept the shipment as a function of π . you decide to increase the sample size so that there is only a 5% chance of accepting a shipment with 1% defective. Usually haphazard (for small items) or systematic sampling is used in this context. you inspect the complete shipment. What sample size do you recommend? Stat 371 © R.π ) 20 c) Graph p(π ) for 0 ≤ π ≤ 10% Using R. MacKay. If you find 1 or more defective items. a) How would you select the sample? It would be nice to use SRS but it is likely too expensive unless the items are already numbered and it is easy to locate an idea with a specified label. the percentage of defective items in the shipment.We need an estimate of the so-called coefficient of variation σ / μ .J. Then we have P(accept shipment) = (1. we can approximate the number of defective items in the sample by a binomial random variable with n = 20 and the probability of a defective item π . Since we are sampling such a small fraction of the shipment. You decide to select and inspect a sample of 20 items and accept the shipment if you find 0 defectives. University of Waterloo 2009 Exercise Solution -29 . 5. Suppose that there are N = 1000 items in a shipment and you cannot tolerate more than 1% defective (your first mistake – why should you tolerate any defective items from your supplier). One cheap but (poor) way to check the quality of a batch of items is called acceptance sampling.

y0 ) ( y − y0 ) 2 ( x − x0 ) + ( y − y0 ) + + ( x − x 0 )( y − y0 ) + 2 ∂x ∂y ∂x 2 ∂x∂y ∂y 2 2 This quadratic function has the same value. a shipping company selects a sample of 25 items and weighs them. first and second derivatives at the point ( x0 .J. . is completely unreasonable. y) ~ and with a bit of effort. In order to count the number of small items in a large container. University of Waterloo 2009 Exercise Solution -30 . we can show Cov( μ n 1 is the population covariance. μ ( y)] 2 μ( x) 2 ~ ( x )) = (1 − f ) σ ( x ) The approximate bias is given by the second term. y0 ) ∂ 2 f ( x 0 . We want to find n so that (1 − 0. y) where Cov( x. You can easily check this statement by differentiating the right side of the expression. y0 ) + ∂ f ( x 0 . y0 ) ( x − x 0 ) 2 ∂ 2 f ( x 0 . y) . MacKay.Suppose the sample size is n . Assume that there is small error in weighing and act as if SRS is used . Sampling inspection is not useful here. y0 ) ∂ 2 f ( x 0 . Chapter 9 Find the quadratic expansion of f ( x. 2. They then weigh the whole shipment (excluding the container). To use the expansion. Note that the general form of the expansion is f ( x . I recommend you tell your supplier to ensure that there are no defective items in the shipments. y) = y / x about the point ( μ ( x ). μ ( y)) = (1 − f ) Cov( x. we have P(accept shipment) = (1.01) n = 0. y0 ) as does f ( x. 2 = = . ~ ~ ~ estimate the bias in the estimator θ = μ ( y) / μ ( x ) . Assuming that we can use the binomial approximation. This is so large that the binomial approximation may breakdown and. .π ) n . We know Var( μ n ~ ( x ). y0 ) ∂f ( x 0 .it is not. y ) ≈ f ( x 0 . Let the weight of the ith item in the population be yi and the total known weight be τ Stat 371 © R.05 so n = 298 . 2 =0 μ ( x )2 ∂y μ ( x ) ∂x μ ( x )3 ∂x∂y μ ( x )2 ∂y ∂x so we can write ~ ~ μ ( y) ~ 1 ~ 2 μ ( y) [ μ ( x ) − μ ( x )]2 1 ~ ~ θ ≈θ − − [ μ ( x ) − μ ( x )] + [ μ ( y) − μ ( y)] + [ μ ( x ) − μ ( x )][ μ ( y) − μ ( y)] 2 3 2 μ( x ) μ( x) μ( x) μ( x) 2 and μ ( y) 1 ~ ~ ~ ~ E (θ ) ≈ θ + Var ( μ ( x )) − Cov( μ ( x ). on a practical basis. we have ∂f μ ( y) ∂f 1 ∂ 2 f 2 μ ( y) ∂ 2 f 1 ∂2 f =− =− . μ ( y )) 3 μ( x) μ ( x )2 =θ + 1 ~ ~ ~ [θ Var ( μ ( x )) − Cov( μ ( x ). μ ( y)) to 1. the sampling is haphazard. The key point is to notice that the bias has a factor and n will be small if the sample size is large.

The linear approximation about μ ( y) is 1 1 1 1 −1 ~ −1 ( μ ( y) − μ ( y)) and ≈ + ( y − μ ( y)) and hence we have ~ ≈ + 2 y μ ( y) μ ( y) μ ( y) μ ( y) μ ( y ) 2 1 1 1 1 ~ ~ ~ E( ~ ) ≈ . Consider expanding the function f ( y) = 1 / y . N ) . Var ( N ) ≈ . (1− f ) σ ( y) ~ To find the confidence interval we have N ~ G( N .2 kg. we get the confidence interval N N ( . n μ ( y) 2 μ ( y)2 In the example. we have 4 μ ( y) μ ( y) μ ( y) μ ( y) (1 − f ) N 2σ ( y)2 N2 ~ ~ ~ Var ( μ ( y)) = E( N ) ≈ N . N 25 ∑ yi / 25 iε s ∑y ε i s τ i / 25 . μ( y). σ ( y) . Note n μ ( y) that the mean and standard deviation both depend on the unknown N which is different ~ N 1 − f σ ( y) from the usual situation. 2053). MacKay.96(1 − f )1/ 2 σ ( y) 1+ 1− n n μ ( y) μ ( y) In the example.J.45 g with sample standard deviation 0. we have N = 2044 and the 95% confidence interval is (2035.a) Show that an estimate of the population size is N = Note that τ = Nμ ( y) so we can construct an estimate of N using our knowledge of estimating μ ( y) . Var ( ~ ) ≈ Var ( μ ( y)) . The sample average and population average should be close.96) = 0. b) Find the (approximate) mean and standard deviation of the corresponding ~ estimator N . the sample average weight is 75.163 g and the total weight is 154.96(1 − f ) σ ( y) 1. University of Waterloo 2009 Exercise Solution -31 . Find a 95% confidence interval for the total number of items in the container. μ ( y). approximately. Instead we work with ~ G(1. ) and hence we N n μ ( y) have ~ N / N −1 | ≤ 1. so we have ∑ yi τ iε s τ ≈ and hence the estimate N = . Since N = τ / μ ( y) . c) Stat 371 © R. σ ( y) for N . ) 1/ 2 1.95 Pr(| (1 − f )1/ 2 σ ( y) n μ ( y) ~ Re-arranging the inequality and substituting N .

Suppose we wanted to estimate the number of wood thrush pairs nesting within the region of Waterloo.62 with associated estimated standard deviation 1− f σ ( y)2 = 0.4 ha.34. If the response variate is approximately proportional to the explanatory variate.62 ± 0.62. μ ( y) = 0.85 with associated estimated standard deviation 1− f i s = 0. α = μ ( y) = 1.62.62 ± 0.088 so an approximate 95% confidence interval for the n n −1 population average is 1. we know that there are 1783 such patches (minimum size 3 ha) with an average size 13. We need the following summaries of the data: 4.3. MacKay.J. i i ∑ (y − θ x ) ε 2 Stat 371 © R. If the response variate is approximately linear in the explanatory variate. A simple random sample of 50 woodlots is selected and the number of nesting pairs yi is counted in each woodlot by counting the number of singing males.xls. We can exploit this knowledge when we are trying to estimate population totals or density.142 n −1 a) The sample average is 1.72. ∑[y − α − β( x ε i i s i − μ ( x ))]2 = 0. θ = μ ( x ) = 11. We need an explanatory variate with known population average that can be measured on each unit in the sample.72 σ ( y) = 1. μ ( x ) = 11.17) = 3299 ± 308 .17 and for the population total is 1783(185 ± 0.399.228 μ( x) ∑ (y − θ x ) ε i i i s 2 n −1 = 0. then the regression estimate is more precise. Many bird species have specialized habitat. Briefly describe when you would use the ratio or regression estimate instead of the sample average to estimate the population average. β = 0. Using aerial photography. wood thrush are a forest dwelling bird that live in the hardwood forests of eastern North America. The area xi of each sampled woodlot is also recorded. μ ( y) = 1. Find 95% confidence intervals for the total number of thrushes based on the a) sample average y ratio estimate b) c) regression estimate If you do not want to do the calculations. University of Waterloo 2009 Exercise Solution -32 .138.85 ± 0.37) = 2888 ± 653 b) The ratio estimate is θμ ( x ) = 1. The data are available in the file exercise8. an area of highly fragmented forest patches.187 so an approximate 95% confidence interval for the average n number of thrushes per woodlot is 1. write out what summaries you need to get the three confidence intervals. then the ratio estimate is more precise than the sample average.37 and the interval for the total number of thrushes is 1783(1. For example.

c) The regression estimate is μ ( y) + β ( μ ( x ) − μ ( x )) = 2.00 ± 0. University of Waterloo 2009 Exercise Solution -33 . there is interest in estimating strata averages or differences in strata averages.00 with associated estimated 1− f i s = 0. MacKay.4 .J.00 ± 0. In many surveys.4 Chapter 10 1. the “fitted” line through the origin (dotted) and μ ( x ) = 13.10) = 3566 ± 184 i i standard deviation ∑[y − α − β( x ε − μ ( x ))]2 The scatterplot on the next page shows the fitted regression line (solid). Stat 371 © R. Number of thrushes versus woodlot area 6 5 4 number 3 2 1 0 0 -1 5 10 15 area 20 25 30 35 μ ( x ) = 13.10 and for the population total is 1783(2.053 so an approximate 95% n n −1 confidence interval for the population average is 2.

30 ± 9. a) Write down the stratified estimate of π and the variance of the corresponding estimator..3 with associated estimated standard deviation 2 (1 − f1 )σ 1 = 3. For farms with animals. Since π = W1π 1 +.614 2 = 4.030 so the 95% confidence interval is standard deviation n 0. There is no evidence of a difference in average Na levels between the two groups of farms.877 . we have Stat 371 © R. MacKay...(1 − fh )1/ 2 σh nh ) ~ ~ μ h − μ k ~ G( μ h − μ k .30 with associated n2 estimated standard deviation 3.+ WH π H and 2 ~ ~ ~ Var (π strat ) = W12 Var (π 1 )+.+WH π H we have π strat = W1π 1 +. find a 95% confidence interval for the proportion of wells in farms with animals that are contaminated The estimate of the proportion contaminated is π 1 =.614 and hence we have μ 2 − μ 1 = 8. we have μ 1 = 237. Assuming relatively large sample sizes within the strata we have approximately ~ μ h ~ G( μ h . find a 95% confidence interval for the average Na difference between the two types of farm wells.275 .6 with associated estimated standard deviation 2 (1 − f2 )σ 2 = 3...172 with associated estimated 1− f π 1 (1 − π 1 ) = 0. for SRS.+ WH Var (π H ) (1 − f1 ) 2 (1 − f H ) π 1 (1 − π 1 )+.058 c) In the well survey.~ a) In general. we have μ 2 = 245. Hence a 95% confidence interval for μ 2 − μ1 is 8. Suppose that the purpose of the survey is to estimate a population proportion π . University of Waterloo 2009 Exercise Solution -34 .+ WH π H (1 − π H ) n1 nH (ignoring the factors nh / (nh −1) ) ~ b) What is the variance of π strat for proportional allocation? = W12 If nh = Wh n .. (1 − fh )σ 2 (1 − fk )σ 2 h k ) + nh nk b) In the well survey.. n1 For farms without animals.J.. write down the distribution for the estimators μ h and ~ ~ μh − μk . 2.172 ± 0.60 .2752 + 3. If there are H strata.

726 0.. ~ c) For each case. We do this by making π h close to 0 or 1 for each stratum.+ WH π H (1 − π H ) W1n WH n 1− f [W1π 1 (1 − π 1 )+. For optimal allocation. Suppose the well survey was to be re-done with the same overall sample size 500. If we assume that the standard deviations do not change markedly.23 so the optimal sample sizes are A: 76..62 0.317811 3 0. University of Waterloo 2009 Exercise Solution -35 .37738 2 0. In words. How would you recommend allocating the sample to the strata if a) Estimating the average Na level was the primary goal. We have stratum 1 2 3 Weight St Dev 0.338491 So the optimal sample sizes are B:97. we have nh ∝ Whσ h .177 0. 3.J. For optimal allocation.+ WH π H (1 − π H )] n where f = 1 − n / N . more weight is given to stratum three because it is larger and has higher estimated standard deviation. 45 and 358.. we have nh ∝ Wh π h (1 − π h ) and we use the current estimates to get startum Weight sd 1 0.(1 − n1 / N1 ) 2 (1 − nH / N H ) ~ Var (π strat ) = W12 π 1 (1 − π 1 )+.726 51.. We have Stat 371 © R. we use the estimates from the current survey to allocate the sample.097 37. 38 and 386.177 41. = c) How should the strata be formed so that the stratified sampling protocol is superior to SRS? We want to form the strata so that [W1π 1 (1 − π 1 )+. we decrease the variation in the strata by making the response more consistent. compare the predicted standard deviations of μ strat and ~ π strat to what occurred in the current survey..45 0.097 0..+WH π H (1 − π H )] < π (1 − π ) . MacKay. b) Estimating the proportion of contaminated wells was the primary goal.

(1 − f1 ) 2 ~ 2 (1 − f H ) Var ( μ strat ) = W12 σ 1 +...+ WH σ2 H n1 nH (1 − f1 ) 2 (1 − f H ) ~ Var (π strat ) = W12 π 1 (1 − π 1 )+...+ WH π H (1 − π H ) n1 nH Using the current estimates and the two new allocations, we get the estimated standard deviations

allocation current

~ μ strat π strat

A 2.11 0.015

B 2.13 0.015

2.42 0.016

The estimator of the proportion is much less sensitive to changes in the allocation. . ~ 4. Consider the difference of the variances of μ strat under proportional and optimal allocation for a sample of size n. Ignore the fpc. 1 a) Show that this difference can be written as ∑ (σ h − σ )2 Wh where n h σ = ∑ σ h Wh is the weighted average standard deviation over the H

h

strata. For optimal allocation, ignoring the fpc, we have 1 Vopt = (W1σ 1 +...+ WHσ H ) 2 and, for proportional allocation, we have n 1 2 Vprop = (W1σ 1 +...+ WH σ 2 ) . The weights can considered a probability distribution on the H n 1 1 integers 1,…,H so we have ∑ (σ h − σ )2 Wh = [∑ σ 2 Wh − (∑ σ h Wh ) 2 ] as required. h n h n h h b) When will the gain be large with optimal allocation relative to proportional allocation? The gain will be largest when the standard deviations vary widely. 5. In an informal sample of math students at UW, 100 people were asked their opinion (on a 5 point scale) about the core courses and their value. One particular statement was (with scores): “All mathematics students are required to take Stat 231?” strongly agree – 1 agree – 2 neutral – 3 disagree- 4 strongly disagree - 5 The sample results, broken down by year are shown below. Estimate the average score for all math students and find an approximate 95% confidence interval for the population average – note that SRS was not used here so were are making assumptions about the estimators that may be unwarranted. There are about 3300 students in the faculty.

Stat 371 © R.J. MacKay, University of Waterloo 2009

Exercise Solution -36

Year

Sample size

1 2 3 4

39 23 26 12

Population weight 0.31 0.24 0.23 0.22

Average score

2.8 3.5 3.2 3.1

Standard deviation 1.22 1.09 1.03 0.87

We can estimate the average score as if we had stratified the sampling beforehand.

**. . μ post = 0.31(2.8) + 0.24(3.5) + 0.23(3.2) + 0.22(31) = 3126 . The approximate estimated ~ variance of μ post is
**

1 − fh 2 2 Wh σ h = 0.107 nh h The approximate 95% confidence interval is 313 ± 0.21. .

∑

Stat 371 © R.J. MacKay, University of Waterloo 2009

Exercise Solution -37

In Stat 371, we deal with applications and theory of the linear model Y = Xβ + R where X = (1 x1 ... x p ) is a n × ( p + 1) matrix with columns giving the values of the explanatory variates and R is a vector of random variables with independent components Ri ~ N (0, σ 2 ) . We represent the corresponding data model by y = Xβ + r where y is the vector of observed values of the response variate. 1. For the model described above: $ a) (4 marks) Derive the least squares estimate of β , i.e. show that β = ( X t X ) −1 X t y . Be sure to explain the principles underlying your derivation.

n

Statistics 371 Midterm Solution

The least squares criterion is to minimize

∑ ri

i =1

2

=r t r = ( y − Xβ )t ( y − Xβ ) with respect to

β.

y r

Xβ

From the picture, the minimum value corresponds to the orthogonal projection of y onto $ the column space of X so that y − Xβ is perpendicular to 1, x1 ..., x p , the columns of X or $ $ $ equivalently X t ( y − Xβ ) = 0 . Solving we have X t y = X t Xβ so β = ( X t X ) −1 X t y as required. b)

% (3 marks) Show that the estimator r corresponding to the estimated residuals is 2 MVN (0, σ ( I − H )) where H = X ( X t X ) −1 X t .

$ $ Since r = y − Xβ = y − X ( X t X ) −1 Xy = ( I − H ) y , we have ~ = ( I − H )Y = ( I − H )( Xβ + R) = ( I − H ) R and hence ~ is multivariate normal with mean r r vector and variance covariance matrix E(~) = E(( I − H ) R) = ( I − H )0 = 0 r Var ( ~) = Var (( I − H ) R) = ( I − H )σ 2 I (( I − H )t = σ 2 ( I − H ) r

since ( I − H )( I − H )t = ( I − H ) and I − H is a symmetric projection matrix.

c)

(3 marks) Using the result in b), explain the notion of a unit in the sample that has high leverage. We say that unit i has high leverage if the ith diagonal element of H, hii , is close to 1. ~ $ Using the result in b), we have ri ~ N (0,(1 − hii )σ 2 ) and ifhii ≈ 1, we know that ri is close to 0 . Since hii depends only on X, the fitted plane passes close to yi regardless of its value so deleting case i may have a large effect on the fitted plane.

1

2. p-value: 1.7907 3.416 on 48 degrees of freedom Multiple R-Squared: 0.81995 0.001 `**' 0. γ 1 represents the change in the expected response for a unit change in x1 with x2 held fixed.58 on 1 and 48 DF.' 0.81201 -1.' 0.53223 0.1256 6.05 `.402 2 .01 < 2e-16 *** x1 0.05 `. 59e-07 *** --Signif.7724 -0.85583 3.44051 69.91e-07 *** -0.01 `*' 0. the summary output from fitting the two models is: Call: lm(formula = y ~ x1) Residuals: Min 1Q Median 3Q Max -2. Error t value Pr(>|t|) 30. What is the difference in interpretation between β 1 and γ 1 ? β 1 represents the change in the expected response for a unit change in x1 .2069 Coefficients: (Intercept) x1 x2 --Signif.04878 -0. a) (2 marks) Consider the two models Model 1: Y = β 0 1 + β 1 x1 + R and Model 2: Y = γ 0 1 + γ 1 x1 + γ 2 x2 + R .02064 0.587e-07 Call: lm(formula = y ~ x1 + x2) Residuals: Min 1Q Median 3Q Max -2. Adjusted R-squared: 0.02882 -0. Suppose we are interested in understanding the relationship between a response variate y and a specified explanatory variate x1 In an investigation.4391.7700 0.3653 83. From R. Error t value Pr(>|t|) (Intercept) 30. x1 and a second explanatory variate x2 are measured on a sample of 50 units from the study population.1072 0.13915 5.04125 0.892 3.846 0.1 ` ' 1 Residual standard error: 1. codes: 0 `***' 0.3255 0.311 < 2e-16 *** 0.1 ` ' 1 Estimate Std.01 `*' 0.4274 F-statistic: 37.9820 -0. codes: 0 `***' 0.26556 Coefficients: Estimate Std.13 1.001 `**' 0. y.

019 with 47 degrees of freedom.005 with 48 degrees of freedom.820 ± 0.135 and 47 degrees of freedom. Show how we ca use the Analysis of variance to test the same hypothesis.4475. we need to find two estimates of σ 2 .273 d) (1 mark) The F statistic in the summary output for model 2 is F=19. Adjusted R-squared: 0. Hence the change in the residual sum of squares is 47 × 2. the estimate of σ 2 is 1.Residual standard error: 1.4212 = 2.03 on 2 and 47 DF.019 − 48 × 2.01) ≈ 0.807e-07) there is strong evidence against the hypothesis that all of the γ ’s are 0.019 f) (2 marks) A plot of the estimated residuals versus the fitted values from Model 2 is shown below. 3 .807e-07 b) (2 marks) The value of R 2 is a bit larger for model 2.662 so there no evidence that γ 2 differs from 0. 2.03. Some suggested that there was a “funnel” effect that suggests a non-constant standard deviation and transforming the response variate using the logarithm etc.416 2 = 2.388 / 1 = 0.338 with 1 degree of freedom so the F-ratio is 1.820 with standard error of 0.135 (estimate ±c standard error) or 0. From fitting the full model the estimate of σ 2 that does not depend on any hypothesis about γ 2 is 1.95 so the confidence interval is 0.005 = 1. Does this mean that this model better fits the data? No since we know that R 2 must increase as we add terms to a model whether or not the corresponding explanatory variate has an effect on the response. c) (3 marks) Using the model 2. R 2 measures the proportion of variation in the response variate explained by the explanatory variates. find a 95% confidence interval for γ 1 $ From the R output we have γ 1 = 0.820 ± 2. From the t-table we have P(| t47 | ≤ 2. No action is required because there are no apparent patterns or outliers on the plot. When fitting the model assuming γ 2 = 0 . What does this signify? Since the F ratio is so large (p-value 8. what action would you recommend? Why? Note the plot has been corrected in the solution. there is strong evidence that one or more of the explanatory variates explains variation in the response variate.. p-value: 8.421 on 47 degrees of freedom Multiple R-Squared: 0.424 F-statistic: 19. To use ANOVA. That is. Based on the plot. e) (3 marks) The output for model 2 shows that there is no evidence that γ 2 differs from 0 using a t-test.01 × 0.

u p ) .. we have ~ Y − ut β ~ tn −( p +1) ~ σ (1 + u t ( X t X ) −1 u) We use this random variable as the basis for our interval.95 σ (1 + u t ( X t X ) −1 u) Cross-multiplying and re-arranging we get the probability statement 1 . σ 2 ( X t X ) −1 ) .Statistics 371 Sample Midterm Solution In Stat 371. age. σ 2 ) . 1. Derive a 95% prediction interval.95. educational qualifications. age.. Suppose we want to predict the response variate for a unit with values of the explanatory variate u t = (1. ~ Standardizing and replacing σ by σ . u1 . σ ) . We represent the corresponding data model by y = Xβ + r where y is the vector of observed values of the response variate.. different uses of this model in business contexts. σ 2 u t ( X t X ) −1 u) .. (4 marks) Give two distinct. For the model described above: a) (1 mark) What is the criterion used to produce the least squares estimates of the parameters β ? We minimize || r||2 =|| y − Xβ ||2 or we chose β so that r = y − Xβ is perpendicular to span(1. x p ) is a n × ( p + 1) matrix with columns giving the values of the explanatory variates and R is a vector of random variables with independent components Ri ~ G(0.. σ 2 ( X t X ) −1 ) and hence u t β ~ N (u t β . …) from sales of similar buildings • estimate parameters: estimate the volatility of a share price relative to an index using past closing prices • look for outliers: identify extreme salaries (response variate) after adjusting for explanatory variates such as experience .. we have ~ Y − ut β Pr( − c ≤ ~ ≤ c) = 0. we deal with applications and theory of the linear model Y = Xβ + R where X = (1 x1 . x p ) b) (5 marks) We know that the least squares estimate of β is β = ( X t X ) −1 X t y and the ~ corresponding estimator is β ~ N ( β .... . • prediction: predict the market value of a building using the selling price (response variate) and various explanatory variates (size. We are ~ predicting Y where Y ~ N (u t β . Hence we have Y − u t β ~ N (0. Be sure to explain the derivation. 2. σ 2 (1 + u t ( X t X ) −1 u))... x1 . ~ ~ We know that β ~ N ( β . Choosing c so that Pr(| tn −( p +1) | ≤ c) = 0.

data were collected from 91 rural school districts in a given year. p-value: < 2. estimate and estimators. Adjusted R-squared: 0.889.634 < 2e-16 *** phd 4910.164 < 2e-16 *** experience 171.2 on 5 and 85 DF.701 9.0956 4152.4 1059.3909 10.4598 6.1682 36.or over-paid.1024 0.8825 F-statistic: 136. or if some CEOs were highly under.7820 334. 1 where 1 indicates the presence of the degree.4497 15. Multiple R-Squared: 0.703 0.05 `.6700 4093.747 1. 1 and phd=0.2 -135. The purpose of the investigation was to determine if the salaries were relatively “equitable”. Error t value Pr(>|t|) (Intercept) 90717.' 0. Call: lm(formula = salary ~ experience + size + ma + phd + col) Residuals: Min 1Q Median 3Q Max -2969.484 --Signif. codes: 0 `***' 0. If phd=1.7543 474.2e-16 2 . u t β + cσ (1 + u t ( X t X ) −1 u)) Note: many marks were lost because of confusion among parameters. The R output from fitting a linear model to the data is given in the box below.95 We get the 95% confidence interval by replacing the estimators by the corresponding estimates. then the CEO has the equivalent to both degrees so ma is set to 1.6 Coefficients: Estimate Std.74e-09 *** ma 5228.7 -944.01 `*' 0. relative to the others.352 < 2e-16 *** col -2917.4134 4. (u t β − cσ (1 + u t ( X t X ) −1 u). In a compensation study of the chief executive officer salaries in one state.001 `**' 0. 3. MA or PhD cost of living(col): relative cost of living in the district salary: annual salary of CEO Note that education level is captured by two explanatory variates ma=0.91e-06 *** size 3. after adjusting for qualifications.~ ~ ~ ~ Pr(u t β − cσ (1 + u t ( X t X ) −1 u) ≤ Y ≤ u t β + cσ (1 + u t ( X t X ) −1 u)) = 0. The variates measured were: experience: number of years in the current or similar job size: number of students in the district education level: BA only.6 3299.1441 -0.1 ` ' 1 a) (1 mark) Carefully interpret the on 85 degrees4 of freedom Residual standard error: 1402 coefficient β corresponding to the explanatory variate phd.0329 22.

707 degrees of freedom so the mean square is = 4. (4 marks) To check the contribution of size and experience to the model. 2 To test the hypothesis that the coefficients of the squared terms are simultaneously 0. Is there any evidence that these quadratic terms are necessary? d) Residual standard error: 1377 on 83 degrees of freedom Multiple R-Squared: 0.707 with 8-6= 2 167.8955. all other explanatory variates held fixed.Since E(Y ) = β 0 + β 1experience + β 2 size + β 3 ma + β 4 phd + β 5 col . With the new model we have E(Y ) = β 0 + β 1experience + β 2size + β 3 ma + β 4 phd + β 5 col + β 14 experience * phd If phd=0.340 The change in the residual sum of squares is 167.378. the 4.707 The estimate of σ under the restricted model (without the squared terms) is 1402 so the residual sum off squares is 85 * (1402) 2 = 167.557 and the p-value is (1377)2 3 .076.340 − 157.076. b) (1 mark) Suppose we add a product term phd*experience with coefficient β 14 to the model.848. p-value: < 2.2e-16 The estimate of σ under the full model (including the squared terms) is 1377 so the residual sum off squares is 83 * (1377) 2 = 157. a new model with terms e2 and s2. there is no evidence that col effects salary.484. all other explanatory variates being held ficxed. E(Y ) = β 0 + β 1experience + β 2size + β 3 ma + β 4 + β 5 col + β 14 experience = β 0 + ( β 1 + β 14 )experience + β 2size + β 3 ma + β 4 + β 5 col Hence β 14 represents the change in the rate that changing experience effects average salary if a CEO has a PhD versus not having a PhD .816 .816 discrepancy measure is = 2. Carefully interpret this parameter.848. Good thing it was only one mark! c) (2 marks) The Pr(>|t|) for the variable col is 0.340 − 157. β 4 represents the average change in salary if a CEO gets a PhD. What does this tell us? The p-value for the hypothesis β 5 = 0 is large. the squares of size and experience was fit. Part of the R summary output is shown below.378.6 on 7 and 83 DF. Note : This was meant to be difficult and it proved to be so. That is.8867 F-statistic: 101.076. so there is no evidence against this hypothesis. E(Y ) = β 0 + β 1experience + β 2 size + β 3 ma + +β 5 col and if phd=1. Adjusted R-squared: 0.378.

10 . we can be confident that the assumption of gaussian residuals is reasonable. We divide the G(0. There is weak evidence against the hypothesis that the coefficients of the squared terms are 0 and hence weak evidence that they need to be included in the model. 4 . The y-coordinate is the smallest “center” of the first bin q1where Pr( Z ≤ q1 ) = 182 standardized residual in the set of 91. g) (2 marks) How can we detect cases with an outlier in the explanatory variates? For each case. Note that the full model has 83 degrees of freedom for estimating σ here. we look at the leverages hii . If the leverages are close to 1 or relatively large then the corresponding values of the explanatory variates are an outlier and are possibly influential in the fit of the model.05 < Pr( F2.1) into 91 bins each with probability 1/91.83 ≥ 2. The x-coordinate is the 1 . f) (1 mark) What does the qq plot tell us in this case? Since the points fall close to a straight line.557) < 0. Explain how to calculate the coordinates of the point in the lower left corner of the plot. the diagonal elements of the hat matrix H = X ( X t X ) −1 X t .0. e) (2 marks) A quantile-quantile (qq) plot of the standardized residuals is shown below.

Looking at the plot of the studentized residuals we see no very large values (i. the CEO salary after accounting for the explanatory variates. 5 .>2.h) (2 marks) A plot of the studentized residuals versus the case number is shown below. use the plot to provide a conclusion to the investigation.5) so it appears that the salaries are equitable. Assuming that the fit of the model is adequate.e. The purpose of the investigation was to identify outliers in the response variate.

we looked at applications and theory of the linear model Y = Xβ + R where X = (1 x1 . R ~ N (0. x p ) is a n × ( p + 1) matrix with columns giving the values of the explanatory variates for the n units in the sample and R is a vector of random variables with independent components Ri ~ G(0. the data are coded as follows: Name average weekly sales (in $1000) during the promotion period promotion 1 promotion 2 promotion 3 promotion 4 (control) past average sales (in $1000) Symbol y x1 = 1 if promotion 1 is used. your boss. Is this a good idea? Explain. 1. 2. Your company is investigating a special promotion with the goal of increasing sales. (4 marks) Show that the mean and variance-covariance matrix of the corresponding estimator β are β and σ 2 ( X t X ) −1 respectively. We represent the corresponding data model by y = Xβ + r where y is the vector of observed values of the response variate. σ 2 I ) (1) 4. 3.. x3 = 0 otherwise none x4 Consider the model (in vector notation) Y = β 01 + β1 x1 + β 2 x2 + β 3 x3 + β 4 x4 + R. x2 = 0 otherwise x3 = 1 if promotion 3 is used. show that the least squares estimate of β is β = ( X t X ) −1 X t y . (2 marks) Carefully interpret the coefficient β1 5. (3 marks) The estimator corresponding to the vector of estimated residuals is r = Y − X β . Find the distribution of the ith component ri You work in the marketing division of a large corporation that owns pizza franchises. an engineer gone wrong. 1 . x1 = 0 otherwise x2 = 1 if promotion 2 is used.Final Examination Spring 2004 Part I (35 marks) In the first part of the course. σ ) . The company assigns each of the versions at random to 20 franchises and measures the average weekly sales (over a four week period).. (5 marks) From first principles. suggests adding an extra term β c xc to the model (1) where xc = 1 for the franchises with the control and xc = 0 otherwise. (2 marks) To maintain symmetry. There are three versions of the promotion plus a control in which there is no change to current practice. For each franchise. The average weekly sales before the promotion is also recorded for each of the 80 franchises in the sample.

39377 0.9819 F-statistic: 1072 on 4 and 75 DF.2e-16 Is there any evidence of a difference among the three versions? 11.2e-16 6.9828.You fit the model (1) using R with the following summary output. Adjusted R-squared: 0.410 4. Adjusted R-squared: 0.68274 1.30198 0.44505 0. lm(formula = y ~ x1 + x2 + x3 + x4) Coefficients: (Intercept) x1 x2 x3 x4 Estimate -1. 9. p-value: < 2.925 64. 7.18265 2.9786. (4 marks) To test that the hypothesis that there is no difference among the three promotions.00347 0.709 Pr(>|t|) 0.31e-05 0.90509 Std. (1 mark) What does “Multiple R-Squared: 0.44554 0.18868 0. Briefly describe what each plot tells you about the fit.9828” tell you? (2 marks) What does “F-statistic: 1072 on 4 and 75 DF. What does this interval tell you? (3 marks) Explain in symbols and words (no numerical calculations needed ) how you could formally assess if there was a difference between promotion 2 and promotion 1? 10.549 on 77 degrees of freedom Multiple R-Squared: 0. p-value: < 2.2e-16” tell you? (3 marks) Find a 95% confidence interval for β 2 . R ~ N (0.44514 0.9781 F-statistic: 1763 on 2 and 77 DF. p-value: < 2. σ 2 I ) with summary output (in part): Residual standard error: 1.01399 t value -3.666 2.019 0. 2 . you fit the model Y = β 01 + β ( x1 + x2 + x3 ) + β 4 x4 + R.407 on 75 degrees of freedom Multiple R-Squared: 0. Error 0.07901 1. the following plots were prepared. (4 marks) To assess the fit of the original model (1).00455 < 2e-16 Residual standard error: 1. 8.

Plot 1(Estimated Residuals vs Fitted Values): Plot 2 (Normal Q_Q Plot of standardized residuals): Plot 3 (Leverage vs Case Number): Plot 4 (Studentized Residual vs Case Number): 12. how would you proceed with the information from the above plots? 3 . (2 marks) If the primary purpose of the study was to look for differences in the versions (as in Question 10).

We can n σ 2 ( y) where σ 2 ( y ) = show that E ( μ ) = μ . (3 marks) A key step in proving that E ( μ ) = μ for SRS was to determine Pr( I i = 1) . ∑ yi I i The corresponding estimator can be written μ = i∈U where U is the population N (frame) with size N and I i = 1 if unit i is in the sample and I i = 0 otherwise. Unfortunately. Var ( μ ) = (1 − ) N n ∑( y i∈U i − μ )2 N −1 . There was then pressure on the Government to help homeowners remove the UFFI.. we deal with the theory. 4.. We plan to sample nh units from stratum h using SRS independently for each stratum.. Find this probability and explain your reasoning. 3. (5 marks) Suppose we have a population with frame U = U1 ∪ . h = 1. Many homes used UFFI. 1. H are mutually exclusive strata of size N h .. the Federal Government of Canada established a grant program to help homeowners re-insulate their homes to reduce energy consumption.. a foam insulation that could be pumped into cavities as a liquid. some homeowners developed allergy symptoms that were attributed to formaldehyde (CH2O). the population proportion of units with y = 1 . ∪ U H of size N where the U h . Show that σ ( y ) is essentially determined by π . The foam then solidified without reducing its volume. applications and some extensions of simple random sampling (SRS) to learn about population averages. 2. a gas that could have been given off by UFFI. (4 marks) A key step in the derivation of Var ( μ ) for SRS was to determine the covariance of I i and I j . The ∑ yi ˆ = i∈s basic estimate of a population average μ is the sample average μ where yi is n the value of the response variate for the ith unit in the sample s and n is the sample size. How should we best allocate the sample to the various strata if the goal is to estimate the population average? In the early 1980’s. a survey was commissioned with the basic purposes to assess • • the average level of CH2O in homes with UFFI the proportion of individuals in these homes with allergy symptoms 4 . a very expensive proposition. The cost per unit of sampling from stratum h is ch and the total sampling cost must be limited to C.. (3 marks) Suppose that yi is a binary variate with values 0 and 1. Find this covariance.Part 2 (35 marks) In the second part of the course. To assess the magnitude of the problem.

981 homes that had been re-insulated without UFFI. the proportion of homes in which one or more persons experienced allergy symptoms was 0. (3 marks) An early press release about the allergy problems noted that “ the survey will look at a random sample of people living in homes with UFFI …” Is this statement technically correct? Explain. Strategy 1: Strategy 2: 5 . they had a frame of 124. How large a sample would have been required to reduce the length of this interval by half? 11. They decided to select a simple random sample of 500 homes from each frame and then • measure the concentration of CH2O in the air in each home in the sample. He suggested that the survey would be improved if the sample sizes was proportional to the frame sizes.and to compare these attributes to those in homes without UFFI. 10. the 95% confidence interval for π . A summary of part of the data collected is Attribute CH2O average concentration (ppb) CH2O standard deviation (ppb) UFFI homes 57.6 9.345 homes in which UFFI had been installed and another frame of 230. (2 marks) For the UFFI home sample.08 ± 0. 7. (3 marks) Find a 95% confidence interval for the difference in population average CH2O concentration in the UFFI and non-UFFI homes.023 .8 12. 5. (4 marks) Briefly discuss two different strategies that could have been used to increase the efficiency of the survey. (3 marks) A chemist involved in the planning of the survey noted that the frame of homes with UFFI was about half the size as that for homes without UFFI. (3 marks) Find a 95% confidence interval for the population average CH2O concentration in the UFFI homes. (2 marks) Explain how you could implement simple random sampling in this case. Is this correct? Explain. 9. 6.7 8. • administer a questionnaire to the homeowner to collect information about allergy symptoms and other demographics. Since the Government had awarded grants.4 non-UFFI homes 47.

Are you sure?

This action might not be possible to undo. Are you sure you want to continue?

We've moved you to where you read on your other device.

Get the full title to continue

Get the full title to continue reading from where you left off, or restart the preview.

scribd