You are on page 1of 16

Notes on Some Aspects of Regression Analysis Author(s): D. R. Cox Reviewed work(s): Source: Journal of the Royal Statistical Society.

Series A (General), Vol. 131, No. 3 (1968), pp. 265-279 Published by: Blackwell Publishing for the Royal Statistical Society Stable URL: http://www.jstor.org/stable/2343523 . Accessed: 10/01/2012 02:03
Your use of the JSTOR archive indicates your acceptance of the Terms & Conditions of Use, available at . http://www.jstor.org/page/info/about/policies/terms.jsp JSTOR is a not-for-profit service that helps scholars, researchers, and students discover, use, and build upon a wide range of content in a trusted digital archive. We use information technology and tools to increase productivity and facilitate new forms of scholarship. For more information about JSTOR, please contact support@jstor.org.

Blackwell Publishing and Royal Statistical Society are collaborating with JSTOR to digitize, preserve and extend access to Journal of the Royal Statistical Society. Series A (General).

http://www.jstor.org

1968]

265

Regression Methods
Notes on Some Aspects of Regression Analysis
By D. R. Cox
Imperial College

[Readbefore ROYAL STATISTICAL the SOCIETYon Wednesday, March 20th, 1968,


thePresident, F. Dr
YATES,

C.B.E., F.R.S., in theChair]

populations; models withcomponentsof variation. 1. INTRODUCTION

SUMMARY Miscellaneous comments madeon regression are analysis under fourbroad headings: regression a dependent of variable a single on regressor variable; on regression many regressor variables; analysis bivariate multivariate of and

THIS is an expository paper consisting of new resultsbut of miscellaneous not and isolatedcomments thetheory regression.The subjectis a verybroad one and on of thepaper is in no sensecomprehensive. particular, ideas of regression the In the are basis of much work in time seriesanalysisand in multivariate analysisand these nor are experimental specializedsubjectsare barelymentioned; designand sampling theory problemsassociatedwithregression considered. Anotherseriouslimitation to thepaper is theomissionof relevant partsof theeconometric literature. Two generalsituations distinguished in theirsimplest are and forms are: (i) a dependent variable Y has a distribution dependingon a regressor variablex and it is required assess thisdependence; to is (ii) thereis a bivariate populationof pairs (X, Y) and the joint distribution to be analysed. are Capital letters used forobservations represented randomvariablesand lowerby for case letters otherobservations. Most theoretical discussionof regression startsfroma quite tightly specified are as modelin whichsomeobservations regarded corresponding randomvariables to withprobability distributions dependingin a givenway on unknownparameters. of concernsuch questionsas Many of the difficulties regression analysis,however, as whichobservations shouldbe treated randomvariables, whatare suitablefamilies of models,and whatis thepractical of interpretation theconclusions. The special computational non-linear and otherproblemsassociatedwithfitting models will not be consideredexplicitly, althoughmuch of the discussionapplies as much to such problemsas to the simpler linearmodels withwhichthe paper is concerned. Particular will not be discussedin detailbut one or overtly applications twotypical hypothetical but situations be outlined will laterforillustration. Throughthe of out thediscussion measurement uncertainty significance and confidence tests by limits important not paramount. is but of the Inevitably paper tendsto emphasizedifficulties likelyto be encountered; course awarenessof potentialdifficulties a good thing,but only as one facetof is constructive scepticism.

266

Analysis Cox - Notes on Some Aspectsof Regression

[Part3,

by (1959)and by Draper is of The methodology regression described Williams by (1960),Kendalland aspects Plackett and Smith (1966)and themoretheoretical Chapters 26-29)and Rao (1965). Stuart (1967,
VARIABLE ON 2. REGRESSION A SINGLEREGRESSOR Suppose that thereare n pairs of observations Y,),..., (x., Y.) and that for (xl,

of of x is variable there a population values thedependent eachvalueoftheregressor for taken from which observed arerandomly chosen.Thisis often variable the Yi's in of discussion regression; (i) and (ii) point a theoretical for granted thestarting as and in are which someofitsimplications discussed then (iii)-(vi)comments follow, are on somemore advanced matters made. dependent x. on and deliberately theexperimenter Y is a response by maybe chosen as can of situation whenbothtypes observation be regarded is The moredifficult variable: random. Thenwetakeas dependent for variable; a given the variable beingtheexplanatory (a) the"effect", regressor of we value of the explanatory variable, ask whatis the distribution possible responses; on to variable being variable which the the (b) thevariable be predicted, regressor is the is problem to give thepredictionto be based. A full solution theprediction to information given of to conditional distributionthevariable be predicted all available concerned. on theindividual to an or population Supposethatit is reasonable consider existing hypothetical as may or of Y values eachx. Nowwhether notthex's canbe regarded random for of For of wellaffect interpretation application theconclusions. analysis the the and conditionally thex values on in however, argue we coefficientthemodel, regression that observed 1956,p. 156),provided thex valuesby themselves (Fisher, actually of no about wouldhavegiven information theparameter interest. is from a of lengths taken 1. sample fibre of segments fixed Example A random accurately diameter measured is the source for and eachsegment mean homogeneous as can load Bothobservations be regarded random and thebreaking determined. its to of variables inview (a) itis reasonable takebreaking but load,orrather log,as would be The and as variable. samemodel variable logdiameter regressor dependent from of is a number fibres selected randomly each if, appropriate forexample, fixed is Whether regressiona fruitful toconsider the thing ofa number diameter of groups. i.e. there on relationship depends itsstability, on whether is a reproducible involved; see(iii). in case 2. experiment by Example A moredifficult is illustrated a calibration and which n individuals a "slow"measurement (b) a "quick"measurement for (a) an for Often is a definitive aremadeof someproperty. determination, example (a) and of and measurement fibre of diameter, (b) is theresult someindirect optical will onlythe "quick" measurement be obtained mucheasiermethod.In future, If "slow" measurement. this to from thecorresponding and it is required predict the from samepopulation are observed a random sample then individuals initially are as theindividuals which for future predictions to be made,we takethe"slow" sincethisis the one to be predicted. to measurement be the dependent variable, for that werechosensystematically, example however, then individuals Suppose, distributed therangeof interest. over to have"slow"valuesapproximately evenly
the variables. In some situations x values and (i) Choice of dependent regressor

1968]

Cox - Notes on Some Aspectsof Regression Analysis

267

Usuallyit wouldbe reasonable regard "quick" valueas having random to the a component provided and, thata physically stablerandom system involved is and therelationship linear write to "quick"= cx "slow" +f +"error", the where "error" does not dependon the "slow" value. Givena new "quick" measurement, have to estimate non-random we a variable inverse by estimation (Williams, 1959, 91, 95). If bothvariables in factbe regarded random, pp. can as the secondapproach inefficient is becauseit ignores information the about the marginal distributionthe"slow"measurement. of (ii) Theomitted variables. Supposethat is a regressor z variable might that have beenincluded theregression in analysis in factis not,forexample but becauseno observationsitareavailable. of What assumptions z aremade about when consider we theregression Y on x alone? Box (1966)has given illuminating of an of discussion the dangers omitting relevant of a variable.The relationship z ignoring willbe meaningful if: in (a) changes z haveno effect Y; on (b) in a randomized in experimentwhich corresponds a treatment, maybe x to there a unit-treatment Thenthe usual analysis additivity. willgivean estimate the of effect changing and an estimate thestandard of x, of error.The estimate refers to thedifference between response one unitwithcertain the on valuesforx and for theomitted variable andwhat z would havebeenobserved that on sameunit with a different ofx andthesamez; value or (c) z is a random variable, Z, andthedistributions given and of Y, say ofZ, x, given = z andx, arewelldefined. x is a random Z If variable this X, amounts the to requirement (X, Y,Z) havea well-defined that three-dimensional distribution. Then theregression Y on x is welldefined includes contribution of and a associated with in changes z. 3. Example Consider observational forindividuals an ultimately data with fatal the disease,Y being log time deathand x someaspect thetreatment to of applied, calledthedose,andthat regression Y on x is analysed. the of Themissing variable z is theinitial of severity thedisease.If,as might be thecase,z largely well determines of x, theregression Y on x, although defined well under circumstances (c), the of wouldbe ofvery limited usefulness. particular wouldnotgivefora particular In it individual estimate theeffect his Y of changing an of on dose. Thisis an extreme of example a difficulty applies many that to regression studies basedon observational data. (iii) Stability regression. Whilea fitted of regression equationmay often be useful as simply a concise of summary data,itis obviously desirable therelation that should stable reproducible. is stressed Ehrenberg be and This by (1968)and Nelder (1968); see also Tukey(1954). Stability might meanthatwhentheexperiment is repeated under different conditions: (a) thesameregression even equation other holds, of though aspects thedatachange; or (b) parallel are regression equations obtained; or (c) satisfactory lines regression are always obtained with but different positions and slopes. In cases(b) and(c) thefitting regression willbe an important step the of lines first in The analysis. second willbe to try account thevariation theparameters to for step in do whoseestimates varyappreciably, on possibly a further by regression analysis

268

Cox - Notes on Some Aspectsof Regression Analysis

[Part3,

regressor variables characterizing different the groups observations taking of and the initial regression coefficientsdependent as variables.In testing significance the of thedifferences between regression the coefficients in different groups willoften it be important allowforcorrelations to between groups data(Yates,1939). the of (iv) Choice relation befitted. of to Thischoice depend preliminary will on plotting and inspection the data and possibly the outcome earlier of on of unsuccessful analyses. addition, model In the may takeaccount of: (a) conclusions previous ofdata; from sets (b) theoretical analysis, including dimensional analysis, thesystem; of (c) limiting behaviour. any Further, given modelcan be parametrized various in waysand,in choosing a parametrization, following the considerations be relevant: may (a)' individual parameters shouldhave a physical interpretation, in terms say of components a theoretical in model interms combinations regressor or of of variables ofphysical meaning; (b)' individual parameters estimates and shouldhave a descriptive interpretation, forexample terms theaverage in of slopeand curvature response therange of over considered; (c)' interpretations, as those (a)' and(b)', should insensitive secondary such of be to departures themodel; from (d)' any instability between groupsshouldbe confined as fewparameters to as possible; (e)' the sampling errors estimates different of of parameters shouldnot be highly correlated. Theserequirements be to someextent may mutually conflicting. Thereis not spacehereto discuss and exemplify these all points.As just one example, preliminary of analysis data ofExample might, thex's haverelatively 1 if little variation, suggest linear that regressions breaking on diameter of of load and log breaking on log diameter load wouldfit aboutequally well. The second would in general preferable be because, with respect theaboveconditions: to (b) it permits easier comparison with the theoretical model breakingload (c) itensures breaking vanishes diameter; that load with (a, b)' the regression coefficient, a dimensionless being power,is easierto think aboutthan coefficient a the having dimensions load/diameter. of (v) Goodness Thiscanbe examined a number ways: offit. in of (a) by a non-probabilistic graphical tabular or for analysis, example residuals; of (b) bya significance using a teststatistic test, as someaspect thedatathought of to be a reasonable measure departure themodel.Thus,thestandardized of from third moment theresiduals of couldbe used,ifpossible skewness ofinterest; is (c) by thefitting an extended of modelreducing thegiven to modelforparticular parameter values. The mostfamiliar is example theinclusion thenewregressor of a variable, possibly powerof the first or of variable, a product variables, when multiple is regressionbeing considered; (d) by thefitting a quitedifferent of whether fits model, it better seeing thanthe initial one. Suchexamination theadequacy themodel importantmodels to be of of is if are refined improved. and but Often, notalways, primary the of aspect themodelwill be theform theregression of and equation theadequacy what of maybe secondary
oc (diameter)2;

1968]

Cox - Notes on Some Aspectsof Regression Analysis

269

assumptions aboutconstancy variance, of normality distribution, willbe of of etc. rather importance. less Formalsignificance are valuable tests but,of course, need correct interpretation. verysignificant of fitmeansthatthere decisive A lack is evidence systematic of departures the from model; nevertheless model account the may forenough thevariation be very of to valuable (Nelder, 1968). A non-significant test result means intherespect that tested model reasonably the is consistent the with data;nevertheless maybe other there reasons regarding model inadequate. for the as Of(a)-(d) theleaststandard (d). It is closely is related theproblem choosing to of between alternative regression equations (Williams, 1959,Chapter forexample 5), between regression Y on xl aloneandthat Y on x2alone. Themore the of of usual in procedure sucha casewill, however, to fit be both variables cover possibility to the the that jointregressionappreciably is better either than separate one. As a rather different example, suppose thatnormal theory linear of regressions (os) Y on x, (3) Y on logx, (y) log Y on logx are considered. The goodness fit of of(os)and(3) can be compared descriptivelytheresidual by sums squares, to of but compare (os)with theresidual say (y) sumof squares cannot useddirectly. be The mostusualprocedure probably is then compare to squared correlation coefficients, butfor of comparison thefull models is probably it to better compare maximized the theconstruction significance in suchsituations. of tests An alternative in many and wayspreferable approach to consider compreis a hensive modelcontaining (/), (y) as specialcases. For example normal (os), the linear of theory regression
y112

log likelihoodsof Y1, Y. underthe two models. Cox (1961, 1962) has discussed ...,

- 1

A2

on

xA1_

could taken allparameters, be and including A2),estimated tested maximum (A1, and by likelihood (Box and Tidwell, 1962;Box and Cox, 1964). This is computationally if formidablethere several are variables. regressor (vi) More complexdependence the regressor on variable. In most regression it analyses is assumed thatthedependence theregressor on x variable is confined to changes theconditional in meanof Y. TransformationY maybe necessary of to achieve transformationsrequired linearize regression this;ifdifferent are to the of the meanand to stabilize variance first usually the will have preference, simply because interest usually inthemean.To study, example, primary will lie for changes in theconditional variance Y we can: of
(a) plot residuals; for changes of mean withingroups,and then considerthe regression x of log on variance; for (c) fit, exampleby maximum a are likelihood, modelin whichparameters added to accountforchangesin variance. For example,thevariancemight takento be be u2exp {y(x-x)}. It would nearlyalways be right precedeany such fitting (a) to by or (b) in orderto get some idea of an appropriate model and of whether more the is complexfitting likely be fruitful. to Similarremarks apply to the studyof changesof distributional shape. If theregression themeanis linearbutthere substantial on are in changes variance a weighted be analysiswill often required, the although changesin variancehave to be quite substantial beforethereis appreciablegain in precisionin estimating the

on (b) group x, calculate variances within if an groups, necessary applying adjustment

270

Analysis Cox - Notes on Some Aspectsof Regression

[Part3,

interest, may in the Of coefficient. course changes variance be ofintrinsic regression depends of how to in study order specify theprecision prediction or needseparate on x. VARIABLES REGRESSOR ON 3. REGRESSION SEVERAL are variables available, regressor several Supposenow thatforeach individual the mainly we i.e. thatfortheithindividual observe xi,..., xip). We consider (Yi,
than,forexample, rather measurements distinct case wherexi, ..., xp are physically

are but 2 of all x. of powers a single Virtually thediscussion Section is relevant, there and the variables withthe choiceof the regressor connected newpointsmostly among non-orthogonality is there appreciable in of interpretationsituations which variables. theregressor the of In to situations consider. thefirst number regressor There twoextreme are feasible perfectly or than three four.It is then say is variables quitesmall, notmore them individually. and equations to examine regression bothto fitthe2P possible can is established be of the variables nature whoseeffects clearly Thoseregressor listed and, variables ofother non-orthogonality from arising and isolated ambiguities say, involving, equations regression Also interpreted. further as faras possible, can variables be fitted, regressor of and squares cross-productssomeoftheoriginal is variables larger.It may casethenumber ofregressor p In ifrequired. thesecond all of but to all feasible fit 2Pequations, unless pairs regressor still computationally be and, is likely be difficult at the to the are orthogonal, interpretation variables nearly from the for are techniques required handling information the least,somefurther is hopethatonlya fairly there a reasonable of applications thistype fits.In many studied. over effects theregion haveimportant variables of smallnumber regressor thesetwo cases shouldbe bornein mindin the between The broad distinction discussion. following that Lindley (1968) has emphasized thechoice and (i) Interpretation objectives. and of analysis hasdiscussed on purpose the depends the equations alternative between one problem viewpoint, a prediction a twocases in detailfrom decision-theoretic of the showvery explicitly consequences problem.His results and one a control in to guidance other and abouttheproblem arelikely be useful assumptions strong are less assumptions refer remarks to caseswhere explicit casestoo. The following of and of aboutthenature theproblem theobjectives theanalysis. possible in Y future individuals theregion is first the Suppose that objective to predict for and variables the be the ofx-space covered thedata. In particular, x's may random by Thenanyregression equation the from samepopulation. be newindividuals drawn overa on effective theaverage will the thatfits data adequately be aboutequally variables contribute, that it If, series x-values. however,is thought notall regressor of an with insignificant variables regressor excluding to there likely be a gainfrom is the reducing contrisuggests of analysis thissituation Notethata Bayesian effect. a also from and such thaneliminating, variables thisis sensible bution rather of, with it is notlikely n, So viewpoint. longas p is smallcompared theory sampling is taken.Thealgorithm various possibilities which difference ofthese tomake major a of number a with specified the of Beale et al. (1967)forselecting "best"equation described by procedures stepwise and the variousautomatic variables regressor 6) and Smith (1966,Chapter willbe relevant. Draper in is the next that prediction to be madeforan individual a newregion Suppose that For suppose xl andx2arealmost are ofx-space.Things nowdifferent. example,

1968]

Cox - Notes on Some Aspectsof Regression Analysis

271

linearly related theinitial in sample observations thatthepartial of and regression coefficients insignificant, combined are the regression being very highly significant. It is thusknown thatat leastone ofxl and x2has an important contribution, but there be many will regression equations fitting data aboutequally the well. Under thecircumstancestheprevious of paragraph is immaterial, ifprediction this but of Y is attempted (xl,x2) farfrom original for the linear relation, extremely different results be obtained will from different In suchcasesthepossibilities the fits. are: (a) to postpone setting a prediction up equation until better data are available for estimation; (b) to useexternal informationdecide to which "really" appropriate is the equation; (c) to use theformal variance prediction of from fullequation a meansof the as detecting individuals which for prediction anyregression from equation hazardous. is The next, and in many waysthemostimportant, is where hope that case we there a uniquedependence Y on some,or all, of theregressor is of variables that willremain stableovera range conditions we wishto estimate relation of and this andinparticular identify regressor to the variables occur it. In a randomized that in it experimentmayideally possible estimate contrasts primary be to the of interest and to show separately efficiently, that they notinteract external do with factors and for account most thevariability. here of they Even there difficulties, are particularly if theresponse surface relatively is complicated. observational there For data, are twomajordifficulties: of (a) thepossibility important omitted variables Section point (see 2, ii); from (b) ambiguities arising appreciable ofregressor non-orthogonality variables. There discussion is below some the of of devices canbeusedtotry overcome that to (b). In thesituation in contemplatedtheprevious paragraph, objective essentially the is thesameas that a randomized in A is experiment. more limited objective to analyse data preliminary in order suggest to which factors wouldbe worth in including a and subsequent experiment to suggest for be appropriate spacing thelevels.It would to the interestingexamine performancesomesimple of there strategies, though even be willalways further informationbe taken account. to into In some casesthemain (ii) Aidstointerpretation. interest lieintheregression may onxl,thevariable being included characterizing different as x2 of say groups observaor tions, somepotentially interest theparticular important in aspectof secondary If be it investigation.x2can convenientlygrouped, willoften goodto fit be separate on each regressions xl within x2group then relate estimated and to to the parameters of x2. Thisleadsto an analysis thestability theregression of to and equation possibly theconstructionmodels of interaction. Moregenerally andx2maybe containing xl setsofregressor variables. Ifthere a propertythat thought to havean effect Y,itwilloften is x is not be on x variable.Significant goodto include as a regressor on be regression x wouldthen for a warning, example an important of omitted variable. The next ofremarks to ambiguities set refer and arising from non-orthogonality all depend further informationsomeform. in uponintroducing that (a) It maybe thought theregression coefficient say xl shouldbe nonon In casesthis an negative. some resolve apparent For special may ambiguity. instance, that andx2areclosely is suppose that xl positively related, thecombined regression but are Incidentlarge, thepartial that regressions insignificant, onxl being negative. to such aboutthesign a regression of coefficient allytheattitude assumptions as that needs comment. taken isthat such That here so as any assumption should, far possible,

272

Cox - Noteson Some Aspectsof Regression Analysis

[Part3,

(c) Kendall (1957, 75) suggested p. applying principal component analysis the to regressor variables then and taking regressor new variables specified thefirst by few principal components. Jeffers (1967) and Spurrell (1963) have giveninteresting A applications. difficulty tobe that seems there nological is reason the why dependent variable should be closely to theleastimportant not tied principal The component. modification is worth following The considering. principal components suggest may combinations regressor of simple variables withphysical meaning.These simple combinations, theprincipal not components, be usedas regressor can variables and ifa goodfit obtained constructive, is a although necessarily not unique, simplification hasemerged. theregressor If variables be divided meaningful e.g.into can into sets, physical measurements chemical and measurements, separate principal component couldbe considered thetwosets. for analyses in of the (d) In somesituations, especially thephysical sciences, method dimensionalanalysis leadto a reduction theeffective in number regressor of variables. may (e) Another general wayof clarifying regression the relation whensomeof the regressor variables random are variables to examine is models for plausible special theinterrelationships all different between thevariables. There tworather are cases. If theadditional cannot tested be from data,then the not assumptions parameters previously estimable become and thosepreviously may so, estimable havethe may of precision estimation increased. theother if On hand, theadditional assumptions can be tested, thenthe gain is confined improved to precision. SewallWright's method pathcoefficients of a of is essentiallydevice handling for complex systems in interrelations. general For of not discussion pathcoefficients specifically genetic see et and Stevens terms, Tukey(1954),Turner (1959),Turner al. (1961) and, for with particularly theconnection multiple regression, Kempthorne (1957, Chapter familiar of of is 14). The most example thesecond type situation theuse ofa conof in comitant variableto increase precision treatment the contrasts controlled are When concomitant is before treatments the the variable measured experiments. the is of Another model justified therandomization treatments. applied, special by is variable simple example theuseofan intermediate (Cox,1960).Heretheregression suchthat, of Y on X1is ofinterest thesuppositionthat is a further is variable and X2 observation of under somecircumstances, given = x2,Y is independent X1. Then, X2 of ofX2canlead to appreciable in of increase theprecision theestimated regression the Y on X1. In other is of applications, analysis covariance usedto seewhether data arein accord via the that with hypothesis Y is affected X1only X2. by

fitting.

on be tested thedata and,if consistent thedata,its consequences with shouldbe analysed compared and with conclusions the without assumption. might the It be argued from Bayesian a viewpoint a prior that probability should attached the be to assumption a single and conclusion obtained, evenapartfrom difficulty but, the of this doing quantitativelya meaningful it seems in way, likely themore that cautious approach be more will informative. (b) There maybe setsofregressor variables which to a large are extent physically equivalent. example, a textile For in experiment strength be measured yarn can by several different methods.Quiteoften measurements be expected be the may to highly correlated equivalent regressor and as variables, although datamayshow the thisexpectation be false.In applications this willbe natural try use to like it to to throughout regressor one variable, possibly simple a combination the separate of variables, provided that thisdoes not give an appreciably worsefitthan full

1968]

Analysis Cox - Notes on Some Aspectsof Regression

273

in the variables be arranged order can (f) A very special case is when regressor The series. of and ofpriority. maincasesarethefitting polynomials Fourier

to are equations, example 2P linearregressions, fitted the same data, the for all problem in needs In handling theresulting of information comment. a prediction the are in the which predictions to be madeovera setofx valuesdistributed much the samewayas thedata,an average variance prediction, better corresponding of or in be measure adequacy;of course, of standard deviation, often a reasonable will at prediction is someapplications there maybe particular points x-space which in fit Thoseequations significantly thantheoverall can be identified worse required. with in someway. Notethatan equation significantly in conflict thedata maybe for of because involves it substantial economy thenumber variables in used, example Thiswouldbe reasonable thestandard of is to be measured. if deviation prediction but of is equation a newregion x-space likely in thought satisfactory, theuseofthis to be especially hazardous. be for Where arelooking a (hopefully) we the step unique relation, first willoften of to listall equations a particular that notsignificantly are contradicted the by type to to down choice someofthearguments the by data,as a preliminarytrying narrow would be used with sketched (ii). Automatic in devicesforselecting equations if and caution at all. Gorman Toman(1966)and Hocking Leslie(1967) and great someofthe methods in particular and have outlined have discussed somefurther workof Dr C. L. Mallows. A different approach takenby is recent unpublished elements summarize to b) quantities called Newton Spurrell and (1967a, whointroduce sums thesetofall 2P regression ofsquares. variables which in the of Particular caution necessary examining effect regressor is The much inthe datathan would expected future be in less applications. standard vary in is errors theregression of coefficients be highand there an obvious will danger of from statistical the the signifisolely judging potential importance suchvariables coefficients. canceoftheir regression
4. BIvARIATE POPULATIONS in are Considernow situations whichthe observations pairs (X1,Y1),..., (Xn,Yn)

regressions. problems whichmanyalternative For in (iii) Analysis a set offitted of

reasonfor and is from bivariate a there no particular drawn population in which of thanthatof X on Y. The example the studying dependence Y on X rather of discussed Ehrenberg and by (1968)is concerning heights weights schoolchildren but or couldlegitimately considered, the be an instance.Either bothregressions it to is question whether is fruitful do so. of jointdistribution set one With homogeneous ofdatatheconcise description the Thismaybe in of can is all that be attempted,theabsence a more objective. specific distribution of table done by a frequency or by an estimate thejointcumulative Whilethere has can distribution be fitted. or bivariate function, someparametric other distributions thanthebivariate of families bivariate of beendiscussion special is nevertheless the normal distribution normal 1967) bivariate (Plackett, 1965;Moran, and to transformation be desirable one theone mostlikely arise. Preliminary may is transformations (x,y)to from possibilityto consider

274

Cox - Notes on Some Aspectsof Regression Analysis

[Part3,

andto estimate A2) maximum (A1, by likelihood (Box and Cox, 1964), assuming that on thetransformed a bivariate scale normal distribution apply.In someapplidoes = cations maybe reasonable takeA1 A2. it to Ifa bivariate normal distribution is fitted, estimates five of parameters required are and these might, example, themeans(p, py) thevariances cr2) and the for be (cr2, correlation coefficient see,however, Section point(iv) forremarks para2, on p; metrization. When there k populations problem betodescribe setofpopulations are the will the in a concise way. Thereare many possibilities. Often separate descriptions be will ingthecovariance matrices. (a) suchquestions ariseas whether means For will the lie on or around lineor curve a and of whether position be linked their can with someother variable characterizing populations. the Ehrenberg's (1963)criticisms of regression appliedto bivariate populations partly are directed confusions at of comparisons between populations those with within populations. If thecovariance matrices notconstant, willbe natural lookforaspects are it to thatare constant thesemight and include one or other regression the coefficient, ratioof the standard deviations, correlation the coefficient, Anychanges in etc. covariance matrix be linked may with changes mean. in Of courseonce a potentially reasonable is representationobtained, standard techniques, especially maximum likelihood, availableforfitting forconare and structing significance In many tests. cases,however, mostchallenging the problem will be to discover the most fruitful conciserepresentation amongthe many possibilities. All theremarks thissection of applyin principle p variate to problems.
5. MODELSWITH CoMpoNENrs VARiATION OF In the mathematical theory regression most of the awkward problems probably are in those which observations split components directly the are into not observable, and the relationships between thesecomponents to be explored.Thereis a very are extensive theoretical literature suchsituations; in particular, on see, Lindley (1947), Madansky (1959), Tukey (1951), Sprent Fisk(1967), (1966), Kendall Stuart and (1967) andNelder (1968). In this section few a on comments such systems bemade, will on particularlypoints which connect theprevious with discussion. Thesimplest situationwhere the is only dependent variable split components, is into a hypothetical valueand a measurementsampling true or Themainquestion, error. is easily answered, then assesshowmuchoftheobserved to of dispersion Y about its regression x is accounted by themeasurement sampling on for or error.For be exampleY might the squarerootof a Poissondistributed whenthe variable, error variance has sampling nearly 1/4. One would,in particular, wantto know whether accounted all therandom this for variation present. Moredifficult problems wouldariseifit wererequired estimate distributional of the"hidden" to the form of component random variation. The moreinteresting are where cases bothindependent regressor and variables can be split intocomponents:
Xi=

attempted (a) themeans(p$i of

yi)

(i

1, ..., k) and of (b) theparameters determin-

'i+Y=Ti+

, JifT=

+PD+

cr

1968]

Cox - Notes on Some AspectsofRegression Analysis

275

from regression againof zeromean. The simplest is where the the line, case (Di, "true"valueoftheregressor variable, a random is variable.Random variables for different assumed i are independent the and triple -qi, independent (e, eyo,)is assumed of (Di. Various casesmayariseforthecovariance matrix thetriple, simplest of the being that three the components mutually are independent. (1967)and Nelder Fisk (1968)haveconsidered models which regression in the coefficient random is a variable. Sometimes is convenient write instead P to distinguishfrom it to of it /yx, P/3, thepopulation squares least regression Y on X. In fact of var (X) Ao= p /3YXvar (X)-var(f) Ifprediction Y directly X is theobjective, is required, g/3,, long of from not so /yx as X is a random variable; however, future at which if, the X's prediction to be is are attempted notrandom, comefrom different or a distribution, presence the of thecomponentsdoesneedconsideration. e Muchpublished discussion concentrates theestimation P andin particular on of on thecircumstances which is consistently under : for it estimable; somepurposes is enough notethat is between and l//3y(Moran,1956). The simplest to case P Pyx is when (e) canbe estimated separate var from data,for instance within from replicate or variation, theoretically. often correction g,/Py_/ isvery one. Quite the factor near Somefurther problems naturally some, notall,can be answered arise and but in a fairly direct way. In most casesseparate estimates atleastpartofthecovariance of matrix (e, -j) are required. more of If thantheminimum of amount information is available moresearching of themodelis possible.The following test illustrate the further problems: andvar(-q). (ii) In particular, thedataconsistent all thec's being are with zero? (iii) Is a discrepancy between estimated the regression of coefficient Y on X and a theoretical explicable terms "errors" theregressor value in of in variable? (iv) Are apparent differences between in groups the regression coefficients of Y on X explicable terms "errors" theregressor in of in variable? of (v) In thecontext Section (Xi,Yj) mayrefer thesamplemeansof the 4, to ith group,(D, T) beingthe corresponding population means. The covariance matrix (e, of 71) can be estimated: whatcan be said abouttherelation between more (vi) How much could effectively Y be predicted X ifX were from measured more for precisely, example additional by replication? in (vii) Is non-linearitytheregression Y on X explicable errors the of in by regressor variable?(In non-normal ifT haslinear cases, on regression (D. Y willnot havelinear on regression X.) Whenthere morethanone regressor is variable similar arise. The problems important techniques basedon instrumental variables notbe considered will here; see,however, Section point 3, (ii). This finalsection deals witha number miscellaneous of topicsnot discussed earlier.
6.
MISCELLANEOUS POINTS 'Ti

Here ep Xi are measurement sampling or errors zero mean and EIF,i is a deviation of

(i) Estimatethe threecomponents varianceof Y, namely/32var of (Q), var(E,)

and (Di?

276

Cox - Noteson Some Aspectsof Regression Analysis

[Part3,

Theseare very (i) Graphical methods. important forthedirect both plotting of scatter diagrams pairsof variables, of possibly distinguishing variables a other by coarsegrouping, also forthesystematic but plotting residuals; forexample, of see, Anscombe (1961). Particularly extensive thesystematic with data, plotting residuals of is likely be themost to searching oftesting improving way and models.It is possible that incomputer developments display devices leadtovaluable will ways inspecting of relationships involving more thantwovariables. (ii) Outliers robust and estimation. screening datafor The of suspect observations willoften required. be limited With data it willbe usualto look at suspect values individually order decide in to whether include to them anysubsequent in analysis; often with without and analyses suspect valueswillbe needed.With observations p foreachindividual bestwayof looking outliers depend thetype the for will on of effect Thus expected. is (a) ifanyextreme deviation thought be in a particular to known variable, usually on thedependent residuals from regression theother its variable, variables should be For see et examined. further discussion, Mickey al. (1967); is to to thatanyextreme deviation thought be confined one variable, (b) suppose for individuals. might thecase, This be butnotnecessarily same the variable different for with errors. procedurethen calculate One is to example, occasional gross recording p residuals eachindividual, for for one eachvariable on regressed all theothers; to deviations oneor more in (c) ifanyindividual be subject extreme may variables is and simultaneously, thejointdistributionapproximately it p variate normal, may be reasonable calculate the ith individual, to for withvectorobservation a Yi, the standardized distance from meani, given Di = (Y -F)' S-1(Y - F), squared by Thentheordered can be plotted where is theestimated S covariance matrix. Di's the order statistics samples for from chi-squared the distribution against expected of of Wilkand with degrees freedom. Iteration theprocedure be desirable. p may Gnanadesikan discussion graphical of methods for (1964) have givena general multiresponse experiments. to it of Withextensive data,however, maybe necessary use methods analysis to of so-called methods robustestimation; for thatare insensitive outliers, see, Huber(1964). example, and the values.Afifi Elashoff (iii) Missing (1966,1967)havereviewed literature in on missing valuesin multivariate and have considered somedetailpoint data in Univariate valuetheory concentrates estimation simple linear regression. missing the of of on thecomputational aspects exploiting near-balance a balanced design is but spoiled a missing by observation, no informationcontributed themissing by information be contributed In observations. a multivariate however, case, may by observations missing.In a multiple are individuals whichsome component for in there usually information individuals whicha is no from regression problem, is thatvariable be regarded random.An can as variable missing, unless regressor is there say, individual xl missing analysis theother and of an with exception when is, of the individuals the suggests omission xl from regression equation. Suppose, is with variable and that variable random, theindividuals that however, a regressor a which can as missing be regarded selected randomly, quitesevere assumption, where shouldbe tested possible.Thenmorecan be done. In someapplications and use all nearly individuals haveat leastonemissing may component then ofsome between two is the valuetheory essential. missing Roughly speaking, covariance any bothvariables from thoseindividuals which on random variables be estimated can

1968]

Cox - Notes on Some Aspectsof Regression Analysis

277

are available; thereseemsscope forfurther workto settle just whenit is wise to do thisand whensomething moreelaboratesuch as fullmaximum likelihood estimation is desirable. (iv) Non-normal variation.The present paper is largely concerned withproblems to whichleast squaresmethodsare reasonably applicable,possiblyafter transformation. In regiession-like problemsin whichparticular non-normal distributions can be specified, have usually to apply maximumlikelihoodmethods. These are we locallyequivalent least squarestechniques to and therefore greatdeal of the above a discussion,for example that on the choice of regressor variables,is immediately relevant.Anscombe(1967) considered some detailthe analysisof a linearmodel in withnon-normal distribution error;Cox and Hinkley of (1968) foundtheasymptotic of efficiency least squaresestimates such situations. in The justification maximumlikelihoodmethodsis asymptotic sometimes of but analogues of at least a fewof the "exact" properties normal-theory of linearmodels can be obtained. The simplest case is when the ith observation the dependent on variablehas a distribution the exponential in family (Lehmann,1959,p. 50) exp{Ai(y)B(61)+ Cy) + D(6i)}, where6i is a singleparameter and thereis a linearmodel
B(6i) = E xirr

wherethe /'s are unknown and the x's knownconstants. Special cases parameters are the binomial,Poisson and gammadistributions whenthe "linear" model applies to the logit transform, the log of the Poisson mean and to the reciprocalof the to mean of the gamma distribution.Sufficient statistics are obtained and in very fortunate cases useful "exact" significance tests for single regression coefficients emerge. and observational data. Many of the issues discussedin the (v) Experimental to paper applyless acutely theanalysisof controlled than experiments to theanalysis of observational data and that is why the paper may seem overweighted towards thelatter in typeof problem. In fact,in terms thediscussion thispaper,there of are three rather different reasonswhyfewer difficulties in theanalysis experimental arise of data, quite apartfromthe smallerrandomerrorto whichsuch data are likelyto be subject. These reasonsare: (1) the spacingof regressor variablesis likelyto be more suitable; (2) substantial non-orthogonalities estimation be avoided; of will (3) factorsomittedfromthe treatments will be randomizedand hence the worst difficulties associatedwith omitted variables(Section 2, point (ii)) will be avoided.
ACKNOWLEDGEMENT

I am grateful Mrs E. J. Snell and to thereferees constructive to for comments.


AFIFI,

REFERENCES A. A. and ELASHOFF,R. M. (1966). Missing observations in multivariate statistics. I. Review of the literature. J. Am. Statist. Ass., 61, 595-604. (1967). Missing observations in multivariate statistics. II. Point estimation in simple linear regression. J. Am. Statist. Ass., 62, 10-29. ANSCOMBE, J. (1961). Examination of residuals. Proc. 4thBerkeley F. Symp., 1-36. 1, (1967). Topics in the investigationof linear relations fittedby the method of least squares.

J. R. Statist. Soc. B, 29, 1-52.

278

Cox - Notes on Some Aspectsof Regression Analysis

[Part3,

Box, G. E. P. (1966). Use and abuse of regression. Technometrics,625-630. 8, Box, G. E. P. and Cox, D. R. (1964). An analysisof transformations. R. Statist.Soc. B, J. 26, 211-252. Box, G. E. P. and TIDWELL, P. W. (1962). Transformation the independent of variables. Technometrics,531-550. 4, Cox, D. R. (1960). Regression analysiswhenthereis priorinformation about supplementary variables.J. R. Statist. Soc. B, 22, 172-176. (1961). Testsof separate families hypotheses. of Proc. 4thBerkeley Symp., 105-123. 1, (1962). Further results testsof separatefamilies hypotheses. R. Statist.Soc. B, on of J. 24, 406-424. Cox, D. R. and HINKLEY, D. V. (1968). A note on the efficiency least squaresestimates. of J. R. Statist. Soc. B, 30, 284-289. DRAPER,N. R. and SMITH, (1966). Applied H. Regression Analysis.New York: Wiley. EHRENBERG, S. C. (1963). Bivariate A. regression useless. Appl.Statistics, 161-179. is 12, - (1968). The elements law-like of J. Soc. A, 131,280-302. relationships. R. Statist. FISHER,R. A. (1956). Statistical Methods Scientific and Inference. Edinburgh: Oliverand Boyd. FISK, P. (1967). Models of the second kind in regression analysis. J. R. Statist.Soc. B, 29, 266-281. GORMAN, W. and TOMAN, J. (1966). Selection variables J. of forfitting R. equationsto data. 8, Technometrics,27-51. HOCKING,R. R. and LESLIE, R. N. (1967). Selection the best subsetin regression of analysis. Technometrics,531-540. 9, HUBER,P. J. (1964). Robustestimation location. Ann.Math. Statist., 73-101. of 35, JEFFERS, N. R. (1967). Two case studiesin the application principal J. of component analysis. Applied Statistics, 225-236. 16, New York: Wiley. KEMPTHORNE, (1957). An Introduction Genetic 0. to Statistics. KENDALL,M. G. (1957). A Course Multivariate in Analysis.London: Griffin. KENDALL,M. G. and STUART, (1967). Advanced A. Theory Statistics of (2nded.), Vol. 2. London: Griffin. LEHMANN, L. (1959). Testing E. Statistical Hypotheses.New York: Wiley. LINDLEY,D. V. (1947). Regression linesand linearfunctional J. Soc. B, relationships. R. Statist. 9, 218-244. in J. Soc. B, 30, 31-66. (1968). The choiceof variables multiple regression. R. Statist. MADANSKY,A. (1959). The fitting straight of lines whenboth variablesare subjectto error. J. Am.Statist. Ass.,54, 173-205. in MIcKEYn, R., DUNN, 0. J. and CLARK,V. (1967). Note on theuse of stepwise M. regression detecting outliers.Comp.and Biomed. Res., 1, 105-111. for relation.J. R. Statist. Soc. B, MORAN,P. A. P. (1956). A testof significance an unidentified 18, 61-64. for variates.Biometrika, 385-394. between (1967). Testing correlation non-negative 54, NELDER, J. A. (1968). Regression, model-building and invariance. J. R. Statist.Soc. A, 131, 303-315. of for NEWTON,R. G. and SPURRELL,D. J. (1967a). A development multiple regression the (1967b). Examples of the use of elementsfor clarifying regression analysis. Applied Statistics, 165-172. 16, Press. PLACKETT, L. (1960). Regression R. Analysis.Oxford:Clarendon
-

BEALE,E. M. L., KENDALL, M. G., and MANN, D. W. (1967). The discarding variables of in multivariateanalysis. Biometrika,54, 357-366.

analysis of routine data. Applied Statistics,16, 51-64.

(1965). A class of bivariate distributions.J. Am. Statist. Ass., 60, 516-522. New York: Wiley. RAO, C. R. (1965). LinearStatistical and Inference its Applications. SPRENT, P. (1966). A generalized least-squares approach to linear functional relationships.

Statistics, 180-188. 12, in J. TUKEY, W. (1951). Components regression. Biometrics, 33-69. 7,

SPURRELL,D. J. (1963). Some metallurgical applicationsof principalcomponents.Applied (1954). Causation regression and path analysis. In Statistics and Mathematics in Biology

J. R. Statist. Soc. B, 28, 278-297.

Iowa: Ames. (ed. 0. Kempthorne). and M. TURNER, E., MONROE,R. J.and LUCAS,H. L. (1961). Generalized asymptotic regression non-linear pathanalysis.Biometrics, 120-143. 17,

1968]
15, 236-258.

Cox - Noteson Some Aspectsof Regression Analysis

279

C. of analysis causal paths. Biometrics, TURNER,M. E. and STEVENS, D. (1959). The regression in R. for internal comparisons WILK, M. B. and GNANADESIKAN, (1964). Graphicalmethods multiresponseexperiments. Ann. Math. Statist., 35, 613-631. WILLIAMS, J.(1959). Regression E. Analysis.New York: Wiley. derived between regressioncoefficients YATES, F. (1939). Tests of significanceof the differences variates.Proc. R. Soc. Edinb., 184-194. from twosetsof correlated 59,