You are on page 1of 67
5) yA Oba fey) Ve: aaa Ba ONO a ata SAF UINYTIAS H By Owen Toller The Mathematical Association NIE LIBRAR a o cana | The Mathematics of A-level Statistics Introduction mathematics behind the various results used in the statistics units of A-level Mathematics and Further Mathematics. It is believed that many teachers as well as advanced students will welcome the chance to see just what is behind the various theorems and formulae; many will have not done much formal mathematical statistics as part of their degree courses and will have learnt the subject “on the hoof”. I hasten to add that there is absolutely nothing wrong with that; undoubtedly standards of teaching and learning in statistics have improved hugely in the thirty-five years or so since I was at school. Alongside the welcome growth in the popularity of statistics as an option within mathematics A-level (as well as an A-level and an AS in their own right) has come a less welcome, though understandable, tendency to treat the subject as a series of mere recipes, formulae into which you substitute numbers, and nothing more. One national examination group, in particular, is completely unapologetic about this; its endorsed textbooks repeatedly contain lines such as “you do not need to know this for the examination”. However, anyone with any intellectual probity - teacher or taught alike ~ must rebel at being presented merely with a series of techniques and formulae with no understanding of where they have come from. Hence the importance of mathematical statistics. It is unfortunately the case that some (but by no means all) of what follows will be beyond the reach of all but a very few A-level students; it is a fallacy that A-level students who do statistics as part of a mathematics course can do it all “properly”, as I am sure anyone who ploughs through these pages will agree. Nevertheless, I strongly believe that each individual student should be encouraged to do as much of the “proper” mathematics as is within his or her capacities, and where possible I have given presentations that I have myself used in the classroom, Comprehensibility has been the goal, at the expense of full generality or rigour. Harder sections are marked with a *. Names in [square brackets], such as [Burkill], refer to the Bibliography. T= OBJECT OF THIS EXERCISE is to present, in as simple a form as possible, the Acknowledgements. Some of these ideas have been developed over several years, and I am grateful to various sets of pupils who have acted as enforced guinea-pigs. I am grateful to Dr Tan Cook and Dr Gerry Leversha for supplying ideas and for useful discussions. Dr Leversha also read a first draft and made many valuable mathematical and stylistic suggestions, many of which have been included verbatim. Dr Gerald Goodall and Norman Morris have been generous with their time and expertise in discussing some of these issues, and I am also grateful to Peter Balaam and to an anonymous colleague for numerous comments and suggestions. Above all, Peter Thomas read the work on behalf of the Mathematical Association and made so many valuable suggestions that it is essential to place on record my gratitude to him. Needless to say, I am responsible for the inadequacies and errors that remain, Owen Toller St Paul's School November 2009 NATIONAL INSTITL OF EDUCATION Contents 1 Preliminaries 1.1 Notation and basic definitions 12 Abbreviations 13 Independence 14 — Bivariate data 1.5 Manipulations involving 2 Discrete Probability Distributions 2.1 The binomial distribution 2.2 The Poisson distribution 23 The geometric distribution 3 The Normal Distribution 3,1 The value of k 3.2. Mean and variance 3.3. Continuity correction 4 The Expectation and Variance Operators 4.1 Expectation 4.2 Variance a Estimators 5.1 Unbiased estimators 5.2 Degrees of freedom 6 Correlation and Regression 6.1 The product moment correlation coefficient 6.2 The population parameter 6.3. Linear regression 6.4 Inverse regression 6.5 Bivariate normal distributions 6.6 Geometrical representation of the PMCC 6.7 Spearman’s rank correlation coefficient 7 The Probability Generating Function 7.1 Definition 7.2 Properties 7.3 Other useful PGFs 8 The Moment Generating Function 8.1 Definition 8.2 Properties 8.3 Other useful MGFs 84 The central limit theorem 9 The Gamma and Chi-Squared Distributions 9.1 The gamma function and the gamma distribution 9.2 The chi-squared distribution 9.3 Goodness-of-fit ii 18 18, 19 20 20 22 23 24 26 29 30 31 31 31 32 33 33 33 34 35 37 37 38 39 10 il 12 13 The t-distribution 10.1 Quotients 10.2 The PDF of the ¢-distribution 10.3 The product moment correlation coefficient 10.4 The two-sample t-distribution The F-distribution and its applications 111 The F-distribution 11.2 Applications Distribution-free tests 12.1 The sign test 12.2 The Wilcoxon signed-rank test 123 The Wilcoxon rank-sum test Appendix: Multiple Integration References and Sources Index iti 41 41 41 43 45 46 46 47 49 49 49 50 52 33 54 Errata Page 1, line 29 For P(as.X B) = P(A) x PCB). ‘Two discrete random variables are independent if and only if PUK =A) 9 (= 8) = PIX=A) x PLY =B) for all possible combinations of outcomes A of X and B of Y. The definition of independence for continuous random variables is given below. 1.4 Bivariate data. This topic covers a large part of A-level statistics, not just the obvious topics of regression and correlation but also such things as Var(X + ¥). ‘The term “bivariate” is sometimes used simply whenever there are two different measurable variables of interest, but others restrict it to a stricter usage. Rather than attempt to distinguish between bivariate data and bivariate distributions, the usage followed here is that a bivariate distribution is one where, for any particular element of the population, there are two measurable variables of interest, and both vary randomly. Thus we need to distinguish two situations, best illustrated by examples. 'A. Suppose we study the effect on the yield of a crop of adding different quantities of fertiliser, There are two variables: x, the quantity of fertiliser added, and Y, the crop yield, but ‘x is not a random variable — it is a controlled variable (even if its actual values may be subject to slight variations in measuring it out). Thus this is nor a bivariate situation; there is only one random variable Y, parameterised by the value of x. The technical term for this situation is univariate. Note that we do not use a capital letter for x as it is not a random variable; some authors use a Greek letter such as € to convey the fact that it can be thought of as a parameter. B. Suppose X and ¥ are an exam candidate's marks on a mathematics paper and a physics paper. One candidate took only one of the two papers, and we wish to estimate the missing mark for the other paper. This is a genuine bivariate situation. Both X and Y are random variables and are denoted by capital letters. The usual indicator of a bivariate situation is that the two variables are both functions of something else — in this case, the identity of each candidate A bivariate distribution has a joint probability distribution, which will be a function of both variables. For any particular fixed value of one of the variables, the corresponding conditional distribution of the other variable is called the marginal distribution. Two continuous random variables are independent if and only if their joint PDF, fx.4% y), can be written as the product of separate PDFS, fx(x) x f(y). for all x, y. Preliminaries 1.5 Manipulations using 5. Students who are not very familiar with ¥ notation will find it useful, before getting on to more complicated applications, to have three specific details explained: 1) Constants can be taken out of sums: Ekx; = Ex; and, less obviously but of specific importance below, E(x, %) = ¥(Ex). 2) Sumofconstants: EX =n. 3) E(ar+ yi) = Ex; + Lys. This result “looks” plausible, but students too easily get the idea that all they are doing is the same thing as saying 3(x + y) = 3x + 3y. I find it useful to write out a grid: 1 Xn oe fae Yn ‘Then (x; + y;) means: add up each column first, then add up all the column totals. Lx; + Ly; means: add up each row first, and then add up the row totals, Clearly if the series are finite you will get the same answer either way (there are 2n terms and it doesn’t matter which order you add them up). But if the series are infinite, it is not at all obvious that you will get the same answer either way. Both series have to converge (otherwise it is meaningless to talk about the sum of either series), and in fact they both have to converge in a “well-behaved” manner. The details are part of mathematical analysis and can be found in most books on that subject, such as [Burkill] or [Spivak]. Fortunately, in elementary mathematical statistics most series are generally very well-behaved. ) We use this idea in many contexts in mathematical statistics. Here is one. There are two . 2 1 standard formulae for sample variance: — )) (x, ~ X)* and —)) x,’ - ¥? . To show that they n n are equivalent: 1 = 1 2 15-9 =1Y (tan 747? sou x) Su ¥+¥°) using 3) using 1) and 2) But Ex =n¥ , so we get ty : ” which is the result we wanted. I find that this is a particularly useful example to work through with students; it introduces the main standard manipulative techniques fairly painlessly. Notes. The second version of the formula tends to be more useful, and not just because it is generally easier to use for manual calculations. It is also much more economical of memory. If you are entering the data values one at a time, the calculations cannot be carried out in the first version until the value of Fis known, but X cannot be known until all the data have been entered. The second version allows the software to update the two sums Sx and Ex? every time a new data result is entered. Interesting issues arise concerning loss of accuracy when the differences between data values are small relative to the size of the values. This is a context in which coding of data has a practical use. For the continuous case, with sums replaced by integrals, the algebra is identical, which is also a point worth making. 2 Discrete Probability Distributions 2.1 The binomial distribution. A Bernouilli trial is an experiment in which the outcomes can be classified as either “success” or “failure”, Thus throwing a die and determining whether the score on the die is divisible by 3 is an example of a Bernouilli trial. Non-Bemouilli trials (where, for instance, outcomes can be classified in more than two ways) do exist, but many texts simply refer to “trials” in this context, ‘A repeated series of Bernouilli trials, when the variable of interest is the total number of successes, is a candidate for being modelled by the binomial distribution. Suppose there are n trials, on each of which the probability of a success is p. Then, under certain conditions (discussed in the next paragraph), the probability of a total of exactly r successes (where ris an integer satisfying 0 < r 2; if n 1 clearly we get 0. 5 Yrr-y. ae ie" (asthe frst two terms are 220) a Geo! Gel NB! J =ntn—p$;—0=2"_grtgee wr (r= 2) (n= ryt Put m=n-2,s=r—2, so thatn—r=m-~s: n(n Dp"(p + 9)" = n(n Lp? ‘Therefore Var(R) = n(n L)p? + np —(np)” snp? — np +np—n'p? =-—np’+np =np(1-p) =npq_as required. We will see that it is quite often preferable, as here, to replace Var(R) = E(R’) — [E(R)]° by Var(R) = E[R(R - 1)] + E(R) - [E(R)I ‘These processes can be made to look less forbidding by writing the terms out explicitly: E(R) = 0xP(0) + 1xP(1) + 2xP2) + 3xPG) +... + mxP(n) 1 . n et mt ane Byrd =0xq" +1x. ee feo 2 sreae Rami? tS? t ad pu, nt zgr2 nt! a! “ “omy ia-a!” 7 2n-3” DO!” Discrete Probability Distributions Now replace n! throughout by nx(n— 1)! and take out a factor of np: = (a=! par, GD! ea, (ADE gs | Dy "Plowm—pi Syma? * dyna? DIO? =nplp+a]"" 7 Likewise E[R(R ~ 1)] = 0x-1xP(0) + 1xOXP(1) + 2x1XP(2) + 3x2XP(B) +... mln = PCa) ! 1 1 20x HOT tag art 4 tnx(n=Dx: se soe nl a 298 tt onn-a? *in-al? 4 Dro” Now replace n! throughout by nx(n—1)x(n 2)! and take out a factor of n(n ~ 1)p": a{ (n-2)! (0-2! ns (n-2! =n(n—pp'| 22" en=2)t CCS GIO!” =n(n-)p*[p+q]™ =n(n- 1p? and use Var(R) = E[R(R — 1)] + E(R) ~ [E(R)}? as before: Var(R) = n(n = 1p + np — (np)* sniot= np enp—niy? =o np +np = np(1—p) =npq. 2.2 The Poisson distribution. The probability formula is usually derived as a limiting case of the binomial distribution. We will need the following result: eae ca ‘This is a standard result, a Maclaurin expansion, as found in most Further Maths courses eee for example, [Neill and Quadling]). Suppose the mean (expected) number of occurrences of an event in a given time or space interval is 2. Divide the total interval up into a large number, n, of equal sub-intervals, and consider the distribution B(n, p), where, as the mean must be 4, the value of p must be Ain. We will assume that no more than one event can take place in any sub-interval. We must also assume that for any sub-interval the probability of an event occurring in that sub-interval is the same, and that whether or not an event occurs in any one sub-interval is independent of whether or not one occurs in any other sub-interval; these two assumptions allow us to model the total number of events occurring in the total interval by B(n, A/n). Now we consider what happens as n — ow. a “renl4) ( 4) ad (1-4) 7 nn) et =l+ [2.A] a Discrete Probability Distributions We now consider the limits as n —> ce of various parts of this expression. al nn-l n-2 oe ne a) Ba" x2 xxx, . and with r remaining fixed, the value of (a-nin non on n each fraction tends to 1, so the whole expression tends to 1 ay’ ’) (-4) 1. This is fairly obvious as you are taking a number to a fixed power, n and that number is tending to 1. Thi is far from obvious if you have not seen it before. However, it counts as a “standard result” and is sometimes actually used as a . definition of e. We prove that tin(1+2) n Using the binomial expansion, we have (+4) =142(2) eb ye + A@rDeneD (; Jo Vey t\n 2! ln 3! n n1e[2 fa eeadle eed Din-]x ‘| 2! njl ln on tet nol —1) (a—2) nepenins [2O=00=0 fort) As with the identical expression above, the limit of each expression in [brackets] is 1 as (A) (ay C4) 14 3! Maclaurin expansion of e, Hence as n —> 9, P(r) ae ; which is the n—> 0, Hence the limit of (-4) as n> ois 1+: n Modelling assumptions. Clearly the above derivation requires that events occur at constant average rate (so that p = W/n throughout) and independently of one another. It also requires that the limiting probability that two events take place in the same time interval is 0. Some texts therefore add the condition that “events must occur singly”. However, this condition is essentially trivial. Any example in which the occurrence of events singly seems to be an issue can usually be put in terms of independence of events. For example, it would probably not be sensible to use the Poisson distribution to model the number of Roman coins found in a particular area of an archaeological site, because they might well occur in hoards ~ but this can be covered perfectly adequately by the “independence” condition. On the other hand, there seems to be no reason why the number of letters you receive each day should not be modelled by a Poisson distribution just because they all arrive together when the postman brings them. Likewise, in modelling retail sales of, say, television sets in a given time period, the fact that several might be bought together (for example, for an old people's home) does not necessarily negate the validity of the Poisson distribution, even though in this case they might be considered not independent (though if multiple sales are a significant proportion of the total, the distribution would in practice be modified). The sort of scenario in which more than one event occurs at a time — for instance, points scored in rugby matches — is so obviously not a candidate for the Poisson distribution as to put the “singly” condition on the same level, at best, as “outcomes can be classified as success or failure” for the binomial distribution. To summarise, “singly” is best considered not a “modelling assumption” but a necessary, and usually tacitly, “given” of the scenario. 8 | Discrete Probability Distributions Expectation and variance. In view of our definition of 2 it would be worrying if E(R) did not equal 2. However, the value of Var(R) can not be obtained by inspection. We derive here the appropriate results “by steam”, though again it is quicker to use the probability generating function (page 31). We have P(R = r) = 2F torr=0, tate 7 The constant €” comes from the Maclaurin series for eas in [2.A], so that SPR=n=e* z = ort For the mean, F(R) =F PR=N)= Sr Le Sy md NB! =f cancelling r and putting s = S APF apigh =Ae* adore =a os! For the variance, E[R(R-1)] = Lr(r-1)P(R =P). rer as the first two terms are zero, ryt a NB! aa ey. at 1 | se A(r-2)! Putting s=r—2 gives e222 me*d xe! a? =o st Hence Var(R) = E(R(R- 1)]+E(R)-[E(®)] (again!) P+ A- Once again a more explicit form gives a less formidable-looking version. This works more easily than for the binomial distribution and is recommended for students. E(R) = OxP(0) + 1xP(1) + 2xP(2) + 3xP(3) +... =e [0x0 «(4 sa )o[a) 7 E[R(R=1)]_ = Ox(—1)xP(0) + 1xOxP(1) + 2x1xP(2) + 3x2xP(3) + 4x3xPG) ... A) z z# x [ox-ren+(tx0%4)-(2xre)-(sax4)[nxZ)..| ty Discrete Probability Distributions Hence Var(R) = E[R(R - 1)] + E(R) - [E(R)P =R+A-2 2.3. The geometric distribution. We have P(R =r) =q/"'p, for r= 1,2, 3, Modelling assumptions. These are the same as for the binomial distribution. (The scenario, of repeated trials until a success is achieved, is not the same, but this is not a modelling assumption but a “given” of the question.) Mean and variance. In order to derive formulae for the mean and variance, some sleight of hand is needed, if only because we have to deal with infinite series. We offer two approaches. ‘The first, calculus-based, required more in the way of assumptions but involves some flexible ideas that can be used in Pure Mathematics contexts. First, some preliminary results in Pure Mathematics. ¥ix! = for bx < 1, as this is a geometric progression. a a =x Sntety 4 x j= ae Qe 4 ae ~alijx) ali-z J a-» . : ) Sre-px =x Alc =a 3) 7 ae &* “arla-n?) =x) Note that in (3) the lower limit of the sum remains 1 as the (r = 1) term is zero, provided x40. (fx =0, the probability result that follows is trivial.) In each case a rigorous approach would derive the formulae for the sum to m terms and then consider what happens as m tends to infinity; delicate issues of uniform convergence are needed to permit the interchange of sum and differentiation. Therefore 10 Discrete Probability Distributions Proficiency in dealing with this type of series manipulation is good practice for aspiring university mathematicians, Again, for teaching, one can write out terms explicitly: E(R) = 1xP(1) + 2xP(2) + 3xP3) + 4xP(4). = Ixp + 2pq + 3pq" + 4pg? +... 1 and assert that, firs, l+xtxt+x4... = I-x" 1 and so on as before. Differentiation gives. 1+ 2x 43x +... dea’ =» A different approach avoids differentiation. We start from f=lexever ee =x and proceed as follows, fa)sl+x¢rP er tries... x@)= xt¥ tr eaters. Fia= PtP artery. and adding, fl tx4P e+ elt Qt ar ears. sothat 1+ 2r4+3P 440+... =(tetr tre? =—__.”, which is result (2). d=» Now let g(x) = 1+ 20437 44e 4... aga= xt2P4304.. g(x) = SHH... and adding, +324 6x7 + 100+. = "lEr(r— Da). The individual coefficients are 1, 1+2, 14243, 1424344, etc, which are the triangle numbers and so have general formula ¥4 rir 1). 1 4 so that Er(r—1)x"2= —2 a» “i-9 a ‘These are the same identities as derived above, and the corresponding results for the mean and variance follow. The only assumptions needed here are that infinite series can be added and factorised in the ways shown, and these assumptions are perhaps easier to make convincing, The nice behaviour of uniformly convergent series such as these can be found in a course on analysis, such as [Apostol] or (Spivak]. gal ext t Hence % [Er(r- 1x7] = u 3 The Normal Distribution The basic shape of the probability density function of the normal distribution is that of the graph _y =e" . To avoid the type becoming too small we will often write exp(-42°). However, the area under the graph needs to be 1, and hence we have £(x) =kexp(-t4*) for some constant k, Unfortunately this function has no elementary integral. (The area under this curve between any two vertical lines is well defined, and can be found, but only by using numerical methods, which is how tables of cumulative normal probabilities are obtained.) 3.1 *The value of k. The standard way of demonstrating the exact value of the constant k is to consider the joint PDF of two independent random variables Z and W, both with the distribution N(O, 1). This involves the concept of multiple integration. In order to avoid losing the thread of the statistical argument at this stage, a brief account of multiple integration is supplied as an appendix (see page 52). The PDFs of two independent random variables Z and W will be written as, f(z) = kexp[ 42" ] and fudw) = kexp| [-+w?] respectively. Then J exo[-42)]ec J” exo w? }dw=1 = FL feel A further trick involves transforming to plane polar coordinates (r, @), with z° +1 Now, instead of splitting up the plane into a rectangular grid where the area of a basic element is 8xx6y, we split the plane into concentric circles and radii. The area of a basic element here is Srx(r86). zw) de dow a ‘Therefore, instead of calculating [~ [7 £(z,w)dedw, we calculate f°" "g(r, 0)rdraa. wf [rexp[-47 ]arde=1 = BE L-exp(-4 aK = ep ae=1 => an It is clear that proper discussion of this method is difficult. An alternative derivation that doesn’t involve double integrals is given in two articles in The Mathematical Gazette, [Gauthier], and [Desbrow]. Here are the bare bones of the former. [letra f = [from putting u =e V-2inu ‘The Normal Distribution [from considering the derivative of u” du [“plausible”] [from putting w = sin g] == [from a standard Further Mathematics reduction formula). Clearly each step requires a good deal of justification, Now we change from N(0, 1) to N(0, &). Clearly we have stretched the graph in the y-direction by a factor of o; using ordinary function notation, that means moving from f(x) to f(x/0). But to keep the total area 1, we must also stretch the graph in the y-direction by a factor of 1/0, which is the further transformation f(x) — f(x)/¢. Hence our new PDF is f(x/0)/, which is ovia «-4(2)] And finally we change from N(0, &) to Ng, 6°). This is a translation w in the positive x- direction, so it corresponds, again using ordinary function notation, to the transformation a(x) g(x - 4). Replacing x by x ~ 44 therefore, we get _ 1 _Ifa-ny toon eceol-3(S4) | as the PDF of the distribution N(wz, &). fs Note that we can return from & is onl -4(4) |= tokf ovo|- cee 2 o substituting 2—4 — - o It is worth emphasising, for future reference, that z [Lesp-de*)dz=1 BB) a 3.2 Expectation and variance. As the graph of y=e™" is symmetric about x = 0, it is obvious that the mean of the distribution N(0, 1) is 0 (our notation merely implies this). However, it is possible to show that this is correct, using A-level techniques. ie E(X) =—=[" xexpv- (O) =J5e [Lxexp-+ Nex + 2)" =| —Zexp(-42*: by inspection oe ee ait Itis preferable nor to get involved in the formal substitution u = x* as that requires considerable care with the limits, Likewise we can show that the variance is 1: EOC) [le expt s)de= [latxep-te ide 13 The Normal Distribution and use integration by parts: = a7" : = : =| -xexp(-$ 2° += |_ expt Tag neo] +p eset ‘The term in [square brackets] is 0 as_exp(/4x*) grows faster than x as x tends to either +0 or ce, The integral is identical to the integral of the PDF for the distribution N(Q, 1), so its value must be 1. Hence Py dr. Var(X) = E(X”) - [EQQ]? = 1-0 Confidence in the sort of manipulation in these examples will be of benefit in the more complicated theory of bivariate distributions. Obtaining the appropriate values of mean and variance from the PDF £(x) 1 oNin abe exp| Ey] is a useful classroom exercise. 3.3. Continuity corrections. Experience shows that this issue causes a good deal of 4ifficulty. The principle behind the need for a continuity correction in certain situations is simple. Suppose X is a discrete random variable and ¥ is a continuous random variable whose distribution approximates to that of X. For example, we might have X ~ B(40, 0.3) and ¥~ N(12, 8.4). The problem arises because although, for example, P(X = 16) is plainly non- zero, P(Y = 16) is zero, or at any rate vanishingly small. Non-zero probabilities for ¥ do not occur for a point but only for a region, such as a < y < b [and here there is no distinction between < and <, as P(Y = a) = 0}. The region corresponding to X = 16 is 15.5 O as s* [E(9)? = 0? ~ Var(s?) < 0 is a random variable, not a constant. However, E( Hence E(s) < @. Very annoying. ‘A circumflex denotes an estimate of a parameter. We denote an estimator of o” by 6°. (Strictly, this placing of the circumflex means the square of the estimate of the standard deviation, rather than the estimate of the variance, but typesetting restrictions all but compel this distinction to be ignored.) 18 | The Expectation and Variance Operators There is notorious disagreement in the literature about the notations 5, » and the definition of sample standard deviation, Perhaps the most common convention is to define 1 z n 5 Dx, - x) and a but some argue that there is no purpose to the sample variance other than as an estimate of the population variance and hence define sample variance with the (n ~ 1) divisor. The MEI A- level specification gives the name “root mean square deviation” to what is here denoted by S. Many calculators use the notations xo'n and xo'n-1 for $ and s, which confuses the issue further, sample variance 5’ 5.2 Degrees of freedom. This concept will be used in subsequent work on the # and chi- squared distributions. It is notoriously slippery, as an article in Teaching Statistics (September 2008) makes clear. In simplistic terms, in the context of a situation involving several random variables, it is the number of those random variables that can be considered independent of one another. “Can be considered”, of course, begs the question; it usually means the number of variables that can be changed, independently of one another, without altering the features of the data that are being used. Hence if you observe 10 random variables and find that the sum of the 10 values is 832.6 (or, which is the same thing, the ‘mean of the 10 values is 83.26), then there are 9 degrees of freedom, because you can invent arbitrary values for 9 of the observations, provided the 10" is then chosen to make the total correct. The snag is that in other contexts the meaning is slightly different, and extracting the precise common feature from the different contexts in which the expression is used is hard. But essentially the key thing is that when there are n degrees of freedom, the analysis deals with m independent random variables. Itis useful to think of °(X,-w)' = ¢(X,-X)' +n(X-)’ when considering degrees of freedom. The left-hand term has n degrees of freedom; the last term has 1, so the first term on the right must have (n— 1) degrees of freedom. Ya. -*) n of squares by the number of degrees of freedom, and when we come to consider Analysis of Variance (ANOVA) we will use that nomenclature. In the present context, the use of 6? = is often interpreted as dividing the sum 6 Correlation and Regression 6.1 The product moment correlation coefficient. ‘The product moment correlation coefficient r is defined as r= Sq =D —DO-F These definitions follow those in the formula book for the OCR specification, but some authorities use alternative, equivalent, versions with the » divisor in different places throughout. We note that Var(x) = Sadn and Var(y) = Sy/n. The quantity S/n is called the (sample) covariance of x and y, abbreviated to Cov(x, y).. If there is no intention of carrying out any significance tests, there is no need to make any distributional assumptions about the variables, or even to bother about whether they are univariate or bivariate. The value of r can merely be treated as an informal measure of the degree of linear relationship between the two variables, noting that when the points lie close to a straight line with positive [negative] gradient the value of r is close to +1 [-1]. The fact that -1 << 1, with equality if and only if the points line exactly on a line, amounts to this: [nCov(x, y)P =[EZE(x- 3)(y — P? $ Le = ¥)? ECy — §)?] =n? Var(x) Vary) This inequality is in fact is an application of the Cauchy-Schwarz inequality, which states that, for any two sets of real numbers (4), 2, a3, ...} and {b1, ba, bs, «.-} a,b,” < Ea? Eb?) and likewise with integrals for the continuous case. A nice proof of Cauchy-Schwarz is obtained by writing LG x-b)7 20 forms => x Ya?-2x) ab, +H 20 and now put x = Ea,b, / Za? and rearrange to obtain the result. An algebraic derivation is accessible to students and also saves time with the derivation of regression equations. However, it can be made to look very heavy. The following approach is, again, via a sort of parallel-axes theorem; it reduces the heavy algebra to produce an equation from which two important results can be deduced. We use Se? =)°(y,-a-by,)’ and insert extra terms involving b¥ and y. Ye? =D ({F-a—Hx} +{y, -F}-b{x, -¥}). Expanding: n(F—a—bz) + (y, FY +0°D (5 -¥) ~26Y (4, -¥)(,-F)+2(F 4-2) F (94, - =n(¥-a-bx)' +(S,, +b°S,, -2bS,,) as the last two summations are both zero. Completing the square for the second term gives } On: )-2b(F¥-a-bE) > +Sy @S,,- 26S, +5,,) 20 | | | | | | Correlation and Regression Hence We can now make two deductions, The first concerns the equation of the line of best fit, in the form y = a + by. We assume that “best fit” corresponds to minimising Ee, and the coefficients a and b are then given by the following equations 1) As this identity is true for all values of a and 6, it must be true if @ and b are chosen so S, that a= ¥ —b¥ and b= In that case we have Le? = Sy(1- 1°). As Le? > 0, Syy>0, so that | — 7° 2 0 and hence srs} Equality is obtained if and only if Le;* = 0, in other words if all the points lie on the line y =a + bx. This deduction is not dependent upon any assumptions about the nature of the variables x (or X) and ¥ but is purely algebraic; for this deduction there is no need to interpret the minimising of Le; as giving a line of best fit. 2) Le? is minimised, for choices of a and b, by taking [a= y —bx| and This has killed two birds with one stone. However, we are getting ahead of ourselves. ‘There are several plausible reasons for choosing the minimising of Ze? as our criterion for best-fit line. A common-sense reason is that one wants big discrepancies to be ruled out as far as possible. I suspect that more significant in the choice is that its use makes the mathematical analysis easier than other possibilities. For instance, use of le;| makes for mathematical difficulties, and the use of Ze? allows the parameters a and b to be calculated using reasonably easy formulae that produce unbiased estimates of the true values. A proof of the fact that they are unbiased estimates is deferred; see page 25. a1 Correlation and Regression. EX = Hy Y=) 62 The population parameter. We define p= as the (population) product moment correlation coefficient for a bivariate population. We can Suv also consider R as a random variable. There is an obvious analogy between p and R that parallels the analogy between and X, though issues such as whether R is an unbiased estimator of p will take us too far afield. It is useful to be able to generate a set of bivariate data from a population with a given (theoretical) product moment correlation coefficient pp, as follows. Lemma. If X and ¥ are independent random variables with the same variance, putting U=pX+ Vl-piY produces a random variable such that the theoretical PMCC between X and U is po. Proof. First we note that Var(X) = Var(Y) = 0? say, and as X and Y are independent, Var(U) = poo” + (1- poo? = 0? We need two properties of the covariance: @ — Cov(X, ¥) is a linear operator (page 16) with respect to both X and ¥: Cov(X, ¥ + Z) = Cov(X, ¥) + Cov(X,Z) and similarly for Cov(X + Z, Y) (ii) Cov(kX, ¥) = Cov(X, kY) = k Cov(X, ¥). Then Cov(X,U)=p,Cov(X,X)+J1-p;Cov(X,¥) = poCov(X,X) +0 as X, Yare independent = poo” as Cov(X, X) = Var(X) by definition. Hence oe oe) xe [War(X)VarU) -PE—=p, as required, oo* This can be used for a simple spreadsheet demonstration of different values of Po, as follows: A B Cee 1 = 2 |=RANDO+...7RANDO FRANDO+..+RANDO | =A2*BS1+B2*SIGN(BS1)* SQRT(1-BS1*2) Adding 6 uniform random variables gives a fair approximation to a normal variable, though 12 is better. The value of pis entered manually in cell BI. Row 2 can now be filled down for, say, 50 rows and a scatter diagram plotted Here is a typical chart for p = 0.8. It will be seen that the result is roughly elliptical. The scatter diagrams produced by standard spreadsheet packages (such as Excel here) unfortunately do not usually exemplify good statistical practice. 2 Correlation and Regression 6.3 Linear regression ~ fitting a straight line to data. If we are to fit a straight line to data for purposes of estimating values, it is essential to make a careful distinction between the univariate and bivariate cases as mentioned above. Case 1: Univariate data, ic.,.x is a controlled variable and Y is a random variable parameterised by it. We model ¥ by the equation Yea+ fr+k where £, is a random variable, the subscript indicating that the distribution of E may depend on x, For the moment the only assumption we will make is that E, has zero expectation for all values of x. We find approximations @ and b, based on sample data, to the values of @and (Strictly a and b are random variables but nobody ever seems to write A and B in this context, and it is more likely to confuse than to help.) Taking a sample of size n, we have Waat bx +e; for 1 ‘We return to this geometrical representation at a later stage, to consider the distribution of r. 29 Correlation and Regression 6.7 Spearman’s rank correlation coefficient. This is simply the PMCC applied to rankings. The defining formula is 62d? n(n? =1) where the ds are the differences in the rankings. We show that this is equivalent to the formula for the PMCC, provided there are no tied ranks. Let the rankings be (x yi). Then Y\d? = D(x, -y,)° =x? - 2D ay; +Dy}. Now the {x;} and the {y;} are both {1, 2, ...,} in some order, so we know that ds =D», mn 2 n(nt+1(2n+1) Dea T yf MEDD, 6 ua 2 n(n+1)(2n+1) (2 Var(x) = Var(y) = ee a Substituting into r, Substituting the known results for Ex; and Zy?: {rc ~p-of 2esneeth 249; sues} “ho =D 123 xy, nat =r as required. However, this identity does not apply for tied rankings as the formulae for the sums of squares do not apply (although the formula for Ex; and Ly still hold). 30 7 The Probability Generating Function 7. Definition. We now introduce a tool that may seem pointless at first. The Probability Generating Function is a function that “packages up” all the different probabilities of a random variable into a single function. Its advantage is that in its packaged form it allows us to calculate many useful results very easily. We use the abbreviation p(r) for P(R = r). This is the definition. G@= Sep = p(0) + p(1) + Pp) + Pp) +... This is a function of an arbitrary variable 1. The coefficient of each power of f is the corresponding probability. It can be applied to any discrete random variable, but it is usually applied only to random variables that take values 0, 1, 2, ... only, so that the powers of f are positive integers. 7.2 Properties. 0) The series needs to converge in a neighbourhood of ¢= 1 QD G=k Ep 2 G)sw GW =Ert pr) so G1) = Ep) = 3) F =") +6) -[G'F: GO =Exr- Dp; G"(1) = Zr(r= 1)p(r) = ELR(R = 1], so © =E[RR-1)]+E(R)-[ER)!? {note again this useful version] =6"(1) +6) - (Gyr. 4) pr) =G%0) +r! 5) GO=EC%), as EC) = Depa. a 6) Gxt) = Gx(#) x Gx(0) where X and Y are independent random variables. Proof: Multiply out Gx(t) x Gy(0) and compare coefficients, thus Let Gx(t) = a9 + ait + ant? +... and Gy(2) = bo + bit + bof +... Then Gx(t) x Gy(t) = aobo + (aob: + aibo)t + (agb2 + arbi + abo) +. Now P(X + ¥=1)=P(X=0, ¥= 1) +P(X=1, ¥=0) = gh; + aibo, which is the coefficient of tin Gx(?) x G40). P(X + ¥=2)=P(X=0, Y=2)+P(X=1, ¥=1)+P(X=2, ¥=0) = dgh2 + a,b, + aabp which is the coefficient of fin Gx() x GX), and so on. If we wish to emphasise that a PGF is that of a particular random variable, say'X, we write Gx(f). 31 aceemennensmemasmananiics The Probability Generating Function Example. R ~ B(n, p), 80 that P(R =r) -(') p'q’. Then r oo=Er("\pe = E[")enre =(pt+q)" i= 1, we have: Hence, using repeatedly the fact that (p + q)* Db Gd) =@+g"=I"=1. 2) G’() = np(pt + gy"! so G1) = nplp + g)"* = np. 3) GY =nln=Lppr +g"? so G(1) = n(n Dp’. Hence 6° = n(n — 1)p* + np — (np) = —np? + np = np(1 =P) 4) GO) =n(n= Lp?g’? = 2P(R=2). 5) (pit + qu)" x (pat + ga)" # (pat + 43)" *" unless pi = p2 = P3- So although the sum of two binomial random variables with the same value of p is still a binomial random variable, the sum of two binomial distributions with different ps is not binomial. In symbols, if X ~ B(m, p) and ¥ ~ Bm, p), then X + ¥ ~ B(m + m2, p). This is of course obvious, But if X ~ B(, p1) and ¥ ~ B(nz, p2), with pr # po, then X + ¥ does not have a binomial distribution at all, for any value of p. = mp4. In part 5) here we have used an important property of the PGF, its uniqueness. If two random variables have the same PGFs, they have the same distribution. The proof is an easy example of a very common type of uniqueness proof. Suppose X and Y are two random variables for which Gy(t) = Gy(t). Then Gy(t) - Gr(*) = 0, so all the coefficients of Gx(t) ~ Gy(r) are 0 and therefore all the coefficients of Gx(#) and Gy(0) are equal. Hence for all r, P(X = r) = P(Y=1), in other words X and Y have the same distribution. 7.3 Other useful PGFs. Poisson If R ~ Po(A) then G(r) = eM". Geometric If R ~ Geo(p) then G a =a It is easy (and a standard examination question) to derive these and hence to obtain the corresponding means and variances, as mentioned in the sections on the corresponding. distributions, 8 The Moment Generating Function 8.1 Definition. A problem with PGFs is that they do not apply to continuous random variables. We more often use another function, the Moment Generating Function or MGF. Mo = 27 XD = 1+ es SBOP) Again this is an expression involving an arbitrary variable The coefficients this time are the “moments” E(X"), divided by r!. The mean E(X) is the first moment; E(X*) is the second moment, and so on. However, there is no restriction here on the possible values of the random variable X, provided all the moments are defined. ‘The term “moment” may need some justification. The mean E(X) = ExP(X = x) is exactly analogous to the centre of mass Dxymt; in mechanics, a result obtained in the latter context by “taking moments”; the second moment Zx*P(X = x) is exactly analogous to the moms Ex? mechanics. Indeed, some writers on probability talk of “weights” or “masses” instead of probabi in some contexts. 8.2 Properties. 0) The series must converge in a neighbourhood of t = 0. This is one reason why the definition involves factorials; a series such as this is much more likely to converge if the denominators are factorials. (Compare the fact that xox efal+xt24 4% 4... converges for any value of x with the fact that 23! al L4+x+27 +27 +... converges only for—1 0 as. 0. Since EQ) = 0, M’(0) = 0, and since Var(X) = » M0) Asn ©, Fi(n?) > 0, and Mi le? sn) e,. (an) (¢ m) ** where &,/(P/no*) > 0 as n>, We thus have M,, w-[rEee : zE ae ? as n—> ee, Hence the result. We know that tin( 142) so M, (0) +e" 35 ‘The Moment Generating Function Notes. This is a rather simple version of the theorem; there are more general ones, requiring weaker assumptions, such as the existence of only E(X) and E(X *). Details can be found in [Feller]. In practice, we want to know for a particular distribution how big n must be for a normal distribution to be a valid approximation. This naturally depends on the distribution under discussion, For example, if X ~ U{0, 1], the sum of 12 observations produces a good normal distribution (certainly good enough to use on a spreadsheet simulation), but the sum of n independent exponential random variables doesn’t become very normal until at least n= 30. Intuitively, one assumes that the infinite right-hand tail of the exponential distribution is what is causing the problem here. A uniform distribution, although it doesn’t look at all normal to start with, is finite and symmetric; many students will know that the sum of two discrete uniform distributions (such as the total score on two dice) has the shape of a triangle and it is not very difficult to do the corresponding calculation for three dice, where the result is already recognisably close to normal. ‘An advantage of using the sum of 12 observations from U0, 1] is that the variance of the resulting random variable is 1. This is because, if X ~ UT0, 1], Var(X) = 1/12. The Central Limit Theorem is a widely misunderstood result. It is a statement about the shape of the distribution of the sample mean, not about the values of any parameters, 36 9 The Gamma and Chi-Squared Distributions We derive the PDF of the ° distribution from its MGF, via the Gamma distribution. 9.1 *The gamma function and the gamma distribution. ‘The gamma function is defined by (a) = |" x*e"*dx ‘This improper integral is convergent provided a> 0. Itis a standard exercise in Further Mathematics reduction formulae to show that [(or+ 1) = @T(a): Tat) = fj evar [exten] 5+ frame ‘de by parts = 04a fx ede = af (a) as required. Thus if n is a positive integer, Tin +1) =a0() = n(n — DT (n- 1) =... =n!) =a! “ (2 ane, *) which by the reduction formula is easily seen to be OM!T%), We are going to need rf But T(4)= Nee “dx . Putting x = 27/2 links (4) to the normal distribution: TA) = V2 [re Pac. (compare [3.B]) We know that — dz= 1, so the value of this integral is Vr. v2a°~" oe (222) .20 ig cy a Jamal By a similar change of variable itis easy to show that [ y"“"e"* dy and the integrand is positive, so a probability density function with wo parameters and A can be obtained by writing fos Ayre fory>0. Ta) A distribution with this PDF is called a gamma distribution. We write ¥ ~ T(@, 4), The moment generating function for a gamma distribution is, my A Ae tao (= BE) = [re i@ The integral on the right can be evaluated by replacing A by 2 M@= 2x. T@) 1 T@) =)" (=a Note: If a is an integer, the distribution T(@ +1, 4) has the probability density function 2°)" 4 which is the PDF of the waiting time for the (a+ 1)" event in a Poisson process. a! This is an extension of the familiar result that if X is the number of events observed in unit time interval and X ~ Po(A), then the probability that no events occur in time t is given by etn other words, the waiting time for the first event in a Poisson process has an exponential distribution. 37 The Gamma and Chi-Squared Distributions 9.2 — *The chi-squared distribution. If Z ~ N(0, 1) then the distribution of Z* is called the chi-squared distribution with 1 degree of freedom, denoted by 7°(1). ‘The MGF of Z is : 1 1 1 anen, MQ) = Ee*)=—— gg = Lf etneigy : Noa L Vn i Substituting (1 - 24)" z = w and dz = du/N(1 - 21) gives 1 du Loe MO= on O- Toa Oe which is the MGF of the distribution 74, ¥2). If ¥2(n) = Z? + Zy +... + Z;” is the sum of n independent random variables each with the distribution [N(0, 1), then the MGF of 4°(n) is [the MGF of 7°(1)] to the power n, that is, MQ) =((1-207" = - 20", This is the MGF of I(4,4) , so the PDF of the distribution °(n) is 1 forx>0. [9.4] x 2'T@) The value of T(+) is found from (*) or (*) depending on whether n is even or odd.” ‘The mean and variance of the °(n) distribution are found from the MGF to be n and 2n respectively. These results can also be obtained directly from the normal PDF, as follows. 1p te 1EX~N(, 1) we have PX <2) = [eax (0, 1) X<2)= Fe fle We will use the fact that flePar= = (Compare (3.B].) Writing Y= X*, we have P(Y 0 and the mean is given by xe Made 1 use ie Putting u? =x, difdu = 2u,we get = ["xte ig et af ax et du In fe . Bs fe" [+E eau} [x The term in [brackets] is 0; the integral is />- as above. Hence the value of the integral is 2 la —2_x,|7 =1. Thus the expected value of ¥ is 1 and the expected value of the sum of Jin V2 ee observations of ¥ is n. Likewise, E(Y*) is given by lip ee = | ve a [ute du Tas Tah * From [Rice]. The Gamma and Chi-Squared Distributions aul ay ~ As before, the variance of the sum of n independent observations of ¥ is then 2n, as required. cis worth pointing out to classes that the critical values for 2°(1) are the squares of those for NG, 1), for example 1.96? = 3.84(16). 93 Goodness-of-fit. It is easy and worthwhile to show that, for a goodness-of-fit test with one degree of freedom, the quantity [> has a normal distribution for large E, enough n, using the normal approximation (o the binomial distribution, We assume that there are n observations of a quantity, each of which is to be placed in cell 1 with probability p, or cell 2, with probability q=1~p. Then the total number of observations placed in cell 1, denoted by X;, has the distribution B(n, p), and Xz =n ~Xy, so that O) = Xi, O2 = Xo, Ei = np and Ey = ng. We have (,-E)* (Xj=np) (n= X,)=ng)* tad nq p= X sn —ng=n(l—q)=np This is the square of the quantity ~—"?. which for large m and p close to % has the npg distribution N(np, npq). Unfortunately it is hard to generalise this result to m cells, A rough rEy | 50, 5 induction argument might suggest that if ° {2 Wi, then cell k, say, could be subdivided into two cells and the above argument applied, but as the total number of observations Og in cell k is not fixed but is a random variable the relevant quantity X,-0,) . fi i : 5 "is amuch more complicated random variable (the denominator is complicated). 4 *In order to go further, it is necessary to consider the natural generalisation of the binomial distribution, which was obtained from the expansion of (p+9)". The multinomial distribution is obtained from the expansion of (p: + p2 + ... + py)". This gives the probability i that the ith cell contains r; members as P= pip? ..pi —~— f : Boyngtth One then uses Stirling’s approximation, r!= 2m e"n™” rey 39 (see, for example, [Burkill]), tocobisin Pe (2) ( ny ‘The Gamma and Chi-Squared Distributions Now, putting j= np; and X, But, from the previous line, 7, = 44, +X,./#, . 80 In P—In(constant) = >) (4, +X, /u, +4)In Tacos -% Now provided is large, X; will be small compared with 4, (as X; has been standardised), and ‘we can expand the logarithm using the standard Maclaurin series (see [Neill & Quadling]) to obtain (u,+X,u, +4 Hf pcm =F 0, feos fet El. ) re X,4@,) plus terms of order 7"? and smaller. But we also have Le eI= De“ en-n=0 so that, to order w1,-", we have In P —In(constant) = ~4.) X? EXD. In other words, the distribution of P is, approximately, that at the sum of squares of normal random variables, which is what we were seeking to prove.’ or Px exp( The theory for the chi-squared distribution applies both to contingency tables and to goodness-of-fit tests, but from now on we shall using it in the context of the /-distribution. ‘The most important fact we shall need about the 2° distribution is this: GD 2 ~ Bin-D, Proof. We have seen before that L(X— i) =E(X-X )? +n(X - yw)’. (See [5.A].) Hence ate) al) Cea The term on the left-hand side is the sum of n independent [N(0, 1)? variables and therefore has the distribution z°(n). The second term on the right is one such variable and therefore has the distribution 7°(1). The first term on the right therefore has the distribution %7(n ~ 1). > From [Kendall], section 12. 40 10 The ¢-distribution ‘The algebra here becomes very messy once more, We start by emphasising that the random sare ere 5 xX- variable ~— has the distribution N(O, 1). In practice, we are more interested in“ —# olNn sivn X-u olNn” the population mean, but the expression with s is the quotient of two: random variables and wwe will therefore need some more theory. than as we are unlikely to know the population standard deviation if we don’t know 10.1 Quotients. Suppose we know the PDFs of two random variables, X and Y, say, and we want to know the PDF of some function of X and ¥, for instance X + ¥, or (as we will need here) Y/X. The process involves three steps, which are the same as those often used to find the PDF of a function of a single variable. 1) Find the CDFs, by integration, so that we know formulae for P(X < x) and P(Y $y). 2) Use these to find the CDF of the combined variable. 3) Differentiate to find the PDF of the combined variable. 10.2 *The PDF of the f-distribution, We now put this plan into practice. Suppose Z = YIX, Then Fz) = P(Z Sz) = P(YIX $2). We will be considering only random variables for which the denominator X is restricted to positive values, so we can write this as P(Y < Xz). Therefore FA2)= f° [ts y) dar. To remove the dependence of the inner integral on x, we substitute y = xv, dy/dv =x, to get Fde)= ff [/xfG.0) dvdr = ff [xt dvd That has completed step 2). For step 3), differentiate with respect to z to obtain f2)= [_xf(axz) dr. If X and ¥ are independent, this becomes fae [xf cor, ¢ In fact we will need T= + where U~ 72(k) and usually k= n= 1 te To simplify the working we put V = E . From the general result [10.A] we have [10.4] f= [vf (Of, (may, noting that as V2 0, the lower limit is 0 and not ~c», Substituting z= vr gives, simply, [teontyorer= Cv] seerese(-'F) fone - (10.8) 41 The t-distribution We need now to explain how to find f(v) when we know fy(u). As usual we use the CDF. I spell out the details. iw. U=kY. Then Suppose k P(VSv) = PRY 0, based on a sample of size 16 for which r = 0.6. We have v= 16-2 = 14, We compute tu= Viax—2S_ = 281 6 The 99% and 99.5% values for 4 are 2.624 and 2.977, and as 2.81 is between these numbers, for a 1-tailed test this value of r is significant at 1%. If, on the other hand, we want to find the upper 95% critical value for a sample of size 16, then we solve 1.746 = Vi4x—— vi-r which gives r= 0.423 (or -0.423 for the lower 95% critical value). I conclude by emphasising that this analysis is valid only if r is calculated from variables that are bivariate normal. The tdistribution 10.4 The two-sample /-distribution. The purpose of this section is to show why, when a two-sample test (or corresponding confidence interval) is being carried out, it is necessary to assume that the two samples come from normal populations with a common variance. Suppose ) {X, Xo, ..., Xm} is a set of mm independent observations from the distribution N(x, @), and {Y¥j, Yo, ..., Yq) is a set of n independent observations from the distribution N(dty, 0°). Suppose first that k Then the usual statistic is 7'= =P) ee =e) ‘This can be written as —== Cee ee tei le w pace xy dw -¥y where V=m+n—2. Now the numerator is the difference of independent random variables with the distributions NO, 4) and N(O, 4 <;» which has the distribution 1. AN(,1). The factor ef mon cancels The remaining denominator is V[y7(m at a 24 — 1)], which is Vin +n -2)] Therefore T/V vis the quotient of random variables with the distributions N(0, 1) and V(2()}, so TN vhas the distribution 1(). This also explains why the pooled variance estimate needs a divisor of m+n ~2. However, if k 1, the obvious statistic would become and although the corresponding factor 1 +—— would again cancel, the value of the m n denominator, and hence the whole statistic, cannot be worked out unless k is known. If we replace ko by an estimate, this would introduce yet another random variable and the statistic would no longer have a t-distribution. In the classroom I have “hand-waved” this by saying that is the quotient of two (X-¥)-0 is also random variables, and has a distribution, and ——————— Sas the quotient of two random variables and so also has a tdistribution, but the quantity is one random variable divided by (the square root of) the sum of no different random variables and therefore does not have a t-distribution. 4s 11 The F-distribution and its applications 11.1 *The F-distribution, We will need to be able to compare two sample variances to see whether they can be taken to come from populations with the same population variance. We therefore consider the distribution of the quotient of two independent chi-squared distributions. We know from [10.A] that if W = U/V is the quotient of two independent random variables, with v > 0, then fww) = i: vf y (vw)fy (v) dv. ‘Throughout what follows, we take w > 0. The working is complicated, and once again it may be preferable to ignore all constant factors and use throughout. Substituting the appropriate chi-squared PDFs into this formula, with m and n degrees of freedom for U and V respectively, we obtain a 2 fw) = ———_——. [“v(vw)"’*"! exp(—L vw) lv"? exp Tere Lowy p(—d wy] Cp v)}dv 2-4 Put y= (+ Dy: = Forma p Now the integrand equals (a — -| 7 lee where the term in square brackets is a constant. In fact this is not quite what we need. The random variable which will be the ratio of two Ulm "Wy The effect Vin m on the PDF of multiplying a random variable by a constant k is easily obtained. Suppose the PDF and CDF of X are f(x) and F(x) respectively, with f = F’, and that ¥ = KX. Then the CDF of Yis unbiased estimates of population variances will be not W = U/V but Fy) = P(Y < y) = P(KX < y) = P(X < yik) = F(v/h). Hence fy) = dFI/dy = fOv/ky/k So if we now replace W by X = (nm)W, we obtain fx(x) = fidnwim)x(nin) which eventually becomes, for x 0, ‘This is the PDF of the distribution F(m, n).* As usual, tables of the corresponding cumulative distribution function are compiled numerically. * From (Berry & Lindgren]. The F Distribution and iss Applications It is usual for the tables to give only the upper percentiles (90%, 95%, 99%, etc). This is because the lower percentiles can also be obtained from them, as follows. It is obvious that if X has the distribution F(m, n) then 1/X has the distribution F(n, m). Suppose we want the 5% percentile of X; call this percentile @ Then 0.05 = P(X < @) = P(UX> L/a) = 1 - P(X < I/a = 1 - 0.95 so the 5% percentile from F(m, n) is the reciprocal of the 95" percentile from F(n, m). 11.2 Applications. We have already seen that the two-sample t-test requires the samples to be drawn from populations with the same variance. Clearly a significance test based on the F distribution (the F-test) can be used to test whether this condition can be taken to hold. However, the most important use of the F-test in statistics that might be met at school level is in Analysis of Variance, or ANOVA. This powerful and widely-applicable technique is based on testing the ratio of two sums of square deviations, and therefore uses the F-test. We return to relatively straightforward mathematics to show how this works. In the simplest type of ANOVA, several different “treatments” are each applied to a number of different samples, and an appropriate random variable X is measured. The issue is whether different treatments produce different effects in terms of X, so that the null hypothesis will be that for all treatments, “= E(X) will be the same. (The context may not be “treatments” but it is convenient to have a specific name for the variable in which we are interested.) The variation in the values of X can be analysed into two different sources: ordinary variation due to random differences of the sort that always occur, and variation due to the differences between treatments. The intention is to separate these variances and compare them using an F-test. Denote the data values by Xj where i = 1, 2, ..., m are the different treatments and J =1, 2, ... mm are the observations within each treatment. Often the samples will be the same size for each treatment, so that m =m ‘Nm but this is not essential to the calculation. (It may help the experimental design, but that issue is beyond the scope of this book). The null hypothesis is Ho: = = Hm and the alternative hypothesis is that not all the 4s are equal. The total number of observations is n and we call X =+ DID,X,, the grand mean, We are interested in the total nit sum of squares: SStoe 1S, -X). Using the same trick that we used in [4.A], we write X,-X =(X, -X,)+(X,-X), where ¥,=LS°x, is the mean of the observations for the i treatment, n Squaring both sides, we obtain X)? =(X,—X,)? +2(X, -X, (X, —X) + (X, -X)) ‘Summing these over j, within the :* sample, gives YK, -¥)? = DK, -¥,)* +28, -M L(x, -¥)) +, (X,- XY? But ))(X, —X,) is zero by the definition of X,. Hence De, -X) =X, -X,)? +0 (X,-X). 4 ‘The F-Distribution and its Applications ‘Thus when we sum over i, we obtain SSta= YX, -X) =F, -¥ +30 (%,-¥". fat fat “This has split up the total sum of squares (that is, the squared differences) into two parts. The first term is the sum of squares within each sample, and the second is the sum of squares between the different samples. We call these terms the error sum of squares and the treatment sum of squares, respectively, and denote them by SSp and SSy.. Then Sta = SSe + SSte. If the null hypothesis is correct, SSrz will be small compared with SSp, The number of degrees of freedom for SSr. is fairly obviously m — 1, but there are n ~ 1 degrees of freedom altogether, so the number of degrees of freedom for SSp is the difference, which is n - m. We calculate SS, Mm-2 SS, Kn—m) and compare with the appropriate critical value for the distribution F(mm ~ 1, n—m).. More complicated types of ANOVA exist, enabling more than two sources of variation to be compared, but the principle is the same throughout. An elementary introduction to ANOVA can be found in [Moroney] 48 12 _ Distribution-free tests \ Tests that do not rely on assumptions concerning a particular distribution are sometimes called “non-parametric tests”, but this title seems incorrect as they often involve testing a parameter such as the median. The altemative name “distribution-free tests” should be preferred. 12.1 The sign test. The use of the distribution B(n, 0.5) to calculate the probabilities that the median is exceeded at least R times out of n is obvious enough not to need further comment apart from the need for the usual modelling assumptions for the binomial distribution to apply. The application to whether paired data are drawn from the same distribution is equally obvious. 12.2 The Wilcoxon signed-rank test. The calculation of the critical values for the sums of the positive and negative ranks, P and Q respectively, is simple but tedious. It is best to illustrate the method with a concrete example, here n = 4. There are 16 possible sets of signs for the ranks 1, 2, 3 and 4, and these are displayed, with the corresponding values of P, Q and T= min(P, Q) in the following table. Signed Ranks P Q ra +1 +2 +3 +4 10 0 0 -1 #2 +3 +4 9 1 1 +1 2 43 4 8 2 2 ee ae 3 +4 ei 7 Ei +1 acs +3 4 6 4 4 -I 2 +3 +4 7 3 3 1 +2 3 +4 6 4 4 - #2 +3 4 5 5 5 +1 3 4 5 5 5 +1 +3 4 4 6 4 +1 3 4 3 7 3 ae -3 4 4 6 4 -1 +3 4 3 7 3 -1 3 4 2 8 2 +1 3 4 1 9 i =I 3 4 0 10 0 On the null hypothesis, all 16 sign combinations are equally likely, and we have the following distribution of 7. [ Hee. eae 3 4 5 Pr=p| 0125 | 0.125 | 0.125 | 025 | 025 | 0.125 Clearly here if the significance level is, say, 15%, we would reject the null hypothesis if T= 0 and not otherwise. If the significance level is less than 124%, no outcomes can ever result in rejection of the null hypothesis (when n = 4). For large values of n itis possible to use an appropriate approximation to the'exact distribution of P. We define a set of random variables Ix by if the k® largest difference is positive otherwise 49 Distribution-Free Tests so that J; has the distribution B(1, 2). Thus E(J,) = and Var(I,) Also, P= SCA, ,sothat — E(P)= k= 4 k= men : ¢ DQn+1) Var (Py= 431g? = RE DntD ° and far (P) 2 24 [The first of these results is obvious by symmetty.] Provided we remember that we are now carrying out a nvo-tailed test, the calculations proceed as with a test for a normal hypothesis.” In ranking the data in order of distance from the hypothesised median, it is clearly necessary that the probability of being a given distance below the median equals the probability that it is the same distance above — in other words, that the distribution is symmetrical. It is easy to find examples where a non-symmetric distribution negates the calculations. For example, let fo= {x osxsv2 [0 otherwise so that F(x) = 2/2 and the median is 1. Then F(0.9) = 0.405 and 1 - F(1.1) = 0.395. Thus a difference of less than -0.1 is more likely than a difference of greater than +0.1, so if the n™ largest difference is 0.1, the statement “the n® largest difference is equally likely to be positive or negative” is false. 12.3 The Wilcoxon rank-sum test (also known as the Mann-Whitney U-test). This is the distribution-free alternative to the two-sample s-test. The null hypothesis being tested is that the two relevant random variables X and Y have the same distribution, so that P(X < ¥)=44. The only condition needed is that P(X = ¥) = 0, which is automatically true for ‘a continuous random variable. As with the signed-rank test, the calculations are simple but tedious and again a specific example will be used. Suppose there are 2 observations of X and 3 of ¥, and these are then ranked. If the null hypothesis is correct, all *C2 possible rankings are equally likely. Here are the possible rankings. a : Bavaro, rr enna nereede meme ne be bg Bg AS NS te Be] to Me be Be A A A Be | 0 Dene ge Ne EN] pg be PE NE SN A] 5 From Rice, op. cit, p. 414 My pupil Andrew Simmonds pointed out the apparent paradox that although E(P) is the median of the interval of the positive ranks, the actual test statistic T used can never take a value bigger than this median! The paradox is only apparent as we are considering the distribution of either P or Q and merely using the smaller for convenience in tabulation 30 Distribution-Free Tests Hence the distribution of W (for m = 2, n = 3) is w) [02 03 04 0.1 and the smallest possible significance level here would be 20%. Unfortunately the calculations for the corresponding normal approximation to the distribution of W for large m, n are much more complicated. We cannot use the addition of random variables tick so simply, as the corresponding variables are not independent, and for the same reason we cannot simply invoke the Central Limit Theorem to describe the distribution. The interested reader can consult references such as [Rice] (p. 402EH) for the justification of the statement that, for m and n both greater than 10, W has the approximate distribution Se (m +n+1)) a) with the use of a continuity correction as appropriate. Bt Appendix: Multiple integration. In brief, and very unrigorously, the fundamental of definite integration of a function f(x) of one vari is that the area between the curve y = f(x) and the x- axis is split into a (large) number of (thin) strips of width 6x. The heights of these strips are the corresponding y-coordinate for some appropriate point in the interval [x, x + 8x] [Figure I], so the total area under the curve is approximately Ey(Sx), and in the limit this becomes exactly [ydx Suppose now that z = f(z, y) is a function of two variables, with x- and y-axes considered as horizontal, and z vertical. Divide the (x, y) plane by a rectangular grid, with spacings &x and 6y in the x- and y-directions respectively [Figure 2]. ‘A cuboid with base sides 8x and Sy, height z, will have volume z 8x 8y [Figure 3], and the volume under the surface z = f(x, y) and the horizontal plane z = 0 is given approximately by 5 z 8x 6y. Here the = means that all possible base rectangles of sides 8x and Sy have to be added up. Take the limit as 8x tends to 0 and you get Do fede thin vertical slices perpendicular to the y-axis. Now take the limit as By tends to 0 and you get |{ {zdx)dy » Which is the sum of the volumes of Obvious problems arise, quite apart from the issue of whether the approximations become exact in the limit. Do you get the same answer no matter which order the &x bits are added, or no matter which order the By bits are added? Do you get the same answer if you integrate with respect to y first as if you integrate with respect to x first? Do you get a finite answer at all? These questions are beyond the present work; anyone wishing to follow them up should consult a work on mathematical analysis (see bibliography). th , dee Ol > & <= Figure 1 y \ Sy : For our purposes, all that is needed is to know that provided the function is sufficiently well- behaved you get the same answer regardless of the order in which the integrals are done. This makes life considerably easier for us. Among other benefits, it means that we can write [Jf y)dxdy without worrying whether this means [ffs sae) ay or Jl fees» ay) ax Appendix 13 References and Sources Some of the foregoing is “common knowledge”. The works referenced are as follows. [Apostol] _T. Apostol, Mathematical Analysis. Calculus, Addison-Wesley, Readin; ‘A Modern Approach to Advanced 1g, MA, second edition 1974 [Berry & Lindgren] Donald A. Berry and Bernard W. Lindgren, Statistics, Theory and ‘Methods, Duxbury Press, Belmont, CA, 1996. [Boas] Mary L, Boas, Mathematical Methods in the Physical Sciences, John Wiley & Sons, third edition 2005 [Burkill] J.C.Burkill, A First Course in Mathematical Analysis, Cambridge University Press, 1962 [Desmond] Darrell Desmond, “Evaluating the Probability Integral”, The Mathematical Gazette, Volume 74, No. 468, June 1990, pp 169-170 [Feller] W. Feller, An Introduction to Probability Theory and Its Applications, Volume 2, Wiley, 2™ edition 1971. Feller’s Volume 1 is one of the most highly-regarded and often-cited of undergraduate probability texts, but unfortunately the section required here is in the more advanced Volume 2. [French] AP. French, Vibrations and Waves, Van Nostrand Reinhold, 1971 [Gauthier] N.Gauthier, “Evaluating the Probability Integral”, The Mathematical Gazette, Volume 72, No. 460, June 1988, pp 124-5 [Kendall] | M.G.Kendall, The Advanced Theory of Statistics, Griffin, London, 1948 {Mood & Graybill] Alexander Mood and Franklin Graybill, Introduction to the Theory of Statistics, McGraw-Hill, New York, 1963 [Moroney] _M.J.Moroney, Facts from Figures, Penguin Books, Harmondsworth, 1951 [Neill & Quadling] _H. Neill and D. Quadling, Further Pure 2 & 3 (Cambridge Advanced Mathematics), Cambridge University Press, 2005, [Quadling] — Douglas Quadling, Statistics and Probability (SMP Further Mathematics series), Cambridge University Press, 1987 [Rice] John A. Rice, Mathematical Statistics and Data Analysis, Duxbury Press, Belmont, CA, 1995. [Spivak] M. Spivak, Calculus, Cambridge University Press, third edition 2006 (Uday Yule & Kendall] G, Udny Yule and M.G. Kendall, An Introduction to the Theory of Statistics, 12" edition UBS, 2000. 3 Index Analysis of Variance (ANOVA), 19, 47f Bayesian probability, 5 Bernouilli trial, 4 Best-fit line, 21 Biased estimator, 18, 19 Binomial distribution, 1, 4f, 32, 34, 49, 50 mean and variance, Sff — modelling assumptions, 4 Bivariate data, 2, 20, 26 Bivariate normal distribution, 26ff, 43f ~ surface, 26, 28 Calculator notation, 19 Calibration problem, 23, 24f Cauchy-Schwarz inequality, 20 Central limit theorem, 15, 34, 35f, 49, 51 — proof, 35 Chi-squared distribution, 38ff, 41, 46 = mean, variance, 38f Coding, 3 Coefficients of regression, 2f, 25f ‘Common variance, 45 Conditional probability, 4 Contingency table, 14, 39 Continuity correction, 14, 51 Contour integration, 35 Controlled variable, 2 Convergence, 10, 11, 33 Covariance, population, 16, 22 — operator, 22 = sample, 20 Cumulative distribution function, 1, 41f Curve of equal probability, 28 Degrees of freedom, 19, 43f Discrete & continuous random variables, 14f Distribution-free tests, 49f Distributive property, 3, 17 Dot product, 29, 43 Double integration, 12, 52 Ellipse, 28 Elliptical scatter, 22, 28 Error sum of squares, 48 Estimator, 18 = pooled variance, 45 ~ regression, 23 sample variance, 20 Event, | Expectation, of binomial distribution, Sif = of error variable, 23 = of geometric distribution, 10f = of normal distribution, 13 = of Poisson distribution, 9F = operator, 16 Exponential distribution, 34, 36, 37 Exponential function e*, 7 F-distribution, 46f Fourier transform, 35 Frequency spectrum, 35 Frequentist interpretation, 5 Gamma distribution, 37, 38, 42 Gamma function, 37 Geometric distribution, 10, 34 ~ mean and variance, 10 — modelling assumptions, 10 Geometrical representation, of PMCC, 29 — of t-distribution, 42f Goodness of fit, 39f, 41 Gradients of regression lines, 29 Grand mean, 47 Hyperplane, 29, 43 Independence, 2, 4, 8 Independent events, 2 = random variables, 2 Integration, 12 Inverse regression, 23, 24f Joint distribution, 2, 12, 26 Laplace transform, 35 Likelihood, 24, 25 Line of best fit, 21 Linear operator, 16 Linear regression, 20f, 23ff Maclaurin expansion, 7 Mann-Whitney U-test, 50 Marginal distribution, 2, 26 ‘Maximum likelihood estimator, 24 Mean, see Expectation Mechanics analogies, 18, 20, 33 Median, 50 MGE, see Moment generating function ‘Moment (mechanics), 33 ‘Moment generating function, 1, 33, 38 Moment of inertia, 18 Multinomial distribution, 39 Multiple integration, 12, 52 Music technology, 35 Matually exclusive events, 1 n-dimensional space, 29 Non-parametric tests, 49 Normal distribution, 12, 26ff, 37, 38 = approximation, 14f, 49f, 51 — bivariate, 268F — mean and variance, 13 ~PDF, 12 — swarm, 42 Normal equations, 23 Notation, 1, 19 Null hypothesis, 47, 49, 50 Operator, 16, 22 Outcome, 1 Parallel-axes theorem, 18, 20 Partial differentiation, 23, 25 PDF, see Probability density function Percentiles of F, 47 PGF, see, Probability generating function PMCC, see Product moment correlation coefficient Poisson distribution, 7ff, 32, 34, 37 mean and variance, 9f ~ modelling assumptions, 8 — waiting time, 37 Pooled variance estimate, 45 Population parameter, 1 —PMCC, 22 — variance, 1, 18f, 20 Principal axis, 42 Prior probability, 4 Probability density function, 1, 41 ~ gamma distribution, 37 normal, 12 ~ tdistribution, 41 = F-distribution, 46 Probability generating function, 1, 5, 31f, 38 Probability of event, 1 Product moment correlation coefficient, ~sample, 1, 20, 43f ~ geometric interpretation, 29f — population parameter, 22 —and Spearman's RCC, 30 Quotients, 41, 46f Random variable, 1 Random, 2 Rank correlation coefficient, 30 Regression, 20f, 23fF — coefficients, 25f — lines, two, 28f Root mean square deviation, 19 Sample covariance, 20 Sample mean, variance of, 17 Sample variance, 1, 3, 18f, 20 ‘Sampling with/without replacement, 2, 4 Scalar product, 29, 43 Scatter diagram, 22 Series, 3, 10f Sigma notation, 3 Sign test, 49 Significance level, 49, 50 Singly, modelling condition?, 8 Spearman’s rank correlation coefficient, 30 Spreadsheet, 22f Standard deviation, 19 Success/failure, 4 Sum of random variables, 16f Sum of squares, 20, 47 Surface, in 3D, 26, 28 Tangent to ellipse, 28f distribution, 39, 41f ~PDF, 41 — two-sample, 45 ‘Tied rankings, 30 ‘Treatment sum of squares, 48 Tree diagram, 4 Triangle numbers, 11 Trivariate data, 29 Two-sample r-distribution, 45 Unbiased estimator, of variance, 18 — of regression coefficients, 25¢ Uniform convergence, 10, 11 Uniform distribution, 12, 36 Uniqueness, 32, 34 Univariate data, 2, 20, 23 Variance, common, 45 —and covariance, 16 = of chi-squared distribution, 38° ~ of binomial distribution, 5ff — of geometric distribution, 10f — of normal distribution, 13f — of notation, 1, 19 — of Poisson distribution, 9f — of sample mean, 17 — operator, 16 Waiting time, 37 Wilcoxon tests, 49, 50 Yates’s correction, 14 SU eam SON ec CIRC SS OST CM URE LOE ey YT ROSS ORR UCL eS OR Cee TUM DUCT R UMUC Cree Seem OTe TnL MTT (COLIC the chance to see just what is behind the various theorems and formulae; many teachers PTE RR CRU AUE RUE eee MR Um acc R TT PU Rae eee SONIC U RT Cale if this book attempts to provide comprehensibly, at the expense of full generality or rigour, | CRO Me Ue SLUT 0 | | e Published by: SEU ESL) 259 London Road ecg nora) Ucar Cou) CEU Psa RR} Email: sales@m-a.org.uk AVERT ALES ©2010 The Mathematical Association

You might also like