00 upvotes22 downvotes

514 views11 pagesWhereas inferential testing of interval data is widely employed, data on other scales of measurement occasionally appear to fall by the wayside. Even though classical significance testing is hampered by its own limitations, p-value generating methods are nevertheless also available to be used on categorical/nominal or ordinal data. Especially the question whether empirically found univariate categorical frequency observations conform to or deviate, with statistical significance, from the expected can be suitably solved with Pearson's chi-squared goodness-of-fit test. Due to the inherent simplicity of its algorithm, neither outstanding statistical knowledge nor dedicated statistical software is needed to subject data to this kind of statistical inference.

Jan 06, 2015

© © All Rights Reserved

PDF, TXT or read online from Scribd

Whereas inferential testing of interval data is widely employed, data on other scales of measurement occasionally appear to fall by the wayside. Even though classical significance testing is hampered by its own limitations, p-value generating methods are nevertheless also available to be used on categorical/nominal or ordinal data. Especially the question whether empirically found univariate categorical frequency observations conform to or deviate, with statistical significance, from the expected can be suitably solved with Pearson's chi-squared goodness-of-fit test. Due to the inherent simplicity of its algorithm, neither outstanding statistical knowledge nor dedicated statistical software is needed to subject data to this kind of statistical inference.

© All Rights Reserved

514 views

00 upvotes22 downvotes

Whereas inferential testing of interval data is widely employed, data on other scales of measurement occasionally appear to fall by the wayside. Even though classical significance testing is hampered by its own limitations, p-value generating methods are nevertheless also available to be used on categorical/nominal or ordinal data. Especially the question whether empirically found univariate categorical frequency observations conform to or deviate, with statistical significance, from the expected can be suitably solved with Pearson's chi-squared goodness-of-fit test. Due to the inherent simplicity of its algorithm, neither outstanding statistical knowledge nor dedicated statistical software is needed to subject data to this kind of statistical inference.

© All Rights Reserved

You are on page 1of 11

Thomas Gamsjger

Meitner Monographs

Pearson's Test

The Nonmathematician Series

Thomas Gamsjger

Abstract

Whereas inferential testing of interval data is widely employed, data on other scales of measurement

occasionally appear to fall by the wayside. Even though classical significance testing is hampered by its

own limitations, p-value generating methods are nevertheless also available to be used on

categorical/nominal or ordinal data. Especially the question whether empirically found univariate

categorical frequency observations conform to or deviate, with statistical significance, from the

expected can be suitably solved with Pearson's chi-squared goodness-of-fit test. Due to the inherent

simplicity of its algorithm, neither outstanding statistical knowledge nor dedicated statistical software is

needed to subject data to this kind of statistical inference.

Introduction

The hypothesis that many researchers are not overly confident when it comes to dragging their hard

arrived by data through the forbidding machinery of statistical software might not appear extremely far

fetched. The evidence has been critically appraised already 20 years ago.1 Still, most branches of

science are just unthinkable without the proper use of statistical tests to decide whether the results

confirm the initially stated hypothesis or not. Especially the widely applied testing for significance by

comparing the mean values of different groups along with their corresponding distributions requires

interval data.6, 17 But this type of data, which renders itself suitable for common algebraic operations like

the calculation of the mean itself (e.g. blood pressure), is not always available. The other types include,

in particular, nominal and ordinal scales for which the appropriate statistical tests appear considerably

less well known. Even though the qualitative character of nominal data can by definition carry only a

limited amount of inherent information, certain parameters are most properly expressed in just such

terms. In medicine, a common example is the distinction 'dead' vs. 'alive', or the condition of a patient

upon discharge may be described as 'improved/unchanged/worse'. After the categorisation of individual

cases into groups using a nominal parameter, the number of cases in each group can be counted and

displayed in a histogram. This kind of incidence data can then be subjected to rigorous analytical

procedures just as its brethren from the interval camp.

The hallmark of statistical analysis appears to be still the testing for significance using a cut-off value of

typically p < 0.05, the core of which has been repeatedly criticised, even more than 60 years ago.8 But

until the final verdict is available, is it possible to test nominal incidence data for classical statistical

significance? The answer is unequivocally yes: Enter the chi-squared goodness-of-fit test, or for short,

Pearson's test. Though the concept could appear genuinely complicated, it might be surprising that no

fancy statistical software package is need to calculate it.

Pearson, who?

Karl Pearson (1857-1936) was an eminent English mathematician. After studying mathematics in

Cambridge, he pursued his wide ranging interests including, among others, Darwinism, German

literature and even Roman law, which induced him to travel widely and stay extendedly especially in

Berlin and Heidelberg.9, 11 He has been aptly described as a 'thoroughly restless intellectual'.15

From 1881 onwards, Pearson renewed his focus again on his primary subject with a very strong

biometric inclination and took succeeding positions at King's College London and University College

London. Together with Walter Frank Raphael Weldon, he founded the still appearing journal

Biometrika. During his scientific career, Pearson made major foundational contributions to the field of

mathematical statistics and is even regarded as the founder of the discipline in its modern form.11, 14

Pearson's paper on the chi-squared test appeared in 1900.13

Chi-squared, what?

'Chi' (pronounced /'kai/) stands for the Greek letter , which was chosen to lend its name to a

distinctive kind of distribution. It was first described in 1875-76 by Friedrich Robert Helmert as the

distribution of the sum of squares of k independent standard normal random variables representing a

special case of the gamma distribution. Its single parameter k specifies the degrees of freedom and

determines the shape of the distribution. If k is increased to infinity, the chi-squared distribution very

much approaches a normal distribution.3, 14 A single such sum is calculated using the following formula:

= !! + !! + + !!

!

!!

=

!!!

On repeatedly and numerously calculating this sum Q, the tabulated frequencies of them give the chisquared distribution. But the good thing is that these mildly bewildering mathematical aspects serve as a

background only and are fortunately not of immediate relevance in using this approach for inferential

testing.

Goodness-of-fit-test

How, then, is the test applied in practice? First, let us state the aim again: We want to determine

whether the values of nominal incidence data differ significantly from what is expected. This is best

illustrated with an example.

In the course of one month a general ward in a hospital admitted 300 patients, of which 162 were

female and 138 male. As the expected values would be 150 patients in each group, do these two

empirically determined values differ significantly from the expected ones using a p-value of 0.05? Or in

more scientific parlance: The null hypothesis H0 assumes no significant difference whereas the

alternative hypothesis H1 does just the opposite.

Before solving this example, it is time to define the conditions of the test:16, 19

- A single categorical or nominal variable. (The case of a single variable is also characterised by the

term 'univariate'.) Within this single variable two or more groups can be tested, which are

represented by the number of cases in each group. In the example above the categorical/nominal

variable is 'gender' with the two groups 'female' and 'male'.

- Mutual exclusion of the observations. One observation can only be found in one group and not in

another.

- Independence of observations.

- Use of actual numbers, not of percentages.

- Total probability = 1.

- No group has a number of expected cases less than 1 and no more than 20 % of the groups have

expected numbers of cases less than 5. (The use of Yates's correction for continuity in such

circumstances remains controversial.5)

How the redoubtable Karl Pearson came about his solution can be happily relegated to the true

statisticians. What we need now is much more a convenient way to calculate the results. A statistical

software package certainly does the trick in ideal manner. Amazingly, such a device is not of ultimate

necessity. A very conventional spreadsheet or even a pocket calculator (or nowadays a cell phone, for

that matter) is all that is needed. We just have to calculate the chi-squared test statistic 2:2, 4, 18

! =

( )!

E Number of expected cases

In (other) words: We have to take the squared difference between the number of observed and expected

cases in each group and divide it by the number of expected cases. And this we repeat for each group

and take the sum of the respective results. To do just that a table containing our data comes in handy:

Table 1. Tabulation of observed and expected cases

Gender

Number of

observed

cases O

Expected

cases (%)

Number of

expected

cases E

Female

162

50

150

Male

138

50

150

Sum

300

100

300

! =

138 150

150

=

162 150

150

144 144

+

= 1.92

150 150

Now we have to determine the degrees of freedom k, which is even easier as this is the number of

groups minus 1.

= 1

=21=1

All that is left to do is looking up these two results 2 = 1.92 and k = 1 in a table of corresponding 2 and

p-values (readily available on the internet20). Here is a small section of such a table:

Degrees of

freedom k

2 value

1.07

1.64

2.71

3.84

6.64

2.41

3.22

4.60

5.99

9.21

3.66

4.64

6.25

7.82

11.64

p-value

0.30

0.20

0.10

0.05

0.01

As our result for k is 1, only the first row is relevant. And in this row our value of 2 of 1.92 lies between

1.64 and 2.71 with a corresponding p-value range between 0.20 and 0.10. Therefore, the level of

significance at p = 0.05 is not reached.

The final verdict in our example: Even though the numbers of observed cases in both gender groups

deviate considerable from the expectation of a 50:50 split, these empirical data do not reach statistical

significance (on the p = 0.05 level).

In statistics, more data points usually lead to more robust results and the level of statistical significance

is more easily reached. The chi-squared goodness-of-fit-test conforms to this rule just as well. To

illustrate this we expand our imaginary hospital ward example by increasing the number of

observations by a factor of 10.

Table 3. Tabulation of observed and expected cases

Gender

Number of

observed

cases O

Expected

cases (%)

Number of

expected

cases E

Female

1,620

50

1,500

Male

1,380

50

1,500

Sum

3,000

100

3,000

Degrees of freedom: k = 1

In the first row of the look-up table the value of 19.2 lies to the right of 6.64. Therefore, the statistical

significance level is even 'better' than p = 0.001.

Even though the two examples share identical ratios, statistical testing yields markedly different results.

The actual numbers are of paramount importance, which definitely precludes the use of percentages as

input in the calculation.

A univariate variable is not limited to having only two groups. Accordingly, the chi-squared goodnessof-fit-test can handle any number of them. A worked example highlights the case:

The following table contains (hypothetical) long-term data for the conditions of patients discharged

from a hospital ward:

Table 4. Example data

(expected)

Status

Improved

Percentage

91.5 %

Unchanged

5%

Deteriorated

2%

Dead

1.5 %

The observed data of the ward under investigation are the following:

(observed)

Status

Number of

cases

Improved

308

Unchanged

16

Deteriorated

10

Dead

11

Do these observed numbers conform to the long-term average or is there any significant deviation

(p=0.05)?

Table 6. Tabulation of observed and expected cases

Number of

observed

cases O

Expected

cases (%)

Number of

expected

cases E

Improved

308

91.5

315.675

Unchanged

16

17.25

Deteriorated

10

6.9

Dead

11

1.5

5.175

Sum

345

100

345

Status

! =

308 315.675

315.675

=

16 17.25

17.25

10 6.9

6.9

11 5.175

5.175

+

+

+

= 8.227

315.675 17.25 6.9

5.175

Now there are four groups to heed in the determination of the degrees of freedom:

=41=3

In the look-up table the test statistic of 8.227 can be found in row 3 between the p-value levels of 0.05

and 0.01. Therefore, the intended level of significance is reached, which leads to the conclusion that

the observed numbers of cases show indeed significant deviation from the long-term averages.

Conclusion

Scientific investigation is a demanding occupation. Toil (and, occasionally, tears) are necessary

prerequisites. Against this background, it is all the more an intriguing finding that all the laboriously

gathered data get boiled down to typically only one parameter, the p-value. The most narrow line,

usually, albeit arbitrarily, drawn at 0.05, decides whether the whole undertaking was worth the effort or

not, an ongoing and, unfortunately, undecided discussion.10, 12 The contentious issue of publication bias

only compounds this sometimes dire situation.7

These academic debates notwithstanding, the p-value can still defend its beleaguered position. Whereas

interval data optimally comply with the demands of classical p-value producing inferential testing, their

categorical kin appear to fall through the occasional crack in the common scientist's statistical toolbox.

But that need not be the case. In fact, Pearson's chi-squared goodness-of-fit test is a very good choice to

subject univariate categorical frequency data to statistical scrutiny. It compares the number of observed

cases with the number of expected cases, quantitatively weighing how 'good' the empirically found

data 'fit' a given reference. As an additional bonus, all this can be accomplished without the heavy

lifting usually associated with statistical software packages. And, most importantly, Pearson's test

provides us with the familiar and trusted (as we have seen, not always rightly so) p-value signifying the

cut-off between statistical relevance and scientific oblivion.

10

References

1) Altman DG. The scandal of poor medical research. BMJ 1994; 308: 283-284

2) Bithell JF. Statistical inference. In: Ahrens W, Pigeot I (eds.). Handbook of epidemiology. Springer

2014, p. 953

3) Boslaugh S. Statistics in a nutshell. O'Reilly Media 2012, p. 125

4) Ibid., p. 129

5) Ibid., p. 131

6) Carlberg C. Statistical analysis. Pearson Education 2011, p. 12

7) Dwan K, Gamble C, Williamson PR et al. Systematic review of the empirical evidence of study

publication bias and outcome reporting bias An updated review. PLoS ONE 2013; 8(7): e66844

8) Greenwood M. The statistician and medical research. BMJ 1948; 2: 467-468

9) Hardy A, Magnello ME. Statistical methods in epidemiology: Karl Pearson, Ronald Ross, Major

Greenwood and Austin Bradford Hill, 1900-1945. Soz Prventivmed 2002; 47: 80-89

10) Lew MJ. To P or not to P: on the evidential nature of P-values and their place in scientific inference.

arXiv:1311.0081v1

11) Norton BJ. Karl Pearson and statistics: The social origins of scientific innovation. Social Studies of

Science 1978; 8: 3-34

12) Nuzzo R. Statistical errors. Nature 2014; 506: 150-152

13) Pearson K. On the criterion that a given system of deviations from the probable in the case of a

correlated system of variables is such that it can be reasonably supposed to have arisen from

random sampling. Philosophical Magazine 1900; 50: 157-175

14) Plackett RL. Karl Pearson and the chi-squared test. Int Stat Rev 1983; 51: 59-72

15) Porter TM. Karl Pearson: The scientific life in a statistical age. Princeton University Press 2004, p. 1

16) Sheskin DJ. Handbook of parametric and nonparametric statistical procedures. CRC Press 2003, p.

219

17) Stevens SS. On the theory of scales of measurement. Science 1946; 103: 677-680

18) Van den Broeck J, Brestoff JR. Epidemiology: Principals and practical guidelines. Springer 2013, p.

449

19) Verma JP. Data analysis in management with SPSS software. Springer 2013, p. 73

20) A good table can be found at: http://www.medcalc.org/manual/chi-square-table.php

Figures

Figure 1. Public domain image.

Author

Dr. Thomas Gamsjger, University Hospital St. Plten-Lilienfeld, Propst-Fhrer-Strae 4, 3100 St. Plten, Austria

Date of publication

1 January 2015

Citation

Gamsjger T. Pearson's test. Meitner Monographs 2015

11

## Much more than documents.

Discover everything Scribd has to offer, including books and audiobooks from major publishers.

Cancel anytime.