You are on page 1of 16
18 FIGURE 18.1 Midway throuah wting the second edition of this book, things had gone ate stange Categorical data 18.1. What will this chapter tell me? © ‘We discovered in the previous chapter that I wrote 2 book. This book, There are a lor of good things abour writing books. The main benefit is chat Your parents are impressed. Wall, they're not that impressed acwally because they think that a good book sells as ‘many copies as Harry Potter and that people should queue outside bookshops for the lat- est enthralling instalment of Discovering Statistics... . My parents are, consequently, quite baffled about how this book is seen as successful, yet I don't get invited to dinnec by the Queen. Nevertheless, given that my family doz’ really understand what I do, books are tangible proof that { do something. The size of this book and the fact it has equations in it is an added bonus because it makes me look cleverer than | actually am. However, there is a price to pay, which is immeasurable mental anguish. In England we don’t talk about our emotions, because we fear that if they get out into the open, ci CHAPTER 18 CATEGORICAL DATA collapse, so J definitely will nor mention that the writing process for the second edition was so stressful that T eamne within one of Fuzzy’s whiskers of a toxal meltdown. It took me swo years to récover, just in time to start thinking about this third cdition. Sell, it was worth it because the feedback suggests that some people found the book vaguely useful. OF course, the publishers don’t care about helping people. they care only abour raking in as much cash as possible fo feed their cocaine habits and champagne addictions. Therefore, they are obsessed with sales figures and comparisons with other books. They have databases thet have sales figures of this book and its competitors in different ‘markers’ (you are not 4 person, you are a ‘consumer" and you don’t live in a country, you live in a ‘marker) and they gibber and twitch at their consoles creating frequency distributions (with 3-D effects) of these values. The data they ger are frequency data (the number of books sold in a certain timeframe), Therefore, if they wanted to compare sales of this book to its competitors, in different countries, they would need to read this chapter because i's all about analysing data, for which we know only the frequency with which evenrs occuz. Of course, they ‘won't read this chapter, bur they should ... 18.2. Analysing categorical data © Sometimes, we are intecested not in test scores, or continuous measures, but in categori- cal variables, These are not variables involving cats (although the examples in this chapter might convince you otherwise), but are what we have mainly used as grouping variables. They are variables thar describe categories of entities (see section 1.5.1.2). We've come across these types of variables in virtually every chapter of this book. There are different types of categorical variable (cee section 6.5.5), but ia theory a person, or ease, should fall into only one category. Good examples of categorical variables are gender (with few excep- tions people can be only biologically male or biologically female),* pregnancy (a woman can be only pregnant or not pregnant) and voting in an election (as a general rule you are allowed to vote for only one candidate). In all cases (except logistic regression) so far, we've used sitch categorical variables to predict some kind of continuous outcome, but there are times when we want to lao: 2t relationships between lots of categorical variables. This chapter looks at two sechniques for doing this. We begin with the simple case of wo categorical variables and discover the chi-square statistic (which we're not really discover~ ing because we've unwittingly come across it countless times before). We then extend this model to look at relationships between several categotical variables. 18.3. Theory of anal sing categorical data © ‘We will begin by looking at the simplest situation that you could encounter; that is, analys- ing two categorical variables. If we want to look at the relationship between two caregori- cal variables then we can’t use the mean or any similar statistic because we don’t have any vasiables chat have been measured continuously. Teying to calculate the mean of a eategori- cal variable is complerely meaningless because the numeric values you attach to different categories are arbitrary, and the mean of those numeric values will depend on how many ‘members each category has. Therefore, when we've measured only categorical variables, we analyse frequencies. That is, we analyse the number of things that fall into each combination "Before anyone tips my arms from their sockets and heats me around the head with them, Lam awate thar mumer- ‘ous chromasomal and hormonal conditions exist that complicate the matter. Also, people can have a different gender identcy to cher biological gender DISCOVERING STATISTICS USING sp: ‘TABLE 18.1 Contingency lable showing how many ca‘s will ine dance aller being trained with 4 aiferent rewards ' berry) Food as Reward Affection as Reward Totar (Gould They Dance? of categories, If we take an example, a rescarcher was interested in whether animals could be crained to line dance. He took 200 eats and tried to train them to line dance by giving ‘them either food or affection a: a reward for dance-like behaviour. At the end of the week they counted how many animals could line dance and how many could not. There are ewo categorical variables here: training (che animal was trained using cither food or affection, not both) and dance (the animal either learnt to line dance or it did nov). By combining categories, we end up with four different categories. All we then need to do is to count how many cats fall into each category. We can tabulate these frequencies as in Table 18.1 (which shows the data for this example) and this is known as a contingency table. Pearson’s chi-square test © If we want to see whether there's a relationship berween two categorical variables (Le. docs the amount of cats that line dance relate to the type of training used?) we can use the Pearson's chi-square test (Fisher, 1922; Pearson, 1900). This is an extremely elegant statis- ‘tic based on the simple idea of comparing the frequencies you observe in certain categories +o the frequencies you might expect to get in thase categories by chance. All the way back in Chapters 2, 7 and 10 we saw that if we fit x model to any set of data we can evaluate ‘thet model using a very simple equation (or some variant of it): deviation = (observed - model)? ‘This equation was the basis of our sums of squares in regression and ANOVA. Now, when ‘we have categorical data we can use the same equation. There is a slight variation ia that ‘we divide by the model scores as well, which is actually much the same process as divi the sum of squares by the degrees of freedom in ANOVA. So, basicaly, what we're doing is standardizing the deviation for each observation. If we add all of these standardized devia- tions together the resulting statistic is Pearson's chi-square (2) given by: 2 x peel 8.) in which / represents the rows in the contingency cable and represents the colurans. The observed data are, obviously, the frequencies in Table 18.1, but we necd to work out what the model is. In ANOVA the model we use is group means, but as I've mentioned we can’t ‘work with means when we have only categorical variables so we work with frequencies CHAPTER 18 CATEGORICAL DATA instead. Therefore, we use ‘expected frequencies’. One way to estimate the expected fre quencies would be ro say ‘well, we've got 200 cats in rotal, and four categories, so the expected value is simply 200/4 = S0°. Thie would be fine if, for example, we had the same umber of cats that had affection as a reward and food as a reward; however, we didn’: 38 got food and 162 got affection as a reward. Likewise there are not equal numbers that could and couldn't dance. fo take account of this, we calculace expected frequencies for cach of the cells in the table (in this case there are four cells) and we use the column and row torals for a particular cell ro calculate the expected value: Lg, _ 10¥ total; x column totaly ‘nis simply the total number of observations (in this case 200). We can calculate these expected frequencies for the four cells within our table (row rotal and column total are abbreviated to RT and CT respectively): RT ve X CThoot _ 76% 38 modeleie yu = = fo = TE a 14 ee RT Wo X CTiged _ 124 x38 modelyood,o = — Soo = 23-56 RT yee x CTtieaiom _ 76% 162 model reaioa Yes= < Sm a8 56 RT No * CT aieeign _ 124 x 162 modelagenionNe = = Sap = 100.44 Given that we now have these model values, all we aced to do is take each value in each cell of our data table, subrract from it the corresponding model value, square the result, and then divide by the corresponding model value. Once we've done this for each cell in the table, we just add them up! 26-1444)? | (10-23.56)2 Tag * 2356 6136 00.44 _ (03.56)? | (13.56)? | (-13.561 13.56)" = "yaar * 9356 * ese * To0.4t 2.73 + 7.804299 + 1.83 =25.35 4 GB=OSOP | (114— 100.44)? This statistic can then be checked against a distribution with known properties, All we need to know is the degrees of freedom and these are calculated as (r= 1)(6~ 1) in which ris the number of rows and ¢ is the number of columns. Another way to think of itis the number of levels of each variable minus one multiplied. In this case we get df = (2 - 1) - 1)=1. If you were doing the test by hand, you would find a critical value for the chi-square distribu tion with df= 1 and if the observed value was bigger than this critical value you would say that there was a significant relationship between the two variables. These critical values are produced in Appendix A.4, and for df= 1 the critical values are 3.84 (p= .05) and 6.63 (p= 01) and so because the observed chi-square is bigger than these values itis significant at p < °01. However, if you use SPSS, it will simply produce an estimate of the precise probability of obtaining a chi-square statistic atleast as big as (in this case) 25.35 if there were no association in the population between the variables. DISCOVERING STATISTICS USING SPSS Fisher’s exact test © There is one problem with the chi-square rest, which is that the sampling distribution of the test statistic has an approximate chi-square distribution. The larger the sample is, the better this approximation becomes and in large samples the approximation is good enough to not worry about the fact that it is an approximation. However, in small samples, che approxi- mation is not good enough, making significance tesct of the chi-square distribution inac- curate. This is why you often read thar to use the chi-square test the expected frequencies in cach cell must be greater than 5 (sec section 18.4). When the expected frequencies are greater than 5, the sampling distribution is probably close enough to a perfect chi-square discribution for us not to worry. However, when the expected frequencies are too low, it probably means that the sample size is too small and thar the sampling distzibution of the test statistic is t00 deviane from a chi-square distribution to be of any use. Fishor came up with a method for computing the exact probability of the chi-square sta- tistic that is accurate when sample sizes arc small, This method is called Fisner’s exact test (Fisher, 1922) even though it’s not so much of a test as a way of computing the exact prob- ability of the chi-square statistic. This procedure is normally used on 2 x 2 contingency tables (Le. two Variables cach with two options) and with small samples. However, it can be used on larger contingency tables and with large samples, but on larger contingency tables i becomes computationally intensive and you might find SPSS taking a long time to give ‘you an answer. In large samples there is really no poine because it was designed to over- come the problem of small samples, so you don’t need to use it when samples are large. SEEM The likelihood ratio © ‘An alternative to Pearson's chi-square is the likelihood ratio statistic, which is based on maximunvlikelihood theory. The general idea bchind.this theory is that you collect some data and create a model for which the probability of obtaining the observed sct of data is maximized, thea you compare this model co the probability of obtaining those dara under ‘the null hypothesis. The resulting statist is, therefore, based on comparing observed frc- quencies with those predicted by the model in which i and j are the rows and coturnns of the contingency table and In is the natural Jogatithin (this is the standard mathematical function that we came across in Chapter 8 and you can find it on your calculazot usually labelled as In or log). Using the same model and ‘observed valucs as in the previous section, this would give us: safc (28, ) 30 in 2,) ot) sae to EY] =2/(28 x 0.662) + (10 x 0.857) + (48 x —0.249) + (114 x 0.0.127]] = 218.54 — 8.57 — 11.94 + 14.44] 34 ‘As with Pearson's chi-square, this statistic has a chi-square distribution with the same degrees of freedom (in this case 1). As such, it is tested in the same way: we could look CHAPTER 18 CATEGORICAL DATA ap the critical value of chi-squate for she number of degrees of freedom that we have. AS aoaaee pe valve we have here wil be signficane because its bigges chan the critical values od (pe.03) and 6.63 (p-=.01). For large samples tis seatistc wil be roughly the same or pears’ chi-squfr, but is prefersed when samples arc small PERM Yates’s correction @ “When you have a 2.x 2. contingency table (Le, cwo categorical variables ‘exch with two cat gories) then Pearson's chisquace tends to produce significance values that are too small (Ga other words, ic tends to make a Type T error). Therefore, Yates suggested a correction oe oanvtarcs frtaule (osualy referced co as Ytes's cotiuty correction). The basic idea is so an pen you calglat the devicion fom she maodel (che observed, - Model, equation {(8.1)) you subtract 0.5 from the absolute value of chi deviasion ‘before you square it. In (lin English his means you calculate the devia, ignore whether is positive or negative, ee trac 0.5 from dhe value and then square if, Pearson's equation then becomes: 7 (Jobservedi)— models] 0. sy ed madely For the data in our example this just translates into: (1356-05)? (13-56-05) i3.s6—0.5) _ (13. eo 4a Tse * (6136 I1B147 2442.77 41.70 =23.52 “The key thing to note is that it lowers the value of the chi-square staete and, therefore, rari lew signifeant. Although this seems like a nice solution 10 the robles, there is caaaeh ef evidence that this overcorzests and produces chi-square values thar are "00 small! Bee 2006) provides an excelient discussion of the problem with Yates's comero00 for sro Ay iPyonte inerested all | will say s that alshough i's worth knowing about, i's probably best ignored! 18. 4 Assumptions of the chi-square test © Teshould be obvious shat the chiesquare test docs not cely on assumptions such as havin cannes normally diseibuted data ike most ofthe othes tests inthis Book (categorical sarctavanot be normally distibuted because they aren't continuous). Howeves, the chi- square rest stil has wo important assumptions: 4 Pretty much all of the tescs we have encountered in this book ave made an assumption Peery pendence of data and the chi-square testis n0 exseprion. For the

You might also like