You are on page 1of 9

ASSOCIATIONS IN CATEGORICAL DATA (Statistics Course Notes by Alan Pickering) Con entions !

se" in #y Statistics $an"outs Text which appears in unshaded boxes can be treated as an aside. It may define a concept being used or provide a mathematical justification for something. Some of these relate to very important statistical ideas with which you SHOU ! be familiar. The names of S"SS menus will be written in shadow bold font #e.g.$ Analy%e%. The options on the menu will be called procedures and the name will be written in bold small capitals font #e.g.$ DESCRIPTI&E STATISTICS%. "rocedures sometimes offer a family of related options$ the name of the one selected will appear in italic small capitals font #e.g.$ CROSSTABS%. The resulting window will contain boxes to be filled or chec&ed$ and will offer buttons to access subwindows. Subwindow names will appear in italic font #e.g.$ Statistics%. Subwindows also have boxes to be filled or chec&ed. Thus the full path to the CROSSTABS procedure will be written' Analy%e ' DESCRIPTI&E STATISTICS '' CROSSTABS (uestions to be completed will appear in shaded boxes at various points. PART I ( )ASICS AND )AC*GRO!ND +,at Are Categorical DataCategorical or nominal data are those in which the classes within each variable do not have any meaningful numerical value. )ommon examples are' gender* presence or absence #of some sign$ symptom$ disease$ or behaviour%* ethnic group etc. It is sometimes useful to recode a numerical variable into a small number of categories$ and sometimes this classification will retain ordinal information #e.g.$ carrying out a median+split on a personality scale to create high or low trait subgroups* or grouping subjects into broad age bands%. The data can then be analysed using the techni,ues described below$ but one has to be aware that these analyses usually have considerably reduced power$ relative to using the full range of values on the variable. Contingency Tables One can carry out analyses on a single categorical variable to chec& whether the fre,uencies occurring for each level of that category are as predicted #chec& out S"SS "rocedure' Analy%e ' NON.PARA#ETRIC TESTS '' CHI-SQUARE%. However$ much more commonly$ the basic data structure for categorical variables is the contingency #or classification% table$ or crosstabulation$ formed from the variables concerned. These are described as n+way contingency tables$ where n refers to the number of variables involved. The contingency table documents the frequencies #i.e.$ counts% of data for each combination of the variables concerned. Hence contingency tables are also referred to as frequency tables. The small+scale example below is a -+way table formed from the variables !status #i.e.$ "ar&inson.s !isease status' does the participant have "ar&inson.s !isease/% and Smo"e#is #i.e.$ smo&ing history' has the participant smo&ed regularly at some point during their life/%. 0ach variable in the table has two levels #yes vs. no%.

!status yes no #12% #1-% Smo"e#is yes #12% no #1-% )olumn Totals 4 #5.64% 9 #4.-6% : 22 #7.-6% #8.64% 24

3ow Totals 28 7 ;rand Total1 --

Table $. Observed fre,uency counts of current "ar&inson.s disease status by smo&ing history. 0xpected fre,uencies under an independence model are in parentheses.

In S"SS$ some of the analysis procedures reviewed below re,uire that the categorical variables have numerical values$ so it is recommended to always code categorical variables in this way. 3emember that it is com%letely arbitrary how these values are assigned to the categories2. To aid memory$ and to produce clearer analysis printouts$ one should always use the variable labelling options for such variables and include verbal &alue labels for each numerically+coded categorical variable. T,e Na/es o0 t,e Tec,ni1ues Introductory stats courses familiarise students with a specific techni,ue for analysing -+way contingency tables' earson's - test. <ore general methods are re,uired for analysing higher+order contingency tables #involving 4 or more variables%$ and some of these analytic methods are therefore described #e.g.$ in Tabachni& and =idell% as multi(ay frequency table analyses #or <=>%. However$ there are a large number of alternative names for a family of closely+related procedures. Here are just some of the other names which one may come across #e.g.$ in the procedure names available in S"SS%' multinomial #or binary% logistic regression* logit regression* logistic analysis* analysis of multinomial &ariance* and ##ierarc#ical% loglinear mo)elling. >s these names imply the techni,ues are often comparable to the procedures used for handling numerical data$ in particular multiple linear regression and analysis of variance #>?O@>%. > reasonable simplification is that the categorical data analyses below can be thought of as extending the "earson - test to multiway tables. The techni,ues also share$ with the - test$ the fact that the calculated statistics are tested for statistical significance against the - distribution #just as multiple linear regression and >?O@> involving testing their statistics against the = distribution%. Association in Contingency Tables There are many techni,ues in statistics for detecting an association between variables. > significant association simply means that the values of one variable vary systematically #i.e.$ at a level greater than chance% with values of the other variable.
2

>s we shall see later$ changing the way the numerical values for categories are assigned can sometimes alter the value of a particular statistic #e.g. from a positive number to a negative number%$ but this does not alter the statistical significance of the statistic. It is also sometimes helpful #for interpreting analyses% to choose one particular assignment of numbers rather than others.

The most well+&nown measures of association are probably the #various types of% correlation coefficient between two variables. )orrelation coefficients can reveal the extent to which the score #or ran&% of one variable is linearly related to the score #or ran&% of another. In contingency tables$ the values within each category have no intrinsic numerical value$ but associations can still be detected. >n association means that the distribution of fre,uencies across the levels of one category differs depending upon the particular level of another category. Ahen there is no association between variables$ they are described as being in)e%en)ent. Thus$ independence in a -+way table means that there is no association between the row and column variables. It is a much+noted statistical fact that finding significant associations between variables$ in itself$ tells you nothing about the causal relationships at wor&. In association analyses one may therefore have no logical reason to treat variables as either )e%en)ent or in)e%en)ent. Sometimes research is entirely exploratory and$ when significant associations are found$ the search for a causal connection between the variables begins. In such research with categorical variables one might typically ta&e a single sample of subjects and record the values on the variables of interest. =or example$ one might explore whether the political orientation reported by a subject #left* centre* right% was associated with the newspaper he or she reads. ?either variable here is obviously the dependent variable #!@%$ as the causal relationship could go in either direction. @ery often$ however$ one has a causal model in mind. =or example$ one might be interested in whether a subject.s gender is associated with political orientation. Here$ political orientation is the !@ and a subject.s gender is the independent variable #I@%$ as it not possible for political orientation to affect gender. The research here will usually adopt a different sampling scheme$ by controlling the sampling of the subjects in terms of the I@s. In this example$ two samples of subjects #males and females% would be tested$ recording the !@ #political orientation% in each sample. In would be typical to arrange for e,ual+siBed samples of males and females in such research. In the example data of Table 2$ we are interested in finding variables that predict whether a subject will develop "ar&inson.s !isease #"!%. Thus "! status is the !@ and the I@ #or %re)ictor% is smo&ing history. =or contingency table data$ the distinction between #exploratory% analyses$ where all variables have a similar status$ and analyses involving both I@s and !@s is important. Ae will see below that it affects the name of the analysis and the statistical procedure that one uses. In this handout$ where the categorical analyses have both I@s and !@s$ we will adopt the convention that the !@ will be shown as the column variable. It follows from the above that a pair of alternative hypotheses #H2 1 independence* H- 1 association% may be applied to a contingency table. In order to decide between these hypotheses$ one can calculate a statistic that reflects the discrepancy between the actual fre,uencies obtained and the fre,uencies that would be expected under the in)e%en)ence model described above. If the discrepancies are within the limits of chance #i.e.$ the statistic is nonsignificant%$ then one cannot reject the hypothesis of independence. If the discrepancies are not within chance limits #i.e.$ the statistic is significant% then one can safely reject the independence hypothesis$ which implies association between the variables in the table.

Estimating E*%ecte) +requencies Un)er t#e In)e%en)ence ,o)el Ae will use the data from Table 2 as an example. If the variables !status and Smo"e#is are independent then the proportion of CSmo"e#is1yesD subjects with "! should be e,ual to the proportion of CSmo"e#is1noDsubjects with "!$ and both should be e,ual to the proportion of subjects who have "! overall. The overall proportion with "! is e,ual to E.82 #i.e.$ :F--%. Therefore$ the expected fre,uency with "! in the Smo"e#is1yes group should be E.82 times the total number of subjects in the Smo"e#is1yes group* i.e.$ E.82G28 #15.64%. This gives the expected fre,uency without "! in the Smo"e#is1yes group by subtraction #128+5.6417.-6%. Similar calculations give the expected fre,uencies for "! #E.82G714.-6% and no "! #17+4.-618.64% in the Smo"e#is1no group. >nother way to get the expected fre,uency for a cell in row R and column C is to multiply the row total for row R by the column total for column C and divide the result by the grand total for the whole table #e.g.$ H:G28IF--15.-6 for row 2 and column 2%. This approach is easy to use with tables that have more than - variables #where the rows represent one variable$ columns another$ and separate subtables are used for other variables%.

Testing Associations in 2.3ay Contingency Tables There are several statistics that one can compute to test for association vs. independence in a -+way contingency table #such statistics are thus sometimes referred to as an Cindices of associationD%. Three such statistics #described below% are worthy of attention$ and two #-.* OR% are of particular significance for logistic regression analyses. In this section we shall consider a -+way table with R rows and C columns* this is therefore referred to as an RxC table. There are m cells in the table where m 1 R G C. The actual fre,uency in cell number i of the table is denoted by the symbol fi and the expected fre,uency #under the independence model% is denoted by ei. #i% "earson.s - statistic.

Ho( to Com%ute using S SS' Select the following procedure' Analy%e ' DESCRIPTI&E STATISTICS '' CROSSTABS )lic& on the Statistics button to access the Statistics subwindow and then chec& the )hi+s,uare box. #The Cells subwindow is also useful as it lets you display things other than just the actual fre,uencies in the contingency table.% /ey S SS Out%ut0 )onducting a "earson - analysis on the data in Table 2 #which is available on the J drive as the dataset small %ar"s )ata% produces the following S"SS output'

Is/Was subject a smoker? * Has got Parkinson's disease? Crosstabulation Count Has got ark!nson"s #!sease? yes no 3 11 6 2 9 13

Is/Was subject a smoker? Total

yes no

Total 14 8 22

Chi-Square ests 1sym5$ *!g$ 62)s!#e#7 $%14 $%4& $%13 $%16 /0act *!g$ 62)s!#e#7 /0act *!g$ 61)s!#e#7

earson C(!)*+uare Cont!nu!ty Correct!ona ,!kel!(oo# -at!o .!s(er"s /0act Test ,!near)by),!near 1ssoc!at!on 2 o3 4al!# Cases

4alue 6$%44b 4$%31 6$222 &$'69 22

#3 1 1 1 1

$%26

$%22

a$ Com5ute# only 3or a 202 table b$ 2 cells 6&%$%87 (a9e e05ecte# count less t(an &$ T(e m!n!mum e05ecte# count !s 3$2'$

1#at !o T#ese Results ,ean2' The results of this analysis show that the - test statistic was significantly greater than the tabulated value #which is approximately the expected value for the statistic$ assuming independence between the variables%. Therefore$ the independence model can be rejected and one concludes that variables of !status and smo"e#is are associated. If the p+value associated with the - test statistic was nonsignificant$ then one would not be able to reject the hypothesis that !status and smo"e#is are independent. +ormula.' - 1 i12 to m #Hfi K eiI-Fei% !egrees of +ree)om 3)f4' )f1#R+2%G#C+2% Testing Significance' Under the independence model$ the - test statistic has a distribution which follows the - distribution with the degrees of freedom as given above. #"earson could have been more helpful and given his statistic a name that differed from that of the distribution against which it is tested.% It has been shown
-

The expression on the next line is used in many statistical formulae' i12 to m #*i%

It means calculate the sum of a set of numbers *2$ *-$ ... up to *m.

that the - distribution can be used to test the "earson - statistic as long as none of the expected fre,uencies is lower than 4. #ii% i&elihood ratio statistic #usually abbreviated -. or 5.%.

C)omputing using S"SSD$ CLey S"SS OutputD$ CAhat !o These 3esults <ean/D$ )f$ and CTesting SignificanceD$ are the same as for "earson -. Mecause of the mathematical relationship between the two formulae$ the values under many circumstances are approximately e,ual. +ormula6' -. 1 - G i12 to m #fi GlogHfiFeiI% #iii% Odds ratio #OR%.

#?ote' this applies only to a -x- table$ or to -x- comparison within a larger table.% Ho( to Com%ute Using S SS' This is also available via S"SS CROSSTABS. =ollow the procedure for - and -. but now chec& the C)ochran.s and <antel+HaensBel StatisticsD box in the Statistics subwindow. /ey S SS Out%ut' )onducting an O3 analysis for the data in Table 2 using S"SS CROSSTABS gives the following additional output'
!antel-Haens"el Common #dds $atio %stimate /st!mate ln6/st!mate7 *t#$ /rror o3 ln6/st!mate7 1sym5$ *!g$ 62)s!#e#7 1sym5$ 9&8 Con3!#ence Inter9al $%91 )2$398 1$%44 $%22 $%12 $'%4 )4$44& )$3&1

Common =##s -at!o ln6Common =##s -at!o7

,o:er ;oun# <55er ;oun# ,o:er ;oun# <55er ;oun#

T(e >antel)Haens?el common o##s rat!o est!mate !s asym5tot!cally normally #!str!bute# un#er t(e common o##s rat!o o3 1$%%% assum5t!on$ *o !s t(e natural log o3 t(e est!mate$

1#at !o T#ese Results ,ean' The OR$ as for - and -.$ tests whether the independence hypothesis can be rejected. The significance test above reveals a %+value of E.E--$ and so the row and column variables #smo&ing history and "ar&inson.s !isease status% are not independent. >n OR value of 2 corresponds to perfect independence$ and can range from E to plus infinity. The value calculated for the Table 2 data lies well below one #1E.E:2* the %+value shows this to be significantly different from 2%. This means that the odds of having "ar&inson.s !isease #"!% if you are a smo&er #row 2% are E.E:2 times the odds of having "! if you are a nonsmo&er. Smo&ing in these data is significantly protective with respect to "!. If one had predicted the direction of this relationship #based on the existing literature demonstrating a "!+protective effect for smo&ing%$ then one
4

The log in the formula refers to natural logarithms$ also written as loge or ln.

could justifiably ma&e a one+tailed test' the %+value for the OR of E.E:2 in this case would be E.E--F- #1E.E22%. In the S"SS output$ one sees that the natural logarithm of the odds ratio is also reported. =or various reasons$ this statistic is more useful within contingency tables than the raw OR. This use of log+transformed statistics explains why contingency table analyses are often described as logistic analyses$ logistic regressions or loglinear modelling. The output also reports confi)ence inter&als for the OR and log#OR% statistics. The next - boxes give a few basic facts about logarithms and confidence intervals. +ormula' >ssuming the cells of a -x- table contain the fre,uencies a7 b7 c7 ) as follows'+ a c the formula is'+ OR 1 #a 8 )% F #c 8 b% If any of the fre,uencies #a to )% are Bero then one usually replaces the Bero with E.5. Testing Significance' >s the calculated ln#O3% is normally distributed under the independence hypothesis$ the estimate can be converted to a 9+value and tested against the value in a normal probability table. 9#logHORI%1log#OR%FS0#logHORI%. =or the data in table 2$ the value is +-.4E. In a standard normal #i.e.$ 9% distribution$ the two+tailed probability of finding a value that is as far #or further% above or below E as K-.4$ is E.E--. b )

/ey +acts About :ogarit#ms Ta&ing the logarithm of a number is the inverse mathematical operation to raising a number to a power #or exponentiating%. If we write the following e,uation' *y 1 ; then we can define the Clog to the base x of BD thus' log*#;% 1 y Try it with *12E* y1- and ;12EE. The log #to base 2E% of 2EE is -* i.e. the log of a number is the power to which you have to raise the base to obtain the original number. ?atural logarithms employ the number e as their base #e is approximately -.627%. In statistical theory$ natural logs are always used. The convenient thing about logs is that they turn multiplicative #or reciprocal% relationships into additive #or subtractive% ones. Mecause ea G eb 1 e#a N b% and ea F eb 1 e#a + b% this means the following are true #and indeed it is true for logs to any base%' loge#aGb% 1 loge#a% N loge#b% loge#aFb% 1 loge#a% + loge#b% In order for *y 1 E to be generally true y must be minus infinity #+O%. Thus$ loge#E% could be +O$ but convention dictates that loge#E% is undefined >ny number when raised to the power Bero is 2 and so loge#2%1E. Thus we have the following ranges' loge#*% PE loge#*%1E loge#*% QE if 2Q * QNO if *<2 if EQ * Q2

"robabilities #or li&elihoods% have values between E and 2. >s the statistics in this handout involve ta&ing the logs of probabilities$ it follows from the above that the resulting values are negative.

1#at Are Confi)ence Inter&als2 Ahen we calculate the value of a statistic for a sample of data$ we are attempting to measure the true value of that statistic for the populations under study. 0ven if we have removed #most of% the sources of systematic bias from our experiment$ the sample used to calculate the statistic will be subject to several sources of random error or noise. Therefore$ although the value we calculate for the statistic is our best estimate of its true value in the population$ we might prefer to give a range of values within which we feel the true value is li&ely to fall. > confidence interval #)I% for the calculated value of a statistic is just such a range #error bars on graphs serve a similar purpose%. If we estimate a statistic to have a value of 2E with a :5R confidence interval of plus or minus 9$ then we are saying that we are :5R certain that the true value of the statistic lies somewhere between 8 and 29 #H2E K 9I to H2E N 9I%. If the statistic would be expected to have a value of E according to some hypothesis$ then the value and associated )I in the above example imply that such a hypothesis should be rejected. Journals increasingly demand that )Is are calculated. In the S"SS output for the OR analysis of the data in Table 2$ the log#OR% had a value of K-.4:7$ with a standard error #S0% for the log#OR% of 2.E88. The S0 is given by the s,uare root of the sum of the reciprocals of the fre,uencies used in the OR calculation #i.e. S#2F4 N 2F22 N 2F9 N 2F-%12.E88%. Aith sufficient subjects in the whole table$ the distribution of possible values for log#OR% is distributed normally around the value actually obtained. :5R of the distribution will lie within 2.:9GS0 either side of the estimated value. Thus$ the :5R )I for the calculated value of ln#OR% lies between #+-.4:7 K 2.:9G2.E88% and #K-.4:7 N 2.:9G2.E88%* i.e. between K8.88 and KE.45. To calculate the corresponding )I for the OR$ we just simply compute e+8.88 #1E.E2% and e+E.45#1E.6E% =or the data in table 2$ our best guess of the true OR is E.E:$ and are :5R certain that the true value lies between E.E2 and E.6E. >s these :5R )Is do not include 2.E #the value expected if there were independence between !status and Smo"e#is%$ then the independence hypothesis can be rejected. 4uestion 5 Ahat happens if you calculate -$ -.$ and OR statistics for the contingency table formed from the variables smo"e#is and !statr in the small %ar"s dataset/ # !statr is the same variable as !status$ only it has been recoded so that 21no and -1yes.%

You might also like