You are on page 1of 11

Data Analysis Basics

Types of Variables
Types of variables indicate which estimates you can calculate
and which statistical tests you should use
Continuous variables:
Always numeric
Generally calculate measures such as the mean, median
and standard deviation
Categorical variables:
Information that can be sorted into categories
Field investigation often interested in dichotomous or
binary !"level# categorical variables
Cannot calculate mean or median but can calculate risk
Measures of Association
$trength of the association between two variables, such as an
e%posure and a disease
Two measure of association used most often are the relative ris&,
or ris& ratio ''#, and the odds ratio ('#
The decision to calculate an '' or an (' depends on the study
Interpretation of '' and (':
'' or (' ) *: e%posure has no association with disease
'' or (' + *: e%posure may be positively associated with
'' or (' , *: e%posure may be negatively associated with
Risk Ratio or Odds Ratio?
'is& ratio
-sed when comparing outcomes of those who were
e%posed to something to those who were not e%posed
Calculated in cohort studies
Cannot be calculated in case"control studies because the
entire population at ris& is not included in the study
(dds ratio
-sed in case"control studies
(dds of e%posure among cases divided by odds of
e%posure among controls
.rovides a rough estimate of the ris& ratio

Analysis Tool: 2x2 Table

Commonly used with dichotomous variables to compare groups
of people
Table puts one dichotomous variable across the rows and another
dichotomous variable along the columns
-seful in determining the association between a dichotomous
e%posure and a dichotomous outcome
Calculatin an Odds Ratio
$ample !%! table for /epatitis A at 'estaurant A
/epatitis A 1o /epatitis A Total
Ate salsa !*2 34 !56
7id not eat
!* 24 *85
Total !69 *68 659
Table displays data from a case control study conducted in
.ennsylvania in !886 !#
Can calculate the odds ratio:
:(' ) ad ) !*2#24# ) *9;5
bc 34#!*#
Con!dence "nter#als
.oint estimate a calculated estimate li&e ris& or odds# or
measure of association ris& ratio or odds ratio#
The con!dence inter#al $C"% of a point estimate describes the
precision of the estimate
The CI represents a range of values on either side of the estimate
The narrower the CI, the more precise the point estimate 6#
Con!dence "nter#als & 'xa(ple
0%ample<large bag of 488 red, green and blue marbles:
=ou want to &now the percentage of green marbles but don>t
want to count every marble
$ha&e up the bag and select 48 marbles to give an estimate of
the percentage of green marbles
$ample of 48 marbles:
*4 green marbles, *8 red marbles, !4 blue marbles
?ased on sample we conclude 68@ *4 out of 48# marbles are
68@ ) point estimate
/ow conAdent are we in this estimateB
Actual percentage of green marbles could be higher or lower, ie;
sample of 48 may not reCect distribution in entire bag of marbles
Can calculate a conAdence interval to determine the degree of
Calculatin Con!dence "nter#als
/ow do you calculate a conAdence intervalB
Can do so by hand or use a statistical program
0pi Info, $A$, $TATA, $.$$ and 0pisheet are common statistical
7efault is usually 94@ conAdence interval but this can be
adDusted to 98@, 99@ or any other level
Con!dence "nter#als
Eost commonly used conAdence interval is the 94@ interval
94@ CI indicates that our estimated range has a 94@ chance of
containing the true population value
Assume that the 94@ CI for our bag of marbles e%ample is *F"
Ge estimated that 68@ of the marbles are green:
CI tells us that the true percentage of green marbles is most
li&ely between *F and 36@
There is a 4@ chance that this range *F"36@# does not contain
the true percentage of green marbles
Con!dence "nter#als
If we want less chance of error we could calculate a 99@
conAdence interval
A 99@ CI will have only a *@ chance of error but will have a
wider range
99@ CI for green marbles is *6"3F@
If a higher chance of error is acceptable we could calculate a
98@ conAdence interval
98@ CI for green marbles is *9"3*@
Hery narrow conAdence intervals indicate a very precise estimate
Can get a more precise estimate by ta&ing a larger sample
*88 marble sample with 68 green marbles
.oint estimate stays the same 68@#
94@ conAdence interval is !*"69@ rather than *F"36@ for
original sample of 48 marbles#
!88 marble sample with 58 green marbles
.oint estimate is 68@
94@ conAdence interval is !3"65@
CI becomes narrower as the sample siIe increases

?ac& to e%ample of /epatitis A in a .ennsylvania restaurant:
(dds ratio ) *9;5
94@ conAdence interval of **;8"63;9 94@ chance that the range
**;8"63;9 contained the true ('#
Jower bound of CI in this e%ample is **;8 e;g;, +*#
(dds ratio of * means there is no diKerence between the two
groups, (' + * indicates a greater ris& among the e%posed
Conclusion: people who ate salsa were truly more li&ely to
become ill than those who did not eat salsa
Con!dence "nter#als
Eust include CIs with your point estimates to give a sense of the
precision of your estimates
(utbrea& of gastrointestinal illness at ! primary schools in Italy
Children who ate cornLtuna salad had 5;*9 times the ris& of
becoming ill as children who did not eat salad
94@ conAdence interval: 3;2* F;92
.ertussis outbrea& in (regon 4#
Case"patients had 5;3 times the odds of living with a 5"*8 year"
old child than controls
94@ conAdence interval: *;2 !6;3
Conclusion: true association between e%posure and disease in
both e%amples
Analysis of Cateorical Data
Eeasure of association ris& ratio or odds ratio#
ConAdence interval
C)i&s*uare test
A formal statistical test to determine whether results are
statistically signiAcant
C)i&+*uare +tatistics
A common analysis is whether 7isease M occurs as much among
people in Group A as it does among people in Group ?
.eople are often sorted into groups based on their e%posure to
some disease ris& factor
Ge then perform a test of the association between e%posure and
disease in the two groups
C)i&+*uare Test: 'xa(ple
/ypothetical outbrea& of Salmonella on a cruise ship
'etrospective cohort study conducted
All 688 people on cruise ship interviewed, 58 had symptoms
consistent with Salmonella
Nuestionnaires indicate many of the case"patients ate tomatoes
from the salad bar

Table; Cohort study: 0%posure to tomatoes and Salmonella infection
=es 1o
Tomatoes 3* 29 *68
1o Tomatoes *9 *4* *F8
Total 58 !38 688
To see if there is a statistical diKerence in the amount of illness
between those who ate tomatoes 3*L*68# and those who did not
*9L*F8# we could conduct a chi"sOuare test

To conduct a chi"sOuare the following conditions must be met:
There must be at least a total of 68 observations people# in the
0ach cell must contain a count of 4 or more
To conduct a chi"sOuare test we compare the observed data
from study results# with the data we would e%pect to see

Table; 'ow and column totals for tomatoes and Salmonella infection
=es 1o
Tomatoes *68
1o Tomatoes *F8
Total 58 !38 688
Gives an overall distribution of people who ate tomatoes and
became sic&
?ased on these distributions we can All in the empty cells with
the expected #alues
C)i&+*uare Test: 'xa(ple $cont,%
0%pected Halue ) 'ow Total % Column Total
Grand Total
For the Arst cell, people who ate tomatoes and became ill:
0%pected value ) *68 % 58 ) !5
$ame formula can be used to calculate the e%pected values for
each of the cells
Chi"$Ouare Test: 0%ample cont;#
Table; 0%pected values for e%posure to tomatoes
=es 1o
*68 % 58 ) !5
*68 % !38 )
1o Tomatoes
*F8 % 58 ) 63
*F8 % !38 )
Total 58 !38 688
To calculate the chi"sOuare statistic you use the observed values
from Table !a and the e%pected values from Table !c
Formula is P(bserved 0%pected#
L0%pectedQ for each cell of the
Table; 0%pected values for e%posure to tomatoes
=es 1o
3*"!5#! ) 2;F
29"*83#! ) !;!
1o Tomatoes
*9"63#! ) 5;5
*4*"*65#! )
Total 58 !38 688
The chi"sOuare R!# for this e%ample is *9;!
2;F S !;! S 5;5 S *;F ) *9;!
C)i&+*uare Test
Ghat does the chi"sOuare tell youB
In general, the higher the chi"sOuare value, the greater the
li&elihood there is a statistically signiAcant diKerence between the two
groups you are comparing
To &now for sure, you need to loo& up the p&#alue in a chi"sOuare
Ge will discuss p"values after a discussion of diKerent types of
chi"sOuare tests
Types of C)i&+*uare Tests
Eany computer programs give diKerent types of chi"sOuare tests
0ach test is best suited to certain situations
Eost commonly calculated chi"sOuare test is .earson>s chi"sOuare
-se .earson>s chi"sOuare for a fairly large sample +*88#
Types of +tatistical Tests
-arade of
+tatistics .uys

The right test... To use when.
.earson chi"sOuare
$ample siIe +*88
0%pected cell counts + *8
=ates chi"sOuare corrected# $ample siIe +68
0%pected cell counts T 4
Eantel"/aensIel chi"sOuare $ample siIe + 68
Hariables are ordinal
Fisher>s e%act test $ample siIe , 68 andLor
0%pected cell counts , 4
/sin +tatistical Tests: 'xa(ples fro( Actual +tudies
In each study, investigators chose the type of test that best
applied to the situation 1ote: while the chi"sOuare value is used to
determine the corresponding p"value, often only the p"value is
.earson -ncorrected# Chi"$Ouare : A 1orth Carolina study
investigated 944 individuals because they were identiAed as partners
of someone who tested positive for /IH; The study found that the
proportion of partners who got tested for /IH diKered signiAcantly by
raceLethnicity p"value ,8;88*#; The study also found that /IH"positive
rates did not diKer by raceLethnicity among the 5*8 who were tested
p ) 8;3#; 5#

Additional exa(ples:
=ates Corrected# Chi"$Ouare: In an outbrea& of Salmonella
gastroenteritis associated with eating at a restaurant, *3 of *4 ill
patrons studied had eaten the Caesar salad, while 8 of ** well patrons
had eaten the salad p"value ,8;8*#; The dressing on the salad was
made from raw eggs that were probably contaminated with
Salmonella; F#
Fisher>s 0%act Test: A study of Group A $treptococcus GA$#
among children attending daycare found that F of ** children who
spent 68 or more hours per wee& in daycare had laboratory"conArmed
GA$, while 8 of 3 children spending less than 68 hours per wee& in
daycare had GA$ p"value ,8;8*#; 2#
-sing our hypothetical cruise ship Salmonella outbrea&:
6!@ of people who ate tomatoes got Salmonella as compared
with **@ of people who did not eat tomatoes
/ow do we &now whether the diKerence between 6!@ and **@
is a UrealV diKerenceB
In other words, how do we &now that our chi"sOuare value
calculated as *9;!# indicates a statistically signiAcant diKerenceB
The p"value is our indicator
Eany statistical tests give both a numeric result e;g; a chi"
sOuare value# and a p"value
The p"value ranges between 8 and *
Ghat does the p"value tell youB
The p"value is the probability of getting the result you got,
assuming that the two groups you are comparing are actually the
$tart by assuming there is no diKerence in outcomes between
the groups
Joo& at the test statistic and p"value to see if they indicate
A low p"value means that assuming the groups are the same#
the probability of observing these results by chance is very small
7iKerence between the two groups is statistically signiAcant
A high p"value means that the two groups were not that diKerent
A p"value of * means that there was no diKerence between the
two groups
Generally, if the p"value is less than 8;84, the diKerence
observed is considered statistically signiAcant, ie; the diKerence did
not happen by chance
=ou may use a number of statistical tests to obtain the p"value
Test used depends on type of data you have
C)i&+*uares and -&Values
If the chi"sOuare statistic is small, the observed and e%pected
data were not very diKerent and the p"value will be large
If the chi"sOuare statistic is large, this generally means the p"
value is small, and the diKerence could be statistically signiAcant
0%ample: (utbrea& of E. coli (*4F:/F associated with swimming
in a la&e *#
Case"patients much more li&ely than controls to have ta&en la&e
water in their mouth p"value )8;88!# and swallowed la&e water p"
value )8;88!#
?ecause p"values were each less than 8;84, both e%posures were
considered statistically signiAcant ris& factors
0ote: Assu(ptions
$tatistical tests such as the chi"sOuare assume that the
observations are independent
Independence: value of one observation does not inCuence value
of another
If this assumption is not true, you may not use the chi"sOuare
7o not use chi"sOuare tests with:
'epeat observations of the same group of people e;g; pre" and
Eatched pair designs in which cases and controls are matched
on variables such as se% and age
Analysis of Continuous Data
7ata do not always At into discrete categories
Continuous numeric data may be of interest in a Aeld
investigation such as:
Clinical symptoms between groups of patients
Average age of patients compared to average age of non"
'espiratory rate of those e%posed to a chemical vs; respiratory
rate of those who were not e%posed
Eay compare continuous data through the Analysis (f Hariance
A1(HA# test
Eost statistical software programs will calculate A1(HA
(utput varies slightly in diKerent programs
For e%ample, using 0pi Info software:
Generates 6 pieces of information: A1(HA results, ?artlett>s test
and Wrus&al"Gallis test
Ghen comparing continuous variables between groups of study
-se a t&test for comparing ! groups
-se an f&test for comparing 6 or more groups
?oth tests result in a p"value
A1(HA uses either the t"test or the f"test
0%ample: testing age diKerences between ! groups
If groups have similar average ages and a similar distribution of
age values, t"statistic will be small and the p"value will not be
If average ages of ! groups are diKerent, t"statistic will be larger
and p"value will be smaller p"value ,8;84 indicates two groups have
signiAcantly diKerent ages#
A0OVA and Bartlett1s Test
Critical assumption with t"tests and f"tests: groups have similar
variances e;g;, UspreadV of age values#
As part of the A1(HA analysis, software conducts a separate test
to compare variances: ?artlett>s test for eOuality of variance
?artlett>s test:
.roduces a p"value
If ?artlett>s p"value +8;84, not signiAcant# (W to use A1(HA
?artlett>s p"value ,8;84, variances in the groups are 1(T the
same and you cannot use the A1(HA results
2ruskal&3allis Test
Wrus&al"Gallis test: generated by 0pi Info software
-sed only if ?artlett>s test reveals variances dissimilar enough so
that you can>t use A1(HA
7oes not ma&e assumptions about variance, e%amines the
distribution of values within each group
Generates a p"value
If p"value +8;84 there is not a signiAcant diKerence between
If p"value , 8;84 there is a signiAcant diKerence between groups
Analysis of Continuous Data
Decision tree for analysis of continuous data,
?artlett>s test for eOuality of
p"value +8;84B
=0$ 1(
Gallis test
p,8;84 p+8;84 p,8;84 p+8;84
between groups
is statistically
between groups
is statistically
between groups is
1(T statistically
between groups is
1(T statistically
In Aeld epidemiology a few calculations and tests ma&e up the
core of analytic methods
Jearning these methods will provide a good set of Aeld
epidemiology s&ills;
ConAdence intervals, p"values, chi"sOuare tests, A1(HA and their
Further data analysis may reOuire methods to control for
confounding including matching and logistic regression