6 views

Uploaded by AzharulHaqueAl-Amin

Data Analysis Basics

save

You are on page 1of 11

Types of Variables

Types of variables indicate which estimates you can calculate

and which statistical tests you should use

Continuous variables:

Always numeric

Generally calculate measures such as the mean, median

and standard deviation

Categorical variables:

Information that can be sorted into categories

Field investigation often interested in dichotomous or

binary !"level# categorical variables

Cannot calculate mean or median but can calculate risk

Measures of Association

$trength of the association between two variables, such as an

e%posure and a disease

Two measure of association used most often are the relative ris&,

or ris& ratio ''#, and the odds ratio ('#

The decision to calculate an '' or an (' depends on the study

design

Interpretation of '' and (':

'' or (' ) *: e%posure has no association with disease

'' or (' + *: e%posure may be positively associated with

disease

'' or (' , *: e%posure may be negatively associated with

disease

Risk Ratio or Odds Ratio?

'is& ratio

-sed when comparing outcomes of those who were

e%posed to something to those who were not e%posed

Calculated in cohort studies

Cannot be calculated in case"control studies because the

entire population at ris& is not included in the study

(dds ratio

-sed in case"control studies

(dds of e%posure among cases divided by odds of

e%posure among controls

.rovides a rough estimate of the ris& ratio

Commonly used with dichotomous variables to compare groups

of people

Table puts one dichotomous variable across the rows and another

dichotomous variable along the columns

-seful in determining the association between a dichotomous

e%posure and a dichotomous outcome

Calculatin an Odds Ratio

$ample !%! table for /epatitis A at 'estaurant A

(utcome

0%posure

/epatitis A 1o /epatitis A Total

Ate salsa !*2 34 !56

7id not eat

salsa

!* 24 *85

Total !69 *68 659

Table displays data from a case control study conducted in

.ennsylvania in !886 !#

Can calculate the odds ratio:

:(' ) ad ) !*2#24# ) *9;5

bc 34#!*#

Con!dence "nter#als

.oint estimate a calculated estimate li&e ris& or odds# or

measure of association ris& ratio or odds ratio#

The con!dence inter#al $C"% of a point estimate describes the

precision of the estimate

The CI represents a range of values on either side of the estimate

The narrower the CI, the more precise the point estimate 6#

Con!dence "nter#als & 'xa(ple

0%ample<large bag of 488 red, green and blue marbles:

=ou want to &now the percentage of green marbles but don>t

want to count every marble

$ha&e up the bag and select 48 marbles to give an estimate of

the percentage of green marbles

$ample of 48 marbles:

*4 green marbles, *8 red marbles, !4 blue marbles

?ased on sample we conclude 68@ *4 out of 48# marbles are

green

68@ ) point estimate

/ow conAdent are we in this estimateB

Actual percentage of green marbles could be higher or lower, ie;

sample of 48 may not reCect distribution in entire bag of marbles

Can calculate a conAdence interval to determine the degree of

uncertainty

Calculatin Con!dence "nter#als

/ow do you calculate a conAdence intervalB

Can do so by hand or use a statistical program

0pi Info, $A$, $TATA, $.$$ and 0pisheet are common statistical

programs

7efault is usually 94@ conAdence interval but this can be

adDusted to 98@, 99@ or any other level

Con!dence "nter#als

Eost commonly used conAdence interval is the 94@ interval

94@ CI indicates that our estimated range has a 94@ chance of

containing the true population value

Assume that the 94@ CI for our bag of marbles e%ample is *F"

36@

Ge estimated that 68@ of the marbles are green:

CI tells us that the true percentage of green marbles is most

li&ely between *F and 36@

There is a 4@ chance that this range *F"36@# does not contain

the true percentage of green marbles

Con!dence "nter#als

If we want less chance of error we could calculate a 99@

conAdence interval

A 99@ CI will have only a *@ chance of error but will have a

wider range

99@ CI for green marbles is *6"3F@

If a higher chance of error is acceptable we could calculate a

98@ conAdence interval

98@ CI for green marbles is *9"3*@

Hery narrow conAdence intervals indicate a very precise estimate

Can get a more precise estimate by ta&ing a larger sample

*88 marble sample with 68 green marbles

.oint estimate stays the same 68@#

94@ conAdence interval is !*"69@ rather than *F"36@ for

original sample of 48 marbles#

!88 marble sample with 58 green marbles

.oint estimate is 68@

94@ conAdence interval is !3"65@

CI becomes narrower as the sample siIe increases

?ac& to e%ample of /epatitis A in a .ennsylvania restaurant:

(dds ratio ) *9;5

94@ conAdence interval of **;8"63;9 94@ chance that the range

**;8"63;9 contained the true ('#

Jower bound of CI in this e%ample is **;8 e;g;, +*#

(dds ratio of * means there is no diKerence between the two

groups, (' + * indicates a greater ris& among the e%posed

Conclusion: people who ate salsa were truly more li&ely to

become ill than those who did not eat salsa

Con!dence "nter#als

Eust include CIs with your point estimates to give a sense of the

precision of your estimates

0%amples:

(utbrea& of gastrointestinal illness at ! primary schools in Italy

3#

Children who ate cornLtuna salad had 5;*9 times the ris& of

becoming ill as children who did not eat salad

94@ conAdence interval: 3;2* F;92

.ertussis outbrea& in (regon 4#

Case"patients had 5;3 times the odds of living with a 5"*8 year"

old child than controls

94@ conAdence interval: *;2 !6;3

Conclusion: true association between e%posure and disease in

both e%amples

Analysis of Cateorical Data

Eeasure of association ris& ratio or odds ratio#

ConAdence interval

C)i&s*uare test

A formal statistical test to determine whether results are

statistically signiAcant

C)i&+*uare +tatistics

A common analysis is whether 7isease M occurs as much among

people in Group A as it does among people in Group ?

.eople are often sorted into groups based on their e%posure to

some disease ris& factor

Ge then perform a test of the association between e%posure and

disease in the two groups

C)i&+*uare Test: 'xa(ple

/ypothetical outbrea& of Salmonella on a cruise ship

'etrospective cohort study conducted

All 688 people on cruise ship interviewed, 58 had symptoms

consistent with Salmonella

Nuestionnaires indicate many of the case"patients ate tomatoes

from the salad bar

Table; Cohort study: 0%posure to tomatoes and Salmonella infection

$almonellaB

Total

=es 1o

Tomatoes 3* 29 *68

1o Tomatoes *9 *4* *F8

Total 58 !38 688

To see if there is a statistical diKerence in the amount of illness

between those who ate tomatoes 3*L*68# and those who did not

*9L*F8# we could conduct a chi"sOuare test

To conduct a chi"sOuare the following conditions must be met:

There must be at least a total of 68 observations people# in the

table

0ach cell must contain a count of 4 or more

To conduct a chi"sOuare test we compare the observed data

from study results# with the data we would e%pect to see

Table; 'ow and column totals for tomatoes and Salmonella infection

$almonellaB

Total

=es 1o

Tomatoes *68

1o Tomatoes *F8

Total 58 !38 688

Gives an overall distribution of people who ate tomatoes and

became sic&

?ased on these distributions we can All in the empty cells with

the expected #alues

C)i&+*uare Test: 'xa(ple $cont,%

0%pected Halue ) 'ow Total % Column Total

Grand Total

For the Arst cell, people who ate tomatoes and became ill:

0%pected value ) *68 % 58 ) !5

688

$ame formula can be used to calculate the e%pected values for

each of the cells

Chi"$Ouare Test: 0%ample cont;#

Table; 0%pected values for e%posure to tomatoes

$almonellaB

Total

=es 1o

Tomatoes

*68 % 58 ) !5

688

*68 % !38 )

*83

688

*68

1o Tomatoes

*F8 % 58 ) 63

688

*F8 % !38 )

*65

688

*F8

Total 58 !38 688

To calculate the chi"sOuare statistic you use the observed values

from Table !a and the e%pected values from Table !c

Formula is P(bserved 0%pected#

!

L0%pectedQ for each cell of the

table

Table; 0%pected values for e%posure to tomatoes

$almonellaB

Total

=es 1o

Tomatoes

3*"!5#! ) 2;F

!5

29"*83#! ) !;!

*83

*68

1o Tomatoes

*9"63#! ) 5;5

63

*4*"*65#! )

*;F

*65

*F8

Total 58 !38 688

The chi"sOuare R!# for this e%ample is *9;!

2;F S !;! S 5;5 S *;F ) *9;!

C)i&+*uare Test

Ghat does the chi"sOuare tell youB

In general, the higher the chi"sOuare value, the greater the

li&elihood there is a statistically signiAcant diKerence between the two

groups you are comparing

To &now for sure, you need to loo& up the p&#alue in a chi"sOuare

table

Ge will discuss p"values after a discussion of diKerent types of

chi"sOuare tests

Types of C)i&+*uare Tests

Eany computer programs give diKerent types of chi"sOuare tests

0ach test is best suited to certain situations

Eost commonly calculated chi"sOuare test is .earson>s chi"sOuare

-se .earson>s chi"sOuare for a fairly large sample +*88#

Types of +tatistical Tests

-arade of

+tatistics .uys

The right test... To use when.

.earson chi"sOuare

uncorrected#

$ample siIe +*88

0%pected cell counts + *8

=ates chi"sOuare corrected# $ample siIe +68

0%pected cell counts T 4

Eantel"/aensIel chi"sOuare $ample siIe + 68

Hariables are ordinal

Fisher>s e%act test $ample siIe , 68 andLor

0%pected cell counts , 4

/sin +tatistical Tests: 'xa(ples fro( Actual +tudies

In each study, investigators chose the type of test that best

applied to the situation 1ote: while the chi"sOuare value is used to

determine the corresponding p"value, often only the p"value is

reported;#

.earson -ncorrected# Chi"$Ouare : A 1orth Carolina study

investigated 944 individuals because they were identiAed as partners

of someone who tested positive for /IH; The study found that the

proportion of partners who got tested for /IH diKered signiAcantly by

raceLethnicity p"value ,8;88*#; The study also found that /IH"positive

rates did not diKer by raceLethnicity among the 5*8 who were tested

p ) 8;3#; 5#

Additional exa(ples:

=ates Corrected# Chi"$Ouare: In an outbrea& of Salmonella

gastroenteritis associated with eating at a restaurant, *3 of *4 ill

patrons studied had eaten the Caesar salad, while 8 of ** well patrons

had eaten the salad p"value ,8;8*#; The dressing on the salad was

made from raw eggs that were probably contaminated with

Salmonella; F#

Fisher>s 0%act Test: A study of Group A $treptococcus GA$#

among children attending daycare found that F of ** children who

spent 68 or more hours per wee& in daycare had laboratory"conArmed

GA$, while 8 of 3 children spending less than 68 hours per wee& in

daycare had GA$ p"value ,8;8*#; 2#

-&Values

-sing our hypothetical cruise ship Salmonella outbrea&:

6!@ of people who ate tomatoes got Salmonella as compared

with **@ of people who did not eat tomatoes

/ow do we &now whether the diKerence between 6!@ and **@

is a UrealV diKerenceB

In other words, how do we &now that our chi"sOuare value

calculated as *9;!# indicates a statistically signiAcant diKerenceB

The p"value is our indicator

-&Values

Eany statistical tests give both a numeric result e;g; a chi"

sOuare value# and a p"value

The p"value ranges between 8 and *

Ghat does the p"value tell youB

The p"value is the probability of getting the result you got,

assuming that the two groups you are comparing are actually the

same

$tart by assuming there is no diKerence in outcomes between

the groups

Joo& at the test statistic and p"value to see if they indicate

otherwise

A low p"value means that assuming the groups are the same#

the probability of observing these results by chance is very small

7iKerence between the two groups is statistically signiAcant

A high p"value means that the two groups were not that diKerent

A p"value of * means that there was no diKerence between the

two groups

Generally, if the p"value is less than 8;84, the diKerence

observed is considered statistically signiAcant, ie; the diKerence did

not happen by chance

=ou may use a number of statistical tests to obtain the p"value

Test used depends on type of data you have

C)i&+*uares and -&Values

If the chi"sOuare statistic is small, the observed and e%pected

data were not very diKerent and the p"value will be large

If the chi"sOuare statistic is large, this generally means the p"

value is small, and the diKerence could be statistically signiAcant

0%ample: (utbrea& of E. coli (*4F:/F associated with swimming

in a la&e *#

Case"patients much more li&ely than controls to have ta&en la&e

water in their mouth p"value )8;88!# and swallowed la&e water p"

value )8;88!#

?ecause p"values were each less than 8;84, both e%posures were

considered statistically signiAcant ris& factors

0ote: Assu(ptions

$tatistical tests such as the chi"sOuare assume that the

observations are independent

Independence: value of one observation does not inCuence value

of another

If this assumption is not true, you may not use the chi"sOuare

test

7o not use chi"sOuare tests with:

'epeat observations of the same group of people e;g; pre" and

post"tests#

Eatched pair designs in which cases and controls are matched

on variables such as se% and age

Analysis of Continuous Data

7ata do not always At into discrete categories

Continuous numeric data may be of interest in a Aeld

investigation such as:

Clinical symptoms between groups of patients

Average age of patients compared to average age of non"

patients

'espiratory rate of those e%posed to a chemical vs; respiratory

rate of those who were not e%posed

A0OVA

Eay compare continuous data through the Analysis (f Hariance

A1(HA# test

Eost statistical software programs will calculate A1(HA

(utput varies slightly in diKerent programs

For e%ample, using 0pi Info software:

Generates 6 pieces of information: A1(HA results, ?artlett>s test

and Wrus&al"Gallis test

Ghen comparing continuous variables between groups of study

subDects:

-se a t&test for comparing ! groups

-se an f&test for comparing 6 or more groups

?oth tests result in a p"value

A1(HA uses either the t"test or the f"test

0%ample: testing age diKerences between ! groups

If groups have similar average ages and a similar distribution of

age values, t"statistic will be small and the p"value will not be

signiAcant

If average ages of ! groups are diKerent, t"statistic will be larger

and p"value will be smaller p"value ,8;84 indicates two groups have

signiAcantly diKerent ages#

A0OVA and Bartlett1s Test

Critical assumption with t"tests and f"tests: groups have similar

variances e;g;, UspreadV of age values#

As part of the A1(HA analysis, software conducts a separate test

to compare variances: ?artlett>s test for eOuality of variance

?artlett>s test:

.roduces a p"value

If ?artlett>s p"value +8;84, not signiAcant# (W to use A1(HA

results

?artlett>s p"value ,8;84, variances in the groups are 1(T the

same and you cannot use the A1(HA results

2ruskal&3allis Test

Wrus&al"Gallis test: generated by 0pi Info software

-sed only if ?artlett>s test reveals variances dissimilar enough so

that you can>t use A1(HA

7oes not ma&e assumptions about variance, e%amines the

distribution of values within each group

Generates a p"value

If p"value +8;84 there is not a signiAcant diKerence between

groups

If p"value , 8;84 there is a signiAcant diKerence between groups

Analysis of Continuous Data

Decision tree for analysis of continuous data,

?artlett>s test for eOuality of

variance

p"value +8;84B

=0$ 1(

-se

A1(HA

test

-se

Wrus&al"

Gallis test

p,8;84 p+8;84 p,8;84 p+8;84

7iKerence

between groups

is statistically

signiAcant

7iKerence

between groups

is statistically

signiAcant

7iKerence

between groups is

1(T statistically

signiAcant

7iKerence

between groups is

1(T statistically

signiAcant

Conclusion

In Aeld epidemiology a few calculations and tests ma&e up the

core of analytic methods

Jearning these methods will provide a good set of Aeld

epidemiology s&ills;

ConAdence intervals, p"values, chi"sOuare tests, A1(HA and their

interpretations

Further data analysis may reOuire methods to control for

confounding including matching and logistic regression

- Estimation TheoryUploaded byLiezel Dizon
- Explanation of Statistical MethodsUploaded byRoshini Kr- Dubey
- cb workUploaded byAkansha Goyal
- Lectures on Biostatistics-ocr4Uploaded byKostas Gemenis
- Application of Statistical Concepts in Determining Weight Variations in SamplesUploaded bymihau11235813
- Gendered Time Allocation of Indigenous Peoples in the Ecuadorian AmazonUploaded byPedro Portella Macedo
- Practice Exam 3Uploaded bySergio
- ASTM E2862-12 Standard Practice for Probability of Detection Analysis for Hit-Miss Data.pdfUploaded byDiego Egoávil Méndez
- Confidence Intervals and Hypothesis TestingUploaded bymoonerman100
- client research report- zen motorsUploaded byapi-415899715
- wjec_s2_confidence_intervals_eng+cym (1)Uploaded byBalkis
- MED808 Course InformationUploaded byJaine Tan
- UJI T DINAUploaded bynurwahyuti
- Llsk_confidence Interval PrintoutUploaded byThomson Loong
- Significance Testing UsingState Health CompareUploaded bySHADAC
- Meta-Analysis Preferred MethodUploaded byHasan Tayyar BEŞİK
- ASSIGMent statistikUploaded byAzrina Ryn Adnan
- Confidence Interval EstimationUploaded byVikrant Lad
- StatisticUploaded bySam Ai Sia
- c440 GuideUploaded byaskarah
- Ch. 10 ReviewUploaded bysavannah
- A Predictive Location-Aware Algorithm for Dementia CareUploaded byanjugadu
- CPI 2009 Methodology Long EnUploaded bywinniepooh310
- what is in the exam 6900 latest.docxUploaded byumerbutt
- TrendsUploaded byj10izz
- Work-Family Conflict, Psychological Distress, And Sleep Deficiency Among Patient Care WorkersUploaded byPrih Antono
- UT Dallas Syllabus for econ4355.501.09f taught by Kurt Beron (kberon)Uploaded byUT Dallas Provost's Technology Group
- Chapter 9 Testbank (1)Uploaded byvx8550_373384312
- Effectiveness of Muscle Energy Technique on Hamstring Extensibility in Healthy, Asymptomatic Adults With Hamstring Tightness_PangUploaded byReinhold Muñoz
- 9709_m17_qp_72.pdfUploaded bySonia Mascarenhas

- Guía 2 Repaso QUploaded byprofe_cristian
- Devoir méca flu.pdfUploaded byCelina Lima
- Nayo Practiquita MixUploaded byLucho Castelo
- Escuela Superior Politecnica de ChimborazoUploaded byDenys Pilamunga
- BSc (Hons) Web TechnologiesUploaded byRoubin Dhallapah
- Escalameinto de SismosUploaded byXavy Yacchirema
- UGprospectus2011Uploaded byKhawar Nawaz
- Questões e GabaritosUploaded byAna Luísa Duboc
- Propiedades de HomotesiaUploaded byMardyori Piñas
- Interpolação e ExercíciosUploaded bySarahOliveiraBarbosa
- autocad_2008-2009Uploaded byRajendra Chavan
- valoracion-empresasUploaded byI.M
- 50g Hyperbolic FunctionsUploaded byMarco Antonio Moncerrate
- Números Primos 4 PitagorasUploaded byvictor
- 126924333-Relatorio-Troca-de-Calor-em-uma-serpentina.pdfUploaded byArthur Robles
- 9702_s16_ms_32.pdfUploaded byDaanial Khan
- Ondas Fisica IIUploaded byVanesa Mora
- LDO 2014_2015 Capitulo I - Regimes Juridicos_parte7_RevPreços.pdfUploaded byJosé Lapa
- Project Report HMTUploaded bySanu
- Image Processing PPTUploaded byOmKumarSahoo
- Jonei Cerqueira Barbosa 3Uploaded bygiselefernanda_1
- PROCED8IMIENTOUploaded byOswaldo Omar Lamas Rios
- Semejanza y Congruencia OmmUploaded byAntonio De Jesús López Alarcón
- Modulo Simulacion UnadUploaded byOzzy Manson
- Mathematics Advantages and Disadvantages of Applaying Estimation by : MohdUploaded bylabulin
- 198459346Uploaded bymavericksailor
- Cinética Metalúrgica.pdfUploaded byfabiocodeso
- Modelos de regresiónUploaded byAmirmkt
- Multivariate Bayesian Statistics Models for Source Separation and Signal UnmixingUploaded byConsuelo Nava
- Matematica Actuarial Vida8Uploaded byYuzaira Delgado