Statistics Basis To Advance

Inferential Statistics
Dr Mahalakshmy T
Special thanks:
Dr Ajay Kumar, The Union
Dr. Palanivel C
What is intended?
• APPROACH to analysis
• When to use which statistic?

What is descriptive statistics?
What is inferential statistics?
Descriptive statistics
Description of the sample

Population
Research
question
Sample
4
Inferential statistics
Generalisation to the population
Conclusions based
on the sample
Population
Research
question
Sample
5
Examples of descriptive and
inferential statistics
TYPES OF VARIABLES
VARIABLES
QUALITATIVE QUANTITATIVE
(Categorical) (Numeric)
NOMINAL ORDINAL DISCRETE CONTINUOUS

How to summarize a quantitative
variable?
• Measures of central tendency!!
• Measures of dispersion
How to summarize a quantitative
variable?
• Mean (SD) or Median (IQR/Range)
• Mean difference
• Correlation coefficient
• Regression coefficient (beta)

How to summarize a qualitative/categorical
variable?
• Percentage /proportion
• Rate
• Ratio (odds ratio, relative risk, risk ratio)
• Risk difference (AR, AR%)

• Beta coefficient (Logistic regression)
Inferential statistics
– Estimation (Confidence Interval)
– Hypothesis testing (Statistical Tests of

Significance)
What is SD and SE?
Mean and Standard Deviation
68.3% of the values
95.5% of the values
- 3SD - 2SD - 1SD Mean + 1SD + 2SD + 3SD
95% of values are between the mean and ± 1.96 SD

Sampling
from a population of 4 – list all the possible

sample of 2
A B C D (Weight)
AB, AC, AD, BC, BD, CD
Standard Deviation Standard Error
Descriptive statistics Inferential statistics
Variability of the one Measures of sample

sample taken to sample variability
Measure of
uncertainity
What is 95% Confidence
Interval?
What is the prevalence of
anemia among doctors?
Generalisation to the population
Conclusions based
on the sample
Population
Research
question
60%
Sample
20
Statistics – Confidence interval
(95% CI)
▪ A Confidence Interval is a range of values within
which the “true” population parameter is believed to
be found with a given level of confidence.
▪ A 95% CI means: We can be 95% confident that this

interval/limits has the true population value
Confidence interval
= mean + Standard Error
Statistics – Confidence interval
Example
Study of HIV prevalence in TB patients at Site X
▪ Sample = 244 TB patients

▪ HIV prevalence = 70% (95% CI 65-75)
▪ Implies that we can be 95% confident that the

true HIV prevalence among the TB population
would lie anywhere between 65% and 75%.
The mean hemoglobin of the study
population is 12gm%,
SD = 1gm%,
95% CI = 11.5 – 12.5 gm%
Interpret
The immunization coverage of
Pondicherry is 70%,
95% CI = 60-80%
Interpret
The immunization coverage of
Pondicherry is 70%,
95% CI = 68-72%
What is the difference between this
and the earlier example?
A medical representative told the
doctor that the new drug has a cure
rate of 50%
95% CI = 5 – 95%
Hence the drug can be as effective as
achieving 95% cure rate
Comment
In a study to find the association
between smoking and CHD, the relative
risk (RR) was 2.1
95% CI being 0.5 to 6
Relative risk of 2.1 (95%CI: 0.5-6)
Interpret
What is alpha error and beta
error?
Errors in the study….
▪ Sampling and non sampling errors

▪ We are dealing always with samples and
generalizing to population, hence errors
can result
▪ There are two kinds of errors – alpha and

beta error.
TRUTH
Association No
association
Association
STUDY
No
association
TRUTH
Association No
association
Association
Power
(1-beta)
STUDY
No
association (1-alpha)
Confidence
TRUTH
Association No
association
Association Type I error
Power (alpha error)
(1-beta) FP error
STUDY ‘P’ value
No Type II error
association (beta error) (1-alpha)
FN error Confidence
TRUTH
Association No
association
STUDY Association (1-beta) Type I error
Power (alpha error)
FP error
‘P’ value
No Type II error (1-alpha)
association (beta error) Confidence
FN error
The Judges Dilemma
Not
Guilty Guilty
Type I error
Type II Error
TRUTH
Guilty Innocent
(associated
with crime)
JUDGE Guilty (1-beta) Type I error
(associated Power (alpha error)
with crime) FP error
Innocent Type II error (1-alpha)

(beta error) Confidence
FN error
Which is a more serious error?
▪ Conventionally
• Accepted alpha error is 0.05
• Accepted Beta error is 0.2
▪ Alpha error is “error of commission” – it
changes practice
▪ Beta error is “error of omission”
What is p value?
‘p’ value < 0.05 – what does it really
mean?
▪ “p” is the probability that the result is just by
chance
▪ It is nothing but alpha error
▪ That is… when we conclude that there is an
association but in reality there is no
association.
▪ Lower the p value more confident we are
that the result is true…
Statistics – P-value
▪ Statistical significance:
• P value < 0.05 - Significant
• P value < 0.01 - Very significant
• P value < 0.001 - Highly significant
▪ Express exact P-values rather than

“significant” or “Non-significant”.
P value vs confidence interval
Confidence interval provides more
information than p value
▪ RR 4.3 ( 95% CI: 4.1-5.4) and P value=0.001

▪ Magnitude of the effect (strength of association)
▪ Direction of the effect (RR > or < 1)
▪ Precision of the point estimate of the effect
(variability)
p value can not provide them !
43
Statistical significance vs
Clinical significance
Comparison of weight before
and after intervention
95%CI P value
Scenario (What was (What might
observed) be the truth)
Case 1 2 (1,3) <0.05
Case 2 30 (20,40) <0.05
Case 3 30 (2,58) <0.05
Comparison of weight before
95%CI P value
Scenario (What was (What might
observed) be the truth)
Case 1 2 (1,3) <0.05
Case 2 30 (20,40) <0.05
Case 3 30 (2,58) <0.05
Case 4 1 (-1,3) >0.05
Case 5 2 (-58,62) >0.05
Case 6 30 (-2,62) >0.05
Clinical Significance v/s Statistical Significance
▪ Statistical significance does not prove causation

▪ Big ‘n’ can sometimes produce a small P even
though magnitude of effect is not clinically
significant
▪ Conversely, small ‘n’ can produce a non-significant
effect even though difference is real.
▪ Evaluate in the context of study
▪ Clinical significance is more important than
statistical significance
The analysis found the
association is not statistically
significant (p<0.05)
How do it proceed further?
Study
Association Association not

positive found
What is the
What is the
power
“p” value?
of the study?
Example
Yoga
Vs
Standard treatment
Risk of DM in next 5 years

Example
▪ 5% in yoga group compared to 10% in the standard arm
▪ Alpha error of 5% (95% confidence level),

▪ power of 80%,
▪ expected risk/incidence of DM in the standard group as
10% and yoga group as 5% (50% reduction in risk),
▪ we need 466 individuals in each group for the study
Yoga
7%
Standard
10%
Example..contd
▪ The study got over; as decided I have
included 466 individuals in each group
▪ In analysis- P value is NOT SIGNIFICANT

(P=0.25)
▪ What does that mean? What are the

possibilities?
Example
▪ In standard group- 10% (10% assumed in
sample size calculation)
▪ In Yoga group-7% (5% assumed during

sample size calculation)
Power calculation
Power calculation
How much is required?
Summary
Specific questions addressed
▪ What are the types of variables?

▪ How to summarize a variable?
▪ What is the difference between standard
deviation and standard error?
▪ What do you understand by 95% Confidence
Interval?
▪ ‘p’ value < 0.05 – what does it really mean?
Specific questions addressed
▪ What do you understand by ‘power’ of the

study?
▪ Statistical versus clinical significance?
Thank You
Dr K C Premarajan
Professor of P&SM
JIPMER
 One of the most important problem in observational
epidemiological studies
 For example: In a study of coffee and the cancer of
pancreas, smoking can be a confounder if
 a) Smoking is a known risk factor for pancreatic cancer
 b)Smoking is associated with coffee drinking , but is not a
result of coffee drinking
A) CAUSAL B) DUE TO CONFOUNDING
COFFEE COFFEE
DRINKING DRINKING
OBSERVED ASSOCIATION
OBSERVED ASSOCIATION
SMOKING
PANCREATIC PANCREATIC
CANCER CANCER
 Function of the complex interrelationship
between exposure and disease
 Factor should differ between the study
groups
 Mixing of the effect of the exposure under
study on the disease with that of a third
factor.
 Factor must be associated with the exposure
 Be a risk factor for the disease, independent

of that exposure
 Associationneed not be causal but correlates

of another causal factor
 Sound knowledge of the disease under study
 Review of literature
 Identifythe potential confounders in the

design stage.
 Collect adequate data on such confounders

 Statisticalsignificance of association cannot
be the sole criteria
 Identify
the direction of effect of
confounding
 Positive Confounding
Over-estimate risk
Eg: Smoking in a study on coffee drinking and
MI
Negative Confounding
Underestimate risk/protection
Eg: Gender in a study on Physical Activity and
MI
Hypothetical example of confounding in an unmatched
case control study : Number of Exposed and nonexposed
cases and controls
EXPOSED CASES CONTROLS

YES 30 18
NO 70 82
TOTAL 100 100
ODDS RATIO = 30 × 82
70 × 18
= 1.95
A. Causal B. Due to Confounding
Exposure Exposure
Observed Association
Observed Association
Older Age
Disease Disease
Figure 15.2 Schematic representation of the issue of potential

confounding
Hypothetical Example of Confounding
in an Unmatched Case-Control Study:
Relationship of Exposure to Age
Not
Age(yr) Total Exposed Exposed % Exposed
<40 130 13 117 10
≥40 70 35 35 50
Hypothetical Example of Confounding
in an Unmatched Case-Control Study:
Distribution of Cases and Controls by
Age
Age(yr) Cases Controls
<40 50 80
≥40 50 20
Total 100 100
Hypothetical example of confounding in an unmatched
case control study : calculation of odds ratios after
stratifying by age
AGE EXPOSED CASES CONTROL ODDS RATIO

S
YES 5 8
< 40 NO 45 72 OR= 1
TOTAL 50 80
YES 25 10 OR= 1
≥ 40 NO 25 10
TOTAL 50 20
 At the design stage
 At the analysis stage
 Both
 Techniquesfor control of confounding is

based on the characteristics of a confounder.
 Randomisation (For intervention studies)
 Restriction
 Matching
 Stratification
 Multivariate techniques : Examine the potential

effect of one variable while simultaneously
controlling the effect of many other factors
90 85
80
AGE ADUESTED LUNG CANCER DEATH
71 72
70 65
PER 100,000 MAN-YEARS
60
50
RATES
NEVER SMOKED
40
REGFULARLY SMOKED
CIGARETTES
30
20 15
9
10 5
0
0
CITY OF CITY OF SUBURB OR RURAL
50000+ 10000-5000 TOWN
90 85
80 71 72
AGE ADUESTED LUNG CANCER DEATH
70 65
60
PER 100,000 MAN-YEARS
50
40
30 NEVER SMOKED
RATES
20 15
9 REGFULARLY SMOKED
10 5
0 CIGARETTES
0
 Confounding describe the reality of the
interrelationship between certain factors and
a certain outcome
 Characterize every situation in which
etiology is addressed because most causal
questions involve the relationship of multiple
exposures and multiple etiological factors
 In all analytical observational studies, bias
and confounding must always be considered
as an alternative explanation for study
findings.
 Number of methods are available for
controlling confounders
 No single method can be considered optimal
in every situations
 Combination of strategies is preferred
 Epidemiology- by Leon Gordis (6th Edition)
 Epidemiology in Medicine by Charles H

Hennekens
Confounding
Interaction
Web of Causation
What is interaction?
1+1=2
• No interaction
1+1=4 or 1+1=-2
• Interaction
Stratified analysis
Association equally strong

in all strata
Equal Not equal/variable
Interaction Interaction
not present is present
Definition by Mac Mahan
• When the incidence of disease in the presence of two or more risk

factors differs from the incidence rate expected to result from their
individual effects.
• The effect can be greater than we would expect. ( Positive

interaction-Synergism)
• Less than what we expect .(Negative interaction-Antagonism)
Incidence of lung cancer Leon Gordis
Smoking
- +
- 3 9
Urbanisation
+ 15 9+15-3=21
Additive model
Incidence of lung cancer
Smoking
- +
- 3 9
Urbanisation
+ 15 > 21
(interaction present)
synergism
Deaths from lung cancer (per 1,00,000)
Asbestos exposure
- +
- 11 58
Smoking
+ 123 602
Relative Risk of oral cancer
Smoking
- +
- 1 1.5
Alcohol
+ 1.2 5.7
Relative Risks of Liver Cancer for persons exposed
to Aflatoxin or Chronic Hepatitis B Infection
Aflatoxin Negative Aflatoxin positive

HBsAg Negative 1.0 3
HBsAg Positive 7 59
Summary
Look for incidence and relative risk in each strata
If the effect of both risk factors higher than that

contributed by each individually
More than that by additive model – interaction present

If an engineer is able to sort this out
Why not researchers?
Blog: https://significantlystatistical.wordpress.com/2014/12/12/confounders-mediators-moderators-and-covariates/
CHOOSING A
STATISTICAL
TESTS
Dr Mahalakshmy T
Special Thanks to Dr Subita

Descriptive Statistics
• Estimation (Confidence Interval)
• Hypothesis testing (Statistical Tests of Significance)

• Association between two variables
• Usually independent and dependent variable

List some exposure (independent)
and outcome (dependent) variables
Example:
1. Smoking vs cancer
2. Physical activity intervention & decrease in BMI

EXPOSURE VARIABLES
ARE MOSTLY
CATEGORICAL
CHOOSING STATISTICAL TESTS
1. Type of variables
• Continuous or categorical
2. No. of groups
• Two or more than 2 groups
3. Type of distribution
• Normal or non-normal
4. Type of data
• Paired or Unpaired design/ repeated
CHOICE OF STATISTICAL TESTS
2 groups 3 or more
Scale of dependent groups
variable
Interval or ratio/
continuous
(Parametric )
Ordinal scale
(Non parametric)
Nominal or Dichotomous
2 groups 3 or more
Scale of
groups
dependent COMPARE
variable Independe Repeated Independe
nt samples measures nt samples
Interval or Means
Independe Paired One way
ratio/
nt samples samples t ANOVA
continuous
t test test
(Parametric)
Ordinal scale Mean

(Non ranks Mann Wilcoxon Kruskal
parametric) Whitney test Wallis
test
Nominal or Counts
Dichotomous or %
Chi square McNemar Chi square
test test test
GEN EDUC OCCUP HTN SBP SBP DBP fam smo alco HEI WEI GHQ GHQ
idno AGE DER ATION ATION status 1 2 1 hist ker hol GHT GHT WC 1 2
housewif
1 34 1 12 e 0 124 120 86y n n 1.55 65 98 24 32
housewif
2 50 1 6e 0 100 118 82n n n 1.72 47 73 48 24
fisherma
3 34 2 12 n 0 114 120 88y n n 1.68 61 85 24 24
4 44 2 8labourer 1 120 122 90n n n 1.64 64 88 24 24
fisherma
5 34 2 7n 0 120 140 80n n n 1.62 67 92 24 24
6 39 2 10labourer 1 120 154 90n n n 1.78 67 91 36 26

EXAMPLES
1. To compare mean SBP between
females & males
2. To compare HTN status with GHQ

scores
3. To find association between

Gender and Hypertension status
3. To compare mean SBP before and after
intervention
4. To compare GHQ score before and after

intervention
5. To compare Hypertension status before

6. To compare mean SBP between
occupation groups
7. To compare GHQ scores between

occupation groups
8. To find association between occ groups

and hypertension status
What to do when both the variables are on a
continuous scale?
Eg. Relationship between BMI and SBP

Inferential statistics for strength of relationships
Scale of dependent
COMPARE Independent samples
variable
Parametric (Interval Values
Pearson correlation
or ratio/ continuous)
Non parametric Ranks
Spearman correlation
(ordinal scale)
Predicting single outcome variable from several
independent variables
One dependent/ Several independent or predictor

outcome variable variables
Continuous
Multiple regression
Dichotomous
Logistic regression
CHOOSE A STATISTICAL
TEST TO BE USED IN THE
FOLLOWING SITUATIONS
FRAMEWORK FOR CHOICE
OF STATISTICAL TESTS
Aim
Analysis type
Parameter to be analysed
No. & Name of the groups / data

sets to be analysed
Distribution of data
(Normal or Non-normal)
Design
(Paired or Unpaired)
1. A new drug is proposed to lower total
cholesterol. A randomized controlled trial
is conducted with 30 randomly assigned
participants to receive either the new drug
or a placebo. Each participant is asked to
take the assigned treatment for 6 months.
At the end of 6 months, each patient's
total cholesterol level is measured
SOLUTION
Aim
• To see whether the new drug
alters blood cholesterol levels
Analysis type • Comparison of means
Parameter to be
analysed • Cholesterol levels
No. & Name of the groups /
data sets to be analysed • Two – New drug and Placebo
Distribution of data
(Normal or Non-normal)
• Normal
Design
(Paired or Unpaired)
• Unpaired
Unpaired t test
2. Health education to adolescents on healthy
eating. Change in knowledge scores (marks
obtained out of 50 MCQs) with the health
intervention
Paired t test
3. Three different formulations of iron to
improve hb level of the women at the end
of 6 months
One way ANOVA

4. The wound-healing effect of a
traditional drug was tested in ulcer
patients.
Two groups of patients (6 each) were
administered either saline or test
drug and the effect was measured in
scores(0-5; 0 -No healing; 5-
Complete healing)
Mann Whitney test
5. Does the blood pressure vary with
bodyweight?
To find out this, mean BP and weight of

100 participants were measured.
Correlation
6. To evaluate an adverse effect, two
groups 1000 subjects were administered
either a drug or placebo.
The proportion of subjects affected were
78.54% and 21.63%.
Chi square test

7. The effect of physostigmine on salivary
secretion was tested among
preoperative patients.
Before and after giving the drug, the
secretion was quantified as no
secretion(0), small(1), medium(2) and
large(3) and compared.
Wilcoxon test
8. A new drug was tested to see how
its concentration in the body alters
with time. 10mg of the drug was
given iv and plasma concentration
was measured at 4, 8, 12, 24, 48 & 72
hr. Repeated measures
ANOVA
Summary
2 groups 3 or more
Scale of
groups
dependent COMPARE
variable Independe Repeated Independe
nt samples measures nt samples
Interval or Means
Independe Paired One way
ratio/
nt samples samples t ANOVA
continuous
t test test
(Parametric)
Ordinal scale Mean

(Non ranks Mann Wilcoxon Kruskal
parametric) Whitney test Wallis
test
Nominal or Counts
Dichotomous or %
Chi square McNemar Chi square
test test test
THANK YOU

Statistics Basis To Advance

Uploaded by

Document Information

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Statistics Basis To Advance

Uploaded by

Copyright:

Available Formats

Inferential Statistics

• When to use which statistic?

Description of the sample

NOMINAL ORDINAL DISCRETE CONTINUOUS

• Measures of central tendency!!

• Regression coefficient (beta)

• Risk difference (AR, AR%)

– Estimation (Confidence Interval)

– Hypothesis testing (Statistical Tests of

68.3% of the values

95.5% of the values

- 3SD - 2SD - 1SD Mean + 1SD + 2SD + 3SD

95% of values are between the mean and ± 1.96 SD

from a population of 4 – list all the possible

Descriptive statistics Inferential statistics

Variability of the one Measures of sample

▪ A 95% CI means: We can be 95% confident that this

▪ Sample = 244 TB patients

▪ Implies that we can be 95% confident that the

Relative risk of 2.1 (95%CI: 0.5-6)

▪ Sampling and non sampling errors

▪ There are two kinds of errors – alpha and

Innocent Type II error (1-alpha)

▪ Express exact P-values rather than

▪ RR 4.3 ( 95% CI: 4.1-5.4) and P value=0.001

▪ Statistical significance does not prove causation

Association Association not

Risk of DM in next 5 years

▪ Alpha error of 5% (95% confidence level),

▪ In analysis- P value is NOT SIGNIFICANT

▪ What does that mean? What are the

▪ In Yoga group-7% (5% assumed during

▪ What are the types of variables?

▪ What do you understand by ‘power’ of the

 Be a risk factor for the disease, independent

 Associationneed not be causal but correlates

 Identifythe potential confounders in the

 Collect adequate data on such confounders

EXPOSED CASES CONTROLS

Figure 15.2 Schematic representation of the issue of potential

AGE EXPOSED CASES CONTROL ODDS RATIO

 At the analysis stage

 Techniquesfor control of confounding is

 Multivariate techniques : Examine the potential

 Epidemiology in Medicine by Charles H

Association equally strong

Equal Not equal/variable

• When the incidence of disease in the presence of two or more risk

• The effect can be greater than we would expect. ( Positive

Aflatoxin Negative Aflatoxin positive

Look for incidence and relative risk in each strata

If the effect of both risk factors higher than that

More than that by additive model – interaction present

Special Thanks to Dr Subita

• Hypothesis testing (Statistical Tests of Significance)

• Usually independent and dependent variable

2. Physical activity intervention & decrease in BMI

Ordinal scale Mean

4 44 2 8labourer 1 120 122 90n n n 1.64 64 88 24 24

6 39 2 10labourer 1 120 154 90n n n 1.78 67 91 36 26

2. To compare HTN status with GHQ

3. To find association between

4. To compare GHQ score before and after