Statistics Made Easy
Hani Tamim, MPH, PhD
Assistant Professor
Epidemiology and Biostatistics
Research Center / College of Medicine
King Saud bin Abdulaziz University for Health Sciences
Riyadh Saudi Arabia
Objective of medical research
Is treatment A better than treatment B for patients with
hypertension?
What is the survival rate among ICU patients?
What is the incidence of Downs syndrome among a certain
group of people?
Is the use of Oral Contraceptives associated with an increased risk
of breast cancer?
Research Process?
Planning
Design
Data collection
Analysis
Data entry
Data cleaning
Data management
Data analysis
Reporting
Statistics is used in .
What is statistics?
Scientific methods for:
Collecting
Organizing
Summarizing
Presenting
Interpreting
data
Definition of some basic terms
Population: The largest collection of entities for which we have
interest at a particular time
Sample: A part of a population
Simple random sample: is when a sample n is drawn from a
population N in such a way that every possible sample of size n
has the same chance of being selected
Definition of some basic terms
Variable: A characteristic of the subjects under observation that
takes on different values for different cases, example: age gender,
diastolic blood pressure
Quantitative variables: Are variables that can convey information
regarding amount
Qualitative variables: Are variables in which measurements
consist of categorization
Types of variables
Categorical variables
Continuous variables
Categorical variables
Nominal: unordered data
Death
Gender
Country of birth
Ordinal: Predetermined order among response classification
Education
Satisfaction
Continuous variables
Continuous: Not restricted to integers
Age
Weight
Cholesterol
Blood pressure
Steps involved (data)
Data collection
Database structure
Data entry
Data cleaning
Data management
Data analyses
Data collection
Data collection:
Collection of information that will be used to answer the research
question
Could be done through questionnaires, interviews, data abstraction,
etc.
Data collection
Database structure
Database structure:
Structure the database (using SPSS) into which the data will be
entered
Data entry
Data entry:
Entering the information (data) into the computer
Most of the times done manually
Single data entry
Double data entry
Data cleaning
Data cleaning:
Identify any data entry mistakes
Correct such mistakes
Data management
Data management:
Create new variables based on different criteria
Such as:
BMI
Recoding
Categorizing age (less than 50 years, and 50 years and above)
Etc.
Data analyses
Data analyses:
Descriptive statistics: are the techniques used to describe the main
features of a sample
Inferential statistics: is the process of using the sample statistic to
make informed guesses about the value of a population parameter
Data analyses
Data analyses:
Univariate analyses
Bivariate analyses
Multivariate analyses
Bottom line
There are different statistical methods
for different types of variables
Descriptive statistics: categorical variables
Frequency distribution
Graphical representation
Descriptive statistics: categorical variables
Frequency distribution
A frequency distribution lists, for each value (or small range of
values) of a variable, the number or proportion of times that
observation occurs in the study population
Descriptive statistics: categorical variables
Frequency distribution:
How to describe a categorical variable (marital status)?
Descriptive statistics: categorical variables
Construct a frequency distribution
Title
Values
Frequency
Relative frequency (percent)
Valid relative frequency (valid percent)
Cumulative relative frequency (cumulative percent)
Descriptive statistics: categorical variables
Marital status of the 291 patients admitted to the Emergency Department
Valid
Missing
Total
Married
Single
Widow
Total
System
Frequency
266
13
2
281
10
291
Percent
91.4
4.5
.7
96.6
3.4
100.0
Valid Percent
94.7
4.6
.7
100.0
Cumulative
Percent
94.7
99.3
100.0
Example
Example: summarizing data
Descriptive statistics: categorical variables
Graphical representation
A graph lists, for each value (or small range of values) of a variable,
the number or proportion of times that observation occurs in the
study population
Descriptive statistics: categorical variables
Graphical representation:
Two types
Bar chart
Pie chart
Descriptive statistics: categorical variables
Construct a bar or pie chart
Title
Values
Frequency or relative frequency
Properly labelled axes
Descriptive statistics: categorical variables
Descriptive statistics: categorical variables
Descriptive statistics: continuous variables
Central tendency
Dispersion
Graphical representation
Descriptive statistics: continuous variables
How to describe a continuous variable (Systolic blood pressure)?
Central tendency:
Mean
Median
Mode
Descriptive statistics: continuous variables
Mean:
Add up data, then divide by sample size (n)
The sample size n is the number of observations (pieces of
data)
Example
n = 5 Systolic blood pressures (mmHg)
X1 = 120
X2 = 80
120 + 80 + 90 + 110 + 95
X3 = 90
X=
= 99mmHg
5
X4 = 110
X5 = 95
Descriptive statistics: continuous variables
Formula
X=
X
i =1
Summation Sign
Summation sign () is just a mathematical shorthand for add
up all of the observations
X
i=1
= X1 + X 2 + X 3 + ....... + Xn
Descriptive statistics: continuous variables
Also called sample average or arithmetic mean X
Sensitive to extreme values
One data point could make a great change in sample mean
Uniqueness
Simplicity
Descriptive statistics: continuous variables
Median: is the middle number, or the number that cuts the data in
half
80
90
95
110 120
The sample median is not sensitive to extreme values
For example: If 120 became 200, the median would remain the
same, but the mean would change to 115.
Descriptive statistics: continuous variables
If the sample size is an even number
80
90
95
110 120
125
95 + 110
= 102.5 mmHg
2
Descriptive statistics: continuous variables
Median: Formula
n = odd: Median = middle value (n+1/2)
n = even: Median = mean of middle 2 values (n/2 and n+2/2)
Properties:
Uniqueness
Simplicity
Not affected by extreme values
Descriptive statistics: continuous variables
Mode: Most frequently occurring number
80
Mode = 95
90
95
95
120
125
Descriptive statistics: continuous variables
Example:
Statistics
Systolic blood pressure
N
Valid
286
Missing
5
Mean
144.13
Median
144.50
Mode
155
Descriptive statistics: continuous variables
Central tendency measures do not tell the whole story
Example:
21
22
23
23
23
24
Mean = 213/9 = 23.6
Median = 23
15
18
21
21
23
25
Mean = 213/9 = 23.6
Median = 23
24
25
28
25
32
33
Descriptive statistics: continuous variables
How to describe a continuous variable (Systolic blood pressure)
in addition to central tendency?
Measures of dispersion:
Range
Variance
Standard Deviation
Descriptive statistics: continuous variables
Range
Range = Maximum Minimum
Example:
Range = 120 80 = 40
X 1 = 120
X 2 = 80
X 3 = 90
X 4 = 110
X 5 = 95
Descriptive statistics: continuous variables
Sample variance (s2 or var or 2)
The sample variance is the average of the square of the
deviations about the sample mean
n
s2 =
Sample standard deviation (s or SD or )
It is the square root of variance
2
(X
X
)
i
i=1
n 1
s=
2
(X
X
)
i
i=1
n 1
Descriptive statistics: continuous variables
Example: n = 5 systolic blood pressures (mm Hg)
Recall, from earlier: average = 99 mm HG
X 1 = 120
X 2 = 80
X 3 = 90
X 4 = 110
X 5 = 95
2
2
2
2
(X
X
)
=
(120
99)
+
(80
99)
+
(90
99)
i
i=1
+ (110 99)2 + (95 99)2 = 1020
Descriptive statistics: continuous variables
Sample Variance
n
s =
2
2
(X
X
)
i
i=1
n 1
1020
=
= 255
4
Sample standard deviation (SD)
s = s2 = 255 = 15.97 (mm Hg)
Descriptive statistics: continuous variables
The bigger s, the more variability
s measures the spread about the mean
s can equal 0 only if there is no spread
All n observations have the same value
The units of s is the same as the units of the data (for example,
mm Hg)
Descriptive statistics: continuous variables
Example:
Statistics
Systolic blood pressure
N
Valid
Missing
Mean
Median
Mode
Std. Deviation
Variance
Range
Minimum
Maximum
286
5
144.13
144.50
155
35.312
1246.916
202
55
257
Example: summarizing data
Descriptive statistics: continuous variables
Graphical representation:
Different types
Histogram
Descriptive statistics: continuous variables
Construct a chart
Title
Values
Frequency or relative frequency
Properly labelled axes
Descriptive statistics: continuous variables
Shapes of the Distribution
Three common shapes of frequency distributions:
Symmetrical
and bell
shaped
Positively
skewed or
skewed to
the right
Negatively
skewed or
skewed to
the left
Shapes of Distributions
Symmetric (Right and left sides are mirror images)
Left tail looks like right tail
Mean = Median = Mode
Mean
Median
Mode
Shapes of Distributions
Left skewed (negatively skewed)
Long left tail
Mean < Median
Mean
Median
Mode
Shapes of Distributions
Right skewed (positively skewed)
Long right tail
Mean > Median
Mode
Median
Mean
Shapes of the Distribution
Three less common shapes of frequency distributions:
A
Bimodal
B
Reverse
J-shaped
C
Uniform
Probability
Probability
Definition:
The likelihood that a given event will occur
It ranges between 0 and 1:
0 means the event is impossible to occur
1 means that the event is definitely occurring
How do we calculate it?
Frequentist Approach:
Probability: is the long term relative frequency
Thus, it is an idealization based on imagining what would
happen to the relative frequencies in an indefinite long
series of trials
Application in medicine
How does probability apply in medicine?
Probability is the most important theory behind biostatistics
It is used at different levels
Descriptive
Example: 4% chance of a patient dying after admission to
emergency department (from the previous example)
What do we mean?
Out of each 100 patients admitted to the emergency department, 4
will die, whereas 96 will be discharged alive
Example: 1 in 1000 babies are born with a certain abnormality!
Incidence and prevalence
Associations
Example: the association between cigarette smoking and death
after admission to the emergency department with an MI
Current Cigarrete Smoking in association with death at discharge
Count
Current Cigarrete
Smoking
Total
No
Yes
Death at discharge
Death
Discharged
5
123
5
154
10
277
Probability of being smoker
Total
128
159
287
= 100 / 331
Probability of dying if a smoker
= 5 / 159 = 3.1%
Probability of dying if a non-smoker
= 5 / 128 = 3.9%
Associations
Same is applied to:
Relative risk
Risk difference
Attributable risk
Odds ratio
Etc..
Bottom line
Probability is applied at all levels of statistical analyses
Probability distributions
Probability distributions list or describe probabilities for all possible
occurrences of a random variable
There are two types of probability distributions:
Categorical distributions
Continuous distributions
Probability distributions: categorical variables
Categorical variables
Frequency distribution
Other distributions, such as binomial
Probability distributions: continuous variables
Continuous variables
Continuous distribution
Such as Z and t distributions
Normal Distribution
Properties of a Normal Distribution
Also called Gaussian distribution
A continuous, Bell shaped, symmetrical distribution; both
tails extend to infinity
The mean, median, and mode are identical
The shape is completely determined by the mean and
standard deviation
Normal Distribution
A normal distribution can have any and any :
e.g.: Age: =40 ,
= 10
The area under the curve represents 100% of all the observations
Mean
Median
Mode
Normal Distribution
Normal Distribution
Age distribution for a specific population
50%
50%
Mean=40
SD=10
Normal Distribution
Age distribution for a specific population
Age = 25
Mean=40
SD=10
Normal distribution
The formula used to calculate the area below a certain point in a
normal distribution:
The probability density function of the normal distribution with
mean and variance 2
Normal distribution
Thus, for any normal distribution, once we have the mean and sd,
we can calculate the percentage of subjects:
Above a certain level
Below a certain level
Between different levels
But the problem is:
Calculation is very complicated and time consuming, so:
Standardized Normal Distribution
We standardize to a normal distribution
What does this mean?
For a specific distribution, we calculate all possible probabilities,
and record them in a table
A normal distribution with a = 0, = 1 is called a Standardized
Normal Distribution
Standardized Normal Distribution
Mean=0
SD=1
Area under the Normal Curve from 0 to X
0.00
0.01
0.02
0.03
0.04
0.05
0.06
0.07
0.08
0.09
0.0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
1.0
1.1
1.2
1.3
1.4
1.5
1.6
1.7
1.8
1.9
2.0
2.1
2.2
2.3
2.4
2.5
2.6
2.7
2.8
2.9
3.0
3.1
3.2
3.3
3.4
3.5
3.6
3.7
3.8
3.9
4.0
0.00000
0.03983
0.07926
0.11791
0.15542
0.19146
0.22575
0.25804
0.28814
0.31594
0.34134
0.36433
0.38493
0.40320
0.41924
0.43319
0.44520
0.45543
0.46407
0.47128
0.47725
0.48214
0.48610
0.48928
0.49180
0.49379
0.49534
0.49653
0.49744
0.49813
0.49865
0.49903
0.49931
0.49952
0.49966
0.49977
0.49984
0.49989
0.49993
0.49995
0.49997
0.00399
0.04380
0.08317
0.12172
0.15910
0.19497
0.22907
0.26115
0.29103
0.31859
0.34375
0.36650
0.38686
0.40490
0.42073
0.43448
0.44630
0.45637
0.46485
0.47193
0.47778
0.48257
0.48645
0.48956
0.49202
0.49396
0.49547
0.49664
0.49752
0.49819
0.49869
0.49906
0.49934
0.49953
0.49968
0.49978
0.49985
0.49990
0.49993
0.49995
0.49997
0.00798
0.04776
0.08706
0.12552
0.16276
0.19847
0.23237
0.26424
0.29389
0.32121
0.34614
0.36864
0.38877
0.40658
0.42220
0.43574
0.44738
0.45728
0.46562
0.47257
0.47831
0.48300
0.48679
0.48983
0.49224
0.49413
0.49560
0.49674
0.49760
0.49825
0.49874
0.49910
0.49936
0.49955
0.49969
0.49978
0.49985
0.49990
0.49993
0.49996
0.49997
0.01197
0.05172
0.09095
0.12930
0.16640
0.20194
0.23565
0.26730
0.29673
0.32381
0.34849
0.37076
0.39065
0.40824
0.42364
0.43699
0.44845
0.45818
0.46638
0.47320
0.47882
0.48341
0.48713
0.49010
0.49245
0.49430
0.49573
0.49683
0.49767
0.49831
0.49878
0.49913
0.49938
0.49957
0.49970
0.49979
0.49986
0.49990
0.49994
0.49996
0.49997
0.01595
0.05567
0.09483
0.13307
0.17003
0.20540
0.23891
0.27035
0.29955
0.32639
0.35083
0.37286
0.39251
0.40988
0.42507
0.43822
0.44950
0.45907
0.46712
0.47381
0.47932
0.48382
0.48745
0.49036
0.49266
0.49446
0.49585
0.49693
0.49774
0.49836
0.49882
0.49916
0.49940
0.49958
0.49971
0.49980
0.49986
0.49991
0.49994
0.49996
0.49997
0.01994
0.05962
0.09871
0.13683
0.17364
0.20884
0.24215
0.27337
0.30234
0.32894
0.35314
0.37493
0.39435
0.41149
0.42647
0.43943
0.45053
0.45994
0.46784
0.47441
0.47982
0.48422
0.48778
0.49061
0.49286
0.49461
0.49598
0.49702
0.49781
0.49841
0.49886
0.49918
0.49942
0.49960
0.49972
0.49981
0.49987
0.49991
0.49994
0.49996
0.49997
0.02392
0.06356
0.10257
0.14058
0.17724
0.21226
0.24537
0.27637
0.30511
0.33147
0.35543
0.37698
0.39617
0.41308
0.42785
0.44062
0.45154
0.46080
0.46856
0.47500
0.48030
0.48461
0.48809
0.49086
0.49305
0.49477
0.49609
0.49711
0.49788
0.49846
0.49889
0.49921
0.49944
0.49961
0.49973
0.49981
0.49987
0.49992
0.49994
0.49996
0.49998
0.02790
0.06749
0.10642
0.14431
0.18082
0.21566
0.24857
0.27935
0.30785
0.33398
0.35769
0.37900
0.39796
0.41466
0.42922
0.44179
0.45254
0.46164
0.46926
0.47558
0.48077
0.48500
0.48840
0.49111
0.49324
0.49492
0.49621
0.49720
0.49795
0.49851
0.49893
0.49924
0.49946
0.49962
0.49974
0.49982
0.49988
0.49992
0.49995
0.49996
0.49998
0.03188
0.07142
0.11026
0.14803
0.18439
0.21904
0.25175
0.28230
0.31057
0.33646
0.35993
0.38100
0.39973
0.41621
0.43056
0.44295
0.45352
0.46246
0.46995
0.47615
0.48124
0.48537
0.48870
0.49134
0.49343
0.49506
0.49632
0.49728
0.49801
0.49856
0.49896
0.49926
0.49948
0.49964
0.49975
0.49983
0.49988
0.49992
0.49995
0.49997
0.49998
0.03586
0.07535
0.11409
0.15173
0.18793
0.22240
0.25490
0.28524
0.31327
0.33891
0.36214
0.38298
0.40147
0.41774
0.43189
0.44408
0.45449
0.46327
0.47062
0.47670
0.48169
0.48574
0.48899
0.49158
0.49361
0.49520
0.49643
0.49736
0.49807
0.49861
0.49900
0.49929
0.49950
0.49965
0.49976
0.49983
0.49989
0.49992
0.49995
0.49997
0.49998
Standardized Normal Distribution
Standardized Normal
Distribution (Z)
Normal Distribution
Mean = , SD =
TRANSFORM
Z
x-
Mean = 0, SD = 1
Standardized Normal Distribution
Standardized Normal
Distribution (Z)
Normal Distribution
Mean = 40, SD = 10
Z(40)
x - = 40 - 40 = 0
10
TRANSFORM
Mean = 0, SD = 1
Standardized Normal Distribution
Standardized Normal
Distribution (Z)
Normal Distribution
30
Mean = 40, SD = 10
Z(40)
x - = 30 - 40 = -1
10
TRANSFORM
-1
Mean = 0, SD = 1
Standardized Normal Distribution: summary
For any normal distribution, we can
Transform the values to the standardized normal distribution (Z)
Use the Z table to get the following areas
Above a certain level
Below a certain level
Between different levels
Normal Distribution
Age distribution for a specific population
Mean=40
SD=10
30
Mean 1SD
68%
50
Mean + 1SD
Normal Distribution
Age distribution for a specific population
Mean=40
SD=10
20
Mean 2SD
95%
60
Mean + 2SD
Normal Distribution
Age distribution for a specific population
Mean=40
SD=10
10
Mean 3SD
99.7%
70
Mean + 3SD
Practical example
Practical example
The 68-95-99.7 Rule for the Normal Distribution
68% of the observations fall within one standard deviation of the
mean
95% of the observations fall within two standard deviations of the
mean
99.7% of the observations fall within three standard deviations of
the mean
When applied to real data, these estimates are considered
approximate!
Distributions of Blood Pressure
.4
.3
68%
.2
Mean = 125 mmHG
s = 14 mmHG
95%
99.7%
.1
0
83
97
111
125
139
153
167
The 68-95-99.7 rule applied to the distribution
of systolic blood pressure in men.
Data analyses
Data analyses:
Descriptive statistics: are the techniques used to describe the main
features of a sample
Inferential statistics: is the process of using the sample statistic to
make informed guesses about the value of a population parameter
Why do we carry out research?
population
sample
Inference: Drawing
conclusions on certain
questions about a
population from sample data
Inferential statistics
Since we are not taking the whole population, we have to draw
conclusions on the population based on results we get from the
sample
Simple example: Say we want to estimate the average systolic
blood pressure for patients admitted to the emergency department
after having an MI
Other more complicated measures might be quality of life,
satisfaction with care, risk of outcome, etc.
Inferential statistics
What do we do?
Take a sample (n=291) of patients admitted to emergency
department in a certain hospital
Calculate the mean and SD (descriptive statistics) of systolic blood
pressure
Statistics
Systolic blood pressure
N
Valid
Missing
Mean
Std. Deviation
286
5
144.13
35.312
Inferential statistics
The next step is to make a link between the estimates we observed
from the sample and those of the underlying population (inferential
statistics)
What can we say about these estimates as compared to the
unknown true ones???
In other words, we trying to estimate the average systolic blood
pressure for ALL patients admitted to the emergency department
after an MI
Inferential statistics
Sample data
N=291
Mean=144
SD=35
Inference
In statistical inference we usually encounter TWO issues
Estimate value of the population parameter. This is done through
point estimate and interval estimate (Confidence Interval)
Evaluate a hypothesis about a population parameter rather than
simply estimating it. This is done through tests of significance
known as hypothesis testing (P-value)
1-
Confidence Interval
Confidence Intervals
A point estimate:
A single numerical value used to estimate a population parameter.
Interval estimate:
Consists of 2 numerical values defining a range of values that with
a specified degree of confidence includes the parameter being
estimated.
(Usually interval estimate with a degree of 95% confidence is
used)
Example
What is the average systolic blood pressure for patients admitted
to emergency departments after an MI?
Select a sample
Point estimate
Interval estimate = 95% CI = (140 148)
95% Confidence Interval:
- Upper limit =
- Lower limit =
= mean
= 144
35
144 1.95
291
x + z (1-/2) SE
x + z (1-/2) SE
x z (1-/2) SE
Sampling distribution of mean
N = 291
- 2SE
95%
+ 2SE
Standard error
Standard error
= sd / n
As sample size increases the standard error decreases
The estimation as measured by the confidence interval will be
better, ie narrower confidence interval
Interpretation
95% Confidence Interval
There is 95% probability that the true parameter is within the
calculated interval
Thus, if we repeat the sampling procedure 100 times, the above
statement will be:
correct in 95 times (the true parameter is within the interval)
wrong in 5 times (the true parameter is outside the interval) (also called
error)
Notes on Confidence Intervals
Interpretation
It provides the level of confidence of the value for the population
average systolic blood pressure
Are all CIs 95%?
No
It is the most commonly used
A 99% CI is wider
A 90% CI is narrower
Notes on Confidence Intervals
To be more confident you need a bigger interval
For a 99% CI, you need 2.6 SEM
For a 95% CI, you need 2 SEM
For a 90% CI, you need 1.65 SEM
2-
P-value
Inference
P-value
Is related to another type of inference
Hypothesis testing
Evaluate a hypothesis about a population parameter rather than
simply estimating it
Hypothesis testing
Back to our previous example
We want to make inference about the average systolic blood
pressure of patients admitted to emergency department after MI
Assume that the normal systolic blood pressure is 120
The question is whether the average systolic blood pressure for
patients admitted to emergency departments is different than the
normal, which is 120
Hypothesis testing
Two types of hypotheses:
Null hypothesis: is a statement consistent with no difference
Alternative hypothesis: is a statement that disagrees with the null
hypothesis, and is consistent with presence of difference
The logic of hypothesis testing
To decide which of the hypothesis is true
Take a sample from the population
If the data are consistent with the null hypothesis, then we do not
reject the null hypothesis (conclusion = no difference)
If the sample data are not consistent with the null hypothesis, then
we reject the null (conclusion = difference)
Hypothesis testing
Example: is the systolic blood pressure for patients admitted to
emergency department after an MI normal (ie =120)?
-
Ho: = 120
Ha: 120
How do we answer this question?
We take a sample and find that the mean is 144 years
Can we consider that the 144 is consistent with the normal value
(120 years)?
Hypothesis testing
N = 291
mean
144
Ho: = 120
It looks like it is consistent with the null hypothesis
Is it still consistent with the null hypothesis?
mean
144
Hypothesis testing
N = 291
mean
mean
2.5%
2.5%
95%
- 2SE
Ho: = 120
+ 2SE
Test statistic
It is the statistic used for deciding whether the null hypothesis
should be rejected or not
Used to calculate the probability of getting the observed results if
the null hypothesis is true.
This probability is called the p-value.
How to decide
We calculate the probability of obtaining a sample with mean of
144 if the true mean is 120 due to chance alone (p-value)
Based on p-value we make our decision:
If the p-value is low then this is taken as evidence that it is unlikely
that the null hypothesis is true, then we reject the null hypothesis (we
accept alternative one)
If the p-value is high, it indicates that most probably the null
hypothesis is true, and thus we do not reject the Ho
Problem!
We could be making the wrong decisions
Decision
Do not reject Ho
Reject Ho
Ho True
Ho False
Correct decision
Type II error
Type I error
Correct decision
Type I error: is rejecting the null hypothesis when it is true
Type II error: is not rejecting the null hypothesis when it is false
Error
Type I error:
Referred to as
Probability of rejecting a true null hypothesis
Type II error:
Referred to as
Probability of accepting a false null hypothesis
Power:
Represented by 1-
Probability of correctly rejecting a false null hypothesis
Significance level
The significance level, , of a hypothesis test is defined as the
probability of making a type I error, that is the probability of
rejecting a true null hypothesis
It could be set to any value, as:
0.05
0.01
0.1
Statistical significance
If the p-value is less then some pre-determined cutoff (e.g. .05),
the result is called statistically significant
This cutoff is the -level
The -level is the probability of a type I error
It is the probability of falsely rejecting H0
Back to the example
To test whether the average systolic blood pressure for patients
admitted to the emergency department after an MI is different
than 120 (which is the normal blood pressure)
We carry out a test called one sample t-test which provides a pvalue based on which we accept or reject the null hypothesis.
Back to the example
One-Sample Statistics
N
Systolic blood pressure
286
Mean
144.13
Std. Deviation
35.312
Std. Error
Mean
2.088
One-Sample Test
Test Value = 120
Systolic blood pressure
t
11.558
df
285
Sig. (2-tailed)
.000
Mean
Difference
24.133
95% Confidence
Interval of the
Difference
Lower
Upper
20.02
28.24
Since p-value is less than 0.05, then the conclusion will be that the
systolic blood pressure for patients admitted to emergency
department after an MI is significantly higher than the normal
value which is 120
p-values
p-values are probabilities (numbers between 0 and 1)
Small p-values mean that the sample results are unlikely when the
null is true
The p-value is the probability of obtaining a result as/or more
extreme than you did by chance alone assuming the null
hypothesis H0 is true
t-distribution
The t-distribution looks like a standard normal curve
A t-distribution is determined by its degrees of freedom (n-1), the
lower the degrees of freedom, the flatter and fatter it is
Normal (0,1)
t35
t15
75%
80%
85%
90%
95%
97.5%
99%
99.5%
99.75%
99.9%
99.95%
1.000
1.376
1.963
3.078
6.314
12.71
31.82
63.66
127.3
318.3
636.6
0.816
1.061
1.386
1.886
2.920
4.303
6.965
9.925
14.09
22.33
31.60
0.765
0.978
1.250
1.638
2.353
3.182
4.541
5.841
7.453
10.21
12.92
0.741
0.941
1.190
1.533
2.132
2.776
3.747
4.604
5.598
7.173
8.610
0.727
0.920
1.156
1.476
2.015
2.571
3.365
4.032
4.773
5.893
6.869
0.718
0.906
1.134
1.440
1.943
2.447
3.143
3.707
4.317
5.208
5.959
0.711
0.896
1.119
1.415
1.895
2.365
2.998
3.499
4.029
4.785
5.408
0.706
0.889
1.108
1.397
1.860
2.306
2.896
3.355
3.833
4.501
5.041
0.703
0.883
1.100
1.383
1.833
2.262
2.821
3.250
3.690
4.297
4.781
10
0.700
0.879
1.093
1.372
1.812
2.228
2.764
3.169
3.581
4.144
4.587
11
0.697
0.876
1.088
1.363
1.796
2.201
2.718
3.106
3.497
4.025
4.437
12
0.695
0.873
1.083
1.356
1.782
2.179
2.681
3.055
3.428
3.930
4.318
13
0.694
0.870
1.079
1.350
1.771
2.160
2.650
3.012
3.372
3.852
4.221
14
0.692
0.868
1.076
1.345
1.761
2.145
2.624
2.977
3.326
3.787
4.140
15
0.691
0.866
1.074
1.341
1.753
2.131
2.602
2.947
3.286
3.733
4.073
16
0.690
0.865
1.071
1.337
1.746
2.120
2.583
2.921
3.252
3.686
4.015
17
0.689
0.863
1.069
1.333
1.740
2.110
2.567
2.898
3.222
3.646
3.965
18
0.688
0.862
1.067
1.330
1.734
2.101
2.552
2.878
3.197
3.610
3.922
19
0.688
0.861
1.066
1.328
1.729
2.093
2.539
2.861
3.174
3.579
3.883
20
0.687
0.860
1.064
1.325
1.725
2.086
2.528
2.845
3.153
3.552
3.850
100
0.677
0.845
1.042
1.290
1.660
1.984
2.364
2.626
2.871
3.174
3.390
120
0.677
0.845
1.041
1.289
1.658
1.980
2.358
2.617
2.860
3.160
3.373
0.674
0.842
1.036
1.282
1.645
1.960
2.326
2.576
2.807
3.090
3.291
Hypothesis Testing
Different types of hypothesis:
Mean (a) = Mean (b)
Proportion (a) = Proportion (b)
Variance (a) = Variance (b)
OR = 1
RR = 1
RD = 0
Test of homogeneity
Etc..
Example
Comparing two means: paired testing
In the previous example, is the heart rate at admission different
than the heart rate at discharge among the patients admitted to the
emergency department after an MI?
Statistics
N
Mean
Std. Deviation
Valid
Missing
Heart Rate at
admission
286
5
82.64
22.598
Heart Rate at
discharge
77
214
76.99
17.900
Is this decrease in heart rate statistically significant?
Thus, we have to make inference.
Comparing two means: paired testing
What type of test to be used?
Since the measurements of the heart rate at admission and at
discharge are dependent on each other (not independent), another
type of test is used
Paired t-test
Comparing two means: paired testing
Paired Samples Statistics
Pair
1
Mean
81.16
76.72
Heart Rate at admission
Heart Rate at discharge
N
75
75
Std. Deviation
23.546
17.973
Std. Error
Mean
2.719
2.075
Paired Samples Test
Paired Differences
Mean
Pair
1
Heart Rate at admission Heart Rate at discharge
4.440
Std. Deviation
Std. Error
Mean
25.302
2.922
95% Confidence
Interval of the
Difference
Lower
Upper
-1.381
10.261
t
1.520
df
Sig. (2-tailed)
74
.133
95%CI = 4.4 1.95 2.9
H0: b - a = 0
HA: b - a 0
P-value = 0.133, thus no significant difference
How Are p-values Calculated?
sample mean 0
t=
SEM
4 .4
t =
= 1 . 52
2 .9
The value t = 1.52 is called the test statistic
Then we can compare the t-value in the table and get the
p-value, or get it from the computer (0.13)
Interpreting the p-value
The p-value in the example is 0.133
Interpretation: If there is no difference in heart rate between
admission and discharge to an emergency department, then the
chance of finding a mean difference as extreme/more extreme as 4.4
in a sample of 291 patients is 0.133
Thus, this probability is big (bigger than 0.05) which leads to saying
that the difference of 4.4 is due to chance
Notes
How to decide on significance from the 95% CI?
3 scenarios
-15
-10
-5
10
15
-15
-10
-5
10
15
-15
-10
-5
10
15
Comparing two means: Independent sample testing
In the previous example, is the systolic blood pressure different
between males and females among the patients admitted to the
emergency department after an MI?
Group Statistics
Systolic blood pressure
Sex
Male
Female
N
240
44
Mean
145.05
138.64
Std. Deviation
35.162
35.753
Std. Error
Mean
2.270
5.390
Is this difference in systolic blood pressure statistically significant?
Thus, we have to make inference.
Comparing two means: Independent sample testing
Null hypothesis:
Ho: Mean SBP(Males) = Mean SBP (Females)
Ho: Mean SBP (Males) - Mean SBP (Females) = 0
Alternative hypothesis:
Ha: Mean SBP(Males) Mean SBP (Females)
Ha: Mean SBP(Males) - Mean SBP (Females) 0
Comparing two means: Independent sample testing
Thus, we carry out a test called: independent samples t-test
Formula to use is:
Comparing two means: Independent sample testing
What we need to know is that we can calculate a p-value out of the
t-test (based on the t-distribution)
Based on this p-value, make the decision:
P-value > 0.05, then do no reject the null (the two means are equal)
P-value < 0.05, then reject the null (the two means are different)
Comparing two means: Independent sample testing
Group Statistics
Systolic blood pressure
Sex
Male
Female
Mean
145.05
138.64
240
44
Std. Deviation
35.162
35.753
Std. Error
Mean
2.270
5.390
Independent Samples Test
Levene's Test for
Equality of Variances
F
Systolic blood pressure
Equal variances
assumed
Equal variances
not assumed
.044
Sig.
.835
Two formulas for calculation of t-test
1- when variances are equal
2- when variances are not equal
t-test for Equality of Means
df
Sig. (2-tailed)
Mean
Difference
Std. Error
Difference
95% Confidence
Interval of the
Difference
Lower
Upper
1.109
282
.269
6.409
5.781
-4.970
17.789
1.096
59.267
.278
6.409
5.848
-5.292
18.111
To know which one to use (hypothesis test)
Ho: variancemales = variancefemales
Ha: variancemales = variancefemales
1- If p-value > 0.05 then variances are equal
2- If p-value < 0.05 then variances are not equal
Example
T-test
Ho: Mean1 = Mean2
T-test: P-value = 0.89
Ha: Mean1 Mean2
No significant difference
Chi square
Example
In the MI example, we would like to check if hypertension is
associated with gender.
In other words, are males at higher or lower risk of having
hypertension?
Sex * Hypertension Crosstabulation
Count
Sex
Total
Male
Female
Hypertension
No
Yes
191
52
24
20
215
72
Total
243
44
287
Example
Sex * Hypertension Crosstabulation
Sex
Male
Female
Total
Count
% within Sex
Count
% within Sex
Count
% within Sex
Hypertension
No
Yes
191
52
78.6%
21.4%
24
20
54.5%
45.5%
215
72
74.9%
25.1%
Total
243
100.0%
44
100.0%
287
100.0%
Example
To answer the question, we do a hypothesis test:
H0: P1 = P2
(P1 - P2 = 0)
Ha: P1 P2
(P1 - P2 0)
(Pearsons) Chi-Square Test (2)
Calculation is easy (can be done by hand)
Works well for big sample sizes
Can be extended to compare proportions between more than two
independent groups in one test
The Chi-Square Approximate Method
(0 - E)
=
E
4 cells
Looks at discrepancies between observed and expected cell counts
Expected refers to the values for the cell counts that would be
expected if the null hypothesis is true
O = observed
E = expected =
row total column total
grand total
The Chi-Square Approximate Method
The distribution of this statistic when the null is a chi-square
distribution with one degree of freedom
We can use this to determine how likely it was to get such a big
discrepancy between the observed and expected by chance alone
Probability
.4
.6
.8
Distribution: Chi-Square with One Degree of Freedom
.2
2 = 3.84 p = 0.05
10
Chi-squared Value
15
20
Example of Calculations of
Chi-Square 2x2 Contingency Table
Test statistic
(0 - E)
=
E
4 cells
Chi-Square Tests
Pearson Chi-Square
Continuity Correctiona
Likelihood Ratio
Fisher's Exact Test
Linear-by-Linear
Association
N of Valid Cases
Value
11.471b
10.227
10.366
11.431
df
1
1
1
1
Asymp. Sig.
(2-sided)
.001
.001
.001
Exact Sig.
(2-sided)
Exact Sig.
(1-sided)
.001
.001
.001
287
a. Computed only for a 2x2 table
b. 0 cells (.0%) have expected count less than 5. The minimum expected count is 11.
04.
= 11.471
2
.2
Probability
.4
.6
.8
Sampling Distribution: Chi-Square with One Degree of
Freedom
10
Chi-squared Value
15
20
Example
The value that corresponds to 95%, or 5% error is 5.991.
Thus we reject Ho since 11.471 is < 5.991
We conclude that Ho is false and that there is a relationship
between gender and diagnosis with hypertension
The p-value is = 0.001
Chi-square
Ho: Proportion1 = Proportion2
Ha: Proportion1 Proportion2
ChiSquare: P-value = 0.96
No significant difference
Relative Risk (RR):
Study the association between Vioxx use and Myocardial Infarction
MI
Yes
No
Vioxx
71
52
Placebo
29
48
Drug
Ho: RR = 1
RR=1.5, 95% CI = (1.1 - 1.9) (p-value = 0.01)
Ha: RR 1
Significant association
Notes
How to decide on significance from the 95% CI?
3 scenarios
0
Example
Example
Chi-square
Ho: Proportion1 = Proportion2
Ha: Proportion1 Proportion2
ChiSquare: P-value = 0.96
No significant difference
Example
We would like to check if there is an association between gender
and both Hypertension and diabetes combined.
Sex * Hypterension and Diabetes combined Crosstabulation
Sex
Male
Female
Total
Count
% within Sex
Count
% within Sex
Count
% within Sex
Hypterension and Diabetes
combined
Either HT
Both HT
or DM
and DM
None
145
67
28
60.4%
27.9%
11.7%
13
12
19
29.5%
27.3%
43.2%
158
79
47
55.6%
27.8%
16.5%
Total
240
100.0%
44
100.0%
284
100.0%
Ho: HPV status and stage of HIV infection are independent.
Ha: the two variables are not independent.
Ho: P1 = P2 = P3
Ha: P1 P2 P3
Example
Conclusion
Sex * Hypterension and Diabetes combined Crosstabulation
Sex
Male
Female
Total
Count
% within Sex
Count
% within Sex
Count
% within Sex
Hypterension and Diabetes
combined
Either HT
Both HT
or DM
and DM
None
145
67
28
60.4%
27.9%
11.7%
13
12
19
29.5%
27.3%
43.2%
158
79
47
55.6%
27.8%
16.5%
Chi-Square Tests
Pearson Chi-Square
Likelihood Ratio
Linear-by-Linear
Association
N of Valid Cases
Value
28.691a
24.336
25.341
2
2
Asymp. Sig.
(2-sided)
.000
.000
.000
df
284
a. 0 cells (.0%) have expected count less than 5. The
minimum expected count is 7.28.
Total
240
100.0%
44
100.0%
284
100.0%
Example
The value that corresponds to 95%, or 5% error is 5.991.
Thus we reject Ho since 28.691 is < 5.991
We conclude that Ho is false and that there is a relationship
between gender and diagnosis with hypertension and/or diabetes
The p-value is < 0.0001
ANOVA
The problem
We have samples from a number of independent groups.
We have a single numerical or ordinal variable and are interested
in whether the values of the variable vary between the groups.
Example: Is systolic blood pressure vary between men of
different smoking status.
The problem
One-way ANOVA can answer the question be comparing the
group means.
So the null and alternative hypotheses are:
H0: all group means in the population are equal
HA : at least two of the means are not equal
ANOVA is an extension of 2 independent groups.
But 2 groups technique can not be used.
The problem
- If 5 groups is available then 10 t-test of 2 groups to perform.
- The high Type I error rate, resulting from the large number of
comparisons, means that we may draw incorrect conclusions.
Assumptions
Analysis of variance requires the following assumptions:
Independent random samples have been taken from each
population.
The populations are normal.
The population variances are all equal.
The ANOVA Table
ANOVA table summaries the calculation needed to test the main
hypothesis.
Sources
df
SS
MS
Factor
k 1
SS(factor)
MS(factor)=
Error
n k
SS(error)
MS(error)=
SS ( factor)
k 1
MS ( factor)
MS (error )
SS (error )
n k
___________________________________________________________
Total
n 1 SS(total)
Rationale
One-way ANOVA separate the total variability (SS(total) in the
data into:
Differences between the individuals from the different groups
(between-group variation) SS(factor)
The random variation between the individuals within each group
(within-group variation) SS(error) called also unexplained
Rationale
These components of variation are measured using variances,
hence the name analysis of variance (ANOVA).
Under the null hypothesis that the group means are the same,
SS(factor) will be similar to SS(error).
The test is based on the ratio of these two variances.
If there are differences between-groups, then between-groups
variance will be larger than within-group variance.
Example
A new variable is created which combines diagnosis with
Hypertension and Diabetes together as follows:
Hypterension and Diabetes combined
Valid
Missing
Total
None
Either HT or DM
Both HT and DM
Total
System
Frequency
159
80
47
286
5
291
Percent
54.6
27.5
16.2
98.3
1.7
100.0
Valid Percent
55.6
28.0
16.4
100.0
Cumulative
Percent
55.6
83.6
100.0
Example
We would like to check whether the systolic blood pressure is the
same for the three groups defined by their HT and DM status.
Ho: Mean1 = Mean2 = Mean3
Ha: Mean1 Mean2 Mean3
Hypterension and Diabetes combined
Valid
Missing
Total
None
Either HT or DM
Both HT and DM
Total
System
Frequency
159
80
47
286
5
291
Percent
54.6
27.5
16.2
98.3
1.7
100.0
Valid Percent
55.6
28.0
16.4
100.0
Cumulative
Percent
55.6
83.6
100.0
Example
Descriptives
Systolic blood pressure
N
None
Either HT or DM
Both HT and DM
Total
155
79
47
281
Mean
144.52
142.97
146.55
144.43
Std. Deviation
32.789
39.634
36.360
35.319
Std. Error
2.634
4.459
5.304
2.107
95% Confidence Interval for
Mean
Lower Bound Upper Bound
139.32
149.73
134.10
151.85
135.88
157.23
140.28
148.57
Minimum
78
56
55
55
Maximum
248
257
235
257
ANOVA
Systolic blood pressure
Between Groups
Within Groups
Total
Sum of
Squares
380.517
348908.2
349288.8
df
2
278
280
Mean Square
190.259
1255.066
F
.152
Sig.
.859
We conclude that the average systolic blood pressures for the
three groups are the same.
Conclusion
We conclude that the average systolic blood pressure for the three
groups is the same.
Bivariate analyses
DEPENDENT
(outcome)
INDEPENDENT
(exposure)
2 LEVELS
> 2 LEVELS
CONTINUOUS
2 LEVELS
X2
(chi square test)
X2
(chi square test)
t-test
> 2 LEVELS
X2
(chi square test)
X2
(chi square test)
ANOVA
t-test
-Correlation
-Linear
Regression
CONTINUOUS
ANOVA
New scenario
If the dependent and independent variables are continuous, then
we cant use the t-test, and we cannot use the chi squared.
Regression and Correlation
Describing association between two continuous variables
Scatterplot
Correlation coefficient
Simple linear regression
Correlation
Correlation
It is a measure of linear correlation
Called Pearson correlation coefficient (r)
Ranges between:
+1.0 (perfect positive correlation)
-1.0 (perfect negative correlation)
Scatter plot and correlation
The Correlation Coefficient (r)
Measures the direction and strength of the linear association
between x and y
The correlation coefficient is between -1 and +1
r>0
Positive association
r<0
Negative association
r=0
No association
r = 0.01
Y
r = 0.68
r = 0.98
10
12
r = -0.9
Correlation in the Plasma Example
Y, plasma volume (liters)
3.5
r = .76
3
2.5
55
60
65
X, body weight (kg)
70
75
Correlation
Study the association between Heart Rate and Systolic Blood
Pressure
Ho: Correlation = 0
Scatter plot for assocation between HR and SBP
Ha: Correlation 0
A
A
250
BP systolic
200
150
100
50
A
A
A
A
A
A
A
AA
A
A A
AAA
A A
A
A
A
A
AA A A
AA
A
A AA
A
A A
A
A
A A AA
AA
A
A A
A
AA
A
A
A A A
A
A
AA
A
A
A
A
A
A A A
AAA
A
A
AA AA
A
AA
A
A AAA
A
A
A
A
A
A
A
A
A A
A
AA A
A
AA
A
A
A
A
AA A
A
AA A
A
A A
AA
AA
A
A
A
A
A
A
A
A
A
A AA
A A
A
A
A
A A A A
AA A A
A
A AA A A
AA A
A
A
A
A
A
AA
AAAA A
A
A
A
A
A AA
AA
A
A
A
A
A
A
A
A A AAAAA
A
AA A A
AA
A
AA
A
AA
AA
AA A
A
A
A
AA
A
AA A A
AA
AA
A
A AA AAA
A
A
A A
AA
A A A AA
A
AA
A AAAA
A
A
A
A
A A A
A
A
A
A
A A
A
A
A
A
A
40
80
Correlation: = 0.190
P-value = 0.001
Systolic blood pressure
Heart Rate at admission
A
Heart Rate
Pearson Correlation
Sig. (2-tailed)
N
Pearson Correlation
Sig. (2-tailed)
N
Systolic blood
pressure
1
Heart Rate at
admission
.190**
.001
286
285
.190**
1
.001
285
286
**. Correlation is significant at the 0.01 level (2-tailed).
120
Correlations
160
Significant correlation
Problem
Important to note that
correlation measures
strength of linear
association
There could be a strong
non-linear relationship
between y and x, and r
may not catch it
r= 0
Correlation Coefficient
Outliers can really affect correlation coefficient
One extreme point can change r sizably
r = .7
Simple linear regression
Simple linear regression
Used to quantify the association between two variables
It is simple in terms of
having only 1 variable
The association is thought to be linear in nature
Formula: Dependent = 0 + 1 (independent)
The Equation of a Line
0 and 1 are called regression coefficients
These two quantities are estimated by the least squares method
The intercept 0 is the estimated expected value of y when x is 0
The slope 1 is the estimated expected change in y corresponding
to a unit increase in x
The Slope
The slope 1 is the expected change in y corresponding to a unit
increase in x
1 = 0
No association between y and x
1 > 0
Positive association (as x increases y tends to increase)
1 < 0
Negative association (as x increases y tends to decrease)
The Equation of a Line
y
y
y = b0 + b1 x
b1
b0
The Slope
y
1 > 0
1 = 0
1 < 0
0
Simple linear regression
Systolic blood pressure and age
Model Summary
Model
1
R
.054a
R Square
.003
a. Predictors: (Constant), Age
Correlation:
R = 0.054
Adjusted
R Square
-.001
Std. Error of
the Estimate
35.387
Simple linear regression
Simple linear regression
Coefficientsa
Model
1
(Constant)
Age
Unstandardized
Coefficients
B
Std. Error
136.400
8.812
.148
.162
Standardized
Coefficients
Beta
.054
t
15.479
.910
Sig.
.000
.364
a. Dependent Variable: Systolic blood pressure
Simple linear regression:
SBP = 136.400 + 0.148 (Age)
If age = 0, then SBP = 136.400 + 0 = 136.400
As age increase by 1 year, SBP increases by 0.148 units
Simple Linear Regression
How do we decide if there is significant association between age
and SBP?
Hypothesis test
Ho: 1 = 0
Ha: 1 0
SBP = 0 + 1 (Age)
If reject Ho, then as age changes, SBP changes significantly
If Ho is not rejected, then if as changes, there is no effect on SBP
Multiple Linear Regression
The important aspect of linear regression is that we can include
more than 1 independent variable
This is to control for the effect of another variable
Study the association between Age and SBP while controlling for
gender
SBP = 0 + 1 (Age) + 2 (Gender)
Multiple Linear Regression
Coefficientsa
Model
1
(Constant)
Age
Sex
Unstandardized
Coefficients
B
Std. Error
143.090
9.742
.216
.171
-8.992
6.123
Standardized
Coefficients
Beta
.080
-.093
t
14.688
1.261
-1.469
Sig.
.000
.208
.143
a. Dependent Variable: Systolic blood pressure
Multiple linear regression:
SBP = 143.090 + 0.216 (Age) + -8.992 (Gender)
As age increase by 1 year, SBP increases by 0.216 units
after adjusting for gender
Difference in SBP between males and females is 8.992 units
Choosing the right statistical test
Choosing a statistical test
Choosing the right statistical test depends on:
Nature of the data
Sample characteristics
Inferences to be made
Choosing a statistical test
A consideration of the nature of data includes:
Number of variables
not for entire study, but for the specific question at hand
Type of data
numerical, continuous
dichotomous, categorical information
Choosing a statistical test
A consideration of the sample characteristics includes:
Number of groups
Sample type
normal distribution (parametric) or not (non-parametric)
independent or dependent
Choosing a statistical test
A consideration of the inferences to be made includes:
Data represent the population
The group means are different
There is a relationship between variables
Choosing a statistical test
Before choosing a statistical test, ask:
How many variables?
How many groups?
Is the distribution of data normal?
Are the samples (groups) independent?
What is your hypothesis or research question?
Is the data continuous, ordinal, or categorical?
Descriptive analyses
Type of variable
Measure
Categorical
Proportion (%)
Continuous
(Normal)
Mean (SD)
Continuous
(Not Normal)
Median
Inter-quartile range
-
Different types of statistics
Parametric vs non-parametric analyses
Parametric:
Assume data follows a specific probability distribution
More powerful
Non-parametric:
Also called distribution free
No assumptions required for data
But are robust
Univariate analyses
Type of variable
Measure
Categorical
Z proportions
Continuous
(Normal)
T-test
Continuous
(Not Normal)
n > 30 t-test
n < 30 Kolmogorov-Smirnov Test
-
Bivariate analyses
Type of
variable
2 levels
> 2 levels
Continuous
2 levels
Chi squared
Chi squared
T-test
> 2 levels
Chi squared
Chi squared
Anova
Continuous
T-test
Anova
Correlation
linear regression
-
Bivariate analyses
Type of
variable
2 levels
2 levels
Fishers test
McNemars test
-
> 2 levels
Continuous
> 2 levels
Fishers test
Fishers test
Mann-Whitney
- Wilcoxin test
Fishers test
Kruskal-Wallis
- Friedman test
Continuous
Mann-Whitney
- Wilcoxin test
Kruskal-Wallis
- Friedman test
Correlation
Regression
Multivariate analyses
Type of variable
Measure
Categorical
Logistic regression
Continuous
(Normal)
Multinomial regression
Continuous
(Not Normal)
Linear regression
Overview
Measurement
(Gaussian)
Ordinal or
Measurement (NonGaussian)
Binomial
Survival Time
Describe one group
Mean, SD
Median, interquartile
range
Proportion
Kaplan Meier survival
curve
Compare two unpaired
groups
Unpaired t test
Mann-Whitney test
Fisher's test
Chi-square
Log-rank test or
Mantel-Haenszel*
Compare two paired groups
Paired t test
Wilcoxon test
McNemar's test
Conditional
proportional hazards
regression*
Compare three or more
unmatched groups
One-way ANOVA
Kruskal-Wallis test
Chi-square test
Cox regression
Compare three or more
matched groups
Repeated-measures
ANOVA
Friedman test
Cochrane Q**
Conditional
proportional hazards
regression*
Quantify association between
two variables
Pearson correlation
Spearman correlation
Contingency
coefficients**
Predict value from another
measured variable
Simple linear
regression
Nonparametric
regression**
Simple logistic
regression*
Cox regression
Predict value from several
measured or binomial
variables
Multiple linear
regression*
Multiple logistic
regression*
Cox regression
Sample size calculation
Sample size and power calculation
Important step in designing a study
If it is not done, then sample size might be high or low:
If it is low: lack precision to provide reliable answers
If it is high: resources will be wasted for minimal gain
Sample size and power calculation
This step addresses two questions:
How precise will my parameter estimates tend to be if I select a
particular sample size?
How big a sample do I need to attain a desirable level of precision?
Sample size and power calculation: example
A cross-sectional survey of the prevalence of diabetes (diagnosed
or undiagnosed) among native Americans would require a sample
size of 1421 to allow estimation of the prevalence within a
precision of 0.02 with 90% confidence, assuming a true
prevalence no larger than 30%.
Sample size and power calculation
Should be done at the DESIGN stage, ie before data is collected
Drives the whole study
To determine the sample size:
Objectives should be clearly defined
Main exposure and outcome should be specified
Analyses plan should be clarified
Sample size and power calculation
Different equations are used:
Depends on:
Study design
Objectives (prevalence, risk, etc.)
Types of variables
Following is an example of sample size calculation for comparing
the means in two groups
Sample size and power calculation: example
A randomized clinical trial of a new drug treatment vs. placebo
for decreasing blood pressure would require 126 patients for a
two-sided test at = 0.05 to provide 80% power to detect a 5%
difference in blood pressure.
Sample size calculation: comparing two means
2 *SD * (z + z )
2
N = the number of subjects in each group
= level of significance (error)
1 - = power
Difference = Minimal significant difference
Sample size calculation: comparing two means
N = the number of subjects in each group
N = more power or less
N = less power or more
Sample size calculation: comparing two means
= level of significance (error)
= more power or smaller N
= less power or larger N
Sample size calculation: comparing two means
1 - = power
1 - = less or larger N
1 - = more or smaller N
Sample size calculation: comparing two means
Difference = Minimal significant difference
Difference = larger power or smaller N
Difference = smaller power or larger N
Sample size calculation: comparing two means
N = to be found
= level of significance (error) = 0.05 or 5%
1 - = power = 0.80 or 80%
Difference = Minimal significant difference
Thank you