100% found this document useful (2 votes)

3K views226 pages

Statistics Made Easy Presentation

The document provides an overview of statistics used in medical research. It discusses (1) the objective of medical research is to answer questions about treatments, survival rates, and disease incidence; (2) the research process involves planning, design, data collection, analysis, and reporting; and (3) statistics are used to collect, organize, summarize, present, and interpret data. It then focuses on descriptive statistics for summarizing categorical and continuous variables through measures of central tendency, dispersion, and graphical representations like frequency distributions, graphs, and histograms.

Uploaded by

HariomMaheshwari

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

100% found this document useful (2 votes)

3K views226 pages

Statistics Made Easy Presentation

Uploaded by

HariomMaheshwari

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

Statistics Made Easy

Hani Tamim, MPH, PhD

Assistant Professor
Epidemiology and Biostatistics
Research Center / College of Medicine
King Saud bin Abdulaziz University for Health Sciences
Riyadh Saudi Arabia

Objective of medical research

Is treatment A better than treatment B for patients with

hypertension?

What is the survival rate among ICU patients?

What is the incidence of Downs syndrome among a certain

group of people?

Is the use of Oral Contraceptives associated with an increased risk

of breast cancer?

Research Process?

Planning

Design

Data collection

Analysis

Data entry

Data cleaning

Data management

Data analysis

Reporting

Statistics is used in .

What is statistics?

Scientific methods for:

Collecting

Organizing

Summarizing

Presenting

Interpreting

data

Definition of some basic terms

Population: The largest collection of entities for which we have

interest at a particular time

Sample: A part of a population

Simple random sample: is when a sample n is drawn from a

population N in such a way that every possible sample of size n
has the same chance of being selected

Definition of some basic terms

Variable: A characteristic of the subjects under observation that

takes on different values for different cases, example: age gender,
diastolic blood pressure

Quantitative variables: Are variables that can convey information

regarding amount

Qualitative variables: Are variables in which measurements

consist of categorization

Types of variables

Categorical variables

Continuous variables

Categorical variables
Nominal: unordered data

Death

Gender

Country of birth

Ordinal: Predetermined order among response classification

Education

Satisfaction

Continuous variables
Continuous: Not restricted to integers

Age

Weight

Cholesterol

Blood pressure

Steps involved (data)

Data collection

Database structure

Data entry

Data cleaning

Data management

Data analyses

Data collection

Data collection:

Collection of information that will be used to answer the research

question
Could be done through questionnaires, interviews, data abstraction,
etc.

Data collection

Database structure

Database structure:

Structure the database (using SPSS) into which the data will be
entered

Data entry

Data entry:

Entering the information (data) into the computer

Most of the times done manually

Single data entry

Double data entry

Data cleaning

Data cleaning:

Identify any data entry mistakes

Correct such mistakes

Data management

Data management:

Create new variables based on different criteria

Such as:

BMI
Recoding
Categorizing age (less than 50 years, and 50 years and above)
Etc.

Data analyses

Data analyses:

Descriptive statistics: are the techniques used to describe the main

features of a sample

Inferential statistics: is the process of using the sample statistic to

make informed guesses about the value of a population parameter

Data analyses

Data analyses:

Univariate analyses

Bivariate analyses

Multivariate analyses

Bottom line

There are different statistical methods

for different types of variables

Descriptive statistics: categorical variables

Frequency distribution

Graphical representation

Descriptive statistics: categorical variables

Frequency distribution

A frequency distribution lists, for each value (or small range of

values) of a variable, the number or proportion of times that
observation occurs in the study population

Descriptive statistics: categorical variables

Frequency distribution:

How to describe a categorical variable (marital status)?

Descriptive statistics: categorical variables

Construct a frequency distribution

Title

Values

Frequency

Relative frequency (percent)

Valid relative frequency (valid percent)

Cumulative relative frequency (cumulative percent)

Descriptive statistics: categorical variables

Marital status of the 291 patients admitted to the Emergency Department

Valid

Missing
Total

Married
Single
Widow
Total
System

Frequency
266
13
2
281
10
291

Percent
91.4
4.5
.7
96.6
3.4
100.0

Valid Percent
94.7
4.6
.7
100.0

Cumulative
Percent
94.7
99.3
100.0

Example

Example: summarizing data

Descriptive statistics: categorical variables

Graphical representation

A graph lists, for each value (or small range of values) of a variable,
the number or proportion of times that observation occurs in the
study population

Descriptive statistics: categorical variables

Graphical representation:

Two types

Bar chart

Pie chart

Descriptive statistics: categorical variables

Construct a bar or pie chart

Title

Values

Frequency or relative frequency

Properly labelled axes

Descriptive statistics: categorical variables

Descriptive statistics: continuous variables

Central tendency

Dispersion

Graphical representation

Descriptive statistics: continuous variables

How to describe a continuous variable (Systolic blood pressure)?

Central tendency:

Mean

Median

Mode

Descriptive statistics: continuous variables

Mean:

Add up data, then divide by sample size (n)

The sample size n is the number of observations (pieces of
data)
Example
n = 5 Systolic blood pressures (mmHg)
X1 = 120
X2 = 80
120 + 80 + 90 + 110 + 95
X3 = 90
X=
= 99mmHg
5
X4 = 110
X5 = 95

Descriptive statistics: continuous variables

Formula

X
i =1

Summation Sign
Summation sign () is just a mathematical shorthand for add
up all of the observations

X
i=1

= X1 + X 2 + X 3 + ....... + Xn

Descriptive statistics: continuous variables

Also called sample average or arithmetic mean X

Sensitive to extreme values

One data point could make a great change in sample mean

Uniqueness

Simplicity

Descriptive statistics: continuous variables

Median: is the middle number, or the number that cuts the data in
half

110 120

The sample median is not sensitive to extreme values

For example: If 120 became 200, the median would remain the
same, but the mean would change to 115.

Descriptive statistics: continuous variables

If the sample size is an even number

110 120

125

95 + 110
= 102.5 mmHg
2

Descriptive statistics: continuous variables

Median: Formula

n = odd: Median = middle value (n+1/2)

n = even: Median = mean of middle 2 values (n/2 and n+2/2)

Properties:

Uniqueness
Simplicity
Not affected by extreme values

Descriptive statistics: continuous variables

Mode: Most frequently occurring number

Mode = 95

120

125

Descriptive statistics: continuous variables

Example:
Statistics
Systolic blood pressure
N
Valid
286
Missing
5
Mean
144.13
Median
144.50
Mode
155

Descriptive statistics: continuous variables

Central tendency measures do not tell the whole story

Example:
21
22
23
23
23
24
Mean = 213/9 = 23.6
Median = 23

21
21
23
25
Mean = 213/9 = 23.6
Median = 23

Descriptive statistics: continuous variables

How to describe a continuous variable (Systolic blood pressure)

in addition to central tendency?

Measures of dispersion:

Range

Variance

Standard Deviation

Descriptive statistics: continuous variables

Range

Range = Maximum Minimum

Example:

Range = 120 80 = 40

X 1 = 120
X 2 = 80
X 3 = 90
X 4 = 110
X 5 = 95

Descriptive statistics: continuous variables

Sample variance (s2 or var or 2)

The sample variance is the average of the square of the
deviations about the sample mean
n

s2 =

Sample standard deviation (s or SD or )

It is the square root of variance

2
(X

X
)
i
i=1

n 1

2
(X

X
)
i
i=1

n 1

Descriptive statistics: continuous variables

Example: n = 5 systolic blood pressures (mm Hg)

Recall, from earlier: average = 99 mm HG

X 1 = 120
X 2 = 80
X 3 = 90
X 4 = 110
X 5 = 95

2
2
2
2
(X

X
)
=
(120

99)
+
(80

99)
+
(90

99)
i
i=1

+ (110 99)2 + (95 99)2 = 1020

Descriptive statistics: continuous variables

Sample Variance
n

s =
2

2
(X

X
)
i
i=1

n 1

1020
=
= 255
4

Sample standard deviation (SD)

s = s2 = 255 = 15.97 (mm Hg)

Descriptive statistics: continuous variables

The bigger s, the more variability

s measures the spread about the mean

s can equal 0 only if there is no spread

All n observations have the same value

The units of s is the same as the units of the data (for example,
mm Hg)

Descriptive statistics: continuous variables

Example:
Statistics
Systolic blood pressure
N
Valid
Missing
Mean
Median
Mode
Std. Deviation
Variance
Range
Minimum
Maximum

286
5
144.13
144.50
155
35.312
1246.916
202
55
257

Example: summarizing data

Descriptive statistics: continuous variables

Graphical representation:

Different types

Histogram

Descriptive statistics: continuous variables

Construct a chart

Title

Values

Frequency or relative frequency

Properly labelled axes

Descriptive statistics: continuous variables

Shapes of the Distribution

Three common shapes of frequency distributions:

Symmetrical
and bell
shaped

Positively
skewed or
skewed to
the right

Negatively
skewed or
skewed to
the left

Shapes of Distributions

Symmetric (Right and left sides are mirror images)

Left tail looks like right tail

Mean = Median = Mode

Mean
Median
Mode

Shapes of Distributions

Left skewed (negatively skewed)

Long left tail

Mean < Median

Mean

Median

Mode

Shapes of Distributions

Right skewed (positively skewed)

Long right tail

Mean > Median

Mode
Median

Mean

Shapes of the Distribution

Three less common shapes of frequency distributions:

A
Bimodal

B
Reverse
J-shaped

C
Uniform

Probability

Definition:

The likelihood that a given event will occur

It ranges between 0 and 1:

0 means the event is impossible to occur

1 means that the event is definitely occurring

How do we calculate it?

Frequentist Approach:

Probability: is the long term relative frequency

Thus, it is an idealization based on imagining what would

happen to the relative frequencies in an indefinite long
series of trials

Application in medicine

How does probability apply in medicine?

Probability is the most important theory behind biostatistics

It is used at different levels

Descriptive

Example: 4% chance of a patient dying after admission to

emergency department (from the previous example)

What do we mean?
Out of each 100 patients admitted to the emergency department, 4
will die, whereas 96 will be discharged alive

Example: 1 in 1000 babies are born with a certain abnormality!

Incidence and prevalence

Associations

Example: the association between cigarette smoking and death

after admission to the emergency department with an MI

Current Cigarrete Smoking in association with death at discharge

Count

Current Cigarrete
Smoking
Total

No
Yes

Death at discharge
Death
Discharged
5
123
5
154
10
277

Probability of being smoker

Total
128
159
287

= 100 / 331

Probability of dying if a smoker

= 5 / 159 = 3.1%

Probability of dying if a non-smoker

= 5 / 128 = 3.9%

Associations

Same is applied to:

Relative risk

Risk difference

Attributable risk

Odds ratio

Etc..

Bottom line

Probability is applied at all levels of statistical analyses

Probability distributions

Probability distributions list or describe probabilities for all possible

occurrences of a random variable

There are two types of probability distributions:

Categorical distributions

Continuous distributions

Probability distributions: categorical variables

Categorical variables

Frequency distribution

Other distributions, such as binomial

Probability distributions: continuous variables

Continuous variables

Continuous distribution

Such as Z and t distributions

Normal Distribution

Properties of a Normal Distribution

Also called Gaussian distribution

A continuous, Bell shaped, symmetrical distribution; both

tails extend to infinity

The mean, median, and mode are identical

The shape is completely determined by the mean and

standard deviation

Normal Distribution

A normal distribution can have any and any :

e.g.: Age: =40 ,
= 10
The area under the curve represents 100% of all the observations

Mean
Median
Mode

Normal Distribution

Normal Distribution
Age distribution for a specific population

50%

Mean=40
SD=10

Normal Distribution
Age distribution for a specific population

Age = 25

Mean=40
SD=10

Normal distribution

The formula used to calculate the area below a certain point in a

normal distribution:

The probability density function of the normal distribution with

mean and variance 2

Normal distribution

Thus, for any normal distribution, once we have the mean and sd,
we can calculate the percentage of subjects:

Above a certain level

Below a certain level

Between different levels

But the problem is:

Calculation is very complicated and time consuming, so:

Standardized Normal Distribution

We standardize to a normal distribution

What does this mean?

For a specific distribution, we calculate all possible probabilities,

and record them in a table

A normal distribution with a = 0, = 1 is called a Standardized

Normal Distribution

Standardized Normal Distribution

Mean=0
SD=1

Area under the Normal Curve from 0 to X

0.00

0.01

0.02

0.03

0.04

0.05

0.06

0.07

0.08

0.09

0.0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
1.0
1.1
1.2
1.3
1.4
1.5
1.6
1.7
1.8
1.9
2.0
2.1
2.2
2.3
2.4
2.5
2.6
2.7
2.8
2.9
3.0
3.1
3.2
3.3
3.4
3.5
3.6
3.7
3.8
3.9
4.0

0.00000
0.03983
0.07926
0.11791
0.15542
0.19146
0.22575
0.25804
0.28814
0.31594
0.34134
0.36433
0.38493
0.40320
0.41924
0.43319
0.44520
0.45543
0.46407
0.47128
0.47725
0.48214
0.48610
0.48928
0.49180
0.49379
0.49534
0.49653
0.49744
0.49813
0.49865
0.49903
0.49931
0.49952
0.49966
0.49977
0.49984
0.49989
0.49993
0.49995
0.49997

0.00399
0.04380
0.08317
0.12172
0.15910
0.19497
0.22907
0.26115
0.29103
0.31859
0.34375
0.36650
0.38686
0.40490
0.42073
0.43448
0.44630
0.45637
0.46485
0.47193
0.47778
0.48257
0.48645
0.48956
0.49202
0.49396
0.49547
0.49664
0.49752
0.49819
0.49869
0.49906
0.49934
0.49953
0.49968
0.49978
0.49985
0.49990
0.49993
0.49995
0.49997

0.00798
0.04776
0.08706
0.12552
0.16276
0.19847
0.23237
0.26424
0.29389
0.32121
0.34614
0.36864
0.38877
0.40658
0.42220
0.43574
0.44738
0.45728
0.46562
0.47257
0.47831
0.48300
0.48679
0.48983
0.49224
0.49413
0.49560
0.49674
0.49760
0.49825
0.49874
0.49910
0.49936
0.49955
0.49969
0.49978
0.49985
0.49990
0.49993
0.49996
0.49997

0.01197
0.05172
0.09095
0.12930
0.16640
0.20194
0.23565
0.26730
0.29673
0.32381
0.34849
0.37076
0.39065
0.40824
0.42364
0.43699
0.44845
0.45818
0.46638
0.47320
0.47882
0.48341
0.48713
0.49010
0.49245
0.49430
0.49573
0.49683
0.49767
0.49831
0.49878
0.49913
0.49938
0.49957
0.49970
0.49979
0.49986
0.49990
0.49994
0.49996
0.49997

0.01595
0.05567
0.09483
0.13307
0.17003
0.20540
0.23891
0.27035
0.29955
0.32639
0.35083
0.37286
0.39251
0.40988
0.42507
0.43822
0.44950
0.45907
0.46712
0.47381
0.47932
0.48382
0.48745
0.49036
0.49266
0.49446
0.49585
0.49693
0.49774
0.49836
0.49882
0.49916
0.49940
0.49958
0.49971
0.49980
0.49986
0.49991
0.49994
0.49996
0.49997

0.01994
0.05962
0.09871
0.13683
0.17364
0.20884
0.24215
0.27337
0.30234
0.32894
0.35314
0.37493
0.39435
0.41149
0.42647
0.43943
0.45053
0.45994
0.46784
0.47441
0.47982
0.48422
0.48778
0.49061
0.49286
0.49461
0.49598
0.49702
0.49781
0.49841
0.49886
0.49918
0.49942
0.49960
0.49972
0.49981
0.49987
0.49991
0.49994
0.49996
0.49997

0.02392
0.06356
0.10257
0.14058
0.17724
0.21226
0.24537
0.27637
0.30511
0.33147
0.35543
0.37698
0.39617
0.41308
0.42785
0.44062
0.45154
0.46080
0.46856
0.47500
0.48030
0.48461
0.48809
0.49086
0.49305
0.49477
0.49609
0.49711
0.49788
0.49846
0.49889
0.49921
0.49944
0.49961
0.49973
0.49981
0.49987
0.49992
0.49994
0.49996
0.49998

0.02790
0.06749
0.10642
0.14431
0.18082
0.21566
0.24857
0.27935
0.30785
0.33398
0.35769
0.37900
0.39796
0.41466
0.42922
0.44179
0.45254
0.46164
0.46926
0.47558
0.48077
0.48500
0.48840
0.49111
0.49324
0.49492
0.49621
0.49720
0.49795
0.49851
0.49893
0.49924
0.49946
0.49962
0.49974
0.49982
0.49988
0.49992
0.49995
0.49996
0.49998

0.03188
0.07142
0.11026
0.14803
0.18439
0.21904
0.25175
0.28230
0.31057
0.33646
0.35993
0.38100
0.39973
0.41621
0.43056
0.44295
0.45352
0.46246
0.46995
0.47615
0.48124
0.48537
0.48870
0.49134
0.49343
0.49506
0.49632
0.49728
0.49801
0.49856
0.49896
0.49926
0.49948
0.49964
0.49975
0.49983
0.49988
0.49992
0.49995
0.49997
0.49998

0.03586
0.07535
0.11409
0.15173
0.18793
0.22240
0.25490
0.28524
0.31327
0.33891
0.36214
0.38298
0.40147
0.41774
0.43189
0.44408
0.45449
0.46327
0.47062
0.47670
0.48169
0.48574
0.48899
0.49158
0.49361
0.49520
0.49643
0.49736
0.49807
0.49861
0.49900
0.49929
0.49950
0.49965
0.49976
0.49983
0.49989
0.49992
0.49995
0.49997
0.49998

Standardized Normal Distribution

Standardized Normal
Distribution (Z)

Normal Distribution

Mean = , SD =

TRANSFORM
Z

Mean = 0, SD = 1

Standardized Normal Distribution

Standardized Normal
Distribution (Z)

Normal Distribution

Mean = 40, SD = 10

Z(40)

x - = 40 - 40 = 0

TRANSFORM

Mean = 0, SD = 1

Standardized Normal Distribution

Standardized Normal
Distribution (Z)

Normal Distribution

30
Mean = 40, SD = 10

Z(40)

x - = 30 - 40 = -1

TRANSFORM

-1
Mean = 0, SD = 1

Standardized Normal Distribution: summary

For any normal distribution, we can

Transform the values to the standardized normal distribution (Z)

Use the Z table to get the following areas

Above a certain level

Below a certain level

Between different levels

Normal Distribution
Age distribution for a specific population

Mean=40
SD=10

30
Mean 1SD

68%

Mean + 1SD

Normal Distribution
Age distribution for a specific population

Mean=40
SD=10

20
Mean 2SD

95%

60
Mean + 2SD

Normal Distribution
Age distribution for a specific population

Mean=40
SD=10

10
Mean 3SD

99.7%

70
Mean + 3SD

Practical example

The 68-95-99.7 Rule for the Normal Distribution

68% of the observations fall within one standard deviation of the

mean

95% of the observations fall within two standard deviations of the

mean

99.7% of the observations fall within three standard deviations of

the mean

When applied to real data, these estimates are considered

approximate!

Distributions of Blood Pressure

68%
.2

Mean = 125 mmHG

s = 14 mmHG

95%
99.7%

0
83

111

125

139

153

167

The 68-95-99.7 rule applied to the distribution

of systolic blood pressure in men.

Data analyses

Data analyses:

Descriptive statistics: are the techniques used to describe the main

features of a sample

Inferential statistics: is the process of using the sample statistic to

make informed guesses about the value of a population parameter

Why do we carry out research?

population

sample

Inference: Drawing
conclusions on certain
questions about a
population from sample data

Inferential statistics

Since we are not taking the whole population, we have to draw

conclusions on the population based on results we get from the
sample

Simple example: Say we want to estimate the average systolic

blood pressure for patients admitted to the emergency department
after having an MI

Other more complicated measures might be quality of life,

satisfaction with care, risk of outcome, etc.

Inferential statistics

What do we do?

Take a sample (n=291) of patients admitted to emergency

department in a certain hospital

Calculate the mean and SD (descriptive statistics) of systolic blood

pressure
Statistics
Systolic blood pressure
N
Valid
Missing
Mean
Std. Deviation

286
5
144.13
35.312

Inferential statistics

The next step is to make a link between the estimates we observed

from the sample and those of the underlying population (inferential
statistics)

What can we say about these estimates as compared to the

unknown true ones???

In other words, we trying to estimate the average systolic blood

pressure for ALL patients admitted to the emergency department
after an MI

Inferential statistics

Sample data

N=291

Mean=144
SD=35

Inference

In statistical inference we usually encounter TWO issues

Estimate value of the population parameter. This is done through

point estimate and interval estimate (Confidence Interval)

Evaluate a hypothesis about a population parameter rather than

simply estimating it. This is done through tests of significance
known as hypothesis testing (P-value)

Confidence Interval

Confidence Intervals

A point estimate:
A single numerical value used to estimate a population parameter.

Interval estimate:
Consists of 2 numerical values defining a range of values that with
a specified degree of confidence includes the parameter being
estimated.
(Usually interval estimate with a degree of 95% confidence is
used)

Example

What is the average systolic blood pressure for patients admitted

to emergency departments after an MI?

Select a sample

Point estimate

Interval estimate = 95% CI = (140 148)

95% Confidence Interval:

- Upper limit =
- Lower limit =

= mean

= 144

35
144 1.95

291

x + z (1-/2) SE
x + z (1-/2) SE
x z (1-/2) SE

Sampling distribution of mean

N = 291

- 2SE

95%

+ 2SE

Standard error
Standard error

= sd / n

As sample size increases the standard error decreases

The estimation as measured by the confidence interval will be

better, ie narrower confidence interval

Interpretation
95% Confidence Interval

There is 95% probability that the true parameter is within the

calculated interval
Thus, if we repeat the sampling procedure 100 times, the above

statement will be:

correct in 95 times (the true parameter is within the interval)

wrong in 5 times (the true parameter is outside the interval) (also called
error)

Notes on Confidence Intervals

Interpretation

It provides the level of confidence of the value for the population

average systolic blood pressure

Are all CIs 95%?

It is the most commonly used

A 99% CI is wider

A 90% CI is narrower

Notes on Confidence Intervals

To be more confident you need a bigger interval

For a 99% CI, you need 2.6 SEM

For a 95% CI, you need 2 SEM

For a 90% CI, you need 1.65 SEM

P-value

Inference
P-value

Is related to another type of inference

Hypothesis testing

Evaluate a hypothesis about a population parameter rather than

simply estimating it

Hypothesis testing

Back to our previous example

We want to make inference about the average systolic blood

pressure of patients admitted to emergency department after MI

Assume that the normal systolic blood pressure is 120

The question is whether the average systolic blood pressure for

patients admitted to emergency departments is different than the
normal, which is 120

Hypothesis testing

Two types of hypotheses:

Null hypothesis: is a statement consistent with no difference

Alternative hypothesis: is a statement that disagrees with the null

hypothesis, and is consistent with presence of difference

The logic of hypothesis testing

To decide which of the hypothesis is true

Take a sample from the population

If the data are consistent with the null hypothesis, then we do not
reject the null hypothesis (conclusion = no difference)

If the sample data are not consistent with the null hypothesis, then
we reject the null (conclusion = difference)

Hypothesis testing

Example: is the systolic blood pressure for patients admitted to

emergency department after an MI normal (ie =120)?
-

Ho: = 120

Ha: 120

How do we answer this question?

We take a sample and find that the mean is 144 years

Can we consider that the 144 is consistent with the normal value
(120 years)?

Hypothesis testing

N = 291
mean
144

Ho: = 120

It looks like it is consistent with the null hypothesis

Is it still consistent with the null hypothesis?

mean
144

Hypothesis testing

N = 291
mean

mean

2.5%

2.5%
95%
- 2SE

Ho: = 120

+ 2SE

Test statistic

It is the statistic used for deciding whether the null hypothesis

should be rejected or not

Used to calculate the probability of getting the observed results if

the null hypothesis is true.

This probability is called the p-value.

How to decide

We calculate the probability of obtaining a sample with mean of

144 if the true mean is 120 due to chance alone (p-value)

Based on p-value we make our decision:

If the p-value is low then this is taken as evidence that it is unlikely

that the null hypothesis is true, then we reject the null hypothesis (we
accept alternative one)

If the p-value is high, it indicates that most probably the null

hypothesis is true, and thus we do not reject the Ho

Problem!

We could be making the wrong decisions

Decision
Do not reject Ho
Reject Ho

Ho True

Ho False

Correct decision

Type II error

Type I error

Correct decision

Type I error: is rejecting the null hypothesis when it is true

Type II error: is not rejecting the null hypothesis when it is false

Error

Type I error:

Referred to as

Probability of rejecting a true null hypothesis

Type II error:

Referred to as

Probability of accepting a false null hypothesis

Power:

Represented by 1-

Probability of correctly rejecting a false null hypothesis

Significance level

The significance level, , of a hypothesis test is defined as the

probability of making a type I error, that is the probability of
rejecting a true null hypothesis
It could be set to any value, as:

0.05

0.01

0.1

Statistical significance

If the p-value is less then some pre-determined cutoff (e.g. .05),

the result is called statistically significant

This cutoff is the -level

The -level is the probability of a type I error

It is the probability of falsely rejecting H0

Back to the example

To test whether the average systolic blood pressure for patients

admitted to the emergency department after an MI is different
than 120 (which is the normal blood pressure)

We carry out a test called one sample t-test which provides a pvalue based on which we accept or reject the null hypothesis.

Back to the example

One-Sample Statistics
N
Systolic blood pressure

286

Mean
144.13

Std. Deviation
35.312

Std. Error
Mean
2.088

One-Sample Test
Test Value = 120

Systolic blood pressure

t
11.558

df
285

Sig. (2-tailed)
.000

Mean
Difference
24.133

95% Confidence
Interval of the
Difference
Lower
Upper
20.02
28.24

Since p-value is less than 0.05, then the conclusion will be that the
systolic blood pressure for patients admitted to emergency
department after an MI is significantly higher than the normal
value which is 120

p-values

p-values are probabilities (numbers between 0 and 1)

Small p-values mean that the sample results are unlikely when the
null is true

The p-value is the probability of obtaining a result as/or more

extreme than you did by chance alone assuming the null
hypothesis H0 is true

t-distribution

The t-distribution looks like a standard normal curve

A t-distribution is determined by its degrees of freedom (n-1), the

lower the degrees of freedom, the flatter and fatter it is
Normal (0,1)
t35

t15

75%

80%

85%

90%

95%

97.5%

99%

99.5%

99.75%

99.9%

99.95%

1.000

1.376

1.963

3.078

6.314

12.71

31.82

63.66

127.3

318.3

636.6

0.816

1.061

1.386

1.886

2.920

4.303

6.965

9.925

14.09

22.33

31.60

0.765

0.978

1.250

1.638

2.353

3.182

4.541

5.841

7.453

10.21

12.92

0.741

0.941

1.190

1.533

2.132

2.776

3.747

4.604

5.598

7.173

8.610

0.727

0.920

1.156

1.476

2.015

2.571

3.365

4.032

4.773

5.893

6.869

0.718

0.906

1.134

1.440

1.943

2.447

3.143

3.707

4.317

5.208

5.959

0.711

0.896

1.119

1.415

1.895

2.365

2.998

3.499

4.029

4.785

5.408

0.706

0.889

1.108

1.397

1.860

2.306

2.896

3.355

3.833

4.501

5.041

0.703

0.883

1.100

1.383

1.833

2.262

2.821

3.250

3.690

4.297

4.781

0.700

0.879

1.093

1.372

1.812

2.228

2.764

3.169

3.581

4.144

4.587

0.697

0.876

1.088

1.363

1.796

2.201

2.718

3.106

3.497

4.025

4.437

0.695

0.873

1.083

1.356

1.782

2.179

2.681

3.055

3.428

3.930

4.318

0.694

0.870

1.079

1.350

1.771

2.160

2.650

3.012

3.372

3.852

4.221

0.692

0.868

1.076

1.345

1.761

2.145

2.624

2.977

3.326

3.787

4.140

0.691

0.866

1.074

1.341

1.753

2.131

2.602

2.947

3.286

3.733

4.073

0.690

0.865

1.071

1.337

1.746

2.120

2.583

2.921

3.252

3.686

4.015

0.689

0.863

1.069

1.333

1.740

2.110

2.567

2.898

3.222

3.646

3.965

0.688

0.862

1.067

1.330

1.734

2.101

2.552

2.878

3.197

3.610

3.922

0.688

0.861

1.066

1.328

1.729

2.093

2.539

2.861

3.174

3.579

3.883

0.687

0.860

1.064

1.325

1.725

2.086

2.528

2.845

3.153

3.552

3.850

100

0.677

0.845

1.042

1.290

1.660

1.984

2.364

2.626

2.871

3.174

3.390

120

0.677

0.845

1.041

1.289

1.658

1.980

2.358

2.617

2.860

3.160

3.373

0.674

0.842

1.036

1.282

1.645

1.960

2.326

2.576

2.807

3.090

3.291

Hypothesis Testing

Different types of hypothesis:

Mean (a) = Mean (b)

Proportion (a) = Proportion (b)

Variance (a) = Variance (b)

OR = 1

RR = 1

RD = 0

Test of homogeneity

Etc..

Example

Comparing two means: paired testing

In the previous example, is the heart rate at admission different

than the heart rate at discharge among the patients admitted to the
emergency department after an MI?
Statistics

N
Mean
Std. Deviation

Valid
Missing

Heart Rate at
admission
286
5
82.64
22.598

Heart Rate at
discharge
77
214
76.99
17.900

Is this decrease in heart rate statistically significant?

Thus, we have to make inference.

Comparing two means: paired testing

What type of test to be used?

Since the measurements of the heart rate at admission and at

discharge are dependent on each other (not independent), another
type of test is used

Paired t-test

Comparing two means: paired testing

Paired Samples Statistics

Pair
1

Mean
81.16
76.72

Heart Rate at admission

Heart Rate at discharge

N
75
75

Std. Deviation
23.546
17.973

Std. Error
Mean
2.719
2.075

Paired Samples Test

Paired Differences

Mean
Pair
1

Heart Rate at admission Heart Rate at discharge

4.440

Std. Deviation

Std. Error
Mean

25.302

2.922

95% Confidence
Interval of the
Difference
Lower
Upper
-1.381

10.261

t
1.520

Sig. (2-tailed)
74

.133

95%CI = 4.4 1.95 2.9

H0: b - a = 0
HA: b - a 0

P-value = 0.133, thus no significant difference

How Are p-values Calculated?

sample mean 0
t=
SEM
4 .4
t =
= 1 . 52
2 .9
The value t = 1.52 is called the test statistic
Then we can compare the t-value in the table and get the
p-value, or get it from the computer (0.13)

Interpreting the p-value

The p-value in the example is 0.133

Interpretation: If there is no difference in heart rate between

admission and discharge to an emergency department, then the
chance of finding a mean difference as extreme/more extreme as 4.4
in a sample of 291 patients is 0.133

Thus, this probability is big (bigger than 0.05) which leads to saying
that the difference of 4.4 is due to chance

Notes

How to decide on significance from the 95% CI?

3 scenarios
-15

-10

-5

-15

-10

-5

-15

-10

-5

Comparing two means: Independent sample testing

In the previous example, is the systolic blood pressure different

between males and females among the patients admitted to the
emergency department after an MI?
Group Statistics

Systolic blood pressure

Sex
Male
Female

N
240
44

Mean
145.05
138.64

Std. Deviation
35.162
35.753

Std. Error
Mean
2.270
5.390

Is this difference in systolic blood pressure statistically significant?

Thus, we have to make inference.

Comparing two means: Independent sample testing

Null hypothesis:

Ho: Mean SBP(Males) = Mean SBP (Females)

Ho: Mean SBP (Males) - Mean SBP (Females) = 0

Alternative hypothesis:

Ha: Mean SBP(Males) Mean SBP (Females)

Ha: Mean SBP(Males) - Mean SBP (Females) 0

Comparing two means: Independent sample testing

Thus, we carry out a test called: independent samples t-test

Formula to use is:

Comparing two means: Independent sample testing

What we need to know is that we can calculate a p-value out of the

t-test (based on the t-distribution)

Based on this p-value, make the decision:

P-value > 0.05, then do no reject the null (the two means are equal)

P-value < 0.05, then reject the null (the two means are different)

Comparing two means: Independent sample testing

Group Statistics

Systolic blood pressure

Sex
Male
Female

Mean
145.05
138.64

240
44

Std. Deviation
35.162
35.753

Std. Error
Mean
2.270
5.390

Independent Samples Test

Levene's Test for
Equality of Variances

F
Systolic blood pressure

Equal variances
assumed
Equal variances
not assumed

.044

Sig.
.835

Two formulas for calculation of t-test

1- when variances are equal
2- when variances are not equal

t-test for Equality of Means

Sig. (2-tailed)

Mean
Difference

Std. Error
Difference

95% Confidence
Interval of the
Difference
Lower
Upper

1.109

282

.269

6.409

5.781

-4.970

17.789

1.096

59.267

.278

6.409

5.848

-5.292

18.111

To know which one to use (hypothesis test)

Ho: variancemales = variancefemales
Ha: variancemales = variancefemales

1- If p-value > 0.05 then variances are equal

2- If p-value < 0.05 then variances are not equal

Example

T-test

Ho: Mean1 = Mean2

T-test: P-value = 0.89

Ha: Mean1 Mean2

No significant difference

Chi square

Example

In the MI example, we would like to check if hypertension is

associated with gender.

In other words, are males at higher or lower risk of having

hypertension?
Sex * Hypertension Crosstabulation
Count

Sex
Total

Male
Female

Hypertension
No
Yes
191
52
24
20
215
72

Total
243
44
287

Example

Sex * Hypertension Crosstabulation

Sex

Male
Female

Total

Count
% within Sex
Count
% within Sex
Count
% within Sex

Hypertension
No
Yes
191
52
78.6%
21.4%
24
20
54.5%
45.5%
215
72
74.9%
25.1%

Total
243
100.0%
44
100.0%
287
100.0%

Example

To answer the question, we do a hypothesis test:

H0: P1 = P2

(P1 - P2 = 0)

Ha: P1 P2

(P1 - P2 0)

(Pearsons) Chi-Square Test (2)

Calculation is easy (can be done by hand)

Works well for big sample sizes

Can be extended to compare proportions between more than two

independent groups in one test

The Chi-Square Approximate Method

(0 - E)
=
E
4 cells

Looks at discrepancies between observed and expected cell counts

Expected refers to the values for the cell counts that would be
expected if the null hypothesis is true
O = observed

E = expected =

row total column total

grand total

The Chi-Square Approximate Method

The distribution of this statistic when the null is a chi-square

distribution with one degree of freedom

We can use this to determine how likely it was to get such a big
discrepancy between the observed and expected by chance alone

Probability
.4

Distribution: Chi-Square with One Degree of Freedom

2 = 3.84 p = 0.05

10
Chi-squared Value

Example of Calculations of
Chi-Square 2x2 Contingency Table

Test statistic

(0 - E)
=
E
4 cells

Chi-Square Tests

Pearson Chi-Square
Continuity Correctiona
Likelihood Ratio
Fisher's Exact Test
Linear-by-Linear
Association
N of Valid Cases

Value
11.471b
10.227
10.366
11.431

df
1
1
1
1

Asymp. Sig.
(2-sided)
.001
.001
.001

Exact Sig.
(2-sided)

Exact Sig.
(1-sided)

.001

287

a. Computed only for a 2x2 table

b. 0 cells (.0%) have expected count less than 5. The minimum expected count is 11.
04.

= 11.471
2

Probability
.4

Sampling Distribution: Chi-Square with One Degree of

Freedom

10
Chi-squared Value

Example

The value that corresponds to 95%, or 5% error is 5.991.

Thus we reject Ho since 11.471 is < 5.991

We conclude that Ho is false and that there is a relationship

between gender and diagnosis with hypertension

The p-value is = 0.001

Chi-square

Ho: Proportion1 = Proportion2

Ha: Proportion1 Proportion2

ChiSquare: P-value = 0.96

No significant difference

Relative Risk (RR):

Study the association between Vioxx use and Myocardial Infarction

MI
Yes

Vioxx

Placebo

Drug

Ho: RR = 1

RR=1.5, 95% CI = (1.1 - 1.9) (p-value = 0.01)

Ha: RR 1
Significant association

Notes

How to decide on significance from the 95% CI?

3 scenarios
0

Example

Chi-square

Ho: Proportion1 = Proportion2

Ha: Proportion1 Proportion2

ChiSquare: P-value = 0.96

No significant difference

Example

We would like to check if there is an association between gender

and both Hypertension and diabetes combined.
Sex * Hypterension and Diabetes combined Crosstabulation

Sex

Male
Female

Total

Count
% within Sex
Count
% within Sex
Count
% within Sex

Hypterension and Diabetes

combined
Either HT
Both HT
or DM
and DM
None
145
67
28
60.4%
27.9%
11.7%
13
12
19
29.5%
27.3%
43.2%
158
79
47
55.6%
27.8%
16.5%

Total
240
100.0%
44
100.0%
284
100.0%

Ho: HPV status and stage of HIV infection are independent.

Ha: the two variables are not independent.
Ho: P1 = P2 = P3
Ha: P1 P2 P3

Example

Conclusion
Sex * Hypterension and Diabetes combined Crosstabulation

Sex

Male
Female

Total

Count
% within Sex
Count
% within Sex
Count
% within Sex

Hypterension and Diabetes

combined
Either HT
Both HT
or DM
and DM
None
145
67
28
60.4%
27.9%
11.7%
13
12
19
29.5%
27.3%
43.2%
158
79
47
55.6%
27.8%
16.5%

Chi-Square Tests

Pearson Chi-Square
Likelihood Ratio
Linear-by-Linear
Association
N of Valid Cases

Value
28.691a
24.336
25.341

2
2

Asymp. Sig.
(2-sided)
.000
.000

.000

284

a. 0 cells (.0%) have expected count less than 5. The

minimum expected count is 7.28.

Total
240
100.0%
44
100.0%
284
100.0%

Example

The value that corresponds to 95%, or 5% error is 5.991.

Thus we reject Ho since 28.691 is < 5.991

We conclude that Ho is false and that there is a relationship

between gender and diagnosis with hypertension and/or diabetes

The p-value is < 0.0001

ANOVA

The problem

We have samples from a number of independent groups.

We have a single numerical or ordinal variable and are interested

in whether the values of the variable vary between the groups.

Example: Is systolic blood pressure vary between men of

different smoking status.

The problem

One-way ANOVA can answer the question be comparing the

group means.

So the null and alternative hypotheses are:

H0: all group means in the population are equal
HA : at least two of the means are not equal

ANOVA is an extension of 2 independent groups.

But 2 groups technique can not be used.

The problem
- If 5 groups is available then 10 t-test of 2 groups to perform.
- The high Type I error rate, resulting from the large number of
comparisons, means that we may draw incorrect conclusions.

Assumptions
Analysis of variance requires the following assumptions:

Independent random samples have been taken from each

population.

The populations are normal.

The population variances are all equal.

The ANOVA Table

ANOVA table summaries the calculation needed to test the main

hypothesis.

Sources

Factor

k 1

SS(factor)

MS(factor)=

Error

n k

SS(error)

MS(error)=

SS ( factor)
k 1

MS ( factor)
MS (error )

SS (error )
n k

___________________________________________________________
Total
n 1 SS(total)

Rationale

One-way ANOVA separate the total variability (SS(total) in the

data into:

Differences between the individuals from the different groups

(between-group variation) SS(factor)

The random variation between the individuals within each group

(within-group variation) SS(error) called also unexplained

Rationale

These components of variation are measured using variances,

hence the name analysis of variance (ANOVA).

Under the null hypothesis that the group means are the same,
SS(factor) will be similar to SS(error).

The test is based on the ratio of these two variances.

If there are differences between-groups, then between-groups

variance will be larger than within-group variance.

Example

A new variable is created which combines diagnosis with

Hypertension and Diabetes together as follows:

Hypterension and Diabetes combined

Valid

Missing
Total

None
Either HT or DM
Both HT and DM
Total
System

Frequency
159
80
47
286
5
291

Percent
54.6
27.5
16.2
98.3
1.7
100.0

Valid Percent
55.6
28.0
16.4
100.0

Cumulative
Percent
55.6
83.6
100.0

Example

We would like to check whether the systolic blood pressure is the

same for the three groups defined by their HT and DM status.

Ho: Mean1 = Mean2 = Mean3

Ha: Mean1 Mean2 Mean3

Hypterension and Diabetes combined

Valid

Missing
Total

None
Either HT or DM
Both HT and DM
Total
System

Frequency
159
80
47
286
5
291

Percent
54.6
27.5
16.2
98.3
1.7
100.0

Valid Percent
55.6
28.0
16.4
100.0

Cumulative
Percent
55.6
83.6
100.0

Example
Descriptives
Systolic blood pressure

N
None
Either HT or DM
Both HT and DM
Total

155
79
47
281

Mean
144.52
142.97
146.55
144.43

Std. Deviation
32.789
39.634
36.360
35.319

Std. Error
2.634
4.459
5.304
2.107

95% Confidence Interval for

Mean
Lower Bound Upper Bound
139.32
149.73
134.10
151.85
135.88
157.23
140.28
148.57

Minimum
78
56
55
55

Maximum
248
257
235
257

ANOVA
Systolic blood pressure

Between Groups
Within Groups
Total

Sum of
Squares
380.517
348908.2
349288.8

df
2
278
280

Mean Square
190.259
1255.066

F
.152

Sig.
.859

We conclude that the average systolic blood pressures for the

three groups are the same.

Conclusion

We conclude that the average systolic blood pressure for the three
groups is the same.

Bivariate analyses
DEPENDENT
(outcome)

INDEPENDENT
(exposure)

2 LEVELS

> 2 LEVELS

CONTINUOUS

2 LEVELS

X2
(chi square test)

t-test

> 2 LEVELS

X2
(chi square test)

ANOVA

t-test

-Correlation
-Linear
Regression

CONTINUOUS

ANOVA

New scenario

If the dependent and independent variables are continuous, then

we cant use the t-test, and we cannot use the chi squared.

Regression and Correlation

Describing association between two continuous variables

Scatterplot

Correlation coefficient

Simple linear regression

Correlation

It is a measure of linear correlation

Called Pearson correlation coefficient (r)

Ranges between:

+1.0 (perfect positive correlation)

-1.0 (perfect negative correlation)

Scatter plot and correlation

The Correlation Coefficient (r)

Measures the direction and strength of the linear association

between x and y

The correlation coefficient is between -1 and +1

r>0

Positive association

r<0

Negative association

r=0

No association

r = 0.01
Y

r = 0.68

r = 0.98

r = -0.9

Correlation in the Plasma Example

Y, plasma volume (liters)

3.5

r = .76
3

2.5
55

X, body weight (kg)

Correlation

Study the association between Heart Rate and Systolic Blood

Pressure
Ho: Correlation = 0

Scatter plot for assocation between HR and SBP

Ha: Correlation 0
A
A

250

BP systolic

200

150

100

A
A
A
A
A
A
A
AA
A
A A
AAA
A A
A
A
A
A
AA A A
AA
A
A AA
A
A A
A
A
A A AA
AA
A
A A
A
AA
A
A
A A A
A
A
AA
A
A
A
A
A
A A A
AAA
A
A
AA AA
A
AA
A
A AAA
A
A
A
A
A
A
A
A
A A
A
AA A
A
AA
A
A
A
A
AA A
A
AA A
A
A A
AA
AA
A
A
A
A
A
A
A
A
A
A AA
A A
A
A
A
A A A A
AA A A
A
A AA A A
AA A
A
A
A
A
A
AA
AAAA A
A
A
A
A
A AA
AA
A
A
A
A
A
A
A
A A AAAAA
A
AA A A
AA
A
AA
A
AA
AA
AA A
A
A
A
AA
A
AA A A
AA
AA
A
A AA AAA
A
A
A A
AA
A A A AA
A
AA
A AAAA
A
A
A
A
A A A
A
A
A
A
A A
A
A
A
A
A

Correlation: = 0.190
P-value = 0.001

Systolic blood pressure

Heart Rate at admission

Heart Rate

Pearson Correlation
Sig. (2-tailed)
N
Pearson Correlation
Sig. (2-tailed)
N

Systolic blood
pressure
1

Heart Rate at
admission
.190**
.001
286
285
.190**
1
.001
285
286

**. Correlation is significant at the 0.01 level (2-tailed).

120

Correlations

160

Significant correlation

Problem

Important to note that

correlation measures
strength of linear
association
There could be a strong
non-linear relationship
between y and x, and r
may not catch it

r= 0

Correlation Coefficient

Outliers can really affect correlation coefficient

One extreme point can change r sizably

r = .7

Simple linear regression

Used to quantify the association between two variables

It is simple in terms of

having only 1 variable

The association is thought to be linear in nature

Formula: Dependent = 0 + 1 (independent)

The Equation of a Line

0 and 1 are called regression coefficients

These two quantities are estimated by the least squares method

The intercept 0 is the estimated expected value of y when x is 0

The slope 1 is the estimated expected change in y corresponding

to a unit increase in x

The Slope

The slope 1 is the expected change in y corresponding to a unit

increase in x

1 = 0

No association between y and x

1 > 0

Positive association (as x increases y tends to increase)

1 < 0

Negative association (as x increases y tends to decrease)

The Equation of a Line

y
y

y = b0 + b1 x

b1
b0

The Slope
y
1 > 0

1 = 0

1 < 0
0

Simple linear regression

Systolic blood pressure and age

Model Summary
Model
1

R
.054a

R Square
.003

a. Predictors: (Constant), Age

Correlation:
R = 0.054

Adjusted
R Square
-.001

Std. Error of
the Estimate
35.387

Simple linear regression

Coefficientsa

Model
1

(Constant)
Age

Unstandardized
Coefficients
B
Std. Error
136.400
8.812
.148
.162

Standardized
Coefficients
Beta
.054

t
15.479
.910

Sig.
.000
.364

a. Dependent Variable: Systolic blood pressure

Simple linear regression:

SBP = 136.400 + 0.148 (Age)
If age = 0, then SBP = 136.400 + 0 = 136.400
As age increase by 1 year, SBP increases by 0.148 units

Simple Linear Regression

How do we decide if there is significant association between age

and SBP?

Hypothesis test
Ho: 1 = 0
Ha: 1 0

SBP = 0 + 1 (Age)

If reject Ho, then as age changes, SBP changes significantly

If Ho is not rejected, then if as changes, there is no effect on SBP

Multiple Linear Regression

The important aspect of linear regression is that we can include

more than 1 independent variable

This is to control for the effect of another variable

Study the association between Age and SBP while controlling for
gender

SBP = 0 + 1 (Age) + 2 (Gender)

Multiple Linear Regression

Coefficientsa

Model
1

(Constant)
Age
Sex

Unstandardized
Coefficients
B
Std. Error
143.090
9.742
.216
.171
-8.992
6.123

Standardized
Coefficients
Beta
.080
-.093

t
14.688
1.261
-1.469

Sig.
.000
.208
.143

a. Dependent Variable: Systolic blood pressure

Multiple linear regression:

SBP = 143.090 + 0.216 (Age) + -8.992 (Gender)
As age increase by 1 year, SBP increases by 0.216 units
after adjusting for gender
Difference in SBP between males and females is 8.992 units

Choosing the right statistical test

Choosing a statistical test

Choosing the right statistical test depends on:

Nature of the data

Sample characteristics

Inferences to be made

Choosing a statistical test

A consideration of the nature of data includes:

Number of variables

not for entire study, but for the specific question at hand

Type of data

numerical, continuous

dichotomous, categorical information

Choosing a statistical test

A consideration of the sample characteristics includes:

Number of groups

Sample type

normal distribution (parametric) or not (non-parametric)

independent or dependent

Choosing a statistical test

A consideration of the inferences to be made includes:

Data represent the population

The group means are different

There is a relationship between variables

Choosing a statistical test

Before choosing a statistical test, ask:

How many variables?

How many groups?

Is the distribution of data normal?

Are the samples (groups) independent?

What is your hypothesis or research question?

Is the data continuous, ordinal, or categorical?

Descriptive analyses

Type of variable

Measure

Categorical

Proportion (%)

Continuous
(Normal)

Mean (SD)

Continuous
(Not Normal)

Median
Inter-quartile range
-

Different types of statistics

Parametric vs non-parametric analyses

Parametric:

Assume data follows a specific probability distribution

More powerful

Non-parametric:

Also called distribution free

No assumptions required for data

But are robust

Univariate analyses

Type of variable

Measure

Categorical

Z proportions

Continuous
(Normal)

T-test

Continuous
(Not Normal)

n > 30 t-test
n < 30 Kolmogorov-Smirnov Test
-

Bivariate analyses

Type of
variable

2 levels

> 2 levels

Continuous

2 levels

Chi squared

T-test

> 2 levels

Chi squared

Anova

Continuous

T-test

Anova

Correlation
linear regression
-

Bivariate analyses

Type of
variable
2 levels

2 levels
Fishers test
McNemars test
-

> 2 levels

Continuous

> 2 levels

Fishers test

Mann-Whitney
- Wilcoxin test

Fishers test

Kruskal-Wallis
- Friedman test

Continuous
Mann-Whitney
- Wilcoxin test

Kruskal-Wallis
- Friedman test

Correlation
Regression

Multivariate analyses

Type of variable

Measure

Categorical

Logistic regression

Continuous
(Normal)

Multinomial regression

Continuous
(Not Normal)

Linear regression

Overview
Measurement
(Gaussian)

Ordinal or
Measurement (NonGaussian)

Binomial

Survival Time

Describe one group

Mean, SD

Median, interquartile
range

Proportion

Kaplan Meier survival

curve

Compare two unpaired

groups

Unpaired t test

Mann-Whitney test

Fisher's test
Chi-square

Log-rank test or
Mantel-Haenszel*

Compare two paired groups

Paired t test

Wilcoxon test

McNemar's test

Conditional
proportional hazards
regression*

Compare three or more

unmatched groups

One-way ANOVA

Kruskal-Wallis test

Chi-square test

Cox regression

Compare three or more

matched groups

Repeated-measures
ANOVA

Friedman test

Cochrane Q**

Conditional
proportional hazards
regression*

Quantify association between

two variables

Pearson correlation

Spearman correlation

Contingency
coefficients**

Predict value from another

measured variable

Simple linear
regression

Nonparametric
regression**

Simple logistic
regression*

Cox regression

Predict value from several

measured or binomial
variables

Multiple linear
regression*

Multiple logistic
regression*

Cox regression

Sample size calculation

Sample size and power calculation

Important step in designing a study

If it is not done, then sample size might be high or low:

If it is low: lack precision to provide reliable answers

If it is high: resources will be wasted for minimal gain

Sample size and power calculation

This step addresses two questions:

How precise will my parameter estimates tend to be if I select a

particular sample size?

How big a sample do I need to attain a desirable level of precision?

Sample size and power calculation: example

A cross-sectional survey of the prevalence of diabetes (diagnosed

or undiagnosed) among native Americans would require a sample
size of 1421 to allow estimation of the prevalence within a
precision of 0.02 with 90% confidence, assuming a true
prevalence no larger than 30%.

Sample size and power calculation

Should be done at the DESIGN stage, ie before data is collected

Drives the whole study

To determine the sample size:

Objectives should be clearly defined

Main exposure and outcome should be specified

Analyses plan should be clarified

Sample size and power calculation

Different equations are used:

Depends on:

Study design

Objectives (prevalence, risk, etc.)

Types of variables

Following is an example of sample size calculation for comparing

the means in two groups

Sample size and power calculation: example

A randomized clinical trial of a new drug treatment vs. placebo

for decreasing blood pressure would require 126 patients for a
two-sided test at = 0.05 to provide 80% power to detect a 5%
difference in blood pressure.

Sample size calculation: comparing two means

2 *SD * (z + z )
2

N = the number of subjects in each group

= level of significance (error)

1 - = power

Difference = Minimal significant difference

Sample size calculation: comparing two means

N = the number of subjects in each group

N = more power or less

N = less power or more

Sample size calculation: comparing two means

= level of significance (error)

= more power or smaller N

= less power or larger N

Sample size calculation: comparing two means

1 - = power

1 - = less or larger N

1 - = more or smaller N

Sample size calculation: comparing two means

Difference = Minimal significant difference

Difference = larger power or smaller N

Difference = smaller power or larger N

Sample size calculation: comparing two means

N = to be found

= level of significance (error) = 0.05 or 5%

1 - = power = 0.80 or 80%

Difference = Minimal significant difference

Thank you

Basic Statistics (3685) PPT - Lecture On 20-01-2019
100% (1)
Basic Statistics (3685) PPT - Lecture On 20-01-2019
64 pages
Basic Statistics PDF
100% (10)
Basic Statistics PDF
262 pages
Statistics Made Easy
100% (2)
Statistics Made Easy
20 pages
STAT 101 Module Handout 1.1
100% (1)
STAT 101 Module Handout 1.1
4 pages
Overview of Probability and Statistics
No ratings yet
Overview of Probability and Statistics
52 pages
Advanced Statistics Review Guide
No ratings yet
Advanced Statistics Review Guide
2 pages
1.medical Statistics
No ratings yet
1.medical Statistics
33 pages
Laerd Statistics 2013
100% (1)
Laerd Statistics 2013
3 pages
Statistics: Basics and Key Concepts
100% (1)
Statistics: Basics and Key Concepts
33 pages
Introduction To Descriptive Statistics
100% (1)
Introduction To Descriptive Statistics
43 pages
True Experimental Design
No ratings yet
True Experimental Design
3 pages
Introduction to Medical Statistics
No ratings yet
Introduction to Medical Statistics
185 pages
Chapter 5 - Descriptive Statistics
No ratings yet
Chapter 5 - Descriptive Statistics
9 pages
Introduction to Statistics Overview
No ratings yet
Introduction to Statistics Overview
127 pages
G Power Manual
0% (1)
G Power Manual
82 pages
Introduction to Biostatistics Concepts
No ratings yet
Introduction to Biostatistics Concepts
283 pages
Biostatistics Syllabus
No ratings yet
Biostatistics Syllabus
11 pages
Overview of Inferential Statistics
No ratings yet
Overview of Inferential Statistics
180 pages
At The End of The Lesson The Students Should Be Able To
No ratings yet
At The End of The Lesson The Students Should Be Able To
19 pages
Educational Statistics Overview
80% (5)
Educational Statistics Overview
27 pages
Statistics and Basic Distribution - Mabe
No ratings yet
Statistics and Basic Distribution - Mabe
103 pages
Difference Between Correlation and Regression (With Comparison Chart) - Key Differences
No ratings yet
Difference Between Correlation and Regression (With Comparison Chart) - Key Differences
16 pages
Choosing Correct Statistical Tests
0% (1)
Choosing Correct Statistical Tests
3 pages
Descriptive Statistics
100% (3)
Descriptive Statistics
41 pages
Data Management
No ratings yet
Data Management
23 pages
1MATH - MW - Unit 4.1 (Introductory Topics in Statistics)
100% (1)
1MATH - MW - Unit 4.1 (Introductory Topics in Statistics)
30 pages
Phi Coefficient
100% (1)
Phi Coefficient
2 pages
Intro to Descriptive Statistics
100% (2)
Intro to Descriptive Statistics
57 pages
Understanding Mean in Statistics
No ratings yet
Understanding Mean in Statistics
16 pages
Introduction to Statistics and Data Analysis
No ratings yet
Introduction to Statistics and Data Analysis
48 pages
Quantitative Research Designs Guide
No ratings yet
Quantitative Research Designs Guide
26 pages
Understanding Probability and Statistics
No ratings yet
Understanding Probability and Statistics
19 pages
Statistics Guide for Students
No ratings yet
Statistics Guide for Students
92 pages
Shapiro
No ratings yet
Shapiro
7 pages
Regression Analysis PPT
No ratings yet
Regression Analysis PPT
28 pages
Point-Biserial Correlation Coefficient
0% (1)
Point-Biserial Correlation Coefficient
3 pages
Introduction To Database Management and Statistical Software
100% (2)
Introduction To Database Management and Statistical Software
48 pages
STATA 18.0: Univariate Analysis Guide
No ratings yet
STATA 18.0: Univariate Analysis Guide
6 pages
Stata Data Analysis Lab Guide
No ratings yet
Stata Data Analysis Lab Guide
51 pages
Types of Variables in Statistics and Research
0% (1)
Types of Variables in Statistics and Research
5 pages
Statistics Definitions From Pagano Textbook
No ratings yet
Statistics Definitions From Pagano Textbook
13 pages
Understanding Descriptive Statistics
86% (7)
Understanding Descriptive Statistics
33 pages
Statistical Analysis and Data Collection Guide
100% (2)
Statistical Analysis and Data Collection Guide
16 pages
Unit - 1: Statistics: Meaning, Significance & Limitations
No ratings yet
Unit - 1: Statistics: Meaning, Significance & Limitations
11 pages
Statistics: Measures of Central Tendency
No ratings yet
Statistics: Measures of Central Tendency
23 pages
Introducing Statistics
100% (4)
Introducing Statistics
13 pages
Understanding Measures of Variability
No ratings yet
Understanding Measures of Variability
3 pages
Nonparametric Hypothesis Testing Guide
No ratings yet
Nonparametric Hypothesis Testing Guide
29 pages
2021 Introduction To Basic Statistics-1 PDF
No ratings yet
2021 Introduction To Basic Statistics-1 PDF
125 pages
Measures of Averages
No ratings yet
Measures of Averages
32 pages
Spss
100% (1)
Spss
82 pages
EpiData 3.1 Setup and Usage Guide
No ratings yet
EpiData 3.1 Setup and Usage Guide
36 pages
Cronbach - S Alpha
100% (1)
Cronbach - S Alpha
3 pages
Statistics Made Easy Presentation PDF
No ratings yet
Statistics Made Easy Presentation PDF
226 pages
Statistics
No ratings yet
Statistics
12 pages
Descriptive Statistics Overview Guide
No ratings yet
Descriptive Statistics Overview Guide
45 pages
Basic Concepts in Biostatistics-1
No ratings yet
Basic Concepts in Biostatistics-1
40 pages
Introduction To Biostatistics 1.Zp256050
No ratings yet
Introduction To Biostatistics 1.Zp256050
68 pages
تصميم وتحليل التجارب الإحصائية
No ratings yet
تصميم وتحليل التجارب الإحصائية
22 pages
Pharmacology and Biostatistics Overview
No ratings yet
Pharmacology and Biostatistics Overview
38 pages
Sun and Architecture in GEK 1506
No ratings yet
Sun and Architecture in GEK 1506
31 pages
BCom Finance Fundamentals of Investment PDF
100% (1)
BCom Finance Fundamentals of Investment PDF
80 pages
Sun and Architecture in GEK 1506
No ratings yet
Sun and Architecture in GEK 1506
31 pages
Survey Methods For Transport Planning
No ratings yet
Survey Methods For Transport Planning
475 pages
Urban Road Cross Section Standards
No ratings yet
Urban Road Cross Section Standards
78 pages
Plea Note 3 Thermal Comfort
100% (1)
Plea Note 3 Thermal Comfort
68 pages
Water-Related Architecture As An Identity Anchor of Chanderi-Madhya Pradesh, India
No ratings yet
Water-Related Architecture As An Identity Anchor of Chanderi-Madhya Pradesh, India
4 pages
North Sea Freight Transport Survey
No ratings yet
North Sea Freight Transport Survey
2 pages
Basis of The Gothic
No ratings yet
Basis of The Gothic
14 pages
Architectural Style Guide
100% (2)
Architectural Style Guide
18 pages
Gothic Art: Evolution of Cathedrals
No ratings yet
Gothic Art: Evolution of Cathedrals
51 pages
Syllabus of Nagpur University LLM
100% (1)
Syllabus of Nagpur University LLM
94 pages
Future Megacities: Sustainable Urban Solutions
No ratings yet
Future Megacities: Sustainable Urban Solutions
4 pages
Probability Distributions and Quantiles Guide
No ratings yet
Probability Distributions and Quantiles Guide
8 pages
Mplus Tutorial: Updated: March 2015
0% (1)
Mplus Tutorial: Updated: March 2015
56 pages
Chi-Square Tests in Marketing Analysis
No ratings yet
Chi-Square Tests in Marketing Analysis
6 pages
Advanced Hypothesis Testing Techniques
No ratings yet
Advanced Hypothesis Testing Techniques
145 pages
Statistical Instruments and References Writing in Research
No ratings yet
Statistical Instruments and References Writing in Research
36 pages
Research Methodology for Vincentian Values
No ratings yet
Research Methodology for Vincentian Values
5 pages
Understanding the Chi-Square Test
100% (1)
Understanding the Chi-Square Test
8 pages
Z-Test and T-Test for Mean Differences
No ratings yet
Z-Test and T-Test for Mean Differences
29 pages
Spatial Data Analysis Techniques
No ratings yet
Spatial Data Analysis Techniques
17 pages
Course Handout Paper 3 - 23-24
No ratings yet
Course Handout Paper 3 - 23-24
1 page
Data Analysis Techniques Guide
No ratings yet
Data Analysis Techniques Guide
99 pages
Hypothesis Testing Case Studies
100% (3)
Hypothesis Testing Case Studies
22 pages
F Test
No ratings yet
F Test
2 pages
Hypothesis Tests for Various Studies
No ratings yet
Hypothesis Tests for Various Studies
27 pages
STAT1600 (24-25, 1st) Assignment 3
No ratings yet
STAT1600 (24-25, 1st) Assignment 3
43 pages
SPSS Analysis of Delivery Anxiety Data
No ratings yet
SPSS Analysis of Delivery Anxiety Data
33 pages
Akij Group Assignment
40% (5)
Akij Group Assignment
65 pages
IJFAEMA
No ratings yet
IJFAEMA
14 pages
E Learning and Students Academic Perform
No ratings yet
E Learning and Students Academic Perform
9 pages
Time Headway Analysis in Traffic Study
No ratings yet
Time Headway Analysis in Traffic Study
13 pages
SPSS Crosstab Analysis on Hypertension
No ratings yet
SPSS Crosstab Analysis on Hypertension
5 pages
Weibull Analysis
No ratings yet
Weibull Analysis
6 pages
cs1 Syllabus-2026 - Final
No ratings yet
cs1 Syllabus-2026 - Final
8 pages
A Study On Consumer Satisfaction of Water Purifier Machine Users in Erode Town
80% (15)
A Study On Consumer Satisfaction of Water Purifier Machine Users in Erode Town
79 pages
Long-Service Workers' Savings Analysis
No ratings yet
Long-Service Workers' Savings Analysis
47 pages
2.1 The Chi Square Distribution
No ratings yet
2.1 The Chi Square Distribution
11 pages
Barrier-Free Accessible Environment and Kamlapur Railway Station - Require To More Sustainable
No ratings yet
Barrier-Free Accessible Environment and Kamlapur Railway Station - Require To More Sustainable
18 pages
R Studio
No ratings yet
R Studio
5 pages
Chi-Square Test for Gender and Firsts
No ratings yet
Chi-Square Test for Gender and Firsts
14 pages
Sport Psychology 2 Craft Et Al. (2003)
No ratings yet
Sport Psychology 2 Craft Et Al. (2003)
22 pages