Professional Documents
Culture Documents
1
Expectations
2
Scope of presentation
• Approach in statistical analysis
• Descriptive & Inferential analysis
• Sample Size for Common Statistical Tests
• Pearson’s correlation test
• Cronbach’s alpha test
• Revision: What affect sample size?
• Alpha, Power of study and effect sizes
• Sample size estimation for multivariate analysis
• MLR and Logistic regression
• Tips for determination of sample size
3
Introduction: Why need to calculate or
estimate sample size?
• Requirement for protocol submission
• Plan for budget
• To know when to start and when to stop
• To get significant result
4
Introduction: Approach in the statistical
analysis
Population
Descriptive analysis
data
Study Statistical
objectives analysis
5
Approach in statistical analysis
6
Introduction: Relation between population &
sample
% of adults % of adults
Inference!
7
Introduction: Summary
• How good statistics derived from a sample can be inferred to the
targeted population?
• 2 factors drive the accuracy of the statistics
a. Sampling technique – to eliminate bias in selection
b. Sample size – to get sufficient sample size for inference
• Therefore, researcher calculate or estimate sample size because
a. They study or analyze sample data instead of population data
b. They want to get significant result (p<0.05) to justify for
inference
8
Introduction: Summary
Statistical test:
Eg: Ind. Samp.
T-test, Pearson
Chi Square,
Correlation, etc
Formula!
9
Sample Size for Common
Statistical Tests
Correlation test & Cronbach’s alpha
10
Introduction: What affect sample size?
• Alpha (∝)
• Power of the study (1 - 𝛽)
Fail to
• Effect size reject
null
• Cohen J. A power
primer. Psychological
Bulletin. 1992;112(1):155–159.
doi: 10.1037/0033-
2909.112.1.155.
12
Note:
A power of 80.0%
refers to type II
error of 20.0%
Power of study
=1–β
= 1 – 0.2
=0.8
Note:
An alpha (type I
error) of 0.05%
refers to 95%
confidence interval
Confidence interval
=1–α
= 1 – 0.05
= 0.95
13
Relationship between effect size & sample
size
• If the real effect size is large,
ideally there is no need large
ES n sample size to prove the large ES
is exist!
14
Relation between effect size & sample size
• Set low effect size • Thus, researcher will normally
• You may need large sample size to set the smallest effect size that
prove the result is statistically they can tolerate to convince
significant. themselves and also the
audience that the finding is
• Set high effect size clinically or scientifically
• Although you are require to recruit
significant (i.e. there is a
small sample size, but you may not difference, there is an
able to achieve the desired effect association, there is a
size from the sample. correlation, etc)
15
Correlation test – Pearson’s Correlation Test
• A measure of the correlation & strength between two variables in
numerical form.
16
Inferential analysis:
18
Inferential analysis: Scenario (sample size)
• This study aims to determine the • Let alpha is fixed at 0.05 (5%)
correlation between age and ≈ 95% confidence interval
knowledge score. Both variables
are observed in numerical form. • Let power of study is fixed at
What is the appropriate sample 80.0%
size of this study to determine a • Effect size ?
low or a moderate correlation Say we estimate the minimum
between these two variables? value of correlation coefficient is
0.30.
19
What information do we need for sample size
calculation?
• Type I error = alpha = 0.05
• Power of study = 1 – β = 80.0%
• Effect size for Pearson’s correlation test = r = correlation coefficient
20
Sample size calculation using PASS software:
Pearson’s correlation test
21
Sample size calculation using PASS software:
Pearson’s correlation test
22
Sample size calculation: Pearson’s correlation
test
ρ0 ρ1 n Note:
0.0 0.1 782 • ρ0 is the value of the population correlation
under the null hypothesis.
0.2 193
• ρ1 is the value of the population correlation
0.3 84 under the alternative hypothesis.
0.4 46
0.5 29 • Bujang MA, Nurakmal B. Sample size guideline for
0.6 19 correlation analysis. World Journal of Social Science
Research. 2016;3(1):37–46.
0.7 13 doi: 10.22158/wjssr.v3n1p37.
0.8 9
0.9 6
• Citations 137 since 2016 until 8th Feb 2022
Note:
• Sample size was calculated based on formula by Guenther
(1977) based on alpha less than 0.05 and minimum power of
80.0%
23
Sample size statement: Pearson’s correlation test
• This study aims to determine the magnitude of correlation between age
and knowledge score. The basis of sample size calculation will use
formula based on Pearson’s correlation test. The minimum correlation
coefficient to be detected in the study is at least 0.30. With assumption
that the magnitude of correlation in the null hypothesis and alternative
hypothesis are equal to zero and at least 0.3 respectively. Hence, the
minimum required sample size is 84 based on alpha of 0.05 and
minimum power of 80%. By adding a 20.0% of drop out, this study
need to recruit 105 participants.
84 / 0.8 = 105
24
Statistical test: Cronbach’s alpha
• Aim: To determine the
strength/magnitude of internal
consistency or stability of domain
(latent variable measured by a
group of variables).
Stress Anxiety
(Item (Item
1,6,8,11,12, 2,4,7,9,15,1
14 and 18) 9 and 20)
26
Example: Job Satisfaction Questionnaire (JS-
Q)
• TW1, TW2, TW3, TW4 and TW5
report excellent internal
consistency with Cronbach’s alpha
coefficient 0.924. This group of
item is suitable to represent a
domain (in this case is Teamwork
,TW)
28
Inferential analysis: Scenario
• This study aims to determine the internal consistency of four main
domains of Questionnaire Z (Step 1: Understand the objective). All
domains have 5 items each and thus, a Cronbach’s alpha test was
conducted (Step 2 & Step 3: Determine the appropriate statistical test
to answer the objective & Conduct statistical analysis). Result shows
that all the four domains report Cronbach’s alpha more than 0.5.
(Step 4: Interpretation). Therefore, the internal consistency of
Questionnaire Z domains are acceptable (Step 5: Make a conclusion).
29
What information do we need for sample size
calculation?
• Type I error = alpha = 0.05
• Power of study = 1 – β = 80.0%
• Effect size for Cronbach’s alpha is determined by the difference of
Cronbach’s alpha values in the hypothesis testing and the number of
items (or raters)
31
Sample size calculation using PASS software:
Cronbach’s alpha test
32
Sample size calculation using PASS software:
Cronbach’s alpha test
CA0 CA1 n Note:
0.0 0.3 152 • CA0 is the value of the estimated Cronbach’s alpha
in the null hypothesis.
0.4 74
• CA1 is the value of the estimated Cronbach’s alpha
0.5 41 in the alternative hypothesis.
0.6 24
0.7 14
0.8 9 • Bujang MA, Omar ED, Baharum NA. A review on
0.9 5 sample size determination for Cronbach’s alpha
test: a simple guide for researchers. Malays J Med
Sci. 2018;25(6):85–99.
Note:
doi: 10.21315/mjms2018.25.6.9.
• Sample size was calculated based on formula by Bonnet &
Douglas (2002) based on alpha less than 0.05 and minimum • Citations 159 since 2018 until 8thFeb 2022
power of 80.0%
33
Sample size statement: Cronbach’s alpha test
• This study aims to determine the internal consistency of domains for
Questionnaire Z which has 5 questions in each domain. The basis of
sample size calculation will use formula based on Cronbach’s alpha
test. The minimum Cronbach’s alpha coefficient to be detected in the
four domains is at least 0.50. With assumption that the Cronbach’s
alpha coefficient in the null hypothesis and alternative hypothesis are
equal to zero and at least 0.5 respectively. Hence, the minimum
required sample size is 41 based on alpha of 0.05 and minimum power
of 80%. By adding a 20.0% of drop out, this study need to recruit 52
participants.
34
Sample Size for Multivariate
Analysis
Multiple Linear Regression (MLR) & Analysis of Covariance (ANCOVA)
Logistic Regression
35
Estimate sample size
To determine to what extent the socio-demographics profile and
• We estimate sample size usually for multivariate analysis. perception among UPLB scientists are associated with the use of
social media in research.
36
Multiple Linear Regression & General Linear
Model (ANCOVA)
• is a statistical technique that “ Based on a cross-sectional study,
uses several explanatory a group of researcher aim to
variables (independent determine to what extent age,
variables) to; gender, ethnicity, education
• predict the outcome of a response level, BMI, exercise and diet are
variable (in numerical form). associated with systolic blood
• study how the explanatory pressure.”
variables associate with the
response variable (in numerical
form). How many participants should
they recruit?
37
Multiple Linear Regression & General Linear
Model (ANCOVA)
Before sample size calculation:
• Understand the scenario
• Determine the appropriate sample size technique to answer the
objective
• MLR or General Linear Model ANCOVA!
38
Rule of thumb for Multiple Linear Regression
(1) Tabachnick, B.G. & Fidell, L.S. (2013). Sample size statement:
Using Multivariate Statistics (6th “The aim of this study is to determine to
edition). Boston: Pearson Education what extent age, gender, ethnicity,
education level, BMI, exercise and diet
“N > 50 + 8m” are associated with systolic blood
pressure. According to study by
The number of sample size (N) should Tabachnick et., al., (2013), which is
exceeds referring to a guideline of sample size
50 + 8 (no. of predictors or risk factor) for Multiple Linear Regression, The
number of sample size (N) should
exceeds 50 + 8 (no. of independent
variables). Since this study has 7
independent variables, therefore this
study will needs a minimum sample
size of 106=50 + 8(7). (Tabachnick et
al., 2013).”
39
Rule of thumb for MLR / ANCOVA
(2) Bujang MA, Sa’at N, Tg Abu Bakar • Based on validation (based on various
Sidik TMI. Determination of minimum sample size & statistical analyses)
sample size requirement for multiple between sample statistics and
linear regression and analysis of parameter, the ideal sample size is
covariance based on experimental and 300 subjects.
non-experimental studies. Epidemiology
Biostatistics and Public
Health. 2017;14(3):e12117–1.
doi: 10.2427/1211.
40
Rule of thumb: Multiple Linear Regression (MLR) and General Linear
Model (ANCOVA) for observational study by Bujang et al., (2017)
At
sample
size of
300
The relation of the difference of effect size (partial eta-squared) between parameters and statistics and sample sizes
41
Rule of thumb for MLR & ANCOVA for observational study by Bujang et. al.,
(2017)
42
Logistic Regression
• is a statistical technique that “ Based on a cross-sectional study,
uses several explanatory a group of researcher aim to
variables (independent determine to what extent age,
variables) to; gender, ethnicity, education
• predict the outcome of a response level, BMI, exercise and diet are
variable (in categorical form). associated with status of systolic
• study how the explanatory blood pressure (i.e. controlled &
variables associate with the not controlled.”
response variable (in categorical
form).
How many participants should
they recruit?
43
Rule of thumb for Logistic Regression based on Peduzzi et al.,
(1996)
• An EPV10 rule of thumb is
depends on;
1. Prevalence of the outcome of
interest (e.g; 30% of poor
outcome)
2. Number of participants to be
recruited (e.g; 300 participants).
• Based on (1) & (2), researchers be
able to determine number of
independent variables to be tested
in the final regression model.
44
Rule of thumb for Logistic Regression based
on Peduzzi et al., (1996)
a. Estimated b. Total c. Total d. EPV of e. Number of f. Sample size
prevalence of the sample estimated 10 factors (IV) sufficient?
least category size sample size for (c / 10) in the final
from a binary the least logistic
outcome category from regression
a binary model
outcome
20% 100 20 2 4 No
20% 300 60 6 5 Yes
30% 100 30 3 4 No
30% 300 90 9 5 Yes
50% 100 50 5 4 Yes
50% 300 150 15 5 Yes
45
Rule of thumb for Logistic Regression based
on Peduzzi et al., (1996)
a. Number b. Total c. Total d. Estimated e. Total f. Sample
of factors estimated sample prevalence of estimated size
planned in sample size for size the least sample size for sufficient?
the logistic the least category from the least
regression category from a binary category from
model a binary outcome a binary
outcome outcome
4 40 100 20% 20 No
EPV 10
5 50 300 20% 60 Yes
4 40 100 30% 30 No
5 50 300 30% 90 Yes
4 40 100 50% 50 Yes
5 50 300 50% 150 Yes
Yes: e > b
46
Rule of thumb for Logistic Regression based on Peduzzi et al.,
(1996)
Sample size statement:
“The aim of this study is to determine to what extent age, gender,
ethnicity, education level, BMI, exercise and diet are associated with
poor control of HbA1c). According to study by Peduzzi (1996) which
is referring to a guideline of sample size for logistic regression,
suggest a minimum event per variable is 10 for the least number in the
outcome variable. Since this study is interested to study 7 risk factors,
therefore this study will needs a minimum sample of 70 patients in the
poor outcome category. This study plans to recruit at least 300 samples
which is exceed the minimum number of sample size since the
prevalence of poor outcome is estimated at 50% (Peduzzi, 1996).”
47
Cox regression
• The concept of EPV 10 was introduced for both logistic regression and cox
regression.
• Peduzzi & Concato proposed the similar rule of thumb of EPV10 can be used also
for cox regression
Reference:
• Peduzzi, P., Concato, J., Feinstein, A. R. and Holford, T. R. 1995. Importance of
events per independent variable in proportional hazards regression analysis: II.
Accuracy and precision of regression estimates. Journal of Clinical Epidemiology,
48: 1503–1510.
48
Criticism of EPV10 rule of thumb
The concept of EPV with 10 received some critics [Smeden et al., 2016] and hence, Austin
and Steyerberg (2017) recommended EPV of 20 instead of 10 [Austin & Steyerberg, 2017].
References:
• Maarten van Smeden, Joris A. H. de Groot, Karel G. M. Moons, Gary S. Collins, Douglas
G. Altman, Marinus J. C. Eijkemans and Johannes B. Reitsma. No rationale for 1 variable
per 10 events criterion for binary logistic regression analysis. BMC Medical Research
Methodology (2016) 16:163
• Austin PC, Steyerberg EW. Events per variable (EPV) and the relative performance of
different strategies for estimating the out-of-sample validity of logistic regression models.
Statistical Methods in Medical Research 2017, Vol. 26(2) 796–808
49
Rule of thumb for Logistic Regression based
on Bujang et al., (2018)
Based on;
• Rule of thumb of EPV50
• Using a simple formula
n = 100 + 50i where i refers to
number of independent variables
in the final model.
50
Rule of thumb for Logistic Regression based on Bujang et al.,
(2018)
Logistic regression based on enter method Logistic regression based on stepwise method
51
Rule of thumb - . If large data is available (>500)
52
Rule of thumb - If large data is available (>500)
53
Summary: Why need to calculate or estimate
sample size?
• Requirement for protocol submission Require a sample size
statement
• Plan for budget
• To know when to start and when to stop
• To get significant result
54
Tips for sample size
calculation or estimation
55
To understand the objective of a study
Understand:
1. the subject matter
2. the scenario
3. the significant of the study objective
56
To select the appropriate statistical analysis
• Need to familiar with various
statistical tests
• Say the study has been
conducted by others, then read
the paper in the method’s
section – to identify the
statistical test that was used
57
To calculate or estimate the minimum sample
size required by the study
• Calculation
• Manual
• Software
58
Sample
size papers
59
To provide an additional allowance during subject
recruitment to cater for a certain proportion of
non-response
Causes Calculation
• Missing participants’ response • Minimum sample size required 150
• Spoilt or broken sample • Add non-response rate of 20%
• Missing values
How much? • 150 / 0.8 = 187.5 = 188
• usually by 20% to 30%.
• If researcher is expecting a high non-response • Say add non-response rate of 30%
rate in a self-administered survey, then • 150 / 0.7 = 214.2 = 215
he/she should provide an allowance for it by
adding more than 30% such as 40% to 50%.
Purpose
• To ensure minimum sample size is achieved
60
To write a sample size statement
All the elements from Step 1 until Step 4 • Say a study aims to determine the
• determine the study objective, association of factors with optimal
HbA1c level as determined by its cut-
• determine the appropriate statistical off point of < 6.5% among patients
analysis, with type 2 diabetes mellitus (T2DM).
• sample size estimation/calculation Previous study had already estimated
that several significant factors were
• add non-response rate identified, and then included as three
to four variables in the final model
consisting of parameters that were
should be fully stated in the sample size selected from demographic profile of
statement. patients and clinical parameters (cite
the appropriate reference). How many
T2DM patients should the study
recruit in order to answer the study
objective?
61
To write a sample size statement
• Step 1: To Understand the • Step 2: To Decide the Appropriate
Objective of Study Statistical Analysis
• The study aims to determine a set • In this example, the outcome
of independent variables that show variable is in the categorical and
a significant association with binary form, such as HbA1c level of
optimal HbA1c level (as < 6.5% versus ≥ 6.5%. On the other
determined by its cut-off point of < hand, there are about 3 to 4
6.5%) among T2DM patients. independent variables, which can
be expressed in both the
categorical and numerical form.
Therefore, an appropriate
statistical analysis shall be logistic
regression.
62
To write a sample size statement
• Step 3: To Estimate or Calculate the Sample Size Required
• Since this study will require a multivariate regression analysis, thus it
is recommended to estimate sample size based on the general rule of
thumb. There are several general rules of thumb available for
estimating the sample size for multivariate logistic regression. Two
approaches are introduced here, namely: i) sample size estimation
based on concept of event per variable (EPV) and ii) sample size
estimation based on a simple formula.
63
To write a sample size statement
i) Sample size estimation based on a concept EPV 50 ii) Sample size estimation based on a formula of n =
100 + 50i (where i represents number of independent
• For EPV 50, the researcher will need to know the variable in the final model)
prevalence of the ‘good’ outcome category and the
number of subjects in the ‘good’ outcome category • When using this formula, the researcher will first
to fit the rule of EPV 50. need to set the total number of independent
variables in the final model. As stated in the
• Say, the prevalence of ‘good’ outcome category is example, the total number of independent
reported at 70% (cite the appropriate reference). variables were estimated to be about three to four
• Then, with a total of four independent variables, (cite the appropriate reference). Then, with a total
the minimum sample size required in the ‘poor’ of four independent variables, the minimum
outcome category will be at least 200 subjects in required sample size will be 300 patients [(i.e. 100
order to fulfil the condition for EPV 50 (i.e. 200/4 = + 50 (4) = 300].
50).
• On the other hand, by estimating the prevalence of
‘good’ outcome at 70.0%, this study will therefore
need to recruit at least 290 subjects in order to
ensure that a minimum 200 subjects will be
obtained in the ‘poor’ outcome category (70/100 x
290 = 203, and 203 > 200).
64
To write a sample size statement
• Step 4: To Provide Additional Allowance for a Certain Proportion of
Non-Response Rate
• In order to make up for a rough estimate of 20.0% of non-response
rate, the minimum sample size requirement is calculated to be 254
patients (i.e. 203/0.8) by estimating the sample size based on the EPV
50, and is calculated to be 375 patients (i.e. 300/0.8) by estimating
the sample size based on the formula n = 100 + 50i.
65
To write a sample size statement
• Step 5: To Write a Sample Size Statement
• There were previously two approaches that were introduced to estimate sample size for
logistic regression. Say, if the researcher chooses to apply the formula n = 100 +
50i. Therefore, the sample size statement will be written as follows:
• “The main objective of this study is to determine the association of factors with optimal
HbA1c level as determined by its cut-off point of < 6.5% among patients with type 2
diabetes mellitus (T2DM). The sample size estimation is derived from the general rule of
thumb for logistic regression proposed by Bujang et al. (2018), which had established a
simple guideline of sample size determination for logistic regression. In this study, Bujang
et al. (2018) suggested to calculate the sample size by basing on a formula n = 100 +
50i. The estimated total number of independent variables was about three to four (cite
the appropriate reference). Thus, with a total of four independent variables, the
minimum required sample size will be 300 patients (i.e. 100 + 50 (4) = 300). By providing
an additional allowance to cater for a possible dropout rate of 20%, this study will
therefore need at least a sample size of 300/0.8 = 375 patients.”
66
A checklist to
ease for
sample size
calculation or
estimation
67
A checklist to
ease for
sample size
calculation or
estimation
68
End
Thank you
69