(06.01) Introduction To Statistics and Statistical Inference (TG11-CG23) - Final - Deanna Flor Lim

MODULE 06: RESEARCH & EPIDEMIOLOGY
Introduction to Statistics and Statistical Inference

VEINCENT CHRISTIAN F. PEPITO, MSc
01/13/2021
RESEARCH, QUANTITATIVE STUDIES
TABLE OF CONTENTS • Association between exposures and outcomes

o Usually studied in observational studies
I. INTRODUCTION TO STATISTICS..................................................................................1
A. APPLICATION OF STATISTICS IN EPIDEMIOLOGY AND MEDICINE ......................1
B. DESCRIPTIVE AND INFERENTIAL STATISTICS .......................................................1
B. DESCRIPTIVE AND INFERENTIAL STATISTICS
C. DEFINITION OF TERMS .......................................................................................1 Descriptive Statistics
D. RELATIONSHIPS BETWEEN VARIABLES ...............................................................2
• Describes a set of data
II. PROBABILITY .............................................................................................................2
A. DEFINITION .........................................................................................................2 • Used to organize, summarize, and present individual data values
B. RULES OF PROBABILITY ......................................................................................3 • Data can be summarized with:
C. APPLICATIONS OF PROBABILITY IN EPIDEMIOLOGY AND MEDICINE .................3 o Categorical data
III. NORMAL DISTRIBUTION ..........................................................................................3 § Percentages or frequencies for data we can categorize
A. INTRODUCTION ..................................................................................................3 o Quantitative data
B. WHY THE NORMAL DISTRIBUTION? ...................................................................4
§ Average values and the spread of the values for data which we count
C. STANDARD NORMAL DISTRIBUTION ..................................................................4
D. SKEW ..................................................................................................................5 or measure
IV. PRINCIPLES OF STATISTICAL INFERENCE..................................................................5
A. INTRODUCTION ..................................................................................................5 Inferential Statistics
B. TARGET VS. STUDY POPULATION .......................................................................6
C. SAMPLING DISTRIBUTION ..................................................................................6
• Uses methods of probability to make inferences about a population using
D. CONFIDENCE INTERVALS ....................................................................................7 data from a sample
V. HYPOTHESIS TESTING ...............................................................................................7 • Interplay between population, sample, and statistics is the theoretical
A. INTRODUCTION ..................................................................................................7 basis for inferential statistics
B. P-VALUES ............................................................................................................7 o Start with a target population
C. TYPE I AND II ERRORS .........................................................................................7 § Population where you aim to generalize the findings of your study
QUICK REVIEW ..............................................................................................................8
§ Difficult to study everyone in your target population
SUMMARY OF CONCEPTS .......................................................................................8
SUMMARY OF NEED-TO-KNOWS (NDTK) ...............................................................9 o Pick a study population
SUMMARY OF PROCESSES ......................................................................................9 § An area of the population you want to focus your study on
SUMMARY OF MEMORY AIDS ................................................................................9 § Study population may still be too big
SUMMARY OF EQUATIONS .....................................................................................9 o Select a sample out of your study population
REVIEW QUESTIONS ...............................................................................................9 o Collect data, test hypotheses, and generate test statistics from the
REFERENCES................................................................................................................10 sample
REQUIRED .............................................................................................................10
o Make conclusions and generalize the target population
SUPPLEMENTARY..................................................................................................10
FREEDOM SPACE.........................................................................................................10
APPENDIX....................................................................................................................11
LEARNING OBJECTIVES
1) To discuss the basic principles of statistics and probability
2) To enumerate the measures of central tendency and explain the
importance of normal distribution in statistics
3) To discuss the basic principles of statistical inference
I. INTRODUCTION TO STATISTICS
• Statistics can mean two things: Figure 1. Theoretical basis of inferential statistics
o Data
§ The numbers we get when we measure or count things C. DEFINITION OF TERMS
o Methods
• Data
§ A collection of procedures that allows us to analyze data (statistical
o Values we collect from respondents or records
tests)
o E.g. From patient records in hospitals
• Why do we need to study all of this?
• Variables
o To conclude that our sample estimates represent population data, or to
o Characteristic of a study subject that may vary from one respondent to
establish a causal association, we need to rule out three things:
another
§ Chance
o E.g. Age, sex, disease status, etc.
Þ Addressed by statistical analysis
Þ Main thing to rule out
§ Bias
Qualitative Variables
Þ Addressed by study design • Aka categorical variables
§ Confounding • Characterizes a certain quality of a subject
Þ Addressed by both statistical analysis and study design • 3 types:
o Binary variables (Dichotomous)
§ Categorical variables that only have two values
A. APPLICATION OF STATISTICS IN EPIDEMIOLOGY AND MEDICINE
§ E.g. Biological sex, disease status (sick or not sick), HIV+ or HIV-
• Determining probability of survival from a disease or medical procedure o Nominal variables
o E.g. Comparing old and new medical procedures § Variables whose categories can be listed in any order
• Effectiveness of drugs or medical procedures § Have no inherent ordering
o E.g. Clinical trials § E.g. Religion, gender
Transcribed by TG 11: Ballelos, Cortez, Gamo, Lacerna, Lim, Monge, Pagayatan, Sanchez
YL6: 06.01
Checked by TG 23: Catacutan, Daco, Go, Luna, Mamaril, Mariano, Rita, Sing 1 of 11
o Ordinal variables Controlling Confounding Variables
§ Variables whose categories have a natural or inherent ordering
• The researcher may control the confounding variables in the following
§ E.g. Data from Likert scales (i.e., from strongly disagree to strongly
stages using the enumerated methods:
agree)
o Design stage
Þ Note: Using quantitative statistical tests for data from Likert
§ Restriction
scales is highly discouraged
Þ E.g. limiting subjects to studying females only or males only
- Data from Likert scales are qualitative; thus it is inappropriate
Þ May be inefficient
for quantitative statistical tests such as t-test
§ Matching (case control studies)
Þ Select a control with similar characteristics to a case
Quantitative Variables § Randomization
• Represents a counted or measured quantity Þ Often done in trials
• 2 types: o Analysis stage
o Interval variable § Regression Analysis
§ No true zero § Stratified Analysis
§ Inherent ordering
§ Exact differences between values
§ E.g. Temperature in Celsius or Fahrenheit [EXAMPLE] CONFOUNDING VARIABLE
o Ratio variable A study found that coffee drinkers (exposure) are 4x more likely to have lung
§ Has a true zero cancer (outcome) compared to those who don’t drink coffee. Consider if this
§ Inherent ordering relationship is actually true or if there is a confounding variable.
§ Exact differences between values
§ E.g. Temperature in Kelvin
Exposure Variable
• Aka Modified, Independent, or Predictor variable
• Variable of interest which you think could have an effect on the outcome
variable
o Modified to assess its effect on an outcome
o E.g. In clinical trials, a patient given a drug is exposed, while a patient
who does not receive the drug is unexposed
• Usually placed on the X-axis on graphs Figure 2. Confounding relationship of smoking, drinking coffee, and lung
cancer
EXPOSURE VARIABLE: MIX
• Modified variable • Confounding Variable: Smoking
• Independent variable o People who smoke are more likely to drink coffee, and smoking can
• X-axis also cause lung cancer
§ Satisfies the first two criteria of a confounding variable
o Smoking is not an intermediate variable because it is not between
Outcome Variable the two other variables
• Aka Dependent or Response variable § Satisfies third criteria
• Variable of interest which you think is affected by the exposure • Intermediate Variable: Caffeine levels
variables/predictors o Drinking coffee causes caffeine levels in the body to increase and, in
o In epidemiology or medicine, this is usually your disease status most cases, it is also associated with lung cancer (spurious
§ Can also be other things such as HIV testing relationship)
o In an experiment, this is the variable that you are observing for a change o Should not be controlled for in the analysis
as you vary your level of exposure • In this study, it’s important to control for smoking (confounding variable)
• Usually placed on the Y-axis but not for caffeine levels (intermediate variable)
OUTCOME VARIABLE: DRY SOURCE: Veincent Christian F. Pepito, MSc – Introduction to Statistics and
• Dependent variable Probability (2021)
• Response variable
• Y-axis
NDTK: Confounding vs Intermediate Variables
Intermediate Variable • Confounding variable: should be controlled for in the analysis
• Variable that lies in the causal pathway between the exposure and • Intermediate variable: should NOT be controlled for in the analysis
outcome
Effect Measure Modifier
D. RELATIONSHIPS BETWEEN VARIABLES
• Variable that alters the effect of the exposure on the outcome
Confounder • E.g. Given a hypothetical drug that cures cancer among female patients but
• A variable that muddles or confounds the relationship between an has no effect on male patients:
exposure and an outcome o Sex is an effect measure modifier in the association between taking the
• Must satisfy all of the following criteria: drug and curing cancer
o Associated with the exposure • Can be assessed statistically, unlike confounding variables
o Associated with the outcome
o Not in the causal pathway between the exposure and outcome II. PROBABILITY
§ The variable should not be an intermediate variable A. DEFINITION
• It is important to do a thorough literature review to determine other
• The proportion of times that we would observe an outcome if we repeated
variables that may confound the relationship between exposure and
the experiment a large number of times
outcome
o E.g. What is the probability of:
o Researcher must collect data from these variables to be able to control
§ Drawing a queen of spades in a standard deck of playing cards
them later in the analysis
Þ 1/52
• While the first two criteria can be tested statistically, note that there is no
§ Throwing a ‘4’ in a six-faced fair die
statistical test for confounding
Þ 1/6
• Values are always 0 ≤ x ≤ 1 or [0, 1]
YL6: 06.01 RESEARCH & EPIDEMIOLOGY: Introduction to Statistics and Statistical Inference 2 of 11
B. RULES OF PROBABILITY • In the example above, 3 donors in the first 100 are expected to be group
AB
Additive Law • However, it cannot be said for certain that there will be 3 group AB donors
in the first 100 due to:
𝑃(𝐴 𝑜𝑟 𝐵) = 𝑃(𝐴) + 𝑃(𝐵) o Random variation
Equation 1. Additive Law § Affects what is observed especially when the number of
experiments is not sufficiently large (E.g. 100)
• For mutually exclusive events, the probability of an event is the sum of the o Small number of observations
probabilities for each event § Makes the expected outcome of 3% imprecise
• Mutually Exclusive Events
o When one outcome happens, the other outcome can no longer occur The Law of Large Numbers
o E.g. Tossing of a coin will result in either a heads or tails, never both
• An experiment repeated many times will result to an observed value that
is equal to the expected value
[EXAMPLE] ADDITIVE LAW o E.g. In the previous example of blood types, repeating the experiment
What is the probability of getting 1 or 6 in a fair six-faced die? sufficiently and getting a total count of 10,000 or 1,000,000 will result to
the expected value of 3% being equal to the observed value
• Solution:
o 𝑃(1 𝑜𝑟 6) = 𝑃(1) + 𝑃(6)
1 1
o 𝑃(1 𝑜𝑟 6) = 2 + 2 = 𝟎. 𝟑𝟑
SOURCE: Veincent Christian F. Pepito, MSc – Introduction to Statistics and

Probability (2021)
Multiplicative Law
𝑃(𝐴 𝑎𝑛𝑑 𝐵) = 𝑃(𝐴) 𝑥 𝑃(𝐵)
Equation 2. Multiplicative Law
• For independent events, the probability of two independent events is given

by the product of their individual probabilities
• Independent Events
o When one outcome happens, it doesn’t affect the probability of Figure 3. Average Number of Dice roll by Number of Rolls
another event happening
III. NORMAL DISTRIBUTION
A. INTRODUCTION
[EXAMPLE] MULTIPLICATIVE LAW
What is the probability of getting two consecutive heads when flipping a fair Measures of Central Tendency
coin twice? • Mean
o Aka arithmetic average
• Solution: o Sum of the observations divided by the number of observations
o 𝑃(𝐻, 𝐻) = 𝑃(𝐻) 𝑥 𝑃(𝐻) • Median
1 1
o 𝑃(𝐻, 𝐻) = 8 𝑥 8 = 𝟎. 𝟐𝟓 o The value that divides the data by half
o For an odd number of observations, the median is the middle
SOURCE: Veincent Christian F. Pepito, Msc – Introduction to Statistics and observation when the observations are tallied in ascending order
Probability (2021) o For an even number of observations, the median is the average of the
two middle observations when the observations are tallied in ascending
order
C. APPLICATIONS OF PROBABILITY IN EPIDEMIOLOGY AND • Mode
o Most common value appearing in the data
MEDICINE
o Not commonly used
• It is possible to use probability to predict the likelihood of certain events
Normal Distribution
[EXAMPLE] APPLICATION OF PROBABILITY • It is the most important distribution in statistics
Given the distribution of blood types in Table 1, how many donors will be o Bell-shaped
blood group AB among the next 100 who will arrive? o Defined by its mean and standard deviation
Table 1. Sample Distribution of Blood Types

BLOOD TYPE COUNTS PROBABILITY
O 92 0.46
A 86 0.43
B 16 0.08
AB 6 0.03
• Solution:
2
o 𝑃 (𝐴𝐵) = ; = = 0.03
8<<
§ Where:
Þ 6 = Count for AB
Þ 200 = Total counts for all blood types Figure 4. Example of a Normal Distribution Curve
• Answer:
o 3 people in the next 100 will have blood type AB
SOURCE: Veincent Christian F. Pepito, MSc – Introduction to Statistics and

Probability (2021)
Mean • Can also be used to compute the areas outside a given range by taking the
complement of the values within the range
• Tells the location (or center) of the distribution
o A table of the areas outside a given range is called a two-tailed z-table,
as shown in Figure 9
Table 2. Range of SD and corresponding areas within it

RANGE (SD) AREA WITHIN THE RANGE
(-1, 1) 68.3%
(-2, 2) 95.4%
(-3, 3) 99.7%
[EXAMPLE] COMPARING NMAT SCORES

Assuming that you got a score of 650 in the NMAT and the mean score is
Figure 5. Examples of Mean Values 500 with an SD of 100, what percent of the population scored lower than
you?
Standard Deviation
• Measure of spread or dispersion of a set of data • Given:
o Average deviation of the observations from the mean o NMAT score = 650
• Calculated as the square root of the variance o Mean = 500
• The more widely spread out the values, the larger the standard deviation o SD = 100
• Solution:
JKLMNOPNKQ
o z=
RS
2T<OT<<
o 𝑧 = = 𝟏. 𝟓𝟎
1<<
• Answer:
o Using Equation 3, a z-score of 1.50 can be derived
o Based on Figure 7, a z-score of 1.50 corresponds to a P-lower value
of 0.9332 and a P-upper value of 0.0668
§ This means that an NMAT score of 650 is higher than 93.32% of
all test takers, but lower than 6.68% of test takers
Figure 6. Example of a Standard Deviation
B. WHY THE NORMAL DISTRIBUTION?

• It is used to represent the distribution of values that would be observed if
we could examine everybody in the population (i.e. sufficiently many
times)
o Y-axis in a normal distribution represents probability
• Used to represent the sampling distribution
• It is defined by a complicated mathematical formula
o One exact formula does not exist
o There is a need to convert normal distributions to the standard normal Figure 7. Example of a Z-table
distribution
SOURCE: Veincent Christian F. Pepito, MSc – Measures of Central Tendency
and Normal Distribution (2021)
C. STANDARD NORMAL DISTRIBUTION
• The standard normal distribution is used to determine areas under the
curve
o E.g. determining what proportion of the population has scores lower or
higher than yours
o In a standard normal distribution:
§ Mean = 0
§ Standard Deviation (SD1) = 1
o Z-scores
§ Standard normal scores
§ Tells how many standard deviations away a certain value is from the
mean
§ They are converted to area under the curve using the z-table
𝑉𝑎𝑙𝑢𝑒 − 𝑀𝑒𝑎𝑛
𝑧=
𝑆𝐷
Equation 3. Z-score Formula
Figure 8. The Area between -1 and 1 SD in a Standard Normal Distribution.
Z-table This represents 68.3% of the population.
• Can be used to compute for the area between -1 and 1 SD, -2 and 2 SD, and
so on
o Figure 8 depicts a standard normal distribution where the area between
-1 and 1 SD is shown
1 SD: Standard deviation
D. SKEW
Figure 12. Skewed Distributions

Figure 9. Example of a Two-Tailed Z-Table • Skewed distributions
o Some variables are not normally distributed despite a large number of
respondents; distribution becomes skewed due to outliers
[EXAMPLE] NORMAL DISTRIBUTION § E.g. serum bilirubin
Assuming that the mean cholesterol level of the population is 196 mg/dL o Standard deviation is numerically greater than the mean
and their standard deviation is 46 mg/dL, what is the proportion of the • Data can be skewed 2 ways:
population who has a cholesterol level less than 166 mg/dL? o Positive Skew
§ The curve is skewed to the right
• Given: § Mean > Median > Mode
o Mean = 196 mg/dL o Negative Skew
o SD = 46 mg/dL § The curve is skewed to the left
• Solution: § Mean < Median < Mode
o Convert 166 mg/dL to its standardized equivalent (i.e., determine the o Recall: For a normal distribution, mean = mode = median
z-score) • The mean is not a good measure of central tendency for skewed
JKLMNOPNKQ
§ z= distributions
RS
122O1V2 o Median is the better measurement
§ z= = −0.65
W2 • Variables with a skewed distribution may be ineligible from some statistical
§ z = −0.65 tests that assume normality (e.g. linear regression, ANOVA, t-test)
o Convert the z-score to the area under the curve using the z-table o Should such statistical tests be used for data sets with a skewed
(Figure 10. Z-table [Refer to Appendix A]) distribution, they should first be transformed (i.e. log transformation)
§ Look at the left most column first (tenths place), and then the or instead use non-parametric equivalents of the tests (e.g. rank sum
topmost column (hundredths place) of the z-table test for t-test)
§ The intersection between the 2 represents the area under the
curve for z = 0.65
Þ Area under the curve = 0.7422 ACTIVE RECALL
Þ Take note that the z-score is -0.65, not 0.65. 1. Normal distribution is defined by its:
o Plot the distribution with the given values (see Figure 11) a) Mean
b) Standard deviation
c) Both A & B
d) Neither A nor B
2. T/F. In a normal distribution, the mean is equal to 1.
3. T/F. In a skewed distribution, tests assuming normality cannot be used
to analyze data that has not been transformed first.
ANSWERS: 1C, 2F, 3T
IV. PRINCIPLES OF STATISTICAL INFERENCE

A. INTRODUCTION
• Sampling is done because it is impractical to study everyone
• A sample is chosen in order to obtain information about a particular feature
of the target population
Figure 11. Normal Distribution Curve o However, there are no means to directly find this feature and studying
everybody is expensive
§ 0.7422 represents the area under the curve with z = 0.65 (yellow o Thus, information is collected from a random sample and is used as the
lines) best estimate of the population value
§ However, since the z-score is -0.65, we need to calculate the area • As much as possible, sampling should be done by random sampling
from z = -0.65 to the left (blue lines). o Every member of the population has an equal chance to be selected
o Recall: The total area under the normal distribution curve is 1 regardless of whether other members have already been picked
§ 1 – 0.7422 = 0.2578 = 25.78% § Ensures that the sample is representative of the study population
• Answer: o There may be a selection bias if the respondents are not chosen at
o 25.78% of the population have cholesterol levels under 166 mg/dL. random
§ The sample is not representative of the target population
SOURCE: Veincent Christian F. Pepito, MSc – Measures of Central Tendency § Conclusions from the study may be erroneous and not generalizable
and Normal Distribution (2021) to the target population
• Population values are usually denoted in Greek letters (p, µ)
• Sample estimates are usually denoted in Roman letters (p, x)
TIP: When answering questions like these, it always pays to draw a normal
distribution curve and plot the values accordingly as it helps you have a
sensible interpretation of your data.
B. TARGET VS. STUDY POPULATION
• Recall:
o Target Population
§ Population about which we aim to generalize the findings of the
study
o Study Population
§ Population about which we can obtain information from
[EXAMPLE] POPULATION AND SAMPLE PREVALENCE

In the population, the prevalence (p) of smoking among school children is
30%. Suppose we collect a random sample of 800 school children:
1) Will our sample prevalence (p) be the same as our population prevalence
(p)?
o Sometimes it will be, sometimes it won’t. This variation is due to Figure 14. Sampling Distributions of Different Samples of Identical Size
chance (random variation)
§ The idea of statistical testing is to make sure that the differences • The larger the sample size, the narrower the sampling distribution of the
in the results of a study are real and are not merely due to chance estimates obtained from the samples
2) Assuming that our sample showed that the p of smoking is 30%, would • The shape of a sampling distribution becomes closer to that of a normal
another sample obtain a p of 30%? distribution as the sample size increases
o It depends on the sample; it can be the same, greater, or higher. • The standard error decreases as the sample size increases
o Standard Error (SE2)
SOURCE: Veincent Christian F. Pepito, MSc – Introduction to Statistical § Standard deviation of the sampling distribution
Interference (2021) § Should be kept as low as possible by increasing the sample size for a
more precise sampling distribution
§ Depends inversely on the square root of the sample size used
ACTIVE RECALL § Note: Equation 4 was not discussed extensively; hence some of the
information regarding the variables were gathered from Kirkwood
4. T/F. In a positively skewed distribution, the curve is skewed to the left
and Sterne’s Essential Medical Statistics
and the median is less than the mean.
5. T/F. Random sampling assures that the sample is representative of the [\(1O\) `
study population. 𝑆𝐸(𝑝) = 𝑆𝐸(𝑥̅ ) =
√^ √^
ANSWERS: 4F, 5T Equation 4. Standard Error
• Where:
o SE = standard error
C. SAMPLING DISTRIBUTION o p = sample value
• The distribution of the proportion of smokers obtained from 1,000 o 𝜋 = population value
different samples (see Figure 13) o 𝑛 = number of observations
o Most of the samples are close to the true prevalence π=30%, and p o 𝑥̅ = mean
ranges from 24-36% which can happen due to chance o 𝜎 = population standard deviation
o Distribution is nearly symmetrical
NDTK: Standard Deviation vs Standard Error
• Standard Deviation: variability in individual data or the sample
• Standard Error: standard deviation of a sampling distribution
SOURCE: 06.05, 2023
Central Limit Theorem

• States that when the sample size is large enough, the sampling distribution
of the estimates is always normal
o Happens even if the distribution of the original data is not normal
o Explains why the normal distribution is the most important distribution
in statistics
• Since the sampling distribution is normal, 95% of the sample means (or
proportions) fall within 1.96 times the standard error
Figure 13. Histogram showing the Sampling Distribution 2 1.96 is the multiplier in the 95% confidence interval
o 95% of sample means (or proportions) are in the range as calculated in
• In theory, the population can be sampled many times and different sample Equation 5
estimates can be obtained
o Not done because it is too expensive and sampling distributions are
actually never observed
o In practice, only one sample is taken from the population of interest
§ Used to relate the sample estimates to the true population value
Equation 5. Estimated Mean with 95% Confidence Interval

Characteristics of Sampling Distributions
• The mean of the sampling distribution of the estimates obtained from • Very few of the estimated sampling distributions will be very far from the
different samples of identical size is the same as the population value, true sampling distribution because:
regardless of the size of the samples (see Figure 14) o All the estimated sampling distributions are centered around their
o Note that the mean of both sampling distributions with different sample means
sample size is the same (0.8) o Most of the sample means are close to the true mean
2 SE: Standard error
• The probability of obtaining the observed or a more extreme sample
estimate if the null hypothesis is true
• Not a measurement of how true the null hypothesis is
• Quantifies the strength of the evidence
Large P-Value
• Greater than the level of significance α
• The evidence against the null hypothesis is weak
• The chance of observing a value as extreme as the sampled value, would
be high if the null hypothesis is true
• Sampling variation alone can be the reason for the difference between the
estimate and the parameter (or the null value)
o This means that the findings could just be due to chance
2 E.g. If the p-value = 0.5, then there is a 50% chance (high chance) that
you can observe a value equal to that, if your null hypothesis is true
2 Since it might be due to chance, there might be no significant effect if
Figure 15. Estimated Sampling Distributions. Probability of obtaining extreme you sample others
samples (green and blue) is low. • Sample conclusion made from a large p-value:
o “There is little evidence that the heights of males and females
D. CONFIDENCE INTERVALS significantly differ.”
• The intervals around the estimated mean which we can be confident
contains the true population mean Small P-Value
o “We are 95% confident that the true mean (or proportion) is between • Less than the level of significance α
lower limit and upper limit” • The evidence against the null hypothesis is strong
• Calculated and presented as the values in the data are only estimates and • The chance of observing a value as extreme as the sampled value would be
not true values of the population low if the null hypothesis is true
• Different confidence intervals are used in statistics, e.g., 90%, 95%, 99%; • Sampling variation alone is unlikely to be the reason for the difference
wherein 95% is the most commonly used between the estimate and the parameter (or null value)
2 An increasing confidence level means that the confidence interval also o This means that the finding is most likely not due to chance
becomes wider • Sample conclusion made from a small p-value:
o Multipliers of commonly used CIs are: o “There is strong evidence that the height between the males and
§ 90% CI: 1.28 females is significantly different.”
§ 95% CI: 1.96
§ 99% CI: 2.58
C. TYPE I AND II ERRORS
• A significance test can never prove that a null hypothesis is either true or
V. HYPOTHESIS TESTING
false
A. INTRODUCTION o It only gives an indication of the strength of the evidence against the
• Testing an assumption regarding a population parameter null hypothesis
o Null hypothesis (Ho)
§ Always a statement of equivalence
Types of Errors in Doing Hypothesis Tests
§ E.g. There is no significant difference between the heights of males
and females Type I Error
o Alternative hypothesis (Ha) • Rejecting a null hypothesis when it is true
§ Always a statement of disagreement • A significant effect is stated even when there is none
§ E.g. There is a significant difference between the heights of males • False positive
and females
• Tests if the sample is different from the population value
Type II Error
• Involves calculation of the probability of obtaining the observed data if the
• Failing to reject a null hypothesis when it is false
null hypothesis were true
• No significant effect is stated when there should be a significant effect
• False negative
Steps in Hypothesis Testing
1) Clarify and state your null and alternative hypotheses
Contingency Table in Hypothesis Testing
2) Collect data
3) Compute for the p-value using the appropriate statistical test
4) Make your conclusions
B. P-VALUES
Figure 17. Type I and II Errors
• True Positive
o There is a significant effect in reality and in the findings of the study
o The probability of getting a true positive result is equal to your study
power
Figure 16. P-value • True Negative
o There is no significant effect in reality and in the findings of the study
• False Positive (Type I Error) o Effect Measure Modifier: variable that alters the effect of the exposure
o There is no significant effect in reality but the findings of the study state on the outcome
that there is a significant effect
• False Negative (Type II Error) PROBABILITY
o There is a significant effect in reality but the findings of the study state • The proportion of times that we would observe an outcome if we repeated
that there is no significant effect the experiment a large number of times
• Additive Law: for mutually exclusive events, the probability of an event is
α: Level of Significance the sum of the probabilities for each event
o Mutually Exclusive Events: when one outcome happens, the other
• Probability of committing a type I error
outcome can no longer occur
• The threshold that, below which, the p-value will be considered significant
• Multiplicative Law: for independent events, the probability of two
• Usually set at 0.05
independent events is given by the product of their individual probabilities
o Implies that 1 out of 20 results that you get will not be true
o Independent Events: when one outcome happens, it doesn’t affect the
probability of another event happening
β • Random variation: affects what is observed especially when the number
• Probability of committing a type II error of experiments is not sufficiently large
• β is minimized by increasing study/statistical power • Law of Large Numbers: an experiment repeated many times will result to
o Study/statistical power an observed value that is equal to the expected value
§ Probability of rejecting a null hypothesis when a true effect actually
exists (true positive) NORMAL DISTRIBUTION
§ Power can be increased if sample size is increased MEASURES OF CENTRAL TENDENCY
Þ When you increase your sample size, you are less likely to • Mean: arithmetic average; sum of the observations divided by the number
commit a type II error of observations
• Median: the value that divides the data by half
HOW’S MY TRANSING? o Odd number of observations: the middle observation when the
observations are tallied in ascending order
Feedback Form: https://tinyurl.com/2024YL6gHMT
o Even number of observations: the average of the two middle
Errata Tracker: https://tinyurl.com/2024YL6ET06
observations when the observations are tallied in ascending order
• Mode: most common value appearing in the data
QUICK REVIEW • Normal distribution: represents the distribution of values that would be
SUMMARY OF CONCEPTS observed if we could examine everybody in the population; bell-shaped
o Defined by:
• Statistics can mean two things:
§ Mean: location or center of the distribution
o Data: the numbers we get when we measure or count things
§ Standard deviation: measure of spread or dispersion of a set of data
o Methods: a collection of procedures that allows us to analyze data
• Standard Normal Distribution: the standard normal distribution is used to
(statistical tests)
determine areas under the curve
• Three things that must be ruled out when concluding that the sample
o In a standard normal distribution, mean = 0 and SD = 1
represents population data:
o Z-score: standard normal scores; tells how many standard deviations
o Chance: addressed by statistical analysis; main thing to rule out
away a certain value is from the mean
o Bias: addressed by study design
o Z-table: used to compute for the area between -1 and 1 SD, -2 and 2 SD,
o Confounding: addressed by both analysis and study design
and so on and is also used to compute the areas outside a given range
• Descriptive statistics: describes a set of data and is used to organize,
by taking the complement of the values within the range
summarize, and present individual data values
• Skewed distributions: some variables are not normally distributed despite
o Categorical data: percentages or frequencies for data we can categorize
a large number of respondents; distribution becomes skewed due to
o Quantitative data: average values and the spread of the values for data
outliers
which we count or measure
o Positive skew: skewed to the right, where mean > median > mode
• Inferential statistics: uses methods of probability to make inferences about
o Negative skew: skewed to the left, where mean < median < mode
a population using data from a sample
o The median is the measurement used for skewed distributions
• Data: values we collect from respondents or records
• Variables: characteristic of a study subject that may vary from one PRINCIPLES OF STATISTICAL INTERFERENCE
respondent to another • Random sampling: every member of the population has an equal chance
o Qualitative variables: categorical variables; characterizes a certain to be selected regardless of whether other members have already been
quality of a subject picked
§ Binary variables: dichotomous; variables that only have two values
• Selection bias: can occur when respondents are not chosen at random; the
§ Nominal variables: variables whose categories can be listed in any
sample is not representative of the target population and conclusions from
order
the study may be erroneous and not generalizable to the target population
§ Ordinal variables: variables whose categories have a natural or
• Target population: population about which we aim to generalize the
inherent ordering
findings of the study
o Quantitative variables: represents a counted or measured quantity,
• Study population: population about which we can obtain information from
with inherent ordering and exact differences between values
• Sampling distribution: the distribution of the proportion of smokers
§ Interval variable: no true zero
obtained from 1,000 different samples; distribution is nearly symmetrical
§ Ratio variable: has a true zero
o Standard error: standard deviation of the sampling distribution
o Exposure variable: variable of interest which you think could have an
o Characteristics of sampling distributions include:
effect on the outcome variable; modified to assess its effect on the
§ The mean of the sampling distribution of the estimates obtained
outcome
from different samples of identical size is the same as the population
o Outcome variable: variable of interest which you think is affected by
value, regardless of the size of the samples
the exposure variables/predictors
§ The larger the sample size, the narrower the sampling distribution
o Intermediate variable: variable that lies in the causal pathway between
of the estimates obtained from the samples
the exposure and outcome
§ The shape of a sampling distribution becomes closer to that of a
o Confounder: variable that muddles or confounds the relationship
normal distribution as the sample size increases
between an exposure and an outcome
§ The standard error decreases as the sample size increases
§ Associated with the exposure
o Central Limit Theorem: states that when the sample size is large
§ Associated with the outcome
enough, the sampling distribution of the estimates is always normal
§ Not in the causal pathway between the exposure and outcome
§ Controlled in the design stage (restriction, matching, and • Confidence Intervals: the intervals around the estimated mean which we
randomization) and analysis stage (regression and stratified can be confident contains the true population mean
analyses)
HYPOTHESIS TESTING Equation 3. Z-score Formula
• Testing an assumption regarding a population parameter
• Null hypothesis (Ho): always a statement of equivalence 𝑉𝑎𝑙𝑢𝑒 − 𝑚𝑒𝑎𝑛
𝑧=
• Alternative hypothesis (Ha): always a statement of disagreement 𝑆𝐷
• P-values: the probability of obtaining the observed or a more extreme • Where:
sample estimate if the null hypothesis is true; quantifies the strength of the o Z = Z-score
evidence o SD = standard deviation
o Large p-value: greater than the level of significance α
§ The evidence against the null hypothesis is weak Equation 4. Standard Error
§ The chance of observing a value as extreme as the sampled value,
[\(1O\) `
would be high if the null hypothesis is true 𝑆𝐸(𝑝) = 𝑆𝐸(𝑥̅ ) =
√\ √^
o Small p-value: Less than the level of significance α
§ The evidence against the null hypothesis is strong • Where:
§ The chance of observing a value as extreme as the sampled value o SE = standard error
would be low if the null hypothesis is true o p = sample value
• True Positive: there is a significant effect in reality and in the findings of o 𝜋 = population value
the study o 𝑛 = number of observations
o The probability of getting a true positive result is equal to your study o 𝑥̅ = mean
power o 𝜎 = population standard deviation
• True Negative: there is no significant effect in reality and in the findings of
the study
REVIEW QUESTIONS
• Type I Error/False Positive: rejecting a null hypothesis when it is true; a
significant effect is stated even when there is none 1. James wanted to see if the amount of exercise a person gets in a week has
an effect on mental status. In this experiment, what kind of variable is
• Type II Error/False Negative: failing to reject a null hypothesis when it is
amount of exercise?
false; no significant effect is stated when there should be a significant effect
a) Response variable
• α/Level of Significance: probability of committing a type I error
b) Ordinal variable
o The threshold that, below which, the p-value will be considered
c) Modified variable
significant
d) Dependent variable
o Usually set at 0.05
• Β: probability of committing a type II error
2. Which of the following is part of the theoretical basis of inferential
• Study/statistical power: probability of rejecting a null hypothesis when a
statistics?
true effect actually exists (true positive)
a) Population
b) Sample
SUMMARY OF NEED-TO-KNOWS (NDTK) c) Statistics
• Confounding vs Intermediate Variables d) All of the above
o Confounding variable: should be controlled for in the analysis e) None of the above
o Intermediate variable: should NOT be controlled for in the analysis
• Standard Deviation vs Standard Error 3. Which law refers to the probability of two events given by the product of
o Standard Deviation: variability in individual data or the sample their individual probabilities?
o Standard Error: standard deviation of a sampling distribution a) Additive
b) Multiplicative
SUMMARY OF PROCESSES c) Law of Large Numbers
d) None of the above
STEPS IN HYPOTHESIS TESTING
1) Clarify and state your null and alternative hypotheses
4. Which of the following is false?
2) Collect data
a) Probability can have values that are 0 ≤ 𝑥 ≤ 1
3) Compute for the p-value using the appropriate statistical test
b) Effect measure modifier can be measured statistically
4) Make your conclusions
c) Additive law of probability considers independent events
d) None of the above
SUMMARY OF MEMORY AIDS
• EXPOSURE VARIABLE: MIX 5. Which of the following is true about standard deviation?
o Modified variable a) It is the average deviation of the observations from the median value
o Independent variable b) Calculated by the square root of the mean
o X-axis c) The more widely spread out the values, the smaller the standard
deviation
• OUTCOME VARIABLE: DRY d) NOTA
o Dependent variable
o Response variable 6. The mean score of batch 2024’s 2nd Pharmacology comprehensive exam is
o Y-axis 58 with a standard deviation of 14. What is the proportion of the batch who
scored above 60?
SUMMARY OF EQUATIONS a) 55.57%
b) 14.28%
Equation 1. Additive Law
c) 44.43%
d) 85.72%
𝑃(𝐴 𝑜𝑟 𝐵) = 𝑃(𝐴) + 𝑃(𝐵)
e) 43.44%
• Where:
7. Which of the following statements is true?
o A, B = mutually exclusive events
a) Statistical tests that assume normality may be used on data with a
skewed distribution as is.
Equation 2. Multiplicative Law
b) The study population is bigger than the target population.
𝑃(𝐴 𝑎𝑛𝑑 𝐵) = 𝑃(𝐴) 𝑥 𝑃(𝐵) c) The mode is the best measure of central tendency for skewed
distributions.
• Where: d) Random sampling ensures that the sample is representative of the
o A, B = independent events study population.
8. Which of the following may result from not selecting the sample randomly?
a) Type 1 error
b) Type 2 error
c) Type 3 error
d) Selection bias
e) Failure of study
9. This theorem states that when the sample size is large enough, the
sampling distribution of the estimates is always normal.
a) Sampling Distribution
b) Central Limit
c) Confidence Interval
d) Pythagorean
10. What is the multiplier in the 95% confidence interval?

a) 1.66
b) 1.69
c) 1.96
d) 1.99
11. T/F: The standard error is the variability in individual data or the sample.
12. T/F: A type I error occurs when the study states the presence of a significant
effect even though in reality, there is none.
13. Which of the following is false when talking about p-values?

a) A large p-value indicates that the findings could just be due to chance
b) A small p-value indicates that the evidence against the null hypothesis
is strong
c) P-values measure how true the null hypothesis is
d) P-values quantify the strength of the evidence
14. Statement I: A high statistical power can minimize type I errors.

Statement II: Statistical power can be increased by increasing the sample
size of the study.
a) Only statement I is true
b) Only statement II is true
c) Both statements are true
d) Both statements are false
ANSWERS:
1C, 2D, 3B, 4C, 5D, 6C, 7D, 8D, 9B, 10C, 11F, 12T, 13C, 14B
EXPLANATIONS:
6. C – Calculate the z-score (z = 0.1428). Find the area under the curve (0.5557).
This represents the area to the left of z=0.1428. Since we are looking for the
proportion of the batch that scored higher than 60, we need to get the area
of the curve to the right of z=0.1428. To do that: 1-0.5557=0.4443=44.33%.
REFERENCES
REQUIRED
(1) Veincent Christian F. Pepito, MSc. Introduction to Statistics and Probability
[Lecture slides].
(2) Veincent Christian F. Pepito, MSc. Measures of Central Tendency and
Normal Distribution [Lecture slides].
(3) Veincent Christian F. Pepito, MSc. Introduction to Statistical Inference
[Lecture slides].
2 ASMPH 2023. 06.05: Inferential Statistics by Veincent Christian F. Pepito,
MSc.
SUPPLEMENTARY
& Kirkwood, Betty, Sterne, Jonathan AC. Essential Medical Statistics.
Massachusetts, Blackwell Science Ltd, 2003.
FREEDOM SPACE
APPENDIX
APPENDIX A: Figure 10. Z-table

(06.01) Introduction To Statistics and Statistical Inference (TG11-CG23) - Final - Deanna Flor Lim

Uploaded by

Document Information

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

(06.01) Introduction To Statistics and Statistical Inference (TG11-CG23) - Final - Deanna Flor Lim

Uploaded by

Copyright:

Available Formats

MODULE 06: RESEARCH & EPIDEMIOLOGY

Introduction to Statistics and Statistical Inference

TABLE OF CONTENTS • Association between exposures and outcomes

SOURCE: Veincent Christian F. Pepito, MSc – Introduction to Statistics and

𝑃(𝐴 𝑎𝑛𝑑 𝐵) = 𝑃(𝐴) 𝑥 𝑃(𝐵)

Equation 2. Multiplicative Law

• For independent events, the probability of two independent events is given

Table 1. Sample Distribution of Blood Types

SOURCE: Veincent Christian F. Pepito, MSc – Introduction to Statistics and

Table 2. Range of SD and corresponding areas within it

[EXAMPLE] COMPARING NMAT SCORES

Figure 6. Example of a Standard Deviation

B. WHY THE NORMAL DISTRIBUTION?

1 SD: Standard deviation

Figure 12. Skewed Distributions

ANSWERS: 1C, 2F, 3T

IV. PRINCIPLES OF STATISTICAL INFERENCE

[EXAMPLE] POPULATION AND SAMPLE PREVALENCE

ANSWERS: 4F, 5T Equation 4. Standard Error

SOURCE: 06.05, 2023

Central Limit Theorem

Equation 5. Estimated Mean with 95% Confidence Interval

2 SE: Standard error

Figure 17. Type I and II Errors

10. What is the multiplier in the 95% confidence interval?

13. Which of the following is false when talking about p-values?

14. Statement I: A high statistical power can minimize type I errors.

APPENDIX A: Figure 10. Z-table

You might also like