You are on page 1of 2

Population of interest: group in which researchers have interest in Probability Sampling (randomized mechanism) Non-probability Sampling (human

drawing conclusions on discretion)


Population parameter: Numerical fact about population Convenience Sampling:
Sample: Proportion of the population selected → most easily available to participate in
• when census is not readily available research (e.g. mall surveys)
• preferred over census as sample is less costly administratively → selection bias: certain members of
and collection + data processing is faster for a sample population left out
Census: An attempt to reach out to the entire population of interest → non-response bias; opt out
Estimate: Inference about the population’s parameter, based on
information obtained from a sample Selection Bias: Non-response Bias: Volunteer Sampling:
Sampling frame: list from which sample was obtained imperfect sampling Participants’ non-disclosure, → actively seek volunteers to participate
frame excluded units exclusion of information. May (eg. Online polls)
→ may not cover the population of interest or may contain units
that are not in the pop of interest
from being selected. occur regardless of whether the → non-response bias: unwilling to
Can be caused by non- sampling method is probabilistic participate
Condition for generalizability: probability sampling. or non-probabilistic in nature. → selection bias: pick those more likely to
1. SF =/> population of interest 2. Prob-based sampling - minimise
respond in a desirable way
selection bias 3. Large sample size 4. Minimum non-response rate
Observational study: Experimental study: manipulate Sampling Advantages Disadvantages
observes individuals and IV to cause an effect on DV Plan
measures variables of → Cause-and-effect relationship SRS Good representation Time-consuming; accessibility of information
interest → Subjects placed in ‘Treatment’ of population
→ when direct investigation & ‘Control’ (comparison) group Systematic Simpler selection Potentially under-representing the population
is difficult & unethical → random assignment used process as opposed to
→ exposure & non-exposure → random draw w/o replacement. SRS
grps used to denote T & C Chosen = T, not chosen = C Stratified Good representation Require sampling frame and criteria for classification of
grps → T & C can take 2 diff sizes as of sample by stratum population into stratum
→ blinding preventS bias long as size of grps are quite large Cluster Less time-consuming Require larger sample size to achieve low margin of error
→ placebo prevents IV – DV: Drink coffee – pass and less costly
subject’s beliefs from exam
affecting results
Confounders:
→ third variable that affected results in a study
→ Simspon’s paradox implies presence of confounders
but the converse is NOT TRUE
→ Slicing is a method to control confounders
(randomisation not always possible in studies)

Measures of central tendencies Measures of dispersion


Symbol: x̄ Add/subtract: x̄ ± 𝑘 Multiply: 𝑘 × x̄ Standard *always non- Add/subtract: σ𝑥
Mean
• Does not tell us about the distribution over total data deviation negative with Multiply: |𝑘| × σ𝑥
points Symbol: the same
• Does not tell us about frequency of occurrence Population units as the When quantifying the degree of spread relative to mean, we
numerical variable values → σ𝑥 numerical 𝑆
should utilize the coefficient of variation = 𝑥
• Serves as good metric when comparing groups of unequal Sample → variable x̄
𝐴 𝑆𝑋
sizes → take the weighted average × 𝑀𝑒𝑎𝑛(𝐴) +
𝑇𝑜𝑡𝑎𝑙
𝐵
× 𝑀𝑒𝑎𝑛(𝐵)
𝑇𝑜𝑡𝑎𝑙
Median Add/subtract & multiply: same as mean IQR = Q3 – Q1
• Does not tell us about the total value, frequency of occurrence or the distribution Outlier: 𝑄3 + 1.5 × 𝐼𝑄𝑅
of data points of the numerical data 𝑄1 − 1.5 × 𝐼𝑄𝑅
• Knowing median of subgroups does not tell us anything about the overall median
apart from the fact that it must be between the medians of the subgroups
• Preferably used over mean when distribution of points is asymmetrical
Mode • Most frequent
• Can be taken on both numerical and categorical values Add/subtract does not change the IQR. Multiplying all data points by c
• Not very useful when values are unique results in the IQR being multiplied by |𝑐|

Table for marginal rates/proportions Marginal Rate Conditional Rate Joint Rate
Outcome/ Success Failure Row Total Rate(Y) = 350/1050 = Rate(success|X) = 542/700 Not conditional rate!
Treatment 33.333% = 77.44% Rate(Y and failure) = 61/1050
X 542 158 700 Rate(success) = P(A|B) = 5.81%
Y 289 61 350 831/1050
Column Total 831 219 1050 = 79.1%

Association absent Association present


• Rate(A|B) = Rate(A|NB) 1. Rate(A|B) > Rate(A|NB)
• Rate of A is not affected by o Presence of A when B is present is stronger than when B is absent hence there is a positive association between A and B
the presence or absence of 2. Rate (A|B) < Rate(A|NB)
B hence A and B are not o Presence of A when B is present is weaker than when B is absent hence there is a negative association between A and B.
associated
Rules of rate Simpson’s Paradox
Symmetry rule: A trend appears in more than half of the
Rate(A|B) > rate(A|NB) ⇔ rate(B|A) > rate(B|NA) group’s data but disappears or reverses when
Rate(A|B) < rate(A|NB) ⇔ rate(B|A) < rate(B|NA) the groups are combined. It means the two
Rate(A|B) = rate(A|NB) ⇔ rate(B|A) = rate(B|NA) variables are no longer associated, rate(A|B)
= rate(A|NB). Majority of individual
Basic rule of rate: the overall rate(A) will always lie between rate(A|B) and
subgroups rates are opposite from the overall
rate(A|NB)
association.
Consequences on basic rule of rates: 1. The closer rate(B) is to 100%, the closer
rate(A) is to rate(A|B)
rate(A|B) + rate(A|NB)
2. If rate(B) = 50%, then rate(A) =
2
3. If rate(A|B) = rate(A|NB), then rate(A) = rate(A|B) = rate(A|NB)
Histograms/Boxplot (univariate EDA) Describing overall pattern of distribution Bivariate EDA
1. Shape - Peaks and skewness 1. Direction Outliers

2. Centre – Mean, median, mode 2. Form


Left skewed
presence of outlier
Mean < median < mode
decreases the strength of
Right skewed
Mean > median > mode

3. Strength
Shape: compare variability from max-median to
median-min. skewed right if lower half has less correlation
variability than upper half & vice versa r = -0.75 (before)
Center: can deduce median value r = 0.01 (after) presence
Spread: IQR gives an idea of the spread for middle of outlier increases
50% of dataset strength of correlation
1. All are right 3. Spread of distribution: range & standard Correlation coefficient compared to when it is
skewed, deviation → Measure of linear association removed
variability: P1 Higher variability → wider → range is between -1 and 1
> P2 > P3 range spread → summarizes direction & strength of Linear Regression
2. IQR lowest s = 1.69 low variability linear association Slope of regression line
in P1, followed
by P2 then P3
3. More outliers in P1 and P2 compared to P3 s = 4.30 high variability
Histogram Boxplot 4. Outliers
Histogram provides better sense More useful to → Useful in identifying strong skew
of distribution shape, when there compare distribution → Identify possible data collection or data & correlation coefficient
𝑠𝑦
are great differences among the of different datasets, entry errors is related by 𝑚 = 𝑟.
𝑠𝑥
frequencies of datapoint and can identify → Provide interesting insight into the data
outliers clearly → shouldn’t be removed unnecessarily r-value is not affected by interchanging 2 If CC r is +ve, gradient
Don’t give any variables, adding/multiplying a number of RL also +ve,vice versa
information as to how to all values of a variable.
many datapoints it has r-value only measures linear association CC not necessarily equal
compared to to gradient of RL
histogram – 4 diff ASSOCIATION =/ CAUSATION
datasets can have the
same boxplots
Conditional probability & Independence Conditional Probabilities as Rates
A and B are
independent
events
whenever A
and B are not
associated
with each
other

Random variables

Confidence Intervals Hypothesis Testing


Null hypothesis: no effect/difference
Alternate hypothesis: what we claim

Reject H0 if p-value is < significance


level (e.g < 0.05)

Do not reject H0 if p-value >


significance level. Test is inconclusive.
Doesn’t mean that H0 is true.

CI = population proportion + random error Only done when we have sample data! Both H0 & H1 must be mutually exclusive.
Larger sample size, smaller random error, narrower CI

You might also like