GEA1000 Finals Cheatsheet

Population of interest: group in which researchers have interest in Probability Sampling (randomized mechanism) Non-probability Sampling (human
drawing conclusions on discretion)

Population parameter: Numerical fact about population Convenience Sampling:
Sample: Proportion of the population selected → most easily available to participate in
• when census is not readily available research (e.g. mall surveys)
• preferred over census as sample is less costly administratively → selection bias: certain members of
and collection + data processing is faster for a sample population left out
Census: An attempt to reach out to the entire population of interest → non-response bias; opt out
Estimate: Inference about the population’s parameter, based on
information obtained from a sample Selection Bias: Non-response Bias: Volunteer Sampling:
Sampling frame: list from which sample was obtained imperfect sampling Participants’ non-disclosure, → actively seek volunteers to participate
frame excluded units exclusion of information. May (eg. Online polls)
→ may not cover the population of interest or may contain units
that are not in the pop of interest
from being selected. occur regardless of whether the → non-response bias: unwilling to
Can be caused by non- sampling method is probabilistic participate
Condition for generalizability: probability sampling. or non-probabilistic in nature. → selection bias: pick those more likely to
1. SF =/> population of interest 2. Prob-based sampling - minimise
respond in a desirable way
selection bias 3. Large sample size 4. Minimum non-response rate
Observational study: Experimental study: manipulate Sampling Advantages Disadvantages
observes individuals and IV to cause an effect on DV Plan
measures variables of → Cause-and-effect relationship SRS Good representation Time-consuming; accessibility of information
interest → Subjects placed in ‘Treatment’ of population
→ when direct investigation & ‘Control’ (comparison) group Systematic Simpler selection Potentially under-representing the population
is difficult & unethical → random assignment used process as opposed to
→ exposure & non-exposure → random draw w/o replacement. SRS
grps used to denote T & C Chosen = T, not chosen = C Stratified Good representation Require sampling frame and criteria for classification of
grps → T & C can take 2 diff sizes as of sample by stratum population into stratum
→ blinding preventS bias long as size of grps are quite large Cluster Less time-consuming Require larger sample size to achieve low margin of error
→ placebo prevents IV – DV: Drink coffee – pass and less costly
subject’s beliefs from exam
affecting results
Confounders:
→ third variable that affected results in a study
→ Simspon’s paradox implies presence of confounders
but the converse is NOT TRUE
→ Slicing is a method to control confounders
(randomisation not always possible in studies)
Measures of central tendencies Measures of dispersion

Symbol: x̄ Add/subtract: x̄ ± 𝑘 Multiply: 𝑘 × x̄ Standard *always non- Add/subtract: σ𝑥
Mean
• Does not tell us about the distribution over total data deviation negative with Multiply: |𝑘| × σ𝑥
points Symbol: the same
• Does not tell us about frequency of occurrence Population units as the When quantifying the degree of spread relative to mean, we
numerical variable values → σ𝑥 numerical 𝑆
should utilize the coefficient of variation = 𝑥
• Serves as good metric when comparing groups of unequal Sample → variable x̄
𝐴 𝑆𝑋
sizes → take the weighted average × 𝑀𝑒𝑎𝑛(𝐴) +
𝑇𝑜𝑡𝑎𝑙
𝐵
× 𝑀𝑒𝑎𝑛(𝐵)
𝑇𝑜𝑡𝑎𝑙
Median Add/subtract & multiply: same as mean IQR = Q3 – Q1
• Does not tell us about the total value, frequency of occurrence or the distribution Outlier: 𝑄3 + 1.5 × 𝐼𝑄𝑅
of data points of the numerical data 𝑄1 − 1.5 × 𝐼𝑄𝑅
• Knowing median of subgroups does not tell us anything about the overall median
apart from the fact that it must be between the medians of the subgroups
• Preferably used over mean when distribution of points is asymmetrical
Mode • Most frequent
• Can be taken on both numerical and categorical values Add/subtract does not change the IQR. Multiplying all data points by c
• Not very useful when values are unique results in the IQR being multiplied by |𝑐|
Table for marginal rates/proportions Marginal Rate Conditional Rate Joint Rate
Outcome/ Success Failure Row Total Rate(Y) = 350/1050 = Rate(success|X) = 542/700 Not conditional rate!
Treatment 33.333% = 77.44% Rate(Y and failure) = 61/1050
X 542 158 700 Rate(success) = P(A|B) = 5.81%
Y 289 61 350 831/1050
Column Total 831 219 1050 = 79.1%
Association absent Association present

• Rate(A|B) = Rate(A|NB) 1. Rate(A|B) > Rate(A|NB)
• Rate of A is not affected by o Presence of A when B is present is stronger than when B is absent hence there is a positive association between A and B
the presence or absence of 2. Rate (A|B) < Rate(A|NB)
B hence A and B are not o Presence of A when B is present is weaker than when B is absent hence there is a negative association between A and B.
associated
Rules of rate Simpson’s Paradox
Symmetry rule: A trend appears in more than half of the
Rate(A|B) > rate(A|NB) ⇔ rate(B|A) > rate(B|NA) group’s data but disappears or reverses when
Rate(A|B) < rate(A|NB) ⇔ rate(B|A) < rate(B|NA) the groups are combined. It means the two
Rate(A|B) = rate(A|NB) ⇔ rate(B|A) = rate(B|NA) variables are no longer associated, rate(A|B)
= rate(A|NB). Majority of individual
Basic rule of rate: the overall rate(A) will always lie between rate(A|B) and
subgroups rates are opposite from the overall
rate(A|NB)
association.
Consequences on basic rule of rates: 1. The closer rate(B) is to 100%, the closer
rate(A) is to rate(A|B)
rate(A|B) + rate(A|NB)
2. If rate(B) = 50%, then rate(A) =
2
3. If rate(A|B) = rate(A|NB), then rate(A) = rate(A|B) = rate(A|NB)
Histograms/Boxplot (univariate EDA) Describing overall pattern of distribution Bivariate EDA
1. Shape - Peaks and skewness 1. Direction Outliers
2. Centre – Mean, median, mode 2. Form

Left skewed
presence of outlier
Mean < median < mode
decreases the strength of
Right skewed
Mean > median > mode
3. Strength
Shape: compare variability from max-median to
median-min. skewed right if lower half has less correlation
variability than upper half & vice versa r = -0.75 (before)
Center: can deduce median value r = 0.01 (after) presence
Spread: IQR gives an idea of the spread for middle of outlier increases
50% of dataset strength of correlation
1. All are right 3. Spread of distribution: range & standard Correlation coefficient compared to when it is
skewed, deviation → Measure of linear association removed
variability: P1 Higher variability → wider → range is between -1 and 1
> P2 > P3 range spread → summarizes direction & strength of Linear Regression
2. IQR lowest s = 1.69 low variability linear association Slope of regression line
in P1, followed
by P2 then P3
3. More outliers in P1 and P2 compared to P3 s = 4.30 high variability
Histogram Boxplot 4. Outliers
Histogram provides better sense More useful to → Useful in identifying strong skew
of distribution shape, when there compare distribution → Identify possible data collection or data & correlation coefficient
𝑠𝑦
are great differences among the of different datasets, entry errors is related by 𝑚 = 𝑟.
𝑠𝑥
frequencies of datapoint and can identify → Provide interesting insight into the data
outliers clearly → shouldn’t be removed unnecessarily r-value is not affected by interchanging 2 If CC r is +ve, gradient
Don’t give any variables, adding/multiplying a number of RL also +ve,vice versa
information as to how to all values of a variable.
many datapoints it has r-value only measures linear association CC not necessarily equal
compared to to gradient of RL
histogram – 4 diff ASSOCIATION =/ CAUSATION
datasets can have the
same boxplots
Conditional probability & Independence Conditional Probabilities as Rates
A and B are
independent
events
whenever A
and B are not
associated
with each
other
Random variables
Confidence Intervals Hypothesis Testing

Null hypothesis: no effect/difference
Alternate hypothesis: what we claim
Reject H0 if p-value is < significance

level (e.g < 0.05)
Do not reject H0 if p-value >

significance level. Test is inconclusive.
Doesn’t mean that H0 is true.
CI = population proportion + random error Only done when we have sample data! Both H0 & H1 must be mutually exclusive.
Larger sample size, smaller random error, narrower CI

GEA1000 Finals Cheatsheet

Uploaded by

Document Information

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

GEA1000 Finals Cheatsheet

Uploaded by

Copyright:

Available Formats

Population of interest: group in which researchers have interest in Probability Sampling (randomized mechanism) Non-probability Sampling (human

drawing conclusions on discretion)

Measures of central tendencies Measures of dispersion

Association absent Association present

2. Centre – Mean, median, mode 2. Form

Confidence Intervals Hypothesis Testing

Reject H0 if p-value is < significance

Do not reject H0 if p-value >

You might also like