You are on page 1of 2

Data Definition Collection of information, a set of measurements taken on a set of ind.

units Shape - Skewness: measures lack of symmetry


Data Storage & Data is usually stored and presented in a dataset comprised of variables measured on Positive Skew Symmetrical Distr. Negative Skew
Presentation cases
Definition Variable - Characteristic of observed statistical unit; Can have different values
- Values: categorical vs. numeric
Categorical vs. - Categorical - Kurtosis: heavy-tailed/light-tailed vs. normal distr.
numeric o Nominal: e.g. gender
o Ordinal e.g. educational level
- Numeric
o Discrete: e.g. #ppl. In household
Variability Statistic Parameter
o continuous: e.g. income
- Variance 𝑠2 =
1
∑𝑁 2 1
𝜎 2 = ∑𝑁 2
Can be dependent on measurement, e.g. categorical variable can be coded as numeric 𝑁−1 𝑖=1(𝑥𝑖 − 𝑥̅ ) 𝑁 𝑖=1(𝑥𝑖 − 𝜇)
- Standard Dev.
sometimes 𝑠 = √𝑠 2 𝜎 = √𝜎 2
Population & Sample Population: Total observations possible, Five Number Median: p=50, LQ: p=25, UP p=75,
Definition Sample: A set of observations from sample summary: Min, Interquartile range: 𝐼𝑄𝑅 = 𝑈𝑄 – 𝐿𝑄,
Statistical Inference Using the sample data to make inferences (learn) about the population 25%,50%,75%,max 𝑅𝑎𝑛𝑔𝑒 = 𝑀𝑎𝑥 − 𝑀𝑖𝑛
Parameter vs - parameter: summary of population Z-Score 𝑥 −𝜇
𝑧𝑖 = 𝑖 :
𝜎
Statistic - statistic: summary of sample data
deviation of observation from average divided by 𝜎
Descriptive vs. - descriptive:
Statistical - Covariance: measures the relationship between to variables & their tendency to
Inferential Statistics o describing one variable
Dependence, for vary together
o describing the relation ship between two variables
more than 1 Variable 𝐶𝑂𝑉(𝑋, 𝑌) = 𝐸[(𝑥 − 𝜇𝑥 )(𝑌 − 𝜇𝑦 )]
- inferential statistics: sampling distribution
- Correlation: association not driven by arbitrary changes
Sampling Bias - If the sample that is selected does not accurately represent the population 𝐶𝑂𝑉(𝑋,𝑌)
because it was selected in a way such that it is biased 𝜌(𝑋, 𝑌) = 𝜎𝑋 𝜎𝑌
- RANDOM sampling; Properties |𝜌(𝑋, 𝑌) ≤ 1| |𝜌(𝑋, 𝑌) ≤ 1| 𝑋⊥𝑌→ 𝜌=0
How to avoid o however even with random sample (without sampling bias) data can be Linear relation Not vice versa
biased 𝜌 provides a measure of the extent to which X and Y are linearly related
Explanatory and - Explanatory variable (X): Understand or predict values response variable: Probability Function - can be used to present distribution of random variable
response variables - Response Variable (Y): Values predicted by explanatory variable - Probability Function → Discrete random var.
 Remember by f(x)=y - Probability Densitiy Function → continuous random v.
Association vs. - Association: values of two variables are related Prob. Function (pf) - Discrete Random Variable: X can take finite number of k of different values:
Causation - Causal association: changing value of explanatory variable influences value of 𝑥1 , … , 𝑥𝑘
response variable 𝑓(𝑥) = Pr(𝑋 = 𝑥)
Association may not be causal Prob Densitiy - Continuous Random Variable: X can assume every value in an interval
Confounding - Third variable that is associated with both explanatory and response variable Function (pdf) 𝑏
Pr(𝑎 < 𝑋 ≤ 𝑏) = ∫𝑎 𝑓(𝑥)𝑑𝑥
variable - Is explanatory variable for both X and Y ∞
(1) 𝑓(𝑥) ≥=, for all x; (2) ∫ 𝑓(𝑥)𝑑𝑥 = 1
- Causal association cannot be determined if there is a confounding variable −∞
Causation: Common Cause Common effect Expectation of Discrete X: Continuous X:

X causes Y Xand Y are both caused by 3. var. Z X and Y both predict the 3. var. Z pf & pdf 𝐸(𝑋) = ∑𝑎𝑙𝑙 𝑥 𝑥 ∗ 𝑓(𝑥) 𝐸(𝑋) = ∫−∞ 𝑥𝑓(𝑥)
𝐸(𝑥): expected value / mean of X
Observational study - Observational study: solely observation of values as they exist Variance of X 𝑉𝑎𝑟(𝑋) = 𝐸[(𝑋 − 𝜇)2 ]
vs experiment - Experiment: controlling one or more explanatory var. - Measure of the spread of the distribution around its mean (𝜇), Large variance:
𝑛
Relative Frequency 𝑟𝑒𝑙𝑎𝑡𝑖𝑣𝑒 𝐹𝑟𝑒𝑞. = 𝑁𝑖 wide spread around 𝜇
Normal Distribution: 1
(𝑥−𝜇)2
Summary statistics Summarizing observations to communicate as much information as possible as simply 𝑓(𝑥|𝜇, 𝜎 2 ) = 𝑒 − 2𝜎2
95% rule: 95% of √2𝜋𝜎 2
as possible If X is normally distributed,
values fall within 2
Central Tendency - Mean: most important descriptive statistic for center - Pr(𝜇 − 𝜎 ≤ 𝑋 ≤ 𝜇 + 𝜎) ≈ 68.27%
𝑦1 +𝑦2 +⋯+𝑦𝑛 ∑𝑛 ∑𝑛
standard deviations
𝑖=1 𝑦𝑖 𝑖=1 𝑦𝑖 - Pr(𝜇 − 2𝜎 ≤ 𝑋 ≤ 𝜇 + 2𝜎) ≈ 95.45%
Statistic 𝑥̅ = 𝑛
= 𝑛
Parameter 𝜇̂ = 𝑛 of the mean
- Pr(𝜇 − 3𝜎 ≤ 𝑋 ≤ 𝜇 + 3𝜎) ≈ 99.73%
- Median: middle measurement of ordered sample Point Estimate - point estimate: of population parameter is single value
- Mode: most common observation - interval estimate: interval defined by two numbers, between a parameter is
expected to lie
Sampling - Sample statistic: random variable; varies across samples
Distribution - Sampling distribution: probability distribution of a given random-sample-based
statistic Type I and Type II - Type I Error: Rejecting the Null Hypothesis although it is true, signifccance level 𝛼
- → Statistical Inference: Based on sampling distribution, inferences on population Error is the tolerated Type I error rate
can be made - Type I Error: Not rejecting the Null Hypothesis although it is false
Sampling - Center: Sampling distribution will be centered around the population parameter P-Value - Test statistic: random variable calculated form sample data and used in
Distribution: - Shape: If the sample size is large enough, sampling distribution will be symmetric → reject 𝐻0 if: hypothesis test
Characteristics & bell-shaped, like normal- /t-distribution p-value< 𝛼 - Measures degree of difference between sample of data and 𝐻0
Variability of Sample - Standard Error of a statistic, SE: standard deviation of the sample statistic - P-value is the probability that the test statistic equals the observed value or a
Statistic (SE) - Measures how much statistic varies from sample to sample value even more extreme in the direction predicted by 𝐻𝑎
Sample size n & - Increasing n, decreasing SE Standardizing - X (raw score) random variable with mean 𝜇 and standard deviation 𝜎 > 0
Standard Error SE Random Variables - Standardizing X: 𝒁 = 𝝈
𝑿−𝝁

Central Limit - Let 𝑋1 , 𝑋2 , … , 𝑋𝑛 be a random sample of size n darwn from a population - Standardized normal distribution will be standard normally distr.: N(0,1)
Theorem distribution with mean 𝜇 and variance 𝜎 2 . Suppose we are interested in 𝑋̅, the Probability of tail - Every normal distribution 𝑁(𝜇, 𝜎 2 ) can be standardized, therefore we only need
sample mean. Then: as n approaches +∞ area one Z-table which shows the probability under the standard normal distribution,
√𝑛(𝑋̅−𝜇) 𝑑 𝑑 𝜎2
→ 𝑁(0,1), or equivalently, 𝑋̅ → 𝑁(𝜇, ) e.g. 𝑁(4,4) → 𝑍 =
𝑋−4
𝜎 𝑛 √4
Implications of ∑
𝜇 ≈ 𝑋̅ = 𝑖=1
𝑛
𝑋
𝜎2 ≈ 𝑠2 =
1
∑𝑛𝑖=1(𝑥𝑖 − 𝑥̅ )2 - Therefore, to find the probability of the Tail area e.g. 𝑃(𝑋 ≤ 𝒂): 𝑃(𝑍 ≤
𝒂−𝜇
)→
𝑛−1
Central Limit 𝑛 𝜎
If n approaches +∞, SE of sample mean will be
𝑠 check the Z table to find value
Theorem √𝑛
Steps of Significance 1. Assumptions
Characteristics of Unbi𝛔 asedness: Efficiency: Consistency:
Test (1)random sample, (2)quantitative variable, (3) pop. Distr. Normal distr.
Good Estimator Expected value: identical Low Variance As sample size n
(one sided t-test) 2. Stating 𝑯𝟎 and 𝑯𝒂
to population parameter increases, estimator will
Two sided test One sided test
approach the parameter
𝐻0 : 𝜇 = 𝜇0 , vs. 𝐻𝑎 : 𝜇 ≠ 𝑦0 𝐻0 : 𝜇 = 𝜇0 , vs. 𝐻𝑎 : 𝜇 < 𝑜𝑟 > 𝑦0
Approximation using - 𝒏 ≤ 𝟑𝟎: t-distribution, df=n-1
3. Calculate test statistic
normal / t- - 𝒏 > 𝟑𝟎: normal distribution (𝑥̿ − 𝜇0 ) 𝑠
distribution - Degrees of freedom (df): number of values in the final calculation of a statistic 𝑡= , 𝑠𝑒 =
𝑠𝑒 √𝑛
that are free to vary, most of the time: 𝑁 − 1 4. Found p-value based on the test statistic
Parameter vs Parameter Statistics Two sided: p-value=2*tail probability; one sided: p-value=tail probability
Statistic Mean 𝜇 𝑥̅ 5. Compare p-value with pre chosen 𝜶 & 6. Make Conclusions
Proportion 𝑝 𝑝̂ If 𝑃 ≤ 𝛼, we reject null hypothesis
Standard Deviation 𝜎 s Comparing Means - Comparison of means between two groups
Correlation 𝜌 r between two groups - In the two independent samples test, steps are the same but there are
Types of Population Distribution: Sample Distribution: Sampling Distribution: differences in steps 1. Assumptions and 3. Test statistic
Distributions Probability distribution of prob. Distr. of sample prob. Distr. of sample 1. Assumptions
population from the population statistics Additional assumption: 𝜎 2 equals each other and population 𝜎12 = 𝜎22 = 𝜎 2
Interval Estimate - Gives a range of plausible values for a population parameter 2. Test statistic
- Common interval estimate: 𝑠𝑡𝑎𝑡𝑖𝑠𝑡𝑖𝑐 ± 𝑚𝑎𝑟𝑔𝑖𝑛 𝑜𝑓 𝑒𝑟𝑟𝑜𝑟, ̅𝑦̅̅1̅−𝑦
̅̅̅1̅ 𝑠2 𝑠2
We use: 𝑡 = 𝑠𝑒
, 𝑠𝑒 = √𝑛1 + 𝑛2 , 𝑑𝑓 = (𝑛1 − 1) + (𝑛2 − 1)
- margin of error: precision of the sample statistic as point estimate for the 1 2

parameter; determined using variability of the sampling distribution R-Programming


95%-CI - A 95% Confidence Interval will contain true parameter for 95% of samples Defining something: <-
Package installing and Loading install.packages(), e.g. haven library(), to load packages
Constructing a CI - Sample statistics (𝜃̂), e.g. mean/proportion
Working directory setwd(), to set the working directory to a specified location
𝑠 𝑝̂(1−𝑝̂)
- SE of sample mean: 𝜎̂ = , SE of sample proportion: 𝜎̂ = √ Data Import read_dta(), read stata files view(), view contents of data frame
√𝑛 𝑛
class(), checks data type [class] summary(), summary of variables
- CI: [𝜃̂ − 𝑧𝛼 ∗ 𝜎̂, 𝜃̂ + 𝑧𝛼 ∗ 𝜎̂]
2 2 Describing Variables coding and distribution
Hypothesis Testing - Goal of stat. inference: draw conclusions about population, by (dis)proving factor(), creates a new factor variable with custom labels
hypothesis Creating Tables table(), creates a frequency table for CrossTable(), creates cross-
Statistical Test - Null hypothesis 𝐻0 : Claim that there is no effect/difference, we want to disprove a variable tabulation table for two variables
Hypotheses - Alternative hypothesis 𝐻𝑎 : Claim that we want to search evidence for Data Visualisation ggplot(), create visualizations geom_bar(), create scatter plots for
Statistical Test Two sided test One sided test cat. variables
One- vs two-sided 𝐻0 : 𝜇 = 𝜇0 , vs. 𝐻𝑎 : 𝜇 ≠ 𝑦0 𝐻0 : 𝜇 = 𝜇0 , vs. 𝐻𝑎 : 𝜇 < 𝑜𝑟 > 𝑦0 hist(), create histograms geom_point(), var num. variables labs(), add labels to the axes
Stat. Significance - Results are as extreme to be unlikely to occur by random chance alone (assuming
𝐻0 is true), we say statistically significant
- If sample statistic is statistically significant, we have convincing evidence against
𝐻0 and thus, in favor of 𝐻𝑎

You might also like