QR Midterm Memo

Data Definition Collection of information, a set of measurements taken on a set of ind.
units Shape - Skewness: measures lack of symmetry

Data Storage & Data is usually stored and presented in a dataset comprised of variables measured on Positive Skew Symmetrical Distr. Negative Skew
Presentation cases
Definition Variable - Characteristic of observed statistical unit; Can have different values
- Values: categorical vs. numeric
Categorical vs. - Categorical - Kurtosis: heavy-tailed/light-tailed vs. normal distr.
numeric o Nominal: e.g. gender
o Ordinal e.g. educational level
- Numeric
o Discrete: e.g. #ppl. In household
Variability Statistic Parameter
o continuous: e.g. income
- Variance 𝑠2 =
1
∑𝑁 2 1
𝜎 2 = ∑𝑁 2
Can be dependent on measurement, e.g. categorical variable can be coded as numeric 𝑁−1 𝑖=1(𝑥𝑖 − 𝑥̅ ) 𝑁 𝑖=1(𝑥𝑖 − 𝜇)
- Standard Dev.
sometimes 𝑠 = √𝑠 2 𝜎 = √𝜎 2
Population & Sample Population: Total observations possible, Five Number Median: p=50, LQ: p=25, UP p=75,
Definition Sample: A set of observations from sample summary: Min, Interquartile range: 𝐼𝑄𝑅 = 𝑈𝑄 – 𝐿𝑄,
Statistical Inference Using the sample data to make inferences (learn) about the population 25%,50%,75%,max 𝑅𝑎𝑛𝑔𝑒 = 𝑀𝑎𝑥 − 𝑀𝑖𝑛
Parameter vs - parameter: summary of population Z-Score 𝑥 −𝜇
𝑧𝑖 = 𝑖 :
𝜎
Statistic - statistic: summary of sample data
deviation of observation from average divided by 𝜎
Descriptive vs. - descriptive:
Statistical - Covariance: measures the relationship between to variables & their tendency to
Inferential Statistics o describing one variable
Dependence, for vary together
o describing the relation ship between two variables
more than 1 Variable 𝐶𝑂𝑉(𝑋, 𝑌) = 𝐸[(𝑥 − 𝜇𝑥 )(𝑌 − 𝜇𝑦 )]
- inferential statistics: sampling distribution
- Correlation: association not driven by arbitrary changes
Sampling Bias - If the sample that is selected does not accurately represent the population 𝐶𝑂𝑉(𝑋,𝑌)
because it was selected in a way such that it is biased 𝜌(𝑋, 𝑌) = 𝜎𝑋 𝜎𝑌
- RANDOM sampling; Properties |𝜌(𝑋, 𝑌) ≤ 1| |𝜌(𝑋, 𝑌) ≤ 1| 𝑋⊥𝑌→ 𝜌=0
How to avoid o however even with random sample (without sampling bias) data can be Linear relation Not vice versa
biased 𝜌 provides a measure of the extent to which X and Y are linearly related
Explanatory and - Explanatory variable (X): Understand or predict values response variable: Probability Function - can be used to present distribution of random variable
response variables - Response Variable (Y): Values predicted by explanatory variable - Probability Function → Discrete random var.
 Remember by f(x)=y - Probability Densitiy Function → continuous random v.
Association vs. - Association: values of two variables are related Prob. Function (pf) - Discrete Random Variable: X can take finite number of k of different values:
Causation - Causal association: changing value of explanatory variable influences value of 𝑥1 , … , 𝑥𝑘
response variable 𝑓(𝑥) = Pr(𝑋 = 𝑥)
Association may not be causal Prob Densitiy - Continuous Random Variable: X can assume every value in an interval
Confounding - Third variable that is associated with both explanatory and response variable Function (pdf) 𝑏
Pr(𝑎 < 𝑋 ≤ 𝑏) = ∫𝑎 𝑓(𝑥)𝑑𝑥
variable - Is explanatory variable for both X and Y ∞
(1) 𝑓(𝑥) ≥=, for all x; (2) ∫ 𝑓(𝑥)𝑑𝑥 = 1
- Causal association cannot be determined if there is a confounding variable −∞
Causation: Common Cause Common effect Expectation of Discrete X: Continuous X:
∞
X causes Y Xand Y are both caused by 3. var. Z X and Y both predict the 3. var. Z pf & pdf 𝐸(𝑋) = ∑𝑎𝑙𝑙 𝑥 𝑥 ∗ 𝑓(𝑥) 𝐸(𝑋) = ∫−∞ 𝑥𝑓(𝑥)
𝐸(𝑥): expected value / mean of X
Observational study - Observational study: solely observation of values as they exist Variance of X 𝑉𝑎𝑟(𝑋) = 𝐸[(𝑋 − 𝜇)2 ]
vs experiment - Experiment: controlling one or more explanatory var. - Measure of the spread of the distribution around its mean (𝜇), Large variance:
𝑛
Relative Frequency 𝑟𝑒𝑙𝑎𝑡𝑖𝑣𝑒 𝐹𝑟𝑒𝑞. = 𝑁𝑖 wide spread around 𝜇
Normal Distribution: 1
(𝑥−𝜇)2
Summary statistics Summarizing observations to communicate as much information as possible as simply 𝑓(𝑥|𝜇, 𝜎 2 ) = 𝑒 − 2𝜎2
95% rule: 95% of √2𝜋𝜎 2
as possible If X is normally distributed,
values fall within 2
Central Tendency - Mean: most important descriptive statistic for center - Pr(𝜇 − 𝜎 ≤ 𝑋 ≤ 𝜇 + 𝜎) ≈ 68.27%
𝑦1 +𝑦2 +⋯+𝑦𝑛 ∑𝑛 ∑𝑛
standard deviations
𝑖=1 𝑦𝑖 𝑖=1 𝑦𝑖 - Pr(𝜇 − 2𝜎 ≤ 𝑋 ≤ 𝜇 + 2𝜎) ≈ 95.45%
Statistic 𝑥̅ = 𝑛
= 𝑛
Parameter 𝜇̂ = 𝑛 of the mean
- Pr(𝜇 − 3𝜎 ≤ 𝑋 ≤ 𝜇 + 3𝜎) ≈ 99.73%
- Median: middle measurement of ordered sample Point Estimate - point estimate: of population parameter is single value
- Mode: most common observation - interval estimate: interval defined by two numbers, between a parameter is
expected to lie
Sampling - Sample statistic: random variable; varies across samples
Distribution - Sampling distribution: probability distribution of a given random-sample-based
statistic Type I and Type II - Type I Error: Rejecting the Null Hypothesis although it is true, signifccance level 𝛼
- → Statistical Inference: Based on sampling distribution, inferences on population Error is the tolerated Type I error rate
can be made - Type I Error: Not rejecting the Null Hypothesis although it is false
Sampling - Center: Sampling distribution will be centered around the population parameter P-Value - Test statistic: random variable calculated form sample data and used in
Distribution: - Shape: If the sample size is large enough, sampling distribution will be symmetric → reject 𝐻0 if: hypothesis test
Characteristics & bell-shaped, like normal- /t-distribution p-value< 𝛼 - Measures degree of difference between sample of data and 𝐻0
Variability of Sample - Standard Error of a statistic, SE: standard deviation of the sample statistic - P-value is the probability that the test statistic equals the observed value or a
Statistic (SE) - Measures how much statistic varies from sample to sample value even more extreme in the direction predicted by 𝐻𝑎
Sample size n & - Increasing n, decreasing SE Standardizing - X (raw score) random variable with mean 𝜇 and standard deviation 𝜎 > 0
Standard Error SE Random Variables - Standardizing X: 𝒁 = 𝝈
𝑿−𝝁
Central Limit - Let 𝑋1 , 𝑋2 , … , 𝑋𝑛 be a random sample of size n darwn from a population - Standardized normal distribution will be standard normally distr.: N(0,1)
Theorem distribution with mean 𝜇 and variance 𝜎 2 . Suppose we are interested in 𝑋̅, the Probability of tail - Every normal distribution 𝑁(𝜇, 𝜎 2 ) can be standardized, therefore we only need
sample mean. Then: as n approaches +∞ area one Z-table which shows the probability under the standard normal distribution,
√𝑛(𝑋̅−𝜇) 𝑑 𝑑 𝜎2
→ 𝑁(0,1), or equivalently, 𝑋̅ → 𝑁(𝜇, ) e.g. 𝑁(4,4) → 𝑍 =
𝑋−4
𝜎 𝑛 √4
Implications of ∑
𝜇 ≈ 𝑋̅ = 𝑖=1
𝑛
𝑋
𝜎2 ≈ 𝑠2 =
1
∑𝑛𝑖=1(𝑥𝑖 − 𝑥̅ )2 - Therefore, to find the probability of the Tail area e.g. 𝑃(𝑋 ≤ 𝒂): 𝑃(𝑍 ≤
𝒂−𝜇
)→
𝑛−1
Central Limit 𝑛 𝜎
If n approaches +∞, SE of sample mean will be
𝑠 check the Z table to find value
Theorem √𝑛
Steps of Significance 1. Assumptions
Characteristics of Unbi𝛔 asedness: Efficiency: Consistency:
Test (1)random sample, (2)quantitative variable, (3) pop. Distr. Normal distr.
Good Estimator Expected value: identical Low Variance As sample size n
(one sided t-test) 2. Stating 𝑯𝟎 and 𝑯𝒂
to population parameter increases, estimator will
Two sided test One sided test
approach the parameter
𝐻0 : 𝜇 = 𝜇0 , vs. 𝐻𝑎 : 𝜇 ≠ 𝑦0 𝐻0 : 𝜇 = 𝜇0 , vs. 𝐻𝑎 : 𝜇 < 𝑜𝑟 > 𝑦0
Approximation using - 𝒏 ≤ 𝟑𝟎: t-distribution, df=n-1
3. Calculate test statistic
normal / t- - 𝒏 > 𝟑𝟎: normal distribution (𝑥̿ − 𝜇0 ) 𝑠
distribution - Degrees of freedom (df): number of values in the final calculation of a statistic 𝑡= , 𝑠𝑒 =
𝑠𝑒 √𝑛
that are free to vary, most of the time: 𝑁 − 1 4. Found p-value based on the test statistic
Parameter vs Parameter Statistics Two sided: p-value=2*tail probability; one sided: p-value=tail probability
Statistic Mean 𝜇 𝑥̅ 5. Compare p-value with pre chosen 𝜶 & 6. Make Conclusions
Proportion 𝑝 𝑝̂ If 𝑃 ≤ 𝛼, we reject null hypothesis
Standard Deviation 𝜎 s Comparing Means - Comparison of means between two groups
Correlation 𝜌 r between two groups - In the two independent samples test, steps are the same but there are
Types of Population Distribution: Sample Distribution: Sampling Distribution: differences in steps 1. Assumptions and 3. Test statistic
Distributions Probability distribution of prob. Distr. of sample prob. Distr. of sample 1. Assumptions
population from the population statistics Additional assumption: 𝜎 2 equals each other and population 𝜎12 = 𝜎22 = 𝜎 2
Interval Estimate - Gives a range of plausible values for a population parameter 2. Test statistic
- Common interval estimate: 𝑠𝑡𝑎𝑡𝑖𝑠𝑡𝑖𝑐 ± 𝑚𝑎𝑟𝑔𝑖𝑛 𝑜𝑓 𝑒𝑟𝑟𝑜𝑟, ̅𝑦̅̅1̅−𝑦
̅̅̅1̅ 𝑠2 𝑠2
We use: 𝑡 = 𝑠𝑒
, 𝑠𝑒 = √𝑛1 + 𝑛2 , 𝑑𝑓 = (𝑛1 − 1) + (𝑛2 − 1)
- margin of error: precision of the sample statistic as point estimate for the 1 2
parameter; determined using variability of the sampling distribution R-Programming

95%-CI - A 95% Confidence Interval will contain true parameter for 95% of samples Defining something: <-
Package installing and Loading install.packages(), e.g. haven library(), to load packages
Constructing a CI - Sample statistics (𝜃̂), e.g. mean/proportion
Working directory setwd(), to set the working directory to a specified location
𝑠 𝑝̂(1−𝑝̂)
- SE of sample mean: 𝜎̂ = , SE of sample proportion: 𝜎̂ = √ Data Import read_dta(), read stata files view(), view contents of data frame
√𝑛 𝑛
class(), checks data type [class] summary(), summary of variables
- CI: [𝜃̂ − 𝑧𝛼 ∗ 𝜎̂, 𝜃̂ + 𝑧𝛼 ∗ 𝜎̂]
2 2 Describing Variables coding and distribution
Hypothesis Testing - Goal of stat. inference: draw conclusions about population, by (dis)proving factor(), creates a new factor variable with custom labels
hypothesis Creating Tables table(), creates a frequency table for CrossTable(), creates cross-
Statistical Test - Null hypothesis 𝐻0 : Claim that there is no effect/difference, we want to disprove a variable tabulation table for two variables
Hypotheses - Alternative hypothesis 𝐻𝑎 : Claim that we want to search evidence for Data Visualisation ggplot(), create visualizations geom_bar(), create scatter plots for
Statistical Test Two sided test One sided test cat. variables
One- vs two-sided 𝐻0 : 𝜇 = 𝜇0 , vs. 𝐻𝑎 : 𝜇 ≠ 𝑦0 𝐻0 : 𝜇 = 𝜇0 , vs. 𝐻𝑎 : 𝜇 < 𝑜𝑟 > 𝑦0 hist(), create histograms geom_point(), var num. variables labs(), add labels to the axes
Stat. Significance - Results are as extreme to be unlikely to occur by random chance alone (assuming
𝐻0 is true), we say statistically significant
- If sample statistic is statistically significant, we have convincing evidence against
𝐻0 and thus, in favor of 𝐻𝑎

QR Midterm Memo

Uploaded by

Document Information

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

QR Midterm Memo

Uploaded by

Copyright:

Available Formats

Data Definition Collection of information, a set of measurements taken on a set of ind.

units Shape - Skewness: measures lack of symmetry

parameter; determined using variability of the sampling distribution R-Programming

You might also like