Professional Documents
Culture Documents
Statistics Cheatsheet PDF
Statistics Cheatsheet PDF
√
selected n
∑ (xi x)2
➔ Quartiles split the ranked data into 4 s =
i=1
Observational Study there can always be n 1
lurking variables affecting results equal groups
➔ i.e, strong positive association between ◆ Box and Whisker Plot ◆ highly affected by outliers
shoe size and intelligence for boys ◆ has same units as original
➔ **should never show causation data
◆ finance = horrible measure of
Experimental Study lurking variables can be risk (trampoline example)
controlled; can give good evidence for causation
Descriptive Statistics Part I
Descriptive Statistics Part II
➔ Summary Measures
Linear Transformations
➔ Range = X maximum X minimum
◆ Disadvantages: Ignores the
way in which data are
distributed; sensitive to outliers
➔ Interquartile Range (IQR) = 3rd
➔ Linear transformations change the
quartile 1st quartile
center and spread of data
◆ Not used that much
◆ Not affected by outliers ➔ V ar(a + bX) = b2 V ar(X)
➔ Average(a+bX) = a+b[Average(X)]
➔ Effects of Linear Transformations: Skewness ◆ Correlation doesn't imply
◆ meannew = a + b*mean ➔ measures the degree of asymmetry causation
◆ mediannew = a + b*median exhibited by data ◆ The correlation of a variable
◆ stdev new = |b| *stdev
◆ negative values= skewed left with itself is one
◆ IQRnew = |b| *IQR
◆ positive values= skewed right
➔ Zscore new data set will have mean ◆ if |skewness| < 0.8 = don't need Combining Data Sets
0 and variance 1 to transform data ➔ Mean (Z) = Z = aX + bY
z = X S X ➔ Var (Z) = sz2 = a2 V ar(X) + b2 V ar(Y ) +
Measurements of Association 2abCov(X, Y )
➔ Covariance
Empirical Rule
◆ Covariance > 0 = larger x, Portfolios
➔ Only for moundshaped data
larger y ➔ Return on a portfolio:
Approx. 95% of data is in the interval:
◆ Covariance < 0 = larger x,
(x 2sx , x + 2sx ) = x + / 2sx smaller y
➔ only use if you just have mean and std.
Rp = wA RA + wB RB
n
dev. ◆ sxy = 1
∑ (x x)(y y )
n 1 ◆ weights add up to 1
i=1
Chebyshev's Rule ◆ Units = Units of x Units of y
◆ return = mean
➔ Use for any set of data and for any ◆ Covariance is only +, , or 0 ◆ risk = std. deviation
number k, greater than 1 (1.2, 1.3, etc.) (can be any number)
➔ 1 1 ➔ Variance of return of portfolio
2
k ➔ Correlation measures strength of a
➔ (Ex) for k=2 (2 standard deviations), linear relationship between two sp2 = wA
2 2
sA + wB2 sB2 + 2wA wB (sA,B )
75% of data falls within 2 standard variables
deviations covariancexy
◆ r xy = (std.dev. )(std. dev. ) ◆ Risk(variance) is reduced when
x y
Detecting Outliers stocks are negatively
◆ correlation is between 1 and 1
➔ Classic Outlier Detection correlated. (when there's a
◆ Sign: direction of relationship
◆ doesn't always work negative covariance)
◆ Absolute value: strength of
◆ |z | = || X S X || ≥ 2 relationship (0.6 is stronger
➔ The Boxplot Rule relationship than +0.4)
Probability
◆ Value X is an outlier if:
➔ measure of uncertainty
X<Q11.5(Q3Q1) ➔ all outcomes have to be exhaustive
or (all options possible) and mutually
X>Q3+1.5(Q3Q1) exhaustive (no 2 outcomes can
occur at the same time)
Probability Rules ➔ Another way to find joint probability: ➔ Expected Value Solution =
1. Probabilities range from P (A and B) = P (A|B) P (B)
0 ≤ P rob(A) ≤ 1 P (A and B) = P (B|A) P (A) E M V = X 1 (P 1 ) + X 2 (P 2 )... + X n (P n )
2. The probabilities of all outcomes must
add up to 1 2 x 2 Table
3. The complement rule = A happens
or A doesn't happen
P (A) = 1 P (A)
Decision Tree Analysis
P (A) + P (A) = 1 ➔ square = your choice
4. Addition Rule: ➔ circle = uncertain events
P (A or B) = P (A) + P (B) P (A and B)
Contingency/Joint Table Discrete Random Variables
➔ To go from contingency to joint table, ➔ P X (x) = P (X = x)
divide by total # of counts
➔ everything inside table adds up to 1 Expectation
Conditional Probability ➔ μx = E(x) = ∑ xi P (X = xi )
➔ P (A|B)
P (A and B) Decision Analysis ➔ Example: (2)(0.1) + (3)(0.5) = 1.7
➔ P (A|B) = P (B) ➔ Maximax solution = optimistic
➔ Given event B has happened, what is approach. Always think the best is Variance
the probability event A will happen? going to happen ➔ σ 2 = E (x2 ) μx2
➔ Look out for: "given", "if" ➔ Maximin solution = pessimistic ➔ Example:
approach. (2)2 (0.1) + (3)2 (0.5) (1.7)2 = 2.01
Independence
➔ Independent if: Rules for Expectation and Variance
P (A|B) = P (A) or
P (B|A) = P (B) ➔ μs = E (s) = a + bμx
➔ If probabilities change, then A and B
➔ Var(s)= b2 σ 2
are dependent
➔ **hard to prove independence, need
Jointly Distributed Discrete Random
to check every value
Variables
➔ Independent if:
Multiplication Rules
➔ If A and B are INDEPENDENT:
P x,y (X = x and Y = y ) = P x (x) P y (y)
P (A and B) = P (A) P (B)
➔ Combining Random Variables 2.) All Successes Continuous Probability Distributions
◆ If X and Y are independent: P (all successes) = pn ➔ the probability that a continuous
3.) At least one success random variable X will assume any
E (X + Y ) = E (X) + E (Y ) P (at least 1 success) = 1 (1 p)n particular value is 0
V ar(X + Y ) = V ar(X) + V ar(Y ) 4.) At least one failure ➔ Density Curves
P (at least 1 f ailure) = 1 pn ◆ Area under the curve is the
◆ If X and Y are dependent: 5.) Binomial Distribution Formula for probability that any range of
E (X + Y ) = E (X) + E (Y ) x=exact value values will occur.
V ar(X + Y ) = V ar(X) + V ar(Y ) + 2Cov(X, Y ) ◆ Total area = 1
➔ Covariance: Uniform Distribution
C ov(X, Y ) = E (XY ) E (X)E(Y )
➔ If X and Y are independent, Cov(X,Y)
= 0
Sums of Normals
➔ Mean for uniform distribution:
(a+b)
E (X) = 2
➔ Variance for unif. distribution:
(b a) 2
V ar(X) = 12 Confidence Intervals = tells us how good our
estimate is
Normal Distribution Sums of Normals Example: **Want high confidence, narrow interval
➔ governed by 2 parameters: **As confidence increases , interval also
μ (the mean) and
σ (the standard increases
deviation)
➔ X ~ N (μ, σ 2 ) A. One Sample Proportion
Standardize Normal Distribution:
X μ
Z = σ
➔ Zscore is the number of standard
deviations the related X is from its
︿ x number of successes in sample
mean ➔ Cov(X,Y) = 0 b/c they're independent ➔ p= n = sample size
➔ **Z< some value, will just be the
probability found on table Central Limit Theorem
➔ **Z> some value, will be ➔ as n increases,
(1probability) found on table ➔ x should get closer to
μ (population
➔
mean) ➔ We are thus 95% confident that the true
➔ mean( x) = μ population proportion is in the interval…
Normal Distribution Example ︿
➔ variance (x) = σ 2 /n ➔ We are assuming that n is large, n p >5 and
2 our sample size is less than 10% of the
➔ X ~ N (μ, σn ) population size.
◆ if population is normally distributed,
n can be any value
◆ any population, n needs to be ≥ 30
Standard Error and Margin of Error B. One Sample Mean * Stata always uses the tdistribution when
For samples n > 30 computing confidence intervals
Confidence Interval:
Hypothesis Testing
➔ Null Hypothesis:
➔ H 0 , a statement of no change and is
➔ If n > 30, we can substitute s for assumed true until evidence indicates
σ so that we get: otherwise.
➔ Alternative Hypothesis: H a is a
statement that we are trying to find
evidence to support.
Example of Sample Proportion Problem ➔ Type I error: reject the null hypothesis
when the null hypothesis is true.
(considered the worst error)
➔ Type II error: do not reject the null
hypothesis when the alternative
hypothesis is true.
Example of Type I and Type II errors
Determining Sample Size
︿ ︿
(1.96)2 p(1 p)
n = e2
︿
➔ If given a confidence interval, p is For samples n < 30
the middle number of the interval
➔ No confidence interval; use worst
case scenario
︿
◆ p =0.5
T Distribution used when:
➔ σ is not known, n < 30, and data is
Methods of Hypothesis Testing
normally distributed 1. Confidence Intervals **
2. Test statistic
3. Pvalues **
➔ C.I and Pvalues always safe to do
because don’t need to worry about
size of n (can be bigger or smaller
than 30)
One Sample Hypothesis Tests
1. Confidence Interval (can be
used only for twosided tests)
4. PValues
➔ a number between 0 and 1
➔ the larger the pvalue, the more
consistent the data is with the null
➔ the smaller the pvalue, the more
consistent the data is with the
2. Test Statistic Approach alternative
(Population Mean) ➔ ** If P is low (less than 0.05),
3. Test Statistic Approach (Population
H 0 must go reject the null
Proportion)
hypothesis
Two Sample Hypothesis Tests ➔ Test Statistic for Two Proportions 2. Comparing Two Means (large
1. Comparing Two Proportions independent samples n>30)
(Independent Groups)
➔ Calculate Confidence Interval ➔ Calculating Confidence Interval
➔ Test Statistic for Two Means
Matched Pairs
➔ Two samples are DEPENDENT
Example:
︿
➔ Interpretation of slope for each ➔ corr (Y , e) = 0
additional x value (e.x. mile on
odometer), the y value decreases/ A Measure of Fit: R2
increases by an average of b1 value
➔ Interpretation of yintercept plug in
︿
0 for x and the value you get for y is
the yintercept (e.x.
y=3.250.0614xSkippedClass, a
student who skips no classes has a
gpa of 3.25.)
➔ ** danger of extrapolation if an x
value is outside of our data set, we
can't confidently predict the fitted y ➔ Good fit: if SSR is big, SEE is small
value ➔ SST=SSR, perfect fit
Simple Linear Regression
➔ R2 : coefficient of determination
➔ used to predict the value of one
Properties of the Residuals and Fitted 2
R = SSR = 1 SSE
variable (dependent variable) on the SST SST
basis of other variables (independent Values ➔ R is between 0 and 1, the closer R2
2
Assumptions of Simple Linear Regression Example of Prediction Intervals: Regression Hypothesis Testing
1. We model the AVERAGE of something *always a twosided test
rather than something itself ➔ want to test whether slope ( β 1 ) is
needed in our model
2. ➔ H 0 : β 1 = 0 (don’t need x)
H a : β 1 =/ 0 (need x)
➔ Need X in the model if:
a. 0 isn’t in the confidence
interval
Standard Errors for b1 and b0 b. t > 1.96
➔ standard errors when noise c. Pvalue < 0.05
➔ sb0 amount of uncertainty in our
estimate of β 0 (small s good, large s Test Statistic for Slope/Yintercept
bad) ➔ can only be used if n>30
➔ sb1 amount of uncertainty in our
➔ if n < 30, use pvalues
estimate of β 1
◆ As ε (noise) gets bigger, it’s
harder to find the line
Confidence Intervals for b1 and b0
Estimating S e
2 ➔
➔ S e = SSEn 2
2
➔ S e is our estimate of σ 2
➔
√
➔ S e = S e2 is our estimate of σ
➔ 95% of the Y values should lie within ➔
+
the interval b0 + b1 X 1.96S e
➔
➔ n small → bad
se big → bad
s2x small→ bad (wants x’s spread out for
better guess)
Multiple Regression
➔
➔ Variable Importance:
◆ higher tvalue, lower pvalue =
variable is more important
◆ lower tvalue, higher pvalue =
variable is less important (or not
Interaction Terms
needed)
➔ allow the slopes to change
➔ interaction between 2 or more x
Adjusted Rsquared variables that will affect the Y variable
➔ k = # of X’s
Modeling Regression How to Create Dummy Variables (Nominal
Backward Stepwise Regression Variables)
1. Start will all variables in the model ➔ If C is the number of categories, create
2. at each step, delete the least important (C1) dummy variables for describing
➔ Adj. Rsquared will as you add junk x variable based on largest pvalue above the variable
variables 0.05 ➔ One category is always the
➔ Adj. Rsquared will only if the x you 3. stop when you can’t delete anymore
“baseline”, which is included in the
add in is very useful ➔ Will see Adj. Rsquared and Se
intercept
➔ **want Adj. Rsquared to go up and Se
low for better model Dummy Variables
➔ An indicator variable that takes on a
The Overall F Test value of 0 or 1, allow intercepts to
change
➔ Always want to reject F test (reject
null hypothesis)
Recoding Dummy Variables
➔ Look at pvalue (if < 0.05, reject null)
Example: How many hockey sticks sold in
➔ H 0 : β 1 = β 2 = β 3 ... = β k = 0 (don’t
the summer (original equation)
need any X’s) hockey = 100 + 10W tr 20Spr + 30F all
H a : β 1 = β 2 = β 3 ... = β k =/ 0 (need at Write equation for how many hockey sticks
least 1 X) sold in the winter
➔ If no x variables needed, then SSR=0 hockey = 110 + 20F all 30Spri 10Summer
and SST=SSE ➔ **always need to get same exact
values from the original equation
Regression Diagnostics so that we can compare models. ◆ Homoskedastic: band around the
Standardize Residuals Can’t compare models if you take log values
of Y. ◆ Heteroskedastic: as x goes up,
◆ Transformations cheatsheet the noise goes up (no more band,
fanshaped)
Check Model Assumptions ◆ If heteroskedastic, fix it by
➔ Plot residuals versus Yhat logging the Y variable
◆ If heteroskedastic, fix it by
making standard errors robust
➔ Multicollinearity
◆ when x variables are highly
correlated with each other.
◆ ovtest: a significant test ◆ R2 > 0.9
statistic indicates that ◆ pairwise correlation > 0.9
➔ Outliers polynomial terms should be ◆ correlate all x variables, include
◆ Regression likes to move added y variable, drop the x variable
towards outliers (shows up ◆ H 0 : data = no transf ormation that is less correlated to y
as R2 being really high) H a : data =/ no transf ormation
◆ want to remove outlier that is Summary of Regression Output
extreme in both x and y
➔ Nonlinearity (ovtest)
◆ Plotting residuals vs. fitted
values will show a
relationship if data is ➔ Normality (sktest)
nonlinear ( R2 also high) ◆ H 0 : data = normality
H a : data =/ normality
◆ don’t want to reject the null
hypothesis. Pvalue should
be big
◆ Log transformation
accommodates nonlinearity,
reduces right skewness in the Y, ➔ Homoskedasticity (hettest)
eliminates heteroskedasticity ◆ H 0 : data = homoskedasticity
◆ **Only take log of X variable ◆ H a : data =/ homoskedasticity