Statistics Cheatsheet PDF

Population entire collection of objects or ➔ Mean arithmetic average of data ➔ Variance the average distance

individuals about which information is desired. values squared
➔ easier to take a sample ◆ * *Highly susceptible to n
∑ (xi x)2
◆ Sample part of the population extreme values (outliers).
that is selected for analysis Goes towards extreme values
sx2 = i=1 n 1
◆ Watch out for: ◆ Mean could never be larger or
● Limited sample size that smaller than max/min value but ◆ sx2 gets rid of the negative
might not be values
could be the max/min value
representative of
◆ units are squared
population
◆ Simple Random Sampling ➔ Median in an ordered array, the
Every possible sample of a certain median is the middle number ➔ Standard Deviation shows variation
size has the same chance of being ◆ **Not affected by extreme about the mean
values
√
selected n
∑ (xi x)2

➔ Quartiles split the ranked data into 4 s =
i=1
Observational Study there can always be n 1
lurking variables affecting results equal groups
➔ i.e, strong positive association between ◆ Box and Whisker Plot ◆ highly affected by outliers
shoe size and intelligence for boys ◆ has same units as original
➔ **should never show causation data
◆ finance = horrible measure of
Experimental Study lurking variables can be risk (trampoline example)
controlled; can give good evidence for causation

Descriptive Statistics Part I
Descriptive Statistics Part II
➔ Summary Measures
Linear Transformations

➔ Range = X maximum X minimum
◆ Disadvantages: Ignores the
way in which data are
distributed; sensitive to outliers

➔ Interquartile Range (IQR) = 3rd
➔ Linear transformations change the
quartile 1st quartile
center and spread of data
◆ Not used that much
◆ Not affected by outliers ➔ V ar(a + bX) = b2 V ar(X)
➔ Average(a+bX) = a+b[Average(X)]

➔ Effects of Linear Transformations: Skewness ◆ Correlation doesn't imply
◆ meannew = a + b*mean ➔ measures the degree of asymmetry causation
◆ mediannew = a + b*median exhibited by data ◆ The correlation of a variable
◆ stdev new = |b| *stdev
◆ negative values= skewed left with itself is one
◆ IQRnew = |b| *IQR
◆ positive values= skewed right
➔ Zscore new data set will have mean ◆ if |skewness| < 0.8 = don't need Combining Data Sets
0 and variance 1 to transform data ➔ Mean (Z) = Z = aX + bY
z = X S X ➔ Var (Z) = sz2 = a2 V ar(X) + b2 V ar(Y ) +
Measurements of Association 2abCov(X, Y )

➔ Covariance
Empirical Rule
◆ Covariance > 0 = larger x, Portfolios
➔ Only for moundshaped data
larger y ➔ Return on a portfolio:
Approx. 95% of data is in the interval:
◆ Covariance < 0 = larger x,
(x 2sx , x + 2sx ) = x + / 2sx smaller y
➔ only use if you just have mean and std.
Rp = wA RA + wB RB
n
dev. ◆ sxy = 1
∑ (x x)(y y )
n 1 ◆ weights add up to 1
i=1
Chebyshev's Rule ◆ Units = Units of x Units of y
◆ return = mean
➔ Use for any set of data and for any ◆ Covariance is only +, , or 0 ◆ risk = std. deviation
number k, greater than 1 (1.2, 1.3, etc.) (can be any number)
➔ 1 1 ➔ Variance of return of portfolio
2
k ➔ Correlation measures strength of a
➔ (Ex) for k=2 (2 standard deviations), linear relationship between two sp2 = wA
2 2
sA + wB2 sB2 + 2wA wB (sA,B )
75% of data falls within 2 standard variables
deviations covariancexy
◆ r xy = (std.dev. )(std. dev. ) ◆ Risk(variance) is reduced when
x y
Detecting Outliers stocks are negatively
◆ correlation is between 1 and 1
➔ Classic Outlier Detection correlated. (when there's a
◆ Sign: direction of relationship
◆ doesn't always work negative covariance)
◆ Absolute value: strength of

◆ |z | = || X S X || ≥ 2 relationship (0.6 is stronger

➔ The Boxplot Rule relationship than +0.4)
Probability
◆ Value X is an outlier if:
➔ measure of uncertainty
X<Q11.5(Q3Q1) ➔ all outcomes have to be exhaustive
or (all options possible) and mutually
X>Q3+1.5(Q3Q1) exhaustive (no 2 outcomes can
occur at the same time)

Probability Rules ➔ Another way to find joint probability: ➔ Expected Value Solution =
1. Probabilities range from P (A and B) = P (A|B) P (B)
0 ≤ P rob(A) ≤ 1 P (A and B) = P (B|A) P (A) E M V = X 1 (P 1 ) + X 2 (P 2 )... + X n (P n )
2. The probabilities of all outcomes must
add up to 1 2 x 2 Table
3. The complement rule = A happens
or A doesn't happen

P (A) = 1 P (A)
Decision Tree Analysis
P (A) + P (A) = 1 ➔ square = your choice
4. Addition Rule: ➔ circle = uncertain events
P (A or B) = P (A) + P (B) P (A and B)

Contingency/Joint Table Discrete Random Variables
➔ To go from contingency to joint table, ➔ P X (x) = P (X = x)
divide by total # of counts
➔ everything inside table adds up to 1 Expectation

Conditional Probability ➔ μx = E(x) = ∑ xi P (X = xi )
➔ P (A|B)
P (A and B) Decision Analysis ➔ Example: (2)(0.1) + (3)(0.5) = 1.7
➔ P (A|B) = P (B) ➔ Maximax solution = optimistic
➔ Given event B has happened, what is approach. Always think the best is Variance
the probability event A will happen? going to happen ➔ σ 2 = E (x2 ) μx2
➔ Look out for: "given", "if" ➔ Maximin solution = pessimistic ➔ Example:
approach. (2)2 (0.1) + (3)2 (0.5) (1.7)2 = 2.01
Independence
➔ Independent if: Rules for Expectation and Variance
P (A|B) = P (A) or
P (B|A) = P (B) ➔ μs = E (s) = a + bμx
➔ If probabilities change, then A and B
➔ Var(s)= b2 σ 2
are dependent

➔ **hard to prove independence, need
Jointly Distributed Discrete Random
to check every value
Variables

➔ Independent if:
Multiplication Rules

➔ If A and B are INDEPENDENT:
P x,y (X = x and Y = y ) = P x (x) P y (y)
P (A and B) = P (A) P (B)

➔ Combining Random Variables 2.) All Successes Continuous Probability Distributions
◆ If X and Y are independent: P (all successes) = pn ➔ the probability that a continuous
3.) At least one success random variable X will assume any
E (X + Y ) = E (X) + E (Y ) P (at least 1 success) = 1 (1 p)n particular value is 0
V ar(X + Y ) = V ar(X) + V ar(Y ) 4.) At least one failure ➔ Density Curves
P (at least 1 f ailure) = 1 pn ◆ Area under the curve is the
◆ If X and Y are dependent: 5.) Binomial Distribution Formula for probability that any range of
E (X + Y ) = E (X) + E (Y ) x=exact value values will occur.
V ar(X + Y ) = V ar(X) + V ar(Y ) + 2Cov(X, Y ) ◆ Total area = 1

➔ Covariance: Uniform Distribution
C ov(X, Y ) = E (XY ) E (X)E(Y )
➔ If X and Y are independent, Cov(X,Y)
= 0

6.) Mean (Expectation) ◆ X ~ U nif (a, b)

μ = E (x) = np
7.) Variance and Standard Dev. Uniform Example
σ 2 = npq
σ = √npq
q = 1 p

Binomial Example

Binomial Distribution
➔ doing something n times
➔ only 2 outcomes: success or failure
➔ trials are independent of each other (Example cont'd next page)
➔ probability remains constant

1.) All Failures
P (all f ailures) = (1 p)n

X μ
➔ Z = σ/√n

Sums of Normals
➔ Mean for uniform distribution:
(a+b)
E (X) = 2

➔ Variance for unif. distribution:
(b a) 2
V ar(X) = 12 Confidence Intervals = tells us how good our
estimate is
Normal Distribution Sums of Normals Example: **Want high confidence, narrow interval
➔ governed by 2 parameters: **As confidence increases , interval also
μ (the mean) and
σ (the standard increases
deviation)
➔ X ~ N (μ, σ 2 ) A. One Sample Proportion

Standardize Normal Distribution:
X μ
Z = σ
➔ Zscore is the number of standard
deviations the related X is from its
︿ x number of successes in sample
mean ➔ Cov(X,Y) = 0 b/c they're independent ➔ p= n = sample size
➔ **Z< some value, will just be the
probability found on table Central Limit Theorem
➔ **Z> some value, will be ➔ as n increases,
(1probability) found on table ➔ x should get closer to
μ (population
➔
mean) ➔ We are thus 95% confident that the true
➔ mean( x) = μ population proportion is in the interval…
Normal Distribution Example ︿
➔ variance (x) = σ 2 /n ➔ We are assuming that n is large, n p >5 and
2 our sample size is less than 10% of the
➔ X ~ N (μ, σn ) population size.
◆ if population is normally distributed,
n can be any value
◆ any population, n needs to be ≥ 30

Standard Error and Margin of Error B. One Sample Mean * Stata always uses the tdistribution when
For samples n > 30 computing confidence intervals
Confidence Interval:

Hypothesis Testing
➔ Null Hypothesis:
➔ H 0 , a statement of no change and is
➔ If n > 30, we can substitute s for assumed true until evidence indicates
σ so that we get: otherwise.
➔ Alternative Hypothesis: H a is a
statement that we are trying to find

evidence to support.
Example of Sample Proportion Problem ➔ Type I error: reject the null hypothesis
when the null hypothesis is true.
(considered the worst error)
➔ Type II error: do not reject the null
hypothesis when the alternative
hypothesis is true.

Example of Type I and Type II errors

Determining Sample Size
︿︿
(1.96)2 p(1 p)
n = e2

︿
➔ If given a confidence interval, p is For samples n < 30
the middle number of the interval
➔ No confidence interval; use worst
case scenario
︿
◆ p =0.5
T Distribution used when:
➔ σ is not known, n < 30, and data is
Methods of Hypothesis Testing
normally distributed 1. Confidence Intervals **
2. Test statistic
3. Pvalues **
➔ C.I and Pvalues always safe to do
because don’t need to worry about
size of n (can be bigger or smaller
than 30)

One Sample Hypothesis Tests
1. Confidence Interval (can be
used only for twosided tests)

4. PValues
➔ a number between 0 and 1
➔ the larger the pvalue, the more
consistent the data is with the null
➔ the smaller the pvalue, the more
consistent the data is with the
2. Test Statistic Approach alternative

(Population Mean) ➔ ** If P is low (less than 0.05),
3. Test Statistic Approach (Population
H 0 must go reject the null
Proportion)
hypothesis

Two Sample Hypothesis Tests ➔ Test Statistic for Two Proportions 2. Comparing Two Means (large
1. Comparing Two Proportions independent samples n>30)
(Independent Groups)
➔ Calculate Confidence Interval ➔ Calculating Confidence Interval

➔ Test Statistic for Two Means

Matched Pairs
➔ Two samples are DEPENDENT
Example:

︿
➔ Interpretation of slope for each ➔ corr (Y , e) = 0
additional x value (e.x. mile on
odometer), the y value decreases/ A Measure of Fit: R2
increases by an average of b1 value
➔ Interpretation of yintercept plug in
︿
0 for x and the value you get for y is
the yintercept (e.x.
y=3.250.0614xSkippedClass, a
student who skips no classes has a
gpa of 3.25.)
➔ ** danger of extrapolation if an x
value is outside of our data set, we

can't confidently predict the fitted y ➔ Good fit: if SSR is big, SEE is small

value ➔ SST=SSR, perfect fit
Simple Linear Regression
➔ R2 : coefficient of determination
➔ used to predict the value of one
Properties of the Residuals and Fitted 2
R = SSR = 1 SSE
variable (dependent variable) on the SST SST
basis of other variables (independent Values ➔ R is between 0 and 1, the closer R2
2
variables) 1. Mean of the residuals = 0; Sum of is to 1, the better the fit

︿ the residuals = 0
➔ Y = b0 + b1 X ➔ Interpretation of R2 : (e.x. 65% of the
︿ 2. Mean of original values is the same variation in the selling price is explained by
➔ Residual: e = Y Y f itted ︿
as mean of fitted values Y = Y the variation in odometer reading. The rest
➔ Fitting error: 35% remains unexplained by this model)
︿
ei = Y i Y i = Y i b0 bi X i ➔ ** R2 doesn’t indicate whether model
◆ e is the part of Y not related is adequate**
to X ➔ As you add more X’s to model, R2
➔ Values of b0 and b1 which minimize goes up
the residual sum of squares are: ➔ Guide to finding SSR, SSE, SST
sy
(slope) b1 = r s
x
b0 = Y b1 X 3.
4. Correlation Matrix

Assumptions of Simple Linear Regression Example of Prediction Intervals: Regression Hypothesis Testing
1. We model the AVERAGE of something *always a twosided test
rather than something itself ➔ want to test whether slope ( β 1 ) is
needed in our model
2. ➔ H 0 : β 1 = 0 (don’t need x)
H a : β 1 =/ 0 (need x)

➔ Need X in the model if:
a. 0 isn’t in the confidence
interval
Standard Errors for b1 and b0 b. t > 1.96
➔ standard errors when noise c. Pvalue < 0.05
➔ sb0 amount of uncertainty in our
estimate of β 0 (small s good, large s Test Statistic for Slope/Yintercept
bad) ➔ can only be used if n>30
➔ sb1 amount of uncertainty in our
➔ if n < 30, use pvalues
estimate of β 1

◆ As ε (noise) gets bigger, it’s
harder to find the line

Confidence Intervals for b1 and b0
Estimating S e
2 ➔
➔ S e = SSEn 2
2
➔ S e is our estimate of σ 2
➔
√
➔ S e = S e2 is our estimate of σ
➔ 95% of the Y values should lie within ➔
+
the interval b0 + b1 X 1.96S e
➔
➔ n small → bad
se big → bad

s2x small→ bad (wants x’s spread out for
better guess)

Multiple Regression
➔
➔ Variable Importance:
◆ higher tvalue, lower pvalue =
variable is more important
◆ lower tvalue, higher pvalue =
variable is less important (or not
Interaction Terms
needed)
➔ allow the slopes to change

➔ interaction between 2 or more x
Adjusted Rsquared variables that will affect the Y variable
➔ k = # of X’s
Modeling Regression How to Create Dummy Variables (Nominal
Backward Stepwise Regression Variables)
1. Start will all variables in the model ➔ If C is the number of categories, create
2. at each step, delete the least important (C1) dummy variables for describing
➔ Adj. Rsquared will as you add junk x variable based on largest pvalue above the variable
variables 0.05 ➔ One category is always the
➔ Adj. Rsquared will only if the x you 3. stop when you can’t delete anymore
“baseline”, which is included in the
add in is very useful ➔ Will see Adj. Rsquared and Se
intercept
➔ **want Adj. Rsquared to go up and Se
low for better model Dummy Variables
➔ An indicator variable that takes on a
The Overall F Test value of 0 or 1, allow intercepts to
change

➔ Always want to reject F test (reject

null hypothesis)
Recoding Dummy Variables
➔ Look at pvalue (if < 0.05, reject null)
Example: How many hockey sticks sold in
➔ H 0 : β 1 = β 2 = β 3 ... = β k = 0 (don’t
the summer (original equation)
need any X’s) hockey = 100 + 10W tr 20Spr + 30F all
H a : β 1 = β 2 = β 3 ... = β k =/ 0 (need at Write equation for how many hockey sticks
least 1 X) sold in the winter
➔ If no x variables needed, then SSR=0 hockey = 110 + 20F all 30Spri 10Summer
and SST=SSE ➔ **always need to get same exact
values from the original equation

Regression Diagnostics so that we can compare models. ◆ Homoskedastic: band around the
Standardize Residuals Can’t compare models if you take log values
of Y. ◆ Heteroskedastic: as x goes up,
◆ Transformations cheatsheet the noise goes up (no more band,
fanshaped)
Check Model Assumptions ◆ If heteroskedastic, fix it by
➔ Plot residuals versus Yhat logging the Y variable
◆ If heteroskedastic, fix it by
making standard errors robust

➔ Multicollinearity
◆ when x variables are highly
correlated with each other.
◆ ovtest: a significant test ◆ R2 > 0.9
statistic indicates that ◆ pairwise correlation > 0.9
➔ Outliers polynomial terms should be ◆ correlate all x variables, include
◆ Regression likes to move added y variable, drop the x variable
towards outliers (shows up ◆ H 0 : data = no transf ormation that is less correlated to y
as R2 being really high) H a : data =/ no transf ormation
◆ want to remove outlier that is Summary of Regression Output
extreme in both x and y
➔ Nonlinearity (ovtest)
◆ Plotting residuals vs. fitted
values will show a
relationship if data is ➔ Normality (sktest)
nonlinear ( R2 also high) ◆ H 0 : data = normality
H a : data =/ normality
◆ don’t want to reject the null
hypothesis. Pvalue should
be big

◆ Log transformation
accommodates nonlinearity,
reduces right skewness in the Y, ➔ Homoskedasticity (hettest)
eliminates heteroskedasticity ◆ H 0 : data = homoskedasticity
◆ **Only take log of X variable ◆ H a : data =/ homoskedasticity

Statistics Cheatsheet PDF

Uploaded by

Document Information

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Statistics Cheatsheet PDF

Uploaded by

Copyright:

Available Formats

Population entire collection of objects or ➔ Mean arithmetic average of data ➔ Variance the average distance

6.) Mean (Expectation) ◆ X ~ U nif (a, b)

variables) 1. Mean of the residuals = 0; Sum of is to 1, the better the fit

You might also like

Statistics Cheatsheet PDF

Uploaded by

Document Information

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Statistics Cheatsheet PDF

Uploaded by

Copyright:

Available Formats

Population ­ entire collection of objects or ➔ Mean ­ arithmetic average of data ➔ Variance ­ the average distance

6.) Mean (Expectation) ◆ X ~ U nif (a, b)

variables) 1. Mean of the residuals = 0; Sum of is to 1, the better the fit

You might also like

Population entire collection of objects or ➔ Mean arithmetic average of data ➔ Variance the average distance