Professional Documents
Culture Documents
Stats 104E FinalExamCheatSheet - 22x
Stats 104E FinalExamCheatSheet - 22x
Descriptive Statistics
Sample statistic
Population Parameter
Mean
x
Variance
s2
2
Correlation
r
Noise
ei
i
Guess
True, but unknown
The Mean, Variance and StdDev are subject to outliers.
X N (, 2 )
X
N (0, 1)
X =Z +
Z=
2 = V ar(X) =E[X 2 ] 2x
X (xi x)2
s2 =
n1
all i
2
= StdDev(X) =
Random Variables
P (a X b) =P [(a ) (X ) (b )]
P (X x) CDF
(a )
(X )
(b )
(a )
(b )
=P [
Z
]
E(cX) =c E(X)
=P [
E(X + c) =E(X) + c
x = E(X) =
x2 = V ar(X) =
all xi
Cov
x y
sxy
=
sx sy
Chebyshev
Any Prob
0
75%
89%
Cor =
rxy
P ( < x < + )
P ( 2 < x < + 2)
P ( 3 < x < + 3)
Z =aX + bY
V ar(Z) =
s2z
=a + bE(X)
=a + bx
V ar(W ) =V ar(a + bX)
Return on Portfolio
=b2 x2
Binomial Distribution
n independent trials
binary result
same probability of success
Probabilities
0 All probabilities 1
Mutually exclusive and Exhaustive
P (A) =1 P (A)
25% | 25% | 25% | 25%
Q1
Q2
Q3
Chebyshevs Rule:
1
)
k2
yields the % of data that falls with k std deviations.
Empircal
Normal
68%
95%
100%
W =a + bX
If Z = aX + bY , then
V ar(X) =E[(X )2 ]
V ar(X) =E(X 2 ) E(X)2 (= E(X E(X))2 )
(xi x )2 P (X = xi )
=E[X 2 ] (x )2
V ar(X + c) =V ar(X)
V ar(cX) =c2 V ar(X)
xi P (X = xi )
all xi
N
1 X
Cov(X, Y ) =
(xi x)(yi y)
N i=1
n
1X
x=
xi
n i=1
P (A and B)
P (B)
1(
x = E(X) =n p
2
= V ar(X) =n p q = n p (1 p)
Joint vs. Marginal probabilities
Uniform Distribution
(a + b)
2
(b a)2
V ar(X) =
12
E(X) =
Joint Distribution
y axis should be a fraction to make area = 1
Decision Analysis
Independent if for all values:
Px,y (X = x and Y = y) = PX (x)PY (y)
Conditional Distribution:
P
(x,y)
PX|Y (X = x|Y = y) = X,Y
PY (y)
Conditional Expectation:
P
E(X|Y = y) =
x P (X = x|Y = y)
allx
Covariance:
N
X
X N (,
Then Z-score:
P (Z <
Px,y =P (X = x and Y = y)
XY =
P (X < avg)
From CLT:
i=1
Correlation:
=
X,Y
X Y
A x
)
x / n
pp =p
r
p =
p(1 p)
n
avg
)
s/ n
Bias of an Estimator
In practice n has to relatively much larger like > 100.
Guesses should be unbiased and have minimum variance.
MVUE (Minimum Variance, Unbiased Estimates).
= E()
bias()
Unbiased if bias = 0 (expected value equals true, not a
particular value of x).
For samples, we divide by n 1 instead of n to make an
unbiased estimator. The guess would otherwise be too low.
X
E(x) =
xi p i =
i=1
E(s2 ) = 2
E[X + c] =E[X] + c
E[X + Y ] =E[X] + E[Y ]
E[aX] =aE[X]
E[aX + bY + c] =aE[X] + bE[Y ] + c
1
Example: Roulette has a $1 bet with a $35 payoff for 38
odds.
37
1
E[gain from a $1 bet] = $1
+ $35
= $0.0526
38
38
x =
x =
n
s2
)
n
or
p =(U + L)/2
MoE =(U L)/2
The CI interval is the confidence percentage that the true
population mean is within the interval. It does not imply
what percentage of the population is outside the interval.
The larger the CI Percentage, the wider it is and the smaller
the risk of being incorrect.
Factors affecting the margin of error:
data variation . Direct relation
sample size n. Inverse relation
level of confidence, 1 . Direct relation
1.96 2
n =(
) p(1 p)
0.05
=1536.64 p(1 p)
Worst case, use p = 0.50. e.g. 95% confident, 2% accuracy,
find n:
1.96 2
n=(
) (0.5)2
0.02
For small n, use Agresti with a different p:
p =
x+2
n+4
4. The two samples size are both large (ie. > 30) or both
populations have normal distributions.
(
p p0 )
p0 (1 p0 )/n
Types of Errors
Type I the null hypothesis is rejected when it is true
H0 : p1 = p2
Ha : p1 < p2
Ha : diff < 0
H0 : p1 = p2
Ha : p1 > p2
If T > 1.64, reject H0
Ha : diff > 0
If the interval is all positive then p1 > p2 . If the interval is all
negative then p1 < p2 . If the interval spans 0, then one is not
significantly bigger than the other (or cannot be determined).
As long as n > 30, it doesnt matter if the sample size is
different between the random variables.
2
B N (B , B
)
P (X = A B c > 0)
p1 (1 p1 )
p2 (1 p2 )
+
n1
n2
The 95% confidence interval for p1 p2 is:
s
p1 (1 p1 )
p2 (1 p2 )
(
p1 p2 ) 1.96
+
n1
n2
V ar(
p1 p2 ) =
T = q
(
p1 p2 )
p(1 p)( n1 +
1
)
n2
, where p =
Requirements:
x 0
s/ n
2
A N (A , A
)
then Z-score
tstat =
s21
s2
+ 2
n1
n2
H0 : = 0
Ha : > 0
If tstat > 1.64, reject H0
We use 1.96 because it is 2.5% on either side. We use 1.64
because it is 5% on a single side.
(x1 x2 ) 1.96
n1 p1 + n2 p2
n1 + n2
Matched Pairs
This when there are two samples that are not independent,
e.g. Weight Watchers, Before / After or matched, shared
characteristics.
Is the data matched or independent?
If we dont take into account the match, the results are
wrong.
To account for this, take the difference between X 1 X 2 and
then do a hypothesis test on the difference.
H0 :D = 0
Ha :D > 0
Regression - Errors
H0 : p1 = a1 , p2 = a2 , ..., pk = ak
n
X
i=1
n number of trials
X (oi ei )2
ei
all i
1 X
|yi yi |
n all i
s2e =
=Observed F itted
(yi b0 bi xi )2 =
n
X
(yi yi )2 =
i=1
n
X
e2i
p
s2e is our estimate of 2 . se = s2e is our estimate of . The
standard deviation of the noise, aka se or Root MSE, is
important as it is good estimate and unbiased. It is more
important than R2 as a measure of quality of regression.
Use se to form bands around the regression line:
Percent of y Values
Band
68%
b0 + b1 X 1se
95%
b0 + b1 X 1.96se
b0 + b1 X 3se
99%
Assuming a 95% confidence interval, use:
i=1
n
SSE
1 X 2
e =
n 2 i=1 i
n2
y = b0 + b1 x 1.96se
For 68%, use 1 instead of 1.96. For 99% use 3 instead of 1.96.
A confidence interval for 1 is:
b1 1.96(sb1 )
where:
V ar(b1 ) = s2b1 =
s2e
(n 1)s2x
where:
R2 :
x2
1
+
n
(n 1)s2x
yi = 0 + 1 xi + i
N (0, 2 )
These tests work identically for 1 and 0 .
We want to test whether 1 equals a proposed value. We
always use a two-sided test.
H0 :1 = 1
Ha :1 6= 1
To know if x affects y, test if 1 = 0. Use the following test
statistic:
T =
b1 1
s b1
Test Statistic
H0 : 1 = 1
Ha : 1 6= 1
Decision Rule
If |T | > 1.96, reject H0
H0 : 1 1
Ha : 1 < 1
H0 : 1 1
Ha : 1 > 1
If T > 1.64, reject H0
As before if n < 30, use t distribution.
Multiple Regression
y = 0 + 1 x1 + 2 x2 + ... + k xk +
(SSR)/k
SSE/(n k 1)
f Fk,nk1
The decision rule is to reject H0 if f fk,nk1, .
Variable Selection
Forward Stepwise Regression starts with no variables and
then adds one at a time.
Backward Stepwise Regression starts with all variables and
then delete the least important one based on largest p-value
> 0.05. Refit and repeat. Stop when all variables are
significant.
Use the p-value or the Confidence Interval rather than the t
stat for small (n < 30) datasets. The p-value is automatically
adjusted by Stata, whereas the correct t stat must be looked
up in a table.
The smallest p-value determines the most important variable
(do not use the coefficients!).
Predictions on the full model might still be better than the
simplified model. Backward regression may not result in the
same results as forward regression.
Dummy Variables
when
bj
j
t=
1.96
s bj
R2 = 1
i=1
n
P
e2i
(yi y)2
SSR
SSE
=
=1
SST
SST
i=1
Normality
a huge assumption that errors in our regression model are
normally distributed. This enables constructing confidence
intervals and doing hypothesis tests. This assumptions must
be validated.
Three different normality tests in Stata. H0 : errors are
normally distributed. Ha : errors are not normally
distributed.
predict res, r
swilk res
sfrancia res
sktest res
Heteroskedasticity
non-constant variance. We assume that the variance is
consistent across all values of all xs. The may not be the
case.
Homoskedastic noise:
2
i N (0, )
A plot would have a straight band around the data, ie.
Var(i ) = 2 . Irrespective of xi , the variance is the same.
If we assume homoskedasticity, we have verify our
assumption as always.
Heteroskedastic noise:
i N (0, i2 )
Note the subscript i2 which means the variance changes for a
given xi . This implies that there are a lot of different
variances to estimate, complicating the model. A plot of
either y vs x or residuals vs. fitted values would have a
spread of data points in a fan or tunnel shape. A frequently
occurring situation is that Var(i ) = x2 2 meaning that the
variance increases with larger values of x.
With heteroskedasaticity, the estimates are ok, but the
standard errors are incorrect, which is a problem in the
model. The coefficients are useful, but the p-values are
wrong.
An easy way (at this level) to resolve this is by taking a
log(y). This makes comparison of models more difficult.
x transforms can be compared against prior models because
the units of y have not changed.
Test for homoskedasticity in Stata: hettest. H0 is constant
variance, and Ha is heteroskedasaticity. Once again, we want
a large p-value, if possible.
Multicollinearity
some or all xs are related to each other, when they should be
related to y. Two highly related xs confuse the regression
algorithm. The standard deviation of the regression
coefficients (sbi ) will be disproportionately large, resulting in
t ratios that are too small because, recall:
tstat =
bi
sbi
When the t stat is too small, it will seem that some variables
are not needed when in fact they are needed.
Step-wise regression
Variance Inflation Factors (VIF) is a good automated
way of discovering this.
Take regression of all permutations of x, ie. Take regress of
x1 on x2 , x3 . Take regress of x2 on x1 , x3 . Take regress of x3
on x1 , x2 .
Note R2 on each regression, and then calculate:
V IFj =
1
1 Rj2
Nonlinearities
By default, regresses y on y2 , y3 , y4 . A significant value
means that polynomial terms should be added. H0 : no
transformations of x needed, Ha : polynomial or other
transformations are needed. A small p-value means one of
the xs (but not which) needs to be transformed.
In Stata, use ovtest. Re-run again. Might need a x2 and x3
and maybe other terms. (Adding a cubic term might need to
drop the linear term). As usual, be wary of overfitting
especially with polynomials.
Finding Outliers
Recall: influential observations are the worst (extreme in x
and y).
Z scores and standard deviations are a simple and quick way
to find outliers.
Cooks distance is (approx) |ei | |xi x| (high residual and
far away from mean(x)). Looking for extreme Cook values to
know which to drop.