Professional Documents
Culture Documents
Crash Course On Basic Statistics: University of New York at Stony Brook
Crash Course On Basic Statistics: University of New York at Stony Brook
1 Basic Probability 5
1.1 Basic Definitions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5
1.2 Probability of Events . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6
1.3 Bayes’ Theorem . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6
2 Basic Definitions 7
2.1 Types of Data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7
2.2 Errors . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7
2.3 Reliability . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7
2.4 Validity . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7
2.5 Probability Distributions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7
2.6 Population and Samples . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8
2.7 Bias . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8
2.8 Questions on Samples . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9
2.9 Central Tendency . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9
5 Confidence Intervals 15
6 Hypothesis Testing 17
7 The t-Test 19
8 Regression 23
9 Logistic Regression 25
10 Other Topics 27
3
4 CONTENTS
Chapter 1
Basic Probability
5
6 CHAPTER 1. BASIC PROBABILITY
P (E ∪ F ) = P (E) + P (F )
P (E ∪ F ) = P (E) + P (F ) − P (E ∩ F ),
where
P (E ∩ F ) = P (E) × P (F |E)
P (A ∩ B) P (B|A)P (A)
P (A|B) = =
P (B) P (B|A)P (A) + P (B| ∼ A)P (∼ A)
? Frequentist:
Chapter 2
Basic Definitions
? Systematic errors: has an observable pattern, ? Predictive validity: the ability to draw infer-
and it is not due to chance, so its causes can be ences about some event in the future.
often identified.
7
8 CHAPTER 2. BASIC DEFINITIONS
2.6 Population and Samples – Calculate n by dividing the size of the pop-
ulation by the number of subjects you want
? We rarely have access to the entire population of in the sample.
users. Instead we rely on a subset of the popu-
– Useful when the population accrues over
lation to use as a proxy for the population.
time and there is no predetermined list
? Sample statistics estimate unknown popu- of population members.
lation parameters. – One caution: making sure data is not cyclic.
? Ideally you should select your sample ran- ? Stratified sample: the population of interest
domly from the parent population, but in prac- is divided into non overlapping groups or strata
tice this can be very difficult due to: based on common characteristics.
– issues establishing a truly random selection ? Cluster sample: population is sampled by us-
scheme, ing pre-existing groups. It can be combined with
the technique of sampling proportional to size.
– problems getting the selected users to par-
ticipate.
? Every member of the population has a know – In the way information is collected
probability to be selected for the sample. about the subjects.
? Was it truly randomly selected? ? For odd samples, the median is the central value
(n + 1)/2th. For even samples, it’s the average
? Were there any biases in the selection process? of the two central values, [n/2 + (n + 1)/2]/2th.
10 CHAPTER 2. BASIC DEFINITIONS
Mode
? The most frequently occurring value.
? Is useful in describing ordinal or categorical data.
Dispersion
? Range: simplest measure of dispersion, which
is the difference between the highest and lowest
values.
? Interquartile range: less influenced by ex-
treme values.
? Variance:
– The most common way to do measure dis-
persion for continuous data.
– Provides an estimate of the average differ-
ence of each value from the mean.
– For a population:
n
1X
σ2 = (xi − µ)2
n i=1
– For a sample:
n
1 X
s2 = (xi − x̄)2
n − 1 i=1
? Standard deviation:
– For a population:
√
σ= σ2 ,
– For a sample:
√
s= s2 .
Chapter 3
bution. N (µ, σn ).
? If we are interested on p, a proportion or the
? All normal distributions are symmetric, probability of an event with 2 outcomes. We use
unimodal (a single most common value), and the estimator p̂, proportion of times we see the
have a continuous range from negative infinity event in the data. The sampling distribution of
to positive infinity. p̂ (unbiased estimator ofq
p) has expected value p
? The total area under the curve adds up to 1, or and standard deviation p(1−p)n . By the central
100%. limit theorem, for large n the sampling distribu-
tion is p̂ ∼ N (p, p(1−p)
n ).
? Empirical rule: For the population that fol-
lows a normal distribution, almost all the values
will fall within three standard deviations above The Z-Statistic
and bellow the mean (99.7%). Two standard de- ? A z-score is the distance of a data point from the
viations are about 95%. One standard deviation mean, expressed in units of standard deviation.
is 68%. The z-score for a value from a population is:
x−µ
Z=
The Central Limit Theorem σ
? As the sample size approaches infinity, the dis- ? Conversion to z-scores place distinct populations
tribution of sample means will follow a nor- on the same metric.
11
12 CHAPTER 3. THE NORMAL DISTRIBUTION
where
n n!
= nCk =
k k!(n − k)!
13
14 CHAPTER 4. THE BINOMIAL DISTRIBUTION
Chapter 5
Confidence Intervals
15
16 CHAPTER 5. CONFIDENCE INTERVALS
? The first method to estimate binary success rates ? We also want our sample mean to be unbiased,
is given by the Wald interval: where any sample mean is just as likely to over-
r estimate or underestimate the population mean.
p̂(1 − p̂) The median does not share this property: at
p̂ ± z(1− 2 )
α ,
n small samples, the sample median of comple-
tion times tends to consistently overestimated
where p̂ is the sample proportion, n is the sample the population median.
size, z(1− α2 ) is the critical value from the normal
distribution for the level of confidence. ? For small-sample task-time data the geometric
mean estimates the population median better
? Very inaccurate for small sample sizes (less than than the sample median. As the sample sizes
100) or for proportions close to 0 or 1. get larger (above 25) the median tends to be the
best estimate of the middle value. We can use
? For 95% (z = 1.96) confidence intervals, we can binomial distribution to estimate the confidence
add two successes and two failures to the ob- intervals: the following formula constructs a con-
served number of successes and failures: fidence interval around any percentile. The me-
dian (0.5) would be the most common:
2 2
x + z2 x + 1.96
2 x+2 np ± z(1− α2 )
p
np(1 − p),
p̂adj = 2
= ∼ , (5.0.1)
n+z n + 1.962 n+4
where n is the sample size, pp is the percentile
where x is the number that successfully com- expressed as a proportion, and np(1 − p) is the
pleted the task and n the number who attempted standard error. The confidence interval around
the task (sample size). the median is given by the values taken by the
th integer values in this formula (in the ordered
data set).
Hypothesis Testing
? H0 is called the null hypothesis and H1 is the strong enough to support rejection of the null
alternative hypothesis. They are mutually exclu- hypothesis.
sive and exhaustive. A null hypothesis cannot
? Expresses the probability that extreme results
never be proven to be true, it can only be shown
obtained in an analysis of sample data are due
to be plausible.
to chance.
? The alternative hypothesis can be single-tailed:
? A low p-value (for example, less than 0.05)
it must achieve some value to reject the null hy-
means that the null hypothesis is unlikely to be
pothesis,
true.
H0 : µ1 ≤ µ2 , H1 : µ1 > µ2 , ? With null hypothesis testing, all it takes is suffi-
cient evidence (instead of definitive proof) that
or can be two-tailed: it must be different from
we can see as at least some difference. The size of
certain value to reject the null hypothesis,
the difference is given by the confidence interval
H0 : µ1 = µ2 , H1 : µ1 6= µ2 . around the difference.
? A small p-value might occur:
? Statistically significant is the probability that
is not due to chance. – by chance
– because of problems related to data collec-
? If we fail to reject the null hypothesis (find sig- tion
nificance), this does not mean that the null hy-
– because of violations of the conditions nec-
pothesis is true, only that our study did not find
essary for testing procedure
sufficient evidence to reject it.
– because H0 is true
? Tests are not reliable if the statement of the
hypothesis are suggested by the data: data ? if multiple tests are carried out, some are likely
snooping. to be significant by chance alone! For σ = 0.05,
we expect that significant results will be 5% of
the time.
? Be suspicious when you see a few significant re-
p-value sults when many tests have been carried out or
significant results on a few subgroups of the data.
? We choose the probability level or p-value that
defines when sample results will be considered
17
18 CHAPTER 6. HYPOTHESIS TESTING
? Power of a test:
The t-Test
t-Distribution
? The t-distribution adjusts for how good our
estimative is by making the intervals wider as
the sample sizes get smaller. It converges to the
normal confidence intervals when the sample size
increases (more than 30).
? Two main reasons for using the t-distribution
to test differences in means: when working with
small samples from a population that is approx-
imately normal, and when we do not know the
standard deviation of a population and need to
use the standard deviation of the sample as a
substitute for the population deviation. where x̄ is the sample mean, n is the sample size, s
is the sample standard deviation, t(1− α2 ) is the crit-
? t-distributions are continuous and symmetrical, ical value from the t-distribution for n − 1 degrees
and a bit fatter in the tails. of freedom and the specified level of confidence, we
? Unlike the normal distribution, the shape of the need:
t distribution depends on the degrees of freedom
1. the mean and standard deviation of the mean,
for a sample.
? At smaller sample sizes, sample means fluctuate 2. the standard error,
more around the population mean. For exam-
ple, instead of 95% of values failing with 1.96 3. the sample size,
standard deviations of the mean, at a sample size
of 15, they fall within 2.14 standard deviations. 4. the critical value from the t-distribution for the
desired confidence level.
19
20 CHAPTER 7. THE T-TEST
The t-critical value is simply: hypothesis is that there is no significant difference be-
tween the mean population from which your sample
x̄ − µ was drawn and the mean of the know population.
t= ,
√s
n The standard deviation of the sample is:
n1 −1 + n2 −1
Regression
23
24 CHAPTER 8. REGRESSION
Y = β0 + β1 X1 + β2 X2 + ... + ,
Y = β0 + β1 X1 − β2 X2 + .. + βn Xn + ,
Logistic Regression
? We use logistic regression when the dependent logit of the outcome variable and any con-
variable is dichotomous rather than continuous. tinuous predictor), no multicollinearity, no
complete separation (the value of one vari-
? Outcome variables conventionally coded as 0-1, able cannot be predicted by the values of
with 0 representing the absence of a characteris- another variable of set variables).
tic and 1 its presence.
? As with linear regression, we have measures of
? Outcome variable in linear regression is a logit, model fit for the entire equation (evaluating it
which is a transformation of the probability of a against the null model with no predictor vari-
case having the characteristic in question. ables) and tests for each coefficient (evaluating
each against the null hypothesis that the coeffi-
? The logit is also called the log odds. If p is the
cient is not significantly different from 0).
probability of a case having some characteristics,
the logit is:
? The interpretation of the coefficients is different:
p instead of interpreting them in terms of linear
logit(p) = log = log(p) − log(1 − p). changes in the outcome, we interpret them in
1−p
terms of the odds ratios.
? The logistic regression equation with n predic-
tors is:
logit(p) = β0 + β1 X1 + β2 X2 + ... + βn Xn + .
Ratio, Proportion, and Rate
? The differences to multiple linear regression us- Three related metrics:
ing a categorical outcome are:
? A ratio express the magnitude of one quantity
– The assumption of homoscedascitiy (com- in relation to the magnitude of another without
mon variance) is not met with categorical making further assumptions about the two num-
variables. bers or having them sharing same units.
– Multiple linear regression can return values ? A proportion is a particular type of ratio in which
outside the permissible range of 0-1. all cases in the numerator are included in the
– Assume independence of cases, linearity denominator. They are often expressed as per-
(there is a linear relationship between the cents.
25
26 CHAPTER 9. LOGISTIC REGRESSION
Odds Ratio
? The odds ratio was developed for use in case-
control studies.
? The odds ratio is the ratio of the odds of expo-
sure for the case group to the odds of exposure
for the control group.
odds1 p1 (1 − p1 )
OR = =
odds2 p2 (1 − p2 )
Chapter 10
Other Topics
27
28 CHAPTER 10. OTHER TOPICS
Nonparametric Statistics
? Distribution-free statistics: make few or no
assumptions about the underlying distribution
of the data.
(X ± 0.5) − np
Z= p ,
np(1 − p)
Design an experiment to see whether a new The difference between the means of two sam-
feature is better? ples, A and B, both randomly drawn from the same
normally distributed source population, belongs to
? Questions: a normally distributed sampling distribution whose
overall mean is equal to zero and whose standard de-
– How often Y occurs?
viation (”standard error”) is equal to
– Do X and Y co-vary? (requires measuring p
X and Y) (sd2 /na ) + (sd2 /nb ),
– Does X cause Y? (requires measuring X where sd2 is the variance of the source population.
and Y and measuring X, and accounting
for other independent variables) If each of the two coefficient estimates in a
regression model is statistically significant, do
? We have:
you expect the test of both together is still
– goals for visitor (sign up, buy). significant?
– metrics to improve (visit time, revenue, re-
turn). ? The primary result of a regression analysis is a
set of estimates of the regression coefficients.
? We are manipulating independent variables (in-
dependent of what the user does) to measure de- ? These estimates are made by finding values for
pendent variables (what the user does). the coefficients that make the average residual 0,
and the standard deviation of the residual term
? We can test 2 versions with a A/B test, or many as small as possible.
full versions of a single page.
? In order to see if there is evidence supporting the
? How different web pages perform using a ran- inclusion of the variable in the model, we start
dom sample of visitor: define what percentage of by hypothesizing that it does not belong, i.e.,
the visitor are included in the experiment, chose that its true regression coefficient is 0.
which object to test.
? Dividing the estimated coefficient by the stan-
dard error of the coefficient yields the t-
Mean standard deviation of two blended ratio of the variable, which simply shows how
datasets for which you know their means and many standard-deviations-worth of sampling er-
standard deviations: ror would have to have occurred in order to yield
29
30 CHAPTER 11. SOME RELATED QUESTIONS
an estimated coefficient so different from the hy- need to be sure that a normal distribution
pothesized true value of 0. can be assumed).
? What the relative importance of variation in the – Data Follows a Different Distribution:
explanatory variables is in explaining observed Poison distribution (rare events), Expo-
variation in the dependent variable? The beta- nential distribution (bacterial growth), log-
weights of the explanatory variables can be com- normal distribution (to length data), bino-
pared. mial distribution (proportion data such as
percent defectives).
? Beta weights are the regression coefficients for – Equivalente tools for non-normally dis-
standardized data. Beta is the average amount tributed data:
by which the dependent variable increases when
the independent variable increases one standard ∗ t-test, ANOVA: median test, kruskal-
deviation and other independent variables are wallis test.
held constant. The ratio of the beta weights is ∗ paired t-test: one-sample sign test.
the ratio of the predictive importance of the in-
dependent variables.
How do you test a data sample to see
whether it is normally distributed?
How to analyse non-normal distributions?
? For example: 6 sigmas events don’t happen in ? Any linear combination of independent normal
normal distributions! deviates is a normal deviate!
√
SDy = SDm 12 What’s the difference between the t-stat and
R2 in a regression? What does each measure?
Given a dataset, how do you determine its When you get a very large value in one but a
sample distribution? Please provide at least very small value in the other, what does that
two methods. tell you about the regression?
? Plot them using histogram: it should give ? Correlation is a measure of association between
you and idea as to what parametric fam- two variables, the variables are not designated
ily(exponential, normal, log-normal...) might as dependent or independent. The significance
provide you with a reasonable fit. If there is one, (probability) of the correlation coefficient is de-
you can proceed with estimating the parameters. termined from the t-statistic. The test indicates
whether the observed correlation coefficient oc-
? If you use a large enough statistical sample size, curred by chance if the true correlation is zero.
you can apply the Central Limit Theorem to a
sample proportion for categorical data to find its ? Regression is used to examine the relationship
sampling distribution. between one dependent variable and one inde-
pendent variable. The regression statistics can
? You may have 0/1 responses, so the distribution be used to predict the dependent variable when
is binomial, maybe conditional on some other the independent variables is known. Regression
covariates, that’s a logistic regression. goes beyond correlation by adding prediction ca-
? You may have counts, so the distribution is Pois- pabilities.
son.
? The significance of the slope of the regression
? Regression to fit, test chi-square to the goodness line is determined from the t-statistic. It is the
of the fit. probability that the observed correlation coeffi-
cient occurred by chance if the true correlation
? Also a small sample (say under 100) will be com- is zero. Some researchers prefer to report the
patible with many possible distributions, there is F-ratio instead of the t-statistic. The F-ratio is
simply no way distinguishing between them on equal to the t-statistic squared.
the basis of the data only. On the other hand,
large samples (say more than 10000) won’t be ? The t-statistic for the significance of the slope is
compatible with anything. essentially a test to determine if the regression
33
39