You are on page 1of 9

ACM31 II.

1 – Univariate Statistics: Principles and Key Terms


Three fundamental questions
Estimation Which parameter value in the population is most probable for a given outcome of sample observations?
Tests Given a certain parameter value, are the observations from a sample probable or improbable?
Confidence For which intervals of parameter values are the observations from a sample probable (plausible), and for which values are they improbable
Intervals (unplausible)?
Terms
Term for population (Sample space) U Terms for the sample level S Terms for the probability model level
μ Arithmetic mean for metric case – E(X1) = μ x Arithmetic mean x Observed values are realizations and have
small letters
σ² Variance – VAR(X1) = σ² s² Empirical variance
σ Standard deviation s Empirical standard deviation X Variables themselves are written with capital
π Arithmetic mean for nominal attribute p Arithmetic mean for nominal attribute letters
β Slope in linear regression b Slope in linear regression
ρx,y Correlation coefficient between two metric attributes rx,y Correlation coefficient
N Size of sample n Sample size
Rules for expectancy and Variance
E ( X ± Y )=E( X )± E(Y ) Var ( X )=E ( X )−( E ( X ) )
Rule 1: Rule 3: 2 2

Rule 2: E ( a∗X ) =a∗E( X ) Rule 4:


Var ( a∗X )=a2∗Var ( X )
Rule 5: Var ( X +Y )=Var ( X )+ Var (Y )
Definitions formulas of expectancy and variance
Discrete case n n
E ( X ) =∑ x i∗pi Var ( X )=∑ (x ¿¿ i−E( X))2∗pi ¿
i=1 i=1
Continuous case +∞ +∞
E ( X ) =∫ x∗f ( x )∗dx Var ( X )= ∫ (x−E ( X ) )2∗f ( x )∗dx
−∞ −∞
1.2.1 Central limit theorem
If X1, X2, …, Xn are independent and identically distributed random variables with E ( X i ) =μ and Var ( X i ) =σ 2
S X 1 + X 2+ x n have the following expectancies and variances:
Then the random variables S= X 1 + X 2 + x n∧X = =
n n

E ( S )=n∗μ Var ( S )=n∗σ 2 E ( X ) =μ σ


2
Var ( X )=
n
1.3.1 Empirical variance and empirical standard deviation
n n
1 1
2
s= ∗∑ (x i−x )2 2
S= ∗∑ ( X i−X )2
n−1 i−1 n−1 i−1
1.3.2 Confidence Intervals for Arithmetic Means; z-Intervals
For any error risk α , sample mean X , sample size n, and standard Remarks
deviation σ, the interval  Can be calculated with invNorm(α , μ=0 , σ =1¿
z a∗σ z a∗σ 1.96∗σ
1− 1−  Verbalizing: “the interval x ± contains μ with a probability of 95%
x− 2
¿ x+ 2
√n
n√ n √  The smaller error risk, the broader confidence interval
contains the arithmetic mean μ of the sample space with a  The larger sample, the smaller confidence interval
probability of 1−α  The bigger st.dev, the broader confidence interval
Finite population correction: (hardly ever applied and not meaningful above n=300)

In case of finite sample space, standard error

1.3.3 t-Intervals
σ
√n
of the arithmetic mean must be replaced by
σ

N −n
√ n N −1 √
Conditions for 1. Assume that underlying distribution of random variable in sample space is normal distribution (indispensable for small samples)
applying t-intervals: When conditions not met: eliminate extreme outliers (record it), bootstrap it, use median as mean
Dichotomous = t a ∗s t a ∗s
binominal distribution ,v ,v
The interval 2 to 2 contains the arithmetic mean μ of the sample space with a probability of 1−α
x− x+
√n √n
a
Calc: invT (t ¿ ¿ , df )¿ => v=df =n−1
2
When using inv.T on calc always use one-sided value for area e.g a=99% then invT(0.995,df)
1.3.7 Confidence Intervals for Proportions, Nominal Attributes
X−μ x E ( X ) =n∗π and Var ( X )=n∗π∗(1−π )
T= p=
S n Rule of thumb: For n∗π∗( 1−π )> 9 , a binominal distribution can be approximated by a normal distribution

√ √ √
√n X E( X) n∗π Var ( X) n∗π∗(1−π) π∗( 1−π)
P= E ( P )= = =π and standard error √ Var ( P)=
n = =
n n n
2
n
2
n
Absolute error = 2∗standard error
z-confidence interval for proportions:
Provided n∗p (1− p ) >9 holds, we have:
The probability that the proportion π of the sample space is
between

p−z 1−α /2∗


√ p∗( 1−p )
n
∧p+ z1−α /2∗
F=half of length of Confidence interval
n √
p∗(1− p ) is 1−α

The length of confidence interval is largest at p = 0.5 as data is split (n-


distributed)

1.3.8 Sample Size


2
z α ∗σ
2 2
z α ∗π∗(1−π ) Assumes that π∧σ are given, if they are not given, pilot study can be conducted.
2 2 If that is not possible => use 0.25 for: σ 2∨π∗( 1 – π )
n≥ n≥
F2 F2 If length of confidence interval is given => L=2∗F
Note: To cut absolute error in half, sample size must be doubled
2.1.1 Hypothesis Testing
Step 1: formulating the hypothesis.
Hypothesis test must be based on two mutually exclusive hypotheses on sample space.
Null hypothesis H0: The scrap rate is at most 2%. 𝐻0: 𝜋 ≤ 0.02
Alternative hypothesis HA: The scrap rate exceeds 2%. 𝐻𝐴: 𝜋 > 0.02

Step 2: select error at risk (weird looking a)


In our example, we select an α of 5% - this is a threshold often used in testing.

Step 3: plan the experiment, fix sample size, define statistic.


Example: In a sample of n = 400, chips are tested, imperfect chips are counted Sample size n = 400 Test statistic X =
number of imperfect chips in the sample

Step 4: determine the region of rejection of the null hypothesis. Set up a decision rule.
Since our sample size is n = 400, our reasoning is as follows: Provided H0 : 𝜋 ≤ 0.02 is true, at most 8 imperfect
chips

set up decision rule, determine region of rejection of null hypothesis (region of rejection through Excel, first value
below risk is c => R={c,c+1,c+2,…)
(reject H0 if and only if at least c “six dots” are observed (replace c with its value)

We need to determine this critical value c, which implies a region of rejection (also called critical region) based on
probability theory

To find the critical value c, follow this train of thought:

c is the smallest solution x of the inequation P(X≥x) < a or P(X 1-a, respectively. In words: the probability that the
RV X takes a value larger than the critical value c is smaller than the error risk fixed in advance. If the null
hypothesis holds, we can conclude that, in our example: X has a binomial distribution («success» being an
imperfect chip) with n = 400 and probability of success 𝜋 = 0.02, in short: X ~ B(400, 0.02)

Step 5: assess data, determine test statistic, find decision => Type 1 and Type 2 error occur
Excel method and binom:
Alternative hypothesis HA: the dice is loaded; the probability of «six
dots» is bigger than 1/6.
a) Shorthand: HA: p>1/ 6

Null hypothesis H0: the probability of «six dots» is at most 1/6:


b) H0: p £1/ 6

This is how to start the procedure:

2.1.3 Type I and Type II Errors


Type 1 error (α error) Type 2 error ( β error ¿
We reject null hypothesis H0 although it is true. We retain null hypothesis H0 although it is wrong.
(x ∈ R) and H0 is true ( x ∉ R ) and H is wrong 0

Can be limited, by making α very small (e.g., 1% or 0.1%) Calc: β=binomcdf ( n , lower limit ,upper limit )
Excel: β=
.
dist
Interpretation: The probability that our binominal test with n=400 and α =5% confirms the null
hypothesis although the true proportion of imperfect chips is 4%, is 27%.
p-value Decision-making based on p-value
Given that Ho is true, what is the probability that we get the result p−value ≤ α →reject H 0 test statistic is significant
we did for our sample?
p−value>α → retain H 0 test statistic is not significant
p-values in SPSS
Note: SPSS always conducts a one-sided test! For a two sided test, multiply the p-value by two.
Analyze > Nonparametric Tests > Legacy Dialogs > Binomial
according to type of test, value will be given with different legends:
- Exact Sig (2-tailed)
- Exact Sig. (1-tailed)
- Asymptotic significance or similar

In SPSS, every data point must be specified, meaning 1 or 0.


If data is aggregated, it first has to be weighted before SPSS can process it. Here, Excel is better.

2.2 t-Tests for the Arithmetic Mean

Condition for t-test: assumes that test follows normal distribution,


 for samples larger than 30, this can be assumed,
 for samples less than 30, make histogram and make sure that no outliers are there, and it is not skewed too much

Make histogram and box plot to figure this out ALWAYS ON EXPLORE USE COMPARE MEAN on SPSS!!

One-sample t-test, σ unkown 2-sample t-test for paired 2-sample t-test for independent
samples samples
Is an expectancy bigger or smaller Does a certain measure have an Does a metric attribute have
than a given reference value? impact on a certain attribute of a different expectancies in two groups
population? Does it change its of the population (e.g. men,
Ryan says African schlong (known expectancy? (exact same unit is women)? Are the differences we
measured twice) measure between them significant?
value) is on average 20cm long, a
random sample (Unknown) showed
African schlong measured in Winter South African schlong vs Congo
ave length = 23cm sd of 1.5cm and African schlong measured in schlong – is there a big difference?
summer
Hypotheses: H0: μ=μ 0 H0: μ A =μB ∨μdiff =μ A −μB =0 H0: μ1=μ 2∨μ 1−μ2=0
HA: μ ≠ μ 0 (bilateral / standard case) HA: μ A ≠ μB ∨μdiff ≠ μ A −μ B =0 HA: μ1 ≠ μ2 ∨μ 1−μ2 ≠ 0 standard
μ> μ0 ∨μ < μ0 (unilateral test) (bilateral test) case, bilateral case)
μ> μ1∨μ< μ 0 (unilateral test
(under specified circumstances)
Test ( X−μ0 ) X 1− X 2
distribution T= T= follows t-distribution
S has a t-distribution S( X − X )
1 2

√n S(X − X ) is standard error of


1 2

with df =v =n−1 degrees of difference X 1 −X 2


freedom
Empirical test (x−μ 0) x 1−x 2
statistics t= Outcome for T: t¿
s (value of t in the s( X − X )
1 2

√n
sample) df =v =n−1 degrees of freedom
 then use excel formula to Take difference of the two for one group
determine p-value samples from each unit and then
T.DIST(t,df,TRUE) use that difference as a one-
 or use tcdf on calc (times by sample t-test
2 to get double sided mf)
=T.DIST.2T(t,df)
p-value of Are below t-bell curve to the left of Estimating standard error s(X − X ) : 1 2

the sample −|t | and to the right of +|t | (bilateral Variant 1: homoscedasticity, that
test) is when we assume the two
(in case of unilateral test, cut p-value in variances are the same
half)
Level of α (select in advance, standard value 5%) σ 12 =σ 22
significance Variant 2: heteroscedasticity, that
Decision if p-value > α , retain H0 is we assume the two variances
making if p-value ≤ α , reject H0 and instead are different:
2 2
accept HA σ 1 ≠ σ2

3.2.1 Chi-square test


2
χ -distribution: If X1, X2, … Xv are independent RV with a standard normal distribution (
μ=0 , σ=1 ¿,
2 2 2 2
the RV χ = X 1 + X 2 + …+ X v
Conditions:
1. Must be based on at least 50 samples (n ≥ 50¿
2. All expected frequencies must be at least 5 (if not the case, they can be combined
with another row or column but decreases degrees of freedom
For SPSS run Fischer’s exact method)
3. Degree of freedom has to be always greater than 1 (if the case, can use Yates
correction/continuity correction, or use other option two-sample-test for
proportions Chi-square and SPSS
E ( χ ) =v ∧Var ( χ ) =2 v Degrees of freedom = n-1
2 2
Use p-value
Steps of Chi-test of independence:
1. Formulate hypotheses If shown in absolute frequencies:
H0: two attributes are independent (null hypotheses is always independent 1. Data > Weight Cases (select the
HA: two attributes are dependent absolute numbers)
2. Select error risk α (significance level is 1-α )
2
3. Calculate test statistics χ b 2. Since we have a 2x2-table, the
value to look at in the output file
Calculate relative frequencies outside table is p-value shown in the
Multiply the data from outside the table into the table «Continuity Correction» row.
Calculate theoretical absolute numbers into table
( f ij −f ' ij )
2
k m
( observed−expected )2 3. If p-value > α , then retain H0 and
χ =∑ ∑ ∨more logically χ =∑
2 2
b ' b reject HA => Difference between
i=1 j=1 f ij
expected observed and expected
fij: observed frequencies, f’ij: expected frequencies (provided H0 is true) distributions must not be
considered significant
χ −distribution has v=( r−1 )∗( c−1 ) degrees of freedom
2

r = number of rows 4. If p-value < α , then reject H0 and


c = number of columns retain HA => Difference between
4.
2
Compare the test statistic χ b to the critical value χ a
2 observed and expected
distributions must be considered
Find critical value in table or use excel =CHISQ.INV(1-α; degree of freedom) significant
5. Find a decision
2 2
If χ b > χ 5 %, then reject null hypothesis => differences between distributions are Remarks:
significant => thus, reason to believe that relationship exists
2 2
If χ b < χ 5 %, then accept null hypothesis => differences between distributions are not If only crosstab is given, use two
categories in row and column and then
significant => thus, reason to believe that relationship does not exists
read of the continuity correction
EXCEL: In Excel: = CHISQ.INV(1-α;n) value.
example: α = 5%, 𝜈 = 2 = CHISQ.INV(0.95;2) = 5.99
Based on complete dataset
1. If two ordinal attributes are to be examined, a test of Spearman’s correlation coefficient
is recommended (higher power): Click on Correlations in the Analyze – Descriptive
Statistics – Crossabs - Statistics menu and use the number shown in the case Spearman’s
correlation and the p-value related to it in the output file. You also find this type of
correlation under Analyze – Correlation – Bivariate

2. To show frequencies in the output, select Observed and «Expected» in the crosstabs
menu

3. If the condition «expected frequencies > 5» is violated, select Method Exact (this means
«run Fisher’s exact test»). Then, use this test’s p-value shown in the output file. Or merge
columns.
4. If both attributes have only two different values, the cross tab will have four cases. This
reduces the df to 1, and the Yates correction will be carried out. The p-value shown in the
row «continuity correction» will then be the one to be used.
Chi-square test for two different sample spaces
H0: π 1=π 2, meaning proportions are the same
HA: π 1 ≠ π 2, meaning proportions are different
Samples: two samples with sample size n1, n2 and observed proportions p1, p2
Success No success
Sample 1 n1*p1 n1*(1-p1)
Sample 1 n2*p2 n2*(1-p2)
Error risk: α
Decision rule: p < α : reject H0 p > α : retain H0
2
SPSS: χ - test with continuity correction yields the bilateral p-value
4. Linear Regression

Probability model for a simple linear regression:


Y i=β 1∗x i + β 0 + Ei i.i.d. =independent
and identically distributed
β 1∧β 0 :unkown regression parameters¿ be estimated
Ei : random variable ( RV ) for error (Assumption that it
follows normal dist. with
E ( Ei ) =0∧unkonwnvariance σ 2E

Conditions: 𝐸𝑖 ~ 𝑁(0, 𝜎𝐸 2 )
1. Ei must follow a normal distribution with expectation 0
and variance .
2. The Ei ’s have the same variance across all observations
i = 1 through n.
3. The Ei ’s don’t influence each other, that is they are
independent RVs.

Summary for variables and terms


Point estimates (when data points are
Random variables
given)
𝛽1 Unknown slope, parameter to be b 1 Point estimate for β 1 Yi RV for attribute Y at observation
determined Based on dataset and method of I, i=1 through n
least squares
𝛽0 Unknown intercept, parameter to b 0 Point estimate for β 0 Ei RV for the error at observation I,
be determined Based on dataset and method of i=1 through n
least squares
σE
2 Unknown variance of errors, 2 2
s E Point estimate for σ E B1 RV for the slope, calculation
disturbance parameter based on Yi’s
Based on residuals
(𝑥i,𝑦i) Data point, i=1 through n B0 RV for the intercept, calculation
based on the Yi’s
Residuals:
Ei observed error, i=1 through n
In case the parameter value stated in the null hypothesis should be in the confidence interval, the null hypothesis will be retained.
(equivalent to p−value ≥ α of null hypothesis). Otherwise, the null hypothesis will be rejected (p<α )
Or in other words: the confidence interval contains all parameter values for which the null hypothesis would not be rejected.
Testing if linear regression can be applied:
1. No more than 5% outside of +/- residuals 2 on histogram (more than 3 is an outlier)
2. No pattern in Scatterplot (random errors are independent and identically distributed (i.i.d.))
SPSS help
Point estimation: see a), marginal cost = 3.8 mu/qu
Interval estimation: Analyze -> Regression -> Linear Regression -> Statistics -> Model Fit - > Confidence Intervals
4.3 Probability Model of Multiple Linear Regression

Y i=β 0 + β 1∗x (i1 )+ β2∗x(i2) +...+ β m∗x (im ) + Ei , withi=1 … n, where the unknown regression parameters β0, β1, β2, …, βm are to be
2
estimated. Assume error terms Ei follow normal distribution with expectancy E(E i)=0 and unknown variance σ E . Assume error variables
are independent
n
1
∗∑ e i
2 2
Point estimate for the variance s =
E
n−m−1 i=1
Adjusted R2 explains what proportion of the model accounts for the variance of the independent variable
Backward stepwise regression
1. Run regression with complete model
2. In case p value is bigger than α, regressor with highest p-value must be dropped (only one at a time)
3. Run regression again but without regressor from Step 2.
4. Repeat Step 2 & 3 until all regressor have p-value smaller than α
5. Select model whose regressors all have significant impact on independent variables

4.4 Transformations
Model Remark SPSS
Exponential model ~ ~x
y= b∗β
1. Logarithmise data for yi => obtain zi
~ ~x 2. Run regression with zi as dependent, and xi as independent
y= b ∙ β We linearize by logarithmising:
~ ~ variable
ln ( y )=ln ( b ) + x∗ln( β ) ~ ~
3. b∧ β can be calculated by retransforming estimates b and β
z=b + βw , ~ ~
obtained from linear regression b=e b & β =e β
with z=ln ( y )
w=ln ( x ) ,
~
b=ln ( b )

Power model ~ β 1. Logarithmise data for xi => obtain wi, logarithmise data for yi
y= b∗x
=> zi
~ 2. Run a regression with zi as dependent, and with wi as
y= b ∙ x β We linearize by logarithmising:
independent variable
z=b + β∗w ~
3. b can be obtained by retransforming estimate b of the linear
, with z =ln ( y ) , ~
regression: b=e b --- THIS is the constant number (not beta)
w=ln ( x ) , Interpretation: if x goes up by 1%, y changes by about β%
~
b=ln ⁡( b)
SPSS commands
What do you want? Command What do you want? Command

Binominal Tests Analyze > Nonparametric Tests > Legacy > Binomial Boxplot Analyze > Descriptive Statistics > Explore

p-value Analyze > Nonparametric Tests > Legacy > Binomial Chi-square test Analyze > Descriptive Statistics > CrossTabs > Statistics
> Chi-Square
Note: when using “Get from data”, Test proportion = α ,
all data must first be sorted into descending
When using “Cut point”, Test proportion =1−α , data
must not be sorted
α
When trying bilateral binomial test: use α modified =
2
, then compare this p-value

t-test (one-sample Analyze > Compare Means > One-Sample T-Test Linear Regression Analyze > Regression > Linear
test, σ unknown) Select confidence interval under Statistics
Note: Test Value is the reference value that we are Under save:
comparing our data to Mean when looking for expectancy of market share at
given price x0
Individual when looking for extra realization at price x0
t-test (two-samples Analyze > Compare Means > Paired Samples T-Test Testing for linear Analyze > Regression > Linear > Plots
test) regression Y=*ZRESID
X=*ZPRED

t-test (two Analyze > Compare Means > Summary Independent- Multiple linear Analyze > Regression > Linear
independent Samples T Test regression Then select backward
samples)

Transforming two Transform > Recode Into Different Variables


variables into one

Excel commands
What do you want? Command What do you want? Command

Bilateral p-value =T.VERT.2S(t:n-1) (german) Chi-Square =CHISQ.INV(1-α ; degrees of freedom)


=T.DIST.2T(t:n-1) =Chiq.Inv (german)

II.1.2 Basic principles - scales


If a study investigates variables with different hierarchy of scales (level of measurement), the lowest level of measurement determines the methods to
be applied. Using a lower level of measurement is possible but not vice versa (e.g. a method for ordinally scaled variables can be used for interval scaled
ones but not for normally scaled ones).

A metric variable can be measured with a well-defined unit of measurement. It can be discrete or continuous.

 Discrete variable:
o It can only take a countable number of values on a scale (e.g. a family can have 1 or 2 children, but not 1.3)
 Continuous variable:
o Every value on an interval can be adopted (e.g. weight, height, time)
Box plot
Top of box = 75 percentile
Bottom of box = 25 percentile
Cross = mean
Line in the box = medium
Points = outliers

You might also like