Professional Documents
Culture Documents
CAMPUS BRUSSEL
Master of Business Engineering
Statistical Modelling
Basic concepts (revision)
1
Types of data/variables
Is it meaningful to
perform mathematical
operations on the
variable (e.g., compute
the average?)
no yes
3
Parameters and statistics
Population parameters are represented by Greek letters. E.g.
o Parameters are numbers.
o Parameters measure a characteristic of the population.
Statistics or estimators are represented by upper-case Roman letters.
E.g.
o Statistics measure a characteristic of a specific random sample drawn from
the population.
o Statistics are random variables because their value depends on the values
included in a specific random sample.
o Statistics are estimators of population parameters. E.g. the sample mean
is an estimator of the population mean .
The observed value of a statistic for a specific sample is indicated using
lower-case Roman letters:
4
Properties of estimators
Statistics (or estimators) are random variables with a sampling
distribution with an expected value and a variance.
5
Properties of estimators
6
Central Limit Theorem
Suppose we draw a SRS of size and the population mean
and standard deviation of equal and , resp. Regardless of the
distribution of , if is large (i.e., ) then .
The CLT even applies if is very skewed. As an example, consider the
distribution of when has an exponential distribution in the population.
7
Hypothesis testing
The null hypothesis is the hypothesis that we aim to reject
o E.g.
The alternative hypothesis is the hypothesis we aim to confirm
o E.g. (two-sided test)
The significance level or type-I error is the probability to reject a true
null hypothesis. Usually we assume .
The type-II error represents the probability that a false null hypothesis
is not rejected. , also called the statistical power of a test, is the
probability to reject a false null hypothesis. The power of a test depends on
o Sample size: the power increases if the sample size increases.
o Effect size: the power increases if the deviation from becomes larger.
o Significance level: the power increases if increases.
8
Hypothesis testing
The p-value is the probability to obtain a value of the statistic under
(i.e., assuming is true) that is more extreme than the value of the
statistic observed in the sample.
o If we reject because it is very unlikely to observe such an extreme
value of the statistic under .
o Avoid the conclusion “ and hence we accept ”. Instead, we say
“ so we cannot reject ”
o Note that practical significance differs from statistical significance: in
large samples a very small deviation of often leads to very small -value.
E.g. a lawsuit: : the defendant is innocent, : the defendant is guilty
o Is there enough evidence to reject ? The p-value indicates the likelihood
of the evidence against the accused assuming he/she is innocent.
o Type-I error: send innocent defendant to jail.
o type-II error: freeing a guilty defendant.
9
Test of (with known)
If then the transformed variable has a standard
normal distribution, i.e., . If is not normally distributed and
then the CLT implies .
The z-test: we test
o against (two-sided test)
o against (one-sided test)
o against (one-sided test)
If is true and if is normally distributed with standard deviation
then .
10
Test of (with known)
Exercise: Suppose indicates the age of persons in a population. We
know that is Normally distributed with standard deviation . Do we
reject (using ) if we have observed in a SRS of size
that the average age equals 67 (i.e. )?
Solution
o We test against
o The -statistic equals
o
The -value is the probability to observe
under a -value that is in absolute
value more extreme than in the observed
sample (i.e., or )
As we reject
and conclude with 95% confidence that
the population mean of age differs
significantly from 65. 11
t-distribution
-distributions are similar to Normal distributions in that they are
symmetric and bell-shaped.
A -distribution is characterized by its degrees of freedom.
-distributions differ from Normal distributions as they have fatter tails.
This larger variability is due to replacing by (because is unknown) in
-tests.
12
Test of (with unknown)
The t-test: we test
o against (two-sided test)
o against (one-sided test)
o against (one-sided test)
If is true and if is normally distributed then the -statistic is exactly
-distributed with degrees of freedom:
13
Test of (with unknown)
Exercise: Suppose indicates the age of persons in a population. We
know that is Normally distributed. Do we reject against
(using ) if we have observed in a SRS of size
that the average age equals 67 (i.e. ) and that the sample standard
deviation equals 3 (i.e., )?
Solution
o We test against
o The -statistic equals
o
The p-value is the probability to observe under H0 a
p-value that is in absolute value more extreme than
in the observed sample (i.e., or
)
As p (=.064)>α (=.05) we do not reject H0 and
conclude with 95% confidence that the population
mean of age is not significantly different from1465.
Parametric versus non-parametric tests
Parametric tests
Examples: -test, -test, ANOVA,…
Parametric tests assume that variables are Normally distributed. This may
seem a strong assumption but if the sample size is large enough, the
validity of parametric tests is guaranteed by the CLT.
Non-parametric or (distribution-free) tests
Examples: sign test, Wilcoxon signed-rank test, …
Non-parametric tests make very weak or no assumptions about the
distribution of variables. As a result, they are widely applicable (e.g., also
for qualitative variables).
15
Pros and cons of parametric and non-parametric tests
If the assumptions of parametric tests are violated (e.g., distribution of a
variable is very skewed) and we cannot rely on the CLT we
should use a non-parametric test.
If assumptions of parametric tests are not violated
o Both parametric and non-parametric tests can be used.
o However, as parametric tests usually have more statistical power than non-
parametric tests, parametric tests should be used.
16
Overview parametric versus non-parametric tests
17