You are on page 1of 17

FACULTY OF ECONOMICS AND BUSINESS

CAMPUS BRUSSEL
Master of Business Engineering

Statistical Modelling
Basic concepts (revision)

1
Types of data/variables
Is it meaningful to
perform mathematical
operations on the
variable (e.g., compute
the average?)

no yes

Qualitative variable Quantitative variable


Can the values of the Is there an absolute 0?
variable be ordered in a
meaninful way?
no yes
no yes

Ordinal variable Interval variable Ratio variable


Nominal variable e.g. categories e.g. Christian era e.g. Length, age of a
e.g. hair color do not agree at all, person
agree more ore less, temperature
completely agree
2
Normal distribution
 In this course we consider 4 types of probability distributions: Normal
distribution, t-distribution, F-distribution and distribution.
 The Normal distribution is symmetric and bell shaped. It is defined by
two parameters: the mean and the standard deviation
 Notation: means that the random variable is Normally
distributed with mean and variance . E.g., is a Normal
distribution with mean 10 and standard deviation 2.

 indicates the distance between and the point of inflection of the


Normal density function.

3
Parameters and statistics
 Population parameters are represented by Greek letters. E.g.
o Parameters are numbers.
o Parameters measure a characteristic of the population.
 Statistics or estimators are represented by upper-case Roman letters.
E.g.
o Statistics measure a characteristic of a specific random sample drawn from
the population.
o Statistics are random variables because their value depends on the values
included in a specific random sample.
o Statistics are estimators of population parameters. E.g. the sample mean
is an estimator of the population mean .
 The observed value of a statistic for a specific sample is indicated using
lower-case Roman letters:

4
Properties of estimators
 Statistics (or estimators) are random variables with a sampling
distribution with an expected value and a variance.

 The bias of the estimator indicates the


difference between the expected value
of the estimator and the population
parameter it aims to estimate.
 Drawing a simple random sample
(SRS) from the population is a
necessary condition to avoid bias.
 The variability of an estimator is
described by the variance of the
sampling distribution.
 The variability of an estimator can be
reduced by increasing the sample size.

5
Properties of estimators

 Suppose the variable has population mean and population standard


deviation , and we draw a SRS
o . That is, the sample mean is an unbiased
estimator of the population mean
o . That is, the variability in the estimator can be reduced by
increasing the sample size.
o Central limit theorem (CLT): if the sample size is large, the sampling
distribution of is approximately Normally distributed.

6
Central Limit Theorem
 Suppose we draw a SRS of size and the population mean
and standard deviation of equal and , resp. Regardless of the
distribution of , if is large (i.e., ) then .
 The CLT even applies if is very skewed. As an example, consider the
distribution of when has an exponential distribution in the population.

7
Hypothesis testing
 The null hypothesis is the hypothesis that we aim to reject
o E.g.
 The alternative hypothesis is the hypothesis we aim to confirm
o E.g. (two-sided test)
 The significance level or type-I error is the probability to reject a true
null hypothesis. Usually we assume .
 The type-II error represents the probability that a false null hypothesis
is not rejected. , also called the statistical power of a test, is the
probability to reject a false null hypothesis. The power of a test depends on
o Sample size: the power increases if the sample size increases.
o Effect size: the power increases if the deviation from becomes larger.
o Significance level: the power increases if increases.

8
Hypothesis testing
 The p-value is the probability to obtain a value of the statistic under
(i.e., assuming is true) that is more extreme than the value of the
statistic observed in the sample.
o If we reject because it is very unlikely to observe such an extreme
value of the statistic under .
o Avoid the conclusion “ and hence we accept ”. Instead, we say
“ so we cannot reject ”
o Note that practical significance differs from statistical significance: in
large samples a very small deviation of often leads to very small -value.
 E.g. a lawsuit: : the defendant is innocent, : the defendant is guilty
o Is there enough evidence to reject ? The p-value indicates the likelihood
of the evidence against the accused assuming he/she is innocent.
o Type-I error: send innocent defendant to jail.
o type-II error: freeing a guilty defendant.

9
Test of (with known)
 If then the transformed variable has a standard
normal distribution, i.e., . If is not normally distributed and
then the CLT implies .
 The z-test: we test
o against (two-sided test)
o against (one-sided test)
o against (one-sided test)
 If is true and if is normally distributed with standard deviation
then .

 If is true and if is not normally distributed and then the


CLT implies .

10
Test of (with known)
 Exercise: Suppose indicates the age of persons in a population. We
know that is Normally distributed with standard deviation . Do we
reject (using ) if we have observed in a SRS of size
that the average age equals 67 (i.e. )?
 Solution
o We test against
o The -statistic equals
o
The -value is the probability to observe
under a -value that is in absolute
value more extreme than in the observed
sample (i.e., or )
As we reject
and conclude with 95% confidence that
the population mean of age differs
significantly from 65. 11
t-distribution
 -distributions are similar to Normal distributions in that they are
symmetric and bell-shaped.
 A -distribution is characterized by its degrees of freedom.
 -distributions differ from Normal distributions as they have fatter tails.
This larger variability is due to replacing by (because is unknown) in
-tests.

12
Test of (with unknown)
The t-test: we test
o against (two-sided test)
o against (one-sided test)
o against (one-sided test)
 If is true and if is normally distributed then the -statistic is exactly
-distributed with degrees of freedom:

 If is true and if is not normally distributed and then the


CLT implies that is approximately -distributed with degrees of
freedom.

13
Test of (with unknown)
 Exercise: Suppose indicates the age of persons in a population. We
know that is Normally distributed. Do we reject against
(using ) if we have observed in a SRS of size
that the average age equals 67 (i.e. ) and that the sample standard
deviation equals 3 (i.e., )?
 Solution
o We test against
o The -statistic equals
o
The p-value is the probability to observe under H0 a
p-value that is in absolute value more extreme than
in the observed sample (i.e., or
)
As p (=.064)>α (=.05) we do not reject H0 and
conclude with 95% confidence that the population
mean of age is not significantly different from1465.
Parametric versus non-parametric tests
Parametric tests
 Examples: -test, -test, ANOVA,…
 Parametric tests assume that variables are Normally distributed. This may
seem a strong assumption but if the sample size is large enough, the
validity of parametric tests is guaranteed by the CLT.
Non-parametric or (distribution-free) tests
 Examples: sign test, Wilcoxon signed-rank test, …
 Non-parametric tests make very weak or no assumptions about the
distribution of variables. As a result, they are widely applicable (e.g., also
for qualitative variables).

15
Pros and cons of parametric and non-parametric tests
 If the assumptions of parametric tests are violated (e.g., distribution of a
variable is very skewed) and we cannot rely on the CLT we
should use a non-parametric test.
 If assumptions of parametric tests are not violated
o Both parametric and non-parametric tests can be used.
o However, as parametric tests usually have more statistical power than non-
parametric tests, parametric tests should be used.

16
Overview parametric versus non-parametric tests

Parametric Non-parametric Non-parametric


(no normality or ordinal) (nominal)

1 sample t-test H 0 :   0 Sign test Binomial test


Wilcoxon signed-rank test
2 paired samples t-test differences /

2 independent samples t-test H 0 : 1  2 Mann-Whitney-Wilcoxon test Chi-square test

More than 2 ANOVA Kruskal-Wallis test Chi-square test


independent samples H 0 : 1  2  ..   K
Relation between 2 Pearson-correlation Spearman correlation Chi-square test
variables (linear relation two (relation between ordinal (relation two
quantitative variables) variables) qualitative
variables)

17

You might also like