You are on page 1of 21

Introduction to statistics

Statistiek.................................................................................................................................................1
Algemene werking excell:...................................................................................................................3
WP1: Descriptive statistics.....................................................................................................................4
Variables.............................................................................................................................................4
Characteristic..................................................................................................................................4
Measurement.................................................................................................................................4
Samples..............................................................................................................................................4
Strategies........................................................................................................................................4
Bias.................................................................................................................................................5
Measures........................................................................................................................................5
Graphs................................................................................................................................................5
Box plot..........................................................................................................................................5
Scatter plot.....................................................................................................................................6
WP2: Probability.....................................................................................................................................6
Some terminology..............................................................................................................................6
Rules...................................................................................................................................................7
Complement Rule...........................................................................................................................7
Addition rule...................................................................................................................................7
Multiplication rule..........................................................................................................................7
Contingency table...............................................................................................................................7
Absolute numbers Relative numbers......................................................................................7
Probability rules: example..................................................................................................................7
WP3: probability distributions................................................................................................................8
Types of probability distributions.......................................................................................................8
Discrete distribution.......................................................................................................................8
Other important distributions....................................................................................................8
Continuous distribution..................................................................................................................8
Continuous distribution: normal.................................................................................................9
Other important distributions....................................................................................................9
Empirical distribution...................................................................................................................10
Special case: The sample distribution: central limit theorem...........................................................10
The average perspective...............................................................................................................10
The ‘sum of variables’ perspective...............................................................................................11
Summary probability distributions...................................................................................................11
Exercises...........................................................................................................................................12
Delhaize........................................................................................................................................12
Pepsi.............................................................................................................................................12
Investments..................................................................................................................................12
WP4: regression analysis......................................................................................................................13
Intro to simple and multiple linear regression.................................................................................13
Simple linear regression...............................................................................................................13
Concept of residuals.................................................................................................................13
Least squares estimation..........................................................................................................13
Using excel to get the regression equation..............................................................................14
Introduction to multiple linear regression....................................................................................14
Example – price of a house...........................................................................................................14
Goodness of fit measures.................................................................................................................15
Coefficient of determination........................................................................................................15
Individual and joint significance.......................................................................................................16
Categorical explanatory variables.....................................................................................................17
How to include categorical variables?..........................................................................................17
Example – Delhaize..........................................................................................................................18
Excel.....................................................................................................................................................19
Probability........................................................................................................................................19
Distribution.......................................................................................................................................19
Normal..........................................................................................................................................19
Discrete........................................................................................................................................19
Poisson.....................................................................................................................................19
Binom.......................................................................................................................................20
Continuous...................................................................................................................................20
Exponential...............................................................................................................................20
Regression........................................................................................................................................20
Excel:
=sum() and =sumproduct()

=average()

=min() or =max()

=count() : counts all cells that contain a number

=counta(): counts all cells that contain a value (text or number)

=countif() : counts all cells that fit a certain criteria, e.g. only negative numberse.g.
=countif(RANGE;”>0”)

if(Condition; Output if true; Output if false)E.g. =if(A1<0; “Loss”; “Gain”)

=If(A1<0;”Loss”;if(A1=0;”Break-even”;”Gain”))

if(AND(condi1;condi 2; ...) ; “outcome if all conds are true”; “outcome if at least 1 is false”)
=if(OR(condi1; condi2; ...) ; “outcome if at least 1 cond is true”; “outcome if no condis are true”

layout of tables: PPT 2

Layout of tables: PPT 3

PIVOT CHART:

Do wee see a link/difference between x and y  one of the two in x-as and the average of the other
in y as. Also make graph + usually the categorical or descrete value in xas and the average of
continuous in the y as
WP1: Descriptive statistics
Variables
Characteristic
1. Categorical = different options where one is not better then the other (colour, brands, …)
2. Numerical
a. Discrete:
i. finite amount of numbers: we can right all the possibilities down f.e.
scores on exam (12/20, 12,5/20, …) (doesn’t need to be a whole)
b. Continuous
i. infinite amount of possibilities

Measurement
 Nominal:
a. Distinct categories without ‘magnitude’
 Ordinal
a. Ordered categories: a scale of happiness, …
 Interval
a. Meaningful distances without absolute zero point (fe: temp no clear 0 point depends
on in Fahrenheit or Celsius)
 Ratio
a. Meaningful distances with absolute zero point: temp with Kevin scale

Samples

Strategies
Types of samples
 they all have advantages and disadvantages

- Non-probability sample (subjective) CHANGE OF BEING SELECTED IS UNKNOWN


 Judgement
 Selects units to be sampled based on his own existing knowledge
 Quota
 Non-probabilistic version of stratified sampling
 Convenience
 It is very easy to do
 Snowball
 One person selects someone, he selects someone else… (ex: ice bucket challenge)
- Probability samples (objective) CHANCE OF BEING SELECTED IS KNOWN
 Simple random
 Subset of a statistical population in which each member of the subset has an
equal probability of being chosen
 Systematic
 Selected according to a random starting point but with a fixed, periodic interval
 Stratified
 Subdivide the population in different sub groups, homogenic within the group,
heterogenous between groups
 Cluster
 Same as stratified, but you only use some groups, not all of them

Bias
- Non response bias
- Response bias
- Socially accepted answers
- Voluntary response bias
 If it is not good, you will be more eager to share

Measures
THINK

CRITICALLY ABOUT SUMMARY MEASURES!!


Graphs
Box plot
3 parts of boxplot: q1 = first quartal:
number of which 25% of observations
are lower

Q2 = second quartal = 50% = median

Q3 = third quartal = 75%

 IQR = interquartile range =


distance between Q1 and Q3
 Outliers:
o 1,5-3x IQR out of boxplot = mild outliers
o More than 3 iqr = extreme outliers
o What to do with outliers: check if they are mistake or part of the dataset + are
we sure that this datapoint matches the research q (fe: Elon musk salary vs q:
average American)  if mistake: delete + if right: keep it (hospital beds example)

Scatter plot
we can see/track a possible correlation in the scatter plot
BUT we can’t measure if there is a causality.

THINK CRITICALLY ABOUT GRAPHS!! (years in between)


WP2: Probability
Some terminology
 Sample space of an experiment:
o Contains all possible outcomes.
 Event:
o Subset of the sample space
 If only one outcome (simple event)
o Exhaustive events:
 If all possible outcomes of an experiment are included in the events
o Mutually exclusive events:
 If the events do not share any common outcome of an experiment

Rules
Complement Rule
 𝑃(𝐴𝑐)=1−𝑃(𝐴)
o The probability of a complement of an event is equal to 1 minus the probability
of the event

Addition rule
 𝑃(𝐴∪𝐵)=𝑃(𝐴)+𝑃(𝐵)−𝑃(𝐴∩𝐵)
o The probability that A or B occurs is equal to the probability that A occurs, plus the
probability that B occurs, minus the probability that both A and B occur

Multiplication rule
 𝑃𝐴∩𝐵=𝑃(𝐴/𝐵)𝑃(𝐵)=𝑃(𝐵/𝐴)𝑃(𝐴)
o The probability that A and B both occur is equal to the probability that A occurs given
that B has occurred, times the probability that B occurs

Contingency table
Absolute numbers Relative numbers
Probability rules: example
Questions

1. What’s the chance of being a man or using the self-scanning system?

2. What’s the chance of being female given that


you’re using the self-scanning system?

WP3: probability distributions


Types of probability distributions
Discrete distribution
 Discrete number of variables
 Number of possible outcomes is countable,
 Each outcome has a certain probability
 Mass function
o Ex: the probability there are x students in class

Other important distributions

 Uniform
 Poisson
 Binomial
Continuous distribution
 Continuous random variable
 Number of possible outcomes in uncountable
 Probability for any specific value is zero [ P(Y = y) = 0 ]
 Density function

Continuous distribution: normal

 Most important distribution in statistics


 Defined by two parameters
o Average
o Standard deviation
 Bell-shaped curve, symmetric, asymptotic

E.g., The time to produce one product is on average 4 hours and has a standard deviation of 1 hour.
The variable follows a normal distribution

 What’s the chance that it takes less than 3,25 hours to produce a product?

 What’s the time to produce a product so that there’s 38% chance of exceeding it?

 What is the concept of z-values?


Other important distributions

Empirical distribution
 Pragmatic approach
o If we can’t see any of the other distributions
o For example: the age of the different employees
o Treat your sample as if it would be the population, and count the share of
observations that satisfy the condition you’re interested in...
Easy: count the amount of people that match what you are researching: age below
20  then divide this by total n

Special case: The sample


distribution: central limit theorem
The average perspective

For example, with salary  we take the average salary


of 3 random employees  we’ll keep doing this  all these averages will give us a normal
distribution
U and circle are the same, n = how many “people” in one sample  SHOULD ACTUALLY BE BIGGER
THAN 30
We use this if question is: what is the chance that the average salary of “3” people is below x.

E.g., the time to produce one product is on


average 4 hours and has a standard deviation of
1 hour. The variable follows a normal
distribution.
 100 products will be produced. What’s the chance that it takes on average less than 3,25
hours to produce a product?

The ‘sum of variables’ perspective


Different variables X with different u and standard deviation  Combination of all the variables is y
 y will form a normal distribution

Use: lee time: during production process: different stages with times and variables  use to look at
total production time and variability  calculate the chance it will be below a certain time to finish
the product  is used for negotiation: see notes

E.g., the time to produce one product is on average 4 hours and has a standard deviation of 1 hour.
The variable follows a normal distribution.

 What’s the chance that it takes less than 325 hours to produce the 100 products?

Summary probability distributions


Exercises
Delhaize
Let’s assume that the variable age follows a distribution for which the population average equals the
sample average, and a similar reasoning applies to the standard deviation.

Questions:

1. What’s the chance that the age of a customer is below 46 years?


2. What’s the chance that the average age of ten customers is below 46 years?
3. Do you think the variable ‘age’ is following a normal distribution?

Pepsi
Everyday Pepsi produces 10.000 cans. Legally, a can must contain at least 12 ounces of soda. The
government inspects 10 randomly chosen cans each day. If at least two are underweight, Pepsi is
fined. The filling follows a normal distribution with µ = 12,05 ounces and σ = 0,03 ounce

Question: What is the probability that Pepsi will be fined on a given day?

Investments
Your boss wants to invest $100 and asks you to deal with it. There are two investments on the
market for which you can either do the full investment or 50% of the investment

Basically, you have two options

 Investment A
 Split investment: 50% A and 50%B

So, investment A is the best option


WP4: regression analysis
Intro to simple and multiple linear regression
Regression analysis is the study of relationships between variables:

 Dependent variable (response variable) = Y


 Independent variable (explanatory variable) = X

using data:

 Cross-sectional data
 Time series data

in order to have understanding and to make predictions

Simple: only one explanatory variable =/= Multiple: multiple explanatory variables

Simple linear regression


Only two variables

 One response variably Y


 One explanatory variable X

Coeff: if x rises with one unit, y will change with b1 unit

B0 gives us value in x would be 0  not always very good value: if we make a line for lin regression
we have a certain data set with a certain range  outside of this range the line doesn’t “count” so if
there are no data around x=0  then the b0 value also doesn’t mean a lot.

Concept of residuals
 Fit a straight line
o Regression equation
o Observed values (y) versus fitted values ( )

Computer will calculate sum of squares of all e’s (squares so we


don’t at together positive and negative e’s)

Least squares estimation


Which line fits the best?

 We want to minimize the sum of squares of all e’s


Using excel to get the regression equation

Introduction to multiple linear regression


 More than two variables
o One response variable Y
o More than one explanatory variable X

In excel  a coeff for every variable


Example – price of a house
 Explaining variability in the price of a house by knowing
o The number of bedrooms
o The distance from the city
centre (km)
 Regression output table

Goodness of fit measures


Excel will calculate line with lowest squares estimation  this is the best fit  even the best fit can
be very bad  We need measures to assess the fit!

Coefficient of determination
 Coefficient of determination R²
o R² cannot decrease when new explanatory variables are added
o So, keep on adding variables then?

 Concept of Adjusted R²
o For model comparison and selection only
o Correcting for sample size and nr of explanatory variables
o No direct interpretation of adjusted R²
1. Standard error of estimate: with stand dev. In unit of variable  not very clear
With st error of estimate/average of y  should be below 0,2
2. Coeff of determination:

1 is excellent model and 0 is bad model: the model is worst at prediction the value then just
taking the average
If we add a variable: R2 can stay the same or can go up but never down: if they don’t add
value  the computer doesn’t use them and the r2 doesn’t change
why not keep increasing variables then  makes the model to complicated: we want to
explain the most with the least amount of variables

Intermezzo: watch out with multiple variables with possibly a strong correlation: for example, high
correlation between age and experience  weird things will happen in the formula  delete the
variable with the lowest correlation. (corr between 2 variables should be between -0,4 and 0,4

 If we then delete the variable: R2 will stay the same or lower  prob not lower a lot
because we kept the most correlated variable out of the 2
3. Adjusted R2:
Only for comparison and selection between different lin regression models
No direct interpretation
Adjust the R2 for sample size: sample size & variables
 Best chose model with highest adjusted R2 even if the normal R2 for another one is
higher

Difference between R2 and adjusted R2:

Normal R2 will get higher and higher (or stay the same) as more variables are added =/=
adjusted r2 will first rise, plateau and lower again as more variables are added.

Individual and joint significance


For every variable there is a zero-hypothesis: is
the coef of the variable equal to 0, the
alternative hypothesis is that the value is
different from 0.

If the zero hypothesis is true: coef is equal to


zero: the variable has no significance/has no
explanatory power (we could also do this for B0
but this is too complicated for this course)

how to determine this: a lot of different


ways we look at the p-value
Compare p value to alfa value: the chance of making a type 1 error = rejecting the H0 even when it
was actually true  alfa = 0,05

 If variable not significant: delete it  one by one don’t delete multiple at the same time,
always make a new model first

Principle of PARSIMONY: Explain the most with


the least: Favours a model with fewer
explanatory variables, without losing too much
of explanatory power

If we have multiple variables, it is possible that


none are significant but that the whole
combination of the different variables does
give us a good way of calculating y and are
significant  we can test this with Anova test
 look at significant F in excel  if smaller
then 0,05 = GOOD

Categorical explanatory variables


How to include categorical variables?
 Categorical variables
o Assume this variable can only take two different
outcomes
o Example
 Gender = categorical variable with 2 outcomes
 House = detached or not
 We need dummy variables to represent categorical variables
o Dummy variable has two possible outcomes
 1 implies that observation is in the category of interest
 0 implies that it is not
 Categorical variables
o Assume this variable takes more than two different
outcomes
o Example
 Hair = blond, black, red, other
 House = detached, semi-detached, other
 Required number of dummy variables
o Use one less dummy variable than the number of categories for the categorical
variable (do not use the categorical variable)
o Omitted dummy corresponds to the reference category
 How to do this in excel?
How to do this easy in excel: sorting and then give 1 or 0 to total part that’s been sorted (don’t
filter!!)

Example – Delhaize
Excel
Probability
1. Make pivot table off all data
2. Select the data you want and put it in columns and row and also add one of the two in the
value
3. Put them as count and show as percentage for easy calculations

Distribution
Normal
What’s the chance that it takes less than 3,25 hours to produce a product? (µ:4; σ:1)

=norm.dist(prob; µ; σ; 1)
=norm.dist(3,25;4;1;1)

What’s the time to produce a product so that there’s 38% chance of exceeding it? (µ:4; σ:1)

=norm.inv(x; µ; σ)
=norm.inv(0,62; 4; 1)

Discrete
Poisson
=poisson.dist(x;µ; 0 or 1)

0: P(X = x)
1: P(X ≤ x)

Binom
=binom.dist(x; n; p; 0 or 1)

0: P(X = x)

1: P(X ≤ x)

Continuous
Exponential
=expon.dist( x; λ; 1)

1: P(X ≤ x)

Questions

1. What’s the chance of being a man or using the self-scanning system?


𝑃(𝐴∪𝐵)=𝑃(𝐴)+𝑃(𝐵)−𝑃(𝐴∩𝐵)
2. What’s the chance of being female given that you’re using the self-scanning system
𝑃𝐴∩𝐵=𝑃(𝐴/𝐵)𝑃(𝐵)=𝑃(𝐵/𝐴)𝑃(𝐴)

3. What’s the chance that the


P(x < 46y) =
4. What’s the chance that the average age of ten customers is below 46 years?
=norm.dist(46, 42, (13.22/√10), 1) = 83%

5. What is the probability that Pepsi will be fined on a given day?


Regression
 Data analysis tool pack
o “Regression”
o “labels”

You might also like