Professional Documents
Culture Documents
Statistiek.................................................................................................................................................1
Algemene werking excell:...................................................................................................................3
WP1: Descriptive statistics.....................................................................................................................4
Variables.............................................................................................................................................4
Characteristic..................................................................................................................................4
Measurement.................................................................................................................................4
Samples..............................................................................................................................................4
Strategies........................................................................................................................................4
Bias.................................................................................................................................................5
Measures........................................................................................................................................5
Graphs................................................................................................................................................5
Box plot..........................................................................................................................................5
Scatter plot.....................................................................................................................................6
WP2: Probability.....................................................................................................................................6
Some terminology..............................................................................................................................6
Rules...................................................................................................................................................7
Complement Rule...........................................................................................................................7
Addition rule...................................................................................................................................7
Multiplication rule..........................................................................................................................7
Contingency table...............................................................................................................................7
Absolute numbers Relative numbers......................................................................................7
Probability rules: example..................................................................................................................7
WP3: probability distributions................................................................................................................8
Types of probability distributions.......................................................................................................8
Discrete distribution.......................................................................................................................8
Other important distributions....................................................................................................8
Continuous distribution..................................................................................................................8
Continuous distribution: normal.................................................................................................9
Other important distributions....................................................................................................9
Empirical distribution...................................................................................................................10
Special case: The sample distribution: central limit theorem...........................................................10
The average perspective...............................................................................................................10
The ‘sum of variables’ perspective...............................................................................................11
Summary probability distributions...................................................................................................11
Exercises...........................................................................................................................................12
Delhaize........................................................................................................................................12
Pepsi.............................................................................................................................................12
Investments..................................................................................................................................12
WP4: regression analysis......................................................................................................................13
Intro to simple and multiple linear regression.................................................................................13
Simple linear regression...............................................................................................................13
Concept of residuals.................................................................................................................13
Least squares estimation..........................................................................................................13
Using excel to get the regression equation..............................................................................14
Introduction to multiple linear regression....................................................................................14
Example – price of a house...........................................................................................................14
Goodness of fit measures.................................................................................................................15
Coefficient of determination........................................................................................................15
Individual and joint significance.......................................................................................................16
Categorical explanatory variables.....................................................................................................17
How to include categorical variables?..........................................................................................17
Example – Delhaize..........................................................................................................................18
Excel.....................................................................................................................................................19
Probability........................................................................................................................................19
Distribution.......................................................................................................................................19
Normal..........................................................................................................................................19
Discrete........................................................................................................................................19
Poisson.....................................................................................................................................19
Binom.......................................................................................................................................20
Continuous...................................................................................................................................20
Exponential...............................................................................................................................20
Regression........................................................................................................................................20
Excel:
=sum() and =sumproduct()
=average()
=min() or =max()
=countif() : counts all cells that fit a certain criteria, e.g. only negative numberse.g.
=countif(RANGE;”>0”)
=If(A1<0;”Loss”;if(A1=0;”Break-even”;”Gain”))
if(AND(condi1;condi 2; ...) ; “outcome if all conds are true”; “outcome if at least 1 is false”)
=if(OR(condi1; condi2; ...) ; “outcome if at least 1 cond is true”; “outcome if no condis are true”
PIVOT CHART:
Do wee see a link/difference between x and y one of the two in x-as and the average of the other
in y as. Also make graph + usually the categorical or descrete value in xas and the average of
continuous in the y as
WP1: Descriptive statistics
Variables
Characteristic
1. Categorical = different options where one is not better then the other (colour, brands, …)
2. Numerical
a. Discrete:
i. finite amount of numbers: we can right all the possibilities down f.e.
scores on exam (12/20, 12,5/20, …) (doesn’t need to be a whole)
b. Continuous
i. infinite amount of possibilities
Measurement
Nominal:
a. Distinct categories without ‘magnitude’
Ordinal
a. Ordered categories: a scale of happiness, …
Interval
a. Meaningful distances without absolute zero point (fe: temp no clear 0 point depends
on in Fahrenheit or Celsius)
Ratio
a. Meaningful distances with absolute zero point: temp with Kevin scale
Samples
Strategies
Types of samples
they all have advantages and disadvantages
Bias
- Non response bias
- Response bias
- Socially accepted answers
- Voluntary response bias
If it is not good, you will be more eager to share
Measures
THINK
Scatter plot
we can see/track a possible correlation in the scatter plot
BUT we can’t measure if there is a causality.
Rules
Complement Rule
𝑃(𝐴𝑐)=1−𝑃(𝐴)
o The probability of a complement of an event is equal to 1 minus the probability
of the event
Addition rule
𝑃(𝐴∪𝐵)=𝑃(𝐴)+𝑃(𝐵)−𝑃(𝐴∩𝐵)
o The probability that A or B occurs is equal to the probability that A occurs, plus the
probability that B occurs, minus the probability that both A and B occur
Multiplication rule
𝑃𝐴∩𝐵=𝑃(𝐴/𝐵)𝑃(𝐵)=𝑃(𝐵/𝐴)𝑃(𝐴)
o The probability that A and B both occur is equal to the probability that A occurs given
that B has occurred, times the probability that B occurs
Contingency table
Absolute numbers Relative numbers
Probability rules: example
Questions
Uniform
Poisson
Binomial
Continuous distribution
Continuous random variable
Number of possible outcomes in uncountable
Probability for any specific value is zero [ P(Y = y) = 0 ]
Density function
E.g., The time to produce one product is on average 4 hours and has a standard deviation of 1 hour.
The variable follows a normal distribution
What’s the chance that it takes less than 3,25 hours to produce a product?
What’s the time to produce a product so that there’s 38% chance of exceeding it?
Empirical distribution
Pragmatic approach
o If we can’t see any of the other distributions
o For example: the age of the different employees
o Treat your sample as if it would be the population, and count the share of
observations that satisfy the condition you’re interested in...
Easy: count the amount of people that match what you are researching: age below
20 then divide this by total n
Use: lee time: during production process: different stages with times and variables use to look at
total production time and variability calculate the chance it will be below a certain time to finish
the product is used for negotiation: see notes
E.g., the time to produce one product is on average 4 hours and has a standard deviation of 1 hour.
The variable follows a normal distribution.
What’s the chance that it takes less than 325 hours to produce the 100 products?
Questions:
Pepsi
Everyday Pepsi produces 10.000 cans. Legally, a can must contain at least 12 ounces of soda. The
government inspects 10 randomly chosen cans each day. If at least two are underweight, Pepsi is
fined. The filling follows a normal distribution with µ = 12,05 ounces and σ = 0,03 ounce
Question: What is the probability that Pepsi will be fined on a given day?
Investments
Your boss wants to invest $100 and asks you to deal with it. There are two investments on the
market for which you can either do the full investment or 50% of the investment
Investment A
Split investment: 50% A and 50%B
using data:
Cross-sectional data
Time series data
Simple: only one explanatory variable =/= Multiple: multiple explanatory variables
B0 gives us value in x would be 0 not always very good value: if we make a line for lin regression
we have a certain data set with a certain range outside of this range the line doesn’t “count” so if
there are no data around x=0 then the b0 value also doesn’t mean a lot.
Concept of residuals
Fit a straight line
o Regression equation
o Observed values (y) versus fitted values ( )
Coefficient of determination
Coefficient of determination R²
o R² cannot decrease when new explanatory variables are added
o So, keep on adding variables then?
Concept of Adjusted R²
o For model comparison and selection only
o Correcting for sample size and nr of explanatory variables
o No direct interpretation of adjusted R²
1. Standard error of estimate: with stand dev. In unit of variable not very clear
With st error of estimate/average of y should be below 0,2
2. Coeff of determination:
1 is excellent model and 0 is bad model: the model is worst at prediction the value then just
taking the average
If we add a variable: R2 can stay the same or can go up but never down: if they don’t add
value the computer doesn’t use them and the r2 doesn’t change
why not keep increasing variables then makes the model to complicated: we want to
explain the most with the least amount of variables
Intermezzo: watch out with multiple variables with possibly a strong correlation: for example, high
correlation between age and experience weird things will happen in the formula delete the
variable with the lowest correlation. (corr between 2 variables should be between -0,4 and 0,4
If we then delete the variable: R2 will stay the same or lower prob not lower a lot
because we kept the most correlated variable out of the 2
3. Adjusted R2:
Only for comparison and selection between different lin regression models
No direct interpretation
Adjust the R2 for sample size: sample size & variables
Best chose model with highest adjusted R2 even if the normal R2 for another one is
higher
Normal R2 will get higher and higher (or stay the same) as more variables are added =/=
adjusted r2 will first rise, plateau and lower again as more variables are added.
If variable not significant: delete it one by one don’t delete multiple at the same time,
always make a new model first
Example – Delhaize
Excel
Probability
1. Make pivot table off all data
2. Select the data you want and put it in columns and row and also add one of the two in the
value
3. Put them as count and show as percentage for easy calculations
Distribution
Normal
What’s the chance that it takes less than 3,25 hours to produce a product? (µ:4; σ:1)
=norm.dist(prob; µ; σ; 1)
=norm.dist(3,25;4;1;1)
What’s the time to produce a product so that there’s 38% chance of exceeding it? (µ:4; σ:1)
=norm.inv(x; µ; σ)
=norm.inv(0,62; 4; 1)
Discrete
Poisson
=poisson.dist(x;µ; 0 or 1)
0: P(X = x)
1: P(X ≤ x)
Binom
=binom.dist(x; n; p; 0 or 1)
0: P(X = x)
1: P(X ≤ x)
Continuous
Exponential
=expon.dist( x; λ; 1)
1: P(X ≤ x)
Questions