You are on page 1of 25

Descriptive Statistics

MPA

Stephan Dietrich
s.dietrich@maastrichtuniversity.nl

1
14 Sept Tutorial Data visualisation in Stata
15 Sept Optional practice Voluntary Stata exercise (video solution published on Monday morning)
Week 3
18 Sept Lecture Constructing research frameworks, hypotheses and variables
19 Sept Tutorial Using research frameworks
20 Sept Lecture Working with quantitative data
21 Sept Tutorial Summary statistics
22 Sept Optional practice Voluntary Stata exercise (video solution published on Monday morning)
Week 4
25 Sept Lecture Research methods, sampling and quality

26 Sept Tutorial Choosing research methods

27 Sept Free for preparation of Public Policy Assignment

28 Sept Free for preparation of Public Policy Assignment

29 Sept Free for preparation of Public Policy Assignment

Week 5
2 Oct Lecture Estimating population means

3 Oct Tutorial Samples and population means

4 Oct Lecture Ethical dilemmas in social science research design

5 Oct Tutorial Presentation of draft research proposals

6 Oct Optional practice Voluntary Stata exercise (video solution published on Monday morning)

Week 6
9 Oct Lecture Hypothesis testing

10 Oct Tutorial Hypothesis testing

11 Oct Lecture Ordinary Least Squares regression (OLS)

12 Oct Tutorial Ordinary Least Squares regression (OLS)

2
Recap

Source: Gallup WorldPoll 2008-2020

3
Lecture 3: Descriptive statistics

Central
Tendency Spread

Descriptive
Statistics

Relations Box Plot

4
Mean
• The most well-known indicator to describe data is the (arithmetic) mean
• We add up values of a variable and divide the sum by the number of observations

𝑥1 + 𝑥2 + ⋯ + 𝑥𝑛
𝑥ҧ = 𝑚𝑒𝑎𝑛 𝑣𝑎𝑙𝑢𝑒 𝑜𝑓 𝑥 → 𝑥ҧ =
𝑁

7
Example:
6

5
Respondent 1 2 3 4 5
4
Variable 2 1 2 4 6
3

2
2+1+2+4+6
𝑥ҧ = =3 1
5 0
0 2 4 6
Application: all observations measured in same unit Respondent 5
Mean

Source: Gallup WorldPoll 2008-2020

6
Median
• The median is the middle value when the data is ordered
• At least half of the values are greater or equal to the median
• The median is less sensitive to outliers than the mean

How do we calculate the median?


1. We order values in ascending/descending order
2. We select the middle value (or the mean of the two values in the middle if we have even number of values)

Respondent 1 2 3 4 5 Respondent 2 1 3 4 5
Value 2 1 2 4 6 Value 1 2 2 4 6

Respondent 3 is in the middle (2 reported smaller values and 2 larger values)→ median value is 2
7
Percentiles
• The median is the middle value
• We can also select other points in the distribution than the middle value
• Percentile indicate the value below which a given percentage of observations in our data falls

Median = 50% percentile


25th percentile → 25% of values are smaller and 75% are larger
1st percentile → 1% of values are smaller and 99% are larger

25th Median 75th Largest vale


smallest vale
percentile 50th percentile (maximum)
(minimum)
percentile
8
Mode
• The value that appears most often in the data
• Can be useful for nominal data

Example:

Respondent 1 3 2 4 5
Variable NL NL SK IT DK

Count occurrence of values and sort in descending order: Mode = NL

• These indicators describe central tendency, but not the distribution of values
9
Deviation from the mean
• Mean does not indicate how spread-out values are
• Same mean can come from very different distribution

Sample 1 Sample 2
10 10
9 9
8 8
7 7
6 6
5 mean 5
4 4
3 3
2 2
1 1
0 0
0 1 2 3 4 5 6 0 1 2 3 4 5 6

Same mean, but values in sample 1 are much more dispersed than values in sample 2!
10
Deviation from the mean

Data source: Eurobarometer, 2019


Deviation from the mean
7
• The distance of each observation from the mean tells us
something about the distribution of the variable 6

• Symmetric distribution if number of observations and 4


deviations below and above the mean is similar
3

2
• Otherwise, distribution is skewed
1

0
0 1 2 3 4 5 6

Problem: sum of all distances from the mean is always 0!


12
Variance
7
• Distance from the mean works for single values but sum of distances is 0
6
• Variance is an indicator for the dispersion of values
• Describes how far values are spread out from the mean in the data 5

How do we calculate the variance? 2


• Sum up squared distances and divide it by the number of observations
1
• Squared distance can’t be negative
0
• Sum of squares can be different from 0 0 1 2 3 4 5 6

13
Example
Suppose we ask 5 people for their income:
30
25
20
Income
10
10
8 Mean
7
5
0
P1 P2 P3 P4 P5

Respondent 1 2 3 4 5
𝑛 Response 10 7 5 8 25
1
𝑉𝑎𝑟𝑖𝑎𝑛𝑐𝑒 = ෍(𝑥𝑖 − 𝑥)ҧ 2 𝑥ҧ 11 11 11 11 11
𝑁
𝑖=1 (𝑥𝑖 − 𝑥)ҧ 2 1 16 36 9 196
Variance 258/5≈51.6
14
Standard Deviation
• The variance is scale variant
→ Variance may change if we measure income in other currency
• The standard deviation is very similar to the variance, but is scale invariant
• "standard" way of measuring what is a normal deviation from the mean, and what is a large deviation

How do we calculate the standard deviation?


Respondent 1 2 3 4 5

• square root of the variance Income 10 7 5 8 25


𝑆𝑡𝑎𝑛𝑑𝑎𝑟𝑑 𝐷𝑒𝑣𝑖𝑎𝑡𝑖𝑜𝑛 = 𝑉𝑎𝑟𝑖𝑎𝑛𝑐𝑒 Variance 51.6
Standard Deviation s = 51.6 ≈ 7.18

15
Standard Deviation – Outlier Classification
30 Outlier:
• Data organization includes data cleaning
25 25
• Large outlier can bias our analysis
20 How can we identify outlier?
1. Rule out implausible responses
15 Income
→ e.g. 40 working hours per day

Mean
10 2. Trimming
10
8 → e.g. remove 1% of observations at the top and
bottom of a variable
5 7
5
3. Use rule-of-thumb outlier classification
0 →E.g. outlier if observations is more than 3
P1 P2 P3 P4 P5 standard deviations away from mean
16
Box Plots
• Visualizes statistical indicators
(median, 25%-75%percentiles)

Box= interquartile range (25%-75%)

Median= 50% percentile

1.5*interquartile range (lower boundary)

Data source: Eurobarometer, 2019

graph box d11, over(d70) intensity(25) ytitle("AGE") title("Age by Life Satisfaction Category, 2019") 17
Example Box Plot

Source: DHS data India, Pakistan, Nepal and Bangladesh 2000-2017 18


Covariance
• So far, our indicators describe single variables

• Often, we are interested in relations between variables

• Covariance describes the extent to which two variables move in the same direction

How do we calculate the covariance?


Respondent 1 2 3 4 5
• Covariance between variable x and y Income 10 7 5 8 25
Education 10 12 8 13 12
𝑥1 − 𝑥ҧ ∗ 𝑦1 − 𝑦ത + ⋯ + 𝑥𝑛 − 𝑥ҧ ∗ 𝑦𝑛 − 𝑦ത Covariance 29
𝑠𝑥𝑦 = = 5.8
𝑁 5

19
Correlation Coefficient
• Covariance is scale variant

• Difficult to compare magnitude of relations

• Correlation coefficients standardizes covariance to make if scale invariant and comparable

• Correlation ranges between -1 (perfect negative correlation) and 1 (perfect positive correlation)

How do we calculate the correlation coefficient?

𝑠𝑥𝑦
𝑟=
𝑠𝑥 𝑠𝑦
20
Correlation Coefficient
Global economic growth and CO2 emissions, 1961-2015

• Scatterplot visualizes correlation

• Correlation coefficient measures the degree


to which variables are linearly related
(here r=-0.39)

• But correlation does not imply causation!

Source: World Bank World Development Indicators


21
What is the mode of life satisfaction?

22
Is the mean of LS greater than the median?

23
The data set comprises 24,780 observations. If we
had less observations (say 200 less), would the
standard deviation of LS be larger or smaller?

24
Questions?

25

You might also like