Professional Documents
Culture Documents
Representation and
Descriptive Statistics for
univariate data
B.S. Global Studies
Universitat Pompeu Fabra
Lecturer: Jaume Borràs
Datasets types
1. Cross-section data: Information about different individuals, observed
during a particular point in time
Example: annual unemployment rate from 2008 to 2021 (multiple periods) for all
EU-countries (multiple individuals)
2
Graphical representation of univariate data
• Univariate data: from 1 variable only
Why do we want to have a graphical representation of data?
1) Know the distribution of the variable (which values, which
frequency)
2) Compare diferent variables / datasets
Graphical representation: categorical
variables
1. Bar chart: The height/length of each bar is proportional to the frequency
(absolute or relative) of the corresponding variable outcome
2. Pie chart: Size of each slice is proportional to the relative frequency of the
corresponding variable outcome
Popularity of Car Brands in Barcelona Popularity of Car Brands in Barcelona
0,4 Dacia Hyundai
% of Car Preferences
Ford
0,35
0,3
0,25 Opel
0,2
0,15 Suzuki
Volkswagen
0,1
0,05
0
Nissan
an
t
n
cia
rd
ki
el
i
da
ul
e
zu
Op
Fo
ss
ag
na
Da
un
Su
Ni
w
Re
Hy
lks
Vo
Example: Level of satisfaction with public health care system of 40 citizens, with a scale
from 0 (highly dissatisfied) to 10 (highly satisfied)
10 45
9 40
7
30
6
25
5
4 20
3 15
2 10
1 5
0 0
0 1 2 3 4 5 6 7 8 9 10 0 1 2 3 4 5 6 7 8 9 10
Level of satisfaction Level of satisfaction
Graphical representation: continuous
variables
• When dealing with continuous variables, we will use a histogram rather than a bar chart
• The main differences: 1) in the horizontal axis we do not have a number but an interval and 2) the
”bars” are not separated
Example: CO2 emissions per capita in countries with population of over 20 million people
9
8
7
6
Frequency
5
4
3
2
1
0
[0, 3) [3, 6) [6, 9) [9, 12) [12,15) [15,18) [18,21)
Objectives:
• Understand the underlying patterns related with time, recognize trends
(positive or negative), as well as seasonality and cyclical patterns
• Use known data to predict future evolution of the data
• Examples: Daily variation of the price of oil, mortality rate per year, sales
seasonal forecasting per quarter, monthly precipitation in a specific
location
More on time series
The 3 principal components of a time series:
1. Trend: long-term variation, that can be positive (increasing) or negative (decreasing), e.g.
Average life expectancy at birth
2. Periodic (seasonal): any regular variation that is easy to predict, usually for 12-month
period or quarters.
3. Cyclical: is different from a periodic component in that it usually is of longer duration, and
that it occurs at irregular intervals, e.g. four phases of business cycle: peak - recession -
depression - expansion
Descriptive Statistics for Univariate Data
• We have seen visual representation of univariate data
$
1
𝑥̅ = & 𝑥!
𝑛
!"#
Where:
• n is the size of the sample, the amount of observations
• 𝑥! , 𝑥" , …, 𝑥# are the data, the observations 1,2,3…n correspondent to variable 𝑥
Translation: you sum all the values of the variable and divide by the number of observations
Mean
Median
• The median is the value of the variable which is larger than half of the
observations and lower than half of the observations
• It’s the “middle value”
• To compute the median, sort the data by increasing values, then find
the middle point
CAREFUL! The procedure to follow if the number of observations is
even is not the same as if the number of observations is odd!
Median: odd number of values
EXAMPLE: number of subjects taken at university
3 4 1 5 3 4 5
This case is easy: we have 7 observations. But, what do we have to do if we have a large
Nº of observations? Visually, it is difficult to spot the middle, so we will compute in which
position the median lies on.
Position= (n+1)/2 à Example: if we have 101 observations, the median is on the (101+1)/2=51 place
Median: even number of values
As before, we first sort the numbers by increasing order and then we find the middle point
PROBLEM: two values are on the middle! In this case, what we have to do is to is to take the
mean of both numbers to find the median
The same problem with the sample size: we compute the position using the same formula. In this case,
when computing the position, we will obtain a decimal point. It means that we have to take 2 numbers and compute
the mean! Example: if n=200, Position=201/2=100,5 à We need to take the average of the values in the 100 and 101
place