You are on page 1of 17

TOPIC 2: Graphical

Representation and
Descriptive Statistics for
univariate data
B.S. Global Studies
Universitat Pompeu Fabra
Lecturer: Jaume Borràs
Datasets types
1. Cross-section data: Information about different individuals, observed
during a particular point in time

Example: household (multiple individuals) expenditure in tobacco during July 2023


(one period)

2. Time series data: Information for a single individual, observed at


different time periods

Example: Spain (one individual) unemployment rate from 2006-2023 (multiple


periods)

3. Panel data: Information of several individuals, observed over several


time periods (combination of the two above mentioned types)

Example: annual unemployment rate from 2008 to 2021 (multiple periods) for all
EU-countries (multiple individuals)

2
Graphical representation of univariate data
• Univariate data: from 1 variable only
Why do we want to have a graphical representation of data?
1) Know the distribution of the variable (which values, which
frequency)
2) Compare diferent variables / datasets
Graphical representation: categorical
variables
1. Bar chart: The height/length of each bar is proportional to the frequency
(absolute or relative) of the corresponding variable outcome

2. Pie chart: Size of each slice is proportional to the relative frequency of the
corresponding variable outcome
Popularity of Car Brands in Barcelona Popularity of Car Brands in Barcelona
0,4 Dacia Hyundai
% of Car Preferences

Ford
0,35
0,3
0,25 Opel
0,2
0,15 Suzuki
Volkswagen
0,1
0,05
0
Nissan
an
t
n

cia
rd
ki

el

i
da
ul
e

zu

Op

Fo
ss
ag

na

Da

un
Su
Ni
w

Re

Hy
lks
Vo

Car Brand Renault


Graphical representation: discrete variables
In addition, the cumulative frequency (absolute/relative) can be visualized (we
still use pie charts and bar charts when the variable is clearly countable)

Example: Level of satisfaction with public health care system of 40 citizens, with a scale
from 0 (highly dissatisfied) to 10 (highly satisfied)

10 45
9 40

Cumm. absolute frequency


8 35
Absolute frequency

7
30
6
25
5
4 20
3 15
2 10
1 5
0 0
0 1 2 3 4 5 6 7 8 9 10 0 1 2 3 4 5 6 7 8 9 10
Level of satisfaction Level of satisfaction
Graphical representation: continuous
variables
• When dealing with continuous variables, we will use a histogram rather than a bar chart
• The main differences: 1) in the horizontal axis we do not have a number but an interval and 2) the
”bars” are not separated
Example: CO2 emissions per capita in countries with population of over 20 million people
9
8
7
6
Frequency

5
4
3
2
1
0
[0, 3) [3, 6) [6, 9) [9, 12) [12,15) [15,18) [18,21)

CO2 emission per capita (intervals)


Histograms are a way to know the distribution. In the following lectures we will talk about particular types of
distribution. Keep in mind: histograms are not the only ways to represent continuous variables (tbd)
Graphical representation: time series
(continuous data)
With time series we usually focus on the relationship between and evolution
of a numerical variable over time (in economics, it is not bounded to that).

Objectives:
• Understand the underlying patterns related with time, recognize trends
(positive or negative), as well as seasonality and cyclical patterns
• Use known data to predict future evolution of the data

• Examples: Daily variation of the price of oil, mortality rate per year, sales
seasonal forecasting per quarter, monthly precipitation in a specific
location
More on time series
The 3 principal components of a time series:

1. Trend: long-term variation, that can be positive (increasing) or negative (decreasing), e.g.
Average life expectancy at birth
2. Periodic (seasonal): any regular variation that is easy to predict, usually for 12-month
period or quarters.
3. Cyclical: is different from a periodic component in that it usually is of longer duration, and
that it occurs at irregular intervals, e.g. four phases of business cycle: peak - recession -
depression - expansion
Descriptive Statistics for Univariate Data
• We have seen visual representation of univariate data

• We will move now to “computations”

• In this section we will explore the main analytical tools to summarize


information of a single variable: mean, median, quartile, kurtosis…
Central Tendency Measures
• The following measures are used to get a sense of “what is the most
common feature of the variable”. Hence, they are named “central tendency
measures”
1. Mode: most frequent value(s). For categorical and discrete variables.
2. Modal class: most frequent class(es) of grouped data (can be more than
one value) For continuous/discrete with many values
3. (Arithmetic) Mean: the average of the data. For numerical variables only
4. Median: the middle point of the data after sorting it in increasing order.
For numerical variables only
Mode
Modal Class
Mean
• To compute the mean, the formula is:

$
1
𝑥̅ = & 𝑥!
𝑛
!"#
Where:
• n is the size of the sample, the amount of observations
• 𝑥! , 𝑥" , …, 𝑥# are the data, the observations 1,2,3…n correspondent to variable 𝑥
Translation: you sum all the values of the variable and divide by the number of observations
Mean
Median
• The median is the value of the variable which is larger than half of the
observations and lower than half of the observations
• It’s the “middle value”
• To compute the median, sort the data by increasing values, then find
the middle point
CAREFUL! The procedure to follow if the number of observations is
even is not the same as if the number of observations is odd!
Median: odd number of values
EXAMPLE: number of subjects taken at university

3 4 1 5 3 4 5

1) Sort the values in increasing order: 1,3,3,4,4,5,5

2) Find the value that lays in the middle

This case is easy: we have 7 observations. But, what do we have to do if we have a large
Nº of observations? Visually, it is difficult to spot the middle, so we will compute in which
position the median lies on.
Position= (n+1)/2 à Example: if we have 101 observations, the median is on the (101+1)/2=51 place
Median: even number of values

As before, we first sort the numbers by increasing order and then we find the middle point
PROBLEM: two values are on the middle! In this case, what we have to do is to is to take the
mean of both numbers to find the median

The same problem with the sample size: we compute the position using the same formula. In this case,
when computing the position, we will obtain a decimal point. It means that we have to take 2 numbers and compute
the mean! Example: if n=200, Position=201/2=100,5 à We need to take the average of the values in the 100 and 101
place

You might also like