You are on page 1of 86

Statistics I

Topic 2: Analysis of univariate data


Topic 2: Analysis of univariate data

Contents
1. Representations and graphs
I Frequency tables

I Bar and pie charts, pictograms, histograms, frequency


polygons, pictograms. Other graphs. Lying with graphs

2. Numerical measures to summarize and describe data:


I Central tendency (mean, median, mode)

I Location (quartiles and percentiles). Box plots

I Spread (variance, standard deviation, quasi-variance,


quasi-standard deviation, range, IQR, coefficient of variation)
I Shape (coefficients of skewness and kurtosis)
Topic 2: Analysis of univariate data

Recommended reading
I Peña, D., Romo, J. Introducción a la Estadística para las
Ciencias Sociales (1997).
I Chapters 2, 3, 4 and 5.

I Newbold, P. Statistics for Business and Economics (2008).


I Chapters 1 and 2

I Triola, M.F. Essentials of Statistics, 5th Global ed.


I Chapters 2 and 3

I Triola, M.F. Estadística, 12 ed.


I Chapters 2 and 3
Description of categorical variables

I Sample: 46 employees of a US company


I Variable: EDUC: education level (1=High School; 2=College;
3=Advanced Degree)
I Variable: MGT: management position (1=yes; 0=no)

We want to extract information from these raw data. How?


The data (file employees.txt) in R
Converting categorical data coded as numbers into R
factors
I Since variables EDUC and MGT are coded as numbers, R
treats them by default as if they were numeric variables
I We have to tell R that they are categorical variables (factors
in R parlance)
I R allows us to specify level names for the classes
Converting categorical data coded as numbers into R
factors
Describing categorical variables: frequency table and bar
chart

Education level Number of employees Proportion of employees


High School 14 0.304
College 19 0.413
Advanced Degree 13 0.283
Total 46 1
Structure of a frequency table

Absolute Relative
Class (category): ci Frequency: ni Frequency: fi
c1 n1 f1 = nn1
c2 n2 f2 = nn2
.. .. ..
. . .
nk
ck nk fk = n
Total n 1

Note:
I ni = number of individuals of class ci in the sample
I fi = nni
I 0 ≤ fi ≤ 1
Bar charts

I Bars are of the same width and equally-spaced, their heights


represent frequencies
I There are gaps between bars
I Bars are labeled with class names (or codes)
Bar graphs in R
Frequency tables in R
Other graphics for categorical variables: the pie chart

I Each pie sector is a fraction of the circle


I Sectors are labeled with their corresponding class names
I Computer software typically orders classes in alphabetical
order
I Pie charts are visually engaging, but relative sector sizes are
harder to assess correctly than in bar charts
I Avoid 3D pie charts: 3D perspective distorts our perception of
relative sector sizes
Pie chart in R
EDUC

High School

College

Advanced Degree
Other graphics: the Pareto chart

I Bar chart in which the variable classes are ranked in


decreasing order of frequency
I It only applies to nominal categorical variables
I Useful to identify the more relevant classes

The Pareto Principle (80/20 rule)


Pareto stated (c. 1896) that, typically, about 80% of the effects come
from 20% of the possible causes
Example:
I 20% of the population owns about 80% of the wealth
I 80% of the population owns the remaining 20%
The Pareto chart: example %

Visitar la colección del Museo 16,6


I Sample:
Visitar Among the
o estar en la 1,100
cafetería visitors of the art exhibition
del Museo 7,7 Turner
and Visitar la tienda del(Prado
the Masters Museo Museum, 2010), those 28,1 who bought
theirEstar o visitar otros espacios del Museo que no tienen
tickets
colección online (20.3%). Source: Institute for33,0Tourism
Studies
Esperar en el exterior del Museo 27,5

I Variable: Main reason for buying the ticket online


Tabla 9. Visitantes por la razón principal para adquirir la entrada por vía telemática

Filtro: Adquiere la entrada por vía telemática


%

Por comodidad 60,5

Rapidez 10,1

Puedo elegir el día y la hora de la visita 14,0

No tengo que esperar en taquilla 9,5

Porque la entrada es más barata 4,3

Por el horario 24 horas 1,2

Había oído hablar bien del servicio 0,4

Total 100,0
The Pareto chart: example
The Pareto chart in R: Importing data
Import data from Excel file visitorsPrado.xlsx
The Pareto chart in R (not available in R Commander)

Need to install and load the package qcc


with(visitorsPrado, pareto.chart(Frecuencia.rel., names.arg=Motivo, las=1))
Other graphics: pictograms
I Sample: 70 university students from Madrid
I Variable: Preferred political party

Preferred political party Numb. Students Prop. Students


PSOE 23 0.33
PP 15 0.21
Unidas Podemos 20 0.29
Ciudadanos 7 0.10
Otros 5 0.07
Total 70 1

The area of each class graph is proportional to its frequency


Exercise
Results from a survey among 15–20 year-olds about their favorite
leisure activity

I What is the variable and who are the individuals?


I For what percentage is reading the preferred leisure activity?
Exercise

From a test taken by a group of students, graded between 1 and 8,


the following frequency table was obtained:

Grade, ci ni fi
1 4 0.08
2 4
3 0.16
4 7 0.14
5 5
6 10
7 7 0.14
8

I How many students took the test?


I What percentage of students obtained a grade of 6 or more?
Exercise

In a survey, 30 randomly chosen university students were asked


about their favorite sport. The results are shown in the following
table:
Sport ci ni fi
Basketball 12 0.4
Swimming 3 0.1
Football 9 0.3
None 6 0.2
Total 30 1

Which of the following charts represents the above data?


Estadística Aplicada
Exercise
a) c)
Deporte Deporte

14 14

12 12

10 10

8 8

6 6

4 4

2 2

0 0
Baloncesto Natación Fútbol Ningún deporte Baloncesto Natación Fútbol Ningún deporte

b) d)

Deporte Deporte

14 14

12 12

10 10

8 8

6 6

4 4

2 2

0 0
Baloncesto Natación Fútbol Ningún deporte Baloncesto Natación Fútbol Ningún deporte
Description of discrete numeric variables: frequency table
I Sample: 100 shopping malls in which a promotion of a certain
service was launched last November
I Variable: number of new customers of the service
Absolute Relative
Absolute Relative Cumulative Cumulative
ci Frequency ni Frequency fi Frequency Ni Frequency Fi
0 1 0.01 1 0.01
1 4 0.04 5 0.05
2 7 0.07 12 0.12
3 8 0.08 20 0.20
4 8 0.08 28 0.28
5 16 0.16 44 0.44
6 18 0.18 62 0.62
7 14 0.14 76 0.76
8 10 0.10 86 0.86
9 11 0.11 97 0.97
10 3 0.03 100 1.00
Total 100 1
Description of discrete numeric variables: frequency table

I What percentage of the sampled malls gained only 5 new


customers?
I How many malls attracted at least 3 new customers?

I How many malls attracted less than 6 new customers?

I What percentage of the sampled malls gained between 4 and


8 new customers?
I What percentage of malls gained at most 7 new customers?
Description of discrete numeric variables: the bar chart
Bar charts can also be used for discrete data if there are not too
many different values
Description of discrete numeric variables: structure of a
frequency table
Cumulative Cumulative
Absolute Relative Absolute Relative
Class, ci Freq., ni Freq., fi Freq., Ni Freq., Fi
c1 n1 f1 = nn1 N1 = n1 F 1 = f1
c2 n2 f2 = nn2 N2 = N1 + n2 F 2 = F 1 + f2
.. .. .. .. ..
. . . . .
nk
ck nk fk = n Nk = n Fk = 1
Total n 1

Note:
I c1 < c2 < . . . < ck
I ni = number of individuals of class ci in the sample, fi = ni
n
I Ni = Ni−1 + ni , Fi = Fi−1 + fi
I 0 ≤ fi , F i ≤ 1
I Fi and Ni also make sense for ordinal categorical variables
Ordinal categorical variables: cumulative frequencies

I Sample: 901 employees.


I Variable: satisfaction levels (S=satisfied, V=very,
U=unsatisfied)

Cumulative Cumulative
Absolute Relative Absolute Relative
Class Frequency Frequency Frequency Frequency
VU 62 0.07 62 0.07
U 108 0.12 170 0.19
S 319 0.35 489 0.54
VS 412 0.46 901 1.00
Total 901 1
Ordinal categorical variables: bar charts with cumulative
frequencies

Beware! Many software programs rank classes in alphabetical order


when the variable is categorical. If it is an ordinal variable, it must
be ranked in increasing order
Bar charts for discrete data
I Sample: 46 employees of a company
I variable: EXPRNC: years working in the company

Experience, ci Absolute freq., ni Relative freq., fi


1 5 0.109
2 4 0.087
3 4 0.087
4 4 0.087
5 3 0.065
6 4 0.087
7 1 0.022
8 4 0.087
10 4 0.087
11 2 0.043
12 2 0.043
13 2 0.043
14 1 0.022
15 1 0.022
16 3 0.065
17 1 0.022
20 1 0.022
Total 46 1
Description of discrete numeric variables: the bar chart

Too many different values!


Description of continuous numeric variables

I Sample: 46 employees of a company


I Variable: EXPRNC: years of experience
I Variable: SALARY: anual gross salary (in e)
Grouping in class intervals: numeric data

Class Interval Class Mark (midpoint) ni fi Ni Fi


[`0 , `1 ] c1 = `0 +`
2
1
n1 f1 N1 F1
(`1 , `2 ] c2 = `1 +`
2
2
n2 f2 N2 F2
.. .. .. .. .. ..
. . . . . .
`k−1 +`k
(`k−1 , `k ] ck = 2 nk fk n 1
Total n 1

Note:
I In R the left end-point is excluded, but right end-point is
included (default option), except for first interval
I Useful for tabulating discrete variables with many possible
values
Grouping in class intervals
I Very often class intervals have the same width
I Determine the width w of each interval by
largest number - smallest number
w=
number of desired intervals
I How many intervals? Between 5 and 20 (practice and
experience) :
Sample size Number of classes
Less than 50 5–7
50 to 100 7–8
101 to 500 8–10
501 to 1000 10–11
1001 to 5000 11–14
More than 5000 14–20
I Class intervals cannot overlap
I Round up the interval width to get convenient interval
endpoints
Grouping in class intervals

I Find range: 20 − 1 = 19

I Select number of classes: say k = 46 = 6.78 ≈ 7
I Compute interval width: 19/7 = 2.71 ⇒ 3.
I Determine the end-points (beginning before the first one and
ending after the last one): [0, 3], (3, 6], . . . , (19, 21]
Description of numeric variables: histogram
I There are no gaps between the bars/bins
I Bin widths = widths of class intervals (identical), class
boundaries are marked on the horizontal axis
I Bin heights = frequencies (here, absolute)
I Bin areas are proportional to the frequencies
Description of numeric variables: histogram
Histogram and frequency polygon
Histograms in R Commander

Note that the chosen Number of bins is not necessarily used


8
6
frequency

4
2
0

0 5 10 15 20

EXPRNC
Histograms in R (through commands to set the number of
breaks (= bins+1))

12
10
8
frequency

6
4
2
0

0 5 10 15 20

EXPRNC
Other graphics: cartograms (INE, Encuesta de Turismo de residentes)
Average travel expenditure per person during the third term of 2016

Average excursions expenditure per person during the third term of 2016
Other graphics: pictograms
Other graphics: time series

INE, Encuesta de Población Activa


How to lie with graphs

Published in La Voz de Galicia, on October 24, 2010.


How to lie with graphs

Published in La Voz de Galicia, on October 24, 2010.

I Letting height be proportional to frequency gives a false


impression
I Is there anything else that is wrong with this graph?
Lying with graphs
Improper use of scales: the coordinate origin is not 0
Lying with graphs
Lying with graphs
The vertical axes scale is upside down
Lying with Statistics
A classic book: How to Lie with Statistics, by Darrell Huff, 1954

Available online:
https://archive.org/details/HowToLieWithStatistics
Numeric summaries: descriptive measures

Central tendency Location Spread Shape


⇓ ⇓ ⇓
mean quartiles range skewness
median percentiles interquartile range kurtosis
mode variance
standard deviation
coeff. of variation
Numeric summaries: descriptive measures

X Why are they useful?

X Can we calculate them for all types of variables?

X Which are the most useful in each case?

X How can we compute them?


Measures of central tendency

X The (arithmetic) mean

X The median

X The mode
Central tendency: the (arithmetic) mean

The mean
The mean is the average of all the data
Pn
i=1 xi x1 + . . . + xn
x̄ = =
n n

I It is the most common measure of central tendency


I It is the center of gravity of the data
I It should be calculated only for numeric variables
The mean: example

For the experience of the 46 employees, What is the mean?


1 + 1 + 1 + 1 + 1 + 2 + 2 + 2 + 2 + · · · + 17 + 20
x̄ = = 7.5 years
46
In R: mean(employees$EXPRNC)
The mean: example
How to calculate the mean from the absolute frequency table? And
from the relative frequency table?
Calculating the mean from grouped data

It is the same formula but using the center of each interval.


For the salary of the 46 employees, What is the mean?

Note: the mean salary from the original data equals 17250.41
The mean: properties
X Linearity: If Y = a + bX ⇒ ȳ = a + bx̄
If Z = X + Y ⇒ z̄ = x̄ + ȳ
If the 46 employees’ salaries increase by 2%, how does the
mean salary change?
If the salary is reduced in 100 dollars, what is the new mean
salary?
If the salary is increased with a productivity bonus that is
recorded in variable Y , with mean ȳ , what is the new mean
salary?

X Disadvantages: Affected by extreme values (outliers)


Example: X : 3, 1, 5, 4, 2, Y : 3, 1, 5, 4, 200
3+1+5+4+2 3 + 1 + 5 + 4 + 200
x̄ = = 3 ȳ = = 42.6!
5 5
When the data are skewed, an alternative robust measure of
central tendency is more appropriate
Central tendency: the median

1 1 1 3 3 5 5 7 8 8 9
1. Order the data from smallest to largest
2. Include repetitions
3. The median is in the central position

3+5
1 1 1 3 3 5 5 7 8 8 ⇒ M= =4
2

Median
Ordered data from smallest to largest: x(1) , x(2) , . . . , x(n)

 x((n+1)/2) if n odd
M=
 x(n/2) +x(n/2+1)
2 if n even

In R: median(employees$EXPRNC)
Finding the median from a frequency table

Experiencia, ci ni fi Ni Fi
1 5 0.109 5 0.109
2 4 0.087 9 0.196
3 4 0.087 13 0.283
4 4 0.087 17 0.370
5 3 0.065 20 0.435 < 0.5
M=6 4 0.087 24 0.522 > 0.5
7 1 0.022 25 0.543
8 4 0.087 29 0.630
9 0 0.000 29 0.630
10 4 0.087 33 0.717
11 2 0.043 35 0.761
12 2 0.043 37 0.804
13 2 0.043 39 0.848
14 1 0.022 40 0.870
15 1 0.022 41 0.891
16 3 0.065 44 0.957
17 1 0.022 45 0.978
18 0 0.000 45 0.978
10 0 0.000 45 0.978
20 1 0.022 46 1,000
The median: properties
X Linearity: If Y = a + bX with b > 0 ⇒ My = a + bMx
If the 46 employees’ salaries are increased by 2%, How does
the median salary change?
Afterwards the salary is reduced in 100 dollars. What is the
final median salary?

X Can we calculate the median for the EDUC variable?


Can we calculate the median for the MGT variable?

X Advantage: Not affected by outliers


Example: X : 3, 1, 5, 4, 2, Y : 3, 1, 5, 4, 200

Mx = 3 My = 4

When the data are skewed it is a better measure of central


tendency than the mean.
The median and the mean for asymmetric (skewed) data
Annual gross salary in 2014, Encuesta de Estructura Salarial 2014,
INE

“La diferencia entre el salario medio and el mediano se explica


porque en el cálculo del valor medio influyen notablemente los
salarios muy altos aunque se refieran a pocos trabajadores.” (Press
note of the INE, 28/10/2016)
Central tendency: the mode

... it is the most frequent value


The mode of the variable EXPRNC in the 46 employees example is
1 year, with an absolute frequency of 5 employees.
What is the mode of the variable MGT?
What is the mode of the variable EDUC?
The mode for grouped data

What if we have grouped data? ⇒ modal interval


The mode: properties

X It can be calculated for both categorical and numeric


variables. Indeed, it is the only descriptive measure that makes
sense for nominal categorical variables.

X Not affected by outliers

X There can be more than one mode:


bimodal–trimodal–plurimodal
What could it suggest?
Location measures

X Quartiles

X Percentiles
Location measures: quartiles and percentiles

X Quartiles split the ranked data into four segments with an


(approximately) equal number of values per segment.

X Percentiles split the ranked data into a hundred segments with


an (approximately) equal number of values per segment.
1. Order the data from smallest to largest

2. Include repetitions
3. Select each quartile (percentile) according to:
I The first quartile Q1 is in position 14 (n + 1).
I The second quartile Q2 (= median) is in position 12 (n + 1).
I The third quartile Q3 is in position 43 (n + 1).
I The k-th percentile Pk is in position k
100 (n + 1), k = 1, . . . , 99,
leaving k% of data below
Quartiles and percentiles in R

Note:

Typically, the fractions 14 (n + 1), 43 (n + 1) y k


100 (n + 1) are not
integer ⇒ a rounding criterion is used

In R:
Measures of spread

X The range and the interquartile range

X The variance and the standard deviation

X The coefficient of variation


Spread: range and interquartile range (IQR)

I The range is the simplest measure of spread

R = xmax − xmin

I It ignores the way the data are distributed


I Sensitive to outliers
Example: Given observations 3, 1, 5, 4, 2, R = 5 − 1 = 4
Example: Given observations 3, 1, 5, 4, 100, R = 100 − 1 = 99
I The Interquartile range (IQR) can eliminate some outlier
problems. Eliminate high and low observations and calculate
the range of the middle 50% of the data

IQR = 3rd quartile − 1st quartile = Q3 − Q1


Spread: Interquartile range, outliers and boxplot

I Outliers are observations that fall


I below the value Q1 − 1.5 · IQR
I above the value Q3 + 1.5 · IQR
I For extreme outliers, replace 1.5 by 3 in the above definition

MEDIANA
xmin Q1 (Q2) Q3 xmax

25% 25% 25% 25%

12 24 31 42 58
RI=18
Boxplot
I It shows five location measures
I It allows to assess the spread of the data
I It allows to assess the symmetry of the data
I It is very useful to compare different datasets
I Note: R produces a modified boxplot, where outliers are
plotted as distinguished points (the min and max shown are
those without outliers)
25000
20000
SALARY

15000
10000
Measures of spread: variance

I Average of squared deviations of values from the mean


I Sample variance

faster to calculate
Pn zP }| {
2 n 2 2
i=1 (xi − x̄ ) i=1 xi − n(x̄ )
σ̂ 2 = = ⇐ divided by n
n n

I Sample quasi-variance (corrected sample variance)


Pn 2 Pn 2
2 i=1(xi − x̄ ) i=1 xi− n(x̄ )2
s = = ⇐ divided by n − 1
n−1 n−1

I They are related via


n−1 2
σ̂ 2 = s
n
I If a, b are real numbers and y = a + bx , then sy2 = b 2 sx2
Measures of spread: standard deviation (SD)

I The most-commonly used measure of spread


I The sample standard deviation and sample quasi-standard
deviation are respectively
√ √
σ̂ = σ̂ 2 s = s2

I They both measure variation about the mean


I They have the same units as the original data, while variance
is in units2
I Variance and SD are both sensitive to outliers
Calculating the variance and standard deviation
Example: X : 11, 12, 13, 16, 16, 17, 18, 21,
Y : 14, 15, 15, 15, 16, 16, 16, 17, Z : 11, 11, 11, 12, 19, 20, 20, 20
124 124 124
x̄ = = 15.5 ȳ = = 15.5 z̄ = = 15.5
8 8 8

n
X
xi2 = 112 + 122 + . . . + 212 = 2000
i=1
n
X
yi2 = 142 + 152 + . . . + 172 = 1928
i=1
n
X
zi2 = 112 + 112 + . . . + 202 = 2068
i=1
Pn 2 2
i=1 xi − n(x̄ ) 2000 − 8(15.5)2 78
sx2 = = = = 11.1429 ⇒ sx = 3.3381
n−1 8−1 7
1928 − 8(15.5)2 6
sy2 = = = 0.8571 ⇒ sy = 0.9258
8−1 7
2
2068 − 8(15.5) 146
sz2 = = = 20.8571 ⇒ sz = 4.5670
8−1 7
Obtaining the variance and sample standard deviation with
R
Comparing standard deviations
Example cont.: X : 11, 12, 13, 16, 16, 17, 18, 21,
Y : 14, 15, 15, 15, 16, 16, 16, 17, Z : 11, 11, 11, 12, 19, 20, 20, 20

x = 15.5 sx = 3.3

● ● ● ● ● ● ●
11 12 13 14 15 16 17 18 19 20 21

y = 15.5 sy = 0.9
● ●
● ●
● ● ● ●
11 12 13 14 15 16 17 18 19 20 21


z = 15.5 sz = 4.6 ●
● ●
● ● ● ●
11 12 13 14 15 16 17 18 19 20 21
Measures of spread: coefficient of variation (CV)

I The CV measures relative variation and is defined as


s
CV =
|x̄ |

I It is a unitless number (sometimes given in %)


I It represent variation relative to the mean
Example: Stock A: Mean price last year = 50, Quasi-standard deviation
=5
Stock B: Mean price last year = 100, Quasi-standard deviation = 5
5 5
CVA = = 0.10 CVB = = 0.05
50 100
Both stocks have the same quasi-SDs, but stock B is less variable relative
to its mean price
Standardizing variables

I Standardizing variable x means to calculate a new variable

x − x̄
z=
s
I If you apply this formula to all observations x1 , . . . , xn and call the
transformed ones z1 , . . . , zn , then the mean of the z’s is zero with
standard deviation one
I Standardizing = calculating z-scores
Measures of shape

X Coefficient of skewness

X Coefficient of kurtosis
Measures of shape: skewness

Do not make a decision about the shape just through a comparison


between the Mean, the Median and the Mode.
Coefficient of skewness
n 3
1X xi − x̄

γ1 =
n i=1 s

R (load package e1071): skewness(employees$SALARY) = 0.48


Measures of shape: kurtosis

Coefficient of kurtosis
n 4
1X xi − x̄

→ γ2 = −3
n i=1 s

R (load package e1071): kurtosis(employees$SALARY) = -0.93


Numerical summaries in R Commander
Empirical rule
If the data are bell-shaped (normal), that is, symmetric with light
tails, the following rule holds:
I About 68 % of the data are in (x̄ − 1s, x̄ + 1s)
I About 95 % of the data are in (x̄ − 2s, x̄ + 2s)
I About 99.7 % of the data are in (x̄ − 3s, x̄ + 3s)
Empirical rule
If the data are bell-shaped (normal), that is, symmetric with light
tails, the following rule holds:
I About 68 % of the data are in (x̄ − 1s, x̄ + 1s)
I About 95 % of the data are in (x̄ − 2s, x̄ + 2s)
I About 99.7 % of the data are in (x̄ − 3s, x̄ + 3s)

Note: This rule is also known as 68–95–99.7 rule


Empirical rule
If the data are bell-shaped (normal), that is, symmetric with light
tails, the following rule holds:
I About 68 % of the data are in (x̄ − 1s, x̄ + 1s)
I About 95 % of the data are in (x̄ − 2s, x̄ + 2s)
I About 99.7 % of the data are in (x̄ − 3s, x̄ + 3s)

Note: This rule is also known as 68–95–99.7 rule

Example: We know that for a sample of 100 observations, the


mean is 40 and the quasi-standard deviation is 5. Assuming that
the data are bell-shaped, give the endpoints of an interval that
contains about 95 % of the observations.

95 % of the xi are in: (x̄ ± 2s) = (40 ± 2(5)) = (30, 50)

You might also like