Professional Documents
Culture Documents
Contents
1. Representations and graphs
I Frequency tables
Recommended reading
I Peña, D., Romo, J. Introducción a la Estadística para las
Ciencias Sociales (1997).
I Chapters 2, 3, 4 and 5.
Absolute Relative
Class (category): ci Frequency: ni Frequency: fi
c1 n1 f1 = nn1
c2 n2 f2 = nn2
.. .. ..
. . .
nk
ck nk fk = n
Total n 1
Note:
I ni = number of individuals of class ci in the sample
I fi = nni
I 0 ≤ fi ≤ 1
Bar charts
High School
College
Advanced Degree
Other graphics: the Pareto chart
Rapidez 10,1
Total 100,0
The Pareto chart: example
The Pareto chart in R: Importing data
Import data from Excel file visitorsPrado.xlsx
The Pareto chart in R (not available in R Commander)
Grade, ci ni fi
1 4 0.08
2 4
3 0.16
4 7 0.14
5 5
6 10
7 7 0.14
8
14 14
12 12
10 10
8 8
6 6
4 4
2 2
0 0
Baloncesto Natación Fútbol Ningún deporte Baloncesto Natación Fútbol Ningún deporte
b) d)
Deporte Deporte
14 14
12 12
10 10
8 8
6 6
4 4
2 2
0 0
Baloncesto Natación Fútbol Ningún deporte Baloncesto Natación Fútbol Ningún deporte
Description of discrete numeric variables: frequency table
I Sample: 100 shopping malls in which a promotion of a certain
service was launched last November
I Variable: number of new customers of the service
Absolute Relative
Absolute Relative Cumulative Cumulative
ci Frequency ni Frequency fi Frequency Ni Frequency Fi
0 1 0.01 1 0.01
1 4 0.04 5 0.05
2 7 0.07 12 0.12
3 8 0.08 20 0.20
4 8 0.08 28 0.28
5 16 0.16 44 0.44
6 18 0.18 62 0.62
7 14 0.14 76 0.76
8 10 0.10 86 0.86
9 11 0.11 97 0.97
10 3 0.03 100 1.00
Total 100 1
Description of discrete numeric variables: frequency table
Note:
I c1 < c2 < . . . < ck
I ni = number of individuals of class ci in the sample, fi = ni
n
I Ni = Ni−1 + ni , Fi = Fi−1 + fi
I 0 ≤ fi , F i ≤ 1
I Fi and Ni also make sense for ordinal categorical variables
Ordinal categorical variables: cumulative frequencies
Cumulative Cumulative
Absolute Relative Absolute Relative
Class Frequency Frequency Frequency Frequency
VU 62 0.07 62 0.07
U 108 0.12 170 0.19
S 319 0.35 489 0.54
VS 412 0.46 901 1.00
Total 901 1
Ordinal categorical variables: bar charts with cumulative
frequencies
Note:
I In R the left end-point is excluded, but right end-point is
included (default option), except for first interval
I Useful for tabulating discrete variables with many possible
values
Grouping in class intervals
I Very often class intervals have the same width
I Determine the width w of each interval by
largest number - smallest number
w=
number of desired intervals
I How many intervals? Between 5 and 20 (practice and
experience) :
Sample size Number of classes
Less than 50 5–7
50 to 100 7–8
101 to 500 8–10
501 to 1000 10–11
1001 to 5000 11–14
More than 5000 14–20
I Class intervals cannot overlap
I Round up the interval width to get convenient interval
endpoints
Grouping in class intervals
I Find range: 20 − 1 = 19
√
I Select number of classes: say k = 46 = 6.78 ≈ 7
I Compute interval width: 19/7 = 2.71 ⇒ 3.
I Determine the end-points (beginning before the first one and
ending after the last one): [0, 3], (3, 6], . . . , (19, 21]
Description of numeric variables: histogram
I There are no gaps between the bars/bins
I Bin widths = widths of class intervals (identical), class
boundaries are marked on the horizontal axis
I Bin heights = frequencies (here, absolute)
I Bin areas are proportional to the frequencies
Description of numeric variables: histogram
Histogram and frequency polygon
Histograms in R Commander
4
2
0
0 5 10 15 20
EXPRNC
Histograms in R (through commands to set the number of
breaks (= bins+1))
12
10
8
frequency
6
4
2
0
0 5 10 15 20
EXPRNC
Other graphics: cartograms (INE, Encuesta de Turismo de residentes)
Average travel expenditure per person during the third term of 2016
Average excursions expenditure per person during the third term of 2016
Other graphics: pictograms
Other graphics: time series
Available online:
https://archive.org/details/HowToLieWithStatistics
Numeric summaries: descriptive measures
X The median
X The mode
Central tendency: the (arithmetic) mean
The mean
The mean is the average of all the data
Pn
i=1 xi x1 + . . . + xn
x̄ = =
n n
Note: the mean salary from the original data equals 17250.41
The mean: properties
X Linearity: If Y = a + bX ⇒ ȳ = a + bx̄
If Z = X + Y ⇒ z̄ = x̄ + ȳ
If the 46 employees’ salaries increase by 2%, how does the
mean salary change?
If the salary is reduced in 100 dollars, what is the new mean
salary?
If the salary is increased with a productivity bonus that is
recorded in variable Y , with mean ȳ , what is the new mean
salary?
1 1 1 3 3 5 5 7 8 8 9
1. Order the data from smallest to largest
2. Include repetitions
3. The median is in the central position
3+5
1 1 1 3 3 5 5 7 8 8 ⇒ M= =4
2
Median
Ordered data from smallest to largest: x(1) , x(2) , . . . , x(n)
x((n+1)/2) if n odd
M=
x(n/2) +x(n/2+1)
2 if n even
In R: median(employees$EXPRNC)
Finding the median from a frequency table
Experiencia, ci ni fi Ni Fi
1 5 0.109 5 0.109
2 4 0.087 9 0.196
3 4 0.087 13 0.283
4 4 0.087 17 0.370
5 3 0.065 20 0.435 < 0.5
M=6 4 0.087 24 0.522 > 0.5
7 1 0.022 25 0.543
8 4 0.087 29 0.630
9 0 0.000 29 0.630
10 4 0.087 33 0.717
11 2 0.043 35 0.761
12 2 0.043 37 0.804
13 2 0.043 39 0.848
14 1 0.022 40 0.870
15 1 0.022 41 0.891
16 3 0.065 44 0.957
17 1 0.022 45 0.978
18 0 0.000 45 0.978
10 0 0.000 45 0.978
20 1 0.022 46 1,000
The median: properties
X Linearity: If Y = a + bX with b > 0 ⇒ My = a + bMx
If the 46 employees’ salaries are increased by 2%, How does
the median salary change?
Afterwards the salary is reduced in 100 dollars. What is the
final median salary?
Mx = 3 My = 4
X Quartiles
X Percentiles
Location measures: quartiles and percentiles
2. Include repetitions
3. Select each quartile (percentile) according to:
I The first quartile Q1 is in position 14 (n + 1).
I The second quartile Q2 (= median) is in position 12 (n + 1).
I The third quartile Q3 is in position 43 (n + 1).
I The k-th percentile Pk is in position k
100 (n + 1), k = 1, . . . , 99,
leaving k% of data below
Quartiles and percentiles in R
Note:
In R:
Measures of spread
R = xmax − xmin
MEDIANA
xmin Q1 (Q2) Q3 xmax
12 24 31 42 58
RI=18
Boxplot
I It shows five location measures
I It allows to assess the spread of the data
I It allows to assess the symmetry of the data
I It is very useful to compare different datasets
I Note: R produces a modified boxplot, where outliers are
plotted as distinguished points (the min and max shown are
those without outliers)
25000
20000
SALARY
15000
10000
Measures of spread: variance
faster to calculate
Pn zP }| {
2 n 2 2
i=1 (xi − x̄ ) i=1 xi − n(x̄ )
σ̂ 2 = = ⇐ divided by n
n n
n
X
xi2 = 112 + 122 + . . . + 212 = 2000
i=1
n
X
yi2 = 142 + 152 + . . . + 172 = 1928
i=1
n
X
zi2 = 112 + 112 + . . . + 202 = 2068
i=1
Pn 2 2
i=1 xi − n(x̄ ) 2000 − 8(15.5)2 78
sx2 = = = = 11.1429 ⇒ sx = 3.3381
n−1 8−1 7
1928 − 8(15.5)2 6
sy2 = = = 0.8571 ⇒ sy = 0.9258
8−1 7
2
2068 − 8(15.5) 146
sz2 = = = 20.8571 ⇒ sz = 4.5670
8−1 7
Obtaining the variance and sample standard deviation with
R
Comparing standard deviations
Example cont.: X : 11, 12, 13, 16, 16, 17, 18, 21,
Y : 14, 15, 15, 15, 16, 16, 16, 17, Z : 11, 11, 11, 12, 19, 20, 20, 20
x = 15.5 sx = 3.3
●
● ● ● ● ● ● ●
11 12 13 14 15 16 17 18 19 20 21
y = 15.5 sy = 0.9
● ●
● ●
● ● ● ●
11 12 13 14 15 16 17 18 19 20 21
●
z = 15.5 sz = 4.6 ●
● ●
● ● ● ●
11 12 13 14 15 16 17 18 19 20 21
Measures of spread: coefficient of variation (CV)
x − x̄
z=
s
I If you apply this formula to all observations x1 , . . . , xn and call the
transformed ones z1 , . . . , zn , then the mean of the z’s is zero with
standard deviation one
I Standardizing = calculating z-scores
Measures of shape
X Coefficient of skewness
X Coefficient of kurtosis
Measures of shape: skewness
Coefficient of kurtosis
n 4
1X xi − x̄
→ γ2 = −3
n i=1 s