Professional Documents
Culture Documents
The background idea of this chapter is as follows: given a set of data we want to obtain some key
values that could be taken as a summary or reference for the whole distribution. We are going to see
that these values are the mode, the arithmetic mean, the median, the quantiles and the other means
different from the arithmetic mean. For the procedure, as a general rule, we have to distinguish
between non-grouped variables and grouped variables.
1. The mode: The mode is in general to be understood as the “trendiest value” of the distribution. Now
let’s see how it should be defined according with this interpretation.
The mode for non-grouped variables: For non-grouped variables the mode is defined as the observed
value with the greatest non-cumulative frequency.
So, if we take the example of the salaries the mode is 1,400 (€).
Mo 1, 400(€)
However, a distribution can have more than one mode, as for example happens with the variable X:
number of siblings, where the two modes are 1 and 2.
It can be justified (see the final appendix) that this graphical procedure leads to this formula:
hi hi 1
Mo Li 1 ai where:
2hi hi 1 hi 1
Li 1 is the left- hand-side point of the modal interval
hi is the height in the histogram for the modal interval
hi 1 and hi 1 are the heights in the histogram for the left and right-hand-side intervals with respect to the
modal interval.
This construction is now shown for the example of the newborns’ weights. Remember that the situation
was the following:
Absolute
Intervals Frequencies Widths
densities
Li 1 Li ni ai
hi ni ai
[2.4, 2.9] 3 0.5 6
(2.9, 3.3] 6 0.4 15
(3.3, 3.8] 6 0.5 12
(3.8, 4.5] 4 0.7 5.71
The modal interval is the one with the greatest density, that is (2.9, 3.3]. Thus, Li 1 2.9 , ai 0.4 ,
hi hi 1 12 6
hi 15 , hi 1 6 , hi 1 12 and Mo Li 1 ai 2.9 0.4 3.2 (kg), as
2hi hi 1 hi 1 2 15 12 6
shown below.
k
xn
or with other notation x i 1 xi f i , where xi are the observed values for non-grouped
k
x i 1 i i
N
variables or the representative values of the intervals for grouped variables.
From this definition we obtain xN i 1 xi ni . Thus x is just the constant value that, when taken N
k
times, adds up to the same amount as the total of the original distribution. If we say that for the
1400 3 1500 2 1950 2 2300
distribution of monthly salaries the arithmetic mean is x 1675 (€)
8
we are stating that if every employee earned this salary then amount of money the company distributes
would remain unchanged.
The following properties of the arithmetic mean are easy to understand but also highly important to be
borne in mind:
a) The total summation of the deviations of the observed values with respect to their arithmetic mean
is zero (the deviation of an observed value with respect to a constant is defined as the difference
between the observed value and the constant).
Proof: From the properties of the summations and the definition of the arithmetic mean, we obtain the
following chain of identities:
( xi x ) ni i 1 xi ni x i 1 ni N x N x 0 , as required.
k k k
i 1
k
Of course, by simply dividing by N, the same property can be also be written as i 1
( xi x ) f i 0
Important remark: The reader must be sure to understand all the steps in the previous demonstration.
We will have to deal with summation symbols very often and it is important to recognize what
properties are true and can be applied. So just in case the reader finds any difficulty in understanding
what every step of the demonstration says, we are going to write the same demonstration but with
different notation:
k
i 1
( xi x ) ni ( x1 x )n1 ( xk x )nk x1n1 xk nk x (n1 nk ) N x N x 0
Interpretation: This last property means that x is the gravity centre of the set of values. For instance,
for the salary distribution we know that x 1675 and the situation is as follows:
Observed Frequencies Deviations with
values ni respect to x
xi xi x
1400 3 -275
1500 2 -175
1950 2 275
2300 1 625
k
i 1
( xi x )ni 275 3 175 2 275 2 625 0
With this graphical interpretation where every big dot represents an observation:
x is the gravity centre of the distribution and is where the balance point would be if the dots were weights
b) If a constant c is added to every value xi of the original distribution, the new arithmetic mean is
the original arithmetic mean plus that constant.
This property is clearly intuitive, and if for instance the marks of a group of students are increased by
one point, everybody knows that the new arithmetic mean will be the original arithmetic mean also
increased by one point. However, it is very convenient to be able to understand and reproduce the
formal demonstration of this property:
Proof: The tabulations of the original variable X and that of the transformed one Y X c are as
follows:
Distribution of X Distribution of Y = X + c
x1 n1 y1 x1 c n1
x2 n2 y2 x2 c n2
.... ... ... ...
xk nk yk xk c nk
Total N Total N
k
xn
The arithmetic mean of X is x i 1 i i
, whereas the arithmetic mean for Y is the following:
N
y
1
N
k
yn
i 1 i i
1
N
k
i 1
( xi c)ni 1
N
k
x ci 1 ni
i 1 i
k
1
N
k
x cN x c
i 1 i
The following property is similar to the previous property and can be demonstrated in a similar way.
c) If every value xi is multiplied by a constant c, then the new arithmetic mean is the original
arithmetic mean multiplied by the same constant.
Remark (changes of origin and scale): When a constant is added or subtracted to every value of a
distribution then we say that a change of origin has taken place. That is the situation if for instance we
want to change the computation of the years from the old Roman calendar to the Christian calendar
because we have to subtract the constant 753, (see Chapter 2, page 3).
But when we multiply by a positive constant, this operation is associated to a change of scale, as for
instance happens when we change a measurement from centimetres to metres because we have to
multiply by 0.01 (which is obviously the same as dividing by 100).
Therefore properties b) and c) tell us that the arithmetic mean is sensitive to changes of origin and
scale. Or written more formally, if X is the original distribution and Y is the transformed distribution,
when Y X a then y x a , and when Y cX then y cx . If we combine both results we can say
that when Y cX a then y cx a , and so when a linear change is applied to the distribution, then
the arithmetic mean is transformed in the same way.
d) Relationship between the arithmetic mean and the strata of a distribution: Let be a set of data
divided into two parts or strata where we have the following information about each of these strata:
size: N1 size: N 2
stratum 1 stratum 2
arithmetic mean: x1 arithmetic mean: x2
N1 x1 N 2 x2
Then the arithmetic mean for the whole set of data is x
N1 N 2
Example: Let’s suppose that in a certain subject we consider two groups or strata according to the
gender, the first one of 20 men and the second of 30 women. We know that for the men the arithmetic
mean of the marks is 5.5, and 4.5 for the women. What will the arithmetic mean be for the whole group
of students?
5.5 20 4.5 30
Therefore the arithmetic mean for the whole group is x 4.9
50
The general demonstration is as follows. Since the size of the whole set of data is N N1 N2 , then:
xi ni
x N
global
xi ni
xi ni xi ni
N1 x1 N 2 x2
x1 xi ni N1 x1 x
stratum1 stratum1 stratum 2
N1 stratum1 N1 N 2 N
xi ni
x2 xi ni N 2 x2
stratum 2
N2 stratum 2
3. The median. The median is to be understood as a middle point according to the order of the
observed values.
The median for non-grouped variables: Here the median is the first value for which the cumulative
relative frequency is equal to or greater than 0.5. If N is an odd number there is always only one
median, but if N is even there can be two of them, as shown in the tables for two examples where the
central values (the medians) are printed in bold.
Set of data 2, 3, 5,5, 5 Set of data 2, 4, 4, 5,5, 5
The only central value is Me 5 There are two central values Me 4 and 5
xi ni Fi xi ni Fi
2 1 1 5 0.2 2 1 1 6 0.17
3 1 2 5 0.4 4 2 3 6 0.5
5 3 1 5 3 1
N 5 N 6
The median for grouped variables: According to the previous situation in this case Me is defined as
the point for which the cumulative relative frequency is 0.5. That is, Me is the value whose position in
the graph is like this:
Thus, the interval ( Li 1 , Li ] can be determined easily, and according to the previous figure the median is
Me Li 1 m , where m needs to be calculated.
To this end, we can use that the density in the interval ( Li 1 , Me] is the same as that in ( Li 1 , Li ] . But the
0.5 Fi 1 fi
density is the amount of non-cumulative frequency divided by the width, therefore .
m ai
0.5 Fi 1
Then, by isolating m, the result we were looking for is m ai , and the final formula is:
fi
0.5 Fi 1
Me Li 1 ai , with ( Li 1 , Li ] as the interval where the relative cumulative frequency attains 0.5
fi
Example: We are asked to find the median value for the grouped variable of newborns’ weights.
Therefore, to find the answer we have to proceed in this way:
Remark 1: As can be seen, the median could also be defined as such a value for which the non-
cumulative frequency is N 2 . This would lead to the formula:
N 2 Ni 1 N
Me Li 1 ai , with ( Li 1 , Li ] as the interval where the cumulative frequency attains
ni 2
This formula is equivalent to the previous formula, as the only change is that, in the fraction, the
numerator and denominator have been multiplied by N.
Remark 2: The concept of median can be extended to qualitative characters on the condition that the
appearances of the character can be sorted in a natural way, that is, if the character belongs to the
ordinal scale. For instance if, in a group of N=50 people, 15 of them only have primary education, other
30 have secondary education and the remaining 5 have university education, then as N 2 25 this
means that for the qualitative variable X: level of education the median is secondary education, because
this would be the central value once the whole group has been sorted according to the level of
education.
4. The quantiles. The quantiles are a generalization of the median and are defined so that between any
two consecutive of them there is a certain percentage of observed values. The most frequently used are
the following:
Quartiles: defined as the three points Qk ( k 1, 2,3) such that between any two consecutive of them
there is a quarter of the total observed values. Graphically represented:
Between any two consecutive quartiles there are 25% of the The cumulative frequency associated to every quartile is as
values. Observe that the second quartile is the median. shown.
Deciles: defined as the nine points Dk (k 1, 2, ,9) such that between any two consecutive of them
there are a tenth of the observed values, that is, graphically:
Between any two consecutive deciles there are 50% of the The cumulative frequency associated to every decile is as
values. Observe that the fifth decile is the median. shown.
Percentiles: similarly defined as the ninety-nine points Pk (k 1, 2, ,99) such that between any two
consecutive of them there are a hundredth of the observed values. Therefore P50 is the median, P10 is
D1 , P20 is D2 , P25 is Q1 , and so on.
To calculate any quantile we can proceed as we did for the median with the only change being the
cumulative frequency. The most interesting case is for grouped data, and these are the formulae:
For the quartiles:
k
Fi 1 k
Qk Li 1 4 ai , with ( Li 1 , Li ] as the interval where the relative cumulative frequency attains
fi 4
Or equivalently:
k
N Ni 1 k
Qk Li 1 4 ai , with ( Li 1 , Li ] as the interval where the cumulative frequency attains N
ni 4
Example: Take again the table of newborns’ weights and let’s suppose we are asked to calculate the
first quartile and the 87th percentile. Then we would have to proceed in this way:
Absolute Relative Cumulative relative
Intervals Widths
frequencies frequencies frequencies
Li 1 Li ai
ni fi Fi
[2.4, 2.9] 3 3 19 3 19 0.158 0.5
(2.9, 3.3] 6 6 19 9 19 0.474 0.4
(3.3, 3.8] 6 6 19 15 19 0.789 0.5
(3.8, 4.5] 4 4 19 1 0.7
N 19
Distribution X Distribution Y
Salaries Absolute Relative Salaries Absolute Relative
xi (€) frequencies frequencies yi (€) frequencies frequencies
x1 1000 n1 30 f1 0.3 y1 1000 n1 27 f1 0.3
x2 1500 n2 40 f 2 0.4 y2 1500 n2 36 f 2 0.4
x3 2000 n3 20 f 3 0.2 y3 2000 n3 18 f 3 0.2
x4 4000 n4 10 f 4 0.1 y4 4000 n4 9 f 4 0.1
Total Total
N 100 N 90
The salaries do not change, and the relative frequencies don’t change either. The mode, the arithmetic
mean, the median and in general the quantiles can be calculated by using the relative frequencies, and
therefore all them maintain their original values. We can easily see that for both cases the mode and the
median are € 1,500, the arithmetic mean is € 1,700, and so on.
Appendix: Demonstration of the formula of the mode for grouped data. Remember that the
graphical representation of the mode was like this for the newborns’ weights:
hi hi 1
Now we want to show the validity of the general formula Mo Li 1 ai . To this end
2hi hi 1 hi 1
It can be seen in the figure that, once the modal interval ( Li 1 , Li ] has been selected, the mode Mo is
defined as Mo Li 1 m , where m is the value on the X axis directly below the point A. To calculate m
we use the fact that the two triangles B´ AB and CAC´ are proportional and consequently
m a m
i . By taking out the denominators we obtain hi m hi 1m hi m hi 1m (hi hi 1 )ai .
hi hi 1 hi hi 1
Rearranging the equation gives (2hi hi 1 hi 1 )m (hi hi 1 )ai , and if we isolate m then we have
hi hi 1 hi hi 1
m ai . The final formula is thereby given, that is, Mo Li 1 ai
2hi hi 1 hi 1 2hi hi 1 hi 1
Summary. The main position measures in a distribution are the mode, the arithmetic mean and the
median and the quantiles.
The mode informs about the “trendiest value”.
Mo 1, 400(€)
k
xn
or x i 1 xi f i is the main average for a set of data due to the
k
The arithmetic mean x i 1 i i
N
fact that is the constant value that, when taken N times, adds up to the same amount as the total of the
original distribution, that is, xN i 1 xi ni . Given two strata or groups, the relationship between the
k
N1 x1 N 2 x2
global mean and those of the strata is given by x
N
The median is such a value that its cumulative frequency is half of the whole.
For non-grouped variables you have to order the data and select the one in the central position (if N is
even there are two central positions and there could be two medians).
For grouped variables, to calculate Me you have to determine the point with half of the cumulative
frequency.
0.5 Fi 1 N 2 Ni 1
Me Li 1 ai or Me Li 1 ai
fi ni
The quantiles are defined in a similar way as it was done for the median but with different cumulative
frequencies. The calculation is the same just changing the frequency.
k k
Fi 1 N Ni 1
For the three quartiles: Qk Li 1 4 ai or Qk Li 1 4 ai
fi ni
k k
Fi 1 N Ni 1
For the nine deciles: Dk Li 1 10 ai or Dk Li 1 10 ai
fi ni
k k
Fi 1 N Ni 1
For the ninety-nine percentiles: Pk Li 1 100 ai or Pk Li 1 100 ai
fi ni
Changes of origin and scale affect the mode, the arithmetic mean and the quantiles in the same way,
that is, by adding a constant or by multiplying by a positive constant.
No position measure is affected if all the absolute frequencies are multiplied by a positive constant.