You are on page 1of 12

CHAPTER 3

FREQUENCY OF DISTRIBUTIONS FOR ONE CHARACTERISTIC

1. Tables and frequencies of a distribution. In Statistics, once the data collection process is finished,
the next step is to present this information in a way so that it can be easily read and understood. To this
end, the easiest method is to build a table showing what the observed data are and how often they
appear. This process of building such tables is called tabulation.

From now on, X will represent a certain variable we are going to study that is associated to a population
or a sample. The total amount of available data will be represented by N, and x1 , x2 ,.., xk will
denote the non-repeated k observed values or categories of X. If the data are of a numerical nature
or can be sorted in a natural way we will always suppose that the order is x1  x2  ...  xk . Let us
note that k  N , and that k  N only holds when there is no repeated value or category.

Now let’s see how a set of data can be described and organized. To this end, xi will represent a certain
value or category of X and the following definitions can be given:

Absolute frequency of xi : represented by ni , is the number of times that xi appears.

Relative frequency of xi : represented by f i ,is defined as the rate between its absolute frequency and
ni
the total amount of data, that is, fi  .
N

Now the next two properties should be recognized:

1. n1  n2  ...  nk   i 1 ni  N (the total sum of all the absolute frequencies gives us the size of the
k

population or sample)

f1  f 2  ...  f k   i 1 f i  1 (the total sum of all the relative frequencies equals 1)


k
2.

----------------------------------------------------------------------------------------------------------------------------
Example 1: A group of 9 people has been asked about the variable X: number of siblings they have,
and their answers are as follows: 1, 1, 0, 2, 5, 2, 2, 1, 0. Now we are asked to tabulate these data and to
indicate the absolute and relative frequencies of each value.

The answer must be something like this:


Observed Absolute Relative
values frequencies frequencies
xi ni fi
0 2 2 9  0.22
1 3 3 9  0.33
2 3 3 9  0.33
5 1 1 9  0.11
Total N= 9 Total=1
----------------------------------------------------------------------------------------------------------------------------

So tabulating is just sorting the data with no repetition but indicating by its frequencies how many
times they appear. This makes a first evaluation of the phenomena easier.

Sometimes, the relative frequencies can appear as a percentage, that is, for the first relative frequency
instead of 0.22 we could write 22% , and so on.

If the data can be sorted, these other definitions are useful too:

Cumulative (absolute) frequency of xi : is represented by N i and is defined as the total amount of


appearances of X that are lower or equal to xi , that is, Ni  n1  n2  ...  ni   j 1 n j
i

Cumulative relative frequency of xi : is represented by Fi and is defined as the rate between the
Ni
, or also Fi  f1  f 2  ...  f i   i 1 f j
j
absolute frequency and the total amount of data, that is, Fi 
N

----------------------------------------------------------------------------------------------------------------------------

Example 2: A total of N= 8 workers have been asked about their monthly salaries (in €) and their
answers are: 1400, 1950, 1400, 1500, 1400, 1500, 1950, 2300. Now we are asked to tabulate these data
by indicating all the frequencies we have defined.

Then, the frequency table of the variable X: monthly salary in euros should look like this:
xi ni fi Ni Fi
1400 3 3 8  0.375 3 3 8  0.375
1500 2 2 8  0.25 5 5 8  0.625
1950 2 2 8  0.25 7 7 8  0.875
2300 1 1 8  0.125 8 8 8 1
N= 8

----------------------------------------------------------------------------------------------------------------------------

Here there are other general properties that should be recognized:

1. N k  N , Fk  1

2. N i  N i 1  ni and Fi  Fi 1  fi for i  2,..., k

2. Grouped variables. Sometimes we need to group several numerical values in the same interval.
This will mainly be because the variable is continuous and it isn’t worth distinguishing between values
that are very close, or because the amount of data is too large to deal with individual values. In this case
the intervals are usually taken left-open and right-closed, except the first interval which is also left-
closed. The left end point of the first interval is usually the smallest data value, and the right end point
of the last interval will usually be the greatest value.

The usual notation for grouping intervals is as shown below


Intervals Absolute frequencies
1st interval [ L0 , L1 ] n1
2nd interval ( L1 , L2 ] n2
3rd interval ( L2 , L3 ] n3
.................. ......... ........
kth interval ( Lk 1 , Lk ] nk

Definition: The width of the ith interval is ai  Li  Li 1


Example 3: Let X be the variable: weight in kg of N=19 newborns delivered in a hospital on a certain
day. In this case, as there is a lot of data, it doesn’t really matter if we don’t consider the exact weight,
and it has been decided to group them into these intervals [2.4, 2.9] (2.9, 3.3] (3.3, 3.8] (3.8, 4.5],
giving rise to this table:

Intervals Frequencies Cumulative Widths Interval


Li 1  Li ni frequencies N i ai midpoints xi
[2.4, 2.9] 3 3 0.5 2.65
(2.9, 3.3] 6 9 0.4 3.1
(3.3, 3.8] 6 15 0.5 3.55
(3.8, 4.5] 4 19 0.7 4.15

--------------------------------------------------------------------------------------------------------------------------
Some remarks:
1) When data is not accurately recorded, but only grouped into certain interval to which it belongs,
there is a loss of information called grouping error.

2) Intervals can be identified with one suitable value when we need to, e.g., if we have to select some
of its points to calculate the data average. The usual procedure is to select the interval midpoint
L L
xi  i 1 i . This will always be our choice by default except if there were a reason for a more
2
suitable choice.

So for instance, if we want to find out the average value of the newborns’ weights, we shall identify
2.65  3  3.1 6  3.55  6  4.15  4
each interval with its central midpoint and calculate it as  3.392 kg .
19
But let’s suppose that the exact N= 9 weights were the following:
2.8, 4.5, 3.9, 3.65, 2.65, 3, 3.65, 2.4, 4.15, 3.3, 3.15, 3.7, 3.2, 3.1, 3.75, 3.8, 3.15, 3.5, 3.95
Now the reader is asked to check that the grouping process was correct and that if we sum up every
value and divide the total amount by 19, we get that the average value is 3.437 kg. That is, both
calculations do not match due to the grouping error.

3) The number of intervals used to group the data mustn’t be too big but not too small either (some
authors recommend taking around N)

4) As a general criterion, an interval is well identified with the points it contains when it is true that
those points are distributed in a more or less uniform way inside the interval. So, for instance, were
the points distributed as it is shown in the next figure, we should choose three grouping intervals for
that set of data, because each one of them would correspond to different rhythms or densities.
The squares represent the data and must be grouped into three different intervals to be uniformly distributed.

Definition: The density of absolute frequencies in a grouping interval is defined as the amount of data
that there is inside the interval divided by its width, that is, d i  ni ai for absolute frequencies or
d i  f i ai for relative frequencies.
The density of frequencies is to be interpreted as “how packed the data is in an interval”, in the same
way that a density of population tells us how closely together the inhabitants live in a territory.
Therefore, in the above figure, the left interval is the one with the highest density, whereas the right-
hand-side interval has the lowest density.

3. Graphical representations of quantitative variables. Graphical representations are very useful to


give us a quick and overall view of the phenomena that are being studied. The graphs will differ
according to what information is required to be shown and whether the variable is grouped.

a) Non-grouped variables: For these variables, the non cumulative frequency is represented by using
bar charts and the cumulative frequency by using cumulative frequency diagrams. Let’s study the
different cases:

For the example of the monthly salaries, the bar charts would look like these:

Bar charts for the distribution of monthly salaries

And the cumulative frequencies would look like these:


Graphs of the cumulative frequencies for the distribution of monthly salaries

It’s worth noticing that the cumulative frequency function is always non-decreasing and continuous
on the right side, but it jumps on the left side at the data points. The step of the jump is just the non
cumulative frequency.

b) Grouped variables: When the data set is grouped, the non-cumulative frequencies are represented
using histograms. To draw a histogram one plots the intervals on the horizontal axis and the
densities on the vertical axis.

Therefore, for the case of the newborns’ weights we can proceed like this:
Absolute
Intervals Frequencies Widths
densities
Li 1  Li ni ai
d i  ni ai
[2.4, 2.9] 3 0.5 6
(2.9, 3.3] 6 0.4 15
(3.3, 3.8] 6 0.5 12
(3.8, 4.5] 4 0.7 5.71

Histogram for the distribution of newborns’ weights (absolute densities)

Now it is important to realize that the rectangular area is the base multiplied by its height, that is, for
n
every rectangle in the histogram its area Ai will be Ai  ai  di  ai  i  ni , and the area of every
ai
rectangle equals the frequency of the interval.

The same process could have been carried out using the relative frequencies instead of the absolute
frequencies. This would have led to a table and a graph like these:

Intervals Frequencies Widths Densities


Li 1  Li fi ai d i  f i ai
[2.4, 2.9] 3 19 0.5 0.316
(2.9, 3.3] 6 19 0.4 0.789
(3.3, 3.8] 6 19 0.5 0.632
(3.8, 4.5] 4 19 0.7 0.526
Histogram for the distribution of newborns’ weights (relative densities)

As you can see, both graphs are in practice the same as only the scale has changed. Furthermore,
carrying out the same procedure as before we now see that every area is Ai  fi and the area of every
rectangle equals its frequency.

So, the essential thing is to remember that in a histogram the heights hi can be the absolute or the
relative densities, which in turn make the areas be the frequencies. Therefore, the higher the
rectangle is, the more density it has, and the more area it has, the more data points there are in the
interval. For instance, in the previous figure, the second and third rectangles have the same area
because there are 6 data points in each rectangle, but the density of the second is greater and so the
rectangle is higher.

Important remark: If the width of every interval is constant, that is, if they have a common width a,
then it is also possible to take simply the frequencies as the heights. This can be done because in this
n
case di  i , and if we take hi  ni then we obtain hi  ad i and the heights are proportional to the
a
densities. In the same way, we obtain Ai  hi a  ni a and the areas are proportional to the frequencies as
they must be. In other words, the heights hi of the rectangles in a histogram can be always the
densities, but when the intervals have a common width, the heights can also be the frequencies.

With regard to the representation of the cumulative frequencies, let’s start with a look at the table

Cumulative absolute Cumulative relative


Intervals
frequencies N i frequencies Fi
[2.4, 2.9] 3 0.158
(2.9, 3.3] 9 0.474
(3.3, 3.8] 15 0.789
(3.8, 4.5] 19 1
From this table we know this value at the end points of the intervals, that is, the cumulative absolute
frequency at 2.9 must be 3, at 3.9 it must be 9, and so on. But how can frequencies be assigned to the
other points? To answer this question we must make an additional assumption: that the growth rates
of the cumulative frequencies do not change inside the same interval, o to say it in a more precise
way, that inside each interval the data distribution is uniform. This additional assumption allows us to
represent the cumulative frequencies by a segment on every interval, giving rise to a polygonal graph
of cumulative frequencies, which in our example would look like this:

Graph for the cumulative absolute frequencies of newborns’ weights

The graph is essentially the same for the relative frequencies, as the only thing that changes is the scale
of the vertical axis

Graph for the relative cumulative frequencies of newborns’ weights

Let’s add two additional comments in relation to these polygonal graphs of cumulative frequencies:
 The assumption that the grouped distribution is uniform inside an interval is coherent with choosing
the midpoint interval as a representative value of the interval.

 The slope mi of every segment in the graphs is the ratio between the increase of the cumulative
frequency and the increase of the variable, that is, mi  ( N i  N i 1 ) ai in the first graph and
mi  ( Fi  Fi 1 ) ai in the second. But we know that N i  N i 1  ni and that Fi  Fi 1  f i and hence
for the first graph the slopes are mi  ni ai , and mi  fi ai for the second. Therefore the slopes
are the densities, and the intervals where the polygonal grows more quickly are those where the
densities are higher.
In the example of the weighs, the maximum slope is in the second interval, which is the one with
the highest density.

4. Graphical representations of qualitative variables. Qualitative variables can’t be grouped and


sometimes it doesn’t make any sense to rank them either. Thus for these variables we can use again the
bar charts to represent their frequencies, although other similar alternatives are widely used, as for
instance, pie charts.
Example 4: The following table shows the nationalities of foreign travellers who enter Spain through a
certain airport. We are asked to represent that data by a bar chart and by a pie chart.
Central and South Other
Nationality EU USA-Canada Africa
America countries
% of visitors 45.2 18.1 22.9 5.2 8.6

Answer: the graphs should look like these:


Summary: This chapter ends with a summary to help you remember the names of the graphs, what
they represent, and their main features. You must also remember that in general there is no essential
difference when you work with or represent absolute or relative frequencies, although the relative
frequencies are more informative and common in practice.

QUANTITATIVE VARIABLE
FREQUENCY NON-GROUPED GROUPED
Histogram
Bar chart The heights of the rectangles are proportional
The lengths of the bars are to the densities.
proportional to the frequencies. The areas of the rectangles are proportional
to the frequencies.
NON-
CUMULATIVE

Graph of cumulative
Polygonal of cumulative frequencies
frequencies
The slope of every segment is proportional to
Continuous on the right-hand-side.
the interval density.
The step height is the non-cumulative
frequency.
CUMULATIVE
QUALITATIVE VARIABLE
Bar Chart Pie Chart
The lengths of the bars are proportional to the The areas of the sectors are proportional to
frequencies. the frequencies.

You might also like