You are on page 1of 14

CHAPTER 2 VISUAL PRESENTATION OF DATA

CHAPTER 2: VISUAL PRESENTATION OF DATA

2.1 Frequency distributions


A dataset may be quite small or it may be very large. In any case, the best way to summarize
a set of data values relating to a given variable is to construct a frequency distribution for that
variable.
A simple (or ungrouped) frequency distribution is a table with two columns that shows
how many times each possible value of the variable is observed in the dataset. The possible
values are shown separately in the first column; for each possible value, the frequency, ie.
the number of times the given value occurs in the dataset, is shown in the second column. It
is not necessary to show classes with zero frequencies, especially if there are many of them.
Example 2.1.1: The following table shows an ungrouped frequency distribution.
Number of A grades scored by a group of students

Number of A Number of
grades students
0 61
1 38
2 17
3 8
4 3
5 2
6 0
Total 129

Additional examples A:
Below is an example set of 72 counts of the number of customers entering a shop in a one-
hour period.
7 5 8 0 4 2 5 7 0 1 3
3 5 5 1 7 4 7 4 4 7 3
4 4 6 2 3 7 6 7 5 4 5
5 6 1 2 2 5 3 11 6 7 3
5 3 7 8 2 5 3 9 4 5 2
7 6 5 3 3 8 5 5 6 3 2
4 6 5 7 4 4
Our aim is to summarize this set of measurement data. The most popular way to do this is by
forming a frequency table where each row is a different value, and alongside each value we
report the number of times it occurs. A frequency table for the above data would look like:

1
CHAPTER 2 VISUAL PRESENTATION OF DATA

Number of customers Frequency


per hour
0 2
1 3
2 8
3 11
4 11
5 14
6 7
7 11
8 3
9 1
10 0
11 1
Total: 72

A grouped frequency distribution condenses data even more, by grouping the possible
values into classes and showing for each class the class frequency, ie. how many values fall
in the class.
Example 2.1.2: The following table shows a grouped frequency distribution.
Fortnightly food expenditure for a group of households

Expenditure Number of
(nearest kina) households
10 - 19 7
20 - 29 16
30 - 39 30
40 - 49 14
50 - 59 8
60 - 69 3
70 - 79 2
Total 80

It is often useful to know what proportion, or percentage, of cases falls into each class; for
this purpose one can construct a relative frequency distribution.

2
CHAPTER 2 VISUAL PRESENTATION OF DATA

Example 2.1.3: The following table shows a relative frequency distribution.


Fortnightly food expenditure for a group of households

Expenditure Percentage of
(nearest kina) households
10 - 19 8.75
20 - 29 20.00
30 - 39 37.50
40 - 49 17.50
50 - 59 10.00
60 - 69 3.75
70 - 79 2.50
Total 100.00

Additional Examples B:
Here is another example. Below is a set of 112 daily takings of a shop, in hundreds of kina.
38 68 69 64 65 55 66 64 52 63
69 50 57 67 50 47 32 51 46 56
57 54 74 48 60 55 68 53 50 69
49 57 40 73 71 62 62 65 62 65
68 47 51 54 57 48 53 43 61 48
74 57 72 58 64 72 42 60 46 78
73 74 57 50 64 59 68 53 63 54
55 76 61 63 62 54 59 72 63 72
57 54 61 69 61 75 67 56 39 33
69 45 67 75 67 51 47 83 64 48
63 62 64 37 67 49 47 70 64 59
66 52
We can not use a frequency table, since we would have far too many rows in the table.
Instead we use a grouped frequency table (in which we group the data into a set of
intervals) to summarise the data. A grouped frequency table for the above data would look
like:
Interval Frequency
31-40 5
41-50 16
51-60 33
61-70 42
71-80 15
81-90 1

Total: 112
The advantage of using frequency tables to summarise data is that with practice you can more
readily “see” properties of the data, like its distribution , its average value), or its spread of
values

3
CHAPTER 2 VISUAL PRESENTATION OF DATA

Frequency tables also enable you to easily spot outliers


Here are five common sense rules to keep in mind when constructing frequency and grouped
frequency tables.
 A table should have approximately 8 rows
The tables earlier had respectively 12 and 6 rows. A table with many more
than 12 rows is hard to read “at a glance”. A table with many fewer than 6
rows causes too much condensation of information.
 Intervals in grouped tables should not overlap
Consider the intervals 5.0 - 6.0
6.0 - 7.0
7.0 - 8.0
These intervals overlap, since (say) a measurement of 6.0 could be entered
into two intervals. The usual way round this problem is to change slightly the
upper or lower end points of the interval - eg 5.0-5.9, 6.0-6.9, etc. Another
method is to use common endpoints that can not be a data measurement. For
example, if the data is measured to one decimal point accuracy, then the
intervals could be 4.95-5.95, 5.95-6.95, etc. Although we have theoretical
overlap, in practice there is none.
 Interval widths, and end points of intervals, should be “friendly” values
The choice of interval widths and end points often requires quite a lot of
thought, and involves some trade off. Interval widths should be either one unit,
two units, or five units wide, or these numbers multiplied by a power of 10 -
eg. 0.001, 0.2, 500, 10000, etc). Similarly endpoints should be simple
numbers - eg 5.0-5.9, 6.0-6.9, etc rather than (say) 5.3-6.2, 6.3-7.2, etc. The
trade off in practice is that occasionally it is impossible to devise a scheme that
results in a good interval width, acceptable end points, and about eight
intervals. Something has to be sacrificed.
 A table should not have any row gaps
Consider the table that uses the following intervals:
100 - 109
110 - 119
130 - 139
140 - 149
150 - 159
Do you notice that there is a gap? Which interval is missing? If we leave a
gap, the table distorts the picture of the information presented by the table -
like the distribution, average value and spread.
 Intervals of a table should all be the same width
The following table has intervals that are not all the same:
100 - 119
120 - 129
130 - 139
140 - 159
160 - 179
Using different interval widths distorts the picture of the information being
presented by the table.
4
CHAPTER 2 VISUAL PRESENTATION OF DATA

A frequency histogram for this data looks like:

Frequency 50
40
30
20
10
0
35 45 55 65 75 85
Mark

We can make the following observations about this, or any, histogram:


 The widths of the histogram bars are all equal
This guarantees a property of (frequency) histograms that the area of each bar
is proportional to the frequency of the interval that the bar represents.
 The horizontal and vertical scales are linear
This has the same purpose of making table intervals of equal widths and
leaving no gaps. That is it ensures that an observer does not get a distorted
impression of the data being represented by the graph.
In the histogram above the centres of the base of each bar have been marked. We could have
marked the end points of the bars instead.

A frequency polygon is a line graph obtained by joining the midpoints of the corresponding
frequency histogram, as is demonstrated in the following two diagrams:

50
Frequency

40
30
20
10
0
35 45 55 65 75 85
Mark

5
CHAPTER 2 VISUAL PRESENTATION OF DATA

50

Frequency
40
30
20
10
0
35 45 55 65 75 85
Mark

Another common graph is the cumulative frequency polygon, or ogive. It is a line graph
obtained by joining the mid-points of a cumulative frequency histogram. This is illustrated in
the following sequence of diagrams:

Interval Frequency Cumulative


Frequency
31-40 5 5
41-50 16 21
51-60 33 54
61-70 42 96
71-80 15 111
81-90 1 112

Total: 112 -
Cumulative Frequency

120
100
80
60
40
20
0
35 45 55 65 75 85
Mark
Cumulative Frequency

120
100
80
60
40
20
0
35 45 55 65 75 85
Mark

6
CHAPTER 2 VISUAL PRESENTATION OF DATA

Additional Examples C:

A sample of 20 juvenile lobsters is randomly selected from a tank containing several


hundred. Each lobster is measured for length (in cm) and the results are as follows.

4.9, 5.6, 7.2, 6.7, 3.1, 4.6, 6.0, 5.0, 3.7, 7.3, 6.0, 5.4, 4.2, 6.6, 4.7, 5.8, 4.4, 3.6, 4.2, 5.4,

a. Organize the data using a grouped frequency table.


Solutions
The variable “the length of a ‘lobster’ is continuous even though lengths have been
rounded to the nearest mm.
The shortest length is 3.1 cm and the longest is 7.3 cm, so we will use class intervals
of length 1 cm.

Length (cm) Frequency

3l  4 3

4l 5 6

5l 6 5

6l 7 4

7l 8 2

b. State the modal class


The modal class is 4  l  5
c. Sketch a histogram for the above data.

7
CHAPTER 2 VISUAL PRESENTATION OF DATA

d. Describe the distribution of the data.


The data is positively skewed

2.2 Class limits, sizes and midpoints


The lower and upper real limits of a class are respectively the lowest and highest possible
values which can fall in the class. The values of the real, or actual, limits of a class depend
on the type of data and on how the data has been measured.
The real limits must be determined in such a way that the classes are mutually exclusive, ie.
no value may fall into two classes. The classes must also be exhaustive, ie. every possible
value of the variable must fall into one of the classes.
The class interval, or class width, may be defined as the difference between the lower real
limit of a class and the lower real limit of the next class.
The class midpoint is the value halfway between the lower and upper real limits.
Example 2.2.1: For the second class in the fortnightly food expenditure distribution
given in Example 2.1.3 above, using the usual rounding conventions,
(a) lower real limit is K19.50;
(b) upper real limit is K29.49 = K29.50 effectively;
(c) width is 29.50 - 19.50 = K10.00;
(d) midpoint is (29.50 + 19.50)/2 = K24.50.
However, if the data in the example referred to ages in completed years, the situation would
be as follows:
(a) lower real limit would be 20.00 years;
(b) upper real limit would be 29.99 (recurring) = 30.00 years effectively;
(c) width would be 30.00 - 20.00 = 10.00 years;
(d) midpoint would be (30.00 + 20.00)/2 = 25.00 years.

8
CHAPTER 2 VISUAL PRESENTATION OF DATA

2.3 Cumulative frequency distributions


It is often useful to know how many observations, or scores, are less than a certain value or
greater than a certain value. A cumulative frequency distribution is constructed for this
purpose.
Example 2.3.1: The following table shows cumulative frequency distributions.
Fortnightly food expenditure for a group of households

Expenditure Number of Cumulative Cumulative Relative cumulative


(nearest households frequency frequency frequency
kina) (<) (>) (<) (%)
10 - 19 7 7 80 8.75
20 - 29 16 23 73 28.75
30 - 39 30 53 57 66.25
40 - 49 14 67 27 83.75
50 - 59 8 75 13 93.75
60 - 69 3 78 5 97.50
70 - 79 2 80 2 100.00
Total 80

There are actually two types of cumulative frequency distribution: one (the “less than” type)
showing how many cases lie in each class and the lower-valued classes (third column in the
table above); the other ( the “greater than” type) showing how many cases lie in each class
and the higher-valued classes (fourth column). For the fourth class in the example, the figure
67 in the third column shows that there are 67 values in that class and lower-valued classes,
ie. 67 households have expenditure of K49 or less. For the same class, the figure 27 in the
fourth column shows that there are 27 values in that class and higher-valued classes, ie. 27
households have expenditure of K40 or more.
One can also express cumulative frequency distributions in relative terms, of course, as is
done in the fifth column of the table above for the “less than” type.

2.4 Histogram
A histogram is a special kind of bar chart that is used to illustrate a frequency distribution;
each bar is drawn over a class interval (so that all the bars are touching, with no gaps) and the
area of the bar indicates the frequency, or relative frequency, of the class. Conveniently
rounded class limits should be used to mark off the axis.
If the bars are all the same width, which is usually the case, then the height of each bar also
represents the corresponding class frequency. If the bars are not all the same width, then the
area but not the height of each bar is proportional to the corresponding class frequency. In
any case, a vertical axis should not be shown, because the histogram is an “area” diagram –
not a “height” diagram.

9
CHAPTER 2 VISUAL PRESENTATION OF DATA

Example 2.4.1: Illustrate the food expenditure data given in Example 2.3.1 by drawing a
histogram.

2.5 Frequency polygon


A frequency polygon is a chart obtained by joining the midpoints of the tops of the
histogram bars (instead of drawing the bars), including a zero frequency at either end of the
range. It is an alternative way of illustrating a frequency distribution (one normally draws
one or the other – not both).
Example 2.5.1:Illustrate the data in Example 2.3.1 by drawing a frequency polygon.

Food expenditure for a group of households

4.5 14.5 24.5 34.5 44.5 54.5 64.5 74.5 84.5


(kina per fortnight)

If the variable is continuous and one uses many very narrow classes, the polygon would
approach a smooth curve, which is called a frequency curve.
Like the histogram, the frequency polygon and the frequency curve are “area diagrams”;
again a vertical axis should not be shown. The interpretation of these diagrams is that the
area under the polygon or curve between any two values of the variable represents the
relative frequency associated with the interval between the two values, ie. the proportion of
values in the dataset which fall in the interval.

10
CHAPTER 2 VISUAL PRESENTATION OF DATA

2.6 Ogive Fodexpniturfaghsl


The usual way of illustrating a cumulative frequency distribution graphically is to draw a
cumulative frequency curve, which is called an ogive (pronounced “o-jaiv”). The cumulative
frequencies in a “less than” cumulative frequency distribution relate to the number of cases
that are less than (or equal to) class upper limits, so the cumulative frequencies must be
plotted above the class upper limits - not above the class midpoints.
Example 2.6.1: Illustrate the “less than” cumulative frequency distribution in
Example 2.3.1 by an ogive.

Food expenditure of a group of households

80
70
60
Cumulative 50
40
frequency 30
20
10
0
10 20 30 40 50 60 70 80
(kina per fortnight)

An ogive can also be used to illustrate a “greater than” cumulative frequency distribution, of
course. It has a downward slope and the cumulative frequencies are plotted above the class
lower limits.
The ogive is an extremely useful curve. On the “less than” ogive, the height of the curve
above any given value on the horizontal axis indicates the number of observations in the
dataset whose values are less than (or equal to) that given value. One can use such
information to estimate how many items fall into any given interval - not just class intervals.
By reading off the heights (cumulative frequencies) of the ogive at both ends of the interval,
one can calculate an estimate of the number of items falling in the interval.
Example 1.16.2: From the ogive in the previous example, estimate the number of cases in the
interval from 37 to 46.
Number of cases between 37 and 46 (ie. households with expenditure between K37 and K46)
= [No. of cases with values < 46] - [No. of cases with values < 37]
= [Ogive height at 46] - [Ogive height at 37]
= 61 - 43 = 18
So we estimate that 18 households have food expenditure between K37 and K46 per
fortnight.

11
CHAPTER 2 VISUAL PRESENTATION OF DATA

Tutorial exercises
1. What is descriptive statistics? Give two examples of its use.
2. A sample of 10 grocery stores in Newcastle, New South Wales on a particular day
revealed that the average price per kilogram for hamburger mince was $4.50.
(a) What is the population of interest in this study?
(b) In the statistical inference process, we would like to estimate the average price per
kilogram for hamburger mince for all grocery stores in Newcastel, New South
Wales. Suggest a value of such an estimate.
3. Suppose we carry out two statistical surveys. One uses a sample of size 50. The
other uses a sample of size 500.
(a) Which procedure is likely to more accurate?
(b) Since the second sample size is 10 times larger, can we expect the accuracy to be
10 times better? Why?
4. In each of the following situations, indicate whether the data used in the analysis is
primary or secondary.
(a) We wish to determine the proportion of tourists to PNG that come from Japan.
So we approach the Immigration Department and gain permission to analyse
data given by arriving tourists on their customs cards.
(b) We wish to determine the most common destination for PNG tourists. So we
interview PNG citizens at Jackson’s airport just before they enter the
International Departure Lounge to find out where they are travelling to.
(c) We wish to find the proportion of PNG companies that are willing to hire
union labour. So we approach the managers of a selection of companies and
ask them, and use the information gathered to calculate the required
proportion.
(d) We wish to calculate the movement of PNG stocks on the Sydney stock
exchange. So we get the daily papers, check the stock prices, and use this
information to calculate the average movement.
5. Rank the following observations (from lowest to highest):
1.73 1.77 1.83 1.80 1.74 1.72 1.79 1.79 1.75 1.77 1.73
1.77 1.78 1.77 1.80 1.65 1.75 1.69 1.75 1.73 1.80 1.75
1.80 1.82 1.81 1.81 1.71 1.84 1.71 1.74 1.72 1.76 1.70

12
CHAPTER 2 VISUAL PRESENTATION OF DATA

6. In order to help decide which new staff to recruit, a large national company
administered an aptitude test to 74 Grade 10 school-leavers. The following are the
marks obtained by the applicants (the test was marked out of 100).
65 78 72 71 71 74 63 71 76 72 64 79 73 75 70 76 78
73 69 76 66 71 70 70 73 75 75 69 73 68 70 67 74 75
70 62 75 71 78 75 74 65 71 71 73 73 77 75 78 73 73
65 68 74 69 66 72 66 63 70 73 77 73 69 76 68 74 78
73 76 72 74 76 67

(a) Construct an ungrouped frequency distribution for this data.


(b) Construct a relative frequency distribution for this data.
(c) Construct a “less than” relative cumulative frequency distribution for this data.
(d) Illustrative the dataset by drawing a histogram.
(e) Is the distribution symmetric or skewed? If it is not symmetric, in which
direction is it skewed, ie. in which direction is the longer tail?
(f) Illustrate the cumulative frequency distribution in (c) with an ogive.
(g) Estimate the 65th percentile of the dataset.
(h) Estimate the first quartile of the dataset.

13
CHAPTER 2 VISUAL PRESENTATION OF DATA

7. A clinic recorded the heights of 154 adult male patients. The heights (recorded in
metres, rounded to the nearest centimetre) were:

1.73 1.77 1.83 1.80 1.74 1.72 1.79 1.79 1.75 1.77 1.73
1.77 1.78 1.77 1.80 1.65 1.75 1.69 1.75 1.73 1.80 1.75
1.80 1.82 1.81 1.81 1.71 1.84 1.71 1.74 1.72 1.76 1.70
1.86 1.83 1.62 1.72 1.76 1.71 1.71 1.80 1.68 1.72 1.76
1.70 1.73 1.73 1.69 1.77 1.78 1.78 1.74 1.79 1.76 1.76
1.72 1.73 1.69 1.70 1.70 1.72 1.73 1.73 1.69 1.77 1.78
1.70 1.67 1.68 1.82 1.78 1.80 1.74 1.74 1.68 1.68 1.71
1.63 1.77 1.68 1.70 1.69 1.72 1.75 1.68 1.81 1.87 1.67
1.68 1.80 1.78 1.78 1.68 1.82 1.65 1.66 1.88 1.84 1.75
1.85 1.77 1.64 1.76 1.69 1.73 1.79 1.73 1.75 1.81 1.67
1.76 1.67 1.81 1.71 1.69 1.78 1.76 1.72 1.65 1.74 1.80
1.81 1.74 1.69 1.73 1.71 1.76 1.78 1.77 1.76 1.67 1.68
1.75 1.74 1.84 1.79 1.76 1.81 1.72 1.71 1.78 1.73 1.72
1.62 1.72 1.69 1.82 1.87 1.75 1.75 1.77 1.84 1.70 1.81

(a) Construct a grouped frequency distribution for this data.


(b) Construct a relative frequency distribution for this data.
(c) Construct a “greater than” cumulative frequency distribution for this data.
(d) Illustrate the dataset by drawing a frequency polygon
(e) Is the distribution symmetric or skewed? If it is not symmetric, in which direction
is it skewed?
(f) Illustrate the cumulative frequency distribution in (c) with an ogive.
(g) Estimate the 43rd percentile of the dataset
(h) Estimate the third quartile of the dataset.

14

You might also like