You are on page 1of 5

2

Data collection and


graphical summaries

2.1 Introduction

2.1.1 This Chapter provides a review of some basic methods of presenting raw
data in a more digestible form, and some less well known but very useful
techniques. None of these tools should be considered in isolation: they will
almost always be used in a progression or combination in the course of a real
application.
It is worth reviewing some of the essential principles of data collection and
recording. Often data is collected as a habit: perhaps someone started it long
ago for a purpose now obscured, and the routine has continued ever since. No
one dares to question its relevance or suitability in changing conditions. Some
important points include:
Objective: Is the data required for legal purposes (accounts, VAT, customer
documentation), control, solving a specific problem or management informa-
tion?
Type: Is the data obtained by counting or measurement? Samples or 100%
checks?
Frequency: Is it required once only, or for regular checks (once per hour,
every batch or shift, etc.)?
Recording format: Will this be automatic data capture, record sheets, tallies,
control charts, questionnaires? Document design should be considered in
terms of relevance, simplicity, ease of transfer of information to other media
(e.g. computer), possible multiple use of the same data for various purposes,
and facilitating calculation of summaries.
Communication and training: Ensuring all concerned know the relevance and
importance, and are familiar with the correct procedure, methods of calculation
and avoiding errors.
Integrity: Data is used for control, making decisions, diagnosing problems
and other important functions. It needs to be objective, honest, legible and
(especially where sampling is concerned) properly representative ofthe system
or process.
l 0 Data collection and graphical summaries

The collection of data does not, of itself, solve problems. It is sometimes


thought that SPC in particular is concerned with displaying control charts on
machines or workstations and carrying out various routine calculations (e.g.
capability indices). These are only the outward signs; nearer to the true meaning
of SPC is solving problems collectively. This places the emphasis on solving
problems rather than just collecting information about them, and on teamwork
rather than buck-passing.
Data collection must therefore be accompanied by such activities as flow-
charting, brainstorming, cause and effect analysis and Pareto analysis. Texts
covering 'statistical process control' and 'total quality management' deal with
these topics - they are not the subject-matter of the present book, but their
importance must not be under-rated.

2.2 Organizing raw data

2.2.1 In its raw form, data is rarely useful. Care is needed in presentation and
extracting useful summary statistics. Consider the following set of data:
238.9, 238.3, 240.4, 241.0, 239.0, 239.3, 239.6, 236.9, 239.3, 240.1, 238.8, 240.7,
241.0, 240.1, 239.3, 239.1, 239.5, 239.9, 238.9, 237.4, 238.2, 238.4, 239.2, 239.7,
239.4, 240.1, 240.3, 239.5, 239.0, 238.6, 239.8, 240.3, 239.7, 238.7, 240.7, 240.0,
240.1, 241.6, 239.8, 240.6, 239.7, 239.7, 240.4, 239.5, 238.6, 239.2, 237.6, 238.9,
239.2, 240.3, 239.4, 240.8.
In this form, the figures convey little except that they are all fairly close to 240.
What are they? And have we any background information?
The data make rather more sense when we learn that they are weights (in
grams) of valve liners moulded in a corrosion resistant plastic material. They
are weighed as a quality check (if they weigh too little, they may contain voids
or be undersize; if too much, they may be of poor shape due to improper mould
closure). There is a design specification of 240 ± 5 g.
While one can now examine whether the weights (about equal to a half-pound
of butter!) satisfy the specification, it is impossible to discern any pattern. Some
organization is needed.

2.2.2 The stem-and-leaf table


The 'stem and leaf' table is a useful first step. It is based on identifying which
digits do not vary at all (the hundreds, in this case), which vary little (the tens)
and hence where the real variation begins(units.) Taking a set of units which
covers the range of the data (say 235-242 to be on the generous side) as 'stems',
the remaining digits (the decimal part in this example) are allocated as leaves
on the appropriate stems. In other cases the stems might be hundreds or tens,
the leaves tenths or hundredths, etc.
Organizing raw data ll

Table 2.1 Stem-and-leaf table: valve liner data

242
241 0 0 6
240 4 1 7 1 1 3 3 7 0 1 6 4 3 8
239 0 3 6 3 3 1 5 9 2 7 4 5 0 8 7 8 7 7 5 2 2 4
238 9 3 8 9 2 4 6 7 6 9
237 4 6
236 9
235

Table 2.2 Final (ordered) stem-and-leaf table

241 0 0 6
240 0 1 1 1 1 3 3 3 4 4 6 7 7 8
239 0 0 1 2 2 2 3 3 3 4 4 5 5 5 6 7 7 7 7 8 8 9
238 2 3 4 6 6 7 8 9 9 9
237 4 6
236 9

In the example, the first value is 238.9, so the leaf .9 is allocated to the stem
238, then .3 also to stem 238, .4 to 240, etc., as in Table 2.1. Immediately a
pattern emerges - the values cluster around a centre (stem 239), tapering off in
each direction. Even at this stage, it is apparent that although all the values lie
within the specification, there are more values in the lower half (239.9 and
below) than in the upper half (240.0 and above). Of course, this may be a
deliberate saving on materials!
If required, the leaves within each stem can now be rearranged in ascending
order to give a complete ranking order which is useful for identifying quantiles
such as the median and first and third quartiles, as described in section 2.3.3.
This yields (deleting the unused stems 242 and 235) the presentation in Table 2.2.

2.2.3 The frequency table


Useful though it is in initially sorting data, the stem-and-leaf table does not
necessarily give a satisfactory final presentation. The frequency table, in which
data are collected into a reasonable number of classes, is a useful device, and
leads naturally to the diagrammatic form of the histogram.
For the most effective presentation, a table with about ten or twelve classes
of equal width is usually preferred. The stem-and-leaf table suggests that a set
of classes based on half-gram intervals should achieve this. Care is necessary
to avoid ambiguity at the class boundaries, e.g. groups comprising 236.5-237,
12 Data collection and graphical summaries

Table 2.3 Frequency table: valve liner data

Class limits Mid-value Tallies Frequency


(x) (f)
236.75-237.25 237.0 I 1
237.25-237.75 237.5 II 2
23 7. 75-238.25 238.0 I 1
238.25-238.75 238.5 J+tr 5
238.75-239.25 239.0 Jltr.#tf 10
239.25-239.75 239.5 .#tf.laflll 13
239.75-240.25 240.0 JHflll 8
240.25-240.75 240.5 Mill 8
240.75-241.25 241.0 Ill 3
241.25-241.75 241.5 I 1

52

237-237.5, 237.5-238, etc., could lead to inconsistency, omission or even double


counting. Possible unambiguous class limits include:
236.5-236.9 236.45-236.95
237.0-237.4 236.95-237.45
or
(defining the values to be (the extra decimal
included in any class) avoids ambiguity)
Another possible approach is to make the class mid-values as simple as
possible. This is useful where the mid-values may be used to represent the whole
class for subsequent calculations. Often, by calculating class limits half-way
between the mid-values, ambiguity looks after itself, as in Table 2.3.
When the classes have been defined, a tally chart is compiled, using the
'five-bar gate' method. For the present example, half-gram mid-values are used
in Table 2.3. With this number of classes, the pattern remains clear, but a little
extra detail emerges. In this case, no major anomalies exist, but in other cases
lopsidedness (skewness) or a tendency to multiple peaks (bimodality or
multimodality) may be observed, leading to diagnosis of possible problems.

2.3 Graphical presentation

2.3.1 Histogram
While the tally chart is often adequate for displaying the pattern, the presentation
may be improved by drawing a scaled bar chart or histogram. It is here that
the advantage of using equal class widths becomes apparent, as the histogram
Graphical presentation 13

J 1 2 1 5 10 13 8 8 3 1 I
236.75 237.25 237.75 238.25 238 .75 239.25 239.75 240.25 240.75 241 .25 241.75
Fig. 2.1 Histogram of valve linear data.

columns have heights proportional to frequencies. If class widths vary, it is


necessary to calculate areas to give the correct representation. Figure 2.1 shows
a histogram for the valve liner data.
At this point it is appropriate to consider the relationship between a sample
and the system, population or process from which it is drawn and which it
purports to represent. All samples will contain minor irregularities, and the fact
that two values in the above sample occurred in the 237.5 class, but only one
each in the 237 and 238 classes, would not be taken to imply that the batch
from which the sample was drawn contains, say, 200 valve liners in the 237.25-
237.75 weight range, but only 100 in each of the neighbouring classes. To obtain
an impression of the whole distribution, only the broad features are required.
In this case, we have a fairly central hump and a pair of roughly equal tails.

1\
I ·-·
. . . . . . .I
. \.""
/
·• 237 238 239 240 241
. ·-
........

242 237 238 239 240 241 242

Fig. 2.2 (a) Frequency polygon; (b) frequency curve.

You might also like