Lecture 3 EDA 2022

KWAME NKRUMAH UNIVERSITY OF SCIENCE AND TECHNOLOGY
CHEMICAL ENGINEERING DEPARTMENT

CHE 357: EXPERIMENTAL DATA ANALYSIS
INSTRUCTOR: Dr. (Mrs.) Mizpah A. D. Rockson
LECTURE 3: Graphical Displays of Data
Learning Objectives
At the end of the lecture the student is expected to understand the following:
 Graphical Methods for Displaying Qualitative Data: Frequency Tables, Bar Charts and Pie Charts
 Graphical Methods for Displaying Quantitative Data: Frequency Tables and Histograms
 Other graphical methods: Dotplots, Time Series Graphs, Pareto Graphs, Stem and leaf Diagrams, digidot
plots
3.1 Graphical Description and Frequency Distribution
When conducting a statistical study, the researcher must gather data for the particular variable under study. To
describe situations, draw conclusions, or make inferences about events, the researcher must organize the data in
some meaningful way.
The most convenient method of organizing data is to construct a frequency distribution. After organizing the
data, the researcher must present them so they can be understood by those who will beneﬁt from reading the study.
The most useful method of presenting the data is by constructing statistical charts and graphs.
This section explains how to organize data by constructing frequency distributions and how to present the data
by constructing charts and graphs.
3.1.1 Organizing Data
Consider a wastewater expert who collects 50 samples of coloured wastewater from a company and analyzed
their pH content (rounded to the nearest whole number) as:
1 2 6 7 12 13 2 6 9 5
14 3 3 11 4 13 1 10 5 2
4 12 4 5 8 6 5 14 5 2
1 4 3 1 9 2 10 11 4 10
3 14 8 2 4 14 1 3 2 6
Data in this form is known as raw data. Very little information can be obtained from the raw data unless it is
organized.
Raw data can be organized by constructing frequency distribution. Frequency distribution is the organization of
raw data in table form, using classes and frequencies.
1
Now, a more general observation can be obtained from the data. Example, more than 60% of wastewater
samples are acidic.
Two types of frequency distributions are used. These are categorical and grouped distributions.
Categorical frequency distribution

This is used for data that can be placed in specific categories. Data such as field of study, religious affiliation,
and sports discipline preferred would use categorical frequency distribution.
Example 3.1
Twenty-five (25) army inductees were given a blood test to determine their blood type. The data set is as
follows: A, B, B, AB, O, O, O, B, AB, B, B, B, O, A, O, A, O, O, O, AB, AB, A, O, B, and A.
Construct a frequency distribution for the data.
Grouped frequency distribution
When the variable is quantitative and large, the data can be grouped into classes. Consider a distribution of the
number of hours boat batteries last below.
In the 31 – 37 class limit, 31 is the lower class limit and 37 is the upper class limit. In the second column are
class boundaries. They are used to separate the classes so that there are no gaps in the frequency distribution.
The rule of thumb for deciding class limits and class boundaries is that the class limits should have the same
decimal place value as the data, but class boundaries have one addition place value and end in a 5.
Class width: upper (or lower) class limit of one class – upper (or lower) class limit of the next class. For
example, the class width in the distribution = 31 – 24 = 30.5 – 23.5 = 7.
2
The following guidelines can be considered when constructing grouped frequency distribution:
1. There should be between 5 and 20 classes. (Normally the number of classes = √𝒏)
2. The class width should be an odd number.
3. The classes must be mutually exclusive.
4. The classes must be continuous.
5. The classes must be exhaustive.
6. The classes must be of equal width.
Example 3.2
The following data represent the record high temperatures for each of the 50 states in the US.
Construct a grouped frequency distribution for the data using 7 classes.
112 100 127 120 134 118 105 110 109 112
110 118 117 116 118 122 114 114 105 109
107 112 114 115 118 117 118 122 106 110
116 108 110 121 113 120 119 111 104 111
120 113 120 117 105 110 118 112 114 114
Solution
The following steps can be used in constructing the grouped frequency distribution
- The range, R = highest value – lowest value = 134 – 100 = 34

- Select the number of classes desired (usually between 5 – 20). (Can be determined as √𝒏 = √𝟓𝟎 = 𝟕. 𝟎𝟕 ≅
𝟕)
- Find class width: class width = R/(number of class) = 34/7 = 4.9. Round it up to 5.
- Select a starting point for the lowest class limit. This can be the smallest data value or any convenient number
less than smallest data value (100).
- Add the width to the lowest score taken from the starting point to get the lower limit of the next class. Keep
adding until there are 7 classes, i.e 100, 105, 110, etc.
- Subtract one unit from the lower limit of the second class to get the upper limit of the first class. Then add the
width to each upper limit to get all the upper limits. The first class is 100 – 104. The second class is 105 – 109,
etc.
- Subtract 0.5 from each lower class limit and add 0.5 to each upper class limit to obtain the class boundaries.
These should be 99.5-104.5, 104.5-109.5,etc.
- Tally the data. Find the numerical frequencies from the tallies
3
- Add the frequencies cumulatively from one class to the other to form cumulative frequencies.
The results are summarized in the table below.
Class limits Class Tally Frequency Cumulative

boundaries frequency
100 – 104 99.5 – 104.5 2 2
105 – 109 104.5 – 109.5 8 10
110 – 114 109.5 – 114.5 18 28
115 – 119 114.5 – 119.5 13 41
120 – 124 119.5 – 124.5 7 48
125 – 129 124.5 – 129.5 1 49
130 – 134 129.5 – 134.5 1 50
Ungrouped frequency distribution
When the range of numerical data is small and each class is only one unit, ungrouped frequency distribution can
be used to represent the data. For example the frequency distribution below represents data for the number of miles per
gallon (mpg) that 30 selected four-wheel-drive sports utility vehicles obtained in city driving.
The reasons for constructing a frequency distribution are as follows:

1. To organize the data in a meaningful, intelligible way.
2. To enable the reader to determine the nature or shape of the distribution.
3. To facilitate computational procedures for measures of average and spread
4. To enable the researcher to draw charts and graphs for the presentation of data
5. To enable the reader to make comparisons among different data sets.
4
3.2 Histogram, frequency polygon and ogive
After you have organized the data into a frequency distribution, you can present them in graphical form. The
purpose of graphs in statistics is to convey the data to the viewers in pictorial form. It is easier for most people
to comprehend the meaning of data presented graphically than data presented numerically in tables or frequency
distributions. This is especially true if the users have little or no statistical knowledge. Statistical graphs can be
used to describe the data set or to analyze it. Graphs are also useful in getting the audience’s attention in a
publication or a speaking presentation. They can be used to discuss an issue, reinforce a critical point, or
summarize a data set. They can also be used to discover a trend or pattern in a situation over a period of time.
The three most commonly used graphs in research are histogram, frequency polygon, and cumulative frequency
graph (also known as ogive)
3.2.1 Histogram
The histogram is a graph that displays the data by using contiguous vertical bars (unless the frequency of a class
is 0) of various heights to represent the frequencies of the classes.
Example 3.3
Construct a histogram to represent the data in example 3.3
Solution
The class boundaries and frequencies are shown below.
A plot of frequency against class boundaries, known as histogram, is shown in figure 3.1 below.
5
Figure 3.1: Histogram plot
3.2.2 Frequency polygon
The frequency polygon is a graph that displays the data by using lines that connect points plotted for the
frequencies at the midpoints of the classes. The frequencies are represented by the heights of the points.
Example 3.4
Using the frequency distribution given in example 3.2, construct a frequency polygon.
Solution
The only missing link here is the class midpoint. This is calculated as the average of the class boundaries. The
complete table is shown below.
A plot of frequency against class midpoint, known as frequency polygon, is shown in figure 3.2.
6
Figure 3.2: Frequency polygon plot
3.2.3 Ogive
The ogive is a graph that represents the cumulative frequencies for the classes in a frequency distribution. It is
also known as cumulative frequency graph. The cumulative frequency is the sum of the frequencies
accumulated up to the upper boundary of a class in the distribution.
Example 3.5
Construct an ogive for the frequency distribution in example 3.2.
Solution
The cumulative frequency for each class is shown in the table below.
A plot of cumulative frequency against upper class boundary, known as ogive, is shown in figure 3.3.
7
Figure 3.3: Ogive plot
3.2.4 Relative frequency graphs
The histogram, the frequency polygon, and the ogive shown previously were constructed by using frequencies
in terms of the raw data. These distributions can be converted to distributions using proportions instead of raw
data as frequencies. These types of graphs are called relative frequency graphs.
Example 3.6
Construct a histogram, frequency polygon, and ogive using relative frequencies for the distribution (shown
here) of the miles that 20 randomly selected runners ran during a given week.
Solution
The frequencies of each class are converted to relative frequencies by dividing each frequency by the total
frequency. The procedure for drawing relative frequency graphs follows the same order as normal frequency
graphs. The relative frequencies are shown here.
8
The relative frequency graphs are shown in figure 3.4.
3.3 Other types of graphs
Other graphs such as bar graphs, Pareto charts, time series graphs, pie graphs, dotplots, and stem and leaf plots
are sometimes used in statistics to give information.
3.3.1 Bar graph
When the data are qualitative or categorical, bar graphs can be used to represent the data. A bar graph can be
drawn using either horizontal or vertical bars. Apart from the vertical and horizontal bar graphs, there are two
more types of bar graphs, which are given below:
 Grouped Bar Graph

The grouped bar graph is also referred the clustered bar graph. It is used to show the discrete value for
two or more categorical data. In this, rectangular bars are grouped by position for levels of one categorical
variable, with the same colors showing the secondary category level within each group. It can be shown
both vertically and horizontally.
 Stacked Bar Graph
The stacked bar graph is also referred to as the composite bar graph. It divides the whole bar into different
parts. In this, each part of a bar is represented using different colors to easily identify the different
categories. It requires specific labeling to indicate the different parts of the bar. Thus, in a stacked bar
graph every rectangular bar represents the whole, and each segment in the rectangular bar shows the
different parts of the whole. It can be shown vertically or horizontally.
Uses of Bar Graph
A bar graph is mostly used in mathematics and statistics. Some of the uses of the bar graph are as follows:
 The comparisons between different variables are easy and convenient.
9
 It is the easiest diagram to prepare and does not require too much effort.
 It is the most widely used method of data representation. Therefore, it is used by various industries.
 It is used to compare data sets. Data sets are independent of one another.
 It helps in studying patterns over long periods of time.
Figure 3.4: Relative frequency plot
10
3.3.2 Pareto graphs
When the variable displayed on the horizontal axis is qualitative or categorical, a Pareto chart can also be used
to represent the data (see figure 3.5 b).
A Pareto chart is used to represent a frequency distribution for a categorical variable, and the frequencies are
displayed by the heights of vertical bars, which are arranged in order from highest to lowest. In this way, the
chart visually depicts which situations are more significant
When to use a pareto chart:

 When analyzing data about the frequency of problems or causes in a process
 When there are many problems or causes and you want to focus on the most significant
 When analyzing broad causes by looking at their specific components
 When communicating with others about your data
Figure 3.5: Bar and Pareto charts

11
3.3.3 Time series charts
When data are collected over a period of time, they can be represented by a time series graph. A time series graph
represents data that occur over a specific period of time (figure 3.5 c). When measurements are plotted as a time
series, we often see trends, cycles, or other broad features of the data that could not be seen otherwise.
3.3.4 Pie charts
The purpose of the pie graph is to show the relationship of the parts to the whole by visually comparing the
sizes of the sections. Percentages or proportions can be used. The variable is nominal or categorical.
A pie graph is a circle that is divided into sections or wedges according to the percentage of frequencies in each
category of the distribution (figure 3.5d).
3.3.4 Dotplots
A dotplot is a graph used for small data sets, in which each observation is plotted as a point on a single horizontal
axis. The dotplot’s axis is scaled so that each data point can be located uniquely on the axis. When more than one
observations have the same value, the points are “stacked” on top of each other. A dot plot allows us to quickly
and easily see the location or central tendency in the data and the spread or variability. When the number of
observations is small, it is often difficult to identify any specific pattern of variation; however, the dot diagram
will frequently be helpful and may provide information about unusual features in the data. A dotplot gives
information about location, spread, extremes, and gaps in a small data set
3.3.5 Stem and Leaf Diagrams
The dot diagram is a useful data display for small samples, up to (say) about 20 observations. However, when the
number of observations is moderately large, other graphical displays may be more useful.
A stem-and-leaf diagram is a good way to obtain an informative visual display of a data set x1, x2, …,xn where
each number xi consists of at least two digits. To construct a stem and-leaf diagram, use the following steps:
(1) Divide each number xi into two parts: a stem, consisting of one or more of the leading digits and a leaf,
consisting of the remaining digit.
(2) List the stem values in a vertical column.
(3) Record the leaf for each observation beside its stem.
(4) Write the units for stems and leaves on the display.
12
For example, consider the data in the table below. These data are the compressive strengths in pounds per square
inch (psi) of 80 specimens of a new aluminum-lithium alloy undergoing evaluation as a possible material for
aircraft structural elements. The data were recorded in the order of testing, and in this format they do not convey
much information about compressive strength. Questions such as “What percent of the specimens fail below 120
psi?” are not easy to answer.
Compressive Strength (in psi) of 80 Aluminum-Lithium Alloy Specimens
To illustrate the construction of a stem-and-leaf diagram, consider the alloy compressive strength data in the table
above. We will select as stem values the numbers 7, 8, 9, …,24. The resulting stem-and-leaf diagram is presented
in Figure 3.6 below. The last column in the diagram is a frequency count of the number of leaves associated with
each stem. Inspection of this display immediately reveals that most of the compressive strengths lie between 110
and 200 psi and that a central value is somewhere between 150 and 160 psi. Furthermore, the strengths are
distributed approximately symmetrically about the central value. The stem-and-leaf diagram enables us to
determine quickly some important features of the data that were not immediately obvious in the original display
in table.
In some data sets, it may be desirable to provide more classes or stems. One way to do this would be to modify
the original stems as follows: Divide the stem 5 (say) into two new stems, 5L and 5U. The stem 5L has leaves 0,
1, 2, 3, and 4, and stem 5U has leaves 5, 6, 7, 8, and 9. This will double the number of original stems. We could
increase the number of original stems by four by defining five new stems: 5z with leaves 0 and 1, 5t (for twos and
three) with leaves 2 and 3, 5f (for fours and fives) with leaves 4 and 5, 5s (for six and seven) with leaves 6 and 7,
and 5e with leaves 8 and 9.
13
Figure 3.6: Stem and- leaf diagram for the compressive strength data
Figure 3.7 illustrates the stem-and-leaf diagram for 25 observations on batch yields from a chemical process. In
Fig. 3.7(a) we have used 6, 7, 8, and 9 as the stems. This results in too few stems, and the stem-and-leaf diagram
does not provide much information about the data. In Fig. 3.7(b) we have divided each stem into two parts,
resulting in a display that more adequately displays the data. Figure 3.7(c) illustrates a stem-and-leaf display with
each stem divided into five parts. There are too many stems in this plot, resulting in a display that does not tell us
much about the shape of the data.
A stem-and-leaf display conveys information about the following aspects of the data:
• Identification of a typical or representative value
• Extent of spread about the typical value
• Presence of any gaps in the data
• Extent of symmetry in the distribution of values
• Number and location of peaks
• Presence of any outlying values
14
Figure 3.7 Stem and- leaf displays for observations on batch yields from a chemical process. Stem: Tens digits.
Leaf: Ones digits
3.3.6 Digidot Plots

Sometimes it can be very helpful to combine a time series plot with some of the other graphical displays that we
have considered previously. J. Stuart Hunter (An American Statistician) has suggested combining the stem-and-
leaf plot with a time series plot to form a digidot plot.
Figure 3.8: A digidot plot of the compressive strength data

15
Figure 3.8 shows a digidot plot for the observations on compressive strength from Figure 3.6, assuming that these
observations are recorded in the order in which they occurred. This plot effectively displays the overall variability
in the compressive strength data and simultaneously shows the variability in these measurements over time. The
general impression is that compressive strength varies around the mean value of 162.67, and there is no strong
obvious pattern in this variability over time.
Figure 3.9: A digidot plot of chemical process concentration readings, observed hourly.
The digidot plot in Figure 3.9 tells a different story. This plot summarizes 30 observations on concentration of
the output product from a chemical process, where the observations are recorded at one-hour time intervals. This
plot indicates that during the first 20 hours of operation this process produced concentrations generally above 85
grams per liter, but that following sample 20, something may have occurred in the process that results in lower
concentrations. If this variability in output product concentration can be reduced, operation of this process can be
improved.
16

Lecture 3 EDA 2022

Uploaded by

Document Information

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Lecture 3 EDA 2022

Uploaded by

Copyright:

Available Formats

KWAME NKRUMAH UNIVERSITY OF SCIENCE AND TECHNOLOGY

CHEMICAL ENGINEERING DEPARTMENT

LECTURE 3: Graphical Displays of Data

3.1 Graphical Description and Frequency Distribution

3.1.1 Organizing Data

Categorical frequency distribution

Construct a frequency distribution for the data.

Grouped frequency distribution

- The range, R = highest value – lowest value = 134 – 100 = 34

The results are summarized in the table below.

Class limits Class Tally Frequency Cumulative

Ungrouped frequency distribution

The reasons for constructing a frequency distribution are as follows:

Construct a histogram to represent the data in example 3.3

3.2.2 Frequency polygon

3.2.4 Relative frequency graphs

3.3 Other types of graphs

3.3.1 Bar graph

 Grouped Bar Graph

Figure 3.4: Relative frequency plot

When to use a pareto chart:

Figure 3.5: Bar and Pareto charts

3.3.4 Pie charts

3.3.5 Stem and Leaf Diagrams

Compressive Strength (in psi) of 80 Aluminum-Lithium Alloy Specimens

3.3.6 Digidot Plots

Figure 3.8: A digidot plot of the compressive strength data

You might also like