You are on page 1of 4

Summary of Chapter 2 – Displaying and Describing Qualitative Data

(discussion based on Excel file 1000Ch2RestaurantCategoricalDataTables&Graphs)

Using our original data set of randomly selected customers in our restaurant study and after using
VLOOKUP and IF we ended up with a number of categorical variables including geographical location
(nominal), gender (nominal), loyalty card (nominal), satisfaction (ordinal), last meal (nominal or ordinal)
and, age (ordinal – young, mid age, senior). To make any sense of the 125 qualitative descriptions for
each of these variables, we need to be able to convert them into information that can be used to make
decisions of importance to us. Converting them into information involves summarizing these
descriptions. This can be accomplished by creating statistics (i.e., numerical measures) or through
creating a ‘picture’ summarizing these descriptions.

Since statistics are numerical measures of a data set, how can we convert qualitative descriptions into
numerical measures? Basically, the only numerical measures for this type of data is to count how many
observations belong to each of the different categories (frequencies), and, if necessary, convert these
counts into proportions or percentages (relative frequencies) belonging to each of these different
categories. With these frequencies or relative frequencies, we can summarize their values in either a
relevant table or relevant graph, depending on our objective. This relevant table or graph depends on,
to a large extent, what our objective is and whether we only need to summarize one categorical variable
or more than one categorical variable (usually two) to help us satisfy our objective. (If we need two
categorical variables, our objective is usually to look at the relationship between these two variables and
determine if there is or is not a relationship.)

Rule of data analysis – summarize our data with a picture (e.g.,graph) and/or table

To summarize our data with a picture or table, common tools are


 Frequency tables
 Bar charts
 Pie charts
 Contingency tables (conditional distributions)
 Segmented bar charts, side-by-side bar charts

The accompanying Excel file contains several tables (labelled A through G) and several graphs (labelled H
through M) based on some of the categorical data in the restaurant data. (These tables and graphs all
involved the use of the ‘Pivot Table’ function.)

A: A frequency table summarizing the locations of the restaurants in which the 125 customers
were sampled. This includes both the number of customers (frequency) and the percentage of
customers (relative frequency) sampled in each of the locations. (Technically, a frequency table
does not necessarily need to include both frequencies and relative frequencies.) Obviously, this
allows us to compare the numbers of customers or percentage of customers sampled in each
location. Since location is nominal, it is not necessary to order locations in a certain way
although, in this table, they were ordered from most to least number of customers, a common
practice with nominal data.
The rest of these tables look at two characteristics simultaneously by either counting how many
customers fall into each combination of these two characteristics or converting these counts into
relative frequencies or percentages. When converting these counts into percentages, these percentages
may represent these counts as: percentages of the total counts; percentage of the total count in the row
in which this combination can be found: or, percentage of the total count in the column in which this
combination can be found. By basing these percentages on either row totals (table E) or column totals
(tables D and G), one can make some vague observation as to whether there is or is not a relationship
in our sample between these characteristics. In the subsequent statistics course, one can use the
sample results to test whether or not there is a true relationship between these two characteristics in
our population from which this sample was taken.

B: A contingency table which summarizes the number of customers falling into each combination
of age and location, in addition to summarizing the total number of customers in each age
category and the total number of customers in each location. The number of customers in each
of these combinations total the sample size of 125. Although location, being nominal, did not
need to be ordered, it is customary to order ordinal data, such as age groups, in their natural
order, such as youngest to oldest. Although this table summarizes data relating to these two
variables, it may need to be further manipulated to satisfy some reasonable objective or to be of
use in decision making.

C: Another contingency table which converts the frequencies in table B into relative frequencies
where the relative frequencies are the percentage of customers falling into each combination of
age and location, in addition to summarizing the percentage of customers in each age group and
the percentage of customers in each location. The percentages in each combination of age and
location total 100%. As in table B, this table may need further manipulations to be of use in
decision making.

D: Another contingency table which converts the frequencies in table B into percentages of
customers in each location within each age group. In this table, the percentages in each age
group total 100%. By having the percentages totalling 100% in each age group, the percentages
within each age group gives us the conditional distribution of locations for each age group. This
allows us to compare distributions of locations among the three age groups. If the distributions
are the same (i.e., same percentage of BC customers in each of the 3 age groups, same
percentage of Alberta customers in each of the 3 age groups, etc.), one could argue that
locations of customers are not affected by, or not related to, the age of the customers.

E: Another contingency table which converts the frequencies in table B into percentages of each
age group within each location. In this table, the percentages in each location total 100%. By
having the percentages totalling 100% in each location, the percentages within each location
gives us the conditional distribution of ages within each location. This allows us to compare
distributions of age groups among the 6 locations. If the distributions are the same (i.e., same
percentage of young customers in each location, same percentage of mid age customers in each
location, etc.), one could argue that customer ages are not affected by, or not related to,
locations. Of, tables C, D and E, this table gives us the best summary if we are interested in
coming up with some plan to improve the number or percentage of customers of a certain age
within certain locations.
F. A contingency table which summarizes the number of customers falling into each combination
of age and level of satisfaction, in addition to summarizing the total number of customers in
each age category and the total number of customers in each level of satisfaction. The number
of customers in each of these combinations total the sample size of 125. Both variables, being
ordinal, were ordered in their natural order, such as youngest to oldest. Although this table
summarizes data relating to these two variables, it may need to be further manipulated to be of
use in decision making.

G. A contingency table created by manipulating the summary of table F. It was manipulated by


converting the number of customers to the percentage of customers within each age group that
stated the level of satisfaction in which they belonged, giving us conditional distributions,
(conditional on age). This table would then allow us to compare satisfaction across the three
age groups. This comparison would then allow us to determine if different age groups had
different attitudes about the restaurant chain, allowing us to develop a plan to improve
satisfaction levels within certain age groups if necessary.

H & I: Graphs H (bar chart) and I (pie chart) give us two pictures of table A, the frequency table. The
heights of bars in this bar chart indicates the number of customers (frequencies) within each of
the locations as summarized in table A and allows us to compare these numbers across the
locations. A bar chart using the relative frequencies could have also been constructed but the
chart would look identical (in terms of the relative heights of the bars) with the heights
representing percentages instead of counts. (In the case of nominal data, when the bars in a bar
chart are ordered from highest to lowest, as is in Graph H (or from lowest to highest), it is often
referred to as a Pareto chart.) One could argue that the pie chart gives us a better picture than
a bar chart when we want to compare the relative ‘importance’ of each location to the whole.

J & K: These bar charts (J is a side-by-side bar chart and K is a segmented bar chart as defined by the
text) give us the same information as contained in table B but only in picture form. By having
locations across the horizontal axis and heights representing the number of customers of
different ages within each location, we are mainly interested in comparing the importance of
different age groups within each location. By using numbers instead of percentages, it is more
difficult to compare the importance of different age groups across the different locations.

(There was no graph created which visualizes the information contained in table D.)

L: This bar chart (segmented bar chart) gives us the same information as contained in table E. By
having each bar in each location equalling 100% of the total number of customers in each
location and by having the 100% separated into percentages falling in each age group within
each location, we can visualize the importance of each age group to each location and we can
compare the importance of each age group across locations.

M: This bar chart (segmented bar chart) gives us the same information as contained in table G. It
enables us to visualize the distribution of levels of satisfaction within each age group as well as
enabling us to visualize the distributions of levels of satisfaction across different age groups.
Observations:

Each of the above graphs were created based on frequencies or relative frequencies, or, each graph was
based on the importance or relative importance of each of the categories of a qualitative variable or a
combination of categories of two qualitative variables. Importance need not only be based on
frequencies. Each of the above graphs can also be created using other definitions of importance. For
example, you may have a qualitative variable defined by where your money is being spent and the
importance of each category is measured by how much money is being spent in each of these
categories. Whether importance is measured based on frequencies (or relative frequencies) or amount
of money being spent (or relative amounts of money), the heights of bars or the sizes of pie slices are
based on quantities (e.g. frequencies or amounts being spent in each of the categories).

The above graphs were created, based on some specific objective, to give us an accurate and an
undistorted picture of our data. One can distort the true picture by changing the scale of the vertical
axis by not starting the vertical axis at a value of zero. Another way one can distort is to not only have
the heights of the bars represent the frequencies but also have the widths of the bars adjusted to reflect
the frequencies. For example, if category A has 50 observations and category B has 100 observations,
doubling the height of the category B and doubling the width of category B at the same time gives us the
impression that the importance of category B is more than twice (and, maybe four times) the
importance of category A. Another common way to distort is to use 3D graphs instead of 2D graphs.
Two other ways to distort a picture of our data are: to omit an important picture; and, to create a
picture with inaccurate or dishonest data.

You might also like