You are on page 1of 9

MODULE 3 Organizing Data

WEEK 2 Engage: Warm Up!


Summarizes concisely the organized and
Explore: Dig In!
presented data
Explain: Lecture Based
 Frequency Distributions
Elaborate: Wrought with Labor
 Graphing Data
Evaluate: Test Time!

MODULE 4 Measures of Central Tendency


Explain briefly the methods in the
computation of the measures of central
tendency
 Mean
 Median
 Mode

Organizing Data
In this chapter, and the next, we discuss what to do with the observations made when conducting a
study—namely, how to describe the data set through the use of descriptive statistics. First, we consider ways of
organizing the data. We need to take the large number of observations made during the course of a study and
present them in a manner that is easier to read and understand. Then, we discuss some simple descriptive
statistics. These statistics allow us to do some “number crunching”—to condense a large number of observations
into a summary statistic or set of statistics. The concepts and statistics described in this section can be used to
draw conclusions from data. They do not come close to covering all that can be done with data gathered from a
study. They do, however, provide a place to start.

Learning Objectives
• Organize data in a frequency distribution.
• Organize data in a class interval frequency distribution.
• Graph data in a bar graph.
• Graph data in a histogram.
• Graph data in a frequency polygon.
We will discuss two methods of organizing data: frequency distributions and graphs.

Frequency Distributions
To illustrate the processes of organizing and describing data, let’s use the data set presented in Table 3.1.
These data represent the scores of 30 students on an introductory psychology exam. One reason for organizing
data and using statistics is so that meaningful conclusions can be drawn. As you can see from Table 3.1, our list
of exam scores is simply that—a list in no particular order. As shown here, the data are not especially meaningful.
One of the first steps in organizing these data might be to rearrange them from highest to lowest or lowest to
highest.
Once this is accomplished (see Table 3.2), we can try to condense the data into a frequency
distribution—a table in which all of the score are listed along with the frequency with which each occurs. We
can also show a relative frequency distribution, which indicates the proportion of the total observations included
in each score. When the relative frequency distribution is multiplied by 100, it is read as a percentage. A frequency
distribution and a relative frequency distribution of our exam data are presented in Table 3.3.

The frequency distribution is a way of presenting data that makes the pattern of the data easier to see.
We can make the data set even easier to read (especially desirable with large data sets) if we group the scores
and create a class interval frequency distribution.
Table 3.1 exam scores for 30 students Table 3.2 exam scores ordered from lowest to highest
SCORE SCORE SCORE SCORE
56 74 45 76
69 70 47 77
78 90 54 78
80 74 56 78
47 59 59 80
85 86 60 82
82 92 60 82
74 60 63 85
95 63 65 86
65 45 69 87
54 94 70 90
60 93 74 92
87 82 74 93
76 77 74 94
75 78 75 95

We can combine individual scores into categories, or intervals, and list them along with the frequency of
scores in each interval. In our exam score example, the scores range from 45 to 95—a 50-point range. A rule of
thumb when creating class intervals is to have between 10 and 20 categories (Hinkle, Wiersma, & Jurs, 1988). A
quick method of calculating what the width of the interval should be is to subtract the smallest score from the
largest score and then divide by the number of intervals you would like (Schweigert, 1994). If we wanted 10
intervals in our example, we would proceed as follows:
95 − 45 50
= =5
10 10
Class interval frequency distribution. A table in which the scores are grouped into intervals and listed
along with the frequency of scores in each interval.

The frequency distribution using the class intervals with a width of 5 is provided in Table 3.4. Notice how
much more compact the data appear when presented in a class interval frequency distribution. Although such
distributions have the advantage of reducing the number of categories, they have the disadvantage of not
providing as much information as a regular frequency distribution. For example, although we can see from the
class interval frequency distribution that five people scored between 75 and 79, we do not know their exact
scores within the interval.

Table 3.3 Frequency and relatve frequency distribution of exam data


rf (ReLaTIVe
SCORe f (FReQueNCY)
FReQueNCY)
45 1 .033
47 1 .033
54 1 .033
56 1 .033
59 1 .033
60 2 .067
63 1 .033
65 1 .033
69 1 .033
70 1 .033
74 3 .100
75 1 .033
76 1 .033
77 1 .033
78 2 .067
80 1 .033
82 2 .067
85 1 .033
86 1 .033
87 1 .033
90 1 .033
92 1 .033
93 1 .033
94 1 .033
95 1 .033
N = 30 1.00

Graphing Data
Frequency distributions can provide valuable information, but some- times a picture is of greater value.
Several types of pictorial representations can be used to represent data. The choice depends on the type of data
collected and what the researcher hopes to emphasize or illustrate. The most common graphs used by
psychologists are bar graphs, histograms, and frequency polygons (line graphs). Graphs typically have two
coordinate axes, the x-axis (the horizontal axis) and the y-axis (the vertical axis). Most commonly, the y-axis is
shorter than the x-axis, typically 60% to 75% of the length of the x-axis.

Table 3.4 a class interval of exam data


Class Interval f rf
45–49 2 .067
50–54 1 .033
55–59 2 .067
60–64 3 .100
65–69 2 .067
70–74 4 .133
75–79 5 .167
80–84 3 .100
85–89 3 .100
90–94 4 .133
95–99 1 .033
N = 30 1.00

Bar Graphs and Histograms


Bar graphs and histograms are frequently confused. When the data collected are on a nominal scale, or
if the variable is a qualitative variable (a categorical variable for which each value represents a discrete category),
then a bar graph is most appropriate. A bar graph is a graphical representation of a frequency distribution in
which vertical bars are centered above each category along the x-axis and are separated from each other by a
space, indicating that the levels of the variable represent distinct, unrelated categories.
If the variable is a quantitative variable (the scores represent a change in quantity), or if the data
collected are ordinal, interval, or ratio in scale, then a histogram can be used. A histogram is also a graphical
representation of a frequency distribution in which vertical bars are centered above scores on the x-axis, but in
a histogram the bars touch each other to indicate that the scores on the variable represent related, increasing
values.
In both a bar graph and a histogram, the height of each bar indicates the frequency for that level of the
variable on the x-axis. The spaces between the bars on the bar graph indicate not only the qualitative differences
among the categories but also that the order of the values of the variable on the x-axis is arbitrary. In other
words, the categories on the x-axis in a bar graph can be placed in any order. The fact that the bars are contiguous
in a histogram indicates not only the increasing quantity of the variable but also that the variable has a definite
order that cannot be changed.
A bar graph is illustrated in Figure 3.1. For a hypothetical distribution, the frequencies of individuals who
affiliate with various political parties are indicated. Notice that the different political parties are listed on the x-
axis, whereas frequency is recorded on the y-axis. Although the political parties are presented in a certain order,
this order could be rearranged because the variable is qualitative.

qualitative variable
A categorical variable for which each value represents a discrete category.

bar graph A graphical representation of a frequency distribution in which vertical bars are centered above each
category along the x-axis and are separated from each other by a space, indicating that the levels of the variable
represent distinct, unrelated categories.

quantitative variable
A variable for which the scores represent a change in quantity.

histogram A graphical representation of a frequency distribution in which vertical bars centered above scores
on the x-axis touch each other to indicate that the scores on the variable represent related, increasing values.

Figure 3.1
Bar graph representing political affiliation for a distribution of 30 individuals

Political Affiliation
Figure 3.2 illustrates a histogram. In this figure, the frequencies of intelligence test scores from a hypothetical
distribution are indicated. A histogram is appropriate because the IQ score variable is quantitative. The variable
has a specific order that cannot be rearranged.

frequency polygon
A line graph of the frequencies of individual scores.

Frequency Polygons (Line Graphs)


We can also depict the data in a histogram as a frequency polygon—a line graph of the frequencies of individual
scores or intervals. Again, scores (or intervals) are shown on the x-axis and frequencies on the y-axis. Once all the
frequencies are plotted, the data points are connected. You can see the frequency polygon for the intelligence
score data in Figure 3.3.
Figure 3.2
histogram representing IQ score data for 30 individuals

Figure 3.3
Frequency polygon of IQ score data for 30 individuals

Frequency polygons are appropriate when the variable is quantitative or the data are ordinal, interval, or
ratio. In this respect, frequency polygons are similar to histograms. Frequency polygons are especially useful for
continuous data (such as age, weight, or time) in which it is theoretically possible for values to fall anywhere
along the continuum. For example, an individual can weigh 120.5 pounds or be 35.5 years of age. Histograms
are more appropriate when the data are discrete (measured in whole units)—for example, number of college
classes taken or number of siblings.
Measures of Central Tendency
Learning Objectives
• Differentiate measures of central tendency.
• Know how to calculate the mean, median, and mode.
• Know when it is most appropriate to use each measure of central tendency.

Organizing data into tables and graphs can help make a data set more meaningful. These methods, however,
do not provide as much information as numerical measures. Descriptive statistics are numerical measures that
de- scribe a distribution by providing information on the central tendency of the distribution, the width of the
distribution, and the distribution’s shape. A measure of central tendency characterizes an entire set of data in
terms of a single representative number. Measures of central tendency measure the “middleness” of a distribution
of scores in three ways: the mean, median, and mode.

Mean
The most commonly used measure of central tendency is the mean—the arithmetic average of a group of scores.
You are probably familiar with this idea. We can calculate the mean for our distribution of exam scores (from the
previous module) by adding all of the scores together and dividing by

the total number of scores. Mathematically, this would be:

∑𝑿
𝝁=
𝑵
where
𝜇 (pronounced “mu”) represents the symbol for the population mean
∑ represents the symbol for “the sum of”
X represents the individual scores, and
N represents the number of scores in the distribution

To calculate the mean, then, we sum all of the Xs, or scores, and divide by the total number of scores in
the distribution (N). You may have also seen this formula represented as follows:

∑𝑿
̅=
𝑿
𝑵
In this case X represents a sample mean.
We can use either formula (they are the same) to calculate the mean for the distribution of exam scores used in
Module 3. These scores are presented again in Table 4.1, along with a column showing frequency (f) and another

Table 4.1 Frequency distribution of exam scores, including an f X column


X f fX
45 1 45
47 1 47
54 1 54
56 1 56
59 1 59
60 2 120
63 1 63
65 1 65
69 1 69
70 1 70
74 3 222
75 1 75
76 1 76
77 1 77
78 2 156
80 1 80
82 2 164
85 1 85
86 1 86
87 1 87
90 1 90
92 1 92
93 1 93
94 1 94
95 1 95

30 2220 = ∑ 𝑋

column showing the frequency of the score multiplied by the score ( f times X). The sum of all the values in the
fX column is the sum of all the individual scores (∑ 𝑋). Using this sum in the formula for the mean, we have:

∑ 𝑋 2,200
𝜇= = = 74.00
𝑁 30

Median
Another measure of central tendency, the median, is used in situations in which the mean might not be
representative of a distribution. Let’s use a different distribution of scores to demonstrate when it might be
appropriate to use the median rather than the mean. Imagine that you are considering taking a job with a small
computer company. When you interview for the position, the owner of the company informs you that the mean
income for employees at the company is approximately $100,000 and that the company has 25 employees. Most
people would view this as good news. Having learned in a statistics class that the mean might be influenced by
extreme scores, you ask to see the distribution of 25 incomes. The distribution is shown in Table 4.2.
The calculation of the mean for this distribution is:
∑ 𝑋 2,498,000
= = 99,920
𝑁 25

Notice that, as claimed, the mean income of company employees is very close to $100,000. Notice also,
however, that the mean in this case is not very representative of central tendency, or “middleness.” In this
distribution, the mean is thrown off center or inflated by one very extreme score of $1,800,000 (the income of
the company’s owner, needless to say). This extremely high income pulls the mean toward it and thus increases
or inflates the mean. Thus, in distributions with one or a few extreme scores (either high or low), the mean will
not be a good indicator of central tendency. In such cases, a better measure of central tendency is the median.
The median is the middle score in a distribution after the scores have been arranged from highest to lowest or
lowest to highest. The distribution of incomes in Table 4.2 is already ordered from lowest to highest. To deter-
mine the median, we simply have to find the middle score. In this situation, with 25 scores, that would be the
13th score. You can see that the median of the distribution would be an income of $27,000, which is far more
representative of the central tendency for this distribution of incomes.

Table 4.2 Yearly salaries for 25 employees


Income Frequency fX
15,000 1 15,000
20,000 2 40,000
22,000 1 22,000
23,000 2 46,000
25,000 5 125,000
27,000 2 54,000
30,000 3 90,000
32,000 1 32,000
35,000 2 70,000
38,000 1 38,000
39,000 1 39,000
40,000 1 40,000
42,000 1 42,000
45,000 1 45,000
1,800,000 1 1,800,000
N = 25 ∑ 𝑋 = 2,498,000

Why is the median not as influenced as the mean by extreme scores? Think about the calculation of each
of these measures. When calculating the mean, we must add in the atypical income of $1,800,000, thus distorting
the calculation. When determining the median, however, we do not consider the size of the $1,800,000 income;
it is only a score at one end of the distribution whose numerical value does not have to be considered in order
to locate the middle score in the distribution. The point to remember is that the median is not affected by extreme
scores in a distribution because it is only a positional value. The mean is affected because its value is determined
by a calculation that has to include the extreme value.

In the income example, the distribution had an odd number of scores N 5 25. Thus, the median was an
actual score in the distribution (the 13th score). In distributions with an even number of observations, the median
is calculated by averaging the two middle scores. In other words, we determine the middle point between the
two middle scores. Look back at the distribution of exam scores in Table 4.1. This distribution has 30 scores. The
median would be the average of the 15th and 16th scores (the two middle scores). Thus, the median would be
75.5—not an actual score in the distribution, but the middle point nonetheless. Notice that in this distribution,
the median (75.5) is very close to the mean (74.00). Why are they so similar? Because this distribution contains
no extreme scores, both the mean and the median are representative of the central tendency of the distribution.

Like the mean, the median can be used with ratio and interval data and is inappropriate for use with
nominal data, but unlike the mean, the median can be used with most ordinal data.

Mode
The third measure of central tendency is the mode—the score in a distribution that occurs with the greatest
frequency. In the distribution of exam scores, the mode is 74 (similar to the mean and median). In the distribution
of incomes, the mode is $25,000 (similar to the median, but not the mean). In some distributions, all scores occur
with equal frequency; such a distribution has no mode. In other distributions, several scores occur with equal
frequency. Thus, a distribution may have two modes (bimodal), three modes (trimodal), or even more. The mode
is the only indicator of central tendency that can be used with nominal data. Although it can also be used with
ordinal, interval, or ratio data, the mean and median are more reliable indicators of the central tendency of a
distribution, and the mode is seldom used.
QUIZ #3 Organizing Data
1. What do you think might be the advantage of a graphical representation of data over a frequency
distribution?
2. A researcher observes driving behavior on a roadway, noting the gender of the drivers, the type of vehicle
driven, and the speed at which they are traveling. The researcher wants to organize the data in graphs
but cannot remember when to use bar graphs, histograms, or frequency polygons. Which type of graph
should be used to describe each variable?
3. In the example described in Critical Thinking Check 3.1, a researcher collected data on drivers’ gender,
type of vehicle, and speed of travel. What would be an appropriate measure of central tendency to
calculate for each type of data?
4. If one driver was traveling at a rate of 100 mph (25 mph faster than anyone else), which measure of
central tendency would you recommend against using?

You might also like