You are on page 1of 64

MODULE 5

DATA MANAGEMENT

Introduction

Data management refers to a wide range of activities researchers perform as they collect,
arrange, describe, store, preserve, and often share materials generated by their research.
Many funding agencies and institutions require researchers to develop data management plans,
or DMPs, that outline their data management activities, and to disseminate their data through
repositories and archives.

After the end of the module, you should be able to:


1. define data and statistic;
2. contrast descriptive and inferential statistics;
3. construct a frequency distribution for a data set;
4. compute the mean, median, midrange and mode of a data set.
5. find the variance and standard deviation of a data set and interpret the result;
6. compute z scores and use them to compare values from different data sets;
7. use a table to find areas under the standard normal distribution;
8. use the normal distribution to find probabilities;
9. use the normal distribution to find percentile rank;
10. draw and analyze scatter plots for two data sets;
11. calculate correlation coefficients;
12. determine if correlation coefficients are significant;
13. find a regression line for two data sets; and
14. use regression lines to make predictions.

GATHERING AND ORGANIZING DATA – Module 5


TIME FRAME: 6 hours

CHECK-UP TEST

Answer numbers 9 - 10 on page 651 (Sobecki, D. (2019). Math in Our World.


New York. NY: McGraw-Hill Education.)

GATHERING AND ORGANIZING DATA – Module 5


LESSON PROPER

I. Statistics
Data are measurements or observations that are gathered for an event under study.

Statistics is the branch of mathematics that involves collecting, organizing, summarizing, and
presenting data and drawing general conclusions from the data.

Populations and Samples

When statistical studies are performed, we usually begin by identifying the population for
the study. A population consists of all subjects under study. (i.e. all colleges in NDMU). More
often than not, it’s not realistic to gather data from every member of a population.

A sample is a representative subgroup or subset of a population.

Sampling Methods
We will study four basic sampling methods:

1. In order to obtain a random sample, each subject of the population must have an equal
chance of being selected.

2. A systematic sample is taken by numbering each member of the population and then
selecting every kth member, where k is a natural number. When using systematic sampling, it’s
important that the starting number is selected at random. For example,

Consider every 5th on the list.

Therefore, the samples from every 5th from left to right are 13, 23, 26, 34, 23, and 12.

GATHERING AND ORGANIZING DATA – Module 5


When a population is divided into groups where the members of each group have similar
characteristics and members from each group are chosen at random, the result is called a
stratified sample.

When an existing group of subjects that represent the population is used for a sample, it is
called a cluster sample.

3. Stratified sampling – process of subdividing the population into subgroups or strata and
drawing members at random from each subgroup or stratum

Given the population of a certain university and a target sample population of 5,455, determine
the sample size of each subgroup or courses.

Field of Specialization Population

Nursing 6,000

Accountancy 500

Management 2,000

Marketing 1,000

Education 2,500

Total 12,000

Field of Population Percentage Sample


specialization size

Nursing 6,000 50.00 2,728

Accountancy 500 4.16 227

Management 2,000 16.66 909

Marketing 1,000 8.33 455

Education 2,500 20.33 1,136

Total 12,000 100.00 5,455


GATHERING AND ORGANIZING DATA – Module 5
4. Cluster sampling – process of selecting clusters from a population which is very large or
widely spread out over a wide geographical area.

Example:A researcher wishes to survey apartment dwellers in a large city. If there are 10
apartment buildings in the city, the researcher can select at random 2 buildings from the 10 and
interview all the residents of these buildings.

Example: Choosing a Sample


A student in an education class is given an assignment to find out how late typical students at his campus
stay up to study. He considered four different methods for gathering data. Describe what type of
sampling method he’s using in each case on the next slide. Then describe whether you think he’s likely
to get a representative sample.
(a) Stop by the union before his 9 A.M. class and ask everyone sitting at a table how late they were up
studying the night before.
(b) Ask the school registrar to give him a list of 50 students whose student ID numbers end in 4.
(c) Survey the five people sitting closest to the door in each of his five classes.
(d) Look at the campus directory and choose the first name at the top of each page.
SOLUTION
(a) Since he is choosing all students in a particular place at a particular time, this is a cluster sample. The
sample is unlikely to be representative. Since he’s polling people early in the morning, those that tend to
stay up very late studying are less likely to be included in the sample. (Would you be at the union at 8:30
if you were up until 4 A.M. studying? Didn’t think so.)
(b) This is a random sample: you’d think that the last digit of an ID number is likely to be completely
random. How representative it is might depend on how big the campus is: if there are 500 students on
the campus, this would probably be pretty good. If there were 50,000 students, I’d be a lot less
comfortable with it.
(c) This is a stratified sample, since he’s randomly picking subjects from existing groups (the
classes). It sounds like a pretty good approach, depending on what kinds of classes he’s
taking. If they’re all senior-level courses that require a lot of studying that would probably bias
the results toward people staying up later.
(d) This is a classic example of a systematic sample, and would almost certainly be a good
representative sample.

GATHERING AND ORGANIZING DATA – Module 5


Descriptive Statistics
There are two main branches of statistics: descriptive and inferential. Statistical techniques used
to describe data are called descriptive statistics. This is based on collecting, organizing, and
reporting data without using the data to draw any wide-ranging conclusions.

For example, a researcher might be interested in the average age of the full-time students on your
campus and how many credit hours they’re scheduled for this term.

Inferential Statistics
Statistical techniques used to make inferences are called inferential statistics. This is based on
studying characteristics of a sample within a larger population and using them to draw
conclusions about the entire population.

For example, the Bureau of Labor and Statistics estimates the number of people in the United
States that are unemployed every month. Since it would be impossible to survey everyone, the
bureau picks a sample of adults to see what percentage are unemployed. Then they use that
information to estimate the unemployment rate for the entire population.

Frequency Distributions

The data collected for a statistical study are called raw data. In order to describe situations and
draw conclusions, we need to organize the data in a meaningful way.

Two methods that we will use are frequency distributions and stem and leaf plots.

The first type of frequency distribution we will investigate is the categorical frequency
distribution. This is used when the data are categorical rather than numerical.

GATHERING AND ORGANIZING DATA – Module 5


Example: Constructing a Frequency Distribution
Twenty-five volunteers for a medical research study were given a blood test to obtain their
blood types. The data follow. Construct a frequency distribution for the data.
A, B, B, AB, O,
O, O, B, AB, B,
B, B, O, A, O,

AB, A, O, B, A,
A, O, O, O, AB
Solution:
Step 1 Make a table with all categories represented.

Step 2 Tally the data using the second column.


Step 3 Count the tallies and place the numbers in the third column.

GATHERING AND ORGANIZING DATA – Module 5


Grouped Frequency Distributions

Another type of frequency distribution that can be constructed uses numerical data and is
called a grouped frequency distribution. In a grouped frequency distribution, the numerical
data are divided into classes.

For example, if you gathered data on the weights of people in your class, there’s a decent chance
that no two people have the exact same weight. So it would be reasonable to group people into
weight ranges, like 100–119 pounds, 120–139 pounds, and so forth.

When deciding on classes for a grouped frequency distribution, here are some guidelines:

1. Try to keep the number of classes between 5 and 15.

2. Make sure the classes do not overlap.

3. Don’t leave out any numbers between the lowest and highest, even if nothing falls into a
particular class.

4. Make sure the range of numbers included in a class is the same for each one.

5. The beginning and ending values have to be chosen based on how the data values are
rounded.

Constructing a frequency distribution


Step 1. Determine the classes(should be between 5 and 20)
– Find R = HV-LV.
– Select the number of classes desired (there should be 5 and 15 classes)
– Find the width. W =R / # of classes. Round up
(preferably odd number)
– Select a starting point (usually the lowest value or any convenient number less
than the lowest value); add the width to get the lower limits.
– Find the upper class limits.
Step 2. Tally the data

GATHERING AND ORGANIZING DATA – Module 5


Example: Constructing a Frequency Distribution
These data represent the record high temperatures for each of the 50 states in degrees Fahrenheit.
Construct a grouped frequency distribution for the data.

1 1 1 1 1 1 1 1 1 1
1 0 2 2 3 1 0 1 0 1
2 0 8 0 4 8 6 0 9 2
1 1 1 1 1 1 1 1 1 1
0 1 1 1 1 2 1 1 0 0
0 8 7 6 8 1 4 4 5 9
1 1 1 1 1 1 1 1 1 1
0 1 1 1 1 1 1 2 0 1
7 2 4 5 8 7 8 5 6 0
1 1 1 1 1 1 1 1 1 1
2 0 1 2 1 2 1 1 0 1
2 8 0 1 3 0 9 1 4 3
1 1 1 1 1 1 1 1 1 1
2 1 2 1 0 1 1 1 1 1
0 3 0 7 5 0 8 2 4 5
SOLUTION
Step 1 Subtract the lowest value from the highest value: 134 − 100 = 34.
Step 2 If we use a range of 5 degrees, that will give us seven classes, since the entire range (34
degrees) divided by 5 is 6.8. This is certainly not the only choice for the number of classes: if we
choose a range of 4 degrees, there will be nine classes.
Step 3 Start with the lowest value and add 5 to get the lower class limits: 100, 105, 110, 115, 120, 125,
130. Notice that all of the data are rounded to the nearest whole number, so that’s reflected in our
choices of class limits.

Step 4 Set up the classes. To find the upper limit for each, subtract one from the next upper limit.

GATHERING AND ORGANIZING DATA – Module 5


II. Picturing Data
Bar Graphs and Pie Charts

When data are representative of certain categories, rather than numerical, we often use bar
graphs or circle graphs (commonly known as pie charts) to illustrate the data.

Pie Chart

A pie chart, also called a circle graph, is a circle that is divided into sections in proportion to
the frequencies corresponding to the categories.

The purpose of a pie chart is to show the relationship of the parts to the whole by visually
comparing the size of the sections.

Example: Drawing a Pie Chart to Represent Data

The marketing firm Deloitte Retail conducted a survey of grocery shoppers. The frequency
distribution below represents the responses to the survey question “How often do you bring your
own bags when grocery shopping?” Draw a pie chart to represent the data.

Response Frequency

Always 10

Never 39

Frequently 19

Occasionally 32

SOLUTION
f
Step 1 Find the number of degrees corresponding to each slice using the formula Degrees = ⋅ 
n
360° where f is the frequency for each class and n is the sum of the frequencies. Recall that there
are 360° in a circle, so what we're really doing here is finding the percentage of the circle that's
covered by each portion. In this case, n = 100, so the degree measures are:

GATHERING AND ORGANIZING DATA – Module 5


10 19
Always ⋅ 360° = 36° Frequently ⋅ 360° = 68.4°
100 100
39 32
Never ⋅ 360° = 140.4° Occasionally ⋅ 360° = 115.2°
100 100

Step 2 Using a protractor, graph each section on the circle using the calculated angles. The pie
chart is shown on the next slide. Notice the labeling that makes it clear what each slice
represents. If you don’t have a protractor, you can find graph paper marked off in degrees by
doing an Internet search for “pie chart graph paper.”

Step 3 Calculate the percent of the circle covered by each slice.


10
Always:  10%
100
39 19 32
Never:  39% Frequently:  19% Occasionally:  32% Label each section with the
100 100 100
percent, as shown.

GATHERING AND ORGANIZING DATA – Module 5


Bar Graph
A bar graph is used to compare amounts or percents using either vertical or horizontal bars of
various lengths which correspond to the amounts or percents.

While a pie chart is used to compare parts to a whole, a bar graph is used for comparing parts to
other parts.

Example: Drawing a Bar Graph to Represent Data

Let’s revisit the marketing firm’s survey of 100 grocery shoppers. Suppose the frequency
distribution below represents the responses to the survey question, “Which grocery stores have
you shopped at in the last month?”

Store Frequency

Publix 50

Trader Joe’s 40

Fareway 28

Aldi 33

Other 65

(a) Draw a vertical bar graph for the results, shown in the frequency distribution.
(b) Would a pie chart make sense for these data? Why or why not?

SOLUTION

(a) Draw and label the axes. We were asked for a vertical bar graph, so the responses go on
the horizontal axis, and the frequencies on the vertical. Make sure that your labeling on
the vertical axis starts at zero.

GATHERING AND ORGANIZING DATA – Module 5


Step 2 Draw vertical bars with heights that correspond to the frequencies.

(b) There are two good ways to decide that a pie chart would be a bad choice. One is numerical:
we were told that 80 shoppers were surveyed, but the numbers for the responses add up to more
than 80. So this isn’t comparing distinct parts to a whole: the categories overlap. Which brings us
to the other way you can tell that a pie chart is a bad choice. The nature of the survey (“Which
stores have you shopped at?”) pretty much guarantees that some people will choose more than
one of the responses. This is a dead giveaway that the categories overlap, so a pie chart would be
a bad choice.

Histograms and Frequency Polygons


When data are organized into grouped frequency distributions, two types of graphs are
commonly used to represent them: histograms and frequency polygons.

A histogram is similar to a vertical bar graph in that the heights of the bars correspond to
frequencies. The difference is that class limits are placed on the horizontal axis, rather than
categories.

GATHERING AND ORGANIZING DATA – Module 5


Example: Drawing a Histogram for Grouped Data

In Section 11-1, we analyzed and organized data representing the record high temperature in
every state. It would therefore seem reasonable, if not expected, for us to illustrate that data
graphically. So that’s just what we’ll do. Below is a grouped frequency distribution for the data.
Use it to draw a histogram.

Class Frequency

100-104 3

105-109 8

110-114 16

115-119 13

120-124 7

125-129 2

130-134 1

SOLUTION
Step 1 Write the scale for the frequencies on the vertical axis and the class limits on the
horizontal axis. Make sure that your labeling on the vertical axis starts at zero.
Step 2 Draw vertical bars with heights that correspond to the frequencies for each class.

GATHERING AND ORGANIZING DATA – Module 5


Frequency Polygons
A frequency polygon is similar to a histogram, but instead of bars, a series of line segments is
drawn connecting the midpoints of the classes. The heights of those points match the heights of
the bars in a histogram.

Example: Drawing a Frequency Polygon

Draw a frequency polygon for the frequency distribution from Example 3.

Class Frequency

100-104 3

105-109 8

110-114 16

115-119 13

120-124 7

125-129 2

130-134 1

SOLUTION
Step 1 Find the midpoints for each class. This is accomplished by adding the upper and lower
limits and dividing by 2. For the first two classes, we get:
100 + 104 105 + 109
= 102 = 107
2 2

The remaining midpoints are 112, 117, 122, 127, and 132.

GATHERING AND ORGANIZING DATA – Module 5


Step 2 Write the scale for the frequencies on the vertical axis (making sure to start at zero), and
label a scale on the horizontal axis so that all midpoints will be included.
Step 3 Plot points at the midpoints with heights matching the frequencies for each class, then
connect those points with straight lines.
Step 4 Finish the graph by drawing a line back to the horizontal axis at the beginning and end.
The horizontal distance to the axis should equal the distance between the midpoints. In this case,
that distance is 5, so we extend back to 97 and forward to 137. See the figure on the next slide.

Example: Identifying the Value of Frequency Polygons

These two grouped frequency distributions come from data gathered in two calculus classes in
Spring of 2015. One is for percentage on the final
Overall Percentage
exam; the other is for overall percentage in the
course.
Percentage Frequency
Final Exam Percentage
40-49.9 0
Percentage Frequency
50-59.9 1
40-49.9 3
60-69.9 7
50-59.9 2
70-79.9 20

80-89.9
GATHERING AND ORGANIZING DATA – Module 5 19

90-99.9 11
60-69.9 13

70-79.9 14

80-89.9 14

90-99.9 12

(a) Draw a frequency polygon for each distribution. Put them on the same axes.
(b) What can you learn about final exam percentage vs. overall percentage from the polygons?
(c) What would happen if you tried to draw a histogram for each on the same axes?
SOLUTION
(a) The midpoints for the classes are 44.95, 54.95, 64.95, 74.95, 84.95, and 94.95. The two
frequency polygons are shown on the next slide.

GATHERING AND ORGANIZING DATA – Module 5


III. Measures of Average

Measures of Average
In casual terms, average means the most typical case, or the center of the distribution. Measures
of average are also called measures of central tendency, and include the mean, median, mode,
and midrange.

Mean
The Greek letter  (sigma) is used to represent the sum of a list of numbers. If we use the letter X
to represent data values, then X means to find the sum of all values in a data set.
The mean is the sum of the values in a data set divided by the number of values. If X1, X2, X3, …,
Xn are the data values, we use X́ to stand for the mean, and

X 1  +   X 2  +   X 3  +  ⋯  +  X n ∑ X


X́= =
n n

Example: Finding the Mean of a Data Set

Here’s the salary list for Vandelay Industries:


Employee Salary
Jerry $58,000
Kramer $65,000
Newman $944,000
George $20,000
Elaine $52,000
Susan $51,000
Tim $53,000
Estelle $55,000

GATHERING AND ORGANIZING DATA – Module 5


Frank $50,000
Find the mean of all salaries for Vandelay Industries.

The company advertises that its average employee makes almost $150,000 per year. Is the
company’s claim technically truthful? Do you think it’s deceiving? Explain.

SOLUTION

The company has nine employees, so we need to add all the salaries and then divide the sum by
9.

58 + 65 + 944 + 20 + 52 + 51 + 53 + 55 + 50
X́= = 149.8
9

All of the salaries were whole numbers of thousands, so it was easier to just add the number of
thousands and divide by 9.

The result of 149.8 tells us that the mean salary is $149.8 thousand dollars, or $149,800.

The claim is, in fact, truthful—provided that by “average” you mean “mean.” But is it deceiving?
You bet it is! There’s only one person in the company that makes more than $65,000 per year—
the owner (Newman) who pays himself a handsome salary of $944,000. Given that we want
measures of average to describe a most typical case, $149,800 certainly doesn’t fit that bill.

Median

In short, the median of a data set is the value in the middle if all values are arranged in order. The
median will either be a specific data value in the set, or will fall in between two values.

Steps in Computing the Median of a Data Set

Step 1 Arrange the data in order, from smallest to largest. Actually, largest to smallest will work,
too. Whatever makes you happy.

Step 2 If the number of data values is odd, the median is the value in the exact middle of the list.
If the number of data values is even, the median is the mean of the two middle data values.

GATHERING AND ORGANIZING DATA – Module 5


Example: Finding the Median of a Data Set

(a) Find the median salary for Vandelay Industries. How does it compare to the mean?

(b) Find the mean and median if Newman’s salary is left out. What can you conclude?

SOLUTION

(a) First, we need to arrange the salaries in order:

$20,000, $50,000, $51,000, $52,000, $53,000, $55,000, $58,000, $65,000,


$944,000
There are nine salaries listed, and where I come from, nine is odd. So the median will be the
salary right in the middle: there will be four salaries less and four more. That makes it the fifth
salary on the list, which is $53,000. This is a whole lot less than the mean of $149,800, and in
fact is a much more reasonable measure of average for these data.
(b) Here’s the ordered list if we leave off Newman’s gigantic salary:
$20,000, $50,000, $51,000, $52,000, $53,000, $55,000, $58,000, $65,000
Now there are eight salaries, so we’ll need to find the mean of the two in the middle,
which are $52,000 and $53,000. It would be nice if you could just figure out that the mean is
halfway in between, but for the sake of completeness:
$52,000 + $53,000
= $52,500
2

So now the median is $52,500.


The new mean is

GATHERING AND ORGANIZING DATA – Module 5


$20,000 + $50,000 + $51,000 + $52,000 +  $53,000 + $55,000 + $58,000 + $65,000
8
= $50,500
Now that’s interesting. The median was almost unaffected by throwing away the largest value,
but the mean changed dramatically, to say the least. This is exactly why the mean was a poor
measure of average for this data set: the one very large value has a great impact on the mean, but
not so much on the median.
Midrange
The advantage of the midrange is that it’s very quick and easy to calculate. The disadvantage is
that it totally ignores most of the data values, so it’s not a particularly reliable measure.
Finding the Midrange for a Data Set
lowest value +  highest value
Midrange =
2

Example: Finding the Midrange of a Data Set

Find the midrange of all salaries at Vandelay Industries. Is it meaningful in this case?
SOLUTION
It’s not necessary to put a data set in order to find the midrange, but it sure doesn’t hurt. All we
need to know is the lowest and highest salaries, and since we already ordered the list in Example
2, it’s easy to see that those are $20,000 and $944,000. So the midrange is
$20,000 + $944,000
= $482,000
2
Wow. The midrange is a whopping $482,000, which is meaningful in that it emphasizes how big
Newman’s salary is, but as a measure of average it’s not good for much.

Mode
The mode is sometimes said to be the most typical case. The value that occurs most often in a
data set is called the mode. A data set can have more than one mode or no mode at all.

Example: Finding the Mode of a Data Set


GATHERING AND ORGANIZING DATA – Module 5
These data represent the duration (in days) of the final 20 U.S. space shuttle voyages. Find the
mode.
11, 12, 13, 12, 15, 12, 15,
13, 15, 12, 12, 15, 13, 10,
13, 15, 11, 12, 15, 12
SOLUTION
If we construct a frequency distribution, it will be easy to find the mode—it’s simply the value
with the greatest frequency. The frequency distribution for the data is shown to the right, and the
mode is 12.

Days Frequency

10 1

11 2

12 7

13 4

15 6

Example: Finding the Mode of a Data Set

The number of Atlantic hurricanes for each of the years from 1997–2016 is shown in the list.
Find the mode, and describe what that tells you.
3, 10, 8, 8, 9, 4, 7, 9, 15, 5,
6, 8, 3, 12, 7, 10, 2, 6, 4, 7
SOLUTION
This time, we’ll find the mode without making a frequency distribution. Instead, we can just
work down the list, counting the number of occurrences for each number of hurricanes. It turns
out that there are two numbers that appear three times, while no others appear more than twice.

GATHERING AND ORGANIZING DATA – Module 5


Those numbers are 7 and 8, so this data set has two modes. This means that over that 20-year
span, the most common number of Atlantic hurricanes was 7 and 8.

Example: Finding the Mode for Categorical Data

A survey of the junior class at Fiesta State University shows the following number of students
majoring in each field. Find the mode.
Business 1,425
Liberal arts 878
Computer science 632
Education 471
General studies 95

You have to be a little careful here. If you focus on the numbers, you might conclude that there’s
no mode, since they’re all different. But that would be missing the point. The mode is supposed
to be the most typical case. Here, the most typical major is the one with the most students: that’s
business, so that’s the mode.

Mean for Grouped Data


The procedure for finding the mean for grouped data uses the midpoints and the frequencies of
the classes. This procedure will give only an approximate value for the mean, and it is used when
the data set is very large or when the original raw data are unavailable but have been grouped by
someone else.

Finding the Mean for Grouped Data

Step 1: Find the midpoint of each class in the grouped data.

Step 2: Multiply the frequency for each class by the midpoint of that class.

GATHERING AND ORGANIZING DATA – Module 5


Step 3: Add up all of the products from step 2.

Step 4: Divide by the sum of all frequencies (which is the total number of data values).

If you prefer formulas to procedures:

X́=
∑ ( f  ·  X m )
n

where f is the frequency for each class, Xm is the midpoint of each class, and n is the sum of all
frequencies.

Example: Finding the Mean for Grouped Data

Find the mean record high temperature for the 50 states.

Class Frequency

100-104 3

105-109 8

110-114 16

115-119 13

120-124 7

125-129 2

130-134 1

SOLUTION
First, we’ll need the midpoint for each class.
Since we’ll need to multiply by the frequencies, it’s convenient to make a new table with the
midpoints and frequencies, then multiply them.

GATHERING AND ORGANIZING DATA – Module 5


We’ll also need the sum of those products and of the frequencies.

Class Midpoint Frequency Midpoint ×


Frequency

100-104 102 3 306

105-109 107 8 856

110-114 112 16 1,792

115-119 117 13 1,521

120-124 122 7 854

125-129 127 2 254

130-134 132 1 132

Sums 50 5,715

To get our mean, we divide the sum of the products by the sum of the frequencies:
5,715
X =  = 114.3
50
The mean state record high temperature is about 114.3°.

GATHERING AND ORGANIZING DATA – Module 5


IV. Measures of Variation
In this section we will study measures of variation, which will help to describe how the data
within a set vary. The three most commonly used measures of variation are range, variance, and
standard deviation.

Range
The range of a data set is the difference between the highest and lowest values in the set.
Range = Highest value – lowest value

Example: Finding the Range of a Data Set

The first list below is the weights of the dogs in the first picture, and the second is the weights of
the dogs in the second picture. Find the mean, median, and range for each list, then describe any
observations you can make based on the results.
1st: 70, 73, 58, 60
2nd: 30, 85, 40, 125, 42, 75, 60, 55

SOLUTION

GATHERING AND ORGANIZING DATA – Module 5


For the first list,
70 + 73 + 58 + 60
Mean =  = 65.25 lb
4
The ordered list is 58, 60, 70, 73, so the median is the mean of 60 and 70, which is 65.
Range = 73 - 58 = 15 lb

For the second list,


30 + 85 + 40 + 125 + 42 + 75 + 60 + 55
Mean =  = 64 lb
8
The ordered list is 30, 40, 42, 55, 60, 75, 85, 125, so the median is the mean of 55 and 60, which
is 57.5.
Range = 125 - 30 = 95 lb
As we suspected, the means and medians are similar, but the ranges are very different. This is
reflective of the fact that there’s a lot more variation in size among the dogs in the second
picture.

Variance and Standard Deviation


If most of the values are similar, but there’s just one unusually high value, the range will make it
look like there’s a lot more variation than there actually is.

For this reason, we will next define variance and standard deviation, which are much more
reliable measures of variation.

Procedure for Finding the Variance and Standard Deviation

Step 1 Find the mean.

Step 2 Subtract the mean from each data value in the data set.

Step 3 Square the differences.

Step 4 Find the sum of the squares.

Step 5 Divide the sum by n – 1 to get the variance, where n is the number of data values.

Step 6 Take the square root of the variance to get the standard deviation.

Example: Finding Variance and Standard Deviation


GATHERING AND ORGANIZING DATA – Module 5
Find the variance and standard deviation for the weights of the eight dogs in the second picture at
the beginning of this section. The weights are listed again for reference.
30, 85, 40, 125, 42, 75, 60, 55

SOLUTION
Step 1 Find the mean weight.
We found the mean of 64 lb in Example 1.
Step 2 Subtract the mean from each data value.
30 - 64 = -34, 85 - 64 = 21, 40 - 64 = -24, 125 - 64 = 61,
42 - 64 = -22, 75 - 64 = 11, 60 - 64 = -4, 55 - 64 = -9
Step 3 Square each result.
(-34)2 = 1,156, (21)2 = 441, (-24)2 = 576, (61)2 = 3,721,
(-22)2 = 484, 112 = 121, (-4)2 = 16, (-9)2 = 81
Step 4 Find the sum of the squares.
1,156 + 441 + 576 + 3,721 + 484 + 121 + 16 + 81 = 6,596
Step 5 Divide the sum by n - 1 to get the variance, where n is the sample size. In this case, n is 8,
so n - 1 = 7.
6,596
Variance =  ≈ 942.3
7
Step 6 Take the square root of the variance to get standard deviation.
Standard Deviation = √ 942.3 ≈ 30.7 lb

To organize the steps, you might find it helpful to make a table with three columns: the original
data, the difference between each data value and the mean, and their squares. Then you just add
the entries in the last column and divide by n - 1 to get the variance.

2
Data (X) X−X́ ( X−X́ )

30 -34 1,156

GATHERING AND ORGANIZING DATA – Module 5


85 21 441

40 -24 576

125 61 3,721

42 -22 484

75 11 121

60 -4 16

55 -9 81

Standard Deviation

To understand the significance of standard deviation, we’ll look at the process one step at a time.

Step 1 Compute the mean. Variation is a measure of how far the data vary from the mean, so it
makes sense to begin there.

Step 2 Subtract the mean from each data value. In this step, we are literally calculating how far
away from the mean each data value is. The problem is that since some are greater than the mean
and some less, their sum will always add up to zero. (Try it!) So that doesn’t help much.

Step 3 Square the differences. This solves the problem of those differences adding to zero—
when we square them, they’re all positive.

Step 4 Add the squares. In the next two steps, we’re getting an approximate average of the
squares of the individual variations from the mean. First we add them, then…

Step 5 Divide the sum by n − 1. It seems like dividing by the number of values (n) here is a good
idea, but it turns out that when we’re using a sample from a larger population to compute mean
and variance, dividing by n − 1 makes the sample variance more likely to be a true reflection of
the population variance. In any case, at this point we have an approximate average of the squares
of the individual variations from the mean.

GATHERING AND ORGANIZING DATA – Module 5


Step 6 Take the square root of the sum. This “undoes” the square we did in Step 3. It will return
the units of our answer to the units of the original data, giving us a good measure of how far the
typical data value varies from the mean.

The sample variance for a data set is an approximate average of the square of the distance
between each value and the mean.

If X represents individual values, X́ is the mean and n is the number of values:

∑ ( X  -  X́ ) 2  
s2 = 
n  - 1

The sample standard deviation (s) is the square root of the variance. It provides an approximate
average of the distances between data values and the mean.

Example: Interpreting Standard Deviation

The mean and standard deviation for heights of all adult males in the United States are 69.3
inches and 2.8 inches, respectively. The mean and standard deviation for the 2016–2017
Cleveland Cavaliers (a professional basketball team), on the other hand, were 78.9 inches and
3.75 inches. What can we conclude from comparing these statistics?
SOLUTION
There are two main things we can learn from these comparisons. First, the top scorers for a
professional basketball team tend to be a lot taller than average people. If you’ve ever watched a
basketball game, this comes as no surprise. Second, the heights are spread out a bit more than the
population in general. This could be due to the small sample size of 14 players, but it might also
be because there are some basketball players that are good players because of speed and
athleticism even though they’re not that tall, while others are successful to some extent because
of their extreme height. In order to draw more reliable conclusions, we’d probably need to look
at a much larger sample of pro basketball players.

Example: Interpreting Standard Deviation

A professor has two sections of Math 115 this semester. The 8:30 A.M. class has a mean score of
74% with a standard deviation of 3.6%. The 2 P.M. class also has a mean score of 74%, but a

GATHERING AND ORGANIZING DATA – Module 5


standard deviation of 9.2%. What can we conclude about the students’ averages in these two
sections?
SOLUTION
In relative terms, the morning class has a small standard deviation and the afternoon class has a
large one. So even though they have the same mean, the classes are quite different. In the
morning class, most of the students probably have scores relatively close to the mean, with few
very high or very low scores. In the afternoon class, the scores vary more widely, with a lot of
high scores and a lot of low scores that average out to a mean of 74%.

V. The Normal Distribution


A wide variety of quantities in the real world, like sizes of individuals in a population, IQ scores,
and many others, tend to exhibit the same phenomenon, in which we see that the largest number
have values somewhere in the middle of the range, and the classes further away from the center
have smaller values. In fact, it’s so common that frequency distributions of this type came to be
known as normal distributions.

A normal distribution is a continuous, symmetric, bell-shaped distribution.

A probability distribution that plots all of its values in a symmetrical fashion and most of the
results are situated around the probability’s mean is called a normal distribution. Values are

GATHERING AND ORGANIZING DATA – Module 5


equally likely to plot either above or below the mean. Grouping takes place at values that are
close to the mean and then tails off symmetrically away from the mean.

Some Properties of a Normal Distribution

1. The value in the middle of the distribution, which appears most often in the sample, is the
mean.

2. The distribution is symmetric about the mean. This means that the graph has two halves that
are mirror images on either side of the mean value.

3. This is the key fact: the area under any portion of the curve is the percentage (in decimal form)
of data values that fall between the values that begin and end that region.

4. The total area under the entire curve is 1.

Example: Getting Information from a Normal Curve

The graph below shows a normal distribution for heights of women in the United States. The
numbers on the horizontal axis are heights in inches, and some areas are labeled for reference.

EXAMPLE 1 Getting Information from a Normal Curve (2 of 4)


(a) What is the mean height?
(b) What percentage of women are between 57.4 and 59.1 inches tall?

GATHERING AND ORGANIZING DATA – Module 5


(c) If there are 31,806 women at a stadium concert, how many of them would you expect to be
between 63.7 and 66.0 inches tall?
SOLUTION
(a) The mean is the value in the very center of a normal distribution, and it corresponds to the
value that appears most often. This would be the highest point on the graph, which is labeled
63.7. So the mean height for American women is 63.7 inches.
(b) This is what the third property in the colored box is about. The diagram indicates that the area
under the normal graph between 57.4 and 59.1 is 0.034. This is the decimal form of the
percentage of data values that fall in that range. Converting 0.034 to percent form by moving the
decimal point two places right, we get 3.4%. So we’d expect that about 3.4% of women would
have heights in that range.
(c) This is the same question with a twist. It asks for the number of women, not a percentage.
But we can find that number by first finding a percentage. In this case, the area under that portion
of the graph is 0.303, so we’d expect 30.3% of women to have heights between 63.7 and 66.0
inches. In particular, we’d expect 30.3% of the women at the concert to have a height in that
range.
30.3 % of 31,806 = 0.303 × 31,806 = 9,637.218
We’re talking about human beings here, not leftover body parts, so the .218 part is irrelevant.
We’d expect about 9,637 women to be between 63.7 and 66.0 inches tall.
The Empirical Rule (1 of 2)
When data are normally distributed, approximately 68% of the values are within 1 standard
deviation of the mean, approximately 95% are within 2 standard deviations of the mean, and
approximately 99.7% are within 3 standard deviations of the mean.
The Empirical Rule (2 of 2)
The figure below illustrates the Empirical Rule with
X́ = mean and s = standard deviation.

GATHERING AND ORGANIZING DATA – Module 5


Example: Using the Empirical Rule

According to the website answerbag.com, the mean height for male humans is 5 feet 9.3 inches,
with a standard deviation of 2.8 inches. If this is accurate, out of 1,000 randomly selected men,
how many would you expect to be between 5 feet 6.5 inches and 6 feet 0.1 inch?
SOLUTION
The given range of heights corresponds to those within 1 standard deviation of the mean, so we
would expect about 68% of men to fall in that range. In this case, we expect about 680 men to be
between 5 feet 6.5 inches and 6 feet 0.1 inch.

The Standard Normal Distribution


The standard normal distribution is a normal distribution with mean 0 and standard deviation 1.
The values under the curve shown indicate the proportion of area in each section.

GATHERING AND ORGANIZING DATA – Module 5


z Score
For a data value from a sample with mean X́ and standard deviation s, the z score is
X  -  X́
z = 
s
Verbally, to find a z score, just subtract the mean, then divide the result by the standard
deviation.

A couple of notes:
A data point is greater than the mean if z > 0 and less than the mean if z < 0. z scores are
typically rounded to two decimal places.

Example: Computing a z score

Based on the information in the previousExample , find the z score for a man who is 6 feet 4
inches tall and describe what it tells us.
SOLUTION
Use the formula for z scores with mean 5 feet 9.3 inches and standard deviation 2.8 inches. Note
that we converted the heights to inches to make it easier to subtract.
76 in - 69.3 in
z =  ≈ 2.40
2.8 in
This means that 6′4″ is 2.4 standard deviations above the mean.

GATHERING AND ORGANIZING DATA – Module 5


Example: Using z Scores to Compare Standardized Test Scores

As you probably know, there are two main companies that offer standardized college entrance
exams, ACT and SAT. Since each has a completely different scoring scale, it’s really difficult to
compare the scores of students that took different exams. One year the ACT had a mean score of
21.2 and a standard deviation of 5.1. That same year, the SAT had a mean score of 1498 and a
standard deviation of 347. Suppose that a scholarship committee is considering two students, one
who scored 26 on the ACT and another who scored 1800 on the SAT. Both are pretty good
scores, but which one is better?
SOLUTION
This is not an easy question because the two tests have completely different scales. But because
we know the mean and standard deviation for each, we can compute z scores. This will allow us
to decide which student did better relative to the others who took the same test.
26  -  21.2
26 ACT: z =  ≈ 0.94
5.1
1800  -  1498
1800 SAT: z =  ≈ 0.87
347
The student with 26 on the ACT did better. He or she is 0.94 standard deviations above the
mean, while the student who scored 1800 on the SAT is 0.87 standard deviations above the
mean.
This is not an easy question because the two tests have completely different scales. But because
we know the mean and standard deviation for each, we can compute z scores. This will allow us
to decide which student did better relative to the others who took the same test.
26  -  21.2
26 ACT: z =  ≈ 0.94
5.1
1800  -  1498
1800 SAT: z =  ≈ 0.87
347
The student with 26 on the ACT did better. He or she is 0.94 standard deviations above the
mean, while the student who scored 1800 on the SAT is 0.87 standard deviations above the
mean.

z Score and Area

GATHERING AND ORGANIZING DATA – Module 5


The value of z scores is that they will allow us to find areas under a normal curve using only
areas under a standard normal curve, which can be read from a table, like this one.
Area under a Normal Distribution curve
between z = 0 and a positive z value

z A z A z A z A

0.00 0.000 0.20 0.079 0.40 0.155 0.60 0.226

0.05 0.020 0.25 0.099 0.45 0.174 0.65 0.242

0.10 0.040 0.30 0.118 0.50 0.192 0.70 0.258

0.15 0.060 0.35 0.137 0.55 0.209 0.75 0.273

An area table with more values can be found in Table 11-3 and Appendix A of the text.
Two Important Facts about the Standard Normal Curve
1. The area under any normal curve is divided into two equal halves at the mean. Each of
the halves has area 0.500.
2. The area between z = 0 and a positive z score is the same as the area between z = 0 and
the negative of that z score.

Example: Finding the Area between Two z scores

Find the area under the standard normal distribution


(a) Between z = 1.55 and z = 2.25.
(b) Between z = −0.60 and z = −1.35.
(c) Between z = 1.50 and z = −1.75.
SOLUTION
(a) Draw the picture, label z scores, and shade the requested area.

GATHERING AND ORGANIZING DATA – Module 5


Using Table 11-3, the area between z = 0 and z = 2.25 is 0.488, and the area between z =
0 and z = 1.55 is 0.439.

The area we are looking for is the larger area minus the smaller:
0.488 - 0.439 = 0.049
The area between z = 1.55 and z = 2.25 is 0.049.
The area we are looking for is the larger area minus the smaller:
0.488 - 0.439 = 0.049
The area between z = 1.55 and z = 2.25 is 0.049.

(b) Draw the picture, label z scores, and shade the requested area.
All of the z scores in the table are positive, but the areas for negative z scores are exactly
the same as for the corresponding positive z scores.
So we can get the area between z = 0 and z = −1.35 by looking up z = 1.35 in Table 11-3, which
gives us 0.412. We can then get the area for z = −0.60 by looking up z = 0.60 in the table: the
result is 0.226.
Again, the area we’re looking for is the larger area minus the smaller:
0.412 - 0.226 = 0.186
The area between z = -0.60 and z = -1.35 is 0.186.

GATHERING AND ORGANIZING DATA – Module 5


(c) Draw the picture, label the z scores, and shade the requested area.
In this case, the values we get from Table 11-3 give us the areas between z = 0 and the z
scores of ─1.75 and 1.50.

The entire shaded area is the sum of these two areas, so we’ll add the two areas we get from the
table.
The area corresponding to z = −1.75 is 0.460, and the area corresponding to z = 1.50 is 0.433.
0.460 + 0.433 = 0.893.
The area between z = 1.50 and z = −1.75 is 0.893.

Example: Finding the Area to the Right of a z Score

Find the area under the standard normal distribution


(a) To the right of z = 1.70.
(b) To the right of z = – 0.95.
SOLUTION

GATHERING AND ORGANIZING DATA – Module 5


(a) Draw the picture, label the z score, and shade the area.
The area of the entire portion to the right of z = 0 is 0.500.
According to Table 11-3, the area of the portion between z = 0 and z = 1.70 is 0.455.

So the shaded portion is the difference between 0.500 and 0.455:


0.500 - 0.455 = 0.045
The area to the right of z = 1.70 is 0.045.
(b) Draw the picture, label the z score, and shade the area.
The area between z = 0 and z = −0.95, we find to be 0.329 using Table 11-3.

This time, the area is the sum of 0.500 (the entire right half of the distribution) and the area from
the table:
0.500 + 0.329 = 0.829
The area to the right of z = -0.95 is 0.829.

Example: Finding the Area to the Left of a z Score

GATHERING AND ORGANIZING DATA – Module 5


Find the area under the standard normal distribution
(a) To the left of z = −2.20.
(b) To the left of z = 1.95.
SOLUTION
(a) Draw the picture, label the z score, and shade the area. The shaded area is the entire left half
(0.500) minus the area between z = 0 and z = -2.2, which is 0.486 according to Table 11-3.
0.500 - 0.486 = 0.014
The area to the left of z = -2.2 is 0.014.

(b) Draw the picture, label the z score, and shade the area. Table 11-3 gives us an area of 0.474
between z = 0 and z = 1.95. Adding to the area of the left half (0.500), we get
0.500 + 0.474 = 0.974
The area to the leftof z = 1.95 is 0.974.

GATHERING AND ORGANIZING DATA – Module 5


VI. APPLICATIONS OF THE NORMAL DISTRIBUTION
The standard normal distribution is a normal distribution with a mean of 0 and a standard
deviation of 1.
__
x x
z 
s
3 basic types of problems
• To the left of any z value.
• To the right of any z value.
• Between any two z values.
(Look up both z values and subtract the corresponding areas)
Note: Table E gives the area under the normal distribution curve to the left of any z value given
in two decimal places.

Example: Solving a Problem Using the Normal Distribution

GATHERING AND ORGANIZING DATA – Module 5


If the weights of Oreos in a package are normally distributed with mean 518 grams and standard
deviation 4 grams, find the percentage of packages that will weigh less than 510 grams.
Step 1 Draw the figure and represent the area, as shown in the figure on the next slide.
Step 2 Find the z score for data value 510.
value - mean 510 - 518
z= = = -2
standard deviation 4
This tells us that 510 grams is 2 standard deviations below the mean.
Step 3 Find the shaded area using Table 11-3. The shaded area is the area of the left half (0.5)
minus the area between z = 0 and z = -2. The area between z = 0 and z = -2 is 0.477 (using z =
+2 from the table).
Area = 0.5 - 0.477 = 0.023

Step 4 Interpret the area. In this case, an area of 0.023 tells us that only 2.3% of packages will
have weights less than 510 grams.
Probability and Area under a Normal Distribution
The area under a normal distribution between two data values is the probability that a randomly
selected data value is between those two values.

Example: Using Area under a Normal Distribution to Find Probabilities

Based on data in an EPA study released in 2016, the average American generates 1,606 pounds
of garbage per year. Let’s estimate that the number of pounds generated per person is
approximately normally distributed with standard deviation 200 pounds. Find the probability that
a randomly selected person generates
(a) Between 1,250 and 2,050 pounds of garbage per year.

GATHERING AND ORGANIZING DATA – Module 5


(b) More than 2,050 pounds of garbage per year.
SOLUTION
(a) Step 1 Draw the figure and represent the area. See the figure on the next slide.
Step 2 Find the z scores for the two given weights.
1,250 - 1,606
1,250 pounds : z = = -1.78
200
2,050 - 1,606
2,050 pounds : z = = 2.22
200
Step 3 Find the shaded area. Using Table 11-3, the area between z = 0 and z = 2.22 is about
0.487, and the area between z = 0 and z = −1.78 is about 0.462. So the shaded area is 0.487 + 
0.462 = 0.949.

Step 4 Interpret the area. This tells us that if a person is selected at random, the probability that
he or she generates between 1,250 and 2,050 pounds of garbage per year is 0.949.
(b) Most of the work was already done in part a. The desired area is the area between z = 0 and z
= 2.22 subtracted from the area of the entire right half (0.5).
0.5 - 0.487 = 0.013
The probability that a randomly selected person generates more than 2,050 pounds of garbage
per year is 0.013.

Example: Finding Number in a Sample and Percentile Rank

The American Automobile Association reports that the average time it takes to respond to an
emergency call is 25 minutes. If the response time is approximately normally distributed and the
standard deviation is 4.5 minutes:
(a) If 80 calls are randomly selected, approximately how many will have response times less than
17 minutes?

GATHERING AND ORGANIZING DATA – Module 5


(b) In what percentile is a response time of 30 minutes?
SOLUTION
(a) Step 1 Draw a figure and represent the area as shown in the figure on the next slide.
Step 2 Find the z value for 17.
value - mean 17 - 25
z= = ≈ − 1.78
standard deviation 4.5
Step 3 Find the shaded area. Using Table 11-3, we find that the area between z = 0 and z = −1.78
is approximately 0.462. The shaded area is the difference of the entire left half (0.5) and 0.462.
Area = 0.5 − 0.462  = 0.038

Step 4 Interpret the area. An area of 0.038 tells us that 3.8% of calls will have a response time of
less than 17 minutes. Multiply the percentage by the number of calls in our sample:
3.8% of 80 calls = (0.038) (80) = 3.04
About 3 calls in 80 will have response time less than 17 minutes.

(b) Step 1 Draw a figure and represent the area as shown in the figure on the next slide.
Remember that to find the percentile, we are interested in the percentage of times that are less
than 30 minutes.
Step 2 Find the z score for a response time of 30 minutes.
value - mean 30  -  25
z =  = ≈ 1.11
standard deviation 4.5
Step 3 Find the shaded area. The area to the left of z = 1.11 is the area of the left half (0.5) plus
the area between z = 0 and z = 1.11 (about 0.366 according to Table 11-3).
Area = 0.5 + 0.366 = 0.866

GATHERING AND ORGANIZING DATA – Module 5


Step 4 Interpret the area. An area of 0.866 means that 86.6% of calls will get a response time of
less than 30 minutes, so the time of 30 minutes is in the 87th percentile.

Example: Identifying Normally Distributed Data

A random sample of 25 entrées served by a college cafeteria is tested to find the number of
calories per serving, with the data in the list below.
845 460 620 752 683
1,088 785 575 580 755
720 650 512 945 672
526 1,050 725 822 740
773 812 880 911 910
(a) Draw a histogram for the data using seven classes. Do the data appear to be normally
distributed?
(b) Find the mean and standard deviation for the data.
(c) What percentage of randomly selected entrées from this cafeteria has more than 700 calories?

SOLUTION
(a) The lowest number of calories is 460 and the highest is 1,088, a range of about 700
calories. A good choice of classes here is simply using the number of hundreds: 400–499,
500–599, and so on.

GATHERING AND ORGANIZING DATA – Module 5


(b)

The histogram doesn’t look exactly like a normal curve, but it’s close enough that it would be
reasonable to guess that the data are approximately normally distributed.
(b) You can find the mean and standard deviation by hand, but with 25 data values, using
technology seems like a good idea. We used the AVERAGE and STDEV commands in Excel,
but you could also use a graphing calculator. The results:
X́ = 751.6
s = 160.7
(c) Since we’re already using technology, let’s also use it to find an area under a normal curve
with mean 751.6 and standard deviation 160.7. We want the area to the right of 700. For the
calculator, we press 2nd VARS 2 to get the normalcdf command, then key in
700,1000000,751.6,160.7). For a spreadsheet, we use the command =1−NORM.DIST
(700,751.6,160.7).
In each case, we see that the area is about 0.626. That tells us that there’s a 62.6% chance
that an entrée chosen at random has more than 700 calories.

GATHERING AND ORGANIZING DATA – Module 5


VII. Correlation and Regression Analysis
Correlation
In statistics, correlation is a term that describes an attempt to determine if a relationship
exists between two sets of data. In order to decide if a relationship exists between two data sets,
we have to first gather data in such a way that values from each set can be paired together.

We can then plot a graph designating one set of data as the x variable or independent
variable and the other as the y variable or dependent variable.

Scatter Plot

A scatter plot is a graph of the ordered pairs (x, y) consisting of data from two data sets.

After the scatter plot is drawn, we can analyze the graph to see if there is a pattern. If there is a
noticeable pattern, such as the points falling in an approximately straight line, then a possible
relationship between the two variables may exist.

Example: Drawing a Scatter Plot

A medical researcher selects a sample of small hospitals in his state and hopes to discover if
there is a relationship between the number of beds and the number of personnel employed by the
hospital.

(a) Do you think there should be a relationship between these two data sets? Describe any
relationship you’d expect.

(b) Draw a scatter plot for the data shown. Does it look like there’s a relationship between the
data sets?

SOLUTION

GATHERING AND ORGANIZING DATA – Module 5


(a) As a general rule, you would think that larger hospitals would have more beds in them, and
you’d also think that larger hospitals would employ more personnel. So it seems reasonable to
predict that less beds would go along with less personnel, and more beds would go along with
more personnel.
(b) First we need to draw and label the axes. The information in the table indicates that the
number of beds should go on the x axis, and the largest number of beds in the table is 95. So
we’ll label that axis to go from 0 up to 100. On the y axis, we’ll go from 0 to 400 or so because
the largest number of personnel is 366. Then we plot each point from the table, as shown in the
figure on the next slide.

We can see from the graph that there appears to be a connection. In general, the points progress
from lower left to upper right, which confirms our prediction: lower number of beds tend to go
with lower numbers of personnel, and higher numbers of beds tend to go with higher numbers of
personnel.
Analyzing the Scatter Plot
The types of patterns and corresponding relationships are:
1. A positive linear relationship exists when the points fall approximately in an ascending
straight line from left to right, and both the x and y values increase at the same time.

GATHERING AND ORGANIZING DATA – Module 5


2. A negative linear relationship exists when the points fall approximately in a descending
straight line from left to right. The relationship then is as the x values are increasing, the y values
are decreasing.

3. A nonlinear relationship exists when the points fall in a curved line. The relationship is
described by the nature of the curve.

4. No relationship exists when there is no discernible pattern to the points.

GATHERING AND ORGANIZING DATA – Module 5


THE CORRELATION COEFFICIENT
The correlation coefficient is a number that describes how close to a linear relationship
there is between two data sets, represented by the letter r.

Correlation coefficients range from -1 (perfect negative linear relationship) to +1 (perfect


positive linear relationship). The closer this number is to one in absolute value, the more likely it
is that the data sets are related. A correlation coefficient close to zero indicates that the data are
most likely not related at all.

Example: Estimating Correlation Coefficients

Based on our description of the correlation coefficient for two data sets, make an estimate of the
correlation coefficient for each scatter plot below.

GATHERING AND ORGANIZING DATA – Module 5


GATHERING AND ORGANIZING DATA – Module 5
SOLUTION
The first plot shows a positive linear relationship, so r should be close to 1.
The second plot shows a negative linear relationship, so r should be close to -1.
The third and fourth plots don’t show linear relationships. One has a clear pattern and one
doesn’t, but the correlation coefficient only describes whether there’s a linear relationship.
Neither of those plots has one, so both should have r close to zero.

Formula for Finding the Value of r:

n ( xy ) - ( x )( y )
r = 
√ [ n ( x 2 ) - ( x ) 2 ] [ n ( y 2) - ( y ) 2 ]
where
n = the number of data pairs
x = the sum of the x values
y = the sum of the y values
xy = the sum of the products of the x and y values for each pair
x2 = the sum of the squares of the x values
y2 = the sum of the squares of the y values

Example: Calculating a Correlation Coefficient

Find the correlation coefficient for the data in Example 1, and discuss what you think it indicates.

No. of beds (x) 28 56 34 42 45 78 84 36 74 95

Personnel (y) 72 195 74 211 145 139 184 131 233 366

SOLUTION
Step 1 Make a table as shown on the next slide with the following headings: x, y, xy, x2, and y2.
Step 2 Find the values for the product xy, the values for x2, the values for y2, and place them in
the columns as shown. Then find the sums of the columns.

GATHERING AND ORGANIZING DATA – Module 5


Step 3 Substitute into the formula and evaluate (note that there are 10 data pairs):
n ( xy ) - ( x )( y )
r = 
√ [ n ( x 2 ) - ( x ) 2 ] [ n ( y 2) - ( y ) 2 ]
10(113,865) - (572)(1,750)
=
√ [10(37,802) - (572) 2 ][10(372,814) - (1,750)2 ]
137,650
= ≈ 0.748
√ (50,836)(665,640)

GATHERING AND ORGANIZING DATA – Module 5


x y xy x2 y2

28 72 2,016 784 5,184

56 195 10,920 3,136 38,025

34 74 2,516 1,156 5,476

42 211 8,862 1,764 44,521

45 145 6,525 2,025 21,025

78 139 10,842 6,084 19,321

84 184 15,456 7,056 33,856

36 131 4,716 1,296 17,161

74 233 17,242 5,476 54,289

95 366 34,770 9,025 133,956

Σx = Σy = Σxy = Σx2 = Σy2 =


572 1,750 113,865 37,802 372,814

x y xy x2 y2

28 72 2,016 784 5,184

56 195 10,920 3,136 38,025

34 74 2,516 1,156 5,476

42 211 8,862 1,764 44,521

45 145 6,525 2,025 21,025

78 139 10,842 6,084 19,321

84 184 15,456 7,056 33,856

36 131 4,716 1,296 17,161

74 233 17,242 5,476 54,289

95 366 34,770 9,025 133,956


GATHERING AND ORGANIZING DATA – Module 5
Σx = Σy = Σxy = Σx2 = Σy2 =
572 1,750 113,865 37,802 372,814
Significance Levels
Statisticians have traditionally agreed that when we conclude that two data sets have a
relationship, we can be satisfied with that conclusion if there is either a 95% or 99% chance that
we’re correct. This corresponds to a 5% or 1% chance of being wrong. These percentages are
called significance levels.

GATHERING AND ORGANIZING DATA – Module 5


Sample Sample
5% 1% 5% 1%
Size Size
4 .950 .990 17 .482 .606
5 .878 .959 18 .468 .590
6 .811 .917 19 .456 .575
7 .754 .875 20 .444 .561
8 .707 .834 21 .433 .549
9 .666 .798 22 .423 .537
10 .632 .765 23 .412 .526
11 .602 .735 24 .403 .515
12 .576 .708 25 .396 .505
13 .553 .684 30 .361 .463
14 .532 .661 40 .312 .402
15 .514 .641 60 .254 .330
16 .497 .623 120 .179 .234

Using Significant Values for the Correlation Coefficient


If | r | is greater than or equal to the value given in Table 11-4 for either the 5% or 1%
significance level, then we can reasonably conclude that the two data sets are linearly related.

Example: Deciding if a Correlation Coefficient is Significant

Determine if the correlation coefficient r = 0.748 is significant at the 5% level.


SOLUTION
Since the sample size is n = 10 and the 5% significance level is used, the value of | r | has to be
greater than or equal to the number found in Table 11-4 to be significant. In this case, | r | =
0.748; this is greater than 0.632, which we found in the table. Conclusion: there’s at least a 95%
chance that the data sets are significantly related.

GATHERING AND ORGANIZING DATA – Module 5


However, we can’t conclude that there is a relationship between the data sets at the 1%
significance level since r would need to be greater than or equal to 0.765. In this example, r =
0.748, which is less than 0.765.

Regression
Once we have concluded that there is a significant relationship between the two variables,
the next step is to find the equation of the regression line through the data points.

We say it is the line that passes through the points in such a way that the overall distance
each point is from the line is at a minimum. The regression line is also called the line of best fit.

Recall from algebra that the equation of a line in the slope-intercept form is

y = mx + b,

where m is the slope and b is the y intercept.

In statistics, the equation of the regression line is written as

y = a + bx,

where a is the y intercept and b is the slope.

Formulas for Finding the Values of a and b for the Equation of the Regression Line

Slope:

n ( xy ) - ( x )( y )
b=
n ( x 2 ) - ( x ) 2

y intercept:

y  -  b ( x )
a=
n

GATHERING AND ORGANIZING DATA – Module 5


Example: Finding a Regression Line

Find the equation of the regression line for the data in the previous Example.
SOLUTION
We already calculated the values needed for each formula when we found the correlation
coefficient in Example 3.
Substitute into the first formula to find the value for the slope, b.
n ( xy ) - ( x )( y )
b=
n ( x 2 ) - ( x ) 2
10(113,865) - (572)(1,750) 137,650
= = ≈ 2.708
10(37,802) - (572) 2 50,836
We already calculated the values needed for each formula when we found the correlation
coefficient in Example 3.
Substitute into the first formula to find the value for the slope, b.
n ( xy ) - ( x )( y )
b=
n ( x 2 ) - ( x ) 2
10(113,865) - (572)(1,750) 137,650
= = ≈ 2.708
10(37,802) - (572) 2 50,836
Substitute into the second formula to find the value for the y intercept (a) when b = 2.708.
y  -  b ( x )
a=
n
(1,750) - 2.708(572) 201.02
= = = 20.102
10 10
The equation of the regression line is y = 20.102 + 2.708x.
This image shows the regression line y = 20.102 + 2.708x in comparison to the original scatter
plot.

GATHERING AND ORGANIZING DATA – Module 5


The Relationship between r and the Regression Line
Two things should be noted about the relationship between the value of r and the
regression line.

First, the value of r and the value of the slope, b, always have the same sign. Second, the
closer the value of r is to +1 or -1, the better the points will fit the line. In other words, the
stronger the relationship, the better the fit.

GATHERING AND ORGANIZING DATA – Module 5


These graphs show the relationship between the correlation coefficient and the regression line.

GATHERING AND ORGANIZING DATA – Module 5


These graphs show the relationship between the correlation coefficient and the regression line.

Use the equation of the


Example: Using a Regression Line to Make a Prediction

regression line found in Example 5 to predict the approximate number of personnel for a hospital
with 65 beds.

SOLUTION
Substitute 65 for x into the equation y = 20.120 + 2.7077x and find the value for y.
y = 20.120 + 2.7077(65) = 196.12 (which can be rounded to 196)
We predict that a hospital with 65 beds will have about 196 personnel.

CORRELATION AND CAUSATION


Possible Relationships between Data Sets:
1. There is a direct cause-and-effect relationship between two variables. That is, x causes y.
2. There is a reverse cause-and-effect relationship between the variables. That is, y causes x.
3. The relationship between the variables may be caused by a third variable.
4. There may be a complexity of interrelationships among many variables.
5. The relationship might simply be coincidental.

Exercises:
GATHERING
1. Find AND ORGANIZING
the area under DATAthat
the normal curve – Module 5
lies between the given values of Z.
a. z = 0 and z = 2.37
b. z = 0 and z = -1.94
ASSIGNMENT
Answer numbers 1 - 5 on page 651 (Sobecki, D. (2019). Math in Our World. New York. NY:
McGraw-Hill Education.)

References

Nocon, R. (2018). Essential Mathematics for the Modern World. Quezon City: C & E
Publishing, Inc.

Sirug, W. (2014). Business Mathematics, rev. ed. Manila: Mindshapers Co.

Sobecki, D. (2019). Math in Our World. New York. NY: McGraw-Hill Education.

GATHERING AND ORGANIZING DATA – Module 5


End of Module 5

GATHERING AND ORGANIZING DATA – Module 5

You might also like