Professional Documents
Culture Documents
DATA MANAGEMENT
Introduction
Data management refers to a wide range of activities researchers perform as they collect,
arrange, describe, store, preserve, and often share materials generated by their research.
Many funding agencies and institutions require researchers to develop data management plans,
or DMPs, that outline their data management activities, and to disseminate their data through
repositories and archives.
CHECK-UP TEST
I. Statistics
Data are measurements or observations that are gathered for an event under study.
Statistics is the branch of mathematics that involves collecting, organizing, summarizing, and
presenting data and drawing general conclusions from the data.
When statistical studies are performed, we usually begin by identifying the population for
the study. A population consists of all subjects under study. (i.e. all colleges in NDMU). More
often than not, it’s not realistic to gather data from every member of a population.
Sampling Methods
We will study four basic sampling methods:
1. In order to obtain a random sample, each subject of the population must have an equal
chance of being selected.
2. A systematic sample is taken by numbering each member of the population and then
selecting every kth member, where k is a natural number. When using systematic sampling, it’s
important that the starting number is selected at random. For example,
Therefore, the samples from every 5th from left to right are 13, 23, 26, 34, 23, and 12.
When an existing group of subjects that represent the population is used for a sample, it is
called a cluster sample.
3. Stratified sampling – process of subdividing the population into subgroups or strata and
drawing members at random from each subgroup or stratum
Given the population of a certain university and a target sample population of 5,455, determine
the sample size of each subgroup or courses.
Nursing 6,000
Accountancy 500
Management 2,000
Marketing 1,000
Education 2,500
Total 12,000
Example:A researcher wishes to survey apartment dwellers in a large city. If there are 10
apartment buildings in the city, the researcher can select at random 2 buildings from the 10 and
interview all the residents of these buildings.
For example, a researcher might be interested in the average age of the full-time students on your
campus and how many credit hours they’re scheduled for this term.
Inferential Statistics
Statistical techniques used to make inferences are called inferential statistics. This is based on
studying characteristics of a sample within a larger population and using them to draw
conclusions about the entire population.
For example, the Bureau of Labor and Statistics estimates the number of people in the United
States that are unemployed every month. Since it would be impossible to survey everyone, the
bureau picks a sample of adults to see what percentage are unemployed. Then they use that
information to estimate the unemployment rate for the entire population.
Frequency Distributions
The data collected for a statistical study are called raw data. In order to describe situations and
draw conclusions, we need to organize the data in a meaningful way.
Two methods that we will use are frequency distributions and stem and leaf plots.
The first type of frequency distribution we will investigate is the categorical frequency
distribution. This is used when the data are categorical rather than numerical.
AB, A, O, B, A,
A, O, O, O, AB
Solution:
Step 1 Make a table with all categories represented.
Another type of frequency distribution that can be constructed uses numerical data and is
called a grouped frequency distribution. In a grouped frequency distribution, the numerical
data are divided into classes.
For example, if you gathered data on the weights of people in your class, there’s a decent chance
that no two people have the exact same weight. So it would be reasonable to group people into
weight ranges, like 100–119 pounds, 120–139 pounds, and so forth.
When deciding on classes for a grouped frequency distribution, here are some guidelines:
3. Don’t leave out any numbers between the lowest and highest, even if nothing falls into a
particular class.
4. Make sure the range of numbers included in a class is the same for each one.
5. The beginning and ending values have to be chosen based on how the data values are
rounded.
1 1 1 1 1 1 1 1 1 1
1 0 2 2 3 1 0 1 0 1
2 0 8 0 4 8 6 0 9 2
1 1 1 1 1 1 1 1 1 1
0 1 1 1 1 2 1 1 0 0
0 8 7 6 8 1 4 4 5 9
1 1 1 1 1 1 1 1 1 1
0 1 1 1 1 1 1 2 0 1
7 2 4 5 8 7 8 5 6 0
1 1 1 1 1 1 1 1 1 1
2 0 1 2 1 2 1 1 0 1
2 8 0 1 3 0 9 1 4 3
1 1 1 1 1 1 1 1 1 1
2 1 2 1 0 1 1 1 1 1
0 3 0 7 5 0 8 2 4 5
SOLUTION
Step 1 Subtract the lowest value from the highest value: 134 − 100 = 34.
Step 2 If we use a range of 5 degrees, that will give us seven classes, since the entire range (34
degrees) divided by 5 is 6.8. This is certainly not the only choice for the number of classes: if we
choose a range of 4 degrees, there will be nine classes.
Step 3 Start with the lowest value and add 5 to get the lower class limits: 100, 105, 110, 115, 120, 125,
130. Notice that all of the data are rounded to the nearest whole number, so that’s reflected in our
choices of class limits.
Step 4 Set up the classes. To find the upper limit for each, subtract one from the next upper limit.
When data are representative of certain categories, rather than numerical, we often use bar
graphs or circle graphs (commonly known as pie charts) to illustrate the data.
Pie Chart
A pie chart, also called a circle graph, is a circle that is divided into sections in proportion to
the frequencies corresponding to the categories.
The purpose of a pie chart is to show the relationship of the parts to the whole by visually
comparing the size of the sections.
The marketing firm Deloitte Retail conducted a survey of grocery shoppers. The frequency
distribution below represents the responses to the survey question “How often do you bring your
own bags when grocery shopping?” Draw a pie chart to represent the data.
Response Frequency
Always 10
Never 39
Frequently 19
Occasionally 32
SOLUTION
f
Step 1 Find the number of degrees corresponding to each slice using the formula Degrees = ⋅
n
360° where f is the frequency for each class and n is the sum of the frequencies. Recall that there
are 360° in a circle, so what we're really doing here is finding the percentage of the circle that's
covered by each portion. In this case, n = 100, so the degree measures are:
Step 2 Using a protractor, graph each section on the circle using the calculated angles. The pie
chart is shown on the next slide. Notice the labeling that makes it clear what each slice
represents. If you don’t have a protractor, you can find graph paper marked off in degrees by
doing an Internet search for “pie chart graph paper.”
While a pie chart is used to compare parts to a whole, a bar graph is used for comparing parts to
other parts.
Let’s revisit the marketing firm’s survey of 100 grocery shoppers. Suppose the frequency
distribution below represents the responses to the survey question, “Which grocery stores have
you shopped at in the last month?”
Store Frequency
Publix 50
Trader Joe’s 40
Fareway 28
Aldi 33
Other 65
(a) Draw a vertical bar graph for the results, shown in the frequency distribution.
(b) Would a pie chart make sense for these data? Why or why not?
SOLUTION
(a) Draw and label the axes. We were asked for a vertical bar graph, so the responses go on
the horizontal axis, and the frequencies on the vertical. Make sure that your labeling on
the vertical axis starts at zero.
(b) There are two good ways to decide that a pie chart would be a bad choice. One is numerical:
we were told that 80 shoppers were surveyed, but the numbers for the responses add up to more
than 80. So this isn’t comparing distinct parts to a whole: the categories overlap. Which brings us
to the other way you can tell that a pie chart is a bad choice. The nature of the survey (“Which
stores have you shopped at?”) pretty much guarantees that some people will choose more than
one of the responses. This is a dead giveaway that the categories overlap, so a pie chart would be
a bad choice.
A histogram is similar to a vertical bar graph in that the heights of the bars correspond to
frequencies. The difference is that class limits are placed on the horizontal axis, rather than
categories.
In Section 11-1, we analyzed and organized data representing the record high temperature in
every state. It would therefore seem reasonable, if not expected, for us to illustrate that data
graphically. So that’s just what we’ll do. Below is a grouped frequency distribution for the data.
Use it to draw a histogram.
Class Frequency
100-104 3
105-109 8
110-114 16
115-119 13
120-124 7
125-129 2
130-134 1
SOLUTION
Step 1 Write the scale for the frequencies on the vertical axis and the class limits on the
horizontal axis. Make sure that your labeling on the vertical axis starts at zero.
Step 2 Draw vertical bars with heights that correspond to the frequencies for each class.
Class Frequency
100-104 3
105-109 8
110-114 16
115-119 13
120-124 7
125-129 2
130-134 1
SOLUTION
Step 1 Find the midpoints for each class. This is accomplished by adding the upper and lower
limits and dividing by 2. For the first two classes, we get:
100 + 104 105 + 109
= 102 = 107
2 2
The remaining midpoints are 112, 117, 122, 127, and 132.
These two grouped frequency distributions come from data gathered in two calculus classes in
Spring of 2015. One is for percentage on the final
Overall Percentage
exam; the other is for overall percentage in the
course.
Percentage Frequency
Final Exam Percentage
40-49.9 0
Percentage Frequency
50-59.9 1
40-49.9 3
60-69.9 7
50-59.9 2
70-79.9 20
80-89.9
GATHERING AND ORGANIZING DATA – Module 5 19
90-99.9 11
60-69.9 13
70-79.9 14
80-89.9 14
90-99.9 12
(a) Draw a frequency polygon for each distribution. Put them on the same axes.
(b) What can you learn about final exam percentage vs. overall percentage from the polygons?
(c) What would happen if you tried to draw a histogram for each on the same axes?
SOLUTION
(a) The midpoints for the classes are 44.95, 54.95, 64.95, 74.95, 84.95, and 94.95. The two
frequency polygons are shown on the next slide.
Measures of Average
In casual terms, average means the most typical case, or the center of the distribution. Measures
of average are also called measures of central tendency, and include the mean, median, mode,
and midrange.
Mean
The Greek letter (sigma) is used to represent the sum of a list of numbers. If we use the letter X
to represent data values, then X means to find the sum of all values in a data set.
The mean is the sum of the values in a data set divided by the number of values. If X1, X2, X3, …,
Xn are the data values, we use X́ to stand for the mean, and
The company advertises that its average employee makes almost $150,000 per year. Is the
company’s claim technically truthful? Do you think it’s deceiving? Explain.
SOLUTION
The company has nine employees, so we need to add all the salaries and then divide the sum by
9.
58 + 65 + 944 + 20 + 52 + 51 + 53 + 55 + 50
X́= = 149.8
9
All of the salaries were whole numbers of thousands, so it was easier to just add the number of
thousands and divide by 9.
The result of 149.8 tells us that the mean salary is $149.8 thousand dollars, or $149,800.
The claim is, in fact, truthful—provided that by “average” you mean “mean.” But is it deceiving?
You bet it is! There’s only one person in the company that makes more than $65,000 per year—
the owner (Newman) who pays himself a handsome salary of $944,000. Given that we want
measures of average to describe a most typical case, $149,800 certainly doesn’t fit that bill.
Median
In short, the median of a data set is the value in the middle if all values are arranged in order. The
median will either be a specific data value in the set, or will fall in between two values.
Step 1 Arrange the data in order, from smallest to largest. Actually, largest to smallest will work,
too. Whatever makes you happy.
Step 2 If the number of data values is odd, the median is the value in the exact middle of the list.
If the number of data values is even, the median is the mean of the two middle data values.
(a) Find the median salary for Vandelay Industries. How does it compare to the mean?
(b) Find the mean and median if Newman’s salary is left out. What can you conclude?
SOLUTION
Find the midrange of all salaries at Vandelay Industries. Is it meaningful in this case?
SOLUTION
It’s not necessary to put a data set in order to find the midrange, but it sure doesn’t hurt. All we
need to know is the lowest and highest salaries, and since we already ordered the list in Example
2, it’s easy to see that those are $20,000 and $944,000. So the midrange is
$20,000 + $944,000
= $482,000
2
Wow. The midrange is a whopping $482,000, which is meaningful in that it emphasizes how big
Newman’s salary is, but as a measure of average it’s not good for much.
Mode
The mode is sometimes said to be the most typical case. The value that occurs most often in a
data set is called the mode. A data set can have more than one mode or no mode at all.
Days Frequency
10 1
11 2
12 7
13 4
15 6
The number of Atlantic hurricanes for each of the years from 1997–2016 is shown in the list.
Find the mode, and describe what that tells you.
3, 10, 8, 8, 9, 4, 7, 9, 15, 5,
6, 8, 3, 12, 7, 10, 2, 6, 4, 7
SOLUTION
This time, we’ll find the mode without making a frequency distribution. Instead, we can just
work down the list, counting the number of occurrences for each number of hurricanes. It turns
out that there are two numbers that appear three times, while no others appear more than twice.
A survey of the junior class at Fiesta State University shows the following number of students
majoring in each field. Find the mode.
Business 1,425
Liberal arts 878
Computer science 632
Education 471
General studies 95
You have to be a little careful here. If you focus on the numbers, you might conclude that there’s
no mode, since they’re all different. But that would be missing the point. The mode is supposed
to be the most typical case. Here, the most typical major is the one with the most students: that’s
business, so that’s the mode.
Step 2: Multiply the frequency for each class by the midpoint of that class.
Step 4: Divide by the sum of all frequencies (which is the total number of data values).
X́=
∑ ( f · X m )
n
where f is the frequency for each class, Xm is the midpoint of each class, and n is the sum of all
frequencies.
Class Frequency
100-104 3
105-109 8
110-114 16
115-119 13
120-124 7
125-129 2
130-134 1
SOLUTION
First, we’ll need the midpoint for each class.
Since we’ll need to multiply by the frequencies, it’s convenient to make a new table with the
midpoints and frequencies, then multiply them.
Sums 50 5,715
To get our mean, we divide the sum of the products by the sum of the frequencies:
5,715
X = = 114.3
50
The mean state record high temperature is about 114.3°.
Range
The range of a data set is the difference between the highest and lowest values in the set.
Range = Highest value – lowest value
The first list below is the weights of the dogs in the first picture, and the second is the weights of
the dogs in the second picture. Find the mean, median, and range for each list, then describe any
observations you can make based on the results.
1st: 70, 73, 58, 60
2nd: 30, 85, 40, 125, 42, 75, 60, 55
SOLUTION
For this reason, we will next define variance and standard deviation, which are much more
reliable measures of variation.
Step 2 Subtract the mean from each data value in the data set.
Step 5 Divide the sum by n – 1 to get the variance, where n is the number of data values.
Step 6 Take the square root of the variance to get the standard deviation.
SOLUTION
Step 1 Find the mean weight.
We found the mean of 64 lb in Example 1.
Step 2 Subtract the mean from each data value.
30 - 64 = -34, 85 - 64 = 21, 40 - 64 = -24, 125 - 64 = 61,
42 - 64 = -22, 75 - 64 = 11, 60 - 64 = -4, 55 - 64 = -9
Step 3 Square each result.
(-34)2 = 1,156, (21)2 = 441, (-24)2 = 576, (61)2 = 3,721,
(-22)2 = 484, 112 = 121, (-4)2 = 16, (-9)2 = 81
Step 4 Find the sum of the squares.
1,156 + 441 + 576 + 3,721 + 484 + 121 + 16 + 81 = 6,596
Step 5 Divide the sum by n - 1 to get the variance, where n is the sample size. In this case, n is 8,
so n - 1 = 7.
6,596
Variance = ≈ 942.3
7
Step 6 Take the square root of the variance to get standard deviation.
Standard Deviation = √ 942.3 ≈ 30.7 lb
To organize the steps, you might find it helpful to make a table with three columns: the original
data, the difference between each data value and the mean, and their squares. Then you just add
the entries in the last column and divide by n - 1 to get the variance.
2
Data (X) X−X́ ( X−X́ )
30 -34 1,156
40 -24 576
125 61 3,721
42 -22 484
75 11 121
60 -4 16
55 -9 81
Standard Deviation
To understand the significance of standard deviation, we’ll look at the process one step at a time.
Step 1 Compute the mean. Variation is a measure of how far the data vary from the mean, so it
makes sense to begin there.
Step 2 Subtract the mean from each data value. In this step, we are literally calculating how far
away from the mean each data value is. The problem is that since some are greater than the mean
and some less, their sum will always add up to zero. (Try it!) So that doesn’t help much.
Step 3 Square the differences. This solves the problem of those differences adding to zero—
when we square them, they’re all positive.
Step 4 Add the squares. In the next two steps, we’re getting an approximate average of the
squares of the individual variations from the mean. First we add them, then…
Step 5 Divide the sum by n − 1. It seems like dividing by the number of values (n) here is a good
idea, but it turns out that when we’re using a sample from a larger population to compute mean
and variance, dividing by n − 1 makes the sample variance more likely to be a true reflection of
the population variance. In any case, at this point we have an approximate average of the squares
of the individual variations from the mean.
The sample variance for a data set is an approximate average of the square of the distance
between each value and the mean.
∑ ( X - X́ ) 2
s2 =
n - 1
The sample standard deviation (s) is the square root of the variance. It provides an approximate
average of the distances between data values and the mean.
The mean and standard deviation for heights of all adult males in the United States are 69.3
inches and 2.8 inches, respectively. The mean and standard deviation for the 2016–2017
Cleveland Cavaliers (a professional basketball team), on the other hand, were 78.9 inches and
3.75 inches. What can we conclude from comparing these statistics?
SOLUTION
There are two main things we can learn from these comparisons. First, the top scorers for a
professional basketball team tend to be a lot taller than average people. If you’ve ever watched a
basketball game, this comes as no surprise. Second, the heights are spread out a bit more than the
population in general. This could be due to the small sample size of 14 players, but it might also
be because there are some basketball players that are good players because of speed and
athleticism even though they’re not that tall, while others are successful to some extent because
of their extreme height. In order to draw more reliable conclusions, we’d probably need to look
at a much larger sample of pro basketball players.
A professor has two sections of Math 115 this semester. The 8:30 A.M. class has a mean score of
74% with a standard deviation of 3.6%. The 2 P.M. class also has a mean score of 74%, but a
A probability distribution that plots all of its values in a symmetrical fashion and most of the
results are situated around the probability’s mean is called a normal distribution. Values are
1. The value in the middle of the distribution, which appears most often in the sample, is the
mean.
2. The distribution is symmetric about the mean. This means that the graph has two halves that
are mirror images on either side of the mean value.
3. This is the key fact: the area under any portion of the curve is the percentage (in decimal form)
of data values that fall between the values that begin and end that region.
The graph below shows a normal distribution for heights of women in the United States. The
numbers on the horizontal axis are heights in inches, and some areas are labeled for reference.
According to the website answerbag.com, the mean height for male humans is 5 feet 9.3 inches,
with a standard deviation of 2.8 inches. If this is accurate, out of 1,000 randomly selected men,
how many would you expect to be between 5 feet 6.5 inches and 6 feet 0.1 inch?
SOLUTION
The given range of heights corresponds to those within 1 standard deviation of the mean, so we
would expect about 68% of men to fall in that range. In this case, we expect about 680 men to be
between 5 feet 6.5 inches and 6 feet 0.1 inch.
A couple of notes:
A data point is greater than the mean if z > 0 and less than the mean if z < 0. z scores are
typically rounded to two decimal places.
Based on the information in the previousExample , find the z score for a man who is 6 feet 4
inches tall and describe what it tells us.
SOLUTION
Use the formula for z scores with mean 5 feet 9.3 inches and standard deviation 2.8 inches. Note
that we converted the heights to inches to make it easier to subtract.
76 in - 69.3 in
z = ≈ 2.40
2.8 in
This means that 6′4″ is 2.4 standard deviations above the mean.
As you probably know, there are two main companies that offer standardized college entrance
exams, ACT and SAT. Since each has a completely different scoring scale, it’s really difficult to
compare the scores of students that took different exams. One year the ACT had a mean score of
21.2 and a standard deviation of 5.1. That same year, the SAT had a mean score of 1498 and a
standard deviation of 347. Suppose that a scholarship committee is considering two students, one
who scored 26 on the ACT and another who scored 1800 on the SAT. Both are pretty good
scores, but which one is better?
SOLUTION
This is not an easy question because the two tests have completely different scales. But because
we know the mean and standard deviation for each, we can compute z scores. This will allow us
to decide which student did better relative to the others who took the same test.
26 - 21.2
26 ACT: z = ≈ 0.94
5.1
1800 - 1498
1800 SAT: z = ≈ 0.87
347
The student with 26 on the ACT did better. He or she is 0.94 standard deviations above the
mean, while the student who scored 1800 on the SAT is 0.87 standard deviations above the
mean.
This is not an easy question because the two tests have completely different scales. But because
we know the mean and standard deviation for each, we can compute z scores. This will allow us
to decide which student did better relative to the others who took the same test.
26 - 21.2
26 ACT: z = ≈ 0.94
5.1
1800 - 1498
1800 SAT: z = ≈ 0.87
347
The student with 26 on the ACT did better. He or she is 0.94 standard deviations above the
mean, while the student who scored 1800 on the SAT is 0.87 standard deviations above the
mean.
z A z A z A z A
An area table with more values can be found in Table 11-3 and Appendix A of the text.
Two Important Facts about the Standard Normal Curve
1. The area under any normal curve is divided into two equal halves at the mean. Each of
the halves has area 0.500.
2. The area between z = 0 and a positive z score is the same as the area between z = 0 and
the negative of that z score.
The area we are looking for is the larger area minus the smaller:
0.488 - 0.439 = 0.049
The area between z = 1.55 and z = 2.25 is 0.049.
The area we are looking for is the larger area minus the smaller:
0.488 - 0.439 = 0.049
The area between z = 1.55 and z = 2.25 is 0.049.
(b) Draw the picture, label z scores, and shade the requested area.
All of the z scores in the table are positive, but the areas for negative z scores are exactly
the same as for the corresponding positive z scores.
So we can get the area between z = 0 and z = −1.35 by looking up z = 1.35 in Table 11-3, which
gives us 0.412. We can then get the area for z = −0.60 by looking up z = 0.60 in the table: the
result is 0.226.
Again, the area we’re looking for is the larger area minus the smaller:
0.412 - 0.226 = 0.186
The area between z = -0.60 and z = -1.35 is 0.186.
The entire shaded area is the sum of these two areas, so we’ll add the two areas we get from the
table.
The area corresponding to z = −1.75 is 0.460, and the area corresponding to z = 1.50 is 0.433.
0.460 + 0.433 = 0.893.
The area between z = 1.50 and z = −1.75 is 0.893.
This time, the area is the sum of 0.500 (the entire right half of the distribution) and the area from
the table:
0.500 + 0.329 = 0.829
The area to the right of z = -0.95 is 0.829.
(b) Draw the picture, label the z score, and shade the area. Table 11-3 gives us an area of 0.474
between z = 0 and z = 1.95. Adding to the area of the left half (0.500), we get
0.500 + 0.474 = 0.974
The area to the leftof z = 1.95 is 0.974.
Step 4 Interpret the area. In this case, an area of 0.023 tells us that only 2.3% of packages will
have weights less than 510 grams.
Probability and Area under a Normal Distribution
The area under a normal distribution between two data values is the probability that a randomly
selected data value is between those two values.
Based on data in an EPA study released in 2016, the average American generates 1,606 pounds
of garbage per year. Let’s estimate that the number of pounds generated per person is
approximately normally distributed with standard deviation 200 pounds. Find the probability that
a randomly selected person generates
(a) Between 1,250 and 2,050 pounds of garbage per year.
Step 4 Interpret the area. This tells us that if a person is selected at random, the probability that
he or she generates between 1,250 and 2,050 pounds of garbage per year is 0.949.
(b) Most of the work was already done in part a. The desired area is the area between z = 0 and z
= 2.22 subtracted from the area of the entire right half (0.5).
0.5 - 0.487 = 0.013
The probability that a randomly selected person generates more than 2,050 pounds of garbage
per year is 0.013.
The American Automobile Association reports that the average time it takes to respond to an
emergency call is 25 minutes. If the response time is approximately normally distributed and the
standard deviation is 4.5 minutes:
(a) If 80 calls are randomly selected, approximately how many will have response times less than
17 minutes?
Step 4 Interpret the area. An area of 0.038 tells us that 3.8% of calls will have a response time of
less than 17 minutes. Multiply the percentage by the number of calls in our sample:
3.8% of 80 calls = (0.038) (80) = 3.04
About 3 calls in 80 will have response time less than 17 minutes.
(b) Step 1 Draw a figure and represent the area as shown in the figure on the next slide.
Remember that to find the percentile, we are interested in the percentage of times that are less
than 30 minutes.
Step 2 Find the z score for a response time of 30 minutes.
value - mean 30 - 25
z = = ≈ 1.11
standard deviation 4.5
Step 3 Find the shaded area. The area to the left of z = 1.11 is the area of the left half (0.5) plus
the area between z = 0 and z = 1.11 (about 0.366 according to Table 11-3).
Area = 0.5 + 0.366 = 0.866
A random sample of 25 entrées served by a college cafeteria is tested to find the number of
calories per serving, with the data in the list below.
845 460 620 752 683
1,088 785 575 580 755
720 650 512 945 672
526 1,050 725 822 740
773 812 880 911 910
(a) Draw a histogram for the data using seven classes. Do the data appear to be normally
distributed?
(b) Find the mean and standard deviation for the data.
(c) What percentage of randomly selected entrées from this cafeteria has more than 700 calories?
SOLUTION
(a) The lowest number of calories is 460 and the highest is 1,088, a range of about 700
calories. A good choice of classes here is simply using the number of hundreds: 400–499,
500–599, and so on.
The histogram doesn’t look exactly like a normal curve, but it’s close enough that it would be
reasonable to guess that the data are approximately normally distributed.
(b) You can find the mean and standard deviation by hand, but with 25 data values, using
technology seems like a good idea. We used the AVERAGE and STDEV commands in Excel,
but you could also use a graphing calculator. The results:
X́ = 751.6
s = 160.7
(c) Since we’re already using technology, let’s also use it to find an area under a normal curve
with mean 751.6 and standard deviation 160.7. We want the area to the right of 700. For the
calculator, we press 2nd VARS 2 to get the normalcdf command, then key in
700,1000000,751.6,160.7). For a spreadsheet, we use the command =1−NORM.DIST
(700,751.6,160.7).
In each case, we see that the area is about 0.626. That tells us that there’s a 62.6% chance
that an entrée chosen at random has more than 700 calories.
We can then plot a graph designating one set of data as the x variable or independent
variable and the other as the y variable or dependent variable.
Scatter Plot
A scatter plot is a graph of the ordered pairs (x, y) consisting of data from two data sets.
After the scatter plot is drawn, we can analyze the graph to see if there is a pattern. If there is a
noticeable pattern, such as the points falling in an approximately straight line, then a possible
relationship between the two variables may exist.
A medical researcher selects a sample of small hospitals in his state and hopes to discover if
there is a relationship between the number of beds and the number of personnel employed by the
hospital.
(a) Do you think there should be a relationship between these two data sets? Describe any
relationship you’d expect.
(b) Draw a scatter plot for the data shown. Does it look like there’s a relationship between the
data sets?
SOLUTION
We can see from the graph that there appears to be a connection. In general, the points progress
from lower left to upper right, which confirms our prediction: lower number of beds tend to go
with lower numbers of personnel, and higher numbers of beds tend to go with higher numbers of
personnel.
Analyzing the Scatter Plot
The types of patterns and corresponding relationships are:
1. A positive linear relationship exists when the points fall approximately in an ascending
straight line from left to right, and both the x and y values increase at the same time.
3. A nonlinear relationship exists when the points fall in a curved line. The relationship is
described by the nature of the curve.
Based on our description of the correlation coefficient for two data sets, make an estimate of the
correlation coefficient for each scatter plot below.
n ( xy ) - ( x )( y )
r =
√ [ n ( x 2 ) - ( x ) 2 ] [ n ( y 2) - ( y ) 2 ]
where
n = the number of data pairs
x = the sum of the x values
y = the sum of the y values
xy = the sum of the products of the x and y values for each pair
x2 = the sum of the squares of the x values
y2 = the sum of the squares of the y values
Find the correlation coefficient for the data in Example 1, and discuss what you think it indicates.
Personnel (y) 72 195 74 211 145 139 184 131 233 366
SOLUTION
Step 1 Make a table as shown on the next slide with the following headings: x, y, xy, x2, and y2.
Step 2 Find the values for the product xy, the values for x2, the values for y2, and place them in
the columns as shown. Then find the sums of the columns.
x y xy x2 y2
Regression
Once we have concluded that there is a significant relationship between the two variables,
the next step is to find the equation of the regression line through the data points.
We say it is the line that passes through the points in such a way that the overall distance
each point is from the line is at a minimum. The regression line is also called the line of best fit.
Recall from algebra that the equation of a line in the slope-intercept form is
y = mx + b,
y = a + bx,
Formulas for Finding the Values of a and b for the Equation of the Regression Line
Slope:
n ( xy ) - ( x )( y )
b=
n ( x 2 ) - ( x ) 2
y intercept:
y - b ( x )
a=
n
Find the equation of the regression line for the data in the previous Example.
SOLUTION
We already calculated the values needed for each formula when we found the correlation
coefficient in Example 3.
Substitute into the first formula to find the value for the slope, b.
n ( xy ) - ( x )( y )
b=
n ( x 2 ) - ( x ) 2
10(113,865) - (572)(1,750) 137,650
= = ≈ 2.708
10(37,802) - (572) 2 50,836
We already calculated the values needed for each formula when we found the correlation
coefficient in Example 3.
Substitute into the first formula to find the value for the slope, b.
n ( xy ) - ( x )( y )
b=
n ( x 2 ) - ( x ) 2
10(113,865) - (572)(1,750) 137,650
= = ≈ 2.708
10(37,802) - (572) 2 50,836
Substitute into the second formula to find the value for the y intercept (a) when b = 2.708.
y - b ( x )
a=
n
(1,750) - 2.708(572) 201.02
= = = 20.102
10 10
The equation of the regression line is y = 20.102 + 2.708x.
This image shows the regression line y = 20.102 + 2.708x in comparison to the original scatter
plot.
First, the value of r and the value of the slope, b, always have the same sign. Second, the
closer the value of r is to +1 or -1, the better the points will fit the line. In other words, the
stronger the relationship, the better the fit.
regression line found in Example 5 to predict the approximate number of personnel for a hospital
with 65 beds.
SOLUTION
Substitute 65 for x into the equation y = 20.120 + 2.7077x and find the value for y.
y = 20.120 + 2.7077(65) = 196.12 (which can be rounded to 196)
We predict that a hospital with 65 beds will have about 196 personnel.
Exercises:
GATHERING
1. Find AND ORGANIZING
the area under DATAthat
the normal curve – Module 5
lies between the given values of Z.
a. z = 0 and z = 2.37
b. z = 0 and z = -1.94
ASSIGNMENT
Answer numbers 1 - 5 on page 651 (Sobecki, D. (2019). Math in Our World. New York. NY:
McGraw-Hill Education.)
References
Nocon, R. (2018). Essential Mathematics for the Modern World. Quezon City: C & E
Publishing, Inc.
Sobecki, D. (2019). Math in Our World. New York. NY: McGraw-Hill Education.