You are on page 1of 67

UNIT IV : STATISTICS

Introduction

Statisticians collect numerical data from subgroups of populations to find out


everything imaginable about the population as a whole, including whom they favor in an
election , what they watch on TV , how much money they make , what worries them , and
even how being attractive pays off . Comedians joke that 62.36% of all statistics are made
up on the spot . Because statisticians both record and influence our behavior , it is
important to distinguish between good and bad methods for collecting , presenting , and
interpreting data. In this unit , you will gain an understanding of where data come from and
how these numbers are used to make decisions.

Objectives:

At the end of this unit , the students should be able to :

 Use variety of statistical tools to process and manage numerical data.


 Use methods of linear regression and correlations to predict the value of a variable
given certain conditions.
 Advocate the use of statistical data in making important decisions.
 Recognize the different types and classification of data.
 Classify and characterize data.

51
Lesson I. Classification and Organization of Data

At the end of the twentieth century, there were 94 million households in the Philippines
with television sets . The television program viewed by the greatest percentage of such
households in that century was the final episode of Probinsiyano .Over 50 million Filipinos
watched this program .

Numerical information , such as the information about the top three TV shows of the
twentieth century is called data. The word statistics is often used when referring to data.
However , statistics has a second meaning:

Statistics is also a method for collecting , organizing , analyzing , and interpreting data, as
well as drawing conclusions based on the data. This methodology divide statistics into two
main areas.

Descriptive Statistics is concerned with collecting ,organizing , summarizing , and


presenting data.

Inferential Statistics has to do with making generalization about and drawing conclusions
from the data.

There are many classifications of data. Different kinds of data are collected ,
analyzed , and interpret. Being able to differentiate them is the first thing that must be
considered when organizing data.

Qualitative and Quantitative data are the two types of data.

Qualitative data deals with categories or attributes. Examples are colored eyes , ethnicity ,
and brand of ice cream.

Quantitative data are numerical data. Quantitative data can be discrete or continuous.

Discrete data is obtained through counting .Number of countries in Southeast Asia and
number of courses in a school term are examples of discrete data.

Continuous data is obtained by measuring . Weight and age are some examples of
continuous data.

Classification of data includes levels of measurement of data. The levels of measurement of


data are nominal , ordinal , interval and ratio.

Nominal level of measurement classifies qualitative data into two or more categories. It is
the lowest level of measurement.

52
Examples of nominal are the books in the library and courses in college.

Ordinal level of measurement ranks qualitative data.

Examples of ordinal are winners in science quiz bee and levels of anxiety.

Interval level of measurement involves quantitative data that are ranked and makes sense
of differences. There is no starting point for this level of measurement.

An example of interval is the Celsius temperature.

Ratio level of measurement does not only include those characteristics of interval of
measurement but also start at 0 value. It is highest level of measurement.

Examples of ratio are weight, the time it takes to do a math project and the number of
absences of students in a class.

53
Lesson II : Measure of Central Tendency

Population and Sample

Consider the set of all Filipinos TV households. Such a set is called the population. In
general , a population is the set containing all the people or objects whose properties are to
be described and analyzed by the data collector.

The population of Filipino TV households is huge. At the time of the conclusion ,


there were nearly 50million out of 94 million such households who watched the final
episode of the Probinsiyano ( a part only of the population ). A sample , which is a subset
or subgroup of the population , is needed. In this case , it would be appropriate to have a
sample of a few thousand TV households to draw conclusions about the population of all
TV households.

Random Samples :

A random sample is a sample obtained in such a way that every element in the population
has an equal chance of being selected for the sample .

Measure of central tendency

According to researchers , “Robert , “ the average American guy , is 31 years old , 5


feet 10 inches , 172 pounds , works 6.1 hours daily and slleps 7.7 hours. These numbers
represent what is “ average “ or “ typical “ of American men. In statistics , such values are
known as measure of central tendency because they are generally located toward the
center of a distribution.

One of the most basic statistical concepts involves finding measure of central
tendency of a set of numerical data . It is often helpful to find numerical values that locate ,
in some sense , the center of a set of data . Suppose Mikee is a senior at a university . In a
few months he plans to graduate and start a career as a landscape architect . A survey of
five landscape architects from last year’s senior class shows that they received job offers
with the following yearly salaries.

₱43,750 ₱39,500 ₱38,000 ₱41,250 ₱44,000

Before Mikee interviews for the job , he wishes to determine the average of these
five salaries. This average should be a “ central “ number around which the salaries cluster.
We will consider three types of averages , known as the arithmetic mean , the median , and
the mode .

54
Each of these averages is a measure of central tendency for the numerical data . There are
three different measure of central tendency : the mean , the median and the mode. Each
measure of central tendency is calculated in different ways. Thus , it is better to use a
specific term ( mean , median and mode ) than to use the generic description term
“average .”

The arithmetic mean is the most commonly used measure of central tendency . The
arithmetic mean of asset of numbers is often referred to as simply the mean . To find the
mean for a set of data , find the sum of the data values and divide it by the number of data
values.

Ʃx
x=
n

Where : x=mean

Ʃx = sum of all data item

n = represents the number of data items

For instance , to find the mean of the 5 salaries listed above , Mikee would divide the sum of
the salaries by 5 .

Ʃx ₱ 43,750+ ₱ 39,500+ ₱ 38,000+ ₱ 41,250+ ₱ 44,000


Mean = x= =
n 5

₱ 206,500
= = ₱ 41,300
5

The mean suggests that Mikee can reasonably expect a job offer at a salary of about
₱41,300.

In statistics it is often necessary to find the sum of a set of numbers . The traditional symbol
used to indicate is the Greek letter sigma , Σ . Thus the notation Σ x , called summation
notation , denotes the sum of all the numbers in a given set .

Mean

The mean of n numbers is the sum of the numbers divided by n

Σx
Mean = x= 55
n
Statisticians often collect data from small portions of a large group in order to determine
information about the group .In such situations the entire group under consideration is
known as the population , and any subset of the population is called a sample. It is
traditional to denote the mean of a sample by x bar ( x ¿¿ and to denote the mean of the
population by the Greek letter μ ( lowercase mu ).

Example : Find the mean

Nine friends in a mathematics class of 30students received test grades of

95 85 75 70 88 90 93 69 89

Find the mean of these scores.

Solution :

The nine friends are a sample of the population of 30 students .

95+85++75+ 70+ 88+90+ 93+69+89 754


x= = =83.778
9 9

Exercises:

A doctor ordered 6 separate blood tests to measure a patient’s total blood cholesterol
levels . The test results were

246 236 225 224 215 210

Find the mean of the blood cholesterol levels.

The Median

Another type of average is the median . Essentially , the median is the middle number or
the mean of the two middle numbers in a list of numbers that have been arranged in
numerical order from smallest to the largest or largest to smallest. Any list of numbers that
is arranged in numerical order from smallest to largest or largest to smallest is ranked list.
56

Median

The median of a ranked list of n numbers is :

 The middle number if n is odd.


 The mean of the two middle numbers if n is even .

Example :

Find the median of the data in the following lists.

1. 4, 8, 1, 14, 9, 21, 12
2. 46, 23, 92,89, 77, 108

Solution :

1. The list 4, 8, 1, 14, 9, 21, 12 contains 7 numbers . The median of a list with odd
number of entries is found by ranking the numbers and finding the middle number.
Rank the numbers from smallest to largest .
1, 4, 8, 9, 12, 14, 21

The middle number is 9 , therefore 9 is the median.

2. The list 46, 23, 92, 89, 77, 108 , contains 6 numbers . The median of a list of data
with even number of entries is found by ranking the numbers and computing the
mean of the two middle numbers. Rank the numbers from smallest to largest.
23,46, 77, 89, 92, 108 89 is 83 .

The two middle numbers are 77and 89. The mean of 77 and 89 is 83 . Therefore the
median of the data is 83.

Exercises :

Find the median of the data in the following lists.

1. 15, 28, 4, 83, 65, 35, 9, 52, 77, 75

2. 22.5, 38.6, 15.7, 86.8, 19.6, 21.8, 32.5


57

Mode

The third type of average is the mode .

The mode of a list of numbers is the number that occurs most frequently .

Some lists of numbers do not have a mode . For instance , in the list 2 , 7 , 9, 11, 33 , 16 , 50
each number occurs exactly once. Because no number occur more often than the other
numbers , then , there is no mode.

A list of numerical data can have more than one mode. For instance , 4, 2, 6, 2, 7, 9, 2,
4, 9, 8, 9,7 , the number 2 occurs three times and the number 9 also occurs three times. All
other numbers occurs less than three times. Thus 2 and 9 are both modes for the data.

Example :

Find the mode of the data in the following lists.

1. 18, 15, 21, 16, 15, 14, 15, 21


2. 2, 5, 8, 9, 11, 4, 7, 23

Solution :

1. In the list 18, 15, 21, 16, 15, 14, 15, 21 , the number 15 occurs more often than the
other numbers. Therefore 15 is the mode.
2. Each number in the list occurs only once. Because no number occurs more often
than the others , there is no mode.

Exercises :

Find the mode of the data in the following lists.

1. 3, 3, 3, 3, 3, 4 ,4, 5, 5, 5, 8, 8, 8, 8,
2. 12, 34, 12, 71, 48, 93, 71, 93, 12
58

The mean , the median and the mode are all averages; however, they are generally not
equal. The mean of a set of data is the most sensitive of the averages. A change in any of the
numbers changes the mean , and the mean can be changed drastically by changing an
extreme value.

In contrast , the median and the mode of a set of data are usually not changed by
changing the extreme value.

When a data set has one or more extreme values that are very different from the
majority of the data values , the mean will not necessarily be a good indicator of an average
value. In the following example , we compare the mean ,median and mode for the salaries
of 5 employees of a small company.

Salaries : ₱370,000 ₱60,000 ₱36,000 ₱20,000 ₱20,000

The sum of the 5 salaries is ₱506,000.Hence the mean is

₱ 506,000
=¿₱ 101,200
5

The mean is the middle number, ₱36,000.Because the ₱20,000salary occurs the
most , the mode is ₱20,000. The data contain one extreme value that is much larger than
the other values. This extreme value makes the mean considerably larger than the median.
Most of the employees of this company would probably agree that the median of ₱36,000
better represents the average of the salaries than does either the mean or the mode.

The Weighted Mean

A value called the weighted mean is often used when some data values are more important
than others. For instance , many professors determine a student’s course grade from the
student’s tests and the final examination. Consider the situation in which a professor
counts the final examination score as 3 test score and test score as 2. To find the weighted
mean of the student’s scores, the professor first assigns a weight to each score. In case a
professor could assign each test a weight of 2 and the final exam score a weight of 3.A
student with a test scores of 65, 70, and 75 and a final examination score of 90 has a
weighted mean of
( 65 x 2 ) + ( 70 x 2 ) + ( 75 x 2 ) +(90 x 3) 690
= =76.667
9 9

59

Note that the numerator of the weighted mean above is the sum of the products of each test
score and its corresponding weight. The number 9 in the denominator is the sum of all the
weights ( 2+2+2+3 ). The procedure for finding the weighted mean can be generalized as
follows.

The Weighted Mean

The weighted mean of the n number x1 , x2, x3 , . . . xn with the respective assigned weights
w1, w2, w3, . . . wn is

Σ( x ⋅w)
Weighted mean = = x
Σw

where Σ ( x ⋅ w ¿ is the sum of the products formed by multiplying each numberby its
assigned weight , the Σ w is the sum of all the weights.

Many colleges use the 4- point grading system:

A = 4 , B = 3 , C = 2, D = 1 , E = 0

A student’s grade point average ( GPA ) is calculated as a weighted mean , where the
student’s grade in each course is given a weight equal to the number of units ( or
credits )that course is worth . Use this 4 – point grading system for the given example.

The table 4.1 : Shows Peter’s first semester course grades. Use the weighted mean formula
to find Peter’s GPA for the first semester.

Course Course Grade Course Unit

Math B 4

History A 3

Chemistry D 3

Biology C 4
Solution:

The B is worth 3 points , with a weight of 4 ; the A is worth 4 points with a point of 3; the D
is worth 1 point with a weight of 3 ; and the C is worth 2 points , with a weight of 4. The
sum of all the weights is 4+ 3 + 3 + 4 , or 14.

( 3 x 4 ) + ( 4 x 3 ) + ( 1 x 3 ) +(2 x 4) 35
Weighted mean = x = = =2.5
14 14

60

Peter’s GPA for the first semester is2.5

EXERCISE SET 13 :

The table 4.2 , shows Lourd’s second semester course grades. Use the weighted mean
formula to find lourd’s GPA for the second semester.

Course Course grade Course units

Biology A 4

Statistic B 3

Business C 3

Psychology F 2

CAD B 2

Frequency Distribution

After the data have been collected from the sample of the population , the next task
facing the statistician is to present the data into a condensed and manageable form . In this
way , the data can be more easily to interpret.

Data that have not been organized or manipulated in any manner are called raw
data. A large collection of raw data may not provide much readily observable information .
A frequency distribution which is a table that lists observed events and the frequency of
occurrence of each observed event, is often used to organize raw data. For instance ,
consider the following table , which lists the number of laptop computers owned by
families in each of 40 homes in a subdivision.

A piece of data is called data item . This list of data has 40 data items . Some of the
data items are identical . Two of the data items are 5 and 5 . Thus , we say that the data
value 5 occurs twice. Similarly , because 14 of the data item are2 , the data value 2 occurs
14 times.

Collected data can be represented using frequency distribution . Such a distribution


consists of two columns , The data values are listed in one column . Numerical data are
generally listed from smallest to largest. The adjacent column is labeled frequency and
indicates the number of times each value occurs.

61

Table 4.3: Number of Laptop Computers per Household

2 0 3 1 2 1 0 4

2 1 1 7 2 0 1 1

0 2 2 1 3 2 2 1

1 4 2 5 2 3 1 2

2 1 2 1 5 0 2 5

The next table (Table 2 )is a frequency distribution table which was constructed using the
data from the above table. The first column of the frequency distribution consists of the
numbers , 0, 1, 2, 3, 4, 5, 6, and 7. The corresponding frequency of occurrence , f , of each of
the numbers in the first column is listed in the second column.

Table 4.4 : A Frequency Distribution for Table 1;

Observed event Frequency

Number of laptop computers , x Number of households , f ,


with x laptop computers

0 llll - ------------------------------- 5

1 llll - llll - ll -------------------- 12

2 --- this row indicates that llll - llll – llll ------------------- 14

3 there are 14 households lll - ------------------------------ 3

4 with 2 laptop computers ll - ------------ ------------------- 2

5 lll - ------------------------------- 3

6 - ---------------------- 0
l - -------------------------------- 1
7
____

40
total

The formula for a weighted mean can be used to find the mean of the data in a
frequency distribution. The only change is that the weights w1 , w2, w3, . . . wn are replaced
with the frequencies f1, f2, f3, . . . fn. This procedure is illustrated in the next example.

62

Example 1:

Find the mean of data displayed in a frequency distribution in Table 4.4.

Solution :

The number in the right-hand column of Table 2 are the frequencies f for numbers in the
first column . The sum of all the frequencies is 40.

Σ( x ⋅ f )
Mean = x =
n
( 0 ⋅ 5 ) + ( 1⋅12 ) + ( 2 ⋅14 ) + ( 3 ⋅3 ) + ( 4 ⋅ 2 ) + ( 5 ⋅3 ) + ( 6 ⋅ 0 ) +(7 ⋅1) 79
= = =1.975
40 40

The mean number of laptop computers per house hold for the homes in the subdivision is
1.975.7

Example 2 :

Students Stress – Level Ratings ( using the formula )

Table 4.5 :

Stress Rating ( x ) Frequency ( f ) ( xf )

0 2 0 • 2 = 0

1 1 1 • 1 = 1

2 3 2 • 3 = 6

3 12 3 • 12 = 36

4 16 4 • 16 = 64

5 18 5 • 18 = 90

6 13 6 • 13 = 78

7 31 7 • 31 = 217

8 26 8 • 26 = 208

9 15 9 • 15 = 135

10 14 10 • 14 = 140

Totals : n = 151 Ʃ xf =975

Ʃxf 975
Mean = x= = ≈6.46
n 151

The mean of 0 to 10 stress – level ratings is approximate 6.46 . Notice that the mean is
greater than 5 , the middle of the 0 to 10 scale.
63

A frequency distribution that lists all possible data items can be quite cumbersome
when there are many such items . For example , consider the following data items . These
are statistics test scores for a class of 40 students.

82 47 75 64 57 82 63 93

76 68 84 54 88 77 79 80

94 92 94 80 94 66 81 67

75 73 66 87 76 45 43 56

57 74 50 78 71 84 59 76

It is difficult to determine how well the group did when the grades are displayed like
this . Because there so many data items , one way to organize these data so that the results
are more meaningful is to arrange the grades into groups , or classes , based on something
that interest us. Many grading systems assign an A to grades in the 90 – 100 class , B to
grades in the 80 – 89 class , C to grades in the 70 – 79 class , and so on . These classes
provide one way to organize the data.

Looking at the 40 statistics test score , we see that they range from a low of 43 to a
high of 94. We can use classes that run from 40 through 49 , 50 through 59 , 60 through 69 ,
and so on up to 90 through 99 , to organize the scores. In the example , we go through the
data tally each item into appropriate class. This method for organizing data is called a
grouped frequency distribution .

Example : Construct a Grouped Frequency Distribution

Use the classes 40 -49 , 50 – 59, 60 – 69, 70 – 79 , 80 -89, and 90 – 99 to construct a


grouped frequency distribution for the 40 test scores on the previous page.
64

TABLE 4.6 :

Class Tally Number of students

( frequency )

40 - 49 Lll 3

50 - 59 llll - l 6

60 - 69 llll - l 6

70 - 79 llll - llll - l 11

80 - 89 llll - llll 9

90 - 99 llll 5

Omitting the tally column results in the grouped frequency distribution in table 2 . The
distribution shows that the greatest frequency of students scored in the 70 – 79 class. The
number of students decreases in classes that contain successively lower and higher scores .
The sum of frequencies , 40 , is equal to the original number of data items.

Table 4.7.

Class Frequency

40 - 49 3

50 - 59 6

60 - 69 6

70 - 79 11

80 - 89 9

90 - 99 5
Total: n = 40

The leftmost number in each class of a grouped frequency distribution is called the
lower class limit . For example , in table 2 , the lower limit of the first class is 40 and the
lower limit of the third class is 60. The rightmost number in each class is called the upper
class limit . In table 2 , 49 and 69 are the upper class limit of the first and third class,
respectively . Notice that if we take the difference between two consecutive lower class
limits we get the same number.

50 - 40 = 10 , 60 - 50 = 10 , 70 - 60 = 10, 80 - 70 = 10 , 90 - 80 = 10

The number 10 is called the class width .

65

When setting up class limits , each class , with the possible exception for the first or
last , should have the same width. Because each data item must fall into exactly one class , it
is sometimes helpful to vary the width of the first or last to allow for items that fall far
above or below most of the data.

Exercise :

A housing division consists of 45 homes. The following frequency distribution shows the
number of homes in the subdivision that are two – bedroom homes , the number that are
three bedroom homes , the number that are four-bedroom homes, and the number that are
five- bedroom homes , Find the mean number of bedrooms for the 45 homes.

Observed event Frequency

Number of bedroom , x Number of homes

with x bedrooms

2 5

3 25

4 10

5 5
______

Total 45

EXERCISE SET 14 :

1. The following table displays the ages of actors when they starred in their Oscar –
winning Best Actor performances in 1980 – 2015 Academy Awards.

Table of performances in 1980 – 2015 Academy Awards

41 33 31 74 33 49 38 61 21 41 26 80
42 29 33 36 45 49 39 34 26 25 33 35
35 28 30 29 61 32 33 45 66 25 46 55

Find the mean , median and mode for the data in the table.

66

2 .In some 4.0 grading systems , a student’s grade point average ( GPA ) is calculated by
assigning letter grades the following numerical values.

A = 4.00 B - = 2.67 D+ = 1.33


A - = 3.67 C+ = 2.33 D = 1.00
B+ = 3.33 C = 2.00 D - = 0.67
B = 3.00 C - = 1.67 F = 0.00

Use the above grading system to find the student’s GPA .

Aeron’s First Semester Grades.


Course Course grade Course units

English A 3
Anthropology A 3
Chemistry B 4
French C+ 3
Theatre B– 2
History D+ 3
Computer Science B+ 2
Math A- 3

3.Find the mean for the data in the given frequency distribution.

Points Scored by Lebron Harden.

Points scored in a Frequency

basketball game

2 6

4 5

5 6

9 3

10 1

14 2

19 1

67

LESSON III. MEASURE OF RELATIVE DISPERSION

In the preceding units we introduced three types of averages for a data set - the mean ,
the median and the mode. Some characteristics of a set of data may not be evident from the
examination of averages.
Example 1:

For instance , consider a soft-drink dispensing machine that should dispense 8 oz of your
selection into a cup. In the following table 4.8 , shows data for two of these machines.

Table 4.8 Soda Dispensed

Machine 1 Machine 2

9.52 8.01

6.41 7.99

10. 07 7.95

5. 85 8.03

8.15 8.03

x = 8.0 x = 8.0

The mean data value for each machine is 8 oz . However , look at the variation in
data values for machine 1 . The quantity of soda dispensed is very inconsistent --- in some
cases the soda overflows the cup , and in other cases too little soda is dispensed. The
machine obviously needs adjustments. Machine 2 , on the other hand , is working just fine.
The quantity dispensed is very consistent , with little variation.

This example shows that average values do not reflect the spread or dispersion
data..

Example 2.

When you think of Houston , Texas and Honolulu , Hawaii , The same temperature comes to
mind ? Both cities have a mean temperature of 75o. However , the mean temperature does
not tell the whole story . The temperature in Houston differs seasonally from a low of about
40o in January to a high of close to 100o in July and August. By contrast , Honolulu’s
temperature varies less throughout the year usually ranging between 60 o and 90o .

68
Measures of dispersion are used to describe the spread of data items in a data set .
To measure the spread or dispersion of data , we must introduce the two of the most
common statistical values known as , the range and the standard deviation .

The Range

A quick but rough measure of dispersion is the range , the difference between the highest
( greatest ) data values and the lowest ( least ) data values in a data set.

1. For example , if Houston’s hottest annual temperature is 103 o and its coldest is 33o , the
range in temperature is

103o - 33o = 70o

If Honolulu’s hottest day is 89o and its coldest day 61o , the range in temperature is

89o - 61o = 28o

2. Find the range of the numbers of ounces dispensed by machine 1 in the given table.

Solution :

The greatest number of ounces dispensed is 10.07 and the least is 5.85 . The range of the
numbers of ounces dispensed is 10.07 - 5.85 = 4.22 oz.

The Range

The range , the difference between the highest and the lowest data values in a data set ,
indicates the total spread of the data.

Range = highest data value - lowest data value

69
Exercises:

1. Find the range of the numbers of ounces dispensed by machine 2.

2. Find the range for each group of data items.

a. 16, 17 , 18 , 19, 20

b. 11, 13 , 14 , 15 , 16 , 17

c. 3, 3, 4, 4, 5, ,5

A second measure of dispersion , and one that is dependent on all of the data items , is
called the standard deviation . The standard deviation is found by determining how much
each data item differ from the mean.

In order to compute the standard deviation , it is necessary to find by how much


data item deviates from the mean . First compute the mean. Then subtract the mean from
each data item . the example shows how it is done.

Example , preparing to find the standard deviation ; Finding deviations from the mean.

Find the deviations of countries with the most workers from the mean for the five data
items 778 , 472 , 147 , 106 , and 82 ( in millions ) .

Solution ;

First , calculate the mean.

Ʃ x 778+472+147+106+ 82 1585
Mean = x= = = =317 millions
n 5 5

Deviation = data item minus mean ; Deviation = data item - mean

The mean for the countries with the largest labor forces is 317 million workers. Now , lets
find by how much each of the five data item differs from 317 , the mean.
70

Table 4.9 Deviations from the mean

Data item Data item – mean

( x ) ( x−x ¿

778 labor force of China 778 - 317 = 461

472 labor force of India 472 - 317 = 155

147 labor force of USA 147 - 317 = - 170

106 labor force of 106 - 317 = -211


Indonesia

82 labor force of Brazil 82 - 317 = -235

Ʃx = 1585

Ʃx 1585
Mean = x = = =317
n 5

For China , with 778 million workers , the computation is shown as follows:

Deviation from mean = data item - mean

= 778 - 317 = 461

This indicates that the labor force in China exceeds the mean by 461 million workers.

The computation for United states , with 147 million workers , is given by

Deviation from the mean = 147 - 317 = - 170

This indicates that the labor force in United States is 170 million workers below the mean.

The sum of deviations for a set of data is always zero. For the deviations in the table above.

461 + 155 + ( - 170) + ( -211 ) + ( -235 ) = 616 + ( -616 ) = 0


This shows that we cannot find a measure of dispersion by finding the mean of the
deviations , because this value is always zero. However , a kind of average of the deviations
from the mean , called the standard deviation , can be computed . We do so by squaring
each deviation and later introducing a square root in the computation. Here are the details
on how to find the standard deviation for set of data.

71

COMPUTING THE STANDARD DEVIATION FOR A DATA SET

Ʃx
1. Find the mean of the data item. x=
n

2. Find the deviation of each data item from the mean.

Data item - mean = x−x

3. Square each deviation :

( data item - mean )2 = (x−x )2

4. Sum the squared deviations:

Ʃ ( data item - mean )2 = Ʃ(x−x )2

5. Divide the sum in step4 by n - 1 , where n represents the number of data items :

2 2
Ʃ(data item−mean) Ʃ( x−x)
=
n−1 n−1

6. Take the square root of the quotient in step 5 . This value is the standard deviation for
the data set.


2
Ʃ(data item−mean)
Standard deviation= = √ Ʃ¿ ¿ ¿
n−1
The computation of the standard deviation can be organized using a table with three
columns.

Data item Deviation : ( Deviation )2 :

(Data item - mean ) ( Data item - mean )2

( x−x ) (x−x )
2

72

Example : Table 4.10 Showing the number of workers , in millions , for the five
countries with the largest labor forces . Find the standard deviation , in millions , for these
five countries.

Data item Deviation : ( Deviation )2 :

data item - mean ( data item - mean )2

( x−x ¿ (x−x )2

778 778 - 317 = 461 (461)2 = (461)(461) = 212,521

472 472 - 317 = 155 ( 155 )2 = (155)(155) = 24 ,025

147 147 - 317 = - 170 ( - 170 )2 = (-170)(-170) = 28,900

106 106 - 317 = - 211 ( -211 )2 = ( -211)(-211) = 44,521

82 82 - 317 = - 235 ( - 235 )2 = (-235)(235) = 55,225

Totals : 0 Ʃ(x−x )2 = 365,192

√ √
2
Ʃ(data item−mean) 365,192
Standard deviation = = = √91,298
n−1 4

Standard deviation ≈ 302.16


The standard deviation for the five countries with the largest labor force is approximately
302.16 million workers.

Exercises:

A consumer group has tested a sample of 8 size – D batteries from each 3 companies. The
results of the tests are shown in the following table . According to these tests , which
company produces batteries for which the values representing hours of constant use have
the smallest standard deviation.

Company Hours of constant use per battery

Ever ready 6.2 , 6.4 , 7.1 , 5.9 , 8.3 , 5.3 , 7.5 , 9.3

Energizer 6.8 , 6.2 , 7.2 , 5.9 , 7.0 , 7.4 , 7.3 , 8.2

Dependable 6.1 , 6.6 , 7.3 , 5.7 , 7.1 , 7.6 , 7.1 , 8.5

73

The Variance

A statistic known as the variance is also used as a measure of dispersion . The variance for
a given set of data is the square of the standard deviation of the data. The following chart
shows the mathematical notations that are used to denote standard deviations and
variance.

Notations for Standard Deviation and Variance

σ --- is the standard deviation of a population .

(σ )2 – is the variance of the population.

S --- is the standard deviation of the sample.

S2 --- is the variance of the sample.


Example : The following numbers were obtained by sampling a population. Find the
standard deviation and the variance.

2 , 4 , 7 , 12 , 15

Solution :

2+ 4+ 7+12+15 40
Mean = x = = =8
5 5

Table 4.11
Data item Deviation: ( Deviation )2 :

( data item – mean ) ( data item - mean )2

2 2 – 8 = -6 ( - 6 )2 = 36

4 4 - 8 = -4 ( - 4 )2 = 16

7 7 - 8 = -1 (-1)2 = 1

12 12 - 8 = 4 ( 4 ) 2 = 16

15 15 - 8 = 7 ( 7 ) 2 = 49

118 -- sum of the


squared deviation

Standard deviation = s=
√ √ Ʃ (data item)2 =
n−1 √ 118
4
=√29.5=5.43

s2 = ( √ 29.5)2=29.5 - ---- Variance

74

EXERCISE SET 15:

Find the Range , the standard deviation , and the variance for the following:

1. 1, 2 , 5 , 7 , 19 , 22

2. 3 , 4 , 7 , 11 , 12 , 12 , 15 , 16

3. 78 , 91, 87 , 93 , 59 , 68 , 92 , 100 , 81
4. 93 , 67 , 49 , 55 , 92 , 87 , 77 , 66 , 73 , 96 , 54

5. 8,6,8,6,8,6,8,6,8,6,8,6,8

75

LESSON IV : MEASURE OF RELATIVE POSITION

Consider the Internet site that offers movie downloads . Based on data kept by the
site , an estimate of the mean time to download a certain movie is 12 min , with a standard
deviation of 4 min. When you download this movie , the download takes 20 min, which you
think is unusually long time for the download. On the other hand , when your friend
downloads the movie , the download takes only 6 min , and your friend is pleasantly
surprised at how quickly she receives the movie. In each case , a data value far from the
mean is unexpected.

The number of standard deviations between a data value and the mean is known as
the data value’s z – score or standard score .

z – Score

The z-score for a given data value x is the number of standard deviations that x is above or
below the mean of the data . The following formulas shows how to calculate the z- score for
data value x in a population and in a sample.

x−μ data item−mean x−x


Population : z x =
σ
: Sample : z x =
s
= s

Where : zx = z-score ( of population or sample )

x = data item

μ = mean of the population

x = mean of the sample

s = standard deviation of the sample

σ = standard deviation of the population

The z-score equation involves four variables. If the values of any three of the four
variables are known , you can solve for the unknown variable.

Example : Aggu Utang has taken two tests in his chemistry class . He scored 72 on the first
test , for which the mean of all scores was 65 and the standard deviation was 8 . He
received a 60 on a second test , for which the mean of all scores was 45 and the standard
deviation was 12. In comparison to the other students , did Aggu Utang do better on the
first test or the second test ?

76

Solution :

Find the z-score for each test.


72−65
z 72= =0.875
8

60−45
z 60= =1.25
12

Aggu Utang scored 0.875 standard deviation above the mean on the first test and 1.25
standard deviations above the mean on the second test . The z-score indicates that , in
comparison to his classmates , Aggu Utang scored better on the second test than he did on
the first test.

Percentiles

Most standardized examinations provide scores in terms of percentiles , which are defined
as follows :

pth Percentile

A value x is called the pth percentile of a data provided p% of the data values are less than
x .

In a recent year , the median annual salary of a Medical Technologist was ₱185, 698.00. If
the 90th percentile for a salary of a Medical Technologists was ₱205, 500.00 , find the
percent of the Medical Technologists whose annual was

a. ₱185, 698.00

b. ₱ 205, 500.00

c. between ₱185,698.00 and ₱ 205,500.00

77

Solution :

a. By definition , the median is the 50th percentile. Therefore , 50% of the Medical
Technologists earned more than ₱185,698.00.
b. Because ₱205,500.00 is the 90th percentile, 90% of all Medical Technologist made less
than ₱ 205,500.00.

c. From parts a and b 90% - 50% = 40% of the Medic al Technologist earned between
₱185,698.00 and ₱ 205,500.00.

The following formula can be used to find the percentile that corresponds to a data value
in a set of data.

Percentile for a Given Data Value

Given a set of data and a data value x ,

number of data values less than x


Percentile of score x = • 100
total number of data values

Example :

On a reading examination given to 950 students, Jack Ammu score of 652 was higher than
the scores of 580 of the students who took the examination . What is the percentile for Jack
Ammu’s score ?

number of data values less than652


Percentile = •100
total number of data values

580
= 950 •100=61.0

Jack Ammu’s score of 652 places him at the 61st percentile.

78

Quartiles

The three numbers Q1 , Q2 , and Q3 , that partition a ranked data set into four
( approximately ) equal groups are called the quartiles of the data .

Example ; for the below Q1 = 10 , Q2 = 31 , Q3 = 79 are the quartiles of the data,

3 , 3 , 5 , 7 , 10 , 13 ,15 , 21 , 25 , 31 , 34 , 38 , 41 66 79 , 88 , 96 , 105 , 178, 286

↑ ↑ ↑

Q1 Q2 Q3

The quartile Q1 , is called the first quartile . The quartile Q2 , is called the second quartile. It
is the median of the data. The quartile Q3 , is called the third quartile. The following
method of finding the quartile makes use of the medians.

The Median Procedure for Finding Quartiles

1. Rank the data.

2. Find the median of the data. This is second quartile , Q2.

3. The first quartile , Q1 , is the median of data values less than Q2. The third quartile , Q3 , is
the median of the data values greater than Q2.

Example : Use Medians to Find the Quartiles of s data Set

The following table lists the calories per 100 milliliters of 25 popular soft drinks. Find the
quartiles for the data.

Calories per 100 milliliters , of selected soft drinks

43 37 42 40 53 62 36 32 50 49 26 53 73

48 45 39 45 48 40 56 41 36 58 42 39

79

Solution :

Step 1. Ranked the data as shown in the following data.


1) 26 11) 42 21) 53

2) 32 12) 42 22) 56

3) 36 13) 43 23) 58

4) 36 14) 45 24) 62

5) 37 15) 45 25) 73

6) 39 16) 48

7) 39 17) 48

8) 40 18) 49

9) 40 19) 50

10) 41 20) 53

Step 2: The median of these 25 data values has a ranked of 13. Thus the median is 43 ,. The
second quartile Q2 , is the median of the data , so 43 ,

Step 3: There are 12 data values less that the median and 12 data values greater than the
median . The first quartile is the median of the data values less than the median. Thus Q1 , is
the mean of the data values with ranks 6 and 7.

39+39
Q 2= =39
2

The third quartile is the median of the data values greater than the median . Thus , Q3 , is
the mean of the data values with ranks 19 and 20.

50+53
Q 3= =51.5
2

80

EXERCISE SET 16 :

1. A data set has a mean of x = 75 and a standard deviation of 11.5 . Find the z-score for
each of the following:
a. x = 85 b. x = 95

c. x = 50 d. x = 75

2. Which of the following three test score is the highest relative score?

a. A score of 65 on a test with a mean of 72 and a standard deviation of 8.2.

b. A score of 102 on a test with a mean of 130 and a standard deviation of 18.5.

c. A score of 605 on a test with a mean of 720 and a standard deviation of 116.4.

81

LESSON V : THE NORMAL DISTRIBUTION

Frequency Distributions and Histogram

Large sets of data are often displayed using a grouped frequency distribution , or a
histogram . For instance , consider the following situation. An Internet Service Provider
(ISP ) has installed new computers. To estimate the new download times its subscribers
will experience , the ISP surveyed 1000 of its subscribers to determine the time required
for each subscriber to download a particular file from an Internet site. The result of that
survey are summarized in the Table.

A grouped Frequency Distribution with 12 Classes

Table 4.12 :

Download time Number of

( in seconds ) subscribers

0 - 5 6

5 - 10 17

10 - 15 43

15 - 20 92

20 - 25 151

25 - 30 192

30 - 35 190

35 - 40 149

40 - 45 90

45 - 50 45

50 - 55 15

55 - 60 10
82

200

number

of 150

subscriber

100

50

0 10 20 30 40 50 60

Download time ( in seconds )

Figure 4. 1 : Histogram for the frequency distribution

Table 4.12 , is called a grouped frequency distribution . It shows how often


(frequency ) certain events occurred. Each interval , 0 - 5, 5 - 10, and so on is called a
class . This distribution has 12 classes. For the 10 - 15 class , 10 is the lower class
boundary and 15 is the upper class boundary. Any data value that lies on a common
boundary is assigned to the higher class. The graph of a frequency distribution is called a
histogram. A histogram provides a pictorial view of how the data are distributed. In figure
4.1 , the height of each bar of the histogram indicates how many subscribers experienced
the download times shown by the class on the base of the bar.

The type of frequency distribution that lists the percent of data in each class is
called a relative frequency distribution. The relative frequency histogram was drawn by
using the data in the relative frequency distribution. It shows the percent of subscribers
along its vertical axis.

83

One advantage of using a relative frequency distribution instead of a grouped


frequency distribution is that there is a direct correspondence between the percent values
of the relative frequency distribution and the probabilities.

Example : Use a Relative Frequency Distribution

Use the relative frequency distribution in Table 4.14 to determine

a. the percent of subscribers who required at least 25 seconds to download the file.

b. probability that a subscriber chosen at random will require at least 5 seconds but less
than 20 seconds to download the file.

Solution :

a. The percent of data in all the classes with a lower boundary of 25 seconds or more is the
sum of the percents printed in blue in the table below. Thus the percent of the subscribers
who required at least 25 seconds to download the file is 69.1%
Download time Percent of Table 4.14
( in seconds ) subscribers

0 - 5 0.6 Sum is
5 - 10 1.7 15.1%
10 - 15 4.3

15 - 20 9.2

20 - 25 15.1 Sum is
25 - 30 19.2 69.1 %
30 - 35 19.0
b. The percent of the data in all the
35 - 40 14.9
classes with lower boundary of 5
40 - 45 9.0 seconds and the upper boundary of 20
seconds is the sum of the percents
45 - 50 4.5

50 - 55 1.5

55 - 60 1.0
printed in blue in table 4.14 above. Thus , the percent of subscribers who required at least
5 seconds but less than 20 seconds to download the file is 15.1%. The probability that a
subscriber chosen at random will require at least 5 seconds but less than 20 seconds to
download the file is ) 0.152 .

84

A normal distribution is a continuous probability distribution. This means that it


generally uses either interval or a ratio data. The histogram is a great approximation of a
normal distribution. Drawing a bell-shaped curved on the histogram determines if the data
follows a normal distribution. A bell-shaped curve symbolizes that there is one central
peak. The rest of the data are either side of the center tapering off on the extremes.

Figure4.2 Results to the Preliminary Examination

0-9 10-14 15-19 20-24 25-30 31-34 35-39 40-44 45-49

Figure 4.3 : results of the Midterm Examination

5
4

19-23 24-28 29-33 34-38 39-43 44-48 49-53 85

Figure 4.4

frequency

9-18 19-28 29-38 39-48

Figures 4.2 and 4.3 show non-normal distributions . Figure 4.2 has two peaks. There is also
a gap in the data. The peak of figure 4.3 is not centered which violates the concept of a bell.
Figure 4.4 shows a normal distribution.

A normal distribution has the following properties :

1. It is bell-shaped curve.

2. The total area under the normal curve is 1.

3. The tails of the normal curve are asymptotic to the horizontal axis.

4. The curve is symmetrical to the mean.


5. It is determined by the population mean μ and the population standard deviation . The
mean controls the center and the standard deviation controls the spread of the
distribution.

6. The mean , median , and the mode have the same value.

86

The standard normal has the same properties as that of the normal distribution
except that the mean is zero and the standard deviation is 1.
87

It was stated that the normal distribution is symmetric about the mean. This signifies that
the areas of a z-value are the same , whether it is positive or negative. Hence , area of – z is
equal to the area od +z.

The concept of probability is used for normal distribution. Probabilities are from 0
to 1. This means that the values of areas cannot be negative. Moreover ,they also cannot
have values greater than 1.

The notation P ( a < z < b ) , P ( z < a ) and P ( z > a ) will be used and their
meanings are as follows :

 P ( a < z < b ) is read as “ the probability or area of z between and b.”


 P(z < a ) is read as “ the probability or area of z less than a or to the left of a.”
 P ( z > a ) is read as “ the probability or area of z greater than a or to the right of z.”
Note that the symbols ≤∧≥ have the same meanings as < and > . To find the areas , the
Tables of Areas under the Normal Curve will be used. The table is also known as the z –
table.

88
89
Using the z-table , the area of z = -0.46 is 0.1772 and the area of z = 0.52 is 0.1985.

For z = - 0.46 , look for 0.4 under z column , and column of 0.06 , what ever is the
intersection along the row of 0.4 and the column of 0.06 is the area which is 0.1772. The
same through with z = 0.52 with an area of 0.1985. Look for 0.5 along column z and 0.02.
The intersection of row of 0.5 and the column of 0.02 is the area which is 0.1985.

To find the areas under the normal curve , three things mustbe done :

1. Draw the normal curve.

2. Shade the appropriate region .

3. Calculate the area by using the Table of Areas under the Normal Curve.

Example :

1. P ( -0.72 < z < 0 ) , therefore the answer is 0.2642.

90
2. P ( -2.58 < z < 2.58 )

Since the mean is included in the must be shaded

region the areas must be added. Therefore ,

0.4951 + 0.4951 = 0.9902. Thus, the area is

0.3389

-2.58 0 2.58

3. P ( z > 1.95 )

Since the shaded area is on the extreme right,

The area of 1.95 must be subtracted from 0.5.

Therefore , the answer is 0.0256

The area of 1.95 is 0.4744 .

Thus , 0.5 – 0.4744 = 0.0256

1.95

91
If the areas are given , what are the values of z ? Here are some examples :

1. Find z0 such that P ( z > z0 ) = 0.0125

Since the area given is less than 0.5 , the shaded area is on the extreme left or extreme right.
However , looking at the direction , it can be seen that the shaded area is at the extreme
right.

Since the shaded area is at the extreme right,

The area is to be subtracted from 0.5 .

Therefore ,

0.5 - 0.0125 = 0.4875

Obtaining the exact or closest value from the

0 z0 z-table , the z-score is 2.24.

2. Find the values of ± z 0 such that the area is 0.8452

Since the area given is more than 0.5 and there are two values of z0 to be obtained ,
0.8452 has to be divided into 2.

Therefore , obtaining the exact value or

0.8452
closest to 0.4226 which is ( ),
2

the z- score is ± 1.42.

-z0 z0
92

There are various applications of the normal distribution to real-life problems. As such ,
these problems are to be transformed to the standard normal distribution which makes use
of the formula:

x−μ
z=
σ

Where z = standard normal score

x = random variable

μ = population mean

σ = population standard deviation

Note that the calculated value of z is to be rounded to the hundredths place.

Examples:

1. Thirteen students who took the final exam last term have a mean grade of 34.08 and the
standard deviation of 7.62.

a. What is the probability that Akiwikiwag will get more than 40 in the final exam ?

40−34.08
z= =0.78
7.62

Therefore , the area of 0.78 is 0.2177. This means that Akiwikiwag has a 21.77% chance of
getting more than 4o in the final exam.

b. What is the probability that Akiwikiwag will get a score of 30 and 40 ?

30−34.08 40−34.08
z 1= =−0.54 z 2= =0.78
7.62 7.62
93

Therefore , the areas of -0.54 and 0.78 are added.

From the z – table, the area of -0.54 = 0.2054

The area of 0.78 = 0.2823

_____________

0.0.4877 ---- This means that Akiwikiwag

has a 48.77% chance of getting

a score between 30 and 40.

2. The average age of Filipino man to undergo sacrament of matrimony is 29 with standard
deviation of 2.5 years. Peter , aged 26 , is contemplating if he should marry already . What is
the probability that he will marry before he reaches 30 ?

26−29 30−29
z 1= =−1.2 z 2= =0.4
2.5 2.5

Therefore , the areas of – 1.2 and 0.4 are added.

area of -1.2 = 0.3849

area of 0.4 = 0.1554

____________

0. 5403 this means that Peter has 54.03% chance of marrying

between 26 and 30 years old.


94

Exercise Set 17 :

1. Find the area of the standard normal distribution between z = -1.44 and z = 0.

2. Find the area of the standard normal distribution between z = - 0.67 and z = 0.

3. A soft drink machine dispenses soft drinks into 12 – ounce cups. Tests show that the
actual amount of soft drinks dispensed is normally distributed , with a mean of 11.5 oz and
a standard deviation of 0.2 oz.

a. What percent of cups will receive less than 11.25 oz of soft drinks ?

b. What percent of cups will receive between 11.2 and 11.5 oz of soft drinks ?

c. If a cup is filled at random , what is the probability that the machine will overflow the
cup ?
95

LESSON VI : LINEAR REGRESSION AND CORRELEATION

Correlation analysis has touched quantitative research in many ways. Relationships


among variables are very important because they can explain certain phenomena that
would eventually contribute to the whole well-being of humanity.

Linear Regression

When performing research studies , scientist often wish to know whether two
variables are related. It the variables are determined to be related , a scientist may then
wish to find an equation that can be used to model the relationship . For instance , a
geologist might want to know whether there is a relationship between the duration of an
eruption of a geyser and the time between eruptions. A first step in this determination is
to collect some data. Data involving two variables are called bivariate data . Table 6.1 gives
bivariate data showing the time between two eruptions and the duration of the second
eruption for 10 eruptions of the geyser Old Faithful.

Table 6.1 :

Time
between
eruptions
( in 272 227 237 238 203 270 218 226 250 245
seconds),
x

Duration
of
eruption
( in 89 79 83 82 81 85 78 81 85 79
seconds),
y

Once the data are collected , a scatter diagram or scatter plot can be drawn , as shown in
Figure 6.1
96

89 ____ •( 272 , 89 )

88 ____

87 _____

86 ____

85 ____ ( 250 , 85 ) • • ( 270 , 85 )

84 ____

length 83 ____

of 84 ____

eruptions83 ____ • ( 237 , 83 )

82 ____ •(238 , 82 )

81 ____ •(203 , 81 ) •( 226 . 81 )

80 ____

79 ____ ( 227 , 79 ) • •( 245 , 79 )

78 ____ •( 218 , 78 )

203 218 226 227 237 238 245 250 270 272

Figure 6.1 : Seconds between eruptions

One way for geologist to create a model of the relationship between the time between two
eruptions and the duration of the second eruption is to find the line that approximates the
data points plotted in the scatter plot ( the dots ). There are many lines that can be drawn
in figure 6.1.

Of all the possible lines that can be drawn , the one that is usually of most interest is
called the line of best fit or the least-squares regression lines . The least-squares line is the
line that fits the data better than any other line that might be drawn. The least-squares
regression line is defined as follows

97

The Least- Squares Regression Line

The least- squares regression line for a set of bivariate is the line that minimizes the sum
of the squares of the vertical deviations from each data point to the line, or simple linear
regression line , seeks to develop an equation that will predict future values of the
dependent variable from the values of the independent variable.

In this definition , the phrase “ minimizes the sum of the squares of the vertical deviations “
it means that of all lines possible , the linear equation that minimizes the sum
2 2 2 2 2 2 2 2 2 2
d 1 +d 2 +d 3 +d 4 + d 5+ d 6+ d 7 +d 8 +d 9 +d 10

Is the equation of the line of best fit. In this expression , each d , represents the distance
from the point n to the line.

89 ____ •( 272 , 89 )

88 ____ d 10

87 _____

86 ____

85 ____ ( 250 , 85 ) • • ( 270 , 85 )

84 ____ d8 d9

length 83 ____

of 84 ____ d6

eruptions83 ____ d5 •( 237 , 83 ) d7

82 ____ •(238 , 82 )

81 ____ •(203 , 81 ) d3 •( 226 . 81 )

80 ____ d1 d4
79 ____ d2 ( 227 , 79 ) • •( 245 , 79 )

78 ____ •( 218 , 78 )

203 218 226 227 237 238 245 250 270 272 98

Figure 6.2 seconds between eruptions

Applying some techniques from calculus , it is possible to find a formula for the least-
squares line.

The Formula for the least-squares Line

The equation of the least-squares line for the n ordered pairs .

( y 1 , y 1) , ( x 2 , y 2 ) , ( x 3 , y 3 ) , .. . ,( x n , y n)
The regression line or the prediction line is drawn on the scatter plot and it is given by;

^y =ax+ b ,

where : ^y = predicted value of the independent variable ,y

a = intercept of the regression

x = value of the independent variable

b = slope of the regression line

nƩxy−( Ʃx ) ( Ʃy)
a= ∧b= y −a x
nƩ x 2−( Ʃx)2

To apply this formula to the data for Old Faithful,

Ʃx = 272 + 227 + 237 + 238 + 203 + 270 + 218 + 226 + 250 + 245 = 2386

Ʃy = 89 + 79 + 83 + 82 + 81 + 85 + 78 + 81 + 85 + 79 = 822

Ʃ x 2=2722 +2272 +2372 +2382 +2032 +2702 +2182 +226 2+250 2+ 2452

Ʃ x 2=573,560

Ʃxy=( 272 x 89 ) + ( 227 x 79 ) + ( 237 x 83 ) + ( 238 x 82 ) + ( 203 x 81 ) + ( 270 x 85 )


+( 218 x 78 ) + ( 226 x 81 ) + ( 250 x 85 ) +( 245 x 79 ) = 196,636

Next , we use these values to find for the value of a,

nƩxy−(Ʃx)(Ʃy)
a=
nƩ x 2−¿ ¿

We then find the values of x∧ y 99

Ʃx 2386 Ʃy 822
x= = =238.6∧ y= = =82.2
n 10 n 10

And use them to find the y-intercept , b

b = y−a x = 82.8 - 0.1189559666( 238.6) = 53.81710

The regression equation is ^y = 0.1189559666x + 53.81710. The graph of the regression


equation and the scatter plot are shown below.

89 ____ •

88 ____

87 _____ y = 0.1189559666x + 53.81710

86 ____

85 ____ • •

84 ____

length 83 ____

of 84 ____

eruptions83 ____ •

82 ____ •

81 ____ • •

80 ____

79 ____ • •
78 ____ •

203 218 226 227 237 238 245 250 270 272

Figure 6.3 seconds between eruptions

100

We can now use the regression equation to estimate the duration of an eruption given the
time between eruptions. For instance , if the time between two eruptions is 250 seconds,
then the estimated duration of the second eruption is

^y =¿0.1189559666 ( 250 ) + 53.81710 = 83.556

^y ≈ 84

The approximate duration of the eruption is 84 seconds.

Correlation analysis is the study of relationship between independent and


dependent variables. It measures the strength and direction of continuous bivariate data.
Examples of bivariate data is time and academic performance , mass and width , etc.

The linear correlation coefficient , r , is used to determine if there is a linear


relationship between two variables. It has a value from -1 to +1 . If the value of r is -1, then
there is a perfect negative linear relationship between the two variables ; if the value of r ,
is +1 , then there is a perfect positive linear relationship between the two variables; and if
the value of r is 0 , then there is no linear relationship between the two variables. The
closer the value of r to either -1 or +1 means that there is either a strong negative or strong
positive linear relationship between two variables.

The scatter plot is a visual representation of the linear relationship between the two
variables. It is a graph involving the x – and y – axes. The following scatter plots show the
difference of linear relationship between two variables.
101

y y

• •

• • •

• • •

• • • •

• • •

• • •

• •

x x

Negative relationship positive relationship

• • •

• • • •

• • • •

• • •

• •
No relationship x

Figure 6.4 : 102

There are many methods to get the value of a correlation coefficient . However , the
Pearson’s moment correlation coefficient ( or simply Pearson correlation coefficient ) will
be used throughout this lesson . The formula for Pearson correlation coefficient is given by

r =¿ ¿

where :

x = independent variable

y = dependent variable

To illustrate , assume that a proprietor of a fabrication shop wants to know if there is


a relationship between the number of hours on the lathe machine and the income ( Php in
hundred thousands ) for each month of a year. The results are as follows:

Table 6.1:

Month Lathe(x) Income(y) Month Lathe(x) Income(y)

( hours) Php ( hours) Php

January 6.0 6.0 July 6.25 8.00

February 4.5 5.50 August 5.50 6.60

March 5.75 4.00 September 5.0 4.95

April 6.25 5.00 October 4.50 3.90


May 4.0 3.74 November 4.50 4.60

June 4.75 4.50 December 5.25 6.00

103

Constructing a scatterplot helps to see if there is a relationship between the two


variables. The scatter plot is drawn below:

income

8 •

5 •

4 •

3 • •

1 • •

0 1 2 3 4• • 5 • 6 7

Figure
6.5: Hours
using the
lathe machine
It can be presumed that there is a positive relationship of hours on the lathe machine

Table 6.2 :

Month X Y XY X2 Y2

January 6.0 6.0 36.00 36.00 36.00

February 4.5 5.50 24.75 20.25 30.25

March 5.75 4.00 23.00 33.0625 16.00

April 6.25 5.00 31.25 39.0625 25.00

May 4.0 3.75 15.00 16.00 14.0625

June 4.75 4.50 21.375 22.5625 20.25

July 6.25 8.00 50.00 39.0625 64.00

August 5.50 6.60 36.30 30.25 43.56

September 5.0 4.95 24.75 25.00 24.5025

October 4.50 3.90 17.55 20.25 15.21

November 4.50 4.60 20.70 20.25 21.16

December 5.25 6.00 31.50 27.5025 36.00

Total 62.25 62.8 332.175 329.3125 345.995

r =¿ ¿
[12 ( 332.175 )−( 62.25 )( 62.8 ) ]
r=
√¿ ¿ ¿

76.8
r= =0.61
√ [ 76.6875 ] [208.1]
as with the scatter plot , the direction of the obtained value is positive. Therefore, there is a
positive relationship between the number of hours on the lathe machine and the income
per month.

Exercise Set 18 :

1. Find the equation of the least-squares line for the ordered pairs in the given table below.

Adults men

Stride 2.5 3.0 3.3 3.5 3.8 4.0 4.2 4.5


length
(m)

Speed 3.4 4.9 5.5 6.6 7.0 7.7 8.3 8.7


(m/s)

2. Use the equation of the least- squares line t from item # 1. To predict the average speed
of an adult man for each of the following stride length. Round your answer to the nearest
tenth of a meter per second.

a. 2.8 m b. 4.8 m

105

UNIT IV : SUMMARY

The following tables summarizes essential concepts in this unit . The references given in
the right-hand column list of examples and exercises that can be used to test your
understanding of a concept.

4.1 Measures of Central Tendency

Mean , Median , and Mode : The mean of n is see examples on page 54 , 57 and 58
the sum of the numbers divided by n . The
median of a ranked list of n numbers is the
middle number if n is odd , or the mean of
the two middle numbers if n is even. The
mode of a list of numbers is the number that
occurs most frequently .

Weighted Mean : The formula for the See example on page 60 and then try
weighted mean of the n numbers exercises on page 61.

x 1, , x 2 , , x 3 , . . ., x n is

weighted mean=
∑ (x • w)
∑w
Where ∑ (x • w) is the sum of the products
formed by multiplying each number by its
assigned weight , and ∑ w is the sum of all
the weights.

4.2 : Measure of Dispersion

Range : The range of a set of data values is See example on page 69 and then try
the difference between the greatest data exercises on page 70.
value and the least data value.

Standard Deviation and Variance: See examples on pages 71 – 74 and then


try exercises on page 75.
If x 1 , x 2 , x 3 ,. . . , x n is a population of n
numbers with mean μ , then the standard
deviation of the population is

𝛔 =
√ ∑ ( x−μ)2
n
, and the
variance is
∑ ( x−μ )2
n

If x 1 , x 2 , x 3 ,. . . , x n is a sample of n numbers
with mean x , then the standard deviation of
the sample is
106

S =
√ ∑ ( x−x )2
n−1
, and the

¿
Variance is = ∑ x −x ¿2 n−1

4.3 : Measure of Relative Position

z-score : The z-score for a given data value x


is the number of standard deviations that x
is above or below the mean.

z – score for a population data value :

x−μ
zx=
σ

z-score for a sample data value :


x−x
zx=
s

Percentiles :A value x is called the pth


percentile of a data set provided p% of the
data values are less than x. Given a set of
data and a data value x ,

Percentile score of x ,

number of data values less than x


x= •100
total number of data values

Quartiles : The quartiles of a data set are


the three numbers Q1 ,Q 2 ,∧Q3 that partition
the ranked data into four (approximately)
equal groups. Q2 is the median of the data ,
Q1 is the median of the data values less than
Q2 , and Q3 is the median of the data values
greater the Q2.

4.4 : Normal Distribution

Frequency Distribution : A frequency Page 82


distribution displays a data set by dividing
the data into intervals , or classes , and
listing the number of data values that fall
into each interval. A relative frequency
distribution lists the percent of data in each
interval.

Normal distribution See examples on pages 84-86 and 90-94


and try exercises on pages 95.

Using the Standard Normal Distribution :


The standard normal distribution is the
normal that has a mean of 0 and a standard
deviation of 1. Any normal distribution can
be converted into the standard normal
distribution by converting data values to
their z-score. Then the percent of the data
values that lie in a given interval can b
found as the area under the standard
normal curve between the z-scores of the
endpoints of the given interval. The Table is
on page 89 gives the areas under the
standard normal curve for z-scores between
0 and 3.33

4.5 : Linear Regression and Correlation

Least – Squares Line : Bivariate data are See examples on page 99.
data given as ordered pairs . The least-
squares regression line , least-square line
or regression line , for a set of bivariate data
is the line that minimizes the sum of the
squares of the vertical deviations from each
data point to the line. The equation of the
least-squares line for the n ordered pairs

( x 1 , y 1 ¿ , ( x 2 , y 2) , ( x 3 , y 3 ) , . . .,(x n , y n) is
^y =ax+ b , where

a=n ∑ xy−¿¿ ¿ and

b= y−a x

The equation of the least-squares line can


be used to predict the value of one variable
when the value of the other is known.

Linear Correlation Coefficient : The linear See examples on pages 103 – 105.
correlation coefficient r measures the
strength of a linear relationship between
two variables. The closer r is to 1 , the
stronger the linear relationship is between
the variables. For n ordered pairs

x 1 , y 1 ¿ , ( x 2 , y 2) , ( x 3 , y 3 ) , . . .,( x n , y n) , the
linear correlation coefficient is

r =¿ ¿ 108

UNIT IV TEST :

1.Solve the standard deviation of the following distribution of scores:

Class interval f

30 - 34 2

35 - 39 3

40 - 44 6

45 - 49 7

50 - 54 8

55 - 59 7

60 - 64 5

65 - 69 4

70 - 75 2

N = 44

2. The mean weight of a newborn infants is 7 pounds and the standard deviation is 0.8
pound. The weights of newborn infants are normally distributed. Find the z-score for a
weight of
a. 9 pounds

b. 7 pounds

c. 6 pounds

109

3. Shown below are the data involving the number of years of school , x , completed by ten
randomly selected people and their scores on the test measuring prejudice , y .The higher
the scores on prejudice ( 1 to 10 ) indicate greater levels of prejudice. Determine the
correlation coefficient between years of education and scores on a prejudice test.

Respondent A B C D E F G H I J

Years of 12 5 14 13 8 10 16 11 12 4
education
(x)

Score on 1 7 2 3 5 4 1 2 3 10
prejudice
(y)

4. Consider the following satisfaction level ratings of 35 people.

9 12 10 8 9 12 12 11 14 12

10 8 10 9 12 8 12 15 9 8

13 10 9 9 11 10 11 10

Determine the mean , median and mode of the data item given.

1. Solve the least-square regression line for the data scores in the table:

Employees X Y

A 2 8

B 8 10

C 4 11
D 11 13

E 5 9

F 13 17

G 4 8

H 15 14

You might also like