Professional Documents
Culture Documents
Mathematics in the
Modern World
Course
Modules
Weeks 7-12
MODULE 4
Data Management: Introduction to Statistics
4.1 Introduction
When we hear the word Statistics, the first thing that comes to mind is set
of numerical figures, such as your monthly allowance, the number of hours you
spend in school, the number of hours you spend on Facebook, your vital
statistics, etc.
However, the study of statistics is not limited to knowing and memorizing
numerical figures. This module will give us a better understanding of what
Statistics is about. Discussion on how some of its processes are done is also
included.
4.2 Learning Outcomes
After finishing this module, you are expected to:
1. discuss the importance of statistics in your field of study;
2. compare and contrast between descriptive statistics and inferential
statistics;
3. define data;
4. identify different types of data as well as their level of measurement;
5. identify appropriate data collection methods based on needed data; and
6. identify appropriate data presentation type for a set of data.
Why are all processes involved in Statistics important? Statistics has the
ability to provide us with tools we need to convert raw data into information that
we can use to make sensible decisions and intelligent choices.
People from various fields of interest need to obtain information to answer
different types of problems. Nowadays, we do this by performing a statistical
Page 1 of 23
inquiry. This will allow us to answer problems with clearer understanding of a
particular collection of information.
Usually, the population of interest may be too large that it becomes too
expensive and time-consuming to collect data from every element of the
population. Thus, we have no other option but to get the data we need from only
a subset of the population. We use the term sample to refer to this subset of the
population.
In any statistical inquiry, we study certain characteristics or attributes of
the elements in the population, which we call variables. Just like in algebra, we
denote variables with letters of the English alphabets. We refer to these
characteristics as variables because their realized values may vary for the
different elements in the sample or population.
Page 2 of 23
Example 2. Below are illustrations of variables together with their possible
values.
Example 4.
A summary measure that we are familiar with is the proportion. The
proportion is the quotient obtained when we divide the magnitude of a part by
the magnitude of the whole. Suppose that among the 35 students, 28 claimed
that they own a cellular phone. We can now compute for the proportion of
students in the population with cellular phones.
𝑛𝑢𝑚𝑏𝑒𝑟 𝑜𝑓 𝑠𝑡𝑢𝑑𝑒𝑛𝑡𝑠 𝑤𝑖𝑡ℎ 𝑐𝑒𝑙𝑙𝑢𝑙𝑎𝑟 𝑝ℎ𝑜𝑛𝑒𝑠 28
𝑃= = = 0.8
𝑛𝑢𝑚𝑏𝑒𝑟 𝑜𝑓 𝑠𝑡𝑢𝑑𝑒𝑛𝑡𝑠 𝑖𝑛 𝑡ℎ𝑒 𝑝𝑜𝑝𝑢𝑙𝑎𝑡𝑖𝑜𝑛 35
Page 3 of 23
The proportion of students in our population with cellular phones is an
example of a parameter because it is a summary measure describing a
characteristic of the population.
Suppose we take a sample of 10 students from this class. Among the 10
students in the sample, 7 own cellular phones. We cannot compute the
proportion 𝑃 of students in the population with cellular phones but we can
compute for 𝑃̂ (read as “𝑃 hat”), where 𝑃̂ is the proportion of students in the
sample with cellular phones, as follows:
Learning Activity 1
Page 4 of 23
problems. On the other hand, mathematical statistics is concerned with the
development of the mathematical foundations of the methods used in applied
statistics.
There are two major areas of interest in applied statistics. These are
descriptive statistics and inferential statistics.
Inferential Statistics includes all the techniques used in analyzing the sample
data that will lead to generalizations about a population from which the sample
came from. It consists of performing hypothesis testing, determining
relationships among variables, and making predictions.
Page 5 of 23
be clear that whatever conclusions we make using inferential statistics is always
subject to some error.
Example 6. Below is an application of inferential statistics.
To determine if reforestation is effective, we can take a representative portion
of denuded forests and use inferential statistics to draw conclusions about the
effect of reforestation in all denuded forests.
Learning Activity 2
Page 6 of 23
2. Quantitative Variables are numerical variables and can be measured.
Learning Activity 3
Page 7 of 23
4.3.2.2 Levels of Measurement
Variables can also be classified according to the level of measurement.
There are four levels of measurement: Nominal, Ordinal, Interval, and Ratio.
1. Nominal Data. In this case, numbers are used to represent an item or
characteristic. Examples include: names, gender, religious affiliation,
civil status, college majors. Note that such data should not be treated as
numerical, since relative size has no meaning.
5 − 𝑂𝑢𝑡𝑠𝑡𝑎𝑛𝑑𝑖𝑛𝑔
4 − 𝑉𝑒𝑟𝑦 𝑆𝑎𝑡𝑖𝑠𝑓𝑎𝑐𝑡𝑜𝑟𝑦
3 − 𝑆𝑎𝑡𝑖𝑠𝑓𝑎𝑐𝑡𝑜𝑟𝑦
2 − 𝑃𝑜𝑜𝑟
3. Interval Data. In this set, numbers can be ordered and has exact
difference between any two units but has no meaningful zero or starting
point. For example, Temperature is an interval data since they can be
ordered, there is an exact difference between two degrees, but the zero
does not mean the starting point since there can be temperatures below
zero.
4. Ratio Data. This set is the highest level of measurement and allows for
all basic arithmetic operations, including division and multiplication.
Data at this level can be ordered, has exact difference between units,
and has a meaningful zero. Things that are counted are usually ratio
level, for example, business data, such as cost, revenue and profit.
Page 8 of 23
Learning Activity 4
In the case where data are not properly gathered, the consequences are as
follows:
Page 9 of 23
One can obtain documented data from previous studies of individuals,
written reports of government and nongovernment agencies, periodicals, and
others.
Example 7.
The Philippine Statistics Authority is a major collector of data for
government needs. It provides the public with basic data on various subject
matters. A few of these are household income and expenditure,
employment, and others.
Primary data are data documented by a primary source. The data collectors
themselves documented this data.
Page 10 of 23
4.3.2.4b Surveys
DEFINITION 5.8. (survey, census, sample survey)
Page 11 of 23
we did not. Both pots have the same soil type. We watered the pots at the same
time using the same amount of water. A few weeks later, we observed the heights
of the mongo plants.
In this experiment, the objective is to determine the effect of sunlight on the
height of a mongo plant. The explanatory variable is the amount of sunlight.
Categories for the explanatory variable are called “treatments” or factor levels.
The response variable is the height of the mongo plant and the extraneous
variables are identified to be the soil type and amount of water.
The extraneous variables are usually controlled making sure that the two
groups will receive the same levels or amounts. The use of randomization
mechanism in assigning the treatments and controlling the identifies extraneous
variables makes the experiment a more effective method of data collection in
establishing cause and effect.
Example 11.
The school administration wishes to determine which of the two methods is
more effective in training new student leaders. They randomly assigned twenty
student leaders to training method 1 and twenty student leaders to training
method 2. After one month of training, they administered a standardized
achievement test to the two groups and compared their scores.
4.3.2.4d Observation
DEFINITION 5.10. (observation method)
Page 12 of 23
The table below shows the comparison of survey, experiment, and
observation methods.
Data Collection Method
Aspect
Survey Experiment Observation
Assessing the reliability of
Generally Sometimes Oftentimes
generalizations about a well-
possible difficult difficult
defined population
Learning Activity 5
Page 13 of 23
4.3.3 Presentation of Data
After data collection, we need to organize and analyze the data. After
organizing and analyses, we present the results in forms that will allow us to
reveal important information we obtained from the data.
There are three ways to present the information from our data. These
include textual, tabular, and graphical presentations.
4.3.3.1 Textual Presentation
Textual presentation of data incorporates important figures in a paragraph
of text. In this type of presentation, we insert important data figures or summary
measures within the paragraph of text to support our conclusions.
Textual presentation allows us to direct reader’s interest to vital information
we want to highlight. Summary measures like minimum, maximum, total, and
percentages are just few information that may be included in a textual
presentation.
It is necessary to select the most important figures we want to focus on.
Whenever we use textual presentation, we must always provide our readers with
additional discussion about the relevance of the figures in our presentation.
Example 12. Here is an illustration of textual presentation.
Excerpts taken from the Isabela Covid-19 Case Updates.
“As of 4PM today, the Department of Health reports a total number of COVID-
19 cases at 290,190, after 3,475 newly-confirmed cases were added to the list of
COVID-19 patients.
DOH likewise announces 400 recoveries. This brings the total number of
recoveries to 230,233.
Twenty-eight duplicates were removed from the total case count. Of these, 19
were recovered cases.
Moreover, 13 cases previously reported as recovered were reclassified as death
(12) and active (1) cases after final validation.”
From the illustration given, the paragraphs showed and highlighted only the
most important figures. Few numbers were included and minute details or a
large quantity of data were not presented. If we want to refer to other details of
the data, then it would be more appropriate to use tabular presentation.
4.3.3.2 Tabular Presentation
Tabular presentation of data arranges figures in a systematic manner in
rows and columns. It is the most common method of data presentation. We can
use it for various purposes such as description, comparison, and in showing
relationships between two or more variables of interest.
Page 14 of 23
In tabular presentation, we arrange the data figures or summary measures
in rows and columns for easy reading. Tables should be simple and easy to
understand. Each row and column must have an appropriate label.
Three types of tabular presentation will be discussed in this module namely,
leader work, text tabulation, and the formal statistical table.
4.3.3.2a Leader Work
Leader work has the simplest layout among all three types of tables. It
contains no table title or column headings and has no table borders. We
incorporate this type within a paragraph presenting one or two columns of
figures as supporting data.
Example 13.
The population in the Philippines for the census years 1975 to 2000 is as
follows:
1975 42,070,660
1980 48,098,460
1990 60,703,206
1995 68,616,536
2000 76,498,735
Page 15 of 23
4.3.3.2c Formal Statistical Table
The formal statistical table is the most complex type of table since it has all
the different parts like the table number, table title, head note, box head, stub
head, column headings, and so on. It is a stand-alone table and can be easily
understood even without a description.
The following presents the different parts of a formal statistical table:
Page 16 of 23
Example 14.
Below is an example of a formal statistical table.
Page 17 of 23
4.3.3.3 Graphical Presentation
Graphical presentation of data portrays numerical figures or relationships
among variables in pictorial form. Some statistical charts used in this type of
presentation is given in the following table:
Type of
Description Example
Chart
Line Chart Useful for presenting historical
data
Effective in showing movement
of a series over time
Appropriate when comparing
two or more time series data
and trends over time
Page 18 of 23
Pictograph Like a horizontal bar chart that
uses symbols or pictures
instead of bars
The purpose is to get the
attention of the readers
Page 19 of 23
3. To help the researcher in making credible decisions based on
quantitative data or arguments.
Excel Charts & Graphs: Learn the Basics for a Quick Start by Leila
Gharani
https://www.youtube.com/watch?v=DAU0qqh_I-A
A. Short-response Essay
Page 20 of 23
B. Identification
1. The average weekly allowance of students last year at a private high school
was Php 600.00 per week, based on an enrollment of 1,080 stdents. The
third year students who did not have this information interviewed 50
students and found their average weekly allowance last year to be Php
550.00. Identify the following:
a. Population
b. Sample
c. Variable of interest
d. Parameter
e. Sample
2. Observe the use of the number seven in the following statements. Classify
each statement according to the level of measurement used to get the value
7.
3. What method of data collection is most appropriate for the following cases?
Page 21 of 23
4. Indicate the type of chart you would choose to present the information
given in each of the following cases.
Your answers in items where you are asked to discuss will be graded according
to the given standards/basis for grading:
Score Criteria
Unable to elicit the ideas and concepts from the learning activity, material, or
0
video
Able to elicit the ideas and concepts from the learning activity, material, or video
1
but shows erroneous understanding
Able to elicit the ideas and concepts from the learning activity, material, or video
2
and shows correct understanding
Able to elicit the correct ideas from the learning activity, material, or video and
3 also shows evidence of internalization and consistently contributes additional
thought to the core idea
4.8 References:
Beaver, B.M. and Beaver R.J. (1999). Introduction to Probability and
Statistics. 10th ed. New York: Duxbury Press.
Bluman, A. (1998) Elementary Statistics: A Step by Step Approach. 3rd ed.
McGraw-Hill Book Co.
Deuna, Melecio C. (1996), Elementary Statistics for Basic Education.
Quezon City: Phoenix Publishing House, Inc.
Febre, F.A. and Virginia F. Cawagas (Consultant)(1987) Introduction to
Statistics. Metro Manila, Pheonix Publishing House, Inc.
Reyes, C.Z. and Saren, L.L. (2003). Metro Manila. M.G. Reprographics.
Spiegel, M. and Stephens, L. (1999). Schaum’s Outline Theory and Problems
in Probability and Statistics. 3rd. Edition. Singapore: McGraw-Hill
Book Company.
Thorndike,R.M. & Dinnel,D.L. (2002)Basic Statistics for the Behavioral
Sciences.Prentice Hall,Inc.
Triola, Mario (1995) Elementary Statistics. New York: Addison-Wesley
Publishing Company.
Page 22 of 23
Most, .M.M., Craddick, S., Crawford, S., Redican, S., Rhodes, D., Rukenbrod,
F., Laws, R. (2003). Dietary quality assurance processes of the DASH-
Sodium controlled diet study. Journal of the American Dietetic
Association, 103(10): 1339-1346.
Web Sources:
http://lsc.cornell.edu/wp-content/uploads/2016/01/Why-study-
statistics.pdf
Page 23 of 23
MODULE 5
Data Management: Measures of Central Tendency,
Dispersion and Position
5.1 Introduction
Often we wish to describe a set of data with a single number, or a small set
of numbers, in such a way that these values will yield enough information about
the content of the data that we can produce a means of generating a similar set
of data from this description.
One manner in which this can be done is by specifying values that describe the
numerical center of the set of data, which may be defined in various ways. They
are measures of the central tendency of the data. We can also describe the data
by how it is dispersed around a particular measure of central tendency. A third
manner in which we can describe data is by how it tends to accumulate with
respect to the central tendency--such as whether it tends to accumulate
immediately to the left or to the right of the numerical center.
Page 1 of 22
central tendency- mean, median and mode will be discussed for ungrouped (raw)
and grouped data. Ungrouped data are raw data and grouped data are raw data
that have been compressed into frequency distribution table for better and easy
understanding.
5.3.1.1 Mean
The arithmetic mean or mean is the most familiar and most widely used
measure in our daily life activities. It is the most reliable value in which all the
values of the variable are taken into consideration. It is also the sum of all data
values divided by the number of values in the data set. The mean of a sample
data set is denoted by x and the mean of a population data set by the Greek
letter .
Example 1. Find the mean score of the following sample data set:
Quiz Scores: 1, 5, 7, 7, 6, 8, 10, 9, 5, 10, 8
Solution.
Steps Actual process and result
Page 2 of 22
Example 2.
What is the mean age in the following set of sample data?
∑ 𝑓𝑥 618
𝑥̅ = = = 17.66
𝑛 35
3. Divide ∑ 𝑓𝑥 by 𝑛.
The mean age is 17.66.
5.3.1.2 Median
The median is the middle number. It is the value which separates the
largest 50% of data values from the lowest 50%. It is denoted as 𝑥̃. To calculate
the median, place data values in number order then find the middle number. If
there is an odd number of values, the number in the middle will be the median.
If there is an even number of values, then the average of the two numbers in the
middle will be the median.
Page 3 of 22
Example 3. Odd number of values:
Find the median of the following set of data.
35 47 36 24 55 32 29 57 32
Solution.
Steps Actual process and result
𝑛+1
4. The ( )th value is the median of the In this case, the 5th value, which is 35, is the
2
set of data. median.
𝑛
3. Identify the (2 )th observation and the In this case, we identify the 5th observation,
𝑛
(2 + 1)th observation. which is 35, and the 6th observation, which is 36.
𝑛
4. Find the mean of the ( 2 )th
𝑛
observation and ( 2 + 1)th The median is given by 𝑥̃ =
35+36
=
71
= 35.5
2 2
observation. The number that result
is the median of the set of data.
Page 4 of 22
Example 5. Ungrouped data in frequency distribution.
Find the median age in the given frequency distribution
Age (𝑥) 𝑓
16 5
17 10
18 12
19 8
Solution.
Steps Actual process and result
Age (𝑥) 𝑓
1. Find the total frequency 𝑛, and the
cumulative frequency 𝑐𝑓. 16 5
17 10
Note: Make sure that the entries in the 18 12
first column are in order. 19 8
Total 35
Age (𝑥) 𝑓 𝑐𝑓
𝑛+1 16 5 5
4. Locate ( ) in 𝑐𝑓. We know that 18
2
17 10 15
belongs to the range 16 − 27 as
18 12 27
indicated by the 𝑐𝑓 of 27.
19 8 35
Total 𝑛 = 35
Age (𝑥) 𝑓 𝑐𝑓
𝑛+1 16 5 5
5. Find the ( 2 )th observation in the
17 10 15
first column. In the example, the
18 12 27
median age is 18.
19 8 35
Total 𝑛 = 35
Page 5 of 22
5.3.1.3 Mode
The mode is the data value which appears most frequently in the set. There
might be one or more modes or no mode for every data set. For example, in
the previous data:
35 47 36 24 55 32 29 57 32 40
Age (𝑥) 𝑓
16 5
17 10
18 12
19 8
3. The mean is unique but cannot be found for categorical data or for open-
ended frequency distributions.
4. The median does not use all the values so it is less affected than the mean
by a few or small data.
6. The mode has the advantage that it can be used to measure nominal data
but it is not unique, there may be more than one mode or none at all.
Page 6 of 22
Learning Activity 1
Direction. Tell whether the following statements describe the Mean, Median,
or Mode
Frequency
10
8
6
4
2
Page 7 of 22
2. Left-Skewed. This type of distribution has few data values that are
much lower than the majority of values in the set. (Tail extends to the
left). Generally, the mean is less than the median (and mode) in a left-
skewed distribution.
90
80
70
60
50
40
30
20
10
0
p 'g h
BA AB PA tre ED BS
A
ng at
BS BA En BS lE M
vi BS
BS Ci
3. Right-Skewed. This type of distribution has few data values are much
higher than the majority of values in the set. (Tail extends to the right).
Generally the mean is greater than the median (and mode) in a right-
skewed distribution.
90
80
70
60
50
40
30
20
10
0
BA AB PA
p
ED A 'g at
h
tre BS ng
BS BA En BS lE M
BS vi BS
Ci
20
15
10
0
Freshmen Sophomore Junior Senior
Page 8 of 22
5.3.2 Measures of Dispersion
Dispersion or variation in a data set is the amount of difference between
data values. It tells if the numbers in the data are close together or spread far
apart.
In a data set with little variation, almost all data values would be close to
one another. The histogram of such a data set would be narrow and tall. An
example of this is the set of quiz scores below.
Quiz Scores: 3, 3, 4, 4, 4, 4, 4, 4, 5, 5, 5
In a data set with a great deal of variation, the data values would be spread
widely. The histogram of this data set would be low and wide. An example is
the set of data that follows.
Quiz Scores: 1, 3, 4, 5, 6, 6, 7, 8, 8, 9, 10
where
𝑥 represents the observations
∑(𝑥 − 𝜇)2
Population variance 𝜎2 = 𝜇 the population mean
𝑁 𝑁 the population size
where
𝑥 represents the observations
∑(𝑥 − 𝑥̅ )2
Sample variance 𝑠2 = 𝑥̅ the sample mean
𝑛−1 𝑛 the sample size
Page 9 of 22
To find the variance in a set of data, the process is as follows:
∑(𝑥 − 𝜇)2
Population standard deviation 𝜎=√
𝑁
∑(𝑥 − 𝑥̅ )2
Sample standard deviation 𝑠=√
𝑛−1
Page 10 of 22
Solution.
Steps Actual process and result
1. Determine the mean of the
∑ 𝑥 4 + 5 + 5 + 6 + 7 + 8 + 8 + 9 + 9 + 9 70
observations. 𝑥̅ = = = =7
𝑛 10 10
2. For each observation,
calculate the deviation or 𝑥 𝑥̅ 𝑥 − 𝑥̅
difference between each 4 7 −3
observation and the mean. 5 −2
Because this is a sample data, 5 −2
we get 𝑥 − 𝑥̅ . 6 −1
7 0
8 1
8 1
9 2
9 2
9 2
𝑛 = 10
Page 11 of 22
The coefficient of variation (CV) makes it easier to tell if a standard deviation
is large or small by comparing the standard deviation to the mean and it allows
comparison of standard deviations that come from data sets with different
means.
𝜎
For population 𝑐𝑣 = × 100%
𝜇
𝑠
For the sample 𝑐𝑣 = × 100%
𝑥̅
𝑥 − 𝑥̅
For the sample 𝑧=
𝑠
1. The 𝑧-score of a value is positive if the value is above the mean and
negative if it is below the mean. The mean itself always has a 𝑧-score
of 0.
Page 12 of 22
Example 7.
Students were selected from two sections and their scores in a Statistics
examination were gathered. The following information were obtained:
Sample mean is 75.
First section
Sample standard deviation is 5.6.
Sample mean is 72.
Second section
Sample standard deviation is 7.
Linda, who is from the first section got a score of 68 while her friend, Jessa,
who is in the second section got a score of 60. Who has a higher standard score?
Solution.
Linda Jessa
𝑥 − 𝑥̅1 68 − 75 −7 𝑥 − 𝑥̅ 2 60 − 72 −12
𝑧1 = = = = −1.25 𝑧2 = = = = −1.71
𝑠1 5.6 5.6 𝑠2 7 7
Since −1.25 > −1.71, we conclude that Linda has a higher standard score.
Percentiles divide a data set into 100 parts. It can be found for any percent
from 1 to 99 and is denoted as 𝑃𝑟 where the subscript 𝑟 is the percentile rank
which indicates the percent of the distribution that falls below the percentile.
For example, P10 is the tenth percentile and is larger than 10% of the distribution.
Example 8. Using the data below, find 𝑃25, 𝑃60 and the percentile rank of 4.
2 6 3 4 2 1 2 0 1 3 6 3
Page 13 of 22
Solution.
a) To find 𝑃25, we follow the steps given:
1+2 3
Thus,𝑃25 = = = 1.5 which means that 25% of the observations are
2 2
less than 1.5.
b) To find 𝑃60, we will follow a similar process with the previous item.
From here, we conclude that 60% of the observations are less than 3.
Page 14 of 22
Another measure of position is the deciles. Deciles divide the data set into
tenths and can be found for 1 through 9. Deciles are denoted as 𝐷𝑟 with a
subscript 𝑟, for example, D3 is the third decile and is the value that is larger than
three tenths of the other values.
DECILES
• divides ranked data into ten equal parts
10% 10% 10% 10% 10% 10% 10% 10% 10% 10%
D1 D2 D3 D4 D5 D6 D7 D8 D9
Quartiles divide a data set into fourths and can be found for 1 to 3. Q1 is
the first quartile and is the value that is larger than one fourth of the
observations in the distribution.
QUARTILES
• divides ranked scores into four equal parts
minimum
Q1 Q2 Q3 maximum
median
Page 15 of 22
Solution.
Steps Actual process and result
1. Arrange the data to
ascending order 12 13 13 14 17 18 18 19 20
21 22 22 23 23 24 25 28 29
31 32 35 36 43 48 55
Interpretation:
The stem-and-leaf plot shows that most of the students obtained the score
from 20 to 29.
Example 10. Make a stem-and leaf plot for the following numbers.
215 239 212 245 226 228 246 213 247 225
236 223 221 248 237 242 218 236 232 238
Solution.
Steps Actual process and result
1. Arrange the data to
ascending order 212 213 215 218 221 223 225 226 228 232
236 236 237 238 239 242 245 246 247 248
Page 16 of 22
3. Use the first 2 digits for the
leading digit (or stem) and Leading Digit Stem
list all the last digits in order 21 2 3 5 8
for the trailing digit (or leaf): 22 1 3 5 6 8
23 2 6 6 7 8 9
24 2 5 6 7 8
Interpretation:
The stem-and-leaf plot shows that most of the students obtained the score
from 231 to 239.
A BOX-AND-WHISKER PLOT graphs five values of the set of data on a
number line. The five values are:
1. The lowest value in the set of data.
2. The lower hinge.
3. The median.
4. The upper hinge.
5. The highest value of the set of data.
A box is drawn from the lower hinge to the upper hinge and lines are drawn
from the box to the highest and lowest value. The lower hinge is the median of
all the values less than or equal to the median when the set of data set has an
odd number of values, or the median of all values less than the median when the
set of data has an even number of values. The upper hinge is the median of all
values greater than or equal median when the set of data has an odd number of
values, or the median of all values greater than the median when the set of data
has an even number of values.
Example 11. A 100 item test was given to 25 statistics students. The result is
shown below:
55 32 20 22 43 14 17 48 24
31 21 22 35 23 36 23 18 25
13 28 12 29 13 18 19
Page 17 of 22
Solution.
Steps Actual process and result
1. Arrange the data to
ascending order 12 13 13 14 17 18 18 19 20
21 22 22 23 23 24 25 28 29
31 32 35 36 43 48 55
2. Determine the five values: The lowest value in the data set is 12.
The highest value in the data set is 55.
The median is 23.
The lower hinge is the midpoint of the numbers
below the median which is 18.
The upper hinge is the midpoint of the numbers
above the median which is 31.5.
3. Set up the horizontal axis
containing the values
obtained in Step 2. In this
case, we start at 5 and end at
60 with an interval of 5.
Interpretation:
The box whisker plot shows that the data is not symmetrical and that the
data is positively skewed since the whisker in longer on the right.
Page 18 of 22
5.5 Flexible Teaching-Learning Modality
Remote (asynchronous)
• Module, exercises, problem sets, PowerPoint lessons
Page 19 of 22
4. A quiz on the classification of research by general methodology was
administered to a group of 34 students at the College of Arts and
Sciences. The scores are reported below:
Male Female
8 10 20 14 13 10 10 13 10
17 17 12 14 14 9 14 15 8 17
12 10 9 18 14 15 13 17
16 8 18 14
6 16 10
a. Consider all the members of the group and compute the mean, median
and mode.
b. Calculate the mean, median and mode for male and female students.
c. Compare the mean and median within each group. Which has the higher
value? Why?
For items 5-6, find the mean, median, mode, range, variance, standard
deviation, and the coefficient of variation.
Number of
Ages
students
16 2
17 10
18 8
19 5
Page 20 of 22
7. Louie’s test scores for two semesters of mathematics are listed below. The
percentage of each semester’s grade represented by each score is also
given
1st Semester 2nd Semester % of Grade
78 87 15
68 66 15
84 81 15
86 89 15
90 88 40
9. Using the following, calculate the 𝑧-score that corresponds to the raw
score indicated.
𝑥̅ 𝑠 𝑥 𝑧-score
97 9.23 100
46 8.0 38
8 0.52 9
22 4.69 24
31 7.15 24
54 1.50 39
100 6.50 110
75 3.75 72
Page 21 of 22
10. Using the following data, estimate the raw score that corresponds to the
𝑧-score indicated.
𝑥̅ 𝑠 𝑧-score 𝑥
28 5.2 −1.62
69 2.35 +2.58
7 0.86 +1.03
41 4.73 −2.37
72 1.05 +0.40
85 3.21 −3.20
150 9.61 −0.26
36 0.90 +3.50
5.7 References
Beaver, B.M. and Beaver R.J. (1999). Introduction to Probability and
Statistics. 10th ed. New York: Duxbury Press.
Bluman, A. (1998) Elementary Statistics: A Step by Step Approach. 3rd ed.
McGraw-Hill Book Co.
Deuna, Melecio C. (1996), Elementary Statistics for Basic Education.
Quezon City: Phoenix Publishing House, Inc.
Febre, F.A. and Virginia F. Cawagas (Consultant)(1987) Introduction to
Statistics. Metro Manila, Pheonix Publishing House, Inc.
Ferguson G. (1981) Statistical Analysis in Psychology and Education. 5th ed.
New York: McGraw-Hill Book Company.
Padua, R. N., E.G. Adanza and R.T. Guinto (1986) Statistics: Theory and
Applications. Metro Manila: Hermil Printing Services.
Reyes, C.Z. and Saren, L.L. (2003). Metro Manila. M.G. Reprographics.
Spiegel, M. and Stephens, L. (1999). Schaum’s Outline Theory and Problems
in Probability and Statistics. 3rd. Edition. Singapore: McGraw-Hill
Book Company.
Thorndike,R.M. & Dinnel,D.L. (2002)Basic Statistics for the Behavioral
Sciences.Prentice Hall,Inc.
Triola, Mario (1995) Elementary Statistics. New York: Addison-Wesley
Publishing Company.
Walpole, R.E (1982) Introduction to Statistics. 3rd ed. New York: Macmillan
Publishing Co. Inc.
Page 22 of 22
MODULE 6
Data Management: Probabilities and Normal Distribution
6.1 Introduction:
The normal curve also known as the Gaussian curve or the normal
probability curve is the most fundamental distribution curve in statistics. In this
section, we shall discuss the applications of a normal curve in statistics to
performance of students in class or in their daily activities using standard or 𝑧-
scores.
6.2 Learning Outcomes
At the end of this section, you will be able to:
1. give the importance of a normal distribution;
2. differentiate between a normal distribution and a skewed distribution;
3. give the significance of the standard or 𝑧-score;
4. compute areas under the normal curve; and
5. solve problems involving the normal distribution.
Page 1 of 15
The normal distribution is used to find probabilities by finding the area
under the curve. The area under the graph from the mean to any given 𝑧-score
can be determined using Table 1.
The area under the curve is the same as the probability that a value will be
between the mean and the given number.
What is Probability?
Page 2 of 15
A standard normal model is a normal distribution with a mean of 0 and a
standard deviation of 1. It has some distinct properties.
Because of its properties, the following are observed based on the empirical
rule and Chebyshev’s theorem.
1. Approximately 68% of the data values will fall within 1 standard deviation
of the mean.
2. Approximately 95% of the data values will fall within 2 standard deviations
of the mean.
3. Approximately 99.78% of the data values will fall within 3 standard
deviations of the mean.
One way of figuring out how data are distributed is to plot them in a graph.
If the data is evenly distributed, you may come up with a bell curve. A bell curve
has a small percentage of the points on both tails and the bigger percentage on
the inner part of the curve. In the standard normal model, about 5 percent of
your data would fall into the “tails” (colored darker orange in Figure 2) and 90
percent will be in between. For example, for test scores of students, the normal
distribution would show 2.5 percent of students getting very low scores and 2.5
percent getting very high scores. The rest will be in the middle; not too high or
too low. The shape of the standard normal distribution looks like this:
The standard normal distribution could help you figure out which subject
you are getting good grades in and which subjects you have to exert more effort
into due to low scoring percentages. Once you get a score in one subject that is
higher than your score in another subject, you might think that you are better
in the subject where you got the higher score. This is not always true.
Page 3 of 15
You can only say that you are better in a particular subject if you get a score
with a certain number of standard deviations above the mean. The standard
deviation tells you how tightly your data is clustered around the mean; it allows
you to compare different distributions that have different types of data —
including different means.
For example, if you get a score of 90 in Math and 95 in English, you might
think that you are better in English than in Math. However, in Math, your score
is 2 standard deviations above the mean. In English, it’s only one standard
deviation above the mean. It tells you that in Math, your score is far higher than
most of the students (your score falls into the tail).
Based on this data, you actually performed better in Math than in English!
Since not all problems are simple, a 𝑧-table had been prepared. A 𝑧-table
measures those probabilities and put them in standard deviations from
the mean. The mean is in the center of the standard normal distribution, and a
probability of 50% equals zero standard deviations.
There are different types of 𝑧-tables. It is important to read and check the
information given before we proceed to finding probabilities. The table which we
will use gives the probabilities to the left of a given 𝑧-value. We also take note
that since the total area under the normal curve is 1, the probability values are
also the areas to the left of a given 𝑧-value.
Page 4 of 15
Page 5 of 15
Source: https://www.math.arizona.edu/~jwatkins/normal-table.pdf
We will give more illustrations on finding probabilities using the 𝒛-table. This
time, we follow the steps given.
Page 6 of 15
1. Area below 𝒛.
Page 7 of 15
2. Area above 𝒛.
Page 8 of 15
3. Area between two 𝒛-values.
3. To get 𝑷(−𝟎. 𝟕𝟖 ≤ 𝒛 ≤
𝟏. 𝟔𝟓), we get the
difference between the 𝑷(−𝟎. 𝟕𝟖 ≤ 𝒛 ≤ 𝟏. 𝟔𝟓) = 𝟎. 𝟗𝟓𝟎𝟓 − 𝟎. 𝟐𝟏𝟕𝟕 = 𝟎. 𝟕𝟑𝟐𝟖
values we obtained at 𝒛 =
−𝟎. 𝟕𝟖 and 𝒛 = 𝟏. 𝟔𝟓.
Learning Activity 1
1. 𝑃(𝑧 ≤ −1.73)
2. 𝑃(𝑧 ≥ −0.67)
3. 𝑃(−1.73 ≤ 𝑧 ≤ −0.67)
How do you know that a word problem involves normal distribution? Look
for the key phrase “assume the variable is normally distributed” or “assume the
variable is approximately normal.”
Page 9 of 15
Example 1. The mean time to complete a certain psychology examination is 34
minutes with a standard deviation of 8. If the distribution of the time to
complete the examination is approximately normally distributed, what is the
probability that a student will complete the examination
(a) in less than 28 minutes?
(b) in more than 40 minutes?
(c) Between 28 and 40 minutes?
Solution.
(a)
Steps Actual process and result
1. List the given mean
𝜇 = 34 minutes
and standard
𝜎 = 8 minutes
deviation.
2. Compute the 𝑧-score of 𝑥 − 𝜇 28 − 34
𝑧= = = −0.75
𝑥 = 28 minutes. 𝜎 8
3. Find the probability
𝑃(𝑧 ≥ −0.75).
(b)
Steps Actual process and result
1. List the given mean
𝜇 = 34 minutes
and standard
𝜎 = 8 minutes
deviation.
2. Compute the 𝑧-score of 𝑥 − 𝜇 45 − 34
𝑧= = = 1.38
𝑥 = 45 minutes. 𝜎 8
3. Find the probability
𝑃(𝑧 ≥ 1.38).
Page 10 of 15
(c)
𝑥 − 𝜇 45 − 34
𝑧= = = 1.38
𝜎 8
Example 2.
Solution.
Page 11 of 15
Example 3.
A company gives an employment test to all applicants for a job. The results of
the test are normally distributed with a mean score of 124 and a standard
deviation of 16. If only the top 75% of the applicants are to be interviewed, what
score must an applicant have to be interviewed?
Solution.
We have
𝑥 = 𝜎𝑧 + 𝜇
Page 12 of 15
Learning Activity 2
The heights of 1000 students are normally distributed with a mean of 174.5
centimeters and a standard deviation of 6.9 centimeters. Assuming that the
heights are recorded to the nearest half centimeters, how many of these
students would you expect to have heights
Page 13 of 15
2. Given a normal distribution with 𝜇 = 30 and 𝜎 = 6, find
a. what fraction of the cups will contain more than 224 milliliters?
b. what is the probability that a cup contains between 191 and 209
milliliters?
c. how many cups will probably overflow if 230- milliliter cups are used
for the next 1000 drinks?
d. below what value do we get the smallest 25% of the drinks?
6.7 References:
Beaver, B.M. and Beaver R.J. (1999). Introduction to Probability and
Statistics. 10th ed. New York: Duxbury Press.
Bluman, A. (1998) Elementary Statistics: A Step by Step Approach. 3rd ed.
McGraw-Hill Book Co.
Deuna, Melecio C. (1996), Elementary Statistics for Basic Education.
Quezon City: Phoenix Publishing House, Inc.
Febre, F.A. and Virginia F. Cawagas (Consultant)(1987) Introduction to
Statistics. Metro Manila, Pheonix Publishing House, Inc.
Page 14 of 15
Reyes, C.Z. and Saren, L.L. (2003). Metro Manila. M.G. Reprographics.
Spiegel, M. and Stephens, L. (1999). Schaum’s Outline Theory and Problems
in Probability and Statistics. 3rd. Edition. Singapore: McGraw-Hill
Book Company.
Thorndike,R.M. & Dinnel,D.L. (2002)Basic Statistics for the Behavioral
Sciences.Prentice Hall,Inc.
Triola, Mario (1995) Elementary Statistics. New York: Addison-Wesley
Publishing Company.
Most, .M.M., Craddick, S., Crawford, S., Redican, S., Rhodes, D., Rukenbrod,
F., Laws, R. (2003). Dietary quality assurance processes of the DASH-
Sodium controlled diet study. Journal of the American Dietetic
Association, 103(10): 1339-1346.
Walpole, R, R Myers, S. Myers (2012) Probability and Statistics for Engineers
and Scientists. Prentice Hall, Pearson Education, Boston, MA
Web Sources:
http://lsc.cornell.edu/wp-content/uploads/2016/01/Why-study-
statistics.pdf
Page 15 of 15
MODULE 7
Data Management: Regression and Correlation
7.1 Introduction
In our daily activities it is necessary that the relationship between variables
be established before a decision is made. For example, the school registrar must
predict the enrollment before preparing the class schedules. One must know the
sequence of the courses to be offered before a feasible flow chart could be
prepared. In this section, we will discuss some commonly used measures of
association that show the linear relationship between two variables such as
correlation analysis. The term “relationship” means that changes in two variables
are associated with each other. This relationship can be directly or inversely
proportional to each other. Moreover, correlation is used to determine if there is
a relationship between two variables and to determine the strength of the
correlation.
Correlation and linear regression can help us deal with the relationship
between two or more continuous variables. We shall study about the dependence
of one variable, the dependent variable to the independent variable.
7.2 Learning Outcome
After finishing this module, you are expected to:
1. explain the purpose of correlation coefficients;
2. choose the appropriate correlation coefficients to show the relationship
between two variables;
3. compute the coefficients of correlation and determination;
4. calculate the average correlation between two variables across several
groups of people.
5. define linear regression;
6. give the purpose of linear regression;
7. define least-squares regression line and the assumptions underlying
the test of significance;
8. use methods of linear regression and correlation to predict the value of
a variable given certain conditions.
7.3 What You Need to Know
7.3.1 What is the purpose of correlation analysis?
In correlation analysis, the purpose is to measure the strength or closeness
of the relationship between the variables. In other words, we would like to know
‘how strong or weak is the relationship existing between the variables?’ the two
variables associated in a statistical sense do not guarantee the existence of a
causal relationship. But in reverse, the existence of a causal relationship usually
Page 1 of 15
does imply correlation. The magnitude of association is measured by the
absolute value of 𝑟 that can range from 0.00 to 1.00; the greater the absolute value
of 𝑟, the stronger the relationship between the two variables.
The two types of variables involve in a relationship are independent variable
(𝑋) and the dependent variable (𝑌). In correlation analysis, the 𝑋-variable is the
predictor and the 𝑌-variable is the criterion variable.
A correlation is a relationship between two statistical variables measured
from the same population. In this module, we will only consider linear
correlation which comes in three types: positive linear correlation, negative
linear correlation and zero linear correlation.
A Positive Linear Correlation indicates that high values for one variable
tend to correspond to high values for the second variable or simply, if one value
increases, so does the other the other. For example, the height vs. weight for
adults (For a normal individual, as the height increases, the weight also
increases).
A Negative Linear Correlation indicates high values for one variable tend
to correspond to low values for the second variable., that is, one variable
increases and the other decreases. For instance, the year of acquiring a vehicle
and the resale price (As the vehicle gets older, the re sale price becomes lower).
A Zero Linear Correlation means there is no linear relationship that exists
between the variables. For example, the height and no. of years of education (The
height of the person in no way has a bearing on the number of years he had been
in school).
7.3.1.1 Simple Correlation
In simple correlation, only two variables are studied at once. The two
variables are the independent and dependent variable. The independent
variable, (𝑋), is the variable that can be controlled or picked. The independent
variable, (𝑌), is the variable that you assume to be dependent on the other
variable. The independent variable are used to predict the dependent variable if
there is a correlation between the two variables.
One way to determine the type of linear correlation between two variables is
by means of a scatter plot. The scatter plot is a graph with the independent
variable at the bottom (or along the 𝑥 − 𝑎𝑥𝑖𝑠) and the dependent variable along
the side (𝑥 − 𝑎𝑥𝑖𝑠). For each pair of numbers, we plot a point but the points are
not connected with a line.
The scatter plot shows if there is a linear correlation between two variables.
We can then determine the type of linear correlation as follows:
1. Positive Linear Correlation - general trend in the plotted points is from
bottom left to top right.
Page 2 of 15
2. Negative Linear Correlation - general trend in the plotted points is from
top left to bottom right.
3. No Linear Correlation - No general trend in plotted points, or a non-linear
trend.
The strength of the linear correlation can be judged by looking at how closely
the points approximate a straight line.
Example 1
The following table shows the Height (X) vs. Weight (Y) measurements (both in
inches) for 10 men:
x 70.8 66.2 71.7 68.7 67.6 69.2 66.5 67.2 68.3 65.6
y 42.5 40.2 44.4 42.8 40.0 47.3 43.4 40.1 42.1 36.0
Example 2.
The following table gives the resale value of a car bought in 1970 at
Php200,000.00.
x (Php) 1970 1973 1976 1979 1982 1985 1988 1991 1994 1997
y (000) 200 150 145 135 120 100 79 65 54 35.0
Page 3 of 15
Interpretation: The diagram indicates a negative linear correlation between the
variables.
Example 3.
Below is a data of the scores in an examination. Make a scatter plot and interpret
the data.
100
Test scores
Mid-Term Final 90
73 70
Final Term Score
86 80
80
93 96
92 85
70
72 68
65 68
60
58 62
75 78
50
50 55 60 65 70 75 80 85 90 95 100
Mid-Term Score
Page 4 of 15
7.3.1.2 Coefficient of Correlation
A more precise method of determining the type and strength of a linear
correlation is to calculate the coefficient of linear correlation 𝑟, also known as
Pearson Product-Moment Correlation Coefficient, for the two variables using the
formula:
Example 4.
Scores of students in the Midterm and Final Examinations were gathered.
The teacher wants to find the strength of linear relationship between the Midterm
scores and the Final Term scores. What is the coefficient of linear correlation?
Page 5 of 15
Solution.
The scatter plot in the example suggests that a positive correlation exists
between Midterm and Final term scores.
8(47500) − 614(607)
=
√[8(48236) − (614)2 ]√[8(46917) − (607)2 ]
𝑟 = 0.933
From the result, we know that the Midterm score and the Final
term score have a strong positive linear correlation.
Page 6 of 15
Learning Activity 1
Learning Activity 2
Mathematics 6 4 8 5 3. 5
Chemistry 6. 5 4. 5 7 5 4
Page 7 of 15
logical to believe that 𝑦 caused 𝑥. By convention, we plot the independent variable
along the horizontal axis or the 𝑥-axis and the dependent variable along the
vertical axis or 𝑦-axis.
Furthermore, simple linear regression is similar to correlation in that the
purpose is to measure to what extent there is a linear relationship between two
variables. In particular, the purpose of linear regression is to "predict" the value
of the dependent variable based upon the values of one or more independent
variables. The relationship is summarized by a regression equation consisting of
a slope and an intercept. The slope represents the amount the dependent
variable increases or decreases with unit increase or decrease in the independent
variable and the intercept indicates the value of the dependent variable when the
independent variable takes the value zero.
Page 8 of 15
Example 6.
Find the equation of the least-squares line for the ordered pairs in the table
below.
𝑥 𝑦
2.5 3.4
3.0 4.9
3.3 5.5
3.5 6.6
3.8 7.0
4.0 7.7
4.2 8.3
4.5 8.7
Solution.
From the scatter plot in this example, we see that there is a positive
correlation between the two sets of data.
Page 9 of 15
We now proceed with the process of finding the equation of the regression
line.
Steps Actual process and results
1. Prepare the columns
for 𝑥 2 and 𝑥𝑦. 𝑥 𝑦 𝑥2 𝑥𝑦
2.5 3.4 6.25 8.50
3.0 4.9 9.00 14.70
3.3 5.5 10.89 18.15
3.5 6.6 12.25 23.10
3.8 7.0 14.44 26.60
4.0 7.7 16.00 30.80
4.2 8.3 17.64 34.86
4.5 8.7 20.25 39.15
∑ 𝑥 = 28.8 ∑ 𝑦 = 52.1 ∑ 𝑥 2 = 106.72 ∑ 𝑥𝑦 = 195.86
8(195.86) − 28.8(52.1)
= ≈ 2.7303
8(106.72) − (28.8)2
3. Find the means of 𝑥
and 𝑦 values and the ∑ 𝑥 28.8
𝑥̅ = = = 3.6
𝑦-intercept 𝑏. 𝑛 8
∑ 𝑦 52.1
𝑦̅ = = = 6.5125
𝑛 8
The regression line is given by the red line in the next figure.
Page 10 of 15
Example 7.
Use the equation of the least-squares line from the previous example to
predict the average 𝑦 values for each of the following 𝑥 values.
a. 2.8
b. 4.8
Solution.
Steps Actual process and results
1. Substitute the given
𝑥 values to the a. 𝑦̂ = 2.7(2.8) − 3.3 = 4.26
formula that was b. 𝑦̂ = 2.7(4.8) − 3.3 = 9.66
obtained.
Example 8.
Five children aged 2, 3, 5, 7 and 8 years old weigh 14, 20, 32, 42 and 44
kilograms respectively.
a. Find the equation of the regression line of age on weight.
b. Based on this data, what is the approximate weight of a six-year-old
child?
Page 11 of 15
Solution.
(a)
Steps Actual process and results
1. Prepare the table
with columns for 𝑥, 𝑦
𝑦, 𝑥 2 , and 𝑥𝑦. 𝑥
(Weight in 𝑥2 𝑥𝑦
(Age)
kg)
2 14 4 28
3 20 9 60
5 32 25 160
7 42 49 294
8 44 64 352
∑ 𝑥 = 25 ∑ 𝑦 = 152 ∑ 𝑥 2 = 151 ∑ 𝑥𝑦 = 894
5(894) − 25(152)
= ≈ 5.1538
5(151) − (25)2
∑ 𝑦 152
𝑦̅ = = = 30.4
𝑛 5
(b)
Steps Actual process and results
1. Substitute the given
𝑥 values to the 𝑦̂ = 5.2(6) + 4.6 = 35.8
formula that was
obtained.
Page 12 of 15
Learning Activity 3
An exercise instructor remembers that the data given in the following table,
which shows the recommended maximum exercise heart rates for individuals of
given ages.
Age (𝑥 years) 20 40 60
Chemistry 170 153 136
Page 13 of 15
7.6 Assessment Task
Direction. Answer the following items.
1. The table below shows the students’ involvement in community service
(in hours) and their general weighted average (GWA).
7.7 References:
Beaver, B.M. and Beaver R.J. (1999). Introduction to Probability and
Statistics. 10th ed. New York: Duxbury Press.
Bluman, A. (1998) Elementary Statistics: A Step by Step Approach. 3rd ed.
McGraw-Hill Book Co.
Deuna, Melecio C. (1996), Elementary Statistics for Basic Education.
Quezon City: Phoenix Publishing House, Inc.
Febre, F.A. and Virginia F. Cawagas (Consultant)(1987) Introduction to
Statistics. Metro Manila, Pheonix Publishing House, Inc.
Ferguson G. (1981) Statistical Analysis in Psychology and Education. 5th ed.
New York: McGraw-Hill Book Company.
Padua, R. N., E.G. Adanza and R.T. Guinto (1986) Statistics: Theory and
Applications. Metro Manila: Hermil Printing Services.
Reyes, C.Z. and Saren, L.L. (2003). Metro Manila. M.G. Reprographics.
Page 14 of 15
Spiegel, M. and Stephens, L. (1999). Schaum’s Outline Theory and Problems
in Probability and Statistics. 3rd. Edition. Singapore: McGraw-Hill
Book Company.
Triola, Mario (1995) Elementary Statistics. New York: Addison-Wesley
Publishing Company.
Walpole, R.E (1982) Introduction to Statistics. 3rd ed. New York: Macmillan
Publishing Co. Inc.
Page 15 of 15