You are on page 1of 76

GEC 3

Mathematics in the
Modern World

Course
Modules

Weeks 7-12
MODULE 4
Data Management: Introduction to Statistics

4.1 Introduction
When we hear the word Statistics, the first thing that comes to mind is set
of numerical figures, such as your monthly allowance, the number of hours you
spend in school, the number of hours you spend on Facebook, your vital
statistics, etc.
However, the study of statistics is not limited to knowing and memorizing
numerical figures. This module will give us a better understanding of what
Statistics is about. Discussion on how some of its processes are done is also
included.
4.2 Learning Outcomes
After finishing this module, you are expected to:
1. discuss the importance of statistics in your field of study;
2. compare and contrast between descriptive statistics and inferential
statistics;
3. define data;
4. identify different types of data as well as their level of measurement;
5. identify appropriate data collection methods based on needed data; and
6. identify appropriate data presentation type for a set of data.

4.3 What You Need to Know


The following definition will give the meaning of the study of Statistics:

DEFINITION 5.1 (Statistics)

Statistics is the branch of science that deals with the collection,


presentation, organization, analysis, and interpretation of data.

Why are all processes involved in Statistics important? Statistics has the
ability to provide us with tools we need to convert raw data into information that
we can use to make sensible decisions and intelligent choices.
People from various fields of interest need to obtain information to answer
different types of problems. Nowadays, we do this by performing a statistical

Page 1 of 23
inquiry. This will allow us to answer problems with clearer understanding of a
particular collection of information.

DEFINITION 5.2 (population)

The population is the collection of all elements under consideration in


statistical inquiry. The sample is a subset of the population.

Usually, the population of interest may be too large that it becomes too
expensive and time-consuming to collect data from every element of the
population. Thus, we have no other option but to get the data we need from only
a subset of the population. We use the term sample to refer to this subset of the
population.
In any statistical inquiry, we study certain characteristics or attributes of
the elements in the population, which we call variables. Just like in algebra, we
denote variables with letters of the English alphabets. We refer to these
characteristics as variables because their realized values may vary for the
different elements in the sample or population.

DEFINITION 5.3 (variable, observation, and data)

The variable is a characteristic or attribute of the elements in a collection that


can assume different values for the different elements. An observation is a
realized value of a variable. Data is the collection of observations.

Example 1. Assigning the population and sample


If we define our population to be the set of all students of ISU for SY 2020-
2021, a sample is the set of First year students in ISU for the SY 2020 -2021.

Page 2 of 23
Example 2. Below are illustrations of variables together with their possible
values.

Variable Possible Observations


S sex of a student Male, Female
E employment status of an employee Temporary, Permanent, Contractual
I monthly income of a person in pesos 𝑖≥0
N number of children in a household 𝑛 = 0, 1, 2, 3, …

Example 3. Identifying population and variables of interest.


The research division of a certain pharmaceutical company is investigating
the effectiveness of a new diet pill in reducing weight on female patients.
Population: set of all female who will use the diet pill
Variable of interest: weight before taking the pill, weight after taking the pill

Regardless of whether we are using data collected from the population or


from the sample, it would be difficult to understand what all this numeric figures
convey. To give meaning to these numbers, it is necessary to summarize and
condense the information contained in this collection of observations into a
single numeric figure that describes a particular feature of the whole collection.
We call this single numeric figure a summary measure.

DEFINTION 5.4 (parameter and statistic)

The parameter is a summary measure describing a specific characteristic


of the population. The statistic is a summary measure describing a
specific characteristic of the sample.

Example 4.
A summary measure that we are familiar with is the proportion. The
proportion is the quotient obtained when we divide the magnitude of a part by
the magnitude of the whole. Suppose that among the 35 students, 28 claimed
that they own a cellular phone. We can now compute for the proportion of
students in the population with cellular phones.
𝑛𝑢𝑚𝑏𝑒𝑟 𝑜𝑓 𝑠𝑡𝑢𝑑𝑒𝑛𝑡𝑠 𝑤𝑖𝑡ℎ 𝑐𝑒𝑙𝑙𝑢𝑙𝑎𝑟 𝑝ℎ𝑜𝑛𝑒𝑠 28
𝑃= = = 0.8
𝑛𝑢𝑚𝑏𝑒𝑟 𝑜𝑓 𝑠𝑡𝑢𝑑𝑒𝑛𝑡𝑠 𝑖𝑛 𝑡ℎ𝑒 𝑝𝑜𝑝𝑢𝑙𝑎𝑡𝑖𝑜𝑛 35

Page 3 of 23
The proportion of students in our population with cellular phones is an
example of a parameter because it is a summary measure describing a
characteristic of the population.
Suppose we take a sample of 10 students from this class. Among the 10
students in the sample, 7 own cellular phones. We cannot compute the
proportion 𝑃 of students in the population with cellular phones but we can
compute for 𝑃̂ (read as “𝑃 hat”), where 𝑃̂ is the proportion of students in the
sample with cellular phones, as follows:

𝑛𝑢𝑚𝑏𝑒𝑟 𝑜𝑓 𝑠𝑡𝑢𝑑𝑒𝑛𝑡𝑠 𝑤𝑖𝑡ℎ 𝑐𝑒𝑙𝑙𝑢𝑙𝑎𝑟 𝑝ℎ𝑜𝑛𝑒𝑠 𝑖𝑛 𝑡ℎ𝑒 𝑠𝑎𝑚𝑝𝑙𝑒 7


𝑃̂ = = = 0.70
𝑛𝑢𝑚𝑏𝑒𝑟 𝑜𝑓 𝑠𝑡𝑢𝑑𝑒𝑛𝑡𝑠 𝑖𝑛 𝑡ℎ𝑒 𝑠𝑎𝑚𝑝𝑙𝑒 10

The proportion of students in our sample is an example of a statistic because


it is a summary measure describing a characteristic of the sample.

Learning Activity 1

Direction. Answer the given item.

Mr. Jose Mari Khan, a candidate for mayor in San Sebastian,


wants to find out if there is a need to intensify his campaign
efforts against his opponents. He requested the services of a
group of students to interview 1,000 of the 3,000 registered voters
in San Isidro. The survey results showed that 75% of the 1,000
voters in the sample will vote for him as vice-mayor.

a. Identify the population


and the sample.
b. Identify the variable of
interest.
c. Identify the parameter or
statistic.

4.3.1 Major Areas of Applied Statistics


There are two major fields of Statistics. These are applied statistics and
theoretical or mathematical statistics. Applied statistics is concerned with
procedures and techniques used in the collection, presentation, organization,
analysis, and interpretation of data.
We study applied statistics in order to learn how to select and properly
implement the most appropriate statistical methods to answer research

Page 4 of 23
problems. On the other hand, mathematical statistics is concerned with the
development of the mathematical foundations of the methods used in applied
statistics.
There are two major areas of interest in applied statistics. These are
descriptive statistics and inferential statistics.

DEFINITION 5.5 (Descriptive Statistics)

Descriptive Statistics includes all the techniques used in organizing,


summarizing, and presenting data on hand. It is concerned with summary
calculations such as averages, and percentages and construction of graphs,
charts and tables.

We use methods in descriptive statistics to summarize and describe the


features of the data on hand. The data on hand may have come from all the
elements of the population so that the analysis using descriptive statistics will
allow us to describe the population. The data on hand may also come from the
elements of a selected sample. In this case, the analysis using descriptive
statistics will only allow us to describe the sample. The methods used in
descriptive statistics will not allow us to generalize about the population using
sample data.
Example 5. Below is an illustration of application and restriction of descriptive
statistics.
Given the daily sales performance for a product for the previous year, we
can draw a line chart or a column chart to emphasize the upward/downward
movement of the series. Likewise, we can use descriptive statistics to calculate
a quantity index per quarter to compare the sales per quarter for the previous
year.

DEFINITION 5.6. (Inferential Statistics)

Inferential Statistics includes all the techniques used in analyzing the sample
data that will lead to generalizations about a population from which the sample
came from. It consists of performing hypothesis testing, determining
relationships among variables, and making predictions.

In inferential statistics, we do not simply describe the sample data. Rather,


we use the sample data to form conclusions about the population. Since the
sample is only a subset of the population, we arrive at the conclusions about the
population using inferential statistics under conditions of uncertainty. It should

Page 5 of 23
be clear that whatever conclusions we make using inferential statistics is always
subject to some error.
Example 6. Below is an application of inferential statistics.
To determine if reforestation is effective, we can take a representative portion
of denuded forests and use inferential statistics to draw conclusions about the
effect of reforestation in all denuded forests.

Learning Activity 2

Direction. For each of the following statements below, decide whether it


belongs to the field of descriptive statistics or inferential statistics.

1. A badminton player wants to know his


average score for the past 10 games.
2. A car manufacturer wishes to estimate
the average lifetime of batteries by testing
a sample of 50 batteries.
3. A politician wants to determine the total
number of votes his rival obtained in the
past election based on his copies of the
daily tally sheets of the electoral returns.

4.3.2 Collection of Data

Data Collection is the process of gathering and measuring information on


variables of interest, in an established systematic fashion that enables one to
answer stated research questions, test hypotheses, and evaluate outcomes. The
data collection component of research is common to all fields of study including
physical and social sciences, humanities, business, etc. While methods vary by
discipline, the emphasis on ensuring accurate and honest collection remains the
same.

4.3.2.1 Quantitative and Qualitative Variables or Data


In doing a report or research, initially, we have to define the variables
relevant to the data. There are two major classifications of variables: qualitative
and quantitative.
1. Qualitative Variables are nonnumeric variables and cannot be
measured.

Examples include gender, religious affiliation, and ethnicity.

Page 6 of 23
2. Quantitative Variables are numerical variables and can be measured.

Examples include balance in your checking account, number of


children in your family.

Some quantitative variables can take on only specific or isolated values


along a scale, for example, the number of children in the family may be 1, 2, 3,
or any other whole number but it can never be 1.25 or 0.5. Thus, this variable
has values which can only be obtained through the process of counting and is
referred to as discrete or discontinuous variables.
Specifically, quantitative variables can be ordered and ranked. It can be
classified in to two groups:
1. Discrete variables are values that are obtained by counting. The results
are whole numbers. For example, the number of students in the room.

2. Continuous variables are values that are obtained by measuring. The


results can be any value between two specific values. For example, if
you take the height of each student in a room, you could get any
number between two reasonable amounts. So height is a continuous
variable.

Learning Activity 3

Directions. Write 𝑄 if the variable is qualitative and if it is quantitative, write


𝐷 if it is discrete and 𝐶 if continuous.

1. Speed of cars 8. Weight of children


2. Brand of watches 9. Height of the building
3. Height of children 10. ID number
4. Student number 11. Place of residence
5. IQ score 12. Rank of teachers
6. Educational 13. Time required to take the
attainment examination
7. Number of years in 14. Political
school Affiliation

Page 7 of 23
4.3.2.2 Levels of Measurement
Variables can also be classified according to the level of measurement.
There are four levels of measurement: Nominal, Ordinal, Interval, and Ratio.
1. Nominal Data. In this case, numbers are used to represent an item or
characteristic. Examples include: names, gender, religious affiliation,
civil status, college majors. Note that such data should not be treated as
numerical, since relative size has no meaning.

2. Ordinal or Rank Data. In this set, numbers can be ordered or ranked,


but a specific difference in the levels cannot be determined. For
example, the performance rating can be represented by numbers as
illustrated below:

5 − 𝑂𝑢𝑡𝑠𝑡𝑎𝑛𝑑𝑖𝑛𝑔
4 − 𝑉𝑒𝑟𝑦 𝑆𝑎𝑡𝑖𝑠𝑓𝑎𝑐𝑡𝑜𝑟𝑦
3 − 𝑆𝑎𝑡𝑖𝑠𝑓𝑎𝑐𝑡𝑜𝑟𝑦
2 − 𝑃𝑜𝑜𝑟

Because order in this set is considered, we know that Outstanding


is higher than Very Satisfactory or Very Satisfactory is higher than
Satisfactory, etc., but there is no exact difference between any two of
them. For example, the grade of Outstanding and Very Satisfactory
may be close (4.65 and 4.45) or may be far apart (5.00 and 4.25), so the
exact difference cannot be determined.

3. Interval Data. In this set, numbers can be ordered and has exact
difference between any two units but has no meaningful zero or starting
point. For example, Temperature is an interval data since they can be
ordered, there is an exact difference between two degrees, but the zero
does not mean the starting point since there can be temperatures below
zero.

4. Ratio Data. This set is the highest level of measurement and allows for
all basic arithmetic operations, including division and multiplication.
Data at this level can be ordered, has exact difference between units,
and has a meaningful zero. Things that are counted are usually ratio
level, for example, business data, such as cost, revenue and profit.

Page 8 of 23
Learning Activity 4

Direction. State the level of measurement used to measure the following


variables:

For each of the following statements below, decide whether it


belongs to the field of descriptive statistics or inferential
statistics.

a. Postal zip code


b. Performance rating of an employee as
excellent, very good, good, fair, and bad
c. Student number
d. Ranking of a student in class
e. Annual salary of employee

4.3.2.3 Importance of Accurate and Appropriate Data Collection

Regardless of the field of study or preference for defining data (quantitative,


qualitative), accurate data collection is essential to maintaining the integrity of a
report or research. Both the selection of appropriate data collection instruments
(existing, modified, or newly developed) and clearly defined instructions for their
correct use reduce the likelihood of errors occurring.

In the case where data are not properly gathered, the consequences are as
follows:

1. inability to answer research questions accurately


2. inability to repeat and validate the study
3. distorted findings resulting in wasted resources
4. misleading other researchers to pursue unproductive ways of
investigation
5. compromising decisions for public policy
6. causing harm to human participants and animal subjects

4.3.2.4 Data Collection Methods


We discuss the most widely used methods for collecting data. These include
the use of documented data, surveys, experiments, and observations.
4.3.2.4a Use of Documented Data
Sometimes information is difficult to gather or measure personally. If
information is already available for use, then it would be more practical to use
documented data in gathering needed information.

Page 9 of 23
One can obtain documented data from previous studies of individuals,
written reports of government and nongovernment agencies, periodicals, and
others.
Example 7.
The Philippine Statistics Authority is a major collector of data for
government needs. It provides the public with basic data on various subject
matters. A few of these are household income and expenditure,
employment, and others.

DEFINITION 5.7. (primary data, secondary data)

Primary data are data documented by a primary source. The data collectors
themselves documented this data.

Secondary data are data documented by a secondary source. An


individual/agency, other than the data collectors, documented this data.

Example 8. The following agencies can provide primary data:


a. Central Bank (CB) is a primary source of data on banking and finance.
b. Philippine Statistics Authority (PSA) is a primary source of data on
population, housing, and establishments.
c. Bureau of Agricultural Statistics (BAS) is a primary source of data on
agriculture and livestock.
d. The University Registrar’s Office is a primary source of student
records.
Example 9. The following are examples of secondary data.
a. The United Nations’ compiled data for its yearbook, which were
originally gathered by government statistical agencies of different
countries.
b. A medical researcher’s documented data for his research paper, which
were originally collected by the Department of Health.
c. The documented data of a student for his thesis, which were originally
collected by the Department of Labor and Employment.

Page 10 of 23
4.3.2.4b Surveys
DEFINITION 5.8. (survey, census, sample survey)

The survey is a method of collecting data on the variable of interest by asking


people questions. When data came from asking all the people in the population,
then this is called a census. On the other hand, when data came from asking a
sample of people from a well-defined population, then this is called a sample
survey.

The interviewees are the respondents of the survey. A questionnaire which


contains all the questions that each respondent will have to answer is used.
Usually, respondents are selected objectively by employing probability sampling
procedure. By following an objective method of selecting the sample, the
reliability of generalizations about the population under study can be assessed.
Example 10.
Pulse Asia conducted a sample survey on voter response to political ads in
the May 2016 election. Its respondents were selected registered voters who
intend to vote in the 2016 election.
There are various methods of communicating with the respondents in a
survey. Some of the most commonly used methods are personal interviews,
telephone interviews, self-administered questionnaires, online surveys, and
focus group discussions.
4.3.2.4c Experiments
DEFINITION 5.9. (experiment)

The experiment is a method of collecting data where there is


direct human intervention on the conditions that may affect the
values of the variable of interest.

In an experiment, there are different types of variables:


 The explanatory variable is the factor under study.
 The response variable is the observation which the researcher uses
for comparison after conducting the experiment using the factor
under study.
 The extraneous variables are such which the researcher believes may
have an effect on the response variable
We consider the classical mongo experiment. We randomly select mongo
seeds, planted them in two pots, one pot we exposed to sunlight and the other

Page 11 of 23
we did not. Both pots have the same soil type. We watered the pots at the same
time using the same amount of water. A few weeks later, we observed the heights
of the mongo plants.
In this experiment, the objective is to determine the effect of sunlight on the
height of a mongo plant. The explanatory variable is the amount of sunlight.
Categories for the explanatory variable are called “treatments” or factor levels.
The response variable is the height of the mongo plant and the extraneous
variables are identified to be the soil type and amount of water.
The extraneous variables are usually controlled making sure that the two
groups will receive the same levels or amounts. The use of randomization
mechanism in assigning the treatments and controlling the identifies extraneous
variables makes the experiment a more effective method of data collection in
establishing cause and effect.
Example 11.
The school administration wishes to determine which of the two methods is
more effective in training new student leaders. They randomly assigned twenty
student leaders to training method 1 and twenty student leaders to training
method 2. After one month of training, they administered a standardized
achievement test to the two groups and compared their scores.
4.3.2.4d Observation
DEFINITION 5.10. (observation method)

The observation method is a method of collecting data on the


phenomenon of interest by recording the observations made
about the phenomenon as it actually happens.

The observation method is useful in studying the reactions and behavior of


individuals or groups of persons/objects in a given situation or environment as
it happens. For example, a researcher may use the observation method to study
the behavior patterns in panic situations like a big fire, the landslide in Itogon,
Benguet, or the destruction of structures when Typhoon Yolanda hit Tacloban
City.
It is also practical to use observation method when the subjects under study
cannot express their sentiments or are unable to speak. For example,
researchers often use the observation method to study the behavior of animals
in the wild, or the behavior of newborn babies in the nursery.

Page 12 of 23
The table below shows the comparison of survey, experiment, and
observation methods.
Data Collection Method
Aspect
Survey Experiment Observation
Assessing the reliability of
Generally Sometimes Oftentimes
generalizations about a well-
possible difficult difficult
defined population

Ability to establish cause-and-


Poor Superior Poor
effect

Realism of data Realistic Least realistic Most realistic

Learning Activity 5

Direction. Answer the following items.

1. Classify each of the following as a survey, an experiment, or an


observation.

a. A local TV network asked voters to indicate


whom they voted as they exited the polling booth.
b. A private hospital divides terminally ill patients
into two groups, with one group receiving
medication 𝐴 and the other group receiving
medication 𝐵. After a month, they measured
each subject’s improvement.
c. A researcher investigates the level of pollution in
key points in Metro Manila by setting up
pollution measuring devices at selected
intersections.
d. The school administration asked students
whether they are willing to have an increase in
laboratory fees if there is an upgrade of
computers.

2. What method of data collection is most appropriate


for the following cases?

a. Studying two groups of patients and determining


if exercise lowers the blood pressure.
b. The Department of Health monitors and
evaluates the benefits of the family planning
methods given to a certain community.
c. A group of medical intern students studies the
effects of laughter to patients in a hospital
d. A nongovernment organization compares the
household expenditures of two districts in
Isabela.

Page 13 of 23
4.3.3 Presentation of Data
After data collection, we need to organize and analyze the data. After
organizing and analyses, we present the results in forms that will allow us to
reveal important information we obtained from the data.
There are three ways to present the information from our data. These
include textual, tabular, and graphical presentations.
4.3.3.1 Textual Presentation
Textual presentation of data incorporates important figures in a paragraph
of text. In this type of presentation, we insert important data figures or summary
measures within the paragraph of text to support our conclusions.
Textual presentation allows us to direct reader’s interest to vital information
we want to highlight. Summary measures like minimum, maximum, total, and
percentages are just few information that may be included in a textual
presentation.
It is necessary to select the most important figures we want to focus on.
Whenever we use textual presentation, we must always provide our readers with
additional discussion about the relevance of the figures in our presentation.
Example 12. Here is an illustration of textual presentation.
Excerpts taken from the Isabela Covid-19 Case Updates.
“As of 4PM today, the Department of Health reports a total number of COVID-
19 cases at 290,190, after 3,475 newly-confirmed cases were added to the list of
COVID-19 patients.

DOH likewise announces 400 recoveries. This brings the total number of
recoveries to 230,233.
Twenty-eight duplicates were removed from the total case count. Of these, 19
were recovered cases.
Moreover, 13 cases previously reported as recovered were reclassified as death
(12) and active (1) cases after final validation.”

From the illustration given, the paragraphs showed and highlighted only the
most important figures. Few numbers were included and minute details or a
large quantity of data were not presented. If we want to refer to other details of
the data, then it would be more appropriate to use tabular presentation.
4.3.3.2 Tabular Presentation
Tabular presentation of data arranges figures in a systematic manner in
rows and columns. It is the most common method of data presentation. We can
use it for various purposes such as description, comparison, and in showing
relationships between two or more variables of interest.

Page 14 of 23
In tabular presentation, we arrange the data figures or summary measures
in rows and columns for easy reading. Tables should be simple and easy to
understand. Each row and column must have an appropriate label.
Three types of tabular presentation will be discussed in this module namely,
leader work, text tabulation, and the formal statistical table.
4.3.3.2a Leader Work
Leader work has the simplest layout among all three types of tables. It
contains no table title or column headings and has no table borders. We
incorporate this type within a paragraph presenting one or two columns of
figures as supporting data.
Example 13.
The population in the Philippines for the census years 1975 to 2000 is as
follows:

1975 42,070,660
1980 48,098,460
1990 60,703,206
1995 68,616,536
2000 76,498,735

4.3.3.2b Text Tabulation


The format of text tabulation is a little bit more complex than leader work.
It already has column headings and table borders, making it easier to
understand than leader work. This type does not have table title and table
number. Thus, it still needs introductory description for reader comprehension.
Example 14.
The distribution of cellular subscribers per telephone operator as of
December 2003 is as follows:
Telephone Operator Number of Subscribers
SMART 10,080,112
GLOBE 8,800,000
PILTEL 2,867,085
EXTELCOM 29,896
Total 22,509,560

Page 15 of 23
4.3.3.2c Formal Statistical Table
The formal statistical table is the most complex type of table since it has all
the different parts like the table number, table title, head note, box head, stub
head, column headings, and so on. It is a stand-alone table and can be easily
understood even without a description.
The following presents the different parts of a formal statistical table:

number that identifies the position of the table in a


Table number
sequence

Heading states in telegraphic form the subject, data


Located on top of Table title classification, and place and period covered by the
the table of figures figures in the table
appears below the title but above the top cross rule of
Head note the table and provides additional information about the
table.
Spanner head caption or label describing two or more column heads
Box head
Column head label that describes the figures in a column

Panel set of column heads under the same spanner head

Row caption label that describes the figures in a row


Stub
Located at the left Center head label describing a set of row captions
side of the table
caption or label that describes all of the center heads
Stub head
and row captions and is located at the first row
Field collection of figures in the table

Line row of figures

Column column of figures

Cell contains the intersection of a row caption and a column heading


a descriptive statement about a particular part of the table or the whole
Footnote
table located at the bottom of the table
Source note gives the name of the agency that collected the data

Page 16 of 23
Example 14.
Below is an example of a formal statistical table.

Page 17 of 23
4.3.3.3 Graphical Presentation
Graphical presentation of data portrays numerical figures or relationships
among variables in pictorial form. Some statistical charts used in this type of
presentation is given in the following table:
Type of
Description Example
Chart
Line Chart  Useful for presenting historical
data
 Effective in showing movement
of a series over time
 Appropriate when comparing
two or more time series data
and trends over time

Column  Compare amounts in a time


Chart series data
 Emphasis is on difference in
magnitude
 For time series data, columns
are arranged on the horizontal
axis

Horizontal  Appropriate when we wish to


Bar Chart show the distribution of
categorical data.
 Used to compare magnitudes
for different categories of a
qualitative variable.

Pie Chart  Circle divided into several


sections
 Each section indicates the
proportion of each component

Page 18 of 23
Pictograph  Like a horizontal bar chart that
uses symbols or pictures
instead of bars
 The purpose is to get the
attention of the readers

4.4 Supplementary Learning Content


Importance of Knowledge in Statistics
Data is an important part of an inquiry and Statistical knowledge is
essential in carrying out the different steps from the proper methods of data
collection and correct data analyses to effective data presentation. A deep
understanding of statistics is necessary for the following reasons:
1. It enables anyone to become a better and more effective problem solver.

2. It provides procedures to gather data systematically and logically for


the advancement of knowledge.

3. It helps in organizing questions and testing theories.

4. It assists in describing and understanding the relationship between


variables that are often important in decision-making.

5. Knowledge of the statistical process can help us measure current


change and improve the forecasting process in predicting future with
accuracy.
Role of Statistics in Data Analysis
The following list provides us with the role Statistics play in data analysis
1. To organize the number derived from measuring a trait or a variable.

2. To describe and interpret the distribution of data, relationships


between variables, hypothesis being tested or parameters being
predicted or estimated.

Page 19 of 23
3. To help the researcher in making credible decisions based on
quantitative data or arguments.

4. To cope with changes by forecasting the future based on data on hand.

5. To provide a plausible foundation for building new learning or teaching


theory in education.

4.5 Supplementary Learning Resources

 Excel Charts & Graphs: Learn the Basics for a Quick Start by Leila
Gharani
https://www.youtube.com/watch?v=DAU0qqh_I-A

 Creating a Table in Word from Skillsoft YouTube


youtube.com/watch?v=koDeGamrxV4
4.6 Flexible Teaching-Learning Modality
Remote (asynchronous)
 Module, exercises, problem sets, powerpoint lessons
4.7 Assessment Task

Direction. Answer each of the following assessment task.

A. Short-response Essay

1. Discuss the possible applications of Statistics in your respective field of


study.
2. Research on new discoveries in your field of study where Statistics was
applied. In which part was Statistics applied?

Page 20 of 23
B. Identification

1. The average weekly allowance of students last year at a private high school
was Php 600.00 per week, based on an enrollment of 1,080 stdents. The
third year students who did not have this information interviewed 50
students and found their average weekly allowance last year to be Php
550.00. Identify the following:

a. Population
b. Sample
c. Variable of interest
d. Parameter
e. Sample

2. Observe the use of the number seven in the following statements. Classify
each statement according to the level of measurement used to get the value
7.

a. Mark is in the 7th grade


b. Mark measured the temperature of the object as 7℃.
c. Mark has a score of 7 in the Math quiz.
d. Mark’ s basketball shirt number is 7.
e. Mark has 7 cousins.

3. What method of data collection is most appropriate for the following cases?

a. A group of Anthropology students studies the culture and norms of two


ethnic groups.
b. A social welfare organization gathers information on hospital patients
with mental disorders.
c. A construction contractor deciding on future house construction
wishes to determine the type of house demand among families.
d. A milk manufacturer wishes to determine the preference in drink flavor
among children in planning for new products.
e. A car manufacturer studies the preference of cars for the next
production.

Page 21 of 23
4. Indicate the type of chart you would choose to present the information
given in each of the following cases.

a. Percentage distribution of monthly expenditures of a Filipino family for


clothing, footwear, house maintenance, and food
b. The number of hardware stores in Isabela for the year 2015 to 2020
c. Log production for 1990 to 2005
d. Location of Savemore supermarkets in Region 2
e. Distribution of employees by civil status

Your answers in items where you are asked to discuss will be graded according
to the given standards/basis for grading:
Score Criteria
Unable to elicit the ideas and concepts from the learning activity, material, or
0
video
Able to elicit the ideas and concepts from the learning activity, material, or video
1
but shows erroneous understanding
Able to elicit the ideas and concepts from the learning activity, material, or video
2
and shows correct understanding
Able to elicit the correct ideas from the learning activity, material, or video and
3 also shows evidence of internalization and consistently contributes additional
thought to the core idea

4.8 References:
Beaver, B.M. and Beaver R.J. (1999). Introduction to Probability and
Statistics. 10th ed. New York: Duxbury Press.
Bluman, A. (1998) Elementary Statistics: A Step by Step Approach. 3rd ed.
McGraw-Hill Book Co.
Deuna, Melecio C. (1996), Elementary Statistics for Basic Education.
Quezon City: Phoenix Publishing House, Inc.
Febre, F.A. and Virginia F. Cawagas (Consultant)(1987) Introduction to
Statistics. Metro Manila, Pheonix Publishing House, Inc.
Reyes, C.Z. and Saren, L.L. (2003). Metro Manila. M.G. Reprographics.
Spiegel, M. and Stephens, L. (1999). Schaum’s Outline Theory and Problems
in Probability and Statistics. 3rd. Edition. Singapore: McGraw-Hill
Book Company.
Thorndike,R.M. & Dinnel,D.L. (2002)Basic Statistics for the Behavioral
Sciences.Prentice Hall,Inc.
Triola, Mario (1995) Elementary Statistics. New York: Addison-Wesley
Publishing Company.

Page 22 of 23
Most, .M.M., Craddick, S., Crawford, S., Redican, S., Rhodes, D., Rukenbrod,
F., Laws, R. (2003). Dietary quality assurance processes of the DASH-
Sodium controlled diet study. Journal of the American Dietetic
Association, 103(10): 1339-1346.
Web Sources:
http://lsc.cornell.edu/wp-content/uploads/2016/01/Why-study-
statistics.pdf

Page 23 of 23
MODULE 5
Data Management: Measures of Central Tendency,
Dispersion and Position

5.1 Introduction
Often we wish to describe a set of data with a single number, or a small set
of numbers, in such a way that these values will yield enough information about
the content of the data that we can produce a means of generating a similar set
of data from this description.

One manner in which this can be done is by specifying values that describe the
numerical center of the set of data, which may be defined in various ways. They
are measures of the central tendency of the data. We can also describe the data
by how it is dispersed around a particular measure of central tendency. A third
manner in which we can describe data is by how it tends to accumulate with
respect to the central tendency--such as whether it tends to accumulate
immediately to the left or to the right of the numerical center.

There are three ways of describing data, measures of central tendency,


measures of variation and measures of position.
The measures of central tendency are the averages and tell about the middle
of the data. The measures of variation tell if the data are close together or spread
far apart. The measures of position tell the relative position of a number in a
given data in comparison with the rest of the numbers.
Data from a population are called parameters while the data from a sample
are called statistics.
5.2 Learning Outcomes

After finishing this module, you are expected to:

1. Describe the measures of central tendency;


2. Compute or obtain the different measures of central tendency; and
3. Select the proper measures of central tendency to use.

5.3 What You Need to Know


5.3.1 Measures of Central Tendency
Measures of central tendency provide us a convenient way of describing a
set of data with a single number. It is a value used to represent the typical or
“average” value in a data set. In this section, three commonly used measures of

Page 1 of 22
central tendency- mean, median and mode will be discussed for ungrouped (raw)
and grouped data. Ungrouped data are raw data and grouped data are raw data
that have been compressed into frequency distribution table for better and easy
understanding.
5.3.1.1 Mean
The arithmetic mean or mean is the most familiar and most widely used
measure in our daily life activities. It is the most reliable value in which all the
values of the variable are taken into consideration. It is also the sum of all data
values divided by the number of values in the data set. The mean of a sample
data set is denoted by x and the mean of a population data set by the Greek
letter  .

where the 𝑥’s are values of individual


∑𝑥
Sample mean: 𝑥̅ = observations and 𝑛 is the number of
𝑛 observations in the sample

∑𝑥 where the 𝑥’s are values of individual


Population mean: 𝜇= observations 𝑁 is the number of observations
𝑁 in the population

Example 1. Find the mean score of the following sample data set:
Quiz Scores: 1, 5, 7, 7, 6, 8, 10, 9, 5, 10, 8
Solution.
Steps Actual process and result

1. Find the sum ∑ 𝑥 = 1 + 5 + 7 + 7 + 6 + 8 + 10 + 9 + 5 + 10 = 76

2. Divide the sum by the number ∑ 𝑥 76


of observations. In this case, 𝑥̅ == = 6.9090 … ≈ 6.91
𝑛 11
there are 11. Thus, the sample mean is 𝑥̅ = 6.91.

The mean for ungrouped data in a frequency distribution is found by


multiplying the values by the frequency for each set of number, adding all the
products, and dividing by the total number of frequencies.

where 𝑓 is the frequency of each value 𝑥, the


∑ 𝑓𝑥 𝑥’s are values of individual observations and
Sample mean: 𝑥̅ =
𝑛 𝑛 is the number of observations in the
sample

∑ 𝑓𝑥 where 𝑓 is the frequency of each value 𝑥, the


Population mean: 𝜇= 𝑥’s are values of individual observations 𝑁 is
𝑁 the number of observations in the population

Page 2 of 22
Example 2.
What is the mean age in the following set of sample data?

Age (𝑥) Frequency (𝑓)


16 5
17 10
18 12
19 8
Solution.
Steps Actual process and result

Age (𝑥) Frequency (𝑓) 𝑓𝑥

1. Find the product of 𝑓 and 𝑥 16 5 80


17 10 170
18 12 216
19 8 152

Age (𝑥) Frequency (𝑓) 𝑓𝑥

2. Add values under column of 𝑓, 16 5 80


which is 𝑛 and under column 17 10 170
of 𝑓𝑥, which ∑ 𝑓𝑥. 18 12 216
19 8 152
Total 𝑛 = 35 ∑ 𝑓𝑥 = 618

∑ 𝑓𝑥 618
𝑥̅ = = = 17.66
𝑛 35
3. Divide ∑ 𝑓𝑥 by 𝑛.
The mean age is 17.66.

5.3.1.2 Median
The median is the middle number. It is the value which separates the
largest 50% of data values from the lowest 50%. It is denoted as 𝑥̃. To calculate
the median, place data values in number order then find the middle number. If
there is an odd number of values, the number in the middle will be the median.
If there is an even number of values, then the average of the two numbers in the
middle will be the median.

Page 3 of 22
Example 3. Odd number of values:
Find the median of the following set of data.

35 47 36 24 55 32 29 57 32
Solution.
Steps Actual process and result

1. Arrange the observation in ascending


24 29 32 32 35 36 47 55 57
order.

2. Since the number of values is odd,


find the number of observations plus There are 9 values. Here, 𝑛 + 1 = 9 + 1 = 10
1 (𝑛 + 1)

3. Divide 𝑛 + 1 by 2. The number that


𝑛+1 𝑛 + 1 10
will result, 2 , will tell us the place of = =5
2 2
the median in the ordered array.

𝑛+1
4. The ( )th value is the median of the In this case, the 5th value, which is 35, is the
2
set of data. median.

Example 4. Even number of values.


Find the median of the following set of data:
35 47 36 24 55 32 29 57 32 40
Solution.
Steps Actual process and result
1. Arrange the observation in ascending
24 29 32 32 35 36 40 47 55 57
order.

2. Since the number of values is even, 𝑛 10


= =5
find half the number of observations. 2 2

𝑛
3. Identify the (2 )th observation and the In this case, we identify the 5th observation,
𝑛
(2 + 1)th observation. which is 35, and the 6th observation, which is 36.

𝑛
4. Find the mean of the ( 2 )th
𝑛
observation and ( 2 + 1)th The median is given by 𝑥̃ =
35+36
=
71
= 35.5
2 2
observation. The number that result
is the median of the set of data.

Page 4 of 22
Example 5. Ungrouped data in frequency distribution.
Find the median age in the given frequency distribution

Age (𝑥) 𝑓
16 5
17 10
18 12
19 8
Solution.
Steps Actual process and result

Age (𝑥) 𝑓
1. Find the total frequency 𝑛, and the
cumulative frequency 𝑐𝑓. 16 5
17 10
Note: Make sure that the entries in the 18 12
first column are in order. 19 8
Total 35

2. Obtain the column for the cumulative


frequency 𝑐𝑓. To do this, copy the Age (𝑥) 𝑓 𝑐𝑓
first entry in 𝑓. In this case, it is 5. 16 5 5
After this, add the first entry with the 17 10 15
second entry in 𝑓, 5 + 10 = 15. 18 12 27
Repeating the process, we will have 19 8 35
15 + 12 = 27 as the third entry. The Total 𝑛 = 35
last entry in 𝑐𝑓 must equal 𝑛.

𝑛+1 𝑛+1 35+1 36


3. Since 𝑛 is odd, compute . In this case, we have = = = 18.
2 2 2 2

Age (𝑥) 𝑓 𝑐𝑓
𝑛+1 16 5 5
4. Locate ( ) in 𝑐𝑓. We know that 18
2
17 10 15
belongs to the range 16 − 27 as
18 12 27
indicated by the 𝑐𝑓 of 27.
19 8 35
Total 𝑛 = 35

Age (𝑥) 𝑓 𝑐𝑓
𝑛+1 16 5 5
5. Find the ( 2 )th observation in the
17 10 15
first column. In the example, the
18 12 27
median age is 18.
19 8 35
Total 𝑛 = 35

Page 5 of 22
5.3.1.3 Mode
The mode is the data value which appears most frequently in the set. There
might be one or more modes or no mode for every data set. For example, in
the previous data:
35 47 36 24 55 32 29 57 32 40

The mode is 32 which is repeated two times.


The mode for an ungrouped data is the value that has the most frequencies.
For example, in the data below, the mode is 18 years old.

Age (𝑥) 𝑓
16 5
17 10
18 12
19 8

5.3.1.4 Properties of Mean, Median, and Mode


1. Mean is the most commonly used measure of central tendency.

2. One drawback of the mean is that it is heavily influenced by a few very


high or very low data values. In these cases, it is more common to use the
median.

3. The mean is unique but cannot be found for categorical data or for open-
ended frequency distributions.

4. The median does not use all the values so it is less affected than the mean
by a few or small data.

5. The median is unique and can be found for open-ended frequency


distributions.

6. The mode has the advantage that it can be used to measure nominal data
but it is not unique, there may be more than one mode or none at all.

Page 6 of 22
Learning Activity 1

Direction. Tell whether the following statements describe the Mean, Median,
or Mode

1. The most preferred descriptive measure


in a skewed distribution
2. Will have the largest value in a
negatively skewed distribution
3. Will have the largest value in a
positively skewed distribution.
4. The point above and below where half
of the distribution of the data falls.
5. Will have the same value in a bimodal
distribution
6. The “center of gravity” of a distribution
7. Is equivalent to the 50th percentile of a
distribution?
8. The most popular score in a
distribution
9. Influenced by the specific value of every
observation
10. Is most appropriate to use when
extreme scores are given.

5.3.1.5 Shapes of Data Distributions

1. Symmetric. In this case, the data distribution is approximately the same


shape on either side of a central dividing line. The mean and median
(and mode if unimodal) are equal in a symmetric distribution. A
symmetrical data is bell-shaped and can be called normal.

Frequency

10
8
6
4
2

100 180 260 340 420 500

Page 7 of 22
2. Left-Skewed. This type of distribution has few data values that are
much lower than the majority of values in the set. (Tail extends to the
left). Generally, the mean is less than the median (and mode) in a left-
skewed distribution.

90
80
70
60
50
40
30
20
10
0
p 'g h
BA AB PA tre ED BS
A
ng at
BS BA En BS lE M
vi BS
BS Ci

3. Right-Skewed. This type of distribution has few data values are much
higher than the majority of values in the set. (Tail extends to the right).
Generally the mean is greater than the median (and mode) in a right-
skewed distribution.

90
80
70
60
50
40
30
20
10
0

BA AB PA
p
ED A 'g at
h
tre BS ng
BS BA En BS lE M
BS vi BS
Ci

4. Uniform. This type of distribution has all data values equally


represented.
25

20

15

10

0
Freshmen Sophomore Junior Senior

Page 8 of 22
5.3.2 Measures of Dispersion
Dispersion or variation in a data set is the amount of difference between
data values. It tells if the numbers in the data are close together or spread far
apart.
In a data set with little variation, almost all data values would be close to
one another. The histogram of such a data set would be narrow and tall. An
example of this is the set of quiz scores below.
Quiz Scores: 3, 3, 4, 4, 4, 4, 4, 4, 5, 5, 5
In a data set with a great deal of variation, the data values would be spread
widely. The histogram of this data set would be low and wide. An example is
the set of data that follows.
Quiz Scores: 1, 3, 4, 5, 6, 6, 7, 8, 8, 9, 10

5.3.2.1 Common Measures of Dispersion


There are three common measures of dispersion: range, variance, and
standard deviation.
1. Range. It is the difference between the largest and smallest data values
in a data set.

𝑅𝑎𝑛𝑔𝑒 = 𝐻𝑖𝑔ℎ𝑒𝑠𝑡 𝑣𝑎𝑙𝑢𝑒 − 𝐿𝑜𝑤𝑒𝑠𝑡 𝑣𝑎𝑙𝑢𝑒

2. Variance. It is the average of squared deviations from the mean of a set


of data. It is calculated using two formulas depending on whether the
data set being considered is a population or a sample data set.

where
 𝑥 represents the observations
∑(𝑥 − 𝜇)2
Population variance 𝜎2 =  𝜇 the population mean
𝑁  𝑁 the population size

where
 𝑥 represents the observations
∑(𝑥 − 𝑥̅ )2
Sample variance 𝑠2 =  𝑥̅ the sample mean
𝑛−1  𝑛 the sample size

Page 9 of 22
To find the variance in a set of data, the process is as follows:

Procedure for Computing a Variance

1. Determine the mean of the observations.


2. For each observation, calculate the deviation (difference) between each
observation and the mean.
3. Calculate the square of each of the deviations and find the sum of these
squared deviations.
4. If the data is a population, then divide the sum by 𝑁. If the data is a sample,
then divide the sum by 𝑛 − 1.

3. Standard Deviation. It is the most commonly used measure of variation.


A measure of the “average” distance of a data value from the mean for
the data set. It is also the square root of the variance.

∑(𝑥 − 𝜇)2
Population standard deviation 𝜎=√
𝑁

∑(𝑥 − 𝑥̅ )2
Sample standard deviation 𝑠=√
𝑛−1

To compute the standard deviation, we simply get the square root of


the variance.
Example 6.
What is the standard deviation in the given sample data?
4 5 5 6 7 8 8 9 9 9

Page 10 of 22
Solution.
Steps Actual process and result
1. Determine the mean of the
∑ 𝑥 4 + 5 + 5 + 6 + 7 + 8 + 8 + 9 + 9 + 9 70
observations. 𝑥̅ = = = =7
𝑛 10 10
2. For each observation,
calculate the deviation or 𝑥 𝑥̅ 𝑥 − 𝑥̅
difference between each 4 7 −3
observation and the mean. 5 −2
Because this is a sample data, 5 −2
we get 𝑥 − 𝑥̅ . 6 −1
7 0
8 1
8 1
9 2
9 2
9 2
𝑛 = 10

3. Calculate the square of each


of the deviations and find the
𝑥 𝑥̅ 𝑥 − 𝑥̅ (𝑥 − 𝑥̅ )2
sum of these squared
4 7 −3 9
deviations. This means that
5 −2 4
we will get (𝑥 − 𝑥̅ )2 and
∑(𝑥 − 𝑥̅ )2. 5 −2 4
6 −1 1
7 0 0
8 1 1
8 1 1
9 2 4
9 2 4
9 2 4
𝑛 = 10 ∑(𝑥 − 𝑥̅ )2 = 32

4. If the data is a population,


then divide the sum by 𝑁. If ∑(𝑥 − 𝑥̅ )2 32 32
the data is a sample, then 𝑠2 = = = = 3.5556
𝑛−1 10 − 1 9
divide the sum by 𝑛 − 1.
5. Find the square root of the
variance to get the standard 𝑠 = √3.5556 = 1.8856 or 1.89
deviation.

Page 11 of 22
The coefficient of variation (CV) makes it easier to tell if a standard deviation
is large or small by comparing the standard deviation to the mean and it allows
comparison of standard deviations that come from data sets with different
means.

𝜎
For population 𝑐𝑣 = × 100%
𝜇

𝑠
For the sample 𝑐𝑣 = × 100%
𝑥̅

5.3.3 Measures of Position


Measures of position compare the location of a value in a data set in relation
to other values.
The standard score (or 𝑧-score) of a data value is the number of standard
deviations that the value lies above or below the mean. It measures how many
standard deviations a value is away from the mean. It is used to compare scores
from groups of data with different terms.
𝑥−𝜇
For population 𝑧=
𝜎

𝑥 − 𝑥̅
For the sample 𝑧=
𝑠

1. The 𝑧-score of a value is positive if the value is above the mean and
negative if it is below the mean. The mean itself always has a 𝑧-score
of 0.

2. A data value is considered to be unusual if it is more than two standard


deviations from the mean.

3. A data value is unusually high if it has a 𝑧-score larger than 2 and


unusually low if it has a 𝑧-score of less than −2.

Page 12 of 22
Example 7.
Students were selected from two sections and their scores in a Statistics
examination were gathered. The following information were obtained:
 Sample mean is 75.
First section
 Sample standard deviation is 5.6.
 Sample mean is 72.
Second section
 Sample standard deviation is 7.

Linda, who is from the first section got a score of 68 while her friend, Jessa,
who is in the second section got a score of 60. Who has a higher standard score?
Solution.
Linda Jessa

𝑥 − 𝑥̅1 68 − 75 −7 𝑥 − 𝑥̅ 2 60 − 72 −12
𝑧1 = = = = −1.25 𝑧2 = = = = −1.71
𝑠1 5.6 5.6 𝑠2 7 7

Since −1.25 > −1.71, we conclude that Linda has a higher standard score.

5.3.3.1 Percentiles, Deciles, and Quartiles

Percentiles divide a data set into 100 parts. It can be found for any percent
from 1 to 99 and is denoted as 𝑃𝑟 where the subscript 𝑟 is the percentile rank
which indicates the percent of the distribution that falls below the percentile.
For example, P10 is the tenth percentile and is larger than 10% of the distribution.

Example 8. Using the data below, find 𝑃25, 𝑃60 and the percentile rank of 4.
2 6 3 4 2 1 2 0 1 3 6 3

Page 13 of 22
Solution.
a) To find 𝑃25, we follow the steps given:

Steps Actual process and result


1. Arrange the numbers in ascending
order. 0 1 1 2 2 2 3 3 3 4 6 6

2. Find 𝐶 = 𝑛𝑟, where 𝑛 is the number


of observations and 𝑟 is the
percentile rank which, in this 𝐶 = 𝑛𝑟 = 12(0.25) = 3
example, is 25% = 0.25.

3. Since 𝐶 is a whole number, get the


average of the 3rd and the 4th
number in the ordered list. This will 1+2 3
𝑃25 = = = 1.5
be 𝑃25 . 2 2

1+2 3
Thus,𝑃25 = = = 1.5 which means that 25% of the observations are
2 2
less than 1.5.

b) To find 𝑃60, we will follow a similar process with the previous item.

Steps Actual process and result


1. Arrange the numbers in ascending
order. 0 1 1 2 2 2 3 3 3 4 6 6

2. Find 𝐶 = 𝑛𝑟, where 𝑛 is the number


of observations and 𝑟 is the
percentile rank which is 60% = 0.60. 𝐶 = 𝑛𝑟 = 12(0.60) = 7.2

3. Since 𝐶 is not a whole number, we


round up to 8. Locate the 8th 0 1 1 2 2 2 3 3 3 4 6 6
observation. This will be 𝑃60 .
𝑃60 = 3

From here, we conclude that 60% of the observations are less than 3.

c) To find the percentile rank 𝑟 of 4, we use the formula given below:

number of values below the given value − 0.5 10 − 0.5


𝑟= (100%) = (100%) = 79%
total number of values 12

Page 14 of 22
Another measure of position is the deciles. Deciles divide the data set into
tenths and can be found for 1 through 9. Deciles are denoted as 𝐷𝑟 with a
subscript 𝑟, for example, D3 is the third decile and is the value that is larger than
three tenths of the other values.

DECILES
• divides ranked data into ten equal parts

10% 10% 10% 10% 10% 10% 10% 10% 10% 10%

D1 D2 D3 D4 D5 D6 D7 D8 D9

Quartiles divide a data set into fourths and can be found for 1 to 3. Q1 is
the first quartile and is the value that is larger than one fourth of the
observations in the distribution.

QUARTILES
• divides ranked scores into four equal parts

25% 25% 25% 25%

minimum
Q1 Q2 Q3 maximum
median

5.3.3.2 Exploratory Data Analysis


Exploratory data analysis is used to examine data to find out what can be
discovered about the data. Two methods to present for exploratory data analysis
are stem-and-leaf plot and box plot.
A STEM-AND-LEAF PLOT uses the first digit (or digits) as the stem and the
last digit as the leaf to form group of classes.
Example 9. A 100 item test was given to 25 statistics students. The result is
shown below:
55 32 20 22 43 14 17 48 24
31 21 22 35 23 36 23 18 25
13 28 12 29 13 18 19

Make a stem-and-leaf plot of the above data.

Page 15 of 22
Solution.
Steps Actual process and result
1. Arrange the data to
ascending order 12 13 13 14 17 18 18 19 20
21 22 22 23 23 24 25 28 29
31 32 35 36 43 48 55

2. Separate the data according


12 13 13 14 17 18 18 19
to classes using the first digit
20 21 22 22 23 23 24 25 28 29
to separate the classes.
31 32 35 36
43 48
55
3. Use the first digit for the
leading digit (or stem) and Stem Leaf
list all the last digits in order 1 2 3 3 4 7 8 8 9
for the trailing digit (or leaf): 2 0 1 2 2 3 3 4 5 8 9
3 1 2 5 6
4 3 8
5 5

Interpretation:
The stem-and-leaf plot shows that most of the students obtained the score
from 20 to 29.
Example 10. Make a stem-and leaf plot for the following numbers.
215 239 212 245 226 228 246 213 247 225
236 223 221 248 237 242 218 236 232 238

Solution.
Steps Actual process and result
1. Arrange the data to
ascending order 212 213 215 218 221 223 225 226 228 232
236 236 237 238 239 242 245 246 247 248

2. Separate the data according 212 213 215 218


to classes using the first digit 221 223 225 226 228
to separate the classes.
232 236 236 237 238 239
242 245 246 247 248

Page 16 of 22
3. Use the first 2 digits for the
leading digit (or stem) and Leading Digit Stem
list all the last digits in order 21 2 3 5 8
for the trailing digit (or leaf): 22 1 3 5 6 8
23 2 6 6 7 8 9
24 2 5 6 7 8

Interpretation:
The stem-and-leaf plot shows that most of the students obtained the score
from 231 to 239.
A BOX-AND-WHISKER PLOT graphs five values of the set of data on a
number line. The five values are:
1. The lowest value in the set of data.
2. The lower hinge.
3. The median.
4. The upper hinge.
5. The highest value of the set of data.

A box is drawn from the lower hinge to the upper hinge and lines are drawn
from the box to the highest and lowest value. The lower hinge is the median of
all the values less than or equal to the median when the set of data set has an
odd number of values, or the median of all values less than the median when the
set of data has an even number of values. The upper hinge is the median of all
values greater than or equal median when the set of data has an odd number of
values, or the median of all values greater than the median when the set of data
has an even number of values.
Example 11. A 100 item test was given to 25 statistics students. The result is
shown below:
55 32 20 22 43 14 17 48 24
31 21 22 35 23 36 23 18 25
13 28 12 29 13 18 19

Page 17 of 22
Solution.
Steps Actual process and result
1. Arrange the data to
ascending order 12 13 13 14 17 18 18 19 20
21 22 22 23 23 24 25 28 29
31 32 35 36 43 48 55

2. Determine the five values:  The lowest value in the data set is 12.
 The highest value in the data set is 55.
 The median is 23.
 The lower hinge is the midpoint of the numbers
below the median which is 18.
 The upper hinge is the midpoint of the numbers
above the median which is 31.5.
3. Set up the horizontal axis
containing the values
obtained in Step 2. In this
case, we start at 5 and end at
60 with an interval of 5.

4. Draw the boxplot.


a. Draw a vertical segment
on the lowest value 12
and highest value 55 as
shown.
b. Draw vertical lines on the
median 23, lower hinge
18, and upper hinge 31.5
and form a box as
shown.
c. Draw horizontal
segments as illustrated.

Interpretation:
The box whisker plot shows that the data is not symmetrical and that the
data is positively skewed since the whisker in longer on the right.

5.4 Supplementary Learning Resources

 Descriptive statistics calculator


https://www.calculatorsoup.com/calculators/statistics/descriptivestatis
tics.php

 Stem and leaf plotter


https://www.calculatorsoup.com/calculators/statistics/stemleaf.php
 Boxplot Generator
https://www.desmos.com/calculator/h9icuu58wn

Page 18 of 22
5.5 Flexible Teaching-Learning Modality
Remote (asynchronous)
• Module, exercises, problem sets, PowerPoint lessons

5.6 Assessment Task


Directions. Answer the following items. You may use online calculators and
solvers in answering.
1. For each of the following data sets determine the mean:

a. 72, 14,8, 11,57,54,31, 11,67,11,19,3,66


b. 63, 9, 87,16, 2, 96,13,67,34
c. 1, 6, 8,2,7,2,9,4,8,9,8,6
d. 12, 0, 4, 5, 8, 3,6, 35,47
e. 2, 7, 17, 33, 67,73, 88, 33, 92, 57,33

2. For each of the following data sets determine the median:

a. 48, 78, 10, 66, 45, 57,96,67, 40, 66, 63,8, 20


b. 28, 3, 10, 60, 8, 23, 45, 97,11, 10
c. 6, 9, 1, 6, 4, 8, 1, 7, 8, 3, 1, 0
d. 63, 9, 86, 16, 2, 97,24, 67, 34, 40
e. 8, 4, 64, 99, 11, 42, 15,88, 54, 77,42

3. For each of the following data sets determine the mode:

a. 98, 37,5, 33, 96, 67, 43, 33, 91, 33,32,8,11


b. 104, 2, 51, 31, 8, 101, 104, 18, 47
c. 3, 4, 5, 3, 9,8,5,7,2,5,1
d. 19, 1, 9,6,4,2,13,15,24,2
e. 8,9,39,44,55,90,19,44,28,69,44

Page 19 of 22
4. A quiz on the classification of research by general methodology was
administered to a group of 34 students at the College of Arts and
Sciences. The scores are reported below:

Male Female
8 10 20 14 13 10 10 13 10
17 17 12 14 14 9 14 15 8 17
12 10 9 18 14 15 13 17
16 8 18 14
6 16 10

a. Consider all the members of the group and compute the mean, median
and mode.
b. Calculate the mean, median and mode for male and female students.
c. Compare the mean and median within each group. Which has the higher
value? Why?

For items 5-6, find the mean, median, mode, range, variance, standard
deviation, and the coefficient of variation.

5. Cost of notebooks for these sample prizes:

7.95 9.98 5.58 4.99 10.75 6.25 7.63 8.50 8.88

6. Age of college students in ISU:

Number of
Ages
students
16 2
17 10
18 8
19 5

Page 20 of 22
7. Louie’s test scores for two semesters of mathematics are listed below. The
percentage of each semester’s grade represented by each score is also
given
1st Semester 2nd Semester % of Grade
78 87 15
68 66 15
84 81 15
86 89 15
90 88 40

a. Compute the weighted arithmetic mean and standard deviation for


each semester.
b. Is Louie improving? Explain your answer.

8. The following set of data represents a simple random sample of IQ scores


of 32 students at ISU.

137 141 128 135 159 122 140 118


126 133 111 125 127 116 138 203
120 121 131 122 125 126 118 119
117 139 133 124 168 135 126 131

a. Calculate the range of the test scores.


b. Calculate the sample variance, and standard deviation.

9. Using the following, calculate the 𝑧-score that corresponds to the raw
score indicated.

𝑥̅ 𝑠 𝑥 𝑧-score
97 9.23 100
46 8.0 38
8 0.52 9
22 4.69 24
31 7.15 24
54 1.50 39
100 6.50 110
75 3.75 72

Page 21 of 22
10. Using the following data, estimate the raw score that corresponds to the
𝑧-score indicated.

𝑥̅ 𝑠 𝑧-score 𝑥
28 5.2 −1.62
69 2.35 +2.58
7 0.86 +1.03
41 4.73 −2.37
72 1.05 +0.40
85 3.21 −3.20
150 9.61 −0.26
36 0.90 +3.50

5.7 References
Beaver, B.M. and Beaver R.J. (1999). Introduction to Probability and
Statistics. 10th ed. New York: Duxbury Press.
Bluman, A. (1998) Elementary Statistics: A Step by Step Approach. 3rd ed.
McGraw-Hill Book Co.
Deuna, Melecio C. (1996), Elementary Statistics for Basic Education.
Quezon City: Phoenix Publishing House, Inc.
Febre, F.A. and Virginia F. Cawagas (Consultant)(1987) Introduction to
Statistics. Metro Manila, Pheonix Publishing House, Inc.
Ferguson G. (1981) Statistical Analysis in Psychology and Education. 5th ed.
New York: McGraw-Hill Book Company.
Padua, R. N., E.G. Adanza and R.T. Guinto (1986) Statistics: Theory and
Applications. Metro Manila: Hermil Printing Services.
Reyes, C.Z. and Saren, L.L. (2003). Metro Manila. M.G. Reprographics.
Spiegel, M. and Stephens, L. (1999). Schaum’s Outline Theory and Problems
in Probability and Statistics. 3rd. Edition. Singapore: McGraw-Hill
Book Company.
Thorndike,R.M. & Dinnel,D.L. (2002)Basic Statistics for the Behavioral
Sciences.Prentice Hall,Inc.
Triola, Mario (1995) Elementary Statistics. New York: Addison-Wesley
Publishing Company.
Walpole, R.E (1982) Introduction to Statistics. 3rd ed. New York: Macmillan
Publishing Co. Inc.

Page 22 of 22
MODULE 6
Data Management: Probabilities and Normal Distribution

6.1 Introduction:
The normal curve also known as the Gaussian curve or the normal
probability curve is the most fundamental distribution curve in statistics. In this
section, we shall discuss the applications of a normal curve in statistics to
performance of students in class or in their daily activities using standard or 𝑧-
scores.
6.2 Learning Outcomes
At the end of this section, you will be able to:
1. give the importance of a normal distribution;
2. differentiate between a normal distribution and a skewed distribution;
3. give the significance of the standard or 𝑧-score;
4. compute areas under the normal curve; and
5. solve problems involving the normal distribution.

6.3 What You Need to Know


A set of continuous variables where the mean, median, and the mode are all
equal is called a normal distribution. Its graph is called a normal curve. The
normal distribution is often referred to as Gaussian distribution in honor of Carl
Friedrich Gauss.
The graph of a normal distribution is symmetrical and approximates the bell
shape (see Figure 1). The area under the normal curve is equal to 1 (or 100%).
Since the mean is the same as the mode, the highest point on the bell
corresponds to the mean. Since the median is the same as the mean, 50% of
the values are below the mean and 50% are above the mean.

Figure 1. A normal curve

Page 1 of 15
The normal distribution is used to find probabilities by finding the area
under the curve. The area under the graph from the mean to any given 𝑧-score
can be determined using Table 1.

The area under the curve is the same as the probability that a value will be
between the mean and the given number.

What is Probability?

By a probability, we mean the likelihood of occurrence of a particular


situation which is described by a number between 0 and 1, inclusive. We
may think of this as a percentage between 0% and 100%, inclusive.

A situation that is not very likely to occur has a probability close to 0


while a situation that is very likely to occur has a probability close to 1.

For instance, the probability of being struck by a lightning is close to


0. However, if we randomly choose a freshman student from Isabela State
University, it is very likely that the student is under 20 years old, so the
probability is close to 1.

Because any situation has from 0% to 100% of occurring, probabilities


are always between 0 and 1, inclusive. If a situation is sure to occur, its
probability is 1. If it cannot occur, its probability is 0.

Figure 2. The Standard Normal Model

Page 2 of 15
A standard normal model is a normal distribution with a mean of 0 and a
standard deviation of 1. It has some distinct properties.

6.3.1 Properties of a normal distribution

1. The mean, mode, and median are all equal.


2. The curve is symmetric at the center (i.e. around the mean, 𝜇).
3. Exactly half of the values are to the left of center and exactly half the values
are to the right.
4. The total area under the curve is 1.

Because of its properties, the following are observed based on the empirical
rule and Chebyshev’s theorem.

1. Approximately 68% of the data values will fall within 1 standard deviation
of the mean.
2. Approximately 95% of the data values will fall within 2 standard deviations
of the mean.
3. Approximately 99.78% of the data values will fall within 3 standard
deviations of the mean.

The standard deviation controls the spread of the distribution. A smaller


standard deviation indicates that the data is tightly clustered around the mean;
the normal distribution will be taller. A larger standard deviation indicates that
the data is spread out around the mean; the normal distribution will be flatter
and wider.

6.3.2 Standard Normal Model: Distribution of Data

One way of figuring out how data are distributed is to plot them in a graph.
If the data is evenly distributed, you may come up with a bell curve. A bell curve
has a small percentage of the points on both tails and the bigger percentage on
the inner part of the curve. In the standard normal model, about 5 percent of
your data would fall into the “tails” (colored darker orange in Figure 2) and 90
percent will be in between. For example, for test scores of students, the normal
distribution would show 2.5 percent of students getting very low scores and 2.5
percent getting very high scores. The rest will be in the middle; not too high or
too low. The shape of the standard normal distribution looks like this:

The standard normal distribution could help you figure out which subject
you are getting good grades in and which subjects you have to exert more effort
into due to low scoring percentages. Once you get a score in one subject that is
higher than your score in another subject, you might think that you are better
in the subject where you got the higher score. This is not always true.

Page 3 of 15
You can only say that you are better in a particular subject if you get a score
with a certain number of standard deviations above the mean. The standard
deviation tells you how tightly your data is clustered around the mean; it allows
you to compare different distributions that have different types of data —
including different means.

For example, if you get a score of 90 in Math and 95 in English, you might
think that you are better in English than in Math. However, in Math, your score
is 2 standard deviations above the mean. In English, it’s only one standard
deviation above the mean. It tells you that in Math, your score is far higher than
most of the students (your score falls into the tail).
Based on this data, you actually performed better in Math than in English!

The key to solving questions involving the normal curve is understanding


what the area under a standard normal curve represents. The total area under
a standard normal distribution curve is 100% (which is “1” as a decimal). For
example, the left half of the curve is 50%, or 0.5. So the probability of a random
variable appearing in the left half of the curve is 0.5.

Since not all problems are simple, a 𝑧-table had been prepared. A 𝑧-table
measures those probabilities and put them in standard deviations from
the mean. The mean is in the center of the standard normal distribution, and a
probability of 50% equals zero standard deviations.

There are different types of 𝑧-tables. It is important to read and check the
information given before we proceed to finding probabilities. The table which we
will use gives the probabilities to the left of a given 𝑧-value. We also take note
that since the total area under the normal curve is 1, the probability values are
also the areas to the left of a given 𝑧-value.

For instance, if 𝑧 = 1.65, then we go to 𝑧 = 1.6 in the table. Then we move to


the right and get the value that corresponds to 0.05. Thus, the area to the left of
𝑧 = 1.65 is 0.9505.

Page 4 of 15
Page 5 of 15
Source: https://www.math.arizona.edu/~jwatkins/normal-table.pdf

We will give more illustrations on finding probabilities using the 𝒛-table. This
time, we follow the steps given.

Page 6 of 15
1. Area below 𝒛.

Question: What is the probability at 𝒛 ≤ 𝟏. 𝟔𝟓?

Steps Actual process and result


1. In this case, we will get
the area to the left of 𝒛 =
𝟏. 𝟔𝟓 and denote this as
𝑷(𝒛 ≤ 𝟏. 𝟔𝟓). It will help if
we draw a curve and
shade the area we want
to get. This part is
important because it will
give us an idea of what
the final answer will be.
Here, we know that the
probability is greater
that 𝟎. 𝟓.
2. We refer to the table for
the next step. As given in
the previous example, we
locate 𝟏. 𝟔 and move to
the right until we reach
the value that
corresponds to the
column of 𝟎. 𝟎𝟓.

Thus, 𝑷(𝒛 ≤ 𝟏. 𝟔𝟓) = 𝟎. 𝟗𝟓𝟎𝟓

Page 7 of 15
2. Area above 𝒛.

Question: What is the area at 𝒛 ≥ 𝟏. 𝟔𝟓?

Steps Actual process and result


1. For this one, we will get
the area to the right of 𝒛 =
𝟏. 𝟔𝟓 and denote this as
𝑷(𝒛 ≥ 𝟏. 𝟔𝟓).

This time we know that


the probability we should
get must be lower than
𝟎. 𝟓.

2. We locate the value in the


table similar to what we
have done in the first
case.

3. Since the table gives the


area to the left and we
need the area to the right,
then we subtract the area 𝑷(𝒛 ≥ 𝟏. 𝟔𝟓) = 𝟏 − 𝟎. 𝟗𝟓𝟎𝟓 = 𝟎. 𝟎𝟒𝟗𝟓
to the left from 𝟏 to get the
area to the right of 𝒛 =
𝟏. 𝟔𝟓.

Page 8 of 15
3. Area between two 𝒛-values.

What is the area at −𝟎. 𝟕𝟖 ≤ 𝒛 ≤ 𝟏. 𝟔𝟓?

Steps Actual process and result


1. We draw the area on the
normal curve and we see
that the area is between
the values 𝒛 = −𝟎. 𝟕𝟖 and
𝒛 = 𝟏. 𝟔𝟓. We denote the
probability as 𝑷(−𝟎. 𝟕𝟖 ≤
𝒛 ≤ 𝟏. 𝟔𝟓)

2. We locate the values in


the table similar to what
we have done in the first
two cases. This time we
illustrate how we get the
value for 𝒛 = −𝟎. 𝟕𝟖.

3. To get 𝑷(−𝟎. 𝟕𝟖 ≤ 𝒛 ≤
𝟏. 𝟔𝟓), we get the
difference between the 𝑷(−𝟎. 𝟕𝟖 ≤ 𝒛 ≤ 𝟏. 𝟔𝟓) = 𝟎. 𝟗𝟓𝟎𝟓 − 𝟎. 𝟐𝟏𝟕𝟕 = 𝟎. 𝟕𝟑𝟐𝟖
values we obtained at 𝒛 =
−𝟎. 𝟕𝟖 and 𝒛 = 𝟏. 𝟔𝟓.

Learning Activity 1

Direction. Find the following probabilities.

1. 𝑃(𝑧 ≤ −1.73)
2. 𝑃(𝑧 ≥ −0.67)
3. 𝑃(−1.73 ≤ 𝑧 ≤ −0.67)

6.3.3 Applications of the Normal Distribution

How do you know that a word problem involves normal distribution? Look
for the key phrase “assume the variable is normally distributed” or “assume the
variable is approximately normal.”

Page 9 of 15
Example 1. The mean time to complete a certain psychology examination is 34
minutes with a standard deviation of 8. If the distribution of the time to
complete the examination is approximately normally distributed, what is the
probability that a student will complete the examination
(a) in less than 28 minutes?
(b) in more than 40 minutes?
(c) Between 28 and 40 minutes?

Solution.

(a)
Steps Actual process and result
1. List the given mean
𝜇 = 34 minutes
and standard
𝜎 = 8 minutes
deviation.
2. Compute the 𝑧-score of 𝑥 − 𝜇 28 − 34
𝑧= = = −0.75
𝑥 = 28 minutes. 𝜎 8
3. Find the probability
𝑃(𝑧 ≥ −0.75).

𝑃(𝑧 ≥ −0.75) = 1 − 0.2266 = 0.7734


The probability that a student will complete the
examination in less than 28 minutes is 0.2266.

(b)
Steps Actual process and result
1. List the given mean
𝜇 = 34 minutes
and standard
𝜎 = 8 minutes
deviation.
2. Compute the 𝑧-score of 𝑥 − 𝜇 45 − 34
𝑧= = = 1.38
𝑥 = 45 minutes. 𝜎 8
3. Find the probability
𝑃(𝑧 ≥ 1.38).

𝑃 (𝑧 ≥ 1.38) = 1 − 0.9162 = 0.0838


The probability that a student will complete the
examination in more than 45 minutes is 0.0838.

Page 10 of 15
(c)

Steps Actual process and result


1. List the given mean
𝜇 = 34 minutes
and standard
𝜎 = 8 minutes
deviation.
2. Compute the 𝑧-scores
of 𝑥 = 28 minutes and 𝑥 − 𝜇 28 − 34
𝑧= = = −0.75
𝑥 = 45 minutes. 𝜎 8

𝑥 − 𝜇 45 − 34
𝑧= = = 1.38
𝜎 8

3. Find the probability


𝑃(−0.75 ≤ 𝑧 ≤ 1.38). 𝑃(−0.75 ≤ 𝑧 ≤ 1.38) = 0.9162 − 0.2266 = 0.6896

The probability that a student will complete the


examination between 28 and 45 minutes 0.6896.

Example 2.

The mean time to complete a mathematics exam is approximately normally


distributed with a mean of 30 minutes and a standard deviation of 7. If 100
students take the examination, how many should finish in less than 25 minutes?

Solution.

Steps Actual process and result


1. List the given mean, 𝜇 = 30 minutes
standard deviation, and 𝜎 = 7 minutes
the number of students. Number of students= 100
2. Compute the 𝑧-scores of 𝑥 − 𝜇 25 − 30
𝑥 = 25 minutes. 𝑧= = = −0.71
𝜎 7
3. Find the probability 𝑃(𝑧 ≤
−0.71). Using the 𝑧-table, we obtain

𝑃(𝑧 ≤ −0.71). = 0.2381

The probability that a student will complete the


examination in less than 25 minutes is 0.2381.

4. Get the percentage of


students who complete 100 × 0.2381 = 23.81 ≈ 24 students
the examination in less
than 25 minutes by Thus, 24 students will finish in less than 25 minutes.
multiplying the number of
students by the obtained
probability.

Page 11 of 15
Example 3.

A company gives an employment test to all applicants for a job. The results of
the test are normally distributed with a mean score of 124 and a standard
deviation of 16. If only the top 75% of the applicants are to be interviewed, what
score must an applicant have to be interviewed?

Solution.

Steps Actual process and result


1. List the given 𝜇 = 124
information. 𝜎 = 16
Top 75% of the applicants will be interviewed
2. Draw the area under
the normal curve
indicating the top
75% = 0.7500. In this
case, we shade 75%
from the right.

3. Since the shaded


region is from the
right, we consider the
area 1 − 0.7500 = 0.2500
which is the area to the
left of the 𝑧-value we
want to obtain. In this
case, we look for the
area equal to or closest The area closest to 0.2500 is 0.2514. The 𝑧-value that
to 0.2500. corresponds to this area is −0.67.
4. Convert the 𝑧-value
obtained to score by
deriving 𝑥 from the 𝑥 = 𝜎𝑧 + 𝜇 = −0.67(16) + 124 = 113.28 ≈ 113
formula
𝑥−𝜇 From the result, students with a score of at least 113
𝑧=
𝜎 must be interviewed.

We have
𝑥 = 𝜎𝑧 + 𝜇

Page 12 of 15
Learning Activity 2

Direction. Solve the following probabilities.

The heights of 1000 students are normally distributed with a mean of 174.5
centimeters and a standard deviation of 6.9 centimeters. Assuming that the
heights are recorded to the nearest half centimeters, how many of these
students would you expect to have heights

a. Less than 160.0 centimeters?


b. Between 171.5 and 182.0
centimeters inclusive?
c. Equal to 175.0 centimeters?
d. Greater than or equal to 188.0
centimeters?

6.4 Supplementary Learning Resources

 Normal Curve Generator


http://onlinestatbook.com/2/calculators/normal_dist.html

6.5 Flexible Teaching-Learning Modality


Remote (asynchronous)
 Module, exercises, problem sets, PowerPoint lessons

6.6 Assessment Task


Direction. Solve the following problems.

1. Find the value of z if the area under a standard normal curve

a. to the right of z is 0.3622;


b. to the left of z is 0.1131;
c. between 0 and 𝑧, with 𝑧 > 0, is 0.4838;
d. between −𝑧 and 𝑧, with 𝑧 > 0, is 0.9500.

Page 13 of 15
2. Given a normal distribution with 𝜇 = 30 and 𝜎 = 6, find

a. the normal curve area to the right of 𝑥 = 17;


b. the normal curve area to the left of 𝑥 = 22;
c. the normal curve area between 𝑥 = 32 and 𝑥 = 41;
d. the value of 𝑥 that has 80% of the normal curve area to the left;
e. the two values of 𝑥 that contain the middle 75% of the normal curve
area.

3. A research scientist reports that mice will live an average of 40 months


when their diets are sharply restricted and then enriched with vitamins
and proteins. Assuming that the lifetimes of such mice are normally
distributed with a standard deviation of 6.3 months, find the probability
that a given mouse will live

a. more than 32 months;


b. less than 28 months;
c. between 37 and 49 months.

4. A soft-drink machine is regulated so that it discharges an average of 200


milliliters per cup. If the amount of drink is normally distributed with a
standard deviation equal to 15 milliliters,

a. what fraction of the cups will contain more than 224 milliliters?
b. what is the probability that a cup contains between 191 and 209
milliliters?
c. how many cups will probably overflow if 230- milliliter cups are used
for the next 1000 drinks?
d. below what value do we get the smallest 25% of the drinks?

6.7 References:
Beaver, B.M. and Beaver R.J. (1999). Introduction to Probability and
Statistics. 10th ed. New York: Duxbury Press.
Bluman, A. (1998) Elementary Statistics: A Step by Step Approach. 3rd ed.
McGraw-Hill Book Co.
Deuna, Melecio C. (1996), Elementary Statistics for Basic Education.
Quezon City: Phoenix Publishing House, Inc.
Febre, F.A. and Virginia F. Cawagas (Consultant)(1987) Introduction to
Statistics. Metro Manila, Pheonix Publishing House, Inc.
Page 14 of 15
Reyes, C.Z. and Saren, L.L. (2003). Metro Manila. M.G. Reprographics.
Spiegel, M. and Stephens, L. (1999). Schaum’s Outline Theory and Problems
in Probability and Statistics. 3rd. Edition. Singapore: McGraw-Hill
Book Company.
Thorndike,R.M. & Dinnel,D.L. (2002)Basic Statistics for the Behavioral
Sciences.Prentice Hall,Inc.
Triola, Mario (1995) Elementary Statistics. New York: Addison-Wesley
Publishing Company.
Most, .M.M., Craddick, S., Crawford, S., Redican, S., Rhodes, D., Rukenbrod,
F., Laws, R. (2003). Dietary quality assurance processes of the DASH-
Sodium controlled diet study. Journal of the American Dietetic
Association, 103(10): 1339-1346.
Walpole, R, R Myers, S. Myers (2012) Probability and Statistics for Engineers
and Scientists. Prentice Hall, Pearson Education, Boston, MA
Web Sources:
http://lsc.cornell.edu/wp-content/uploads/2016/01/Why-study-
statistics.pdf

Page 15 of 15
MODULE 7
Data Management: Regression and Correlation

7.1 Introduction
In our daily activities it is necessary that the relationship between variables
be established before a decision is made. For example, the school registrar must
predict the enrollment before preparing the class schedules. One must know the
sequence of the courses to be offered before a feasible flow chart could be
prepared. In this section, we will discuss some commonly used measures of
association that show the linear relationship between two variables such as
correlation analysis. The term “relationship” means that changes in two variables
are associated with each other. This relationship can be directly or inversely
proportional to each other. Moreover, correlation is used to determine if there is
a relationship between two variables and to determine the strength of the
correlation.
Correlation and linear regression can help us deal with the relationship
between two or more continuous variables. We shall study about the dependence
of one variable, the dependent variable to the independent variable.
7.2 Learning Outcome
After finishing this module, you are expected to:
1. explain the purpose of correlation coefficients;
2. choose the appropriate correlation coefficients to show the relationship
between two variables;
3. compute the coefficients of correlation and determination;
4. calculate the average correlation between two variables across several
groups of people.
5. define linear regression;
6. give the purpose of linear regression;
7. define least-squares regression line and the assumptions underlying
the test of significance;
8. use methods of linear regression and correlation to predict the value of
a variable given certain conditions.
7.3 What You Need to Know
7.3.1 What is the purpose of correlation analysis?
In correlation analysis, the purpose is to measure the strength or closeness
of the relationship between the variables. In other words, we would like to know
‘how strong or weak is the relationship existing between the variables?’ the two
variables associated in a statistical sense do not guarantee the existence of a
causal relationship. But in reverse, the existence of a causal relationship usually
Page 1 of 15
does imply correlation. The magnitude of association is measured by the
absolute value of 𝑟 that can range from 0.00 to 1.00; the greater the absolute value
of 𝑟, the stronger the relationship between the two variables.
The two types of variables involve in a relationship are independent variable
(𝑋) and the dependent variable (𝑌). In correlation analysis, the 𝑋-variable is the
predictor and the 𝑌-variable is the criterion variable.
A correlation is a relationship between two statistical variables measured
from the same population. In this module, we will only consider linear
correlation which comes in three types: positive linear correlation, negative
linear correlation and zero linear correlation.
A Positive Linear Correlation indicates that high values for one variable
tend to correspond to high values for the second variable or simply, if one value
increases, so does the other the other. For example, the height vs. weight for
adults (For a normal individual, as the height increases, the weight also
increases).
A Negative Linear Correlation indicates high values for one variable tend
to correspond to low values for the second variable., that is, one variable
increases and the other decreases. For instance, the year of acquiring a vehicle
and the resale price (As the vehicle gets older, the re sale price becomes lower).
A Zero Linear Correlation means there is no linear relationship that exists
between the variables. For example, the height and no. of years of education (The
height of the person in no way has a bearing on the number of years he had been
in school).
7.3.1.1 Simple Correlation
In simple correlation, only two variables are studied at once. The two
variables are the independent and dependent variable. The independent
variable, (𝑋), is the variable that can be controlled or picked. The independent
variable, (𝑌), is the variable that you assume to be dependent on the other
variable. The independent variable are used to predict the dependent variable if
there is a correlation between the two variables.
One way to determine the type of linear correlation between two variables is
by means of a scatter plot. The scatter plot is a graph with the independent
variable at the bottom (or along the 𝑥 − 𝑎𝑥𝑖𝑠) and the dependent variable along
the side (𝑥 − 𝑎𝑥𝑖𝑠). For each pair of numbers, we plot a point but the points are
not connected with a line.
The scatter plot shows if there is a linear correlation between two variables.
We can then determine the type of linear correlation as follows:
1. Positive Linear Correlation - general trend in the plotted points is from
bottom left to top right.

Page 2 of 15
2. Negative Linear Correlation - general trend in the plotted points is from
top left to bottom right.
3. No Linear Correlation - No general trend in plotted points, or a non-linear
trend.

The strength of the linear correlation can be judged by looking at how closely
the points approximate a straight line.
Example 1
The following table shows the Height (X) vs. Weight (Y) measurements (both in
inches) for 10 men:

x 70.8 66.2 71.7 68.7 67.6 69.2 66.5 67.2 68.3 65.6
y 42.5 40.2 44.4 42.8 40.0 47.3 43.4 40.1 42.1 36.0

Interpretation: The diagram scatter plot processed in Excel below shows a


positive linear correlation between the variables.

Example 2.
The following table gives the resale value of a car bought in 1970 at
Php200,000.00.
x (Php) 1970 1973 1976 1979 1982 1985 1988 1991 1994 1997
y (000) 200 150 145 135 120 100 79 65 54 35.0

Page 3 of 15
Interpretation: The diagram indicates a negative linear correlation between the
variables.

Example 3.
Below is a data of the scores in an examination. Make a scatter plot and interpret
the data.
100
Test scores
Mid-Term Final 90
73 70
Final Term Score

86 80
80
93 96
92 85
70
72 68
65 68
60
58 62
75 78
50
50 55 60 65 70 75 80 85 90 95 100

Mid-Term Score

Interpretation: There is a fairly positive correlation between scores in the mid-


term examination and the final examination.

Page 4 of 15
7.3.1.2 Coefficient of Correlation
A more precise method of determining the type and strength of a linear
correlation is to calculate the coefficient of linear correlation 𝑟, also known as
Pearson Product-Moment Correlation Coefficient, for the two variables using the
formula:

𝑛(∑ 𝑥𝑦) − (∑ 𝑥 )(∑ 𝑦)


𝑟=
√[𝑛 ∑ 𝑥 2 − (∑ 𝑥 )2 ] √ 𝑛 ∑ 𝑦 2 − (∑ 𝑦)2
The coefficient of linear correlation will always be a number between −1.00
and 1.00, with a positive value indicating a positive correlation and a negative
value a negative correlation. A coefficient of 𝑟 = 1.00 for a data set indicates
perfect positive linear correlation, and 𝑟 = −1.00 indicates perfect negative linear
correlation, while 𝑟 = 0 would indicate no linear correlation. The closer the value
of r is to ±1, the stronger the correlation, and the closer to zero, the weaker the
correlation.

Example 4.
Scores of students in the Midterm and Final Examinations were gathered.
The teacher wants to find the strength of linear relationship between the Midterm
scores and the Final Term scores. What is the coefficient of linear correlation?

Midterm score Final Term score


(𝑥) (𝑦)
73 70
86 80
93 96
92 85
72 68
65 68
58 62
75 78

Page 5 of 15
Solution.
The scatter plot in the example suggests that a positive correlation exists
between Midterm and Final term scores.

To verify, we solve for the coefficient of correlation.


Steps Actual Process and Results
1. Compute 𝑥 2 , 𝑦 2 , and
𝑥 𝑦 𝑥2 𝑦2 𝑥𝑦
𝑥𝑦 and column totals.
73 70 5329 4900 5110
86 80 7396 6400 6880
93 96 8649 9216 8928
92 85 8464 7225 7820
72 68 5184 4624 4896
65 68 4225 4624 4420
58 62 3364 3844 3596
75 78 5625 6084 5850
∑ 𝒙 = 𝟔𝟏𝟒 ∑ 𝒚 = 𝟔𝟎𝟕 ∑ 𝒙𝟐 = 𝟒𝟖𝟐𝟑𝟔 ∑ 𝒚𝟐 = 𝟒𝟔𝟗𝟏𝟕 ∑ 𝒙𝒚 = 𝟒𝟕𝟓𝟎𝟎

2. Solve for 𝑟 and 𝑟 2 𝑛(∑ 𝑥𝑦) − (∑ 𝑥 )(∑ 𝑦)


using the formulas. 𝑟=
√[𝑛 ∑ 𝑥 2 − (∑ 𝑥 )2 ] √ 𝑛 ∑ 𝑦 2 − (∑ 𝑦)2

8(47500) − 614(607)
=
√[8(48236) − (614)2 ]√[8(46917) − (607)2 ]

𝑟 = 0.933

From the result, we know that the Midterm score and the Final
term score have a strong positive linear correlation.

Page 6 of 15
Learning Activity 1

Direction. Write T if the statement is true and F if it is not.

1. A high positive correlation indicates that variable 𝑋 causes a


predictable change in variable 𝑌.
2. The Pearson r can only be used with data measured in ordinal
scale.
3. Two variables with a correlation of −0.75 have a weaker association
than two variables with correlation of 0.56.
4. An inverse correlation indicates that the two variables tend to
change either in the direct or in the opposite directions.
5. If two variables have a low correlation, such as 𝑟 = +0.04, the two
variables cannot have a strong relationship to one another.
6. Direct relationships in a set of data are always stronger than
indirect or inverse relationship in a data set.
7. The relative strength of the relationships in different samples is
most accurately assessed by computing the 𝑟 2 values for the
samples.
8. A meaningful correlation coefficient can only be computed when
data are measured in interval or ratio scales.
9. The sign of the correlation coefficient can be used to determine the
strength of the relationships between two variables.
10. A correlation coefficient of −0.60 is approximately twice as strong as
correlation coefficient of +0.30.

Learning Activity 2

Direction. Solve the following.

1. The scores of five students in mathematics and chemistry classes are:

Mathematics 6 4 8 5 3. 5
Chemistry 6. 5 4. 5 7 5 4

Calculate the linear correlation coefficient.

7.3.2 Regression Analysis


After a relationship between paired data, which are referred to as bivariate
data, has been discovered, one can model the relationship with an equation. One
method of determining a linear relationship for bivariate data is called linear
regression.
In linear regression, we assume that a change in 𝑥 (independent variable)
will lead directly to a change in 𝑦 (dependent variable). Sometimes, we are
interested in predicting the value of 𝑦 from the value of 𝑥. Generally, it is not

Page 7 of 15
logical to believe that 𝑦 caused 𝑥. By convention, we plot the independent variable
along the horizontal axis or the 𝑥-axis and the dependent variable along the
vertical axis or 𝑦-axis.
Furthermore, simple linear regression is similar to correlation in that the
purpose is to measure to what extent there is a linear relationship between two
variables. In particular, the purpose of linear regression is to "predict" the value
of the dependent variable based upon the values of one or more independent
variables. The relationship is summarized by a regression equation consisting of
a slope and an intercept. The slope represents the amount the dependent
variable increases or decreases with unit increase or decrease in the independent
variable and the intercept indicates the value of the dependent variable when the
independent variable takes the value zero.

7.3.2.1 The Least-Squares Regression Line


The least-squares regression line for a set of bivariate data is the line that
minimizes the sum of the squares of the vertical deviations from each data point
to the line.
The least-squares regression line is also called the least-squares line. By
convention, we use the symbol 𝑦̂ (pronounced 𝑦-hat) in place of 𝑦 in the equation
of a least-squares line. This also helps us differentiate the line’s 𝑦-values from
the 𝑦-values of the given ordered pairs.
The equation of the least-squares line for 𝑛 ordered pairs
(𝑥1 , 𝑦1 ), (𝑥2 , 𝑦2 ), (𝑥3 , 𝑦3 ), … , (𝑥𝑛 , 𝑦𝑛 ) is 𝑦̂ = 𝑎𝑥 + 𝑏 where
𝑛 ∑ 𝑥𝑦 − (∑ 𝑥 )(∑ 𝑦)
𝑎=
𝑛 ∑ 𝑥 2 − (∑ 𝑥 )2
and
𝑏 = 𝑦̅ − 𝑎𝑥̅
The notation 𝑥̅ represents the mean of the 𝑥 values and 𝑦̅ represents the
mean of the 𝑦 values.

Page 8 of 15
Example 6.
Find the equation of the least-squares line for the ordered pairs in the table
below.
𝑥 𝑦
2.5 3.4
3.0 4.9
3.3 5.5
3.5 6.6
3.8 7.0
4.0 7.7
4.2 8.3
4.5 8.7

Solution.
From the scatter plot in this example, we see that there is a positive
correlation between the two sets of data.

Page 9 of 15
We now proceed with the process of finding the equation of the regression
line.
Steps Actual process and results
1. Prepare the columns
for 𝑥 2 and 𝑥𝑦. 𝑥 𝑦 𝑥2 𝑥𝑦
2.5 3.4 6.25 8.50
3.0 4.9 9.00 14.70
3.3 5.5 10.89 18.15
3.5 6.6 12.25 23.10
3.8 7.0 14.44 26.60
4.0 7.7 16.00 30.80
4.2 8.3 17.64 34.86
4.5 8.7 20.25 39.15
∑ 𝑥 = 28.8 ∑ 𝑦 = 52.1 ∑ 𝑥 2 = 106.72 ∑ 𝑥𝑦 = 195.86

2. Compute the slope


𝑎. 𝑛 ∑ 𝑥𝑦 − (∑ 𝑥 )(∑ 𝑦)
𝑎=
𝑛 ∑ 𝑥 2 − (∑ 𝑥 )2

8(195.86) − 28.8(52.1)
= ≈ 2.7303
8(106.72) − (28.8)2
3. Find the means of 𝑥
and 𝑦 values and the ∑ 𝑥 28.8
𝑥̅ = = = 3.6
𝑦-intercept 𝑏. 𝑛 8

∑ 𝑦 52.1
𝑦̅ = = = 6.5125
𝑛 8

𝑏 = 𝑦̅ − 𝑎𝑥̅ = 6.5125 − 2.7303(3.6) = −3.31658

4. Round 𝑎 and 𝑏 to the 𝑎 = 2.7


nearest tenth and 𝑏 = −3.3
find 𝑦̂.
The least-squares line equation is 𝑦̂ = 2.7𝑥 − 3.3.

The regression line is given by the red line in the next figure.

Page 10 of 15
Example 7.
Use the equation of the least-squares line from the previous example to
predict the average 𝑦 values for each of the following 𝑥 values.
a. 2.8
b. 4.8
Solution.
Steps Actual process and results
1. Substitute the given
𝑥 values to the a. 𝑦̂ = 2.7(2.8) − 3.3 = 4.26
formula that was b. 𝑦̂ = 2.7(4.8) − 3.3 = 9.66
obtained.

2. Round the computed


value to the nearest a. 𝑦̂ = 4.3
tenth. b. 𝑦̂ = 9.7

Example 8.
Five children aged 2, 3, 5, 7 and 8 years old weigh 14, 20, 32, 42 and 44
kilograms respectively.
a. Find the equation of the regression line of age on weight.
b. Based on this data, what is the approximate weight of a six-year-old
child?

Page 11 of 15
Solution.
(a)
Steps Actual process and results
1. Prepare the table
with columns for 𝑥, 𝑦
𝑦, 𝑥 2 , and 𝑥𝑦. 𝑥
(Weight in 𝑥2 𝑥𝑦
(Age)
kg)
2 14 4 28
3 20 9 60
5 32 25 160
7 42 49 294
8 44 64 352
∑ 𝑥 = 25 ∑ 𝑦 = 152 ∑ 𝑥 2 = 151 ∑ 𝑥𝑦 = 894

2. Compute the slope


𝑎. 𝑛 ∑ 𝑥𝑦 − (∑ 𝑥 )(∑ 𝑦)
𝑎=
𝑛 ∑ 𝑥 2 − (∑ 𝑥 )2

5(894) − 25(152)
= ≈ 5.1538
5(151) − (25)2

3. Find the means of 𝑥


and 𝑦 values and the ∑ 𝑥 25
𝑥̅ = = =5
𝑦-intercept 𝑏. 𝑛 5

∑ 𝑦 152
𝑦̅ = = = 30.4
𝑛 5

𝑏 = 𝑦̅ − 𝑎𝑥̅ = 30.4 − 5.1538(5) = 4.631

4. Round 𝑎 and 𝑏 to the


nearest tenth and 𝑎 = 5.2
find 𝑦̂. 𝑏 = 4.6

The least-squares line equation is 𝑦̂ = 5.2𝑥 + 4.6.

(b)
Steps Actual process and results
1. Substitute the given
𝑥 values to the 𝑦̂ = 5.2(6) + 4.6 = 35.8
formula that was
obtained.

2. Round the computed


value to the nearest
tenth. The predicted weight for a six-year old is 35.8 kg.

Page 12 of 15
Learning Activity 3

Direction. Solve the following.

An exercise instructor remembers that the data given in the following table,
which shows the recommended maximum exercise heart rates for individuals of
given ages.

Age (𝑥 years) 20 40 60
Chemistry 170 153 136

a. Find the equation of the least-squares line.


b. Use the equation to predict the maximum exercise heart rate for a person
who is 50.

7.4 Supplementary Learning Resources


 Linear Regression Calculator
https://www.socscistatistics.com/tests/regression/default.aspx

 Pearson Correlation Calculator


https://www.socscistatistics.com/tests/pearson/default2.aspx
7.5 Flexible Teaching-Learning Modality
Remote (asynchronous)
 Module, exercises, problem sets, PowerPoint lessons

Page 13 of 15
7.6 Assessment Task
Direction. Answer the following items.
1. The table below shows the students’ involvement in community service
(in hours) and their general weighted average (GWA).

Community service (in General Weighted Average


hours)
37 1.75
35 3.00
24 2.35
15 2.00
28 2.50
35 1.63
32 2.80
38 2.52
30 2.95
27 1.95
29 2.13
30 2.42
24 2.53
39 2.85
35 2.50

a. Compute the correlation coefficient of the two variables.


b. Find the equation of the least-squares line.

7.7 References:
Beaver, B.M. and Beaver R.J. (1999). Introduction to Probability and
Statistics. 10th ed. New York: Duxbury Press.
Bluman, A. (1998) Elementary Statistics: A Step by Step Approach. 3rd ed.
McGraw-Hill Book Co.
Deuna, Melecio C. (1996), Elementary Statistics for Basic Education.
Quezon City: Phoenix Publishing House, Inc.
Febre, F.A. and Virginia F. Cawagas (Consultant)(1987) Introduction to
Statistics. Metro Manila, Pheonix Publishing House, Inc.
Ferguson G. (1981) Statistical Analysis in Psychology and Education. 5th ed.
New York: McGraw-Hill Book Company.
Padua, R. N., E.G. Adanza and R.T. Guinto (1986) Statistics: Theory and
Applications. Metro Manila: Hermil Printing Services.
Reyes, C.Z. and Saren, L.L. (2003). Metro Manila. M.G. Reprographics.

Page 14 of 15
Spiegel, M. and Stephens, L. (1999). Schaum’s Outline Theory and Problems
in Probability and Statistics. 3rd. Edition. Singapore: McGraw-Hill
Book Company.
Triola, Mario (1995) Elementary Statistics. New York: Addison-Wesley
Publishing Company.
Walpole, R.E (1982) Introduction to Statistics. 3rd ed. New York: Macmillan
Publishing Co. Inc.

Page 15 of 15

You might also like