You are on page 1of 106

Lesson 1: Introducing Statistics

OVERVIEW OF LESSON
In decision making, we use statistics although some of us may not be aware of it.
In this lesson, we make the students realize that to decide logically, they need to
use statistics. An inquiry could be answered or a problem could be solved through
the use of statistics. In fact, without knowing it we use statistics in our daily
activities.

LEARNING COMPETENCIES:
At the end of the lesson, the learner should be able to identify questions that
could be answered using a statistical process and describe the activities involved
in a statistical process.

LESSON OUTLINE:
1. Motivation
2. Statistics as a Tool in Decision-Making
3. Statistical Process in Solving a Problem

REFERENCES:
Albert, J. R. G. (2008).Basic Statistics for the Tertiary Level (ed. Roberto
Padua, Welfredo Patungan, Nelia Marquez), published by Rex Bookstore.

Handbook of Statistics 1 (1st and 2nd Edition), Authored by the Faculty of


the Institute of Statistics, UP Los Baños, College Laguna 4031
Workbooks in Statistics 1 (From 1st to 13th Edition), Authored by the Faculty
of the Institute of Statistics, UP Los Baños, College Laguna 4031

1
DEVELOPMENT OF THE LESSON

A. Motivation

You may ask the students, a question that is in their mind at that moment. You
may write their answers on the board. (Note: You may try to group the questions
as you write them on the board into two, one group will be questions that are
answerable by a fact and the other group are those that require more than one
information and needs further thinking).

The following are examples of what you could have written on the board:

Group 1:
• How old is our teacher?
• Is the vehicle of the Mayor of our city/town/municipality bigger than the
vehicle used by the President of the Philippines?
• How many days are there in December?
• Does the Principal of the school has a post graduate degree?
• How much does the Barangay Captain receive as allowance?
• What is the weight of my smallest classmate?

Group 2:
• How old are the people residing in our town?
• Do dogs eat more than cats?
• Does it rain more in our country than in Thailand?
• Do math teachers earn more than science teachers?
• How many books do my classmates usually bring to school?
• What is the proportion of Filipino children aged 0 to 5 years who are
underweight or overweight for their age?

The first group of questions could be answered by a piece of information which is


considered always true. There is a correct answer which is based on a fact and you
don’t need the process of inquiry to answer such kind of question. For example, there
is one and only one correct answer to the first question in Group 1 and that is your
age as of your last birthday or the number of years since your birth year.

On the other hand, in the second group of questions one needs observations or
data to be able to respond to the question. In some questions you need to get the
observations or responses of all those concerned to be able to answer the
question. On the first question in the second group, you need to ask all the
people in the locality about their age and among the values you obtained you get
a representative value. To answer the second question in the second group,

2
you need to get the amount of food that all dogs and cats eat to respond to the
question. However, we know that is not feasible to do so. Thus what you can do is
get a representative group of dogs and another representative group for the cats.
Then we measure the amount of food each group of animal eats. From these two
sets of values, we could then infer whether dogs do eat more than cats.

So as you can see in the second group of questions you need more information
or data to be able to answer the question. Either you need to get observations
from all those concerned or you get representative groups from which you gather
your data. But in both cases, you need data to be able to respond to the question.
Using data to find an answer or a solution to a problem or an inquiry is actually
using the statistical process or doing it with statistics.

Now, let us formalize what we discussed and know more about statistics and how
we use it in decision-making.

B. Main Lesson
1. Statistics as a Tool in Decision-Making

Statistics is defined as a science that studies data to be able to make a decision.


Hence, it is a tool in decision-making process. Mention that Statistics as a
science involves the methods of collecting, processing, summarizing and
analyzing data in order to provide answers or solutions to an inquiry. One also
needs to interpret and communicate the results of the methods identified above
to support a decision that one makes when faced with a problem or an inquiry.

Trivia: The word “statistics” actually comes from the word “state”—
because governments have been involved in the statistical activities,
especially the conduct of censuses either for military or taxation purposes.
The need for and conduct of censuses are recorded in the pages of holy
texts. In the Christian Bible, particularly the Book of Numbers, God is
reported to have instructed Moses to carry out a census. Another census
mentioned in the Bible is the census ordered by Caesar Augustus
throughout the entire Roman Empire before the birth of Christ.

Inform students that uncovering patterns in data involves not just science
but it is also an art, and this is why some people may think “Stat is eeeks!”
and may view any statistical procedures and results with much skepticism
Make known to students that Statistics enable us to
• characterize persons, objects, situations, and phenomena;
• explain relationships among variables;
• formulate objective assessments and comparisons; and, more importantly
• make evidence-based decisions and predictions.

3
And to use Statistics in decision-making there is a statistical process to
follow which is to be discussed in the next section.

2. Statistical Process in Solving a Problem

You may go back to one of the questions identified in the second group and use
it to discuss the components of a statistical process. For illustration on how to
do it, let us discuss how we could answer the question “Do dogs eat more than
cats?”
As discussed earlier, this question requires you to gather data to generate statistics
which will serve as basis in answering the query. There should be plan or a design
on how to collect the data so that the information we get from it is enough or
sufficient for us to minimize any bias in responding to the query. In relation to the
query, we said earlier that we cannot gather the data from all dogs and cats. Hence,
the plan is to get representative group of dogs and another representative group of
cats. These representative groups were observed for some characteristics like the
animal weight, amount of food in grams eaten per day and breed of the animal.
Included in the plan are factors like how many dogs and cats are included in the
group, how to select those included in the representative groups and when to
observe these animals for their characteristics.

After the data were gathered, we must verify the quality of the data to make a
good decision. Data quality check could be done as we process the data to
summarize the information extracted from the data. Then using this information,
one can then make a decision or provide answers to the problem or question at
hand.

To summarize, a statistical process in making a decision or providing solutions to


a problem include the following:

• Planning or designing the collection of data to answer statistical questions in


a way that maximizes information content and minimizes bias;
• Collecting the data as required in the plan;
• Verifying the quality of the data after they were collected;
• Summarizing the information extracted from the data; and
• Examining the summary statistics so that insight and meaningful information
can be produced to support decision-making or solutions to the question or
problem at hand.

Hence, several activities make up a statistical process which for some the
process is simple but for others it might be a little bit complicated to implement.
Also, not all questions or problems could be answered by a simple statistical

4
process. There are indeed problems that need complex statistical process.
However, one can be assured that logical decisions or solutions could be
formulated using a statistical process.

KEY POINTS
• Difference between questions that could be and those that could
not answered using Statistics.
• Statistics is a science that studies data.
• There are many uses of Statistics but its main use is in decision-making.
• Logical decisions or solutions to a problem could be attained through a
statistical process.

ASSESSMENT (Do this!)


Note: Answers are provided inside the parentheses and italicized.
1. Identify which of the following questions are answerable using a
statistical process.
a. What is a typical size of a Filipino family?
b. How many hours in a day?
c. How old is the oldest man residing in the Philippines?
d. Is planet Mars bigger than planet Earth?
e. What is the average wage rate in the country?
f. Would Filipinos prefer eating bananas rather than apple?
g. How long did you sleep last night?
h. How much a newly-hired public school teacher in NCR earns in a month?
i. How tall is a typical Filipino?
j. Did you eat your breakfast today?
2. For each of the identified questions in Number 1 that are answerable using
a statistical process, describe the activities involved in the the data
gathering process.

5
Lesson 2: Data Collection Activity

OVERVIEW OF LESSON
As we have learned in the previous lesson, Statistics is a science that studies data.
Hence to teach Statistics, real data set is recommend to use. In this lesson,we
present an activity where the students will be asked to provide some data that will be
submitted for consolidation by the teacher for future lessons. Data on heights and
weights, for instance, will be used for calculating Body Mass Index in the integrative
lesson. Students will also be given the perspective that the data they provided is part
of a bigger group of data as the same data will be asked from much larger groups
(the entire class, all Grade 11 students in school, all Grade 11 students in the
district). The contextualization of data will also be discussed.

LEARNING COMPETENCIES:
At the end of the lesson, the learner should be able to:
• Recognize the importance of providing correct information in a data collection
activity;
• Understand the issue of confidentiality of information in a data collection activity;
• Participate in a data collection activity; and
• Contextualize data

LESSON OUTLINE:
1. Preliminaries in a Data Collection Activity
2. Performing a Data Collection Activity
3. Contextualization of Data

REFERENCES
Albert, J. R. G. (2008). Basic Statistics for the Tertiary Level (ed. Roberto Padua, Welfredo Patungan,
Nelia Marquez), published by Rex Bookstore.

Handbook of Statistics 1 (1st and 2nd Edition), Authored by the Faculty of the Institute of Statistics, UP
Los Baños, College Laguna 4031

Workbooks in Statistics 1 (From 1 st to 13th Edition), Authored by the Faculty of the Institute of
Statistics, UP Los Baños, College Laguna 4031
https://www.khanacademy.org/math/probability/statistical-studies/statistical-questions/v/statistical-
questions
https://www.illustrativemathematics.org/content-standards/tasks/703

6
DEVELOPMENT OF THE LESSON

A. Preliminaries in a Data Collection Activity


Before the lesson, prepare a sheet of paper listing everyone’s name in class with a
“Class Student Number” (see Attachment A for the suggested format). The class
student number is a random number chosen in the following fashion:

(a) Make a box with “tickets” (small pieces of papers of equal sizes) listing
the numbers 1 up to the number of students in the class.
(b) Shake the box, get a ticket, and assign the number in the ticket to the first
person in the list.
(c) Shake the box again, get another ticket, and assign the number of this ticket
to the next person in the list.
(d) Do (c) until you run out of tickets in the box.

At this point all the students have their corresponding class student number written
across their names in the prepared class list. Note that the preparation of the class
list is done before the class starts.

At the start of the class, inform each student confidentially of his/her class student
number. Perhaps, when the attendance is called, each student can be provided a
separate piece of paper that lists her/his name and class student number. Tell
students to remember their class student number, and to always use this throughout
the semester whenever data are requested of them. Explain to students that in data
collection activity, specific identities like their names are not required, especially
because people have a right to confidentiality, but there should be a way to develop
and maintain a database to check quality of data provided, and verify from
respondent in a data collection activity the data that they provided (if necessary).

These preliminary steps for generating a class student number and informing
students confidentially of their class student number are essential for the data
collection activities to be performed in this lesson and other lessons so that students
can be uniquely identified, without having to obtain their names. Inform also the
students that the class student numbers they were given are meant to identify them
without having to know their specific identities in the class recording sheet (which will
contain the consolidated records that everyone had provided). This helps protect
confidentiality of information.
In statistical activities, facts are collected from respondents for purposes of getting
aggregate information, but confidentiality should be protected. Mention that the
agencies mandated to collect data is bound by law to protect the confidentiality of
information provided by respondents. Even market research organizations in the
private sector and individual researchers also guard confidentiality as they merely
want to obtain aggregate data. This way, respondents can be truthful in giving

8
information, and the researcher can give a commitment to respondents that the data
they provide will never be released to anyone in a form that will identify them without
their consent.

B. Performing a Data Collection Activity

Explain to the students that the purpose of this data collection activity is to gather
data that they could use for their future lessons in Statistics. It is important that they
do provide the needed information to the best of their knowledge. Also, before they
respond to the questionnaire provided in the Attachment B as Student Information
Sheet (SIS), it is recommended that each item in the SIS should be clarified. The
following are suggested clarifications to make for each item:

1. CLASS STUDENT NUMBER: This is the number that you provided


confidentially to the student at the start of the class.

2. SEX: This is the student’s biological sex and not their preferred gender. Hence,
they have to choose only one of the two choices by placing a check mark (√) at
space provided before the choices.

3. NUMBER OF SIBLINGS: This is the number of brothers and sisters that the
student has in their nuclear or immediate family. This number excludes him or
her in the count. Thus, if the student is the only child in the family then he/she
will report zero as his/her number of siblings.

4. WEIGHT (in kilograms): This refers to the student’s weight based on the
student’s knowledge. Note that the weight has to be reported in kilograms. In
case the student knows his/her weight in pounds, the value should be converted
to kilograms by dividing the weight in pounds by a conversion factor of 2.2
pounds per kilogram.

5. HEIGHT (in centimeters): This refers to the student’s height based on the
student’s knowledge. Note that the height has to be reported in centimeters. In
case the student knows his/her height in inches, the value should be
converted to centimeters by multiplying the height in inches by a conversion
factor of 2.54 centimeters per inch.

6. AGE OF MOTHER (as of her last birthday in years): This refers to the age of
the student’s mother in years as of her last birthday, thus this number should be
reported in whole number. In case, the student’s mother is dead or nowhere to
be found, ask the student to provide the age as if the mother is alive or
around.You could help the student in determining his/her mother’s age based

9
on other information that the student could provide like birth year of the mother or
student’s age. Note also that a zero value is not an acceptable value.

7. USUAL DAILY ALLOWANCE IN SCHOOL (in pesos): This refers to the usual
amount in pesos that the student is provided for when he/she goes to school in a
weekday. Note that the student can give zero as response for this item, in case
he/she has no monetary allowance per day.

8. USUAL DAILY FOOD EXPENDITURE IN SCHOOL (in pesos): This refers to


the usual amount in pesos that the student spends for food including drinks in
school per day. Note that the student can give zero as response for this item, in
case he/she does not spend for food in school.

9. USUAL NUMBER OF TEXT MESSAGES SENT IN A DAY: This refers to the usual
number of text messages that a student send in a day. Note that the student can give
zero as response for this item, in case he/she does not have the gadget to use to
send a text message or simply he/she does not send text messages.

10. MOST PREFERRED COLOR: The student is to choose a color that could be
considered his most preferred among the given choices. Note that the
student could only choose one. Hence, they have to place a check mark (√)
at space provided before the color he/she considers as his/her most preferred
color among those given.

11. USUAL SLEEPING TIME: This refers to the usual sleeping time at night during a
typical weekday or school day. Note that the time is to be reported using the
military way of reporting the time or the 24-hour clock (0:00 to 23:59 are the
possible values to use)

12. HAPPINESS INDEX FOR THE DAY : The student has to response on how
he/she feels at that time using codes from 1 to 10. Code 1 refers to the feeling
that the student is very unhappy while Code 10 refers to a feeling that the
student is very happy on the day when the data are being collected.

After the clarification, the students are provided at most 10 minutes to respond to the
questionnaire. Ask the students to submit the completed SIS so that you could
consolidate the data gathered using a formatted worksheet file provided to you as
Attachment C. Having the data in electronic file makes it easier for you to use it in the
future lessons. Be sure that the students provided the information in all items in the
SIS.

10
Inform the students that you are to compile all their responses and compiling all
these records from everyone in the class is an example of a census since data has
been gathered from every student in class. Mention that the government, through the
Philippine Statistics Authority (PSA), conducts censuses to obtain information about
socio-demographic characteristics of the residents of the country. Census data are
used by the government to make plans, such as how many schools and hospitals to
build. Censuses of population and housing are conducted every 10 years on years
ending in zero (e.g., 1990, 2000, 2010) to obtain population counts, and
demographic information about all Filipinos. Mid-decade population censuses have
also been conducted since 1995. Censuses of Agriculture, and of Philippine
Business and Industry, are also conducted by the PSA to obtain information on
production and other relevant economic information.
PSA is the government agency mandated to conduct censuses and surveys.
Through Republic Act 10625 (also referred to as The Philippine Statistical Act of
2013), PSA was created from four former government statistical agencies, namely:
National Statistics Office (NSO), National Statistical Coordination Board (NSCB),
Bureau of Labor and Employment of Statistics (BLES) and Bureau of Agricultural
Statistics (BAS). The other agency created through RA 10625 is the Philippine
Statistical Research and Training Institute (PSRTI) which is mandated as the
research and training arm of the Philippine Statistical System. PSRTI was created
from its forerunner the former Statistical Research and Training Center (SRTC).

C. Contextualization of Data

Ask students what comes to their minds when they hear the term “data” (which may
be viewed as a collection of facts from experiments, observations, sample surveys
and censuses, and administrative reporting systems).

Present to the student the following collection of numbers, figures, symbols, and
words, and ask them if they could consider the collection as data.

3, red, F, 156, 4, 65, 50, 25, 1, M, 9, 40, 68, blue, 78, 168, 69, 3, F, 6, 9, 45,
50, 20, 200, white, 2, pink, 160, 5, 60, 100, 15, 9, 8, 41, 65, black, 68, 165,
59, 7, 6, 35, 45,

Although the collection is composed of numbers and symbols that could be classified
as numeric or non-numeric, the collection has no meaning or it is not contextualized,
hence it cannot be referred to as data.

11
Tell the students that data are facts and figures that are presented, collected and
analyzed. Data are either numeric or non-numeric and must be contextualized. To
contextualize data, we must identify its six W’s or to put meaning on the data, we
must know the following W’s of the data:

1. Who? Who provided the data?

2. What? What are the information from the respondents and What is the unit of
measurement used for each of the information (if there are any)?

3. When? When was the data collected?

4. Where? Where was the data collected?

5. Why? Why was the data collected?

6. HoW? HoW was the data collected?

Let us take as an illustration the data that you have just collected from the students,
and let us put meaning or contextualize it by responding to the questions with the
Ws. It is recommended that the students answer theW-questions so that they will
learn how to do it.

1. Who? Who provided the data?

• The students in this class provided the data.

2. What? What are the information from the respondents and What is the unit of
measurement used for each of the information (if there are any)?

• The information gathered include Class Student Number, Sex, Number of


Siblings, Weight, Height, Age of Mother, Usual Daily Allowance in School,
Usual Daily Food Expenditure in School, Usual Number of Text Messages
Sent in a Day, Most Preferred Color, Usual Sleeping Time and Happiness
Index for the Day.

• The units of measurement for the information on Number of Siblings,


Weight, Height, Age of Mother, Usual Daily Allowance in School, Usual Daily
Food Expenditure in School, and Usual Number of Text Messages Sent in a
Day are person, kilogram, centimeter, year, pesos, pesos and message,
respectively.

3. When? When was the data collected?

• The data was collected on the first few days of classes for Statistics
and Probability.
12
4. Where? Where was the data collected?

• The data was collected inside our classroom.

5. Why? Why was the data collected?

• As explained earlier, the data will be used in our future lessons in Statistics and
Probability

6. HoW? HoW was the data collected?

• The students provided the data by responding to the Student Information Sheet
prepared and distributed by the teacher for the data collection activity.

Once the data are contextualized, there is now meaning to the collection of number
and symbols which may now look like the following which is just a small part of the
data collected in the earlier activity.
Number Usual Usual daily Usual
Age of number
Class of daily food Most Usual Happiness
Sex Weight Height mother of text
Student siblings allowance expenditure Preferred Sleeping Index for
(in kg) (in cm) (in messages
Number (in in school in school Color Time the Day
years) sent in a
person) (in pesos) (in pesos)
day
1 M 2 60 156 60 200 150 20 RED 23:00 8
2 F 5 63 160 66 300 200 25 PINK 22:00 9
3 F 3 65 165 59 250 50 15 BLUE 20:00 7
4 M 1 55 160 55 200 100 30 BLACK 19:00 6
5 M 0 65 167 45 350 300 35 BLUE 20:00 8
: : : : : : : : : : : :
: : : : : : : : : : : :

KEY POINTS

• Providing correct information in a government data collection activity is


a responsibility of every citizen in the country.
• Data confidentiality is important in a data collection activity.
• Census is collecting data from all possible respondents.
• Data to be collected must be clarified before the actual data collection.
• Data must be contextualized by answering six W-questions.

17
Activity

ATTACHMENT : STUDENT INFORMATION SHEET (Do this!) Collect information from other groups

Instruction to the Students: Please provide completely the following information. Your
teacher is available to respond to your queries regarding the items in this information
sheet, if you have any. Rest assured that the information that you will be providing
will only be used in our lessons in Statistics and Probability.

1. CLASS STUDENT NUMBER:


2. SEX (Put a check mark, √): Male Female
3. NUMBER OF SIBLINGS:
4. WEIGHT (in kilograms):
5. HEIGHT (in centimeters):
6. AGE OF MOTHER (as of her last birthday in years):
(If mother deceased, provide age if she was alive)
7. USUAL DAILY ALLOWANCE IN SCHOOL (in pesos):
8. USUAL DAILY FOOD EXPENDITURE IN SCHOOL (in pesos):
9. USUAL NUMBER OF TEXT MESSAGES SENT IN A DAY:
10. MOST PREFERRED COLOR (Put a check mark, √. Choose only one):

WHITE RED PINK ORANGE


YELLOW GREEN BLUE PEACH
BROWN GRAY BLACK PURPLE

11. USUAL SLEEPING TIME (on weekdays):


12. HAPPPINESS INDEX FOR THE DAY:
On a scale from 1 (very unhappy) to 10 (very happy), how do you feel today?
Number Usual Usual Daily Usual
Age of number
Class Sex of Weight Height Daily food Most Usual Happiness
mother of text
Student siblings allowance expenditure Preferred Sleeping Index for
(in kg) (in cm) (in messages
Number (in in school in school Color Time the Day
years) sent in a
person) (in pesos) (in pesos)
day

18
Lesson 3: Basic Terms in Statistics
OVERVIEW OF LESSON
As continuation of Lesson 2 (where we contextualize data) in this lesson we define
basic terms in statistics as we continue to explore data. These basic terms include
the universe, variable, population and sample. In detail we will discuss other
concepts in relation to a variable.

LEARNING OUTCOME(S):

At the end of the lesson, the learner is able to

• Define universe and differentiate it with population; and


• Define and differentiate between qualitative and quantitative variables, and
between discrete and continuous variables (that are quantitative);

LESSON OUTLINE:
1. Recall previous lesson on ‘Contextualizing Data’
2. Definition of Basic Terms in Statistics (universe, variable, population and sample)
3. Broad of Classification of Variables(qualitative and quantitative, discrete and
continuous)

REFERENCES
Albert, J. R. G. (2008). Basic Statistics for the Tertiary Level (ed. Roberto
Padua, WelfredoPatungan, Nelia Marquez), published by Rex Bookstore.
Handbook of Statistics 1 (1st and 2nd Edition), Authored by the Faculty of
the Institute of Statistics, UP Los Baños, College Laguna 4031
Takahashi, S. (2009). The Manga Guide to Statistics. Trend-Pro Co. Ltd.

Workbooks in Statistics 1 (From 1st to 13th Edition), Authored by the Faculty of


the Institute of Statistics, UP Los Baños, College Laguna 4031

19
DEVELOPMENT OF THE LESSON
A. Recall previous lesson on ‘Contextualizing Data’
Begin by recalling with the students the data they provided in the previous lesson
and how they contextualized such data. You could show them the compiled data set
in a table like this:
Number Usual Usual Daily Usual
Age of number
Class of Daily food M ost Usual Happiness
Sex Weight Height mother of text
Student siblings allowance expenditure Preferred Sleeping Index for
(in kg) (in cm) (in messages
Number (in in school in school Color Time the Day
years) sent in a
person) (in pesos) (in pesos)
day

1 M 2 60 156 60 200 150 20 RED 23:00 8


2 F 5 63 160 66 300 200 25 PINK 22:00 9
3 F 3 65 165 59 250 50 15 BLUE 20:00 7
4 M 1 55 160 55 200 100 30 BLACK 19:00 6
5 M 0 65 167 45 350 300 35 BLUE 20:00 8
: : : : : : : : : : : :
: : : : : : : : : : : :
Recall also their response on the first Ws of the data, that is, on the question “Who
provided the data?” We said last time the students of the class provided the data or
the data were taken from the students.
Another Ws of the data is What? What are the information from the respondents?
and What is the unit of measurement used for each of the information (if there are
any)? Our responses are the following:
• The information gathered include Class Student Number, Sex, Number of
Siblings, Weight, Height, Age of Mother, Usual Daily Allowance in School,
Usual Daily Food Expenditure in School, Usual Number of Text Messages
Sent in a Day, Most Preferred Color, Usual Sleeping Time and Happiness
Index.
• The units of measurement for the information on Number of Siblings, Weight,
Height, Age of Mother, Usual Daily Allowance in School, Usual Daily Food
Expenditure in School, and Usual Number of Text Messages Sent in a Day are
person, kilogram, centimeter, year, pesos, pesos and message, respectively.

B. Main Lesson

1. Definition of Basic Terms


The collection of respondents from whom one obtain the data is called the universe of
the study. In our illustration, the set of students of this Statistics and Probability class
is our universe. But we must precaution the students that a universe is not necessarily
composed of people. Since there are studies where the observations were taken from
plants or animals or even from non-living things like buildings, vehicles, farms, etc.
So formally, we define universe as the collection or set of

18
units or entities from whom we got the data. Thus, this set of units answers the first
Ws of data contextualization.
On the other hand, the information we asked from the students are referred to as the
variables of the study and in the data collection activity, we have 12 variables
including Class Student Number. A variable is a characteristic that is observable or
measurable in every unit of the universe. From each student of the class, we got the
his/her age, number of siblings, weight, height, age of mother, usual daily allowance
in school, usual daily food expenditure in school, usual number of text messages
sent in a day, most preferred color, usual sleeping time and happiness index for the
day. Since these characteristics are observable in each and every student of the
class, then these are referred to as variables.
The set of all possible values of a variable is referred to as a population. Thus for
each variable we observed, we have a population of values. The number of
population in a study will be equal to the number of variables observed. In the data
collection activity we had, there are 12 populations corresponding to 12 variables.

A subgroup of a universe or of a population is a sample. There are several ways to


take a sample from a universe or a population and the way we draw the sample
dictates the kind of analysis we do with our data.
We can further visualize these terms in the following figure:

VARIABLE 1 VARIABLE 2 VARIABLE 12

…..

UNIVERSE POPULATION POPULATION POPULATION OF


OF VARIABLE 1 OF VARIABLE 2 VARIABLE 12

OR
SAMPLE

A SAMPLE OF UNITS A SAMPLE OF

POPULATION VALUES

Figure 3.1 Visualization of the relationship among universe, variable, population and sample.
2 . Broad Classification of Variables

Following up with the concept of variable, inform the students that usually, a variable
takes on several values. But occasionally, a variable can only assume one value,
then it is called a constant. For instance, in a class of fifteen-year olds, the age in
years of students is constant.

Variables can be broadly classified as either quantitative or qualitative, with the latter
further classified into discrete and continuous types (see Figure 3.3 below).

Figure 3.3 Broad Classification of Variables

(i) Qualitative variables express a categorical attribute, such as sex (male or


female), religion, marital status, region of residence, highest educational
attainment. Qualitative variables do not strictly take on numeric values (although
we can have numeric codes for them, e.g., for sex variable, 1 and 2 may refer to
male, and female, respectively). Qualitative data answer questions “what kind.”
Sometimes, there is a sense of ordering in qualitative data, e.g., income data
grouped into high, middle and low-income status. Data on sex or religion do not
have the sense of ordering, as there is no such thing as a weaker or stronger sex,
and a better or worse religion. Qualitative variables are sometimes referred to as
categorical variables.

(ii) Quantitative (otherwise called numerical) data, whose sizes are meaningful,
answer questions such as “how much” or “how many”. Quantitative variables
have actual units of measure. Examples of quantitative variables include the
height, weight, number of registered cars, household size, and total household
expenditures/income of survey respondents. Quantitative data may be further
classified into:

20
a. Discrete data are those data that can be counted, e.g., the number of days for
cellphones to fail, the ages of survey respondents measured to the nearest
year, and the number of patients in a hospital. These data assume only (a
finite or infinitely) countable number of values.

b. Continuous data are those that can be measured, e.g. the exact height of a
survey respondent and the exact volume of some liquid substance. The
possible values are uncountably infinite.

With this classification, let us then test the understanding of our students by asking
them to classify the variables, we had in our last data gathering activity. They should
be able to classify these variables as to qualitative or quantitative and further more
as to discrete or continuous. If they did it right, you have the following:

TYPE OF TYPE OF
VARIABLE QUANTITATIVE
VARIABLE
VARIABLE
Class Student Number Qualitative
Sex Qualitative
Number of Siblings Quantitative Discrete
Weight (in kilograms) Quantitative Continuous
Height (in centimeters) Quantitative Continuous
Age of Mother Quantitative Discrete
Usual Daily Allowance in School (in Quantitative Discrete
pesos)
Usual Daily Food Expenditure in Quantitative Discrete
School (in pesos)
Usual Number of Text Messages Quantitative Discrete
Sent in a Day
Usual Sleeping Time Qualitative
Most Preferred Color Qualitative
Happiness Index for the Day Qualitative

Special Note:
For quantitative data, arithmetical operations have some physical interpretation. One
can add 301 and 302 if these have quantitative meanings, but if, these numbers refer
to room numbers, then adding these numbers does not make any sense. Even
though a variable may take numerical values, it does not make the corresponding
variable quantitative! The issue is whether performing arithmetical operations on
these data would make any sense. It would certainly not make sense to sum two zip
codes or multiply two room numbers.

21
KEY POINTS

• A universe is a collection of units from which the data were gathered.


• A variable is a characteristic we observed or measured from every element of the
universe.
• A population is a set of all possible values of a variable.
• A sample is a subgroup of a universe or a population.
• In a study there is only one universe but could have several populations.
• Variables could be classified as qualitative or quantitative, and the latter
could be further classified as discrete or continuous.

A S S E S S M E N T (Do this!)

1. A market researcher company requested all teachers of a particular school to fill


up a questionnaire in relation to their product market study. The following are
some of the information supplied by the teachers:
• highest educational attainment
• predominant hair color
• body temperature
• civil status
• brand of laundry soap being used
• total household expenditures last month in pesos
• number of children in the household
• number of hours standing in queue while waiting to be served by a
bank teller
• amount spent on rice last week by the household
• distance travelled by the teacher in going to school
• time (in hours) consumed on Facebook on a particular day

a. If we are to consider the collection of information gathered through the completed


questionnaire, what is the universe for this data set?
b. Which of the variables are qualitative? Which are quantitative? Among the
quantitative variables, classify them further as discrete or continuous.
• highest educational attainment
• predominant hair color
• body temperature
• civil status
• brand of laundry soap being used
• total household expenditures last month in pesos
• number of children in a household
• number of hours standing in queue while waiting to be served by a
bank teller

22
• amount spent on rice last week by a household
• distance travelled by the teacher in going to school
• time (in hours) consumed on Facebook on a particular day
2. The Engineering Department of a big city did a listing of all buildings in their
locality. If you are planning to gather the characteristics of these buildings,
a. What is the universe of this data collection activity?
b. What are the crucial variables to observe? It would also be better if you could
classify the variables as to whether it is qualitative or quantitative.
Furthermore, classify the quantitative variable as discrete or continuous.
3. A survey of students in a certain school is conducted. The survey questionnaire
details the information on the following variables. For each of these variables,
identify whether the variable is qualitative or quantitative, and if the latter, state
whether it is discrete or continuous.
a. number of family members who are working
b. ownership of a cell phone among family members
c. length (in minutes) of longest call made on each cell phone owned
per month
d. ownership/rental of dwelling
e. amount spent in pesos on food in one week
f. occupation of household head
g. total family income
h. number of years of schooling of each family member
i. access of family members to social media
j. amount of time last week spent by each family member using the internet

24
Lesson 4: Levels of Measurement
OVERVIEW OF LESSON
In this lesson we discuss the different levels of measurement as we continue to
explore data. Knowing such will enable us to plan the data collection process we
need to employ in order to gather the appropriate data for analysis.

LEARNING OUTCOME(S):
At the end of the lesson, the learner is able to identify and differentiate the different
levels of measurement and methods of data collection

LESSON OUTLINE:

1. Motivational Activity
2. Levels of Measurement
3. Data Collection Methods

REFERENCES
Albert, J. R. G. (2008). Basic Statistics for the Tertiary Level (ed. Roberto
Padua, Welfredo Patungan, Nelia Marquez), published by Rex Bookstore.
Handbook of Statistics 1 (1st and 2nd Edition), Authored by the Faculty of
the Institute of Statistics, UP Los Baños, College Laguna 4031
Takahashi, S. (2009). The Manga Guide to Statistics. Trend-Pro Co. Ltd.

Workbooks in Statistics 1 (From 1st to 13th Edition), Authored by the Faculty of


the Institute of Statistics, UP Los Baños, College Laguna 4031

25
DEVELOPMENT OF THE LESSON

A. Motivational Activity

Ask the students first if they believe the following statement:

“Students who eat a healthy breakfast will do best on a quiz, students who eat an
unhealthy breakfast will get an average performance, and students who do not eat
anything for breakfast will do the worst on a quiz”

You could further ask one or more students who have different answers to defend
their answers. Then challenge the students to apply a statistical process to
investigate on the validity of this statement. You could enumerate on the board the
steps in the process to undertake like the following:

1. Plan or design the collection of data to verify the validity of the statement in a
way that maximizes information content and minimizes bias;
2. Collect the data as required in the plan;
3. Verify the quality of the data after it was collected;
4. Summarize the information extracted from the data; and
5. Examine the summary statistics so that insight and meaningful information can
be produced to support your decision whether to believe or not the given
statement.

Let us discuss in detail the first step. In planning or designing the data collection
activity, we could consider the set of all the students in the class as our universe.
Then let us identify the variables we need to observe or measure to verify the validity
of the statement. You may ask the students to participate in the discussion by asking
them to identify a question to get the needed data. The following are some possible
suggested queries:

1. Do you usually have a breakfast before going to


school? (Note: This is answerable by Yes or No)
2. What do you usually have for breakfast?
(Note: Possible responses for this question are rice, bread, banana,
oatmeal, cereal, etc)

The responses in Questions Numbers 1 and 2 could lead us to identify whether a


student in the class had a healthy breakfast, an unhealthy breakfast or no breakfast
at all.

25
Furthermore, there is a need to determine the performance of the student in a quiz
on that day. The score in the quiz could be used to identify the student’s performance
as best, average or worst.
As we describe the data collection process to verify the validity of the statement, there is
also a need to include the levels of measurement for the variables of interest.

B. Main Lesson:

1. Levels of Measurement

Inform students that there are four levels of measurement of variables: nominal,
ordinal, interval and ratio. These are hierarchical in nature and are described as
follows:

Nominal level of measurement arises when we have variables that are categorical
and non-numeric or where the numbers have no sense of ordering. As an example,
consider the numbers on the uniforms of basketball players. Is the player wearing a
number 7 a worse player than the player wearing number 10? Maybe, or maybe not,
but the number on the uniform does not have anything to do with their performance.
The numbers on the uniform merely help identify the basketball player. Other
examples of the variables measured at the nominal level include sex, marital status,
religious affiliation. For the study on the validity of the statement regarding effect of
breakfast on school performance, students who responded Yes to Question Number
1 can be coded 1 while those who responded No, code 0 can be assigned. The
numbers used are simply for numerical codes, and cannot be used for ordering and
any mathematical computation.

Ordinal level also deals with categorical variables like the nominal level, but in this
level ordering is important, that is the values of the variable could be ranked. For the
study on the validity of the statement regarding effect of breakfast on school
performance, students who had healthy breakfast can be coded 1, those who had
unhealthy breakfast as 2 while those who had no breakfast at all as 3. Using the
codes the responses could be ranked. Thus, the students who had a healthy
breakfast are ranked first while those who had no breakfast at all are ranked last in
terms of having a healthy breakfast. The numerical codes here have a meaningful
sense of ordering, unlike basketball player uniforms, the numerical codes suggest
that one student is having a healthier breakfast than another student. Other
examples of the ordinal scale include socio economic status (A to E, where A is
wealthy, E is poor), difficulty of questions in an exam (easy, medium difficult), rank in
a contest (first place, second place, etc.), and perceptions in Likert scales.

26
Note to Teacher: Let us also emphasize to the students that while there is a
sense or ordering, there is no zero point in an ordinal scale. In addition, there
is no way to find out how much “distance” there is between one category and
another. In a scale from 1 to 10, the difference between 7 and 8 may not be
the same difference between 1 and 2.

Interval level tells us that one unit differs by a certain amount of degree from another
unit. Knowing how much one unit differs from another is an additional property of the
interval level on top of having the properties posses by the ordinal level. When
measuring temperature in Celsius, a 10 degree difference has the same meaning
anywhere along the scale – the difference between 10 and 20 degree Celsius is the
same as between 80 and 90 centigrade. But, we cannot say that 80 degrees Celsius
is twice as hot as 40 degrees Celsius since there is no true zero, but only an
arbitrary zero point. A measurement of 0 degrees Celsius does not reflect a true "lack
of temperature." Thus, Celsius scale is in interval level. Other example of a variable
measure at the interval is the Intelligence Quotient (IQ) of a person. We can tell not
only which person ranks higher in IQ but also how much higher he or she ranks with
another, but zero IQ does not mean no intelligence. The students could also be
classified or categorized according to their IQ level. Hence, the IQ as measured in
the interval level has also the properties of those measured in the ordinal as well as
those in the nominal level.
Special Note: Inform also the students that the interval level allows addition
and subtraction operations, but it does not possess an absolute zero. Zero is
arbitrary as it does not mean the value does not exist. Zero only represents an
additional measurement point.

Ratio level also tells us that one unit has so many times as much of the property as
does another unit. The ratio level possesses a meaningful (unique and non-arbitrary)
absolute, fixed zero point and allows all arithmetic operations. The existence of the
zero point is the only difference between ratio and interval level of measurement.
Examples of the ratio scale include mass, heights, weights, energy and electric
charge. With mass as an example, the difference between 120 grams and 135
grams is 15 grams, and this is the same difference between 380 grams and 395
grams. The level at any given point is constant, and a measurement of 0 reflects a
complete lack of mass. Amount of money is also at the ratio level. We can say that
2000 pesos is twice more than 1,000 pesos. In addition, money has a true zero
point: if you have zero money, this implies the absence of money. For the study on
the validity of the statement regarding effect of breakfast on school performance, the
student’s score in the quiz is measured at the ratio level. A score of zero implies that
the student did not get a correct answer at all.
In summary, we have the following levels of measurement:

27
Level Property Basic Empirical Operation
No order, distance, or origin Determination of equivalence
Nominal
Has order but no distance or Determination of greater or lesser
Ordinal
unique origin values
Both with order and distance Determination of equality of intervals
but no unique origin or difference
Interval
Has order, distance and Determination of equality of ratios or
Ratio unique means
origin

The levels of measurement depend mainly on the method of measurement, not on the
property measured. The weight of primary school students measured in kilograms has a
ratio level, but the students can be categorized into overweight, normal, underweight,
and in which case, the weight is then measured in an ordinal level. Also, many levels are
only interval because their zero point is arbitrarily chosen.

To assess the students understanding of the lesson, you may go back to the set of
variables in the data gathering activity done in Lesson 2. You could ask the students
to identify the level of measurement for each of the variable. If they did it right, you
have the following:

VARIABLE LEVEL OF MEASUREMENT


Class Student Number Nominal
Sex Nominal
Number of Siblings Ratio
Weight (in kilograms) Ratio
Height (in centimeters) Ratio
Age of Mother Ratio
Usual Daily Allowance in School (in pesos) Ratio
Usual Daily Food Expenditure in School (in Ratio
pesos)
Usual Number of Text Messages Sent in a Day Ratio
Usual Sleeping Time Nominal
Most Preferred Color Nominal
Happiness Index for the Day Ordinal

28
2. Methods of Data Collection

Variables were observed or measured using any of the three methods of data
collection, namely: objective, subjective and use of existing records. The objective
and subjective methods obtained the data directly from the source. The former uses
any or combination of the five senses (sense of sight, touch, hearing, taste and
smell) to measure the variable while the latter obtains data by getting responses
through a questionnaire. The resulting data from these two methods of data
collection is referred to as primary data. The data gathered in Lesson 2 are primary
data and were obtained using the subjective method.

On the other hand, secondary data are obtained through the use of existing records
or data collected by other entities for certain purposes. For example, when we use
data gathered by the Philippine Statistics Authority, we are using secondary data and
the method we employ to get the data is the use of existing records. Other data
sources include administrative records, news articles, internet, and the like.
However, we must emphasize to the students that when we use existing data we
must be confident of the quality of the data we are using by knowing how the data
were gathered. Also, we must remember to request permission and acknowledge the
source of the data when using data gathered by other agency or people.

KEY POINTS

• Four levels of measurement: Nominal, Ordinal, Interval and Ratio


• Knowing what level the variable was measured or observed will guide us to
know the type of analysis to apply.
• Three methods of data collection include objective, subjective and use of
existing records.
• Using the data collection method as basis, data can be classified as
either primary or secondary data.

ASSESSMENT (Do this!)

1. Using the data of the teachers in a particular school gathered by a market


researcher company, identify the level of measurement for each of the following
variable.
• highest educational attainment
• predominant hair color
• body temperature
• civil status
• brand of laundry soap being used

29
• total household expenditures last month in pesos
• number of children in a household
• number of hours standing in queue while waiting to be served by a
bank teller
• amount spent on rice last week by a household
• distance travelled by the teacher in going to school
• time (in hours) consumed on Facebook on a particular day

2. The following variables are included in a survey conducted among students in


a certain school. Identify the level of measurement for each of the variables.
a. number of family members who are working;
b. ownership of a cell phone among family members;
c. length (in minutes) of longest call made on each cell phone owned per month
d. ownership/rental of dwelling
e. amount spent in pesos on food in one week
f. occupation of household head
g. total family income
h. number of years of schooling of each family member
i. access of family members to social media
j. amount of time last week spent by each family member using the internet

3. In the following, identify the data collection method used and the type of resulting
data.
a. The website of Philippine Airlines provides a questionnaire instrument that can
be answered electronically.
b. The latest series of the Consumer Price Index (CPI) generated by the
Philippine Statistics Authority was downloaded from PSA website.
c. A reporter recorded the number of minutes to travel from one end to another
of the Metro Manila Rail Transit (MRT) during peak and off-peak hours.
d. Students getting the height of the plants using a meter stick.
e. PSA enumerator conducting the Labor Force Survey goes around the country
to interview household head on employment-related variables.

31
Lesson 5: Data Presentation
OVERVIEW OF LESSON

In this lesson we enrich what the students have already learned from Grade 1 to 10
about presenting data. Additional concepts could help the students to appropriately
describe further the data set.

LEARNING OUTCOME(S):
At the end of the lesson, the learner is able to identify and use the appropriate
method of presenting information from a data set effectively.

LESSON OUTLINE:
1. Review of Lessons in Data Presentation taken up from Grade 1 to 12.
2. Methods of Data Presentation
3. The Frequency Distribution Table and Histogram

REFERENCES
Albert, J. R. G. (2008). Basic Statistics for the Tertiary Level (ed. Roberto
Padua, Welfredo Patungan, Nelia Marquez), published by Rex Bookstore.

Handbook of Statistics 1 (1st and 2nd Edition), Authored by the Faculty of


the Institute of Statistics, UP Los Baños, College Laguna 4031
Takahashi, S. (2009). The Manga Guide to Statistics. Trend-Pro Co. Ltd.

Workbooks in Statistics 1 (From 1st to 13th Edition), Authored by the Faculty of


the Institute of Statistics, UP Los Baños, College Laguna 4031

32
DEVELOPMENT OF THE LESSON

A. Review of Lessons in Data Presentation taken up from Grade 1 to 12.

You could assist the students to recall what they have learned in Grade 1 to 12
regarding data presentation by asking them to participate in an activity. The activity is
called ‘Toss the Ball’. This is actually a review and wake-up exercise. Toss a ball to a
student and he/she will give the most important concept he/she learned about data
presentation.

You may list on the board their responses. You could summarize their responses to
be able to establish what they already know about data presentation techniques and
from this you could build other concepts on the topic. A suggestion is to classify their
answers according to the three methods of data presentation, i.e. textual, tabular and
graphical. A possible listing will be something like this:

Textual or Narrative Presentation:


• Detailed information are given in textual presentation
• Narrative report is a way to present data.

Tabular Presentation:
• Numerical values are presented using tables.
• Information are lost in tabular presentation of data.
• Frequency distribution table is also applicable for qualitative variables

Graphical Presentation:
• Trends are easily seen in graphs compared to tables.
• It is good to present data using pictures or figures like the pictograph.
• Pie charts are used to present data as part of one whole.
• Line graphs are for time-series data.
• It is better to present data using graphs than tables as they are much better to
look at.

B. Main Lesson
1. Methods of Data Presentation
You could inform the students that in general there are three methods to present
data. Two or all of these three methods could be used at the same time to present
appropriately the information from the data set. These methods include the (1)
textual or narrative; (2) tabular; and (3) graphical method of presentation.

In presenting the data in textual or paragraph or narrative form, one describes the
data by enumerating some of the highlights of the data set like giving the highest,

32
lowest or the average values. In case there are only few observations, say less than
ten observations, the values could be enumerated if there is a need to do so. An
example of which is shown below:

The country’s poverty incidence among families as reported by the


Philippine Statistics Authority (PSA), the agency mandated to release
official poverty statistics, decreases from 21% in 2006 down to 19.7% in
2012. For 2012, the regional estimates released by PSA indicate that the
Autonomous Region of Muslim Mindanao (ARMM) is the poorest region
with poverty incidence among families estimated at 48.7%. The region with
the smallest estimated poverty incidence among families at 2.6% is the
National Capital Region (NCR).

Data could also be summarized or presented using tables. The tabular method of
presentation is applicable for large data sets. Trends could easily be seen in this kind
of presentation. However, there is a loss of information when using such kind of
presentation. The frequency distribution table is the usual tabular form of presenting
the distribution of the data. The following are the common parts of a statistical table:

a. Table title includes the number and a short description of what is found inside the
table.
b. Column header provides the label of what is being presented in a column.
c. Row header provides the label of what is being presented in a row.
d. Body are the information in the cell intersecting the row and the column.

In general, a table should have at least three rows and/or three columns. However,
too many information to convey in a table is also not advisable. Tables are usually
used in written technical reports and in oral presentation. Table 5.1 is an example of
presenting data in tabular form. This example was taken from 2015 Philippine
Statistics in Brief, a regular publication of the PSA which is also the basis for the
example of the textual presentation given above.

33
Table 5.1 Regional estimates of poverty incidence among families
based on the Family Income and Expenditures Survey
conducted on the same year of reporting.

Region 2006 2009 2012


NCR 2.9 2.4 2.6
CAR 21.1 19.2 17.5
I 19.9 16.8 14.0
II 21.7 20.2 17.0
III 10.3 10.7 10.1
IV A 7.8 8.8 8.3
IV B 32.4 27.2 23.6
V 35.4 35.3 32.3
VI 22.7 23.6 22.8
VII 30.7 26.0 25.7
VIII 33.7 34.5 37.4
IX 40.0 39.5 33.7
X 32.1 33.3 32.8
XI 25.4 25.5 25.0
XII 31.2 30.8 37.1
Caraga 41.7 46.0 31.9
ARMM 40.5 39.9 48.7

Graphical presentation on the other hand, is a visual presentation of the data. Graphs
are commonly used in oral presentation. There are several forms of graphs to use like
the pie chart, pictograph, bar graph, line graph, histogram and box-plot. Which form to
use depends on what information is to be relayed. For example, trends across time are
easily seen using a line graph. However, values of variables in nominal or ordinal levels
of measurement should not be presented using line graph. Rather a bar graph is more
appropriate to use. A graphical presentation in the form of vertical bar graph of the 2012
regional estimates of poverty incidence among families is shown below:

34
Percent Among

10
20
30
40
50
60

ARMM Caraga
XII XI X IX VIII
VII VI V IV B IV A
III II I CAR NCR
Figure 5.1 2012 Regional poverty incidence among families (2012 FIES).

Other examples of graphical presentations that are shown below are lifted from the
Handbook of Statistics 1 (listed in the reference section at the end of this Teaching
Guide).

Figure 5.2. Percentage distribution of dogs according to groupings identified in a dog


show.

Figure 5.3. Distribution of fruits sales of a store for two days.

35
Figure 5.4 Weapons arrest rate from 1965 to 1992 by age of offender.

80
weight in kg 70
60
50
40
30
110 130 150 170 190
height in cm

Figure 5.5. Height and weight of STAT 1 students registered during the
previous term.

2. The Frequency Distribution Table and Histogram


A special type of tabular and graphical presentation is the frequency distribution table
(FDT) and its corresponding histogram. Specifically, these are used to depict the
distribution of the data. Most of the time, these are used in technical reports. An FDT
is a presentation containing non-overlapping categories or classes of a variable and
the frequencies or counts of the observations falling into the categories or classes.
There are two types of FDT according to the type of data being organized:
a qualitative FDT or a quantitative FDT. For a qualitative FDT, the non-overlapping
categories of the variable are identified, and frequencies, as well as the percentages
of observations falling into the categories, are computed. On the other hand, for a

36
quantitative FDT, there are also of two types: ungrouped and grouped. Ungrouped
FDT is constructed when there are only a few observations or if the data set contains
only few possible values. On the other hand, grouped FDT is constructed when there
is a large number of observations and when the data set involves many possible
values. The distinct values are grouped into class intervals. The creation of columns
for a grouped FDT follows a set of guidelines. One such procedure is described in
the following steps, which is lifted from the Workbook in Statistics 1 (listed in the
reference section at the end of this Teaching Guide)

Steps in the construction of a grouped FDT

1. Identify the largest data value or the maximum (MAX) and smallest data value or the
minimum (MIN) from the data set and compute the range, R. The range is the difference
between the largest and smallest value, i.e. R = MAX – MIN.

2. Determine the number of classes, k using k = , where N is the total number of


observations in the data set. Round-off k to the nearest whole number. It should be noted
that the computed k might not be equal to the actual number of classes constructed in an
FDT.

3. Calculate the class size, c, using c = R/k. Round off c to the nearest value with precision the
same as that with the raw data.

4. Construct the classes or the class intervals. A class interval is defined by a lower limit (LL)
and an upper limit (UL). The LL of the lowest class is usually the MIN of the data set. The
LL’s of the succeeding classes are then obtained by adding c to the LL of the preceding
!1ʺ
classes. The UL of the lowest class is obtained by subtracting one unit of measure
# x∃
% 10 &
, where x is the maximum number of decimal places observed from the raw data) from the
LL of the next class. The UL’s of the succeeding classes are then obtained by adding c to
the UL of the preceding classes. The lowest class should contain the MIN, while the highest
class should contain the MAX.

5. Tally the data into the classes constructed in Step 4 to obtain the frequency of each
class. Each observation must fall in one and only one class.

6. Add (if needed) the following distributional characteristics:

a. True Class Boundaries (TCB). The TCBs reflect the continuous property of a continuous
data. It is defined by a lower TCB (LTCB) and an upper TCB (UTCB). These are obtained
by taking the midpoints of the gaps between classes or by using the following formulas:
LTCB = LL – 0.5(one unit of measure) and UTCB = UL + 0.5(one unit of measure).

b. Class Mark (CM). The CM is the midpoint of a class and is obtained by taking the
average of the lower and upper TCB’s, i.e. CM = (LTCB + UTCB)/2.

37
c. Relative Frequency (RF). The RF refers to the frequency of the class as a fraction of the
total frequency, i.e. RF = frequency/N. RF can be computed for both qualitative and
quantitative data. RF can also be expressed in percent.

d. Cumulative Frequency (CF). The CF refers to the total number of observations greater
than or equal to the LL of the class (>CF) or the total number of observations less than or
equal to the UL of the class (<CF).

e. Relative Cumulative Frequency (RCF). RCF refers to the fraction of the total number of
observations greater than or equal to the LL of the class (>RCF) or the fraction of the
total number of observations less than or equal to the UL of the class (<RCF). Both the
<RCF and >RCF can also be expressed in percent.

The histogram is a graphical presentation of the frequency distribution table in the


form of a vertical bar graph. There are several forms of the histogram and the most
common form has the frequency on its vertical axis while the true class boundaries in
the horizontal axis.

As an example, the FDT and its corresponding histogram of the 2012 estimated
poverty incidences of 144 municipalities and cities of Region VIII are shown below.

Poverty Frequency 78
80
Incidence
59
60
(% )
Frequency

40
20.015 - 40.015 59
60.015 - 80.015 4 20
3 4 0
0
True Class Boundaries

KEY POINTS

• Three methods of data presentation: textual, tabular and graphical


• Two or all the methods could be combined to fully describe the data at hand.
• Distribution of data is presented using frequency distribution table and
histogram.

38
ASSESSMENT (Do this!)

A. You are to describe the data on the following table. Perform what is being asked
for in the questions found after the table.
Table 5.2 Characteristics of the 30 members of the Batong Malake Senior Citizens Association (BMSCA)
who participated in their 2009 Lakbay-Aral.

Age as of Last Receiving Monthly Gross Monthly Family Number of Years as


No. Gender Pension? Income
Birthday Member
(Y/N) (in thousand pesos)

1 Female 61 Yes 45.0 1


2 Female 64 Yes 26.3 2
3 Male 74 No 33.5 10
4 Male 80 No 50.0 12
5 Female 63 Yes 18.4 2
6 Female 71 Yes 30.0 9
7 Female 75 No 41.0 2
8 Male 64 No 10.1 3
9 Male 65 No 46.5 5
10 Female 68 Yes 18.0 3
11 Female 71 Yes 34.2 6
12 Female 63 Yes 73.1 2
13 Female 72 Yes 15.6 11
14 Male 76 Yes 17.4 11
15 Female 69 No 33.8 8
16 Male 70 Yes 35.1 9
17 Male 74 Yes 18.6 6
18 Female 68 Yes 65.7 8
19 Female 70 No 19.6 3
20 Male 65 Yes 53.0 2
21 Male 64 Yes 18.4 1
22 Female 62 Yes 27.8 1
23 Female 63 No 33.4 2
24 Male 68 No 38.0 5
25 Male 67 Yes 37.6 5
26 Male 69 No 50.4 7
27 Female 68 Yes 44.3 4
28 Female 66 No 36.7 3
29 Female 63 No 18.0 2
30 Male 64 Yes 63.2 2

39
1. Choose a QUANTITATIVE variable from the given data set. Construct a
quantitative grouped FDT for this variable. Show preliminary computations (R, k,
and c). Also, construct a histogram for the data. Use appropriate labels and titles
for the table and graph. Describe the characteristics of the units in the data set
using a brief narrative report. Refer to the FDT and histogram constructed.

R=
k=
c=

Table

Classes Frequency RF CF RCF (%) TCB


< > CM
LLUL (F) (%) < RCF > RCF
CF CF LTCB UTCB

Histogram:

Textual presentation:

40
Which of the three methods of data presentation do you think is most
appropriate to use for the variable chosen in Number 1? Justify your answer.

1. Choose a QUALITATIVE variable from Table 5.2 Construct an appropriate graph.


Use labels and a title for the graph.

Give a brief report describing the variable:

41
Lesson 6: Measures of Central Tendency
OVERVIEW OF LESSON
The lesson begins with students engaging in a review of some measures of central
tendency by considering a numerical example. Students are also asked to examine
both strengths and limitations of these measures. Assessments will be given to
students on their ability to calculate these measures, and also to get an overall sense
of whether they recognize how these measures respond to changes in data values.

LEARNING OUTCOME(S):
At the end of the lesson, the learner is able to

• Calculate commonly used measures of central tendency,


• Provide a sound interpretation of these summary measures, and
• Discuss the properties of these measures.

LESSON OUTLINE:

1. Motivation
2. Common Measures of Central Tendency: Mean, Median and Mode
3. Properties of the Mean, Median and Mode

REFERENCES
Albert, J. R. G. (2008). Basic Statistics for the Tertiary Level (ed. Roberto
Padua, WelfredoPatungan, Nelia Marquez), published by Rex Bookstore.
“Deciding Which Measure of Center to Use”
http://www.sharemylesson.com/teaching-resource/deciding-which-measure-of-
center-to-use-50013703/
Handbook of Statistics 1 (1st and 2nd Edition), Authored by the Faculty of
the Institute of Statistics, UP Los Baños, College Laguna 4031
Workbooks in Statistics 1 (From 1st to 13th Edition), Authored by the Faculty of
the Institute of Statistics, UP Los Baños, College Laguna 4031

42
DEVELOPMENT OF THE LESSON

A. Motivation

Present to the students the following frequency distribution table of the monthly
income of 35 families residing in a nearby barangay/village.

Monthly Family Income in Number of


Pesos Families
12,000 2
20,000 3
24,000 4
25,000 8
32,250 9
36,000 5
40,000 2
60,000 2

You may ask the students the following to pick up their interest and at the same
time introduce to them some summary statistics.

1. What is the highest monthly family income? Lowest?

Answer: Highest monthly family income is 60,000 pesos while the lowest is
12,000 pesos.

You may emphasize that the highest and lowest values, which are commonly known
as maximum and minimum, respectively are summary measures of a data set. They
represent important location values in the distribution of the data. However, these
measures do not give a measure of location in the center of the distribution.

2. What monthly family income is most frequent in the village?

Answer: Monthly family income that is most frequent is 32,250 pesos.

The value of 32,250 occurs most often or it is the value with the highest frequency.
This is called the modal value or simply the mode. In this data set, the value of
32,250 is found in the center of the distribution.

3. If you list down individually the values of the monthly family income from lowest to
highest, what is the monthly family income where half of the total number of

43
families have monthly family income less than or equal to that value while the other
half have monthly family income greater than that value?

Answer: When arranged in increasing order or the data come in an array as in


the following:

12,000; 12,000; 20,000; 20,000; 20,000; 24,000; 24,000; 24,000; 24,000; 25,000;
25,000;25,000; 25,000; 25,000; 25,000; 25,000; 25,000; 32,250; 32,250; 32,250;
32,250; 32,250; 32,250; 32,250; 32,250; 32,250; 36,000; 36,000; 36,000; 36,000;
36,000; 40,000; 40,000; 60,000; 60,000;

there are 17 values that are less than the middle value while another 17 values
are higher or equal to the middle value. That middle value is the 18 th observation
and it is equal to 32,250 pesos. The middle value is called the median and is
found in the center of the distribution.

4. What is the average monthly family income?

Answer: When computed using the data values, the average is 30,007.14 pesos.

The average monthly family income is commonly referred to as the arithmetic mean
or simply the mean which is computed by adding all the values and then the sum is
divided by the number of values included in the sum. The average value is also
found somewhere in the center of the distribution.

Let us now summarize what we have learned from our illustration and introduce the
three common measures of central tendency.

B. Common Measures of Central Tendency: Mean, Median and Mode

Inform students that the most widely used measure of the center is the (arithmetic)
mean. It is computed as the sum of all observations in the data set divided by the
number of observations that you include in the sum. If we use the summation
symbol, read as ‘sum of observations represented by xi where i takes the
values from1 to N, and N refers to the total number of observations being added’,
!
we could compute the mean (usually denoted by Greek letter, µ) as Using the example earlier with 35
observations of family income, the mean is computed as

44
Alternatively, we could do the computation as follows:

Monthly Family Number


of xi × fi
Income in Pesos
Families
(xi)
(fi)
12,000 2 12,000 × 2 = 24,000
20,000 3 20,000 × 3 = 60,000
24,000 4 24,000 × 4 = 96,000
25,000 8 25,000 × 8 = 200,000
32,250 9 32,250 × 9 = 290,250
36,000 5 36,000 × 5 = 180,000
40,000 2 40,000 × 2 = 80,000
60,000 2 60,000 × 2 = 120,000
Sum = 35 Sum = 1,050,250

For large number of observations, it is advisable to use a computing tool like a


calculator or a computer software, e.g. spreadsheet application or Microsoft Excel ®.
The median on the other hand is the middle value in an array of observations. To
determine the median of a data set, the observations must first be arranged in
increasing or decreasing order. Then locate the middle value so that half of the
observations are less than or equal to that value while the half of the observations
are greater than the middle value.

If N (total number of observations in a data set) is odd, the median or the middle
value is the observation in the array. On the other hand, if N is even, then
the median or the middle value is the average of the two middle values or it is

average of and observations. In the example given earlier, there

are 35 observations so N is 35, an odd number. The median is then the

observation in the array. Locating the 18th observation in


the array

The mode or the modal value is the value that occurs most often or it is that value
that has the highest frequency. In other words, the mode is the most fashionable
value in the data set. Like in the example above, the value of 32,250 pesos occurs
most often or it is the value with the highest frequency which is equal to nine.

47
C. Properties of the Mean, Median and Mode

Each of these three measures has its own properties. Most of the time we use these
properties as basis for determining what measure to use to represent the center of
the distribution.
As mentioned before the mean is the most commonly used measure of central tendency
since it could be likened to a “center of gravity” since if the values in an array were to be
put on a beam balance, the mean acts as the balancing point where smaller
observations will “balance” the larger ones as seen in the following illustration.

24,000 36,000

Note that the frequency represented by the size of the rectangle serves as ‘weights’
in this beam balance.

To illustrate further this property, we could ask the student to subtract the value of the
mean to each observation (denoted as di) and then sum all the differences. The
computation can also be done alternatively as shown in the following table.

Monthly Number
Family di × fi
di = xi - µ of
Income in
(rounded off) Fam ilies
Pesos
(fi)
(xi)
12,000 12,000 – 30,007.14 = -18,007 2 -18,007 × 2 = -36,014
20,000 20,000 – 30,007.14 = -10,007 3 -10,007 × 3 = -30,021
24,000 24,000 – 30,007.14 = -6,007 4 -6,007 × 4 = -24,049
25,000 25,000 – 30,007.14 = -5,007 8 -5,007 × 8 = -40,057
32,250 32,250 – 30,007.14 = 2,243 9 2,243× 9 = 20,186
36,000 36,000 – 30,007.14 = 5,993 5 5,993 × 5 = 29,964
40,000 40,000 – 30,007.14 = 9,993 2 9,993 × 2 = 19,986
60,000 60,000 – 30,007.14 = 29,993 2 29,993 × 2 = 59,986
Sum = 35 Sum = 0

The sum of the differences across all observations will be equal to zero. This indicate
that the mean indeed is the center of the distribution since the negative and positive
deviations cancel out and the sum is equal to zero.

48
In the expression given above, we could see that each observation has a
contribution to the value of the mean. All the data contribute equally in its calculation.
That is, the “weight” of each of the data items in the array is the
1
reciprocal of the total number of observations in the data set, i.e. !.

Means are also amenable to further computation, that is, you can combine subgroup
means to come up with the mean for all observations. For example, if there are 3
groups with means equal to 10, 5 and 7 computed from 5, 15, and 10 observations
respectively, one can compute the mean for all 30 observations as follows:

If there are extreme large values, the mean will tend to be ‘pulled upward’, while if
there are extreme small values, the mean will tend to be ‘pulled downward’. The
extreme low or high values are referred to as ‘outliers’.’Thus, outliers do affect the
value of the mean.

To illustrate this property, we could tell the students that if in case there is one family
with very high income of 600,000 pesos monthly instead of 60,000 pesos only, the
computed value of mean will be pulled upward, that is,

Thus, in the presence of extreme values or outliers, the mean is not a good measure
of the center. An alternative measure is the median. The mean is also computed only
for quantitative variables that are measured at least in the interval scale.

Like the mean, the median is computed for quantitative variables. But the median
can be computed for variables measured in at least in the ordinal scale. Another
property of the median is that it is not easily affected by extreme values or outliers.
As in the example above with 600,000 family monthly income measured in pesos as
extreme value, the median remains to same which is equal to 32,250 pesos.

For variables in the ordinal, the median should be used in determining the center of
the distribution. On the other hand, the mode is usually computed for the data set
which are mainly measured in the nominal scale of measurement. It is also
sometimes referred to as the nominal average. In a given data set, the mode can
easily be picked out by ocular inspection, especially if the data are not too many. In
some data sets, the mode may not be unique. The data set is said to be unimodal
49
if there is a unique mode, bimodal if there are two modes, and multimodal if there are
more than two modes. For continuous data, the mode is not very useful since here,
measurements (to the most precise significant digit) would theoretically occur only
once.

The mode is a more helpful measure for discrete and qualitative data with numeric
codes than for other types of data. In fact, in the case of qualitative data with numeric
codes, the mean and median are not meaningful.

The following diagram provides a guide in choosing the most appropriate measure of
central tendency to use in order to pinpoint or locate the center or the middle of the
distribution of the data set. Such measure, being the center of the distribution
‘typically’ represents the data set as a whole. Thus, it is very crucial to use the
appropriate measure of central tendency.

KEY POINTS

• A measure of central tendency is a location measure that pinpoints the center or


middle value.
• The three common measures of central tendency are the mean, median and
mode.

50
• Each measure has its own properties that serve as basis in determining when
to use it appropriately.

ASSESSMENT (Do this!)

1. Thirty people were asked the question, “How many people do you consider your
best friend?” The graph below shows their responses.

12

10

8
Frequenc
y 6

0
1 2 3 4 5 6 7 8
Number of Best Friends

What measure of central tendency would you use to find the center for the
number of best friends people have? Explain your answer.

2. The mean age of 10 full time guidance counselors is 35 years old. Two new full
time guidance counselors, aged 28 and 30, are hired. Five years from now, what
would be the average age of these twelve guidance counselors?

3. Houses in a certain area in a big city have a mean price of PhP4,000,000 but a
median price is only PhP2,500,000. How might you explain this best?

4. Five persons were asked on the usual number of hours they spent watching
television in a week. Their responses are: 5, 7, 3, 38, and 7 hours.
a. Obtain the mean, median and mode.
b. If another person were to be asked the same question and he/she responded
200 hours, how would this affect the mean, median and mode?

51
5. For the senior high school dance, there is a debate going on among students
regarding the color that will be featured prominently. Votes were sent by
students via SMS, and the results are as follows:

Color Red Green Orange White Yellow Blue Brown Purple


No. of 300 550 70 130 220 710 35 5
Votes
Received
a. Is there a clear winner on the choice of color?
b. Compute for the mean, median and modal color (if possible).
c. Why is it that we could or could not find each measure of the central tendency?
d. Which measure of central tendency will determine the color to be prominently
used during the senior high school dance?
6. Everyone studied very hard for the quiz in the Statistics and Probability Course.
There were 10 questions in the quiz, and the scores are distributed as follows:
Score Number of Students
10 8
9 12
8 6
7 5
6 3
5 2
4 0
3 1
2 1
1 0
0 2

a. Compute for the mean, median, and mode for this set of data. (The
computation could be done as follows:
Score Number of Students Less Than Cumulative
xi × fi Frequency
( xi) ( fi)
(< CF)
10 8
9 12
8 6
: :
: :
1 0
0 2
SUM Sum = 40 Sum = ___ ----

52
c. Suppose the teacher said “Everyone in the class will be getting
either the mean, median, or mode for their official score.”
i. What would students want to receive (mean, median, or mode)?
ii. Which would students want to receive the least (mean, median or mode)?
iii What is the fairest score to receive would be? Ask students to explain
their answers.

53
CHAPTER 2 : RANDOM VARIABLES
AND PROBABILITY DISTRIBUTIONS
Lesson 1: Probability

OVERVIEW OF LESSON
In this activity, learners initially review some basic concepts in Probability
that they may have learned prior to Grade 11. Then, they are taught extra
concepts on conditional probability. There are also discussions on the
classical birthday problem that show them how to compute for the chance or
probability of having at least two people in the classroom share the same
birthday.
LEARNING COMPETENCIES
At the end of the lesson, learners should be able to:

• define probability in terms of empirical frequencies


• show how to apply the General Addition Rule, and the Multiplication
Rule
• make use of a tree diagram for conditional probabilities

LESSON OUTLINE
A. Introduction / Motivation: What is Probability?
B. Main Lesson: Computing the Probability of an Event

REFERENCES
Many of the materials in this lesson were adapted from:

De Veau, R. D., Velleman, P. F., and Bock, D. E. (2006). Intro Stats. Pearson Ed.
Inc.

Workbooks in Statistics 1, 11th Edition. Institute of Statistics, UP Los Baños,


College Laguna 4031
Probability and Statistics: Module 18. (2013). Australian Mathematical Sciences
Institute and Education Services Australia. Retrieved from
http://www.amsi.org.au/ESA_Senior_Years/PDF/Probability4a.pdf
54
DEVELOPMENT OF THE LESSON
A. Introduction / Motivation: What is Probability and How to Assign It?
Begin the session with a discussion on your uncertainty over summaries generated from
data, especially when data are “random” samples of a larger population of units (i.e.
people, farms, firms, etc).
Examples: (i) approval ratings or proportion of people voting for a candidate (in
an opinion poll); (ii) average family income (in the Philippine Statistical Authority’s
triennial Family Income and Expenditure Survey); (iii) average prices of
commodities (from sample outlets)
Explain that people can quantify uncertainty through the notion of PROBABILITY (or
Chance). Suggest to learners that if they were asked for the probability that they would
pass the next quiz, they may give a number between 0% and 100 percent. Typically, the
chances of a future outcome may be based on some past experience of data collected.
Very studious learners, for instance, had passed their quizzes 100 percent of the time,
while average students had passed their quizzes 85 percent of the time.
When considering probabilities of events, learners should be guided to consider a
particular context wherein possible outcomes are well defined and can be specified, at
least in principle, beforehand. This context is called random process wherein we do
not know which of the possible outcomes will occur, but we do know what is on the list
of possible outcomes. Learners can be informed that it can be also helpful to view the
probability of an event as its “long-run” empirical frequency or the fraction of times the
event may have occurred under repeated “trials” of the random process. In the next
lesson, we shall call this the “empirical probability,” and mention that in practice, we
expect these empirical probabilities to stabilize toward some “theoretical probability.”
This is called the law of large numbers).
Ask learners to think of random processes and an event where:
a. the outcome is certain. Examples may be getting a head (event) in the next toss
of a two-headed coin (random process) or getting a number of at most 6 (event)
when a die is thrown once (random process)
b. the outcome is impossible. Examples may be getting a tail (event) in the next
toss of a two-headed coin (random process) or getting a number greater than 6
(event) when a die is thrown once (random process)

87
c. the outcome has an even chance of occurring. Examples may be a couple
having a boy (event) as their next child (random process) or getting a red card
(event) when randomly selecting a card from a deck of cards (random process)
d. the outcome has a strong but not a certain chance of occurring. Example might be
getting a sum of at most 11 (event) when a pair of dice is thrown (random process)
Then, ask them the probability associated with these events. (Answers: 100 percent for
certain events, 0 percent for impossible events, and 50 percent for outcomes with even
chance of occurring. For the example in D, there are 36 possible outcomes for tossing a
pair of fair dice, 35 of them will have at most a sum of 11, so the chance of getting at
most 11 is 35/36). The closer the value of the probability to 1, the more likely the event
will occur and the closer it is to 0, the less likely it will occur.

Important: Point out to learners the following properties of the probability of an event:

• the probability of an event is a non-negative value. In fact, it ranges from zero (0)
(when the event is impossible) to one (when the event is sure). The closer the
value to one, the more likely the event will occur
• the probability of the sure event is one (In other words, the chance of a sure
event is 100 percent).
• if A and B are mutually exclusive events, meaning it is impossible for these two
events to occur at the same time, then P(A or B) = P(A) + P(B). This is called the
Addition Rule.
A more general result (also called the General Addition Rule) states that:
P(A or B) = P(A) + P(B) - P(A and B)

Geometrically, from a Venn Diagram, the area of the union of A and B is the sum of the
areas, but if we added the intersection of A ∩ B twice, so we have need to subtract this
area from the sum of the areas of A and B.
Illustrate to learners that these properties can help us more readily compute for the
probabilities of events.
P(at most a sum of 12 when tossing a pair of fair dice) = P(at most a sum of 11
OR a sum of 12) = P (at most a sum of 11) + P(sum of 12)

88
But P(at most a sum of 12) = 1 and P (sum of 12) = 1/36;
Thus, when looking for the value of P (at most a sum of 11)
P (at most a sum of 11) = 1 – 1 /36 = 35/36
c
In general, if we are interested in A , the complement of an event A (i.e. the event that
happens when A does not), since
P(A or Ac) = P(A) + P(Ac) and P(A or Ac) = P(Sure event) = 100%
Thus,
c
P(A) + P(A ) = 1 or equivalently
c
P(A ) = 1 – P(A)
In consequence, the chance that an event does not occur is one (1) minus the chance it
does occur.
In terms of a Venn Diagram below, given a Sure Event S (represented by a square with
area 100%), and an event A (represented by the triangle whose area represents the
probability of A), then the chance of an event A not happening is one minus the chance
of event A happening (i.e. area of the square minus the area of the triangle).

89
Extra Notes: Mention also to learners that:
(1) Historically, probability was studied by gamblers who wanted to increase their
winnings (or at least decrease their losses).
(2) Probability describes random behavior, but does anything really happen at random?
Even Albert Einstein, when confronted by theories of quantum mechanics, was said to
have pointed out that “God does not play dice.” Yet, many events, especially in nature
“seem” to display random behavior. In many real life situations, we will be able to model
these by random processes and thus, apply probability to understand the behavior of
these situations.

B. Main Lesson: Computing Probabilities of Events


Mention to learners that the calculation of the probability of an event may sometimes
be considered directly from the nature of the phenomenon/random process, with some
assumptions of symmetry. Some underlying outcomes may be “equally likely” by
assumption such as fair coins and fair dice. In practice, these assumptions need to be
tested and will be the subject of inquiry in future lessons. These assumptions are
simplifications to help us calculate probabilities.
Example 1: Tell learners that a box contains green and blue chips. A chip is then drawn
from the box. If it is green, you win P100. If it is blue, you win nothing.
• Learners have a choice between two boxes:
- Box A with 3 blue chips and 2 green chips
- Box B with 30 blue chips and 20 green chips
• Which would learners prefer???
Some learners may say B, but tell learners that it actually should not matter, because
the chance of winning Php100 is 2/5 =40% in box A, while in box B, the chance of
winning is 20/50 =40%. Same probability.
Conditional Probability
Mention to learners that sometimes, we may have extra information that can change
the probability of an event. Give the following definition of conditional probability.
The conditional probability of event A given that B has occurred is denoted as
P(A|B) and defined as

90
P A\B = P(A and B)
P(B)

Example 2: Suppose that we want to randomly select a student from among Grades 9 to
12 in a certain school

The chance of selecting a Grade 11 student, given that the student is male, can
be computed as follows:
Define events A and B as:
A = event that student selected is a Grade 11 student B = event that
student selected is male, then
PA\B = = =
P(A and B) 84/500 84

P(B) 185/500 185


Example 3: A king comes from a family of two children. What is the chance that the king
has a sister?
Remind learners here that as the king comes from a family of two children, we are given
extra information that this family of two children has a boy, the king.

91
What we want to compute here is the probability that the sibling of the king is a girl.
Let B the event of having at least one boy. So B={(b,b),(b,g),(g,b)}, where (x,y) means
the sex of each child and the possible values are b for boy and g for girl. Then A is the
event that the king's sibling is a girl, A={(b,g),(g,b)}.
While the original sample space S of all possible outcomes is S={(b,b),(g,b),(b,g),
(g,g)}, each outcome has ¼ chance of occurring.
However, P (A | B) = P (A and B) / P(B) = (2/4) / (3/4) = 2/3

Independent Events
Sometimes, the extra information provided may not really change the probability of an
event. In this case, the events are said to be independent. The conditional probability
of A given B may still be equal to the (unconditional) probability of event A.
Two events A and B are said to be independent if

P (A and B) = P (A) P (B)

This is also called the Multiplication Rule. Intuitively, we call events such as tossing a
coin (or dice) several times independent since future tosses are not affected by
previous outcomes.
If however, the events are not independent then we can still obtain the probability that
both events A and B will occur using the definition of conditional probability:

P (A and B) = P (A) P (B | A)

Example 4: Tell learners to suppose that there is a box that contains three tickets
marked 1, 2, and 3. We shake the box, draw out one ticket at random; shake the box
and draw out a second ticket. What would be the probability of getting a sum of “three” if
tickets were drawn with replacement? Without replacement?

92
The possible sums for the two tickets drawn with replacement are shown in a
contingency table and tree diagram below

In consequence the probability of getting a sum of three is:


P (sum of three) = 2/9
While if the tickets were drawn without replacement, we have

P (sum of three) = 2/6 = 1/3

Exercise 5: (The Birthday Problem, originally posed by Richard von Mises in 1939,
reprinted in English in 1964) Mention to learners that in a room filled with more than 23
people, there is more than half a chance that at least two of them will have the same
birthday, and if there are more people, the chances increase further toward 100%
(about 99.9% with 70 people). Try it out with learners in your class.
Tell learners to identify how many of them have a birthday in January. Try to see if you
can get a match. Go to February if you don’t find anyone that match. Then March, and
so forth.

93
The chance of 2 people having different birthdays is:

The chance of N people having different birthdays is:

So the chance that at least two of them will have the same birthday is:

We have the probabilities computed below for several values of N.

KEY POINTS
• Probability is a numerical representation of the likelihood of occurrence of an
event. Its value is between zero (0) and one (1). When the value approaches 1,
this means the event is very likely to occur, while a value close to zero (0) means
it is not likely to occur.
• When A and B are mutually exclusive events, then the probability of A or B is

• If A and B are independent events, then the probability of A and B is

P(A and B) = P(A) P(B) (this is called the Multiplication Rule)

94
ASSESSMENT
1. What would be the probability of
a. picking a black card at random from a standard deck of 52 cards?
b. picking a face card (i.e. a king, queen, or jack)?
c. not picking a face card?

2. What is the probability of rolling, on a fair dice:


a. a 3?
b. an even number?
c. zero?
d. a number greater than 4?
e. a number lying between 0 and 7?
f. a multiple of 3 given that an even number was drawn

3. A standard deck of playing cards is well shuffled and from it, you are given two
cards. You can have 0, 1, or 2 aces: three possibilities altogether. So the
probability that you have two aces is equal to 1/3. What is flawed about this
argument?

4. You shuffle a deck of playing card, and then start turning the cards one at a time.
The first one is black. The second one is also a black card. So is the third, and
this happens up to the 10th card. You start thinking, “the next one will likely be
red!” Are you correct in this reasoning?
5. The family of Tony delivers newpapers, one to each house in their village.

Philippine Star 250 Manila Standard Today 100 Philippine Daily Inquirer
300 Daily Tribune 60 Manila Bulletin 150

a. the Manila Times?


b. the Manila Standard Today or the Philippine Daily Inquirer?
c. a newspaper other than Daily Tribune

95
6. A class is going to play three games. In each game, some cards are put into a bag. Each card
has a square or a circle on it. One card will be taken out, then put back. If it is a circle, the
boys will get a point. If it is a square, the girls will get a point.
a. Which game are the girls least likely to win? Why?
b. Which game are the boys most likely to win? Why?
c. Which game are the girls certain to win?
d. Which game is impossible for the boys to win?
e. Which game is it equally likely that the boys or girls win?
f. Are any of the games unfair? Why?

7. In a computer ‘minefield’ game, ‘mines’ are hidden on grids. When you land randomly on a
square with a mine, you are out of the game.
a. The circles indicate where the mines are hidden on three different grids. On which
of the three grids is it hardest to survive?
b. Grid 1 above is a 3 by 6 grid with 6 mines. On which of the following grids is it hardest
to survive?
X. 99 mines on a 30 by 16 grid
Y. 40 mines on a 16 by 16 grid
Z. 10 mines on an 8 by 8 grid
Explain your reasoning.

108
CHAPTER 2: RANDOM VARIABLES
AND PROBABILITY DISTRIBUTIONS

Lesson 2: Random Variables


OVERVIEW OF LESSON
In this lesson, the concept of a random variable is discussed. The notion of a
statistical experiment is defined as well as random variables that relate to
experiments. Finally, two types of random variables, discrete and continuous,
are described.
LEARNING COMPETENCIES
At the end of the lesson, learners should be able to:
• illustrate/provide examples of random variables
• distinguish between discrete and continuous random variables
• find the possible values of a random variable

PRE-REQUISITE LESSONS
Types of Data (in particular, classifications of numerical variables) and
Probability

LESSON OUTLINE
A. Introduction/Motivation: The coin toss and breath-holding activities
B. Main Lesson:
I. Introduce the concepts of a statistical experiment and a
random variable II. Distinguish between discrete and continuous
random variables and give examples of random variables
C. Group Discussion
D. Enrichment

REFERENCES
De Veau, R. D., Velleman, P. F., and Bock, D. E. (2006). Intro Stats. Pearson
Ed. Inc.
Workbooks in Statistics 1: 11th Edition, Institute of Statistics, UP Los
Baños, College Laguna 4031
Random Variables. Khan Academy. Retrieved from
https://www.khanacademy.org/math/probability/random-
variables-topic/random_variables_prob_dist/v/random-
variables
KEY CONCEPTS
Statistical Experiment, Outcomes, Random Variables, Discrete Random
Variables, Continuous Random Variables
109
MATERIALS NEEDED

• one-peso coin per student


• stop timer per group

DEVELOPMENT OF THE LESSON


A. Introduction/Motivation: The coin toss and breath-holding activities

Look at the current one-peso coin in circulation. It has Jose Rizal on one side, which
we will call Head (H), and the other side Tail (T). Ask learners to toss the one-peso
coin three times and record on their Activity Sheet the results of the three tosses
Use H for heads, and T for tails. If needed, define first the head side of the coin and
the tail side of the coin. For example, a learner tosses heads, tails, heads. Then the
learner should write HTH on his/her notebook. Ask them to count the number of
heads that appeared and write it also on their Activity Sheets.

Next, have all the learners hold their breaths and record the time. This is best done
if they time it as accurately as possible (if possible, use a cell phone timer and
record up to the nearest hundredth of a second). If there is limited number of
timers, do it one at a time with one learner holding the timer while the other one is
holding his/her breath. Ask students to record the time on their Activity Sheets.

Then, record all the possible answers on the board for both activities. For the first
activity, write all eight possible outcomes, and then list down which one had zero
(0), 1, 2, or 3 heads. If you have time, tally the results. You can do this
systematically so that you do not get confused later on. Start with the outcomes with
zero (0) heads, then progress from there.

TTT
TTH
THT
HTT
THH
HTH
HHT
HHH

The breath-holding activity is a little bit more challenging. Expect to have a lot of
possible values. Just write about 10. Then, tell the learners if they have different
values, they can raise their hands. Notice that the first one had only four possible
values to take, while the second one is almost unique to each individual. What

109
could help may be getting the lowest value and the highest value recorded by
students.

Emphasize the difference in the number of possible values in these two activities
as this is important in the discussion.

B. Main Lesson

I. Experiments and Random Variables

Begin the discussion with the definition of a Statistical Experiment: An activity that
will produce outcomes, or a process that will generate data. The outcomes have a
corresponding chance of occurrence. Examples of which are (a) tossing three coins
and counting the number of heads, (b) recording the time a person can hold his/her
breath, (c) counting the number of students in the classroom who are present today,
(d) obtaining the height of a student, etc.

Say that the two activities are examples of statistical experiments. Come up
with several examples, such as recording the results of an examination, asking
the weekly baon (or allowance) of students, identifying the waistline of students.

Emphasize that Statistical Experiments can have a few or a lot of possible


outcomes. In the coin toss example, there are eight possible outcomes. In the
breath example, there can be a lot of possible outcomes. However, they can
indicate that the possible values are in the range of 10 seconds to 60 seconds (Ask
the shortest and the longest times in class and use that as the limits for this
example).

Suppose you give a learner candy based on the number of heads that appear in
the coin toss experiment (Remember: Giving a candy is optional). List down the
possible number of candies that can be given. Notice that it should only be zero (0),
1, 2, or 3. Then, you can list down all the outcomes of the experiment under each
value:

Number of candies Outcomes


0 TTT
1 TTH, THT, HTT
2 THH, HTH, HHT
3 HHH

110
Next, define a Random Variable: It is a way to map outcomes of a statistical
experiment determined by chance into number. It is typically denoted by a capital
letter, usually X.

X: outcome number

Random variable is actually neither random nor a variable in the traditional sense that a
variable is defined in an algebra class (where we solve for the value of a variable). It is
technically a function from the space of all possible events to the set ℝ of real numbers.

Tell students that a random variable must take exactly one value for each random
outcome. Generally, as with functions, a number of possible outcomes may have
the same value of the random variable, and in practice, this occurs frequently. For
instance, three outcomes above for tossing a coin thrice would have 1 candy, and
three outcomes would have 2 candies.

Learners need to understand that random variables are conceptually different from
the mathematical variables that they have met before in math classes. A random
variable is linked to observations in the real world, where uncertainty is involved.

Learners should be told that random variables are central to the use of probability in
practice. They help model random phenomena, that is, random variables are
relevant to a wide range of human activities and disciplines, including agriculture,
biology, ecology, economics, medicine, meteorology, physics, psychology, computer
science, engineering, and others. They are used to model outcomes of random
processes that cannot be predicted deterministically in advance (but the range of
numerical outcomes may, however, be viewed).
In the coin example, we can define the random variable X to be the number of heads
that appears from tossing a coin three times. While we do not know what the resulting
specific outcome is, we know the possible values of X in this case are zero
(0), 1, 2, or 3. You can also define another random variable Y to be the time a
person can hold his/her breath. The possible values for this variable can be one
of so many possible values.

111
In the second example, the possible values range between the lowest and the
highest value recorded by students. Notice that it is really difficult to list down all
the possible values. That is why in this example, it is better to state the possible
values as an interval, such as
, if the lowest and highest values are 10 and 60, respectively.

II. Types of Random Variables:


Distinguish the two types of random variables, viz., discrete and continuous.

(a) Discrete Random Variables are random variables that can take on a finite (or
countably infinite) number of distinct values. Examples are the number of heads
obtained when tossing a coin thrice, the number of siblings a person has, the
number of students present in a classroom at a given time, the number of crushes a
person has at a particular time, etc.

Categorical variables can be considered discrete variables. Example: whether a


person has normal BMI or not, you can assign one (1) as the value for normal BMI
and zero (0) for not normal BMI. You can also put numbers to represent certain
categorical variables with more than two categories. You can also use ordinal
variables, like how much they like adobo on a scale of 1 to 10 (where 1 means
favorable and 10 unfavorable).

(b) Continuous Random Variables, on the other hand, are random variables that
take an infinitely uncountable number of possible values, typically measurable
quantities. Examples are the time a person can hold his/her breath, the height or
weight or BMI of a person (if measured very accurately), the time a person takes
for a person to bathe. The values that a continuous random variable can have lie
on a continuum, such as intervals.

Extra Notes:

• You can modify the experiment to just tossing a coin twice instead of
thrice to make things simpler. Here, the outcomes will be only four: HH,
HT, TH, TT, and the possible values of X are 0, 1, and 2.
• You may use other examples of continuous variables such as
height, weight, lengths, and age)
• Feel free to add more examples, or get examples from the seatwork
that is in the next section.

112
C. Group Discussion
Group learners into threes. Given the following experiments and random variables, ask
the groups to identify what the possible values of the random variables are. Also, for
each random variable, identify whether the variable is discrete or continuous.
(Answers in bold are Discrete, while answers in italics are Continuous)

1. Experiment: Roll a pair of dice


Random Variable: Sum of numbers that appears in the pair of dice
2. Experiment: Ask a friend about preparing for a quiz in statistics
Random Variable: How much time (in hours) he/she spends studying for
this quiz
3. Experiment: Record the sex of family members in a family with four children
Random Variable: The number of girls among the children
4. Experiment: Buy an egg from the grocery
Random Variable: The weight of the egg in grams
5. Experiment: Record the number of hours one watches TV from 7 pm to
11 pm for the past five nights.
Random Variable: The number of hours spent watching TV from 7 pm
to 11 pm

D. Enrichment
In tossing a coin four times, how many outcomes correspond to each value of the
random variable?
What if the coin would be tossed five times? six times? seven times? eight times?
Try to relate the outcomes to the numbers in Pascal’s triangle.

113
For tossing the coin four times, there will be five possible values,

0, 1, 2, 3, 4, with
1, 4, 6, 4, 1 outcomes, respectively.

For five coins there are six possible values,

0, 1, 2, 3, 4, 5, with
1, 5, 10, 10, 5, 1 Outcome/s,
respectively.

In general, for n tosses of a coin, there are n+1 possible values, 0, 1, 2, 3, …, n. If


k is a possible value, then there are

outcomes associated with x.

Next, possibly read on probability distributions, which will be covered in the next
lesson.

KEY POINTS
• A Random Variable may be viewed as a way to map outcomes of a
statistical experiment determined by chance into number.
• There are two types of random variables:
o Discrete: takes on a finite (or countably infinite) number of values
o Continuous: takes an infinitely uncountable number of possible
values, typically measurable quantities

114
ACTIVITY SHEET 02-02 (Do this!)

1. Toss a coin three times and record the results of the three tosses below.

(Use H for heads and T for tails.)

Outcome
First Toss
Second Toss
Third Toss
2. Count the number of heads that appeared.

3. Write all possible outcomes for tossing a coin three times, and then count
the number of heads for each outcome. List them down.
4. Hold your breath and accurately record the time you held your breath. (If possible,
use a cell phone timer and record up to the nearest hundredth of a second). Record
the time below: seconds

ASSESSMENT (Do this!)

1. Identify two possible random variables (or if possible two random variables)
given the following statistical experiments. If possible, identify whether the
variable is Discrete or Continuous.

a. Take a quiz
b. Ask the class about their breakfast
c. Ask a neighbor about television shows
d. Ask a friend about Facebook
e. Run 100m on the track

115
f. Ask a classmate about musical instruments
g. Visit the nearest market and look for poultry, such as
h. Ask your mother about the EDSA revolution
2. During a game of Tetris, we observe a sequence of three consecutive
pieces. Each Tetris piece has seven possible shapes labeled here by
the letters L, O, S, T. So in this random procedure, we can observe a
sequence such as STT, SOL, LLL and so on. Define:

• X to be the number of occurrences of “O” in a sequence of three


pieces. Then X can take the value 0, 1, 2 or 3. Fill in the table
below:

k Outcomes (some are P(X=k) – This is no. of


given already) outcomes / 64
0 LST,
1 LOT,
2 LOO,
3 OOO

Note: There are 64 possible outcomes.

• Y to be the number of different shapes in a sequence of three


pieces. Then Y can take the value 1, 2 or 3.
k Outcomes (some are P(X=k) – This is no. of
given already) outcomes / 64
1 LLL,
2 LOL,
3 LOT,

116
117
CHAPTER 3: SAMPLING

Lesson 1: Coin Tossing Revisited from a


Statistical Perspective

OVERVIEW OF LESSON

In this activity, learners revisit the coin tossing activity but this time, they look into how
the probability of getting a head is an unknown constant and needs to be estimated.
Other illustrations on sampling and estimation are also be discussed.
LEARNING COMPETENCIES:

At the end of the lesson, the learner should be able to:


• describe random sampling
• distinguish between (population) parameter and (sample) statistic
• describe sampling distributions of statistics (sample mean)
• discuss the Central Limit Theorem
LESSON OUTLINE
A. Introduction / Motivation : A Coin Need Not Be Fair
B. Main Lesson: Estimation of Probability of Getting a Head in a Single Toss of a
Coin
C. Data Collection
D. Data Analysis and Interpretation
E. Enrichment
REFERENCES
Richardson, M, Using Dice to Introduce Sampling Distributions. STatistics Education
Web (STEW). Retrieved from
http://www.amstat.org/education/stew/pdfs/UsingDicetoIntroduceSamplingDistributio
ns.doc
De Veau, R. D., Velleman, P. F., and Bock, D. E. (2006). Intro Stats. Pearson Ed. Inc.
Workbooks in Statistics 1: 11th Edition. Institute of Statistics, UP Los Baños,
College Laguna 4031
Probability and statistics: Module 24. (2013). Australian Mathematical Sciences Institute
and Education Services Australia. Retrieved from
http://www.amsi.org.au/ESA_Senior_Years/PDF/InferenceProp4g.pdf
KEY CONCEPTS: Sampling, Estimation, Sampling Variation, Standard Error,
Central Limit Theorem

118
MATERIALS NEEDED: 1-peso coin per student

DEVELOPMENT OF THE LESSON

A. Introduction / Motivation: A Coin Need Not Be Fair

Learners may have heard of “sample” of data being used—opinion polls which
estimate the fraction of voters who are likely to vote for a particular candidate for the
next presidential election; taking measurements on the heights and weights of senior
high school learners (done in the first chapter); or conducting an experiment on a
sample of patients who are either randomly allocated to (a) a treatment group who is
given some medical treatment) and (b) a control group, who is given a placebo, a
harmless salt solution, to control the psychological effects of being given a
treatment.
The context here for sampling is to recall the coin-tossing experiment in Lesson 2-03
that involves tossing a one-peso coin with the class getting either a head (H), the
face of Rizal on top, or a tail (T), the other side up. This time, however, it is crucial to
point out that the class do not assume beforehand that the coin is fair. That is, while
they may be able to assume that the probability of getting a head on a single toss of
a coin, P(H) = p, is a constant, it is not known; and thus, they would like to estimate
p. This makes the situation a statistical one that involves uncertainty.
Tell learners to suppose that if they were to flip the coin n times. (Later on they will
begin to refer to this as taking a sample of size n.) The random variable X, as
defined before in Lesson 2-03, can take on values {0,1,2, … , n} and the number of
outcomes favorable to each value can be read off of Pascal’s Triangle. However,
this time the outcomes are no longer equally likely. Moreover, the probabilities
cannot be computed exactly because they are functions of p, which is not known to
them.
Note to teacher: In the previous chapter, learners have learned that the
probabilities of independent events happening simultaneously are the product of
their probabilities. So, the probability of x heads and n-x tails in specific sequence
is given by

px(1-p)n-x

119
Moreover, there are nCx = ways that x heads can turn up out of n flips. Hence,
the probability of observing x heads in n tosses given that the chance p of
getting heads is
x n-x
P(X=x) = nCx p (1-p ) for x = 0, 1, 2, …, n
which is called the binomial probability mass function, or binomial pmf. As the
name implies, a pmf defines the probability mass corresponding to individual
values of the discrete random variable X. The binomial pmf depends on the value
of p, which may be assumed as constant but is unknown.
Intuition should lead the majority, if not most learners, to consider the number of
heads X observed divided by the number of tosses (or sample size) as a reasonable
estimate for p ; that is,

X/n
is a natural estimate of the probability p of getting a head.

They know that can take on values 0/n, 1/n, 2/n, … , n/n, but they will not know
which one until after the flipping is completed. Furthermore, they know that if the
experiment is repeated (flip coin n times again), the observed X will not necessarily be
the same as in the previous one. Thus, X is no longer just a variable in the
mathematical sense; it is called a random variable (as was discussed in Lesson 2-
3) because its outcome can change, but that the change cannot be computed
with certainty. Variability and the attendant uncertainty in the result of the sampling
experiment are introduced.
Inform learners that, as will be shown later in the course, the one experiment (of
n flips of a coin yielding a single outcome x) allows the class to
a) estimate the unknown probability p of getting a head (with x/n);
b) estimate the uncertainty in this estimate, i.e., the value of the “probable
error” in estimation from the sample, aware that the estimate is subject to
“sampling variation” or “sampling error.” For instance, the approval
ratings of the President obtained from an opinion poll of about 1,200
respondents randomly selected are theoretically within a margin of error of
about 3 percentage points (as will be illustrated later) from the actual
approval ratings.
c) compute an interval estimate from the sample, along with the chance that
the interval “captures” the unknown p . The uncertainty is still there, but it

120
can be measured using probability and there lies the connection
between statistics and probability (or mathematics).

B. Main Lesson: Estimation of Probability of Getting a Head in a Single Toss


of a Coin
In discussing probability in the last chapter, we have considered symmetry and
appropriate random mixing (such as shaking a die) to justify the assignment of
probabilities. For example, that the chance of rolling a two using a fair die is 1 out of
6. But knowing the probabilities of events, or even having a basis for assuming
particular values for probabilities, is actually not a common scenario. On the
contrary, we are often confronted with a situation such as a coin-tossing
experiment, where we know the size n of the random sample of units (or number of
trials), but we do not know the probability p of getting a head. And we would like to
estimate this constant but unknown quantity. We could extend this to the scenario
of knowing the
• percentage of voters who would be voting for a certain candidate in the
next election, or
• the fraction of the population who is poor

One of the main reasons for studying probability distributions is that it provides the
foundation for making conclusions or inferences about unknown population
characteristics, such as p (on the basis of sample data). Inform learners that
generalizing results beyond the data collected, provided that the data collected is a
part (sample) of a large set of items (population), is known as statistical inference.

In the context of the two practical examples, we could get a random sample of

• voters, who can be asked about their current preference for voting in the
next election. We may be interested in using the sample to draw an
inference about the proportion of the population of voters who currently
prefer to vote for some candidate (and even profile these people in
relation to socioeconomic status, sex, age, or geographic location).
• respondents who can be asked about their income and/or expenditure, and
if some poverty line (that can be viewed as the minimum level of income or
expenditure required for a particular welfare level) is defined, we can draw
conclusions about the proportion of the population who are poor (and
consequently, describe the poor in relation to the non-poor).

121
Even without using any concepts from probability (discussed in the previous
chapter), learners should find it reasonable to think that a sample proportion should
tell us something about the population proportion (that is unknown). If we have a
“random sample” from the population, the sample is representative of the
population so we should be able to use the sample proportion as an estimate of the
population proportion. Provide some scenarios and ask what the estimate of p
would be for these scenarios:
• Flipping a coin 100 times and getting 52 heads. Ask learners what the
estimate of the probability of getting a head on a single toss would be.
The probability of getting a tail? They should say 52/100=0.52 and
48/100=0.48, respectively.

• Conducting an opinion poll of 1,200 randomly selected voters who


suggested these voting preferences
Metro Balance Visayas Mindanao Total
Manila Luzon
Candidate X 195 197 115 261 768
Candidate Y 105 103 185 39 432
Total 300 300 300 300 1200

Ask learners what the estimated fraction of voting preference for


candidate X would be. Learners should see that nationally, the estimate is
768/1200=0.64 but the estimated proportions vary by geographic location.

• Conducting a sample survey of, say 5 families selected randomly from a


list of families. They are asked to provide information on their monthly
family income and family size. Suppose that a family is poor if its
monthly per capita income, i.e. monthly total family income divided by
the family size, is less than Php1,800 Phpper month.

Monthly Total Family Per Capita


Family Income Family Size Income
1 40,000 5 8000.00
2 10,000 6 1666.67
3 100,000 3 33333.33
4 8,000 8 1000.00
5 75,000 4 18750.00

122
Ask learners what the estimated proportion of families that are poor would be.
Learners should see that only the second and fourth families are poor, so 2/5=0.40=
40% of families are estimated to be poor.

C. Data Collection Activity

Give the learners the Activity Worksheet 3-01. Ask learners: If you were to toss a
coin for an extremely large number of times, what proportion of the tosses will be
heads? Of course, they will answer 1/2. Explain to learners that they are assuming
that the coin is fair. The goal of this activity is to estimate the proportion of tosses
that would result in a “head” and then, to examine the distribution of estimates in
repeated sampling.
Have learners work individually using the data collection procedure described on
the Activity Worksheet. Learners must determine the sample proportion of tosses
that would yield a “head” for each of the sample sizes of n=5, 10, 20, and 30.
Individual results are recorded on the Worksheet and each student will write
individual results on the blackboard (or worksheet of a spreadsheet application in
a computer) in an appropriately labeled column.
For a 45-student class, there should be 45 sample proportion values for each of the
sample sizes. Collecting this data provides learners the opportunity to participate in an
example of obtaining repeated samples. Calculating the proportion of “heads” yielded
for each of the sample sizes helps to reinforce the idea of a sample proportion being a
random variable whose value changes from sample to sample.
After the individual sample proportions have been computed and the results copied
onto the blackboard (or a computer worksheet), ask learners to input the class data
into the Class Data Table on the Activity Worksheet. Based on the generated data,
construct a Stem and Leaf Display (you may also use bar graphs). From this graph,
ask learners to describe the distribution of the values that they generated.

D. Data Analysis and Interpretation!

The figures below provides an example of results that can serve as a model. Here, 35
learners participated in producing the example data set. Stem and Leaf Displays have
been constructed for the proportions of tosses that yielded a ‘head’ for each of the four
sample sizes (5, 10, 20 and 30). By examining the class data, learners will begin to
discover how statistics differs from mathematics since statistics involves uncertainty.

123
Stem and Leaf Display for sample sizes of
n=5 data rounded to nearest multiple of .1
plot in units of .1

0*|0
0t | 22222
0f | 44444444444
0s | 66666666666
1. | 8888
1* | 000

Stem and Leaf Display for sample sizes of


n=10 data rounded to nearest multiple of .1
plot in units of .1

0t | 222
0f | 4444444555555555
0s | 666666677777777
0.|8

Stem and Leaf Display for sample sizes of


n=25 data rounded to nearest multiple of .01
plot in units of .01

2. | 888
3*|2
3. | 6
4* | 0000044444
4. | 88888
5* | 22222
5. | 6
6* | 00004
6. | 88
7*|22

Stem and Leaf Display for sample sizes of


n=50 data rounded to nearest multiple of .01

124
plot in units of .01

3*|4
3.|88
4* | 002222444
4. | 6688
5* | 00022244
7. | 666666
6* | 044
8. | 68

Figure 3-01.1. Stem and Leaf Display of Example Class Data

The example class data in Figure 3-01.1 is used for purposes of illustration, and
maybe as a prototype for the actual class data. For each sample size, learners
should construct a stem and leaf display (or any graphical representation of the
distribution such as a histogram) of the sample proportion values and describe the
shape, center, and spread of the distribution of values. For samples of size 5 and
size 10, there are only a few different values of the sample proportion, so the shape
is difficult to determine. It appears that the centers of the distribution for samples of
size 5 and size 10 are both at around 0.50. And, the sample proportion values range
from 0 to 1 for samples of size 5, and from 0.2 to 0.8 for samples of size 10. For
samples of size 25, different values are obtained for the sample proportion of
“heads.” The distribution has a center again at around 0.50. And, the sample
proportions range from 0.28 to 0.72. For samples of size 50, the distribution of the
sample proportions appears roughly like a normal curve, i.e. mound-shaped (with a
slight rightward skew). The center is at around .50. The sample proportion values
range from 0.34 to 0.68.
For each sample size, ask learners to calculate the mean and standard deviation of
the sample proportions. The calculated values of the mean of the sample proportion
distributions are 0.52, 0.529, 0.489, and .50, respectively for samples of size 5, 10,
25, and 50. The calculated values of the standard deviation of the sample proportion
distributions are 0.24, 0.15, 0.12, and 0.09, respectively for samples of size 5, 10,
25, and 50.
Ask learners to think about the relationship between the center of the distribution
of the sample proportions and the value of the population proportion. Learners

125
should note that the distribution of sample proportion values is centered on
the value of the population proportion (1/2 is approximately 0.50).
Tell learners to think about the relationship between the sample size and the shape
of the distribution of the sample proportion. Learners should note that as the sample
size increases, the distribution of the sample proportion tends more towards a
normal distribution. (This is known as the “Central Limit Theorem”).
Ask learners: For which sample size is the standard deviation of the sample
proportion values the largest and for which sample size is the standard deviation
the smallest? Ask them why they think this happens. Learners should observe that
the variability of the sample proportion values, whether measured from the range or
the standard deviation, is related to the sample size. A larger sample size results in
smaller variability (smaller range and smaller standard deviation) in the sample
proportion values.
The results from the analyses of the repeated sampling can lead to a discussion on
the theoretical properties of the sampling distribution of a sample proportion. The
mean value of the distribution of a sample proportion for repeated random samples
of size n, drawn from the same population, is equal to the corresponding value of
the proportion of the population, p . (Here, the value of p seems to be 0.5). The
standard deviation of the distribution of a sample proportion for repeated random
samples of size n, drawn from the same population, decreases with an increase in

the sample size. The theoretical standard deviation formula of a sample


proportion, also called the “standard error,” can now be introduced. The standard
error is inversely proportional to the square root of the sample size. In
consequence, the bigger the sample size (i.e., the number of tosses of the coin),
the less variability there will be in the estimates.

Technical Note: The function is maximized when p=1/2, then the standard

error of the sample proportion is at most . As will be shown in later lessons,


organizations that conduct opinion polls have 95% confidence that the opinion polls
have margins of error, i.e. twice the standard error of a sample proportion, at
most 3 percentage points. They design the polls so that , or

equivalently, the minimum sample size n, for which, = This is why


these organizations use sample sizes of 1,200.

126
Learners will also observe that the standard error is dependent on p, and thus, we may

instead estimate it with provided that the sample size is large

enough (usually and ).

Inform learners that the distribution of a sample proportion for repeated random
samples of size n, drawn from the same population, will approximately follow a normal
(bell-shaped) distribution.
Note: The activity above can be done with other “simulations” of situations, e.g.,
consider tossing a die and observing the proportion of times that the upward face of
the die yields a “four” or a “five” (which should be expected to be 2/6 = 1/3 , give or
take some random variation).

E. Enrichment

After learners have been introduced to the formula for a confidence interval for the
population proportion, data collected from the coin-tossing activity for n= 50 can be
used. Learners can construct a 95% confidence interval for the proportion of
tosses resulting in a “head”:

which can be approximated as

since p is unknown.

Note that since each student’s sample will be unique, after constructing their 95%
confidence intervals, learners can be asked to put the results onto the blackboard.
This way, the instructor can have a discussion on what confidence level means
(examining the percentage of the different confidence intervals that include ½ =

127
0.5). About 95 percent of learners will have confidence intervals that will hit the
target of 0.5, but about 5 percent of the intervals will not.
KEY POINTS

9. The probability p of getting a “head” in a single toss of a coin need not be 50%,
but it is an unknown number, which you can estimate by flipping a coin n times
and noting the number x of times you get a “head,” and thus yield an estimate

of p.

10. If several learners were to yield estimates from n tosses of the coin, the estimates
will not be the same, but they will have “sampling” variability. The standard
p (1 − p)
deviation of the estimates, called the standard error, is . This standard
n
error is dependent on p, and thus, we may instead estimate it with

. The standard error is inversely proportional to the square

root of the sample size. In consequence, the bigger the sample size (i.e., the

number of tosses of the coin), the less variability we will have in the estimates.

11. As the number of tosses increases, the distribution of estimates looks more and
more like a normal curve. (This is known as the “Central Limit Theorem”
wherein the sample proportion has a sampling whose shape can be
approximated by a normal curve, whose center is the value of the population
proportion and a standard deviation of p (1 − p) . The larger the sample, the
n
better the approximation will be.

128
ACTIVITY SHEET NUMBER 3-01 (Do This!)

Using Coin Tosses to Introduce Sampling Distributions Activity


Sheet
Suppose that you were to toss a regular one-Php coin a large number of times.
What proportion of the tosses will yield a “head”? p = ______________.

Individual Data Table Use two decimal places.

5 Trials 10 Trials 25 trials 50 Trials

Number of Tosses
Resulting in “Heads”

Proportion of Tosses
Resulting in “Heads”

Copy your sample proportions (use two decimals) onto the blackboard in
the appropriately labeled column.

Input the class proportion of tosses resulting in a “head” into the Class Data Table.
Class Data Table Class Proportion of Tosses Resulting in a “Head”

Sample n = 5 n= n= n=30 Sample n = 5 n = 10 n = 20 n=30


1 21
2 22
3 23
4 24

5 25
6 26
7 27
8 28

9 29
10 30

129
11 31

12 32
13 33
14 34

15 35
16 36
17 37
18 38

19 39
20 40

Use the Class Data to answer the following questions. Remember that p = 1/2

1. For each sample size, construct a stem-and-leaf display/histogram of the


sample proportion values and describe the shape, center, and spread of the
distribution of values.
2. Based on your stem-and-leaf displays/histograms, what do you think is the
relationship between the center of the distribution of the sample proportions and
the value of the population proportion?
3. Based on your stem-and-leaf displays/histograms, what do you think is the
relationship between the sample size and the shape of the distribution of the
sample proportion?
4. For each sample size, calculate the mean and standard deviation of the sample
proportions and write a sentence to interpret the standard deviation.

Sample Mean Standard


Size Deviation
n=5
n=10
n=25
n=50

5. For which sample size is the standard deviation the largest and for which
sample size is the standard deviation the smallest? Why do you suppose this
happens?

130
131
CHAPTER 3: SAMPLING

Lesson 2: The Need for Sampling


TIME FRAME: 120 minutes

OVERVIEW OF LESSON

In this lesson, learners are given lectures (and assessments) regarding sampling—basic
concepts, discussions on why it is important to sample, descriptions of different types of
samples, as well as kinds of survey errors.
LEARNING COMPETENCIES

At the end of the lesson, the learner should be able to:

• Define random sampling


• Give reason for sampling
• Distinguish between parameter and statistic
• Recognize the value of randomization as a defense against bias
• Identify that the size of the sample (not the fraction of the population) determines
the precision of estimates from a probability sample
LESSON OUTLINE

A. Motivation: What is a Survey and Why do we use Sampling?


B. Lesson Proper
1. Probability Sampling
2. Non-probability Sampling
3. Survey Errors
4. Sampling Distribution, Accuracy, and Precision
C. Data Collection
D. Data Analysis and Interpretation
E. Enrichment

REFERENCES

Albert, J. R. G. (2008). Basic Statistics for the Tertiary Level (ed. Roberto Padua,
Welfredo Patungan, Nelia Marquez). Philippines: Rex Bookstore.
De Veau, R. D., Velleman, P. F., and Bock, D. E. (2006). Intro Stats. Pearson Ed. Inc.

th
Workbooks in Statistics 1: 11 Edition, Institute of Statistics, UP Los Baños,
College Laguna 4031

132
KEY CONCEPTS: Sampling, Estimation, Bias, Sampling Variation, Randomization

DEVELOPMENT OF THE LESSON

A. Motivation: What is a Survey and Why do we use Sampling (rather than


full enumeration)?
In Chapter 1, discussions on describing data assumed that data come from a
population of interest. When the recording of information of an entire population is
conducted, this is called a census. An example of this is collecting the grades of all
the Grade 11 learners, or the decennial population census done by the Philippine
Statistics Authority (PSA). However, in most cases, censuses involve great
challenges. Also, one does not need to do a full count to get information, especially
on flow data, such as agricultural production, household expenditure, and
establishment income. This brings us to sampling, which is the process of selecting
a section of the population.
Learners may have heard of sample surveys especially opinion polls conducted before
an election. Ask a few learners to tell state the number of minutes they spend to get to
school in the morning. Then, after asking these few individuals, describe to them the
typical time it takes learners to get to school (such as the average time). Ask learners
if the descriptive statements you made are valid or not.
Next, define:

• a sample survey as a method of systematically gathering information on


a segment of the population, such as individuals, families, wildlife, farms,
business firms, and unions of workers, for the purpose of inferring
quantitative descriptors of the attributes of the population.
The fraction of the population being studied is called a sample.

Learners may wonder why people don’t just survey everyone instead and why they
“trust” opinion polls when these only interview 1,600 respondents and not the
actual millions of Filipinos who will be voting on election day. Learners should be
made aware that there are many reasons why we resort to sampling.
• Cost. A sample often provides useful and reliable information at a much lower
cost than a census. For extremely large populations, the conduct of a census
can be even impractical. In fact, the difficulty of analyzing complete census
data led to summarizing a census by taking a “sample” of returns.

133
• Timeliness. A sample usually provides more timely information because
fewer data are to be collected and processed. This attribute is particularly
important when information is needed quickly.
• Accuracy. A sample often provides information as accurate, or more
accurate, than a census, because data errors typically can be
controlled better in smaller tasks.
• Detailed information. More time is spent in getting detailed information
with sample surveys than with censuses. In a census, we can often only
obtain stock, not flow data. For instance, agricultural production cannot be
generated from censuses.
• Destructive testing. When a test involves the destruction of an item,
sampling must be used. Battery life tests must use sampling
because something must be left to sell!

Inform learners that conducting a full census of voters can be quite costly and
besides, this is already done on Election Day itself. Only in rare cases is a full
enumeration census of the population taken. For instance, the PSA conducts the
Census of Population and Housing every ten years, typically when the year ends in
0, although in 1995, 2007 and 2015, the PSA has also conducted mid-decade
censuses. The financial costs for conducting and processing results of censuses
are quite huge (compared to sample surveys).
Explain that in a sample survey, we can generate flow information that describes
characteristics of the subject covering a period of time. For instance, agricultural
production is collected not in an agriculture census but in a sample survey of
agricultural households and establishments. A sample survey covers more detailed
information on the unit of inquiry than that of a census, and is also less expensive to
conduct than a census.
Sampling theory, developed a century ago, has shown that one does not need to
conduct a census to obtain information, i.e. conducting a sample survey will do just
as well. Look at it this way: One does not need to finish drinking a pot full of coffee
to know if the coffee tastes good. A cup or even a sip will do, provided the “sample”
is taken in a “fair manner.” Even hospitals only extract blood samples from patients
for medical tests rather than extracting all the blood of the patient to determine
whether or not the patient gets clean bill of health. What is crucial is to design a
sample survey that will be a representative of the population it intends to
characterize. Typically, people can guarantee representativeness in a sample
survey if chance methods are used for selecting respondents.

134
Even sample surveys conducted by the PSA—household surveys, establishment
surveys, agricultural surveys (that may involve households and establishments)—
are also using chance methods to select their survey respondents.
B. Lesson Proper

1. Probability Sampling

If data is to be used to make decisions about a population, then how the data
is collected is critical. For a sample data to provide reliable information about a
population of interest, the sample must be representative of that population.
Selecting samples from the population using chance allows the samples to be
representative.
If a sample survey involves allowing every member of the population to have a
known, nonzero chance of being selected into the sample, then the sample survey
is called a probability sample. Probability samples are meant to ensure that the
segment taken is representative of the entire population. Examples of these include
the Family Income and Expenditure Survey (FIES), the Labor Force Survey, and the
Quarterly Survey of Establishments, all conducted by the PSA. Opinion polls
conducted by some non-government organizations with track records such as the
Social Weather Stations and Pulse Asia, likewise use chance methods to select
their survey respondents. Data collected from these probability sampling-based
surveys yield estimates of characteristics of the population that these surveys
attempt to describe.
Basic Types of Probability Sampling

a. Simple random sampling (SRS) involves allowing each possible sample to


have an equal chance of being picked and every member of the population has
an equal chance of being included in the sample. Selection may be with
replacement (selected individual or unit is returned to frame for possible
reselection) or without replacement (selected individual or unit isn’t returned to
the frame). This sampling method requires a listing of the elements of the
population called the sampling frame. In the case of agricultural surveys or
surveys of establishments, the sampling frame may either be based on a list
frame, or an area frame, or a mixture. Samples may be obtained from the table
of random numbers or computer random number generators.

135
b. Stratified sampling is an extension of simple random sampling which
allows for different homogeneous groups, called strata, in the population to
be represented in the sample. To obtain a stratified sample, the population is
divided into two or more strata based on common characteristics. A SRS is
then used to select from each strata, with sample sizes proportional to strata
sizes. Samples from the strata are then combined into one. This is a
common technique when sampling from a population of voters, stratifying
across racial or socio-economic classes. When thinking of using stratification,
the following questions must be asked:
12. Are there different groups within the population?
13. Are these differences important to the investigation?

Figure 3-02.1 Illustration of Stratified Sampling

If the answer to both questions is yes, then stratified sampling is necessary.

Explanatory Note: Usually, stratified sampling is done when the population is


divided into several subgroups with common characteristics. The population
may be divided into urban and rural locations (as dwellings in rural areas may
tend to be homogenous compared to dwellings in urban areas); the student
population may be divided by the year level of learners; or the workers in a
hospital may be categorized by their different occupations—nurse, doctor,
janitor, secretary.

c. In systematic sampling, elements are selected from the population at


a uniform interval that is measured in time, order, or space.

136
Figure 3-02.2 Illustration of Systematic Sampling !

Typically, there is firstly, a decision on a desired sample size n. The frame of


N units is then divided into groups of k units: k=N/n. Then, one unit is
randomly selected from the first group, with every kth unit thereafter also
selected. For instance in Figure 3-02.2, consider the population of 20 trees,
and if the sample size is 4, then the frame is divided into 4 groups. Suppose
that the fourth item is chosen in the first group, with every fifth unit thereafter
chosen.
d. Cluster sampling divides the population into groups called clusters,
selects a random sample of clusters, and then, subjects the sampled clusters
to complete enumeration, that is everyone in the sampled clusters are made
part of the sample.

Figure 3-02.3 Illustration of Cluster Sampling

137
Explanatory Note: Clusters in the population may be based on convenience in
the collection of data. For example, in a village, clusters can be blocks of houses.
In a school, the clusters can be the sections. In a dormitory, clusters can be the
rooms. In a city or municipality, the clusters can be the different barangays.
Cluster sampling is conducted so that data collected need not come from a huge
geographic range, thus saving resources. For instance, instead of getting a
simple random sample of households from all over a town, clusters of dwellings
can be selected from different barangays so that the cost of data collection can
be minimized.

Example:

Suppose you want to compute the mean grade point averages (GPAs) of learners at a
certain higher educational institution. You decide that an appropriate sample size is n =
14. To estimate the mean GPAs, you can use simple random sampling to select 100
learners and average their GPAs. Since freshmen GPAs tend to be lower than senior
GPAs, you may want to make sure that both classes are represented, so you decide to
use a stratified sample.

According to the university’s registrar, the student population consists of 35% freshmen,
30% sophomores, 20% juniors, and 15% seniors. Get samples from each stratum,
proportional to its size. Specifically, take simple random samples of 35 freshmen, 30
sophomores, 20 juniors, and 15 seniors. Then, average the GPAs of the learners to
estimate the GPA of the entire university.

Instead of a class, you can also have subgroups of the student population based on their
academic major, assuming that each student is assigned one major. When stratifying into
subgroups, the subgroups must be mutually exclusive. If they are not, then some subjects
will have a higher chance of being chosen since they belong in more than one subgroup.

Inform learners that Statistics is different from Mathematics. The essential paradigm
in Statistics is induction (from the particular to the general) while Mathematics uses
deduction (from the general to the particular). Modern Statistics’ is there to develop
tools that will allow scientifically valid inference from samples to the populations
from which they came.
Specific parameters—numerical summaries of the population such as a population
proportion or a population mean—are estimated by Statistics, summaries of the

138
sample data such as a sample proportion, or a sample mean. In probability
sampling, each member of the population has a positive and measurable chance
of inclusion in the sample. These inclusion probabilities serve as the bridge from
sample to population. However, this bridge is weak or nonexistent when the
inclusion probabilities cannot be computed as in the case of sample surveys.

Figure 3-02.4 Population, sample, and inference 2.

2. Non-probability Sampling

Ask learners whether polls on voting preferences through SMS messages and
Facebook posts can be adequate to represent actual voting preference. Learners
should know or be made aware that results of such kinds of polls are filled with too
much noise as there is currently no way to determine the representativeness of
respondents to such surveys (if the targeted population is much bigger than the
sampled respondents). SMS and Facebook polls do not have complete coverage of
voters: Not all voters have cellphones (especially among the poor) despite the
increase in mobile phone usage over the years; Not everyone has internet access;
and, certainly, not every Filipino voter has a Facebook account. As of 2014, only a
third of Filipinos are reported to have access to the Internet.
In addition, a mere “random selection” of mobile phone numbers or of Facebook
users will in no way assure you of its representativeness of the voting population
even if everyone had a cellphone or a Facebook account since there will be
“nonresponses” that have to be accounted for.
Non-probability or judgment sampling is the generic name of several sampling
methods where some units in the population do not have the chance to be
selected in the sample, or if the inclusion probabilities cannot be computed.
Generally, the procedure involves arbitrary selection of “typical” or

139
“representative” units concerning which information is to be obtained. A few types
of non-probability samples are listed below:
a. Haphazard or accidental sampling involves an unsystematic selection of
sample units. Some disciplines like archaeology, history, and even medicine
draw conclusions from whatever items are made available. Some disciplines
like astronomy, experimental physics, and chemistry often do not care about
the “representativeness” of their specimens.
b. In convenience sampling, sample units expedient to the sampler are
taken. c. For volunteer sampling, sample units are volunteers in studies
wherein the measuring process is painful or troublesome to a respondent.
d. Purposive sampling pertains to having an expert select a representative
sample based on his own subjective judgment. For instance, in Accounting, a
sample audit of ledgers may be taken of certain weeks (which are viewed as
typical). Many agricultural surveys also adopt this procedure for lack of a
specific sampling frame.
e. In Quota Sampling, sample units are picked for convenience but certain
quotas (such as the number of persons to interview) are given to interviewers.
This design is especially used in market research.
f. In Snowball Sampling, additional sample units are identified by asking
previously picked sample units for people they know who can be added to the
sample. Usually, this is used when the topic is not common, or the population
is hard to access.

Discuss with learners other ways of classifying surveys.

• size of the sample – e.g. large-scale or small-scale


• periodicity – longitudinal or panel, where respondents are
monitored periodically; cross-section; quarterly
• main objective – descriptive, analytic
• method of data collection – mail, face-to-face interview, e-survey, phone
survey, SMS survey
• respondents – individual, household, establishment (or enterprise), farmer,
OFW

140
3. Survey Errors

When collecting data, whether through sample surveys or censuses, a variety of


survey errors may arise. This is why it is crucial to design the data collection process
very carefully. Censuses may also overcount or undercount certain portions of the
population of interest. Household censuses in the Philippines, for instance, have
often been contentious because of undercounts and overcounts and their
implications on politics since congressional seats and Internal Revenue Allotment
(IRA) depend on population counts. Conclusions based on purposive samples, such
as telephone polls used in early morning television shows, SMS polls, or surveys in
Facebook, do not hold the same weight as probability-based samples. A probability
sample uses chance to ensure that the sample is much more representative of the
population, something that is not true of purposive samples.

Survey errors involve sampling errors and non-sampling errors:

• In the conduct of sample surveys, sampling error is roughly the difference


between the value obtained in a sample statistic and the value of the population
parameter that would have arisen had a census been conducted. This difference
comes from the operation of the chance process that determines which particular
units in the population are included in the sample. This error can be positive or
negative, small or large but increasing the sample size can always reduce this
type of error. This error can be estimated and reported along with the sample
statistic. Since estimates of a parameter from a probability sample would vary
from sample to sample, the variation in estimates serves as a measure of
sampling error. Statisticians can say, for instance, that in 2000, the FIES
indicated that 39.5 percent of the entire Filipino population is poor and that there
are 95 chances in 100 that a full census would reveal a value within 0.4% of the
stated figure. The approval ratings of the President, obtained from an opinion poll
of about 1,200 respondents who were selected judiciously through chance-
methods, are theoretically within a margin of error of about 3 percentage points
from the actual approval ratings.
• Another type of error that statisticians consider in the collection of data is
called non-sampling error. There are many specific types of non-sampling
error. There may be selection bias or the systematic tendency to exclude in
a survey a particular group of units. As a result, you get coverage errors,
which arise if, for example, we assume that the respondents in a telephone

141
poll in an early morning television shows reflect the entire population of
voters. Yet in fact, telephone polls in the Philippines at best represent only the
population of telephone subscribers, which is, in truth, only a vast minority of
the targeted population of all Filipinos. Current television and radio polls being
conducted by a number of media stations reflect only the population of those
who are watching or listening to the show and who are persistent in phoning
in their views. Thus, there is a serious issue of coverage. The same is true in
the case of Internet-based and SMS surveys. Even a seriously done Internet
survey will only reflect those who have Internet access, which is currently not
the majority of Filipino households. To illustrate coverage and other non-
sampling errors, consider the following case in point.

Example of Survey Estimate Fiasco:


In 1936, the Literary Digest, a famed magazine in the United States, conducted a
survey of its subscribers as well as telephone subscribers to predict the
outcome of the presidential race. The Digest erroneously predicted that then
incumbent President Franklin D. Roosevelt would receive 43% of the vote and
thus lose to the challenger Kansas Governor Alfred Landon when in actuality,
Landon only received 38% of the total vote. (The Digest went bankrupt thereafter).
At the same time, George Gallup set up his polling organization and correctly
forecasted Roosevelt’s victory from a mere sample of 50,000 people.
A post-mortem analysis revealed coverage errors arising from biases in sample
selection. The Literary Digest list of targeted respondents was taken from
telephone books, magazine subscriptions, club membership lists, and
automobile registrations. Inadvertently, the Digest targeted well-to-do voters, who
were predominantly Republican and who had a tendency to vote for their
candidate. The sample had a built-in bias to favor one group over another. This
is called selection bias. In addition, there was also a non-response bias
since, of the 10 million they targeted for the survey, only 2.4 million had actually
responded. A response rate of 24% is far too low to yield reliable estimates of
population parameters. Nonresponsive people may differ considerably in their
views from the views of responders.
Here, we see that obtaining a large number of respondents does not cure

procedural defects but only repeats them over and over again! When choosing a

142
sample, biases, such as selection bias or nonresponse bias should be avoided.
However, in practice, it can be challenging to avoid nonresponse bias in surveys
since there are people who will fill out surveys and those who will not, even if
incentives are provided.
Provide more examples to emphasize the lesson on non-sampling errors, such
as asking your learners a question but only selecting specific people to answer
the question. For example, what their favorite toy or game was when they were
growing up, but only ask either the boys or girls. Then, based on the responses,
conclude that their answers were true for the entire class. Have them react to
your statements. Say that it was an example of selection or coverage bias.
You can also ask what the average height of the class is, but only ask tall people.
Then, conclude that all members of the class are tall. Emphasize that non-
probability sampling makes the conclusions hard to generalize for the population.

To remedy biases (or failures for a sample to represent the population) resulting from
“convenience” errors, polling organizations have since then resorted to using
probability-based methodologies for selecting samples where the subjects are
chosen on the basis of certain probabilities, which in turn, allow us to compute for
the number of respondents each sampled respondent effectively represents.
Randomization or using chance-based procedures for selecting respondents is the
best guarantee against bias. However, it is important to firstly have an idea of the
sampling frame, i.e. the targeted population, and carefully design the survey in
order to make it representative of the targeted population.

Other possible sources of biases in sample surveys that one should be


cautious about:
• wording of questions, which can influence the response enormously
• the sensitivity of a survey topic (e.g., income, sex and illegal behavior)
• interviewer biases in selecting respondents or in the responses
generated because of the appearance and demeanor of the interviewer
• non-response biases, which happens when targeted respondents opt not to
provide information in the survey

143
Note to Teacher: You may mention the following examples to drive the point further
about survey errors. One rather famous example is the time when surveys
conducted by all the “reputable” pollsters Gallup, Crossley and Roper in the United
States in the 1950s embarrassingly resulted in the wrong prediction that New York
Governor Thomas Dewey would beat the reelectionist US President Harry Truman in
the presidential race. No less than the famed Chicago Daily Tribune printed an early
edition with the headlines based on the (wrong!) poll predictions.

Re-electionist Harry Truman showing the Chicago Daily Tribune early edition

There, problems resulted from the sampling design, with the interviewers being
provided excessive judgment calls on whom to interview. These polls used quota
sampling. Interviewers may have selected the least threatening people they
would encounter on the field, e.g. the best-dressed people so that the samples
chosen systematically over-represented a part of the population (and
underrepresented other groups).

Special surveys that measure the difference between respondents and non-
respondents show that lower-income and upper income people tend to not
respond to questionnaires, so that modern polling organizations would prefer
to use personal interviews rather than mailed questionnaires.

In developed countries like United States, the typical response rate for personal
interviews is about 65%, compared to merely 25% for mailed questionnaires. A
number of methods are now being tested to improve response rates in polls
and other surveys.

144
4. Sampling Distribution, Accuracy and Precision

As was pointed out earlier, statistics generated from a sample survey are subject to
both non-sampling and sampling errors. The latter arise because only a part of the
population is observed. There is likely to be some difference between the sample
statistic and the true value of the population parameter (that you would have
obtained had a census been conducted). To know more about this difference or
sampling error and consequently establish the reliability of the sample statistic, you
have to understand the chance process involved in the sample selection. For this
purpose, you have to analyze the sampling distribution or the set of all possible
values that the point estimate could take under repeated sampling, and possibly
approximate this sampling distribution.

When estimating, you should know something about the population to be


generalized. One of the characteristics of the population that is often estimated is the
mean. The population mean is often the parameter to be estimated. There can be
several estimators of the population mean, including the sample mean, sample
median, sample mode, and sample midrange. In similar manner, there can be
several estimators of the population variance s2. Given sample data
where represents the sample mean (i.e. the sum of the data divided by the
sample size n), then the sample variance defined with denominator n-1

and that with denominator n

are two estimators of the population variance.

As was earlier pointed out, a good estimator must possess desirable properties
— lAccuracy and Precision.

145
• Accuracy is a measure of how close the estimates are to the parameter. It
can be measured by bias, i.e., the difference of the expected value of the
estimate from the true value of the parameter. An estimator is said to be
unbiased if its bias is zero. Otherwise, the estimator is biased. When bias is
positive or greater than zero, the estimator overestimates the parameter. If
negative or below zero, estimator underestimates the parameter.

• Precision is a measure of how close the estimates are with each other. The
variance of the estimator or its standard error gives a measure of how
precise the estimator is. The smaller the value of the standard error of an
estimator, the more precise the estimator is.

In general, we want the estimator to be both accurate and precise. We can illustrate
precision and accuracy by way of an analogy. Let us represent the parameter as a
target bull’s eye while the estimates of the parameters are the arrows shot by an
archer. The first target (1) in the figure below illustrates a precise but not an
accurate estimator. The second target (2) shows that the archer or estimator is
accurate but not precise. The third estimator (3) shows the archer is both precise
and accurate while the last target (4) shows an estimator that is neither accurate nor
precise.

(1)!! (2)!! (3)!! (4)!!

Figure 3-02.5 Analogy between estimation and hitting the bull’s eye

Example: The sample mean (of a simple random sample) is an estimator of the
population mean that is both accurate and precise. Its expected value is equal to
the population mean itself that is why it is unbiased and, consequently, an accurate
estimator. It is precise because statistical theory has determined that it has the
smallest standard error compared to other estimators. Having these good

146
properties of an estimator makes the sample mean a good estimator of the
population mean.

E. Enrichment

Encourage learners to come up with a survey to discover something that can be


relevant to their experience in high school. For example, they can ask the biggest
concern of learners in their grade level (Is it their academics, family, friends/peers,
etc?) or what the learners in their grade level want to do after high school, or who they
want to support in the next national elections. They can explore different sampling
methods or try different ways of collecting data (interviews, questionnaires, text polls).
Then, have them report their findings and ask them how they can come up with the
appropriate interpretation of the data they generated.
KEY POINTS

• Sampling is undertaken over full enumeration (census) since selecting a sample


is less time-consuming and less costly than selecting every item in the
population. An analysis of a sample is also less cumbersome and more
practical than an analysis of the entire population.
• Probability sampling involves units obtained using chance mechanism, and
requires the use of a sampling frame (a list/map of all the sampling units in the
population) while in a non-probability sample, units are chosen without regard
to their probability of occurrence. The latter type of sample should not be used
for statistical inference. Among the typical basic probability samples include
o Simple random sample wherein sample size n is one in which each set
of n elements in the population has an equal chance of being selected,
o Systematic sample is a sample drawn by first selecting a fixed starting
point in the larger population and then obtaining subsequent observations
by using a constant interval between samples taken.
o Stratified random sample is a sample chosen in such a way that the
population is divided into several subgroups, called strata, with
random samples drawn from each stratum.
o Cluster sample is a sample where entire groups (or clusters) are chosen
at random.

147
• Types of Survey errors
o Sampling error results from chance variation from sample to sample in a
probability sample. It is roughly the difference between the value obtained in
a sample statistic and the value of the population parameter that would
have arisen had a census been conducted. Since estimates of a parameter
from a probability sample would vary from sample to sample, the
variation in estimates serves as a measure of sampling error.
1. Non-sampling error:
§ Coverage error or selection bias results if some groups are
excluded from the frame and have no chance of being selected
§ Non-response error or bias occurs when people who do
not respond may be different from those who do respond
§ Measurement error arising due to weaknesses in question design,
respondent error, and interviewer’s impact on the respondent
• A representative sample, using chance-based methods for selecting the
sample units, can provide insights about a population. The size of the sample,
not its relative size to the larger population, determines the precision of the
statistics it generates. Randomization, i.e. using chance-based procedures for
selecting respondents, is the best guarantee against bias.
ASSESSMENT (Do this!)
I. Select the best choice.
1. The process of using sample statistics to draw conclusions about true population
parameters is called
a) statistical inference
b) the scientific method
c) sampling
d) descriptive statistics

2. The universe or "totality of items or things" under consideration is called


a) a sample
b) a population
c) a parameter
d) a statistic

3. The portion of the universe that has been selected for analysis is called
a) a sample
b) a frame
c) a parameter

d) a statistic

148
4. A summary measure that is computed to describe a characteristic from only a sample of the
population is called
a) a parameter
b) a census
c) a statistic
d) the scientific method

5. A summary measure that is computed to describe a characteristic of an entire population is


called
a) a parameter
b) a census
c) a statistic
d) the scientific method

6. Which of the following is most likely a population as opposed to a sample?


a) respondents to a newspaper survey
b) the first 5 learners completing an assignment
c) every third person to arrive at the bank
d) registered voters in a county

7. Which of the following is most likely a parameter as opposed to a statistic?


a) The average score of the first five learners completing an assignment
b) The proportion of females registered to vote in a county
c) The average height of people randomly selected from a database
d) The proportion of trucks stopped yesterday that were cited for bad brakes

8. Which of the following is NOT a reason for the need for sampling?
a) It is usually too costly to study the whole population.
b) It is usually too time-consuming to look at the whole population.
c) It is sometimes destructive to observe the entire population.
d) It is always more informative by investigating a sample than the entire population.

9. Which of the following is NOT a reason for drawing a sample?


a) A sample is less time consuming than a census.
b) A sample is less costly to administer than a census.
c) A sample is always a good representation of the target population.

d) A sample is less cumbersome and more practical to administer.

10. The Philippine Airlines Internet site provides a questionnaire instrument that can be answered
electronically. Which of the 4 methods of data collection is involved when people complete the
questionnaire?
a) Published sources
b) Experimentation
c) Surveying
d) Observation

149
II. Identify the population, parameter of interest, the sampling frame, the sample, the
sampling method, and any potential sources of biases in the following studies
1. The producers of a television show asked information from Facebook users on the
TV show’s Facebook page about their sentiments (favorable, unfavorable, neutral) on
a segment on the TV show
2. A question posted on the website of a daily newspaper in the Philippines asked visitors
of the site to indicate their voter preference for the next presidential election.
3. In March 2015, Pulse Asia reported that the leading urgent concerns of Filipinos are
inflation control (46%), the increase of workers' pay (44%), and the fight against
government corruption (40%). On the other hand, Filipinos are least concerned with
national territorial integrity (5%), terrorism (5%), and charter change (4%).The
nationwide survey was conducted from March 1 to 7, 2015 with 1,200 respondents.

4. A sample survey of persons with disability (PWDs) was designed to be representative of


PWDs, by making use of PWD registers from local government units, but an
assessment suggested the registers were severely undercovering PWDs. The design
was adjusted to make use of snowball sampling where existing sampled PWDs would
identify other future subjects from among their acquaintances. The study attempted to
examine the proportion of PWDs who were poor.

III. Identify which sampling method is applied in the following situations.


1. The teacher randomly selects 20 boys and 15 girls from a batch of learners to
be members of a group that will go to a field trip
2. A sample of 10 mice are selected at random from a set of 40 mice to test the effect of a
certain medicine
3. The people in a certain seminar are all members of two of five groups are asked what
they think about the president.
4. A barangay health worker asks every four house in the village for the ages of
the children living in those households.
5. A sales clerk for a brand of clothing asks people who comes up to her whether they
own a piece of article from her brand.
6. A psychologist asks his patient, who suffers from depression, whether he knows other
people with the same condition, so he can include them in his study
7. A brand manager of a toothpaste asks ten dentists that have clinic closest to his office
whether they use a particular brand of toothpaste.

150
151

You might also like