Statistical Analysis with Excel

 
Instructional Materials in
STAT 20053
STATISTICAL ANALYSIS WITH

SOFTWARE APPLICATION
For the sole noncommercial use of the
Faculty of the Department of Mathematics and Statistics
Polytechnic University of the Philippines
2020
Contributors:
Elizon, Katrina
Usona, Laurence
Aranas, Peter John
Bautista, Lincoln
Baccay, Edcon
Republic of the Philippines
POLYTECHNIC UNIVERSITY OF THE PHILIPPINES
COLLEGE OF SCIENCE
Department of Mathematics and Statistics
Course Title : STATISTICAL ANALYSIS WITH SOFTWARE APPLICATION

Course Code : STAT 20053
Course Credit : 3 UNITS
Pre-Requisite :
Course Description : This course focuses on conceptual understanding of everyday statistics,
and basic statistical procedures. Topics include basic concept of statistics,
descriptive statistics, inferential statistics especially on parametric
estimation and hypothesis testing, and illustrated and applied to practical
situations. It also gives students competence in basic computer
technology by generating descriptive statistics and performing statistical
analysis using EXCEL.
COLLEGE OF SCIENCE
COURSE GRADING SYSTEM
The final grade will be based on the weighted average of the student’s scores on each test
assigned at the end of each lesson. The final SIS grade equivalent will be based on the following
table according to the approved University Student Handbook.
Class Standing (CS) = (((Weighted Average of all the Activities) x 50 )+ 50)
Midterm and/or Final Exam (MFE) = (((Weighted Average of the Midterm and/or Final Tests) x
50)+50)
Final Grade = (70% x CS) + (30% x MFE)
Prepared by:
Katrina D. Elizon
Faculty Member, Department of Mathematics and Statistics
College of Science
COLLEGE OF SCIENCE
Contents
1 Introduction to Statistical Concepts

1.1 Definitions and Terminology……………………………………….. 1
1.2 Process of Statistics ……………………………………………….. 2
1.3 Qualitative and Quantitative Variables……………………………. 4
1.4 Discrete and Continuous Variables ………………………………. 5
1.5 Levels of Measurement…………………………………………….. 6
2 Data Collection and Basic Concepts in Sampling Design
2.1 Data Collection …...…………………………………………………. 8
2.2 Sources of Data …………….………………………………………. 9
2.3 Methods of Collecting Primary and Secondary Data……………. 10
2.4 Sample Size Determination…………..……………………………. 11
2.5 Basic Sampling Design …………..……..…………………………. 14
2.6 Sources of Errors in Sampling…………..…………………………. 20
3 Descriptive Statistics
3.1 Textual Presentation …………………………………………….…. 23
3.2 Tabular Presentation …………………………………………….…. 25
3.3 Graphical Presentation………………………………………….….. 34
3.1 Measures of Central Tendency……….………………………….... 35
3.2 Measures of Relative Position…………………………………..…. 40
3.3 Measures of Variation or Dispersion………………………..…….. 45
3.5 Karl Pearson’s Measure of Skewness………………………..…… 49
3.8 Percentile Coefficient of Kurtosis………………………………….. 50
3.9 Normal Distribution………………………………………….………. 52
3.10 Areas Under a Standard Normal Curve……………….………… 54
4 Inferential Statistics
4.1 Procedures for Hypothesis Testing……….…………….….………. 64
4.2 Assessing and Testing Normality of the Data ……………..…..…. 66
4.3 Inference about Two Means
(Dependent and Independent Sample T – Test………….……….. 72
4.4 One-Way Analysis of Variance ………………………….…………. 79
4.5 Pearson Product Moment Correlation………………………..…… 85
4.6 Chi-Square Test………………………………………………….….. 91
MODULE 1: DEFINITION OF STATISTICS
INTRODUCTION TO THE
Statistics plays a major role in many aspects of our
lives. It is used in sports, for example, to help a
general manager decide which player might be the
STATISTICAL best fit for a team. It is used in politics to help
candidates understand how the public feels about
CONCEPTS various policies. And statistics is used in medicine to

help determine the effectiveness of new drugs. Used
a p p r o p r i a t e l y, s t a t i s t i c s c a n e n h a n c e o u r
understanding of the world around us. Used
Objectives: inappropriately, it can lend support to inaccurate
After successful completion of this beliefs. Understanding statistical methods will
provide you with the ability to analyze and critique
module, you should be able to:
studies and the opportunity to become an informed
consumer of information. Understanding statistical
• Define statistics
methods will also enable you to distinguish solid
analysis from bogus “facts.”
• Explain the process of statistics
Statistics is the science of collecting, organizing,
• Know the difference between summarizing, and analyzing information to draw
descriptive and inferential conclusions or answer questions. In addition,
statistics. statistics is about providing a measure of confidence
in any conclusions.
• Distinguish between qualitative
and quantitative variables. What information is referred to in the definition?
The information referred to the definition is the data.

• Distinguish between discrete and According to the Merriam Webster dictionary, data
continuous variables. are “factual information used as a basis for
reasoning, discussion, or calculation”.
• Determine the level of
measurement of a variable. Definitions:
• Universe is the set of all entities under study.

Population is the set of all possible values of the
variable. An individual is a person or object that
is a member of the population being studied.
• A statistic is a numerical summary of a sample.

• Sample is the subset of the population.
• Descriptive statistics consist of organizing and
summarizing data. Descriptive statistics describe
data through numerical summaries, tables, and
graphs. numerical summary based on a sample.
Descriptive statistics make it easier to get an
• Inferential statistics uses methods that overview of what the data are telling us.
take a result from a sample, extend it to the
population, and measure the reliability of the If we extend the results of our sample to the
result. population, we are performing inferential
statistics. The generalization contains
• A parameter is a numerical summary of a uncertainty because a sample cannot tell us
population. everything about a population. Therefore,
inferential statistics includes a level of
Example: Consider the Scenario.
confidence in the results. So rather than saying
You are walking down the street and notice that 78% of all students would return the
that a person walking in front of you drops money, we might say that we are 95%
PHP100. Nobody seems to notice the PHP100 confident that between 74% and 82% of all
except you. Since you could keep the money students would return the money. Notice how
without anyone knowing, would you keep the this inferential statement includes a level of
money or return it to the owner? confidence (measure of reliability) in our
results. It also includes a range of values to
Suppose you wanted to use this scenario as a account for the variability in our results. One
gauge of the morality of students at your goal of inferential statistics is to use statistics
school by determining the percent of students to estimate parameters.
who would return the money. How might you
do this? You could attempt to present the PROCESS OF STATISTICS
scenario to every student at the school, but
1. Identify the research objective.
this would be difficult or impossible if the
student body is large. A second possibility is to A researcher must determine the question(s)
present the scenario to 50 students and use he or she wants answered. The question(s)
the results to make a statement about all the must clearly identify the population that is to be
students at the school. studied. Identify the research objective.
In the PHP100 study presented, the population 2. Collect the information needed to answer
is all the students at the school. Each student the questions.
is an individual. The sample is the 50 students
selected to participate in the study. Conducting research on an entire population is
often difficult and expensive, so we typically
Suppose 39 of the 50 students stated that they look at a sample. This step is vital to the
would return the money to the owner. We could statistical process, because if the data are not
present this result by saying that the percent of collected correctly, the conclusions drawn are
students in the survey who would return the meaningless. Do not overlook the importance
money to the owner is 78%. This is an of appropriate data collection.
example of a descriptive statistic because it
describes the results of the sample without Example:
making any general conclusions about the
population. So 78% is a statistic because it is a A research objective is presented. For each
research objective, identify the population and information that we need regarding the
sample in the study. population.
1. The Philippine Mental Health Associations Example:

contacts 1,028 teenagers who are 13 to 17
years of age and live in Antipolo City and For the following statements, decide whether it
asked whether or not they had been belongs to the field of descriptive statistics or
prescribed medications for any mental inferential statistics.
disorders, such as depression or anxiety.
1. A badminton player wants to know his
Population: Teenagers 13 to 17 years of age average score for the past 10 games.
who live in Antipolo City (Descriptive Statistics)
Sample: 1,028 teenagers 13 to 17 years of 2. A car manufacturer wishes to estimate the

age who live in Antipolo City average lifetime of batteries by testing a
sample of 50 batteries. (Inferential
2. A farmer wanted to learn about the weight Statistics)
of his soybean crop. He randomly sampled
100 plants and weighted the soybeans on 3. Janine wants to determine the variability of
each plant. her six exam scores in Algebra.
(Descriptive Statistics)
Population: Entire soybean crop
4. A shipping company wishes to estimate the
Sample: 100 selected soybean crop number of passengers traveling via their
ships next year using their data on the
3. Organize and summarize the information. number of passengers in the past three
years. (Inferential Statistics)
Descriptive statistics allow the researcher to
obtain an overview of the data and can help 5. A politician wants to determine the total
determine the type of statistical methods the number of votes his rival obtained in the
researcher should use. past election based on his copies of the
tally sheet of electoral returns.
4. Draw conclusion from the information.
(Descriptive Statistics)
In this step the information collected from the
sample is generalized to the population.
Inferential statistics uses methods that takes
results obtained from a sample, extends them
to the population, and measures the reliability
of the result.
Take Note!
If the entire population is studied, then

inferential statistics is not necessary, because
descriptive statistics will provide all the
DISTINCTION BETWEEN QUALITATIVE AND value of a quantitative variable, it is
QUANTITATIVE VARIABLES discrete.
Variables are the characteristics of the 2. A continuous variable is a quantitative

individuals within the population. For example, variable that has an infinite number of
recently my mother and I planted a tomato possible values that are not countable. If
plant in our backyard. We collected information you measure to get the value of a
about the tomatoes harvested from the plant. quantitative variable, it is continuous.
The individuals we studied were the tomatoes.
The variable that interested us was the weight Example:
of a tomato.
Determine whether the following quantitative
Variables can be classified into two groups: variables are discrete or continuous.
1. Qualitative variables is variable that 1. The number of heads obtained after

yields categorical responses. It is a word or flipping a coin five times. (Discrete)
a code that represents a class or category.
2. The number of cars that arrive at a
2. Quantitative variables takes on numerical McDonald’s drive-through between 12:00
values representing an amount or quantity. P.M and 1:00 P.M. (Continuous)
Example: 3. The distance of a 2005 Toyota Prius can

travel in city conditions with a full tank of
Determine whether the following variables are gas. (Continuous)
qualitative or quantitative.
4. Number of words correctly spelled.
1. Haircolor (Qualitative) (Discrete)
2. Temperature (Quantitative) 5. Time of a runner to finish one lap.

(Continuous)
3. Number of hamburger sold (Quantitative)
LEVELS OF MEASUREMENT
4. Number of children (Quantitative)
It is important to know which type of scale is
5. Zip code (Qualitative) represented by your data since different
statistics are appropriate for different scales of
DISTINCTION BETWEEN DISCRETE AND
measurement. A characteristic may be
CONTINUOUS
measured using nominal, ordinal, interval and
Quantitative variables may be further classified ration scales.
into:
1. Nominal Level - This is the first level of
1. A discrete variable is a quantitative measurement and it is characterized by data
variable that either a finite number of that consist of names, labels or categories
possible values or a countable number of only. The data cannot be arranged in ordering
possible values. If you count to get the scheme. Nominal scales have no numerical
value.
Example:
- Food Preferences
- Rank of a Military officer
- Social Economic Class (First, Middle, Lower)
3. Interval Level - This is a measurement level
not only classifies and orders the
measurements, but it also specifies that the
distances between each interval on the scale
Levels of Measurement are equivalent along the scale from low interval
to high interval. A value of zero does not mean
They are sometimes called categorical scales the absence of the quantity. Arithmetic
or categorical data. Such a scale classifies operations such as addition and subtraction
persons or objects into two or more categories. can be performed on values of the variable.
Whatever the basis for classification, a person
can only be in one category, and members of a Example:
given category have a common set of
characteristics. - Te m p e r a t u r e o n F a h r e n h e i t / C e l s i u s
Thermometer
Example:
- Trait anxiety (e.g., high anxious vs. low
- Method of payment (cash, check, debit card, anxious)
credit card)
- IQ (e.g., high IQ vs. average IQ vs. low IQ)
- Type of school (public vs. private)
4. Ratio Level - A ratio scale represents the
- Eye Color (Blue, Green, Brown) highest, most precise, level of measurement. It
has the properties of the interval level of
2. Ordinal Level - This involves data that may measurement and the ratios of the values of
be arranged in some order, but differences the variable have meaning. A value of zero
between data values either cannot be means the absence of the quantity. Arithmetic
determined or meaningless. An ordinal scale operations such as multiplication and division
not only classifies subjects but also ranks them can be performed on the values of the
in terms of the degree to which they possess a variable.
characteristics of interest. In other words, an
ordinal scale puts the subjects in order from Example:
highest to lowest, from most to least. Although
ordinal scales indicate that some subjects are - Height and weight
higher, or lower than others, they do not - Time
indicate how much higher or how much better.
- Distance and speed
Example: ______________1. A teacher wants to know
the attitudes of all students towards abortion.
Categorize each of the following as nominal,
ordinal, interval or ratio measurement. ______________2. A market analyst of a sales
firm draws a chart showing the sales figures of
1. Ranking of college athletic teams. a given product for the period 2006-2007.
(Ordinal)
______________3. A forecaster predicts the
2. Employee number. (Nominal) results of an election using the number of
votes cast in 15 out of 25 barangays.
3. Number of vehicles registered. (Ratio)
______________4. Men are better in math
4. Brands of soft drinks. (Nominal)
than women.
5. Number of car passers along C5 on a
______________5. Forty percent of the
given day. (Ratio)
employees of an organization were recorded
ACTIVITIES/ASSESSMENTS: tardy for at least 15 working days.
I. A research objective is presented. For _____________6. There are very few gender-
each, identify the (A) population and (B) related occupations.
sample in the study.
______________7. An account predicts
6. A polling organization contacts 2141 male accuracy rate of a client’s financial resources.
university graduates who have a white-
______________8. A quality control manager
collar job and asks whether or not they had
wishes to check production output.
received a raise at work during the past 4
months. ______________9. Records indicated that
75% of the faculty in the graduate school are
A. ______________________________
doctoral degree holders.
B. ______________________________
______________10. There is no relationship
2. Every year the PSA releases the Current between educational qualification of parents
Population Report based on a survey of and academic achievement of their children.
50,000 households. The goal of this report
III. Identify the qualitative and quantitative
is to learn the demographic characteristics,
variables and indicate the highest level of
such as income, of all households within
measurement required in each. If
the Philippines.
quantitative, classify whether discrete or
A. ______________________________ continuous.
B. ______________________________ ______________1. Occupation
II. Indicate whether the following statements ______________2. Number of government

require the use of descriptive or inferential officials
statistics.
______________3. Favorite color
______________4. Temperature in Celsius

degrees
______________5. Type of school
______________6. Volume of mineral water

sold daily
______________7. Employee number
______________8. Civil status
______________9. Zip code numbers
______________10. Brands of soft drinks
______________11. Socioeconomic status
______________12. Status Employment
______________13. Number of vehicles

registered
______________14. Jersey Number
______________15. Number of employees

collecting retirement
benefits from GSIS
REFERENCES:
Statistics. Informed Decision using Data by

Michael Sullivan, III,. Fifth Edition
Sampling: Design and Analysis by Sharon L.

Lhr. Second Edition
MODULE 2: DATA COLLECTION
DATA COLLECTION Everybody collects, interprets and uses information,

much of it in numerical or statistical forms in day-to-
AND BASIC Concepts

day life. It is a common practice that people receive
large quantities of information everyday through
conversations, televisions, computers, the radios,
in Sampling DESIGN newspapers, posters, notices and instructions. It is
just because there is so much information available
that people need to be able to absorb, select and
reject it. In everyday life, in business and industry,
Objectives: certain statistical information is necessary and it is
After successful completion of this independent to know where to find it how to collect it.
module, you should be able to:
Data collection is the process of gathering and
measuring information on variables of interest, in an
• Determine the sources of data
established systematic fashion that enables one to
(primary and secondary data).
answer stated research questions, test hypotheses,
and evaluate outcomes.
• Distinguish the different methods
under primary and secondary Without proper planning for data collection, a
data. number of problems can occur. If the data collection
steps and processes are not properly planned, the
• Determine the appropriate research project can ultimately end up with a data
sampling size. set that does not serve the purpose for which it was
intended. For example, if more than one person is
• Differentiate various sampling involved in the data collection, but data collectors do
techniques. not follow consistent data collection practices, they
can end up with data with different units, collection
• Know the sources of errors in processes, and variable names.
sampling.
Consequences from Improperly Collected Data
• Inability to answer research questions accurately.
• Inability to repeat and validate the study.
• Distorted findings resulting in wasted resources.
• Misleading other researchers to pursue fruitless

avenues of investigation.
• Compromising decisions for public policy.
• Causing harm to human participants and animal

subjects.
Steps in Data Gathering 1. Direct personal interviews - The
researcher has direct contact with the
1. Set the objectives for collecting data interviewee. The researcher gathers
information by asking questions to the
2. Determine the data needed based on the
interviewee.
set objectives.
2. Indirect/Questionnaire Method - This
3. Determine the method to be used in data
methods of data collection involve sourcing
gathering and define the comprehensive
and accessing existing data that were
data collection points.
originally collected for the purpose of the study.
4. Design data gathering forms to be used.
Key Design Principles of a Good
5. Collect data. Questionnaire
SOURCES OF DATA 1. Keep the questionnaire as short as possible.
Whether conducting research in the social 2. Decide on the type of questionnaire (Open
sciences, humanities arts, or natural sciences, Ended or Closed Ended).
the ability to distinguish between primary and
3. Write the questions properly.
secondary sources is essential.
4. Order the questions appropriately.
Primary Sources - Provide a first-hand
account of an event or time period and are 5. Avoid questions that prompt or motivate the
considered to be authoritative. They respondent to say what you would like to hear.
represent original thinking, reports on
discoveries or events, or they can share new 6. Write an introductory letter or an
information. Often these sources are created introduction.
at the time the events occurred but they can
also include sources that are created later. 7. Write special instructions for interviewers or
They are usually the first formal appearance respondents.
of original research.
8. Translate the questions if necessary.
Secondary Sources - offer an analysis,
9. Always test your questions before taking the
interpretation or a restatement of primary
survey. (Pre-test)
sources and are considered to be
persuasive. They often involve An open-ended question is a type of question
generalisation, synthesis, interpretation, that does not include response categories.
commentary or evaluation in an attempt to This type of question is usually appropriate for
convince the reader of the creator's collecting subjective data.
argument. They often attempt to describe or
explain primary sources. A closed-ended question is a type of
question that includes a list of response
The primary data can be collected by the categories from which the respondent will
following five methods: select his answer. This type of question is
usually appropriate for collecting objective
Open- Ended versus Closed - Ended
data. respondents to the second question said it is

possible that Elvis is still alive.
Take Note!
3. A focus group is a group interview of
Question wording and question order have a approximately six to twelve people who share
large effect on the responses obtained. similar characteristics or common interests. A
facilitator guides the group based on a
Example:
predetermined set of topics.
Two surveys were taken in late 1993/early
4. Experiment is a method of collecting data
1994 about Elvis Presley.
where there is direct human intervention on the
One survey asked: “In the past few years, conditions that may affect the values of the
there have been a lot of rumors and stories variable of interest.
about whether Elvis Presley is really dead.
Bear in mind that the experimental method has
How do you feel about this? Do you think there
several limitations that you should be aware of.
is any possibility that these rumors are true
and that Elvis Presley is still alive, or don’t you - Ethical, moral, and legal Concerns
think so?”
- Unrealistic Controlled Environments
Second survey asked: “A recent television
show examined various theories about Elvis - Inability to Control for All Variables
Presley’s death. Do you think it is possible that
Elvis is alive or not?” 5. Observation is a method of collecting data
on the phenomenon of interest by recording
8% of the respondents to the first question said the observations made about the phenomenon
it is possible that Elvis is still alive and 16% of as it actually happens. involves collecting
information without asking questions. size can be mentioned here and it can vary in
different research settings. However, all else
The secondary data can be collected by the being equal, large sized sample leads to
following five methods: increased precision in estimates of various
properties of the population.
1. Published report on newspaper and
periodicals. Take Note!
2. Financial Data reported in annual reports. - Representativeness, not size, is the more
important consideration.
3. Records maintained by the institution.
- Use no less than 30 subjects if possible.
4. Internal reports of the government
departments. - If you use complex statistics, you may need
a minimum of 100 or more in your sample
5. Information from official publications.
(varies with method).
Take Note!
• Always investigate the validity and reliability

of the data by examining the collection
method employed by your source.
• Do not use inappropriate data for your

research.
SAMPLE SIZE
“How many participants should be chosen for a

survey”?
One of the most frequent problems in

statistical analysis is the determination of the
appropriate sample size. One may ask why
sample size is so important. The answer to this
is that an appropriate sample size is required
for validity. If the sample size it too small, it will
not yield valid results. An appropriate sample
size can produce accuracy of results.
Moreover, the results from the small sample
size will be questionable. A sample size that is
too large will result in wasting money and time
because enough sample will normally give an
accurate result.
The sample size is typically denoted by n and Representative Sample

it is always a positive integer. No exact sample
Choosing of sample size depends on non- sample size is required to get an optimum level
statistical considerations and statistical of precision.
considerations.
Methods in Determining the Sample Size
• Non-statistical considerations – It may
include availability of resources, man power, • Estimating the Mean or Average
budget, ethics and sampling frame.
The sample size required to estimate the
• Statistical considerations – It will include population mean µ to with a level of confidence
the desired precision of the estimate. with specified margin of error e, given by
( e )
Three criteria need to be specified to Zσ
determine the appropriate sample size: n≥
1. Level of Precision where:
Also called sampling error, the level of Z is the z-score corresponding to level of
precision, is the range in which the true value confidence.
of the population is estimated to be.
e is the level of precision.
2. Confidence Interval
Take Note:
It is statistical measure of the number of times
out of 100 that results can be expected to be If When σ is unknown, it is common practice to
within a specified range. For example, a conduct a preliminary survey to determine s
confidence interval of 90% means that results and use it as an estimate of σ or use results
of an action will probably meet expectations from previous studies to obtain an estimate of
90% of the time. σ. When using this approach, the size of the
sample should be at least 30. The formula for
To find the right z – score to use, refer to the the sample standard deviation s is
table:
∑ (x − x̄)2
Desired Confidence
Z - Score s=
Level n−1
80% 1.28
85% 1.44
Example:
90% 1.65
95% 1.96 A soft drink machine is regulated so that the
99% 2.58 amount of drink dispensed is approximately
normally distributed with a standard deviation
3. Degree of Variability equal to 0.5 ounce. Determine the sample size
needed if we wish to be 95% confident that our
Depending upon the target population and
sample mean will be within 0.03 ounce from
attributes under consideration, the degree of
the true mean.
variability varies considerably. The more
heterogeneous a population is, the larger the Solution: The z – score for confidence level
95% in the z – table is 1.96.
When p = 0.5, the maximum value of
2
1.96(0.5)
( 0.03 )
n≥ = 1067.11
p(1- p)=0.25. This is called the most
conservative estimate, since it gives the
largest possible estimate of n.
We need a 1068 sample for our study.
The conservative formula using the strong law
• Estimating Proportion (Infinite of large number.
2
1 Z
4 (e)
Population)
n≥ ≈ 385
The sample size required to obtain a
Where:
confidence interval for p with specified margin
of error e is given by Confidence level is 95%.
2
(e)
Z
n≥ p(1 − p) The level of precision is 0.05.
Where: Example:
Z is the z-score corresponding to level of Suppose we are doing a study on the
confidence. inhabitants of a large town, and want to find
out how many households serve breakfast in
e is the level of precision.
the mornings. We don’t have much information
P is population proportion. on the subject to begin with, so we’re going to
assume that half of the families serve
There is a dilemma in this formula: breakfast: this gives us maximum variability.
So p = 0.5. We want 99% confidence and at
It dependents on least 1% precision.
x
p=
N Solution: The z – score for confidence level
which we know only after we have taken the 99% in the z – table is 2.58.
2
2.58
sample.
( 0.01 )
n≥ 0.5(1 − 0.5) = 16,641
There are two ways to solve this dilemma:
1. We could determine a preliminary value for We need a 16,641 sample for our study.
p based on a pilot study or an earlier study.
• Slovin’s Formula
Example: If last month 37% of all voters
thought that state taxes are too high, then it is Slovin’s formula is used to calculate the
likely that the proportion with that opinion this sample size n given the population size and
month will not be dramatically different, and we error. It is computed as
would use the value 0.37 for p in the formula.
N
n≥
2. Simply to replace p in the formula by 0.5. 1 + Ne 2
Where: This is the link for online calculator of sample
size:
N is the total population.
https://select-statistics.co.uk/calculators/
e is the level of precision. sample-size-calculator-population-proportion/
Example: https://www.calculator.net/sample-size-
calculator.html
A researcher plans to conduct a survey about
food preference of BS Stat students. If the BASIC SAMPLING DESIGN
population of students is 1000, find the sample
size if the error is 5%. The goal in sampling is to obtain individuals for
a study in such a way that accurate information
Solution: about the population can be obtained.
1000
n≥ = 285.71 Reason for Sampling
1 + 1000(0.05)2
- Important that the individuals included in a
The researcher need to survey 286 BS stat sample represent a cross section of
students. individuals in the population.
• Finite Population Correction - If sample is not representative it is biased.
You cannot generalize to the population from
If the population is small then the sample size
your statistical data.
can be reduced slightly.
Some definitions are needed to make the
n0
n≥ notion of a good sample more precise.
n −1
1+ o
N Definitions:
Where:
• Observation unit - An object on which a
no is Cochran’s sample size recommendation. measurement is taken. This is the basic unit
of observation, sometimes called an element.
N is the population size. In studying human populations, observation
units are often individuals.
• Target population - The complete collection

of observations we want to study.
• Sampled population - The collection of all

possible observation units that might have
been chosen in a sample; the population
from which the sample was taken.
• Sample - A subset of a population.

• Sampling unit - A unit that can be selected Failing to obtain responses from all of the
for a sample. We may want to study chosen sample. (Nonresponse)
individuals, but do not have a list of all
individuals in the target population. Instead, - Allowing the sample to consist entirely of
households serve as the sampling units, and volunteers.
the observation units are the individuals
Advantage of Sampling Over Complete
living in the households.
Enumeration
• Sampling frame - A list, map, or other
- Less Labor
specification of sampling units in the
population from which a sample may be - Reduced Cost
selected. For a survey using in-person
interviews, the sampling frame might be a list
- Greater Speed
of all street addresses. - Greater Scope
• Sampling technique/Sampling Strategies - - Greater Efficiency and Accuracy
It is a plan you set forth to be sure that the
sample you use in your research study
- Convenience
represents the population from which you - Ethical Considerations
drew your sample.
Two Type of Samples
• Sampling Bias - This involves problems in
your sampling, which reveals that your 1. Probability Sample
sample is not representative of your
- Samples are obtained using some objective
population.
chance mechanism, thus involving
The following examples indicate some ways in randomization.
which selection bias can occur:
- They require the use of a complete listing of
- Deliberately or purposively selecting a the elements of the universe called the
“representative” sample.  sampling frame.
Misspecifying the target population.  
- The probabilities of selection are known.
Failing to include all of the target population
in the sampling frame, called - They are generally referred to as random
undercoverage.  samples.
Including population units in the sampling
frame that are not in the target population, - They allow drawing of valid generalizations
called overcoverage. about the universe/population.
- Having multiplicity of listings in the sampling 2. Non - probability Sample

frame. 
Substituting a convenient member of a - Samples are obtained haphazardly, selected
population for a designated member who is purposively or are taken as volunteers.
not readily available. 
- The probabilities of selection are unknown.
- They should not be used for statistical
inference.
Sampling Procedure
- Identify the population.

- Determine if population is accessible.
- Select a sampling method.
- Choose a sample that is representative of
the population.
- Ask the question, can I generalize to the Simple Random Sampling

general population from the accessible
population? • Systematic Random Sampling
Sampling technique can be grouped into how - It is obtained by selecting every kth
selections of items are made such as individual from the population.
probability sampling and non-probability
- The first individual selected corresponds to a
sampling.
random number between 1 to k.
Basic Sampling Technique of Probability
Obtaining a Systematic Random Sample
Sampling
1. Decide on a method of assigning a unique
• Simple Random Sampling serial number, from 1 to N, to each one of
- Most basic method of drawing a probability the elements in the population.
sample.
2. Compute for the sampling interval
- Assigns equal probabilities of selection to N PopulationSize
k= =
each possible sample. n SampleSize
3. Select a number, from 1 to k, using a
- Results to a simple random sample.
randomization mechanism. The element in
Advantage: It is very simple and easy to use. the population assigned to this number is
the first element of the sample. The other
Disadvantage: The sample chosen may be elements of the sample are those assigned
distributed over a wide geographic area. to the numbers and so on until you get a
sample of size.
When to use: This is preferable to use if the
population is not widely spread geographically. Example:
Also, this is more appropriate to use if the
population is more or less homogenous with We want to select a sample of 50 students
respect to the characteristics of the population. from 500 students under this method kth item
and picked up from the sampling frame.
Solution:
500
k= = 10
50
We start to get a sample starting form i and for
every kth unit subsequently. Suppose the
random number i is 6, then we select 15, 25,
35, 45, .. .
Advantage: Drawing of the sample is easy. It

is easy to administer in the field, and the
sample is spread evenly over the population.
Disadvantage: May give poor precision when

Systematic Random Sampling
unsuspected periodicity is present in the
population. Solution:
When to use: This is advisable to us if the There are two strata in this case.
ordering of the population is essentially
random and when stratification with numerous Given:
data is used. N1 = 200 N2 = 300 N = 500 n = 50
50
(N) ( 500 )
When to use: This is advisable to us if the n
ordering of the population is essentially n1 = N1 = 200 = 20
random and when stratification with numerous
50
(N) ( 500 )
data is used. n
n2 = N2 = 300 = 30
• Stratified Random Sampling
The sample sizes are 20 from A and 30 from B.
- It is obtained by separating the population
Then the units from each institution are to be
into non-overlapping groups called strata
selected by simple random sampling.
and then obtaining a simple random sample
from each stratum. Advantage: Stratification of respondents is
advantageous in terms of precision of the
- The individuals within each stratum should
estimates of the characteristics of the
be homogeneous (or similar) in some way.
population. Sampling designs may vary by
Example: stratum to adjust for the differences in the
conditions across strata. It is easy to use as a
A sample of 50 students is to be drawn from a random sampling design.
population consisting of 500 students
belonging to two institutions A and B. The Disadvantage: Values of the stratification
number of students in the institution A is 200 variable may not be easily available for all
and the institution B is 300. How will you draw units in the population especially if the
the sample using proportional allocation? characteristic of interest is homogeneous. It is
possible that there are not representative
• Cluster Sampling
- You take the sample from naturally occurring
groups in your population.
- The clusters are constructed such that the

sampling units are heterogeneous within the
cluster and homogeneous among the
clusters.
Obtaining a Cluster Sample
1. Divide the population into non-overlapping

clusters.
2. Number the clusters in the population from 1

to N.
3. Select n distinct numbers from 1 to N using

a randomization mechanism. The selected
clusters are the clusters associated with the
selected numbers.
4. The sample will consist of all the elements in

the selected clusters.
Example:
A researcher wants to survey academic

performance of high school students in
MIMAROPA.
Stratified Random Sampling
1. He/She can divide the entire population into
in one or two strata. Also, transportation costs different clusters.
can be high if the population covers a wide
geographic area. 2. Then the researcher selects a number of
clusters depending on his research through
When to use: If the population is such that the simple or systematic random sampling.
distribution of the characteristics of the
respondents under consideration concentrated 3. Then, from the selected clusters the
in small and spread segment of the population. researcher can either include all the high
Thus, this is preferred to use if precise school students as subject or he can select a
estimates are desired for stratified parts of the number of subjects from each cluster through
population and if sampling problems differ in simple or systematic random sampling.
the various strata of the population.
Advantage: There is no need to come out with
samples selected from the previous stage
constitute the frame for the stages.
Obtaining a Multi-Stage Sampling
1. Organize the sampling process into stages

where the unit of analysis is systematically
grouped.
2. Select a sampling technique for each

stage.
3. Systematically apply the sampling

Cluster Sampling
technique to each stage until the unit of
analysis has been selected.
a list of units in the population; all what is
needed is simply a list of the clusters. It is also Example:
less costly since the elements are physically
closer together. Suppose we wish to study the expenditure
patterns of households in NCR. We can select
Disadvantage: In actual field applications, a sample of households for this study using
adjacent households tend to have more similar simple three-stage sampling.
characteristics than households distantly apart.
- First, divide into smaller cities/municipalities
When to use: If the population can be and a random sample of these cities/
grouped into clusters where individual municipalities is collected.
population elements are known to be different
with respect to the characteristics under study, - Second, a random sample of smaller areas
this preferable to use. such as barangays is taken from within each
of the cities/municipalities chosen in the first
• Multi - Stage Sampling stage.
- Selection of the sample is done in two or - Third, a random sample of even smaller
more steps or stages, with sampling units areas such as households is taken from
varying in each stage. within each of the areas chosen in the
second stage.
- The population is first divided into a number
of first-stage sampling units from which a Advantage: It is easier to generate adequate
sample is drawn. Smaller units, called the sampling frames. Transportation costs are
secondary sampling units, comprising the greatly reduced since there is some form of
selected first-stage units then serve as the clustering among the ultimate or final samples;
sampling units for the next stage. If needed i.e., they are in the sample lower-stage units.
additional stages may be added until the
units of observation for the survey are Disadvantage: Its complexity in theory may be
clearly identified. The units comprising the
• Convenience Sampling - It is a process of
picking out people in the most convenient
and fastest way to get reactions
immediately. This method can be done by
telephone interview to get the immediate
reactions of a certain group of sample for a
certain issue.
• Purposive Sampling - It is based on certain

criteria laid down by the researcher. People
who satisfy the criteria are interviewed. It is
used to determine the target population of
those who will be taken for the study.
Multi-Stage Sampling • Judgement Sampling - selects sample in

accordance with an expert’s judgment.
difficult to apply in the field. Estimation
procedures may be difficult for non-statisticians Cases wherein Non-Probability Sampling is
to follow. Useful
When to use: If no population list is available - Only few are willing to be interviewed
and if the population covers a wide area.
- Extreme difficulties in locating or identifying
Take Note! subjects
Used probability sampling if the main objective - Probability sampling is more expensive to
of the sample survey is making inferences implement
about the characteristics of the population
- Cannot enumerate the population elements.
under study.
Sources of Errors in Sampling
Basic Sampling Technique of Non-
Probability Sampling 1. Non-sampling Error
• Accidental Sampling - There is no system - Errors that result from the survey process.
of selection but only those whom the
researcher or interviewer meets by chance. - Any errors that cannot be attributed to the
sample-to-sample variability.
• Quota Sampling - There is specified
number of persons of certain types is Sources of Non-Sampling Error
included in the sample. The researcher is
aware of categories within the population 1. Non-responses
and draws samples from each category. The
2. Interviewer Error
size of each categorical sample is
proportional to the proportion of the 3. Misrepresented Answers
population that belongs in that category.
4. Data entry errors II. Determine the sample size of the following
problems. Show your solution.
5. Questionnaire Design
1. A dermatologist wishes to estimate the
6. Wording of Questions proportion of young adults who apply
sunscreen regularly before going out in the
7. Selection Bias
sun in the summer. Find the minimum
2. Sampling Error sample size required to estimate the
proportion with precision of 3%, and 90%
- Error that results from taking one sample confidence.
instead of examining the whole population.
2. The administration at a college wishes to
- Error that results from using sampling to estimate, the proportion of all its entering
estimate information regarding a population. freshmen who graduate within four years,
with 95% confidence. Estimate the
minimum size sample required. Assume
that the population standard deviation is σ
ACTIVITIES/ASSESSMENTS:
= 1.3 and precision level is 0.05.
I. Determine if the source would be a primary
3. A government agency wishes to estimate
or a secondary source.
the proportion of drivers aged 16–24 who
______________1. Government Records have been involved in a traffic accident in
the last year. It wishes to make the
______________2. Dictionary estimate to within 1% error and at 90%
confidence. Find the minimum sample size
______________3. Artifact required, using the information that several
years ago the proportion was 0.12.
______________4. A TV show explaining what
happened in Philippines. 4. An internet service provider wishes to
estimate, to within one percentage error,
______________5. Autobiography about
the current proportion of all email that is
Rodrigo Duterte.
spam, with 85% confidence. Last year the
______________6. Enrile diary describing proportion that was spam was 71%.
what he thought about the Estimate the minimum size sample
world war II. required if the total email that is spam is
10,000.
______________7. Audio and video
recordings III. Determine the type of sampling. (ex.
Simple Random Sampling, Purposive
______________8. Speeches Sampling)
______________9. Newspaper ______________1. To determine customer

opinion of its boarding policy, Southwest
______________10. Review Articles Airlines randomly selects 60 flights during a
certain week and surveys all passengers on freshman, sophomore, junior, senior, and
the flights. graduate student. The official takes a simple
random sample from each class and asks the
______________2. A member of Congress
members opinions regarding student services.
wishes to determine her constituency’s opinion
regarding estate taxes. She divides her ______________10. In the game of lotto, 6
constituency into three income classes: low- balls are selected from a container with 42
income households, middle-income balls.
households, and upper-income households.
IV. Using proportional allocation, determine
She then takes a simple random sample of
the sample size needed for every school.
households from each income class.
The total population of students is 10,679,
______________3. The presider of a guest- and the minimum sample is 2,450.
lecture series at a university stands outside the
Population
auditorium before a lecture begins and hands School Sample
per School
every fifth person who arrives, beginning with
Antipolo National
the third, a speaker evaluation survey to be 1.28
High School
completed and returned at the end of the
Bagong Nayon
program.
National 1.44
______________4. 24 Hour Fitness wants to High School
administer a satisfaction survey to its current Dela Paz National
1.65
members. Using its membership roster, the High School
club randomly selects 40 club members and Sta. Cruz National
asks them about their level of satisfaction with 1.96
High School
the club. Tubigan National
2.58
______________5. A radio station asks its High School
listeners to call in their opinion regarding the Total 10,679
use of U.S. forces in peacekeeping missions.
______________6. A tax auditor selects every REFERENCES:
1000th income tax return that is received.
______________7. For a survey, a sample of Michael Sullivan, III,. Fifth Edition
municipalities was selected from every
Sampling: Design and Analysis by Sharon L.
province in the country and included all child
Lhr. Second Edition
laborers in the selected municipalities.
http://www.economicsdiscussion.net/statistics/
______________8. To determine his DSL
sampling/advantages-of-sampling-over-
Internet connection speed, Shawn divides up
completeenumeration-in-statistics/11980
the day into four parts: morning, midday,
evening, and late night. He then measures his h t t p : / / w w w. n a t c o 1 . o r g / r e s e a r c
Internet connection speed at 5 randomly h / fi l e s /SamplingStrategies.pdf
selected times during each part of the day.
https://data36.com/statistical-bias-types-
______________9. A college official divides explained/
the student population into five classes:
MODULE 3: DESCRIPTIVE STATISTICS
OBJECTIVES:
After successful completion of this module, you should be Data Presentation
able to:
✦ Distinguish the three main forms of data presentation. Data are usually collected in a raw format and thus
✦ Know the different parts of the table. the inherent information is difficult to understand.
✦ Choose appropriate diagrams/graphs to present a given set of Therefore, raw data need to be summarized,
data.
✦ Organize qualitative and quantitative data in tables.
processed, and analyzed to usefully derive
information from them. However, no matter how well
✦ Compute measures of central tendency, measures of variation and
measures of relative position of grouped and ungrouped data.
manipulated, the information derived from the raw
✦ Describe the shape of a distribution.
data should be presented in an effective format,
otherwise, it would be a great loss for both authors
✦ Identify regions under the normal curve corresponding to
and readers. Planning how the data will be presented
different standard normal values.
✦ Compute probabilities using the standard normal table and Excel.
is essential before appropriately processing raw data.
Polytechnic University of the Philippines Polytechnic University of the Philippines
College of Science College of Science
Department of Mathematics and Statistics Department of Mathematics and Statistics
Presentation of Data Textual Presentation

Presentation of data refers to an exhibition • All the data is presented in the form of text,
or putting up data in an attractive and useful phrases, or paragraphs.
manner such that it can be easily interpreted. • It involves enumerating important
characteristics, emphasizing significant figures
The three main forms of presentation of data
and identifying important features of data.
are:
Textual Presentation • Text is the principal method for explaining
Tabular Presentation findings, outlining trends, and providing
contextual information.
Graphical Presentation
Example: Advantage of Textual Presentation
A researcher is asked to present the performance of a section in ✦ The data would be more interpreted.
the statistics test. The following are the test scores:
34 42 20 50 17 9 34 43
✦ Can help in emphasizing some important points
50 18 35 43 50 23 23 35 in data.
37 38 38 39 39 38 38 39 ✦ Small sets of data can be easily presented.
24 29 25 26 28 27 44 44
49 48 46 45 45 46 45 46 Remember!
The data presented in textual form would be like this: ✦ Keep your paragraphs simple and short.
In the statistics class of 40 students, 3 obtained the perfect
score of 50. Sixteen students got a score 40 and above, ✦ Always make sure that the readers are provided
while only 3 got 19 and below. Generally, the students with additional explanations about the relevance
performed well in the test with 23 or 70% getting a passing of the figures and its implications.
score of 38 and above.
Advantage of Tabular
Tabular Presentation: Presentation
• It is a systematic and logical arrangement of ✦ More information may be presented.
data in the form of Rows and Columns with
respect to the characteristics of data.
✦
Exact values can be read from a table to
retain precision.
• A table is best suited for representing individual
information and represents both quantitative
✦ Flexibility is maintained without
and qualitative information. distortion of data.
✦ Less work and less cost are required in
the preparation.
Preparing Tables B. Boxhead: The boxhead contains the captions or
The making of a compact table itself is an art. This should column headings. The heading of each column
contain all the information needed within the smallest possible
should contain as few words as possible, yet
space. What the purpose of tabulation is and how the tabulated
information is to be used are the main points to be kept in mind explain exactly what the data in the columns
while preparing for a statistical table. An ideal table should represent.
consist of the following main parts:.
A. Title: The title must tell as simply as possible what is in the C. Stubs: The row captions are known as the stub.
table. It should answer the questions: Items in the stub should be grouped to facilitate
✦ Who? White females with breast cancer, black males with interpretation of the data. For example, rows may
lung cancer. stand for score of classes and columns for data
✦ What are the data? Counts, percentage distributions, rates. related to sex of students. In the process, there will
✦ Where are the data from? Example: One hospital, or the be many rows for scores classes but only two
entire population covered by your registry. columns for male and female students.
✦
When? A particular year, time period.
D. Footnotes: Footnotes are given at the foot of the

table for explanation of any fact or information Parts of the Table
included in the table which needs some explanation.
Thus, they are meant for explaining or providing
further details about the data that have not been
covered in title, captions and stubs.
E. Sources of Data: We should also mention the source
of information from which data are taken. This may
preferably include the name of the author, volume,
page and the year of publication. This should also
state whether the data contained in the table is of
‘primary or secondary’ nature.
https://byjus.com/commerce/tabular-presentation-of-data/

Example: Simple or One – Way Table
Construction of Data Tables Optionally, the table may also include totals or
percentages.
✦ The title should be in accordance with the
objective of study
✦ Comparison
✦ Alternative location of stubs
✦ Headings
✦ Footnote
✦ Size of columns
✦ Use of abbreviations
✦ Units
Example: Compound Table Organize Quantitative Variable in Table

A compound table is just an extension of a simple in which
Classes are categories into which data are grouped. When a
there are more than one variable distributed among its
attributes (subvariable). An attribute is just a quality, property data set consists of a large number of different discrete data
values or when a data set consists of continuous data, we create
or component of a variable according to which it can be
differentiated with respect to other variables. classes by using intervals of numbers.
We may refer to a compound table as a cross tabulation or Make sure that the classes do not overlap. This is necessary to
even to a contingency table depending on the context in which avoid confusion as to which class a data value belongs. Also,
it is used. make sure that the class widths are equal for all classes.
Upper Class
Lower Class Limit (LC) Limit (UC)
Number
The class width is the Age
(in thousands)
difference between 25 - 34 14,482
consecutive lower class 35 - 44 14,156
45 - 54 13,801
limits.
55 - 64 12,123
College of Science
College of Science
65 - 74 7,010
One exception to the requirement of Scores Frequency Guidelines for Determining the Lower Class Limit of the First
equal class widths occurs in open- Class and Class Width
10 - 19 25
ended tables. A table is open ended if 20 - 29 36 Determining the Class Width:
the first class has no lower class limit 30 - 39 40 • Decide on the number of classes. Generally, there should be
or the last class has no upper class 40 and over 12 between 5 and 20 classes. The smaller the data set, the fewer
limit. classes you should have.
• Determine the class width by computing: x − xmin
cw = max
Guidelines for Determining the Lower Class Limit of the First
Class and Class Width cw is the class width nc
nc is the number of classes
Choosing the Lower Class Limit of the First Class:
Round this value up to a convenient number.
Choose the smallest observation in the data set or a Remember!
convenient number slightly lower than the smallest Creating the classes for summarizing continuous data is an art
observation in the data set. form. There is no such thing as the correct frequency distribution.
However, there can be less desirable frequency distributions. The
For example, the smallest observation is 10.2. A convenient larger the class width, the fewer classes a frequency distribution
lower class limit of the first class is 10. will have.
How to Construct Frequency Example: Use the “Sample Data file”.
Distribution Table?
A frequency distribution list each
category of data and the number of
occurrences for each category of data.
Solution:
To answer this question we need to construct a frequency
distribution to determine how many female and male
respondents participated in the study.
Procedure in Constructing
Frequency Table
✦ If the data is in the form of qualitative data
To construct the frequency distribution using
excel use the command:
=frequency(data_array,bins_array)
Then Ctrl → Shift → Enter
{=frequency(data_array,bins_array)}

Final Output Example: Use the “Sample Data file”.
Table 1 shows the frequency and percentage distribution of

the respondents in terms of sex. It can be gleaned from the
table that, out of 128 respondents considered in the study,
65 or 50.8% are male and 63 or 49.2% are female.
Procedure in Constructing Procedure in Constructing

Frequency Table Frequency Table
✦If the data is in the form of quantitative data ✦If the data is in the form of quantitative data
Steps Steps
1. Set an interval or range for your data. It is
4. Highlight your data for the “INPUT RANGE”.
needed for the “BIN RANGE”.
5. Highlight your data for the “BIN RANGE”.
2. Click “DATA” on the menu bar and Click
6. Click the box of “LABELS IN FIRST ROW”
“DATA ANALYSIS” on the tool bar
then click “OK”.
3. The dialog box “DATA ANALYSIS” will appear
7. The result will appear on the new worksheet of
and choose “HISTOGRAM” on the dialog box
the excel file. Get the Percentage and total.
then click OK.

Final Output

Example: Identify problems with the following

table.
Graphical Presentation
✦ A graph is a very effective visual tool as it displays data at
a glance, facilitates comparison, and can reveal trends and
relationships within the data such as changes over time,
and correlation or relative share of a whole.
Answer:
✦ Useless Information – Don’t show decimals if they are not ✦ It is considered an important medium of communication
needed. because we are able to create a pictorial representation of
✦ Poor Alignment – Make sure alignment makes sense. the numerical figures.
• Don’t center numbers, always right justify – try to align
✦ Suited when we need to show the results of the study to
decimal points.
• Consider the appropriate placement of row titles.
nonprofessionals and or people who dislike numbers and too
✦ Difficult to Read – Use commas used when the number exceeds lengthy texts.
a thousand.
Example: Simple Bar Graph
Bar Graph The simple bar chart is used for the case of one
variable only.
✦ It is constructed by labeling each category
of data on either the horizontal or vertical
axis and the frequency or relative frequency
of the category on the other axis. Rectangles
of equal width are drawn for each category.
The height of each rectangle represents the
category’s frequency or relative frequency.
✦ It is use to organize discrete data.
Multiple Bar Graph\ Grouped Component Bar Graph/ Subdivided

Example: Column Chart Example: Column Chart
The multiple bar chart is an extension of a simple bar chart In this type of bar chart, the components (quantities) of each
when there are quantities of several variables to be variable are piled on top of one another. It saves space as
displayed. The bars representing the quantities for the compared to a multiple bar chart. One of the disadvantage
different variables are piled next to one another for each of this graph is that it is not always easy to compare size of
attribute. The figure becomes very cumbersome when there the components, or parts. It is used to represent data in
are too many variables and components. which the total magnitude is divided into different or
components.

Remember! Histogram
✦ It is constructed by drawing rectangles for each class of
• Bar graphs may also be drawn with horizontal data. The height of each rectangle is the frequency or
bars. Horizontal bars are preferable when relative frequency of the class. The width of each rectangle
category names are lengthy. is the same and the rectangles touch each other.
✦ It is a graph used to present quantitative data, is similar to
• In bar graphs, the order of the categories does the bar graph.
not usually matter. However, bar graphs that ✦ It is use to organize continuous data.
have categories arranged in decreasing order
of frequency help prioritize categories for
decision-making purposes in areas such as
quality control, human resources, and
marketing.
College of Science
College of Science
https://newonlinecourses.science.psu.edu/
Department of Mathematics and Statistics Department of Mathematics and Statistics stat500/lesson/1/1.6/1.6.2
Pie Chart When should a bar graph or a

It is a circle divided into sectors. Each sector represents a
✦
category of data.The area of each sector is proportional to pie chart be used?

the frequency of the category.
✦ Pie charts are typically used to present the relative ✦ Pie charts are useful for showing the
frequency of qualitative data. Inmost cases the data are division of all possible values of a
nominal, but ordinal data can also be displayed in a pie qualitative variable into its parts.
chart.
✦ Bar graphs are useful when we want to
compare the different parts, not necessarily
the parts to the whole.

Line Graph Example: Simple Line Graph
The simplest of line graphs is the single line graph, so
✦ A graph that shows information that is called because it displays information concerning one
connected in some way (such as change over variable only, in terms of its frequencies.
time)
✦ Line segments are then drawn connecting the
points. It is use to organize continuous data.
✦ Very useful in identifying trends in the data
over time.

Example: Multiple Line Graph Guidelines for Constructing

Multiple line graphs illustrate information on Good Graphics
several variables so that comparison is possible
between them.
✦ Title and label the graphic axes clearly,
providing explanations if needed. Include units
of measurement and a data source when
appropriate.
✦ Avoid distortion.
✦ Minimize the amount of white space in the
graph. Use the available space to let the data
stand out. If you truncate the scales, clearly
indicate this to the reader.
Guidelines for Constructing Grouped and Ungrouped Data
Good Graphics Data is often described as ungrouped Scores Frequency
or grouped. 1 - 10 5
11 - 20 9
Grouped data is the type of data 21 - 30 10
✦ Avoid clutter, such as excessive gridlines and which is classified into groups after 31 - 40 12
unnecessary backgrounds or pictures. collection. 41 - 50
Total
24
60
✦ Don’t distract the reader. Ungrouped data which is also known
as raw data is data that has not been Ungrouped data with a
✦ Avoid three dimensions. placed in any group or category after frequency distribution
collection. No. of Television
✦ Do not use more than one design in the same Sets Frequency
graphic. Let the data speak for themselves. Ungrouped data without a 0
1
7
15
frequency distribution 2 12
3 4
1, 5, 4, 7, 2, 4, 1, 3, 8, 2, 2, 9 4 5
College of Science
Polytechnic University of the Philippines 5 2
College of Science
Department of Mathematics and Statistics Total 45
Measures of Central Tendency: Formula for Mean:

MEAN ✦ For Ungrouped Data
Sample Mean
✦ For Grouped Data
• It is the sum of the data values divided by the number of where: where:
∑i=1 fxi
data values.
∑i=1 xi xi = data values
xi = data values n r
• It is also called the average. n = no. of
x̄ = f = frequency x̄ =
• It is appropriate only for data under interval and ratio scale sample n n = no. of n
measurement. observations sample
observations
Advantage of Mean Population Mean
✦ Simple to understand and easy to calculate. where:
∑i=1 xi xi = data values ∑i=1 fxi
N where: r
✦ It is rigidly defined. xi = data values
✦ It is least affected fluctuation of sampling. N = no. of μ= f = frequency
μ=
observations N N
✦ It takes into account all the values in the series. N = no. of
observations
Measures of Central Tendency: Formula for Median:
MEDIAN ✦ For Ungrouped Data ✦ For Grouped Data
It is the “middle observation” when the data set is sorted (in
(2 )
•
1. Arrange the data from n
either increasing or decreasing order). − < cf i
lowest to highest (or highest
• The median divides the distribution into two equal parts. x̃ = LB +
to lowest). f
Advantage of Median where:
✦ The median is not affected by the size of extreme values but 2. For an odd number of LB = lower boundary of the
by the number of observations. data, the median of a data median class
✦ The median can be calculated even when the frequency set is the “middle i = class width
distribution contains “open-ended” intervals. observation”. When the n = no. of observations
✦ It can also be used to define the middle of a number of
number of data is even, the < cf = less than the cumulative
median is the “average of frequency of the class
objects, properties, or quantities which are not really
quantitative in a nature. the two middle scores”. preceding the median class
f = frequency of the median
✦ It can be easily interpreted.
class
Measures of Central Tendency: Formula for Mode:

MODE ✦
For Ungrouped Data ✦ For Grouped Data
It is the most frequently occurring value in a list of data.
( d1 + d2 )
•
d1
• It is sometimes called nominal average. 1.Obtain a frequency x ̂ = LB + i
• It is an appropriate measure of average for data using the distribution of the distinct
nominal scale of measurement. values of the data. where:
LB = lower boundary of the
• It is the only measure of central tendency used in both modal class
quantitative and qualitative data. 2.The mode is the most
i = class width
Advantage of Mode frequently occurring data
d1 = difference between the
✦ The mode is easy to understand. (if there is one).
frequency of the modal class
✦ Like the median, it is not greatly affected by extreme and the class preceding it
values. d2 = difference between the
✦ Like the median, it can be computed even when the frequency of the modal class
frequency distribution contains “open-ended” intervals. and the class following it
Remember! Choosing a Measure of Central Tendency:
We have discussed three types of central tendency-the
• Whenever you hear the word average, be aware that mode, the mean, and the median and examined how they
the word may not always be referring to the mean. differ in terms of finding the center of a data distribution.
One average could be used to support one position,
The next legitimate question to ask may be “When do we
while another average could be used to support a use which measure?”
different position.
Consider the following data sets:
• Mode is not always present in the data sets unlike
mean and median.
Data Set I 108 112 116 120 124
Data Set II 108 112 116 120 205
• If you are interested in the “center of gravity” of your
data, then use the mean; if you are interested in the Determine the mean, median and mode.
“middle value” within your data, then use the median
In both data sets, the median is 116, as it is the number that • The mode is simply the most frequently occurring data
divides the data set into two exact halves. However, you will values in the data set. Therefore, it is mainly useful for the
notice that the mean is not identical in both data sets. For the nominal level of measurement. Both median and mean are
first data set, the mean is equal to 116 where the mean of the useful when the variable being measured can be quantified.
second data set is equal to 132.5 Also both data sets have no mode that’s why mode is not
appropriate measure to use in these data sets.
Notice how the mean of the second data set has been
influenced by the presence of an unusual case/outlier in the
data set. If we were to say the mean is equal to 132.5 for the
• It is better to use the median than to use the mean when
the sample is small or asymmetrical (i.e., skewed) and
second data set and it represents a typical case, this will not
make much sense because the majority of data values are less
unusual cases/outliers is present in the data sets. This is
than 120. Therefore, the mean should not be used when
why the average housing price is always reported with the
unusual, or outlying, data values are present in the data set, as median, since even one million-dollar house can distort the
the mean tends to be extremely sensitive to the unusual average housing price when most of the houses are in
values. Rather, the median should be reported in this case. Php500,000–Php650,000 range.

Example: Solution:
The data given below is the age of the residents in To compute mean of grouped data, first you need to
Barangay 634, Sta. Mesa, Manila. Compute mean, fill out this table.
median and mode. Class
Interval
Frequency
(f)
x fx
55 - 59 3
It is the midpoint of
Class Interval Frequency
50 - 54 6 every class interval.
55 - 59 55 45 - 49 7
To compute this:
LC + UP
50 - 54 23 40 - 44 9
x=
45 - 49 37 35 - 39 6
40 - 44 37 30 - 34 4
2
35 - 39 48 25 - 29 5 Ex:
7 55 + 59
30 - 34 42
fxi = x= = 57
25 - 29 27
Total n=
∑ 2
50 + 54
i=1
x= = 52
2
Solution: Solution:
7 To compute median and mode of grouped data, first
Class Interval Frequency x fx ∑i=1 fxi
x̄ =
(f) you need to fill out this table.
55 - 59 3 57 171
50 - 54 6 52 312 n Class
f LB < cf
Interval To compute the lower
1,675
45 - 49 7 47 329 55 - 59 3
b o u n d a r y, a l w a y s
=
40 - 44 9 42 378 50 - 54 6
subtract 0.5 to lower
40
35 - 39 6 37 222 45 - 49 7
30 - 34 4 32 128 40 - 44 9 class limit (LC).
= 41.88
25 - 29 5 27 135
7
35 - 39 6 Ex:
55 − 0.5 = 54.5
30 - 34 4
fxi = 1,675
Total n = 40 ∑
50 − 0.5 = 49.5
25 - 29 5
i=1
Total n=
The average age is 41.88 45 − 0.5 = 44.5

Solution: If the arrangement of Solution:
the class interval is n
Class Class First, compute , it will help us to
Interval f LB < cf descending order, Interval
f LB < cf
2
55 - 59 3 54.5 always start at the 55 - 59 3 54.5 40 determine the median class and the
50 - 54 6 49.5 bottom part. 50 - 54 6 49.5 37 < cf.
n 40
= = 20
45 - 49 7 44.5 45 - 49 7 44.5 31
40 - 44 9 39.5 40 - 44 9 39.5 24 2 2
35 - 39 6 34.5 35 - 39 6 34.5 15
30 - 34 4 29.5 Copy the frequency 30 - 34 4 29.5 9
The median class is the class
containing the 20th item. Hence, the
25 - 29 5 24.5 5 of the lowest class 25 - 29 5 24.5 5
Total n = 40 Total n = 40 median class is 40 - 44.
interval.
(2 )
5 + 4 = 9 + 6 = 15 + 9 = 24 + 7 = 31 + 6 = 37 + 3 = 40 n
− < cf i
(20 − 15)5
x̃ = LB + x̃ = 39.5 + = 42.28
f 9

Solution:
Class
Interval f LB < cf The modal class is the class interval
Measures of Relative Position
55 - 59 3 54.5 40 with the highest frequency. The
modal class is 40 - 44.
50 - 54 6 49.5 37 Quantiles are statistics that describe
45 - 49
40 - 44
7
9
44.5
39.5
31
24 If there are two class interval that various subdivisions of a frequency
35 - 39 6 34.5 15
contains the highest frequency, distribution into equal proportions.
always choose the highest class
30 - 34 4 29.5 9
25 - 29 5 24.5 5
interval. Three special Quantiles:
d1 = 9 − 6 = 3 1. Quartiles
( d1 + d2 )
d1
x ̂ = LB + i
d2 = 9 − 7 = 2 2. Deciles
3
(3 + 2)
x ̂ = 39.5 + 5 = 42.5
3. Percentiles
Formula for Quartile:
Quartiles - split
the ordered data ✦ For Ungrouped Data ✦ For Grouped Data
into four quarters.
(4 )
nk
1. Arrange the data from − < cf i
lowest to highest. Then use
Qk = LB +
this formula. f
Deciles - split the nk
Qclass = + 0.5
where:
ordered data into
ten equal. 4 LB = lower boundary of the
quartile class
2. If the resulting positioning i = class width
point is an integer, the
n = no. of observations
particular numerical k = quartile position
Percentiles - split
observation corresponding
the ordered data < cf = less than the cumulative
to that point is chosen for frequency of the class
into 100 equal
parts.
the quartile. If not, use preceding the quartile class
interpolation. f = frequency of the quartile
class
Formula for Decile: Formula for Percentile:

✦ For Ungrouped Data ✦ For Grouped Data ✦ For Ungrouped Data ✦ For Grouped Data
( 100 )
nk
( 10 )
1. Arrange the data from 1. Arrange the data from
nk
− < cf i lowest to highest. Then use − < cf i
lowest to highest. Then use
this formula. Dk = LB + this formula. Pk = LB +
f f
nk
Dclass =
nk
+ 0.5 Pclass = + 0.5 where:
10
where: 100 LB = lower boundary of the
LB = lower boundary of the
2. If the resulting 2. If the resulting percentile class
decile class
i = class width positioning point is an i = class width
positioning point is an
n = no. of observations n = no. of observations
integer, the particular integer, the particular
k = decile position k = percentile position
numerical observation numerical observation
< cf = less than the cumulative < cf = less than the cumulative
corresponding to that point corresponding to that point frequency of the class
is chosen for the decile.If frequency of the class is chosen for the percentile.
preceding the decile class preceding the percentile class
not, use interpolation. If not, use interpolation. f = frequency of the percentile
f = frequency of the decile class
College of Science
College of Science
class
Example 1: Solution: To compute Q3 of ungrouped data:
The data given below is the total number of hours 1. Arrange the data from lowest to highest.
lost due to tardiness and absences of employees in a 20 23 24 27 30 32 37 37 40 42 48 55
company in a given year. 1 2 3 4 5 6 7 8 9 10 11 12
(12)(3)
Qclass = = 9.5
Month Hour Lost (x)
Find Q3, D4 and P55. January
February
55
23
4
March 37
2. Use interpolation since the computed Qclass is not an integer.
April 37
May 48 20 23 24 27 30 32 37 37 40 42 48 55
June 42 1 2 3 4 5 6 7 8 9 10 11 12
Q3 = 40 + 0.5(42 − 40)
July 27
August 20
= 41
September 30
October 32
November 24
December 40
Solution: To compute D4 of ungrouped data: Solution: To compute P55 of ungrouped data:
1. Arrange the data from lowest to highest. 1. Arrange the data from lowest to highest.
20 23 24 27 30 32 37 37 40 42 48 55 20 23 24 27 30 32 37 37 40 42 48 55
1 2 3 4 5 6 7 8 9 10 11 12 1 2 3 4 5 6 7 8 9 10 11 12
(12)(4) (12)(55)
Dclass = + 0.5 = 5.3 Pclass = + 0.5 = 7.1
10 100
2. Use interpolation since the computed Dclass is not an integer. 2. Use interpolation since the computed Pclass is not an integer.
20 23 24 27 30 32 37 37 40 42 48 55 20 23 24 27 30 32 37 37 40 42 48 55
1 2 3 4 5 6 7 8 9 10 11 12 1 2 3 4 5 6 7 8 9 10 11 12
D4 = 30 + 0.3(32 − 30) P55 = 37 + 0.1(37 − 37)

= 30.6 = 37
Example 2: Solution:
The data given below is the age of the residents in To compute Q1, D7, and P10 of grouped data, first you
Barangay 634, Sta. Mesa, Manila. Compute Q1, D7, and need to fill out this table.
P10. Class f LB < cf
Interval To compute the lower
Class Interval Frequency 55 - 59 3
50 - 54 6
b o u n d a r y, a l w a y s
55 - 59 55
45 - 49 7 subtract 0.5 to lower
50 - 54 23
40 - 44 9 class limit (LC).
45 - 49 37
35 - 39 6 Ex:
55 − 0.5 = 54.5
40 - 44 37 30 - 34 4
35 - 39 48 25 - 29 5
30 - 34 42 Total n= 50 − 0.5 = 49.5
25 - 29 27
45 − 0.5 = 44.5


Class
f LB < cf
the class interval is Class f LB < cf First, compute
nk
, it will help us to
Interval descending order, Interval 4
55 - 59 3 54.5 55 - 59 3 54.5 40 determine the quartile class and the
50 - 54 6 49.5
always start at the 50 - 54 6 49.5 37
nk (40)(1)
bottom part. < cf.
= = 10
45 - 49 7 44.5 45 - 49 7 44.5 31
40 - 44 9 39.5 40 - 44 9 39.5 24 4 4
35 - 39 6 34.5 35 - 39 6 34.5 15
30 - 34 4 29.5 Copy the frequency 30 - 34 4 29.5 9 The quartile class is the class
containing the 10th item. Hence, the
25 - 29 5 24.5 5 of the lowest class 25 - 29 5 24.5 5
quartile class is 35 - 39.
Total n = 40 Total n = 40
interval.
(4 )
5 + 4 = 9 + 6 = 15 + 9 = 24 + 7 = 31 + 6 = 37 + 3 = 40 nk
− < cf i
(10 − 9)5
Qk = LB + Q1 = 34.5 + = 35.33
f 6

Solution: Solution:
nk
nk First, compute , it will help us to
100
Class Class
f LB < cf First, compute , it will help us to f LB < cf
Interval 10 Interval
determine the percentile class and
55 - 59 3 54.5 40 determine the decile class and the 55 - 59 3 54.5 40
50 - 54 6 49.5 37 50 - 54 6 49.5 37
the
< cf. nk (40)(7) < cf. nk (40)(10)
45 - 49 7 44.5 31
= = 28 45 - 49 7 44.5 31
= =4
40 - 44 9 39.5 24 10 10 40 - 44 9 39.5 24 100 100
35 - 39 6 34.5 15 35 - 39 6 34.5 15
30 - 34 4 29.5 9
The decile class is the class 30 - 34 4 29.5 9
The percentile class is the class
25 - 29 5 24.5 5 containing the 28 item. Hence, the 25 - 29 5 24.5 5 containing the 4th item. Hence, the
Total n = 40 decile class is 45 - 49. Total n = 40 percentile class is 25 - 29.
( 10 ) ( 100 )
nk
− < cf i
nk
− < cf i (5 − 0)5
(28 − 24)5 P10 = 24.5 + = 29.5
Dk = LB + D7 = 44.5 + = 47.36 Pk = LB + 5
f 7 f

Example 2: Solution:
The ages of the town’s people in a certain community To compute Q2, D5, and P50 of grouped data, first you
is as follows: need to fill out this table.
Class
f LB < cf
Class Interval Frequency Interval To compute the lower
18 - 24 28 18 - 24 28 b o u n d a r y, a l w a y s
25 - 31 54
25 - 31 54 subtract 0.5 to lower
32 - 38 38
32 - 38 38 class limit (LC).
39 - 45 20
39 - 45 20 Ex:
18 − 0.5 = 17.5
46 - 52 17
46 - 52 17
53 - 59 3
53 - 59 3
Total n= 25 − 0.5 = 24.5
Find Q2, D5, and P50. 32 − 0.5 = 31.5

the class interval is nk
Class Class First, compute , it will help us to
Interval f LB < cf a s c e n d i n g o r d e r, Interval
f LB < cf
4
18 - 24 28 17.5 28 always start at the 18 - 24 28 17.5 28 determine the quartile class and the
upper part.
nk (160)(2)
25 - 31 54 24.5 25 - 31 54 24.5 82 < cf.
32 - 38 38 31.5 = = 80
4 4
32 - 38 38 31.5 120
39 - 45 20 38.5 Copy the frequency 39 - 45 20 38.5 140
46 - 52 17 45.5 of the lowest class 46 - 52 17 45.5 157 The quartile class is the class
53 - 59 3 52.5 interval. 53 - 59 3 52.5 160 containing the 80th item. Hence, the
Total n = 160 Total n = 160 quartile class is 25 - 31.
(4 )
28 + 54 = 82 + 38 = 120 + 20 = 140 + 17 = 157 + 3 = 160 nk
− < cf i
(80 − 28)7
Qk = LB + Q2 = 24.5 + = 31.24
f 54

Solution: Solution:
nk
nk First, compute , it will help us to
Class First, compute , it will help us to Class
100
10
Interval f LB < cf Interval f LB < cf
determine the percentile class and
18 - 24 28 17.5 28 determine the decile class and the 18 - 24 28 17.5 28
the
< cf. (160)(5) (160)(50)
25 - 31 54 24.5 82 25 - 31 54 24.5 82
nk < cf. nk
= = 80 = = 80
10 10 100 100
32 - 38 38 31.5 120 32 - 38 38 31.5 120
39 - 45 20 38.5 140 39 - 45 20 38.5 140
46 - 52 17 45.5 157 The decile class is the class 46 - 52 17 45.5 157 The percentile class is the class
53 - 59 3 52.5 160 containing the 80th item. Hence, the 53 - 59 3 52.5 160 containing the 80th item. Hence, the
Total n = 160 decile class is 25 - 31. Total n = 160 percentile class is 25 - 31.
( 10 ) ( 100 )
nk nk
− < cf i − < cf i (80 − 28)7
(80 − 28)7
Dk = LB + D5 = 24.5 + = 31.24 Pk = LB + P50 = 24.5 + = 31.24
f 54 f 54

Sample Interpretation: Measures of Dispersion/Variability
1. Jennifer just received the results of her SAT exam. Her Based on the figure below, determine which between the
SAT Mathematics score of 600 is in the 74th percentile. What two scatter diagram illustrate larger variability?
does this mean?
Figure 1 Figure 2
A percentile rank of 74% means that 74% of SAT
Mathematics scores are less than or equal to 600 and 26%
of the scores are greater. So 26% of the students who took
the exam scored better than Jennifer.
2. Time taken to finish a test is 35 minutes. This time was the

first quartile. What does this mean?
25% of the learners finished the exam in 35 minutes or Since the data points in figure 2 is more scattered than the
less, and 75% of the learners finished the exam in more data points in figure 1, then the data set depicted in figure 2
than 35 minutes. is more varied.
Measures of Dispersion/Variability: Measures of Dispersion/Variability:

RANGE STANDARD DEVIATION
It is the difference between the largest and the smallest • It is a measure of how far away items in a data set are from
observations or items in a set of data. the mean.
R = Xmax. − Xmin.
• The larger the standard deviation, the more variation there
is in the data set.
Range is simple to calculate. However, we should be • The standard deviation can never be a negative number,
cautious about using range as a measure of variability. due to the way it’s calculated and the fact that it measures a
distance (distances are never negative numbers).
Range is a very crude measure of variability as it only
uses the highest and lowest values in computation. • The smallest possible value for the standard deviation is 0,
Therefore, it does not accurately capture information and that happens only in contrived situations where every
about how data values in the set differ if the data set single number in the data set is exactly the same (no
deviation).
contains an unusual cases/outliers.
Formula for Standard Deviation: Measures of Dispersion/Variability:
Sample Standard Deviation
✦ For Ungrouped Data ✦ For Grouped Data
VARIANCE
where: where: It represents all data points in a set and is calculated
∑i=1 (xi − x̄) xi = data
2 ∑i=1 f(xi − x̄)2
n r
xi = data
values s = values s =
by averaging the squared deviation of each mean.
n−1 n−1
x̄ = mean x̄ = mean
n = no. of sample observations f = frequency Variance is not easy to read as it is the squared format
n = no. of sample observations and hence not easily interpretable. However,
Population Standard Deviation Standard deviation being in the same units as the
where: mean we can easily understand the spread of data.
where:
xi = data
∑i=1 (xi − μ) 2 xi = data ∑i=1 f(xi − μ)2
N r
values σ = values σ =
μ = mean N μ = mean N
N = no. of observations f = frequency
College of Science
N = no. of observations
College of Science
Formula for Variance: Example 1:

Sample Variance
The data given below is the age of the residents in
✦ For Ungrouped Data
✦
For Grouped Data
Barangay 634, Sta. Mesa, Manila. Compute sample
where: where: standard deviation and sample variance.
∑i=1 (xi − x̄)2 xi = data ∑i=1 f(xi − x̄)2
n r
xi = data
values s = values s =
2 2
n−1 n−1 Class Interval Frequency
x̄ = mean x̄ = mean 55 - 59 55
n = no. of sample observations f = frequency 50 - 54 23
n = no. of sample observations 45 - 49 37
Population Variance 40 - 44 37
where: where: 35 - 39 48
∑i=1 (xi − μ)2 xi = data ∑i=1 f(xi − μ)2

xi = data N r 30 - 34 42
values σ =
2 values σ =
2 25 - 29 27
μ = mean N μ = mean N
N = no. of observations f = frequency
College of Science
N = no. of observations
College of Science
Solution: Solution:
To compute SD and Var of grouped data, first you Class
Interval
f x fx (xi − x̄)2 f(xi − x̄)2
need to fill out this table. 55 - 59 3 57 171 228.61
(xi − x̄)2 f(xi − x̄)2

50 - 54 6 52 312 102.41
Class
f x fx 45 - 49 7 47 329 26.21
Interval
55 - 59 3 40 - 44 9 42 378 0.01
50 - 54 6 35 - 39 6 37 222 23.81
45 - 49 7 30 - 34 4 32 128 97.61
40 - 44 9 25 - 29 5 27 135 221.41
7 7
fxi =
35 - 39 6
∑ f(xi − x̄)2 =
30 - 34 4 Total n = 40
i=1 1,675 ∑
25 - 29 5 i=1
7 7
1,675 (x1 − x̄)2 = (57 − 41.88)2 = 228.61
∑
fxi =
∑
f(xi − x̄)2 = x̄ =
(x2 − x̄)2 = (52 − 41.88)2 = 102.41
Total n=
i=1 i=1 40
= 41.88 (x3 − x̄)2 = (47 − 41.88)2 = 26.21
Solution: Solution: 7
∑i=1 f(xi − x̄)2
Class
(xi − x̄)2 f(xi − x̄)2 s=
f x fx
n−1
(xi − x̄) 2
f(xi − x̄) 2
Interval
Class
55 - 59 3 57 171 228.61 685.83
3,124.20
Interval
s=
50 - 54 6 52 312 102.41 614.46 55 - 59 228.61 685.83
45 - 49 7 47 329 26.21 183.47 50 - 54 102.41 614.46 40 − 1
40 - 44 9 42 378 0.01 0.09 45 - 49 26.21 183.47 = 8.95
35 - 39 6 37 222 23.81 142.86 40 - 44 0.01 0.09
30 - 34 4 32 128 97.61 390.44
7
∑i=1 f(xi − x̄)2
35 - 39 23.81 142.86
s =
25 - 29 5 27 135 221.41 1107.05 30 - 34 97.61 390.44 2
7 7
fx = f(x − x̄)2 =
25 - 29 221.41 1107.05
n−1
Total n = 40 ∑ i ∑ i 7
1,675 3,124.20 f(xi − x̄)2 = 3,124.20
i=1
∑
i=1
s2 =
Total
3,124.20
f(x1 − x̄) = 3(228.61) = 685.83
2 i=1
40 − 1
f(x2 − x̄)2 = 6(102.41) = 614.46 = 80.11
f(x3 − x̄)2 = 7(26.21) = 183.47
How to interpret variance and standard We cannot use variance as a measure of variability. Let us
assume that the values represent weight losses measured in
deviation? pounds taken from five subjects. Because the deviation of each
observation from the mean has been squared, the unit for the
Consider the following data set of toddler variance is now in (pound)2 . What does (pound)2 mean? If we
weights in an outpatient clinic, assuming that the were to say that data values differ from the mean on average
data values were taken: about 9.7 (pound)2, would this claim make sense? Probably not,
since there is no such a unit as a (pound)2.
Data Set 15 13 20 19 14
Why do we then take the square of the deviation if the (unit)2
will not make sense to interpret at the end? The answer is
Computed variance for this data set is 9.7. simple: If you do not square the deviation and sum each
Computed standard deviation for this data set is deviation, it will always add up to zero no matter what data
set you work with.
3.11. n n
(xi − x̄) = 0 → (xi − x̄)2 ≠ 0
What does this mean? ∑ ∑
i=1 i=1
How can we then talk about variability if the measure of Choosing a Measure of Dispersion/Variability:
variability comes out to be equal to zero? This is why we take We have discussed four types of dispersion/variability - the
square of the deviation to compute the variance first and range, the interquartile range, the variance, and the
then take square root of it to compute the standard standard deviation and examined how they differ. The next
deviation, bringing us back to the original unit of legitimate question to ask may be “When do we use which
measurement. measure?”
We get the standard deviation of 3.11 by taking square root of
9.7; we can then say that the data values differ from the mean You should use the range only as a crude measure, since it
(16.2 lbs.) on an average of about 3.11 pounds. We can is extremely sensitive to unusual values in the data set.
interpret this finding to mean that, on average, the weights fall Interquartile range is not as sensitive to unusual data values,
between 13.09 and 19.31 pounds. This makes more sense where standard deviation is very sensitive to unusual values.
when you look at the data set, compared to the variance. Note Therefore, the interquartile range should be used with the
that the mean and standard deviation should always be median when the data contain unusual data values.
reported together! However, the standard deviation should be used with the
16.2 − 3.11 = 13.09
mean when the data are free of unusual data values.
16.2 + 3.11 = 19.31
Shape of Distribution Skewness
A symmetrical distribution will have a skewness of 0.
These two statistics give you insights into the shape of So, a normal distribution will have a skewness of 0.
the distribution.
In a symmetrical distribution, the Mean, Median and
✦
Skewness is the degree of distortion from the Mode are equal to each other and the ordinate at
symmetrical bell curve or the normal distribution. It mean divides the distribution into two equal parts.
measures the lack of symmetry in data distribution.
✦ Kurtosis is a measure of the combined sizes of the
two tails. It tells you how tall and sharp the central
peak is, relative to a standard bell curve.

There are two types of Skewness:

• Negatively Skewed/Skewed Left is when the tail of the left Karl Pearson’s Measure of
side of the distribution is longer or fatter than the tail on the
right side. The mean and median will be less than the mode.
Skewness
• Positively Skewed/Skewed Right means when the tail on the Noticed that the mean, median and mode are not
right side of the distribution is longer or fatter. The mean and equal in a skewed distribution.
median will be greater than the mode.
The Karl Pearson's measure of skewness is based
upon the divergence of mean from mode in a skewed
distribution. Karl Pearson’s Coefficient of Skewness
(Sk), given by
where:
x̄ − x ̂
x̄ is the mean Sk =
x ̂ is the median
s
Skewness < 0 Skewness > 0 Skewness = 0
s is the sample standard deviation
So far we have seen that Sk is strategically dependent
upon mode. If mode is not defined for a distribution Kurtosis
we cannot find Sk .But empirical relation between It is actually the measure of outliers present in the
mean, median and mode states that, for a moderately distribution. The outliers in a sample, therefore, have
symmetrical distribution, we have even more effect on the kurtosis than they do on the
Mean − Mode ≈ 3(Mean − Median) skewness.
Hence Karl Pearson's coefficient of skewness is Higher kurtosis means more of the variance is the
defined in terms of median as result of infrequent extreme deviations, as opposed to
frequent modestly sized deviations. In other words, it’s
3(x̄ − x̃)
where:
the tails that mostly account for kurtosis, not the
x̄ is the mean Sk = central peak.
x̃ is the median
s
The kurtosis decreases as the tails become lighter. It
s is the sample standard deviation increases as the tails become heavier.
• Mesokurtic (Kurtosis=3): This distribution has

kurtosis statistic similar to that of the normal Percentile Coefficient of Kurtosis
distribution.
A measure of kurtosis based on quartiles and
• Leptokurtic (Kurtosis>3): Peak is higher and percentiles is
sharper than normal distribution, which means that
QD
data are heavy-tailed or profusion of outliers. k=
P90 − P10
• Platykurtic (Kurtosis<3): where:
Compared to a normal
Q3 − Q1
distribution, its tails are shorter QD is semi-interquartile range QD =
and thinner, and often its central 2
peak is lower and broader.

How to Calculate Measures of Central Tendency, 1. Click “DATA” on the menu bar and Click “DATA
Measures of Variation, Skewness and Kurtosis for ANALYSIS” on the tool bar. The Dialog box will appear.
Ungrouped and Sample Data Using Excel? 2. Select “Descriptive Statistics” then click “OK”.
Example:
The data given below are the scores of randomly
selected applied statistics undergraduate students in
Section A and Section B. Compare the scores of Section
A and Section B based on measures of central tendency,
and measures of variation and determine which section
performed better in their final examination. Also,
describe the shape of the distribution of these two data
sets using skewness and kurtosis
Data Set A 40 38 42 40 39 39 43 40 39 40
Data Set B 46 37 40 33 42 36 40 47 34 45

3. Highlight your data for the “INPUT RANGE” and click

the box of “LABELS IN FIRST ROW” then click “OK”.
4. Click “Summary statistics” and then click “OK”. Repeat the
process for Data Set B.
When comparing distributions, it is better to use a measure of

variation/dispersion in addition to a measure of central tendency
but because in this example Data set A and Data set B have the
same value for measures of central tendency, we will just used
College of Science
measure of variation/dispersion to compare these two data set.
College of Science
Based on the result, Data set B has a larger variability since it
has larger value computed based on different measures of
variation. This means that Data Set B is much more spread
Normal Distribution
out than the Data Set A. ✦ The normal distribution is sometimes called the bell curve
In this example, we want a data set with a large mean value because the graph of its probability density looks like a
and a small standard deviation so we can say that this is the bell.
section that performed better. Section A and Section B have
the same mean value but in terms of standard deviation ✦
It is also known as the Gaussian distribution, after the
Section A have smaller value compared to Section B,
German mathematician Carl Friedrich Gauss who first
therefore, Section A performed better in their final
described it.
examination.
In terms of the shape of the distribution, these two data sets ✦ It is a probability function that describes how the values
have the shape in terms of Skewness and kurtosis. It shows
of a variable are distributed.
that Data Set A and Data Set B have platykurtic shaped and it
is skewed to the right.
No data will ever be exactly/perfectly normally

Normal Curve
distributed in reality. If so, how do we know
whether or not a collected data set is normally
distributed?
50 100 150 We can begin with a visual display of the data in a

The red curve is a model called the normal curve , histogram to see if the data set is normally
which is used to describe continuous random variables distributed. However, a visual check, alone, may not
that are said to be normally distributed. be sufficient to know whether the data are normally
distributed. There are statistical measures,
A continuous random variable is normally distributed,
skewness and kurtosis, which, along with a
or has a normal probability distribution, if its relative
histogram, allow us to determine whether the set is
frequency histogram has the shape of a normal curve.
normally distributed.
Why is it important to know if the data follows
a normal distribution? Properties of Normal Curve
The most important reason is that many human 1. The normal curve is bell-shaped and symmetric
characteristics fall into an approximately normal about the mean, μ.
distribution and that the measurement scores are
2. Because mean, median and mode are equal, the
assumed to be normally distributed when
normal curve has a single peak and the highest
running most statistical analyses. Therefore, the point occurs at x = μ.
statistical results you get at the end may not be
trustworthy if the variable is not normally 3. The normal curve has
distributed. inflection points at μ − σ Inflection point Inflection point
and μ + σ.

μ−σ μ μ+σ
μ1 = μ2, σ1 < σ2 μ1 < μ2, σ1 < σ2

Properties of Normal Curve
4. The area under the normal curve is 1.
5. The area under the normal curve to the right Mean:

of μ equals the area under the curve to the
✦ Changing the mean shifts the entire
curve left or right on the X-axis.
left of μ, which equals 0.50
Standard Deviation:
6. The normal curve approaches, area = 1
✦ Changing the standard deviation
but never touches the x-axis either tightens or spreads out the
μ1 < μ2, σ1 = σ2
as it extends farther and width of the distribution along the X-
axis.
farther away from the mean. 0.50 0.50 Larger standard deviations produce distributions that are more
spread out.
Determine whether the graph represent a normal Role of Area under a Normal
curve. Curve
Suppose that a random variable X is normally
A. C. distributed with mean μ and standard deviation σ . The
area under the normal curve for any interval of values of
the random variable X represents either
✦
the proportion of the population with the characteristic
described by the interval of values or
B. D.
✦ the probability that a randomly selected individual
from the population will have the characteristic
described by the interval of values.
All of them did not represent the normal curve.
Standardizing a Normal Random Variable

Standard Normal Distribution The normal random variable of a standard
x−μ
z=
normal distribution is called a standard
score or a z-score. Every normal random
A normal random variable having mean variable X can be transformed into a z score σ
value μ = 0 and standard deviation σ = 1 is via the following equation:
called a standard normal random variable, where X is a normal random variable, μ is the mean of X, and
and its density curve is called the standard σ is the standard deviation of X.
normal curve. Probabilities for a standard normal
random variable are computed
It will always be denoted by the letter Z. using Standard Normal
Distribution Table which shows
a cumulative probability associated
with a particular z-score.
Standard Normal Distribution Table 1 (Positive Side P(Z < z))
Remember!
Positive values of z-score indicate how far above
the mean a score falls and negative values
indicate how far below the mean a score falls.
Whether positive or negative, larger z-scores

mean that scores are far away from the mean and
smaller z-scores means that scores are close to
the mean.

Standard Normal Distribution Table 2 (Negative Side P(Z < − z))

Patterns for Finding Areas under a Standard Normal Curve
Using Table 1
A. Area to the right of a negative z value or to the left of a
positive z value.
Use Table 1 directly
0 z1 z1 0
B. Area between z values on either side of 0.
= -
z1 0 z2 0 z2 z1 0
1 − Area
C. Area between z values on same side of 0.
= -
z1 z2 0 z1 0 z2
1 − Area 1 − Area
Patterns for Finding Areas under a Standard Normal Curve Patterns for Finding Areas under a Standard Normal Curve
Using Table 1 Using Table 2

A. Area to the right of a positive z value or to the left of a
D. Area to the right of a positive z value or to the left of a negative z value.
negative z value. Use Table 2 directly
z1 0 0 z1
= - B. Area between z values on same side of 0.
0 z1 0 0 z1 = -
Area = 1
z1 z2 0 z1 0 z2
E. Area between a given z value and 0. C. Area between z values on either side of 0.
= - = +
0 z1 0 z1 0 z1 0 z2 0 z2 z1 0
Area = 0.50 0.50 − Area 0.50 − Area
Patterns for Finding Areas under a Standard Normal Curve Example 1:

Using Table 2 Scores on a standardized college entrance examination (CEE)
are normally distributed with mean 510 and standard
D. Area to the right of a negative z value or to the left of a deviation 60. A selective university considers for admission
positive z value. only applicants with CEE scores over 560. Find proportion of
all individuals who took the CEE who meet the university's
= + CEE requirement for consideration for admission.
z1 0 z1 0 0 Solution:
0.50 − Area Area = 0.50 Given: μ = 510,σ = 60 and x = 560
Area = P(X > 560)
E. Area between a given z value and 0. Step 1: Draw a normal curve and
shade the desired area.
= -
X
0 z1 0 0 z1 450 510 570
Area = 0.50
560
Using Table 1 By-hand Approach! Using Table 2 By-hand Approach!
Step 2: Convert the value of x to a z-score. Step 2: Convert the value of x to a z-score.
P(X > 560) = P (Z > z) Area = P(Z > 0.83) P(X > 560) = P (Z > z) Area = P(Z > 0.83)
560 − 510 = 0.2033 = 0.2033
( )
560 − 510
( )
=P Z> =P Z>
60 60
= P(Z > 0.83)
= P(Z > 0.83)
= 1 − P(Z ≤ 0.83)
= 1 − 0.7967 Z
= 0.2033 Z
−2 −1 0 1 2 −2 −1 0 1 2
= 0.2033
0.83 0.83
Use the Complement Rule The proportion of all CEE
and determine one minus scores that exceed 560 is
the area. 0.2033 or 20.33%.
The proportion of all CEE scores that exceed 560 is
0.2033 or 20.33%.
Step 2: Used Excel to determine the area under Example 2:

any normal curve. Technology Approach!
A pediatrician obtains the heights of her three-year-old female
Use “TRUE” for patients. The heights are approximately normally distributed,
cumulative since we
with mean 38.72 inches and standard deviation 3.17 inches.
want the area under the
normal curve.
Determine the proportion of the three-year-old females that
have a height less than 35 inches.
Solution:
Given: μ = 38.72,σ = 3.17 and x = 35
Step 1: Draw a normal curve and shade
the desired area.
Area = P(X < 35)
The proportion of all CEE
scores that exceed 560 is
X
0.2033 or 20.33%. 35.55 38.72 41.89
35
P(X < 35) = P (Z < z) Area = P(Z < − 1.17) = 0.1210 P(X < 35) = P (Z < z) Area = P(Z < − 1.17) = 0.1210
35 − 38.72 35 − 38.72
( 3.17 ) ( 3.17 )
=P Z< =P Z<
= P(Z < − 1.17) = P(Z < − 1.17)
= 1 − P(Z ≥ − 1.17) = 0.1210
= 1 − 0.8790 Z Z
−2 −1 0 1 2 −2 −1 0 1 2
= 0.1210
Use the Complement Rule −1.17 −1.17
and determine one minus
the area.
The proportion of the pediatrician’s three-year-old The proportion of the pediatrician’s three-year-old
females who are less than 35 inches tall is 0.1210 or females who are less than 35 inches tall is 0.1210 or
12.10%.
College of Science
12.10%.
College of Science
Step 2: Used Excel to determine the area under Example 3:

A pediatrician obtains the heights of her three-year-old female
Use “TRUE” patients. The heights are approximately normally distributed,
with mean 38.72 inches and standard deviation 3.17 inches.
for cumulative
Determine the probability that a randomly selected three-year-
since we want old girl is between 35 and 40 inches tall, inclusive.
the area under
Solution:
the normal
Given: μ = 38.72,σ = 3.17, and 35 ≤ X ≤ 40
curve. Area = P(35 ≤ X ≤ 40)
Step 1: Draw a normal curve and
shade the desired area.
The proportion of the
pediatrician’s three-
year-old females who
are less than 35 inches X
35.55 38.72 41.89
tall is 0.1210 or 12.10%.
35 40
P(35 ≤ X ≤ 40) = P(z ≤ Z ≤ z) P(35 ≤ X ≤ 40) = P(z ≤ Z ≤ z)
35 − 38.72 40 − 38.72
( 3.17 3.17 )
35 − 38.72 40 − 38.72
( 3.17 3.17 )
=P ≤Z≤ =P ≤Z≤
= P(−1.17 ≤ Z ≤ 0.40) = P(−1.17 ≤ Z ≤ 0.40)
= P(Z ≤ 0.40) − [1 − P(Z ≥ − 1.17)] = [0.50 − P(Z ≥ 0.40) + [0.50 − P(Z ≤ − 1.17)]
= 0.6554 − [1 − 0.8790] Area = P(−1.17 ≤ Z ≤ 0.40) = [0.50 − 0.3446] + [0.50 − 0.1210]
= 0.6554 − 0.1210 = 0.1554 + 0.3790
= 0.5344 Area = P(−1.17 ≤ Z ≤ 0.40)
= 0.5344
The probability a randomly The probability a randomly selected
selected three-year-old female three-year-old female is between 35
and 40 inches tall is 0.5344.
is between 35 and 40 inches tall X
−2 −1 0 1 2
is 0.5344.
−1.17 0.40 X
−2 −1 0 1 2
−1.17 0.40
Step 2: Used Excel to determine the area under ACTIVITIES/ASSESSMENTS:

1. Which one do you think is more informative?
Use “TRUE” for Why?
cumulative since
we want the area
under the normal
curve.

College of Science
College of Science
ACTIVITIES/ASSESSMENTS: ACTIVITIES/ASSESSMENTS:
2. What features 3. Review the table and consider questions such as the
of the ‘Good following.
Presentation’ Origin / Rating Poor
Needs
Satisfactory V Good Excellent Total
Improvement
make it better External 0% 2% 12% 19% 9% 41%
than the ‘Bad Internal 4% 8% 15% 23% 9% 59%
Presentation’?
A. Grand Total 4% 10% 27% 41% 17% 100%
1. What percentage of the employees originated from within the
organization?
2. What percentage of the employees are both internal and rated
‘Very Good’?
3. What percentage of the employees received ‘Needs Improvement’
or ‘Poor’?
4. What category contains the greatest number of employees?
5. Do you see any notable differences in the percentage by category?
B.
4. Consider the above Frequency Distribution of 5. The length of life of an instrument produced by a machine has a normal
distribution with a mean of 12 months and standard deviation of 2 months.
Salaries. Find the probability that an instrument produced by this machine will last
Salary Frequency Percentage
A. less than 7 months.
41,000 - 50,000 1 1%
51,000 - 60,000 20 13% B. between 7 and 12 months.
61,000 - 70,000 53 35% Be sure to draw a normal curve with the area corresponding to the
71,000 - 80,000 43 29% probability shaded.
81,000 - 90,000 26 17% 6. The lengths of human pregnancies are approximately normally distributed,
91,000 - 100,000 6 4% with mean μ = 266 days and standard deviation σ = 16 days.
101,000 - 110,000 1 1% What proportion of pregnancies lasts more than 270 days?
Total 150 100% B. What proportion of pregnancies lasts less than 250 days?
1.What percentage of the employees earns less than or C. What proportion of pregnancies lasts between 240 and 280 days?
equal 80,000? D. What is the probability that a randomly selected pregnancy?
2.What is the salary range of values? lasts more than 280 days?
3.What salary categories have percentage less than 5? Be sure to draw a normal curve with the area corresponding to the
4.What salary category includes the most employees? probability shaded.
College of Science
College of Science
7. Construct frequency distribution table based on the A. Based on the frequency distribution, compute measures of
scores of 75 randomly selected students. central tendency, measures of variation, Q1, D9, P10 , Skewness
37 46 37 26 30 41 28 49 29 34 46 50 38 35 42 and kurtosis.
B. Based on the raw data, compute measures of central
35 46 45 27 41 26 45 39 43 46 36 32 46 36 48 tendency, measures of variation, Skewness and kurtosis using
49 47 30 43 31 34 38 41 39 45 28 43 37 39 26 Excel.
38 30 29 38 26 31 42 44 48 43 37 46 38 27 50 C. Compute Skewness and kurtosis of grouped and ungrouped
42 33 42 42 43 39 39 31 46 46 48 48 50 45 31 data. Make sure to describe the shape of the distribution
Scores Frequency Percentage (%) D. Do you think that computed value for grouped and
26 to 30 ungrouped data are the same?
31 to 35
36 to 40 8. Begin with the following set of data, call it Data Set I.
41 to 45 5, −2, 6, 14, −3, 0, 1, 4, 3, 2, 5
46 to 50 A. Compute the sample standard deviation and sample mean of
Total Data Set I.
B. Form a new data set, Data Set II, by adding 3 to each
number in Data Set I. Calculate the sample standard deviation References
and sample mean of Data Set II. https://prezi.com/rirrca9ckuiz/textual-
C. Form a new data set, Data Set III, by subtracting 6 from presentation-of-data/
each number in Data Set I. Calculate the sample standard
deviation and sample mean of Data Set III. https://www.toppr.com/guides/economics/
D. Comparing the answers to parts (a), (b), and (c), can you presentation-of-data/textual-and-tabular-
guess the pattern? State the general principle that you expect presentation-of-data/
to be true.
9.Using “Encoded Data file”, construct frequency distribution Michael Sullivan, III,. Fifth Edition
table for age, sex, marital status and educational attainment
and interpret the table.

MODULE 4: INFERENTIAL STATISTICS
OBJECTIVES: What is HYPOTHESIS TESTING?
After successful completion of this module, you should be
able to: Hypothesis testing is a procedure on sample
✦ Differentiate the null and alternative hypotheses. evidence and probability, used to test claims
✦ Formulates the appropriate null and alternative regarding a characteristic of one or more populations.
hypotheses.
✦
Explain the logic of hypothesis testing. What is HYPOTHESIS?
✦
Assess and test if the data follows a normal distribution.
•A statement or claim regarding a characteristic of
✦ Distinguish between independent and dependent
sampling. one or more populations.
✦ Identify the appropriate test statistics for normally •A preconceived idea, assumed to be true but has to
distributed data. be tested for its truth or falsity.
✦ Conduct test for two categorical variables.

1. State the Null and Alternative Hypothesis

Procedures for Testing Two Types of Hypothesis
Hypothesis 1. Null Hypothesis
• Denoted by
1. State the null and alternative hypothesis. • The statement being tested.
2. Set the level of significance or alpha level (α).
• Assumed true until evidence indicates otherwise.
• Must contain the condition of equality and must be written
with the symbol = , ≤ , or ≥.
3. Determine the test distribution to use.
2. Alternative Hypothesis
4. Calculate test statistic or p - value. • Denoted by
• Statement that must be true if the null hypothesis is false
5. Make statistical Decision • Sometimes referred to as the research hypothesis
• Must contain the condition of equality and must be written
6. Draw Conclusion with the symbol ≠, < or >.
Example Hypothesis: Null Hypothesis: Reminders:
✦ Students who eat and not eat breakfast will perform the same on If you are conducting a research study and you want
a math exam.
✦ Students who experience and not experience test anxiety prior to to use a hypothesis test to support your claim, the
an English exam will get the same scores. claim must be stated in such a way that it becomes
✦ Motorists who talk and not talk on the phone while driving will the alternative hypothesis, so it cannot contain the
get the same errors on a driving course. condition of equality.
Alternative Hypothesis:
✦ Students who eat breakfast will perform better on a math exam Two Types of Alternative Test
than students who do not eat breakfast.
✦ Students who experience test anxiety prior to an English exam 1. One - tailed test
will get higher scores than students who do not experience test ✦ Left tailed
anxiety. ✦ Right tailed

Motorists who talk on the phone while driving will be more likely
2. Two - tailed test
✦
to make errors on a driving course than those who do not talk on

the phone.
2. Set the Level of Significance or Alpha Level (α) Example:

• You should establish a predetermined level of
H0: The defendant is innocent.
significance, below which you will reject the null
hypothesis. Ha: The defendant is not innocent.
• The generally accepted levels are 0.10, 0.05, and 0.01. What happen to the defendant if the jury made type I
• Be as rigorous as possible. and type II error?
Two Types of Error
Answer:
A type I error is like putting an innocent person in
jail.
A type II error is like letting a guilty person go free.

3. Determine the Test Distribution to Use.
Reminders: Determine the appropriate statistical test to
It is important to note that we want to set be used.
( α ) before we start our study because the ✦ Dependent Sample t - Test
Type I error is the more ‘grevious’ error to
make. ✦ Independent Sample t - Test
The smaller (α ) is, the smaller the region

✦ One Way Analysis of Variance
of rejection. (ANOVA) Test
✦ Pearson r
✦ Chi - Square Test
4. Calculate Test Statistic or p - value. Decision Rule:

✦ Using Confidence Interval
Performing statistical analysis using statistical Reject the null hypothesis if the test statistic is not within
software such as Excel, SPSS, R, Minitab, SAS, the range specified by the confidence interval.
etc. ✦ Using Traditional Approach
Reject Ho if the computed value of the test statistic falls in
5. Make Statistical Decision the region of rejection.
✦ Using P-value Approach
✦ Using confidence interval Reject the null hypothesis if the computed p-value is less
than or equal to the set significance level , otherwise do not
✦ Using p-value approach reject the null hypothesis.
Example: If the level of significance (α = 0.05),
✦ Using traditional method P-value Decision
0.01 Reject H0
0.05 Reject H0
0.10 Failed to Reject H0
Traditional Approach One-tailed and Left tailed One-tailed and Right tailed
Ha : μ1 < μ2 Ha : μ1 > μ2
Rejection of region Rejection Region
or critical region is Rejection Region
the set of all values of
the test statistic
which will lead to the -2 0 2 -2 0 2
rejection of H0.
Acceptance Region is Two-tailed
the set of all values of Ha : μ1 ≠ μ2
the test statistic that Rejection Region
Rejection Region
leads the researcher to
retain H0.
Polytechnic University of the Philippines Polytechnic University of the Philippines -2 0 2
In stating your decision you can use:

✦ Fail to reject the null hypothesis/ Do not reject
Assessing and Testing Normality
the null hypothesis/ Retain the null hypothesis of the Data
✦ Reject the null hypothesis.
To determine if the data is follows a normality
It is important to recognize that we never accept distribution, we can use the graphical or
the null hypothesis. We are merely saying that the numerical method.
sample evidence is not strong enough to warrant Graphical:
rejection of the null hypothesis. Normal Q-Q Plot
6. Draw Conclusion Histogram
Record conclusions and recommendations in a report, Numerical:
and associate interpretations to justify your Shapiro Wilk Test
conclusion or recommendations. Kolmogorov Smirnov Test
How to Check Normality? How to Check Normality?
Histogram plots the observed values against their Q-Q probability plots display the observed values
frequency, states a visual estimation whether the against normally distributed data (represented by the
distribution is bell shaped or not. line).

Hypotheses of Normality Test

Reminders:
The hypotheses used are:
Ho: The sample data follows a normal distribution.
Graphical methods are typically not
Ha: The sample data does not follow a normal
very useful when the sample size is distribution.
small.
When we are testing normality:
• If P value > alpha, it means that the data are
normal.
• If P value ≤ alpha, it means that the data are NOT
normal.
How to Calculate Shapiro - Wilk Test in Excel? STEP 2: Calculate SS as follows:
n
(xi − x̄)
Sample Data 2
∑
SS =
i=1
STEP 1:
Rearrange
the data in
ascending
order.
Use "=DEVSQ( )”
function in excel
Polytechnic University of the Philippines Polytechnic
Polytechnic University
University of the Philippines
of the Philippines
College of Science College
College of Science
of Science
Department of Mathematics and Statistics Department
of Mathematics and Statistics
∑ i ( n+1−i
STEP 3: Calculate b as follows: b = a x − xi)
i=1
n is the number of
observation
If n is even:
n
m=
2
If n is odd:
n−1
m=
2
Since n is even in this
SS means Sum of Square example, m=8. That’s
why we used a1 to a8
Polytechnic
of the Philippines Polytechnic University of the Philippines
College
College of Science
of Science College of Science
Department
of Mathematics and Statistics Department of Mathematics and Statistics
Shapiro - Wilk Table
Taking the ai weights from

the table of Shapiro -Wilk
College of Science
(based on the value of n) Polytechnic University of the Philippines
College of Science

Note that if n is odd, the median
data value is not used in the
calculation of b. Polytechnic University of the Philippines

STEP 4: Calculate the test statistic: b2 STEP 5:
W= Find the value in the table of Shapiro - Will (for a
SS
given value of n) that is closest to W, interpolating if
necessary. This is the p-value for the test.
We choose this
interval in the table of
Shapiro - Wilk,
because our n=16 and
our test statistic
(W=0.955) is within
this interval.

Result
Since the computed p-value is greater than the set

level of significance, we failed to reject the null
hypothesis. Therefore, the sample data follows a
normal distribution.
We used interpolation to get the
p-value of Shapiro-Wilk Test
Inferential Statistics Inference About Two Means

1. Parametric Tests To perform inference on the difference of two
✦ Assume underlying statistical distributions in the data. population means, we must first determine whether the
Therefore, several conditions of validity must be met data come from an independent or dependent sample.
so that the result of a parametric test is reliable.
✦
Apply to data in ratio scale, and some apply to data in Distinguish between Independent and Dependent Sample
interval scale. ✦
A sampling method is independent when the
2. Non Parametric Test individuals selected for one sample do not dictate
✦ Refer to a statistical method in which the data is not
which individuals are to be in a second sample.
required to fit a normal distribution. ✦
A sampling method is dependent when the individual
✦
Most non-parametric tests apply to data in an ordinal selected to be in one sample are used to determine the
scale, and some apply to data in nominal scale. individuals to be in the second sample.
Example:
Determine whether the sample is independent or dependent. Example:
1. An urban economist believes that commute times to Determine whether the sample is independent or
work in the South are less than commute times to work dependent.
in the Midwest. He randomly selects 40 employed 3. A researcher wants to know if the mean
individuals in the south and 45 employed individuals in
length of stay in for-profit hospitals is different
the Midwest and determines their commute times.
Answer: Independent from the mean length of stay in not-for-profit
2. In an experiment conducted in biology class, Prof. hospitals. He randomly selected 20 individuals in
Rhea measured the time required for 12 students to the for-profit hospital and matched them with 20
catch a failing meter stick using their dominant hand individuals in the not-for-profit by diagnosis.
and nondominant hand. The goal of the study was to
Answer:
determine whether the reaction time in an individual’s
Dependent
dominant hand is different from the reaction time in
the non dominant hand. Answer: Dependent
Assumptions
Dependent Sample t - Test 1. Your dependent variable should be measured at
the interval or ratio level (i.e., they are
The dependent sample t-test (also called continuous).
the paired t-test or paired-samples t-test) 2. Your independent variable should consist of two
compares the means of two related groups categorical, "related groups" or "matched pairs”.
to determine whether there is a statistically 3. There should be no significant outliers in the
significant difference between these differences between the two related groups.
means. 4. The distribution of the differences in the
H0 : μ1 ≥ μ2 and Ha : μ1 < μ2 dependent variable between the two related
H0 : μ1 ≤ μ2 and Ha : μ1 > μ2 groups should be approximately normally
H0 : μ1 = μ2 and Ha : μ1 ≠ μ2 distributed.
Example: 1. State the Null and Alternative
A teacher is interested to know if the new learning program Hypothesis
will help to increase the number of correct remembered
Null hypothesis: Ho : μ1 ≥ μ2
words. 10 Subjects learn a list of 50 words. Learning
performance is measured using a recall test. The new learning program will not help to increase
After the first test all subjects the number of correct remembered words.
are instructed how to use the Alternative hypothesis: Ha : μ1 < μ2
learning program and then
The new learning program will help to increase the
learn a second list of 50 words.
Learning performance is again number of correct remembered words.
measured with the recall test. In 2. Set the Level of Significance or Alpha
the following table the number
of correct remembered words Level (α)
are listed for both tests. α = 0.05
3. Determine the Test 4. Calculate Test Statistic or

p - value.
Distribution to Use.
Click “Data”, then click “Data Analysis”
Dependent Variable:
Number of correct remembered words
Independent Variable:
Treatment (Before and After)
Since we are comparing the means of two

related groups, we will use the dependent
sample t-test.
5. Make Statistical Decision
Using p-value approach: If pvalue ≤ α , reject Ho,
otherwise failed to reject Ho
Reject Ho

Exercises:
6. Draw Conclusion Apply the procedure in testing the hypothesis.
Professor Rhea measured the time (in second) required to
catch a falling meter sticks for 10 randomly selected
There is sufficient evidence to support that the new students' dominant hand and non-dominant hand. Professor
learning program help to increase the number of Rhea claims that the reaction time in an individual's
correct remembered words. dominant hand is less than the reaction time in
their non-dominant hand.
Proper Presentation of Results Test the claim at the level
of significance. The data
obtained are presented:

Result
Independent Sample t - Test
The independent sample t - test allows
researchers to evaluate or to compare the mean
difference between two populations using the data
from two separate samples. It is used to test
whether population means are significantly
different from each other, using the means from
randomly drawn samples.
H0 : μ1 ≥ μ2 and Ha : μ1 < μ2
H0 : μ1 ≤ μ2 and Ha : μ1 > μ2
H0 : μ1 = μ2 and Ha : μ1 ≠ μ2
Assumptions Example:
1. Your dependent variable should be measured on a Researchers wanted to know whether there was a difference in
continuous scale (i.e., it is measured at the interval or comprehension among students learning a computer program
ratio level). based on the style of the text. They randomly divided 18
2. Your independent variable should consist of two students into two groups of 9 each. The researchers verified
categorical, independent groups. that the 18 students were similar in terms of educational level,
3. You should have independence of observations, which age, and so on. Group 1 individuals learned the software using
means that there is no relationship between the visual manual (multimodal
observations in each group or between the groups instruction), while Group 2
themselves. individual learned the software
4. There should be no significant outliers. using textual manual (Unimodal
5. Your dependent variable should be approximately instruction). The following data
normally distributed for each group of the independent represent scores the students
variable. received on an exam given to them
6. There needs to be homogeneity of variances. they studied from the manuals.
1. State the Null and Alternative 3. Determine the Test
Hypothesis
Null hypothesis: Ho : μ1 = μ2
There is no significant difference between the scores of the Dependent Variable:
students learning computer program using textual and
visual style. Scores
Alternative hypothesis: Ha : μ1 ≠ μ2 Independent Variable:
There is significant difference between the scores of the
students learning computer program using textual and Style of the Text (Visual and Textual)
visual style.
2. Set the Level of Significance or Alpha Since we are comparing the means of two
Level (α) independent groups, we will use the
α = 0.05 independent sample t-test.
Determine if the
variances are equal
or not equal.

Using p-value approach: If pvalue ≤ α , reject Ho, 4. Calculate Test Statistic or
Ho: Equal Variances Assumed p - value.
Ha: Equal Variances Not Assumed Click “Data”, then click “Data Analysis”
Failed to
Reject Ho
Since we failed to reject Ho, we will proceed to t-test: Two
Sample Assuming Equal Variances.
Result

5. Make Statistical Decision 6. Draw Conclusion
Using p-value approach: If pvalue ≤ α , reject Ho, There is no enough evidence to support that
otherwise failed to reject Ho there is a difference in comprehension among
students learning a computer program based on
the style of the text.
Proper Presentation of Results
Failed to
College of Science
Reject Ho Polytechnic University of the Philippines
College of Science
Exercises:
Apply the procedure in testing the hypothesis.
Twenty participants were given a list of 20 words to
process. The 20 participants were randomly assigned to
one of two treatment conditions. Half were instructed to
count the number of vowels in each word (shallow
processing). Half were instructed to judge whether the
object described by each word would be useful if one
were stranded on a desert island (deep processing).
After a brief distractor task, all subjects were given a
surprise free recall task. Did the instruction affect the
level of recall?The number of words correctly recalled
was recorded for each subject. Here are the data:
Result
Since the result of F-test conclude that the

variances of the two groups are equal, we will
apply “Assuming Equal Variances”.
One - Way Analysis of Variance Assumptions

1. Your dependent variable should be measured at the
(ANOVA) interval or ratio level (i.e., they are continuous).
2. Your independent variable should consist of two or more
One-way analysis of variance (ANOVA) categorical, independent groups.
is a method of test ing the equality of 3. You should have independence of observations, which
three or more population means by means that there is no relationship between the
observations in each group or between the groups
analyzing sample variances. themselves.
4. There should be no significant outliers.
Ho : μ1 = μ2 = . . . = μk
5. Your dependent variable should be approximately
Ha : At least one of the population means normally distributed for each category of the independent
is different from the others. variable.
6. There needs to be homogeneity of variances.
Example: 1. State the Null and Alternative
A Researchers wanted to compare math test scores of Hypothesis
students at the end of secondary school from various cities. Null hypothesis:
Eight randomly selected students from Makati, Manila,
and Quezon City each were administered the same exam; There is no significant difference between the
the results are presented in the following table. Can the mathematics scores of students at various city.
researchers conclude Alternative hypothesis:
that the distribution of There is significant difference between the
exam scores is different mathematics scores of students at various city.
for each city at the
level of significance? 2. Set the Level of Significance or Alpha
Level (α)
α = 0.10
3. Determine the Test Click “Data”, then click “Data Analysis”

Dependent Variable:
Mathematics Scores
Cities (Makati, Manila, Quezon City)
Since we are comparing the means of one
independent variable that consist of two Determine if the
or more categorical groups, we will use variances are equal
the one-way ANOVA. or not equal.
Ho: Equal Variances Assumed
Ha: Equal Variances Not Assumed
Failed to
Reject Ho
E q u a l
Variances
College of Science
College of Science
Assumed

Failed to
Reject Ho
E q u a l
Variances
College of Science
College of Science
Assumed
Failed to
Reject Ho
E q u a l
Variances
College of Science
College of Science
Assumed
4. Calculate Test Statistic or

p - value.

Result 5. Make Statistical Decision
Reject Ho

Exercises:
There is enough evidence to support that the A teacher is concerned about the level of
distribution of exam scores of students in knowledge possessed by PUP students regarding
mathematics is different for each city. Philippine history. Students completed a high
school senior level standardized history exam.
Proper Presentation of Results Academic major of the students was also recorded.
Data in terms of percent correct is recorded below
for 24 students. Is there a significant difference
between the levels of knowledge possessed by PUP
students regarding Philippine history when
grouped according to their academic major?
Result

Pearson Product Moment

Correlation Features of r
The Pearson product moment correlation • Unit free
coefficient (Pearson r) is a measure of the • Range between -1 and 1
strength of a linear association between • The closer to -1, the stronger the negative
two variables and is denoted by r. linear relationship.
Ho: There is no significant relationship • The closer to 1, the stronger the positive
between two continuous variables. linear relationship.
Ha: There is significant relationship between • The closer to 0, the weaker the linear
two continuous variables. relationship.
Pearson Product Moment Sample of Observations from
Correlation Various r Values
Y Y Y
X X X
r = -1 r = -.6 r =0
If r is positive, the correlation is direct. Y Y
If r is negative, the correlation is inverse.
r = .6 r=1
Reminders: Assumptions
• Correlation does not imply causation. 1. Your two variables should be measured at the
• Watch out for hidden (lurking) variables. interval or ratio level (i.e., they are
continuous).
Lurking Variable
2. There is a linear relationship between your
• A variable that is not included as an explanatory two variables.
or response variable in the analysis but can affect
the interpretation of relationships between 3. There should be no significant outliers.
variables.
4. Your variables should be approximately
• Can falsely identify a strong relationship between normally distributed.
variables or it can hide the true relationship.
Significance Testing of Pearson r Example:
Test Statistic: A dietetics student wanted to look at the
df relationship between calcium intake and
t=r knowledge about calcium in sports
1 − r2
where: science students. Table shows the data
df = degrees of freedom she collected. Is there a relationship
between calcium intake and knowledge
r = correlation coefficient of Pearson r
about calcium in sports science
Note:
students?
df = n − 2
1. State the Null and Alternative

Hypothesis
Null hypothesis:
There is no significant relationship between the
calcium intake and knowledge about calcium in sports
science students.
Alternative hypothesis:
There is significant relationship between the calcium
intake and knowledge about calcium in sports science
students.
2. Set the Level of Significance or Alpha
Level (α)
α = 0.0.5
3. Determine the Test 4. Calculate Test Statistic or p - value.
Dependent Variable:
Calcium Intake
Knowledge about Calcium
Since we are testing the significant

relationship of two variables, we will use
Pearson r.
df
t=r
1 − r2
Result
df = n − 2
Polytechnic University of the Philippines Polytechnic
of the Philippines
College of Science College
College of Science
of Science
Department of Mathematics and Statistics Department
of Mathematics and Statistics
5. Make Statistical Decision 6. Draw Conclusion
There is sufficient evidence to conclude that there
Using p-value approach: If pvalue ≤ α , is significant relationship between the calcium
reject Ho, otherwise failed to reject intake and knowledge about calcium in sports
Ho Strong and science students.
D i r e c t Proper Presentation of Results
Correlation

Reject Ho Polytechnic University of the Philippines
Exercises:
Apply the procedure in testing the hypothesis.
A group of twelve children participated in a

psychological study designed to assess the
relationship, if any, between age (years)
and average total sleep time (minutes). To
obtain a measure for average total sleep
time, recordings were taken on each child
on five consecutive nights and then Result
averaged. The results obtained are shown in
the table.
Chi - Square: Test for
Chi-Square Distribution Independence
Definition: ✦
Used to discover if there is association
between two categorical variables.
The chi-square distribution is
written as χ 2 distribution.
✦ Used when you want to decide whether
two variables are independent or
The symbol χ is the Greek letter dependent.
“chi”, pronounced as “ki”. ✦ A contingency table will be constructed.

Chi - Square: Test for Chi - Square: Test for

Independence Independence
The test statistic for a test of independence is given
H0: The two categorical variables are by
(O − E)2
2
∑
independent. χ =
E
where:
Ha: The two categorical variables are O is the observed frequency for a category
dependent.
E is the expected frequency for a category
(row total)(column total)
E=
grand total
Observed and Expected Frequencies Assumptions
The frequencies obtained from the performance of an 1. There are 2 variables, and both are measured as
experiment are called the observed frequencies and are categories, usually at the nominal level.
denoted by O. 2. The two variables should consist of two or more
The expected frequencies, denoted by E, are the categorical, independent groups.
frequencies that we expect to obtain if the null hypothesis is 3. The data in the cells should be frequencies, or counts
true. of cases rather than percentages or some other
Example of Contingency Table: transformation of the data.
Observed Values Low Medium High Row Total
4. For a 2 by 2 table, all expected frequencies > 5.
Some College 20 35 20 80
Bachelor's Degree 17 33 25 70 5. For a larger table, all expected frequencies > 1 and
Masters Degree 11 18 21 50 no more than 20% of all cells may have expected
frequencies < 5.
Column Total 48 86 66 200
Example: Reminders:
1. A doctor who knows that hypertension depends
on smoking habits can tell his smoking patients what
they should do. The word contingency refers to
dependence, but this is only a
2. If the traffic condition (light, moderate, heavy,
standstill) is found to be dependent on vehicle plate statistical dependence and cannot be
numbers (odd, even) a traffic officer may decide to used to establish a direct cause-and-
revise traffic law enforcement. effect link between the two variables in
3. If poverty status of households is found to be question.
correlated with family size, government ought to
adopt a viable poverty management program
1. State the Null and Alternative
Example: Hypothesis
Null hypothesis:
Educators are always looking for novel ways in
which to teach statistics to undergraduates as part Gender is independent with the preferred type of
of a non-statistics degree course (e.g., psychology). learning medium.
With current technology, it is possible to present Alternative hypothesis:
how-to guides for statistical programs online Gender is dependent with the preferred type of
instead of in a book. However, different people learning medium.
learn in different ways. An educator would like to
know whether gender (male/female) is associated
2. Set the Level of Significance or Alpha
with the preferred type of learning medium (online Level (α)
vs. books). Use “Data_Example and Exercises file”. α = 0.0.5
3. Determine the Test 4. Calculate Test Statistic or

Distribution to Use. p - value.
Click “Insert”, then click “Pivot Table”
Two Categorical Variables
Gender (Male and Female)
Preferred type of learning medium
(online vs. books)
Since we are testing the significant

relationship of two categorical variables,
we will use Chi-square test.

Row Total
Grand Total
Column Total
(row total)(column total)

E=
grand total Polytechnic University of the Philippines
5. Make Statistical Decision

Using p-value approach: If pvalue ≤ α, reject Ho,

College of Science
College of Science
Reject Ho
Exercises:
There is sufficient evidence to conclude that there A survey was conducted at a community college of 102
gender is associated with the preferred type of randomly selected students who dropped a course in the
learning medium. current semester to learn why students drop courses.
Proper Presentation of Results Personal drop reasons include financial, transportation,
family issues, health issues, and lack of child care. Course
drop reasons include reducing ones load, being unprepared
for the course, the course was not what was expected,
dissatisfaction with teaching, and not getting the desired
grade. Work drop reasons include an increase in hours, a
change in shift, and obtaining full-time employment. Test
whether gender is independent of drop reason at the 1%
level of significance. Use “Data_Example and Exercises
file”.
Result Determine whether the sampling is dependent or independent.
________1. A researcher wishes to compare academic
aptitudes of married mathematicians and their spouses. She
obtains a random sample of 287 such couples who take an
academic aptitude test and determines each spouses academic
aptitude.
________2. A political scientist wants to know how a random
sample of 18- to 25-year-olds feel about Democrats and
Republicans in Congress. She obtains a random sample of
1030 registered voters 18 to 25 years of age and asks, Do you
have favorable/unfavorable opinion of the Democratic/
Republican party? Each individual was asked to disclose his
or her opinion about each party.
________3. An educator wants to determine whether a new Solve the following problems. Make sure to follow the 6 steps
curriculum significantly improves standardized test scores for third procedure.
grade students. She randomly divides 80 third-graders into two
groups. Group 1 is taught using the new curriculum, while group 2 is 1. A study is designed to test whether there is a difference in mean daily
taught using the traditional curriculum. At the end of the school year, calcium intake in adults with normal bone density, adults with
both groups are given the standardized test and the mean scores are osteopenia (a low bone density which may lead to osteoporosis) and
adults with osteoporosis. Adults 60 years of age with normal bone
compared.
density, osteopenia and osteoporosis are selected at random from
________4. A stock analyst wants to know if there is difference hospital records and invited to participate in the study. Each
between the mean rate of return from energy stocks and that from
participant's daily calcium intake is measured based on reported food
financial stocks. He randomly select 13 energy stocks and computes
intake and supplements. The data are shown below.
the rate of return for the past year. He randomly selects 13 financial
stocks and compute the rate of return for the past year. I s t h e r e a s t a t i s t i c a l l y Normal Bone Osteopenia Osteoporosis
significant difference in mean Density
1200 1000 890
________5. An urban economist believes that commute times to work calcium intake in patients 1000 1100 650
in the South are less than commute times to work in the Midwest. He with normal bone density as 980 700 1100
randomly selects 40 employed individuals in the south and 45 compared to patients with 900 800 900
employed individuals in the Midwest and determines their commute osteopenia and osteoporosis? 750 500 400
times.
College of Science
College of Science 800 700 350
2. Some studies have shown that in the United Men Women 3. A researcher is interested whether a training course increases
(in $) (in $)
States, men spend more than women buying gifts the teaching performance of the teachers who attended the
and cards on Valentine’s Day. Suppose a researcher 107.48 125.98
training courses. Test at 10% level of significance. The data are
wants to test this hypothesis by randomly sampling 143.61 45.53 shown below:
nine men and 10 women with comparable Case Before After Case Before After
demographic characteristics from various large cities 90.19 56.35
across the United States to be in a study. Each study
1 85 95 11 89 97
125.53 80.62
participant is asked to keep a log beginning one 2 84 98 12 87 98
month before Valentine’s Day and record all 70.7 46.37 3 86 97 13 82 95
purchases made for Valentine’s Day during that one- 83 44.34 4 87 92 14 81 95
month period. The resulting data are shown below.
129.63 75.21
5 89 96 15 86 92
Use these data and a 1% level of significance to test 6 82 93 16 89 91
to determine if, on average, men actually do spend 154.22 68.48 7 80 94 17 89 94
significantly more than women on Valentine’s Day.
Assume that such spending is normally distributed 93.8 85.82 8 84 95 18 84 95
in the population and that the population variances 126.11
9 86 90 19 85 96
are equal.
10 82 82
20 88 97
Head
4. A pediatrician wants to Height
Circumference
5. The following data represent the smoking status from a
determine the relation that may (inches) random sample of 1054 U.S. residents 18 years or older by
(inches)
exist between a child’s height 27.75 17.5 level of education.
and head circumference. She 24.5 17.1
No. Of Years Smoking Status
randomly selects eleven 3- 25.5 17.1 of Education
yearold children from her 26 17.3 Current Former Never
practice, measures their heights 25 16.9 Less than 12 178 88 208
and head circumference, and 27.75 17.6
obtains the data shown in the 12 137 69 143
26.5 17.3
table below. 13 - 15 44 25 44
27 17.5
26.75 17.3 16 or more 34 33 51
26.75 17.5
27.5 17.5 Test whether smoking status and level of education are
independent at the α = 0.05 level of significance.
6. A pediatrician wants to Height
(inches)
Head
Circumference
References
determine the relation that may (inches)
exist between a child’s height 27.75 17.5 h t t p s : / / w o l f w e b . u n r. e d u / h o m e p a g e / a n i a /
and head circumference. She 24.5 17.1 stat352f12lectures/352lecture21f12.pdf
randomly selects eleven 3- 25.5 17.1
yearold children from her Statistics. Informed Decision using Data by
26 17.3
practice, measures their heights 25 16.9
Michael Sullivan, III,. Fifth Edition
and head circumference, and 27.75 17.6 http://www.real-statistics.com/tests-normality-
obtains the data shown in the 26.5 17.3
table below. and-symmetry/statistical-tests-normality-
27 17.5 symmetry/shapiro-wilk-test/
26.75 17.3
26.75 17.5
27.5 17.5
STATISTICAL ANALYSIS WITH SOFTWARE APPLICATION

MIDTERM EXAMINATION
Name: Course & Section:
Directions: Read each item carefully. Write the letter corresponding to the best answer on a yellow paper on each
item. Write NONE if no correct choice is given. Make sure to write also your solutions.
1. A bank surveyed all of its 60 employees to determine the proportion who participate in volunteer activities.
Which of the following statements is true?
(a) The bank should not use the data from this survey because this is an observational study.
(b) The bank does not need to use an inference procedure to determine the proportion of employees who
participate in volunteer activities because the survey was a census of all employees.
(c) The bank can use the result of this survey to prove that working for the bank causes employees to
participate in volunteer activities.
(d) The bank did not select a random sample of employees, so the survey will not provide the bank with useful
information.
2. In the design of a survey, which of the following best explains how to minimize response bias?
(a) Increase the sample size (c) Randomly select the sample
(b) Carefully word and field-test survey questions (d) Increase the number of questions in the survey
3. A body of principle, which deals with collection, analysis, interpretation and presentation of numerical facts or
data.
(a) Statistic (b) Descriptive (c) Inferential (d) Statistics
4. Cluster sampling is an example of:
(a) Simple Random Sampling (c) Nonprobability Sampling

(b) Probability Sampling (d) Stratified Sampling
5. Which of the following statements regarding a researchers use of inferential statistics is true?
(a) It is best to measure every member of a population if possible.
(b) A random sample provides a perfect estimate of the population values.
(c) Descriptive statistics from a sample are used to estimate the characteristics of the population.
(d) We usually need to take several samples to obtain a good estimate of the population values.
6. The divides the distribution into ten equal parts.
(a) Decile (b) Percentile (c) Median (d) Quartile
7. What sampling technique is used when the respondents are chosen on the basis of pre-determined criteria set
by the researchers?
(a) cluster sampling (b) systematic sampling (c) purposive sampling (d) convenience sampling
8. In a distribution the mean < median < mode.
(a) Normal (b) Unimodal (c) Negatively Skewed (d) Positively Skewed
9. Which one of the following variables is not categorical?

(a) score on the exam.
(b) Educational Attainment: elementary graduate, high school graduate, college graduate.
(c) Color: blue, red, white.
(d) Subject: algebra, calculus, trigonometry
10. Given the data set, 40, 50, 70, 70, 60, 90, 80, 80, 90. What will happen if we replace the data value 90 in the
data set by 5, will the standard deviation .
(a) Increase (b) Decrease (c) stay the same (d) None of the above
11. If the statistics grades of Karen are 87, 85, 91, 89 and X, what must be the value of X so that the average is
89?
(a) 92 (b) 95 (c) 93 (d) 91
12. In descriptive statistics, we study

(a) The description of decision making process
(b) The methods for organizing, displaying, and describing data
(c) How to describe the probability distribution
(d) None of the above
13. In statistics, conducting a survey means
(a) Collecting information from elements
(b) Making mathematical calculations
(c) Drawing graphs and pictures
14. Which of the following represents the middle point in a set of numbers arranged in order of magnitude?
(a) Mean (b) Median (c) Mode (d) Variance
15. Mr. Martin had seven students in his after-school statistics tutorial. The scores they received on their last quiz
were as follows: 81, 73, 84, 78, 89, 82, 81. What was the mean score?
(a) 81.14 (b) 78.5 (c) 82 (d) 79.5
16. If all the units of a population are surveyed it is called
(a) Survey (b) Population (c) Census (d) Sample
17. For percentiles, the total number of partition values are
(a) 10 (b) 25 (c) 99 (d) 100
18. Which of the following represents median?
Page 2
(a) First Quartile (b) Fiftieth Percentile (c) Sixth decile (d) Third quartile
19. 5 is subtracted from each observation of a set, then the mean of the observation is reduced by
(a) 5 (b) 1 (c) 0 (d) 15
20. The standard deviation of 10 observations is 15. If 5 is added to each observations the value of new standard
deviation is
(a) 5 (b) 1 (c) 0 (d) 15
21. If the minimum value in a set is 9 and its range is 57, the maximum value of the set is
(a) 33 (b) 66 (c) 48 (d) 24
22. Which of the following situations exhibit the function of Inferential Statistics?
(a) The highest score obtained by BSS section 1 in their first quiz is 48.
(b) All the ten scores are closely scattered around the average value.
(c) Mathematical anxiety of the students will be related with their academic performance.
(d) Line graphs will be used to exhibit the fluctuating trend of monthly consumption of electricity.
23. Which of the following situations exhibit the function of Descriptive Statistics?
(a) Determining the most favored characteristics of the ideal teacher students perceived.
(b) Relating the number of absences committed by students with their academic performance.
(c) Citing the differences in perception of the male and female students towards NO ID-NO ENTRY policy.
(d) Comparing the course grades in Statistics of every section who are taking the subject during the first
semester.
For items 24 to 27, consider this situation. There were 200 students of PUP San Juan enrolled in General
Statistics in the first semester. A periodic examination was given and it was found out that the average score
is 93. When a random section with 50 students is chosen, it was found out thet 89 is the average score of the
section.
24. What do we call to the number 200?
(a) statistic (b) sample size (c) parameter (d) population size
For items 28 to 30, consider this situation.A group of undergraduate researchers aims to execute stratified
random sampling among 63 Section 1 students, 52 Section 2 students, 48 Section 3 students and 37 Section 4
students. The margin or error is 5%.
28. What is the sample size?
Page 3
(a) 124 students (b) 134 students (c) 144 students (d) 154 students
29. How many students of Section 2 will be included in the sample?
30. How many students of Section 4 will be included in the sample?
31. Which of the following is an example of a primary source of data?
(a) TV station (b) encyclopedias (c) living organisms (d) scientific journals
32. A marketing team specializing in food products set stands in a mall to determine the preference of the mall-goers
in choosing and consuming finger-foods. What sampling technique is appropriate in doing this?
(a) cluster sampling (b) purposive sampling (c) convenience sampling (d) systematic sampling
33. A market research company asks a sample of students to rate the taste of a new soft drink. The response scale
is really yummy, yummy, ok, yuck, really yuck. This is an example of a
(a) Nominal Level (b) Ordinal Leve (c) Interval Level (d) Ratio Level
34. A researcher is studying students in college in PUP. She takes a sample of 400 students from 10 colleges. The
average age of selected college students in PUP is
(a) statistic. (b) parameter. (c) the median. (d) a population.
35. A coffee shop wants to know the temperature of coffee that most people prefer. They brew coffee at the typical
temperature for the shop and then ask customers “Do you prefer coffee to be at this temperature?” and record
a yes or no answer for each customer. What is the level of measurement of the way they measured preferred
temperature?
(a) Nominal (b) Ordinal (c) Interval (d) Ratio
36. The same coffee shop later repeats the study but this time they ask “Do you prefer coffee to be a lot colder, a
little cooler, this temperature, a little warmer or a lot hotter?” and record the persons response. Now, what is
the level of measurement of the way they measured preferred temperature?
(a) Nominal (b) Ordinal (c) Interval (d) Ratio
37. Determine the characteristics of a Normal Curve.

I. The normal curve is bell-shaped and symmetric about the mean.
II. The mean, median and mode are not equal.
III. The total area under the curve is equal to one.
IV. The normal curve approaches, but never touches the x-axis as it extends farther and farther away from the
mean.
(a) I, II and III (b) I, II, III and IV (c) II, III and IV (d) I, III and IV
38. Given a normally distribution, find the area under the curve which lies to the right of z = 1.96.
Page 4
(a) 0.9750 (b) 0.0196 (c) 0.4750 (d) 0.0250
For items 56 to 60, consider this situation. A researcher has collected the following sample data. 5, 12, 6, 8, 5,
6, 7, 5, 12, 4
39. Find the median.
(a) 5 (b) 6 (c) 7 (d) 8
40. Find the mode.
(a) 5 (b) 6 (c) 7 (d) 8
41. Find the mean.
(a) 5 (b) 6 (c) 7 (d) 8
42. Find the standard deviation.
(a) 1.2 (b) 2.2 (c) 3.2 (d) 4.2
43. Find the Pearson coefficient of skewness using the value of median.
(a) 1.2 (b) 2.2 (c) 3.2 (d) 4.2
Problem Solving
A. The PUPCET scores for the math portion of the test were normally distributed, with a mean of 23.4 and a
standard deviation of 4.8. Find the probability that a randomly selected student who took the math portion
of the PUPCET has a score that is
(a) less than 18.
(b) between 21 and 26.
B. Given the following frequency distribution.
Class Interval Frequency

240 - 259 5
220 - 239 5
200 - 219 12
180 - 199 13
160 - 179 5
140 - 159 10
Compute the following:

(a) Mean
(b) Median
(c) Mode
(d) Standard Deviation
(e) Q1
(f) Q3
Page 5
(g) D1
(h) D9
(i) P10
(j) P90
(k) Karl Pearsons Measure of Skewness
(l) Kurtosis
C. Construct a frequency distribution table.
No. of Children Frequency Percentage (%)

0
1
2
3
4
5
Total
(a) What percentage of couples married seven years has two children?
(b) What percentage of couples married seven years has at least two children?
Page 6
STATISTICAL ANALYSIS WITH SOFTWARE APPLICATION

FINAL EXAMINATION
Name: Course & Section:
Directions: Read each item carefully. Write the letter corresponding to the best answer on a yellow paper on each
item. Write NONE if no correct choice is given. Make sure to write also your solutions.
1. Which of the following is a alternative hypothesis?

(a) There will be a significant difference between the length of time taken to complete a test online and the
time taken to complete a test on paper.
(b) There is no significant factors.
(c) There will be no difference between the length of time taken to complete tests online and tests completed
on paper, and if there is it is due to chance.
2. The alternative hypothesis of F-test is .
(a) Equal variances assumed (c) Data follows a Normal Distribution

(b) Equal variances Not assumed (d) Data does not follows a Normal Distribution
3. The two forms of t-tests are
(a) One-way and two-way (c) Chi-square - Independent

(b) Independent and dependent (d) Pearson r and chi-square
4. If a researcher conducts a study in which the reading ability of a class of 20 second graders is tested at the
beginning and at the end of the year, the appropriate statistical procedure to analyze the results would be
(a) One-way ANOVA (c) Dependent sample t - test

(b) Independent sample t - test (d) Pearson r
5. Suppose a researcher is conducting a study in which five groups of adults, each group having a distinct life
situation, are assessed on a measure of stress. The appropriate statistical procedure to compare the groups is
a(n)

6. When the value of x variable increases and the value of y variable also increases. It is known as .
(a) No Relationship (c) Inverse Relationship

(b) Direct Relationship (d) None of the above
7. If the computed correlation coefficient of two continuous variables is 0.967, then describe the relationship.
(a) Weak Negative and Inverse Relationship
(b) Strong Negative and Inverse Relationship
(c) Strong Positive and Direct Relationship
(d) Weak Positive and Direct Relationship
8. If the computed value for Pearson r is negative, this implies that there is a/an relationship between
variables x and y.
(a) No Relationship (c) Inverse Relationship

(b) Direct Relationship (d) Undefined
9. You find children who take vitamins have higher health index scores than children who do not take vitamins
(p < 0.05). You have found that these two groups of children are
(a) significantly different
(b) different because of chance
(c) positively correlated
(d) negatively correlated
10. A conclusion in a research on Science Teaching in selected Quezon City high schools states, Most schools are
lack of adequate facilities. Which of the following is a proper recommendation for this conclusion?
(a) School administrators should be pro-active and skillful in acquiring adequate facilities.
(b) School administrators should conduct Science achievement tests that are centralized and uniform
(c) School administrators should hire more competent Science teachers for proper handling of the facilities.
(d) School administrators should work on the revision of the Science curricula so that lessons may adapt with
the facilities.
11. Which of the following is a positive correlation?
(a) Gas mileage decreases as vehicle weight increases
(b) As study time decreases, students achieve lower grades
(c) As levels of self-esteem decline, levels of depression increase
(d) People who exercise regularly are less likely to be obese
12. A friend of mine studies the effects of praise on happiness. She believes that children who receive praise are
happier overall than children who do not receive praise. She measures happiness by counting the number of
times a child smiles in a one hour period. She knows that in the population of children who do not receive praise
smiles average 4 times per hour with a standard deviation of .5, and that these data are normally distributed.
She selects a sample of 100 children whom she knows receive praise and finds that they smile an average of 3.5
times per hour.
An appropriate null hypothesis for this study is:
(a) Children who receive praise smile more than children who do not.
(b) Children who receive praise smile the same amount as children who do not.
(c) Children who receive praise are happier than children who do not.
(d) Children who receive praise do not smile more than children who do not.
13. What is the criterion for rejecting the null hypothesis using p value approach?
(a) If p value is less than or equal to the level of significance retain Ho, otherwise Reject Ho.
(b) If p value is less than or equal to the level of significance reject Ho, otherwise retain Ho.
(c) If p value is greater than or equal to the level of significance reject Ho, otherwise retain Ho.
(d) If p value is greater than or equal to the level of significance retain Ho, otherwise Reject Ho.
14. The alternative hypothesis of Shapiro wilk test is .
Page 2
(a) Equal variances assumed (c) Data follows a Normal Distribution
(b) Equal variances Not assumed (d) Data does not follows a Normal Distribution
15. An inspector needs to learn if customers are getting fewer ounces of a soft drink than the 28 ounces stated on
the label. After she collects data from a sample of bottles, she is going to conduct a test of a hypothesis. She
should use
(a) A two tailed test.
(b) A one tailed test with an alternative to the right.
(c) A one tailed test with an alternative to the left.
(d) Either a one or a two tailed test because they are equivalent.
16. A hypothesis test is done in which the alternative hypothesis is that more than 10% of a population is left-
handed. The computed p value is 0.25. Which statement is correct?
(a) We can conclude that more than 10% of the population is left-handed.
(b) We can conclude that more than 25% of the population is left-handed.
(c) We can conclude that exactly 25% of the population is left-handed.
(d) We cannot conclude that more than 10% of the population is left-handed.
17. If there is a negative correlation between no. of absences students have and grades. What can we conclude
from this research finding?
(a) That being absent leads to lower grades
(b) That students that are absent more often are likely to have lower grades
(c) That low grades leads to people being absent
(d) That this is an illusory correlation
18. It is a procedure on sample evidence and probability, used to test claims regarding a characteristic of one or
more populations.
(a) Parametric Statistics (c) Hypothesis

(b) Non-Parametric Statistics (d) Hypothesis Testing
19. If the computed p-value is 0.0001 and the level of significance is 0.01, what do you think will be the decision
of the researcher?
(a) Reject Ho (c) Reject Ha

(b) Failed to Reject Ho (d) Failed to Reject Ha
20. Which of the following statistical test is not used for testing significant difference?

Problem Solving
A. The ACT is a college entrance exam. ACT has determined that a score of 22 on the mathematics portion of
the ACT suggests that a student is ready for college-level mathematics. To achieve this goal, ACT recommends that
students take a core curriculum of math courses: Algebra I, Algebra II, and Geometry. Suppose a random sample
of 200 students who completed this core set of courses results in a mean ACT math score of 22.6 with a standard
deviation of 3.9. Do these results suggest that students who complete the core curriculum are ready for college-level
mathematics? That is, are they scoring above 22 on the math portion of the ACT?
Page 3
1. State the appropriate null and alternative hypotheses.
2. If p - value is 0.001, write your decision and conclusion.
B. A corporation owns a chain of several hundred gasoline stations on the eastern seaboard. The marketing
director wants to test a proposed marketing campaign by running ads on some local television stations and deter-
mining whether gasoline sales at a sample of the companys stations increase after the advertising. The following
data represent gasoline sales for a day before and a day after the advertising campaign. Determine whether sales
increased significantly after the advertising campaign. Use an alpha of 0.05.
Station Before After

1 10,500 12,600
2 8,870 10,660
3 12,300 11,890
4 10,510 14,630
5 5,570 8,580
6 9,150 10,115
7 11,980 14,350
8 6,740 6,900
9 7,340 8,890
10 13,400 16,540
11 12,200 11,300
12 10,570 13,330
13 9,880 9,990
14 12,100 14,050
15 9000 9,500
16 11,800 12,450
17 10500 13,450
1. Step 1:
2. Step 2:
3. Step 3:
Check the assumptions.
4. Step 4:
5. Step 5:
6. Step 6:
Page 4

Statistical Analysis with Excel

Uploaded by

Document Information

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Statistical Analysis with Excel

Uploaded by

Copyright:

Available Formats

STATISTICAL ANALYSIS WITH

Aranas, Peter John

Course Title : STATISTICAL ANALYSIS WITH SOFTWARE APPLICATION

COURSE GRADING SYSTEM

Class Standing (CS) = (((Weighted Average of all the Activities) x 50 )+ 50)

Final Grade = (70% x CS) + (30% x MFE)

1 Introduction to Statistical Concepts

CONCEPTS various policies. And statistics is used in medicine to

The information referred to the definition is the data.

• Universe is the set of all entities under study.

• A statistic is a numerical summary of a sample.

1. The Philippine Mental Health Associations Example:

Sample: 1,028 teenagers 13 to 17 years of 2. A car manufacturer wishes to estimate the

If the entire population is studied, then

Variables are the characteristics of the 2. A continuous variable is a quantitative

1. Qualitative variables is variable that 1. The number of heads obtained after

Example: 3. The distance of a 2005 Toyota Prius can

2. Temperature (Quantitative) 5. Time of a runner to finish one lap.

B. ______________________________ ______________1. Occupation

II. Indicate whether the following statements ______________2. Number of government

______________4. Temperature in Celsius

______________5. Type of school

______________6. Volume of mineral water

______________7. Employee number

______________8. Civil status

______________9. Zip code numbers

______________10. Brands of soft drinks

______________11. Socioeconomic status

______________12. Status Employment

______________13. Number of vehicles

______________14. Jersey Number

______________15. Number of employees

Statistics. Informed Decision using Data by

Sampling: Design and Analysis by Sharon L.

DATA COLLECTION Everybody collects, interprets and uses information,

AND BASIC Concepts

• Inability to answer research questions accurately.

• Inability to repeat and validate the study.

• Distorted findings resulting in wasted resources.

• Misleading other researchers to pursue fruitless

• Compromising decisions for public policy.

• Causing harm to human participants and animal

SOURCES OF DATA 1. Keep the questionnaire as short as possible.

data. respondents to the second question said it is

• Always investigate the validity and reliability

• Do not use inappropriate data for your

“How many participants should be chosen for a

One of the most frequent problems in

The sample size is typically denoted by n and Representative Sample

1. Level of Precision where:

• Target population - The complete collection

• Sampled population - The collection of all

• Sample - A subset of a population.

- Having multiplicity of listings in the sampling 2. Non - probability Sample

- Identify the population.

- Ask the question, can I generalize to the Simple Random Sampling

Advantage: Drawing of the sample is easy. It

Disadvantage: May give poor precision when

- The clusters are constructed such that the

Obtaining a Cluster Sample

1. Divide the population into non-overlapping

2. Number the clusters in the population from 1

3. Select n distinct numbers from 1 to N using

B. ________________ 1. Occupation

9. Newspaper 1. To determine customer