You are on page 1of 336

Engr.

Marizen Contreras
Engr. Siddartha Valle
Engr. Framces Thea De Mesa
Engr. Nadine Alex Bravo
ENGINEERING DATA ANALYSIS
__________________________________________
COURSE MATERIAL

Prepared by:
Engr. Marizen Contreras
Engr. Siddartha Valle
Engr. Framces Thea De Mesa
Engr. Nadine Alex Bravo

2
COURSE TECHNICALITIES

Course Details

Course Code EDA 1


Course Title Engineering Data Analysis
Course This course is designed for undergraduate engineering students
Description with emphasis on problem solving related to societal issues that
engineers and scientists are called upon to solve. It introduces
different methods of data collection and the suitability of using a
particular method for a given situation. The relationship of probability
to statistics is also discussed, providing students with the tools they
need to understand how "chance" plays a role in statistical analysis.
Probability distributions of random variables and their uses are also
considered, along with a discussion of linear functions of random
variables within the context of their application to data analysis and
inference. The course also includes estimation techniques for
unknown parameters; and hypothesis testing used in making
inferences from sample to population; inference for regression
parameters and build models for estimating means and predicting
future values of key variables under study. Finally, statistically based
experimental design techniques and analysis of outcomes of
experiments are discussed with the aid of statistical software.
No. of Units 3

Program Outcomes to be met:

PO PO Statement
a Apply knowledge of mathematics and science to solve engineering
problems
b Design and conduct experiments, as well as to analyze and interpret
data
i Recognize the need for, and an ability to engage in life-long learning
j Apply knowledge of contemporary issues
n Participate in the generation of new knowledge or in research and
development projects

3
Course Outcomes

By the end of the course, students will be able to:


CO CO Statement
CO1 Identify different methods of collecting data suitable for a certain
situation or problem
CO2 Compute for the probability of a certain event to happen.
CO3 Apply statistical tool in analyzing data.
C04 Use MS Excel as tool in statistical calculations.
CO5 Test hypothesis based on statistical calculations.

Mapping of PO to CO

POa POb POi POj POn


CO1
CO2
CO3
CO4
CO5

Learning Process (Table of Contents)

Module Module Title Topics Page


No.
1 Obtaining Role of Statistics in Engineering 8
Data
Methods of Collecting Data 19
Planning and Conducting Surveys 31
Introduction to Design of Experiments 37
Planning and Conducting Experiments 39
Organizing Data 45
Averages and Variation, Quantiles, Skewness and 65
Kurtosis
2 Probability Probability Theory 95
Sample Space and Relationship Among Events 101
Counting Rules Useful in Probability 108

4
Rules of Probability 115
Joint Probability Distribution 126
3 Discrete Introduction 135
Probability
Distributions Binomial Distributions 142
Geometric and Negative Binomial Distributions 146
Hypergeometric Distributions 149
Poisson Distributions 151
4 Continuous Continuous Random Variable 154
Probability
Distributions Probability Density Function 156
Normal Distribution 159
5 Sampling Sampling Distribution 175
Distributions
and Estimation 188
Estimation
Sample Size 197
6 Test of Introduction to Hypothesis 201
Hypothesis
P-value 207
PARAMETRIC TESTS 210
T-Test for Independent Samples 210
T-Test for Correlated Samples 213
One Sample Mean Tests 216
Two Sample Mean Test 217
One-Way ANOVA (F test) 219
Scheffe's Test 222
Two-Way ANOVA 223
Three-Way ANOVA 229
Simple Linear Regression 240
Multiple Regression and Its Significance 243
Pearson Coefficient of Correlation r 250

5
NON-PARAMETRIC TESTS 259
Chi-Square Test of Goodness of Fit 260
Chi-Square Test of Homogeneity 263
Chi-Square Test of Independence 265
Mann Whitney U Test 268
Kruskal-Wallis Test 273
Sign Test for Two Independent Samples 276
Fisher Sign Test 278
Median Test for K Independent Samples 280
Spearman Rank-Order Coefficient of Correlation rs 283
Friedman Fr Test for Randomized Block Design 286
Mc Nemar’s Test for Correlated Proportions 288
Kendall’s Coefficient of Concordance 288
7 Reliability Introduction to Reliability and Validity Tests 294
and Validity
Tests Methods Used For Reliability And Validity Testing 304
8 Computer Basic introduction to the Use of MS Excel in 311
Aided Statistical Data Analysis
Statistics

6
Obtaining Data
____________________________________________________
MODULE 1

7
Session 1
Role of Statistics in Engineering

By the end of this session you should be able to:


1. Identify the role that statistics can play in the engineering problem-solving process.
2. Discuss how variability affects the data collected and used for making engineering
decisions.
3. Explain the difference between enumerative and analytical studies
4. Discuss the different methods that engineers use to collect data
5. Identify the advantages that designed experiments have in comparison to other
methods of collecting engineering data

Lecture:

THE ENGINEERING METHOD AND STATISTICAL THINKING

An engineer is someone who solves problems of interest to society by the efficient


application of scientific principles. Engineers accomplish this by either refining an existing
product or process or by designing a new product or process that meets customers
‘needs. The engineering, or scientific, method is the approach to formulating and
solving these problems. The steps in the engineering method are as follows:

1. Develop a clear and concise description of the problem.


2. Identify, at least tentatively, the important factors that affect this problem or
that may play a role in its solution.
3. Propose a model for the problem, using scientific or engineering knowledge
of the phenomenon being studied. State any limitations or assumptions of the
model.
4. Conduct appropriate experiments and collect data to test or validate the
tentative model or conclusions made in steps 2 and 3.
5. Manipulate the model to assist in developing a solution to the problem.
6. Conduct an appropriate experiment to confirm that the proposed solution to
the problem is both effective and efficient.
7. Draw conclusions or make recommendations based on the problem solution.

Figure 1. Steps in Engineering Method of Problem Solving

8
Statistical methods are used to help us describe and understand variability. By variability,
we mean that successive observations of a system or phenomenon do not produce
exactly the same result. We all encounter variability in our everyday lives, and statistical
thinking can give us a useful way to incorporate this variability into our decision-making
processes.

For example, consider the gasoline mileage performance of your car. Do you always get
exactly the same mileage performance on every tank of fuel? Of course not—in fact,
sometimes the mileage performance varies considerably. This observed variability in
gasoline mileage depends on many factors, such as the type of driving that has occurred
most recently (city versus highway), the changes in condition of the vehicle over time
(which could include factors such as tire inflation, engine compression, or valve wear),
the brand and/or octane number of the gasoline used, or possibly even the weather
conditions that have been recently experienced.

These factors represent potential sources of variability in the system. Statistics gives us
a framework for describing this variability and for learning about which potential sources
of variability are the most important or which have the greatest impact on the gasoline
mileage performance.

In engineering, suppose that an engineer is designing a nylon connector to be used in an


automotive engine application. The engineer is considering establishing the design
specification on wall thickness at 3/32 inch but is somewhat uncertain about the effect of
this decision on the connector pull-off force. If the pull-off force is too low, the connector
may fail when it is installed in an engine. Eight prototype units are produced and their pull-
off forces measured, resulting in the following data (in pounds): 12.6, 12.9, 13.4, 12.3,
13.6, 13.5, 12.6, and 13.1. As we anticipated, not all of the prototypes have the same pull-
off force.

Because the pull-off force measurements exhibit variability, we consider the pull-off force
to be a random variable. A convenient way to think of a random variable, say X, that
represents a measurement, is by using the model
𝑋 = 𝜇+𝜖
where: 𝜇 is a constant and ∈ is a random disturbance.

The constant remains the same with every measurement, but small changes in the
environment, test equipment, differences in the individual parts themselves, and so forth
change the value of ∈.

9
If there were no disturbances, ∈ would always equal zero and X would always be equal
to the constant 𝜇. However, this never happens in the real world, so the actual
measurements X exhibit variability. We often need to describe, quantify and ultimately
reduce variability.

Figure 2. Dot Diagram of Pull-off Force

The dot diagram is a very useful plot for displaying a small body of data—say, up to
about 20 observations. This plot allows us to see easily two features of the data; the
location, or the middle, and the scatter or variability.When the number of observations is
small, it is usually difficult to identify any specific patterns in the variability, although the
dot diagram is a convenient way to see any unusual data features.

From testing the prototypes, he knows that the average pull-off force is 13.0 pounds.
However, he thinks that this may be too low for the intended application, so he decides
to consider an alternative design with a greater wall thickness, 1/8 inch. Eight prototypes
of this design are built, and the observed pull-off force measurements are 12.9, 13.7,
12.8, 13.9, 14.2, 13.2, 13.5, and 13.1. The average is 13.4.

Figure 2. Dot Diagram of pul-force for two wall thicknesses

This display gives the impression that increasing the wall thickness has led to an increase
in pull-off force. However, there are some obvious questions to ask. For instance, how do
we know that another sample of prototypes will not give different results? Is a sample of
eight prototypes adequate to give reliable results? If we use the test results obtained so
far to conclude that increasing the wall thickness increases the strength, what risks are
associated with this decision?

For example, is it possible that the apparent increase in pull-off force observed in the
thicker prototypes is only due to the inherent variability in the system and that increasing
the thickness of the part (and its cost) really has no effect on the pull-off force?

Often, physical laws (such as Ohm’s law and the ideal gas law) are applied to help design
products and processes. We are familiar with this reasoning from general laws to specific
cases. But it is also important to reason from a specific set of measurements to more
general cases to answer the previous questions. This reasoning is from a sample (such

10
as the eight connectors) to a population (such as the connectors that will be sold to
customers). The reasoning is referred to as statistical inference.

Figure 4. Types of reasoning

Clearly, reasoning based on measurements from some objects to measurements on all


objects can result in errors (called sampling errors). However, if the sample is selected
properly, these risks can be quantified and an appropriate sample size can be
determined.

In some cases, the sample is actually selected from a well-defined population. The
sample is a subset of the population. For example, in a study of resistivity a sample of
three wafers might be selected from a production lot of wafers in semiconductor
manufacturing. Based on the resistivity data collected on the three wafers in the sample,
we want to draw a conclusion about the resistivity of all of the wafers in the lot.

In other cases, the population is conceptual (such as with the connectors), but it might be
thought of as future replicates of the objects in the sample. In this situation, the eight
prototype connectors must be representative, in some sense, of the ones that will be
manufactured in the future. Clearly, this analysis requires some notion of stability as an
additional assumption. For example, it might be assumed that the sources of variability in
the manufacture of the prototypes (such as temperature, pressure, and curing time) are
the same as those for the connectors that will be manufactured in the future and ultimately
sold to customers.

The wafers-from-lots example is called an enumerative study. A sample is used to make


an inference to the population from which the sample is selected. The connector example
is called an analytic study. A sample is used to make an inference to a conceptual (future)
population. The statistical analyses are usually the same in both cases, but an analytic
study clearly requires an assumption of stability.

11
Figure 5. Enumerative Versus Analytic Study

COLLECTING ENGINEERING DATA

In the engineering environment, three basic methods of collecting data are:

 A retrospective study using historical data


- A retrospective study would use either all or a sample of the historical process
data archived over some period of time.
 An observational study
- In an observational study, the engineer observes the process or population,
disturbing it as little as possible, and records the quantities of interest. Because
these studies are usually conducted for a relatively short time period,
sometimes variables that are not routinely measured can be included.
 A designed experiment
- In a designed experiment the engineer makes deliberate or purposeful changes
in the controllable variables of the system or process, observes the resulting
system output data, and then makes an inference or decision about which
variables are responsible for the observed changes in output performance.

An effective data collection procedure can greatly simplify the analysis and lead to
improved understanding of the population or process that is being studied.

Illustrative Example:

Montgomery, Peck, and Vining (2001) describe an acetone-butyl alcohol distillation


column for which concentration of acetone in the distillate or output product stream is an
important variable. Factors that may affect the distillate are the reboil temperature, the
condensate temperature, and the reflux rate.

12
Production personnel obtain and archive the following records: the concentration of
acetone in an hourly test sample of output product, the reboil temperature log, which is a
plot of the reboil temperature over time, the condenser temperature controller log, the
nominal reflux rate each hour The reflux rate should be held constant for this process.
Consequently, production personnel change this very infrequently.

The study objective might be to discover the relationships among the two temperatures
and the reflux rate on the acetone concentration in the output product stream.

Using restrospective study would present the following problems

1. We may not be able to see the relationship between the reflux rate and acetone
concentration, because the reflux rate didn’t change much over the historical
period.
2. The archived data on the two temperatures (which are recorded almost
continuously) do not correspond perfectly to the acetone concentration
measurements (which are made hourly). It may not be obvious how to construct
an approximate correspondence.
3. Production maintains the two temperatures as closely as possible to desired
targets or set points. Because the temperatures change so little, it may be
difficult to assess their real impact on acetone concentration.
4. Within the narrow ranges that they do vary, the condensate temperature tends
to increase with the reboil temperature. Consequently, the effects of these two
process variables on acetone concentration may be difficult to separate.

As we can see, a retrospective study may involve a lot of data, but that data may contain
relatively little useful information about the problem. Furthermore, some of the relevant
data may be missing, there may be transcription or recording errors resulting in outliers
(or unusual values), or data on other important factors may not have been collected and
archived. In the distillation column, for example, the specific concentrations of butyl
alcohol and acetone in the input feed stream are a very important factor, but they are not
archived because the concentrations are too hard to obtain on a routine basis.

Using the observational study the engineer would design a form to record the two
temperatures and the reflux rate when acetone concentration measurements are made.
It may even be possible to measure the input feed stream concentrations so that the
impact of this factor could be studied.

13
Generally, an observational study tends to solve problems 1 and 2 above and goes a long
way toward obtaining accurate and reliable data. However, observational studies may not
help resolve problems 3 and 4.

Designed experiments are a very powerful approach to studying complex systems, such
as the distillation column. This process has three factors, the two temperatures and the
reflux rate, and we want to investigate the effect of these three factors on output acetone
concentration.

A good experimental design for this problem must ensure that we can separate the effects
of all three factors on the acetone concentration. The specified values of the three factors
used in the experiment are called factor levels.Typically, we use a small number of levels
for each factor, such as two or three. For this distillation column problem, suppose we
use a “high,’’and “low,’’level (denoted +1 and -1, respectively) for each of the factors. We
thus would use two levels for each of the three factors.

A very reasonable experiment design strategy uses every possible combination of the
factor levels to form a basic experiment with eight different settings for the process. This
type of experiment is called a factorial experiment.

Table 1. Designed Experiment (Factorial Design) for the Distillation Column

With each setting of the process conditions, we allow the column to reach equilibrium,
take a sample of the product stream, and determine the acetone concentration. We then
can draw specific inferences about the effect of these factors. Such an approach allows
us to proactively study a population or process.

Observing Processes Over Time

Often data are collected over time. In this case, it is usually very helpful to plot the data
versus time in a time series plot. Phenomena that might affect the system or process
often become more visible in a time-oriented plot and the concept of stability can be better
judged.
14
Figure 6. Dot diagram of acetone concentration taken hourly from the distillation column

The large variation displayed on the dot diagram indicates a lot of variability in the
concentration, but the chart does not help explain the reason for the variation.

process mean
shift detected

Figure 7. Time series plot of concentration.

A shift in the process mean level is visible in the plot and an estimate of the time of the
shift can be obtained.

W. Edwards Deming, a very influential industrial statistician, stressed that it is important


to understand the nature of variability in processes and systems over time. He conducted
an experiment in which he attempted to drop marbles as close as possible to a target on
a table. He used a funnel mounted on a ring stand and the marbles were dropped into
the funnel.

Figure 8. Deming’s Funnel Experiment Set=up

The funnel was aligned as closely as possible with the center of the target. He then used
two different strategies to operate the process.
1. He never moved the funnel. He just dropped one marble after another and
recorded the distance from the target.
2. He dropped the first marble and recorded its location relative to the target. He
then moved the funnel an equal and opposite distance in an attempt to

15
compensate for the error. He continued to make this type of adjustment after
each marble was dropped.

After both strategies were completed, he noticed that the variability of the distance from
the target for strategy 2 was approximately 2 times larger than for strategy 1. The
adjustments to the funnel increased the deviations from the target. The explanation is that
the error (the deviation of the marble’s position from the target) for one marble provides
no information about the error that will occur for the next marble. Consequently,
adjustments to the funnel do not decrease future errors. Instead, they tend to move the
funnel farther from the target.

This interesting experiment points out that adjustments to a process based on random
disturbances can actually increase the variation of the process. This is referred to as over
control or tampering. Therefore adjustments should be applied only to compensate for
a nonrandom shift in the process.

Figure 9. Time Series of two methods in Deming’s Funnel Experiment

The target value for the process is 10 units. The figure displays the data with and without
adjustments that are applied to the process mean in an attempt to produce data closer to
target. Each adjustment is equal and opposite to the deviation of the previous
measurement from target. For example, when the measurement is 11 (one unit above
target), the mean is reduced by one unit before the next measurement is generated. The
over control has increased the deviations from the target.

16
Figure 10. Time series of Deming’s funnel experiment with applied adjustment at the
point of process mean shift.

Figure 10 displays the data without adjustment except that the measurements after
observation number 50 are increased by two units to simulate the effect of a shift in the
mean of the process. When there is a true shift in the mean of a process, an adjustment
can be useful. It also displays the data obtained when one adjustment (a decrease of two
units) is applied to the mean after the shift is detected (at observation number 57). Note
that this adjustment decreases the deviations from target.

The question of when to apply adjustments (and by what amounts) begins with an
understanding of the types of variation that affect a process. A control chart is an
invaluable way to examine the variability in time-oriented data.

Figure 11. Control Chart for Acetone Concentration Data

The center line on the control chart is just the average of the concentration measurements
for the first 20 samples (𝑥̅ = 91.5 g/l) when the process is stable. The upper control limit

17
and the lower control limit are a pair of statistically derived limits that reflect the inherent
or natural variability in the process. These limits are located three standard deviations
of the concentration values above and below the center line. If the process is operating
as it should, without any external sources of variability present in the system, the
concentration measurements should fluctuate randomly around the center line, and
almost all of them should fall between the control limits.

In this control chart the visual frame of reference provided by the center line and the
control limits indicates that some upset or disturbance has affected the process around
sample 20 because all of the following observations are below the center line and two of
them actually fall below the lower control limit. This is a very strong signal that corrective
action is required in this process. If we can find and eliminate the underlying cause of this
upset, we can improve process performance considerably.

Control charts are a very important application of statistics for monitoring, controlling, and
improving a process. The branch of statistics that makes use of control charts is called
statistical process control, or SPC.

Activity 1:

State any experiment you have done before that you can repeatedly perform at the
comfort of your home. Record measurements and observe variations. What do you think
are the factors that might contributed the variations? Among the factors, did you find any
variable that can be controlled? If yes, what is it and how can it be controlled?

18
Session 2
Methods of Collecting Data, Planning and Conducting Surveys,
Planning and Conducting Experiments

By the end of this session you should be able to:

1. Enumerate and illustrate the different methods of statistical data collection


2. Explain the method of conducting design of experiments

Lecture:

INTRODUCTION

Statistics plays an important role in nearly all phases of our lives. It is used in agriculture,
biology and natural sciences, business and economics, electronics and computer
sciences, education, political science and sociology, and other fields of science and
engineering as mentioned in session 1.

Statistics or statistical methods are the mathematical techniques used to facilitate the
interpretation of numerical data secured from entities, individuals, or observations.
Statistics is a mathematical science including methods of collecting, organizing and
analyzing data in such a way that meaningful conclusions can be drawn from them. In
general, its investigations and analyses fall into two broad categories called descriptive
and inferential statistics.

Descriptive statistics deals with the processing of data without attempting to draw any
inferences from it. The data are presented in the form of tables and graphs. The
characteristics of the data are described in simple terms. Events that are dealt with include
everyday happenings such as accidents, prices of goods, business, incomes, epidemics,
sports data, and population data. While inferential statistics is a scientific discipline that
uses mathematical tools to make forecasts and projections by analyzing the given data.
It makes use of generalizations, predictions, estimations, or approximations in the face of
uncertainty.

The basic idea behind all statistical methods of data analysis is to make inferences about
a population by studying small sample chosen from it. A population often consists of a
large group of specifically defined elements. For example, the population of a specific
country means all the people living within the boundaries of that country. Usually, it is not

19
possible or practical to measure data for every element of the population under study. We
randomly select a small group of elements from the population and call it a sample.
Inferences about the population are then made on the basis of several samples. The
number that describes a population is called parameter while for sample it is called
statistic.

STATISTICAL INQUIRY

It is a systematic process of collecting data and logically analyzing information about


various concerns in the society.

It means statistical examination, the person who conducts an investigation is referred to


an investigator, and the investigator requires the help of enumerator who gathers data
and respondent who gives information for statistical examination.

For example, gathering information about how many students have cleared entrance
exam of MBA and how are have not cleared the exam.

METHODS OF DATA COLLECTION

Information gathering can be from a variety of sources. Importantly to say, there are no
best method of data collection. In principle, how data are being collected depends on the
researcher’s nature of research or the phenomena being studied.

Data collection is a crucial aspect in any level of research work. If data are inaccurately
collected, it will surely impact the findings of the study, thereby leading to false or
invaluable outcome.

Data collection is a systematic method of collecting and measuring data gathered from
different sources of information in order to provide answers to relevant questions. An
accurate evaluation of collected data can help researchers predict future phenomenon
and trends.

Data collection can be classified into two, namely: primary and secondary data. Primary
data are raw data i.e. fresh and are collected for the first time. It is the data collected for
the first time through personal experiences or evidence, particularly for research. It is also
described as raw data or first-hand information. The investigator supervises and controls
the data collection process directly. Mostly the data is collected through observations,
physical testing, mailed questionnaires, surveys, personal interviews, telephonic

20
interviews, case studies, and focus groups, etc. Secondary data, on the other hand, are
data that were previously collected and tested. These are second-hand data that is
already collected and recorded by some researcher for their purpose and not for the
current research problem. It is accessible in the form of data collected from different
sources such as government publications, censuses, internal records of the organization,
books, journal articles, websites, and reports etc. This method of gathering data is
affordable, readily available, saves cost and time.

The differences between primary data and secondary data are represented in a
comparison format are as follows:

Basis Primary Data Secondary Data


Meaning Primary data are those which Secondary data refers to those
are collected for the first time. data which have already been
collected by some other person.
Originality Primary data are originated Secondary data are not original
because these are collected by because someone else
the investigator for the first time. collected these for his own
purpose.
Nature of Data Primary data are in the form of Secondary data are in the
raw materials. finished form.
Reliability and Primary data are more reliable It is less reliable and less
Suitability and suitable for the enquiry suitabel as someone else has
because it is collected for a collected the data which may
particular purpose. not perfectly match our
purpose.
Time and Money Collecting primary data is quite Secondary data requires less
expensive both in time and time and money so it is
money terms. economical.
Precaution and No special precaution or editing Both precaution and editing are
Editing is required while using primary essential as secondary data
data as these have been were collected by someone else
collected with a definite for his own purpose.
purpose.

Primary data can be divided into two categories: quantitative and qualitative.

Quantitative data comes in the form of numbers, quantities and values. It describes things
in concrete and easily measurable terms. Because quantitative data is numeric and
measurable, it lends itself well to analytics. When you analyze quantitative data, you may

21
uncover insights that can help you better understand your audience. Because this kind of
data deals with numbers, it is very objective and has a reputation for reliability.

Qualitative data is descriptive, rather than numeric. It is less concrete and less easily
measurable than quantitative data. This data may contain descriptive phrases and
opinions. Qualitative data helps explains the “why” behind the information quantitative
data reveals. For this reason, it is useful for supplementing quantitative data, which will
form the foundation of your data strategy.

HOW TO COLLECT DATA IN 5 STEPS

There are many different techniques for collecting different types of quantitative data, but
there’s a fundamental process you’ll typically follow, no matter which method of data
collection you’re using. This process consists of the following five steps.

1. Determine what information you want to collect

The first thing you need to do is choose what details you want to collect. You’ll need to
decide what topics the information will cover, who you want to collect it from and how
much data you need.

2. Set a timeframe for data collection

Next, you can start formulating your plan for how you’ll collect your data. In the early
stages of your planning process, you should establish a timeframe for your data
collection. You may want to gather some types of data continuously.

3. Determine your data collection method

At this step, you will choose the data collection method that will make up the core of your
data-gathering strategy. To select the right collection method, you’ll need to consider the
type of information you want to collect, the timeframe over which you’ll obtain it and the
other aspects you determined.

4. Collect the data

Once you have finalized your plan, you can implement your data collection strategy and
start collecting data. You can store and organize your data. Be sure to stick to your plan
and check on its progress regularly. It may be useful to create a schedule for when you
will check in with how your data collection is proceeding, especially if you are collecting

22
data continuously. You may want to make updates to your plan as conditions change and
you get new information.

5. Analyze the data and implement your findings

Once you’ve collected all of your data, it’s time to analyze it and organize your findings.
The analysis phase is crucial because it turns raw data into valuable insights that you can
use to enhance your marketing strategies, products and business decisions.

DATA COLLECTION

The system of data collection is based on the type of study being conducted. Depending
on the researcher’s research plan and design, there are several ways data can be
collected.

The most commonly used methods are: published literature sources, surveys (email and
mail), interviews (telephone, face-to-face or focus group), observations, documents and
records, and experiments.

1. Literature sources

This involves the collection of data from already published text available in the public
domain. Literature sources can include: textbooks, government or private companies’
reports, newspapers, magazines, online published papers and articles.

This method of data collection is referred to as secondary data collection. In comparison


to primary data collection, it is inexpensive and not time consuming.

2. Surveys

Survey is another method of gathering information for research purposes. Information are
gathered through questionnaire, mostly based on individual or group experiences
regarding a particular phenomenon.

There are several ways by which this information can be collected. Most notable ways
are: web-based questionnaire and paper-based questionnaire (printed form). The results
of this method of data collection are generally easy to analyze.

23
3. Interviews

Interview is a qualitative method of data collection whose results are based on intensive
engagement with respondents about a particular study. Usually, interviews are used in
order to collect in-depth responses from the professionals being interviewed.

Interview can be structured (formal), semi-structured or unstructured (informal). In


essence, an interview method of data collection can be conducted through face-to-face
meeting with the interviewee(s) or through telephone.

4. Observations

Observation method of information gathering is used by monitoring participants in a


specific situation or environment at a given time and day. Basically, researchers observe
the behavior of the surrounding environments or people that are being studied. This type
of study can be controlled, natural or participant.

Controlled observation is when the researcher uses a standardized procedure of


observing participants or the environment. Natural observation is when participants are
being observed in their natural conditions. Participant observation is where the researcher
becomes part of the group being studied.

5. Documents and records

This is the process of examining existing documents and records of an organization for
tracking changes over a period of time. Records can be tracked by examining call logs,
email logs, databases, minutes of meetings, staff reports, information logs, etc.

For instance, an organization may want to understand why there are lots of negative
reviews and complains from customer about its products or services. In this case, the
organization will look into records of their products or services and recorded interaction
of employees with customers.

6. Experiments

Experimental research is a research method where the causal relationship between two
variables are being examined. One of the variables can be manipulated, and the other is
measured. These two variables are classified as dependent and independent variables.

In experimental research, data are mostly collected based on the cause and effect of the
two variables being studied. This type of research are common among medical

24
researchers, and it uses quantitative research approach. Additional topics will be
discussed in introduction to design of experiments.

USES OF DATA COLLECTION

Collecting data is valuable because you can use it to make informed decisions. The more
relevant, high-quality data you have, the more likely you are to make good choices when
it comes to marketing, sales, customer service, product development and many other
areas of your business. Some specific uses of customer data include the following:

1. Improving Your Understanding of Your Audience

It can be difficult or impossible to get to know every one of your customers personally,
especially if you run a large business or an online business. The better you understand
your customers, though, the easier it will be for you to meet their expectations. Data
collection enables you to improve your understanding of who your audience is and
disseminate that information throughout your organization. Through the primary data
collection methods described above, you can learn about who your customers are, what
they’re interested in and what they want from you as a company.

2. Identifying Areas for Improvement or Expansion

Collecting and analyzing data helps you see where your company is doing well and where
there is room for improvement. It can also reveal opportunities for expanding your
business.
Looking at transactional data, for example, can show you which of your products are the
most popular and which ones do not sell as well. This information might lead you to focus
more on your bestsellers, and develop other similar products. You could also look at
customer complaints about a product to see which aspects are causing problems.
Data is also useful for identifying opportunities for expansion. For example, say you run
an e-commerce business and are considering opening a brick-and-mortar store. If you
look at your customer data, you can see where your customers are and launch your first
store in an area with a high concentration of existing customers. You could then expand
to other similar areas.

3. Predicting Future Patterns

Analyzing the data you collect can help you predict future trends, enabling you to prepare
for them. As you look at the data for your new website, for instance, you may discover
videos are consistently increasing in popularity, as opposed to articles. This observation

25
would lead you to put more resources into your videos. You might also be able to predict
more temporary patterns and react to them accordingly. If you run a clothing store, you
might discover pastel colors are popular during spring and summer, while people gravitate
toward darker shades in the fall and winter. Once you realize this, you can introduce the
right colors to your stores at the right times to boost your sales.

You can even make predictions on the level of the individual customer. Say you sell
business software. Your data might show companies with a particular job title often have
questions for tech support when it comes time to update their software. Knowing this in
advance allows you to offer support proactively, making for an excellent customer
experience.

4. Better Personalizing Your Content and Messaging

When you know more about your customers or site visitors, you can tailor the messaging
you send them to their interests and preferences. This personalization applies to
marketers designing ads, publishers choosing which ads to run and content creators
deciding what format to use for their content.
Using data collection in marketing can help you craft ads that target a given audience.
For example, say you’re a marketer looking to advertise a new brand of cereal. If your
customer data shows most people who eat the cereal are in their 50s and 60s, you can
use actors in those age ranges in your ads. If you’re a publisher, you likely have
information about what topics your site visitors prefer to read about. You can group your
audience based on the characteristics they share and then show visitors with those
characteristics content about topics popular with that group.

METHODS OF SAMPLING FROM A POPULATION

It would normally be impractical to study a whole population, for example when doing a
questionnaire survey. Sampling is a method that allows researchers to infer information
about a population based on results from a subset of the population, without having to
investigate every individual. Reducing the number of individuals in a study reduces the
cost and workload, and may make it easier to obtain high quality information, but this has
to be balanced against having a large enough sample size with enough power to detect
a true association.

If a sample is to be used, by whatever method it is chosen, it is important that the


individuals selected are representative of the whole population. This may involve
specifically targeting hard to reach groups.

26
There are several different sampling techniques available, and they can be subdivided
into two groups: probability sampling and non-probability sampling. In probability
(random) sampling, you start with a complete sampling frame of all eligible individuals
from which you select your sample. In this way, all eligible individuals have a chance of
being chosen for the sample, and you will be more able to generalize the results from
your study. Probability sampling methods tend to be more time-consuming and expensive
than non-probability sampling. In non-probability (non-random) sampling, you do not start
with a complete sampling frame, so some individuals have no chance of being selected.
Consequently, you cannot estimate the effect of sampling error and there is a significant
risk of ending up with a non-representative sample which produces non-generalizable
results. However, non-probability sampling methods tend to be cheaper and more
convenient, and they are useful for exploratory research and hypothesis generation.

PROBABILITY SAMPLING METHODS

1. Simple random sampling


A simple random sample takes a small, random portion of the entire population to
represent the entire data set, where each member has an equal probability of being
chosen. Researchers can create a simple random sample using methods like lotteries or
random draws.

2. Systematic sampling

Systematic sampling is a type of probability sampling method in which sample members


from a larger population are selected according to a random starting point but with a fixed,
periodic interval (the sampling interval).

27
3. Stratified sampling

Stratified random sampling allows researchers to obtain a sample population that best
represents the entire population being studied. Stratified random sampling involves
dividing the entire population into homogeneous groups called strata.

4. Clustered sampling

The researcher divides the population into separate groups, called clusters. Then, a
simple random sample of clusters is selected from the population. The researcher
conducts his analysis on data from the sampled clusters.

28
NON-PROBABILITY SAMPLING METHODS

1. Convenience sampling
Convenience sampling (also known as availability sampling) is a specific type of non-
probability sampling method that relies on data collection from population members who
are conveniently available to participate in study. Facebook polls or questions can be
mentioned as a popular example for convenience sampling.

2. Quota sampling
It means to take a much tailored sample that’s in proportion to some characteristic or trait
of a population. For example, you could divide a population by the state they live in,
income or education level, or sex. The population is divided into groups (also called strata)
and samples are taken from each group to meet a quota.

29
3. Judgement (or Purposive) Sampling

Purposive sampling, also known as judgmental, selective, or subjective sampling, is a


form of non-probability sampling in which researchers rely on their own judgment when
choosing members of the population to participate in their study.
The main goal of purposive sampling is to focus on particular characteristics of a
population that are of interest, which will best enable you to answer your research
questions.

4. Snowball sampling
Snowball sampling or chain-referral sampling is defined as a non-probability sampling
technique in which the samples have traits that are rare to find.
This is a sampling technique, in which existing subjects provide referrals to recruit
samples required for a research study.

30
BIAS IN SAMPLING

There are five important potential sources of bias that should be considered when
selecting a sample, irrespective of the method used. Sampling bias may be introduced
when:
1. Any pre-agreed sampling rules are deviated from
2. People in hard-to-reach groups are omitted
3. Selected individuals are replaced with others, for example if they are difficult to
contact
4. There are low response rates
5. An out-of-date list is used as the sample frame (for example, if it excludes people
who have recently moved to an area)

PLANNING AND CONDUCTING SURVEYS

1. Develop Survey’s Objectives


a. Use different formative research methods to help identify the objectives.
b. Obtain approval of the objectives from management and/or funders
2. Design the Survey
a. Choose the most appropriate type of survey
b. Decide on how the survey will be administered
c. Check to see if there are existing surveys with similar objectives
d. Check with other colleagues or agencies conducting similar programs
e. Adapt some or all questions from existing surveys
f. Decide on question and response types that will get the best responses
g. Prepare a draft questionnaire
h. Identify samples for both the survey and the pilot test
3. Pilot-test the Survey
a. Decide where the pilot will be conducted
b. Mail or give out questionnaires, supervise the data collection, or conduct
interviews Analyze the pilot-test data
c. Make any necessary revisions to the survey design
4. Conduct the Survey
a. Assign or hire staff
b. Train staff
c. Decide if an incentive is appropriate to get a better response rate
d. Decide on a timeframe for conducting the survey
e. Decide where the survey will be conducted
f. Mail or give out questionnaires, supervise the data collection, or conduct
interviews
g. Monitor the quality of the surveys being completed

31
h. Re-train staff (if necessary)

I. SURVEY OBJECTIVES

An effective survey begins with specific objectives. Choose survey questions with the
objectives in mind. Every question should refer back to the objectives.

Example

1: To determine the prevalence of HIV infection by race, ethnicity, age, and geographic
location of women patients accessing our downtown clinic from July 1 to December 31.

2: To identify the most common misconceptions about HIV among high school seniors at
Washington High School from September 1 through the 31st.

II. DESIGN ELEMENTS OF SURVEYS

There are 3 types of surveys:

A. QUESTIONNAIRES

There are 3 types of questionnaires:

 Self-Administered
o Surveys that respondents complete by themselves. Questionnaires can be
printed and distributed in-person, made available on location, mailed or
avail-able on-line.
 Group-Administered
o Surveys that are administered by a group facilitator (i.e., questions are read
aloud to respondents) and respondents complete the survey in the group
setting
 Face-To-Face Interviews
o Surveys that are conducted between two people, one who asks the
questions and records the answers and the respondent who answers the
questions. These can be in-person or over the phone.

B. STRUCTURED RECORD REVIEWS

Surveys where a form is created to be used by an individual to systematically look through


a collection of records to gather information.

32
C. STRUCTURED OBSERVATIONS

Surveys where data are collected visually by observers in a systematic way to gather
information.

III. ISSUES TO CONSIDER WHEN PILOT-TESTING A QUESTIONNAIRE

 Are there any misspelled words or other errors in the survey?


 Is the type size big enough to be read easily?
 Are there easy questions mixed in with the difficult questions?
 Does the order of questions flow well?
 Are the skip patterns easy to follow?
 Do the item numbers make sense?
 Were the respondents able to follow instructions?
 Do the questions meet the objectives?
 Are the questions appropriate for the respondents?
 Is the vocabulary appropriate for the respondents?
 Is the survey in the most appropriate language for the respondents?
 Are the questions sensitive to possible cultural barriers?
 Is the survey too long?
 How did respondents react to the survey?
 What comments did respondents make?

IV. CONSIDER THE RESOURCES INVOLVED WHEN CONDUCTING THE SURVEY

How can you be sure that you have an adequate amount of resources (time, experience
and money) to conduct your survey? Answer these seven questions:

1. What are the major tasks involved in conducting a survey?


2. What skills and resources are needed to complete each task?
3. How much time does each task take?
4. How much time is available to complete the survey?
5. Who can be hired or assigned to perform each task?
6. What are the costs involved in performing each task?
7. What additional resources are needed?

33
GUIDELINES FOR ASSESSING REASONABLE RESOURCES

A survey’s resources are reasonable if they adequately cover the financial costs
and time needed for all activities of the survey in the time planned. You need to
consider the following:

1. Other formative research needed(i.e. focus groups, reviewing existing surveys,


etc)
2. Hiring and training of staff
3. Deciding on the target population and sample
4. Designing the survey form
5. Pilot-testing the survey form
6. Administering the survey
7. Entering responses in to a database
8. Analyzing and interpreting the data
9. Reporting the findings

INTRODUCTION TO DESIGN OF EXPERIMENTS

One of the methods of data collection is conducting experiments. As what have


mentioned, experimental research is a research method where the causal relationship
between two variables are being examined. A variable is defined as an attribute of an
object of study. Choosing which variables to measure is central to good experimental
design. You need to know which types of variables you are working with in order to choose
appropriate statistical tests and interpret the results of your study. As mentioned earlier,
we have two types of primary data; quantitative and qualitative. Quantitative data
represents amounts while qualitative or categorical data represents groupings. A variable
that contains quantitative data is a quantitative variable; a variable that contains
categorical data is a categorical variable. Each of these types of variable can be broken
down into further types.

QUANTITATIVE VARIABLES

When you collect quantitative data, the numbers you record represent real amounts that
can be added, subtracted, divided, etc. There are two types of quantitative variables:
discrete and continuous.

34
Discrete vs Continuous Variables

Type of variable What does the data represent? Examples


Discrete variables Counts of individual items or  Number of students in
(aka integer variables) values a class
 Number of different
tree species in a forest
Continuous variables Measurements of continuous or  Distance
(aka ratio variables) non-finite values  Volume
 Age

CATEGORICAL VARIABLES

Categorical variables represent groupings of some kind. They are sometimes recorded
as numbers, but the numbers represent categories rather than actual amounts of things.
There are three types of categorical variables: binary, nominal, and ordinal variables.

Binary vs Nominal vs Ordinal Variables

Type of variable What does the data Examples


represent?
Binary variables (aka Yes/No outcomes.  Heads/tails in a coin flip
dichotomous variables)  Win/lose in a football
game
Nominal variables Groups with no rank or order  Species names
between them.  Colors
 Brands
Ordinal variables Groups that are ranked in a  Finishing place in a race
specific order.  Rating scaes responses
in a survey

Statistical studies usually include one or more independent variables and one dependent
variable. The independent variable in an experimental study is the one that is being
manipulated by the researcher. The independent variable is also called the explanatory
variable. The resultant variable is called the dependent variable or the outcome variable.
You will probably also have variables that you hold constant (control variables) in order
to focus on your experimental treatment.

35
Independent vs Dependent vs Control Variables

Type of variable Definition Examples (salt tolerance


experiment)
Independent variables Variables you manipulate in  The amount of soil
(aka treatment order to affect the outcome added to each plant’s
variables) of an experiment water.
Dependent variables Variables that represent the  Any measurement of
(aka response outcome of the experiment plant health and growth:
variables) in this case, plant height
and wilting
Control variables Variables that are held  The temperature and
constant throughout the light in the room the
experiment plants are kept in, and
the volume of water
given to each plant.

OTHER COMMON TYPES OF VARIABLES

Once you have defined your independent and dependent variables and determined
whether they are categorical or quantitative, you will be able to choose the correct
statistical test.
But there are many other ways of describing variables that help with interpreting your
results. Some useful types of variable are listed below.

Type of variable Definition Examples (salt tolerance


experiment)
Confounding variables A variable that hides the true  Pot size and soil type
effect of another variable in might affect plant
your experiment. This can survival as much as
happen when another more than salt additions.
variable is closely related to In an experiment you
a variable that you are would control these
interested in, but you haven’t potential confounders by
controlled in your holding them constant.
experiment.
Latent variables A variable that can’t be  Salt tolerance in
directly measured, but that plantscannot be
you represent via a proxy. measured directly, but
can be inferred from

36
measurements of plant
health in our salt-
addition experiment.
Composite variables A variable that is made by  The three plant health
combining multiple variables variables could be
in an experiment. These combined into a single
variables are created when plant-health score to
you analyze data, not when make it easier to present
you measure it. your findings.

INTRODUCTION TO DESIGN OF EXPERIMENTS

Design of experiments (DOE) is a systematic method to determine the relationship


between factors affecting a process and the output of that process. In other words, it is
used to find cause-and-effect relationships. This information is needed to manage
process inputs in order to optimize the output.

An understanding of DOE first requires knowledge of some statistical tools and


experimentation concepts. Although a DOE can be analyzed in many software programs,
it is important for practitioners to understand basic DOE concepts for proper application.

COMMON DOE TERMS AND CONCEPTS

The most commonly used terms in the DOE methodology include: controllable and
uncontrollable input factors, responses, hypothesis testing, blocking, replication and
interaction.

Controllable input factors, or x factors, are those input parameters that can be modified
in an experiment or process. For example, in cooking rice, these factors include the
quantity and quality of the rice and the quantity of water used for boiling.

Uncontrollable input factors are those parameters that cannot be changed. In the rice-
cooking example, this may be the temperature in the kitchen. These factors need to be
recognized to understand how they may affect the response.
Responses, or output measures, are the elements of the process outcome that gage
the desired effect. In the cooking example, the taste and texture of the rice are the
responses.

37
The controllable input factors can be modified to optimize the output. The relationship
between the factors and responses is shown in the figure.

Hypothesis testing helps determine the significant factors using statistical methods.
There are two possibilities in a hypothesis statement: the null and the alternative. The null
hypothesis is valid if the status quo is true. The alternative hypothesis is true if the status
quo is not valid. Testing is done at a level of significance, which is based on a probability.
Blocking and replication: Blocking is an experimental technique to avoid any unwanted
variations in the input or experimental process. For example, an experiment may be
conducted with the same equipment to avoid any equipment variations. Practitioners also
replicate experiments, performing the same combination run more than once, in order to
get an estimate for the amount of random error that could be part of the process.

Interaction: When an experiment has three or more variables, an interaction is a


situation in which the simultaneous influence of two variables on a third is not additive.

A well-performed experiment may provide answers to questions such as:

1. What are the key factors in a process?


2. At what settings would the process deliver acceptable performance?
3. What are the key, main, and interaction effects in the process?
4. What settings would bring about less variation in the output?

A repetitive approach to gaining knowledge is encouraged, typically involving these


consecutive steps:

1. A screening design that narrows the field of variables under assessment.

38
2. A "full factorial" design that studies the response of every combination of factors
and factor levels, and an attempt to zone in on a region of values where the
process is close to optimization.
3. A response surface designed to model the response.

WHEN TO USE DESIGN OF EXPERIMENT

Use DOE when more than one input factor is suspected of influencing an output. For
example, it may be desirable to understand the effect of temperature and pressure on
the strength of a glue bond.

DOE can also be used to confirm suspected input/output relationships and to develop a
predictive equation suitable for performing what-if analysis.

STEPS FOR PLANNING, CONDUCTING AND ANALYZING A RESEARCH PROJECT


OR AN EXPERIMENT

When you are involved in conducting a research project, you generally go through the
steps described below, either formally or informally. Some of these are directly involved
in designing the experiment to test the hypotheses required by the project. The following
steps are generally used in conducting a research project.

1. Review pertinent literature to learn what has been done in the field and to become
familiar enough with the field to allow you to discuss it with others. The best ideas
often cross disciplines and species, so a broad approach is important. For example,
recent research in controlling odors in swine waste has exciting implications for fly
and nematode control.

2. Define your objectives and the hypotheses that you are going to test. You can't be
vague. You must be specific. A good hypothesis is:
a. Clear enough to be tested
b. Adequate to explain the phenomenon
c. Good enough to permit further prediction
d. As simple as possible

3. Specify the population on which research is to be conducted. For example, specify


whether you are going to determine the P requirements of papaya on the Kauai
Branch Station (a Typic Gibbsihumox), or the P requirements of papaya throughout
the State, or the P requirements of papaya in sand or solution culture. The types of

39
experiments required to solve these problems vary greatly in scope and complexity
and also in resource requirements.

4. Evaluate the feasibility of testing the hypothesis. One should be relatively certain that
an experiment can be set up to adequately test the hypotheses with the available
resources. Therefore, a list should be made of the costs, materials, personnel,
equipment, etc., to be sure that adequate resources are available to carry out the
research. If not, modifications will have to be made to design the research to fit the
available resources.

5. Select Research Procedure:

a. Selection of treatment design is very crucial and can make the difference between
success or failure in achieving the objectives. Should seek help of a statistical
resource person (statistician) or of others more experienced in the field. Statistical
help should be sought when planning an experiment rather than afterward when a
statistician is expected to extract meaningful conclusions from a poorly designed
experiment. An example of a poor selection of treatments is the experiment which
demonstrated that each of three treatments, Scotch and water, in and water, and
Bourbon and water taken orally in sufficient quantities, produce some degree of
intoxication. Will this experiment provide information on which ingredient or mixture
causes intoxication? Why? How can this experiment be improved? An example
related to agriculture is an experiment with 2 treatments, Ammonium Sulfate and
Calcium Nitrate, selected to determine whether or not maize responds to N fertilizer
on a Typic Paleudult soil. Will this experiment provide the desired information? What
is lacking? What sources of confusion are included in the treatments?

b. Selection of the sampling or experimental design and number of replicates. This is


the major topic of this course so this will not be discussed further other than to
comment that in general one should choose the simplest design that will provide the
precision you require.

c. Selection of measurements to be taken. With the computer it is now possible to


analyze large quantities of data and thus the researcher can gain considerably more
information about the crop, etc. than just the effects of the imposed variables on yield.
For example, with corn, are you going to measure just the yield of grain, or of ears,
or of grain plus Stover? What about days to tasseling and silking? Height of ears,
kernel depth, kernel weight, etc. What about nutrient levels at tasseling, or weather
conditions, especially if there are similar experiments at other locations having

40
different climates? With animal experiments, you can measure just the increase in
weight or also total food intake, components of blood, food digestibility etc.

d. Selection of the unit of observation, i.e., the individual plant, one row, or a whole plot,
etc? One animal or a group of animals?

e. Control of border effects or effects of adjacent units on each other or “competition".


Proper use of border rows or plants and randomization of treatments to the
experimental units helps minimize border effects. Proper randomization of treatments
to the experimental unit is also required by statistical theory so be sure this is properly
done.

f. Probable results: Make an outline of pertinent summary tables and probable results.
Using information gained in the literature review write out the results you expect.
Essentially perform the experiment in theory and predict the results expected.

g. Make an outline of statistical analyses to be performed. Before you plant the first pot
or plot or feed the first animal, you should have set up an outline of the statistical
analysis of your experiment to determine whether or not you are able to test the
factors you wish with the precision you desire. One of the best ways to do this is to
write out the analysis of variance table (source of variation and df) and determine the
appropriate error terms for testing the effects of interest. A cardinal rule is to be sure
you can analyze the experiment yourself and will not require a statistician to do it for
you--he might not be there when you need him. Another danger in this age of the
computer and statistical programs, is to believe that you can just run the data through
the statistical program and the data will be analyzed for you. While this is true to
ascertain extent, you must remember that the computer is a perfect idiot and does
only what you tell it to do. Therefore, if you do not know what to tell the computer to
do and/or of you don’t know what the computer is doing, you may end up with a lot of
useless output-garbage!! Also, there is the little matter of interpreting all the computer
output that you can get in a very short time. This is your responsibility and you had
better know what it is all about

6. Selection of suitable measuring instruments and control of bias in data collection:


Measuring instruments should be sufficiently accurate for the precision required.
Don't want a gram balance (scale) to weigh watermelons or sugarcane. Experimental
procedure should be free of personal bias, i.e., if treatment effects must be graded
(subjective evaluation) such as in herbicide, or disease control experiments, the
treatments should be randomized and the grader should not know what treatment he

41
is grading until after he has graded it. Have two people do the data collection, one
grade and the other record.

7. Install experiment: Care should be taken in measuring treatment materials (fertilizers,


herbicides, or other chemicals, food rations, etc.) and the application of treatments to
the experimental units. Errors here can have disastrous effects on the experimental
results. Infield experiments, you should personally check the bags of fertilizer or seed
of varieties which should be placed on each plot, to be certain that the correct
fertilizers or variety will be applied to the correct plot before any fertilizer is applied or
any seed planted. Once fertilizer is applied to a plot, it generally cannot be removed
easily. With laboratory experiments or preparation of various rations for feeding trials,
check calculations and reagents or ingredients, etc., and set up a system of
formulating the treatments to minimize the possibility of errors.

8. Collect Data: Careful measurements should be made with the appropriate


instruments. It is better to collect too much data than not enough. Data should also
be recorded properly in a permanent notebook. In many studies data collection can
be quite rapid and before you know it you have data scattered in 6 notebooks, 3
folders, and 2 packs of paper towels!!When it is time to analyze the data, it is a
formidable task, especially if someone has used the paper towels to dry their hands.
Thus a little thought early in the experiment will save a lot of time and grief later. Avoid
recording data on loose sheets at all costs as this is one good way to prolong your
stay here by having to repeat experiments because the data were lost. Avoid fatigue
in collecting data as errors increase as one gets tired. Also avoid recopying data as
this is a major source of errors in experimental work. If data must be recopied, check
figures against the originals immediately. It is better to have two people do the
checking, one read the original data and the other read the copied data. When one
person is making measurements and another recording, have the person recording
repeat the value being recorded. This will minimize errors.

9. Make a complete analysis of the data: Be sure to have a plan of analysis, e.g., which
analysis and in what order will they be done? Interpret the results in the light of the
experimental conditions and hypotheses tested. Statistics do not prove anything and
there is always the possibility that your conclusions may be wrong. One must consider
the consequences of drawing an incorrect conclusion and modify the interpretation
accordingly. Do not jump to a conclusion just because an effect is significant. This is
especially so if the conclusion doesn't agree with previously established facts. The
experimental data should be checked very carefully if this occurs, as the results must
make sense!

42
10. Finally, prepare a complete, correct, and readable report of the experiment. This may
be a report to the farmer or researcher or an extension publication. There is no such
thing as a negative result. If the null hypothesis is not rejected, it is positive evidence
that there may be no real difference among the treatments tested.

In summary, you should remember the 3 R's of experimentation:

1. Replicate: This provides a measure of variation (an error term) which is used in
evaluating the effects observed in the experiment. This is the only way that the validity
of your conclusions from the experiment can be measured.

2. Randomize: Statistical theory requires the assignment of treatments to the


experimental units in a purely random manner. This prevents bias.

3. Request Help: Ask for help when in doubt about how to design, execute or analyze
your experiment. Not everyone is a statistician, but should know the important
principles of scientific experimentation. Be on guard against common pitfalls and ask
for help when you need it. Do this when planning an experiment, not after it is
completed.

Activity:

Find a published study online on any engineering field that is of interest to you, and is
aligned with your undergraduate program. Use GOOGLE SCHOLAR as your search
engine, and limit your search results to studies published within the last 5 years.

Design your own study based on the research of your interest. You may replicate the
study while altering the control variables, such as changing the materials to be used,
adjusting physical dimensions of an experimental setup, varying mixture proportion,
different time and place of sampling, etc. Identify the rationale of your proposed study,
and list down at least two (2) objectives. In each objective, identify the independent and
dependent variables. An example is provided for your reference.

43
Example:

Proposed : Modal Analysis of the Bridge of Promise via Ambient Vibration Test
Research (AVT)*
Title
Reference : Feng, M., Fukuda, Y., Mizuta., & Ozer, E. (2015). Citizen sensors for
Literature SHM: Use of accelerometer data from smartphones. Sensors, 15:
2980-2998
Rationale of : To provide a quantitative assessment on the structural health of the
the Study Bridge using the smartphone accelerometers

Objective Independent Variable Dependent Variable


Measure the natural Location of sampling point Measured natural
frequencies of the Bridge of frequency of bridge
Promise
Determine the maximum Ambient frequency Angular velocity
angular velocity

*A study conducted by CE students K. Cristobal, R. Catapang, J. Latoza, D. Marocom, C. Montalbo, R.


Ramos and K. Reyna for their undergraduate research project

44
Session 3
Organizing Data

By the end of this session you should be able to:

1. Summarize and present data in different forms;


2. Organize raw data into an array, utilize exploratory data analysis tools and
construct the frequency distribution
3. Define and illustrate histograms, frequency polygons and ogives

Lecture:

DATA PRESENTATION

The main portion of statistics is the display of summarized data. Data is initially collected
from a given source, whether they are experiments, surveys, or observation. Presentation
of data also needs planning and presentation. If data are properly and interestingly
presented, the benefits will not only go to the readers or users out more so to the
statisticians who will make the analysis and interpretation of the data gathered. The
selection of the method of presentation depends on the type of data, method of analysis,
and type of information sought from the data. The data gathered are summarized and
presented in any of three methods: textual form, tabular form and graphical form.

TEXTUAL PRESENTATION

Textual presentation of data means presenting data in the form of words, sentences and
paragraphs. The opposite of textual presentation is graphical presentation of data. While
graphical presentation of data is the most popular and widely used in the research, textual
presentation allows the researcher to present qualitative data that cannot be presented
in graphical or tabular forms.

FACTORS TO CONSIDER IN TEXTUAL PRESENTATION OF DATA

The textual presentation of data is very helpful in presenting contextual data. It helps the
researcher explain and analyze specific points in data. While presenting data in textual
form the researcher should consider the following factors.

45
1. The researcher should know the target audience who are going to read it. The
researcher should use a language in the presentation of data that is easy to
understand and highlights the main points of the data findings.
2. The author should use wordings that does not introduce bias in research. Avoid
the use of biased, slanted, or emotional language.
3. Accuracy should be maintained in presenting data and the numbers and
percentages presented in the textual data should be reviewed to avoid any
mistakes in the presentation.
4. To make it easier for the audience to comprehend the important points in data, the
researcher should avoid unnecessary details. Too much detail can make it difficult
for the audience to concentrate on the key points in data.
5. Do not repeat the same point again and again it will ruin the purpose of textual
presentation. Your data presentation will become monotonous if there is repetition
of data finding.
6. Try to shorten longer phrases wherever possible, mix two phrases when they can
be combined as one.
7. One mistake that often researchers make is to use general descriptive words like,
too much, little, exactly, all, always, never, must and many more. These words
should be avoided as they only add to the unnecessary details in your data
presentation. The numbers and percentages better describe and fulfill the aim of
data presentation.
8. Another point to consider is to avoid decorative language while make sure that you
use scholarly language in your data presentation.

ADVANTAGES OF TEXTUAL PRESENTATION

1. Textual presentation of data allows the researcher to interpret the data more
elaborately during the presentation. According to In and Lee, 2017, text is the
principal method for explaining findings, outlining trends, and providing contextual
information
2. It allows the researcher to present qualitative data that cannot be presented in
graphical or tabular form.
3. Textual presentation can help in emphasizing some important points in data. It
allows the researcher to explain the data in contextual manner and the reader can
draw meaning out of it.
4. Small sets of data can be easily presented through textual presentation. For
example, simple data like, there are 30 students in the class 20 of whom are girls
while 10 are boys is easier to understand through textual presentation. No table or
graph is required to present this data as it can be comprehended through text.

46
DISADVANTAGES OF TEXTUAL PRESENTATION

1. The major disadvantage of the textual presentation of data is that it produces


extensive data in the form of words and paragraphs. It becomes difficult for the
reader to draw conclusions in a glance. On the other hand data presented in tables
or graphs can make it easier for the readers to draw conclusions from the data.
2. Textual presentation of data is not suitable for large sets of data that has too many
details. Graphical or tabular forms allow the researcher to make large data
displayed easily.
3. In textual presentation one has to read through he whole text to understand and
comprehend the main point.

TABULAR PRESENTATION

A table helps representation of even large amount of data in an engaging, easy to read
and coordinated manner. The data is arranged in rows and columns. This is one of the
most popularly used forms of presentation of data as data tables are simple to prepare
and read.

The most significant benefit of tabulation is that it coordinates data for additional statistical
treatment and decision-making. The analysis used in tabulation is of 4 types:

1. Qualitative Classification: When the classification is done according to traits, such


as physical status, nationality, social status, etc., it is known as qualitative
classification.
2. Quantitative Classification: In this, the data is classified on the basis of features
which are quantitative in nature. In other words, these features can be estimated
quantitatively.
3. Temporal Classification: In this classification, time becomes the categorizing
variable and data are classified according to time. Time may be in years, months,
weeks, days, hours, etc.,
4. Spatial Classification: When the categorization is done on the basis of location, it
is called spatial classification. The place may be a country, state, district, block,
village/town, etc.,

OBJECTIVES OF TABULATION:

 To Simplify the Complex Data


 To Bring Out Essential Features of the Data
 To Facilitate Comparison
 To Facilitate Statistical Analysis

47
 Saving of Space

THE ADVANTAGES OF TABULAR PRESENTATION

1. Ease of representation: A large amount of data can be easily confined in a data


table. Evidently, it is the simplest form of data presentation.
2. Ease of analysis: Data tables are frequently used for statistical analysis like
calculation of central tendency, dispersion etc.
3. Helps in comparison: In a data table, the rows and columns which are required to
be compared can be placed next to each other. To point out, this facilitates
comparison as it becomes easy to compare each value.
4. Economical: Construction of a data table is fairly easy and presents the data in a
manner which is really easy on the eyes of a reader. Moreover, it saves time as
well as space.

WHAT ARE THE THREE LIMITATIONS OF A TABLE?

1. Lacks Description
a. The table represents only figures and not attributes.
b. It ignores the qualitative aspects of facts.
2. Incapable of Presenting Individual Items
a. It does not present individual items.
b. It presents aggregate data.
3. Needs Special Knowledge
a. The understanding of the table requires special knowledge.
b. It cannot be easily used by the layman.

MAIN PARTS OF A TABLE:

Table Number  Table No. is the very first item mentioned on the top of each
table for easy identification and further reference.
Title  Title of the table is the second item which shown just above
the table.
 It narrates about the contents of the table so, it has to be very
clear, brief and carefully worded.
Headnote  It is the third item just above the Table & shown after the title.
 It gives information about unit of data like, “Amount in Rupees
or $”, “Quantity in Tonnes” etc.
 It is generally given in brackets.

48
Captions or  At the top of each column in a table, a column
Column designation/head is given to explain figures of the column.
Headings  This is column heading is called “Caption”.
Stubs or Row  The title of the horizontal rows is called “Stubs”.
Headings
Body of the Table  It contains the numeric information and reveals the whole story
of investigated facts. Columns are read vertically from top to
bottom and rows are read horizontally from left to right.
Source Note  It is a brief statement or phrase indicating the source of data
presented in the table.
Footnote  It explains the specific feature of the table which is not self-
explanatory and has not been explained earlier. For example,
Points of exception if any.

CONSTRUCTION OF DATA TABLES

There are many ways for construction of a good table. However, some basic ideas are:

1. The title should be in accordance with the objective of study: The title of a table
should provide a quick insight into the table.
2. Comparison: If there might arise a need to compare any two rows or columns then
these might be kept close to each other.
3. Alternative location of stubs: If the rows in a data table are lengthy, then the stubs
can be placed on the right-hand side of the table.
4. Headings: Headings should be written in a singular form. For example, ‘good’ must
be used instead of ‘goods’.
5. Footnote: A footnote should be given only if needed.
6. Size of columns: Size of columns must be uniform and symmetrical.
7. Use of abbreviations: Headings and sub-headings should be free of abbreviations.
8. Units: There should be a clear specification of units above the columns.

GRAPHICAL PRESENTATION

Graphical Representation is a way of analyzing numerical data. It exhibits the relation


between data, ideas, information and concepts in a diagram. It is easy to understand and
it is one of the most important learning strategies. It always depends on the type of
information in a particular domain. There are different types of graphical representation.
Some of them are as follows:

49
 Line Graphs – Linear graphs are used to display the continuous data and it is
useful for predicting the future events over time.
 Bar Graphs – Bar Graph is used to display the category of data and it compares
the data using solid bars to represent the quantities.
 Histograms – The graph that uses bars to represent the frequency of numerical
data that are organized into intervals. Since all the intervals are equal and
continuous, all the bars have the same width.

50
 Line Plot – It shows the frequency of data on a given number line. ‘x ‘ is placed
above a number line each time when that data occurs again.
 Frequency Table – The table shows the number of pieces of data that falls within
the given interval.
 Circle Graph – Also known as pie chart that shows the relationships of the parts
of the whole. The circle is considered with 100% and the categories occupied is
represented with that specific percentage like 15%, 56%, etc.
 Stem and Leaf Plot – In stem and leaf plot, the data are organized from least
value to the greatest value. The digits of the least place values from the leaves
and the next place value digit forms the stems.
 Box and Whisker Plot – The plot diagram summarizes the data by dividing into
four parts. Box and whisker shows the range (spread) and the middle (median) of
the data.

GENERAL RULES FOR GRAPHICAL REPRESENTATION OF DATA

There are certain rules to effectively present the data and information in the graphical
representation. They are:

 Suitable Title: Make sure that the appropriate title is given to the graph which
indicates the subject of the presentation.
 Measurement Unit: Mention the measurement unit in the graph
 Proper Scale: To represent the data in an accurate manner, choose a proper scale.
 Index: Index the appropriate colors, shades, lines, design in the graphs for better
understanding
 Data Sources: Include the source of information wherever it is necessary at the
bottom of the graph.
 Keep it Simple: Construct a graph in an easy way that everyone can understand.
 Neat: Choose the correct size, lettering, colors etc. in such a way that the graph
should be a visual aid for the presentation of information.

ADVANTAGES AND DISADVANTAGES OF GRAPHICAL PRESENTATION

Graphs and charts provide major benefits. First, they can quickly provide information
related to trends and comparisons by allowing for a global view of the data. It also allows
members of the audience who may be less versed in numerical analysis to follow the
information and understand the presentation more fully. Secondly, graphs and charts
provide a visual version of data, which can be helpful for visual learners.

However, these benefits are balanced by disadvantages. The major disadvantage of


using charts and graphs is that these aids may oversimplify data, which can provide a

51
misleading view of the data. Attempting to correct this can make charts overly complex,
which can make their value in aiding a presentation less useful. Finally, it is important to
use the correct chart and/or graph when presenting information, though this can be
difficult to identify for ambiguous data.

STEM AND LEAF DISPLAY

A stem and leaf display is an exploratory data analysis tool. It is a technique used to
classify either discrete or continuous variables. It is a graphical method of displaying data.
It is particularly useful when your data are not too numerous.

Stem and Leaf Diagrams allow you to display raw data visually. Each raw score is divided
into a stem and a leaf. The leaf is typically the last digit of the raw value. The stem is the
remaining digits of the raw value. To generate a stem and leaf diagram you must first
create a vertical column that contains all of the stems. Then list each leaf next to the
corresponding stem. In these diagrams, all of the scores are represented in the diagram
without the loss of any information. The graph below is an example of a Stem and Leaf
Diagram:

Number of touchdown passes


37, 33, 33, 32, 29, 28, 28, 23, 22,
22, 22, 21, 21, 21, 20, 20, 19, 19,

18, 18, 18, 18, 16, 15, 14, 14, 14,


12, 12, 9, 6

Stem and leaf display of the number of touchdown passes


3 | 2337
2 | 001112223889

1 | 2244456888899
0 | 69

From this, we can now easily arrange our data from lowest to highest value.

52
HISTOGRAM

A frequency distribution shows how often each different value in a set of data occurs. A
histogram is the most commonly used graph to show frequency distributions. It looks very
much like a bar chart, but there are important differences between them; the bars always
touch and the width of the bar represents a quantitative value, such as age. This helpful
data collection and analysis tool is considered one of the seven basic quality tools.

A Histogram consists of a number of bars placed side by side. The y-axis uses Frequency
as its label so Histograms are generally referred to as Frequency Histograms. The width
of each bar on the x-axis indicates the interval size. The height of each bar indicates the
frequency of the interval. To create a Frequency Histogram you must first determine the
frequency of the intervals of the axes. Then draw the bars of the histogram. The bars
are drawn from the lower limits to the upper limits along the x-axis. Since the intervals
are connected the bars of the histogram should also be connected. The graph below is
an example of a Frequency Histogram:

WHEN TO USE A HISTOGRAM

Use a histogram when:

1. The data are numerical


2. You want to see the shape of the data’s distribution, especially when determining
whether the output of a process is distributed approximately normally
3. Analyzing whether a process can meet the customer’s requirements
4. Analyzing what the output from a supplier’s process looks like
5. Seeing whether a process change has occurred from one time period to another
6. Determining whether the outputs of two or more processes are different

53
7. You wish to communicate the distribution of data quickly and easily to others

FREQUENCY POLYGONS

Frequency polygons are a graphical device for understanding the shapes of distributions.
They serve the same purpose as histograms, but are especially helpful for comparing
sets of data. Frequency polygons are also a good choice for displaying cumulative
frequency distributions.

FREQUENCY POLYGON

The steps to generating a Frequency Polygon and an example of a Frequency Polygon


(without a title) are listed below:

1. draw and label the axes


2. add two extra intervals: one below the lowest interval and one above the highest
interval
3. determine the midpoint for each interval
4. plot the frequency for each of the midpoints on the graph
5. connect the dots with a straight line

RELATIVE FREQUENCY POLYGON

Relative frequency is used to compare two distributions that have different numbers of
subjects. Relative frequency can be graphed as a Relative Frequency Polygon. Relative
frequency polygons are created in the same manner as the frequency polygon. The only

54
difference being that you use relative frequency instead of frequency values. The graph
below is an example of a Relative Frequency Polygon:

CUMULATIVE FREQUENCY POLYGONS (OGIVES)

Cumulative polygons the data in a distribution that fall below a particular score.
Cumulative Polygons are created in the same manner as the polygon. The only difference
being that you use cumulative values and the upper real limits of the intervals are used
instead of the midpoints. The s-shaped curve of the graph below is called an “ogive”,
pronounced “oh-jive.”

The Less than and Greater than Ogives


for the Entrance Examination Scores of 60 students

55
HOW TO DEVELOP A FREQUENCY DISTRIBUTION, HISTOGRAM, FREQUENCY
POLYGON, RELATIVE FREQUENCY POLYGON AND OGIVES

• Arrange data in ascending order. We can use stem and leaf display.
• Determine the class width. If we will decide on the number of classes then it should
be between 5 to 15 classes only. Less than 5 classes, we will be losing too much
information and greater than 15 classes, our graph will be too crowded. We use
the formula:

𝐻𝑖𝑔ℎ𝑒𝑠𝑡 𝑣𝑎𝑙𝑢𝑒 − 𝐿𝑜𝑤𝑒𝑠𝑡 𝑣𝑎𝑙𝑢𝑒


𝐶𝑊 =
𝐷𝑒𝑠𝑖𝑟𝑒𝑑 𝑛𝑢𝑚𝑏𝑒𝑟 𝑜𝑓 𝑐𝑙𝑎𝑠𝑠

If we have don’t have any desired number of class, then we can use the formula:

𝐻𝑖𝑔ℎ𝑒𝑠𝑡 𝑣𝑎𝑙𝑢𝑒 − 𝐿𝑜𝑤𝑒𝑠𝑡 𝑣𝑎𝑙𝑢𝑒


𝐶𝑊 =
1 + 3.322 log 𝑁

• Computed class width (CW) should be adjusted to the next whole number.
• Create the distinct classes (class limits). We use the convention that the lower
class limit of the first class is the smallest data value. Add the class width to this
number to get the lower class limit of the next class.
• Determine the class boundaries. Apply the continuity correction (Subtract 0.5 to
the lower limit and add 0.5 to the upper limit).
• Compute the midpoint (class mark) for each class. Add the upper limit and lower
limit then divide the sum by 2.
• Tally the data into classes. Each data value should fall into exactly one class. Total
the tallies to obtain each class frequency. This will be the basis for the histogram
and frequency polygon.
• For each class, compute the relative frequency f/n, where f is the class frequency
and n is the total sample size. This will be the basis for the relative frequency
polygon.
• For the cumulative frequency, we have the less than cumulative frequency and
greater than cumulative frequency. For less than cumulative frequency,
frequencies were added from top to bottom. For greater than cumulative
frequency, frequencies were added from bottom to top. This is the basis for the
cumulative frequency polygon or simply ogives.

56
Example 1: Given below are the one-way commuting distances in miles for 60 workers in
Downtown Dallas.
13 47 10 3 16 20 17 40 4 2
7 25 8 21 19 15 3 17 14 6
12 45 1 8 4 16 11 18 23 12
6 2 14 13 7 15 46 12 9 18
34 13 41 28 36 17 24 27 29 9
14 26 10 24 37 31 8 16 12 16

(a) Construct the stem and leaf plot for the data above.
(b) Determine the class width, and setup a frequency distribution.
(c) Construct the frequency histogram, relative frequency polygon and the cumulative
frequency ogives.

Stem Leaf

0 1 2 2 3 3 4 4 6 6 7 7 8 8 8 9 9
1 0 0 1 2 2 2 2 3 3 3 4 4 4 5 5
2 0 1 3 4 4 5 6 7 8 9
3 1 4 6 7
4 0 1 5 6 7

Arranged data:
1 7 12 15 19 29
2 8 12 16 20 31
2 8 12 16 21 34
3 8 13 16 23 36
3 9 13 16 24 37
4 9 13 17 24 40
4 10 14 17 25 41
6 10 14 17 26 45
6 11 14 18 27 46
7 12 15 18 28 47

Determination of Class width:

𝐻𝑖𝑔ℎ𝑒𝑠𝑡 𝑣𝑎𝑙𝑢𝑒 − 𝐿𝑜𝑤𝑒𝑠𝑡 𝑣𝑎𝑙𝑢𝑒


𝐶𝑊 =
1 + 3.322 log 𝑁
47 − 1
𝐶𝑊 = = 6.6577 ≈ 7
1 + 3.322 log 60

57
Frequency Distribution

Class Limits Class Boundaries Midpoint freq f< f> RF

Lower Upper Lower Upper

1 7 0.5 7.5 4 11 11 60 0.18

8 14 7.5 14.5 11 18 29 49 0.30

15 21 14.5 21.5 18 14 43 31 0.23

22 28 21.5 28.5 25 7 50 17 0.12

29 35 28.5 35.5 32 3 53 10 0.05

36 42 35.5 42.5 39 4 57 7 0.07

43 49 42.5 49.5 46 3 60 3 0.05

N = 60

Histogram and Frequency Polygon

58
Relative Frequency Histogram and Polygon

Cumulative Frequency Polygon

Example 2: Many airline passengers seem weighted down with their carry-on luggage.
Just how much weight are they carrying? The carry-on luggage weights for a random
sample of 40 passengers returning from a vacation to Hawaii were recorded in pounds.
Use 6 classes.

59
Weights of Carry-On Luggage in Pounds
30 27 12 42 35 47 38 36 27 35
22 17 29 3 21 10 38 32 41 33
26 45 18 43 18 32 31 32 19 21
33 31 28 29 51 12 32 18 21 26

Stem Leaf
0 3
1 7 2 8 8 0 2 8 9
2 2 6 7 9 8 9 1 7 1 1 6
3 0 3 1 5 2 8 8 1 2 6 2 2 5 3
4 5 2 3 7 1
5 1

Arranged data

3 17 19 22 27 30 32 33 38 43
10 18 21 26 28 31 32 35 38 45
12 18 21 26 29 31 32 35 41 47
12 18 21 27 29 32 33 36 42 51

Determination of Class width:

𝐻𝑖𝑔ℎ𝑒𝑠𝑡 𝑣𝑎𝑙𝑢𝑒 − 𝐿𝑜𝑤𝑒𝑠𝑡 𝑣𝑎𝑙𝑢𝑒


𝐶𝑊 =
𝐷𝑒𝑠𝑖𝑟𝑒𝑑 𝑛𝑢𝑚𝑏𝑒𝑟 𝑜𝑓 𝑐𝑙𝑎𝑠𝑠

51 − 3
𝐶𝑊 = =8≈9
6

60
Frequency Distribution

Class Limits Class Boundaries Midpoint freq f< f> RF

Lower Upper Lower Upper

3 11 2.5 11.5 7 2 2 40 0.05

12 20 11.5 20.5 16 7 9 38 0.18

21 29 20.5 29.5 25 11 20 31 0.28

30 38 29.5 38.5 34 14 34 20 0.35

39 47 38.5 47.5 43 5 39 6 0.13

48 56 47.5 56.5 52 1 40 1 0.03

N = 40

Histogram and Frequency Polygon

61
Relative Frequency Histogram and Polygon

Cumulative Frequency Polygon

62
Activity:
1) The following scores represent the final examination grade for an elementary statistics
course:
23 60 79 32 57 74 52 70 82 36
80 77 81 95 41 65 92 85 55 76
52 10 64 75 78 25 80 98 81 67
41 71 83 54 54 72 88 62 74 43
60 78 89 76 76 48 84 90 15 79
34 67 17 82 82 74 63 80 85 61

a) Construct a stem and lead plot for the examination grades in which are the stems
are 1, 2, 3, …, 9.
b) Set up a relative frequency distribution.
c) Construct the frequency histogram, relative frequency polygon and the cumulative
frequency ogives.

2) The following data represent the length of life in years, measured to the nearest tenth,
of 30 similar fuel pumps:
2.0 3.0 0.3 3.3 1.3 0.4
0.2 6.0 5.5 6.5 0.2 2.3
1.5 4.0 5.9 1.8 4.7 0.7
4.5 0.3 1.5 0.5 2.5 5.0
1.0 6.0 5.6 6.0 1.2 0.2

a) Construct a stem and lead plot for the life in years of the fuel pump using the digit
to the left of the decimal point as the stem for each observation.
b) Set up a relative frequency distribution.
c) Construct the frequency histogram, relative frequency polygon and the cumulative
frequency ogives.

3) The following data represent the length of life in seconds of 50 fruit flies to a new spray
in a controlled laboratory experiment.
17 20 10 9 23 13 12 19 18 24
12 14 6 9 13 6 7 10 13 7
16 18 8 13 3 32 9 7 10 11
13 7 18 7 10 4 27 19 16 8
7 10 5 14 15 10 9 6 7 15

63
a) Construct a double-stem and lead plot for the life span of the fruit using the stems

b) Set up a relative frequency distribution.


c) Construct the frequency histogram, relative frequency polygon and the cumulative
frequency ogives.

64
Session 4
Measures of Central Tendency and Dispersion

By the end of this session you should be able to:

1. Define, illustrate, and distinguish the different measures of central tendency and
dispersion for grouped and ungrouped data;
2. Interpret the computations for grouped and ungrouped data

Lecture:

MEASURES OF CENTRAL TENDENCY

INTRODUCTION

A measure of central tendency is a single value that attempts to describe a set of data by
identifying the central position within that set of data. As such, measures of central
tendency are sometimes called measures of central location. They are also classed as
summary statistics. The mean (often called the average) is most likely the measure of
central tendency that you are most familiar with, but there are others, such as the median
and the mode.

The mean, median and mode are all valid measures of central tendency, but under
different conditions, some measures of central tendency become more appropriate to use
than others. In the following sections, we will look at the mean, mode and median, and
learn how to calculate them and under what conditions they are most appropriate to be
used.

MEAN

The mean (or average) is the most popular and well known measure of central tendency.
It can be used with both discrete and continuous data, although its use is most often with
continuous data. The mean is equal to the sum of all the values in the data set divided by
the number of values in the data set. So, if we have n values in a data set and they have
values x1, x2, x3…xn, the sample mean, usually denoted by 𝑥̅ , (pronounced "x bar"), is:

𝑥1 + 𝑥2 + 𝑥3 +. . . +𝑥𝑛
𝑥̅ =
𝑛

This formula is usually written in a slightly different manner using the Greek capitol letter,∑
, pronounced "sigma", which means "sum of...":

65
∑𝑥
𝑥̅ =
𝑛

You may have noticed that the above formula refers to the sample mean. So, why have
we called it a sample mean? This is because, in statistics, samples and populations have
very different meanings and these differences are very important, even if, in the case of
the mean, they are calculated in the same way. To acknowledge that we are calculating
the population mean and not the sample mean, we use the Greek lower case letter "mu",
denoted as µ:

∑𝑥
µ=
𝑁

The mean is essentially a model of your data set. It is the value that is most common.
You will notice, however, that the mean is not often one of the actual values that you have
observed in your data set. However, one of its important properties is that it minimizes
error in the prediction of any one value in your data set. That is, it is the value that
produces the lowest amount of error from all other values in the data set.

An important property of the mean is that it includes every value in your data set as part
of the calculation. In addition, the mean is the only measure of central tendency where
the sum of the deviations of each value from the mean is always zero.

WHEN NOT TO USE THE MEAN

The mean has one main disadvantage: it is particularly susceptible to the influence of
outliers. These are values that are unusual compared to the rest of the data set by being
especially small or large in numerical value. For example, consider the wages of staff at
a factory below:

Staff 1 2 3 4 5 6 7 8 9 10
Salary 15k 18k 16k 14k 15k 15k 12k 17k 90k 95k

The mean salary for these ten staff is $30.7k. However, inspecting the raw data suggests
that this mean value might not be the best way to accurately reflect the typical salary of a
worker, as most workers have salaries in the $12k to 18k range. The mean is being
skewed by the two large salaries. Therefore, in this situation, we would like to have a
better measure of central tendency. We can apply a resistant measure called trimmed
mean.

A trimmed mean (similar to an adjusted mean) is a method of averaging that removes a


small designated percentage of the largest and smallest values before calculating the
66
mean. After removing the specified outlier observations, the trimmed mean is found using
a standard arithmetic averaging formula. The use of a trimmed mean helps eliminate the
influence of outliers or data points on the tails that may unfairly affect the traditional mean.

Example:

Many airline passengers seem weighted down with their carry-on luggage. Just how much
weight are they carrying? The carry-on luggage weights for a random sample of 40
passengers returning from a vacation to Hawaii were recorded in pounds.

Weights of Carry-On Luggage in Pounds


30 27 12 42 35 47 38 36 27 35
22 17 29 3 21 10 38 32 41 33
26 45 18 43 18 32 31 32 19 21
33 31 28 29 51 12 32 18 21 26

∑𝑥
𝑥̅ =
𝑛

1141
𝑥̅ =
40

𝑥̅ = 28.53

For the 5% trimmed mean:

1. Arrange the data in order


2. Compute for the 5% of n=40 which is 2
3. Remove 2 from the lower values of the data and 2 values from the upper values
4. Compute for the mean

3 17 19 22 27 30 32 33 38 43
10 18 21 26 28 31 32 35 38 45
12 18 21 26 29 31 32 35 41 47
12 18 21 27 29 32 33 36 42 51

∑𝑥
𝑥̅ =
𝑛

1030
𝑥̅ = = 28.61
36

67
MEDIAN

The median is the middle score for a set of data that has been arranged in order of
magnitude. The median is less affected by outliers and skewed data. In order to calculate
the median, suppose we have the data below:

65 55 89 56 35 14 56 55 87 45 92

We first need to rearrange that data into order of magnitude (smallest first):

14 35 45 55 55 56 56 65 87 89 92

Our median mark is the middle mark - in this case, 56 (highlighted in bold). It is the middle
mark because there are 5 scores before it and 5 scores after it. This works fine when you
have an odd number of scores, but what happens when you have an even number of
scores? What if you had only 10 scores? Well, you simply have to take the middle two
scores and average the result. So, if we look at the example below:

65 55 89 56 35 14 56 55 87 45

We again rearrange that data into order of magnitude (smallest first):

14 35 45 55 55 56 56 65 87 89

Only now we have to take the 5th and 6th score in our data set and average them to get
a median of 55.5.

MODE

The mode is the most frequent score in our data set. On a histogram it represents the
highest bar in a bar chart or histogram. You can, therefore, sometimes consider the mode
as being the most popular option. An example of a mode is presented below:

68
EXAMPLES OF THE MODE

For example, in the following list of numbers, 16 is the mode since it appears more times
in the set than any other number:

3, 3, 6, 9, 16, 16, 16, 27, 27, 37, 48

A set of numbers can have more than one mode (this is known as bimodal if there are
two modes) if there are multiple numbers that occur with equal frequency, and more times
than the others in the set.

3, 3, 3, 9, 16, 16, 16, 27, 37, 48

In the above example, both the number 3 and the number 16 are modes as they each
occur three times and no other number occurs more often.

If no number in a set of numbers occurs more than once, that set has no mode:

3, 6, 9, 16, 27, 37, 48

A set of numbers with two modes is bimodal, a set of numbers with three modes is
trimodal, and a set of numbers with four or more nodes is multimodal.

GROUPED DATA TO FIND THE MEAN

A mean can be determined for grouped data, or data that is placed in intervals. Unlike
listed data, the individual values for grouped data are not available, and you are not able
to calculate their sum. To calculate the mean of grouped data, the first step is to determine
the midpoint (also called a class mark) of each interval, or class. These midpoints must
then be multiplied by the frequencies of the corresponding classes. The sum of the
products divided by the total number of values will be the value of the mean.

In other words, the mean for a population can be found by dividing ∑xf by N, where x is
∑ 𝑥𝑓
the midpoint of the class and f is the frequency. As a result, the formula 𝑥̅ = can be
𝑁
written to summarize the steps used to determine the value of the mean for a set of
grouped data. If the set of data represented a sample instead of a population, the process
∑ 𝑥𝑓
would remain the same, and the formula would be written as 𝑥̅ = .
𝑛

69
Example:

Class
Midpoint freq
Class Limits Boundaries f< f> RF
(x) (f)
Lower Upper Lower Upper
1 7 0.5 7.5 4 11 11 60 0.18
8 14 7.5 14.5 11 18 29 49 0.30
15 21 14.5 21.5 18 14 43 31 0.23
22 28 21.5 28.5 25 7 50 17 0.12
29 35 28.5 35.5 32 3 53 10 0.05
36 42 35.5 42.5 39 4 57 7 0.07
43 49 42.5 49.5 46 3 60 3 0.05
n = 60

∑ 𝑥𝑓
𝑥̅ =
𝑛

1059
𝑥̅ = = 17.65
60

FINDING THE MEDIAN OF GROUPED DATA

We use the formula:

𝑛
(2 − 𝐶𝐹< )
𝑥̃ = 𝐿𝑜 + 𝑥 𝐶𝑊
𝑓

This formula is used to find the median in a group data with class interval. The median
is the value of the data in the middle position of the set when the data is arranged in
numerical order. The class where the middle position is located is called the median class
and this is also the class where the median is located. This formula is used to find the
median in a group data which is located in the median class.

where:
Lo is the lower class boundary of the group containing the median
n is the sum of frequencies
CF< is the cumulative frequency before the median class.
f is the frequency of the median group
CW is the class width

70
Example:

Class
Midpoint freq
Class Limits Boundaries f< f> RF
(x) (f)
Lower Upper Lower Upper
1 7 0.5 7.5 4 11 11 60 0.18
8 14 7.5 14.5 11 18 29 49 0.30

n/2 = 60/2 = 30 median class

15 21 14.5 21.5 18 14 43 31 0.23


22 28 21.5 28.5 25 7 50 17 0.12
29 35 28.5 35.5 32 3 53 10 0.05
36 42 35.5 42.5 39 4 57 7 0.07
43 49 42.5 49.5 46 3 60 3 0.05
n = 60

𝑛
(2 − 𝐶𝐹< )
𝑥̃ = 𝐿𝑜 + 𝑥 𝐶𝑊
𝑓

60
( 2 − 29)
𝑥̃ = 14.5 + 𝑥 7 = 15
14

FINDING THE MODE OF GROUPED DATA

Again, looking at our data:

Class
Midpoint freq
Class Limits Boundaries f< f> RF
(x) (f)
Lower Upper Lower Upper
1 7 0.5 7.5 4 11 11 60 0.18
8 14 7.5 14.5 11 18 29 49 0.30
15 21 14.5 21.5 18 14 43 31 0.23
22 28 21.5 28.5 25 7 50 17 0.12
29 35 28.5 35.5 32 3 53 10 0.05
36 42 35.5 42.5 39 4 57 7 0.07
43 49 42.5 49.5 46 3 60 3 0.05
n = 60

71
We can estimate the Mode using the following formula:

𝑓𝑚𝑜 − 𝑓1
𝑥̂ = 𝐿𝑚𝑜 + ( ) 𝑥 𝐶𝑊
2𝑓𝑚𝑜 − 𝑓1 − 𝑓2

where:

𝐿𝑚𝑜 is the lower class boundary of the modal group


f1 is the frequency of the group before the modal group
fmo is the frequency of the modal group
f2 is the frequency of the group after the modal group
CW is the group width

18 − 11
𝑥̂ = 7.5 + ( ) 𝑥 7 = 11.95
2(18) − 11 − 14

DISPERSION AND MEASURES OF DISPERSION

Dispersion is the state of getting dispersed or spread. Statistical dispersion means the
extent to which a numerical data is likely to vary about an average value. In other words,
dispersion helps to understand the distribution of the data.

MEASURES OF DISPERSION

In statistics, the measures of dispersion help to interpret the variability of data i.e. to know
how much homogeneous or heterogeneous the data is. In simple terms, it shows how
squeezed or scattered the variable is.

72
TYPES OF MEASURES OF DISPERSION

There are two main types of dispersion methods in statistics which are:

 Absolute Measure of Dispersion


 Relative Measure of Dispersion

ABSOLUTE MEASURE OF DISPERSION

An absolute measure of dispersion contains the same unit as the original data set.
Absolute dispersion method expresses the variations in terms of the average of deviations
of observations like standard or means deviations. It includes range, standard deviation,
quartile deviation, etc.

The types of absolute measures of dispersion are:

1. Range: It is simply the difference between the maximum value and the minimum
value given in a data set.

For ungrouped data:


Range = Highest Value –Lowest Value

For grouped data:


Range = Highest Class Boundary –Lowest Class Boundary

2. Variance: Deduct the mean from each data in the set then squaring each of them
and adding each square and finally dividing them by the total no of values in the
data set is the variance.

For ungrouped data:


Population Variance
2
∑(𝑥 − 𝜇)2
𝜎 =
𝑁
Sample Variance:
∑(𝑥 − 𝑥̅ )2
𝑠2 =
𝑛−1
For grouped data:
Population Variance

∑(𝑥 − 𝜇)2 𝑓
𝜎2 =
𝑁

73
Sample Variance:

2
∑(𝑥 − 𝑥̅ )2 𝑓
𝑠 =
𝑛−1

3. Standard Deviation: The square root of the variance is known as the standard
deviation.

For Ungrouped Data:


Population Standard Deviation
∑(𝑥 − 𝜇)2
𝜎=√
𝑁

Sample Standard Deviation:


∑(𝑥 − 𝑥̅ )2
𝑠=√
𝑛−1
For Grouped Data:
Population Standard Deviation:

∑(𝑥 − 𝜇)2 𝑓
𝜎=√
𝑁
Sample Standard Deviation:
∑(𝑥 − 𝑥̅ )2 𝑓
𝑠=√
𝑛−1

4. Quartiles and Quartile Deviation: The quartiles are values that divide a list of
numbers into quarters. The quartile deviation is half of the distance between the
third and the first quartile.
For Ungrouped Data:
First Quartile (Q1)

𝑛
𝑄1 =
4

74
Third Quartile (Q3)

3𝑛
𝑄3 =
4
Quartile Deviation (QD):
𝑄3 − 𝑄1
𝑄𝐷 =
2
For Grouped Data:
First Quartile (Q1)

𝑛
(4 − 𝐶𝐹< )
𝑥̃ = 𝐿𝑜 + 𝑥 𝐶𝑊
𝑓
Third Quartile (Q3)

3𝑛
( 4 − 𝐶𝐹< )
𝑥̃ = 𝐿𝑜 + 𝑥 𝐶𝑊
𝑓

Quartile Deviation (QD):


𝑄3 − 𝑄1
𝑄𝐷 =
2

5. Mean Deviation: The mean deviation (also called the mean absolute deviation) is
the mean of the absolute deviations of a set of data about the data's mean. For a
sample size N, the mean deviation is defined by

For Ungrouped Data:

∑|𝑥 − 𝑥̅ |
𝑀𝐷 =
𝑛

For Grouped Data:

∑|𝑥 − 𝑥̅ |𝑓
𝑀𝐷 =
𝑛

RELATIVE MEASURE OF DISPERSION:

The relative measures of depression are used to compare the distribution of two or more
data sets. This measure compares values without units. Common relative dispersion
methods include:

75
1. Coefficient of Range
2. Coefficient of Variation
3. Coefficient of Standard Deviation
4. Coefficient of Quartile Deviation
5. Coefficient of Mean Deviation

COEFFICIENT OF DISPERSION

The coefficients of dispersion are calculated along with the measure of dispersion when
two series are compared which differ widely in their averages. The dispersion coefficient
is also used when two series with different measurement unit are compared. It is denoted
as C.D.

The common coefficients of dispersion are:

C.D. In Terms of Coefficient of dispersion


Range C.D. = (Xmax – Xmin) ⁄ (Xmax + Xmin)
Quartile Deviation C.D. = (Q3 – Q1) ⁄ (Q3 + Q1)
Standard Deviation (𝑠) C.D. = s ⁄ 𝑥̅
Mean Deviation C.D. = MD/𝑥̅

EXAMPLE:

For ungrouped data:

3 17
10 18
12 18
12 18

Range: Highest Value – Lowest Value = 18 - 3 = 15

Sample Variance:
x 𝑥̅ (𝑥 − 𝑥̅ )2
3 13.5 110.25
10 13.5 12.25
12 13.5 2.25
12 13.5 2.25
17 13.5 12.25
18 13.5 20.25

76
18 13.5 20.25
18 13.5 20.25
∑(𝑥 − 𝑥̅ )2 200

∑(𝑥 − 𝑥̅ )2 200
𝑠2 = = = 28.57
𝑛−1 7

Sample Standard Deviation:


∑(𝑥 − 𝑥̅ )2
𝑠=√
𝑛−1

𝑠 = √28.57 = 5.345
Quartile Deviation:
18 − 2
𝑄𝐷 = =8
2

Mean Deviation:
x 𝑥̅ |𝑥 − 𝑥̅ |
3 13.5 10.5
10 13.5 3.5
12 13.5 1.5
12 13.5 1.5
17 13.5 3.5
18 13.5 4.5
18 13.5 4.5
18 13.5 4.5
∑|𝑥 − 𝑥̅ | 34

∑|𝑥 − 𝑥̅ |
𝑀𝐷 =
𝑛

34
𝑀𝐷 = = 4.25
8
COEFFICIENT OF DISPERSION:

Coefficient of Range =

77
𝑥𝑚𝑎𝑥 − 𝑥𝑚𝑖𝑛 18 − 3
= = 0.714
𝑥𝑚𝑎𝑥 + 𝑥𝑚𝑖𝑛 18 + 3

Coefficient of Quartile Deviation =


𝑄3 − 𝑄1 18 − 2
= = 0.8
𝑄3 + 𝑄1 18 + 2

Coefficient of Standard Deviation =


𝑠 5.345
= = 0.3959
𝑥̅ 13.5

Coefficient of Mean Deviation =


𝑀𝐷 4.25
= = 0.3148
𝑥̅ 13.5
For Grouped Data:

Class
Midpoint freq
Class Limits Boundaries f< f> RF
(x) (f)
Lower Upper Lower Upper
1 7 0.5 7.5 4 11 11 60 0.18
8 14 7.5 14.5 11 18 29 49 0.30
15 21 14.5 21.5 18 14 43 31 0.23
22 28 21.5 28.5 25 7 50 17 0.12
29 35 28.5 35.5 32 3 53 10 0.05
36 42 35.5 42.5 39 4 57 7 0.07
43 49 42.5 49.5 46 3 60 3 0.05
n = 60

𝑥̅ = 17.65
Range = Highest boundary – Lowest boundary = 49.5 - 0.5 = 49

Sample Variance:
∑(𝑥 − 𝑥̅ )2 𝑓
𝑠2 =
𝑛−1
8077.65
𝑠2 = = 136.9093
59

Standard Deviation:
∑(𝑥 − 𝑥̅ )2 𝑓
𝑠=√
𝑛−1

78
8077.65
𝑠=√ = 11.7
59

First Quartile (Q1)


𝑛
(4 − 𝐶𝐹< )
𝑥̃ = 𝐿𝑜 + 𝑥 𝐶𝑊
𝑓

60
( 4 − 11)
𝑥̃ = 7.5 + 𝑥 7 = 9.06
18

Third Quartile (Q3)

3𝑛
( 4 − 𝐶𝐹< )
𝑥̃ = 𝐿𝑜 + 𝑥 𝐶𝑊
𝑓

3(60)
( 4 − 43)
𝑥̃ = 21.5 + 𝑥 7 = 23.5
7

Quartile Deviation (QD):


𝑄3 − 𝑄1
𝑄𝐷 =
2
23.5 − 9.06
𝑄𝐷 = = 7.22
2

Mean Deviation (MD):


Class
Class Limits Midpoint freq
Boundaries f< f> RF |𝑥 − 𝑥̅ |f
Lower Upper Lower Upper (x) (f)
1 7 0.5 7.5 4 11 11 60 0.18 150.2
8 14 7.5 14.5 11 18 29 49 0.3 119.7
15 21 14.5 21.5 18 14 43 31 0.23 4.9
22 28 21.5 28.5 25 7 50 17 0.12 51.45
29 35 28.5 35.5 32 3 53 10 0.05 43.05
36 42 35.5 42.5 39 4 57 7 0.07 85.4
43 49 42.5 49.5 46 3 60 3 0.05 85.05
n= 60 ∑|𝑥 − 𝑥̅|f 539.7

79
∑|𝑥 − 𝑥̅ |𝑓
𝑀𝐷 =
𝑛

539.7
𝑀𝐷 = = 8.995
60

COEFFICIENT OF DISPERSION:

Coefficient of Range:
𝑥𝑚𝑎𝑥 − 𝑥𝑚𝑖𝑛 49.5 − 0.5
= = 0.98
𝑥𝑚𝑎𝑥 + 𝑥𝑚𝑖𝑛 49.5 + 0.5

Coefficient of Quartile Deviation:


𝑄3 − 𝑄1 23.5 − 9.06
= = 0.44
𝑄3 + 𝑄1 23.5 + 9.06

Coefficient of Standard Deviation:


𝑠 11.7
= = 0.66
𝑥̅ 17.65

Coefficient of Mean Deviation:


𝑀𝐷 4.25
= = 0.3148
𝑥̅ 13.5

QUANTILES

Sorting the values we have is a common first step in data analysis. When we sort values,
normally in increasing order, we obtain an order statistics. Having an order statistics helps
us locate data in the given data set, compare it with others in terms of ranking or obtain
an idea about the shape of the data distribution since we can approximate where data
are concentrated and where they are sparse. A particular k th entry in an increasing-order-
𝑘
list is the kth order statistic, and approximately (𝑛) 100% of the observations in the list fall
at or below the kth entry.

Quantiles refer to the set of n-1 values of a variable that partition it into a number of n
equal proportions. If a set of data is divided into 2 equal proportions, then we have 2-1 =
1 value that partition the data set and this is the most common quantile, which is the
1
median. The median or the middle data has ( ) 100% = 50% of the observations falling
2

80
at or below it. There may be different ways on how we can partition our data set but the
most common are quartiles, deciles and percentiles.

The quartiles of a data set divide it to quarters or 4 equal parts which means every data
set has three quartiles which we name as Q 1, Q2, and Q3. Basically, the first quartile or
Q1 is the data that has 25% of the data set at or below it and 75% of the data set at or
above it. The second quartile, Q2, is the data that has 50% of the data set at or below and
above it which shows Q2 is equal to the median. And, Q3, the third quartile, is the data
that separates the 75% of the data set at or below it from the 25% of the data set at or
above it.

On the other hand, deciles, divide the data set into 10 equal parts using 9 partition data
or 9 deciles which are named as D1, D2, D3, ... , and D9. The first decile, D1, is the data
that has 10% of the data set at or below it and 90% of the data set at or above it while,
the second decile, D2, is the data that has 20% of the data set at or below it and the
remaining 80% of the data set at or above it. Note that D5 is the same as Q2 and the
median.

Lastly, percentiles, divide the data set into 100 equal parts using 99 partition data or 99
percentiles which are named as P1, P2, P3, ... , and P99. The first percentile, P1, is the data
that has 1% of the data set at or below it and 99% of the data set at or above it while, the
second percentile, P2, is the data that has 2% of the data set at or below it and 98% of
the data at or above it. It is worth noting that P 50 = D5 = Q2 = 𝑥̃ and that other quantiles
may also be equal to the other like P10 = D1 and etc.

In pratical cases, percentiles are also referred to as "quantiles" such that P 1 is referred to
as the 0.01 quantile, P25 as the 0.25 quantile, and etc.

QUARTILE, DECILE, PERCENTILE OF UNGROUPED DATA

To find a particular quantile in a set of ungrouped data the following must be followed:

1. Rearrange a given data set based on order of magnitude (smallest first).


2. Compute the rank or location of the quantile using the following formulas:

For an nth Quartile:


𝑛𝑁
𝑄𝑛 = ( ) th observation
4
For an nth Decile:

81
𝑛𝑁
𝐷𝑛 = ( 10 ) th observation
For an nth Percentile:
𝑛𝑁
𝑃𝑛 = (100) th observation
Note : N represents the total number of observations.
3. If the formula gives an integer as an answer, simply locate the data in the order
statistics. But if the formula gives a non-integer answer, round up to the next whole
number and then locate the data in the order statistics.

Example:

The following observations represent the oxidation-induction time (min) for various
commercial oils:

87 103 130 160 180 195 132 145 211 105


145 153 152 138 87 99 93 119 129

First, rearrange the data set based on order of magnitude (smallest first):

87 87 93 99 103 105 119 129 130 132


138 145 145 152 153 160 180 195 211

Where N = 19

To solve for Q1:

𝑛𝑁 19
𝑄1 = ( ) = ( 4 )th observation = 4.75th  round up to 5th observation which is 103
4

To solve for Q2:

𝑛𝑁 2∗19 th
𝑄2 = ( )= ( ) observation = 9.5th  round up to 10th observation which is 132
4 4
which also represents the median

To solve for Q3:

𝑛𝑁 3∗19 th
𝑄3 = ( )= ( ) observation = 14.25th  round up to 15th observation which is 153
4 4

To solve for D2:

𝑛𝑁 2∗19 th
𝐷2 = ( )= ( ) observation = 3.8th  round up to 4th observation which is 99
10 10

82
To solve for D5:

𝑛𝑁 5∗19 th
𝐷5 = ( 10 ) = ( ) observation = 9.5th  round up to 10th observation which is 132 and
10
is equal to Q2 and the median

To solve for P20:

𝑛𝑁 20∗19 th
𝑃20 = (100) = ( ) observation = 3.8th  round up to 4th observation which is 99 and
100
is equal to D2

To solve for P50:

𝑛𝑁 50∗19 th
𝑃50 = (100) = ( ) observation = 9.5th  round up to 10th observation which is 132
100
and is equal to Q2, D5, and the median

To solve for P75:

𝑛𝑁 75∗19 th
𝑃75 = (100) = ( ) observation = 14.25th  round up to 15th observation which is 153
100
and is equal to Q3.

QUARTILE, DECILE, PERCENTILE OF GROUPED DATA

Since the actual values of observations may not be obtained if a grouped data is given,
the quartiles, deciles and/or percentiles may be approximated using the following
formulas:

For an nth Quartile:


𝑛𝑁
− 𝐶𝐹<
𝑄𝑛 = 𝐿𝑜 + [ 4 ] × 𝐶𝑊
𝑓

where:
 𝐿𝑜 is the lower boundary of the class containing the nth quartile
 𝑁 is the sum of frequencies or total observations
 𝐶𝐹< is the cumulative frequencies before the class containing the nth
quartile
 𝑓 is the frequency of the class containing the nth quartile
 𝐶𝑊 is the class width

83
For an nth Decile:
𝑛𝑁
− 𝐶𝐹<
𝐷𝑛 = 𝐿𝑜 + [ 10 ] × 𝐶𝑊
𝑓

where:
 𝐿𝑜 is the lower boundary of the class containing the nth decile
 𝑁 is the sum of frequencies or total observations
 𝐶𝐹< is the cumulative frequencies before the class containing the nth decile
 𝑓 is the frequency of the class containing the nth decile
 𝐶𝑊 is the class width

For an nth Percentile:


𝑛𝑁
− 𝐶𝐹<
𝑃𝑛 = 𝐿𝑜 + [100 ] × 𝐶𝑊
𝑓

where:
 𝐿𝑜 is the lower boundary of the class containing the nth percentile
 𝑁 is the sum of frequencies or total observations
 𝐶𝐹< is the cumulative frequencies before the class containing the nth
percentile
 𝑓 is the frequency of the class containing the nth percentile
 𝐶𝑊 is the class width

Example:

Class
Class Limits Midpoint freq
Boundaries f< f> RF
(x) (f)
Lower Upper Lower Upper
1 7 0.5 7.5 4 11 11 60 0.18
8 14 7.5 14.5 11 18 29 49 0.30
15 21 14.5 21.5 18 14 43 31 0.23
22 28 21.5 28.5 25 7 50 17 0.12
29 35 28.5 35.5 32 3 53 10 0.05
36 42 35.5 42.5 39 4 57 7 0.07
43 49 42.5 49.5 46 3 60 3 0.05
n = 60

Using the same grouped data, solve for the Q1, D7 and P85.

84
To solve for Q1:

𝒏𝑵
− 𝐶𝐹<
𝑄1 = 𝐿𝑜 + [ 𝟒 ] × 𝐶𝑊
𝑓

𝑛𝑁
Take the value of and locate which class the observation falls. An quantile is
4
𝑛𝑁 𝑛𝑁
considered within a class whose CF < is greater than and is closest to .
4 4

𝑛𝑁 60
For Q1, = = 15 and the closest greater CF< than 15 is 29 which is in the second
4 4
class with limits 8 - 14 and will be considered the class containing the first quartile.

Class
Class Limits Midpoint freq
Boundaries f< f> RF
(x) (f)
Lower Upper Lower Upper
1 7 0.5 7.5 4 11 11 60 0.18
8 14 7.5 14.5 11 18 29 49 0.30
15 21 14.5 21.5 18 14 43 31 0.23
22 28 21.5 28.5 25 7 50 17 0.12
29 35 28.5 35.5 32 3 53 10 0.05
36 42 35.5 42.5 39 4 57 7 0.07
43 49 42.5 49.5 46 3 60 3 0.05
n = 60

Therefore:

𝑛𝑁
− 𝐶𝐹<
𝑄1 = 𝐿𝑜 + [ 4 ] × 𝐶𝑊
𝑓

15 − 11
𝑄1 = 7.5 + [ ] × 7 = 9.06
18

To solve for D4:

𝒏𝑵
− 𝐶𝐹<
𝐷4 = 𝐿𝑜 + [ 𝟏𝟎 ] × 𝐶𝑊
𝑓

𝑛𝑁
Take the value of and locate which class the observation falls.
10

85
𝑛𝑁 (7)60
For D7, = = 42 and the closest greater CF< than 42 is 43 which is in the third class
10 10
with limits 15 - 21 and will be considered the class containing the seventh decile.

Class
Class Limits Midpoint freq
Boundaries f< f> RF
(x) (f)
Lower Upper Lower Upper
1 7 0.5 7.5 4 11 11 60 0.18
8 14 7.5 14.5 11 18 29 49 0.30
15 21 14.5 21.5 18 14 43 31 0.23
22 28 21.5 28.5 25 7 50 17 0.12
29 35 28.5 35.5 32 3 53 10 0.05
36 42 35.5 42.5 39 4 57 7 0.07
43 49 42.5 49.5 46 3 60 3 0.05
n = 60

Therefore:

𝑛𝑁
− 𝐶𝐹<
𝐷7 = 𝐿𝑜 + [ 10 ] × 𝐶𝑊
𝑓

42 − 29
𝐷7 = 14.5 + [ ] × 7 = 21
14

To solve for P85:

𝒏𝑵
− 𝐶𝐹<
𝑃85 = 𝐿𝑜 + [𝟏𝟎𝟎 ] × 𝐶𝑊
𝑓

𝑛𝑁
Take the value of 100 and locate which class the observation falls.

𝑛𝑁 (85)(60)
For P85, = = 51 and the closest greater CF< than 51 is 53 which is in the fifth
100 100
class with limits 29 - 35 and will be considered the class containing the 85 th percentile.

86
Class
Class Limits Midpoint freq
Boundaries f< f> RF
(x) (f)
Lower Upper Lower Upper
1 7 0.5 7.5 4 11 11 60 0.18
8 14 7.5 14.5 11 18 29 49 0.30
15 21 14.5 21.5 18 14 43 31 0.23
22 28 21.5 28.5 25 7 50 17 0.12
29 35 28.5 35.5 32 3 53 10 0.05
36 42 35.5 42.5 39 4 57 7 0.07
43 49 42.5 49.5 46 3 60 3 0.05
n = 60

Therefore:

𝑛𝑁
− 𝐶𝐹<
𝑃85 = 𝐿𝑜 + [100 ] × 𝐶𝑊
𝑓

51 − 50
𝑃85 = 28.5 + [ ] × 7 = 30.83
3

BOX-AND-WHISKER PLOTS

The box plot is a graphical display that has been used successfully to describe several of
a data set's most prominent features. These features include (1) center, (2) spread, (3)
the extent and nature of any departure from symmetry, and (4) identification of “outliers,”
observations that lie unusually far from the main body of the data. Because even a single
outlier can drastically affect the values of the mean and standard deviation, a box plot is
based on measures that are “resistant” to the presence of a few outliers—the median and
a measure of variability called the interquartile range.

The simplest box plot is based on the following Five-number summary:

 Lowest value
 Lower quartile (Q1)
 Median
 Upper quartile (Q3)
 Highest Value

87
To make a box-and-whiskers plot:

 Draw a vertical scale to include the highest and lowest values in the data set
 Draw a box from Q1 to Q3. The horizontal width of the box is irrelevant while the
height of the box is exactly the IQR.
 Up from the Q3, measure out a distance of 1.5 times the IQR and draw a so-called
whisker up to the highest value that lies within this distance, and put a horizontal
line.
 Down from the Q1, measure out a distance of 1.5 times the IQR and draw a so-
called whisker to the lowest value that lies within this distance, and put a horizontal
line.
 All observations beyond the whiskers may be marked with " ⃘" or an asterisk (*).
A point less than 3 IQR from the box edge is considered an outlier while points
more than 3IQR from the box edge are considered extreme outliers.

SKEWNESS AND KURTOSIS

Skewness and kurtosis are two different measures that characterizes the shape or
distribution of a data set. Skewness refers to the degree of symmetry or departures from
symmetry of a set of data while kurtosis is a measure of the degree to which a unimodal
distribution is peaked.

88
SKEWNESS

A distribution is said to be right-skewed (or positively skewed) if the right tail seems to be
stretched from the center. A left-skewed (or negatively skewed) distribution is stretched
to the left side. A symmetric distribution has a graph that is balanced about its center, in
the sense that half of the graph may be reflected about a central line of symmetry to match
the other half.
The degree of skewness is measured by the coefficient of skewness, sk, where

3 ( 𝑀𝑒𝑎𝑛 − 𝑀𝑒𝑑𝑖𝑎𝑛)
𝑠𝑘 =
𝑆𝑡𝑎𝑛𝑑𝑎𝑟𝑑 𝑑𝑒𝑣𝑖𝑎𝑡𝑖𝑜𝑛

and ranges from -3 to 3.

A distribution where the 𝑥̅ = 𝑥̃ = 𝑥̂ is called the normal or symmetrical distribution and is


usually called the normal curve or bell-shaped curve. A normally distributed data will have
a degree of skewness equal to zero.

Normal or Symmetrical Distribution

A distribution where the 𝑥̅ < 𝑥̃ is negatively skewed or skewed to the left. Negatively
skewed distributions have negative sk down to -3 and contains extremely low scores with
low frequency, but does not contain extreme high scores with corresponding low
frequency.

Longer left tail, extreme


low scores with low
frequency
median
low high

Negatively Skewed Distribution (heavy left tail)

A distribution where the 𝑥̃ < 𝑥̅ is positively skewed or skewed to the right. Positively
skewed distributions have positive sk up to 3 and contains extremely high scores that
have low frequency, but does not contain extreme low scores with corresponding low
frequency.

89
Longer right tail,
extreme high scores but
with low frequency
low median
high

Positively Skewed Distribution (heavy right tail)

KURTOSIS

The term kurtosis was introduced by Karl Pearson. This word literally means "the amount
of hump", and is used to represent the degree of peakedness or flatness of a unimodal
frequency curve. One measure of kurtosis is the percentile coefficient of kurtosis (ku)
which is defined by the formula
Q 3 − Q1
ku = 2
P90 − P10
and whose value ranges from 0 to 0.5.

There are three categories of kurtosis that can be displayed by a set of data. All
measures of kurtosis are compared against a standard normal distribution, or bell curve.

The first category of kurtosis is a mesokurtic distribution. This type of kurtosis is the most
similar to a standard normal distribution in that it also resembles a bell curve. However, a
graph that is mesokurtic has fatter tails than a standard normal distribution and has a
slightly lower peak. This type of kurtosis is considered normally distributed but is not a
standard normal distribution and for normality of the data must have ku = 0.263

The second category is a leptokurtic distribution. Any distribution that is leptokurtic


displays greater kurtosis than a mesokurtic distribution. Characteristics of this type of
distribution is one with extremely thick tails and a very thin and tall peak. The prefix of
“lepto-” means “skinny,” making the shape of a leptokurtic distribution easier to remember.
Leptokurtic distributions have ku values less than 0.263. T-distributions are leptokurtic.

The final type of distribution is a platykurtic distribution. These type of distributions have
slender tails and a peak that’s smaller than a mesokurtic distribution. The prefix of “platy-
” means “broad,” and it is meant to describe a short and broad-looking peak. With
platykurtic distributions, the ku values come out to be greater than 0.263. Uniform
distributions are platykurtic.

90
Mesokurtic
Leptokurtic
Platykurtic

CHEBYSHEV’S THEOREM

Chebyshev's theorem is a theorem that allows us to approximately know how much


percentage of a data set lies within a certain number of standard deviations of the mean
of the data set.

The mathematical equation to compute Chebyshev's theorem is shown below.

Chebyshev's theorem states for any k > 1, at least 1-1/k2 of the data lies within k standard
deviations of the mean.

As stated, the value of k must be greater than 1.

Using this formula and plugging in the value 2, we get a resultant value of 1-1/22, which
is equal to 75%. This means that at least 75% of the data for a set of numbers lies within
2 standard deviations of the mean. The number could be greater. It could be all, 100%,
but it's guaranteed to be at least 75%. And this is what Chebyshev's theorem computes.

If we plug in 3 for k, then the resultant value is 88.89%. This means that at least 88.89%
of a data set lies within 3 standard deviations of the mean.

If we plug in 4 for k, then the resultant value is 93.75%. This means that at least 93.75%
of a data set lies within 3 standard deviations of the mean.

Chebyshev's theorem is a great tool to find out how approximately how much percentage
of a population lies within a certain amount of standard deviations above or below a mean.
It tells us at least how much percentage of the data set must fall within that number of
standard deviations.

To use this calculator, a user simply enters in a k value. This k value represents the
number of standard deviations from the mean.

91
The resultant value calculated will represent the minimum percentage of the data set that
falls within k standard deviations of the mean.

Chebyshev's theorem is a great statistical measure because it can be used in electronics


math for various statistical purposes.

In summary, for any numerical data set,

 at least ¾ of the data lie within two standard deviations of the mean, that is, in the
interval with endpoints 𝑥̅ ± 2𝑠 for samples and with endpoints μ±2σ for populations;
 at least 8/9 of the data lie within three standard deviations of the mean, that is, in
the interval with endpoints 𝑥̅ ± 3𝑠 for samples and with endpoints μ±3σ for
populations;
 at least 1−1/k2 of the data lie within k standard deviations of the mean, that is, in
the interval with endpoints 𝑥̅ ± 𝑘𝑠 for samples and with endpoints μ±kσ for
populations, where k is any positive whole number that is greater than 1.

It is important to pay careful attention to the words “at least” at the beginning of each of
the three parts of Chebyshev’s Theorem. The theorem gives the minimum proportion of
the data which must lie within a given number of standard deviations of the mean; the
true proportions found within the indicated regions could be greater than what the theorem
guarantees.

92
EXAMPLE

The East Coast Independent News periodically runs ads in its own classified section
offering a month’s free subscription to those who respond. In this way, management can
get sense about the number of subscribers who read the classified section each day.
Over a period of 2 years, careful records have been kept. The mean number of responses
per ad is 525 with standard deviation of 30.

Determine the interval about the mean in which at least 88.9% of the data fall.

At least 88.9% of the data fall in the interval from µ - 3σ to µ + 3σ

µ - 3σ to µ + 3σ

525 – 3(30) to 525 + 3(30)

435 to 615

Activity:
1) The following scores represent the final examination grade for an elementary statistics
course:
23 60 79 32 57 74 52 70 82 36
80 77 81 95 41 65 92 85 55 76
52 10 64 75 78 25 80 98 81 67
41 71 83 54 54 72 88 62 74 43
60 78 89 76 76 48 84 90 15 79
34 67 17 82 82 74 63 80 85 61

 For the ungrouped data, find the sample mean and median. Determine the range,
variation, standard deviation, quartile deviation, mean deviation and the dispersion
coefficients.
 Set up a relative frequency distribution. For the ungrouped data, find the sample
mean and median. Determine the range, variation, standard deviation, quartile
deviation and mean deviation and the dispersion coefficients. for the grouped data.

2) The following data represent the length of life in years, measured to the nearest tenth,
of 30 similar fuel pumps:

93
2.0 3.0 0.3 3.3 1.3 0.4
0.2 6.0 5.5 6.5 0.2 2.3
1.5 4.0 5.9 1.8 4.7 0.7
4.5 0.3 1.5 0.5 2.5 5.0
1.0 6.0 5.6 6.0 1.2 0.2

 For the ungrouped data, find the sample mean and median. Determine the range,
variation, standard deviation, quartile deviation, mean deviation and the dispersion
coefficients.
 Set up a relative frequency distribution. For the ungrouped data, find the sample
mean and median. Determine the range, variation, standard deviation, quartile
deviation and mean deviation and the dispersion coefficients. for the grouped data.

3) The following data represent the length of life in seconds of 50 fruit flies to a new spray
in a controlled laboratory experiment.

17 20 10 9 23 13 12 19 18 24
12 14 6 9 13 6 7 10 13 7
16 18 8 13 3 32 9 7 10 11
13 7 18 7 10 4 27 19 16 8
7 10 5 14 15 10 9 6 7 15

 For the ungrouped data, find the sample mean and median. Determine the range,
variation, standard deviation, quartile deviation, mean deviation and the dispersion
coefficients.
 Set up a relative frequency distribution. For the ungrouped data, find the sample
mean and median. Determine the range, variation, standard deviation, quartile
deviation and mean deviation and the dispersion coefficients. For the grouped
data.

94
Probability
________________________________________________
MODULE 2

95
Session 1
Elementary Probability Theory

By the end of this session, you should be able to:

1. Understand the concept of the naive definition of probability.


2. Apply the different properties of probability when solving problems.

Lecture:

The mathematical theory of probability gives us the basic tools for constructing and
analyzing mathematical models for random phenomena. In studying a random
phenomena, we are dealing with an experiment of which the outcome is not predictable
in advance.

In science and engineering, random phenomena describe a wide variety of situations that
can be grouped into two broad classes: physical or natural phenomena involving
uncertainties and the other exhibiting variability. Since uncertainty and variability are
present in our modeling of all real phenomena, it is only natural to see that probabilistic
modeling and analysis occupy a central place in the study of wide variety of topics in
science and engineering.
PROBABILITY

Probability is used to quantify the likelihood, or chance, that an outcome of a random


experiment will occur. The probability of an outcome can be interpreted as our subjective
probability, or degree of belief, or intuition, that the outcome will occur. Another
interpretation of probability is based on the conceptual model of repeated replications of
the random experiment. If a die is tossed in the air, then it is certain that the die will come
down, but it is not certain that, say, a 6 will appear. However, suppose we repeat this
experiment of tossing a die; let s be the number of successes, i.e. the number of times a
6 appears, and let n be the number of tosses. Then it has been empirically observed that
𝑠
the ratio 𝑓 = 𝑛 , called relative frequency, becomes stable in the long run, i.e. approaches
a limit. This stability is the basis of probability. When the model of equally likely outcomes
is assumed, the probabilities are chosen to be equal.
Historically, probability theory began with the study of games of chance, such as roulette
and cards. The probability p of an event A was defined as follows: if A can occur in s ways
out of a total of n equally likely ways, then
𝑠 𝑛𝑢𝑚𝑏𝑒𝑟 𝑜𝑓 𝑜𝑐𝑐𝑢𝑟𝑒𝑛𝑐𝑒 𝑜𝑓 𝐴
𝑝 = 𝑃 (𝐴 ) = =
𝑛 𝑡𝑜𝑡𝑎𝑙 𝑛𝑢𝑚𝑏𝑒𝑟 𝑜𝑓 𝑝𝑜𝑠𝑠𝑖𝑏𝑙𝑒 𝑜𝑢𝑡𝑐𝑜𝑚𝑒𝑠 𝑖𝑛 𝑡ℎ𝑒 𝑒𝑥𝑝𝑒𝑟𝑖𝑚𝑒𝑛𝑡

96
Example 1:
What is the probability of drawing one card from a standard deck of 52 cards and having
it be a king?
When you select a card from a deck, there are 52 possible outcomes, 4 of which are
4 1
favorable as king. Thus, the probability of drawing a king is 52 = 13.

Example 2:
Human eye color is controlled by a single pair of genes, one of which comes from the
mother and one of which comes from the father, called a genotype. Brown eye color, B,
is dominant over blue eye color, l. Therefore, in the genotype Bl, which consists of one
brown gene B and one blue gene l, the brown gene dominates. A person with a genotype
Bl will have brown eyes.
If both parents have genotype Bl, what is the probability that their child will have blue
eyes?

Considering every possible eye-color genotype for the child, the table shows the possible
outcomes:

F Since all four possible outcomes are equally likely to occur and that
B l
M blue eyes can only occur on the l l genotype, then there is only one
B BB Bl favourable outcome to blue eyes. Thus, the probability that the
1
child has blue eyes is 4.
l lB ll

Example 3:

Find the probability of getting a non-defective bulb from 200 bulbs already examined,
where 12 bulbs are found to be defective.
(200 − 12)𝑛𝑜𝑛 − 𝑑𝑒𝑓𝑒𝑐𝑡𝑖𝑣𝑒 𝑏𝑢𝑙𝑏𝑠 47
𝑃(𝑑𝑒𝑓𝑒𝑐𝑡𝑖𝑣𝑒 𝑏𝑢𝑙𝑏) = =
200 𝑏𝑢𝑙𝑏𝑠 50

PROPERTIES OF PROBABILITY

1. The probability of a sample space S is 1.


𝑷(𝑺) = 𝟏
2. The probability of a null set is 0.
𝑷(𝝓) = 𝟎
3. The probability of an event E always lie in the range between zero to 1.
𝟎 < 𝑃(𝐸) < 1

97
4. The sum of the probabilities of all events (or final outcomes) for an experiment,
denoted by ∑ 𝑃(𝐸𝑖 ) , is always 1.
∑ 𝑷(𝑬𝒊 ) = 𝑷(𝑬𝟏 ) + 𝑷(𝑬𝟐 ) + 𝑷(𝑬𝟑 ) + ⋯ + 𝑷(𝑬𝒏 ) = 𝟏

To illustrate the last property, consider the experiment of tossing a coin twice. Then

S = { HH, TH, HT, TT }. Define the following events, E1 , E 2 , E 3 as follows:

E1 = the event of getting zero head, then E1 = { TT } and P( E1 ) = ¼.

E 2 = the event of getting 1 head, then E 2 = { HT, TH} and P( E 2 ) = 2/4 or 1/2.

E 3 = the event of getting 2 heads, then E 3 = { HH } and P( E 3 ) = 1/4.

Thus, to verify the fourth property of probability, we have:

 P(E i ) = P( E1 ) + P( E 2 ) + P( E 3 ) = 1/4 + 1/2 + ¼ = 1

Another illustration of the fourth property of probability is given as follows:

Example 1:

Five of the 100 randomly delivered plastic connector of assorted colors are found to be
green. What is the probability of getting the color green in the next delivery of connectors?

Let E denote the event of having the color green in the next delivery of connectors. Using
the relative frequency concept of probability with n = 100 and f = 5, then we obtain:

f 5
P(E) = P(Next toy is green) =   0.05
N 100

Toys f Relative Frequency


Other color 95 95/100 = 0.95
Green 5 5/100 = 0.05
n = 100 Sum = 1.00

Distribution for the Sample of Toys

98
Example 2:

All 275 employees of a factory were asked if they are smokers or non-smokers and
whether or not they are college graduates. Based on this information, the two-way
classification table below was prepared. If one employee is selected at random from this
factory, find the probability that this employee is a(a) college graduate; (b) non-smoker.

College Graduate Not a College Graduate Total


Smoker 15 50 65
Nonsmoker 90 120 210
Total 105 170 275

a. Let E1 be the event that an employee is a college graduate. Then, since there are
105 graduates out of 275 employees, then

105
P( E1 ) =  0.38
275

b. Let E 2 be the event that an employee is a nonsmoker. Since there are 210
nonsmoker out of 275 employees, then

210
P( E 2 ) =  0.76
275

Activity

1. A mutual fund company offers its customers a variety of funds: a money-market


fund, three different bond funds (short, intermediate, and long-term), two stock
funds (moderate and high-risk), and a balanced fund. Among customers who own
shares in just one fund, the percentages of customers in the different funds are as
follows:

Money-market 20%
Short bond 15% Intermediate bond 10% Long bond 5%
Moderate-risk stock 25% High-risk stock 18%
Balanced 7%

A customer who owns shares in just one fund is randomly selected.

99
a. What is the probability that the selected individual owns shares in the balanced
fund?
b. What is the probability that the individual owns shares in a bond fund?
c. What is the probability that the selected individual does not own shares in a
stock fund?

2. A hat contains 40 marbles, 16 of which are red and 24 are green. If one marble is
randomly selected out of this hat, what is the probability that this marble is a) red?
b) green?

3. Out of 30 families living in an apartment complex in a compound, 6 paid no income


tax last year. What is the probability that a randomly selected family from these
30 families paid income tax last year?

4. A multiple-choice question in a test contains five answers. If Kate chooses one


answer based on “pure guess,” what is the probability that her answer is a) correct?
b) wrong?

5. The following table below gives a two-way classification of all the scientists and
engineers. If one person is selected at random from these scientists and
engineers, find the probability that this person is a) an engineer? b) a male?

Male Female
Scientist 58 79
Engineer 64 37

100
Session 2
Sample Spaces and Relationship Among Events

By the end of this session, you should be able to:

1. Understand and describe sample spaces and events for random experiments
with graphs, tables, lists, or tree diagrams
2. Identify the relationships existing between events and differentiate them

Lecture:

RANDOM EXPERIMENTS

There are two types of experiments: deterministic and random. A deterministic


experiment is one whose outcome may be predicted with certainty beforehand, such as
combining Hydrogen and Oxygen, or adding two numbers such as 2+3. A random
experiment is one whose outcome is determined by chance. We posit that the outcome
of a random experiment may not be predicted with certainty beforehand, even in principle.
It may also result in different outcomes, even though it is repeated in the same manner
every time. Examples of random experiments include tossing a coin, rolling a die, and
throwing a dart on a board, how many red lights you encounter on the drive home, or the
amount of current flowing through a thin wire where variations may play a role in obtaining
different measures, and etc.

SAMPLE SPACES

Sample spaces are simply sets whose elements describe all the possible outcomes of
the random experiment in which we are interested in. The sample space is denoted as S.

A sample space is often defined based on the objectives of the analysis. It is useful to
distinguish between two types of sample spaces: discrete and continuous. A sample
space is discrete if it consists of a finite or countable infinite set of outcomes. A sample
space is continuous if it contains an interval (either finite or infinite) of real numbers.

Example 1:

Consider an experiment in which you select a molded plastic part, such as a connector,
and measure its thickness. The possible values for thickness depend on the resolution of
the measuring instrument, and they also depend on upper and lower bounds for

101
thickness. However, it might be convenient to define the sample space as simply the
positive real line

𝑆 = 𝑅+ = {𝑥 |𝑥 > 0}

because a negative value for thickness cannot occur. This sample space is continuous.

If it is known that all connectors will be between 10 and 11millimeters thick, the continuous
sample space could be

𝑆 = {𝑥|10 < 𝑥 < 11}

If the objective of the analysis is to consider only whether a particular part is low, medium,
or high for thickness, the sample space might be taken to be the set of three outcomes:

𝑆 = {𝑙𝑜𝑤, 𝑚𝑒𝑑𝑖𝑢𝑚, ℎ𝑖𝑔ℎ}

which is an example of a discrete sample space.

If the objective of the analysis is to consider only whether or not a particular part conforms
to the manufacturing specifications, the discrete sample space might be simplified to the
set of two outcomes

𝑆 = {𝑦𝑒𝑠, 𝑛𝑜}

Example 2:

In random experiments in which items are selected from a batch, we will indicate whether
or not a selected item is replaced before the next one is selected.

If a batch consists of three items {a, b, c} and our experiment is to select two items without
replacement, the sample space can be represented as

𝑆𝑤𝑖𝑡ℎ𝑜𝑢𝑡 = {𝑎𝑏, 𝑎𝑐, 𝑏𝑎, 𝑏𝑐, 𝑐𝑎, 𝑐𝑏}

If the items are replaced before the next one is selected, the sampling is referred to as
with replacement. Then the possible ordered outcomes are

𝑆𝑤𝑖𝑡ℎ = {𝑎𝑎, 𝑎𝑏, 𝑎𝑐, 𝑏𝑎, 𝑏𝑏, 𝑏𝑐, 𝑐𝑎, 𝑐𝑏, 𝑐𝑐}

102
Example 3:

Sometimes it is not necessary to specify the exact item selected, but only a property of
the item.

Suppose that there are 5 defective parts and 95 good parts in a batch. To study the quality
of the batch, two are selected without replacement. Let 𝑔 denote a good part and 𝑑 denote
a defective part. It might be sufficient to describe the sample space (ordered) in terms of
quality of each part is selected as

𝑆 = {𝑔𝑔, 𝑔𝑑, 𝑑𝑔, 𝑑𝑑}

Also, if there were only one defective part in the batch, there would be fewer possible
outcomes

𝑆 = {𝑔𝑔, 𝑔𝑑, 𝑑𝑔}

Because 𝑑𝑑 would be iimpossible.

TREE DIAGRAMS

An effective way of graphically describing a sample space is through the use of tree
diagrams. When a sample space can be constructed in several steps or stages, we can
represent each of the 𝑛1 ways of completing the first step as a branch of a tree. Each of
the ways of completing the second step can be represented as 𝑛2 branches starting from
the ends of the original branches, and so forth.

Example:

An automobile manufacturer provides vehicles equipped with selected options. Each


vehicle is ordered

 With or without automatic transmission


 With or without air-conditioning
 With one of three choices of a stereo system
 With one of four exterior colors

If the sample space consists of the set of all possible vehicle types, what is the number
of outcomes in the sample space?

103
Per the tree diagram, the sample space contains 48 outcomes.

EVENTS

Often we are interested in a collection of related outcomes from a random experiment.


We call this collection of related outcomes as an event.

An event, 𝐸, is a subset of the sample space of a random experiment.

We can also be interested in describing new events from combinations of existing events.
Because events are subsets, we can use basic set operations such as unions,
intersections, and complements to form other events of interest. Some of the basic set
operations are summarized below in terms of events:

 The union of two events A and B, denoted by the symbol A ᑌ B, is the event
containing all the elements that belong to A or B or both.

Illustration: Let A = {a, b, c} and B = {b, c, d, e}; then A ᑌ B = { a, b, c, d, e }

 The intersection of two events A and B, denoted by the symbol A ᑎ B, is the


event containing all elements that are common to A and B.

Illustration: Let E be the event that a person selected at random in a classroom is


majoring in engineering, and let F be the event that the person is female.
Then EᑎF is the event of all female engineering students in the classroom.

104
Let V = {a, e, i, o, u} and C = {l, r, s, t}; then it follows that V ∩ C = φ. That
is, V and C have no elements in common and, therefore, cannot both
simultaneously occur.

For certain statistical experiments it is by no means unusual to define two events, A and
B, which cannot both occur simultaneously. The events A and B are then said to be
mutually exclusive or disjoint, if A ∩ B = φ, that is, if A and B have no elements in
common.

 The complement of an event A with respect to S is the subset of all elements of


S that are not in A. We denote the complement of A by the symbol A’.

Illustration: Let R be the event that a red card is selected from an ordinary deck of 52
playing cards and let S be the entire deck. Then R’ is the event that the card
selected from the deck is not a red card but a black card.

Example:

Consider the sample space S = {copper, sodium, nitrogen, potassium, uranium, oxygen,
zinc} and the events
A = {copper, sodium, zinc}
B = {sodium, nitrogen, potassium}
C = {oxygen}

List the elements of the sets corresponding to the following events:


a. A’ b. A ᑌ C c. (A ᑎ B’) ᑌ C’

a. A′ = {nitrogen, potassium, uranium, oxygen}

b. A ∪ C = {copper, sodium, zinc, oxygen}

c. A = {copper, sodium, zinc}


B ′ = {copper, uranium, oxygen, 𝑧𝑖𝑛𝑐}
C ′ = {copper, sodium, nitrogen, potassium, uranium, zinc}
A ∩ B ′ = {copper, zinc}
(A ∩ B ′ ) ∪ C ′ = C′

105
VENN DIAGRAMS

Diagrams are often used to portray relationships between sets, and these diagrams are
also used to describe relationships between events. We can use Venn diagrams to
represent a sample space and events in a sample space.

The sample space of the random experiment is represented as the points in the rectangle
S. The events A and B are the subsets of points in the indicated regions.

• If the two events have no common outcomes

• If the two events have some common outcomes, then A ∩ B

• In a random experiment with three events: A , B and C

A∩B∩C (A ∩ C)’

106
• If two events have no outcomes in common, they are called mutually exclusive
events. The intersection of such two events is a null set.

Activity

1. Suppose that vehicles taking a particular freeway exit can turn right (R), turn left
(L), or go straight (S). Consider observing the direction for each of three successive
vehicles.
a. List all outcomes in the event A that all three vehicles go in the same direction.
b. List all outcomes in the event B that all three vehicles take different directions.
c. List all outcomes in the event C that exactly two of the three vehicles turn right.
d. List all outcomes in the event D that exactly two vehicles go in the same
direction.
e. List outcomes in D', C∪D, and C∩D.

2. An engineering construction firm is currently working on power plants at three


different sites. Let Ai denote the event that the plant at site i is completed by the
contract date. Use the operations of union, intersection, and complementation to
describe each of the following events in terms of A1, A2, and A3, draw a Venn
diagram, and shade the region corresponding to each one.
a. At least one plant is completed by the contract date.
b. All plants are completed by the contract date.
c. Only the plant at site 1 is completed by the contract date.
d. Exactly one plant is completed by the contract date.
e. Either the plant at site 1 or both of the other two plants are completed by the
contract date.

107
Session 3
COUNTING RULES USEFUL IN PROBABILITY

By the end of this session, you should be able to:

1. Differentiate permutations and combinations.


2. Apply the appropriate counting principle in various area of problem solving.

Lecture:

PRINCIPLE OF COUNTING

If a choice consists of two steps, of which the first can be made in n 1 ways and the second
can be made in n 2 ways, then the whole choice can be made in n 1n 2 ways.

Example 1:

In the design of a casing for a gear housing, we can use four different types of fasteners,
three different bolt lengths, and three different bolt locations. How many casing designs
are possible?

4 × 3 × 3 = 36 casing designs

Example 2:

In a medical study, patients are classified in 8 ways according to whether they have blood
type AB+, AB-, A+, A-, B+, B-, O+, or O-, and also according to whether their blood
pressure is low, normal, or high. In how many ways can a patient be classified?

8 × 3 = 24 classifications

Example 3:

A developer of a new subdivision offers prospective home buyers a choice of Tudor,


rustic, colonial, and traditional exterior styling in ranch, two-story, and split-level floor
plans. In how many different ways can a buyer order one of these homes?

4 × 3 = 12 possible homes

108
PERMUTATIONS

Permutation is an arrangement of a group of things in a definite order, that is, there is a


first element, a second, a third, etc. In other words, the order or arrangement of the
elements is important. For example, the letters a, b, and c have the following possible
arrangements or permutations:

abc acb bac bca cab cba

LINEAR PERMUTATIONS are permutations in a straight line:

Permutations of n distinct objects taken all at a time

𝒏𝑷𝒏 = 𝒏!

Illustration:

How many permutation can be made from three distinct objects all are taken at a time?

𝑛! = 3! = 3 × 2 × 1 = 6 permutations

Permutations of n distinct objects taken r at a time, r < n

𝒏!
𝒏𝑷𝒓 =
( 𝒏 − 𝒓) !

Illustration:

In the game of Jai Alai, players are numbered from 1 to 10. The first player to score 9
points will be the first placer and the next two highest scorers will be declared 2 nd and 3rd
placers. Each bet must consist therefore of three numbers corresponding to the player
number of the 1st placer, 2nd placer, and the 3rd placer respectively. Neglecting the
possibility of not having a 2nd and 3rd placer, how many bets are possible?

10!
𝑛𝑃𝑟 = 10𝑃3 = (10−3)!
= 720 bets

Permutations of n objects (some objects are identical) taken all at a time

𝒏!
𝒏𝑷𝒏𝒊 =
𝒊𝟏 ! 𝒊𝟐 ! … …

Where 𝑖1 = number of type 1 identical things

109
𝑖2 = number of type 2 identical things

Illustration:

How many permutations can you make out of the letters of the word PHILIPPINES if all
letters are taken at a time?

11!
𝑛𝑃𝑛𝑖 = = 1,108,800 permutations
3!3!

CIRCULAR PERMUTATIONS are permutations positioned in a circular shape:

Permutation of n distinct objects taken all at a time

𝒏𝑷𝒄𝒏 = (𝒏 − 𝟏)!

Illustration:

In how many ways can you sit in a three couple circular table of 6 seats if one occupies
a particular position?

𝑛𝑃𝑐𝑛 = (6 − 1)! = 120 ways

Permutation of n objects (some objects are identical) taken all at time

( 𝒏 − 𝟏) !
𝒏𝑷𝒄𝒏 =
𝒊𝟏 ! 𝒊𝟐 ! … …

where the given formula may only be applied provided that at least one thing is distinct.

Illustration:

In how many ways can you arrange 6 chairs in a circular position if 2 of which are
identical?

(6−1)!
𝑛𝑃𝑐𝑛 = = 60 ways
2!

Permutation of n distinct objects taken r at a time, r < n

𝒏𝑷𝒄𝒓 = (𝒏𝑪𝒓)(𝒓 − 𝟏)!

110
Illustration:

How many circular permutations of 4 objects can you make out of 6 different objects?

𝑛𝑃𝑐𝑟 = 6𝑃𝑐4 = (6𝐶4)(4 − 1)! = 90 permutations

PERMUTATION BY GROUP (collection of things arranged in peculiar ways)

𝑷𝐺 = 𝐺! 𝑛1 ! 𝑛2 ! … …

Where G = number of groups


n1 = number of things in group 1
n2 = number of things in group 2

Illustration:

In how many ways can you line-up 4 marines and 6 army soldiers if all the marines must
be side by side with each other and the army soldiers are likewise?

𝑃𝐺 = 2! 4! 6! = 34,560 ways

In the last problem, what if all the marines must be together while the army soldiers need
not be necessarily together, how many ways are possible?

𝑃𝐺 = 7! 4! = 120,960 ways

COMBINATIONS

A combination also concerns arrangements, but without regard to the order. This means
that the order or arrangement in which the elements are taken is not important. For
example, ab will be the same as ba, abc is the same arrangement as acb or bac. Also,
the number of ways in the combination of the letters a, b, and c taken 2 at a time is 3, and
these are ab, ab and bc. We note that ab = ba, ac = ca and bc = cb.

Combinations from n distinct objects taken r at a time


𝒏𝑷𝒓 𝒏!
𝒏𝑪𝒓 = =
𝒓! 𝒓! (𝒏 − 𝒓)!

111
Illustration:
How many number combinations are there in Lotto 6/49? (Note that each combination
consists of 6 different numbers selected from 1 to 49 wherein the arrangement is not
important)
49𝑃6 49!
49𝐶6 = = 6!(49−6)! = 13,983,816 combinations
6!

A semiconductor company will hire 7 men and 4 women. In how many ways can the
company choose from 9 men and 6 women who qualified for the positions?

𝐶 = 9𝐶7 × 6𝐶4 = 540 ways


Combinations of things taken 1, 2, 3, 4, or n at a time

𝑪 = 𝟐𝒏 − 𝟏
Illustration:

In how many ways can you fill a box with book(s) if you can choose from 6 different books?

𝐶 = 26 − 1 = 63 ways
PARTITION

A partition is defined as the division of things into smaller groups and is defined by the
formula
𝑛!
𝐷=
𝑔1 ! 𝑔2 !
𝑟1 ! 𝑟2 !

Where 𝑔1 , 𝑔2 = number of group members in group 1 and group 2


𝑟1 , 𝑟2 = number of groups with the same number of members respectively

Illustration:
In how many ways can you divide 17 students into 5 committees, namely: 2 committees
with 2 members each, 3 committees with 3 members each and the last, a 4-member
committee?
17!
𝐷= 2!2!3!3!3!4! = 17,153,136,000 ways
2!3!1!

112
Supplementary Examples

Solve for the following counting problems:

1. In a city, the bus route numbers consist of a natural number less than 100, followed
by one of the letters A, B, C, D, E and F. How many different bus routes are
possible?
2. There are 3 questions in a question paper. If the questions have 4,3 and 2 solutions
respectively, find the total number of solutions.
3. In how many ways can an animal trainer arrange 5 lions and 4 tigers in a row so
that no two lions are together?
4. In how many ways can the letters of the word LEADING be arranged in such a
way that the vowels always come together
5. In how many ways can 4 girls and 5 boys be arranged in a row so that all the girls
are together?
6. 12 points lie on a circle. How many cyclic quadrilaterals can be drawn by using
these points?
7. In a box, there are 5 black pens, 3 white pens and 4 red pens. In how many ways
can 2 black pens, 2 white pens and 2 red pens can be chosen?
8. A question paper consists of 10 questions divided into two parts A and B. Each
part contains five questions. A candidate is required to attempt six questions in all
of which at least 2 should be from part A and at least 2 from part B. In how many
ways can the candidate select the questions if he can answer all questions equally
well?
9. A committee of 5 persons is to be formed from 6 men and 4 women. In how many
ways can this be done when:
a. At least 2 women are included?
b. At most 2 women are included?
10. The Indian Cricket team consists of 16 players. It includes 2 wicket keepers and 5
bowlers. In how many ways can a cricket eleven be selected if we have to select
1 wicket keeper and at least 4 bowlers?

Activity

1. An order for a personal digital assistant can specify any one of five memory sizes,
any one of three types of displays, any one of four sizes of a hard disk, and can
either include or not include a pen tablet. How many different systems can be
ordered?

113
2. In a manufacturing operation, a part is produced by machining, polishing, and
painting. If there are three machine tools, four polishing tools, and three painting
tools, how many different routings (consisting of machining, followed by polishing,
and followed by painting) for a part are possible?

3. New designs for a wastewater treatment tank have proposed three possible
shapes, four possible sizes, three locations for input valves, and four locations for
output valves. How many different product designs are possible?

4. A byte is a sequence of eight bits and each bit is either 0 or 1.


a. How many different bytes are possible?
b. If the first bit of a byte is a parity check, that is, the first byte is determined from
the other seven bits, how many different bytes are possible?

5. In the design of an electromechanical product, seven different components are to


be stacked into a cylindrical casing that holds 12 components in a manner that
minimizes the impact of shocks. One end of the casing is designated as the bottom
and the other end is the top.
a. How many different designs are possible?
b. If the seven components are all identical, how many different designs are
possible?
c. If the seven components consist of three of one type of component and four of
another type, how many different designs are possible?

114
Session 4
APPROACHES TO AND RULES OF PROBABILITY

By the end of this session, you should be able to:

1. Identify the different events that exist in probability problems and differentiate
them from one another when applicable.
2. Apply the different probability rules when computing probabilities.

Lecture:

APPROACHES TO PROBABILITY

There are two ways to approach probability:

1. Marginal Probability
2. Conditional Probability

MARGINAL PROBABILITY

Marginal probability is the probability of a single event without consideration of any other
event. Marginal probability is also called simple probability.

Suppose 40 students in a certain school were asked whether they are in favor of
or against the wearing of school uniform. The table below gives a two-way classification
of the responses of these 40 students.

In Favor Against
Male 6 9
Female 17 8

The four marginal probabilities are calculated as follows:


𝑛𝑢𝑚𝑏𝑒𝑟 𝑜𝑓 𝑚𝑎𝑙𝑒𝑠 15
𝑃(𝑚𝑎𝑙𝑒) = = = 0.375
𝑡𝑜𝑡𝑎𝑙 𝑛𝑢𝑚𝑏𝑒𝑟 𝑜𝑓 𝑠𝑡𝑢𝑑𝑒𝑛𝑡𝑠 40

As we can observe, the probability that a male will be selected is obtained by dividing the
total of the row labeled “Male” (15) by the grand total (40).

Similarly,

115
𝑛𝑢𝑚𝑏𝑒𝑟 𝑜𝑓 𝑓𝑒𝑚𝑎𝑙𝑒𝑠 25
𝑃(𝑓𝑒𝑚𝑎𝑙𝑒) = = = 0.625
𝑡𝑜𝑡𝑎𝑙 𝑛𝑢𝑚𝑏𝑒𝑟 𝑜𝑓 𝑠𝑡𝑢𝑑𝑒𝑛𝑡𝑠 40

𝑛𝑢𝑚𝑏𝑒𝑟 𝑜𝑓 𝑖𝑛 𝑓𝑎𝑣𝑜𝑟 23
𝑃(𝑖𝑛 𝑓𝑎𝑣𝑜𝑟) = = = 0.575
𝑡𝑜𝑡𝑎𝑙 𝑛𝑢𝑚𝑏𝑒𝑟 𝑜𝑓 𝑠𝑡𝑢𝑑𝑒𝑛𝑡𝑠 40

𝑛𝑢𝑚𝑏𝑒𝑟 𝑜𝑓 𝑎𝑔𝑎𝑖𝑛𝑠𝑡 17
𝑃(𝑎𝑔𝑎𝑖𝑛𝑠𝑡) = = = 0.425
𝑡𝑜𝑡𝑎𝑙 𝑛𝑢𝑚𝑏𝑒𝑟 𝑜𝑓 𝑠𝑡𝑢𝑑𝑒𝑛𝑡𝑠 40

Let I = event of having the opinion of “in favor” M = event of male student

A = event of having the opinion of “against” F = event of female student

Then,

P(I) = 23/40 = 0.575 P(M) = 15/40 = 0.375

P(A) = 17/40 = 0.425 P(F) = 25/40 = 0.625

CONDITIONAL PROBABILITY

Conditional probability is the probability that an event will occur given that another
event has already occurred. If A and B are two events, then the conditional probability of
A is written as P(A/B) and read as “the probability of A given that B has already
occurred.”

𝑃(𝐴 ∩ 𝐵)
𝑃(𝐴/𝐵) =
𝑃(𝐵)

Suppose we are being asked to compute for P(in favor/male) in the two-way classification
of the responses of 40 students regarding the wearing of school uniform.
6
𝑃 (𝑖𝑛 𝑓𝑎𝑣𝑜𝑟 ∩ 𝑚𝑎𝑙𝑒) 40 6
𝑃(𝑖𝑛 𝑓𝑎𝑣𝑜𝑟/𝑚𝑎𝑙𝑒) = = = = 0.40
𝑃 (𝑚𝑎𝑙𝑒) 15 15
40

Similarly, the probability that a randomly selected student is against (given that this
student is female) is

8
𝑃 (𝑎𝑔𝑎𝑖𝑛𝑠𝑡 ∩ 𝑓𝑒𝑚𝑎𝑙𝑒) 40 8
𝑃(𝑎𝑔𝑎𝑖𝑛𝑠𝑡/𝑓𝑒𝑚𝑎𝑙𝑒) = = = = 0.32
𝑃 (𝑓𝑒𝑚𝑎𝑙𝑒) 25 25
40

116
EVENTS ON PROBABILITY

The following events are commonly observed when solving probability problems:

1. Mutually Exclusive Events


Events that cannot occur together are mutually exclusive events. Such events do
not have any common outcomes. If two or more are mutually exclusive, then at
most one of them will occur every time we repeat the experiment. Thus, the
occurrence of one event excludes the occurrence of the other event or events.

2. Independent and Dependent Events


Two events are said to be independent if the occurrence of one does not affect the
probability of the occurrence of the other. In opposite, dependent events are those
whose probabilities affect each other.

In other words, A and B are independent events if either

P(A/B) = P(A) or P(B/A) = P(B).

On the other hand, the two events will be dependent if either

P(A/B)  P(A) or P(B/A)  P(B).

3. Complementary Events
Because two complementary events, taken together, include all the outcomes for
an experiment and because the sum of the probabilities of all outcomes is 1, it is
obvious that P(A) + P(A’) = 1.

PROBABILITY RULES

MULTIPLICATION RULE

The probability that events A and B can happen together or the probability of the
intersection of two events A and B is called the joint probability of A and B and is
written as P(A and B).

The probability of the intersection of two events is obtained by multiplying the marginal
probability of one event by the conditional probability of the second event. This rule is
called the multiplication rule.

117
The probability of the intersection of two events A and B or the joint probability of
A and B is

𝑷(𝑨 𝒂𝒏𝒅 𝑩) = 𝑷(𝑨 ∩ 𝑩) = 𝑷(𝑨𝑩) = 𝑷(𝑨)𝑷(𝑩/𝑨)

𝑷(𝑨 𝒂𝒏𝒅 𝑩) = 𝑷(𝑨 ∩ 𝑩) = 𝑷(𝑨𝑩) = 𝑷(𝑩)𝑷(𝑨/𝑩)

for dependent events.

If events A and B are deemed independent from each other, then,

𝑷(𝑨 𝒂𝒏𝒅 𝑩) = 𝑷(𝑨)𝑷(𝑩).

Example 1:
A lot contains 40 batteries, 4 of which is defective. If 2 batteries are selected at random
(without replacement) from the lot, what is the probability that
a) both are defective?
b) both are nondefective?
c) the first is defective but the second is not?
d) If two batteries are selected out of 40 with replacement where 4 are defective, what
is the probability that both are defective?

Let us define the following events:


A = event that the first battery selected is good;
B = event that the first battery selected is defective;
C = event that the second battery selected is good;
D = event that the second battery selected is defective.

a) We are to calculate the joint probability of dependent events B and D, which is


given by P(B and D) = P(B) P(D/B). As we know, there are 4 defective batteries
in 40. Consequently, the probability selecting a defective battery at the first
selection is

P(B) = 4/40 = 0.10.

To calculate the probability P(D/B), we know that the first batteries selected is
defective because B has already occurred. Because the selections are made
without replacement, there are 39 total batteries and 3 of them are defective at the
time of the second selection. Therefore,

P(D/B) = 3/39

118
Hence, the required probability is

P(B and D) = P(B) P(D/B) = (4/40)(3/39) = 0.0077.

b) Let the dependent events A and C = the event of selecting 2 non-defective (or
good) batteries. Since there are 36 good batteries, then

P(A and C) = P(A)  P(C/A) = (36/40) (35/39) = 21/26

c) Let the dependent events B and C = the event of selecting the first battery
(defective) and the second battery (nondefective), then

P(B and C) = P(B)  P(C/B) = (4/40) (36/39) = 6/65

d) Let the independent events B and D = the event of selecting 2 defective batteries.
Then

P(B and D) = P(B)  P(D) = (4/40) (4/40) = 1/100

Example 2:
The probability that a patient is allergic to penicillin is 0.30. Suppose the drug is
administered to three patients.

a) Find the probability that all three of them are allergic to it.
b) Find the probability that at least one of them is not allergic to it.

Let A, B and C denote the events that the first, second, and third patients are allergic to
penicillin, respectively.
a) We are to find the joint probability of A, B, and C all three events are independent
because whether or not one patient is allergic does not depend on whether or not
any of the other patients is allergic. Hence,

P(A and B and C) = P(A) P(B) P(C) = (0.30)(0.30)(0.30) = 0.027

b) Let us define the following events: D = all three patients are allergic; E = at least
one patient is not allergic.

At least one patient is allergic to penicillin would mean that one or two but not all
of the three patients. Thus, we merely exclude the possibility that of three patients
being allergic, i.e. to penicillin. Notice that events D and E are two complementary
events. Event D consists of the intersection of events A, B, and C. Hence, from
part (a),

119
P(D) = P(A and B and C) = 0.027

Therefore using the complementary event rule, we get

P(E) = 1 – P(D) = 1 – 0.027 = 0.973

ADDITION RULE

The method used to calculate the probability of the union of events is called the addition
rule. It is defined as follows

The probability of the union of two events A and B is

𝑷(𝑨 𝒐𝒓 𝑩) = 𝑷(𝑨) + 𝑷(𝑩) − 𝑷(𝑨 𝒂𝒏𝒅 𝑩)

Thus, to calculate the probability of the union of two events A and B, we add their
marginal probabilities and subtract their joint probability from this sum. We must subtract
the joint probability of A and B from the sum of their marginal probabilities to avoid
double counting due to common outcomes in A and B.

If A and B are mutually exclusive or disjoint events, 𝑷(𝑨 𝒂𝒏𝒅 𝑩) = 𝟎, then

𝑷(𝑨 𝒐𝒓 𝑩) = 𝑷(𝑨) + 𝑷(𝑩)

Example 1:
In a college graduating class of 100 students, 58 studied math, 70 studied history, and 30
studied both math and history. If one student is selected at random, find the probability
that
M H
a) the student takes math or history;
28 30 40
b) the student does not take either of these
subjects;
c) the student takes history but not math.

In the Venn diagram that depicts the given situation, let M be event that a student takes
Math and H be the event that a student takes History.

Since there are 30 who studied both subjects and math has 58 students in all, then there
are 58 – 30 = 28 students who studied math alone. Also, there 70 – 30 = 40 students
who studied history alone.
a) Let M and H be the events that the student studied math and history respectively,
then
120
P(M) = 58/100 = 0.58 and P(H) = 70/100 = 0.70.

If M  H denotes the event that a student studied math or history, then

58 70 30 98
P(M or H) = P(M) + P(H) – P(M and H) =    or 0.98
100 100 100 100

b) Let (M or H) denote the event that the student did not study any of the two
subjects. Then using the complementary event rule, we have

P (M or H) = 1 – P(M or H) = 1 – 0.98 = 0.02

c) Let A be the event that the student takes math only but not history. Then based
on the Venn diagram, we have

P(A) = 28/100 = 0.28

Example 2:
The table below lists the history of 940 wafers in a semiconductor manufacturing process.
Suppose one wafer is selected at random. Let H denote the event that the wafer contains
high levels of contamination. And C denote the event that the wafer is in the center of a
sputtering tool.

Location in Sputtering Tool


Contamination Center Edge Total
Low 514 68 582
High 112 246 358
Total 626 314

Since P(H) = 358/940 and P(C) = 626/940, then, the probability that the wafer selected
contains high levels of contamination or is in the center of a sputtering tool

P(H or C) = (358/940) + (626/940) – (112/940) = 872/940

Example 3:
The Cost Less Clothing Store carries seconds in slacks. If you buy a pair of slacks in your
regular waist size without trying them on, the probability that the waist will be too tight is
0.30 and the probability that it will be too loose is 0.10

If you choose a pair of slacks at random in your regular waist size, what is the probability
that the waist will be too tight or too loose?

121
Based on the given in the problem,

P(too tight) = 0.30


P(too loose) = 0.10
Since these events are mutually exclusive, then,

P(too tight or too loose) = P(too tight) + P(too loose) = 0.30 + 0.10 = 0.40

TOTAL PROBABILITY RULE

A1, A2, A3 and A4 are mutually exclusive events that do


complete a sample space. B is an intersection of the
events in Ai and forms four other mutually exclusive B
events. By the total probability rule,

𝑷(𝑩) = 𝑷(𝑨𝟏 ∩ 𝑩) + 𝑷(𝑨𝟐 ∩ 𝑩) + 𝑷(𝑨𝟑 ∩ 𝑩) + 𝑷(𝑨𝟒 ∩ 𝑩)

But since 𝑷(𝑨𝟏 ∩ 𝑩) = 𝑷(𝑨𝟏 )𝑷(𝑩/𝑨𝟏 ) according to conditional probability, therefore,

𝑷(𝑩) = 𝑷(𝑨𝟏 )𝑷(𝑩/𝑨𝟏 ) + 𝑷(𝟐)𝑷(𝑩/𝑨𝟐 ) + 𝑷(𝑨𝟑 )𝑷(𝑩/𝑨𝟑 ) + 𝑷(𝑨𝟒 )𝑷(𝑩/𝑨𝟒 )

Example:
Suppose that in a semiconductor manufacturing, the probability is 0.10 that a chip that is
subjected to high levels of contamination during manufacturing causes a product failure.
The probability is 0.005 that a chip that is not subjected to high contamination levels
during manufacturing causes a product failure. In a particular production run, 20% of the
chips are subject to high levels of contamination. What is the probability that a product
using one of these chips fails?

Let F denote the event that the product fails and H denote the event that the chip is
exposed to high levels of contamination.

Based on the given in the problem,

𝑃(𝐹/𝐻) = 0.10 𝑃(𝐹/𝐻′) = 0.005 𝑃(𝐻) = 0.20.

Since 𝑃(𝐻′ ) = 1 − 𝑃(𝐻) = 1 − 0.20 = 0.80,

𝑃(𝐹 ) = 𝑃(𝐻)𝑃(𝐹/𝐻) + 𝑃 (𝐻′ )𝑃(𝐹/𝐻′)

122
𝑃 (𝐹 ) = (0.20)(0.10) + (0.80)(0.005) = 𝟎. 𝟎𝟐𝟑𝟓

BAYE'S THEOREM

If we have two events A and B and we are given the conditional probability of A given B
known as P(A/B), we can use Bayes’ Theorem to find P(B/A) using the formula

𝑷(𝑩) 𝑷(𝑨/𝑩)
𝑷(𝑩/𝑨) =
𝑷(𝑩) 𝑷(𝑨/𝑩) + 𝑷(𝑩′ ) 𝑷(𝑨/𝑩′)

Example:
In a factory, there are two machines manufacturing bolts. Machine A manufactures 75%
of the bolts and the Machine B manufactures the remaining 25%. From machine A, 5%
of the bolts are defective and from machine B, 8% of the bolts are defective. A bolt is
selected at random, what is the probability that the bolt came from machine A, given that
it is defective?

Let 𝐴 denote the event that a bolt came from machine A


𝐵 denote the event that a bolt came from machine B
𝐷 denote the event that a bolt is defective

Based on the given in the problem,

𝑃(𝐴) = 0.75 𝑃(𝐵) = 0.25 𝑃(𝐷/𝐴) = 0.05 𝑃(𝐷/𝐵) = 0.08

Then,

𝑃(𝐴) 𝑃(𝐷/𝐴)
𝑃(𝐴/𝐷) =
𝑃(𝐴) 𝑃(𝐷/𝐴) + 𝑃(𝐴′ ) 𝑃(𝐷/𝐴′)

𝑃(𝐴) 𝑃(𝐷/𝐴)
𝑃(𝐴/𝐷) =
𝑃(𝐴) 𝑃(𝐷/𝐴) + 𝑃(𝐵) 𝑃(𝐷/𝐵)

(0.75)(0.05)
𝑃(𝐴/𝐷) = = 𝟎. 𝟔𝟓𝟐𝟐
(0.75)(0.05) + (0.25)(0.08)

123
Activity

1. A box contains six 40-W bulbs, five 60-W bulbs, and four 75-W bulbs. If bulbs are
selected one by one in random order, what is the probability that at least two bulbs
must be selected to obtain one that is rated 75 W?

2. Human visual inspection of solder joints on printed circuit boards can be very
subjective. Part of the problem stems from the numerous types of solder defects
(e.g., pad nonwetting, knee visibility, voids) and even the degree to which a joint
possesses one or more of these defects. Consequently, even highly trained
inspectors can disagree on the disposition of a particular joint. In one batch of
10,000 joints, inspector A found 724 that were judged defective, inspector B found
751 such joints, and 1159 of the joints were judged defective by at least one of the
inspectors. Suppose that one of the 10,000 joints is randomly selected.
a. What is the probability that the selected joint was judged to be defective by
neither of the two inspectors?
b. What is the probability that the selected joint was judged to be defective by
inspector B but not by inspector A?

3. The route used by a certain motorist in commuting to work contains two


intersections with traffic signals. The probability that he must stop at the first signal
is .4, the analogous probability for the second signal is .5, and the probability that
he must stop at at least one of the two signals is .6. What is the probability that he
must stop
a. At both signals?
b. At the first signal but not at the second one?
c. At exactly one signal?

4. Semiconductor lasers used in optical storage products require higher power levels
for write operations than for read operations. High-power-level operations lower
the useful life of the laser.
Lasers in products used for backup of higher speed magnetic disks primarily write,
and the probability that the useful life exceeds five years is 0.95. Lasers that are in
products that are used for main storage spend approximately an equal amount of
time reading and writing, and the probability that the useful life exceeds five years
is 0.995. Now, 25% of the products from a manufacturer are used for backup and
75% of the products are used for main storage.
Let A denote the event that a laser’s useful life exceeds five years, and let B denote
the event that a laser is in a product that is used for backup. Use a tree diagram to
determine the following:

124
a. 𝑃(𝐵)
b. 𝑃(𝐴/𝐵)
c. 𝑃(𝐴/𝐵′)
d. 𝑃(𝐴 ∩ 𝐵)
e. 𝑃(𝐴 ∩ 𝐵′)
f. 𝑃(𝐴)
g. What is the probability that the useful life of a laser exceeds five years?
h. What is the probability that a laser that failed before five years came from a
product used for backup?

5. Samples of emissions from three suppliers are classified for conformance to air-
quality specifications. The results from 100 samples are summarized as follows:

conforms
Yes No
1 22 8
supplier 2 25 5
3 30 10

Let A denote the event that a sample is from supplier 1, and let B denote the event
that a sample conforms to specifications.

a. Are events A and B independent?


b. Determine 𝑃(𝐵/𝐴).

125
Session 5
Joint Probability Distribution

By the end of this session, you should be able to:

1. Understand a basic concept on random variables.


2. Use joint probability mass functions to calculate probability.
3. Calculate marginal and conditional probability distributions from joint probability
distributions

Lecture:

RANDOM VARIABLES

For a given sample space S of some experiment, a random variable (rv) is any rule that
associates a number with each outcome in S. In mathematical language, a random
variable is a function whose domain is the sample space and whose range is the set of
real numbers.

Random variables are customarily denoted by uppercase letters, such as X and Y, near
the end of our alphabet. In contrast to our previous use of a lowercase letter, such as x,
to denote a variable, we will now use lowercase letters to represent some particular value
of the corresponding random variable. The notation means that x is the value associated
with the outcome s by the rv X.

Example 1:

When a student calls a university help desk for technical support, he/she will either
immediately be able to speak to someone (S, for success) or will be placed on hold (F,
for failure). With S={S,F}, define an rv X by

𝑋 (𝑆 ) = 1 𝑋 (𝐹 ) = 0

The rv X indicates whether (1) or not (0) the student can immediately speak to someone.

The rv X in the example was specified by explicitly listing each element of S and the
associated number. Such a listing is tedious if S contains more than a few outcomes, but
it can frequently be avoided.

126
Example 2:

Consider the experiment in which a telephone number in a certain area code is dialed
using a random number dialer (such devices are used extensively by polling
organizations), and define an rv Y by

1 𝑖𝑓 𝑡ℎ𝑒 𝑠𝑒𝑙𝑒𝑐𝑡𝑒𝑑 𝑛𝑢𝑚𝑏𝑒𝑟 𝑖𝑠 𝑢𝑛𝑙𝑖𝑠𝑡𝑒𝑑


𝑌={
0 𝑖𝑓 𝑡ℎ𝑒 𝑠𝑒𝑙𝑒𝑐𝑡𝑒𝑑 𝑛𝑢𝑚𝑏𝑒𝑟 𝑖𝑠 𝑙𝑖𝑠𝑡𝑒𝑑 𝑖𝑛 𝑡ℎ𝑒 𝑑𝑖𝑟𝑒𝑐𝑡𝑜𝑟𝑦

For example, if 5282966 appears in the telephone directory, then 𝑌(5282966) = 0,


whereas 𝑌(7727350) = 1 tells us that the number 7727350 is unlisted. A word description
of this sort is more economical than a complete listing, so we will use such a description
whenever possible.

There are two types of random variables: discrete and continuous.

A discrete random variable is an rv whose possible values either constitute a finite set
or else can be listed in an infinite sequence in which there is a first element, a second
element, and so on (“countably” infinite).

A random variable is continuous if both of the following apply:

1. Its set of possible values consists either of all numbers in a single interval on the
number line (possibly infinite in extent, e.g., from −∞ to ∞) or all numbers in a
disjoint union of such intervals (e.g., [0,10] ∪ [20,30] ).
2. No possible value of the variable has positive probability, that is, P(X = c) = 0 for
any possible value c.

Given below is an example of a probability distribution (or probability mass function, pmf)
for a single rv X.

𝑥 0 1 2
𝑓(𝑥) 0.50 0.20 0.30

Sometimes, we're simultaneously interested in two or more variables in a random


experiment. We're looking for a relationship between the two variables.

Examples for discrete random variables:

 Year in college vs. Number of credits taken


 Number of cigarettes smoked per day vs. Day of the week

127
Examples for continuous random variables:

 Time when bus driver picks you up vs. Quantity of caffeine in bus driver's system
 Dosage of a drug (ml) vs. Blood compound measure (percentage)

In general, if X and Y are two random variables, the probability distribution that defines
their simultaneous behaviour is called a joint probability distribution.

Shown below is a table for two discrete random variables, which gives 𝑃(𝑋 = 𝑥, 𝑌 = 𝑦).

𝒙
1 2 3
𝒚 1 0 1/6 1/6
2 1/6 0 1/6
3 1/6 1/6 0

Shown below is a graphic for two continuous random variables as 𝑓𝑋,𝑌 (𝑥, 𝑦).

If X and Y are discrete, this distribution can be described with a joint probability mass
function.

If X and Y are continuous, this distribution can be described with a joint probability
density function.

Example of a Discrete joint pmf: Plastic Covers for CDs

Dimensions of plastic covers for CDs are measured. Measurement for length and width
of a rectangular plastic covers for CDs are rounded to the nearest mm, so they are
considered discrete.

Let X denote the length.

128
Let Y denote the width.

The possible values of X are 129, 130, and 131 mm while the possible values of Y are 15
and 16 mm. Thus, both X and Y are discrete. There are 6 possible pairs (X,Y). We show
the probability for each pair in the following table:

𝑥 = 𝑙𝑒𝑛𝑔𝑡ℎ
129 130 131
𝑦 = 𝑤𝑖𝑑𝑡ℎ
15 0.12 0.42 0.06
16 0.08 0.28 0.04

The sum of all probabilities is 1.0. The combination with the highest probability is (130,15).
The combination with the lowest probability is (131, 16).

The joint probability mass function is the function 𝑓𝑋𝑌 (𝑥, 𝑦) = 𝑃(𝑋 = 𝑥, 𝑌 = 𝑦). For
example, we have 𝑓𝑋𝑌 (129,15) = 0.12.

If we are given a joint probability distribution for X and Y, we can obtain the individual
probability distribution for X or Y and these are called the Marginal Probability
Distributions.

Example:

Referring to the example of plastic covers of CDs, find the probability that a CD cover has
a length of 129 mm (i.e. X=129).

𝑦 = 𝑤𝑖𝑑𝑡ℎ 𝑥 = 𝑙𝑒𝑛𝑔𝑡ℎ
129 130 131
15 0.12 0.42 0.06
16 0.08 0.28 0.04

𝑃(𝑋 = 129) = 𝑃(𝑋 = 129 𝑎𝑛𝑑 𝑌 = 15) + 𝑃(𝑋 = 129 𝑎𝑛𝑑 𝑌 = 16)
𝑃(𝑋 = 129) = 0.12 + 0.08 = 𝟎. 𝟐𝟎

What is the probability distribution of X?

𝑦 = 𝑤𝑖𝑑𝑡ℎ 𝑥 = 𝑙𝑒𝑛𝑔𝑡ℎ
129 130 131
15 0.12 0.42 0.06
16 0.08 0.28 0.04
Column totals 0.20 0.70 0.10

129
The probability distribution for X appears in the column totals, resulting to

𝑥 129 130 131


𝑓𝑋 (𝑥) 0.20 0.70 0.10

We've used a subscript X in the probability mass function of X, or 𝑓𝑋 (𝑥), for clarification
since we're considering more than one variable at a time now.
We can do the same for the Y random variable:
𝑦 = 𝑤𝑖𝑑𝑡ℎ 𝑥 = 𝑙𝑒𝑛𝑔𝑡ℎ Row
totals
129 130 131
15 0.12 0.42 0.06 0.60
16 0.08 0.28 0.04 0.40
Column totals 0.20 0.70 0.10 1

Then, the probability distribution for Y appears as

𝑦 15 16
𝑓𝑌 (𝑦) 0.60 0.40

Because the probability mass functions for X and Y appear in the margins of the table
(i.e. column and row totals), they are often referred to as the Marginal Distributions for
X and Y. When there are two random variables of interest, we also use the term bivariate
probability distribution of bivariate distribution to refer to the joint distribution.

JOINT PROBABILITY MASS FUNCTION

The joint probability mass function of the discrete random variables X and Y, denoted as
𝑓𝑋𝑌 (𝑥, 𝑦), satisfies
1. 𝑓𝑋𝑌 (𝑥, 𝑦) ≥ 0
2. ∑𝑥 ∑𝑦 𝑓𝑋𝑌 (𝑥, 𝑦) = 1
3. 𝑓𝑋𝑌 (𝑥, 𝑦) = 𝑃(𝑋 = 𝑥, 𝑌 = 𝑦)

MARGINAL PROBABILITY MASS FUNCTION

If X and Y are discrete random variables with joint probability mass function 𝑓𝑋𝑌 (𝑥, 𝑦), then
the marginal probability mass functions of X and Y are

𝑓𝑋 (𝑥 ) = ∑𝑦 𝑓𝑋𝑌 (𝑥, 𝑦) and 𝑓𝑌 (𝑦) = ∑𝑥 𝑓𝑋𝑌 (𝑥, 𝑦)

130
Where the sum for 𝑓𝑋 (𝑥 ) is over all points in the range of (X,Y) for which 𝑋 = 𝑥 and the
sum for 𝑓𝑌 (𝑦) is over all points in the range of (X,Y) for which 𝑌 = 𝑦.

When asked for 𝐸(𝑋) or 𝑉(𝑋) (i.e. values related to only 1 of the 2 variables) but you are
given a joint probability distribution, first calculate the marginal distribution 𝑓𝑋 (𝑥) and work
it similar to a univariate case (i.e. for a single random variable).

Example:

Suppose that 2 batteries are randomly chosen without replacement from the following
group of 12 batteries: 3 new, 4 used but working and 5 defective. Find 𝑓𝑋𝑌 (𝑥, 𝑦)

Let X denote the number of new batteries chosen


Let Y denote the number of used batteries chosen

Solving for the problem:

Though X can take on values 0,1, and 2, and Y can take on values 0,1, and 2, when we
consider them jointly, 𝑋 + 𝑌 ≤ 2. So, not all combinations of (𝑋, 𝑌) are possible. But, there
are 6 possible cases:

CASE: no new, no used (so, all defective)

5𝐶2 10
𝑓𝑋𝑌 (0,0) = =
12𝐶2 66

CASE: no new, 1 used

4𝐶1 5𝐶1 20
𝑓𝑋𝑌 (0,1) = =
12𝐶2 66

CASE: no new, 2 used

4𝐶2 6
𝑓𝑋𝑌 (0,2) = =
12𝐶2 66

CASE: 1 new, no used

3𝐶1 5𝐶1 15
𝑓𝑋𝑌 (1,0) = =
12𝐶2 66

CASE: 2 new, no used

131
3𝐶2 3
𝑓𝑋𝑌 (2,0) = =
12𝐶2 66

CASE: 1 new, 1 used

3𝐶1 4𝐶1 12
𝑓𝑋𝑌 (1,1) = =
12𝐶2 66

The joint distribution for X and Y becomes

𝑥 = 𝑛𝑜. 𝑜𝑓 𝑛𝑒𝑤 𝑐ℎ𝑠𝑜𝑠𝑒𝑛


0 1 2
𝑦 = 𝑛𝑜. 𝑜𝑓 𝑢𝑠𝑒𝑑 0 10/66 15/66 3/66
𝑐ℎ𝑜𝑠𝑒𝑛 1 20/66 12/66
2 6/66

There are 6 possible (X,Y) pairs and ∑𝑥 ∑𝑦 𝑓𝑋𝑌 (𝑥, 𝑦) = 1.

CONDITIONAL PROBABILITY DISTRIBUTIONS

Recall that for events A and B,

𝑃(𝐴 ∩ 𝐵)
𝑃(𝐴/𝐵) =
𝑃(𝐵)

We now apply this conditioning to random variables X and Y. Given random variables X
and Y with joint probability 𝑓𝑋𝑌 (𝑥, 𝑦), the conditional probability distribution of Y given 𝑋 =
𝑥 is

𝑓𝑋𝑌 (𝑥,𝑦)
𝑓𝑌|𝑥 (𝑦) = for 𝑓𝑋 (𝑥) > 0.
𝑓𝑋 (𝑥)

The conditional probability can be stated as the joint probability over the marginal
probability.

132
Example:

From the previously introduced example on plastic covers of CDs

𝑦 = 𝑤𝑖𝑑𝑡ℎ 𝑥 = 𝑙𝑒𝑛𝑔𝑡ℎ Row


totals
129 130 131
15 0.12 0.42 0.06 0.60
16 0.08 0.28 0.04 0.40
Column totals 0.20 0.70 0.10 1

a) Find the probability that a CD cover has a length of 130 mm GIVEN that the width
is 15 mm.

𝑃(𝑋 = 130, 𝑌 = 15) 0.42


𝑃 (𝑋 = 130|𝑌 = 15) = = = 𝟎. 𝟕𝟎
𝑃(𝑌 = 15) 0.60

b) Find the conditional distribution of X given that Y=15.

0.12
𝑃(𝑋 = 129|𝑌 = 15) = = 𝟎. 𝟐𝟎
0.60

0.42
𝑃(𝑋 = 130|𝑌 = 15) = = 𝟎. 𝟕𝟎
0.60

0.06
𝑃(𝑋 = 131|𝑌 = 15) = = 𝟎. 𝟏𝟎
0.60

Once we're given that Y=15, we're in a "different" space.

From the subset of the covers with a width of 15 mm, how are the lengths (X)
distributed.

The conditional distribution of X given 𝑌 = 15, or 𝑓𝑋|𝑌=15 (𝑥):

𝑥 129 130 131


𝑓𝑋|𝑌=15 (𝑥) 0.20 0.70 0.10

The sum of the probabilities is 1 and this is a legitimate distribution.

133
Activity

1. Show that the following function satisfies the properties of a joint probability
mass function.

2. In problem 1, determine the probabilities of the following:

a. 𝑃(𝑋 < 2.5, 𝑌 < 3)


b. 𝑃(𝑋 < 2.5)
c. 𝑃(𝑌 < 3)
d. 𝑃(𝑋 > 1.8, 𝑌 > 4.7)

3. In the same problem 1, determine:

a. The marginal probability distribution of the random variable X.


b. The conditional probability distribution of Y given that X=1.5.
c. The conditional probability distribution of X given that Y=2.
d. Are X and Y independent?

4. Determine the value of c that makes the function 𝑓(𝑥, 𝑦) = 𝑐(𝑥 + 𝑦) a joint
probability mass function over the nine points with x=1,2,3 and y=1,2,3.
5. In problem 4, determine the probabilities of the following:

a. 𝑃(𝑋 = 1, 𝑌 < 4)
b. 𝑃(𝑋 = 1)
c. 𝑃(𝑌 = 2)
d. 𝑃(𝑋 < 2, 𝑌 < 2)

6. In the same problem 4, determine:

e. The marginal probability distribution of the random variable X.


f. The conditional probability distribution of Y given that X=1.
g. The conditional probability distribution of X given that Y=2.
h. Are X and Y independent?
134
Discrete Probability Distributions
____________________________________________________
MODULE 3

135
Session 1
Introduction to Discrete Random Variables and Discrete Probability
Distributions

By the end of this session, you should be able to:


1. Differentiate discrete random variables and continuous random variables
2. Determine probability distributions of discrete random variables
3. Calculate means and variances of discrete random variables

Lecture:

RANDOM VARIABLES

In a random experiment, it is sometimes useful to associate a number with each outcome


in the sample space. Because the particular outcome of the experiment is not known in
advance, the resulting value of our variable is not known in advance. For this reason, the
variable that associates a number with the outcome of a random experiment is referred
to as a random variable.

A random variable is a function that assigns a real number to each outcome in the sample
space of a random experiment. A random variable is denoted by an uppercase letter such
as X. After an experiment is conducted, the measured value of the random variable is
denoted by a lowercase letter.

So we would often encounter the equation 𝑋 = 𝑥.

Sometimes a measurement (such as current in a copper wire or length of a machined


part) can assume any value in an interval of real numbers (at least theoretically). Then
arbitrary precision in the measurement is possible. Of course, in practice, we might round
off to the nearest tenth or hundredth of a unit. The random variable that represents this
measurement is said to be a continuous random variable. The range of the random
variable includes all values in an interval of real numbers; that is, the range can be thought
of as a continuum.

In other experiments, we might record a count such as the number of transmitted bits that
are received in error. Then the measurement is limited to integers. Or we might record
that a proportion such as 0.0042 of the 10,000 transmitted bits were received in error.
Then the measurement is fractional, but it is still limited to discrete points on the real line.
Whenever the measurement is limited to discrete points on the real line, the random
variable is said to be a discrete random variable.

136
A discrete random variable is a random variable with a finite (or countably infinite) range.
A continuous random variable is a random variable with an interval (either finite or infinite) of real
numbers for its range.

In some cases, the random variable X is actually discrete but, because the range of
possible values is so large, it might be more convenient to analyze X as a continuous
random variable. For example, suppose that current measurements are read from a
digital instrument that displays the current to the nearest one-hundredth of a milliampere.
Because the possible measurements are limited, the random variable is discrete.
However, it might be a more convenient, simple approximation to assume that the current
measurements are values of a continuous random variable.

Examples of continuous random variables:


electrical current, length, pressure, temperature, time, voltage, weight
Examples of discrete random variables:
number of scratches on a surface, proportion of defective parts among 1000 tested,
number of transmitted bits received in error.

Illustrative examples of discrete random variables:

Example 1: A voice communication system for a business contains 48 external lines. At


a particular time, the system is observed, and some of the lines are being used. Let the
random variable X denote the number of lines in use. Then, X can assume any of the
integer values 0 through 48. When the system is observed, if 10 lines are in use, x = 10.

Example 2: In a semiconductor manufacturing process, two wafers from a lot are tested.
Each wafer is classified as pass or fail. Assume that the probability that a wafer passes
the test is 0.8 and that wafers are independent. Then the random variable X is the number
of wafers that passed the test.

Example 3: Define the random variable X to be the number of contamination particles on


a wafer in semiconductor manufacturing. Although wafers possess a number of
characteristics, the random variable X summarizes the wafer only in terms of the number
of particles. The possible values of X are integers from zero up to some large value that
represents the maximum number of particles that can be found on one of the wafers. If
this maximum number is very large, we might simply assume that the range of X is the
set of integers from zero to infinity.

137
DISCRETE PROBABILITY DISTRIBUTION

The probability distribution of a random variable X is a description of the probabilities


associated with the possible values of X.

For a discrete random variable, the distribution is often specified by just a list of the
possible values along with the probability of each. In some cases, it is convenient to
express the probability in terms of a formula.

Example 4: In a semiconductor manufacturing process, two wafers from a lot are tested.
Each wafer is classified as pass or fail. Assume that the probability that a wafer passes
the test is 0.8 and that wafers are independent. If the discrete random variable X
represents the number of wafers that passed the test. Find the discrete probability
distribution of random variable X.

Solution:
𝑓𝑜𝑟 𝑥 = 2;
𝑃(𝑝, 𝑝) = 0.8 𝑥 0.8 = 0.64
𝑓𝑜𝑟 𝑥 = 1;
𝑃(𝑝, 𝑓) = 0.8 𝑥 0.2 = 0.16
𝑃(𝑓, 𝑝) = 0.8 𝑥 0.2 = 0.16
𝑓𝑜𝑟 𝑥 = 0;
𝑃(𝑓, 𝑓) = 0.2 𝑥 0.2 = 0.04

The sample space for the experiment and associated probabilities are shown below:

By law of total probability, the probabilities should sum up to 1.

Mean or Expected Value of Discrete Random Variable


By definition mean is a measure of the center or the middle of the probability distribution.
Mean of discrete random variable is also known as its expected value

Mean or expected value of discrete random variable is denoted as μ or E(X)

μ or E(X) = ∑ 𝑥𝑃(𝑥)
138
From the probability distribution of the Example 4,

𝜇 𝑜𝑟 𝐸 (𝑋) = (0)(0.04) + (1)(0.32) + (2)(0.64)


= 1.6

Variance and Standard Deviation of Discrete Random Variable


By definition variance is a measure of the dispersion, or variability in the distribution.
Standard deviation (σ) is the square root of the variance.

Variance of discrete random variable is denoted as 𝝈^𝟐 𝒐𝒓 𝑽(𝑿)

𝝈𝟐 𝒐𝒓 𝑽(𝑿) = ∑(𝑥 − 𝜇)2 𝑃(𝑥)

From the probability distribution of Example 4,

Example 5. Two new product designs are to be compared on the basis of revenue
potential. Marketing feels that the revenue from design A can be predicted quite
accurately to be $3 million. The revenue potential of design B is more difficult to assess.
Marketing concludes that there is a probability of 0.3 that the revenue from design B will
be $7 million, but there is a 0.7 probability that the revenue will be only $2 million. Which
design do you prefer?

Solution: Let X denote the revenue from design A.

Because there is no uncertainty in the revenue from design A, we can model the
distribution of the random variable X as $3 million with probability 1.

139
Therefore, 𝝁𝒙 or E(X) = $3 million

Let Y denote the revenue from design B.

Therefore, 𝝁𝒚 or E(Y) = $7(0.3) + $2(0.7) = $3.5 million

Because 𝜇𝑦 exceeds 𝜇𝑥 , we might prefer design B.

However, the variability of the result from design B is larger. That is,
𝝈𝟐 = (7 − 3.5)2 (0.3) + (2 − 3.5)2 (0.7)
= 𝟓. 𝟐𝟓 𝐦𝐢𝐥𝐥𝐢𝐨𝐧 𝐝𝐨𝐥𝐥𝐚𝐫𝐬 𝐬𝐪𝐮𝐚𝐫𝐞𝐝

Because the units of the variables in this example are millions of dollars, and because the
variance of a random variable squares the deviations from the mean, the units of variance
are millions of dollars squared. These units make interpretation difficult.

Because the units of standard deviation are the same as the units of the random variable,
the standard deviation is easier to interpret.

In this example, we can summarize our results as “the average deviation of Y from its
mean is 𝝈 = $2.29 million.’’

This is quite a large deviation and therefore makes design A better.

Example 6. The number of messages sent per hour over a computer network has the
following distribution:

Determine the mean and standard deviation of the number of messages sent per hour.
𝜇 = (10)(0.08) + (11)(0.15) + (12)(0.3) + (13)(0.2)
+(14)(0.2) + (15)(0.07)
𝝁 = 𝟏𝟐. 𝟓

140
𝝈𝟐 = (−2.5)2 (0.08) + (−1.5)2 (0.15) + (−0.5)2 (0.3) + (0.5)2 (0.2)
+ (1.5)2 (0.2) + (2.5)2 (0.07)
𝟑𝟕
𝝈𝟐 =
𝟐𝟎
𝝈=√𝟑𝟕/𝟐𝟎 = 1.36

Activity:

1. Given the following probability distribution:

Find the following probabilities:


a. P(X≤2)
b. P(X≤-1 or X=2)

2. Marketing estimates that a new instrument for the analysis of soil samples will be
very successful, moderately successful, or unsuccessful, with probabilities 0.3,
0.6, and 0.1, respectively. The yearly revenue associated with a very successful,
moderately successful, or unsuccessful product is $10 million, $5 million, and $1
million, respectively. Let the random variable X denote the yearly revenue of the
product. Determine the probability probability distribution of the random variable.

3. If the range of X is the set {0, 1, 2, 3, 4} and P(X=x)=0.2 determine the mean and
variance of the random variable.

4. The range of the random variable X is [0, 1 , 2, 3, x] where x is unknown. If each


value is equally likely and the mean of X is 6, determine x.

141
Session 2
Binomial Distributions

At the end of this session, you should be able to:


1. Identify binomial distributions
2. Calculate probabilities of binomial distributions
3. Calculate means and variances of binomial distributions

Lecture:

Consider the following random experiments and random variables:


 Flip a coin 10 times. Let X number of heads obtained.
 A worn machine tool produces 1% defective parts. Let X number of defective parts
in the next 25 parts produced.
 Each sample of air has a 10% chance of containing a particular rare molecule. Let
X the number of air samples that contain the rare molecule in the next 18 samples
analyzed.
 Of all bits transmitted through a digital transmission channel, 10% are received in
error. Let X the number of bits in error in the next five bits transmitted.
 A multiple choice test contains 10 questions, each with four choices, and you
guess at each question. Let X the number of questions answered correctly.
 In the next 20 births at a hospital, let X the number of female births.
 Of all patients suffering a particular illness, 35% experience improvement from a
particular medication. In the next 100 patients administered the medication, let X
the number of patients who experience improvement

Each of these random experiments can be thought of as consisting of a series of


repeated, random trials: 10 flips of the coin in experiment 1, the production of 25 parts in
experiment 2, and so forth. The random variable in each case is a count of the number of
trials that meet a specified criterion.

The outcome from each trial either meets the criterion that X counts or it does not;
consequently, each trial can be summarized as resulting in either a success or a failure.

For example, in the multiple choice experiment, for each question, only the choice that is
correct is considered a success. Choosing any one of the three incorrect choices results
in the trial being summarized as a failure.

A trial with only two possible outcomes is used so frequently as a building block of a
random experiment that it is called a Bernoulli trial.

142
PROBABILITY OF BINOMIAL DISTRIBUTION

It is usually assumed that the trials that constitute the random experiment are
independent. This implies that the outcome from one trial has no effect on the outcome
to be obtained from any other trial.

Furthermore, it is often reasonable to assume that the probability of a success in each


trial is constant. For instance, In the multiple choice experiment, if the test taker has no
knowledge of the material and just guesses at each question, we might assume that the
probability of a correct answer is ¼ for each question.

Example 1: The chance that a bit transmitted through a digital transmission channel is
received in error is 0.1. Also, assume that the transmission trials are independent. Let X
be the number of bits in error in the next four bits transmitted. Determine 𝑃(𝑥 = 2).

Solution: Let the letter E denote a bit in error, and let the letter O denote that the bit is
okay, that is, received without error. We can represent the outcomes of this experiment
as a list of four letters that indicate the bits that are in error and those that are okay.

The event that X = 2 consists of the six outcomes:


{𝐸𝐸𝑂𝑂, 𝐸𝑂𝐸𝑂, 𝐸𝑂𝑂𝐸, 𝑂𝐸𝐸𝑂, 𝑂𝐸𝑂𝐸, 𝑂𝑂𝐸𝐸}

Using the assumption that the trials are independent, the probability of {𝐸𝐸𝑂𝑂} is:
𝑃(𝐸𝐸𝑂𝑂) = 𝑃(𝐸 )𝑃(𝐸 )𝑃(𝑂)𝑃(𝑂) = (0.1)2 (0.9)2 = 0.0081

Also, any one of the six mutually exclusive outcomes for which X=2 has the same
probability of occurring. Therefore,
𝑃(𝑋 = 2) = 6(0.0081) = 0.0486

A random experiment consists of n Bernoulli trials such that:


1) The trials are independent.
2) Each trial results in only two possible outcomes, labeled as “success’’ and “failure’’
3) The probability of a success in each trial, denoted as p, remains constant
The random variable X that equals the number of trials that result in a success has a binomial random
variable with parameters 0 < 𝑝 < 1 and 𝑛 = 1, 2, 3, …
The probability mass function of X is 143
𝑛 𝑥 𝑛−𝑥
𝑃(𝑥) = ( ) 𝑝 (1 − 𝑝) 𝑥 = 0, 1, 2, . . 𝑛
𝑥
Example 2: Each sample of water has a 10% chance of containing a particular organic
pollutant. Assume that the samples are independent with regard to the presence of the
pollutant. Find the probability that in the next 18 samples, exactly 2 contain the pollutant.

Solution:

Let X be the number of samples that contain the pollutant in the next 18 samples
analyzed.

Then X is a binomial random variable with p = 0.1 and n =18.


𝟏𝟖
𝑷(𝑿 = 𝟐) = ( ) (𝟎. 𝟏)𝟐 (𝟎. 𝟗)𝟏𝟔 = 𝟎. 𝟐𝟖𝟒
𝟐
From Example 2, determine the probability that at least four samples contain the pollutant.
Solution: 𝟏𝟖
𝟏𝟖
𝑷(𝑿 ≥ 𝟒) = ∑ ( ) (𝟎. 𝟏)𝒙 (𝟎. 𝟗)𝟏𝟖−𝒙
𝒙
𝒙=𝟒

However, it is easier to use the complementary event,


𝟑
𝟏𝟖 (
𝑷(𝑿 ≥ 𝟒) = 𝟏 − 𝑷(𝑿 < 𝟒) = (𝟏 − ∑ ( ) 𝟎. 𝟏)𝒙 (𝟎. 𝟗)𝟏𝟖−𝒙
𝒙
𝒙=𝟎
= 𝟏 − [𝟎. 𝟏𝟓𝟎 + 𝟎. 𝟑 + 𝟎. 𝟐𝟖𝟒 + 𝟎. 𝟏𝟔𝟖] = 𝟎. 𝟎𝟗𝟖

MEAN AND VARIANCE OF BINOMIAL DISTRIBUTION


If X is a binomial random variable with parameters p and n,

𝜇 = 𝐸(𝑋) = 𝑛𝑝 and 𝜎 2 = 𝑉(𝑋) = 𝑛𝑝(1 − 𝑝)

Example 3. The chance that a bit transmitted through a digital transmission channel is
received in error is 0.1. Also, assume that the transmission trials are independent. Let X
be the number of bits in error in the next four bits transmitted. Find the mean and variance
of X.
𝝁 = 𝑬(𝑿) = 𝒏𝒑 = 𝟒(𝟎. 𝟏) = 𝟎. 𝟒
𝝈𝟐 = 𝑽(𝑿) = 𝒏𝒑(𝟏 − 𝒑) = 𝟒(𝟎. 𝟏)(𝟏 − 𝟎. 𝟏) = 𝟎. 𝟑𝟔

144
Activity:

1. The random variable X has a binomial distribution with n=10 and p=0.5. Determine
the following probabilities: P(X=5), P(X≤2), P(X≥9), P(3≤X<5). Also find mean and
variance of the random variable X.
2. Because not all airline passengers show up for their reserved seat, an airline sells
125 tickets for a flight that holds only 120 passengers. The probability that a
passenger does not show up is 0.10, and the passengers behave independently.
a. What is the probability that every passenger who shows up can take the
flight?
b. What is the probability that the flight departs with empty seats?

145
Session 3
Geometric and Negative Binomial Distributions

At the end of this session, you should be able to:


1. Identify Geometric and Negative Binomial distributions
2. Calculate probabilities of geometric and negative binomial distributions
3. Calculate means and variances of geometric and negative binomial distributions

Lecture:

GEOMETRIC DISTRIBUTIONS

Geometric distribution involves series of Bernoulli trials just like binomial distributions.
However, instead of a fixed number of trials, trials are conducted until a success is
obtained.
In a series of Bernoulli trials (independent trials with constant probability p of a success), let the
random variable X denote the number of trials until the first success.
Then X is a geometric random variable with parameter 0 < 𝑝 < 1and
𝑃(𝑥) = (1 − 𝑝)𝑥−1 𝑝 𝑥 = 1, 2, …

Example 1: The probability that a bit transmitted through a digital transmission channel is
received in error is 0.1. Assume the transmissions are independent events, and let the
random variable X denote the number of bits transmitted until the first error.
Solution:

Then, P(X = 5) is the probability that the first four bits are transmitted correctly and the
fifth bit is in error. This event can be denoted as {OOOOE}, where O denotes an okay bit.
Because the trials are independent and the probability of a correct transmission is 0.9,
𝑃(𝑋 = 5) = 𝑃(𝑂𝑂𝑂𝑂𝐸 ) = (0.9)4 0.1 = 0.066

Example 2: The probability that a wafer contains a large particle of contamination is 0.01.
If it is assumed that the wafers are independent, what is the probability that exactly 125
wafers need to be analyzed before a large particle is detected?

Solution: 𝑃(𝑋 = 125) = (0.99)124 0.01 = 0.0029

146
MEAN AND VARIANCE OF GEOMETRIC DISTRIBUTIONS
If X is a geometric random variable with parameter p,

𝜇 = 𝐸(𝑋) = 1⁄𝑝 and 𝜎 2 = 𝑉(𝑋) = (1 − 𝑝)/𝑝2

Example 3: The probability that a bit transmitted through a digital transmission channel is
received in error is 0.1. Assume the transmissions are independent events, and let the
random variable X denote the number of bits transmitted until the first error. Find the mean
number of transmissions until the first error.

Solution:
The mean number of transmissions until the first error is:
1
𝜇 =0.1 = 10
The standard deviation of the number of transmissions before the first
error is
1⁄
𝜎 = [(1 − 0.1)/(0.1)2 ] 2 = 9.49

NEGATIVE BINOMIAL DISTRIBUTIONS

A generalization of a geometric distribution in which the random variable is the number of


Bernoulli trials required to obtain r successes results in the negative binomial
distribution.
In a series of Bernoulli trials (independent trials with constant probability p of a success), let the
random variable X denote the number of trials until r success occur.
Then X is a negative binomial random variable with parameter 0 < 𝑝 < 1and 𝑟 = 1, 2, 3, …, and
𝑥 − 1)
𝑃(𝑥) = ( (1 − 𝑝)𝑥−𝑟 𝑝𝑟 𝑥 = 𝑟, 𝑟 + 1, 𝑟 + 2 …
𝑟−1

Example 4: Suppose the probability that a bit transmitted through a digital transmission
channel is received in error is 0.1. Assume the transmissions are independent events,
and let the random variable X denote the number of bits transmitted until the fourth error.
Find the probability that the fourth error bit is transmitted on the 10th trial.

Solution:
Then, X has a negative binomial distribution with r =4.The P(X =10) is the probability that
exactly three errors occur in the first nine trials and then trial 10 results in the fourth error.
𝑥−1 (
𝑃 (𝑋 ) = ( ) 1 − 𝑝)𝑥−𝑟 𝑝𝑟
𝑟−1
𝑃(𝑋 = 10) = (9𝐶3)(0.9)6 (0.1)4 = 4.4641x10−3

147
MEAN AND VARIANCE OF NEGATIVE BINOMIAL DISTRIBUTIONS

If X is a negative binomial random variable with parameter p and r,

𝜇 = 𝐸(𝑋) = 𝑟⁄𝑝 and 𝜎 2 = 𝑉(𝑋) = 𝑟(1 − 𝑝)/𝑝2

Example 5: A Web site contains three identical computer servers. Only one is used to
operate the site, and the other two are spares that can be activated in case the primary
system fails. The probability of a failure in the primary computer (or any activated spare
system) from a request for service is 0.0005. Assuming that each request represents an
independent trial, what is the mean number of requests until failure of all three servers?
3
𝜇 =0.0005 = 6000 𝑟𝑒𝑞𝑢𝑒𝑠𝑡𝑠

Activity:

1. Assume that each of your calls to a popular radio station has a probability of 0.02
of connecting, that is, of not obtaining a busy signal. Assume that your calls are
independent.
a. What is the probability that your first call that connects is your tenth call?
b. What is the probability that it requires more than five calls for you to connect?
c. What is the mean number of calls needed to connect?
2. An electronic scale in an automated filling operation stops the manufacturing line
after three underweight packages are detected. Suppose that the probability of an
underweight package is 0.001 and each fill is independent.
a. What is the mean number of fills before the line is stopped?
b. What is the standard deviation of the number of fills before the line is
stopped?

148
Session 4
Hypergeometric and Poisson Distributions

At the end of this session, you should be able to:


1. Identify Hypergeometric and Poisson distributions
2. Calculate probabilities of Hypergeometric and Poisson distributions
3. Calculate means and variances of Hypergeometric and Poisson distributions

Lecture:

HYPERGEOMETRIC DISTRIBUTIONS

Hypergeometric distributions includes binomial experiments where each trial is not


independent, because the objects are not replaced.

PROBABILITY OF HYPERGEOMETRIC DISTRIBUTIONS

A set of N objects contains


K objects classified as successes
N-K objects classified as failures
A sample of size n objects is selected randomly (without replacement) from the N objects, where
𝐾 ≤ 𝑁 𝑎𝑛𝑑 𝑛 ≤ 𝑁.
Let the random variable X denote the number of successes in the sample. Then
X is a hypergeometric random variable and
𝐾 𝑁−𝐾
( )( )
( )
𝑃 𝑥 = 𝑥 𝑛 − 𝑥
𝑁
( )
𝑛

Example 1: A batch of parts contains 100 parts from a local supplier of tubing and 200
parts from a supplier of tubing in the next state. If four parts are selected randomly and
without replacement, what is the probability they are all from the local supplier?

Solution:  100   200 


  
P ( X  4)   4   0   0.0119
 300 
 
 4 

149
From Example 1, what is the probability that two or more parts in the sample are from
the local supplier?

 100   200   100   200   100   200 


        
Solution:  2  2    3  1    4  0 
P ( X  2) 
 300   300   300 
     
 4   4   4 
P ( X  2)  0.298 0.098 0.0119  0.408

From Example 1, what is the probability that at least one part in the sample is from the
local supplier?
 100   200 
  
P ( X  1)  1  P ( X  0)  0  4 
 1  0.804
 300 
 
 4 

MEAN AND VARIANCE OF HYPERGEOMETRIC DISTRIBUTIONS

If X is a hypergeometric random variable with parameters N, K and n, then

  E ( X )  np

 N n
 2  V ( X )  np(1  p)  
 N 1 

Where: 𝑝 = 𝐾/𝑁

From Example 1, what is the mean or expected value of X, also find its variance.

Solution: From the formula we first compute for 𝑝:

𝐾 100 1
𝑝 = 𝑁 = 300 = 3
  np  (4)( 1 3 )  4 3  1.3333
 N n  300  4 
 2  V ( X )  np (1  p )    4( 3 ) 1  3     0.88
1 1
 N 1   300  1 

150
POISSON DISTRIBUTION

Given an interval of real numbers, assume counts occur at random throughout the
interval. If the interval can be partitioned into subintervals of small enough length such
that:
1. the probability of more than one count in a subinterval is zero,
2. the probability of one count in a subinterval is the same for all subintervals and
proportional to the length of the subinterval, and
3. the count in each subinterval is independent of other subintervals, the random
experiment is called a Poisson process.

PROBABILITY OF RANDOM VARIABLE IN A POISSON PROCESS

The random variable X that equals the number of counts in the interval is a Poisson
random variable with parameter   0 . The probability of random variable X can be computed as:
e   x
P( x)  x  0,1, 2,...
x!

It is important to use consistent units in the calculation of probabilities, means, and


variances involving Poisson random variables. The following example illustrates unit
conversions. For example, if the
average number of flaws per millimeter of wire is 3.4, then the
average number of flaws in 10 millimeters of wire is 34, and the
average number of flaws in 100 millimeters of wire is 340.

If a Poisson random variable represents the number of counts in some interval, the mean
of the random variable must equal the expected number of counts in the same length of
interval.

Example 2: Flaws occur at random along the length of a thin copper wire. Suppose that
the number of flaws follows a Poisson distribution with a mean of 2.3 flaws per millimeter.
Determine the probability of exactly 2 flaws in 1 millimeter of wire.
Solution:
𝑒 −𝜆 𝜆𝑥 𝑒 −2.3 2.32
𝑃(𝑥 ) = = = 0.265
𝑥! 2!
From Example 15: Determine the probability of 10 flaws in 5 millimeters of wire.

151
Solution:
𝑓𝑙𝑎𝑤𝑠
𝜆 = 5𝑚𝑚 𝑥 2.3 = 11.5 𝑓𝑙𝑎𝑤𝑠
𝑚𝑚
Therefore,
𝑒 −𝜆 𝜆𝑥 𝑒 −11.511.510
𝑃 (10) = = = 0.113
𝑥! 10!
From Example 2: Determine the probability of at least 1 flaw in 2 millimeters of wire.
Solution:
𝑓𝑙𝑎𝑤𝑠
𝜆 = 2𝑚𝑚 𝑥 2.3 = 4.6 𝑓𝑙𝑎𝑤𝑠
𝑚𝑚
Therefore,
𝑒 −4.64.60
𝑃 (𝑥 ≥ 1) = 1 − 𝑃 (𝑥 = 0) = 1 − = 0.9899
0!
Example 3: Contamination is a problem in the manufacture of optical storage disks. The
number of particles of contamination that occur on an optical disk has a Poisson
distribution, and the average number of particles per centimeter squared of media surface
is 0.1. The area of a disk under study is 100 squared centimeters. Find the probability
that 12 particles occur in the area of a disk under study.

Solution:
𝑝𝑎𝑟𝑡𝑖𝑐𝑙𝑒𝑠
𝜆 = 100𝑐𝑚2 𝑥 0.1 = 10 𝑝𝑎𝑟𝑡𝑖𝑐𝑙𝑒𝑠
𝑐𝑚 2

Therefore,
𝑒 −10 1012
𝑃(𝑥 = 12) = = 0.095
12!

MEAN AND VARIANCE OF POISSON DISTRIBUTIONS

If X is a Poisson random variable with parameters  , then


  E( X )  
and
 2  V (X )  

The mean and variance of a Poisson random variable are equal.

152
Activity:

1. Suppose that the number of customers that enter a bank in an hour is a Poisson
random variable, and suppose that P(X=0) = 0.05. Determine the mean and
variance of X.
2. Printed circuit cards are placed in a functional test after being populated with
semiconductor chips. A lot contains 140 cards, and 20 are selected without
replacement for functional testing.
a. If 20 cards are defective, what is the probability that at least 1 defective card
is in the sample?
b. If 5 cards are defective, what is the probability that at least 1 defective card
appears in the sample?

153
Continuous Probability Distributions
____________________________________________________
MODULE 4

154
Session 1
Introduction to Continuous Random Variables and Probability Density
Functions
By the end of this session, you should be able to:
3. Determine probabilities of continuous random variables from probability density
functions.
4. Calculate means and variances for continuous random variables.

Lecture:

Continuous random variable is a random variable X whose range includes all


values in an interval of real numbers. That is, the range of X can be thought of as a
continuum.
Probability density function f(x) can be used to describe the probability
distribution of a continuous random variable X.

If an interval is likely to contain a value for X, its probability is large and it


corresponds to large values for f(x).
The probability that X is between a and b is determined as the integral of f(x)
from a to b.

For a continuous random variable X, a probability density function is a function such that:
1. f ( x )  0

2. 
f ( x )dx  1
b
3. P(a  X  b)   f ( x)dx  area under f ( x) from a to b
a

A probability density function is zero for x values that cannot occur and it is
assumed to be zero wherever it is not specifically defined.
A histogram is an approximation to a probability density function.

155
For each interval of the histogram, the area of the bar equals the relative frequency
(proportion) of the measurements in the interval.
The relative frequency is an estimate of the probability that a measurement falls in
the interval.
Similarly, the area under f(x) over any interval equals the true probability that a
measurement falls in the interval.
PROBABILITY OF A CONTINUOUS RANDOM VARIABLE

If X is a continuous random variable for any x1 and x2


P( x1  X  x2 )  P( x1  X  x2 )  P( x1  X  x2 )  P( x1  X  x2 )

Example 1: Let the continuous random variable X denote the current measured in a
thin copper wire in milliamperes. Assume that the range of X is [0, 20
mA], and assume that the probability density function of X is 𝑓(𝑥 ) = 0.05
for 0 ≤ 𝑥 ≤ 20. What is the probability that a current measurement is less
than 10 milliamperes?

10 10
P( X  10)   f ( x)dx   0.05dx  0.5
0 0

Example 2: Let the continuous random variable X denote the diameter of a hole
drilled in a sheet metal component. The target diameter is 12.5
millimeters. Most random disturbances to the process result in larger
diameters. Historical data show that the distribution of X can be modeled
by a probability density function f ( x)  20e20( x 12.5) , x  12.5 . If a part with
a diameter larger than 12.60 millimeters is scrapped, what proportion of
parts is scrapped?
Solution:

156
From Example 2, what proportion of parts is between 12.5 and 12.6 millimeters?
Solution:

Because the total area under f ( x ) equals 1, we can also calculate

MEAN AND VARIANCE OF A CONTINUOUS RANDOM VARIABLE

Suppose X is a continuous random variable with probability density function f(x).

The mean or expected value of X is computed as follows:



  E ( X )   xf ( x)dx


The variance of X, denoted as V ( X ) or  , is


2

 
 2  V (X )   x   f ( x)dx   x 2 f ( x)dx   2
2
 

From Example 1, assume that the range of X is [0, 20 mA], and assume that the
probability density function of X is 𝑓 (𝑥 ) = 0.05 for 0 ≤ 𝑥 ≤ 20. What is the
mean and variance of the random variable X?
Solution:
∞ 20 𝑥2
𝜇 = 𝐸 (𝑋) = ∫−∞ 𝑥𝑓(𝑥 )𝑑𝑥 = ∫0 𝑥(0.05)𝑑𝑥 = 0.05 2 | 20
0
= 10

∞ 20
𝜎 2 = V(𝑋) = ∫−∞(𝑥 − 𝜇)2 𝑓(𝑥 )𝑑𝑥 = ∫0 (𝑥 − 10)2 0.05𝑑𝑥
(𝑥−10)3 20
= 0.05 |0 = 33.3333
3

From Example 2, Historical data show that the distribution of X can be modeled by
a probability density function 𝑓(𝑥 ) = 20𝑒 −20(𝑥−12.5) , 𝑥 ≥ 12.5. Determine
the mean of the random variable X.
Solution:

𝜇 = 𝐸 (𝑋 ) = ∫ 𝑥𝑓(𝑥 )𝑑𝑥
12.5

𝜇 = ∫ 𝑥 (20𝑒 −20(𝑥−12.5) )𝑑𝑥
12.5

157
By integration by parts:
𝜇 = 𝑢𝑣 − ∫ 𝑣𝑑𝑢 𝑢 = 𝑥; 𝑑𝑢 = 𝑑𝑥
𝑑𝑣 = 20𝑒 −20(𝑥−12.5) 𝑑𝑥
𝜇 = −𝑥𝑒 −20(x−12.5) — 𝑒 −20(x−12.5) 𝑑𝑥
𝑙𝑒𝑡 𝑧 = −20(x − 12.5); dz = −20dx
1 ∞ d𝑣 = −𝑒 𝑧 𝑑𝑧;
𝜇 = −𝑥𝑒 −20(x−12.5) − 20 𝑒 −20(x−12.5) | 12.5
∫ 𝑑𝑣 = 𝑣 = −𝑒 𝑧 = −𝑒 −20(x−12.5)
𝜇 = 12.5 + 0.05 = 12.55

If X is a continuous uniform random variable over a ≤ x ≤ b, (that is if its probability function is constant)

ab (b  a)2
  E( X )  and  V (X ) 
2

2 12

From Example 1, assume that the range of X is [0, 20 mA], and assume that the
probability density function of X is 𝑓 (𝑥 ) = 0.05 for 0 ≤ 𝑥 ≤ 20. What is the
mean and standard deviation of random variable X?
Solution:
a  b 0  20
  E( X )    10mA
2 2

(20  0) 2
 2  V (X )   33.3333 ,
12

Therefore   V ( X )  33.3333  5.7735mA

Activity

1. Suppose that 𝑓 (𝑥) = 𝑒 −𝑥 𝑓𝑜𝑟 𝑥 > 0. Determine the following probabilities:


a. 𝑃(𝑋 > 1) b. 𝑃(1 < 𝑋 < 2.5) c. 𝑃(𝑋 = 3)

d. 𝑃(𝑋 < 4) e. 𝑃(𝑋 ≥ 3)

2. The probability density function of the net weight in pounds of a packaged chemical
herbicide is 𝑓(𝑥 ) = 2.0 𝑓𝑜𝑟 49.75 < 𝑥 < 50.25 pounds.
a. Determine the probability that a package weighs more than 50 pounds.
b. How much chemical is contained in 90% of all packages?
c. Determine the mean and variance of the weight of packages.

158
Session 2
NORMAL DISTRIBUTION

By the end of this session, you should be able to:


1. Characterize normal distribution
2. Standardize normal random variables.
3. Use the table for the cumulative distribution function of a standard normal
distribution to calculate probabilities.

Lecture:

NORMAL DISTRIBUTION
Random variables with different means and variances can be modeled by normal
probability density functions with appropriate choices of the center and width of the curve.
The value of E(X) = 𝜇determines the center of the probability density function and the
value of 𝑉 (𝑋) = 𝜎 2 determines the width.

The figure illustrates several normal probability density functions with selected
values of 𝜇 and 𝜎 2 . Each has the characteristic symmetric bell-shaped curve, but the
centers and dispersions differ.
In a normal distribution for any normal random variable, the probability distribution
can be approximated by the figure below:

 Since it is symmetrical, 𝑃 (𝑥 < 𝑢) = 𝑃 (𝑥 > 𝑢) = 0.5


 Since 𝑓 (𝑥 ) is positive for all x, this model assigns some probability to each
interval of the real line.

159
 The probability density function decreases as x moves farther from μ.
Consequently, the probability that a measurement falls far from µ is small,
and at some distance from µ the probability of an interval can be
approximated as zero.
STANDARD NORMAL DISTRIBUTION

A normal random variable with


  0 and  2  1
is called a standard normal random variable and is denoted as Z (also known as Z-score)

160
PROBABILITIES USING THE CUMULATIVE STANDARD NORMAL DISTRIBUTION
Example 3: Solve for 𝑃(𝑍 ≤ 1.5)

Solution: This probability is from −∞ 𝑡𝑜 1.5, graphically

161
The intersection of row (1.5) and column (0.00) is 0.933193,
Therefore: 𝑷(𝒁 ≤ 𝟏. 𝟓) = 𝟎. 𝟗𝟑𝟑𝟏
Example 4: Solve for 𝑃(𝑍 > 1.26)

Solution: This probability is from 1.26 𝑡𝑜 ∞

The intersection of row (1.2) and column (0.06) (1.2 + 0.06 = 1.26) is
0.896165, this is 𝑃 (𝑍 ≤ 1.26). But the required probability is 𝑃(𝑍 > 1.26)

𝑃(𝑍 > 1.26) = 1 − 𝑃(𝑍 ≤ 1.26)


= 1 − 0.896165
𝑷(𝒁 > 𝟏. 𝟐𝟔) = 𝟎. 𝟏𝟎𝟑𝟖𝟑𝟓
Graphically,

162
Example 5: Solve for 𝑃(𝑍 < −0.86)
Solution:
This probability is from −∞ 𝑡𝑜 − 0.86 which is graphically,

The intersection of row (-0.8) and column (-0.06):


(−0.8) + (−0.06) = −0.86 is 0.194894,
Therefore: 𝑷(𝒁 < −𝟎. 𝟖𝟔) = 𝟎. 𝟏𝟗𝟒𝟖𝟗𝟒
Example 6: Solve for 𝑃(−1.25 < 𝑍 < 0.37)
Solution:

This probability is from −1.25 𝑡𝑜 0.37, which is graphically

 

𝑃(−1.25 < 𝑍 < 0.37) = 𝑃(𝑍 ≤ 0.37) - 𝑃(𝑍 ≤ −1.25)

163
𝑃(𝑍 ≤ 0.37) = 0.644309

𝑃(𝑍 ≤ −1.25) = 0.105650

𝑷(−𝟏. 𝟐𝟓 < 𝒁 < 𝟎. 𝟑𝟕) = 𝑃(𝑍 ≤ 0.37) − 𝑃(𝑍 ≤ −1.25)

= 0.644309 − 0.105650 = 𝟎. 𝟓𝟑𝟖𝟔𝟓𝟗

Example 7: Solve for 𝑃(𝑍 ≤ −4.6)


Solution:
This probability cannot be found from the table since the table gives up
to 𝑃(𝑍 ≤ −3.99) only.

164
𝑃(𝑍 ≤ −3.99) = 0.000033
Since 𝑃(𝑍 ≤ −4.6) is less than 𝑃(𝑍 ≤ −3.99), therefore 𝑃(𝑍 ≤ −4.6) can
be approximated as almost zero.
Example 8. Find the value of 𝑧 such that 𝑃(𝑍 > 𝑧) = 0.05
Solution:
Since the total probability is 1, if the probability that Z is greater than z is
0.05, we can say that the probability that Z is less than or equal to z is
(1 − .05) = 0.95.

This can be expressed as: 𝑃(𝑍 ≤ 𝑧) = 0.95.


Table II will now be used in reverse. We search through the probabilities
to find the value that corresponds to 0.95.

From this table we can say that:


𝑃 (𝑍 ≤ 1.65) = 0.950529 ≈ 0.95

Therefore we can say that in the expression 𝑃(𝑍 > 𝑧) = 0.05, 𝑧 = 1.65

165
Example 9: Find the value of 𝑧 such that 𝑃 (−𝑧 < 𝑍 < 𝑧) = 0.99
Solution:
Because of the symmetry of the normal distribution, if the area of the
shaded region is to equal 0.99, the area in each tail of the distribution must
1−0.99
equal [ ] = 0.005.
2

Therefore, the value for z corresponds to a probability of 0.995 in Table II.

From this table we can say that:


𝑃(𝑍 ≤ 2.58) ≈ 0.995

Therefore we can say that in the expression 𝑃(−𝑧 < 𝑍 < 𝑧) = 0.99,
𝒛 = 𝟐. 𝟓𝟖

166
TRANSFORMATION TO STANDARD NORMAL RANDOM VARIABLE (Z)
The preceding examples show how to calculate probabilities for standard normal
random variables. To use the same approach for an arbitrary normal random variable
would require a separate table for every possible pair of values for 𝜇 and .
Fortunately, all normal probability distributions are related algebraically, and
Appendix Table II can be used to find the probabilities associated with an arbitrary normal
random variable by first using a simple transformation.

If X is a normal random variable with 𝐸(𝑋) = µ and 𝑉(𝑋) = 𝜎 2,

𝑋−𝜇
𝑍=
𝜎
is a normal random variable with 𝐸(𝑍) = 0 and 𝑉(𝑍) = 1. That is, Z is a standard
normal random variable.

Creating a new random variable by this transformation is referred to as


standardizing. The random variable Z represents the distance of X from its mean in
terms of standard deviations.
Example 10: Assume that the current measurements in a strip of wire follow a
normal distribution with a mean of 10 mA and a variance of 4 (mA) 2.
a. What is the probability that a measurement exceeds 13 mA?
b. What is the probability that a current measurement is between 9
and 11 mA?
c. Determine the value of X for which the probability that a current
measurement is below this value is 0.98.
Solution:
a. What is the probability that a measurement exceeds 13 mA? Let X
denote the current in milliamperes. And the requested probability
can be represented as 𝑃(𝑋 > 13).

Transforming X 𝑡𝑜 𝑍; 𝑋 = 13, 𝜇 = 10, 𝜎 2 = 4 𝑡ℎ𝑒𝑟𝑒𝑓𝑜𝑟𝑒 𝜎 = 2


𝑋 − 𝜇 13 − 10
𝑍= = = 1.5
𝜎 2

𝑃(𝑋 > 13) = 𝑃(𝑍 > 1.5)

167
𝑃(𝑍 > 1.5) = 1 − 𝑃 (𝑍 ≤ 1.5)
Then from the table

𝑃(𝑍 ≤ 1.5) = 0.933193


𝑃(𝑍 > 1.5) = 1 − 0.933193 = 0.066807
𝑇ℎ𝑒𝑟𝑒𝑓𝑜𝑟𝑒: 𝑃(𝑋 > 13) = 0.066807
b. What is the probability that a current measurement is between 9 and
11 mA?
This can be expressed as 𝑃(9 < 𝑋 < 11).
𝑋−𝜇
Standardizing the values of X: 𝑍 = 𝜎
9−10 11−10
@X = 9 : Z = 2
= −0.5 @X = 11 : Z = 2
= 0.5
Then: 𝑃(9 < 𝑋 < 11) = 𝑃(−0.5 < 𝑍 < 0.5)
𝑃(−0.5 < 𝑍 < 0.5) = 𝑃(𝑍 < 0.5) − 𝑃(𝑍 < −0.5)

𝑃(𝑍 < 0.5) = 0.691462

168
𝑃(𝑍 < −0.5) = 0.308538
𝑇ℎ𝑒𝑟𝑒𝑓𝑜𝑟𝑒:
𝑃 (9 < 𝑋 < 11) = 𝑃 (−0.5 < 𝑍 < 0.5)
= 0.691462 − 0.308538
= 𝟎. 𝟑𝟖𝟐𝟗𝟐𝟒
c. Determine the value of X for which the probability that a current
measurement is below this value is 0.98.
This can be expressed as 𝑥 𝑠𝑢𝑐ℎ 𝑡ℎ𝑎𝑡 𝑃(𝑋 < 𝑥 ) = 0.98. First we
have to find the value of 𝑧 such that 𝑃(𝑍 < 𝑧) = 0.98

169
𝑃(𝑍 < 2.05) = 0.979818
𝑃(𝑍 < 2.05) ≈ 0.98
𝑇ℎ𝑒𝑟𝑒𝑓𝑜𝑟𝑒:
𝑇ℎ𝑒 𝑣𝑎𝑙𝑢𝑒 𝑜𝑓 𝑧 𝑠𝑢𝑐ℎ 𝑡ℎ𝑎𝑡 𝑃(𝑍 < 𝑧) = 0.98 is 2.05
𝑇ℎ𝑒𝑛 𝑡ℎ𝑒 𝑣𝑎𝑙𝑢𝑒 𝑥 𝑠𝑢𝑐ℎ 𝑡ℎ𝑎𝑡 𝑃(𝑋 < 𝑥 ) = 0.98 𝑖𝑠:
𝑥−𝜇
𝑧= 𝜎 ;
𝑥−10
2.05 = 2
𝒙 = 𝟏𝟒. 𝟏 𝒎𝑨
Example 11. Assume that in the detection of a digital signal the background noise
follows a normal distribution with a mean of 0 volt and standard deviation
of 0.45 volt. The system assumes a digital 1 has been transmitted when
the voltage exceeds 0.9.
a. What is the probability of detecting a digital 1 when none was
sent?
b. Determine symmetric bounds about 0 that include 99% of all noise
readings.
Solution:
a. What is the probability of detecting a digital 1 when none was
sent?
Let the random variable N denote the voltage of noise. The
requested probability is 𝑃(𝑁 > 0.9).
Converting it to standard score Z:
𝑋 − 𝜇 𝑁 − 𝜇 0.9 − 0
𝑍= = = =2
𝜎 𝜎 0.45
𝑇ℎ𝑒𝑟𝑒𝑓𝑜𝑟𝑒: 𝑃(𝑁 > 0.9) = 𝑃(𝑍 > 2)
𝑃(𝑁 > 0.9) = 𝑃(𝑍 > 2) = 1 − 𝑃(𝑍 ≤ 2)

𝐹𝑟𝑜𝑚 𝑡ℎ𝑒 𝑡𝑎𝑏𝑙𝑒: 𝑃(𝑍 ≤ 2) = 0.977250

170
𝑃 (𝑍 > 2) = 1 − 𝑃(𝑍 ≤ 2) = 1 − 0.977250 = 0.02275
Therefore: 𝑷(𝑵 > 𝟎. 𝟗) = 𝟎. 𝟎𝟐𝟐𝟕𝟓
This probability can be described as the probability of a false detection.
b. Determine symmetric bounds about 0 that include 99% of all noise
readings.
The question requires us to find x such that 𝑃(−𝑥 < 𝑁 < 𝑥 ) = 0.99
Graphically,

𝑃(−𝑧 < 𝑍 < 𝑧) = 0.99


𝑃(−𝑧 < 𝑍 < 𝑧) = 𝑃(𝑍 < 𝑧) − 𝑃(𝑍 < −𝑧)
Since the graph is symmetrical, we need to find z such that :
𝑃(𝑍 < 𝑧) = 0.995
𝑇ℎ𝑖𝑠 𝑤𝑖𝑙𝑙 𝑙𝑒𝑎𝑣𝑒 𝑡ℎ𝑒 0.005 𝑝𝑟𝑜𝑏𝑎𝑏𝑖𝑙𝑖𝑡𝑦 𝑖𝑛 𝑡ℎ𝑒 𝑟𝑖𝑔ℎ𝑡 𝑡𝑎𝑖𝑙 𝑜𝑓 𝑡ℎ𝑒 𝑔𝑟𝑎𝑝ℎ
We also need to find the probability such that :
𝑃(𝑍 < −𝑧) = 0.005
𝑇ℎ𝑖𝑠 𝑤𝑖𝑙𝑙 𝑙𝑒𝑎𝑣𝑒 𝑡ℎ𝑒 0.005 𝑝𝑟𝑜𝑏𝑎𝑏𝑖𝑙𝑖𝑡𝑦 𝑖𝑛 𝑡ℎ𝑒 𝑙𝑒𝑓𝑡 𝑡𝑎𝑖𝑙 𝑜𝑓 𝑡ℎ𝑒 𝑔𝑟𝑎𝑝ℎ

𝑃(𝑍 < 2.58) = 0.995060 ≈ 0.995

171
𝑃(𝑍 < −2.58) = 0.004940 ≈ 0.005
Therefore the value of 𝑧 such that 𝑃(−𝑧 < 𝑍 < 𝑧) = 0.99 is 2.58
Converting to N:
𝑁−𝜇
𝑧= @z=2.58;
𝜎
𝑁−0
2.58 = ; 𝑁 = 1.161 𝑉
0.45

Example 12: The diameter of a shaft in an optical storage drive is normally


distributed with mean 0.2508 inch and standard deviation 0.0005 inch.
The specifications on the shaft are 0.2500 ±0.0015 inch. What proportion
of shafts conforms to specifications?
Solution:
Let X denote the shaft diameter in inches. The required probability is
𝑃(0.2485 < X < 0.2515)
𝑋−𝜇
Transforming to standard normal random variable: Z = 𝜎
0.2485 − 0.2508
@𝑋 = 0.2485; 𝑍 = = −4.6
0.0005
0.2515 − 0.2508
@𝑋 = 0.2515; 𝑍 = = 1.4
0.0005
𝑃(0.2485 < X < 0.2515) = P(−4.6 < Z < 1.4)
= P(Z < 1.4) − P(Z < −4.6)

172
P(Z < 1.4) = 0.919243
Since P(Z < −4.6) is less than the smallest entry in the table
𝑃 (𝑍 < −3.99) = 0.000033, it is approximated as zero.
P(−4.6 < Z < 1.4) = P(Z < 1.4) − P(Z < −4.6)
= 0.919243 − 0
= 0.919243
Therefore:
𝑷(𝟎. 𝟐𝟒𝟖𝟓 < 𝑿 < 𝟎. 𝟐𝟓𝟏𝟓) = 𝟎. 𝟗𝟏𝟗𝟐𝟒𝟑
This means that 91.92% of the shafts conforms to the specifications.
From Example 12, most of the nonconforming shafts are too large, because the
process mean is located very near to the upper specification limit. If the
process is centered so that the process mean is equal to the target value
of 0.2500, the required probability 𝑃(0.2485 < X < 0.2515) will be
𝑋−𝜇
transformed to standard normal random variable as: Z = 𝜎

0.2485 − 0.2500
@𝑋 = 0.2485; 𝑍 = = −3
0.0005
0.2515 − 0.2500
@𝑋 = 0.2515; 𝑍 = =3
0.0005
Therefore: 𝑃(0.2485 < X < 0.2515) = P(−3 < Z < 3)
P(−3 < Z < 3) = P(Z < 3) − P(Z < −3)

𝑷(𝒁 < 𝟑) = 𝟎. 𝟗𝟗𝟖𝟔𝟓𝟎

173
𝑷(𝒁 < −𝟑) = 𝟎. 𝟎𝟎𝟏𝟑𝟓𝟎
Therefore:
𝑷(−𝟑 < 𝒁 < 𝟑) = 𝑷(𝒁 < 𝟑) − 𝑷(𝒁 < −𝟑)
= 0.998650 − 0.001350
= 0.9973
This means that by recentering the process, the yield is increased to
approximately 99.73%.

Activity

1. Assume Z has a standard normal distribution. Use the Cumulative Standard


Normal Distribution Table to determine the value for z that solves each of the
following:
a. 𝑃(𝑍 < 𝑧) = 0.5 b. 𝑃(𝑍 > 𝑧) = 0.1 c. 𝑃(−1.24 < 𝑍 <
𝑧) = 0.8
2. Assume X is normally distributed with a mean of 10 and a standard deviation of 2.
Determine the following:
a. 𝑃(𝑋 < 13) b. 𝑃(2 < 𝑋 < 4) c. 𝑃(−12 < 𝑋 < 8)

3. The time it takes a cell to divide (called mitosis) is normally distributed with an
average time of one hour and a standard deviation of 5 minutes.
a. What is the probability that a cell divides in less than 45 minutes?
b. What is the probability that it takes a cell more than 65 minutes to divide?
c. What is the time that it takes approximately 99% of all cells to complete
mitosis?

174
Sampling Distribution and Estimation
______________________________________________
MODULE 5

175
Session 1
Sampling Distribution

By the end of this session you should be able to:


1. Understand the concept of sampling distribution.
2. Illustrate the use of central limit theorem.

Lecture:

INTRODUCTION

A sampling distribution is a probability distribution of a statistic obtained from a larger


number of samples drawn from a specific population. The sampling distribution of a given
population is the distribution of frequencies of a range of different outcomes that could
possibly occur for a statistic of a population.

In statistics, a population is the entire pool from which a statistical sample is drawn. A
population may refer to an entire group of people, objects, events, hospital visits, or
measurements. A population can thus be said to be an aggregate observation of subjects
grouped together by a common feature.

UNDERSTANDING SAMPLING DISTRIBUTION

A lot of data drawn and used by academicians, statisticians, researchers, marketers,


analysts, etc. are actually samples, not populations. A sample is a subset of a population.
For example, a medical researcher that wanted to compare the average weight of all
babies born in North America from 1995 to 2005 to those born in South America within
the same time period cannot within a reasonable amount of time draw the data for the
entire population of over a million childbirths that occurred over the ten-year time frame.
He will instead only use the weight of, say, 100 babies, in each continent to make a
conclusion. The weight of 200 babies used is the sample and the average weight
calculated is the sample mean.

Now suppose that instead of taking just one sample of 100 newborn weights from each
continent, the medical researcher takes repeated random samples from the general
population, and computes the sample mean for each sample group. So, for North
America, he pulls up data for 100 newborn weights recorded in the US, Canada and
Mexico as follows: four 100 samples from select hospitals in the US, five 70 samples from
Canada and three 150 records from Mexico, for a total of 1200 weights of newborn babies

176
grouped in 12 sets. He also collects a sample data of 100 birth weights from each of the
12 countries in South America.

The average weight computed for each sample set is the sampling distribution of the
mean. Not just the mean can be calculated from a sample. Other statistics, such as the
standard deviation, variance, proportion, and range can be calculated from sample data.
The standard deviation and variance measure the variability of the sampling distribution.

The number of observations in a population, the number of observations in a sample and


the procedure used to draw the sample sets determine the variability of a sampling
distribution. The standard deviation of a sampling distribution is called the standard error.
While the mean of a sampling distribution is equal to the mean of the population, the
standard error depends on the standard deviation of the population, the size of the
population and the size of the sample.

Knowing how spread apart the mean of each of the sample sets are from each other and
from the population mean will give an indication of how close the sample mean is to the
population mean. The standard error of the sampling distribution decreases as the sample
size increases.

SPECIAL CONSIDERATIONS

A population or one sample set of numbers will have a normal distribution. However,
because a sampling distribution includes multiple sets of observations, it will not
necessarily have a bell-curved shape.

Following our example, the population average weight of babies in North America and in
South America has a normal distribution because some babies will be underweight (below
the mean) or overweight (above the mean), with most babies falling in between (around
the mean). If the average weight of newborns in North America is seven pounds, the
sample mean weight in each of the 12 sets of sample observations recorded for North
America will be close to seven pounds as well.

However, if you graph each of the averages calculated in each of the 1,200 sample
groups, the resulting shape may result in a uniform distribution, but it is difficult to predict
with certainty what the actual shape will turn out to be. The more samples the researcher
uses from the population of over a million weight figures, the more the graph will start
forming a normal distribution.

177
UNDERSTANDING DIFFERENT SAMPLING METHODS

When you conduct research about a group of people, it’s rarely possible to collect data
from every person in that group. Instead, you select a sample. The sample is the group
of individuals who will actually participate in the research.

To draw valid conclusions from your results, you have to carefully decide how you will
select a sample that is representative of the group as a whole. As discussed in the
previous lectures there are two types of sampling methods:

 Probability sampling involves random selection, allowing you to make statistical


inferences about the whole group.
 Non-probability sampling involves non-random selection based on convenience or
other criteria, allowing you to easily collect initial data.

POPULATION VS SAMPLE

First, you need to understand the difference between a population and a sample, and
identify the target population of your research.

 The population is the entire group that you want to draw conclusions about.
 The sample is the specific group of individuals that you will collect data from.

The population can be defined in terms of geographical location, age, income, and many
other characteristics.

178
It can be very broad or quite narrow: maybe you want to make inferences about the whole
adult population of your country; maybe your research focuses on customers of a certain
company, patients with a specific health condition, or students in a single school.
It is important to carefully define your target population according to the purpose and
practicalities of your project.
If the population is very large, demographically mixed, and geographically dispersed, it
might be difficult to gain access to a representative sample.

SAMPLING FRAME

The sampling frame is the actual list of individuals that the sample will be drawn from.
Ideally, it should include the entire target population (and nobody who is not part of that
population).

Example:
You are doing research on working conditions at Company X. Your population is all 1000
employees of the company. Your sampling frame is the company’s HR database which
lists the names and contact details of every employee.

SAMPLE SIZE

The number of individuals in your sample depends on the size of the population, and on
how precisely you want the results to represent the population as a whole.
You can use a sample size calculator or compute using different methods that will be
discussed later to determine how big your sample should be. In general, the larger the
sample size, the more accurately and confidently you can make inferences about the
whole population.

CENTRAL LIMIT THEOREM

The central limit theorem states that if you have a population with mean μ and standard
deviation σ and take sufficiently large random samples from the population with
replacement [text annotation indicator] , then the distribution of the sample means will be
approximately normally distributed. This will hold true regardless of whether the source
population is normal or skewed, provided the sample size is sufficiently large (usually n >
30). If the population is normal, then the theorem holds true even for samples smaller
than 30. In fact, this also holds true even if the population is binomial, provided that
min(np, n(1-p))> 5, where n is the sample size and p is the probability of success in the
population. This means that we can use the normal probability model to quantify

179
uncertainty when making inferences about a population mean based on the sample
mean.
For the random samples we take from the population, we can compute the mean of the
sample means:

𝜇𝑥̅ = 𝜇

and the standard deviation of the sample means:


𝜎
𝜎𝑥̅ =
√𝑛

Before illustrating the use of the Central Limit Theorem (CLT) we will first illustrate the
result. In order for the result of the CLT to hold, the sample must be sufficiently large (n
≥30). Again, there are two exceptions to this. If the population is normal, then the result
holds for samples of any size (i.e., the sampling distribution of the sample means will be
approximately normal even for samples of size less than 30).

CENTRAL LIMIT THEOREM WITH A NORMAL POPULATION

The figure below illustrates a normally distributed characteristic, X, in a population in


which the population mean is 75 with a standard deviation of 8.

If we take simple random samples (with replacement) of size n=10 from the population
and compute the mean for each of the samples, the distribution of sample means should
be approximately normal according to the Central Limit Theorem. Note that the sample
size (n=10) is less than 30, but the source population is normally distributed, so this is not
a problem. The distribution of the sample means is illustrated below. Note that the
horizontal axis is different from the previous illustration, and that the range is narrower.

180
The mean of the sample means is 75 and the standard deviation of the sample means is
2.5, with the standard deviation of the sample means computed as follows:

𝜎 8
𝜎𝑥̅ = = = 2.53
√𝑛 √10

If we were to take samples of n=5 instead of n=10, we would get a similar distribution, but
the variation among the sample means would be larger. In fact, when we did this we got
a sample mean = 75 and a sample standard deviation = 3.6.

CENTRAL LIMIT THEOREM WITH A DICHOTOMOUS OUTCOME

Now suppose we measure a characteristic, X, in a population and that this characteristic


is dichotomous (e.g., success of a medical procedure: yes or no) with 30% of the
population classified as a success (i.e., p=0.30) as shown below.

The Central Limit Theorem applies even to binomial populations like this provided that
the minimum of np and n (1-p) is at least 5, where "n" refers to the sample size, and "p"

181
is the probability of "success" on any given trial. In this case, we will take samples of n=20
with replacement, so min (np, n (1-p)) = min (20(0.3), 20(0.7)) = min (6, 14) = 6. Therefore,
the criterion is met.

We saw previously that the population mean and standard deviation for a binomial
distribution are:

Mean binomial probability:

𝜇 = 𝑛𝑝

Standard deviation:

𝜎 = √𝑛𝑝𝑞

The distribution of sample means based on samples of size n=20 is shown below.

The mean of the sample means is

𝑥̅ = 𝑛𝑝 = 20(0.3) = 6

and the standard deviation of the sample means is:

𝜎 √𝑛𝑝𝑞 √20(0.3)(0.7)
𝜎𝑥̅ = = = = 0.46
√𝑛 √𝑛 √20

Now, instead of taking samples of n=20, suppose we take simple random samples (with
replacement) of size n=10. Note that in this scenario we do not meet the sample size
requirement for the Central Limit Theorem (i.e., min (np, n (1-p)) = min (10(0.3), 10(0.7))

182
= min (3, 7) = 3).The distribution of sample means based on samples of size n=10 is
shown on the right, and you can see that it is not quite normally distributed. The sample
size must be larger in order for the distribution to approach normality.

CENTRAL LIMIT THEOREM WITH A SKEWED DISTRIBUTION

The Poisson distribution is another probability model that is useful for modeling discrete
variables such as the number of events occurring during a given time interval. For
example, suppose you typically receive about 4 spam emails per day, but the number
varies from day to day. Today you happened to receive 5 spam emails. What is the
probability of that happening, given that the typical rate is 4 per day? The Poisson
probability is:
(𝑒 −𝜇 )(𝜇 𝑥 )
𝑃 (𝑥; 𝜇) =
𝑥!

Mean = µ

Standard deviation = 𝜎 = √𝜇

The mean for the distribution is μ (the average or typical rate), "X" is the actual number
of events that occur ("successes"), and "e" is the constant approximately equal to
2.71828. So, in the example above

(2.71828−4 )(45 )
𝑃(5; 4) = = 0.1563
5!

Now let's consider another Poisson distribution. with μ=3 and σ=1.73. The distribution is
shown in the figure below.

183
This population is not normally distributed, but the Central Limit Theorem will apply if n >
30. In fact, if we take samples of size n=30, we obtain samples distributed as shown in
the first graph below with a mean of 3 and standard deviation = 0.32. In contrast, with
small samples of n=10, we obtain samples distributed as shown in the lower graph. Note
that n=10 does not meet the criterion for the Central Limit Theorem, and the small
samples on the right give a distribution that is not quite normal. Also note that the sample
standard deviation (also called the "standard error") is larger with smaller samples,
because it is obtained by dividing the population standard deviation by the square root of
the sample size. Another way of thinking about this is that extreme values will have less
impact on the sample mean when the sample size is large.

𝜎 1.73
𝜎𝑥̅ = = = 0.3158
√𝑛 √30

𝜎 1.73
𝜎𝑥̅ = = = 0.5471
√𝑛 √10

184
EXAMPLE 1:

Data from the Framingham Heart Study found that subjects over age 50 had a mean HDL
of 54 and a standard deviation of 17. Suppose a physician has 40 patients over age 50
and wants to determine the probability that the mean HDL cholesterol for this sample of
40 men is 60 mg/dl or more (i.e., low risk). Probability questions about a sample mean
can be addressed with the Central Limit Theorem, as long as the sample size is
sufficiently large. In this case n=40, so the sample mean is likely to be approximately
normally distributed, so we can compute the probability of HDL>60 by using the standard
normal distribution table.
The population mean is 54, but the question is what is the probability that the sample
mean will be >60?

In general,

𝑥−𝜇
𝑧=
𝜎

the standard deviation of the sample mean is


𝜎
𝜎𝑥̅ =
√𝑛

Therefore, the formula to standardize a sample mean is:

𝑥̅ − 𝜇𝑥 𝑥̅ − 𝜇
𝑧= = 𝜎
𝜎𝑥̅
√𝑛

And in this case:

60 − 54
𝑧= = 2.23
17
√40
P(z > 2.23) can be looked up in the standard normal distribution table, and because we
want the probability that P(z > 2.23), we compute is as P(z > 2.23) = 1 - 0.9871 = 0.0129.
Therefore, the probability that the mean HDL in these 40 patients will exceed 60 is 1.29%.

EXAMPLE 2:

Assume that the weight of 2,000 students from TUP are normally distributed with a mean
of 45 kgs and standard deviation of 2 kgs. If 100 samples consisting of 25 students each

185
are obtained, what would be the expected mean and standard deviation of the resulting
sampling distribution of means if sampling were done with replacement? What is the
probability that the mean is between a) 44.7 and 46.1? b) less than 46 kgs?

𝜎 2
𝜇𝑥̅ = 𝜇 = 45 , 𝜎𝑥̅ = = = 0.4
√𝑛 √25

44.7 − 45 46.1 − 45
𝑃 (44.7 ≤ 𝜇 ≤ 46.1) = ≤𝑧≤ = −0.75 ≤ 𝑧 ≤ 2.75 = 0.2734 + 0.4970
0.4 0.4
= 0.7704 𝑜𝑟 77.04%

46 − 45
𝑃(𝑥 ≤ 46 𝑘𝑔𝑠) = = 2.5
0.4

𝑧 ≤ 2.5

𝑃 (𝑥 ≤ 46) = 0.9938 𝑜𝑟 99.38%

Activity:

1) Given a standard normal distribution, find the area under the curve that lies (a) to the
right of z = 1.84, and (b) between z = – 1.97 and z = 0.86.

2) Given a standard normal distribution, find the value of k such that P (z > k) = 0.3015.

3) Given a standard normal distribution, find the value of k such that (b) P (k < z < –
0.18) = 0.4197.

4) Given a random variable x having a normal ditribution with μ = 50 and σ = 10, find the
probability that x assumes a value between 45 and 62.

5) Given that x has a normal distribution with μ = 300 and σ = 50, find the probability that
x assumes a value grater than 362.

6) Given a normal distribution with with μ = 40 and σ = 6, find the value of x that has (a)
45% of the area to the lefyt, and (b) 14% of the area to the right.

7) A certain type of storage battery lasts, on average, 3.0 years, with a standard
deviation of 0.5 year. Assuming that the battery lives are normally distributed, find the
probability that a given battery will last less than 2.3 years.

186
8) An electrical fim manufactures light bulbs that have a life, before burn-out, that is
normally distributed with mean equal to 800 hours and a standard deviation of 40
hours. Find the probability that a bulb burns between 778 and 834 hours.

9) A certain machine makes electrical resistors having a mean resistance of 40 ohms


and a standard deviation of 2 ohms. Assuming the resistance follows a normal
distribution and can be measured to any degree of accuracy, what percentage of
resistors will have a resistance exceeding 43 ohms?

10) In an industrial process the diameter of a ball bearing is an important component part.
The buyer sets specifications on the diameter to be 3.0 ± 0.01 cm. The implication is
that no part falling outside these specifications will be accepted. It is known that in the
process the diameter of a ball bearing has a normal distribution with mean 3.0 and
standard deviation 0.005. On the average, how many manufactures ball bearings will
be scraped? Hint: x1 = 3.0 - 0.01, and x = 3.0 + 0.01.

187
Session 2
Estimation
By the end of this session you should be able to:
1. Explain the concept of an interval estimate of the population mean.
2. Apply formulas for a confidence interval for a population/ proportions and two
distinct populations/ proportions.

Lecture:

INTRODUCTION

In statistics, estimation refers to the process by which one makes inferences about a
population, based on information obtained from a sample.

POINT ESTIMATE VS. INTERVAL ESTIMATE


Statisticians use sample statistics to estimate population parameters. For example,
sample means are used to estimate population means; sample proportions, to estimate
population proportions.

An estimate of a population parameter may be expressed in two ways:


 Point estimate. A point estimate of a population parameter is a single value of a
statistic. For example, the sample mean x is a point estimate of the population
mean μ. Similarly, the sample proportion p is a point estimate of the population
proportion P.
 Interval estimate. An interval estimate is defined by two numbers, between which
a population parameter is said to lie. For example, a < x < b is an interval estimate
of the population mean μ. It indicates that the population mean is greater than a
but less than b.

CONFIDENCE INTERVALS
Statisticians use a confidence interval to express the precision and uncertainty associated
with a particular sampling method. A confidence interval consists of three parts.
 A confidence level.
 A statistic.
 A margin of error.

The confidence level describes the uncertainty of a sampling method. The statistic and
the margin of error define an interval estimate that describes the precision of the method.

188
The interval estimate of a confidence interval is defined by the sample statistic + margin
of error.
For example, suppose we compute an interval estimate of a population parameter. We
might describe this interval estimate as a 95% confidence interval. This means that if we
used the same sampling method to select different samples and compute different interval
estimates, the true population parameter would fall within a range defined by the sample
statistic ± margin of error 95% of the time.

Confidence intervals are preferred to point estimates, because confidence intervals


indicate (a) the precision of the estimate and (b) the uncertainty of the estimate.

CONFIDENCE LEVEL

The probability part of a confidence interval is called a confidence level. The confidence
level describes the likelihood that a particular sampling method will produce a confidence
interval that includes the true population parameter.
Here is how to interpret a confidence level. Suppose we collected all possible samples
from a given population, and computed confidence intervals for each sample. Some
confidence intervals would include the true population parameter; others would not. A
95% confidence level means that 95% of the intervals contain the true population
parameter; a 90% confidence level means that 90% of the intervals contain the population
parameter; and so on.

MARGIN OF ERROR

In a confidence interval, the range of values above and below the sample statistic is called
the margin of error.

For example, suppose the local newspaper conducts an election survey and reports that
the independent candidate will receive 30% of the vote. The newspaper states that the
survey had a 5% margin of error and a confidence level of 95%. These findings result in
the following confidence interval: We are 95% confident that the independent candidate
will receive between 25% and 35% of the vote.

Note: Many public opinion surveys report interval estimates, but not confidence intervals.
They provide the margin of error, but not the confidence level. To clearly interpret survey
results you need to know both! We are much more likely to accept survey findings if the
confidence level is high (say, 95%) than if it is low (say, 50%).

189
ESTIMATING µ WITH LARGE SAMPLES

The most fundamental point and interval estimation process involves the estimation of a
population mean. Suppose it is of interest to estimate the population mean, μ, for a
quantitative variable. Data collected from a simple random sample can be used to
compute the sample mean, x̄, where the value of x̄ provides a point estimate of μ.

When the sample mean is used as a point estimate of the population mean, some error
can be expected owing to the fact that a sample, or subset of the population, is used to
compute the point estimate. The absolute value of the difference between the sample
mean, x̄, and the population mean, μ, written |x̄ − μ|, is called the sampling error. Interval
estimation incorporates a probability statement about the magnitude of the sampling
error. The sampling distribution of x̄ provides the basis for such a statement.

Statisticians have shown that the mean of the sampling distribution of x̄ is equal to the
𝜎
population mean, μ, and that the standard deviation is given by 𝑛, where σ is the

population standard deviation. The standard deviation of a sampling distribution is called
the standard error. For large sample sizes, the central limit theorem indicates that the
sampling distribution of x̄ can be approximated by a normal probability distribution. As a
matter of practice, statisticians usually consider samples of size 30 or more to be large.
• the reliability of an estimate will be measured by the confidence level, c
• zc is the critical value for a confidence level of c

Level of confidence, c Critical value, zc


0.75 1.150
0.80 1.280
0.85 1.440
0.90 1.645
0.95 1.960
0.98 2.330
0.99 2.580

ERROR OF ESTIMATE

E is the maximal error tolerance on the error of estimate for a given confidence level c,
computed using the formula:
𝑠
𝐸 = 𝑧𝑐
√𝑛
The confidence interval for µ is:
𝑥̅ − 𝐸 < 𝜇 < 𝑥̅ + 𝐸

190
EXAMPLE:

Julia enjoys jogging. She has been jogging over a period of several years, during which
time her physical condition has remained constantly good. Usually she jogs 2 miles per
day. During the past year Julia has sometimes recorded her times required to run 2 miles.
She has a sample of 90 of these times. For these 90 times the mean was 15.60 minutes
and the standard deviation was 1.80. Find a 0.95 confidence interval for µ.

𝑠
𝐸 = 𝑧𝑐
√𝑛
1.80
𝐸 = 1.96 = 0.37
√90
𝑥̅ − 𝐸 < 𝜇 < 𝑥̅ + 𝐸
15.6 − 0.37 < 𝜇 < 15.6 + 0.37
15.23 < 𝜇 < 15.97

ESTIMATING µ WITH SMALL SAMPLES

If the population standard deviation is unknown and the sample size n is small then when
we substitute the sample standard deviation s for σ the normal approximation is no longer
valid. The solution is to use a different distribution, called Student’s t-distribution with n−1
degrees of freedom. Student’s t-distribution is very much like the standard normal
distribution in that it is centered at 0 and has the same qualitative bell shape, but it has
heavier tails than the standard normal distribution does, as indicated by the figure below,
in which the curve (in brown) that meets the dashed vertical line at the lowest point is the
t-distribution with two degrees of freedom, the next curve (in blue) is the t-distribution with
five degrees of freedom, and the thin curve (in red) is the standard normal distribution. As
also indicated by the figure, as the sample size n increases, Student’s t-distribution ever
more closely resembles the standard normal distribution. Although there is a different t-
distribution for every value of n, once the sample size is 30 or more it is typically
acceptable to use the standard normal distribution instead, as we will always do in this
text.

191
Student’s t distribution

Just as the symbol zc stands for the value that cuts off a right tail of area c in the standard
normal distribution, so the symbol tc stands for the value that cuts off a right tail of area c
in the standard normal distribution. This gives us the following confidence interval
formulas.
𝑥̅ − 𝜇
𝑡= 𝑠
√𝑛
Degrees of freedom: d.f. = n-1
𝑠
𝐸 = 𝑡𝑐
√𝑛
The confidence interval for µ is:
𝑥̅ − 𝐸 < 𝜇 < 𝑥̅ + 𝐸

EXAMPLE:

A company has a new process for manufacturing large artificial sapphires. The production
of each gem is expensive, so the number available for examination is limited. In a trial run
12 sapphires are produced. The mean weight for these 12 gems is 6.75 carats, and the
sample standard deviation is 0.33 carats. Find a 95% confidence interval for the mean.
𝑡𝑐 = 2.201 𝑤𝑖𝑡ℎ 𝑑. 𝑓. = 12 − 1 = 11
𝑠
𝐸 = 𝑡𝑐
√𝑛
0.33
𝐸 = 2.201 = 0.21
√12
𝑥̅ − 𝐸 < 𝜇 < 𝑥̅ + 𝐸
6.75 − 0.21 < 𝜇 < 6.75 + 0.21
6.54 < 𝜇 < 6.96

192
ESTIMATING P IN BINOMIAL DISTRIBUTION

The binomial distribution is completely determined by the number of trials n and the
probability p of success in a single trial. For most experiments, the number of trials is
chosen in advance. Then the distribution is completely determined by p.
Point estimate for p
𝑟
𝑝̂ =
𝑛
𝑞̂ = 1 − 𝑝̂
𝑝̂ 𝑞̂
𝐸 = 𝑍𝑐 √
𝑛
𝑝̂ − 𝐸 < 𝑝 < 𝑝̂ + 𝐸

E = maximal error tolerance of the error of estimate |𝑝̂ − 𝑝| for a confidence level c.

EXAMPLE:

A random sample of 188 books purchased showed that 66 of the books were murder
mysteries.
What is the point estimate for p?
Find a 90% confidence interval for p.

𝑟 66
𝑝̂ = = = 0.35
𝑛 188
𝑞̂ = 1 − 𝑝̂ = 1 − 0.35 = 0.65
𝑝̂ 𝑞̂ 0.35𝑥0.65
𝐸 = 𝑍𝑐 √ = 1.645√ = 0.06
𝑛 188
𝑝̂ − 𝐸 < 𝑝 < 𝑝̂ + 𝐸
0.35 − 0.06 < 𝑝 < 0.35 + 0.06
0.29 < 𝑝 < 0.41

ESTIMATING THE DIFFERENCE BETWEEN TWO POPULATION MEANS

Suppose we wish to compare the means of two distinct populations. The figure below
illustrates the conceptual framework of our investigation in this and the next section. Each
population has a mean and a standard deviation. We arbitrarily label one population as
Population 1 and the other as Population 2, and subscript the parameters with the
numbers 1 and 2 to tell them apart. We draw a random sample from Population 1 and
label the sample statistics it yields with the subscript 1. Without reference to the first

193
sample we draw a sample from Population 2 and label its sample statistics with the
subscript 2.

Our goal is to use the information in the samples to estimate the difference μ1−μ2 in the
means of the two populations and to make statistically valid inferences about it.
POINT ESTIMATION OF (𝜇1 − 𝜇2 ) LARGE SAMPLES

𝑠1 2 𝑠2 2
𝐸 = 𝑍𝑐 √ +
𝑛1 𝑛2
(𝑥̅1 − 𝑥̅ 2 ) − 𝐸 < 𝜇1 − 𝜇2 < (𝑥̅1 − 𝑥̅ 2 ) + 𝐸
EXAMPLE:

To compare customer satisfaction levels of two competing cable television companies,


174 customers of Company 1 and 355 customers of Company 2 were randomly selected
and were asked to rate their cable companies on a five-point scale, with 1 being least
satisfied and 5 most satisfied. The survey results are summarized in the following table:

Company 1 Company 2
n1 = 174 n2 = 355
𝑥̅ 1 = 3.51 𝑥̅ 2 = 3.24
s1 = 0.51 s2 = 0.52

Construct a point estimate and a 99% confidence interval for μ1−μ2, the difference in
average satisfaction levels of customers of the two companies as measured on this five-
point scale.
𝑠1 2 𝑠2 2
𝐸 = 𝑍𝑐 √ +
𝑛1 𝑛2

194
0.512 0.522
𝐸 = 2.58 √ + = 0.123
174 355
(𝑥̅1 − 𝑥̅ 2 ) − 𝐸 < 𝜇1 − 𝜇2 < (𝑥̅1 − 𝑥̅ 2 ) + 𝐸
(3.51 − 3.24) − 0.123 < 𝜇1 − 𝜇2 < (3.51 − 3.24) + 0.123
0.27 − 0.123 < 𝜇1 − 𝜇2 < 0.27 + 0.123
0.147 < 𝜇1 − 𝜇2 < 0.393

POINT ESTIMATION OF (𝝁𝟏 − 𝝁𝟐 ) SMALL SAMPLES

(𝑛1 − 1)𝑠1 2 + (𝑛2 − 1)𝑠2 2


𝑠=√
𝑛1 + 𝑛2 − 2

1 1
𝐸 = 𝑡𝑐 𝑠√ +
𝑛1 𝑛2
(𝑥̅1 − 𝑥̅ 2 ) − 𝐸 < 𝜇1 − 𝜇2 < (𝑥̅1 − 𝑥̅ 2 ) + 𝐸

EXAMPLE:

Given the following data compute for 90% confidence interval.

Group1
16 19.6 19.9 20.9 20.1 20.1 16.4 20.6
20.1 22.3 18.8 19.1 17.4 21.1 22.1

Group II
8.2 5.4 6.8 6.5 4.7 5.9 2.9 7.6
10.2 6.4 8.8 5.4 8.3 5.4

Group 1 Group 2
n1=15 n2=14
𝑥̅ 1=19.63 𝑥̅ 2=6.61
s1=1.86 s2=1.89

(𝑛1 − 1)𝑠1 2 + (𝑛2 − 1)𝑠2 2


𝑠=√
𝑛1 + 𝑛2 − 2

(15 − 1)1.862 + (14 − 1)1.892


𝑠=√ = 1.87
15 + 14 − 2

195
𝑡𝑐 = 1.703, 𝑑. 𝑓. = 15 + 14 − 2 = 27
1 1
𝐸 = 𝑡𝑐 𝑠√ +
𝑛1 𝑛2

1 1
𝐸 = (1.703)(1.87)√ + = 1.18
15 14
(𝑥̅1 − 𝑥̅ 2 ) − 𝐸 < 𝜇1 − 𝜇2 < (𝑥̅1 − 𝑥̅ 2 ) + 𝐸
(19.63 − 6.61) − 1.18 < 𝜇1 − 𝜇2 < (19.63 − 6.61) + 1.18
13.02 − 1.18 < 𝜇1 − 𝜇2 < 13.02 + 1.18
11.84 < 𝜇1 − 𝜇2 < 14.2

POINT ESTIMATION OF (𝒑𝟏 − 𝒑𝟐 ) LARGE SAMPLES

𝑝̂1 𝑞̂1 𝑝̂2 𝑞̂2


𝐸 = 𝑧𝑐 √ +
𝑛1 𝑛2
𝑝̂1 − 𝑝̂2 − 𝐸 < 𝑝1 − 𝑝2 < (𝑝̂1 − 𝑝̂2 ) + 𝐸

EXAMPLE:

The book Survey Responses: An Education of their Validity, by E.J. Wentland and K.
Smith, includes studies reporting accuracy of answers to questions from surveys. A study
by Locander et al. considered the question, “Are you a registered voter?” Accuracy of
response was confirmed by a check of city voting records. Two methods of survey were
used: a face to face interview and a telephone interview. A random sample of 93 people
was asked the same questions during the telephone interview. 84 respondents gave
accurate answers. Another random sample of 83 people was asked the voter registration
question face to face. 69 respondents gave accurate answers. Find a 95% confidence
interval for p1-p2

84
𝑝̂1 = = 0.90, 𝑞̂1 = 0.10
93
69
𝑝̂2 = = 0.83, 𝑞̂2 = 0.17
83

𝑝̂1 𝑞̂1 𝑝̂2 𝑞̂2


𝐸 = 𝑧𝑐 √ +
𝑛1 𝑛2

0.90𝑥0.10 0.83𝑥0.17
𝐸 = 1.96√ + = 0.10
93 83

196
(𝑝̂1 − 𝑝̂2 ) − 𝐸 < 𝑝1 − 𝑝2 < (𝑝̂1 − 𝑝̂2 ) + 𝐸
(0.90 − 0.83) − 0.10 < 𝑝1 − 𝑝2 < (0.90 − 0.83) + 0.10
0.07 − 0.10 < 𝑝1 − 𝑝2 < 0.07 + 0.10
0 < 𝑝1 − 𝑝2 < 0.17

SAMPLE SIZE

SLOVIN'S FORMULA

It is used to calculate the sample size (n) given the population size (N) and a margin of
error (e). It's a random sampling technique formula to estimate sampling size
It is computed as
𝑁
𝑛=
1 + 𝑁𝑒 2
where:
n = no. of samples
N = total population
e = error margin / margin of error

WHEN TO USE SLOVIN'S FORMULA

If a sample is taken from a population, a formula must be used to take into account
confidence levels and margins of error. When taking statistical samples, sometimes a lot
is known about a population, sometimes a little and sometimes nothing at all. For
example, we may know that a population is normally distributed (e.g., for heights, weights
or IQs), we may know that there is a bimodal distribution (as often happens with class
grades in mathematics classes) or we may have no idea about how a population is going
to behave (such as polling college students to get their opinions about quality of student
life). Slovin's formula is used when nothing about the behavior of a population is known
at all.

EXAMPLE:

To use the formula, first figure out what you want your error of tolerance to be. For
example, you may be happy with a confidence level of 95 percent (giving a margin error
of 0.05), or you may require a tighter accuracy of a 98 percent confidence level (a margin
of error of 0.02). Plug your population size and required margin of error into the formula.
The result will be the number of samples you need to take.

In research methodology, for example N=1000 and e=0.05

197
𝑁
𝑛=
1 + 𝑁𝑒 2
1000
𝑛= = 285.71 ≈ 286
1 + 1000 𝑥 0.052

SAMPLE SIZE FOR ESTIMATING µ


𝑍𝑐 𝜎 2
𝑛=( )
𝐸

A wildlife study is designed to find the mean weight of salmon caught by an Alaskan
fishing company. As a preliminary study, a random sample of 50 freshly caught salmon
is weighed. The sample standard deviation is 2.15lbs. How large a sample should be
taken to be 99% confident that the sample mean is within 0.20lb of the true mean weight?
𝑍𝑐 𝜎 2
𝑛=( )
𝐸
2.58 𝑥 2.15 2
𝑛=( ) = 769.23 𝑙𝑏𝑠
0.20

Sample size for estimating 𝑝̂


1 𝑍𝑐 2
𝑛= ( )
4 𝐸

A company is in business of selling wholesale popcorn to grocery stores. The company


buys directly from farmers. A buyer for the company is examining a large amount of corn
from a certain farmer. Before the purchase is made, the buyer wants to estimate p, the
probability that the kernel will pop.

EXAMPLE:

The buyer wants to be 95% sure that the point estimate for p will be in order either way
by less than 0.01.
1 𝑍𝑐 2
𝑛= ( )
4 𝐸
1 1.96 2
𝑛= ( ) = 9604
4 0.01

Let us consider the scenario that the buyer has already a preliminary estimate that the
kernel will pop is 90%.

𝑍𝑐 2
𝑛 = 𝑝(1 − 𝑝) ( )
𝐸
198
1.96 2
𝑛 = 0.90(1 − 0.10) ( ) = 3457.44
0.01

Activity:

1) The average zinc concentration recovered from a sample of zinc measurements in


36 different locations is found to be 2.6 grams per milliliter. Find the 95% and 99%
confidence intervals for the mean zinc concentration in the river. Assume that the
population standard deviation is 0.3.

2) How large a sample is required in Problem [1] if we want to be 95% confident that our
estimate of μ is off by less than 0.05?

3) The contents of 7 similar containers of slufuric acid are 9.8, 10.2, 10.4, 9.8, 10.0, 10.2,
and 9.6 liters. Find a 95% confidence interval for the mean of all such containers,
assuming an approximate normal distribution.

4) In a random sample of n = 500 families owning television sets in the city of Hamilton,
Canada, it is found that x = 340 subscribed to HBO. Find a 95% confidence interval
for the actual proportion of families in this city who subscribe to HBO.

5) How large a sample is required in Problem [4] if we want to be at least 95% confident
that our estimate of p is within 0.02?

6) An experiment was conducted in which two types of engines A and B, were


compared. Gas mileage in miles per gallon was measured. Fifty experiments were
conducted using engine type A and 75 experiments were done for engine type B. The
gasoline used and other conditions were held constant. The average gas mileage for
engine A was 36 miles per gallon and the average for machine B was 42 miles per
gallon. Find the 96% confidence interval on μB – μA, where μB and μA are population
mean gas mileage for machines B and A, respectively. Assume that the population
standard deviationsare 6 and 8 for machines A nad B respectively.

7) A study was conducted by the Department of Zoology at the Virginia Polytechnic


Institute and State University to estimate the difference in the amount of the chemical
orthophosphorus measured at two different stations on the James River.
Orthophosphorus is measured in milligrams per liter. Fifteen samples were collected
from station 1 and 12 samples were collected from station 2. The 15 samples from
station 1 had an average orthophosphorus content of 3.84 milligrams per liter and a
standard deviation of 3.07 milligrams per liter, while the 12 samples from station 2
had an average content of 1.49 milligrams per liter and a standard deviation of 0.80

199
milligrams per liter. Find a 95% confidence interval for the difference in the true
average orthophophorus contents at these two stations, assuming that the
observations came from normal populations with difference variances.

8) A certain change in a process of manufacture of component parts is being


considered. Samples are taken using both the existing and the new procedure so as
to determine if the new process results in an improvement. If 75 of 1500 items from
the existing procesure were found to be defective and 80 of 2000 items from the new
procedure were found to be defective, find a 90% confidence interval for the true
difference in the fraction of defectives between the existing and the new process.

200
Test of Hypothesis
______________________________________________
MODULE 6

201
Session 1
Introduction to Hypothesis Testing

By the end of this session you should be able to:


6. Define hypothesis and hypothesis testing
7. Discuss the procedure in the test of statistical hypothesis.
8. Differentiate parametric and non-parametric statistics
9. Formulate and test statistical hypothesis
10. Discuss and apply the principles embodied in parametric and non-
parametric statistics.

Lecture:

INTRODUCTION

The statistician R. Fisher explained the concept of hypothesis testing with a story of a
lady tasting tea. Here we will present an example based on James Bond who insisted
that martinis should be shaken rather than stirred. Let's consider a hypothetical
experiment to determine whether Mr. Bond can tell the difference between a shaken and
a stirred martini. Suppose we gave Mr. Bond a series of 16 taste tests. In each test, we
flipped a fair coin to determine whether to stir or shake the martini. Then we presented
the martini to Mr. Bond and asked him to decide whether it was shaken or stirred. Let's
say Mr. Bond was correct on 13 of the 16 taste tests. Does this prove that Mr. Bond has
at least some ability to tell whether the martini was shaken or stirred?

This result does not prove that he does; it could be he was just lucky and guessed right
13 out of 16 times. But how plausible is the explanation that he was just lucky? To assess
its plausibility, we determine the probability that someone who was just guessing would
be correct 13/16 times or more. This probability can be computed from the binomial
distribution, and the binomial distribution calculator shows it to be 0.0106. This is a pretty
low probability, and therefore someone would have to be very lucky to be correct 13 or
more times out of 16 if they were just guessing. So either Mr. Bond was very lucky, or he
can tell whether the drink was shaken or stirred. The hypothesis that he was guessing is
not proven false, but considerable doubt is cast on it. Therefore, there is strong evidence
that Mr. Bond can tell whether a drink was shaken or stirred.

202
STATISTICAL HYPOTHESIS

It is an assertion or conjuncture concerning one or more populations. This assertion may


or may not be true. The best way to determine whether a statistical hypothesis is true is
to examine the entire population.
It is also known as an educated guess about something in the world around you. It should
be testable, either by experiment or observation. For example:

 A new medicine you think might work.


 A way of teaching you think might be better.
 A possible location of new species.
 A fairer way to administer standardized tests.

It can really be anything at all as long as you can put it to the test.

HYPOTHESIS TESTING

Hypothesis testing in statistics is a way for you to test the results of a survey or experiment
to see if you have meaningful results. You’re basically testing whether your results are
valid by figuring out the odds that your results have happened by chance. If your results
may have happened by chance, the experiment won’t be repeatable and so has little use.

Hypothesis testing can be one of the most confusing aspects for students, mostly
because before you can even perform a test, you have to know what your null hypothesis
is. Often, those tricky word problems that you are faced with can be difficult to decipher.
But it’s easier than you think; all you need to do is follow the steps in hypothesis testing:
• Problem identification
• State the null hypothesis (Ho) and the alternative hypothesis (Ha)
• Choose the level of significance (α)
• Select the appropriate test statistic and establish the critical region.
• Collect the data and compute the value of the test statistic from the sample data.
• Make the decision. Reject Ho is the value of the test statistic belongs in the critical
region. Otherwise, fail to reject Ho.
• Draw conclusion about the population

TWO TYPES OF STATISTICAL HYPOTHESES

NULL HYPOTHESIS (HO)


The hypothesis that an apparent effect is due to chance is called the null hypothesis. The
commonly accepted fact. It is the opposite of the alternate hypothesis. Researchers work

203
to reject, nullify or disprove the null hypothesis. Researchers come up with an alternate
hypothesis, one that they think explains a phenomenon, and then work to reject the null
hypothesis.

ALTERNATIVE HYPOTHESIS (HA)

It is the operational statement of the theory that the investigator believes to be true and
wishes to prove. This is the hypothesis that sample observations are influenced by some
non-random cause. It is the contradiction of the null hypothesis.
EXAMPLE 1:

It’s an accepted fact that ethanol boils at 173.1°F; you have a theory that ethanol actually
has a different boiling point, of over 174°F. The accepted fact (“ethanol boils at 173.1°F”)
is the null hypothesis; your theory (“ethanol boils at temperatures of 174°F”) is the
alternate hypothesis.

EXAMPLE 2:
A classroom full of students at a certain elementary school is performing at lower than
average levels on standardized tests. The low test scores are thought to be due to poor
teacher performance. However, you have a theory that the students are performing poorly
because their classroom is not as well ventilated as the other classrooms in the school.
The accepted theory (“low test scores are due to poor teacher performance”) is the null
hypothesis; your theory (“low test scores are due to inadequate ventilation in the
classroom”) is the alternative hypothesis.

CRITICAL REGION OR REJECTION REGION

It is the set of values of the test statistic for which the null hypothesis will be rejected and
the acceptance region is the set of values of the test statistic for which the null hypothesis
will not be rejected. The rejection and acceptance regions are separated by a critical value
of the test statistic.

ONE TAILED AND TWO TAILED TESTS

The statistical tests used will be one tailed or two tailed depending on the nature of the
null hypothesis and the alternative hypothesis. The following hypothesis applies to test
for the mean:

204
One tailed test:

Two tailed test

When we are interested only in the extreme values that are greater than or less than a
comparative value (μ0) we use a one tailed test to test for significance. When we
are interested in determining that things are different or not equal, we use a two
tailed

TEST STATISTIC

It is a value calculated from sample measurements and on which the statistical decision
will be based. This involves specifying the statistic that will be used to assess the validity
of the null hypothesis.

DECISION RULE

It is defined as a procedure that a researcher uses to decide whether to reject or not to


reject the null hypothesis.

205
There are two types of errors that can result from a decision rule; Type I and Type II error.

The chances of committing these two types of errors are inversely proportional: that is,
decreasing type I error rate increases type II error rate, and vice versa.
A type 1 error is also known as a false positive and occurs when a researcher incorrectly
rejects a true null hypothesis. This means that your report that your findings are significant
when in fact they have occurred by chance.

A type II error is also known as a false negative and occurs when a researcher fails to
reject a null hypothesis which is really false. Here a researcher concludes there is not a
significant effect, when actually there really is.
The probability of making a type I error is represented by your alpha level (α), which is
the p-value below which you reject the null hypothesis. A p-value of 0.05 indicates that
you are willing to accept a 5% chance that you are wrong when you reject the null
hypothesis.
You can reduce your risk of committing a type I error by using a lower value for p. For
example, a p-value of 0.01 would mean there is a 1% chance of committing a Type I error.

However, using a lower value for alpha means that you will be less likely to detect a true
difference if one really exists (thus risking a type II error).
The probability of making a type II error is called Beta (β), and this is related to the power
of the statistical test (power = 1- β). You can decrease your risk of committing a type II
error by ensuring your test has enough power.
You can do this by ensuring your sample size is large enough to detect a practical
difference when one truly exists.

206
SELECTION OF STATISTICAL TESTS

The statistical tests are classified into parametric and non-parametric tests. Parametric
tests assume underlying statistical distributions in the data. Therefore, several conditions
of validity must be met so that the result of a parametric test is reliable. For example,
Student’s t-test for two independent samples is reliable only if each sample follows a
normal distribution and if sample variances are homogeneous. Nonparametric tests do
not rely on any distribution. They can thus be applied even if parametric conditions of
validity are not met. Parametric tests often have nonparametric equivalents. You will find
different parametric tests with their equivalents when they exist in this grid. Nonparametric
tests are more robust than parametric tests. In other words, they are valid in a broader
range of situations (fewer conditions of validity).

The advantage of using a parametric test instead of a nonparametric equivalent is that


the former will have more statistical power than the latter. In other words, a parametric
test is more able to lead to a rejection of H0. Most of the time, the p-value associated to
a parametric test will be lower than the p-value associated to a nonparametric equivalent
that is run on the same data.

P-VALUE

The P value, or calculated probability, is the probability of finding the observed, or more
extreme, results when the null hypothesis (H0) of a study question is true – the definition
of ‘extreme’ depends on how the hypothesis is being tested. P is also described in terms
of rejecting H0 when it is actually true, however, it is not a direct probability of this state.

The null hypothesis is usually a hypothesis of "no difference" e.g. no difference between
blood pressures in group A and group B. Define a null hypothesis for each study question
clearly before the start of your study.

The only situation in which you should use a one sided P value is when a large change
in an unexpected direction would have absolutely no relevance to your study. This
situation is unusual; if you are in any doubt then use a two sided P value.

The term significance level (alpha) is used to refer to a pre-chosen probability and the
term "P value" is used to indicate a probability that you calculate after a given study.

The alternative hypothesis (H1) is the opposite of the null hypothesis; in plain language
terms this is usually the hypothesis you set out to investigate. For example, question is

207
"is there a significant (not due to chance) difference in blood pressures between groups
A and B if we give group A the test drug and group B a sugar pill?" and alternative
hypothesis is " there is a difference in blood pressures between groups A and B if we give
group A the test drug and group B a sugar pill".

If your P value is less than the chosen significance level then you reject the null hypothesis
i.e. accept that your sample gives reasonable evidence to support the alternative
hypothesis. It does NOT imply a "meaningful" or "important" difference; that is for you to
decide when considering the real-world relevance of your result.

The choice of significance level at which you reject H 0 is arbitrary. Conventionally the 5%
(less than 1 in 20 chance of being wrong), 1% and 0.1% (P < 0.05, 0.01 and 0.001) levels
have been used. These numbers can give a false sense of security.

In the ideal world, we would be able to define a "perfectly" random sample, the most
appropriate test and one definitive conclusion. We simply cannot. What we can do is try
to optimize all stages of our research to minimize sources of uncertainty. When presenting
P values some groups find it helpful to use the asterisk rating system as well as quoting
the P value:

P < 0.05 *
P < 0.01 **
P < 0.001

Most authors refer to statistically significant as P < 0.05 and statistically highly significant
as P < 0.001 (less than one in a thousand chance of being wrong).

The asterisk system avoids the woolly term "significant". Please note, however, that many
statisticians do not like the asterisk rating system when it is used without showing P
values. As a rule of thumb, if you can quote an exact P value then do. You might also
want to refer to a quoted exact P value as an asterisk in text narrative or tables of contrasts
elsewhere in a report.

At this point, a word about error. Type I error is the false rejection of the null hypothesis
and type II error is the false acceptance of the null hypothesis. As an aid memoir: think
that our cynical society rejects before it accepts.

The significance level (alpha) is the probability of type I error. The power of a test is one
minus the probability of type II error (beta). Power should be maximized when selecting

208
statistical methods. If you want to estimate sample sizes then you must understand all of
the terms mentioned here.

The following table shows the relationship between power and error in hypothesis testing:

DECISION
TRUTH Accept H0: Reject H0:
H0 is true: correct decision P type I error P
1-alpha alpha (significance)
H0 is false: type II error P correct decision P
beta 1-beta (power)
H0 = null hypothesis
P = probability

Activity:

Essay
1. What is the importance of hypothesis testing in engineering data analysis?
2. What is the role of P value in hypothesis testing?
3. Give 5 examples of events where the researcher might commit the two types of
errors.

209
Session 2
Parametric Statistics

By the end of this session you should be able to:


1. Formulate and test statistical hypothesis
2. Solve hypothesis testing problems using parametric statistics

Lecture:

INTRODUCTION

A parameter in statistics refers to an aspect of a population, as opposed to a statistic,


which refers to an aspect about a sample. For example, the population mean is a
parameter, while the sample mean is a statistic. A parametric statistical test makes an
assumption about the population parameters and the distributions that the data came
from. These types of test includes Student’s T tests and ANOVA tests, which assume
data is from a normal distribution.
The opposite is a nonparametric test, which doesn’t assume anything about the
population parameters. Nonparametric tests include chi-square, Fisher’s exact test and
the Mann-Whitney test.
Every parametric test has a nonparametric equivalent. For example, if you have
parametric data from two independent groups, you can run a 2 sample t test to compare
means. If you have nonparametric data, you can run a Wilcoxon rank-sum test to compare
means.

THE T-TEST

The t-Test is used to compare two means, the means of two independent samples or two
independent groups and the means of correlated samples before and after the treatment.
Ideally the t-test is used when there are less than 30 samples, but some researchers use
the t-test even if there are more than 30 samples.

THE T-TEST FOR INDEPENDENT SAMPLES/ GROUPS


𝑥̅1 − 𝑥̅ 2
𝑡=
𝑆𝑆1 + 𝑆𝑆2 1 1
√(
𝑛1 + 𝑛2 − 2 𝑛1 𝑛2 )
) ( +

210
Where:

𝑡 = 𝑡ℎ𝑒 𝑡 − 𝑡𝑒𝑠𝑡

𝑥̅1 = 𝑡ℎ𝑒 𝑚𝑒𝑎𝑛 𝑜𝑓 𝑔𝑟𝑜𝑢𝑝 1

𝑥̅ 2 = 𝑡ℎ𝑒 𝑚𝑒𝑎𝑛 𝑜𝑓 𝑔𝑟𝑜𝑢𝑝 2

𝑆𝑆1 = 𝑡ℎ𝑒 𝑠𝑢𝑚 𝑜𝑓 𝑠𝑞𝑢𝑎𝑟𝑒𝑠 𝑜𝑓 𝑔𝑟𝑜𝑢𝑝 1

𝑆𝑆2 = 𝑡ℎ𝑒 𝑠𝑢𝑚 𝑜𝑓 𝑠𝑞𝑢𝑎𝑟𝑒𝑠 𝑜𝑓 𝑔𝑟𝑜𝑢𝑝 2

𝑛1 = 𝑡ℎ𝑒 𝑛𝑢𝑚𝑏𝑒𝑟 𝑜𝑓 𝑜𝑏𝑠𝑒𝑟𝑣𝑎𝑡𝑖𝑜𝑛𝑠 𝑖𝑛 𝑔𝑟𝑜𝑢𝑝 1

𝑛2 = 𝑡ℎ𝑒 𝑛𝑢𝑚𝑏𝑒𝑟 𝑜𝑓 𝑜𝑏𝑠𝑒𝑟𝑣𝑎𝑡𝑖𝑜𝑛𝑠 𝑖𝑛 𝑔𝑟𝑜𝑢𝑝 2

EXAMPLE:

The following are the score of 10 male and 10 female AB students in spelling. Test the
null hypothesis that there is no significant difference between the performance of male
and female AB students in the test. Use the t-test at 0.05 level of significance.

Male Female
X1 X2
14 12
18 9
17 11
16 5
4 10
14 3
12 7
10 2
9 6
17 13

Problem:

Is there a significant difference between the performance of male and female students in
spelling?
Hypotheses:

Ho: There is no significant difference between the performance of male and female AB
students in spelling. (𝐻𝑜: 𝜇1 = 𝜇2 )
211
Ha: There is a significant difference between the performance of male and female AB
students in spelling. (𝐻𝑜: 𝜇1 ≠ 𝜇2 )

Level of Significance:
α = 0.05
d.f. = n1+n2-2
= 10+10-2
= 18
t0.05 = 2.101 t-tabular value at 0.05 (two tailed test)
Statistics:
t-test for two independent samples

Computations:

Male Female
X1 X2
14 12
18 9
17 11
16 5
4 10
14 3
12 7
10 2
9 6
17 13
∑ 𝑥1 = 131

∑ 𝑥2 = 78

∑ 𝑥12 = 1891

∑ 𝑥22 = 738

𝑥̅1 = 13.1

𝑥̅ 2 = 7.8
(∑ 𝑥1 )2
𝑆𝑆𝑥1 = ∑ 𝑥12 −
𝑥1

212
1312
𝑆𝑆𝑥1 = 1891 − = 174.9
10
(∑ 𝑥2 )2
𝑆𝑆𝑥2 = ∑ 𝑥22 −
𝑥2

782
𝑆𝑆𝑥1 = 738 − = 129.6
10
𝑥̅1 − 𝑥̅ 2
𝑡=
𝑆𝑆1 + 𝑆𝑆2 1 1
√(
𝑛1 + 𝑛2 − 2) (𝑛1 + 𝑛2 )
13.1 − 7.8
𝑡= = 2.88
√(174.9 + 129.6) ( 1 + 1 )
10 + 10 − 2 10 10
Decision Rule:

If the computed value is greater than or beyond the t-tabular/ critical value,
reject Ho.
Conclusion:

Since the t-computed value of 2.88 is greater than the t-tabular value of 2.101 at 0.05
level of significance with 18 degrees of freedom, the Ho is rejected in favor of the research
hypothesis. This means that there is a significant difference between the performance of
the male and female AB students in spelling.

THE T-TEST FOR CORRELATED SAMPLES

̅
𝐷
𝑡=
2
∑ 𝐷 2 − (∑ 𝐷 )
√ 𝑛
𝑛(𝑛 − 1)

Where:
̅ = the mean difference between the pre test and the post test.
𝐷
∑ 𝐷2 = the sum of squares of the difference between the pre test and post test

n= sample size

213
EXAMPLE:
An experimental study was conducted on the effect of programmed materials in English
on the performance of 20 selected college students. Before the program was
implemented the pre-test was administered and after 5 months the same instrument was
used to get the post test result. The following is the result of the experiment.

Pre test Post test Pre test Post test


20 25 18 30
30 35 15 10
10 25 15 16
15 25 20 25
20 20 18 10
10 20 40 45
18 22 10 15
14 20 10 10
15 20 12 18
20 15 20 25

Problem:

Is there a significant difference between the pretest and the posttest on the use of
programmed materials in English?

Hypotheses:

Ho: There is no significant difference between the pretest and posttest or the use of
programmed materials did not affect the performance of students in English(𝐻𝑜: 𝜇1 = 𝜇2 )

Ha: The posttest result is higher than the pretest result. (𝐻1: 𝜇1 < 𝜇2 )

Level of Significance:
α = 0.05
d.f. = n-1 = 19
t0.05 = -1.729 (one-tailed test)

Statistics: t-test for correlated samples

214
Computations:

Pre Post
test test D D2
20 25 -5 25
30 35 -5 25
10 25 -15 225
15 25 -10 100
20 20 0 0
10 20 -10 100
18 22 -4 16
14 20 -6 36
15 20 -5 25
20 15 5 25
18 30 -12 144
15 10 5 25
15 16 -1 1
20 25 -5 25
18 10 8 64
40 45 -5 25
10 15 -5 25
10 10 0 0
12 18 -6 36
20 25 -5 25
2
∑ 𝐷 =-81 ∑ 𝐷 =947
̅ =-4.05
𝐷

̅
𝐷
𝑡=
2
∑ 𝐷 2 − (∑ 𝐷 )
√ 𝑛
𝑛(𝑛 − 1)

−4.05
𝑡= = −3.17
(−81)2
√947 −
20
20(20 − 1)

Decision Rule:
If the t-computed value is greater than or beyond the critical value, reject Ho.

215
Conclusion:
The t-computed value of -3.17 is beyond the t-critical value of -1.729 at 0.05 level of
significance with 19 degrees of freedom, the null hypothesis is therefore rejected in favor
of the research hypothesis. This means that the posttest result is higher than the pretest
result. It implies that the use of the programmed materials in English is effective.

THE Z-TEST

The z-test is another test under parametric statistics which requires the normality of the
distribution. It utilizes the two population parameters μ and σ. It is used to compare two
means, the sample mean, and the perceived population mean. It is also used to compare
two sample means taken from the same population. It is used when the samples are
equal or greater than 30. The z-test can be applied in two-ways: the One-Sample Mean
Test and the Two-Sample Mean Test

THE ONE-SAMPLE MEAN TEST

The one-sample mean test is used when the sample mean is being compared to the
perceived population mean. However, if the population standard deviation is not known,
the sample standard deviation can be used as a substitute.
𝑥̅ − 𝜇
𝑧= 𝜎
√𝑛
Where:

𝑥̅ = sample mean

μ = hypothesized value of the population mean


σ = population standard deviation

n = sample size

EXAMPLE:

Problem:
A tire manufacturing plant produces 15.2 tires per hour. This yield has an established
variance of 2.5 tires/hour. New machines are recommended, but will be expensive to
install. Before deciding to implement the change, 32 new machines are tested. They

216
produce 16.8 tire per hour. Is it worth buying the new machines? Before testing, verify
that the data comes from a normal distribution (assumption of the test)

Problem:
Is the claim true that the mean yield of new machines is greater than 15.2 tires per hour?

Hypotheses:
Ho: µ=15.2(ie. Mean yield of new machines is equal to 15.2)
Ha: µ>15.2(ie. Mean yield of new machines is greater than 15.2)

Level of Significance:
α = 0.05
z = ± 1.645

Statistics: one sample mean test


Computation:
𝑥̅ − 𝜇
𝑧= 𝜎
√𝑛
16.8 − 15.2
𝑧= = 5.73
1.58
√32

Decision Rule: If the z computed value is greater than or beyond the z tabular value, reject
Ho.
Conclusion: Since the z computed value of 5.73 is beyond the critical value of 1.645 at
0.05 level of significance the null hypothesis is rejected in favor of the research
hypothesis.

THE TWO-SAMPLE MEAN TEST

The two sample mean test is used when comparing two separate samples drawn at
random taken from a normal population. To test whether the difference between the two
values of 𝑥̅ 1and 𝑥̅ 2 is significant or can be attributed to chance, the formula is used:
𝑥̅1 − 𝑥̅ 2
𝑧=
𝑠1 2 𝑠2 2

𝑛1 + 𝑛2

217
EXAMPLE:
An admission test was administered to incoming freshmen in the Colleges of Nursing and
Veterinary Medicine with 100 students. Each was randomly selected. The mean scores
of the given samples were 90 and 85 and the variances of the test scores were 40 and
35, respectively. Is there a significant difference between the two groups? Use 0.01 level
of significance.
Problem:
Is there a significant difference between the test scores of the incoming freshmen in the
Colleges of Nursing and Veterinary Medicine?

Hypotheses:
Ho: 𝜇1 = 𝜇2
Ha: 𝜇1 ≠ 𝜇2

Level of Significance:
α = 0.01
z = 2.58

Statistics: two sample mean test

Computation:
𝑥̅1 − 𝑥̅ 2
𝑧=
𝑠1 2 𝑠2 2

𝑛1 + 𝑛2
90 − 85
𝑧= = 5.77
√ 40 + 35
100 100

Decision Rule: If the z computed value is greater than or beyond the z tabular value, reject
Ho.

Conclusion: Since the z computed value of 5.77 is beyond the critical value of 2.58 at
0.01 level of significance the null hypothesis is rejected in favor of the research
hypothesis.

218
THE F-TEST (ONE WAY ANOVA)

The F-Test is the analysis of variance (ANOVA). This is used in comparing the means of
two or more independent groups. One-way ANOVA is used when two variables are
involved: the column and the row variables. The researcher is interested to know if there
are significant differences between and among columns and rows. This is also used in
looking at the interaction effect between the variables being analyzed.
Steps in ANOVA

• Determine the null and alternate hypotheses


• Find SSTOT
(∑ 𝑥 𝑇𝑂𝑇 )2
𝑆𝑆𝑇𝑂𝑇 = ∑ 𝑥𝑇𝑂𝑇 2 −
𝑁
• Find SSBET
(𝑥𝑖 )2 (∑ 𝑥 𝑇𝑂𝑇 )2
𝑆𝑆𝐵𝐸𝑇 = ∑ −
𝑛𝑖 𝑁
• Find SSW
(∑ 𝑥𝑖 )2
𝑆𝑆𝑤 = ∑ (∑ 𝑥𝑖 2 − )
𝑛𝑖
• Find variance estimates (mean squares)
• Find the F ratio and complete the ANOVA test

EXAMPLE:

A sari-sari store is selling 4 brands of shampoo. The owner is interested if there is a


significant difference in the average sales for one week. The following data are recorded.

A B C D
7 9 2 4
3 8 3 5
5 8 4 7
6 7 5 8
9 6 6 3
4 9 4 4
3 10 2 5

219
Perform the analysis of variance and test the hypothesis at 0.05 level of significance that
the average sales of the four brands of shampoo are equal.

Problem:
Is there a significant difference in the average sales of the four brands of shampoo?
Hypotheses:

Ho: 𝜇1 = 𝜇2 = 𝜇3 = 𝜇4

Ha: 𝜇1 ≠ 𝜇2 ≠ 𝜇3 ≠ 𝜇4

Level of Significance:

α = 0.05
d.fbet = k-1 = 4-1 = 3

d.fwith =N-k = 28-4 = 24


d.ftot = N-1 = 28-1 = 27

F0.05 = 3.01 (refer to the critical values of F)

Statistics: One Way ANOVA

Computations:

A A2 B B2 C C2 D D2
7 49 9 81 2 4 4 16
3 9 8 64 3 9 5 25
5 25 8 64 4 16 7 49
6 36 7 49 5 25 8 64
9 81 6 36 6 36 3 9
4 16 9 81 4 16 4 16
3 9 10 100 2 4 5 25
∑ 𝐴=37 2
∑ 𝐴 =225 ∑ 𝐵 =57 ∑ 𝐵2 =475 ∑ 𝐶 =26 2
∑ 𝐶 =110 ∑ 𝐷=36 2
∑ 𝐷 =204

𝐴̅ =5.29 𝐵̅ = 8.14 𝐶̅ = 3.71 ̅ = 5.14


𝐷

220
∑ 𝑥𝑡𝑜𝑡 = ∑ 𝐴 + ∑ 𝐵 + ∑ 𝐶 + ∑ 𝐷

∑ 𝑥𝑡𝑜𝑡 = 37 + 57 + 26 + 36 = 156

2
∑ 𝑥𝑡𝑜𝑡 = ∑ 𝐴2 + ∑ 𝐵2 + ∑ 𝐶 2 + ∑ 𝐷2

2
∑ 𝑥𝑡𝑜𝑡 = 225 + 475 + 110 + 204 = 1014

(∑ 𝑥 𝑇𝑂𝑇 )2
𝑆𝑆𝑇𝑂𝑇 = ∑ 𝑥𝑇𝑂𝑇 2 −
𝑁

(156)2
𝑆𝑆𝑇𝑂𝑇 = 1014 − = 144.857
28

(𝑥𝑖 )2 (∑ 𝑥 𝑇𝑂𝑇 )2
𝑆𝑆𝐵𝐸𝑇 = ∑ −
𝑛𝑖 𝑁
2 2 2 2
37 57 26 36 1562
𝑆𝑆𝐵𝐸𝑇 = + + + − = 72.286
7 7 7 7 28

(∑ 𝑥𝑖 )2
2
𝑆𝑆𝑤𝑖𝑡ℎ = ∑ (∑ 𝑥𝑖 − )
𝑛𝑖
372 572 262 362 1562
𝑆𝑆𝑤𝑖𝑡ℎ = (225 − ) + (475 − ) + (110 − ) + (204 − )− = 72.571
7 7 7 7 28

Sources Sum of Degrees Mean Fratio F0.05 Decision


of Squares of Squares
Variation Freedom
Between 72.286 3 24.095 7.979 3.01 Reject the
Within 72.571 24 3.02 null
Total 144.857 27 hypothesis

Since the decision is to reject the null hypothesis, we need to look into where does the
variation lies by using Scheffe’s test.

221
SCHEFFE’S TEST

The F-test tells us that there is a significant difference in the average sales of the four
brands of shampoo but as to where the difference lies, it has to be tested further by
another test, the Scheffe’s test.
(𝑥̅1 − 𝑥̅ 2 )2
𝐹′ =
𝑀𝑆𝑤 (𝑛1 + 𝑛2 )
𝑛1 𝑛2

Where:
F’ = Scheffe’s test

𝑥̅1 = mean of group 1

𝑥̅ 2 = mean of group 2

n1= number of samples in group 1

n2 = number of samples in group 2


MSw = within mean squares

AxB


(5.29 − 8.14)2
𝐹 = = 9.41
3.02(7 + 7)
7𝑥7
AXC


(5.29 − 3.71)2
𝐹 = = 2.89
3.02(7 + 7)
7𝑥7
AXD
(5.29 − 5.14)2
𝐹′ = = 0.026
3.02(7 + 7)
7𝑥7
BXC
(8.14 − 3.71)2
𝐹′ = = 22.74
3.02(7 + 7)
7𝑥7

222
BXD

(8.14 − 5.14)2

𝐹 = = 10.43
3.02(7 + 7)
7𝑥7
CXD
(3.71 − 5.14)2
𝐹′ = = 2.37
3.02(7 + 7)
7𝑥7
Sources of F’computed F’0.05 = Remarks
Variation F0.05x(k-1)=
3.01x3
AxB 9.41 9.03 significant
AxC 2.89 9.03 not significant
AxD 0.026 9.03 significant
BxC 22.74 9.03 not significant
BxD 10.43 9.03 not significant
CxD 2.37 9.03 significant

From all the significant sources of variation, B is the common brand, it means that the
variation lies in Brand B.

TWO WAY ANOVA

A two-way ANOVA test is a statistical test used to determine the effect of two nominal
predictor variables on a continuous outcome variable. ANOVA stands for analysis of
variance and tests for differences in the effects of independent variables on a dependent
variable.
A two-way ANOVA tests the effect of two independent variables on a dependent variable.
It analyzes the effect of the independent variables on the expected outcome along with
their relationship to the outcome itself. Random factors would be considered to have no
statistical influence on a data set, while systematic factors would be considered to have
statistical significance.

223
By using ANOVA, a researcher is able to determine whether the variability of the
outcomes is due to chance or to the factors in the analysis. ANOVA has many applications
in finance, economics, science, medicine, and social science.

The two-factor ANOVA can test the following null hypothesis:


1. Factor A has no significant effect.
2. Factor B has no significant effect.
3. Factors A and B have no significant effect.
Illustration of Two-Way ANOVA Classification
Two-Way Classification Table
A1 A2 A3 … Am
B1 .. .. .. ..
B2 .. .. .. ..
B3 .. .. .. ..
. . .
. . .
. . .
Bn .. .. .. … ..

Summary of the Two-Way ANOVA

Sources of Sum of Degrees of Mean Ftabular


Fratio Decision
Variation Squares Freedom Squares value
Factor A SS A DFA MSS A FA
Factor B SS B DFB MSS B FB
AB SS A  B DFA  B MSS A  B FA  B
Within SS wit DFwit MSS wit Fwit
Total SS tot DFtot

Where:
Column 1
SV = sources of variations
Factor A = effect of factor A
Factor B = effect of factor B
A  B = interaction effect of factors A and B
Within = variations within groups
Total = sum of all variations

224
Column 2
SS = sum of squares
SSA = factor A sum of squares
SSB = factor B sum of squares
SSA  B = interaction A  B sum of squares
Within = within groups sum of squares
Total = total sum of squares

Column 3
DF = degrees of freedom
DFA = factor A degrees of freedom
DFB = factor B degrees of freedom
DFA  B = interaction A  B degrees of freedom
DFwit = within groups degrees of freedom
DFtot = total degrees of freedom

Column 4
MSS = mean sum of squares
MSSA = factor A mean sum of squares
MSSB = factor B mean sum of squares
MSSA  B = interaction A  B mean sum of squares
MSSwit = within groups mean sum of squares

Column 5
F = f-statistic
FA = factor A computed value of f-statistic
FB = factor B computed value of f-statistic
FA  B = interaction A  B computed value of f-statistic

Formulas in Two-Way ANOVA


Column 2
 x  2
 x  2

SS A   n Ai
Ai

N
i

 x  2
 x  2

SS B   n Bi
Bi

N
i

SS bet  
 x    x 
Ai B i
2
i
2

n Ai Bi N
SS A  B  SS bet  SS A  SS B
 x  2

SS tot   
2 i
xi
N

225
SS wit  SS tot  SS bet

Column 3
DFA  A  1 ;( Categories in A – 1)
DFB  B  1 ;( Categories in B – 1)
DFA  B  (A  1)(B  1)
DFwit  N  AB
DFtot  N  1

Column 4
SS A
MSS A 
DFA
SS B
MSS B 
DFB
SS A  B
MSS A  B 
DFA  B
SS wit
MSS wit 
DFwit

Column 5
MSS A
FA 
MSS wit
MSS B
FB 
MSS wit
MSS A  B
FA  B 
MSS wit

Where:

x = observed value
i = individual observation or cell
A = the first given factor
B = the first given factor
A  B = interaction of factor A and B
n = number of samples in a particular category
N = total samples

Example:

226
Determine the effects of socio-economic status and location of residence on the number
of absences incurred by the students in the past semester. Test at α = 0.05.

Location of Socio-economic status (A)


Residence (B) High Average Low
8 3 5 8 1 9
2 5 6 2 3 4
Within the city
4 6 4 5 6 2
7 9 0 3 8 5
4 2 6 1 4 1
3 9 4 5 7 2
Outside the city
4 7 8 4 6 4
1 6 3 2 2 5

Problems:

Do the socio-economic status and location of residence have significant effect on the
number of absences incurred by the students in the past semester?
Is there an interaction effect between the socio-economic status and the location of
residence?

Hypotheses:
HO1: Socio-economic status has no significant effect on the number of absences
incurred by the students in the past semester.

HO2: The location of the residence has no significant effect on the number of absences
incurred by the students in the past semester.
HO3: Socio-economic status and location of residence have no interaction effect.
Level of Significance:

∝= 0.05

𝑑. 𝑓.𝐴 = 𝐴 − 1 = 3 − 1 = 2

𝑑. 𝑓.𝐵 = 𝐵 − 1 = 2 − 1 = 1
𝑑. 𝑓.𝐴×𝐵 = (𝐴 − 1)(𝐵 − 1) = (3 – 1)(2 – 1) = 2

𝑑. 𝑓.𝑊𝐼𝑇 = 𝑁 − 𝐴𝐵 = 48 − (3)(2) = 1

𝑑. 𝑓. 𝑇𝑂𝑇𝐴𝐿 = 𝑁 − 1 = 48 − 1 = 47

Test statistics: TWO-WAY ANOVA

227
Location of Socio-economic status
Residence High Average Low
8 3 5 8 1 9
2 5 6 2 3 4
Within the city 4 6 4 5 6 2
7 9 0 3 8 5
44 33 38
4 2 6 1 4 1
3 9 4 5 7 2
4 7 8 4 6 4
Outside the city
1 6 3 2 2 5
36 33 31
80 66 69

802 662 692 2152


𝑆𝑆𝐴 = + + − = 6.792
16 16 16 48
1152 1002 2152
𝑆𝑆𝐵 = + − = 4.688
24 24 48
442 332 382 362 332 312 2152
𝑆𝑆𝐵𝐸𝑇 = + + + + + − = 13.854
8 8 8 8 8 8 48
𝑆𝑆𝐴×𝐵 = 18.854 − 6.792 − 4.688 = 2.374

2152
𝑆𝑆𝑇𝑂𝑇 = 1233 − = 269.979
48
𝑆𝑆𝑊𝐼𝑇 = 269.979 − 13.854 = 256.125

Sources
Sum of Mean
of d.f. Fratio F0.05 Decision
Sources Squares
variation
A 6.792 2 3.396 0.55 3.22
B 4.688 1 4.688 0.769 4.07
Fail to
AxB 2.374 2 1.187 0.1947 3.22
Reject Ho
Within 256.125 42 6.098
Total 269.979 47

228
THREE-WAY ANOVA

The three-way ANOVA is used by statisticians to determine whether there is a three-way


relationship among variables on an outcome. It determines what affect, if any, three
factors had on an outcome. Three-way ANOVAs are useful for gaining an understanding
of complex interactions where more than one variable may influence the result and have
many applications in finance, science and medical research, and a host of other fields.

A three-way ANOVA is also known as three-factor ANOVA. By using ANOVA, a


researcher is able to determine whether the variability of the outcomes is due to chance
or to the factors in the analysis. ANOVA has many applications in science and
engineering, medicine, and social science.

HOW THREE-WAY ANOVA WORKS

A pharmaceutical company, for example, may do a three-way ANOVA to determine the


effect of a drug on a medical condition. One factor would be the drug, another may be the
gender of the subject, and another may be the age of the subject. These three factors
may each have a distinguishable effect on the outcome. They may also interact with each
other. The drug may have a positive effect on male subjects, for example, but it may not
work on males above a certain age. Three-way ANOVA allows the scientist to quantify
the effects of each and whether the factors interact.

The three-factor ANOVA can test the following null hypothesis:


1. Factor A has no significant effect.
2. Factor B has no significant effect.
3. Factor C has no significant effect.
4. Factors A and B has no interaction effect.
5. Factors B and C has no interaction effect.
6. Factors A and C has no interaction effect.
7. Factors A, B and C has no interaction effect.

Illustration of Three-Way Classification

Three-Way Classification Table


A1 A2 … Al
B1
C1 .. .. ..
C2 .. .. ..

229
. . .
Cn .. .. … ..
B2
C1 .. .. ..
C2 .. .. ..
. .
Cn .. .. … ..
. .
. .
Bm
C1 .. .. ..
C2 .. .. ..
. . .
Cn .. .. … ..

Summary of Three-Way ANOVA

Sources of Sum of Degrees of Mean Fratio Ftabular Decision


Variation Squares Freedom Squares value
A SS A DFA MSS A FA FA
B SS B DFB MSS B FB FB
C SS C DFC MSS C FC FC
AB SS A  B DFA  B MSS A  B FA  B FA  B
BC SS B  C DFB  C MSS B  C FB  C FB  C
AC SS A  C DFA  C MSS A  C FA  C FA  C
ABC SS A  B  C DFA  B  C MSS A  B  C FA  B  C FA  B  C
Within SS wit DFwit MSS wit
Total SS tot DFtot

Where:

Column 1
SV = sources of variations
A = effect of factor A
B = effect of factor B
C = effect of factor C
AB = interaction effect of factor A  B

230
BC = interaction effect of factor B  C
AC = interaction effect of factor A  C
ABC = interaction effect of factor A  B  C
Within = variation within groups
Total = sum of all variations

Column 2
SS = sum of squares
SS A = factor A sum of squares
SS B = factor B sum of squares
SS C = factor C sum of squares
SS A  B = interaction A  B sum of squares
SS B  C = interaction B  C sum of squares
SS A  C = interaction A  C sum of squares
SS A  B  C = interaction A  B  C sum of squares
SS wit = within groups sum of squares
SS tot = total sum of squares

Column 3
DF = degrees of freedom
DFA = factor A degrees of freedom
DFB = factor B degrees of freedom
DFC = factor C degrees of freedom
DFA  B = interaction A  B degrees of freedom
DFB  C = interaction B  C degrees of freedom
DFA  C = interaction A  C degrees of freedom
DFA  B  C = interaction A  B  C degrees of freedom
DFwit = within groups degrees of freedom
DFtot = total degrees of freedom

Column 4
MSS = mean sum of squares
MSS A = factor A mean sum of squares
MSS B = factor B mean sum of squares
MSS C = factor C mean sum of squares
MSS A  B = interaction A  B mean sum of squares

231
MSS B  C = interaction B  C mean sum of squares
MSS A  C = interaction A  C mean sum of squares
MSS A  B  C = interaction A  B  C mean sum of squares
MSS wit = within groups mean sum of squares

Column 5
F = f-statistics
FA = factor A computed value of f-statistics
FB = factor B computed value of f-statistics
FC = factor C computed value of f-statistics
FA  B = interaction A  B computed value of f-statistics
FB  C = interaction B  C computed value of f-statistics
FA  C = interaction A  C computed value of f-statistics
FA  B  C = interaction A  B  C computed value of f-statistics

Formulas in Three-Way ANOVA

Column 2
 x  2
 x  2

SS A   nBC
Ai

N
i

 x  2
 x  2

SS B   nAC
Bi

N
i

 x  2
 x  2

SS C   nAB
Bi

N
i

SS A  B 
 ( x Ai B j ) 2

 (x A i ) 2 
 (x B j ) 2 
 x  i
2

nC nBC nAC N

SS B  C 
 (x BiC j ) 2

 (x B i ) 2 
 (x C j ) 2 
 x i
2

nA nAC nAB N

SS A  C 
 ( x Ai C j ) 2

 (x A i ) 2 
 (x C j ) 2 
 x i
2

nB nBC nAB N
 ( x AiB jCk ) 2  ( x Ai B j ) 2
 ( x B jCk ) 2  ( x AiCk ) 2
SS A  B  C    
n nC nA nB
 ( x A i )   ( x B j )   ( x C k )   x i 
2 2 2 2


nBC nAC nAB N

232
 (x AiB jCk ) 2
SS wit   xi 
2

n
 (x i ) 2
SS tot   xi 
2

Column 3
DFA  A  1 ; (categories in A – 1)
DFB  B  1 ; (categories in B – 1)
DFC  C  1 ; (categories in C – 1)
DFA  B  (A  1)(B  1)
DFB  C  (B  1)(C  1)
DFA  C  (A  1)(C  1)
DFA  B  C  (A  1)(B  1)(C  1)
DFwit  N  ABC
DFtot  N  1

Column 4
SS A
MSS A 
DFA
SS B
MSS B 
DFB
SS C
MSS C 
DFC
SS A  B
MSS A  B 
DFA  B
SS B  C
MSS B  C 
DFB  C
SS A  C
MSS A  C 
DFA  C
SS A  B  C
MSS A  B  C 
DFA  B  C
SS wit
MSS wit 
DFwit
Column 5
MSS A MSS A  B
FA  FA  B 
MSS wit MSS wit

233
MSS B MSS B  C
FB  FB  C 
MSS wit MSS wit
MSS C MSS A  C
FC  FA  C 
MSS wit MSS wit

MSS A  B  C
FA  B  C 
MSS wit
Where:
x = observed value
i = individual observation or cell
A = the first given factor
B = the first given factor
C = the third given factor
A  B = interaction of factor A and B
B  C = interaction of factor B and C
A  C = interaction of factor A and C
A  B  C = interaction of factor A, B and C
n = number of samples in a particular category
N = total samples

Example:

Determine the effect of factors A, B and C on the given data.

A1 A2 A3
C1 C2 C1 C2 C1 C2
5 2 8 2 1 9
6 7 7 8 5 2
B1 7 1 3 6 4 3
9 6 5 4 9 8
3 8 4 1 7 6
7 5 8 7 8 2
4 7 4 1 1 3
B2 5 6 9 4 3 5
8 8 6 2 6 6
1 3 7 5 2 9
4 8 5 7 4 2
6 7 9 6 1 6
B3
7 9 4 2 3 6
8 5 5 9 5 3

234
2 7 8 4 7 1
1 3 8 1 5 6
3 4 3 9 3 5
B4 5 9 1 2 8 2
6 2 4 7 9 7
4 5 2 5 6 3

Solution. Set the null hypotheses ( H 0 ) and alternative hypotheses ( H a ).

H0 : Factor A has no significant effect.


Factor B has no significant effect.
Factor C has no significant effect.
Factor A and B have no interaction effect.
Factor B and C have no interaction effect.
Factor A and C have interaction effect.
Factor A, B and C have interaction effect.

Ha : Factor A has significant effect.


Factor B has significant effect.
Factor C has significant effect.
Factor A and B have interaction effect.
Factor B and C have interaction effect.
Factor A and C have interaction effect.
Factor A, B and C have interaction effect.

Table A  B
A1 A2 A3 Total
B1 54 48 54 156
B2 54 53 45 152
B3 63 53 38 154
B4 42 42 54 138
Total 213 196 191 600

Table B  C
C1 C2 Total
B1 83 79 156
B2 79 73 152
B3 72 82 154

235
B4 68 70 138
Total 302 298 600

Table A  C
A1 A2 A3 Total
C1 101 104 97 302
C2 112 92 94 298
Total 213 196 191 600

Table A  B  C
A1 A2 A3 Total
C1 C2 C1 C2 C1 C2
B1 30 24 27 21 26 28 156
B2 25 29 34 19 20 25 152
B3 27 36 25 28 20 18 154
B4 19 23 18 24 31 23 138
Total 101 112 104 92 97 94 600

Column 2
 x  2
 x  2

SS A   nBC
Ai
N
 i

 213  196  191  600


2 2 2 2
SS A
(5)(4)(2) 120
SS A  3,006.65  3,000.00
SS A  6.65
 x  2
 x  2

SS B   nAC
Bi
N
 i

 156  152  154  138  600


2 2 2 2 2
SS B
(5)(3)(2) 120
SS B  3,006.67  3,000.00
SS B  6.67
 x  2
 x  2

SS C   nAB
Bi
N
 i

 302  298  600


2 2 2
SS C
(5)(3)(4) 120

236
SS C  3,000.13  3,000.00
SS C  0.13

SS A  B 
 ( x Ai B j ) 2
 
 (x A i ) 2  (x B j ) 2 
 x  i
2

nC nBC nAC N
 54  48  54  ...  54  3,006.65  3,006.67  3,000.00
2 2 2 2
SS A  B
(5)(2)
SS A  B  3,055.20  3,013.32
SS A  B  41.88

SS B  C 
 (x BiC j ) 2
 
 (x B i ) 2  (x C j ) 2 
 x  i
2

nA nAC nAB N
 83  73  79  ...  70  3,006.67  3,000.13  3,000.00
2 2 2 2
SS B  C
(5)(3)
SS B  C  3,014.67  3,006.80
SS B  C  7.87

SS A  C 
 ( x Ai C j ) 2
 
 (x A i ) 2  (x C j ) 2 
 x  i
2

nB nBC nAB N
 101  104  97  ...  94  3,006.65  3,000.13  3,000.00
2 2 2 2
SS A  C
(5)(4)
SS A  C  3,013.50  3,006.52
SS A  C  6.72

 ( x AiB jCk ) 2  ( x Ai B j ) 2  ( x B jCk ) 2  ( x AiCk ) 2


SS A  B  C    
n nC nA nB

 
 (x A i ) 2

 (x B j ) 2
 (x C k ) 2 
 x  i
2

nBC nAC nAB N


 30  24  27  ...  23  3,055.20  3,014.67  3,013.50
2 2 2 2
SS A  B  C
5
 3,006.65  3,006.67  3,000.13  3,000.00
SS A  B  C  3,110.40  3,069.92
SS A  B  C  40.48
 (x AiB jCk ) 2
SS wit   xi 
2

n
SS wit  (5  2  8  ...  32 )  3,110.40
2 2 2

SS wit  603.60
 (x i ) 2
SS tot   xi 
2

237
SS tot  (52  2 2  82  ...  32 )  3,000.00
SS tot  714.00

Column 3
DFA  A  1 ; (categories in A – 1)
DFA  3  1
DFA  2
DFB  B  1 ; (categories in B – 1)
DFB  4  1
DFB  3
DFC  C  1 ; (categories in C – 1)
DFC  2  1
DFC  1

DFA  B  (A  1)(B  1) DFB  C  (B  1)(C  1)


DFA  B  (3  1)(4  1) DFB  C  (4  1)(2  1)
DFA  B  6 DFB  C  3

DFA  C  (A  1)(C  1) DFA  B  C  (A  1)(B  1)(C  1)


DFA  C  (3  1)(2  1) DFA  B  C  (3  1)(4  1)(2  1)
DFA  C  2 DFA  B  C  6

DFwit  N  ABC DFtot  N  1


DFwit  120  (3)(4)(2) DFtot  120  1
DFwit  96 DFtot  99
Check that the values obtained in the row for “Total” must always be equal to the
sum of all values above the corresponding columns.

Column 4
SS A SS B SS C
MSS A  MSS B  MSS C 
DFA DFB DFC
MSS A  6.65 MSS B  6.67 MSS C  0.13
2 3 1
MSS A  3.33 MSS B  2.22 MSS C  0.13

238
SS A  B SS B  C SS A  C
MSS A  B  MSS B  C  MSS A  C 
DFA  B DFB  C DFA  C
MSS A  B  41.88 MSS B  C  7.87 MSS A  C  6.72
6 3 2
MSS A  B  6.98 MSS B  C  2.62 MSS A  C  3.36

SS A  B  C SS wit
MSS A  B  C  MSS wit 
DFA  B  C DFwit
MSS A  B  C  40.48 MSS wit  603.6
6 96
MSS A  B  C  6.75 MSS wit  6.29

Column 5
MSS A MSS B MSS C
FA  FB  FC 
MSS wit MSS wit MSS wit
FA  3.33 FB  2.22 FC  0.13
6.29 6.29 6.29
FA  0.53 FB  0.35 FC  0.02

MSS A  B MSS B  C MSS A  C


FA  B  FB  C  FA  C 
MSS wit MSS wit MSS wit
FA  B  6.98 FB  C  2.62 FA  C  3.36
6.29 6.29 6.29
FA  B  1.11 FB  C  0.42 FA  C  0.53

MSS A  B  C
FA  B  C 
MSS wit

FA  B  C  6.75
6.29
FA  B  C  1.07

After computing all the required statistics, prepare the three-way ANOVA table as shown:

239
Three-Way ANOVA Table

Sources of Sum of Degrees Mean Fratio F0.05 Decision


Variation Squares of Squares
Freedom
Factor A 6.65 2 3.33 0.53 3.09 Fail to reject Ho
Factor B 6.67 3 2.22 0.35 2.70 Fail to reject Ho
Factor C 0.13 1 0.13 0.02 3.94 Fail to reject Ho
A´B 41.88 6 6.98 1.11 2.20 Fail to reject Ho
B´C 7.87 3 2.62 0.42 2.70 Fail to reject Ho
A´C 6.72 2 3.36 0.53 3.09 Fail to reject Ho
A´B´C 40.48 6 6.75 1.07 2.20 Fail to reject Ho
Within 603.6 96 6.29
Total 714 119

Conclusion
Since Fratio is less than F0.05, for factors A, B and C and interactions A  B, B  C,
A  C and A  B  C, accept H 0 and conclude the following:

Factor A has no significant effect.


Factor B has no significant effect.
Factor C has no significant effect.
Factor A and B have no interaction effect.
Factor B and C have no interaction effect.
Factor A and C have interaction effect.

SIMPLE LINEAR REGRESSION

In simple linear regression, we predict scores on one variable from the scores on a second
variable. The variable we are predicting is called the criterion variable and is referred to
as Y. The variable we are basing our predictions on is called the predictor variable and is
referred to as X. When there is only one predictor variable, the prediction method is called
simple regression. In simple linear regression, the topic of this section, the predictions of
Y when plotted as a function of X form a straight line.

240
The example data are plotted figure below. You can see that there is a positive
relationship between X and Y. If you were going to predict Y from X, the higher the value
of X, the higher your prediction of Y.
Example data:
X Y
1 1
2 2
3 1.3
4 3.75
5 2.25

A scatter plot of the example data

Linear regression consists of finding the best-fitting straight line through the points. The
best-fitting line is called a regression line. The black diagonal line in the figure below is
the regression line and consists of the predicted score on Y for each possible value of X.
The vertical lines from the points to the regression line represent the errors of prediction.
As you can see, the red point is very near the regression line; its error of prediction is
small. By contrast, the yellow point is much higher than the regression line and therefore
its error of prediction is large.

241
A scatter plot of the example data. The black line consists of the predictions, the points
are the actual data, and the vertical lines between the points and the black line
represent errors of prediction.

The error of prediction for a point is the value of the point minus the predicted value (the
value on the line). The table below shows the predicted values (Y') and the errors of
prediction (Y-Y'). For example, the first point has a Y of 1.00 and a predicted Y (called
Y') of 1.21. Therefore, its error of prediction is -0.21.

Example data
X Y Y' Y-Y' (Y-Y')2
1 1 1.21 -0.21 0.044
2 2 1.635 0.365 0.133
3 1.3 2.06 -0.76 0.578
4 3.75 2.485 1.265 1.6
5 2.25 2.91 -0.66 0.436

You may have noticed that we did not specify what is meant by "best-fitting line." By far,
the most commonly-used criterion for the best-fitting line is the line that minimizes the
sum of the squared errors of prediction. That is the criterion that was used to find the
line in the second. The last column in the second table shows the squared errors of
prediction. The sum of the squared errors of prediction shown in the second table is
lower than it would be for any other regression line.

The formula for a regression line is


y=a+bx
Where:
y = the dependent variable
x = the independent variable
a = the y intercept
b = the slope of the line

𝑛 ∑ 𝑥𝑦 − ∑ 𝑥 ∑ 𝑦
𝑏=
𝑛(∑ 𝑥 2 ) − (∑ 𝑥 )2

𝑎 = 𝑦̅ − 𝑏𝑥̅

Y' = 0.785 + 0.425X


For X = 1,
Y' = (0.425) (1) + 0.785 = 1.21

242
For X = 2,
Y' = (0.425) (2) + 0.785 = 1.64

Test statistic

– To test the hypothesis on the slope (b),

𝑏 − 𝛽0
𝑡=
𝑠/√𝑆𝑥𝑥

d.f. = n-2

To test the hypothesis on the intercept (a), the test statistic is:

𝛼 − 𝛼0
𝑡=
∑𝑛𝑖=1 𝑥𝑖 2
𝑠√ 𝑛𝑆
𝑥𝑥

d.f. = n-2

Where:

2
𝑆𝑥𝑥 = 𝑛 ∑ 𝑥 2 − (∑ 𝑥)

𝑆𝑦𝑦 − 𝑏𝑆𝑥𝑦
𝑠=
𝑛−2
2
𝑆𝑦𝑦 = 𝑛 ∑ 𝑦 2 − (∑ 𝑦)

𝑆𝑥𝑦 = 𝑛 ∑ 𝑥𝑦 − (∑ 𝑥) (∑ 𝑦)

MULTIPLE LINEAR REGRESSION

Multiple linear regression (MLR), also known simply as multiple regression, is a


statistical technique that uses several explanatory variables to predict the outcome of a
response variable. The goal of multiple linear regression (MLR) is to model the linear

243
relationship between the explanatory (independent) variables and response
(dependent) variable.

In essence, multiple regression is the extension of ordinary least-squares (OLS)


regression that involves more than one explanatory variable.

THE FORMULA FOR MULTIPLE LINEAR REGRESSION IS

yi = β0 + β1xi1 + β2xi2 + ... + βpxip + ϵ

Where, for i=n observations:

yi=dependent variable
xi=expanatory variables
β0=y-intercept (constant term)
βp=slope coefficients for each explanatory variable
ϵ=the model’s error term (also known as the residuals)

A simple linear regression is a function that allows an analyst or statistician to make


predictions about one variable based on the information that is known about another
variable. Linear regression can only be used when one has two continuous variables—
an independent variable and a dependent variable. The independent variable is the
parameter that is used to calculate the dependent variable or outcome. A multiple
regression model extends to several explanatory variables.

The multiple regression model is based on the following assumptions:

 There is a linear relationship between the dependent variables and the


independent variables.
 The independent variables are not too highly correlated with each other.
 yi observations are selected independently and randomly from the population.
 Residuals should be normally distributed with a mean of 0 and variance σ.

The coefficient of determination (R-squared) is a statistical metric that is used to measure


how much of the variation in outcome can be explained by the variation in the independent
variables. R2 always increases as more predictors are added to the MLR model even
though the predictors may not be related to the outcome variable.

R2 by itself can't thus be used to identify which predictors should be included in a model
and which should be excluded. R2 can only be between 0 and 1, where 0 indicates that
the outcome cannot be predicted by any of the independent variables and 1 indicates that
the outcome can be predicted without error from the independent variables.

244
When interpreting the results of a multiple regression, beta coefficients are valid while
holding all other variables constant ("all else equal"). The output from a multiple
regression can be displayed horizontally as an equation, or vertically in table form.

REGRESSION EQUATIONS WITH TWO PREDICTORS

𝑦 ′ = 𝑏1 𝑥1 + 𝑏2 𝑥2 + 𝑎

(𝑆𝑃𝑋1 𝑌 )(𝑆𝑆𝑋2 ) − (𝑆𝑃𝑋1 𝑋2 )(𝑆𝑃𝑋2 𝑌 )


𝑏1 = 2
(𝑆𝑆𝑋1 )(𝑆𝑆𝑋2 ) − (𝑆𝑃𝑋1 𝑋2 )

(𝑆𝑃𝑋2 𝑌 )(𝑆𝑆𝑋1 ) − (𝑆𝑃𝑋1 𝑋2 )(𝑆𝑃𝑋1 𝑌 )


𝑏2 = 2
(𝑆𝑆𝑋1 )(𝑆𝑆𝑋2 ) − (𝑆𝑃𝑋1 𝑋2 )

𝑎 = 𝑀𝑦 − 𝑏1 𝑀𝑋1 − 𝑏2 𝑀𝑋2

𝑎 = 𝑦̅ − 𝑏1 𝑥̅1 − 𝑏2 𝑥̅ 2

where

𝑀 = mean

𝑆𝑆𝑋1 = sum of squared deviations fro X1

𝑆𝑆𝑋2 = sum of squared deviations fro X2

𝑆𝑃𝑋1 𝑌 = sum of product of deviations fro X1 and Y

𝑆𝑃𝑋2 𝑌 = sum of product of deviations fro X2 and Y

𝑆𝑃𝑋1 𝑋2 = sum of product of deviations fro X1 and X2

𝑆𝑆𝑟𝑒𝑔𝑟𝑒𝑠𝑠𝑖𝑜𝑛
𝑅2 =
𝑆𝑆𝑦

𝑆𝑆𝑟𝑒𝑔𝑟𝑒𝑠𝑠𝑖𝑜𝑛 = 𝑅2 𝑆𝑆𝑦

245
𝑆𝑆𝑟𝑒𝑠𝑖𝑑𝑢𝑎𝑙
𝑓𝑜𝑟 𝑡ℎ𝑒 𝑠𝑡𝑎𝑛𝑑𝑎𝑟𝑑 𝑒𝑟𝑟𝑜𝑟 𝑜𝑓 𝑒𝑠𝑡𝑖𝑚𝑎𝑡𝑒 = √𝑀𝑆𝑟𝑒𝑠𝑖𝑑𝑢𝑎𝑙 = √
𝑑. 𝑓.

or
𝑏1 𝑆𝑃𝑋1 𝑌 + 𝑏2 𝑆𝑃𝑋2 𝑌
𝑅2 =
𝑆𝑆𝑦

describes the proportion of the total variability of the Y scores that is accounted for by
the regression equation.

Example 1:

Person Y X1 X2
A 11 4 10
B 5 5 6
C 7 3 7
D 3 2 4
E 4 1 3
F 12 7 5
G 10 8 8
H 4 2 4
I 8 7 10
J 6 1 3
∑ 𝑌 = 70 ∑ 𝑋1 = 40 ∑ 𝑋2 = 60

∑ 𝑌 2 = 580 ∑ 𝑋1 2 = 222 ∑ 𝑋2 2 = 424


𝑦̅ = 7 𝑥̅1 = 4 𝑥̅ 2 = 6

∑ 𝑋1 𝑌 = 334

∑ 𝑋2 𝑌 = 467

∑ 𝑋1 𝑋2 = 282

(∑ 𝑋1 )2 402
𝑆𝑆𝑋1 = ∑ 𝑋1 2 − = 222 − = 62
𝑛1 10

246
2
(∑ 𝑋2 )2 602
𝑆𝑆𝑋2 = ∑ 𝑋2 − = 424 − = 64
𝑛2 10
(∑ 𝑌)2 702
𝑆𝑆𝑌 = ∑ 𝑌 2 − = 580 − = 90
𝑛𝑦 10

∑ 𝑋1 ∑ 𝑌 (47)(70)
𝑆𝑃𝑋1 𝑌 = ∑(𝑋1 𝑌) − = 334 − = 54
𝑛 10
∑ 𝑋2 ∑ 𝑌 (60)(70)
𝑆𝑃𝑋2 𝑌 = ∑(𝑋2 𝑌) − = 467 − = 47
𝑛 10
∑ 𝑋1 ∑ 𝑋2 (40)(60)
𝑆𝑃𝑋1 𝑋2 = ∑(𝑋1 𝑋2 ) − = 282 − = 42
𝑛 10
(𝑆𝑃𝑋1 𝑌 )(𝑆𝑆𝑋2 ) − (𝑆𝑃𝑋1 𝑋2 )(𝑆𝑃𝑋2 𝑌 ) (54)(64) − (42)(47)
𝑏1 = 2 =
(𝑆𝑆𝑋1 )(𝑆𝑆𝑋2 ) − (𝑆𝑃𝑋1 𝑋2 ) (62)(64) − (42)2

𝑏1 = 0.672

(𝑆𝑃𝑋2 𝑌 )(𝑆𝑆𝑋1 ) − (𝑆𝑃𝑋1 𝑋2 )(𝑆𝑃𝑋1 𝑌 ) (47)(62) − (42)(54)


𝑏2 = 2 =
(𝑆𝑆𝑋1 )(𝑆𝑆𝑋2 ) − (𝑆𝑃𝑋1 𝑋2 ) (62)(64) − (42)2

𝑏2 = 0.293

𝑎 = 𝑦̅ − 𝑏1 𝑥̅1 − 𝑏2 𝑥̅ 2 = 7 − (0.672)(4) − (0.293)(6)

𝑎 = 2.554

𝑦 ′ = 𝑏1 𝑥1 + 𝑏2 𝑥2 + 𝑎

𝑦 ′ = 0.672𝑥1 + 0.293𝑥2 + 2.554

𝑏1 𝑆𝑃𝑋1 𝑌 + 𝑏2 𝑆𝑃𝑋2 𝑌 0.672(54) + 0.293(47)


𝑅2 = =
𝑆𝑆𝑦 90

𝑅2 = 0.5562

𝑆𝑆𝑟𝑒𝑔𝑟𝑒𝑠𝑠𝑖𝑜𝑛 = 𝑅2 𝑆𝑆𝑦 = 0.5562(90)

𝑆𝑆𝑟𝑒𝑔𝑟𝑒𝑠𝑠𝑖𝑜𝑛 = 50.059

𝑆𝑆𝑟𝑒𝑠𝑖𝑑𝑢𝑎𝑙 = (1 − 𝑅2 )𝑆𝑆𝑦 = 0.4439(90)

𝑆𝑆𝑟𝑒𝑠𝑖𝑑𝑢𝑎𝑙 = 39.942

247
𝑆𝑆𝑟𝑒𝑠𝑖𝑑𝑢𝑎𝑙 39.942
𝑀𝑆𝑟𝑒𝑠𝑖𝑑𝑢𝑎𝑙 = =
𝑑. 𝑓. 10 − 3

𝑀𝑆𝑟𝑒𝑠𝑖𝑑𝑢𝑎𝑙 = 5.706
𝑆𝑆𝑟𝑒𝑔𝑟𝑒𝑠𝑠𝑖𝑜𝑛 50.059
𝑀𝑆𝑟𝑒𝑔𝑟𝑒𝑠𝑠𝑖𝑜𝑛 = =
𝑑. 𝑓. 3−1

𝑀𝑆𝑟𝑒𝑔𝑟𝑒𝑠𝑠𝑖𝑜𝑛 = 25.0295

𝑀𝑆𝑟𝑒𝑔𝑟𝑒𝑠𝑠𝑖𝑜𝑛 25.0295
𝐹= =
𝑀𝑆𝑟𝑒𝑔𝑟𝑒𝑠𝑠𝑖𝑜𝑛 5.706

𝐹 = 4.3865

𝐹0.05 = 4.7374 𝑑. 𝑓. (2,7)

We cannot conclude that the regression equation accounts for a significant portion of
the variance for the Y scores.

Example 2
Evaluate the relationship between churches and crime while controlling population.

Number of Population Number of


Churches X2 Crimes
X1 Y
1 1 4
2 1 1
3 1 2
4 1 3
5 1 5
7 2 8
8 2 11
9 2 9
10 2 7
11 2 10
13 3 15
14 3 14
15 3 16
16 3 17

248
17 3 13
∑ 𝑋1 = 135 ∑ 𝑋2 = 30 ∑ 𝑌 = 135

∑ 𝑋1 2 = 1605 ∑ 𝑋2 2 = 70 ∑ 𝑌 2 = 1605
𝑥̅1 = 9 𝑥̅ 2 = 2 𝑦̅ = 9

∑ 𝑋1 𝑌 = 1578

∑ 𝑋2 𝑌 = 330

∑ 𝑋1 𝑋2 = 330

2
(∑ 𝑋1 )2 1352
𝑆𝑆𝑋1 = ∑ 𝑋1 − = 1605 − = 390
𝑛1 15
(∑ 𝑋2 )2 30602
𝑆𝑆𝑋2 = ∑ 𝑋2 2 − = 70 − = 10
𝑛2 15
(∑ 𝑌 )2 1352
𝑆𝑆𝑌 = ∑ 𝑌 2 − = 1605 − = 390
𝑛𝑦 15

∑ 𝑋1 ∑ 𝑌 (135)(135)
𝑆𝑃𝑋1 𝑌 = ∑(𝑋1 𝑌) − = 1578 − = 363
𝑛 15
∑ 𝑋2 ∑ 𝑌 (30)(135)
𝑆𝑃𝑋2 𝑌 = ∑(𝑋2 𝑌) − = 330 − = 60
𝑛 15

∑ 𝑋1 ∑ 𝑋2 (135)(30)
𝑆𝑃𝑋1 𝑋2 = ∑(𝑋1 𝑋2 ) − = 330 − = 60
𝑛 15
(𝑆𝑃𝑋1 𝑌 )(𝑆𝑆𝑋2 ) − (𝑆𝑃𝑋1 𝑋2 )(𝑆𝑃𝑋2 𝑌 ) (363)(10) − (60)(60)
𝑏1 = 2 =
(𝑆𝑆𝑋1 )(𝑆𝑆𝑋2 ) − (𝑆𝑃𝑋1 𝑋2 ) (390)(10) − (60)2

𝑏1 = 0.1

(𝑆𝑃𝑋2 𝑌 )(𝑆𝑆𝑋1 ) − (𝑆𝑃𝑋1 𝑋2 )(𝑆𝑃𝑋1 𝑌 ) (60)(390) − (60)(363)


𝑏2 = 2 =
(𝑆𝑆𝑋1 )(𝑆𝑆𝑋2 ) − (𝑆𝑃𝑋1 𝑋2 ) (390)(10) − (60)2

𝑏2 = 5.4

𝑎 = 𝑦̅ − 𝑏1 𝑥̅1 − 𝑏2 𝑥̅ 2 = 9 − (0.1)(99) − (5.4)(2)

249
𝑎 = −2.7

𝑦 ′ = 𝑏1 𝑥1 + 𝑏2 𝑥2 + 𝑎

𝑦 ′ = 0.1𝑥1 + 5.4𝑥2 − 2.7


𝑏1 𝑆𝑃𝑋1 𝑌 + 𝑏2 𝑆𝑃𝑋2 𝑌 0.1(363) + 5.4(60)
𝑅2 = =
𝑆𝑆𝑦 390

𝑅2 = 0.9239

𝑆𝑆𝑟𝑒𝑔𝑟𝑒𝑠𝑠𝑖𝑜𝑛 = 𝑅2 𝑆𝑆𝑦 = 0.9239(390)

𝑆𝑆𝑟𝑒𝑔𝑟𝑒𝑠𝑠𝑖𝑜𝑛 = 360.321

𝑆𝑆𝑟𝑒𝑠𝑖𝑑𝑢𝑎𝑙 = (1 − 𝑅2 )𝑆𝑆𝑦 = 0.0761(390)

𝑆𝑆𝑟𝑒𝑠𝑖𝑑𝑢𝑎𝑙 = 29.679
𝑆𝑆𝑟𝑒𝑠𝑖𝑑𝑢𝑎𝑙 29.679
𝑀𝑆𝑟𝑒𝑠𝑖𝑑𝑢𝑎𝑙 = =
𝑑. 𝑓. 15 − 3

𝑀𝑆𝑟𝑒𝑠𝑖𝑑𝑢𝑎𝑙 = 2.47325
𝑆𝑆𝑟𝑒𝑔𝑟𝑒𝑠𝑠𝑖𝑜𝑛 360.321
𝑀𝑆𝑟𝑒𝑔𝑟𝑒𝑠𝑠𝑖𝑜𝑛 = =
𝑑. 𝑓. 3−1

𝑀𝑆𝑟𝑒𝑔𝑟𝑒𝑠𝑠𝑖𝑜𝑛 = 180.1605

𝑀𝑆𝑟𝑒𝑔𝑟𝑒𝑠𝑠𝑖𝑜𝑛 180.160
𝐹= =
𝑀𝑆𝑟𝑒𝑔𝑟𝑒𝑠𝑠𝑖𝑜𝑛 2.47325

𝐹 = 72.8436

𝐹0.05 = 3.8853 𝑑. 𝑓. (2,12)

We can conclude that the regression equation accounts for a significant portion of the
variance for Y.
R2 of 0.9239  population by itself predicts 92.4% of the variance

As for the standard error of estimate,

√𝑀𝑆𝑟𝑒𝑠𝑖𝑑𝑢𝑎𝑙 = √2.47325

√𝑀𝑆𝑟𝑒𝑠𝑖𝑑𝑢𝑎𝑙 = 1.573

250
THE PEARSON PRODUCT MOMENT CORRELATION COEFFICIENT, r

The Pearson product-moment correlation coefficient r is a measure of the linear


relationship between two questions/measures/variables, X and Y. The correlation value
can range from +1 to -1.

A positive correlation (e.g., +0.32) means there is a positive relationship between X and
Y. For example, a positive correlation between height and weight means that as height
increases, so does weight.

A negative correlation (e.g., -0.25) means there is a relationship between X and Y that
moves in the opposite direction. For example, a negative correlation between people's
waist size and running speed means that the larger the waist size of a person, the slower
they run.

A correlation of 0 means that there is no linear relationship between X and Y (although


there could be a non-linear relationship between X and Y).

The Pearson correlation coefficient is the most common and widely use measure of the
degree of linear relationship between two variables. It should be noted that the Pearson
Moment correlation tells us whether there is a linear relationship between two variables
but it does not tell us anything about causality. That is, we do not know if X causes Y or
if Y causes X, or if some other variable causes both X and Y.

The Pearson moment correlation is sometimes used as a driver analysis as it is simple


and quick to run (hence lowest cost) and may be the pragmatic solution when we need
to run many driver analyses (e.g., 100).

𝑛 ∑ 𝑥𝑦 − (∑ 𝑥)(∑ 𝑦)
𝑟=
√(𝑛 ∑ 𝑥 2 − (∑ 𝑥)2 )(𝑛 ∑ 𝑦 2 − (∑ 𝑦)2 )

251
Where:
r= the Pearson Product Moment Coefficient of Correlation
n= sample size
Σxy = sum of the product of x and y
ΣxΣy = the product of the sum of Σx and the sum of Σy
Σx2 = sum of squares of x
Σy2 = sum of squares of y

Test statistic is

𝑟√𝑛 − 2
𝑡=
√1 − 𝑟 2

df = n-2

Where:
n= number of pair of x and y
r = correlation coefficient

Example:

Below are midterm (x) and final (y)

x 75 70 65 90 85 85 80 70 65 90
y 80 75 65 95 90 85 90 75 70 90

Problem:
Is there a significant relationship between the midterm and final examinations of 10
students in Mathematics?

Hypotheses:
Ho: There is no significant relationship between the midterm grades and the final
examination/ grades of 10 students in Mathematics.
Ha: There is a significant relationship between the midterm grades and the final
examination/ grades of 10 students in Mathematics.

Level of Significance:
α = 0.05
df = n-2 = 10-2 = 8

252
t 0.05 = 2.306

Statistics: Pearson Product Moment Correlation Coefficient, r

Computation:

x y x2 y2 xy
75 80 5625 6400 6000
70 75 4900 5625 5250
65 65 4225 4225 4225
90 95 8100 9025 8550
85 90 7225 8100 7650
85 85 7225 7225 7225
80 90 6400 8100 7200
70 75 4900 5625 5250
65 70 4225 4900 4550
90 90 8100 8100 8100
∑x=775 ∑y=815 ∑x =60925 ∑y =67325 ∑xy=64000
2 2

𝑥̅ =77.5 𝑦̅=81.5

𝑛 ∑ 𝑥𝑦 − (∑ 𝑥)(∑ 𝑦)
𝑟=
√(𝑛 ∑ 𝑥 2 − (∑ 𝑥)2 )(𝑛 ∑ 𝑦 2 − (∑ 𝑦)2 )

10(64000) − (775)(815)
𝑟= = 0.94925
√[10(60925) − 7752 )(10(67325) − 8152 )]

𝑟√𝑛 − 2
𝑡=
√1 − 𝑟 2

0.94925√10 − 2
𝑡= = 8.536
√1 − 0.949252

Decision Rule: If the t computed value is greater than or beyond the t tabular value, reject
Ho.
Conclusion: Since the t computed value of 8.536 is beyond the critical value of 2.306 at
0.05 level of significance, we reject Ho, the research hypothesis is accepted which means

253
that there is a significant relationship between the midterm grades and the final
examination/ grades of 10 students in Mathematics.

COEFFICIENT OF DETERMINATION

The coefficient of determination (denoted by R2) is a key output of regression


analysis. It is interpreted as the proportion of the variance in the dependent variable that
is predictable from the independent variable.

 The coefficient of determination is the square of the correlation (r) between


predicted y scores and actual y scores; thus, it ranges from 0 to 1.
 With linear regression, the coefficient of determination is also equal to the square
of the correlation between x and y scores.
 An R2 of 0 means that the dependent variable cannot be predicted from the
independent variable.
 An R2 of 1 means the dependent variable can be predicted without error from the
independent variable.
 An R2 between 0 and 1 indicates the extent to which the dependent variable is
predictable. An R2 of 0.10 means that 10 percent of the variance in Y is
predictable from X; an R2 of 0.20 means that 20 percent is predictable; and so
on.

The formula for computing the coefficient of determination for a linear regression model
with one independent variable is given below.

CD = r2 x 100%

What is the coefficient of determination when r=.949

CD = r2 x 100%

= (.949)2 x 100%

= 90.06%

Activity:

Solve the following problems using stepwise method:

1. From long experience with a process for manufacturing an alcoholic beverage it is


known that the yield is normally distributed with a mean of 500 and a standard
deviation of 96 units. For a modified process the yield is 535 units for a sample of
size 50. At α= .05 does the modified process increase the yield?

254
2. A random sample of companies in electric utilities (I), financial services (II), and
food processing (III) gave the following information regarding annual profits per
employee (units in thousands of dollars).
I 49.1 43.4 32.9 27.8 38.3 36.1 20.2
II 55.6 25.0 41.3 29.9 39.5
III 39.0 37.3 10.8 32.5 15.8 42.6
Shall we reject or not the claim that there is no difference in population mean
annual profits per employee in each of the three types of companies? Use 1% level
of significance.
3. The following data represent the operating time in hours for 4 types of pocket
calculators before a recharge is required.
Fx101 Fx202 Fx303 Fx404
6.4 5.9 7.1 5.3
6.1 5.8 7.1 4.9
6.5 5.9 7.2 6.1
6.2 5.1 7.3 7.1
6.3 5.0 7.4 4.9
Test at 0.01 level of significance if the operating times for all four calculators are
equal.
4. Two groups of experimental animals are given two brands of feed supplement and
the following weight gains are recorded in grams.
Group I 260 204 105 275 210 100 110 143 170 189
Group II 65 78 140 182 195 48 79 80 69 89
Test if the two samples at 0.05 level of significance.
5. Ten students were given remedial instruction in reading comprehension. Pretest
and posttest were also administered, the results of which follow:
Pretest Posttest
25 30
20 25
20 20
8 12
15 20
8 8
15 20
8 8
20 20
21 20
35 38
20 30
Test at 0.05 level of significance to find out if there is significant difference between
the pretest and posttest.
6. Is there a significant difference on the performance of male and female Grade V
pupils in three different subjects using two different methods? Use α=.05

Subjects Methods of Teaching

255
Method I Method II
Male Female Male Female
90 87 85 80
88 85 90 95
1 82 78 88 89

80 75 89 82
2 91 88 85 80
85 80 90 92

81 76 85 80
3 89 86 87 82
83 79 88 85

7. The following data are based on information from the book Life in America’s Small
Cities ( by G.S. Thomas, Prometheus Books). Let x be the percentage of 16 to 19
years old not in school and not high school graduates. Let y be the reported violent
crimes per 1000 residents. Six small cities in Arkansas (Blytheville, El Dorado, Hot
Springs, Jonesboro, rogers, and Russelville) reported the following information
about x and y:

X 24.2 19 18.2 14.9 19.0 17.5


Y 13 4.4 9.3 1.3 0.8 3.6

Is there a significant relationship between x and y? Test at 0.05 level of


significance.
If there is a significant relationship between x and y, compute for y if x =
26,27,28,29.

8. Determine the effects of gender and type of high school graduated from on the
college entrance examination scores of the students. Test at 0.05 level of
significance

Gender High School Graduated from


Public Private
Male 85 79 87 76
90 81 78 83
70 93 90 73
75 78 80 85
Female 75 79 70 90
83 87 93 89
91 83 87 78
256
90 88 89 85

9. Determine the effects of socio-economic status, type of school and location of


residence on the number of absences incurred by the students in the past
semester.

Socio Location of Residence


Economic Within the City Outside the City
Status Public Private Public Private
High 8 3 4 2
2 5 3 9
4 6 4 7
7 9 1 6
5 6 4 6
Average 5 8 6 1
6 2 4 5
4 5 8 4
0 3 3 2
2 1 4 3
Low 1 9 4 1
3 4 7 2
6 2 6 4
8 5 2 5
5 6 4 4

10. A teacher wants to find out whether a relationship exists between her students’
GWA and the result of their exam in the Entrance Test for Graduate School
Program. Test at 0.05 level of significance

GWA Entrance
Exam Results
2.55 98
2.82 90
2.73 95
2.72 93
2.88 99
2.8 97
2.53 98
2.09 97
2.75 94
1.55 99
2.62 95

257
2.86 95
2.35 95
2.31 97
2.66 98

11. Suppose we want to predict job performance of Chevy mechanics based on


mechanical aptitude test scores and test scores from personality test that
measures conscientiousness.
Job Mechanical Conscientiousness
Performance Aptitude
Y X1 X2
1 40 25
2 45 20
1 38 30
3 50 30
2 48 28
3 55 30
3 53 34
4 55 36
4 58 32
3 40 34
5 55 38
3 48 28
3 45 30
2 55 36
4 60 34
5 60 38
5 60 42
5 65 38
4 50 34
3 58 38
6 60 43
3 45 30

Test at 0.05 level of significance.

258
Session 3
Nonparametric Statistics

By the end of this session you should be able to:


3. Formulate and test statistical hypothesis
4. Solve hypothesis testing problems using nonparametric statistics

Lecture:

INTRODUCTION

Nonparametric statistics refers to a statistical method in which the data are not assumed
to come from prescribed models that are determined by a small number of parameters;
examples of such models include the normal distribution model and the linear regression
model. Nonparametric statistics sometimes uses data that is ordinal, meaning it does not
rely on numbers, but rather on a ranking or order of sorts. For example, a survey
conveying consumer preferences ranging from like to dislike would be considered ordinal
data.

Nonparametric statistics includes nonparametric descriptive statistics, statistical models,


inference, and statistical tests. The model structure of nonparametric models is not
specified a priori but is instead determined from data. The term nonparametric is not
meant to imply that such models completely lack parameters, but rather that the number
and nature of the parameters are flexible and not fixed in advance. A histogram is an
example of a nonparametric estimate of a probability distribution.

In statistics, parametric statistics includes parameters such as the mean, standard


deviation, Pearson correlation, variance, etc. This form of statistics uses the observed
data to estimate the parameters of the distribution. Under parametric statistics, data are
often assumed to come from a normal distribution with unknown parameters μ (population
mean) and σ2 (population variance), which are then estimated using the sample mean
and sample variance. Nonparametric statistics makes no assumption about the sample
size or whether the observed data is quantitative.

Nonparametric statistics does not assume that data is drawn from a normal distribution.
Instead, the shape of the distribution is estimated under this form of statistical
measurement. While there are many situations in which a normal distribution can be
assumed, there are also some scenarios in which the true data generating process is far
from normally distributed.

259
SPECIAL CONSIDERATIONS

Nonparametric statistics have gained appreciation due to their ease of use. As the need
for parameters is relieved, the data becomes more applicable to a larger variety of tests.
This type of statistics can be used without the mean, sample size, standard deviation, or
the estimation of any other related parameters when none of that information is available.

Since nonparametric statistics makes fewer assumptions about the sample data, its
application is wider in scope than parametric statistics. In cases where parametric testing
is more appropriate, nonparametric methods will be less efficient. This is because
nonparametric statistics discard some information that is available in the data, unlike
parametric statistics.

COMMONLY USED TESTS UNDER NONPARAMETRIC TESTS

They are:

 The Chi-Square Tests


 The Wilcoxon Rank-Sum Test or Wilcoxon Two-Sample Test
 The Kruskal-Wallis Test or Kruskall-Wallis H-Test
 The Spearman Rank Order Coefficient of Correlation rs.
 A Sign Test for Two Correlative sample (Fisher Sign Test)
 The Sign Test for K Independent Samples (The Median Test: Multi-Sample Case)
 The Mc Nemar’s Test for Correlated Sample
 The Friedman Fr Test for Randomized Block Designs
 The Kendall’s Coefficient of Concordance W.
THE CHI-SQUARE TEST

This is a test of difference between the observed and expected frequencies. The Chi-
Square is considered a unique test due to its 3 functions which are as follows:

 The test of goodness-of-fit


 The test of homogeneity
 The test of independence

THE CHI-SQUARE TEST OF GOODNESS-OF-FIT

Chi-Square goodness of fit test is a non-parametric test that is used to find out how the
observed value of a given phenomena is significantly different from the expected value.
In Chi-Square goodness of fit test, the term goodness of fit is used to compare the
observed sample distribution with the expected probability distribution. Chi-Square

260
goodness of fit test determines how well theoretical distribution (such as normal, binomial,
or Poisson) fits the empirical distribution. In Chi-Square goodness of fit test, sample data
is divided into intervals. Then the numbers of points that fall into the interval are
compared, with the expected numbers of points in each interval.

PROCEDURE FOR CHI-SQUARE GOODNESS OF FIT TEST:

Set up the hypothesis for Chi-Square goodness of fit test:

A. Null hypothesis: In Chi-Square goodness of fit test, the null hypothesis assumes that
there is no significant difference between the observed and the expected value.

B. Alternative hypothesis: In Chi-Square goodness of fit test, the alternative hypothesis


assumes that there is a significant difference between the observed and the expected
value.

Compute the value of Chi-Square goodness of fit test using the following formula:

(𝑂 − 𝐸 )2
𝑥2 = ∑
𝐸
Where:

X2= the chi-square test


O = the observed frequencies
E = the expected frequencies

Degree of freedom: In Chi-Square goodness of fit test, the degree of freedom depends
on the distribution of the sample. The following table shows the distribution and an
associated degree of freedom:

Type of distribution No of constraints Degree of freedom


Binominal distribution 1 n-1
Poisson distribution 2 n-2
Normal distribution 3 n-3

Hypothesis testing: Hypothesis testing in Chi-Square goodness of fit test is the same as
in other tests, like t-test, ANOVA, etc. The calculated value of Chi-Square goodness of
fit test is compared with the table value. If the calculated value of Chi-Square goodness
of fit test is greater than the table value, we will reject the null hypothesis and conclude
that there is a significant difference between the observed and the expected frequency.

261
If the calculated value of Chi-Square goodness of fit test is less than the table value, we
will accept the null hypothesis and conclude that there is no significant difference between
the observed and expected value.

Example:

The theory of Mendel regarding crossing of peas is in the ratio of 9:3:3:1, meaning 9 parts
are smooth yellow, 3 parts wrinkled yellow, 3 parts smooth green and one part wrinkled
green. The researcher conducted an experiment and the result was that out of 560 peas,
310 were smooth yellow, 100 were wrinkled yellow, 110 were smooth green and 40 were
wrinkled green. Is there a significant difference between the observed and the expected?
Use X2 – test at 0.05 level of significance.

Problem: Is there a significant difference between the observed (actual experiment) and
the expected (theory) frequencies?

Hypotheses:
Ho: There is no significant difference between the observed and the expected
frequencies.
Ha: There is a significant difference between the observed and the expected frequencies.

Level of Significance
α = 0.05
df = h-1 = 4-1 = 3
X2.05 = 7.815 tabular value

Statistics: Chi square goodness of fit test


Computations:

Actual
Result Theory O-E
Attributes Ratio (Observed) (Expected)
-5
Smooth Yellow 9 310 315
-5
Wrinkled Yellow 3 100 105
5
Smooth Green 3 110 105
5
Wrinkled Green 1 40 35
Total 16 560 560

262
For the Theory or Expected values:
Divide 560 by 16 = 35

For expected frequencies multiply 35 by each ratio


(𝑂 − 𝐸 )2
𝑥2 = ∑
𝐸
2 2 2
−5 −5 5 52
𝑥2 = + + + = 1.27
315 105 105 35

Decision Rule: If the 𝑥 2 computed value is greater than or beyond the 𝑥 2 tabular value,
reject Ho.

Conclusion: Since the 𝑥 2 computed value of 1.27 is less than the critical value of 7.815
at 0.05 level of significance with 3 degrees of freedom, we fail to reject the null hypothesis,
there is no significant difference between the observed and the expected frequencies.

THE CHI-SQUARE TEST OF HOMOGENEITY

The chi-square test of homogeneity tests to see whether different columns (or rows) of
data in a table come from the same population or not (i.e., whether the differences are
consistent with being explained by sampling error alone). For example, in a table showing
political party preference in the rows and states in the columns, the test has the null
hypothesis that each state has the same party preferences.

Use the formula below for a 2x2 contingency table:

2
𝑁(𝑎𝑑 − 𝑏𝑐)2
𝑥 =
𝑘𝑙𝑚𝑛
Where:
𝑥 2 = chi square
N = grand total
klmn = the product of the rows and columns
for a 2x2 contingency table label the different cells

Total
a b k
c d l
Total m n N

263
Use this formula for any form of contingency table:

2
(𝑂 − 𝐸 )2
𝑥 =∑
𝐸
Where:

X2= the chi-square test


O = the observed frequencies
E = the expected frequencies

Example:

To illustrate this, we can evaluate the attitude of a sample of Lakas and Laban parties on
the issue of peace and order in Mindanao. To carry out such study, a separate random
sample of members of each party is drawn from the nationwide population of Lakas and
Laban and each individual in both samples responds to the scale. Scores are then
classified into favorable or unfavorable categories. The following frequencies are
obtained:
Favorable Unfavorable Total
Lakas 65 35 100
Laban 50 50 100
Total 115 85 200

Problem: Is there a significant difference between the attitudes of the two political parties
on the issue of peace and order in Mindanao?

Hypotheses:
Ho: There is no significant difference between the attitudes of the two political parties on
the issue of peace and order in Mindanao
Ha: There is a significant difference between the attitudes of the two political parties on
the issue of peace and order in Mindanao

Level of Significance:
α=0.05
d.f. = (c-1)(r-1)
= (2-1)(2-1)
=1
2
𝑥 = 3.841

264
Computation:
2
𝑁(𝑎𝑑 − 𝑏𝑐)2
𝑥 =
𝑘𝑙𝑚𝑛
200(3250 − 1750)2
𝑥2 = = 4.604
100𝑥100𝑥115𝑥85

Decision Rule: If the chi square computed value is greater than the chi square tabular
value, reject the Ho.

Conclusion:
Since the chi square computed value of 4.604 is greater than the chi square tabular value
of 3.481 at 0.05 level of significance with 1 degree of freedom, the research hypothesis
is confirmed. There is a significant difference between the attitudes of the two political
parties on the issue of peace and order in Mindanao.

THE CHI-SQUARE TEST OF INDEPENDENCE

The Chi-Square Test of Independence determines whether there is an association


between categorical variables (i.e., whether the variables are independent or related). It
is a nonparametric test.

This test is also known as:

 Chi-Square Test of Association.

This test utilizes a contingency table to analyze the data. A contingency table (also known
as a cross-tabulation, crosstab, or two-way table) is an arrangement in which data is
classified according to two categorical variables. The categories for one variable appear
in the rows, and the categories for the other variable appear in columns. Each variable
must have two or more categories. Each cell reflects the total count of cases for a specific
pair of categories.

The Chi-Square Test of Independence is commonly used to test the following:

 Statistical independence or association between two or more categorical variables.

The Chi-Square Test of Independence can only compare categorical variables. It cannot
make comparisons between continuous variables or between categorical and continuous
variables. Additionally, the Chi-Square Test of Independence only assesses associations
between categorical variables, and cannot provide any inferences about causation.

265
Your data must meet the following requirements:
1. Two categorical variables.
2. Two or more categories (groups) for each variable.
3. Independence of observations.
a. There is no relationship between the subjects in each group.
b. The categorical variables are not "paired" in any way (e.g. pre-test/post-test
observations).
4. Relatively large sample size.
a. Expected frequencies for each cell are at least 1.
b. Expected frequencies should be at least 5 for the majority (80%) of the cells.

The null hypothesis (H0) and alternative hypothesis (H1) of the Chi-Square Test of
Independence can be expressed in two different but equivalent ways:

Ho: "[Variable 1] is independent of [Variable 2]"


Ha: "[Variable 1] is not independent of [Variable 2]"

or

Ho: "[Variable 1] is not associated with [Variable 2]"


Ha: "[Variable 1] is associated with [Variable 2]"

Use this formula:

2
(𝑂 − 𝐸 )2
𝑥 =∑
𝐸
Where:

X2= the chi-square test


O = the observed frequencies
E = the expected frequencies

Example:

Ninety individuals, male and female, were given a test in psychomotor skills and their
scores were classified into high and low. Use the x2 test of independence at 0.05 level of
significance. The table is shown.

266
Scores
Sex High Low Total
Male 18 28 46
Female 32 12 44
Total 50 40 90

Problem: Is there a significant relationship between sex and scores in psychomotor skill?

Hypotheses:
Ho: There is no significant relationship between sex and scores in psychomotor skill
Ha: There is a significant relationship between sex and scores in psychomotor skill

Level of Significance:
α=0.05
d.f= (c-1)(r-1)
= (2-1)(2-1)
=1
2
X = 3.841

Statistics: Chi-Square Test of Independence

Computation:

Scores
Sex High Low Total
O E O E
Male 18 25.56 28 20.44 46
Female 32 24.44 12 19.56 44
Total 50 40 90

For expected values: Multiply the column total with the row total and divide the product
by the grand total.

(𝑂 − 𝐸 )2
𝑥2 = ∑
𝐸
2 2
(18 − 25.56) (32 − 24.44) ( 28 − 20.44)2 (12 − 19.56)2
𝑥2 = + + + = 10.292
25.56 24.44 20.44 19.56

Decision Rule: If the chi square computed value is greater than the chi square tabular
value, reject the Ho.

267
Conclusion:
Since the chi square computed value of 10.292 is greater than the chi square tabular
value of 3.481 at 0.05 level of significance with 1 degree of freedom, the research
hypothesis is confirmed. There is a significant relationship between sex and scores in
psychomotor skill

MANN WHITNEY U TEST (WILCOXON RANK SUM TEST)

The modules on hypothesis testing presented techniques for testing the equality of means
in two independent samples. An underlying assumption for appropriate use of the tests
described was that the continuous outcome was approximately normally distributed or
that the samples were sufficiently large (usually n 1> 30 and n2> 30) to justify their use
based on the Central Limit Theorem. When comparing two independent samples when
the outcome is not normally distributed and the samples are small, a nonparametric test
is appropriate.

A popular nonparametric test to compare outcomes between two independent groups is


the Mann Whitney U test. The Mann Whitney U test, sometimes called the Mann Whitney
Wilcoxon Test or the Wilcoxon Rank Sum Test, is used to test whether two samples are
likely to derive from the same population (i.e., that the two populations have the same
shape). Some investigators interpret this test as comparing the medians between the two
populations. Recall that the parametric test compares the means (H 0: μ1=μ2) between
independent groups.

In contrast, the null and two-sided research hypotheses for the nonparametric test are
stated as follows:

Ho: The two populations are equal versus

Ha: The two populations are not equal.

This test is often performed as a two-sided test and, thus, the research hypothesis
indicates that the populations are not equal as opposed to specifying directionality. A one-
sided research hypothesis is used if interest lies in detecting a positive or negative shift
in one population as compared to the other. The procedure for the test involves pooling
the observations from the two samples into one combined sample, keeping track of which
sample each observation comes from, and then ranking lowest to highest from 1 to n 1+n2,
respectively.

268
EXAMPLE:

Consider a Phase II clinical trial designed to investigate the effectiveness of a new drug
to reduce symptoms of asthma in children. A total of n=10 participants are randomized to
receive either the new drug or a placebo. Participants are asked to record the number of
episodes of shortness of breath over a 1 week period following receipt of the assigned
treatment. The data are shown below.

Placebo 7 5 6 4 12
New Drug 3 6 4 2 1

Is there a difference in the number of episodes of shortness of breath over a 1 week


period in participants receiving the new drug as compared to those receiving the placebo?
By inspection, it appears that participants receiving the placebo have more episodes of
shortness of breath, but is this statistically significant?

In this example, the outcome is a count and in this sample the data do not follow a normal
distribution.

Frequency Histogram of Number Of Episodes Of Shortness Of Breath

In addition, the sample size is small (n 1=n2=5), so a nonparametric test is appropriate.


The hypothesis is given below, and we run the test at the 5% level of significance (i.e.,
α=0.05).

H0: The two populations are equal versus

H1: The two populations are not equal.

Note that if the null hypothesis is true (i.e., the two populations are equal), we expect to
see similar numbers of episodes of shortness of breath in each of the two treatment
groups, and we would expect to see some participants reporting few episodes and some

269
reporting more episodes in each group. This does not appear to be the case with the
observed data. A test of hypothesis is needed to determine whether the observed data is
evidence of a statistically significant difference in populations.

The first step is to assign ranks and to do so we order the data from smallest to largest.
This is done on the combined or total sample (i.e., pooling the data from the two treatment
groups (n=10)), and assigning ranks from 1 to 10, as follows. We also need to keep track
of the group assignments in the total sample.

Total Sample Ranks


(Ordered
Smallest to
Largest)
Placebo New Placebo New Placebo New
Drug Drug Drug
7 3 1 1
5 6 2 2
6 4 3 3
4 2 4 4 4.5 4.5
12 1 5 6
6 6 7.5 7.5
7 9
12 10

Note that the lower ranks (e.g., 1, 2 and 3) are assigned to responses in the new drug
group while the higher ranks (e.g., 9, 10) are assigned to responses in the placebo group.
Again, the goal of the test is to determine whether the observed data support a difference
in the populations of responses. Recall that in parametric tests (discussed in the modules
on hypothesis testing), when comparing means between two groups, we analyzed the
difference in the sample means relative to their variability and summarized the sample
information in a test statistic. A similar approach is employed here. Specifically, we
produce a test statistic based on the ranks.

First, we sum the ranks in each group. In the placebo group, the sum of the ranks is 37;
in the new drug group, the sum of the ranks is 18. Recall that the sum of the ranks will
always equal n(n+1)/2. As a check on our assignment of ranks, we have n(n+1)/2 =
10(11)/2=55 which is equal to 37+18 = 55.

For the test, we call the placebo group 1 and the new drug group 2 (assignment of groups
1 and 2 is arbitrary). We let R1 denote the sum of the ranks in group 1 (i.e., R 1=37), and
R2 denote the sum of the ranks in group 2 (i.e., R 2=18). If the null hypothesis is true (i.e.,

270
if the two populations are equal), we expect R1 and R2 to be similar. In this example, the
lower values (lower ranks) are clustered in the new drug group (group 2), while the higher
values (higher ranks) are clustered in the placebo group (group 1). This is suggestive, but
is the observed difference in the sums of the ranks simply due to chance? To answer this
we will compute a test statistic to summarize the sample information and look up the
corresponding value in a probability distribution.

TEST STATISTIC FOR THE MANN WHITNEY U TEST

The test statistic for the Mann Whitney U Test is denoted U and is the smaller of U1 and
U2, defined below.

Where R1 = sum of the ranks for group 1 and R2 = sum of the ranks for group 2.

For this example,

In our example, U=3. Is this evidence in support of the null or research hypothesis? Before
we address this question, we consider the range of the test statistic U in two different
situations.

EXAMPLE:

A new approach to prenatal care is proposed for pregnant women living in a rural
community. The new program involves in-home visits during the course of pregnancy in
addition to the usual or regularly scheduled visits. A pilot randomized trial with 15 pregnant
women is designed to evaluate whether women who participate in the program deliver
healthier babies than women receiving usual care. The outcome is the APGAR score
measured 5 minutes after birth. Recall that APGAR scores range from 0 to 10 with scores
of 7 or higher considered normal (healthy), 4-6 low and 0-3 critically low. The data are
shown below.

271
Usual Care 8 7 6 2 5 8 7 3
New Program 9 9 7 8 10 9 6

Problem: Is there a statistical evidence of difference in APGAR scores in women receiving


the new and enhanced versus usual prenatal care?

Hypotheses:

• H0: The two populations are equal versus


• H1: The two populations are not equal.

Level of Significance

• α =0.05
• n1 = 8, n2 = 7
• U = 10

Statistic: Mann Whitney U test

Computation:

Total Sample (Ordered


Smallest to Largest) Ranks
Usual New Usual New Usual New
Care Program Care Program Care Program
8 9 2 1
7 8 3 2
6 7 5 3
2 8 6 6 4.5 4.5
5 10 7 7 7 7
8 9 7 7
7 6 8 8 10.5 10.5
3 8 8 10.5 10.5
9 13.5
9 13.5
10 15
R1=45.5 R2=74.5

272
𝑛1 (𝑛1 + 1)
𝑈1 = 𝑛1 𝑛2 + − 𝑅1
2

8(8 + 1)
𝑈1 = 8(7) + − 45.5 = 46.5
2

𝑛2 (𝑛2 + 1)
𝑈2 = 𝑛1 𝑛2 + − 𝑅2
2

7(7 + 1)
𝑈1 = 8(7) + − 74.5 = 9.5
2

Thus, the test statistic is U=9.5.

Decision:

The appropriate critical value can be found in the table above. To determine the
appropriate critical value we need sample sizes (n1=8 and n2=7) and our two-sided level
of significance (α=0.05). The critical value for this test with n1=8, n2=7 and α =0.05 is 10
and the decision rule is as follows: Reject H0 if U < 10

Conclusion:

We reject H0 because 9.5 < 10. We have statistically significant evidence at α =0.05 to
show that the populations of APGAR scores are not equal in women receiving usual
prenatal care as compared to the new program of prenatal care.

KRUSKAL WALLIS TEST (H-TEST)

The Kruskal-Wallis H test (sometimes also called the "one-way ANOVA on ranks") is a
rank-based nonparametric test that can be used to determine if there are statistically
significant differences between two or more groups of an independent variable on a
continuous or ordinal dependent variable. It is considered the nonparametric alternative
to the one-way ANOVA, and an extension of the Mann-Whitney U test to allow the
comparison of more than two independent groups.

The formula is:

12 𝑅𝑖 2
𝐻= ∑ − 3(𝑛 + 1)
𝑛(𝑛 + 1) 𝑛𝑖

273
Where:

H = Kruskal-Wallis test
n = the number of observations
12 = constant
3 = constant
1 = constant

HOW DO WE USE THE H-TEST?

• Rank the data/observations of all the groups from the lowest to the highest value
• Assign the ranks to the corresponding observations
• Get the ∑ 𝑅1 , ∑ 𝑅2 … … …
• Determine the sample size of every group
• Use the formula of H test
• Compare the computed value to the critical value (𝑥 2 table)

Example:

Consider the examination scores of samples of high school students who are taught in
English using three different methods: Method 1 (classroom instruction and language
laboratory), Method 2 (only classroom instruction) and Method 3 (only self-study in
language laboratory). Use H-test at 0.05 level of significance to test the null hypothesis
that their means are not equal. Consider the following data:

Method 1 Method 2 Method 3


94 85 89
88 88 78
90 90 75
95 80 65
92 79 80
90 85
80

Problem: Is the claim true that there are no significant differences in the examination
scores of high school students using three different methods?

Hypotheses:

Ho: There are no significant differences in the examination scores of high school students
using three different methods

274
Ha: There are significant differences in the examination scores of high school students
using three different methods

Level of Significance:

α = 0.05

d.f = h-1=3-1=2

x2 = 5.99

Statistics: Kruskal-Wallis (H-test)

Computations:

Method 1 R1 Method 2 R2 Method 3 R3


94 17 85 8.5 89 12
88 10.5 88 10.5 78 3
90 14 90 14 75 2
95 18 80 6 65 1
92 16 79 4 80 6
90 14 85 8.5
80 6
Sum of
Ranks 89.5 57.5 24

12 𝑅𝑖 2
𝐻= ∑ − 3(𝑛 + 1)
𝑛(𝑛 + 1) 𝑛𝑖

12 89.52 57.52 242


𝐻= ( + + ) − 3(18 + 1) = 10.458
18(18 + 1) 6 7 5

Decision Rule:

If the x2 computed value is greater than the x2 tabular value, disconfirm the Ho.

Conclusion:

275
Since the H computed value of 10.458 is greater than the x2 tabular of 5.99 at 0.05 level
of significance with 2 degrees of freedom, the null hypothesis of no significant difference
in the performance of the two groups is rejected.

THE SIGN TEST FOR TWO INDEPENDENT SAMPLES (MEDIAN TEST TWO-SAMPLE
CASE)

The Sign test is a non-parametric test that is used to test whether or not two groups are
equally sized. This test is known as the median test. The sign test is used when
dependent samples are ordered in pairs, where the bivariate random variables are
mutually independent. It is based on the direction of the plus and minus sign of the
observation, and not on their numerical magnitude. It is also called the binominal sign
test, with p = .5. The sign test is considered a weaker test, because it tests the pair value
below or above the median and it does not measure the pair difference. This is the
counterpart of the t-test under parametric test.

The data consist of two independent samples of n 1 and n2 observations. The medians of
the two samples are taken jointly. In each sample observation, the values above the
medians are assigned the plus sign and those at or below the negative sign. Then the
number of + and – signs for each sample is obtained. A x2 test is used to determine
whether the observed frequencies of + and – signs differ significantly

Apply the formula

2
𝑁(𝑎𝑑 − 𝑏𝑐 )2
𝑥 =
𝑘𝑙𝑚𝑛

Where:

x2 = Chi-square test
a and c = observed (+) frequencies
b and d = observed (-) frequencies
k and l = the row total
m and n = the column total
N = grand total

Total
a b k
c d l
Total m n N

276
Example:

Consider the test scores of 12 female and 9 male students in a spelling test.

Female 12 26 25 10 10 10 22 20 19 17 17 15
Male 6 22 19 7 8 12 16 8 19

Problem:

Is there a significant difference in the performance of the two groups?

Hypotheses:

Ho: There is no significant difference in the performance of the two groups

Ha: There is a significant difference in the performance of the two groups.

Level of significance:

α = 0.05

df = (c-1)(v-1)

x20.05 = 3.841

Computations:

The median of the female and male observations is 16. Assigning a+ to values above
median and a – to the values at or below it, we have the following result.

Female - + + - - - + + + + + -
Male - + + - - - - - +

This data may be tabulated in the form of a 2’2 table as follows:

+ - Total
Female a 7 b 5 k 12
Male c 3 d 6 l 9
Total m 10 n 11 N 21

𝑁(𝑎𝑑 − 𝑏𝑐 )2
2
𝑥 =
𝑘𝑙𝑚𝑛

277
2
21(42 − 15)2
𝑥 = = 1.288
(12)(9)(10)(11)

Decision Rule:

If the x2 computed value is greater than the x2 tabular value, disconfirm the H o.

Conclusion:

Since the x2 computed value of 1.288 is lesser than the x2 tabular of 3.841 at 0.05 level
of significance with 1 degree of freedom, the null hypothesis of no significant difference
in the performance of the two groups is confirmed.

THE SIGN TEST FOR TWO-CORRELATED SAMPLES (FISHER SIGN TEST)

This test is under nonparametric statistics. It is the counterpart of the t-test for correlated
sample under the parametric test. The Fisher Sign Test compares two correlated samples
and is applicable to data composed of N paired observations. The difference between
each pair of observations is obtained. This test is based on the idea that half the difference
between the paired observations will be positive and the other half will be negative.

The formula is:

|𝐷 | − 1
𝑧=
√𝑁

Where:

z = the Fisher Sign test

D = the difference between the number of + and – signs

N = is the number of signs

Example:

Consider the pretest and the posttest results before and after the implementation of the
program.

278
Pretest Posttest
(x) (y)
15 19
19 30
31 26
36 8
10 10
11 6
19 17
15 13
10 22
16 8

Problem:

Is there a significant difference between the pretest and posttest results of the 10
students?

Hypothesis:

Ho: There is no significant difference between the pretest and posttest results of the 10
students

Ha: There is a significant difference between the pretest and posttest results of the 10
students.

Level of Significance:

α = 0.05

Z0.05 = ±1.96

Statistics: Fisher Sign Test

Pretest (x) Posttest (y) Sign of x-y


(D)
15 19 -
19 30 -
31 26 +
36 8 +
10 10 0

279
11 6 +
19 17 +
15 13 +
10 22 -
16 8 +

In this example, there are all 6 + signs, 3 – signs, and 1 zero. Zero is disregarded. It may
be shown that:

|𝐷 | − 1
𝑧=
√𝑁

|6 − 3 | − 1 2
𝑧= = = 0.67
√9 3

Decision Rule:

If the z computed value is greater than the z tabular value disconfirm the H o

Conclusion:

Since the z computed value of 0.67 is less than the z tabular value of 1.96 at 0.05 level
of significance, the null hypothesis is confirmed which means that there is no significant
difference between the pretest and the posttest results of the ten students.

THE MEDIAN TEST: MULTI-SAMPLE TEST

This test is under the nonparametric tests. This is a straightforward extension of the
median test for two independent samples. The x2 test is used for k independent samples.

2
(𝑂 − 𝐸 )2
𝑥 =∑
𝐸
Where:

X2= the chi-square test


O = the observed frequencies
E = the expected frequencies

280
Example:

A sampling of the acidity of rain for 10 randomly selected rainfalls was recorded at three
different locations in the province of Northern Samar: Biri Island, Catarman, and Silvino
Lubos. The pH readings for these 30 rainfalls are shown in the table. (Note that the pH
readings range from 0 to 14; 0 is acid, 14 is alkaline. Pure water falling through clean air
has a pH reading of 5.7).

Silvino
Biri Catarman Lubos
4.4 4.6 4.7
4 4.5 4.8
4.1 4.3 5
3.5 3.8 4.9
2.4 4.2 3.9
3.8 4.5 4.5
4.2 4.7 4.6
3.9 4.3 4.3
4.1 4.5 4
4.2 4.8 4.7

Use the median test at 0.05 level of significance to test the null hypothesis that there is
no significant difference among the pH readings of the three different municipalities of
Northern Samar.

Problem: Is there a significant difference in the pH readings among the three different
municipalities of Northern Samar?

Hypotheses:

Ho: There is no significant difference in the pH readings among the three municipalities
in Northern Samar.

Ha: There is a significant difference in the pH readings among the three municipalities in
Northern Samar.

Level of Significance:

𝛼 = 0.05

d.f. = (c-1)(r-1) = (2-1)(3-1) = 2

281
x2= 5.991

Statistics: Median test for k independent samples

Computation:

Silvino
Biri Catarman Lubos
+ + +
- + +
- - +
- - +
- - -
- + +
- + +
- - -
- + -
- + +

at or Below
Municipalities above 4.3 4.3 Total
+ -
O E O E
Biri 1 4.7 9 5.3 10
Catarman 6 4.7 4 5.3 10
Silvino Lubos 7 4.7 3 5.3 10
14 16 30

(𝑂 − 𝐸)2
𝑥2 = ∑ = 8.299
𝐸

Decision Rule:

If the computed value is greater than the tabular value, disconfirm the Ho.

282
Conclusion:

Since the computed value of 8.299 is greater than the tabular value of 5.991 at 0.05 level
of significance with 2 degrees of freedom, we reject the Ho, there is a significant difference
in the pH readings among the three different municipalities in N. Samar.

THE SPEARMAN RANK-ORDER COEFFICIENT OF CORRELATION rS?

The Spearman rank-order correlation coefficient (Spearman’s correlation, for short) is a


nonparametric measure of the strength and direction of association that exists between
two variables measured on at least an ordinal scale. It is denoted by the symbol rs (or the
Greek letter ρ, pronounced rho). The test is used for either ordinal variables or for
continuous data that has failed the assumptions necessary for conducting the Pearson's
product-moment correlation. For example, you could use a Spearman’s correlation to
understand whether there is an association between exam performance and time spent
revising; whether there is an association between depression and length of
unemployment; and so forth.

HOW DO WE USE THE SPEARMAN RANK-ORDER COEFFICIENT OF


CORRELATION rS?

• Rank the data in x, the independent variable.


• Rank the data in y, the dependent variable.
• Find D, the difference between the Rx and Ry
• Find D2, squaring the difference between the Rx and Ry
• Get ΣD2
• Determine the sample size n
• Use the formula and study the sample

The formula is:

6 ∑ 𝐷2
𝑟𝑠 = 1 −
𝑛(𝑛 2 − 1)

Where:

• rs = Spearman Rank Order Coefficient Correlation


• ΣD2 = sum of squares of the difference between rank x and rank y
• n = sample size
• 6 = constant

283
EXAMPLE:

The following are the number of hours which 12 students spent in studying for a midterm
examination and the grades they obtained in English. Calculate r s at 0.05 level of
significance.

Number of
Hours Studied Midterm
(x) Grades (y)
5 50
6 60
11 79
20 90
19 85
20 92
10 80
12 82
8 65
15 85
18 94
10 70

Problem:

Is there a significant relationship between the number of hours spent in studying in English
and the corresponding grades in the midterm examination?

Hypothesis:

Ho: There is no significant relationship between the number of hours spent in studying
English and the corresponding grades in the midterm examination.

Ha: There is a significant relationship between the number of hours spent in studying
English and the corresponding grades in the midterm examination.

Level of significance:

α = 0.05

df = 12 – 1 = 11

284
rs = 0.532

Statistics: Spearman Rank Order Coefficient Correlation rs

Computation:

Number of
Hours Studied Midterm
(x) Grades (y) Rx Ry D D2
5 50 12 12 0 0
6 60 11 11 0 0
11 79 7 8 -1 1
20 90 1.5 3 -1.5 2.25
19 85 3 4.5 -1.5 2.25
20 92 1.5 2 -0.5 0.25
10 80 8.5 7 1.5 2.25
12 82 6 6 0 0
8 65 10 10 0 0
15 85 5 4.5 0.5 0.25
18 94 4 1 3 9
10 70 8.5 9 -0.5 0.25
∑ D2 =17.5

6 ∑ 𝐷2
𝑟𝑠 = 1 −
𝑛(𝑛 2 − 1)

6(17.5)
𝑟𝑠 = 1 − = 0.94
12(122 − 1)

Decision Rule:

If the rs computed value is greater than the rs tabular value, disconfirm the Ho,

Conclusion:

Since the rs computed value of 0.94 is greater than the r s tabular value of 0.532 at 0.05
level of significance with 11 degrees of freedom, the research hypothesis is confirmed. A
significant relationship between the number of hours spent in studying English and the
grade in the midterm examination in English is established. It implies that the more hours
are devoted to studying, the higher is the result in the examination.
285
FRIEDMAN Fr TEST FOR RANDOMIZED BLOCK DESIGNS

The Friedman Fr test is a nonparametric test used for comparing the distributions of
measurements for k treatments laid out in b blocks using randomized block design. The
procedure for conducting the test is similar to that used for the Kruskal- Wallis H Test.
When either the number of k treatments or the number of b of blocks is larger than five,
the sampling distribution of Fr test can be approximated by a chi-square distribution with
(k-1) df. The formula for Fr test is:

12
𝐹𝑟 = = ∑ 𝑇𝑖 2 − 3𝑏(𝑘 + 1)
𝑏𝑘(𝑘 + 1)

Where:

• Fr = the Friedman test


• b = number of blocks
• k = the number of treatments
• Ti = rank sum for treatments I
• i = 1,2,…..k

Example

In a study of the probability of reaction to antibiotics in children, 5 sample healthy children


were used as subjects to assess their reaction to the taste of four antibiotics. The
children’s response was measured on a 10-centimeter visual analog scale incorporating
the use of faces, from sad (low score) to happy (high score). The minimum score was 0
and the maximum was 10. The following data were recorded:

Antibiotics
Child 1 2 3 4
1 5.8 2.5 6.7 6.2
2 9 9 6.6 9.5
3 5 2.6 3.5 6.6
4 7.9 9.4 5.3 8.4
5 3.9 7.5 2.5 2.5

Problem: Is there a significant difference in the reaction of the five children to the 4
different antibiotics?

286
Hypotheses:

Ho: There is no significant difference in the reaction of 5 children to the 4 different


antibiotics

Ha: There is a significant difference in the reaction of 5 children to the 4 different


antibiotics

Level of Significance:

α=0.05

df=k-1 =4-1 = 3

x2 = 7.815

Statistics:

Computation: Friedman Fr Test For Randomized Block Designs

Antibiotics
Child 1 T1 2 T2 3 T3 4 T4
1 5.8 2 2.5 1 6.7 4 6.2 3
2 9 2.5 9 2.5 6.6 1 9.5 4
3 5 3 2.6 1 3.5 2 6.6 4
4 7.9 2 9.4 4 5.3 1 8.4 3
5 3.9 3 7.5 4 2.5 1.5 2.5 1.5
Rank Sum 12.5 12.5 9.5 15.5

12
𝐹𝑟 = = ∑ 𝑇𝑖 2 − 3𝑏(𝑘 + 1)
𝑏𝑘(𝑘 + 1)

12
𝐹𝑟 = = [12.52 + 12.52 + 9.52 + 15.52 ] − 3(5)(5) = 2.16
(5)(4)(4 + 1)

Decision Rule: If the computed value of Fr is greater than the tabular value of chi square,
reject the Ho.

287
Conclusion: Since the computed value of Fr = 2.16 is less than the chi square tabular
value = 7.815, we fail to reject Ho

MCNEMAR’S TEST FOR CORRELATED PROPORTIONS

McNemar's test assess the significance of the difference between two correlated
proportions, such as might be found in the case where the two proportions are based on
the same sample of subjects or on matched-pair samples. It is a chi square test for
situations when samples are matched, that is, they are not independent. This is before
and after design which is tested to find out whether there is a significant change between
the before and after situations.

The formula is:

(𝒃 − 𝒄)𝟐
𝒙𝟐 =
𝒃+𝒄

Where:

x2 = chi square test

b = is the first cell of the 2nd column in a 2x2 table

c = is the first cell of the 2nd row in a 2x2 table

Example:

Consider the data below on seat belt use before and after involvement in auto accidents
for a sample of 100 accident victims

Seat belt worn


regularly before seat belt worn regularly after
the accident the accident total
yes no
yes a=60 b=6 66
no c=19 d=15 34
total 79 21 100

Problem: Is there a significant difference in the use of seat belt before and after
involvement in an automobile accident?

288
Hypotheses:

Ho: There is no significant difference in the use of seat belt before and after involvement
in an automobile accident.

Ha: There is a significant difference in the use of seat belt before and after involvement
in an automobile accident.

Level of Significance:

α=0.05

d.f. = (c-1)(r-1) = (2-1)(2-1) = 1

x2 = 3.841

Statistics: Mc Nemar’s test for correlated proportion

Computation:

(𝒃 − 𝒄)𝟐
𝒙𝟐 =
𝒃+𝒄

(𝟔 − 𝟏𝟗)𝟐
𝒙𝟐 = = 𝟔. 𝟕𝟔
𝟔 + 𝟏𝟗

Decision rule: If the computed value is greater than the tabular value, reject the Ho.

Conclusion:

Since the computed value of 6.76 is greater than or beyond the tabular value of 3.841 at
0.05 level of significance with 1 degree of freedom, we reject the Ho, there is a significant
difference in the use of seat belt before and after involvement in an automobile accident.

KENDALL’S COEFFICIENT OF CONCORDANCE W

It is used to find out if there is an agreement or concordance among raters or judges of N


objects or individuals. The interpretation of the value of W is high agreement when W=1,
no agreement when W = 0.

289
The formula is:

12 ∑ 𝐷2
𝑊=
𝑚2 (𝑁)(𝑁 2 − 1)

Where:

W= the coefficient of concordance

D= the difference between the individual sum of ranks of the raters or judges and the
average of the sum of ranks of the object or individuals

∑ 𝐷2 = The sum of squares of the difference

m= number of judges or raters

N= number of objects or individuals being rated or ranked

Significance tests:

In the case of complete ranks, a commonly used significance test for W against a null
hypothesis of no agreement.

𝒙 𝟐 = 𝒎 ( 𝒏 − 𝟏) 𝑾

Where the test statistic takes a chi-squared distribution with d f = n − 1 degrees of


freedom.

Example:

The data on the ranking of 10 projects by 4 judges.

Individual
Judges' Ranks
Projects
A B C D
1 1 2 3 4
2 3 1 2 2
3 4 4 1 3
4 5 5 5 1
5 2 6 7 6
6 8 3 4 7
7 6 8 6 5

290
8 7 7 8 9
9 10 10 9 8
10 9 9 10 10

Problem: Is there an agreement or concordance among the 4 judges regarding the 10


projects?

Hypotheses:

Ho: There is no agreement or concordance among the 4 judges regarding the 10 projects.

Ha: There is an agreement or concordance among the 4 judges regarding the 10 projects.

Level of Significance:

α = 0.05

d.f. = n-1=10-1=9

x2 = 16.92

Statistics: W Coefficient of Concordance

Computations:

Sum
Individual 𝑅̅- Sum
Judges' Ranks of D2
Projects of Ranks
Ranks
A B C D
1 1 2 3 4 10 12 144
2 3 1 2 2 8 14 196
3 4 4 1 3 12 10 100
4 5 5 5 1 16 6 36
5 2 6 7 6 21 1 1
6 8 3 4 7 22 0 0
7 6 8 6 5 25 -3 9
8 7 7 8 9 31 -9 81
9 10 10 9 8 37 -15 225
10 9 9 10 10 38 -16 256
Total 220 1048
𝑅̅ 22

291
𝟏𝟐 ∑ 𝑫𝟐
𝑾=
𝒎𝟐 (𝑵)(𝑵𝟐 − 𝟏)

12(1048)
𝑊= = 0.79
42 (10)(100 − 1)

𝒙 𝟐 = 𝒎 ( 𝒏 − 𝟏) 𝑾

𝑥 2 = 4(10 − 1)0.79

𝑥 2 = 28.44

Decision rule:

If the computed value is greater than the tabular value, reject the Ho.

Conclusion:

Since the computed value of x2 = 28.44 is greater than the tabular value of x2 = 16.92 at
0.05 level of significance with d.f=9, we reject the Ho. There is an agreement or
concordance in the ranking of the judges regarding the 10 projects

Activity:

1. The following table shows the Myers-Briggs personality preference and


professions for a random sample of 2408 people in the listed professions (Atlas of
Type Tables, by Macdaid, McCaulley, and Kainz). E refers to extroverted, and I
refers to introverted.
Occupation Personality Preference Row Total
E I
Clergy (all 308 226 534
denominations)
M.D. 667 936 1603
Lawyer 112 159 271
Column total 1087 1321 2408
Determine if the listed occupations and personality preferences are
independent at the 0.01 level of significance.

2. The following data are the results of 4 groups of animals given four different
treatments.

292
T1 T2 T3 T4
5 8 17 10
6 15 18 12
16 12 15 20
15 15 12 20
15 10 13 25
Test at 0.05 level of significance to find out if there is a significant difference in
the 4 groups of animals given four different treatments assuming data is not
normal.

3. From an English class of 18 students using programmed materials, 10 are selected


at random and given additional instruction by the teacher. The results on the final
examination were as follows:
Grades in the Final Examination
With Additional No Additional
Instruction Instruction
87 75
91 78
85 81
80 85
86 72
85 80
81 81
85 85
73
79
Test at 0.01 level of significance to find out if the additional instruction affects the
average grade assuming that the data is not normal.

4. Six subjects were exposed to 4 treatments and the following data were recorded.
Treatments
Subjects T1 T2 T3 T4
1 9 5 6 2
2 10 10 3 5
3 8 7 9 10
4 5 6 3 4
5 10 9 8 7
6 5 6 8 9
Test at 0.05 level of significance to test the null hypothesis that there is no
significant difference among the six subjects on four different treatments.

293
5. Fifteen women were enrolled in a slimming program for a period of 8 weeks. Their
weights were taken before and after the program, and are shown below.
Before After Before After
130 125 141 135
120 120 130 130
125 126 125 125
140 115 114 110
140 148 115 105
135 120 120 110
115 100 135 120
148 128
Test at 0.05 level of significance if there was a significant difference before and
after the program. Assuming that data is not normal.

6. Group of six subjects, each taught using three different methods of teaching
English, had the following scores.
Method 1 Method 2 Method 3
37 38 40
38 40 70
40 45 68
30 50 70
35 49 38
36 30 49
Test at 0.05 level of significance. Assume data is not normal.

7. Data on charter change before and after a televised debate for a sample of 50
registered voters are found below.
Before the debate After the Debate Total
Yes No
Yes 19 11 30
No 8 12 20
27 23 50
Test at 0.05 level of significance to find out if televised debate will change the
opinions of 50 registered voters.

8. Three judges rank-order a group of 5 students in an examination as follows:


Judges Students
A B C D E
A 1 3 2 5 4
B 2 1 3 4 5
C 1 2 4 3 5
Test at 0.05 level of significance

294
Reliability and Validity Tests
______________________________________________
MODULE 7

295
Reliability and Validity Tests

By the end of this session you should be able to:

1. Define reliability, including the different types and how they are assessed.
2. Define validity, including the different types and how they are assessed.
3. Describe the kinds of evidence that would be relevant to assessing the reliability
and validity of a particular measure.

Lecture:

INTRODUCTION

Reliability and validity are concepts used to evaluate the quality of research. They indicate
how well a method, technique or test measures something. Reliability is about the
consistency of a measure, and validity is about the accuracy of a measure.

It’s important to consider reliability and validity when you are creating your research
design, planning your methods, and writing up your results, especially in quantitative
research.
RELIABILITY VS VALIDITY

Reliability Validity

The extent to which the results The extent to which the


can be reproduced when the results really measure what
What does it tell you?
research is repeated under they are supposed to
the same conditions. measure.

By checking how well the


By checking the consistency
results correspond to
of results across time, across
How is it assessed? established theories and
different observers, and
other measures of the same
across parts of the test itself.
concept.

296
A reliable measurement is not A valid measurement is
always valid: the results might generally reliable: if a test
How do they relate?
be reproducible, but they’re produces accurate results,
not necessarily correct. they should be reproducible.

UNDERSTANDING RELIABILITY VS VALIDITY

Reliability and validity are closely related, but they mean different things. A measurement
can be reliable without being valid. However, if a measurement is valid, it is usually also
reliable.

WHAT IS RELIABILITY?

Reliability refers to how consistently a method measures something. If the same result
can be consistently achieved by using the same methods under the same circumstances,
the measurement is considered reliable.

You measure the temperature of a liquid sample several times under identical conditions.
The thermometer displays the same temperature every time, so the results are reliable.
A doctor uses a symptom questionnaire to diagnose a patient with a long-term medical
condition. Several different doctors use the same questionnaire with the same patient but
give different diagnoses. This indicates that the questionnaire has low reliability as a
measure of the condition.

WHAT IS VALIDITY?

Validity refers to how accurately a method measures what it is intended to measure. If


research has high validity that means it produces results that correspond to real
properties, characteristics, and variations in the physical or social world.

High reliability is one indicator that a measurement is valid. If a method is not reliable, it
probably isn’t valid.
If the thermometer shows different temperatures each time, even though you have
carefully controlled conditions to ensure the sample’s temperature stays the same, the
thermometer is probably malfunctioning, and therefore its measurements are not valid.

297
If a symptom questionnaire results in a reliable diagnosis when answered at different
times and with different doctors, this indicates that it has high validity as a measurement
of the medical condition.

However, reliability on its own is not enough to ensure validity. Even if a test is reliable, it
may not accurately reflect the real situation.
The thermometer that you used to test the sample gives reliable results. However, the
thermometer has not been calibrated properly, so the result is 2 degrees lower than the
true value. Therefore, the measurement is not valid.
A group of participants take a test designed to measure working memory. The results are
reliable, but participants’ scores correlate strongly with their level of reading
comprehension. This indicates that the method might have low validity: the test may be
measuring participants’ reading comprehension instead of their working memory.

Validity is harder to assess than reliability, but it is even more important. To obtain useful
results, the methods you use to collect your data must be valid: the research must be
measuring what it claims to measure. This ensures that your discussion of the data and
the conclusions you draw are also valid.

HOW ARE RELIABILITY AND VALIDITY ASSESSED?

Reliability can be estimated by comparing different versions of the same measurement.


Validity is harder to assess, but it can be estimated by comparing the results to other
relevant data or theory. Methods of estimating reliability and validity are usually split up
into different types.

298
TYPES OF RELIABILITY

Different types of reliability can be estimated through various statistical methods.

Type of reliability What does it assess? Example

A group of participants
complete a questionnaire
The consistency of a designed to measure
measure across time: do personality traits. If they
Test-retest you get the same results repeat the questionnaire
when you repeat the days, weeks or months apart
measurement? and give the same answers,
this indicates high test-retest
reliability.

Based on an assessment
criteria checklist, five
The consistency of a examiners submit
measure across raters or substantially different results
observers: do you get the for the same student project.
Interrater
same results when different This indicates that the
people conduct the same assessment checklist has
measurement? low inter-rater reliability (for
example, because the
criteria are too subjective).

You design a questionnaire


to measure self-esteem. If
The consistency of the you randomly split the
measurement itself: do you results into two halves, there
get the same results from should be a strong
Internal consistency
different parts of a test that correlation between the two
are designed to measure sets of results. If the two
the same thing? results are very different,
this indicates low internal
consistency.

299
A set of questions is
formulated to measure
financial risk aversion in a
group of respondents. The
questions are randomly
It measures the correlation divided into two sets, and
between two equivalent the respondents are
versions of a test. You use it randomly divided into two
Parallel forms when you have two different groups. Both groups take
assessment tools or sets of both tests: group A takes
questions designed to test A first, and group B
measure the same thing. takes test B first. The results
of the two tests are
compared, and the results
are almost identical,
indicating high parallel forms
reliability.
The Minnesota Multiphasic
The split-half method Personality Inventory has
assesses the internal sub scales measuring
consistency of a test, such differently behaviors such as
as psychometric tests and depression, schizophrenia,
Split-half questionnaires. There, it social introversion.
measures the extent to Therefore the split-half
which all parts of the test method was not be an
contribute equally to what is appropriate method to
being measured. assess reliability for this
personality test.

300
TYPES OF VALIDITY

The validity of a measurement can be estimated based on three main types of evidence.
Each type can be evaluated through expert judgement or statistical methods.

Type of What does it assess? Example


validity
Construct The adherence of a measure to A self-esteem questionnaire
existing theory and knowledge of the could be assessed by
concept being measured. measuring other traits known or
assumed to be related to the
concept of self-esteem (such as
social skills and optimism).
Strong correlation between the
scores for self-esteem and
associated traits would indicate
high construct validity.

Content The extent to which the A test that aims to measure a


measurement covers all aspects of class of students’ level of
the concept being measured. Spanish contains reading,
writing and speaking
components, but no listening
component. Experts agree that
listening comprehension is an
essential aspect of language
ability, so the test lacks content
validity for measuring the
overall level of ability in
Spanish.

301
Criterion The extent to which the result of a A survey is conducted to
measure corresponds to other valid measure the political opinions of
measures of the same concept. voters in a region. If the results
accurately predict the later
outcome of an election in that
region, this indicates that the
survey has high criterion
validity.

To assess the validity of a cause-and-effect relationship, you also need to consider


internal validity (the design of the experiment) and external validity (the generalizability of
the results).

HOW TO ENSURE VALIDITY AND RELIABILITY IN YOUR RESEARCH

The reliability and validity of your results depends on creating a strong research design,
choosing appropriate methods and samples, and conducting the research carefully and
consistently.

ENSURING VALIDITY

If you use scores or ratings to measure variations in something (such as psychological


traits, levels of ability or physical properties), it’s important that your results reflect the real
variations as accurately as possible. Validity should be considered in the very earliest
stages of your research, when you decide how you will collect your data.

 Choose appropriate methods of measurement


Ensure that your method and measurement technique are high quality and targeted to
measure exactly what you want to know. They should be thoroughly researched and
based on existing knowledge.

For example, to collect data on a personality trait, you could use a standardized
questionnaire that is considered reliable and valid. If you develop your own questionnaire,
it should be based on established theory or findings of previous studies, and the questions
should be carefully and precisely worded.

 Use appropriate sampling methods to select your subjects

302
To produce valid generalizable results, clearly define the population you are researching
(e.g. people from a specific age range, geographical location, or profession). Ensure that
you have enough participants and that they are representative of the population.

ENSURING RELIABILITY

Reliability should be considered throughout the data collection process. When you use a
tool or technique to collect data, it’s important that the results are precise, stable and
reproducible.
 Apply your methods consistently

Plan your method carefully to make sure you carry out the same steps in the same way
for each measurement. This is especially important if multiple researchers are involved.

For example, if you are conducting interviews or observations, clearly define how specific
behaviors or responses will be counted, and make sure questions are phrased the same
way each time.
 Standardize the conditions of your research

When you collect your data, keep the circumstances as consistent as possible to reduce
the influence of external factors that might create variation in the results.
For example, in an experimental setup, make sure all participants are given the same
information and tested under the same conditions.
Where to write about reliability and validity in a thesis
It’s appropriate to discuss reliability and validity in various sections of your thesis or
dissertation. Showing that you have taken them into account in planning your research
and interpreting the results makes your work more credible and trustworthy.
Reliability and validity in a thesis

Section Discuss
What have other researchers done to devise and improve methods
Literature review that are reliable and valid?

How did you plan your research to ensure reliability and validity of
the measures used? This includes the chosen sample set and size,
Methodology sample preparation, external conditions and measuring techniques.

303
If you calculate reliability and validity, state these values alongside
Results your main results.
This is the moment to talk about how reliable and valid your results
actually were. Were they consistent, and did they reflect true
Discussion values? If not, why not?
If reliability and validity were a big problem for your findings, it might
Conclusion be helpful to mention this here.

EXAMPLE OF VALIDATION OF QUESTIONNAIRE

For example, a researcher wishes to validate a questionnaire in Science. He requests


experts in Science to judge if the items measure the knowledge, skills, and values
supposed to be measured. Another way of testing content validity is for experts to check
if the test items or questions represent the knowledge, skills, and values suggested in the
Science course content.
Below is an example of an open-ended questionnaire on the teaching strategies used in
teaching Science. For content validity, the experts are directed to check (√) the items to
be retained as: (3) retain; (2) needs improvement; and (1) delete.
Directions: Below are teaching strategies used in the teaching of Science. Indicate the
extent to which each strategy is used in teaching Science in your school by encircling one
of the options on the right column. The options 4, 3, 2, and 1 represent the extent of use,
thus;

4 – very often 2 – sometimes


3 - often 1 - never

Needs
Retain Delete
Improvement
3 2 1
1 Curriculum enrichment 4,3,2,1
2 Theory and practice scheme 4,3,2,1
3 Brainstorming method 4,3,2,1
4 Experimental method 4,3,2,1
5 Supervised or directed study 4,3,2,1
6 Laboratory approach 4,3,2,1
7 Discovery approach 4,3,2,1
8 Guided discovery approach 4,3,2,1

304
9 Unstructured approach 4,3,2,1
10 Project method 4,3,2,1
11 Others (please specify) 4,3,2,1

The researcher requires a selected group of experts to validate the content of the
questionnaire on the basis of the foregoing questions. If the weighted mean is 2.5 and
above, the item is retained; 1.5 to 2.4, needs improvement; and below, delete.

For instance, there are five experts to validate the above questionnaire. In Item 1,
“Curriculum enrichment,” 4 experts rated 3 and 1, rated 2. Weighted mean is used. The
computation is as follows:

x f xf
3 4 12
2 1 2
5 14

∑ 𝑓𝑥 14
𝑥̅ = = = 2.8 ≈ 3 𝑟𝑒𝑡𝑎𝑖𝑛
∑𝑓 5

Thus, Item 1 “Curriculum enrichment” should be retained.

EXAMPLE OF METHODS IN TESTING THE RELIABILITY OF A GOOD RESEARCH


INSTRUMENT

For the Test-retest method, the same instrument is administered twice to the same group
of subjects and correlation coefficient is determined.
Spearman rho Computation of the First and Second administration of Achievement Test
in Computer

Respondents 1 2 R1 R2 D D2
1 75 75 6 6 0 0
2 53 55 13 12.5 0.5 0.25
3 47 48 15 15 0 0
4 83 80 2 3.5 -1.5 2.25
5 70 75 8.5 6 2.5 6.25
6 69 70 10 9.5 0.5 0.25
7 70 70 8.5 9.5 -1 1
8 55 55 12 12.5 -0.5 0.25

305
9 77 75 5 6 -1 1
10 85 85 1 1 0 0
11 79 80 4 3.5 0.5 0.25
12 57 55 11 12.5 -1.5 2.25
13 81 82 3 2 1 1
14 50 55 14 12.5 1.5 2.25
15 71 71 7 8 -1 1
∑D2 18

6 ∑ 𝐷2 6(18)
𝑟𝑠 = 1 − 3
=1− 3 = 0.97 (𝑣𝑒𝑟𝑦 ℎ𝑖𝑔ℎ 𝑟𝑒𝑙𝑖𝑎𝑏𝑖𝑙𝑖𝑡𝑦)
𝑁 −𝑁 15 − 15

For the Split-half method. The test in this method may be administered once, but the test
items are divided into halves. The common procedure is to divide a test into odd and even
items. The two halves of the test must be similar but not identical in content, number of
items, difficulty, means, and standard deviations. Each student obtains two scores, one
on the odd and the other on the even items in the same test. The scores obtained in the
two halves are correlated. The result is reliability coefficient for a half test. Since the
reliability holds only for a half test, the reliability coefficient of a whole test is estimated by
using the Spearman Brown formula. This formula is as follows:
2(𝑟ℎ𝑡)
𝑟𝑤𝑡 =
1 + 𝑟ℎ𝑡

Where:
rwt = the reliability of the whole test

rht = reliability of a half test.


For example, a test is administered to twelve students as pilot sample to test the reliability
coefficient of the odd and even items. For illustration purposes, consider the computation
in the table below.

Computation of Reliability Coefficient of Odd and Even Items

Scores Ranks Differences


Students Odd Even Ro Re D D2
1 55 66 8 7 1 1
2 71 79 3 1 2 4
3 72 70 2 4 -2 4

306
4 43 50 10.5 9.5 1 1
5 35 31 12 12 0 0
6 64 72 6 3 3 9
7 57 57 7 8 -1 1
8 70 67 4 6 -2 4
9 69 69 5 5 0 0
10 48 50 9 9.5 -0.5 0.25
11 43 41 10.5 11 -0.5 0.25
12 75 75 1 2 -1 1
∑D2 25.5

6 ∑ 𝐷2 6(25.5)
𝑟𝑠 = 1 − 3
=1− 3 = 0.91 𝑟𝑒𝑙𝑖𝑎𝑏𝑖𝑙𝑖𝑡𝑦 𝑜𝑓 ℎ𝑎𝑙𝑓 𝑡𝑒𝑠𝑡
𝑁 −𝑁 12 − 12
2(𝑟ℎ𝑡) 2(0.91)
𝑟𝑤𝑡 = = = 0.95 𝑣𝑒𝑟𝑦 ℎ𝑖𝑔ℎ 𝑟𝑒𝑙𝑎𝑡𝑖𝑜𝑛𝑠ℎ𝑖𝑝
1 + 𝑟ℎ𝑡 1 + 0.91

For the Internal-consistency method. This method is used with psychological tests which
consists of dichotomously scored items. The examinee either passes or fails in an item.
A score of 1 (one) is assigned for a pass and 0 (zero) for failure. The method of obtaining
reliability coefficient in this method is determined by Kuder-Richardson Formula 20. This
formula is a measure of internal consistency or homogeneity of the research instrument.
The formula is as follows:

𝑁 𝑆𝐷2 − ∑ 𝑝𝑖 𝑞𝑖
𝑟𝑥𝑥 =[ ][ ]
𝑁−1 𝑆𝐷2

Where:
N = number of items
∑(𝑥−𝑥̅ )2
SD2 = variance of scores 𝑁−1

∑ 𝑝𝑖 𝑞𝑖 = sum of the product of the proportion passing or failing in Item 1

The proportion of individuals passing item i is denoted by the symbol pi and the proportion
failing by qi, where qi = 1-pi. If all items are perfectly corrected, a situation which can only
arise when all have the same difficulty, r xx = 1.0
The steps in applying Kuder-Richardson Formula 20 are as follows:

Step 1. Compute for the variance (𝑆𝐷2 ) of the test scores for the whole group.

307
Step 2. Find the proportion passing each item (𝑝𝑖 ) and the proportion failing each
item(𝑞𝑖 ). For instance, ten of the twelve students passed or got the correct answer for
10 2
Item 2, hence, 𝑝𝑖 = 12 = 0.83. There are two students who failed in Item 2, thus, 𝑞 = 12 =
0.17. It can also be taken by 𝑞𝑖 = 1 − 𝑝𝑖 or 1 − 0.83 = 𝟎. 𝟏𝟕.
Step 3. Multiply p and q for each item. For instance, 0.83 x 0.17 = 0.1411. This gives the
𝑝𝑖 𝑞𝑖 value. Get the sum of 𝑝𝑖 𝑞𝑖 column to get ∑ 𝑝𝑖 𝑞𝑖 .
Step 4. Substitute the calculated value using formula.
For illustration purposes, consider the example below: Suppose a test of 15 items
has been administrated to a group of twelve students. The table below presents the
computation of Richardson Formula 20 of the twelve students’ responses for a fifteen-
item test.
Computation of Kuder-Richardson Formula 20 of the Twelve Students Responses for a
Fifteen-Item Test

Students
Item 1 2 3 4 5 6 7 8 9 10 11 12 f pi qi piqi
1 1 1 1 1 1 1 1 1 1 1 1 0 11 0.92 0.08 0.08
2 1 1 1 1 1 1 1 1 1 1 0 0 10 0.83 0.17 0.14
3 1 1 1 1 1 1 1 1 0 0 0 0 8 0.67 0.33 0.22
4 1 1 1 1 1 1 1 1 0 0 0 0 8 0.67 0.33 0.22
5 1 1 1 1 1 1 1 1 0 0 0 0 8 0.67 0.33 0.22
6 1 1 1 1 1 1 1 0 0 0 0 0 7 0.58 0.42 0.24
7 1 1 1 1 1 1 1 0 0 0 0 0 7 0.58 0.42 0.24
8 1 1 1 1 1 1 1 0 0 0 0 0 7 0.58 0.42 0.24
9 1 1 1 1 1 1 1 0 0 0 0 0 7 0.58 0.42 0.24
10 1 1 1 1 1 0 0 0 0 0 0 0 5 0.42 0.58 0.24
11 1 1 1 1 1 0 0 0 0 0 0 0 5 0.42 0.58 0.24
12 1 1 1 1 0 0 0 0 0 0 0 0 4 0.33 0.67 0.22
13 1 1 1 0 0 0 0 0 0 0 0 0 3 0.25 0.75 0.19
14 1 1 1 0 0 0 0 0 0 0 0 0 3 0.25 0.75 0.19
15 1 1 0 0 0 0 0 0 0 0 0 0 2 0.17 0.83 0.14
Total 15 15 14 12 11 9 9 5 2 2 1 0 ∑piqi 3.08

308
Variance (SD2) Computation

?? ?? ?? ?? ??
Student x ?? ? ?
1 15 7.08 50.1264
2 15 7.08 50.1264
3 14 6.08 36.9664
4 12 4.08 16.6464
5 11 3.08 9.4864
6 9 1.08 1.1664
7 9 1.08 1.1664
8 5 -2.92 8.5264
9 2 -5.92 35.0464
10 2 -5.92 35.0464
11 1 -6.92 47.8864
12 0 7.92 62.7264
Total 95 354.9168

∑ 𝑥 95
𝑥̅ = = = 7.92
𝑛 12
∑(𝑥 − 𝑥̅ ) 354.9168
𝑆𝐷2 = = = 32.27
𝑛−1 12 − 1

Kuder-Richardson Formula 20 Computation

𝑁 𝑆𝐷2 − ∑ 𝑝𝑖 𝑞𝑖
𝑟𝑥𝑥 = [ ][ ]
𝑁−1 𝑆𝐷2

15 32.27 − 3.0768
𝑟𝑥𝑥 = [ ][ ] = 0.97 𝑣𝑒𝑟𝑦 ℎ𝑖𝑔ℎ 𝑟𝑒𝑙𝑎𝑡𝑖𝑜𝑛𝑠ℎ𝑖𝑝
15 − 1 32.27

Activity:

Essay:

1. Is valid research instrument always valid? Why?


2. Is reliable research instrument always valid? Why?

309
3. How do you administer the test-retest method in testing the reliability of a research
instrument?
4. Which of the qualities of the research instrument is most practical? Why?
5. How are parallel-forms method of estimating the reliability of a research instrument
constructed?

Problem Solving:

1. Using the data below, find out if the research instrument is reliable applying the
split half method administered to 15 students.
Even-Odd Even-Odd Even-Odd Even-Odd Even-Odd
51-52 21-23 51-50 47-49 40-42
28-30 20-19 23-25 40-41 32-30
55-52 38-37 22-23 29-25 57-52

2. Find out if there is internal consistency in the responses of 12 students as pilot


sample in a 7-item test in English

Students

Items 1 2 3 4 5 6 7 8 9 10 11 12

1 1 1 1 1 1 1 1 1 1 1 1 1

2 1 1 1 1 1 1 1 1 0 1 1 1

3 1 1 1 1 1 1 1 1 0 1 1 1

4 0 1 1 1 1 1 1 1 0 1 1 1

5 0 0 1 0 1 1 1 1 1 1 1 0

6 0 0 0 0 1 1 1 1 0 0 1 0

7 0 0 0 0 1 1 1 0 0 0 0 0

Total 3 4 5 4 7 7 7 6 2 5 6 4

310
Using Excel for Statistical Data Analysis
______________________________________________
MODULE 8

311
Using Excel for Statistical Data Analysis

By the end of this session you should be able to use Microsoft Excel for data summary,
presentation and statistical analysis.

Lecture:

INTRODUCTION

Microsoft Excel is a spreadsheet program. That means it's used to create grids of text,
numbers and formulas specifying calculations. That's extremely valuable for many
businesses, which use it to record expenditures and income, plan budgets, chart data
and succinctly present fiscal results.

It can be programmed to pull in data from external sources such as stock market feeds,
automatically running the data through formula such as financial models to update such
information in real time. Like Microsoft Word, Excel has become a de facto standard in
the business world, with Excel spreadsheets frequently emailed and otherwise shared to
exchange data and perform various calculations.

Excel also contains fairly powerful programming capabilities for those who wish to use
them that can be used to develop relatively sophisticated financial and scientific
computation capabilities.

You can perform statistical analysis with the help of Excel. It is used by most of the data
scientists who require the understanding of statistical concepts and behavior of the data.
But when the data set is huge or you need some specialized data analysis model such
as linear or regression, you should go for advanced tools such as Python, R programming.
Here, we will go through the basic concept of statistical analysis and will apply the
concepts to our own data.

Before starting, you need to check whether Excel Analysis ToolPak is enabled in Excel
or not (it is an add-in provided by Microsoft Excel). To check whether it is enabled or not,
go to Excel → Data and check whether data analysis option is there or not on the top right
corner. If it is not there, go to Excel → File → Options → Add-in and enable the Analysis

312
ToolPak by selecting the Excel Add-ins option in manage tab and then, click GO. This will
open a small window; select the Analysis ToolPak option and enable it.

These are some of the tests you can perform using Excel Statistical Analysis.

DESCRIPTIVE ANALYSIS

You can find descriptive analysis by going to Excel→ Data→ Data Analysis → Descriptive
statistics. It is the most basic set of analysis that can be performed on any data set. It
gives you the general behavior and pattern of the data. It is helpful when you a have a
set of data and want to have the summary of that dataset. This will show the following
statistic data for the chosen dataset.

 Mean, Standard error and Median


 Median, Mode and Standard Deviation
 Sample Variance
 Kurtosis and Skewness
 Range, Minimum, Maximum, Sum and Count

Example:

For example, you may have the scores of 14 participants for a test.

313
To generate descriptive statistics for these scores, execute the following steps.

1. On the Data tab, in the Analysis group, click Data Analysis.

2. Select Descriptive Statistics and click OK.

3. Select the range A2:A15 as the Input Range.

4. Select cell C1 as the Output Range.

5. Make sure Summary statistics is checked.

314
6. Click OK.

Result:

ANOVA (ANALYSIS OF VARIANCE)

It is a data analysis method which shows whether the mean of two or more data set is
significantly different from each other or not. In other words, it analyses two or more
groups simultaneously and finds out whether any relationship is there among the groups
of data set or not. For example, you can use ANOVA if you want to analyse the traffic of
three different cities and find out which one is more efficient in handling the traffic (or if
there are no significant differences among the traffic).

You will find three types of ANOVA in the Excel

 ANOVA single factor


 ANOVA two factor with replication
 ANOVA two factor without replication

If you have three groups of datasets and want to check whether there is any significant
difference between these groups or not, you can use ANOVA single factor.

If the P-value in the ANOVA summary table is greater than 0.05, you can say that there
is a significant difference between the groups.

315
EXAMPLE:

This example teaches you how to perform a single factor ANOVA (analysis of variance)
in Excel. A single factor or one-way ANOVA is used to test the null hypothesis that the
means of several populations are all equal.

Below you can find the salaries of people who have a degree in economics, medicine or
history.

H0: μ1 = μ2 = μ3
H1: at least one of the means is different.

To perform a single factor ANOVA, execute the following steps.

1. On the Data tab, in the Analysis group, click Data Analysis.

2. Select Anova: Single Factor and click OK.

316
3. Click in the Input Range box and select the range A2:C10.

4. Click in the Output Range box and select cell E1.

5. Click OK.
Result:

317
Conclusion: if F > F crit, we reject the null hypothesis. This is the case, 15.196 > 3.443.
Therefore, we reject the null hypothesis. The means of the three populations are not all
equal. At least one of the means is different. However, the ANOVA does not tell you
where the difference lies. You need a t-Test to test each pair of means.

MOVING AVERAGE

Moving average is usually applicable for time series data such as stock price, weather
report, attendance in class etc. For example, it is heavily used in stock price as a technical
indicator. If you want to predict the stock price of today, the last ten days data would be
more relevant than the last 1 year. So, you can plot the moving average of the stock
having a 10-day time period and you can then predict the price to some extent. The same
applies to the temperature of a city. The recent temperature of a city can be calculated
by taking the average of last few weeks rather than previous months.

Example

This example teaches you how to calculate the moving average of a time series in Excel.
A moving average is used to smooth out irregularities (peaks and valleys) to easily
recognize trends.

1. First, let's take a look at our time series.

2. On the Data tab, in the Analysis group, click Data Analysis

318
3. Select Moving Average and click OK.

4. Click in the Input Range box and select the range B2:M2.

5. Click in the Interval box and type 6.

6. Click in the Output Range box and select cell B3.

7. Click OK.

8. Plot a graph of these values.

319
Explanation: because we set the interval to 6, the moving average is the average of the
previous 5 data points and the current data point. As a result, peaks and valleys are
smoothed out. The graph shows an increasing trend. Excel cannot calculate the moving
average for the first 5 data points because there are not enough previous data points.

9. Repeat steps 2 to 8 for interval = 2 and interval = 4.

320
Conclusion: The larger the interval, the more the peaks and valleys are smoothed out.
The smaller the interval, the closer the moving averages are to the actual data points.

RANK AND PERCENTILE

It calculates the ranking and percentile in the data set. For example, if you are managing
a business of several products and want to find out which product is contributing to a
higher revenue, you can use this rank method in Excel.

In the left table, we have our data on the revenues of different products. And we want to
rank this data of products based on their revenue.

With the help of rank and percentile, we can get the table shown on the right. You can
observe that now the data is sorted and respective rank is also marked with each data.

Percentile shows the category in which the data belongs, such as top 50%, top 30% etc.
In the summary table, the rank of product 7 is 4. As the total number of data is 7, we can
easily say that it belongs to the top 50% of the data.

REGRESSION

Regression is a process of establishing a relationship among many variables. Usually,


we establish a relationship between dependent variables and independent variables. For
example, cases when you want to see if there is any increase in the revenue of product,
which is not due to increase in the advertisement.

321
This is the window you will get once you click regression option in data analysis. Here,
you have to provide a dependent variable in input Y range and an independent variable
in Input X range. In our example, we have revenue of the product as a dependent variable
and spending on the advertisement as the independent variable (If you have a label in
the data, you can mark the checkbox of the labels). And at last, provide the range of cells
where you want to see the output.

EXAMPLE:

This example teaches you how to run a linear regression analysis in Excel and how to
interpret the Summary Output.

Below you can find our data. The big question is: is there a relation between Quantity
Sold (Output) and Price and Advertising (Input). In other words: can we predict Quantity
Sold if we know Price and Advertising?

322
1. On the Data tab, in the Analysis group, click Data Analysis.

2. Select Regression and click OK.

3. Select the Y Range (A1:A8). This is the predictor variable (also called dependent
variable).

4. Select the X Range (B1:C8). These are the explanatory variables (also called
independent variables). These columns must be adjacent to each other.

5. Check Labels.

6. Click in the Output Range box and select cell A11.

7. Check Residuals.

8. Click OK.

323
Excel produces the following Summary Output (rounded to 3 decimal places).

R SQUARE

R Square equals 0.962, which is a very good fit. 96% of the variation in Quantity Sold is
explained by the independent variables Price and Advertising. The closer to 1, the better
the regression line (read on) fits the data.

SIGNIFICANCE F AND P-VALUES

To check if your results are reliable (statistically significant), look at Significance F (0.001).
If this value is less than 0.05, you're OK. If Significance F is greater than 0.05, it's probably

324
better to stop using this set of independent variables. Delete a variable with a high P-
value (greater than 0.05) and rerun the regression until Significance F drops below 0.05.

Most or all P-values should be below below 0.05. In our example this is the case. (0.000,
0.001 and 0.005).

COEFFICIENTS

The regression line is: y = Quantity Sold = 8536.214 -835.722 * Price + 0.592 *
Advertising. In other words, for each unit increase in price, Quantity Sold decreases with
835.722 units. For each unit increase in Advertising, Quantity Sold increases with 0.592
units. This is valuable information.

You can also use these coefficients to do a forecast. For example, if price equals $4 and
Advertising equals $3000, you might be able to achieve a Quantity Sold of 8536.214 -
835.722 * 4 + 0.592 * 3000 = 6970.

RESIDUALS

The residuals show you how far away the actual data points are fom the predicted data
points (using the equation). For example, the first data point equals 8500. Using the
equation, the predicted data point equals 8536.214 -835.722 * 2 + 0.592 * 2800 =
8523.009, giving a residual of 8500 - 8523.009 = -23.009.

You can also create a scatter plot of these residuals.

325
CORRELATION

The correlation coefficient (a value between -1 and +1) tells you how strongly two
variables are related to each other. We can use the CORREL function or the Analysis
Toolpak add-in in Excel to find the correlation coefficient between two variables.

A correlation coefficient of +1 indicates a perfect positive correlation. As variable X


increases, variable Y increases. As variable X decreases, variable Y decreases.

A correlation coefficient of -1 indicates a perfect negative correlation. As variable X


increases, variable Z decreases. As variable X decreases, variable Z increases.

326
A correlation coefficient near 0 indicates no correlation.

To use the Analysis Toolpak add-in in Excel to quickly generate correlation coefficients
between multiple variables, execute the following steps.

1. On the Data tab, in the Analysis group, click Data Analysis.

2. Select Correlation and click OK.

327
3. For example, select the range A1:C6 as the Input Range.

4. Check Labels in first row.

5. Select cell A8 as the Output Range.

6. Click OK.

328
Result.

Conclusion: variables A and C are positively correlated (0.91). Variables A and B are not
correlated (0.19). Variables B and C are also not correlated (0.11) . You can verify these
conclusions by looking at the graph.

RANDOM NUMBER GENERATOR

Although you can find a simple function to generate a series of random numbers, this
option in data analysis gives you more flexibility in the random number generation
process. It gives us more control over the generated data.

329
This is the screenshot of the random number generation window. In the number of
variables field, it requires the number of columns in which you need the random number
to be generated. And in the number of random variables field, it requires the number of
rows where you need random data to be filled. If I give a number of variable 10 and the
number of random variables as 3, it will give the following result.

You can also select the type of data being generated such as Uniform, Normal, Bernoulli,
Binomial, Poisson etc.

SAMPLING

This option is the data analysis tool which is used for creating samples from a huge
population. You can randomly select data from the dataset or select every nth item from
the set. For example, if you want to measure the effectiveness of a call centre employee
in a call centre, you can use this tool to randomly select few data every month and listen
to their recorded calls and give a rating based on the selected call.

This is a screenshot of the sampling option in data analysis. In input range, you have to
give the reference of input population data set (check the Labels checkbox if your data
has labels). Next, you have to specify the sampling method. If you are selecting periodic
sampling method, you have to give the number as a period. So, it will create a sample
from the population taking the nth data from the population. For example, if your period
is 5, it will select every 5th value from the dataset and create a sample. And if you are
choosing at random, it will randomly select an n number of the data from the population

330
dataset. The second method can give a closer idea to the actual population as the data
is being selected randomly but there are chances of duplicate data in the sample dataset.
And lastly, you have to specify the output location from the output options.

The statistical analysis tool is one of the most important features in Excel. It is always
recommended to use this option whenever you get any kind of data to have a better
understanding of it. As you saw the above examples, you can get a general sense of the
data using descriptive analysis, calculate the moving average of a data to predict the
future data of the dataset, rank each data of the dataset and find the percentile of data,
test the two groups of the data, generate some specific kind of large random data and
find relations between the two data, using the regression. There are some more options
available which we are not covering here but if you wish, you may read about these on
the official website of Microsoft Excel.

Activity:

Answer the following problems using Data Analysis in MS Excel:

1. A random sample of companies in electric utilities (I), financial services (II), and
food processing (III) gave the following information regarding annual profits per
employee (units in thousands of dollars).

I 49.1 43.4 32.9 27.8 38.3 36.1 20.2


II 55.6 25.0 41.3 29.9 39.5
III 39.0 37.3 10.8 32.5 15.8 42.6

Shall we reject or not the claim that there is no difference in population mean
annual profits per employee in each of the three types of companies? Use 1% level
of significance.

2. The following data represent the operating time in hours for 4 types of pocket
calculators before a recharge is required.

Fx101 Fx202 Fx303 Fx404


6.4 5.9 7.1 5.3
6.1 5.8 7.1 4.9
6.5 5.9 7.2 6.1
6.2 5.1 7.3 7.1
6.3 5.0 7.4 4.9

Test at 0.01 level of significance if the operating times for all four calculators are
equal.

331
3. Ten students were given remedial instruction in reading comprehension. Pretest
and posttest were also administered, the results of which follow:

Pretest Posttest
25 30
20 25
20 20
8 12
15 20
8 8
15 20
8 8
20 20
21 20
35 38
20 30

Test at 0.05 level of significance to find out if there is significant difference between
the pretest and posttest.

4. The following data are based on information from the book Life in America’s Small
Cities ( by G.S. Thomas, Prometheus Books). Let x be the percentage of 16 to 19
years old not in school and not high school graduates. Let y be the reported violent
crimes per 1000 residents. Six small cities in Arkansas (Blytheville, El Dorado, Hot
Springs, Jonesboro, rogers, and Russelville) reported the following information
about x and y:

X 24.2 19 18.2 14.9 19.0 17.5


Y 13 4.4 9.3 1.3 0.8 3.6

Is there a significant relationship between x and y? Test at 0.05 level of


significance.
If there is a significant relationship between x and y, compute for y if x =
26,27,28,29.

5. A teacher wants to find out whether a relationship exists between her students’
GWA and the result of their exam in the Entrance Test for Graduate School
Program. Test at 0.05 level of significance

GWA Entrance
Exam Results
2.55 98
2.82 90
2.73 95

332
2.72 93
2.88 99
2.8 97
2.53 98
2.09 97
2.75 94
1.55 99
2.62 95
2.86 95
2.35 95
2.31 97
2.66 98

6. Suppose we want to predict job performance of Chevy mechanics based on


mechanical aptitude test scores and test scores from personality test that
measures conscientiousness.

Job Mechanical Conscientiousness


Performance Aptitude
Y X1 X2
1 40 25
2 45 20
1 38 30
3 50 30
2 48 28
3 55 30
3 53 34
4 55 36
4 58 32
3 40 34
5 55 38
3 48 28
3 45 30
2 55 36
4 60 34
5 60 38
5 60 42
5 65 38
4 50 34
3 58 38
6 60 43
3 45 30

333
Test at 0.05 level of significance.

7. From an English class of 18 students using programmed materials, 10 are selected


at random and given additional instruction by the teacher. The results on the final
examination were as follows:

Grades in the Final Examination


With Additional No Additional
Instruction Instruction
87 75
91 78
85 81
80 85
86 72
85 80
81 81
85 85
73
79

Test at 0.01 level of significance to find out if the additional instruction affects the
average grade assuming that the data is normal.

334
COURSE REFERENCES

A. Books
1. Probability and Statistics, Altares, Tayag, Torres, Yap, Ymas 2015
2. Statistics for the Behavioral Sciences, Frederick J. Gravetter, Larry B. Wallnau,
2015
3. Introduction to probability and statistics for engineers and scientist 5/e, Ross,
Sheldon M., 2014
4. Applied statistics and probability for engineers, 5e Montgomery, Douglas 2011
5. Schaum's Outlines of Probability, 2nd ed. Lipschutz 2011
6. Basic statistics with Calculator and Narag, E. 2010
7. Probability, Random Variables, Random processes 3/e, Hsu, Hwei P., 2014
8. Business Statistics, Cabero, Salamat, Sta. Maria, 2013
9. Nonparametric Statistics, Broto, 2008
10. Understandable Statistics Concepts and Methods, 9 th ed, Brase and Brase, 2009

B. Journals
Taper, M. L., & Ponciano, J. M. (2016). Evidential statistics as a statistical
modern synthesis to support 21st century science. Population Ecology, 58(1), 9-
29. doi:http://dx.doi.org/10.1007/s10144-015-0533-y

Probability theory: A first course in probability theory and statistics (2017). .


Beaverton: Ringgold Inc. Retrieved from
https://search.proquest.com/docview/1867035442?accountid=160143

Liu, K., Luedtke, A., & Tintle, N. (2013). Optimal methods for using posterior
probabilities in association testing. Human Heredity, 75(1), 2-11.
doi:http://dx.doi.org/10.1159/000349974

Statistics with JMP: Graphs, descriptive statistics, and probability (online access
included) (2015). . Beaverton: Ringgold Inc. Retrieved from
https://search.proquest.com/docview/1686840904?accountid=160143

Kim, S. (2014). Introduction to probability and statistics for engineers. Choice,


51(7), 1257. Retrieved from
https://search.proquest.com/docview/1512627742?accountid=160143

Probability and statistics. (2014). Mathematics Teaching in the Middle School,


19(9), 571. Retrieved from
https://search.proquest.com/docview/1526982340?accountid=160143

C. Internet Sources
http://onlinestatbook.com/2/introduction/inferential.html
https://statistics.laerd.com/statistical-guides/descriptive-inferential-statistics.php
http://www.statisticshowto.com/probability-and-statistics/hypothesis-testing/

335
https://onlinecourses.science.psu.edu/statprogram/node/138/
http://stattrek.com/hypothesis-test/hypothesis-testing.aspx
https://www.lotame.com/what-are-the-methods-of-data-collection/
https://www.investopedia.com
https://magoosh.com/excel/using-excel-statistical-analysis/
https://www.excel-easy.com/examples/descriptive-statistics.html
https://www.scribbr.com/methodology/reliability-vs-validity/

336

You might also like