Probability and Statistics v2

LECTURE NOTES IN
( STATISTICAL METHODS )
INTRODUCTION
Decision making is an important aspect of our lives. We make decisions based on the
Information we have, our attitudes, and our values. Statistical methods help us examine information.
Moreover, statistics can be used for making decisions when we are faced, with uncertainties. For
instance, if we wish to estimate the proportion of people who will have a severe a reaction to a
dengue/flu a shot without giving a shot to everyone who wants it, statistics provides appropriate
methods. Statistical method enable us to look at information from a small collection of people or
items and make inferences about a larger collection of people or items.
* Definition of Terms
Statistic - numerical data of statistics such as mean, median, or standard deviation
Statistics – is the study of systematic collection, presentation, analysis, and interpretation
of numerical data
Collection - process of obtaining data
Presentation - organization of data into graphs, charts, or tables
Analysis - process of extracting relevant information from the organized data
Interpretation – refers to the task of drawing conclusion, making predictions from the analyzed data
Critical thinking is essential in understanding and evaluating information. There are more than
a few situations in statistics in which the lack of critical thinking can lead to conclusions that are
misleading or incorrect.
Statistical literacy is fundamental for applying and interpreting statistical results. Students need
to know correct statistical terminology. The knowledge of correct terminology helps the students
focus on correct analysis and processes.
Calculations and computers are very good at providing the numerical results of statistical
processes. It is up to the user of statistics to interpret the results in the context of application.
When correct process is used to analyze the data, What do the results mean?
The general prerequisite of statistical decision making is the gathering of data. We need to
Identify the individuals or objects to be included in the study and the characteristics or features of
the individuals that are of interest.
* Individuals are people or objects included in the study.
* Variable is the characteristics of the individual to be measured or observed
* Quantitative variable has a value or numerical measurement for which operation as in addition
or averaging make sense
*Qualitative variable describes an individual by placing the individual into category or group
such as male or female, non-numerical observation
Levels of Measurement
Parameter is numerical measure that describes an aspect of population
Statistic is a numerical measures that describes a sample
Levels of Measurement
 Nominal level - applies to data that consist of names, label or categories. No criteria by
which data can be ordered
 Ordinal level - applies to data that can be arranged in order. Differences between
data are meaningful
 Interval level - applies to data that can be arranged in order
 Ratio level - applies to data that can be arranged in order. In addition, both differences
between data values and ratios of data values are meaningful.
Examples: Identify the type of data and level of measurement:
1. Kim, James, Peter are names of three students from the population of students in a
university. Solution: These data are at nominal level. These data values are simply
Names. We cannot determine if one name is “greater than or lesser than another. Any
ordering of the names would numerically meaningless.
2. In a high school graduating class of 320 students, John ranked 25th, James ranked 19th,
Walter ranked 10th, and Joe ranked 4th.
3. Water temperatures in degree Celsius of milk fish pond in Victorias City, varies from
1 – 2, 3 – 4 and 5 -6
4. Height of basketball players. A is ¾ taller than B
Distinguish Statistician and Statistics Practitioner

Statistician refers to an individual who works with the mathematics of statistics. His or her work
Involves research that develops techniques and concepts that in the future may help the
the statistics practitioner
Statistics practitioner is a person who uses statistical techniques
1.Financial analyst who develops stock portfolio based on historical rates of returns
2. Economist who uses statistical models to help explain and predict variables such as
Inflation rates, unemployment rate and changes in GDP
3.Market researcher who surveys consumers and converts the responses into useful
Information, say consumer preference of dairy products, medicine for cough, etc.
Statisticians are also statistic practitioners, frequently conducting empirical research and as
consultant
Major Areas of Statistics

1. Descriptive Statistics deals with the methods of organizing, summarizing, and
presenting data in a convenient and informative way. One form of descriptive
statistics uses graphical techniques that allow statistics practitioners to present
data that makes it easy for the reader to extract useful information.
Another form of descriptive statistics uses numerical techniques to summarize
by the average or mean. Ex. Average sales for the month, highest sales (mode)
2. Inferential Statistics is a body of methods used to draw conclusions or inferences
about the characteristics of the population based on the sample data
(generalization).
Statistical inference problems involve three key concepts: the population,
the sample and the statistical inference
*Population is the group of all items of interest. It is frequently very large and may be
infinitely large
Parameter is descriptive measure of population
*Sample is a set of data drawn from the studied population
Statistic a descriptive measure of sample ; Statistical inference is the process of
making estimate, prediction, or decision about the population based on sample data.
Types of Data
1. Interval data are real numbers such as height, income, distance. It is
quantifiable
2. Nominal data are categories such as responses to questions about marital
status: single, married, divorced, widowed. You can assign codes to each
category, say S = single, M= married, D = divorced; W = widowed
3. Ordinal data – appear to be nominal but the difference is that the order
of their values has meanings, example Excellent, 5 ; Very good, 4; Good,3;
Fair, 2; Poor, 1. The order must be maintained.
4. Ratio data - data are arranged in order as percentage or fraction
Calculations for Types of Data
1. Interval data - all calculations are permitted, say average; variability
2. Nominal data – codes are arbitrary, no calculations can be done on these
Codes. Use frequency, percentages, and mean value
3. Ordinal data- important aspect is the order of the values. Permissible
Calculation is ranking.
4. Ratio data permissible calculations is percentage or use of fraction
5.
Classification of Data
1. One-way classification – has only one variable described by at least
two categories
Example: Civil Status frequency
Single 20
Married 25
Separated 10
Widowed 15
2. Two-way classification - has two variables each described by their
respective categories
Example: Attitude of Respondents On The Proposed Federalism
Attitude
Gender Favor Against Undecided Total
Male 35 20 10 65
Female 25 10 5 40
Total 60 30 15 105
Kinds of Sampling
1. Random or Probability Sampling - method of selecting sample from the

Population where all elements have equal chance of being selected
N = 1000 n = 100 p = 100/1000 = 10% chance of
of being selected
2. Non-random sampling or non-probability sampling - method of
selecting sample from the studied population where not all elements
of the population are given equal chance of being selected. Others are
deliberately left out.
Example: In hiring personnel, the selection is non-random because of the
qualifications set by the hiring personnel
Methods of Sampling
1. Lottery - more of raffle

2. Table of random numbers
3. Systematic, n – every nth element, Example: N = 1000, n = 100
nth= 1000/100 = every 10th
4. Stratified - the population is divided into homogeneous strata
∑Example: Study on the Income Profile of the Employees
of A -1, Inc.
5. Cluster or Area – similar to stratified 1.e. clustering of respondents
according to interests, specializations and selection is based on location,
say, cluster of teachers, clusters of engineers in Brgy. X
6. Multistage sampling – Population is very large or geographically spread
out. In such cases, samples are constructed a multistage sample design
of several stages, the final stage consists of clusters.
7. Convenience sampling uses result or data that are conveniently and
readily obtained. It is considered bias.
Sampling error is difference between measurements from a sample and

corresponding measurements from the respective population.
Survey - best choice for gathering information across a wide range of

many variables. Many questions can be included in the survey.
*Pitfalls of Survey:
1. Nonresponse _ Individuals either cannot be contacted or refuse to
participate. Non response can result in significant under coverage of
a population.
2.Truthfulness of response: Respondents may lie intentionally or
inadvertently
3.Faulty recall: Respondents may not accurately remember when or
whether and event took place
4.Hidden bias : The order of questions might lead to biased response.
also, the number of responses on a Likert scale may force responses
that do not reflect the respondent’s feelings or experience
5.Vague wording: Words such as “often”, “seldom”, and occasionally
mean different things to different people.
6.Interviewer influence: Factors such as tone of voice, body language,
dress, gender authority, and ethnicity of the interviewer might
influence response.
7.Voluntary response: Individuals with strong feelings about the subject
are more likely than others to respond. Such a study is interesting
but not reflective of the population.
*Lurking variable is one in which no data have been collected but

nevertheless has influenced other variables in the study
*Confounding variables are two variables that the effect of one
Variable cannot be distinguished from the other. They may be
part of the study or outside lurking variables.
Statistical Measures
Parameter - numerical value which describes the characteristics of

Population
Statistic - numerical value which describes the characteristics of the
sample
Measures of Central Location

Parameter Statistic
1. Population mean µ= ∑X/N Sample mean 𝑿 ̅ = ∑X/n
2.Population median - middle value Sample median – middle value
Items are arranged in ascending order Same
Example: 10,12,14,16, 18, 20
Calculate: Mean, Median Calculate: Mean, Median
3.Mode, Mo is value which appears most often. It may or may not exist
Measures of Variability
The three measures of Central Location do not by themselves give
adequate description of the data. It is needed to know how the
observations spread out from the average.
Variance is the average of the squares of the deviations of individual
values from the mean
Standard Deviation is a special form of the average deviation from
the mean. It is the square root of variance.
Range is the difference between the largest and the smallest measure
Measures of Relative Dispersion
1. Chebyshev’s Theorem - enables to make statements about the proportion

of data values that must be within the specified number of standard
deviations (KSD) from the mean.
“At least, (1 – 1/K2) of the data values must be within K std. deviations
of the mean where K>1.”
Illustration: Suppose a set of data has a mean of 150 and std. deviation of
25. Between what values X will take on if K = 2, K = 3
Soln. (1 – 1/K2) = (1 - ½2) = 1 - ¼ = ¾ or 75% ; X will take on a value
within KS of the mean. Thus, 𝑿 ̅ ±KS = 150 ± 2(25) = 150 ± 50 = 100 – 200
The probability that X will take on values between 100 – 200 is 75%.
Chebyshev’s Theorem
The two values most often used are the mean and standard
deviation. If the distribution has a small SD, we would expect most of the
values to be grouped closely around the mean. However, a large values of
the SD indicates a greater variability in which case we would expect the
observations to be more spread out from the mean.
Chebyshev a Russisn mathematician discovered that a fraction of the
Is related to standard deviation.

Chebyshev’s Theorem- “At least the fraction ( 1 – 1/K2 ) of the measurement
of any set of data must lie with KSD of the mean where K > 1.
For K = 2, the theorem states at least, (1 – 1/K2) or (1- ¼) = 3/4 or 75% of the
Measurements must lie within 2SD on both sides of the mean. That is , ¾ or more
Of the observations must lie in the interval µ ± 2 SD.
Exercise: What percent of the observations lie within K = 3 ; K = 4
Note: Chebyshev’s theorem enables us to make statement about the proportion

of data values that must be within specified number of Std. Deviation( X ± KSD).
Problem Exercise: If the IQ scores of a random sample of 1080 college students have
mean score of 120 and std. dev. of 8, determine the interval 810 of the IQ scores.
Solution: (1 - 1/K2) = 810/1080 = 3/4 1 – 3/4 = 1/K2 1/4 = 1/K2 K = 2, then
̅ ± 𝟐(𝟖) ; 𝟏𝟐𝟎 ± 𝟏𝟔; 𝟏𝟎𝟒 − 𝟏𝟑𝟔 , th are us at least 3/4 of 1080 or
𝑿
810 of the IQ scores are found at interval 104 - 136.
Note: The variable X will take on a value within KS thus, X ± KS where K refers to the
number of standard deviation . If K = 2, and S = 8, therefore KS = 16
Problem Exercise: A certain type of flight of City Airlines carried a daily

average of 13 passengers. The management needed to know how
dispersed the number of passengers are, for such estimate is needed
inorder to make decisions which will maximize efficiency. The management
wanted to know what percent of the time and how often the number of
passengers is within 2.5 std. deviation and what that interval is?
Solution: If the number of passengers is within 2.5 SD for any dat𝑎 set, at
least 1 – 1/K2 percent of observations lie within K SD of the mean, thus

1 – 1/K 2= 1 - 1/(2.5)2 = 1 – 1/6.25 = 1 - .16 = 0.84 or 84%.
If we move 2.5 SD (KS) passengers above and below the mean, i.e.
2.5 x 13 = 33 passengers, we would have the interval 𝑿 ̅ ± KS or 78 ± 2.5(13)
or 45 to 111 passengers.
The management can ascertain that at least 84% of the time the number
of daily passengers is between 45 to 111. And how often? In one month, say
84% of 30 days is 25.2 or 25 days, the daily passengers is between 45 to 111
2. Z – Score measures the number of standard deviations the variable X is from the
mean. It is a measure of relative location of the observation in a data set. Observation
in a data set. Observations in two different data sets with the same Z-score can be said
to have the same relative location in terms of the same number of standard deviation
from the mean. Z-score is used to compare two observations from two different
populations or samples in order to determine their relative rank. A method of ranking
these two observations is to convert the individual observations into standard deviation
units known as Z-Score or Z-values.
𝑿−𝒖 ̅
𝑿−𝑿
Z= or Z=
𝝈 𝑺
Problem Exercise:
Let us assess the accomplishment of a student in Math 1 and Physics 1. The
student’s score in Math 1 is 82 and 89 in Physics 1. Can we conclude that the
student is a better student in Physics 1?
Soln. We should consider how the student performed relative to other student’s
in each class. For Math 1 the mean score of the class is 68 with SD of 8 while in
Physics 1 the mean score of the class is 80 with SD of 6.
Math 1: µ = 68 𝝈 = 𝟖 𝑿 = 𝟖𝟐 Z = ( X - µ)/𝝈 = Z = (82 - 68)/8 = + 1.75
𝟖𝟗−𝟖𝟎
Physics 1; µ = 80 𝝈 = 𝟔 𝑿 = 𝟖𝟗 Z = (X - µ)/𝝈 = 𝒁 = 𝟔
= + 1.50
Since the Z- score of student in Math1 is higher than his Z-score in Physics 1, the
Student’s relative performance in Math 1 is better than his performance in Physics 1.
Conclusion: The student is a better student in Math 1.
Problem Exercises:
1. Let us assess the encoding speed of an applicant whether he is suited for Dean’s
Office, Bus. Office, or Personnel Office.
Office Applicant’s Score Standard Speed Std. Deviation

_______________________________________________________________________
Dean 141 sec 180 sec 30 sec

Business 7 min 10 min 2 min
Personnel 33 min 26 min 5 min
_______________________________________________________________________
In what office is the applicant seem to be suited?

Since speed is of primary importance, we are looking for the Z-Score that represents
the greatest no. of SD’s to the left of the mean.(Negative –Score)
3. Coefficient of Variation, CV = measure of uniformity
𝑺𝒕𝒅.𝑫𝒆𝒗𝒊𝒂𝒕𝒊𝒐𝒏
CV = 𝑴𝒆𝒂𝒏
The smaller coefficient of Variation the more uniform the
distribution
Example: Nicotine content in mg of n cigarettes
1.2; 1.3; 1.4; 1.5; 1.6; 1.7; 1.8; 1.9; 2.0

Calculate: Mean, Std. Deviation; Coef. of Variation (CV)
If n is small relative to N, the probability for each will change only slightly.
Hence, we essentially have a binomial experiment and can approximate
the hyper geometric distribution by using the binomial distribution with p=k/N.
The mean and variance can also be approximated by the formula,
µ= n K/N
𝝈2= n p q = N-n/N-1 . n. k/N (1-k/N)
N – n/N – 1 is a correction factor, negligible when n is small relative to N
Example. A case of wine has 12 bottles, 3 of which contain spoiled wine.

A sample of 4 bottles is randomly selected from a case.
1. Find the probability distribution of X = 0,1,2 bottles of
spoiled wine
2. What are the mean and variance of X
THE NORMAL DISTRIBUTION
The normal probability distribution also known as Gaussian distribution

after the mathematician and astronomer, Karl Gauss, is a continuous
distribution which is regarded as the most significant probability distribution
in the entire theory of statistics.
Properties of the normal curve/distribution

1. The mean, median, and mode have the same value and plotted
on the central point along the horizontal line
2. The curve is symmetrical about the vertical line which contains
the mean
3. The curve is asymptotic relative to the horizontal line
4. The total area under the curve is 1.0 or 100% and it is subdivided
into at least three standard scores (3), each to the left and to the
right of the vertical axis.
The Empirical Rule – regardless of the value of the mean and standard deviation,
68 % of all observations lie within one SD of the mean

95.5% of all the observations lie within two SD’s of the mean
99.7% of all observations lie within three SD’s of the mean
The Normal Deviate

The Standardized normal distribution
There can exist an infinite number of possible normal distributions, each with its
own mean and standard deviation. It is necessary to convert all these normal distributions
to the standard normal distribution with the Z – formula.
𝑿− 𝝁 ̅
𝑿−𝑿
Z= 𝝈
or Z= 𝒔
Where: Z is the normal deviate and X is some specified value for the random variable.
Z measures the number of standard deviations an observation is from the mean.
After the conversion process, the mean of the distribution is 0 and SD is 1
Illustration: Telecom , a telephone answering service for business executive in Metro Bacolod
has found that the average telephone message is 150 sec. with SD of 15 sec. The
length of message is normally distributed a particular phone message took 180
seconds. How many seconds is it longer than the average?
𝑿−𝝁 𝟏𝟖𝟎−𝟏𝟓𝟎
Solution. Z = Z = = 2 SD’s or 30 seconds above the mean
𝝈 𝟏𝟓
What is the probability that the single message takes between 150 sec. and 180 sec.
𝟏𝟖𝟎−𝟏𝟓𝟎
Solution: Z = 𝟏𝟓 = 2 From normal curve table, Z = +/- 2.0 is equivalent to .4772
Telecom concludes that there is 47.72% chance that any single telephone message will
last 150 sec to 180 sec.
Exercises: Find the area under the normal curve. Use the normal curve table
1. Between Z = -1.75 and Z = + 2.85

2. Below Z = -2.75
3. Beyond Z = + 2.20
4. Between Z = - 2.25 and Z = - 2.90
Problem Exercises:
1. A random sample of 1000 construction workers gave their average daily wage at
P420 w ith SD of P35. Assuming that daily wages to be normally distributed,
a. what is the probability that a worker selected at random earns between Commented [F1]:
P490 and P380 a day?
b. how many workers earn less than P450 a day?
c. if workers who earn P480 and above are asked to contribute P70 for a sick
co-worker, how much is the expected contributions?
2. A study of prevailing market prices for one day shows that the average price of
rice
per kilo is P40.00 with SD of P1.50. Assuming that prices are normally distributed,
a. What percentage of rice sells at higher than P43.00 per kilo?
b. If in a particular market, 1000 sacks of rice were sold, how many kilos were
sold
at less than P42.00 per kilo? (1 sack = 50 kilos)
c. What average price per kilo should the government try to maintain so that
80% of rice sells at not more than P42.50 .
3. The average life of a certain type of motor is 10 years. The manufacturer replaces
free all motors that fail while under guarantee. If he is willing to replace only 3%
of the motors that fail, how long a guarantee should he offer? SD is 2 years.
4. Assume that heights of women in a population follow a normal curve with mean
of 64.3 inches and SD of 2.6 inches.
bna. What proportion of women stand between 60 inches and 66 inches?
b. A certain woman has a height of 0.5 SD above the mean., What proportion
of women are taller than she ?
5. A distribution of test scores in Statistics follow a normal distribution with mean of
80 and std. deviation of 12. There are 120 students who took the test.
a. How many scores do you expect to find above 100
b. How many scores do you expect to find between 90 and 110
6. A soft drink machine is regulated so that it discharges an average of 210 ml
per cup. If the amount of drink is normally distributed with standard deviation
of 18 ml,
a. What fraction of the cup will contain more than 230 ml?
b. How many cups will likely overflow if 245 ml cups are used in the next
1000 drinks
c. Below what value do we get the smallest 20 % of the drinks?
7. The IQs of 600 applicants to the College of Engineering are approximately
normally distributed with a mean of 115 and standard deviation of 12. If the
college requires an IQ of at least 95, how many of these students will be rejected
on this basis regardless of their other qualifications?
CORRELATION AND REGRESSION ANALYSIS
Definition: Correlation is the measure of relationship between or among variables

Coefficient of correlation, r is the index of relationship between variables.
It measures the strength of relationship or association between variables.
These variables include independent variables and dependent variable.
The values of the coefficient of correlation are between -1.0 and +1.0
If r is +1.0, it indicates that the two variables are perfectly related in a
a positive sense which means, if X increases, Y also increases; if X decreases,
Y do likewise. If r is negative, it indicates that X and Y are not linearly related,
meaning, if X increases, Y decreases.
Table of the Coefficient of Correlation
Coefficient Of Correlation, r Interpretation
r = 0.00 No correlation
r from +/- .01 to +/- .19 Negligible correlation
r from +/- .20 to +/- .39 Low Correlation
r from +/- .40 to +/- .59 Moderate Correlation
r from +/- .60 to +/- .79 Moderately High Correlation
r from +/- .80 to +/- 1.00 High Correlation
* Lurking variable is a variable that has an important effect on the response

that is not included among the explanatory variables studied.
*Categorical explanatory variable – When changes in a variable X are thought
to explain or even cause changes in the second variable Y, X is called is called
explanatory variable and Y is called response variable
* Strong correlation does not imply any cause and effect relationship
*Causation: Changes in X causes changes in Y
*Common response: Changes in both X and Y are caused by changes in the
lurking variable Z
*Confounding: The effect (if any) of X on Y is confounded with the effect on Y of a
lurking variable Z
Association Between X and Y
1. Causation: Changes in X cause changes in Y
Example: Quitting smoking may reduce a person’s chance of
getting lung cancer if causation holds.
2. Common response: Both X and Y respond to changes in some unobserved
variables. Y can sometimes be predicted from X, but intervening
to change X would not bring about change in Y. The genetic
hypothesis claims that both smoking behavior and lung cancer are
responses to genetic predisposition; quitting smoking does not
3. Confounding: The effect on Y of the explanatory variable X is hopelessly
mixed up with the effects on Y of other variables. The “sloppy
lifestyle” hypothesis claims that smoking is confounded with other
types of behavior, so that we have no information about the effect
of smoking alone on health.
Techniques of Correlation
1. Spearman Rank – Order -Correlation
𝟔∑𝒅𝟐
r = 1- 𝑵(𝑵𝟐 −𝟏)
Where: 6 = constant
N = no. of pairs
d = difference between ranks
Example: Relationship between Height and Weight of persons
2. Pearson Product – Moment (PPM)
𝑵∑𝑿𝒀− (∑𝑿)(∑𝒀)
r=
√[𝑵(∑𝑿𝟐 )− (∑𝑿)𝟐 )] [𝑵(∑𝒀𝟐 )−(∑𝒀)𝟐 ]
3. Deviation from mean
∑(𝒅𝒙 .𝒅𝒚)
r= ̅
Where: dx = X - 𝑿 ̅
dy = Y - 𝒀
√ (∑𝒅𝒙𝟐 )(∑𝒅𝒚𝟐 )
Illustration:
(1) The relationship between AGE of machines and its REPAIR costs
X Y
Age in Years Repair costs in P 10 3
2 1.0
4 2.2
5 2.5
7 3.0
8 4.5
10 5.0
Find the coefficient of correlation, r and interpret
2. Relationship between pressure and volume of a confined gas
V (Cm3) 50 60 70 90 100 110

P(Kg/Cm3) 64.7 51.3 40.5 25.9 15.5 10.2
Calculate the coefficient of correlation, r and interpret
3. Ten candidates for Miss Philippines were ranked by Judge X and Judge Y
according to beauty and talent. Find If the choice of Judge X is consistent
with the choice of Judge Y
Candidates: A B C D E F G H I J
Judge X : 2 3 1 6 4 5 10 7 8 9
Judge Y : 2 1 3 5 6 4 8 10 9 7
3. Relationship between Height and Weight of seven students selected at random

Height in ft. and inches Weight in lbs.
5ft and 8 in. 150 lbs.
5ft and 2 in 90 lbs.
Limitations of Rank-Order Correlation

It takes account only the rank position of the items in the series and make no
allowance for gaps between adjacent measures.
Advantages:
1. Rank Order – provides a convenient way of estimating coef. of correlation if
N is small
2. Pearson Product-Moment(PPM) – takes into account the absolute size of
the measures and not merely their rank position
Coefficient of Determination, r2 expresses the proportion of total variation in Y that
can be accounted for or explained by the independent variable X.
Thus, r = .60, r2 = .36 meaning, 36 % of the variation in Y is accounted for by X
REGRESSION ANALYSIS
Definition. Regression is a quantitative expression of the basic nature of the relationship

between X and Y (Independent and Dependent variable). It determines the
change in Y given a change in X.
Correlation measures the strength of relationship between X and Y. If X and Y are

related in a linear manner then as X changes by a constant amount, Y also changes
by a constant amount.
Linear Regression Equation
̂ = a + bX
𝒀 Where: ̂ = estimated value of Y
𝒀
a = y- intercept, that is value of Y when X = 0
b = slope i.e. change in Y for every unit change in X
X = independent variable
𝑵∑𝑿𝒀−(∑𝑿)(∑𝒀)
where: b= ̅ - b𝑿
a= 𝒀 ̅
𝑵∑𝑿𝟐 −(∑𝑿)𝟐
̂ = a + b X ± SE est
thus, 𝒀
SE est is the standard error of estimate. It measures the disperse on about an average
line called regression line.
∑𝒀𝟐− 𝒂(∑𝒀)− 𝒃(∑𝑿𝒀)

Std. error of estimate = √
𝒏−𝟐
Coefficient of determination, r2 expresses the proportion of total variation in Y that can be
accounted for or explained by the independent variable X. Thus, r = .6, r 2 = .36 meaning,
36% of the variation in Y is accounted for in X
Exercises:
1. An assembly plant wanted to study the relationship between the age of machine
and its repair cost. The following data represent the Age in years and Repair costs
of a random sample of 8 machines.
Age in Years Repair costs(P103)
5 3.5
6 5.0
7 5.0
9 5.2
12 6.0
13 6.0
15 6.2
16 7.1
a. Determine the coefficient of correlation, r
b. Estimate the repair cost of ten- year old machine
2. The data below represent the electrical energy consumption in KWH and the
amount due over a period of 6 months.
Energy Consumption Amount Due

(KWH) (P103)
83 0.853
160 1 .425
190 1.732
153 1.505
147 1.547
170 1.655
a. Calculate the coefficient of correlation, r
b. Estimate the amount due for 200 KWH consumption
3. Data below represent the supply of product A and its price per unit
Price per unit Supply (103) units
P 25 100
20 120
30 80
25 110
35 90
30 100
40 75
3a. Find the coefficient of correlation by PPM

3b. Estimate the price per unit of the product when supply is 150,000 units
4. The marketing research dept. of A-1, Inc. wanted to study the relationship
between the Advertising expenditures and Sales volume of a certain product.
Ad Expense(P103) Sales Volume(P103)
5 40
7 45
10 60
12 65
15 75
20 80
25 95
(a) Calculate the Coefficient of Correlation by Pearson Product Moment

(b) Estimate the sales volume for an Ad expense of P50,000
(a) Relationship between the teaching performance and tenure in years of faculty
Teaching Tenure in
Performance, % Years
84 4
86 6
90 14
87 8
92 15
94 12
95 16
80 5
85 7
88 10
5a. Calculate the coefficient of correlation by PPM
5b . Estimate the performance rating of a teacher who has been in the
profession for 9 years
Multiple Regression Analysis
Multiple Regression Equation, ̂ = a + b1X1 + b2X2

𝒀
Sub- equations: (1) ∑Y = n a + b1∑X1 + b2∑X2

(2) ∑X1Y = a∑X1 + b1∑X12 + b2∑X1X2
(3) ∑X2Y = a∑X2 + b1∑X1X2 + b2∑X22
Problem illustration:
Mr. de Los Santos has been concerned for sometimes with the overhead
costs in his furniture shop. For the last 7 months he has kept a record not only of the
direct labor hours in the shop but also the total costs of lumber used in the operation.
The data are found in the following table:
Month Overhead Hrs./mo. Costs Lumber

Costs(P103) (103) Used(P103)
Jan 3.1 0.39 2.4
Feb 2.6 0.36 2.6
Mar 2.9 0.38 2.3
Apr 2.7 0.39 1.9
May 2.8 0.37 1.9
June 3.0 0.39 2.1
July 3.2 0.38 2.4
1. What is the dependent variable

2. What are the independent variables
3. Find the values of a, b1, and b2
4. Find the overhead costs for labor equals to 400 hrs./ month and
lumber used of P2500
TESTS OF HYPOTHESIS
Inferential or sampling statistics is useful in generalizing populations from a small

sample.
In many instances a researcher can only rely on the information provided by the sample.
Hypothesis is an educated guess or a tentative answer to a question. It is a statement

about an expected relationship between two or more variables that can be empirically
tested.
Kinds of Hypothesis
1. Ho: Null hypothesis or statistical hypothesis is a negative statement which

indicates absence of relationship or correlation between two variables; an absence
of a significant difference between proportions of two groups; absence of significant
difference between means of two groups
Example: There is no significant relationship between A and B

There is no significant difference between A and B
2. Ha: Alternative hypothesis or research hypothesis is a positive form of the null
hypothesis. It may state the presence of a significant relationship between the
independent and dependent variables, or the presence of a significant difference
between two means or two proportions. It is the opposite of Null hypothesis.
Example: There is a significant relationship between A and B
3. Directional hypothesis states whether the relationship between two variables is

direct or positive or inverse or negative. A positive or direct relationship is present
when the value of one variable increases with the increase in the value of another.
The relationship is negative when the value of one variable increases as the value
of another decreases.
Example: The taller, the heavier X > Y ; A< B.
4. Non directional hypothesis does not specify the direction of relationship between
variables. It merely states the presence or absence of a relationship between two
variables or that one variable influences the other variable.
Example: There is significant difference between the performance rating

of students who attended the review class and those who did not
A ≠ B ; X≠ Y
Level of Significance,𝜶
The significance level of a test is the maximum value of the probability of
rejecting the null hypothesis when in fact it is true.
5% level of significance implies that you are 95% confident that you have
made the right decision of accepting or rejecting the hypothesis.
A 1% significance level, 𝜶 = .01 means that you could be wrong with a

probability of 1 %. It implies that you are 99% confident that you have made
the right decision.
Rejection of the null hypothesis (Ho) implies acceptance of the

alternative hypothesis (Ha)
Acceptance of the null hypothesis (Ho) implies rejection of the alternative
hypothesis (Ha).
Ho: Null hypothesis – statement of non-significance of the difference between X and Y

Ho: X = Y
Ha: Alternative hypothesis – statement of the significant difference between X and Y

Ha: X ≠ 𝒀 (𝐧𝐨𝐧 − 𝐝𝐢𝐫𝐞𝐜𝐭𝐢𝐨𝐧𝐚𝐥 𝐚𝐥𝐭𝐞𝐫𝐧𝐚𝐭𝐢𝐯𝐞 𝐡𝐲𝐩𝐨𝐭𝐡𝐞𝐬𝐢𝐬) This is a two-tailed test
Ha: X> 𝒀 𝒐𝒓 𝑿 < 𝒀 (𝒅𝒊𝒓𝒆𝒄𝒕𝒊𝒐𝒏𝒂𝒍 𝒂𝒍𝒕𝒆𝒓𝒏𝒂𝒕𝒊𝒗𝒆 𝒉𝒚𝒑𝒐𝒕𝒉𝒆𝒔𝒊𝒔) This is a one tailed
test
Example: Ho: There is no significant difference between X and Y (X = Y)
Ha: There is a significant difference between X and Y (X ≠ 𝒀)

Ha: X is significantly greater than Y (X > Y) or (Y < X)
Z – TEST AND T - TEST OF HYPOTHESIS
1. Z-test is used when the population standard deviation is known. Sample size is 30
or more (n is 30 or more), Ronald Walpole, Intro. To Statistics
2. t-test is used when sample standard deviation is known. Sample size is less than 30.
Note: The normal approximation of the sample mean is good when n ≥ 30
If the sample size is small n< 30, the values of s2 fluctuate considerably from
sample to sample and the distribution of the mean is no longer a standard
normal distribution.
Uses of Z – test:
1. sample mean is compared with population mean
̅−µ
𝑿
Z= √𝒏
µ = population mean 𝑿 ̅ = sample mean
𝝈
n = sample size 𝝈 = Population standard deviation
Note: For Z – test, use table of critical values of Z based on the area under the
normal curve.
2. Comparing two sample means
̅̅̅̅̅̅̅̅̅̅
𝑿𝟏−𝑿𝟐 ̅̅̅̅
Z= 𝟏 𝟏
n1 = first sample size n2 = second sample size
+
𝝈√𝒏𝟏 𝒏𝟐
3. Comparing two sample proportions P1 = proportion of first sample

Q1 = 1 –P1
𝑷𝟏−𝑷𝟐
Z= P2 = proportion of 2nd sample , Q2=1-P2
𝑷𝟏𝑸𝟏 𝑷𝟐𝑸𝟐
√ 𝒏𝟏 + 𝒏𝟐
Accept Ho if the absolute value of the computed Z is less than the table value.
Reject Ha.
Reject Ho if the absolute value of the computed Z is equal or greater than the table
Value. Accept Ha.
Uses of t-test:
1. Test of population mean
̅ −µ
𝑿
t=
𝒔
√𝒏 − 𝟏 S = sample standard deviation
2. Test of two sample means (n1 ≠ n2)
̅̅̅̅̅̅̅̅̅̅̅
𝑿𝟏− 𝑿𝟐 ̅̅̅̅
t =
𝒏𝟏𝒔𝟏𝟐+𝒏𝟐𝒔𝟐𝟐 𝟏 𝟏
√ ( + )
𝒏𝟏+𝒏𝟐−𝟐 𝒏𝟏 𝒏𝟐
3. Test of two samples, n1 = n2
𝑿𝟏− 𝑿𝟐
t= ∑𝒅𝟐
√𝒏(𝒏−𝟏)
Problem Exercises:
1. Two types of wires are being compared for strength. Fifty pcs. of each type of wires
are tested under similar conditions. Type A has an average tensile strength of 78.3 N,
while type B has tensile strength of 87.2 Nt. The combined standard deviation of wire
is 5.6 Nt. Which type of wire is stronger. Test at 1% alpha.
2. A sample survey of TV program in Metro Bacolod shows that 80 of 200 men prefer
watching NBA. From another group , 75 of 250 prefer watching PBA. What is the
is the preference of men? Test at 5% alpha.
3. A cigarette manufacturer claims that his cigarettes has an average nicotine content of
1.83 mg. and sample SD of .11 mg If a random sample of 28 cigarettes of this type
has an average nicotine content of 1.90 mg , will you agree with the claim of the
manufacturer? Test at 1% alpha.
4. A researcher wishes to find out whether or not there is a significant difference

between the weekly pay of night and day shifts of a certain company. By random
sampling, she selected 25 day and 27 night shifts and computed their mean weekly
pay and standard deviations. The day shifts has a mean weekly pay of P1575 with
standard deviation of P55. The night shifts has a mean weekly pay of P1850 and
standard deviation of P65 . Do the night shifts earned significantly higher than the day
shifts ? Test at 5% alpha.
5. To determine whether membership in campus club is beneficial or detrimental to

one’s GPA, the following GPA’s were collected over a period of 5 years. Test at 1%
alpha.
Year X1(Member) X2(Non-member) d1(X1-mean X1)2 d2(X2-mean X2)2
1 2.0 1.9
2 2.0 1.9
3 2.3 2.0
4 2.1 2.1
5 2.4 2.3
6. Data from the subdivision survey shows that average monthly electrical consumption
of residential homes is 150 KWH with standard deviation of 18 KWH. A sample of 70
residential homes were selected randomly and were found to consume on the
average 190 KWH . Are the 70 residences consume significantly more than the rest?
Test at 2.5% alpha.
7. Alpha Company manufactures steel cable with an average tensile strength 150 Nt.
The laboratory tested 30 pcs. and found to have average tensile strength of 145 Nt.
and standard deviation of 6.5 Nt. Is the result of the laboratory in accordance with
what the manufacturer claim? Test at 5% level of significance.
8. Two types of thread are compared for strength. Twenty five pieces of each type of
thread are tested under similar conditions. Type X has an average strength of 78.3
N, and standard deviation of 5.6 N. Type Y has an average strength of 87.2 N
and standard deviation of 6.2 N. Which thread is stronger. Test at 5% level of
significance.
9. The average length of time for students to register at certain college has been 50
minutes. A new registration procedure using modern computing machine is being
tried. If a random sample of 28 students had an average registration time of 40
minutes with standard deviation of 12 minutes under the new system, can we
conclude that the new system had reduced significantly the registration time? Test
at 1% level of significance.
10. A course in mathematics is taught to 15 students by conventional method. A

second group of 13 students was given the same course by means of programmed
materials. The first group who were taught the conventional method made an
average grade of 85 with standard deviation of 4, while the second group of
13 students using programmed materials made an average grade of 87 with
standard deviation of 5. Which method is more effective? Test at 5% level of
significance.
CHI SQUARE ANALYSI ( 𝝌2 )
Chi Square is used as a test of significance when data are expressed in frequencies
or data are in terms of percentages or proportions and that can be reduced to
frequency.
The applications of chi square are with discrete data; however, any continuous data be
reduced to categories and the data so tabulated that chi square may be applied.
Example: Scores on a test of mental ability and dexterity test could be tabulated into
a contingency table.
Dexterity Test Score
Mental Ability Scores 12 - 20 21 – 29 30 - 38

140 and up none 1 3
120 – 139 2 5 2
100 - 119 5 3 1
80 - 99 3
To use the Chi square statistic, the data must be independent, i.e., no response
is related to any other responses. Also the categories into which data are placed must be
mutually exclusive, i.e. the frequency must be placed in one and only one category. And
finally, all data must be used. All the observed data must be used in a chi square
problem.
Formula for Chi Square:

where: fo = observed/ actual cell frequency
(𝒇𝒐−𝒇𝒆)𝟐
𝝌2 = ∑ fe = expected or theoretical frequency
𝒇𝒆
Classification of Data
1. One-way classification - has one variable described by at least two categories

Example: Civil Status frequency(f)
Single 5
Married 8 df = no. of categories -1
Widowed 6 df = 4-1 = 3
Separated 3
2. Two-way classification - has two variables each described by at least two
categories
Example:
Attitude Towards Charter Change
Gender Favor Against Undecided
fo fe fo fe fo fe Total
Male 15 15.09 8 2 25
Female 20 10 3 33
Total 35 18 5 58
Where: fo = observed/actual frequency

fe = expected/theoretical frequency
𝑹𝒙 𝑪
fe = R = Row total ; C = Column total; N = Grand Total
𝑵
Example: fe = (25x35) = 10.34
df = (r – 1) (c-1) r = no. of rows; c = no. of columns

Example: df = (2 – 1) (3-1) = 2
Uses of Chi Square:
1. to test the “goodness of fit” to a normal curve, i.e., to find out whether or not a
sample distribution conforms to hypothetical/ideal distribution
Example: Tossing of a coin 10 times
fo fe
Head 4 5
Tail 6 5
2. To find out whether or not an observed proportion is equal to some given

ideal proportion
Ex. A doctor claims that a particular drug can reduce weight. Out of 30 persons
who took the drug 18 reduced in weight. If the ideal proportion is 75%, can
we conclude that the drug is effective in reducing weight? Test at 5% alpha.
fo fe
Reduced in weight 18 22
Did not reduce in weight 12 8
3. To test the independence of one variable from another variable.

Does employment of new graduates independent of the school graduated
from?
Status
School Hired Not hired Total
SUC 175 125 300
PUC 140 60 200

______________________________________
Total 315 185 500
Ho: Hiring of new graduates is independent of school graduated from

Ha: Hiring of new graduates is dependent on the school graduated from
Problem Examples:
1.In a public opinion poll conducted on attitude towards women in the military
were sampled. Some 113 subjects were interviewed. The question asked was
“Do you favor women in the military?” Test at 2% alpha if sex is independent
of response.
RESPONSE
Sex YES NO DON’T KNOW TOTAL
Male 30 20 5 55
Female 38 18 2 58
__________________________________________________
Total 68 38 7 1 13
3. In 100 tosses of a coin, 63 heads and 37 tails are observed. Is this a balanced coin?
Test at 1% alpha.
4. In an experiment to study the dependence of hypertension on smoking habits, the

following data were taken on 180 individuals.
Non smokers Moderate smokers Heavy smokers Total
With hypertension 21 36 30 87
No hypertension 48 26 19 93
_______________________________________________________
Test the hypothesis that the absence or presence of hypertension is

independent of smoking habits. Test at 2% alpha.
5. A marketing research department has divided the sales area into six districts. It is
believed that all districts have the same sales potentials. The number of units sold
specified in each district are given. Test whether or not the six districts have equal
sales potential. Test at 2% alpha
District Units Sold
1 12
2 18
3 15
4 25
5 22
6 28
Ho:
Ha:
Note: Chi Square for a 2x2 table with df = (2-1)(2-1) = 1 without computing for fe
A c K
b d l
m N N
𝑵(𝒂𝒅−𝒃𝒄)𝟐
𝝌2 =
𝒌𝒍𝒎𝒏
Example: Do hiring of graduates independent of school graduated from
𝑺𝑪𝑯𝑶𝑶𝑳 HIRED NOT HIRED TOTAL

SUC 175 a 125 c 300 k
PUC 140 b 60 d 200 l
TOTAL 315 185 500
m n N
Ho: Hiring of graduates is independent of school graduated from
Ha: Hiring of graduates is dependent on the school graduated from
𝑵(𝒂𝒅−𝒃𝒄)𝟐 𝟓𝟎𝟎(𝟏𝟕𝟓𝒙𝟔𝟎−𝟏𝟒𝟎𝒙𝟏𝟐𝟓)𝟐
Solution: 𝝌𝟐 = 𝒌𝒍𝒎𝒏
= 𝟑𝟎𝟎𝒙𝟐𝟎𝟎𝒙𝟑𝟏𝟓𝒙𝟏𝟖𝟓
= 7.007 > 3.841 Ho: Rejected
Ha: Accepted
df = (2-1)(2-1) = 1 𝜶 = 5% Table, App. D 𝝌𝟐(. 𝟎𝟓) = 𝟑. 𝟖𝟒𝟏
Conc. Hiring of graduates is dependent on the school graduated from
Note: For other types of table, say 2x3, 3x3, 3x4, 4x5, 5x5
df = (row -1)(column -1)
fe =( RxC )/N
Problem Exercise
A study was conducted to determine the relationship between sales and location
of fast food.
Sales
Location Low Average High Total

Near School 30 25 40
Near Church 25 20 42
Near Movie
House 28 35 43
ANALYSIS OF VARIANCE
(ANOVA)
Analysis of Variance is a parametric test which is widely used and highly developed statistical
methods for comparing three or more means
Assumptions Underlying the Use of ANOVA
1. The individuals in the group and subgroups are selected randomly from a normally
distributed population
2. The samples that constitute the groups are independent
ONE- WAY CLASSIFICATION ANOVA
Sources of Variation:
1. Variance between groups – Sum of Squares Between Groups (SSB)

2. Variance within groups - Sum of Squares Within Groups (SSW)
Illustration: Distance traveled by cars with one liter of gasoline
Distance traveled, kilometers

CAR XA XB Xc X2A X2B X2C
1 12 18 6 144 324 36
2 18 17 4 324 289 16
3 16 16 14 256 256 196
4 8 18 4 64 324 16
5 6 12 6 36 144 36
6 12 17 12 144 289 144
7 10 10 14 100 100 196
_________________________________________________________________
Total 82 108 60 1068 1726 640
̅
𝑿 11.71 15.43 8.57
∑X2 = 1068 + 1726 + 640 = 3434

∑X = 82 + 108 + 60 = 250
Ho: No significant difference between the distance traveled by cars with one liter of
gasoline
Ha: There is a significant difference between the distance traveled by cars with one
liter of gasoline
Steps: 1. TSS = 𝚺X2 - (∑X)2/ N C.F. = (∑X)2/N Correction Factor = (250)2/21 = 2976.19
TSS = 3434 - (250)2/21 = 457. 81
2. SSB = Variance between groups
SSB = ∑[∑X2]/ No. of rows - C F
SSB = { (82)2 + (108)2+ (60)2}/7 - 2976.19
= 3141.2 -2976.19 = 165.01

ZZ
3. SSW = Variance within the group
SSW = 457.81 – 165.01 = 292.8
ANOVA TABLE
Source of Variation Sum of Squares df Mean Square
Between groups SSB = 165.01 dfB = c-1=3-1=2 MSSB = SSB/dfB

Within groups SSW = 292.8 dfw = dft-dfB = 165.01/2 = 82.50
= 20-2=18 MSSw= SSw/dfw = 292.8/18
Total variance TSS = 457.81 df T= N-1 =21-1=20 =16.25
F-test = MSSB/MSSW
= 82.50/16.25
= 5.06
F-test is interpreted with the use of F table
 At 5% alpha, F- table value is 3.55
 At 1% alpha, F- table value is 6.01
Thus, at 5% alpha, F-test = 5.06 > 3.55, Ho: Rejected, Ha: Accepted
and at 1% alpha, F-test = 5.06<6.01, Ho: Acceptd, Ha: Rejected
Conclusion at 5% alpha:
There is a significant difference between the distance traveled by cars

with one liter of gasoline
Conclusion at 1% alpha:
There is no significant difference between the distance traveled by cars with

one liter of gasoline
Problem Exercises:
(1) Ratings of Trainees by Four Supervisors (Scale of 10)
Trainees Raters
A B C D
1 10 6 8 7
2 4 5 3 4
3 8 4 7 4
4 3 4 2 2
5 6 8 6 7
6 9 7 8 7
(2) In an experiment designed to compare the effects of coaching on the scores obtained
on an aptitude test used for entrance to a professional school, three levels of
coaching were used : none, 4 hours, and 12 hours. A random sample of 18
applicants was chosen from the population of applicants. Their scores are given
below: Test at 5% alpha.
Scores Obtained
Group None 4 Hours 12 Hours
1 30 32 35
2 27 30 33
3 26 29 32
4 24 27 30
5 22 25 27
6 20 24 26
State: Ho:
Ha:

Probability and Statistics v2

Uploaded by

Document Information

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Probability and Statistics v2

Uploaded by

Copyright:

Available Formats

LECTURE NOTES IN

Examples: Identify the type of data and level of measurement:

Distinguish Statistician and Statistics Practitioner

Statistics practitioner is a person who uses statistical techniques

Major Areas of Statistics

1. Random or Probability Sampling - method of selecting sample from the

1. Lottery - more of raffle

Sampling error is difference between measurements from a sample and

Survey - best choice for gathering information across a wide range of

*Lurking variable is one in which no data have been collected but

Parameter - numerical value which describes the characteristics of

Measures of Central Location

Measures of Relative Dispersion

1. Chebyshev’s Theorem - enables to make statements about the proportion

Is related to standard deviation.

Note: Chebyshev’s theorem enables us to make statement about the proportion

Problem Exercise: A certain type of flight of City Airlines carried a daily

least 1 – 1/K2 percent of observations lie within K SD of the mean, thus

Math 1: µ = 68 𝝈 = 𝟖 𝑿 = 𝟖𝟐 Z = ( X - µ)/𝝈 = Z = (82 - 68)/8 = + 1.75

Office Applicant’s Score Standard Speed Std. Deviation

Dean 141 sec 180 sec 30 sec

In what office is the applicant seem to be suited?

3. Coefficient of Variation, CV = measure of uniformity

Example: Nicotine content in mg of n cigarettes

1.2; 1.3; 1.4; 1.5; 1.6; 1.7; 1.8; 1.9; 2.0

Example. A case of wine has 12 bottles, 3 of which contain spoiled wine.

THE NORMAL DISTRIBUTION

The normal probability distribution also known as Gaussian distribution

Properties of the normal curve/distribution

68 % of all observations lie within one SD of the mean

The Normal Deviate

1. Between Z = -1.75 and Z = + 2.85

CORRELATION AND REGRESSION ANALYSIS

Definition: Correlation is the measure of relationship between or among variables

Table of the Coefficient of Correlation

Coefficient Of Correlation, r Interpretation

* Lurking variable is a variable that has an important effect on the response

1. Spearman Rank – Order -Correlation

2. Pearson Product – Moment (PPM)

Find the coefficient of correlation, r and interpret

2. Relationship between pressure and volume of a confined gas

V (Cm3) 50 60 70 90 100 110

Calculate the coefficient of correlation, r and interpret

3. Relationship between Height and Weight of seven students selected at random

Limitations of Rank-Order Correlation

Definition. Regression is a quantitative expression of the basic nature of the relationship

Correlation measures the strength of relationship between X and Y. If X and Y are

Linear Regression Equation

∑𝒀𝟐− 𝒂(∑𝒀)− 𝒃(∑𝑿𝒀)

Age in Years Repair costs(P103)

Energy Consumption Amount Due

3a. Find the coefficient of correlation by PPM

Ad Expense(P103) Sales Volume(P103)

(a) Calculate the Coefficient of Correlation by Pearson Product Moment

Multiple Regression Analysis

Multiple Regression Equation, ̂ = a + b1X1 + b2X2

Sub- equations: (1) ∑Y = n a + b1∑X1 + b2∑X2

Month Overhead Hrs./mo. Costs Lumber

1. What is the dependent variable

Inferential or sampling statistics is useful in generalizing populations from a small

Hypothesis is an educated guess or a tentative answer to a question. It is a statement

1. Ho: Null hypothesis or statistical hypothesis is a negative statement which

Example: There is no significant relationship between A and B

Example: There is a significant relationship between A and B