Professional Documents
Culture Documents
Introduction to Statistics
Objectives:
1. define Statistics and its concepts
2. identify the type of variables
Descriptive Statistics consist of methods for organizing and summarizing data/ information
Inferential Statistics consist of methods for drawing and measuring conclusions
1.3
1
CHAPTER 5
TESTS OF HYPOTHESIS
Objectives
The learner should be able to :
1. illustrate null and alternative hypotheses,
2. illustrate level of significance and rejection region,
3. formulate the appropriate null and alternative hypotheses on a population mean and
proportion,
4. identify the appropriate form of the test statistic
5. compute for the test statistic value ,and
6. solve problems involving test of hypothesis .
A hypothesis test is a statistical test that is used to determine whether there is enough
evidence in a sample of data to infer that a certain condition is true for the entire population.
A hypothesis test examines two opposing hypotheses about a population :
a. null hypothesis ( Ho )
b. alternative hypothesis. ( Ha )
Null hypothesis
It is the statement being tested. Usually the null hypothesis is a statement of "no effect" or "no
difference".
Alternative hypothesis
It is the statement you want to be able to conclude is true. A common misconception is that
statistical hypothesis tests are designed to select the more likely of two hypotheses. Instead, a test will
remain with the null hypothesis until there is enough evidence (data) to support the alternative
hypothesis.
Example:
Problem: Is normal body temperature really 98.6 oF?
Solution:
Consider the population of many adults. A researcher hypothesized that the average adult body
temperature is lower than the often-advertised 98.6 degrees F. That is, the researcher wants an answer to
the question: "Is the average adult body temperature 98.6 degrees? Or is it lower?" To answer his
research question, the researcher starts by assuming that the average adult body temperature was 98.6
degrees F.
2
Then, the researcher went out and tried to find evidence that refutes his initial assumption. In doing so, he
selects a random sample of 130 adults. The average body temperature of the 130 sampled adults is 98.25
degrees.
Then, the researcher uses the data he collected to make a decision about his initial assumption. It is
either likely or unlikely that the researcher would collect the evidence he did given his initial
assumption that the average adult body temperature is 98.6 degrees:
If it is likely, then the researcher does not reject his initial assumption that the average adult
body temperature is 98.6 degrees. There is not enough evidence to do otherwise.
If it is unlikely, then:
either the researcher's initial assumption is correct and he experienced a very unusual
event;
or the researcher's initial assumption is incorrect.
Types of Test
1. Two-tailed test
A test to determine whether a population parameter has changed since the null hypothesis can
be rejected by observing a statistic that falls either the two tails of the sampling distribution.
2. One-tailed test
It is use if the following conditions satisfy:
1. the sample data from the population that has a parameter less than the hypothesized value
2. the sample data from the population that has a parameter greater than the hypothesized
value
3
Note:
If we reject the null hypothesis, we do not prove that the alternative
hypothesis is true.
If we do not reject the null hypothesis, we do not prove that the null
hypothesis is true.
We merely state that there is enough evidence to behave one way or the other. This is
always true in statistics, because of this, whatever the decisions; there is always a chance that
we made an error.
Type II error - when the null hypothesis is not rejected even if it is false
Right-tailed test
Left-tailed Test
4
Two-tailed test
5
, if there are two sample mean
6
Lesson 5.3.1 Hypothesis Test for a Proportion
The P-value is the probability of observing a sample statistic as extreme as the test statistic.
Since the test statistic is a z-score
Just in case, the standard deviation is not given the use the formula below to obtain the value of
the standard deviation.
This is a test to determine whether the difference between two proportions is significant. The
test procedure, called the two-proportion z-test, is appropriate when the following conditions are
met:
The sampling method for each population is simple random sampling.
The samples are independent.
Each sample includes at least 10 successes and 10 failures.
Each population is at least 20 times as big as its sample.
7
Every hypothesis test requires the analyst to state a null hypothesis and an
alternative hypothesis. The table below shows three sets of hypotheses. Each makes a
statement about the difference, d, between two population proportions, P1 and P2. (In the table,
the symbol ≠ means " not equal to ").
The first set of hypotheses (Set 1) is an example of a two-tailed test, since an extreme value
on either side of the sampling distribution would cause a researcher to reject the null hypothesis. The
other two sets of hypotheses (Sets 2 and 3) are one-tailed tests, since an extreme value on only one
side of the sampling distribution would cause a researcher to reject the null hypothesis.
When the null hypothesis states that there is no difference between the two population
proportions (i.e., d = 0), the null and alternative hypothesis for a two-tailed test are often stated in the
following form.
H0: P1 = P2
Ha: P1 ≠ P2
The analysis plan describes how to use sample data to accept or reject the null hypothesis. It
should specify the following elements.
Significance level. Often, researchers choose significance levels equal to 0.01, 0.05, or
0.10; but any value between 0 and 1 can be used.
Test method. Use the two-proportion z-test to determine whether the hypothesized difference
between population proportions differs significantly from the observed sample difference.
Using sample data, complete the following computations to find the test statistic and its
associated P-Value.
Pooled sample proportion. Since the null hypothesis states that P1=P2, we use a
pooled sample proportion (p) to compute the standard error of the sampling distribution.
8
Standard Error. Compute the standard error (SE) of the sampling distribution
difference between two proportions
Test Statistic. The test statistic is a z-score (z) defined by the following equation.
P-value. The P-value is the probability of observing a sample statistic as extreme as
the test statistic.
4. Interpret results.
If the sample findings are unlikely, given the null hypothesis, the researcher rejects the
null hypothesis. Typically, this involves comparing the P-value to the significance level,
and rejecting the null hypothesis when the P-value is less than the significance
level.
Example:
Problem : Suppose the Drug Company develops a new drug, designed to prevent colds. The company
states that the drug is equally effective for men and women. To test this claim, they choose a
simple random sample of 100 women and 200 men from a population of 100,000 volunteers.
At the end of the study, 38% of the women caught a cold; and 51% of the men caught a cold.
Based on these findings, can we reject the company's claim that the drug is equally effective
for men and women? Use a
0.05 level of significance.
Solution:
Note: These hypotheses constitute a two-tailed test. The null hypothesis will be rejected if the
proportion from population 1 is too big or if it is too small.
9
Step 2: Formulate an analysis plan.
P = 0.467
SE = √0.2489 ∙ (0.015)
SE =√0.00373
SE = 0.061
Z score :
Z = - 2.13
Since we have a two-tailed test, the P-value is the probability that the z-score is less than -
2.13 or greater than 2.13. Thus, the P-value = 0.0166 + 0.0166 = 0.0332.
10
Note: From Table 3 , when z = - 2.13 , P-value = 0.0166
Note: If you use this approach on an exam, you may also want to mention why this
approach is appropriate. Specifically, the approach is appropriate because the sampling
method was simple random sampling, the samples were independent, each population
was at least 10 times larger than its sample, and each sample included at least 10
successes and 10 failures.
Hypothesis Test of a Mean will be conducted when the following conditions are met:
The sampling method is simple random sampling.
The sampling distribution is normal or nearly normal.
We can say that the sampling distribution will be approximately normally distributed if any
of the following conditions apply :
The population distribution is normal.
The population distribution is symmetric, unimodal, without outliers, and the sample
size is 15 or less.
The population distribution is moderately skewed, unimodal, without outliers, and the sample
size is between 16 and 40.
The sample size is greater than 40, without outliers.
State the hypotheses. Every hypothesis test requires the analyst to state a null
hypothesis and an alternative hypothesis. The table below may use.
Formulate an Analysis Plan. The analysis plan describes how to use sample data to
accept or reject the null hypothesis. It should specify the following elements.
11
Test method. Use the one-sample t-test to determine whether the
hypothesized mean differs significantly from the observed sample mean.
Analyze Sample Data. Using sample data, conduct a one-sample t-test. This involves:
When the population size is much larger (at least 20 times larger) than the sample
size, the standard error can be approximated by:
Degrees of Freedom. The degrees of freedom (DF) are equal to the sample
size (n) minus one.
DF = n - 1.
Test Statistic. The test statistic is a t statistic (t) defined by the following equation.
t=(x-µ)
SE
P-value
P-value is the probability of observing a sample statistic as extreme as the test
statistic.
Interpret Results. This involves comparing the P-value to the significance level, and
rejecting the null hypothesis when the P-value is less than the significance level.
Example:
Problem :
An elementary school has 1000 students. The principal of the school thinks that the average IQ of
students at Bon Air is at least 110. To prove her point, she administers an IQ test to 20 randomly
selected students. Among the sampled students, the average IQ is 108 with a
12
standard deviation of 10. Based on these results, should the principal accept or reject her original
hypothesis? Assume a significance level of 0.01. (Assume that test scores in the population of engines
are normally distributed.) Solution:
Step 1: State the hypotheses. The first step is to state the null hypothesis and an alternative
hypothesis.
Note: The hypotheses constitute a one-tailed test. The null hypothesis will be rejected if the
sample mean is too small.
SE = 2.236
t - test statistic (t )
t=
t=
t =-0.894
13
The observed sample mean produced a t - test statistic of -0.894. The P(t<-0.894) =
0.19. This means we would expect to find a sample mean of 108 or smaller in 19 percent of our
samples, if the true population IQ were 110. Thus the P-value in this analysis is 0.19
Step 4: Interpret results. Since the P-value (0.19) is greater than the significance level (0.01), we
cannot reject the null hypothesis.
This lesson explains how to conduct a hypothesis test for the difference between two means.
The test procedure, called the two-sample t-test, is appropriate when the following conditions are
met:
The sampling method for each sample is simple random sampling.
The samples are independent.
Each population is at least 20 times larger than its respective sample.
A Z-test is any statistical test for which the distribution of the test statistic under the null
hypothesis can be approximated by a normal distribution. It is used for testing hypothesis when the:
1. sample standard deviation is known
2. sample size is at least 30
It is also used when there is only one sample in the experiment that is known, and both standard
deviation and the mean of the population are known. Likewise, the two-sample z-test is used to compare
the population means between two groups.
When the data are normally distributed we shall follow the following steps:
Example 1:
Problem:
A researcher reported that the mean grade of Grade 11 students in statistics was 84%. A random
sample of 100 students showed a mean of 87% with a standard deviation of 4%. Is there a significant
difference between the grades of Grade 11 students? Use α = 0.05.
This problem may be computed in two ways. Its either you are going to use one-
tailed test or two-tailed test.
14
Step 1: Formulate the null hypothesis and the alternative hypothesis.
Ho: µ = 84
Ha: µ ≠ 84
Step 2: Specify the level of significance and decide whether two-tailed test, or one- tailed test
(right-tailed test or left-tailed test) shall be used, and decide the test statistic to be
used, and find the critical value from TABLE 4.
z = ( x - µ )√ n
s
µ = 84
x = 87
n = 100
s=4
z = ( 87 – 84 ) √ 100
4
z = 7.5
Step 4: Graph computed z- value and critical value, and make a decision
Note: The shaded part is the rejected region and the part that has no shade is the
accepted region.
15
Since the computed z was located in the rejected region therefore, null hypothesis is
rejected.
Ho: µ = 84
Ha: µ >84 (since the sample mean is 87)
Step 2: Specify the level of significance and decide whether two-tailed test, or one- tailed test
(right-tailed test or left-tailed test) shall be use, and decide the test statistic to be used,
and find the critical value from TABLE 4.
z = ( x - µ )√ n
s
µ = 84
x = 87
n = 100
s=4
z = ( 87 – 84 ) √ 100
4
z = 7.5
Step 4: Graph the computed z value and critical value, and make a decision
16
Note: The shaded part is the rejected region and the part that has no shade is the
accepted region.
Since the computed z was located in the rejected region therefore, null hypothesis is
rejected.
Example 2:
Problem :
A supermarket owner believes that the mean of family income of its customers is Php 45,000
per month. 49 customers were randomly selected and asked their family income. The sample mean was
Php 42,200 per month and the standard deviation was Php 2,800. Is there enough difference to say that
the mean family income per month is Php 45,000 per month at 1% significant level?
Solution:
Ho: µ = 45,000
Ha: µ ≠45,000
Step 2: Specify the level of significance and decide whether two-tailed test, or one- tailed
test (right-tailed test or left-tailed test) shall be use, and decide the test statistic to be
used, and find the critical value from table 4.
Critical value: ±2.575 (The value is taken from Z tabular value/ table 4)
z = ( x - µ )√ n
s
µ = 45,000
x = 42,200
n = 49
s = 2,800
17
z = ( 42,200 – 45,000 ) √ 49
2,800
z=7
Step 4: Graph the z value and critical value, and make a decision.
Note: The shaded part is the rejected region and the part that has no shade is the
accepted region.
Since the computed z is located in the rejected region therefore, null hypothesis is
rejected.
Example 3:
Problem :
The average lifetime of 120 Brand X AA batteries and 120 Brand Y AA batteries were found to
be 9.1 hours and 9.6 hours respectively. Suppose the population standard deviations of lifetimes are 1.9
hours of Brand X and 2.1 for Brand Y batteries, test the hypothesis using α = 0.05.
Ho: µ1 = µ2
Ha: µ1≠ µ2
Step 2: Specify the level of significance and decide whether two-tailed test or one-
tailed test (right-tailed test or left-tailed test) shall be used, and decide the test statistic
to be used, and find the critical value from Table 4.
18
Tailed test: two-tailed test
Test statistic: Z-test
Critical value: ±1.96 (taken from Z tabular value/ Table 4)
µ1 = 9.1
µ2 = 9.6
n1 = 120
n2 = 120
s1 = 1.9
s2 = 2.1
z = 9.1 – 9.6
(1.9)2 + (2.1)2
120 120
= - 0.5
√ 0.03 + 0.037
z = -1.93
Step 4: Graph the computed z value and critical value, and make a decision critical
z = - 1.93
19
Note: The shaded part is the rejected region and the part that has no shade is the
accepted region.
Since the computed z is located in the accepted region therefore, null hypothesis is
accepted.
Step 2: Specify the level of significance and decide whether two-tailed test or one- tailed test
(right-tailed test or left-tailed test) shall be use, and decide the test statistic to be used,
and find the critical value from Table 4.
9.1
µ2 = 9.6
n1 = 120
n2 = 120
s1 = 1.9
s2 = 2.1
z = 9.1 – 9.6
(1.9)2 + (2.1)2
120 120
= - 0.5
√ 0.03 + 0.037
z = -1.93
20
Step 4: Graph the computed z value and critical value, and make a decision
z = - 1.93
Note: The shaded part is the rejected region and the part that has no shade is the
accepted region.
A t-test is an analysis of two population means through the use of statistical examination; a t-
test with two samples is commonly used with small sample sizes, testing the difference
between the samples when the variances of two normal distributions are not known.
A t-test looks at the t-statistic, the t-distribution and degrees of freedom to determine the
probability of difference between populations; the test statistic in the test is known as the t-statistic.
We may use t-test if
the population variance is unknown ,and therefore has to be estimated from the
sample itself
the sample size is less than 30, (n < 30)
21
To compute t-test, we shall follow the following steps:
Example 1:
Problem:
A chemical company alleged that the average weight of its bag of chemical is 50 kgs. With a
standard deviation of 0.9 kg., a sample of 25 bags was taken and revealed a mean weight of 48.1 kgs. If
the significant level is 1%, is there a significant difference between the weights of the chemical bags?
Step 2: Specify the level of significance and decide whether two tailed test, or one tailed test
(right-tailed test or left-tailed test) shall be used, and decide the test statistic to be
used Find the degrees of freedom, and find the critical value from Table 5 .
22
Step 3: Compute the value of the test statistic.
x = 48.1
µ = 50
n = 25
s = 0.9
t = ( 48.1 – 50 )√ 25
0.9
t= -10.56
Step 4: Graph the computed t value and critical value, and make a decision
Note: The shaded part is the rejected region and the part that has no shade is the
accepted region.
Since the computed t is located in the rejected region therefore, null hypothesis is
rejected.
23
Step 2: Specify the level of significance and decide whether two tailed test, or one tailed test
(right-tailed test or left-tailed test) shall be use, and decide the test statistic to be used,
find the degrees of freedom, and find the critical value from table 5
48.1
µ = 50
n = 25
s = 0.9
t = ( 48.1 – 50 )√ 25
0.9
t = -10.56
Step 4: Graph the computed t value and critical value, and make a decision
Note: The shaded part is the rejected region and the part that has no shade is the
accepted region.
24
Since the computed t is located in the rejected region therefore, null hypothesis is
rejected.
Example 2:
Problem :
A study was conducted to examine the relationship between the attitudes towards mathematics
and success at college level mathematics. Twenty-two man and twenty women were identified as being
at high risk of failure. The students were asked to responds to a series of questions, and their answers
were used to obtain a math anxiety score. Summary values appear in the table below. Test the
hypothesis using a 0.05 level of significance.
Gender n x s
Male 22 40.8 9.3
Female 20 37.5 10.2
Step 2: Specify the level of significance and decide whether two tailed test, or one tailed test
(right-tailed test or left-tailed test) shall be use, and decide the test statistic to be used,
find the degrees of freedom, and find the critical value from table 5
= (22 + 20) – 2
= 42 – 2
= 40
Critical value: +1.684 (taken from Table 5 and the positive value
is used since it is a right-tailed test)
25
Step 3: Compute the value of the test statistic.
x1 = 40.8
x2 = 37.5
n1 = 22
n2 = 20
s1 = 9.3
s2 = 10.2
t= 40.8 - 37.5
t= 1.1
Step 4: Graph computed t value and critical value, and make a decision .
t = 1.1
26
Since the computed t is located in the accepted region therefore, null hypothesis is
accepted.
Step 2:Specify the level of significance and decide whether two tailed test, or one tailed test
(right-tailed test or left-tailed test) shall be use, and decide the test statistic to be used,
find the degrees of freedom, and find the critical value from table 5
= (22 + 20) – 2
= 42 – 2
= 40
Critical value: ± 2.021 (taken from Table 5 ; one is positive and the other
is negative since it is two-tailed )
x1 = 40.8
x2 = 37.5
n1 = 22
n2 = 20
s1 = 9.3
s2 = 10.2
27
( 22 – 1 )( 9.3 )2 + ( 20 – 1 ) (10.2 )2( 22 + 20 )
( 22 + 20 – 2 ) ( 22 )( 20 )
t= 40.8 - 37.5
t = 1.1
Sep 4: Graph the computed t value and critical value, and make a decision .
t = 1.1
Note: The shaded part is the rejected region and the part that has no shade is the
accepted region.
Since the computed t was located in the accepted region therefore, null hypothesis is
accepted.
28
CHAPTER TEST
Apply the appropriate test hypothesis steps and procedures for the following research
problems.
1. A supermarket owner believes that the mean income of its costumers is Php50,000 per month. One-
hundred costumers are randomly selected and asked of their monthly income. The sample mean is
Php48,500 per month and standard deviation is Php3,200.Is their sufficient evidence to indicate that
the mean income of the costumers of the supermarket is Php50,000per month? Use α= 0.05.
29
2. It is reported that the average monthly salary of accounting graduates in the accounting field is
Php18,000. A dean of a certain university conducted a survey of 60 accounting graduates and found
their average salary at Php20,500 per month with standard deviation of Php1,500 per month. Using
α = 0.05, is there a significant difference between the accounting graduates salaries?
30
3. A prospective MBA student was made to estimate the difference in the monthly salaries of
professors in private and state colleges. He claimed that the difference in the starting salaries of
MBA graduates of the two colleges were relevant. An independent study of a simple random
samples of the most recent MBA graduates of both colleges revealed the following statistics:
31
4. A distributor claims that the average strength of the brand X rope exceeds the average strength of
the brand Y rope. To test its claim, 25 pieces of each brand are tested under similar conditions.
Brand X had an average strength of 90.7 kilograms with a standard deviation of 7.82 kilograms,
while brand Y have an average strength of 93.7 kilograms with a standard deviation of 6.75
kilograms. Test whether the claim of the distributor is correct at 5% level of significance.
32
5. Job satisfaction as a function of a work schedule was investigated in different factories. In the first
factory, the employees are on fixed shift system while in the second factory, the workers have
rotating shift system. Using the data in the table below, determine if there is a significant
difference in job satisfaction between the two groups of workers. Use α = 0.01.
33
CHAPTER 6
Objectives
The learner should be able to
1. construct a scatter plot,
2. describe shape, trend, and variation based on scatter plot,
3. estimate strength of association between the variables based on scatter plot,
4. calculate the Pearson’s sample correlation coefficient,
5. solve problem involving correlation analysis,
6. identify the independent and dependent variables,
7. draw the best-fit line on a scatter plot,
8. calculate the slope and y-intercept of the regression line,
9. predict the value of the dependent variable given the value of the independent
variable, and
10. solve problems involving regression analysis.
Correlation analysis is used to quantify the association between two continuous variables
(e.g., between an independent and a dependent variable or between two independent variables).
Regression analysis is a related technique to assess the relationship between an outcome
variable and one or more risk factors or confounding variables.
The outcome variable is also called the response or dependent variable and the risk
factors and confounders are called the predictors, or explanatory or independent variables.
In regression analysis, the dependent variable is denoted "y" and the independent
variables are denoted by "x".
In some cases, the measurement scale for data is ordinal, but the variable is treated as
continuous. For example, a Likert scale that contains five values - strongly agree, agree, neither
agree nor disagree, disagree, and strongly disagree - is ordinal. However, where a Likert scale
contains seven or more value - strongly agree, moderately agree, agree, neither agree nor disagree,
disagree, moderately disagree, and strongly disagree - the underlying scale is sometimes treated as
continuous (although where you should do this is a cause of great dispute).
It is worth noting that how we categorize variables is somewhat of a choice. Whilst we
categorized gender as a dichotomous variable (you are either male or female), social scientists may
disagree with this, arguing that gender is a more complex variable involving more than two distinctions,
but also including measurement levels like gender queer, intersex and transgender. At the same time,
some researchers would argue that a Likert scale, even with seven values, should never be treated as a
continuous variable.
34
Dependent and Independent Variables
Independent Variable
Sometimes called an experimental or predictor, it is a variable that is being manipulated in
an experiment in order to observe the effect on a dependent variable.
Dependent Variable
It is sometimes called an outcome variable. The dependent variable is simply a variable that
is dependent on an independent variable(s).
All experiments examine some kind of variable(s). A variable is not only something that we
measure, but also something that we can manipulate and something we can control for.
The dependent variable is just like the name sounds; it depends upon some factor that you,
the researcher, controls. For example:
Whatever event you are expecting to change is always the dependent variable. In
the first example above race performance is the variable you would expect to change if you changed
your training, so that’s the dependent variable. In the second example, the dependent variable is weight
and in the third example the dependent variable is the amount earned
Example:
Imagine that a tutor asks 100 students to complete a math test. The tutor wants to know why
some students perform better than others. Whilst the tutor does not know the answer to this, she thinks
that it might be because of two reasons: (1) some students spend more time revising for their test; and
(2) some students are naturally more intelligent than others. As such, the tutor decides to investigate the
effect of revision time and intelligence on the test performance of the 100 students. The dependent and
independent variables for the study are:
Dependent Variable: Test Mark (measured from 0 to 100)
Independent Variables: Revision time (measured in hours) Intelligence (measured
using IQ score)
Categorical variables are also known as discrete or qualitative variables. Categorical variables
can be further categorized as nominal, ordinal or dichotomous.
• Nominal variables are variables that have two or more categories, but which do not have an
intrinsic order. For example, a real estate agent could classify their types of property into
distinct categories such as houses, condos, co-ops or bungalows. So
35
"type of property" is a nominal variable with 4 categories called houses, condos, co-ops and
bungalows. Of note, the different categories of a nominal variable can also be referred to as
groups or levels of the nominal variable. Another example of a nominal variable would be
classifying where people live in the USA by state. In this case there will be many more levels of
the nominal variable (50 in fact).
• Dichotomous variables are nominal variables which have only two categories or levels. For
example, if we were looking at gender, we would most probably categorize somebody as either
"male" or "female". This is an example of a dichotomous variable (and also a nominal variable).
Another example might be if we asked a person if they owned a mobile phone. Here, we may
categorise mobile phone ownership as either "Yes" or "No". In the real estate agent example, if
type of property had been classified as either residential or commercial then "type of property"
would be a dichotomous variable.
• Ordinal variables are variables that have two or more categories just like nominal variables only
the categories can also be ordered or ranked. So if you asked someone if they liked the policies
of the Democratic Party and they could answer either "Not very much", "They are OK" or "Yes,
a lot" then you have an ordinal variable. Why? Because you have 3 categories, namely "Not
very much", "They are OK" and "Yes, a lot" and you can rank them from the most positive
(Yes, a lot), to the middle response (They are OK), to the least positive (Not very much).
However, whilst we can rank the levels, we cannot place a "value" to them; we cannot say that
"They are OK" is twice as positive as "Not very much" for example.
Continuous variables are also known as quantitative variables. Continuous variables can be
further categorized as either interval or ratio variables.
• Interval variables are variables for which their central characteristic is that they can be
measured along a continuum and they have a numerical value (for example, temperature
measured in degrees Celsius or Fahrenheit). So the difference between 20C and 30C is the same
as 30C to 40C. However, temperature measured in degrees Celsius or Fahrenheit is NOT a
ratio variable.
• Ratio variables are interval variables, but with the added condition that 0 (zero) of the
measurement indicates that there is none of that variable. So, temperature measured in
degrees Celsius or Fahrenheit is not a ratio variable because 0C does not mean there is no
temperature. However, temperature measured in Kelvin is a ratio variable as 0 Kelvin (often
called absolute zero) indicates that there is no temperature whatsoever. Other examples of ratio
variables include height, mass, distance and many more. The name "ratio" reflects the fact that
you can use the ratio of measurements. So, for example, a distance of ten meters is twice the
distance of 5 meters.
36
variable is the exam mark (measured from 0 to 100), and the independent variables are revision time
(measured in hours) and intelligence (measured using IQ score). Here, it would be possible to use an
experimental design and manipulate the revision time of the students. The tutor could divide the students
into two groups, each made up of 50 students. In "group one", the tutor could ask the students not to do
any revision. Alternately, "group two" could be asked to do 20 hours of revision in the two weeks prior to
the test. The tutor could then compare the marks that the students achieved.
Bivariate analysis means the analysis of bivariate data. It is one of the simplest forms of
statistical analysis used to determine if there is a relationship between two sets of values. It usually
involves the variables X and Y.
The results from bivariate analysis can be stored in a two-column data table.
Example:
You might want to find out the relationship between the age of the students and their academic
achievement. The age would be your independent variable, X and the academic achievement would be
your dependent variable, Y.
37
Lesson 6.3 Scatter plot
Scatter Plot
It is a type of plot or mathematical diagram using Cartesian coordinate to display values for
typically two variables for a set of data. If the points are color-coded, one additional variable can be
displayed. The data is displayed as a collection of points, each having the value of one variable
determining the position on the horizontal axis and the value of the other variable determining the
position on the vertical axis.
A scatter plot can be used either when one continuous variable that is under the control of the
experimenter and the other depends on it or when both continuous variables are independent. If a
parameter exists that is systematically incremented and/or decremented by the other, it is called the
control parameter or independent variable and is customarily plotted along the horizontal axis. The
measured or dependent variable is customarily plotted along the vertical axis.
A scatter plot can suggest various kinds of correlations between variables with a certain
confidence interval.
Example:
Plotting Weight vs. Height. Weight would be on y axis and height would be on the x axis.
Correlations may be positive (rising), negative (falling), or null (uncorrelated).
Pattern of dots
Positive correlation if the pattern of dots slopes from lower left to upper right.
38
Negative correlation if the pattern of dots slopes from upper left to lower right.
A line of best fit (or "trend" line) is a straight line that best represents the data on a scatter
plot. This line may pass through some of the points, none of the points, or all of the points. A line of
best fit can be drawn in order to study the relationship between the variables. An equation for the
correlation between the variables can be determined by established best-fit procedures. For a linear
correlation, the best-fit procedure is known as linear regression and is guaranteed to generate a correct
solution in a finite time. No universal best-fit procedure is guaranteed to generate a correct solution for
arbitrary relationships. A scatter plot is also very useful when we wish to see how two comparable data
sets agree with each other. In this case, an identity line, i.e., a y =x line, or an 1:1 line, is often
drawn as a reference. The more the two data sets agree, the more the scatters tend to concentrate in the
vicinity of the identity line; if the two data sets are numerically identical, the scatters fall on the identity
line exactly.
39
Lesson 6.5 Pearson’s Correlation Coefficient
Correlation between sets of data is a measure of how well are they related. The most common
measure of correlation in stats is the Pearson Correlation. The full name is the Pearson Product
Moment Correlation or PPMC. It shows the linear relationship between two sets of data. In
simple terms, it answers the question; Can I draw a line graph to represent the data? Two letters
are used to represent the Pearson correlation: Greek letter rho (ρ) for a population and the letter “r”. It
tells you whether there is a relationship between the variables. To compute the value of Pearson
Correlation we have the formula:
Formula 1:
Formula 2:
r= Σ(x–x)(y–y)
Σ ( x – x )2 Σ ( y – y )2
The results will be between -1 and 1. You will rarely see 0, -1 or 1 as a result. You’ll get a
number somewhere in between those values. The closer the value of r gets to zero, the greater the
variation the data points are around the line of best fit. To interpret the obtained results the table below
may use.
40
Example 1:
Problem :
Researchers want to know if there is a significant relationship between the ages of the person to their
glucose level. They use six (6) persons as their samples and obtained the data below.
Solution :
Step 1: Make a chart. Use the given data, and add three more columns: xy, x2, and y2.
Glucose
Samples Age (x) Level (y) xy x2 y2
1 40 98
2 25 59
3 36 83
4 45 70
5 50 90
6 61 85
Step 2: Multiply x and y together to fill the xy column. For example, row 1 would be 40 × 98 = 3920.
Glucose
Samples Age (x) Level (y) xy x2 y2
1 40 98 3920
2 25 59 1475
3 36 83 2988
4 45 70 3150
5 50 90 4500
6 61 85 5185
41
Step 3: Take the square of the numbers in the x column, and put the result in the x2 column.
Glucose
Samples Age (x) Level (y) xy x2 y2
1 40 98 3920 1600
2 25 59 1475 625
3 36 83 2988 1296
4 45 70 3150 2025
5 50 90 4500 2500
6 61 85 5185 3721
Step 4: Take the square of the numbers in the y column, and put the result in the y2 column.
Glucose
Samples Age (x) Level (y) xy x2 y2
1 40 98 3920 1600 9604
2 25 59 1475 625 3481
3 36 83 2988 1296 6889
4 45 70 3150 2025 4900
5 50 90 4500 2500 8100
6 61 85 5185 3721 7225
Step 5: Add up all of the numbers in the columns and put the result at the bottom column. The Greek
letter sigma (Σ) is a short way of saying “sum of” or “summation of”.
Glucose
Samples Age (x) Level (y) xy x2 y2
1 40 98 3920 1600 9604
2 25 59 1475 625 3481
3 36 83 2988 1296 6889
4 45 70 3150 2025 4900
5 50 90 4500 2500 8100
6 61 85 5185 3721 7225
42
Step 6: Substitute the values obtained to the formula and compute.
The result is 0.5108, which means the variables have a High positive correlation.
Example 2:
Problem :
Calculate the Pearson correlation coefficient of the obtained scores by 5 students in
algebra and trigonometry as given below:
Algebra 15 16 12 10 8
Trigonometry 18 11 10 20 17
Solution:
Complete the table by following steps 1 to 5.
x Y xy x2 y2
15 18 270 225 324
16 11 176 256 121
12 10 120 144 100
10 20 200 100 400
8 17 136 64 289
∑x = 61 ∑y = 76 ∑xy = 902 ∑x = 789
2
∑y = 1234
2
43
Step 6: Substitute the values obtained to the formula and compute.
The result is – 0.4241, which means the variables have a Medium negative correlation
44
Activity:
x 5 6 4 2
y 7 3 9 8
2. The scores of 6 pupils in two subjects : Physics and Chemistry are given below..
Calculate the coefficient of correlation.
Chemistry 45 53 67 40 35 50
Physics 68 76 70 64 54 66
45
Lesson 6.6 Regression
Simple regression is used to examine the relationship between one dependent and one
independent variable. After performing an analysis, the regression statistics can be used to predict the
dependent variable when independent variable is known. Regression goes beyond correlation by adding
prediction capabilities.
People use regression on an intuitive level everyday, such as :
in business, a well-dressed man is thought to be financially successful;
a mother knows that more sugar in her children’s diet results in higher energy levels; and
the ease of waking up in the morning often depends on how late you went to bed the night
before.
The regression line ( known as the least squares line ) is a plot of the expected value of the
dependent variable for all values of the independent variable. Technically, it is the line that minimizes the
squared residuals. The regression line is the one that best fits the data on a scatter plot.
Using the regression equation , the dependent variable maybe predicted from the independent
variable. The slope of the regression line ( b ) is defined as the rise divided by the run. The y-
intercept ( a ) is the point on the y-axis where the regression line would intercept the y-axis. The slope
and y-intercept are incorporated into the regression equation. The intercept is usually called the
constant , and the slope is referred to as the coefficient. Since the regression model is usually not a
perfect predictor, there is also an error term in the equation.
In the regression equation, y is always the dependent variable and x is always the independent
variable. Here are three equivalent ways to mathematically describe a linear regression model :
y = intercept + ( slope ± x ) + error
= a + bx + e
The significance of the slope of the regression line is determined from the t-statistic. It is the
probability that the observed correlation coefficient occurred by chance if the true correlation is zero.
Some researchers prefer to report the F-ratio instead of the t-statistic. The F- ratio is equal to the t-
statistic squared.
The t-statistic for the significance of the slope is essentially a test to determine if the regression
model ( equation ) is usable. If the slope is significantly different than zero, then we can use the
regression model to predict the dependent variable for any value of the independent variable.
m = Δy
Δx
m = y2 – y1
x2 – x1
46
Forms of Linear Equation
4. General form: Ax + By + C = 0
The slope indicates the steepness of a line and the intercept indicates the location where it
intersects an axis. The slope and the intercept define the linear relationship between two variables, and
can be used to estimate an average rate of change. The greater the magnitude of the slope, the steeper the
line and the greater the rate of change.
By examining the equation of a line, you quickly can discern its slope and y-intercept (where the
line crosses the y-axis).
47
y = - 3x + 3
4
The slope is 0. When x increases by 1, y neither increases nor decreases. The y- intercept
is 2.
Usually, this relationship can be represented by the equation y = b0 + b1x, where b0 is the y
intercept and b1 is the slope.
48
Example :
Problem:
A company determines that job performance for employees in a production department can be
predicted using the regression model y = 130 + 4.3x, where x is the hours of in-house training they
received (from 0 to 20) and y is their score on a job skills test. The value of the y- intercept (130)
indicates the average job skill score for an employee with no training. The value of the slope (4.3)
indicates that for each hour of training, the job skill score increases, on average, by 4.3 points.
Regression analysis is a statistical process for estimating the relationships among variables.
It includes many techniques for modelling and analyzing several variables, when the focus is on the
relationship between a dependent variable and one or more independent variables. More
specifically, regression analysis helps one understand how the typical value of the dependent
variable changes when any one of the independent variables is varied, while the other independent
variables are held fixed. Regression Analysis estimates the conditional expectation of the dependent
variable given the independent variables – that is, the average value of the dependent variable when
the independent variables are fixed. In all cases, the estimation target is a function of the independent
variables called the regression function
Activity:
Can we predict the number of total calories based upon the total fat grams?
49
Solution:
1. Prepare a scatter plot of the data on graph paper.
2. Using a strand of spaghetti, position the spaghetti so that the plotted points are as close to the
strand as possible.
3. Find two points that you think will be on the "best-fit" line.
4. We are choosing the points (9, 260) and (30, 530). ( You may choose different . )
5. Calculate the slope of the line through your two points (rounded to three decimal places).
m = y2 – y1
x2 – x1
7. This equation can now be used to predict information that was not plotted in the scatter plot.
Question a: Predict the total calories based upon 22 grams of fat.
y = 12.857 (x – 9) + 260
y = 12.857 (22 – 9) + 60
y = 12.857 (13) + 260
y = 427.141calories
50
Question b: Predict the total calories based upon 18 grams of fat.
y = 12.857(x – 9) + 260 y
= 12.857 (18 – 9) +260
y = 12.857 (9) +260
y = 115.713 + 260
y = 375.713 calories
y = 12.857(x – 9) + 260 y
= 12.857 (26 – 9) +260
y = 12.857 (17) +260
y = 218.569 + 260
y = 478.569 calories
51
CHAPTER TEST
Problem solving.
1. To interpret the relationship between years of education and salary potential, 5 persons were
surveyed. The results obtained on their number of years of higher education (college degree and
higher)and their monthly salaries are shown below. Compute the Pearson’s Product Moment Coefficient
Correlation and interpret the relationship between the variables.
52
2. A financial analyst believes that the interest rate on bonds is inversely related to the interest rate of
loans. Hence, bonds perform when the lending rate are down and vice versa. The results of the
observation are shown in the table below. Find the slope and y-intercept on the data and predict the
interest rate bond (%) when the Interest rate loan is
a. 7
b. 11
c. 12
Interest Rate on Loan (%) Interest Rate on Bond (%)
10 6
5 9
8 7
6 8
8 6
53
Tables
54
Table 2: Level of Confidence
55
Table 3: P Value Table
56
Table 4: Tabular value of Z at indicated levels of significance (∞)
Test/∞ 0.005 0.01 0.05 0.1
One-tailed ±2.58 ±2.33 ±1.645 ±1.28
Two-tailed ±2.81 ±2.575 ±1.96 ±1.645
57
58
Name: Score
Grade / Section : Date :
Chapter 1 : Exercise
4. 16 possible outcomes
5. 1024 possible outcomes
IV. Three coins are tossed simultaneously. Find the probability of :
1. at least 2 tails
2. no tail
3. 1 head
4. 1 tail and 2 heads
5. 2 heads and 2 tails
V. The number of adults living in homes on a randomly selected barangay in Tunasan is
described by the following probability distribution.
59
Solution :
60
Name: Score
Grade / Section : Date :
Chapter 2 : Exercise
3. x = 45 ; x = 42 ; s = 5
4. x = 250 ; x = 255 ; s = 2
5. x = 28 ; x = 24 ; s = 2.5
IV. Draw the graph of the following probability. 1.
Pr ( - 1.8 ≤ z ≤ 2.7 )
2. Pr ( 1.2 ≤ z ≤ 2.8 )
3. Pr ( z ≤ 1.5 )
61
V. Word problems.
1. Scores on a history test have an average of 80 with a standard deviation of 6. If there were
50 students who took the test on this subject, how many students got a score of at least 75 ?
2. The weight of chocolate bars from a particular chocolate factory has a mean of 8 ounces with standard
deviation of .1 ounce. What is the z-score corresponding to a weight of 8.17 ounces?
3. Books in the library are found to have an average length of 350 pages with a standard deviation of
100 pages. If there are 10,000 books in the library , how many books have a corresponding length
of at least 80 pages?
5. The mean growth of the thickness of trees in a forest is found to be .5 cm/year with a
standard deviation of .1cm/year. What is the z-score corresponding to 1 cm/year?
62
Name: Score
Grade / Section : Date :
Chapter 3 : Exercise
2. n=6;r=3
3. n=8;r=2
4. n=9;r=3
5. n = 10; r = 4
63
IV. Problem solving.
A population consists of 4 values such as 12, 14, 16, and 18. Compute the following when r =
2:
a. number of sample values or combinations
b. population mean
c. sample variance
d. standard deviation
e. Fill up the table below based on the computed values.
Solution:
64
Name: Score
Grade / Section : Date :
Chapter 4 : Exercise
III. Find the value of the unknown with the following given:
1. α = if CI ( confidence interval ) = 80 %
2. Z 0.01 = if α = 0.01
3. CI = if Z 0.5 = 1.645
4. α = if CI = 0.95
5. CI = if 2α = 0.01
65
IV. Problem solving.
1. There are hundreds of mangoes on the trees and we want to know if they are
big enough. 46 mangoes were randomly chosen and the following data were
obtained: ( use confidence interval of 95 % )
x = 86
s = 6.2
Find : a. margin of error
b. true mean
2. Lyceum of Alabang P.E. department wants to calculate the proportion of students who have attended a
women’s basketball game at the college. They use student email addresses, randomly choose 220
students, and email them. Of the 145 who responded, 22 had attended a women’s basketball game.
a. What is the sample proportion of students who have attended a women’s basketball game?
b. What is the sample proportion of students who have not attended a women’s basketball game ?
c. Can a normal distribution be used to model the sampling distribution for the sample
proportion ? Explain.
66
Name: Score
Grade / Section : Date :
Chapter 5 : Exercise
II. Determine if z-test or t-test is appropriate for the following given. Write Z or T in the blank
before each number. If neither of the two test is applicable, write X.
1. s = 2.5 ; n = 50
Tear Here
2. n = 15 ;
3. s2 = unknown ; n = 25
4. s = 16 ; n = 20
5. s = 36 ; n = 30
III. Compute for the degrees of freedom based on the following given.
1. df = if n1 = 16 and n2 = 20
2. df = if n = 28
3. df = if n1 = 24 and n2 = 26
67
V. Problem solving.
1. Average heart rate for Americans is 72 beats/minute. A group of 25 individuals participated in an
aerobics fitness program to lower their heart rate. After six months the group was evaluated to identify
is the program had significantly slowed their heart. The mean heart rate for the group was 69
beats/minute with a standard deviation of 6.5. Was the aerobics program effective in lowering heart
rate?
Answer the following:
a. Ho:
b. Ha:
c. α=
d Test statistic
e. Tailed test
f. degrees of freedom
g. critical value
h. computed t-value
i. Graph
j. Conclusion:
2. The amount of a certain trace element in blood is known to vary with a standard deviation of 14.1
ppm (parts per million) for male blood donors and 9.5 ppm for female donors. Random samples of 75
male and 50 female donors yield concentration means of 28 and 33 ppm, respectively. What is the
likelihood that the population means of concentrations of the element are the same for men and
women?
Answer the following:
a. Ho:
b. Ha:
c. α=
d Test statistic
e. Tailed test
f. degrees of freedom
g. critical value
h. computed t-value
i. Graph
j. Conclusion:
68
Name: Score
Grade / Section : Date :
Chapter 6 : Exercise
II. Find the slope and y-intercept with the following given.
Tear Here
1. 2y = 6x + 4 m= y-intercept =
2. ( 4, 6 ) and ( 2, 18 ) m= y-intercept =
3. – 10x + 5y + 20 = 0 m= y-intercept =
4. 2x + y = 1 m= y-intercept =
3 4
69
III. Problem solving.
1.. Find the Pearson’s correlation coefficient ( r ) using the following data (α = 0.02 ; two-
tailed test ) and state the correlation.
Samples x y xy x2 y2
1 2 6
2 7 16
3 5 11
Σ
70
2. You have to examine the relationship between the age and price for used cars sold in the last year
by a car dealership company.
Note : Use points ( 5, 4500 ) and ( 7, 4200 ) as basis for the computation of the slope.
Find :
a. Predict the price when the car age is 6 years.
b. Predict the price when the car age is 9 years.
c. Predict the price when the car age is 15 years.
Tear Here
71