You are on page 1of 29

STATISTICS AND PROBABILITY LECTURE NOTES

(Prepared by: Marivic D. Taňola)

NAME: ____________________________________

4th QUARTER – Week 1:


HYPOTHESIS TESTING

JUST REFLECT

• Sometimes we hear claims on social media that we find unbelievable. Such as: a
whitening product advertisement stating that if you use their whitening product, then your
skin is like snow white.
• The weatherman stating that there is a 90% chance of rain tomorrow.

We might feel compelled to challenge such claims. To challenge claims, we must run a
research study upon a sample (since the surveying the entire population would be
impossible). To test a claim, you must write two hypotheses.
Hypothesis testing is a decision-making process for evaluating claims about a
population.

• Null hypothesis (Ho), is basically, “The population is like this.” It states, in formal terms,
that the population is no different than usual.
• Alternative hypothesis (Ha), is, “The population is like something else.” It states that
the population is different than the usual, that something has happened to this
population, and as a result it has a different mean, or different shape than the usual
case.

In order to state the hypothesis correctly, the researcher must translate the claim into
mathematical symbols. There are three possible sets of statistical hypotheses.

TWO-TAILED TEST 1. Ho : parameter = specific value


Ha : parameter ≠ specific value

LEFT-TAILED TEST 2. Ho : parameter = specific value


Ha : parameter < specific value

RIGHT-TAILED TEST 3. Ho : parameter = specific value


Ha : parameter > specific value

In the hypothesis testing, there are four possible outcomes.

Reject Ho Do not Reject Ho

Ho is True Type I Error Correct Decision

Ho is False Correct Decision Type II Error

• A type I error occurs if one rejects the null hypothesis when it is true.
• A type II error occurs if one does not reject the null hypothesis when it is false.

The decision is made based on probabilities: “How large a difference is necessary to


reject the null hypothesis?” here is where the level of significance is used.
The level of significance is the maximum probability of committing a type I error.
This probability is symbolized by α (Greek letter alpha). That is, P(type I error) = α. the
probability of type II error is symbolized by β (Greek letter beta), that is, P(type II error) = β,
although in most hypothesis testing situations, b cannot be computed.

Generally, statisticians agree on using three arbitrary significance levels: the 0.10,
0.05, and 0.01 level. That is, if the null hypothesis is rejected, the probability of type I error
will be 10%, 5% and 1%, and the probability of a correct decision will be 90%, 95% and
99%, depending on which level of significance is used. In other words, when α = 0.05, there
is a 5% chance of rejecting a true null hypothesis.

• You can reflect on these figures which are commonly used hypothesis testing
in research:

After a significance level is chosen, a critical value is selected from a table for the
appropriate test.

• The critical value determines the critical and non-critical regions.


• The critical region or the rejection region is the range of values of the test value that
indicates that there is a significant difference and that the null hypothesis should be
rejected.
• The non-critical or non-rejection region is the range of values of the test value that
indicates that the difference was probably due to chance and that the null hypothesis
should not be rejected.

If the test is two-tailed, the critical value will be either positive or negative. If the test is
left-tailed, the critical value will be negative. If the test is right-tailed, the critical value will
be positive.
JUST LEARN

A hypothesis is essentially an idea about the population that you think might be true, but
which you cannot prove to be true. While you usually have good reasons to think it is true,
and you often hope that it is true, you need to show that the sample data support your idea.

In hypothesis testing the following steps should be considered:


1. State the null and alternative hypotheses.
2. Select the level of significance.
3. Determine the critical value and the rejection region/s.
4. State the decision rule.
5. Compute the test statistic.
6. Make a decision, whether to reject or not to reject the null hypothesis.
JUST EVALUATE
4th QUARTER – Week 2:

JUST REFLECT

• You can reflect on these statements which are commonly used in research.










• The symbol ≠ in the alternative hypothesis suggests either a greater than ( > )
relation or a less than ( < ) relation.
• When the alternative hypothesis utilizes the ≠ symbol, the test is said to be
non-directional. Also called a two-tailed test.
• When the alternative hypothesis utilizes the > or the < symbol, the test is said
to be directional, may either be called left-tailed or right-tailed.

These are the graphical representations of two-tailed test and the one-tailed test:
JUST LEARN:
JUST EVALUATE
4th QUARTER – Week 3:

JUST RECALL AND REFLECT:

Directions: Choose the letter that corresponds to your answer. Write your answer on a
separate sheet.

1. Which of the following is a Null Hypothesis test formula?

A. Test statistic C. Variance statistic B. Population statistic D. Null statistic

2. If the hypothesis contains the greater than symbol (>) the rejection region is
______.

A. Left-tailed B. Right-tailed C. Center -tailed D. Cross-tailed

3. If the hypothesis contains the less than symbol (<) the rejection region is ____.

A. Center tailed B. Right tailed C. Left tailed D. Cross tailed

4. What is the main purpose of hypothesis testing?

A. Test how far the mean of a sample is from zero.


B. Determine whether a statistical result is significant.
C. Determine the appropriate value of the significance level.
D. Derive the standard error of the data.

5. What do you call a population for testing purposes?

A. Statistic C. Hypothesis B. Level of Significance D. Test-Statistic


• The rejection region (RR) specifies the values of the test statistic for which the null
hypothesis is rejected in favor of the alternative hypothesis.

• If the computed value of the test statistic falls in RR, we reject the null hypothesis (Ho)
and accept the alternative hypothesis (H1).

• If the value of the test statistic does not fall into the rejection (critical) region, we accept
Ho. The region, other than the rejection region, is the acceptance region.

• Typical values for α are 0.01, 0.05 and 0.1. It is a value that we select based on the
certainty we need. In most cases, the choice of α is determined by the context we are
operating in, but 0.05 is the most commonly used value.
JUST LEARN:
DO IT IN A GROUP:

1. Directions: Briefly answer the Self – Assessment Questions (SAQ) below.

SAQ 1: When do we accept Null Hypothesis?


SAQ 2: When do we reject Null Hypothesis?

1. Directions: Identify the Rejection Region.

PROBLEM 1. Professor Balenciaga has reported her students’ grades for several semesters
and the average for all the grades of these students is 83. Her new class of 28 students
seem to be higher than the average of ability and she wants to demonstrate that the current
class is superior to the previous classes according to their average." Is there sufficient
evidence for the class average of 86.2 and the standard deviation of 12 present to support
her argument that the current class is superior? Using the 0.05 significance level.

PROBLEM 2. Professor Balenciaga has reported her students’ grades for several semesters
and the average for all the grades of these students is 83. Her new class of 30 students
seem to be higher than the average of ability and she wants to demonstrate that the current
class is superior to the previous classes according to their average." Is there sufficient
evidence for the class average of 86.2 and the standard deviation of 12 present to support
her argument that the current class is superior? Using the 0.05 significance level.
JUST EVALUATE:
Directions: Choose the letter that corresponds to your answer. Write your answer on a
separate sheet.
1. Null hypothesis is rejected as direct evidence that the alternative hypothesis is:

a. True b. False c. Either d. Neither

2. One or two tailed tests will determine ________.


a. that hypothesis has one or two conclusion.
b. the two values of the sample need to be rejected.
c. the rejection region is located in one or two tails of the distribution.
d. the rejection region is located in one tails of distribution.
4th QUARTER – Week4:

JUST RECALL AND REFLECT

Test statistic is a value computed from the data. The test statistic is used to assess
the evidence in rejecting or accepting the null hypothesis. Each statistic test is used for a
different test.

JUST LEARN
HYPOTHESIS TESTING ON A POPULATION MEAN

STEP 6: Draw the appropriate conclusion.


Since H0 is rejected, there is enough evidence to support the claim that
college students watch less television than the general public.
JUST LEARN WITH THE GROUP
JUST EVALUATE

For items 4 and 5, refer to the following information:


Previously, an organization reported that teenagers spent 4.5 hours per week, on
average, on the phone. The organization thinks that, currently, the mean is higher. Fifteen
(15) randomly chosen teenagers were asked how many hours per week they spend on the
phone. The sample mean was 4.75 hours with a sample standard deviation of 2.0. Conduct a
hypothesis test.

4. The null and alternate hypotheses are:

(a) Ho :x=4.5,Ha :x>4.5 (b) Ho :μ≥4.5,Ha :μ<4.5

(c) Ho :μ=4.75,Ha:μ>4.75 (d) Ho :μ=4.5,Ha :μ>4.5

5. At a significance level of a = 0.05, what is the correct conclusion?

(a) There is enough evidence to conclude that the mean number of hours is more
than 4.75.
(b) There is enough evidence to conclude that the mean number of hours is more
than 4.5.
(c) There is not enough evidence to conclude that the mean number of hours is
more than 4.5.
(d) There is not enough evidence to conclude that the mean number of hours is
more than 4.75.
4th QUARTER – Week 5

JUST RECALL AND REFLECT

JUST LEARN
6. Decision
• If we reject 𝐻0, we can conclude that 𝐻𝐴 is true.
• If, however, we do not reject 𝐻0, we may conclude that 𝐻0 is true.

Decision rule using 𝑝 – value:


• If the 𝑝 – value is less than or equal to 𝛼, we reject the null hypothesis (𝑝 ≤ 𝛼).
• If the 𝑝 – value is greater than to 𝛼, we do not reject the null hypothesis (𝑝 > 𝛼).
SHARE INSIGHTS (BY GROUP)
JUST EVALUATE

Directions: Study the problem and answer the task given.

PROBLEM 1: A Company manufactures calculators with an average mass of 500g, an


engineer believes that the average weight to be different and decides to calculate the
average mass of 60 calculates.

TASK: State the null and alternative hypothesis.


𝑯𝑶:
𝑯𝟏:

PROBLEM 2: Reyes performed a study to validate a translated version of the Western


Mindanao State University (WMSU) questionnaire used with English-speaking patients with
hip or knee osteoarthritis. For the 76 women classified with severe hip pain. The WMSU
mean function score was 70.7 with standard deviation of 14.6, we wish to know if we may
conclude that the mean function score for a population of similar women subjects with sever
hip pain is less than 75. Let α= 0.01.

TASK: Perform hypothesis testing by following the steps below.

1. Data:
2. Assumption:
3. Hypothesis:
4. Test Statistics:
5. Decision Rule:
6. Decision:
4th QUARTER – Week 6
Test Statistic for Population Proportion

JUST LEARN:

Step 5. Make a statistical Decision.


Since the computed test statistic 𝒛 = −𝟐. 1𝟑𝟑 falls in the rejection region, reject the
null hypothesis.
Step 6: Draw the appropriate conclusion.
Since H0 is rejected, then there is enough evidence to conclude that the percentage
of voters for the administration candidate is different from 65%.
4th QUARTER – Week 7

BIVARIATE DATA AND SCATTERPLOT

JUST RECALL AND REFLECT

Have you ever wondered whether tall people have longer arms than short people?
We’ll explore this question by collecting data on two variables — height and arm span
(measured from left fingertip to right fingertip).

• Do people with above-average arm spans tend to have above-average heights?


• Do people with below-average arm spans tend to have below-average heights?

Directions: Study the table given and answer the questions that follow.

Person Number 1 2 3 4 5 6 7 8 9 10 11 12
Arm Span 156 157 159 160 161 161 162 165 170 170 173 173
Height 162 160 162 155 160 162 170 166 170 167 185 176

The methods we employ to do this depend on the type of variables we are dealing
with; that is, they depend on whether the data are numerical or categorical. The ways of
measuring the relationship between the following pairs of variables.
• a numerical variable and categorical variable (for example, height and nationality)
• two categorical variables (for example, gender and religious denomination)
• two numerical variables (for example, height and weight)

In a relationship between two variables, if the values of one variable ‘depend’ on the
values of another variable, then the former variable is referred to as the dependent variable
and the latter variable is referred to as the independent variable.

BIVARIATE DATA - consist of two (2) variables can be dependent is the variable that
can cause the dependent variable to change. or dependent variable is the variable that
is influenced or affected by the independent variable.
It is useful to identify the independent and dependent
variables where possible since it is the usual practice when
displaying data on a graph the independent variable on the
horizontal axis and the dependent variable on the vertical axis.

EXAMPLE 1.
You want to test a new dosage of drug that supposedly prevents sneezing in people
allergic to flowers.
• Variable in the -axis: new dosage of drug
• Variable in the -axis: Sneezing

EXAMPLE 2.
A soap manufacturer wants to prove that a little amount of detergent can remove
greater amount of stain.
• Variable in the -axis: amount of detergent.
• Variable in the -axis: Amount of stain removed.

SCATTERPLOT – is a diagram that is used to show the degree and pattern of


relationship between the two (2) sets of data. They are constructed on the Coordinate
plane each data point on a scatter plot represents two (2) values.

A scatterplot is used to determine if there is a relationship between two numerical


variables. Data are collected on the two variables and often displayed in a table of ordered
pairs. A scatterplot is graph of the ordered pairs of numbers. Each ordered pair is a dot on
the graph.

PATTERNS OF DATA IN SCATTERPLOTAPE (FORM)


• SHAPE (FORM) - Refers to whether a data pattern is linear (straight) or nonlinear
(curved).

LINEAR FORM NON-LINEAR FORM


If the points seem to approximate a If the points seem to appropriate a
straight line, the association is a linear curve, the association is a non-linear form.
form.

FORM OF AN ASSOCIATION
2. Linear form – when the points tend to follow a straight line.
3. Non-linear form – when the points tend to follow a curved line.
2. FORM (DIRECTION) - Refers to the direction of change in variable when variable gets
bigger. If variable also gets bigger, the slope is positive; but if variable gets smaller, the
slope is negative.

POSITIVE NEGATIVE
Positive association exists between Negative association exists
the variables if the gradient of the line is between the variables if the gradient of the
positive, that is, the dots on the scatterplot line is negative, that is, the dots on the
tend to go up as we go from left to right. scatterplot tend to go down as we go from
left to right.

DIRECTION OF AN ASSOCIATION
3. Positive – gradient of the line is positive.
• Negative – gradient of the line is negative.

4. VARIATION (STRENGTH) - Refers to the degree of “scatter” in the plot. If the dots are
widely spread, the relationship between variables is weak. If the dots are concentrated
around a line, the relationship is strong.

STRONG MODERATE WEAK


In strong association In moderate association In weak association the
the dots will tend to follow the amount of scatter in the amount of scatter
a single stream. A pattern plot increases and the increases further, and the
is clearly seen. There is pattern becomes less clear. pattern becomes even less
only a small amount of This indicates that the clear. Linear form is less
scatter in the plot association is less strong. evident.

STRENGTH OF AN ASSOCIATION
Strong- small amount of scatter in the plot.
Moderate – modest amount of scatter in the plot.
Weak – large amount of scatter in the plot.
EXAMPLE 3.
Determine the relationship between the height and arm span. The date data
collected on these variables is shown in the table of ordered Pairs.

Height
172 159 178 162 156 174 151 162 165 185 186 176 166 180 158
(cm)
Arm
Span 172 162 182 164 159 180 151 165 168 189 188 184 167 184 161
(cm)

Each person has two numerical


variables, height, and arm span. To construct a
scatterplot, draw a number plane with the
height on the horizontal axis and arm span on
the vertical axis. Plot each ordered pair as a
dot. The scatterplot shows there is a
relationship between these variables.

JUST EVALUATE
Directions: Construct a scatterplot using the tables and describe the a. shape (form), b.
trend (direction), and c. strength (variation).
4th QUARTER – Week 8
THE PEARSON PRODUCT-MOMENT CORRELATION

JUST RECALL AND REFLECT

Directions: Identify the direction and the strength of the following correlation given. Choose
your answer from the box.

a. Strong positive correlation b. Moderate positive correlation


c. No correlation d. Moderate negative correlation
e. Strong negative correlation f. Perfect correlation

TASK: Research on the life of Karl Pearson and his important contributions in the field of
statistics. Do not forget to copy and study the formula he proposed for computing the
coefficient of correlation( r).
Correlation coefficient, computed from the sample data measures the strength and
direction of a linear relationship between two variables. The strength of correlation is indicated
by the coefficient of correlation. There are several coefficients of correlation. One that is most
commonly used in linear correlation is Pearson Product-Moment coefficient of correlation,
symbolized by r, named in honor of the statistician who did a lot of research on this area, Karl
Pearson.
Where, r is called the Pearson correlation coefficient. This indicates the degree of relationship
between the two values,
X is the values in the first set of data,
Y is the values in the second set of data, and
n is the total number of values/data pairs.

Analyze the diagram below:

The Pearson correlation coefficient, r, can take a range of values from +1 to -1.

A value greater than 0 indicates a positive correlation; that is, as the value of one
variable increases, so does the value of the other variable.

A value less than 0 indicates a negative association; that is, as the value of one variable
increases, the value of the other variable decreases.

A value of 0 indicates that there is no correlation between the two variables.
• The direction of the points scattered tells the direction of correlation that exists between
the variables.

Explore the Correlation Scale.

The stronger the association of the two variables, the closer the Pearson correlation
coefficient, r, will be to either +1 or -1 depending on whether the relationship is positive or
negative, respectively. See table below (Table of range of values).

PEARSON R QUALITATIVE DESCRIPTION


±1 Perfect
± 0.75 to < ± 1 Very high
± 0.50 to < ± 0.75 Moderately high
± 0.25 to < ± 0.50 Moderately low
> 0 ± to < ± 0.25 Very low
0 No correlation
Different relationships and their correlation coefficients are shown in the diagram
below:

• Achieving a value of +1 or -1 means that all your data points are included
on the line of best fit – there are no data points that show any variation
away from this line. Values for r between +1 and -1 (for example, r = 0.7
or -0.3) indicate that there is variation around the line of best fit.
• The closer the value of r to 0 the greater the variation around the line of
best fit. It indicates the closeness of the point to the trend line.
• The closer the points are to the trend line, the stronger the relationship is.
LESSON 2

Correlation coefficient formula is used to find how strong a relationship is between


data. The formula returns a value between -1 and 1, where:
• 1 indicates a strong positive relationship.
• -1 indicates a strong negative relationship.
• A result of zero indicates no relationship at all.

Meaning
✓ A correlation coefficient of 1 means that for every positive increase in one variable, there is a
positive increase of a fixed proportion in the other. For example, shoe sizes go up in
(almost) perfect correlation with foot length.
✓ A correlation coefficient of -1 means that for every positive increase in one variable,
there is a negative decrease of a fixed proportion in the other. For example, the amount
of gas in a tank decreases in (almost) perfect correlation with speed.
✓ Zero means that for every increase, there isn’t a positive or negative increase. The two
just aren’t related.
Let’s find the value of the correlation coefficient from the table below.

SUBJECT AGE X GLUCOSE LEVEL Y


1 43 99
2 21 65
3 25 79
4 42 75
5 57 87
6 59 81

STEP 1: Make a chart. Use the given data, and add three more columns: xy, x2, and y2.

Subject Age x Glucose level y xy x2 y2


1 43 99
2 21 65
3 25 79
4 42 75
5 57 87
6 59 81

STEP 2: Multiply x and y together to fill the xy column. For example, row 1 would be 43 × 99 =
4,257.

Subject Age x Glucose level y xy x2 y2


1 43 99 4257
2 21 65 1365
3 25 79 1975
4 42 75 3150
5 57 87 4959
6 59 81 4779

STEP 3: Take the square of the numbers in the x column and put the result in the x2 column.

Subject Age x Glucose level y xy x2 y2


1 43 99 4257 1849
2 21 65 1365 441
3 25 79 1975 625
4 42 75 3150 1764
5 57 87 4959 3249
6 59 81 4779 3481

STEP 4: Take the square of the numbers in the y column, and put the results in the y2 column.
Subject Age x Glucose level y xy x2 y2
1 43 99 4257 1849 9801
2 21 65 1365 441 4225
3 25 79 1975 625 6241
4 42 75 3150 1764 5625
5 57 87 4959 3249 7569
6 59 81 4779 3481 6561
The range of the correlation coefficient is from -1 to 1. Our result is 0.5298, which
means the relationship between variables is moderate positive correlation.

Assumptions
For the Pearson r correlation, both variables should be normally distributed (normally
distributed variables have a bell-shaped curve). Other assumptions include linearity and
homoscedasticity. Linearity assumes a straight-line relationship between each of the two
variables and homoscedasticity assumes that data are equally distributed about the
regression line.

JUST EVALUATE

I. Directions: Calculate r and make a generalization regarding the information that you get
from the computed correlation coefficient for each of the following:
a. ∑X = 225 b. ∑X = 32 c. ∑X = 180
∑Y=22 ∑Y = 1105 ∑Y = 147
∑X = 9653
2
∑X = 220
2
∑X2 = 6914
∑Y2 = 143 ∑Y2 = 364525 ∑Y2 = 5273
∑XY = 651 ∑XY = 3402 ∑XY = 4013
n=6 n=6 n=7
II. Directions: Solve the Problem.
The following are the heights of a father and his eldest son, in inches:
Heights of the Father 71 69 67 68 68 66 70 72 65 60
Heights of the Eldest Son 71 69 69 65 66 63 68 70 60 58

QUESTION: Do the data support the hypothesis that height is hereditary? Explain.
Accompany your explanation with statistical computations.

III. Directions: Read the statement carefully and choose the best answer.
For items 1 – 5. Complete the table below.

Consider the scores obtained in Math(X) and Statistics (Y) subjects by 10 students.
Observation Math Score (X) Stat Score (Y) X2 Y2 XY
1 5 2 25 4 10

2 8 7 64 49 56

3 10 8 100 64 80

4 12 9 144 81 108

5 12 10 144 100 120

6 14 12 196 144 168

7 15 14 225 196 210

8 16 10 256 100 160

9 18 16 324 256 288

10 20 12 400 144 240

Sum

1. The ∑X2 is equal to ________.


a. 1118 b. 1138 c. 1878 d. 1873
2. Find ∑XY.
a. 1440 b. 1040 c. 1400 d. 1140
3. How many respondents are being observed?
a. 20 b. 12 c. 10 d. 6
4. Based on the given data, solve for the Pearson’s correlation coefficient.
a. 0.78 b. 0.87 c. 0.86 d. 0.76
5. Evaluate what conclusion can be derived from the result of r obtained in the data.
a. There is a no relationship between math scores and statistics scores of the
students.
b. There is a strong negative relationship between math scores and statistics scores of
the students.
c. There is a moderately positive relationship between math scores and statistics
scores of the students.
d. There is a strong positive relationship between math scores and statistics scores of
the students.

You might also like