You are on page 1of 75

GEC 410

Part II
Dr Agarana M.C.

Hypothesis Testing
What Is a Hypothesis?

 A hypothesis is a prediction that can be


tested.
 A hypothesis is a specific, testable prediction.
 Hypothesis describes, in concrete terms, what
you expect will happen in a certain
circumstance.
 A Hypothesis is a statement about the value
of a population parameter
Examples of Hypotheses
• The average height of female students and
that of male students in GEC 410 class are the
same
• 70% of Chemical Engineering students are
eligible to the Exam
• The mean age of students in GEC class is 20
• 90% of Covenant University students are Born
again
Large Population
• When the population of interest is large, we
take a sample from the population.
• For instance if the Hypothesis is ‘’ The mean
monthly income of all Engineering lecturers in
Nigerian University’’, it would not be feasible
to study all the items, or persons, in the
population.
• So we take an appropriate sample from the
population.
Illustration
• Imagine you have GEC test 2 tomorrow and
you decided to stay out late tonight and see a
movie with friends. You know that when you
study the night before a test, you get good
grades, generally speaking.
• What do you think will happen on tomorrow's
test?
• When this question is answered, you formed a
hypothesis.
• Your hypothesis may be:
'If not studying lowers test performance and I
do not study, then I will get a low grade on the
test.'
Why Hypothesis?
• A hypothesis is used in an experiment to
define the relationship between two
variables.
• The purpose of a hypothesis is to find the
answer to a question.
• A formalized hypothesis will force us to think
about what results we should look for in an
experiment
• The first variable is called the independent
variable. This is the part of the experiment that
can be tested. It can be considered the cause of
any changes in the outcome.
• The second variable is called the dependent
variable. This is the part of the experiment is the
outcome.
• The independent variable in our previous
example is not studying for a test.
• The dependent variable that you are using to
measure outcome is your test score
A hypothesis should always:
• Explain what you expect to happen
• Be clear and understandable
• Be testable
• Be measurable
• And contain an independent and dependent
variable
What is Hypothesis Testing?
• Hypothesis testing is a procedure based on
sample evidence and probability theory used
to determine whether the hypothesis is a
resonable statement and should not be
rejected, or is unreasonable and should be
rejected.
• The objective of Hypothesis testing is to check
the validity of statement about a population
parameter
The 7 Steps of Statistical Hypothesis
Testing

• Step 1: State the Null Hypothesis


• Step 2: State the Alternative Hypothesis
• Step 3: Set α (Level of significance)
• :
Step 4: Collect Data
• Step 5: Calculate a test statistic
• Step 6: Construct Acceptance / Rejection regions
(Determine the decision rule)
• Step 7: Based on steps 5 and 6, draw a conclusion about
H0. (Make a decision)
State the Null Hypothesis
• The null hypothesis can be thought of as the
opposite of the "guess" the research made.
• Null Hypothesis is a tentative assumption made
about the value of population parameter.
• Usually it is a statement that the population
parameter has a specific value.
• The Hypothesis to be tested is the Null Hypothesis
• The Null Hypothesis is set up for the purpose of
either accepting or rejecting it.
• It is designated H
0.
Examples
• 1) The packaging department of general foods
corporation is concerned that, on average,
boxes of groundnuts are overweight. The
cereal is packed in 500g boxes. So
Ho:  500
H1:  > 500
Note that the inequality sign in H1 point to the
region of rejection in the upper tail.
• 2) An automobile leasing company wants the
purchased tires to average , say, 70,000km of
wear under normal usage. They will therefore
reject a shipment of tires if test reveal that the
life of the tires is significantly below 70,000km
on the average. So
Ho:   70,000
H1:  < 70,000
• Note that H1 sign point to the left. The
rejection region is therefore in the left tail.
• 3) The efficiency ratings of Boering employees at the
seattle plant have normally distributed over a period
of many years. The Arithmetic mean of the
distribution is 300, and the standard deviation is 16.
Recently, however, young employees have been hired
and new training and production methods
inaugurated. Using the 0.01 level of significance, we
want to test the Hypothesis that the mean is still 300.
So,
• Ho:   300
• H1:   300
State the Alternative Hypothesis
• The Alternative Hypothesis is a statement that
will be accepted if our sample data provide us
with ample evidence that the Null hypothesis
is false.
• It usually describes what you will believe if you
reject the null hypothesis
• It is designated H1.
Set α (Level of significance)
• The third stage, After setting up the Null
hypothesis and alternative hypothesis, is to
determine the level of significance.
• The level of significance is the probability of
rejecting the null hypothesis when it is
actually true.
• It is designated α.
• It is also referred to as the level of risk
• It is the risk you take of rejecting the Null
Hypothesis when it is really true.
TYPE I and TYPE II errors
• Type I error: Rejecting the Null Hypothesis, H ,
O

when it is actually true.


• Type II error: Failure to reject the Null
Hypothesis when it is actually false.
It is important to note that we want to set α
before the experiment (a-priori) because the
Type I error is the more ‘grevious’ error to make.
The typical value of α is 0.05, establishing a 95%
confidence level. For this course we will assume
α = 0.05 or 0.01
Collect Data

• Remember the importance of recognizing


whether data is collected through an
experimental design or observational. 
• Take a sample
Calculate a test statistic

• The standard normal distribution using the


statistics z is applied for large-sample tests of
mean and proportions.
Testing a hypothesis about the population mean
(a) If the standard deviation of the population  is known,
the test statistic z is computed by:
X 
z=
/ n
(b) If  is unknown and n  30, substitute the sample standard deviation, s
Hypothesis testing: Two means,
large samples
• The objective here is determine whether or
not there is a difference between two
population means using large samples (30 or
more)
• The formula for z is:
X1  X 2
z
s12 s22

n1 n2
Hypothesis Testing about a single
proportion
• A proportion is a fraction, ratio, or percent of a
population that has a particular trait.
• The formula for determining the z value is
p p
z
p (1  p )
n
Where:
p is the sample proportion.
• n is the size of the sample
• P is the population proportion
Hypothesis testing about two
proportions
• The following formula is used to compute z
p1  p2
z
pc (1  pc ) pc (1  pc )

• Where
n1 n2
• N1 is the total number in the first sample
• N2 is the total number in the second sample
• pc
is the pooled estimate of the population proportion,
• Where: x  x2
pc  1
n1  n2
• x1 is the number processing the trait in the first sample
• x2 is the number processing the trait in the second sample
Construct Acceptance / Rejection regions
(Determine the decision rule)
• The decision rule states the condition or
conditions under which the null hypothesis is
rejected.
• Based on the sampling distribution, if
computed value of z falls in the rejection
region, logically Ho is rejected. Otherwise Ho
is not rejected.
• Critical value: Is the value that separates the
regions where Ho is rejected and where it is
not rejected.
Make a Decision
• Based on steps 5 and 6, draw a conclusion
about H0.
Problem 1: Two-Tailed Test

An inventor has developed a new, energy-


efficient lawn mower engine. He claims that the
engine will run continuously for 300 minutes on
a single gallon of regular gasoline. From his stock
of 2000 engines, the inventor selects a simple
random sample of 50 engines for testing. The
engines run for an average of 295 minutes, with
a standard deviation of 20 minutes.
• Test the null hypothesis that the mean run
time is 300 minutes against the alternative
hypothesis that the mean run time is not 300
minutes. Use a 0.05 level of significance.
(Assume that run times for the population of
engines are normally distributed.)
• (Note this implies confidence interval is 95%)
SOLUTION
• The solution to this problem takes four steps:
(1) state the hypotheses, (2) formulate an
analysis plan, (3) analyze sample data, and (4)
interpret results. We work through these steps
as follows:
• State the hypotheses. The first step is to state
the null hypothesis and an alternative
hypothesis.
• Null hypothesis: μ = 300
Alternative hypothesis: μ ≠ 300
• Note that these hypotheses constitute a two-
tailed test.
Problem 2: One-Tailed Test
Bon Air Elementary School has 1000 students.
The principal of the school thinks that the
average IQ of students at Bon Air is at least 110.
To prove her point, she administers an IQ test to
20 randomly selected students. Among the
sampled students, the average IQ is 108 with a
standard deviation of 10.
Based on these results, should the principal
accept or reject her original hypothesis? Assume
a significance level of 0.01. (Assume that test
scores in the population of engines are normally
distributed.)
SOLUTION
• The solution to this problem takes four steps:
(1) state the hypotheses, (2) formulate an
analysis plan, (3) analyze sample data, and (4)
interpret results. We work through those steps
below:
• State the hypotheses. The first step is to state
the null hypothesis and an alternative
hypothesis.
• Null hypothesis: μ >= 110
Alternative hypothesis: μ < 110
• Note that these hypotheses constitute a one-
tailed test.
•  
SIMPLE LINEAR REGRESSION
Univariate Vs Multivariate Data

• Univariate data. When we conduct a study


that looks at only one variable, we say that we
are working with univariate data.
• For example,
• Suppose we conducted a survey to estimate
the average weight of some mechanical tools.
Since we are only working with one variable
(weight), we would be working with univariate
data.
• Bivariate data. When we conduct a study that
examines the relationship between two
variables, we are working with bivariate data.
Example:
• Suppose we conducted a study to see if there
were a relationship between courses of study
and gender of students offering such courses.
Since we are working with two variables
(department and Gender), we would be
working with bivariate data
Differences between univariate and bivariate data.

UNIVARIATE Vs BIVARIATE
REGRESSION
• Regression provides the line that "best" fits
the data. This line can then be used to:
Examine how the response variable changes
as the predictor variable changes.
• Regression predicts the value of a response
variable (y) for any predictor variable (x).
What is the origin of the word Regression?
• The word regression was first used by
SIR FRANCIS GALTON in 1877 in his study of
Heredity. He found that the heights of the
descendants of tall parents tended to regress
(i.e. Go back) towards the average height of
the population.
• He called the Mathematical line that he
developed to explain the relationship between
the height of children and the height of their
parents the line of regression.
What are the types of Regressions?

• Linear Regression
• Logistic Regression
• Polynomial Regression
• Stepwise Regression
• Ridge Regression
• Lasso Regression
• ElasticNet Regression
Types of Linear Regression

• Simple Linear Regression


• Multiple Linear Regression
• Ordinary least square regression
What is simple linear regression?

• Simple linear regression examines the linear


relationship between two continuous
variables: one response (y) and one predictor
(x). When the two variables are related, it is
possible to predict a response value from a
predictor value with better than chance
accuracy.
• In simple linear regression, we predict scores on
one variable from the scores on a second
variable. The variable we are predicting is called
the criterion (response, target or dependent)
variable and is referred to as Y. The variable we
are basing our predictions on is called the
predictor (independent) variable and is referred
to as X.
• What is multiple linear regression?
• Multiple linear regression examines the linear
relationships between one continuous
response and two or more predictors
• What is ordinary least squares regression?
• In ordinary least squares (OLS) regression, the
estimated equation is calculated by
determining the equation that minimizes the
sum of the squared distances between the
sample's data points and the values predicted
by the equation.
SIMPLE LINEAR REGRESSION
• In this course we shall concentrate on Simple
linear regression analysis.

• Dependent Variable: The variable that is being


predicted or estimated.
• Independent variable: A variable that
provides the basis for estimation. It is the
predictor variable.
What is Regression Analysis?

Regression analysis, generally, is a form of


predictive modelling technique which
investigates the relationship between a
dependent (target or criterion) and
independent variable(s) (predictor). This
technique is used for forecasting, time series
modelling and finding the
causal effect relationship between the variables.
• For example, relationship between rash
driving and number of road accidents by a
driver is best studied through regression.
• Regression analysis is an important tool for
modelling and analyzing data. Here, we fit a
curve / line to the data points, in such a
manner that the differences between the
distances of data points from the curve or line
is minimized.
• Regression analysis is a statistical tool for the
investigation of relationships between
variables.
• Usually, the investigator seeks to ascertain the
causal effect of one variable upon another
For example:
• (i) the effect of a load increase upon vibration
of plate.
• (ii) the effect of changes in velocity upon the
momentum of a body
Example

• A linear regression model attempts to explain the


relationship between two or more variables
using a straight line. Consider the data obtained
from Chemical Engineering Laboratory where
students carrying out an experiment reason that
the yield of the chemical process is related to the
reaction temperature (see the table below).
SCATTER DIAGRAM
• This is a chart that portrays the relationship
between two variables of interest.
• A useful first step in looking at the relationship
between two variables is to portray the
information in a scatter diagram.
Exercise
• Draw the scatter diagram for the above
example.
LEAST SQUARE PRINCIPLE
• Determining a regression equation by
minimizing the sum of the squares of the
vertical distances between the actual Y values
and the predicted values of Y.
• This method gives what is commonly reffered
to as the best-fitting straight line.
• This Mathematical method eliminates
subjective judgements in an attempt to
determine the regression line
• Next we are going to develop an equation to
express the relationship between the two
variables, x and y.
• Also we want to estimate the value of y based
on the values of x.
• This equation is called regression equation.
• The general form of the Regression equation is:
Y = a + bx
Where,
Y - is the predicted value of the Y variable for a selected X value
a - is the Y-intercept. It is the estimated value of Y when X=0.
It is the estimated value of Y where the regression line crosses
the Y-axis when X is zero.
b - is the slope of the line, or the average change in Y for each
change of one unit (either increase or decrease) in the
independent variable X.
X - is any value of the independent variable that is selected.
• The values of ‘a’ and ‘b’ in the regression
equation are called the regression coefficients.
• The formulas for ‘b’ and ‘a’ are as follows:
n( XY )  ( X )( Y )
b
n( X 2 )  (  X ) 2

a
 Y
b
 X
n n
• Where,
• X is a value of the independent variable
• Y is a value of the dependent variable
• n is the number of items in the sample
Solution to above Example
• Develop the regression equation:
• i) Find b.
• ii) Find a.
• iii) Write out Y = a + bX
Drawing the line of regression
• How is the regression line placed on the
scatter diagram?
• The least squares equation Y = a + bX
is used to determine the least squares line of
regression to be drawn on the scatter diagram.
Exercise
• Draw the line of regression for the above
example.
CHI-SQUARE GOODNESS- OF-FIT TEST
NONPARAMETRIC
• Nonparametric or distribution-free tests are
Hypothesis tests concerned with nominal or
ordinal levels of measurement.
• Distribution-free test implies that these tests
are free of assumptions regarding the
distribution of the parent population.
• They are relatively easy to apply.
• Nominal-level data are the ‘’lowest’’ type of
data. They can only be classified into
categories, such as APC, PDP, and ‘’all others’’
or male and female.
• Ordinal level of measurement assumes that
one category is ranked higher than the next
one.
CHI-SQUARE TEST
• The chi-square goodness-of-fit test is one of
the most commonly used nonparametric tests.
I t is appropriate for both nominal and ordinal
levels of data.
• The purpose of goodness-of-fit test is to
determine how well an observed set of data
fits an expected set of data.
• An example can best describe the hypothesis
testing situation.
• The Test Statistic is the chi-square distribution,
given as:
  fo  fe  2

  
2

 fe 

• With k-1 degree of freedom, where k is the


number of categories.
• f o and f e are observed and expected
frequencies, respectively, in a particular
category

You might also like