You are on page 1of 15

Page |1

Semester: Autumn, 2020 ASSIGNMENT NO. 2


ROLL NO: BY626671 Reg No:19BZB00169
Program : B.Ed (1.5 Year Course Code: (8614)
Course Name: Educational Statistics
…………………………………….…………………………………………………

Q No1: Define hypothesis testing logic behind hypothesis testing.

Ans: Hypothesis Testing


It is usually impossible for a researcher to observe each individual in a population
Therefore, he selects some individual from the population as sample and collects
data from the sample. He then uses the sample data to answer questions about the
population. For this purpose, he uses some statistical techniques.

Hypothesis testing is a statistical method that uses sample data to evaluate a


hypothesis about a population parameter (Gravetter & Wallnau, 2002).A
hypothesis test is usually used in context of a research study. Depending on the
type of research and the type or data, the details of the hypothesis test will change
from on situation to another.

Hypothesis testing is a formalized procedure that follows a standard series of


operations. In this way a researcher has a standardized method for evaluating the
results of his research study. Other researchers will recognize and understand
exactly how the data were evaluated and how conclusions were drawn.

Logic of Hypothesis Testing:


According to Gravetter & Wallnau (2002) the logic underlying hypothesis testing 1S
as follows:

1. First, a researcher states a hypothesis about a population. Usually, the


hypothesis concerns the value of the population mean. For example, we
might hypothesize that the mean IQ for the registered voters Pakistan is M
=100.

2. Before a researcher actually selects a sample, he uses the hypothesis to


predict the characteristics that the sample should have. For example, if he
hypothesizes that the population mean IQ = 100, then he would predict that
Page |2

the sample should have a mean around 100. It should be kept in mind that
the sample should be similar to the population but there is always a chance
certain amount of error.

3. Next, the researcher obtains a random sample from the population. For
example, he might select a random sample of n= 200 registered voters to
compute the mean IQ for the sample.

4. Finally, he compares the obtained sample data with the prediction that was
made from the hypothesis. If the sample mean is consistent with the
prediction, he will conclude that the hypothesis is reasonable. But if there is
big difference between the data and the prediction, he will decide that the
hypothesis is wrong.

Four-Step Process for Hypothesis Testing:


The process of hypothesis testing goes through following four steps.

(1): Stating the Hypothesis


The process of hypothesis testing begins by stating a hypothesis about the
unknown population. Usually, a researcher states two opposing hypotheses. And
both hypotheses are stated in terms of population parameters.

The first and most important of two hypotheses is called null hypothesis. A null
hypothesis states that the treatment has no effect. In general, null hypothesis
states that there is no change, no effect, no difference nothing happened. The null
hypothesis is denoted by the symbol H0 (H stands for hypothesis and 0 denotes that
this is zero effect).

The null hypothesis (H1) states that in the general population there is no change,
no difference, or no relationship. In an experimental study, null hypothesis (H0)
predicts that the independent variable (treatment) will have no effect on the
dependent variable for the population.

The second hypothesis is simply the opposite of null hypothesis and it is called the
Scientific or alternative hypothesis. It is denoted by H1. This hypothesis states that
the treatment has an effect on the dependent variable.
Page |3

The alternative hypothesis (H1) states that there is a change, a difference, or a


relationship for the general population. In an experiment, H1 predicts that the
independent variable (treatment) will have an effect on the dependent variable.

(2)Setting Criteria for the Decision


In a common practice, a researcher uses the data from the sample to evaluate the
authority of null hypothesis. The data will either support or negate the null
hypothesis. To formalize the decision process, a researcher will use null hypothesis
to predict exactly what kind of sample should be obtained if the treatment has no
effect. In particular, a researcher will examine all the possible sample means that
could be obtained if the null hypothesis is true.

(3) Collecting data and computing sample statistics


The next step in hypothesis testing is to obtain the sample data. Then raw data are
summarized with appropriate statistics such as mean, standard deviation etc. then
it is possible for the researcher to compare the sample mean with the null
hypothesis.

(4) Make a Decision


In the final step the researcher decides, in the light of analysis of data, whether to
accept or reject the null hypothesis. If analysis of data supports the null hypothesis,
he accepts it and vice versa.

Q No 2: Explain types of ANOVA. Describe possible situation in which each type


should be used.

Ans: An ANOVA test is a way to find out if survey or experiment results


are significant. In other words, they help you to figure out if you need to reject the
null hypothesis or accept the alternate hypothesis.

Basically, you’re testing groups to see if there’s a difference between


them. Examples of when you might want to test different groups:

• A group of psychiatric patients are trying three different therapies:


counseling, medication and biofeedback. You want to see if one therapy is
better than the others.
• A manufacturer has two different processes to make light bulbs. They want
Page |4

to know if one process is better than the other.


• Students from different colleges take the same exam. You want to see if one
college outperforms the other.
What Does “One-Way” or “Two-Way Mean?
One-way or two-way refers to the number of independent variables (IVs) in your
Analysis of Variance test.
• One-way has one independent variable (with 2 levels). For
example: brand of cereal,
• Two-way has two independent variables (it can have multiple levels). For
example: brand of cereal, calories.
What are “Groups” or “Levels”?
Groups or levels are different groups within the same independent variable. In the
above example, your levels for “brand of cereal” might be Lucky Charms, Raisin
Bran, Cornflakes — a total of three levels. Your levels for “Calories” might be:
sweetened, unsweetened — a total of two levels.
Types of Tests:
There are two main types: one-way and two-way. Two-way tests can be with or
without replication.

• One-way ANOVA between groups: used when you want to test two
groups to see if there’s a difference between them.
• Two way ANOVA without replication: used when you have one
group and you’re double-testing that same group. For example, you’re
testing one set of individuals before and after they take a medication to
see if it works or not.
• Two way ANOVA with replication: Two groups, and the members of
those groups are doing more than one thing. For example, two groups
of patients from different hospitals trying two different therapies.
One Way ANOVA
A one way ANOVA is used to compare two means from two independent
(unrelated) groups using the F-distribution. The null hypothesis for the test is that
the two means are equal. Therefore, a significant result means that the two means
are unequal.
Examples of when to use a one way ANOVA
Page |5

Situation 1: You have a group of individuals randomly split into smaller groups and
completing different tasks. For example, you might be studying the effects of tea
on weight loss and form three groups: green tea, black tea, and no tea.

Situation 2: Similar to situation 1, but in this case the individuals are split into
groups based on an attribute they possess. For example, you might be studying leg
strength of people according to weight. You could split participants into weight
categories (obese, overweight and normal) and measure their leg strength on a
weight machine.
Limitations of the One Way ANOVA:
A one way ANOVA will tell you that at least two groups were different from each
other. But it won’t tell you which groups were different. If your test returns a
significant f-statistic, you may need to run an ad hoc test (like the Least Significant
Difference test) to tell you exactly which groups had a difference in means.
Two Way ANOVA
A Two Way ANOVA is an extension of the One Way ANOVA. With a One Way, you
have one independent variable affecting a dependent variable. With a Two Way
ANOVA, there are two independents. Use a two way ANOVA when you have
one measurement variable (i.e. a quantitative variable) and two nominal variables.
In other words, if your experiment has a quantitative outcome and you have two
categorical explanatory variables, a two way ANOVA is appropriate.
For example, you might want to find out if there is an interaction between income
and gender for anxiety level at job interviews. The anxiety level is the outcome, or
the variable that can be measured. Gender and Income are the two categorical
variables. These categorical variables are also the independent variables, which are
called factors in a Two Way ANOVA.

Q NO 3: What is the range of correlation coefficient? Explain strong, moderate


and weak relationship.

Ans: The ‘correlation coefficient’ was coined by Karl Pearson in 1896. Accordingly,
this statistic is over a century old, and is still going strong. It is one of the most used
statistics today, second to the mean. The correlation coefficient's weaknesses and
warnings of misuse are well documented. As a 15-year practiced consulting
statistician, who also teaches statisticians continuing and professional studies for
the Database Marketing/Data Mining Industry, I see too often that the weaknesses
Page |6

and warnings are not heeded. Among the weaknesses, I have never seen the issue
that the correlation coefficient interval [−1, +1] is restricted by the individual
distributions of the two variables being correlated. The purpose of this article is (1)
to introduce the effects the distributions of the two individual variables have on
the correlation coefficient interval and (2) to provide a procedure for calculating
an adjusted correlation coefficient, whose realized correlation coefficient interval
is often shorter than the original one.

The implication for marketers is that now they have the adjusted correlation
coefficient as a more reliable measure of the important ‘key-drivers’ of their
marketing models. In turn, this allows the marketers to develop more effective
targeted marketing strategies for their campaigns.

The correlation coefficient, denoted by r, is a measure of the strength of the


straight-line or linear relationship between two variables. The well-known
correlation coefficient is often misused, because its linearity assumption is not
tested. The correlation coefficient can – by definition, that is, theoretically –
assume any value in the interval between +1 and −1, including the end values +1 or
−1.

The following points are the accepted guidelines for interpreting the correlation
coefficient:

1. 0 indicates no linear relationship.


2. +1 indicates a perfect positive linear relationship – as one variable increases
in its values, the other variable also increases in its values through an exact
linear rule.
3. −1 indicates a perfect negative linear relationship – as one variable increases
in its values, the other variable decreases in its values through an exact linear
rule.
4. Values between 0 and 0.3 (0 and −0.3) indicate a weak positive (negative)
linear relationship through a shaky linear rule.
5. Values between 0.3 and 0.7 (0.3 and −0.7) indicate a moderate positive
(negative) linear relationship through a fuzzy-firm linear rule.
6. Values between 0.7 and 1.0 (−0.7 and −1.0) indicate a strong positive
(negative) linear relationship through a firm linear rule.
Page |7

7. The value of r2, called the coefficient of determination, and denoted R2 is


typically interpreted as ‘the percent of variation in one variable explained by
the other variable,’ or ‘the percent of variation shared between the two
variables.’ Good things to know about R2:

(a): It is the correlation coefficient between the observed and


modelled (predicted) data values.

(b): It can increase as the number of predictor variables in the model


increases; it does not decrease. Modellers unwittingly may think that
a ‘better’ model is being built, as s/he has a tendency to include more
(unnecessary) predictor variables in the model. Accordingly, an
adjustment of R2 was developed, appropriately called adjusted R2.
The explanation of this statistic is the same as R2, but it penalises the
statistic when unnecessary variables are included in the model.

(c): Specifically, the adjusted R2 adjusts the R2 for the sample size and
the number of variables in the regression model. Therefore, the
adjusted R2 allows for an ‘apples-to-apples’ comparison between
models with different numbers of variables and different sample sizes.
Unlike R2, the adjusted R2 does not necessarily increase, if a predictor
variable is added to a model.

(d): It is a first-blush indicator of a good model.

(e): It is often misused as the measure to assess which model produces


better predictions. The RMSE (root mean squared error) is the
measure for determining the better model. The smaller the RMSE
value, the better the model, viz., the more precise the predictions.

8. Linearity Assumption: the correlation coefficient requires that the underlying


relationship between the two variables under consideration is linear. If the
relationship is known to be linear, or the observed pattern between the two
variables appears to be linear, then the correlation coefficient provides a
reliable measure of the strength of the linear relationship. If the relationship
is known to be non-linear, or the observed pattern appears to be non-linear,
then the correlation coefficient is not useful, or at least questionable.
Page |8

The calculation of the correlation coefficient for two variables, say X and Y,
is simple to understand. Let zX and zY be the standardized versions
of X and Y, respectively, that is, zX and zY are both re-expressed to have
means equal to 0 and standard deviations (s.d.) equal to 1. The re-
expressions used to obtain the standardized scores are in equations (1) and
(2):

zX1 =[X1- mean(X)]/s.d (X) (1)

zY1 =[Y1- mean(Y)]/s.d (Y) (2)

The correlation coefficient is defined as the mean product of the paired


standardised scores (zX i , zY i ) as expressed in equation (3).

rx.y =sum of [zX1 ×zY1]/(n-1) (3)

Where n is the sample size.

For a simple illustration of the calculation, consider the sample of five observations
in Table 1. Columns zX and zY contain the standardised scores of X and Y,
respectively. The last column is the product of the paired standardised scores. The
sum of these scores is 1.83. The mean of these scores (using the adjusted divisor n–
1, not n) is 0.46. Thus, r X,Y =0.46.

Q No 4: Explain chi square independence test. In what situation should it be


applied?

Ans: Chi-Square Independence Test:


A chi-square (X) test of independence is the second important form of chi-square
tests. It is used to explore the relationship between two categorical variables. Each
of these variables can have two of more categories.
It determines it there is a significant relationship between two nominal
(categorical), variables. The frequency of one nominal variable is compared with
different values of the second nominal variable. The data can be displayed in R*C
contingency table, where R is the row and C is the column. For example, the
researcher wants to examine the relationship between gender (male and female)
and empathy (high vs. low). The researcher will use chi-square test of
independence. If the null hypothesis is accepted there would be no relationship
between gender and empathy. If the null hypothesis is rejected then the conclusion
Page |9

will be there is a relationship between gender and empathy (e.g. say females tent
to score higher on empathy and males tend to score lower on empathy).

The chi-square test of independence being a non-parametric technique follow less


strict assumptions, there are some general assumptions which should be taken care
of,
(1): Random Sample Sample should be selected using simple random sampling
method.
(2): Variables - Both variables under study should be categorical.
(3): Independent Observations Each person or case should be counted only once
and none should appear in more than one category of group. The data from one
subject should not influence the data from another subject.
(4): If the data are displayed in a contingency table, the expected frequency count
for each cell of the table is at least 5.

Both the chi-square tests are sometime confused but they are quite different from
each other.
• The chi-square test for independence compares two sets of data to see if
there is relationship.
• The chi-square goodness of fit test is to fit one categorical variable to a
distribution.

Chi-Square Test for Independence:


This lesson explains how to conduct a chi-square test for independence. The test
is applied when you have two categorical variables from a single population. It is
used to determine whether there is a significant association between the two
variables.

For example, in an election survey, voters might be classified by gender (male or


female) and voting preference (Democrat, Republican, or Independent). We could
use a chi-square test for independence to determine whether gender is related to
voting preference. The sample problem at the end of the lesson considers this
example.
P a g e | 10

When to Use Chi-Square Test for Independence

The test procedure described in this lesson is appropriate when the following
conditions are met:

▪ The sampling method is simple random sampling.


▪ The variables under study are each categorical.
▪ If sample data are displayed in a contingency table, the expected frequency
count for each cell of the table is at least 5.

This approach consists of four steps: (1) state the hypotheses, (2) formulate an
analysis plan, (3) analyze sample data, and (4) interpret results.

State the Hypotheses


Suppose that Variable A has r levels, and Variable B has c levels. The null
hypothesis states that knowing the level of Variable A does not help you predict
the level of Variable B. That is, the variables are independent.

Ho: Variable A and Variable B are independent.


Ha: Variable A and Variable B are not independent.

The alternative hypothesis is that knowing the level of Variable A can help you
predict the level of Variable B.

Note: Support for the alternative hypothesis suggests that the variables are
related; but the relationship is not necessarily causal, in the sense that one variable
"causes" the other.

Formulate an Analysis Plan


The analysis plan describes how to use sample data to accept or reject the null
hypothesis. The plan should specify the following elements.

▪ Significance level. Often, researchers choose significance levels equal to


0.01, 0.05, or 0.10; but any value between 0 and 1 can be used.
▪ Test method. Use the chi-square test for independence to determine
whether there is a significant relationship between two categorical variables.
P a g e | 11

Analyze Sample Data


Using sample data, find the degrees of freedom, expected frequencies, test
statistic, and the P-value associated with the test statistic. The approach described
in this section is illustrated in the sample problem at the end of this lesson.

▪ Degrees of freedom. The degrees of freedom (DF) is equal to:

DF = (r - 1) * (c - 1)

where r is the number of levels for one catagorical variable, and c is the
number of levels for the other categorical variable.

▪ Expected frequencies. The expected frequency counts are computed


separately for each level of one categorical variable at each level of the other
categorical variable. Compute r * c expected frequencies, according to the
following formula.

Er,c = (nr * nc) / n

where Er,c is the expected frequency count for level r of Variable A and
level c of Variable B, nr is the total number of sample observations at level r
of Variable A, nc is the total number of sample observations at level c of
Variable B, and n is the total sample size.

▪ Test statistic. The test statistic is a chi-square random variable (Χ2) defined
by the following equation.

Χ2 = Σ [ (Or,c - Er,c)2 / Er,c ]

where Or,c is the observed frequency count at level r of Variable A and


level c of Variable B, and Er,c is the expected frequency count at level r of
Variable A and level c of Variable B.

▪ P-value. The P-value is the probability of observing a sample statistic as


extreme as the test statistic. Since the test statistic is a chi-square, use
the Chi-Square Distribution Calculator to assess the probability associated
with the test statistic. Use the degrees of freedom computed above.
P a g e | 12

Interpret Results
If the sample findings are unlikely, given the null hypothesis, the researcher rejects
the null hypothesis. Typically, this involves comparing the P-value to
the significance level, and rejecting the null hypothesis when the P-value is less
than the significance level.

Test Your Understanding


Problem
A public opinion poll surveyed a simple random sample of 1000 voters.
Respondents were classified by gender (male or female) and by voting preference
(Republican, Democrat, or Independent). Results are shown in the contingency
table below.

Voting Preferences
Row total
Rep Dem Ind
Male 200 150 50 400
Female 250 300 50 600
Column total 450 450 100 1000

Is there a gender gap? Do the men's voting preferences differ significantly from the
women's preferences? Use a 0.05 level of significance.

Solution
The solution to this problem takes four steps: (1) state the hypotheses, (2)
formulate an analysis plan, (3) analyze sample data, and (4) interpret results. We
work through those steps below:

▪ State the hypotheses. The first step is to state the null hypothesis and an
alternative hypothesis.

Ho: Gender and voting preferences are independent.


Ha: Gender and voting preferences are not independent.

▪ Formulate an analysis plan. For this analysis, the significance level is 0.05.
Using sample data, we will conduct a chi-square test for independence.
P a g e | 13

Q No 5: Correlation is pre requisite of Regression Analysis. Explain.

Ans: Regression
A correlation quantifies the degree and direction to which two variables are
related. It does not fit a line through the data points. It does not have to think about
the cause and effect. It does not natter which of the two variables is called
dependent and which is called independent.

On the other hand regression finds the best line that predicts dependent variables
from the independent variable. The decision of which variable is calls dependent
and which calls independent is an important matter in regression, as it will get a
different best-fit line if we exchange the two variables, i.e. dependent to
independent and independent to dependent. The line that best predicts
independent variable from dependent variable will not be the same as the line that
predicts dependent variable from independent variable.

Let us start with the simple case of studying the relationship between two variables
X and Y. The variable Y 1s dependent variable and the variable X is the independent
variable. We are interested in seeing how various values of the independent
variable X predict corresponding values of dependent Y. This statistical technique
is called regression analysis. We can say that regression analysis is a technique that
is used to model the dependency of one dependent variable upon one independent
variable. Merriam-Webster online dictionary defines regression as a functional
relationship between two or more correlated variables that is often empirically
determined from data and is used especially to predict values of one variable when
given variables of others. According to Gravetter & Wallnua (2002). regression is a
statistical technique for finding the best-fitting straight line for a set of data is called
regression, and the resulting straight line is called regression line.

Objectives of Regression Analysis:


The regression analysis is used to explain variability in dependent variable by mean
of one or more of independent variables and to analyze relationships among
variables to answer the question of how much dependent variable changes with
the changes in the independent variables and to forecast or predict the value of
dependent variable based on the values of the independent variable.

The primary objective of the regression is to develop a relationship between a


P a g e | 14

response variable and the explanatory variable for the purpose of prediction,
assumes that a functional relationship exists, and alternative approaches are
superior.

Why do we use Regression Analysis?


Regression analysis estimates the relationship between two or more variables and
is used for forecasting or finding cause and effect relationship between the
variables. There are multiple benefits of using regression analysis. These are as
follows:
i) It indicates the significant relationships between dependent and the
independent variables.
ii) It indicates the strength of impact of multiple independent variables on a
dependent variable.
iii) It allows us to compare the effects of variables measured on different
scales.

These benefits help a researcher to estimate and evaluate the best set of variables
to be used for building productive models.

Types of Regression
Commonly used types of regression are:
i) Linear Regression
It is the most commonly used types of regression. In this technique the dependent
variable is continuous and the independent variable can be continuous or discrete
and the nature of regression line is linear. Linear regression establishes relationship
between dependent variable (Y) and one or more independent variables (X) using
best fit straight line (also known as regression line).

ii) Logistic Regression


Logistic regression is a statistical method for analyzing a dataset in which there
one or more independent variables that determine an outcome. The outcome is
measured with the dichotomous (binary) variable. Like all regression analysis, the
logistic regression is a predictive analysis. It is used to describe and explain
relationship between one dependent binary variable and one or more nominal,
ordinal, interval or ratio level independent variables.
P a g e | 15

iii) Polynomial Regression


It is a form of regression analysis in which the relationship between independent
variable X and dependent variable Y is modeled as an n degree polynomial in x.
this type of regression fits a non-linear relationship between the values of X with
the corresponding values of Y.

iv) Stepwise Regression


It is a method of fitting regression model in which the choice of predictive variables
is carried out by an automatic procedure. In each step, a variable is considered for
addition or subtraction from the set of explanatory variables based on some pre-
specified criteria. The general idea behind this procedure is that we build our
regression model from a set of predictor variable by entering and removing
predictors in our model, in a stepwise manner, until there is no justifiable reason
to enter or remove any more.

v) Ridge Regression
It is a technique for analyzing multiple regression data that suffer from
multicollinearity (independent variables are highly correlated). When
multicollinearity occurs, least squares estimates are unbiased, but their variances
are large so that they may be far from the true value. By adding the degree of bias
to the regression estimates, ridge regression reduces the standard errors.

vi) LASSO Regression


LASSO or lasso stands for Least Absolute Shrinkage and Selection Operator. It is a
method that performs both variable selection and regularization in order to
enhance the prediction accuracy and interpretability of the statistical model it
produces. This type of regression uses shrinkage. Shrinkage is where data values
are shrunk towards a central point, like the mean.

vii) Elastic Net Regression


This type of regression is a hybrid of lasso and ridge regression techniques. It is
useful when there are multiple features which are correlated.

You might also like