Professional Documents
Culture Documents
the sample should have a mean around 100. It should be kept in mind that
the sample should be similar to the population but there is always a chance
certain amount of error.
3. Next, the researcher obtains a random sample from the population. For
example, he might select a random sample of n= 200 registered voters to
compute the mean IQ for the sample.
4. Finally, he compares the obtained sample data with the prediction that was
made from the hypothesis. If the sample mean is consistent with the
prediction, he will conclude that the hypothesis is reasonable. But if there is
big difference between the data and the prediction, he will decide that the
hypothesis is wrong.
The first and most important of two hypotheses is called null hypothesis. A null
hypothesis states that the treatment has no effect. In general, null hypothesis
states that there is no change, no effect, no difference nothing happened. The null
hypothesis is denoted by the symbol H0 (H stands for hypothesis and 0 denotes that
this is zero effect).
The null hypothesis (H1) states that in the general population there is no change,
no difference, or no relationship. In an experimental study, null hypothesis (H0)
predicts that the independent variable (treatment) will have no effect on the
dependent variable for the population.
The second hypothesis is simply the opposite of null hypothesis and it is called the
Scientific or alternative hypothesis. It is denoted by H1. This hypothesis states that
the treatment has an effect on the dependent variable.
Page |3
• One-way ANOVA between groups: used when you want to test two
groups to see if there’s a difference between them.
• Two way ANOVA without replication: used when you have one
group and you’re double-testing that same group. For example, you’re
testing one set of individuals before and after they take a medication to
see if it works or not.
• Two way ANOVA with replication: Two groups, and the members of
those groups are doing more than one thing. For example, two groups
of patients from different hospitals trying two different therapies.
One Way ANOVA
A one way ANOVA is used to compare two means from two independent
(unrelated) groups using the F-distribution. The null hypothesis for the test is that
the two means are equal. Therefore, a significant result means that the two means
are unequal.
Examples of when to use a one way ANOVA
Page |5
Situation 1: You have a group of individuals randomly split into smaller groups and
completing different tasks. For example, you might be studying the effects of tea
on weight loss and form three groups: green tea, black tea, and no tea.
Situation 2: Similar to situation 1, but in this case the individuals are split into
groups based on an attribute they possess. For example, you might be studying leg
strength of people according to weight. You could split participants into weight
categories (obese, overweight and normal) and measure their leg strength on a
weight machine.
Limitations of the One Way ANOVA:
A one way ANOVA will tell you that at least two groups were different from each
other. But it won’t tell you which groups were different. If your test returns a
significant f-statistic, you may need to run an ad hoc test (like the Least Significant
Difference test) to tell you exactly which groups had a difference in means.
Two Way ANOVA
A Two Way ANOVA is an extension of the One Way ANOVA. With a One Way, you
have one independent variable affecting a dependent variable. With a Two Way
ANOVA, there are two independents. Use a two way ANOVA when you have
one measurement variable (i.e. a quantitative variable) and two nominal variables.
In other words, if your experiment has a quantitative outcome and you have two
categorical explanatory variables, a two way ANOVA is appropriate.
For example, you might want to find out if there is an interaction between income
and gender for anxiety level at job interviews. The anxiety level is the outcome, or
the variable that can be measured. Gender and Income are the two categorical
variables. These categorical variables are also the independent variables, which are
called factors in a Two Way ANOVA.
Ans: The ‘correlation coefficient’ was coined by Karl Pearson in 1896. Accordingly,
this statistic is over a century old, and is still going strong. It is one of the most used
statistics today, second to the mean. The correlation coefficient's weaknesses and
warnings of misuse are well documented. As a 15-year practiced consulting
statistician, who also teaches statisticians continuing and professional studies for
the Database Marketing/Data Mining Industry, I see too often that the weaknesses
Page |6
and warnings are not heeded. Among the weaknesses, I have never seen the issue
that the correlation coefficient interval [−1, +1] is restricted by the individual
distributions of the two variables being correlated. The purpose of this article is (1)
to introduce the effects the distributions of the two individual variables have on
the correlation coefficient interval and (2) to provide a procedure for calculating
an adjusted correlation coefficient, whose realized correlation coefficient interval
is often shorter than the original one.
The implication for marketers is that now they have the adjusted correlation
coefficient as a more reliable measure of the important ‘key-drivers’ of their
marketing models. In turn, this allows the marketers to develop more effective
targeted marketing strategies for their campaigns.
The following points are the accepted guidelines for interpreting the correlation
coefficient:
(c): Specifically, the adjusted R2 adjusts the R2 for the sample size and
the number of variables in the regression model. Therefore, the
adjusted R2 allows for an ‘apples-to-apples’ comparison between
models with different numbers of variables and different sample sizes.
Unlike R2, the adjusted R2 does not necessarily increase, if a predictor
variable is added to a model.
The calculation of the correlation coefficient for two variables, say X and Y,
is simple to understand. Let zX and zY be the standardized versions
of X and Y, respectively, that is, zX and zY are both re-expressed to have
means equal to 0 and standard deviations (s.d.) equal to 1. The re-
expressions used to obtain the standardized scores are in equations (1) and
(2):
For a simple illustration of the calculation, consider the sample of five observations
in Table 1. Columns zX and zY contain the standardised scores of X and Y,
respectively. The last column is the product of the paired standardised scores. The
sum of these scores is 1.83. The mean of these scores (using the adjusted divisor n–
1, not n) is 0.46. Thus, r X,Y =0.46.
will be there is a relationship between gender and empathy (e.g. say females tent
to score higher on empathy and males tend to score lower on empathy).
Both the chi-square tests are sometime confused but they are quite different from
each other.
• The chi-square test for independence compares two sets of data to see if
there is relationship.
• The chi-square goodness of fit test is to fit one categorical variable to a
distribution.
The test procedure described in this lesson is appropriate when the following
conditions are met:
This approach consists of four steps: (1) state the hypotheses, (2) formulate an
analysis plan, (3) analyze sample data, and (4) interpret results.
The alternative hypothesis is that knowing the level of Variable A can help you
predict the level of Variable B.
Note: Support for the alternative hypothesis suggests that the variables are
related; but the relationship is not necessarily causal, in the sense that one variable
"causes" the other.
DF = (r - 1) * (c - 1)
where r is the number of levels for one catagorical variable, and c is the
number of levels for the other categorical variable.
where Er,c is the expected frequency count for level r of Variable A and
level c of Variable B, nr is the total number of sample observations at level r
of Variable A, nc is the total number of sample observations at level c of
Variable B, and n is the total sample size.
▪ Test statistic. The test statistic is a chi-square random variable (Χ2) defined
by the following equation.
Interpret Results
If the sample findings are unlikely, given the null hypothesis, the researcher rejects
the null hypothesis. Typically, this involves comparing the P-value to
the significance level, and rejecting the null hypothesis when the P-value is less
than the significance level.
Voting Preferences
Row total
Rep Dem Ind
Male 200 150 50 400
Female 250 300 50 600
Column total 450 450 100 1000
Is there a gender gap? Do the men's voting preferences differ significantly from the
women's preferences? Use a 0.05 level of significance.
Solution
The solution to this problem takes four steps: (1) state the hypotheses, (2)
formulate an analysis plan, (3) analyze sample data, and (4) interpret results. We
work through those steps below:
▪ State the hypotheses. The first step is to state the null hypothesis and an
alternative hypothesis.
▪ Formulate an analysis plan. For this analysis, the significance level is 0.05.
Using sample data, we will conduct a chi-square test for independence.
P a g e | 13
Ans: Regression
A correlation quantifies the degree and direction to which two variables are
related. It does not fit a line through the data points. It does not have to think about
the cause and effect. It does not natter which of the two variables is called
dependent and which is called independent.
On the other hand regression finds the best line that predicts dependent variables
from the independent variable. The decision of which variable is calls dependent
and which calls independent is an important matter in regression, as it will get a
different best-fit line if we exchange the two variables, i.e. dependent to
independent and independent to dependent. The line that best predicts
independent variable from dependent variable will not be the same as the line that
predicts dependent variable from independent variable.
Let us start with the simple case of studying the relationship between two variables
X and Y. The variable Y 1s dependent variable and the variable X is the independent
variable. We are interested in seeing how various values of the independent
variable X predict corresponding values of dependent Y. This statistical technique
is called regression analysis. We can say that regression analysis is a technique that
is used to model the dependency of one dependent variable upon one independent
variable. Merriam-Webster online dictionary defines regression as a functional
relationship between two or more correlated variables that is often empirically
determined from data and is used especially to predict values of one variable when
given variables of others. According to Gravetter & Wallnua (2002). regression is a
statistical technique for finding the best-fitting straight line for a set of data is called
regression, and the resulting straight line is called regression line.
response variable and the explanatory variable for the purpose of prediction,
assumes that a functional relationship exists, and alternative approaches are
superior.
These benefits help a researcher to estimate and evaluate the best set of variables
to be used for building productive models.
Types of Regression
Commonly used types of regression are:
i) Linear Regression
It is the most commonly used types of regression. In this technique the dependent
variable is continuous and the independent variable can be continuous or discrete
and the nature of regression line is linear. Linear regression establishes relationship
between dependent variable (Y) and one or more independent variables (X) using
best fit straight line (also known as regression line).
v) Ridge Regression
It is a technique for analyzing multiple regression data that suffer from
multicollinearity (independent variables are highly correlated). When
multicollinearity occurs, least squares estimates are unbiased, but their variances
are large so that they may be far from the true value. By adding the degree of bias
to the regression estimates, ridge regression reduces the standard errors.