You are on page 1of 9

ALEXIS JAN M.

PATACSIL
ADVANCE RESEARCH METHODS- Statistical Tools

I. T-TEST

1. Definition: The t-test was developed by a chemist working for the Guinness brewing
company as a simple way to measure the consistent quality of stout. It was further
developed and adapted, and now refers to any test of a statistical hypothesis in
which the statistic being tested for is expected to correspond to a t-distribution if the
null hypothesis is supported.

A t-distribution is basically any continuous probability distribution that arises from an


estimation of the mean of a normally distributed population using a small sample size
and an unknown standard deviation for the population. The null hypothesis is the
default assumption that no relationship exists between two different measured
phenomena.

2. Assumptions:
The first assumption made regarding t-tests concerns the scale of measurement. The
assumption for a t-test is that the scale of measurement applied to the data collected
follows a continuous or ordinal scale, such as the scores for an IQ test.

The second assumption made is that of a simple random sample, that the data is
collected from a representative, randomly selected portion of the total population.
The third assumption is the data, when plotted, results in a normal distribution, bell-
shaped distribution curve.

The fourth assumption is a reasonably large sample size is used. A larger sample size
means the distribution of results should approach a normal bell-shaped curve.
The final assumption is homogeneity of variance. Homogeneous, or equal, variance
exists when the standard deviations of samples are approximately equal

3. Reference: https://www.investopedia.com/ask/answers/073115/what-assumptions-
are-made-when-conducting-ttest.asp

II. ONE WAY ANOVA

1. Definition: A one way ANOVA is used to compare two means from two independent
(unrelated) groups using the F-distribution. The null hypothesis for the test is that
the two means are equal. Therefore, a significant result means that the two means
are unequal.

2. Assumptions:
a. The observations are random and independent samples from the populations. This is
commonly referred to as the assumption of independence.
The null hypothesis actually says that the samples come from populations that have the
same mean. The samples must be random and independent if they are to be
representative of the populations. The value of one observation is not related to any
other observation. In other words, one person’s score should not provide any clue as to
how any of the other people should score. That is, one event does not depend on
another.

b. The distributions of the populations from which the samples are selected are normal.
This is commonly referred to as the assumption of normality.
This assumption implies that the dependent variable is normally distributed (a
theoretical requirement of the underlying distribution, the F distribution) in each of the
groups.
c. The variances of the distributions in the populations are equal. This is commonly
referred to as the assumption of homogeneity of variance.
This assumption (along with the normality assumption and the null hypothesis) provides
that the distributions in the populations have the same shapes, means, and variances;
that is, they are the same populations. In other words, the variances on the dependent
variable are equal across the groups.

3. References: http://www.statisticshowto.com/probability-and-statistics/hypothesis-
testing/anova/;
http://oak.ucc.nau.edu/rh232/courses/EPS525/Handouts/
Understanding%20the%20One-way%20ANOVA.pdf

III. TWO WAY ANOVA

1. Definition: A Two Way ANOVA is an extension of the One Way ANOVA. With a One
Way, you have one independent variable affecting a dependent variable. With a Two
Way ANOVA, there are two independents. Use a two way ANOVA when you have one
measurement variable (i.e. a quantitative variable) and two nominal variables. In
other words, if your experiment has a quantitative outcome and you have two
categorical explanatory variables, a two way ANOVA is appropriate.

2. Assumptions:
a. The population must be close to a normal distribution.
b. Samples must be independent.
c. Population variances must be equal.
d. Groups must have equal sample sizes.

3. Reference: http://www.statisticshowto.com/probability-and-statistics/hypothesis-
testing/anova/

IV. MANOVA
1. Definition: Analysis of variance (ANOVA) tests for differences between means.
MANOVA is just an ANOVA with several dependent variables. It’s similar to many other
tests and experiments in that it’s purpose is to find out if the response variable (i.e.
your dependent variable) is changed by manipulating the independent variable. The test
helps to answer many research questions, including:
• Do changes to the independent variables have statistically significant
effects on dependent variables?
• What are the interactions among dependent variables?
• What are the interactions among independent variables?
2. Assumptions:
a. Observations are randomly and independently sampled from the population
b. Each dependent variable has an interval measurement
c. Dependent variables are multivariate normally distributed within each group of the
independent variables (which are categorical)
d. The population covariance matrices of each group are equal (this is an extension of
homogeneity of variances required for univariate ANOVA)

3. Reference: http://www.real-statistics.com/multivariate-statistics/multivariate-analysis-
of-variance-manova/manova-assumptions/

V. MULTIPLE LINEAR REGRESSION


1. Definition:
Multiple linear regression is the most common form of linear regression analysis. As a
predictive analysis, the multiple linear regression is used to explain the relationship
between one continuous dependent variable and two or more independent variables.
The independent variables can be continuous or categorical (dummy coded as
appropriate).

2. Assumptions:

Level of measurement
a. IVs: Two or more continuous (interval or ratio) or dichotomous variables - it may be
necessary to recode multichotomous categorical or ordinal IVs and non-normal interval
or ratio IVs into dichotomous variables or a series of dummy variables)
b. DV: One continuous (interval or ratio) variable
Sample size
a. Some rules of thumb:
1 Enough data is needed to provide reliable estimates of the correlations.
Use at least 50 cases and at least 10 to 20 as many cases as there are IVs (as the
number of IVs increases, more inferential tests are being conducted (if testing each
predictor), therefore more data is needed), otherwise the estimates of the regression
line are probably unstable and are unlikely to replicate if the study is repeated.
2 Green (2001) and Tabachnick and Fidell (2007) suggest:
1 50 + 8(k) for testing an overall regression model and
2 104 + k when testing individual predictors (where k is the number of IVs)
3 These sample size suggestions are based on detecting a medium effect
size (β >= .20), with critical α <= .05, with power of 80%.
4 To be more accurate, study-specific power and sample size calculations
should be conducted (e.g. use A-priori sample Size calculator for multiple regression;
note that this calculator uses f2 for the anticipated effect size - see the Formulas link for
how to convert R2 to to f2).
Normality
a. Check the univariate descriptive statistics (M, SD, skewness and kurtosis)
b. Check the histograms with a normal curve imposed
c. Be wary (avoid!) using inferential tests of normality (e.g., the Shapiro–Wilk test - they
are notoriously overly sensitive for the purposes/needs of regression).
d. Estimates of correlations will be more reliable and stable when the variables are
normally distributed, but regression will be reasonably robust to minor to moderate
deviations from non-normal data when moderate to large sample sizes are involved.
More important is the examination of scatterplots for bivariate outliers (non-normal
univariate data may make bivariate and multivariate outliers more likely).
e. Further information:
• Non-normality - PROPHET StatGuide: Do your data violate linear
regression assumptions?, Northwestern University
• Regression when the OLS residuals are not normally distributed
• How do I perform a regression on non-normal data which remain non-
normal when transformed?
a. Are the bivariate relationships linear?
b. Check scatterplots and correlations between the DV (Y) and each of the IVs (Xs)
c. Check for influence of bivariate outliers

Homoscedasticity
a. Are the bivariate distributions reasonably evenly spread about the line of best fit?
b. Check scatterplots between Y and each of Xs and/or check scatterplot of the
residuals (ZRESID) and predicted values (ZPRED))

Multicollinearity
a. Screencast:
b. Is there multicollinearity between the IVs? Predictors should not be overly correlated
with one another. Ways to check:
1 Examine bivariate correlations and scatterplots between each of the IVs
(i.e., are the predictors overly correlated - above ~.7?).
2 Check the collinearity statistics in the coefficients table:
1 Various recommendations for acceptable levels of VIF and Tolerance have
been published.
2 Variance Inflation Factor (VIF) should be low (< 3 to 10) or
3 Tolerance should be high (> .1 to .3)
4 Note that VIF and Tolerance have a reciprocal relationship (i.e., TOL=1/
VIF), so only one of the indicators needs to be used.

Multivariate outliers
a. MVOs
b. Check whether there are influential MVOs using Mahalanobis' Distance (MD) and/or
Cook’s D (CD).
c. SPSS: Linear Regression - Save - Mahalanobis (can also include Cook's D)
1 After execution, new variables called mah_1 (and coo_1) will be added to
the data file.
2 In the output, check the Residuals Statistics table for the maximum MD
and CD.
3 The maximum MD should not exceed the critical chi-square value with
degrees of freedom (df) equal to number of predictors, with critical alpha =.001. CD
should not be greater than 1.
d. If outliers are detected:
1 Go to the data file, sort the data in descending order by mah_1, identify
the cases with mah_1 distances above the critical value, and consider why these cases
have been flagged (these cases will each have an unusual combination of responses for
the variables in the analysis, so check their responses).
2 Remove these cases and re-run the MLR.
1 If the results are very similar (e.g., similar R2 and coefficients for each of
the predictors), then it is best to use the original results (i.e., including the multivariate
outliers).
2 If the results are different when the MVOs are not included, then these
cases probably have had undue influence and it is best to report the results without
these cases.

Normality of residuals
1 Residuals are more likely to be normally distributed if each of the variables
normally distributed
2 Check histograms of all variables in an analysis
3 Normally distributed variables will enhance the MLR solution

3. References: http://www.statisticssolutions.com/what-is-multiple-linear-regression/;
https://en.wikiversity.org/wiki/Multiple_linear_regression/Assumptions

VI. EXPLORATORY FACTOR ANALYSIS

1. Definition: Exploratory factor analysis is a statistical technique that is used to reduce


data to a smaller set of summary variables and to explore the underlying theoretical
structure of the phenomena. It is used to identify the structure of the relationship
between the variable and the respondent.
2. Assumptions:
a. Variables used should be metric. Dummy variables can also be considered, but only
in special cases.
b. Sample size: Sample size should be more than 200. In some cases, sample size
may be considered for 5 observations per variable.
c. Homogeneous sample: A sample should be homogenous. Violation of this
assumption increases the sample size as the number of variables increases.
Reliability analysis is conducted to check the homogeneity between variables.
d. In exploratory factor analysis, multivariate normality is not required.
e. Correlation: At least 0.30 correlations are required between the research variables.
f. There should be no outliers in the data.

3. Reference: http://www.statisticssolutions.com/factor-analysis-sem-exploratory-factor-
analysis/

VII. DISCRIMINANT FUNCTION ANALYSIS

1. Definition: Discriminant function analysis is used to determine which variables


discriminate between two or more naturally occurring groups. For example, an
educational researcher may want to investigate which variables discriminate
between high school graduates who decide (1) to go to college, (2) to attend a
trade or professional school, or (3) to seek no further training or education. For that
purpose the researcher could collect data on numerous variables prior to students'
graduation. After graduation, most students will naturally fall into one of the three
categories. Discriminant Analysis could then be used to determine which variable(s)
are the best predictors of students' subsequent educational choice.

2. Assumptions:

a. Normal distribution. It is assumed that the data (for the variables) represent a
sample from a multivariate normal distribution. You can examine whether or not
variables are normally distributed with histograms of frequency distributions. However,
note that violations of the normality assumption are usually not "fatal," meaning, that
the resultant significance tests etc. are still "trustworthy." You may use specific tests for
normality in addition to graphs.

b. Homogeneity of variances/covariances. It is assumed that the variance/covariance


matrices of variables are homogeneous across groups. Again, minor deviations are not
that important; however, before accepting final conclusions for an important study it is
probably a good idea to review the within-groups variances and correlation matrices. In
particular a scatterplot matrix can be produced and can be very useful for this purpose.
When in doubt, try re-running the analyses excluding one or two groups that are of less
interest. If the overall results (interpretations) hold up, you probably do not have a
problem. You may also use the numerous tests available to examine whether or not this
assumption is violated in your data. However, as mentioned in ANOVA/MANOVA, the
multivariate Box M test for homogeneity of variances/covariances is particularly
sensitive to deviations from multivariate normality, and should not be taken too
“seriously."

c. Correlations between means and variances. The major "real" threat to the validity of
significance tests occurs when the means for variables across groups are correlated
with the variances (or standard deviations). Intuitively, if there is large variability in a
group with particularly high means on some variables, then those high means are not
reliable. However, the overall significance tests are based on pooled variances, that is,
the average variance across all groups. Thus, the significance tests of the relatively
larger means (with the large variances) would be based on the relatively smaller pooled
variances, resulting erroneously in statistical significance. In practice, this pattern may
occur if one group in the study contains a few extreme outliers, who have a large
impact on the means, and also increase the variability. To guard against this problem,
inspect the descriptive statistics, that is, the means and standard deviations or
variances for such a correlation.

d. The matrix ill-conditioning problem. Another assumption of discriminant function


analysis is that the variables that are used to discriminate between groups are not
completely redundant. As part of the computations involved in discriminant analysis,
you will invert the variance/covariance matrix of the variables in the model. If any one
of the variables is completely redundant with the other variables then the matrix is said
to be ill-conditioned, and it cannot be inverted. For example, if a variable is the sum of
three other variables that are also in the model, then the matrix is ill-conditioned.

e.Tolerance values. In order to guard against matrix ill-conditioning, constantly check


the so-called tolerance value for each variable. This tolerance value is computed as 1
minus R-square of the respective variable with all other variables included in the current
model. Thus, it is the proportion of variance that is unique to the respective variable.
You may also refer to Multiple Regression to learn more about multiple regression and
the interpretation of the tolerance value. In general, when a variable is almost
completely redundant (and, therefore, the matrix ill-conditioning problem is likely to
occur), the tolerance value for that variable will approach 0.

3. Reference: http://www.statsoft.com/Textbook/Discriminant-Function-
Analysis#assumptions

VIII. LOGISTIC REGRESSION

1. Definition:
Logistic regression is the appropriate regression analysis to conduct when the
dependent variable is dichotomous (binary). Like all regression analyses, the logistic
regression is a predictive analysis. Logistic regression is used to describe data and to
explain the relationship between one dependent binary variable and one or more
nominal, ordinal, interval or ratio-level independent variables.
Sometimes logistic regressions are difficult to interpret; the Intellectus Statistics tool
easily allows you to conduct the analysis, then in plain English interprets the output.

2. Assumptions:
a. First, logistic regression does not require a linear relationship between the dependent
and independent variables. Second, the error terms (residuals) do not need to be
normally distributed. Third, homoscedasticity is not required. Finally, the dependent
variable in logistic regression is not measured on an interval or ratio scale.
However, some other assumptions still apply.
First, binary logistic regression requires the dependent variable to be binary and ordinal
logistic regression requires the dependent variable to be ordinal.

b. Second, logistic regression requires the observations to be independent of each


other. In other words, the observations should not come from repeated measurements
or matched data.

c. Third, logistic regression requires there to be little or no multicollinearity among the


independent variables. This means that the independent variables should not be too
highly correlated with each other.

d. Fourth, logistic regression assumes linearity of independent variables and log odds.
although this analysis does not require the dependent and independent variables to be
related linearly, it requires that the independent variables are linearly related to the log
odds.

e. Finally, logistic regression typically requires a large sample size. A general guideline
is that you need at minimum of 10 cases with the least frequent outcome for each
independent variable in your model. For example, if you have 5 independent variables
and the expected probability of your least frequent outcome is .10, then you would
need a minimum sample size of 500 (10*5 / .10).

Reference: http://www.statisticssolutions.com/assumptions-of-logistic-regression/

IX. KRUSKAL-WALLIS H TEST

1. Definition: The Kruskal-Wallis H test (sometimes also called the "one-way ANOVA on
ranks") is a rank-based nonparametric test that can be used to determine if there
are statistically significant differences between two or more groups of an
independent variable on a continuous or ordinal dependent variable. It is considered
the nonparametric alternative to the one-way ANOVA, and an extension of the
Mann-Whitney U test to allow the comparison of more than two independent
groups.

For example, you could use a Kruskal-Wallis H test to understand whether exam
performance, measured on a continuous scale from 0-100, differed based on test
anxiety levels (i.e., your dependent variable would be "exam performance" and your
independent variable would be "test anxiety level", which has three independent
groups: students with "low", "medium" and "high" test anxiety levels). Alternately, you
could use the Kruskal-Wallis H test to understand whether attitudes towards pay
discrimination, where attitudes are measured on an ordinal scale, differed based on job
position (i.e., your dependent variable would be "attitudes towards pay discrimination",
measured on a 5-point scale from "strongly agree" to "strongly disagree", and your
independent variable would be "job description", which has three independent groups:
"shop floor", "middle management" and “boardroom").

2. Assumptions:
a. Your dependent variable should be measured at the ordinal or continuous level (i.e.,
interval or ratio). Examples of ordinal variables include Likert scales (e.g., a 7-point
scale from "strongly agree" through to "strongly disagree"), amongst other ways of
ranking categories (e.g., a 3-pont scale explaining how much a customer liked a
product, ranging from "Not very much", to "It is OK", to "Yes, a lot"). Examples of
continuous variables include revision time (measured in hours), intelligence
(measured using IQ score), exam performance (measured from 0 to 100), weight
(measured in kg), and so forth.
b. Your independent variable should consist of two or more categorical, independent
groups. Typically, a Kruskal-Wallis H test is used when you have three or more
categorical, independent groups, but it can be used for just two groups (i.e., a
Mann-Whitney U test is more commonly used for two groups). Example independent
variables that meet this criterion include ethnicity (e.g., three groups: Caucasian,
African American and Hispanic), physical activity level (e.g., four groups: sedentary,
low, moderate and high), profession (e.g., five groups: surgeon, doctor, nurse,
dentist, therapist), and so forth.
c. You should have independence of observations, which means that there is no
relationship between the observations in each group or between the groups
themselves. For example, there must be different participants in each group with no
participant being in more than one group. This is more of a study design issue than
something you can test for, but it is an important assumption of the Kruskal-Wallis H
test. If your study fails this assumption, you will need to use another statistical test
instead of the Kruskal-Wallis H test (e.g., a Friedman test). If you are unsure
whether your study meets this assumption, you can use our Statistical Test Selector,
which is part of our enhanced content.

As the Kruskal-Wallis H test does not assume normality in the data and is much less
sensitive to outliers, it can be used when these assumptions have been violated and the
use of a one-way ANOVA is inappropriate. In addition, if your data is ordinal, a one-way
ANOVA is inappropriate, but the Kruskal-Wallis H test is not. However, the Kruskal-Wallis
H test does come with an additional data consideration, Assumption #4, which is
discussed below:

d. In order to know how to interpret the results from a Kruskal-Wallis H test, you have
to determine whether the distributions in each group (i.e., the distribution of scores for
each group of the independent variable) have the same shape (which also means the
same variability).

3. Reference: https://statistics.laerd.com/spss-tutorials/kruskal-wallis-h-test-using-spss-
statistics.php

You might also like