You are on page 1of 27

Multiple Regression

Kean University
February 12, 2013
0
Content
1. Multiple Linear Regression ……………………………………………………….2

2. Logistic Regression ……………………………………………………………….8

3. Statistical Definitions …………………………….………………………………12

4. Regression Models & SEM…………………………………………..…………... 17

1
Multiple Linear Regression
Regressions techniques are primarily used in order to create an equation which can be used to predict values of
dependent variables for all members of the population. A secondary function of using regression is that it can
be used as a means of explaining causal relationships between variables.

Types of Linear Regression

Standard Multiple Regression-All independent variables are entered into the analysis simultaneously

Sequential Multiple Regression (Hierarchical Multiple Regression)-Independent variables are entered into the
equation in a particular order as decided by the researcher

Stepwise Multiple Regression-Typically used as an exploratory analysis, and used with large sets of predictors

1. Forward Selection-Bivariate correlations between all the IVs and the DV are calculated, and IVs are
entered into the equation from the strongest correlate to the smallest

2. Stepwise Selection-Similar to forward selection; however, if in combination with other predictors and the
IV no longer appears to contribute much to the equation, it is thrown out.

3. Backward Deletion-All IVs are entered into the equation. Partial F tests are calculated on each variable as
if it were entered last, to determine the level of contribution to the overall prediction. The smallest partial
F is removed based on a predetermined criteria.

Variables

IV- Also referred to as predictor variables, one or more continuous variables

DV-Also referred to as the outcome variable, a single continuous variable

Assumptions that must be met:

1. Normality. All errors should to be normally distributed, which can be tested by looking at the skewness,
kurtosis, and histogram plots. Technically normality is necessary only for the t-tests to be valid, estimation
of the coefficients only requires that the errors be identically and independently distributed

2. Independence. The errors associated with one observation are not correlated with the errors of any other
observation.

3. Linearity. The relationship between the IVs and DV should be linear.

4. Homoscedasticity. The variances of the residuals across all levels of the IVs should be consistent, which can
be tested by plotting the residuals.

5. Model specification. The model should be properly specified (including all relevant variables, and excluding
irrelevant variables)

Other important issues:

Influence - individual observations that exert undue influence on the coefficients. Are there covariates that
you should be including in your model?

Collinearity - The predictor variables should be related, but not so strongly correlated that they are measuring
the same thing (e.g. using age and grade), which will lead to multicollinearity. Multicollinearity misleadingly

2
inflates the standard errors. Thus, it makes some variables statistically insignificant while they should be
otherwise significant.

Unusual and Influential data

A single observation that is substantially different from all other observations can make a large difference in
the results of your regression analysis. If a single observation (or small group of observations) substantially
changes your results, you would want to know about this and investigate further. There are three ways that an
observation can be unusual.

Outliers: In linear regression, an outlier is an observation with large residual. In other words, it is an
observation whose dependent-variable value is unusual given its values on the predictor variables. An outlier may
indicate a sample peculiarity or may indicate a data entry error or other problem.

Leverage: An observation with an extreme value on a predictor variable is called a point with high leverage.
Leverage is a measure of how far an observation deviates from the mean of that variable. These leverage points
can have an unusually large effect on the estimate of regression coefficients.

Influence: An observation is said to be influential if removing the observation substantially changes the
estimate of coefficients. Influence can be thought of as the product of leverage and outlierness.

Collinearity Diagnostics

VIF: Formally, variance inflation factors (VIF) measure how much the variance of the estimated coefficients
are increased over the case of no correlation among the X variables. If no two X variables are correlated, then
all the VIFs will be 1. If there are two or more variables that will have a VIF around or greater than 5 (some say
up to 10 is okay), one of these variables must be removed from the regression model. To determine the best one
to remove, remove each one individually. Select the regression equation that explains the most variance (R2 the
highest).

Tolerance: Value should be greater than .10. Less than .10 indicates a collinearity issue.

Other informal signs of multicollinearity are:


• Regression coefficients change drastically when adding or deleting an X variable.
• A regression coefficient is negative when theoretically Y should increase with increasing values of that
X variable, or the regression coefficient is positive when theoretically Y should decrease with increasing
values of that X variable.
• None of the individual coefficients has a significant t statistic, but the overall F test for fit is
significant.
• A regression coefficient has a nonsignificant t statistic, even though on theoretical grounds that X
variable should provide substantial information about Y.
• High pairwise correlations between the X variables. (But three or more X variables can be multicollinear
together without having high pairwise correlations.)

How to deal with multicollinearity:


• Increasing the sample size is a common first step, but only partially offsets the.
• The easiest solution: Remove the most intercorrelated variable(s) from analysis. This method is
misguided if the variables were there due to the theory of the model.
• Combine variables into a composite variable through building indexes. Remember: in order to create an
index, you need to have theoretical and empirical reasons to justify this action.
• Use centering: transform the offending independents by subtracting the mean from each case.

3
• Drop the intercorrelated variables from analysis but substitute their crossproduct as an interaction
term, or in some other way combine the intercorrelated variables. This is equivalent to respecifying the
model by conceptualizing the correlated variables as indicators of a single latent variable. Note: if a
correlated variable is a dummy variable, other dummies in that set should also be included in the
combined variable in order to keep the set of dummies conceptually together.
• Leave one intercorrelated variable as is but then remove the variance in its covariates by regressing
them on that variable and using the residuals.
• Assign the common variance to each of the covariates by some probably arbitrary procedure.
• Treat the common variance as a separate variable and decontaminate each covariate by regressing them
on the others and using the residuals. That is, analyze the common variance as a separate variable.

To Conduct the Analysis in SPSS

1. Your dataset must be open. To run the


analysis, click analyze, then
regression, then linear.

2. The Linear Regression window will open.


Select the outcome variable, then the right 2. Place DV (outcome) in 2. Place IV s(predictors)
arrow to put the variable in the dependent Dependent Box in Independent Box
variable box. Highlight all of the independent
variables, then the right arrow to put the
variables into the Independent(s) box.

3. Select the method of regression that is


most appropriate for the data set:
a. Enter-enters all IVs, one at a time, into
the model regardless of significant
contribution
3. Select the appropriate
b. Stepwise-combines Forward and
regression method
Backward methods, and utilizes criteria
for both entering and removing IVs in
the equation
c. Remove-first uses and Enter method.
Specific variable(s) is removed from the
model and the Enter method is repeated
d. Backward-enters all IVs one at a time
and then removes then removes them one
at a time based on a predetermined level
of significance for removal (default is p ≥
.01).
e. Forward-only enters IVs that significantly contribute to the model.

4
4. Click on the Statistics button to open
the Statistics Dialogue box. Check the
appropriate statistics, which usually
includes Estimates, Model Fit,
Descriptives, Part and Partial
Correlations, Collinearity diagnostics.
Note: if running a stepwise regression,
check, R squared change.

4. Select the
appropriate
statistics

5. Click on the Options button to open the Options Dialogue


box. Here you can change the inclusion and exclusion
criteria depending on the method of regression used.

5. If applicable, change 6. Optional, if needed, click on the Plots button to add Plots
the probability or F and Histograms to the output. Also, clicking the save
value for inclusion or
button gives options to save the residuals, etc.
exclusion

7. To create Syntax file, simply click on Paste.

Output Run the syntax and Output file


should look similar to those below:
The variables entered box shows which
variables have be included or excluded
from the regression analysis, and the
method in which they have been
entered. Depending on the method of
regression used, certain variables
maybe removed for failing to meet
predetermined criteria.

5
R Square indicates the degree to which the
amount of variance in the DV explained by the
IVs. Use the adjusted R Square when you
have more than 1 predictor (IV).

The model summary box outlines the overall fit


of the model. R is the correlation between the
variables, which should be the same as shown in
the Correlations table. R Square value indicates
the amount of variance in the dependent variable
by the predictor variables. In this case, the
predictor variables account for 9.8% of the
variance in number of offenses. The adjusted R
Square is a more conservative estimate of variance explained and removes variability that is likely due to
chance; however, this value is not often reported or interpreted. The ANOVA is used to test whether or not
the model significantly predicts the outcome variable. In this example, the model does significantly predicts
the outcome variable, because p <.001.

The F ratio and


significance tells the
degree to which the
model predicts the DV

The Coefficients
box notes the
degree and
significance that
each predictor has
on the outcome
variable. In this
example, only
whether or not one
is incarcerated and
their entry age are Unstandardized B is Standardized Betas tell The T statistic and
significant predictors. When conducting the degree to which the amount of variance Significance tell
regression analyses, it may be useful to run each predictor of the DV that is whether or not
variable impacts the explained by each each predictor is
multiple combinations of predictor variables outcome variable (DV) predictor variable significant
and regression methods. individually

6
Sample Write Up & Table

A multiple regression was also conducted to predict the number of offenses based on the

available independent variables. The predictors included incarcerated (vs. not incarcerated), the

age at first offense, the number of days in placement, and race. The overall model was

significant, F (5, 220) = 4.80, p < .001, and accounted for 31% of the variance. The results

indicated that incarcerated and the age at first offense were significant predictors of the

number of offenses committed (see Table 9). The number of days in placement and race were

not significant predictors of the number of offenses committed. Incarceration (compared to

non-incarceration) was associated with an increase in the number of offenses committed (Beta =

.243, p < .01). In addition, controlling for the other predictors, as age at first offense increased,

the number of offenses also increased (Beta = .187, p < .01).

________________________________________________________________________

Table 9

Multiple Regression Analyses of Incarceration Status, Age, Days in Placement, and

Race on Number of Offenses (N = 226)


________________________________________________________________________

Unstandardized
B SE Beta t p

Incarcerated .668 .24 .243 2.76 .006


Age at First Offense .147 .05 .187 2.71 .007
Days in Placement .000 .00 - .057 -0.66 .511
African American .065 .22 .022 0.30 .766
Caucasian - .212 .20 - .077 -1.03 .302
________________________________________________________________________
Note. F (5, 220) = 4.80, p < .001, R2 = .098

7
Logistic Regression
Logistic regression, also referred to as binary or multinomial regression, is a type of prediction analysis that
predicts a dichotomous dependent variable based on a set of independent variables.

Variables

DV-one dichotomous dependent variable (e.g. alive/dead, married/single, purchase/not purchase)

IVs-one or more independent variable that can be either continuous or categorical

Assumptions that must be met:

1. Sample Size. Reducing a continuous variable to a binary or categorical one loses information and attenuates
effect sizes, reducing the power of the logistic procedure. Therefore, in many cases, a larger sample size is
needed to insure power of the statistical procedure. It is recommended that a sample size be at least 30
times the number of parameters or 10 cases per independent variable.

2. Meaningful coding. Logistic coefficients will be difficult to interpret if not coded meaningfully. The
convention for binary logistic regression is to code the dependent class of greatest interest as 1 ("the event
occurring") and the other class as 0 ("the event not occurring).

3. Proper specification of the model is particularly crucial; parameters may change magnitude and even
direction when variables are added to or removed from the model.

a. Inclusion of all relevant variables in the regression model: If relevant variables are omitted, the common
variance they share with included variables may be wrongly attributed to those variables, or the error
term may be inflated.

b. Exclusion of all irrelevant variables: If causally irrelevant variables are included in the model, the
common variance they share with included variables may be wrongly attributed to the irrelevant
variables. The more the correlation of the irrelevant variable(s) with other independents, the greater
the standard errors of the regression coefficients for these independents.

4. Linearity. Logistic regression does not require linear relationships between the independent factor or
covariates and the dependent, but it does assume a linear relationship between the independents and the log
odds (logit) of the dependent.

a. Box-Tidwell Transformation (Test): Add to the logistic model interaction terms which are the
crossproduct of each independent times its natural logarithm. If these terms are significant, then there
is nonlinearity in the logit. This method is not sensitive to small nonlinearities.

5. No outliers. As in logistic regression, outliers can affect results significantly.

Types of Logistic Regression

Binary Logistic Regression-treats all IVs as continuous covariates and categorical variables must be set in SPSS
Multinomial Logistic Regression- All IVs are explicitly entered as factors, and the reference category of the
outcome variable must be set in SPSS.

8
To conduct this analysis in SPSS
1. Your data set must be open. To
run the analysis, click Analyze,
then Regression, then either
Binary or Multinomial Logistic.

2. Move the DV into the


“Dependent” box, and move the IVs
into the “Covariates” Box.

2. Move the DV to the


“Dependent” Box

2. Move the IVs to the


“Covariates” Box

3. Select the appropriate


inclusion method

4. Check this box to


display the
confidence interval

3. Select the method of regression


that is most appropriate for the
data set:
a. Enter-enters all IVs, one at a
time, into the model regardless
of significant contribution
b. Stepwise-combines Forward and
Backward methods, and utilizes
criteria for both entering and
removing IVs in the equation
c. Remove-first uses and Enter
method. Specific variable(s) is
removed from the model and
the Enter method is repeated
d. Backward-enters all IVs one at
a time and then removes then removes them one at a time based on a predetermined level of significance
for removal (default is p ≥ .01).
e. Forward-only enters IVs that significantly contribute to the model.

4. Click on the “Options” box, and check the box next to “CI for exp(B).” Then click “Continue.”

9
Report the number of cases
included in the analysis
5. Paste and run the syntax.
Output
Run the syntax, and Output file should look
similar to those below:
The first box outlines how many cases were
included and excluded from the analysis, you will
report the n included in the analysis in your
write up.
The dependent variable encoding box shows the
label for each coding. This is important to note, because SPSS
creates the regression equation based on the likelihood of having a
value of 1. In this case, SPSS is creating an equation to predict the
likelihood that an individual is not very satisfied.
The next set of tables falls under the
The Beginning Block does not
heading of “Block 0: Beginning Block,” and
include the predictor
consist of three tables: Classification Table, variables. DO NOT report or
Variables in the Equation, and Variables not interpret these values.
in the Equation. This block provides a
description of the null model and does not
include the predictor variables. These
values are not interpreted or report.

Block 1 is what is interpreted


and reported in the write up.

The Omnibus Test uses a Chi-Square


to determine if the model is
statistically significant.
-2 Log is not interpreted
or reported.
Cox & Snell R Square and Nagelkerke R
Square are measures of effect size.
Typically Nagelkerke is reported over
Cox & Snell.

Based on the regression equation created


from the analysis, SPSS will predicts
which group individual cases will belong to.
SPSS then calculates the percentage of
correct predictions.

10
Report significance Report the confidence
The estimated intervals for both the
of each predictor
coefficient and standard
lower and upper limits
error are reported, but
not interpreted
Exp(B) is the odds ratio for each predictor. As mentioned, SPSS is
predicting the likelihood of the DV being a 1, in this case “Not Very
Satisfied.’ When the odds ratio is less than 1, increasing values in the
variable correspond to decreasing odds of the event’s occurrence. When
the odds ratio is greater than 1, increasing values of the variable
corresponds to increasing odds of the event’s occurrence.

Sample Write Up and Table


A logistic regression analysis was conducted to predict if an individual was not very satisfied with his or her job,
see Table 1. Overall, the model was significant, χ2(4) = 16.71, p = .002, Nagelkerke R2 = .025. Of all the
predictor variables, only age was a significant predictor, p <.001, and had an odds ratio of .980, indicating that
as an individual’s age increases he or she is less likely to be not very satisfied with his or her job. As a
predictor, years of education was marginally significant, p = .085, and had an odds ratio of .957, indicating that
as years of education increases, there is a decrease in the likelihood of being not very satisfied with one’s job.
None of the remaining predictors (e.g., hours worked per week, number of siblings) were not significant
predictors of job satisfaction, ns.
________________________________________________________________________

Table 1

Summary of Logistic Regression Predicting Job Satisfaction Satisfied or Not Satisfied


________________________________________________________________________

Odds 95% CI
β Ratio Lower Upper p

Age -.021 .980 .97 .99 .000

Years of Education -.044 .957 .91 1.01 .085

Hours per Week -.004 .996 .99 1.01 .384

Number of Siblings .000 1.000 .95 1.05 .990


________________________________________________________________________
Note.: χ2(4) = 16.71, p = .002, Nagelkerke R2 = .025.

11
Statistics Definitions
Binary (dichotomous) variable
A binary variable has only two values, typically 0 or 1. Similarly, a dichotomous variable is a categorical variable
with only two values. Examples include success or failure; male or female; alive or dead.
Categorical variable
A variable that can be placed into separate categories based on some characteristic or attribute. Also referred
to as “qualitative”, “discrete”, or “nominal” variables. Examples include gender, drug treatments, race or
ethnicity, disease subtypes, dosage level.
Causal relationship
A causal relationship is one in which a change in one variable can be attributed to a change in another variable.
The study needs to be designed in a way that it is legitimate to infer cause. In most cases, the term “causal
conclusion” indicates findings from an experiment in which the subjects are randomly assigned to a control or
experimental group. For instance, causality cannot be determined from a correlational research design.
Furthermore, it is important to note that a significant finding (small p-value) does not signify causality. The
medical statistician, Austin B. Hill, outlined nine criteria to establish causality in epidemiological research:
temporal relationship, strength, dose-response relationship, plausibility, consideration of alternate explanations,
experiment, specificity, and coherence.
Central Limit Theorem
The Central Limit Theorem is the foundation for many statistical techniques. The theorem proposes that the
larger the sample size (>30), the more closely the sampling distribution of the mean will approach a normal
distribution. The mean of the sampling distribution of the mean will approach the true population mean have a
standard deviation of σ / √n (population variance / square root of n). The population from which the sample is
drawn does not need to be normally distributed. Furthermore, the Central Limit Theorem explains that the
approximation improves with larger samples as well as why sampling error is smaller with larger samples than it
is with smaller samples.
Confidence interval
A confidence interval is an interval estimate of a population parameter, consisting of a range of values bounded
by upper and lower confidence limits. The parameter is estimated as falling somewhere between the two values.
Researchers can assign a degree of confidence for the interval estimate (typically 90%, 95%, 99%), indicating
that the interval will include the population parameter that percentage of the time (i.e., 90%, 95%, 99%). The
wider the confidence interval, the higher the confidence level.
Confounding variable
A confounding variable is one that obscures the effects of another variable. In other words, a confounding
variable is one that is associated with both the independent and dependent (outcome) variables, and therefore
affects the results. Confounding variables are also called extraneous variables and are problematic because the
researcher cannot be sure if results are due to the independent variable, the confounding variable, or both.
Smoking, for instance, would be a confounding variable in the relationship between drinking alcohol and lung
cancer. Therefore, a researcher studying the relationship between alcohol consumption and lung cancer should
control for the effects of smoking. A positive confounder is related to the independent and dependent variables
in the same direction; a negative confounder displays an opposite relationship to the two variables. If there is a
confounding effect, researchers can use a stratified sample and/or a statistical model that controls for the
confounding variable (e.g., multiple regression, analysis of covariance).

12
Continuous variables
A variable that can take on any value within the limits the variable ranges. Continuous variables are measured on
ratio or interval scales. Examples include age, temperature, height, and weight.
Control group
In experimental research and many types of clinical trials, the control group is the group of participants that
does not receive the treatment. The control group is used for comparison and is treated exactly like the
experimental group except that it does not receive the experimental treatment. In many clinical trials, one
group of patients will be given an experimental drug or treatment, while the control group is given either a
standard treatment for the illness or a placebo (ex: sugar pill).
Covariate
A covariate is a variable that is statistically controlled for using techniques such as multiple regression analysis
or analysis of covariance. Covariates are also known as control variables and in general, have a linear relationship
with the dependent variable. Using covariates in analyses allow the researcher to produce more precise
estimates of the effect of the independent variable of interest. In order to determine if the use of a covariate
is legitimate, the effect of the covariate on the residual (error) variance should be examined. If the covariate
reduces the error, then is likely to improve the analysis.
Degrees of freedom
The degrees of freedom is usually abbreviated “df” and represents the number of values free to vary when
calculating a statistic. For instance, the degrees of freedom in a 2x2 crosstab table are calculated by
multiplying the number of rows minus 1 by the number of columns minus 1. Therefore, if the totals are fixed,
only one of the four cell counts is free to vary, and the df = (2-1) (2-1) = 1.
Dependent variable
The dependent variable is the effect of interest that is measured in the study. It is termed the “dependent”
variable because it “depends” on another variable. Also referred to as outcome variables or criterion variables.
Descriptive / Inferential Statistics
Descriptive Statistics. Descriptive statistics provide a summary of the available data. The descriptive
statistics are used to simplify large amounts of data by summarizing, organizing, and graphing quantitative
information. Typical descriptive statistics include measures of central tendency (mean, median, mode) and
measures of variability or spread (range, standard deviation, variance).
Inferential Statistics. Inferential statistics allow researchers to draw conclusions or inferences from the
data. Typically, inferential statistics are used to make inferences or claims about a population based on a sample
drawn from that population. Examples include independent t tests and Analysis of Variance (ANOVA)
techniques.
Effect size
An effect size is a measure of the strength of the relationship between two variables. Sample-based effect
sizes are distinguished from test statistics used in hypothesis testing, in that they estimate the strength of an
apparent relationship, rather than assigning a significance level reflecting whether the relationship could be due
to chance. The effect size does not determine the significance level, or vice-versa. Some fields using effect
sizes apply words such as "small", "medium" and "large" to the size of the effect. Whether an effect size should
be interpreted small, medium, or big depends on its substantial context and its operational definition. Some
common measures of effect size are Cohen’s D, Cramer’s V, Odds Ratios, Standardized Beta weights, Pearson’s
R, and partial Eta squared.

13
Independent variable
The independent variables are typically controlled or manipulated by the researcher. Independent variables are
also used to predict the values of another variable. Furthermore, researchers often use demographic variables
(e.g., gender, race, age) as independent variables in statistical analysis. Examples of independent variables
include the treatment given to groups, dosage level of an experimental drug, gender, and race.
Measures of Central Tendency: Mean, Median, Mode
Measures of central tendency are a way of summarizing data using the value which is most typical or
representative, including the mean, median, and mode.
Mean. The mean (strictly speaking arithmetic mean) is also known as the average. It is calculated by adding
up the values for each case and dividing by the total number of cases. It is often symbolized by M or X (“X-
bar”). The mean is influenced by outliers and also should not be used with skewed distributions.
Median. The median is the central value of a set of values, ranked in ascending (or descending) order. Since
50% of all scores fall at or below the 50th percentile, the median is therefore the score located at the
50th percentile. The median is not influenced by extreme scores and is the preferred measure of central
tendency for a skewed distribution.
Mode. The mode is the value which occurs most frequently in a set of scores. The mode is not influenced by
extreme values.
Measures of Dispersion: Variance, Standard deviation, range
Measures of dispersion include statistics that show the amount of variation or spread in the scores, or values
of, a variable. Widely scattered or variable data results in large measures of dispersion, whereas tightly
clustered data results in small measures of dispersion. Commonly used measures of dispersion include the
variance and the standard deviation.
Variance. A measure of the amount of variability in a set of scores. Variance is calculated as the square of
the standard deviation of scores. Larger values for the variance indicate that individual cases are further
away from the mean and a wider distribution. Smaller variances indicate that individual cases are closer to
the mean and a tighter distribution. The population variance is symbolized by σ2 and the sample variance is
symbolized by s2.
Standard Deviation. A measure of spread or dispersion in a set of scores. The standard deviation is the
square root of the variance. Similar to the variance, the more widely the scores are spread out, the larger
the standard deviation. Unlike the variance, which is expressed in squared units of measurement, the
standard deviation is expressed in the same units as the measurements of the original data. In the event
that the standard deviation is greater than the mean, the mean would be deemed inappropriate as a
representative measure of central tendency. The empirical rule states that for normal distributions,
approximately 68% of the distribution falls within ± 1 standard deviation of the mean, 95% of the
distribution falls within ± 2 standard deviations of the mean, and 99.7% of the distribution falls within ± 3
standard deviations of the mean. The standard deviation is symbolized by SD or s.
Multivariate / Bivariate / Univariate
Multivariate. Quantitative methods for examining multiple variables at the same time. For instance, designs
that involve two or more independent variables and two or more dependent variables would use multivariate
analytic techniques. Examples include multiple regression analysis, MANOVA, factor analysis, and
discriminant analysis.
Bivariate. Quantitative methods that involve two variables.
Univariate. Methods that involve only one variable. Often used to refer to techniques in which there is only
one outcome or dependent variable.

14
Normal distribution
The normal distribution is a bell-shaped, theoretical continuous probability distribution. The horizontal axis
represents all possible values of a variable and the vertical axis represents the probability of those values. The
scores on the variable are clustered around the mean in a symmetrical, unimodal fashion. The mean, median, and
mode are all the same in the normal distribution. The normal distribution is widely used in statistical inference.
Null / Alternative Hypotheses
Null Hypothesis. In general, the null hypothesis (H0) is a statement of no effect. The null hypothesis is set
up under the assumption that it is true, and is therefore tested for rejection.
Alternative Hypothesis. The hypothesis alternative to the one being tested (i.e., the alternative to the null
hypothesis. The alternative hypothesis is denoted by Ha or H1 and is also known as the research or
experimental hypothesis. Rejecting the null hypothesis (on the basis of some statistical test) indicates that
the alternative hypothesis may be true.
Parametric / non-Parametric
Parametric Statistics: Statistical techniques that require the data to have certain characteristics
(approximately normally distributed, interval/ ratio scale of measurement). Also called inferential statistics.
Non-Parametric Statistics: Statistical techniques designed for use when the data does not conform to the
characteristics required for parametric tests. Non-parametric statistics are also known as distribution-free
statistics. Examples include the Mann-Whitney U test, Kruskal-Wallis test and Wilcoxon's (T) test. In
general, parametric tests are more robust, more complicated to compute, and have greater power efficiency.
Population / Sample
The population is the group of persons or objects that the researcher is interested in studying. To generalize
about a population, the researcher studies a sample that is representative of the population.
Power
The power of a statistical test is the probability that the test will reject the null hypothesis when the null
hypothesis is false (i.e. that it will not make a Type II error, or a false negative decision). As the power
increases, the chances of a Type II error occurring decrease. Power analysis can be used to calculate the
minimum sample size required so that one can be reasonably likely to detect an effect of a given size. Power
analysis can also be used to calculate the minimum effect size that is likely to be detected in a study using a
given sample size. In addition, the concept of power is used to make comparisons between different statistical
testing procedures: for example, between a parametric and a nonparametric test of the same hypothesis.
p-value
The p-value stands for probability value and represents the likelihood that a result is due to chance alone. More
specifically, the p-value is the probability of obtaining a result at least as extreme as the one that was actually
observed given that the null hypothesis is true. For instance, given a p -value of 0.05 (1/20) and repeated
experiments, we would expect that in approximately every 20 replications of the experiment, there would be
one in which the relationship between the variables would be equal to or more extreme than what was found.
The p-value is compared with the alpha value set by the researcher (usually .05) to determine if the result is
statistically significant. If the p-value is less than the alpha level, the result is significant and the null
hypothesis is rejected. If the p-value is greater than the alpha level, the result is non-significant and the
researcher fails to reject the null hypothesis. When interpreting the p-value, it is important to understand the
measurement as well as the practical significance of the results. The p-value indicates significance, but does not
reveal the size of the effect. In addition, a non-significant p-value does not necessarily mean that there is no
association; rather, the non-significant result could be due to a lack of power to detect an association. In
clinical trials, the level of statistical significance depends on the number of participants studied and the
observations made, as well as the magnitude of differences observed.

15
Skewed distribution and other distribution shapes (bimodal, J-shaped)
Skewed Distribution. A skewed distribution is a distribution of scores or measures that produces a
nonsymmetrical curve when plotted on a graph. The distribution may be positively skewed (infrequent scores
on the high or right side of the distribution) or negatively skewed (infrequent scores on the low or left side
of the distribution). The mean, mode, and median are not equal in a skewed distribution.
Bimodal. A bimodal distribution is a distribution that has two modes. The bimodal distribution has two
values that both occur with the highest frequency in the distribution. This distribution looks like it has two
peaks where the data centers on the two values more frequently than other neighboring values.
J-Shaped. A J-shaped distribution occurs when one of the first values on either end of the distribution
occurs with the most frequency with the following values occurring less and less frequently so that the
distribution is extremely asymmetrical and roughly resembles a “J” lying on its side.
Standard error
Standard error (SE) is a measure of the extent to which the sample mean deviates from the population mean.
Another name for standard error (SE) is standard error of the mean (SEM). This alternative name gives more
insight into the standard error statistic. The standard error is the standard deviation of the means of multiple
samples from the same population. In other words, multiple samples are taken from a population and the
standard error is the standard deviation of the mean of the multiple sample means. The standard error can be
thought of as an index to how well the sample reflects the population. The smaller the standard error, the more
the sampling distribution resembles the population.
z-score
The z-score (aka standard score) is the statistic of the standard normal distribution. The standard normal
distribution has a mean of zero and a standard deviation of 1. Raw scores can be standardized into z-scores
(thus also known as standard scores). The z-score measures the location of a raw score by its distance from the
mean in standard deviation units. Since the mean of the standard normal distribution is zero, a z-score of 1
would reflect a raw score that falls one standard deviation from the mean. In the same manner, a z-score of -1
would reflect a raw score that falls exactly one standard deviation below the mean. If we were reading
standardized IQ scores (raw mean = 100, SD = 15), for example, a z-score of 1 would reflect a raw score of 115
and a z-score of -1 would reflect a raw score of 85.

16
Regression Models & SEM
Bayesian Linear Regression
Bayesian linear regression is an approach to linear regression in which the statistical analysis is undertaken
within the context of Bayesian inference. When the regression model has errors that have a normal
distribution, and if a particular form of prior distribution is assumed, explicit results are available for the
posterior probability distributions of the model's parameters. Bayesian methods can be used for any probability
distribution.
Bootstrapped Estimates
Bootstrapped estimates assume the sample is representative of the universe and do not make parametric
assumptions about the data.
Canonical Correlation
A canonical correlation is the correlation of two canonical (latent) variables, one representing a set of
independent variables, the other a set of dependent variables. Canonical correlation is used for many-to-many
relationships. There may be more than one such linear correlation relating the two sets of variables, with each
such correlation representing a different dimension by which the independent set of variables is related to the
dependent set. The purpose of canonical correlation is to explain the relation of the two sets of variables, not
to model the individual variables.
Categorical Regression

The goal of categorical regression is to describe the relationship between a response variable and a set of
predictors. It is a variant of regression which can handle nominal independent variables, but now largely
replaced by generalized linear models. Scale values are assigned to each category of every variable such that
these values are optimal with respect to the regression.

Cox Regression

Cox regression may be used to analyze time-to-event as well as proximity, and preference data. Cox regression
is designed for analysis of time until an event or time between events. The classic univariate example is time
from diagnosis with a terminal illness until the event of death (hence survival analysis). The central statistical
output is the hazard ratio.

Curve Estimation
Curve estimation lets the researcher explore how linear regression compares to any of 10 nonlinear models, for
the case of one independent predicting one dependent, and thus is useful for exploring which procedures and
models may be appropriate for relationships in one's data. Curve-fitting to compare linear, logarithmic, inverse,
quadratic, cubic, power, compound, S-curve, logistic, growth, and exponential models based on their relative
goodness of fit for models where a single dependent variable is predicted by a single independent variable or by
a time variable.
Discriminant Function Analysis

Discriminant function analysis is used when the dependent variable is a dichotomy but other assumptions of
multiple regression can be met, making it more powerful than logistic regression for binary or multinomial
dependents. Discriminant function analysis, a.k.a. discriminant analysis or DA, is used to classify cases into the
values of a categorical dependent, usually a dichotomy. If discriminant function analysis is effective for a set of
data, the classification table of correct and incorrect estimates will yield a high percentage correct.

17
Multiple discriminant analysis (MDA) is an extension of discriminant analysis and a cousin of multiple analysis of
variance (MANOVA), sharing many of the same assumptions and tests. MDA is used to classify a categorical
dependent which has more than two categories, using as predictors a number of interval or dummy independent
variables. MDA is sometimes also called discriminant factor analysis or canonical discriminant analysis.

Dummy Coding

Dummy variables are a way of adding the values of a nominal or ordinal variable to a regression equation. The
standard approach to modeling categorical variables is to include the categorical variables in the regression
equation by converting each level of each categorical variable into a variable of its own, usually coded 0 or 1. For
instance, the categorical variable "region" may be converted into dummy variables such as "East," "West,"
"North," or "South." Typically "1" means the attribute of interest is present (ex., South = 1 means the case is
from the region South). Of course, once the conversion is made, if we know a case's value on all the levels of a
categorical variable except one, that last one is determined. We have to leave one of the levels out of the
regression model to avoid perfect multicollinearity (singularity; redundancy), which will prevent a solution (for
example, we may leave out "North" to avoid singularity). The omitted category is the reference category
because b coefficients must be interpreted with reference to it.
The interpretation of b coefficients is different when dummy variables are present. Normally, without dummy
variables, the b coefficient is the amount the dependent variable increases when the independent variable
associated with the b increases by one unit. When using a dummy variable such as "region" in the example above,
the b coefficient is how much more the dependent variable increases (or decreases if b is negative) when the
dummy variable increases one unit (thus shifting from 0=not present to 1=present, such as South=1=case is from
the South) compared to the reference category (North, in our example). Thus for the set of dummy variables
for "Region," assuming "North" is the reference category and education level is the dependent, a b of -1.5 for
the dummy "South" means that the expected education level for the South is 1.5 years less than the average of
"North" respondents.
Entry Terms – Forward/Backward/Stepwise/Blocking/Hierarchical

Forward selection starts with the constant-only model and adding variables one at a time in the order they are
best by some criterion (see below) until some cutoff level is reached (ex., until the step at which all variables
not in the model have a significance higher than .05).

Backward selection starts with all variables and deletes one at a time, in the order they are worst by some
criterion.
Stepwise multiple regression is a way of computing OLS regression in stages. In stage one, the independent
variable best correlated with the dependent is included in the equation. In the second stage, the remaining
independent with the highest partial correlation with the dependent, controlling for the first independent, is
entered. This process is repeated, at each stage partialing for previously-entered independents, until the
addition of a remaining independent does not increase R-squared by a significant amount (or until all variables
are entered, of course). Alternatively, the process can work backward, starting with all variables and eliminating
independents one at a time until the elimination of one makes a significant difference in R-squared.
Hierarchical multiple regression (not to be confused with hierarchical linear models) is similar to stepwise
regression, but the researcher, not the computer, determines the order of entry of the variables. F-tests are
used to compute the significance of each added variable (or set of variables) to the explanation reflected in R-
square. This hierarchical procedure is an alternative to comparing betas for purposes of assessing the
importance of the independents. In more complex forms of hierarchical regression, the model may involve a
series of intermediate variables which are dependents with respect to some other independents, but are
themselves independents with respect to the ultimate dependent. Hierarchical multiple regression may then
involve a series of regressions for each intermediate as well as for the ultimate dependent.

18
Exogenous and endogenous variables.
Exogenous variables in a path model are those with no explicit causes (no arrows going to them, other than the
measurement error term). If exogenous variables are correlated, this is indicated by a double headed arrow
connecting them. Endogenous variables, then, are those which do have incoming arrows. Endogenous variables
include intervening causal variables and dependents. Intervening endogenous variables have both incoming and
outgoing causal arrows in the path diagram. The dependent variable(s) have only incoming arrows.
Factor Analysis

Factor analysis is used to uncover the latent structure (dimensions) of a set of variables. It reduces attribute
space from a larger number of variables to a smaller number of factors and does not assume a dependent
variable is specified. Exploratory factor analysis (EFA) seeks to uncover the underlying structure of a relatively
large set of variables. The researcher's à priori assumption is that any indicator may be associated with any
factor. This is the most common form of factor analysis. There is no prior theory and one uses factor loadings
to intuit the factor structure of the data. Confirmatory factor analysis (CFA) seeks to determine if the number
of factors and the loadings of measured (indicator) variables on them conform to what is expected on the basis
of pre-established theory. Indicator variables are selected on the basis of prior theory and factor analysis is
used to see if they load as predicted on the expected number of factors. The researcher's à priori assumption
is that each factor (the number and labels of which may be specified à priori) is associated with a specified
subset of indicator variables. A minimum requirement of confirmatory factor analysis is that one hypothesize
beforehand the number of factors in the model, but usually also the researcher will posit expectations about
which variables will load on which factors (Kim and Mueller, 1978b: 55). The researcher seeks to determine, for
instance, if measures created to represent a latent variable really belong together.

There are several different types of factor analysis, with the most common being principal components analysis
(PCA), which is preferred for purposes of data reduction. However, common factor analysis is preferred for
purposes of causal analysis and for confirmatory factor analysis in structural equation modeling, among other
settings.

Generalized Least Squares

Generalized least squares (GLS) is an adaptation of OLS to minimize the sum of the differences between
observed and predicted covariances rather than between estimates and scores. GLS works well even for non-
normal data when samples are large (n>2500).

General Linear Model (multivariate)


Although regression models may be run easily in GLM, as a practical matter univariate GLM is used primarily to
run analysis of variance (ANOVA) and analysis of covariance (ANCOVA) models. Multivariate GLM is used
primarily to run multiple analysis of variance (MANOVA) and multiple analysis of covariance (MANCOVA) models.
Multiple regression with just covariates (and/or with dummy variables) yields the same inferences as multiple
analysis of variance (MANOVA), to which it is statistically equivalent. GLM can implement regression models
with multiple dependents.
Generalized Linear Models/Generalized Estimating Equations
GZLM/GEE are the generalization of linear modeling to a form covering almost any dependent distribution with
almost any link function, thus supporting linear regression, Poisson regression, gamma regression, and many
others.. GZLM is for variance and regression models which analyze normally distributed dependent variables
using an identity link function (that is, prediction is directly of the values of the dependent).

19
Linear Mixed Models
Linear mixed models (LMM) handle data where observations are not independent. That is, LMM correctly models
correlated errors, whereas procedures in the general linear model family (GLM) usually do not. (GLM includes
such procedures as t-tests, analysis of variance, correlation, regression, and factor analysis, to name a few.)
LMM is a further generalization of GLM to better support analysis of a continuous dependent for:

1. random effects: where the set of values of a categorical predictor variable are seen not as the complete
set but rather as a random sample of all values (ex., the variable "product" has values representing only 5 of
a possible 42 brands). Through random effects models, the researcher can make inferences over a wider
population in LMM than possible with GLM.
2. hierarchical effects: where predictor variables are measured at more than one level (ex., reading
achievement scores at the student level and teacher-student ratios at the school level).
3. repeated measures: where observations are correlated rather than independent (ex., before-after
studies, time series data, matched-pairs designs).
LMM uses maximum likelihood estimation to estimate these parameters and supports more variations and data
options. Hierarchical models in SPSS require LMM implementation. Linear mixed models include a variety of
multi-level modeling (MLM) approaches, including hierarchical linear models, random coefficients models (RC),
and covariance components models. Note that multi-level mixed models are based on a multi-level theory which
specifies expected direct effects of variables on each other within any one level, and which specifies cross-
level interaction effects between variables located at different levels. That is, the researcher must postulate
mediating mechanisms which cause variables at one level to influence variables at another level (ex., school-level
funding may positively affect individual-level student performance by way of recruiting superior teachers, made
possible by superior financial incentives). Multi-level modeling tests multi-level theories statistically,
simultaneously modeling variables at different levels without necessary recourse to aggregation or
disaggregation. It should be noted, though, that in practice some variables may represent aggregated scores.
Logistic regression/odds ratio

Logistic Regression. Logistic regression is a form of regression that is used with dichotomous dependent
variables (usually scored 0, 1) and continuous and/or categorical independent variables. It is usually used for
predicting if something will happen or not, for instance, pass/fail, heart disease, or anything that can be
expressed as an Event or Non-Event. Logistic regression transforms the data by taking their natural logarithms
to reduce nonlinearity. The technique estimates the odds of an event occurring by calculating changes in the log
odds of the dependent variable. Logistic regression techniques do not assume linear relationships between the
independent and dependent variables, does not require normally distributed variables, and does not assume
homoscedasticity. However, the observations must be independent and the independent variables must be
linearly related to the logit of the dependent variable.
Odds Ratios. An odds ratio is the ratio of two odds. An odds ratio of 1.0 indicates that the independent has no
effect on the dependent and that the variables are statistically independent. An odds ratio greater than 1
indicates that the independent variable increases the likelihood of the event. The "event" depends on the coding
of the dependent variable. Typically, the dependent variable is coded as 0 or 1, with the 1 representing the
event of interest. Therefore, a unit increase in the independent variable is associated with an increase in the
odds that the dependent equals 1 in binomial logistic regression. An odds ratio less than 1 indicates that the
independent variable decreases the likelihood of the event. That is, a unit increase in the independent variable
is associated with a decrease in the odds of the dependent being 1.

20
Logit Regression
Logit Regressionuses log-linear techniques to predict one or more categorical dependent variables. Logit models
discriminate better than probit models for high and low potencies and are therefore more appropriate when the
binary dependent is seen as representing an underlying equal distribution (large tails). The logit model is
equivalent to binary logistic regression for grouped data. The logit is the value of the left-hand side of the
equation and is the natural log of the odds ratio, p/(1-p), where p is the probability of response. Thus, if the
probability is .025, the logit = ln(.025/.975) = -.366; if the probability is .5, the logit=ln(.5/.5) = 0; etc.

Multicollinearity
Multicollinearity refers to excessive correlation of the predictor variables. When correlation is excessive (some
use the rule of thumb of r > .90), standard errors of the b and beta coefficients become large, making it
difficult or impossible to assess the relative importance of the predictor variables. Multicollinearity is less
important where the research purpose is sheer prediction since the predicted values of the dependent remain
stable, but multicollinearity is a severe problem when the research purpose includes causal modeling...
Multiple Linear Regression
Multiple Linear Regression is employed to account for (predict) the variance in an interval dependent, based on
linear combinations of interval, dichotomous, or dummy independent variables. Multiple regression can establish
that a set of independent variables explains a proportion of the variance in a dependent variable at a significant
level (through a significance test of R2), and can establish the relative predictive importance of the independent
variables. Power terms can be added as independent variables to explore curvilinear effects. Cross-product
terms can be added as independent variables to explore interaction effects. One can test the significance of
difference of two R2's to determine if adding an independent variable to the model helps significantly. Using
hierarchical regression, one can see how most variance in the dependent can be explained by one or a set of new
independent variables, over and above that explained by an earlier set. Of course, the estimates (b coefficients
and constant) can be used to construct a prediction equation and generate predicted scores on a variable for
further analysis.
The multiple regression equation takes the form y = b1x1 + b2x2 + ... + bnxn + c. The b's are the regression
coefficients, representing the amount the dependent variable y changes when the corresponding independent
changes 1 unit. The c is the constant, where the regression line intercepts the y axis, representing the amount
the dependent y will be when all the independent variables are 0. The standardized version of the b coefficients
are the beta weights, and the ratio of the beta coefficients is the ratio of the relative predictive power of the
independent variables. Associated with multiple regression is R2, multiple correlation, which is the percent of
variance in the dependent variable explained collectively by all of the independent variables. In addition, it is
important that the model being tested is correctly specified. The exclusion of important causal variables or the
inclusion of extraneous variables can change markedly the beta weights and hence the interpretation of the
importance of the independent variables.

Multinomial Logistic Regression


Logistic regression for a categorical dependent variable with more than two levels.

Negative Binomial Regression


This is similar to the Poisson distribution, also used for count data, but is used when the variance is larger than
the mean. Typically this is characterized by "there being too many 0's." It is not assumed all cases have an equal
probability of experiencing the rare event, but rather that events may cluster. The negative binomial model is
therefore sometimes called the "overdispersed Poisson model." Values must still be non-negative integers. The
negative binomial is specified by an ancillary (dispersion) parameter, k. When k=0, the negative binomial is
identical to the Poisson distribution. The researcher may specify k or allow it to be estimated by the program.

21
Nonlinear Regression
Nonlinear regression refers to algorithms for fitting complex and even arbitrary curves to one's data using
iterative estimation when the usual methods of dealing with nonlinearity fail. Simple curves can be implemented
in general linear models (GLM) and OLS regression and in models supported by the generalized linear modeling
because the dependent is transformed by some nonlinear link function). Nonlinear regression is used to fit
curves not amenable to transformation methods. That is, it is used when the nonlinear relationship is
intrinsically nonlinear because there is no possible transformation to linearize the relationship of the
independent(s) to the dependent. Common models for nonlinear regression include logistic population growth
models and asymptotic growth and decay models.
Ordinal Regression

Ordinal regression is a special case of generalized linear modeling (GZLM). Ordinal regression is used with
ordinal dependent (response) variables, where the independents may be categorical factors or continuous
covariates. Ordinal regression models are sometimes called cumulative logit models. Ordinal regression typically
uses the logit link function, though other link functions are available. Ordinal regression with a logit link is also
called a proportional odds model, since the parameters of the predictor variables may be converted to odds
ratios, as in logistic regression. Ordinal regression requires assuming that the effect of the independents is the
same for each level of the dependent. Thus if an independent is age, then the effect on the dependent for a 10
year increase in age should be the same whether the difference is between age 20 to age 30, or from age 50 to
age 60. The "test of parallel lines assumption" tests this critical assumption, which should not be taken for
granted.

Ordinary Least Squares Regression


This is the common form of multiple regression, used in early, stand-alone path analysis programs. It makes
estimates based on minimizing the sum of squared deviations of the linear estimates from the observed scores.
However, even for path modeling of one-indicator variables, MLE is still preferred in SEM because MLE
estimates are computed simultaneously for the model as a whole, whereas OLS estimates are computed
separately in relation to each endogenous variable. OLS assumes similar underlying distributions but not
multivariate normality, is even less restrictive and is a better choice when MLE's multivariate normality
assumption is severely violated.

Path Analysis

Path analysis is an extension of the regression model, used to test the fit of the correlation matrix against two
or more causal models which are being compared by the researcher. The model is usually depicted in a circle-
and-arrow figure in which single-headed arrows indicate causation. A regression is done for each variable in the
model as a dependent on others which the model indicates are causes. The regression weights predicted by the
model are compared with the observed correlation matrix for the variables, and a goodness-of-fit statistic is
calculated. The best-fitting of two or more models is selected by the researcher as the best model for
advancement of theory. Path analysis requires the usual assumptions of regression. It is particularly sensitive to
model specification because failure to include relevant causal variables or inclusion of extraneous variables
often substantially affects the path coefficients, which are used to assess the relative importance of various
direct and indirect causal paths to the dependent variable. Such interpretations should be undertaken in the
context of comparing alternative models, after assessing their goodness of fit discussed in the section on
structural equation modeling. When the variables in the model are latent variables measured by multiple
observed indicators, path analysis is termed structural equation modeling.

22
Partial Least Squares Regression
Sometimes called “soft modeling” because it makes relaxed assumptions about the data. PLS can support small
sample models, even where there are more variables than observations, but it is lower in power than SEM
approaches. The advantages of PLS include ability to model multiple dependents as well as multiple independents;
ability to handle multicollinearity among the independents; robustness in the face of data noise and missing
data; and creating independent latents directly on the basis of crossproducts involving the response variable(s),
making for stronger predictions. Disadvantages of PLS include greater difficulty of interpreting the loadings of
the independent latent variables (which are based on crossproduct relations with the response variables, not
based as in common factor analysis on covariances among the manifest independents) and because the
distributional properties of estimates are not known, the researcher cannot assess significance except through
bootstrap induction.

Poisson Regression

Poisson regression is a form of regression analysis used to model count variables and contingency tables count
data in event history analysis. It has a very strong assumption: the conditional variance equals the conditional
mean. A Poisson regression model is sometimes known as a log-linear model, especially when used to model
contingency tables. Data appropriate for Poisson regression do not happen very often. Nevertheless, Poisson
regression is often used as a starting point for modeling count data and Poisson regression has many extensions.
A rule of thumb is to use a Poisson rather than binomial distribution when n is 100 or more and the probability is
.05 or less. Where the binomial distribution is used where the variable of interest is count of successes per
given number of trials, the Poisson distribution is used for count of successes per given number of time units.
The Poisson distribution is also used when "event occurs" can be counted but non-occurrence cannot be counted.

Probit Algorithms

Probit models are similar to logistic models but use a log-normal transformation (the probit transformation) of
the dependent variable. Where logit and logistic regression are appropriate when the categories of the
dependent are equal or well dispersed, probit may be recommended when the middle categories have greater
frequencies than the high and low tail categories, or with binomial dependents when an underlying normal
distribution is assumed. As a practical matter, probit and logistic models yield the same substantive conclusions
for the same data the great majority of the time.

R2
Also called multiple correlation or coefficient of multiple determination, R2 is the percent of the variance in the
dependent explained uniquely or jointly by the independents. R-squared can also be interpreted as the
proportionate reduction in error in estimating the dependent when knowing the independents. That is, R2
reflects the number of errors made when using the regression model to guess the value of the dependent, in
ratio to the total errors made when using only the dependent's mean as the basis for estimating all cases.
Adjusted R-Square is an adjustment for the fact that when one has a large number of independents

Recursive Partitioning
Recursive partitioning creates a decision tree that attempts to correctly classify members of the population
based on several dichotomous dependent variables. It creates a formula that researchers can use to calculate
the probability that a participant belongs to a particular category. For example, a patient has a disease,
recursive partitioning creates a rule such as 'If a patient has finding x, y, or z they probably have disease q'
Advantages include generates clinically more intuitive models that do not require the user to perform
calculations, allows varying prioritizing of misclassifications in order to create a decision rule that has more
sensitivity or specificity, and may be more accurate.

Disadvantages include does not work well for continuous variables and may overfit data.

23
Spuriousness
A given bivariate correlation or beta weight may be inflated because one has not yet introduced control
variables into the model by way of partial correlation. For instance, regressing height on hair length will
generate a significant b coefficient, but only when gender is left out of the model specification (women are
shorter and tend to have longer hair).

Structural Equation Modeling


Structural equation modeling (SEM) grows out of and serves purposes similar to multiple regression, but in a
more powerful way which takes into account the modeling of interactions, nonlinearities, correlated
independents, measurement error, correlated error terms, multiple latent independents each measured by
multiple indicators, and one or more latent dependents also each with multiple indicators. SEM may be used as a
more powerful alternative to multiple regression, path analysis, factor analysis, time series analysis, and analysis
of covariance. That is, these procedures may be seen as special cases of SEM, or, to put it another way, SEM is
an extension of the general linear model (GLM) of which multiple regression is a part. Advantages of SEM
compared to multiple regression include more flexible assumptions (particularly allowing interpretation even in
the face of multicollinearity), use of confirmatory factor analysis to reduce measurement error by having
multiple indicators per latent variable, the attraction of SEM's graphical modeling interface, the desirability of
testing models overall rather than coefficients individually, the ability to test models with multiple dependents,
the ability to model mediating variables rather than be restricted to an additive model the ability to model
error terms, the ability to test coefficients across multiple between-subjects groups, and ability to handle
difficult data (time series with autocorrelated error, non-normal data, incomplete data). Moreover, where
regression is highly susceptible to error of interpretation by misspecification, the SEM strategy of comparing
alternative models to assess relative model fit makes it more robust. SEM is usually viewed as a confirmatory
rather than exploratory procedure, using one of three approaches:
1. Strictly confirmatory approach

2. Alternative models approach


3. Model development approach

Regardless of approach, SEM cannot itself draw causal arrows in models or resolve causal ambiguities.
Theoretical insight and judgment by the researcher is still of utmost importance. SEM is a family of statistical
techniques which incorporates and integrates path analysis and factor analysis. In fact, use of SEM software
for a model in which each variable has only one indicator is a type of path analysis. Use of SEM software for a
model in which each variable has multiple indicators but there are no direct effects (arrows) connecting the
variables is a type of factor analysis. Usually, however, SEM refers to a hybrid model with both multiple
indicators for each variable (called latent variables or factors), and paths specified connecting the latent
variables. Synonyms for SEM are covariance structure analysis, covariance structure modeling, and analysis of
covariance structures. Although these synonyms rightly indicate that analysis of covariance is the focus of
SEM, be aware that SEM can also analyze the mean structure of a model.
Suppression
Suppression occurs when the omitted variable has a positive causal influence on the included independent and a
negative influence on the included dependent (or vice versa), thereby masking the impact the independent would
have on the dependent if the third variable did not exist. Note that when the omitted variable has a suppressing
effect, coefficients in the model may underestimate rather than overestimate the effect of those variables on
the dependent.

24
Two-Stage Least-Squares Regression
Two-stage least squares regression (2SLS) is a method of extending regression to cover models which violate
ordinary least squares (OLS) regression's assumption of recursivity (all arrows flow one way, with no feedback
looping), specifically models where the researcher must assume that the disturbance term of the dependent
variable is correlated with the cause(s) of the independent variable(s). Second, 2SLS is used for the same
purpose to extend path analysis, except that in path models there may be multiple endogenous variables rather
than a single dependent variable. Third, 2SLS is an alternative to maximum likelihood estimation (MLE) in
estimating path parameters of non-recursive models with correlated error among the endogenous variables in
structural equation modeling (SEM). Fourth, 2SLS can be used to test for selection bias in quasi-experimental
studies involving a treatment group and a comparison group, in order to reject the hypothesis that self-
selection or other forms of selection into the two groups accounts for differences in the dependent variable. 
Weight Estimation

One of the critical assumptions of OLS regression is homoscedasticity: that the variance of residual error
should be constant for all values of the independent(s). Weighted least squares (WLS) regression compensates
for violation of the homoscedasticity assumption by weighting cases differentially: cases whose value on the
dependent variable corresponds to large variances on the independent variable(s) count less and those with
small variances count more in estimating the regression coefficients. That is, cases with greater weights
contribute more to the fit of the regression line. The result is that the estimated coefficients are usually very
close to what they would be in OLS regression, but under WLS regression their standard errors are smaller.
Apart from its main function in correcting for heteroscedasticity, WLS regression is sometimes also used to
adjust fit to give less weight to distant points and outliers, or to give less weight to observations thought to be
less reliable.
Zero-Inflated Regression
Zero-inflated models attempt to account for excess zeros. In other words, two kinds of zeros are thought to
exist in the data, "true zeros" and "excess zeros". Zero-inflated models estimate two equations simultaneously,
one for the count model and one for the excess zeros. One common cause of over-dispersion is excess zeros by
an additional data generating process. If the data generating process does not allow for any 0s (such as the
number of days spent in the hospital), then a zero-truncated model may be more appropriate.

25
26

You might also like