Professional Documents
Culture Documents
1
x
i
+ u
i
+ e
i
5a
Multiplying out gives:
y
i
=
0
+
1
x
i
+
2
1
x
i
+
2
u
i
+ e
i
|{z}
v
i
5b
Or, rearranging as a function of x gives
y
i
=
0
+
1
+
2
1
x
i
+
2
u
i
+ e
i
5c
Whichever way we look at it, whereas the slope of x was correctly estimated in Eq. (2), it cannot be correctly estimated in
Eq. (3) because as shown in Eq. (5c), the slope will include the correlation of x with z (i.e.,
1
). Thus, x correlates with the error
term (as per Eq. (5b)) and is inconsistent. In the presence of omitted variable bias, one does not estimate
1
as per Eq. (3), but
something else (
1
). Whether
1
would go up or down when including z will depend on the signs of
2
and
1
. It is also clear that if
2
=0 or if
1
=0 then v
i
reduces to e
i
and there is no omitted variable bias if z is excluded from the model.
Also, bear in mind that all other regressors that correlate with z and x will be inconsistent too when estimating the wrong
model. What effect the regression coefcients capture is thus not clear when there are omitted variables, and this bias can increase
or decrease remaining coefcients or change their signs!
Thus, it is important that all possible sources of variance in y that correlate with the regressor are included in the regression
model. For instance, the construct of emotional intelligence has not been adequately tested in leadership or general work
Table 1
The 14 threats to validity.
Validity threat Explanation
1. Omitted variables (a) Omitting a regressor, that is, failing to include important control variables when testing the
predictive validity of dispositional or behavioral variables (e.g., testing predictive validity of
emotional intelligence without including IQ or personality; not controlling for competing
leadership styles)
(b) Omitting xed effects
(c) Using random-effects without statistical justication (i.e., Hausman test)
(d) In all other cases, independent variables not exogenous (if it is not clear what the controls should be)
2. Omitted selection (a) Comparing a treatment group to other non-equivalent groups (i.e., where the treatment group is not
the same as the other groups)
(b) Comparing entities that are grouped nominally where selection to group is endogenous (e.g.,
comparing men and women leaders on leadership effectiveness where the selection process to
leadership is not equivalent)
(c) Sample (participants or survey responses) suffers from self-selection or is non-representative
3. Simultaneity (a) Reverse causality (i.e., an independent variable is potential caused by the dependent variable)
4. Measurement error (a) Including imperfectly measured variables as independent variables and not modelling measurement error
5. Common-method variance (a) Independent and dependent variables are gathered from the same rating source
6. Inconsistent inference (a) Using normal standard errors without examining for heteroscedasticity
(b) Not using cluster-robust standard errors in panel data (i.e., multilevel hierarchical or longitudinal)
7. Model misspecication (a) Not correlating disturbances of potentially endogenous regressors in mediation models (and not
testing for endogeneity using a Hausman test or augmented regression),
(b) Using a full information estimator (e.g., maximum likelihood, three-stage least squares) without
comparing estimates to a limited information estimator (e.g., two stage-least squares).
Note: The 14 threats to validity mentioned are the criteria we used for coding the studies we reviewed.
1091 J. Antonakis et al. / The Leadership Quarterly 21 (2010) 10861120
situations; one of the reasons is that researchers fail to include important control variables like IQ, personality, sex, age, and the
like (Antonakis, 2009; Antonakis, Ashkanasy, & Dasborough, 2009; Antonakis & Dietz, 2010, in press a,b).
What if irrelevant regressors are included? It is always safer to err on the side of caution by including more than fewer control
variables (Cameron & Trivedi, 2005). The regressors that should be included are ones that are theoretically important; the cost of
including them is reduced efciency (i.e., higher standard errors), but that is a cheap price to pay when consistency is at stake.
Note, there are tests akin to Ramsey's (1969) regression-error-specication (RESET) test, which can be useful for testing whether
there are unmodeled linearities present in the residuals by regressing y on the predicted value of polynomials of y (the 2nd, 3rd,
and 4th powers) and the independent variables. This test is often incorrectly used as a test of omitted variables or functional form
misspecication (Wooldridge, 2002); however, the test actually looks at whether the predicted value of y is linear given the
predictors.
3.1.2. Omitting xed effects
Oftentimes, researchers have panel data (repeated observations nested under an entity). Panel data can be hierarchical (e.g.,
leaders nested in rms; followers nested in leaders) or longitudinal (e.g., observations of leaders over time). Our discussion later is
relevant to both types of panels, though we will discuss the rst form, hierarchical panels (or otherwise known as pseudo-panels).
If there are xed-effects as in the case of having repeated observations of leaders (Level 1) nested in rms (Level 2), the
estimates of the other regressors included in the model would be inconsistent if these xed effects are not explicitly modeled
(Cameron & Trivedi, 2005; Wooldridge, 2002). By xed-effects, we mean the unobserved rm-invariant (Level 2) constant effects
(or in the case of a longitudinal panel, the time-invariant panel effects) common to those leaders nested under a rm (we refer to
these effects as u
j
later, see Eq. (7)).
We discuss an example regarding the modeling of rm xed-effects. By explicitly modeling the xed effects (i.e., intercepts)
using OLS, any possible unobserved heterogeneity in the level (intercept) of y common to leaders in a particular rm which
would otherwise have been pooled in e thus creating omitted variable bias is explicitly captured. As such, the estimator is
consistent if the regressors are exogenous. If the xed effects correlating with Level 1 variables are not modeled, Level 1 estimates
will be inconsistent to the extent that they correlate with the xed effects (which is likely). What is useful is to conceptualize the
error term e
ij
in a xed effects model as having two components: u
j
, the Level 2 invariant component (that is explicitly modeled
with xed-effects), and e
ij
, the idiosyncratic error component. To maintain a distinction between the xed-effects model and the
random-effects model, we will simply refer to the error term as e
ij
in the xed-effect model (given that the error u
j
is considered
xed and not random and is explicitly modeled using dummy variables, as we show later).
Obtaining consistent estimates by including xed-effects comes at the expense of not allowing any Level 2 (rm-level)
predictors because they will be perfectly collinear with the xed effects (Wooldridge, 2002). If one wants to add Level 2 variables
to the model (and remove the xed effects) then one must ensure that the estimator is consistent by comparing estimates fromthe
consistent estimator to with the more efcient one, as we discuss in the next section.
Assume we estimate a model where we have data from 50 rms and we have 10 leaders from each rm (thus we have 500
observations at the leader level). Assume that leaders completed an IQ test (x) and were rated on their objective performance
(adherence to budget), y. Thus, we estimate the following model for leader i in rm j:
y
ij
=
0
+
1
x
ij
+
50
k=2
k
D
kj
+ e
ij
6
The xed effects are captured by 49 (k1) dummy or indicator variables, D identifying the rms. Not including these dummy
variables would be a big risk to take because it is possible, indeed likely, that the xed effects are correlated with x (e.g., some rms
may select leaders on their IQ) and they will most certainly predict variance in y (e.g., xed effects would capture things like rm
size, which may predict y). Thus, even though x is exogenous with respect to e
ij
the coefcient of x will be consistent only if the
dummies are included; if the dummies are not included then the e
ij
term will include u
j
and thus biases estimates of x. If the
dummies are not included then the modeler faces the same problem as in the previous example: omitted-variable bias. Note, if x
correlates with e
ij
, the remedy comes using another procedure, which we discuss later when introducing two-stage least squares
estimation.
Fixed-effects could be present for a number of reasons including group, microeconomic, macroeconomic, country-level, or time
effects and researchers should pay more attention to these contextual effects because they can affect estimate consistency (Liden &
Antonakis, 2009). Finally, when observations are nested (clustered), standard errors should not be estimated the conventional
way (refer to the section later regarding consistency of inference).
3.1.3. Using random effects without meeting assumptions of the estimator
If the modeler wants to determine whether Level 2 variables predict y, the model could be estimated using the random-effects
estimator. The random effects estimator allows for a randomly varying intercept between rms this model is referred to as the
intercepts as outcomes in multilevel modeling vernacular (Hofmann, 1997). Instead of explicitly estimating this heterogeneity
via xed effects, this estimator treats the leader level differences in y (i.e., the intercepts) as randomeffects between rms that are
drawn from a population of rms and assumed to be uncorrelated with the regressors and the disturbances; the randomeffects are
also assumed to be constant over rms and independently distributed. Failure to meet these assumptions will lead to inconsistent
estimates and is tantamount to having omitted variable bias.
1092 J. Antonakis et al. / The Leadership Quarterly 21 (2010) 10861120
Also, prior to using this estimator, the modeler should test for the presence of random effects using a Breusch and Pagan
Lagrangian multiplier test for randomeffects if the model has been estimated by GLS (Breusch & Pagan, 1980), or a likelihood-ratio
test for random effects if the model has been estimated with maximum likelihood estimation (see Rabe-Hesketh & Skrondal,
2008); this is a chi-square with 1 degree of freedom and if signicant, rules in favor of the random-effects model. We do not
discuss the random-coefcients model, which is direct extension of the of the random-effects model and allows varying slopes
across groups. Important to note is that before one uses such a model, one must test whether it is justied by testing the random-
coefcients models versus the random-effects model (using a likelihood-ratio test); only if the test is signicant (i.e., the
assumption that the slopes are xed is rejected) should the random-coefcients estimator by used (Rabe-Hesketh & Skrondal,
2008).
Now, the advantage of the random-effects estimator (which could simultaneously be its Achilles heel) is that then Level 2
variables can be included as predictors (e.g., rmsize, public vs. private organization, etc.), in the following specication for leader
i in rm j:
y
ij
=
0
+
1
x
ij
+
q
k=1
k
z
kj
+ e
ij
+ u
j
7
In Eq. 7, we include regressors 1 to q (e.g., rm size, type, etc.) and omit the xed-effects, but include a rm-specic error
component, u
j.
The random effects estimator is more efcient than the xed-effects estimator because it is designed to minimize
the variance of the estimated parameters (loosely speaking it has fewer independent variables because it does not include all the
dummies). But you guessed it; it comes with a hefty price in that it may not be consistent vis--vis the xed-effects estimator
(Wooldridge, 2002). That is, u
j
might correlate with the Level 1 regressors. To test whether the estimator is consistent, one can use
what is commonly called a Hausman Test (see Hausman, 1978) this test, which is crucial to ensuring that the random-effects
model is tenable does not seemto be routinely used by researchers outside of econometrics, and not even in sociology, a domain
that is close to economics (Halaby, 2004).
Basically, what the Hausman test does is to compare the Level 1 estimates fromthe consistent (xed-effects) estimator to those
of the efcient (random-effects estimator). If the estimates differ signicantly, then the efcient estimator is inconsistent and the
xed-effects estimator must be retained; the inconsistency must have come from u
j
correlating with the regressor. In this case
estimates from the random-effects estimator cannot be trusted; our leitmotif in this case is consistency always trumps efciency.
The most basic Hausman test is that for one parameter, where is the element of being tested (Wooldrige, 2002). Thus, the test
examines whether the estimate of of the efcient (RE) estimator differ signicantly from that of the consistent (FE) estimator,
using the following t test (which has an asymptotic standard normal distribution):
z =
FE
RE
SE
FE
2
SE
RE
2
q
This test can be extended for an array of parameters. In comparing the xed-effects to the random-effects estimator, an
alternative to the Hausman test is the SarganHansen test (Schaffer & Stillman, 2006), which can be used with robust or cluster-
robust standard errors. Both these tests are easily implemented in Stata (see StataCorp, 2009), our software of choice. Again,
because observations are nested (clustered), standard errors should not be estimated the conventional way (refer to the section
later regarding consistency of inference).
One way to get around the problem of omitted xed effects and to still include Level 2 variables is to include the cluster means
of all Level 1 covariates in the estimated model (Mundlak, 1978). The cluster means can be included as regressors or subtracted
(i.e., cluster-mean centering) from the Level 1 covariate. The cluster means are invariant within cluster (and vary between
clusters) and allow for consistent estimation of Level 1 parameters just as if xed-effects had been included (see Rabe-Hesketh &
Skrondal, 2008). Thus, if the Hausman test is signicant, we could still obtain consistent estimates of the Level 1 parameters with
either one of the following specications (given that the cluster mean will be correlated with the covariate but not with u
j
):
y
ij
=
0
+
1
x
ij
+
2
x
j
+
q
k=1
k
z
kj
+ e
ij
+ u
j
8
y
ij
=
0
+
1
x
ij
x
j
+
q
k=1
k
z
kj
+ w
ij
+ g
j
9
In the previous two equations, the interpretation of the coefcient of the cluster mean differs; that is, in Eq. (8) it refers to the
difference in the between and within effects whereas in Eq. (9) it refers to the between effect (Rabe-Hesketh & Skrondal, 2008). In
both cases, however, the estimate of
1
or
1
is consistent (and equals that of the xed-effects estimator, though the intercept will
be different in the case of Eq. (9)). Note that if Level 2 variables are endogenous, the cluster-mean trick cannot help; however,
there are ways to obtain consistent estimates by exploiting the exogenous variation in Level 2 covariates (see Hausman & Taylor,
1981).
1093 J. Antonakis et al. / The Leadership Quarterly 21 (2010) 10861120
3.1.4. Omitting selection
Selection refers to the general problem of treatment not being assigned randomly to individuals. That is, the treatment is
endogenous. Assume:
y
i
=
0
+
1
x
i
+
2
z
i
+ e
i
10
Here, x takes the value of 1 if the individual receives a treatment (e.g., attends a leadership-training program), else x is 0 (the
individual has not received the treatment). Assume that y is how charismatic the individual is rated. However, assume that
individuals have been selected (either self-selected or otherwise) to receive the training. That is, x, the binary variable has not been
randomly assigned, which means that the groups might not be the same on the outset on observed or unobserved factors and
these factors could be correlated with y and of course x. Thus, the problem arises because x is explained by other factors (i.e., the
selection can be predicted) that are not observed in Eq. (10), which we refer to as x* (subsumed) in e. That is, assume x* is modeled
in the following probit (or logistic) equation (Cong & Drukker, 2001)
x
i
=
0
+
q
k=1
k
z
kj
+ u
i
11
Where k refers to regressors 1 to q and u to a disturbance term. We observe x=1 when x*N0 (i.e., treatment has been received),
else x=0. The problem arises because u will be correlated with e (this correlation is called
e,u
) and thus x will be correlated with e.
As an example, suppose that individuals who have a higher IQ (as well as some other individual differences that correlate with
leadership) are more likely to attend the training; it is also likely, however, that these individuals are more charismatic. Thus, there are
unmodeled sources of variance (omitted variables) subsumed in e that correlate with x. Suppose that one is highly motivated, so u
i
is
high, andattends training. If motivationis alsocorrelatedwithcharisma, thene
i
will be higher too; hencethe twodisturbances correlate.
As it is evident, this problemis similar to omitted variable bias inthe sense that there are excluded variables, pooled into the error term,
that correlate with endogenous choice variable and the outcome (Kennedy, 2003). If the selection is not explicitly (and correctly
modeled), then using untreated individuals to estimate the counterfactual is misleading: they differ from treated individuals with
respect to things we do not know; that is, the counterfactuals are missing (andthe effect of the treatment will be incorrectly estimated).
Although this problemmight seemunsolvable, it is not; this model can be estimated correctly if this selection process is explicitly
modeled (Cong & Drukker, 2001; Maddala, 1983). In fact, for a related type of model where y is only observed for those who received
treatment (see Heckman, 1979), James Heckman won the Nobel Prize in Economics! We discuss how this model is estimated later.
Another problem that is somewhat related to selection (but has nothing to do with selection to treatment) is having non-
representative selection to participation in a study or censored samples (a kind of missing-data problem). We briey discuss the
problem here and suggested remedies, given that the focus of our paper is geared more towards selection problems. The problem of
nonrepresentativeness has to do with affecting the observed variability, which thus attenuates estimates. Range restriction would be
anexample of this problem; for example, estimating the effect of IQonleadershipina sample that is highonIQwill bias the estimate of
IQ downwards (thus, the researcher must either obtain a representative sample or correct for range restriction). Another example
wouldbe usingself-selectedparticipants for leadershiptraining(where participants are thenrandomly assignedto treatment); inthis
case, it is possible that the participants are not representative of the population (and only those that are interested in leadership, for
example, volunteered to participate). Thus, the researcher should check whether the sample is representative of the population. Also,
consider the case where managers participate in a survey and they select the subordinates that will rate them (the managers will
probably select subordinates that like them). Thus, ideally, samples must be representative and random (and for all types of studies,
whether correlational or testing for group differences); if they are not, the selection process must be modeled. Other examples of this
probleminclude censored observations above or belowa certain threshold (which creates a missing-data problemon the dependent
variable). Various remedies are available in such cases, for example, censored regression models (Tobin, 1958) or other kinds of
truncated regression models (Long & Freese, 2006) depending on the nature of the problem at hand.
3.2. Simultaneity
This problemis one that is tricky and which has given many economists and other social scientists a headache. Suppose that x
causes y and this relation should be negative; you regress y on x but to your surprise, you nd a non-signicant relation (or even a
positive effect). How can this be? If y also causes x it is quite possible that their covariation is not negative. Simultaneity has to do
with a two variables simultaneously causing each other. Note, this problem is not necessarily the supposed simplistic backward
causality problemoften evoked by researchers (i.e., that the positive regression coefcient of x on y could be due to y causing x); it
has to do with simultaneous causation, which is a different sort of problem.
Here is a simple example to demonstrate the simultaneity problem: Hiring more police-ofcers (x) should reduce crime (y), right?
However, it is also possible too that when crime goes up, cities hire more police ofcers. Thus, x is not exogenous and will necessarily
correlate with e in the y equation (see Levitt, 1997; Levitt, 2002). To make this problem more explicit, assume that x is a particular
leadershipstyle(useof sanctions) andy is follower performance (andweexpect therelation, as estimatedin
1
, tobenegative):
y
i
=
0
+
1
x
i
+ e
i
12
1094 J. Antonakis et al. / The Leadership Quarterly 21 (2010) 10861120
Because leader style is not randomly assigned it will correlate with e
i
making
1
inconsistent. Why? For one, leaders could also
change their style as a function of followers' performance, leading to Eq. (13).
x
i
=
1
y
i
+ u
j
13
We expect
1
, to be positive. Now, because we do not explain y perfectly, y varies as a function of e too; y could randomly
increase (e.g., higher satisfaction of followers because of a company pay raise) or decrease (e.g., an unusually hot summer).
Suppose y increases due to e; as a consequence x will vary; thus, e affects x via y in Eq. (13). In simple terms e correlates with x,
rendering
1
inconsistent. Instrumental-variable estimation can solve this problem, as we discuss later.
3.3. Measurement error (errors-in-variables)
Suppose we intend to estimate our basic specication, however, this time what we intent to observe is a latent variable,
x*:
y
i
=
0
+
1
x
i
+ e
i
14
However, instead of observing x*, which is exogenous and a theoretically pure or latent construct, we observe instead a not-
so-perfect indicator or proxy of x*, which we call x (assume that x* is the IQ of leader i). This indicator consists of the true
component (x*) in addition to an error term (u) as follows (see Cameron & Trivedi, 2005; Maddala, 1977):
x
i
= x
i
+ u
i
15a
x
i
= x
i
u
i
15b
Now substituting Eq. (15b) into Eq. (14) gives:
y
i
=
0
+
1
x
i
u
i
+ e
i
16
Expanding and rearranging the terms gives:
y
i
=
0
+
1
x
i
+ e
i
1
u
i
17
As is evident, the coefcient of x will be inconsistent given that the full error term, which nowincludes measurement error too,
is correlated with x. Note that measurement error in the y variable does not bias coefcients and is not an issue because it is
absorbed in the error term of the regression model.
The above discussion concerns a special kind of omitted variable bias because by estimating the model only with x, we omit u
from the model; given that u is a cause of x creates endogeneity of the sort that x correlates with the combined error term in Eq.
(17). This bias attenuates the coefcient of x, particularly in the presence of further covariates (Angrist & Krueger, 1999); the bias
will also taint the coefcients of other independent variables that are correlated with x (Bollen, 1989; Kennedy, 2003) refer to
Antonakis (2009) for a example in leadership research, where he showed that emotional intelligence was more strongly related
to IQ and the big ve than some have suggested (which means that failure to include these controls and failure to model
measurement error will severely bias model estimates, e.g., see Antonakis & Dietz, in press-b, Fiori & Antonakis, in press). In fact,
using error-in-variables (with maximum likelihood estimation, given that he reanalyzed summary data), Antonakis (2009)
showed that emotional intelligence measures were linearly dependent on the big ve and intelligence, with multiple rs ranging
between .48 and .76 depending on the measures used. However, this relation was vastly underestimated when ignoring
multivariate effects and measurement error, leading to incorrect inference.
The effect of measurement error can be eliminated with a very simple x: by constraining the residual variance of x to (1
reliability
x
)Variance
x
(Bollen, 1989); if reliability is unknown, the degree of validity of the indicator can be assumed from
theory and hence the residual is constrained accordingly (Hayduk, 1996; Onyskiw & Hayduk, 2001). What the modeler needs is a
reasonably good estimate for the reliability (or validity) of the measure. If x were a test of IQ, for example, and we have good reason
to think that IQ is exogenous as we discuss later (see Antonakis, in press), a reasonable estimate could be the testretest reliability
or the Cronbach alpha internal consistency of estimate of the scale. Otherwise, theory is the best guide to howreliable the measure
is. Using this technique is very simple in the context of a regression model with a program like Stata and its eivreg (errors-in-
variables) routine or most structural equation modeling programs using maximum likelihood estimation (e.g., Mplus). The
advantage of using a program like Stata is that the eivreg least-squares estimator does not have the computational difculties,
restrictive assumptions, and sample size requirements inherent to maximum likelihood estimation so it is useful with single
indicator (or index) measures (e.g., see Bollen, 1996; Draper & Smith, 1998; Kmenta, 1986); for multi-item measures of a latent
variable a structural equation modeling program must be used.
Later, we also discuss a second way to x the problem of measurement error particularly if the independent variable
correlates with e for other reasons beyond measurement error using two-stage least squares regression.
1095 J. Antonakis et al. / The Leadership Quarterly 21 (2010) 10861120
3.4. Common source, common-method variance
Related to the previous problem of measurement error is what has been termed common-method variance. That x causes
y could be because they both depend on q. For example, suppose raters rate their leaders on their leader style (x) and raters
are simultaneously asked to provide ratings on the leaders' effectiveness (y); given that a common source is being used it is
quite likely that the source (i.e., rater) will strive to maintain consistency between the two types of ratings (Podsakoff,
MacKenzie, Lee, & Podsakoff, 2003; Podsakoff & Organ, 1986) suppose due to q, which could reect causes including halo
effects from the common source (note, a source could also be a method of data gathering). Important to note is that the
common source/method problem does not only inate estimates as most researchers believe; it could bias them upwards as
well as downwards as we will show later. As will be evident from our demonstrations, common-method variance is a very
serious problem and we disagree in the strongest possible terms with Spector (2006) that effects associated with common-
method variance is simply an urban legend.
Although Podsakoff et al. (2003) suggested that the common-method variance problem biases coefcients, they did not
specically explain why the coefcient of x predicting y can be biased upwards or downwards. To our knowledge, we make
this demonstration explicit for the rst time (at least as far as the management and applied psychology literature is
concerned). We also provide an alternative solution to deal with common-method variance (i.e., two-stage least squares, as
discussed later), particularly in situations where the common cause cannot be identied. An often-used remedy for common-
methods variance problem is to obtain independent and dependent variables from different sources or different times, a
remedial action which we nd satisfactory as long as the independent variables are exogenous. In the case of split-sample
designs where half the raters rate the leader's style and the other half the leader's effectiveness (e.g., Koh, Steers, & Terborg,
1995) precision of inference (i.e., standard errors) will be reduced particularly if the full sample is not large. Also, splitting
measurement occasions across different time periods still does not fully address the problem because the common-method
variance problem could still affect the independent variables that have been measured from the common source (refer to the
end of this section).
One proposed way to deal with this problem is to include a latent common factor in the model to account for the common
(omitted) variance (Loehlin, 1992; Podsakoff et al., 2003, see Figure 3A in Table 5; Widaman, 1985). Although Podsakoff, et al.
suggested this method as a possible remedy and cited research that has used it as evidence of its utility, they noted that this
method is limited in its applicability. We will go a step further and suggest that this procedure should never be used. As we will
show, one cannot remove the common bias with a latent method factor because the modeler does not knowhowthe unmeasured
cause affects the variables (Richardson, Simmering, & Sturman, 2009). It is impossible to estimate the exact effect of the common
source/method variance without directly measuring the common source variable and including it in the model in the correct
causal specication.
Suppose that an unmeasured common cause, degree of organizational safety and risk, affects two latent variables, as depicted
in Fig. 2; this context of leadership is one where teammembers are exposed to danger (e.g., oil rig). We generated data for a model
where 1 and 2 measure subordinate ratings of a leader's style (task and person-oriented leadership respectively). The effect of
the cause on 1 is positive (.57), that is, in a high-risk situation the leader is very task-oriented because in these situations,
violation of standards could cost lives; however, for 2 the effect of the common cause is negative (.57), that is, in high-risk
situations, leaders pay less attention to being nice to subordinates. Thus, leadership style is endogenous; this explanation should
make it clear why leader style can never be modeled as an independent variable. When controlling for the common cause the
residual correlation between 1 and 2 is zero. The data are such that the indicators of each respective factor are tau equivalent
(i.e., they have the same loadings on their respective factors) and with strong loadings (i.e., all s are .96 and are equal on their
respective factors). We made the models tau equivalent to increase the likelihood that the model is identied when introducing a
latent common method/source factor. The sample size is 10,000, and the model ts the data perfectly, according to the
overidentication test:
2
(31)=32.51, pN.05 (as well as to adjunctive measures of t CFI =1.00, RMSEA=.00 which we do
not care much for as we discuss later). Estimating the model without the common cause gives a biased structural estimate (a
correlation of .32 between the two latent variables), although the model ts perfectly:
2
(25)=28.06, pN.05 (CFI =1.00,
RMSEA=.00); hence, it is important of theoretically establishing if modeled variables are exogenous or not because a misspecied
model (with endogeneity) could still pass a test of t. Finally, when including a latent common factor to account for the supposed
common-cause effects, the model still ts well:
2
(17)=20.32, pN.05 (CFI =1.00, RMSEA=.00). However, the loadings and the
structural parameter are severely biased. This method, which is very popular with modelers, is obviously not useful; also, as is
evident, this misspecication is not picked up with the test of model t. The correct model estimates could have been recovered
when using instrumental variables (we present this solution later for the simple case of a path model and then extend this
procedure to a full structural-equation model).
We rst broaden Podsakoff et al.'s (2003) work to show the exact workings of common-method bias, and then present a
solution to the common-method problem. We start with our basic specication, where a rater
i
has rated leader
j
(n=50 leaders)
on leader style x and leader effectiveness y, where we control for the xed effects of rm (note, the estimator should be a robust
one for clustering, as discussed later; also, assume in the following that we do not have random effects):
y
ij
=
0
+
1
x
ij
+
50
k=2
k
D
jk
+ e
ij
18
1096 J. Antonakis et al. / The Leadership Quarterly 21 (2010) 10861120
Similar to the case of measurement error we cannot directly observe y* or x*; however what we do observe is y and x in the
following respective equations (where q
i
is the common bias):
y
ij
= y
ij
+
y
q
ij
19
Fig. 2. Correcting for common-source variance: the common method factor fallacy (estimates are standardized). A: this model is correctly specied. B: failing to
include the common cause estimates the correlation between 1 and 2 incorrectly (0.32). C: including an unmeasured common factor estimates the loadings
(which are also not signicant for 1) and the correlation between 1 and 2 (0.19, not signicant) incorrectly.
1097 J. Antonakis et al. / The Leadership Quarterly 21 (2010) 10861120
x
ij
= x
ij
+
x
q
ij
20
Rearranging the equations gives:
y
ij
= y
ij
y
q
ij
21
x
ij
= x
ij
x
q
ij
22
Substituting Eqs. (21) and (22) into Eq. (18) shows the following:
y
ij
y
q
ij
=
0
+
1
x
ij
x
q
ij
+
50
k=2
k
D
jk
+ e
ij
23
Rearranging the equation gives:
y
ij
=
0
+
1
x
ij
+
50
k=2
k
D
jk
+ e
ij
x
q
ij
+
y
q
ij
24
As with measurement error, common-method variance introduces a correlation between x and the error term, which now
consists of three components (and cannot be eliminated by estimating the xed effects). Unlike before with measurement error,
however, the bias can attenuate or accentuate the coefcient of x. Furthermore, it is now clear that this bias cannot be eliminated
unless q is directly measured (or instruments are used to purge the bias using two-stage least-squares estimation).
Thus, as we alluded to previously, the problem is not one of ination of variance of coefcients; it is one of consistency of
coefcients. The coefcient
1
is uninterpretable because it includes the effect of q on x and y. Assuming that the researcher has no
other option but to gather data in a common-source way, and apart from measuring and including q directly in the model, which
may be difcult to do because q could reect a number of causes, there is actually a rather straightforward solution to this problem
and one that, to our knowledge, will be presented for the rst time to leadership, management, and applied-psychology
researchers. This solution has been available to econometricians for quite some time, and we will discuss this solution in the
section on two-stage least squares estimation.
Note, assume the case where only the independent variables (e.g., assume x
1
and x
2
) suffer from common-method variance; in
this case, the estimates of the two independent variables will be biased to zero and be inconsistent (though their relative
contribution,
x1
x2
is consistent), which can be shown as follows. Suppose:
y =
0
+
1
x
1
+
2
x
2
+ e 25
Instead of observing the latent variables x
1
* and x
2
*, we observe x
1
and x
2
, which are assumed to have approximately the same
variance and are both equally dependent on a common variable q. Thus, by substitution it can be shown that both estimates will be
biased downwards but equally so, suggesting their relative contribution will remain consistent:
y =
0
+
1
x
1
+
2
x
2
+ e
1
q
26
3.5. Consistency of inference
We nish this section by bringing up another threat to validity, which has to do with inference. Froma statistical point-of-view
we mean whether the standard errors are consistent. There has been quite a bit of research on this area particularly after the
papers by Huber (1967) and White (1980); this work is extremely technical so we will just provide a short overview of its
importance and remedial action that can be taken to ensure correct standard errors.
In a simple experimental setting, regression residuals will usually be i.i.d. (identically and independently distributed). By
identically distributed we mean that residuals are homoscedastic, that is, they have been drawn from the same population and
have a uniform variance. By independently distributed we mean that they are not clustered or serially correlated (as when
observations are nested under a Level 2 entity). It is always a good idea, however, to check whether residuals are homoscedastic.
Whether they are clustered is certainly evident fromthe data-gathering design. Programs like Stata have nice routines for checking
for heteroscedasticity, including White's test, and for the presence of clustering.
If residuals are heteroscedastic, coefcient estimates will be consistent; however, standard errors will not be. In this context, the
variance of the parameters has to be estimateddifferently as the usual assumptions donot hold. The variance estimator is basedonthe
work of HuberWhite, andthe standarderrors are usually referredto as HuberWhite standarderrors, sandwichedstandarderrors, or
just robust standard errors. We cannot stress the importance of having the standard errors correctly estimated (either with a robust
variance estimator or using bootstrapping) and this concern is really not on the radar screen of researchers in our eld. Consistent
standard errors are just as important as consistent estimates. If standard errors are not correctly estimated, p-values will be over or
1098 J. Antonakis et al. / The Leadership Quarterly 21 (2010) 10861120
understated, which means that results could change fromsignicant to non-signicant (and vice-versa); refer to Antonakis and Dietz
(in press-a) for an example.
Similar to the previous problem of heteroscedasticity, is the problem of standard errors from clustered observations. A recent
paper published in a top economics journal blasted economists for failing to correctly estimate the variance of the parameters and
suggested that many of the results published with clustered data that had not corrected for the clustering were dubious (see
Bertrand, Duo, & Mullainathan, 2004), and this in a domain that is known for its methodological rigor! The variance estimator for
clustered data is similar in form to the robust one but relaxes the assumptions about the independence of residuals. Note that at
times, researchers have to correct standard errors for multiple dimensions of clustering; that is, we are not discussing the case of
hierarchically clustered but truly independently clustered dimensions (see Cameron, Gelbach, & Miller, in press). Again, these
corrections are easily achieved with Stata or equivalent programs.
4. Methods for inferring causality
To extend our discussion regarding how estimates can become inconsistent, we now review methods that are useful for
recovering causal parameters in eld settings where randomization is not possible. We introduce two broad methods of ensuring
consistent estimates. The rst is what we refer to as statistical adjustment, which is only possible when all sources of variation in y
are known and are observable. The second way we refer to as quasi-experimentation: Here, we include simultaneous equation
models (with extensive discussion on two-stage least squares), regression discontinuity models, difference-in-differences models,
selection models (with unobserved endogeneity), and single-group designs. These methods have many interesting and broad
applications in real-world situations, where external validity (i.e., generalizability) is assured, but where internal validity (i.e.,
experimental control) is not easily assured. Given space constraints, our presentation of these methods is cursory; our goal is to
introduce readers to the intuition and the assumptions behind the methods and to explain how they can recover the causal
parameter of interest. We include a summary of these methods in Table 2.
4.1. Statistical adjustment
The most simple way to ensure that estimates are consistent is to measure and include all possible sources of variance of y in
the regression model (cf. Angrist & Krueger, 1999); of course, we must control for measurement error and selection effects if
relevant. Controlling for all sources of variance in the context of social science, though, is not feasible because the researcher has to
identify everything that causes variation in y (so as to remove this variance frome). At times, there is unobserved selection at hand
or other causes unbeknown to the researcher; from a practical point-of-view, this method is not very useful per se. We are not
suggesting that researchers must not use controls; on the contrary, all known theoretical controls must be included. However, it is
likely that researchers might unknowingly (or even knowingly) omit important causes, so they must also use other methods to
ensure consistency because of possible endogeneity.
4.1.1. Propensity score analysis (PSA)
Readers should refer back to Eqs. (10) and (11) so as to understand why PSA could recover the causal parameter of interest and
thus approximate a randomized eld experiment (Rubin, 2008; Rubin & Thomas, 1996). PSA can only provide consistent estimates
to the extent that (a) the researcher has knowledge of variables that predict whether an individual would have received treatment
or not, and (b) e and u in Eqs. (10) and (11) do not correlate. If e and u correlate, which may often be the case, a Heckman
treatment effects model must be used to derive consistent estimates (discussed later).
The idea behind PSA is quite simple and has to do with comparing treated individuals to similar control individuals (i.e., to
recreate the counterfactual). Going back to the randomized experiment: What is the probability, or propensity to provide an
introduction to the term, that a particular individual is in the treatment versus the control group? If the treatment is assigned
randomly, it is .50 (i.e., 1 out of 2). However, this probability is not .50 is the treatment was not assigned randomly. Thus, the
essence of PSA is to determine the probability (propensity) that an individual would have received treatment (as a function of
measured covariates). Then, the researcher attempts to compare (match) individuals from the treatment and control groups who
Table 2
Six methods for inferring causality in non-experimental settings.
Method Brief description
1. Statistical adjustment Measure and control for all causes of y (impractical and not recommended)
2. Propensity score analysis Compare individuals who were selected to treatment to statistically similar
controls using a matching algorithm
3. Simultaneous-equation models Using instruments (exogenous sources of variance that do not correlate with the
error term) to purge the endogenous x variable from bias.
4. Regression discontinuity Select individuals to treatment using a modelled cut-off.
5. Difference-in-differences models Compare a group who received an exogenous treatment to a similar control group
over time
6. Heckman selection models Predict selection to treatment (where treatment is endogenous) and then control
for unmodeled selection to treatment in predicting y.
1099 J. Antonakis et al. / The Leadership Quarterly 21 (2010) 10861120
have the same probability of receiving treatment. In this way, the design mimics the true experiment (and the counterfactual),
given that the researcher attempts to determine the treatment effect on y by comparing individuals who received the treatment to
similar individuals who did not.
Suppose in our example that we want to compare individuals who undertook leadership training (and were self-selected)
versus a control group. In the rst instance, we estimate a probit (or logistic) model to predict the probability that an individual
receives the treatment:
x
i
=
0
+
q
k=1
k
z
ki
+ u
i
27
And x=1 when x* N0 (i.e., treatment has been received), else x=0. The predicted probability of receiving the treatment (x) as a
function of q covariates (e.g., IQ, demographics, and so forth) for each individual (i) is saved. This score, which ranges from0 to 1, is
the propensity score. The point is to match individuals in the treatment and control with the same propensity scores. That is,
suppose two individual having the same (or almost the same) propensity score but one is in the treatment group and the other in
the control group. What they differ on is what is captured by the error term in the propensity equation: u
i
. That is, what explains,
beyond the covariates, whether a particular individual should have received the treatment but did not is the error term. In other
words, if two subjects have the same propensity score and they are in different groups (assuming that u
i
is just noise), then it is
almost like these two subject were randomly assigned to the treatment (D'Agostino, 1998). As mentioned, the assumption of this
method is that u
i
is unrelated to the residual term e
i
(in Eq. (10)); the unobserved factors which explain whether someone
received the treatment must not correlate with unobserved factors in the y equation (Cameron & Trivedi, 2005). If this assumption
is tenable, then in the simplest case we can match individuals to obtain the counterfactual. That is, a simple t-test for the
individuals matched across the two groups using some matching algorithm (rule) will indicate the average treatment effect. For
more information on this method, readers should consult more detailed exposs and examples (Cameron & Trivedi, 2005; Cook et
al., 2008; D'Agostino, 1998; Rosenbaum & Rubin, 1983, 1984, 1985).
4.2. Quasi-experimentation
Next, we introduce quasi-experimental methods, focusing extensively on two methods that are able to recover causal
parameters in rather straightforward ways: Simultaneous equation models, and regression discontinuity models. We discuss two
other methods, which have more restrictive assumptions (e.g., difference-in-differences models, selection models) but which are
also able to establish causality if these assumptions are met. We complete our methodological journey with a brief discussion on
single group, quasi-experimental designs.
4.2.1. Simultaneous-equation models
We begin the explanations in this section with two-stage least squares (2SLS) regression. This method, also referred to also as
instrumental-variable estimation, is used to estimate simultaneous equations where one or more predictors are endogenous. 2SLS
is standard practice in economics a workhorse and probably the most useful and most-used method to ensure consistency of
estimates threatened by endogeneity. Unfortunately, beyond economics, this method has not had a big impact on other social
science disciplines including psychology and management research (see Cameron & Trivedi, 2005; Foster & McLanahan, 1996;
Gennetian et al., 2008). We hope that our review will help correct this state of affairs particularly because this approach can be
useful for solving the common-method variance problem.
The 2SLS estimator (or its cousin, the Limited-Information Maximum Likelihood estimator, LIML) is handy for a variety of
problems where there is endogeneity because of simultaneity, omitted variables, common-method variance, or measurement
error (Cameron & Trivedi, 2005; Greene, 2008; Kennedy, 2003). This estimator has some commonalities with the selection models
discussed later because it relies on simultaneous equations and instrumental variables. Instrumental variables, or simply
instruments, are exogenous variables and do not depend on other variables or disturbances in the system of equations. Recall, the
problem of endogeneity makes estimates inconsistent because the problematic (endogenous) variable which is supposed to
predict a dependent variable correlates with the error termof the dependent variable. Refer to Fig. 3, for a simplied depiction of
the problem and the solution, which we explain in detail later.
In our basic specication we will assume that x is continuous. The endogenous variable could be dichotomous too, in which
case the 2SLS estimation procedure must employ a probit model in the rst stage equation, that is, in Eq. (29) (Greene, 2008).
Other estimators are available too for this class of model (e.g., where the y variable is a probit but the endogenous is a continuous
variable). The Stata cmp command (Roodman, 2008) can estimate a broad class of such single-indicator mixed models by
maximum likelihood similar to the Mplus structural-equation modeling program (L. K. Muthn & Muthn, 2007).
Turning back to the issue at hand, let us assume we have a common methods variance problem, where x (leader behavior) and
y (perceptions of leader effectiveness) have been gathered from a common source: boss
i
rating leader
i
(n=50 leaders; with q
representing unobserved common-source variance, and c control variables). Here, following Eq. (24) we estimate:
y
i
=
0
+
1
x
i
+
c
k=1
k
f
ik
+ e
i
x
q
i
+
y
q
i
28
1100 J. Antonakis et al. / The Leadership Quarterly 21 (2010) 10861120
The coefcient of x could be interpreted causally if an exogenous source of variance, say z, were found that strongly predicts x and is
related to y via x only, and unrelated to e (the combined term). For identication of the parameters two conditions must be satised: We
must have at least as many instruments as endogenous variables and one instrument must be excluded fromthe second-stage equation;
also, the instruments should be signicantly and strongly related to the endogenous variable x (Wooldridge, 2002). If we have more
instruments than endogenous variables, then we can test the overidentifying restrictions in this system. If appropriate instruments are
found, then the causal effect of x on y can be recovered by rst estimating Eq. (29) (i.e., the rst-stage equation) and then using the
predicted value of x to predict y. Note, all exogenous variables (in this case c, the control variables) should usually be used as instruments
of theendogenous variables, otherwiseestimates maybeinconsistent incertainconditions (for further informationrefer toBaltagi, 2002).
To illustrate the workings of 2SLS we use a theoretical example (later, we also demonstrate 2SLS with a simulated data set as
well to should how solve the problem of common-method variance). Assume that z is IQ; given that IQ is genetically determined
(i.e., has high genetic heritability, thus it is exogenous) it makes for an excellent instrument as would personality, and other stable
individual differences (Antonakis, in press), as long as they do not correlate with omitted causes with respect to y. IQ affects how
effectively a leader behaves (Antonakis, in press) and leader behavior affects leader outcomes (Barling, Weber, & Kelloway, 1996;
Dvir, Eden, Avolio, & Shamir, 2002; Howell & Frost, 1986); note, these studies are not correlational but manipulated leadership.
Also, the instruments must be related to y but less strongly than is the endogenous predictor. Assume that d is the distance of the
rater from the leader (which is assigned by the company randomly), and which may impact how effective a leader can be with
respect to that follower because it limits interaction frequency with followers (Antonakis & Atwater, 2002). We also include c
control variables (e.g., leader age, leader sex, etc.). Thus, we model:
x
i
=
0
+
1
z
i
+
2
d
i
+
c
k=1
k
f
ik
+ u
i
29
Because z and d (and f, which directly effects y) are exogenous, they will, of course not correlate with u, and more importantly
with the error term in Eq. (28) (which consists of three components). Thus, the predicted value, x, will not correlate with the
combined error term either. In the second stage we use x to predict y as follows:
y
i
=
0
+
1
x
i
+
c
k=1
k
f
ik
+ e
i
30
Fig. 3. Consistent estimation with a simultaneous equation (mediatory) model. A:
1
is inconsistent because x correlates with e. B:
1
is consistent because z and d, the
instruments (which are truly exogenous), do not correlate with e (or u for that matter). C:
1
is inconsistent because the common cause of x and y, which is reected in the
correlationbetweeneandu, is not estimated(i.e., this is akintoestimatingthesystemof equations usingOLS, whichignores cross-equationcorrelations amongdisturbances).
1101 J. Antonakis et al. / The Leadership Quarterly 21 (2010) 10861120
Howdoes 2SLS ensure consistency? What the 2SLS estimator does is very simple. Only the portionof variance that z and d (andthe
controls) predict in x that overlaps with y is estimated; given that z and d are exogenous, this portion of variance is isolated from the
error termin the y equation (for anexcellent intuitive explanation see Kennedy, 2003). Thus, the 2SLS estimate of
1
is consistent, but
less efcient thanthe OLS estimator given that less information is used to produce the estimate (Kennedy, 2003); this procedure must
not be done manually but using specialized software to estimate the standard error correctly. The signicance of each indirect
(nonlinear) effect, that is
1
1
and
2
1
canalsobetestedusingthetraditional Sobel (delta) method(Sobel, 1982) or bootstrapping
(Shrout & Bolger, 2002). Note that sums of indirect effects can be tested too in programs like Stata (i.e.,
1
1
+
2
1
).
What is very important to understand here about the estimation procedure is that, as we depict in Fig. 3, consistency can only
occur if the cross-equation disturbances (e and u) are estimated. This procedure is standard practice in econometrics and the
reason that it is done is quite straightforward. If the errors are not correlated then estimates will equivalent to OLS, which will be
inconsistent if x is endogenous (Maddala, 1977). Estimating this correlation acknowledges the unmodeled common cause of x and
y; it is a common unmeasured shock which affects both x and y, which must be included in the model. Failing to estimate it
suggests that x is exogenous and does not require instrumenting.
How can we test if the errors u and e are correlated? The Hausman endogeneity test (see Hausman, 1978) or the DurbinWu
Hausman endogeneity test (or an augmented regression, wherein the residuals of the rst-stage equation are included as a control
in the second-stage equation) (see Baum, Schaffer, & Stillman, 2007) can tell us if the mediator is endogenous. Given that we have
one endogenous regressor, this is a one degree of freedom chi-square test of the difference between the constrained model (the
correlation of the disturbance is not estimated) and the unconstrained model (where the correlation of the disturbance is
estimated); this procedure can be done in SEM programs. Thus, if the model where x is instrumented (the consistent estimator),
generates a signicantly different estimate from that where x is not instrumented (the OLS estimator, which is efcient), the OLS
model must be rejected and x requires instrumenting.
A common mistake we see in management and applied psychology is the estimation of simultaneous equations without
correlating the cross-equation disturbances as per the method suggested by Baron and Kenny (1986) or derivatives of this method.
If the correlation is not estimated and if x is endogenous, then the estimate of will change accordingly (and will not be
consistent). Thus, most of the papers testing mediation models that have not correlated the disturbances of the two endogenous
variables have estimates that are potentially biased. If, however, x is exogenous, then the system of equations could be estimated
by OLS (or Maximum Likelihood) without correlating disturbances (refer to Section 4.2.1.4 for a specic example with data). This
procedure we propose should not be confused with correlating disturbances of observed indicators in factor models, which
addresses another issue to the one we discuss in mediation (or two-stage models). In principle, disturbances of indicators of factor
models should not be correlated unless the modeler has a priori reason to do so (see Cole, Ciesla, & Steiger, 2007).
Systems of equations can be estimated using 2SLS, which is a limited information estimator (i.e., it uses information only from
an upstream equation to estimate a downstream variable). This estimator is usually a safe bet estimator because if there is a
misspecication in one part of the model and if the model is quite complicated with many equations, this misspecication will not
bias estimates in other parts of the model as would full-information estimators like three-stage least squares (e.g., Zellner & Theil,
1962) or maximum likelihood, the usual estimator in most structural-equation modeling programs (Baltagi, 2002; Bollen, 1996;
Bollen, Kirby, Curran, Paxton, & Chen, 2007). Thus, using a Hausman test, one could check whether the full-information estimator
yields different model estimates (of the coefcients) from the limited-information estimator; if the estimates are signicantly
different, then the limited-information estimator must be retained (as long as the model ts).
4.2.1.1. Examining t in simultaneous-equation models (overidentication tests). In the previous example, we can test whether the
veracity of the model and the appropriateness of the instruments. For instance, one can examine whether the instruments are
strong (Stock, Wright, & Yogo, 2002); these routines are implemented in the ivreg2 module of Stata (Baum et al., 2007). Also
important, if not more important, is to test whether the overidentifying restrictions of the system of equations are viable (when
having more instruments than mediators); this is a test of t to determine whether there is a discrepancy between the implied and
actual model. Essentially, what these tests examine is whether the instruments correlate with the residuals of the y equation. It
should be now clear to readers that this undesirable situation is due to a model that is misspecied, which means that estimates
are biased and cannot be interpreted. Thus, the model must t before estimates can be interpreted.
In the previous example, Eq. (29) is overidentied (i.e., we have one more instrument that we do endogenous regressors); thus,
the chi-square test of t has 1 degree of freedom; if we had only one instrument, the model would be just-identied and a test of t
cannot be conducted (though the Hausman endogeneity test can still be done). In the context of regression models, these test of t
are chi-square tests and are usually called Sargan tests, HansenSargan tests, or simply J-tests (see Basmann, 1960; Hansen, 1982;
Sargan, 1958). These tests are direct analogs to the chi-square test of t in the context of maximum likelihood estimation, as is
usually the case with structural equation modeling software. A signicant p-value for this test means that the model fails to t (i.e.,
that the data rejected the model); this test is well-known in psychology and management but is often (and incorrectly so) ignored.
Interestingly, economists pay attentionto the test of t. If it is signicant, the model is no good, endof story (andone must rene the
model or ndbetter instruments); they do not use approximate indexes of t, for instance the RMSEA(Browne &Cudeck, 1993), CFI
(Bentler, 1990), or TLI (Tucker & Lewis, 1973), which are not statistical tests with known distributions (Fan & Sivo, 2005; Marsh,
Hau, & Wen, 2004) or have arbitrary cut-offs, as in the case of RMSEA (Chen, Curran, Bollen, Kirby, & Paxton, 2008).
There are researchers (outside of economics) who are starting to seriously question the common practice in some social-sciences
eld of accepting models that fail the chi-square test of t apparently because with a large sample even minute discrepancies will be
detected and thus the p-value of the test will always be signicant (see Antonakis, House, Rowold & Borgmann, submitted for
1102 J. Antonakis et al. / The Leadership Quarterly 21 (2010) 10861120
publication; Hayduk, Cummings, Boadu, Pazderka-Robinson, & Boulianne, 2007; Kline, 2010; Marsh et al., 2004; McIntosh, 2007;
Shipley, 2000). If the model is correct specied, it will not be rejected by the chi-square test even at very large samples sizes (Bollen,
1990); the chi-square test accommodates random uctuations and "forgives" a certain discrepancy due to chance. Also, the chi-
squared test is the most powerful test to detect a misspecied model, as Marsh et al. (2004) demonstrated in comparing the chi-
square test to a variety of approximate t indices. Thus, we urge researchers to pay attention to the chi-square test of t and not to
report failed models as acceptable.
Finally, it is essential to study samples that are causally homogenous (Mulaik & James, 1995); causally homogenous samples
are not innite (thus, there is a limit to how large the sample can be). Thus, nding sources of population heterogeneity and
controlling for it will improve model t whether using multiple groups (moderator models) or multiple indicator, multiple causes
(MIMIC) models (Antonakis et al., submitted for publication; Bollen, 1989; B. O. Muthn, 1989).
4.2.1.2. The PLS problem. Researchers in some elds (particularly information systems and less so in some management
subdisciplines) use what has been referred to as Partial-Least Squares (PLS) techniques to test path models or latent variable
(particularly composite) models. We discuss this modeling method briey, because it is quite popular in other elds yet PLS has no
important advantages of regression or OLS. Because it seems to be slowly creeping into management research we feel it is
important to warn researchers to not use PLS to test their models. PLS estimates are identical to OLS in saturated models with
observed variables. Whether modeling composites in PLS or indexes/parcels in saturated regression models will not change
estimates by much (Temme, Kreis, & Hildebrandt, 2006).
The problem with PLS, however, is that it cannot test systems of equations causally (i.e., overidentifying restrictions cannot be
tested) nor can it directly estimate standard errors of estimates. Because the model's t cannot be tested, the modeller cannot
know if model estimates are biased. Also, its apparent advantages over regression-based (OLS and 2SLS) or covariance-based
modeling (e.g., SEM) is rather exaggerated (see Hair, Black, Babin, Anderson, & Tatham, 2006; Hwang, Malhotra, Kim, Tomiuk, &
Hong, 2010; Marcoulides & Saunders, 2006; McDonald, 1996); recently it has been also shown that PLS can experience
convergence problems (Henseler, 2010). PLS users commonly repeat the mantra that PLS is good for prediction, particularly in
early phases of theory development whereas SEM models are good for theory testing; this comment suggests that one cannot
predict using SEM or 2SLS, which is obviously a baseless assertion. We really nd it odd that those using PLS would knowingly not
want to test their model when they could use more robust tests.
In a recent simulation PLS was found to perform worse than SEM (both in conditions of correct and misspecication); also,
although a new approach, referred to as generalized structured component analysis, has been proposed as a better alternative to
PLS (because it is similar to SEMin the sense that it can test model t), it does not provide for better estimation when the model is
correctly specied (Hwang et al., 2010). Interesting in this simulation is that the new method performed better under conditions
of model misspecication (which makes sense given that it is a limited-information estimator); however, it is unclear as to
whether this estimation approach is better than other limited-information (e.g., 2SLS) estimators (e.g., Bollen et al., 2007).
Other apparent advantages of PLS are that it makes no distributional assumptions regarding variables and does not require
large sample sizes; however, regression or two-stage least squares analysis do not make any assumptions either about
independent variables and can estimate models with small sample sizes. More importantly, there are estimators build into
programs like MPlus, LISREL, EQS, and Stata that can accommodate a large class of models, using robust estimation and various
types of variables, which might not be normally distributed or continuous (e.g., dichotomous, polytomous, ordered, counts,
composite variables, etc.). Thus, given the advances that have been made today in statistics software, there is no use for PLS
whatsoever (see in particular McDonald, 1996). We thus strongly encourage researchers to abandon it.
4.2.1.3. Finding instruments. Finally, one of the biggest challenges that researchers face when attempting to estimate instrumental
variable models has to do with where to nd instruments. In the case of an experiment, where the modeler wishes to establish
mediation, the modeler will have the perfect instrument/s: the manipulated variables. As long as the model is estimated correctly
(with the cross-equation disturbances of the endogenous variables correlated), then the causal mediation inuence can be
correctly identied. In the case of cross-sectional or longitudinal research, stable individual difference that are genetically
determined could do (personality and cognitive ability), as would age, height, hormones (e.g., testosterone), or physical
appearance (Antonakis, in press); geographic factors (distance from the leader as mentioned previously) could work. Time effects
could be used as an exogenous source of variance as could exogenous shocks of from a particular event; there are contextual
effects that could affect leadership, including laws or cultural-level factors (Liden & Antonakis, 2009). With panel data, xed-
effects of leaders or more simply cluster-means should also do the trick because they would capture all unobserved sources of
variance in the leader that predict behavior (e.g., Antonakis et al., submitted for publication); this procedure will essentially purge
rater i's score from idiosyncratic bias, common-method bias, or other errors, given that the xed-effect (i.e., the cluster-mean
score) should mostly capture true variance (Mount & Scullen, 2001; Scullen, Mount, & Goff, 2000). Others have had ingenious
ideas, estimating the effect of a change of leadership (presidents) on country-level outcomes using death in ofce as an exogenous
source of variance (Jones & Olken, 2005); thus, the change of the handover of power is random (exogenous sources of variances
such as this could be used to identify causal effects in two-stage models). Finding instruments is, at times, not easy; however, the
time spent to nd instruments is an investment that will serve science and society in good stead because the estimated parameters
of the model will be consistent.
Important to note, once again, is that the instruments must not correlate with e, omitted causes. For instance, if an omitted
common cause of leader style and effectiveness is affect for the leader and if leader IQ is used as an instrument, the modeler must
1103 J. Antonakis et al. / The Leadership Quarterly 21 (2010) 10861120
be sure that affect for the leader and IQdonot correlate. If they docorrelate, the model will be misspecied; however, misspecication
could be caught by the overidentication test (as long as true exogenous variables, in addition to the bad instrument are included).
Thus, it is crucial to try and obtain more instruments than endogenous variables so that the overidentication test can be performed.
Also, the instruments must rst pass a theoretical overidentication test before anempirical one (if all the modeled instruments are
not truly exogenous the overidentication test will not necessarily catch the misspecication, as we have shown).
4.2.1.4. Solving the common-method variance problem with 2SLS. We provide two examples next; one where we show how to
recover causal estimates with instrumental variable and 2SLS. The second example is a full SEM causal model, where we recover
the causal estimates with instrumental variables using maximum likelihood estimation.
Example 1 using 2SLS: The previous discussion has been a theoretical one and readers might be skeptical about how the 2SLS
estimator can recover causal estimates. We thus generated data with a known structure where there is a strong common-method
variance effect. Assume that we have an endogenous independent variable x, a dependent variable y, two exogenous and perfectly-
measured variables m and n, and a common source effect, q. The true model that generated the data is (note that e and u are
normally distributed and independent of each other):
x =
0
+ q + 0:8m + 0:8n + e 31
y =
0
+ q0:2x + u 32
We generated this data for a sample size of n=10,000. Refer to Table 3 for the correlation matrix and sample statistics of this
data (note, these summary data can be inputted into a structural-equation modeling program to derive the same estimates with
maximum likelihood).
Estimating the OLS model (or using Maximum Likelihood), where y is simply regressed on x, clearly gives a wrong estimate
with the wrong sign (.11); the true estimate (.20) is 281.82% lower! Here is an example of the sinister effect of the common
method variable, which when omitted from the equation makes x endogenous; as we mentioned, the biased OLS coefcient could
be higher, lower, of a different sign or not signicant. We trust it is now clear that Spector's (2006) suggestion that common-
method variance is an urban legend is an urgent legend in itself.
The estimates of this model are depicted in Panel A of Table 4. The known-model estimates, based on two OLS equations (i.e.,
not correlating cross-equation disturbances, which is not needed because of sources of variance in the endogenous variables are
accounted for) reproduce the correct estimate precisely (.20), as indicated Panel B in Table 4. However, in the real-world this
model would not be estimated because it is highly probably that the common cause, q, cannot be measured directly.
Thus, the only correct solution that is available to address this problem is one that is straightforward to use, provided the
modeler has instruments. Using the 2SLS estimator, which exploits the exogenous sources of variance from m and n, recovers the
true estimate (see Panel C in Table 4); the exogenous variables do not correlate with q (and thus not with e when q is not included
in the equation) nor with u because they vary randomly. They are strongly related to x and only affect y via x. Next, even though q is
not included in the model, the 2SLS estimator recovers the estimate of interest exactly (.20), though with a slightly larger
condence interval; as we said before, the price that is paid is reduced efciency. In the case of two-equation models, and with
strong instruments, the 2SLS estimator gives similar estimates to three stage least squares (3SLS), iterated 3SLS, maximum
likelihood (ML), and limited information ML (LIML).
To demonstrate the stability of the 2SLS estimate, a Monte Carlo simulation of this data structure based on 1000 simulations
provided a mean estimate of .20, with a 95% condence interval of between .2007859 and .1992456!). Finally, a Sargan chi-
square test of overidentication (Sargan, 1958) suggests that the instruments are valid, p=.30 (the simulation results conrmed
this nding too, mean p=.32).
Now, had we estimated this previous model using the standard approach (irrespective of the estimator) that is usually used in
management and applied psychology where the cross-equation disturbance are not correlated would have given an incorrect
estimate (i.e., .11, which is, in fact, that of the OLS model); not estimating the cross-equation disturbance suggests that there is no
common shock that might predict x and y, which is unmeasured and not accounted for in the model. That assumption is too
strong to make, and as we demonstrate, incorrect in the context of such mediation models.
Table 3
Correlation matrix for 2SLS demonstration.
Variable Mean SD q m n x y
q .01 1.00 1.00
m .01 1.01 .01 1.00
n .02 1.00 .00 .01 1.00
x .01 1.82 .55 .44 .45 1.00
y .00 1.32 .62 .13 .12 .15 1.00
N=10,000.
1104 J. Antonakis et al. / The Leadership Quarterly 21 (2010) 10861120
Example 2 using ML: The previous demonstration should now explain further that if the effect of a common source/method is not
explicitly modeled, true parameter estimates cannot be recovered (e.g., by attempting to model a method factor, because how the
method factor affects the variable is unknown to the researcher). Thus, one defensible statistical way to control for this problemis
the way we have demonstrated previously, by using instrumental variables. The same procedures can be extended to full SEM
models. We provide a brief example later, following a similar specication in Fig. 2 where we include a dependent variable y,
presidential leader effectiveness, and two independent variables. All measures were obtained from voters, who only have some
knowledge of the leaders behaviors. We include a common cause suppose it is affect for the leader or any other common-cause
mechanism as well as two instrumental variables z1 and z2 that do not correlate with the common cause (also assume no
selection effects due to the instruments). The rst instrument, z1 is the leader's IQ and z2 is the leader's neuroticism, which are
orthogonal to each other. 1 and 2 are transformational and transactional-oriented leadership respectively (for simplicity these
are the only styles of leadership that matter). Thus, the more subordinates like the leader the more they see her as charismatic and
the less they see her as transactional; however, these styles vary too because of the leaders' personality and IQ. Given that the
leader individual differences are largely exogenous (i.e., due to genes), they will vary independently of other factors in the model.
The correct model is depicted in Fig. 4, Panel A, which ts the data perfectly:
2
(51)=47.48, pN.05, n=10,000; all estimates
are standardized. In Panel B, we estimate the model only using the instruments. The parameters estimates are correct as long as
the disturbances are correlated; the model ts perfectly, despite omitting the common cause:
2
(45)=48.50, pN.05. In Panel C the
model is still correct. Given that the instruments are exogenous, they do not correlate with the common cause. Thus, omitting the
instruments (or the common cause, as we showed in Panel B) will not bias the estimates; also, the model ts perfectly:
2
(37)=
37.96, pN.05. Finally, the model depicted in Panel D is incorrect because the latent variables are endogenous and are not purged
from endogeneity bias. The structural estimates are incorrect although the model ts:
2
(31)=34.46, pN.05. Again, this example
not only demonstrates that instruments can purge the bias from endogenous variables but that it is imperative that the model be
correctly specied. Note, we tried to recover the correct causal estimates by modeling a latent common factor; however, the model
produced a Heywood case on y whose variance we had to constrain so that it could be estimated); doing so resulted in good
model t. However, the model estimates were still wrong.
Thus, we hope that our demonstrations will provide new directions in solving the common-method problem and in
estimating mediation models correctly. Also, as is evident, the modeler must rely on theory as well as statistical tests when
specifying models and ensure that they model exogenous sources of variance to obtain consistent estimates.
Table 4
Estimates for 2SLS demonstration.
Independent variables Coef. Std. err. t p-value 95% conf. interval
Panel A: OLS (dependent variable is y)
F(1, 9998)=237.47, pb.001, r
2
=.02
x .11 .01 15.41 .00 .10 .12
Constant .01 .01 .26 .79 .02 .03
Panel B Two-equation model estimated with OLS (dependent variable is y)
F(2, 9997)=3927.65, pb.001, r
2
=.44
x .20 .01 30.34 .00 .21 .18
q 1.02 .01 86.26 .00 1.00 1.04
Constant .01 .01 .83 .41 .01 .03
Two-equation model estimated with OLS (dependent variable is x)
F(3, 9996)=7706.09, pb.001, r
2
=.70
q 1.00 .01 10.20 .00 .98 1.02
m .80 .01 81.41 .00 .78 .82
n .81 .01 81.33 .00 .79 .83
Constant .01 .01 1.07 .29 .03 .01
Panel C Simultaneous equation model estimated with 2SLS (dependent variable is y)
F(1, 9998)=263.32, pb.001, r
2
=.02
a
; Sargan overidentication
2
(1)=1.07, p=.30
x .20 .01 16.23 .00 .23 .18
Constant .00 .01 .02 .98 .03 .03
Simultaneous equation model estimated with 2SLS (dependent variable is x)
F(2, 9997)=3985.68, r
2
=.44
m .80 .01 56.93 .00 .77 .82
n .82 .01 57.75 .00 .79 .84
Constant .02 .01 1.35 .18 .04 .01
N=10,000.
a
Note, it is possible that the r-square in the y equation in simultaneous equations models is undened; however, this is not a problemin simultaneous equation
models and structural estimates will be correct (Wooldridge, 2009). As a measure of regression t, the predicted value of y, can be correlated to the observed y
and then squared (which is one way that r-square is calculated). We used this calculation for r-square in this model.
1105 J. Antonakis et al. / The Leadership Quarterly 21 (2010) 10861120
4.3. Regression discontinuity models
The regression discontinuity design (RDD) is a deceptively simple and useful design. It was rst proposed by Thistlethwaite and
Campbell (1960) and brought to the fore by Cook and Campbell (1979). Interestingly, this design has been rediscovered
independently in several elds (Cook, 2008). After the randomized experiment the RDD is the design that most closely
approximates a randomized experiment (Cook et al., 2008); however, it is underutilized in social-sciences research and not well
understood (Shadish et al., 2002). It is currently experiencing a renaissance in economics (Cook, 2008). We discuss this design
extensively, because it is very useful in eld settings.
The reason why the design is so useful is that like the randomized experiment, it specically models the selection procedure.
Whereas in the randomized experiment selection to treatment is random, in the RDD selection is due to a specic cut-off (or
threshold) that is observed explicitly and modeled as such; this cut-off can be a pretest or any other continuous variable that does
not necessarily have to be correlated with y (Shadish et al., 2002). Froman ethical point of view, and when the cutoff is a pretest of
y, this design is very useful in that the individuals who are most likely to benet from the treatment obtain it; however, this non-
randomization to condition is precisely why the RDD is difcult to grasp in terms of whether its estimates are consistent. The
reason why the RDD yield consistent estimates is that selection to experiment and control group is based on an explicitly
1
x1
1
x2
2
x3
3
x4
4
2
x5
5
x6
6
x7
7
x8
8
Measured
Common
Cause
.25 -.58
.66
.94 .95 .94 .94 .96 .96 .96 .96
y
.55
.77
.01
3
1 2
C: Correct Model
1
x1
1
x2
2
x3
3
x4
4
2
x5
5
x6
6
x7
7
x8
8
.24
.94 .95 .94 .94 .96 .96 .96 .96
y
.69
.00
3
D: Incorrect Model
A: Correct Model B: Correct Model
1
x1
1
x2
2
x3
3
x4
4
2
x5
5
x6
6
x7
7
x8
8
Measured
Common
Cause
.26 -.58
.66
.94 .95 .94 .94 .96 .96 .96 .96
y
.56
z2 z1
.68 .57
.77
.00
3
.00 .00
1 2
.00
.00
.00
1
x1
1
x2
2
x3
3
x4
4
2
x5
5
x6
6
x7
7
x8
8
.67
.94 .95 .94 .95 .96 .96 .96 .96
y
.54
z2 z1
.68 .57
-.25
3
.32 -.64
1 2
.00
Fig. 4. Recovering causal estimates in a structural equation model with instrumental variables (estimates are standardized). A: This is the correctly specied model
including the common cause and two instrumental variables z1 and z2; note, the instruments and the common cause do not correlated. Thus, omitting the
common cause or z1 and z2 will not bias estimates. B: Despite omitting the common cause, this model is correctly specied given that the endogeneity bias is
purged with instruments z1 and z2. C: This model is also correct; because the common cause does not correlate with the instruments z1 and z2. D: This model is
incorrect, and estimates are biased because the latent variables 1 and 2 correlate with 3.
1106 J. Antonakis et al. / The Leadership Quarterly 21 (2010) 10861120
measured criterion that is included in the regression equation (thus, the disturbance term contains no information that might
correlate with the grouping variable). The advantages of this design are many, given that it is relatively easy to implement in eld
settings to test the effectiveness of a policy (particularly when using a pretest threshold).
To explain the basic workings of this design assume that a company decides to give leadership training to its managers;
however, the company CEO is not sure if leadership training works. A professor, eager to test the workings of the RDD, suggests
that they could emulate a randomized experiment and simultaneously help those that need leadership training the most (i.e.,
provide the training only to the leaders who are belowa certain threshold). Let s be the selection threshold for training, based on a
pretest of a validated diagnostic test of the leaders' charisma (x
i
). Leaders who score below a threshold which will be the group
mean (this choice maximizes power in the RDD Shadish et al., 2002) are placed in the treatment condition. Thus, for leader i,
selection is based on the following explicit rule:
s
i
= 1 if x
i
P
x
s
i
= 0 if x
i
N
P
x
Then, the following regression model is estimated, using the mean-centerd pretest score to set the intercept to the cut-off value
(note, controls could be included in this equation to increase power):
y
i
=
0
+
1
s
1
+
2
x
i
P
x + e
i
33
The treatment effect is
1
. The counterfactual is
0
(what the treatment group would have had had it not been treated). Too
good to be true, right? As mentioned by Shadish et al. (2002) many will think it implausible that the regression discontinuity
design would yield useful, much less unbiased, estimates of treatment effect (p. 220). Below we thus show explicitly why RDD
approximates the randomized experiment almost perfectly (we base our examples on the arguments and gures presented by
Shadish et al., 2002). We use our own, simulated data, with a known structure, coupled with errors-in-variables regression as well
as a Monte Carlo experiment to showhowthe RDD can provide consistent estimates (so we kill three birds with one stone given
that we plead in the conclusions too for more use of the Monte Carlo simulation method).
We rst begin with a simple example to show the parallel between the RDD and the randomized experiment. Five hundred
participants were randomly assigned to a control and treatment condition (x=1 if in treatment, else x=0). We include a perfectly
measured pretest (z) correlating .60 with the posttest (y); both variables are standardized, thus structural parameters are
standardized too. We generated the data such that the treatment increases y by 2 points on the scale. The regression model we
estimated (the typical ANCOVA model in psychology) was:
y
i
=
0
+
1
x
i
+
2
z
i
+ e
i
34
Results indicate a signicant regression model, F(2, 497)=544.80. The coefcient of
1
=2.00, standard error=.07, t =27.87,
pb.001. The coefcient of
2
=.60, standard error=.04, t =16.71, pb.001. The constant is 2.60. The regression lines are parallel
given that the xz interaction was insignicant (see Fig. 5A).
Now, to understand how regression discontinuity works and to see its visual relation to the experiment (Shadish et al., 2002),
suppose that we had given the treatment only to that part of the treatment group that scored below9, which was the group mean,
on the pretest. Also, suppose that those who score above the threshold do not receive the treatment. Using the same data as before,
we obtain the two regression lines (see Fig. 5B).
The discontinuity can be seen at the mean of x (the threshold for assigning a participant to the treatment or control condition);
this sharp-drop in the line suggests that those just left of the treatment cut-off benetted greatly as compared to those just to the
right of the cut-off in the control group who did not receive the treatment. Estimating Eq. (33) shows that the regression model
was signicant, F(2, 243)=82.21. The coefcient of
1
=1.99, standard error=.17, t =6.78, pb.001. The coefcient of
2
=.58,
standard error=.09, t =6.78, pb.001. The constant is 8.03. The regression lines are parallel given that the xz interaction was
insignicant; note, it is always good policy to include the xz interaction in case the experiment produces not only a change in the
constant but also in the slope (Hahn, Todd, & Van der Klaauw, 2001; Lalive, 2008).
The treatment effect is almost precisely the same as before (1.99 now, versus 2.00). As we mentioned before, the counterfactual is
the constant; thus, if the experimental group had not received the treatment, its mean would have been 8.03. Now, going back to the
randomized experiment, the tted model indicated that =2.60+2x+.60z. Thus, at the mean value of z we predict y to be the
following for the control group, which is the true counterfactual or the estimated marginal mean: 2.60+20+.609=8.00!
This exercise never ceases to amaze us, but it is so obvious once one understands how the RDD works. As is evident from the
graphs, the randomized experiment replaces the discontinuity with random assignment. Rather than allocating everyone using a
cutoff to the treated condition, the randomized experiment assigns a random subgroup to either the treated or the control
condition. Furthermore, readers should not fall into the trap of thinking that RDDis simply explained by regression to the mean, in
the sense that when remeasuring participants with extreme values their post-scores regress to the mean. As mentioned by Shadish
et al. (2002), any regression effects are already captured in the regression line. Of course, those initially scoring in the extremes
will regress; however, this causes the slope of the regression line to become atter, but it does not cause discontinuities.
To test RDD a step further we then conducted a Monte Carlo experiment. To provide for a strong test, we made the correlation
between y and x more realistic by adding error to x, and thus also showthe workings of the errors-in-variables estimator: We add a
1107 J. Antonakis et al. / The Leadership Quarterly 21 (2010) 10861120
normally distributed error term(e) to x (i.e., .5e). The reliability of x is (Bollen, 1989): 1(error variance)/(total variance). Given
that the original x (without error) had a variance of 1, and we observe the variance of x-with-error to be 1.28, the theoretical
reliability of x is 1(1.281)/1.28=.78. We then ran a Monte Carlo experiment, estimating the same regression model as in the
RDD shown previously using the mean of x as the cut-off. We simulated this process 1000 times to see how stable this estimator is
(i.e., specically to see how the causal parameters of interested were distributed).
The results showed that the RDD coupled with the errors-in-variables estimator recovered the true causal parameters almost
precisely! The mean of the constant was 8.01 (95% condence: 8.00 to 8.01). The mean of coefcient of
1
=2.02 (95% condence:
2.01 to 2.03). Finally, the coefcient for of
2
=.60 (95% condence: .59 to 60)!
We re-ran the Monte Carlo using OLS to demonstrate the effect of measurement error on the estimates. The mean of the constant
was 8.21 (95% condence: 8.20 to 8.21). The mean of coefcient of
1
=1.62 (95% condence: 1.61 to 1.63). Finally, the coefcient for
of
2
=.34 (95%condence: .34 to .34). These estimates are way off the correct estimates; the treatment effect was underestimated by
a large margin (19.80%). The effect of the covariate was underestimated by a much larger margin (43.33%). Finally, the
counterfactual was slightly overestimated (+2.50%); however, the intercept seems to be less affected. Results with more thanone ill-
measured covariate would certainly create much more bias than what we have showed here with a very simple model.
To conclude, we trust that our demonstrations will create some interest in using RDDin leadership research as well as in related
areas (management, applied psychology, strategy, etc.). This design is clean and simple to run. Because of space restrictions we
have only covered the basics of RDD; readers should refer to more specialized literature for further details (e.g., Angrist & Krueger,
1999; Angrist &Pischke, 2008; Hahn et al., 2001; Lee &Lemieux, 2009). Finally, modelers can nd creative ways to use the RDD. For
instance, regression discontinuities could also be used where the modeled cut-off is an exogenous shock (e.g., war).
4.4. Difference-in-differences models
In the case where a treatment and a very similar control group are compared before and after a treatment, causal inference
could be made provided certain assumptions are met (the most important being that in the absence of the treatment, the
difference between the two groups is relatively stable over time). This type of model is called a difference-in-differences model in
6
8
1
0
1
2
1
4
6 8 10 12
x
A
6
8
1
0
1
2
6 7 8 9 10 11
x
B
control
treated
fitted
fitted
control
treated
fitted
fitted
Fig. 5. Similarity between randomized experiment and regression discontinuity. A): Estimating causal effect using a randomized experiment. B): Estimating the
causal effect using regression discontinuity.
1108 J. Antonakis et al. / The Leadership Quarterly 21 (2010) 10861120
economics (see Angrist & Krueger, 1999; Angrist & Pischke, 2008; Meyer, 1995); in psychology, it is usually referred to as an
untreated control group design with pre- and post-test (Shadish et al., 2002). We will discuss these models from the point of view
of economics, given that the literature on estimation methods is more developed in this eld.
The basic idea of the difference-in-differences model is to observe the effect of an exogenous shock on a treatment group;
the treatment effect is the difference between the treated group and a comparable control group across time. Using a comparable
group thus differences-out confounding contemporaneous factors. For a graphic depiction of this model see Fig. 6.
The following model, in panel form, is thus estimated:
y
it
=
0
+
1
x
i
+
2
t +
3
x
i
t + e
it
35
Where person i is in a group (x=1 for treatment group, else it is 0) in a particular time period (t =1 if post treatment, else it is 0);
for simplicity, suppose that data are on two periods, before and after the intervention, where t =1 is post-treatment, else it is 0; the
model shouldinclude control variables, whichwe have omittedfor simplicity. The treatment effect is capturedby the coefcient of the
interaction term,
3
. Another way of looking at the treatment effect is to difference y across time and groups, which gives:
fEY
it
j x
i
= 1; t = 1EY
it
j x
i
= 0; t = 1gfEY
it
j x
i
= 1; t = 0EY
it
j x
i
= 0; t = 0g =
3
36
That there may be differences between the groups prior to the intervention is captured by the xed effect of group
membership, that is, the coefcient of x (thus, random assignment is not of issue here, as long as the assumptions of the
method are satised). Fixed effects of time are captured by the coefcient of t, that is, because changes in y might be due to
time. Note, xed effects of individual could be modeled as well in which case the between group differences will be captured
by the individual xed effects rather than by the parameter
1
. What is important for this model is that xt is not endogenous,
that is, that the difference between the groups is stable over time and that the timing of the treatment is exogenous (i.e., that
differences in y are not due to unmeasured factors); this assumption can be examined by comparing data historically to see if
differences are stable across the groups before (and after) the treatment (Angrist & Krueger, 1999). Also, given that the data
are panel data, it is important to correct standard errors for clustering on the panel variables (Bertrand et al., 2004). Note, that
0
+
1
+
2
provides for the counterfactual (i.e., y of the treatment group had it not been treated). Of course, the basic
difference-in-differences model can be expanded in more sophisticated ways (Angrist & Krueger, 1999, 2001; Angrist &
Pischke, 2008; Meyer, 1995).
Applied to leadership research, suppose that a CEO of a company that has two similar factory sites decides to hire a professor to
conduct an experiment to see whether leadership training works. She decided this on the basis of yearly data the company has
been gathering using a 360 leadership instrument, which showed that the mean level of charisma has been declining across the
two sites (at a similar rate) and that it is now belowa critical threshold in both sites. Trained as a medical researcher, she suggests
to the professor that all the company's 500 supervisors should be randomized into a treatment or a control group. Instead of doing
a randomized experiment within each factory, which could have spillover effects from the treatment to the control groups, the
professor convinces the CEOto allowher to train the managers on one site only. Given the fact that they are separated by a distance
of 2000 km and because they produce pharmaceutical products for different markets (which both have strong demand for their
products), it is very unlikely that managers in the control site will get to know about the training that will be conducted in the
experimental site. Furthermore, because demographic indicators regarding the managers and the workers are similar in the two
sites (as are charisma trends in the managers), and socioeconomics about the same (as historical data indicates) the difference-in-
differences would be an appropriate tool to use in this particular case.
4.5. Selection models (Heckman models)
As discussed previously, when there is unmodeled selection to treatment (i.e., participants attend leadership training, but
training is not assigned randomly), estimates will be inconsistent because unobserved variance which affects selection in the
selection equation (see Eq. (11)) could be correlated with unobserved variance that affects the dependent variable (see Eq. (10))
Fig. 6. Estimating causal effects using difference-in-differences.
1109 J. Antonakis et al. / The Leadership Quarterly 21 (2010) 10861120
this endogeneity can inate or deate the treatment effect. One way around this problem is to estimate a Heckman type two-step
selection model (Heckman, 1979) or more specically, what is referred to as a treatment effects model (see Cong & Drukker, 2001;
Maddala, 1983).
The idea behind this model is to use instruments to predict participation in the treatment or control group (the probit rst-step
equation). Thereafter, a control variable, which captures all unobserved differences between the treatment and control groups due
to selection, is added in the second step (the substantive equation). This control variable will remove the variance from the error
term due to selection, so as the coefcient on the treatment term can be correctly estimated. This model is easily estimated in
advanced statistics programs like Stata.
Note, there are other types of models that can be estimated having sample selection bias, for example, models where the
dependent outcome could be binary instead of continuous. Another type of selection model is the classic Heckman two-step model
for situations where one observes the dependent variable only for the selected group (i.e., there is missing data on the dependent
variable). The example Heckman used is to estimate the effect of education on wages for women, with the selection problembeing
the fact that women choose to work depending on the offer wage and the minimumwage a woman would expect to have (i.e., her
reservation wage); thus, simply regressing the wages on education will not provide a consistent estimate.
Applied to leadership, suppose we wish to estimate whether there are sex differences in leadership effectiveness. The selection
problem is that individuals are appointed to positions of leadership based in part on their sex and not only on their competence
(Eagly & Carli, 2004). That is, because of social prejudice mechanisms, stereotype threat, and self-limiting behavior, females may
be less likely to be appointed to leader roles as a function of the gender typing of the context. Thus, in male-oriented environments,
the sample of observed male leaders is biased upward in male-stereotypical settings and only the performance of very competent
women would be observed (because women are held to a higher standard of performance and thus only the more competent
women are observed); indeed, when comparing the effectiveness of women versus men in business settings that would reect this
selection, women are signicantly more effective (Antonakis, Avolio, &Sivasubramaniam, 2003; Eagly, Johannesen-Schmidt, &van
Engen, 2003). The effect of being a woman on leadership effectiveness could thus be overstated.
The Heckman model could be useful in this context. In the rst step, we would predict the probability of being a leader using
exogenous instruments (e.g., sex, competence, sex-typing of the job, cultural factors, etc). Then in the second step, we would
include sex as a predictor and control for unobserved heterogeneity in the selection in predicting effectiveness of leaders who we
observe to derive a consistent estimate of the effect of sex on effectiveness.
4.6. Other types of quasi-experimental designs
There are other ways to obtain causal estimates using very simple methods. Researchers should refer to Cook and Campbell
(1979) and Shadish et al. (2002) for ideas. For instance, extending the idea behind the non-equivalent dependent variable design
(see Shadish et al., 2002), suppose that a researcher wants to investigate the efcacy of a leadership training program; however,
for whatever reason (e.g., restrictions imposed by an organization, ethical reasons, etc.) the researcher cannot have a control
group. One way to obtain estimates that could be consistent is to pretest the participants on the measure of interest (e.g.,
charisma) as well as on a closely-related measure that the researcher did not intend to change (e.g., communication skills, see
Frese, Beimel, & Schoenborn, 2003 for an example). The point of this design is to show a signicant difference between the Time 1
and Time 2 measure of interest and no difference in the other measure that the researchers did not intend to manipulate. In the
Frese et al. (2003) study, however, they did nd differences too in communication skills, which can be interpreted as learning
effects; however, they could have used this information to unbias their parameters of interest (though they did not). That is, a
simple way to remove the variance due to learning effects is to include the non-equivalent measure as a control variable,
particularly if one has pre and post measures as well as control variables (and can thus estimate a panel model). Of course, such
methods will are not substitutes for the experiments, but if the right controls are included they may provide good enough
estimates of treatment effect.
Next, we discuss the state-of-the-art of causal analysis in leadership research. We rst explain the sample we used in this
review and our coding method. Thereafter we present the ndings and discuss their implications.
5. Review of robustness of causal inference in management and applied psychology
5.1. Sample
To gauge whether leadership research is currently dealing with central threats to causal inference (i.e., reporting estimates
that are consistent), we reviewed and coded a random sample of articles appearing in top management and applied
psychology journals. The initial sample from which the nal set of articles was drawn was quite large (n=120) and current
covering the last 10 years (i.e., between 1999 and 2008). We did not code any laboratory experiments given that their
estimates would be consistent by design (because of randomization). We only coded empirical non-experimental papers and
eld experiments, because it is in these categories of research where potential problems would be evident. The population of
journals we surveyed, including The Leadership Quarterly are all top-tier journals according to objective criteria (i.e., 5-year ISI
impact factor reported in 2009) in the domain of management or applied psychology. These journals publish research on
leadership and have a strong micro or psychology focus. We include the list of journals as well as their 5-year impact factor (IF)
1110 J. Antonakis et al. / The Leadership Quarterly 21 (2010) 10861120
and rank in either management (MGT) where there are 89 journals listed and/or in applied psychology (AP) where there are
61 listed journals:
1. Academy of Management Journal: Management (IF=7.67; MGT=3rd)
2. Journal of Applied Psychology (IF=6.01; AP=1st)
3. Journal of Management (IF=4.53; MGT=9th)
4. Journal of Organizational Behavior (IF=3.93; MGT=14th; AP=4th)
5. The Leadership Quarterly (IF=3.50; MGT=18th; AP=5th)
6. Organizational Behavior & Human Decision Processes (IF=3.19; MGT=21st; AP=10th)
7. Personnel Psychology (IF=5.06; AP=2nd)
We rst identied the population of articles that met our selection criteria. We used ISI Web of Science to initially identify
potential articles which included either leader or leadership in the topics eld, which searches in the title, keywords, and
abstract. We only examined studies that focused on leadership per se. We limited studies using the denition of leadership
provided by Antonakis, Cianciolo, and Sternberg (Antonakis, Cianciolo, & Sternberg, 2004, p. 5), that is, leadership can be dened
as the nature of the inuencing process and its resultant outcomes that occurs between a leader and followers and how this
inuencing processes is explained by the leader's dispositional characteristics and behaviors, follower perceptions and
attributions of the leader, and the context in which the inuencing process occurs. Thus, we coded only studies that examine the
inuencing process of leaders from a dispositional or behavioral perspective, where leadership could be either an independent or
dependent variable.
We then determined how many papers were quantitative non-experimental studies or eld experiments. The population of
studies that met our criteria was 287 (i.e., 281 non-experiment and 6 eld experiments). This population was distributed as
follows across the journals: Academy of Management Journal (9.06%), Journal of Applied Psychology (24.04%), Journal of Management
(3.14%), Journal of Organizational Behavior (13.24%), The Leadership Quarterly (42.16%), Organizational Behavior & Human Decision
Processes (3.14%), and Personnel Psychology (5.22%).
We then randomly selected 120 studies using stratied (proportionate) sampling by journal and type of study (i.e., non-
experimental or eld experiment). Fromthis sample of 120 studies, we dropped 10 which, although quantitative in nature, did not
make any implicit or explicit causal claims as in the case of scale validation studies; we did though retain those that made, for
example, comparison of factors across groupings like gender (e.g., Antonakis et al., 2003). Thus, the nal sample was 110 studies,
distributed as follows: Academy of Management Journal (8.18%), Journal of Applied Psychology (26.36%), Journal of Management
(3.64%), Journal of Organizational Behavior (14.55%), Leadership Quarterly (38.18%), Organizational Behavior & Human Decision
Processes (3.64%), and Personnel Psychology (5.45%). The nal distribution of papers was the same as the original distribution,
2
(6)=.67, p=1.00.
5.2. Coding
We evaluated studies on each of the sub-criteria of the seven categories listed later (i.e., in total there were 14 criteria). We
coded each criterion, using a categorical scale: 0=irrelevant criterion; 1=relevant criterion for which the authors did not correct;
2=relevant criterion for which we were unable to determine whether it was taken into account by the authors; 3=relevant
criterion which the authors addressed. Note, we not code for correct use of statistical tests, for example, use of the chi-square
overidentication test. The criteria we coded included those listed in Table 1.
When papers reporting several studies, we only coded those which were non-experimental studies or eld experiments even if
they represent a small portion of the paper; for example, the coding of de Cremer and van Knippenberg (2004) is based solely on
the one non-experimental study reported by the authors and does not take into account the experimental studies presented in the
paper.
The coding was undertaken by the second and third authors of this study. The coders were rst familiarized with the coding
criteria. To ensure that the coders of the study were well calibrated with each other, they independently coded ve randomly-
selected studies from the eligible population of leadership studies we had identied (but which had not been selected in our
random sample). Thereafter, differences were reconciled. The coders then independently coded 20 studies and we calculated
agreement statistics (which indicated very high agreement, i.e., 80.51% agreement across the 280 coding events). After differences
were reconciled, the coders then coded the rest of the studies independently.
Each study was then discussed between the coders and differences were reconciled. Finally, the rst author crossed-checked a
random sample of 10 studies from the total population of studies coded (and reconciled situations where either one or the other
coder was unsure as to what to code). The nal coding represents the agreed ratings of both coders.
6. Results
We rst report results for the coding to examine whether it was undertaken reliably by the two coders. The total coding events
were 1540 (14 criteria times 110 papers); however, we computed agreement statistics for 1519 coding events only given that for
21 of the coding events either one or the other coder was unsure about how the coding procedure should be applied. In this case,
the rst author reconciled the coding.
1111 J. Antonakis et al. / The Leadership Quarterly 21 (2010) 10861120
Initial agreement based on the rst independent coding of the 110 studies was 74.39% (1130 agreements out a possible 1519
coding events) and the initial agreement coefcient was .60, SE=.02, z=33.51, pb.001 (see Cohen, 1960). This result suggests
that the coders did signicantly better than chance (which would have generated an agreement of 36.41% given the coding events
and coding categories). This level of agreement has been qualied as being close to substantial (Landis & Koch, 1977).
We present the results of the coding in Appendix A, summarized for the full sample and by journal. As a descriptive indicator of
our ndings, the data across the four coding categories for all journals indicated that: 43.83% (675/1540) received a code of 0
(irrelevant criterion), 37.21% (573/1540) received a code of 1 (relevant criterion for which the authors did not correct), 13.57%
(209/540) received a code of 2 (relevant criterion for which we were unable to determine whether the authors undertook the
necessary correction), and 5.39% (83) received a code of 3 (relevant criterion, which the authors addressed). The frequency
distribution of coding categories across the journals (see Appendix) were very similar as indicated by a chi-square test:
2
(18)=
14.88, p=.67. Because the distribution of this chi-square test can be affected by small sample sizes in cells and given that we could
not compute the Fisher exact test (Fisher, 1922) withso many permutations, we repeatedthis analysis only for the journals that had
many observations (i.e., whichregularly publishleadership research: Academy of Management Journal, Journal of Applied Psychology,
Journal of Organizational Behavior, and The Leadership Quarterly). The result remained unchanged:
2
(9)=7.20, p=.62.
Considering only the codings that were applicable (i.e., excluding the 675 codings receiving 0), indicates that 66.24% received a
code of 1, 24.16% a code of 2, and 9.60% a code of 3. Assuming that those codings given a 2 were not actually corrected by the
authors indicates that 90.40% (66.24%+24.16%) of coded validity threats were not adequately handled. We can consider this
90.40% as an upperbound percentage of codings that did not deal with the validity threat appropriately; thus, 66.24% is the
lowerbound (assuming that those codings receiving a 2 were actually corrected for by the authors but that the correction was not
reported). These results are depicted in Fig. 7. Again, the distribution of aggregate coding categories across the journals were very
similar for the full sample, chi-square test:
2
(12)=8.37, p=.76, as well as for the four journals that had sufcient observations:
chi-square test:
2
(6)=1.18, p=.98. Note that all articles, save one, had at least one threat to validity and most (90.91%) had
three or more threats.
We compared the distribution of codings across the seven journals for the 14 coding criteria using the Fisher (1922) exact test
and setting the overall Type I error to be less .05 across the 14 tests (Bonferroni correction). The distributions across the 14 criteria
were the same across the journals, suggesting that practices and standards for these top-tier journals regarding leadership
research were essentially the same.
What are the most frequent and important threats to validity (see bottom part of Panel A in the Appendix)? Criterion 4
(measurement error) and 6a (heteroscedastic errors) applies to more than 94% of all the studies we survey. Measurement error is
not addressed by 70% of all studies to which the problem applies, and only 16.4% of the studies adequately deal with the problem.
Heteroscedasticity is a potential pervasive problem but 92.3% of the studies possibly facing the problem do not report whether or
not they dealt with it; only 6.7% reported using robust inference. Common-method variance (Criterion 5) is a threat to validity that
is also very pervasive in leadership research (it applies to 83.6% of the studies). Yet 77% of the studies affected by the threat do not
deal with it adequately, and only 20.7% of the studies adequately address it.
Fig. 7. Summary of coded validity threats by journal.
1112 J. Antonakis et al. / The Leadership Quarterly 21 (2010) 10861120
Are there threats to validity that are dealt with better than others? Threat 2c (sample is not representative) applies to 58.2% of
all the studies we survey. Of the studies that face this problem, 32.8% address it adequately whereas 48.4% do not; leaving 18.8% for
which we cannot assess whether sample selection has been addressed or not.
7. Discussion
ur review indicated that methodological practices regarding causal modeling in the domain of leadership are unsatisfactory.
Our results essentially point to the same conclusions as do the recent reviews of the literature regarding endogeneity by Hamilton
and Nickerson (2003) in the strategy domain, that of Halaby (2004) in sociology regarding panel models, and that of Bertrand et al.
(2004) regarding the use of cluster-robust standard errors in econometrics. Although we looked at similar issues to those of the
three reviews, the contribution of our review was unique in that we examined multiple validity threats (beyond those three
reviews).
Except for The Leadership Quarterly, the articles we coded were published in general management and organizational
behavior journals. Thus, we could assume that the practices of others disciplines publishing in those journals are very similar to
the practices we identied; our ndings may thus have implications for other areas and also for the meta-analytic reviews which
may have used estimates that were inconsistent. We can only echo what Halaby (2004, p. 508) noted about for research in
sociology using panel data Key principles that ought to routinely inform analysis are at times glossed over or ignored
completely.
Why is current practice not where it should be given that the methodological tools have been available for some time? We
can only speculate as to why practice has been slow to follow the methodological advances that have been made. The most
important reason probably has to do with doctoral training; in psychology at least, it appears that adequate training in eld
research and quantitative methods in general is not provided, even at elite universities (Aiken, West, & Millsap, 2008). We can
assume that the level of training provided in non-experimental causal analysis in management is insufcient as well, particularly
in econometrics training. As Aiken et al. (2008, p. 44) state Psychologists must reinvigorate the teaching of research design to
our next generation of graduate students, to bring new developments burgeoning in other elds into the mainstream of
psychology.
We believe that coupled with the previous problem is the fact that users of statistical programs have been very slow to adopt
software that can do the job correctly when causal analysis in non-experimental settings is concerned; as mentioned by Steiger
(2001) statistical practice is, unfortunately, software driven and there are many simplied books that make it easy to use
software to estimate complicated models (Alberto Holly, one of our econometrics colleagues, refers to this as the push-button
statistics syndrome). We nd it very unfortunate that easy-to-use programs (e.g., like SPSS now called PASW), which have very
limited and at times inexistent routines to handle many of the challenging methodological situations we identied in our review,
are rmly entrenched in psychology and business schools. In our experience, SPSS is sufcient for analyzing basic experimental
data, but as soon as researchers venture out into the non-experimental domain we would urge them to migrate to other software
(e.g., Stata, SAS and R) that will allow them to test models in robust ways and also to widen the research horizons on which they
can explore. Of course, professors who teach methods and statistics classes should also seriously consider using more appropriate
software (and in also providing more extensive training to their students).
We note the same concern regarding structural-equation modeling (SEM) software, where much of the market is using SPSS's
AMOS software; this program makes it very easy to estimate models. However, this program has very limited capabilities as
compared to MPlus (our SEM software of choice), LISREL or EQS, though even these programs have some catching-up to do
concerning the estimation of certain types of models (e.g., selection models).
7.1. Recommendations: the 10 commandments of causal analysis
Our review and the coding criteria we identied can be used as a summary framework around which researchers should plan
and evaluate their work to ensure that estimates are consistent and that inferences are valid. We briey present these criteria next,
grouped in the formof 10 best practices, implicating research design and analysis issues. Concerning these two aspects of research,
simply put, design rules (Shadish &Cook, 1999); only when the design is adequate can appropriate statistical procedures be used to
obtain consistent estimates.
7.1.1. Best practice for causal inference
1. To avoid omitted variable bias include adequate control variables. If adequate control variables cannot be identied or
measured obtain panel data and use exogenous sources of variance (i.e., instruments) to identify consistent effects.
2. With panel (multilevel) data, always model the xed effects using dummy variables or cluster means of level 1 variables. Do
not estimate random-effects models without ensuring that the estimator is consistent with respect to the xed-effects
estimator (using a Hausman test).
3. Ensure that independent variables are exogenous. If they are endogenous (and this for whatever reason) obtain instruments
to estimate effects consistently.
4. If treatment has not been randomly assigned to individuals in groups, if membership to a group is endogenous, or samples are
not representative between-group estimates must be corrected using the appropriate selection model or other procedures
(difference-in-differences, propensity scores).
1113 J. Antonakis et al. / The Leadership Quarterly 21 (2010) 10861120
5. Use overidentication tests (chi-square tests of t) in simultaneous equations models to determine if the model is tenable.
Models that fail overidentication tests have untrustworthy estimates that cannot be interpreted.
6. When independent variables are measured with error, estimate models using errors-in-variables or use instruments (well-
measured, of course, in the context of 2SLS models) to correct estimates for measurement bias.
7. Avoid common-method bias; if it is unavoidable use instruments (in the context of 2SLS models) to obtain consistent
estimates.
8. To ensure consistency of inference, check if residuals are i.i.d. (identically and independently distributed). Use robust variance
estimators as the default (unless residuals can be demonstrated to be i.i.d.). Use cluster-robust variance estimators with panel
data (or group-specic regressors).
9. Correlate disturbances of potentially endogenous regressors in mediation models (and use a Hausman test to determine if
mediators are endogenous or not).
10. Do not use a full-information estimator (i.e., maximum likelihood) unless estimates are not different to that of limited
information (2SLS) estimator (based on the Hausman test). Never use PLS.
Apart from addressing the previous guidelines and the methods we reviewed, researchers should also consider using Monte
Carlo analysis more than they currently do. Monte Carlo analysis is very useful for understanding the working of estimators
(Mooney, 1997); for example, when an estimator may be potentially unstable (e.g., in the case of high multicollinearity) a
researcher could identify the sample size requirement to ensure that the estimator is consistent.
8. Conclusion
Research in applied psychology and related social sciences is at the cusp of a renaissance regarding causal analysis and eld
experimentation; there are many reasons for this push including, in part, for the need for evidence-based practice (Shadish &Cook,
2009). Researchers cannot miss this call; understanding the causal foundations of social phenomena is too important a function for
society. Important social phenomena deserve to be studied using the best possible methods and in sample situations that can
generalize to real-world settings; ideally our goals should be to improve policies and practices.
Although our reviewmakes for telling conclusions we are hopeful and condent that research practice will change in ways that
produces research that will be more useful to society. We conclude by referring to the problem of alignment of theory, analysis,
and measurement: When not correctly aligned Schriesheim, Castro, Zhou, and Yammarino (2001, p. 516) noted that researchers
may wind up erecting theoretical skyscrapers on foundations of empirical jello. This warning is pertinent for a broader class of
problems relating to causal modeling too; implicit or explicit causal claims must be made on concrete foundations.
Acknowledgements
We are grateful to the Yearly-Review Editor, Fran Yammarino and to the reviewers for their helpful comments. We also thank
Marius Brulhart, Suzanne de Treville, Saskia Faulk, Lorenz Goette, Mikko Ketokivi, Cameron McIntosh, Edward Rigdon, and Laura
Stapleton for their helpful suggestions. Any errors or omissions are our responsibility.
Appendix A
Coded studies and results.
Coding criteria
a
Study coded 1a 1b 1c 1d 2a 2b 2c 3 4 5 6a 6b 7a 7b Total Total