Professional Documents
Culture Documents
Abstract—In judging the quality of a research study it is In this study we survey how validity threats are analyzed in
very important to consider threats to the validity of the study recent papers published in ESE and discuss the current state-
and the results. This is particularly important for empirical of-the-art and what may have caused it. We then propose
research where there is often a multitude of possible threats.
With a growing focus on empirical research methods in ways that can help in improving the analysis of validity
software engineering it is important that there is a consensus threats. In particular we address the research questions:
in the community on this importance, that validity analysis is • RQ1. What are the existing advice on and checklists
done by every researcher and that there is common terminology for VTs in ESE?
and support on how to do and report it. Even though there are
• RQ2. What is the current practice in discussing and
previous relevant results they have primarily focused on quan-
titative research methods and in particular experiments. Here handling validity threats in research studies in ESE?
we look at the existing advice and guidelines and then perform • RQ3. Is the current practice of sufficient quality and, if
a review of 43 papers published in the ESEM conference in not, how can it be improved?
2009 and analyse the validity analysis they include and which
threats and strategies for overcoming them that were given
To limit the scope the survey presented here has been
by the authors. Based on this analysis we then discuss what constrained to only cover one year of a major ESE publication
is working well and less well in validity analysis of empirical venue. Because of this our results are only preliminary.
software engineering research and present recommendations Section II below describes existing advice on validity
on how to better support validity analysis in the future. threats in software engineering. In Section III we describe
Keywords-Validity threats, Empirical research, Software en- the methodology for the survey we have performed on VT
gineering, Research methodology analysis and mitigation in ESE. Section IV gives the results
of the survey and analyses them. Finally, Section V discusses
I. I NTRODUCTION the VTs of this study itself and future work, while Section
VI concludes.
Empirical research in software engineering (ESE) is
increasingly important to advance the state-of-the-art in a II. VALIDITY IN S OFTWARE E NGINEERING R ESEARCH
scientific manner [1]. A critical element of any empirical
research study is to analyze and mitigate threats to the validity Validity of research is concerned with the question of
of the results. A number of summaries, models and lists of how the conclusions might be wrong, i.e. the relationship
validity threats (VTs) have been presented to help a researcher between conclusions and reality [2]. This is distinct from the
in analyzing validity and mitigate threats [1, 2]. Many of larger question of how a piece of research might have low
these results focus on VT analysis for quantitative research quality since quality has more aspects than validity alone, for
such as experiments [1] while other consider VT analysis example relevance and replicability. In non-technical terms
for qualitative research [2]. validity is concerned with ‘How the results might be wrong?’
For a researcher in ESE it may not be clear if the existing not with the larger questions of ‘How this research might be
checklists are consistent or which one applies to a particular bad?’. In many cases they overlap, though.
study. In particular for qualitative ESE research this can be Validity is a goal, not something that can be proven or
a problem since few studies with results specific to software assured with the use of specific procedures. A validity threat
engineering has been presented. Furthermore, several recent (VT) is a specific way in which you might be wrong [2].
results have pointed to problems in how ESE research is In a validity analysis (VA) you identify possible threats and
conducted or reported in sub-areas [3, 4, 5]. In this paper, discuss and decide how to address them. A specific choice
we want to analyse in particular how validity threats are or action used to increase validity by addressing a specific
analysed and mitigated in ESE research studies. Ultimately threat we call a mitigation strategy.
this will lead to additional support and guidelines for how For quantitative research in software engineering, such as
to more consistently perform such analysis and mitigation experiments, specific advice on validity analysis and threats
and thus increase the validity of future ESE results. was given by Wohlin et al [1] structured according to previous
results from Cook et al [6]. Wohlin et al discuss four main provide invalid information) and researcher bias (assumptions
types of validity threats: conclusion, internal, construct and and personal beliefs of the researcher might affect the study)
external. as main threats. Maxwell focuses on the researcher and
Conclusion validity focus on how sure we can be that the lists description (of the collected data), interpretation (of
treatment we used in an experiment really is related to the the collected data) and theory (not considering alternative
actual outcome we observed. Typically this concerns if there explanations) as the main threats [2].
is a statistically significant effect on the outcome. Table I below summarizes the main validity threats that
If there is a statistically significant relationship the Internal have been discussed above.
validity focus on how sure we can be that the treatment
actually caused the outcome. There can be other factors that III. R EVIEW M ETHODOLOGY
have caused the outcome, factors that we do not have control Below we describe how we selected the papers to be
over or have not measured. included in the review, the overall process used and the
Construct validity focus on the relation between the theory forms used in the analysis of each paper.
behind the experiment and the observation(s). Even if we
have established that there is a casual relationship between A. Selection of papers
the treatment of our experiment and the observed outcome,
the treatment might not correspond to the cause we think we To limit the scope of our initial review we wanted to
have controlled and altered. Similarly, the observed outcome select only recent papers published in ESE. This is also
might not correspond to the effect we think we are measuring. motivated by the fact that ESE has been maturing and if we
Finally, the External validity is concerned with whether we would include too many years back these papers might not be
can generalize the results outside the scope of our study. Even representative of current practice. Based on this we selected
if we have established a statistically significant casual relation all full research papers published in the Empirical Software
between a treatment and an outcome and they correspond to Engineering and Measurement (ESEM) conference in 2009.
the casue and effect we set out to investigate the results are Even though this is a limited selection our assumption is
of little use if the cause and effect we have established does that ESEM is the state-of-the-art conference on empirical
not hold in other situations. software engineering research and as such can be expected
For qualitative research methods there are not software to have higher expectations, standards and experience of
engineering specific results and we have to turn to other fields empirical research.
of study. Epistemologically qualitative research stretches A total of 43 full papers was published in ESEM 2009
between positivism as one extreme and interpretivism as and are thus included in our results and analysis below.
the other [7]. The two similar notions of subtle realism and
B. Process for Review
anti-realism have also been used by some authors [8].
The fundamental difference between the two resides in For each of the papers we read and filled in one form
the interepretivists belief that the humans differ from the summarizing the type of and empirical content of the paper
subjects/objects of study in the natural sciences since social (Paper form). For each validity threat or issue discussed in
reality has a meaning for humans. That meaning is the the paper we then filled in a form with detailed information
basis both for human actions and the meanings that humans about each threat and any mitigation strategies mentioned
attribute to their own actions as well as the actions of for it (VT form). The forms used are presented in Tables II
others [7]. Positivists on the other hand view humans as and III respectively and described in more detail below.
any other data source that can be sensed, measured and The forms themselves were created in discussions between
positively verified [7]. the researchers based on the review of the existing literature
Both positivistic and interpretivistic scientists are interested on validity threats and analysis presented above. These initial
in assessing whether they are observing, measuring or forms were then piloted on a random selection of 4 papers
identifying what they think and say they are [9]. However, which was reviewed independently by both authors. The pilot
since their research questions, methods and views on reality test did not uncover any discrepancies between the analyses
differ, so do the methods to assess the quality of their of the two authors. It only prompted clarification of where to
work. Positivists usually use natural sciences criteria: internal record what and the exclusion of a redundant question. Since
and external validity, reliability and objectivity [7]. The we used the same papers in this step this also validated that
interpretivists use a set of alternative criteria: credibility, we had the same interpretation of the forms and used them
transferability, dependability and confirmability [10]. No in the same way. We also discarded the idea of classifying
matter the criteria, there are threats to the validity of the each identified threat since there is no clear unified model
research. Lincoln and Guba suggest reactivity (researchers for how this can be done.
presence might influence the setting and subjects behaviour), The remaining 39 papers was then reviewed by one of the
respondent bias (where subjects knowingly or unknowingly authors and the corresponding forms filled in. During this
Table I: Main types of Validity threats discussed in the literature.
Validity threat type Example of typical questions to be answered
Conclusion validity Does the treatment/change we introduced have a statistically significant effect on the outcome we measure?
Internal validity Did the treatment/change we introduced cause the effect on the outcome? Can other factors also have had an effect?
Construct validity Does the treatment correspond to the actual cause we are interested in? Does the outcome correspond to the effect
we are interested in?
External validity, Transferability Is the cause and effect relationship we have shown valid in other situations? Can we generalize our results? Do the
results apply in other contexts?
Credibility Are we confident that the findings are true? Why?
Dependability Are the findings consistent? Can they be repeated?
Confirmalibity Are the findings shaped by the respondents and not by the researcher?
R EFERENCES
[1] C. Wohlin, M. Höst, P. Runeson, M. Ohlsson, B. Reg-
nell, and A. Wesslén, Experimentation in software