apostolakisstamouliVALIDITY RELIABILITYJournalStatisticalReview

See discussions, stats, and author profiles for this publication at: https://www.researchgate.
net/publication/271132459
Validity and reliability assessment of quantitative research questionnaires in

health units: The case of a questionnaire concerning the evaluation of a
nursing services management...
Article · March 2006
CITATIONS READS
14 4,147
2 authors:
Ioannis Apostolakis Aggeliki Stamouli

National and Kapodistrian University of Athens University of West Attica
182 PUBLICATIONS 323 CITATIONS 35 PUBLICATIONS 94 CITATIONS
SEE PROFILE SEE PROFILE
Some of the authors of this publication are also working on these related projects:
Programming environment for high school students View project
Social media in healthcare: Exploring its use by health-care professionals in Greece View project
All content following this page was uploaded by Ioannis Apostolakis on 20 January 2015.
The user has requested enhancement of the downloaded file.

Validity and reliability assessment of quantitative research
questionnaires in health units:
The case of a questionnaire concerning the evaluation of a nursing
services management information system of a hospital
Apostolakis Ioannis apost@ced.tuc.gr Stamouli Maria Aggeliki mstamouli@yahoo.com
Abstract: The aim of this paper is to provide to all the essential conceptual approaches and the
differences between reliability and validity of questionnaires used in the quantitative research in health
units. Moreover, it aims to present in detail their various determination techniques, while at the same
time gives choice criteria of those proposed by analyzing examples, which will give a most complete
picture as far as their implementation is concerned.
Key Words: Reliability, Validity, Factor Analysis, Eigenvalues, Construct Validity, Cronbach alpha
coefficient, KMO criterion, KR21.
1. Introduction
In the field of quantitative researches carried out in health units, such as researches
related to the integration of new technologies, the efficiency and/or the effectiveness
of the administrative, nursing and medical personnel or to the patient satisfaction
degree etc., data most of the times are collected through the use of questionnaires.
These questionnaires, apart from their pilot application, should also be checked for
reliability and validity.
This paper, initially, attempts to approach conceptually the reliability and the validity
of questionnaires and clarify their differences. Further on, their various calculation
techniques are presented in detail and the choice criteria for each one of them being
described as well. Finally, a most complete picture for the application of the reliability
and validity calculation techniques into practice is given by presenting the case of a
questionnaire, which was developed for the evaluation of the nursing services
management information system of a hospital. The most important criterion for the
evaluation of this information system is the agreement including the requirements and
the conditions that should be fulfilled by all the information systems that are
developed for this particular area.
2. Conceptual Approaches
2.1. Reliability
The term “questionnaire reliability” means the precision, the consequence and the
stability of its results. That is, the degree to which, results that arise from its repeated
applications, are equivalent (Carmines and Zeller, [8]). The reliability of a
questionnaire is a very important characteristic, because when a questionnaire is
characterized as reliable, then we can trust its results to reach conclusions.
The techniques used for the calculation of questionnaire reliability are based to the
principle that specific results can be related to each other, in order to determine the
reliability of an instrument (e.g. of a questionnaire). In this way, the reliability factors
become correlation factors between two sets of scores. The usual techniques for the
calculation of the reliability factors are the following:
1. The test-retest technique
2. The technique of alternate or parallel forms
3. The split-half technique and
4. The internal consistency technique.
Here is an analytic description (Burns [7]; Carmines and Zeller [8]) of each one
factor.
1
The test-retest technique
According to this technique, a factor of stability is measured, expressed as a
correlation between the results of two measurements of the same questionnaire to the
same sample. There is no specific time interval between the two measurements.
However, a minimum limit of one day and a maximum limit of one year do exist. In
general, the most optimal and acceptable time interval is considered to be two to three
months.
If the time interval is very small, then the individuals that participated in the research
may remember the answers they gave the first time and thus falsely increase the
consistency of the results. At the other end, if the time interval is very big, various
other factors (such as age, experience etc.) might influence the results of the second
administration of the questionnaire and thus cause under-evaluation of the
instrument’s reliability. In general, a rule of thumb is very difficult to be applied, in
relation to the question of which is the most suitable time interval that should
intervene between the two applications of a test. However, if the characteristic that is
to be measured is relatively constant, does not alter in time, neither is subjected to
specific tests/stimuli that can alter it, then the time period between the two
applications of the questionnaire can be relatively long. Yet, if the particular
characteristic is changeable on behalf of various factors, then it is advised the interval
to be quite small, but not so small that the answers of the participants are influenced
by their memory, altering thus artificially the correlation between the two applications
of the test.
The alternate or parallel forms technique
As during the previous technique, both the general ability and the specific knowledge
of the individuals participating in the research sample, can change at the time interval
that intervenes between the test and the retest, there is another technique that can be
applied as proof of the reliability of a questionnaire; the alternate or parallel forms
technique. According to this technique, the instrument with which a particular
characteristic is measured has two equivalent forms that are set up in such a way as to
be fully comparable in content, extent and degree of difficulty. Afterwards, these two
test forms are distributed to the individuals of the sample at the same time. The results
that derive from each form of the test are then correlated and a factor of equivalence
arises, which measures the performance consistency of the individuals, between the
two alternative forms of the same test. With this technique, the temporal variations in
the performance of the individuals in the research are not calculated.
This technique has two important advantages over the test-retest one. The first
advantage is that one does not need to worry about the memory abilities of the
participants in the test, because these two forms, despite the fact that they measure the
same characteristic, consist of different parts-basically, we have two questionnaires,
with different questions, designed to measure the same characteristic. The second
advantage is that with this technique, a more precise calculation of the reliability of
the test can be achieved, because it arises from a bigger breadth of “questions”. These
two advantages however, result at a cost. This cost concerns the double effort and
time to be consumed, for the creation of the two forms of the questionnaire, as
obviously, it is much more difficult to create two equivalent forms of a questionnaire
measuring the same characteristic than it is to create one. Besides, it is clear to
perceive that the equivalence of these two forms is difficult to be confirmed. If the
two forms are not equivalent, then the estimate of the test reliability is likely to be
low, since there is a tendency of decreased correlation between the two forms of the
test, due to their non-equivalence (Burns [7]).
2
Moreover, the estimation of the reliability is also influenced by both temporal
environmental factors and the state of the individuals of the sample, such as tiredness
and boredom. During the successive applications of the test, the presence of these
factors is not easy to trace or to calculate and thus, the test’s reliability can be
decreased.
The split – half technique
This particular technique, is based on the thought that many of the temporal factors
that influence the two reliability calculation techniques previously mentioned, can be
minimized, if only one factor of reliability can be determined from the results that
come from one and only measurements of the same instrument. Two sets of results are
very easy to produce, by dividing the test in two parts and then correlating the results
that arise from each part separately.
Major problems due to this division are:
a) In each half, different types of questions, with different degrees of difficulty
can result,
b) Some of the individuals of the sample, may not complete the second half
because of lack of time, and
c) Some of the individuals of the sample may not complete the second half of the
test correctly because of lack of interest or boredom.
A commonly acceptable way of the test division in two equal parts is for the first part
to consist of the odd numbered questions of the test/questionnaire and for the second
to consist of the even numbered ones. Of course, this way of separation is meaningful
only if the successive questions are equivalent and this is the only way it provides
useful information to the person who created the test. However there is an additional
problem with this particular technique. When a test is divided in two parts and their
results are correlated, the final outcome is the correlation between two tests, which
however are made up of only half the initial questions. The solution to this problem is
given by the Spearman-Brown formula (Brown [6]; Burns [7]), which is the
following:
nrtt
rtt = (1)
1 + (n − 1) rtt
Where, n is the ratio of the desired (total) length of the test to the length of the present
test (by ‘length’, we mean the total number of the questions/items of the test) and rtt is
the correlation factor between the two halves of the test. Whenever this particular
technique for the calculation of a test’s reliability is applied, it is very important to
remember to apply the Spearman-Brown formula to the correlation factor resulted
from the two halves of the test, in order to come to a reliability factor estimate that
should be suitable for the total test. For example, if the correlation between the two
half parts of eight questions of a test is 0,60, then by use of the above formula we
have:
(16 / 8) * 0,6 1,2
rtt = = =0,75
1 + (16 / 8 − 1) * 0,6 1,6
Thus, after the application of the Spearman-Brown formula it derives that the
reliability of the total test, with the split-half technique is 0,75. In general, as the
length of a test gets bigger, its reliability has a tendency to grow too.
The internal consistency technique
A very common technique for the calculation of questionnaire reliability is the
internal consistency one, which was developed by Kuder-Richardson. The most
important difference between this technique and those already mentioned, is that this
3
technique regards each question/variable as a basic entity of measurement, as a
contrast to the other techniques, that divide the questionnaire in parts and depending
on each split a different correlation factor may result. The result of the internal
consistency technique is a reliability estimate equivalent to the average of correlation
factors that result from all possible split half divisions of the questionnaire.
The simplest form of the Kuder-Richardson formula is the KR21 one, which is the
following (Elvin, [11]; The encyclopedia [22]):
n ⎛ m ∗ (n − m) ⎞
KR21 = ∗ ⎜1 − ⎟ (2)
n -1 ⎝ nσ 2 ⎠
Where n is the number of the questions of the test, m the mean score on the test and s
is the standard deviation of the test scores.
Another form of the internal consistency technique for the reliability estimation of a
questionnaire is the Cronbach alpha coefficient. This is the form used more often and
also incorporated to various statistical programs such as the SPSS. This technique
measures the internal consistency of a questionnaire’s items that is to say if all these
items have the tendency to measure the same thing. The more the items of a
questionnaire inter-correlate, the bigger the value of the alpha test is, in other words
all the items measure the same characteristic. The reliability estimation technique
through the use of Cronbach alpha coefficient is equivalent to that of
Kuder-Richardson (KR21), but the latter one is easier to calculate.
2.1.1. Factors that influence the reliability of a questionnaire-review
All the reliability estimation techniques presented often lead to different results. This
happens, because there are different factors and situations that influence the reliability
of a questionnaire and for this reason there is not a technique to produce a unique
reliability factor. In general, these factors can be classified in the following categories:
i) Factors that are connected to the nature of the test (e.g. its length: the longer
the length is, the bigger the reliability is), and also with the questions that
compose it (e.g. questions that cannot be subjectively evaluated usually provide
less reliability).
ii) The nature of the individuals participating in the test (e.g. an intelligence test
performed by the total number of students in a high school, has a bigger
reliability factor if the calculation accomplished concerns the students of all
classes instead of those of just one class) and
iii) Factors that are related to the administration of the test. When a researcher tries
to interpret the reliability factor that resulting from a test/questionnaire, it is
important that this interpretation is carried out in relation to the reliability
estimation technique that has been used. Techniques that take into account both
stability and equivalence of a test, have the tendency to produce smaller
reliability factors. This happens because they include more types of error-
variance 1 . Therefore, the techniques of Kuder-Richardson and the Cronbach
alpha coefficient produce smaller reliability factors than those being produced
by the split-half technique, as they do not simply produce consistency between
1
Taking into account that (Burns [7]): σ obs
2
= σ true
2
+ σ error
2
, from where the statistical
σ 2
definition of reliability: rtt = 1 − error
arises, (where rtt is the reliability, σ obs
2
is the observed
σ 2
obs
variance, σ true
2
is the true variance and σ error
2
is the error variance), it becomes obvious that the bigger
the error variance, the smaller the reliability is, and if observed variance equals true variance then there
is the perfect reliability.
(3)
the two parts of the test, but also reflect the homogeneousness of the
correlations between the test’s items. For this reason, the Kuder-Richardson
and the Cronbach alpha coefficient techniques, should never be applied in a
heterogeneous set of questions, but in questions that measure the same
characteristic. More analytically, if a questionnaire consists of different units
that concern different dimensions of the same characteristic, then these
techniques should be applied to each unit separately and not to the whole test.
Moreover, for any technique that contains two measurements of the test, the
bigger the time interval that intervenes between the two trials, the smaller its
reliability is.
2.2. Validity
We conduct a validity test of a questionnaire/instrument, when we wish to check if
this test measures the attribute for which it was designed (Carmines and Zeller [8]). It
is possible to have created a fully reliable test, which however measures a different
attribute from what we wished. This is exactly the point that defines the difference
between validity and reliability. In other words, a test may be reliable, if it produces
results of absolute consistence, but may not at all be valid if these measurements
concern something different from what for it was designed to measure. It goes without
saying that if a test is not reliable, then it cannot be valid either.
An example of a reliable instrument, which however is not valid is the following: in a
factory that manufactures sound level meters there was an error in the production
process and the particular lot always shows 10 sound measurement units (db (a)
decibel in a working place), less than what really exists in every area. The particular
instruments, therefore, are reliable, since they take consistent measurements (they
measure sound volume and always show 10 units less), but they are not valid, as they
do not give the correct-valid measurements of the noise volume that exists in the
specific areas. The following types of validity exist:
i) Content validity
ii) Predictive validity
iii) Concurrent validity
iv) Face validity, and
v) Construct validity
Each one of them is here presented in detail (Burns [7]; Carmines and Zeller [8];
Koulakoglou [16]).
Content validity
This specific type of validity is mainly connected to ability tests. In this kind of tests
content validity exists, if the test illustrates clearly the aims and the objectives of a
given educational process and reflects the emphasis granted to the content of the
educational field of this specific area. For example, each professor follows a teaching
plan that is based on specific targets and thus each ability measurement test should be
based on that plan. More specifically, an ability test in information technology will
not have content validity, if it measures the students’ abilities in anatomy. When
students say that the examination topics have not been included in the official
examination material, they talk about the content validity of the examination.
However, it should be pointed out that, an ability test may be characterized by
satisfactory content validity at a specific moment in time, for a specific class and
teacher, but it may not be valid for another moment in time, a different class and a
different teacher. For this reason, the standardized tests that appear on the market do
not have satisfactory content validity in all cases and the interpretation of their results
should be performed with extra cautiousness. This does not mean that standardized
5
tests are useless or that they should be used only at some specific time once, because
actually they produce a very useful set of information about the group of students
under review. However, it should be stressed that content validity is a characteristic,
which does not stay unchangeable in time and for this reason it should be re-
considered each time the same test is being used for a different group of individuals,
or if there is some kind of modification in the conditions in which this test is carried
out.
Predictive validity
Predictive validity concerns the desire to make predictions mainly related to
performance through some kind of technique or specific kind of evaluation. An
example could be the academic performance of students in the last senior high school
grade and the university school in which finally the student gets admitted. The
specific type of validity is particularly important in research techniques of profession
selection. Schoolteachers in elementary schools could use the results of intelligence
tests and reading ability tests, as guidance for the separation of students in classes of
“equal abilities” students. Of course, predictive validity success can be checked and
proved only through comparison between the initial results of the test and the
subsequent performance (perhaps some years later), of the same individuals in the
same test. This subsequent performance, which is to be predicted, is called
performance criterion, it is usually general and there it is rarely required to produce a
precise score.
Sometimes, it is possible to express predictive validity as a correlation coefficient,
between the predicted situation and the performance criterion. This coefficient is also
called validity coefficient. For example, a doctor could predict the outcome of a
treatment (success or failure, improvement or not) of a group of patients, by using
information, collected for each patient at the beginning of the treatment. Later on, the
real condition of the patients is observed and recorded at the end of the treatment; in
other words it is checked whether there has been improvement or not. The specific
correlation coefficient is calculated as a degree of correspondence between the
prediction and the real condition and may be regarded as a measure of predictive
validity.
However, as it is difficult to select the best criterion of performance for the
measurement of predictive validity, there are some suggestions that should be taken
into consideration (Burns [7]):
i) One should make sure that the performance of such a criterion measurement,
results from the same characteristics of each individual and same external
environmental conditions that influence the performance of the test, the validity
of which is to be measured.
ii) The measurement criterion should be reliable, in other words it should remain
constant during the time. Obviously, it is very difficult to predict something that
is constantly changing.
iii) While choosing the measurement criterion, factors, such as time and cost needed
for the specific measurement to be made, should be taken into consideration.
Concurrent validity
This specific type of validity bears great resemblance to predictive validity analyzed
earlier on. Their main difference concerns the dimension of time. More specifically,
concurrent and predictive validity, try to check the results of a measurement
instrument at the present time and in the future respectively. For example, when a test
for the measurement of depression (depression scale) was developed, in order to
check the predictive validity of a question similar to this would be asked: “Will the
6
individual that achieved the highest score in this test, develop depression some time in
the future?” While concurrent validity would require an answer to the question: “Is
the individual that achieved the higher score in the test suffering from depression
now?” The concurrent validity coefficient, which is expressed as the correlation
coefficient between the performance criterion and the present situation (performance
in the test), can be measured at a lesser cost and time in comparison to the predictive
validity coefficient, which considers future performance. However, a high concurrent
validity does not necessarily mean there is a high predictive validity.
The concurrent validity coefficient may also be used to evaluate ability estimation
tests. For example, a student’s performance in an arithmetic computation test should
be related to his performance in other subject areas that require high computational
ability (these are the performance criteria). When the relationship between the
performance in the test and the performance criteria is not important, then either these
criteria are not acceptable, or the criteria measure a different attribute than the one
measured by the test. A third possibility is that the test is of low concurrent validity,
but this possibility should be concluded only if the previous two possibilities have
been carefully examined and rejected.
Face validity
In certain cases, someone may want to ask the following questions: “Is it obvious
from the check of the test items that the test measures what we wish to measure?” or
“Does the test measure what its title implies?” Researchers often need high face
validity for tests or techniques that are related to research programs, education,
military programs and so on.
This particular type of validity is extremely difficult, if not impossible, to measure.
High face validity can be the motive for someone to participate or not in a specific
test. This happens because if the test participants think that the parts constituting the
test are irrelevant to its target, then they will be motivated not to fill in the
questionnaire. However, face validity may also serve other functions. If, for example,
a researcher wishes to use a test that measures psychological disorders, this specific
test should be characterized by low face validity, so that the test participants cannot
easily comprehend what attribute is to be measured. At the same time, this test should
be characterized by high predictive validity, high concurrent validity and high
construct validity (the last one will be analyzed later on).
Face validity is very important and in cases of attitude measurement scales. In these
cases, there may be a great difference between what the scale seems to measure (face
validity) and what it really measures. There are certain cases in which it is possible
that a test is constructed in such a way, that it has false face validity that is to be
constructed so as to hide its real aim and at the same time increase its face validity
through the use of intermediary questions. In this way, the individuals under review
cannot understand what the test really measures.
Construct validity
When we have constructed a test with which we measure a specific characteristic such
as ΄effectiveness’, ‘satisfaction’, ‘intelligence’ etc., what we are especially interested
in is to check the degree to which the questionnaire items/variables measure the same
characteristic (Flynn et al. [12]). As constructs (characteristics) of this type are quite
complex and not easily observed, when one wants to check the construct validity of an
instrument, he/she should follow a set of indirect proofs, the existence of which
witnesses the existence of the specific constructs. Some of these proofs are (Aiken,
[1]): analysis of the internal coherence of the test, correlations between the test and
other tests of the same kind with proved construct validity, a systematic examination
7
of the test participants as to their answers to this specific test, experts’ views on the
test contents, and so on.
Another very important method for checking the construct validity of a test is the
factor analysis one. Factor analysis is a complex statistical method, according to
which the number of variables constituting the test in question (the questionnaire)
decreases significantly and factors are created. Each factor is a concept, which
includes those variables that correlate with another. In this way, the degree, to which
each question contributes to the measurement of the attribute we want to evaluate, is
calculated (Alexopoulos, [2]; Mellon [19]). If the characteristic under measurement is
not complex and if only one general factor is produced by factor analysis, then the
fact that the test answers reflect the characteristic under measurement can be
considered as proof and thus the test is characterized by construct validity. If the
characteristic under measurement is complex, then more than one factors will result
from factor analysis. Then the test will cover the specific characteristic we are
measuring and it will be characterized by construct validity, if a small number of
factors come up (Koulakoglou, [16]).
2.2.1. Criticism
By studying the information provided above, one can observe that validity–in contrast
to reliability–is more theoretically oriented, as it is solely established by the degree to
which a particular coefficient measures the characteristic that is supposed to be
measured. In fact, we do not prove the validity of a test, but of the data that derives
from a particular process. This distinction is made, because it is possible that a
measurement instrument/questionnaire is valid for the measurement of a specific
phenomenon, but it may be, at the same time, totally inappropriate for the
measurement of other phenomena. This is why the validity of an instrument is never
checked independently of the aim that it was designed for.
The various validity techniques mentioned earlier, follow different approaches as to
the evaluation of the degree to which each instrument measures the thing it claims to
measure.
It is widely believed that content validity is not as much a specific kind of validity, as
it is the achievement of a goal, so that valid measurements of any kind are obtained.
These measurements should cover the contents of a theoretical concept. Content
validity does not provide a method or procedure defining the degree tο, which the
aforementioned goal can be achieved.
Prediction validity and concurrent validity (both also called criterion validity) also
have a limited scope of application. Having said that it does not mean that there are no
particular conditions under which the calculation of a measurement’s validity is made,
by comparing the performance of this measurement to the performance in a specific
criterion-variable. In Social Sciences however, which in contrast to the Applied
Sciences, are usually concerned with abstract theoretical concepts, it is extremely
difficult to find well-known criterion-variables, in order to make the necessary
comparisons.
Finally, construct validity in contrast to all the rest is generally applied in both social
and applied sciences. This kind of validity in empirical measurements can be proved
if these measurements are in accordance with a specific theoretical framework. If they
agree with this framework, then they are characterized by construct validity–if they do
not, then it can be concluded that they lack construct validity as far as the specific
theoretical framework is only concerned.
3. Case Presentation
3.1. Validity and Reliability Techniques Selection Procedure Model
8
To recapitulate, what was mentioned before concerning validity and reliability check
of a test, the following could be said:
A test may be reliable, but not valid. If it is not reliable, it cannot be valid. For this
reason, reliability check is carried out first and then a validity check follows.
The most well-known questionnaire reliability calculation methods are the test-rest
technique, the alternate or parallel forms technique, the split-half technique and the
technique of internal consistency. If the same test is administered twice to the same
group of individuals and the results of both tests inter-correlate, then the test-retest
technique is applied to check the reliability of the test. If the same test cannot be re-
used (because the participants remember the answers they gave the first time) then an
equivalent form of the test is used and the alternate or parallel forms technique is
applied. Then the results that derive from the equivalent forms of the test are
correlated. In those cases where the test can be administered only once, either the
split-half or the Cronbach internal consistency techniques are applied. In the first one
of those, the test is divided in two equivalent parts and is distributed to the same group
of people. Next, the results concerning each part are correlated. To find the reliability
of the whole test, it is essential to apply the Spearman-Brown formula.
Of course, the method that is much better known than any of the above is the one of
internal consistency alpha of Cronbach. It is equivalent to the split-half technique, but
it should be applied only to a homogeneous group of questions. If we connect together
all this information mentioned above we come up with the following table (table 1).
(Suggested position of table 1)
After the reliability of an instrument/questionnaire is checked and proved, the control
of its validity follows. The following forms of validity exist: content validity,
prediction validity, concurrent validity, face validity and construct validity.
The content validity is connected mainly to ability measurement tests and it is
checked in the cases when we wish to see whether the test that is applied in a specific
group of individuals, portrays also the content of the educational field being checked.
Prediction validity and concurrent validity (also called criterion validities)
respectively check the results of a measurement instrument in the future or in the
present. The first is calculated as a degree of agreement (correlation factor), between
the prediction and the real situation, while the second is calculated as correlation
factor between the performance criterion and the present situation. A control of a
test’s face validity is executed when the test’s constructor wants to see if it is obvious
from the questions of the test that it measures the characteristic for which it was made.
The particular validity is difficult to be measured and a high level of it is not always
desirable.
A very important form of validity however is construct validity. It is mainly applied in
tests/questionnaires that measure a specific characteristic as 'effectiveness',
‘satisfaction’ and so on, and what substantially interests is the degree to which the
questionnaire variables measure the same characteristic. To find this out, factor
analysis is mainly used. Alternatively, a whole group of indirect proofs (such as the
analysis of the test’s internal cohesion, cross-correlations of the test with other tests of
the same type and of proven validity of conceptual construct, opinions of experts on
the test content, etc) can be used. Their existence constitutes also positive testimony
for the conceptual construct validity of a test.
3.2. Application of reliability and validity techniques in the questionnaire under
study
The case presented here concerns a questionnaire, which was drawn for the evaluation
of the information system developed for the management and administration of a
9
nursing service of a hospital. This evaluation is done based on the conditions that all
information systems developed for this particular area should fulfil (Stamouli [21]).
The questionnaire was organised in such a way that it fulfils the designing standards
of an inquiring questionnaire (McNeill [18]) and a pilot study of the test preceded its
distribution to the individuals of the sample (Alreck and Settle [3]), at a local level
among graduate and postgraduate students, with the immediate aim to check it. Due to
this pilot application small changes mainly on wording or formulation issues were
made to the test. Its structure is the following: the first questions concern the personal
details of the participants (e.g. age, gender, marital status etc.), the second group
questions inquires about their work situation (e.g. student or hospital employee, years
of experience etc.) and finally their educational level is asked (e.g. graduate or
postgraduate student, graduate of a Technical School or of a University etc.). The next
questions unit, is concerned with the participants’ general attitude to computers, like
possession of and familiarity with PCs, wish for acquisition or improvement of
computer knowledge and so on. The following unit includes questions that are related
tο the opinions of users on the information system (I.S.) which was developed,
opinions about its easiness in learning, in use, its homogeneity in appearance, the
reservation of data precision of the system and so on. The questionnaire is completed
with more general questions that concern information systems that are developed in
the field of nursing service administration (for example, information systems can
improve administrative effectiveness etc.).
3.2.1. Questionnaire’s Reliability
To control the reliability of the particular questionnaire, the technique of internal
consistency (Cronbach alpha coefficient) was applied, separately for the last two parts
of the questionnaire-that is to say, for part a (which includes the questions related to
users’ opinions on the particular information system developed) and for part b (which
includes more general questions about all the information systems that are developed
in the field of nursing services administration). More analytically, the variables of the
questionnaire, whose reliability was measured, appear in table 2 that follows.
(Suggested position of table 2)
Before we apply the particular technique, it is important to check whether all the
above mentioned variable/questions are of the same scale and same conceptual
direction. In our case, this is valid since all the questions are of the likert scale and
coded in the same direction (from 1=a lot to 4=not at all). Applying therefore the
internal consistency technique to Part a, through the use of the statistical data
processing program SPSS for Windows (Green et al. [13]; Howitt [14]), the following
tables (3 and 4) result:
(Suggested position for table 3)
As in any statistical analysis, it is important that the descriptive statistics are firstly
presented (table 3), so that the fact that our data do not present major anomalies is
checked. That is to say, the mean of each variable is between the possible values (in
our case between 1 and 4), or if extraordinarily high values in standard deviation are
observed, that may show a likely error in typing and whether the correlations among
the variables are positive (table 4). If not, then one of the variables involved in the
negative correlations may have been coded in reversed scale (Green et al. [13]). From
the tables above, it results that our data have been registered and coded in the right
way. The results of the Cronbach alpha coefficient application appear in the table 5
that follows.
10
The column before the last, shows the consistency of each question to the total test,
while the last column shows the internal consistency of the test (Alexopoulos [2]) or
else, the questionnaire reliability if the particular item is deleted. The alpha value for
part a, of the questionnaire is 0,7619 and appears underneath the table 5. This
particular value, once is above is the critical value of 0,7 is accepted. We also observe
that there is no variable, the removal of which would increase the value of internal
consistency of the first part of the test.
We could therefore say that the test part, which examines the users’ opinions on
information system, presents an alpha of 0,7619 and is acceptable as far as internal
consistency-reliability is concerned.
In the same way we examine the internal consistency of the second part of the test,
where, by looking at the table of the descriptive statistics and the table of the
correlations among the variables, it is concluded that the data have been entered and
coded in the right way. The results of the alpha Cronbach coefficient appear in the
table 6 that follows.
From this table we observe that the value of alpha for the second part of the test is
0,5678, which is not acceptable, since it is much smaller than the critical value. In the
column where the Corrected Item-Total Correlation of the second part of the test is
presented, we observe that the third variable (Should Information Systems produce
automated shift scheduling for the nursing personnel?), has very small consistency
with the total (0,1983) and at the same time, we can see from the last column that if
the particular question is deleted then the reliability of the second part of the test
would increase to 0,7960 which is an acceptable value. It is therefore suggested that
the particular variable to be omitted from the study.
Of course, if particular interest exists concerning the users’ opinion resulting from this
question, then this question could remain in the test but in a separate part.
The reliability of the first part of the test could also be checked with the split-half
technique and the application of the Spearman Brown formula.
Indeed, by dividing this test part in two further parts, from which, the first includes
the odd-numbered questions and the second the even-numbered ones (two half parts
of 4 questions and 90 individuals asked), it results that the cross-correlation between
them is 0,705. Through the use of the type that was described above, we have:
(8 / 4) * 0,705 1,41
rtt = = =0,827
1 + (8 / 4 − 1) * 0,705 1,705
We observe therefore, that the test is characterized by internal consistency-reliability
by the application of this method as well, which indeed gives us a bigger consistency
factor, something, which was expected for reasons that have been mentioned earlier.
3.2.2. Questionnaire’s construct validity.
In order to check whether indeed the particular questionnaire evaluates the
information system developed, that is to say to check its construct validity, the method
of factor analysis will be applied. The questions that are related to the evaluation of
the information system are the 8 questions of the first part of the questionnaire as we
named it above during the analysis of its reliability. Thus, factor analysis will be
applied to these 8 questions. Before the application of the factor analysis method, it is
important to examine the sampling adequacy with the Kaiser-Mayer-Olkin (KMO)
measure and to check the ability of factor extraction, through the Bartlett’s test of
sphericity. It resulted therefore that KMO=0,718>0,5 which means that our sample is
11
characterised by sampling adequacy (Apostolakis et al. [5]; Alexopoulos [2]) and the
Bartlett’s test of sphericity=163,073 with p=0,0005 which is statistically important,
which means that the data are suitable for conducting factor analysis.
For factor extraction and determination of the correlations between the items of the
questionnaire, the Principal Component Analysis method was selected. Also for the
selection of the number of factors to be extracted, the eigenvalues criterion was used
(table 7).
By looking at the above table and according to the eigenvalues criterion, 3 factors
with eigenvalues larger than one (1) were identified, a fact that explains the 66,814%
of the total variance of the test. When, however, the eigenvalues plot (scree plot) was
used (Cattell [9]), the results were different (Figure 1).
(Suggested position for figure 1)
Indeed, from the above figure it appears that it is possible to extract only one factor
since the number of factors in the sharp descent part of the plot before the eigenvalues
start to level off is only one, (therefore it is not necessary to rotate the factors for a
better interpretation of the results) (Alexopoulos [2]). This factor interprets the
37,897% of the total variance of the test. Moreover, at the point where the diagram
begins to decline, the additional factors (starting from the second one) interpret a quite
smaller amount of variance. Of course, there was an effort to produce more than one
factors but the solutions did not appear to have meaning.
From the table of factor loadings (table 8), we observe that the correlations (loadings)
between the variables and the factor produced are distinct, very important and ranged
between 0,501 and 0,715.
From the factor analysis therefore, it can be concluded that the test is characterized by
homogeneity (it is one-dimensional). That is to say, the questions evaluate one only
main factor, which means that indeed they measure the attribute for which they were
created.
3.3. Results
To examine the reliability and validity of the questionnaire under review the
technique of internal consistency of Cronbach (Cronbach alpha coefficient), the spilt
half method and factor analysis were applied in this order.
The control of reliability with the use of the internal consistency method was applied
to the two parts of the questionnaire separately, which were both found reliable. In the
first part, which investigates the users’ opinion on the information system under
review, in relation to the criteria that all systems developed for the nursing service
should satisfy, the Cronbach alpha coefficient was found 0,7619, a value that
constitutes testimony for the existence of the test’s internal consistency. At the same
time, the removal of any of the variables of this part does not cause an increase to the
value of alpha coefficient. In the second part, where the general opinions of the users
concerning information systems that are developed for the specific area are
investigated, the consistency coefficient was found to be 0,5678, a value that does not
constitute testimony for the existence of internal consistency. The removal, however,
of the variable (Should Information Systems produce automated shift scheduling for
the nursing personnel?), which has very small consistency with the total, increases the
value of the factor to 0,7960, which is acceptable. Its removal from the research is
therefore suggested, unless the researchers are particularly interested in the
12
information resulting from the answers to this question. If they are so, then it is
recommended the specific question to be included in another separate part of the
questionnaire with relevant questions.
The split-half method was applied only to the first part of the questionnaire and also
showed similar results, since with the Spearman Brown formula, the coefficient factor
that resulted was 0,827, a value that constitutes testimony for the existence of the
test’s internal consistency.
To examine the validity of the questionnaire, we focused on the construct validity of
the first part only. This examination was accomplished with the use of factor analysis,
after sampling adequacy and data suitability were first checked and confirmed. After
the application of the principal component analysis and the eigenvalues plot criterion,
it was found that only one factor could be extracted to which all variables clearly load
(>0,5) and thus, none of them is removed from the questionnaire. This finding, that
characterizes the test as one-dimensional or homogeneous, in combination with the
fact that the variables that are included in the particular part (part a), are those that
summarise most of the requirements that all the information systems for the nursing
services area should satisfy, lead to the conclusion that, indeed the questionnaire that
was created evaluates the suitability of the information system, according to the
conditions that should be fulfilled by all the systems developed for this particular area.
Here it has to be reported that these requirements, are summarised into reliability
requirements (Shortlife et al. [20]) concerning system’s hardware and software,
flexibility requirements, user friendliness requirements (Van de Velde [23]), which, in
addition, include the degree of easiness in learning and using the system, the
homogeneity in its appearance and the plenitude of the produced messages and
reports. Finally these requirements also include system’s database security (Korpan
[15]), and data precision and integrity. Additional requirements constitute:
transparency (McHugh [17]) as to the response of the users actions and finally data
processing speed (Cox, et al. [10]), which however depends straight on the hardware
of the system.
The questionnaire under study therefore, is reliable and valid and thus can be used for
the aim it was created.
4. Conclusions - Discussion
From the analysis of reliability and validity techniques, that was carried earlier, one
could say the following:
The reliability determination of the questionnaire depends mainly on the most suitable
choice of technique and not so much on the interpretation of the results. Among the
techniques presented, the least proposed ones for use are those of test-retest and split-
half ones. The biggest problem of the test-retest technique, is that the first test
administration usually influences the answers in the second one, while the
disadvantage of the split-half one, is that the correlation between the two half parts of
the test is not constant and it differentiates depending on how the total number of
variables is divided. Exceptional techniques of reliability estimation however, are
those of alternative or parallel forms and of internal consistency through the use of the
Cronbach’s alpha coefficient factor. In the first technique, the only difficulty is the
creation of an equivalent form of the same test. If this practical difficulty is overcome,
then the correlation between the two alternative forms of the test constitutes an
exceptional estimate of the reliability of the questionnaire under study. As for the
technique of internal consistency, the only difficulty is the interpretation of the results
13
and more specifically, the value over whose the level of reliability is considered
acceptable. Usually, when the reliability value is above 0,7 then it is considered
acceptable.
Contrary to reliability, validity is more positively oriented. That is why the
interpretation of its results requires much more attention. Particularly in abstract
theoretical concepts, such as “effectiveness”, its proof is exceptionally difficult to be
accomplished and for this reason, the interpretation of results of techniques such as
factor analysis cannot be faced superficially because it can lead to misleading
conclusions. In the cases however, where some comparable theoretical frame exists,
then the agreement with this or with results of instruments of proven validity
constitutes also important testimony for a test’s validity. It constitutes therefore a tool
that is used along with theoretical analysis and not as its replacement.
5. Bibliography
[1] Aiken, L.R., Psychological Testing and Assessment, Allyn and Bacon,
Boston (1994)
[2] Alexopoulos, D., Psychometrics, Planning test and analysis of questions,
Vol.A’. Athens, Ellinika Grammata (1998) (in Greek).
[3] Alreck, P., Settle, R., Survey Research a Handbook, IRWIN, USA (1995)
[4] Anastasi, A., Psychological Testing, New York, Macmillan (1988)
[5] Apostolakis, I., Kastania, A., Pierrakou, H., Data Statistical Processing in
Health, Athens, Papazisi (2003) (in Greek).
[6] Brown, D.J., Statistics Corner, Questions and answers about language testing
statistics: Can we use the Spearman-Brown prophecy formula to defend low
reliability?. Shiken: JALT Testing & Evaluation SIG Newsletter, Vol. 4 No.
3, pp.7-9, [Online] Available at http://www.jalt.org/test/bro_9.htm (2001)
[7] Burns, R., Introduction to Research Methods, London, Sage Publications
(2000)
[8] Carmines, G.E., Zeller A.R., Reliability and Validity Assessment, Sage
Publications Inc, USA (1979).
[9] Catell, R.B., The scree rest for the number of factors, Multivariate
Behavioral Research, Vol 1, pp.245-276 (1966)
[10] Cox, H.C., Hersanyi, B. and Dean, L.C., Computers and Nursing,
Application to Practice, Appleton & Lange, East Norwalk, (1987)
[11] Elvin, C., Test Item Analysis Using Microsoft Excel Spreadsheet Program.
[Online] Available at
http://www.eflclub.com/elvin/publications/2003/itemanalysis.html (2003)
[12] Flynn, B.B., Schroeder, R.G. and Sakakibara, S., A framework for quality
management research and an associated management instrument, Journal of
Operations Management, Vol. 11, pp. 339-366 (1994)
[13] Green, S., Salkind, N., Akey, T., Using SPSS for Windows. Analyzing and
Understanding Data 2nd edition, Practice Hall, USA (2000)
[14] Howitt, D., Cramer, D., Statistics with SPSS 10 for Windows, Kleidarithmos,
Athens, (2001) (in Greek)
[15] Korpan, R.A., System Reliability Assurance of Quality and Security, Clinics
in Laboratory Medicine, Vol.11, No1, pp.165-177 (1991)
[16] Koulakoglou, K., Psychometrics and Psychological Evaluation, Ellinika
Grammata, Athens, (1998) (in Greek).
14
[17] McHugh, M., Information Access: a Basis for Strategic Planning and Control
of Operations, Nursing Administration Quarterly, Vol.10, No 2, pp.10-20
(1986)
[18] McNeill, P., Research methods, 2nd edition, Routledge, London, (1990)
[19] Mellon, R., Psycho-diagnostic Methods, Ellinika Grammata, Athens, (1998)
(in Greek).
[20] Shortlife, E.H., Perreault, L.E., Wiederhold, G. and Fagan, L.M., Μedical
Informatics. Computer Application in Health Care, Addison-Wesley
Publishing Company, New York (1990)
[21] Stamouli, M.A., Quality Control of a Nursing Service Information System for
a Hospital: Selected papers of the 5th Pan-Hellenic Scientific Congress of
Health Services Management, Health Administration and Economy
Sciences, pp.193-208, Mediforce, Athens (2004) (in Greek).
[22] The encyclopedia, Centre for Applied Language Studies, Vocabulary
Acquisition Research Group Encyclopedia, [Online] Available at
http://www.swan.ac.uk/cals/calsres/encyclopedia/Cronbach's_Alpha.htm
[23] Van de Velde, R., Hospital Information Systems-The Next Generation,
Springer-Verlag, New York (1992)
Table 1. Reliability control techniques
Number of Test Forms
Number of test
applications One Two
Alternate or parallel forms
Split-half technique,
technique (with a short
One Cronbach alpha internal
time interval between the
consistency technique
two applications)
Alternate or parallel forms
technique (with a long
Two Test-retest technique
time interval between the
two applications)
Source: (Anastasi [4])
15
Table 2. Questionnaire variables whose reliability was measured
Part a: Users’ opinions on the I.S., with regard to the criteria that should satisfy all
the I.S. developed for the particular field
Variable Description
1 Easiness in learning the system
2 Easiness in using the system
3 Uniformity in systems appearance
4 Satisfaction of security conditions
5 Data precision satisfaction
6 Production of informative messages
7 Satisfactory speed of data processing
8 Creation and production of satisfactory reports
Part b: General questions on the I.S. in the field
Variable Description
1 Do the Information Systems offer substantial help?
2 Do they provide administrative effectiveness?
3 Should they produce automated shift scheduling for the nursing
personnel?
Table 3. Descriptive statistical data concerning Part a

Standard
Variable Average Items
deviation
1 1,7093 0,5712 86,0
2 1,5581 0,5658 86,0
3 1,4070 0,5610 86,0
4 1,5349 0,5881 86,0
5 1,3256 0,5189 86,0
6 1,6279 0,6333 86,0
7 1,4767 0,5472 86,0
8 1,5116 0,5476 86,0
Table 4. Correlations among the items of questionnaire under study

Variables 1 2 3 4 5 6 7 8
1 1,00
2 0,58 1,00
3 0,30 0,31 1,00
4 0,26 0,29 0,40 1,00
5 0,16 0,34 0,36 0,46 1,00
6 0,09 0,39 0,27 0,32 0,30 1,00
7 0,03 0,27 0,20 0,19 0,07 0,42 1,00
8 0,10 0,28 0,39 0,20 0,19 0,42 0,39 1,00
16
Table 5. Cronbach alpha coefficient results
Scale Mean Scale Corrected Alpha if
Variable if Item Variance if Item-Total Item
Deleted Item deleted Correlation deleted
1 10,4419 6,4142 0,3419 0,7574
2 10,5930 5,8207 0,5821 0,7144
3 10,7442 6,0044 0,5131 0,7272
4 10,6163 5,9804 0,4880 0,7315
5 10,8256 6,3339 0,4314 0,7417
6 10,5233 5,7818 0,5079 0,7277
7 10,6744 6,4339 0,3589 0,7537
8 10,6395 6,1862 0,4566 0,7373
Cronbach Alpha for 8 items, Alpha = 0,7619
Table 6. Alpha Cronbach coefficient results
Scale Mean Scale Corrected Alpha if
Variable if Item Variance if Item-Total Item
Deleted Item deleted Correlation deleted
1 2,8111 0,6493 0,5307 0,2307
2 2,9000 0,7652 0,4773 0,3524
3 2,3778 0,7321 0,1983 0,7960
Cronbach Alpha for 8 items, Alpha = 0,5678
Table 7. Results of the Initial Solution
Initial Eigenvalues Extraction Sums of Squared
Loadings
Component Total % of Cumulative % Total % of Cumulative
Variance Variance %
1 3,032 37,897 37,897 3,032 37,897 37,897
2 1,294 16,170 54,066 1,294 16,170 54,066
3 1,020 12,747 66,814 1,020 12,747 66,814
4 0,762 9,529 76,343
5 0,622 7,773 84,116
6 0,498 6,227 90,343
7 0,464 5,801 96,144
8 0,308 3,856 100,000
17
Figure 1. Scree Plot of the eigenvalues
Scree Plot
3,5
3,0
2,5
2,0
1,5
1,0
Eigenvalue
,5
0,0
1 2 3 4 5 6 7 8
Component Number
Table 8. Table of factor loadings

Factor
Variables
1
2 0,715
3 0,666
6 0,663
4 0,644
8 0,606
5 0,593
7 0,502
1 0,501
18
View publication stats

apostolakisstamouliVALIDITY RELIABILITYJournalStatisticalReview

Uploaded by

Document Information

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

apostolakisstamouliVALIDITY RELIABILITYJournalStatisticalReview

Uploaded by

Copyright:

Available Formats

See discussions, stats, and author profiles for this publication at: https://www.researchgate.

Validity and reliability assessment of quantitative research questionnaires in

Article · March 2006

Ioannis Apostolakis Aggeliki Stamouli

SEE PROFILE SEE PROFILE

Programming environment for high school students View project

The user has requested enhancement of the downloaded file.

Table 3. Descriptive statistical data concerning Part a

Table 4. Correlations among the items of questionnaire under study

Table 8. Table of factor loadings

View publication stats

You might also like