You are on page 1of 5

Research Briefs

Reliability and Internal Consistency Findings from the C-SEI

Katie Anne Adamson, PhD, RN; Mary E. Parsons, PhD, RN; Kim Hawkins, MS, APRN-NP;
Julie A. Manz, MS, RN; Martha Todd, MS, APRN-NP; and Maribeth Hercinger, PhD, RN-C


Human patient simulation (HPS) is increasingly being

used as both a teaching and an evaluation strategy in nursing education. To meaningfully evaluate student performance in HPS activities, nurse educators must be equipped
with valid and reliable instruments for measuring student
performance. This study used a novel method, including
leveled, video-archived simulation scenarios, a virtual classroom, and webinar and e-mail communication, to assess the
reliability and internal consistency of data produced using
the Creighton Simulation Evaluation Instrument. The interrater reliability, calculated using intraclass correlation (2,1)
and 95% confidence interval, was 0.952 (0.697, 0.993). The
intrarater reliability, calculated using intraclass correlation
(3,1) and 95% confidence interval, was 0.883 (0.001, 0.992),
and the internal consistency, calculated using Cronbachs
alpha, was a = 0.979. This article includes a sample of the
instrument and provides valuable resources and reliability
data for nurse educators and researchers interested in measuring student performance in HPS activities.

Received: November 11, 2010

Accepted: May 31, 2011
Posted Online: July 15, 2011
Dr. Parsons is Associate Professor, Ms. Hawkins is Assistant Professor,
Ms. Manz is Assistant Professor, Ms. Todd is Assistant Professor, and Dr.
Hercinger is Assistant Professor, School of Nursing, Creighton University,
Omaha, Nebraska. Dr. Adamson is Assistant Professor, University of Washington Tacoma, Tacoma, Washington.
This research was completed as part of Dr. Adamsons dissertation
work at Washington State University College of Nursing in Spokane, Washington.
The authors thank the National League for Nursing, Nursing Education Research Grants Program, and the Washington Center for Nursing for
funding this research.
The authors have no financial or proprietary interest in the materials
presented herein.
Address correspondence to Mary E. Parsons, PhD, RN, Associate Professor, School of Nursing, Creighton University, 2500 California Plaza, Omaha,
NE 68178; e-mail:

Journal of Nursing Education Vol. 50, No. 10, 2011

xperiential learning, or learning by doing, is an ancient

concept, and human patient simulation (HPS) is a prime
example of an experiential teaching strategy that is increasingly used in nursing education. Human patient simulation
allows nursing students to not only learn by doing, but also to
demonstrate what they know in the context of realistic patient
care scenarios. By replicating specific simulations, nurse educators may evaluate multiple students or groups of students performing in the same patient care situation. This ability to control and manipulate students clinical encounters provides nurse
educators with opportunities to make objective, formative, and
summative evaluations of student performance.
In education and research, the quality of evaluation data is determined, largely, by the quality of the strategies and instruments
used to gather it. A review of the literature (Kardong-Edgren,
Adamson, & Fitzgerald, 2010) revealed that educators across
the health care disciplines are developing and using a variety
of instruments to evaluate student performance in HPS activities. These instruments purport to measure constructs ranging
from student satisfaction (The Student Satisfaction and SelfConfidence in Learning survey) (National League for Nursing,
2005) to clinical performance expectations (Herm, Scott, & Copley, 2007). Unfortunately, the body of knowledge related to the
reliability and validity of many simulation evaluation instruments is still inadequate. For nurse educators to maximize the
opportunities that HPS affords for experiential teaching, learning, and evaluation, valid and reliable evaluation instruments are
imperative (Diekelmann & Ironside, 2002; Oermann, 2009).
One instrument that has demonstrated initial validity and
reliability when used with a sample of senior-level baccalaureate students is the Creighton Simulation Evaluation Instrument
(C-SEI). To further establish the psychometric properties of this
instrument, we assessed the reliability (interrater and test-retest,
also known as intrarater) and internal consistency of data from
the C-SEI when the instrument was used by a nation-wide sample of baccalaureate nurse educators to evaluate video-archived
simulated clinical experiences.

The investigators used the following methods to answer the
question: What are the reliability and internal consistency of
data from the C-SEI when it is used by a nation-wide sample of
baccalaureate nurse educators to evaluate video-archived simulated clinical experiences?



The C-SEI is a 22-item, dichotomous scale divided into

four categories based on the American Association of Colleges of Nursings (AACN) (1998) Essentials for Baccalaureate
Educationassessment, communication, critical thinking, and
technical skills. Evaluators assign each item a score of either 0
or 1, with 1 indicating achieved competency of the behavior. In
practice, each item on the C-SEI is evaluated for its relevance
to the scenario being used, and a decision is made by the course
faculty to determine minimal expected behaviors for the scenario prior to the HPS activity based on the objectives for the scenario and the level of the student. In this study, the minimum expected behaviors were determined by the investigators. A total
score is determined by adding the number of relevant items. A
percent score is then calculated, and faculty determine a passing
score for the clinical scenario. Previously established reliability
scores ranged from 84.4% to 89.1% in the four categories determined by percent agreement among six faculty who evaluated student groups. Content validity was established through
the AACN framework and an expert panel review (Todd, Manz,
Hawkins, Parsons, & Hercinger, 2008).

Prior to the recruitment of participants, the investigators

received from the institutional review board a certificate of
exemption for the involvement of human subjects. A total of
38 nurse educators from across the United States participated in
the study. These participants were recruited through professional and simulation-related listserv connections. By self-report,
participants met the following inclusion criteria: they currently
taught in an accredited, prelicensure, baccalaureate nursing program in the United States; they had at least 1 year of experience
using HPS in prelicensure, baccalaureate nursing education;
and they had clinical teaching or practice experience in an acute
care setting as an RN over the past 10 years.
Study Design

This study employed a novel method, including leveled,

video-archived simulation scenarios, a virtual classroom, and
webinar and e-mail communication. As part of the study, the investigators assessed the reliability (interrater and intrarater) and
internal consistency of data produced using the C-SEI (Todd et
al., 2008).
Leveled, Video-Archived Simulations

Video-recordings are frequently used in clinical and educational research to provide standardized subjects for observationbased evaluations (Baer, Smith, Rowe, & Masterton, 2003;
Cusick, Vasquez, Knowles, & Wallen, 2005; McConvey &
Bennett, 2005; Portney & Watkins, 2008). Indeed, one of the
primary barriers to analyzing the reliability of data from simulation evaluation instruments (or any instrument designed for
making observation-based evaluations) is the void of standardized subjects to evaluate. To overcome this barrier, the investigators produced and video-archived three simulated patient care
scenarios depicting nursing students performing patient care
below, at, and above the level of expectation for senior baccalaureate nursing students. These scenarios were uploaded onto

the universitys server and made accessible through linked Web

addresses. (A description of the development of these scenarios
will be available in an upcoming publication, and access to the
scenarios may be obtained by e-mailing the first author at katie.
Virtual Classroom and Webinar Communication

Another barrier to rigorous psychometric analyses of simulation evaluation instruments is difficulty accessing an adequate
number of qualified participants. The investigators established
a virtual classroom and used webinar communication to facilitate the participation of nurse educators from around the United
States. The virtual classroom served as a portal for information
related to the study, including reference and training materials.
Prior to participating in the study, individuals attended webinar
orientation meetings where the investigators familiarized them
with the C-SEI. Participants then accessed a training video embedded in the online classroom where they were taught how to
score a sample scenario using the C-SEI. One of the unique features of the C-SEI is the standardized training that is required
prior to using the instrument. This training was adapted for the
scenario that was used in the study and included expected minimal behaviors for the simulation.
Data Collection Procedures

After completing the orientation and guided scoring of the

sample scenarios, participants began the 6-week data collection procedures. During the first 3 weeks, participants received
one e-mail per week instructing them to access and score one
of the three leveled scenarios (depicting students performing
below, at, or above the level of expectation). To assess testretest reliability, during the second 3 weeks of the study, participants received e-mails instructing them to access and score
the same scenarios they scored during the first 3 weeks of the
study. The sequences of the scenarios were randomized for
each participant and participants were blinded to the intended
levels of the scenarios they were viewing and scoring each

Twenty-nine (76.31%) of the original 38 participants completed the entire 6-week data collection procedures. Using
Walter, Eliasziw, and Donners (1998) equation for determining sample size for reliability studies, this sample size provided
adequate power (80% and a = 0.05) when the minimal acceptable levels of reliability were 0.40. However, the width of the
confidence interval for the intrarater reliability result demonstrated that additional scenarios or a larger number of participants would have been better suited for determining test-retest
reliability. Data from the evaluation instruments were entered
into SPSS version 17.0 software for analysis. In addition to the
raw scores assigned by each of the participants, scores were
converted into percentages to categorize them as passing or
not passing. Scores of 75% or greater were designated as
passing and scores less than 75% were designated as not passing. Analysis of variance (ANOVA) comparisons revealed that
scores assigned by the participants who completed the study
were not significantly different from the scores assigned by parCopyright SLACK Incorporated


ticipants who did not complete the study. The following sections
present the descriptive statistics and the reliability and internal
consistency findings using data from only the participants who
completed the study (n = 29).
Descriptive Statistics

The descriptive statistics, including means and standard deviations of scores, assigned to each of the scenarios during the
first and second viewings (during the first half of the data collection procedures, weeks one through three) and the second
half of the data collection procedures (weeks four through six)
provided validity evidence supporting the intended levels of the
scenarios. The mean scores and standard deviations for the below-the-level-of-expectation scenario were 4.24 (3.00) and 4.62
(3.56) for the first and second viewings, respectively. Likewise,
the mean scores and standard deviations for the at-the-level-ofexpectation scenario were 14.43 (3.79) and 14.39 (4.66) for the
first and second viewings, respectively. The mean scores and
standard deviations for the above-the-level-of-expectation scenario were 19.14 (1.48) and 19.28 (1.51) for the first and second
viewings, respectively. ANOVA comparisons revealed significant differences between each of the levels.
When the scores were categorized as passing or not passing,
all of the raters assigned the students at the below-the-level-ofexpectation scenario a score of not passing and the students at
the above-the-level-of-expectation scenario a score of passing
during both the first and second viewings. However, when the
scores for the at-the-level-of-expectation scenario were categorized in this way, 16 (55%) of the 29 raters assigned a score
of passing during the first viewing, whereas 18 (62%) of the
29 raters assigned a score of passing during the second viewing. ANOVA revealed that the differences between raw scores
assigned during the first and second viewings were not statistically significant.
Interrater Reliability

Reliability is essentially a measure of consistency. To assess

the consistency of scores assigned by different raters, intraclass
correlation (2,1) agreement was used. The selection of this
type of intraclass correlation was based on three specifications:
two-way ANOVA design; participants (raters) were considered
random effects (i.e., they are intended to represent a random
sample from a larger population); and the unit of analysis is the
individual rating (Shrout & Fleiss, 1979). Intraclass correlation
(2,1) (95% CI) was 0.952 (0.697, 0.993).
Intrarater (Test-Retest Reliability)

To assess the consistency of scores assigned to the same

scenario by the same participant (rater) from the first viewing
to the second viewing, the investigators used intraclass correlation (3,1) consistency. The selection of intraclass correlation
(3,1) consistency was appropriate for these analyses based on
three specifications: a two-way ANOVA design; participants
(raters) were considered mixed effects because, as a measure
of intrarater reliability, the investigators were interested only in
the reliability of specific raters; and the unit of analysis was a
single rating (Portney & Watkins, 2008; Shrout & Fleiss, 1979).
The intrarater reliability between the first and second viewing,
Journal of Nursing Education Vol. 50, No. 10, 2011

across all three scenarios with 95% confidence interval, was

0.883 (0.001, 0.992).
Internal Consistency

Internal consistency may be considered a measure of both

reliability and validity. It assesses whether the items in an instrument are all measuring the same construct. Cronbachs alpha was used to measure the internal consistency of the items
on the C-SEI (a = 0.979).

Discussion and Limitations

Evaluation methods used in nursing education vary considerably across programs (Oermann, Yarbough, Saewert, Ard, &
Charasika, 2009), and many quantitative and qualitative evaluation strategies do not have established validity and reliability.
Nursing programs often choose to use letter grades for didactic
courses and pass-or-fail for clinical courses, indicating a reluctance to explicitly delineate between varying levels of performance. Although many factors influence how nurse educators
evaluate students and what they choose to base those evaluations on, failing to provide meaningful, valid, and reliable evaluations of students practice performance in a performance-based
discipline is not acceptable. It is essential that nurse educators
consider the validity and reliability of the instruments they use
to evaluate students.
Although sources vary on what is considered an acceptable
level of reliability, in general, reliabilities equal to or greater
than 0.75 are considered good (Portney & Watkins, 2008). The
results reported in the previous sections demonstrate the reliabilities (interrater and intrarater) and internal consistency findings from the C-SEI as better than good when the instrument
was used by a nation-wide sample of baccalaureate nurse educators to evaluate video-archived simulated clinical experiences. These findings support the expanded use of the instrument
in additional populations and circumstancesincluding pilot
use of the C-SEI for clinical evaluation. This work is currently
being initiated in National Council of State Boards of Nursing
schools with a modified version of the C-SEI.
The findings from this study also add to the evidence about
something that most nurse educators know intuitively: it is relatively easy to identify very poor and very good performances,
but it is more difficult to delineate between mid-range performances. These findings are similar to previously reported findings (Yule et al., 2009) and necessitate further investigation
about how to better evaluate mid-level students.
This study had several limitations related to the methods,
sample, and analytic strategies. In a true evaluation situation,
the instructors would meet together to establish and agree on
the expected minimum behaviors for the simulation scenario,
rather than have the behaviors predetermined by the investigators. They would likely have a teacherstudent relationship with
the students they were evaluating, and there would be identified
significance to the results of the evaluation (i.e., grades, passing
a course, need for remediation). Although some of these factors,
if maintained within the study methods, may have contributed
to rater bias, they also would have created a more realistic environment for evaluation.
The sample size and the nonrandom selection of participants


present further limitations of the study and analyses. The intraclass correlation (2,1) has, as one of its assumptions, a random
sample of raters. However, the sample used for this study was a
convenience sample of nursing faculty who met specific inclusion criteria. Finally, the width of 95% confidence intervals for
the reliability result were influenced by the number of raters and
the number of scenarios that were used. Had the investigators
used more scenarios or a larger number of participants, the sizes
of these confidence intervals could have been reduced.

Ongoing, rigorous psychometric assessment of the C-SEI in
different settings with larger and more diverse samples will allow for further refinement of the instrument. The development
and use of additional scenarios would also improve the quality
of the intrarater reliability data and further delineation of midrange performance. Ultimately, comparison of student performance and learning between groups who are exposed to various
teaching strategies, including HPS, needs ongoing examination.
The testing of these types of instruments allows for refinement
and improved accuracy in evaluating student performance and
more effective feedback to students.

American Association of Colleges of Nursing. (1998). The essentials of
baccalaureate education for professional nursing practice. Washington,
DC: Author.
Baer, G.D., Smith, M.T, Rowe, P.J., & Masterton, L. (2003). Establishing the reliability of mobility milestones as an outcome measure for
stroke. Archives of Physical Medicine and Rehabilitation, 84, 977-981.
Cusick, A., Vasquez, M., Knowles, L., & Wallen, M. (2005). Effect of rater
training on reliability of Melbourne Assessment of Unilateral Upper
Limb Function scores. Developmental Medicine and Child Neurology,
47, 39-45.


Diekelmann, N.L., & Ironside, P.M. (2002). Developing a science of nursing education: Innovation with research. Journal of Nursing Education,
41, 379-380.
Herm, S., Scott, K., & Copley, D. (2007). Simsational revelations.
Clinical Simulation in Nursing Education, 3, e25-e30. doi:10.1016/
Kardong-Edgren, S., Adamson, K., & Fitzgerald, C. (2010). A review of
currently published evaluation instruments for human patient simulation. Clinical Simulation in Nursing, 6(1), e25-e35. doi:10.1016/
McConvey, J., & Bennett, S.E. (2005). Reliability of the Dynamic Gait Index in individuals with multiple sclerosis. Archives of Physical Medicine and Rehabilitation, 86, 130-133. doi:10.1016/j.apmr.2003.11.033
National League for Nursing. (2005). Nursing education research. Retrieved from
Oermann, M.H. (2009). Evidence-based programs and teaching/evaluation
methods: Needed to achieve excellence in nursing education. In M. Adams & T. Valiga (Eds.), Achieving excellence in nursing education (pp.
63-76). New York, NY: National League for Nursing.
Oermann, M.H., Yarbough, S.S., Saewert, K.J., Ard, N., & Charasika, M.E.
(2009). Clinical evaluation and grading practices in schools of nursing:
National survey findings, part II. Nursing Education Perspectives, 30,
Portney, L.G., & Watkins, M.P. (2008). Foundations of clinical research:
Applications to practice (3rd ed.). Upper Saddle River, NJ: PrenticeHall.
Shrout, P.E., & Fleiss, J. (1979). Intraclass correlations: Uses in assessing
rater reliability. Psychological Bulletin, 86, 420-428.
Todd, M., Manz, J.A., Hawkins, K.S., Parsons, M.E., & Hercinger, M.
(2008). The development of a quantitative evaluation tool for simulations in nursing education. International Journal of Nursing Education
Scholarship, 5, Article 41.
Walter, S.D., Eliasziw, M., & Donner, A. (1998). Sample size and optimal designs for reliability studies. Statistics in Medicine, 17, 101-110.
Yule, S., Rowley, D., Flin, R., Maran, N., Youngson, G., Duncan, J., et al.
(2009). Experience matters: Comparing novice and expert ratings of
non-technical skills using the NOTSS system. Surgical Education, 79,

Copyright SLACK Incorporated

Reproduced with permission of the copyright owner. Further reproduction prohibited without