Professional Documents
Culture Documents
A R T I C LE I N FO A B S T R A C T
Keywords: Background: Ergonomics researchers and practitioners use many techniques to assess risk. The Rapid Entire Body
Intra-rater reliability Assessment (REBA) is a common tool used to facilitate the measurement and evaluation of the risks associated
Inter-rater reliability with working postures as a part of ergonomic workload. However, little work has been reported regarding the
REBA reliability of REBA reporting.
Assessment
Objective: This study assesses the reliability of this commonly used tool for research and practice.
Methods: The study was conducted as part of the larger Safe Workload Ergonomic Exposure Project (SWEEP),
which is a University of Minnesota research initiative for custodians. For this effort, a secondary data analysis
was conducted on data collected during a study of custodians’ exposures to risks of musculoskeletal disorders.
Eight observers used the REBA tool to sequentially evaluate tasks performed two times in succession by the same
individual.
Results: This study reports high intra-rater reliability (ICC = 0.925) for REBA raw scores and moderate inter-
rater reliability (IRR) (Fleiss kappa = 0.54) for a categorical scoring of REBA.
Conclusion: A moderate amount of IRR was found, and a standardized training and calibration protocol is
proposed as a potential means to improve intra- and inter-rater reliability.
1. Introduction financial and temporal resources and as such need to do triage for in-
tervention.
Ergonomists use many tools to measure risk of injury. The Rapid However, little work has been reported regarding the validity or
Entire Body Assessment (REBA) is a practitioner's field tool reliability of REBA. This is a concern due to lack of knowledge of the
(McAtamney and Hignett, 2000; Coyle, 2005; Motamedzade et al., consistency of this tool. The Safe Workload Ergonomic Exposure Project
2011) that is designed to facilitate the measurement and evaluation of (described below) used REBA as an assessment tool during a study of
risks associated with working postures as a part of ergonomic workload. custodians’ exposure to risk factors associated with musculoskeletal
It is “sensitive to musculoskeletal risks in a variety of tasks, divides the disorders. There were multiple instances of two successive observations
body into segments to be coded individually, regarding movement of a task by the same observer; in addition, multiple observers eval-
planes, and provides a scoring system for muscle activity caused by uated multiple tasks performed by a single individual. These data af-
static, dynamic, rapidly changing or unstable postures” (McAtamney forded an assessment of the intra-rater and inter-rater reliability of
and Hignett, 2000). The REBA tool is also commonly utilized in re- REBA.
search settings (Janowitz et al., 2006; Jones and Kumar, 2010; Kee and
Karwowski, 2007; Pascual and Naqvi, 2008; Nawi et al., 2013; Lee 1.1. Safe Workload Ergonomic Exposure Project (SWEEP) study
et al., 2008). It is a popular assessment tool, well represented in the
technical literature (Dempsey et al., 2005); for example, a general The SWEEP study is a mixed prospective and retrospective cohort
Google Scholar search for “REBA Ergonomics” found 2700 results. study of unionized janitors in the Twin Cities, Minnesota. It was in-
David (2005) concluded that assessments that use observations itiated by the University of Minnesota, Division of Environmental
(such as the REBA) “provide the levels of costs, capacity, versatility, Health Sciences and examined mental workload, physical workload,
generality and exactness best matched to the needs of occupational sleep, stress, job satisfaction, fitness, and occupational injury burden on
safety and health practitioners.” These practitioners have limited janitors. Reflecting the population, the study was conducted to
∗
Corresponding author.
E-mail address: schw1562@umn.edu (A.H. Schwartz).
https://doi.org/10.1016/j.ergon.2019.02.010
Received 27 March 2018; Received in revised form 4 January 2019; Accepted 19 February 2019
Available online 18 March 2019
0169-8141/ © 2019 Elsevier B.V. All rights reserved.
A.H. Schwartz, et al. International Journal of Industrial Ergonomics 71 (2019) 111–116
accommodate English, Spanish, and Somali languages. use of percent agreement on the basis that it does not account for
chance agreement between raters.
1.2. Current data regarding reliability and validity of REBA Janowitz et al. (2006), reported a similar level of inter-rater relia-
bility of a modified version of REBA when classifying risks to the back:
1.2.1. Studies of inter-method reliability 0.54 for the upper back and 0.66 for the lower back. However, the
Several studies (Jones and Kumar, 2010; Kee and Karwowski, 2007; modification of the REBA tool reported by Janowitz et al. (2006), and
Ansari and Sheikh, 2014; Manavakun, 2017; Gentzler and Stader, 2010) the restricted scope limiting scoring to the back, limits the ability to
have compared the results of REBA analyses with results from appli- generalize these results to the standard form of the REBA tool. Simi-
cation of other risk assessment tools. These studies might be described larly, although Lamarão et al. (2014) also report modest agreement for
as reports of inter-method reliability, where inter-method reliability can intra- and inter-rater reliability of the REBA tool, the version they
be thought of as an instance of parallel forms reliability; that is, two studied was a Portuguese-language version, somewhat limiting the
different assessment tools intended to measure the same concept, such ability to generalize to an English-language version of REBA.
as the risk of musculoskeletal injury, should give consistent results. This
could also be considered convergent validity, which is a form of con- 1.2.3. Raw scoring versus categorical scoring of REBA
struct validity. REBA assigns raw scores to risk analyses of tasks; those scores are
These inter-method reliability studies of REBA have compared the continuous and range from a minimum of 0–11 and upward. In turn,
risk, or action categories, to other tools such as the Rapid Upper Limb those raw scores are interpreted by converting them to one of five ac-
Assessment (RULA) and the Ovako Working Posture Analysis System tion, or risk, categories. Those categories are Negligible Risk, Low Risk,
(OWAS). Medium Risk, High Risk, and Very High Risk. Thus, the action expected
Ansari and Sheikh (2014) assessed risk levels in the same sample of of the evaluator depends on the action category, rather than on a spe-
workers in India with REBA and RULA. RULA classified 40% workers as cific raw score. As an example, raw scores of 8, 9 and 10 fell into the
at a high-risk level, compared to REBA, which placed 53% of the High-Risk action category, and the recommended action is “Investigate
workers evaluated at the same level, as high risk. and Implement Change”.
Manavakun (2017) observed tree cutting in Thailand, using both the
REBA and OWAS, and found that “postural load [as measured] by REBA
1.2.4. Predictive validity of REBA
was generally higher than by OWAS … 22.6% of 248 postures were
The purpose of the REBA and other similar tools is to enable prac-
classified at the action category 3 or 4 by OWAS, about 72.6% of the
titioners to identify jobs that are at risk for development of muscu-
postures were classified into action level 3 or 4 by REBA … OWAS
loskeletal disorders, that is, they must first effectively and efficiently
underestimated posture-related risk compared to REBA.”
discriminate at-risk jobs from jobs not at-risk, before attempting to
Kee and Karwowski (2007) used OWAS, REBA and RULA to eval-
discriminate levels of risk. An ideal assessment tool would have high
uate tasks in various industries (iron, steel, electronics, chemical,
sensitivity and high specificity – that is, minimizing false positives and
medical, and automotive) and found that the “inter-method reliability
false negatives.
for postural load category between OWAS and RULA was … 29.2%, and
However, before the predictive validity of REBA risk scores can be
the reliability between RULA and REBA was 48.2%.” They inferred that,
examined, it is necessary to establish that the measurements made by
“compared to RULA, OWAS and REBA generally underestimated pos-
different observers are consistent and reliable. Reliability of a measure
tural loads for the analyzed postures (Kee and Karwowski, 2007).
is a prerequisite to establishing validity (Moskal and Leydens, 2000;
Jones and Kumar (2010) evaluated sawmill workers using five dif-
Cook and Beckman, 2006).
ferent assessment tools, RULA, REBA, the American Conference of
Two types of validity are of interest; intra-rater and inter-rater re-
Government Industrial Hygienists Threshold Limit Value for Mono-Task
liability. Without some knowledge of the consistency of ratings within
Hand Work (ACGIH-TLV), the Strain Index (SI) and the Concise Ex-
individual raters and between different raters, it is not possible to assess
posure Index (OCRA). They scored the different tools in two ways: first
the predictive validity of REBA.
for agreement on risk category (at risk vs. not at risk) and for perfect
agreement between risk levels (e.g., both tool A and tool B rate the job
as low, high or very high risk). Excepting the ACGIH-TLV, there was a 2. Methods
high level of agreement between the evaluation tools for at risk versus
not at-risk classifications and a more modest level of perfect agreement 2.1. Intra-rater reliability
of classifications.
Gentzler and Stader (2010) found that for firefighters lifting hoses In collecting the data for a field study of the musculoskeletal risks
“above the shoulder for drainage, the NIOSH lifting equation showed a associated with custodial tasks (described below), several observers
danger of lifting the hose from the ground to chest height and especially used the REBA tool to sequentially evaluate a task performed two times
from chest height to above the shoulders, and the REBA determined in succession by the same individual. Comparing the successive ratings
that there was a very high level of risk for injury” for the same action. provided an opportunity to assess the intra-rater reliability of REBA risk
While these studies suggest that REBA is reliable in the sense that it assessments (Table 1). The participants in this project were all experi-
measures similar things, as do other tools intended to measure mus- enced custodians. There were 30 people, 15 men and 15 women.
culoskeletal risk, they do not describe inter- or intra-rater reliability of
Table 1
REBA. Knowledge of intra- and inter-rater reliability is necessary as a
Ergonomic impact (REBA score) by task.
prerequisite for establishing the validity of the tool.
Task Mean Standard Deviation Median Minimum Maximum
1.2.2. Intra-and inter-rater reliability of REBA
Toilet cleaning 10.40 2.11 10 7 13
At present, there are limited data available regarding the intra- and Dusting 8.92 2.41 9 4 13
inter-rater reliability of the REBA tool. Hignett and McAtamney Large trash 10.49 1.09 11 4 13
(McAtamney and Hignett, 2000) reported that inter-observer reliability Small trash 10.68 1.44 11 4 13
of REBA scoring ranged between 62 and 85 percent, but it is not clear if Mopping 9.13 1.86 10 2 12
Mirror cleaning 9.27 1.68 9 5 13
the agreement referred to raw or categorical scores. Lamarão et al.
Sink cleaning 8.96 2.36 9 4 13
(2014) also reported modest percent agreement for two observers in a Vacuuming 9.65 2.08 10 1 13
Portuguese-language version of REBA. Cohen (1960) has criticized the
112
A.H. Schwartz, et al. International Journal of Industrial Ergonomics 71 (2019) 111–116
Balanced (10 in each age group) tertile cut points of the sample were (20%) is attributable to measurement error …”. Onis gives the formula
made with the groups being ages 21–39, 40–56, and 60–71. The for calculating R as
average height for women 1.59 m and their average weight was 73 kg.
(TEMinter )2
On average, men were 1.68 m tall and weighed 80.5 kg. R=1−
SD 2
The consistency of the repeated measurements (intra rater relia-
bility) was assessed using an Intra Class Correlation (ICC) as suggested where TEMinter is the inter-rater TEM and SD2 is an estimate of the
in previous work on intra-rater reliability (Bennell et al., 1998; Berg variance of all measurements.
et al., 1995). ICCs are “measures of the relative similarity of quantities
which share the same observational units …”, and in areas of applica- 2.4. Intra-rater reliability of REBA in the SWEEP study
tion such as “reliability studies (e.g., products from the same machine,
measurements of characteristics for the same person), and “… persons During the SWEEP study, eight observers completed 189 repeated
contacted by the same interviewer (Koch, 2004).” observations (first and second observations) of nine tasks. These paired
Lamarão et al. (2014) note that there are two approaches to using data were used to compute an Intra-Class Correlation (ICC).
tools such as REBA to evaluate injury risk, observation in situ and ob-
servation of video. In this instance, we used live, in situ evaluations, as 2.5. Inter-rater reliability of REBA in the SWEEP study
this is a common practice amongst practitioners. While there is some
possibility of variation in the manner in which a task is performed The Technical Error Measurement and reliability coefficients were
between successive trials, each instance was performed by the same, calculated for a total of eight observers who evaluated eight simulated
experienced individual in exactly the same environment. custodial tasks (emptying large trash cans, emptying small trash cans,
mopping, vacuuming, dusting, cleaning toilets, mirrors, and sinks) as
2.2. Inter-rater reliability they were performed by an experienced custodian. The simulated tasks
were created from information gathered from focus groups and inter-
During the study of the potential musculoskeletal risks associated views with custodians; experienced custodians helped design the tasks
with custodial tasks, several observers concurrently evaluated tasks to be realistic and representative. Seven of the observers were novices,
performed by the same individual. These data provided an opportunity one was a professional ergonomist experienced in risk assessments. Not
to assess the inter-rater reliability of REBA risk assessments. all observers were able to evaluate all tasks.
Gwet (2014) describes inter-rater reliability as two or more raters
“classifying subjects or objects into predefined classes or categories” 2.6. Simulated custodial tasks
and that “the extent to which these two categorizations coincide re-
presents what is often referred to as inter-rater reliability.” As defined For all of the simulated tasks, janitors were asked to complete the
in this paper, inter-rater reliability refers to the consistency of mea- tasks as they would normally.
surements of postures made by multiple observers of the same work Emptying large trash cans: Janitors were asked to move two pre-
tasks performed by the same individual. The consistency of these loaded 18.2 kg trash cans on rolling platforms 6.1 m and then empty the
measurements is described using the Technical Error of Measurement cans, after which they were to place new bags in the cans.
(TEM). The TEM is used to calculate a reliability coefficient (R). Emptying small trash cans: There were four 6.8 kg trash cans; ja-
nitors were asked to grab them from under desks, empty, tie off the
bags, put new liners, put the small bags into a large bin, and then re-
2.3. Technical Error of Measurement place the small bins under the desks.
Mopping: A 3 m by 3 m polished concrete floor was marked off with
Lewis (1999) noted that when two individuals measure the same yellow tape. Inside the area was a table and three folding chairs.
thing the value obtained will not always be the same, thus, producing Janitors were given a Kentucky string mop and mop bucket with
the Technical Error of Measurement (TEM). The TEM is commonly used wringer and asked to clean the floor. Janitors either moved the chairs
to assess intra-rater and inter-rater reliability of measurements made by and the table out of the 3 m by 3 m area, or simply mopped around the
anthropometrists (Lewis, 1999; Ulijaszek and Kerr, 1999; WHO furniture.
Multicentre Growth Reference Study Groupde Onis, 2006; Geeta et al., Vacuuming: Two 0.91 by 2.44 m rugs were placed side by side,
2009). The TEM may be thought of as the standard deviation of re- creating a 1.83 × 4.9 m area. Janitors were asked to plug a standing
plicate measures (Marks et al., 1985). style vacuum into an extension cord, vacuum the rugs, and then unplug
This paper utilizes TEM to assess the inter-rater reliability of mea- the cord.
surements of postures made concurrently by multiple individuals while Dusting: In an office cubical, purple duct tape was placed on several
observing the same individual performing a custodial task. surfaces to provide a standardized cleaning routine. The horizontal
A formula to calculate inter-rater reliability for multiple observers is surfaces were the: top of the cubical walls, top of the monitor and
given by Onis as computer tower and a filing cabinet. Vertical surfaces included a tele-
N K K 2 1/2 phone, part of a desk lamp, and several pictures on the wall. Janitors
⎛1 1 ⎡ i 2 ( ∑J =i 1 Yij ) ⎤⎞ were provided a feather duster.
TEMinter = ⎜
N
∑ (K ⎢ ∑ Yij − ⎥⎟
i=1 i − 1) ⎢ j = 1 Ki ⎥ Cleaning toilets: Janitors were asked to wipe down the toilet outer
⎝ ⎣ ⎦⎠
surface with a cloth, scrub the bowl with a long-handled brush, and
where Yij is one of the measures made by observer j for task i, Ki is the refill the toilet paper dispenser.
number of observers that measured task i, and N is the number of tasks Mirrors: Janitors were asked to clean a mirror that was approxi-
observed (WHO Multicentre Growth Reference Study Groupde Onis, mately 0.61 m × 0.91 m. They were given a cloth and a spray bottle
2006). It is important to note that the formula can accommodate dif- that weighed approximately 2.3 kg.
fering numbers of observers for each of the tasks observed. Sinks: Janitors were asked to clean a white porcelain sink that was
Onis (WHO Multicentre Growth Reference Study Groupde Onis, 0.76 m wide, 0.61 m deep, and 0.35 m tall. They were given a cloth and
2006) describes the Coefficient of Reliability, R, as an estimate of “the a spray bottle that weighed 2.2 kg.
proportion … of the total measurement variance that is not due to Each observer recorded his or her postural assessments on a mod-
measurement error. A reliability coefficient of 0.8 means that 80% of ified REBA scoresheet. The scoresheets were modified to avoid details
the total variability is true variation, while the remaining proportion unnecessary to the data gathering stage, namely the table calculations
113
A.H. Schwartz, et al. International Journal of Industrial Ergonomics 71 (2019) 111–116
Fig. 1. Modified version of REBA scoresheet. With permission from Hignett, S., McAtamney, L. (2000) Rapid Entire Body Assessment (REBA). Applied Ergonomics 31,
201-205.
and score rankings. The scoresheets were collected from all observers; consistency in both the manner in which the tasks were performed by
final REBA scores (Raw Scores) were then determined based on the the custodians on each of the two successive trials and in the REBA
postural assessments recorded on the modified scoresheets. The mod- rating assigned to each trial by the observers.
ified scoresheet is shown in Fig. 1.
3.2. Inter rater reliability
2.7. Inter-rater reliability for raw scores
The Reliability Coefficient, R, of the continuous, raw data was cal-
The Technical Error of Measurement and Reliability coefficients culated to be 0.41. That is, about 59 percent of the total variation in the
were calculated for the REBA raw scores for each of the eight jobs using raw scores was due to inter-rater variation.
the formulas suggested by Onis et al. (WHO Multicentre Growth However, REBA evaluates risk categorically. Consequently, the raw
Reference Study Groupde Onis, 2006). scores were converted to categorical ratings based on the guidelines
described on the REBA scoresheet and the inter-rater reliability of the
2.8. Categorical scoring measurements was assessed.
The Fleiss kappa for the categorical scoring was 0.54. According to
Each raw score was converted to a risk assessment category, based Landis and Koch (Gore et al., 1996), this is considered moderate
on the raw score ranges described for REBA: Negligible Risk, Low Risk, agreement.
Medium Risk, High Risk, and Very High Risk. The risk categories were
then assigned a number from 1 to 5, with 1 corresponding to Negligible 4. Discussion
Risk and 5 to Very High Risk. A Fleiss kappa was calculated for six
observers who concurrently assessed four tasks using categorical scores. This study suggests that there is a strong intra-rater reliability
among individual observers when they immediately observe the same
3. Results task twice. However, the results for inter-rater reliability are more
complex. The moderate agreement among multiple raters regarding
3.1. Intra rater reliability classification of risk categories suggests that categorical classification,
rather than raw scores, should be used to classify risk. A similar
The ICC value for 189 pairs of observations made by nine observers methodology has been used by Motamedzade et al. (2011), and Jones
of eight tasks was determined to be 0.925. These were raw scores, not et al., (Jones and Kumar, 2010).
categorical. The high level of intra-rater reliability suggests that observations
The high ICC value indicates that there was a high degree of made by a single observer are generally comparable. That is, the
114
A.H. Schwartz, et al. International Journal of Industrial Ergonomics 71 (2019) 111–116
115
A.H. Schwartz, et al. International Journal of Industrial Ergonomics 71 (2019) 111–116
Lewis, S.J., 1999. Quantifying Measurement Error. Oxbow Books (for the Sci. JRHS [Internet]. Vol. 11, Journal of Research in Health Sciences. Univ. of
Osteoarchaeological Research Group). Medical Sciences.
Manavakun N. A comparison of OWAS and REBA observational techniques for assessing Nawi, N.S.M., Deros, B.M., Nordin, N., 2013, December. Assessment of oil palm fresh fruit
postural loads in tree felling and processing. [cited 2017 Nov 28]; Available from: bunches harvesters working postures using reba. In: Advanced Engineering Forum,
https://www.formec.org/images/proceedings/2014/a115.pdf. vol. 10. Trans Tech Publications Ltd, pp. 122–127.
Marks, G.C., Habicht, J.P., Mueller, W.H., 1985. Reliability, dependability and precision Pascual, S.A., Naqvi, S., 2008. An investigation of ergonomics analysis tools used in in-
of anthropometric measurements: the second national health and nutrition execution dustry in the identification of work-related musculoskeletal disorders. Int. J. Occup.
survey, 1975-1980. Am. J. Epidemiol. 30, 57–87. Saf. Ergon. 14 (2), 237–245.
McAtamney, L.Y.N.N., Hignett, S., 2000. REBA: Rapid Entire body assessment. Appl. Ulijaszek, S.J., Kerr, D.A., 1999. Anthropometric measurement error and the assessment
Ergon. 31, 201–205. of nutritional status. Br. J. Nutr. 82 (3), 165–177.
Moskal, B.M., Leydens, J.A., 2000. Scoring rubric development: validity and reliability. WHO Multicentre Growth Reference Study Group, de Onis, M., 2006. Reliability of an-
Practical Assess. Res. Eval. 7 (10), 71–81. thropometric measurements in the WHO Multicentre Growth reference study. Acta
Motamedzade, M., Ashuri, M.R., Golmohammadi, R., Mahjub, H., 2011. J. Res. Health Paediatr. 95, 38–46.
116