Pzy 159

Review
M. Jakobsson, RPT, PhD, Back in Mo-

Level of Evidence for Reliability, tion Research Group, Department of
Orthopaedics, Institute of Clinical Sci-
Validity, and Responsiveness of ences, University of Gothenburg, Möl-

ndal Hospital, Göteborgsvägen 31, 431
80 Mölndal, Gothenburg, 41326 Swe-
Physical Capacity Tasks Designed to den. Address all correspondence to Dr
Jakobsson at: max.jakobsson@gu.se.
Assess Functioning in Patients With A. Gutke, RPT, PhD, Division of Physio-

therapy, Department of Health and Re-
Low Back Pain: A Systematic Review habilitation, Institute of Neuroscience

and Physiology, University of Gothen-
burg.
Using the COSMIN Standards
Downloaded from https://academic.oup.com/ptj/article/99/4/457/5252375 by guest on 19 January 2021

L.B. Mokkink, PhD, Department of Epi-
demiology and Biostatistics, Amster-
Max Jakobsson, Annelie Gutke, Lidwine B. Mokkink, Rob Smeets, Mari Lundberg dam Public Health Research Institute,
VU University Medical Center, Amster-
dam, the Netherlands.
Background. Physical capacity tasks (ie, observer-administered outcome measures that R. Smeets, MD, PhD, Department
comprise a standardized activity) are useful for assessing functioning in patients with low of Rehabilitation Medicine, Research
back pain. School of CAPHRI, Maastricht Univer-
sity, Maastricht, the Netherlands; and
Purpose. The purpose of this study was to systematically review the level of evidence CIR Revalidatie, Eindhoven, the Nether-
for the reliability, validity, and responsiveness of physical capacity tasks. lands.
M. Lundberg, RPT, PhD, Division
Data Sources. MEDLINE, CINAHL, PsycINFO, Scopus, the Cochrane Library, and rele- of Physiotherapy, Department of
vant reference lists were used as data sources. Health and Rehabilitation, Institute of
Neuroscience and Physiology, Univer-
Study Selection. Two authors independently selected articles addressing the reliabil- sity of Gothenburg; Department of
Orthopaedics, Sahlgrenska University
ity, validity, and responsiveness of physical capacity tasks, and a third author resolved
Hospital, Gothenburg, Sweden; and
discrepancies.
Division of Physiotherapy, Department
of Neurobiology, Care Sciences and
Data Extraction and Quality Assessment. One author performed data extraction, Society (NVS), Karolinska Institutet,
and a second author independently checked the data extraction for accuracy. Two authors Stockholm, Sweden.
independently assessed the methodological quality with the Consensus-Based Standards [Jakobsson M, Gutke A, Mokkink LB,
for the Selection of Health Measurement Instruments (COSMIN) 4-point checklist, and a Smeets R, Lundberg M. Level of ev-
third author resolved discrepancies. idence for reliability, validity, and re-
sponsiveness of physical capacity tasks
Data Synthesis and Analysis. Data synthesis was performed by all authors to de- designed to assess functioning in pa-
termine the level of evidence per measurement property per physical capacity task. The tients with low back pain: a systematic
5-repetition sit-to-stand, 5-minute walk, 50-ft (∼15.3-m) walk, Progressive Isoinertial Lifting review using the COSMIN standards.
Phys Ther. 2019;99:457–477.]
Evaluation, and Timed “Up & Go” tasks displayed moderate to strong evidence for positive

C The Author(s) 2018. Published by
ratings of both reliability and construct validity. The 1-minute stair-climbing, 5-repetition
sit-to-stand, shuttle walking, and Timed “Up & Go” tasks showed limited evidence for Oxford University Press on behalf of
the American Physical Therapy Asso-
positive ratings of responsiveness.
ciation. This is an Open Access ar-
ticle distributed under the terms of
Limitations. The COSMIN 4-point checklist was originally developed for patient- the Creative Commons Attribution Li-
reported outcome measures and not physical capacity tasks. cense (http://creativecommons.org/li
censes/by/4.0/), which permits unre-
Conclusions. The 5-repetition sit-to-stand, 50-ft walk, 5-minute walk, Progressive Isoin- stricted reuse, distribution, and repro-
ertial Lifting Evaluation, Timed “Up & Go,” and 1-minute stair-climbing tasks are promising duction in any medium, provided the
tests for the measurement of functioning in patients with chronic low back pain. However, original work is properly cited.
more research on the measurement error and responsiveness of these tasks is needed to Published Ahead of Print:
be able to fully recommend them as outcome measures in research and clinical practice. December 18, 2018
Accepted: October 23, 2018
Submitted: July 15, 2018
2019 Volume 99 Number 4 Physical Therapy 457

Systematic Review of Physical Capacity Tasks
R
esearch suggests that low back pain (LBP) causes The objective of this study was to systematically review
more years lived with disability than any other the level of evidence of the measurement properties of
condition.1 Many patients with acute LBP recover physical capacity tasks that are designed to assess
quickly, but the pain becomes a chronic condition for functioning in patients with LBP.
some, with significant consequences related to restricted
functioning.2 The measurement of functioning is complex Methods
and covers multidimensional constructs.3 Functioning is The study protocol was registered at the International
divided into 3 domains according to the International Prospective Register of Systematic Reviews (PROSPERO;
Classification of Functioning, Disability, and Health (ICF). http://www.crd.york.ac.uk/PROSPERO; registration
The body functions and structures domain refers to number: CRD42016042011). We followed the working
psychological and physiological processes and anatomical procedure developed by the Consensus-Based Standards
structures; the activity domain refers to people’s ability to for the Selection of Health Measurement Instruments

perform activities in their daily life; and the participation (COSMIN) for conducting a systematic review of
domain describes people’s interactions with their measurement properties.23 The systematic review was
sociocultural environment (eg, when buying groceries, reported according to the Preferred Reporting Items for
participating in sports, or working).3 Researchers have Systematic Reviews and Meta-Analyses (PRISMA).26
argued for the use of outcome measures that specifically
measure functioning in the activity domain.4–6 Data Sources and Searches
The electronic data sources were MEDLINE (through the
The ICF activity domain is typically measured with Ovid interface), CINAHL (EBSCOhost), PsycINFO (Ovid),
patient-reported outcome measures (PROMs), with which Scopus (Elsevier), and the Cochrane Library (Wiley). The
patients rate their perceived ability to perform various reference lists of the articles from the electronic data
activities in their usual environment.7–11 PROMs are sources that were identified for full-text reading were used
relatively time saving, easy to administer, and, importantly, as additional data sources.
include the patient’s perspective.12 However, PROMs as
used in LBP research have shown both floor and ceiling A search strategy was developed in collaboration with 2
effects as well as low- to very-low-quality evidence for medical librarians who also performed the search
content validity.13,14 Clinical experience and scientific (eAppendix 1, available at https://academic.oup.com/ptj).
evidence also indicate frequent discrepancies between The search strategy included 4 main filters, specifically
how patients score PROMs and how they actually move tailored for each electronic information source: target
and perform activities when observed in the clinic.15–17 population (ie, LBP), the construct of interest (ie,
Several authors5,6,16–18 have therefore recommended the capacity), type of outcome measures (ie, physical capacity
use of physical capacity tasks that measure the capacity tasks), and article type (ie, articles on the reliability,
qualifier of the ICF activity domain,5,6,16–18 defined as “the validity, and responsiveness of outcome measures). The
ability to execute a task or an action in a standardized search strategy was based on a preliminary search strategy
environment.”3 In a physical capacity task, a patient as described in the PROSPERO protocol. Improvements of
performs a standardized activity that is administered by an the preliminary search strategy included: (1) adaptation of
observer using predefined criteria such as repetition a validated search filter for measurement properties;27 (2)
counts or task timing.18 addition of the specific names of all physical capacity
tasks identified by the preliminary search strategy; (3) use
Physical capacity tasks are designed to assess what of the Ovid interface rather than PubMed for MEDLINE
patients actually can do rather than what they think they due to the benefit of positional operators; and (4) the
can do, and can therefore capture important information addition of supplementary search terms for LBP. The first
about functioning that PROMs often do not.17,19,20 Physical search with the final search strategy was performed on
capacity tasks also appear to be less influenced by May 11, 2017 (no limitations for publication period) and
education level and language skills compared with an updated search was performed on August 29, 2018
PROMs.17,18,21,22 (publication period from January 1, 2016). No restrictions
were applied for language. Two authors (M.J. and either
All outcome measures should have sufficient support for A.G. or M.L.) independently performed a hand search of
their reliability, validity, and responsiveness in order to be the reference lists of all articles that had been identified
used in research or in clinical practice. Otherwise, there is for full-text reading.
a significant risk of imprecise or biased results that might
lead to incorrect conclusions regarding the patient’s level Study Selection
of functioning.23–25 However, to our knowledge, no study Two authors (M.J. and either A.G. or M.L.) independently
has yet summarized the level of evidence of measurement screened the titles and abstracts of articles from the
properties of physical capacity tasks for patients with LBP. electronic and hand searches. A third author (R.S.) was
consulted to resolve the disagreement if consensus could
not be reached on the eligibility of articles for full-text
458 Physical Therapy Volume 99 Number 4 2019

review. Two authors (M.J. and A.G., R.S., or M.L.) then Two authors (M.J. and M.L.) independently assessed all of
independently reviewed the full-text articles for eligibility, the included studies for methodological quality using the
and a third author (A.G., R.S., or M.L.) was consulted if COSMIN 4-point checklist,31,32 and a third author (A.G.,
needed for consensus. R.S., or L.B.M.) was consulted if needed to reach
consensus. The COSMIN 4-point checklist is specifically
Articles that met 4 criteria (target population, construct, designed to determine methodological quality scores of
outcome measure, and article type) were included. studies of measurement properties. The scores (excellent,
good, fair, or poor) are assigned for each measurement
Target population. Patients in the study population property using the “worst score counts method.”31,32 The
were at least 18 years old and had had LBP for 6 weeks or item for sample size in the checklist was excluded from
more.28 Articles that contained pregnant participants or the “worst score counts method” as recommended when
participants with fibromyalgia, confirmed rheumatic performing systematic reviews of measurement properties

diseases, infections, tumors, osteoporosis, fractures, of physical capacity tasks.33 The rating authors (M.J. and
structural deformities (eg, scoliosis), or cauda equina M.L.) pretested and then discussed the ratings of the
syndrome were excluded unless data were presented checklist with a third author (L.B.M.) to achieve
specifically for patients who met the eligibility criteria. consistency in scoring, as recommended to ensure the
validity of the scoring method.34
Construct. The test was a measure of “capacity” under Data Synthesis and Analysis
the ICF activity domain, defined as the ability to execute a A “best-evidence synthesis” approach was performed by
task or an action in a standardized environment.3 consensus of all authors to synthesize the level of
evidence of measurement properties per physical capacity
Outcome measure. The test was a physical capacity task.35 This procedure is similar to the Grading of
task, defined as a standardized test that is used for an Recommendations Assessment, Development, and
evaluative purpose and that is administered by an Evaluation (GRADE),36 but different in that the
observer, includes an activity (as classified by the ICF) that best-evidence synthesis determines the level of evidence
is performed in a standardized setting, and requires for the results of measurement properties, rather than the
readily available, low-cost, and portable equipment. results of clinical trials.23 First, the results of each
Articles that exclusively investigated test batteries and did measurement property for each physical capacity task
not present results for individual physical capacity tasks were rated as “positive,” “negative,” or “indeterminate”
were excluded. If an article cited an original test manual according to result rating criteria accepted with consensus
that could then not be obtained, the test was excluded. in an international Delphi study (Tab. 1).30 Then, the level
of evidence for the ratings of measurement properties was
determined on the basis of the consistency of the result
Article type. The article presented original data
ratings of the measurement properties, the sample size of
reporting the reliability (including reliability, measurement
the combined eligible articles, and the methodological
error, and internal consistency), validity (including content
quality of the articles. Multiple articles were combined if
validity, construct validity, and criterion validity), or
they concerned the same physical capacity task and
responsiveness of physical capacity tasks.29
included samples with comparable characteristics. The
possible levels of evidence were assigned according to 5
Data Extraction and Quality Assessment criteria (strong, moderate, limited, unknown, and
A pilot version of a data extraction form was developed by conflicting) used in previous systematic reviews of the
the first author (M.J.) and piloted by 3 authors (M.J., R.S., reliability, validity, and responsiveness of physical capacity
and M.L.) on 5 randomly selected, included articles. The tasks.33,35
data extraction form was modified in response to the pilot
to include the following 6 data items: (1) patient sample Strong. This criterion encompassed consistent result
characteristics; (2) eligibility criteria; (3) setting; (4) ratings in at least 2 good-quality articles or at least 1
procedure and equipment for performing the physical excellent-quality article, with a total sample size of eligible
capacity tasks; (5) results of the measurement properties, articles equal to or greater than 100.
and (6) minimal important change. Minimal important
change is not considered a measurement property in the Moderate. This criterion encompassed consistent result
COSMIN taxonomy29 but was included in the data ratings in at least 2 fair-quality articles or 1 good-quality
extraction form because this measure is necessary to article, with a total sample size of eligible articles equal to
determine the level of evidence for measurement error.30 or greater than 50.
Data from all included articles were subsequently
extracted by 1 author (M.J.) and then independently Limited. This criterion encompassed at least 1 fair-,
checked for accuracy by a second author (A.G., R.S., or good-, or excellent-quality article, with a total sample size
M.L.). of eligible articles of 25 to 49.

Table 1.
Criteria for Result Ratings of Measurement Propertiesa
Measurement Property Ratingb Criterion for Result Rating

Reliability + ICC or weighted κ ≥0.70
? ICC or weighted κ not reported
− Criteria for “+” not met
Measurement error + SDC or LoA < MICc
? MIC not defined

Hypothesis testing for construct validity + 75% of the results in accordance with the hypotheses
? No hypotheses defined
Criterion validity + Correlation with gold standard ≥ 0.70 or AUC ≥ 0.70
? Not all information for “+” reported
Responsiveness + 75% of the results in accordance with the hypotheses or AUC ≥ 0.70
? No hypotheses defined

a
Based on Prinsen et al. AUC = area under the receiver operating characteristic curve; ICC = intraclass correlation coefficient; LoA = limits of
30
agreement; MIC = minimal important change; SDC = smallest detectable change.

b
+ = positive rating; ? = indeterminate rating; − = negative rating.
c
This evidence can come from different studies.
Unknown. This criterion encompassed the following 4 Results

options: indeterminate result ratings, all eligible articles Study Selection and Characteristics
were of poor methodological quality, the total sample size The search of electronic data sources and the hand search
of eligible articles was less than 25, or conflicting result of reference lists resulted in 7900 articles after duplicates
ratings. had been removed (Figure). Of these, 25 articles fulfilled
the eligibility criteria (Tab. 2). The included articles
comprised 18 physical capacity tasks that involved the
Conflicting. This criterion encompassed conflicting following activities: walking (10 tasks), stair-climbing (1
findings. task), lifting (3 tasks), rising from a chair (2 tasks), a
combination of walking/rising from a chair (1 task), and a
If an article lacked a priori hypotheses for validity and combination of walking and carrying (1 task) (Tab. 3).
responsiveness, we extracted what the authors had
expected in relation to those measurement properties One article included patients with a pain duration of at
from the description found in the article. We generated least 6 weeks (“subacute” LBP28 ),37 whereas the rest
our own hypotheses for validity and responsiveness based included patients with a pain duration of at least 12 weeks
on these descriptions, which were then added to the data (“chronic” LBP28 ). Eleven articles included patients with
synthesis (eAppendixes 2 and 3, available at back-related diagnoses known to affect walking capacity
https://academic.oup.com/ptj). severely.15,20,38–46 Of these, 1 article included patients with
lumbar spondylolisthesis and postlaminectomy,38 6 articles
included patients with lumbar spinal stenosis,20,39–43 3
Role of the Funding Source articles included patients with lumbar disk herniation and
This review was supported by AFA Försäkring 120216, the lumbar spinal stenosis,15,44,45 and 1 article included
Eurospine Research Grants 8-2014, the Health and Medical patients with lumbar disk herniation, lumbar spinal
Care Executive Board of the Västra Götaland Region, and stenosis, and spondylolisthesis.46
Vetenskapsrådet 2015-02511. The funders played no role
in the study design, data analysis, interpretation of data, or The included articles investigated 5 measurement
writing of the manuscript. The views expressed here are properties: reliability, measurement error, construct
those of the authors and do not necessarily reflect those of
the funding sources.

12,301 Articles Identified Through 2302 Articles Identified Through 1557 Articles Identified Through
the First Database Search the Second Database Search Hand Search of 362 Reference
[Inception – May 11, 2017] [Jan 1, 2016 – Aug 29, 2018] Lists
7900 Articles After Duplicates Removed

Abstracts of 7900 Articles Screened 7440 Articles Excluded
435 Full-Text Articles Excluded:

460 Full-Text Articles Assessed for Eligibility 1. Original data not included: 201
2. Wrong condition or not fulfilling inclusion criteria for

age: 47
3. Tests not adequately described or original reference

for test procedure not obtained: 14
25 Articles Included in the Data Extraction 4. Wrong construct: 65

(of which, 5 were from hand search)
5. Tests not evaluated by an observer: 15
6. Insufficient standardization/expensive equipment: 18
7. Aim was not to evaluate measurement properties: 69
8. Results only presented for composite score of a test

25 Articles Included in the Methodological battery: 5
Quality Assessment
Full-text articles not found: 1
3 Articles Not Included in the Data Synthesis

22 Articles Included in the Data Synthesis for Any Measurement Property
(poor methodological quality)
Figure.
Flow of information through the different phases of the systematic review.
validity (hypothesis testing), criterion validity, and best-evidence synthesis. The poor scores were due to the
responsiveness. fact that patients received treatment between the first and
second administration of the tasks.
Quality Assessment and Data Synthesis With regard to best-evidence synthesis, a strong level of
evidence for positive ratings for test-retest reliability was
Reliability. The results for reliability are shown in
found for 3 tasks (50-ft [∼15.3-m] walk, 5-minute walk,
Table 4. Regarding methodological quality, 11 articles
and 5-repetition sit-to-stand tasks). A moderate level of
investigated reliability.5,17,37,39,40,42,46–50 Three studies of
evidence for positive ratings for test-retest reliability was
reliability in 1 article48 were rated as poor for
found for 4 tasks (1-minute stair-climbing, Progressive
methodological quality and therefore excluded from the

462

Table 2.
Physical Therapy
Characteristics of Included Studies in Alphabetical Order of First Authors’ Namesa
Patient-
Sample Age, Eligible
Criteria for Setting Back Pain Back Pain Reported
Study Size Mean (SD) Physical
Patient Inclusion (Country) Durationb Intensityb Disability,
(% Women) [Range] Capacity Task
Volume 99
Mean (SD)
Andersson Nonspecific LBP for >3 198 (47) 41.9 (9.1) Tertiary care Median: 24.0 VAS: 48.9 RMDQ: 13.8 (3.6) 1 min of stair climbing,
et al51 (2010) mo (the Netherlands) mo (IQR: (24.7) 5-min walk, 50-ftc walk,
12.0–73.0 PILE, sit-to-stand
mo)
Number 4
Armstrong LBP for ≥6 mo 10 (80) 40.8 (11.3) Tertiary care 77.1 (35.8) VAS: 46 (23) Shuttle walking test
et al52 (2005) (United Kingdom) mo
Campbell LBP for >12 mo, 250 (44) 40 (8.7) Hospitals 7.6 (6.7) y ODI for improved: Shuttle walking test
et al38 (2006) considered for surgical [19–55] (United Kingdom) 42.6 (13.6); ODI
stabilization of the spine for stable: 46.8
(unspecified LBP, (13.7); ODI for
spondylolisthesis, or deteriorated: 51.8
postlaminectomy) (13.5)
Conway Lumbar spinal stenosis 12 (25) 66.3 (9.8) Tertiary care 4.7 (1.5) y VAS: 39.9 ODI: 48.9 (11.6); Self-paced walking test
et al20 (2011) [53–81] (United States) (27.0) QBPDS: 48.2 (16.8)
Deen et al39 Lumbar spinal stenosis 28 (39.3) 73.8 [57–91] Treadmill examination
(2000) (only the test parameter
“total ambulation”)
Gautschi Lumbar disk herniation, 253 (42.3) 58.4 (16.1) Hospitals VAS: 39.1 RMDQ: 11.8 (5.2); Timed “Up & Go” Test
et al15 (2016) lumbar spinal stenosis, (Switzerland) (28.5); VAS ODI: 48.4 (20.8)
and lumbar degenerative for the leg:
disk disease scheduled for 48.8 (29.0)
lumbar spine surgery
(continued)
2019
2019
Table 2.
Continued
Patient-
Mean (SD)
Gautschi Lumbar disk herniation, 136 (44.1) 57.7 (15.8) Hospitals VAS: 45.6 ODI: 45.6 (17.6); Timed “Up & Go” Test
et al45 (2016) lumbar spinal stenosis, (Switzerland) (17.6); VAS RMDQ: 11.3 (5.1)
and lumbar degenerative for the leg:
disk disease scheduled for 5.5 (2.6)
Gautschi Lumbar disk herniation, 100 (43) 56.2 (16.1) Hospitals VAS: 38.0 RMDQ: 12 (5); Timed “Up & Go” Test
et al44 (2017) lumbar spinal stenosis, (Switzerland) (10); VAS for ODI: 46 (18)
and lumbar degenerative the leg: 55.0
disk disease scheduled for (28)
Kahraman Nonspecific LBP for >12 38 (37) 35 (10) 7 (3) mo VAS with ODI: 22.6 (14.2) 30-s chair stand test
et al47 (2016) wk rest: 21.3
(15.1); VAS
with activity:
64.5 (20.1)
Lee et al19 LBP with mechanical, 83 (57.8) 45.64 (10.12) Orthopedic spine 108.2 (127.9) VAS: 36.9 RMDQ: 10.4 (6.1) 5-min walk, 50-ft walk,
(2001) structural, or nonspecific [24–65] clinic (United mo (27.0) 5-repetition sit-to-stand
origins States)
Magnussen LBP for >8 wk 32 (66%) 38.1 (9.2) Tertiary care RMDQ: 10.0 (4.0); Lift test
et al37 (2004) (Norway) FFbH-R: 22.5 (4.9)
Ocarino Nonspecific LBP for ≥3 30 43.16 (11.23) University clinic 42.3 (80.6) RMDQ: 9.9 (3.7) 50-ft walk, sit-to-stand
et al53 (2009) mo (Brazil) mo
Volume 99
Odebiyi LBP for ≥3 mo with 23 [25–65] Two hospitals VAS for men: RMDQ for men: 8.4 50-ft walk, 5-min walk,
et al54 (2007) radiating leg pain (Nigeria) 45.2 (19.4); (5.3); RMDQ for 5-repetition sit-to-stand
VAS for women: 9.6 (4.9)
women: 64.1
Number 4
(22.4)
Pratt et al40 Lumbar spinal stenosis 29 (41) 69 [49–82] Orthopedic hospital ODI: 39.8 (16.5) Shuttle walking test
(2002) (United Kingdom)
(continued)
Physical Therapy
463

464

Table 2.
Continued
Physical Therapy
Patient-
Mean (SD)
Rainville et al41 Lumbar spinal stenosis 50 (42) 68 (7.9) Spine center of a VAS: 61; VAS ODI: 35 Motorized treadmill test,
Volume 99
(2012) [48–86] hospital for the leg: self-paced walking test
(United States) 54
Simmonds Nonspecific, mechanical 44 (63.6) 42.6 Orthopedic clinic 12.4 (20.8) VAS: 27.3 RMDQ: 8.3 (6.0) 1 min of stair climbing,
et al17 (1998) LBP (United States) [1–72] mo (23.9) 5-min walk, 50-ft walk,
Number 4
[0.0–7.2] 50-ft walk (preferred
speed), 5-repetition
sit-to-stand
Smeets Nonspecific LBP for >3 53 (52.8) 43.19 (9.27) Tertiary care 53.4 (67.7) VAS: 44.5 RMDQ: 13.2 (4.2) 1-min of stair climbing,
et al5 (2006) mo (the Netherlands) mo (23.5) 5-min walk, 50-ft walk,
PILE, 5-repetition
sit-to-stand
Soer et al55 LBP for >3 mo 53 (39.6) 38.5 (9.8) 250 (375) wk RMDQ: 9.2 (5.5) PILE
(2006)
Staartjes and Lumbar disk herniation, 157 (49) 49.9 (14.1) Outpatient spine VAS: 5.8 ODI: 43.0 (17.6); 5-repetition sit-to-stand
Schroder46 lumbar spinal stenosis, surgery clinic (the (2.8); VAS for RMDQ: 11.7 (5.3)
(2018) spondylolisthesis, Netherlands) the leg: 7.4
degenerative disk disease (2.0)
Strand LBP for >3 mo 114 (60) 43.9 (10.6) Tertiary care 10 (9.0) y VAS: 52.6 Disability Rating Lift test
et al56 (2002) (Norway) (20.5) Index: 56.2 (13.3)
Strand LBP for >3 mo 98 (49.0) 37.4 (10.4) Tertiary care RMDQ: 11.8 (4.4); 15-m walk test, lift test
et al48 (2011) [18–60] (Norway) FFbH-R: 7.9 (5.3) (modified), PILE
Taylor Mechanical LBP with or 44 (59.2) 48.2 (9.1) Tertiary care (Great ODI: 28.5
et al49 (2001) without sciatica for ≥6 Britain)
mo
(continued)
2019
2019
Table 2.
Continued
Patient-
Mean (SD)
Teixeira da LBP for >3 mo 30 (63.3) 33.0 (10.6) Tertiary care (Brazil) VAS: 44 (26) RMDQ: 9.8 (5.8) 50-ft walk, 5-min walk,
Cunha-Filho 5-repetition sit-to-stand,
et al50 (2010) Timed “Up & Go” Test
Tomkins Lumbar spinal stenosis 45 (57.8) 66.9 (9.6) 10 (11) y; leg Self-paced walking test,
et al42 (2009) pain: 7 (9) y treadmill protocol
Whitehurst Lumbar spinal stenosis 57 (56.1) Men: 69 (5); Treadmill walking test,
et al43 (2001) women: 70 (3) weight-carrying test
a
FFbH-R = Hannover Functional Ability Questionnaire; IQR = interquartile range; LBP = low back pain; ODI = Oswestry Disability Index; PILE = Progressive Isoinertial Lifting Evaluation;
QBPDS = Quebec Back Pain Disability Scale; RMDQ = Roland-Morris Disability Questionnaire; SD = standard deviation; VAS = 100-mm visual analog scale.
b
Reported as mean (SD) [range] unless otherwise indicated.
c
50 ft ≈ 15.3 m.
Volume 99
Number 4
Physical Therapy
465

Table 3.
Brief Descriptions of Included Physical Capacity Tasks
Physical Capacity Task Activity Quantification Measure Equipment Needed

1-min stair climbing Stair climbing No. of stairs climbed in 1 min A flight of stairs with handrails,
stopwatch
30-s chair stand test Rising from a chair No. of repetitions (sitting to standing) performed in Chair, stopwatch
30 s
5-repetition sit-to-stand Rising from a chair Seconds to complete 5 repetitions of sitting to Chair, stopwatch
standing
50-fta walk Walking Seconds to complete a 50-ft course at maximum Measuring tape, stopwatch, markers
speed for indicating track end points

50-ft walk, preferred speed Walking Seconds to complete a 50-ft course at preferred Measuring tape, stopwatch, markers
speed for indicating track endpoints
5-min walk Walking Meters walked in 5 min Measuring tape, stopwatch, markers
for indicating track end points
Lift test Lifting Ordinal scale of no. of repetitions of lifting a box with Table, 1.35-kg box with 5-kg sandbag
a sandbag from floor to table and back
Lift test, modified Lifting No. of repetitions of lifting a box from floor to table Table, 1.35-kg box with 5-kg sandbag
and back in 1 min (4 kg for women)
Motorized treadmill test Walking Total walking time and distance walked at preferred Treadmill, stopwatch
speed (nonmodifiable during the test) at the moment
when walking-related symptoms make the
participant stop
Progressive isoinertial lifting Lifting Weight (in kg) of the box during the last completed Standardized box, an assortment of
evaluation cycle48 /no. of completed lifting cycles5,51 weights
Self-paced walking test Walking Total walking time, speed, and distance walked at the Distance instrument, stopwatch
moment when walking-related symptoms make the
participant stop
Shuttle walking test Walking Meters walked until the participant fails to complete a Standardized audiotape, measuring
predefined “shuttle” in the time allocated tape, markers for indicating track end
points
Timed “Up & Go” Test Rising from a chair Seconds to complete rising from a chair, walking 3 m, Chair, stopwatch
and walking turning around, walking back to the chair, and sitting
down
Treadmill examination, 1.2 Walking Total time walked at 1.2 mph on a treadmill at the Treadmill, stopwatch
mphb moment when walking-related symptoms make the
participant stop (time limit: 15 min)
Treadmill examination, Walking Total time walked at preferred speed on a treadmill at Treadmill, stopwatch
preferred speed the moment when walking-related symptoms make
the participant stop (time limit: 15 min)
Treadmill protocol Walking Total distance, time, and average speed walked on a Treadmill, stopwatch
treadmill at preferred speed (modifiable during the
test) at the moment when symptoms of lumbar
spinal stenosis or other reasons make the participant
stop (time limit: 30 min)
Treadmill walking test Walking Distance walked at 53.6 m/min on a treadmill at the Treadmill, stopwatch, heart rate
moment when pain or fatigue makes the participant monitor
stop or when 70% of the predicted maximum heart
rate (70[220 – age]/100) is reached
Weight-carrying test Walking and Time needed to walk 20 m as quickly as possible Stopwatch, an assortment of
carrying while carrying dumbbells equaling 10% of the dumbbells
person’s weight
a
50 ft ≈ 15.3 m.
b
1.2 miles ≈ 1.9 km.

2019
Table 4.
COSMIN Methodological Quality Ratings, Result Ratings, and Level of Evidence for Reliability and Measurement Error as Well as Minimal Important Change
Per Physical Capacity Taska
Reliability Measurement Error
Physical Study Sample Best-Evidence Best-Evidence

Study Result Result MICc
Capacity Task Designb Size COSMIN Synthesis: COSMIN Synthesis:
Ratings Ratings
Score32 Level of Score32 Level of
(+/?/−)30 (+/?/−)30
Evidence Evidence
1-min stair Andersson et al51 Test-retest 134 Test-retest: Poord SDC: 28.6 Test-retest: 14.5
climbing (2010) moderate (+) stepse (−) moderate (−) steps
Smeets et al5 Test-retest 53 Good ICC(1,1): Good LoA: ±14.7

(2006) 0.96 (+) steps (−)
(95% CI:
0.93–0.98)
30-s chair stand Kahraman et al47 Intrarater 38 Goodf ICC(2,1): Intrarater: limited No information
test (2016) 0.94 (+) (+)
(95% CI:
0.89–0.97)
5-repetition Andersson et al51 Test-retest 134 Test-retest: strong Poord SDC: 11.6 se Test-retest: strong −4.1 s
sit-to-stand (2010) (+); interrater: (−) (−); interrater:
unknown unknown
Simmonds et al17 Test-retest 44 Goodf ICC(1,k): Goodf SDC: 6.38 se

(1998) (participants 0.89 (+) (−)
tested on
different days)
Simmonds et al17 Test-retest 44 Fair ICC(1,1): Fair SDC: 18.46 se

(1998) (participants 0.45 (−) (−)
Volume 99
tested on the
same day)
Simmonds et al17 Interrater 22 Fair ICC(1,1): SDC: 1.14 se

(1998) 0.99 (+) (+)
Number 4
Smeets et al5 Test-retest 53 Good ICC(1,1): Good LoA: ±/7.6 s
(2006) 0.91(+) (−)
(95% CI:
0.84–0.94)
(continued)
Physical Therapy
467

468

Table 4.
Continued
Physical Therapy
Ratings Ratings
(+/?/−)30 (+/?/−)30
Evidence Evidence
Staartjes and Test-retest 66 Fair 0.97 (95% SDC: 4.1 se
Volume 99
Schroder46 (2018) CI:
0.94–0.98)
Teixeira da Test-retest 30 Fair ICC(2,1): Fair

Cunha-Filho et al50 0.99 (+)
Number 4
(2010)
50-ftb,g walk Andersson et al51 Test-retest 133 Test-retest: strong Poord SDC: 3.0 se Test-retest: strong −0.7 s
(2010) (+); interrater: (−) (−); interrater:
unknown unknown
Simmonds et al17 Test-retest 44 Goodf ICC(1,1): Goodf SDC: 4.82 se

(1998) (participants 0.80 (+) (−)
tested on
different days)

(1998) (participants 0.99 (+) (−)
tested on the
same day)
Simmonds et al17 Interrater 22 Fairf ICC(1,1): Fairf SDC: 0.61 s

(1998) 0.99 (+) (+)
Smeets et al5 (2006) Test-retest 52 Good ICC(1,1): Good LoA: ±3.9 s

0.76 (+) (−)
(95% CI:
0.61–0.85)
Strand et al48 (2011) Test-retest 9 Poord ICC(2,1): Poord SDC: 0.6 s (+)
0.77 (+)
(95% CI:
0.24–0.94)
(continued)
2019
2019
Table 4.
Continued

Ratings Ratings
(+/?/−)30 (+/?/−)30
Evidence Evidence
Teixeira da Test-retest 30 Fair ICC(2,1):
(2010)
50-ft walkg , Simmonds et al17 Test-retest 44 Good ICC(1,1): Test-retest: Goodf SDC: 7.10 se Unknown
preferred walking (1998) (participants 0.64 (−) conflicting; (?)
speed tested on interrater:
different days) unknown

(1998) (participants 0.95 (+) (?)
tested on the
same day)
Simmonds et al17 Interrater 22 Fairf ICC(1,1): Fair SDC: 1.52 se

(1998) 0.98 (+) (?)
5-min walk Andersson et al51 Test-retest 132 Test-retest: strong Poord SDC: 105.1 Test-retest: 21.4 m
(2010) (+) me (−) moderate (−)
Simmonds et al17 Test-retest 44 Goodf ICC(1,1): Goodf SDC: 100.3

(1998) (participants 0.99 (+) me (−)
tested on
different days)
Smeets et al5 (2006) Test-retest 53 Good ICC(1,1): Good LoA: ±82.7 m

0.89 (+) (?)
Volume 99
(95% CI:
0.81–0.93)

(2010)
Number 4
Lift test Magnussen37 (2004) Interrater 32 Goodf κ: 1.00 (+) Interrater, subacute No information
(95% CI: LBP: limited (+);
1.00–1.00) test-retest,
subacute LBP:
limited (−)
(continued)
Physical Therapy
469

470
Table 4.

Continued

Ratings Ratings
(+/?/−)30 (+/?/−)30
Evidence Evidence
Physical Therapy
Magnussen37 (2004) Test-retest 28 Goodf κ: 0.55 (−)
(95% CI:
0.51–0.59)
Lift test, modified Strand et al48 (2011) Test-retest 9 Poord ICC(2,1): Test-retest: Poord SDC: 6.5 Unknown 2.5
Volume 99
0.87 (+) unknown lifts/min (−) lifts/min
(95% CI:
0.50–0.97)
Progressive Andersson et al51 Test-retest 126 Test-retest: Poord SDC: 4.2 Test-retest: 1.5
isoinertial lifting (2010) moderate (+) stagese (−) moderate (−) stages
Number 4
evaluation
Smeets et al5 (2006) Test-retest 50 Good ICC(1,1): Good LoA: ±2

0.92 (+) stages (?)
(95% CI:
0.87–0.96)
Strand et al48 (2011) Test-retest 9 Poord ICC(2,1): Poord SDC: 8.4 kg

0.91 (+) (?)
(95% CI:
0.65–0.98)
Self-paced walking Tomkins et al42 Test-retest 33 Fair ICC(3,1): Test-retest, LSS: No information
test (2009) 0.98 (+) limited (+)
Shuttle walking Armstrong et al52 Test-retest Test-retest: limited Goodf LoA: ±22 m Unknown
test (2005) (+); test-retest, (?)
LSS: limited (+)
Pratt et al40 (2002) Test-retest 29 Fairf ICC(?,?):

0.92 (+)
Taylor et al49 (2001) Test-retest 44 Fairf ICC(1,1): Fair LoA: ±49.5 m

0.99 (+) (?)
Timed “Up & Go” Gautschi et al44 Test-retest 18–62 Test-retest: Poord Average SDC: Test-retest:
Test (2017) moderate (+); 4.9 s (?) unknown; LDH/LSS/
interrater: test-retest, DDD:
unknown; LDH/LSS/DDD: 3.4 s
test-retest, unknown
LDH/LSS/DDD:
unknown
(continued)
2019
2019
Table 4.
Continued

Ratings Ratings
(+/?/−)30 (+/?/−)30
Evidence Evidence
Simmonds et al17 Test-retest 44 Goodf ICC(1,k): Goodf SDC: 2.74 se
(1998) (participants 0.98 (+) (?)
tested on
different days)

(1998) (participants 0.98 (+) (?)
tested on the
same day)
Simmonds et al17 Interrater 22 Fairf ICC(1,1): Fairf SDC: 0.80 se

(1998) 0.99 (+)

(2010)
Treadmill Deen et al39 (2000) Test-retest 28 Fairf CCC: 0.90 Test-retest, LSS: No information
examination, 1.2 (+) limited (+)
mi/hh
Treadmill Deen et al39 (2000) Test-retest 28 Fairf CCC: 0.96 Test-retest, LSS: No information
examination, (+) limited (+)
preferred speed
a
CCC = concordance correlation coefficient (comparable to ICC67 ); COSMIN = Consensus-Based Standards for the Selection of Health Measurement Instruments; DDD = lumbar degenerative disk
Volume 99
disease; ICC = intraclass correlation coefficient; LBP = low back pain; LDH = lumbar disk herniation; LoA = limits of agreement; LSS = lumbar spinal stenosis; MIC = minimal important change;
SDC = smallest detectable change. + = positive rating: ICC or weighted κ ≥ 0.70 (reliability), or SDC or LoA < MIC (measurement error); ? = indeterminate rating: ICC or weighted κ not reported
(reliability) or MIC not defined (measurement error); − = negative rating: criteria for “+” not met.
b
Intrarater reliability = when presenting repeatedly the same observations to 1 observer; interrater reliability = when presenting the same observations to 2 or more observers; test-retest reliabil-
ity = when presenting the same task to the same subjects 2 or more times.68
c
The MIC (defined as “the smallest change in score that is perceived as important by patients”23 ) was added for reference because this value is needed to determine the result ratings for measurement
Number 4
error.66
d
Not included in the data synthesis because of the poor methodological score.23
e
The SDC (defined as “the smallest change that can be detected by the measurement instrument,
√ beyond measurement error”23 ) was calculated23 by the authors of the present systematic review by
multiplying the standard error of measurement presented in the original article by 1.96 × 2.
f
The rating changed after removal of the sample size item from the “worst score counts” summary in the COSMIN 4-point checklist.
g
50 ft ≈ 15.3 m.
h
1.2 miles ≈ 1.9 km.
Physical Therapy
471

Isoinertial Lifting Evaluation, and Timed “Up & Go” tasks). affect walking capacity (Timed “Up & Go” task, motorized
Four tasks were specifically evaluated for patients with treadmill test, treadmill walking test, and weight-carrying
back-related diagnoses known to severely affect walking test).
capacity (self-paced walking test, shuttle walking test, and
treadmill examination at 1.2 mph [∼1.9 km/h] and at Criterion validity. The results for criterion validity are
preferred speed), and all of these had limited evidence for shown in eTable 1.
positive ratings for test-retest reliability. Intrarater
reliability was assessed for 1 task (30-second chair stand With regard to methodological quality, 1 article
test), which had limited evidence for a positive rating. investigated criterion validity42 and was rated as good for
Most of the tasks investigated for interrater reliability methodological quality and was therefore included in the
displayed unknown evidence due to small sample sizes. best-evidence synthesis.

Measurement error. The results for measurement error As regards best-evidence synthesis, the treadmill protocol
are shown in Table 4. displayed limited evidence for a positive rating for
criterion validity for patients with lumbar spinal stenosis.
Regarding methodological quality, 8 articles investigated
measurement error.5,17,44,46,48,49,51,52 Nine studies of Responsiveness. The results for responsiveness are
measurement error in 3 articles44,48,51 were rated as poor shown in eTable 2.
and therefore excluded from the best-evidence synthesis.
The poor scores were due to the fact that patients in the Regarding methodological quality, 7 articles investigated
articles received treatment between the first and second responsiveness.38,41,48,49,51,56,57 All studies of
administration of the tasks. responsiveness in the articles received either fair or good
scores and were therefore included in the best-evidence
With regard to best-evidence synthesis, a moderate level of synthesis.
evidence for negative ratings for measurement error was
displayed by the 1-minute stair-climbing, 5-minute walk, With regard to best-evidence synthesis, a limited level of
and Progressive Isoinertial Lifting Evaluation tasks. A evidence for positive ratings for responsiveness was found
strong level of evidence for negative ratings was displayed for 3 tasks (1-minute stair-climbing task, shuttle walking
by 50-ft walk and 5-repetition sit-to-stand tasks. The test, and 5-repetition sit-to-stand task). A moderate level of
negative ratings were a consequence of failure to meet the evidence for negative ratings was found for 3 tasks (50-ft
criterion stating that the smallest detectable change must walk task, modified lift test, and Progressive Isoinertial
be smaller than the minimal important change.30 The level Lifting Evaluation). A limited level of evidence for negative
of evidence was unknown for 4 tasks (50-ft walk ratings was found for 4 tasks (5-minute walk task, lift test,
[preferred speed] task, modified lift test, shuttle walking motorized treadmill test, and self-paced walking test). A
test, and Timed “Up & Go” task), because the minimal limited level of evidence for a positive rating for
important change was not investigated. responsiveness was found for 1 task (Timed “Up & Go”
task) that was specifically for patients with back-related
Construct validity (hypothesis testing). The results for diagnoses known to severely affect walking capacity.
construct validity (hypothesis testing) are shown in
eTable 1 (available at https://academic.oup.com/ptj). Summary of the level of evidence. A summary of the
level of evidence is shown in Table 5.
With regard to methodological quality, 12 articles
investigated construct validity.17,19,20,41,43,44,47,48,50,53–55 Five The 5-repetition sit-to-stand task was the only physical
studies of construct validity in 2 articles53,54 were rated as capacity task that displayed positive ratings for more than
poor and therefore excluded from the best-evidence 2 measurement properties: test-retest reliability (strong
synthesis. The poor scores were due to an absence of a evidence), construct validity (moderate evidence), and
priori validity hypotheses (such as the hypothesized responsiveness (limited evidence). The 50-ft walk,
correlation between 2 outcome measures required for 5-minute walk, Progressive Isoinertial Lifting Evaluation,
adequate validity). and Timed “Up & Go” tasks displayed moderate to strong
evidence for positive ratings for both test-retest reliability
With regard to best-evidence synthesis, a moderate level of and construct validity. The above-mentioned tasks,
evidence for positive ratings of construct validity was however, also displayed moderate to strong evidence for
found for 6 tasks (50-ft walk task, 5-minute walk task, negative ratings for measurement error. The 1-minute
modified lift test, Progressive Isoinertial Lifting Evaluation, stair-climbing task and shuttle walking test displayed
5-repetition sit-to-stand task, and Timed “Up & Go” task). positive ratings for responsiveness (limited evidence) as
Limited evidence for positive ratings for construct validity well as for test-retest reliability (moderate evidence for
was found for 4 walking tasks investigated specifically for 1-minute stair-climbing task and limited evidence for
patients with back-related diagnoses known to severely shuttle walking test).

Table 5.
Overview of the Level of Evidence Per Measurement Property Per Physical Capacity Taska
Physical Test-Retest Interrater Intrarater Measurement Hypothesis Criterion

Responsiveness
Capacity Task Reliability Reliability Reliability Error Testing Validity
1-min stair climbing Moderate (+) 0 0 Moderate (−) 0 0 Limited (+)
30-s chair stand 0 0 Limited (+) 0 Limited (+) 0 0
5-repetition Strong (+) Unknown 0 Strong (−) Moderate (+) 0 Limited (+)
sit-to-stand
50-ftb walk Strong (+) Unknown 0 Strong (−) Moderate (+) 0 Moderate (−)
50-ft walk, preferred Conflicting Unknown 0 Unknown Conflicting 0 0

speed

5-min walk Strong (+) 0 0 Moderate (−) Moderate (+) 0 Limited (−)
Lift test Limited (−) Limited (+) 0 0 0 0 Limited (−)
Lift test, modified Unknown 0 0 Unknown Moderate (+) 0 Moderate (−)

c
Motorized treadmill 0 0 0 0 Limited (+)c 0 Limited (−)
test
Progressive isoinertial Moderate (+) 0 0 Moderate (−) Moderate (+) 0 Moderate (−)
lifting evaluation
Self-paced walking Limited (+)c 0 0 0 Conflicting c 0 Limited (−)c

test
Shuttle walking test Limited (+), 0 0 Unknown 0 0 Limited (+), Limited (+)c
Limited (+)c
Timed “Up & Go” Moderate (+) Unknown 0 Unknown Moderate (+), Limited 0 Limited (+)c
Test (+)c
Treadmill Limited (+)c 0 0 0 0 0 0

examination, 1.2
mphd
Treadmill Limited (+)c 0 0 0 0 0 0

examination,
preferred speed
Treadmill protocol 0 0 0 0 0 Limited (+)c 0
Treadmill walking test 0 0 0 0 Limited (+)c 0 0
Weight-carrying test 0 0 0 0 Limited (+)c 0 0

a
+ = positive result rating; − = negative result rating; 0 = no information
b
50 ft ≈ 15.3 m.
c
The level of evidence marked in italics primarily concerns patients with back-related diagnoses known to severely affect walking capacity (eg, lumbar spinal stenosis,
spondylolisthesis).
d
1.2 miles ≈ 1.9 km.
Of the walking tasks that were specifically investigated for confident that the scores of these tasks have a high degree
patients with diagnoses known to severely affect walking of reliability and are likely to measure the constructs they
capacity, the Timed “Up & Go” task and shuttle walking are designed to measure.23 A previous systematic review
test were the only tasks that displayed positive ratings for investigated the reliability of physical capacity tasks, but
more than 1 measurement property. not the validity and responsiveness.58 That review
reported similar results for reliability, despite a different
method for determining the level of evidence.58
Discussion
To our knowledge, this is the first study to determine the
The 1-minute stair climbing task, 5-repetition sit-to-stand
level of evidence for the reliability, validity, and
task, shuttle walking test, and Timed “Up & Go” displayed
responsiveness of physical capacity tasks designed to
limited evidence for positive ratings for responsiveness.
assess functioning for patients with LBP. The 5-repetition
These results suggest that these tasks appear to have the
sit-to-stand, 50-ft walk, 5-minute walk, Progressive
ability to detect changes in functioning after a treatment,
Isoinertial Lifting Evaluation, and Timed “Up & Go” tasks
although the evidence is limited. In contrast, most
displayed moderate to strong evidence for positive ratings
included tasks that were investigated for responsiveness
for both test-retest reliability and construct validity. These
displayed limited to moderate evidence for negative
results suggest that a clinician or researcher can be

ratings. An important aspect of the definition of evidence that the heterogeneity would affect the results of
responsiveness in the COSMIN taxonomy is the ability to measurement properties. Not least, this is seen in the
detect change over time specifically in the construct to be almost unanimous positive results for test-retest reliability
measured.29 However, most of the included articles and construct validity regardless of the characteristics of
investigated responsiveness by comparison with global the study populations. A subset of the results is, however,
rating scales of constructs such as “LBP-associated primarily generalizable to patients with back-related
disability,”51 “general health improvements,”38 and “return diagnoses known to severely affect walking capacity such
to work,”56 all constructs that are different from those as lumbar spinal stenosis and spondylolisthesis (see the
found in the physical capacity tasks. In contrast, several level of evidence marked in italics in Tab. 5).62–64 In the
authors have recommended using global rating scales of data synthesis of walking tasks, we therefore did not
the same constructs as the outcome measures of interest.23 combine studies of such patients with other studies.
For instance, the responsiveness of a walking task could

preferably be investigated by comparing the results with a Predominantly positive ratings for single measurement
global rating scale concerning walking ability.23 Future properties were found for walking tasks specifically
high-quality studies with a priori hypotheses that include investigated for patients with back-related diagnoses
global rating scales of capacity could show clearer known to affect walking capacity severely. However, only
evidence for the responsiveness of physical capacity tasks. the Timed “Up & Go” task and shuttle walking test
displayed positive ratings for more than 1 measurement
A moderate to strong level of evidence for negative ratings property. This result appears to be due to a lack of studies
for measurement error was displayed by 5 tasks because investigating measurement properties with the same test
the smallest detectable change was larger than the minimal protocol. For instance, 5 treadmill walking tasks with
important change. Similar findings have been reported similar protocols20,41–43 were evaluated for patients with
previously for PROMs of self-efficacy and health-related lumbar spinal stenosis, but these protocols were
quality of life.59,60 The smallest detectable change (defined considered too heterogeneous for the results to be
as “the smallest change that can be detected by the combined in the data synthesis. Physical capacity tasks
measurement instrument, beyond measurement error”23 ) investigated in future research might show a stronger level
should be smaller than the minimal important change of evidence considering the promising results reported in
(defined as “the smallest change in score that is perceived these articles.
as important by patients”23 ) in order to detect changes that
are as small as the minimal important change.61 The data Strengths and Limitations
of the systematic review suggest that the smallest A strength of the present review was the use of the
detectable change of the investigated tasks appears to be COSMIN 4-point checklist for assessing the methodological
too large to distinguish minimal important change from quality of the included articles.32 To our knowledge, the
measurement error. Nevertheless, all negative ratings of checklist is the most widely used consensus-based quality
measurement error were due to comparisons with minimal assessment tool specifically developed for studies on
important change values determined for 5 tasks in a single measurement properties. During the course of this
article.51 That article investigated minimal important systematic review, the COSMIN checklist and the data
change values with a global rating scale of “LBP-related synthesis procedure were updated.65,66 For instance, poor
disability” rather than the constructs that the physical studies are included in the new data synthesis
capacity tasks measure. In contrast, minimal important procedure.65,66 After weighing the pros and cons we
change is recommended to be compared with similar decided to proceed with the original methodology as in
constructs as those in the outcome measures of interest.61 our PROSPERO study protocol. In summary, we reasoned
Moreover, the COSMIN 4-point scale is also not designed that the changes in the updated COSMIN methodology
to assess the quality of studies that investigate minimal would not change the quality of the analysis. Moreover,
important change32 and therefore the methodological breaching the study protocol would have decreased our
quality of the relevant article was undetermined. trustworthiness in performing the data synthesis.
We argue that the results of the systematic review are Another strength is the clearly defined construct that
generalizable to most individuals with chronic LBP. formed the basis for the selection of articles, ie, “capacity”
Research suggests that a representative sample is required in the ICF activity domain.3 This approach generated more
to generalize the results of measurement properties homogeneous test content, which can make the results of
beyond the research setting.23 The patients in the the systematic review easier to interpret. Capacity denotes
systematic review comprise a heterogeneous sample, a patient’s ability to perform an activity in a standardized
which is partly reflected in the differences in levels of pain environment and is useful to assess and compare patients’
and disability seen between the included studies (Tab. 2). functioning in a standardized way.3 However, it is
Although we acknowledge that such heterogeneity can important to note that results from a physical capacity task
influence the results of physical capacity testing (eg, are not directly transferable to a patient’s environment
meters walked or weight lifted), there seems to be little outside the test conditions, but rather signify the patient’s

potential of performing the activity outside the test Writing: M. Jakobsson, A. Gutke, L.B. Mokkink, R. Smeets, M. Lundberg
conditions. Data collection: M. Jakobsson, M. Lundberg
Data analysis: M. Jakobsson, A. Gutke, L.B. Mokkink, R. Smeets, M. Lundberg
This review also has a few limitations. First, the COSMIN Project management: M. Jakobsson, M. Lundberg
Fund procurement: M. Lundberg
4-point checklist was originally developed for PROMs,32
Providing facilities/equipment: M. Lundberg
which could compromise its validity for assessing physical
Providing institutional liaisons: R. Smeets, M. Lundberg
capacity tasks. However, we argue that the checklist is Consultation (including review of manuscript before submitting): A. Gutke,
indeed appropriate for physical capacity tasks because the L.B. Mokkink, R. Smeets, M. Lundberg
items of the checklist are phrased in a general manner and
The authors of the systematic review would like to acknowledge the medical
draw upon the researchers’ knowledge of the test at hand,
librarians Anders Wändahl and Klas Moberg at Karolinska Institutet
such as the appropriate time interval between the first and
University Library for their valuable help in developing the search strategy
second administrations in a reliability study. The checklist and performing the search.

has also been successfully used in previous systematic
reviews of physical capacity tasks for other patient Funding
groups.33,35
This review was supported by AFA Försäkring 120216, the Eurospine
Research Grants 8-2014, the Health and Medical Care Executive Board of the
Another possible limitation is the exclusion of so-called
Västra Götaland Region, and Vetenskapsrådet 2015-02511.
functional capacity evaluations because we could not
obtain the original test manuals for these tests due to
copyright issues. Functional capacity evaluations are test
Clinical Trial or Systematic Review Registration
batteries commonly used for return-to-work evaluations The study protocol was registered at the International Prospective Register of
and could potentially have added value to this systematic Systematic Reviews (PROSPERO; http://www.crd.york.ac.uk/PROSPERO;
review. However, we could not reliably determine the registration number: CRD42016042011).
quality of the tests because we could not review the
detailed procedures. Also, many of the excluded functional
capacity evaluation articles appeared to focus on either Disclosures
the composite scores of the batteries or constructs such as The authors completed the ICJME Form for Disclosure of Potential Conflicts
“level of effort,” “work capacity,” and “safe lifting,” which of Interest. A. Gutke reported a grant received by her institution from the
would have led to the exclusion of several articles. Swedish Research Council. No other authors reported any conflicts of
interest.
A final possible limitation is that the data extraction was DOI: 10.1093/ptj/pzy159
not performed completely independently, as 1 author
(M.J.) extracted the data and a second author (A.G., M.L.,
or R.S.) double-checked it for accuracy. In contrast, at least References
2 authors independently performed all other 1. Hoy D, March L, Brooks P et al. The global burden of low
back pain: estimates from the Global Burden of Disease 2010
methodological steps of the review. We decided on the study. Ann Rheum Dis. 2014;73:968–974.
current data extraction procedure because of the high 2. Breivik H, Collett B, Ventafridda V, Cohen R, Gallacher D.
accuracy of the first author in the pilot of this Survey of chronic pain in Europe: prevalence, impact on
methodological step. We therefore argue that having a daily life, and treatment. Eur J Pain. 2006;10:287–333.
second author double-checking all data extracted was 3. World Health Organization. The International Classification
of Functioning, Disability and Health (ICF). 2nd ed. Geneva,
almost equivalent to having 2 authors independently Switzerland: World Health Organization; 2001.
performing the data extraction. 4. Gautschi OP, Corniola MV, Schaller K, Smoll NR, Stienen MN.
The need for an objective outcome measurement in spine
Conclusion surgery–the timed-up-and-go test. Spine J.
2014;14:2521–2522.
In conclusion, the 5-repetition sit-to-stand, 5-minute walk,
5. Smeets RJ, Hijdra HJ, Kester AD, Hitters MW, Knottnerus JA.
50-ft walk, Progressive Isoinertial Lifting Evaluation, The usability of six physical performance tasks in a
Timed “Up & Go,” and 1-minute stair-climbing tasks are rehabilitation population with chronic low back pain. Clin
promising physical capacity tasks for the measurement of Rehabil. 2006;20:989–998.
functioning in patients with chronic LBP. However, more 6. Wittink H. Functional capacity testing in patients with
chronic pain. Clin J Pain. 2005;21:197–199.
research on the measurement error and responsiveness of
7. Daltroy LH, Cats-Baril WL, Katz JN, Fossel AH, Liang MH. The
these tasks is needed to be able to fully recommend them North American spine society lumbar spine outcome
as outcome measures in research and clinical practice for assessment instrument: reliability and validity tests. Spine
patients with LBP. (Phila Pa 1976). 1996;21:741–749.
8. Fairbank JC, Pynsent PB. The Oswestry Disability Index.
Spine. 2000;25:2940–2952.
Author Contributions and Acknowledgments 9. Salen BA, Spangfort EV, Nygren AL, Nordemar R. The
Disability Rating Index: an instrument for the assessment of
Concept/idea/research design: M. Jakobsson, A. Gutke, L.B. Mokkink, R. disability in clinical settings. J Clin Epidemiol.
Smeets, M. Lundberg 1994;47:1423–1435.

10. Kopec JA, Esdaile JM, Abrahamowicz M et al. The Quebec nonspecific low back pain in primary care. Eur Spine J.
Back Pain Disability Scale: conceptualization and 2006;15:S169–S191.
development. J Clin Epidemiol. 1996;49:151–161. 29. Mokkink LB, Terwee CB, Patrick DL et al. The COSMIN study
11. Roland M, Morris R. A study of the natural history of back reached international consensus on taxonomy, terminology,
pain. Part I: development of a reliable and sensitive measure and definitions of measurement properties for health-related
of disability in low-back pain. Spine (Phila Pa 1976). patient-reported outcomes. J Clin Epidemiol.
1983;8:141–144. 2010;63:737–745.
12. Food and Drug Administration Center for Drug Evaluation 30. Prinsen CAC, Vohra S, Rose MR et al. How to select outcome
and Research. Guidance for Industry. Patient-Reported measurement instruments for outcomes included in a “Core
Outcome Measures: Use in Medical Product Development to Outcome Set”—a practical guideline. Trials. 2016;17:449.
Support Labeling Claims. Silver Spring, MD: Food and Drug
Administration; 2009. 31. Mokkink L, Terwee C, Patrick D et al. The COSMIN checklist
for assessing the methodological quality of studies on
13. Brodke DS, Goz V, Lawrence BD, Spiker WR, Neese A, Hung measurement properties of health status measurement
M. Oswestry Disability Index: a psychometric analysis with instruments: an international Delphi study. Qual Life Res.
1,610 patients. Spine J. 2017;17:321–327. 2010;19:539–549.

14. Chiarotto A, Ostelo RW, Boers M, Terwee CB. A systematic 32. Terwee C, Mokkink L, Knol D, Ostelo RJG, Bouter L, de Vet
review highlights the need to investigate the content validity HW. Rating the methodological quality in systematic reviews
of patient-reported outcome measures for physical of studies on measurement properties: a scoring system for
functioning in patients with low back pain. J Clin Epidemiol. the COSMIN checklist. Qual Life Res. 2012;21:651–657.
2018;95:73–93.
33. Dobson F, Hinman RS, Hall M, Terwee CB, Roos EM, Bennell
15. Gautschi OP, Smoll NR, Corniola MV et al. Validity and KL. Measurement properties of performance-based measures
reliability of a measurement of objective functional to assess physical function in hip and knee osteoarthritis: a
impairment in lumbar degenerative disc disease: the Timed systematic review. Osteoarthritis Cartilage.
Up and Go (TUG) test. Neurosurgery. 2016;79:270–278. 2012;20:1548–1562.
16. Caporaso F, Pulkovski N, Sprott H, Mannion AF. How well do 34. Mokkink LB, Terwee CB, Gibbons E et al. Inter-rater
observed functional limitations explain the variance in agreement and reliability of the COSMIN (COnsensus-based
Roland Morris scores in patients with chronic non-specific Standards for the selection of health status Measurement
low back pain undergoing physiotherapy? Eur Spine J. Instruments) checklist. BMC Med Res Methodol. 2010;10:82.
2012;21:187–195.
35. Kroman SL, Roos EM, Bennell KL, Hinman RS, Dobson F.
17. Simmonds MJ, Olson SL, Jones S et al. Psychometric Measurement properties of performance-based outcome
characteristics and clinical usefulness of physical measures to assess physical function in young and
performance tests in patients with low back pain. Spine. middle-aged people known to be at high risk of hip and/or
1998;23:2412–2421. knee osteoarthritis: a systematic review. Osteoarthritis
Cartilage. 2014;22:26–39.
18. Harding VR, Williams AC, Richardson PH et al. The
development of a battery of measures for assessing physical 36. Guyatt GH, Oxman AD, Vist GE et al. GRADE: an emerging
functioning of chronic pain patients. Pain. 1994;58:367–375. consensus on rating quality of evidence and strength of
recommendations. BMJ. 2008;336:924–926.
19. Lee CE, Simmonds MJ, Novy DM, Jones S. Self-reports and
clinician-measured physical function among patients with 37. Magnussen L, Strand LI, Lygren H. Reliability and validity of
low back pain: a comparison. Arch Phys Med Rehabil. the back performance scale: observing activity limitation in
2001;82:227–231. patients with back pain. Spine (Phila Pa 1976).
2004;29:903–907.
20. Conway J, Tomkins CC, Haig AJ. Walking assessment in
people with lumbar spinal stenosis: capacity, performance, 38. Campbell H, Rivero-Arias O, Johnston K, Gray A, Fairbank J,
and self-report measures. Spine J. 2011;11:816–823. Frost H. Responsiveness of objective, disease-specific, and
generic outcome measures in patients with chronic low back
21. Guralnik JM, Branch LG, Cummings SR, Curb JD. Physical pain: an assessment for improving, stable, and deteriorating
performance measures in aging research. J Gerontol.
patients. Spine (Phila Pa 1976). 2006;31:815–822.
1989;44:M141–M146.
39. Deen HG, Jr, Zimmerman RS, Lyons MK, McPhee MC,
22. Wand BM, Chiffelle LA, O’Connell NE, McAuley JH, Desouza
Verheijde JL, Lemens SM. Test-retest reproducibility of the
LH. Self-reported assessment of disability and exercise treadmill examination in lumbar spinal stenosis.
performance-based assessment of disability are influenced by
Mayo Clin Proc. 2000;75:1002–1007.
different patient characteristics in acute low back pain. Eur
Spine J. 2010;19:633–640. 40. Pratt RK, Fairbank JC, Virr A. The reliability of the Shuttle
Walking Test, the Swiss Spinal Stenosis Questionnaire, the
23. de Vet HCW, Terwee CB, Mokkink LB, Knol DL. Oxford Spinal Stenosis Score, and the Oswestry Disability
Measurement in Medicine: A Practical Guide. Cambridge, Index in the assessment of patients with lumbar spinal
England: Cambridge University Press; 2011. stenosis. Spine (Phila Pa 1976). 2002;27:84–91.
24. Brakenhoff TB, Mitroiu M, Keogh RH, Moons KGM, 41. Rainville J, Childs LA, Pena EB et al. Quantification of
Groenwold RHH, van Smeden M. Measurement error is often walking ability in subjects with neurogenic claudication from
neglected in medical literature: a systematic review. J Clin lumbar spinal stenosis–a comparative study. Spine J.
Epidemiol. 2018;98:89–97. 2012;12:101–109.
25. Brakenhoff TB, van Smeden M, Visseren FLJ, Groenwold 42. Tomkins CC, Battie MC, Rogers T, Jiang H, Petersen S. A
RHH. Random measurement error: why worry? An example
of cardiovascular risk factors. PLoS One. 2018;13:e0192298. criterion measure of walking capacity in lumbar spinal
stenosis and its comparison with a treadmill protocol. Spine
26. Moher D, Liberati A, Tetzlaff J, Altman DG. Preferred (Phila Pa 1976). 2009;34:2444–2449.
reporting items for systematic reviews and meta-analyses: the
43. Whitehurst M, Brown LE, Eidelson SG, D’Angelo A.
PRISMA statement. PLoS Med. 2009;6:e1000097.
Functional mobility performance in an elderly population
27. Terwee CB, Jansma EP, Riphagen II, de Vet HCW. with lumbar spinal stenosis. Arch Phys Med Rehabil.
Development of a methodological PubMed search filter for 2001;82:464–467.
finding studies on measurement properties of measurement
44. Gautschi OP, Stienen MN, Corniola MV et al. Assessment of
instruments. Qual Life Res. 2009;18:1115–1123.
the minimum clinically important difference in the Timed Up
28. van Tulder M, Becker A, Bekkering T et al. Chapter 3. and Go test after surgery for lumbar degenerative disc
European guidelines for the management of acute disease. Neurosurgery. 2017;80:380–385.

45. Gautschi OP, Joswig H, Corniola MV et al. Pre- and 57. Gautschi OP, Joswig H, Corniola MV et al. Pre- and
postoperative correlation of patient-reported outcome postoperative correlation of patient-reported outcome
measures with standardized Timed Up and Go (TUG) test measures with standardized Timed Up and Go (TUG) test
results in lumbar degenerative disc disease. Acta Neurochir results in lumbar degenerative disc disease. Acta Neurochir
(Wien). 2016;158:1875–1881. (Wien). 2016;158:1875–1881.
46. Staartjes VE, Schroder ML. The five-repetition sit-to-stand test: 58. Denteneer L, Van Daele U, Truijen S, De Hertogh W, Meirte J,
evaluation of a simple and objective tool for the assessment Stassijns G. Reliability of physical functioning tests in
of degenerative pathologies of the lumbar spine. J Neurosurg patients with low back pain: a systematic review. Spine J.
Spine. 2018;29:380–387. 2018;18:190–207.
47. Kahraman T, Ozcan Kahraman B, Salik Sengul Y, Kalemci O. 59. Maughan EF, Lewis JS. Outcome measures in chronic low
Assessment of sit-to-stand movement in nonspecific low back back pain. Eur Spine J. 2010;19:1484–1494.
pain: a comparison study for psychometric properties of
field-based and laboratory-based methods. Int J Rehabil Res. 60. van der Roer N, Ostelo RW, Bekkering GE, van Tulder MW,
2016;39:165–170. de Vet HC. Minimal clinically important change for pain
intensity, functional status, and general health status in
48. Strand LI, Anderson B, Lygren H, Skouen JS, Ostelo R, patients with nonspecific low back pain. Spine (Phila Pa

Magnussen LH. Responsiveness to change of 10 physical tests 1976). 2006;31:578–582.
used for patients with back pain. Phys Ther. 2011;91:404–415. 61. de Vet HC, Terwee CB, Ostelo RW, Beckerman H, Knol DL,
49. Taylor S, Frost H, Taylor A, Barker K. Reliability and Bouter LM. Minimal changes in health status questionnaires:
responsiveness of the shuttle walking test in patients with distinction between minimally detectable change and
chronic low back pain. Physiother Res Int. 2001;6:170–178. minimally important change. Health Qual Life Outcomes.
2006;4:54.
50. Teixeira da Cunha-Filho I, Lima FC, Guimarães FR, Leite HR.
Use of physical performance tests in a group of Brazilian 62. Budithi S, Dhawan R, Cattell A, Balain B, Jaffray D. Only
Portuguese-speaking individuals with low back pain. walking matters—assessment following lumbar stenosis
Physiother Theory Pract. 2010;26:49–55. decompression. Eur Spine J. 2017;26:481–487.
51. Andersson EI, Lin CC, Smeets RJ. Performance tests in people 63. Tomkins-Lane CC, Battie MC. Predictors of objectively
with chronic low back pain: responsiveness and minimal measured walking capacity in people with degenerative
clinically important change. Spine (Phila Pa 1976). lumbar spinal stenosis. J Back Musculoskelet Rehabil.
2010;35:E1559–E1563. 2013;26:345–352.
52. Armstrong M, McDonough S, Baxter D. Reliability and 64. Drury T, Ames SE, Costi K, Beynnon B, Hall J. Degenerative
repeatability of the shuttle walk test in patients with chronic spondylolisthesis in patients with neurogenic claudication
low back pain. Int J Ther Rehabil. 2005;12:438–443. effects functional performance and self-reported quality of
life. Spine (Phila Pa 1976). 2009;34:2812–2817.
53. Ocarino J, Gonçalves G, Vaz D, Cabral A, Porto J, Silva M.
Correlation between a functional performance questionnaire 65. Mokkink LB, de Vet HCW, Prinsen CAC et al. COSMIN Risk of
and physical performance tests among patients with low Bias checklist for systematic reviews of Patient-Reported
back pain. Braz J Phys Ther. 2009;13:343–349. Outcome Measures. Qual Life Res. 2018;27:1171–1179.
54. Odebiyi DO, Kujero SO, Lawal TA. Relationship between 66. Prinsen CAC, Mokkink LB, Bouter LM et al. COSMIN
spinal mobility, physical performance, pain intensity and guideline for systematic reviews of patient-reported outcome
functional disability in patients with chronic low back pain. measures. Qual Life Res. 2018;27:1147–1157.
Nigerian Journal of Medical Rehabilitation. 2007;11:49–54. 67. Lin LI. A concordance correlation coefficient to evaluate
55. Soer R, Poels BJ, Geertzen JH, Reneman MF. A comparison of reproducibility. Biometrics. 1989;45:255–268.
two lifting assessment approaches in patients with chronic 68. Rousson V, Gasser T, Seifert B. Assessing intrarater, interrater
low back pain. J Occup Rehabil. 2006;16:639–646. and test-retest reliability of continuous measurements. Stat
56. Strand LI, Moe-Nilssen R, Ljunggren AE. Back Performance Med. 2002;21:3431–3446.
Scale for the assessment of mobility-related activities in
people with back pain. Phys Ther. 2002;82:1213–1223.

Pzy 159

Uploaded by

Document Information

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Pzy 159

Uploaded by

Copyright:

Available Formats

Review

M. Jakobsson, RPT, PhD, Back in Mo-

Validity, and Responsiveness of ences, University of Gothenburg, Möl-

Assess Functioning in Patients With A. Gutke, RPT, PhD, Division of Physio-

Low Back Pain: A Systematic Review habilitation, Institute of Neuroscience

Downloaded from https://academic.oup.com/ptj/article/99/4/457/5252375 by guest on 19 January 2021

2019 Volume 99 Number 4 Physical Therapy 457

Downloaded from https://academic.oup.com/ptj/article/99/4/457/5252375 by guest on 19 January 2021

458 Physical Therapy Volume 99 Number 4 2019

Downloaded from https://academic.oup.com/ptj/article/99/4/457/5252375 by guest on 19 January 2021

2019 Volume 99 Number 4 Physical Therapy 459

Measurement Property Ratingb Criterion for Result Rating

? ICC or weighted κ not reported

− Criteria for “+” not met

Measurement error + SDC or LoA < MICc

? MIC not deﬁned

− Criteria for “+” not met

Downloaded from https://academic.oup.com/ptj/article/99/4/457/5252375 by guest on 19 January 2021

− Criteria for “+” not met

Criterion validity + Correlation with gold standard ≥ 0.70 or AUC ≥ 0.70

? Not all information for “+” reported

− Criteria for “+” not met

− Criteria for “+” not met

agreement; MIC = minimal important change; SDC = smallest detectable change.

Unknown. This criterion encompassed the following 4 Results

460 Physical Therapy Volume 99 Number 4 2019

7900 Articles After Duplicates Removed

Downloaded from https://academic.oup.com/ptj/article/99/4/457/5252375 by guest on 19 January 2021

435 Full-Text Articles Excluded:

2. Wrong condition or not fulfilling inclusion criteria for

3. Tests not adequately described or original reference

25 Articles Included in the Data Extraction 4. Wrong construct: 65

6. Insufficient standardization/expensive equipment: 18

7. Aim was not to evaluate measurement properties: 69

8. Results only presented for composite score of a test

3 Articles Not Included in the Data Synthesis

2019 Volume 99 Number 4 Physical Therapy 461

et al52 (2005) (United Kingdom) mo

Downloaded from https://academic.oup.com/ptj/article/99/4/457/5252375 by guest on 19 January 2021

Downloaded from https://academic.oup.com/ptj/article/99/4/457/5252375 by guest on 19 January 2021

Physical Capacity Task Activity Quantiﬁcation Measure Equipment Needed

Downloaded from https://academic.oup.com/ptj/article/99/4/457/5252375 by guest on 19 January 2021

466 Physical Therapy Volume 99 Number 4 2019

Reliability Measurement Error

Physical Study Sample Best-Evidence Best-Evidence

Smeets et al5 Test-retest 53 Good ICC(1,1): Good LoA: ±14.7

Simmonds et al17 Test-retest 44 Goodf ICC(1,k): Goodf SDC: 6.38 se

Simmonds et al17 Test-retest 44 Fair ICC(1,1): Fair SDC: 18.46 se

Simmonds et al17 Interrater 22 Fair ICC(1,1): SDC: 1.14 se

Downloaded from https://academic.oup.com/ptj/article/99/4/457/5252375 by guest on 19 January 2021

Reliability Measurement Error

Teixeira da Test-retest 30 Fair ICC(2,1): Fair

Simmonds et al17 Test-retest 44 Goodf ICC(1,1): Goodf SDC: 4.82 se

Simmonds et al17 Test-retest 44 Fair ICC(1,1): Fair SDC: 1.19 se

Simmonds et al17 Interrater 22 Fairf ICC(1,1): Fairf SDC: 0.61 s

Smeets et al5 (2006) Test-retest 52 Good ICC(1,1): Good LoA: ±3.9 s

Reliability Measurement Error

Physical Study Sample Best-Evidence Best-Evidence

Simmonds et al17 Test-retest 44 Fair ICC(1,1): Fair SDC: 2.74 se

Simmonds et al17 Interrater 22 Fairf ICC(1,1): Fair SDC: 1.52 se

Simmonds et al17 Test-retest 44 Goodf ICC(1,1): Goodf SDC: 100.3

Smeets et al5 (2006) Test-retest 53 Good ICC(1,1): Good LoA: ±82.7 m

Teixeira da Test-retest 30 Fair ICC(2,1):