You are on page 1of 11

M E TA - A N A LY S I S

Can we rely on adolescents to self-assess puberty


stage? A systematic review and meta-analysis

Susan C. Campisi,1,2 Josée D. Marchand,3 Fahad Javaid Siddiqui,1,4 Muhammad


Islam,1 Zulfiqar A. Bhutta,1,2,5 and Mark R. Palmert3,6
1
Centre for Global Child Health, The Hospital for Sick Children, Toronto, ON M5G 0A4, Canada;
2
Department of Nutritional Sciences, University of Toronto, Toronto, ON M5S 1A8, Canada; 3Division

Downloaded from https://academic.oup.com/jcem/article/105/8/2846/5807960 by guest on 07 January 2024


of Endocrinology, The Hospital for Sick Children, Toronto, ON M5G 1X8, Canada; 4Health Services and
Systems Research, Duke-NUS Medical School, 169857, Singapore; 5Centre of Excellence in Women and
Child Health, The Aga Khan University South-Central Asia, 74000, East Africa & United Kingdom; and
6
Departments of Pediatrics and Physiology, University of Toronto, Toronto, ON M5S 1A8, Canada

ORCiD numbers: 0000-0003-1072-9228 (S. C. Campisi); 0000-0002-9046-5105 (F. J. Siddiqui);


0000-0003-0637-599X (Z. A. Bhutta).

Context: Clinicians, researchers, and global health advocates often include pubertal
development in outcomes. However, assessments of pubertal stage can be challenging because
of the sensitive nature and feasibility of clinical examinations, especially in larger settings.

Objective: To determine the accuracy of self-assessed Tanner staging when compared with
physically assessed Tanner stages by a clinician.

Data Sources: MEDLINE, PubMed, Embase, Web of Science, Scopus, the Cochrane Library, CINAHL.

Study Selection: Studies were included if they reported 5 × 5 tables of self-assessment


compared to clinician–assessment for the 5-stage Tanner scale.

Data Extraction: We extracted data to generate complete 5 × 5 tables for each study, including
any subgroup eligible for the analysis, such as overweight/obese youth.

Data Synthesis: After screening, 22 studies representing 21,801 participants met our inclusion
criteria for the meta-analysis. Overall agreement was moderate or substantial between the 2
assessments, with breast stage 1, female pubic hair 1, male pubic hair 1, and male pubic hair 5
having the highest agreement. When stages were collapsed into pre- (Tanner stage 1), in (stages
2,3), and completing (stages 4,5) puberty, levels of agreement improved, especially for pre- and
completing pubertal development. Most included studies comprised Caucasian youth. More
studies are needed which include a broader range of geographic and socioeconomic settings, as
well as a greater diversity of racial/ethnic groups.

Conclusions: Self-assessment of puberty is most accurate when identifying Tanner stage 1,


Tanner stage 5 and when development is categorized into prepuberty, in, and completing
puberty phases. Use of self-assessment data should be structured accordingly. (J Clin Endocrinol
Metab 105: 2846–2856, 2020)

Protocol Registration: PROSPERO # CRD42018100205

Key Words: puberty assessment, systematic review, meta-analysis

ISSN Print 0021-972X ISSN Online 1945-7197


Printed in USA
© Endocrine Society 2020. All rights reserved. For permissions, please e-mail: journals.
permissions@oup.com
Received 11 December 2019. Accepted 11 March 2020. Abbreviations: B, female Tanner stage breast; BMI, body mass index; fPH, female pubic
First Published Online 16 March 2020. hair; G, male Tanner stage genitals; HCP, health care professional; mPH, male pubic hair
Corrected and Typeset 1 July 2020. 1–5; κ, Cohen’s kappa.

2846  J Clin Endocrinol Metab, August 2020, 105(8):2846–2856   https://academic.oup.com/jcemdoi:10.1210/clinem/dgaa135


doi:10.1210/clinem/dgaa135 https://academic.oup.com/jcem  2847

M
onitoring a youth’s growth and pubertal develop- this meta-analysis are reported, including subanalyses
ment is an important part of health surveillance of agreement for obese youth and self-assessment tools.
of an individual as well as a population. Abnormalities
of pubertal timing often represent the extreme ends of
the normal spectrum but can also be a manifestation of Methods
an underlying disorder or poor nutritional status (1, 2). Using a 3-step search strategy, studies reporting the agree-
Moreover, for those who have a known systemic illness, ment between Tanner staging conducted by an HCP (hereafter
assessments of growth and puberty are key indicators referred to as HCP assessment) and self-assessed Tanner sta-
of effective management. Knowledge of pubertal status ging by adolescents were identified. An initial limited search of
PubMed was undertaken followed by an analysis of the text
is also important in the assessment of some aspects words contained in the title and abstract, as well as the index
of physiology; for example, assessments of metabolic terms used to describe the articles. A second search using all

Downloaded from https://academic.oup.com/jcem/article/105/8/2846/5807960 by guest on 07 January 2024


status should consider the increasing insulin resistance identified keywords and index terms was then undertaken
that accompanies the onset of puberty. across all included databases. The third step included scanning
Interestingly, the age of puberty is now known to the reference list of the identified articles to extract articles
not previously identified. The search strategy was developed
associate with many later-life health outcomes in both
by S.C., J.M., and the Hospital for Sick Children librarian (6).
men and women. For example, negative impacts of pu- The searches were initially run on June 27, 2018, and updated
bertal timing on the future health of women include an on December 19, 2018, in MEDLINE, PubMed, Embase, Web
association with a higher risk of breast cancer, cardio- of Science, Scopus, the Cochrane Library, and CINAHL and
vascular disease, and metabolic disorders among those are available here (6). Because of limited capacity within the
with early puberty, whereas later puberty has been as- study for a translator, non-English titles were excluded. There
was no limit on the publication date or geographical location
sociated with increased risk of disorders such as celiac for the inclusion of studies. The study protocol was registered
disease and low bone mineral density (3–5). In men, at PROSPERO #CRD42018100205.
early puberty has been associated with increased risk
of cardiovascular disease and bipolar disorder in later Inclusion and exclusion criteria
life, whereas later puberty has been associated with Cohort and cross-sectional studies reporting participant
anxiety, depression, asthma and eczema (4). These later ages between 6 and 19 years were eligible. Participants with
life associations have increased the need for a better or without disease, and studies reporting agreement for over-
weight and/or obese populations were also eligible. Studies
understanding of the factors that regulate the timing of
that recruited participants with disease conditions like cystic fi-
puberty, and support the need for ongoing assessments brosis, diabetes mellitus, and Crohn’s disease that are not likely
of pubertal timing in the general population to monitor to impact participants’ ability to correctly assess their puberty
trends toward earlier development. stage were eligible. Studies using any tool for self-assessment
For all of these reasons, health care providers, of puberty were eligible. Inclusion in meta-analyses required
including family doctors, general pediatricians, adoles- the studies to report a 5 × 5 table or other details from which
such tables could be constructed for Tanner stage breast (B)
cent medicine providers and pediatric subspecialists, as 1-5, Tanner stage genitals (G) 1-5, female pubic hair (fPH) 1-5,
well as epidemiologists and other researchers, require or male pubic hair (mPH) 1-5. If study reports did not provide
the accurate assessment of pubertal timing to guide their the required data, the corresponding authors were contacted
work. However, there are many challenges in obtaining twice and the missing data requested; however, no additional
physical assessment data, including the reluctance of data could be obtained. Studies of participants with cognitive
impairment were excluded, as were studies reporting puberty
youth to undergo examinations and the need for staff
in participants with body dysmorphic disorders such as anor-
to conduct assessments—issues that may be exacer- exia because we believed the body dysmorphic nature of the
bated in large population studies and studies requiring disorder would alter the participant’s ability to correctly assess
serial evaluations. Lack of familiarity/expertise with the his or her puberty. As the focus was on youth self-assessment,
examinations among providers and discrepancies be- studies, where a parent or other adult assessed or assisted the
tween the sensitive nature of the assessment and local participant, were not included.
cultural practices among families and youth represent
other barriers. These limitations often compel reliance Reference standard and index test
on results of self-assessments, but one wonders: are such Assessment of Tanner stages by either physician, nurse, or
self-assessment results reliable? other trained HCP was the reference standard. Tanner sta-
ging comprises indicators used to assess the development of
To address this question, a systematic review and
secondary sexual characteristics during puberty—breast and
meta-analysis was conducted to determine the agreement pubic hair development in girls, and genital and pubic hair
between self-assessed and health care provider (HCP) development in boys. To allow for the direct comparison of
physically assessed Tanner staging. Here the results of the index and reference tests, this systematic review adheres
2848  Campisi et al   Reliability of Self-assessed Puberty Stages J Clin Endocrinol Metab, August 2020, 105(8):2846–2856

to a strict definition of Tanner staging (7, 8) that, for example, estimates and corresponding 95% CI were obtained by meta-
does not include the assessment of testicular volume by Prader analysis using random effects model.
orchidometer. The index test was self-assessed puberty staging To determine the impact of different self-assessment tools
using Tanner criteria. Several self-assessment tools were used on agreement, a stratified analysis was conducted with the
in the included studies. All tools included an image with either tool as the variable using 5 × 5 tables for overall Tanner
a photograph or line drawing, with or without an accom- stages (1-5) female B, male G, mPH, and fPH. The tool vari-
panying description. Most studies developed their self-assess- able was categorized into 4 categories: (a) line drawings
ment tools based on the Morris and Udry (9) or Tanner (7, 8) without descriptions, (b) line drawings with descriptions, (c)
drawings; however, some designed their tools. photographs without descriptions, or (d) photographs with
descriptions. Pooled κ and linearly weighted κ w statistics
Study selection and data collection with corresponding 95% CI for each tool category were cal-
Screening was conducted in duplicate (J.M. and S.C.) for culated (13).
titles, abstracts, and full-text articles. All references were man- Because a body mass index (BMI) > 85th percentile can

Downloaded from https://academic.oup.com/jcem/article/105/8/2846/5807960 by guest on 07 January 2024


aged in an EndNote v7.8 library and exported to Covidence lead to confusion between adipose and breast tissue during
(10) online review software for study selection. Extraction Tanner staging, the main analysis did not include these parti-
forms were designed a priori, and pilot tested (n = 5) by cipants. However, a separate analysis was conducted to com-
J.M. and S.C. Post pilot, extraction forms were revised to pare the agreement of studies reporting the self-assessment
standardize extracted data. All data were extracted in dupli- results of adolescents who were overweight or obese with the
cate from the included studies by I.M., F.J.S., A.K., J.M., or self-assessment results of youth with normal BMI. A κ statistic
S.C. and then matched. Data extracted included: author, pub- and estimate were calculated as described previously.
lication year, study design, sample size, type of recruitment Statistical heterogeneity was assessed by examining forest
(prospective or retrospective), study population characteristics plots across studies for the variability of study estimates and
and the clinical context; details of reference standard; index overlap of the 95% CI. The percentage of variance in the
test used, and data to generate 5 × 5 tables for breast, genital, meta-analysis that was attributable to study heterogeneity
mPH, and fPH Tanner stages. Discrepancies were resolved by was calculated using I2 (14). I2 value greater than 75% sug-
consensus. gests substantial heterogeneity among the study’s κ statistics.
Therefore, a random-effects model was used to accommo-
date inherent high methodological heterogeneity and reduce
Assessment of methodological quality and risk the impact of large studies on pooled κ estimates. A final κ
of bias value of 0.21 to 0.40 was considered to be fair agreement be-
The methodological quality of each included study was as- tween the physical examination assessment and the self-assess-
sessed using the Quality Assessment of Diagnostic Accuracy ment; 0.41 to 0.60 was considered to be moderate agreement;
Studies‐2 (11). The 4 domains assessed for risk of bias were 0.61 to 0.80 was considered to be substantial agreement; and
patient selection, index test, reference standard, and flow and greater than 0.81 to 0.99 to be almost perfect agreement (15).
timing. Applicability concerns were assessed in the first 3 do- All analyses were performed using STATA (v14.2).
mains. In each domain, the signaling questions were answered
with “yes,” “no,” or “unclear,” followed by a judgment of the
risk of bias as “low,” “high,” or “unclear” risk. Two review au- Results
thors (J.M. and S.C.) independently evaluated included studies
and resolved any disagreements by discussion, with referral to Our search identified 838 studies. After screening, 22
a third review author as necessary. studies representing 21 801 participants (boys n = 9854
[45.2%] and girls n = 11 947 [54.8%]) contributed to
Statistical analysis the meta-analysis as outlined in the PRISMA flowchart
To obtain Tanner stage-wise or puberty phase-wise Kappa
(κ) statistic for a given body part (ie, B, G, mPH, and fPH), we
(Fig. 1). Twenty studies (90%) reported agreement for
transformed 5 × 5 tables from each study to stage-wise (stages female B; 14 (64%) studies for male G; 19 (86%) studies
1-5) 2 × 2 tables or to phase-wise (pre-, in/early, and com- for fPH; 18 (82%) studies for mPH; the same number
pleting/late puberty) 2 × 2 tables. The 3 puberty phases reflect of studies contributed to the analysis of collapsed pu-
a commonly used puberty phase (12) classification. Prepuberty berty phase (pre, in or early, and completing or late pu-
phase corresponded to Tanner stage 1, in/early puberty corres-
berty) analysis. It is also important to note the timing
ponded to combined Tanner stages 2 and 3, and completing/
late puberty corresponded to combined Tanner stages 4 and and sequence of assessments. With regard to timing,
5. A κ statistic and 95% CI was calculated (13) stage-wise or 20 studies were cross-sectional with the self-assess-
puberty phase-wise for female B, male G, mPH, and fPH for ment and HCP assessment taking place immediately
each study. κ statistics then were meta-analyzed using random following each other and 2 studies were prospective
effects model to obtain pooled κ estimates and corresponding with 1 study stating the time interval between assess-
95% CI for each stage or phase.
To determine pooled estimates and corresponding 95%CI
ments as 1 week, and 1 study stated the time interval
for self-assessment overestimation or underestimation, we as 2 weeks. Sequentially, self-assessment was conducted
first calculated the percent overestimation or under estimation before the HCP assessment in 16 studies, 5 of which
for each study for female B, male G, mPH, and fPH. Pooled stated the HCP was blinded to the self-assessment score.
doi:10.1210/clinem/dgaa135 https://academic.oup.com/jcem  2849

Downloaded from https://academic.oup.com/jcem/article/105/8/2846/5807960 by guest on 07 January 2024


Figure 1. PRISMA flowchart of study selection.

Physical assessment was conducted before self-assess- Income Groups. Using World Health Organization re-
ment in 5 studies. One study stated that self-assessment gions, 12 studies (54.5%) were from the Region of the
was conducted either before or after the HCP assess- Americas; 3 studies (13.6%) from the European Region;
ment. Twenty studies recruited participants by conveni- 3 studies (13.6%) from the Western Pacific Region; 2
ence sampling (ie, from endocrinology clinics), 1 study studies (9.0%) from the South-East Asia Region; 1 study
conducted consecutive recruitment, and 1 study used (4.5%) from the African Region; and 1 study (4.5%)
random sampling. A single study of participants with from the Eastern Mediterranean Region.
anorexia nervosa was excluded. As well, a single study The overall quality assessments of included studies
examined the reliability of the parent assessment in add- were high for applicability (patient selection, index test,
ition to the reliability of the child assessment in com- and reference standard) and for the risk of bias (index
parison to HCP assessment. In this case, only the child test, reference standard, and flow and timing). Ratings
data were used for the meta-analysis and the parent for the risk of bias pertaining to the patient selection for
data were disregarded. Line drawings with description 13 (59%) studies indicated low risk of bias; 9 studies
were used to aid/guide the self-assessment in 12 (55%) (41%) were identified as “unclear risk.” Among those
studies; line drawings alone in 5 (23%); photographs with “unclear risk,” risk of bias was downgraded mostly
alone in 2 (9%); photographs with a description in 2 because participant selection was not random. Due
(9%); and 1 study (5%) used photographs and line to the nature of the topic at hand, convenience sam-
drawings plus description (Table 1). Fifteen studies ples, such as school settings or cross-sections of cohort
(68%) were from high-income countries and 7 (32%) studies were reported in these studies (6). No studies
from middle-income countries according to World Bank were excluded based on the risk-of-bias assessment.
Table 1. Characteristics of Included Studies
Health Professional
Publication Performing the Physical
Author Year Country Sample Size Age Range (years) Assessment Self-Assessment Tool
Boas et al (16) 1995 USA 61 12.0–18.9 Physician and nurse Photos (7,8) + description
Bonat et al (17) 2002 USA 244 12.9–15.5 Physician and nurse Line drawings + description
Brooks-Gunn et al (18) 1987 USA 85 11.0–13.0 Physician and nurse Line drawings
Chan et al (19) 2008 China 354 8.0–18.0 Other clinician Line drawings (9) + description
Chavarro et al (20) 2017 Mexico 245 8.0–13.0 Physician Line drawings + description
Desmangles et al (21) 2006 USA 240 6.0–16.0 Physician Photos (7,8)
Ernst et al (22) 2018 Denmark 197 13.7–16.6 Physician Line drawings + description
Hergenroeder et al (23) 1999 USA 107 8.0–17.0 Physician Line drawings + description
Jaruratansirikul et al (24) 2015 Thailand 1934 8.0–18.0 boys Physician Line drawings + description
2850  Campisi et al   Reliability of Self-assessed Puberty Stages

7.0–16.0 girls
Lamb et al (25) 2011 USA 232 8.0–18.0 Physician Line drawings
Lee et al (26) 2006 USA 67 8.0–18.0 Physician and nurse Line Drawings (9)
Leone et al (27) 2007 Canada 47 12.0–17.0 Physician Line drawings + description
Morris et al (9) 1980 USA 47 12.0–16.0 Physician Line drawings (9)
Norris et al (28) 2005 South Africa 182 10.0–18.0 Nurse Photographs (7,8) + line drawings (9)
+ description
Peng et al (29) 2018 China 174 8.0–18.0 Physician Photos (computer generated)
Rabbani et al (30) 2013 Iran 190 9.0–16.0 Physician Line drawings + description
Rasmussen et al (31) 2015 Denmark 868 7.4–14.9 Physician Line drawings + description
Rollof et al (32) 2012 Sweden 100 10.0–16.0 Nurse Line drawings + description
Schall et al (33) 2012 USA 100 8.0–18.0 Physician Line drawings + description
Stephen et al (34) 2008 USA 87 8.0–16.0 Physician Line drawings
Sun et al (35) 2012 China 16 046 8.0–18.9 Physician Line drawings + description
Wacharasindhu et al (36) 2002 Thailand 194 7.0–15.0 Physician Photos (7,8) + description
J Clin Endocrinol Metab, August 2020, 105(8):2846–2856

Downloaded from https://academic.oup.com/jcem/article/105/8/2846/5807960 by guest on 07 January 2024


doi:10.1210/clinem/dgaa135 https://academic.oup.com/jcem  2851

For females, we observed substantial agreement (κ observed for G 1, G 4, mPH 2, mPH 3, and mPH 4,
value: 61-80) for B 1 and fPH 1. Moderate agreement whereas a fair agreement was observed for G 2 and G
(κ value: 41-60) was observed for B 2, B 4, B 5, fPH 3. Again, we observed greater agreement when broad-
2, fPH 3, and fPH 5. Fair agreement (κ value: 21-40) ening the puberty criteria by using puberty phases. For
was observed for B 3 and fPH 4. When broadening the boys, no significant differences were observed between
puberty criteria by using puberty phases (pre [Tanner overweight/obese and nonobese studies for most genital
stage 1], early/in [Tanner stages 2-3], and late/com- and mPH Tanner stages or puberty phases. Few excep-
pleting [Tanner stages 4-5]), substantial agreement was tions to note are the significantly lower agreement for
observed for prepuberty and completing puberty for G 2, G 5, mPH 2, and mPH 3 as well as pre- versus
both B and fPH development and moderate agreement in puberty phases for obese/overweight compared with
for in puberty for both B and fPH (Fig. 2A). nonobese/overweight males. Substantial agreement was

Downloaded from https://academic.oup.com/jcem/article/105/8/2846/5807960 by guest on 07 January 2024


For males, substantial agreement was observed for observed for completing puberty for genital develop-
mPH 1, mPH 5, and G 5. Moderate agreement was ment, as well as prepuberty and completing puberty for

A. Female Pubertal Milestones


100

80 Substantial
Agreement
60
κ (95%CI)

Moderate
Agreement
40
Fair
Agreement
20

Breast Development (n=20 studies) Pubic Hair (n=19 studies)

B. Male Pubertal Milestones

100

80
Substantial
Agreement
60
κ (95%CI)

Moderate
Agreement
40
Fair
Agreement
20

Genital Development (n=14 studies) Pubic Hair (n=18 studies)


Abbreviations: n: number of studies used to calculate the pooled kappa estimate

Figure 2. Pooled Kappa estimates (κ) and 95% confidence intervals for self-assessed Tanner stages and puberty phases by sex.
2852  Campisi et al   Reliability of Self-assessed Puberty Stages J Clin Endocrinol Metab, August 2020, 105(8):2846–2856

mPH development. Moderate agreement was observed Discussion


for prepuberty and in puberty for G development and in
puberty for mPH (Fig. 2B). Individual studies have assessed the agreement between
Girls and boys were equally likely to overestimate self-assessed Tanner staging and the gold standard of as-
their pubic hair and breast or genital stage as they were sessment by an HCP. A limited number of reports have
to underestimate. The pooled estimates for overesti- reviewed some of those data (37–41), but no previous
mation for fPH were 16.81(95% CI, 16.15-17.49); for systematic review has used formal meta-analysis meth-
mPH were 15.36 (95% CI, 14.66-16.09); for B were odology to quantify the agreement between pubertal
18.77 (95% CI, 18.08-19.48); and for G were 24.22 staging by self-assessment and that by physical exam-
ination or included as many studies. Our meta-analysis
(95% CI, 23.32-25.14). Pooled estimates for underesti-
comprised 22 studies encompassing a large number
mation for fPH were 15.03 (95% CI, 14.39-15.68); for

Downloaded from https://academic.oup.com/jcem/article/105/8/2846/5807960 by guest on 07 January 2024


of participants (n = 21 801), including a significant
mPH were 15.19 (95% CI, 14.49-15.91); for B were
number of male participants (n = 9854), who are fre-
14.88 (95% CI, 14.26-15.53); and for G were 18.44
quently underrepresented in puberty research. These re-
(95% CI, 17.63 to 19.28).
sults provide important guidance on how self-assessed
Tools used for self-assessment varied. Eight studies
Tanner staging may be implemented to enhance agree-
provided the citation for their images which was either
ment with HCP assessed Tanner staging.
Morris and Udry (9) or Tanner (7, 8). The remaining
Using currently available tools, self-reported assess-
studies did not provide a citation; however, some did
ment of pubertal development is able to reasonably sep-
provide a sample of their illustrations. In these cases, it
arate pubertal development into 3 phases: prepuberty,
was obvious that they were using images derived dir-
in/early puberty (consisting of Tanner stages 2 and 3),
ectly from Morris and Udry (9) or Tanner (7, 8) images.
and completing/late puberty (consisting of Tanner stages
Adding descriptions along with an image (either a line
4 and 5) (12). Agreement decreased when comparing
drawing or photograph) led to a greater agreement be-
individual Tanner stages between self-assessment and
tween the self-assessment and HCP assessment for boys
examination, especially for stages 2, 3, and 4, likely re-
and girls when overall Tanner staging was assessed.
flecting the complexity of their evaluation. Most often, a
Additionally, photographs lead to a significant improve- higher agreement was observed for pubic hair develop-
ment in agreement over line drawings for boys (Fig. 3). ment in both girls and boys, whereas girls appeared to
When specific stages and specific tools were assessed, have a greater ability to discern breast stages compared
it was observed that line drawings derived from either with the boys’ ability to discern genital stages. Perhaps
Morris and Udry (9) or Tanner (7, 8) were not different this difference between girls and boys is not surprising.
in their ability to achieve greater agreement. However, Assessment of genital Tanner stages in boys is complex
photos with description were superior for assessment of because it is based on size and general appearance,
breast development and mPH, whereas photos with or which is possibly more subjective than the assessment
without description performed best for fPH (6). of breast stage, which relies more on structural aspects
Only 2 included studies separately reported agree- of nipple/areola appearance (42). This subjectivity could
ment for overweight and/or obese participants (children contribute to lower agreement between self and HCP
with BMI > 85th percentile) for breast development assessments. In terms of the tools used, our analysis re-
and only one for each of the other physical characteris- vealed increased agreement when the tool included an
tics; however, these studies included large sample sizes image with an accompanying description rather than an
that allowed for a separate analysis (n = 3715; 54.6% image alone.
female). These limited data revealed no significant dif- Our results have several important practical implica-
ference between the studies reporting the agreement tions. If investigators or clinicians require data regarding
between obese/overweight and those reporting agree- individual Tanner stages, then it is likely better to obtain
ment among nonobese/overweight participants for those data via HCP examinations whenever possible.
breast, fPH Tanner stages and puberty phases, with CIs On the other hand, if the variable of interest is whether
that overlap. For boys, no significant differences were an individual is prepubertal or in-puberty vs completing
observed between overweight/obese and nonobese puberty, self-assessment may suffice. In scenarios such
studies for most G and mPH Tanner Stages or puberty as large epidemiology studies or global health initia-
phases. Few exceptions to note are the significantly tives that rely upon self-assessment, questions designed
lower agreement for G 2, G 5, mPH 2 and mPH 3 as to focus on phases of puberty as opposed to individual
well as pre- versus in puberty phases for obese/over- stages may be more tractable and produce more reli-
weight compared with nonobese/overweight males (6). able results. If using self-assessment, use of an image
doi:10.1210/clinem/dgaa135 https://academic.oup.com/jcem  2853

A. Self-Assessment Tool - Girls


100
n=2
n=3
n=2
n=10 n=3
Weighted Kappa (%) with 95%CI

80
n=4
n=11 Substantial
n=4 Agreement
60
Moderate
Agreement
40
Fair

Downloaded from https://academic.oup.com/jcem/article/105/8/2846/5807960 by guest on 07 January 2024


Agreement
20

0
Line Drawing Line Drawing with Description Photographs Photographs with Description
Breast Development Pubic Hair

B. Self-Assessment Tool - Boys


100

n=11 n=3 n=2


n=8
n=2
80
Weighted Kappa (%) with 95%CI

n=1
Substantial
n=2 Agreement
n=3

60
Moderate
Agreement
40
Fair
Agreement

20

0
Line Drawing Line Drawing with Description Photographs Photographs with Description

Genital Development Pubic Hair


Figure 3. Pooled weighted Kappa estimates (κ w) and 95% confidence intervals for self-assessment tools by sex.

with an accompanying description rather than an image at identifying the onset/progression of true puberty
alone will likely yield more accurate results, although it than boys.
is recognized that at times societal, cultural, or religious Another important question is whether self-assess-
factors may limit the use of images. Finally, one must ment can be used for evaluation of pubertal develop-
also remember that pubic hair alone is not necessarily ment among overweight and obese youth. Our results
indicative of central puberty (hypothalamic-pituitary- indicate that overweight/obese children may be able to
gonadal axis maturation) as the hair could derive from assess their puberty development as effectively as chil-
adrenarche instead of gonadarche. The difference in the dren with normal BMI with the exception of G 2, mPH
agreement between self vs HCP assessments of breast 2, mPH 3, and in puberty mPH in boys. Interestingly,
development compared with male genital development although much concern surrounds assessment of breast
suggest the possibility that girls may be more accurate development in females and the ability to distinguish
2854  Campisi et al   Reliability of Self-assessed Puberty Stages J Clin Endocrinol Metab, August 2020, 105(8):2846–2856

true glandular tissue from adipose tissue, these re- its categories do not align 1:1 with the 5 Tanner stages.
sults did not indicate that overweight/obese girls’ self- We also excluded studies reporting parent-assessed or
assessments were more inaccurate than those made by parent-assisted assessments instead of self-assessments
girls of normal weight. Although encouraging, these conducted by the adolescent independently because our
results should be considered preliminary within this goal was to evaluate self-assessment. Inconsistent re-
subpopulation as they are limited to 1 or 2 studies. More sults regarding the correlation between parent reports
evidence including those from different geographic loca- and physical examination warrant further examination
tions and ethnicities would increase the representative- as a study by Rasmussen et al (31) reported that parents
ness of these findings. tended to underestimate pubertal development, whereas
Our meta-analysis does have limitations. Although Dorn et al (48) reported a correlation ranging from
physical examination is the gold standard for assessment r = 0.75 to 0.87. Regardless, self-assessment is unlikely

Downloaded from https://academic.oup.com/jcem/article/105/8/2846/5807960 by guest on 07 January 2024


of the secondary sexual characteristics of puberty, some to be replaced by parent-assisted measures given the in-
have suggested that perhaps hormonal levels should in- creased desire for privacy as adolescents approach mid
stead be a comparator (43–45). Current limitations for puberty.
the use of hormonal data exists, but more importantly, In addition, 16 studies were not included in the meta-
hormonal levels conceptually address a slightly different analysis as the reported agreement data were not in the
question than the one we asked regarding how accur- correct 5 × 5 format and no additional raw data were
ately youth can assess the physical stages of puberty. received in response to letters to the authors. However,
In the studies we assessed, few details were provided an informal review of these 16 studies indicated that
concerning the HCP’s level of training in pubertal as- their results are consistent with the results of our meta-
sessments, and we know that knowledge of pubertal analysis. In addition, a recently published narrative re-
examinations and the confidence of the examiner can be view examining methods of puberty assessment could
at quite variable levels (46). Differences in training and not formally assess the strength of agreement between
proficiency of the HCP could allow for the introduction self- and HCP assessments, but the authors did highlight
of variation (inaccuracy) within the gold standard. In a higher agreement for both sexes in the assessment of
overweight or obese patients, HCP training and experi- pubic hair and lower agreement for male genital devel-
ence is critical as it may be difficult to determine Tanner opment (38), consistent with our results.
stages in breast development (because of presence of Despite these limitations, our study has many
adipose tissue) and genital development (because of the strengths. It addresses a question with implications
presence of a suprapubic fat pad). In terms of general- for individuals in many fields, including HCPs, those
izability, the majority of the included studies were com- conducting physiologic and genetic research, epidemi-
prised of Caucasian youth and were conducted within ologists, and global health workers. A rigorous method-
high-income countries; as such, more studies from other ology, as well as a large number of studies and youth,
geographic and socioeconomic settings and including are key assets of this meta-analysis. The results point to
a broader range of racial/ethnic groups are needed. As practical actions regarding what data can be obtained
none of the included studies were from low-income more reliably and the importance of using an image
countries, data from resource-constrained settings are with a description whenever feasible.
also needed because they would help inform the use of
self-assessment as part of global health initiatives. Last,
Conclusion
we identified no studies for inclusion in our analysis
that assessed the use of self-assessment among gender- Our data indicate that when used to identify Tanner
incongruent youth or among individuals with clinical stage 1 or 5 or to categorize pubertal development into
diagnoses of delayed or precocious puberty. prepuberty, in/early puberty, or completing/late puberty,
Having strict inclusion/exclusion criteria is an im- self-assessment tools yield data that likely have substan-
portant component of a rigorous meta-analysis because tial agreement with data obtained from HCP examin-
they allow for direct comparison of data across studies. ation. Depending on the research or clinical question
However, as a result of such criteria, not all published being asked, improved agreement in self-assessment
studies involving the comparison of self-assessment of when using 3 puberty phases may be more useful than
pubertal development and physical examination were the 5 Tanner stages. However, it is important to note
included in our study. For example, we did not include that there remains a potential for misclassification even
studies that used the Pubertal Development Scale (47) when using the 3-phase classification system. Using
because it only involves the use of text descriptions and tools that provide the most information (images and
doi:10.1210/clinem/dgaa135 https://academic.oup.com/jcem  2855

descriptions) leads to results with the highest agreement References


with the physical examination. Because the studies in
1. Carel JC, Léger J. Clinical practice. Precocious puberty. N Engl J
this meta-analysis used several different images and de- Med. 2008;358(22):2366–2377.
scriptions, 1 area of future work is the generation of or 2. Palmert MR, Dunkel L. Clinical practice. Delayed puberty. N
agreement regarding standard images and descriptions Engl J Med. 2012;366(5):443–453.
3. Zhu J, Chan Y-M. Adult consequences of self-limited delayed pu-
to help minimize variation in self-assessment. The avail-
berty. Pediatrics 2017;139(6):e20163177.
able data suggest that self-assessment data for over- 4. Day FR, Elks CE, Murray A, Ong KK, Perry JR. Puberty timing
weight and obese youth may also be reliable, although associated with diabetes, cardiovascular disease and also diverse
this is another area where further research is warranted. health outcomes in men and women: the UK Biobank study. Sci
Rep. 2015;5:11208.
In clinical or research settings where discrimination be- 5. Elhakeem A, Frysz M, Tilling K, Tobias JH, Lawlor DA.
tween Tanner stage 2, 3, or 4 is required, the physical Association between age at puberty and bone accrual from 10 to

Downloaded from https://academic.oup.com/jcem/article/105/8/2846/5807960 by guest on 07 January 2024


assessment remains paramount. 25 years of age. JAMA Netw Open. 2019;2(8):e198918.
6. Campisi S, Marchand J, Siddiqui F, Islam M, Bhutta Z,
Palmert MR. Supplementary files for self-assessment of puberty
- systematic review. 2019; www.osf.io/h7jfb
Acknowledgments 7. Marshall WA, Tanner JM. Variations in pattern of pubertal
changes in girls. Arch Dis Child. 1969;44(235):291–303.
Thank you to Tamsin Adams-Webber for her assistance with 8. Marshall WA, Tanner JM. Variations in the pattern of pubertal
search strategy terms and translation of Medline search changes in boys. Arch Dis Child. 1970;45(239):13–23.
strategy to other database platform command language. 9. Morris NM, Udry JR. Validation of a self-administered instru-
ment to assess stage of adolescent development. J Youth Adolesc.
Thank you also to Amirtha Karunakaran for assistance with
1980;9(3):271–280.
data collection. 10. Covidence Systematic Review Software. Australia: Veritas Health
Financial Support: Susan Campisi is supported by SickKids Innovation; 2017.
Restracomp Doctoral Award. Josée Marchand was supported 11. Whiting PF, Rutjes AW, Westwood ME, et al.; QUADAS-2
by Health Canada through the Maternal-Infant Research Group. QUADAS-2: a revised tool for the quality assess-
ment of diagnostic accuracy studies. Ann Intern Med.
on Environmental Chemicals Study. There was no external
2011;155(8):529–536.
funding for this manuscript. 12. Royal College of Paediatrics and Child Health. Fact Sheet:UK
Author Contributions: S.C.C. conceptualized and de- 2-18 years Growth Chart. In: RCPCH, ed. London, England:
signed the study, developed the data collection instruments, Royal College of Pediatrics and Child Health; 2012.
assisted with data collection, drafted the initial manuscript, 13. Kottner J, Audige L, Brorson S, et al. Guidelines for Reporting
Reliability and Agreement Studies (GRRAS) were proposed. Int J
and reviewed and revised the manuscript. J. D. M. designed
Nurs Stud. 2011;48(6):661–671.
the study, developed the data collection instruments, collected 14. Higgins JP, Thompson SG, Deeks JJ, Altman DG. Measuring in-
data, drafted the initial manuscript, and reviewed and revised consistency in meta-analyses. BMJ. 2003;327(7414):557–560.
the manuscript. F.J.S. laid out the statistical analysis plan, col- 15. Landis JR, Koch GG. The measurement of observer agreement
lected data, drafted the initial manuscript, and reviewed and for categorical data. Biometrics. 1977;33(1):159–174.
16. Boas SR, Falsetti D, Murphy TD, Orenstein DM. Validity of
revised the manuscript. M.I. collected data, performed the
self-assessment of sexual maturation in adolescent male patients
statistical analysis, and reviewed the manuscript. M.R.P. and with cystic fibrosis. J Adolesc Health. 1995;17(1):42–45.
Z.A.B. provided senior supervision for the study, assisted with 17. Bonat S, Pathomvanich A, Keil MF, Field AE, Yanovski JA. Self-
interpretation of the results, and reviewed and revised the final assessment of pubertal stage in overweight children. Pediatrics.
manuscript. All authors approved the final manuscript as sub- 2002;110(4):743–747.
18. Brooks-Gunn J, Warren MP, Rosso J, Gargiulo J. Validity
mitted.
of self-report measures of girls’ pubertal status. Child Dev.
Clinical Trial Registration: PROSPERO # CRD42018100205 1987;58(3):829–841.
19. Chan NP, Sung RY, Kong AP, Goggins WB, So HK, Nelson EA.
Reliability of pubertal self-assessment in Hong Kong Chinese
Additional Information children. J Paediatr Child Health. 2008;44(6):353–358.
20. Chavarro JE, Watkins DJ, Afeiche MC, et al. Validity of self-
Correspondence: Susan C. Campisi, MHSc, PhD Candidate, assessed sexual maturation against physician assessments and
University of Toronto, Department of Nutritional Sciences, hormone levels. J Pediatr. 2017;186:172–178.e3.
Centre for Global Child Health, Hospital for Sick Children, 21. Desmangles JC, Lappe JM, Lipaczewski G, Haynatzki G.
686 Bay Street, 11th Floor, Suite 11.9805, Toronto, Ontario, Accuracy of pubertal Tanner staging self-reporting. J Pediatr
Endocrinol Metab. 2006;19(3):213–221.
M5G 0A4 Canada. E-mail: susan.campisi@sickkids.ca.
22. Ernst A, Lauridsen LLB, Brix N, et al. Self-assessment of pubertal
Disclosure Summary: S.C.C., J.D.M., F.J.S., M.I., and development in a puberty cohort. J Pediatr Endocrinol Metab.
Z.A.B. have nothing to declare. M.R.P. receives royalties from 2018;31(7):763–772.
UpToDate for having coauthored 2 chapters related to the 23. Hergenroeder AC, Hill RB, Wong WW, Sangi-Haghpeykar H,
evaluation and treatment of precocious puberty. Taylor W. Validity of self-assessment of pubertal maturation in
African American and European American adolescents. J Adolesc
Data Availability: Datasets generated during and/or ana-
Health. 1999;24(3):201–205.
lyzed during the current study are not publicly available but are 24. Jaruratanasirikul S, Kreetapirom P, Tassanakijpanich N,
available from the corresponding author on reasonable request Sriplung H. Reliability of pubertal maturation self-assessment
2856  Campisi et al   Reliability of Self-assessed Puberty Stages J Clin Endocrinol Metab, August 2020, 105(8):2846–2856

in a school-based survey. J Pediatr Endocrinol Metab. 37. Dorn LD, Dahl RE, Woodward HR, Biro F. Defining the bound-
2015;28(3-4):367–374. aries of early adolescence: a user’s guide to assessing pubertal
25. Lamb MM, Beers L, Reed-Gillette D, McDowell MA. Feasibility status and pubertal timing in research with adolescents. Appl Dev
of an Audio Computer-Assisted Self-Interview method to self- Sci. 2006;10(1):30–56
assess sexual maturation. J Adolesc Health. 2011;48(4):325–330. 38. Walker IV, Smith CR, Davies JH, Inskip HM, Baird J. Methods
26. Lee K, Valeria B, Kochman C, Lenders CM. Self-assessment of for determining pubertal status in research studies: literature re-
height, weight, and sexual maturation: validity in overweight view and opinions of experts and adolescents. J Dev Orig Health
children and adolescents. J Adolesc Health. 2006;39(3):346–352. Dis. 2020;11(2):168–187.
27. Leone M, Comtois AS. Validity and reliability of self-assessment 39. Coleman L, Coleman J. The measurement of puberty: a review. J
of sexual maturity in elite adolescent athletes. J Sports Med Phys Adolesc. 2002;25(5):535–550.
Fitness. 2007;47(3):361–365. 40. Biro FM, Dorn LD. Issues in Measurement of Pubertal Development.
28. Norris SA, Richter LM. Usefulness and Reliability of Tanner Handbook of Anthropometry. New York: Springer; 2012:
Pubertal Self-Rating to Urban Black Adolescents in South Africa. 237–251.
Journal of Research on Adolescence. 2005;15(4):609–624. 41. Dorn LD. Moving research on puberty forward: measures are the

Downloaded from https://academic.oup.com/jcem/article/105/8/2846/5807960 by guest on 07 January 2024


29. Peng X, Peng Y, Li Y, et al. Validity of web-based self-assess- key component. J Adolesc Health. 2015;56(6):580–581.
ment of pubertal development against pediatrician assessments. 42. Herman-Giddens ME, Bourdony CJ, Dowshen SA, Reiter EO.
Pediatric Investigation. 2018;2(3):141–148. Assessment of sexual maturity stages in girls and boys. Am Acad
30. Rabbani A, Noorian S, Fallah JS, Setoudeh A, Sayarifard F, Pediatrics. Illinois, USA: American Academy of Pediatrics; 2011.
Abbasi F. Reliability of pubertal self assessment method: an 43. Balzer BWR, Garden FL, Amatoury M, et al. Self-rated tanner
Iranian study. Iran J Pediatr. 2013;23(3):327–332. stage and subjective measures of puberty are associated with lon-
31. Rasmussen AR, Wohlfahrt-Veje C, Tefre de Renzy-Martin K, et al. gitudinal gonadal hormone changes. J Pediatr Endocrinol Metab.
Validity of self-assessment of pubertal maturation. Pediatrics. 2019;32(6):569–576.
2015;135(1):86–93. 44. Chavarro JE, Watkins DJ, Afeiche MC, et al. Validity of self-
32. Rollof L, Elfving M. Evaluation of self-assessment of pubertal assessed sexual maturation against physician assessments and
maturation in boys and girls using drawings and orchidometer. J hormone levels. J Pediatr 2017;186:172–178 e173
Pediatr Endocrinol Metab. 2012;25(1–2):125–129. 45. Singh GK, Balzer BW, Kelly PJ, et al. Urinary sex steroids and
33. Schall JI, Semeao EJ, Stallings VA, Zemel BS. Self-assessment of anthropometric markers of puberty - a novel approach to char-
sexual maturity status in children with Crohn’s disease. J Pediatr. acterising within-person changes of puberty hormones. Plos One.
2002;141(2):223–229. 2015;10(11):e0143555.
34. Stephen MD, Bryant WP, Wilson DP. Self-assessment of sexual 46. Ens A, Janzen K, Palmert MR. Development of an online
maturation in children and adolescents with diabetes mellitus. learning module to improve pediatric residents’ confidence
Endocr Pract. 2008;14(7):840–845. and knowledge of the pubertal examination. J Adolesc Health.
35. Sun Y, Tao FB, Su PY; China Puberty Research Collaboration. 2017;60(3):292–298.
Self-assessment of pubertal Tanner stage by realistic colour im- 47. Petersen AC, Crockett L, Richards M, Boxer A. A self-report
ages in representative Chinese obese and non-obese children and measure of pubertal status: reliability, validity, and initial norms.
adolescents. Acta Paediatr. 2012;101(4):e163–166. J Youth Adolesc. 1988;17(2):117–133.
36. Wacharasindhu S, Pri-Ngam P, Kongchonrak T. Self-assessment 48. Dorn LD, Susman EJ, Nottelmann ED, Inoff-Germain G,
of sexual maturation in Thai children by Tanner photograph. J Chrousos GP. Perceptions of puberty: adolescent, parent, and
Med Assoc Thai. 2002;85(3):308–319. health care personnel. Dev Psychol. 1990;26:322.

You might also like