You are on page 1of 6

Received: 16 June 2021 Revised: 20 June 2021 Accepted: 24 June 2021

DOI: 10.1002/sono.12276

EDUCATION ARTICLE

Correlation does not imply agreement: A cautionary tale for


researchers and reviewers

Christopher Edwards1,2,3,4 | Heather Allen1,5 | Crispen Chamunyonga1,4

1
School of Clinical Sciences, Faculty of Health,
Queensland University of Technology, Abstract
Brisbane, Queensland, Australia Sonography researchers are often confronted with questions concerning the relation-
2
Mater Research Institute-University of
ship between measurements acquired in clinical practice. This educational commen-
Queensland, South Brisbane, Queensland,
Australia tary provides clarity on the definition of the terms correlation and agreement. We
3
Department of Medical Imaging, Redcliffe discuss the statistical tests used to assess correlation and outline some common pit-
Hospital, Redcliffe, Queensland, Australia
4
falls authors fall into when reporting these. We provide examples of the inaccurate
Centre for Biomedical Technologies,
Queensland University of Technology, use of these tests in the sonography literature and recommend better alternatives for
Brisbane, Queensland, Australia
assessing agreement. This review will benefit authors embarking on studies compar-
5
Department of Medical Imaging, The Prince
Charles Hospital, Chermside, Queensland,
ing measurements and reviewers considering manuscripts to ensure the methodology
Australia used is sufficient to justify the claims. The authors also provide plots created in the R

Correspondence
statistical software package, and the supplementary data file supplied can be used
Christopher Edwards, School of Clinical and adapted for those wishing to perform similar statistical tests.
Sciences, Queensland University of
Technology, Gardens Point Campus, 2 George
KEYWORDS
St, Brisbane, QLD 4000, Australia.
Email: c8.edwards@qut.edu.au agreement, biostatistics, Bland–Altman plot, correlation, intra-class correlation

1 | I N T RO DU CT I O N methods used in the context of continuous variables. This discus-


sion and the tests described will aid sonographers and reviewers as
Research in clinical ultrasound practice often involves demonstrating they conduct and appraise research to reduce inconsistencies and
an association between numerical observations and measurements. improve quality.
These may be direct clinical characteristics such as investigating a
relationship between age and renal length measured by ultrasound, or
in the context of a department audit, investigating the number of 2 | WHAT IS CORRELATION?
scans performed as they relate to years of experience. Studies are also
designed to test whether measurements are reproducible under dif- In mathematical terms, correlation is a test that summarises the
ferent conditions or if there is agreement between a single measure- strength and direction of a relationship between two independent
ment acquired using different technologies. An example, which we continuous variables. In most cases, it refers to a linear relationship,
will return to later in this article, is the comparison of transabdominal expressed as the Pearson correlation coefficient (r), and has a value
ovarian volume measurements with measurements obtained using a ranging from 1 to +1.4 A positive correlation occurs when the
transvaginal approach. All observations from these groups can be value of one continuous variable increases in tandem with the other.
measured as numerical values and are referred to as continuous vari- In contrast, a negative correlation occurs when, as one variable
ables.1 When researchers are interested in reporting the degree of increases, the other decreases. A coefficient of 1 represents a perfect
association or concordances between such continuous variables, cor- linear relationship, meaning all values fit perfectly to a straight line. In
relation and agreement are often used. However, the term ‘correla- contrast, a coefficient of 0 or close to zero indicates the lack of any
tion’ as a synonym for ‘agreement’ can be misleading in research. 2,3
linear relationship. The units of each variable do not influence the
This educational article aims to clarify the terms correlation value of the Pearson correlation coefficient. Other types of coeffi-
and agreement and explore some of the standard statistical cients are available for non-linear data, such as Spearman5 and the

Sonography. 2021;1–6. wileyonlinelibrary.com/journal/sono © 2021 Australasian Sonographers Association 1


2 EDWARDS ET AL.

Kendall rank correlation test.6 This discussion will concentrate on data variable (range restriction) is present; again, visual inspection of the
with a linear relationship and refer to the Pearson coefficient. data will reveal these.8 Although Pearson's correlation does not
Notably, the magnitude of the correlation coefficient only pro- assume normality, it does assume constant variance (homoscedastic-
vides information on how close the points lie from the regression or ity). Unequal variance (heteroscedasticity) occurs in situations where
‘least-squares line’. The position of this line is calculated mathemati- there is a difference of dispersion of one variable across the range of
cally and drawn as a straight line that minimises the sum of the values of the other. Graphical examples of cases in which the Pearson
7
squared errors, referred to as least squares regression. The coeffi- correlation should not be used are shown in Figure 2 below.
cient does not provide any information on the absolute agreement Correlation analysis also assumes that all observations are inde-
between two variables. In most scenarios, two sets of variables are pendent of each other. Independent variables are those in which the
never perfectly correlated (r = 1), and there remains an error between distribution of one variable does not depend on the other. Therefore,
each data point and the regression line. Figure 1 demonstrates four it is inappropriate to use in scenarios involving repeated measure-
Pearson coefficient values. In plots A–C, the slope and direction of ments. Some examples of non-independent studies may be; compar-
each corresponding regression line remain relatively unchanged; how- ing the peak systolic velocity in an artery before and after exercise, or
ever, the distance between the data points and the regression line from the earlier example, comparing two different ultrasound tech-
increases, resulting in a lower coefficient value. The coefficient calcu- niques (transabdominal vs. transvaginal). In these examples, a repeated
lation also ignores the units of each variable, and reversing the graph measurement is acquired from a single patient. Unfortunately, this is a
axes below will not affect the value of r. common misuse of the correlation coefficient in the medical literature,
Before interpreting a correlation coefficient, it is essential to plot particularly in studies that aim to assess ‘agreement’ rather than
the relationship between the variables of interest on a scatter plot to ‘correlation’.
avoid incorrect assumptions. This is the easiest way to find the nature A correlation coefficient is fairly easy to calculate in most statisti-
of the correlation. Inspecting the data in this way reveals whether the cal software packages such as the R statistical package9 (https://
correlation is linear or non-linear. In some instances, data may be www.r-project.org/) or IBM's SPSS Statistics Software (IBM Corp.
highly correlated but related in a non-linear monotonic fashion. For Armonk, NY). It can be computed to compare two variables or create
example, when the increase or decrease of one variable occurs at a a matrix to quickly assess how one variable is correlated to a range of
different rate to the other. The plotted data in this instance would others, which is helpful to select relevant variables for further regres-
appear curved (or skewed), and alternative correlation tests are more sion analysis. The calculation is an example of pairwise testing, mean-
appropriate. The Pearson correlation is also highly influenced by the ing that a value for each variable is required; as such, in most cases,
presence of outliers or subgroups, or if significant clustering of one the software will automatically ignore missing values.

F I G U R E 1 Scatter plots to demonstrate the relationship between two sets of variables. Figure 1A–C shows a positive linear correlation,
indicated by Pearson coefficient values. In Figure 1D, there is no relationship
EDWARDS ET AL. 3

F I G U R E 2 Examples where the Pearson correlation coefficient should not be applied. (A) Data containing outliers, (B) heteroscedasticity, in
which there is unequal variance between the two variables, (C) clustered or grouped data (D), non-linear or curved relationships

3 | S T A T I S T I C A L S I G N I F I C A N C E A N D R- (height) explains and attempts to answer the question; Does the


SQ UARED model do a good job explaining changes in the dependent variable?
Several assumptions need to be met and tested during linear regres-
Most statistical software reports a corresponding p-value when calculat- sion analysis, which is beyond the scope of this article; however, it
ing the correlation coefficient. In this instance, the p-value represents the provides a distinction and relationship between r and r-squared.
chance that a significant linear correlation does not exist between the
variables (the null hypothesis). The value of p will depend on the number
of observations, and a low p-value (p ≤0.05) provides confidence that the 4 | WHAT IS AGREEMENT?
measured correlation coefficient in the sample represents a ‘real’ correla-
tion between the two variables in the wider population. However, a low The term agreement is separate and represents the concordance
p-value does not provide further evidence on the degree of correlation. between two measurements of one variable. In other words, it is used
Even medium sample sizes with small correlation coefficients can show p to assess whether measurements by two observers or two different
values that are highly significant and hence should not be used out of techniques yield similar results.2 We can explore this concept using the
context. In particular, statistical significance should not be confused with earlier example of comparing ovarian volumes measured via trans-
clinical significance. The p-value only provides a measure of the test's abdominal and transvaginal techniques. In this example, the actual vol-
statistical reliability. The analysis of confidence intervals and study design ume of the ovary is unknown, and the analysis should be designed to
provides complementary evidence of clinical significance.10 assess how closely measurements are taken from one method match
2
r-squared (r ), the square of the correlation coefficient, is a related those of another, not simply their relationship. Since we are measuring
measure often encountered in linear regression analysis. When a cor- the same structure on the same patient (albeit with a different tech-
relation exists, linear regression enables the calculation of the equa- nique), the two values will be highly correlated but may not necessarily
tion that minimises the distance between the fitted line and all the agree (agreement could be by chance or error). In this example,
data points in the sample. This is useful to develop a model for Pearson's or similar correlation coefficient should not be applied. To
the prediction of one value when another value is known. For exam- demonstrate this, we provide two additional plots (Figure 3), using fic-
ple, predicting an individual's weight from their measured height. In tional data. Transvaginal scan (TVS) measurements have been plotted
simple linear regression, the r-squared provides a percentage of how on the x-axis and Transabdominal scan (TAS) measurements on the y-
well the model fits the data, sometimes referred to as the ‘goodness axis. In both examples, the calculated correlation coefficient remains
of fit’. In technical terms, r-squared indicates the percentage of vari- high (r ≈ 0.9), and it may be tempting to state that the two methods
ance in the dependant variable (weight) that the independent variable agree, but by inspecting the relationship on a scatter plot, it is easy to
4 EDWARDS ET AL.

F I G U R E 3 Scatter plots to demonstrate the relationship between two highly correlated imaging methods with bias. (A) Systematic bias and
(B) proportional bias

see that both demonstrate poor agreement with significant bias. In the two pairs of measurements.13 In contrast, a Bland–Altman plot offers
first example (Figure 3A), TAS consistently measures the structures a graphical alternative to a scatter plot to assess both the agreement
higher, a systematic bias. In Figure 2B, TAS overestimates the value and bias between paired measurements.14 Methods used to assess
when levels are low and underestimate when values are high compared agreement between measurements vary depending on the type of
to TVS. The dashed line represents perfect agreement in both plots. variable (continuous or categorical) and the number of observers.
To highlight some similar examples from the literature, in a study Since the focus of this article is continuous variables, we limit the dis-
by Riesta-Candelaria et al.,11 the authors sought to evaluate ultrasound cussion to two common methods used for this type of variable. Ordi-
measurements of right liver lengths using different ultrasound modes. nary least squared regression could potentially be used to assess
As part of the study, Pearson's correlation coefficient was used to com- agreement; however, this type of simple regression assumes one
pare standard 2D and panoramic imaging measurements in different method is error-free and is unrealistic in most settings. Alternatively,
imaging planes. The study concluded that the craniocaudal plane is other types of regression are better suited for method comparison
optimal for performing these measurements, justified by the highest r- studies, such as Deming's linear regression15 or Passing Bablok linear
value. From the previous discussion, it should be evident that measure- regression,16 which readers may wish to explore.17
ments of the same liver, regardless of the imaging method, are likely to
be highly correlated. Furthermore, r values approaching 1 do not pro-
vide sufficient evidence that the two methods agree. Indeed, a high 5.1 | Intra-class correlation coefficient
r-value may indicate almost perfect agreement, but without further
analysis to assess the presence of bias, the agreement remains The ICC is often encountered in studies designed to assess measure-
unknown. In another example, Lam et al.12 studied an image review ment variability caused by the influence of individual operators.18 For
scoring system to assess sonographer performance during morphology example, a study evaluating the variability of measured left ventricular
examinations. As part of the study, individual scores from two ejection fractions under different circumstances.19 The ICC provides a
reviewers were graphed as scatter plots and compared using Pearson's single measure of the extent to which the readings agree between the
correlation coefficient. An increase in the correlation coefficient operators, instruments, or time points. It conveys the between pair
between the pre- and post-training phase of the study was used to variances as a proportion of the total variance in a group of observa-
demonstrate the utility of the scoring system. Once again, the correla- tions.20 The ICC is expressed as a unitless number from 0 to 1, with
tion coefficient has been incorrectly applied to assess agreement. The zero indicating no agreement and 1 indicating perfect agreement.
same score (or close to) used for both reviewers for each image set Although no strict guidelines are available, and it depends on the clini-
would imply good agreement. However, one reviewer may consistently cal context of the observations, values greater than 0.9 are often cited
mark each image set three points higher than another reviewer, as excellent agreement, between 0.75 and 0.9 as good, between 0.5
producing an almost perfect correlation but poor agreement. and 0.75 moderate and below 0.5 as poor agreement.21,22 One advan-
tage of the ICC is that more than two sets of measurements can be
assigned to a single ICC value. For example, the agreement between
5 | ALTERNATIVE METHODS measurements performed by four sonographers or using three differ-
ent ultrasound machines. Several different forms of ICCs are available,
Two common alternative methods to assess the agreement between and both authors and reviewers should be aware that the selection of
continuous measurements are available: the intra-class correlation ICC should be consistent with the stated aims of any research project.
coefficient (ICC)13 and the Bland–Altman plot with limits of agree- In addition, when reporting ICC values, including the 95% confidence
ment.14 The ICC provides a single measure of the agreement between interval is essential to measure the accuracy of the calculation. For
EDWARDS ET AL. 5

F I G U R E 4 Bland–Altman plots showing agreement between two sets of measurements. The solid line shows the mean difference between
the values while the dotted lines are the upper and the lower limits of agreement (95% observed differences)

more information on selecting and reporting ICCs, readers are 6 | CONC LU SION
encouraged to consult the excellent guide by Koo et al.22
The terms correlation and agreement are not interchangeable in
the scientific literature, and sonographers and reviewers must
5.2 | Bland–Altman plots understand their differences. In essence, correlation can be used to
infer an association between different independent variables. On
When the agreement between two measurements is required, a the other hand, agreement is the concordance between measure-
Bland–Altman Plot provides a convenient process of comparing two ments or techniques for the same variable. The Pearson's correla-
techniques or methods.14 A Bland–Altman plot can also be used in tion coefficient and its associated p-value are commonly cited for
combination with the ICC and is a scatter plot of the difference inferring an association, but two highly correlated variables may
between two measurements on the Y-axis (Method 1 minus Method 2) not agree. This article outlines the situations in which correlation
against the mean of the two measurements on the X-axis ({Method 1 may be used and offers better alternatives for studies that aim to
+ Method 2}/2). This type of basic plot reveals any systematic differ- assess agreement.
ences (bias) between measurements, which is the difference between
Method 1 and Method 2. On the same plot, the limits of agreement CONFLIC T OF INT ER E ST
(LOA) are plotted as separate lines, demonstrating the range within Christopher Edwards is an editorial board member for Sonography
which 95% of the differences between one method and the other are and a co-author on this article. CE was blinded and not involved in the
included.23 The LOA are expressed as the mean observed difference peer review process; management of the peer review process and
±1.96 Standard Deviations (SD) of observed differences.2 The interpre- decision-making for this article was undertaken by the Editor-in-Chief,
tation of what is acceptable to the limits of the agreement will depend Kerry Thoirs, acting as Handling Editor.
on the clinical context; however, broad LOA or values that consistently
fall outside these bounds likely indicates a lack of agreement between OR CID
the two methods. To provide further context on how a Bland–Altman Christopher Edwards https://orcid.org/0000-0001-7466-9530
plot is displayed and interpreted, plots of the same values used in Heather Allen https://orcid.org/0000-0002-4680-4149
Figures 1A,C and 2A,B are shown in Figure 4 below. Crispen Chamunyonga https://orcid.org/0000-0002-6714-4362
6 EDWARDS ET AL.

RE FE R ENC E S 15. Deming WE. Statistical adjustment of data. Oxford, England: Wiley;
1. Dettori JR, Norvell DC. The anatomy of data. Glob Spine J. 2018;8: 1943.
311–3. https://doi.org/10.1177/2192568217746998 16. Passing H, Bablok W. A new biometrical procedure for testing the
2. Ranganathan P, Pramesh CC, Aggarwal R. Common pitfalls in statisti- equality of measurements from two different analytical methods.
cal analysis: measures of agreement. Perspect Clin Res. 2017;8:187– Application of linear regression procedures for method comparison
91. https://doi.org/10.4103/picr.PICR_123_17 studies in clinical chemistry, part I. Clin Chem Lab Med. 1983;21:
3. Rosso A. Correlation does not mean agreement: why is it still used as 709–20. https://doi.org/10.1515/cclm.1983.21.11.709
a synonym of agreement? Radiology. 2015;276:617–9. https://doi. 17. Twomey PJ, Kroll MH. How to use linear regression and correlation
org/10.1148/radiol.2015150302 in quantitative method comparison studies. Int J Clin Pract. 2008;62:
4. Pearson K, Lee A. On the laws of inheritance in man: I. Inheritance of 529–38. https://doi.org/10.1111/j.1742-1241.2008.01709.x
physical characters. Biometrika. 1903;2:357. https://doi.org/10. 18. Popovic ZB, Thomas JD. Assessing observer variability: a user's guide.
2307/2331507 Cardiovasc. Diagn. Ther. 2017;7:317–24. https://doi.org/10.21037/
5. Dodge Y. Spearman rank correlation coefficient. The concise encyclo- cdt.2017.03.12
pedia of statistics. New York, NY: Springer; 2008. p. 502–5. 19. Kusunose K, Shibayama K, Iwano H, Izumo M, Kagiyama N,
6. George M. Kendall rank correlation coefficient. The concise encyclo- Kurosawa K, et al. Reduced variability of visual left ventricular ejection
pedia of statistics. Volume 1938. New York, NY: Springer; 2008. fraction assessment with reference images: the Japanese Association
p. 278–81. of Young Echocardiography Fellows multicenter study. J Cardiol. 2018;
7. Aggarwal R, Ranganathan P. Common pitfalls in statistical analysis: lin- 72:74–80. https://doi.org/10.1016/j.jjcc.2018.01.007
ear regression analysis. Perspect Clin Res. 2017;8:100–2. https://doi. 20. Weir JP. Quantifying test-retest reliability using the intraclass correla-
org/10.4103/2229-3485.203040 tion coefficient and the SEM. J Strength Cond Res. 2005;19:231–40.
8. Armstrong RA. Should Pearson's correlation coefficient be avoided? https://doi.org/10.1519/15184.1
Ophthalmic Physiol Opt. 2019;39:316–27. https://doi.org/10.1111/ 21. Portney LG, Watkins MP. Foundations of clinical research: applica-
opo.12636 tions to practice. Vol 892. Upper Saddle River, NJ: Pearson/Prentice
9. R Core Team R: a language and environment for statistical computing; Hall; 2009.
2013. 22. Koo TK, Li MY. A guideline of selecting and reporting Intraclass corre-
10. du Prel J-B, Hommel G, Röhrig B, Blettner M. Confidence interval or lation coefficients for reliability research. J Chiropr Med. 2016;15:
P-value? Part 4 of a series on evaluation of scientific publications. 155–63. https://doi.org/10.1016/j.jcm.2016.02.012
Dtsch Aerzteblatt. 2009;106:335–9. https://doi.org/10.3238/arztebl. 23. Van Stralen KJ, Jager KJ, Zoccali C, Dekker FW. Agreement between
2009.0335 methods. Kidney Int. 2008;74:1116–20. https://doi.org/10.1038/ki.
11. Riestra-Candelaria BL, Rodriguez-Mojica W, Jorge JC. Anatomical 2008.306
criteria to measure the adult right liver lobe by ultrasound. Sonogra-
phy. 2018;5:181–6. https://doi.org/10.1002/sono.12162
12. Lam P, Samson A, Magotti R, Benzie R. The effect of preliminary SUPPORTING INF ORMATION
training on quantitative evaluation of sonographer performance in Additional supporting information may be found online in the
the fetal morphology ultrasound examination. Australas J Ultrasound Supporting Information section at the end of this article.
Med. 2013;16:142–6. https://doi.org/10.1002/j.2205-0140.2013.
tb00102.x
13. Shrout PE, Fleiss JL. Intraclass correlations: uses in assessing rater
How to cite this article: Edwards C, Allen H, Chamunyonga C.
reliability. Psychol Bull. 1979;86:420–8. https://doi.org/10.1037/
0033-2909.86.2.420 Correlation does not imply agreement: A cautionary tale for
14. Martin Bland J, Altman D. Statistical methods for assessing agree- researchers and reviewers. Sonography. 2021;1–6. https://
ment between two methods of clinical measurements. Lancet. 1986; doi.org/10.1002/sono.12276
327:307–10. https://doi.org/10.1016/S0140-6736(86)90837-8

You might also like