You are on page 1of 22

Behavior Research Methods

https://doi.org/10.3758/s13428-022-01950-0

Better to be in agreement than in bad company

A critical analysis of many kappa-like tests

Paulo Sergio Panse Silveira1,2   · Jose Oliveira Siqueira2

Accepted: 1 August 2022


© The Author(s) 2022

Abstract
We assessed several agreement coefficients applied in 2x2 contingency tables, which are commonly applied in research due
to dichotomization. Here, we not only studied some specific estimators but also developed a general method for the study
of any estimator candidate to be an agreement measurement. This method was developed in open-source R codes and it is
available to the researchers. We tested this method by verifying the performance of several traditional estimators over all
possible configurations with sizes ranging from 1 to 68 (total of 1,028,789 tables). Cohen’s kappa showed handicapped
behavior similar to Pearson’s r, Yule’s Q, and Yule’s Y. Scott’s pi, and Shankar and Bangdiwala’s B seem to better assess
situations of disagreement than agreement between raters. Krippendorff’s alpha emulates, without any advantage, Scott’s pi
in cases with nominal variables and two raters. Dice’s F1 and McNemar’s chi-squared incompletely assess the information
of the contingency table, showing the poorest performance among all. We concluded that Cohen’s kappa is a measurement
of association and McNemar’s chi-squared assess neither association nor agreement; the only two authentic agreement esti-
mators are Holley and Guilford’s G and Gwet’s AC1. The latter two estimators also showed the best performance over the
range of table sizes and should be considered as the first choices for agreement measurement in contingency 2x2 tables. All
procedures and data were implemented in R and are available to download from Harvard Dataverse https://doi.org/10.7910/
DVN/HMYTCK.

Keywords  Agreement coefficient · Contingency table · Categorical data analysis · Inter-rater reliability

Mathematics Subject Classification (2020)  62H17 · 62H20 · 62P10 · 62P15 · 92-04 · 92B15

Introduction Cohen’s κ test (Banerjee, Capozzoli, McSweeney, & Sinha,


1999; Gwet, 2008; Gwet, 2010). In healthcare, clinical trials
Contingency tables are prevalent and, among them, 2x2 depend on the application of drugs, devices, or procedures
tables reign. Conclusions drawn from them depend on statis- for checking their association with potentially positive or
tical measures. In psychology, many observations are some- negative effects, which makes Fisher’s exact and chi-squared
what fuzzy and thus careful researchers must verify whether tests largely used (Ludbrook, 2011). In epidemiology,
different assessors agree, which is usually an application of researchers also need to assess agreement between observ-
ers, for which they apply McNemar’s χ2 test (Kirkwood &
Sterne, 2003). In other situations, concerned with quantifica-
* Jose Oliveira Siqueira
tion of effect intensities of potential harmful or protective
siqueira@usp.br
expositions to the increasing, decreasing, or mere existence
1
Department of Pathology (LIM01‑HCFMUSP), Medical of diseases at the populational level, odds ratio estimations
School, University of Sao Paulo, Av. Dr. Arnaldo 455, room are traditionally applied (King, Harper, & Young, 2012).
1103, Sao Paulo, SP 01246‑903, Brazil
In all these frequent situations, dichotomization is
2
Department of Legal Medicine, Bioethics, Occupational handy. Sometimes it is imposed by the study design when
Medicine and Physical Medicine and Rehabilitation, Medical
a researcher is compared to another one or when patients
School, University of Sao Paulo, Av. Dr. Arnaldo 455, room
1103, Sao Paulo 01246‑903, Brazil are classified into two levels (e.g., male or female). In other

13
Vol.:(0123456789)
Behavior Research Methods

cases, some longed-evaluated cutoff is determined and The two most used coefficients explored here are the wide-
applied to label two groups as healthy or diseased, exposed spread and famous Cohen’s κ, which was the departure point
or non-exposed, showing a given effect or not. It is not a of the present work, and McNemar’s χ2, praised in Epidemi-
defect of scientific research, but the recommended most par- ology textbooks (Kirkwood & Sterne, 2003)(p. 218). Here
simonious approach to start an investigation with the sim- we exhaustively analyzed all possible tables with total sizes
plest classification of events. Statistics has the role to refine from 1 to 68, totaling 1,028,789 different contingency tables
researcher impressions, providing evidence for or against to create a comprehensive map of several traditional asso-
initial hypotheses. Therefore, it must be useful if statistical ciation or agreement coefficients. By ‘exhaustive’ we mean
procedures are reliable in several situations. the computation of all possible configurations of 2x2 tables,
Before one alleges that this study is approaching extreme, given the number of observations. The range matters to cover
atypical, or unrealistic scenarios, let us consider that imbal- different qualitative scenarios (see “Methods, Table range”).
anced tables are the main goal for researchers. Psychologists This study includes Cohen’s κ, Holley and Guilford’s G,
desire the highest agreement between assessors while clini- Yule’s Q, Yules’s Y, Pearson’s r, McNemar’s χ2, Scott’s π,
cians expect the strongest association between intervention Krippendorff’s α, Dice’s F1, Shankar and Bangdiwala’s B,
and patient outcome, thus resulting in 2x2 tables with data and Gwet’s AC1. We found that some estimators may com-
concentration along the main or off-diagonal. Likewise, epi- pute improperly with imbalanced tables due to higher counts
demiologists pursue situations of high association between concentrated in one or more table cells, that the traditional
exposition and effect on populational diseases, usually low Cohen’s κ and McNemar’s χ2 are problematic, and that G and
in prevalence, resulting in 2x2 tables with relatively empty AC1 are the best agreement coefficients.
cells corresponding to affected people.
Conversely, balanced tables improve little scientific
knowledge for they would show the inability to obtain con- Methods
sistent measures from a given method, absence of treatment
effect, or lack of relationship between exposition and effect. Notation
No one starts a scientific study unless the hypothesis is solid
enough and the method is designed to control confusing var- To simplify mathematical notation along this text, a conven-
iables. When, after all the best researcher efforts, the 2x2 tion was adopted according to Table 1. A and B are repre-
table results are well balanced, statistical tests will show sentations of events (positive observer evaluation, existence
non-rejection of the null hypothesis and the study failed to of disease, exposition or effect, etc.), and Ā and B̄ are their
confirm one’s expectation, remaining dubious conclusions respective negations (negative observation, health individu-
either by lack of effect or sample insufficiency. als, absence of exposition or effect, etc.). In the main diago-
These few examples show that imbalanced tables are the rule, nal, a and d are counts or proportions of positive and nega-
not the exception. Estimators for agreement or disagreement tive agreements. In the off-diagonal, b and c are counting or
between observations or observers (generically named as ‘raters’ proportions of disagreements. The total sample size along
along this text) must be robust, therefore, to extreme tables in this text is n = a + b + c + d.
order to provide inferential statistics support to the researchers.
Agreement coefficients are fundamental statistics. Except Table range
for Francis Galton and Karl Pearson starting in the 1880s,
who created the correlation coefficient (without the intention All estimators selected below were exhaustively studied
to apply it to assess agreement, although the computation of with tables from 1 to 68 observations, totaling 1,028,789
this work shows its equivalence) and the pioneering work tables. Each of these tables is different, containing all pos-
of Yule, 1912 (who created Q and Y coefficients, explicitly sible configurations of a, b, c, d given the total number of
intending to measure the association between nominal varia- observations, n. For instance, with n = 1 there are four pos-
bles), there was great interest and creation of agreement coef- sible tables, with n = 2 there are ten possible tables, and with
ficients between the 1940s and 1970s (Dice, 1945; Cramér, 68 tables there are 57,155 possible tables.
1946; McNemar, 1947; Scott, 1955; Cohen, 1960; Holley &
Guilford, 1964; Matthews, 1975; Hubert, 1977), including
Table 1  Convention for 2x2 tables
most of the coefficients in use nowadays. Renewed interest
appears in this century, with the creation of new coefficients, B B̄
searching for improvement and avoidance of known flaws of
A a b a+b
the older propositions (Gwet, 2008, 2010; Shankar & Bang-
Ā c d c+d
diwala, 2014), or attempts to improve traditional estimators
a+c b+d a+b+c+d
(e.g., Lu, 2010, Lu, Wang, & Zhang, 2017).

13
Behavior Research Methods

This range is important because it covers small to large between the raters, an unfortunate initial example for who is
samples, comfortably surpassing some traditional cutoff presenting a new measure of agreement. It follows the intro-
sizes (e.g., n > 15 for exact Fisher test, n > 20 for Pearson’s duction of the κ calculation, mentioning other similar statistics
χ2, n > 20 for Spearman’s ρ, n > 35 for McNemar’s χ2), such as Scott’s π, which was published in 1955 (Scott, 1955).
beyond which asymptotic tests and the central limit theo- Then a second contingency 3x3 table is presented, bringing the
rem is valid (Siegel & Castellan, 1988). Simulations and same storyline of two psychologists classifying patients into
essential estimators presented here were implemented and three categories, for which the numbers were subtly amended,
provided to allow replication of our findings in R, a free increasing two out of three values in the main diagonal and
statistical language (R Core Team, 2022). decreasing five out of six values in the other table cells to com-
pute χ2 and κ in a scenario of rater’s agreement.
In addition, when the intensity of agreement is to be
Cohen’s kappa qualified, Cohen’s recommendation is to compute kappa
statistics maximum value, κM, permitted by the marginals to
This test was originally published by Cohen in (1960). The correct the agreement value of κ by κ/κM, which was largely
original publication proposed a statistical method to verify forgotten in the literature (Sim & Wright, 2005). The maxi-
agreement, typically applied to comparison of observers mum κ is given by:
or results of two measurement methods such as laboratory
results (raters). Kappa (κ) statistics is given by poM − pc
𝜅M =
1 − pc (4)
p − pc
𝜅= o (1)
1 − pc where poM is the sum of minimal marginal values taken in
pairs. For 2x2 tables it is
where po is the realized proportion of agreement and pc is
the expected proportions by chance (i.e., under assumption poM = min((a + c), (a + b)) + min((b + d), (c + d)) (5)
of null hypothesis) in which both raters agreed (i.e., the sum
Cohen did not propose correction by any lower bound
of proportions along the matrix main diagonal).
when κ < 0 (i.e., rater disagreement), stating that it is more
For 2x2 tables, in more concrete terms,
complicated and depends on the marginal values. For this
po = a+d lower limit of κ, we quote:
n
(a+b)(a+c)+(c+d)(b+d) (2)
pc = n2
“The lower limit of K is more complicated since it
depends on the marginal distributions. [...] Since κ is
Thus, Cohen’s κ can be computed, showing tension used as a measure of agreement, the complexities of its
between the main (ad) and off (bc) diagonals, by lower limit are of primarily academic interest. It is of
2(ad − bc) importance that its upper limit is 1.00. If it is less than
𝜅= (3) zero (i.e., the observed agreement is less than expected
(a + c)(c + d) + (b + d)(a + b)
by chance), it is likely to be of no further practical
Historically, the first contingency table presented in the sem- interest.” (Cohen, 1960)
inal paper of Cohen (1960), shows two clinical psychologists
From this consideration, we have decided not to attempt
classifying individuals into schizophrenic, neurotic, or brain-
any correction for negative values of κ. Along this text we
damaged categories. This table is an example of a 3x3 table
refer to Cohen’s κ normalized, when κ > 0, as “Cohen’s κ
with proportions of agreement or disagreement between raters,
corrected by κM.” After this correction, the intensity of κ can
while researchers more often pursue agreement in 2x2 tables. It
be defined by Table 2, but there is an additional complication
follows a lengthy discussion of traditional measures of agree-
for that criteria varies according to different authors (Wong-
ment such as Pearson’s χ2 and contingent coefficient. Curiously,
pakaran, Wongpakaran, Wedding, & Gwet, 2013).
for this first contingency table, Cohen only computed χ2 and
reported it as significant, concluding by the existence of asso-
ciation but arguing that χ2 is not a defensible measurement of Concurrent agreement measures
agreement. However, in 2x2 tables both association and agree-
ment are coincident and provide equal p values, thus Pearson’s Many other alternative statistics have been proposed to com-
χ2 and Cohen’s κ are equivalent (Feingold, 1992), which may pute agreement, although not always initially conceived for
suggest that, at least in 2x2 tables, Cohen’s κ is a mere test of this purpose. Besides Cohen’s κ, here we selected Holley and
association. The computation of κ was not presented in Cohen’s Guilford’s G, Yule’s Q, Yules’s Y, Pearson’s r, McNemar’s
paper for this first 3x3 table, thus we computed κ = − 0.092, χ2, Scott’s π, Dice’s F1, Shankar and Bangdiwala’s B, and
concluding that this is an example of slight disagreement Gwet’s AC1.

13
Behavior Research Methods

Table 2  Criteria for qualitative categorization of κ according different authors – (modified from Wongpakaran et al., 2013)
κ Landis and Koch κ Altman κ Fleiss

[-1.0,0.0) Poor
[0.0,0.2) Slight [-1.0,0.2) Poor
[0.2,0.4) Fair [0.2,0.4) Fair [-1.0,0.4) Poor
[0.4,0.6) Moderate [0.4,0.6) Moderate
[0.6,0.8) Substantial [0.6,0.8) Good [0.4,0.75) Intermediate to good
[0.8,1.0] Almost perfect [0.8,1.0] Very good [0.75,1.0] Excellent

Some other estimators are redundant and were not analyzed Since it is redundant to both J and Holley and Guilford
for varied reasons: G, the analysis of the Γ coefficient is also not required.
According to Zhao, Liu, & Deng (2013), G index was
• Janson and Vangelius’ J, Daniel–Kendall’s generalized independently rediscovered by other authors. In special, it
correlation coefficient, Vegelius’ E-correlation, and was preceded by Bennett, Alpert, & Goldstein (1954), S
Hubert’s Γ (see “Holley and Guilford’s G”); which is computed by
• Goodman–Kruskal’s γ, odds ratio, and risk ratio (see ( )
k 1
“Yule’s Q”); S= P− (9)
• Pearson’s χ2, Yule’s ϕ, Cramér’s V, Matthews’ correla-
k−1 k
tion coefficient, and Pearson’s contingent coefficient (see where k is the number of categories. Being k = 2 and
“Cramér’s V” and “Pearson’s r”); a+d
P = a+b+c+d in 2x2 tables, from Eq. 9 it is derived that:
• Fleiss’ κ (see section “Scott’s π”).
(a + d) − (b + c)
S= (10)
All these alternatives are effect-size measures, therefore a+b+c+d
independent of sample size, n. A brief description of each Therefore, S is equal to G. However, here we kept the
test in the present context follows. credit to Holley and Guilford’s G for historical reasons.
The literature attributes this measure to the last authors.
Holley and Guilford’s G While Bennett et al. (1954) were chiefly focused on the
question of consistency and presented it in the form of
One of the simplest approaches to a 2x2 table was proposed Eq.  9, Holley and Guilford extensively divulged and
by Holley & Guilford (1964), given by applied this index in terms of inter-rater reliability, being
(a + d) − (b + c) the first authors to formulate this coefficient for 2x2 tables
G= (6) in terms of abcd presented in Eq. 6.
a+b+c+d
In addition to that, Lienert (1972) proposed an inferen-
A generalized agreement index is J (Janson & Vegelius, tial asymptotic statistical test for Holley & Guilford (1964)
1982) that can be applied to larger tables. However, in 2x2 G, computing:
tables it reduces to J = G2, thus J performance was excluded n
from the current analysis. a+d− 2
u= √ (11)
Other proposed coefficient, Hubert’s Γ (1977) is a special n
case of Daniel–Kendall’s generalized correlation coefficient 4

and Vegelius’ E-correlation (Janson & Vegelius, 1982). For


For large samples (n > 30), the statistic (a + d) has
2x2 tables it is computed by
distribution approximately normal with mean n2 and vari-
(a + d)(b + c) ance n4 ; consequently, u has standard z distribution, from
𝛤 =1−4 (7) which we can compute the correspondent two-sided p val-
n2
ues to the statistical decision under the null hypothesis,
Although it looks like another coefficient, it is possible to H 0 : G = 0. This inferential decision became necessary
show its equivalency to: for the present study because this coefficient was elected
( )2 as the benchmark for reasons exposed under the Section
(a + d) − (b + c)
𝛤 = = G2 (8) “Results”.
a+b+c+d

13
Behavior Research Methods

Yule’s Q Yules’s Y

Goodman–Kruskal’s γ measures the association between The coefficient of colligation, Y, was also developed by Yule
ordinal variables. In the special case of 2x2 tables, Good- (1912). It is computed by
man–Kruskal’s γ corresponds to Yule's Q (1912), also known √ √
as Yule’s coefficient of association. It can be computed by ad − bc
Y=√ √ (18)
ad − bc ad + bc
Q= (12)
ad + bc which is a variant of Yule’s Q. Here, each term can be inter-
Yule’s Q is also related to odds ratio, which was not con- preted as a geometric mean.
ceived nor applied as an agreement measure (although it
could be). The relationship is
Cramér’s V
ad 1+Q
OR = = (13)
bc 1−Q The traditional Pearson’s chi-squared (χ2) test can be com-
puted by
It is recommended to express OR as a logarithm.
Therefore, (ad − bc)2 (a + b + c + d)
X2 = (19)
(a + b)(c + d)(a + c)(b + d)
log(OR) = log(ad) − log(bc) (14)
This formulation is interesting to reveal χ2 statistics con-
Again, it is possible to observe that OR and Q are statis-
taining tension between diagonals, ad − bc.
tics from the same tension-between-diagonals family.
For the special case of 2x2 tables, absolute value of κ and
Risk ratio (RR) also belongs to this family, being defined
χ2 are associated (Feingold, 1992). However, it is observed
(assuming exposition in rows and outcome in columns of
that χ2 statistics is not an effect size measurement because it
a 2x2 table) as the probability of outcome among exposed
depends on sample sizes thus, in its pure form, χ2 does not
individuals relative to the probability of outcome among
belong to the agreement-family of coefficients. In order to
non-exposed individuals. Since RR describes only probabil-
remove sample size dependence and turn χ2 statistics into an
ity ratio of occurrence of outcomes, we propose to define it
effect size measurement, it should be divided by n = a + b
as a positive risk ratio, computed by
+ c + d. The squared root of this transformation is Cramér’s
a
V (1946), computed by
a+b
RR+ = c (15) �
c+d X2 �ad − bc�
V= =√ (20)
n
To our knowledge, it is not usual in epidemiology the (a + b)(c + d)(a + c)(b + d)
definition of a negative risk ratio (the ratio between the prob-
Cramér’s V can also be regarded as the absolute value
abilities of absence of outcome among exposed individuals
of an implicit Pearson’s correlation between nominal vari-
and absence of outcome among non-exposed individuals),
ables or, in other words, an effect size measure ranging from
which should be conceived as
0 to 1. Cramer’s V, in other words, is the tension between
b diagonals, ad − bc (which is the 2x2 matrix determinant)
RR− =
a+b
(16) normalized by the product of marginals (a + b) (c + d) (a
d
c+d + c) (b + d).

Consequently:
RR+ ad Pearson’s r
OR = RR−
= bc (17)
There are many relations and mathematical identities among
Therefore, it is arguable that the traditional RR + is a
coefficients that converge to Pearson’s correlation coeffi-
somewhat incomplete measure of agreement, for it does not
cient, r.
explore all the information of a 2x2 table when compared
Matthews’ correlation coefficient is a measure of associa-
to OR.
tion between dichotomous variables (Matthews, 1975) also
Both RR and OR are transformations of Q, inheriting their
based on χ2 statistics. Since it is defined from the assessment
characteristics. For that reason, only Q is analyzed in this
of true and false positives and negatives, it is regarded as a
work.
measure of agreement between measurement methods. It

13
Behavior Research Methods

happens that Matthews’ correlation coefficient is identical odds ratio but it is incomplete for not including the main
to Pearson’s ϕ coefficient and Yule’s ϕ coefficient (Cohen, diagonal.
1968), computed by This traditional McNemar’s χ2 cannot be directly con-
fronted with other estimators because it is not restricted
ad − bc
𝜙= √ (21) to the interval [-1,1]. It can be normalized to show values
(a + b)(a + c)(b + d)(c + d) between 0 and 1 as an effect-size measurement if divided by
|b − c|, resulting in
Another famous estimator is the Pearson’s contingent
coefficient, usually defined from chi-squared statistics and (b−c)2
|b − c| (25)
also expressed as function of ϕ by MN =
b+c
=
√ |b − c| b+c

X2 𝜙2 It is noteworthy to say that both McNemar’s χ2 and its
PCC = = (22)
normalized correspondent MN are even more partial than
2
X +n 2
𝜙 +1
RR, for they use only the information of the off-diagonal.
It was not included in this analysis for two reasons: it is not Problems caused by such weakness are explored below.
taken as√an agreement coefficient and its value ranges from Two recent reviews aimed to include information from
zero to 12 ≈ 0.707 , which makes this coefficient not prom- the main diagonal to improve the traditional McNemar’s χ2
ising to the current context. (Lu, 2010; Lu et al., 2017). The first review of McNemar’s
Other two correlation coefficients, Spearman’s ρ and Ken- χ2 (Lu, 2010) computes:
dall’s τ, also provide the same values of Pearson’s r for 2x2
tables. In the notation adopted here: 2 (b − c)2
XMN(2010) = a+d (26)
1+ a+b+c+d
ad − bc
r=𝜌=𝜏= √ (23)
(a + b)(a + c)(b + d)(c + d) To show the obtained improvement, this author applied
Pearson’s χ2 statistics to compare five exemplary and exact
From Eq. 23 other coincidences are observed: binomial test by fixing the b and c and varying a + d.
The second review (Lu et  al., 2017) aimed a further
• Equation 21 shows that ϕ = r for 2x2 tables, thus Mat- improvement in order to differentiate situation in which a
thews’ correlation coefficient and Pearson’s contingent and d dominates the main diagonal, by proposing:
coefficient also share r properties.
• Equation 20 shows that Cramér’s V merely is the absolute (a + b + c + d)(b − c)2
2
XMN(2017) = (27)
value of Pearson’s ϕ coefficient (Matthews, 1975), which (2a + b + c)(2d + b + c)
is, in turn, equal to r.
Following the same strategy, these authors compared Pear-
Consequently, for the current work, only Pearson’s r is son’s χ2 statistics of several tables with fixed two values of
computed as representative of all these other estimators. b and c and varied values of a and d trying to show the
improvement of this new equation.
Since that, to our knowledge, there is no implementation
McNemar’s chi‑squared of these methods in R packages with inferential statistics,
we may apply bootstrapping (Efron, 2007) to reject the null
This test was created by McNemar (1947), and became hypothesis when X2 = 0 does not belong to the 95% predic-
known in the literature as McNemar’s χ2. It is applicable to tion HDI interval (see “Analysis methods” for details).
2x2 tables by

(b − c)2
2
XMN = (24) Scott’s π
b+c
to assess marginal homogeneity. A typical example is verifi- This statistics is similar to Cohen’s κ to measure inter-rater
cation of change before and after an intervention, such as the reliability for nominal variables (Scott, 1955). It applies the
disappearance of disease under treatment in a given number same equation of κ but it changes the estimation of pc using
of subjects. squared joint proportions, computed as
The inferential test is based on the probability ratio and po − pc
confidence interval given by bc ; the null hypothesis is rejected 𝜋=
1 − pc (28)
when the unitary value is not included in its confidence
interval 95%. In other words, McNemar’s χ2 resembles an where

13
Behavior Research Methods

(
a+c+a+b
)2 (
c+d+b+d
)2 In addition, F1 is difficult to compare a priori with other
pc = + (29) measurements because it ranges from F1 = 0 if a = 0 (disa-
2n 2n
greement) to F1 = 1 if b = 0 and c = 0 (agreement) in both
thus cases neglecting agreements in negative counts (d). Being the
range of F1 shorter than that of other measurements, neutral
ad − ( b+c )2
𝜋= 2
(30) situations (i.e., when there is no agreement nor disagreement)
(a + b+c
)(d + b+c
) should find F1 ≈ 0.5, while the concurrent measurements pre-
2 2
sented here should provide zero. In order to make its range
Besides the tension between the diagonals shown on the more comparable, we propose to rescale to the interval [− 1,1]
numerator of this expression (ad vs. b + c), Scott’s π also given by
coincides with Fleiss’ κ in the special case of 2x2 tables, thus
our analysis is restricted to Scott’s π. 2a − (b + c)
F1adj = 2F1 − 1 = (32)
There is also Krippendorff’s α (1970). It converges, at 2a + (b + c)
least asymptotically, to Scott’s π in 2x2 tables with nominal which adjusts F1 range to span from -1 to 1 without chang-
categories and two raters (Wikipedia, 2021). The structure ing its original pattern. Interestingly, this rescaling created
of several other coefficients may be unified in a single struc- a partial tension between the positive agreement and the
ture comparing the observed and expected proportions or off-diagonal that was not present in F1 original presentation.
probabilities (Feng, 2015). This generalized structure, from
which Krippendorff’s α is part of the family (Gwet, 2011; Shankar and Bangdiwala’s B
Krippendorff, 2011), may become equivalent to different
coefficients of the so-called S family (including Bennett’s S, This coefficient has proposed this statistics to access 2x2
Holley and Guilford’s G, Cohen’s κ, Scott’s π, and Gwet’s tables, reporting its good performance (Shankar & Bangdi-
AC1), depending on what assumptions are adopted for the wala, 2014). This statistics is computed by
chance agreement (Feng, 2015; Krippendorff, 2016; Feng
& Zhao, 2016). This coefficient deals with any number of a2 + d2
B= (33)
categories or raters (Krippendorff, 2011) and is provided (a + c)(a + b) + (b + d)(c + d)
by a combinatory computational procedure (Gwet, 2011;
Krippendorff, 2011). Although it was not possible, so far, to Similar to Dice’s F1, this estimator ranges from 0 to 1, being
deduce a feasible general abcd-formula, it was tested with 0 correspondent to disagreement, 0.5 to neutrality, and 1 to
the implementation by Hughes (2021) to verify if its pat- agreement. Therefore, we also propose to explore its adjust-
tern is close to Scott’s π, as predicted by theory, for 2x2 ment by scaling to the range [− 1,1] with:
tables with nominal categories (see Supplemental Material, a2 + d2 − (2bc + (a + d)(b + c))
“Implementation of Krippendorff’s α” for details). Badj = 2B − 1 = (34)
a2 + d2 + (2bc + (a + d)(b + c))

Dice’s F1
Gwet’s AC1
Also known as F-score or F-measure, it was first developed
by Dice (1945) as a measure of test accuracy, therefore it This first-order agreement coefficient (AC1) was developed
can be assumed as an agreement statistics in the same sense by Gwet (2008) as an attempt to correct Cohen’s κ distortions
as the Matthews’ correlation coefficient described above. when there is high or low agreement. AC1 seems to have bet-
It is computed by ter performance than Cohen’s κ assessing inter-rater reliability
analysis of personality disorders (Wongpakaran et al., 2013;
F1 =
a
=
2a Xie, Gadepalli, & Cheetham, 2017) and has been applied in
a+ b+c 2a + b + c (31) information retrieval (Manning, Raghavan, & Schutze, 2008).
2
It is computed by
F1 has been suggested to be a suitable agreement meas-
(b+c)2
ure to replace Cohen’s κ in medical situations (Hripcsak & a2 + d 2 − 2
Rothschild, 2005), thus its analysis was included here. AC1 =
(b+c)2
(35)
a2 + d 2 + + (a + d)(b + c)
There are fundamental differences between F1 and all 2

other agreement statistics. It does not belong to the tension- AC1 somewhat reflects the tension between diagonals,
between-diagonals family, for it computes only the propor- since there is added values for a and d and subtracted values
tion between positive agreement (a) and the upper-left tri- of b and c in the numerator.
angle of a 2x2 table.

13
Behavior Research Methods

Association and agreement marginals, κ and ϕ underestimate agreement, while G is a


stable estimator (Shreiner, 1980).
Part of the problem is to define what is really being meas-
ured by any of these coefficients. Lienert (1972) argued that Analysis methods
association tests are not of the same nature as tests for agree-
ment. Association tests the null hypothesis In order to assess the performance of different agreement
coefficients, first we developed the comparative analysis of
H0, ind ∶ (𝛼 ⋅ 𝛿) − (𝛽 ⋅ 𝛾) = 0 (36) all these statistics by challenging them with fabricated 2x2
while agreement assesses tables, considering scenarios with more balanced tables, and
then with more extreme tables containing 0 or 1 in some
H0, agr ∶ (𝛼 + 𝛿) − (𝛽 + 𝛾) = 0 (37) cells. These analyses confront intuition of agreement and
the measures among all estimates (see “Results, Challenge
where α, β, γ, and δ are the populational proportions respec- tables”).
tively estimated by a, b, c, and d (see notation on Table 1). A second step is the analysis of all possible tables with n
Association tests (from which Cohen’s κ, Pearson’s ϕ and = 64 (see section “Inferential statistics: tables with n = 64” for
other Pearson’s χ2-based statistics are representatives), and details), verifying the coincidence and stability of statistical
agreement tests (from which G is a representative) are, there- decisions taken from the estimators. For that, we developed
fore, sensitive to different types of association (Shreiner, a R function, agr2x2_gentablen(n), which can create
1980). all combinations of 2x2 tables with n observations. This func-
In the same line of reasoning, Pearson’s χ2 and contingent tion and examples of its use are available in the Supplemental
coefficient, as well McNemar test (McNemar, 1947), were Material (see “Creating all 2x2 tables with size n”). To com-
previously criticized by Cohen (1968) who stated that asso- pare the performance of estimators, not only the estimate of
ciation does not imply necessarily in agreement for any table the agreement coefficients but also the decision by inferential
size. This aspect will be further discussed below. The ques- statistics is required. Thus, inferential statistics for all pro-
tion is to know how these coefficients are related, when they posed estimators applying R functions from selected packages
measure association or agreement, and when their measures were computed when available. When there was no function
are coincident or discrepant. available, the confidence interval was computed by bootstrap-
Holley and Guilford (1964) showed that G is equal to ping (this procedure is described for the simple agreement
Pearson’s ϕ coefficient only when the marginal values, i.e., coefficient, SAC, in the Supplemental Material, see “Imple-
a+b
n
= a+c
n
= b+d
n
= c+d
n
= 0.5 where n = a + b + c + d, a mentation of Holley and Guilford’s G” using the results of
condition in which G = ϕ = 0 but also κ = 0. It is to say that Bell and Kato-Katz examination, Hoff et al. (1982)). Exhaus-
κ and ϕ are, otherwise, different entities of G, with potential tive testing showed that Holley and Guilford’s G, among all
different performances to detect association or agreement the studied estimators, minimized the discrepancy of inferen-
in 2x2 contingency tables: ϕ and G are related (Holley & tial decisions from the others, and was selected as benchmark
Guilford, 1964) by (see “Figures and performance checking” for details).
G
(
a+b 1
)(
a+c 1
) Finally, in “Comprehensive maps: all tables with 1 ≤ n
4
− n
− 2 n
− 2 ≤ 68” we exhaustively mapped all estimates from all pos-
𝜙= √ (38) sible tables created in this range of sizes to show the entire
a+b a+c b+d c+d
n n n n region that each concurrent estimator covers. Tables with
size ranging from 1 to 68 were generated (see Supplemen-
and κ is related to G (Green, 1981) by tal Material “Creating all 2x2 tables with size n”), which
G+1
− pc resulted in a little more than 1 million (exactly 1,028,789)
𝜅= 2
(39) different 2x2 tables covering all possible configurations of
1 − pc ‘abcd’. A global measurement of estimator qualities was
computed by the Pearson’s and Spearman’s correlations
The parcel pc, included in the computation of κ, is known
between Holley and Guilford’s G and other estimators
in the literature as the ‘chance correction factor’. Since κ
across all 1,028,789 tables. Pearson’s correlation (not to
includes more parcels, one would expect that the perfor-
be confounded with Pearson’s r application to the agree-
mance of κ should exceed that of G. However, Green (1981)
ment coefficient under investigation here) assesses linear
stated that “Because the standard equation for kappa clearly
trend, while Spearman’s assesses the monotonic trend of
includes a chance correction factor, many authors [...] have
each pair of estimators. These correlations were separately
suggested its usage. Unfortunately, chance has not been
estimated for each n (from 1 to 68) and, then, lower (HDI
explicitly defined.” It was also shown that, under skewed
LB) and upper (HDI UB) bounds of the 95% highest density

13
Behavior Research Methods

interval (HDI95%) were obtained (see Supplemental Mate- and normalized McNemar’s χ2 (estimated as zero when
rial, “Computation of Table 6”).HDI 95% is the smaller pre- b = c or underestimated when b and c are too close).
diction interval that contains 95% of the probability density • The low agreement scenario shows what normalized
function (Hyndman, 1996). McNemar’s χ2 underestimated while corrected Cohen’s
All R source codes and data presented in this work, and κ and Yule’s Q overestimated values in comparison with
some others not shown in the main text are hosted by Har- other tests. Adjusted Shankar and Bangdiwala’s B mis-
vard Dataverse https://​doi.​org/​10.​7910/​DVN/​HMYTCK takenly deemed this table as disagreement.
• The next three scenarios provide disagreement scenarios
that are the reverse of the previous three. Yule’s Q reveals
the same exaggeration for high disagreement (Cohen’s
Results κ has no proposed correction for negative values). Nor-
malized McNemar’s χ2 and Shankar and Bangdiwala’s B
Challenge tables provide only positive values. If normalized McNemar’s
χ2 is taken by its absolute number, disagreement was
A set of 17 tables was chosen to represent several scenarios: underestimated. Rescaled Shankar and Bangdiwala’s B
eight of agreement, two of neutrality and seven of disagree- shows that disagreement was overestimated.
ment in several 2x2 configurations (Tables 3 and 4). • The first neutral scenario has all values equal. Raw Dice’s
In Table 3, regular scenarios were tested with three lev- F1 is equal to 0.5 (as expected), providing zero when
els of agreement, three levels of disagreement, two neutral rescaled. Shankar and Bangdiwala’s B, however, could
scenarios, and a parallel scenario were b = 2a and d = 2c. not deal with this completely neutral scenario, showing
The intention here is to confront one’s intuition with values disagreement when rescaled.
provided by all estimators assessed in the current work. • The next neutral scenario caused major problems for
It may be interesting to observe Table 3 by columns: normalized McNemar’s χ2, Shankar and Bangdiwala’s B
and Dice’s F1, while Scott’s π and Gwet’s AC1 slightly
• This table indicates test names and counts of the number deviated from zero.
of problems to perform with the selected nine scenarios. • The last scenario is a special situation of low disagree-
• Many tests provide a reasonable value for scenarios ment that caused problems to many estimators. The
of high agreement, except Cohen’s κ corrected by κM matrix determinant of a parallel 2x2 table is null and
(the maximum kappa, as recommended by the original all estimators belonging to the tension-between-diagonal
author) and Yule’s Q (exaggerated values), Shankar and family become, therefore, null. The slight disagreement
Bangdiwala’s B (underestimated when adjust is applied), was equally captured by G, Gwet’s AC1, Scott’s π and

Table 3  Test performance of estimators with varied scenarios offering 2x2 regular tables; texts in bold indicate discrepant estimates,  f / t is
counting of discrepancies (failures) over the scenarios (trials)
agr.high agr.high agr.low dis.high dis.high dis.low neutral neutral dis. a/c=b/d
[ ] [ ] [ ] [ ] [ ] [ ] [ ] [ ] [ ]
f/t 90 10 90 11 60 41 10 90 10 91 41 60 50 50 75 25 44 88
10 90 9 90 39 60 90 10 89 10 60 39 50 50 75 25 22 44
Holley and Guilford’s G 0/8 0.80000 0.80000 0.20000 –0.80000 –0.80000 –0.20000 0.00000 0.00000 –0.11111
Gwet’s AC1 1/9 0.80000 0.80000 0.20000 –0.80000 –0.80000 –0.19988 0.00000 0.05882 –0.11111
Scott’s π 1/9 0.80000 0.80000 0.20000 –0.80000 –0.80000 –0.20012 0.00000 –0.06667 –0.11111
Krippendorff’s α 1/9 0.80050 0.80050 0.20200 –0.79550 –0.79550 –0.19712 0.00250 –0.06400 –0.10831
Cohen’s κ 1/9 0.80000 0.80002 0.20008 –0.80000 –0.79982 –0.20012 0.00000 0.00000 0.00000
Cohen’s κ corrected by κM 4/9 1.00000 0.98000 0.98000 –0.80000 –0.79982 –0.20012 0.00000 0.00000 0.00000
Pearson’s r 1/9 0.80000 0.80018 0.20012 –0.80000 –0.79998 –0.20012 0.00000 0.00000 0.00000
Yule’s Q 7/9 0.97561 0.97585 0.38488 –0.97561 –0.97561 –0.38488 0.00000 0.00000 0.00000
Yule’s Y 1/9 0.80000 0.80090 0.20015 –0.80000 –0.79999 –0.20015 0.00000 0.00000 0.00000
Shankar and Bangdiwala’s B, 0.81000 0.81008 0.36004 0.01000 0.01000 0.16008 0.25000 0.31250 0.22222
verified by rescaled B 9/9 0.62000 0.62016 –0.27993 –0.98000 –0.98000 –0.67983 –0.50000 –0.37500 –0.55556
Dice’s F1, 0.90000 0.90000 0.60000 0.10000 0.10000 0.40594 0.50000 0.60000 0.44444
verified by rescaled F1 1/9 0.80000 0.80000 0.20000 –0.80000 –0.80000 –0.18812 0.00000 0.20000 –0.11111
Normalized McNemar’s χ2 8/9 0.00000 0.10000 0.02500 0.00000 0.01111 0.00000 0.00000 0.50000 0.60000

13
Behavior Research Methods

Table 4  Test performance of estimators with 2x2 challenge with extreme tables; texts in bold indicate discrepant estimates, f / t is counting of
discrepancies (failures) over the scenarios (trials)
agr. c = 1 dis. d = 1 agr. b = 1, c agr. b = 0, c agr. d = 0 dis. d = 0 agr. c = 0, d dis. c = 0, d = 0
=1 =1 =0
[ ] [ ] [ ] [ ] [ ] [ ] [ ] [ ]
f/t 94 11 11 94 99 1 100 0 180 10 10 180 190 10 10 190
1 94 94 1 1 99 1 99 10 0 10 0 0 0 0 0
Holley and 0/9 0.88000 –0.88000 0.98000 0.99000 0.80000 –0.90000 0.90000 –0.90000
Guilford’s G
Gwet’s AC1 0/8 0.88000 –0.87531 0.98000 0.99000 0.88950 –0.89526 0.94744 –0.89526
Scott’s π 2/8 0.88000 –0.88471 0.98000 0.99000 –0.05263 –0.90476 –0.02564 –0.90476
Krippendorff’s 2/8 0.88030 –0.88000 0.98005 0.99002 –0.05000 –0.90000 –0.02308 –0.90000
α
Cohen’s κ 4/8 0.88030 –0.88471 0.98000 0.99000 –0.05263 –0.10465 0.00000 0.00000
Cohen’s κ cor- 4/8 0.90025 –0.88471 1.00000 0.99000 –0.05263 –0.10465 0.00000 0.00000
rected by κM
Pearson’s r 4/8 0.88471 –0.88471 0.98000 0.99005 –0.05263 –0.68825 div/0 div/0
Yule’s Q 6/8 0.99751 –0.99751 0.99980 1.00000 –1.00000 –1.00000 div/0 div/0
Yule’s Y 6/8 0.93184 –0.93184 0.98000 1.00000 –1.00000 –1.00000 div/0 div/0
Shankar and 0.88581 0.00608 0.98010 0.99005 0.89503 0.01786 0.95000 0.05000
Bangdiwala’s
B,
verified by 3/8 0.77163 –0.98783 0.96020 0.98010 0.79006 –0.96429 0.90000 –0.90000
rescaled B
Dice’s F1, 0.94000 0.10476 0.99000 0.99502 0.94737 0.09524 0.97436 0.09524
verified by 0/8 0.88000 –0.79048 0.98000 0.99005 0.89474 –0.80952 0.94872 –0.80952
rescaled F1
Normalized 5/8 0.83333 0.00000 0.00000 1.00000 0.00000 0.89474 1.00000 1.00000
McNemar’s χ2

adjusted F1. Adjusted Shankar and Bangdiwala’s B and • When 0 appears in both diagonals (last two columns),
normalized McNemar’s χ2 overestimated the disagree- many estimators are not computable while others pro-
ment. duce underestimated values. Scott’s π underestimates
agreement. Normalized McNemar’s χ2 overestimated
Table 4 shows a second set of unbalanced tables, with agreement and disagreement. Holley and Guilford’s G,
the presence of values 0 or 1 in some table cells to provide Gwet’s AC1, Dice’s F1, and Shankar and Bangdiwala’s B
more extreme conditions, including scenarios of agreement were able to generate adequate values in both scenarios.
or disagreement: • When there are no zeros (first three scenarios), still
Yules’Q and Y may overestimate agreement of disagree-
• Many estimators are problematic. Cohen’s κ, κ adjusted ment. Adjusted Shankar and Bangdiwala’s B underes-
by maximum κ, Pearson’s r, Yule’s Q and Y present prob- timated agreement and overestimated agreement in the
lems when there are zeros in some cells, providing null first two contingency tables. Normalized McNemar’s χ2
or non-computable estimates. Q and Y easily approached could not detect disagreement and agreement in second
1 or -1 even when the agreement or disagreement is not and third scenarios.
perfect. Normalized McNemar’s χ2 uses information only
from the off-diagonal and it is disturbed when informa-
tion is concentrated in the main diagonal. Despite its Based on this preliminary analysis, the famous Cohen’s κ
adjustment, Shankar and Bangdiwala’s B failed in some failed in most extreme situations. Other indices that showed
scenarios of disagreement. over and underestimation were unable to cope with disa-
• The second major cause of problems is due to 0 in the greement or failed to generate a coherent value. The best
main diagonal (last four columns), leading to underes- estimators seem to be Holley and Guilford’s G and Gwet’s
timation of agreement by Pearson’s r, Cohen’s κ and AC1. Scott’s π and Dice’s F1 are also competitive (since the
Scott’s π, and underestimation of disagreement by Pear- rescaling of F1 makes possible the comparison with other
son’s r and Cohen’s κ. coefficients).

13
Behavior Research Methods

Inferential statistics: tables with n = 64 • Pearson’s r (Fig. 1F), Cohen’s κ (Fig. 1G), and Yule’s
Q (Fig. 1I) are similar; the first two presented similar
Figure 1 shows the computation of all possible 2x2 tables percentages of mistakes in all three areas, while the latter
with 64 observations in function of Holley and Guilford’s had fewer mistakes when rejecting the null hypothesis.
G, which was selected as the benchmark (see “Analysis • Yule’s Y (Fig. 1H) was slightly better than Yule’s Q in
methods” in the Methods section and “Computation of the total number of discrepancies, but has more mistaken
Fig.  1 and Table  5” in the  Supplemental Material for decisions around the null hypothesis.
details). Both, the point-estimate and inferential decision • Dice’s F1 (Fig. 1J) presented a total number of discrep-
of agreement or disagreement are compared, thus creating ancies greater than Yule’s Q and asymmetry for the cor-
two probability density functions multiplied by the number responding G distribution (regarding the inferential deci-
of tables, m (i.e., the area under each curve is proportional sion, Dice’s F1 and Dice’s F1 rescaled to the interval
to the number of involved tables). In comparison with [-1,1] are identical, respectively checking if 0.5 and 0.0
other estimators, the solid lines show the distributions of are inside the estimated HDI95%).
G point-estimates of G computed from tables from which • Similarly, original or rescaled Shankar and Bangdiwala’s
there is discrepancy in the rejection of the null hypothesis B (Fig. 1K) are qualitatively similar to Scott’s π with
of absence of agreement; dashed lines are the same, in quantitative worsened performance.
tables with concordance in this decision. Therefore, the • Normalized McNemar’s χ2 (Fig. 1L) produced a flawed
greater the area under the solid line, the more the number approach to the inferential statistics.
of discrepancies between G and the agreement coefficient • Traditional McNemar’s χ2 (Fig. 1M) and the two recently
under assessment, which is counted as the percentage of revised (Lu, 2010; Lu et al., 2017) versions (Fig. 1N and
total discrepancies and in three situations (Table 5): disa- O) do not generate values in the interval [-1,1] but,
greement (H1 −), neutrality (H0), and agreement (H1 +). regarding inferential decisions are still comparable with
Figure 1A differs from the other subfigures to show G, showing increasing number of total discrepancies.
that Holley and Guilford’s G is perfectly correlated
with the proportion a+d  , thus representing the bisector Table 5 shows counts of involved tables, errors and dis-
n
of reference, which is another evidence that G can be a crepancies from Fig. 1 generated with all 47,905 possible
good choice for the benchmark. The interval proportion configurations of 2x2 tables with n = 64. Depending on table
0.391 ≤ a+d ≤ 0.609 corresponds to the non-rejection of configuration, sometimes the estimator cannot be computed
n
the null hypothesis, H0 : G = 0, interpreted here as popu- (failed estimator, e.g., by division by zero) or it is computed
lational neutrality (neither disagreement nor agreement, but the corresponding p value is not returned by R function
randomness). The other two regions denoted as H1- and (no p value). Otherwise, there is an inferential decision but
H1+, correspond to the rejection of the null hypothesis, discrepancies with the benchmark (Holley and Guilford’s G)
respectively meaning disagreement or agreement between may occur on the null hypothesis (when G does not reject
raters. This corresponding interval of G appears in the and the other estimator rejects the null hypothesis) or on
subsequent panels (Fig. 1B to L). The total distribution the rejection of the null hypothesis (H1 −, when G assumes
of G has an ogival, symmetrical shape represented by a disagreement but the other estimator does not reject the null
fine dotted line, which corresponds to the total of 47,905 hypothesis; H1 +, when G assumes agreement but the other
possible tables with n = 64. estimator does not reject the null hypothesis).
The small discrepancy on inferential statistics decision
between SAC and G (Fig.  1B) is caused by differences Comprehensive maps: all tables with 1 ≤ n ≤ 68
between the bootstrapping and asymptotic statistical test
(see Supplemental Material, “Implementation of Holley Tables with size ranging from 1 to 68 observations were
and Guilford’s G”); for this reason, a small number of dis- generated (see Supplemental Material “Creating all 2x2
crepancies are located in the transition from H0 to H1 areas. tables with size n”), which resulted in 1,028,789 different
Besides the number of total discrepancies (solid lines in 2x2 tables covering all possible arrangements of ‘abcd’. In
Fig. 1), its location is also important: addition to the results presented here, the stability of agree-
ment coefficients were verified (not shown, see Supplemen-
• Gwet’s AC1 has no mistakes in H1+; mistakes in H1- are tal Material “Computation of Figs. 2, 3, 4, and 5”).
close to the transition to H0 (Fig. 1C). Correlation between estimates of Holley and Guilford’s G
• Krippendorff’s α and Scott’s π show similar patterns. (assumed as the benchmark) and all other studied estimators
• Krippendorff’s α has a small number of discrepancies are ordered in Table 6.
and Scott’s π has no discrepancies in H1- (Fig. 1D and As defined in the previous section, Holley and Guilford’s
E) - see also Table 5. G was adopted as the benchmark. It is possible to observe

13
Behavior Research Methods

13
Behavior Research Methods

◂Fig. 1  (A) Relative performance of proposed estimators concerning For more extreme 2x2 tables, Figs. 3, 4, and 5 show the
inferential statistics of Holley and Guilford’s coefficient (G). G is a pattern of Gwet’s AC1 (the estimator that better captured the
linear transformation of (a + d)/n. The gray area corresponds to p
≥ 0.05. (B to O) probability distribution function of G multiplied by
estimates by G), Cohen’s κ (the most popular coefficient of
m tables: lines are the distribution of G from tables in which there is agreement), and normalized McNemar’s χ2 (also popular but,
discrepancy (solid lines) or concordance (dashed lines) between the again, the less reliable agreement estimator according to our
inferential decision of G (benchmark) and each other estimators. Mis- analysis). In this analysis, Holley and Guilford’s G is again
takes are discrepancy of inferential decision and fails are impossibil-
ity of estimator computation (e.g., division by zero) provided as the
assumed as benchmark (x axis), while exemplary cases are
proportion in relation to all 47,905 possible tables with n = 64 detailed. The scales of the y axes are distorted, but dashed
lines are represented to show the location of the bisector; in
the same way as Fig. 2, the closer the darker hexbins are to
that Gwet’s AC1 has the best correlation with Holley and the bisector, the better the estimator. Subfigures A to D show
Guilford’s G, but many others also show acceptable cor- the pattern of the estimator when one of the contingency
relations. However, the correlation was lower for Cohen’s table cells (a, b, c, or d) concentrates 90% or more of all
κ, Yule’s Q, Dice’s F1, and Yule’s Y, and much lower, close observations (therefore, A and B are cases of high agree-
to or absent, for normalized and traditional McNemar’s χ2. ment and C and D are cases of high disagreement). Subfig-
A more detailed view of assessing the quality of each ures E to H are the opposite scenarios: almost empty cells
estimator is in Fig. 2. This is a representation in hexbin plots, a to d, which provides a mixture of disagreement, random-
an alternative to scatterplots when there is a large number ness, and agreement cases—therefore, G may vary from -1
of data points—in this case, a little more than one million to 1 but it is expected that the other estimator should do the
points in each subfigure—in such a way that close-located same along the bisector. Subfigures I and J are, respectively,
points are overlapped into a single hexagonal bin (hence concentration of observations in the main (agreement) and
the name hexbin) and its color depends on the counts of off (disagreement) diagonals. Finally, subfigures K to N are,
collapsed points (the greater is the number of points, the respectively, the concentration of observations in the first
darker is the hexbin). Each subfigure from A to L represents row, second row, first column, and second column, which are
the point estimates obtained from Holley and Guilford’s G situations that are mostly cases of neutrality.
against all other estimators under assessment. The better the Figure 3 reveals the good performance of Gwet’s ACI (atten-
concordance between the estimators, the closer the darker tion to the values of y axes), in concordance with Holley and
hexbins can be to the bisector. According to this second Guilford’s G in position and always presenting darker hexbins
criteria, again the best estimator is Gwet’s AC1 (Fig. 2A), close to the bisector and some overestimation with values up
and the worst is normalized McNemar’s χ2 (Fig. 2L). The to 0.3 in subfigures K to N, as expected from Fig. 2. Figure 4
traditional McNemar’s χ2 is not comparable to the other esti- analyzes Cohen’s κ, an intermediate estimator, showing discrep-
mators because it does not provide values in the interval ancies in subfigures A to D with hexbins away from the bisector
[-1,1] (although not shown here, its mapping was tested with and estimative around zero when should be computed agreement
procedures available in the supplemental material). Cohen’s or disagreement. In the mixed situations illustrated in subfigures
κ (Fig. 2C), Pearson’s r (Fig. 2E), Yule’s Y (Fig. 2F) and Q E to H, Cohen’s κ leaks around the bisector. In the other situ-
(Fig. 2G), and Dice’s F1 (Fig. 2J) are mediocre estimators ations, its pattern is fairly adequate to the benchmark. Finally,
of agreement. Rescaled F1 aligned its darker hexbins with Fig. 5 shows McNemar’s χ2 pattern providing estimates from 0
the bisector, but could not fix the number of tables with to 1 in subfigures A, B, and I (it uses only information from the
mistaken estimates below the bisector (Fig. 2K). off-diagonal), the reason for its better estimative in Figures C
Shankar and Bangdiwala’s B (in its original form, and D (although away the bisector because it provides only posi-
Fig. 2H) is defective, with darker hexbins close to the bisec- tive values), as well as in subfigure J, in which it computes val-
tor line only when G approaches 1. When scaled to the inter- ues below 0.1 (here interpreted as consistent with disagreement
val [-1,1] (Fig. 2I) the alignment to the bisector is improved between raters). In subfigures E to H, McNemar’s χ2 is erratic. In
but it leaves darker hexbins away and lighter hexbins close subfigures K to N, the pattern of this estimator provided values
to the bisector line, meaning that it tends to underestimate in between 0.6 and 1.0, an overestimation of neutrality.
comparison to Holley and Guilford’s G. From Table 6, one
should expect a better performance of B. In this figure, it is
possible to observe that its excellent correlation depended Discussion
on fairly aligned pairs of G and B observations but, in terms
of linear regression, the great number of tables that are not It was long understood by theoreticians that G is a supe-
close to the bisector leads to the not-so-good performance rior estimator to Cohen’s κ, but practitioners of applied
observed in Fig. 1. It is to say that such a simple rescaling statistics, for some reason, adhere to the latter. Theoreti-
of B cannot fix this estimator and it is structurally defective. cal reasons and our findings led to the use of Holley and

13
Behavior Research Methods

Table 5  Number of valid and failed computations, number of non-computable p values, and percentage of discrepant inferential decisions with
Holley and Guilford’s G (n = 64; 47,905 tables)
Estimator Valid Failed no p value H1 − H0 H1 + Total

SAC 47905 0 0 0.30 0.00 0.35 0.66


Gwet’s AC1 47905 0 2 5.49 3.98 0.00 9.46
Krippendorff’s α 47903 2 0 0.35 3.92 11.64 15.91
Scott’s π 47903 2 0 0.00 4.15 12.08 16.22
Pearson’s r 47649 256 0 7.23 6.60 7.23 21.07
Cohen’s κ 47903 2 254 7.21 6.70 7.21 21.12
Yule’s Y 47649 256 0 5.41 12.47 3.84 21.72
Yule’s Q 47649 256 0 8.38 5.30 8.38 22.06
Dice’s F1 47904 1 0 7.00 14.90 8.62 30.52
Shankar and Bangdiwala’s B 47903 2 0 0.00 18.25 16.45 34.70
Normalized McNemar’s χ2 47840 65 0 8.05 23.80 13.28 45.12
Traditional McNemar’s χ2 47840 65 0 9.78 21.15 18.72 49.65
Revised McNemar’s χ2 (2010) 47840 65 0 13.75 17.13 20.10 50.98
Revised McNemar’s χ2 (2017) 47903 2 0 13.82 16.69 23.70 54.21

Table 6  Pearson’s and Spearman’s correlations coefficients between Holley and Guilford’s G and other estimators. Table rows ordered by the
median of Spearman’s correlations
Pearson Spearman
Estimator Median HDI LB HDI UB Median HDI LB HDI UB

Gwet’s AC1 0.9931 0.9923 0.9934 0.9933 0.9899 0.9943


Shankar and Bangdiwala’s B 0.9698 0.9677 0.9713 0.9772 0.6699 0.9890
B rescaled to [-1,1] 0.9698 0.9677 0.9713 0.9772 0.6699 0.9890
Scott’s π 0.9555 0.9315 0.9643 0.9578 0.9386 0.9657
Krippendorff’s α 0.9555 0.9315 0.9643 0.9578 0.9392 0.9658
Pearson’s r 0.9131 0.9089 0.9474 0.8661 0.3033 0.9583
Cohen’s κ 0.8713 0.7973 0.8928 0.8659 0.7925 0.8897
Cohen’s κ corrected by κM 0.8351 0.7770 0.8596 0.8371 0.7775 0.8604
Dice’s F1 0.7665 0.7349 0.7792 0.7611 0.7378 0.7751
F1 rescaled to [-1,1] 0.7665 0.7349 0.7792 0.7611 0.7378 0.7751
Yule’s Q 0.7841 0.7147 0.8326 0.7182 0.2305 0.8818
Yule’s Y 0.7384 0.6704 0.8000 0.7182 0.2305 0.8818
Normalized McNemar’s χ2 0.0968 0.0084 0.3324 0.1089 − 0.0316 0.6615
McNemar’s χ2 (rev. Lu, 2010) − 0.3812 − 0.4037 − 0.2978 − 0.2991 − 0.3866 0.2988
Traditional McNemar’s χ2 − 0.3978 − 0.4202 − 0.3126 − 0.3066 − 0.3950 0.2880
McNemar’s χ2 (rev. Lu et al., 2017) − 0.5299 − 0.5434 − 0.5289 − 0.5081 − 0.5301 − 0.2498

Guilford’s G index as a reference to the performance of which is arguable the most important and useful situations
concurrent estimators. More recently, Zhao et al. (2013) for researchers.
presented an extensive discussion on several agreement In fact, many authors compare several of the estimators
estimators, including situations with more than two but, to our knowledge, no one performed an exhaustive anal-
observers or categories, and provided an important dis- ysis with hundreds of thousands of tables as presented here
cussion on assumptions, frequently forgotten by applied to obtain a comprehensive map of estimator patterns. Many
researchers. It is beyond the scope of the present work, comparisons stick to some particular cases, using challeng-
which is an application of several estimators to assess their ing tables similar to that in Tables 3 and 4, sometimes to
performance, but it is interesting to emphasize that Gwet’s show the weakness or strength of particular coefficients in
AC1 seems to provide fair estimates in uneven situations, particular situations. Even so, many of their conclusions

13
Behavior Research Methods

A Failed: 0 (0%) B Failed: 136 (0.013%) C Failed: 136 (0.013%) D Failed: 136 (0.013%)
1.0

Krippendorff’s alpha

Cohen’s kappa
0.5
Gwet’s AC1

Scott’s pi
12500
10000 10000 10000 6000
0.0 7500 4000
5000 5000 5000
2500 2000

−0.5

−1.0

E Failed: 136 (0.013%) F Failed: 9384 (0.912%) G Failed: 9384 (0.912%) H Failed: 9384 (0.912%)
1.0
kappa corrected by kM

0.5
Pearson’s r

6000 6000 10000

Yule’s Q
5000

Yule’s Y
5000 7500 4000
4000 4000
0.0 3000 5000 3000
2000 2000 2000
1000 2500 1000
−0.5

−1.0

I Failed: 136 (0.013%) J Failed: 136 (0.013%) K Failed: 68 (0.007%) L Failed: 68 (0.007%)
1.0

0.5

Adjusted F1
Adjusted B
S & B’s B

Dice’s F1

15000 10000 4000


6000 3000
10000 7500
0.0 5000 4000 2000
5000 2500 2000 1000

−0.5

−1.0

M Failed: 2414 (0.235%) N Failed: 2414 (0.235%) O Failed: 2414 (0.235%) P Failed: 136 (0.013%)
1.0 80 80 80
Norm. McNemar’s X²

McN’s X² (rev. 2010)

McN’s X² (rev. 2017)


Trad. McNemar’s X²

0.5
5000 6000 8000 12500
4000 6000 10000
0.0 3000 4000 4000 7500
2000 2000 5000
1000 2000 2500
−0.5

−1.0 0 0 0
−1.0 −0.5 0.0 0.5 1.0 −1.0 −0.5 0.0 0.5 1.0 −1.0 −0.5 0.0 0.5 1.0 −1.0 −0.5 0.0 0.5 1.0
Holley and Guilford’s G Holley and Guilford’s G Holley and Guilford’s G Holley and Guilford’s G

Fig. 2  Hexbin plots of estimator computation of all possible 2x2 adopted as the benchmark. Darker hexbins correspond to the location
tables with n ranging from 1 to 68 (total of 1,028,789 tables). Failed of higher counts of tables
is the count of non-computable tables. Holley and Guilford’s G was

point to many of Cohen’s κ problems and favor Holley and around the null hypothesis, eventually assuming low agree-
Guilford’s G (Shreiner, 1980; Green, 1981) or Gwet’s AC1 ment when there is none, and in the low disagreement area,
(Kuppens, Holden, Barker, & Rosenberg, 2011; Wongpa- eventually failing in detecting it, which is—paraphrasing
karan et al., 2013; Xie et al., 2017). Cohen (1960)—“unlikely in practice”. In the most impor-
Accordingly with our results, by assuming Holley and tant area of agreement between raters (H1 + in Fig. 1C), AC1
Guilford’s G as the benchmark, Gwet’s AC1 is also a good was the only coefficient with no mistaken inferential deci-
estimator. Not only AC1 mistakes are the lowest in inferen- sions (Table 5). AC1 agreement is also close to the bisector
tial statistics, but also these mistakes are located (Fig. 1C) line (darker hexbins in Fig. 2A) and it is not confounded by

13
Behavior Research Methods

A a/n>0.9 B d/n>0.9 C b/n>0.9 D c/n>0.9


failed: 0 / 1932 failed: 0 / 1932 failed: 0 / 1932 failed: 0 / 1932
1.00 1.00
266 266 20 20

−0.80 67

−0.80 67

184 184 123 123


Gwet’s AC1

−0.85 −0.85
125 125 156 156

0.96 250 0.96 250 250 250


213 213 137 137

200 200 200 200


205 205 237 237

38 127

150 38 127

150 −0.90 92

150 −0.90 92

150
217

100 217

100 246

100 246

100
152

50 152

50 50 50
0.92 151
0.92 151

−0.95 113

−0.95 113

124 124

80 80

50 50
−1.00 −1.00
−1 0 1 −1 0 1 −1 0 1 −1 0 1
E a/n>0.1 F d/n<0.1 G b/n<0.1 H c/n<0.1
failed: 0 / 290444 failed: 0 / 290444 failed: 0 / 290444 failed: 0 / 290444
1.0 1.0 1.0 1.0
364 364 2642 2642

5068 3976 5068 3976 9244 31638 9244 31638

290 11116 290 11116 230 31091 21914 230 31091 21914
Gwet’s AC1

0.5 0.5 0.5 0.5


11347 9930 11347 9930 7416 31328 1879 7416 31328 1879

30000 30000 30000 30000


17351 790 17351 790 14320 19420 14320 19420

20000 20000
2726 19334 2726 19334 1102 21573 8892 1102 21573 8892

20000 20000
0.0 0.0 0.0 0.0
11645 14855 11645 14855 4260 18954 4260 18954

21370 7511

10000 21370 7511

10000 8062 12384

10000 8062 12384

10000
26884 836 26884 836 12469 4657 12469 4657

−0.5 11883 30812


30201

−0.5 11883 30812


30201

−0.5 2378 7424


10965

−0.5 2378 7424


10965

28233 15139 28233 15139 3268 1974 3268 1974

−1.0 −1.0 −1.0 −1.0


−1 0 1 −1 0 1 −1 0 1 −1 0 1
I a/n>0.45, d/n>0.45 J b/n>0.45, c/n>0.45
failed: 0 / 1540 failed: 0 / 1540
1.00
232 6

124
−0.80 34

60
Gwet’s AC1

0.95
134 95

167
−0.85 104

149
200 188
200
150 150
0.90 100 −0.90 100
154 82

50
188 224

117

50
0.85 93
102

−0.95 133

45

0.80
35

−1.00
−1 0 1 −1 0 1
K a/n>0.45, b/n>0.45 L c/n>0.45, d/n>0.45 M a/n>0.45, c/n>0.45 N b/n>0.45, d/n>0.45
failed: 0 / 1540 failed: 0 / 1540 failed: 0 / 1540 failed: 0 / 1540
0.30 0.30 0.30 0.30
6 6 6 6

17 17 17 17

54 54 54 54
Gwet’s AC1

0.25 0.25 0.25 0.25


57 57 57 57

104 104 104 104

200 200 200 200


150 0.20 150 0.20 150 0.20 150
128 128 128 128

0.20 234

100
234

100
234

100
234

100
50 50 50 50
194 194 194 194

0.15 0.15 0.15 0.15


95 86 95 86 95 86 95 86

228 228 228 228

190 190 190 190

0.10 62
85

0.10 62
85

0.10 62
85

0.10 62
85

−1 0 1 −1 0 1 −1 0 1 −1 0 1
Holley and Guilford’s G Holley and Guilford’s G Holley and Guilford’s G Holley and Guilford’s G

Fig. 3  Hexbin plots of Gwet’s AC1 computation of extreme 2x2 main diagonal; J: 90% of data in off-diagonal. Fourth row (K to N):
tables with n ranging from 1 to 68 (total of 1,028,789 different tables; respectively with 90% of data in the first row, second row, first col-
the percentage is of noncomputable tables) in comparison to Hol- umn, and second column of a 2x2 table. The scale of the y-axis var-
ley and Guilford’s G (adopted as the benchmark). First row (A to ies to show the pattern of coincidences between the estimators. The
D): each cell contains more than 90% of all data. Second row (E to dashed gray lines represent the bisector
H): each cell contains less than 10% of all data. I: 90% of data in

extreme tables (Fig. 3). In a recent study, Konstantinidis, explosive number of possible tables with greater samples, up
Le, & Gao (2022) compared the performance of Cohen’s to 500 observations, exploring tables with multiple observ-
κ, Fleiss’ κ, Conger’s κ, and Gwet’s AC1, with the differ- ers, thus fixing the agreement to known values to compare
ence that the authors had to apply simulations because of the the relative performance of these estimators. Conger’s κ

13
Behavior Research Methods

A a/n>0.9 B d/n>0.9 C b/n>0.9 D c/n>0.9


failed: 68 / 1932 failed: 68 / 1932 failed: 0 / 1932 failed: 0 / 1932
1.00 198

1.00 198

0.00
400

0.00
400

52 52 214 214
Cohen’s kappa

0.75 0.75
156 156 162 162

54 54

−0.05 175

−0.05 175

212 212 67

400 67

400
600 600 300 −0.10 300
0.50 400 0.50
170 170 136 136

56 56

400 −0.10 200 200


177

200 177

200 79

100
79

100
0.25 73

0.25 73

−0.15 38
−0.15 38

0.00 702

0.00 702

−0.20 12

−0.20 12

14 14

−1 0 1 −1 0 1 −1 0 1 −1 0 1
E a/n>0.1 F d/n<0.1 G b/n<0.1 H c/n<0.1
failed: 68 / 290444 failed: 68 / 290444 failed: 136 / 290444 failed: 136 / 290444
1.0 1.0
198 198 2936 2936

262 262 10876 10876

0.8 0.8
Cohen’s kappa

317 452 317 452 16136 1412 16136 1412

0.5 0.5
1110 1456 1110 1456 15927 4586 15927 4586

66 3480 2075 36 66 3480 2075 36 15178 5922

16000 15178 5922

16000
68 1040 4722 7300 2785 234

12000 68 1040 4722 7300 2785 234

12000 14838 6136 1272

12000 14838 6136 1272

12000
0.0
3986 4830 6050 7500 10068 13186 8107 3037 983 209

8000 0.0
3986 4830 6050 7500 10068 13186 8107 3037 983 209

8000 0.4 578 15161 6086 2074 72

8000 0.4 578 15161 6086 2074 72

8000
5992

8378
9626

9900
10354

14066
11442

14202
14890

5376
12696

1269
5663 2321 1000 34

4000 5992

8378
9626

9900
10354

14066
11442

14202
14890

5376
12696

1269
5663 2321 1000 34

4000 737 9376


3629

12581
15158

5090
5584

2646
2454

938
474

4000 737 9376


3629

12581
15158

5090
5584

2646
2454

938
474

4000
−0.5 5508

12234
10392

10588
15872

182
5999 466

−0.5 5508

12234
10392

10588
15872

182
5999 466

0.0 3376
319

6289
1887

9954
6234

11158
12965

6302
9758

3892
4622

2540
2518

1694
902

2108 442
0.0 3376
319

6289
1887

9954
6234

11158
12965

6302
9758

3892
4622

2540
2518

1694
902

2108 442

9377 3017 9377 3017 1787 4446 6024 4278 2398 1762 1268 1048 516 1787 4446 6024 4278 2398 1762 1268 1048 516

−1.0 −1.0 529 310 158 98 44 529 310 158 98 44

−1 0 1 −1 0 1 −1 0 1 −1 0 1
I a/n>0.45, d/n>0.45 J b/n>0.45, c/n>0.45
failed: 0 / 1540 failed: 0 / 1540
1.00
232 40

46
Cohen’s kappa

−0.85
132
106

0.95
126

126

200
173

200
147

150 −0.90
147

150
192

0.90
161

177
100 100
50 50
220

121

0.85
112

−0.95 158
79

45

0.80 35

−1.00 134

−1 0 1 −1 0 1
K a/n>0.45, b/n>0.45 L c/n>0.45, d/n>0.45 M a/n>0.45, c/n>0.45 N b/n>0.45, d/n>0.45
failed: 0 / 1540 failed: 0 / 1540 failed: 0 / 1540 failed: 0 / 1540
17 17 17 17

67 67 67 67
Cohen’s kappa

0.1 118

0.1 118

0.1 118

0.1 118

13 140 13 140 13 140 13 140

200 200 200 200


219 219 219 219

0.0 158 234

150 0.0 158 234

150 0.0 158 234

150 0.0 158 234

150
193

100 193

100 193

100 193

100
113 34

50 113 34

50 113 34

50 113 34

50
−0.1 −0.1 −0.1 −0.1
114 114 114 114

70 70 70 70

31 31 31 31

19 19 19 19

−1 0 1 −1 0 1 −1 0 1 −1 0 1
Holley and Guilford’s G Holley and Guilford’s G Holley and Guilford’s G Holley and Guilford’s G

Fig. 4  Hexbin plots of Cohen’s κ computation of extreme 2x2 tables nal; J: 90% of data in off-diagonal. Fourth row (K to N): respectively
with n ranging from 1 to 68 (total of 1,028,789 different tables; the with 90% of data in the first row, second row, first column, and sec-
percentage is of non-computable tables) in comparison to Holley and ond column of a 2x2 table. The scale of y-axis varies to show the pat-
Guilford’s G (adopted as the benchmark). First row (A to D): each tern of coincidences between the estimators. The dashed gray lines
cell contains more than 90% of all data. Second row (E to H): each represent the bisector
cell contains less than 10% of all data. I: 90% of data in main diago-

(1980) was not approached in our study for it is a generali- is in agreement with our results, being Gwet’s AC1 superior
zation for more than two observers and, therefore, it does not to the other statistics, especially with what we called more
apply to 2x2 tables. The main conclusion of these authors extreme tables (these authors named it asymmetric cases).

13
Behavior Research Methods

A a/n>0.9 B d/n>0.9 C b/n>0.9 D c/n>0.9


failed: 266 / 1932 failed: 266 / 1932 failed: 0 / 1932 failed: 0 / 1932
1.00 1.00 1.00 1.00
182 390 182 390 398 398

464 464 266 266


McNemar’s X2

157 157

0.75 0.75 0.95 0.95


134 134

16 16 193 193

52
400 52
400 41

300 41

300
300 300
0.50 200 0.50 200 0.90 200 0.90 200
108 108 136 136

130 70
100 130 70
100 78
100 78
100
0.25 52
0.25 52
0.85 28
0.85 28

0.00 90 112

0.00 90 112

0.80
9

0.80
9

−1 0 1 −1 0 1 −1 0 1 −1 0 1
E a/n>0.1 F d/n<0.1 G b/n<0.1 H c/n<0.1
failed: 266 / 290444 failed: 266 / 290444 failed: 2414 / 290444 failed: 2414 / 290444
1.00 1.00 1.00 1.00
1446 1484 1452 1512 1522 1512 1452 1484 1486 488 1446 1484 1452 1512 1522 1512 1452 1484 1486 488 823 1499 2119 2903 3614 4278 4780 5549 9308 4474 823 1499 2119 2903 3614 4278 4780 5549 9308 4474

3076 4270 3808 3204 2610 2266 1474 852 876 854 3076 4270 3808 3204 2610 2266 1474 852 876 854 1538 3299 4685 5416 5608 5907 4504 3003 3482 7578 1538 3299 4685 5416 5608 5907 4504 3003 3482 7578
McNemar’s X2

4412 4022 3772 3126 2458 2108 1706 962 4412 4022 3772 3126 2458 2108 1706 962 2561 4017 5510 6032 5841 5992 5667 3577 2561 4017 5510 6032 5841 5992 5667 3577

0.75 0.75 0.75 0.75


3096 4438 3890 3302 2826 2378 1976 1222 978 3096 4438 3890 3302 2826 2378 1976 1222 978 608 2217 4515 5600 6059 6180 6046 4291 3880 608 2217 4515 5600 6059 6180 6046 4291 3880

4000 4000
4370 4048 3730 3154 2570 2048 1424 988 612 4370 4048 3730 3154 2570 2048 1424 988 612 1334 4466 6103 5842 4684 3690 4003 1334 4466 6103 5842 4684 3690 4003

3122 4392 3912 3292 2766 2388 1904 1262 750 242

3000
3122 4392 3912 3292 2766 2388 1904 1262 750 242

3000
1002 5418 5831 4446 3345 2090

7500 1002 5418 5831 4446 3345 2090

7500
0.50
4386 4110 3672 3096 2620 2024 1556 1042 674

2000 0.50
4386 4110 3672 3096 2620 2024 1556 1042 674

2000 0.50
3436 5137 3891 5083

5000 0.50 3436 5137 3891 5083

5000
3098

4388
4362

4078
3954

3700
3364

3100
2818

2570
2396

2056
1868

1524
1252

1054
784

816 124
1000
3098

4388
4362

4078
3954

3700
3364

3100
2818

2570
2396

2056
1868

1524
1252

1054
784

816 124
1000
2786

2536
4419

3945
4098

6745 1116
2500 2786

2536
4419

3945
4098

6745 1116
2500
0.25 3138

4368
4412

4098
3872

3678
3348

3156
2762

2666
2380

2072
1862

1528
1266

1116
566

678
0.25 3138

4368
4412

4098
3872

3678
3348

3156
2762

2666
2380

2072
1862

1528
1266

1116
566

678
0.25 3035

4274
3194

5708
0.25 3035

4274
3194

5708

3052 4364 3952 3334 2748 2394 1932 1326 552 3052 4364 3952 3334 2748 2394 1932 1326 552 510 3628 510 3628

0.00 2256 2043 1835 1547 1321 1008 654 606 580 170

0.00 2256 2043 1835 1547 1321 1008 654 606 580 170

0.00 2166 4880 1537

0.00 2166 4880 1537

−1 0 1 −1 0 1 −1 0 1 −1 0 1
I a/n>0.45, d/n>0.45 J b/n>0.45, c/n>0.45
failed: 232 / 1540 failed: 0 / 1540
1.00 0.100
110 346

382 18
McNemar’s X2

0.75 0.075 42

250
34

300 76

200
0.50
80

200 0.050 150


100 100
50
134

94 60 60

0.25 34
0.025 128

170

0.00 63 97

0.000 252

−1 0 1 −1 0 1
K a/n>0.45, b/n>0.45 L c/n>0.45, d/n>0.45 M a/n>0.45, c/n>0.45 N b/n>0.45, d/n>0.45
failed: 0 / 1540 failed: 0 / 1540 failed: 0 / 1540 failed: 0 / 1540
1.0 1.0 1.0 1.0
419 419 419 419

41 191 41 191 41 191 41 191


McNemar’s X2

106 106 106 106

71 151 71 151 71 151 71 151

0.9 142

400 0.9
142

400 0.9
142

400 0.9
142

400
300 300 300 300
57 49 57 49 57 49 57 49

200 200 200 200


116 116 116 116

0.8 54

63
4

100 0.8
54

63
4

100 0.8
54

63
4

100 0.8
54

63
4

100
35 35 35 35

0.7 15
22

0.7 15
22

0.7 15
22

0.7 15
22

4 4 4 4

−1 0 1 −1 0 1 −1 0 1 −1 0 1
Holley and Guilford’s G Holley and Guilford’s G Holley and Guilford’s G Holley and Guilford’s G

Fig. 5  Hexbin plots of normalized McNemar’s χ2 computation of data in main diagonal; J: 90% of data in off-diagonal. Forth row (K
extreme 2x2 tables with n ranging from 1 to 68 (total of 1,028,789 to N): respectively with 90% of data in the first row, second row, first
different tables; the percentage is of non-computable tables) in com- column, and second column of a 2x2 table. The scale of the y-axis
parison to Holley and Guilford’s G (adopted as the benchmark). First varies to show the pattern of coincidences between the estimators.
row (A to D): each cell contains more than 90% of all data. Second The dashed gray lines represent the bisector
row (E to H): each cell contains less than 10% of all data. I: 90% of

13
Behavior Research Methods

Cohen’s κ is not only a poor estimator of agreement but that McNemar’s χ2 test was designed for a very specific situ-
also is prone to provide incorrect statistical decisions when ation: change of signal in a pre-post scenario, using only
there is neutrality and no large disagreement or agreement the off-diagonal (McNemar, 1947). Unfortunately, McNe-
between raters. Fig. 1G and Table 5 show that it is mis- mar’s χ2 sometimes is applied in the context of verifying
taken in around 21% of all possible tables with n = 64, with agreement between methods (e.g., Kirkwood and Sterne,
roughly one-third in each region, but this distribution is pp. 216-218, 2003). By using only b and c, it becomes a
not uniform: mistakes are less likely to occur away from futile mental exercise to link rejection or non-rejection of
the greater disagreements or agreements: the solid line of the null hypothesis with agreement or disagreement between
Cohen’s κ has two peaks in the transition from the gray area raters. For instance, in trying to interpret McNemar’s meas-
(null hypothesis) to the white areas (rejection of H0), thus it ure as agreement between raters, b = 26,c = 10 leads to MN
is when there is low disagreement or agreement that Cohen’s = 0.44,CI95%(MN) = [0.14,0.74], rejects H0 and would
κ is more subjected to inferential errors. This weakness is not suggest disagreement, but b = 18,c = 18 (which is the same
visible in the low agreement or disagreement scenarios in amount of disagreement) leads to MN = 0.00,CI95%(MN) =
Table 3, which denotes the importance of exhaustive compu- [− 0.02,0.34], does not reject H0 and would suggest agree-
tation of many tables; to draw general conclusions from the ment. Both decisions completely disregard the agreement
study of particular cases may be misleading. Another weak- values, a and d, it does not matter if they are 2 or 2 thou-
ness is that there were problems in 256 tables of size n = 64 sand. It would only matter that one rater systematically
(Table 5). Two tables failed when all data are in a or d (a opposes the other and that one of them is biased to provide
clear 100% agreement) because it causes a division by zero much more positives or negatives than the other. If they are
(equation 3). In addition to that, the inferential decision pro- perfectly opposed to making b = c, then this perfect disa-
vided by epiR::epi.kappa (z statistic for κ associated greement would be no more detectable. Revised versions
with p value) is also unavailable in other 254 tables when designed to include more table information by adding a
one row or column is empty (i.e., a + b = 0 or a + c = 0 or and d to the play do not seem to improve this test’s abil-
b + d = 0 or c + d = 0). Finally, Fig. 4 shows some details ity to measure agreement. Consequently, the application of
of κ limitations. In scenarios of agreement in which most of McNemar’s χ2 as a measure of association or agreement,
data are concentrated in a or d, κ mistakenly provide values when removed from its original context, can only lead to the
close to 0 (the darkest hexbins in Fig. 4A and B). That hap- confusing results observed on the challenge 2x2 tables with
pens due to Eq. 3, for a and d appear in both parcels of the the inability to detect agreement or disagreement in some
denominator creating an exaggerated value while the parcel situations and overestimation of agreement or disagreement
ad in the numerator is a low value, thus κ underestimates in others (Tables 3 and 4), fails in non-rejection of the null
the agreement in these situations. Concentration in b or c, hypothesis (Fig. 1L to O), and mixed estimates along clear
which appears only one in each denominator parcel, only situations of agreement or disagreement (Fig. 2M to P). In
causes a small number in the numerator due to bc (always a essence, McNemar’s χ2 is not an agreement estimator and its
small number due to high b and low c or vice-versa), lead- use must be restricted to its original context.
ing to underestimation of the disagreement (Fig. 4C and D). To close this discussion, a brief comment on the other
Cohen’s κ also has problems when only one of the four cells estimators is in order. Scott’s π appears in third place. It is
is relatively empty, which may happen, for instance, if one not a bad estimator but, contrary to its original proposition
of the raters is more lenient than the other; in these cases, as inter-rater reliability, it seems a more reliable estimator of
Cohen’s κ assessment has an excess of variability (Fig. 4E to disagreement (Fig. 1E). Another way to confirm this state-
H). The last comment on Cohen’s κ is that the correction by ment is to observe that most of its correct counts are on the
maximum κ, proposed by Cohen himself and forgotten by bisector, but the dispersion increases with increasing rater
following researchers, may be a correction for effect size agreement (Fig. 2B). Krippendorff’s α (Fig. 2C) showed pat-
(Table 2) but it worsened the general estimator pattern (com- tern similar to Scott’s π, as predicted by theory for this case
pare Fig. 2D and C). of nominal variables and two raters. Perhaps, being Krip-
We started this review by criticizing Cohen’s κ but the pendorff’s α a generalization of other coefficients, its pattern
investigation of alternatives led, in the end, to the develop- depends on the dimensions and nature of the involved vari-
ment of a method to assess any other estimator, some of ables, inheriting the strengths and weaknesses of the respec-
them described in this text. For that, McNemar’s χ 2 was tive coefficients that it may emulate.
collateral damage. It was not anticipated such poor perfor- Pearson’s r is, primarily, a measure of association. How-
mance for such a widespread estimator. On the other hand, ever, in 2x2 tables its performance is also very similar to
we tested the performance of McNemar’s χ2 to verify if it Cohen’s κ, conceived to measure agreement, which can be
could be applied to a general situation of agreement, con- observed by similar density plots (Fig. 1F and G), mapping
cluding that it cannot be used. To be fair, one must recognize of hexbins (Fig. 2F and D) or their similar correlations with

13
Behavior Research Methods

Table 7  Nature of estimators
Estimator Numerator Classification

Holley and Guilford’s G (a + d) − (b + c) Agreement


Gwet’s AC1 ( 2 ) 2
Agreement
a + d2 − (b+c)2
Adjusted Dice’s F1 2a − (b + c) Mostly agreement
Cohen’s κ 2(ad − bc) Association
Yule’s Q ad − bc Association
√ √
Yule’s Y ad − bc Association
Odds ratio ad
or log(ad) − log(bc) Association
bc
Pearson’s r ad − bc Association
Scott’s π ( )2 Mixed
ad − b+c 2
( 2 )
Rescaled Shankar and Bangdiwala’s B 2
a + d − (2bc + (a + d)(b + c)) Mixed
McNemar’s χ2 b
or log(b) − log(c) Undefined
c

G in Table 6. We observe that equations for κ (Eq. 3), Q to L and Figure 2I to J). Rescaled Dice’s F1 (Fig. 2L) fills
(Eq. 12), r (Eq. 23, which is also equal to ρ, τ, ϕ, and Cra- around half of the graph area, which indicates an inability to
mér’s V), all have a sort of OR (Eq. 13) in their numerators detect many occurrences of agreement between raters. Inci-
(ad − bc). Although having similar global performances, dentally, F1 does not use the entire information from a 2x2
the deficiencies of r and κ are due to different reasons (see table, leaving d out of reach (the Traditional McNemar’s χ2
Tables 3 and 4). that showed poor performance also uses incomplete informa-
Yule’s Q and Y come next, with small advantage to Y. tion). Perhaps, the first criteria to be a good agreement esti-
Their global performance are close to that of Cohen’s κ and mator should be to apply the whole information available.
Pearson’s r, but Y made more mistakes when there is neu- Shankar and Bangdiwala’s B has other type of serious
trality (i.e., in the region of the non-rejection of the null problems. Except for the McNemar’s versions, it was the
hypothesis), while Q made relatively more mistakes when estimator that more mistakenly rejected the null hypothesis
there are disagreement or agreement between raters (i.e., in in situations of neutrality and failed to reject the null hypoth-
the regions of rejection of the null hypothesis, Fig. 1H and esis in agreement situations, H1 + (Fig. 1K and Table 5).
I). Both also show a tendency to overestimate agreement Like Scott’s π (Fig. 1E), it has no mistakes on the H1- area—
providing values equal to 1 for any disagreement, neutrality although the performance of π is a lot better.
or agreement provided by G, which is represented by the In a nutshell, the estimators can be divided, according
horizontal lines on the top of Figs. 2G and 2H, specially to section “Association and agreement”, in measurements
for higher agreements (darker hexbins); this also explains of association (Eq. 36) or agreement (Eq. 37), depending
the results observed in Table 4. At least, as it happens to be mainly on the structure of the numerator of their respective
with Scott’s π (Fig. 2B), Yule’s Y concentrates most of their equations. It is shown in Table 7 that most of the estimators
estimates around the bisector line, while Yule’s Q is more usually taken as a measurement of agreement are merely
problematic, showing a sigmoid shadow of darker hexbins estimates of association; only two of the estimators analyzed
(Fig. 2G) that explains its greater tendency do overestimate in this work are authentic agreement coefficients.
both agreement and disagreement as shown in Table 3. This work has a humble mission for restoring Holley
Dice’s F1 and Shankar and Bangdiwala’s B only provide and Guilford’s G as the best agreement estimator, closely
positive estimates in their original forms. Since it is confus- followed by Gwet’s AC1. Both have inferential statistics
ing to compare a range [0,1] with [-1,1] provided by other associated with them, to satisfy research requirements.
coefficients, we proposed the rescaling by 2i − 1 (where i is Gwet’s AC1 was already implemented in R packages. We
F1 or B). This is a linear transformation that does not affect could not find any Holley and Guilford’s G implementa-
these coefficient properties, with zero now corresponding tion but the R scripts presented in the Supplemental Mate-
to neutrality or randomness, negative values to disagree- rial in the Section named “Implementation of Holley and
ment, and positive values to agreement between raters. The Guilford’s G” can be easily adapted, including the asymp-
graphical effect of this transformation is to approach the totic test proposed in the literature for tables with n > 30
pattern of computed coefficients to the bisector (Figure 2K (bootstrapping techniques are easy to adapt for smaller

13
Behavior Research Methods

tables). Holley and Guilford’s G and Gwet’s AC1 should Gwet, K. L. (2008). Computing inter-rater reliability and its variance in
be considered by modern researchers as the first choices the presence of high agreement. British Journal of Mathematical
and Statistical Psychology, 61(1). https://​doi.​org/​10.​1348/​00071​
for agreement measurement in 2x2 tables. 1006X​126600
Gwet, K. L. (2010). Handbook of inter-rater reliability: the definitive
Supplementary Information  The online version contains supplemen- guide to measuring the extent of agreement among raters. 3rd edn.
tary material available at https://d​ oi.o​ rg/1​ 0.3​ 758/s​ 13428-0​ 22-0​ 1950-0. Advanced Analytics, LCC.
Gwet, K.L. (2011). On The Krippendorff’s Alpha Coefficient. Manu-
Open Access  This article is licensed under a Creative Commons Attri- script submitted for publication. https://​agree​stat.​com/​papers/​
bution 4.0 International License, which permits use, sharing, adapta- onkri​ppend​orffa​lpha_​rev10​052015.​pdf
tion, distribution and reproduction in any medium or format, as long Hoff, R., Hoff, R., Sleigh, A., Mott, K., Barreto, M., de Paiva, T. M., ...,
as you give appropriate credit to the original author(s) and the source, Sherlock, I. (1982). Comparison of filtration staining (Bell) and
provide a link to the Creative Commons licence, and indicate if changes thick smear (Kato) for the detection and quantitation of Schisto-
were made. The images or other third party material in this article are soma mansoni eggs in faeces. Transactions of the Royal Society
included in the article's Creative Commons licence, unless indicated of Tropical Medicine and Hygiene, 76(3). https://d​ oi.o​ rg/1​ 0.1​ 016/​
otherwise in a credit line to the material. If material is not included in 0035-​9203(82)​90201-2
the article's Creative Commons licence and your intended use is not Holley, J. W., & Guilford, J. P. (1964). A note on the G index of
permitted by statutory regulation or exceeds the permitted use, you will agreement. Educational and Psychological Measurement, 24(4).
need to obtain permission directly from the copyright holder. To view a https://​doi.​org/​10.​1177/​00131​64464​02400​402
copy of this licence, visit http://​creat​iveco​mmons.​org/​licen​ses/​by/4.​0/. Hripcsak, G., & Rothschild, A. S. (2005). Agreement, the F-measure,
and reliability in information retrieval. Journal of the American
Medical Informatics Association, 12(3). https://​doi.​org/​10.​1197/​
jamia.​M1733
Hubert, L. (1977). Nominal scale response agreement as a general-
ized correlation. British Journal of Mathematical and Statistical
Psychology, 30(1). https://​doi.​org/​10.​1111/j.​2044-​8317.​1977.​
References tb007​28.x
Hughes, J. (2021). krippendorffsalpha: an R package for measuring
agreement using Krippendorff’s alpha coefficient. R Journal,
Banerjee, M., Capozzoli, M., McSweeney, L., & Sinha, D. (1999). 13(1) https://​doi.​org/​10.​32614/​r j-​2021-​046
Beyond kappa: A review of interrater agreement measures. Cana- Hyndman, R. J. (1996). Computing and graphing highest density
dian Journal of Statistics, 27(1). https://​doi.​org/​10.​2307/​33154​87 regions. American Statistician, 50(2). https://​doi.​org/​10.​1080/​
Bennett, E. M., Alpert, R., & Goldstein, A. C. (1954). Communications 00031​305.​1996.​10474​359
through limited-response questioning. Public Opinion Quarterly, Janson, S., & Vegelius, J. (1982). The J-Index as a measure of nomi-
18(3). https://​doi.​org/​10.​1086/​266520 nal scale response agreement. Applied Psychological Measure-
Cohen, J. (1960). A coefficient of agreement for nominal scales. Edu- ment, 6(1). https://​doi.​org/​10.​1177/​01466​21682​00600​111
cational and Psychological Measurement Educational and Psy- King, N. B., Harper, S., & Young, M. E. (2012). Use of relative
chological Measurement 20(1). and absolute effect measures in reporting health inequalities:
Cohen, J. (1968). Weighted kappa: Nominal scale agreement provision structured review. BMJ (Online), 345(7878). https://​doi.​org/​10.​
for scaled disagreement or partial credit. Psychological Bulletin, 1136/​bmj.​e5774
70(4). https://​doi.​org/​10.​1037/​h0026​256 Kirkwood, B. R., & Sterne, J. A. C. (2003). Essential Medical Sta-
Conger, A. J. (1980). Integration and generalization of kappas for mul- tistics, 2nd edn. Blackwell Publishing.
tiple raters. Psychological Bulletin, 88(2). https://d​ oi.o​ rg/1​ 0.1​ 037/​ Konstantinidis, M., Le, L. W., & Gao, X. (2022). An empirical com-
0033-​2909.​88.2.​322 parative assessment of inter-rater agreement of binary outcomes
Cramér, H. (1946). Mathematical methods of statistics. Princeton Uni- and multiple raters. Symmetry, 14(2). https://​doi.​org/​10.​3390/​
versity Press. sym14​020262
Dice, L. R. (1945). Measures of the amount of ecologic association Krippendorff, K. (1970). Estimating the reliability, systematic error
between species. Ecology, 26(3). https://d​ oi.o​ rg/1​ 0.2​ 307/1​ 93240​ 9 and random error of interval data. Educational and Psychologi-
Efron, B. (2007). Bootstrap methods: Another look at the jackknife. cal Measurement, 30(1). https://​doi.​org/​10.​1177/​00131​64470​
The Annals of Statistics, 7(1).https://​doi.​org/​10.​1214/​aos/​11763​ 03000​105
44552 Krippendorff, K. (2011). Computing Krippendorff’s alpha-reliability.
Feingold, M. (1992). The equivalence of Cohen’s kappa and Pearson’s https://​repos​itory.​upenn.​edu/​asc_​papers/​43/
chi-square statistics in the 2 × 2 Table. Educational and Psycho- Krippendorff, K. (2016). Misunderstanding reliability. https://​repos​
logical Measurement, 52(1). https://​doi.​org/​10.​1177/​00131​64492​ itory.​upenn.​edu/​asc_​papers/​537/
05200​105 Kuppens, S., Holden, G., Barker, K., & Rosenberg, G. (2011). A
Feng, G. C. (2015). Mistakes and how to avoid mistakes in using inter- kappa-related decision: K, Y, G, or AC1. Social Work Research,
coder reliability indices. Methodology, 11(1). https://​doi.​org/​10.​ 35(3). https://​doi.​org/​10.​1093/​swr/​35.3.​185
1027/​1614-​2241/​a0000​86 Lienert, G. A. (1972). Note on tests concerning the G index of agree-
Feng, G.C., & Zhao, X. (2016). Commentary: Do not force agreement: ment. Educational and Psychological Measurement, 32(2).
A response to Krippendorff. https://​doi.​org/​10.​1027/​1614-​2241/​ https://​doi.​org/​10.​1177/​00131​64472​03200​205
a0001​20 Lu, Y. (2010). A revised version of McNemar’s test for paired binary
Green, S. B. (1981). A comparison of three indexes of agreement data. Communications in Statistics - Theory and Methods,
between observers: Proportion of agreement, G-Index, and kappa. 39(19). https://​doi.​org/​10.​1080/​03610​92090​32892​18
Educational and Psychological Measurement, 41(4). https://​doi.​ Lu, Y., Wang, M., & Zhang, G. (2017). A new revised version of
org/​10.​1177/​00131​64481​04100​415s. McNemar’s test for paired binary data. Communications in

13
Behavior Research Methods

Statistics - Theory and Methods, 46(20). https://​d oi.​o rg/​1 0.​ Siegel, S., & Castellan, N. J. (1988). Nonparametric statistics for the
1080/​03610​926.​2016.​12289​62 behavioral sciences, 2nd edn. McGraw–Hill.
Ludbrook, J. (2011). Is there still a place for Pearson’s chi-squared Sim, J., & Wright, C. C. (2005). The kappa statistic in reliability stud-
test and Fisher’s exact test in surgical research?. ANZ Journal ies: Use, interpretation, and sample size requirements. Physical
of Surgery, 81(12). https://​doi.​org/​10.​1111/j.​1445-​2197.​2011.​ Therapy, 85(3). https://​doi.​org/​10.​1093/​ptj/​85.3.​257
05906.x Wikipedia (2021). Krippendorff’s alpha. https://e​ n.w ​ ikipe​ dia.o​ rg/w
​ iki/​
Manning, C. D., Raghavan, P., & Schutze, H. (2008) Introduction to Kripp​endor​ff’s_​alpha
information retrieval. Cambridge: Cambridge University Press. Wongpakaran, N., Wongpakaran, T., Wedding, D., & Gwet, K.L.
https://​doi.​org/​10.​1017/​cbo97​80511​809071 (2013). A comparison of Cohen’s kappa and Gwet’s AC1 when
Matthews, B. W. (1975). Comparison of the predicted and observed calculating inter-rater reliability coefficients: A study conducted
secondary structure of T4 phage lysozyme. BBA - Protein Struc- with personality disorder samples. BMC Medical Research Meth-
ture, 405(2). https://​doi.​org/​10.​1016/​0005-​2795(75)​90109-9 odology, 13(1). https://​doi.​org/​10.​1186/​1471-​2288-​13-​61
McNemar, Q. (1947). Note on the sampling error of the difference Xie, Z., Gadepalli, C., & Cheetham, B. M. G. (2017). Reformulation
between correlated proportions or percentages. Psychometrika, and generalisation of the Cohen and Fleiss kappas. LIFE: Inter-
12(2), 153–157. https://​doi.​org/​10.​1007/​BF022​95996 national Journal of Health and Life-Sciences, 3(3), 1–15. https://​
R Core Team (2022). R: A language and environment for statistical doi.​org/​10.​20319/​lijhls.​2017.​33.​115
computing. https://​www.r-​proje​ct.​org/ Yule, G. U. (1912). On the methods of measuring association between
Scott, W. A. (1955). Reliability of content analysis: The case of nomi- two attributes. Journal of the Royal Statistical Society, 75(6).
nal scale coding. Public Opinion Quarterly, 19(3). https://d​ oi.o​ rg/​ https://​doi.​org/​10.​2307/​23401​26
10.​1086/​266577 Zhao, X., Liu, J. S., & Deng, K. (2013). Assumptions behind intercoder
Shankar, V., & Bangdiwala, S. I. (2014). Observer agreement para- reliability indices. Annals of the International Communication
doxes in 2x2 tables: Comparison of agreement measures. BMC Association, 36(1). https://d​ oi.o​ rg/1​ 0.1​ 080/2​ 38089​ 85.2​ 013.1​ 1679​
Medical Research Methodology, 14(1). https://​doi.​org/​10.​1186/​ 142
1471-​2288-​14-​100
Shreiner, S. C. (1980). Agreement or association: Choosing a measure Publisher’s note Springer Nature remains neutral with regard to
of reliability for nominal data in the 2 × 2 case - a comparison of jurisdictional claims in published maps and institutional affiliations.
phi, kappa, and g. Substance Use and Misuse, 15(6). https://​doi.​
org/​10.​3109/​10826​08800​90400​66

13

You might also like