You are on page 1of 8

Journal of Electrocardiology 63 (2020) 75–82

Contents lists available at ScienceDirect

Journal of Electrocardiology

journal homepage: www.jecgonline.com

Comparison of PR, QRS, and QT interval measurements by seven ECG


interpretation programs
J. De Bie a,⁎, I. Diemberger b, J.W. Mason c
a
Mortara Instrument Europe s.r.l., Bologna, Italy
b
Department of Experimental, Diagnostic and Specialty Medicine, University of Bologna, Bologna, Italy
c
Mason Cardiac Safety Consulting, Reno, Nevada, USA

a r t i c l e i n f o a b s t r a c t

Background: Electrocardiograph-generated measurements of PR, QRS, and QT intervals are generally thought to
be more precise than manual measurements on paper records. However, the performance of different programs
has not been well compared.
Keywords: Methods: Routinely obtained digital electrocardiograms (ECGs), including over 500 pediatric ECGs, were used to
ECG create over 2000 10 s analog ECGs that were replayed through seven commercially available electrocardiographs.
Electrocardiogram The measurements for PR interval, QRS duration, and QT interval made by each program were extracted and
Automated ECG measurement compared against each other (using the median of the programs after correction for program bias) and the pop-
QRS-duration ulation mean values.
PR-interval Results: Small but significant systematic biases were seen between programs. The smallest and largest variation
QT-interval from the population mean differed by 4.7 ms for PR intervals, 5.8 ms for QRS duration, and 12.4 ms for QT inter-
vals. In pairwise comparison programs showed similar accuracy for most ECGs, with the average absolute errors
at the 75th percentile for PR intervals being 4–6 ms from the median, QRS duration 4–8 ms, and QT interval
6–10 ms. However, substantial differences were present in the numbers and extent of large, clinically significant
errors (e.g at the 98th percentile), for which programs differed by a factor of two for absolute errors, as well as
differences in the mix of overestimations and underestimations.
Conclusions: When reading digital ECGs, users should be aware that small systematic differences exist between
programs and that there may be large clinically important errors in difficult cases.
© 2020 The Authors. Published by Elsevier Inc. This is an open access article under the CC BY-NC-ND license (http://
creativecommons.org/licenses/by-nc-nd/4.0/).

Introduction have differed between studies, the results have not been comparable,
and many of the findings are now dated because of changes in the tech-
Computer programs that make electrocardiogram (ECG) measure- nology. The subject of correct automatic assessment of ECG intervals
ments and diagnostic interpretations were first developed in the late (PR/PQ, QRS, and QT), is relevant because they are the bases of the inter-
1960s. Their use substantially increased in the 1980s when the results pretation algorithms. They are also used in cardiac safety monitoring by
could be shown in real time directly on the ECG. Currently, almost all clinicians and in the development and approval of new medications and
ECG acquisition equipment includes computer measurement and inter- interventions. Comparative studies by Kligfield et al. [8,9] on six of the
pretation. Clinicians inexperienced in making ECG interval measure- seven programs in the current work showed small but consistent
ments often rely on these automated results to make treatment mean differences but no analysis of variability was performed. Similar
decisions [1,2], yet algorithmic measurements and interpretations are small biases were found by Vancura et al. [10] in a small study of QRS
intended to be preliminary and followed up by physician review [3–5]. durations measured by two of the same programs. Mason et al. [11]
The issue of the quality of computer interpretation and measure- assessed the same two programs in a large study to establish normal ref-
ments was raised in the 1970s [6]. Computer interpretations of single erence ranges for duration, but neither Vancura et al. nor Mason et al.
programs have generally been compared with the findings of one or analyzed variability beyond standard deviations (SDs). De Pooter et al.
more experts on sets of ECGs available only to the researchers [7] or [12] compared QRS duration measured by two human readers and
the manufacturers. However, since methods, experts, and datasets two programs, finding no mean difference between the programs but
a much higher pairwise limits of agreement between the programs
⁎ Corresponding author at: Mortara Instrument Europe s.r.l., Via G. di Vittorio 21b,
than between human readers for wide QRS ECG's.
Castel Maggiore, 40013 Bologna, Italy. More comparative data on the performance of commercially avail-
E-mail address: Johannes.debie@hillrom.com (J. De Bie). able ECG recorders are needed to characterize differences fully and to

https://doi.org/10.1016/j.jelectrocard.2020.10.006
0022-0736/© 2020 The Authors. Published by Elsevier Inc. This is an open access article under the CC BY-NC-ND license (http://creativecommons.org/licenses/by-nc-nd/4.0/).
J. De Bie, I. Diemberger and J.W. Mason Journal of Electrocardiology 63 (2020) 75–82

develop methods for detecting and preventing measurement errors. In Table 2


this study, we directly assessed differences in bias and stability of inter- Overview of the database characteristics.

val measurements between seven mainstream ECG programs through Metrics Percentage
the analysis of a large, representative real-world ECG dataset. Patient population
Age (years)a
Materials and methods ≤2 4.6%
>2–10 12.9%
>10–16 6.3%
In this study, we assessed the PR interval, QRS duration, and QT inter- >16–40 4.9%
val measurements of seven ECG interpretation programs (Table 1), using >40–60 12.2%
ECGs obtained routinely in hospitals and acute-care settings. Since man- >60–70 10.5%
ufacturers do not provide digital input possibilities for validation of their >70–80 12.4%
>80 10.8%
interpretation programs, we created a large set of representative analog
Unknown 25.4%
ECGs converted from previously recorded digital ECGs acquired with Gendera
equipment that complied with the requirements of International Male 37.4%
Electrotechnical Commission standard IEC 60601–2-51:2003 [13]. Female 37.2%
Unknown 25.4%
ECG interpretation
Selection of ECGs Heart rate (bpm)
≤60 16.1%
Anonymized ECGs were obtained from ECG databases in eight cen- 60–99 67.8%
ters in Australia, Italy, and the USA, including 510 pediatric ECGs from 100–119 9.2%
≥120 6.9%
three centers in Italy with no specific selection criteria apart from
Normal or borderline interpretation 42.0%
being consecutive for recording date and time. In addition, 228 ECGs Abnormal interpretation 58.0%
with statements related to acute ischemia (generated by the VERITAS Sinus rhythm 86.7%
program, Mortara Instrument, Milwaukee, WI, USA) were obtained Atrial fibrillation/flutter 10.5%
Atrial enlargement criteria 8.8%
from an ambulance service database and a university hospital database.
Abnormal axis/fascicular block 14.9%
ECGs from patients with pacemakers were excluded because pace- RVH/LVH criteria 21.4%
maker spikes are usually detected in the analog portion of electrocardio- Atrioventricular (PR) abnormality 8.2%
graphs and are not faithfully recorded in stored records. Most ECGs Old infarct 16.4%
were originally recorded with various models of electrocardiographs ACS/STEMI 15.2%
Non-acute ST&T abnormality 8.0%
manufactured by Mortara Instrument Inc. In total we used 2155 digital
a
ECGs representative of real-world conditions but biased toward abnor- If age and/or sex were unknown, they were set at 40 years and/or, randomly, to
male or female at the time of re-recording. Interpretations in this table were made
mal ECGs. Table 2 shows the major characteristics of the database.
by the Veritas program (Mortara Instrument, Milwaukee, WI, USA). Abbreviation:
ACS/STEMI, acute coronary syndrome or ST-elevation myocardial infarction; LVH,
Re-recording of ECGs left ventricular hypertrophy; RVH, right ventricular hypertrophy.

To be able to use the ECGs for our study, we had to convert them
back into analog format in an endless loop and replay them into phys- of each ECG differed slightly, as happens in normal practice when repeat
ical electrocardiographs; we did this through a Whaleteq MECG 2.0 ECGs are acquired from an individual patient.
Multichannel ECG Test System (WHALETEQ Co Ltd., Taipei City,
Taiwan) connected to a laptop PC. More details on the method and
ECG pre-processing can be found in De Bie et al. [14]. The dataset in Outcomes
this study was the same as is described in that paper. After pre-
processing, 2155 ECG were played back three times; thus, for each Our first goal was to estimate each program's systematic bias in
of the seven electrocardiographs, three new digital recordings were measuring the PR interval, QRS duration, and QT interval, replicating
derived from each of the 2155 original digital ECG recordings, giving previous work by Kligfield et al. [8,9] in a more representative real-
a total of 45,255 digital recordings. world dataset. In that work, the bias was simply calculated through
Although the repeat frequency of all playback loops was exactly 10 s, the population mean for all ECGs. However, if some programs tend to
it could not be guaranteed in which phase of the loop the acquired grossly overestimate and others to grossly underestimate intervals in
traces would start. For instance, some recordings might start between a small number of difficult cases, these small numbers of outliers will
two beats or within a QRS complex, and a single abnormal beat could skew the population mean such that it does not represent precisely
be in various positions within the captured record. For this reason and the program bias on the large majority of the ECGs. We therefore
because of small reproduction differences, amplifier noise, and sam- adopted a two-step procedure to estimate the bias of each program.
pling effects, the three automatic measurement sets and interpretations First, the total population mean was calculated for each program. We
then removed from the set the 25% of ECGs with the largest absolute dif-
ferences between any two programs, thereby effectively removing out-
Table 1
liers. Finally, we recalculated the population mean from the remaining
Brand name and model of the electrocardiographs used to acquire and process the ECGs, 75% as our bias estimate. This approach avoided the influence of skew
with name and version of the interpretation program. in the programs' treatment of outliers and offered a better representa-
tion of program bias for the majority of ECGs. The difference between
Electrocardiograph, program name Program version
the simple population mean and the mean of the population without
GE Healthcare MAC2000™, 12SL Version 22
outliers was small (<2 ms in all cases and > 1 ms for PR interval and
Burdick® 8500, Glasgow Version 26.5
Welch-Allyn® CP150, MEANS Revision 2016–7 QT interval in three and two programs, respectively).
Midmark® IQManager® resting ECG, Version 8.6.1 Once biases were calculated, to assess the variability, consistency,
Hillrom™ / Mortara™ ELI® 380, VERITAS® Version 7.3.0 and reliability of each program, we took the bias-corrected median
Philips® TC 20, DXL Version PH100B value of the 21 results for each ECG to be its most probable de-facto
Schiller® MS2015 Version R16.01
“true” value. Using the median reduces the sensitivity to occasional

76
J. De Bie, I. Diemberger and J.W. Mason Journal of Electrocardiology 63 (2020) 75–82

large outliers (e.g. when an ECG landmark is missed by a program). The (metrics 1 and 2), but considering a minimum value of 2 ms (±1 ms
“error” of each individual measurement was calculated as the difference is the resolution of most of the measurements). For the cumulative
from the median with correction for the program's bias. Cases where error distributions, we used the Wilson [15] score instead of the normal
more than 10 measurements of the 21 were absent were discarded for approximation to estimate the 95% CI error bars for percentages, since it
all programs (e.g the PR interval in cases of atrial fibrillation). Most pro- gives a better estimate for values close to zero or one, and we display the
grams have a measurement resolution of 2 ms, some of 1 ms; we there- program's measurement resolution (1 or 2 ms) for the errors on the x-
fore rounded the program's bias to the nearest even or integer number axis. We also used the Wilson score for the 95% CIs of metrics 3 and 4. To
for this analysis in order to preserve this quantization in the results. assess the probability (p value) that the pairwise performance of the
Statistical analysis of the resulting error distributions showed that the programs did not differ, we used the Student t-test for the bias and
two middle quartiles followed a normal distribution, but that the outer the McNemar test on paired nominal data for the true-positive and
quartiles were closer to a Laplace distribution, revealing many more sig- false-positive rates (metrics 3 and 4). We report p values <0.1 by
nificant outliers than expected for a normal distribution. Considering this value and higher values as “not significant”.
pattern and the skewing toward overestimation or underestimation for
some programs, it did not make much sense to illustrate the clinical sig- Results
nificance of differences between programs with the SD, the mean abso-
lute error, or other simple statistical characterizations. We therefore ECG processing and analysis
chose to characterize performance with a graphical presentation of the
cumulative error distributions in addition to non-parametric distribution After playback of the 2155 ECGs, reported heart rate inconsistencies
percentiles. From the distributions, we calculated four metrics, each revealed that for six cases the wrong ECG was played back to some devices
showing different aspects of the program's performance. and that one ECG had technical playback issues. These were excluded for
Metric 1 was the 75th percentile width of the absolute error, which all devices, resulting in 2148 ECGs and a test set of 45,108 digital record-
measures a program's intrinsic variability without considering gross er- ings; no recordings were excluded for quality of the originally recorded
rors (outlier values). This value effectively represents performance for ECG. We were able to parse all records and extract the record identifiers,
ECGs with easily identified onset and offset landmarks. age, sex, ECG interval measurements, and interpretation text.
As metric 2 was the 98th percentile width of the absolute error, The interval measurements for all ECGs and all programs varied sub-
which would be clinically very significant errors. Such errors are stantially. The mean population values were 153 ms (range 65–394 ms
often caused by inflection points or waves that are missed by the pro- and 5th to 95th percentile 109–208 ms) for the PR interval, 95 ms
gram or by inclusion of the following wave (e.g. the U wave or the next (range 56–198 ms and 5th to 95th percentile 71–134 ms) for QRS dura-
beat's P wave for the QT interval). Metric 2 gives a good summary of tion, and 400 ms (range 217–579 ms and 5th to 95th percentile
the program's tendency for outlier errors but does not give insight 311–485 ms) for the QT interval.
into whether these are mostly underestimations or overestimations
compared with the median. Measurement bias
With metrics 3 and 4 we tried to estimate in how many cases a mea-
surement error would lead to a wrong clinical interpretation for a We found small but consistent differences between programs,
prolonged PR interval indicating first-degree atrioventricular (AV) mostly in QRS duration and QT interval. The bias of the programs for
block, a prolonged QRS duration indicating intraventricular conduction PR interval was small, with the maximum difference between pro-
defects (bundle-branch block), or a prolonged QTc interval indicating grams, for GE and Mortara, being only 4.7 ms (Fig. 1A). The bias for all
long QT syndrome or drug-induced long QT. We assessed how many other programs was within 1 ms. Pairwise differences were highly sig-
times a measurement error would lead to a false-positive (metric nificant (p < 0.001) except between Means and Philips, Schiller and
3) or false-negative (metric 4) interpretation. We set limits of normality Glasgow, and Schiller and Midmark.
at 190 ms for the PR interval (indicating first-degree AV block), 110 ms The maximum difference in bias for QRS durations was 5.8 ms (be-
for QRS duration, and 460 ms for QTc (corrected with a linear [Framing- tween GE and Mortara). This difference is quite high compared with
ham] RR correction with a coefficient of 0.154 [16]). These thresholds the QRS duration range and considering the nature of this type of wave-
were chosen to be close to generally used normal limits and close to form. The GE, Glasgow, and Schiller programs had negative bias and
the 10th percentile for the population in this database, so that applica- Means, Midmark, Mortara, and Philips programs were above the aver-
tion of these thresholds resulted in about 200 “abnormal” values for age (Fig. 1B). All pairwise differences except between Midmark and
each interval. By no means do we imply that these are the accepted Mortara were significant (Midmark vs. Philips, p = 0.06, Mortara vs.
thresholds for normal vs. abnormal, nor did we correct for age. How- Philips, p = 0.006, and p < 0.001 for the others).
ever, they are close enough to estimate what the effect of measurement The QT interval is the most difficult to measure because the end of
errors in the critical zone around clinically important thresholds would the T wave is not well defined and might coincide with the following
be. For example, a 10 ms error in a QRS of duration 150 ms would not wave. Thus, unsurprisingly, the differences in bias between programs
change the clinical diagnosis very much, while around 110 ms it were the greatest for this interval. The largest difference was 12 ms,
would be much more important. We again removed the program- seen between GE and Schiller, with that between Schiller and Phillips
dependent bias from each measurement and took as truth the median being similar (Fig. 1C). Of the other programs, Midmark and Mortara
of all 21 measurements for an ECG. False-positive results were defined measured intervals that were shorter than the population mean
as measurements more than 10 ms above the limit of normality while (−4 ms) and Glasgow and Means longer (+2 ms). All pairwise differ-
the truth was below it. For false-negative results, the opposite was ences except between GE and Philips were statistically highly significant
true. The numbers of ECGs in our database did not permit analysis of (all p < 0.001).
abnormally short intervals.
Measurement variability
Statistical analysis
In contrast to measurement of bias, analysis of variability may pro-
We calculated 95% confidence intervals (CIs) for bias, based on the vide an implicit judgment of an individual program's quality. We have,
SD of each program's error distribution around the estimated true therefore, decided not to disclose the identity of the programs for
values, presuming a normal distribution. We also presumed a normal these metrics and have simply named them A–G in undisclosed order
approximation for estimating the 95% CIs for the percentile metrics for these results. Fig. 2 shows the peak cumulative error distributions.

77
J. De Bie, I. Diemberger and J.W. Mason Journal of Electrocardiology 63 (2020) 75–82

Fig. 1. Mean biases in PR interval (A), QRS duration (B), and QT interval (C), compared with the overall population mean (0).

For the PR interval, the 75th percentiles of absolute errors were between programs B, E, and G made more underestimations over 10 ms than
4 and 6 ms from the median, without clinically significant differences the other programs (~10% vs. ~3%). For QRS duration, the 75th percen-
between programs. However, the cumulative distribution shows that tile for errors was between 4 and 6 ms from the median for all programs

Fig. 2. Cumulative error distributions for all seven programs, by interval. (left) PR intervals. (center) QRS durations. (right) QT intervals. Each point on the graph represents the percentage
of measurements with an (absolute) error larger than the value of the horizontal scale. Underestimates are shown on the left of each graph and overestimates are shown on the right.
Horizontal axes span ±10 ms, or ± 0.25 mm on a printed ECG with a common recording speed of 25 mm/s. As most programs measure with 2 ms resolution, points are plotted at
even error values. Error bars indicate measurement resolution (horizontal; 1 or 2 ms) and 95% confidence interval (vertical, Wilson score [12]). Insets show the value of the 75%
precentile of the absolute error in ms.

78
J. De Bie, I. Diemberger and J.W. Mason Journal of Electrocardiology 63 (2020) 75–82

except C (8 ms), which had a much larger percentage of errors, both For QT intervals, as expected, the error distributions were wider than
underestimations and overestimations, greater than 10 ms. For the those for PR intervals and QRS durations, and, as for QRS duration, var-
QT interval, the 75th error percentiles were lowest (6 ms) for pro- ied more for overestimations than for underestimations (Fig. 5). The
grams B and D, and highest (10 ms) for C and F and were caused number of underestimations was similar for all programs, but the distri-
mainly by overestimations. As expected, error distributions were bution was slightly wider for program D, which also had a notably
wider than for PR intervals and QRS durations. All programs behaved higher false-negative rate – three times as high as the other programs
similarly for underestimations of QT intervals, trying to avoid false- – for prolonged QTc. Programs A, B, and C distinguished themselves
negative long-QT interpretations. with lower numbers of underestimations greater than 40 ms than the
other programs. Programs F and C had more overestimations greater
Large errors than 40 ms and, accordingly, high false-positive rates for prolonged
QTc. The number of very large errors (>80 ms) was particularly high
For PR intervals (Fig. 3), at the 98th percentile, programs B and, to a in program F at 1%, followed by A at 0.45% and the other programs
lesser degree, E and G had high percentages of underestimations com- below 0.2%.
pared with the median. The distribution for program E became similar A summary of the largest measurement errors with their directions
to those for programs A, C, and D when underestimations were greater is provided in Table 3.
than 40 ms. Of note, program F had a high number of very large
(>60 ms) underestimations. Program B and, to a lesser extent, program Discussion
E missed a higher number of prolonged PR intervals than the others. The
number of large overestimations was similar for all programs except C, The purpose of this study was to establish whether important sys-
which had a much higher number greater than 30 ms and, conse- tematic or performance differences exist between seven commercial in-
quently, the highest false-positive rate for prolonged PR. terpretation programs from different manufacturers in measurements
For QRS duration, we noted a much wider dispersion of overestima- obtained with automated electrocardiographs. The programs assessed
tion errors than underestimation errors (Fig. 4). This finding was as ex- are all widely used in current clinical practice and for clinical research.
pected because the high slopes within the QRS make it less likely to Our analysis indicated several differences in interval measurements,
grossly underestimate the duration than to to overestimate. However, not only in the average values but also in relation to clinically relevant
program G had twice as many underestimations greater than 30 ms cut-offs.
than the next program, resulting in a false-negative rate of more than A similar approach was previously adopted by Kligfield et al. in two
5% for intraventricular conduction defects. Greater variance was seen studies [8.9]. In the later study they processed 800 ECGs with seven pro-
between programs in overestimations, with program C, F, and E on grams, six of which were included also in our analysis (GE, Glasgow,
the outside of the distribution and D on the inside. Likewise, C, F, and, Means, Mortara, Philips and Schiller). The test set in that study consisted
to a lesser extent E, showed the highest false-positive rates for conduc- of 200 normal ECGs, 200 ECGs with slightly prolonged drug-induced QT
tion defects. intervals but which were otherwise normal, and 400 ECGs from patients

Fig. 3. Tails of error distributions of PR intervals and accuracy for prolonged PR. (A) Each point on the graph represents the percentage of measurements with an (absolute) error larger than
the value of the horizontal scale. Underestimates are shown on the left of each graph and overestimates are shown on the right. Shorter tails indicate fewer very large errors and narrower
curves indicate fewer significant errors. Insets show the value of the 98% precentile of the absolute error in ms. (B) False-positive rates for prolonged PR detection (first degree
atrioventricular block). (C) False-negative rates.

79
J. De Bie, I. Diemberger and J.W. Mason Journal of Electrocardiology 63 (2020) 75–82

Fig. 4. Tails of error distributions of QRS durations and accuracy for prolonged QRS. (A) Cumulative errors. Each point on the graph represents the percentage of measurements with an
(absolute) error larger than the value of the horizontal scale. Underestimates are shown on the left of each graph and overestimates are shown on the right. Shorter tails indicate fewer very
large errors and narrower curves indicate fewer significant errors. Insets show the value of the 98% precentile of the absolute error in ms. (B) False-positive rates for prolonged QRS
detection (intraventricular conduction delay). (C) False-negative rates.

Fig. 5. Tails of error distributions of QT interval and accuracy for prolonged QTc. (A) Cumulative errors. Each point on the graph represents the percentage of measurements with an
(absolute) error larger than the value of the horizontal scale. Underestimates are shown on the left of each graph and overestimates are shown on the right. Shorter tails indicate
fewer very large errors and narrower curves indicate fewer significant errors. Insets show the value of the 98% precentile of the absolute error in ms. (B) False-positive rates for
prolonged QTc detection. (C) False-negative rates.

80
J. De Bie, I. Diemberger and J.W. Mason Journal of Electrocardiology 63 (2020) 75–82

Table 3 overestimations and underestimations differed strikingly between pro-


Summary of the most important overestimations and underestimations for each interval grams in some cases. For example, program F had a high number of very
by program.
large underestimations for the PR interval because it sometimes did not
Program Overestimation Underestimation detect the presence of the first phase of biphasic P waves. Program G did
A not detect the outer waves of some type of QRS patterns indicating
B PR right-bundle-branch block.
C PR, QRS, QT Large errors, particularly for QRS interval landmarks, will lead to fur-
D QT
ther measurement errors, such as in Q wave durations or ST amplitudes,
E
F QRS, QT and, consequently, to more program interpretation errors. The large in-
G QRS terval errors themselves might also lead to interpretation errors by
The median value of the 21 results of the seven programs for each ECG was taken to be the
reviewing cardiologists if high trust is placed in program measure-
most probable de-facto “true” value against which results were compared. ments. Some insight into this effect is gained from our comparison met-
rics for prolonged interval interpretation. The false-positive rates for
clinically abnormal prolonged intervals were generally low at 0.5% for
PR interval, 1.0% for QRS duration, and 1.8% for QTc prolongation, but
with long QT syndrome. The ECGs were presented to the programs in programs differed up to tenfold, meaning that the number of interpreta-
digital form. In the current research, we transformed ECGs into analog tions requiring correction could be sizeable for some programs. The
electrical signals that were reacquired and processed by the electrocar- number of false-negative results is potentially more important, and
diographs. In addition, our ECG test set was larger, came from normal these were higher with average values of 5.7% for prolonged PR, 2.1%
clinical practice, included pediatric ECGs, and contained all kinds of ab- for QRS prolongation, and 8.5% for prolonged QTc and for some pro-
normalities and quality issues. For the six programs common to both grams up to three times the average. These findings are noteworthy
studies, the population means found by Kligfield and colleagues [9] because they could lead to decisions to stop certain cardiovascular or
were 154, 90, and 421 ms for the PR interval, QRS duration, and QT in- non-cardiovascular treatments [18] or deny the initiation of certain
terval, respectively, compared with 153, 95, and 401 ms in our study. treatments or procedures, such as cardiac resynchronization therapy
The longer QRS duration in the current study can be explained by the [12,19]. For QTc interval in particular, a substantial false-negative rate
fact that the test set of Kligfield et al. did not contain ECGs from patients could lead to missed cases of long QT syndrome or drug-induced QTc
with bundle-branch block, whereas our set did. The longer QT interval prolongation, markers of an increasesd risk of life-threatening arrhyth-
in the study of Kligfield et al. was due to the disproportionate number mias. Effects on clinical research related to drug development may also
of patients with long QT syndrome. These differences reinforce that occur.
our dataset is more representative of real-life ECGs.
Like Kligfield and colleagues, we consciously did not attempt to de- Limitations of the study
fine standard references for the intervals, mainly because human mea-
surement of these varies and depends heavily on the experience and Manufacturers do not provide digital input possibilities for perfor-
habits of the reader. It is very difficult, if not impossible, to decide how mance studies of their programs and, therefore, we needed to convert
an appropriate standard could be reached, particularly with a lack of ECGs to analog voltages and replayed them into electrocardiographs.
clear medical definitions for the end of the QRS and T waves [17]. Nev- Small differences in signal processing, filtering, and characteristics of
ertheless, although differences in bias between individual programs the electronic amplification and sampling circuitry might have influ-
were small, they were clearly present, and awareness of them is impor- enced our results. However, these effects are likely to be small, since
tant for clinical decision making. The values found by us are in line with all manufacturers adhere to internationally accepted minimum perfor-
previous reports [6–12] when considered by relative rank between pro- mance standards [20], which guarantees faithful reproduction of the
grams and order of magnitude, although actual values differ somewhat. original ECG. In addition, we disabled all filtering on the play-back de-
This is not surprising, however, given the differences in type of ECGs in vices except the AC-interference filter, which was set to the mains fre-
the various datasets, and possibly differences in versions of the quency of 50 Hz and would normally be active. This approach could
programs. have affected ECG reproduction. To counteract it, we repeated each ac-
Previous reports have generally reported bias without details on quisition for each program three times to reduce the effect of small dif-
program measurement variance. We decided to use the median of all ferences in input records thus improving the statistical power of the
programs as a de facto standard for the “true” value in our evaluation study. Although we could not use the same approach as Kligfield et al.
of each program's measurement variance while taking into account an [8,9], who required collaboration of all manufacturers, we believe that
individual program's bias. We found that simple statistical metrics like the pre-processing, triple acquisition and wide variety of ECG patterns
the SD or distribution percentiles do not adequately document the dif- makes up for the methodological limit we encountered. Incidentally,
ferences between programs because error distributions are not normal. the triple acquisition method actually allowed us to study the effect of
Additionally, tails can be very skewed, indicating either positive or neg- small changes and thus the robustness of the programs, which has
ative large errors, which are clinically the most important. We decided been reported elsewhere [21].
therefore to document the programs' behaviors mainly by the distribu- A digital input possibility for ECG measurement and interpretation
tions themselves. We found the cumulative distributions most appro- programs would greatly facilitate their comparison, and alleviate an-
priate as they clearly depict the proportions of clinically important other limitation of our work, which is that some of the compared pro-
cases with errors above certain values. grams have been revised since we acquired the electrocardiographs,
Overall, we found that all programs did a reasonable job for most of and that their performance may have changed. All programs we com-
the ECGs, with little variability seen in the center quartiles of the distri- pared have a long history and their measurement module is stable,
bution. The 75th percentile values varied from the median by less than but incremental improvements might well have been made in newer
6 ms for PR interval, 8 ms for QRS duration, and 10 ms for QT interval. versions. Any comparison study, like ours, is a snapshot, and the com-
Clinically, large errors are more important: Average 98th percentile parison itself becomes obsolete eventually. However, our main conclu-
values for the absolute difference from the median were 31 ms for PR in- sions on weak points and the methodology to reveal them remain
terval, 17 ms for QRS duration, and 42 ms for QT interval, and the pro- valid. With digital input to the programs, it would be trivial to repeat
grams could differ by up to a factor of two. However, the absolute our study for up-to-date results. Any hospital with digital ECG manage-
error percentile does not give the whole picture. The distributions of ment could provide real world datasets, even much bigger than we have

81
J. De Bie, I. Diemberger and J.W. Mason Journal of Electrocardiology 63 (2020) 75–82

used and our methodology is automatic, no manual “truth” annotation [2] Willems JL, Abrue Lima C, Arnaud P, et al. The diagnostic performance of computer
programs for the interpretation of electrocardiograms. N Engl J Med. 1991;325:
is needed. The only requirement is that enough programs process the 1767–73. https://doi.org/10.1056/NEJM199112193252503.
same ECG's to accept their median results as the de-facto truth. Unfortu- [3] Kligfield P, Gettes LS, Bailey JJ, American Heart Association Electrocardiography and
nately, an easy way to digitally input ECG's to the major ECG analysis Arrhythmias Committee, Council on Clinical Cardiology; American College of Cardi-
ology Foundation; Heart Rhythm Society, et al. Recommendations for the standard-
programs does not exist today.
ization and interpretation of the electrocardiogram: part I: The electrocardiogram
Our study did not include any ECGs from patients with artificial elec- and its technology: a scientific statement from the American Heart Association Elec-
tronic pacemakers. Recorded digital ECGs are not sampled at frequencies trocardiography and Arrhythmias Committee, Council on Clinical Cardiology; the
high enough to reproduce pacemaker spikes. The same is true for play- American College of Cardiology Foundation; and the Heart Rhythm Society: en-
dorsed by the International Society for Computerized Electrocardiology. Circulation.
back devices like the one used in this study. Therefore, measurement 2007;115. https://doi.org/10.1161/CIRCULATIONAHA.106.180200 1306–132.
performance in the presence of pacemaker spikes could not be reported. [4] Macfarlane PW, Mason JW, Kligfield P, et al. Debatable issues in automated ECG
reporting. J Electrocardiol. 2017;50:833–40. https://doi.org/10.1016/j.jelectrocard.
2017.08.027.
Conclusions
[5] Litell JM, Meyers HP, Smith SW. Emergency physicians should be shown all triage
ECGs, even those with a computer interpretation of “Normal”. J Electrocardiol.
We performed a thorough and direct comparison of ECG intervals (PR 2019;54:79–81. https://doi.org/10.1016/j.jelectrocard.2019.03.003.
interval, QRS duration, and QT interval) measured by seven principal ECG [6] Bailey JJ, Itschoitz SB, Grauer LE, Hirshefled JW, Horton MR. A method for evaluating
computer programs for electrocardiographic interpretation. II. Application to version
interpretation programs. Like previous studies, we found small differ- D of the PHS program and the Mayo Clinic program of 1968. Circulation. 1974;50:
ences between programs (bias) in the average QRS duration and QT in- 80–7. https://doi.org/10.1161/01.cir.50.1.80.
terval but not in the PR interval. The interval measurement variability [7] Massel D, Dawdy JA, Melendez LJ. Strict reliance on a computer algorithm or mea-
of programs was similar for most ECGs, but programs differed substan- surable ST segment criteria may lead to errors in thrombolytic therapy eligibility.
Am Heart J. 2000;140:221–6. https://doi.org/10.1067/mhj.2000.108240.
tially in the numbers of outlying measurement errors, which might be [8] Kligfield P, Badilini F, Rowlandson I, et al. Comparison of automated measurements
clinically significant because these are probably the most abnormal of electrocardiographic intervals and durations by computer-based algorithms of
ECGs. Except for two programs, all had important weak points that digital electrocardiographs. Am Heart J. 2014;167:150–9. https://doi.org/10.1016/j.
ahj.2013.10.004.
would affect automatic interpretation, and in some cases could have led
[9] Kligfield P, Badilini F, Denjoy I, et al. Comparison of automated interval measure-
to misdiagnosis if not corrected by a human reader. Additionally, several ments by widely used algorithms in digital electrocardiographs. Am Heart J. 2018;
different metrics were required to bring these weak points to light, pre- 200:1–10. https://doi.org/10.1016/j.ahj.2018.02.014.
sumably because of the different underlying causes; a simple statistical [10] Vancura V, Wichterle D, Ulc I, et al. The variability of automated QRS duration mea-
surement. Europace. 2017;19:636–43. https://doi.org/10.1093/europace/euw015.
metric was clearly insufficient and in fact the actual cumulative distribu-
[11] Mason JW, Ramsheth DJ, Chanter DO, Moon TE, Goodman DB, Mendzelevski B. Elec-
tions were most insightful. We gained much insight in the characteristics trocardiographic reference ranges derived from 79,743 ambulatory subjects. J
of the programs and we hope that this study will stimulate manufac- Cardiol. 2007;40:228–34 e8 https://doi.org/10.1016/j.jelectrocard.2006.09.003.
turers to actively pursue improving measurement programs, especially [12] De Pooter J, El Haddad M, Stroobandt R, De Buyzere M. Timmermans F accuracy of
computer-calculated and manual QRS duration assessments: clinical implications
for those errors that produce false positive or negative diagnoses. With
to select candidates for cardiac resynchronization therapy. Int J Cardiol. 2017;236:
digital input to programs, our study would not be hard to repeat and 276–82. https://doi.org/10.1016/j.ijcard.2017.01.129.
can clearly demonstrate the results of those improvements. [13] International Electrotechnical Committee. IEC 60601–2-51:2003 medical electrical
equipment - part 2–51: Particular requirements for safety, including essential per-
formance, of recording and analysing single channel and multichannel electrocar-
Funding diographs. Geneva: International Electrotechnical Committee; 2003.
[14] De Bie J, et al. Performance of seven ECG interpretation programs in identifying ar-
No external funding was received for this work. rhythmia and acute cardiovascular syndrome. J Electrocardiol. 2020;58:143–9.
https://doi.org/10.1016/j.jelectrocard.2019.11.043.
[15] Wilson EB. Probable inference, the law of succession, and statistical inference. J Am
CRediT authorship contribution statement Stat Assoc. 1927;22:209–12. https://doi.org/10.1080/01621459.1927.10502953.
[16] Sagie A, Larson MG, Goldberg RJ, Bengston JR, Levy D. An improved method for
J De Bie: Conceptualization, Methodology, Formal Analysis, Investi- adjusting the QT interval for heart rate (the Framingham heart study). Am J Cardiol.
gation, Data Curation, Writing – Original Draft, Visualization, Project Ad- 1992;70:797–801. https://doi.org/10.1016/0002-9149(92)90562-D.
[17] Kligfield P, Hancock EW, Helfenbein ED, et al. Relation of QT interval measurements to
ministration. I Diemberger: Methodology, Formal Analysis, Writing – evolving automated algorithms from different manufacturers of electrocardiographs.
Review & Editing. J W Mason: – Conceptualization, Writing – Review Am J Cardiol. 2006;98:88–92. https://doi.org/10.1016/j.amjcard.2006.01.060.
& Editing. [18] Diemberger I, Massaro G, Cubelli M, et al. Repolarization effects of multiple-cycle
chemotherapy and predictors of QTc prolongation: a prospective female cohort
study on >2000 ECGs. Eur J Clin Pharmacol. 2015;71:1001–9. https://doi.org/10.
1007/s00228-015-1874-3.
Declaration of Competing Interest [19] Boriani G, Ziacchi M, Nesti M, et al. Cardiac resynchronization therapy: how did con-
sensus guidelines from Europe and the United States evolve in the last 15 years? Int
J Cardiol. 2018;261:119–29. https://doi.org/10.1016/j.ijcard.2018.01.039.
J de Bie is Chief Scientist at Hillrom, owner of the Mortara VERITAS [20] International Electrotechnical Committee. IEC 60601–2-25:2011 medical electrical
program and licensee of the MEANS program for ECG interpretation. I equipment - part 2–25: Particular requirements for the basic safety and essential
Diemberger and J W Mason declare no competing interests. performance of electrocardiographs. Geneva: International Electrotechnical Com-
mittee; 2011.
[21] De Bie J, Diemberger I. Interpretation and measurement consistency of seven ECG
References computer programs. J Electrocardiol. 2019:S99. https://doi.org/10.1016/j.jelectrocard.
2019.08.021 57 Supplement.
[1] Montgomery H, Hunter S, Morris S, Naunton-Morgan R, Marshall RM. Interpretation
of electrocardiograms by doctors. BMJ. 1994;309:1551–2. https://doi.org/10.1136/
bmj.309.6968.1551.

82

You might also like