Professional Documents
Culture Documents
Journal of Electrocardiology
a r t i c l e i n f o a b s t r a c t
Background: Electrocardiograph-generated measurements of PR, QRS, and QT intervals are generally thought to
be more precise than manual measurements on paper records. However, the performance of different programs
has not been well compared.
Keywords: Methods: Routinely obtained digital electrocardiograms (ECGs), including over 500 pediatric ECGs, were used to
ECG create over 2000 10 s analog ECGs that were replayed through seven commercially available electrocardiographs.
Electrocardiogram The measurements for PR interval, QRS duration, and QT interval made by each program were extracted and
Automated ECG measurement compared against each other (using the median of the programs after correction for program bias) and the pop-
QRS-duration ulation mean values.
PR-interval Results: Small but significant systematic biases were seen between programs. The smallest and largest variation
QT-interval from the population mean differed by 4.7 ms for PR intervals, 5.8 ms for QRS duration, and 12.4 ms for QT inter-
vals. In pairwise comparison programs showed similar accuracy for most ECGs, with the average absolute errors
at the 75th percentile for PR intervals being 4–6 ms from the median, QRS duration 4–8 ms, and QT interval
6–10 ms. However, substantial differences were present in the numbers and extent of large, clinically significant
errors (e.g at the 98th percentile), for which programs differed by a factor of two for absolute errors, as well as
differences in the mix of overestimations and underestimations.
Conclusions: When reading digital ECGs, users should be aware that small systematic differences exist between
programs and that there may be large clinically important errors in difficult cases.
© 2020 The Authors. Published by Elsevier Inc. This is an open access article under the CC BY-NC-ND license (http://
creativecommons.org/licenses/by-nc-nd/4.0/).
Introduction have differed between studies, the results have not been comparable,
and many of the findings are now dated because of changes in the tech-
Computer programs that make electrocardiogram (ECG) measure- nology. The subject of correct automatic assessment of ECG intervals
ments and diagnostic interpretations were first developed in the late (PR/PQ, QRS, and QT), is relevant because they are the bases of the inter-
1960s. Their use substantially increased in the 1980s when the results pretation algorithms. They are also used in cardiac safety monitoring by
could be shown in real time directly on the ECG. Currently, almost all clinicians and in the development and approval of new medications and
ECG acquisition equipment includes computer measurement and inter- interventions. Comparative studies by Kligfield et al. [8,9] on six of the
pretation. Clinicians inexperienced in making ECG interval measure- seven programs in the current work showed small but consistent
ments often rely on these automated results to make treatment mean differences but no analysis of variability was performed. Similar
decisions [1,2], yet algorithmic measurements and interpretations are small biases were found by Vancura et al. [10] in a small study of QRS
intended to be preliminary and followed up by physician review [3–5]. durations measured by two of the same programs. Mason et al. [11]
The issue of the quality of computer interpretation and measure- assessed the same two programs in a large study to establish normal ref-
ments was raised in the 1970s [6]. Computer interpretations of single erence ranges for duration, but neither Vancura et al. nor Mason et al.
programs have generally been compared with the findings of one or analyzed variability beyond standard deviations (SDs). De Pooter et al.
more experts on sets of ECGs available only to the researchers [7] or [12] compared QRS duration measured by two human readers and
the manufacturers. However, since methods, experts, and datasets two programs, finding no mean difference between the programs but
a much higher pairwise limits of agreement between the programs
⁎ Corresponding author at: Mortara Instrument Europe s.r.l., Via G. di Vittorio 21b,
than between human readers for wide QRS ECG's.
Castel Maggiore, 40013 Bologna, Italy. More comparative data on the performance of commercially avail-
E-mail address: Johannes.debie@hillrom.com (J. De Bie). able ECG recorders are needed to characterize differences fully and to
https://doi.org/10.1016/j.jelectrocard.2020.10.006
0022-0736/© 2020 The Authors. Published by Elsevier Inc. This is an open access article under the CC BY-NC-ND license (http://creativecommons.org/licenses/by-nc-nd/4.0/).
J. De Bie, I. Diemberger and J.W. Mason Journal of Electrocardiology 63 (2020) 75–82
val measurements between seven mainstream ECG programs through Metrics Percentage
the analysis of a large, representative real-world ECG dataset. Patient population
Age (years)a
Materials and methods ≤2 4.6%
>2–10 12.9%
>10–16 6.3%
In this study, we assessed the PR interval, QRS duration, and QT inter- >16–40 4.9%
val measurements of seven ECG interpretation programs (Table 1), using >40–60 12.2%
ECGs obtained routinely in hospitals and acute-care settings. Since man- >60–70 10.5%
ufacturers do not provide digital input possibilities for validation of their >70–80 12.4%
>80 10.8%
interpretation programs, we created a large set of representative analog
Unknown 25.4%
ECGs converted from previously recorded digital ECGs acquired with Gendera
equipment that complied with the requirements of International Male 37.4%
Electrotechnical Commission standard IEC 60601–2-51:2003 [13]. Female 37.2%
Unknown 25.4%
ECG interpretation
Selection of ECGs Heart rate (bpm)
≤60 16.1%
Anonymized ECGs were obtained from ECG databases in eight cen- 60–99 67.8%
ters in Australia, Italy, and the USA, including 510 pediatric ECGs from 100–119 9.2%
≥120 6.9%
three centers in Italy with no specific selection criteria apart from
Normal or borderline interpretation 42.0%
being consecutive for recording date and time. In addition, 228 ECGs Abnormal interpretation 58.0%
with statements related to acute ischemia (generated by the VERITAS Sinus rhythm 86.7%
program, Mortara Instrument, Milwaukee, WI, USA) were obtained Atrial fibrillation/flutter 10.5%
Atrial enlargement criteria 8.8%
from an ambulance service database and a university hospital database.
Abnormal axis/fascicular block 14.9%
ECGs from patients with pacemakers were excluded because pace- RVH/LVH criteria 21.4%
maker spikes are usually detected in the analog portion of electrocardio- Atrioventricular (PR) abnormality 8.2%
graphs and are not faithfully recorded in stored records. Most ECGs Old infarct 16.4%
were originally recorded with various models of electrocardiographs ACS/STEMI 15.2%
Non-acute ST&T abnormality 8.0%
manufactured by Mortara Instrument Inc. In total we used 2155 digital
a
ECGs representative of real-world conditions but biased toward abnor- If age and/or sex were unknown, they were set at 40 years and/or, randomly, to
male or female at the time of re-recording. Interpretations in this table were made
mal ECGs. Table 2 shows the major characteristics of the database.
by the Veritas program (Mortara Instrument, Milwaukee, WI, USA). Abbreviation:
ACS/STEMI, acute coronary syndrome or ST-elevation myocardial infarction; LVH,
Re-recording of ECGs left ventricular hypertrophy; RVH, right ventricular hypertrophy.
To be able to use the ECGs for our study, we had to convert them
back into analog format in an endless loop and replay them into phys- of each ECG differed slightly, as happens in normal practice when repeat
ical electrocardiographs; we did this through a Whaleteq MECG 2.0 ECGs are acquired from an individual patient.
Multichannel ECG Test System (WHALETEQ Co Ltd., Taipei City,
Taiwan) connected to a laptop PC. More details on the method and
ECG pre-processing can be found in De Bie et al. [14]. The dataset in Outcomes
this study was the same as is described in that paper. After pre-
processing, 2155 ECG were played back three times; thus, for each Our first goal was to estimate each program's systematic bias in
of the seven electrocardiographs, three new digital recordings were measuring the PR interval, QRS duration, and QT interval, replicating
derived from each of the 2155 original digital ECG recordings, giving previous work by Kligfield et al. [8,9] in a more representative real-
a total of 45,255 digital recordings. world dataset. In that work, the bias was simply calculated through
Although the repeat frequency of all playback loops was exactly 10 s, the population mean for all ECGs. However, if some programs tend to
it could not be guaranteed in which phase of the loop the acquired grossly overestimate and others to grossly underestimate intervals in
traces would start. For instance, some recordings might start between a small number of difficult cases, these small numbers of outliers will
two beats or within a QRS complex, and a single abnormal beat could skew the population mean such that it does not represent precisely
be in various positions within the captured record. For this reason and the program bias on the large majority of the ECGs. We therefore
because of small reproduction differences, amplifier noise, and sam- adopted a two-step procedure to estimate the bias of each program.
pling effects, the three automatic measurement sets and interpretations First, the total population mean was calculated for each program. We
then removed from the set the 25% of ECGs with the largest absolute dif-
ferences between any two programs, thereby effectively removing out-
Table 1
liers. Finally, we recalculated the population mean from the remaining
Brand name and model of the electrocardiographs used to acquire and process the ECGs, 75% as our bias estimate. This approach avoided the influence of skew
with name and version of the interpretation program. in the programs' treatment of outliers and offered a better representa-
tion of program bias for the majority of ECGs. The difference between
Electrocardiograph, program name Program version
the simple population mean and the mean of the population without
GE Healthcare MAC2000™, 12SL Version 22
outliers was small (<2 ms in all cases and > 1 ms for PR interval and
Burdick® 8500, Glasgow Version 26.5
Welch-Allyn® CP150, MEANS Revision 2016–7 QT interval in three and two programs, respectively).
Midmark® IQManager® resting ECG, Version 8.6.1 Once biases were calculated, to assess the variability, consistency,
Hillrom™ / Mortara™ ELI® 380, VERITAS® Version 7.3.0 and reliability of each program, we took the bias-corrected median
Philips® TC 20, DXL Version PH100B value of the 21 results for each ECG to be its most probable de-facto
Schiller® MS2015 Version R16.01
“true” value. Using the median reduces the sensitivity to occasional
76
J. De Bie, I. Diemberger and J.W. Mason Journal of Electrocardiology 63 (2020) 75–82
large outliers (e.g. when an ECG landmark is missed by a program). The (metrics 1 and 2), but considering a minimum value of 2 ms (±1 ms
“error” of each individual measurement was calculated as the difference is the resolution of most of the measurements). For the cumulative
from the median with correction for the program's bias. Cases where error distributions, we used the Wilson [15] score instead of the normal
more than 10 measurements of the 21 were absent were discarded for approximation to estimate the 95% CI error bars for percentages, since it
all programs (e.g the PR interval in cases of atrial fibrillation). Most pro- gives a better estimate for values close to zero or one, and we display the
grams have a measurement resolution of 2 ms, some of 1 ms; we there- program's measurement resolution (1 or 2 ms) for the errors on the x-
fore rounded the program's bias to the nearest even or integer number axis. We also used the Wilson score for the 95% CIs of metrics 3 and 4. To
for this analysis in order to preserve this quantization in the results. assess the probability (p value) that the pairwise performance of the
Statistical analysis of the resulting error distributions showed that the programs did not differ, we used the Student t-test for the bias and
two middle quartiles followed a normal distribution, but that the outer the McNemar test on paired nominal data for the true-positive and
quartiles were closer to a Laplace distribution, revealing many more sig- false-positive rates (metrics 3 and 4). We report p values <0.1 by
nificant outliers than expected for a normal distribution. Considering this value and higher values as “not significant”.
pattern and the skewing toward overestimation or underestimation for
some programs, it did not make much sense to illustrate the clinical sig- Results
nificance of differences between programs with the SD, the mean abso-
lute error, or other simple statistical characterizations. We therefore ECG processing and analysis
chose to characterize performance with a graphical presentation of the
cumulative error distributions in addition to non-parametric distribution After playback of the 2155 ECGs, reported heart rate inconsistencies
percentiles. From the distributions, we calculated four metrics, each revealed that for six cases the wrong ECG was played back to some devices
showing different aspects of the program's performance. and that one ECG had technical playback issues. These were excluded for
Metric 1 was the 75th percentile width of the absolute error, which all devices, resulting in 2148 ECGs and a test set of 45,108 digital record-
measures a program's intrinsic variability without considering gross er- ings; no recordings were excluded for quality of the originally recorded
rors (outlier values). This value effectively represents performance for ECG. We were able to parse all records and extract the record identifiers,
ECGs with easily identified onset and offset landmarks. age, sex, ECG interval measurements, and interpretation text.
As metric 2 was the 98th percentile width of the absolute error, The interval measurements for all ECGs and all programs varied sub-
which would be clinically very significant errors. Such errors are stantially. The mean population values were 153 ms (range 65–394 ms
often caused by inflection points or waves that are missed by the pro- and 5th to 95th percentile 109–208 ms) for the PR interval, 95 ms
gram or by inclusion of the following wave (e.g. the U wave or the next (range 56–198 ms and 5th to 95th percentile 71–134 ms) for QRS dura-
beat's P wave for the QT interval). Metric 2 gives a good summary of tion, and 400 ms (range 217–579 ms and 5th to 95th percentile
the program's tendency for outlier errors but does not give insight 311–485 ms) for the QT interval.
into whether these are mostly underestimations or overestimations
compared with the median. Measurement bias
With metrics 3 and 4 we tried to estimate in how many cases a mea-
surement error would lead to a wrong clinical interpretation for a We found small but consistent differences between programs,
prolonged PR interval indicating first-degree atrioventricular (AV) mostly in QRS duration and QT interval. The bias of the programs for
block, a prolonged QRS duration indicating intraventricular conduction PR interval was small, with the maximum difference between pro-
defects (bundle-branch block), or a prolonged QTc interval indicating grams, for GE and Mortara, being only 4.7 ms (Fig. 1A). The bias for all
long QT syndrome or drug-induced long QT. We assessed how many other programs was within 1 ms. Pairwise differences were highly sig-
times a measurement error would lead to a false-positive (metric nificant (p < 0.001) except between Means and Philips, Schiller and
3) or false-negative (metric 4) interpretation. We set limits of normality Glasgow, and Schiller and Midmark.
at 190 ms for the PR interval (indicating first-degree AV block), 110 ms The maximum difference in bias for QRS durations was 5.8 ms (be-
for QRS duration, and 460 ms for QTc (corrected with a linear [Framing- tween GE and Mortara). This difference is quite high compared with
ham] RR correction with a coefficient of 0.154 [16]). These thresholds the QRS duration range and considering the nature of this type of wave-
were chosen to be close to generally used normal limits and close to form. The GE, Glasgow, and Schiller programs had negative bias and
the 10th percentile for the population in this database, so that applica- Means, Midmark, Mortara, and Philips programs were above the aver-
tion of these thresholds resulted in about 200 “abnormal” values for age (Fig. 1B). All pairwise differences except between Midmark and
each interval. By no means do we imply that these are the accepted Mortara were significant (Midmark vs. Philips, p = 0.06, Mortara vs.
thresholds for normal vs. abnormal, nor did we correct for age. How- Philips, p = 0.006, and p < 0.001 for the others).
ever, they are close enough to estimate what the effect of measurement The QT interval is the most difficult to measure because the end of
errors in the critical zone around clinically important thresholds would the T wave is not well defined and might coincide with the following
be. For example, a 10 ms error in a QRS of duration 150 ms would not wave. Thus, unsurprisingly, the differences in bias between programs
change the clinical diagnosis very much, while around 110 ms it were the greatest for this interval. The largest difference was 12 ms,
would be much more important. We again removed the program- seen between GE and Schiller, with that between Schiller and Phillips
dependent bias from each measurement and took as truth the median being similar (Fig. 1C). Of the other programs, Midmark and Mortara
of all 21 measurements for an ECG. False-positive results were defined measured intervals that were shorter than the population mean
as measurements more than 10 ms above the limit of normality while (−4 ms) and Glasgow and Means longer (+2 ms). All pairwise differ-
the truth was below it. For false-negative results, the opposite was ences except between GE and Philips were statistically highly significant
true. The numbers of ECGs in our database did not permit analysis of (all p < 0.001).
abnormally short intervals.
Measurement variability
Statistical analysis
In contrast to measurement of bias, analysis of variability may pro-
We calculated 95% confidence intervals (CIs) for bias, based on the vide an implicit judgment of an individual program's quality. We have,
SD of each program's error distribution around the estimated true therefore, decided not to disclose the identity of the programs for
values, presuming a normal distribution. We also presumed a normal these metrics and have simply named them A–G in undisclosed order
approximation for estimating the 95% CIs for the percentile metrics for these results. Fig. 2 shows the peak cumulative error distributions.
77
J. De Bie, I. Diemberger and J.W. Mason Journal of Electrocardiology 63 (2020) 75–82
Fig. 1. Mean biases in PR interval (A), QRS duration (B), and QT interval (C), compared with the overall population mean (0).
For the PR interval, the 75th percentiles of absolute errors were between programs B, E, and G made more underestimations over 10 ms than
4 and 6 ms from the median, without clinically significant differences the other programs (~10% vs. ~3%). For QRS duration, the 75th percen-
between programs. However, the cumulative distribution shows that tile for errors was between 4 and 6 ms from the median for all programs
Fig. 2. Cumulative error distributions for all seven programs, by interval. (left) PR intervals. (center) QRS durations. (right) QT intervals. Each point on the graph represents the percentage
of measurements with an (absolute) error larger than the value of the horizontal scale. Underestimates are shown on the left of each graph and overestimates are shown on the right.
Horizontal axes span ±10 ms, or ± 0.25 mm on a printed ECG with a common recording speed of 25 mm/s. As most programs measure with 2 ms resolution, points are plotted at
even error values. Error bars indicate measurement resolution (horizontal; 1 or 2 ms) and 95% confidence interval (vertical, Wilson score [12]). Insets show the value of the 75%
precentile of the absolute error in ms.
78
J. De Bie, I. Diemberger and J.W. Mason Journal of Electrocardiology 63 (2020) 75–82
except C (8 ms), which had a much larger percentage of errors, both For QT intervals, as expected, the error distributions were wider than
underestimations and overestimations, greater than 10 ms. For the those for PR intervals and QRS durations, and, as for QRS duration, var-
QT interval, the 75th error percentiles were lowest (6 ms) for pro- ied more for overestimations than for underestimations (Fig. 5). The
grams B and D, and highest (10 ms) for C and F and were caused number of underestimations was similar for all programs, but the distri-
mainly by overestimations. As expected, error distributions were bution was slightly wider for program D, which also had a notably
wider than for PR intervals and QRS durations. All programs behaved higher false-negative rate – three times as high as the other programs
similarly for underestimations of QT intervals, trying to avoid false- – for prolonged QTc. Programs A, B, and C distinguished themselves
negative long-QT interpretations. with lower numbers of underestimations greater than 40 ms than the
other programs. Programs F and C had more overestimations greater
Large errors than 40 ms and, accordingly, high false-positive rates for prolonged
QTc. The number of very large errors (>80 ms) was particularly high
For PR intervals (Fig. 3), at the 98th percentile, programs B and, to a in program F at 1%, followed by A at 0.45% and the other programs
lesser degree, E and G had high percentages of underestimations com- below 0.2%.
pared with the median. The distribution for program E became similar A summary of the largest measurement errors with their directions
to those for programs A, C, and D when underestimations were greater is provided in Table 3.
than 40 ms. Of note, program F had a high number of very large
(>60 ms) underestimations. Program B and, to a lesser extent, program Discussion
E missed a higher number of prolonged PR intervals than the others. The
number of large overestimations was similar for all programs except C, The purpose of this study was to establish whether important sys-
which had a much higher number greater than 30 ms and, conse- tematic or performance differences exist between seven commercial in-
quently, the highest false-positive rate for prolonged PR. terpretation programs from different manufacturers in measurements
For QRS duration, we noted a much wider dispersion of overestima- obtained with automated electrocardiographs. The programs assessed
tion errors than underestimation errors (Fig. 4). This finding was as ex- are all widely used in current clinical practice and for clinical research.
pected because the high slopes within the QRS make it less likely to Our analysis indicated several differences in interval measurements,
grossly underestimate the duration than to to overestimate. However, not only in the average values but also in relation to clinically relevant
program G had twice as many underestimations greater than 30 ms cut-offs.
than the next program, resulting in a false-negative rate of more than A similar approach was previously adopted by Kligfield et al. in two
5% for intraventricular conduction defects. Greater variance was seen studies [8.9]. In the later study they processed 800 ECGs with seven pro-
between programs in overestimations, with program C, F, and E on grams, six of which were included also in our analysis (GE, Glasgow,
the outside of the distribution and D on the inside. Likewise, C, F, and, Means, Mortara, Philips and Schiller). The test set in that study consisted
to a lesser extent E, showed the highest false-positive rates for conduc- of 200 normal ECGs, 200 ECGs with slightly prolonged drug-induced QT
tion defects. intervals but which were otherwise normal, and 400 ECGs from patients
Fig. 3. Tails of error distributions of PR intervals and accuracy for prolonged PR. (A) Each point on the graph represents the percentage of measurements with an (absolute) error larger than
the value of the horizontal scale. Underestimates are shown on the left of each graph and overestimates are shown on the right. Shorter tails indicate fewer very large errors and narrower
curves indicate fewer significant errors. Insets show the value of the 98% precentile of the absolute error in ms. (B) False-positive rates for prolonged PR detection (first degree
atrioventricular block). (C) False-negative rates.
79
J. De Bie, I. Diemberger and J.W. Mason Journal of Electrocardiology 63 (2020) 75–82
Fig. 4. Tails of error distributions of QRS durations and accuracy for prolonged QRS. (A) Cumulative errors. Each point on the graph represents the percentage of measurements with an
(absolute) error larger than the value of the horizontal scale. Underestimates are shown on the left of each graph and overestimates are shown on the right. Shorter tails indicate fewer very
large errors and narrower curves indicate fewer significant errors. Insets show the value of the 98% precentile of the absolute error in ms. (B) False-positive rates for prolonged QRS
detection (intraventricular conduction delay). (C) False-negative rates.
Fig. 5. Tails of error distributions of QT interval and accuracy for prolonged QTc. (A) Cumulative errors. Each point on the graph represents the percentage of measurements with an
(absolute) error larger than the value of the horizontal scale. Underestimates are shown on the left of each graph and overestimates are shown on the right. Shorter tails indicate
fewer very large errors and narrower curves indicate fewer significant errors. Insets show the value of the 98% precentile of the absolute error in ms. (B) False-positive rates for
prolonged QTc detection. (C) False-negative rates.
80
J. De Bie, I. Diemberger and J.W. Mason Journal of Electrocardiology 63 (2020) 75–82
81
J. De Bie, I. Diemberger and J.W. Mason Journal of Electrocardiology 63 (2020) 75–82
used and our methodology is automatic, no manual “truth” annotation [2] Willems JL, Abrue Lima C, Arnaud P, et al. The diagnostic performance of computer
programs for the interpretation of electrocardiograms. N Engl J Med. 1991;325:
is needed. The only requirement is that enough programs process the 1767–73. https://doi.org/10.1056/NEJM199112193252503.
same ECG's to accept their median results as the de-facto truth. Unfortu- [3] Kligfield P, Gettes LS, Bailey JJ, American Heart Association Electrocardiography and
nately, an easy way to digitally input ECG's to the major ECG analysis Arrhythmias Committee, Council on Clinical Cardiology; American College of Cardi-
ology Foundation; Heart Rhythm Society, et al. Recommendations for the standard-
programs does not exist today.
ization and interpretation of the electrocardiogram: part I: The electrocardiogram
Our study did not include any ECGs from patients with artificial elec- and its technology: a scientific statement from the American Heart Association Elec-
tronic pacemakers. Recorded digital ECGs are not sampled at frequencies trocardiography and Arrhythmias Committee, Council on Clinical Cardiology; the
high enough to reproduce pacemaker spikes. The same is true for play- American College of Cardiology Foundation; and the Heart Rhythm Society: en-
dorsed by the International Society for Computerized Electrocardiology. Circulation.
back devices like the one used in this study. Therefore, measurement 2007;115. https://doi.org/10.1161/CIRCULATIONAHA.106.180200 1306–132.
performance in the presence of pacemaker spikes could not be reported. [4] Macfarlane PW, Mason JW, Kligfield P, et al. Debatable issues in automated ECG
reporting. J Electrocardiol. 2017;50:833–40. https://doi.org/10.1016/j.jelectrocard.
2017.08.027.
Conclusions
[5] Litell JM, Meyers HP, Smith SW. Emergency physicians should be shown all triage
ECGs, even those with a computer interpretation of “Normal”. J Electrocardiol.
We performed a thorough and direct comparison of ECG intervals (PR 2019;54:79–81. https://doi.org/10.1016/j.jelectrocard.2019.03.003.
interval, QRS duration, and QT interval) measured by seven principal ECG [6] Bailey JJ, Itschoitz SB, Grauer LE, Hirshefled JW, Horton MR. A method for evaluating
computer programs for electrocardiographic interpretation. II. Application to version
interpretation programs. Like previous studies, we found small differ- D of the PHS program and the Mayo Clinic program of 1968. Circulation. 1974;50:
ences between programs (bias) in the average QRS duration and QT in- 80–7. https://doi.org/10.1161/01.cir.50.1.80.
terval but not in the PR interval. The interval measurement variability [7] Massel D, Dawdy JA, Melendez LJ. Strict reliance on a computer algorithm or mea-
of programs was similar for most ECGs, but programs differed substan- surable ST segment criteria may lead to errors in thrombolytic therapy eligibility.
Am Heart J. 2000;140:221–6. https://doi.org/10.1067/mhj.2000.108240.
tially in the numbers of outlying measurement errors, which might be [8] Kligfield P, Badilini F, Rowlandson I, et al. Comparison of automated measurements
clinically significant because these are probably the most abnormal of electrocardiographic intervals and durations by computer-based algorithms of
ECGs. Except for two programs, all had important weak points that digital electrocardiographs. Am Heart J. 2014;167:150–9. https://doi.org/10.1016/j.
ahj.2013.10.004.
would affect automatic interpretation, and in some cases could have led
[9] Kligfield P, Badilini F, Denjoy I, et al. Comparison of automated interval measure-
to misdiagnosis if not corrected by a human reader. Additionally, several ments by widely used algorithms in digital electrocardiographs. Am Heart J. 2018;
different metrics were required to bring these weak points to light, pre- 200:1–10. https://doi.org/10.1016/j.ahj.2018.02.014.
sumably because of the different underlying causes; a simple statistical [10] Vancura V, Wichterle D, Ulc I, et al. The variability of automated QRS duration mea-
surement. Europace. 2017;19:636–43. https://doi.org/10.1093/europace/euw015.
metric was clearly insufficient and in fact the actual cumulative distribu-
[11] Mason JW, Ramsheth DJ, Chanter DO, Moon TE, Goodman DB, Mendzelevski B. Elec-
tions were most insightful. We gained much insight in the characteristics trocardiographic reference ranges derived from 79,743 ambulatory subjects. J
of the programs and we hope that this study will stimulate manufac- Cardiol. 2007;40:228–34 e8 https://doi.org/10.1016/j.jelectrocard.2006.09.003.
turers to actively pursue improving measurement programs, especially [12] De Pooter J, El Haddad M, Stroobandt R, De Buyzere M. Timmermans F accuracy of
computer-calculated and manual QRS duration assessments: clinical implications
for those errors that produce false positive or negative diagnoses. With
to select candidates for cardiac resynchronization therapy. Int J Cardiol. 2017;236:
digital input to programs, our study would not be hard to repeat and 276–82. https://doi.org/10.1016/j.ijcard.2017.01.129.
can clearly demonstrate the results of those improvements. [13] International Electrotechnical Committee. IEC 60601–2-51:2003 medical electrical
equipment - part 2–51: Particular requirements for safety, including essential per-
formance, of recording and analysing single channel and multichannel electrocar-
Funding diographs. Geneva: International Electrotechnical Committee; 2003.
[14] De Bie J, et al. Performance of seven ECG interpretation programs in identifying ar-
No external funding was received for this work. rhythmia and acute cardiovascular syndrome. J Electrocardiol. 2020;58:143–9.
https://doi.org/10.1016/j.jelectrocard.2019.11.043.
[15] Wilson EB. Probable inference, the law of succession, and statistical inference. J Am
CRediT authorship contribution statement Stat Assoc. 1927;22:209–12. https://doi.org/10.1080/01621459.1927.10502953.
[16] Sagie A, Larson MG, Goldberg RJ, Bengston JR, Levy D. An improved method for
J De Bie: Conceptualization, Methodology, Formal Analysis, Investi- adjusting the QT interval for heart rate (the Framingham heart study). Am J Cardiol.
gation, Data Curation, Writing – Original Draft, Visualization, Project Ad- 1992;70:797–801. https://doi.org/10.1016/0002-9149(92)90562-D.
[17] Kligfield P, Hancock EW, Helfenbein ED, et al. Relation of QT interval measurements to
ministration. I Diemberger: Methodology, Formal Analysis, Writing – evolving automated algorithms from different manufacturers of electrocardiographs.
Review & Editing. J W Mason: – Conceptualization, Writing – Review Am J Cardiol. 2006;98:88–92. https://doi.org/10.1016/j.amjcard.2006.01.060.
& Editing. [18] Diemberger I, Massaro G, Cubelli M, et al. Repolarization effects of multiple-cycle
chemotherapy and predictors of QTc prolongation: a prospective female cohort
study on >2000 ECGs. Eur J Clin Pharmacol. 2015;71:1001–9. https://doi.org/10.
1007/s00228-015-1874-3.
Declaration of Competing Interest [19] Boriani G, Ziacchi M, Nesti M, et al. Cardiac resynchronization therapy: how did con-
sensus guidelines from Europe and the United States evolve in the last 15 years? Int
J Cardiol. 2018;261:119–29. https://doi.org/10.1016/j.ijcard.2018.01.039.
J de Bie is Chief Scientist at Hillrom, owner of the Mortara VERITAS [20] International Electrotechnical Committee. IEC 60601–2-25:2011 medical electrical
program and licensee of the MEANS program for ECG interpretation. I equipment - part 2–25: Particular requirements for the basic safety and essential
Diemberger and J W Mason declare no competing interests. performance of electrocardiographs. Geneva: International Electrotechnical Com-
mittee; 2011.
[21] De Bie J, Diemberger I. Interpretation and measurement consistency of seven ECG
References computer programs. J Electrocardiol. 2019:S99. https://doi.org/10.1016/j.jelectrocard.
2019.08.021 57 Supplement.
[1] Montgomery H, Hunter S, Morris S, Naunton-Morgan R, Marshall RM. Interpretation
of electrocardiograms by doctors. BMJ. 1994;309:1551–2. https://doi.org/10.1136/
bmj.309.6968.1551.
82