You are on page 1of 5

Effect of an Artificial Neural

Network on Radiologists’
Performance in the Differential
Diagnosis of Interstitial Lung
Disease Using Chest Radiographs

Kazuto Ashizawa 1,2 OBJECTIVE. We developed a new method to distinguish between various interstitial lung
Heber MaCMahon1 diseases that uses an artificial neural network. This network is based on features extracted
Takayuki Ishida from chest radiographs and clinical parameters. The aim of our study was to evaluate the ef-
feet of the output from the artificial neural network on radiologists diagnostic accuracy.
Katumi Nakamura1
MATERIALSAND METHODS. The artificial neural network was designed to differen-
Carl J. Vyborny1
tiate among I I interstitial lung diseases using 10 clinical parameters and 16 radiologic find-
Shigehiko Katsuragawa1
ings. Thirty-three clinical cases (three cases for each lung disease) were selected. In the
Kunio Ooi1 observer test, chest radiographs were viewed by eight radiologists (four attending physicians
and four residents) with and without network output. which indicated the likelihood of each
of the I I possible diagnoses in each case. The radiologists’ performance in distinguishing
among the I I interstitial lung diseases was evaluated by receiver operating characteristic
(ROC) analysis with a continuous rating scale.
RESULTS. When chest radiographs were viewed in conjunction with network output. a sta-
tistically significant improvement in diagnostic accuracy was achieved (p < .(X)Ol ). The aver-
age area under the ROC curve was .826 without network output and .9 1 1 with network output.
CONCLUSION. An artificial neural network can provide a useful ‘second opinion” to as-
sist radiologists in the differential diagnosis of interstitial lung disease using chest radiographs.

D ifferential
lung disease
diagnosis
is a major
of interstitial
subject in
the artificial
performance
neural network
without and with network output.
with radiologists

chest radiology. Although CT has In this study. we used receiver operating char-
greater diagnostic accuracy in the assessment acteristic (ROC) analysis to test the effect of
of interstitial lung disease. chest radiography network output on radiologists’ performance in
remains the imaging technique of choice for differentiating between certain interstitial lung
initial detection and diagnosis. However. differ- diseases using chest radiographs.
ential diagnosis of interstitial lung disease us-
ing chest radiographs has always been difficult Materials and Methods
for radiologists because of the overlapping Artificial Neural Network Scheme
Received September 9, 1998;accepted after revision
November 16, 1998. spectrum of radiographic appearances and the
The artificial neural network scheme and its per-
C. J. Vyborny, H. MacMahon, and K. Doi are shareholders
complexity of clinical parameters. Thus. one fonnance for the differential diagnosis of interstitial
of R2 Technology, Inc., Los Altos, CA. must often merge many radiologic features and lung disease have been described in detail 121. A sin-
Supported by United States Public Health Service grants clinical parameters to make a correct diagnosis. gle three-layer. feed-forward artificial neural network
CA24806 and CA62625. K. Ashizawa supported in part by a Because of an ability to process large with a back-propagation algorithm was used in this
Japanese Board of Radiology grant and by the Konica
amounts of information simultaneously. artifi- study. We designed the artificial neural network to
Company, Tokyo, Japan.
cial neural networks may be useful in the dif- distinguish among I 1 types ofinterstitial lung disease
1 Kurt Rossmann Laboratories for Radiologic Image
ferential diagnosis of interstitial lung disease. Oil the basis of a given set of 26 clinical parameters
Research, Department of Radiology, The University of
In fact. artificial neural networks have been and radiologic findings. The artificial neural network
Chicago, 5841 S. Maryland Ave., Chicago, IL 60637.
Address correspondence to K. Doi. consisted of 26 input units tir 10 clinical parameters
shown to be a powerful tool for pattern recog-
and 16 radiologic findings. I I output units corre-
2Present address: Department of Radiology, Nagasaki nition and data classification in medical imag-
University School of Medicine, Sakamoto 1-7.1, Nagasaki sp()nding tO the I I types of interstitial lung disease.
ing [1-121. In previous studies [1. 2J. we
852-8501, Japan. and I 8 hidden units. The 10 clinical parameters in-
applied an artificial neural network to the dif- eluded the patient’s age, sex. duration of symptoms.
AJR1999;172:131 1-1315
ferential diagnosis of interstitial lung disease seventy of symptoms. temperature. immune status.
0361-803X/99/1725-131 1
and showed the network to perform well. How- underlying malignancy. history of smoking. dust cx-
© American Roentgen Ray Society ever, we have not compared the performance of p()sure. and drug treatment. The 16 radiologic find-

AJR:172, May 1999 1311


Ashizawa et al.

ings (Table 1) were classified into three categories. culosis, lymphangitis carcinomatosa. interstitial lung Case Selection
namely. distribution of infiltrates (upper. middle. and edema. silicosis. Pneumocvstis carinii pneumonia. In the observer test. 33 actual clinical cases (three
lower zones of the right and left lungs: proximal or scleroderma. eosinophilic granuloma. idiopathic pul- cases per disease) were selected from a database of
peripheral predominance). characteristics of infil- monary fibrosis. viral pneumonia, and pulmonary 1St) cases 121 by two experienced radiologists who
trates (homogeneity: fineness or coarseness: nodular- drug toxicity. did not participate in the observer test. The 33 clini-
ity: degree of septal lines, honeycombing. and loss of Our databases for the artificial neural network cal cases included 22 men and I I women who were
lung volume). and three additional thoracic abnor- included 150 actual clinical cases. I 10 published 21-84 years old (mean. 5 I years). In each of the
malities (lymphadenopathy. pleural effusions. heart cases. and I 10 hypothetical cases. Diagnoses of cases, the disease had only one cause.
size). The interstitial lung diseases selected for differ- actual clinical cases were based on a detailed clini- Although each case initially had three output val-
ential diagnosis included sarcoidosis. miliary tuber- cal correlation (ii = 55 l37i-D or on pathologic (ii = ues based on the three input values provided by three
61 [40%]) or bacteriologic (a = 34 [23/c1) proof of radiologists. we averaged the output values for each
One Radiologist’s Ratlngsa the pulmonary lesion. For clinical cases and pub- case and presented these averages to the observers
lished cases. subjective ratings for the 16 radio- fir the observer test. ROC analysis of the average
Subjective Rating logic findings were provided independently by OUtPUt values for the artificial neural network fIund
Radiologic Finding
three experienced radiologists. Table I shows exam- an A: value of .977 for the 33 cases. By the two larg-
Case 1 Case 2
pIes of one radiologist’s ratings for two clinical cases est outputs. the sensitivity and specificity of the arti-
Infiltrate distribution (Fig. I). Input data obtained from clinical parameters ficial neural network for indicating the correct
Right upper field 4 5 and subjective ratings for radiologic findings were diagnosis were 91% and 89%, respectively. This
Right middle field
normalized to a range from 0 to I . We used a modi- level of performance of the artificial neural network.
6 4
fled round-robin (leave-one-out) method 181 to evalu- obtained with a subset of our database. was similar
Right lower field 8 3
ate the performance of the artificial neural network in to that obtained with all I 50 clinical cases.
Left upper field 4 5 distinguishing the actual clinical cases. With this
Leftmiddlefield 6 4 method. although a round-robin method was applied Observer Test
Leftlowerfield 8 3 to all databases for training. only clinical cases were An ROC observer test can be either of two
used for testing. Output values ranging from 0 to I types: independent or sequential I 161. Ours was
Proximal/peripheral 8 5
indicated the likelihood of each of the 1 1 possible sequential. Each chest radiograph and its clinical
Infiltrate characteristics
diseases in each case. parameters were shown to an observer, who rated
Homogeneous 7 8 The performance of the artificial neural network the likelihood of each of the I I diseases (without
Fine/coarse 4 8 was evaluated using ROC analysis 1131. Binormal network output). Subsequently. the network output
Nodular 3 9 ROC curves for the diagnosis of each disease were (Fig. 2 presented to the same observer.
was who
estimated by use of the LABROC4 algorithm devel- rated the likelihood a second time (with network
Septal lines 2 2
oped in our laboratories I 141. A: values representing output). The observer could either change the mi-
Honeycombing 8 0
the area under each of the I I ROC curves were cal- tial ratings or leave them unchanged.
Lossoflungvolume 2 0 culated. The average performance was estimated by Eight radiologists (four experienced radiologists
Lymphadenopathy 0 7 averaging the two binormal parameters of the 1 1 in- lattending physiciansi and four radiology resi-
Pleural effusion 0 0 dividual ROC curves [151. The average A: value as a dents) who knew nothing about the cases partici-
measure of overall performance was .947. The per- pated as observers. Betiwe the test. the observers
Heart size 1 1
formance of the artificial neural network was also were told that only one of the I I possible diseases
Note-The ratings ranged from 0 to 10, with the excep-
assessed by comparing output values indicating the was the correct diagnosis for each case; that the
tion of heart size, which ranged from 1 to 5. For proximal!
peripheral, 10 = peripheral; for fine!coarse, 10 = coarse. likelihood of each of the I I diseases in each case. reading condition was based on the sequential test:
aFor 16 radiologic findings in a 64-year-old man with idio- By the two largest outputs. both the sensitivity and that the role of the network output was to provide a
pathic pulmonary fibrosis lease 11 and a 34-year-old woman the specificity of the artificial neural network for in- “second opinion’: and that. by the two largest out-
with sarcoidosis Icase 2). dicating the correct diagnosis were 89%. puts. the sensitivity and specificity of the artificial

Fig. 1.-Examples of chest radio-


graphs used in this study.
A, 64-year-old man with idiopathic
pulmonaryfibrosis (case 1 in Table 1).
B, 34-year-old woman with sarcoi-
dosis (case 2 in Table 2).

1312 AJR:172, May 1999


Effect of an Artificial Neural Network on Diagnosis of Lung Disease

neural network for indicating the correct diagnosis respectively. of the line. For second ratings that were server diagnosed correctly with the highest confi-
were 9lCk and 89C/c. respectively. Before the test. different from the initial ratings. observers used a red dence rating. 2 corresponding to a case diagnosed
four training cases were shown to the observers to pencil to mark their confidence levels along the correctly with the second highest confidence rating.
familiarize them with the rating method and with same line. For data analysis. the confidence level and so on. An improvement in a ranking. such as a
use of network output as a second opinion. was scored by measuring the distance from the left change from 2 to I , indicated that network output
The observers’ confidence about the likelihood of end of the line to the marked point and converting benefited diagnostic performance: the opposite mdi-
each of the I I possible diseases was represented the measurement to a scale from 0 to 100. cated a detrimental efl#{232}ct.Using a two-tailed t test
using an analog continuous-rating scale with a line- for paired data, we analyzed the statistical signifi-
Data Analysis
checking method I 161. For the initial ratings. the cance of the difference between the number of cases
observers used a graphite pencil to mark their con- The radiologists diagnostic performance with
benefited and the number not benefited. The same
fidence levels along a 7-cm-long line. Ratings
and without network evaluated using
output was
test was used tO analyze differences between the
of “definitely absent’ and “definitely present” ROC analysis [ 131. We defined confidence ratings number of attending physicians’ cases affected and
were marked above the left and the right ends. data with the correct diagnosis as “actual positives” the number of residents’ cases affected.
and those with any other diseases as “actual nega-
tives.’ For each observer and each reading condition
(with and without network output). we used a maxi- Results
Sarcodoss
-IL.. mum likelihood estimation to fit a binormal ROC The overall performance by the three groups
MiIia,y bc.
I curve to the confidence ratings data for all 1 1 possi- of observers is illustrated by the average ROC
Lymphwgftc Ca.

I ble diseases in the 33 cases [14]. This combining of curves in Figures 3-5. Diagnostic performance

P
Edema data for all diseases was done because of the small
improved when chest mdiographs and clinical
SAcosa number of cases of each disease. The A: value was
parameters were shown in conjunction with
ScI#{149}odeona
then calculated for each fitted curve. The statistical
network output. The average A: value for all
significance of differences between ROC curves for
Pcp
each reading condition was determined by applying
radiologists increased to a statistically signifi-
EG
a two-tailed t test for paired data to the reader-spe- cant degree. from .826 without network output
5#{176}F cific A: values. We also analyzed the statistical sig- to .91 1 with network output (p < .0001). How-
Voal pneumorua nificance of differences between ROC curves tir ever, for all radiologists, the average A: value
Drug tox.cay attending physicians and those for radiology resi- with network output was still lower than for
0.0 0.2 0.4 0.6 0.8 1.0 dents using a two-tailed t test. To represent the over- the network alone (A: .977). For the four at-
all performance for each group of observers. average
ANN output tending physicians, the average A: values
ROC curves were generated for the ftur residents.
without and with network output were .839
the four attending physicians. and all radiologists by
and .905. respectively. whereas for the four
Fig. 2.-Graph shows one example of artificial neural averaging the two binormal parameters of their mdi-
residents. the average A: values without and
network output presented to eight observers. Output vidual ROC curves I I 51.
values were obtained for 64-year.old man in Figure 1A. with network output were .812 and .916. re-
Another indication of performance was the
Largest output value among 11 diseases corresponds spectively. These differences were also statisti-
number of correctly diagnosed cases for which the
to correct diagnosis. tbc = tuberculosis, ca. = carcino-
observer’s ranking changed because of network cally significant (/) = .0026 and p = .0074.
matosa, PCP = Pneumocystis cariniipneumonia, EG =
eosinophilic granuloma. IPF = idiopathic pulmonary fi- output. Four rankings were used-I. 2. 3. and less respectively). Table 2 shows the A: values
brosis, ANN = artificial neural network. than 3-with I corresponding to a case that the ob- without and with network output for each ra-

..s for ROC Curves


agnostic Accuracy
11th and Without ANN
utput
I C
0 0
U U
55
a) . 0)

.
U) U)
0
0. 0.

I 0.2

0.2
False-positive fraction False-positive fraction

Fig. 3.-Comparison of average receiver operating Fig. 4.-Average receiver operating characteristic
characteristic curves for all observers without and with curves for attending physicians without and with arti-
artificial neural network output and receiver operating ficial neural network output. Observer performance
characteristic curve for artificial neural network output with network output was significantly improved. ANN =
alone. Observer performance with network output was artificial neural network.
significantly higher than that without network output
but was still lower than performance of network output
alone. ANN = artificial neural network.

AJR:172, May 1999 1313


Ashizawa et al.

diologist. The performance of all improved Table 3 shows the number of cases affected by network output was also statistically signif-
when network output was used. by network output for the individual radiolo- icant ( p = .0034 and p = .0002. respectively)
When we compared the average perfor- gists. The net effect of network output was for the attending-physician group (6.6 and I .0.
mance of the four attending physicians with clearly an improvement in performance. The respectively) and the resident group (13.0 and
that of the four residents, the difference in average number of cases affected beneficially I .8. respectively).
A: values failed to reach statistical signifi- and detrimentally by network output for all ra- In a comparison of the average number of
cance for each of the two reading conditions diologists was 9.8 and I .4. respectively. and cases benefited by network output for the four at-
(without P .3601 and with [p = .5651 net- this dift’erence was statistically significant (p = tending physicians and for the four residents, the
work output). Although the gain in perfor- .0002). When the network ranking was 1, all difference was statistically significant (p = .0001).
mance with use of network output was cases affected by network output showed a whereas the difference in the detrimental effect
larger fir residents than for attending physi- beneficial effect for all radiologists (Fig. 6). was not statistically significant (p = .414) (Fig. 6).
cians. the difference was not statistically The difference between the average number of Performance improved significantly more for the
significant ( p = .077). cases affected beneficially and detrimentally residents than for the attending physicians.

- - - -- Discussion
1umberofCases (of 33
otal) for Which Ranking The results of ur observer test indicate that
network output can significantly improve mdi-
as AffeCted by Artificial
iraI Network Output ologists’ performance in the differential diag-
nosis of interstitial lung disease using chest
radiographs. Three radiologists (one attending
physician and two residents) participated in a
pilot observer test before this study. Another set
of 33 clinical cases. which do not overlap the
cases used in this study. was selected from our
database and used for the pilot test. The average
performance of the three radiologists was sig-
nificantly greater with network output (A =
0.2 0.4
False-positive fraction .930) than without it (A: .867) (p < .05). The
L -- ---- pilot results support those of the main study.
The performance of the artificial neural net-
Fig. 5.-Average receiver operating characteristic
work (A: 977) was substantially better than
curves for radiology residents without and with artifi-
cial neural network output. Observer performance with the radiologists’ performance (A: .826), and
network output was significantly improved. ANN = arti- the difference was statistically significant (p <
ficial neural network. .000 1 ). This finding can be interpreted as fol-
lows. The differential diagnosis of interstitial
lung disease using chest radiographs is gener-
ally considered to require extraction of image
features and subsequent merging of extracted
features and clinical parameters. However, the
features used by one radiologist may differ

z from those used by another and may vary from


4 case to case. The approach tends to be less
.0
than systematic and influenced by anecdotal
experience. Therefore. the 16 image features
0
0 used by the network would likely not be iden-
tical to those used by the radiologists but,
Cl)
a) probably. would be more comprehensive. Thus,
Cl)

U the network’s information would consistently


be more complete than the radiologists’ . In ad-

dition. the network would likely be better able


E than the radiologists to combine features. To
z verify this assumption rigorously, however,
A B C D a
one would need to compare the ability of the
Observers
network and the radiologists to merge the
same features. In practice. performance differ-
ences between the network and the radiolo-
Fig. 6.-Graph shows number of correctly diagnosed cases for which observer’s ranking (1 or 2) changed be-
cause of network output. Network output clearly improved performance of all observers. ANN = artificial neural gists may have resulted from differences in
network, white bars = ranking of 1, gray bars = ranking of 2. both the extraction and the merging.

1314 AJR:172, May 1999


Effect of an Artificial Neural Network on Diagnosis of Lung Disease

Studies on the detection of abnormalities such indicates that network output can improve the References
as lung nodules on chest radiographs and micro- performance of radiologists, especially those I. Asada N, Doi K. MacMahon H. et al. Potential
calcifications on mammograms have shown that less experienced. We believe further study is usefulness of an artificial neural network for dif-
the computer helped radiologists detect findings needed to determine the true difference in per- ferential diagnosis of interstitial lung disease: pi-
even if its performance was lower than that of the fomiance between the two groups of observers lot study. Radiology 1990:177:857-860
2. Ashizawa K. Ishida T. MacMahon H. Vyborny
radiologists 116, 17]. This result was probably under the two reading conditions.
ci. Katsuragawa S. Doi K. Artificial neural net-
possible because the computer output alerted ra- Independent ROC observer tests, which
works in chest radiography: application to the dif-
diologists to the locations of some lesions that have been used widely, have two sessions. In
ferential diagnosis of interstitial lung disease.
might be missed by radiologists. When the the first, each observer interprets half the cases Acad Radiol 1999:6:2-9
computer scheme indicates whether a lesion with network output and the other half with- 3. Ishida T, Katsuragawa S. Ashizawa K. MacMahon
exists by using an arrow in a detection task, it out. In the second, the first half is interpreted H. Doi K. Artificial neural networks in chest radio-
is relatively easy for radiologists to decide ei- without and the second half with network out- graphs: detection and characterization of interstitial

put. The general assumption is that observers lung disease. Prnc SPIE 1997:3034:931-937
ther to agree with or to disregard the com-
4. Gross GW. Boone JM. Greco-Hunt V. Greenberg
puter output. In other words, radiologists do not remember the cases interpreted in the
B. Neural networks in radiologic diagnosis. II. In-
would be able to correct their mistakes if they first session. This assumption is correct for terpretation of neonatal chest radiographs. invest
recognize that, for example, they might have most simple tests, such as the detection of lung Radio! 1990:25:1017-1023
overlooked an obvious lesion. Our study, on nodules, but questionable for our study because 5. La SC. Freedman MT. Lin iS. Mon 5K. Automatic
the other hand, found that the performance of the task required was differential diagnosis, lung nodule detection using profile matching and
radiologists with network output was lower which is more complicated than detection. Thus, back-propagation neural network techniques.

than the performance of the artificial neural radiologists would likely remember details of J Digit imaging 1993:6:48-54
6. Gurney JW, Swensen Si. Solitary pulmonary
network alone. Because the performance of some cases, especially first-session cases with
nodules: determining the likelihood of malig-
the network was much higher than that of the network output. nancy with neural network analysis. Radiology
radiologists alone, one might expect that the Unlike independent tests, a sequential test, 1995:196:823-829
performance of radiologists with network which we used, measures in one session the ef- 7. Wu Y, Doi K. Giger ML, Nishikawa RM. Comput-
output would be at least equal to that of the feet of network output on diagnostic decisions. erized detection of clustered microcalcifications in
network. However, the radiologists could not Concern about variations in the observer’s digital mammograms: application of artificial neu-
ml networks. Med Phvs 1992:19:555-560
take full advantage of network output be- memory is thus eliminated. The sequential test
8. Wu Y, Giger ML. Doi K, Vybomy Ci. Schmidt RA.
cause they were not familiar with using it. is not, however, an established method for
Mets CE. Artificial neural networks in mammogra-
Network output indicates the likelihood of each ROC observer studies and is inherently biased
phy: application to decision making in the diagno-
of the 1 1 possible diseases-information that because the observer is always first tested sis ofbreast cancer. Radiology 1993;187:81-87
radiologists may find difficult to assimilate. In without network output. However, one study 9. Heitmann KR. Kauczor H. Mildenberger P. et al.
addition, although the radiologists knew that the [16] indicated that these two types of observer Automatic detection of ground glass opacities on
network performed well, they might not have tests reach similar conclusions. Therefore, we lung HRCT using multiple neural networks. Eur
Radio! 1997:7:1463-1472
had confidence in all of its output, some of believe that the sequential test can be used to
10. Henschke CI. Yankelevitz DF. Mateescu I. et al.
which may have disagreed with their own evaluate the effect of computer output, espe-
Neural networks for the analysis of small pulmo-
knowledge and experience. To gain confi- cially for differential diagnosis. nary nodules. Chit imaging 1997:21:390-399
dence in network output and familiarity with In conclusion, our results indicate that artifi- I I. Bocchi L. Coppini G. De Dominicis R. et al. Tis-
using it, radiologists need to apply it prospec- cial neural network output can significantly sue characterization from X-ray images. Med Eng
tively to differential diagnosis in actual clini- improve the accuracy of radiologists in the dif- Pin’s 1997; 19:336-342

cal situations. ferential diagnosis of interstitial lung disease 12. Lin iS. Hasegawa A. Freedman MT. et al. Differ-
entiation between nodules and end-on vessels us-
The performance of one of the auending using chest radiographs. We believe that net-
ing a convolution network
neural architecture. J
physicians without and with network output work output, when used as a second opinion,
Digit imaging 1995:8:132-141
was lower than that of the other attending phy- can help radiologists with decision making. 13. Metz CE. ROC methodology in radiologic imag-
sicians, possibly because for almost all the ing. InvestRadiol 1986:21:720-733
cases he did not use the clinical parameters Acknowledgments 14. Metz CE, Herman BA, Shen JH. Maximum-like-
when interpreting chest radiographs. Therefore, We thank Hajime Nakata (University of Oc- lihood estimation of receiver operating (ROC)
even for experienced observers, consideration cupational and Environmental Health, School curves from continuously-distributed data. Stat
Med 1998:17:1033-1053
of these parameters would seem important. His of Medicine, Fukuoka, Japan) and Kuniaki Ha-
15. Mets CE. Some practical issues of experimental
low performance probably led to the compara- yashi (Nagasaki University School of Medicine,
design and data analysis in radiological ROC
bility between the average A. values of the at- Nagasaki, Japan) for supplying valuable clinical studies. invest Radio! 1989:24:234-245
tending-physician group and those of the cases. We also thank John J. Fennessy, Lau- 16. Kobayashi T. Xu XW, MacMahon H, Metz CE,
resident group. However, the improvement in rence Monnier, Walter Cannon, Thomas Woo, Doi K. Effect of a computer-aided diagnosis
his performance with network output was simi- Scott Stacy, Bmce Lin, Dixson Gilbert, and scheme on radiologists’ performance in detection
lar to that of the other attending physicians, so Shawn Kenney for participating as observers; of lung nodules on radiographs. Radiology 1996:
199:843-848
the comparison of gains for the two groups Charles E. Metz for useful suggestions and dis-
17. Chan H-P. Doi K. Vyborny Ci. et al. Improvement
might be meaningful. In fact, the performance cussions about receiver operating characteristic
in radiologists’ detection of clustered microcalcifi-
gain for residents, as measured by both A. analysis; Hiroyuki Yoshida and Yulei Jiang for
cations on mammograms: the potential of com-
value and number of cases benefited, was larger helpful discussions; and E. Lanzl for improving puter-aided diagnosis. invest Radio! 1990:25:
than that for attending physicians. This finding the manuscript. 1102-1110

AJR:172, May 1999 1315

You might also like