You are on page 1of 9

Scientific and Business Applications

Automatic Abstracting and Indexing Survey and Recommendations*


H. P. EDMUNDSONAND R. E. WYLLYS, Planning Research Corp., Los Angeles, California

Abstract. In preparation for the widespread use of automatic scanners which will read documents and transmit their contents to other machines for analysis, this report presents a new concept in automatic analysis: the relative-frequency approach to measuring the significance of words, word groups, and sentences. The relative-frequency approach is discussed in detail, as is its application to problems of automatic indexing and automatic abstracting. Included in the report is a summary of automatic analysis studies published as of the date of writing. Conclusions are drawn that point toward more sophisticated mathematical and linguistic techniques for the solution of problems of automatic analysis.
1. I n t r o d u c t i o n Traditionally the subject content of documents has been established and exhibited in two ways. A trained librarian scans the document, classifies it, and then appends to the recorded title an arbitrary number of subject headings which are intended to single out the pertinent topics of discourse; and an expert in the field with which the document is concerned reads it carefully and then appends to the recorded title an abstract, the purpose of which is to concentrate the essential meaning of the document into a few sentences or a short paragraph. There is a growing consensus that these traditional methods of establishing and exhibiting the content of documents are no longer adequate to our needs. The flood of documents threatens to overwhehn librarians and makers of abstracts. Moreover, m a n y users of documents feel that the librarians' subject headings and the experts' abstracts are too subjective: the presuppositions of the librarians and the personal bias of the experts tend to emphasize some items of information and to suppress others. Automation will make it possible to cope with the mass of documents pouring into every library and will assure objective records of the essential content of documents. When an automated system is once in operation, trained indexers and abstracters can be relieved of the slow and intellectually wasteful task of analysis and can instead devote their talents and skills to the evaluation and improvement of the system output. For some time it has been understood that a frequency count of the significant words of a document (mostly nouns, adjectives, and verbs) can serve to isolate the special vocabulary used to convey information in any * This study was conducted at the Planning Research Corporation under Contract AF 30(602)-1748 for the Intelligence Laboratory, Rome Air Development Center, Air Research and Development Command, United States Air Force, during a period ending in October 1959. 226 Communications of the ACM

particular realm of discourse [4]. Recently, more or less concurrently with the research here reported upon, diverse studies have been undertaken which apply this principle to the automatic derivation of indexes for any given article. These automatic indexes are intended to supplant the traditional subject headings. Moreover, efforts are being made, by an extension of the principle of word frequency, to use word correlation--the concurrence of words of high frequency within sentences--to isolate the most significant sentences in any given document, sentences which constitute automatic abstracts in the language of the author of the document [1, 2, 3]. All such experiments depend upon the use of automatic scanners which will read all document type fonts and transmit the content word by word to a machine. I t is intended that the machine will then sort the words in order of frequency to produce frequency-ordered automatic indexes, and that a second scanning will select the key sentences to be printed out as automatic abstracts. Such key sentences are called topic sentences in the context of this report. A number of linguistic-mathematical approaches have appeared potentially applicable to the automatic indexing and abstracting problem. In the study being reported on here, we have considered from the theoretical standpoint several of these approaches. Section 2 presents studies of automatic indexing and automatic abstracting, and proposes mathematical and linguistic solutions for problems hitherto unresolved. Section 3 is a detailed comparison of the principal techniques that have been proposed for the automatic analysis of documents. The conclusion makes specific recommendations for advanced research in automatic analysis and outlines step by step what course this research should take. 2. R e l a t i v e - F r e q u e n c y T e c h n i q u e This study has indicated that the problem of the automation of indexing and abstracting can be broken down into five areas: 1. The investigation of more exact measures for the significance value of single words. 2. The investigation of more exact measures for the significance value of groups of words (both word pairs and clusters of words syntactically connected). 3. The investigation of more exact measures for the significance value of sentences.

4. The construction of automatic abstracts made up of the sentences with maximal significance value. 5. The employment of significant words and groups of words in automatic indexes for classification, cross-referencing, and retrieval. The following discussions present the initial steps toward a firm statistical foundation for future work in the automatic analysis of linguistic data along the lines of the quinquepartite division above. 2.1. WORD SmNIFICANCE. All proposed methods for making an automatic abstract of a document involve using the author's own words by selecting complete sentences, thereby reducing abstraction to the simpler task of extraction. Criteria for the selection of these topic sentences m a y be purely positional (involving, for example, only the position of the sentence, e.g., first sentence of each paragraph), or semantic, making use of the meaning of words such as " s u m m a r y " or "conclusions'. T h e y m a y involve rating the sentences on the basis of significant words or word groups in each sentence. The criteria for attributing significance to words, in turn, m a y again be positional (in virtue of their occurrence in titles or section headings), or semantic (in virtue of their relation to words like ' s u m m a r y ' ) , or perhaps even pragmatic (in the case of names of specialists mentioned in text, footnotes, or bibliography). And they m a y be statistical (involving frequency of occurrence) or non-statistical (involving only the fact of occurrence, as in a title). 2.2 WORD FREQUENCY. Currently proposed statistical methods for machine measurement of a word's significance depend upon the frequency of occurrence of the word within the document being analyzed. T h e y operate on the reasonable assumption that an author will use terms bearing on his special topic with greater frequency than those which do not. To take frequency within the docum e n t as the only criterion, however, tends to include often used general terms like " i m p o r t a n c e " or "connection" or logical connectives such as "and", "all", or "if", which give no clue to specific content. P. B. Baxendale reports that an I B M 650 has been programmed to delete a pregiven list of about 150 common words of low indicative value from the list of most frequent words in a document. While fairly good results have been obtained in a small number of test cases, frequency within the document seems too crude a test to rest content with. A highly specialized term, exceedingly rare in general or even professional usage, might be highly significant in a document in which it did not occur often enough to rank high in the frequency count. I t is worth noting that in Oswald's experiment [3], described in Section 3, only one term in a word group deemed significant is, ordinarily, of unusually high frequency; the others are of much lower frequency individually, and derive their significance from the fact of joint occurren(ie with high-frequency terms. The measure of significance to be proposed here meets the problem of frequency in an intrinsic way.

2.3. REFERENCE FREQUENCY. Very general considerations from information theory suggest t h a t a word's information should v a r y inversely with its frequency rather than directly, its lower probability evidencing greater selectivity, or deliberation, in its use. I t is the rare, special, or technical word t h a t will indicate most strongly the subject of an author's discussion. Here, however, it is clear that b y "rare" we must mean rare in general usage, not rare within the document itself. In fact it would seem natural to regard the contrast between the word's relative frequency f within the document and its relative frequency r in general use (scil., 0 < f < 1 and 0 < r < 1) as a more revealing indication of the word's value in indicating the subject m a t t e r of a document. Such a contrast can be represented b y the ratio f i r between the two frequencies, or the difference, f - r, or b y various other functions of f and r. The following section presents a comparative study of the merits of several such significance functions s(f, r) and draws the conclusion that either f -- r or f i r seems good? Experimentation would determine which is better. 2.4. COMPARISON OF MEASURES. Let Nwd be the number of occurrences of word w in document d, and let Nd be the total number of running words in d, i.e., N~ = ~
wind

Nwe.

Let c be a class of documents, and let


f we
--

N wd Nd '

rw c

N wc Nc "

Four elementary functions of fwd and r ~ t h a t express the significance Swe of a word w for document d in terms of the contrast of the frequency of w in d to the frequency of w in the reference body of documents c are (dropping subscripts where no confusion can occur):
_ sl = f r s3 f f-4-r

s2- f
r

s4 = logfr

For fixed r, sl and s2 increase linearly with frequency of occurrence within the document, while s~ and s4 increase less steeply with increasing f. Hence, s~ and s2 afford best discrimination for high f (the range of interest). sl is the most easily computed of the functions and it is finite. I t increases less steeply than s2, however, which might make it less desirable for small r. s2 has the most natural and suggestive interpretation, since it answers the question, " H o w m a n y times as frequently does the word occur in the document as it does in normal usage?", but it offers the threat of possible overflow. x The authors wish to thank Herbert Bohnert, who contributed to the comparative study. Communications of the ACM 227

BreakJ

Possible
f o r m s for significance i) s I = f-r

Min

Max

even
as a[ i 8s ar -i

(small i (small; point f) i r) i (f = r) -i l 0

Shape of s vs. f curve for constant r

Remarks
i. Z. 3. 4. f A l m o s t as sharp as s Z. Intuitive breakeven point. Convenient m a x and rain. Particularly easy for m a c h i n e computation and discrimination; all negatively valued w o r d s can be immediately disregarded. Sharpest discrimination. Easy interpretation ("How m a n y times as frequently does the w o r d occur ? "). E a s y to compute.

Straight line through


(0,0), slope I.

Sll I

/
Z) s z = ~--

co

-r

f 2
r

Straight line through (0, 0), slope 1/r. sz rz

i. Z. 3.

iS//rl

f 3) s 3 = - ~

I
0 i ~

i -f

(f+r)Z (f+r) z

s3

rz

i. g. 3.

Less discrimination. Intuitive b r e a k e v e n point. Convenient m a x and rain.

f 4) s4 =log ~ -co co
i
0

1 K

-_I r

s4 rz / ~ rl

I.

Better d i s c r i m i n a t i o n than s 3, since


r

for

r>0

Z. 3. FIG. 1. Comparison of word-significance measures

I n c o n v e n i e n t rain, max. Time c o n s u m i n g c o m p u t a tion.

s~ has convenient upper and lower bounds, which n m y be important for purposes of machine handling in that no machine overflow can occur for cases where an extremely rare word is used very frequently. s4 requires a time-consuming computation which is not compensated for b y other features. These and other features of the functions are summarized in the accompanying table, Figure l. F r o m the foregoing considerations, it seems t h a t either the difference f - r or the ratio f / r is the best choice. Whichever of these simultaneously monotonic functions is chosen, it is clear t h a t defining significance in terms of the contrast between frequency in a document and in general usage would give low significance both to normally rare words which occur rarely in the document and to common words used frequently within the document, while giving high significance to normally rare words used frequently. I t would distinguish sharply between common and rare words whose frequencies in the document are equal and therefore indistinguishable by Luhn's method (see Section 3) and it would eliminate most of the general and syntactic words now deleted by list.
2.5. SPECIAL FIELD FREQUENCIES. A f u r t h e r r e f i n e m e n t

ment of special sets of reference frequencies for special fields of interest. This would have two benefits: it would become possible to classify documents as to field, and it would become possible to note the significance of words which are frequent in the document and frequent in a very large reference class Co of literature (i.e., these words would not be significant with respect to Co) but which are rare in the special field. For example, the word "emotion" might be too common in general usage to seem significant, but frequent occurrence of the word would stand out in a paper on electronic circuitry (e.g., of a robot) when compared with its frequency in general electrical engineering literature. To see how this method would operate, let us imagine that the relative frequencies of m words have been established, both for a large reference class Co of literature and also for n special fields of interest ci, j = 1, 2, - . . , n. Thus, we have n + 1 values of relative frequency for each word w, where w runs from 1 to m: rw0 = relative frequency of word w with respect to the class Co of literature rw3 = relative frequency of word w with respect to special field ci

of the process of automatic analysis would be the develop228


C o m m u n i c a t i o n s of t h e ACM

Then we may form the m (n -t- 1) matrix (rwj), each column of which contains the frequencies of all the listed words for a particular field (the whole body of literature being represented in the first column) and each row of which contains the frequencies of a particular word in all the listed fields. 2.6. AUTOMATIC INDEXING. The automatic indexing step of the automatic analysis of a document d would thus consist in: first, the determination of the words that are significant with respect to general literature by the comparison of the relative frequencies fw~ of words in d with the relative frequencies in the first column of the matrix (rwi); second, the comparison of the document's frequencies with the other columns in the matrix in order to determine which column forms the "best fit" with the document; and third, the determination of the words that are significant with respect to the special field. One standard method of determining "best fit" would be to find the column j whose frequencies differ least from those of the document, i.e., the column which minimizes ~ = 1 ( f w d - r~j) 2, or, equivalently, which maximizes ~ = 1 f~drw The second step could be speeded up through the use of subfields: e.g., the first sort could determine that the document was scientific; the second sort could categorize with respect to basic sciences, chemistry, geology, physics, etc.; when it had been determined that the document belonged in, say, physics, the third sort could decide that the subject was nuclear physics rather than fluid mechanics or optics. Once frequency-ordered indexes have been established for various subject-fields the automatic index of any new document can be compared with them by machine processes. The results of the comparison will determine, first, to which subject-field the document properly belongs (classification); second, with what other subject-fields it should be associated (crossreference); and finally, what terms are significant enough to be used as identification tags for the process of recovering the document (retrieval). 2.7. WEIGHTED FREQUENCY. Extensions of the relativefrequency approach, involving syntactic and semantic approaches, would be the following. A cataloguer or abstract-writer would naturally give more weight to a technical word that appears in a title, in a first paragraph, or in a summary. A machine can be programmed to do the same. It can be instructed to recognize the title by position and capitalization and to place a title indication after each word appearing in the title as it compiled its list. Similarly, it can place first-paragraph indications after all words it meets until it recognizes the end of the first paragraph. It can test every heading or subtitle for the words " s u m m a r y " or "conclusions" and place a summary indication after each word in the summary paragraphs. At, the conclusion of its "reading" of the article, it can then compute each word's weighted signifi-

cance S according to the formula:

S = bib2b3s(f, r),
where for a given word w, bi b2

fbt if w bears a title indication \1 otherwise


fbp if w bears a first-paragraph indication otherwise

\1

b~ =

{blS if w bears a summary indication otherwise

and where bt, bp, and b8 are preassigned weights, all greater than one, for occurrence in title, first paragraph, and summary, respectively. Or alternatively, statistical methods of this type might be used as preliminary sorting for later application of nonstatistical criteria. For example, when a word already known to be somewhat significant by statistical methods also occurs in the title, its significance might be taken as guaranteed, and the machine program could recognize the fact by placing it on the definitely significant list, even though the word was outranked in significance by other words. Actual values of significance will, of course, depend upon choice of the reference class of documents c. However, these values can be expected to converge to their limits as c grows large, given reasonable randomness of input. After an initial establishment of reference values, the values could be continuously corrected at the document-analysis center (assuming it was not specializing in a small range of subjects). 2.8. SELECTION CRITERIA. The final selection of words to be used in the abstracting process would be based on three criteria: 1. Significance of the word with respect to general literature. 2. Significance of the word with respect to a specialized field. 3. Placement of the word on a 'definitely significant' list. It would be possible to select, under criteria 1 and 2, either (a) all words whose significance value exceeded a predetermined threshold value s, or (b) only the first n words in order of significance from the highest down, adding, in either ease, those words selected by criterion 3. Experience would show which method, (a) or (b), would be more satisfactory. 2.9. WORD-GRouP AND SENTENCE SIGNIFICANCE. In extending the measure of significance from words to sentences, we find that the problem is to weight the significance factor of the sentence to allow for its length and the density of significant words. A combination of Luhn's method of "word clusters" and Oswald's technique (see Section 3) should provide an ideal solution. Suppose we choose a positive function E of the statistic s(~, r) such
Communications o f t h e ACM 229

that E(s) > 1 for large s, E(s) ~.~ 1 for medium s, E(s) < 1 for small s. Then let us say that, given two significant words having function values E1 and E2, we define their combined significance factor C as: C C C C C C = = = = = = E~E2 ~EiE2 ~E1E2 ~EiE~ E~E2 0 if if if if if if the words are juxtaposed; 1 non-significant word intervenes; 2 non-significant words intervene; 3 non-significant words intervene; 4 non-significant words intervene; 5 or more non-significant words intervene.

words were not immediately adjacent. On the other hand, juxtapositions could be given more importance (at least for suitable C) than Luhn does. Arbitrary requirements for a significant sentence, such as a given number of significant word pairs, could be avoided and the threshold could be set in finer steps, with less discrimination being made between writers preferring short sentences and those preferring long ones. Finally, one possible difficulty in Oswald's technique could be avoided: namely, t h a t one sentence with, say, three significant word pairs among 60 words is not necessarily better than two sentences each with three such pairs among 30 words. In fact, the latter two might well be more meaningful because of their higher density of significant pairs. Further research will be required to solve these problems. 3. S u r v e y a n d C o m p a r i s o n o f V a r i o u s A u t o m a t i c Indexing and Automatic Abstracting Methods This section summarizes and compares the methods proposed for automatic analysis by H. P. Luhn, P. B. Baxendale, and V. A. Oswald, Jr. 3.1. LUHN'S STUDY. Luhn holds t h a t the significance of a sentence is to be taken as a function of (a) the significance of the words in the sentence, and (b) the relative positions of the words within the sentence. The significance of a word is to be a function of its frequency within the document, in the following way: (a) common words such as pronouns, prepositions, and articles are not to be considered significant; (b) the least frequent words are not to be considered significant; (c) words lying in the frequency range above least frequent are to be considered significant. I n regard to the relative positions of words, Luhn argues that "the closer certain words are associated, the more specifically an aspect of the subject is being treated. Therefore, wherever the greatest n u m b e r of frequently occurring different words are found in greatest physical proximity to each other, the probability is very high t h a t the information being conveyed is most representative of the a r t i c l e . . , the criterion is the relationship of the significant words to each other rather than their distribution over a whole sentence." Accordingly, Luhn computes the significance factor of a sentence in the following manner: a. Word frequency count. (1) Common words (a list of pronouns, prepositions, and articles) are not counted. (2) Words that begin with the same letters are consolidated (i.e., treated as the same word) if they have less than 7 dissimilarities after their identical beginnings; otherwise such words are considered distinct. b. Calculation of significance factor. (1) Words whose frequency is below a predetermined value V are not considered significant. Words whose frequency exceeds V are designated significant.

These weights are chosen to agree with Luhn's use of only those significant words that are not more than four words away from the nearest other significant words. Luhn does not state how he settled on the number four as the critical distance, so that we m a y feel free to investigate the effects of using other values for the critical distance. To put the weighting scheme above into a general form, let us say that if h or more non-significant words intervene between a particular significant word and the nearest other significant word, we shall consider that particular significant word to be isolated (i.e., Luhn uses h = 5). Then the weighting scheme m a y be expressed as:

C =

I
[0

h ~ t E1 E~

when t non-significant words intervene, t = 0 , 1 , 2 , . . . , h - 1; otherwise.

This scheme provides weights that decrease linearly with the distance between significant words. Linear weighting is only one of the possible schemes; another that seems plausible is: 1 C = ~ E1 E2 when t non-significant words intervene.

I t is highly desirable that experiments be performed with various weighting schemes and critical distances to try to ascertain optima. Since the E-function would be extended as a further product, such as E~E2E3 when three high-frequency words were adjacent, it would make a distinction among word groups of different lengths. The same goal could be attained b y making the E-function a set of modified log weights, to be added in a fashion similar to the products above. The values C~ for an entire sentence could be summed (or multiplied) and then divided by some function of the sentence length for normalization purposes. Experiments could be performed to determine the expected value and variance for a sentence. I t would then not be necessary to reject meaningful word groups just because two significant
230 Communications of the ACM

(2) Significant words t h a t are separated from other significant words by more t h a n 4 non-significant words are considered isolated and are disregarded. (3) The longest clusters of non-isolated significant words (i.e., clusters having not more t h a n 4 non-significant words between any pair of significant words) are established within each sentence, each cluster being s t a r t e d and ended by a significant word. (4) The significance factor S~ is then calculated by squaring the number pl of significant words in the cluster i and dividing this square by the total number q~ of words in cluster i: S~ =
pi2/q~ .

may further be observed t h a t Baxendale's exhibited indexes are made up of single words rather t h a n of word groups, in spite of the strong case she makes for using groups.]

(5) If a sentence has only one such cluster, the significance factor of the sentence is t h a t of the cluster. If a sentence has more than one such cluster, the significance factor of the sentence is taken to be t h a t of the highest-valued cluster in the sentence: maxl Si .

Sentences with significance factors above a predetermined value S (or, for other purposes, the n sentences with highest significance factors) are printed out in the order of their appearance in the document. 3.2. BAXENDALE'S STUDY. Baxendale's work is concerned not specifically with the problem of automatic abstracting, but rather with methods of constructing automatic indexes for documents; these methods, however, involve selecting the significant words and phrases of the documents and hence are applicable to the abstracting problem. What Baxendale has done is to select certain words or phrases from the document and order them according to their post-selection frequencies. The index is then constructed from the n words or phrases with the highest frequencies, n being chosen so as to yield an index with the desired number of entries. (Baxendale says n should equal 0.5 per cent, but lets it equal 1.0 per cent, of the number of words in the document.) She has experimented with three methods of selection:
a. Deletion of " a l l parts of speech whose grammatical functions are of a connective or reiterative t y p e . . , pronouns, articles, conjunctions, conjunctive adverbs, copula and auxiliary verbs, as well as q u a n t i t a t i v e a d j e c t i v e s . " The words selected by this method are, of course, all the other words in the document. b. Selection of only the first and last sentences in each paragraph in the document. [Baxendale reports t h a t in a sample of 200 paragraphs the topic sentence was the initial sentence in 85 percent of the paragraphs and the final sentence in 7 percent of the paragraphs.] c. Selection of the first four words following each preposition, unless a second preposition or a p u n c t u a t i o n mark is encountered. [Baxendale comments t h a t a prepositional phrase is a " u n i t o f expression.., with a more flexible function than any other syntactical unit . . . [so that] it is logical to speculate t h a t the phrase is likely to reflect the content of an article more closely than any other simple construction . . . . But more significant is the qualitative advantage of the phrase unit in coSrdinating the terms of the index with each other . . . [for example, compare] the questionable words levels and f o r b i d d e n . . . [with the phrases] d i r e c t e n e r g y levels and f o r b i d d e n e n e r g y r e g i o n . " I t may be observed here t h a t this advantage is of greater importance in synthetic languages like English and Chinese t h a n in languages like Russian, in which inflections often serve the purpose of prepositional phrases, or German, in which groups of basic words are formed into single compound words rather t h a n into phrases. I t

Presumably, although Baxendale does not say so explicitly, methods 2 and 3 also entail, as their second step, deletion of the connective and reiterative words in the selected sentences and phrases before the frequency count is made. The sets of index words or phrases obtained by these three different selection methods are surprisingly similar-the same words, with approximately the same relative frequencies, tend to be chosen by each method. This suggests that the choice among the methods could be made on nonlinguistic grounds; and here the fact that methods 2 and 3 leave a much smaller text to be frequency-counted than does method 1 ought, perhaps, to be decisive (methods 2 and 3 leave about 10 per cent of the initial text, and method 1 about 50 per cent). 3.3. THE OSWALD EXPERIMENT. Oswald's experiment in automatic abstracting differs from Luhn's and Baxendale's techniques in that it combines the notion of significance as a function of word frequency and the notion of significance as a function of word groupings, by employing juxtapositions of significant words as the basic unit for measuring the importance of a sentence. The postulates of Oswald's technique are: a. Automatically generated indexes should include not only single words but also groups of words. b. The choice of sentences for the automatic abstract should be governed by the number of significant groups in the sentence. To accomplish these purposes, the technique uses the following means:
a. Words whose function is essentially syntactic (articles, prepositions, conjunctions, etc.) along with qualifiers of little semantic importance ( " g o o d " , " v e r y " , etc.) are discarded, and only words t h a t are significant in the context of the document are kept. b. The retained words are frequency-counted. c. Next, every juxtaposition (of 2 or more words) involving a high-frequency word is recorded as a significant word group. The recording of such groups begins with those t h a t contain the single word of highest frequency and continues until 6 successive words, in order of descending frequency, produce either no significant groups or no new significant groups. (For example, in one article examined by Oswald the words " l a u n c h i n g " and " v e l o c i t y " were of high frequency. The juxtaposition "launching v e l o c i t y " was itself of high enough frequency to be recorded as a significant word group; and also the distinct juxtaposition "effective launching v e l o c i t y " occurred sufficiently often to be recorded as significant.) d. Sentences containing 2 or more significant word groups are then listed as significant sentences. e. The automatic a b s t r a c t is formed by the choice, beginning with the sentence having the largest number of significant word groups, of enough significant sentences to meet one of two criteria: either (1) a fixed number of sentences, or (2) enough sentences so t h a t their total length approximates a given percentage of the number of words in the article. Sentences having the same number of significant word groups are chosen according to their total C o m m u n i c a t i o n s o f t h e ACM

231

n u m b e r of words, longest first. Chosen sentences are p r i n t e d out in the order of t h e i r occurrence in the document. 3.4. COMPARISON OF THE TECHNIQUES. Baxendale's work is concerned solely with the automatic construction of indexes; she does not extend her treatment of word significance into the area of automatic abstracting. The essential part of her work appears to have been the comparison of three methods of choosing words to be frequency-counted. Although she discusses the advantages of using word groups in the construction of indexes, she does not use them in the indexes she exhibits. Luhn's technique and Oswald's experiment both employ the concept that the simple frequency of a word in a document is the measure of its significance. The difference between the two techniques lies in the ways they are extended to develop measures of significance for word groups and thence to measures of significance for sentences. Oswald's experiment, limited to hand processed, considered juxtapositions involving a high-frequency word; Luhn, with computers at his disposal, permitted up to 4 nonsignificant words to intervene between significant ones and thus dealt with clusters rather than juxtapositions of words. From a broader, mathematicM point of view these two ad hoc rules appear about equally arbitrary. In the extension of significance measures from word groups to sentences both Luhn and Oswald again use ad hoc methods. Luhn takes the highest significance factor of any one cluster in a sentence to be the significance factor for the sentence. An obvious difficulty with this procedure is that a long and important sentence with several clusters, each of which had a significance factor that was only average with respect to the document, could be outranked by a less important sentence that happened to have a single, but high-valued, cluster. Thus, in the way Luhn handles his clusters he does part of the job of coping with the density of significant words, but he does not go on to cope with the density of the clusters in the sentence. The Oswald technique assigns significance measures to sentences according to the number of significant word groups they contain; if two sentences have the same number of such groups, the longer sentence is chosen. The selection by number of such groups is equivalent to summing the significance factors of word groups rather than using only the highest-valued word-group significance factors, but the preference given longer over shorter sentences brings to the fore the sentences of lower density. What is needed is a program of systematic experimentation rather than experiments using ad hoc rules.

The basic, still-to-be-resolved question is: How satisfactory can automatic indexes and abstracts be made, whatever be the method by which they are produced? 2 Some remarks of Oswald's [3, pp. 17-18] deserve to be quoted:
" T h e experiment has clearly revealed some basic l i m i t a t i o n s and d i s a d v a n t a g e s i n h e r e n t in a n y t e c h n i q u e for the isolation of topic sentences which is based solely upon the frequencies of the words in a single d o c u m e n t . . . . [These] i n h e r e n t deficiencies can be summarized as those of: "1. Style (syntactics). E v e n t h o u g h the topic sentences in an a u t o m a t i c a b s t r a c t a p p e a r t h e r e in t h e i r proper order, t h e n a t u r a l flow of ideas from one sentence to the next in the original docum e n t is perforce i n t e r r u p t e d ; and no bridging passages, such as a h u m a n a b s t r a c t - w r i t e r would provide, fill the gaps. "2. M e a n i n g (semantics). The essence of the original docum e n t , when c o n t a i n e d in the topic sentences, is p r e s e n t e d in t h e a u t o m a t i c a b s t r a c t , b u t missing are t h e discourse a n d the reasoning leading to or following from the significant sentences. "3. Relevance (pragmatics). No subjective feelings of the i m p o r t a n c e , relevance, or v a l i d i t y of t h e original d o c u m e n t are i m p a r t e d by any a u t o m a t i c a b s t r a c t , since the d o c u m e n t is not p u t into relation with the general body of knowledge or t h e works of o t h e r a u t h o r s . "

4. C o n c l u s i o n s a n d R e c o m m e n d a t i o n s

4.1. SUMMARY. This report has surveyed the techniques proposed by others for automatic indexing and automatic abstracting, and it has presented a new concept of word significance. This new concept, the formulation of word significance as a function of relative frequency, is fundamentally different from significance measures heretofore employed, and is based on considerations drawn from information theory. 232
Communications o f t h e ACM

Any technique for automatic abstracting will, ultimately, succeed or fail according as the author of the document being abstracted succeeds or fails in expressing his thoughts clearly in the document. One cannot put more into an automatic abstract than the author of the document provides. This dependence m a y actually turn out to be an advantage, for automatic abstracts will reveal the poverty or richness of a document without either disguise or embellishment; whereas conventional abstracts sometimes make a document appear more valuable than it really is and are liable to subconscious biases and misunderstanding on the part of the human abstracter. One may even hope that when, inevitably, automatic abstracting becomes widely used, it will tend to induce authors to set forth more clearly and explicitly the main points of their articles. No matter how much analytic techniques may be improved, it may well be that the concept of extracting topic sentences to form an automatic abstract will not always be capable of yielding a thoroughly satisfactory substitute for an abstract produced by intellectual analysis. Users may have to learn how to read automatic abstracts in order to make proper use of them. Nevertheless, the potential advantages of automatic analysis far outweigh the disadvantages. The following recommendations are made for the carrying out of further research in automatic analysis along the lines of the quinquepartite outline presented in Section 2. a. Investigation qf More Exact Measures for the Significance Value of Single Words. The relative-frequency concept of word significance should be compared with the simple-frequency approach. Idioglossaries, i.e., word-fre2 Some initial i n v e s t i g a t i o n s bearing on this question are presented in [5, 6].

quency counts for specialized subject-fields, need to be constructed and utilized in the measurement of word significance as well as for the purposes of classification, cross referencing, and retrieval of documents. Functions that derive a measure of significance from the relative frequency of a word should be compared for ease of interpretation, amount of discrimination, and estimated relative costs of computation. Non-probabilistic measures of significance such as the position of a word (e.g., in the title or in the first paragraph) and the nearness of a word to key words like " s u m m a r y " and "conclusion" need to be studied. Included in this study area are the problem of whether it will be necessary or desirable, with the relativefrequency approach, to delete syntactic words, and the problem of determining the semantic equivalence of closely related word forms (e.g., singular and plural forms of substantives, and sets of word forms like: differ, differing, difference, differential).

running words, are required to furnish an optimal representation of the contents of the document? To what extent should the length of the abstract reflect the length of the document? Should documents longer than a given length be divided into two or more parts before abstracts are made? If so, what is the optimal maximum length for a document to be handled as a unit? These questions should be investigated through the process of mathematically formulating rules for the various decisions required and systematically varying the values of the decision-rule parameters in experiments.

b. Investigation of More Exact Measures for the Significance Value of Groups qf Words. Formulas for measuring the significance of word groups, utilizing the significance factors for single words, need to be studied. The researchers should investigate several mathematical functions for assigning measures of significance, doing so in a systematic way by varying the parameters of each function through a wide range of values. Another problem in this area is that of developing methods for the recognition of the semantic equivalence of syntactically different word groups, such as "earth's rotation" and "rotation of the earth." c. Investigation of More Exact Measures for the Significance Value of Sentences. The significance factors for single words and word groups derived by means of the techniques developed in the first two study areas should be employed in the evaluation of sentence significance factors. Again the problem should be attacked through the method of systematically testing possible functions for the evaluation of sentence significance by varying the parameter values for each function. One of the main problems in this area will be to normalize the measure of sentence significance to provide a practical comparison of significance factors for sentences of different lengths. It will be necessary to allow for the greater range of ideas--essential in an abs t r a c t - w h i c h is possible in a long sentence without valuing too highly a sentence that contains many words but few ideas. On the other hand, it would appear undesirable to exclude all short sentences just because their total of word significances is low. It may turn out to be desirable to use nonprobabilistic criteria in this area also; e.g., it might prove desirable always to include the first and last sentences of a document in the abstract. d. Construction of Automatic Abstracts Made Up of the Sentences with Maximal Significance Value. There should be research on the nature of the criteria to be applied to the measures of sentence significance in order to select sentences for abstracts. Among the questions that ought to be answered are: How many topic sentences, or how many

e. Employment of Significant Words and Groups qf Words in Automatic Indexes for Classification, CrossReferencing, and Retrieval. The use of single words and word groups in the formation of automatic indexes, in classification, and in cross-referencing should be studied. Study in this area should begin with investigations of criteria for the automatic construction of indexes formed of single words and word groups. The questions to be studied include those of the optimal length of the index, how the index length should be related to the document length, and whether it is desirable or possible to predetermine the ratio of single words to word groups. The question of using statistical methods for classifying documents should be studied also. An experimental matrix of standard relative frequencies for various subjectfields should be constructed in machine-usable form, and research should be performed on the evaluation of word significance by means of different sets of relative frequencies and on the production of experimental indexes, classifications, and cross-references. Investigations should be carried out to answer the questions of how sharp a breakdown into categories can be achieved and how much the classification process can be shortened through the use of sub-fields rather than the comparison of the frequencies in the document with all the subject-field relativefrequency sets in the matrix.
4.2. AN EXTENSION. This report has reviewed the research in two areas of the field of automatic analysis, automatic indexing and automatic abstracting. Included in the report have been not only suggestions about the directions along which further research in these areas should proceed but also indications of how techniques for automatic indexing and automatic abstracting can be extended to automate other areas of document analysis, such as classification and cross-referencing. To conclude the report, there should be mentioned still another area to which automatic indexing and automatic abstracting will contribute importantly. This area is automatic translation. To illustrate how the contribution will be made, it is sufficient to consider only one function of machine translation, namely, the translation of scientific and technical documents from English into foreign languages. I t is already obvious that whatever computer programs are eventually devised to perform translation will be extremely complex--hence time-consuming and costly-and that it will be highly desirable to have some means of Communications of the ACM 233

ITEMS I 109 Words


POSSIBLE MODES First Alternative NO SELECTION [ [ OF AUTOAiATIC TRANSLATION Second Alternative CHANCE SELECTION Third Alternative SYSTEMATIC SELECTION

l
AUTO-INDEXING107

Words

AUTO-ABSTRACTING10

8 Words
F

USERS' S E L E C T I O N

I AUTO-TRANSLATION OF ALL ITEMS 109 Words

A U T O -T R A N S L A T I O N O N HIT-AND-IViISS BASIS ? Words

A U T O -T R A N S L A T I O N OF SELECTED ITEMS i. Z x 108 Words

FIG. 2. Automatic abstracting and indexing applied to automatic translation

choosing for t r a n s l a t i o n o n l y those d o c u m e n t s which are w o r t h t r a n s l a t i n g . A solution is the p r o d u c t i o n in English of a d i s t i l l a t e of t h e c o n t e n t s of t h e d o c u m e n t s a n d t h e t r a n s l a t i o n of o n l y this d i s t i l l a t e into t h e foreign language. A f t e r e x a m i n i n g t h e t r a n s l a t e d distillate, t h e users m a y r e q u e s t full t r a n s l a t i o n s of o n l y those d o c u m e n t s t h a t they want. A q u a n t a t i v e a p p r a i s a l of t h e scope of t h e p r o b l e m is a p p r o p r i a t e here. N e e d e d first are some r e a s o n a b l e o r d e r - o f - m a g n i t u d e e s t i m a t e s for t h e v a r i o u s levels of r e p r e s e n t a t i o n of t h e essence of a d o c u m e n t :
Representation Length Reduction Factor

Document Review of Document Abstract of Document Index of Document Subject Classification of Document

1O~ to 103 to 102 to 10 to 1 to

2X105 words 2103 words 2X102 words 210 words 5 words

100 10 to 1 0 - 2 10-~ to 10-3 10-2 to 10-4


1 0 -3 t o 1 0 - 5

r e s p e c t i v e savings are 109 -- 107 = 9.9 108 w o r d s a n d 109 -- 108 = 9 l 0 s words. If t h e a u t o m a t i c indexes a n d a u t o m a t i c a b s t r a c t s m a k e it possible to choose, say, o n l y 1 i t e m in 100 for full t r a n s l a t i o n , t h e r e will h a v e to be t r a n s l a t e d 10" 10 -2 = 107 w o r d s in t h e items, p l u s 107 w o r d s for t h e a u t o m a t i c indexes a n d 108 w o r d s for t h e a u t o m a t i c a b s t r a c t s of all t h e articles. T h e t o t a l is t h u s 107 + 107 + l 0 s = 1.2 108 w o r d s t r a n s l a t e d , a s a v i n g of 109 -- (1.2 10 s) = 8.8 10 s w o r d s o v e r full t r a n s l a tion of all items. F i g u r e 2 d e p i c t s t h e a l t e r n a t i v e s a n d t h e savings possible t h r o u g h a u t o m a t i c i n d e x i n g a n d a u t o m a r i e a b s t r a c t i n g . T h u s it reveals t h e i n t r i n s i c i n t e r relationship among automatic translating, automatic indexing, a n d a u t o m a t i c a b s t r a c t i n g . REFERENCES 1. BAXENDALE,P. B. Machine-made index for technical liter:qture--an experiment. I B M J. Res. Dev. 2, 4 (Oct. 1958), 354-361. 2. LUHN, H . P . The automatic creation of literature abstracts. I B M J. Res. Dev. 2, 2 (Apr. 1958), 159-165. 3. OSWALD,V. A., JR., ET AL. Automatic indexing and abstract.. ing of the contents of documents. RADC-TR-59-208, 3L October 1959, prepared for the Rome Air Development Center, Air Research and Development Command, United States Air Force, pp. 5-34, 59-133. 4. OSWALD, V. A. JR.; AND LAWSON, R . H . An idioglossary for mechanical translation. Mod. Language Forum 38, 2 (Sept.Dec. 1953), 1-11. 5. RATH, G. J.; RESNICK, A.; and SAVAGE,T . R . The formation of abstracts by the selection of sentences. Research Report RC-184, 29 June 1959, IBM Research Center, Yorktown Heights, N. Y. 6. RATH, G. J.; RESNICK, A.; and SAVAGE, W.R. Comparisons of four types of lexical indicators of contents. Research Report RC-187, 14 August 1959, IBM Research Center, Yorktown Heights, N. Y. 7. RESNICK,A.; and SAVAGE,T . n . A re-evaluation of machine generated abstracts. Research Report RC-230, 1 March 1960, IBM Research Center, Yorktown Heights, N. Y.

N e e d e d n e x t is a n e s t i m a t e of t h e n u m b e r of d o c u m e n t s . T h e 1957 U N E S C O report, "Scientific a n d T e c h n i c a l T r a n s l a t i n g " e s t i m a t e s t h a t from 106 to 2 106 i t e m s (articles, reports, p a t e n t s , a n d books) of scientific a n d t e c h n i c a l l i t e r a t u r e are p u b l i s h e d each year. T h e s a m e r e p o r t s t a t e s t h a t a r a n d o m s a m p l e of 1,000 o u t of t h e 50,000 titles in U N E S C O ' s " W o r l d List of Scientific P e r i o d i c a l s " showed 44 p e r c e n t of t h e m to be in English. Using o n l y t h e c o n s e r v a t i v e e s t i m a t e s of 0.44 106 scientific i t e m s p u b l i s h e d a n n u a l l y in E n g l i s h a n d 2.3 10 a w o r d s in each item, one finds t h a t t h e r e are (0.44 10 ~) (2.3 103) = 109 words, in all, which it m i g h t be desired to t r a n s l a t e into foreign l a n g u a g e s each year. If, in place of t h e t r a n s l a t i o n of these 109 words, a u t o m a t i c indexes are p r o d u c e d , t h e n it becomes n e c e s s a r y (using o n l y t h e lesser r e d u c t i o n factors) to t r a n s l a t e no m o r e t h a n 10 -2 109 = 107 words; if a u t o m a t i c a b s t r a c t s are p r o d u c e d , no m o r e t h a n 10 -~ 10" = 108 words. T h e 234 C o m m u n i c a t i o n s of t h e ACM