You are on page 1of 7

THE VARIATION IN THE I N F O R M A T I O N C O N T E N T

OF TITLES O F RESEARCH PAPERS W I T H


TIME A N D D I S C I P L I N E
A. B. BUXTON and A. J. MEADOWS

Primary Communications Research Centre, University ofLeicester


The relative information content of titles of research papers in different
subject areas has been examined by counting the number of their 'substantive' words in eleven English periodicals, two French and two German.
Chemistry and botany (in which KWIC indexes are already produced) are
found to have the highest values, followed by physics, medicine, history,
and the social sciences, with philosophy lowest. The information content of
the foreign titles when translated into English was almost equal to that of
English titles in the same subjects. Most subjects showed a significant increase in the number of substantive words between 1947 and 1973.
Some difficulties of searching by title due to the vocabularies of nonscientific subjects are discussed.
INTRODUCTION

THERE are now several important KWIC indexes of titles of research papers
available in scientific disciplines, for example Chemical Titles, and the subject
indexes of Biological Abstracts and Geo Abstracts. Science Citation Index and Social
Science Citation Index have 'Permuterm' indexes based on keywords in titles. In
addition, there are several large machine-readable databases which can be searched
by title words. The question of whether such retrieval methods would work
acceptably in other, non-scientific subject areas does not seem to have been
studied hitherto, and the work reported here represents one approach towards
an answer.
Bottle and Preibish1 have made a comparison of index terms assigned in Psychological Abstracts with the corresponding titles in order to assess the suitability of a
KWIC index for psychology. Such a technique is applicable only in subject areas
where an indexing periodical exists, and cannot be used to compare different
subjects because of differences in depth of indexing and specificity of index terms.
To carry out interdisciplinary comparisons, a technique is necessary which
depends on the content of the titles alone.
METHOD AND RESULTS

One method fulfilling this condition (and capable of mechanization) is that used
by Tocatlian2 on chemical titles and by Bird and Knight3 on titles in Nature and
Journal of Clinical Endocrinology and Metabolism. Both these studies investigated the
change in the information content with time during two decades by counting
the number of 'substantive' or 'key' words per title in their samples. Non-substantive words were defined by Tocatlian as 'words that convey little or no
Journal of Documentation, Vol. 33, No. 1, March 1977, pp. 46-52

46

March 1977

TITLES OF RESEARCH PAPERS

information by themselves, such as articles, prepositions, conjunctions, pronouns,


and auxiliary verbs', while Bird and Knight defined non-keywords as 'the 650
common words on the Permuterm Index stop-list, as well as single letter, number
or hieroglyph codes'. Obviously the most suitable stop-lists for different subjects
will vary considerably; e.g. 'history' should be stopped in a history index, but not
for astronomy; 'solution' should probably be stopped in mathematics, but not
in chemistry; and 'reflections' will be useful in physics, but not in English literature. Clearly, there is no way of deciding on an ideal stop-list for interdisciplinary
studies, and it seems safest to confine the list to prepositions, pronouns, and
auxiliary verbs, together with very general words such as 'aspect', 'different',
'method', 'problem', and 'very'. In the following measurements, a common
stop-list of about 230 words (including inflexional variants) was used, and equivalent words were stopped in non-English titles. (In a frequency analysis of 1,000
English history titles, we found that 367% of all words were stopped by our list.
Twenty words, all ofwhich were on the Permuterm Index stop-list, stopped 344%
of all words, which suggests that the effect of minor differences in the stop-list
will be very small.)
In comparing across subjects, it is necessary to adopt some conventions about
what will be allowed as a 'word'. As Tocatlian says, the formulation of clear rules
is difficult. Chemical formulae, e.g. NaCl, Fe2+, and 40Ar, serve much the same
purpose as the chemical names, and arc here counted as single words. (Tocatlian
counted multiword chemical names, e.g. 'hydrogen cyanide', as one word: here
the words are counted separately.) Numbers, e.g. '1949', with following symbols,
e.g. '273K', if any, arc also counted as single words. Hyphenated words are
counted as two if each part can stand alone as a word, e.g. 'twentieth-century', but
otherwise as one, e.g. 'non-English'. Abbreviations, e.g. 'NMR', are counted as
single words. Subtitles have been included with the titlesas they are in the
indexing journals.
Samples of 100 titles each were taken from eleven English periodicals covering
a range of subjects for the years 1947, 1962, and 1973. Where the periodicals had
fewer than 100 titles in the year, adjacent years were combined as necessary.
Samples were also taken from French and German periodicals for 1973 in physical
chemistry and history; similar counts were made on the translations of the chemical titles in Chemical Abstracts, and on the forty-four translations of the German
historical titles which could be found in Historical Abstracts. The mean numbers of
alphanumeric characters per substantive word in the French and German titles,
and in the English titles used for comparison, were counted.
The results are given in Table 1. The cases where a periodical showed a significant difference in the mean value of a parameter between two different years are
marked in the Table. (The test used was whether the difference was greater than
233 times its standard error, which has a probability of less than 1% of arising
by chance.)
DISCUSSION

(a) English titles


Both Tocatlian and Bird and Knight concluded that titles in scientific journals
are becoming more informative, and suggested that this may be related to the
introduction of KWIC indexes. As shown in Table 1, our results indicate a significant increase in the number of substantive words per title for all three chemical
periodicals examined during the period 1962-73 (i.e. following the introduction
47

JOURNAL OF DOCUMENTATION

Vol. 33, no. 1


TABLE I

Journal

Year

Trans. Faraday
Soc.
1947
1962

JCS Faraday
Trans.
Analyt. Chem.

1973
1947
1962
1973

J. Organic Chem.1947
1962

Ann. Bot.

Lancet

Engl. Hist. Rev.

Philosophy

Mean S.D.

Mean S.D.

Mean S.D.

507

644

402

651

278
219

059 012
0 6 8 * 010

843* 3 0 6

066* 009
0 6 6 011

1130
991

I327* 564
864
329
354
975
1 1 5 8 * 429
718
368
740

1962-3
1973

1358*481

1973
1946-9

1947
1962

J. Soc. Psychol. 1 9 4 7 - 9

Brit. J. Social.
(started 1950)

Substantive
words per
title

332
1018* 4 2 3
617
1547
1176* 466

1973

Economica

All words
per title

624
820*
904*

306

950

317
320

1962

964

1973
1946-50
1960-3

982

374
398
362

672

304

1971-4
1950-4
1960-4
I97I-3
1946-50
1960-3
I97I-4
1945-50
1960-3
1971-4

J.Opt.Soc.Amer. 1947

739
343
8 9 0 * 323

559
668*

202
235

798* 2 3 8
473
492

655*
766
682

Proportion of Characters per


substantive
substantive
words
word

065

017

199
272

069
066

014
012

311

051 012
059* 012

840

28l
28l

415
546*

184
183

071
069

629* 2 1 9

071
059

010
013

063*

714

261

012

212
232
221
158

063
063

012

060

016

450
517*
446
437
489
464

197
185
197
186

064
060
063
063

016
014
014
015

215
171

060

011

742
736

355
348

842

370

751
881*
926*
482

305
361

556*

2 06

065

012
016
013

337
236

579*
282

186
109

064
064

012
017

503

265

017

191

299
284

063

447
950

066
063

019
015

553

296

008
016

544
589
605
382

567

745

0 7 0 011
072* 012

203

145
113
206

Mean S.D.

065

1962

834

403
326

175

069* 014

1973

933

371

655* 2 1 9

073* 012

476

662
818

240
268

058
059

010
010

11-02

480

839
829

329
321
230

053
059

009

904

387

060

439

055

013
012

963

274
213

054

011

759

261

Z. Phys. Client.
1178
{Leipzig)
1973
1425
Engl. translation
J. Chim. Phys. 1974
I611
Engl. translation
1436
1970-4
884
Hist. Z.
Engl. translation
1139
Annales
1972-3 1 0 7 6

525
638
576
421
550

514
605

400

568

* Significant change relative to 1947.

011

Significant change relative to 1962.

48

March 1977

TITLES OF RESEARCH PAPERS

of Chemical Titles in 1960). However, these periodicals also showed an. increase
in the earlier period 1947-62, though the increase was significant only for
Analytical Chemistry. The two life science periodicals, Annals of Botany and Lancet
also showed significant increases in substantive words during 1962-73. (The
BASIC index of Biological Abstracts was introduced at the end of 1959.)TheLancet
showed a significant increase for the period 1947-62, while Annals of Botany
actually showed a decrease in this period. There were also significant increases
during 1962-73 for Economica, English Historical Review, and Journal ofthe Optical
Society of America, which cannot be attributed to the introduction of KWIC
indexes in their subject areas. English Historical Review showed a significant increase throughout 1947-62.
The increase in information content of chemical and biological titles since
i960 is thus to be seen in the context of a trend to more informative titles which
has occurred over a wide range of subjects (philosophy being the only exception
found), and which was already apparent before KWIC indexes and mechanized
searching of title words became common. The introduction of these tools may
be responsible to some extent for an awareness of the need for informative titles,
but it cannot provide an explanation for the generality of the trend observed.
Bird and Knight suggest another possible cause of the increasing informativeness of titles, viz. the need to pick out papers of possible interest from everincreasing numbers of papers in the field, if there are only a few periodicals of
interest, each containing say twenty papers a year, it is an easy matter to scan all
the papers as they appear. This is possibly still the position in philosophy: the
number of papers in Philosophy increased from eleven in 1947 to thirty-three in
1973. As the numbers ofperiodicals and papers per year grow, increasing reliance
must be placed on scanning lists of titles either in the journals themselves or in a
secondary journal. As early as 1962 the editor4 of English Historical Review requested contributors to word their titles so that 'the reader scanning contents or
index knows where he is in time and space'. The Journal ofOrganic Chemistry grew
from 122 papers in 1947 to about 1,100 in 1975, and the number of periodicals of
possible interest to an organic chemist lias also escalated. Clearly, his current
awareness problem is of a different order from the philosopher's. We would
suggest this as an explanation both of the greater number of substantive words in
chemical titles than in philosophy, and of the increase of title-length with time
in chemistry.
An attempt was made to discover what kinds of words were represented by the
increases in the number of substantive words. Of the increase of 116 words in
Journal of Organic Chemistry over 1947-73, the biggest contribution was found
to be +070 from words relating to structures and mechanisms; +030 was due
to chemical names and +026 to names of reactions. The increase of 239 substantive words in Analytical Chemistry during the same period included +162
from words relating to instrumentation and techniques. The increases in chemical
titles thus represent the introduction into the titles of words relating to new techniques used and aspects studied. In the social sciences and arts subjects new techniques and new aspects of study are comparatively much rarer, so that their
influence on the information content of titles is much less. In the Journal of Social
Psychology, for example, words relating to tests and technique contributed only
10 substantive word per title in 1947 and 04 in 1973. (Bottle and Preibish1 found
the value to be 8% of keywords, or about 05 per title, for 300 titles from 1968
Psychological Abstracts.)
49

JOURNAL OF DOCUMENTATION

Vol. 33, no. 1

There seems to be a general increase in the proportion of substantive words in


English scientific titles with time. This may arise partly from the increasingly
common use of nouns attributively, giving rise to such multiword concepts as
'the nuclear magnetic resonance spin echo technique'. The coining ofsingle words
for new apparatus and phenomena seems to be less common now than formerly.
(b) French and German titles
The number of substantive words does not form a valid basis for comparison of
the information content of titles in different languages because languages differ
in the extent to which several concepts may be combined into a single word, e.g.
German 'Arbeiterklasse', English 'working classes'; French 'autodiffusion , English
'self diffusion'. The differences between the average numbers of characters per
substantive word, shown in Table 1, are to some extent a reflection of this: German substantive words are significantly longer than English ones in both history
and chemistry.
The numbers of substantive words in the English translations of the German
and French physical chemistry periodicals are very similar to each other and to
that in J. C. S. Faraday Transactions of 1973. Likewise, the number in the translation of Historische Zeitschrift compares well with that in 1971-4 English Historical
Review. Thus there appears to be little difference between the information content
of French, German and English titles of the same date and subject.
There is a significant difference between the proportion of substantive words
in the titlesofjournal de Chimiephysique and its English translation (p<o005). This
seems to be due partly to the attributive use of nouns in English, e.g. 'gallium
oxide', 'Montbliard region', where French uses a preposition: oxyde degallium,
pays de Montbliard, and partly to the less frequent use of the definite article in
English. That the proportion is lower in the English translations than in titles
from corresponding English journals may be due to the rephrasing of the original
titles by the abstracting journals. This eliminates some non-substantive words,
and makes the titles more concise than is ordinarily found in English.
IMPLICATIONS FOR RETRIEVAL

We have counted all substantive words as equal, but clearly within a given title
this is in no respect true, and there may be gross differences between subjects in
the usefulness of title words. For example, 'Silicon heterocyclic compounds: ring
closure by hydrosilation' (Journal of Organic Chemistry, 1973) and 'Misleading
questions and irrelevant answers in Berkeley's theory of vision' (Philosophy, 1968)
each have six substantive words, but it is not clear that the philosophical title gives
as much information about the paper it describes as does the chemical title.
The traditional 'precision' and 'recall' values of retrieval experiments depend,
of course, not on the data elements alone but on the relation between them and
terms in the search profile. Suppose that the user's interest was 'English agricultural history', and the search profile was written as '(BRIT- or ENGL-) and
(AGRICULTUR- or FARM-)'. Then the title 'Wheat-growing in fourteenth
century East Anglia' would be missed, but had this been a subtitle to a main title
'Medieval English Farming. Part V , the paper would have been retrieved.
Strictly, the main title is redundant, since it is implied by the subtitle, and so in
one sense it contributes nothing to the information content. However, using the
number of substantive words as a measure, the information content is increased
50

March 1977

TITLES OF RESEARCH PAPERS

from six to nine by the inclusion of the 'series' heading. Similarly, we may consider that other substantive words in the title will contribute to precision rather
than recall, for example, 'East' in the title above.
Thus, for a particular title, the number of substantive words is not simply related
to its value in retrieval either as regards recall or precision. However, it seems
reasonable to suppose that in general a longer title will contain more words related
to the subject matter of the paper, and will be of more use as a basis for retrieval.
Olive et al.5 have studied the value of titles in operating an SDI service based on
Nuclear Science Abstracts. They found that, for titles of fewer than 100 characters,
index terms gave better recall than titles; while for longer titles, the titles gave
better recall. The shorter titles gave 51% precision and the longer ones 40%.
(One factor responsible for this difference is probably that papers with shorter
titles tend to be on more general subjects: they arc more likely to be of some
interest to the user whereas a highly specific paper on the wrong aspect of a subject
will be irrelevant.)

SEARCHING BY TITLE WORDS IN NON-SCIENTIFIC DISCIPLINES

There are certain features in the vocabulary of scientific subjects which may make
title-searching easier than in non-scientific subjects.
(1) In chemistry, and to a lesser extent in biology and medicine, word-fragments
are often meaningful enough to be useful in retrieval. For example, the fragment
'-ase' will retrieve a large proportion of enzymes such as 'oxidase', 'urease',
'ligase', 'hydrogenase', etc. (as well as 'base', 'release', and a few other false drops).
Complex chemical names such as 'trans-2,3-dimethyl-i-phthalimidoaziridines'
give rise to several entries (five in this case) in Chemical Titles by being split
before each meaningful fragment. 'Hemocytoblastosis' is indexed at three points
in the KWIC index of Biological Abstracts. Such fragmentation is not possible to
anything like the same extent in history or psychology, so the number of entry
points will be almost limited to the number of substantive words, and the elements available for matching against a profile will be similarly reduced.
Fragmentation would, however, be important in a German KWIC index,
because of the frequency of agglutination.
(ii) The nomenclature of chemistry permits searches on two or more facets of
a compound, for instance a paper on 'ammonium trifluoroacetate' could be
retrieved by someone interested in ammonium compounds, acetates, or fluorocompounds. However, no-one is likely to be interested in the class of people with
the Christian name 'Samuel', so that the word 'Samuel' in 'Samuel Johnson'
serves only to improve precision when searching under 'Johnson'. Similarly,
topographical nomenclature does not indicate broader terms in the hierarchy. In
expanding fully the concept 'United States' in a history search, one would have
to include the name of each state, town, and region.
(iii) An important facet in history-related subjects will often be date. The ways
in which titles may indicate relevance to a particular period are many and unpredictable. For example, someone may be interested in English agriculture during
the period 1750-1850. The title words 'Georgian', 'eighteenth-century', '18001914', would all indicate the inclusion of potentially useful information. Thus
natural language searching seems to present formidable difficulties for searching
by date. (If, on the other hand, a title is located in a KWIC index by a word relating
51

JOURNAL OF DOCUMENTATION

Vol. 33, no. 1

to another facet of the problem, a date given in the context should indicate
whether the paper is relevant.)
(iv) There is a tendency in philosophy, and to some extent in other arts subjects,
for whimsical or metaphorical titles to be used, e.g., 'Never smile at a crocodile'
(Journal for the Theory of Social Behavior, 1973), and 'The cow on the roof '{Journal
of Philosophy, 1973). Indicative as these may be to the initiated, they seem to have
little value for retrieval. Sometimes a subtitle gives a more literal statement, as
'Right, left and centre: the Second Spanish Republic' (Historical Journal, 1972).
CONCLUSION

On the basis of the number of informative words they contain, the titles of research papers in physics, history, psychology, and to a somewhat lesser extent
other social sciences, do not seem to fall far short of chemistry and the life sciences
in their suitability for retrieval. However, they do not enjoy the semi-systematic
nomenclature of the sciences, which means that although sufficient information
may be present in the titles it may not be in a form suitable for retrieval. Further
work is needed on the vocabularies of titles in different subjects from the point of
view of specificity and predictability.
ACKNOWLEDGEMENT
O n e o f us ( A . B . B . ) is grateful t o t h e D e p a r t m e n t o f E d u c a t i o n a n d Science for
s u p p o r t f r o m a n I n f o r m a t i o n Science S t u d e n t s h i p .

REFERENCES
1. BOTTLE, R. T.AND PREIBISH, c. I. The proposed K W I C index for psychology: an experimental
test of its effectiveness. Journal of the American Society for Information Science, 21, 1970,427.
2. TOCATLIAN, J. J. Are titles of chemical papers becoming more informative? Journal of the
American Society for Information Science, 21, 1970, 345-50.
3. BIRD, p. R. and KNIGHT, M. A . W o r d count statistics of the titles of scientific papers. Information
Scientist, 9, 1975, 67-9.
4. HAY, D. in: English Historical Review, 77, 1962, 3.
5. OLIVE, c , TERRY, J. E.AND DATTA, S. Studies to compare retrieval using titles with that using
index terms: SDI from 'Nuclear Science Abstracts'. Journal of Documentation, 29, 1973,
169-91.
(Received 3 August 1976)

52

You might also like