You are on page 1of 8

Scientometrics (2011) 86:317–324

DOI 10.1007/s11192-010-0265-x

Correlation between impact and collaboration

Jiann-wien Hsu • Ding-wei Huang

Received: 24 March 2010 / Published online: 1 July 2010


Ó Akadémiai Kiadó, Budapest, Hungary 2010

Abstract We obtained data of statistical significance to verify the intuitive impression


that collaboration leads to higher impact. We selected eight scientific journals to analyze
the correlations between the number of citations and the number of coauthors. For different
journals, the single-authored articles always contained the lowest citations. The citations to
those articles with fewer than five coauthors are lower than the average citations of the
journal. We also provided a simple measurement to the value of authorship with regards to
the increase number of citations. Compared to the citation distribution, similar but smaller
fluctuations appeared in the coauthor distribution. Around 70% of the citations were
accumulated in 30% of the papers, while 60% of the coauthors appeared in 40% of the
papers. We find that predicting the citation number from the coauthor number can be more
reliable than predicting the coauthor number from the citation number. For both citation
distribution and coauthor distribution, the standard deviation is larger than the average
value. We caution the use of such an unrepresentative average value. The average value
can be biased significantly by extreme minority, and might not reflect the majority.

Keywords Citation  Impact  Coauthorship  Collaboration

Introduction

Fast accumulation of scientific knowledge is the hallmark of last century. The practice of
scientific research has evolved rapidly in recent years (Börner et al. 2004). With the
advances in communication technology, research collaborations can be established much
more easily. As a result, the number of coauthors listed on scientific articles has been
increasing over the years. The first impression is that greater collaboration leads to higher
impact (Leimu and Koricheva 2005). There is a general impression that articles published

J. Hsu
General Education Center, National Tainan Institute of Nursing, Tainan, Taiwan, ROC

D. Huang (&)
Department of Physics, Chung Yuan Christian University, Chung-li, Taiwan, ROC
e-mail: dwhuang@phys.cycu.edu.tw

123
318 J. Hsu, D. Huang

in high-impact journals have more coauthors. This work is aimed to verify such an
impression. We note that coauthorship in scientific publications is a direct, but not the only
measurement of collaboration. The coauthor number reflects the collaboration at the per-
sonal level. Collaborations at different levels, such as research groups, institutes, and
countries, have been analyzed recently (Matia et al. 2005; van Raan 2006).
The correlation between coauthorship and citation impact has been studied by Glänzel
(2002). In that study, three different fields were selected: Biomedical Research, Chemistry,
and Mathematics. Citation and coauthor counts were analyzed for papers published in
1996. The mean citation rate did increase with the increase in the coauthor number. To our
surprise, however, the most pronounced tendency was observed in the field of Mathe-
matics, where both the citation and coauthor numbers were lowest among the three fields.
The average citation rate was less than two in the three-year citation window and the
average coauthor number was less than three. With such low statistical numbers, the
positive correlation between collaboration and impact seemed not so convincing. It would
be interesting to see if such a trend could be supported by data with statistical significance.
In this work, we study the correlation between coauthor numbers and citation numbers
of published papers. The methods of data collection will be described in the next section.
The fluctuations in both coauthor numbers and citation numbers will be discussed in
‘‘Wide fluctuations’’ section. The concluding remarks will be presented in the last section.

Methodology

The practice of research can be very different in different disciplines. In this study, we
collect the research papers published in eight journals as listed in Table 1. To avoid the
direct comparison across different disciplines, we analyzed the correlation between citation
numbers and coauthor numbers within the same scientific journal. The data from different
journals are compared after the proper normalization.
We collect the data from ISI Web of Knowledge. The selection criteria for the eight
journals listed in Table 1 are both high Impact Factor and a large number of annual
publications. According to Journal Citation Reports (Science Edition 2008), the well
known journal Nature has Impact Factor 31.43 and annual article number 899. No journal
has both Impact Factor larger than 10 and annual article number larger than 2000. There
are seven journals with both Impact Factor larger than 10 and annual article number larger

Table 1 Eight journals studied in this paper


Journal IF Papers Coauthors Citations

Average Maximum Average Maximum

Nature 31.43 9,946 5.79 349 196.65 4,073


Science 28.10 9,299 5.62 177 201.68 4,175
Circulation 14.60 8,856 7.47 484 77.23 1,897
Blood 10.43 10,736 7.36 82 57.33 1,785
Proc. Natl Acad. Sci. 9.38 13,283 5.78 106 69.95 3,636
J. Am. Chem. Soc. 8.09 12,885 4.22 31 45.47 901
Phys. Rev. Lett. 7.18 15,317 12.49 743 42.73 2,532
Astrophys. J. 6.33 11,712 4.56 118 33.42 1,396

123
Correlation between impact and collaboration 319

than 500. We select four of them: Nature, Science, Circulation, and Blood. We further
select the only four journals meet the criteria of both Impact Factor larger than 6 and
annual article number larger than 2000: Proceedings of the National Academy of Sciences
of the United States of America (Proc. Natl. Acad. Sci.), Journal of the American Chemical
Society (J. Am. Chem. Soc.), Physical Review Letters (Phys. Rev. Lett.), and Astrophysics
Journal (Astrophys. J.). To obtain sufficient data, we collect the research articles published
from the year 1995 to 2004 for the first four journals. The annual publications of the last
four journals are much more than those of the first four journals. To have a similar database
size, the time span of data collection for the last four journals is shortened to 5 years, i.e.,
from year 2000 to 2004.
For each of these 92,034 articles, we counted the number of listed coauthors and the
accumulated citations up to August 2009. Within each journal, the average and maximum
numbers for both coauthors and citations are listed in Table 1. When these eight journals
were compared ostensibly, the average number of citations did not seem to correlate with
the average number of coauthors. However, within each journal, there existed a positive
correlation between citations and the number of coauthors. The results are shown in Fig. 1,
where the citations are normalized by the average citations of each journal and the articles
are separated by the number of coauthors. For example, on average a Nature article has 6
coauthors and 197 citations. Yet a single-authored Nature article has an average of only 61
citations; while a 10-authored Nature article has an average of 263 citations. For those
Nature articles with more than ten coauthors, the average citations further increased to 370.
For all the eight journals, the single-authored articles always had the lowest number of
citations. Consistently, the number of citations to those articles with fewer than five
coauthors is lower than the average citations of the journal. The trend can be well
approximated by a concave curve with the analytical expression as

Nature
Science
Circulation
Blood
1.5 Proc Natl Acad Sci
J Am Chem Soc
Citations (Normalized)

Phys Rev Lett


Astrophys J

0.5

0
1 2 3 4 5 6 7 8 9 10 > 10
Coauthors

Fig. 1 Average citations as functions of coauthor numbers

123
320 J. Hsu, D. Huang

y ¼ ð0:2xÞ1=3 ; ð1Þ
where y denotes the normalized citations and x denotes the number of coauthors. This
simple formula might provide a measurement to the value of authorship, which has been a
much debated issue in the ethics of scientific practice (Slone 1996; Tarnow 1999, 2002).
As a straight forward judgment, the number of citations can be taken as a basis of sharing
credit. With a simple guideline of y = 1 at x = 5, the above formula implies that the first
author should take 58% of the credit. The second author will add an extra 16%. The third
and fourth authors will have 10 and 9%, respectively. With a concave curve, the contri-
bution is dominated by the first author.

Wide fluctuations

We notice that wide fluctuations can be observed in both citation distribution and coauthor
distribution. As listed in Table 1, for both the number of coauthors and the number of
citations, the maximum can be greater than 50-fold the average. We note that such
maximums are not isolated or rare events, but emerge as apart of the continuous distri-
bution. We show in Fig. 2a the Zipf plot for citations, where the citation number to each
article is sorted from large to small. A similar plot for coauthors is shown in Fig. 2b. The
Zipf plot is often used to demonstrate the power law distribution for the high rankings, i.e.,
large citation numbers in Fig. 2a and large coauthor numbers in Fig. 2b. A power law
distribution presents as a straight line on the log–log plot.
In this study, however, we are more concerned about the entire distribution of all
rankings. We normalize the scale and plot a semi-log graph. For each journal, we nor-
malize the rankings by the number of papers. The citations and coauthors are normalized
by the corresponding average numbers. In Fig. 3a, we show that the data from different
journals collapse into the same curve. It is easy to see that only one-third of the articles
receive citations above the average. Very few articles receive citations around the average
number. Even if we extend the range to plus and minus 40% of the average, as marked by
the grey bold lines in Fig. 3a, only 30% of the articles will be covered. The highest 20% is

4 3
(a) 10 (b) 10

103
10 2
Coauthors
Citations

102
Nature
Science
Circulation
Blood 10 1
101 Proc Natl Acad Sci
J Am Chem Soc Nature Proc Natl Acad Sci
Phys Rev Lett Science J Am Chem Soc
Astrophys J Circulation Phys Rev Lett
Blood Astrophys J
100 10 0
100 101 102 103 104 100 101 102 103 104
Rankings Rankings

Fig. 2 Zipf plots: a citation distribution; b coauthor distribution

123
Correlation between impact and collaboration 321

(a) 100 Nature


(b) 100 Nature
Science Science
Circulation Circulation
Blood Blood

Coauthors (Normalized)
Citations (Normalized)

10 Proc Natl Acad Sci 10 Proc Natl Acad Sci


J Am Chem Soc J Am Chem Soc
Phys Rev Lett Phys Rev Lett
Astrophys J Astrophys J
1 1

0.1 0.1

0.01 0.01
0 0.2 0.4 0.6 0.8 1 0 0.2 0.4 0.6 0.8 1
Rankings (Normalized) Rankings (Normalized)

Fig. 3 Normalized Fig. 2: a citation distribution; b coauthor distribution

still outside the regime. And the lower half of the articles receive citations less than 60% of
the average number.
Similar fluctuations can also be observed in the coauthor distribution, see Fig. 3b.
Except for the Phys. Rev. Lett., data from other journals follow the same curve. Only 6% of
Phys. Rev. Lett. articles have the coauthor numbers larger than the average. In contrast, the
ratio is around 40% for the rest seven journals. If the top 2% of large collaborations (with
coauthor number larger than 150) are removed, the Phys. Rev. Lett. data can be recast in
accord with others. The fluctuations of coauthor numbers are not as widely spread as those
of the citation numbers. An extension to plus and minus 40% of the average will cover half
of the articles, with the top 20% and the bottom 30% not included.
In Fig. 4a, we plot the accumulated citations versus the percentage of papers. The
concave curve implies that the citation numbers are unevenly distributed. A uniform
distribution would imply a straight line with the slope 1. From this curve, it can be
observed that 70% of the citations are accumulated on 30% of the papers.

(a) 1 (b) 1

0.8 0.8
Accumulated Coauthors
Accumulated Citations

0.6 0.6

Nature Nature
Science Science
0.4 Circulation 0.4 Circulation
Blood Blood
Proc Natl Acad Sci Proc Natl Acad Sci
J Am Chem Soc J Am Chem Soc
0.2 0.2
Phys Rev Lett Phys Rev Lett
Astrophys J Astrophys J

0 0
0 0.2 0.4 0.6 0.8 1 0 0.2 0.4 0.6 0.8 1
Accumulated Papers Accumulated Papers

Fig. 4 Accumulated ratios: a citation distribution; b coauthor distribution

123
322 J. Hsu, D. Huang

In Fig. 4b, we show a similar plot for the accumulated coauthors. Basically the dis-
tribution evens out a bit. Roughly speaking, 60% of the coauthors appear on 40% of the
papers. The most uniform distribution can be observed in J. Am. Chem. Soc.; while the
most uneven distribution is observed in Phys. Rev. Lett., where 75% of the coauthors are
accumulated on 10% of papers.

Further discussions

In this study, we analyze eight different databases. Each database contains around 105
articles, which provide significant statistics for more detailed analysis. Within each data-
base, we further separate articles into 40 bins according to their citation rankings. And we
ask if it is obvious that those articles with higher citations do have more coauthors? The
results are shown in Fig. 5a. The large fluctuations in Phys. Rev. Lett. are exceptional. For
the majority of citation regimes, the number of coauthors is around the average value. The
variation of coauthor number is only obvious for the extreme cases. For the top citation
rankings, the number of coauthors has a slight increase; for the bottom citation rankings,
the number of coauthors has a slight decrease, especially for Nature and Science. Similarly,
Fig. 5b shows the results when the articles are separated according to the numbers of
coauthor. Again, for the majority of coauthor regimes, the number of citations is around the
average value. A slight increase of citations can be observed for the top coauthor rankings;
and a slight decrease of citations can be observed for the bottom coauthor rankings. By
comparing these two figures, a slightly larger variation can be observed in Fig. 5b, i.e., the
data on Fig. 5a stays more closely to the average value shown by the solid line. An
interesting implication can be drawn from this observation. Predicting the citation number
from the coauthor number can be more reliable than predicting the coauthor number from
the citation number.
We present evidence to support the positive correlation between citation and coauthor
numbers. However, if you asking different questions, one might get a different impression.
Suppose we randomly select two different articles in a journal. What is the probability that
the article with more citations will also have more coauthors? With the positive correlation

(a) 10 (b) 10
Nature Nature
Science Science
Circulation Circulation
Blood Blood
Proc Natl Acad Sci Proc Natl Acad Sci
J Am Chem Soc J Am Chem Soc
Phys Rev Lett Phys Rev Lett
Astrophys J Astrophys J
Coauthors

Citations

1 1

0.1 0.1
0 0.2 0.4 0.6 0.8 1 0 0.2 0.4 0.6 0.8 1
Rankings (Citations) Rankings (Coauthors)

Fig. 5 a Average coauthors in different citation regimes; b average citations in different coauthor regimes

123
Correlation between impact and collaboration 323

Table 2 Probabilities to observe that (1) an article with more citations will have more coauthors, (2) an
article with more coauthors will have more citations
Journal Probability (1) (%) Probability (2)

Nature 58.08 64.90


Science 56.75 63.22
Circulation 51.67 56.37
Blood 52.43 57.02
Proc. Natl Acad. Sci. 52.46 58.23
J. Am. Chem. Soc. 44.38 52.52
Phys. Rev. Lett. 48.72 56.04
Astrophys. J. 48.72 56.58

shown in Fig. 1, one expects that the probability should be high. As we have counted the
coauthor number and citation number for each article, the probability can be readily and
exactly calculated. The results are shown in Table 2. We list two probabilities: (1) an
article with more citations will have more coauthors; (2) an article with more coauthors
will have more citations. The difference between these two values can be related to the fact
that, for two randomly selected articles, it is easier to have the same coauthor number than
the same citation number. With an average probability at 52% (58%), the correlation
between citations and coauthors seems not so obvious. This discrepancy between a weak
correlation shown in Table 2 and a strong correlation shown in Fig. 1 can be attributed to
the wide fluctuations in both citation distribution and coauthor distribution.
As a final remark, we caution the use of such an unrepresentative average value. The
average value can be biased significantly by extreme minority, and might not reflect the
majority. The average citation number to articles published in a journal can be related
directly to the Impact Factor of the journal. However, with the wide spread of citation
distribution, the average citation number is not a good indicator to the expected citations of
an article. As shown in this work, only a few articles receive citations around the average
number. Two thirds of the articles cannot meet the expectation of average citations. With
conventional statistics, the average citation number to a Nature article can be written as
197 ± 269. For a Science article, the citation number is 202 ± 295. These numbers are
awkward as the standard deviation is larger than the average value. The same trend can be
observed in all of the eight journals. It should be reminded that the citation is a non-
negative number. An obvious implication is that the average value is not representative.
For the citation distributions shown in Fig. 3a, power law behaviors can be observed in the
top 10% and the bottom 10% of articles. For the majority of articles, i.e., the rest 80%, the
distributions are characterized by an exponential decay. For an exponential distribution,
the standard deviation is equal to the average value. With the extreme cases included, i.e.,
the top 10% and bottom 10%, the standard deviation becomes larger than the average value.

References

Börner, K., et al. (2004). The simultaneous evolution of author and paper networks. Proceedings of the
National Academy of Sciences of the United States of America, 101, 5266–5273.
Glänzel, W. (1996). Coauthorship patterns and trends in the sciences (1980–1998): A bibliometric study
with implications for database indexing and search strategies. Library Trends, 50, 461–473.

123
324 J. Hsu, D. Huang

Leimu, R., & Koricheva, J. (2005). Does scientific collaboration increase the impact of ecological articles?
Bio Science, 55, 438–443.
Matia, K., et al. (2005). Scaling phenomena in the growth dynamics of scientific output. Journal of the
American Society for Information Science and Technology, 56, 893–902.
Slone, M. (1996). Coauthors’ contributions to major papers publishedin the AJR: Frequency of undeserved
authorship. American Journal of Roentgenology, 167, 571–579.
Tarnow, E. (1999). The authorship list in science: Junior physicists’ perceptions of who appears and why.
Science and Engineering Ethics, 5, 73–88.
Tarnow, E. (2002). Coauthorship in physics. Science and Engineering Ethics, 8, 175–190.
van Raan, A. F. J. (2006). Statistical properties of bibliometric indicators: Research group indicator dis-
tributions and correlations. Journal of the American Society for Information Science and Technology,
57, 408–430.

123