You are on page 1of 16

See discussions, stats, and author profiles for this publication at: https://www.researchgate.

net/publication/318185962

Download patterns of journal papers and their influencing factors

Article  in  Scientometrics · July 2017


DOI: 10.1007/s11192-017-2456-1

CITATIONS READS

14 492

2 authors, including:

Zequan Xiong
East China Normal University
8 PUBLICATIONS   274 CITATIONS   

SEE PROFILE

All content following this page was uploaded by Zequan Xiong on 09 April 2019.

The user has requested enhancement of the downloaded file.


Scientometrics (2017) 112:1761–1775
DOI 10.1007/s11192-017-2456-1

Download patterns of journal papers and their


influencing factors

Yufeng Duan1,2 • Zequan Xiong1,3

Received: 1 April 2017 / Published online: 1 July 2017


Ó Akadémiai Kiadó, Budapest, Hungary 2017

Abstract A two-step cluster analysis was performed on the absolute downloads and the
relative downloads of Chinese journal papers published between 2006 and 2008. Four
patterns were identified from the perspective of absolute downloads; the first three patterns
can be expressed as power functions, signifying their evident aging trends, although this
does not apply to pattern 4. Two patterns were identified from the perspective of relative
downloads, and both present power distributions with minor differences in decline speed.
Furthermore, we delved into the relationships between total downloads and article features
in varying patterns and found that there are only weak correlations between total down-
loads and title length, number of authors, and number of keywords. However, there are
moderate to high correlations between initial downloads—defined as downloads made
during the first year after publication—and total downloads, suggesting that it is possible to
forecast total downloads according to initial downloads. Additionally, it was found that
total instances of highly downloaded papers have no correlations with article features.

Keywords Downloads  Download pattern  Citation  Aging  Obsolescence  Correlations

Introduction

With the development of the Internet and the digitalization of publishing, an increasing
number of academic papers can be accessed and utilized in digital form via electronic
databases, and these download behaviors are continuously recorded by usage monitoring

& Zequan Xiong


zqxiong@library.ecnu.edu.cn
1
Faculty of Economics and Management, East China Normal University, Shanghai 200241, People’s
Republic of China
2
Institute for Academic Evaluation and Development, East China Normal University,
Shanghai 200241, People’s Republic of China
3
Library, East China Normal University, Shanghai 200241, People’s Republic of China

123
1762 Scientometrics (2017) 112:1761–1775

systems and stored as big datasets. Although these data do not include unknowable
quantities, such as user motivations or goals, they disclose direct information about the
usage preferences of articles, such as which article was used, who used it, where that
person was, when it was used and so on (Kurtz and Bollen 2010). In addition, a tally of
downloads is a timely measurement of usage. Thus, some scholars are aware of the
importance of usage data in exploring user behaviors (Davis and Solla 2003; Davis and
Price 2006; Wang et al. 2012, 2013a, b), the obsolescence principles of articles (Moed
2005; Kurtz and Bollen 2010; Wang et al. 2014a, b), or even the supplementary effects of
citations in research evaluation (Bollen and van de Sompel 2008; Wan et al. 2010; Glänzel
and Gorraiz 2015).
Although increasing attention is given to downloads, systematic research on them is
seldom conducted. Existing studies are concerned with the following three aspects: the
comparison between downloads and citations (Bollen et al. 2005; Moed 2005;
Schloegl and Gorraiz 2010, 2011; Lippi and Favaloro 2013; Lu et al. 2016); the half-
life of downloads (Liu 2012; Xu 2014); and the influencing factors of downloads
(Jamali and Nikzad 2011; Guerrero-Bote and Moya-Anegon 2014; Subotic and
Mukherjee 2014).
Despite the significant correlations between downloads and citations that have been
demonstrated, there are still some problems in terms of data selection and processing.
Restricted by commercial database suppliers, most research can use only one or several
journals as data sources. Moed (2005) put forward the two-factors model of downloads
using data from Tetrahedron Letters and noted that during the first 3 months after a paper
is cited, its number of downloads increased 25% compared to what one would expect this
number to be if the paper had not been cited. Jamali and Nikzad (2011) researched the
influence of title type on downloads using data from six journals in PLoS and found that
question titles and short titles tend to get more downloads. These studies evidently did not
consider journal features, such as impact factors and themes, which may also influence
downloads.
Additionally, some scholars use ScienceDirect as the data source for downloads in
combination with citations data from other databases (such as Web of Knowledge) to
analyze the correlations between downloads and citations (Schloegl and Gorraiz
2010, 2011; Schloegl et al. 2014). These studies, however, take samples as a whole unit
and ignore that there may be different download patterns for varying reasons even within
the same sample. In the study of literature aging, diverse citation patterns have been
identified based on the annual citations of highly cited articles (Aversa 1985), while
analogous studies based on annual downloads have not yet been reported. Therefore, the
mining of download patterns, the reasons for the formation of these patterns, and the
influencing factors of different types of papers will contribute to our studies on download
behaviors as well as on the correlations between downloads and citations.
In this paper, we used data from Chinese Library and Information Science (CLIS) to
explore whether there are diverse download patterns. We also delved into the patterns of
article downloads aging over time and the main factors determining downloads, which may
pave the way for further research on the correlations between different download patterns
and citation pattern and provide theoretical foundations for the feasibility of using
downloads as a complementary indicator for article influence evaluation.

123
Scientometrics (2017) 112:1761–1775 1763

Methods

Data source and processing

The data were harvested from 11 CLIS journals between 2006 and 2008 in the China
National Knowledge Infrastructure (CNKI), preliminarily resulting in 10,334 papers. These
journals are all core publications and fully embodied by CNKI; further, their print pub-
lication date is nearly in line with the on-line date, while some others such as Library and
Information Service and Journal of Library Science have long on-line lags and thus are not
included in our research. We eliminated some noisy data, such as catalogs, prefaces,
contributions, and news, and obtained DataSet 1, consisting of 9919 papers.
In DataSet 1, the data are composed of basic information such as titles, authors, jour-
nals, and downloads per year from 2006 to 2015. The actual downloads of a paper in the
year following its publication are computed as D0Yþ1 ¼ DY þ D12 Yþ1
 ð12  M Þ, and its
actual downloads two years after publication are expressed as D0Yþ2 ¼ D12 Yþ1
 Mþ
DYþ2
12  ð12  M Þ; therein, M denotes the remaining months of the current year after its
publication. Similarly, we obtained other years’ actual download data and stored them in
DataSet 2. It should be emphasized that in these calculations, we assumed an equal number
of downloads in each month throughout the year. Next, we normalized  these data and
D0
obtained the ratio of downloads in each year to total downloads RYþ1 ¼ PYþ1D0 , which
were stored in DataSet 3. Ultimately, we collected three datasets: the raw downloads
dataset, DataSet 1; the absolute downloads dataset, DataSet 2; and the normalized
downloads dataset, DataSet 3. The following is an example to demonstrate the calculations
for DataSet 2 and DataSet 3 based on DataSet 1. Given a paper published in September
2008, its raw download in each year as of 2015 is stored in DataSet 1, shown in Table 1. In
DataSet 2, D02008þ1 ¼ D2008 þ D2008þ1 45 0
12  ð12  3Þ ¼ 36 þ 12  9 ¼ 69:75, and D2008þ2 ¼
D2008þ1 D2008þ2 45 30
12  3 þ 12  ð12  3Þ ¼ 12  3 þ 12  9 ¼ 33:75. Correspondingly, in DataSet 3,
R2008þ1 ¼ 69:75=ð69:75 þ 33:75 þ 26:25 þ 21:25 þ 12:50 þ 20:50 þ 16:50Þ  100% ¼
34:79%, and R2008þ2 ¼ 33:75=ð69:75 þ 33:75 þ 26:25 þ 21:25 þ 12:50 þ 20:50þ
16:50Þ  100% ¼ 16:83%.

Analyzing method

Normality test

We used the Q–Q (Quantile–Quantile) Plot to observe the distribution of total downloads
and examined whether it passed the K–S test. The Q–Q Plot shows a scatter plot with
observed values on the X axis and expected values on the Y axis. If all the scatter points are
close to the reference line, we can say that the dataset follows the given distribution.

Cluster analysis

A two-step cluster analysis was conducted to explore different download patterns. The
two-step cluster method is a scalable cluster analysis algorithm designed to handle very
large datasets. It can handle both continuous and categorical variables and is also con-
sidered more reliable and accurate when compared to traditional clustering methods such

123
1764 Scientometrics (2017) 112:1761–1775

Table 1 An example of calculations for different datasets


DataSet 1: raw download

2008 (year) 2009 2010 2011 2012 2013 2014 2015


36 45 30 25 20 10 24 14

DataSet 2: absolute download

1st year 2nd year 3rd year 4th year 5th year 6th year 7th year
69.75 33.75 26.25 21.25 12.50 20.50 16.50

DataSet 3: normalized download

1st year 2nd year 3rd year 4th year 5th year 6th year 7th year
34.79% 16.83% 13.09% 10.60% 6.23% 10.22% 8.23%

as the k-means clustering algorithm (Norusis 2007). As the name suggests, the two-step
cluster procedure involves two distinct steps: (1) pre-cluster the cases (or records) into
many small sub-clusters; (2) cluster the sub-clusters resulting from the pre-cluster step into
the desired number of clusters. This procedure can also automatically select the number of
clusters (Tkaczynski 2017). In this study, annual downloads in DataSet2 and DataSet3
were chosen as continuous variables, respectively, and the Bayesian information criterion
(BIC) was used for clustering. The clustering quality is estimated by the Silhouette
measure of cohesion and separation. This procedure measures the relationship of the
variables within and between clusters. A score above 0.0 would ensure that the within-
cluster distance and the between-cluster distance was valid among the different variables
(Tkaczynski 2017).

Correlation analysis

The Spearman correlation coefficient was carried out to test the correlations between
downloads and the features of papers in the context of different patterns.

Half-life analysis

Analogously to the concept of cited half-life, the download half-life of a paper is defined as
‘‘the median age of the articles that were downloaded in the considered year’’ (Schloegl
and Gorraiz 2010). The half-lives of different download patterns can be achieved by
computations on average downloads of papers per year.

Results

Normality test

We drew the Q–Q probability plot of downloads, shown in the left chart of Fig. 1. In this
normality test, the X axis represents the observed values in DataSet 2 and Y axis represents
the expected normal values. Overall, the curve of the total downloads of most articles

123
Scientometrics (2017) 112:1761–1775 1765

Fig. 1 Probability plot of downloads (left) and the log-transformed downloads (right)

significantly deviates from the expected values and presents remarkably skewed distri-
bution, which is in line with the results from Lu et al. (2016). Given these results, we
employed the Spearman correlation coefficient rather than Pearson for further correlation
tests.
We converted the total downloads of each paper into logarithms and obtained a new
log-transformed dataset. We also used the Q–Q plot to test whether its distribution is
normal (the right chart in Fig. 1). The result shows that the log-transformed data satis-
factorily fit normal distribution. Therefore, logarithmic function was applied to fit the total
downloads distribution, shown in Fig. 2, and the regression equation is y = -259.1
ln(x) ? 2409.4 with R2 = 0.930.

Fig. 2 Frequency distribution of total downloads of papers

123
1766 Scientometrics (2017) 112:1761–1775

Dynamic changing patterns of absolute downloads

The data of absolute downloads in DataSet 2 can be divided into four groups representing
four different changing patterns after two-step cluster analysis with a silhouette measure of
cohesion and separation of 0.5. The detailed information is shown in Table 2.
The annual downloads of each pattern were computed, as shown in Table 3 and Fig. 3.
With respect to pattern 1, pattern 2, and pattern 3, the same changing tendency is shown:
their download counts reach peak activity in the first year and thereafter show a downtrend,

Table 2 Information of each pattern from DataSet 2


Pattern Paper number Proportion of Fitting expression and goodness of fit
included the total (%)

1 4885 49.25 y1 = 36.219x-0.744


1 ; R2 = 0.992
2 3512 35.41 y2 = 84.428x-0.712
2 ; R2 = 0.992
3 1328 13.39 y3 = 132.75x-0.482
3 ; R2 = 0.944
4 194 1.96 y4 = 7.7456x24 - 63.552x4 ? 282.32; R2 = 0.938

Table 3 Annual downloads in four patterns


Pattern 1st year 2nd year 3rd year 4th year 5th year 6th year 7th year

1 38.13 21.08 14.75 12.65 11.23 9.86 8.70


2 88.07 51.43 36.15 30.04 26.62 24.31 22.19
3 142.85 94.26 71.86 62.13 58.21 58.49 58.35
4 230.24 184.24 157.22 147.85 159.31 194.21 208.08

Pattern 1
Pattern 2
250
Pattern 3
Pattern 4

200
Average downloads

150

100

50

0
1st Year 2nd Year 3rd Year 4th Year 5th Year 6th Year 7th Year
Time

Fig. 3 Changing trend of the four patterns in terms of average absolute downloads

123
Scientometrics (2017) 112:1761–1775 1767

which fits negative power functions. The differences among them consist of absolute
downloads: the absolute downloads of pattern 2 and pattern 3 each year are nearly 2.3–2.6
times and 3.7–6.7 times as much as pattern 1, respectively.
By contrast, the downloads in pattern 4 show a process of decline to rise, hitting bottom
in the 4th year before climbing in the 7th year to nearly the number of downloads in the 1st
year. The most fitting function is binomial, and its absolute count reaches to 6.04–23.92
times the amount as the count for pattern 1.
Table 4 shows the download half-life, total downloads, and average downloads for each
pattern. Patterns 1 and 2, both with a 2-year-half-life, account for 85% of the total sample
and contribute to only 60% of total downloads. By contrast, pattern 4, which represents
highly downloaded papers, accounts for less than 2% of the total sample and contributes
more than 10% to the total downloads, with average downloads up to 1530. Pattern 4 also
shows a lower aging rate with a half-life of 3.47.

Dynamic changing pattern of relative download

We continued using the two-step cluster analysis for DataSet 3. The sample of 9919
downloads were lastly divided into two clusters with a Silhouette measure of cohesion and
a separation of 0.4, representing two different relative download patterns. This basic
information is shown in Table 5.
The proportion of annual downloads to total downloads in each pattern are shown in
Table 6 and Fig. 4. Overall, the annual proportions in each pattern all show a downtrend.
Download half-life and the proportion of the total sample in each pattern are show in
Table 7. Pattern A, with a half-life of 1.54, owns higher proportions than pattern B, with a
half-life of 2.68, in the first 2 years, while its decline pace is also faster: starting from the
third year, pattern A’s annual proportion begins to lag behind that of pattern B.
We also identified the aging trend of the two patterns from their absolute downloads: the
absolute downloads of pattern A are larger than those of pattern B in the 1st year but drop
50% the following year and gradually become less than that of pattern B in subsequent
years.

Table 4 Download half-life and average download in four patterns


Pattern Download Sample Proportion of total Total Proportion of total Average
half-life number sample (%) download download (%) download

1 1.95 4885 49.25 620,142 22.04 126.95


2 2.00 3512 35.41 1,072,632 38.13 305.42
3 2.50 1328 13.39 823,751 29.28 620.30
4 3.47 194 1.96 296,957 10.56 1530.71

Table 5 Information of each pattern from DataSet 3


Pattern Paper number included Proportion of the total (%) Fitting expression and goodness of fit

A 4426 44.62 yA = 0.3943x-1.029


A ; R2 = 0.9994
B 3512 55.38 yB = 0.2291x-0.421
B ; R2 = 0.9682

123
1768 Scientometrics (2017) 112:1761–1775

Table 6 Annual download proportion and average download in DataSet 3


Pattern Year

1st year 2nd year 3rd year 4th year 5th year 6th year 7th year

A
Relative download 39.16% 19.90% 12.36% 9.43% 7.57% 6.25% 5.32%
Absolute download 81.13 42.02 26.21 19.77 15.95 13.58 11.79
B
Relative download 24.35% 16.63% 13.26% 12.30% 11.75% 11.19% 10.53%
Absolute download 67.52 47.07 38.04 34.77 33.86 34.37 33.88

40
Pattern A
35 Pattern B

30
Download ratio (%)

25

20

15

10

1st Year 2nd Year 3rd Year 4th Year 5th Year 6th Year 7th Year
Time

Fig. 4 Changing trend of the two download patterns in DataSet 3

Table 7 Download half-life and average downloads in the two patterns in DataSet 3
Pattern Download Sample Proportion of total Total Proportion of total Average
half-life number sample (%) download download (%) download

A 1.54 4426 44.62 995,491 35.38 224.92


B 2.68 5493 55.38 1,817,991 64.62 330.97

Matrix analysis on absolute and relative download changing patterns

We established a sample matrix and a download matrix for DataSets 2 and 3, as shown in
Tables 8 and 9. As is evident, patterns 1 and 2, with smaller absolute downloads, distribute
equally in patterns A and B, while patterns 3 and 4, with larger downloads, are located
centrally in pattern B. The distribution indicates that articles with a higher download rate

123
Scientometrics (2017) 112:1761–1775 1769

Table 8 Sample matrix of different patterns in the two datasets


1 2 3 4 Total

A 2442 (24.62%) 1681 (16.95%) 291 (2.93%) 12 (0.12%) 4426 (44.62%)


B 2443 (24.63%) 1831 (18.46%) 1037 (10.46%) 182 (1.84%) 5493 (55.38%)
Total 4885 (49.25%) 3512 (35.41%) 1328 (13.39%) 194 (1.96%) 9919 (100%)

Table 9 Download matrix of different patterns in the two datasets


1 2 3 4 Total

A 312,339 (11.10%) 497,834 (17.70%) 171,579 (6.10%) 13,739 (0.49%) 995,491 (35.38%)
B 307,803 (10.94%) 574,798 (20.43%) 652,172 (23.18%) 283,218 (10.07%) 1,817,991 (64.62%)
Total 620,142 (22.04%) 1,072,632 (38.13%) 823,751 (29.28%) 296,957 (10.56%) 2,813,482 (100%)

tend to be of smaller change in terms of annual download proportion, and own longer half-
lives.
After sorting articles by downloads, the top 1% of papers, 99 in total, were identified as
highly downloaded papers. These received 190,665 downloads, comprising 6.78% of the
total downloads. Moreover, they all belong to pattern 4 in terms of absolute downloads; 97
of them are attributed to pattern B in terms of relative download, with the other two
belonging to pattern A.

Correlation analysis of download and paper features

Generally, people search for targeted papers by way of titles, keywords, and authors before
downloading them, so the papers with longer titles, a larger number of keywords and more
authors are apt to be searched and then downloaded. Hence, we analyzed the correlation
between total downloads of articles and their features, including title length, number of
authors, number of keywords, and impact factor, as shown in Table 10. Additionally, we
analyzed the correlation between total downloads and initial downloads—defined as
downloads made during the first year after publication—to observe the impact of earlier
downloads on total downloads.

Table 10 Correlations between total download and paper features in different patterns
Feature Pattern

1 2 3 4 Total sample

Title length 0.018 -0.062** -0.071** -0.116 0.019


Number of authors 0.185** 0.043 -0.010 0.180* 0.253**
Number of keywords 0.228* 0.006 -0.028 -0.065 0.174**
Impact factor 0.127** 0.087** -0.004 0.064 0.055**
Initial downloads 0.731** 0.474** 0.466** 0.411** 0.869**
** Correlation is significant at the 0.01 level (2-tailed); * correlation is significant at the 0.05 level (2-tailed)

123
1770 Scientometrics (2017) 112:1761–1775

As shown in Table 10, there are weak correlations between total downloads and fea-
tures, including title length, number of authors, and number of keywords, regardless of
patterns. Moreover, we found that with regard to different patterns, the correlations of
downloads to different features vary. Title length is associated with downloads that have a
weak and negative correlation, such as in patterns 2 and 3—the longer the title length, the
lower the total download. However, in patterns 1 and 4, title length shows no correlation
with total download, but shows a weak and positive correlation with number of authors.
Overall, the higher correlations between total downloads and number of authors, number of
keywords and impact factors can be seen in pattern 1.
Only significant correlations between total downloads and initial downloads are
observed in sample and pattern 1, with coefficients being 0.869 and 0.731, respectively.
Hence, taking initial downloads as independent variable x and total downloads as
dependent variable y (shown in Fig. 5), we carried out curve estimation on the whole
sample in the new DataSet 2 and eventually found that the best fitting curve is the power
function, which can be expressed as y = 7.198x0.839 (R2 = 0.751).
We further analyzed the relationships between themes and patterns. In accordance with
Chinese Library Classification, we assigned every paper a classification number denoting a
certain theme. Classification numbers with more than 100 papers were selected for further
research, as shown in Table 11. Regardless of theme, the average downloads in pattern B
are higher than that in pattern A.
Next, we focused on the features of highly downloaded articles and analyzed the cor-
relations between total downloads and title length, number of authors, and number of
keywords as well as impact factors, as shown in Table 12. It turns out that total downloads
do not associate with most features at all but have a weak and negative correlation with
initial downloads.

Fig. 5 Correlation between initial downloads and total downloads

123
Table 11 Download patterns in different themes
Classification Theme Total Pattern A Pattern B

Paper Download Average Paper Download Average


number download number download

G250 Library science 1598 762 173,050 227.10 836 272,833 326.36
G252 Service for readers 927 472 101,171 214.35 455 154,905 340.45
G258 Various types of library 758 298 56,327 189.02 460 122,723 266.79
G251 Library management 591 307 59,543 193.95 284 85,812 302.16
TP39 Application of computer 532 321 71,961 224.18 211 81,110 384.41
Scientometrics (2017) 112:1761–1775

G259 Librarianship of the world 366 80 10,470 130.88 286 51,332 179.48
TP31 Computer software 331 219 42,148 192.46 113 45,504 402.69
G354 Information retrieval 329 205 51,514 251.29 124 56,764 457.77
G253 Collection development and collection organization 316 146 26,888 184.16 170 45,226 266.04
G254 Document indexing and cataloging 307 138 20,367 147.59 169 37,387 221.23
G350 Information science 274 108 28,397 262.94 166 75,037 452.03
F270 Enterprise economic theory 254 150 50,647 337.65 104 46,556 447.65
F272 Enterprise planning and business decision making 243 151 50,150 332.12 92 45,771 497.51
G256 Philology 238 13 991 76.231 225 31,460 139.82
G203 Information resource management 230 93 21,659 232.89 137 58,704 428.45
G353 The processing of intelligence data 216 59 11,428 193.70 157 77,492 493.58
G255 Various document service 197 81 12,440 153.58 116 20,904 180.21
G201 Information theory 121 38 8915 234.61 83 37,515 451.99
F49 Information industry economics (pandect) 126 53 11,902 224.57 73 30,261 414.53
1771

123
1772 Scientometrics (2017) 112:1761–1775

Table 12 Correlations between downloads of highly downloaded papers and paper features
Title Author Keyword Impact Initial
length number number factor downloads

Correlation coefficient -0.037 0.128 -0.020 0.030 -0.281**


Significance 0.718 0.206 0.844 0.769 0.005
** Correlation is significant at the 0.01 level (2-tailed)

Discussion

Literature aging law based on downloads

The traditional research on the literature aging law mainly relies on citations, generating
a series of mathematical models such as the negative exponential model, the Burton–
Kepler aging equation, the Brookes accumulation exponential model, and the Avrami
equation. It has also been reported that there are different citation patterns for articles
(Avramescu 1979; Aversa 1985). Meanwhile, because of the accessibility of data, there
is little research regarding aging law based on downloads, let alone whether there are
different patterns. With two-step cluster analysis, article downloads can be distinguished
as varying patterns. As far as absolute downloads are concerned, patterns 1, 2 and 3
follow power function y = at-b, where a denotes initial download and b represents
download changing rate. For pattern 4, its download pattern shows a process of decline
to rise, akin to a quadratic function. Taking the inevitable aging trend of literature into
consideration, we conjecture that pattern 4 would finally be subject to the power function
as are the former three. In the time frame of our study, the changing pattern of down-
loads in pattern 4 deviates its aging trajectory, which may be due to the effect of citation.
Moed (2005) noted that an article’s downloads after being cited for 3 months rose 25%
compared to that in the non-cited condition; Schloegl et al. (2014) also reported that the
downloads of an article always show an increase after being cited. Assuming that z de-
notes the influence of citation in period t on the downloads in period t ? 1, we speculate
that a more general function for absolute downloads changing over time is
Y = az ? (1 - a)y (0 \ a \ 1). For most papers, the influence can be ignored because
of few citations, namely, a & 0 and thus Y = y.
From the perspective of relative downloads, the changing trend occurring in pattern 4
does not appear, and the two patterns all follow power functions. The total downloads in
the context of relative downloads is set to be 1, which eliminates the differences in
absolute downloads among papers, thus deducing a more generalized literature aging
law: the downloads of a paper reach peak activity in the first year after its publication
and then gradually fall, finally approaching 0, which is considered as having the status of
death.
There are pros and cons to both perspectives. Concerning absolute downloads, we
identified the ‘‘unusual’’ download pattern of highly downloaded papers and speculated on
the impact of citations on downloads. However, this perspective relies on a research time
window that is too narrow to forecast the developing trend of pattern 4 in the future.
Regarding relative downloads, we focus more on the changing trend of downloads over
time to explore the aging pattern of literature but ignore some factors that may impact
absolute downloads, resulting in a generalized aging pattern.

123
Scientometrics (2017) 112:1761–1775 1773

Influencing factors of paper downloads

As mentioned above, there has been a batch of research on the correlations between paper
citation and its influencing factors, including title type (Jamali and Nikzad 2011; Fox and
Burns 2015), title length (Jacques and Sebire 2010; Jamali and Nikzad 2011), number of
authors (Borsuk et al. 2009; Rao 2014) and number of keywords (Uddin and Khan 2016),
but research on correlations between downloads and these factors is rare. At present, the
most relevant research is concerned with the correlation between downloads and title
length. Some scholars found that articles with short titles tend to receive more downloads
than the long ones (Jamali and Nikzad 2011; Lin 2012), while some others came to the
opposite conclusion (Habibzadeh and Yadollahie 2010; Jacques and Sebire 2010).
Therefore, the disparate results could be due to data differences. In this paper, we find
weak correlations between downloads and paper features including title length and number
of authors, as well as number of keywords. Moreover, downloads show different reactions
to these features in different patterns: for example, title length has a weak negative cor-
relation with downloads in patterns 2 and 3 but no correlation in patterns 1, 4 and the total
sample. It is unreliable to simply study the relationship between only a single feature and
downloads without considering the different influencing factors of downloads that could
give rise to different changing patterns.
We found moderate to high correlations between initial downloads and total downloads
even with a coefficient reaching up to 0.869 for the whole sample. For papers with different
patterns, the higher the downloads, the weaker the correlation. Similarly, the instances of
highly downloaded papers show no significant correlations with paper features, indicating
that article quality, rather than these features, contribute to their higher downloads. In view
of the decline-then-rise trend of highly downloaded papers, and simultaneously consid-
ering the lag of paper citation, we speculate that citation plays a promoting role in the
rising behavior of downloads. After the source paper is downloaded, references to it would
garner more attention from researchers and contribute to more downloads, also leading to
its download aging pattern being different from the patterns of other types of papers.
Meanwhile, higher correlations between total downloads and number of authors, number of
keywords and impact factors can be observed in pattern 1. These correlations indicate that
articles downloaded at a low rate may be primarily dependent upon the factors that
increase the probability of being retrieved, such as longer titles, larger number of keywords
and number of authors.

Acknowledgements The authors are grateful to China National Knowledge Internet (CNKI) for making
available the download files analyzed in this paper.

Appendix

See Table 13.

123
1774 Scientometrics (2017) 112:1761–1775

Table 13 Basic information for 11 journals


Journal Paper Total Average
number download download

Journal of Academic Libraries 431 184,902 429.007


Information Studies: Theory and Application 671 271,441 404.532
Information Science 1211 448,009 369.950
Document Information and Knowledge 450 155,264 345.031
Journal of Information 1854 591,758 319.179
New Technology of Library and Information 739 193,928 262.419
Service
Library and Information 647 163,611 252.876
Library Tribune 1267 280,108 221.080
Library Journal 986 200,474 203.320
Library Work and Study 871 172,700 198.278
Library 792 151,287 191.019
Total 9919 2,813,482 283.646

References
Aversa, E. S. (1985). Citation patterns of highly cited papers and their relationship to literature aging—A
study of the working literature. Scientometrics, 7(3–6), 383–389.
Avramescu, A. (1979). Actuality and obsolescence of scientific literature. Journal of the American Society
for Information Science, 30(5), 296–303.
Bollen, J., de Sompel, H. V., Smith, J. A., & Luce, R. (2005). Toward alternative metrics of journal impact:
A comparison of download and citation data. Information Processing and Management, 41(6),
1419–1440.
Bollen, J., & van de Sompel, H. (2008). Usage impact factor: The effects of sample characteristics on usage-
based impact metrics. Journal of the American Society for Information Science and Technology, 59(1),
136–149.
Borsuk, R. M., Budden, A. E., Leimu, R., Aarssen, L. W., & Lortie, C. J. (2009). The influence of author
gender, national language and number of authors on citation rate in ecology. Open Ecology Journal,
2(1), 25–28.
Davis, P. M., & Price, J. S. (2006). Ejournal interface can influence usage statistics: Implications for
libraries, publishers, and project counter. Journal of the Association for Information Science and
Technology, 57(9), 1243–1248.
Davis, P. M., & Solla, L. R. (2003). An IP-level analysis of usage statistics for electronic journals in
chemistry: Making inferences about user behavior. Journal of the Association for Information Science
and Technology, 54(11), 1062–1068.
Fox, C. W., & Burns, C. S. (2015). The relationship between manuscript title structure and success: Editorial
decisions and citation performance for an ecological journal. Ecology and Evolution, 5(10),
1970–1980.
Glänzel, W., & Gorraiz, J. (2015). Usage metrics versus altmetrics: Confusing terminology? Scientometrics,
102(3), 2161–2164.
Guerrero-Bote, V. P., & Moya-Anegon, F. (2014). Relationship between downloads and citations at journal
and paper levels, and the influence of language. Scientometrics, 101(2), 1043–1065.
Habibzadeh, F., & Yadollahie, M. (2010). Are shorter article titles more attractive for citations? Cross-
sectional study of 22 scientific journals. Croatian Medical Journal, 51(2), 165–170.
Jacques, T. S., & Sebire, N. J. (2010). The impact of article titles on citation hits: An analysis of general and
specialist medical journals. JRSM Short Reports, 1(1), 2.
Jamali, H. R., & Nikzad, M. (2011). Article title type and its relation with the number of downloads and
citations. Scientometrics, 88(2), 653–661.
Kurtz, M. J., & Bollen, J. (2010). Usage bibliometrics. Annual Review of Information Science and Tech-
nology, 44, 3–64.

123
Scientometrics (2017) 112:1761–1775 1775

Lin, J. (2012). Article title and its relation with the number of downloads and citations. Journal of Academic
Libraries, 04, 14–17.
Lippi, G., & Favaloro, E. J. (2013). Article downloads and citations: Is there any relationship? Clinica
Chimica Acta, 415, 195.
Liu, X. (2012). Download’ half-life establishment of scientific journal and bibliometrics importance. Chi-
nese Journal of Scientific and Technical Periodicals, 04, 561–564.
Lu, W., Qian, K., & Tang, X. (2016). Correlation analysis of paper download and citations-with library and
information field. Information Science, 01, 3–8.
Moed, H. F. (2005). Statistical relationships between downloads and citations at the level of individual
documents within a single journal. Journal of the American Society for Information Science and
Technology, 56(10), 1088–1097.
Norusis, M. J. (2007). SPSS 15.0 advanced statistical procedures companion. Chicago, IL: Prentice Hall.
Rao, I. K. R. (2014). Weak relations among the impact factors, number of citations, references and authors.
Collnet Journal of Scientometrics and Information Management, 8(1), 17–30.
Schloegl, C., & Gorraiz, J. (2010). Comparison of citation and usage indicators: The case of oncology
journals. Scientometrics, 82(3), 567–580.
Schloegl, C., & Gorraiz, J. (2011). Global usage versus global citation metrics: The case of pharmacology
journals. Journal of the American Society for Information Science and Technology, 62(1), 161–170.
Schloegl, C., Gorraiz, J., Gumpenberger, C., Jack, K., & Kraker, P. (2014). Comparison of downloads,
citations and readership data for two information systems journals. Scientometrics, 101(2), 1113–1128.
Subotic, S., & Mukherjee, B. (2014). Short and amusing: The relationship between title characteristics,
downloads, and citations in psychology articles. Journal of Information Science, 40(1), 115–124.
Tkaczynski, A. (2017). Segmentation using two-step cluster analysis. In T. Dietrich, S. Rundle-Thiele, & K.
Kubacki (Eds.), Segmentation in social marketing: Process, methods and application (pp. 109–125).
Singapore: Springer Singapore.
Uddin, S., & Khan, A. (2016). The impact of author-selected keywords on citation counts. Journal of
Informetrics, 10(4), 1166–1177.
Wan, J. K., Hua, P. H., Rousseau, R., & Sun, X. K. (2010). The journal download immediacy index (DII):
Experiences using a chinese full-text database. Scientometrics, 82(3), 555–566.
Wang, X., Mao, W., Xu, S., & Zhang, C. (2014a). Usage history of scientific literature: Nature metrics, and
metrics of nature, publications. Scientometrics, 98(3), 1923–1933.
Wang, X., Peng, L., Zhang, C., Xu, S., Wang, Z., Wang, C., et al. (2013a). Exploring scientists’ working
timetable: A global survey. Journal of Informetrics, 7(3), 665–675.
Wang, X., Wang, Z., Mao, W., & Liu, C. (2014b). How far does scientific community look back? Journal of
Informetrics, 8(3), 562–568.
Wang, X., Wang, Z., & Xu, S. (2013b). Tracing scientist’s research trends realtimely. Scientometrics, 95(2),
717–729.
Wang, X., Xu, S., Peng, L., Wang, Z., Wang, C., Zhang, C., et al. (2012). Exploring scientists’ working
timetable: Do scientists often work overtime? Journal of Informetrics, 6(4), 655–660.
Xu, X. (2014). Empirical research on half-life period of journal based on downloads. Journal of Intelligence,
06, 117–121.

123
View publication stats

You might also like